From d5aa7c46692474376a3c31704cfc4783c86338f2 Mon Sep 17 00:00:00 2001 From: Jacques Nadeau Date: Fri, 5 Feb 2016 12:08:35 -0800 Subject: [PATCH 0001/1644] Initial Commit --- README.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000000000..e2dc7471c204d --- /dev/null +++ b/README.md @@ -0,0 +1 @@ +arrow From cbc56bf8ac423c585c782d5eda5c517ea8df8e3c Mon Sep 17 00:00:00 2001 From: Jacques Nadeau Date: Tue, 16 Feb 2016 21:35:38 -0800 Subject: [PATCH 0002/1644] Update readme and add license in root. --- LICENSE.txt | 202 ++++++++++++++++++++++++++++++++++++++++++++++++++++ README.md | 14 +++- 2 files changed, 215 insertions(+), 1 deletion(-) create mode 100644 LICENSE.txt diff --git a/LICENSE.txt b/LICENSE.txt new file mode 100644 index 0000000000000..d645695673349 --- /dev/null +++ b/LICENSE.txt @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md index e2dc7471c204d..4423a91351381 100644 --- a/README.md +++ b/README.md @@ -1 +1,13 @@ -arrow +## Apache Arrow + +#### Powering Columnar In-Memory Analytics + +Arrow is a set of technologies that enable big-data systems to process and move data fast. + +Initial implementations include: + + - [The Arrow Format](https://github.com/apache/arrow/tree/master/format) + - [Arrow Structures and APIs in C++](https://github.com/apache/arrow/tree/master/cpp) + - [Arrow Structures and APIs in Java](https://github.com/apache/arrow/tree/master/java) + +Arrow is an [Apache Software Foundation](www.apache.org) project. More info can be found at [arrow.apache.org](http://arrow.apache.org). From fa5f0299f046c46e1b2f671e5e3b4f1956522711 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Wed, 17 Feb 2016 04:37:53 -0800 Subject: [PATCH 0003/1644] ARROW-1: Initial Arrow Code Commit --- java/.gitignore | 22 + java/memory/pom.xml | 50 + .../main/java/io/netty/buffer/ArrowBuf.java | 863 ++++++++++++++++++ .../io/netty/buffer/ExpandableByteBuf.java | 55 ++ .../java/io/netty/buffer/LargeBuffer.java | 59 ++ .../netty/buffer/MutableWrappedByteBuf.java | 336 +++++++ .../netty/buffer/PooledByteBufAllocatorL.java | 272 ++++++ .../buffer/UnsafeDirectLittleEndian.java | 270 ++++++ .../org/apache/arrow/memory/Accountant.java | 272 ++++++ .../arrow/memory/AllocationManager.java | 433 +++++++++ .../arrow/memory/AllocationReservation.java | 86 ++ .../memory/AllocatorClosedException.java | 31 + .../apache/arrow/memory/BaseAllocator.java | 781 ++++++++++++++++ .../apache/arrow/memory/BoundsChecking.java | 35 + .../apache/arrow/memory/BufferAllocator.java | 151 +++ .../apache/arrow/memory/BufferManager.java | 66 ++ .../apache/arrow/memory/ChildAllocator.java | 53 ++ .../arrow/memory/DrillByteBufAllocator.java | 141 +++ .../arrow/memory/OutOfMemoryException.java | 50 + .../java/org/apache/arrow/memory/README.md | 121 +++ .../apache/arrow/memory/RootAllocator.java | 39 + .../org/apache/arrow/memory/package-info.java | 24 + .../arrow/memory/util/AssertionUtil.java | 37 + .../arrow/memory/util/AutoCloseableLock.java | 43 + .../arrow/memory/util/HistoricalLog.java | 185 ++++ .../org/apache/arrow/memory/util/Metrics.java | 40 + .../org/apache/arrow/memory/util/Pointer.java | 28 + .../apache/arrow/memory/util/StackTrace.java | 70 ++ .../src/main/resources/drill-module.conf | 25 + .../apache/arrow/memory/TestAccountant.java | 164 ++++ .../arrow/memory/TestBaseAllocator.java | 648 +++++++++++++ .../apache/arrow/memory/TestEndianess.java | 43 + java/pom.xml | 470 ++++++++++ java/vector/pom.xml | 165 ++++ java/vector/src/main/codegen/config.fmpp | 24 + .../main/codegen/data/ValueVectorTypes.tdd | 168 ++++ .../src/main/codegen/includes/license.ftl | 18 + .../src/main/codegen/includes/vv_imports.ftl | 62 ++ .../templates/AbstractFieldReader.java | 124 +++ .../templates/AbstractFieldWriter.java | 147 +++ .../AbstractPromotableFieldWriter.java | 142 +++ .../main/codegen/templates/BaseReader.java | 73 ++ .../main/codegen/templates/BaseWriter.java | 117 +++ .../codegen/templates/BasicTypeHelper.java | 538 +++++++++++ .../main/codegen/templates/ComplexCopier.java | 133 +++ .../codegen/templates/ComplexReaders.java | 183 ++++ .../codegen/templates/ComplexWriters.java | 151 +++ .../codegen/templates/FixedValueVectors.java | 813 +++++++++++++++++ .../codegen/templates/HolderReaderImpl.java | 290 ++++++ .../main/codegen/templates/ListWriters.java | 234 +++++ .../main/codegen/templates/MapWriters.java | 240 +++++ .../main/codegen/templates/NullReader.java | 138 +++ .../templates/NullableValueVectors.java | 630 +++++++++++++ .../templates/RepeatedValueVectors.java | 421 +++++++++ .../codegen/templates/UnionListWriter.java | 185 ++++ .../main/codegen/templates/UnionReader.java | 194 ++++ .../main/codegen/templates/UnionVector.java | 467 ++++++++++ .../main/codegen/templates/UnionWriter.java | 228 +++++ .../main/codegen/templates/ValueHolders.java | 116 +++ .../templates/VariableLengthVectors.java | 644 +++++++++++++ .../apache/arrow/vector/AddOrGetResult.java | 38 + .../apache/arrow/vector/AllocationHelper.java | 61 ++ .../arrow/vector/BaseDataValueVector.java | 91 ++ .../apache/arrow/vector/BaseValueVector.java | 125 +++ .../org/apache/arrow/vector/BitVector.java | 450 +++++++++ .../apache/arrow/vector/FixedWidthVector.java | 35 + .../apache/arrow/vector/NullableVector.java | 23 + .../NullableVectorDefinitionSetter.java | 23 + .../org/apache/arrow/vector/ObjectVector.java | 220 +++++ .../arrow/vector/SchemaChangeCallBack.java | 52 ++ .../arrow/vector/ValueHolderHelper.java | 203 ++++ .../org/apache/arrow/vector/ValueVector.java | 222 +++++ .../arrow/vector/VariableWidthVector.java | 51 ++ .../apache/arrow/vector/VectorDescriptor.java | 83 ++ .../apache/arrow/vector/VectorTrimmer.java | 33 + .../org/apache/arrow/vector/ZeroVector.java | 181 ++++ .../complex/AbstractContainerVector.java | 143 +++ .../vector/complex/AbstractMapVector.java | 278 ++++++ .../complex/BaseRepeatedValueVector.java | 260 ++++++ .../vector/complex/ContainerVectorLike.java | 43 + .../vector/complex/EmptyValuePopulator.java | 54 ++ .../arrow/vector/complex/ListVector.java | 321 +++++++ .../arrow/vector/complex/MapVector.java | 374 ++++++++ .../arrow/vector/complex/Positionable.java | 22 + .../complex/RepeatedFixedWidthVectorLike.java | 40 + .../vector/complex/RepeatedListVector.java | 428 +++++++++ .../vector/complex/RepeatedMapVector.java | 584 ++++++++++++ .../vector/complex/RepeatedValueVector.java | 85 ++ .../RepeatedVariableWidthVectorLike.java | 35 + .../arrow/vector/complex/StateTool.java | 34 + .../vector/complex/VectorWithOrdinal.java | 30 + .../complex/impl/AbstractBaseReader.java | 100 ++ .../complex/impl/AbstractBaseWriter.java | 59 ++ .../complex/impl/ComplexWriterImpl.java | 193 ++++ .../complex/impl/MapOrListWriterImpl.java | 112 +++ .../vector/complex/impl/PromotableWriter.java | 196 ++++ .../complex/impl/RepeatedListReaderImpl.java | 145 +++ .../complex/impl/RepeatedMapReaderImpl.java | 192 ++++ .../impl/SingleLikeRepeatedMapReaderImpl.java | 89 ++ .../complex/impl/SingleListReaderImpl.java | 88 ++ .../complex/impl/SingleMapReaderImpl.java | 108 +++ .../vector/complex/impl/UnionListReader.java | 98 ++ .../vector/complex/reader/FieldReader.java | 29 + .../vector/complex/writer/FieldWriter.java | 27 + .../arrow/vector/holders/ComplexHolder.java | 25 + .../arrow/vector/holders/ObjectHolder.java | 38 + .../vector/holders/RepeatedListHolder.java | 23 + .../vector/holders/RepeatedMapHolder.java | 23 + .../arrow/vector/holders/UnionHolder.java | 37 + .../arrow/vector/holders/ValueHolder.java | 31 + .../arrow/vector/types/MaterializedField.java | 217 +++++ .../org/apache/arrow/vector/types/Types.java | 132 +++ .../vector/util/ByteFunctionHelpers.java | 233 +++++ .../apache/arrow/vector/util/CallBack.java | 23 + .../arrow/vector/util/CoreDecimalUtility.java | 91 ++ .../apache/arrow/vector/util/DateUtility.java | 682 ++++++++++++++ .../arrow/vector/util/DecimalUtility.java | 737 +++++++++++++++ .../vector/util/JsonStringArrayList.java | 57 ++ .../arrow/vector/util/JsonStringHashMap.java | 76 ++ .../arrow/vector/util/MapWithOrdinal.java | 248 +++++ .../util/OversizedAllocationException.java | 49 + .../util/SchemaChangeRuntimeException.java | 41 + .../org/apache/arrow/vector/util/Text.java | 621 +++++++++++++ .../arrow/vector/util/TransferPair.java | 27 + 124 files changed, 22077 insertions(+) create mode 100644 java/.gitignore create mode 100644 java/memory/pom.xml create mode 100644 java/memory/src/main/java/io/netty/buffer/ArrowBuf.java create mode 100644 java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java create mode 100644 java/memory/src/main/java/io/netty/buffer/LargeBuffer.java create mode 100644 java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java create mode 100644 java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java create mode 100644 java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/Accountant.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/README.md create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/package-info.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java create mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java create mode 100644 java/memory/src/main/resources/drill-module.conf create mode 100644 java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java create mode 100644 java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java create mode 100644 java/memory/src/test/java/org/apache/arrow/memory/TestEndianess.java create mode 100644 java/pom.xml create mode 100644 java/vector/pom.xml create mode 100644 java/vector/src/main/codegen/config.fmpp create mode 100644 java/vector/src/main/codegen/data/ValueVectorTypes.tdd create mode 100644 java/vector/src/main/codegen/includes/license.ftl create mode 100644 java/vector/src/main/codegen/includes/vv_imports.ftl create mode 100644 java/vector/src/main/codegen/templates/AbstractFieldReader.java create mode 100644 java/vector/src/main/codegen/templates/AbstractFieldWriter.java create mode 100644 java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java create mode 100644 java/vector/src/main/codegen/templates/BaseReader.java create mode 100644 java/vector/src/main/codegen/templates/BaseWriter.java create mode 100644 java/vector/src/main/codegen/templates/BasicTypeHelper.java create mode 100644 java/vector/src/main/codegen/templates/ComplexCopier.java create mode 100644 java/vector/src/main/codegen/templates/ComplexReaders.java create mode 100644 java/vector/src/main/codegen/templates/ComplexWriters.java create mode 100644 java/vector/src/main/codegen/templates/FixedValueVectors.java create mode 100644 java/vector/src/main/codegen/templates/HolderReaderImpl.java create mode 100644 java/vector/src/main/codegen/templates/ListWriters.java create mode 100644 java/vector/src/main/codegen/templates/MapWriters.java create mode 100644 java/vector/src/main/codegen/templates/NullReader.java create mode 100644 java/vector/src/main/codegen/templates/NullableValueVectors.java create mode 100644 java/vector/src/main/codegen/templates/RepeatedValueVectors.java create mode 100644 java/vector/src/main/codegen/templates/UnionListWriter.java create mode 100644 java/vector/src/main/codegen/templates/UnionReader.java create mode 100644 java/vector/src/main/codegen/templates/UnionVector.java create mode 100644 java/vector/src/main/codegen/templates/UnionWriter.java create mode 100644 java/vector/src/main/codegen/templates/ValueHolders.java create mode 100644 java/vector/src/main/codegen/templates/VariableLengthVectors.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/AddOrGetResult.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/BitVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/NullableVectorDefinitionSetter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorTrimmer.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/EmptyValuePopulator.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedFixedWidthVectorLike.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/VectorWithOrdinal.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/MapOrListWriterImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/reader/FieldReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/writer/FieldWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/ComplexHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedListHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedMapHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/Types.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/CallBack.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/SchemaChangeRuntimeException.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/Text.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/TransferPair.java diff --git a/java/.gitignore b/java/.gitignore new file mode 100644 index 0000000000000..73c1be4912297 --- /dev/null +++ b/java/.gitignore @@ -0,0 +1,22 @@ +.project +.buildpath +.classpath +.checkstyle +.settings/ +.idea/ +TAGS +*.log +*.lck +*.iml +target/ +*.DS_Store +*.patch +*~ +git.properties +contrib/native/client/build/ +contrib/native/client/build/* +CMakeCache.txt +CMakeFiles +Makefile +cmake_install.cmake +install_manifest.txt diff --git a/java/memory/pom.xml b/java/memory/pom.xml new file mode 100644 index 0000000000000..44332f5ed14a8 --- /dev/null +++ b/java/memory/pom.xml @@ -0,0 +1,50 @@ + + + + 4.0.0 + + org.apache.arrow + arrow-java-root + 0.1-SNAPSHOT + + arrow-memory + arrow-memory + + + + + com.codahale.metrics + metrics-core + 3.0.1 + + + + com.google.code.findbugs + jsr305 + 3.0.1 + + + + com.carrotsearch + hppc + 0.7.1 + + + + + + + + + + diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java new file mode 100644 index 0000000000000..f033ba6538e83 --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -0,0 +1,863 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package io.netty.buffer; + +import io.netty.util.internal.PlatformDependent; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.channels.GatheringByteChannel; +import java.nio.channels.ScatteringByteChannel; +import java.nio.charset.Charset; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.arrow.memory.BaseAllocator; +import org.apache.arrow.memory.BoundsChecking; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.BufferManager; +import org.apache.arrow.memory.AllocationManager.BufferLedger; +import org.apache.arrow.memory.BaseAllocator.Verbosity; +import org.apache.arrow.memory.util.HistoricalLog; + +import com.google.common.base.Preconditions; + +public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ArrowBuf.class); + + private static final AtomicLong idGenerator = new AtomicLong(0); + + private final long id = idGenerator.incrementAndGet(); + private final AtomicInteger refCnt; + private final UnsafeDirectLittleEndian udle; + private final long addr; + private final int offset; + private final BufferLedger ledger; + private final BufferManager bufManager; + private final ByteBufAllocator alloc; + private final boolean isEmpty; + private volatile int length; + private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? + new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, "DrillBuf[%d]", id) : null; + + public ArrowBuf( + final AtomicInteger refCnt, + final BufferLedger ledger, + final UnsafeDirectLittleEndian byteBuf, + final BufferManager manager, + final ByteBufAllocator alloc, + final int offset, + final int length, + boolean isEmpty) { + super(byteBuf.maxCapacity()); + this.refCnt = refCnt; + this.udle = byteBuf; + this.isEmpty = isEmpty; + this.bufManager = manager; + this.alloc = alloc; + this.addr = byteBuf.memoryAddress() + offset; + this.ledger = ledger; + this.length = length; + this.offset = offset; + + if (BaseAllocator.DEBUG) { + historicalLog.recordEvent("create()"); + } + + } + + public ArrowBuf reallocIfNeeded(final int size) { + Preconditions.checkArgument(size >= 0, "reallocation size must be non-negative"); + + if (this.capacity() >= size) { + return this; + } + + if (bufManager != null) { + return bufManager.replace(this, size); + } else { + throw new UnsupportedOperationException("Realloc is only available in the context of an operator's UDFs"); + } + } + + @Override + public int refCnt() { + if (isEmpty) { + return 1; + } else { + return refCnt.get(); + } + } + + private long addr(int index) { + return addr + index; + } + + private final void checkIndexD(int index, int fieldLength) { + ensureAccessible(); + if (fieldLength < 0) { + throw new IllegalArgumentException("length: " + fieldLength + " (expected: >= 0)"); + } + if (index < 0 || index > capacity() - fieldLength) { + if (BaseAllocator.DEBUG) { + historicalLog.logHistory(logger); + } + throw new IndexOutOfBoundsException(String.format( + "index: %d, length: %d (expected: range(0, %d))", index, fieldLength, capacity())); + } + } + + /** + * Allows a function to determine whether not reading a particular string of bytes is valid. + * + * Will throw an exception if the memory is not readable for some reason. Only doesn't something in the case that + * AssertionUtil.BOUNDS_CHECKING_ENABLED is true. + * + * @param start + * The starting position of the bytes to be read. + * @param end + * The exclusive endpoint of the bytes to be read. + */ + public void checkBytes(int start, int end) { + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + checkIndexD(start, end - start); + } + } + + private void chk(int index, int width) { + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + checkIndexD(index, width); + } + } + + private void ensure(int width) { + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + ensureWritable(width); + } + } + + /** + * Create a new DrillBuf that is associated with an alternative allocator for the purposes of memory ownership and + * accounting. This has no impact on the reference counting for the current DrillBuf except in the situation where the + * passed in Allocator is the same as the current buffer. + * + * This operation has no impact on the reference count of this DrillBuf. The newly created DrillBuf with either have a + * reference count of 1 (in the case that this is the first time this memory is being associated with the new + * allocator) or the current value of the reference count + 1 for the other AllocationManager/BufferLedger combination + * in the case that the provided allocator already had an association to this underlying memory. + * + * @param target + * The target allocator to create an association with. + * @return A new DrillBuf which shares the same underlying memory as this DrillBuf. + */ + public ArrowBuf retain(BufferAllocator target) { + + if (isEmpty) { + return this; + } + + if (BaseAllocator.DEBUG) { + historicalLog.recordEvent("retain(%s)", target.getName()); + } + final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); + return otherLedger.newDrillBuf(offset, length, null); + } + + /** + * Transfer the memory accounting ownership of this DrillBuf to another allocator. This will generate a new DrillBuf + * that carries an association with the underlying memory of this DrillBuf. If this DrillBuf is connected to the + * owning BufferLedger of this memory, that memory ownership/accounting will be transferred to the taret allocator. If + * this DrillBuf does not currently own the memory underlying it (and is only associated with it), this does not + * transfer any ownership to the newly created DrillBuf. + * + * This operation has no impact on the reference count of this DrillBuf. The newly created DrillBuf with either have a + * reference count of 1 (in the case that this is the first time this memory is being associated with the new + * allocator) or the current value of the reference count for the other AllocationManager/BufferLedger combination in + * the case that the provided allocator already had an association to this underlying memory. + * + * Transfers will always succeed, even if that puts the other allocator into an overlimit situation. This is possible + * due to the fact that the original owning allocator may have allocated this memory out of a local reservation + * whereas the target allocator may need to allocate new memory from a parent or RootAllocator. This operation is done + * in a mostly-lockless but consistent manner. As such, the overlimit==true situation could occur slightly prematurely + * to an actual overlimit==true condition. This is simply conservative behavior which means we may return overlimit + * slightly sooner than is necessary. + * + * @param target + * The allocator to transfer ownership to. + * @return A new transfer result with the impact of the transfer (whether it was overlimit) as well as the newly + * created DrillBuf. + */ + public TransferResult transferOwnership(BufferAllocator target) { + + if (isEmpty) { + return new TransferResult(true, this); + } + + final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); + final ArrowBuf newBuf = otherLedger.newDrillBuf(offset, length, null); + final boolean allocationFit = this.ledger.transferBalance(otherLedger); + return new TransferResult(allocationFit, newBuf); + } + + /** + * The outcome of a Transfer. + */ + public class TransferResult { + + /** + * Whether this transfer fit within the target allocator's capacity. + */ + public final boolean allocationFit; + + /** + * The newly created buffer associated with the target allocator. + */ + public final ArrowBuf buffer; + + private TransferResult(boolean allocationFit, ArrowBuf buffer) { + this.allocationFit = allocationFit; + this.buffer = buffer; + } + + } + + @Override + public boolean release() { + return release(1); + } + + /** + * Release the provided number of reference counts. + */ + @Override + public boolean release(int decrement) { + + if (isEmpty) { + return false; + } + + if (decrement < 1) { + throw new IllegalStateException(String.format("release(%d) argument is not positive. Buffer Info: %s", + decrement, toVerboseString())); + } + + final int refCnt = ledger.decrement(decrement); + + if (BaseAllocator.DEBUG) { + historicalLog.recordEvent("release(%d). original value: %d", decrement, refCnt + decrement); + } + + if (refCnt < 0) { + throw new IllegalStateException( + String.format("DrillBuf[%d] refCnt has gone negative. Buffer Info: %s", id, toVerboseString())); + } + + return refCnt == 0; + + } + + @Override + public int capacity() { + return length; + } + + @Override + public synchronized ArrowBuf capacity(int newCapacity) { + + if (newCapacity == length) { + return this; + } + + Preconditions.checkArgument(newCapacity >= 0); + + if (newCapacity < length) { + length = newCapacity; + return this; + } + + throw new UnsupportedOperationException("Buffers don't support resizing that increases the size."); + } + + @Override + public ByteBufAllocator alloc() { + return udle.alloc(); + } + + @Override + public ByteOrder order() { + return ByteOrder.LITTLE_ENDIAN; + } + + @Override + public ByteBuf order(ByteOrder endianness) { + return this; + } + + @Override + public ByteBuf unwrap() { + return udle; + } + + @Override + public boolean isDirect() { + return true; + } + + @Override + public ByteBuf readBytes(int length) { + throw new UnsupportedOperationException(); + } + + @Override + public ByteBuf readSlice(int length) { + final ByteBuf slice = slice(readerIndex(), length); + readerIndex(readerIndex() + length); + return slice; + } + + @Override + public ByteBuf copy() { + throw new UnsupportedOperationException(); + } + + @Override + public ByteBuf copy(int index, int length) { + throw new UnsupportedOperationException(); + } + + @Override + public ByteBuf slice() { + return slice(readerIndex(), readableBytes()); + } + + public static String bufferState(final ByteBuf buf) { + final int cap = buf.capacity(); + final int mcap = buf.maxCapacity(); + final int ri = buf.readerIndex(); + final int rb = buf.readableBytes(); + final int wi = buf.writerIndex(); + final int wb = buf.writableBytes(); + return String.format("cap/max: %d/%d, ri: %d, rb: %d, wi: %d, wb: %d", + cap, mcap, ri, rb, wi, wb); + } + + @Override + public ArrowBuf slice(int index, int length) { + + if (isEmpty) { + return this; + } + + /* + * Re the behavior of reference counting, see http://netty.io/wiki/reference-counted-objects.html#wiki-h3-5, which + * explains that derived buffers share their reference count with their parent + */ + final ArrowBuf newBuf = ledger.newDrillBuf(offset + index, length); + newBuf.writerIndex(length); + return newBuf; + } + + @Override + public ArrowBuf duplicate() { + return slice(0, length); + } + + @Override + public int nioBufferCount() { + return 1; + } + + @Override + public ByteBuffer nioBuffer() { + return nioBuffer(readerIndex(), readableBytes()); + } + + @Override + public ByteBuffer nioBuffer(int index, int length) { + return udle.nioBuffer(offset + index, length); + } + + @Override + public ByteBuffer internalNioBuffer(int index, int length) { + return udle.internalNioBuffer(offset + index, length); + } + + @Override + public ByteBuffer[] nioBuffers() { + return new ByteBuffer[] { nioBuffer() }; + } + + @Override + public ByteBuffer[] nioBuffers(int index, int length) { + return new ByteBuffer[] { nioBuffer(index, length) }; + } + + @Override + public boolean hasArray() { + return udle.hasArray(); + } + + @Override + public byte[] array() { + return udle.array(); + } + + @Override + public int arrayOffset() { + return udle.arrayOffset(); + } + + @Override + public boolean hasMemoryAddress() { + return true; + } + + @Override + public long memoryAddress() { + return this.addr; + } + + @Override + public String toString() { + return String.format("DrillBuf[%d], udle: [%d %d..%d]", id, udle.id, offset, offset + capacity()); + } + + @Override + public String toString(Charset charset) { + return toString(readerIndex, readableBytes(), charset); + } + + @Override + public String toString(int index, int length, Charset charset) { + + if (length == 0) { + return ""; + } + + return ByteBufUtil.decodeString(nioBuffer(index, length), charset); + } + + @Override + public int hashCode() { + return System.identityHashCode(this); + } + + @Override + public boolean equals(Object obj) { + // identity equals only. + return this == obj; + } + + @Override + public ByteBuf retain(int increment) { + Preconditions.checkArgument(increment > 0, "retain(%d) argument is not positive", increment); + + if (isEmpty) { + return this; + } + + if (BaseAllocator.DEBUG) { + historicalLog.recordEvent("retain(%d)", increment); + } + + final int originalReferenceCount = refCnt.getAndAdd(increment); + Preconditions.checkArgument(originalReferenceCount > 0); + return this; + } + + @Override + public ByteBuf retain() { + return retain(1); + } + + @Override + public long getLong(int index) { + chk(index, 8); + final long v = PlatformDependent.getLong(addr(index)); + return v; + } + + @Override + public float getFloat(int index) { + return Float.intBitsToFloat(getInt(index)); + } + + @Override + public double getDouble(int index) { + return Double.longBitsToDouble(getLong(index)); + } + + @Override + public char getChar(int index) { + return (char) getShort(index); + } + + @Override + public long getUnsignedInt(int index) { + return getInt(index) & 0xFFFFFFFFL; + } + + @Override + public int getInt(int index) { + chk(index, 4); + final int v = PlatformDependent.getInt(addr(index)); + return v; + } + + @Override + public int getUnsignedShort(int index) { + return getShort(index) & 0xFFFF; + } + + @Override + public short getShort(int index) { + chk(index, 2); + short v = PlatformDependent.getShort(addr(index)); + return v; + } + + @Override + public ByteBuf setShort(int index, int value) { + chk(index, 2); + PlatformDependent.putShort(addr(index), (short) value); + return this; + } + + @Override + public ByteBuf setInt(int index, int value) { + chk(index, 4); + PlatformDependent.putInt(addr(index), value); + return this; + } + + @Override + public ByteBuf setLong(int index, long value) { + chk(index, 8); + PlatformDependent.putLong(addr(index), value); + return this; + } + + @Override + public ByteBuf setChar(int index, int value) { + chk(index, 2); + PlatformDependent.putShort(addr(index), (short) value); + return this; + } + + @Override + public ByteBuf setFloat(int index, float value) { + chk(index, 4); + PlatformDependent.putInt(addr(index), Float.floatToRawIntBits(value)); + return this; + } + + @Override + public ByteBuf setDouble(int index, double value) { + chk(index, 8); + PlatformDependent.putLong(addr(index), Double.doubleToRawLongBits(value)); + return this; + } + + @Override + public ByteBuf writeShort(int value) { + ensure(2); + PlatformDependent.putShort(addr(writerIndex), (short) value); + writerIndex += 2; + return this; + } + + @Override + public ByteBuf writeInt(int value) { + ensure(4); + PlatformDependent.putInt(addr(writerIndex), value); + writerIndex += 4; + return this; + } + + @Override + public ByteBuf writeLong(long value) { + ensure(8); + PlatformDependent.putLong(addr(writerIndex), value); + writerIndex += 8; + return this; + } + + @Override + public ByteBuf writeChar(int value) { + ensure(2); + PlatformDependent.putShort(addr(writerIndex), (short) value); + writerIndex += 2; + return this; + } + + @Override + public ByteBuf writeFloat(float value) { + ensure(4); + PlatformDependent.putInt(addr(writerIndex), Float.floatToRawIntBits(value)); + writerIndex += 4; + return this; + } + + @Override + public ByteBuf writeDouble(double value) { + ensure(8); + PlatformDependent.putLong(addr(writerIndex), Double.doubleToRawLongBits(value)); + writerIndex += 8; + return this; + } + + @Override + public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length) { + udle.getBytes(index + offset, dst, dstIndex, length); + return this; + } + + @Override + public ByteBuf getBytes(int index, ByteBuffer dst) { + udle.getBytes(index + offset, dst); + return this; + } + + @Override + public ByteBuf setByte(int index, int value) { + chk(index, 1); + PlatformDependent.putByte(addr(index), (byte) value); + return this; + } + + public void setByte(int index, byte b) { + chk(index, 1); + PlatformDependent.putByte(addr(index), b); + } + + public void writeByteUnsafe(byte b) { + PlatformDependent.putByte(addr(readerIndex), b); + readerIndex++; + } + + @Override + protected byte _getByte(int index) { + return getByte(index); + } + + @Override + protected short _getShort(int index) { + return getShort(index); + } + + @Override + protected int _getInt(int index) { + return getInt(index); + } + + @Override + protected long _getLong(int index) { + return getLong(index); + } + + @Override + protected void _setByte(int index, int value) { + setByte(index, value); + } + + @Override + protected void _setShort(int index, int value) { + setShort(index, value); + } + + @Override + protected void _setMedium(int index, int value) { + setMedium(index, value); + } + + @Override + protected void _setInt(int index, int value) { + setInt(index, value); + } + + @Override + protected void _setLong(int index, long value) { + setLong(index, value); + } + + @Override + public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { + udle.getBytes(index + offset, dst, dstIndex, length); + return this; + } + + @Override + public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException { + udle.getBytes(index + offset, out, length); + return this; + } + + @Override + protected int _getUnsignedMedium(int index) { + final long addr = addr(index); + return (PlatformDependent.getByte(addr) & 0xff) << 16 | + (PlatformDependent.getByte(addr + 1) & 0xff) << 8 | + PlatformDependent.getByte(addr + 2) & 0xff; + } + + @Override + public int getBytes(int index, GatheringByteChannel out, int length) throws IOException { + return udle.getBytes(index + offset, out, length); + } + + @Override + public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { + udle.setBytes(index + offset, src, srcIndex, length); + return this; + } + + public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { + if (src.isDirect()) { + checkIndex(index, length); + PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + srcIndex, this.memoryAddress() + index, + length); + } else { + if (srcIndex == 0 && src.capacity() == length) { + udle.setBytes(index + offset, src); + } else { + ByteBuffer newBuf = src.duplicate(); + newBuf.position(srcIndex); + newBuf.limit(srcIndex + length); + udle.setBytes(index + offset, src); + } + } + + return this; + } + + @Override + public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length) { + udle.setBytes(index + offset, src, srcIndex, length); + return this; + } + + @Override + public ByteBuf setBytes(int index, ByteBuffer src) { + udle.setBytes(index + offset, src); + return this; + } + + @Override + public int setBytes(int index, InputStream in, int length) throws IOException { + return udle.setBytes(index + offset, in, length); + } + + @Override + public int setBytes(int index, ScatteringByteChannel in, int length) throws IOException { + return udle.setBytes(index + offset, in, length); + } + + @Override + public byte getByte(int index) { + chk(index, 1); + return PlatformDependent.getByte(addr(index)); + } + + @Override + public void close() { + release(); + } + + /** + * Returns the possible memory consumed by this DrillBuf in the worse case scenario. (not shared, connected to larger + * underlying buffer of allocated memory) + * + * @return Size in bytes. + */ + public int getPossibleMemoryConsumed() { + return ledger.getSize(); + } + + /** + * Return that is Accounted for by this buffer (and its potentially shared siblings within the context of the + * associated allocator). + * + * @return Size in bytes. + */ + public int getActualMemoryConsumed() { + return ledger.getAccountedSize(); + } + + private final static int LOG_BYTES_PER_ROW = 10; + + /** + * Return the buffer's byte contents in the form of a hex dump. + * + * @param start + * the starting byte index + * @param length + * how many bytes to log + * @return A hex dump in a String. + */ + public String toHexString(final int start, final int length) { + final int roundedStart = (start / LOG_BYTES_PER_ROW) * LOG_BYTES_PER_ROW; + + final StringBuilder sb = new StringBuilder("buffer byte dump\n"); + int index = roundedStart; + for (int nLogged = 0; nLogged < length; nLogged += LOG_BYTES_PER_ROW) { + sb.append(String.format(" [%05d-%05d]", index, index + LOG_BYTES_PER_ROW - 1)); + for (int i = 0; i < LOG_BYTES_PER_ROW; ++i) { + try { + final byte b = getByte(index++); + sb.append(String.format(" 0x%02x", b)); + } catch (IndexOutOfBoundsException ioob) { + sb.append(" "); + } + } + sb.append('\n'); + } + return sb.toString(); + } + + /** + * Get the integer id assigned to this DrillBuf for debugging purposes. + * + * @return integer id + */ + public long getId() { + return id; + } + + public String toVerboseString() { + if (isEmpty) { + return toString(); + } + + StringBuilder sb = new StringBuilder(); + ledger.print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); + return sb.toString(); + } + + public void print(StringBuilder sb, int indent, Verbosity verbosity) { + BaseAllocator.indent(sb, indent).append(toString()); + + if (BaseAllocator.DEBUG && !isEmpty && verbosity.includeHistoricalLog) { + sb.append("\n"); + historicalLog.buildHistory(sb, indent + 1, verbosity.includeStackTraces); + } + } + +} diff --git a/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java new file mode 100644 index 0000000000000..59886474923f3 --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java @@ -0,0 +1,55 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package io.netty.buffer; + +import org.apache.arrow.memory.BufferAllocator; + +/** + * Allows us to decorate DrillBuf to make it expandable so that we can use them in the context of the Netty framework + * (thus supporting RPC level memory accounting). + */ +public class ExpandableByteBuf extends MutableWrappedByteBuf { + + private final BufferAllocator allocator; + + public ExpandableByteBuf(ByteBuf buffer, BufferAllocator allocator) { + super(buffer); + this.allocator = allocator; + } + + @Override + public ByteBuf copy(int index, int length) { + return new ExpandableByteBuf(buffer.copy(index, length), allocator); + } + + @Override + public ByteBuf capacity(int newCapacity) { + if (newCapacity > capacity()) { + ByteBuf newBuf = allocator.buffer(newCapacity); + newBuf.writeBytes(buffer, 0, buffer.capacity()); + newBuf.readerIndex(buffer.readerIndex()); + newBuf.writerIndex(buffer.writerIndex()); + buffer.release(); + buffer = newBuf; + return newBuf; + } else { + return super.capacity(newCapacity); + } + } + +} diff --git a/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java new file mode 100644 index 0000000000000..5f5e904fb0429 --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java @@ -0,0 +1,59 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package io.netty.buffer; + +import java.util.concurrent.atomic.AtomicLong; + +/** + * A MutableWrappedByteBuf that also maintains a metric of the number of huge buffer bytes and counts. + */ +public class LargeBuffer extends MutableWrappedByteBuf { + + private final AtomicLong hugeBufferSize; + private final AtomicLong hugeBufferCount; + + private final int initCap; + + public LargeBuffer(ByteBuf buffer, AtomicLong hugeBufferSize, AtomicLong hugeBufferCount) { + super(buffer); + initCap = buffer.capacity(); + this.hugeBufferCount = hugeBufferCount; + this.hugeBufferSize = hugeBufferSize; + } + + @Override + public ByteBuf copy(int index, int length) { + return new LargeBuffer(buffer.copy(index, length), hugeBufferSize, hugeBufferCount); + } + + @Override + public boolean release() { + return release(1); + } + + @Override + public boolean release(int decrement) { + boolean released = unwrap().release(decrement); + if (released) { + hugeBufferSize.addAndGet(-initCap); + hugeBufferCount.decrementAndGet(); + } + return released; + } + +} diff --git a/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java b/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java new file mode 100644 index 0000000000000..5709473135e4b --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java @@ -0,0 +1,336 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package io.netty.buffer; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.channels.GatheringByteChannel; +import java.nio.channels.ScatteringByteChannel; + +/** + * This is basically a complete copy of DuplicatedByteBuf. We copy because we want to override some behaviors and make + * buffer mutable. + */ +abstract class MutableWrappedByteBuf extends AbstractByteBuf { + + @Override + public ByteBuffer nioBuffer(int index, int length) { + return unwrap().nioBuffer(index, length); + } + + ByteBuf buffer; + + public MutableWrappedByteBuf(ByteBuf buffer) { + super(buffer.maxCapacity()); + + if (buffer instanceof MutableWrappedByteBuf) { + this.buffer = ((MutableWrappedByteBuf) buffer).buffer; + } else { + this.buffer = buffer; + } + + setIndex(buffer.readerIndex(), buffer.writerIndex()); + } + + @Override + public ByteBuf unwrap() { + return buffer; + } + + @Override + public ByteBufAllocator alloc() { + return buffer.alloc(); + } + + @Override + public ByteOrder order() { + return buffer.order(); + } + + @Override + public boolean isDirect() { + return buffer.isDirect(); + } + + @Override + public int capacity() { + return buffer.capacity(); + } + + @Override + public ByteBuf capacity(int newCapacity) { + buffer.capacity(newCapacity); + return this; + } + + @Override + public boolean hasArray() { + return buffer.hasArray(); + } + + @Override + public byte[] array() { + return buffer.array(); + } + + @Override + public int arrayOffset() { + return buffer.arrayOffset(); + } + + @Override + public boolean hasMemoryAddress() { + return buffer.hasMemoryAddress(); + } + + @Override + public long memoryAddress() { + return buffer.memoryAddress(); + } + + @Override + public byte getByte(int index) { + return _getByte(index); + } + + @Override + protected byte _getByte(int index) { + return buffer.getByte(index); + } + + @Override + public short getShort(int index) { + return _getShort(index); + } + + @Override + protected short _getShort(int index) { + return buffer.getShort(index); + } + + @Override + public int getUnsignedMedium(int index) { + return _getUnsignedMedium(index); + } + + @Override + protected int _getUnsignedMedium(int index) { + return buffer.getUnsignedMedium(index); + } + + @Override + public int getInt(int index) { + return _getInt(index); + } + + @Override + protected int _getInt(int index) { + return buffer.getInt(index); + } + + @Override + public long getLong(int index) { + return _getLong(index); + } + + @Override + protected long _getLong(int index) { + return buffer.getLong(index); + } + + @Override + public abstract ByteBuf copy(int index, int length); + + @Override + public ByteBuf slice(int index, int length) { + return new SlicedByteBuf(this, index, length); + } + + @Override + public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { + buffer.getBytes(index, dst, dstIndex, length); + return this; + } + + @Override + public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length) { + buffer.getBytes(index, dst, dstIndex, length); + return this; + } + + @Override + public ByteBuf getBytes(int index, ByteBuffer dst) { + buffer.getBytes(index, dst); + return this; + } + + @Override + public ByteBuf setByte(int index, int value) { + _setByte(index, value); + return this; + } + + @Override + protected void _setByte(int index, int value) { + buffer.setByte(index, value); + } + + @Override + public ByteBuf setShort(int index, int value) { + _setShort(index, value); + return this; + } + + @Override + protected void _setShort(int index, int value) { + buffer.setShort(index, value); + } + + @Override + public ByteBuf setMedium(int index, int value) { + _setMedium(index, value); + return this; + } + + @Override + protected void _setMedium(int index, int value) { + buffer.setMedium(index, value); + } + + @Override + public ByteBuf setInt(int index, int value) { + _setInt(index, value); + return this; + } + + @Override + protected void _setInt(int index, int value) { + buffer.setInt(index, value); + } + + @Override + public ByteBuf setLong(int index, long value) { + _setLong(index, value); + return this; + } + + @Override + protected void _setLong(int index, long value) { + buffer.setLong(index, value); + } + + @Override + public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length) { + buffer.setBytes(index, src, srcIndex, length); + return this; + } + + @Override + public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { + buffer.setBytes(index, src, srcIndex, length); + return this; + } + + @Override + public ByteBuf setBytes(int index, ByteBuffer src) { + buffer.setBytes(index, src); + return this; + } + + @Override + public ByteBuf getBytes(int index, OutputStream out, int length) + throws IOException { + buffer.getBytes(index, out, length); + return this; + } + + @Override + public int getBytes(int index, GatheringByteChannel out, int length) + throws IOException { + return buffer.getBytes(index, out, length); + } + + @Override + public int setBytes(int index, InputStream in, int length) + throws IOException { + return buffer.setBytes(index, in, length); + } + + @Override + public int setBytes(int index, ScatteringByteChannel in, int length) + throws IOException { + return buffer.setBytes(index, in, length); + } + + @Override + public int nioBufferCount() { + return buffer.nioBufferCount(); + } + + @Override + public ByteBuffer[] nioBuffers(int index, int length) { + return buffer.nioBuffers(index, length); + } + + @Override + public ByteBuffer internalNioBuffer(int index, int length) { + return nioBuffer(index, length); + } + + @Override + public int forEachByte(int index, int length, ByteBufProcessor processor) { + return buffer.forEachByte(index, length, processor); + } + + @Override + public int forEachByteDesc(int index, int length, ByteBufProcessor processor) { + return buffer.forEachByteDesc(index, length, processor); + } + + @Override + public final int refCnt() { + return unwrap().refCnt(); + } + + @Override + public final ByteBuf retain() { + unwrap().retain(); + return this; + } + + @Override + public final ByteBuf retain(int increment) { + unwrap().retain(increment); + return this; + } + + @Override + public boolean release() { + return release(1); + } + + @Override + public boolean release(int decrement) { + boolean released = unwrap().release(decrement); + return released; + } + +} diff --git a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java new file mode 100644 index 0000000000000..1610028df9de3 --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java @@ -0,0 +1,272 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package io.netty.buffer; + +import io.netty.util.internal.StringUtil; + +import java.lang.reflect.Field; +import java.nio.ByteBuffer; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.arrow.memory.OutOfMemoryException; + +import com.codahale.metrics.Gauge; +import com.codahale.metrics.Histogram; +import com.codahale.metrics.Metric; +import com.codahale.metrics.MetricFilter; +import com.codahale.metrics.MetricRegistry; + +/** + * The base allocator that we use for all of Drill's memory management. Returns UnsafeDirectLittleEndian buffers. + */ +public class PooledByteBufAllocatorL { + private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("drill.allocator"); + + private static final int MEMORY_LOGGER_FREQUENCY_SECONDS = 60; + + + public static final String METRIC_PREFIX = "drill.allocator."; + + private final MetricRegistry registry; + private final AtomicLong hugeBufferSize = new AtomicLong(0); + private final AtomicLong hugeBufferCount = new AtomicLong(0); + private final AtomicLong normalBufferSize = new AtomicLong(0); + private final AtomicLong normalBufferCount = new AtomicLong(0); + + private final InnerAllocator allocator; + public final UnsafeDirectLittleEndian empty; + + public PooledByteBufAllocatorL(MetricRegistry registry) { + this.registry = registry; + allocator = new InnerAllocator(); + empty = new UnsafeDirectLittleEndian(new DuplicatedByteBuf(Unpooled.EMPTY_BUFFER)); + } + + public UnsafeDirectLittleEndian allocate(int size) { + try { + return allocator.directBuffer(size, Integer.MAX_VALUE); + } catch (OutOfMemoryError e) { + throw new OutOfMemoryException("Failure allocating buffer.", e); + } + + } + + public int getChunkSize() { + return allocator.chunkSize; + } + + private class InnerAllocator extends PooledByteBufAllocator { + + + private final PoolArena[] directArenas; + private final MemoryStatusThread statusThread; + private final Histogram largeBuffersHist; + private final Histogram normalBuffersHist; + private final int chunkSize; + + public InnerAllocator() { + super(true); + + try { + Field f = PooledByteBufAllocator.class.getDeclaredField("directArenas"); + f.setAccessible(true); + this.directArenas = (PoolArena[]) f.get(this); + } catch (Exception e) { + throw new RuntimeException("Failure while initializing allocator. Unable to retrieve direct arenas field.", e); + } + + this.chunkSize = directArenas[0].chunkSize; + + if (memoryLogger.isTraceEnabled()) { + statusThread = new MemoryStatusThread(); + statusThread.start(); + } else { + statusThread = null; + } + removeOldMetrics(); + + registry.register(METRIC_PREFIX + "normal.size", new Gauge() { + @Override + public Long getValue() { + return normalBufferSize.get(); + } + }); + + registry.register(METRIC_PREFIX + "normal.count", new Gauge() { + @Override + public Long getValue() { + return normalBufferCount.get(); + } + }); + + registry.register(METRIC_PREFIX + "huge.size", new Gauge() { + @Override + public Long getValue() { + return hugeBufferSize.get(); + } + }); + + registry.register(METRIC_PREFIX + "huge.count", new Gauge() { + @Override + public Long getValue() { + return hugeBufferCount.get(); + } + }); + + largeBuffersHist = registry.histogram(METRIC_PREFIX + "huge.hist"); + normalBuffersHist = registry.histogram(METRIC_PREFIX + "normal.hist"); + + } + + + private synchronized void removeOldMetrics() { + registry.removeMatching(new MetricFilter() { + @Override + public boolean matches(String name, Metric metric) { + return name.startsWith("drill.allocator."); + } + + }); + } + + private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCapacity) { + PoolThreadCache cache = threadCache.get(); + PoolArena directArena = cache.directArena; + + if (directArena != null) { + + if (initialCapacity > directArena.chunkSize) { + // This is beyond chunk size so we'll allocate separately. + ByteBuf buf = UnpooledByteBufAllocator.DEFAULT.directBuffer(initialCapacity, maxCapacity); + + hugeBufferCount.incrementAndGet(); + hugeBufferSize.addAndGet(buf.capacity()); + largeBuffersHist.update(buf.capacity()); + // logger.debug("Allocating huge buffer of size {}", initialCapacity, new Exception()); + return new UnsafeDirectLittleEndian(new LargeBuffer(buf, hugeBufferSize, hugeBufferCount)); + + } else { + // within chunk, use arena. + ByteBuf buf = directArena.allocate(cache, initialCapacity, maxCapacity); + if (!(buf instanceof PooledUnsafeDirectByteBuf)) { + fail(); + } + + normalBuffersHist.update(buf.capacity()); + if (ASSERT_ENABLED) { + normalBufferSize.addAndGet(buf.capacity()); + normalBufferCount.incrementAndGet(); + } + + return new UnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf, normalBufferCount, + normalBufferSize); + } + + } else { + throw fail(); + } + } + + private UnsupportedOperationException fail() { + return new UnsupportedOperationException( + "Drill requries that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); + } + + public UnsafeDirectLittleEndian directBuffer(int initialCapacity, int maxCapacity) { + if (initialCapacity == 0 && maxCapacity == 0) { + newDirectBuffer(initialCapacity, maxCapacity); + } + validate(initialCapacity, maxCapacity); + return newDirectBufferL(initialCapacity, maxCapacity); + } + + @Override + public ByteBuf heapBuffer(int initialCapacity, int maxCapacity) { + throw new UnsupportedOperationException("Drill doesn't support using heap buffers."); + } + + + private void validate(int initialCapacity, int maxCapacity) { + if (initialCapacity < 0) { + throw new IllegalArgumentException("initialCapacity: " + initialCapacity + " (expectd: 0+)"); + } + if (initialCapacity > maxCapacity) { + throw new IllegalArgumentException(String.format( + "initialCapacity: %d (expected: not greater than maxCapacity(%d)", + initialCapacity, maxCapacity)); + } + } + + private class MemoryStatusThread extends Thread { + + public MemoryStatusThread() { + super("memory-status-logger"); + this.setDaemon(true); + this.setName("allocation.logger"); + } + + @Override + public void run() { + while (true) { + memoryLogger.trace("Memory Usage: \n{}", PooledByteBufAllocatorL.this.toString()); + try { + Thread.sleep(MEMORY_LOGGER_FREQUENCY_SECONDS * 1000); + } catch (InterruptedException e) { + return; + } + + } + } + + } + + public String toString() { + StringBuilder buf = new StringBuilder(); + buf.append(directArenas.length); + buf.append(" direct arena(s):"); + buf.append(StringUtil.NEWLINE); + for (PoolArena a : directArenas) { + buf.append(a); + } + + buf.append("Large buffers outstanding: "); + buf.append(hugeBufferCount.get()); + buf.append(" totaling "); + buf.append(hugeBufferSize.get()); + buf.append(" bytes."); + buf.append('\n'); + buf.append("Normal buffers outstanding: "); + buf.append(normalBufferCount.get()); + buf.append(" totaling "); + buf.append(normalBufferSize.get()); + buf.append(" bytes."); + return buf.toString(); + } + + + } + + public static final boolean ASSERT_ENABLED; + + static { + boolean isAssertEnabled = false; + assert isAssertEnabled = true; + ASSERT_ENABLED = isAssertEnabled; + } + +} diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java new file mode 100644 index 0000000000000..6495d5d371e76 --- /dev/null +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -0,0 +1,270 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package io.netty.buffer; + +import io.netty.util.internal.PlatformDependent; + +import java.nio.ByteOrder; +import java.util.concurrent.atomic.AtomicLong; + +/** + * The underlying class we use for little-endian access to memory. Is used underneath DrillBufs to abstract away the + * Netty classes and underlying Netty memory management. + */ +public final class UnsafeDirectLittleEndian extends WrappedByteBuf { + private static final boolean NATIVE_ORDER = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN; + private static final AtomicLong ID_GENERATOR = new AtomicLong(0); + + public final long id = ID_GENERATOR.incrementAndGet(); + private final AbstractByteBuf wrapped; + private final long memoryAddress; + + private final AtomicLong bufferCount; + private final AtomicLong bufferSize; + private final long initCap; + + UnsafeDirectLittleEndian(DuplicatedByteBuf buf) { + this(buf, true, null, null); + } + + UnsafeDirectLittleEndian(LargeBuffer buf) { + this(buf, true, null, null); + } + + UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf, AtomicLong bufferCount, AtomicLong bufferSize) { + this(buf, true, bufferCount, bufferSize); + + } + + private UnsafeDirectLittleEndian(AbstractByteBuf buf, boolean fake, AtomicLong bufferCount, AtomicLong bufferSize) { + super(buf); + if (!NATIVE_ORDER || buf.order() != ByteOrder.BIG_ENDIAN) { + throw new IllegalStateException("Drill only runs on LittleEndian systems."); + } + + this.bufferCount = bufferCount; + this.bufferSize = bufferSize; + + // initCap is used if we're tracking memory release. If we're in non-debug mode, we'll skip this. + this.initCap = ASSERT_ENABLED ? buf.capacity() : -1; + + this.wrapped = buf; + this.memoryAddress = buf.memoryAddress(); + } + private long addr(int index) { + return memoryAddress + index; + } + + @Override + public long getLong(int index) { +// wrapped.checkIndex(index, 8); + long v = PlatformDependent.getLong(addr(index)); + return v; + } + + @Override + public float getFloat(int index) { + return Float.intBitsToFloat(getInt(index)); + } + + @Override + public ByteBuf slice() { + return slice(this.readerIndex(), readableBytes()); + } + + @Override + public ByteBuf slice(int index, int length) { + return new SlicedByteBuf(this, index, length); + } + + @Override + public ByteOrder order() { + return ByteOrder.LITTLE_ENDIAN; + } + + @Override + public ByteBuf order(ByteOrder endianness) { + return this; + } + + @Override + public double getDouble(int index) { + return Double.longBitsToDouble(getLong(index)); + } + + @Override + public char getChar(int index) { + return (char) getShort(index); + } + + @Override + public long getUnsignedInt(int index) { + return getInt(index) & 0xFFFFFFFFL; + } + + @Override + public int getInt(int index) { + int v = PlatformDependent.getInt(addr(index)); + return v; + } + + @Override + public int getUnsignedShort(int index) { + return getShort(index) & 0xFFFF; + } + + @Override + public short getShort(int index) { + short v = PlatformDependent.getShort(addr(index)); + return v; + } + + @Override + public ByteBuf setShort(int index, int value) { + wrapped.checkIndex(index, 2); + _setShort(index, value); + return this; + } + + @Override + public ByteBuf setInt(int index, int value) { + wrapped.checkIndex(index, 4); + _setInt(index, value); + return this; + } + + @Override + public ByteBuf setLong(int index, long value) { + wrapped.checkIndex(index, 8); + _setLong(index, value); + return this; + } + + @Override + public ByteBuf setChar(int index, int value) { + setShort(index, value); + return this; + } + + @Override + public ByteBuf setFloat(int index, float value) { + setInt(index, Float.floatToRawIntBits(value)); + return this; + } + + @Override + public ByteBuf setDouble(int index, double value) { + setLong(index, Double.doubleToRawLongBits(value)); + return this; + } + + @Override + public ByteBuf writeShort(int value) { + wrapped.ensureWritable(2); + _setShort(wrapped.writerIndex, value); + wrapped.writerIndex += 2; + return this; + } + + @Override + public ByteBuf writeInt(int value) { + wrapped.ensureWritable(4); + _setInt(wrapped.writerIndex, value); + wrapped.writerIndex += 4; + return this; + } + + @Override + public ByteBuf writeLong(long value) { + wrapped.ensureWritable(8); + _setLong(wrapped.writerIndex, value); + wrapped.writerIndex += 8; + return this; + } + + @Override + public ByteBuf writeChar(int value) { + writeShort(value); + return this; + } + + @Override + public ByteBuf writeFloat(float value) { + writeInt(Float.floatToRawIntBits(value)); + return this; + } + + @Override + public ByteBuf writeDouble(double value) { + writeLong(Double.doubleToRawLongBits(value)); + return this; + } + + private void _setShort(int index, int value) { + PlatformDependent.putShort(addr(index), (short) value); + } + + private void _setInt(int index, int value) { + PlatformDependent.putInt(addr(index), value); + } + + private void _setLong(int index, long value) { + PlatformDependent.putLong(addr(index), value); + } + + @Override + public byte getByte(int index) { + return PlatformDependent.getByte(addr(index)); + } + + @Override + public ByteBuf setByte(int index, int value) { + PlatformDependent.putByte(addr(index), (byte) value); + return this; + } + + @Override + public boolean release() { + return release(1); + } + + @Override + public boolean release(int decrement) { + final boolean released = super.release(decrement); + if (ASSERT_ENABLED && released && bufferCount != null && bufferSize != null) { + bufferCount.decrementAndGet(); + bufferSize.addAndGet(-initCap); + } + return released; + } + + @Override + public int hashCode() { + return System.identityHashCode(this); + } + + public static final boolean ASSERT_ENABLED; + + static { + boolean isAssertEnabled = false; + assert isAssertEnabled = true; + ASSERT_ENABLED = isAssertEnabled; + } + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java new file mode 100644 index 0000000000000..dc75e5d7231a8 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java @@ -0,0 +1,272 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import java.util.concurrent.atomic.AtomicLong; + +import javax.annotation.concurrent.ThreadSafe; + +import com.google.common.base.Preconditions; + +/** + * Provides a concurrent way to manage account for memory usage without locking. Used as basis for Allocators. All + * operations are threadsafe (except for close). + */ +@ThreadSafe +class Accountant implements AutoCloseable { + // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Accountant.class); + + /** + * The parent allocator + */ + protected final Accountant parent; + + /** + * The amount of memory reserved for this allocator. Releases below this amount of memory will not be returned to the + * parent Accountant until this Accountant is closed. + */ + protected final long reservation; + + private final AtomicLong peakAllocation = new AtomicLong(); + + /** + * Maximum local memory that can be held. This can be externally updated. Changing it won't cause past memory to + * change but will change responses to future allocation efforts + */ + private final AtomicLong allocationLimit = new AtomicLong(); + + /** + * Currently allocated amount of memory; + */ + private final AtomicLong locallyHeldMemory = new AtomicLong(); + + public Accountant(Accountant parent, long reservation, long maxAllocation) { + Preconditions.checkArgument(reservation >= 0, "The initial reservation size must be non-negative."); + Preconditions.checkArgument(maxAllocation >= 0, "The maximum allocation limit must be non-negative."); + Preconditions.checkArgument(reservation <= maxAllocation, + "The initial reservation size must be <= the maximum allocation."); + Preconditions.checkArgument(reservation == 0 || parent != null, "The root accountant can't reserve memory."); + + this.parent = parent; + this.reservation = reservation; + this.allocationLimit.set(maxAllocation); + + if (reservation != 0) { + // we will allocate a reservation from our parent. + final AllocationOutcome outcome = parent.allocateBytes(reservation); + if (!outcome.isOk()) { + throw new OutOfMemoryException(String.format( + "Failure trying to allocate initial reservation for Allocator. " + + "Attempted to allocate %d bytes and received an outcome of %s.", reservation, outcome.name())); + } + } + } + + /** + * Attempt to allocate the requested amount of memory. Either completely succeeds or completely fails. Constructs a a + * log of delta + * + * If it fails, no changes are made to accounting. + * + * @param size + * The amount of memory to reserve in bytes. + * @return True if the allocation was successful, false if the allocation failed. + */ + AllocationOutcome allocateBytes(long size) { + final AllocationOutcome outcome = allocate(size, true, false); + if (!outcome.isOk()) { + releaseBytes(size); + } + return outcome; + } + + private void updatePeak() { + final long currentMemory = locallyHeldMemory.get(); + while (true) { + + final long previousPeak = peakAllocation.get(); + if (currentMemory > previousPeak) { + if (!peakAllocation.compareAndSet(previousPeak, currentMemory)) { + // peak allocation changed underneath us. try again. + continue; + } + } + + // we either succeeded to set peak allocation or we weren't above the previous peak, exit. + return; + } + } + + + /** + * Increase the accounting. Returns whether the allocation fit within limits. + * + * @param size + * to increase + * @return Whether the allocation fit within limits. + */ + boolean forceAllocate(long size) { + final AllocationOutcome outcome = allocate(size, true, true); + return outcome.isOk(); + } + + /** + * Internal method for allocation. This takes a forced approach to allocation to ensure that we manage reservation + * boundary issues consistently. Allocation is always done through the entire tree. The two options that we influence + * are whether the allocation should be forced and whether or not the peak memory allocation should be updated. If at + * some point during allocation escalation we determine that the allocation is no longer possible, we will continue to + * do a complete and consistent allocation but we will stop updating the peak allocation. We do this because we know + * that we will be directly unwinding this allocation (and thus never actually making the allocation). If force + * allocation is passed, then we continue to update the peak limits since we now know that this allocation will occur + * despite our moving past one or more limits. + * + * @param size + * The size of the allocation. + * @param incomingUpdatePeak + * Whether we should update the local peak for this allocation. + * @param forceAllocation + * Whether we should force the allocation. + * @return The outcome of the allocation. + */ + private AllocationOutcome allocate(final long size, final boolean incomingUpdatePeak, final boolean forceAllocation) { + final long newLocal = locallyHeldMemory.addAndGet(size); + final long beyondReservation = newLocal - reservation; + final boolean beyondLimit = newLocal > allocationLimit.get(); + final boolean updatePeak = forceAllocation || (incomingUpdatePeak && !beyondLimit); + + AllocationOutcome parentOutcome = AllocationOutcome.SUCCESS; + if (beyondReservation > 0 && parent != null) { + // we need to get memory from our parent. + final long parentRequest = Math.min(beyondReservation, size); + parentOutcome = parent.allocate(parentRequest, updatePeak, forceAllocation); + } + + final AllocationOutcome finalOutcome = beyondLimit ? AllocationOutcome.FAILED_LOCAL : + parentOutcome.ok ? AllocationOutcome.SUCCESS : AllocationOutcome.FAILED_PARENT; + + if (updatePeak) { + updatePeak(); + } + + return finalOutcome; + } + + public void releaseBytes(long size) { + // reduce local memory. all memory released above reservation should be released up the tree. + final long newSize = locallyHeldMemory.addAndGet(-size); + + Preconditions.checkArgument(newSize >= 0, "Accounted size went negative."); + + final long originalSize = newSize + size; + if(originalSize > reservation && parent != null){ + // we deallocated memory that we should release to our parent. + final long possibleAmountToReleaseToParent = originalSize - reservation; + final long actualToReleaseToParent = Math.min(size, possibleAmountToReleaseToParent); + parent.releaseBytes(actualToReleaseToParent); + } + + } + + /** + * Set the maximum amount of memory that can be allocated in the this Accountant before failing an allocation. + * + * @param newLimit + * The limit in bytes. + */ + public void setLimit(long newLimit) { + allocationLimit.set(newLimit); + } + + public boolean isOverLimit() { + return getAllocatedMemory() > getLimit() || (parent != null && parent.isOverLimit()); + } + + /** + * Close this Accountant. This will release any reservation bytes back to a parent Accountant. + */ + public void close() { + // return memory reservation to parent allocator. + if (parent != null) { + parent.releaseBytes(reservation); + } + } + + /** + * Return the current limit of this Accountant. + * + * @return Limit in bytes. + */ + public long getLimit() { + return allocationLimit.get(); + } + + /** + * Return the current amount of allocated memory that this Accountant is managing accounting for. Note this does not + * include reservation memory that hasn't been allocated. + * + * @return Currently allocate memory in bytes. + */ + public long getAllocatedMemory() { + return locallyHeldMemory.get(); + } + + /** + * The peak memory allocated by this Accountant. + * + * @return The peak allocated memory in bytes. + */ + public long getPeakMemoryAllocation() { + return peakAllocation.get(); + } + + /** + * Describes the type of outcome that occurred when trying to account for allocation of memory. + */ + public static enum AllocationOutcome { + + /** + * Allocation succeeded. + */ + SUCCESS(true), + + /** + * Allocation succeeded but only because the allocator was forced to move beyond a limit. + */ + FORCED_SUCESS(true), + + /** + * Allocation failed because the local allocator's limits were exceeded. + */ + FAILED_LOCAL(false), + + /** + * Allocation failed because a parent allocator's limits were exceeded. + */ + FAILED_PARENT(false); + + private final boolean ok; + + AllocationOutcome(boolean ok) { + this.ok = ok; + } + + public boolean isOk() { + return ok; + } + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java new file mode 100644 index 0000000000000..0db61443266c6 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -0,0 +1,433 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import static org.apache.arrow.memory.BaseAllocator.indent; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.PooledByteBufAllocatorL; +import io.netty.buffer.UnsafeDirectLittleEndian; + +import java.util.IdentityHashMap; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.ReadWriteLock; +import java.util.concurrent.locks.ReentrantReadWriteLock; + +import org.apache.arrow.memory.BaseAllocator.Verbosity; +import org.apache.arrow.memory.util.AutoCloseableLock; +import org.apache.arrow.memory.util.HistoricalLog; +import org.apache.arrow.memory.util.Metrics; + +import com.google.common.base.Preconditions; + +/** + * Manages the relationship between one or more allocators and a particular UDLE. Ensures that one allocator owns the + * memory that multiple allocators may be referencing. Manages a BufferLedger between each of its associated allocators. + * This class is also responsible for managing when memory is allocated and returned to the Netty-based + * PooledByteBufAllocatorL. + * + * The only reason that this isn't package private is we're forced to put DrillBuf in Netty's package which need access + * to these objects or methods. + * + * Threading: AllocationManager manages thread-safety internally. Operations within the context of a single BufferLedger + * are lockless in nature and can be leveraged by multiple threads. Operations that cross the context of two ledgers + * will acquire a lock on the AllocationManager instance. Important note, there is one AllocationManager per + * UnsafeDirectLittleEndian buffer allocation. As such, there will be thousands of these in a typical query. The + * contention of acquiring a lock on AllocationManager should be very low. + * + */ +public class AllocationManager { + // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AllocationManager.class); + + private static final AtomicLong MANAGER_ID_GENERATOR = new AtomicLong(0); + private static final AtomicLong LEDGER_ID_GENERATOR = new AtomicLong(0); + static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(Metrics.getInstance()); + + private final RootAllocator root; + private final long allocatorManagerId = MANAGER_ID_GENERATOR.incrementAndGet(); + private final int size; + private final UnsafeDirectLittleEndian underlying; + private final IdentityHashMap map = new IdentityHashMap<>(); + private final ReadWriteLock lock = new ReentrantReadWriteLock(); + private final AutoCloseableLock readLock = new AutoCloseableLock(lock.readLock()); + private final AutoCloseableLock writeLock = new AutoCloseableLock(lock.writeLock()); + private final long amCreationTime = System.nanoTime(); + + private volatile BufferLedger owningLedger; + private volatile long amDestructionTime = 0; + + AllocationManager(BaseAllocator accountingAllocator, int size) { + Preconditions.checkNotNull(accountingAllocator); + accountingAllocator.assertOpen(); + + this.root = accountingAllocator.root; + this.underlying = INNER_ALLOCATOR.allocate(size); + + // we do a no retain association since our creator will want to retrieve the newly created ledger and will create a + // reference count at that point + this.owningLedger = associate(accountingAllocator, false); + this.size = underlying.capacity(); + } + + /** + * Associate the existing underlying buffer with a new allocator. This will increase the reference count to the + * provided ledger by 1. + * @param allocator + * The target allocator to associate this buffer with. + * @return The Ledger (new or existing) that associates the underlying buffer to this new ledger. + */ + BufferLedger associate(final BaseAllocator allocator) { + return associate(allocator, true); + } + + private BufferLedger associate(final BaseAllocator allocator, final boolean retain) { + allocator.assertOpen(); + + if (root != allocator.root) { + throw new IllegalStateException( + "A buffer can only be associated between two allocators that share the same root."); + } + + try (AutoCloseableLock read = readLock.open()) { + + final BufferLedger ledger = map.get(allocator); + if (ledger != null) { + if (retain) { + ledger.inc(); + } + return ledger; + } + + } + try (AutoCloseableLock write = writeLock.open()) { + // we have to recheck existing ledger since a second reader => writer could be competing with us. + + final BufferLedger existingLedger = map.get(allocator); + if (existingLedger != null) { + if (retain) { + existingLedger.inc(); + } + return existingLedger; + } + + final BufferLedger ledger = new BufferLedger(allocator, new ReleaseListener(allocator)); + if (retain) { + ledger.inc(); + } + BufferLedger oldLedger = map.put(allocator, ledger); + Preconditions.checkArgument(oldLedger == null); + allocator.associateLedger(ledger); + return ledger; + } + } + + + /** + * The way that a particular BufferLedger communicates back to the AllocationManager that it now longer needs to hold + * a reference to particular piece of memory. + */ + private class ReleaseListener { + + private final BufferAllocator allocator; + + public ReleaseListener(BufferAllocator allocator) { + this.allocator = allocator; + } + + /** + * Can only be called when you already hold the writeLock. + */ + public void release() { + allocator.assertOpen(); + + final BufferLedger oldLedger = map.remove(allocator); + oldLedger.allocator.dissociateLedger(oldLedger); + + if (oldLedger == owningLedger) { + if (map.isEmpty()) { + // no one else owns, lets release. + oldLedger.allocator.releaseBytes(size); + underlying.release(); + amDestructionTime = System.nanoTime(); + owningLedger = null; + } else { + // we need to change the owning allocator. we've been removed so we'll get whatever is top of list + BufferLedger newLedger = map.values().iterator().next(); + + // we'll forcefully transfer the ownership and not worry about whether we exceeded the limit + // since this consumer can't do anything with this. + oldLedger.transferBalance(newLedger); + } + } else { + if (map.isEmpty()) { + throw new IllegalStateException("The final removal of a ledger should be connected to the owning ledger."); + } + } + + + } + } + + /** + * The reference manager that binds an allocator manager to a particular BaseAllocator. Also responsible for creating + * a set of DrillBufs that share a common fate and set of reference counts. + * As with AllocationManager, the only reason this is public is due to DrillBuf being in io.netty.buffer package. + */ + public class BufferLedger { + + private final IdentityHashMap buffers = + BaseAllocator.DEBUG ? new IdentityHashMap() : null; + + private final long ledgerId = LEDGER_ID_GENERATOR.incrementAndGet(); // unique ID assigned to each ledger + private final AtomicInteger bufRefCnt = new AtomicInteger(0); // start at zero so we can manage request for retain + // correctly + private final long lCreationTime = System.nanoTime(); + private volatile long lDestructionTime = 0; + private final BaseAllocator allocator; + private final ReleaseListener listener; + private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, + "BufferLedger[%d]", 1) + : null; + + private BufferLedger(BaseAllocator allocator, ReleaseListener listener) { + this.allocator = allocator; + this.listener = listener; + } + + /** + * Transfer any balance the current ledger has to the target ledger. In the case that the current ledger holds no + * memory, no transfer is made to the new ledger. + * @param target + * The ledger to transfer ownership account to. + * @return Whether transfer fit within target ledgers limits. + */ + public boolean transferBalance(final BufferLedger target) { + Preconditions.checkNotNull(target); + Preconditions.checkArgument(allocator.root == target.allocator.root, + "You can only transfer between two allocators that share the same root."); + allocator.assertOpen(); + + target.allocator.assertOpen(); + // if we're transferring to ourself, just return. + if (target == this) { + return true; + } + + // since two balance transfers out from the allocator manager could cause incorrect accounting, we need to ensure + // that this won't happen by synchronizing on the allocator manager instance. + try (AutoCloseableLock write = writeLock.open()) { + if (owningLedger != this) { + return true; + } + + if (BaseAllocator.DEBUG) { + this.historicalLog.recordEvent("transferBalance(%s)", target.allocator.name); + target.historicalLog.recordEvent("incoming(from %s)", owningLedger.allocator.name); + } + + boolean overlimit = target.allocator.forceAllocate(size); + allocator.releaseBytes(size); + owningLedger = target; + return overlimit; + } + + } + + /** + * Print the current ledger state to a the provided StringBuilder. + * @param sb + * The StringBuilder to populate. + * @param indent + * The level of indentation to position the data. + * @param verbosity + * The level of verbosity to print. + */ + public void print(StringBuilder sb, int indent, Verbosity verbosity) { + indent(sb, indent) + .append("ledger[") + .append(ledgerId) + .append("] allocator: ") + .append(allocator.name) + .append("), isOwning: ") + .append(owningLedger == this) + .append(", size: ") + .append(size) + .append(", references: ") + .append(bufRefCnt.get()) + .append(", life: ") + .append(lCreationTime) + .append("..") + .append(lDestructionTime) + .append(", allocatorManager: [") + .append(AllocationManager.this.allocatorManagerId) + .append(", life: ") + .append(amCreationTime) + .append("..") + .append(amDestructionTime); + + if (!BaseAllocator.DEBUG) { + sb.append("]\n"); + } else { + synchronized (buffers) { + sb.append("] holds ") + .append(buffers.size()) + .append(" buffers. \n"); + for (ArrowBuf buf : buffers.keySet()) { + buf.print(sb, indent + 2, verbosity); + sb.append('\n'); + } + } + } + + } + + private void inc() { + bufRefCnt.incrementAndGet(); + } + + /** + * Decrement the ledger's reference count. If the ledger is decremented to zero, this ledger should release its + * ownership back to the AllocationManager + */ + public int decrement(int decrement) { + allocator.assertOpen(); + + final int outcome; + try (AutoCloseableLock write = writeLock.open()) { + outcome = bufRefCnt.addAndGet(-decrement); + if (outcome == 0) { + lDestructionTime = System.nanoTime(); + listener.release(); + } + } + + return outcome; + } + + /** + * Returns the ledger associated with a particular BufferAllocator. If the BufferAllocator doesn't currently have a + * ledger associated with this AllocationManager, a new one is created. This is placed on BufferLedger rather than + * AllocationManager directly because DrillBufs don't have access to AllocationManager and they are the ones + * responsible for exposing the ability to associate multiple allocators with a particular piece of underlying + * memory. Note that this will increment the reference count of this ledger by one to ensure the ledger isn't + * destroyed before use. + * + * @param allocator + * @return + */ + public BufferLedger getLedgerForAllocator(BufferAllocator allocator) { + return associate((BaseAllocator) allocator); + } + + /** + * Create a new DrillBuf associated with this AllocationManager and memory. Does not impact reference count. + * Typically used for slicing. + * @param offset + * The offset in bytes to start this new DrillBuf. + * @param length + * The length in bytes that this DrillBuf will provide access to. + * @return A new DrillBuf that shares references with all DrillBufs associated with this BufferLedger + */ + public ArrowBuf newDrillBuf(int offset, int length) { + allocator.assertOpen(); + return newDrillBuf(offset, length, null); + } + + /** + * Create a new DrillBuf associated with this AllocationManager and memory. + * @param offset + * The offset in bytes to start this new DrillBuf. + * @param length + * The length in bytes that this DrillBuf will provide access to. + * @param manager + * An optional BufferManager argument that can be used to manage expansion of this DrillBuf + * @param retain + * Whether or not the newly created buffer should get an additional reference count added to it. + * @return A new DrillBuf that shares references with all DrillBufs associated with this BufferLedger + */ + public ArrowBuf newDrillBuf(int offset, int length, BufferManager manager) { + allocator.assertOpen(); + + final ArrowBuf buf = new ArrowBuf( + bufRefCnt, + this, + underlying, + manager, + allocator.getAsByteBufAllocator(), + offset, + length, + false); + + if (BaseAllocator.DEBUG) { + historicalLog.recordEvent( + "DrillBuf(BufferLedger, BufferAllocator[%s], UnsafeDirectLittleEndian[identityHashCode == " + + "%d](%s)) => ledger hc == %d", + allocator.name, System.identityHashCode(buf), buf.toString(), + System.identityHashCode(this)); + + synchronized (buffers) { + buffers.put(buf, null); + } + } + + return buf; + + } + + /** + * What is the total size (in bytes) of memory underlying this ledger. + * + * @return Size in bytes + */ + public int getSize() { + return size; + } + + /** + * How much memory is accounted for by this ledger. This is either getSize() if this is the owning ledger for the + * memory or zero in the case that this is not the owning ledger associated with this memory. + * + * @return Amount of accounted(owned) memory associated with this ledger. + */ + public int getAccountedSize() { + try (AutoCloseableLock read = readLock.open()) { + if (owningLedger == this) { + return size; + } else { + return 0; + } + } + } + + /** + * Package visible for debugging/verification only. + */ + UnsafeDirectLittleEndian getUnderlying() { + return underlying; + } + + /** + * Package visible for debugging/verification only. + */ + boolean isOwningLedger() { + return this == owningLedger; + } + + } + +} \ No newline at end of file diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java new file mode 100644 index 0000000000000..68d1244d1e328 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java @@ -0,0 +1,86 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import io.netty.buffer.ArrowBuf; + +/** + * Supports cumulative allocation reservation. Clients may increase the size of the reservation repeatedly until they + * call for an allocation of the current total size. The reservation can only be used once, and will throw an exception + * if it is used more than once. + *

+ * For the purposes of airtight memory accounting, the reservation must be close()d whether it is used or not. + * This is not threadsafe. + */ +public interface AllocationReservation extends AutoCloseable { + + /** + * Add to the current reservation. + * + *

Adding may fail if the allocator is not allowed to consume any more space. + * + * @param nBytes the number of bytes to add + * @return true if the addition is possible, false otherwise + * @throws IllegalStateException if called after buffer() is used to allocate the reservation + */ + boolean add(final int nBytes); + + /** + * Requests a reservation of additional space. + * + *

The implementation of the allocator's inner class provides this. + * + * @param nBytes the amount to reserve + * @return true if the reservation can be satisfied, false otherwise + */ + boolean reserve(int nBytes); + + /** + * Allocate a buffer whose size is the total of all the add()s made. + * + *

The allocation request can still fail, even if the amount of space + * requested is available, if the allocation cannot be made contiguously. + * + * @return the buffer, or null, if the request cannot be satisfied + * @throws IllegalStateException if called called more than once + */ + ArrowBuf allocateBuffer(); + + /** + * Get the current size of the reservation (the sum of all the add()s). + * + * @return size of the current reservation + */ + int getSize(); + + /** + * Return whether or not the reservation has been used. + * + * @return whether or not the reservation has been used + */ + public boolean isUsed(); + + /** + * Return whether or not the reservation has been closed. + * + * @return whether or not the reservation has been closed + */ + public boolean isClosed(); + + public void close(); +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java new file mode 100644 index 0000000000000..566457981c7ed --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java @@ -0,0 +1,31 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +/** + * Exception thrown when a closed BufferAllocator is used. Note + * this is an unchecked exception. + * + * @param message string associated with the cause + */ +@SuppressWarnings("serial") +public class AllocatorClosedException extends RuntimeException { + public AllocatorClosedException(String message) { + super(message); + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java new file mode 100644 index 0000000000000..72f77ab0c7bc2 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -0,0 +1,781 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ByteBufAllocator; +import io.netty.buffer.UnsafeDirectLittleEndian; + +import java.util.Arrays; +import java.util.IdentityHashMap; +import java.util.Set; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; + +import org.apache.arrow.memory.AllocationManager.BufferLedger; +import org.apache.arrow.memory.util.AssertionUtil; +import org.apache.arrow.memory.util.HistoricalLog; + +import com.google.common.base.Preconditions; + +public abstract class BaseAllocator extends Accountant implements BufferAllocator { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BaseAllocator.class); + + public static final String DEBUG_ALLOCATOR = "arrow.memory.debug.allocator"; + + private static final AtomicLong ID_GENERATOR = new AtomicLong(0); + private static final int CHUNK_SIZE = AllocationManager.INNER_ALLOCATOR.getChunkSize(); + + public static final int DEBUG_LOG_LENGTH = 6; + public static final boolean DEBUG = AssertionUtil.isAssertionsEnabled() + || Boolean.parseBoolean(System.getProperty(DEBUG_ALLOCATOR, "false")); + private final Object DEBUG_LOCK = DEBUG ? new Object() : null; + + private final BaseAllocator parentAllocator; + private final ByteBufAllocator thisAsByteBufAllocator; + private final IdentityHashMap childAllocators; + private final ArrowBuf empty; + + private volatile boolean isClosed = false; // the allocator has been closed + + // Package exposed for sharing between AllocatorManger and BaseAllocator objects + final String name; + final RootAllocator root; + + // members used purely for debugging + private final IdentityHashMap childLedgers; + private final IdentityHashMap reservations; + private final HistoricalLog historicalLog; + + protected BaseAllocator( + final BaseAllocator parentAllocator, + final String name, + final long initReservation, + final long maxAllocation) throws OutOfMemoryException { + super(parentAllocator, initReservation, maxAllocation); + + if (parentAllocator != null) { + this.root = parentAllocator.root; + empty = parentAllocator.empty; + } else if (this instanceof RootAllocator) { + this.root = (RootAllocator) this; + empty = createEmpty(); + } else { + throw new IllegalStateException("An parent allocator must either carry a root or be the root."); + } + + this.parentAllocator = parentAllocator; + this.name = name; + + this.thisAsByteBufAllocator = new DrillByteBufAllocator(this); + + if (DEBUG) { + childAllocators = new IdentityHashMap<>(); + reservations = new IdentityHashMap<>(); + childLedgers = new IdentityHashMap<>(); + historicalLog = new HistoricalLog(DEBUG_LOG_LENGTH, "allocator[%s]", name); + hist("created by \"%s\", owned = %d", name, this.getAllocatedMemory()); + } else { + childAllocators = null; + reservations = null; + historicalLog = null; + childLedgers = null; + } + + } + + public void assertOpen() { + if (AssertionUtil.ASSERT_ENABLED) { + if (isClosed) { + throw new IllegalStateException("Attempting operation on allocator when allocator is closed.\n" + + toVerboseString()); + } + } + } + + @Override + public String getName() { + return name; + } + + @Override + public ArrowBuf getEmpty() { + assertOpen(); + return empty; + } + + /** + * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that we have a new ledger + * associated with this allocator. + */ + void associateLedger(BufferLedger ledger) { + assertOpen(); + if (DEBUG) { + synchronized (DEBUG_LOCK) { + childLedgers.put(ledger, null); + } + } + } + + /** + * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that we are removing a + * ledger associated with this allocator + */ + void dissociateLedger(BufferLedger ledger) { + assertOpen(); + if (DEBUG) { + synchronized (DEBUG_LOCK) { + if (!childLedgers.containsKey(ledger)) { + throw new IllegalStateException("Trying to remove a child ledger that doesn't exist."); + } + childLedgers.remove(ledger); + } + } + } + + /** + * Track when a ChildAllocator of this BaseAllocator is closed. Used for debugging purposes. + * + * @param childAllocator + * The child allocator that has been closed. + */ + private void childClosed(final BaseAllocator childAllocator) { + assertOpen(); + + if (DEBUG) { + Preconditions.checkArgument(childAllocator != null, "child allocator can't be null"); + + synchronized (DEBUG_LOCK) { + final Object object = childAllocators.remove(childAllocator); + if (object == null) { + childAllocator.historicalLog.logHistory(logger); + throw new IllegalStateException("Child allocator[" + childAllocator.name + + "] not found in parent allocator[" + name + "]'s childAllocators"); + } + } + } + } + + private static String createErrorMsg(final BufferAllocator allocator, final int rounded, final int requested) { + if (rounded != requested) { + return String.format( + "Unable to allocate buffer of size %d (rounded from %d) due to memory limit. Current allocation: %d", + rounded, requested, allocator.getAllocatedMemory()); + } else { + return String.format("Unable to allocate buffer of size %d due to memory limit. Current allocation: %d", + rounded, allocator.getAllocatedMemory()); + } + } + + @Override + public ArrowBuf buffer(final int initialRequestSize) { + assertOpen(); + + return buffer(initialRequestSize, null); + } + + private ArrowBuf createEmpty(){ + assertOpen(); + + return new ArrowBuf(new AtomicInteger(), null, AllocationManager.INNER_ALLOCATOR.empty, null, null, 0, 0, true); + } + + @Override + public ArrowBuf buffer(final int initialRequestSize, BufferManager manager) { + assertOpen(); + + Preconditions.checkArgument(initialRequestSize >= 0, "the requested size must be non-negative"); + + if (initialRequestSize == 0) { + return empty; + } + + // round to next largest power of two if we're within a chunk since that is how our allocator operates + final int actualRequestSize = initialRequestSize < CHUNK_SIZE ? + nextPowerOfTwo(initialRequestSize) + : initialRequestSize; + AllocationOutcome outcome = this.allocateBytes(actualRequestSize); + if (!outcome.isOk()) { + throw new OutOfMemoryException(createErrorMsg(this, actualRequestSize, initialRequestSize)); + } + + boolean success = false; + try { + ArrowBuf buffer = bufferWithoutReservation(actualRequestSize, manager); + success = true; + return buffer; + } finally { + if (!success) { + releaseBytes(actualRequestSize); + } + } + + } + + /** + * Used by usual allocation as well as for allocating a pre-reserved buffer. Skips the typical accounting associated + * with creating a new buffer. + */ + private ArrowBuf bufferWithoutReservation(final int size, BufferManager bufferManager) throws OutOfMemoryException { + assertOpen(); + + final AllocationManager manager = new AllocationManager(this, size); + final BufferLedger ledger = manager.associate(this); // +1 ref cnt (required) + final ArrowBuf buffer = ledger.newDrillBuf(0, size, bufferManager); + + // make sure that our allocation is equal to what we expected. + Preconditions.checkArgument(buffer.capacity() == size, + "Allocated capacity %d was not equal to requested capacity %d.", buffer.capacity(), size); + + return buffer; + } + + @Override + public ByteBufAllocator getAsByteBufAllocator() { + return thisAsByteBufAllocator; + } + + @Override + public BufferAllocator newChildAllocator( + final String name, + final long initReservation, + final long maxAllocation) { + assertOpen(); + + final ChildAllocator childAllocator = new ChildAllocator(this, name, initReservation, maxAllocation); + + if (DEBUG) { + synchronized (DEBUG_LOCK) { + childAllocators.put(childAllocator, childAllocator); + historicalLog.recordEvent("allocator[%s] created new child allocator[%s]", name, childAllocator.name); + } + } + + return childAllocator; + } + + public class Reservation implements AllocationReservation { + private int nBytes = 0; + private boolean used = false; + private boolean closed = false; + private final HistoricalLog historicalLog; + + public Reservation() { + if (DEBUG) { + historicalLog = new HistoricalLog("Reservation[allocator[%s], %d]", name, System.identityHashCode(this)); + historicalLog.recordEvent("created"); + synchronized (DEBUG_LOCK) { + reservations.put(this, this); + } + } else { + historicalLog = null; + } + } + + public boolean add(final int nBytes) { + assertOpen(); + + Preconditions.checkArgument(nBytes >= 0, "nBytes(%d) < 0", nBytes); + Preconditions.checkState(!closed, "Attempt to increase reservation after reservation has been closed"); + Preconditions.checkState(!used, "Attempt to increase reservation after reservation has been used"); + + // we round up to next power of two since all reservations are done in powers of two. This may overestimate the + // preallocation since someone may perceive additions to be power of two. If this becomes a problem, we can look + // at + // modifying this behavior so that we maintain what we reserve and what the user asked for and make sure to only + // round to power of two as necessary. + final int nBytesTwo = BaseAllocator.nextPowerOfTwo(nBytes); + if (!reserve(nBytesTwo)) { + return false; + } + + this.nBytes += nBytesTwo; + return true; + } + + public ArrowBuf allocateBuffer() { + assertOpen(); + + Preconditions.checkState(!closed, "Attempt to allocate after closed"); + Preconditions.checkState(!used, "Attempt to allocate more than once"); + + final ArrowBuf drillBuf = allocate(nBytes); + used = true; + return drillBuf; + } + + public int getSize() { + return nBytes; + } + + public boolean isUsed() { + return used; + } + + public boolean isClosed() { + return closed; + } + + @Override + public void close() { + assertOpen(); + + if (closed) { + return; + } + + if (DEBUG) { + if (!isClosed()) { + final Object object; + synchronized (DEBUG_LOCK) { + object = reservations.remove(this); + } + if (object == null) { + final StringBuilder sb = new StringBuilder(); + print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); + logger.debug(sb.toString()); + throw new IllegalStateException( + String.format("Didn't find closing reservation[%d]", System.identityHashCode(this))); + } + + historicalLog.recordEvent("closed"); + } + } + + if (!used) { + releaseReservation(nBytes); + } + + closed = true; + } + + public boolean reserve(int nBytes) { + assertOpen(); + + final AllocationOutcome outcome = BaseAllocator.this.allocateBytes(nBytes); + + if (DEBUG) { + historicalLog.recordEvent("reserve(%d) => %s", nBytes, Boolean.toString(outcome.isOk())); + } + + return outcome.isOk(); + } + + /** + * Allocate the a buffer of the requested size. + * + *

+ * The implementation of the allocator's inner class provides this. + * + * @param nBytes + * the size of the buffer requested + * @return the buffer, or null, if the request cannot be satisfied + */ + private ArrowBuf allocate(int nBytes) { + assertOpen(); + + boolean success = false; + + /* + * The reservation already added the requested bytes to the allocators owned and allocated bytes via reserve(). + * This ensures that they can't go away. But when we ask for the buffer here, that will add to the allocated bytes + * as well, so we need to return the same number back to avoid double-counting them. + */ + try { + final ArrowBuf drillBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); + + if (DEBUG) { + historicalLog.recordEvent("allocate() => %s", String.format("DrillBuf[%d]", drillBuf.getId())); + } + success = true; + return drillBuf; + } finally { + if (!success) { + releaseBytes(nBytes); + } + } + } + + /** + * Return the reservation back to the allocator without having used it. + * + * @param nBytes + * the size of the reservation + */ + private void releaseReservation(int nBytes) { + assertOpen(); + + releaseBytes(nBytes); + + if (DEBUG) { + historicalLog.recordEvent("releaseReservation(%d)", nBytes); + } + } + + } + + @Override + public AllocationReservation newReservation() { + assertOpen(); + + return new Reservation(); + } + + + @Override + public synchronized void close() { + /* + * Some owners may close more than once because of complex cleanup and shutdown + * procedures. + */ + if (isClosed) { + return; + } + + isClosed = true; + + if (DEBUG) { + synchronized(DEBUG_LOCK) { + verifyAllocator(); + + // are there outstanding child allocators? + if (!childAllocators.isEmpty()) { + for (final BaseAllocator childAllocator : childAllocators.keySet()) { + if (childAllocator.isClosed) { + logger.warn(String.format( + "Closed child allocator[%s] on parent allocator[%s]'s child list.\n%s", + childAllocator.name, name, toString())); + } + } + + throw new IllegalStateException( + String.format("Allocator[%s] closed with outstanding child allocators.\n%s", name, toString())); + } + + // are there outstanding buffers? + final int allocatedCount = childLedgers.size(); + if (allocatedCount > 0) { + throw new IllegalStateException( + String.format("Allocator[%s] closed with outstanding buffers allocated (%d).\n%s", + name, allocatedCount, toString())); + } + + if (reservations.size() != 0) { + throw new IllegalStateException( + String.format("Allocator[%s] closed with outstanding reservations (%d).\n%s", name, reservations.size(), + toString())); + } + + } + } + + // Is there unaccounted-for outstanding allocation? + final long allocated = getAllocatedMemory(); + if (allocated > 0) { + throw new IllegalStateException( + String.format("Memory was leaked by query. Memory leaked: (%d)\n%s", allocated, toString())); + } + + // we need to release our memory to our parent before we tell it we've closed. + super.close(); + + // Inform our parent allocator that we've closed + if (parentAllocator != null) { + parentAllocator.childClosed(this); + } + + if (DEBUG) { + historicalLog.recordEvent("closed"); + logger.debug(String.format( + "closed allocator[%s].", + name)); + } + + + } + + public String toString() { + final Verbosity verbosity = logger.isTraceEnabled() ? Verbosity.LOG_WITH_STACKTRACE + : Verbosity.BASIC; + final StringBuilder sb = new StringBuilder(); + print(sb, 0, verbosity); + return sb.toString(); + } + + /** + * Provide a verbose string of the current allocator state. Includes the state of all child allocators, along with + * historical logs of each object and including stacktraces. + * + * @return A Verbose string of current allocator state. + */ + public String toVerboseString() { + final StringBuilder sb = new StringBuilder(); + print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); + return sb.toString(); + } + + private void hist(String noteFormat, Object... args) { + historicalLog.recordEvent(noteFormat, args); + } + + /** + * Rounds up the provided value to the nearest power of two. + * + * @param val + * An integer value. + * @return The closest power of two of that value. + */ + static int nextPowerOfTwo(int val) { + int highestBit = Integer.highestOneBit(val); + if (highestBit == val) { + return val; + } else { + return highestBit << 1; + } + } + + + /** + * Verifies the accounting state of the allocator. Only works for DEBUG. + * + * @throws IllegalStateException + * when any problems are found + */ + void verifyAllocator() { + final IdentityHashMap buffersSeen = new IdentityHashMap<>(); + verifyAllocator(buffersSeen); + } + + /** + * Verifies the accounting state of the allocator. Only works for DEBUG. + * + *

+ * This overload is used for recursive calls, allowing for checking that DrillBufs are unique across all allocators + * that are checked. + *

+ * + * @param buffersSeen + * a map of buffers that have already been seen when walking a tree of allocators + * @throws IllegalStateException + * when any problems are found + */ + private void verifyAllocator(final IdentityHashMap buffersSeen) { + synchronized (DEBUG_LOCK) { + + // The remaining tests can only be performed if we're in debug mode. + if (!DEBUG) { + return; + } + + final long allocated = getAllocatedMemory(); + + // verify my direct descendants + final Set childSet = childAllocators.keySet(); + for (final BaseAllocator childAllocator : childSet) { + childAllocator.verifyAllocator(buffersSeen); + } + + /* + * Verify my relationships with my descendants. + * + * The sum of direct child allocators' owned memory must be <= my allocated memory; my allocated memory also + * includes DrillBuf's directly allocated by me. + */ + long childTotal = 0; + for (final BaseAllocator childAllocator : childSet) { + childTotal += Math.max(childAllocator.getAllocatedMemory(), childAllocator.reservation); + } + if (childTotal > getAllocatedMemory()) { + historicalLog.logHistory(logger); + logger.debug("allocator[" + name + "] child event logs BEGIN"); + for (final BaseAllocator childAllocator : childSet) { + childAllocator.historicalLog.logHistory(logger); + } + logger.debug("allocator[" + name + "] child event logs END"); + throw new IllegalStateException( + "Child allocators own more memory (" + childTotal + ") than their parent (name = " + + name + " ) has allocated (" + getAllocatedMemory() + ')'); + } + + // Furthermore, the amount I've allocated should be that plus buffers I've allocated. + long bufferTotal = 0; + + final Set ledgerSet = childLedgers.keySet(); + for (final BufferLedger ledger : ledgerSet) { + if (!ledger.isOwningLedger()) { + continue; + } + + final UnsafeDirectLittleEndian udle = ledger.getUnderlying(); + /* + * Even when shared, DrillBufs are rewrapped, so we should never see the same instance twice. + */ + final BaseAllocator otherOwner = buffersSeen.get(udle); + if (otherOwner != null) { + throw new IllegalStateException("This allocator's drillBuf already owned by another allocator"); + } + buffersSeen.put(udle, this); + + bufferTotal += udle.capacity(); + } + + // Preallocated space has to be accounted for + final Set reservationSet = reservations.keySet(); + long reservedTotal = 0; + for (final Reservation reservation : reservationSet) { + if (!reservation.isUsed()) { + reservedTotal += reservation.getSize(); + } + } + + if (bufferTotal + reservedTotal + childTotal != getAllocatedMemory()) { + final StringBuilder sb = new StringBuilder(); + sb.append("allocator["); + sb.append(name); + sb.append("]\nallocated: "); + sb.append(Long.toString(allocated)); + sb.append(" allocated - (bufferTotal + reservedTotal + childTotal): "); + sb.append(Long.toString(allocated - (bufferTotal + reservedTotal + childTotal))); + sb.append('\n'); + + if (bufferTotal != 0) { + sb.append("buffer total: "); + sb.append(Long.toString(bufferTotal)); + sb.append('\n'); + dumpBuffers(sb, ledgerSet); + } + + if (childTotal != 0) { + sb.append("child total: "); + sb.append(Long.toString(childTotal)); + sb.append('\n'); + + for (final BaseAllocator childAllocator : childSet) { + sb.append("child allocator["); + sb.append(childAllocator.name); + sb.append("] owned "); + sb.append(Long.toString(childAllocator.getAllocatedMemory())); + sb.append('\n'); + } + } + + if (reservedTotal != 0) { + sb.append(String.format("reserved total : %d bytes.", reservedTotal)); + for (final Reservation reservation : reservationSet) { + reservation.historicalLog.buildHistory(sb, 0, true); + sb.append('\n'); + } + } + + logger.debug(sb.toString()); + + final long allocated2 = getAllocatedMemory(); + + if (allocated2 != allocated) { + throw new IllegalStateException(String.format( + "allocator[%s]: allocated t1 (%d) + allocated t2 (%d). Someone released memory while in verification.", + name, allocated, allocated2)); + + } + throw new IllegalStateException(String.format( + "allocator[%s]: buffer space (%d) + prealloc space (%d) + child space (%d) != allocated (%d)", + name, bufferTotal, reservedTotal, childTotal, allocated)); + } + } + } + + void print(StringBuilder sb, int level, Verbosity verbosity) { + + indent(sb, level) + .append("Allocator(") + .append(name) + .append(") ") + .append(reservation) + .append('/') + .append(getAllocatedMemory()) + .append('/') + .append(getPeakMemoryAllocation()) + .append('/') + .append(getLimit()) + .append(" (res/actual/peak/limit)") + .append('\n'); + + if (DEBUG) { + indent(sb, level + 1).append(String.format("child allocators: %d\n", childAllocators.size())); + for (BaseAllocator child : childAllocators.keySet()) { + child.print(sb, level + 2, verbosity); + } + + indent(sb, level + 1).append(String.format("ledgers: %d\n", childLedgers.size())); + for (BufferLedger ledger : childLedgers.keySet()) { + ledger.print(sb, level + 2, verbosity); + } + + final Set reservations = this.reservations.keySet(); + indent(sb, level + 1).append(String.format("reservations: %d\n", reservations.size())); + for (final Reservation reservation : reservations) { + if (verbosity.includeHistoricalLog) { + reservation.historicalLog.buildHistory(sb, level + 3, true); + } + } + + } + + } + + private void dumpBuffers(final StringBuilder sb, final Set ledgerSet) { + for (final BufferLedger ledger : ledgerSet) { + if (!ledger.isOwningLedger()) { + continue; + } + final UnsafeDirectLittleEndian udle = ledger.getUnderlying(); + sb.append("UnsafeDirectLittleEndian[dentityHashCode == "); + sb.append(Integer.toString(System.identityHashCode(udle))); + sb.append("] size "); + sb.append(Integer.toString(udle.capacity())); + sb.append('\n'); + } + } + + + public static StringBuilder indent(StringBuilder sb, int indent) { + final char[] indentation = new char[indent * 2]; + Arrays.fill(indentation, ' '); + sb.append(indentation); + return sb; + } + + public static enum Verbosity { + BASIC(false, false), // only include basic information + LOG(true, false), // include basic + LOG_WITH_STACKTRACE(true, true) // + ; + + public final boolean includeHistoricalLog; + public final boolean includeStackTraces; + + Verbosity(boolean includeHistoricalLog, boolean includeStackTraces) { + this.includeHistoricalLog = includeHistoricalLog; + this.includeStackTraces = includeStackTraces; + } + } + + public static boolean isDebug() { + return DEBUG; + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java b/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java new file mode 100644 index 0000000000000..4e88c734ab4be --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java @@ -0,0 +1,35 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +public class BoundsChecking { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BoundsChecking.class); + + public static final boolean BOUNDS_CHECKING_ENABLED; + + static { + boolean isAssertEnabled = false; + assert isAssertEnabled = true; + BOUNDS_CHECKING_ENABLED = isAssertEnabled + || !"true".equals(System.getProperty("drill.enable_unsafe_memory_access")); + } + + private BoundsChecking() { + } + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java new file mode 100644 index 0000000000000..16a68128b704f --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java @@ -0,0 +1,151 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import io.netty.buffer.ByteBufAllocator; +import io.netty.buffer.ArrowBuf; + +/** + * Wrapper class to deal with byte buffer allocation. Ensures users only use designated methods. + */ +public interface BufferAllocator extends AutoCloseable { + /** + * Allocate a new or reused buffer of the provided size. Note that the buffer may technically be larger than the + * requested size for rounding purposes. However, the buffer's capacity will be set to the configured size. + * + * @param size + * The size in bytes. + * @return a new DrillBuf, or null if the request can't be satisfied + * @throws OutOfMemoryException + * if buffer cannot be allocated + */ + public ArrowBuf buffer(int size); + + /** + * Allocate a new or reused buffer of the provided size. Note that the buffer may technically be larger than the + * requested size for rounding purposes. However, the buffer's capacity will be set to the configured size. + * + * @param size + * The size in bytes. + * @param manager + * A buffer manager to manage reallocation. + * @return a new DrillBuf, or null if the request can't be satisfied + * @throws OutOfMemoryException + * if buffer cannot be allocated + */ + public ArrowBuf buffer(int size, BufferManager manager); + + /** + * Returns the allocator this allocator falls back to when it needs more memory. + * + * @return the underlying allocator used by this allocator + */ + public ByteBufAllocator getAsByteBufAllocator(); + + /** + * Create a new child allocator. + * + * @param name + * the name of the allocator. + * @param initReservation + * the initial space reservation (obtained from this allocator) + * @param maxAllocation + * maximum amount of space the new allocator can allocate + * @return the new allocator, or null if it can't be created + */ + public BufferAllocator newChildAllocator(String name, long initReservation, long maxAllocation); + + /** + * Close and release all buffers generated from this buffer pool. + * + *

When assertions are on, complains if there are any outstanding buffers; to avoid + * that, release all buffers before the allocator is closed. + */ + @Override + public void close(); + + /** + * Returns the amount of memory currently allocated from this allocator. + * + * @return the amount of memory currently allocated + */ + public long getAllocatedMemory(); + + /** + * Set the maximum amount of memory this allocator is allowed to allocate. + * + * @param newLimit + * The new Limit to apply to allocations + */ + public void setLimit(long newLimit); + + /** + * Return the current maximum limit this allocator imposes. + * + * @return Limit in number of bytes. + */ + public long getLimit(); + + /** + * Returns the peak amount of memory allocated from this allocator. + * + * @return the peak amount of memory allocated + */ + public long getPeakMemoryAllocation(); + + /** + * Create an allocation reservation. A reservation is a way of building up + * a request for a buffer whose size is not known in advance. See + * {@see AllocationReservation}. + * + * @return the newly created reservation + */ + public AllocationReservation newReservation(); + + /** + * Get a reference to the empty buffer associated with this allocator. Empty buffers are special because we don't + * worry about them leaking or managing reference counts on them since they don't actually point to any memory. + */ + public ArrowBuf getEmpty(); + + /** + * Return the name of this allocator. This is a human readable name that can help debugging. Typically provides + * coordinates about where this allocator was created + */ + public String getName(); + + /** + * Return whether or not this allocator (or one if its parents) is over its limits. In the case that an allocator is + * over its limit, all consumers of that allocator should aggressively try to addrss the overlimit situation. + */ + public boolean isOverLimit(); + + /** + * Return a verbose string describing this allocator. If in DEBUG mode, this will also include relevant stacktraces + * and historical logs for underlying objects + * + * @return A very verbose description of the allocator hierarchy. + */ + public String toVerboseString(); + + /** + * Asserts (using java assertions) that the provided allocator is currently open. If assertions are disabled, this is + * a no-op. + */ + public void assertOpen(); +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java new file mode 100644 index 0000000000000..0610ff09276bf --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java @@ -0,0 +1,66 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.memory; + +import io.netty.buffer.ArrowBuf; + +/** + * Manages a list of {@link ArrowBuf}s that can be reallocated as needed. Upon + * re-allocation the old buffer will be freed. Managing a list of these buffers + * prevents some parts of the system from needing to define a correct location + * to place the final call to free them. + * + * The current uses of these types of buffers are within the pluggable components of Drill. + * In UDFs, memory management should not be a concern. We provide access to re-allocatable + * DrillBufs to give UDF writers general purpose buffers we can account for. To prevent the need + * for UDFs to contain boilerplate to close all of the buffers they request, this list + * is tracked at a higher level and all of the buffers are freed once we are sure that + * the code depending on them is done executing (currently {@link FragmentContext} + * and {@link QueryContext}. + */ +public interface BufferManager extends AutoCloseable { + + /** + * Replace an old buffer with a new version at least of the provided size. Does not copy data. + * + * @param old + * Old Buffer that the user is no longer going to use. + * @param newSize + * Size of new replacement buffer. + * @return + */ + public ArrowBuf replace(ArrowBuf old, int newSize); + + /** + * Get a managed buffer of indeterminate size. + * + * @return A buffer. + */ + public ArrowBuf getManagedBuffer(); + + /** + * Get a managed buffer of at least a certain size. + * + * @param size + * The desired size + * @return A buffer + */ + public ArrowBuf getManagedBuffer(int size); + + public void close(); +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java new file mode 100644 index 0000000000000..6f120e5328bd4 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java @@ -0,0 +1,53 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + + +/** + * Child allocator class. Only slightly different from the {@see RootAllocator}, + * in that these can't be created directly, but must be obtained from + * {@see BufferAllocator#newChildAllocator(AllocatorOwner, long, long, int)}. + + *

Child allocators can only be created by the root, or other children, so + * this class is package private.

+ */ +class ChildAllocator extends BaseAllocator { + /** + * Constructor. + * + * @param parentAllocator parent allocator -- the one creating this child + * @param allocatorOwner a handle to the object making the request + * @param allocationPolicy the allocation policy to use; the policy for all + * allocators must match for each invocation of a drillbit + * @param initReservation initial amount of space to reserve (obtained from the parent) + * @param maxAllocation maximum amount of space that can be obtained from this allocator; + * note this includes direct allocations (via {@see BufferAllocator#buffer(int, int)} + * et al) and requests from descendant allocators. Depending on the allocation policy in + * force, even less memory may be available + * @param flags one or more of BaseAllocator.F_* flags + */ + ChildAllocator( + BaseAllocator parentAllocator, + String name, + long initReservation, + long maxAllocation) { + super(parentAllocator, name, initReservation, maxAllocation); + } + + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java new file mode 100644 index 0000000000000..23d644841e13f --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java @@ -0,0 +1,141 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import io.netty.buffer.ByteBuf; +import io.netty.buffer.ByteBufAllocator; +import io.netty.buffer.CompositeByteBuf; +import io.netty.buffer.ExpandableByteBuf; + +/** + * An implementation of ByteBufAllocator that wraps a Drill BufferAllocator. This allows the RPC layer to be accounted + * and managed using Drill's BufferAllocator infrastructure. The only thin different from a typical BufferAllocator is + * the signature and the fact that this Allocator returns ExpandableByteBufs which enable otherwise non-expandable + * DrillBufs to be expandable. + */ +public class DrillByteBufAllocator implements ByteBufAllocator { + + private static final int DEFAULT_BUFFER_SIZE = 4096; + private static final int DEFAULT_MAX_COMPOSITE_COMPONENTS = 16; + + private final BufferAllocator allocator; + + public DrillByteBufAllocator(BufferAllocator allocator) { + this.allocator = allocator; + } + + @Override + public ByteBuf buffer() { + return buffer(DEFAULT_BUFFER_SIZE); + } + + @Override + public ByteBuf buffer(int initialCapacity) { + return new ExpandableByteBuf(allocator.buffer(initialCapacity), allocator); + } + + @Override + public ByteBuf buffer(int initialCapacity, int maxCapacity) { + return buffer(initialCapacity); + } + + @Override + public ByteBuf ioBuffer() { + return buffer(); + } + + @Override + public ByteBuf ioBuffer(int initialCapacity) { + return buffer(initialCapacity); + } + + @Override + public ByteBuf ioBuffer(int initialCapacity, int maxCapacity) { + return buffer(initialCapacity); + } + + @Override + public ByteBuf directBuffer() { + return buffer(); + } + + @Override + public ByteBuf directBuffer(int initialCapacity) { + return allocator.buffer(initialCapacity); + } + + @Override + public ByteBuf directBuffer(int initialCapacity, int maxCapacity) { + return buffer(initialCapacity, maxCapacity); + } + + @Override + public CompositeByteBuf compositeBuffer() { + return compositeBuffer(DEFAULT_MAX_COMPOSITE_COMPONENTS); + } + + @Override + public CompositeByteBuf compositeBuffer(int maxNumComponents) { + return new CompositeByteBuf(this, true, maxNumComponents); + } + + @Override + public CompositeByteBuf compositeDirectBuffer() { + return compositeBuffer(); + } + + @Override + public CompositeByteBuf compositeDirectBuffer(int maxNumComponents) { + return compositeBuffer(maxNumComponents); + } + + @Override + public boolean isDirectBufferPooled() { + return false; + } + + @Override + public ByteBuf heapBuffer() { + throw fail(); + } + + @Override + public ByteBuf heapBuffer(int initialCapacity) { + throw fail(); + } + + @Override + public ByteBuf heapBuffer(int initialCapacity, int maxCapacity) { + throw fail(); + } + + @Override + public CompositeByteBuf compositeHeapBuffer() { + throw fail(); + } + + @Override + public CompositeByteBuf compositeHeapBuffer(int maxNumComponents) { + throw fail(); + } + + private RuntimeException fail() { + throw new UnsupportedOperationException("Allocator doesn't support heap-based memory."); + } + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java b/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java new file mode 100644 index 0000000000000..6ba0284d8d449 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java @@ -0,0 +1,50 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + + +public class OutOfMemoryException extends RuntimeException { + private static final long serialVersionUID = -6858052345185793382L; + + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(OutOfMemoryException.class); + + public OutOfMemoryException() { + super(); + } + + public OutOfMemoryException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace) { + super(message, cause, enableSuppression, writableStackTrace); + } + + public OutOfMemoryException(String message, Throwable cause) { + super(message, cause); + + } + + public OutOfMemoryException(String message) { + super(message); + + } + + public OutOfMemoryException(Throwable cause) { + super(cause); + + } + + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/README.md b/java/memory/src/main/java/org/apache/arrow/memory/README.md new file mode 100644 index 0000000000000..09e4257ed0f72 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/README.md @@ -0,0 +1,121 @@ + +# Memory: Allocation, Accounting and Management + +The memory management package contains all the memory allocation related items that Arrow uses to manage memory. + + +## Key Components +Memory management can be broken into the following main components: + +- Memory chunk allocation and fragmentation management + - `PooledByteBufAllocatorL` - A LittleEndian clone of Netty's jemalloc implementation + - `UnsafeDirectLittleEndian` - A base level memory access interface + - `LargeBuffer` - A buffer backing implementation used when working with data larger than one Netty chunk (default to 16mb) +- Memory limits & Accounting + - `Accountant` - A nestable class of lockfree memory accountors. +- Application-level memory allocation + - `BufferAllocator` - The public interface application users should be leveraging + - `BaseAllocator` - The base implementation of memory allocation, contains the meat of our the Arrow allocator implementation + - `RootAllocator` - The root allocator. Typically only one created for a JVM + - `ChildAllocator` - A child allocator that derives from the root allocator +- Buffer ownership and transfer capabilities + - `AllocationManager` - Responsible for managing the relationship between multiple allocators and a single chunk of memory + - `BufferLedger` - Responsible for allowing maintaining the relationship between an `AllocationManager`, a `BufferAllocator` and one or more individual `ArrowBuf`s +- Memory access + - `ArrowBuf` - The facade for interacting directly with a chunk of memory. + + +## Memory Management Overview +Arrow's memory model is based on the following basic concepts: + + - Memory can be allocated up to some limit. That limit could be a real limit (OS/JVM) or a locally imposed limit. + - Allocation operates in two phases: accounting then actual allocation. Allocation could fail at either point. + - Allocation failure should be recoverable. In all cases, the Allocator infrastructure should expose memory allocation failures (OS or internal limit-based) as `OutOfMemoryException`s. + - Any allocator can reserve memory when created. This memory shall be held such that this allocator will always be able to allocate that amount of memory. + - A particular application component should work to use a local allocator to understand local memory usage and better debug memory leaks. + - The same physical memory can be shared by multiple allocators and the allocator must provide an accounting paradigm for this purpose. + +## Allocator Trees + +Arrow provides a tree-based model for memory allocation. The RootAllocator is created first, then all allocators are created as children of that allocator. The RootAllocator is responsible for being the master bookeeper for memory allocations. All other allocators are created as children of this tree. Each allocator can first determine whether it has enough local memory to satisfy a particular request. If not, the allocator can ask its parent for an additional memory allocation. + +## Reserving Memory + +Arrow provides two different ways to reserve memory: + + - BufferAllocator accounting reservations: + When a new allocator (other than the `RootAllocator`) is initialized, it can set aside memory that it will keep locally for its lifetime. This is memory that will never be released back to its parent allocator until the allocator is closed. + - `AllocationReservation` via BufferAllocator.newReservation(): Allows a short-term preallocation strategy so that a particular subsystem can ensure future memory is available to support a particular request. + +## Memory Ownership, Reference Counts and Sharing +Many BufferAllocators can reference the same piece of memory at the same time. The most common situation for this is in the case of a Broadcast Join: in this situation many downstream operators in the same Arrowbit will receive the same physical memory. Each of these operators will be operating within its own Allocator context. We therefore have multiple allocators all pointing at the same physical memory. It is the AllocationManager's responsibility to ensure that in this situation, that all memory is accurately accounted for from the Root's perspective and also to ensure that the memory is correctly released once all BufferAllocators have stopped using that memory. + +For simplicity of accounting, we treat that memory as being used by one of the BufferAllocators associated with the memory. When that allocator releases its claim on that memory, the memory ownership is then moved to another BufferLedger belonging to the same AllocationManager. Note that because a ArrowBuf.release() is what actually causes memory ownership transfer to occur, we always precede with ownership transfer (even if that violates an allocator limit). It is the responsibility of the application owning a particular allocator to frequently confirm whether the allocator is over its memory limit (BufferAllocator.isOverLimit()) and if so, attempt to aggresively release memory to ameliorate the situation. + +All ArrowBufs (direct or sliced) related to a single BufferLedger/BufferAllocator combination share the same reference count and either all will be valid or all will be invalid. + +## Object Hierarchy + +There are two main ways that someone can look at the object hierarchy for Arrow's memory management scheme. The first is a memory based perspective as below: + +### Memory Perspective +
++ AllocationManager
+|
+|-- UnsignedDirectLittleEndian (One per AllocationManager)
+|
+|-+ BufferLedger 1 ==> Allocator A (owning)
+| ` - ArrowBuf 1
+|-+ BufferLedger 2 ==> Allocator B (non-owning)
+| ` - ArrowBuf 2
+|-+ BufferLedger 3 ==> Allocator C (non-owning)
+  | - ArrowBuf 3
+  | - ArrowBuf 4
+  ` - ArrowBuf 5
+
+ +In this picture, a piece of memory is owned by an allocator manager. An allocator manager is responsible for that piece of memory no matter which allocator(s) it is working with. An allocator manager will have relationships with a piece of raw memory (via its reference to UnsignedDirectLittleEndian) as well as references to each BufferAllocator it has a relationship to. + +### Allocator Perspective +
++ RootAllocator
+|-+ ChildAllocator 1
+| | - ChildAllocator 1.1
+| ` ...
+|
+|-+ ChildAllocator 2
+|-+ ChildAllocator 3
+| |
+| |-+ BufferLedger 1 ==> AllocationManager 1 (owning) ==> UDLE
+| | `- ArrowBuf 1
+| `-+ BufferLedger 2 ==> AllocationManager 2 (non-owning)==> UDLE
+| 	`- ArrowBuf 2
+|
+|-+ BufferLedger 3 ==> AllocationManager 1 (non-owning)==> UDLE
+| ` - ArrowBuf 3
+|-+ BufferLedger 4 ==> AllocationManager 2 (owning) ==> UDLE
+  | - ArrowBuf 4
+  | - ArrowBuf 5
+  ` - ArrowBuf 6
+
+ +In this picture, a RootAllocator owns three ChildAllocators. The first ChildAllocator (ChildAllocator 1) owns a subsequent ChildAllocator. ChildAllocator has two BufferLedgers/AllocationManager references. Coincidentally, each of these AllocationManager's is also associated with the RootAllocator. In this case, one of the these AllocationManagers is owned by ChildAllocator 3 (AllocationManager 1) while the other AllocationManager (AllocationManager 2) is owned/accounted for by the RootAllocator. Note that in this scenario, ArrowBuf 1 is sharing the underlying memory as ArrowBuf 3. However the subset of that memory (e.g. through slicing) might be different. Also note that ArrowBuf 2 and ArrowBuf 4, 5 and 6 are also sharing the same underlying memory. Also note that ArrowBuf 4, 5 and 6 all share the same reference count and fate. + +## Debugging Issues +The Allocator object provides a useful set of tools to better understand the status of the allocator. If in `DEBUG` mode, the allocator and supporting classes will record additional debug tracking information to better track down memory leaks and issues. To enable DEBUG mode, either enable Java assertions with `-ea` or pass the following system property to the VM when starting `-Darrow.memory.debug.allocator=true`. The BufferAllocator also provides a `BufferAllocator.toVerboseString()` which can be used in DEBUG mode to get extensive stacktrace information and events associated with various Allocator behaviors. \ No newline at end of file diff --git a/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java new file mode 100644 index 0000000000000..571fc37577209 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java @@ -0,0 +1,39 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import com.google.common.annotations.VisibleForTesting; + +/** + * The root allocator for using direct memory inside a Drillbit. Supports creating a + * tree of descendant child allocators. + */ +public class RootAllocator extends BaseAllocator { + + public RootAllocator(final long limit) { + super(null, "ROOT", 0, limit); + } + + /** + * Verify the accounting state of the allocation system. + */ + @VisibleForTesting + public void verify() { + verifyAllocator(); + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/package-info.java b/java/memory/src/main/java/org/apache/arrow/memory/package-info.java new file mode 100644 index 0000000000000..712af3026e29c --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/package-info.java @@ -0,0 +1,24 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +/** + * Memory Allocation, Account and Management + * + * See the README.md file in this directory for detailed information about Drill's memory allocation subsystem. + * + */ +package org.apache.arrow.memory; diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java b/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java new file mode 100644 index 0000000000000..28d078528974e --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java @@ -0,0 +1,37 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +public class AssertionUtil { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AssertionUtil.class); + + public static final boolean ASSERT_ENABLED; + + static{ + boolean isAssertEnabled = false; + assert isAssertEnabled = true; + ASSERT_ENABLED = isAssertEnabled; + } + + public static boolean isAssertionsEnabled(){ + return ASSERT_ENABLED; + } + + private AssertionUtil() { + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java b/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java new file mode 100644 index 0000000000000..94e5cc5fded4f --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java @@ -0,0 +1,43 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +import java.util.concurrent.locks.Lock; + +/** + * Simple wrapper class that allows Locks to be released via an try-with-resources block. + */ +public class AutoCloseableLock implements AutoCloseable { + + private final Lock lock; + + public AutoCloseableLock(Lock lock) { + this.lock = lock; + } + + public AutoCloseableLock open() { + lock.lock(); + return this; + } + + @Override + public void close() { + lock.unlock(); + } + +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java new file mode 100644 index 0000000000000..38cb779343ab6 --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java @@ -0,0 +1,185 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +import java.util.Arrays; +import java.util.LinkedList; + +import org.slf4j.Logger; + +/** + * Utility class that can be used to log activity within a class + * for later logging and debugging. Supports recording events and + * recording the stack at the time they occur. + */ +public class HistoricalLog { + private static class Event { + private final String note; // the event text + private final StackTrace stackTrace; // where the event occurred + private final long time; + + public Event(final String note) { + this.note = note; + this.time = System.nanoTime(); + stackTrace = new StackTrace(); + } + } + + private final LinkedList history = new LinkedList<>(); + private final String idString; // the formatted id string + private Event firstEvent; // the first stack trace recorded + private final int limit; // the limit on the number of events kept + + /** + * Constructor. The format string will be formatted and have its arguments + * substituted at the time this is called. + * + * @param idStringFormat {@link String#format} format string that can be used + * to identify this object in a log. Including some kind of unique identifier + * that can be associated with the object instance is best. + * @param args for the format string, or nothing if none are required + */ + public HistoricalLog(final String idStringFormat, Object... args) { + this(Integer.MAX_VALUE, idStringFormat, args); + } + + /** + * Constructor. The format string will be formatted and have its arguments + * substituted at the time this is called. + * + *

This form supports the specification of a limit that will limit the + * number of historical entries kept (which keeps down the amount of memory + * used). With the limit, the first entry made is always kept (under the + * assumption that this is the creation site of the object, which is usually + * interesting), and then up to the limit number of entries are kept after that. + * Each time a new entry is made, the oldest that is not the first is dropped. + *

+ * + * @param limit the maximum number of historical entries that will be kept, + * not including the first entry made + * @param idStringFormat {@link String#format} format string that can be used + * to identify this object in a log. Including some kind of unique identifier + * that can be associated with the object instance is best. + * @param args for the format string, or nothing if none are required + */ + public HistoricalLog(final int limit, final String idStringFormat, Object... args) { + this.limit = limit; + this.idString = String.format(idStringFormat, args); + } + + /** + * Record an event. Automatically captures the stack trace at the time this is + * called. The format string will be formatted and have its arguments substituted + * at the time this is called. + * + * @param noteFormat {@link String#format} format string that describes the event + * @param args for the format string, or nothing if none are required + */ + public synchronized void recordEvent(final String noteFormat, Object... args) { + final String note = String.format(noteFormat, args); + final Event event = new Event(note); + if (firstEvent == null) { + firstEvent = event; + } + if (history.size() == limit) { + history.removeFirst(); + } + history.add(event); + } + + /** + * Write the history of this object to the given {@link StringBuilder}. The history + * includes the identifying string provided at construction time, and all the recorded + * events with their stack traces. + * + * @param sb {@link StringBuilder} to write to + */ + public void buildHistory(final StringBuilder sb, boolean includeStackTrace) { + buildHistory(sb, 0, includeStackTrace); + } + + /** + * Write the history of this object to the given {@link StringBuilder}. The history + * includes the identifying string provided at construction time, and all the recorded + * events with their stack traces. + * + * @param sb {@link StringBuilder} to write to + * @param additional an extra string that will be written between the identifying + * information and the history; often used for a current piece of state + */ + + /** + * + * @param sb + * @param indexLevel + * @param includeStackTrace + */ + public synchronized void buildHistory(final StringBuilder sb, int indent, boolean includeStackTrace) { + final char[] indentation = new char[indent]; + final char[] innerIndentation = new char[indent + 2]; + Arrays.fill(indentation, ' '); + Arrays.fill(innerIndentation, ' '); + + sb.append(indentation) + .append("event log for: ") + .append(idString) + .append('\n'); + + + if (firstEvent != null) { + sb.append(innerIndentation) + .append(firstEvent.time) + .append(' ') + .append(firstEvent.note) + .append('\n'); + if (includeStackTrace) { + firstEvent.stackTrace.writeToBuilder(sb, indent + 2); + } + + for(final Event event : history) { + if (event == firstEvent) { + continue; + } + sb.append(innerIndentation) + .append(" ") + .append(event.time) + .append(' ') + .append(event.note) + .append('\n'); + + if (includeStackTrace) { + event.stackTrace.writeToBuilder(sb, indent + 2); + sb.append('\n'); + } + } + } + } + + /** + * Write the history of this object to the given {@link Logger}. The history + * includes the identifying string provided at construction time, and all the recorded + * events with their stack traces. + * + * @param logger {@link Logger} to write to + */ + public void logHistory(final Logger logger) { + final StringBuilder sb = new StringBuilder(); + buildHistory(sb, 0, true); + logger.debug(sb.toString()); + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java b/java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java new file mode 100644 index 0000000000000..5177a2478b53a --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java @@ -0,0 +1,40 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +import com.codahale.metrics.MetricRegistry; + +public class Metrics { + + private Metrics() { + + } + + private static class RegistryHolder { + public static final MetricRegistry REGISTRY; + + static { + REGISTRY = new MetricRegistry(); + } + + } + + public static MetricRegistry getInstance() { + return RegistryHolder.REGISTRY; + } +} \ No newline at end of file diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java b/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java new file mode 100644 index 0000000000000..58ab13b0a16ab --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java @@ -0,0 +1,28 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +public class Pointer { + public T value; + + public Pointer(){} + + public Pointer(T value){ + this.value = value; + } +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java b/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java new file mode 100644 index 0000000000000..638c2fb9a959e --- /dev/null +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java @@ -0,0 +1,70 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory.util; + +import java.util.Arrays; + +/** + * Convenient way of obtaining and manipulating stack traces for debugging. + */ +public class StackTrace { + private final StackTraceElement[] stackTraceElements; + + /** + * Constructor. Captures the current stack trace. + */ + public StackTrace() { + // skip over the first element so that we don't include this constructor call + final StackTraceElement[] stack = Thread.currentThread().getStackTrace(); + stackTraceElements = Arrays.copyOfRange(stack, 1, stack.length - 1); + } + + /** + * Write the stack trace to a StringBuilder. + * @param sb + * where to write it + * @param indent + * how many double spaces to indent each line + */ + public void writeToBuilder(final StringBuilder sb, final int indent) { + // create the indentation string + final char[] indentation = new char[indent * 2]; + Arrays.fill(indentation, ' '); + + // write the stack trace in standard Java format + for(StackTraceElement ste : stackTraceElements) { + sb.append(indentation) + .append("at ") + .append(ste.getClassName()) + .append('.') + .append(ste.getMethodName()) + .append('(') + .append(ste.getFileName()) + .append(':') + .append(Integer.toString(ste.getLineNumber())) + .append(")\n"); + } + } + + @Override + public String toString() { + final StringBuilder sb = new StringBuilder(); + writeToBuilder(sb, 0); + return sb.toString(); + } +} diff --git a/java/memory/src/main/resources/drill-module.conf b/java/memory/src/main/resources/drill-module.conf new file mode 100644 index 0000000000000..593ef8e41e76b --- /dev/null +++ b/java/memory/src/main/resources/drill-module.conf @@ -0,0 +1,25 @@ +// Licensed to the Apache Software Foundation (ASF) under one or more +// contributor license agreements. See the NOTICE file distributed with +// this work for additional information regarding copyright ownership. +// The ASF licenses this file to You under the Apache License, Version 2.0 +// (the "License"); you may not use this file except in compliance with +// the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +// This file tells Drill to consider this module when class path scanning. +// This file can also include any supplementary configuration information. +// This file is in HOCON format, see https://github.com/typesafehub/config/blob/master/HOCON.md for more information. +drill: { + memory: { + debug.error_on_leak: true, + top.max: 1000000000000 + } + +} diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java b/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java new file mode 100644 index 0000000000000..86bccf5064a60 --- /dev/null +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java @@ -0,0 +1,164 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.Accountant; +import org.apache.arrow.memory.Accountant.AllocationOutcome; +import org.junit.Assert; +import org.junit.Test; + +public class TestAccountant { + + @Test + public void basic() { + ensureAccurateReservations(null); + } + + @Test + public void nested() { + final Accountant parent = new Accountant(null, 0, Long.MAX_VALUE); + ensureAccurateReservations(parent); + assertEquals(0, parent.getAllocatedMemory()); + } + + @Test + public void multiThread() throws InterruptedException { + final Accountant parent = new Accountant(null, 0, Long.MAX_VALUE); + + final int numberOfThreads = 32; + final int loops = 100; + Thread[] threads = new Thread[numberOfThreads]; + + for (int i = 0; i < numberOfThreads; i++) { + Thread t = new Thread() { + + @Override + public void run() { + try { + for (int i = 0; i < loops; i++) { + ensureAccurateReservations(parent); + } + } catch (Exception ex) { + ex.printStackTrace(); + Assert.fail(ex.getMessage()); + } + } + + }; + threads[i] = t; + t.start(); + } + + for (Thread thread : threads) { + thread.join(); + } + + assertEquals(0, parent.getAllocatedMemory()); + } + + private void ensureAccurateReservations(Accountant outsideParent) { + final Accountant parent = new Accountant(outsideParent, 0, 10); + assertEquals(0, parent.getAllocatedMemory()); + + final Accountant child = new Accountant(parent, 2, Long.MAX_VALUE); + assertEquals(2, parent.getAllocatedMemory()); + + { + AllocationOutcome first = child.allocateBytes(1); + assertEquals(AllocationOutcome.SUCCESS, first); + } + + // child will have new allocation + assertEquals(1, child.getAllocatedMemory()); + + // root has no change since within reservation + assertEquals(2, parent.getAllocatedMemory()); + + { + AllocationOutcome first = child.allocateBytes(1); + assertEquals(AllocationOutcome.SUCCESS, first); + } + + // child will have new allocation + assertEquals(2, child.getAllocatedMemory()); + + // root has no change since within reservation + assertEquals(2, parent.getAllocatedMemory()); + + child.releaseBytes(1); + + // child will have new allocation + assertEquals(1, child.getAllocatedMemory()); + + // root has no change since within reservation + assertEquals(2, parent.getAllocatedMemory()); + + { + AllocationOutcome first = child.allocateBytes(2); + assertEquals(AllocationOutcome.SUCCESS, first); + } + + // child will have new allocation + assertEquals(3, child.getAllocatedMemory()); + + // went beyond reservation, now in parent accountant + assertEquals(3, parent.getAllocatedMemory()); + + { + AllocationOutcome first = child.allocateBytes(7); + assertEquals(AllocationOutcome.SUCCESS, first); + } + + // child will have new allocation + assertEquals(10, child.getAllocatedMemory()); + + // went beyond reservation, now in parent accountant + assertEquals(10, parent.getAllocatedMemory()); + + child.releaseBytes(9); + + assertEquals(1, child.getAllocatedMemory()); + + // back to reservation size + assertEquals(2, parent.getAllocatedMemory()); + + AllocationOutcome first = child.allocateBytes(10); + assertEquals(AllocationOutcome.FAILED_PARENT, first); + + // unchanged + assertEquals(1, child.getAllocatedMemory()); + assertEquals(2, parent.getAllocatedMemory()); + + boolean withinLimit = child.forceAllocate(10); + assertEquals(false, withinLimit); + + // at new limit + assertEquals(child.getAllocatedMemory(), 11); + assertEquals(parent.getAllocatedMemory(), 11); + + + child.releaseBytes(11); + assertEquals(child.getAllocatedMemory(), 0); + assertEquals(parent.getAllocatedMemory(), 2); + + child.close(); + parent.close(); + } +} diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java new file mode 100644 index 0000000000000..e13dabb9533da --- /dev/null +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java @@ -0,0 +1,648 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ArrowBuf.TransferResult; + +import org.apache.arrow.memory.AllocationReservation; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.memory.RootAllocator; +import org.junit.Ignore; +import org.junit.Test; + +public class TestBaseAllocator { + // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(TestBaseAllocator.class); + + private final static int MAX_ALLOCATION = 8 * 1024; + +/* + // ---------------------------------------- DEBUG ----------------------------------- + + @After + public void checkBuffers() { + final int bufferCount = UnsafeDirectLittleEndian.getBufferCount(); + if (bufferCount != 0) { + UnsafeDirectLittleEndian.logBuffers(logger); + UnsafeDirectLittleEndian.releaseBuffers(); + } + + assertEquals(0, bufferCount); + } + +// @AfterClass +// public static void dumpBuffers() { +// UnsafeDirectLittleEndian.logBuffers(logger); +// } + + // ---------------------------------------- DEBUG ------------------------------------ +*/ + + + @Test + public void test_privateMax() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + final ArrowBuf drillBuf1 = rootAllocator.buffer(MAX_ALLOCATION / 2); + assertNotNull("allocation failed", drillBuf1); + + try(final BufferAllocator childAllocator = + rootAllocator.newChildAllocator("noLimits", 0, MAX_ALLOCATION)) { + final ArrowBuf drillBuf2 = childAllocator.buffer(MAX_ALLOCATION / 2); + assertNotNull("allocation failed", drillBuf2); + drillBuf2.release(); + } + + drillBuf1.release(); + } + } + + @Test(expected=IllegalStateException.class) + public void testRootAllocator_closeWithOutstanding() throws Exception { + try { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + final ArrowBuf drillBuf = rootAllocator.buffer(512); + assertNotNull("allocation failed", drillBuf); + } + } finally { + /* + * We expect there to be one unreleased underlying buffer because we're closing + * without releasing it. + */ +/* + // ------------------------------- DEBUG --------------------------------- + final int bufferCount = UnsafeDirectLittleEndian.getBufferCount(); + UnsafeDirectLittleEndian.releaseBuffers(); + assertEquals(1, bufferCount); + // ------------------------------- DEBUG --------------------------------- +*/ + } + } + + @Test + public void testRootAllocator_getEmpty() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + final ArrowBuf drillBuf = rootAllocator.buffer(0); + assertNotNull("allocation failed", drillBuf); + assertEquals("capacity was non-zero", 0, drillBuf.capacity()); + drillBuf.release(); + } + } + + @Ignore // TODO(DRILL-2740) + @Test(expected = IllegalStateException.class) + public void testAllocator_unreleasedEmpty() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + @SuppressWarnings("unused") + final ArrowBuf drillBuf = rootAllocator.buffer(0); + } + } + + @Test + public void testAllocator_transferOwnership() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator1 = + rootAllocator.newChildAllocator("changeOwnership1", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator2 = + rootAllocator.newChildAllocator("changeOwnership2", 0, MAX_ALLOCATION); + + final ArrowBuf drillBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 4); + rootAllocator.verify(); + TransferResult transferOwnership = drillBuf1.transferOwnership(childAllocator2); + final boolean allocationFit = transferOwnership.allocationFit; + rootAllocator.verify(); + assertTrue(allocationFit); + + drillBuf1.release(); + childAllocator1.close(); + rootAllocator.verify(); + + transferOwnership.buffer.release(); + childAllocator2.close(); + } + } + + @Test + public void testAllocator_shareOwnership() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator1 = rootAllocator.newChildAllocator("shareOwnership1", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator2 = rootAllocator.newChildAllocator("shareOwnership2", 0, MAX_ALLOCATION); + final ArrowBuf drillBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 4); + rootAllocator.verify(); + + // share ownership of buffer. + final ArrowBuf drillBuf2 = drillBuf1.retain(childAllocator2); + rootAllocator.verify(); + assertNotNull(drillBuf2); + assertNotEquals(drillBuf2, drillBuf1); + + // release original buffer (thus transferring ownership to allocator 2. (should leave allocator 1 in empty state) + drillBuf1.release(); + rootAllocator.verify(); + childAllocator1.close(); + rootAllocator.verify(); + + final BufferAllocator childAllocator3 = rootAllocator.newChildAllocator("shareOwnership3", 0, MAX_ALLOCATION); + final ArrowBuf drillBuf3 = drillBuf1.retain(childAllocator3); + assertNotNull(drillBuf3); + assertNotEquals(drillBuf3, drillBuf1); + assertNotEquals(drillBuf3, drillBuf2); + rootAllocator.verify(); + + drillBuf2.release(); + rootAllocator.verify(); + childAllocator2.close(); + rootAllocator.verify(); + + drillBuf3.release(); + rootAllocator.verify(); + childAllocator3.close(); + } + } + + @Test + public void testRootAllocator_createChildAndUse() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + try (final BufferAllocator childAllocator = rootAllocator.newChildAllocator("createChildAndUse", 0, + MAX_ALLOCATION)) { + final ArrowBuf drillBuf = childAllocator.buffer(512); + assertNotNull("allocation failed", drillBuf); + drillBuf.release(); + } + } + } + + @Test(expected=IllegalStateException.class) + public void testRootAllocator_createChildDontClose() throws Exception { + try { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator = rootAllocator.newChildAllocator("createChildDontClose", 0, + MAX_ALLOCATION); + final ArrowBuf drillBuf = childAllocator.buffer(512); + assertNotNull("allocation failed", drillBuf); + } + } finally { + /* + * We expect one underlying buffer because we closed a child allocator without + * releasing the buffer allocated from it. + */ +/* + // ------------------------------- DEBUG --------------------------------- + final int bufferCount = UnsafeDirectLittleEndian.getBufferCount(); + UnsafeDirectLittleEndian.releaseBuffers(); + assertEquals(1, bufferCount); + // ------------------------------- DEBUG --------------------------------- +*/ + } + } + + private static void allocateAndFree(final BufferAllocator allocator) { + final ArrowBuf drillBuf = allocator.buffer(512); + assertNotNull("allocation failed", drillBuf); + drillBuf.release(); + + final ArrowBuf drillBuf2 = allocator.buffer(MAX_ALLOCATION); + assertNotNull("allocation failed", drillBuf2); + drillBuf2.release(); + + final int nBufs = 8; + final ArrowBuf[] drillBufs = new ArrowBuf[nBufs]; + for(int i = 0; i < drillBufs.length; ++i) { + ArrowBuf drillBufi = allocator.buffer(MAX_ALLOCATION / nBufs); + assertNotNull("allocation failed", drillBufi); + drillBufs[i] = drillBufi; + } + for(ArrowBuf drillBufi : drillBufs) { + drillBufi.release(); + } + } + + @Test + public void testAllocator_manyAllocations() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + try(final BufferAllocator childAllocator = + rootAllocator.newChildAllocator("manyAllocations", 0, MAX_ALLOCATION)) { + allocateAndFree(childAllocator); + } + } + } + + @Test + public void testAllocator_overAllocate() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + try(final BufferAllocator childAllocator = + rootAllocator.newChildAllocator("overAllocate", 0, MAX_ALLOCATION)) { + allocateAndFree(childAllocator); + + try { + childAllocator.buffer(MAX_ALLOCATION + 1); + fail("allocated memory beyond max allowed"); + } catch (OutOfMemoryException e) { + // expected + } + } + } + } + + @Test + public void testAllocator_overAllocateParent() throws Exception { + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + try(final BufferAllocator childAllocator = + rootAllocator.newChildAllocator("overAllocateParent", 0, MAX_ALLOCATION)) { + final ArrowBuf drillBuf1 = rootAllocator.buffer(MAX_ALLOCATION / 2); + assertNotNull("allocation failed", drillBuf1); + final ArrowBuf drillBuf2 = childAllocator.buffer(MAX_ALLOCATION / 2); + assertNotNull("allocation failed", drillBuf2); + + try { + childAllocator.buffer(MAX_ALLOCATION / 4); + fail("allocated memory beyond max allowed"); + } catch (OutOfMemoryException e) { + // expected + } + + drillBuf1.release(); + drillBuf2.release(); + } + } + } + + private static void testAllocator_sliceUpBufferAndRelease( + final RootAllocator rootAllocator, final BufferAllocator bufferAllocator) { + final ArrowBuf drillBuf1 = bufferAllocator.buffer(MAX_ALLOCATION / 2); + rootAllocator.verify(); + + final ArrowBuf drillBuf2 = drillBuf1.slice(16, drillBuf1.capacity() - 32); + rootAllocator.verify(); + final ArrowBuf drillBuf3 = drillBuf2.slice(16, drillBuf2.capacity() - 32); + rootAllocator.verify(); + @SuppressWarnings("unused") + final ArrowBuf drillBuf4 = drillBuf3.slice(16, drillBuf3.capacity() - 32); + rootAllocator.verify(); + + drillBuf3.release(); // since they share refcounts, one is enough to release them all + rootAllocator.verify(); + } + + @Test + public void testAllocator_createSlices() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + testAllocator_sliceUpBufferAndRelease(rootAllocator, rootAllocator); + + try (final BufferAllocator childAllocator = rootAllocator.newChildAllocator("createSlices", 0, MAX_ALLOCATION)) { + testAllocator_sliceUpBufferAndRelease(rootAllocator, childAllocator); + } + rootAllocator.verify(); + + testAllocator_sliceUpBufferAndRelease(rootAllocator, rootAllocator); + + try (final BufferAllocator childAllocator = rootAllocator.newChildAllocator("createSlices", 0, MAX_ALLOCATION)) { + try (final BufferAllocator childAllocator2 = + childAllocator.newChildAllocator("createSlices", 0, MAX_ALLOCATION)) { + final ArrowBuf drillBuf1 = childAllocator2.buffer(MAX_ALLOCATION / 8); + @SuppressWarnings("unused") + final ArrowBuf drillBuf2 = drillBuf1.slice(MAX_ALLOCATION / 16, MAX_ALLOCATION / 16); + testAllocator_sliceUpBufferAndRelease(rootAllocator, childAllocator); + drillBuf1.release(); + rootAllocator.verify(); + } + rootAllocator.verify(); + + testAllocator_sliceUpBufferAndRelease(rootAllocator, childAllocator); + } + rootAllocator.verify(); + } + } + + @Test + public void testAllocator_sliceRanges() throws Exception { +// final AllocatorOwner allocatorOwner = new NamedOwner("sliceRanges"); + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + // Populate a buffer with byte values corresponding to their indices. + final ArrowBuf drillBuf = rootAllocator.buffer(256); + assertEquals(256, drillBuf.capacity()); + assertEquals(0, drillBuf.readerIndex()); + assertEquals(0, drillBuf.readableBytes()); + assertEquals(0, drillBuf.writerIndex()); + assertEquals(256, drillBuf.writableBytes()); + + final ArrowBuf slice3 = (ArrowBuf) drillBuf.slice(); + assertEquals(0, slice3.readerIndex()); + assertEquals(0, slice3.readableBytes()); + assertEquals(0, slice3.writerIndex()); +// assertEquals(256, slice3.capacity()); +// assertEquals(256, slice3.writableBytes()); + + for(int i = 0; i < 256; ++i) { + drillBuf.writeByte(i); + } + assertEquals(0, drillBuf.readerIndex()); + assertEquals(256, drillBuf.readableBytes()); + assertEquals(256, drillBuf.writerIndex()); + assertEquals(0, drillBuf.writableBytes()); + + final ArrowBuf slice1 = (ArrowBuf) drillBuf.slice(); + assertEquals(0, slice1.readerIndex()); + assertEquals(256, slice1.readableBytes()); + for(int i = 0; i < 10; ++i) { + assertEquals(i, slice1.readByte()); + } + assertEquals(256 - 10, slice1.readableBytes()); + for(int i = 0; i < 256; ++i) { + assertEquals((byte) i, slice1.getByte(i)); + } + + final ArrowBuf slice2 = (ArrowBuf) drillBuf.slice(25, 25); + assertEquals(0, slice2.readerIndex()); + assertEquals(25, slice2.readableBytes()); + for(int i = 25; i < 50; ++i) { + assertEquals(i, slice2.readByte()); + } + +/* + for(int i = 256; i > 0; --i) { + slice3.writeByte(i - 1); + } + for(int i = 0; i < 256; ++i) { + assertEquals(255 - i, slice1.getByte(i)); + } +*/ + + drillBuf.release(); // all the derived buffers share this fate + } + } + + @Test + public void testAllocator_slicesOfSlices() throws Exception { +// final AllocatorOwner allocatorOwner = new NamedOwner("slicesOfSlices"); + try(final RootAllocator rootAllocator = + new RootAllocator(MAX_ALLOCATION)) { + // Populate a buffer with byte values corresponding to their indices. + final ArrowBuf drillBuf = rootAllocator.buffer(256); + for(int i = 0; i < 256; ++i) { + drillBuf.writeByte(i); + } + + // Slice it up. + final ArrowBuf slice0 = drillBuf.slice(0, drillBuf.capacity()); + for(int i = 0; i < 256; ++i) { + assertEquals((byte) i, drillBuf.getByte(i)); + } + + final ArrowBuf slice10 = slice0.slice(10, drillBuf.capacity() - 10); + for(int i = 10; i < 256; ++i) { + assertEquals((byte) i, slice10.getByte(i - 10)); + } + + final ArrowBuf slice20 = slice10.slice(10, drillBuf.capacity() - 20); + for(int i = 20; i < 256; ++i) { + assertEquals((byte) i, slice20.getByte(i - 20)); + } + + final ArrowBuf slice30 = slice20.slice(10, drillBuf.capacity() - 30); + for(int i = 30; i < 256; ++i) { + assertEquals((byte) i, slice30.getByte(i - 30)); + } + + drillBuf.release(); + } + } + + @Test + public void testAllocator_transferSliced() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator1 = rootAllocator.newChildAllocator("transferSliced1", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator2 = rootAllocator.newChildAllocator("transferSliced2", 0, MAX_ALLOCATION); + + final ArrowBuf drillBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 8); + final ArrowBuf drillBuf2 = childAllocator2.buffer(MAX_ALLOCATION / 8); + + final ArrowBuf drillBuf1s = drillBuf1.slice(0, drillBuf1.capacity() / 2); + final ArrowBuf drillBuf2s = drillBuf2.slice(0, drillBuf2.capacity() / 2); + + rootAllocator.verify(); + + TransferResult result1 = drillBuf2s.transferOwnership(childAllocator1); + rootAllocator.verify(); + TransferResult result2 = drillBuf1s.transferOwnership(childAllocator2); + rootAllocator.verify(); + + result1.buffer.release(); + result2.buffer.release(); + + drillBuf1s.release(); // releases drillBuf1 + drillBuf2s.release(); // releases drillBuf2 + + childAllocator1.close(); + childAllocator2.close(); + } + } + + @Test + public void testAllocator_shareSliced() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator1 = rootAllocator.newChildAllocator("transferSliced", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator2 = rootAllocator.newChildAllocator("transferSliced", 0, MAX_ALLOCATION); + + final ArrowBuf drillBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 8); + final ArrowBuf drillBuf2 = childAllocator2.buffer(MAX_ALLOCATION / 8); + + final ArrowBuf drillBuf1s = drillBuf1.slice(0, drillBuf1.capacity() / 2); + final ArrowBuf drillBuf2s = drillBuf2.slice(0, drillBuf2.capacity() / 2); + + rootAllocator.verify(); + + final ArrowBuf drillBuf2s1 = drillBuf2s.retain(childAllocator1); + final ArrowBuf drillBuf1s2 = drillBuf1s.retain(childAllocator2); + rootAllocator.verify(); + + drillBuf1s.release(); // releases drillBuf1 + drillBuf2s.release(); // releases drillBuf2 + rootAllocator.verify(); + + drillBuf2s1.release(); // releases the shared drillBuf2 slice + drillBuf1s2.release(); // releases the shared drillBuf1 slice + + childAllocator1.close(); + childAllocator2.close(); + } + } + + @Test + public void testAllocator_transferShared() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + final BufferAllocator childAllocator1 = rootAllocator.newChildAllocator("transferShared1", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator2 = rootAllocator.newChildAllocator("transferShared2", 0, MAX_ALLOCATION); + final BufferAllocator childAllocator3 = rootAllocator.newChildAllocator("transferShared3", 0, MAX_ALLOCATION); + + final ArrowBuf drillBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 8); + + boolean allocationFit; + + ArrowBuf drillBuf2 = drillBuf1.retain(childAllocator2); + rootAllocator.verify(); + assertNotNull(drillBuf2); + assertNotEquals(drillBuf2, drillBuf1); + + TransferResult result = drillBuf1.transferOwnership(childAllocator3); + allocationFit = result.allocationFit; + final ArrowBuf drillBuf3 = result.buffer; + assertTrue(allocationFit); + rootAllocator.verify(); + + // Since childAllocator3 now has childAllocator1's buffer, 1, can close + drillBuf1.release(); + childAllocator1.close(); + rootAllocator.verify(); + + drillBuf2.release(); + childAllocator2.close(); + rootAllocator.verify(); + + final BufferAllocator childAllocator4 = rootAllocator.newChildAllocator("transferShared4", 0, MAX_ALLOCATION); + TransferResult result2 = drillBuf3.transferOwnership(childAllocator4); + allocationFit = result.allocationFit; + final ArrowBuf drillBuf4 = result2.buffer; + assertTrue(allocationFit); + rootAllocator.verify(); + + drillBuf3.release(); + childAllocator3.close(); + rootAllocator.verify(); + + drillBuf4.release(); + childAllocator4.close(); + rootAllocator.verify(); + } + } + + @Test + public void testAllocator_unclaimedReservation() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + try (final BufferAllocator childAllocator1 = + rootAllocator.newChildAllocator("unclaimedReservation", 0, MAX_ALLOCATION)) { + try(final AllocationReservation reservation = childAllocator1.newReservation()) { + assertTrue(reservation.add(64)); + } + rootAllocator.verify(); + } + } + } + + @Test + public void testAllocator_claimedReservation() throws Exception { + try (final RootAllocator rootAllocator = new RootAllocator(MAX_ALLOCATION)) { + + try (final BufferAllocator childAllocator1 = rootAllocator.newChildAllocator("claimedReservation", 0, + MAX_ALLOCATION)) { + + try (final AllocationReservation reservation = childAllocator1.newReservation()) { + assertTrue(reservation.add(32)); + assertTrue(reservation.add(32)); + + final ArrowBuf drillBuf = reservation.allocateBuffer(); + assertEquals(64, drillBuf.capacity()); + rootAllocator.verify(); + + drillBuf.release(); + rootAllocator.verify(); + } + rootAllocator.verify(); + } + } + } + + @Test + public void multiple() throws Exception { + final String owner = "test"; + try (RootAllocator allocator = new RootAllocator(Long.MAX_VALUE)) { + + final int op = 100000; + + BufferAllocator frag1 = allocator.newChildAllocator(owner, 1500000, Long.MAX_VALUE); + BufferAllocator frag2 = allocator.newChildAllocator(owner, 500000, Long.MAX_VALUE); + + allocator.verify(); + + BufferAllocator allocator11 = frag1.newChildAllocator(owner, op, Long.MAX_VALUE); + ArrowBuf b11 = allocator11.buffer(1000000); + + allocator.verify(); + + BufferAllocator allocator12 = frag1.newChildAllocator(owner, op, Long.MAX_VALUE); + ArrowBuf b12 = allocator12.buffer(500000); + + allocator.verify(); + + BufferAllocator allocator21 = frag1.newChildAllocator(owner, op, Long.MAX_VALUE); + + allocator.verify(); + + BufferAllocator allocator22 = frag2.newChildAllocator(owner, op, Long.MAX_VALUE); + ArrowBuf b22 = allocator22.buffer(2000000); + + allocator.verify(); + + BufferAllocator frag3 = allocator.newChildAllocator(owner, 1000000, Long.MAX_VALUE); + + allocator.verify(); + + BufferAllocator allocator31 = frag3.newChildAllocator(owner, op, Long.MAX_VALUE); + ArrowBuf b31a = allocator31.buffer(200000); + + allocator.verify(); + + // Previously running operator completes + b22.release(); + + allocator.verify(); + + allocator22.close(); + + b31a.release(); + allocator31.close(); + + b12.release(); + allocator12.close(); + + allocator21.close(); + + b11.release(); + allocator11.close(); + + frag1.close(); + frag2.close(); + frag3.close(); + + } + } +} diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestEndianess.java b/java/memory/src/test/java/org/apache/arrow/memory/TestEndianess.java new file mode 100644 index 0000000000000..25357dc7b07ef --- /dev/null +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestEndianess.java @@ -0,0 +1,43 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.memory; + +import static org.junit.Assert.assertEquals; +import io.netty.buffer.ByteBuf; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.junit.Test; + + +public class TestEndianess { + + @Test + public void testLittleEndian() { + final BufferAllocator a = new RootAllocator(10000); + final ByteBuf b = a.buffer(4); + b.setInt(0, 35); + assertEquals(b.getByte(0), 35); + assertEquals(b.getByte(1), 0); + assertEquals(b.getByte(2), 0); + assertEquals(b.getByte(3), 0); + b.release(); + a.close(); + } + +} diff --git a/java/pom.xml b/java/pom.xml new file mode 100644 index 0000000000000..8a3b192e13e40 --- /dev/null +++ b/java/pom.xml @@ -0,0 +1,470 @@ + + + + 4.0.0 + + + org.apache + apache + 14 + + + org.apache.arrow + arrow-java-root + 0.1-SNAPSHOT + pom + + Apache Arrow Java Root POM + Apache arrow is an open source, low latency SQL query engine for Hadoop and NoSQL. + http://arrow.apache.org/ + + + ${project.basedir}/target/generated-sources + 4.11 + 1.7.6 + 18.0 + 2 + 2.7.1 + 2.7.1 + 0.9.15 + 2.3.21 + + + + scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git + scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git + https://github.com/apache/arrow + HEAD + + + + + Developer List + dev-subscribe@arrow.apache.org + dev-unsubscribe@arrow.apache.org + dev@arrow.apache.org + http://mail-archives.apache.org/mod_mbox/arrow-dev/ + + + Commits List + commits-subscribe@arrow.apache.org + commits-unsubscribe@arrow.apache.org + commits@arrow.apache.org + http://mail-archives.apache.org/mod_mbox/arrow-commits/ + + + Issues List + issues-subscribe@arrow.apache.org + issues-unsubscribe@arrow.apache.org + http://mail-archives.apache.org/mod_mbox/arrow-issues/ + + + + + + + + + Jira + https://issues.apache.org/jira/browse/arrow + + + + + + + org.apache.rat + apache-rat-plugin + + + rat-checks + validate + + check + + + + + false + + **/*.log + **/*.css + **/*.js + **/*.md + **/*.eps + **/*.json + **/*.seq + **/*.parquet + **/*.sql + **/git.properties + **/*.csv + **/*.csvh + **/*.csvh-test + **/*.tsv + **/*.txt + **/*.ssv + **/arrow-*.conf + **/.buildpath + **/*.proto + **/*.fmpp + **/target/** + **/*.iml + **/*.tdd + **/*.project + **/TAGS + **/*.checkstyle + **/.classpath + **/.settings/** + .*/** + **/*.patch + **/*.pb.cc + **/*.pb.h + **/*.linux + **/client/build/** + **/*.tbl + + + + + + org.apache.maven.plugins + maven-jar-plugin + + + **/logging.properties + **/logback-test.xml + **/logback.out.xml + **/logback.xml + + + true + + true + true + + + org.apache.arrow + ${username} + http://arrow.apache.org/ + + + + + + + test-jar + + + true + + + + + + + + org.apache.maven.plugins + maven-resources-plugin + + UTF-8 + + + + org.apache.maven.plugins + maven-compiler-plugin + + 1.7 + 1.7 + 2048m + false + true + + + + maven-enforcer-plugin + + + validate_java_and_maven_version + verify + + enforce + + false + + + + [3.0.4,4) + + + + + + avoid_bad_dependencies + verify + + enforce + + + + + + commons-logging + javax.servlet:servlet-api + org.mortbay.jetty:servlet-api + org.mortbay.jetty:servlet-api-2.5 + log4j:log4j + + + + + + + + + pl.project13.maven + git-commit-id-plugin + 2.1.9 + + + for-jars + true + + revision + + + target/classes/git.properties + + + + for-source-tarball + + revision + + false + + ./git.properties + + + + + + dd.MM.yyyy '@' HH:mm:ss z + true + false + true + false + + false + false + 7 + -dirty + true + + + + + + + + + org.apache.rat + apache-rat-plugin + 0.11 + + + org.apache.maven.plugins + maven-resources-plugin + 2.6 + + + org.apache.maven.plugins + maven-compiler-plugin + 3.2 + + + maven-enforcer-plugin + 1.3.1 + + + maven-surefire-plugin + 2.17 + + -ea + ${forkCount} + true + + ${project.build.directory} + + + + + org.apache.maven.plugins + maven-release-plugin + 2.5.2 + + false + false + deploy + -Papache-release ${arguments} + + + + + + org.eclipse.m2e + lifecycle-mapping + 1.0.0 + + + + + + org.apache.maven.plugins + maven-antrun-plugin + [1.6,) + + run + + + + + + + + + org.apache.maven.plugins + maven-enforcer-plugin + [1.2,) + + enforce + + + + + + + + + org.apache.maven.plugins + + maven-remote-resources-plugin + + [1.1,) + + process + + + + + + + + + org.apache.rat + apache-rat-plugin + [0.10,) + + check + + + + + + + + + + + + + + + + + io.netty + netty-handler + 4.0.27.Final + + + + com.google.guava + guava + ${dep.guava.version} + + + + org.slf4j + slf4j-api + ${dep.slf4j.version} + + + + org.slf4j + jul-to-slf4j + ${dep.slf4j.version} + + + + org.slf4j + jcl-over-slf4j + ${dep.slf4j.version} + + + + org.slf4j + log4j-over-slf4j + ${dep.slf4j.version} + + + + + + com.googlecode.jmockit + jmockit + 1.3 + test + + + junit + junit + ${dep.junit.version} + test + + + + org.mockito + mockito-core + 1.9.5 + + + ch.qos.logback + logback-classic + 1.0.13 + test + + + de.huxhorn.lilith + de.huxhorn.lilith.logback.appender.multiplex-classic + 0.9.44 + test + + + + + + memory + vector + + diff --git a/java/vector/pom.xml b/java/vector/pom.xml new file mode 100644 index 0000000000000..e693344221b9a --- /dev/null +++ b/java/vector/pom.xml @@ -0,0 +1,165 @@ + + + + 4.0.0 + + org.apache.arrow + arrow-java-root + 0.1-SNAPSHOT + + vector + vectors + + + + + org.apache.arrow + arrow-memory + ${project.version} + + + joda-time + joda-time + 2.9 + + + com.fasterxml.jackson.core + jackson-annotations + 2.7.1 + + + com.fasterxml.jackson.core + jackson-databind + 2.7.1 + + + com.carrotsearch + hppc + 0.7.1 + + + org.apache.commons + commons-lang3 + 3.4 + + + + + + + + apache + apache + https://repo.maven.apache.org/ + + true + + + false + + + + + + + + + + ${basedir}/src/main/codegen + codegen + + + + + + maven-resources-plugin + + + copy-fmpp-resources + initialize + + copy-resources + + + ${project.build.directory}/codegen + + + src/main/codegen + false + + + + + + + + org.apache.drill.tools + drill-fmpp-maven-plugin + 1.4.0 + + + generate-fmpp + generate-sources + + generate + + + src/main/codegen/config.fmpp + ${project.build.directory}/generated-sources + ${project.build.directory}/codegen/templates + + + + + + + + + + org.eclipse.m2e + lifecycle-mapping + 1.0.0 + + + + + + org.apache.drill.tools + drill-fmpp-maven-plugin + [1.0,) + + generate + + + + + false + true + + + + + + + + + + + + + + + + diff --git a/java/vector/src/main/codegen/config.fmpp b/java/vector/src/main/codegen/config.fmpp new file mode 100644 index 0000000000000..663677cbb5a76 --- /dev/null +++ b/java/vector/src/main/codegen/config.fmpp @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http:# www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +data: { + # TODO: Rename to ~valueVectorModesAndTypes for clarity. + vv: tdd(../data/ValueVectorTypes.tdd), + +} +freemarkerLinks: { + includes: includes/ +} diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd new file mode 100644 index 0000000000000..e747c30c5d1cb --- /dev/null +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -0,0 +1,168 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http:# www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{ + modes: [ + {name: "Optional", prefix: "Nullable"}, + {name: "Required", prefix: ""}, + {name: "Repeated", prefix: "Repeated"} + ], + types: [ + { + major: "Fixed", + width: 1, + javaType: "byte", + boxedType: "Byte", + fields: [{name: "value", type: "byte"}], + minor: [ + { class: "TinyInt", valueHolder: "IntHolder" }, + { class: "UInt1", valueHolder: "UInt1Holder" } + ] + }, + { + major: "Fixed", + width: 2, + javaType: "char", + boxedType: "Character", + fields: [{name: "value", type: "char"}], + minor: [ + { class: "UInt2", valueHolder: "UInt2Holder"} + ] + }, { + major: "Fixed", + width: 2, + javaType: "short", + boxedType: "Short", + fields: [{name: "value", type: "short"}], + minor: [ + { class: "SmallInt", valueHolder: "Int2Holder"}, + ] + }, + { + major: "Fixed", + width: 4, + javaType: "int", + boxedType: "Integer", + fields: [{name: "value", type: "int"}], + minor: [ + { class: "Int", valueHolder: "IntHolder"}, + { class: "UInt4", valueHolder: "UInt4Holder" }, + { class: "Float4", javaType: "float" , boxedType: "Float", fields: [{name: "value", type: "float"}]}, + { class: "Time", javaType: "int", friendlyType: "DateTime" }, + { class: "IntervalYear", javaType: "int", friendlyType: "Period" } + { class: "Decimal9", maxPrecisionDigits: 9, friendlyType: "BigDecimal", fields: [{name:"value", type:"int"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] }, + ] + }, + { + major: "Fixed", + width: 8, + javaType: "long", + boxedType: "Long", + fields: [{name: "value", type: "long"}], + minor: [ + { class: "BigInt"}, + { class: "UInt8" }, + { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, + { class: "Date", javaType: "long", friendlyType: "DateTime" }, + { class: "TimeStamp", javaType: "long", friendlyType: "DateTime" } + { class: "Decimal18", maxPrecisionDigits: 18, friendlyType: "BigDecimal", fields: [{name:"value", type:"long"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] }, + <#-- + { class: "Money", maxPrecisionDigits: 2, scale: 1, }, + --> + ] + }, + { + major: "Fixed", + width: 12, + javaType: "ArrowBuf", + boxedType: "ArrowBuf", + minor: [ + { class: "IntervalDay", millisecondsOffset: 4, friendlyType: "Period", fields: [ {name: "days", type:"int"}, {name: "milliseconds", type:"int"}] } + ] + }, + { + major: "Fixed", + width: 16, + javaType: "ArrowBuf" + boxedType: "ArrowBuf", + minor: [ + { class: "Interval", daysOffset: 4, millisecondsOffset: 8, friendlyType: "Period", fields: [ {name: "months", type: "int"}, {name: "days", type:"int"}, {name: "milliseconds", type:"int"}] } + ] + }, + { + major: "Fixed", + width: 12, + javaType: "ArrowBuf", + boxedType: "ArrowBuf", + minor: [ + <#-- + { class: "TimeTZ" }, + { class: "Interval" } + --> + { class: "Decimal28Dense", maxPrecisionDigits: 28, nDecimalDigits: 3, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } + ] + }, + { + major: "Fixed", + width: 16, + javaType: "ArrowBuf", + boxedType: "ArrowBuf", + + minor: [ + { class: "Decimal38Dense", maxPrecisionDigits: 38, nDecimalDigits: 4, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } + ] + }, + { + major: "Fixed", + width: 24, + javaType: "ArrowBuf", + boxedType: "ArrowBuf", + minor: [ + { class: "Decimal38Sparse", maxPrecisionDigits: 38, nDecimalDigits: 6, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } + ] + }, + { + major: "Fixed", + width: 20, + javaType: "ArrowBuf", + boxedType: "ArrowBuf", + minor: [ + { class: "Decimal28Sparse", maxPrecisionDigits: 28, nDecimalDigits: 5, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } + ] + }, + { + major: "VarLen", + width: 4, + javaType: "int", + boxedType: "ArrowBuf", + fields: [{name: "start", type: "int"}, {name: "end", type: "int"}, {name: "buffer", type: "ArrowBuf"}], + minor: [ + { class: "VarBinary" , friendlyType: "byte[]" }, + { class: "VarChar" , friendlyType: "Text" }, + { class: "Var16Char" , friendlyType: "String" } + ] + }, + { + major: "Bit", + width: 1, + javaType: "int", + boxedType: "Integer", + minor: [ + { class: "Bit" , friendlyType: "Boolean", fields: [{name: "value", type: "int"}] } + ] + } + ] +} diff --git a/java/vector/src/main/codegen/includes/license.ftl b/java/vector/src/main/codegen/includes/license.ftl new file mode 100644 index 0000000000000..0455fd87ddcb5 --- /dev/null +++ b/java/vector/src/main/codegen/includes/license.ftl @@ -0,0 +1,18 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ \ No newline at end of file diff --git a/java/vector/src/main/codegen/includes/vv_imports.ftl b/java/vector/src/main/codegen/includes/vv_imports.ftl new file mode 100644 index 0000000000000..2d808b1b3cb3f --- /dev/null +++ b/java/vector/src/main/codegen/includes/vv_imports.ftl @@ -0,0 +1,62 @@ +<#-- Licensed to the Apache Software Foundation (ASF) under one or more contributor + license agreements. See the NOTICE file distributed with this work for additional + information regarding copyright ownership. The ASF licenses this file to + You under the Apache License, Version 2.0 (the "License"); you may not use + this file except in compliance with the License. You may obtain a copy of + the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required + by applicable law or agreed to in writing, software distributed under the + License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS + OF ANY KIND, either express or implied. See the License for the specific + language governing permissions and limitations under the License. --> + +import static com.google.common.base.Preconditions.checkArgument; +import static com.google.common.base.Preconditions.checkState; + +import com.google.common.collect.Lists; +import com.google.common.collect.ObjectArrays; +import com.google.common.base.Charsets; +import com.google.common.collect.ObjectArrays; + +import com.google.common.base.Preconditions; +import io.netty.buffer.*; + +import org.apache.commons.lang3.ArrayUtils; + +import org.apache.arrow.memory.*; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.*; +import org.apache.arrow.vector.types.*; +import org.apache.arrow.vector.*; +import org.apache.arrow.vector.holders.*; +import org.apache.arrow.vector.util.*; +import org.apache.arrow.vector.complex.*; +import org.apache.arrow.vector.complex.reader.*; +import org.apache.arrow.vector.complex.impl.*; +import org.apache.arrow.vector.complex.writer.*; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.util.JsonStringArrayList; + +import java.util.Arrays; +import java.util.Random; +import java.util.List; + +import java.io.Closeable; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.nio.ByteBuffer; + +import java.sql.Date; +import java.sql.Time; +import java.sql.Timestamp; +import java.math.BigDecimal; +import java.math.BigInteger; + +import org.joda.time.DateTime; +import org.joda.time.Period; + + + + + + diff --git a/java/vector/src/main/codegen/templates/AbstractFieldReader.java b/java/vector/src/main/codegen/templates/AbstractFieldReader.java new file mode 100644 index 0000000000000..b83dba2879190 --- /dev/null +++ b/java/vector/src/main/codegen/templates/AbstractFieldReader.java @@ -0,0 +1,124 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/AbstractFieldReader.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +@SuppressWarnings("unused") +abstract class AbstractFieldReader extends AbstractBaseReader implements FieldReader{ + + AbstractFieldReader(){ + super(); + } + + /** + * Returns true if the current value of the reader is not null + * @return + */ + public boolean isSet() { + return true; + } + + <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", + "Character", "DateTime", "Period", "Double", "Float", + "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> + <#assign safeType=friendlyType /> + <#if safeType=="byte[]"><#assign safeType="ByteArray" /> + + public ${friendlyType} read${safeType}(int arrayIndex){ + fail("read${safeType}(int arrayIndex)"); + return null; + } + + public ${friendlyType} read${safeType}(){ + fail("read${safeType}()"); + return null; + } + + + + public void copyAsValue(MapWriter writer){ + fail("CopyAsValue MapWriter"); + } + public void copyAsField(String name, MapWriter writer){ + fail("CopyAsField MapWriter"); + } + + public void copyAsField(String name, ListWriter writer){ + fail("CopyAsFieldList"); + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign boxedType = (minor.boxedType!type.boxedType) /> + + public void read(${name}Holder holder){ + fail("${name}"); + } + + public void read(Nullable${name}Holder holder){ + fail("${name}"); + } + + public void read(int arrayIndex, ${name}Holder holder){ + fail("Repeated${name}"); + } + + public void read(int arrayIndex, Nullable${name}Holder holder){ + fail("Repeated${name}"); + } + + public void copyAsValue(${name}Writer writer){ + fail("CopyAsValue${name}"); + } + public void copyAsField(String name, ${name}Writer writer){ + fail("CopyAsField${name}"); + } + + + public FieldReader reader(String name){ + fail("reader(String name)"); + return null; + } + + public FieldReader reader(){ + fail("reader()"); + return null; + + } + + public int size(){ + fail("size()"); + return -1; + } + + private void fail(String name){ + throw new IllegalArgumentException(String.format("You tried to read a [%s] type when you are using a field reader of type [%s].", name, this.getClass().getSimpleName())); + } + + +} + + + diff --git a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java new file mode 100644 index 0000000000000..6ee9dad44e929 --- /dev/null +++ b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java @@ -0,0 +1,147 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/AbstractFieldWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +abstract class AbstractFieldWriter extends AbstractBaseWriter implements FieldWriter { + AbstractFieldWriter(FieldWriter parent) { + super(parent); + } + + @Override + public void start() { + throw new IllegalStateException(String.format("You tried to start when you are using a ValueWriter of type %s.", this.getClass().getSimpleName())); + } + + @Override + public void end() { + throw new IllegalStateException(String.format("You tried to end when you are using a ValueWriter of type %s.", this.getClass().getSimpleName())); + } + + @Override + public void startList() { + throw new IllegalStateException(String.format("You tried to start when you are using a ValueWriter of type %s.", this.getClass().getSimpleName())); + } + + @Override + public void endList() { + throw new IllegalStateException(String.format("You tried to end when you are using a ValueWriter of type %s.", this.getClass().getSimpleName())); + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + @Override + public void write(${name}Holder holder) { + fail("${name}"); + } + + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + fail("${name}"); + } + + + + public void writeNull() { + fail("${name}"); + } + + /** + * This implementation returns {@code false}. + *

+ * Must be overridden by map writers. + *

+ */ + @Override + public boolean isEmptyMap() { + return false; + } + + @Override + public MapWriter map() { + fail("Map"); + return null; + } + + @Override + public ListWriter list() { + fail("List"); + return null; + } + + @Override + public MapWriter map(String name) { + fail("Map"); + return null; + } + + @Override + public ListWriter list(String name) { + fail("List"); + return null; + } + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#if minor.class?starts_with("Decimal") > + public ${capName}Writer ${lowerName}(String name, int scale, int precision) { + fail("${capName}"); + return null; + } + + + @Override + public ${capName}Writer ${lowerName}(String name) { + fail("${capName}"); + return null; + } + + @Override + public ${capName}Writer ${lowerName}() { + fail("${capName}"); + return null; + } + + + + public void copyReader(FieldReader reader) { + fail("Copy FieldReader"); + } + + public void copyReaderToField(String name, FieldReader reader) { + fail("Copy FieldReader to STring"); + } + + private void fail(String name) { + throw new IllegalArgumentException(String.format("You tried to write a %s type when you are using a ValueWriter of type %s.", name, this.getClass().getSimpleName())); + } +} diff --git a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java new file mode 100644 index 0000000000000..549dbf107ea67 --- /dev/null +++ b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.apache.drill.common.types.TypeProtos.MinorType; + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/AbstractPromotableFieldWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * A FieldWriter which delegates calls to another FieldWriter. The delegate FieldWriter can be promoted to a new type + * when necessary. Classes that extend this class are responsible for handling promotion. + * + * This class is generated using freemarker and the ${.template_name} template. + * + */ +@SuppressWarnings("unused") +abstract class AbstractPromotableFieldWriter extends AbstractFieldWriter { + AbstractPromotableFieldWriter(FieldWriter parent) { + super(parent); + } + + /** + * Retrieve the FieldWriter, promoting if it is not a FieldWriter of the specified type + * @param type + * @return + */ + abstract protected FieldWriter getWriter(MinorType type); + + /** + * Return the current FieldWriter + * @return + */ + abstract protected FieldWriter getWriter(); + + @Override + public void start() { + getWriter(MinorType.MAP).start(); + } + + @Override + public void end() { + getWriter(MinorType.MAP).end(); + } + + @Override + public void startList() { + getWriter(MinorType.LIST).startList(); + } + + @Override + public void endList() { + getWriter(MinorType.LIST).endList(); + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#if !minor.class?starts_with("Decimal") > + @Override + public void write(${name}Holder holder) { + getWriter(MinorType.${name?upper_case}).write(holder); + } + + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + getWriter(MinorType.${name?upper_case}).write${minor.class}(<#list fields as field>${field.name}<#if field_has_next>, ); + } + + + + + public void writeNull() { + } + + @Override + public MapWriter map() { + return getWriter(MinorType.LIST).map(); + } + + @Override + public ListWriter list() { + return getWriter(MinorType.LIST).list(); + } + + @Override + public MapWriter map(String name) { + return getWriter(MinorType.MAP).map(name); + } + + @Override + public ListWriter list(String name) { + return getWriter(MinorType.MAP).list(name); + } + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#if !minor.class?starts_with("Decimal") > + + @Override + public ${capName}Writer ${lowerName}(String name) { + return getWriter(MinorType.MAP).${lowerName}(name); + } + + @Override + public ${capName}Writer ${lowerName}() { + return getWriter(MinorType.LIST).${lowerName}(); + } + + + + + public void copyReader(FieldReader reader) { + getWriter().copyReader(reader); + } + + public void copyReaderToField(String name, FieldReader reader) { + getWriter().copyReaderToField(name, reader); + } +} diff --git a/java/vector/src/main/codegen/templates/BaseReader.java b/java/vector/src/main/codegen/templates/BaseReader.java new file mode 100644 index 0000000000000..8f12b1da80424 --- /dev/null +++ b/java/vector/src/main/codegen/templates/BaseReader.java @@ -0,0 +1,73 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/reader/BaseReader.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.reader; + +<#include "/@includes/vv_imports.ftl" /> + + + +@SuppressWarnings("unused") +public interface BaseReader extends Positionable{ + MajorType getType(); + MaterializedField getField(); + void reset(); + void read(UnionHolder holder); + void read(int index, UnionHolder holder); + void copyAsValue(UnionWriter writer); + boolean isSet(); + + public interface MapReader extends BaseReader, Iterable{ + FieldReader reader(String name); + } + + public interface RepeatedMapReader extends MapReader{ + boolean next(); + int size(); + void copyAsValue(MapWriter writer); + } + + public interface ListReader extends BaseReader{ + FieldReader reader(); + } + + public interface RepeatedListReader extends ListReader{ + boolean next(); + int size(); + void copyAsValue(ListWriter writer); + } + + public interface ScalarReader extends + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> ${name}Reader, + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> Repeated${name}Reader, + BaseReader {} + + interface ComplexReader{ + MapReader rootAsMap(); + ListReader rootAsList(); + boolean rootIsMap(); + boolean ok(); + } +} + diff --git a/java/vector/src/main/codegen/templates/BaseWriter.java b/java/vector/src/main/codegen/templates/BaseWriter.java new file mode 100644 index 0000000000000..299b2389bb35c --- /dev/null +++ b/java/vector/src/main/codegen/templates/BaseWriter.java @@ -0,0 +1,117 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/writer/BaseWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.writer; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * File generated from ${.template_name} using FreeMarker. + */ +@SuppressWarnings("unused") + public interface BaseWriter extends AutoCloseable, Positionable { + FieldWriter getParent(); + int getValueCapacity(); + + public interface MapWriter extends BaseWriter { + + MaterializedField getField(); + + /** + * Whether this writer is a map writer and is empty (has no children). + * + *

+ * Intended only for use in determining whether to add dummy vector to + * avoid empty (zero-column) schema, as in JsonReader. + *

+ * + */ + boolean isEmptyMap(); + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#if minor.class?starts_with("Decimal") > + ${capName}Writer ${lowerName}(String name, int scale, int precision); + + ${capName}Writer ${lowerName}(String name); + + + void copyReaderToField(String name, FieldReader reader); + MapWriter map(String name); + ListWriter list(String name); + void start(); + void end(); + } + + public interface ListWriter extends BaseWriter { + void startList(); + void endList(); + MapWriter map(); + ListWriter list(); + void copyReader(FieldReader reader); + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + ${capName}Writer ${lowerName}(); + + } + + public interface ScalarWriter extends + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> ${name}Writer, BaseWriter {} + + public interface ComplexWriter { + void allocate(); + void clear(); + void copyReader(FieldReader reader); + MapWriter rootAsMap(); + ListWriter rootAsList(); + + void setPosition(int index); + void setValueCount(int count); + void reset(); + } + + public interface MapOrListWriter { + void start(); + void end(); + MapOrListWriter map(String name); + MapOrListWriter listoftmap(String name); + MapOrListWriter list(String name); + boolean isMapWriter(); + boolean isListWriter(); + VarCharWriter varChar(String name); + IntWriter integer(String name); + BigIntWriter bigInt(String name); + Float4Writer float4(String name); + Float8Writer float8(String name); + BitWriter bit(String name); + VarBinaryWriter binary(String name); + } +} diff --git a/java/vector/src/main/codegen/templates/BasicTypeHelper.java b/java/vector/src/main/codegen/templates/BasicTypeHelper.java new file mode 100644 index 0000000000000..bb6446e8d6b19 --- /dev/null +++ b/java/vector/src/main/codegen/templates/BasicTypeHelper.java @@ -0,0 +1,538 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/util/BasicTypeHelper.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.util; + +<#include "/@includes/vv_imports.ftl" /> +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.complex.RepeatedMapVector; +import org.apache.arrow.vector.util.CallBack; + +public class BasicTypeHelper { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BasicTypeHelper.class); + + private static final int WIDTH_ESTIMATE = 50; + + // Default length when casting to varchar : 65536 = 2^16 + // This only defines an absolute maximum for values, setting + // a high value like this will not inflate the size for small values + public static final int VARCHAR_DEFAULT_CAST_LEN = 65536; + + protected static String buildErrorMessage(final String operation, final MinorType type, final DataMode mode) { + return String.format("Unable to %s for minor type [%s] and mode [%s]", operation, type, mode); + } + + protected static String buildErrorMessage(final String operation, final MajorType type) { + return buildErrorMessage(operation, type.getMinorType(), type.getMode()); + } + + public static int getSize(MajorType major) { + switch (major.getMinorType()) { +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + return ${type.width}<#if minor.class?substring(0, 3) == "Var" || + minor.class?substring(0, 3) == "PRO" || + minor.class?substring(0, 3) == "MSG"> + WIDTH_ESTIMATE; + + +// case FIXEDCHAR: return major.getWidth(); +// case FIXED16CHAR: return major.getWidth(); +// case FIXEDBINARY: return major.getWidth(); + } + throw new UnsupportedOperationException(buildErrorMessage("get size", major)); + } + + public static ValueVector getNewVector(String name, BufferAllocator allocator, MajorType type, CallBack callback){ + MaterializedField field = MaterializedField.create(name, type); + return getNewVector(field, allocator, callback); + } + + + public static Class getValueVectorClass(MinorType type, DataMode mode){ + switch (type) { + case UNION: + return UnionVector.class; + case MAP: + switch (mode) { + case OPTIONAL: + case REQUIRED: + return MapVector.class; + case REPEATED: + return RepeatedMapVector.class; + } + + case LIST: + switch (mode) { + case REPEATED: + return RepeatedListVector.class; + case REQUIRED: + case OPTIONAL: + return ListVector.class; + } + +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + switch (mode) { + case REQUIRED: + return ${minor.class}Vector.class; + case OPTIONAL: + return Nullable${minor.class}Vector.class; + case REPEATED: + return Repeated${minor.class}Vector.class; + } + + + case GENERIC_OBJECT : + return ObjectVector.class ; + default: + break; + } + throw new UnsupportedOperationException(buildErrorMessage("get value vector class", type, mode)); + } + public static Class getReaderClassName( MinorType type, DataMode mode, boolean isSingularRepeated){ + switch (type) { + case MAP: + switch (mode) { + case REQUIRED: + if (!isSingularRepeated) + return SingleMapReaderImpl.class; + else + return SingleLikeRepeatedMapReaderImpl.class; + case REPEATED: + return RepeatedMapReaderImpl.class; + } + case LIST: + switch (mode) { + case REQUIRED: + return SingleListReaderImpl.class; + case REPEATED: + return RepeatedListReaderImpl.class; + } + +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + switch (mode) { + case REQUIRED: + return ${minor.class}ReaderImpl.class; + case OPTIONAL: + return Nullable${minor.class}ReaderImpl.class; + case REPEATED: + return Repeated${minor.class}ReaderImpl.class; + } + + + default: + break; + } + throw new UnsupportedOperationException(buildErrorMessage("get reader class name", type, mode)); + } + + public static Class getWriterInterface( MinorType type, DataMode mode){ + switch (type) { + case UNION: return UnionWriter.class; + case MAP: return MapWriter.class; + case LIST: return ListWriter.class; +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: return ${minor.class}Writer.class; + + + default: + break; + } + throw new UnsupportedOperationException(buildErrorMessage("get writer interface", type, mode)); + } + + public static Class getWriterImpl( MinorType type, DataMode mode){ + switch (type) { + case UNION: + return UnionWriter.class; + case MAP: + switch (mode) { + case REQUIRED: + case OPTIONAL: + return SingleMapWriter.class; + case REPEATED: + return RepeatedMapWriter.class; + } + case LIST: + switch (mode) { + case REQUIRED: + case OPTIONAL: + return UnionListWriter.class; + case REPEATED: + return RepeatedListWriter.class; + } + +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + switch (mode) { + case REQUIRED: + return ${minor.class}WriterImpl.class; + case OPTIONAL: + return Nullable${minor.class}WriterImpl.class; + case REPEATED: + return Repeated${minor.class}WriterImpl.class; + } + + + default: + break; + } + throw new UnsupportedOperationException(buildErrorMessage("get writer implementation", type, mode)); + } + + public static Class getHolderReaderImpl( MinorType type, DataMode mode){ + switch (type) { +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + switch (mode) { + case REQUIRED: + return ${minor.class}HolderReaderImpl.class; + case OPTIONAL: + return Nullable${minor.class}HolderReaderImpl.class; + case REPEATED: + return Repeated${minor.class}HolderReaderImpl.class; + } + + + default: + break; + } + throw new UnsupportedOperationException(buildErrorMessage("get holder reader implementation", type, mode)); + } + + public static ValueVector getNewVector(MaterializedField field, BufferAllocator allocator){ + return getNewVector(field, allocator, null); + } + public static ValueVector getNewVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ + MajorType type = field.getType(); + + switch (type.getMinorType()) { + + case UNION: + return new UnionVector(field, allocator, callBack); + + case MAP: + switch (type.getMode()) { + case REQUIRED: + case OPTIONAL: + return new MapVector(field, allocator, callBack); + case REPEATED: + return new RepeatedMapVector(field, allocator, callBack); + } + case LIST: + switch (type.getMode()) { + case REPEATED: + return new RepeatedListVector(field, allocator, callBack); + case OPTIONAL: + case REQUIRED: + return new ListVector(field, allocator, callBack); + } +<#list vv. types as type> + <#list type.minor as minor> + case ${minor.class?upper_case}: + switch (type.getMode()) { + case REQUIRED: + return new ${minor.class}Vector(field, allocator); + case OPTIONAL: + return new Nullable${minor.class}Vector(field, allocator); + case REPEATED: + return new Repeated${minor.class}Vector(field, allocator); + } + + + case GENERIC_OBJECT: + return new ObjectVector(field, allocator) ; + default: + break; + } + // All ValueVector types have been handled. + throw new UnsupportedOperationException(buildErrorMessage("get new vector", type)); + } + + public static ValueHolder getValue(ValueVector vector, int index) { + MajorType type = vector.getField().getType(); + ValueHolder holder; + switch(type.getMinorType()) { +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + <#if minor.class?starts_with("Var") || minor.class == "IntervalDay" || minor.class == "Interval" || + minor.class?starts_with("Decimal28") || minor.class?starts_with("Decimal38")> + switch (type.getMode()) { + case REQUIRED: + holder = new ${minor.class}Holder(); + ((${minor.class}Vector) vector).getAccessor().get(index, (${minor.class}Holder)holder); + return holder; + case OPTIONAL: + holder = new Nullable${minor.class}Holder(); + ((Nullable${minor.class}Holder)holder).isSet = ((Nullable${minor.class}Vector) vector).getAccessor().isSet(index); + if (((Nullable${minor.class}Holder)holder).isSet == 1) { + ((Nullable${minor.class}Vector) vector).getAccessor().get(index, (Nullable${minor.class}Holder)holder); + } + return holder; + } + <#else> + switch (type.getMode()) { + case REQUIRED: + holder = new ${minor.class}Holder(); + ((${minor.class}Holder)holder).value = ((${minor.class}Vector) vector).getAccessor().get(index); + return holder; + case OPTIONAL: + holder = new Nullable${minor.class}Holder(); + ((Nullable${minor.class}Holder)holder).isSet = ((Nullable${minor.class}Vector) vector).getAccessor().isSet(index); + if (((Nullable${minor.class}Holder)holder).isSet == 1) { + ((Nullable${minor.class}Holder)holder).value = ((Nullable${minor.class}Vector) vector).getAccessor().get(index); + } + return holder; + } + + + + case GENERIC_OBJECT: + holder = new ObjectHolder(); + ((ObjectHolder)holder).obj = ((ObjectVector) vector).getAccessor().getObject(index) ; + break; + } + + throw new UnsupportedOperationException(buildErrorMessage("get value", type)); + } + + public static void setValue(ValueVector vector, int index, ValueHolder holder) { + MajorType type = vector.getField().getType(); + + switch(type.getMinorType()) { +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + switch (type.getMode()) { + case REQUIRED: + ((${minor.class}Vector) vector).getMutator().setSafe(index, (${minor.class}Holder) holder); + return; + case OPTIONAL: + if (((Nullable${minor.class}Holder) holder).isSet == 1) { + ((Nullable${minor.class}Vector) vector).getMutator().setSafe(index, (Nullable${minor.class}Holder) holder); + } + return; + } + + + case GENERIC_OBJECT: + ((ObjectVector) vector).getMutator().setSafe(index, (ObjectHolder) holder); + return; + default: + throw new UnsupportedOperationException(buildErrorMessage("set value", type)); + } + } + + public static void setValueSafe(ValueVector vector, int index, ValueHolder holder) { + MajorType type = vector.getField().getType(); + + switch(type.getMinorType()) { + <#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + switch (type.getMode()) { + case REQUIRED: + ((${minor.class}Vector) vector).getMutator().setSafe(index, (${minor.class}Holder) holder); + return; + case OPTIONAL: + if (((Nullable${minor.class}Holder) holder).isSet == 1) { + ((Nullable${minor.class}Vector) vector).getMutator().setSafe(index, (Nullable${minor.class}Holder) holder); + } else { + ((Nullable${minor.class}Vector) vector).getMutator().isSafe(index); + } + return; + } + + + case GENERIC_OBJECT: + ((ObjectVector) vector).getMutator().setSafe(index, (ObjectHolder) holder); + default: + throw new UnsupportedOperationException(buildErrorMessage("set value safe", type)); + } + } + + public static boolean compareValues(ValueVector v1, int v1index, ValueVector v2, int v2index) { + MajorType type1 = v1.getField().getType(); + MajorType type2 = v2.getField().getType(); + + if (type1.getMinorType() != type2.getMinorType()) { + return false; + } + + switch(type1.getMinorType()) { +<#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + if ( ((${minor.class}Vector) v1).getAccessor().get(v1index) == + ((${minor.class}Vector) v2).getAccessor().get(v2index) ) + return true; + break; + + + default: + break; + } + return false; + } + + /** + * Create a ValueHolder of MajorType. + * @param type + * @return + */ + public static ValueHolder createValueHolder(MajorType type) { + switch(type.getMinorType()) { + <#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + + switch (type.getMode()) { + case REQUIRED: + return new ${minor.class}Holder(); + case OPTIONAL: + return new Nullable${minor.class}Holder(); + case REPEATED: + return new Repeated${minor.class}Holder(); + } + + + case GENERIC_OBJECT: + return new ObjectHolder(); + default: + throw new UnsupportedOperationException(buildErrorMessage("create value holder", type)); + } + } + + public static boolean isNull(ValueHolder holder) { + MajorType type = getValueHolderType(holder); + + switch(type.getMinorType()) { + <#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + + switch (type.getMode()) { + case REQUIRED: + return true; + case OPTIONAL: + return ((Nullable${minor.class}Holder) holder).isSet == 0; + case REPEATED: + return true; + } + + + default: + throw new UnsupportedOperationException(buildErrorMessage("check is null", type)); + } + } + + public static ValueHolder deNullify(ValueHolder holder) { + MajorType type = getValueHolderType(holder); + + switch(type.getMinorType()) { + <#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + + switch (type.getMode()) { + case REQUIRED: + return holder; + case OPTIONAL: + if( ((Nullable${minor.class}Holder) holder).isSet == 1) { + ${minor.class}Holder newHolder = new ${minor.class}Holder(); + + <#assign fields = minor.fields!type.fields /> + <#list fields as field> + newHolder.${field.name} = ((Nullable${minor.class}Holder) holder).${field.name}; + + + return newHolder; + } else { + throw new UnsupportedOperationException("You can not convert a null value into a non-null value!"); + } + case REPEATED: + return holder; + } + + + default: + throw new UnsupportedOperationException(buildErrorMessage("deNullify", type)); + } + } + + public static ValueHolder nullify(ValueHolder holder) { + MajorType type = getValueHolderType(holder); + + switch(type.getMinorType()) { + <#list vv.types as type> + <#list type.minor as minor> + case ${minor.class?upper_case} : + switch (type.getMode()) { + case REQUIRED: + Nullable${minor.class}Holder newHolder = new Nullable${minor.class}Holder(); + newHolder.isSet = 1; + <#assign fields = minor.fields!type.fields /> + <#list fields as field> + newHolder.${field.name} = ((${minor.class}Holder) holder).${field.name}; + + return newHolder; + case OPTIONAL: + return holder; + case REPEATED: + throw new UnsupportedOperationException("You can not convert repeated type " + type + " to nullable type!"); + } + + + default: + throw new UnsupportedOperationException(buildErrorMessage("nullify", type)); + } + } + + public static MajorType getValueHolderType(ValueHolder holder) { + + if (0 == 1) { + return null; + } + <#list vv.types as type> + <#list type.minor as minor> + else if (holder instanceof ${minor.class}Holder) { + return ((${minor.class}Holder) holder).TYPE; + } else if (holder instanceof Nullable${minor.class}Holder) { + return ((Nullable${minor.class}Holder) holder).TYPE; + } + + + + throw new UnsupportedOperationException("ValueHolder is not supported for 'getValueHolderType' method."); + + } + +} diff --git a/java/vector/src/main/codegen/templates/ComplexCopier.java b/java/vector/src/main/codegen/templates/ComplexCopier.java new file mode 100644 index 0000000000000..3614231c8342e --- /dev/null +++ b/java/vector/src/main/codegen/templates/ComplexCopier.java @@ -0,0 +1,133 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/ComplexCopier.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class ComplexCopier { + + /** + * Do a deep copy of the value in input into output + * @param in + * @param out + */ + public static void copy(FieldReader input, FieldWriter output) { + writeValue(input, output); + } + + private static void writeValue(FieldReader reader, FieldWriter writer) { + final DataMode m = reader.getType().getMode(); + final MinorType mt = reader.getType().getMinorType(); + + switch(m){ + case OPTIONAL: + case REQUIRED: + + + switch (mt) { + + case LIST: + writer.startList(); + while (reader.next()) { + writeValue(reader.reader(), getListWriterForReader(reader.reader(), writer)); + } + writer.endList(); + break; + case MAP: + writer.start(); + if (reader.isSet()) { + for(String name : reader){ + FieldReader childReader = reader.reader(name); + if(childReader.isSet()){ + writeValue(childReader, getMapWriterForReader(childReader, writer, name)); + } + } + } + writer.end(); + break; + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + + case ${name?upper_case}: + if (reader.isSet()) { + Nullable${name}Holder ${uncappedName}Holder = new Nullable${name}Holder(); + reader.read(${uncappedName}Holder); + if (${uncappedName}Holder.isSet == 1) { + writer.write${name}(<#list fields as field>${uncappedName}Holder.${field.name}<#if field_has_next>, ); + } + } + break; + + + + } + break; + } + } + + private static FieldWriter getMapWriterForReader(FieldReader reader, MapWriter writer, String name) { + switch (reader.getType().getMinorType()) { + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + return (FieldWriter) writer.<#if name == "Int">integer<#else>${uncappedName}(name); + + + case MAP: + return (FieldWriter) writer.map(name); + case LIST: + return (FieldWriter) writer.list(name); + default: + throw new UnsupportedOperationException(reader.getType().toString()); + } + } + + private static FieldWriter getListWriterForReader(FieldReader reader, ListWriter writer) { + switch (reader.getType().getMinorType()) { + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + return (FieldWriter) writer.<#if name == "Int">integer<#else>${uncappedName}(); + + + case MAP: + return (FieldWriter) writer.map(); + case LIST: + return (FieldWriter) writer.list(); + default: + throw new UnsupportedOperationException(reader.getType().toString()); + } + } +} diff --git a/java/vector/src/main/codegen/templates/ComplexReaders.java b/java/vector/src/main/codegen/templates/ComplexReaders.java new file mode 100644 index 0000000000000..34c657126015e --- /dev/null +++ b/java/vector/src/main/codegen/templates/ComplexReaders.java @@ -0,0 +1,183 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.lang.Override; +import java.util.List; + +import org.apache.arrow.record.TransferPair; +import org.apache.arrow.vector.complex.IndexHolder; +import org.apache.arrow.vector.complex.writer.IntervalWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> +<#list ["", "Repeated"] as mode> +<#assign lowerName = minor.class?uncap_first /> +<#if lowerName == "int" ><#assign lowerName = "integer" /> +<#assign name = mode + minor.class?cap_first /> +<#assign javaType = (minor.javaType!type.javaType) /> +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> +<#assign safeType=friendlyType /> +<#if safeType=="byte[]"><#assign safeType="ByteArray" /> + +<#assign hasFriendly = minor.friendlyType!"no" == "no" /> + +<#list ["", "Nullable"] as nullMode> +<#if (mode == "Repeated" && nullMode == "") || mode == "" > +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${nullMode}${name}ReaderImpl.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +@SuppressWarnings("unused") +public class ${nullMode}${name}ReaderImpl extends AbstractFieldReader { + + private final ${nullMode}${name}Vector vector; + + public ${nullMode}${name}ReaderImpl(${nullMode}${name}Vector vector){ + super(); + this.vector = vector; + } + + public MajorType getType(){ + return vector.getField().getType(); + } + + public MaterializedField getField(){ + return vector.getField(); + } + + public boolean isSet(){ + <#if nullMode == "Nullable"> + return !vector.getAccessor().isNull(idx()); + <#else> + return true; + + } + + + + + <#if mode == "Repeated"> + + public void copyAsValue(${minor.class?cap_first}Writer writer){ + Repeated${minor.class?cap_first}WriterImpl impl = (Repeated${minor.class?cap_first}WriterImpl) writer; + impl.vector.copyFromSafe(idx(), impl.idx(), vector); + } + + public void copyAsField(String name, MapWriter writer){ + Repeated${minor.class?cap_first}WriterImpl impl = (Repeated${minor.class?cap_first}WriterImpl) writer.list(name).${lowerName}(); + impl.vector.copyFromSafe(idx(), impl.idx(), vector); + } + + public int size(){ + return vector.getAccessor().getInnerValueCountAt(idx()); + } + + public void read(int arrayIndex, ${minor.class?cap_first}Holder h){ + vector.getAccessor().get(idx(), arrayIndex, h); + } + public void read(int arrayIndex, Nullable${minor.class?cap_first}Holder h){ + vector.getAccessor().get(idx(), arrayIndex, h); + } + + public ${friendlyType} read${safeType}(int arrayIndex){ + return vector.getAccessor().getSingleObject(idx(), arrayIndex); + } + + + public List readObject(){ + return (List) (Object) vector.getAccessor().getObject(idx()); + } + + <#else> + + public void copyAsValue(${minor.class?cap_first}Writer writer){ + ${nullMode}${minor.class?cap_first}WriterImpl impl = (${nullMode}${minor.class?cap_first}WriterImpl) writer; + impl.vector.copyFromSafe(idx(), impl.idx(), vector); + } + + public void copyAsField(String name, MapWriter writer){ + ${nullMode}${minor.class?cap_first}WriterImpl impl = (${nullMode}${minor.class?cap_first}WriterImpl) writer.${lowerName}(name); + impl.vector.copyFromSafe(idx(), impl.idx(), vector); + } + + <#if nullMode != "Nullable"> + public void read(${minor.class?cap_first}Holder h){ + vector.getAccessor().get(idx(), h); + } + + + public void read(Nullable${minor.class?cap_first}Holder h){ + vector.getAccessor().get(idx(), h); + } + + public ${friendlyType} read${safeType}(){ + return vector.getAccessor().getObject(idx()); + } + + public void copyValue(FieldWriter w){ + + } + + public Object readObject(){ + return vector.getAccessor().getObject(idx()); + } + + + +} + + +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/reader/${name}Reader.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.reader; + +<#include "/@includes/vv_imports.ftl" /> +@SuppressWarnings("unused") +public interface ${name}Reader extends BaseReader{ + + <#if mode == "Repeated"> + public int size(); + public void read(int arrayIndex, ${minor.class?cap_first}Holder h); + public void read(int arrayIndex, Nullable${minor.class?cap_first}Holder h); + public Object readObject(int arrayIndex); + public ${friendlyType} read${safeType}(int arrayIndex); + <#else> + public void read(${minor.class?cap_first}Holder h); + public void read(Nullable${minor.class?cap_first}Holder h); + public Object readObject(); + public ${friendlyType} read${safeType}(); + + public boolean isSet(); + public void copyAsValue(${minor.class}Writer writer); + public void copyAsField(String name, ${minor.class}Writer writer); + +} + + + + + + + + diff --git a/java/vector/src/main/codegen/templates/ComplexWriters.java b/java/vector/src/main/codegen/templates/ComplexWriters.java new file mode 100644 index 0000000000000..8f9a6e7b97117 --- /dev/null +++ b/java/vector/src/main/codegen/templates/ComplexWriters.java @@ -0,0 +1,151 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> +<#list ["", "Nullable", "Repeated"] as mode> +<#assign name = mode + minor.class?cap_first /> +<#assign eName = name /> +<#assign javaType = (minor.javaType!type.javaType) /> +<#assign fields = minor.fields!type.fields /> + +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${eName}WriterImpl.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using FreeMarker on the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class ${eName}WriterImpl extends AbstractFieldWriter { + + private final ${name}Vector.Mutator mutator; + final ${name}Vector vector; + + public ${eName}WriterImpl(${name}Vector vector, AbstractFieldWriter parent) { + super(parent); + this.mutator = vector.getMutator(); + this.vector = vector; + } + + @Override + public MaterializedField getField() { + return vector.getField(); + } + + @Override + public int getValueCapacity() { + return vector.getValueCapacity(); + } + + @Override + public void allocate() { + vector.allocateNew(); + } + + @Override + public void close() { + vector.close(); + } + + @Override + public void clear() { + vector.clear(); + } + + @Override + protected int idx() { + return super.idx(); + } + + <#if mode == "Repeated"> + + public void write(${minor.class?cap_first}Holder h) { + mutator.addSafe(idx(), h); + vector.getMutator().setValueCount(idx()+1); + } + + public void write(Nullable${minor.class?cap_first}Holder h) { + mutator.addSafe(idx(), h); + vector.getMutator().setValueCount(idx()+1); + } + + <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + mutator.addSafe(idx(), <#list fields as field>${field.name}<#if field_has_next>, ); + vector.getMutator().setValueCount(idx()+1); + } + + + public void setPosition(int idx) { + super.setPosition(idx); + mutator.startNewValue(idx); + } + + + <#else> + + public void write(${minor.class}Holder h) { + mutator.setSafe(idx(), h); + vector.getMutator().setValueCount(idx()+1); + } + + public void write(Nullable${minor.class}Holder h) { + mutator.setSafe(idx(), h); + vector.getMutator().setValueCount(idx()+1); + } + + <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + mutator.setSafe(idx(), <#if mode == "Nullable">1, <#list fields as field>${field.name}<#if field_has_next>, ); + vector.getMutator().setValueCount(idx()+1); + } + + <#if mode == "Nullable"> + + public void writeNull() { + mutator.setNull(idx()); + vector.getMutator().setValueCount(idx()+1); + } + + + +} + +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/writer/${eName}Writer.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.writer; + +<#include "/@includes/vv_imports.ftl" /> +@SuppressWarnings("unused") +public interface ${eName}Writer extends BaseWriter { + public void write(${minor.class}Holder h); + + <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ); + +} + + + + diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java new file mode 100644 index 0000000000000..18fcac93bb6f0 --- /dev/null +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -0,0 +1,813 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.lang.Override; + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> + +<#if type.major == "Fixed"> +<@pp.changeOutputFile name="/org/apache/arrow/vector/${minor.class}Vector.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector; + +<#include "/@includes/vv_imports.ftl" /> + +/** + * ${minor.class} implements a vector of fixed width values. Elements in the vector are accessed + * by position, starting from the logical start of the vector. Values should be pushed onto the + * vector sequentially, but may be randomly accessed. + * The width of each element is ${type.width} byte(s) + * The equivalent Java primitive is '${minor.javaType!type.javaType}' + * + * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. + */ +public final class ${minor.class}Vector extends BaseDataValueVector implements FixedWidthVector{ + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${minor.class}Vector.class); + + private final FieldReader reader = new ${minor.class}ReaderImpl(${minor.class}Vector.this); + private final Accessor accessor = new Accessor(); + private final Mutator mutator = new Mutator(); + + private int allocationSizeInBytes = INITIAL_VALUE_ALLOCATION * ${type.width}; + private int allocationMonitor = 0; + + public ${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + } + + @Override + public FieldReader getReader(){ + return reader; + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + return valueCount * ${type.width}; + } + + @Override + public int getValueCapacity(){ + return (int) (data.capacity() *1.0 / ${type.width}); + } + + @Override + public Accessor getAccessor(){ + return accessor; + } + + @Override + public Mutator getMutator(){ + return mutator; + } + + @Override + public void setInitialCapacity(final int valueCount) { + final long size = 1L * valueCount * ${type.width}; + if (size > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); + } + allocationSizeInBytes = (int)size; + } + + @Override + public void allocateNew() { + if(!allocateNewSafe()){ + throw new OutOfMemoryException("Failure while allocating buffer."); + } + } + + @Override + public boolean allocateNewSafe() { + long curAllocationSize = allocationSizeInBytes; + if (allocationMonitor > 10) { + curAllocationSize = Math.max(8, curAllocationSize / 2); + allocationMonitor = 0; + } else if (allocationMonitor < -2) { + curAllocationSize = allocationSizeInBytes * 2L; + allocationMonitor = 0; + } + + try{ + allocateBytes(curAllocationSize); + } catch (RuntimeException ex) { + return false; + } + return true; + } + + /** + * Allocate a new buffer that supports setting at least the provided number of values. May actually be sized bigger + * depending on underlying buffer rounding size. Must be called prior to using the ValueVector. + * + * Note that the maximum number of values a vector can allocate is Integer.MAX_VALUE / value width. + * + * @param valueCount + * @throws org.apache.arrow.memory.OutOfMemoryException if it can't allocate the new buffer + */ + @Override + public void allocateNew(final int valueCount) { + allocateBytes(valueCount * ${type.width}); + } + + @Override + public void reset() { + allocationSizeInBytes = INITIAL_VALUE_ALLOCATION; + allocationMonitor = 0; + zeroVector(); + super.reset(); + } + + private void allocateBytes(final long size) { + if (size > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); + } + + final int curSize = (int)size; + clear(); + data = allocator.buffer(curSize); + data.readerIndex(0); + allocationSizeInBytes = curSize; + } + +/** + * Allocate new buffer with double capacity, and copy data into the new buffer. Replace vector's buffer with new buffer, and release old one + * + * @throws org.apache.arrow.memory.OutOfMemoryException if it can't allocate the new buffer + */ + public void reAlloc() { + final long newAllocationSize = allocationSizeInBytes * 2L; + if (newAllocationSize > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); + } + + logger.debug("Reallocating vector [{}]. # of bytes: [{}] -> [{}]", field, allocationSizeInBytes, newAllocationSize); + final ArrowBuf newBuf = allocator.buffer((int)newAllocationSize); + newBuf.setBytes(0, data, 0, data.capacity()); + final int halfNewCapacity = newBuf.capacity() / 2; + newBuf.setZero(halfNewCapacity, halfNewCapacity); + newBuf.writerIndex(data.writerIndex()); + data.release(1); + data = newBuf; + allocationSizeInBytes = (int)newAllocationSize; + } + + /** + * {@inheritDoc} + */ + @Override + public void zeroVector() { + data.setZero(0, data.capacity()); + } + +// @Override +// public void load(SerializedField metadata, ArrowBuf buffer) { +// Preconditions.checkArgument(this.field.getPath().equals(metadata.getNamePart().getName()), "The field %s doesn't match the provided metadata %s.", this.field, metadata); +// final int actualLength = metadata.getBufferLength(); +// final int valueCount = metadata.getValueCount(); +// final int expectedLength = valueCount * ${type.width}; +// assert actualLength == expectedLength : String.format("Expected to load %d bytes but actually loaded %d bytes", expectedLength, actualLength); +// +// clear(); +// if (data != null) { +// data.release(1); +// } +// data = buffer.slice(0, actualLength); +// data.retain(1); +// data.writerIndex(actualLength); +// } + + public TransferPair getTransferPair(BufferAllocator allocator){ + return new TransferImpl(getField(), allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator){ + return new TransferImpl(getField().withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new TransferImpl((${minor.class}Vector) to); + } + + public void transferTo(${minor.class}Vector target){ + target.clear(); + target.data = data.transferOwnership(target.allocator).buffer; + target.data.writerIndex(data.writerIndex()); + clear(); + } + + public void splitAndTransferTo(int startIndex, int length, ${minor.class}Vector target) { + final int startPoint = startIndex * ${type.width}; + final int sliceLength = length * ${type.width}; + target.clear(); + target.data = data.slice(startPoint, sliceLength).transferOwnership(target.allocator).buffer; + target.data.writerIndex(sliceLength); + } + + private class TransferImpl implements TransferPair{ + private ${minor.class}Vector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator){ + to = new ${minor.class}Vector(field, allocator); + } + + public TransferImpl(${minor.class}Vector to) { + this.to = to; + } + + @Override + public ${minor.class}Vector getTo(){ + return to; + } + + @Override + public void transfer(){ + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + splitAndTransferTo(startIndex, length, to); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + to.copyFromSafe(fromIndex, toIndex, ${minor.class}Vector.this); + } + } + + public void copyFrom(int fromIndex, int thisIndex, ${minor.class}Vector from){ + <#if (type.width > 8)> + from.data.getBytes(fromIndex * ${type.width}, data, thisIndex * ${type.width}, ${type.width}); + <#else> <#-- type.width <= 8 --> + data.set${(minor.javaType!type.javaType)?cap_first}(thisIndex * ${type.width}, + from.data.get${(minor.javaType!type.javaType)?cap_first}(fromIndex * ${type.width}) + ); + <#-- type.width --> + } + + public void copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector from){ + while(thisIndex >= getValueCapacity()) { + reAlloc(); + } + copyFrom(fromIndex, thisIndex, from); + } + + public void decrementAllocationMonitor() { + if (allocationMonitor > 0) { + allocationMonitor = 0; + } + --allocationMonitor; + } + + private void incrementAllocationMonitor() { + ++allocationMonitor; + } + + public final class Accessor extends BaseDataValueVector.BaseAccessor { + @Override + public int getValueCount() { + return data.writerIndex() / ${type.width}; + } + + @Override + public boolean isNull(int index){ + return false; + } + + <#if (type.width > 8)> + + public ${minor.javaType!type.javaType} get(int index) { + return data.slice(index * ${type.width}, ${type.width}); + } + + <#if (minor.class == "Interval")> + public void get(int index, ${minor.class}Holder holder){ + + final int offsetIndex = index * ${type.width}; + holder.months = data.getInt(offsetIndex); + holder.days = data.getInt(offsetIndex + ${minor.daysOffset}); + holder.milliseconds = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + } + + public void get(int index, Nullable${minor.class}Holder holder){ + final int offsetIndex = index * ${type.width}; + holder.isSet = 1; + holder.months = data.getInt(offsetIndex); + holder.days = data.getInt(offsetIndex + ${minor.daysOffset}); + holder.milliseconds = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + } + + @Override + public ${friendlyType} getObject(int index) { + final int offsetIndex = index * ${type.width}; + final int months = data.getInt(offsetIndex); + final int days = data.getInt(offsetIndex + ${minor.daysOffset}); + final int millis = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + final Period p = new Period(); + return p.plusMonths(months).plusDays(days).plusMillis(millis); + } + + public StringBuilder getAsStringBuilder(int index) { + + final int offsetIndex = index * ${type.width}; + + int months = data.getInt(offsetIndex); + final int days = data.getInt(offsetIndex + ${minor.daysOffset}); + int millis = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + + final int years = (months / org.apache.arrow.vector.util.DateUtility.yearsToMonths); + months = (months % org.apache.arrow.vector.util.DateUtility.yearsToMonths); + + final int hours = millis / (org.apache.arrow.vector.util.DateUtility.hoursToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.hoursToMillis); + + final int minutes = millis / (org.apache.arrow.vector.util.DateUtility.minutesToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.minutesToMillis); + + final long seconds = millis / (org.apache.arrow.vector.util.DateUtility.secondsToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.secondsToMillis); + + final String yearString = (Math.abs(years) == 1) ? " year " : " years "; + final String monthString = (Math.abs(months) == 1) ? " month " : " months "; + final String dayString = (Math.abs(days) == 1) ? " day " : " days "; + + + return(new StringBuilder(). + append(years).append(yearString). + append(months).append(monthString). + append(days).append(dayString). + append(hours).append(":"). + append(minutes).append(":"). + append(seconds).append("."). + append(millis)); + } + + <#elseif (minor.class == "IntervalDay")> + public void get(int index, ${minor.class}Holder holder){ + + final int offsetIndex = index * ${type.width}; + holder.days = data.getInt(offsetIndex); + holder.milliseconds = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + } + + public void get(int index, Nullable${minor.class}Holder holder){ + final int offsetIndex = index * ${type.width}; + holder.isSet = 1; + holder.days = data.getInt(offsetIndex); + holder.milliseconds = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + } + + @Override + public ${friendlyType} getObject(int index) { + final int offsetIndex = index * ${type.width}; + final int millis = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + final int days = data.getInt(offsetIndex); + final Period p = new Period(); + return p.plusDays(days).plusMillis(millis); + } + + + public StringBuilder getAsStringBuilder(int index) { + final int offsetIndex = index * ${type.width}; + + int millis = data.getInt(offsetIndex + ${minor.millisecondsOffset}); + final int days = data.getInt(offsetIndex); + + final int hours = millis / (org.apache.arrow.vector.util.DateUtility.hoursToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.hoursToMillis); + + final int minutes = millis / (org.apache.arrow.vector.util.DateUtility.minutesToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.minutesToMillis); + + final int seconds = millis / (org.apache.arrow.vector.util.DateUtility.secondsToMillis); + millis = millis % (org.apache.arrow.vector.util.DateUtility.secondsToMillis); + + final String dayString = (Math.abs(days) == 1) ? " day " : " days "; + + return(new StringBuilder(). + append(days).append(dayString). + append(hours).append(":"). + append(minutes).append(":"). + append(seconds).append("."). + append(millis)); + } + + <#elseif (minor.class == "Decimal28Sparse") || (minor.class == "Decimal38Sparse") || (minor.class == "Decimal28Dense") || (minor.class == "Decimal38Dense")> + + public void get(int index, ${minor.class}Holder holder) { + holder.start = index * ${type.width}; + holder.buffer = data; + holder.scale = getField().getScale(); + holder.precision = getField().getPrecision(); + } + + public void get(int index, Nullable${minor.class}Holder holder) { + holder.isSet = 1; + holder.start = index * ${type.width}; + holder.buffer = data; + holder.scale = getField().getScale(); + holder.precision = getField().getPrecision(); + } + + @Override + public ${friendlyType} getObject(int index) { + <#if (minor.class == "Decimal28Sparse") || (minor.class == "Decimal38Sparse")> + // Get the BigDecimal object + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromSparse(data, index * ${type.width}, ${minor.nDecimalDigits}, getField().getScale()); + <#else> + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(data, index * ${type.width}, ${minor.nDecimalDigits}, getField().getScale(), ${minor.maxPrecisionDigits}, ${type.width}); + + } + + <#else> + public void get(int index, ${minor.class}Holder holder){ + holder.buffer = data; + holder.start = index * ${type.width}; + } + + public void get(int index, Nullable${minor.class}Holder holder){ + holder.isSet = 1; + holder.buffer = data; + holder.start = index * ${type.width}; + } + + @Override + public ${friendlyType} getObject(int index) { + return data.slice(index * ${type.width}, ${type.width}) + } + + + <#else> <#-- type.width <= 8 --> + + public ${minor.javaType!type.javaType} get(int index) { + return data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + } + + <#if type.width == 4> + public long getTwoAsLong(int index) { + return data.getLong(index * ${type.width}); + } + + + + <#if minor.class == "Date"> + @Override + public ${friendlyType} getObject(int index) { + org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + + <#elseif minor.class == "TimeStamp"> + @Override + public ${friendlyType} getObject(int index) { + org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + + <#elseif minor.class == "IntervalYear"> + @Override + public ${friendlyType} getObject(int index) { + + final int value = get(index); + + final int years = (value / org.apache.arrow.vector.util.DateUtility.yearsToMonths); + final int months = (value % org.apache.arrow.vector.util.DateUtility.yearsToMonths); + final Period p = new Period(); + return p.plusYears(years).plusMonths(months); + } + + public StringBuilder getAsStringBuilder(int index) { + + int months = data.getInt(index); + + final int years = (months / org.apache.arrow.vector.util.DateUtility.yearsToMonths); + months = (months % org.apache.arrow.vector.util.DateUtility.yearsToMonths); + + final String yearString = (Math.abs(years) == 1) ? " year " : " years "; + final String monthString = (Math.abs(months) == 1) ? " month " : " months "; + + return(new StringBuilder(). + append(years).append(yearString). + append(months).append(monthString)); + } + + <#elseif minor.class == "Time"> + @Override + public DateTime getObject(int index) { + + org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); + time = time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return time; + } + + <#elseif minor.class == "Decimal9" || minor.class == "Decimal18"> + @Override + public ${friendlyType} getObject(int index) { + + final BigInteger value = BigInteger.valueOf(((${type.boxedType})get(index)).${type.javaType}Value()); + return new BigDecimal(value, getField().getScale()); + } + + <#else> + @Override + public ${friendlyType} getObject(int index) { + return get(index); + } + public ${minor.javaType!type.javaType} getPrimitiveObject(int index) { + return get(index); + } + + + public void get(int index, ${minor.class}Holder holder){ + <#if minor.class.startsWith("Decimal")> + holder.scale = getField().getScale(); + holder.precision = getField().getPrecision(); + + + holder.value = data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + } + + public void get(int index, Nullable${minor.class}Holder holder){ + holder.isSet = 1; + holder.value = data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + } + + + <#-- type.width --> + } + + /** + * ${minor.class}.Mutator implements a mutable vector of fixed width values. Elements in the + * vector are accessed by position from the logical start of the vector. Values should be pushed + * onto the vector sequentially, but may be randomly accessed. + * The width of each element is ${type.width} byte(s) + * The equivalent Java primitive is '${minor.javaType!type.javaType}' + * + * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker. + */ + public final class Mutator extends BaseDataValueVector.BaseMutator { + + private Mutator(){}; + /** + * Set the element at the given index to the given value. Note that widths smaller than + * 32 bits are handled by the ArrowBuf interface. + * + * @param index position of the bit to set + * @param value value to set + */ + <#if (type.width > 8)> + public void set(int index, <#if (type.width > 4)>${minor.javaType!type.javaType}<#else>int value) { + data.setBytes(index * ${type.width}, value, 0, ${type.width}); + } + + public void setSafe(int index, <#if (type.width > 4)>${minor.javaType!type.javaType}<#else>int value) { + while(index >= getValueCapacity()) { + reAlloc(); + } + data.setBytes(index * ${type.width}, value, 0, ${type.width}); + } + + <#if (minor.class == "Interval")> + public void set(int index, int months, int days, int milliseconds){ + final int offsetIndex = index * ${type.width}; + data.setInt(offsetIndex, months); + data.setInt((offsetIndex + ${minor.daysOffset}), days); + data.setInt((offsetIndex + ${minor.millisecondsOffset}), milliseconds); + } + + protected void set(int index, ${minor.class}Holder holder){ + set(index, holder.months, holder.days, holder.milliseconds); + } + + protected void set(int index, Nullable${minor.class}Holder holder){ + set(index, holder.months, holder.days, holder.milliseconds); + } + + public void setSafe(int index, int months, int days, int milliseconds){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, months, days, milliseconds); + } + + public void setSafe(int index, Nullable${minor.class}Holder holder){ + setSafe(index, holder.months, holder.days, holder.milliseconds); + } + + public void setSafe(int index, ${minor.class}Holder holder){ + setSafe(index, holder.months, holder.days, holder.milliseconds); + } + + <#elseif (minor.class == "IntervalDay")> + public void set(int index, int days, int milliseconds){ + final int offsetIndex = index * ${type.width}; + data.setInt(offsetIndex, days); + data.setInt((offsetIndex + ${minor.millisecondsOffset}), milliseconds); + } + + protected void set(int index, ${minor.class}Holder holder){ + set(index, holder.days, holder.milliseconds); + } + protected void set(int index, Nullable${minor.class}Holder holder){ + set(index, holder.days, holder.milliseconds); + } + + public void setSafe(int index, int days, int milliseconds){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, days, milliseconds); + } + + public void setSafe(int index, ${minor.class}Holder holder){ + setSafe(index, holder.days, holder.milliseconds); + } + + public void setSafe(int index, Nullable${minor.class}Holder holder){ + setSafe(index, holder.days, holder.milliseconds); + } + + <#elseif (minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse") || (minor.class == "Decimal28Dense") || (minor.class == "Decimal38Dense")> + + public void set(int index, ${minor.class}Holder holder){ + set(index, holder.start, holder.buffer); + } + + void set(int index, Nullable${minor.class}Holder holder){ + set(index, holder.start, holder.buffer); + } + + public void setSafe(int index, Nullable${minor.class}Holder holder){ + setSafe(index, holder.start, holder.buffer); + } + public void setSafe(int index, ${minor.class}Holder holder){ + setSafe(index, holder.start, holder.buffer); + } + + public void setSafe(int index, int start, ArrowBuf buffer){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, start, buffer); + } + + public void set(int index, int start, ArrowBuf buffer){ + data.setBytes(index * ${type.width}, buffer, start, ${type.width}); + } + + <#else> + + protected void set(int index, ${minor.class}Holder holder){ + set(index, holder.start, holder.buffer); + } + + public void set(int index, Nullable${minor.class}Holder holder){ + set(index, holder.start, holder.buffer); + } + + public void set(int index, int start, ArrowBuf buffer){ + data.setBytes(index * ${type.width}, buffer, start, ${type.width}); + } + + public void setSafe(int index, ${minor.class}Holder holder){ + setSafe(index, holder.start, holder.buffer); + } + public void setSafe(int index, Nullable${minor.class}Holder holder){ + setSafe(index, holder.start, holder.buffer); + } + + public void setSafe(int index, int start, ArrowBuf buffer){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, holder); + } + + public void set(int index, Nullable${minor.class}Holder holder){ + data.setBytes(index * ${type.width}, holder.buffer, holder.start, ${type.width}); + } + + + @Override + public void generateTestData(int count) { + setValueCount(count); + boolean even = true; + final int valueCount = getAccessor().getValueCount(); + for(int i = 0; i < valueCount; i++, even = !even) { + final byte b = even ? Byte.MIN_VALUE : Byte.MAX_VALUE; + for(int w = 0; w < ${type.width}; w++){ + data.setByte(i + w, b); + } + } + } + + <#else> <#-- type.width <= 8 --> + public void set(int index, <#if (type.width >= 4)>${minor.javaType!type.javaType}<#else>int value) { + data.set${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}, value); + } + + public void setSafe(int index, <#if (type.width >= 4)>${minor.javaType!type.javaType}<#else>int value) { + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, value); + } + + protected void set(int index, ${minor.class}Holder holder){ + data.set${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}, holder.value); + } + + public void setSafe(int index, ${minor.class}Holder holder){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, holder); + } + + protected void set(int index, Nullable${minor.class}Holder holder){ + data.set${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}, holder.value); + } + + public void setSafe(int index, Nullable${minor.class}Holder holder){ + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, holder); + } + + @Override + public void generateTestData(int size) { + setValueCount(size); + boolean even = true; + final int valueCount = getAccessor().getValueCount(); + for(int i = 0; i < valueCount; i++, even = !even) { + if(even){ + set(i, ${minor.boxedType!type.boxedType}.MIN_VALUE); + }else{ + set(i, ${minor.boxedType!type.boxedType}.MAX_VALUE); + } + } + } + + public void generateTestDataAlt(int size) { + setValueCount(size); + boolean even = true; + final int valueCount = getAccessor().getValueCount(); + for(int i = 0; i < valueCount; i++, even = !even) { + if(even){ + set(i, (${(minor.javaType!type.javaType)}) 1); + }else{ + set(i, (${(minor.javaType!type.javaType)}) 0); + } + } + } + + <#-- type.width --> + + @Override + public void setValueCount(int valueCount) { + final int currentValueCapacity = getValueCapacity(); + final int idx = (${type.width} * valueCount); + while(valueCount > getValueCapacity()) { + reAlloc(); + } + if (valueCount > 0 && currentValueCapacity > valueCount * 2) { + incrementAllocationMonitor(); + } else if (allocationMonitor > 0) { + allocationMonitor = 0; + } + VectorTrimmer.trim(data, idx); + data.writerIndex(valueCount * ${type.width}); + } + } +} + + <#-- type.major --> + + diff --git a/java/vector/src/main/codegen/templates/HolderReaderImpl.java b/java/vector/src/main/codegen/templates/HolderReaderImpl.java new file mode 100644 index 0000000000000..3005fca0385aa --- /dev/null +++ b/java/vector/src/main/codegen/templates/HolderReaderImpl.java @@ -0,0 +1,290 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> +<#list ["", "Nullable", "Repeated"] as holderMode> +<#assign nullMode = holderMode /> +<#if holderMode == "Repeated"><#assign nullMode = "Nullable" /> + +<#assign lowerName = minor.class?uncap_first /> +<#if lowerName == "int" ><#assign lowerName = "integer" /> +<#assign name = minor.class?cap_first /> +<#assign javaType = (minor.javaType!type.javaType) /> +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> +<#assign safeType=friendlyType /> +<#if safeType=="byte[]"><#assign safeType="ByteArray" /> +<#assign fields = minor.fields!type.fields /> + +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${holderMode}${name}HolderReaderImpl.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +import java.math.BigDecimal; +import java.math.BigInteger; + +import org.joda.time.Period; + +// Source code generated using FreeMarker template ${.template_name} + +@SuppressWarnings("unused") +public class ${holderMode}${name}HolderReaderImpl extends AbstractFieldReader { + + private ${nullMode}${name}Holder holder; +<#if holderMode == "Repeated" > + private int index = -1; + private ${holderMode}${name}Holder repeatedHolder; + + + public ${holderMode}${name}HolderReaderImpl(${holderMode}${name}Holder holder) { +<#if holderMode == "Repeated" > + this.holder = new ${nullMode}${name}Holder(); + this.repeatedHolder = holder; +<#else> + this.holder = holder; + + } + + @Override + public int size() { +<#if holderMode == "Repeated"> + return repeatedHolder.end - repeatedHolder.start; +<#else> + throw new UnsupportedOperationException("You can't call size on a Holder value reader."); + + } + + @Override + public boolean next() { +<#if holderMode == "Repeated"> + if(index + 1 < repeatedHolder.end) { + index++; + repeatedHolder.vector.getAccessor().get(repeatedHolder.start + index, holder); + return true; + } else { + return false; + } +<#else> + throw new UnsupportedOperationException("You can't call next on a single value reader."); + + + } + + @Override + public void setPosition(int index) { + throw new UnsupportedOperationException("You can't call next on a single value reader."); + } + + @Override + public MajorType getType() { +<#if holderMode == "Repeated"> + return this.repeatedHolder.TYPE; +<#else> + return this.holder.TYPE; + + } + + @Override + public boolean isSet() { + <#if holderMode == "Repeated"> + return this.repeatedHolder.end!=this.repeatedHolder.start; + <#elseif nullMode == "Nullable"> + return this.holder.isSet == 1; + <#else> + return true; + + + } + +<#if holderMode != "Repeated"> +@Override + public void read(${name}Holder h) { + <#list fields as field> + h.${field.name} = holder.${field.name}; + + } + + @Override + public void read(Nullable${name}Holder h) { + <#list fields as field> + h.${field.name} = holder.${field.name}; + + h.isSet = isSet() ? 1 : 0; + } + + +<#if holderMode == "Repeated"> + @Override + public ${friendlyType} read${safeType}(int index){ + repeatedHolder.vector.getAccessor().get(repeatedHolder.start + index, holder); + ${friendlyType} value = read${safeType}(); + if (this.index > -1) { + repeatedHolder.vector.getAccessor().get(repeatedHolder.start + this.index, holder); + } + return value; + } + + + @Override + public ${friendlyType} read${safeType}(){ +<#if nullMode == "Nullable"> + if (!isSet()) { + return null; + } + + +<#if type.major == "VarLen"> + + int length = holder.end - holder.start; + byte[] value = new byte [length]; + holder.buffer.getBytes(holder.start, value, 0, length); + +<#if minor.class == "VarBinary"> + return value; +<#elseif minor.class == "Var16Char"> + return new String(value); +<#elseif minor.class == "VarChar"> + Text text = new Text(); + text.set(value); + return text; + + +<#elseif minor.class == "Interval"> + Period p = new Period(); + return p.plusMonths(holder.months).plusDays(holder.days).plusMillis(holder.milliseconds); + +<#elseif minor.class == "IntervalDay"> + Period p = new Period(); + return p.plusDays(holder.days).plusMillis(holder.milliseconds); + +<#elseif minor.class == "Decimal9" || + minor.class == "Decimal18" > + BigInteger value = BigInteger.valueOf(holder.value); + return new BigDecimal(value, holder.scale); + +<#elseif minor.class == "Decimal28Dense" || + minor.class == "Decimal38Dense"> + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(holder.buffer, + holder.start, + holder.nDecimalDigits, + holder.scale, + holder.maxPrecision, + holder.WIDTH); + +<#elseif minor.class == "Decimal28Sparse" || + minor.class == "Decimal38Sparse"> + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromSparse(holder.buffer, + holder.start, + holder.nDecimalDigits, + holder.scale); + +<#elseif minor.class == "Bit" > + return new Boolean(holder.value != 0); +<#else> + ${friendlyType} value = new ${friendlyType}(this.holder.value); + return value; + + + } + + @Override + public Object readObject() { +<#if holderMode == "Repeated" > + List valList = Lists.newArrayList(); + for (int i = repeatedHolder.start; i < repeatedHolder.end; i++) { + valList.add(repeatedHolder.vector.getAccessor().getObject(i)); + } + return valList; +<#else> + return readSingleObject(); + + } + + private Object readSingleObject() { +<#if nullMode == "Nullable"> + if (!isSet()) { + return null; + } + + +<#if type.major == "VarLen"> + int length = holder.end - holder.start; + byte[] value = new byte [length]; + holder.buffer.getBytes(holder.start, value, 0, length); + +<#if minor.class == "VarBinary"> + return value; +<#elseif minor.class == "Var16Char"> + return new String(value); +<#elseif minor.class == "VarChar"> + Text text = new Text(); + text.set(value); + return text; + + +<#elseif minor.class == "Interval"> + Period p = new Period(); + return p.plusMonths(holder.months).plusDays(holder.days).plusMillis(holder.milliseconds); + +<#elseif minor.class == "IntervalDay"> + Period p = new Period(); + return p.plusDays(holder.days).plusMillis(holder.milliseconds); + +<#elseif minor.class == "Decimal9" || + minor.class == "Decimal18" > + BigInteger value = BigInteger.valueOf(holder.value); + return new BigDecimal(value, holder.scale); + +<#elseif minor.class == "Decimal28Dense" || + minor.class == "Decimal38Dense"> + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(holder.buffer, + holder.start, + holder.nDecimalDigits, + holder.scale, + holder.maxPrecision, + holder.WIDTH); + +<#elseif minor.class == "Decimal28Sparse" || + minor.class == "Decimal38Sparse"> + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromSparse(holder.buffer, + holder.start, + holder.nDecimalDigits, + holder.scale); + +<#elseif minor.class == "Bit" > + return new Boolean(holder.value != 0); +<#else> + ${friendlyType} value = new ${friendlyType}(this.holder.value); + return value; + + } + +<#if holderMode != "Repeated" && nullMode != "Nullable"> + public void copyAsValue(${minor.class?cap_first}Writer writer){ + writer.write(holder); + } + +} + + + + diff --git a/java/vector/src/main/codegen/templates/ListWriters.java b/java/vector/src/main/codegen/templates/ListWriters.java new file mode 100644 index 0000000000000..cf9fa30fa4784 --- /dev/null +++ b/java/vector/src/main/codegen/templates/ListWriters.java @@ -0,0 +1,234 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> + +<#list ["Single", "Repeated"] as mode> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}ListWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; +<#if mode == "Single"> + <#assign containerClass = "AbstractContainerVector" /> + <#assign index = "idx()"> +<#else> + <#assign containerClass = "RepeatedListVector" /> + <#assign index = "currentChildIndex"> + + + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using FreeMarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class ${mode}ListWriter extends AbstractFieldWriter { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${mode}ListWriter.class); + + static enum Mode { INIT, IN_MAP, IN_LIST <#list vv.types as type><#list type.minor as minor>, IN_${minor.class?upper_case} } + + private final String name; + protected final ${containerClass} container; + private Mode mode = Mode.INIT; + private FieldWriter writer; + protected RepeatedValueVector innerVector; + + <#if mode == "Repeated">private int currentChildIndex = 0; + public ${mode}ListWriter(String name, ${containerClass} container, FieldWriter parent){ + super(parent); + this.name = name; + this.container = container; + } + + public ${mode}ListWriter(${containerClass} container, FieldWriter parent){ + super(parent); + this.name = null; + this.container = container; + } + + @Override + public void allocate() { + if(writer != null) { + writer.allocate(); + } + + <#if mode == "Repeated"> + container.allocateNew(); + + } + + @Override + public void clear() { + if (writer != null) { + writer.clear(); + } + } + + @Override + public void close() { + clear(); + container.close(); + if (innerVector != null) { + innerVector.close(); + } + } + + @Override + public int getValueCapacity() { + return innerVector == null ? 0 : innerVector.getValueCapacity(); + } + + public void setValueCount(int count){ + if(innerVector != null) innerVector.getMutator().setValueCount(count); + } + + @Override + public MapWriter map() { + switch(mode) { + case INIT: + int vectorCount = container.size(); + final RepeatedMapVector vector = container.addOrGet(name, RepeatedMapVector.TYPE, RepeatedMapVector.class); + innerVector = vector; + writer = new RepeatedMapWriter(vector, this); + if(vectorCount != container.size()) { + writer.allocate(); + } + writer.setPosition(${index}); + mode = Mode.IN_MAP; + return writer; + case IN_MAP: + return writer; + } + + throw new RuntimeException(getUnsupportedErrorMsg("MAP", mode.name())); + + } + + @Override + public ListWriter list() { + switch(mode) { + case INIT: + final int vectorCount = container.size(); + final RepeatedListVector vector = container.addOrGet(name, RepeatedListVector.TYPE, RepeatedListVector.class); + innerVector = vector; + writer = new RepeatedListWriter(null, vector, this); + if(vectorCount != container.size()) { + writer.allocate(); + } + writer.setPosition(${index}); + mode = Mode.IN_LIST; + return writer; + case IN_LIST: + return writer; + } + + throw new RuntimeException(getUnsupportedErrorMsg("LIST", mode.name())); + + } + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + + private static final MajorType ${upperName}_TYPE = Types.repeated(MinorType.${upperName}); + + @Override + public ${capName}Writer ${lowerName}() { + switch(mode) { + case INIT: + final int vectorCount = container.size(); + final Repeated${capName}Vector vector = container.addOrGet(name, ${upperName}_TYPE, Repeated${capName}Vector.class); + innerVector = vector; + writer = new Repeated${capName}WriterImpl(vector, this); + if(vectorCount != container.size()) { + writer.allocate(); + } + writer.setPosition(${index}); + mode = Mode.IN_${upperName}; + return writer; + case IN_${upperName}: + return writer; + } + + throw new RuntimeException(getUnsupportedErrorMsg("${upperName}", mode.name())); + + } + + + public MaterializedField getField() { + return container.getField(); + } + + <#if mode == "Repeated"> + + public void startList() { + final RepeatedListVector list = (RepeatedListVector) container; + final RepeatedListVector.RepeatedMutator mutator = list.getMutator(); + + // make sure that the current vector can support the end position of this list. + if(container.getValueCapacity() <= idx()) { + mutator.setValueCount(idx()+1); + } + + // update the repeated vector to state that there is current+1 objects. + final RepeatedListHolder h = new RepeatedListHolder(); + list.getAccessor().get(idx(), h); + if (h.start >= h.end) { + mutator.startNewValue(idx()); + } + currentChildIndex = container.getMutator().add(idx()); + if(writer != null) { + writer.setPosition(currentChildIndex); + } + } + + public void endList() { + // noop, we initialize state at start rather than end. + } + <#else> + + public void setPosition(int index) { + super.setPosition(index); + if(writer != null) { + writer.setPosition(index); + } + } + + public void startList() { + // noop + } + + public void endList() { + // noop + } + + + private String getUnsupportedErrorMsg(String expected, String found) { + final String f = found.substring(3); + return String.format("In a list of type %s, encountered a value of type %s. "+ + "Drill does not support lists of different types.", + f, expected + ); + } +} + diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java new file mode 100644 index 0000000000000..7001367bb3774 --- /dev/null +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -0,0 +1,240 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<#list ["Single", "Repeated"] as mode> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}MapWriter.java" /> +<#if mode == "Single"> +<#assign containerClass = "MapVector" /> +<#assign index = "idx()"> +<#else> +<#assign containerClass = "RepeatedMapVector" /> +<#assign index = "currentChildIndex"> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> +import java.util.Map; + +import org.apache.arrow.vector.holders.RepeatedMapHolder; +import org.apache.arrow.vector.AllocationHelper; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.FieldWriter; + +import com.google.common.collect.Maps; + +/* + * This class is generated using FreeMarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class ${mode}MapWriter extends AbstractFieldWriter { + + protected final ${containerClass} container; + private final Map fields = Maps.newHashMap(); + <#if mode == "Repeated">private int currentChildIndex = 0; + + private final boolean unionEnabled; + + public ${mode}MapWriter(${containerClass} container, FieldWriter parent, boolean unionEnabled) { + super(parent); + this.container = container; + this.unionEnabled = unionEnabled; + } + + public ${mode}MapWriter(${containerClass} container, FieldWriter parent) { + this(container, parent, false); + } + + @Override + public int getValueCapacity() { + return container.getValueCapacity(); + } + + @Override + public boolean isEmptyMap() { + return 0 == container.size(); + } + + @Override + public MaterializedField getField() { + return container.getField(); + } + + @Override + public MapWriter map(String name) { + FieldWriter writer = fields.get(name.toLowerCase()); + if(writer == null){ + int vectorCount=container.size(); + MapVector vector = container.addOrGet(name, MapVector.TYPE, MapVector.class); + if(!unionEnabled){ + writer = new SingleMapWriter(vector, this); + } else { + writer = new PromotableWriter(vector, container); + } + if(vectorCount != container.size()) { + writer.allocate(); + } + writer.setPosition(${index}); + fields.put(name.toLowerCase(), writer); + } + return writer; + } + + @Override + public void close() throws Exception { + clear(); + container.close(); + } + + @Override + public void allocate() { + container.allocateNew(); + for(final FieldWriter w : fields.values()) { + w.allocate(); + } + } + + @Override + public void clear() { + container.clear(); + for(final FieldWriter w : fields.values()) { + w.clear(); + } + } + + @Override + public ListWriter list(String name) { + FieldWriter writer = fields.get(name.toLowerCase()); + int vectorCount = container.size(); + if(writer == null) { + if (!unionEnabled){ + writer = new SingleListWriter(name,container,this); + } else{ + writer = new PromotableWriter(container.addOrGet(name, Types.optional(MinorType.LIST), ListVector.class), container); + } + if (container.size() > vectorCount) { + writer.allocate(); + } + writer.setPosition(${index}); + fields.put(name.toLowerCase(), writer); + } + return writer; + } + + <#if mode == "Repeated"> + public void start() { + // update the repeated vector to state that there is current+1 objects. + final RepeatedMapHolder h = new RepeatedMapHolder(); + final RepeatedMapVector map = (RepeatedMapVector) container; + final RepeatedMapVector.Mutator mutator = map.getMutator(); + + // Make sure that the current vector can support the end position of this list. + if(container.getValueCapacity() <= idx()) { + mutator.setValueCount(idx()+1); + } + + map.getAccessor().get(idx(), h); + if (h.start >= h.end) { + container.getMutator().startNewValue(idx()); + } + currentChildIndex = container.getMutator().add(idx()); + for(final FieldWriter w : fields.values()) { + w.setPosition(currentChildIndex); + } + } + + + public void end() { + // noop + } + <#else> + + public void setValueCount(int count) { + container.getMutator().setValueCount(count); + } + + @Override + public void setPosition(int index) { + super.setPosition(index); + for(final FieldWriter w: fields.values()) { + w.setPosition(index); + } + } + + @Override + public void start() { + } + + @Override + public void end() { + } + + + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#assign vectName = capName /> + <#assign vectName = "Nullable${capName}" /> + + <#if minor.class?starts_with("Decimal") > + public ${minor.class}Writer ${lowerName}(String name) { + // returns existing writer + final FieldWriter writer = fields.get(name.toLowerCase()); + assert writer != null; + return writer; + } + + public ${minor.class}Writer ${lowerName}(String name, int scale, int precision) { + final MajorType ${upperName}_TYPE = new MajorType(MinorType.${upperName}, DataMode.OPTIONAL, scale, precision, null, null); + <#else> + private static final MajorType ${upperName}_TYPE = Types.optional(MinorType.${upperName}); + @Override + public ${minor.class}Writer ${lowerName}(String name) { + + FieldWriter writer = fields.get(name.toLowerCase()); + if(writer == null) { + ValueVector vector; + ValueVector currentVector = container.getChild(name); + if (unionEnabled){ + ${vectName}Vector v = container.addOrGet(name, ${upperName}_TYPE, ${vectName}Vector.class); + writer = new PromotableWriter(v, container); + vector = v; + } else { + ${vectName}Vector v = container.addOrGet(name, ${upperName}_TYPE, ${vectName}Vector.class); + writer = new ${vectName}WriterImpl(v, this); + vector = v; + } + if (currentVector == null || currentVector != vector) { + vector.allocateNewSafe(); + } + writer.setPosition(${index}); + fields.put(name.toLowerCase(), writer); + } + return writer; + } + + + +} + diff --git a/java/vector/src/main/codegen/templates/NullReader.java b/java/vector/src/main/codegen/templates/NullReader.java new file mode 100644 index 0000000000000..3ef6c7dcc49a6 --- /dev/null +++ b/java/vector/src/main/codegen/templates/NullReader.java @@ -0,0 +1,138 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/NullReader.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + + +@SuppressWarnings("unused") +public class NullReader extends AbstractBaseReader implements FieldReader{ + + public static final NullReader INSTANCE = new NullReader(); + public static final NullReader EMPTY_LIST_INSTANCE = new NullReader(Types.repeated(MinorType.NULL)); + public static final NullReader EMPTY_MAP_INSTANCE = new NullReader(Types.required(MinorType.MAP)); + private MajorType type; + + private NullReader(){ + super(); + type = Types.required(MinorType.NULL); + } + + private NullReader(MajorType type){ + super(); + this.type = type; + } + + @Override + public MajorType getType() { + return type; + } + + public void copyAsValue(MapWriter writer) {} + + public void copyAsValue(ListWriter writer) {} + + public void copyAsValue(UnionWriter writer) {} + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + public void read(${name}Holder holder){ + throw new UnsupportedOperationException("NullReader cannot write into non-nullable holder"); + } + + public void read(Nullable${name}Holder holder){ + holder.isSet = 0; + } + + public void read(int arrayIndex, ${name}Holder holder){ + throw new ArrayIndexOutOfBoundsException(); + } + + public void copyAsValue(${minor.class}Writer writer){} + public void copyAsField(String name, ${minor.class}Writer writer){} + + public void read(int arrayIndex, Nullable${name}Holder holder){ + throw new ArrayIndexOutOfBoundsException(); + } + + + public int size(){ + return 0; + } + + public boolean isSet(){ + return false; + } + + public boolean next(){ + return false; + } + + public RepeatedMapReader map(){ + return this; + } + + public RepeatedListReader list(){ + return this; + } + + public MapReader map(String name){ + return this; + } + + public ListReader list(String name){ + return this; + } + + public FieldReader reader(String name){ + return this; + } + + public FieldReader reader(){ + return this; + } + + private void fail(String name){ + throw new IllegalArgumentException(String.format("You tried to read a %s type when you are using a ValueReader of type %s.", name, this.getClass().getSimpleName())); + } + + <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", + "Character", "DateTime", "Period", "Double", "Float", + "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> + <#assign safeType=friendlyType /> + <#if safeType=="byte[]"><#assign safeType="ByteArray" /> + + public ${friendlyType} read${safeType}(int arrayIndex){ + return null; + } + + public ${friendlyType} read${safeType}(){ + return null; + } + + +} + + + diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java new file mode 100644 index 0000000000000..6893a25efbe18 --- /dev/null +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -0,0 +1,630 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> + +<#assign className = "Nullable${minor.class}Vector" /> +<#assign valuesName = "${minor.class}Vector" /> +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> + +<@pp.changeOutputFile name="/org/apache/arrow/vector/${className}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector; + +<#include "/@includes/vv_imports.ftl" /> + +/** + * Nullable${minor.class} implements a vector of values which could be null. Elements in the vector + * are first checked against a fixed length vector of boolean values. Then the element is retrieved + * from the base class (if not null). + * + * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. + */ +@SuppressWarnings("unused") +public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector{ + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); + + private final FieldReader reader = new Nullable${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); + + private final MaterializedField bitsField = MaterializedField.create("$bits$", new MajorType(MinorType.UINT1, DataMode.REQUIRED)); + private final UInt1Vector bits = new UInt1Vector(bitsField, allocator); + private final ${valuesName} values = new ${minor.class}Vector(field, allocator); + + private final Mutator mutator = new Mutator(); + private final Accessor accessor = new Accessor(); + + public ${className}(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + } + + @Override + public FieldReader getReader(){ + return reader; + } + + @Override + public int getValueCapacity(){ + return Math.min(bits.getValueCapacity(), values.getValueCapacity()); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final ArrowBuf[] buffers = ObjectArrays.concat(bits.getBuffers(false), values.getBuffers(false), ArrowBuf.class); + if (clear) { + for (final ArrowBuf buffer:buffers) { + buffer.retain(1); + } + clear(); + } + return buffers; + } + + @Override + public void close() { + bits.close(); + values.close(); + super.close(); + } + + @Override + public void clear() { + bits.clear(); + values.clear(); + super.clear(); + } + + @Override + public int getBufferSize(){ + return values.getBufferSize() + bits.getBufferSize(); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + + return values.getBufferSizeFor(valueCount) + + bits.getBufferSizeFor(valueCount); + } + + @Override + public ArrowBuf getBuffer() { + return values.getBuffer(); + } + + @Override + public ${valuesName} getValuesVector() { + return values; + } + + @Override + public void setInitialCapacity(int numRecords) { + bits.setInitialCapacity(numRecords); + values.setInitialCapacity(numRecords); + } + +// @Override +// public SerializedField.Builder getMetadataBuilder() { +// return super.getMetadataBuilder() +// .addChild(bits.getMetadata()) +// .addChild(values.getMetadata()); +// } + + @Override + public void allocateNew() { + if(!allocateNewSafe()){ + throw new OutOfMemoryException("Failure while allocating buffer."); + } + } + + @Override + public boolean allocateNewSafe() { + /* Boolean to keep track if all the memory allocations were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + success = values.allocateNewSafe() && bits.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + bits.zeroVector(); + mutator.reset(); + accessor.reset(); + return success; + } + + <#if type.major == "VarLen"> + @Override + public void allocateNew(int totalBytes, int valueCount) { + try { + values.allocateNew(totalBytes, valueCount); + bits.allocateNew(valueCount); + } catch(RuntimeException e) { + clear(); + throw e; + } + bits.zeroVector(); + mutator.reset(); + accessor.reset(); + } + + public void reset() { + bits.zeroVector(); + mutator.reset(); + accessor.reset(); + super.reset(); + } + + @Override + public int getByteCapacity(){ + return values.getByteCapacity(); + } + + @Override + public int getCurrentSizeInBytes(){ + return values.getCurrentSizeInBytes(); + } + + <#else> + @Override + public void allocateNew(int valueCount) { + try { + values.allocateNew(valueCount); + bits.allocateNew(valueCount+1); + } catch(OutOfMemoryException e) { + clear(); + throw e; + } + bits.zeroVector(); + mutator.reset(); + accessor.reset(); + } + + @Override + public void reset() { + bits.zeroVector(); + mutator.reset(); + accessor.reset(); + super.reset(); + } + + /** + * {@inheritDoc} + */ + @Override + public void zeroVector() { + bits.zeroVector(); + values.zeroVector(); + } + + + +// @Override +// public void load(SerializedField metadata, ArrowBuf buffer) { +// clear(); + // the bits vector is the first child (the order in which the children are added in getMetadataBuilder is significant) +// final SerializedField bitsField = metadata.getChild(0); +// bits.load(bitsField, buffer); +// +// final int capacity = buffer.capacity(); +// final int bitsLength = bitsField.getBufferLength(); +// final SerializedField valuesField = metadata.getChild(1); +// values.load(valuesField, buffer.slice(bitsLength, capacity - bitsLength)); +// } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator){ + return new TransferImpl(getField(), allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator){ + return new TransferImpl(getField().withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new TransferImpl((Nullable${minor.class}Vector) to); + } + + public void transferTo(Nullable${minor.class}Vector target){ + bits.transferTo(target.bits); + values.transferTo(target.values); + <#if type.major == "VarLen"> + target.mutator.lastSet = mutator.lastSet; + + clear(); + } + + public void splitAndTransferTo(int startIndex, int length, Nullable${minor.class}Vector target) { + bits.splitAndTransferTo(startIndex, length, target.bits); + values.splitAndTransferTo(startIndex, length, target.values); + <#if type.major == "VarLen"> + target.mutator.lastSet = length - 1; + + } + + private class TransferImpl implements TransferPair { + Nullable${minor.class}Vector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator){ + to = new Nullable${minor.class}Vector(field, allocator); + } + + public TransferImpl(Nullable${minor.class}Vector to){ + this.to = to; + } + + @Override + public Nullable${minor.class}Vector getTo(){ + return to; + } + + @Override + public void transfer(){ + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + splitAndTransferTo(startIndex, length, to); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + to.copyFromSafe(fromIndex, toIndex, Nullable${minor.class}Vector.this); + } + } + + @Override + public Accessor getAccessor(){ + return accessor; + } + + @Override + public Mutator getMutator(){ + return mutator; + } + + public ${minor.class}Vector convertToRequiredVector(){ + ${minor.class}Vector v = new ${minor.class}Vector(getField().getOtherNullableVersion(), allocator); + if (v.data != null) { + v.data.release(1); + } + v.data = values.data; + v.data.retain(1); + clear(); + return v; + } + + public void copyFrom(int fromIndex, int thisIndex, Nullable${minor.class}Vector from){ + final Accessor fromAccessor = from.getAccessor(); + if (!fromAccessor.isNull(fromIndex)) { + mutator.set(thisIndex, fromAccessor.get(fromIndex)); + } + } + + public void copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector from){ + <#if type.major == "VarLen"> + mutator.fillEmpties(thisIndex); + + values.copyFromSafe(fromIndex, thisIndex, from); + bits.getMutator().setSafe(thisIndex, 1); + } + + public void copyFromSafe(int fromIndex, int thisIndex, Nullable${minor.class}Vector from){ + <#if type.major == "VarLen"> + mutator.fillEmpties(thisIndex); + + bits.copyFromSafe(fromIndex, thisIndex, from.bits); + values.copyFromSafe(fromIndex, thisIndex, from.values); + } + + public final class Accessor extends BaseDataValueVector.BaseAccessor <#if type.major = "VarLen">implements VariableWidthVector.VariableWidthAccessor { + final UInt1Vector.Accessor bAccessor = bits.getAccessor(); + final ${valuesName}.Accessor vAccessor = values.getAccessor(); + + /** + * Get the element at the specified position. + * + * @param index position of the value + * @return value of the element, if not null + * @throws NullValueException if the value is null + */ + public <#if type.major == "VarLen">byte[]<#else>${minor.javaType!type.javaType} get(int index) { + if (isNull(index)) { + throw new IllegalStateException("Can't get a null value"); + } + return vAccessor.get(index); + } + + @Override + public boolean isNull(int index) { + return isSet(index) == 0; + } + + public int isSet(int index){ + return bAccessor.get(index); + } + + <#if type.major == "VarLen"> + public long getStartEnd(int index){ + return vAccessor.getStartEnd(index); + } + + @Override + public int getValueLength(int index) { + return values.getAccessor().getValueLength(index); + } + + + public void get(int index, Nullable${minor.class}Holder holder){ + vAccessor.get(index, holder); + holder.isSet = bAccessor.get(index); + + <#if minor.class.startsWith("Decimal")> + holder.scale = getField().getScale(); + holder.precision = getField().getPrecision(); + + } + + @Override + public ${friendlyType} getObject(int index) { + if (isNull(index)) { + return null; + }else{ + return vAccessor.getObject(index); + } + } + + <#if minor.class == "Interval" || minor.class == "IntervalDay" || minor.class == "IntervalYear"> + public StringBuilder getAsStringBuilder(int index) { + if (isNull(index)) { + return null; + }else{ + return vAccessor.getAsStringBuilder(index); + } + } + + + @Override + public int getValueCount(){ + return bits.getAccessor().getValueCount(); + } + + public void reset(){} + } + + public final class Mutator extends BaseDataValueVector.BaseMutator implements NullableVectorDefinitionSetter<#if type.major = "VarLen">, VariableWidthVector.VariableWidthMutator { + private int setCount; + <#if type.major = "VarLen"> private int lastSet = -1; + + private Mutator(){ + } + + public ${valuesName} getVectorWithValues(){ + return values; + } + + @Override + public void setIndexDefined(int index){ + bits.getMutator().set(index, 1); + } + + /** + * Set the variable length element at the specified index to the supplied byte array. + * + * @param index position of the bit to set + * @param bytes array of bytes to write + */ + public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType} value) { + setCount++; + final ${valuesName}.Mutator valuesMutator = values.getMutator(); + final UInt1Vector.Mutator bitsMutator = bits.getMutator(); + <#if type.major == "VarLen"> + for (int i = lastSet + 1; i < index; i++) { + valuesMutator.set(i, emptyByteArray); + } + + bitsMutator.set(index, 1); + valuesMutator.set(index, value); + <#if type.major == "VarLen">lastSet = index; + } + + <#if type.major == "VarLen"> + + private void fillEmpties(int index){ + final ${valuesName}.Mutator valuesMutator = values.getMutator(); + for (int i = lastSet; i < index; i++) { + valuesMutator.setSafe(i + 1, emptyByteArray); + } + while(index > bits.getValueCapacity()) { + bits.reAlloc(); + } + lastSet = index; + } + + @Override + public void setValueLengthSafe(int index, int length) { + values.getMutator().setValueLengthSafe(index, length); + lastSet = index; + } + + + public void setSafe(int index, byte[] value, int start, int length) { + <#if type.major != "VarLen"> + throw new UnsupportedOperationException(); + <#else> + fillEmpties(index); + + bits.getMutator().setSafe(index, 1); + values.getMutator().setSafe(index, value, start, length); + setCount++; + <#if type.major == "VarLen">lastSet = index; + + } + + public void setSafe(int index, ByteBuffer value, int start, int length) { + <#if type.major != "VarLen"> + throw new UnsupportedOperationException(); + <#else> + fillEmpties(index); + + bits.getMutator().setSafe(index, 1); + values.getMutator().setSafe(index, value, start, length); + setCount++; + <#if type.major == "VarLen">lastSet = index; + + } + + public void setNull(int index){ + bits.getMutator().setSafe(index, 0); + } + + public void setSkipNull(int index, ${minor.class}Holder holder){ + values.getMutator().set(index, holder); + } + + public void setSkipNull(int index, Nullable${minor.class}Holder holder){ + values.getMutator().set(index, holder); + } + + + public void set(int index, Nullable${minor.class}Holder holder){ + final ${valuesName}.Mutator valuesMutator = values.getMutator(); + <#if type.major == "VarLen"> + for (int i = lastSet + 1; i < index; i++) { + valuesMutator.set(i, emptyByteArray); + } + + bits.getMutator().set(index, holder.isSet); + valuesMutator.set(index, holder); + <#if type.major == "VarLen">lastSet = index; + } + + public void set(int index, ${minor.class}Holder holder){ + final ${valuesName}.Mutator valuesMutator = values.getMutator(); + <#if type.major == "VarLen"> + for (int i = lastSet + 1; i < index; i++) { + valuesMutator.set(i, emptyByteArray); + } + + bits.getMutator().set(index, 1); + valuesMutator.set(index, holder); + <#if type.major == "VarLen">lastSet = index; + } + + public boolean isSafe(int outIndex) { + return outIndex < Nullable${minor.class}Vector.this.getValueCapacity(); + } + + <#assign fields = minor.fields!type.fields /> + public void set(int index, int isSet<#list fields as field><#if field.include!true >, ${field.type} ${field.name}Field ){ + final ${valuesName}.Mutator valuesMutator = values.getMutator(); + <#if type.major == "VarLen"> + for (int i = lastSet + 1; i < index; i++) { + valuesMutator.set(i, emptyByteArray); + } + + bits.getMutator().set(index, isSet); + valuesMutator.set(index<#list fields as field><#if field.include!true >, ${field.name}Field); + <#if type.major == "VarLen">lastSet = index; + } + + public void setSafe(int index, int isSet<#list fields as field><#if field.include!true >, ${field.type} ${field.name}Field ) { + <#if type.major == "VarLen"> + fillEmpties(index); + + + bits.getMutator().setSafe(index, isSet); + values.getMutator().setSafe(index<#list fields as field><#if field.include!true >, ${field.name}Field); + setCount++; + <#if type.major == "VarLen">lastSet = index; + } + + + public void setSafe(int index, Nullable${minor.class}Holder value) { + + <#if type.major == "VarLen"> + fillEmpties(index); + + bits.getMutator().setSafe(index, value.isSet); + values.getMutator().setSafe(index, value); + setCount++; + <#if type.major == "VarLen">lastSet = index; + } + + public void setSafe(int index, ${minor.class}Holder value) { + + <#if type.major == "VarLen"> + fillEmpties(index); + + bits.getMutator().setSafe(index, 1); + values.getMutator().setSafe(index, value); + setCount++; + <#if type.major == "VarLen">lastSet = index; + } + + <#if !(type.major == "VarLen" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense" || minor.class == "Interval" || minor.class == "IntervalDay")> + public void setSafe(int index, ${minor.javaType!type.javaType} value) { + <#if type.major == "VarLen"> + fillEmpties(index); + + bits.getMutator().setSafe(index, 1); + values.getMutator().setSafe(index, value); + setCount++; + } + + + + @Override + public void setValueCount(int valueCount) { + assert valueCount >= 0; + <#if type.major == "VarLen"> + fillEmpties(valueCount); + + values.getMutator().setValueCount(valueCount); + bits.getMutator().setValueCount(valueCount); + } + + @Override + public void generateTestData(int valueCount){ + bits.getMutator().generateTestDataAlt(valueCount); + values.getMutator().generateTestData(valueCount); + <#if type.major = "VarLen">lastSet = valueCount; + setValueCount(valueCount); + } + + @Override + public void reset(){ + setCount = 0; + <#if type.major = "VarLen">lastSet = -1; + } + } +} + + diff --git a/java/vector/src/main/codegen/templates/RepeatedValueVectors.java b/java/vector/src/main/codegen/templates/RepeatedValueVectors.java new file mode 100644 index 0000000000000..5ac80f57737ff --- /dev/null +++ b/java/vector/src/main/codegen/templates/RepeatedValueVectors.java @@ -0,0 +1,421 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> +<#assign fields = minor.fields!type.fields /> + +<@pp.changeOutputFile name="/org/apache/arrow/vector/Repeated${minor.class}Vector.java" /> +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector; + +<#include "/@includes/vv_imports.ftl" /> + +/** + * Repeated${minor.class} implements a vector with multple values per row (e.g. JSON array or + * repeated protobuf field). The implementation uses two additional value vectors; one to convert + * the index offset to the underlying element offset, and another to store the number of values + * in the vector. + * + * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. + */ + +public final class Repeated${minor.class}Vector extends BaseRepeatedValueVector implements Repeated<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVectorLike { + //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Repeated${minor.class}Vector.class); + + // we maintain local reference to concrete vector type for performance reasons. + private ${minor.class}Vector values; + private final FieldReader reader = new Repeated${minor.class}ReaderImpl(Repeated${minor.class}Vector.this); + private final Mutator mutator = new Mutator(); + private final Accessor accessor = new Accessor(); + + public Repeated${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + addOrGetVector(VectorDescriptor.create(new MajorType(field.getType().getMinorType(), DataMode.REQUIRED))); + } + + @Override + public Mutator getMutator() { + return mutator; + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public FieldReader getReader() { + return reader; + } + + @Override + public ${minor.class}Vector getDataVector() { + return values; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new TransferImpl(getField(), allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator){ + return new TransferImpl(getField().withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new TransferImpl((Repeated${minor.class}Vector) to); + } + + @Override + public AddOrGetResult<${minor.class}Vector> addOrGetVector(VectorDescriptor descriptor) { + final AddOrGetResult<${minor.class}Vector> result = super.addOrGetVector(descriptor); + if (result.isCreated()) { + values = result.getVector(); + } + return result; + } + + public void transferTo(Repeated${minor.class}Vector target) { + target.clear(); + offsets.transferTo(target.offsets); + values.transferTo(target.values); + clear(); + } + + public void splitAndTransferTo(final int startIndex, final int groups, Repeated${minor.class}Vector to) { + final UInt4Vector.Accessor a = offsets.getAccessor(); + final UInt4Vector.Mutator m = to.offsets.getMutator(); + + final int startPos = a.get(startIndex); + final int endPos = a.get(startIndex + groups); + final int valuesToCopy = endPos - startPos; + + values.splitAndTransferTo(startPos, valuesToCopy, to.values); + to.offsets.clear(); + to.offsets.allocateNew(groups + 1); + int normalizedPos = 0; + for (int i=0; i < groups + 1;i++ ) { + normalizedPos = a.get(startIndex+i) - startPos; + m.set(i, normalizedPos); + } + m.setValueCount(groups == 0 ? 0 : groups + 1); + } + + private class TransferImpl implements TransferPair { + final Repeated${minor.class}Vector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator) { + this.to = new Repeated${minor.class}Vector(field, allocator); + } + + public TransferImpl(Repeated${minor.class}Vector to) { + this.to = to; + } + + @Override + public Repeated${minor.class}Vector getTo() { + return to; + } + + @Override + public void transfer() { + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + splitAndTransferTo(startIndex, length, to); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + to.copyFromSafe(fromIndex, toIndex, Repeated${minor.class}Vector.this); + } + } + + public void copyFrom(int inIndex, int outIndex, Repeated${minor.class}Vector v) { + final Accessor vAccessor = v.getAccessor(); + final int count = vAccessor.getInnerValueCountAt(inIndex); + mutator.startNewValue(outIndex); + for (int i = 0; i < count; i++) { + mutator.add(outIndex, vAccessor.get(inIndex, i)); + } + } + + public void copyFromSafe(int inIndex, int outIndex, Repeated${minor.class}Vector v) { + final Accessor vAccessor = v.getAccessor(); + final int count = vAccessor.getInnerValueCountAt(inIndex); + mutator.startNewValue(outIndex); + for (int i = 0; i < count; i++) { + mutator.addSafe(outIndex, vAccessor.get(inIndex, i)); + } + } + + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + if(!offsets.allocateNewSafe()) return false; + if(!values.allocateNewSafe()) return false; + success = true; + } finally { + if (!success) { + clear(); + } + } + offsets.zeroVector(); + mutator.reset(); + return true; + } + + @Override + public void allocateNew() { + try { + offsets.allocateNew(); + values.allocateNew(); + } catch (OutOfMemoryException e) { + clear(); + throw e; + } + offsets.zeroVector(); + mutator.reset(); + } + + <#if type.major == "VarLen"> +// @Override +// protected SerializedField.Builder getMetadataBuilder() { +// return super.getMetadataBuilder() +// .setVarByteLength(values.getVarByteLength()); +// } + + public void allocateNew(int totalBytes, int valueCount, int innerValueCount) { + try { + offsets.allocateNew(valueCount + 1); + values.allocateNew(totalBytes, innerValueCount); + } catch (OutOfMemoryException e) { + clear(); + throw e; + } + offsets.zeroVector(); + mutator.reset(); + } + + public int getByteCapacity(){ + return values.getByteCapacity(); + } + + <#else> + + @Override + public void allocateNew(int valueCount, int innerValueCount) { + clear(); + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to// + * clear all the memory that we allocated + */ + boolean success = false; + try { + offsets.allocateNew(valueCount + 1); + values.allocateNew(innerValueCount); + } catch(OutOfMemoryException e){ + clear(); + throw e; + } + offsets.zeroVector(); + mutator.reset(); + } + + + + // This is declared a subclass of the accessor declared inside of FixedWidthVector, this is also used for + // variable length vectors, as they should ahve consistent interface as much as possible, if they need to diverge + // in the future, the interface shold be declared in the respective value vector superclasses for fixed and variable + // and we should refer to each in the generation template + public final class Accessor extends BaseRepeatedValueVector.BaseRepeatedAccessor { + @Override + public List<${friendlyType}> getObject(int index) { + final List<${friendlyType}> vals = new JsonStringArrayList<>(); + final UInt4Vector.Accessor offsetsAccessor = offsets.getAccessor(); + final int start = offsetsAccessor.get(index); + final int end = offsetsAccessor.get(index + 1); + final ${minor.class}Vector.Accessor valuesAccessor = values.getAccessor(); + for(int i = start; i < end; i++) { + vals.add(valuesAccessor.getObject(i)); + } + return vals; + } + + public ${friendlyType} getSingleObject(int index, int arrayIndex) { + final int start = offsets.getAccessor().get(index); + return values.getAccessor().getObject(start + arrayIndex); + } + + /** + * Get a value for the given record. Each element in the repeated field is accessed by + * the positionIndex param. + * + * @param index record containing the repeated field + * @param positionIndex position within the repeated field + * @return element at the given position in the given record + */ + public <#if type.major == "VarLen">byte[] + <#else>${minor.javaType!type.javaType} + get(int index, int positionIndex) { + return values.getAccessor().get(offsets.getAccessor().get(index) + positionIndex); + } + + public void get(int index, Repeated${minor.class}Holder holder) { + holder.start = offsets.getAccessor().get(index); + holder.end = offsets.getAccessor().get(index+1); + holder.vector = values; + } + + public void get(int index, int positionIndex, ${minor.class}Holder holder) { + final int offset = offsets.getAccessor().get(index); + assert offset >= 0; + assert positionIndex < getInnerValueCountAt(index); + values.getAccessor().get(offset + positionIndex, holder); + } + + public void get(int index, int positionIndex, Nullable${minor.class}Holder holder) { + final int offset = offsets.getAccessor().get(index); + assert offset >= 0; + if (positionIndex >= getInnerValueCountAt(index)) { + holder.isSet = 0; + return; + } + values.getAccessor().get(offset + positionIndex, holder); + } + } + + public final class Mutator extends BaseRepeatedValueVector.BaseRepeatedMutator implements RepeatedMutator { + private Mutator() {} + + /** + * Add an element to the given record index. This is similar to the set() method in other + * value vectors, except that it permits setting multiple values for a single record. + * + * @param index record of the element to add + * @param value value to add to the given row + */ + public void add(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType} value) { + int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().set(nextOffset, value); + offsets.getMutator().set(index+1, nextOffset+1); + } + + <#if type.major == "VarLen"> + public void addSafe(int index, byte[] bytes) { + addSafe(index, bytes, 0, bytes.length); + } + + public void addSafe(int index, byte[] bytes, int start, int length) { + final int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().setSafe(nextOffset, bytes, start, length); + offsets.getMutator().setSafe(index+1, nextOffset+1); + } + + <#else> + + public void addSafe(int index, ${minor.javaType!type.javaType} srcValue) { + final int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().setSafe(nextOffset, srcValue); + offsets.getMutator().setSafe(index+1, nextOffset+1); + } + + + + public void setSafe(int index, Repeated${minor.class}Holder h) { + final ${minor.class}Holder ih = new ${minor.class}Holder(); + final ${minor.class}Vector.Accessor hVectorAccessor = h.vector.getAccessor(); + mutator.startNewValue(index); + for(int i = h.start; i < h.end; i++){ + hVectorAccessor.get(i, ih); + mutator.addSafe(index, ih); + } + } + + public void addSafe(int index, ${minor.class}Holder holder) { + int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().setSafe(nextOffset, holder); + offsets.getMutator().setSafe(index+1, nextOffset+1); + } + + public void addSafe(int index, Nullable${minor.class}Holder holder) { + final int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().setSafe(nextOffset, holder); + offsets.getMutator().setSafe(index+1, nextOffset+1); + } + + <#if (fields?size > 1) && !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + public void addSafe(int arrayIndex, <#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + int nextOffset = offsets.getAccessor().get(arrayIndex+1); + values.getMutator().setSafe(nextOffset, <#list fields as field>${field.name}<#if field_has_next>, ); + offsets.getMutator().setSafe(arrayIndex+1, nextOffset+1); + } + + + protected void add(int index, ${minor.class}Holder holder) { + int nextOffset = offsets.getAccessor().get(index+1); + values.getMutator().set(nextOffset, holder); + offsets.getMutator().set(index+1, nextOffset+1); + } + + public void add(int index, Repeated${minor.class}Holder holder) { + + ${minor.class}Vector.Accessor accessor = holder.vector.getAccessor(); + ${minor.class}Holder innerHolder = new ${minor.class}Holder(); + + for(int i = holder.start; i < holder.end; i++) { + accessor.get(i, innerHolder); + add(index, innerHolder); + } + } + + @Override + public void generateTestData(final int valCount) { + final int[] sizes = {1, 2, 0, 6}; + int size = 0; + int runningOffset = 0; + final UInt4Vector.Mutator offsetsMutator = offsets.getMutator(); + for(int i = 1; i < valCount + 1; i++, size++) { + runningOffset += sizes[size % sizes.length]; + offsetsMutator.set(i, runningOffset); + } + values.getMutator().generateTestData(valCount * 9); + setValueCount(size); + } + + @Override + public void reset() { + } + } +} + + diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java new file mode 100644 index 0000000000000..9a6b08fc561f9 --- /dev/null +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -0,0 +1,185 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.lang.UnsupportedOperationException; + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionListWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ + +@SuppressWarnings("unused") +public class UnionListWriter extends AbstractFieldWriter { + + private ListVector vector; + private UInt4Vector offsets; + private PromotableWriter writer; + private boolean inMap = false; + private String mapName; + private int lastIndex = 0; + + public UnionListWriter(ListVector vector) { + super(null); + this.vector = vector; + this.writer = new PromotableWriter(vector.getDataVector(), vector); + this.offsets = vector.getOffsetVector(); + } + + public UnionListWriter(ListVector vector, AbstractFieldWriter parent) { + this(vector); + } + + @Override + public void allocate() { + vector.allocateNew(); + } + + @Override + public void clear() { + vector.clear(); + } + + @Override + public MaterializedField getField() { + return null; + } + + @Override + public int getValueCapacity() { + return vector.getValueCapacity(); + } + + @Override + public void close() throws Exception { + + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + + <#if !minor.class?starts_with("Decimal")> + + @Override + public ${name}Writer <#if uncappedName == "int">integer<#else>${uncappedName}() { + return this; + } + + @Override + public ${name}Writer <#if uncappedName == "int">integer<#else>${uncappedName}(String name) { + assert inMap; + mapName = name; + final int nextOffset = offsets.getAccessor().get(idx() + 1); + vector.getMutator().setNotNull(idx()); + writer.setPosition(nextOffset); + ${name}Writer ${uncappedName}Writer = writer.<#if uncappedName == "int">integer<#else>${uncappedName}(name); + return ${uncappedName}Writer; + } + + + + + + @Override + public MapWriter map() { + inMap = true; + return this; + } + + @Override + public ListWriter list() { + final int nextOffset = offsets.getAccessor().get(idx() + 1); + vector.getMutator().setNotNull(idx()); + offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); + writer.setPosition(nextOffset); + return writer; + } + + @Override + public ListWriter list(String name) { + final int nextOffset = offsets.getAccessor().get(idx() + 1); + vector.getMutator().setNotNull(idx()); + writer.setPosition(nextOffset); + ListWriter listWriter = writer.list(name); + return listWriter; + } + + @Override + public MapWriter map(String name) { + MapWriter mapWriter = writer.map(name); + return mapWriter; + } + + @Override + public void startList() { + vector.getMutator().startNewValue(idx()); + } + + @Override + public void endList() { + + } + + @Override + public void start() { + assert inMap; + final int nextOffset = offsets.getAccessor().get(idx() + 1); + vector.getMutator().setNotNull(idx()); + offsets.getMutator().setSafe(idx() + 1, nextOffset); + writer.setPosition(nextOffset); + } + + @Override + public void end() { + if (inMap) { + inMap = false; + final int nextOffset = offsets.getAccessor().get(idx() + 1); + offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); + } + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + + <#if !minor.class?starts_with("Decimal")> + + @Override + public void write${name}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + assert !inMap; + final int nextOffset = offsets.getAccessor().get(idx() + 1); + vector.getMutator().setNotNull(idx()); + writer.setPosition(nextOffset); + writer.write${name}(<#list fields as field>${field.name}<#if field_has_next>, ); + offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); + } + + + + + +} diff --git a/java/vector/src/main/codegen/templates/UnionReader.java b/java/vector/src/main/codegen/templates/UnionReader.java new file mode 100644 index 0000000000000..44c3e55dcc6f1 --- /dev/null +++ b/java/vector/src/main/codegen/templates/UnionReader.java @@ -0,0 +1,194 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionReader.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +@SuppressWarnings("unused") +public class UnionReader extends AbstractFieldReader { + + private BaseReader[] readers = new BaseReader[43]; + public UnionVector data; + + public UnionReader(UnionVector data) { + this.data = data; + } + + private static MajorType[] TYPES = new MajorType[43]; + + static { + for (MinorType minorType : MinorType.values()) { + TYPES[minorType.ordinal()] = new MajorType(minorType, DataMode.OPTIONAL); + } + } + + public MajorType getType() { + return TYPES[data.getTypeValue(idx())]; + } + + public boolean isSet(){ + return !data.getAccessor().isNull(idx()); + } + + public void read(UnionHolder holder) { + holder.reader = this; + holder.isSet = this.isSet() ? 1 : 0; + } + + public void read(int index, UnionHolder holder) { + getList().read(index, holder); + } + + private FieldReader getReaderForIndex(int index) { + int typeValue = data.getTypeValue(index); + FieldReader reader = (FieldReader) readers[typeValue]; + if (reader != null) { + return reader; + } + switch (MinorType.values()[typeValue]) { + case LATE: + return NullReader.INSTANCE; + case MAP: + return (FieldReader) getMap(); + case LIST: + return (FieldReader) getList(); + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + return (FieldReader) get${name}(); + + + default: + throw new UnsupportedOperationException("Unsupported type: " + MinorType.values()[typeValue]); + } + } + + private SingleMapReaderImpl mapReader; + + private MapReader getMap() { + if (mapReader == null) { + mapReader = (SingleMapReaderImpl) data.getMap().getReader(); + mapReader.setPosition(idx()); + readers[MinorType.MAP.ordinal()] = mapReader; + } + return mapReader; + } + + private UnionListReader listReader; + + private FieldReader getList() { + if (listReader == null) { + listReader = new UnionListReader(data.getList()); + listReader.setPosition(idx()); + readers[MinorType.LIST.ordinal()] = listReader; + } + return listReader; + } + + @Override + public java.util.Iterator iterator() { + return getMap().iterator(); + } + + @Override + public void copyAsValue(UnionWriter writer) { + writer.data.copyFrom(idx(), writer.idx(), data); + } + + <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", + "Character", "DateTime", "Period", "Double", "Float", + "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> + <#assign safeType=friendlyType /> + <#if safeType=="byte[]"><#assign safeType="ByteArray" /> + + @Override + public ${friendlyType} read${safeType}() { + return getReaderForIndex(idx()).read${safeType}(); + } + + + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign uncappedName = name?uncap_first/> + <#assign boxedType = (minor.boxedType!type.boxedType) /> + <#assign javaType = (minor.javaType!type.javaType) /> + <#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> + <#assign safeType=friendlyType /> + <#if safeType=="byte[]"><#assign safeType="ByteArray" /> + <#if !minor.class?starts_with("Decimal")> + + private Nullable${name}ReaderImpl ${uncappedName}Reader; + + private Nullable${name}ReaderImpl get${name}() { + if (${uncappedName}Reader == null) { + ${uncappedName}Reader = new Nullable${name}ReaderImpl(data.get${name}Vector()); + ${uncappedName}Reader.setPosition(idx()); + readers[MinorType.${name?upper_case}.ordinal()] = ${uncappedName}Reader; + } + return ${uncappedName}Reader; + } + + public void read(Nullable${name}Holder holder){ + getReaderForIndex(idx()).read(holder); + } + + public void copyAsValue(${name}Writer writer){ + getReaderForIndex(idx()).copyAsValue(writer); + } + + + + @Override + public void copyAsValue(ListWriter writer) { + ComplexCopier.copy(this, (FieldWriter) writer); + } + + @Override + public void setPosition(int index) { + super.setPosition(index); + for (BaseReader reader : readers) { + if (reader != null) { + reader.setPosition(index); + } + } + } + + public FieldReader reader(String name){ + return getMap().reader(name); + } + + public FieldReader reader() { + return getList().reader(); + } + + public boolean next() { + return getReaderForIndex(idx()).next(); + } +} + + + diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java new file mode 100644 index 0000000000000..ba94ac22a05f6 --- /dev/null +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -0,0 +1,467 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/UnionVector.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex; + +<#include "/@includes/vv_imports.ftl" /> +import java.util.ArrayList; +import java.util.Iterator; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.BasicTypeHelper; + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") + + +/** + * A vector which can hold values of different types. It does so by using a MapVector which contains a vector for each + * primitive type that is stored. MapVector is used in order to take advantage of its serialization/deserialization methods, + * as well as the addOrGet method. + * + * For performance reasons, UnionVector stores a cached reference to each subtype vector, to avoid having to do the map lookup + * each time the vector is accessed. + */ +public class UnionVector implements ValueVector { + + private MaterializedField field; + private BufferAllocator allocator; + private Accessor accessor = new Accessor(); + private Mutator mutator = new Mutator(); + private int valueCount; + + private MapVector internalMap; + private UInt1Vector typeVector; + + private MapVector mapVector; + private ListVector listVector; + + private FieldReader reader; + private NullableBitVector bit; + + private int singleType = 0; + private ValueVector singleVector; + private MajorType majorType; + + private final CallBack callBack; + + public UnionVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { + this.field = field.clone(); + this.allocator = allocator; + this.internalMap = new MapVector("internal", allocator, callBack); + this.typeVector = internalMap.addOrGet("types", new MajorType(MinorType.UINT1, DataMode.REQUIRED), UInt1Vector.class); + this.field.addChild(internalMap.getField().clone()); + this.majorType = field.getType(); + this.callBack = callBack; + } + + public BufferAllocator getAllocator() { + return allocator; + } + + public List getSubTypes() { + return majorType.getSubTypes(); + } + + public void addSubType(MinorType type) { + if (majorType.getSubTypes().contains(type)) { + return; + } + List subTypes = this.majorType.getSubTypes(); + List newSubTypes = new ArrayList<>(subTypes); + newSubTypes.add(type); + majorType = new MajorType(this.majorType.getMinorType(), this.majorType.getMode(), this.majorType.getPrecision(), + this.majorType.getScale(), this.majorType.getTimezone(), newSubTypes); + field = MaterializedField.create(field.getName(), majorType); + if (callBack != null) { + callBack.doWork(); + } + } + + private static final MajorType MAP_TYPE = new MajorType(MinorType.MAP, DataMode.OPTIONAL); + + public MapVector getMap() { + if (mapVector == null) { + int vectorCount = internalMap.size(); + mapVector = internalMap.addOrGet("map", MAP_TYPE, MapVector.class); + addSubType(MinorType.MAP); + if (internalMap.size() > vectorCount) { + mapVector.allocateNew(); + } + } + return mapVector; + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + + private Nullable${name}Vector ${uncappedName}Vector; + private static final MajorType ${name?upper_case}_TYPE = new MajorType(MinorType.${name?upper_case}, DataMode.OPTIONAL); + + public Nullable${name}Vector get${name}Vector() { + if (${uncappedName}Vector == null) { + int vectorCount = internalMap.size(); + ${uncappedName}Vector = internalMap.addOrGet("${uncappedName}", ${name?upper_case}_TYPE, Nullable${name}Vector.class); + addSubType(MinorType.${name?upper_case}); + if (internalMap.size() > vectorCount) { + ${uncappedName}Vector.allocateNew(); + } + } + return ${uncappedName}Vector; + } + + + + + + private static final MajorType LIST_TYPE = new MajorType(MinorType.LIST, DataMode.OPTIONAL); + + public ListVector getList() { + if (listVector == null) { + int vectorCount = internalMap.size(); + listVector = internalMap.addOrGet("list", LIST_TYPE, ListVector.class); + addSubType(MinorType.LIST); + if (internalMap.size() > vectorCount) { + listVector.allocateNew(); + } + } + return listVector; + } + + public int getTypeValue(int index) { + return typeVector.getAccessor().get(index); + } + + public UInt1Vector getTypeVector() { + return typeVector; + } + + @Override + public void allocateNew() throws OutOfMemoryException { + internalMap.allocateNew(); + if (typeVector != null) { + typeVector.zeroVector(); + } + } + + @Override + public boolean allocateNewSafe() { + boolean safe = internalMap.allocateNewSafe(); + if (safe) { + if (typeVector != null) { + typeVector.zeroVector(); + } + } + return safe; + } + + @Override + public void setInitialCapacity(int numRecords) { + } + + @Override + public int getValueCapacity() { + return Math.min(typeVector.getValueCapacity(), internalMap.getValueCapacity()); + } + + @Override + public void close() { + } + + @Override + public void clear() { + internalMap.clear(); + } + + @Override + public MaterializedField getField() { + return field; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new TransferImpl(field, allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new TransferImpl(field.withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector target) { + return new TransferImpl((UnionVector) target); + } + + public void transferTo(UnionVector target) { + internalMap.makeTransferPair(target.internalMap).transfer(); + target.valueCount = valueCount; + target.majorType = majorType; + } + + public void copyFrom(int inIndex, int outIndex, UnionVector from) { + from.getReader().setPosition(inIndex); + getWriter().setPosition(outIndex); + ComplexCopier.copy(from.reader, mutator.writer); + } + + public void copyFromSafe(int inIndex, int outIndex, UnionVector from) { + copyFrom(inIndex, outIndex, from); + } + + public ValueVector addVector(ValueVector v) { + String name = v.getField().getType().getMinorType().name().toLowerCase(); + MajorType type = v.getField().getType(); + Preconditions.checkState(internalMap.getChild(name) == null, String.format("%s vector already exists", name)); + final ValueVector newVector = internalMap.addOrGet(name, type, (Class) BasicTypeHelper.getValueVectorClass(type.getMinorType(), type.getMode())); + v.makeTransferPair(newVector).transfer(); + internalMap.putChild(name, newVector); + addSubType(v.getField().getType().getMinorType()); + return newVector; + } + + private class TransferImpl implements TransferPair { + + UnionVector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator) { + to = new UnionVector(field, allocator, null); + } + + public TransferImpl(UnionVector to) { + this.to = to; + } + + @Override + public void transfer() { + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int from, int to) { + this.to.copyFrom(from, to, UnionVector.this); + } + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public Mutator getMutator() { + return mutator; + } + + @Override + public FieldReader getReader() { + if (reader == null) { + reader = new UnionReader(this); + } + return reader; + } + + public FieldWriter getWriter() { + if (mutator.writer == null) { + mutator.writer = new UnionWriter(this); + } + return mutator.writer; + } + +// @Override +// public UserBitShared.SerializedField getMetadata() { +// SerializedField.Builder b = getField() // +// .getAsBuilder() // +// .setBufferLength(getBufferSize()) // +// .setValueCount(valueCount); +// +// b.addChild(internalMap.getMetadata()); +// return b.build(); +// } + + @Override + public int getBufferSize() { + return internalMap.getBufferSize(); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + + long bufferSize = 0; + for (final ValueVector v : (Iterable) this) { + bufferSize += v.getBufferSizeFor(valueCount); + } + + return (int) bufferSize; + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + return internalMap.getBuffers(clear); + } + + @Override + public Iterator iterator() { + List vectors = Lists.newArrayList(internalMap.iterator()); + vectors.add(typeVector); + return vectors.iterator(); + } + + public class Accessor extends BaseValueVector.BaseAccessor { + + + @Override + public Object getObject(int index) { + int type = typeVector.getAccessor().get(index); + switch (MinorType.values()[type]) { + case LATE: + return null; + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + return get${name}Vector().getAccessor().getObject(index); + + + + case MAP: + return getMap().getAccessor().getObject(index); + case LIST: + return getList().getAccessor().getObject(index); + default: + throw new UnsupportedOperationException("Cannot support type: " + MinorType.values()[type]); + } + } + + public byte[] get(int index) { + return null; + } + + public void get(int index, ComplexHolder holder) { + } + + public void get(int index, UnionHolder holder) { + FieldReader reader = new UnionReader(UnionVector.this); + reader.setPosition(index); + holder.reader = reader; + } + + @Override + public int getValueCount() { + return valueCount; + } + + @Override + public boolean isNull(int index) { + return typeVector.getAccessor().get(index) == 0; + } + + public int isSet(int index) { + return isNull(index) ? 0 : 1; + } + } + + public class Mutator extends BaseValueVector.BaseMutator { + + UnionWriter writer; + + @Override + public void setValueCount(int valueCount) { + UnionVector.this.valueCount = valueCount; + internalMap.getMutator().setValueCount(valueCount); + } + + public void setSafe(int index, UnionHolder holder) { + FieldReader reader = holder.reader; + if (writer == null) { + writer = new UnionWriter(UnionVector.this); + } + writer.setPosition(index); + MinorType type = reader.getType().getMinorType(); + switch (type) { + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + Nullable${name}Holder ${uncappedName}Holder = new Nullable${name}Holder(); + reader.read(${uncappedName}Holder); + setSafe(index, ${uncappedName}Holder); + break; + + + case MAP: { + ComplexCopier.copy(reader, writer); + break; + } + case LIST: { + ComplexCopier.copy(reader, writer); + break; + } + default: + throw new UnsupportedOperationException(); + } + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + public void setSafe(int index, Nullable${name}Holder holder) { + setType(index, MinorType.${name?upper_case}); + get${name}Vector().getMutator().setSafe(index, holder); + } + + + + + public void setType(int index, MinorType type) { + typeVector.getMutator().setSafe(index, type.ordinal()); + } + + @Override + public void reset() { } + + @Override + public void generateTestData(int values) { } + } +} diff --git a/java/vector/src/main/codegen/templates/UnionWriter.java b/java/vector/src/main/codegen/templates/UnionWriter.java new file mode 100644 index 0000000000000..c9c29e0dd5f92 --- /dev/null +++ b/java/vector/src/main/codegen/templates/UnionWriter.java @@ -0,0 +1,228 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionWriter.java" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> + +/* + * This class is generated using freemarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class UnionWriter extends AbstractFieldWriter implements FieldWriter { + + UnionVector data; + private MapWriter mapWriter; + private UnionListWriter listWriter; + private List writers = Lists.newArrayList(); + + public UnionWriter(BufferAllocator allocator) { + super(null); + } + + public UnionWriter(UnionVector vector) { + super(null); + data = vector; + } + + public UnionWriter(UnionVector vector, FieldWriter parent) { + super(null); + data = vector; + } + + @Override + public void setPosition(int index) { + super.setPosition(index); + for (BaseWriter writer : writers) { + writer.setPosition(index); + } + } + + + @Override + public void start() { + data.getMutator().setType(idx(), MinorType.MAP); + getMapWriter().start(); + } + + @Override + public void end() { + getMapWriter().end(); + } + + @Override + public void startList() { + getListWriter().startList(); + data.getMutator().setType(idx(), MinorType.LIST); + } + + @Override + public void endList() { + getListWriter().endList(); + } + + private MapWriter getMapWriter() { + if (mapWriter == null) { + mapWriter = new SingleMapWriter(data.getMap(), null, true); + mapWriter.setPosition(idx()); + writers.add(mapWriter); + } + return mapWriter; + } + + public MapWriter asMap() { + data.getMutator().setType(idx(), MinorType.MAP); + return getMapWriter(); + } + + private ListWriter getListWriter() { + if (listWriter == null) { + listWriter = new UnionListWriter(data.getList()); + listWriter.setPosition(idx()); + writers.add(listWriter); + } + return listWriter; + } + + public ListWriter asList() { + data.getMutator().setType(idx(), MinorType.LIST); + return getListWriter(); + } + + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + + <#if !minor.class?starts_with("Decimal")> + + private ${name}Writer ${name?uncap_first}Writer; + + private ${name}Writer get${name}Writer() { + if (${uncappedName}Writer == null) { + ${uncappedName}Writer = new Nullable${name}WriterImpl(data.get${name}Vector(), null); + ${uncappedName}Writer.setPosition(idx()); + writers.add(${uncappedName}Writer); + } + return ${uncappedName}Writer; + } + + public ${name}Writer as${name}() { + data.getMutator().setType(idx(), MinorType.${name?upper_case}); + return get${name}Writer(); + } + + @Override + public void write(${name}Holder holder) { + data.getMutator().setType(idx(), MinorType.${name?upper_case}); + get${name}Writer().setPosition(idx()); + get${name}Writer().write${name}(<#list fields as field>holder.${field.name}<#if field_has_next>, ); + } + + public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + data.getMutator().setType(idx(), MinorType.${name?upper_case}); + get${name}Writer().setPosition(idx()); + get${name}Writer().write${name}(<#list fields as field>${field.name}<#if field_has_next>, ); + } + + + + + public void writeNull() { + } + + @Override + public MapWriter map() { + data.getMutator().setType(idx(), MinorType.LIST); + getListWriter().setPosition(idx()); + return getListWriter().map(); + } + + @Override + public ListWriter list() { + data.getMutator().setType(idx(), MinorType.LIST); + getListWriter().setPosition(idx()); + return getListWriter().list(); + } + + @Override + public ListWriter list(String name) { + data.getMutator().setType(idx(), MinorType.MAP); + getMapWriter().setPosition(idx()); + return getMapWriter().list(name); + } + + @Override + public MapWriter map(String name) { + data.getMutator().setType(idx(), MinorType.MAP); + getMapWriter().setPosition(idx()); + return getMapWriter().map(name); + } + + <#list vv.types as type><#list type.minor as minor> + <#assign lowerName = minor.class?uncap_first /> + <#if lowerName == "int" ><#assign lowerName = "integer" /> + <#assign upperName = minor.class?upper_case /> + <#assign capName = minor.class?cap_first /> + <#if !minor.class?starts_with("Decimal")> + @Override + public ${capName}Writer ${lowerName}(String name) { + data.getMutator().setType(idx(), MinorType.MAP); + getMapWriter().setPosition(idx()); + return getMapWriter().${lowerName}(name); + } + + @Override + public ${capName}Writer ${lowerName}() { + data.getMutator().setType(idx(), MinorType.LIST); + getListWriter().setPosition(idx()); + return getListWriter().${lowerName}(); + } + + + + @Override + public void allocate() { + data.allocateNew(); + } + + @Override + public void clear() { + data.clear(); + } + + @Override + public void close() throws Exception { + data.close(); + } + + @Override + public MaterializedField getField() { + return data.getField(); + } + + @Override + public int getValueCapacity() { + return data.getValueCapacity(); + } +} diff --git a/java/vector/src/main/codegen/templates/ValueHolders.java b/java/vector/src/main/codegen/templates/ValueHolders.java new file mode 100644 index 0000000000000..2b14194574a58 --- /dev/null +++ b/java/vector/src/main/codegen/templates/ValueHolders.java @@ -0,0 +1,116 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +<@pp.dropOutputFile /> +<#list vv.modes as mode> +<#list vv.types as type> +<#list type.minor as minor> + +<#assign className="${mode.prefix}${minor.class}Holder" /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/holders/${className}.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.holders; + +<#include "/@includes/vv_imports.ftl" /> + +public final class ${className} implements ValueHolder{ + + public static final MajorType TYPE = new MajorType(MinorType.${minor.class?upper_case}, DataMode.${mode.name?upper_case}); + + public MajorType getType() {return TYPE;} + + <#if mode.name == "Repeated"> + + /** The first index (inclusive) into the Vector. **/ + public int start; + + /** The last index (exclusive) into the Vector. **/ + public int end; + + /** The Vector holding the actual values. **/ + public ${minor.class}Vector vector; + + <#else> + public static final int WIDTH = ${type.width}; + + <#if mode.name == "Optional">public int isSet; + <#assign fields = minor.fields!type.fields /> + <#list fields as field> + public ${field.type} ${field.name}; + + + <#if minor.class.startsWith("Decimal")> + public static final int maxPrecision = ${minor.maxPrecisionDigits}; + <#if minor.class.startsWith("Decimal28") || minor.class.startsWith("Decimal38")> + public static final int nDecimalDigits = ${minor.nDecimalDigits}; + + public static int getInteger(int index, int start, ArrowBuf buffer) { + int value = buffer.getInt(start + (index * 4)); + + if (index == 0) { + /* the first byte contains sign bit, return value without it */ + <#if minor.class.endsWith("Sparse")> + value = (value & 0x7FFFFFFF); + <#elseif minor.class.endsWith("Dense")> + value = (value & 0x0000007F); + + } + return value; + } + + public static void setInteger(int index, int value, int start, ArrowBuf buffer) { + buffer.setInt(start + (index * 4), value); + } + + public static void setSign(boolean sign, int start, ArrowBuf buffer) { + // Set MSB to 1 if sign is negative + if (sign == true) { + int value = getInteger(0, start, buffer); + setInteger(0, (value | 0x80000000), start, buffer); + } + } + + public static boolean getSign(int start, ArrowBuf buffer) { + return ((buffer.getInt(start) & 0x80000000) != 0); + } + + + @Deprecated + public int hashCode(){ + throw new UnsupportedOperationException(); + } + + /* + * Reason for deprecation is that ValueHolders are potential scalar replacements + * and hence we don't want any methods to be invoked on them. + */ + @Deprecated + public String toString(){ + throw new UnsupportedOperationException(); + } + + + + + +} + + + + \ No newline at end of file diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java new file mode 100644 index 0000000000000..13d53b8e846ab --- /dev/null +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -0,0 +1,644 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import java.lang.Override; + +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.vector.BaseDataValueVector; +import org.apache.drill.exec.vector.BaseValueVector; +import org.apache.drill.exec.vector.VariableWidthVector; + +<@pp.dropOutputFile /> +<#list vv.types as type> +<#list type.minor as minor> + +<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> + +<#if type.major == "VarLen"> +<@pp.changeOutputFile name="/org/apache/arrow/vector/${minor.class}Vector.java" /> + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector; + +<#include "/@includes/vv_imports.ftl" /> + +/** + * ${minor.class}Vector implements a vector of variable width values. Elements in the vector + * are accessed by position from the logical start of the vector. A fixed width offsetVector + * is used to convert an element's position to it's offset from the start of the (0-based) + * ArrowBuf. Size is inferred by adjacent elements. + * The width of each element is ${type.width} byte(s) + * The equivalent Java primitive is '${minor.javaType!type.javaType}' + * + * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. + */ +public final class ${minor.class}Vector extends BaseDataValueVector implements VariableWidthVector{ + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${minor.class}Vector.class); + + private static final int DEFAULT_RECORD_BYTE_COUNT = 8; + private static final int INITIAL_BYTE_COUNT = 4096 * DEFAULT_RECORD_BYTE_COUNT; + private static final int MIN_BYTE_COUNT = 4096; + + public final static String OFFSETS_VECTOR_NAME = "$offsets$"; + private final MaterializedField offsetsField = MaterializedField.create(OFFSETS_VECTOR_NAME, new MajorType(MinorType.UINT4, DataMode.REQUIRED)); + private final UInt${type.width}Vector offsetVector = new UInt${type.width}Vector(offsetsField, allocator); + private final FieldReader reader = new ${minor.class}ReaderImpl(${minor.class}Vector.this); + + private final Accessor accessor; + private final Mutator mutator; + + private final UInt${type.width}Vector.Accessor oAccessor; + + private int allocationSizeInBytes = INITIAL_BYTE_COUNT; + private int allocationMonitor = 0; + + public ${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + this.oAccessor = offsetVector.getAccessor(); + this.accessor = new Accessor(); + this.mutator = new Mutator(); + } + + @Override + public FieldReader getReader(){ + return reader; + } + + @Override + public int getBufferSize(){ + if (getAccessor().getValueCount() == 0) { + return 0; + } + return offsetVector.getBufferSize() + data.writerIndex(); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + + final int idx = offsetVector.getAccessor().get(valueCount); + return offsetVector.getBufferSizeFor(valueCount + 1) + idx; + } + + @Override + public int getValueCapacity(){ + return Math.max(offsetVector.getValueCapacity() - 1, 0); + } + + @Override + public int getByteCapacity(){ + return data.capacity(); + } + + @Override + public int getCurrentSizeInBytes() { + return offsetVector.getAccessor().get(getAccessor().getValueCount()); + } + + /** + * Return the number of bytes contained in the current var len byte vector. + * @return + */ + public int getVarByteLength(){ + final int valueCount = getAccessor().getValueCount(); + if(valueCount == 0) { + return 0; + } + return offsetVector.getAccessor().get(valueCount); + } + +// @Override +// public SerializedField getMetadata() { +// return getMetadataBuilder() // +// .addChild(offsetVector.getMetadata()) +// .setValueCount(getAccessor().getValueCount()) // +// .setBufferLength(getBufferSize()) // +// .build(); +// } +// +// @Override +// public void load(SerializedField metadata, ArrowBuf buffer) { +// the bits vector is the first child (the order in which the children are added in getMetadataBuilder is significant) +// final SerializedField offsetField = metadata.getChild(0); +// offsetVector.load(offsetField, buffer); +// +// final int capacity = buffer.capacity(); +// final int offsetsLength = offsetField.getBufferLength(); +// data = buffer.slice(offsetsLength, capacity - offsetsLength); +// data.retain(); +// } + + @Override + public void clear() { + super.clear(); + offsetVector.clear(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final ArrowBuf[] buffers = ObjectArrays.concat(offsetVector.getBuffers(false), super.getBuffers(false), ArrowBuf.class); + if (clear) { + // does not make much sense but we have to retain buffers even when clear is set. refactor this interface. + for (final ArrowBuf buffer:buffers) { + buffer.retain(1); + } + clear(); + } + return buffers; + } + + public long getOffsetAddr(){ + return offsetVector.getBuffer().memoryAddress(); + } + + public UInt${type.width}Vector getOffsetVector(){ + return offsetVector; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator){ + return new TransferImpl(getField(), allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator){ + return new TransferImpl(getField().withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new TransferImpl((${minor.class}Vector) to); + } + + public void transferTo(${minor.class}Vector target){ + target.clear(); + this.offsetVector.transferTo(target.offsetVector); + target.data = data.transferOwnership(target.allocator).buffer; + target.data.writerIndex(data.writerIndex()); + clear(); + } + + public void splitAndTransferTo(int startIndex, int length, ${minor.class}Vector target) { + UInt${type.width}Vector.Accessor offsetVectorAccessor = this.offsetVector.getAccessor(); + final int startPoint = offsetVectorAccessor.get(startIndex); + final int sliceLength = offsetVectorAccessor.get(startIndex + length) - startPoint; + target.clear(); + target.offsetVector.allocateNew(length + 1); + offsetVectorAccessor = this.offsetVector.getAccessor(); + final UInt4Vector.Mutator targetOffsetVectorMutator = target.offsetVector.getMutator(); + for (int i = 0; i < length + 1; i++) { + targetOffsetVectorMutator.set(i, offsetVectorAccessor.get(startIndex + i) - startPoint); + } + target.data = data.slice(startPoint, sliceLength).transferOwnership(target.allocator).buffer; + target.getMutator().setValueCount(length); +} + + protected void copyFrom(int fromIndex, int thisIndex, ${minor.class}Vector from){ + final UInt4Vector.Accessor fromOffsetVectorAccessor = from.offsetVector.getAccessor(); + final int start = fromOffsetVectorAccessor.get(fromIndex); + final int end = fromOffsetVectorAccessor.get(fromIndex + 1); + final int len = end - start; + + final int outputStart = offsetVector.data.get${(minor.javaType!type.javaType)?cap_first}(thisIndex * ${type.width}); + from.data.getBytes(start, data, outputStart, len); + offsetVector.data.set${(minor.javaType!type.javaType)?cap_first}( (thisIndex+1) * ${type.width}, outputStart + len); + } + + public boolean copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector from){ + final UInt${type.width}Vector.Accessor fromOffsetVectorAccessor = from.offsetVector.getAccessor(); + final int start = fromOffsetVectorAccessor.get(fromIndex); + final int end = fromOffsetVectorAccessor.get(fromIndex + 1); + final int len = end - start; + final int outputStart = offsetVector.data.get${(minor.javaType!type.javaType)?cap_first}(thisIndex * ${type.width}); + + while(data.capacity() < outputStart + len) { + reAlloc(); + } + + offsetVector.getMutator().setSafe(thisIndex + 1, outputStart + len); + from.data.getBytes(start, data, outputStart, len); + return true; + } + + private class TransferImpl implements TransferPair{ + ${minor.class}Vector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator){ + to = new ${minor.class}Vector(field, allocator); + } + + public TransferImpl(${minor.class}Vector to){ + this.to = to; + } + + @Override + public ${minor.class}Vector getTo(){ + return to; + } + + @Override + public void transfer(){ + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + splitAndTransferTo(startIndex, length, to); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + to.copyFromSafe(fromIndex, toIndex, ${minor.class}Vector.this); + } + } + + @Override + public void setInitialCapacity(final int valueCount) { + final long size = 1L * valueCount * ${type.width}; + if (size > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); + } + allocationSizeInBytes = (int)size; + offsetVector.setInitialCapacity(valueCount + 1); + } + + @Override + public void allocateNew() { + if(!allocateNewSafe()){ + throw new OutOfMemoryException("Failure while allocating buffer."); + } + } + + @Override + public boolean allocateNewSafe() { + long curAllocationSize = allocationSizeInBytes; + if (allocationMonitor > 10) { + curAllocationSize = Math.max(MIN_BYTE_COUNT, curAllocationSize / 2); + allocationMonitor = 0; + } else if (allocationMonitor < -2) { + curAllocationSize = curAllocationSize * 2L; + allocationMonitor = 0; + } + + if (curAllocationSize > MAX_ALLOCATION_SIZE) { + return false; + } + + clear(); + /* Boolean to keep track if all the memory allocations were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + try { + final int requestedSize = (int)curAllocationSize; + data = allocator.buffer(requestedSize); + allocationSizeInBytes = requestedSize; + offsetVector.allocateNew(); + } catch (OutOfMemoryException e) { + clear(); + return false; + } + data.readerIndex(0); + offsetVector.zeroVector(); + return true; + } + + @Override + public void allocateNew(int totalBytes, int valueCount) { + clear(); + assert totalBytes >= 0; + try { + data = allocator.buffer(totalBytes); + offsetVector.allocateNew(valueCount + 1); + } catch (RuntimeException e) { + clear(); + throw e; + } + data.readerIndex(0); + allocationSizeInBytes = totalBytes; + offsetVector.zeroVector(); + } + + @Override + public void reset() { + allocationSizeInBytes = INITIAL_BYTE_COUNT; + allocationMonitor = 0; + data.readerIndex(0); + offsetVector.zeroVector(); + super.reset(); + } + + public void reAlloc() { + final long newAllocationSize = allocationSizeInBytes*2L; + if (newAllocationSize > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); + } + + final ArrowBuf newBuf = allocator.buffer((int)newAllocationSize); + newBuf.setBytes(0, data, 0, data.capacity()); + data.release(); + data = newBuf; + allocationSizeInBytes = (int)newAllocationSize; + } + + public void decrementAllocationMonitor() { + if (allocationMonitor > 0) { + allocationMonitor = 0; + } + --allocationMonitor; + } + + private void incrementAllocationMonitor() { + ++allocationMonitor; + } + + @Override + public Accessor getAccessor(){ + return accessor; + } + + @Override + public Mutator getMutator() { + return mutator; + } + + public final class Accessor extends BaseValueVector.BaseAccessor implements VariableWidthAccessor { + final UInt${type.width}Vector.Accessor oAccessor = offsetVector.getAccessor(); + public long getStartEnd(int index){ + return oAccessor.getTwoAsLong(index); + } + + public byte[] get(int index) { + assert index >= 0; + final int startIdx = oAccessor.get(index); + final int length = oAccessor.get(index + 1) - startIdx; + assert length >= 0; + final byte[] dst = new byte[length]; + data.getBytes(startIdx, dst, 0, length); + return dst; + } + + @Override + public int getValueLength(int index) { + final UInt${type.width}Vector.Accessor offsetVectorAccessor = offsetVector.getAccessor(); + return offsetVectorAccessor.get(index + 1) - offsetVectorAccessor.get(index); + } + + public void get(int index, ${minor.class}Holder holder){ + holder.start = oAccessor.get(index); + holder.end = oAccessor.get(index + 1); + holder.buffer = data; + } + + public void get(int index, Nullable${minor.class}Holder holder){ + holder.isSet = 1; + holder.start = oAccessor.get(index); + holder.end = oAccessor.get(index + 1); + holder.buffer = data; + } + + + <#switch minor.class> + <#case "VarChar"> + @Override + public ${friendlyType} getObject(int index) { + Text text = new Text(); + text.set(get(index)); + return text; + } + <#break> + <#case "Var16Char"> + @Override + public ${friendlyType} getObject(int index) { + return new String(get(index), Charsets.UTF_16); + } + <#break> + <#default> + @Override + public ${friendlyType} getObject(int index) { + return get(index); + } + + + @Override + public int getValueCount() { + return Math.max(offsetVector.getAccessor().getValueCount()-1, 0); + } + + @Override + public boolean isNull(int index){ + return false; + } + + public UInt${type.width}Vector getOffsetVector(){ + return offsetVector; + } + } + + /** + * Mutable${minor.class} implements a vector of variable width values. Elements in the vector + * are accessed by position from the logical start of the vector. A fixed width offsetVector + * is used to convert an element's position to it's offset from the start of the (0-based) + * ArrowBuf. Size is inferred by adjacent elements. + * The width of each element is ${type.width} byte(s) + * The equivalent Java primitive is '${minor.javaType!type.javaType}' + * + * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker. + */ + public final class Mutator extends BaseValueVector.BaseMutator implements VariableWidthVector.VariableWidthMutator { + + /** + * Set the variable length element at the specified index to the supplied byte array. + * + * @param index position of the bit to set + * @param bytes array of bytes to write + */ + protected void set(int index, byte[] bytes) { + assert index >= 0; + final int currentOffset = offsetVector.getAccessor().get(index); + offsetVector.getMutator().set(index + 1, currentOffset + bytes.length); + data.setBytes(currentOffset, bytes, 0, bytes.length); + } + + public void setSafe(int index, byte[] bytes) { + assert index >= 0; + + final int currentOffset = offsetVector.getAccessor().get(index); + while (data.capacity() < currentOffset + bytes.length) { + reAlloc(); + } + offsetVector.getMutator().setSafe(index + 1, currentOffset + bytes.length); + data.setBytes(currentOffset, bytes, 0, bytes.length); + } + + /** + * Set the variable length element at the specified index to the supplied byte array. + * + * @param index position of the bit to set + * @param bytes array of bytes to write + * @param start start index of bytes to write + * @param length length of bytes to write + */ + protected void set(int index, byte[] bytes, int start, int length) { + assert index >= 0; + final int currentOffset = offsetVector.getAccessor().get(index); + offsetVector.getMutator().set(index + 1, currentOffset + length); + data.setBytes(currentOffset, bytes, start, length); + } + + public void setSafe(int index, ByteBuffer bytes, int start, int length) { + assert index >= 0; + + int currentOffset = offsetVector.getAccessor().get(index); + + while (data.capacity() < currentOffset + length) { + reAlloc(); + } + offsetVector.getMutator().setSafe(index + 1, currentOffset + length); + data.setBytes(currentOffset, bytes, start, length); + } + + public void setSafe(int index, byte[] bytes, int start, int length) { + assert index >= 0; + + final int currentOffset = offsetVector.getAccessor().get(index); + + while (data.capacity() < currentOffset + length) { + reAlloc(); + } + offsetVector.getMutator().setSafe(index + 1, currentOffset + length); + data.setBytes(currentOffset, bytes, start, length); + } + + @Override + public void setValueLengthSafe(int index, int length) { + final int offset = offsetVector.getAccessor().get(index); + while(data.capacity() < offset + length ) { + reAlloc(); + } + offsetVector.getMutator().setSafe(index + 1, offsetVector.getAccessor().get(index) + length); + } + + + public void setSafe(int index, int start, int end, ArrowBuf buffer){ + final int len = end - start; + final int outputStart = offsetVector.data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + + while(data.capacity() < outputStart + len) { + reAlloc(); + } + + offsetVector.getMutator().setSafe( index+1, outputStart + len); + buffer.getBytes(start, data, outputStart, len); + } + + public void setSafe(int index, Nullable${minor.class}Holder holder){ + assert holder.isSet == 1; + + final int start = holder.start; + final int end = holder.end; + final int len = end - start; + + int outputStart = offsetVector.data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + + while(data.capacity() < outputStart + len) { + reAlloc(); + } + + holder.buffer.getBytes(start, data, outputStart, len); + offsetVector.getMutator().setSafe( index+1, outputStart + len); + } + + public void setSafe(int index, ${minor.class}Holder holder){ + final int start = holder.start; + final int end = holder.end; + final int len = end - start; + final int outputStart = offsetVector.data.get${(minor.javaType!type.javaType)?cap_first}(index * ${type.width}); + + while(data.capacity() < outputStart + len) { + reAlloc(); + } + + holder.buffer.getBytes(start, data, outputStart, len); + offsetVector.getMutator().setSafe( index+1, outputStart + len); + } + + protected void set(int index, int start, int length, ArrowBuf buffer){ + assert index >= 0; + final int currentOffset = offsetVector.getAccessor().get(index); + offsetVector.getMutator().set(index + 1, currentOffset + length); + final ArrowBuf bb = buffer.slice(start, length); + data.setBytes(currentOffset, bb); + } + + protected void set(int index, Nullable${minor.class}Holder holder){ + final int length = holder.end - holder.start; + final int currentOffset = offsetVector.getAccessor().get(index); + offsetVector.getMutator().set(index + 1, currentOffset + length); + data.setBytes(currentOffset, holder.buffer, holder.start, length); + } + + protected void set(int index, ${minor.class}Holder holder){ + final int length = holder.end - holder.start; + final int currentOffset = offsetVector.getAccessor().get(index); + offsetVector.getMutator().set(index + 1, currentOffset + length); + data.setBytes(currentOffset, holder.buffer, holder.start, length); + } + + @Override + public void setValueCount(int valueCount) { + final int currentByteCapacity = getByteCapacity(); + final int idx = offsetVector.getAccessor().get(valueCount); + data.writerIndex(idx); + if (valueCount > 0 && currentByteCapacity > idx * 2) { + incrementAllocationMonitor(); + } else if (allocationMonitor > 0) { + allocationMonitor = 0; + } + VectorTrimmer.trim(data, idx); + offsetVector.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1); + } + + @Override + public void generateTestData(int size){ + boolean even = true; + <#switch minor.class> + <#case "Var16Char"> + final java.nio.charset.Charset charset = Charsets.UTF_16; + <#break> + <#case "VarChar"> + <#default> + final java.nio.charset.Charset charset = Charsets.UTF_8; + + final byte[] evenValue = new String("aaaaa").getBytes(charset); + final byte[] oddValue = new String("bbbbbbbbbb").getBytes(charset); + for(int i =0; i < size; i++, even = !even){ + set(i, even ? evenValue : oddValue); + } + setValueCount(size); + } + } +} + + <#-- type.major --> + + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/AddOrGetResult.java b/java/vector/src/main/java/org/apache/arrow/vector/AddOrGetResult.java new file mode 100644 index 0000000000000..388eb9c447977 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/AddOrGetResult.java @@ -0,0 +1,38 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import com.google.common.base.Preconditions; + +public class AddOrGetResult { + private final V vector; + private final boolean created; + + public AddOrGetResult(V vector, boolean created) { + this.vector = Preconditions.checkNotNull(vector); + this.created = created; + } + + public V getVector() { + return vector; + } + + public boolean isCreated() { + return created; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java b/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java new file mode 100644 index 0000000000000..54c3cd7331e0f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java @@ -0,0 +1,61 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.vector.complex.RepeatedFixedWidthVectorLike; +import org.apache.arrow.vector.complex.RepeatedVariableWidthVectorLike; + +public class AllocationHelper { +// private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AllocationHelper.class); + + public static void allocate(ValueVector v, int valueCount, int bytesPerValue) { + allocate(v, valueCount, bytesPerValue, 5); + } + + public static void allocatePrecomputedChildCount(ValueVector v, int valueCount, int bytesPerValue, int childValCount){ + if(v instanceof FixedWidthVector) { + ((FixedWidthVector) v).allocateNew(valueCount); + } else if (v instanceof VariableWidthVector) { + ((VariableWidthVector) v).allocateNew(valueCount * bytesPerValue, valueCount); + } else if(v instanceof RepeatedFixedWidthVectorLike) { + ((RepeatedFixedWidthVectorLike) v).allocateNew(valueCount, childValCount); + } else if(v instanceof RepeatedVariableWidthVectorLike) { + ((RepeatedVariableWidthVectorLike) v).allocateNew(childValCount * bytesPerValue, valueCount, childValCount); + } else { + v.allocateNew(); + } + } + + public static void allocate(ValueVector v, int valueCount, int bytesPerValue, int repeatedPerTop){ + allocatePrecomputedChildCount(v, valueCount, bytesPerValue, repeatedPerTop * valueCount); + } + + /** + * Allocates the exact amount if v is fixed width, otherwise falls back to dynamic allocation + * @param v value vector we are trying to allocate + * @param valueCount size we are trying to allocate + * @throws org.apache.drill.exec.memory.OutOfMemoryException if it can't allocate the memory + */ + public static void allocateNew(ValueVector v, int valueCount) { + if (v instanceof FixedWidthVector) { + ((FixedWidthVector) v).allocateNew(valueCount); + } else { + v.allocateNew(); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java new file mode 100644 index 0000000000000..b129ea9bcb95f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -0,0 +1,91 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.types.MaterializedField; + + +public abstract class BaseDataValueVector extends BaseValueVector { + + protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this + + protected ArrowBuf data; + + public BaseDataValueVector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + data = allocator.getEmpty(); + } + + @Override + public void clear() { + if (data != null) { + data.release(); + } + data = allocator.getEmpty(); + super.clear(); + } + + @Override + public void close() { + clear(); + if (data != null) { + data.release(); + data = null; + } + super.close(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + ArrowBuf[] out; + if (getBufferSize() == 0) { + out = new ArrowBuf[0]; + } else { + out = new ArrowBuf[]{data}; + data.readerIndex(0); + if (clear) { + data.retain(1); + } + } + if (clear) { + clear(); + } + return out; + } + + @Override + public int getBufferSize() { + if (getAccessor().getValueCount() == 0) { + return 0; + } + return data.writerIndex(); + } + + public ArrowBuf getBuffer() { + return data; + } + + /** + * This method has a similar effect of allocateNew() without actually clearing and reallocating + * the value vector. The purpose is to move the value vector to a "mutate" state + */ + public void reset() {} +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java new file mode 100644 index 0000000000000..8bca3c005370e --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java @@ -0,0 +1,125 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import java.util.Iterator; + +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterators; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.TransferPair; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public abstract class BaseValueVector implements ValueVector { + private static final Logger logger = LoggerFactory.getLogger(BaseValueVector.class); + + public static final int MAX_ALLOCATION_SIZE = Integer.MAX_VALUE; + public static final int INITIAL_VALUE_ALLOCATION = 4096; + + protected final BufferAllocator allocator; + protected final MaterializedField field; + + protected BaseValueVector(MaterializedField field, BufferAllocator allocator) { + this.field = Preconditions.checkNotNull(field, "field cannot be null"); + this.allocator = Preconditions.checkNotNull(allocator, "allocator cannot be null"); + } + + @Override + public String toString() { + return super.toString() + "[field = " + field + ", ...]"; + } + + @Override + public void clear() { + getMutator().reset(); + } + + @Override + public void close() { + clear(); + } + + @Override + public MaterializedField getField() { + return field; + } + + public MaterializedField getField(String ref){ + return getField().withPath(ref); + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return getTransferPair(getField().getPath(), allocator); + } + +// public static SerializedField getMetadata(BaseValueVector vector) { +// return getMetadataBuilder(vector).build(); +// } +// +// protected static SerializedField.Builder getMetadataBuilder(BaseValueVector vector) { +// return SerializedFieldHelper.getAsBuilder(vector.getField()) +// .setValueCount(vector.getAccessor().getValueCount()) +// .setBufferLength(vector.getBufferSize()); +// } + + public abstract static class BaseAccessor implements ValueVector.Accessor { + protected BaseAccessor() { } + + @Override + public boolean isNull(int index) { + return false; + } + } + + public abstract static class BaseMutator implements ValueVector.Mutator { + protected BaseMutator() { } + + @Override + public void generateTestData(int values) {} + + //TODO: consider making mutator stateless(if possible) on another issue. + public void reset() {} + } + + @Override + public Iterator iterator() { + return Iterators.emptyIterator(); + } + + public static boolean checkBufRefs(final ValueVector vv) { + for(final ArrowBuf buffer : vv.getBuffers(false)) { + if (buffer.refCnt() <= 0) { + throw new IllegalStateException("zero refcount"); + } + } + + return true; + } + + @Override + public BufferAllocator getAllocator() { + return allocator; + } +} + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java new file mode 100644 index 0000000000000..952e9028e0668 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -0,0 +1,450 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.complex.impl.BitReaderImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.BitHolder; +import org.apache.arrow.vector.holders.NullableBitHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.util.TransferPair; + +/** + * Bit implements a vector of bit-width values. Elements in the vector are accessed by position from the logical start + * of the vector. The width of each element is 1 bit. The equivalent Java primitive is an int containing the value '0' + * or '1'. + */ +public final class BitVector extends BaseDataValueVector implements FixedWidthVector { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BitVector.class); + + private final FieldReader reader = new BitReaderImpl(BitVector.this); + private final Accessor accessor = new Accessor(); + private final Mutator mutator = new Mutator(); + + private int valueCount; + private int allocationSizeInBytes = INITIAL_VALUE_ALLOCATION; + private int allocationMonitor = 0; + + public BitVector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + } + + @Override + public FieldReader getReader() { + return reader; + } + + @Override + public int getBufferSize() { + return getSizeFromCount(valueCount); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + return getSizeFromCount(valueCount); + } + + private int getSizeFromCount(int valueCount) { + return (int) Math.ceil(valueCount / 8.0); + } + + @Override + public int getValueCapacity() { + return (int)Math.min((long)Integer.MAX_VALUE, data.capacity() * 8L); + } + + private int getByteIndex(int index) { + return (int) Math.floor(index / 8.0); + } + + @Override + public void setInitialCapacity(final int valueCount) { + allocationSizeInBytes = getSizeFromCount(valueCount); + } + + @Override + public void allocateNew() { + if (!allocateNewSafe()) { + throw new OutOfMemoryException(); + } + } + + @Override + public boolean allocateNewSafe() { + long curAllocationSize = allocationSizeInBytes; + if (allocationMonitor > 10) { + curAllocationSize = Math.max(8, allocationSizeInBytes / 2); + allocationMonitor = 0; + } else if (allocationMonitor < -2) { + curAllocationSize = allocationSizeInBytes * 2L; + allocationMonitor = 0; + } + + try { + allocateBytes(curAllocationSize); + } catch (OutOfMemoryException ex) { + return false; + } + return true; + } + + @Override + public void reset() { + valueCount = 0; + allocationSizeInBytes = INITIAL_VALUE_ALLOCATION; + allocationMonitor = 0; + zeroVector(); + super.reset(); + } + + /** + * Allocate a new memory space for this vector. Must be called prior to using the ValueVector. + * + * @param valueCount + * The number of values which can be contained within this vector. + */ + @Override + public void allocateNew(int valueCount) { + final int size = getSizeFromCount(valueCount); + allocateBytes(size); + } + + private void allocateBytes(final long size) { + if (size > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); + } + + final int curSize = (int) size; + clear(); + data = allocator.buffer(curSize); + zeroVector(); + allocationSizeInBytes = curSize; + } + + /** + * Allocate new buffer with double capacity, and copy data into the new buffer. Replace vector's buffer with new buffer, and release old one + */ + public void reAlloc() { + final long newAllocationSize = allocationSizeInBytes * 2L; + if (newAllocationSize > MAX_ALLOCATION_SIZE) { + throw new OversizedAllocationException("Requested amount of memory is more than max allowed allocation size"); + } + + final int curSize = (int)newAllocationSize; + final ArrowBuf newBuf = allocator.buffer(curSize); + newBuf.setZero(0, newBuf.capacity()); + newBuf.setBytes(0, data, 0, data.capacity()); + data.release(); + data = newBuf; + allocationSizeInBytes = curSize; + } + + /** + * {@inheritDoc} + */ + @Override + public void zeroVector() { + data.setZero(0, data.capacity()); + } + + public void copyFrom(int inIndex, int outIndex, BitVector from) { + this.mutator.set(outIndex, from.accessor.get(inIndex)); + } + + public boolean copyFromSafe(int inIndex, int outIndex, BitVector from) { + if (outIndex >= this.getValueCapacity()) { + decrementAllocationMonitor(); + return false; + } + copyFrom(inIndex, outIndex, from); + return true; + } + +// @Override +// public void load(SerializedField metadata, DrillBuf buffer) { +// Preconditions.checkArgument(this.field.getPath().equals(metadata.getNamePart().getName()), "The field %s doesn't match the provided metadata %s.", this.field, metadata); +// final int valueCount = metadata.getValueCount(); +// final int expectedLength = getSizeFromCount(valueCount); +// final int actualLength = metadata.getBufferLength(); +// assert expectedLength == actualLength: "expected and actual buffer sizes do not match"; +// +// clear(); +// data = buffer.slice(0, actualLength); +// data.retain(); +// this.valueCount = valueCount; +// } + + @Override + public Mutator getMutator() { + return new Mutator(); + } + + @Override + public Accessor getAccessor() { + return new Accessor(); + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new TransferImpl(getField(), allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new TransferImpl(getField().withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new TransferImpl((BitVector) to); + } + + + public void transferTo(BitVector target) { + target.clear(); + if (target.data != null) { + target.data.release(); + } + target.data = data; + target.data.retain(1); + target.valueCount = valueCount; + clear(); + } + + public void splitAndTransferTo(int startIndex, int length, BitVector target) { + assert startIndex + length <= valueCount; + int firstByte = getByteIndex(startIndex); + int byteSize = getSizeFromCount(length); + int offset = startIndex % 8; + if (offset == 0) { + target.clear(); + // slice + if (target.data != null) { + target.data.release(); + } + target.data = (ArrowBuf) data.slice(firstByte, byteSize); + target.data.retain(1); + } else { + // Copy data + // When the first bit starts from the middle of a byte (offset != 0), copy data from src BitVector. + // Each byte in the target is composed by a part in i-th byte, another part in (i+1)-th byte. + // The last byte copied to target is a bit tricky : + // 1) if length requires partly byte (length % 8 !=0), copy the remaining bits only. + // 2) otherwise, copy the last byte in the same way as to the prior bytes. + target.clear(); + target.allocateNew(length); + // TODO maybe do this one word at a time, rather than byte? + for(int i = 0; i < byteSize - 1; i++) { + target.data.setByte(i, (((this.data.getByte(firstByte + i) & 0xFF) >>> offset) + (this.data.getByte(firstByte + i + 1) << (8 - offset)))); + } + if (length % 8 != 0) { + target.data.setByte(byteSize - 1, ((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset)); + } else { + target.data.setByte(byteSize - 1, + (((this.data.getByte(firstByte + byteSize - 1) & 0xFF) >>> offset) + (this.data.getByte(firstByte + byteSize) << (8 - offset)))); + } + } + target.getMutator().setValueCount(length); + } + + private class TransferImpl implements TransferPair { + BitVector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator) { + this.to = new BitVector(field, allocator); + } + + public TransferImpl(BitVector to) { + this.to = to; + } + + @Override + public BitVector getTo() { + return to; + } + + @Override + public void transfer() { + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + splitAndTransferTo(startIndex, length, to); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + to.copyFromSafe(fromIndex, toIndex, BitVector.this); + } + } + + private void decrementAllocationMonitor() { + if (allocationMonitor > 0) { + allocationMonitor = 0; + } + --allocationMonitor; + } + + private void incrementAllocationMonitor() { + ++allocationMonitor; + } + + public class Accessor extends BaseAccessor { + + /** + * Get the byte holding the desired bit, then mask all other bits. Iff the result is 0, the bit was not set. + * + * @param index + * position of the bit in the vector + * @return 1 if set, otherwise 0 + */ + public final int get(int index) { + int byteIndex = index >> 3; + byte b = data.getByte(byteIndex); + int bitIndex = index & 7; + return Long.bitCount(b & (1L << bitIndex)); + } + + @Override + public boolean isNull(int index) { + return false; + } + + @Override + public final Boolean getObject(int index) { + return new Boolean(get(index) != 0); + } + + @Override + public final int getValueCount() { + return valueCount; + } + + public final void get(int index, BitHolder holder) { + holder.value = get(index); + } + + public final void get(int index, NullableBitHolder holder) { + holder.isSet = 1; + holder.value = get(index); + } + } + + /** + * MutableBit implements a vector of bit-width values. Elements in the vector are accessed by position from the + * logical start of the vector. Values should be pushed onto the vector sequentially, but may be randomly accessed. + * + * NB: this class is automatically generated from ValueVectorTypes.tdd using FreeMarker. + */ + public class Mutator extends BaseMutator { + + private Mutator() { + } + + /** + * Set the bit at the given index to the specified value. + * + * @param index + * position of the bit to set + * @param value + * value to set (either 1 or 0) + */ + public final void set(int index, int value) { + int byteIndex = index >> 3; + int bitIndex = index & 7; + byte currentByte = data.getByte(byteIndex); + byte bitMask = (byte) (1L << bitIndex); + if (value != 0) { + currentByte |= bitMask; + } else { + currentByte -= (bitMask & currentByte); + } + + data.setByte(byteIndex, currentByte); + } + + public final void set(int index, BitHolder holder) { + set(index, holder.value); + } + + final void set(int index, NullableBitHolder holder) { + set(index, holder.value); + } + + public void setSafe(int index, int value) { + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, value); + } + + public void setSafe(int index, BitHolder holder) { + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, holder.value); + } + + public void setSafe(int index, NullableBitHolder holder) { + while(index >= getValueCapacity()) { + reAlloc(); + } + set(index, holder.value); + } + + @Override + public final void setValueCount(int valueCount) { + int currentValueCapacity = getValueCapacity(); + BitVector.this.valueCount = valueCount; + int idx = getSizeFromCount(valueCount); + while(valueCount > getValueCapacity()) { + reAlloc(); + } + if (valueCount > 0 && currentValueCapacity > valueCount * 2) { + incrementAllocationMonitor(); + } else if (allocationMonitor > 0) { + allocationMonitor = 0; + } + VectorTrimmer.trim(data, idx); + } + + @Override + public final void generateTestData(int values) { + boolean even = true; + for(int i = 0; i < values; i++, even = !even) { + if (even) { + set(i, 1); + } + } + setValueCount(values); + } + + } + + @Override + public void clear() { + this.valueCount = 0; + super.clear(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java new file mode 100644 index 0000000000000..59057000bbca9 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/FixedWidthVector.java @@ -0,0 +1,35 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + + +public interface FixedWidthVector extends ValueVector{ + + /** + * Allocate a new memory space for this vector. Must be called prior to using the ValueVector. + * + * @param valueCount Number of values in the vector. + */ + void allocateNew(int valueCount); + +/** + * Zero out the underlying buffer backing this vector. + */ + void zeroVector(); + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java new file mode 100644 index 0000000000000..00c33fc2d6e6c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java @@ -0,0 +1,23 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +public interface NullableVector extends ValueVector{ + + ValueVector getValuesVector(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/NullableVectorDefinitionSetter.java b/java/vector/src/main/java/org/apache/arrow/vector/NullableVectorDefinitionSetter.java new file mode 100644 index 0000000000000..b819c5d39e99c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/NullableVectorDefinitionSetter.java @@ -0,0 +1,23 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +public interface NullableVectorDefinitionSetter { + + public void setIndexDefined(int index); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java new file mode 100644 index 0000000000000..b806b180e7014 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java @@ -0,0 +1,220 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.ObjectHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.TransferPair; + +public class ObjectVector extends BaseValueVector { + private final Accessor accessor = new Accessor(); + private final Mutator mutator = new Mutator(); + private int maxCount = 0; + private int count = 0; + private int allocationSize = 4096; + + private List objectArrayList = new ArrayList<>(); + + public ObjectVector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + } + + public void addNewArray() { + objectArrayList.add(new Object[allocationSize]); + maxCount += allocationSize; + } + + @Override + public FieldReader getReader() { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + public final class Mutator implements ValueVector.Mutator { + + public void set(int index, Object obj) { + int listOffset = index / allocationSize; + if (listOffset >= objectArrayList.size()) { + addNewArray(); + } + objectArrayList.get(listOffset)[index % allocationSize] = obj; + } + + public boolean setSafe(int index, long value) { + set(index, value); + return true; + } + + protected void set(int index, ObjectHolder holder) { + set(index, holder.obj); + } + + public boolean setSafe(int index, ObjectHolder holder){ + set(index, holder); + return true; + } + + @Override + public void setValueCount(int valueCount) { + count = valueCount; + } + + @Override + public void reset() { + count = 0; + maxCount = 0; + objectArrayList = new ArrayList<>(); + addNewArray(); + } + + @Override + public void generateTestData(int values) { + } + } + + @Override + public void setInitialCapacity(int numRecords) { + // NoOp + } + + @Override + public void allocateNew() throws OutOfMemoryException { + addNewArray(); + } + + public void allocateNew(int valueCount) throws OutOfMemoryException { + while (maxCount < valueCount) { + addNewArray(); + } + } + + @Override + public boolean allocateNewSafe() { + allocateNew(); + return true; + } + + @Override + public int getBufferSize() { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + @Override + public void close() { + clear(); + } + + @Override + public void clear() { + objectArrayList.clear(); + maxCount = 0; + count = 0; + } + + @Override + public MaterializedField getField() { + return field; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + @Override + public int getValueCapacity() { + return maxCount; + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + +// @Override +// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { +// throw new UnsupportedOperationException("ObjectVector does not support this"); +// } +// +// @Override +// public UserBitShared.SerializedField getMetadata() { +// throw new UnsupportedOperationException("ObjectVector does not support this"); +// } + + @Override + public Mutator getMutator() { + return mutator; + } + + @Override + public Iterator iterator() { + throw new UnsupportedOperationException("ObjectVector does not support this"); + } + + public final class Accessor extends BaseAccessor { + @Override + public Object getObject(int index) { + int listOffset = index / allocationSize; + if (listOffset >= objectArrayList.size()) { + addNewArray(); + } + return objectArrayList.get(listOffset)[index % allocationSize]; + } + + @Override + public int getValueCount() { + return count; + } + + public Object get(int index) { + return getObject(index); + } + + public void get(int index, ObjectHolder holder){ + holder.obj = getObject(index); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java b/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java new file mode 100644 index 0000000000000..fc0a066749a91 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java @@ -0,0 +1,52 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector; + +import org.apache.arrow.vector.util.CallBack; + + +public class SchemaChangeCallBack implements CallBack { + private boolean schemaChanged = false; + + /** + * Constructs a schema-change callback with the schema-changed state set to + * {@code false}. + */ + public SchemaChangeCallBack() { + } + + /** + * Sets the schema-changed state to {@code true}. + */ + @Override + public void doWork() { + schemaChanged = true; + } + + /** + * Returns the value of schema-changed state, resetting the + * schema-changed state to {@code false}. + */ + public boolean getSchemaChangedAndReset() { + final boolean current = schemaChanged; + schemaChanged = false; + return current; + } +} + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java new file mode 100644 index 0000000000000..61ce285d61b0c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java @@ -0,0 +1,203 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import java.math.BigDecimal; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.holders.BigIntHolder; +import org.apache.arrow.vector.holders.BitHolder; +import org.apache.arrow.vector.holders.DateHolder; +import org.apache.arrow.vector.holders.Decimal18Holder; +import org.apache.arrow.vector.holders.Decimal28SparseHolder; +import org.apache.arrow.vector.holders.Decimal38SparseHolder; +import org.apache.arrow.vector.holders.Decimal9Holder; +import org.apache.arrow.vector.holders.Float4Holder; +import org.apache.arrow.vector.holders.Float8Holder; +import org.apache.arrow.vector.holders.IntHolder; +import org.apache.arrow.vector.holders.IntervalDayHolder; +import org.apache.arrow.vector.holders.IntervalYearHolder; +import org.apache.arrow.vector.holders.NullableBitHolder; +import org.apache.arrow.vector.holders.TimeHolder; +import org.apache.arrow.vector.holders.TimeStampHolder; +import org.apache.arrow.vector.holders.VarCharHolder; +import org.apache.arrow.vector.util.DecimalUtility; + +import com.google.common.base.Charsets; + + +public class ValueHolderHelper { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ValueHolderHelper.class); + + public static IntHolder getIntHolder(int value) { + IntHolder holder = new IntHolder(); + holder.value = value; + + return holder; + } + + public static BigIntHolder getBigIntHolder(long value) { + BigIntHolder holder = new BigIntHolder(); + holder.value = value; + + return holder; + } + + public static Float4Holder getFloat4Holder(float value) { + Float4Holder holder = new Float4Holder(); + holder.value = value; + + return holder; + } + + public static Float8Holder getFloat8Holder(double value) { + Float8Holder holder = new Float8Holder(); + holder.value = value; + + return holder; + } + + public static DateHolder getDateHolder(long value) { + DateHolder holder = new DateHolder(); + holder.value = value; + return holder; + } + + public static TimeHolder getTimeHolder(int value) { + TimeHolder holder = new TimeHolder(); + holder.value = value; + return holder; + } + + public static TimeStampHolder getTimeStampHolder(long value) { + TimeStampHolder holder = new TimeStampHolder(); + holder.value = value; + return holder; + } + + public static BitHolder getBitHolder(int value) { + BitHolder holder = new BitHolder(); + holder.value = value; + + return holder; + } + + public static NullableBitHolder getNullableBitHolder(boolean isNull, int value) { + NullableBitHolder holder = new NullableBitHolder(); + holder.isSet = isNull? 0 : 1; + if (! isNull) { + holder.value = value; + } + + return holder; + } + + public static VarCharHolder getVarCharHolder(ArrowBuf buf, String s){ + VarCharHolder vch = new VarCharHolder(); + + byte[] b = s.getBytes(Charsets.UTF_8); + vch.start = 0; + vch.end = b.length; + vch.buffer = buf.reallocIfNeeded(b.length); + vch.buffer.setBytes(0, b); + return vch; + } + + public static VarCharHolder getVarCharHolder(BufferAllocator a, String s){ + VarCharHolder vch = new VarCharHolder(); + + byte[] b = s.getBytes(Charsets.UTF_8); + vch.start = 0; + vch.end = b.length; + vch.buffer = a.buffer(b.length); // + vch.buffer.setBytes(0, b); + return vch; + } + + + public static IntervalYearHolder getIntervalYearHolder(int intervalYear) { + IntervalYearHolder holder = new IntervalYearHolder(); + + holder.value = intervalYear; + return holder; + } + + public static IntervalDayHolder getIntervalDayHolder(int days, int millis) { + IntervalDayHolder dch = new IntervalDayHolder(); + + dch.days = days; + dch.milliseconds = millis; + return dch; + } + + public static Decimal9Holder getDecimal9Holder(int decimal, int scale, int precision) { + Decimal9Holder dch = new Decimal9Holder(); + + dch.scale = scale; + dch.precision = precision; + dch.value = decimal; + + return dch; + } + + public static Decimal18Holder getDecimal18Holder(long decimal, int scale, int precision) { + Decimal18Holder dch = new Decimal18Holder(); + + dch.scale = scale; + dch.precision = precision; + dch.value = decimal; + + return dch; + } + + public static Decimal28SparseHolder getDecimal28Holder(ArrowBuf buf, String decimal) { + + Decimal28SparseHolder dch = new Decimal28SparseHolder(); + + BigDecimal bigDecimal = new BigDecimal(decimal); + + dch.scale = bigDecimal.scale(); + dch.precision = bigDecimal.precision(); + Decimal28SparseHolder.setSign(bigDecimal.signum() == -1, dch.start, dch.buffer); + dch.start = 0; + dch.buffer = buf.reallocIfNeeded(5 * DecimalUtility.INTEGER_SIZE); + DecimalUtility + .getSparseFromBigDecimal(bigDecimal, dch.buffer, dch.start, dch.scale, dch.precision, dch.nDecimalDigits); + + return dch; + } + + public static Decimal38SparseHolder getDecimal38Holder(ArrowBuf buf, String decimal) { + + Decimal38SparseHolder dch = new Decimal38SparseHolder(); + + BigDecimal bigDecimal = new BigDecimal(decimal); + + dch.scale = bigDecimal.scale(); + dch.precision = bigDecimal.precision(); + Decimal38SparseHolder.setSign(bigDecimal.signum() == -1, dch.start, dch.buffer); + dch.start = 0; + dch.buffer = buf.reallocIfNeeded(dch.maxPrecision * DecimalUtility.INTEGER_SIZE); + DecimalUtility + .getSparseFromBigDecimal(bigDecimal, dch.buffer, dch.start, dch.scale, dch.precision, dch.nDecimalDigits); + + return dch; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java new file mode 100644 index 0000000000000..c05f0e7c50fd2 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -0,0 +1,222 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.io.Closeable; + +import io.netty.buffer.ArrowBuf; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.TransferPair; + +/** + * An abstraction that is used to store a sequence of values in an individual column. + * + * A {@link ValueVector value vector} stores underlying data in-memory in a columnar fashion that is compact and + * efficient. The column whose data is stored, is referred by {@link #getField()}. + * + * A vector when instantiated, relies on a {@link org.apache.drill.exec.record.DeadBuf dead buffer}. It is important + * that vector is allocated before attempting to read or write. + * + * There are a few "rules" around vectors: + * + *
    + *
  • values need to be written in order (e.g. index 0, 1, 2, 5)
  • + *
  • null vectors start with all values as null before writing anything
  • + *
  • for variable width types, the offset vector should be all zeros before writing
  • + *
  • you must call setValueCount before a vector can be read
  • + *
  • you should never write to a vector once it has been read.
  • + *
+ * + * Please note that the current implementation doesn't enfore those rules, hence we may find few places that + * deviate from these rules (e.g. offset vectors in Variable Length and Repeated vector) + * + * This interface "should" strive to guarantee this order of operation: + *
+ * allocate > mutate > setvaluecount > access > clear (or allocate to start the process over). + *
+ */ +public interface ValueVector extends Closeable, Iterable { + /** + * Allocate new buffers. ValueVector implements logic to determine how much to allocate. + * @throws OutOfMemoryException Thrown if no memory can be allocated. + */ + void allocateNew() throws OutOfMemoryException; + + /** + * Allocates new buffers. ValueVector implements logic to determine how much to allocate. + * @return Returns true if allocation was succesful. + */ + boolean allocateNewSafe(); + + BufferAllocator getAllocator(); + + /** + * Set the initial record capacity + * @param numRecords + */ + void setInitialCapacity(int numRecords); + + /** + * Returns the maximum number of values that can be stored in this vector instance. + */ + int getValueCapacity(); + + /** + * Alternative to clear(). Allows use as an AutoCloseable in try-with-resources. + */ + @Override + void close(); + + /** + * Release the underlying DrillBuf and reset the ValueVector to empty. + */ + void clear(); + + /** + * Get information about how this field is materialized. + */ + MaterializedField getField(); + + /** + * Returns a {@link org.apache.arrow.vector.util.TransferPair transfer pair}, creating a new target vector of + * the same type. + */ + TransferPair getTransferPair(BufferAllocator allocator); + + TransferPair getTransferPair(String ref, BufferAllocator allocator); + + /** + * Returns a new {@link org.apache.arrow.vector.util.TransferPair transfer pair} that is used to transfer underlying + * buffers into the target vector. + */ + TransferPair makeTransferPair(ValueVector target); + + /** + * Returns an {@link org.apache.arrow.vector.ValueVector.Accessor accessor} that is used to read from this vector + * instance. + */ + Accessor getAccessor(); + + /** + * Returns an {@link org.apache.arrow.vector.ValueVector.Mutator mutator} that is used to write to this vector + * instance. + */ + Mutator getMutator(); + + /** + * Returns a {@link org.apache.arrow.vector.complex.reader.FieldReader field reader} that supports reading values + * from this vector. + */ + FieldReader getReader(); + + /** + * Get the metadata for this field. Used in serialization + * + * @return FieldMetadata for this field. + */ +// SerializedField getMetadata(); + + /** + * Returns the number of bytes that is used by this vector instance. + */ + int getBufferSize(); + + /** + * Returns the number of bytes that is used by this vector if it holds the given number + * of values. The result will be the same as if Mutator.setValueCount() were called, followed + * by calling getBufferSize(), but without any of the closing side-effects that setValueCount() + * implies wrt finishing off the population of a vector. Some operations might wish to use + * this to determine how much memory has been used by a vector so far, even though it is + * not finished being populated. + * + * @param valueCount the number of values to assume this vector contains + * @return the buffer size if this vector is holding valueCount values + */ + int getBufferSizeFor(int valueCount); + + /** + * Return the underlying buffers associated with this vector. Note that this doesn't impact the reference counts for + * this buffer so it only should be used for in-context access. Also note that this buffer changes regularly thus + * external classes shouldn't hold a reference to it (unless they change it). + * @param clear Whether to clear vector before returning; the buffers will still be refcounted; + * but the returned array will be the only reference to them + * + * @return The underlying {@link io.netty.buffer.ArrowBuf buffers} that is used by this vector instance. + */ + ArrowBuf[] getBuffers(boolean clear); + + /** + * Load the data provided in the buffer. Typically used when deserializing from the wire. + * + * @param metadata + * Metadata used to decode the incoming buffer. + * @param buffer + * The buffer that contains the ValueVector. + */ +// void load(SerializedField metadata, DrillBuf buffer); + + /** + * An abstraction that is used to read from this vector instance. + */ + interface Accessor { + /** + * Get the Java Object representation of the element at the specified position. Useful for testing. + * + * @param index + * Index of the value to get + */ + Object getObject(int index); + + /** + * Returns the number of values that is stored in this vector. + */ + int getValueCount(); + + /** + * Returns true if the value at the given index is null, false otherwise. + */ + boolean isNull(int index); + } + + /** + * An abstractiong that is used to write into this vector instance. + */ + interface Mutator { + /** + * Sets the number of values that is stored in this vector to the given value count. + * + * @param valueCount value count to set. + */ + void setValueCount(int valueCount); + + /** + * Resets the mutator to pristine state. + */ + void reset(); + + /** + * @deprecated this has nothing to do with value vector abstraction and should be removed. + */ + @Deprecated + void generateTestData(int values); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java new file mode 100644 index 0000000000000..e227bb4c4176c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java @@ -0,0 +1,51 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +public interface VariableWidthVector extends ValueVector{ + + /** + * Allocate a new memory space for this vector. Must be called prior to using the ValueVector. + * + * @param totalBytes Desired size of the underlying data buffer. + * @param valueCount Number of values in the vector. + */ + void allocateNew(int totalBytes, int valueCount); + + /** + * Provide the maximum amount of variable width bytes that can be stored int his vector. + * @return + */ + int getByteCapacity(); + + VariableWidthMutator getMutator(); + + VariableWidthAccessor getAccessor(); + + interface VariableWidthAccessor extends Accessor { + int getValueLength(int index); + } + + int getCurrentSizeInBytes(); + + interface VariableWidthMutator extends Mutator { + void setValueLengthSafe(int index, int length); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java new file mode 100644 index 0000000000000..fdad99a333258 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java @@ -0,0 +1,83 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.util.Collection; + +import com.google.common.base.Preconditions; + +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.MajorType; + +public class VectorDescriptor { + private static final String DEFAULT_NAME = "NONE"; + + private final MaterializedField field; + + public VectorDescriptor(final MajorType type) { + this(DEFAULT_NAME, type); + } + + public VectorDescriptor(final String name, final MajorType type) { + this(MaterializedField.create(name, type)); + } + + public VectorDescriptor(final MaterializedField field) { + this.field = Preconditions.checkNotNull(field, "field cannot be null"); + } + + public MaterializedField getField() { + return field; + } + + public MajorType getType() { + return field.getType(); + } + + public String getName() { + return field.getLastName(); + } + + public Collection getChildren() { + return field.getChildren(); + } + + public boolean hasName() { + return getName() != DEFAULT_NAME; + } + + public VectorDescriptor withName(final String name) { + return new VectorDescriptor(field.withPath(name)); + } + + public VectorDescriptor withType(final MajorType type) { + return new VectorDescriptor(field.withType(type)); + } + + public static VectorDescriptor create(final String name, final MajorType type) { + return new VectorDescriptor(name, type); + } + + public static VectorDescriptor create(final MajorType type) { + return new VectorDescriptor(type); + } + + public static VectorDescriptor create(final MaterializedField field) { + return new VectorDescriptor(field); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorTrimmer.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorTrimmer.java new file mode 100644 index 0000000000000..055857e956084 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorTrimmer.java @@ -0,0 +1,33 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ByteBuf; +import io.netty.buffer.ArrowBuf; + +public class VectorTrimmer { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(VectorTrimmer.class); + + public static void trim(ByteBuf data, int idx) { + data.writerIndex(idx); + if (data instanceof ArrowBuf) { + // data.capacity(idx); + data.writerIndex(idx); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java new file mode 100644 index 0000000000000..78de8706fb7d4 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -0,0 +1,181 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +import java.util.Iterator; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.complex.impl.NullReader; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.TransferPair; + +import com.google.common.collect.Iterators; + +public class ZeroVector implements ValueVector { + public final static ZeroVector INSTANCE = new ZeroVector(); + + private final MaterializedField field = MaterializedField.create("[DEFAULT]", Types.required(MinorType.LATE)); + + private final TransferPair defaultPair = new TransferPair() { + @Override + public void transfer() { } + + @Override + public void splitAndTransfer(int startIndex, int length) { } + + @Override + public ValueVector getTo() { + return ZeroVector.this; + } + + @Override + public void copyValueSafe(int from, int to) { } + }; + + private final Accessor defaultAccessor = new Accessor() { + @Override + public Object getObject(int index) { + return null; + } + + @Override + public int getValueCount() { + return 0; + } + + @Override + public boolean isNull(int index) { + return true; + } + }; + + private final Mutator defaultMutator = new Mutator() { + @Override + public void setValueCount(int valueCount) { } + + @Override + public void reset() { } + + @Override + public void generateTestData(int values) { } + }; + + public ZeroVector() { } + + @Override + public void close() { } + + @Override + public void clear() { } + + @Override + public MaterializedField getField() { + return field; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return defaultPair; + } + +// @Override +// public UserBitShared.SerializedField getMetadata() { +// return getField() +// .getAsBuilder() +// .setBufferLength(getBufferSize()) +// .setValueCount(getAccessor().getValueCount()) +// .build(); +// } + + @Override + public Iterator iterator() { + return Iterators.emptyIterator(); + } + + @Override + public int getBufferSize() { + return 0; + } + + @Override + public int getBufferSizeFor(final int valueCount) { + return 0; + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + return new ArrowBuf[0]; + } + + @Override + public void allocateNew() throws OutOfMemoryException { + allocateNewSafe(); + } + + @Override + public boolean allocateNewSafe() { + return true; + } + + @Override + public BufferAllocator getAllocator() { + throw new UnsupportedOperationException("Tried to get allocator from ZeroVector"); + } + + @Override + public void setInitialCapacity(int numRecords) { } + + @Override + public int getValueCapacity() { + return 0; + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return defaultPair; + } + + @Override + public TransferPair makeTransferPair(ValueVector target) { + return defaultPair; + } + + @Override + public Accessor getAccessor() { + return defaultAccessor; + } + + @Override + public Mutator getMutator() { + return defaultMutator; + } + + @Override + public FieldReader getReader() { + return NullReader.INSTANCE; + } + +// @Override +// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java new file mode 100644 index 0000000000000..c671c9e0b3c55 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -0,0 +1,143 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import java.util.Collection; + +import javax.annotation.Nullable; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; + +import com.google.common.base.Function; +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterables; +import com.google.common.collect.Sets; + +/** + * Base class for composite vectors. + * + * This class implements common functionality of composite vectors. + */ +public abstract class AbstractContainerVector implements ValueVector { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractContainerVector.class); + + protected MaterializedField field; + protected final BufferAllocator allocator; + protected final CallBack callBack; + + protected AbstractContainerVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { + this.field = Preconditions.checkNotNull(field); + this.allocator = allocator; + this.callBack = callBack; + } + + @Override + public void allocateNew() throws OutOfMemoryException { + if (!allocateNewSafe()) { + throw new OutOfMemoryException(); + } + } + + public BufferAllocator getAllocator() { + return allocator; + } + + /** + * Returns the field definition of this instance. + */ + @Override + public MaterializedField getField() { + return field; + } + + /** + * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given field name if exists or null. + */ + public ValueVector getChild(String name) { + return getChild(name, ValueVector.class); + } + + /** + * Returns a sequence of field names in the order that they show up in the schema. + */ + protected Collection getChildFieldNames() { + return Sets.newLinkedHashSet(Iterables.transform(field.getChildren(), new Function() { + @Nullable + @Override + public String apply(MaterializedField field) { + return Preconditions.checkNotNull(field).getLastName(); + } + })); + } + + /** + * Clears out all underlying child vectors. + */ + @Override + public void close() { + for (ValueVector vector:(Iterable)this) { + vector.close(); + } + } + + protected T typeify(ValueVector v, Class clazz) { + if (clazz.isAssignableFrom(v.getClass())) { + return (T) v; + } + throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Drill doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); + } + + MajorType getLastPathType() { + if((this.getField().getType().getMinorType() == MinorType.LIST && + this.getField().getType().getMode() == DataMode.REPEATED)) { // Use Repeated scalar type instead of Required List. + VectorWithOrdinal vord = getChildVectorWithOrdinal(null); + ValueVector v = vord.vector; + if (! (v instanceof AbstractContainerVector)) { + return v.getField().getType(); + } + } else if (this.getField().getType().getMinorType() == MinorType.MAP && + this.getField().getType().getMode() == DataMode.REPEATED) { // Use Required Map + return new MajorType(MinorType.MAP, DataMode.REQUIRED); + } + + return this.getField().getType(); + } + + protected boolean supportsDirectRead() { + return false; + } + + // return the number of child vectors + public abstract int size(); + + // add a new vector with the input MajorType or return the existing vector if we already added one with the same type + public abstract T addOrGet(String name, MajorType type, Class clazz); + + // return the child vector with the input name + public abstract T getChild(String name, Class clazz); + + // return the child vector's ordinal in the composite container + public abstract VectorWithOrdinal getChildVectorWithOrdinal(String name); +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java new file mode 100644 index 0000000000000..d4189b2314a6a --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -0,0 +1,278 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.Collection; +import java.util.Iterator; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.MapWithOrdinal; + +import com.google.common.base.Preconditions; +import com.google.common.collect.Lists; + +/* + * Base class for MapVectors. Currently used by RepeatedMapVector and MapVector + */ +public abstract class AbstractMapVector extends AbstractContainerVector { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractContainerVector.class); + + // Maintains a map with key as field name and value is the vector itself + private final MapWithOrdinal vectors = new MapWithOrdinal<>(); + + protected AbstractMapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { + super(field.clone(), allocator, callBack); + MaterializedField clonedField = field.clone(); + // create the hierarchy of the child vectors based on the materialized field + for (MaterializedField child : clonedField.getChildren()) { + if (!child.equals(BaseRepeatedValueVector.OFFSETS_FIELD)) { + final String fieldName = child.getLastName(); + final ValueVector v = BasicTypeHelper.getNewVector(child, allocator, callBack); + putVector(fieldName, v); + } + } + } + + @Override + public void close() { + for(final ValueVector valueVector : vectors.values()) { + valueVector.close(); + } + vectors.clear(); + + super.close(); + } + + @Override + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + for (final ValueVector v : vectors.values()) { + if (!v.allocateNewSafe()) { + return false; + } + } + success = true; + } finally { + if (!success) { + clear(); + } + } + return true; + } + + /** + * Adds a new field with the given parameters or replaces the existing one and consequently returns the resultant + * {@link org.apache.arrow.vector.ValueVector}. + * + * Execution takes place in the following order: + *
    + *
  • + * if field is new, create and insert a new vector of desired type. + *
  • + *
  • + * if field exists and existing vector is of desired vector type, return the vector. + *
  • + *
  • + * if field exists and null filled, clear the existing vector; create and insert a new vector of desired type. + *
  • + *
  • + * otherwise, throw an {@link java.lang.IllegalStateException} + *
  • + *
+ * + * @param name name of the field + * @param type type of the field + * @param clazz class of expected vector type + * @param class type of expected vector type + * @throws java.lang.IllegalStateException raised if there is a hard schema change + * + * @return resultant {@link org.apache.arrow.vector.ValueVector} + */ + @Override + public T addOrGet(String name, MajorType type, Class clazz) { + final ValueVector existing = getChild(name); + boolean create = false; + if (existing == null) { + create = true; + } else if (clazz.isAssignableFrom(existing.getClass())) { + return (T) existing; + } else if (nullFilled(existing)) { + existing.clear(); + create = true; + } + if (create) { + final T vector = (T) BasicTypeHelper.getNewVector(name, allocator, type, callBack); + putChild(name, vector); + if (callBack!=null) { + callBack.doWork(); + } + return vector; + } + final String message = "Drill does not support schema change yet. Existing[%s] and desired[%s] vector types mismatch"; + throw new IllegalStateException(String.format(message, existing.getClass().getSimpleName(), clazz.getSimpleName())); + } + + private boolean nullFilled(ValueVector vector) { + for (int r = 0; r < vector.getAccessor().getValueCount(); r++) { + if (!vector.getAccessor().isNull(r)) { + return false; + } + } + return true; + } + + /** + * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given ordinal identifier. + */ + public ValueVector getChildByOrdinal(int id) { + return vectors.getByOrdinal(id); + } + + /** + * Returns a {@link org.apache.arrow.vector.ValueVector} instance of subtype of corresponding to the given + * field name if exists or null. + */ + @Override + public T getChild(String name, Class clazz) { + final ValueVector v = vectors.get(name.toLowerCase()); + if (v == null) { + return null; + } + return typeify(v, clazz); + } + + /** + * Inserts the vector with the given name if it does not exist else replaces it with the new value. + * + * Note that this method does not enforce any vector type check nor throws a schema change exception. + */ + protected void putChild(String name, ValueVector vector) { + putVector(name, vector); + field.addChild(vector.getField()); + } + + /** + * Inserts the input vector into the map if it does not exist, replaces if it exists already + * @param name field name + * @param vector vector to be inserted + */ + protected void putVector(String name, ValueVector vector) { + final ValueVector old = vectors.put( + Preconditions.checkNotNull(name, "field name cannot be null").toLowerCase(), + Preconditions.checkNotNull(vector, "vector cannot be null") + ); + if (old != null && old != vector) { + logger.debug("Field [{}] mutated from [{}] to [{}]", name, old.getClass().getSimpleName(), + vector.getClass().getSimpleName()); + } + } + + /** + * Returns a sequence of underlying child vectors. + */ + protected Collection getChildren() { + return vectors.values(); + } + + /** + * Returns the number of underlying child vectors. + */ + @Override + public int size() { + return vectors.size(); + } + + @Override + public Iterator iterator() { + return vectors.values().iterator(); + } + + /** + * Returns a list of scalar child vectors recursing the entire vector hierarchy. + */ + public List getPrimitiveVectors() { + final List primitiveVectors = Lists.newArrayList(); + for (final ValueVector v : vectors.values()) { + if (v instanceof AbstractMapVector) { + AbstractMapVector mapVector = (AbstractMapVector) v; + primitiveVectors.addAll(mapVector.getPrimitiveVectors()); + } else { + primitiveVectors.add(v); + } + } + return primitiveVectors; + } + + /** + * Returns a vector with its corresponding ordinal mapping if field exists or null. + */ + @Override + public VectorWithOrdinal getChildVectorWithOrdinal(String name) { + final int ordinal = vectors.getOrdinal(name.toLowerCase()); + if (ordinal < 0) { + return null; + } + final ValueVector vector = vectors.getByOrdinal(ordinal); + return new VectorWithOrdinal(vector, ordinal); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final List buffers = Lists.newArrayList(); + + for (final ValueVector vector : vectors.values()) { + for (final ArrowBuf buf : vector.getBuffers(false)) { + buffers.add(buf); + if (clear) { + buf.retain(1); + } + } + if (clear) { + vector.clear(); + } + } + + return buffers.toArray(new ArrowBuf[buffers.size()]); + } + + @Override + public int getBufferSize() { + int actualBufSize = 0 ; + + for (final ValueVector v : vectors.values()) { + for (final ArrowBuf buf : v.getBuffers(false)) { + actualBufSize += buf.writerIndex(); + } + } + return actualBufSize; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java new file mode 100644 index 0000000000000..6518897fb780d --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -0,0 +1,260 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.Collections; +import java.util.Iterator; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; + +import com.google.common.base.Preconditions; +import com.google.common.collect.ObjectArrays; + +public abstract class BaseRepeatedValueVector extends BaseValueVector implements RepeatedValueVector { + + public final static ValueVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public final static String OFFSETS_VECTOR_NAME = "$offsets$"; + public final static String DATA_VECTOR_NAME = "$data$"; + + public final static MaterializedField OFFSETS_FIELD = + MaterializedField.create(OFFSETS_VECTOR_NAME, new MajorType(MinorType.UINT4, DataMode.REQUIRED)); + + protected final UInt4Vector offsets; + protected ValueVector vector; + + protected BaseRepeatedValueVector(MaterializedField field, BufferAllocator allocator) { + this(field, allocator, DEFAULT_DATA_VECTOR); + } + + protected BaseRepeatedValueVector(MaterializedField field, BufferAllocator allocator, ValueVector vector) { + super(field, allocator); + this.offsets = new UInt4Vector(OFFSETS_FIELD, allocator); + this.vector = Preconditions.checkNotNull(vector, "data vector cannot be null"); + } + + @Override + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + if (!offsets.allocateNewSafe()) { + return false; + } + success = vector.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + offsets.zeroVector(); + return success; + } + + + @Override + public UInt4Vector getOffsetVector() { + return offsets; + } + + @Override + public ValueVector getDataVector() { + return vector; + } + + @Override + public void setInitialCapacity(int numRecords) { + offsets.setInitialCapacity(numRecords + 1); + vector.setInitialCapacity(numRecords * RepeatedValueVector.DEFAULT_REPEAT_PER_RECORD); + } + + @Override + public int getValueCapacity() { + final int offsetValueCapacity = Math.max(offsets.getValueCapacity() - 1, 0); + if (vector == DEFAULT_DATA_VECTOR) { + return offsetValueCapacity; + } + return Math.min(vector.getValueCapacity(), offsetValueCapacity); + } + +// @Override +// protected UserBitShared.SerializedField.Builder getMetadataBuilder() { +// return super.getMetadataBuilder() +// .addChild(offsets.getMetadata()) +// .addChild(vector.getMetadata()); +// } + + @Override + public int getBufferSize() { + if (getAccessor().getValueCount() == 0) { + return 0; + } + return offsets.getBufferSize() + vector.getBufferSize(); + } + + @Override + public int getBufferSizeFor(int valueCount) { + if (valueCount == 0) { + return 0; + } + + return offsets.getBufferSizeFor(valueCount + 1) + vector.getBufferSizeFor(valueCount); + } + + @Override + public Iterator iterator() { + return Collections.singleton(getDataVector()).iterator(); + } + + @Override + public void clear() { + offsets.clear(); + vector.clear(); + super.clear(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final ArrowBuf[] buffers = ObjectArrays.concat(offsets.getBuffers(false), vector.getBuffers(false), ArrowBuf.class); + if (clear) { + for (ArrowBuf buffer:buffers) { + buffer.retain(); + } + clear(); + } + return buffers; + } + +// @Override +// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { +// final UserBitShared.SerializedField offsetMetadata = metadata.getChild(0); +// offsets.load(offsetMetadata, buffer); +// +// final UserBitShared.SerializedField vectorMetadata = metadata.getChild(1); +// if (getDataVector() == DEFAULT_DATA_VECTOR) { +// addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); +// } +// +// final int offsetLength = offsetMetadata.getBufferLength(); +// final int vectorLength = vectorMetadata.getBufferLength(); +// vector.load(vectorMetadata, buffer.slice(offsetLength, vectorLength)); +// } + + /** + * Returns 1 if inner vector is explicitly set via #addOrGetVector else 0 + * + * @see {@link ContainerVectorLike#size} + */ + @Override + public int size() { + return vector == DEFAULT_DATA_VECTOR ? 0:1; + } + + @Override + public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { + boolean created = false; + if (vector == DEFAULT_DATA_VECTOR && descriptor.getType().getMinorType() != MinorType.LATE) { + final MaterializedField field = descriptor.withName(DATA_VECTOR_NAME).getField(); + vector = BasicTypeHelper.getNewVector(field, allocator); + // returned vector must have the same field + assert field.equals(vector.getField()); + getField().addChild(field); + created = true; + } + + final MajorType actual = vector.getField().getType(); + if (!actual.equals(descriptor.getType())) { + final String msg = String.format("Inner vector type mismatch. Requested type: [%s], actual type: [%s]", + descriptor.getType(), actual); + throw new SchemaChangeRuntimeException(msg); + } + + return new AddOrGetResult<>((T)vector, created); + } + + protected void replaceDataVector(ValueVector v) { + vector.clear(); + vector = v; + } + + public abstract class BaseRepeatedAccessor extends BaseValueVector.BaseAccessor implements RepeatedAccessor { + + @Override + public int getValueCount() { + return Math.max(offsets.getAccessor().getValueCount() - 1, 0); + } + + @Override + public int getInnerValueCount() { + return vector.getAccessor().getValueCount(); + } + + @Override + public int getInnerValueCountAt(int index) { + return offsets.getAccessor().get(index+1) - offsets.getAccessor().get(index); + } + + @Override + public boolean isNull(int index) { + return false; + } + + @Override + public boolean isEmpty(int index) { + return false; + } + } + + public abstract class BaseRepeatedMutator extends BaseValueVector.BaseMutator implements RepeatedMutator { + + @Override + public void startNewValue(int index) { + while (offsets.getValueCapacity() <= index) { + offsets.reAlloc(); + } + offsets.getMutator().setSafe(index+1, offsets.getAccessor().get(index)); + setValueCount(index+1); + } + + @Override + public void setValueCount(int valueCount) { + // TODO: populate offset end points + offsets.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1); + final int childValueCount = valueCount == 0 ? 0 : offsets.getAccessor().get(valueCount); + vector.getMutator().setValueCount(childValueCount); + } + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java new file mode 100644 index 0000000000000..e50b0d0d0a5ea --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java @@ -0,0 +1,43 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; + +/** + * A mix-in used for introducing container vector-like behaviour. + */ +public interface ContainerVectorLike { + + /** + * Creates and adds a child vector if none with the same name exists, else returns the vector instance. + * + * @param descriptor vector descriptor + * @return result of operation wrapping vector corresponding to the given descriptor and whether it's newly created + * @throws org.apache.drill.common.exceptions.DrillRuntimeException + * if schema change is not permissible between the given and existing data vector types. + */ + AddOrGetResult addOrGetVector(VectorDescriptor descriptor); + + /** + * Returns the number of child vectors in this container vector-like instance. + */ + int size(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/EmptyValuePopulator.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/EmptyValuePopulator.java new file mode 100644 index 0000000000000..df699755770a5 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/EmptyValuePopulator.java @@ -0,0 +1,54 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import org.apache.arrow.vector.UInt4Vector; + +import com.google.common.base.Preconditions; + +/** + * A helper class that is used to track and populate empty values in repeated value vectors. + */ +public class EmptyValuePopulator { + private final UInt4Vector offsets; + + public EmptyValuePopulator(UInt4Vector offsets) { + this.offsets = Preconditions.checkNotNull(offsets, "offsets cannot be null"); + } + + /** + * Marks all values since the last set as empty. The last set value is obtained from underlying offsets vector. + * + * @param lastIndex the last index (inclusive) in the offsets vector until which empty population takes place + * @throws java.lang.IndexOutOfBoundsException if lastIndex is negative or greater than offsets capacity. + */ + public void populate(int lastIndex) { + if (lastIndex < 0) { + throw new IndexOutOfBoundsException("index cannot be negative"); + } + final UInt4Vector.Accessor accessor = offsets.getAccessor(); + final UInt4Vector.Mutator mutator = offsets.getMutator(); + final int lastSet = Math.max(accessor.getValueCount() - 1, 0); + final int previousEnd = accessor.get(lastSet);//0 ? 0 : accessor.get(lastSet); + for (int i = lastSet; i < lastIndex; i++) { + mutator.setSafe(i + 1, previousEnd); + } + mutator.setValueCount(lastIndex+1); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java new file mode 100644 index 0000000000000..8387c9e5ba667 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -0,0 +1,321 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.UInt1Vector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionListReader; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.TransferPair; + +import com.google.common.collect.ObjectArrays; + +public class ListVector extends BaseRepeatedValueVector { + + private UInt4Vector offsets; + private final UInt1Vector bits; + private Mutator mutator = new Mutator(); + private Accessor accessor = new Accessor(); + private UnionListWriter writer; + private UnionListReader reader; + private CallBack callBack; + + public ListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { + super(field, allocator); + this.bits = new UInt1Vector(MaterializedField.create("$bits$", new MajorType(MinorType.UINT1, DataMode.REQUIRED)), allocator); + offsets = getOffsetVector(); + this.field.addChild(getDataVector().getField()); + this.writer = new UnionListWriter(this); + this.reader = new UnionListReader(this); + this.callBack = callBack; + } + + public UnionListWriter getWriter() { + return writer; + } + + @Override + public void allocateNew() throws OutOfMemoryException { + super.allocateNewSafe(); + } + + public void transferTo(ListVector target) { + offsets.makeTransferPair(target.offsets).transfer(); + bits.makeTransferPair(target.bits).transfer(); + if (target.getDataVector() instanceof ZeroVector) { + target.addOrGetVector(new VectorDescriptor(vector.getField().getType())); + } + getDataVector().makeTransferPair(target.getDataVector()).transfer(); + } + + public void copyFromSafe(int inIndex, int outIndex, ListVector from) { + copyFrom(inIndex, outIndex, from); + } + + public void copyFrom(int inIndex, int outIndex, ListVector from) { + FieldReader in = from.getReader(); + in.setPosition(inIndex); + FieldWriter out = getWriter(); + out.setPosition(outIndex); + ComplexCopier.copy(in, out); + } + + @Override + public ValueVector getDataVector() { + return vector; + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new TransferImpl(field.withPath(ref), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector target) { + return new TransferImpl((ListVector) target); + } + + private class TransferImpl implements TransferPair { + + ListVector to; + + public TransferImpl(MaterializedField field, BufferAllocator allocator) { + to = new ListVector(field, allocator, null); + to.addOrGetVector(new VectorDescriptor(vector.getField().getType())); + } + + public TransferImpl(ListVector to) { + this.to = to; + to.addOrGetVector(new VectorDescriptor(vector.getField().getType())); + } + + @Override + public void transfer() { + transferTo(to); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + to.allocateNew(); + for (int i = 0; i < length; i++) { + copyValueSafe(startIndex + i, i); + } + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int from, int to) { + this.to.copyFrom(from, to, ListVector.this); + } + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public Mutator getMutator() { + return mutator; + } + + @Override + public FieldReader getReader() { + return reader; + } + + @Override + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + if (!offsets.allocateNewSafe()) { + return false; + } + success = vector.allocateNewSafe(); + success = success && bits.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + if (success) { + offsets.zeroVector(); + bits.zeroVector(); + } + return success; + } + +// @Override +// protected UserBitShared.SerializedField.Builder getMetadataBuilder() { +// return getField().getAsBuilder() +// .setValueCount(getAccessor().getValueCount()) +// .setBufferLength(getBufferSize()) +// .addChild(offsets.getMetadata()) +// .addChild(bits.getMetadata()) +// .addChild(vector.getMetadata()); +// } + public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { + AddOrGetResult result = super.addOrGetVector(descriptor); + reader = new UnionListReader(this); + return result; + } + + @Override + public int getBufferSize() { + if (getAccessor().getValueCount() == 0) { + return 0; + } + return offsets.getBufferSize() + bits.getBufferSize() + vector.getBufferSize(); + } + + @Override + public void clear() { + offsets.clear(); + vector.clear(); + bits.clear(); + lastSet = 0; + super.clear(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final ArrowBuf[] buffers = ObjectArrays.concat(offsets.getBuffers(false), ObjectArrays.concat(bits.getBuffers(false), + vector.getBuffers(false), ArrowBuf.class), ArrowBuf.class); + if (clear) { + for (ArrowBuf buffer:buffers) { + buffer.retain(); + } + clear(); + } + return buffers; + } + +// @Override +// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { +// final UserBitShared.SerializedField offsetMetadata = metadata.getChild(0); +// offsets.load(offsetMetadata, buffer); +// +// final int offsetLength = offsetMetadata.getBufferLength(); +// final UserBitShared.SerializedField bitMetadata = metadata.getChild(1); +// final int bitLength = bitMetadata.getBufferLength(); +// bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); +// +// final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); +// if (getDataVector() == DEFAULT_DATA_VECTOR) { +// addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); +// } +// +// final int vectorLength = vectorMetadata.getBufferLength(); +// vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); +// } + + public UnionVector promoteToUnion() { + MaterializedField newField = MaterializedField.create(getField().getPath(), new MajorType(MinorType.UNION, DataMode.OPTIONAL)); + UnionVector vector = new UnionVector(newField, allocator, null); + replaceDataVector(vector); + reader = new UnionListReader(this); + return vector; + } + + private int lastSet; + + public class Accessor extends BaseRepeatedAccessor { + + @Override + public Object getObject(int index) { + if (isNull(index)) { + return null; + } + final List vals = new JsonStringArrayList<>(); + final UInt4Vector.Accessor offsetsAccessor = offsets.getAccessor(); + final int start = offsetsAccessor.get(index); + final int end = offsetsAccessor.get(index + 1); + final ValueVector.Accessor valuesAccessor = getDataVector().getAccessor(); + for(int i = start; i < end; i++) { + vals.add(valuesAccessor.getObject(i)); + } + return vals; + } + + @Override + public boolean isNull(int index) { + return bits.getAccessor().get(index) == 0; + } + } + + public class Mutator extends BaseRepeatedMutator { + public void setNotNull(int index) { + bits.getMutator().setSafe(index, 1); + lastSet = index + 1; + } + + @Override + public void startNewValue(int index) { + for (int i = lastSet; i <= index; i++) { + offsets.getMutator().setSafe(i + 1, offsets.getAccessor().get(i)); + } + setNotNull(index); + lastSet = index + 1; + } + + @Override + public void setValueCount(int valueCount) { + // TODO: populate offset end points + if (valueCount == 0) { + offsets.getMutator().setValueCount(0); + } else { + for (int i = lastSet; i < valueCount; i++) { + offsets.getMutator().setSafe(i + 1, offsets.getAccessor().get(i)); + } + offsets.getMutator().setValueCount(valueCount + 1); + } + final int childValueCount = valueCount == 0 ? 0 : offsets.getAccessor().get(valueCount); + vector.getMutator().setValueCount(childValueCount); + bits.getMutator().setValueCount(valueCount); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java new file mode 100644 index 0000000000000..1bbce73d6ff82 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -0,0 +1,374 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.Collection; +import java.util.Iterator; +import java.util.Map; + +import javax.annotation.Nullable; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.RepeatedMapVector.MapSingleCopier; +import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.ComplexHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringHashMap; +import org.apache.arrow.vector.util.TransferPair; + +import com.google.common.base.Preconditions; +import com.google.common.collect.Ordering; +import com.google.common.primitives.Ints; + +public class MapVector extends AbstractMapVector { + //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(MapVector.class); + + public final static MajorType TYPE = new MajorType(MinorType.MAP, DataMode.OPTIONAL); + + private final SingleMapReaderImpl reader = new SingleMapReaderImpl(MapVector.this); + private final Accessor accessor = new Accessor(); + private final Mutator mutator = new Mutator(); + private int valueCount; + + public MapVector(String path, BufferAllocator allocator, CallBack callBack){ + this(MaterializedField.create(path, TYPE), allocator, callBack); + } + + public MapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ + super(field, allocator, callBack); + } + + @Override + public FieldReader getReader() { + //return new SingleMapReaderImpl(MapVector.this); + return reader; + } + + transient private MapTransferPair ephPair; + transient private MapSingleCopier ephPair2; + + public void copyFromSafe(int fromIndex, int thisIndex, MapVector from) { + if(ephPair == null || ephPair.from != from) { + ephPair = (MapTransferPair) from.makeTransferPair(this); + } + ephPair.copyValueSafe(fromIndex, thisIndex); + } + + public void copyFromSafe(int fromSubIndex, int thisIndex, RepeatedMapVector from) { + if(ephPair2 == null || ephPair2.from != from) { + ephPair2 = from.makeSingularCopier(this); + } + ephPair2.copySafe(fromSubIndex, thisIndex); + } + + @Override + protected boolean supportsDirectRead() { + return true; + } + + public Iterator fieldNameIterator() { + return getChildFieldNames().iterator(); + } + + @Override + public void setInitialCapacity(int numRecords) { + for (final ValueVector v : (Iterable) this) { + v.setInitialCapacity(numRecords); + } + } + + @Override + public int getBufferSize() { + if (valueCount == 0 || size() == 0) { + return 0; + } + long buffer = 0; + for (final ValueVector v : (Iterable)this) { + buffer += v.getBufferSize(); + } + + return (int) buffer; + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + + long bufferSize = 0; + for (final ValueVector v : (Iterable) this) { + bufferSize += v.getBufferSizeFor(valueCount); + } + + return (int) bufferSize; + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + int expectedSize = getBufferSize(); + int actualSize = super.getBufferSize(); + + Preconditions.checkArgument(expectedSize == actualSize); + return super.getBuffers(clear); + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new MapTransferPair(this, getField().getPath(), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new MapTransferPair(this, (MapVector) to); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new MapTransferPair(this, ref, allocator); + } + + protected static class MapTransferPair implements TransferPair{ + private final TransferPair[] pairs; + private final MapVector from; + private final MapVector to; + + public MapTransferPair(MapVector from, String path, BufferAllocator allocator) { + this(from, new MapVector(MaterializedField.create(path, TYPE), allocator, from.callBack), false); + } + + public MapTransferPair(MapVector from, MapVector to) { + this(from, to, true); + } + + protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { + this.from = from; + this.to = to; + this.pairs = new TransferPair[from.size()]; + this.to.ephPair = null; + this.to.ephPair2 = null; + + int i = 0; + ValueVector vector; + for (String child:from.getChildFieldNames()) { + int preSize = to.size(); + vector = from.getChild(child); + if (vector == null) { + continue; + } + //DRILL-1872: we add the child fields for the vector, looking up the field by name. For a map vector, + // the child fields may be nested fields of the top level child. For example if the structure + // of a child field is oa.oab.oabc then we add oa, then add oab to oa then oabc to oab. + // But the children member of a Materialized field is a HashSet. If the fields are added in the + // children HashSet, and the hashCode of the Materialized field includes the hash code of the + // children, the hashCode value of oa changes *after* the field has been added to the HashSet. + // (This is similar to what happens in ScanBatch where the children cannot be added till they are + // read). To take care of this, we ensure that the hashCode of the MaterializedField does not + // include the hashCode of the children but is based only on MaterializedField$key. + final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); + if (allocate && to.size() != preSize) { + newVector.allocateNew(); + } + pairs[i++] = vector.makeTransferPair(newVector); + } + } + + @Override + public void transfer() { + for (final TransferPair p : pairs) { + p.transfer(); + } + to.valueCount = from.valueCount; + from.clear(); + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int from, int to) { + for (TransferPair p : pairs) { + p.copyValueSafe(from, to); + } + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + for (TransferPair p : pairs) { + p.splitAndTransfer(startIndex, length); + } + to.getMutator().setValueCount(length); + } + } + + @Override + public int getValueCapacity() { + if (size() == 0) { + return 0; + } + + final Ordering natural = new Ordering() { + @Override + public int compare(@Nullable ValueVector left, @Nullable ValueVector right) { + return Ints.compare( + Preconditions.checkNotNull(left).getValueCapacity(), + Preconditions.checkNotNull(right).getValueCapacity() + ); + } + }; + + return natural.min(getChildren()).getValueCapacity(); + } + + @Override + public Accessor getAccessor() { + return accessor; + } + +// @Override +// public void load(SerializedField metadata, DrillBuf buf) { +// final List fields = metadata.getChildList(); +// valueCount = metadata.getValueCount(); +// +// int bufOffset = 0; +// for (final SerializedField child : fields) { +// final MaterializedField fieldDef = SerializedFieldHelper.create(child); +// +// ValueVector vector = getChild(fieldDef.getLastName()); +// if (vector == null) { +// if we arrive here, we didn't have a matching vector. +// vector = BasicTypeHelper.getNewVector(fieldDef, allocator); +// putChild(fieldDef.getLastName(), vector); +// } +// if (child.getValueCount() == 0) { +// vector.clear(); +// } else { +// vector.load(child, buf.slice(bufOffset, child.getBufferLength())); +// } +// bufOffset += child.getBufferLength(); +// } +// +// assert bufOffset == buf.capacity(); +// } +// +// @Override +// public SerializedField getMetadata() { +// SerializedField.Builder b = getField() // +// .getAsBuilder() // +// .setBufferLength(getBufferSize()) // +// .setValueCount(valueCount); +// +// +// for(ValueVector v : getChildren()) { +// b.addChild(v.getMetadata()); +// } +// return b.build(); +// } + + @Override + public Mutator getMutator() { + return mutator; + } + + public class Accessor extends BaseValueVector.BaseAccessor { + + @Override + public Object getObject(int index) { + Map vv = new JsonStringHashMap<>(); + for (String child:getChildFieldNames()) { + ValueVector v = getChild(child); + // TODO(DRILL-4001): Resolve this hack: + // The index/value count check in the following if statement is a hack + // to work around the current fact that RecordBatchLoader.load and + // MapVector.load leave child vectors with a length of zero (as opposed + // to matching the lengths of siblings and the parent map vector) + // because they don't remove (or set the lengths of) vectors from + // previous batches that aren't in the current batch. + if (v != null && index < v.getAccessor().getValueCount()) { + Object value = v.getAccessor().getObject(index); + if (value != null) { + vv.put(child, value); + } + } + } + return vv; + } + + public void get(int index, ComplexHolder holder) { + reader.setPosition(index); + holder.reader = reader; + } + + @Override + public int getValueCount() { + return valueCount; + } + } + + public ValueVector getVectorById(int id) { + return getChildByOrdinal(id); + } + + public class Mutator extends BaseValueVector.BaseMutator { + + @Override + public void setValueCount(int valueCount) { + for (final ValueVector v : getChildren()) { + v.getMutator().setValueCount(valueCount); + } + MapVector.this.valueCount = valueCount; + } + + @Override + public void reset() { } + + @Override + public void generateTestData(int values) { } + } + + @Override + public void clear() { + for (final ValueVector v : getChildren()) { + v.clear(); + } + valueCount = 0; + } + + @Override + public void close() { + final Collection vectors = getChildren(); + for (final ValueVector v : vectors) { + v.close(); + } + vectors.clear(); + valueCount = 0; + + super.close(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java new file mode 100644 index 0000000000000..93451181ca949 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java @@ -0,0 +1,22 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +public interface Positionable { + public void setPosition(int index); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedFixedWidthVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedFixedWidthVectorLike.java new file mode 100644 index 0000000000000..23850bc9034df --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedFixedWidthVectorLike.java @@ -0,0 +1,40 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +/** + * A {@link org.apache.arrow.vector.ValueVector} mix-in that can be used in conjunction with + * {@link RepeatedValueVector} subtypes. + */ +public interface RepeatedFixedWidthVectorLike { + /** + * Allocate a new memory space for this vector. Must be called prior to using the ValueVector. + * + * @param valueCount Number of separate repeating groupings. + * @param innerValueCount Number of supported values in the vector. + */ + void allocateNew(int valueCount, int innerValueCount); + + /** + * Load the records in the provided buffer based on the given number of values. + * @param valueCount Number of separate repeating groupings. + * @param innerValueCount Number atomic values the buffer contains. + * @param buf Incoming buffer. + * @return The number of bytes of the buffer that were consumed. + */ +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java new file mode 100644 index 0000000000000..778fe81b5da6a --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java @@ -0,0 +1,428 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.Iterator; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; +import org.apache.arrow.vector.complex.impl.NullReader; +import org.apache.arrow.vector.complex.impl.RepeatedListReaderImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.ComplexHolder; +import org.apache.arrow.vector.holders.RepeatedListHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.TransferPair; + +import com.google.common.base.Preconditions; +import com.google.common.collect.Lists; + +public class RepeatedListVector extends AbstractContainerVector + implements RepeatedValueVector, RepeatedFixedWidthVectorLike { + + public final static MajorType TYPE = new MajorType(MinorType.LIST, DataMode.REPEATED); + private final RepeatedListReaderImpl reader = new RepeatedListReaderImpl(null, this); + private final DelegateRepeatedVector delegate; + + protected static class DelegateRepeatedVector extends BaseRepeatedValueVector { + + private final RepeatedListAccessor accessor = new RepeatedListAccessor(); + private final RepeatedListMutator mutator = new RepeatedListMutator(); + private final EmptyValuePopulator emptyPopulator; + private transient DelegateTransferPair ephPair; + + public class RepeatedListAccessor extends BaseRepeatedValueVector.BaseRepeatedAccessor { + + @Override + public Object getObject(int index) { + final List list = new JsonStringArrayList<>(); + final int start = offsets.getAccessor().get(index); + final int until = offsets.getAccessor().get(index+1); + for (int i = start; i < until; i++) { + list.add(vector.getAccessor().getObject(i)); + } + return list; + } + + public void get(int index, RepeatedListHolder holder) { + assert index <= getValueCapacity(); + holder.start = getOffsetVector().getAccessor().get(index); + holder.end = getOffsetVector().getAccessor().get(index+1); + } + + public void get(int index, ComplexHolder holder) { + final FieldReader reader = getReader(); + reader.setPosition(index); + holder.reader = reader; + } + + public void get(int index, int arrayIndex, ComplexHolder holder) { + final RepeatedListHolder listHolder = new RepeatedListHolder(); + get(index, listHolder); + int offset = listHolder.start + arrayIndex; + if (offset >= listHolder.end) { + holder.reader = NullReader.INSTANCE; + } else { + FieldReader r = getDataVector().getReader(); + r.setPosition(offset); + holder.reader = r; + } + } + } + + public class RepeatedListMutator extends BaseRepeatedValueVector.BaseRepeatedMutator { + + public int add(int index) { + final int curEnd = getOffsetVector().getAccessor().get(index+1); + getOffsetVector().getMutator().setSafe(index + 1, curEnd + 1); + return curEnd; + } + + @Override + public void startNewValue(int index) { + emptyPopulator.populate(index+1); + super.startNewValue(index); + } + + @Override + public void setValueCount(int valueCount) { + emptyPopulator.populate(valueCount); + super.setValueCount(valueCount); + } + } + + + public class DelegateTransferPair implements TransferPair { + private final DelegateRepeatedVector target; + private final TransferPair[] children; + + public DelegateTransferPair(DelegateRepeatedVector target) { + this.target = Preconditions.checkNotNull(target); + if (target.getDataVector() == DEFAULT_DATA_VECTOR) { + target.addOrGetVector(VectorDescriptor.create(getDataVector().getField())); + target.getDataVector().allocateNew(); + } + this.children = new TransferPair[] { + getOffsetVector().makeTransferPair(target.getOffsetVector()), + getDataVector().makeTransferPair(target.getDataVector()) + }; + } + + @Override + public void transfer() { + for (TransferPair child:children) { + child.transfer(); + } + } + + @Override + public ValueVector getTo() { + return target; + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + target.allocateNew(); + for (int i = 0; i < length; i++) { + copyValueSafe(startIndex + i, i); + } + } + + @Override + public void copyValueSafe(int srcIndex, int destIndex) { + final RepeatedListHolder holder = new RepeatedListHolder(); + getAccessor().get(srcIndex, holder); + target.emptyPopulator.populate(destIndex+1); + final TransferPair vectorTransfer = children[1]; + int newIndex = target.getOffsetVector().getAccessor().get(destIndex); + //todo: make this a bulk copy. + for (int i = holder.start; i < holder.end; i++, newIndex++) { + vectorTransfer.copyValueSafe(i, newIndex); + } + target.getOffsetVector().getMutator().setSafe(destIndex + 1, newIndex); + } + } + + public DelegateRepeatedVector(String path, BufferAllocator allocator) { + this(MaterializedField.create(path, TYPE), allocator); + } + + public DelegateRepeatedVector(MaterializedField field, BufferAllocator allocator) { + super(field, allocator); + emptyPopulator = new EmptyValuePopulator(getOffsetVector()); + } + + @Override + public void allocateNew() throws OutOfMemoryException { + if (!allocateNewSafe()) { + throw new OutOfMemoryException(); + } + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return makeTransferPair(new DelegateRepeatedVector(ref, allocator)); + } + + @Override + public TransferPair makeTransferPair(ValueVector target) { + return new DelegateTransferPair(DelegateRepeatedVector.class.cast(target)); + } + + @Override + public RepeatedListAccessor getAccessor() { + return accessor; + } + + @Override + public RepeatedListMutator getMutator() { + return mutator; + } + + @Override + public FieldReader getReader() { + throw new UnsupportedOperationException(); + } + + public void copyFromSafe(int fromIndex, int thisIndex, DelegateRepeatedVector from) { + if(ephPair == null || ephPair.target != from) { + ephPair = DelegateTransferPair.class.cast(from.makeTransferPair(this)); + } + ephPair.copyValueSafe(fromIndex, thisIndex); + } + + } + + protected class RepeatedListTransferPair implements TransferPair { + private final TransferPair delegate; + + public RepeatedListTransferPair(TransferPair delegate) { + this.delegate = delegate; + } + + public void transfer() { + delegate.transfer(); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + delegate.splitAndTransfer(startIndex, length); + } + + @Override + public ValueVector getTo() { + final DelegateRepeatedVector delegateVector = DelegateRepeatedVector.class.cast(delegate.getTo()); + return new RepeatedListVector(getField(), allocator, callBack, delegateVector); + } + + @Override + public void copyValueSafe(int from, int to) { + delegate.copyValueSafe(from, to); + } + } + + public RepeatedListVector(String path, BufferAllocator allocator, CallBack callBack) { + this(MaterializedField.create(path, TYPE), allocator, callBack); + } + + public RepeatedListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { + this(field, allocator, callBack, new DelegateRepeatedVector(field, allocator)); + } + + protected RepeatedListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack, DelegateRepeatedVector delegate) { + super(field, allocator, callBack); + this.delegate = Preconditions.checkNotNull(delegate); + + final List children = Lists.newArrayList(field.getChildren()); + final int childSize = children.size(); + assert childSize < 3; + final boolean hasChild = childSize > 0; + if (hasChild) { + // the last field is data field + final MaterializedField child = children.get(childSize-1); + addOrGetVector(VectorDescriptor.create(child)); + } + } + + + @Override + public RepeatedListReaderImpl getReader() { + return reader; + } + + @Override + public DelegateRepeatedVector.RepeatedListAccessor getAccessor() { + return delegate.getAccessor(); + } + + @Override + public DelegateRepeatedVector.RepeatedListMutator getMutator() { + return delegate.getMutator(); + } + + @Override + public UInt4Vector getOffsetVector() { + return delegate.getOffsetVector(); + } + + @Override + public ValueVector getDataVector() { + return delegate.getDataVector(); + } + + @Override + public void allocateNew() throws OutOfMemoryException { + delegate.allocateNew(); + } + + @Override + public boolean allocateNewSafe() { + return delegate.allocateNewSafe(); + } + + @Override + public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { + final AddOrGetResult result = delegate.addOrGetVector(descriptor); + if (result.isCreated() && callBack != null) { + callBack.doWork(); + } + this.field = delegate.getField(); + return result; + } + + @Override + public int size() { + return delegate.size(); + } + + @Override + public int getBufferSize() { + return delegate.getBufferSize(); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + return delegate.getBufferSizeFor(valueCount); + } + + @Override + public void close() { + delegate.close(); + } + + @Override + public void clear() { + delegate.clear(); + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new RepeatedListTransferPair(delegate.getTransferPair(allocator)); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new RepeatedListTransferPair(delegate.getTransferPair(ref, allocator)); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + final RepeatedListVector target = RepeatedListVector.class.cast(to); + return new RepeatedListTransferPair(delegate.makeTransferPair(target.delegate)); + } + + @Override + public int getValueCapacity() { + return delegate.getValueCapacity(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + return delegate.getBuffers(clear); + } + + +// @Override +// public void load(SerializedField metadata, DrillBuf buf) { +// delegate.load(metadata, buf); +// } + +// @Override +// public SerializedField getMetadata() { +// return delegate.getMetadata(); +// } + + @Override + public Iterator iterator() { + return delegate.iterator(); + } + + @Override + public void setInitialCapacity(int numRecords) { + delegate.setInitialCapacity(numRecords); + } + + /** + * @deprecated + * prefer using {@link #addOrGetVector(org.apache.arrow.vector.VectorDescriptor)} instead. + */ + @Override + public T addOrGet(String name, MajorType type, Class clazz) { + final AddOrGetResult result = addOrGetVector(VectorDescriptor.create(type)); + return result.getVector(); + } + + @Override + public T getChild(String name, Class clazz) { + if (name != null) { + return null; + } + return typeify(delegate.getDataVector(), clazz); + } + + @Override + public void allocateNew(int valueCount, int innerValueCount) { + clear(); + getOffsetVector().allocateNew(valueCount + 1); + getMutator().reset(); + } + + @Override + public VectorWithOrdinal getChildVectorWithOrdinal(String name) { + if (name != null) { + return null; + } + return new VectorWithOrdinal(delegate.getDataVector(), 0); + } + + public void copyFromSafe(int fromIndex, int thisIndex, RepeatedListVector from) { + delegate.copyFromSafe(fromIndex, thisIndex, from.delegate); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java new file mode 100644 index 0000000000000..e7eacd3c67c40 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java @@ -0,0 +1,584 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; + +import java.util.Iterator; +import java.util.List; +import java.util.Map; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.AllocationHelper; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; +import org.apache.arrow.vector.complex.impl.NullReader; +import org.apache.arrow.vector.complex.impl.RepeatedMapReaderImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.ComplexHolder; +import org.apache.arrow.vector.holders.RepeatedMapHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.TransferPair; +import org.apache.commons.lang3.ArrayUtils; + +import com.google.common.base.Preconditions; +import com.google.common.collect.Maps; + +public class RepeatedMapVector extends AbstractMapVector + implements RepeatedValueVector, RepeatedFixedWidthVectorLike { + //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RepeatedMapVector.class); + + public final static MajorType TYPE = new MajorType(MinorType.MAP, DataMode.REPEATED); + + private final UInt4Vector offsets; // offsets to start of each record (considering record indices are 0-indexed) + private final RepeatedMapReaderImpl reader = new RepeatedMapReaderImpl(RepeatedMapVector.this); + private final RepeatedMapAccessor accessor = new RepeatedMapAccessor(); + private final Mutator mutator = new Mutator(); + private final EmptyValuePopulator emptyPopulator; + + public RepeatedMapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ + super(field, allocator, callBack); + this.offsets = new UInt4Vector(BaseRepeatedValueVector.OFFSETS_FIELD, allocator); + this.emptyPopulator = new EmptyValuePopulator(offsets); + } + + @Override + public UInt4Vector getOffsetVector() { + return offsets; + } + + @Override + public ValueVector getDataVector() { + throw new UnsupportedOperationException(); + } + + @Override + public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { + throw new UnsupportedOperationException(); + } + + @Override + public void setInitialCapacity(int numRecords) { + offsets.setInitialCapacity(numRecords + 1); + for(final ValueVector v : (Iterable) this) { + v.setInitialCapacity(numRecords * RepeatedValueVector.DEFAULT_REPEAT_PER_RECORD); + } + } + + @Override + public RepeatedMapReaderImpl getReader() { + return reader; + } + + @Override + public void allocateNew(int groupCount, int innerValueCount) { + clear(); + try { + offsets.allocateNew(groupCount + 1); + for (ValueVector v : getChildren()) { + AllocationHelper.allocatePrecomputedChildCount(v, groupCount, 50, innerValueCount); + } + } catch (OutOfMemoryException e){ + clear(); + throw e; + } + offsets.zeroVector(); + mutator.reset(); + } + + public Iterator fieldNameIterator() { + return getChildFieldNames().iterator(); + } + + @Override + public List getPrimitiveVectors() { + final List primitiveVectors = super.getPrimitiveVectors(); + primitiveVectors.add(offsets); + return primitiveVectors; + } + + @Override + public int getBufferSize() { + if (getAccessor().getValueCount() == 0) { + return 0; + } + long bufferSize = offsets.getBufferSize(); + for (final ValueVector v : (Iterable) this) { + bufferSize += v.getBufferSize(); + } + return (int) bufferSize; + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + + long bufferSize = 0; + for (final ValueVector v : (Iterable) this) { + bufferSize += v.getBufferSizeFor(valueCount); + } + + return (int) bufferSize; + } + + @Override + public void close() { + offsets.close(); + super.close(); + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new RepeatedMapTransferPair(this, getField().getPath(), allocator); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new RepeatedMapTransferPair(this, (RepeatedMapVector)to); + } + + MapSingleCopier makeSingularCopier(MapVector to) { + return new MapSingleCopier(this, to); + } + + protected static class MapSingleCopier { + private final TransferPair[] pairs; + public final RepeatedMapVector from; + + public MapSingleCopier(RepeatedMapVector from, MapVector to) { + this.from = from; + this.pairs = new TransferPair[from.size()]; + + int i = 0; + ValueVector vector; + for (final String child:from.getChildFieldNames()) { + int preSize = to.size(); + vector = from.getChild(child); + if (vector == null) { + continue; + } + final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); + if (to.size() != preSize) { + newVector.allocateNew(); + } + pairs[i++] = vector.makeTransferPair(newVector); + } + } + + public void copySafe(int fromSubIndex, int toIndex) { + for (TransferPair p : pairs) { + p.copyValueSafe(fromSubIndex, toIndex); + } + } + } + + public TransferPair getTransferPairToSingleMap(String reference, BufferAllocator allocator) { + return new SingleMapTransferPair(this, reference, allocator); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new RepeatedMapTransferPair(this, ref, allocator); + } + + @Override + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + if (!offsets.allocateNewSafe()) { + return false; + } + success = super.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + offsets.zeroVector(); + return success; + } + + protected static class SingleMapTransferPair implements TransferPair { + private final TransferPair[] pairs; + private final RepeatedMapVector from; + private final MapVector to; + private static final MajorType MAP_TYPE = new MajorType(MinorType.MAP, DataMode.REQUIRED); + + public SingleMapTransferPair(RepeatedMapVector from, String path, BufferAllocator allocator) { + this(from, new MapVector(MaterializedField.create(path, MAP_TYPE), allocator, from.callBack), false); + } + + public SingleMapTransferPair(RepeatedMapVector from, MapVector to) { + this(from, to, true); + } + + public SingleMapTransferPair(RepeatedMapVector from, MapVector to, boolean allocate) { + this.from = from; + this.to = to; + this.pairs = new TransferPair[from.size()]; + int i = 0; + ValueVector vector; + for (final String child : from.getChildFieldNames()) { + int preSize = to.size(); + vector = from.getChild(child); + if (vector == null) { + continue; + } + final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); + if (allocate && to.size() != preSize) { + newVector.allocateNew(); + } + pairs[i++] = vector.makeTransferPair(newVector); + } + } + + + @Override + public void transfer() { + for (TransferPair p : pairs) { + p.transfer(); + } + to.getMutator().setValueCount(from.getAccessor().getValueCount()); + from.clear(); + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int from, int to) { + for (TransferPair p : pairs) { + p.copyValueSafe(from, to); + } + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + for (TransferPair p : pairs) { + p.splitAndTransfer(startIndex, length); + } + to.getMutator().setValueCount(length); + } + } + + private static class RepeatedMapTransferPair implements TransferPair{ + + private final TransferPair[] pairs; + private final RepeatedMapVector to; + private final RepeatedMapVector from; + + public RepeatedMapTransferPair(RepeatedMapVector from, String path, BufferAllocator allocator) { + this(from, new RepeatedMapVector(MaterializedField.create(path, TYPE), allocator, from.callBack), false); + } + + public RepeatedMapTransferPair(RepeatedMapVector from, RepeatedMapVector to) { + this(from, to, true); + } + + public RepeatedMapTransferPair(RepeatedMapVector from, RepeatedMapVector to, boolean allocate) { + this.from = from; + this.to = to; + this.pairs = new TransferPair[from.size()]; + this.to.ephPair = null; + + int i = 0; + ValueVector vector; + for (final String child : from.getChildFieldNames()) { + final int preSize = to.size(); + vector = from.getChild(child); + if (vector == null) { + continue; + } + + final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); + if (to.size() != preSize) { + newVector.allocateNew(); + } + + pairs[i++] = vector.makeTransferPair(newVector); + } + } + + @Override + public void transfer() { + from.offsets.transferTo(to.offsets); + for (TransferPair p : pairs) { + p.transfer(); + } + from.clear(); + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int srcIndex, int destIndex) { + RepeatedMapHolder holder = new RepeatedMapHolder(); + from.getAccessor().get(srcIndex, holder); + to.emptyPopulator.populate(destIndex + 1); + int newIndex = to.offsets.getAccessor().get(destIndex); + //todo: make these bulk copies + for (int i = holder.start; i < holder.end; i++, newIndex++) { + for (TransferPair p : pairs) { + p.copyValueSafe(i, newIndex); + } + } + to.offsets.getMutator().setSafe(destIndex + 1, newIndex); + } + + @Override + public void splitAndTransfer(final int groupStart, final int groups) { + final UInt4Vector.Accessor a = from.offsets.getAccessor(); + final UInt4Vector.Mutator m = to.offsets.getMutator(); + + final int startPos = a.get(groupStart); + final int endPos = a.get(groupStart + groups); + final int valuesToCopy = endPos - startPos; + + to.offsets.clear(); + to.offsets.allocateNew(groups + 1); + + int normalizedPos; + for (int i = 0; i < groups + 1; i++) { + normalizedPos = a.get(groupStart + i) - startPos; + m.set(i, normalizedPos); + } + + m.setValueCount(groups + 1); + to.emptyPopulator.populate(groups); + + for (final TransferPair p : pairs) { + p.splitAndTransfer(startPos, valuesToCopy); + } + } + } + + + transient private RepeatedMapTransferPair ephPair; + + public void copyFromSafe(int fromIndex, int thisIndex, RepeatedMapVector from) { + if (ephPair == null || ephPair.from != from) { + ephPair = (RepeatedMapTransferPair) from.makeTransferPair(this); + } + ephPair.copyValueSafe(fromIndex, thisIndex); + } + + @Override + public int getValueCapacity() { + return Math.max(offsets.getValueCapacity() - 1, 0); + } + + @Override + public RepeatedMapAccessor getAccessor() { + return accessor; + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final int expectedBufferSize = getBufferSize(); + final int actualBufferSize = super.getBufferSize(); + + Preconditions.checkArgument(expectedBufferSize == actualBufferSize + offsets.getBufferSize()); + return ArrayUtils.addAll(offsets.getBuffers(clear), super.getBuffers(clear)); + } + + +// @Override +// public void load(SerializedField metadata, DrillBuf buffer) { +// final List children = metadata.getChildList(); +// +// final SerializedField offsetField = children.get(0); +// offsets.load(offsetField, buffer); +// int bufOffset = offsetField.getBufferLength(); +// +// for (int i = 1; i < children.size(); i++) { +// final SerializedField child = children.get(i); +// final MaterializedField fieldDef = SerializedFieldHelper.create(child); +// ValueVector vector = getChild(fieldDef.getLastName()); +// if (vector == null) { + // if we arrive here, we didn't have a matching vector. +// vector = BasicTypeHelper.getNewVector(fieldDef, allocator); +// putChild(fieldDef.getLastName(), vector); +// } +// final int vectorLength = child.getBufferLength(); +// vector.load(child, buffer.slice(bufOffset, vectorLength)); +// bufOffset += vectorLength; +// } +// +// assert bufOffset == buffer.capacity(); +// } +// +// +// @Override +// public SerializedField getMetadata() { +// SerializedField.Builder builder = getField() // +// .getAsBuilder() // +// .setBufferLength(getBufferSize()) // + // while we don't need to actually read this on load, we need it to make sure we don't skip deserialization of this vector +// .setValueCount(accessor.getValueCount()); +// builder.addChild(offsets.getMetadata()); +// for (final ValueVector child : getChildren()) { +// builder.addChild(child.getMetadata()); +// } +// return builder.build(); +// } + + @Override + public Mutator getMutator() { + return mutator; + } + + public class RepeatedMapAccessor implements RepeatedAccessor { + @Override + public Object getObject(int index) { + final List list = new JsonStringArrayList<>(); + final int end = offsets.getAccessor().get(index+1); + String fieldName; + for (int i = offsets.getAccessor().get(index); i < end; i++) { + final Map vv = Maps.newLinkedHashMap(); + for (final MaterializedField field : getField().getChildren()) { + if (!field.equals(BaseRepeatedValueVector.OFFSETS_FIELD)) { + fieldName = field.getLastName(); + final Object value = getChild(fieldName).getAccessor().getObject(i); + if (value != null) { + vv.put(fieldName, value); + } + } + } + list.add(vv); + } + return list; + } + + @Override + public int getValueCount() { + return Math.max(offsets.getAccessor().getValueCount() - 1, 0); + } + + @Override + public int getInnerValueCount() { + final int valueCount = getValueCount(); + if (valueCount == 0) { + return 0; + } + return offsets.getAccessor().get(valueCount); + } + + @Override + public int getInnerValueCountAt(int index) { + return offsets.getAccessor().get(index+1) - offsets.getAccessor().get(index); + } + + @Override + public boolean isEmpty(int index) { + return false; + } + + @Override + public boolean isNull(int index) { + return false; + } + + public void get(int index, RepeatedMapHolder holder) { + assert index < getValueCapacity() : + String.format("Attempted to access index %d when value capacity is %d", + index, getValueCapacity()); + final UInt4Vector.Accessor offsetsAccessor = offsets.getAccessor(); + holder.start = offsetsAccessor.get(index); + holder.end = offsetsAccessor.get(index + 1); + } + + public void get(int index, ComplexHolder holder) { + final FieldReader reader = getReader(); + reader.setPosition(index); + holder.reader = reader; + } + + public void get(int index, int arrayIndex, ComplexHolder holder) { + final RepeatedMapHolder h = new RepeatedMapHolder(); + get(index, h); + final int offset = h.start + arrayIndex; + + if (offset >= h.end) { + holder.reader = NullReader.INSTANCE; + } else { + reader.setSinglePosition(index, arrayIndex); + holder.reader = reader; + } + } + } + + public class Mutator implements RepeatedMutator { + @Override + public void startNewValue(int index) { + emptyPopulator.populate(index + 1); + offsets.getMutator().setSafe(index + 1, offsets.getAccessor().get(index)); + } + + @Override + public void setValueCount(int topLevelValueCount) { + emptyPopulator.populate(topLevelValueCount); + offsets.getMutator().setValueCount(topLevelValueCount == 0 ? 0 : topLevelValueCount + 1); + int childValueCount = offsets.getAccessor().get(topLevelValueCount); + for (final ValueVector v : getChildren()) { + v.getMutator().setValueCount(childValueCount); + } + } + + @Override + public void reset() {} + + @Override + public void generateTestData(int values) {} + + public int add(int index) { + final int prevEnd = offsets.getAccessor().get(index + 1); + offsets.getMutator().setSafe(index + 1, prevEnd + 1); + return prevEnd; + } + } + + @Override + public void clear() { + getMutator().reset(); + + offsets.clear(); + for(final ValueVector vector : getChildren()) { + vector.clear(); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java new file mode 100644 index 0000000000000..99c0a0aeb1e2c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java @@ -0,0 +1,85 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; + +/** + * An abstraction representing repeated value vectors. + * + * A repeated vector contains values that may either be flat or nested. A value consists of zero or more cells(inner values). + * Current design maintains data and offsets vectors. Each cell is stored in the data vector. Repeated vector + * uses the offset vector to determine the sequence of cells pertaining to an individual value. + * + */ +public interface RepeatedValueVector extends ValueVector, ContainerVectorLike { + + final static int DEFAULT_REPEAT_PER_RECORD = 5; + + /** + * Returns the underlying offset vector or null if none exists. + * + * TODO(DRILL-2995): eliminate exposing low-level interfaces. + */ + UInt4Vector getOffsetVector(); + + /** + * Returns the underlying data vector or null if none exists. + */ + ValueVector getDataVector(); + + @Override + RepeatedAccessor getAccessor(); + + @Override + RepeatedMutator getMutator(); + + interface RepeatedAccessor extends ValueVector.Accessor { + /** + * Returns total number of cells that vector contains. + * + * The result includes empty, null valued cells. + */ + int getInnerValueCount(); + + + /** + * Returns number of cells that the value at the given index contains. + */ + int getInnerValueCountAt(int index); + + /** + * Returns true if the value at the given index is empty, false otherwise. + * + * @param index value index + */ + boolean isEmpty(int index); + } + + interface RepeatedMutator extends ValueVector.Mutator { + /** + * Starts a new value that is a container of cells. + * + * @param index index of new value to start + */ + void startNewValue(int index); + + + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java new file mode 100644 index 0000000000000..93b744e108719 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java @@ -0,0 +1,35 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +public interface RepeatedVariableWidthVectorLike { + /** + * Allocate a new memory space for this vector. Must be called prior to using the ValueVector. + * + * @param totalBytes Desired size of the underlying data buffer. + * @param parentValueCount Number of separate repeating groupings. + * @param childValueCount Number of supported values in the vector. + */ + void allocateNew(int totalBytes, int parentValueCount, int childValueCount); + + /** + * Provide the maximum amount of variable width bytes that can be stored int his vector. + * @return + */ + int getByteCapacity(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java new file mode 100644 index 0000000000000..852c72c549729 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/StateTool.java @@ -0,0 +1,34 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import java.util.Arrays; + +public class StateTool { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(StateTool.class); + + public static > void check(T currentState, T... expectedStates) { + for (T s : expectedStates) { + if (s == currentState) { + return; + } + } + throw new IllegalArgumentException(String.format("Expected to be in one of these states %s but was actuall in state %s", Arrays.toString(expectedStates), currentState)); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/VectorWithOrdinal.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/VectorWithOrdinal.java new file mode 100644 index 0000000000000..d04fc1c022c05 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/VectorWithOrdinal.java @@ -0,0 +1,30 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import org.apache.arrow.vector.ValueVector; + +public class VectorWithOrdinal { + public final ValueVector vector; + public final int ordinal; + + public VectorWithOrdinal(ValueVector v, int ordinal) { + this.vector = v; + this.ordinal = ordinal; + } +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java new file mode 100644 index 0000000000000..264e241e73935 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java @@ -0,0 +1,100 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import java.util.Iterator; + +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.holders.UnionHolder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + + +abstract class AbstractBaseReader implements FieldReader{ + + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractBaseReader.class); + private static final MajorType LATE_BIND_TYPE = new MajorType(MinorType.LATE, DataMode.OPTIONAL); + + private int index; + + public AbstractBaseReader() { + super(); + } + + public void setPosition(int index){ + this.index = index; + } + + int idx(){ + return index; + } + + @Override + public void reset() { + index = 0; + } + + @Override + public Iterator iterator() { + throw new IllegalStateException("The current reader doesn't support reading as a map."); + } + + public MajorType getType(){ + throw new IllegalStateException("The current reader doesn't support getting type information."); + } + + @Override + public MaterializedField getField() { + return MaterializedField.create("unknown", LATE_BIND_TYPE); + } + + @Override + public boolean next() { + throw new IllegalStateException("The current reader doesn't support getting next information."); + } + + @Override + public int size() { + throw new IllegalStateException("The current reader doesn't support getting size information."); + } + + @Override + public void read(UnionHolder holder) { + holder.reader = this; + holder.isSet = this.isSet() ? 1 : 0; + } + + @Override + public void read(int index, UnionHolder holder) { + throw new IllegalStateException("The current reader doesn't support reading union type"); + } + + @Override + public void copyAsValue(UnionWriter writer) { + throw new IllegalStateException("The current reader doesn't support reading union type"); + } + + @Override + public void copyAsValue(ListWriter writer) { + ComplexCopier.copy(this, (FieldWriter)writer); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java new file mode 100644 index 0000000000000..4e1e103a12e7c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java @@ -0,0 +1,59 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.complex.writer.FieldWriter; + + +abstract class AbstractBaseWriter implements FieldWriter { + //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractBaseWriter.class); + + final FieldWriter parent; + private int index; + + public AbstractBaseWriter(FieldWriter parent) { + this.parent = parent; + } + + @Override + public String toString() { + return super.toString() + "[index = " + index + ", parent = " + parent + "]"; + } + + @Override + public FieldWriter getParent() { + return parent; + } + + public boolean isRoot() { + return parent == null; + } + + int idx() { + return index; + } + + @Override + public void setPosition(int index) { + this.index = index; + } + + @Override + public void end() { + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java new file mode 100644 index 0000000000000..4e2051fd4efee --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -0,0 +1,193 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.StateTool; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; + +import com.google.common.base.Preconditions; + +public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWriter { +// private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ComplexWriterImpl.class); + + private SingleMapWriter mapRoot; + private SingleListWriter listRoot; + private final MapVector container; + + Mode mode = Mode.INIT; + private final String name; + private final boolean unionEnabled; + + private enum Mode { INIT, MAP, LIST }; + + public ComplexWriterImpl(String name, MapVector container, boolean unionEnabled){ + super(null); + this.name = name; + this.container = container; + this.unionEnabled = unionEnabled; + } + + public ComplexWriterImpl(String name, MapVector container){ + this(name, container, false); + } + + @Override + public MaterializedField getField() { + return container.getField(); + } + + @Override + public int getValueCapacity() { + return container.getValueCapacity(); + } + + private void check(Mode... modes){ + StateTool.check(mode, modes); + } + + @Override + public void reset(){ + setPosition(0); + } + + @Override + public void close() throws Exception { + clear(); + mapRoot.close(); + if (listRoot != null) { + listRoot.close(); + } + } + + @Override + public void clear(){ + switch(mode){ + case MAP: + mapRoot.clear(); + break; + case LIST: + listRoot.clear(); + break; + } + } + + @Override + public void setValueCount(int count){ + switch(mode){ + case MAP: + mapRoot.setValueCount(count); + break; + case LIST: + listRoot.setValueCount(count); + break; + } + } + + @Override + public void setPosition(int index){ + super.setPosition(index); + switch(mode){ + case MAP: + mapRoot.setPosition(index); + break; + case LIST: + listRoot.setPosition(index); + break; + } + } + + + public MapWriter directMap(){ + Preconditions.checkArgument(name == null); + + switch(mode){ + + case INIT: + MapVector map = (MapVector) container; + mapRoot = new SingleMapWriter(map, this, unionEnabled); + mapRoot.setPosition(idx()); + mode = Mode.MAP; + break; + + case MAP: + break; + + default: + check(Mode.INIT, Mode.MAP); + } + + return mapRoot; + } + + @Override + public MapWriter rootAsMap() { + switch(mode){ + + case INIT: + MapVector map = container.addOrGet(name, Types.required(MinorType.MAP), MapVector.class); + mapRoot = new SingleMapWriter(map, this, unionEnabled); + mapRoot.setPosition(idx()); + mode = Mode.MAP; + break; + + case MAP: + break; + + default: + check(Mode.INIT, Mode.MAP); + } + + return mapRoot; + } + + + @Override + public void allocate() { + if(mapRoot != null) { + mapRoot.allocate(); + } else if(listRoot != null) { + listRoot.allocate(); + } + } + + @Override + public ListWriter rootAsList() { + switch(mode){ + + case INIT: + listRoot = new SingleListWriter(name, container, this); + listRoot.setPosition(idx()); + mode = Mode.LIST; + break; + + case LIST: + break; + + default: + check(Mode.INIT, Mode.MAP); + } + + return listRoot; + } + + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/MapOrListWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/MapOrListWriterImpl.java new file mode 100644 index 0000000000000..f8a9d4232aadc --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/MapOrListWriterImpl.java @@ -0,0 +1,112 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.complex.writer.BaseWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapOrListWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.BitWriter; +import org.apache.arrow.vector.complex.writer.Float4Writer; +import org.apache.arrow.vector.complex.writer.Float8Writer; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.complex.writer.VarBinaryWriter; +import org.apache.arrow.vector.complex.writer.VarCharWriter; + +public class MapOrListWriterImpl implements MapOrListWriter { + + public final BaseWriter.MapWriter map; + public final BaseWriter.ListWriter list; + + public MapOrListWriterImpl(final BaseWriter.MapWriter writer) { + this.map = writer; + this.list = null; + } + + public MapOrListWriterImpl(final BaseWriter.ListWriter writer) { + this.map = null; + this.list = writer; + } + + public void start() { + if (map != null) { + map.start(); + } else { + list.startList(); + } + } + + public void end() { + if (map != null) { + map.end(); + } else { + list.endList(); + } + } + + public MapOrListWriter map(final String name) { + assert map != null; + return new MapOrListWriterImpl(map.map(name)); + } + + public MapOrListWriter listoftmap(final String name) { + assert list != null; + return new MapOrListWriterImpl(list.map()); + } + + public MapOrListWriter list(final String name) { + assert map != null; + return new MapOrListWriterImpl(map.list(name)); + } + + public boolean isMapWriter() { + return map != null; + } + + public boolean isListWriter() { + return list != null; + } + + public VarCharWriter varChar(final String name) { + return (map != null) ? map.varChar(name) : list.varChar(); + } + + public IntWriter integer(final String name) { + return (map != null) ? map.integer(name) : list.integer(); + } + + public BigIntWriter bigInt(final String name) { + return (map != null) ? map.bigInt(name) : list.bigInt(); + } + + public Float4Writer float4(final String name) { + return (map != null) ? map.float4(name) : list.float4(); + } + + public Float8Writer float8(final String name) { + return (map != null) ? map.float8(name) : list.float8(); + } + + public BitWriter bit(final String name) { + return (map != null) ? map.bit(name) : list.bit(); + } + + public VarBinaryWriter binary(final String name) { + return (map != null) ? map.varBinary(name) : list.varBinary(); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java new file mode 100644 index 0000000000000..ea62e02360802 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -0,0 +1,196 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import java.lang.reflect.Constructor; + +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.VectorDescriptor; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.complex.AbstractMapVector; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.util.TransferPair; + +/** + * This FieldWriter implementation delegates all FieldWriter API calls to an inner FieldWriter. This inner field writer + * can start as a specific type, and this class will promote the writer to a UnionWriter if a call is made that the specifically + * typed writer cannot handle. A new UnionVector is created, wrapping the original vector, and replaces the original vector + * in the parent vector, which can be either an AbstractMapVector or a ListVector. + */ +public class PromotableWriter extends AbstractPromotableFieldWriter { + + private final AbstractMapVector parentContainer; + private final ListVector listVector; + private int position; + + private enum State { + UNTYPED, SINGLE, UNION + } + + private MinorType type; + private ValueVector vector; + private UnionVector unionVector; + private State state; + private FieldWriter writer; + + public PromotableWriter(ValueVector v, AbstractMapVector parentContainer) { + super(null); + this.parentContainer = parentContainer; + this.listVector = null; + init(v); + } + + public PromotableWriter(ValueVector v, ListVector listVector) { + super(null); + this.listVector = listVector; + this.parentContainer = null; + init(v); + } + + private void init(ValueVector v) { + if (v instanceof UnionVector) { + state = State.UNION; + unionVector = (UnionVector) v; + writer = new UnionWriter(unionVector); + } else if (v instanceof ZeroVector) { + state = State.UNTYPED; + } else { + setWriter(v); + } + } + + private void setWriter(ValueVector v) { + state = State.SINGLE; + vector = v; + type = v.getField().getType().getMinorType(); + Class writerClass = BasicTypeHelper + .getWriterImpl(v.getField().getType().getMinorType(), v.getField().getDataMode()); + if (writerClass.equals(SingleListWriter.class)) { + writerClass = UnionListWriter.class; + } + Class vectorClass = BasicTypeHelper.getValueVectorClass(v.getField().getType().getMinorType(), v.getField() + .getDataMode()); + try { + Constructor constructor = null; + for (Constructor c : writerClass.getConstructors()) { + if (c.getParameterTypes().length == 3) { + constructor = c; + } + } + if (constructor == null) { + constructor = writerClass.getConstructor(vectorClass, AbstractFieldWriter.class); + writer = (FieldWriter) constructor.newInstance(vector, null); + } else { + writer = (FieldWriter) constructor.newInstance(vector, null, true); + } + } catch (ReflectiveOperationException e) { + throw new RuntimeException(e); + } + } + + @Override + public void setPosition(int index) { + super.setPosition(index); + FieldWriter w = getWriter(); + if (w == null) { + position = index; + } else { + w.setPosition(index); + } + } + + protected FieldWriter getWriter(MinorType type) { + if (state == State.UNION) { + return writer; + } + if (state == State.UNTYPED) { + if (type == null) { + return null; + } + ValueVector v = listVector.addOrGetVector(new VectorDescriptor(new MajorType(type, DataMode.OPTIONAL))).getVector(); + v.allocateNew(); + setWriter(v); + writer.setPosition(position); + } + if (type != this.type) { + return promoteToUnion(); + } + return writer; + } + + @Override + public boolean isEmptyMap() { + return writer.isEmptyMap(); + } + + protected FieldWriter getWriter() { + return getWriter(type); + } + + private FieldWriter promoteToUnion() { + String name = vector.getField().getLastName(); + TransferPair tp = vector.getTransferPair(vector.getField().getType().getMinorType().name().toLowerCase(), vector.getAllocator()); + tp.transfer(); + if (parentContainer != null) { + unionVector = parentContainer.addOrGet(name, new MajorType(MinorType.UNION, DataMode.OPTIONAL), UnionVector.class); + } else if (listVector != null) { + unionVector = listVector.promoteToUnion(); + } + unionVector.addVector(tp.getTo()); + writer = new UnionWriter(unionVector); + writer.setPosition(idx()); + for (int i = 0; i < idx(); i++) { + unionVector.getMutator().setType(i, vector.getField().getType().getMinorType()); + } + vector = null; + state = State.UNION; + return writer; + } + + @Override + public void allocate() { + getWriter().allocate(); + } + + @Override + public void clear() { + getWriter().clear(); + } + + @Override + public MaterializedField getField() { + return getWriter().getField(); + } + + @Override + public int getValueCapacity() { + return getWriter().getValueCapacity(); + } + + @Override + public void close() throws Exception { + getWriter().close(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java new file mode 100644 index 0000000000000..dd1a152e2f603 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java @@ -0,0 +1,145 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + + +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.RepeatedListVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.holders.RepeatedListHolder; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + +public class RepeatedListReaderImpl extends AbstractFieldReader{ + private static final int NO_VALUES = Integer.MAX_VALUE - 1; + private static final MajorType TYPE = new MajorType(MinorType.LIST, DataMode.REPEATED); + private final String name; + private final RepeatedListVector container; + private FieldReader reader; + + public RepeatedListReaderImpl(String name, RepeatedListVector container) { + super(); + this.name = name; + this.container = container; + } + + @Override + public MajorType getType() { + return TYPE; + } + + @Override + public void copyAsValue(ListWriter writer) { + if (currentOffset == NO_VALUES) { + return; + } + RepeatedListWriter impl = (RepeatedListWriter) writer; + impl.container.copyFromSafe(idx(), impl.idx(), container); + } + + @Override + public void copyAsField(String name, MapWriter writer) { + if (currentOffset == NO_VALUES) { + return; + } + RepeatedListWriter impl = (RepeatedListWriter) writer.list(name); + impl.container.copyFromSafe(idx(), impl.idx(), container); + } + + private int currentOffset; + private int maxOffset; + + @Override + public void reset() { + super.reset(); + currentOffset = 0; + maxOffset = 0; + if (reader != null) { + reader.reset(); + } + reader = null; + } + + @Override + public int size() { + return maxOffset - currentOffset; + } + + @Override + public void setPosition(int index) { + if (index < 0 || index == NO_VALUES) { + currentOffset = NO_VALUES; + return; + } + + super.setPosition(index); + RepeatedListHolder h = new RepeatedListHolder(); + container.getAccessor().get(index, h); + if (h.start == h.end) { + currentOffset = NO_VALUES; + } else { + currentOffset = h.start-1; + maxOffset = h.end; + if(reader != null) { + reader.setPosition(currentOffset); + } + } + } + + @Override + public boolean next() { + if (currentOffset +1 < maxOffset) { + currentOffset++; + if (reader != null) { + reader.setPosition(currentOffset); + } + return true; + } else { + currentOffset = NO_VALUES; + return false; + } + } + + @Override + public Object readObject() { + return container.getAccessor().getObject(idx()); + } + + @Override + public FieldReader reader() { + if (reader == null) { + ValueVector child = container.getChild(name); + if (child == null) { + reader = NullReader.INSTANCE; + } else { + reader = child.getReader(); + } + reader.setPosition(currentOffset); + } + return reader; + } + + public boolean isSet() { + return true; + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java new file mode 100644 index 0000000000000..09a831d8329fc --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java @@ -0,0 +1,192 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + +import java.util.Map; + +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.RepeatedMapVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.holders.RepeatedMapHolder; +import org.apache.arrow.vector.types.Types.MajorType; + +import com.google.common.collect.Maps; + +@SuppressWarnings("unused") +public class RepeatedMapReaderImpl extends AbstractFieldReader{ + private static final int NO_VALUES = Integer.MAX_VALUE - 1; + + private final RepeatedMapVector vector; + private final Map fields = Maps.newHashMap(); + + public RepeatedMapReaderImpl(RepeatedMapVector vector) { + this.vector = vector; + } + + private void setChildrenPosition(int index) { + for (FieldReader r : fields.values()) { + r.setPosition(index); + } + } + + @Override + public FieldReader reader(String name) { + FieldReader reader = fields.get(name); + if (reader == null) { + ValueVector child = vector.getChild(name); + if (child == null) { + reader = NullReader.INSTANCE; + } else { + reader = child.getReader(); + } + fields.put(name, reader); + reader.setPosition(currentOffset); + } + return reader; + } + + @Override + public FieldReader reader() { + if (currentOffset == NO_VALUES) { + return NullReader.INSTANCE; + } + + setChildrenPosition(currentOffset); + return new SingleLikeRepeatedMapReaderImpl(vector, this); + } + + private int currentOffset; + private int maxOffset; + + @Override + public void reset() { + super.reset(); + currentOffset = 0; + maxOffset = 0; + for (FieldReader reader:fields.values()) { + reader.reset(); + } + fields.clear(); + } + + @Override + public int size() { + if (isNull()) { + return 0; + } + return maxOffset - (currentOffset < 0 ? 0 : currentOffset); + } + + @Override + public void setPosition(int index) { + if (index < 0 || index == NO_VALUES) { + currentOffset = NO_VALUES; + return; + } + + super.setPosition(index); + RepeatedMapHolder h = new RepeatedMapHolder(); + vector.getAccessor().get(index, h); + if (h.start == h.end) { + currentOffset = NO_VALUES; + } else { + currentOffset = h.start-1; + maxOffset = h.end; + setChildrenPosition(currentOffset); + } + } + + public void setSinglePosition(int index, int childIndex) { + super.setPosition(index); + RepeatedMapHolder h = new RepeatedMapHolder(); + vector.getAccessor().get(index, h); + if (h.start == h.end) { + currentOffset = NO_VALUES; + } else { + int singleOffset = h.start + childIndex; + assert singleOffset < h.end; + currentOffset = singleOffset; + maxOffset = singleOffset + 1; + setChildrenPosition(singleOffset); + } + } + + @Override + public boolean next() { + if (currentOffset +1 < maxOffset) { + setChildrenPosition(++currentOffset); + return true; + } else { + currentOffset = NO_VALUES; + return false; + } + } + + public boolean isNull() { + return currentOffset == NO_VALUES; + } + + @Override + public Object readObject() { + return vector.getAccessor().getObject(idx()); + } + + @Override + public MajorType getType() { + return vector.getField().getType(); + } + + @Override + public java.util.Iterator iterator() { + return vector.fieldNameIterator(); + } + + @Override + public boolean isSet() { + return true; + } + + @Override + public void copyAsValue(MapWriter writer) { + if (currentOffset == NO_VALUES) { + return; + } + RepeatedMapWriter impl = (RepeatedMapWriter) writer; + impl.container.copyFromSafe(idx(), impl.idx(), vector); + } + + public void copyAsValueSingle(MapWriter writer) { + if (currentOffset == NO_VALUES) { + return; + } + SingleMapWriter impl = (SingleMapWriter) writer; + impl.container.copyFromSafe(currentOffset, impl.idx(), vector); + } + + @Override + public void copyAsField(String name, MapWriter writer) { + if (currentOffset == NO_VALUES) { + return; + } + RepeatedMapWriter impl = (RepeatedMapWriter) writer.map(name); + impl.container.copyFromSafe(idx(), impl.idx(), vector); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java new file mode 100644 index 0000000000000..086d26e119440 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java @@ -0,0 +1,89 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.complex.impl; + +import java.util.Iterator; + +import org.apache.arrow.vector.complex.RepeatedMapVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + +public class SingleLikeRepeatedMapReaderImpl extends AbstractFieldReader{ + + private RepeatedMapReaderImpl delegate; + + public SingleLikeRepeatedMapReaderImpl(RepeatedMapVector vector, FieldReader delegate) { + this.delegate = (RepeatedMapReaderImpl) delegate; + } + + @Override + public int size() { + throw new UnsupportedOperationException("You can't call size on a single map reader."); + } + + @Override + public boolean next() { + throw new UnsupportedOperationException("You can't call next on a single map reader."); + } + + @Override + public MajorType getType() { + return Types.required(MinorType.MAP); + } + + + @Override + public void copyAsValue(MapWriter writer) { + delegate.copyAsValueSingle(writer); + } + + public void copyAsValueSingle(MapWriter writer){ + delegate.copyAsValueSingle(writer); + } + + @Override + public FieldReader reader(String name) { + return delegate.reader(name); + } + + @Override + public void setPosition(int index) { + delegate.setPosition(index); + } + + @Override + public Object readObject() { + return delegate.readObject(); + } + + @Override + public Iterator iterator() { + return delegate.iterator(); + } + + @Override + public boolean isSet() { + return ! delegate.isNull(); + } + + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java new file mode 100644 index 0000000000000..f16f628603d69 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java @@ -0,0 +1,88 @@ + +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + + +import org.apache.arrow.vector.complex.AbstractContainerVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + +@SuppressWarnings("unused") +public class SingleListReaderImpl extends AbstractFieldReader{ + + private static final MajorType TYPE = Types.optional(MinorType.LIST); + private final String name; + private final AbstractContainerVector container; + private FieldReader reader; + + public SingleListReaderImpl(String name, AbstractContainerVector container) { + super(); + this.name = name; + this.container = container; + } + + @Override + public MajorType getType() { + return TYPE; + } + + + @Override + public void setPosition(int index) { + super.setPosition(index); + if (reader != null) { + reader.setPosition(index); + } + } + + @Override + public Object readObject() { + return reader.readObject(); + } + + @Override + public FieldReader reader() { + if (reader == null) { + reader = container.getChild(name).getReader(); + setPosition(idx()); + } + return reader; + } + + @Override + public boolean isSet() { + return false; + } + + @Override + public void copyAsValue(ListWriter writer) { + throw new UnsupportedOperationException("Generic list copying not yet supported. Please resolve to typed list."); + } + + @Override + public void copyAsField(String name, MapWriter writer) { + throw new UnsupportedOperationException("Generic list copying not yet supported. Please resolve to typed list."); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java new file mode 100644 index 0000000000000..84b99801419c4 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java @@ -0,0 +1,108 @@ + + +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + + +import java.util.Map; + +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.Types.MajorType; + +import com.google.common.collect.Maps; + +@SuppressWarnings("unused") +public class SingleMapReaderImpl extends AbstractFieldReader{ + + private final MapVector vector; + private final Map fields = Maps.newHashMap(); + + public SingleMapReaderImpl(MapVector vector) { + this.vector = vector; + } + + private void setChildrenPosition(int index){ + for(FieldReader r : fields.values()){ + r.setPosition(index); + } + } + + @Override + public FieldReader reader(String name){ + FieldReader reader = fields.get(name); + if(reader == null){ + ValueVector child = vector.getChild(name); + if(child == null){ + reader = NullReader.INSTANCE; + }else{ + reader = child.getReader(); + } + fields.put(name, reader); + reader.setPosition(idx()); + } + return reader; + } + + @Override + public void setPosition(int index){ + super.setPosition(index); + for(FieldReader r : fields.values()){ + r.setPosition(index); + } + } + + @Override + public Object readObject() { + return vector.getAccessor().getObject(idx()); + } + + @Override + public boolean isSet() { + return true; + } + + @Override + public MajorType getType(){ + return vector.getField().getType(); + } + + @Override + public java.util.Iterator iterator(){ + return vector.fieldNameIterator(); + } + + @Override + public void copyAsValue(MapWriter writer){ + SingleMapWriter impl = (SingleMapWriter) writer; + impl.container.copyFromSafe(idx(), impl.idx(), vector); + } + + @Override + public void copyAsField(String name, MapWriter writer){ + SingleMapWriter impl = (SingleMapWriter) writer.map(name); + impl.container.copyFromSafe(idx(), impl.idx(), vector); + } + + +} + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java new file mode 100644 index 0000000000000..9b54d02e571de --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java @@ -0,0 +1,98 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.holders.UnionHolder; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + +public class UnionListReader extends AbstractFieldReader { + + private ListVector vector; + private ValueVector data; + private UInt4Vector offsets; + + public UnionListReader(ListVector vector) { + this.vector = vector; + this.data = vector.getDataVector(); + this.offsets = vector.getOffsetVector(); + } + + @Override + public boolean isSet() { + return true; + } + + MajorType type = new MajorType(MinorType.LIST, DataMode.OPTIONAL); + + public MajorType getType() { + return type; + } + + private int currentOffset; + private int maxOffset; + + @Override + public void setPosition(int index) { + super.setPosition(index); + currentOffset = offsets.getAccessor().get(index) - 1; + maxOffset = offsets.getAccessor().get(index + 1); + } + + @Override + public FieldReader reader() { + return data.getReader(); + } + + @Override + public Object readObject() { + return vector.getAccessor().getObject(idx()); + } + + @Override + public void read(int index, UnionHolder holder) { + setPosition(idx()); + for (int i = -1; i < index; i++) { + next(); + } + holder.reader = data.getReader(); + holder.isSet = data.getReader().isSet() ? 1 : 0; + } + + @Override + public boolean next() { + if (currentOffset + 1 < maxOffset) { + data.getReader().setPosition(++currentOffset); + return true; + } else { + return false; + } + } + + public void copyAsValue(ListWriter writer) { + ComplexCopier.copy(this, (FieldWriter) writer); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/reader/FieldReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/reader/FieldReader.java new file mode 100644 index 0000000000000..c4eb3dc739a49 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/reader/FieldReader.java @@ -0,0 +1,29 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.reader; + +import org.apache.arrow.vector.complex.reader.BaseReader.ListReader; +import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; +import org.apache.arrow.vector.complex.reader.BaseReader.RepeatedListReader; +import org.apache.arrow.vector.complex.reader.BaseReader.RepeatedMapReader; +import org.apache.arrow.vector.complex.reader.BaseReader.ScalarReader; + + + +public interface FieldReader extends MapReader, ListReader, ScalarReader, RepeatedMapReader, RepeatedListReader { +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/writer/FieldWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/writer/FieldWriter.java new file mode 100644 index 0000000000000..ecffe0bec0e84 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/writer/FieldWriter.java @@ -0,0 +1,27 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.writer; + +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ScalarWriter; + +public interface FieldWriter extends MapWriter, ListWriter, ScalarWriter { + void allocate(); + void clear(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/ComplexHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/ComplexHolder.java new file mode 100644 index 0000000000000..0f9310da55b79 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/ComplexHolder.java @@ -0,0 +1,25 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.holders; + +import org.apache.arrow.vector.complex.reader.FieldReader; + +public class ComplexHolder implements ValueHolder { + public FieldReader reader; + public int isSet; +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java new file mode 100644 index 0000000000000..5a5fe0305d83a --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java @@ -0,0 +1,38 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.holders; + +import org.apache.arrow.vector.types.Types; + +/* + * Holder class for the vector ObjectVector. This holder internally stores a + * reference to an object. The ObjectVector maintains an array of these objects. + * This holder can be used only as workspace variables in aggregate functions. + * Using this holder should be avoided and we should stick to native holder types. + */ +@Deprecated +public class ObjectHolder implements ValueHolder { + public static final Types.MajorType TYPE = Types.required(Types.MinorType.GENERIC_OBJECT); + + public Types.MajorType getType() { + return TYPE; + } + + public Object obj; +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedListHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedListHolder.java new file mode 100644 index 0000000000000..83506cdc17549 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedListHolder.java @@ -0,0 +1,23 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.holders; + +public final class RepeatedListHolder implements ValueHolder{ + public int start; + public int end; +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedMapHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedMapHolder.java new file mode 100644 index 0000000000000..85d782b381835 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/RepeatedMapHolder.java @@ -0,0 +1,23 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.holders; + +public final class RepeatedMapHolder implements ValueHolder{ + public int start; + public int end; +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java new file mode 100644 index 0000000000000..b868a620f985b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java @@ -0,0 +1,37 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.holders; + +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.types.Types.MinorType; + +public class UnionHolder implements ValueHolder { + public static final MajorType TYPE = new MajorType(MinorType.UNION, DataMode.OPTIONAL); + public FieldReader reader; + public int isSet; + + public MajorType getType() { + return reader.getType(); + } + + public boolean isSet() { + return isSet == 1; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java new file mode 100644 index 0000000000000..88cbcd4a8c308 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java @@ -0,0 +1,31 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.holders; + +/** + * Wrapper object for an individual value in Drill. + * + * ValueHolders are designed to be mutable wrapper objects for defining clean + * APIs that access data in Drill. For performance, object creation is avoided + * at all costs throughout execution. For this reason, ValueHolders are + * disallowed from implementing any methods, this allows for them to be + * replaced by their java primitive inner members during optimization of + * run-time generated code. + */ +public interface ValueHolder { +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java b/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java new file mode 100644 index 0000000000000..c73098b2a85d7 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java @@ -0,0 +1,217 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.Iterator; +import java.util.LinkedHashSet; +import java.util.Objects; + +import org.apache.arrow.vector.types.Types.DataMode; +import org.apache.arrow.vector.types.Types.MajorType; +import org.apache.arrow.vector.util.BasicTypeHelper; + + +public class MaterializedField { + private final String name; + private final MajorType type; + // use an ordered set as existing code relies on order (e,g. parquet writer) + private final LinkedHashSet children; + + MaterializedField(String name, MajorType type, LinkedHashSet children) { + this.name = name; + this.type = type; + this.children = children; + } + + public Collection getChildren() { + return new ArrayList<>(children); + } + + public MaterializedField newWithChild(MaterializedField child) { + MaterializedField newField = clone(); + newField.addChild(child); + return newField; + } + + public void addChild(MaterializedField field){ + children.add(field); + } + + public MaterializedField clone() { + return withPathAndType(name, getType()); + } + + public MaterializedField withType(MajorType type) { + return withPathAndType(name, type); + } + + public MaterializedField withPath(String name) { + return withPathAndType(name, getType()); + } + + public MaterializedField withPathAndType(String name, final MajorType type) { + final LinkedHashSet newChildren = new LinkedHashSet<>(children.size()); + for (final MaterializedField child:children) { + newChildren.add(child.clone()); + } + return new MaterializedField(name, type, newChildren); + } + +// public String getLastName(){ +// PathSegment seg = key.path.getRootSegment(); +// while (seg.getChild() != null) { +// seg = seg.getChild(); +// } +// return seg.getNameSegment().getPath(); +// } + + + // TODO: rewrite without as direct match rather than conversion then match. +// public boolean matches(SerializedField booleanfield){ +// MaterializedField f = create(field); +// return f.equals(this); +// } + + public static MaterializedField create(String name, MajorType type){ + return new MaterializedField(name, type, new LinkedHashSet()); + } + +// public String getName(){ +// StringBuilder sb = new StringBuilder(); +// boolean first = true; +// for(NamePart np : def.getNameList()){ +// if(np.getType() == Type.ARRAY){ +// sb.append("[]"); +// }else{ +// if(first){ +// first = false; +// }else{ +// sb.append("."); +// } +// sb.append('`'); +// sb.append(np.getName()); +// sb.append('`'); +// +// } +// } +// return sb.toString(); +// } + + public String getPath() { + return getName(); + } + + public String getLastName() { + return getName(); + } + + public String getName() { + return name; + } + +// public int getWidth() { +// return type.getWidth(); +// } + + public MajorType getType() { + return type; + } + + public int getScale() { + return type.getScale(); + } + public int getPrecision() { + return type.getPrecision(); + } + public boolean isNullable() { + return type.getMode() == DataMode.OPTIONAL; + } + + public DataMode getDataMode() { + return type.getMode(); + } + + public MaterializedField getOtherNullableVersion(){ + MajorType mt = type; + DataMode newDataMode = null; + switch (mt.getMode()){ + case OPTIONAL: + newDataMode = DataMode.REQUIRED; + break; + case REQUIRED: + newDataMode = DataMode.OPTIONAL; + break; + default: + throw new UnsupportedOperationException(); + } + return new MaterializedField(name, new MajorType(mt.getMinorType(), newDataMode, mt.getPrecision(), mt.getScale(), mt.getTimezone(), mt.getSubTypes()), children); + } + + public Class getValueClass() { + return BasicTypeHelper.getValueVectorClass(getType().getMinorType(), getDataMode()); + } + + @Override + public int hashCode() { + return Objects.hash(this.name, this.type, this.children); + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + if (obj == null) { + return false; + } + if (getClass() != obj.getClass()) { + return false; + } + MaterializedField other = (MaterializedField) obj; + // DRILL-1872: Compute equals only on key. See also the comment + // in MapVector$MapTransferPair + + return this.name.equalsIgnoreCase(other.name) && + Objects.equals(this.type, other.type); + } + + + @Override + public String toString() { + final int maxLen = 10; + String childStr = children != null && !children.isEmpty() ? toString(children, maxLen) : ""; + return name + "(" + type.getMinorType().name() + ":" + type.getMode().name() + ")" + childStr; + } + + + private String toString(Collection collection, int maxLen) { + StringBuilder builder = new StringBuilder(); + builder.append("["); + int i = 0; + for (Iterator iterator = collection.iterator(); iterator.hasNext() && i < maxLen; i++) { + if (i > 0){ + builder.append(", "); + } + builder.append(iterator.next()); + } + builder.append("]"); + return builder.toString(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java new file mode 100644 index 0000000000000..cef892ce88030 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -0,0 +1,132 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; + +public class Types { + public enum MinorType { + LATE, // late binding type + MAP, // an empty map column. Useful for conceptual setup. Children listed within here + + TINYINT, // single byte signed integer + SMALLINT, // two byte signed integer + INT, // four byte signed integer + BIGINT, // eight byte signed integer + DECIMAL9, // a decimal supporting precision between 1 and 9 + DECIMAL18, // a decimal supporting precision between 10 and 18 + DECIMAL28SPARSE, // a decimal supporting precision between 19 and 28 + DECIMAL38SPARSE, // a decimal supporting precision between 29 and 38 + MONEY, // signed decimal with two digit precision + DATE, // days since 4713bc + TIME, // time in micros before or after 2000/1/1 + TIMETZ, // time in micros before or after 2000/1/1 with timezone + TIMESTAMPTZ, // unix epoch time in millis + TIMESTAMP, // TBD + INTERVAL, // TBD + FLOAT4, // 4 byte ieee 754 + FLOAT8, // 8 byte ieee 754 + BIT, // single bit value (boolean) + FIXEDCHAR, // utf8 fixed length string, padded with spaces + FIXED16CHAR, + FIXEDBINARY, // fixed length binary, padded with 0 bytes + VARCHAR, // utf8 variable length string + VAR16CHAR, // utf16 variable length string + VARBINARY, // variable length binary + UINT1, // unsigned 1 byte integer + UINT2, // unsigned 2 byte integer + UINT4, // unsigned 4 byte integer + UINT8, // unsigned 8 byte integer + DECIMAL28DENSE, // dense decimal representation, supporting precision between 19 and 28 + DECIMAL38DENSE, // dense decimal representation, supporting precision between 28 and 38 + NULL, // a value of unknown type (e.g. a missing reference). + INTERVALYEAR, // Interval type specifying YEAR to MONTH + INTERVALDAY, // Interval type specifying DAY to SECONDS + LIST, + GENERIC_OBJECT, + UNION + } + + public enum DataMode { + REQUIRED, + OPTIONAL, + REPEATED + } + + public static class MajorType { + private MinorType minorType; + private DataMode mode; + private Integer precision; + private Integer scale; + private Integer timezone; + private List subTypes; + + public MajorType(MinorType minorType, DataMode mode) { + this(minorType, mode, null, null, null, null); + } + + public MajorType(MinorType minorType, DataMode mode, Integer precision, Integer scale) { + this(minorType, mode, precision, scale, null, null); + } + + public MajorType(MinorType minorType, DataMode mode, Integer precision, Integer scale, Integer timezone, List subTypes) { + this.minorType = minorType; + this.mode = mode; + this.precision = precision; + this.scale = scale; + this.timezone = timezone; + this.subTypes = subTypes; + } + + public MinorType getMinorType() { + return minorType; + } + + public DataMode getMode() { + return mode; + } + + public Integer getPrecision() { + return precision; + } + + public Integer getScale() { + return scale; + } + + public Integer getTimezone() { + return timezone; + } + + public List getSubTypes() { + return subTypes; + } + } + + public static MajorType required(MinorType minorType) { + return new MajorType(minorType, DataMode.REQUIRED); + } + public static MajorType optional(MinorType minorType) { + return new MajorType(minorType, DataMode.OPTIONAL); + } + public static MajorType repeated(MinorType minorType) { + return new MajorType(minorType, DataMode.REPEATED); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java new file mode 100644 index 0000000000000..2bdfd70b22956 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java @@ -0,0 +1,233 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.util; + +import io.netty.buffer.ArrowBuf; +import io.netty.util.internal.PlatformDependent; + +import org.apache.arrow.memory.BoundsChecking; + +import com.google.common.primitives.UnsignedLongs; + +public class ByteFunctionHelpers { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ByteFunctionHelpers.class); + + /** + * Helper function to check for equality of bytes in two DrillBuffers + * + * @param left Left DrillBuf for comparison + * @param lStart start offset in the buffer + * @param lEnd end offset in the buffer + * @param right Right DrillBuf for comparison + * @param rStart start offset in the buffer + * @param rEnd end offset in the buffer + * @return 1 if left input is greater, -1 if left input is smaller, 0 otherwise + */ + public static final int equal(final ArrowBuf left, int lStart, int lEnd, final ArrowBuf right, int rStart, int rEnd){ + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + left.checkBytes(lStart, lEnd); + right.checkBytes(rStart, rEnd); + } + return memEqual(left.memoryAddress(), lStart, lEnd, right.memoryAddress(), rStart, rEnd); + } + + private static final int memEqual(final long laddr, int lStart, int lEnd, final long raddr, int rStart, + final int rEnd) { + + int n = lEnd - lStart; + if (n == rEnd - rStart) { + long lPos = laddr + lStart; + long rPos = raddr + rStart; + + while (n > 7) { + long leftLong = PlatformDependent.getLong(lPos); + long rightLong = PlatformDependent.getLong(rPos); + if (leftLong != rightLong) { + return 0; + } + lPos += 8; + rPos += 8; + n -= 8; + } + while (n-- != 0) { + byte leftByte = PlatformDependent.getByte(lPos); + byte rightByte = PlatformDependent.getByte(rPos); + if (leftByte != rightByte) { + return 0; + } + lPos++; + rPos++; + } + return 1; + } else { + return 0; + } + } + + /** + * Helper function to compare a set of bytes in two DrillBuffers. + * + * Function will check data before completing in the case that + * + * @param left Left DrillBuf to compare + * @param lStart start offset in the buffer + * @param lEnd end offset in the buffer + * @param right Right DrillBuf to compare + * @param rStart start offset in the buffer + * @param rEnd end offset in the buffer + * @return 1 if left input is greater, -1 if left input is smaller, 0 otherwise + */ + public static final int compare(final ArrowBuf left, int lStart, int lEnd, final ArrowBuf right, int rStart, int rEnd){ + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + left.checkBytes(lStart, lEnd); + right.checkBytes(rStart, rEnd); + } + return memcmp(left.memoryAddress(), lStart, lEnd, right.memoryAddress(), rStart, rEnd); + } + + private static final int memcmp(final long laddr, int lStart, int lEnd, final long raddr, int rStart, final int rEnd) { + int lLen = lEnd - lStart; + int rLen = rEnd - rStart; + int n = Math.min(rLen, lLen); + long lPos = laddr + lStart; + long rPos = raddr + rStart; + + while (n > 7) { + long leftLong = PlatformDependent.getLong(lPos); + long rightLong = PlatformDependent.getLong(rPos); + if (leftLong != rightLong) { + return UnsignedLongs.compare(Long.reverseBytes(leftLong), Long.reverseBytes(rightLong)); + } + lPos += 8; + rPos += 8; + n -= 8; + } + + while (n-- != 0) { + byte leftByte = PlatformDependent.getByte(lPos); + byte rightByte = PlatformDependent.getByte(rPos); + if (leftByte != rightByte) { + return ((leftByte & 0xFF) - (rightByte & 0xFF)) > 0 ? 1 : -1; + } + lPos++; + rPos++; + } + + if (lLen == rLen) { + return 0; + } + + return lLen > rLen ? 1 : -1; + + } + + /** + * Helper function to compare a set of bytes in DrillBuf to a ByteArray. + * + * @param left Left DrillBuf for comparison purposes + * @param lStart start offset in the buffer + * @param lEnd end offset in the buffer + * @param right second input to be compared + * @param rStart start offset in the byte array + * @param rEnd end offset in the byte array + * @return 1 if left input is greater, -1 if left input is smaller, 0 otherwise + */ + public static final int compare(final ArrowBuf left, int lStart, int lEnd, final byte[] right, int rStart, final int rEnd) { + if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { + left.checkBytes(lStart, lEnd); + } + return memcmp(left.memoryAddress(), lStart, lEnd, right, rStart, rEnd); + } + + + private static final int memcmp(final long laddr, int lStart, int lEnd, final byte[] right, int rStart, final int rEnd) { + int lLen = lEnd - lStart; + int rLen = rEnd - rStart; + int n = Math.min(rLen, lLen); + long lPos = laddr + lStart; + int rPos = rStart; + + while (n-- != 0) { + byte leftByte = PlatformDependent.getByte(lPos); + byte rightByte = right[rPos]; + if (leftByte != rightByte) { + return ((leftByte & 0xFF) - (rightByte & 0xFF)) > 0 ? 1 : -1; + } + lPos++; + rPos++; + } + + if (lLen == rLen) { + return 0; + } + + return lLen > rLen ? 1 : -1; + } + + /* + * Following are helper functions to interact with sparse decimal represented in a byte array. + */ + + // Get the integer ignore the sign + public static int getInteger(byte[] b, int index) { + return getInteger(b, index, true); + } + // Get the integer, ignore the sign + public static int getInteger(byte[] b, int index, boolean ignoreSign) { + int startIndex = index * DecimalUtility.INTEGER_SIZE; + + if (index == 0 && ignoreSign == true) { + return (b[startIndex + 3] & 0xFF) | + (b[startIndex + 2] & 0xFF) << 8 | + (b[startIndex + 1] & 0xFF) << 16 | + (b[startIndex] & 0x7F) << 24; + } + + return ((b[startIndex + 3] & 0xFF) | + (b[startIndex + 2] & 0xFF) << 8 | + (b[startIndex + 1] & 0xFF) << 16 | + (b[startIndex] & 0xFF) << 24); + + } + + // Set integer in the byte array + public static void setInteger(byte[] b, int index, int value) { + int startIndex = index * DecimalUtility.INTEGER_SIZE; + b[startIndex] = (byte) ((value >> 24) & 0xFF); + b[startIndex + 1] = (byte) ((value >> 16) & 0xFF); + b[startIndex + 2] = (byte) ((value >> 8) & 0xFF); + b[startIndex + 3] = (byte) ((value) & 0xFF); + } + + // Set the sign in a sparse decimal representation + public static void setSign(byte[] b, boolean sign) { + int value = getInteger(b, 0); + if (sign == true) { + setInteger(b, 0, value | 0x80000000); + } else { + setInteger(b, 0, value & 0x7FFFFFFF); + } + } + + // Get the sign + public static boolean getSign(byte[] b) { + return ((getInteger(b, 0, false) & 0x80000000) != 0); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/CallBack.java b/java/vector/src/main/java/org/apache/arrow/vector/util/CallBack.java new file mode 100644 index 0000000000000..249834270b3fe --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/CallBack.java @@ -0,0 +1,23 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + + +public interface CallBack { + public void doWork(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java new file mode 100644 index 0000000000000..1eb2c13cd4c59 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java @@ -0,0 +1,91 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.math.BigDecimal; + +import org.apache.arrow.vector.types.Types; + +public class CoreDecimalUtility { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(CoreDecimalUtility.class); + + public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { + // Truncate or pad to set the input to the correct scale + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); + + return (input.unscaledValue().longValue()); + } + + public static int getMaxPrecision(Types.MinorType decimalType) { + if (decimalType == Types.MinorType.DECIMAL9) { + return 9; + } else if (decimalType == Types.MinorType.DECIMAL18) { + return 18; + } else if (decimalType == Types.MinorType.DECIMAL28SPARSE) { + return 28; + } else if (decimalType == Types.MinorType.DECIMAL38SPARSE) { + return 38; + } + return 0; + } + + /* + * Function returns the Minor decimal type given the precision + */ + public static Types.MinorType getDecimalDataType(int precision) { + if (precision <= 9) { + return Types.MinorType.DECIMAL9; + } else if (precision <= 18) { + return Types.MinorType.DECIMAL18; + } else if (precision <= 28) { + return Types.MinorType.DECIMAL28SPARSE; + } else { + return Types.MinorType.DECIMAL38SPARSE; + } + } + + /* + * Given a precision it provides the max precision of that decimal data type; + * For eg: given the precision 12, we would use DECIMAL18 to store the data + * which has a max precision range of 18 digits + */ + public static int getPrecisionRange(int precision) { + return getMaxPrecision(getDecimalDataType(precision)); + } + public static int getDecimal9FromBigDecimal(BigDecimal input, int scale, int precision) { + // Truncate/ or pad to set the input to the correct scale + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); + + return (input.unscaledValue().intValue()); + } + + /* + * Helper function to detect if the given data type is Decimal + */ + public static boolean isDecimalType(Types.MajorType type) { + return isDecimalType(type.getMinorType()); + } + + public static boolean isDecimalType(Types.MinorType minorType) { + if (minorType == Types.MinorType.DECIMAL9 || minorType == Types.MinorType.DECIMAL18 || + minorType == Types.MinorType.DECIMAL28SPARSE || minorType == Types.MinorType.DECIMAL38SPARSE) { + return true; + } + return false; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java new file mode 100644 index 0000000000000..f4fc1736032c0 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java @@ -0,0 +1,682 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector.util; + +import org.joda.time.Period; +import org.joda.time.format.DateTimeFormat; +import org.joda.time.format.DateTimeFormatter; +import org.joda.time.format.DateTimeFormatterBuilder; +import org.joda.time.format.DateTimeParser; + +import com.carrotsearch.hppc.ObjectIntHashMap; + +// Utility class for Date, DateTime, TimeStamp, Interval data types +public class DateUtility { + + + /* We have a hashmap that stores the timezone as the key and an index as the value + * While storing the timezone in value vectors, holders we only use this index. As we + * reconstruct the timestamp, we use this index to index through the array timezoneList + * and get the corresponding timezone and pass it to joda-time + */ + public static ObjectIntHashMap timezoneMap = new ObjectIntHashMap(); + + public static String[] timezoneList = {"Africa/Abidjan", + "Africa/Accra", + "Africa/Addis_Ababa", + "Africa/Algiers", + "Africa/Asmara", + "Africa/Asmera", + "Africa/Bamako", + "Africa/Bangui", + "Africa/Banjul", + "Africa/Bissau", + "Africa/Blantyre", + "Africa/Brazzaville", + "Africa/Bujumbura", + "Africa/Cairo", + "Africa/Casablanca", + "Africa/Ceuta", + "Africa/Conakry", + "Africa/Dakar", + "Africa/Dar_es_Salaam", + "Africa/Djibouti", + "Africa/Douala", + "Africa/El_Aaiun", + "Africa/Freetown", + "Africa/Gaborone", + "Africa/Harare", + "Africa/Johannesburg", + "Africa/Juba", + "Africa/Kampala", + "Africa/Khartoum", + "Africa/Kigali", + "Africa/Kinshasa", + "Africa/Lagos", + "Africa/Libreville", + "Africa/Lome", + "Africa/Luanda", + "Africa/Lubumbashi", + "Africa/Lusaka", + "Africa/Malabo", + "Africa/Maputo", + "Africa/Maseru", + "Africa/Mbabane", + "Africa/Mogadishu", + "Africa/Monrovia", + "Africa/Nairobi", + "Africa/Ndjamena", + "Africa/Niamey", + "Africa/Nouakchott", + "Africa/Ouagadougou", + "Africa/Porto-Novo", + "Africa/Sao_Tome", + "Africa/Timbuktu", + "Africa/Tripoli", + "Africa/Tunis", + "Africa/Windhoek", + "America/Adak", + "America/Anchorage", + "America/Anguilla", + "America/Antigua", + "America/Araguaina", + "America/Argentina/Buenos_Aires", + "America/Argentina/Catamarca", + "America/Argentina/ComodRivadavia", + "America/Argentina/Cordoba", + "America/Argentina/Jujuy", + "America/Argentina/La_Rioja", + "America/Argentina/Mendoza", + "America/Argentina/Rio_Gallegos", + "America/Argentina/Salta", + "America/Argentina/San_Juan", + "America/Argentina/San_Luis", + "America/Argentina/Tucuman", + "America/Argentina/Ushuaia", + "America/Aruba", + "America/Asuncion", + "America/Atikokan", + "America/Atka", + "America/Bahia", + "America/Bahia_Banderas", + "America/Barbados", + "America/Belem", + "America/Belize", + "America/Blanc-Sablon", + "America/Boa_Vista", + "America/Bogota", + "America/Boise", + "America/Buenos_Aires", + "America/Cambridge_Bay", + "America/Campo_Grande", + "America/Cancun", + "America/Caracas", + "America/Catamarca", + "America/Cayenne", + "America/Cayman", + "America/Chicago", + "America/Chihuahua", + "America/Coral_Harbour", + "America/Cordoba", + "America/Costa_Rica", + "America/Cuiaba", + "America/Curacao", + "America/Danmarkshavn", + "America/Dawson", + "America/Dawson_Creek", + "America/Denver", + "America/Detroit", + "America/Dominica", + "America/Edmonton", + "America/Eirunepe", + "America/El_Salvador", + "America/Ensenada", + "America/Fort_Wayne", + "America/Fortaleza", + "America/Glace_Bay", + "America/Godthab", + "America/Goose_Bay", + "America/Grand_Turk", + "America/Grenada", + "America/Guadeloupe", + "America/Guatemala", + "America/Guayaquil", + "America/Guyana", + "America/Halifax", + "America/Havana", + "America/Hermosillo", + "America/Indiana/Indianapolis", + "America/Indiana/Knox", + "America/Indiana/Marengo", + "America/Indiana/Petersburg", + "America/Indiana/Tell_City", + "America/Indiana/Vevay", + "America/Indiana/Vincennes", + "America/Indiana/Winamac", + "America/Indianapolis", + "America/Inuvik", + "America/Iqaluit", + "America/Jamaica", + "America/Jujuy", + "America/Juneau", + "America/Kentucky/Louisville", + "America/Kentucky/Monticello", + "America/Knox_IN", + "America/Kralendijk", + "America/La_Paz", + "America/Lima", + "America/Los_Angeles", + "America/Louisville", + "America/Lower_Princes", + "America/Maceio", + "America/Managua", + "America/Manaus", + "America/Marigot", + "America/Martinique", + "America/Matamoros", + "America/Mazatlan", + "America/Mendoza", + "America/Menominee", + "America/Merida", + "America/Metlakatla", + "America/Mexico_City", + "America/Miquelon", + "America/Moncton", + "America/Monterrey", + "America/Montevideo", + "America/Montreal", + "America/Montserrat", + "America/Nassau", + "America/New_York", + "America/Nipigon", + "America/Nome", + "America/Noronha", + "America/North_Dakota/Beulah", + "America/North_Dakota/Center", + "America/North_Dakota/New_Salem", + "America/Ojinaga", + "America/Panama", + "America/Pangnirtung", + "America/Paramaribo", + "America/Phoenix", + "America/Port-au-Prince", + "America/Port_of_Spain", + "America/Porto_Acre", + "America/Porto_Velho", + "America/Puerto_Rico", + "America/Rainy_River", + "America/Rankin_Inlet", + "America/Recife", + "America/Regina", + "America/Resolute", + "America/Rio_Branco", + "America/Rosario", + "America/Santa_Isabel", + "America/Santarem", + "America/Santiago", + "America/Santo_Domingo", + "America/Sao_Paulo", + "America/Scoresbysund", + "America/Shiprock", + "America/Sitka", + "America/St_Barthelemy", + "America/St_Johns", + "America/St_Kitts", + "America/St_Lucia", + "America/St_Thomas", + "America/St_Vincent", + "America/Swift_Current", + "America/Tegucigalpa", + "America/Thule", + "America/Thunder_Bay", + "America/Tijuana", + "America/Toronto", + "America/Tortola", + "America/Vancouver", + "America/Virgin", + "America/Whitehorse", + "America/Winnipeg", + "America/Yakutat", + "America/Yellowknife", + "Antarctica/Casey", + "Antarctica/Davis", + "Antarctica/DumontDUrville", + "Antarctica/Macquarie", + "Antarctica/Mawson", + "Antarctica/McMurdo", + "Antarctica/Palmer", + "Antarctica/Rothera", + "Antarctica/South_Pole", + "Antarctica/Syowa", + "Antarctica/Vostok", + "Arctic/Longyearbyen", + "Asia/Aden", + "Asia/Almaty", + "Asia/Amman", + "Asia/Anadyr", + "Asia/Aqtau", + "Asia/Aqtobe", + "Asia/Ashgabat", + "Asia/Ashkhabad", + "Asia/Baghdad", + "Asia/Bahrain", + "Asia/Baku", + "Asia/Bangkok", + "Asia/Beirut", + "Asia/Bishkek", + "Asia/Brunei", + "Asia/Calcutta", + "Asia/Choibalsan", + "Asia/Chongqing", + "Asia/Chungking", + "Asia/Colombo", + "Asia/Dacca", + "Asia/Damascus", + "Asia/Dhaka", + "Asia/Dili", + "Asia/Dubai", + "Asia/Dushanbe", + "Asia/Gaza", + "Asia/Harbin", + "Asia/Hebron", + "Asia/Ho_Chi_Minh", + "Asia/Hong_Kong", + "Asia/Hovd", + "Asia/Irkutsk", + "Asia/Istanbul", + "Asia/Jakarta", + "Asia/Jayapura", + "Asia/Jerusalem", + "Asia/Kabul", + "Asia/Kamchatka", + "Asia/Karachi", + "Asia/Kashgar", + "Asia/Kathmandu", + "Asia/Katmandu", + "Asia/Kolkata", + "Asia/Krasnoyarsk", + "Asia/Kuala_Lumpur", + "Asia/Kuching", + "Asia/Kuwait", + "Asia/Macao", + "Asia/Macau", + "Asia/Magadan", + "Asia/Makassar", + "Asia/Manila", + "Asia/Muscat", + "Asia/Nicosia", + "Asia/Novokuznetsk", + "Asia/Novosibirsk", + "Asia/Omsk", + "Asia/Oral", + "Asia/Phnom_Penh", + "Asia/Pontianak", + "Asia/Pyongyang", + "Asia/Qatar", + "Asia/Qyzylorda", + "Asia/Rangoon", + "Asia/Riyadh", + "Asia/Saigon", + "Asia/Sakhalin", + "Asia/Samarkand", + "Asia/Seoul", + "Asia/Shanghai", + "Asia/Singapore", + "Asia/Taipei", + "Asia/Tashkent", + "Asia/Tbilisi", + "Asia/Tehran", + "Asia/Tel_Aviv", + "Asia/Thimbu", + "Asia/Thimphu", + "Asia/Tokyo", + "Asia/Ujung_Pandang", + "Asia/Ulaanbaatar", + "Asia/Ulan_Bator", + "Asia/Urumqi", + "Asia/Vientiane", + "Asia/Vladivostok", + "Asia/Yakutsk", + "Asia/Yekaterinburg", + "Asia/Yerevan", + "Atlantic/Azores", + "Atlantic/Bermuda", + "Atlantic/Canary", + "Atlantic/Cape_Verde", + "Atlantic/Faeroe", + "Atlantic/Faroe", + "Atlantic/Jan_Mayen", + "Atlantic/Madeira", + "Atlantic/Reykjavik", + "Atlantic/South_Georgia", + "Atlantic/St_Helena", + "Atlantic/Stanley", + "Australia/ACT", + "Australia/Adelaide", + "Australia/Brisbane", + "Australia/Broken_Hill", + "Australia/Canberra", + "Australia/Currie", + "Australia/Darwin", + "Australia/Eucla", + "Australia/Hobart", + "Australia/LHI", + "Australia/Lindeman", + "Australia/Lord_Howe", + "Australia/Melbourne", + "Australia/NSW", + "Australia/North", + "Australia/Perth", + "Australia/Queensland", + "Australia/South", + "Australia/Sydney", + "Australia/Tasmania", + "Australia/Victoria", + "Australia/West", + "Australia/Yancowinna", + "Brazil/Acre", + "Brazil/DeNoronha", + "Brazil/East", + "Brazil/West", + "CET", + "CST6CDT", + "Canada/Atlantic", + "Canada/Central", + "Canada/East-Saskatchewan", + "Canada/Eastern", + "Canada/Mountain", + "Canada/Newfoundland", + "Canada/Pacific", + "Canada/Saskatchewan", + "Canada/Yukon", + "Chile/Continental", + "Chile/EasterIsland", + "Cuba", + "EET", + "EST", + "EST5EDT", + "Egypt", + "Eire", + "Etc/GMT", + "Etc/GMT+0", + "Etc/GMT+1", + "Etc/GMT+10", + "Etc/GMT+11", + "Etc/GMT+12", + "Etc/GMT+2", + "Etc/GMT+3", + "Etc/GMT+4", + "Etc/GMT+5", + "Etc/GMT+6", + "Etc/GMT+7", + "Etc/GMT+8", + "Etc/GMT+9", + "Etc/GMT-0", + "Etc/GMT-1", + "Etc/GMT-10", + "Etc/GMT-11", + "Etc/GMT-12", + "Etc/GMT-13", + "Etc/GMT-14", + "Etc/GMT-2", + "Etc/GMT-3", + "Etc/GMT-4", + "Etc/GMT-5", + "Etc/GMT-6", + "Etc/GMT-7", + "Etc/GMT-8", + "Etc/GMT-9", + "Etc/GMT0", + "Etc/Greenwich", + "Etc/UCT", + "Etc/UTC", + "Etc/Universal", + "Etc/Zulu", + "Europe/Amsterdam", + "Europe/Andorra", + "Europe/Athens", + "Europe/Belfast", + "Europe/Belgrade", + "Europe/Berlin", + "Europe/Bratislava", + "Europe/Brussels", + "Europe/Bucharest", + "Europe/Budapest", + "Europe/Chisinau", + "Europe/Copenhagen", + "Europe/Dublin", + "Europe/Gibraltar", + "Europe/Guernsey", + "Europe/Helsinki", + "Europe/Isle_of_Man", + "Europe/Istanbul", + "Europe/Jersey", + "Europe/Kaliningrad", + "Europe/Kiev", + "Europe/Lisbon", + "Europe/Ljubljana", + "Europe/London", + "Europe/Luxembourg", + "Europe/Madrid", + "Europe/Malta", + "Europe/Mariehamn", + "Europe/Minsk", + "Europe/Monaco", + "Europe/Moscow", + "Europe/Nicosia", + "Europe/Oslo", + "Europe/Paris", + "Europe/Podgorica", + "Europe/Prague", + "Europe/Riga", + "Europe/Rome", + "Europe/Samara", + "Europe/San_Marino", + "Europe/Sarajevo", + "Europe/Simferopol", + "Europe/Skopje", + "Europe/Sofia", + "Europe/Stockholm", + "Europe/Tallinn", + "Europe/Tirane", + "Europe/Tiraspol", + "Europe/Uzhgorod", + "Europe/Vaduz", + "Europe/Vatican", + "Europe/Vienna", + "Europe/Vilnius", + "Europe/Volgograd", + "Europe/Warsaw", + "Europe/Zagreb", + "Europe/Zaporozhye", + "Europe/Zurich", + "GB", + "GB-Eire", + "GMT", + "GMT+0", + "GMT-0", + "GMT0", + "Greenwich", + "HST", + "Hongkong", + "Iceland", + "Indian/Antananarivo", + "Indian/Chagos", + "Indian/Christmas", + "Indian/Cocos", + "Indian/Comoro", + "Indian/Kerguelen", + "Indian/Mahe", + "Indian/Maldives", + "Indian/Mauritius", + "Indian/Mayotte", + "Indian/Reunion", + "Iran", + "Israel", + "Jamaica", + "Japan", + "Kwajalein", + "Libya", + "MET", + "MST", + "MST7MDT", + "Mexico/BajaNorte", + "Mexico/BajaSur", + "Mexico/General", + "NZ", + "NZ-CHAT", + "Navajo", + "PRC", + "PST8PDT", + "Pacific/Apia", + "Pacific/Auckland", + "Pacific/Chatham", + "Pacific/Chuuk", + "Pacific/Easter", + "Pacific/Efate", + "Pacific/Enderbury", + "Pacific/Fakaofo", + "Pacific/Fiji", + "Pacific/Funafuti", + "Pacific/Galapagos", + "Pacific/Gambier", + "Pacific/Guadalcanal", + "Pacific/Guam", + "Pacific/Honolulu", + "Pacific/Johnston", + "Pacific/Kiritimati", + "Pacific/Kosrae", + "Pacific/Kwajalein", + "Pacific/Majuro", + "Pacific/Marquesas", + "Pacific/Midway", + "Pacific/Nauru", + "Pacific/Niue", + "Pacific/Norfolk", + "Pacific/Noumea", + "Pacific/Pago_Pago", + "Pacific/Palau", + "Pacific/Pitcairn", + "Pacific/Pohnpei", + "Pacific/Ponape", + "Pacific/Port_Moresby", + "Pacific/Rarotonga", + "Pacific/Saipan", + "Pacific/Samoa", + "Pacific/Tahiti", + "Pacific/Tarawa", + "Pacific/Tongatapu", + "Pacific/Truk", + "Pacific/Wake", + "Pacific/Wallis", + "Pacific/Yap", + "Poland", + "Portugal", + "ROC", + "ROK", + "Singapore", + "Turkey", + "UCT", + "US/Alaska", + "US/Aleutian", + "US/Arizona", + "US/Central", + "US/East-Indiana", + "US/Eastern", + "US/Hawaii", + "US/Indiana-Starke", + "US/Michigan", + "US/Mountain", + "US/Pacific", + "US/Pacific-New", + "US/Samoa", + "UTC", + "Universal", + "W-SU", + "WET", + "Zulu"}; + + static { + for (int i = 0; i < timezoneList.length; i++) { + timezoneMap.put(timezoneList[i], i); + } + } + + public static final DateTimeFormatter formatDate = DateTimeFormat.forPattern("yyyy-MM-dd"); + public static final DateTimeFormatter formatTimeStamp = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS"); + public static final DateTimeFormatter formatTimeStampTZ = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS ZZZ"); + public static final DateTimeFormatter formatTime = DateTimeFormat.forPattern("HH:mm:ss.SSS"); + + public static DateTimeFormatter dateTimeTZFormat = null; + public static DateTimeFormatter timeFormat = null; + + public static final int yearsToMonths = 12; + public static final int hoursToMillis = 60 * 60 * 1000; + public static final int minutesToMillis = 60 * 1000; + public static final int secondsToMillis = 1000; + public static final int monthToStandardDays = 30; + public static final long monthsToMillis = 2592000000L; // 30 * 24 * 60 * 60 * 1000 + public static final int daysToStandardMillis = 24 * 60 * 60 * 1000; + + + public static int getIndex(String timezone) { + return timezoneMap.get(timezone); + } + + public static String getTimeZone(int index) { + return timezoneList[index]; + } + + // Function returns the date time formatter used to parse date strings + public static DateTimeFormatter getDateTimeFormatter() { + + if (dateTimeTZFormat == null) { + DateTimeFormatter dateFormatter = DateTimeFormat.forPattern("yyyy-MM-dd"); + DateTimeParser optionalTime = DateTimeFormat.forPattern(" HH:mm:ss").getParser(); + DateTimeParser optionalSec = DateTimeFormat.forPattern(".SSS").getParser(); + DateTimeParser optionalZone = DateTimeFormat.forPattern(" ZZZ").getParser(); + + dateTimeTZFormat = new DateTimeFormatterBuilder().append(dateFormatter).appendOptional(optionalTime).appendOptional(optionalSec).appendOptional(optionalZone).toFormatter(); + } + + return dateTimeTZFormat; + } + + // Function returns time formatter used to parse time strings + public static DateTimeFormatter getTimeFormatter() { + if (timeFormat == null) { + DateTimeFormatter timeFormatter = DateTimeFormat.forPattern("HH:mm:ss"); + DateTimeParser optionalSec = DateTimeFormat.forPattern(".SSS").getParser(); + timeFormat = new DateTimeFormatterBuilder().append(timeFormatter).appendOptional(optionalSec).toFormatter(); + } + return timeFormat; + } + + public static int monthsFromPeriod(Period period){ + return (period.getYears() * yearsToMonths) + period.getMonths(); + } + + public static int millisFromPeriod(final Period period){ + return (period.getHours() * hoursToMillis) + + (period.getMinutes() * minutesToMillis) + + (period.getSeconds() * secondsToMillis) + + (period.getMillis()); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java new file mode 100644 index 0000000000000..576a5b6351ad1 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -0,0 +1,737 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ByteBuf; +import io.netty.buffer.UnpooledByteBufAllocator; + +import java.math.BigDecimal; +import java.math.BigInteger; +import java.nio.ByteBuffer; +import java.util.Arrays; + +import org.apache.arrow.vector.holders.Decimal38SparseHolder; + +public class DecimalUtility extends CoreDecimalUtility{ + + public final static int MAX_DIGITS = 9; + public final static int DIGITS_BASE = 1000000000; + public final static int DIGITS_MAX = 999999999; + public final static int INTEGER_SIZE = (Integer.SIZE/8); + + public final static String[] decimalToString = {"", + "0", + "00", + "000", + "0000", + "00000", + "000000", + "0000000", + "00000000", + "000000000"}; + + public final static long[] scale_long_constants = { + 1, + 10, + 100, + 1000, + 10000, + 100000, + 1000000, + 10000000, + 100000000, + 1000000000, + 10000000000l, + 100000000000l, + 1000000000000l, + 10000000000000l, + 100000000000000l, + 1000000000000000l, + 10000000000000000l, + 100000000000000000l, + 1000000000000000000l}; + + /* + * Simple function that returns the static precomputed + * power of ten, instead of using Math.pow + */ + public static long getPowerOfTen(int power) { + assert power >= 0 && power < scale_long_constants.length; + return scale_long_constants[(power)]; + } + + /* + * Math.pow returns a double and while multiplying with large digits + * in the decimal data type we encounter noise. So instead of multiplying + * with Math.pow we use the static constants to perform the multiplication + */ + public static long adjustScaleMultiply(long input, int factor) { + int index = Math.abs(factor); + assert index >= 0 && index < scale_long_constants.length; + if (factor >= 0) { + return input * scale_long_constants[index]; + } else { + return input / scale_long_constants[index]; + } + } + + public static long adjustScaleDivide(long input, int factor) { + int index = Math.abs(factor); + assert index >= 0 && index < scale_long_constants.length; + if (factor >= 0) { + return input / scale_long_constants[index]; + } else { + return input * scale_long_constants[index]; + } + } + + /* Given the number of actual digits this function returns the + * number of indexes it will occupy in the array of integers + * which are stored in base 1 billion + */ + public static int roundUp(int ndigits) { + return (ndigits + MAX_DIGITS - 1)/MAX_DIGITS; + } + + /* Returns a string representation of the given integer + * If the length of the given integer is less than the + * passed length, this function will prepend zeroes to the string + */ + public static StringBuilder toStringWithZeroes(int number, int desiredLength) { + String value = ((Integer) number).toString(); + int length = value.length(); + + StringBuilder str = new StringBuilder(); + str.append(decimalToString[desiredLength - length]); + str.append(value); + + return str; + } + + public static StringBuilder toStringWithZeroes(long number, int desiredLength) { + String value = ((Long) number).toString(); + int length = value.length(); + + StringBuilder str = new StringBuilder(); + + // Desired length can be > MAX_DIGITS + int zeroesLength = desiredLength - length; + while (zeroesLength > MAX_DIGITS) { + str.append(decimalToString[MAX_DIGITS]); + zeroesLength -= MAX_DIGITS; + } + str.append(decimalToString[zeroesLength]); + str.append(value); + + return str; + } + + public static BigDecimal getBigDecimalFromIntermediate(ByteBuf data, int startIndex, int nDecimalDigits, int scale) { + + // In the intermediate representation we don't pad the scale with zeroes, so set truncate = false + return getBigDecimalFromDrillBuf(data, startIndex, nDecimalDigits, scale, false); + } + + public static BigDecimal getBigDecimalFromSparse(ArrowBuf data, int startIndex, int nDecimalDigits, int scale) { + + // In the sparse representation we pad the scale with zeroes for ease of arithmetic, need to truncate + return getBigDecimalFromDrillBuf(data, startIndex, nDecimalDigits, scale, true); + } + + public static BigDecimal getBigDecimalFromDrillBuf(ArrowBuf bytebuf, int start, int length, int scale) { + byte[] value = new byte[length]; + bytebuf.getBytes(start, value, 0, length); + BigInteger unscaledValue = new BigInteger(value); + return new BigDecimal(unscaledValue, scale); + } + + public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int start, int length, int scale) { + byte[] value = new byte[length]; + bytebuf.get(value); + BigInteger unscaledValue = new BigInteger(value); + return new BigDecimal(unscaledValue, scale); + } + + /* Create a BigDecimal object using the data in the DrillBuf. + * This function assumes that data is provided in a non-dense format + * It works on both sparse and intermediate representations. + */ + public static BigDecimal getBigDecimalFromDrillBuf(ByteBuf data, int startIndex, int nDecimalDigits, int scale, + boolean truncateScale) { + + // For sparse decimal type we have padded zeroes at the end, strip them while converting to BigDecimal. + int actualDigits; + + // Initialize the BigDecimal, first digit in the DrillBuf has the sign so mask it out + BigInteger decimalDigits = BigInteger.valueOf((data.getInt(startIndex)) & 0x7FFFFFFF); + + BigInteger base = BigInteger.valueOf(DIGITS_BASE); + + for (int i = 1; i < nDecimalDigits; i++) { + + BigInteger temp = BigInteger.valueOf(data.getInt(startIndex + (i * INTEGER_SIZE))); + decimalDigits = decimalDigits.multiply(base); + decimalDigits = decimalDigits.add(temp); + } + + // Truncate any additional padding we might have added + if (truncateScale == true && scale > 0 && (actualDigits = scale % MAX_DIGITS) != 0) { + BigInteger truncate = BigInteger.valueOf((int)Math.pow(10, (MAX_DIGITS - actualDigits))); + decimalDigits = decimalDigits.divide(truncate); + } + + // set the sign + if ((data.getInt(startIndex) & 0x80000000) != 0) { + decimalDigits = decimalDigits.negate(); + } + + BigDecimal decimal = new BigDecimal(decimalDigits, scale); + + return decimal; + } + + /* This function returns a BigDecimal object from the dense decimal representation. + * First step is to convert the dense representation into an intermediate representation + * and then invoke getBigDecimalFromDrillBuf() to get the BigDecimal object + */ + public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, int nDecimalDigits, int scale, int maxPrecision, int width) { + + /* This method converts the dense representation to + * an intermediate representation. The intermediate + * representation has one more integer than the dense + * representation. + */ + byte[] intermediateBytes = new byte[((nDecimalDigits + 1) * INTEGER_SIZE)]; + + // Start storing from the least significant byte of the first integer + int intermediateIndex = 3; + + int[] mask = {0x03, 0x0F, 0x3F, 0xFF}; + int[] reverseMask = {0xFC, 0xF0, 0xC0, 0x00}; + + int maskIndex; + int shiftOrder; + byte shiftBits; + + // TODO: Some of the logic here is common with casting from Dense to Sparse types, factor out common code + if (maxPrecision == 38) { + maskIndex = 0; + shiftOrder = 6; + shiftBits = 0x00; + intermediateBytes[intermediateIndex++] = (byte) (data.getByte(startIndex) & 0x7F); + } else if (maxPrecision == 28) { + maskIndex = 1; + shiftOrder = 4; + shiftBits = (byte) ((data.getByte(startIndex) & 0x03) << shiftOrder); + intermediateBytes[intermediateIndex++] = (byte) (((data.getByte(startIndex) & 0x3C) & 0xFF) >>> 2); + } else { + throw new UnsupportedOperationException("Dense types with max precision 38 and 28 are only supported"); + } + + int inputIndex = 1; + boolean sign = false; + + if ((data.getByte(startIndex) & 0x80) != 0) { + sign = true; + } + + while (inputIndex < width) { + + intermediateBytes[intermediateIndex] = (byte) ((shiftBits) | (((data.getByte(startIndex + inputIndex) & reverseMask[maskIndex]) & 0xFF) >>> (8 - shiftOrder))); + + shiftBits = (byte) ((data.getByte(startIndex + inputIndex) & mask[maskIndex]) << shiftOrder); + + inputIndex++; + intermediateIndex++; + + if (((inputIndex - 1) % INTEGER_SIZE) == 0) { + shiftBits = (byte) ((shiftBits & 0xFF) >>> 2); + maskIndex++; + shiftOrder -= 2; + } + + } + /* copy the last byte */ + intermediateBytes[intermediateIndex] = shiftBits; + + if (sign == true) { + intermediateBytes[0] = (byte) (intermediateBytes[0] | 0x80); + } + + final ByteBuf intermediate = UnpooledByteBufAllocator.DEFAULT.buffer(intermediateBytes.length); + try { + intermediate.setBytes(0, intermediateBytes); + + BigDecimal ret = getBigDecimalFromIntermediate(intermediate, 0, nDecimalDigits + 1, scale); + return ret; + } finally { + intermediate.release(); + } + + } + + /* + * Function converts the BigDecimal and stores it in out internal sparse representation + */ + public static void getSparseFromBigDecimal(BigDecimal input, ByteBuf data, int startIndex, int scale, int precision, + int nDecimalDigits) { + + // Initialize the buffer + for (int i = 0; i < nDecimalDigits; i++) { + data.setInt(startIndex + (i * INTEGER_SIZE), 0); + } + + boolean sign = false; + + if (input.signum() == -1) { + // negative input + sign = true; + input = input.abs(); + } + + // Truncate the input as per the scale provided + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); + + // Separate out the integer part + BigDecimal integerPart = input.setScale(0, BigDecimal.ROUND_DOWN); + + int destIndex = nDecimalDigits - roundUp(scale) - 1; + + // we use base 1 billion integer digits for out integernal representation + BigDecimal base = new BigDecimal(DIGITS_BASE); + + while (integerPart.compareTo(BigDecimal.ZERO) == 1) { + // store the modulo as the integer value + data.setInt(startIndex + (destIndex * INTEGER_SIZE), (integerPart.remainder(base)).intValue()); + destIndex--; + // Divide by base 1 billion + integerPart = (integerPart.divide(base)).setScale(0, BigDecimal.ROUND_DOWN); + } + + /* Sparse representation contains padding of additional zeroes + * so each digit contains MAX_DIGITS for ease of arithmetic + */ + int actualDigits; + if ((actualDigits = (scale % MAX_DIGITS)) != 0) { + // Pad additional zeroes + scale = scale + (MAX_DIGITS - actualDigits); + input = input.setScale(scale, BigDecimal.ROUND_DOWN); + } + + //separate out the fractional part + BigDecimal fractionalPart = input.remainder(BigDecimal.ONE).movePointRight(scale); + + destIndex = nDecimalDigits - 1; + + while (scale > 0) { + // Get next set of MAX_DIGITS (9) store it in the DrillBuf + fractionalPart = fractionalPart.movePointLeft(MAX_DIGITS); + BigDecimal temp = fractionalPart.remainder(BigDecimal.ONE); + + data.setInt(startIndex + (destIndex * INTEGER_SIZE), (temp.unscaledValue().intValue())); + destIndex--; + + fractionalPart = fractionalPart.setScale(0, BigDecimal.ROUND_DOWN); + scale -= MAX_DIGITS; + } + + // Set the negative sign + if (sign == true) { + data.setInt(startIndex, data.getInt(startIndex) | 0x80000000); + } + + } + + + public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { + // Truncate or pad to set the input to the correct scale + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); + + return (input.unscaledValue().longValue()); + } + + public static BigDecimal getBigDecimalFromPrimitiveTypes(int input, int scale, int precision) { + return BigDecimal.valueOf(input, scale); + } + + public static BigDecimal getBigDecimalFromPrimitiveTypes(long input, int scale, int precision) { + return BigDecimal.valueOf(input, scale); + } + + + public static int compareDenseBytes(ArrowBuf left, int leftStart, boolean leftSign, ArrowBuf right, int rightStart, boolean rightSign, int width) { + + int invert = 1; + + /* If signs are different then simply look at the + * sign of the two inputs and determine which is greater + */ + if (leftSign != rightSign) { + + return((leftSign == true) ? -1 : 1); + } else if(leftSign == true) { + /* Both inputs are negative, at the end we will + * have to invert the comparison + */ + invert = -1; + } + + int cmp = 0; + + for (int i = 0; i < width; i++) { + byte leftByte = left.getByte(leftStart + i); + byte rightByte = right.getByte(rightStart + i); + // Unsigned byte comparison + if ((leftByte & 0xFF) > (rightByte & 0xFF)) { + cmp = 1; + break; + } else if ((leftByte & 0xFF) < (rightByte & 0xFF)) { + cmp = -1; + break; + } + } + cmp *= invert; // invert the comparison if both were negative values + + return cmp; + } + + public static int getIntegerFromSparseBuffer(ArrowBuf buffer, int start, int index) { + int value = buffer.getInt(start + (index * 4)); + + if (index == 0) { + /* the first byte contains sign bit, return value without it */ + value = (value & 0x7FFFFFFF); + } + return value; + } + + public static void setInteger(ArrowBuf buffer, int start, int index, int value) { + buffer.setInt(start + (index * 4), value); + } + + public static int compareSparseBytes(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits, boolean absCompare) { + + int invert = 1; + + if (absCompare == false) { + if (leftSign != rightSign) { + return (leftSign == true) ? -1 : 1; + } + + // Both values are negative invert the outcome of the comparison + if (leftSign == true) { + invert = -1; + } + } + + int cmp = compareSparseBytesInner(left, leftStart, leftSign, leftScale, leftPrecision, right, rightStart, rightSign, rightPrecision, rightScale, width, nDecimalDigits); + return cmp * invert; + } + public static int compareSparseBytesInner(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits) { + /* compute the number of integer digits in each decimal */ + int leftInt = leftPrecision - leftScale; + int rightInt = rightPrecision - rightScale; + + /* compute the number of indexes required for storing integer digits */ + int leftIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftInt); + int rightIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightInt); + + /* compute number of indexes required for storing scale */ + int leftScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftScale); + int rightScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightScale); + + /* compute index of the most significant integer digits */ + int leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; + int rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; + + int leftStopIndex = nDecimalDigits - leftScaleRoundedUp; + int rightStopIndex = nDecimalDigits - rightScaleRoundedUp; + + /* Discard the zeroes in the integer part */ + while (leftIndex1 < leftStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { + break; + } + + /* Digit in this location is zero, decrement the actual number + * of integer digits + */ + leftIntRoundedUp--; + leftIndex1++; + } + + /* If we reached the stop index then the number of integers is zero */ + if (leftIndex1 == leftStopIndex) { + leftIntRoundedUp = 0; + } + + while (rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { + break; + } + + /* Digit in this location is zero, decrement the actual number + * of integer digits + */ + rightIntRoundedUp--; + rightIndex1++; + } + + if (rightIndex1 == rightStopIndex) { + rightIntRoundedUp = 0; + } + + /* We have the accurate number of non-zero integer digits, + * if the number of integer digits are different then we can determine + * which decimal is larger and needn't go down to comparing individual values + */ + if (leftIntRoundedUp > rightIntRoundedUp) { + return 1; + } + else if (rightIntRoundedUp > leftIntRoundedUp) { + return -1; + } + + /* The number of integer digits are the same, set the each index + * to the first non-zero integer and compare each digit + */ + leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; + rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; + + while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { + return 1; + } + else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { + return -1; + } + + leftIndex1++; + rightIndex1++; + } + + /* The integer part of both the decimal's are equal, now compare + * each individual fractional part. Set the index to be at the + * beginning of the fractional part + */ + leftIndex1 = leftStopIndex; + rightIndex1 = rightStopIndex; + + /* Stop indexes will be the end of the array */ + leftStopIndex = nDecimalDigits; + rightStopIndex = nDecimalDigits; + + /* compare the two fractional parts of the decimal */ + while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { + return 1; + } + else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { + return -1; + } + + leftIndex1++; + rightIndex1++; + } + + /* Till now the fractional part of the decimals are equal, check + * if one of the decimal has fractional part that is remaining + * and is non-zero + */ + while (leftIndex1 < leftStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { + return 1; + } + leftIndex1++; + } + + while(rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { + return -1; + } + rightIndex1++; + } + + /* Both decimal values are equal */ + return 0; + } + + public static BigDecimal getBigDecimalFromByteArray(byte[] bytes, int start, int length, int scale) { + byte[] value = Arrays.copyOfRange(bytes, start, start + length); + BigInteger unscaledValue = new BigInteger(value); + return new BigDecimal(unscaledValue, scale); + } + + public static void roundDecimal(ArrowBuf result, int start, int nDecimalDigits, int desiredScale, int currentScale) { + int newScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(desiredScale); + int origScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(currentScale); + + if (desiredScale < currentScale) { + + boolean roundUp = false; + + //Extract the first digit to be truncated to check if we need to round up + int truncatedScaleIndex = desiredScale + 1; + if (truncatedScaleIndex <= currentScale) { + int extractDigitIndex = nDecimalDigits - origScaleRoundedUp -1; + extractDigitIndex += org.apache.arrow.vector.util.DecimalUtility.roundUp(truncatedScaleIndex); + int extractDigit = getIntegerFromSparseBuffer(result, start, extractDigitIndex); + int temp = org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS - (truncatedScaleIndex % org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS); + if (temp != 0) { + extractDigit = extractDigit / (int) (Math.pow(10, temp)); + } + if ((extractDigit % 10) > 4) { + roundUp = true; + } + } + + // Get the source index beyond which we will truncate + int srcIntIndex = nDecimalDigits - origScaleRoundedUp - 1; + int srcIndex = srcIntIndex + newScaleRoundedUp; + + // Truncate the remaining fractional part, move the integer part + int destIndex = nDecimalDigits - 1; + if (srcIndex != destIndex) { + while (srcIndex >= 0) { + setInteger(result, start, destIndex--, getIntegerFromSparseBuffer(result, start, srcIndex--)); + } + + // Set the remaining portion of the decimal to be zeroes + while (destIndex >= 0) { + setInteger(result, start, destIndex--, 0); + } + srcIndex = nDecimalDigits - 1; + } + + // We truncated the decimal digit. Now we need to truncate within the base 1 billion fractional digit + int truncateFactor = org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS - (desiredScale % org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS); + if (truncateFactor != org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS) { + truncateFactor = (int) Math.pow(10, truncateFactor); + int fractionalDigits = getIntegerFromSparseBuffer(result, start, nDecimalDigits - 1); + fractionalDigits /= truncateFactor; + setInteger(result, start, nDecimalDigits - 1, fractionalDigits * truncateFactor); + } + + // Finally round up the digit if needed + if (roundUp == true) { + srcIndex = nDecimalDigits - 1; + int carry; + if (truncateFactor != org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS) { + carry = truncateFactor; + } else { + carry = 1; + } + + while (srcIndex >= 0) { + int value = getIntegerFromSparseBuffer(result, start, srcIndex); + value += carry; + + if (value >= org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE) { + setInteger(result, start, srcIndex--, value % org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE); + carry = value / org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE; + } else { + setInteger(result, start, srcIndex--, value); + carry = 0; + break; + } + } + } + } else if (desiredScale > currentScale) { + // Add fractional digits to the decimal + + // Check if we need to shift the decimal digits to the left + if (newScaleRoundedUp > origScaleRoundedUp) { + int srcIndex = 0; + int destIndex = newScaleRoundedUp - origScaleRoundedUp; + + // Check while extending scale, we are not overwriting integer part + while (srcIndex < destIndex) { + if (getIntegerFromSparseBuffer(result, start, srcIndex++) != 0) { + throw new RuntimeException("Truncate resulting in loss of integer part, reduce scale specified"); + } + } + + srcIndex = 0; + while (destIndex < nDecimalDigits) { + setInteger(result, start, srcIndex++, getIntegerFromSparseBuffer(result, start, destIndex++)); + } + + // Clear the remaining part + while (srcIndex < nDecimalDigits) { + setInteger(result, start, srcIndex++, 0); + } + } + } + } + + public static int getFirstFractionalDigit(int decimal, int scale) { + if (scale == 0) { + return 0; + } + int temp = (int) adjustScaleDivide(decimal, scale - 1); + return Math.abs(temp % 10); + } + + public static int getFirstFractionalDigit(long decimal, int scale) { + if (scale == 0) { + return 0; + } + long temp = adjustScaleDivide(decimal, scale - 1); + return (int) (Math.abs(temp % 10)); + } + + public static int getFirstFractionalDigit(ArrowBuf data, int scale, int start, int nDecimalDigits) { + if (scale == 0) { + return 0; + } + + int index = nDecimalDigits - roundUp(scale); + return (int) (adjustScaleDivide(data.getInt(start + (index * INTEGER_SIZE)), MAX_DIGITS - 1)); + } + + public static int compareSparseSamePrecScale(ArrowBuf left, int lStart, byte[] right, int length) { + // check the sign first + boolean lSign = (left.getInt(lStart) & 0x80000000) != 0; + boolean rSign = ByteFunctionHelpers.getSign(right); + int cmp = 0; + + if (lSign != rSign) { + return (lSign == false) ? 1 : -1; + } + + // invert the comparison if we are comparing negative numbers + int invert = (lSign == true) ? -1 : 1; + + // compare byte by byte + int n = 0; + int lPos = lStart; + int rPos = 0; + while (n < length/4) { + int leftInt = Decimal38SparseHolder.getInteger(n, lStart, left); + int rightInt = ByteFunctionHelpers.getInteger(right, n); + if (leftInt != rightInt) { + cmp = (leftInt - rightInt ) > 0 ? 1 : -1; + break; + } + n++; + } + return cmp * invert; + } +} + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java new file mode 100644 index 0000000000000..7aeaa12ef9fcf --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java @@ -0,0 +1,57 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.util.ArrayList; +import java.util.List; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; + +public class JsonStringArrayList extends ArrayList { + + private static ObjectMapper mapper; + + static { + mapper = new ObjectMapper(); + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + if (obj == null) { + return false; + } + if (!(obj instanceof List)) { + return false; + } + List other = (List) obj; + return this.size() == other.size() && this.containsAll(other); + } + + @Override + public final String toString() { + try { + return mapper.writeValueAsString(this); + } catch(JsonProcessingException e) { + throw new IllegalStateException("Cannot serialize array list to JSON string", e); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java new file mode 100644 index 0000000000000..750dd592aa49c --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java @@ -0,0 +1,76 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.util.LinkedHashMap; +import java.util.Map; + +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; + +/* + * Simple class that extends the regular java.util.HashMap but overrides the + * toString() method of the HashMap class to produce a JSON string instead + */ +public class JsonStringHashMap extends LinkedHashMap { + + private static ObjectMapper mapper; + + static { + mapper = new ObjectMapper(); + } + + @Override + public boolean equals(Object obj) { + if (this == obj) { + return true; + } + if (obj == null) { + return false; + } + if (!(obj instanceof Map)) { + return false; + } + Map other = (Map) obj; + if (this.size() != other.size()) { + return false; + } + for (K key : this.keySet()) { + if (this.get(key) == null ) { + if (other.get(key) == null) { + continue; + } else { + return false; + } + } + if ( ! this.get(key).equals(other.get(key))) { + return false; + } + } + return true; + } + + @Override + public final String toString() { + try { + return mapper.writeValueAsString(this); + } catch(JsonProcessingException e) { + throw new IllegalStateException("Cannot serialize hash map to JSON string", e); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java new file mode 100644 index 0000000000000..dea433e99e80f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java @@ -0,0 +1,248 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.util.AbstractMap; +import java.util.Collection; +import java.util.Map; +import java.util.Set; + +import com.google.common.base.Function; +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterables; +import com.google.common.collect.Lists; +import com.google.common.collect.Maps; +import com.google.common.collect.Sets; +import io.netty.util.collection.IntObjectHashMap; +import io.netty.util.collection.IntObjectMap; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * An implementation of map that supports constant time look-up by a generic key or an ordinal. + * + * This class extends the functionality a regular {@link Map} with ordinal lookup support. + * Upon insertion an unused ordinal is assigned to the inserted (key, value) tuple. + * Upon update the same ordinal id is re-used while value is replaced. + * Upon deletion of an existing item, its corresponding ordinal is recycled and could be used by another item. + * + * For any instance with N items, this implementation guarantees that ordinals are in the range of [0, N). However, + * the ordinal assignment is dynamic and may change after an insertion or deletion. Consumers of this class are + * responsible for explicitly checking the ordinal corresponding to a key via + * {@link org.apache.arrow.vector.util.MapWithOrdinal#getOrdinal(Object)} before attempting to execute a lookup + * with an ordinal. + * + * @param key type + * @param value type + */ + +public class MapWithOrdinal implements Map { + private final static Logger logger = LoggerFactory.getLogger(MapWithOrdinal.class); + + private final Map> primary = Maps.newLinkedHashMap(); + private final IntObjectHashMap secondary = new IntObjectHashMap<>(); + + private final Map delegate = new Map() { + @Override + public boolean isEmpty() { + return size() == 0; + } + + @Override + public int size() { + return primary.size(); + } + + @Override + public boolean containsKey(Object key) { + return primary.containsKey(key); + } + + @Override + public boolean containsValue(Object value) { + return primary.containsValue(value); + } + + @Override + public V get(Object key) { + Entry pair = primary.get(key); + if (pair != null) { + return pair.getValue(); + } + return null; + } + + @Override + public V put(K key, V value) { + final Entry oldPair = primary.get(key); + // if key exists try replacing otherwise, assign a new ordinal identifier + final int ordinal = oldPair == null ? primary.size():oldPair.getKey(); + primary.put(key, new AbstractMap.SimpleImmutableEntry<>(ordinal, value)); + secondary.put(ordinal, value); + return oldPair==null ? null:oldPair.getValue(); + } + + @Override + public V remove(Object key) { + final Entry oldPair = primary.remove(key); + if (oldPair!=null) { + final int lastOrdinal = secondary.size(); + final V last = secondary.get(lastOrdinal); + // normalize mappings so that all numbers until primary.size() is assigned + // swap the last element with the deleted one + secondary.put(oldPair.getKey(), last); + primary.put((K) key, new AbstractMap.SimpleImmutableEntry<>(oldPair.getKey(), last)); + } + return oldPair==null ? null:oldPair.getValue(); + } + + @Override + public void putAll(Map m) { + throw new UnsupportedOperationException(); + } + + @Override + public void clear() { + primary.clear(); + secondary.clear(); + } + + @Override + public Set keySet() { + return primary.keySet(); + } + + @Override + public Collection values() { + return Lists.newArrayList(Iterables.transform(secondary.entries(), new Function, V>() { + @Override + public V apply(IntObjectMap.Entry entry) { + return Preconditions.checkNotNull(entry).value(); + } + })); + } + + @Override + public Set> entrySet() { + return Sets.newHashSet(Iterables.transform(primary.entrySet(), new Function>, Entry>() { + @Override + public Entry apply(Entry> entry) { + return new AbstractMap.SimpleImmutableEntry<>(entry.getKey(), entry.getValue().getValue()); + } + })); + } + }; + + /** + * Returns the value corresponding to the given ordinal + * + * @param id ordinal value for lookup + * @return an instance of V + */ + public V getByOrdinal(int id) { + return secondary.get(id); + } + + /** + * Returns the ordinal corresponding to the given key. + * + * @param key key for ordinal lookup + * @return ordinal value corresponding to key if it exists or -1 + */ + public int getOrdinal(K key) { + Entry pair = primary.get(key); + if (pair != null) { + return pair.getKey(); + } + return -1; + } + + @Override + public int size() { + return delegate.size(); + } + + @Override + public boolean isEmpty() { + return delegate.isEmpty(); + } + + @Override + public V get(Object key) { + return delegate.get(key); + } + + /** + * Inserts the tuple (key, value) into the map extending the semantics of {@link Map#put} with automatic ordinal + * assignment. A new ordinal is assigned if key does not exists. Otherwise the same ordinal is re-used but the value + * is replaced. + * + * {@see java.util.Map#put} + */ + @Override + public V put(K key, V value) { + return delegate.put(key, value); + } + + @Override + public Collection values() { + return delegate.values(); + } + + @Override + public boolean containsKey(Object key) { + return delegate.containsKey(key); + } + + @Override + public boolean containsValue(Object value) { + return delegate.containsValue(value); + } + + /** + * Removes the element corresponding to the key if exists extending the semantics of {@link Map#remove} with ordinal + * re-cycling. The ordinal corresponding to the given key may be re-assigned to another tuple. It is important that + * consumer checks the ordinal value via {@link #getOrdinal(Object)} before attempting to look-up by ordinal. + * + * {@see java.util.Map#remove} + */ + @Override + public V remove(Object key) { + return delegate.remove(key); + } + + @Override + public void putAll(Map m) { + delegate.putAll(m); + } + + @Override + public void clear() { + delegate.clear(); + } + + @Override + public Set keySet() { + return delegate.keySet(); + } + + @Override + public Set> entrySet() { + return delegate.entrySet(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java b/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java new file mode 100644 index 0000000000000..ec628b22c2d90 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java @@ -0,0 +1,49 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + + +/** + * An exception that is used to signal that allocation request in bytes is greater than the maximum allowed by + * {@link org.apache.arrow.memory.BufferAllocator#buffer(int) allocator}. + * + *

Operators should handle this exception to split the batch and later resume the execution on the next + * {@link RecordBatch#next() iteration}.

+ * + */ +public class OversizedAllocationException extends RuntimeException { + public OversizedAllocationException() { + super(); + } + + public OversizedAllocationException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace) { + super(message, cause, enableSuppression, writableStackTrace); + } + + public OversizedAllocationException(String message, Throwable cause) { + super(message, cause); + } + + public OversizedAllocationException(String message) { + super(message); + } + + public OversizedAllocationException(Throwable cause) { + super(cause); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/SchemaChangeRuntimeException.java b/java/vector/src/main/java/org/apache/arrow/vector/util/SchemaChangeRuntimeException.java new file mode 100644 index 0000000000000..c281561430707 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/SchemaChangeRuntimeException.java @@ -0,0 +1,41 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + + +public class SchemaChangeRuntimeException extends RuntimeException { + public SchemaChangeRuntimeException() { + super(); + } + + public SchemaChangeRuntimeException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace) { + super(message, cause, enableSuppression, writableStackTrace); + } + + public SchemaChangeRuntimeException(String message, Throwable cause) { + super(message, cause); + } + + public SchemaChangeRuntimeException(String message) { + super(message); + } + + public SchemaChangeRuntimeException(Throwable cause) { + super(cause); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java new file mode 100644 index 0000000000000..3919f0606cb20 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java @@ -0,0 +1,621 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.io.DataInput; +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.CharBuffer; +import java.nio.charset.CharacterCodingException; +import java.nio.charset.Charset; +import java.nio.charset.CharsetDecoder; +import java.nio.charset.CharsetEncoder; +import java.nio.charset.CodingErrorAction; +import java.nio.charset.MalformedInputException; +import java.text.CharacterIterator; +import java.text.StringCharacterIterator; +import java.util.Arrays; + +import com.fasterxml.jackson.core.JsonGenerationException; +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.databind.SerializerProvider; +import com.fasterxml.jackson.databind.annotation.JsonSerialize; +import com.fasterxml.jackson.databind.ser.std.StdSerializer; + +/** + * A simplified byte wrapper similar to Hadoop's Text class without all the dependencies. Lifted from Hadoop 2.7.1 + */ +@JsonSerialize(using = Text.TextSerializer.class) +public class Text { + + private static ThreadLocal ENCODER_FACTORY = + new ThreadLocal() { + @Override + protected CharsetEncoder initialValue() { + return Charset.forName("UTF-8").newEncoder(). + onMalformedInput(CodingErrorAction.REPORT). + onUnmappableCharacter(CodingErrorAction.REPORT); + } + }; + + private static ThreadLocal DECODER_FACTORY = + new ThreadLocal() { + @Override + protected CharsetDecoder initialValue() { + return Charset.forName("UTF-8").newDecoder(). + onMalformedInput(CodingErrorAction.REPORT). + onUnmappableCharacter(CodingErrorAction.REPORT); + } + }; + + private static final byte[] EMPTY_BYTES = new byte[0]; + + private byte[] bytes; + private int length; + + public Text() { + bytes = EMPTY_BYTES; + } + + /** + * Construct from a string. + */ + public Text(String string) { + set(string); + } + + /** Construct from another text. */ + public Text(Text utf8) { + set(utf8); + } + + /** + * Construct from a byte array. + */ + public Text(byte[] utf8) { + set(utf8); + } + + /** + * Get a copy of the bytes that is exactly the length of the data. See {@link #getBytes()} for faster access to the + * underlying array. + */ + public byte[] copyBytes() { + byte[] result = new byte[length]; + System.arraycopy(bytes, 0, result, 0, length); + return result; + } + + /** + * Returns the raw bytes; however, only data up to {@link #getLength()} is valid. Please use {@link #copyBytes()} if + * you need the returned array to be precisely the length of the data. + */ + public byte[] getBytes() { + return bytes; + } + + /** Returns the number of bytes in the byte array */ + public int getLength() { + return length; + } + + /** + * Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this + * method avoids using the converter or doing String instantiation + * + * @return the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte + */ + public int charAt(int position) { + if (position > this.length) + { + return -1; // too long + } + if (position < 0) + { + return -1; // duh. + } + + ByteBuffer bb = (ByteBuffer) ByteBuffer.wrap(bytes).position(position); + return bytesToCodePoint(bb.slice()); + } + + public int find(String what) { + return find(what, 0); + } + + /** + * Finds any occurence of what in the backing buffer, starting as position start. The + * starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing + * buffer is not converted to a string for this operation. + * + * @return byte position of the first occurence of the search string in the UTF-8 buffer or -1 if not found + */ + public int find(String what, int start) { + try { + ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length); + ByteBuffer tgt = encode(what); + byte b = tgt.get(); + src.position(start); + + while (src.hasRemaining()) { + if (b == src.get()) { // matching first byte + src.mark(); // save position in loop + tgt.mark(); // save position in target + boolean found = true; + int pos = src.position() - 1; + while (tgt.hasRemaining()) { + if (!src.hasRemaining()) { // src expired first + tgt.reset(); + src.reset(); + found = false; + break; + } + if (!(tgt.get() == src.get())) { + tgt.reset(); + src.reset(); + found = false; + break; // no match + } + } + if (found) { + return pos; + } + } + } + return -1; // not found + } catch (CharacterCodingException e) { + // can't get here + e.printStackTrace(); + return -1; + } + } + + /** + * Set to contain the contents of a string. + */ + public void set(String string) { + try { + ByteBuffer bb = encode(string, true); + bytes = bb.array(); + length = bb.limit(); + } catch (CharacterCodingException e) { + throw new RuntimeException("Should not have happened ", e); + } + } + + /** + * Set to a utf8 byte array + */ + public void set(byte[] utf8) { + set(utf8, 0, utf8.length); + } + + /** copy a text. */ + public void set(Text other) { + set(other.getBytes(), 0, other.getLength()); + } + + /** + * Set the Text to range of bytes + * + * @param utf8 + * the data to copy from + * @param start + * the first position of the new string + * @param len + * the number of bytes of the new string + */ + public void set(byte[] utf8, int start, int len) { + setCapacity(len, false); + System.arraycopy(utf8, start, bytes, 0, len); + this.length = len; + } + + /** + * Append a range of bytes to the end of the given text + * + * @param utf8 + * the data to copy from + * @param start + * the first position to append from utf8 + * @param len + * the number of bytes to append + */ + public void append(byte[] utf8, int start, int len) { + setCapacity(length + len, true); + System.arraycopy(utf8, start, bytes, length, len); + length += len; + } + + /** + * Clear the string to empty. + * + * Note: For performance reasons, this call does not clear the underlying byte array that is retrievable via + * {@link #getBytes()}. In order to free the byte-array memory, call {@link #set(byte[])} with an empty byte array + * (For example, new byte[0]). + */ + public void clear() { + length = 0; + } + + /* + * Sets the capacity of this Text object to at least len bytes. If the current buffer is longer, + * then the capacity and existing content of the buffer are unchanged. If len is larger than the current + * capacity, the Text object's capacity is increased to match. + * + * @param len the number of bytes we need + * + * @param keepData should the old data be kept + */ + private void setCapacity(int len, boolean keepData) { + if (bytes == null || bytes.length < len) { + if (bytes != null && keepData) { + bytes = Arrays.copyOf(bytes, Math.max(len, length << 1)); + } else { + bytes = new byte[len]; + } + } + } + + /** + * Convert text back to string + * + * @see java.lang.Object#toString() + */ + @Override + public String toString() { + try { + return decode(bytes, 0, length); + } catch (CharacterCodingException e) { + throw new RuntimeException("Should not have happened ", e); + } + } + + /** + * Read a Text object whose length is already known. This allows creating Text from a stream which uses a different + * serialization format. + */ + public void readWithKnownLength(DataInput in, int len) throws IOException { + setCapacity(len, false); + in.readFully(bytes, 0, len); + length = len; + } + + /** Returns true iff o is a Text with the same contents. */ + @Override + public boolean equals(Object o) { + if (!(o instanceof Text)) { + return false; + } + + final Text that = (Text) o; + if (this.getLength() != that.getLength()) { + return false; + } + + byte[] thisBytes = Arrays.copyOf(this.getBytes(), getLength()); + byte[] thatBytes = Arrays.copyOf(that.getBytes(), getLength()); + return Arrays.equals(thisBytes, thatBytes); + + } + + @Override + public int hashCode() { + return super.hashCode(); + } + + // / STATIC UTILITIES FROM HERE DOWN + /** + * Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a + * default value. + */ + public static String decode(byte[] utf8) throws CharacterCodingException { + return decode(ByteBuffer.wrap(utf8), true); + } + + public static String decode(byte[] utf8, int start, int length) + throws CharacterCodingException { + return decode(ByteBuffer.wrap(utf8, start, length), true); + } + + /** + * Converts the provided byte array to a String using the UTF-8 encoding. If replace is true, then + * malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a + * MalformedInputException. + */ + public static String decode(byte[] utf8, int start, int length, boolean replace) + throws CharacterCodingException { + return decode(ByteBuffer.wrap(utf8, start, length), replace); + } + + private static String decode(ByteBuffer utf8, boolean replace) + throws CharacterCodingException { + CharsetDecoder decoder = DECODER_FACTORY.get(); + if (replace) { + decoder.onMalformedInput( + java.nio.charset.CodingErrorAction.REPLACE); + decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); + } + String str = decoder.decode(utf8).toString(); + // set decoder back to its default value: REPORT + if (replace) { + decoder.onMalformedInput(CodingErrorAction.REPORT); + decoder.onUnmappableCharacter(CodingErrorAction.REPORT); + } + return str; + } + + /** + * Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are + * replaced by a default value. + * + * @return ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit() + */ + + public static ByteBuffer encode(String string) + throws CharacterCodingException { + return encode(string, true); + } + + /** + * Converts the provided String to bytes using the UTF-8 encoding. If replace is true, then malformed + * input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a + * MalformedInputException. + * + * @return ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit() + */ + public static ByteBuffer encode(String string, boolean replace) + throws CharacterCodingException { + CharsetEncoder encoder = ENCODER_FACTORY.get(); + if (replace) { + encoder.onMalformedInput(CodingErrorAction.REPLACE); + encoder.onUnmappableCharacter(CodingErrorAction.REPLACE); + } + ByteBuffer bytes = + encoder.encode(CharBuffer.wrap(string.toCharArray())); + if (replace) { + encoder.onMalformedInput(CodingErrorAction.REPORT); + encoder.onUnmappableCharacter(CodingErrorAction.REPORT); + } + return bytes; + } + + static final public int DEFAULT_MAX_LEN = 1024 * 1024; + + // //// states for validateUTF8 + + private static final int LEAD_BYTE = 0; + + private static final int TRAIL_BYTE_1 = 1; + + private static final int TRAIL_BYTE = 2; + + /** + * Check if a byte array contains valid utf-8 + * + * @param utf8 + * byte array + * @throws MalformedInputException + * if the byte array contains invalid utf-8 + */ + public static void validateUTF8(byte[] utf8) throws MalformedInputException { + validateUTF8(utf8, 0, utf8.length); + } + + /** + * Check to see if a byte array is valid utf-8 + * + * @param utf8 + * the array of bytes + * @param start + * the offset of the first byte in the array + * @param len + * the length of the byte sequence + * @throws MalformedInputException + * if the byte array contains invalid bytes + */ + public static void validateUTF8(byte[] utf8, int start, int len) + throws MalformedInputException { + int count = start; + int leadByte = 0; + int length = 0; + int state = LEAD_BYTE; + while (count < start + len) { + int aByte = utf8[count] & 0xFF; + + switch (state) { + case LEAD_BYTE: + leadByte = aByte; + length = bytesFromUTF8[aByte]; + + switch (length) { + case 0: // check for ASCII + if (leadByte > 0x7F) { + throw new MalformedInputException(count); + } + break; + case 1: + if (leadByte < 0xC2 || leadByte > 0xDF) { + throw new MalformedInputException(count); + } + state = TRAIL_BYTE_1; + break; + case 2: + if (leadByte < 0xE0 || leadByte > 0xEF) { + throw new MalformedInputException(count); + } + state = TRAIL_BYTE_1; + break; + case 3: + if (leadByte < 0xF0 || leadByte > 0xF4) { + throw new MalformedInputException(count); + } + state = TRAIL_BYTE_1; + break; + default: + // too long! Longest valid UTF-8 is 4 bytes (lead + three) + // or if < 0 we got a trail byte in the lead byte position + throw new MalformedInputException(count); + } // switch (length) + break; + + case TRAIL_BYTE_1: + if (leadByte == 0xF0 && aByte < 0x90) { + throw new MalformedInputException(count); + } + if (leadByte == 0xF4 && aByte > 0x8F) { + throw new MalformedInputException(count); + } + if (leadByte == 0xE0 && aByte < 0xA0) { + throw new MalformedInputException(count); + } + if (leadByte == 0xED && aByte > 0x9F) { + throw new MalformedInputException(count); + } + // falls through to regular trail-byte test!! + case TRAIL_BYTE: + if (aByte < 0x80 || aByte > 0xBF) { + throw new MalformedInputException(count); + } + if (--length == 0) { + state = LEAD_BYTE; + } else { + state = TRAIL_BYTE; + } + break; + default: + break; + } // switch (state) + count++; + } + } + + /** + * Magic numbers for UTF-8. These are the number of bytes that follow a given lead byte. Trailing bytes have + * the value -1. The values 4 and 5 are presented in this table, even though valid UTF-8 cannot include the five and + * six byte sequences. + */ + static final int[] bytesFromUTF8 = + { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, + 0, 0, 0, 0, 0, 0, 0, + // trail bytes + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, + -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, + 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, + 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, + 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5 }; + + /** + * Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any + * mark set on this buffer will be changed by this method! + */ + public static int bytesToCodePoint(ByteBuffer bytes) { + bytes.mark(); + byte b = bytes.get(); + bytes.reset(); + int extraBytesToRead = bytesFromUTF8[(b & 0xFF)]; + if (extraBytesToRead < 0) + { + return -1; // trailing byte! + } + int ch = 0; + + switch (extraBytesToRead) { + case 5: + ch += (bytes.get() & 0xFF); + ch <<= 6; /* remember, illegal UTF-8 */ + case 4: + ch += (bytes.get() & 0xFF); + ch <<= 6; /* remember, illegal UTF-8 */ + case 3: + ch += (bytes.get() & 0xFF); + ch <<= 6; + case 2: + ch += (bytes.get() & 0xFF); + ch <<= 6; + case 1: + ch += (bytes.get() & 0xFF); + ch <<= 6; + case 0: + ch += (bytes.get() & 0xFF); + } + ch -= offsetsFromUTF8[extraBytesToRead]; + + return ch; + } + + static final int offsetsFromUTF8[] = + { 0x00000000, 0x00003080, + 0x000E2080, 0x03C82080, 0xFA082080, 0x82082080 }; + + /** + * For the given string, returns the number of UTF-8 bytes required to encode the string. + * + * @param string + * text to encode + * @return number of UTF-8 bytes required to encode + */ + public static int utf8Length(String string) { + CharacterIterator iter = new StringCharacterIterator(string); + char ch = iter.first(); + int size = 0; + while (ch != CharacterIterator.DONE) { + if ((ch >= 0xD800) && (ch < 0xDC00)) { + // surrogate pair? + char trail = iter.next(); + if ((trail > 0xDBFF) && (trail < 0xE000)) { + // valid pair + size += 4; + } else { + // invalid pair + size += 3; + iter.previous(); // rewind one + } + } else if (ch < 0x80) { + size++; + } else if (ch < 0x800) { + size += 2; + } else { + // ch < 0x10000, that is, the largest char value + size += 3; + } + ch = iter.next(); + } + return size; + } + + public static class TextSerializer extends StdSerializer { + + public TextSerializer() { + super(Text.class); + } + + @Override + public void serialize(Text text, JsonGenerator jsonGenerator, SerializerProvider serializerProvider) + throws IOException, JsonGenerationException { + jsonGenerator.writeString(text.toString()); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/TransferPair.java b/java/vector/src/main/java/org/apache/arrow/vector/util/TransferPair.java new file mode 100644 index 0000000000000..6e68d55226266 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/TransferPair.java @@ -0,0 +1,27 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import org.apache.arrow.vector.ValueVector; + +public interface TransferPair { + public void transfer(); + public void splitAndTransfer(int startIndex, int length); + public ValueVector getTo(); + public void copyValueSafe(int from, int to); +} From 16e44e3d456219c48595142d0a6814c9c950d30c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 16 Feb 2016 16:02:46 -0800 Subject: [PATCH 0004/1644] ARROW-3: This patch includes a WIP draft specification document for the physical Arrow memory layout produced over a series of discussions amongst the to-be Arrow committers during late 2015. There are also a few small PNG diagrams that illustrate some of the Arrow layout concepts. --- format/Layout.md | 253 +++++++++++++++++++++ format/README.md | 5 + format/diagrams/layout-dense-union.png | Bin 0 -> 47999 bytes format/diagrams/layout-list-of-list.png | Bin 0 -> 40105 bytes format/diagrams/layout-list-of-struct.png | Bin 0 -> 60600 bytes format/diagrams/layout-list.png | Bin 0 -> 15906 bytes format/diagrams/layout-primitive-array.png | Bin 0 -> 10907 bytes format/diagrams/layout-sparse-union.png | Bin 0 -> 43020 bytes 8 files changed, 258 insertions(+) create mode 100644 format/Layout.md create mode 100644 format/README.md create mode 100644 format/diagrams/layout-dense-union.png create mode 100644 format/diagrams/layout-list-of-list.png create mode 100644 format/diagrams/layout-list-of-struct.png create mode 100644 format/diagrams/layout-list.png create mode 100644 format/diagrams/layout-primitive-array.png create mode 100644 format/diagrams/layout-sparse-union.png diff --git a/format/Layout.md b/format/Layout.md new file mode 100644 index 0000000000000..c393163bf894b --- /dev/null +++ b/format/Layout.md @@ -0,0 +1,253 @@ +# Arrow: Physical memory layout + +## Definitions / Terminology + +Since different projects have used differents words to describe various +concepts, here is a small glossary to help disambiguate. + +* Array: a sequence of values with known length all having the same type. +* Slot or array slot: a single logical value in an array of some particular data type +* Contiguous memory region: a sequential virtual address space with a given + length. Any byte can be reached via a single pointer offset less than the + region’s length. +* Primitive type: a data type that occupies a fixed-size memory slot specified + in bit width or byte width +* Nested or parametric type: a data type whose full structure depends on one or + more other child relative types. Two fully-specified nested types are equal + if and only if their child types are equal. For example, `List` is distinct + from `List` iff U and V are different relative types. +* Relative type or simply type (unqualified): either a specific primitive type + or a fully-specified nested type. When we say slot we mean a relative type + value, not necessarily any physical storage region. +* Logical type: A data type that is implemented using some relative (physical) + type. For example, a Decimal value stored in 16 bytes could be stored in a + primitive array with slot size 16 bytes. Similarly, strings can be stored as + `List<1-byte>`. +* Parent and child arrays: names to express relationships between physical + value arrays in a nested type structure. For example, a `List`-type parent + array has a T-type array as its child (see more on lists below). +* Leaf node or leaf: A primitive value array that may or may not be a child + array of some array with a nested type. + +## Requirements, goals, and non-goals + +Base requirements + +* A physical memory layout enabling zero-deserialization data interchange + amongst a variety of systems handling flat and nested columnar data, including + such systems as Spark, Drill, Impala, Kudu, Ibis, Spark, ODBC protocols, and + proprietary systems that utilize the open source components. +* All array slots are accessible in constant time, with complexity growing + linearly in the nesting level +* Capable of representing fully-materialized and decoded / decompressed Parquet + data +* All leaf nodes (primitive value arrays) use contiguous memory regions +* Each relative type can be nullable or non-nullable +* Arrays are immutable once created. Implementations can provide APIs to mutate + an array, but applying mutations will require a new array data structure to + be built. +* Arrays are relocatable (e.g. for RPC/transient storage) without pointer + swizzling. Another way of putting this is that contiguous memory regions can + be migrated to a different address space (e.g. via a memcpy-type of + operation) without altering their contents. + +## Goals (for this document) + +* To describe relative types (physical value types and a preliminary set of + nested types) sufficient for an unambiguous implementation +* Memory layout and random access patterns for each relative type +* Null representation for nullable types + +## Non-goals (for this document + +* To enumerate or specify logical types that can be implemented as primitive + (fixed-width) value types. For example: signed and unsigned integers, + floating point numbers, boolean, exact decimals, date and time types, + CHAR(K), VARCHAR(K), etc. +* To specify standardized metadata or a data layout for RPC or transient file + storage. +* To define a selection or masking vector construct +* Implementation-specific details +* Details of a user or developer C/C++/Java API. +* Any “table” structure composed of named arrays each having their own type or + any other structure that composes arrays. +* Any memory management or reference counting subsystem +* To enumerate or specify types of encodings or compression support + +## Array lengths + +Any array has a known and fixed length, stored as a 32-bit signed integer, so a +maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons: + +* Enhance compatibility with Java and client languages which may have varying quality of support for unsigned integers. +* To encourage developers to compose smaller arrays (each of which contains + contiguous memory in its leaf nodes) to create larger array structures + possibly exceeding 2^31 - 1 elements, as opposed to allocating very large + contiguous memory blocks. + +## Nullable and non-nullable arrays + +Any relative type can be nullable or non-nullable. + +Nullable arrays have a contiguous memory buffer, known as the null bitmask, +whose length is large enough to have 1 bit for each array slot. Whether any +array slot is null is encoded in the respective bits of this bitmask, i.e.: + +``` +is_null[j] -> bitmask[j / 8] & (1 << (j % 8)) +``` + +Physically, non-nullable (NN) arrays do not have a null bitmask. + +For nested types, if the top-level nested type is nullable, it has its own +bitmask regardless of whether the child types are nullable. + +## Primitive value arrays + +A primitive value array represents a fixed-length array of values each having +the same physical slot width typically measured in bytes, though the spec also +provides for bit-packed types (e.g. boolean values encoded in bits). + +Internally, the array contains a contiguous memory buffer whose total size is +equal to the slot width multiplied by the array length. For bit-packed types, +the size is rounded up to the nearest byte. + +The associated null bitmask (for nullable types) is contiguously allocated (as +described above) but does not need to be adjacent in memory to the values +buffer. + +(diagram not to scale) + + + +## List type + +List is a nested type in which each array slot contains a variable-size +sequence of values all having the same relative type (heterogeneity can be +achieved through unions, described later). + +A list type is specified like `List`, where `T` is any relative type +(primitive or nested). + +A list-array is represented by the combination of the following: + +* A values array, a child array of type T. T may also be a nested type. +* An offsets array containing 32-bit signed integers with length equal to the + length of the top-level array plus one. Note that this limits the size of the + values array to 2^31 -1. + +The offsets array encodes a start position in the values array, and the length +of the value in each slot is computed using the first difference with the next +element in the offsets array. For example. the position and length of slot j is +computed as: + +``` +slot_position = offsets[j] +slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length) +``` + +The first value in the offsets array is 0, and the last element is the length +of the values array. + +Let’s consider an example, the type `List`, where Char is a 1-byte +logical type. + +For an array of length 3 with respective values: + +[[‘j’, ‘o’, ‘e’], null, [‘m’, ‘a’, ‘r’, ‘k’]] + +We have the following offsets and values arrays + + + +Let’s consider an array of a nested type, `List>` + + + +## Struct type + +A struct is a nested type parameterized by an ordered sequence of relative +types (which can all be distinct), called its fields. + +Typically the fields have names, but the names and their types are part of the +type metadata, not the physical memory layout. + +A struct does not have any additional allocated physical storage. + +Physically, a struct type has one child array for each field. + +For example, the struct (field names shown here as strings for illustration +purposes) + +``` +Struct [nullable] < + name: String (= List) [nullable], + age: Int32 [not-nullable] +> +``` + +has two child arrays, one List array (layout as above) and one +non-nullable 4-byte physical value array having Int32 (not-null) logical +type. Here is a diagram showing the full physical layout of this struct: + + + +While a struct does not have physical storage for each of its semantic slots +(i.e. each scalar C-like struct), an entire struct slot can be set to null via +the bitmask. Whether each of the child field arrays can have null values +depends on whether or not the respective relative type is nullable. + +## Dense union type + +A dense union is semantically similar to a struct, and contains an ordered +sequence of relative types. While a struct contains multiple arrays, a union is +semantically a single array in which each slot can have a different type. + +The union types may be named, but like structs this will be a matter of the +metadata and will not affect the physical memory layout. + +We define two distinct union types that are optimized for different use +cases. This first, the dense union, represents a mixed-type array with 6 bytes +of overhead for each value. Its physical layout is as follows: + +* One child array for each relative type +* Types array: An array of unsigned integers, enumerated from 0 corresponding + to each type, with the smallest byte width capable of representing the number + of types in the union. +* Offsets array: An array of signed int32 values indicating the relative offset + into the respective child array for the type in a given slot. The respective + offsets for each child value array must be in order / increasing. + +Alternate proposal (TBD): the types and offset values may be packed into an +int48 with 2 bytes for the type and 4 bytes for the offset. + +Critically, the dense union allows for minimal overhead in the ubiquitous +union-of-structs with non-overlapping-fields use case (Union) + +Here is a diagram of an example dense union: + + + +## Sparse union type + +A sparse union has the same structure as a dense union, with the omission of +the offsets array. In this case, the child arrays are each equal in length to +the length of the union. This is analogous to a large struct in which all +fields are nullable. + +While a sparse union may use significantly more space compared with a dense +union, it has some advantages that may be desirable in certain use cases: + + + +More amenable to vectorized expression evaluation in some use cases. +Equal-length arrays can be interpreted as a union by only defining the types array + +Note that nested types in a sparse union must be internally consistent +(e.g. see the List in the diagram), i.e. random access at any index j yields +the correct value. + +## References + +Drill docs https://drill.apache.org/docs/value-vectors/ diff --git a/format/README.md b/format/README.md new file mode 100644 index 0000000000000..1120e6282a50a --- /dev/null +++ b/format/README.md @@ -0,0 +1,5 @@ +## Arrow specification documents + +> **Work-in-progress specification documents**. These are discussion documents +> created by the Arrow developers during late 2015 and in no way represents a +> finalized specification. diff --git a/format/diagrams/layout-dense-union.png b/format/diagrams/layout-dense-union.png new file mode 100644 index 0000000000000000000000000000000000000000..5f1f3811bf0056defe19abf494afcaba1bedbb77 GIT binary patch literal 47999 zcmeFZbx>W~*DVMH2n2TxF2OB8a0%|gHMj+L2n2V6yF-8wT!RG-8k``(-Q6Wvuan&S z``&x?y1J^XySo4QZk^nO&E9M8z1CcFjycAhCrn8}66G1;Gbku16lp0jWhf|^Z78TG zIS4S|$g&u=Blrv2Nm)`9s(hGu2mF9&FQw%K1%*ou`2#JjOmz%Sn6y;Ybk>xY<2AOk zVKy|eGcsj%x3LFjLqYMo^MZfcm^vGhx!YLVI`O&-Q2aTA7yKJ?n1zDu&neDU0u-9^ zN@TC?98JkMnO`!qQV2dHBO~K?G%@2<7L)kra`2M?g@v=TJueH3o0}W68wazUqd5y3 z4-XFuD?1B2I}c z$No8=p9S*b|HDN5o#j7A!88j#<7fHLk_kSmTw`B=f)a+3786l*hu+WhSjHKDd@Nx6 zfr%L=6;bv{oa$DcJ}z4Eh1HK%FO#L3{JvIdMCmEUTsRDI@$|lOWvV#zT&cF+Uwc

$Tz|>{&V05 zHCYFBqf=qlzjyub$N0ZF8#toNooGV*XM-F^O?%}+57&93yg3r$!mwgFFA-GfA`LCD zDPn@#^?mL+ZRcMNRLrZ5`YrWYt&;H=rF{knb zVp2*?5Cc1$e`n%AVSn)Tm-pEa_u1x<)o6}L#_xhw$c7O9j?mXbZHV>xkS19;l8`fQ z&f`KCGQ5$fFky0vTwoLVzZPEk4p^u9kQC3~`~N*hR-7#r!G-cO?As3)lRM*ua>eO2JtuXO@`S1i^$omew=Rw4JwbifkukR+G z^@Q{crg2(zg<_0d&}Qp`kqjKV|GXDFLg_>{LISVflF#G9Jz?ARYJ_U1z-lNZ z<6OF3>(1#i=P*SG3XufW_x{T{yQ6nC?b=ploQB_FHM(H%^u4NB()B1_Upszo^15CL zDN!rIEj78A)M(qU9(cGrQ`GMekHFR1l!zpZ4ZKQYsA#U1!FlYJ?u~jbeVRCQce>!V zbTCnbWn=nQj0a*dWTG(I^v_!Huqk0yV9_m(^+WbW*}d15yvbgpzpb+-I67r%yZfnw z90slEb~xhro+BUe4C!dG<<_D1dA;+Fa`i#OTGT?53*TO8jZ&==laVC&C6Ss&M07BPnca$DL^A6C;ARi~@4;D$;)}Ll_3* znMkn=?5Z>wX5_#@{nFbSGkNcDBAePVuz)7dqAOZ|7Xu@_yxpIvF;80|b)7a$V#10< zYQi1($Q)8xKO2TctIgf#D~oRzr?IvwhNa#h_d_h?_F~U`QoUFeG24BydZ+WVX>J0J zzFo^jOzKa(H^mSyh8^1e4z5^ZGF;dTL5!jy$7tniOE<2L0psfPSy`4AlM1Ki!}n_+ zRtDklY8k&;{i@>y=5MJlpzD1(V{TPi1y3;An)Bd5kg;4n%)6Ho3vO| z?ZPax1M^no2#?pCO+7i$8?UWQH#18bO~w1)7h2*JB-H%{$!`eRD?ae($Le^{4NZe?!fTFHr)!8Qh8;cx-X_@=Z*+|#G|?`Zad+5nPxmW3Q3IZ$xkm6YyL6&(*V?Bo{uaX>W@uOqXF%7o ze_)91j#Y)YoKFbXZp5mpLKEc(T^Uf{LaSm>4ybS&GS*Tguo0LN=}*_9h3fJT>gJf$ zvTW)XaC2e5KI_$DdD}cqt5{habYEI#bMr2z1Q%;oM=YrXtZw60R-RckdG+1aQQPih z$?fh~0c3epv7s$>>BF06BcDUNuIBHIi#&avK%Y2{C2eH)rWiT3*Eov9?NE21?P5x& zkUsRTYUgvh_AYI<*Xn?Th!f9R#MaLynf6~g*-3EHK{eH5)yEb2Nn{d`O{q!-85+Gc zjTsaAs_wJEICUV`egb+2PSWUIP^d{wbbK}|gY-CTQO2wk{7YubYSXxv_RH-m3OwDh zGiK>FJyFkf3KcTt`st(p7*eY5%Rf_1?+f+KZrZ{ag>Q#y8ZCKU#{+SbQ5X3=HW!i3 zy9uHr*9Hu)JTI;H{n@_jRhvlHsnNCl88Z#US5YkW z`zo^KDL(Iqn=RGnj?%+#WQsH3B742NcipTcIyPBt*w~Q18Ei?>q%fV~>&`_I^TcsH zlk>V+2_a?eE7PfG&N?Jc8njYT_+t{pz$Ai!+I>y$$NMY98kZN570LYH4zTpmbD100F>#}98&=@? zMCe9-tNjTgj+6k5aNE)+vceG0)SUDXja!;v2jaU9EqiqL_%0orY}g?&40vdV=^ zF3ftr%RHj5#GI?az35KKpbl$4@eO#bk0b(VYDCzgYrWAzzqj)+bW69mm%M6jT5ASG z@Ak0tU#a=3RAA_vTq`5Y7OMbXiv<%~7G?f3E z9ZyO3M{@%hlt+ETSdN;{P;K>q?}rnfZ*N#ogtO63{DZclfk6tAH8Ere_g=@@)o=q5K zO;S`;ehkX@&!4Qd9Pf!F;(J3glJ<8E;-tVDW17o5ifUiSyv| zzd3`fqd<=KZRfw2P+`U_$8~9`WJ6*Z#2d*0C`&gU_N(%r17u+UX;3*#O_)P09^wa8 zfSvX`uX0-db3lp$>;mZw&I_%7?%JOKw%Y#(kKsozy~Ql-u79&TCUGlB-re(C$P++WU7N_<>i z0&a8{pot<7C^LCopyt5jgUf6(U8ba$ekahWK%z{r+zCJ_6#yAVK0a817#@Q9{AaQn zh=sy0VbymPfx#xh!lMwX0`RhY)T~?}gQw>_2I4wK$XlVSd8ZDN;GXl{NrqEY%E9}~ ztEF%p2KJ(6o4LB&4>f&5U?&_ecjR6$zHM$WAIZ|O`Kp$-`t7Kd&q&w8K|I_SK}-tZ zxR+=Y)M3Iw7!6m8o^PNu&n)1sA;B_I-}B(>)sj#4(jyA@e0#}x>+NAl~pMi*2yl#hgyCoIn)kOUOsK%fH5vQ0vJ=uZ&7eCm1C=qGCG6bgFUvi z3-UCj&>kGrh!YnuA+&=J0wzvDgPwaSp%wceqn>rFvUETCZzoDb}(u_Z&vmH40 zd?a9g3pSviC^|?!5sHC7bfopItOoPdpH3cBL4iw2;&m*7N5_E|Lh&fP(Vtir_inP- zRA&y%s+ssSU!slH1|Bps4;7Rkx90T9nI!jDCSs{gM0 ztx*^6hu_Y>?YDD7(k2loVP;2jUjwgBDKPa;u-e^)q z3@)#)c1waK)%_sJjp6hYEfp$)MAmv&-vaX&ku0s(-@VJf9n__Bcfu1{8=R1OU*~dX z9S7mPwUCDFR=8`tHDrCRZzgRYOTw-3?}E)D_$Re&r+UQKf0{)*ZB`e-4S->owLxET+bv~{r zq=i=5bnFWE!S|p~;JE$C$=Y|>1?8Epy`jzT;PF0{giASdFw@!kf$fU5gEA8oRR)5d zg;pgY`5>+-7j%_k@0?*R;o-=oXnc6>3PlU_yW=Dj6qUy@Nm2omK6GUVi7mY_q@kmT zuGh==S4)cP5=1r=AXw`cGegXi7I^f&*kVzPzD(bEQFkBDG7&9_m70eUp~o(ngnG&& z7i&7fn{_O0M{%z*NHl*T^up*QwSnhkQYl0BZY0b5_fDT?^L6Z?(x*yviFL38BSRAH z+U_swKvsLRnNGQn9N}OQAl$D_@|*sp&gRlaP5>$)t3f-ij&-buj&X04z{P~hkCBgL zCnjoKXX*(c4!6q+`$r@Pf2UCDfX%@6Cwgh8;qEa9|B0cz0cOOKYK5PvWm&mIy-dfh z1uSoCDZ$DYraJ!t;9exqJT}&XNu5ZETn=Wx9s#5Zz>U4-XV$15J-3C8@2{FTw-~^7 zs9$hpUWA)(N{k?9^}0UJ>C|86cc!h~8%h=Yhzdd0#%;Vahpo4^A6f&txZM57F>Ie4Pb1IX0S8Ut^aD8;ed~VP-TUrDQUfP=kgWUX z#^-d~)9-EsUeI#Mr~q*eI)?SeR_*9{c{tO7<_+m^e9uN*PWU&e)Sq-2aar%}jYLSux6gJAnbs+a^L)Yo^MH)Rrz z)9B@I;J&ehy*m6Szpl)WHc~LtX+d+j*Tmt>n%A6$>DoKk7_xRX5fP>6zu%$8Ew;wO zrD55bq}A-%2kH3dr%ji$RuxUV^cuMkhCrMQUsUiP$ymf3nXYScDevO&%jJ_vV&9t4ggG@(0IzPK0NX4G5b-5Q7`vZd%% z?vnf*jyD_0x9r=>0C0?gg~Lv(+{N=_M6k!qZQ;`jZohs~1b5|-&^s{yd#~cCEc64n z>yF8DyW;OW=`dwdy1X>WvHh7K*T{;jAgbZyei)g3u6~r?3G_y_C232ir z5JH$2QQFmULdbQXT##ha4DL-a*A(y^58s^??R8F84NMmafZv#j`GeEt z0EQdhR#k%iiDB>Qk&#-)&0TKCQ`$?k_FdrpxxFFGFR(F^bZmDb$?y6ZDm3M;4i_@t z^(Ge*Gjq?|zb5+ZX6z*R#O*Va!iCXS@()cm(;&)L#LJvn%9;WP@-I%OhAAOimQ(h! zm3nbGZ5&JFuG4iqwF92gNBbEGH=&@;f80=FPLqtF8XRuhKr zupy7rS5fWAXf$3?(9-4Oa3=!I{U<#lwCZe8n z@#wt2Rl19vUh#R^OBYQ1eZD|8`Gw@Qo1QhY1kp-a68Ri;FR$aiudGoUo6jM8TsuH9 z1H6QVRF1_MQERqs_pZ*Wj6WS<#;#A+X2RhFre;f-ntL2fxQi12xMnoVpnc{zVN{vG9gOAiFt_}8 z?*phWPnEo9PtopAxOP$U434>Wt>Zn0Jl9%Q02WkvLfjnN%E}&owGfTHEsz}=5Tomz z&u!;}uA367UZ$P8#3TEibm1cbhb{)5WId3 zr+^$SygHP@z@?lw>~0>=utdSye@$VuJ5`26KzuxR1I0gK4VL|Ix{A_PWB$doclXb@ z8dEbI#5ZQEU17x9VkBbE)a{j2I4ys6;e>IuD^O=|a4NXx*@%7(6Xrqx(sk5K+*8CB z4$?^jwjk4p?g+d*{AC2AiYb#sz16L-gaC5d67meuk7WLFzR7HbnB~=s1fzSEyM9P~ zxRQii)+y4l>zf@?a2zJYy}8cN*m2o&PV3Jnz6T*`%^7uv5Zv5sW&dbASn}z6x!6r< z7i-bE{Y8^i^gLf%{a%wPyv}=)28J4vtN&1}f`0)+fzAR$+&^8RFZ4T5=^G$odP|&k zBN4i?e3QbBgv57qK%}$8sFeS;Uz#+btc_JNh8+g{KZTz<2ZTd_g1+oD_K^3LSA?Me zfa(cwSUkci7?4D*FAa)D5})WfNP-f&TpR)0Qe7?z2U0$D9eFP1CZs>fA3p4HTnHbj z_V?TqwXA$%9Np zQ!$#C_FqPMnVO6Uj|BySupt8j(pet7V`-seCf46v6&m9hWS*-9x&P`Pp~9sg^9;K< z|KbFBlRuxgAk%DqiTzKm3Su26c%iB+5*LMkM*S}&^6w>8G=P_w8qXUB`vHUk@Nq{3 z9smMX$Y7{t4KDX*knot1QSBny9{o?kF~Fd-&>q8)Tg_iID2~x!07tD;ZG!klKJDd< zMPB4Xta>h91TIq=kJIP9?;x#tTc-UL5dFzM+;i47^lC=umJ@lB(IA6ba+!o<2F$Eg zMaz^PJ$M{+Cm6Q(rCk{bO|&pLq6Rmo8^rcbAa?5P0Z;pGuFfXtoT_%L9&8iw0vD<7Ayr!$I!9mE{>_dk-S5Gf6<)QxbPywS1sD#B-*}b1$3&A6PW4V0U(vy%*pX zG$8)x)qKFhtO`l)8qQ$2pN~l@@@ypNd*1}bR)B==m#)(g2(7OGxMgNgE5=qN22*mW zz<02IFbBN*0R2zEVTTd(IGO|@F)y~#L6E;L@DDN&{S2%SU{h$mzO-As1JpGL=@Daq zeW=yYbqI2~z4)cCX!JMITbGp8iK<&omCj&D5_iG^jdr^ppY6ttfayV_;mF@E|p`w^u zVbnW1siB`adcny}3~>RXh!#aJup=nM$xdM3dYy>JYg<#{-{E>lS-dX?FpyBsJky1B z9I-Cy2KBO;=BqVOzMejJMvt2ZPx-b)or&EC&&pIW?)m7TjKsDN)oZ>GMw=ytKY){67mUBfhsqhJ@s3<*&-T?> z0sQxuCnvq6SffC>7N&z}$uc!XRiW)d2nO-D-9Vzczxb81jXMjCTy}Tc=JQ30{sUi)u=!UneRw-)j84nUqm&6zN3i5i*q zc6&Z)SUniceZr<$nVHYn!XFWGwlpgZpl{!Q zR>fS5!hgmZK|oN&Sj#ei1jOSh@hYI%QQ-{Fr5cSfjN^YYz9D06jnXB4@d zt)AOZQ;_TgJu^GLeG#+JW|M-?nsNXfkB{$oEaPG#ikZqsFw2-Awn(p*7L8EEK*1oe z?VrpTMKEnMWUT^YVEbiR2&d(9f1zi>x~#0{4HBif8ONpx0}&(LVg{l#`B?C#VGzFC zLFF?lZ2*+*XlWi}a^U4PNS2TySql09&yt$?EZ{wmu@*+z`Jc5v$6h|olhgw0SSd;J zZt(^DAkY=cOS#nyf$Lg_P_FZhDkH}7IpE5vHb754s&cVFZ#;5+(>tOcb_(>+F>9W&wekr{JtL{ae_6Nga1E|+J3On!ALnh z7Qx)K4fZ)oPx9MAfN-sq8sZ)Doc#Mdk6J=%Uf~^~R8pOTL{IDEz(@ho+I@HW5|OL4 z6jNQ{Y>GYphhGMs{7=89Bq`!9 zJf<7Z+W<#tZ@?~i@mizQv6ypQR0`+SMXxCmT;(+npetIaV>Jemx68jBI;P%)yvv4NReg%9jso_^yVDRg%PG9g;~j@+^Y*hE|{aqF)K6Ag<4eWMy$;oI)nye>Khf6mtQr9CimLoaK|R&m=YTa^9~+ za)SZ^ukp`7ykC{Dzpj-T>`bIXPrzb8{DLNK>%c?Rxl zxWbvbz)`a~;)xP2n$sDeN!u{yqAr3l%KucgMYv@uHZYZcK zB}TTr*y5>KzP*Mn7DEJ5gt=n15@lr&Lb~d1t;d*hw zae#?qANKSa180KiBgYhfBoVjkC%ItG^&}Lv@c`g{LB6Bn)fYqFOY9_Pd+{X!Oh6rE zL&(bAK%n&UBZ4kiN}yj;`dX)L-a&Uj7@#DO9DLd+ngx-(Kgc&#_E)){j*sP}h<209p6k+wD-XBG4V|z3Kp`^`&XIxl zND%*n%aD5!fd7X>lBtu_ z67m0xOAOrD4~&fP|6~N%*&ty?wo*(kCej< z;&C@evfC`iJ^?t&4Juk3)_NWvuG)ZP@mz%Hjq1sadG?sVyHBz}JM1g#2N=b1b1*dl zustpZR5|mzVLnBcxp$XbD!VvtCpxAJlq~dOIM8WpUHW~VfK~DhFSnj?wXZo zGI@TwG2nB1YX{N+Eu2BxGXPJ0+M%E3d+fP0xT#|3HlI8L%PX}Ox5@U23gZht3`AA~ zp3c$}2u9W*V*x0*641zxAbkLooCl~y3f+>uUyoNT*Qi(&yxmt})B9LEiNZC#3~c*o zyl@0ir;tD?sqN-&O1ysuWSf{^Q7Zu1&_=K=_ZuA6c!8oDDp3X`65dE4I1H;5tBUAz z8uuVSMAmYu(ZCcyC^Nc|@}4AJ7?7D7X+}`R_XC_=L3QsunDL{QtGCv=+w*`FQZ>|L6z=4FH}OKa3`d`OK>16r z+&jyOFBrU8kjjguuOQ=UmC^%NABZ{Xx~;?*j_nst6(jSmO7NiNY7w`c3U^Xg+*Q@a zTzDBwOm!mJzB@qR-8mC+BG3Cl4u32J9^s)^lpI3}7TO%G3Z!c`fS3&1cA9ZLjTU-n z03@Z&_5S(dyOOeg+#q;nm^T=s%5PMu)S?mY0@?vmegF^v!{w(mZIHJKi&izP&Qo>+ z_c=gZ`Cuz}vKe(D$o5WC+w@GbBD8EI>5TSQiU-~I>~P@mxZGWzOz*)Gflz1#vR?m_ zT(&;@``fE6eI*vnN~lEmIAWrYI$z%b(ZTbp?E-JDDjQ{$?afYMwgpDOJOCdrvOmv2 zb<1-)Z;wU`{npGTvhh6G zRP_$rB?qIB(y+aBZdR~U<#+>V)^J#f<$65au^2To#w^oB3B^a_7SuyFUnG3>O07NXX*r; zx#Z8Mq+Y&H`2mE}cbQZxC&Sf9(P@;0gYe+Bs@QkZASE)AfubG)snu|+#qZvS6_DZO z@4Q%=xh%8uAo*}U_A)ghyGy}f@!KIwBm(9vcP(CX;*7R9{cX4+-$5-1c>c)(t+Uet zB-*fVM($_cT2!xSc8kA%VkS$ZtynDO%E`e8(qvA{_Uf5Sm+uWJvTF8;Cm@%i`N>p` z#N#k3i50vDK#|b-=<5?xJnc^EWL0H_?2a3GcZ5JQ|g;l|!gc-wnQEQU_TRev!QLNlSq_=(7R{+L6vr>avU>dS9SK0;G z-HY-^vYm>@QA`RPPB?qhf?PO3720p)fsfIxU`zLF;3NZ5f!ZjNWihvp%2#fgA#3wz zEn}a6I!B%4wN=%f!j9>)k1rFiIc9 zZCpFWGDI6h`f9BEOhCqyQ1E%IVY%K76kCwb0d^h2aqBqb8!}9nXt4bH;%0RPv}cH= ze4_$OgAV&}rR(I@2#G>$RK8^&JqkKY{jMYl+nT047{zT{E~Wsqs5T31`?}ho#em8^ z)mKVB$Qbo%HPj$4#h2J^-l5g~EXOW3lv5q?6#6OEjicjE;*Eb1GdvR#g+C&-7)AB6 zFq-uDb)0j88s6Jmp*!=9HHz;}%w$PF>m~5cTQ6b+;fe(Jyg{LrBX89ODHu&F znX0rGVs}B`9U`b+^$9K(fqf?H8?E8xHwupnKP_j{x)V%cfy(Yxf#-@GI~8@NH6BXV z0Fezu5?|++-7uWZROje+G{dV{Wjy{~Jb7D|B8~ab#Qsg95kTZ`*Qbte6DLPfWf!TE zevHPVzos_|7k^x)vn0=^CAHLM@TP z8T%4K8RgJJaA0-T%Z?Zgf%rOS5G-e;WLEhTFAZzlY_;nB$!DPA#%toBJN1!=yyC;R z3r6mI5tE_Zi?b4nXdQTSR$QS=58Z{A;I*IsdfLb%0zZu9t>80BQ4kBM@YTW)umhna zg~5K=P>rlX$xi}A33h%eta>)%i7=&BkgZZ85cFyo zVI5Cyv}T#m-MNiEZ% z@(EBZq@KzXhin~C|9(`cpcuV#PE{JC$LY$^mxn!jW?qSds4%sN@2HvMySl0x0m$N& zgX7D|@dAV=i`B{qxGrGl)>03NU0|t+j7Gm!>IW|rHykdHk(P?EML{vpTpJokyrO_h zA~3sw7T}60WDol)j{EMIAhdz^19VqTETv>M`3x+E*OqZi#3TlTI-p={SZc0*-V}CV z%!AaaMH=qex3XaXyY(!vq+DLLRpl?Y;AvCMNqsSzjzD2a|2gE78OEqms9^fJx=j8? zyd5+UM4*K-i<1$;*s?^=ecLO;>#eZ7{C+&zt3`h^MzO58prf?%61GyeM~`TGPyjIv ze~M1@j88|KzdOhbA@V$og?K&y4l_aoP{+~*oxH$99obSJDy21}%(DHz*Z~B}ugxW3m7{c}w^#Lf8b>e&5gqZiy5GYWySu zOYQFt83mh<9-v%Emyd%*-ibP?%gnMc&r#6zgR%2B4|&pMS$Q7% z*=BJgpEX>0|CvwW4*s&oF;BfV9iuQ0cXV}jelh6P$-#s&2Ee0yr$`Y{avk;+OOk7# zQ$&R%2vG9kVg{42^HnX@rN{K^GiCjl@EZ^$JE}c8I=#d31%=mn8$FG-Q|cqf_52o7 zR89OSP6_f}m<|{An#t=Vlmh|SWv?*WGy~Z+<((4N&4Ys zniE%-AeiyZI?}Rsz^)o@-On}!GmOB55WKCVGfuo^VO`!BNxXn}7iqe}Eukf52O=e& zU2tFLbapWBa(%FrqvU$NCE*#;)>yW6Gt!V*w@zTLp@V45D@zETUM=g|!ww?>qNWNeBPc=1tG^%OH6#4?u@% z!W&(O^11g*W=ed;NBuS!sx^;ho!#>)7t*xfS1b!!CEH&dYcq#(<+td!)Kv31HwWQyer& zA%oSZt!kQ@Y`r%V2I0(_E8WbAWQk_YKx+KOmh1xp`s?I?XuUlg&sZ9T45^Q2U-qos z^sJ>-Om)Uw~}uJuGR{qQJp;&DPq>4ULP>&tBjhfojnCr(3kA5S3} zzJhN7>7pY3+v5*y0PyqVEeu-q@zAn}XA~NG`w^+gp*xn-*1?^ejC?|#B4VsQwv|v? zL%SU-HE<=^*^Sze?w&uNzF3g@|D_H7(i*}ifNP)=bovUE z+`pVf3?;br|Bq;n|5qo5k;@NgH4x#g+g9!1FW-Xg3+4z$SS^u_lGS0g3owthwhO-# zHQt74{hE0GXCnRf5dcFImF!3xD-5bib}m5`DJt(aSs06ShRENlVo0GqY7j)g{h2Qf z9P$^HLR3oU)d<{Z8P`@$R^SZcXp}5L(Id5hFp2`~gODE})gj$lAkDuAObp{YpuD#2 z7MCb>hQ|IOkA#)MO{M$gin7Phg}pqM{hp%hgIZzPk~gceFHO`UASYo5foK0@piap~ z>3hv?F-q2z4C*%fn`lm`%~bLaAw-t&EAV71y4X6Xut`ih!Ph?v6-mQbHdq%3^T01H zA+?%>XJ6i|D=dR20dh*`bd|9ljyWhF&U9PcVaEAuNTF!Z2lT-DzR-5lR2p@MlfMpr zX)`+^%><@{2q2pLB%_hQS%^LReMD6K@v4wLTzCil8qh+ZOt}Fg? zAP5(5NJWZ4%vJ_spvoC)k#wBUVz{-8K$gqmv`X5D>x)ZEQU)i8kwH2YyKH49a$zup zO$hkiS^2`uMzU<=rA+A1->Lm+f(!!MM}fVynv*WV9nh_4szJGJ?{BJDYG$WA??01} z1Y2A3Le`)^OBiM-ojaS$Ce1~cLKcQrNe`mk`#oaB^lbPrV$-X^0P)?#B9sxRIlSvJ zuVq|P2$LZ*od79S*kB#l5R?#00^FM2)}Q$oeJotax)iy+gQMdbz zvp8k`*@iC=!QPcIBIHE)_W*?U{O>dVdp`c|E?^&OGDg@QD64>Seg4v7l)eN~E?+-w z6rpCO63k9}40s#BqFRGCt^J<`vUM*&dnebsi7#NYzgzhp1n9l#yQ8*EVA_BMnJa#l z-~*~T6oDEK7{FREQnveQ7~DJR(|1U)xP%kLvvWYUXL|WIR+@^{Y!Kf&1(Za{gB}3? z`$w>SfN*WR3Y3dRkZr?qbn$Q2j0mnBNYFyY1DZ(CC6}$8G$M1!P?mrvXqwEe*@)By z+^-(k;n{$t`OHnH<_^jXih-zFtlFJS;s*$&<9Vl%dqBJmgSz@s*kr95v*&<_2J)kN z4B`ozz_}7@tEiCz$S^wso{qnKR7hpt0Q^RIjIGI{g-C}t@Qr}rdv<`V7zXIk5E?p6 z7zXvyWx6IH)ByJI0}$oInXm-OL1kU|vM5>wXki=mKCOYM!=RtS`DeS|lhF5}vVmRo0tl2(D_6`6wHk&`XDSeY?fL@(=;|sE&g(s-5 zmH}N|PvJEQ>7j(=zM%7my&@lh2ao0L9pLZ21^>P}^4TpazQxANY zcOI7qbRQc^8KCw6RqTnOJ(!D_KjQOvy9AIBptvrEU#)!vR4;dkjL}<@c3)_kucM0I z%gr^Qlc~rc8xZiF9Q62hY#`gZ2TYxE0WQT`|AEIZS&sd50M;!&mdi2(_UX2Hf9Rjg!1o4s%p}GkzGQdb5HMgT@s|H$hRsn>>lv!N?-mmC-J(m+lu77V3A$ z4~Nj0I6gEFn0_zrdWh`0)a~hDrscV2Oh5o!oZ$h^Q>oM47#kJS9njIB%(>sdYu*S# zptlZq#e*%v<4O27jK(Do0iYvPrPN>2T(+FC60Tp>@R!h&OK|92O2^O z=$3$@^Rh^Wo*D)ei3lz+%wvK^Eo1ey?#HJ&tR=se=6kC)H3V=m_5zTk_>H&pj@sXwE-|qtGrr!i* z?ror*ar-A|ztk>PM#?j${tY-dhkiX0BVe|@gjS^(C5;HK68^F=(x2IyHGqJUY-zi} zUf(Hh)G?|&wpP%v8I1Z|CJ$6^(P@v*a3wHS z&O|Xc-axu}2sq5T+8!?>K4%x)--k`v1F^-6-Owr;2lkTrBw0Hpp}`_|&4;ukgd!%q z9W-V)S&UIA-)MS(O67$#hvK`%>mTv7a~nY0-HMF1Nx5fNXb(T*5O6Rs=3B#X{h0U_ zR6F=R1D()sz*a}FifX1er;ASEbUU2y+ly8V+71IfM8@9~SMMS&G2{PMzO9S;a(XiK zNFb6xXY#oXH{)`=TVn`$K9QPSJe^@Jkn?1F1gJ%NRRYrE?=d?*6DS~3LCUlkF{`pe z;W%9jQpq(6WD-|-US8z^rKLuz=L0B{S(_P=dcToh`|9ZUG)h%g9Aw{*0Z?dZAgYGE zXsz^E)tm*kFR~7*hY$^{qu<3IIDz!ufqKT>pB9sO#z>{*y$jwEjBFBPBVgbgy5yzP z1UlXE45VRdUJZutKpo0m&O0gYYZ_nYgA%x*`E3LPJ&~=M{LM%?oVqKk7hhDxL(Q=$ zJ+B=o*g-?c=c1h@NYRteKoZk)?Z!B!cd*Xpcur8AjjIg)!3U)IL<`Ux68x~X>> z4dyvV1Ov2Lw=$2 zx^_F#&5(*@p#4)o#5w#l$q#?4ZQHcVgSi^$-gF*RLov3uTwE`1lUY*acza|APV<`j zsMKyUhMCQGs&*f-|I#g5DtIH#(h?>rtc>cySPSIm&>`SzbqFQJRP>}TXw!Wf32 zGNb!n)+{qU03TZ5&>14U#1NZ_i;D zl>?oax669(mf7f%bsfXF1qD|!I%d%aYeLhpotgSTL08LHjo>IHdl7W!Q|7oAbMlRo zJ5PuNDt#p~r4^MCnZb-5SJeY0jGZ{)DL;LrE9h2#N&QyC+H!M&p#jntcX-n6J~r7n z7uF73Dd}%x5rKgDWye#h73hdI>qB8Y0&{0IK*P03l$&`M z0M_*zXGk749%(j|LdOoIFml+K_d|uFciSg+Hki7nR*A1^IdIwK@1w2WiLaX7AYfwvav4}i8LDRAvcXKxLi}w~5u({H%IM=H+O$p0! zIOP`XlGF_ig(R5VJ!fOi%x&7BEd2SP*nYHr&INjU|6|YnnhzuMhcce9tPv6K@H@lw z)gi3LW&h22-Y}R{I5Rt)KrPSeviHp-MqidrFSv%NX`kS~&nG5gK`Zr<>si9nMDk!T zq?;w^G?G}8uV!7U?g8BnfI?emU)LR-j7bm%-sd?kQ*hKyd#{Lnb1BTU94IDq>QKdt zPh4KLrWw3xDKSa(Gr5<8G<&}8ZRMdwV(+;|VlK%*7+8kI8-M0R$!21wMTK9OVFuTq zE_S`07tJNu$1TOvs#=3b^;+pMZSHJF!-pyvQ^uG7yrJHnZf}i`zkaE&`y7u&8@uKj ziVV1s*X`voUo{^ncK8}9ztk7cYlK#8f`UnHIZIFVMJSSQy4l9d&8=?t%9mT1UKpZT-C(u`=J3}!VJjY@t z2A&FEth@^KjC*6;$eIQBPrOziytC}h z>8yeN9E;6HYa$N5rDze?6&70Dky&$=e_?wcD^*s3P^PV*4=0SO9hVh1YgF36&{utb zGu>`8GvuTnlep~lsd8iq>?+%U)2#fG{XF5PtB&qwV|G)X z8K-rIxEoW4tdZ2KLIfV_1~Fs;X}CS{`9kH2vt29)BJql-!p7SboN@<2uQEq#_KZz3 zHK)ZTpJM(3>}dID74{hk-_bEcqQKm2CT}NFM+KK*xYv6uKcw80KDpsZV@UK%5X~P~ zPVs|e1ayalVQ01jz>Dy*$3sX>? zxN%Cw;M>(GBBQP<%}&Si*Xzrx(;fiQ%1GZArpTjqCq9{ zS2d-fI`oaZ(<_$7-w!Z}K8#0?c%kKwsK^wxuVC0wp4!GCN5!gQ>F%edICU5d6%ei| zq(n^B=+ZQ}ORp^lmc3@=&C95ylX%hFgT$b~Usn~lzVJ~#Q~D?$nLzt=bixv zg-enA?Kf3qPJh{-%$$wT=0@8e>>G)qM??jA;`o|YuXO!o&j&LLkkW*?+zez0J`k~$ zgT{ZYN`!+=(Eo@_cvLrNutVIWfUR9DM=_R{gWgjQyS91sv1S!K?acB!-UQr( z!k23IC3&L33@2m;vFo@-5b0;+A|hEs^ptRh{p=lfxFRs(6S5H>#+$8#MHt;HR)~UI zP;JdoJes|y{lzf}I|VoZi@(R8&Ncr+-D52lt1NwkWj=E2cYF@$S{4`5BIYwntd?3A zjQ(UtGeEdN;2pHMXF_W=$~Uzi@L}DpPhr!5pN)TKFfKUz1xdE(S%Oo3%@5A_P$g_; z#G>w-pF@oX7FY#%T0kWwO1O&eM2Syk@YEV2>lDXjZhs0-(-bOWJHk;NQQruxPzn8$ zw9J`#w(<*$7LPcq<<(3|{cF(i&}r=}CD>!HysNwr84|^G4)3`hD;bPNdU(pP3`jhO z7oQNxa6thkmO369S-mZ$_ATf*K6dRUJ;Z*^A%@@B0Z(*{qY%WO`C2^u%d8_Ve#b8; zA%i-F{C5cd2YYWFRn;2Bivohu(uhH~bV*2ugrIavmw-r@AR!@*G)Si)-5^rZ9nv6O zBB-Q*z?=K%Iq$vu?;GQed*8TYoH5Q2*lVw~_S)Y!=WqT(BPG+a;Ks(rSdX!vB5z9f z%8P?dBWH(3q-%U=+Yc5U^Pep5j+u2M7T9Pc-H4wS-(TQ7P;pF)t2@4?q~N7)lpTn~ z>A3!oG~Ft5% z&NXJ-nh#M=xiwTxevEYe1J~uJ+x_%Rn=K5n*tU1=uaPJF|6XFz(TY#I@9tp97urtJ zN@iN#jCAgu-2aAFVC9x4GwR*dlg&eGxvLvVoZ(%ShQf(tHy;h^{?vGV-=gwwnWeS>(j8Y%+Zbjez;foJJdvO67?RzF~=KEMj;$;4-0$S z{X2K;4*jp!(7y(R4reCwEUT@{>2;DJXs%Bn&#TIRvsZh^(b- zGZCeuGd5vw4%x)pFZ=O+R9MsF&~=6CB-yf04X ztWvr^ZO6nZkW)7r*q8Y5Ebc0E**U#9eF2U-C1tu37i-uV2PgZF))eNeyZf$uGV&N~ z5V7v}%-9$DnX2_>_WBjp+A;2>nLu`%+vP%NPl5OKbg}{8Ly$)!q$CERXb8eytO%Y& zns^4Id=w$v{-ZT07!}cSei^DBn#TRq{%BBuLp11*UVgpioqp76qPS}!?g7%8wcv(d zlZV3xQ`g{Qe(Q;{Ik^}0>UF&;(|*p^Y<_;UpLv>J&QK2~4vb8`a@c2HDg0k*pD*wf z3$+G(y8UBg^`bG~R@+{vJo;3Ch>$_B!gKNppswbU(Oz^z4`} z&IwV|=sPjkz6D6F1q%?dQMlU~C9O^jOfv!+ ze$p%{b+a9^Fra&**MLhpV9*Ft3pCPQ?qdrDj@@X6JC8o9f5A&(*$KfjJnmu+T}^MD zd{?^h2A$1mw8!rYGQs+x&j;4~-nyRe7A5S{w)6v-lTzygiB}hRD4Lo$Nv_%0s>bY` z#cBMKIj;A#GoLgRFI314?X5SomE4avR=_2>Zo7+4dGLZ;ihPiSrB~VDdU0(d_b}0@ zutmx&7(wxTboL8%E%Sf196_kpC{P}~jF+E&a66OZV76nakg=@fO479ouV$SAXpNWJ3G+S5-(H)J_>Q0$Nf&;AiKpE8H%Q%4b_HIDQ%&z{O5e z`*@f3mcH;8zmxmv9}ib4*qS=mE*oE+z2sPP39f%vd|_8=x_gG+ryAB_rxYU2@4(5f zR>QQH<9IFK%n$zq$;6~GR%JlkwXE6gbh+vK`(-bOrbik#%9xhaREo6xR3y?A9}8gG zk=ULbW?es)dzz#c@yfgCXF*T`PoT~3Oq(AUZK~=jzJ{vKOnYu=)}Ec250=)S&QvkQ zY!aHORKupOMr?j#$PX(t(EM>5%@mojg;pc_H07uBKq3Ar_IFo8S0ZuA`p9XHXqS#( zAAGvk@TfH2{Fm7tckt!;D|EkHN_n#;yr>86`ewgkOdYf<(m9M!uCqNf`qA>1r8f7Q z&AlPr?qg{O(bVRuxPzM&<8@L(`XY@FjT!yI6{>_I-JhY{`h>B}x8nR}s_o3A zElM~qYUb>iM^t2_1h^m}ngds?-5msl~dg6&iwu zj&6e_*YV)Tnei*^bHi<^j$;{Nt!LH$yt~?>aHuNTul%n_)~N;2n8Xy1##K3Ati^|n z(nNS>PnH>d>bDnqKcm%8}5IiYMsd}Q{eV$opb5^vzUzZ;N5i0X6d_M zS(B_5N9&3L=N20~bKg9nu8jxU!#VGb<)skJYfev-HB>ZEQTgubNq67(IIyfK9mBhp zzd~4#&RQ6(P9QrxBy~K*Y*9H(#*8?U-i&%|$ivo^A^sI?9=55~nVJ^AT1`G2bep-n zo5I}J+?GDp0-@V}TX*`10cu*fEiV%CZq}hDsF@P6|2yi$6GE9$C;Gso9E2Dw{qvs+ zh?}BA!j}J=yBZ@*Ex+YS-2vYEA8=k)8h*me6#_E7zq2reGiLc)0%zI32a%#AhMyF4 zGwT2MgMWusRDWNC=c@)|tQ_iWxBq@H1#eC8e@^EA+?l@*<9~zB|DQiI8S;gw9id`_ zs57pwwWEuSO!=tbr%Pl@)SEiI=kKuMpN#d~En_YU=+fC<4Ho zHn+@E{lkCC41f4WN};PFtd$TmVwk4K!!WvSPkraV7vUpByi^a>>Hb|p9&5qi7PC*Y z8VDzUUIg#tr{K?_9{1mIjSv(I7K8DBeR8Sss9<^cpL8-A^8U^R%=oY45lW>!X!l?K zI9>qAbsorrP;%LW5LD0WkRNjn)CPl~5{4Bu3{Ywg9o&=8M<9}W2xelZ2U`ez0{Hw? z!JM!fL=-TuJNc|#SMk&I#^o7@cBNk3*#q7LEMJgv_`buKiewnlc@Rnd1T_1L-|b++ zA|2gRxhLzfxr1PYnL6X%<{!S5?#Lfajtic50LWeE`ZiU*f-rY-TqrW66+SqkO2dZP z2EBFsNA8j`^i>(d>ba0L4J#O9HlLr@0^?xL>T!V-UiW333yhN*CZ7S7;w^AB%8VpQ zE5TevDTSYtg6QOUc=i-T#O~04>7;Vv>j7)O6k&IVuM2!+7)SyUT354Bt%o|eN7aSN zJ@BD@|MySr0~N=bAg5S;nDNf94|K{V*FSKYbgs^s5QmBS-Zyy4QujlrTHfR6+X)aF zW}^@TC6KP@{6p#rA-`nhTr=QByf?VERo)yF zIkPb^Y4j<#FIQ^{^eRX)SfBochD8EOc3!9Jyn$mzTB-JUdU#tk)8@BMUvUm29Bw}$ zthZR66xBAKc)}~<`6T(qYRB_YqvQym7eJeMi$YiR*AH^L+xvZbS4*)NjDn|A_ zRZL2zyYV2$#jz2!xq#DGVkQ)Kg5i>^#rF?5iRBQYc^HY_5uS2g@PY}-n4u;NM}2Pc zTCwO@;O6v+mL|2yf1S z7dD`K38qjZIM7-2js=$uSLcLP~WVA1C&+EXN&P8q=6Vy+s!A3VsmOxw`A7RvX zdj=FEFNT)Ohj0cmx_>PMC7aFXnki7``C*M-)url;qD2X9CMvYd{7?KSsNBR!q5ycyDrNxzKtv;j8#d#O6`Tr z?MF4*t_tR1KiLyHJ8oYctetZRRi1%ydBELRvESb%2OKurnJR9h6Y1Gus4E%f;bA5u z2$s4iG=PNE=KD5SJ~)9;HJ{7u7jPBu#Tos`bG^=vTCLtF*;)EhN)RO9I?_%|`yI;O zWC=pHndFS15;}!=kodH==kbD64uo7ApPVj!pCkqt_i`AwzlJpq=hfB5;9tofU`7|^ z%;Eq8a+!{{S?OGF~tg$-dCdNxJ%Yy~t z{t#t*G7L}N3UZp8A6_HIMb5#i;stub$I^9>;!Ng{XxyD*|0<%7Amqd1?Ahjz!EyzG z&%ajv;s@^1AHBM%x67<)EPwh8c3Rw6Vl64dH&h-%X`Oa@try+*Et*JH5=jqinid%( z(?H_pV<45>ZA^@LX)wlL zf$ni`ujuKk(Oh5>v?Gig^HUr`R6E^56Pd8(=0CU|ck2EEc-wHL2PV9qxWf0J2I_Hv z-8j(DWA1~-U>5ip;qR7uD>TV155~TS*Zi7`S`{h7(S?> zd7=_{P*}x#?Z4}bru}hUJQI&*wO>G!HSN~7otINaLFE_A%c^KUBz(8%u8oyFe8u5A z63o5t_ZUp^wuWVF!UBc6!Htscub6G%4LM$hTIaiMmsKV$f1cM6@R8&@{LGl8CRAAZ$ef zZ`pGAMjD__Y|r&j#{`Nk@4^F~Jgue@jISj;xU!TT6sNOIWi7ZV;Q7&PKceH8#xp&> zZ<=W^ipsaP|9N|6+0&oWqRx%Qe9FQ);l@#e+w+z?$ty)(tZ~8*@;I|3_$WS}etrK% z1=B*d1;evDwjc!Yv8pR|>=(48PJeKHoOA?5LXS;h<^K+5RIx0l+ERuRy`Dz2ZS zev?^FVDcH{^j)~5$PIJotK465=7N&fJk7Pu2<0hv{jjt+E!V_^q%h+R?WnuhsHjg~ zHv;(peNl>SJXz_KNj%hqjBmlyF<~Ew85JC}iJDu}{kp^BK4H$El^>MjU{)FszIr^e zVi+jOCnpTTwf996>gs986TCKy1&t#nq6gq1A9bzvCCEAFY2EgC3j)-sjw;3rR2kr- z3?h(ihs6aqRY*Es#P|M?U~eJ6hy~6j7&liiyLH`h)e`@v5Oswy^tK^q;_%h?4z_C? znK*f1KXcF4N;#$OKEGG_FE9}?b(Dm=7*1&OcEV~=50xl zuHLPO`El_}+8INuKu-4~j;UT~`iqoICk(A{IZqDoX0cRnl*I-fq)2Wj5%uH;Wgd2g zPzi^fcqIpo%|DGXNi>>*_Q zs`5tY7Nc$wQfo}DuzRI*;kt0^6-?ax|-@Rk$B*A@`07VO}oF1LdR3v4&ETkE%(WqL> zE^nVvRi<^SMhIw-lv5(}Bui0GeCpLud*#j&7@%+`Ee=zu|KIc4BJ7qIjj=E{#k~?T zAYY@t6u8hM+}K$}e!I$lBIacQI%ivm39WB7sR0f>`5WJBg1uS>Rw4<5SJ*{%_X-#; z%y6EIoFnYwVFG4$hXC3|UDN(yD_DfXHD?Ps}s>p5z@h=2n zq^|Wp=U5{GC?cdmtqQ0}GoK>D2lVd}9?Rl{SAB+sZOHw+>y79emTl7)qnH*H##BGk zFMoG^ZI|@^K`W{zbR@NS)&zu~rWJx>%DWPW}SOdK|Eyy?wk>uDqZ)r8Ul|Nfv z6G}7*Ym~#hhLMrFY{}qB_`xKY<+=q3vv5j@XN<`G%{f^Js3P*^)jc&(jbFgpU77sA zjExtifHqm8-DGANTi#B-FDami%)X)%w#hI#6DUBnTII&rf{@*7*;b<2)XnqlmtGMV zp}f%YoL{~k*cJ6wwB|C0Iik3!FECv%cBe>t*qx6$b7d)o<|rY02~hIG3Bx1KcL4}H ziX!>)XwCqfAQtLne%2A#mNr?{qm8mjdxr6DyUVN)jV6B0c)!Tn7)%6l8L$`s-H$k7gQg)st z(}N7DCF{Dh)@yrx62B8|f-Ljp=_Ef{|1IofeV1cONB#n8EoPpppNJk>O=0>Ezxv!^ z0E(x%rrlv`U{>c4U6XtX{U4v;)jA;9IoPRKUW|tFbQ3#Uo3*J5_xr1y-ZxH6dqiJg zoQOfUhL23)i%F>GdEUJm{b?!cdG`qnR{NeUqqA=(qG|ar|pu(fB%)JV5&o=8IOKB=f^kQK+?@=fFee zIih3vK$}qaU$hN%6t{1d$KXI?LJ_9D|1d(VZP@B37eg-$Y@!?0k8fXR%f}f=(9R z88Zf+5I)8M6FuKVOg4cqDd}ZNvC^vF8xq(Z9Ns7=lX%doUvd@(w&TlomSnXXsR#Xc z`IzF+`%%3s{3j#;fol5tAxP}vW%2I0dHQUQ=r^0tW3hetDHfEuqIjeFF~DCz;O<&~ ztk?&a9j<{wPdLw?^uy3ZxVb+5cvlEvO7;H^6B_+%kMx(|EQH{@+xGGr{tK$I0VL9s zD)cAt1rV^_2*6${-a6p?`xr(eygD5=lo0l4|D0h3g5>`#Pw&dV&hY=)6#c!d41`L9 z=>yZG&)*;XKU|vr=bnZd(+7O(mH-CC4FcIDNz7zLm%sxMOs@H2-5qkeKzv#DzRVvr_>*?A# zC_FUyn6URL3as#a(GePCi9`rf`3oXulmVzk6klrff1}k=aAVPp3Yq@{p4$+@P_3rI zC;z?(5@LY=<6ltt_aITp{P$2%BP2`z zlTJo0xU6JZjTc%hpe6(Eq6v#jk#ctDy7wh8e;NU=+7c=Q142Zw1K5~neFNHfG36zA zXOzQ^xz)#9E%8qp3nI4Yn)f9lJP9nGX2ACW2v$SmnvtO`#pE(COaLSk0Q-31l`w8x zP-9wJM-=iMaxh01J^*v`mb>?vyb);A_C&E{kOvTh^C_ekn!N5!8NM2>a9r<~GtelQ zCg2#Hj^A6O5cY5aY|T%#1~y*mo?l!RFAB&z{WD#cf4wVnC*!xvxF@n`BsF!a~Nu|?Xl_~fFWxV-g33aqZQ=70D(MuzXU*& z7eGS#?rU*s#JG0^KaXrpmM#Edz_J0xEf(P6@AhNypQNB^<4y|4e$I&@fYyjA=OwDl zuJqKKlVD9eaeeYW2WHO*J+b;Bc*al;Xd$4rmM3o<3BOXIGRJ4S?rMD$21dD!%CCEY zH%SkNg$zKk+d%+E0dI%O4gqBROm21G>Wvx~i`n#C!YO#v#HqJZTf{Fe`?zjT4CKMG zJ>>F+M2tNkJTw=`Pam5;_p=uV_V1rL6b2#n#4G_ZzXas+d5MP$061$uI|2-k^pN9H z`{>*b7V$TDvVn^ZbM{@+gwLkRjI?oHfKR6ep0J_@T4=hqUnu7Ts^%X0w;a}5D_O=E zN~kluZ&j<e0=!CeYWk5CPI>0+gO9bsqOU z)3bAgw>3%kkA)O0bFJnIX(b;fM8Lp1gpm;-W{R&4DePL1ztNlYJcq{^AOYQ;EA!ubiI7Vi+B1T8$x>!e9WR= z(@9CX-}XU==UJJO5&DSRvo71^$R|uvV5akM5W^p$F*z{s8WP_ zLb9hyJj5q5^u>HuFLU1{$yyQQF_h}Mhuf32v|OYHcFNaP2SWdu-({eqhS1hwlX0y= zfQBTBon#vm*6!jDT9+O$2!S=i#@$)235g!fQ3~8!7mZx|@se95BCmzyBd7n1-{%{lp1buMc5k=C ze=mo!Kh}U?r9@f5>5?ZOU`4o@8RAM@JZ4d|?g4CS82oR6=Ea06TAejk4$Qmqh&ppv zzGOd}0fbPi?KtFd>(XPATe#1w*}`HH0H5o7LeQNL7OTd@d{l94t)l8(dIS$u)|m7f zJ+%013jYaA(U8kn^y>rrCu0>X_9sNwK`v6XO(=8LjpMdUanE*69Z&;6L_Xw%V9tSm z7=5)NAL64tP1&g+bp^z#phz4sb%@c?T5*FlavmSm(i&Ivw;g7MzZ0xa2nQVHaJKX> zGUGwr+pWzJ<4K*R%?b`lG~Z>CF`s0Y+-mJ->oLP!XVMF^{_IRM6%W6AGf)~G$326540eS6vHXV4u^elA_Je6yGQ6a} zMbq7i8JcPjr4ACWx0yn#l=aV~_+*+?@Zxw^$$t4%&YUFC=K5!#k8Z6kel1Yx5Y-kq zo){2k z72jl$D6|BT)N?+8IMs3N3_S+lhwLF%;6OL1D7W=tLh_1}Ye~UarNvNATq>I9O5Qnr zv0iUme-^9Rijn$zwSwp^9qX!`kk;u{H6?ZKTWd=BN;iT(^1n^#4R)^3Tz2~E|7I$o zvZo+LD~JFUPpn14_gk91cDZo?nd7K6g}Exot(K(Lt1t*5b#5PrbE?n6@(muj40j5JaK!I9Y;OSix{2_bZ}N&+SBIGxSNT(w2uBnq z^(qm1wj}Ad%p3~6sgd%@(rT%S2t)ZFMubK`)(>8T&nQ2NcB5LA#SpoL03mDIj#JXF zwrQ)~O}QdG(;ldW@^2;bu1Li=q|~SP^6{WO2=*h#+Q9A&mZ#w@L6j>@u0Ey^i|@;=OXTYdLJ$ zDHaGaBNKT7;A-oLyO+sG1F}x;(o=3ED;s+z&t&9GfF z$pjp7+;`My+N6-Lb?JO+Ph=8DLPnLRiJEWkNQ$w}3!}y9p3W(vvoDCw`GBhTE>=R- z38EbkHcCk)aUa?{Gpg77{S_5C(}Y}YzvYcKyQEPh*?%=TW#*h24Io(}PlF(H#rk*Dad4+2L;C0d8&x<(mM{$AzRX=+4~dlUJ)vpiK{>jWhmod; z_zvIn{YKA*aKm(0`e<9LtRMZ0^a-hPJd1YLud05z~St1<0@ z3GKm!f&%A_{+P?ECOr@l5HkEI%qw}lMHb#LKQn#zC}6!Hu~tXo9vdguntgzNR`|~3 z;3V__+BG&w6WC7V3Z~1SpFGj|tabNrt9*7zM#+EabW8{ZeYMnQ(B1U#fyG&B(Annm zO16s4&!KxagwtgWGVZt*52Q0=TMP!Vb>$Ss6zYX`C$SwwWvDr)>@{B;mgZILv5rr^ z3OcZnem1W5I2CP#Fmpgr=Cst$^~gcrlB<;ac5%#EzUaA}iRKRv+!~pCc8g;BL_H(f z4}?CnH$7HNRpVf_q^{hx8ReyMF0Qsa2xAaV6?g{CVeY z8nb;5|BRZE8~Z&f_vZ{RSn)Dlb+I`Q*u=2VO(~=16{OcS-|lve%v5-=RXKi@j-(i+ zKUaqggY40p!l_-Kv5%WC_t-D9w%I*yXLCQ&_-V=fPD{~H&?Lu^L7q~j17WiG)ijJg zT(CYO#(7KW=@X?PsL_|*kRN+9ipFZ6SFL-HjisC1nDt(*VVZxp zQ8_u!X>u2#iUgxSoX1lHxDi* zR-cK#teWe5=x^IWux*lPxm<%PS!?|Hff$M}J=^PzV5H`Vj}(@O8?*Qk?MyW2OP4Ix z8$$GD73I~(=r@pIyMQnFHzG>=^8GqIVmNoi1<>MC+ROP#;Q#&S6qDGp@rFGW^Pj)Q zlcg40jz$fj{`=1|RiE~b!4(O$KQE??lJaS%!@!jO`%lAsq~Orhu7G#{ycM1pWiV2( z0s&FT-+#6!A!`g)?*qM%xpKz5s!Qwuo$|*en$1hkuFJ?bmm6=8EcEs^4X^Qi&dSPa zInAO$#p^B^kXqJOpLnst|8bi$eSV{&iHH zzs#eqzo!PTZtp0*aR^Hd;?F60kO>loH@E)1EpwPrb#=tqB7bj#KfTy8!A;>ie?DU$ zA6;YuQpBzP^H#pYsIqzz+4&}a?~2`BvE^lru+zV9g_jIhZF%4n_V+|>(PZ`JLX;`~ zegYDNl=c|mX%TR4VbVaz~|L3Cr z&nKOU@xROCEtWZZ=N$zG!pqkltfRp-w;xO)0Ls>Zeg@(k^$y1#f$W-^6V|^d0c! z^K0FCJ+E3vw!ncvTXzJS)aO%WRO)KYr{ksckJ3E*Ibp9AbJvGkzB;q?rPAUN#EpPI zuG!TIXcMs5>JVupu;iZtflW;dz$&mRP6g+JdOn9vRY7DgP0-5sFSfqAPJZ9=h|-=Z z^>%dy^$+lu5H!dIU+ov*#<$fBs8H434Z6F)%Zl4NoP^CBY+))1ki%sZeH5gXg^cMs zW^11t#8hwY4W|e=DhVH|RYHn(wT^T)5|Y4OIVo{9Aw*_Zz2gSRNsr+o3?Qx-_96wV zu$_8(4oFf38=TEY0ax>`w=8`Ma(w0=#^8qnW*l`8Ok4txB8*%+0!y*aV7okryoZ7| z0C;p6(BsD1H5Q#;VsZ%F%a1sNs0;?Bly_9bL`3%id{mmqB1knuU}^`#GNsNa<`5ks z!0T{THM#{nPfPcAiDz@t0EPi*qPc-rQGQwkcrGn&8v)s5t^x|hE zxPjdBf>KXdSA5tu2Nqar=Q?@Jqap8$%DlD035eaa8<8gMP`?BK5K=6JRDVGg!(@B+ zxufnP7-k4A_GG#HGuz12k!hBU>pAO7c_C#Flei#f_MWcdU}_=#+6hfd zZq}T@gy7C)Li6vAunllo4ZD?S&Id5z1MiOyPk?MuQg8-0Eckur>}?W$NV){;`v!!M z1!R2Ua}{+7gUMezDV&FMk; zAHQvI{l5F#Bn`ksur*TC0!ss6%-pkmkQ#=tc~i4A=VR=12ODQd%pI(=YOL0#Tp0?y zn)@QH`h^kIW$~#*WQZACQHqT0kCG-s9SO!GQGuf}7kQq&2U7E-G*Vv*~vAmEtuNGH<@<2x~Av zUyrc^hf(@sF)EZ?q94aGWV2gf3t%adtJ6zYlQMb2(sG-0Ka!gC!dj`;k{0l+$Xkyt zFU}!4tGJW}?E3w=D_aXFG?Qv!6ZFABc_e%ot3-U4V4mg#j6-Je2so!GTii`-@LBTYOidR)Gli{3lPSF6O_#%&&!;=>b^R;W8x9aOtL=Dw*jc>H|J>Pl!{ZN#m6yU_}4`3z#NTQ&-a)gm zYnd;?4t0gE9U7p9+2gnS`bRgF5r}P zagOq%_vK4fMOFQ}E?T89xF2{WKI>uWV4UE=Q`B|?Mv@d4KZ5L5GF`EtFkrdQNEWE} z!R1+H{)Fs!C}_&X-&O!|>m_pnXQ z=^&W`Q32^8KWUt(Uvl1_Uz%-2KgLFbjm263x9MLh6W#!8f9ha~9UjrOwQMoQ`ruguddpME5r#}>) z`10H8>vUy)T5g<)2zHkN)L!1gUY3fHx4DZM_ho11U|~E=v8ZI?s|QX-IFpkK;XJxm zCz(Y%U)Jf&d~Ex(km62jAvDBM1#V~ck|9T?KV)r07%V>BOo-lZsZMD%qHMylE9iDr zDaEm3D}LJ6Umk9)q`K?u9hJsIq*+ocuLnfWrP#@0T~`3m2X*(2d!p%J)gRl=ju4bz zuvxCzn{eO)VW=F*E(X`^Z3i1-JVq<-e6#by5=3?4cK(5y0Hx+l|MtT zkWYDmgqpZQfK>5;?ckB-HanQV$n7hGl@c$^M(7ojEh~0_$pfK)xitrig%_}$=Lnq4IDwZwg?&4I1Gy=QwKDE!;B`5k$Td@@U~umGWxKpX(HFTTNcBCZ z*c?yDb(e=db_K$5`CLq4#sj7tsx}y<__8!`PoFB)zC&a}Z-3BE*01VG`jXIeQ`b>X z%=^wsxuTdtF|EjwHicAi`1>O15q*G3*Bws~-acrhTkO3GDUXUvu00jcM}&OSA!vXE z8Fe4sZV@9sH67YdC)lySB!*VPFwyOGsH1)|RJNPa9j#dP7-G=qA+r&8Tx6H6Rg?4O zaI=-a?_}ZqR3%?nhcn2F}b`ac{T>Yg(OpPzqP6wn6{B%t46k&yn=NiD!Cws%h0^6O7S^EwJ1 zdw*P#OzMC7=6@_Q@Hv9tr?3VdU4>y2K70)XXNd*#R0|0?jXAF z3e5pcRT4D%s5VMiS$jjXq0Iwld(&kno4<%}J(y;ftKFgeuF)TTs>fWhDu#FUZ?_i9 zBDSnjfzSJI!%F~bmjW)T(Lb#=piDh*Gw5z9U-{b#vOx1){=ewU8N*u;b>z;hW^(f9 z9<2jVLDCCrtm7%&En>4Z(LH`Xw91B3cgCui@Ry5MpPj34bVL;U0 zfl4+0;y^W1A^=!t2nyXDK_^Dlz|uTqFU)v$gO|W=18TW7WSe{iam^bb-(bG`{(C#H zoo=Sf6NZ)v`W{bDIT_DO)e2Nj1#Jf0ePQ4W@t887zJaUj;B~&ztNdWM1hib6xjIhU z&p=st0S0BsJWmNU2XzY9DW@Kdelz6`b}9tP@<#h32q^XyC@o1F`-mya}Xi;2@eOXfpQY=@(m* z*`v_V(7eyL4S>U*S5s8!|W zNmLh9vY4_q&C)-MD$teF&Qkt`7%S+1)G0?KVvoE|WPC+1Ogjgasb-AM7w=22iujDr zS9Ugl565FM!gEV=Ok(AfjfLg&wH6{U4|@-r+*=&vRQn#8%Bm#B7P17XnC{rk53%cP zjC0`3D@AXKf+Ljyvu&6@gzU&0V2Z*_e2PH-fsP??;nzc(aI2KcX(QIad}`7ZE7f_d znC&$B6WxjwbtD}jG${9;r`;T>7C6^DCd47XKUL)pQ|Y8W7xE8=$2su%Bd%bxzke-E zfd^O1M;YNl2ObtKHbEdma)9T=M8WOOQ{;JX>*eS0SrqzOj)9H?bU{BrwZDCJOKqHS z=%HBUYo~?y%Fc ztb<$fW6s`ZvkP25>FQpIKSD>noAKT^?-R70d0*>6E@2+S98;I0(p#s;&=6K;E@w?% z#Wn;?-=V6`QEjsK@#YT*5=I%db-HNTAT2A$)>W^}(3dcmzG<_tX8j&Eg@_*3{WA18 znon4$4l;{i0?1=Ba+sm^su1%0rgM5`3cfy0D{t|>ILMxlBl_C*v57HRE{N=2cSJ7W zf=qO=;p6=uhu6b?!Z>9T;xM5N)V_S_L>m0^(U#wPTfgzID4@y;3LhaPL_v4M8%Cq7 ztZiz@vFe;YfQL7~P*MQp4ZvH$b5v)|K5@y31@0hxf(Nq8w(+No1#N?*N>49D(x_$K zqDKh6pyL$Kb@xR=5unVlCNw8{s#G8dqb>msyFedU_|-(Oz)lINGq2OsonY>;rKj+X zgNNy%s(x$}SODrTRK>O+yqg1-&BrQST2pWK4*?gstxV}d8hBpIQ* z<#(y}5G)+3^1>YA1(Wm(^)_1@86;mubpA(5;|VcpKN06#V5JiA@4l)RG*f49QWH`39hlM?Bd0p@jub!nz^9%#n1>iG{<%|&0L|h~U!oQD)HnDP$6Xm^nOL6V(1wQQ zT|`VBz8GbaPo=4BGTNIzqp%7-MoG$Olob2mRU_45Xgx=3-vL4`Wbumdfw5B6=3b5C zSGsasI=CKb--%+~w`HWPIlwuKZQYoye+N1IefhjpN_h^HWICDWkYiAUK4_1ZcNna|Djb z^IQ(AvX$cn@!VG|DBYnnAsD~zLw;^Wk~+ra5LFC8A(_H(vZlQ<=2tcJaun&C!aK_iH zvUuy!grqwBYRi$fqKrGRouE>)!b*isA#y$gd)5T7t`SMCQ%05pR|kBOU=Mu6&X5C*aSc4%YZ_H{UGx);oHzIa%JFafcuQd)RZf<03Ne9RJjXgT=c0rTZC8)wd{mjINiG zXEYDnbV8`ly3D-&a}5VIh}nqE6yja}g`=S$gZ1kv-ro|Xf)2JbR_&QT#U2teqT2Ih z$QA#KM|a1=_q4|JDe2!oN0gF3Q)93GJsPDhEVk3pxljK7`41@AI$Tw5O8o`WZJtFtlal7o5P1kJ$hHZidh)d$Ue9xlaNrmW{z#c_Nf1mhSSPj2flvFls{37v6pWRA zhb>9`kk1ODK*`&3f4CuLx9cxOtbeT6dsNUIHGHH9WGg$&JV-vqY+oOX7wmECM*FH~zixeNA6 zR$#WOC-9tx0a0<`)5u`o5j+4~w7m5$=gvrW7vc+=<}@l-zSby1c@6c~MBJ5YKP@B^ z@4N58Ms!k1{be<`h49h+%>HpxRxpWp2K6U|KZp4@rEc$l6<=ZkM?LrE^71l3Z$Ltv zHD!fw>3taL#ug9}5g{CGL3ptY8qhLez|4_x(1Lw$N@+7&qevR;U?q!9PlLR6 z7s^ns4_}~sJ>7hyPxj@L|OQ3-ETVtH@`!4{cXrXkP zHDu*aEe7S^9W!I7I46L~&eJ9RJmF6<7DRCu-;_3^k_EB-x(M%e(}14t92CX;1Dj6C zgN@|CRb(Dniue$+sefFmGM@3ATws5K1=RC2v-7u7E?`qM=(0Ysgs6LS8VqskJD1UKp>l*r)H zBLY^NaSD}}1ykOzfVR$=&mJP)ptwu(Bw$J(vKm!yn=bY|n1W6t{ z!k#Ff{{bA63r}RoZ&@7kz{2v>j=B;EWRjOqfE9kofGp#F?jbb-!lYpg9xRkoHd~!% zi0W;S*%}eMRA>6V``3jZEL@{_H$B1Go`<&-D7(qicoN4>Pm@}pczY;?bBpgG zu?{>q{q*Ye_HwzVjwfrP=Q$ydNs%`OQ{NXi915$O$5s%qV0uv1^*`QS35SPXfVhA>=+T}fa!_)@a? znjYI1il|Yi_uJwHr=|ZCcxPcM^6(ov9=h4iQ-Wn_B^E$XFWt zArB`UfTGs~a6^Nvsd0Ss%UgHNMotBF^frL8ry*f-VwZJ;JXnD!az@l-B9*2=HpiP5 zm}V505*J(13#CrT-AJ1+W7G%+^FVknqJF2BhBiYth$mfc+f3*|Gw9ZbM9zvFCAIxK zjosg=@Fpzrq?*AVovgiyTI77`6fa$gU46qh`o{S&Py!ul_S)EycbWGTxKdoIWOrd_ zbz)eB7L!Z~^}s6Kr&3TM+4($GQOWGAQ(TG-)#;`o$g5%QAUsg7MUvs@tJP3A`0OoY ziObqdEPtB^VY0BZ`181l&8%qEL9AwgI(|oEXQqN2Vk@Q0)_puxs-cV5AANIF9+@@8 zYI(G1M-3S&IBa@##a7FBzvHyRqr+fxtP?}6iE|*M z8c~OwU)&+ht_*fpw&%k4dXK6$8CcbM39i`3V02jmy;S8PFUgb7cJVvGo5Mwzx>$tU zqm2?)AU;cONhzLNNs?h&QKV%f8w+U6p0J+Oxuesz1T;J>0iuT5hb`LO4}dv_Iq(m2 zEM|`p)3RX`f*&4g6kfARHjB;X$D1{t$aN=-uD=n>e1D@vIR7P-$QG0>hXL5bZ40oR zvVs6Kd3pFd&3qgwX%vYQCmjHl9J}{P zAJ6zyY!d7q=C(E84uSlgk72g;NVE_kLO}e=1O?U$Zkkh|W~$%}Ogk+HGeFT?UyYI+ zqMIvE>?J*dNsZvF6K_BodtSS=?tpx6L7}p~2l58~_EssffNrX_OzwBTWICAa$I6w$+LTd=7oEO_N$$@wW%kgs3aerA)M@&Csg9y z!pdSluDbBD2STI---l*6QMouL@#%+#qn-&l#PX_}bZw(m6dHHfS4QJpD?h|PmAiQf zy}hyR1UHk?s30AQM{rqtXSTj?PyPBzt2cf(lP~z)C;zm}O|PR{=lNg;;EMTsRo48g z=@hT(A-JPo>~KR>Kz=TRw=@^bFa)df_F&hnD{efb=CGOKh{2D5q2<@l1@?+tTAPCy z+*jm|9q)kT9zMX z1+V##B4Z`aB72IWCXuy`5N>x^YDL1A-{(}8e&h>IkV8`}vQbpE6^J{tQXj6<<+$###Q?R<2i zq8@N&&BB0u0IrLi#ZQjoa_l?(uOtjK?IhHSw%L$Vc+b=<3}2vg3lSVe_QC&Sj?LKFh)>#N zY5oAUlz9H|UB>G@fM))mpERLI)&2Tk_18qFqnP1t_#D^A_C`_8)(er+!}WwK%2$>d zyR1-8Zk|w9dwmP#W)m@Puq}M|uKQ5L_xmsmTU*X@J{kO&UX=WYI7N(T-;^4BB@lgq zZxLqV^WS`z1T~^<{kT!*D;4nnRj+9{N)yK@`rk$Zl?s|&e=e`TaIgde{@dT1{q2PxKeGuTy3~ll<2-tqWfOC z2H&?MMD>5J{_e5SaE3UPD31W+H0g5>7yFEcd5{@-dgC=>pz%6OOkCWz?dX?+8$?H3 zo-bS$yY$Ah&H~^JEpRau96;y1x(dVq3YeF4@;6B(+JB#(9l5ZdrZ75+Pzq+{?#h*- zW=7ohcuZ(IKAW-vuBZdW6hiegl=cyIKX&W2;X^U=FqkIEcMG*r$ImjQXTXuqprZ!3fSS1p#YgP zy9JBAF~cZP4@~;7GvmlC_&7P|yI6$A8rTiz}-Cta6cl-cXA4nJV9?6mx zJj;Ds4>x5Tcps^|h1!*QT_Yz1nO!%E=2Jg}6bbX72ErFKhbU3VwWNaN85=XRym|r5 z_5hmRRTisF1^rC*SpFl|nUN(9vYd)|M6wc?qd4*XZ2L2ZeTsv)7ZPFy2a5Md_3V7z zGeuJnRxAMY7L{I{e({ERy=IA?!0X3*rIFzIvso45K#KFE+aleort03anH(Dxg)L9$ ze{`fCj72)Jdt6MTavBpPP{wW{rXoKyVYsLGO2^C=AXD&8iLs>**;xj_?lw?pn$KsO zJ4p6sKKPX29RfOS$_k>167!qTDGk6b8wmw+hD!XWpzGU#r9ctV!U#Vme@dm~(^GFx z*@yrzs3{FE?5ybho7T1q{d;tHq=ALhKz|B`Ec1Y|Lo$mq>Q%x5JgCXF6|I!J1&evX zikvSjm#!fa#z|>J@~iy(a8%|5tn0 z{torF#$%6AXlBSglm^*um0Y_>N|GpLwo<(-SxALLq8WGW6ON$sT4y%0qqZ!$vQVi((YGR}!n7CL7S3eiEP5Oox@p8rv=n)3 z!>n^2YS)AQa&C2dbOr#xrqZg?E)P?0pOQsCz#1gj91&s#{5(M1wL;5)tRb%8VX7UW zv(U$7jfw%~746><#J4!?=~=W%Ipfg8gZ#|xr}(k4yEV}PLm^Ejv$jt&W^kIhvv|+i zL8lE%sO+!0rL2mrtZUJL?$Ye)rbqN0k<7RgCit2|BO0}UsY-M0Ui-CtgC*nN_ zx~|M-&PcaLYs}dYxW^^&{q84I0~OELfME*AiIoD50m{3sy)^ zq#YB#^Ck5AydI2y;I{Q0S0DjAhO@OM>4kN)Z}j992}$LgLpre(Az`@f@!my?bSWo4hyPCHdV*Dule2sVfC!uN;um|J1i5EZv;5X+*yxS|}|;pO!pf z)l`eAy5ano^e-Q*TgUQO8~@^2eeNKyLZROGf-a=cE<_9r8=ba4_5)LvzHnlP}edpdwGU zY{>KFjU)A-CiK&7@s95%{P4k>)!(alQyrYyl!iPJKR1c3DgxJ+|> zAAx>47JKjf!H#A8o|J=*V!m9vpWl?&X1E5ckV=7Q{0Kh%uKKE=srg72PsZB!knPf> zQF0#~m5caE%)~IilpiM8rvJ;&SpSYzQ%X;Eh4#UUW4E z{SA>Fg8h|Ny;!czG1~iib6HNnwKs_)l0H1YgYP6RJUXAN<5GU`Y4uUKgRWi{5KMQF zKH3kN(FQTrgY+1)Q1WlBTr7Tyoi%HgO4#Ic6#1C10sn1}u*tI}r5NZhco3bROL8_U zyht`%&aIx&scP*^lonB&j3a>gHhAhPhIL)rmS>mVjVhlSS5(8a-5*Z3nK8s*iqA;Z zBh4wmE(?AI#Ff4Bw%xm*L*usZzV#X9)a?=UL=2>gQ13kAsnt|P40Uh9wb@t~E!6`C z*Qs&gu<7Af4((<#p3q_>FP>-4d)`07FP5cve()?fU_zY2nC9bmp9if4@%nZ|sPJ)e zOcvq+&$mz(bZm8ZoQZ}MvWjOD69k3L=+hc=$y?9}w{!!y%gb-fmB7MWV&N`F%o#*9 z>rY5}!IRG~gaQ5WqE`|rpSkBC)peIvHmk{>KhNJ9cU=nNbGQ*sCIU5ol9a6S*D#GaF##Z>yh zY&wnn$a%*XJisjKn5&(P;#*=qEOX0~>qL}8e7z1PsCYD-D#h!!SdJ_TX8u9r5w|qM_A>g0WeLv5}NY7mSS_#-{3w z=p(_{C}M03P5k&yOQ;HCGis({B^VoTy#B9kJj%dW)sR8uNV#9u*Jp_`RIbr<1gW4pH|_zgc}F? z(+)QPyHi;e76Ml4*8{qC*C1MIoMbHXb!wymyBZ2pFIJDUT_`2JnskMe8UqYn zwn*-qu7dY{6(4a$**8do=XRw}bihX!d${iLH>ppkUtqk^cEGEqq{Dj~ni zL`rpK26Hg^ikNy(#+H*06j(9_GQ6ET$!Qliw>9EO&7j>xHHH1Onk+uCWWeZ29E(7V;nEgT_+(Z7&Kob$#eT9^;xT< zRE`eQ5**|XJKsD|s1%EAp%cgf<{V;x(yLcv8yE$rGV!b0kmQMz9C4>Oa!YV}J*^Kn zxCHnFZB$*4iZFY=2Qcz2f|(KBDr+N-tRC18TR#DFxsXCmOS zJRWxrR>URv_mD+X{YDAOup%#sgS66K}fSg2OZzsWdk$ z67~;;9gjtyT}Q6Camy@_VV)TX2)^S?-@5%zj8>Gbw_l`a9`t*6ODHhK`z(EbZLj7N zg-$Fvty@ybBCjGp=Nc1ziMTFOt@9ogh;7S1Zq)5dHW6Dk_5N?}d(>O>9F;5m6N=MH zrdhFP9D_s@uKqOei*0A38c*y=(R6@0;O@#H@MU^B(=Yc|K+xm+=u-53Ra}g0i`)u% zx3SROl_wULQxDZ>!fkz{d4`bQW|iEBk&u{+rRCVkK~$Bv)*OSDm{~;4|FLj`F?eTe zqr?rcMs;70%DjI1mA<=D>QZ$=1LMARB8OZ*eX}{4vK-ay}%KR%F-5c8ijKK(JjuXdTB`R)31;nkt=ZOk4`OlOuTiAUqEzaCz)Z{w~mt>^|q#Ziuo-N6ulO zinC`-O_9F~tXD@UH1iZu=uC{?E*73==e;N|{w|=FfPh<3S>PMPcVaMu=FPdp-ytJ- zTy5QF7Eudu*uPR?azW7)!<_YLUVS_`W}I<63spTe8)e;*F`US0O5+4Ai{7Gh+EpyneC@olVg6YUJEfpJ?jkxkl|ANA^vn?;hlhixPl>KzTp zQqUl~dV>NO_d;Q=xAAJ2Qp8DN-csrlI|uUFF7VAU#2sS|_uBYNikgEZl=S$MAf34g zU$=a$cqxBDQXq<@=NN?wk|W42Qip?qHVd$hT?>ln(Fve`PV^^M9wDdh)~!ZevZ!=c4Kv__0~L$*Rb5SLD9{2x%!u literal 0 HcmV?d00001 diff --git a/format/diagrams/layout-list-of-list.png b/format/diagrams/layout-list-of-list.png new file mode 100644 index 0000000000000000000000000000000000000000..5bc00784641ab3cb3c3275ce1e52e7ccd3b2ec37 GIT binary patch literal 40105 zcmd?RWmH_-7Bz?l4Z+=m26rd8LvVL@ch}(V?hqij1`F;Iyl{7yAOVs-<@%`clw;fBvbbvxVvZO0sqO zw^={~8D2hNV4`Pa_;238qr5M7xrOa)>>W*os5MgMjPaGj3S=#q&{Cw(x!RvChW3_F$?Qtb?A$4&v+-!}->u!u? za!i#@yTu;PM??VX&#MPnoms}P7y}#y`HxFoM8L9PAOJSZ|9@971u+{pE3?U;>5#$8 zrf?xii6Q=XA!qo{9()^068x)}A~7q(KQjvxFo#eq6n?Ap>aTL+a4VVq%#0W!A5J0~ zDOv{S&pQ9E+#fXqGXDQW<#M`u%yZ9H+d`|Axc+KyUK*6VVWCAgLd6hKF|>+`ib5St z_5XSXXg_+0fCyWQDXow=Iw>ZOa@b!VLXjXxlBjo13Chz%|MdZ_n1JPtA*VCRAN@xR zAqWeYTu9R5uPW#>0sV#5n~(5Uf58d~SUMw1Nxl6mgGiJ(+zxCcoxeLA8OR_=UJ>`N z4Ai55(b8{!AoSO02}F*=B?wNK{wsqnL#QG#hL6Hn{~K$-P(cm`G9XS$68kFy%(ujB zo=o_ye~r}uVju&U{36!BGN{1@#&G23w~)Uq14RN1c3ARals_f~azH;|8C}Fs`oAng z3=3o+NJahsbJJSBk}P7~mD0bz*nI!tjVvY(L8vu}AU80_oanpZ6U5A%$s&5wA# zSHqN-P4UxPHhb5T@4p^bwJDwyNhV&E-%Mt5*bV>kF@!0A5dFP`I#M7ISD&K1Ub2x>=yoMQp%dEn(P4nsqO*K9W0$<$# zu2q)rk3*DEN$cAMU9a(hwaNFt@U@5D|9YBnHm~!(nG)}2QO#uj(B)*c7fR%74)KP7 z8$G#Mb2Er?NQor~Ax=w*q9Uy|RmXKP9F1~MHPA8q_mW7KE^x@@f{4x7Wx@D{@8+Aw zes-#H=Wx>LOUZnmtlT$V=N$`Qoc6IpywASu+oLaf9#y^>2gpZ53a8bKfcsdHIl;+qO8y3W)^R*rChTvcb<|%2 z1yYs3Jf%P!rIYehR^k1KRxCW3!nHpW5&FvzUbje0Ee#8ES${=c` z0vM+?2EBo^NC5qm`MskdCxL}pi#({aDM-O4H)}Q9 z&bFL)J|uWezzgEdW4R-09=)Ql$Jna?yCrE4bR^TcaA1HCcs zZ~dyK?GVCo`B|fWg-$Qe<#>KA!GXshh~MJdZ9BPo+NpTTGccxQJMb7k65T5YP^YN4 zcqg%77%GaEEofZ@zz`m6$9WD@+tDLP;lZiLif!7*)7Px~K006}$3G;9R<)hA_N}oJiHCnLC}a2p*McdF z<#-_M%nIT<;&7%k^CF+ncu%ggU|N@1=O!MG%{(2>u6~@I;=LAuw`j30|79NgEX zN8^SB;L~JS#HmBoYHBEiMQCpD&s%}ab)ulD$n@vdj0FVP8s4}n!LI|$>95H9J3swa z+A}KF6lm=t=o-a^O087X3C?Xqx6xViBJvz+8&bU7FBf7yZ+JcO4NS?^zO0>X+V_!0V>zU#p z+QzvBiLiUxygpGpihhwi*ocwYS?TvoZ_LuV0N-jF{TujF-CV*4U~$zbc3bl(P4>83b<>i${rZ#=nn1#K`3lck z0K=m4e66nu`Ze1Go+PkB60yT|&>)tGp~V_~T<|}QQS;gefYIjS^h)!{X@tg-1k>I_ znE2wmc3-n8|D(9s-mP7Rf-NJ!%w~}DXgrM%AGw7FYP6V(zP<*M2J0MQXJsr1i91Zy zSfMa+MnF%+2{-GPVa%y z6Xxp7R3P!3Fk<1nyO@?{Fi0jth=`|KXg&F~r_TDN&u*Huc7&zTo=B-kj{A@d%LqKv zsKxOLt0vW1U-|Z7mj4!Hq#SYKUi@e|s3@JN)nLB4)#iK{NIJ1Q%JB7s^zr5}DEp1q z@-SuY2`QX@s3~{=7@cVG?V6MXsA-H8vFX}7tj`$LE)MXHa;p)ep3U7wFXLtPSu`fD zzSJx$4gbKH=$prS@QR~;KawZvIPm^h$GiAqqXv{x(YT5kz(!YQ={e2<(G2P zO8$G9jms7o{xpSc&A2^j`E9%!Ot_9F>r2F!GHfO+JMeZBMFMq)h0QY} zzo$*qOr#OxVIH@-Hq)Qtp<;ZFhlyxi^D*_vSKqu#a4eq7ol-EHoS6z^`te=AZGE)* zdC7&9<}?7sp7`9-xd&U#5XjH>hSHV^84i=QGL$<2jVf!sOGQCNEgF(Eqb`(pd!R?;E+lafoipq*fO_B!k}Y>M-dcAr1rz!9g;Cz?)TTOG+}YB zzW7Le^tA??@({=9=Xze(EYyokd()JZ{%f(B;9_ncVGzz}I}7hvB1aphQMDaEZQaE- zCeQ-TO(c&vsXH7~32nG(s}o#@FwKWI%+`f(G)wxdS46CHC@8Wf)~S^S4RH0%9a1p} zqh`_*sC9pbal||U2?~Ydg^MlcinTzn2uKh-7-EW zcYU@jHZ&E*O>#_~ma#%a9lGbY*6pJVjy*MW)<{6>Ldb#?0O43ny!-leh2S*{yB*W> zsxE6c_*HD|=~Q2K5E`Tt{kxQ?VY?D7dB>kdth_+O$i;~oyom~4*K6QRvdWzqO*-7> z7}moKQY5_6@}FF%^%7hC2V=QFrFzYVn^^AQT;m$PTyYmV-mG=<0VaH1F8ixjw)A7MjQ4h z*VN~wf`|nF!_ZR#)~`s`Rjz^JpUOfWhV4J5T0t!Lz_QpGuo#wnXq5U$Q+WaIt;30m zZIThx#q~4f`doc*!@LJ@BRL`%PnVm`@|gGgG>1{*&dS0SO_SQ;>?mT(8z=S%tBN1o zfgtZ35x`7Dbw{Bd;Gu|0C!P5JXq{O=Eb35TGsD-zIfdy6(~!0AzGVwShwTY%qUt9-)&Nkdm5we5H3yM zg%4?hPD!g`df>Q^u5KTmy%Of98$dOb`Fi zJWo>Gb2uX#ceCK=G2$XYx37sRT@Ca7%4B;F+8&wpZOVn`TkR$FgO`xUr!CRv@SjbQ z6j-w?K27aSb+@hGvq?SYO-)N!r<(?juWf|6m965#dAD7vr|dd^kqOy$v02_rdy~e& z2;+aj{8i>KsIT5AD%T^pwi(j1zUC1*Rosx&@@S12@}mtN(UB%{t~F8{+b{SM0sK!w z80tTieCAnHq7xB$y43q`I}HY(%xY{3NsUL;g)XBKvt5_q0xo?ey}Bf*HW0C6b_Tj1aH1uaB{2lh{}adwp>?rh>>7O+rOsb!=LFAG2wiV#2p=a4C$r{l zhQhZr4fI%lPGPSWE%*?(rW=z+j?*l!b()HSNAgt{Gic&y^(gh0+~Dcs;v#I5*Wowe zH{-erac!pK#p;OC>c_)9rp!yWc0ES(n{plBlT8!)zGDj8ZJy|}T*=@aV6}x&*754Ku-&%n1i@c5Xbvf~#i+3`Pfhoi7>85`uLBqk1 zGn16hjPmg`?`-^csS|^37A?GZZRkZKfN)#@2aVB%+z?WF<2%h+GX9C#R2nM{QIaG*tGfNQO?()=c@7@lFGjbeCS!oSCKcg?o~aS1cSf^UQ_~ zt*sNGFVY%E+Wp8_E^hQpY-)prOf<#o0p&3-?bg>x zpnNX2{?tee4=r*^gmU&!KNeYtq%xEa&+k{oZ(I&Tv=%X~eQI{;ur61)&Gh34i8OM1dOOS%Tc|%B z#l?as^K1KLR0%njY+Tu|7ql!wV76D=^kW8Ny0smf@^9*R+Ojp>kuU1&9Kv*u=~!m- zOdQPzwILQe42QKc7-a0OphLi{r%ZW5&#QiqyF6w%P^@ELQ>s6%4a3bIhO z_lLY^i5-<_F$X~-*HS+HtO4fe0GOZ@7fA}XrR+GM2c)%xR(rmUcB|;VL7P=Ya+Oe_ z@95wA5M6?q44d)&j~9 zN|8!qQZ~j#qHp1|a0h0(<|K%@#*GibwD`}EU4yw2Sb7bDYlad{R01f1ed4ATaCU5O z6j314nq$0bk&)QU7t==NtWFL(g{F0SO5!ecPOrj^}W;+cH+D8|KstL$g{DE}bL^p5>YlXC(w^q_T~im@DBx0gjG z?*y)|pj#@M{u^gE2C5H=5trX$0+BYyyEt#r@)Ec6HFphpE}k{!4wwG2DUBF(+JyV^ zrY+b5!KxN|wICpekt6|HOJ|F$S%o9MFb|7ZzFUFN0Z1r4$-{5+`B`)QIK#mKiISo8 z->Xuec0~l6aN&u!#=6wZ_8S5f{#fviUD|#1 z1=mzo`4M}_we=I+qjLjIluaHU>wBlR)Wi^QBzZ7gBm^W(F99*J1%ONM*NU^RdppIn zY(=q8Xm69_4lh2X4zjDMv1&bY79(}wP-Lm2?sTjGAv_O^%OafK97gpcDe`-8zM)GN zmGBoxTOJdtp}&T-haGb;K+sB%nt zZEVqj^740e;f63oLE>z(TgHZ`zjhKSTg({*z(`kE@^;$Oa0jM-_l>+Acp}&hd^4A#9u^5J1}&- zEU0LIhfzSP-&ep3f%E@rrn;{PVG~iq1j6%{rg2Uc4t=(S_DNZ zzXY{6nd=8YlX^fLlt>|a)l>vvasWJ0!$A>xvGTz(fKc5(Ui@9D?^;dnF@o#}`X*afVmtDQcN z$v^MD9F(*4-0CuRpc(_<5jT(l481}j zL)%H6vcn|rwai2wCqPBHt-35s6^d{F4B)x|n3Mxh?Y4qALCEdKc%WpPTD2~1%iZ~U zu6u;f&xie~1qbxAl`q#of3~eSj(W@h!S->|?O|!I;{bH14sHDyv-%)P_)&Xjf08TE z1#y=j@oYDs5bfYF`M1K}`T|stZ9MUJ+lw^c=N}|*%6kD2p%xfAj`)2Jg9ObJ+{ec6 z47U2hSIX6@*Ak=$7|lA*T93gDUD5;jfxQEP6C8Q($8q&OYP1ai!RN5sWJEex&|U!a zSmr3tVI>@q@BJous`hl1P8+|^560Nm>j|Fl2LOpz)%7spC8uxM2yA%uU8CM~6)3K# z@0E75txg)F{(&cAZzt}x+wg_H#Qn(Ey$maEm!qA?8{N)_z05bf7hp7WoD2r)HnX2l z)MNzcPS^Ye*ydhU{M77Z|5(FIZ(3axn`OXH`ojA?sL%%V-XN~I-W%? z?oXNPop3FD@yQ^Vh4JF68Q8t?s|Bw8EXd|@j`b(NCs_o5TyHq%-=Ja$aSh|rph%SJ zM@PzR0R_)5FW@?`xjhPnvXIMswEOzZjxnF2Wjm~+T&t$%Oi{*A_8w?E5Px2hhV7Y*&}V~L+nZGcct4gc33ag6(@E>@NeXM zvTzKi8y6gFomO#pXr8y7^XKkpUx@&`3OkwDPnzm><`)b-j|>*%&j6<|Dl?=$#DIb# zMJptdB)E7jn|IMMhC|>QLFY=XIe1Q9j%}ESUcx>hgSwvCtTzb%xG!i_$CNK%s9{67HoFvVj<)PuDS z9%#cVF~Pe5w9bnjgfAhQT_4}CS8k-LuTu3TnbJm5oq?um+fj!@PlKwcth=ZLZkbpJ zWgi)~)_*a5Xuau%{EesY`^ukB@SH#C^?m&%yoM2$^()a9ZtMv_rq#=8@tsl=g;(fi zjt<1xj~oz0(ePNL!ArVOwrOB0GgxZE2N52&W8aGrl*XQQKHAcxLrQ87b3qA}O|kB; z{^GxZCcO-hDDmV9jDz|Vb>OF7w>S~Tltzx8!uDxl7an27qyelrKS&W};tehC{h6ah z7Y<>Gr5OxiIfql+(*T~Gh;ndutRDc?-%#W{mry}QV3aC9NcLJHe!1VMOZah6Y^o3i zOBU__dZS2E1@5X(NY~Pt95MNmO_a%}5Y(-17@{IY=dCiOQcbb>oA7RG(nfrM)ELdU z=Q(fRM$Nm**wMw9nKh3z`|b_E{q%gASeKvQwx#_8lA~Ei(7%?lv2<}=a9*U+TvhtO zXA)9)(}oK-mc`}t{C!~`^N3&S>#F5cK?wQ871n0Q?hf*re$Z>G_m$sQwUz)zSDbm4 z(~b;)kbOaY*Hw&Zs^Yx2qED>t8W%IfzWo?#eQjO+nU&eMtqz9GX+9d8?yxO=29Wu) z=60KB?&-9a4X;VoWyZ85EIEnoQ&FH-S2Wc@o7p_{Yk05-yvla15rGmh!A(hhbqS*q zVEfSkrI)z20juxYF^rlpKJat*^rGT$k-(|#BRB1xl2J{QP%Z+Fu@FTpT|}J zRj$%O|MoqqUqS(lYWT?zFjwjs-}dE?Ab#_bMwxw=jbeRFE4yS)_YL`Y*GVFpA=7Th zLx&@BrmtI6MD){m8*J8}+pis@;C@6%cb#{XEm-&9u~{blM#J&QL82ew^{)ri(rDeS zicI^Y#OM09@$}t1A_;A3-|T6TrW-wBU7r8Sv}w|?RCoyM{?PRV50c&=F{?7v!iY|b zbPGWHz&jN}=l>WD4brjzSGF`|4jy0aCp~2l4(ad=_E{$Cu!z7jZ6O6qaX)DrYXf!Q zrt;Pl$Ktrq?$xZOq>II^77O~nHL&f#DYCe?i;F7z>}G0(W5WbPkU==9j6TV)%STIo zd0^9eSYm5IrR3TFJF=n$>@gUq5Yi{whC;SiyjW7L1?MiiF)&ii!?Ss@XtJ6ftHwx40v@w1VBMhGOqsCm>%_RxxLk7L zIRk@eWHE4<*ynCrWSUQ6>M?Fy`e9sg2)kVz#3CMT-vtROr--t9g{jeoAwuZpi(d`b za5F(W{?zxQ|8XH|Jr6Co^3o?l`$2Fka5sNQ&7^%MNwXre^%2^Dvr~#{jwV; zj5PeEL3slmL(Q&4%A{jBooh|5fzg=O=XUMnTYfJfdx-@hvb39Alz*rcW%}*vGXRRB z%PVCICS=A~q$JZDxp|LOn0lIGrz-)>LS@7`uZXUi(9v#5%53A%kXqd+B6cF4?D52l zbz;k^FBgW;tcmQVF6G?rI=3zw-YtVXb@nZZk2s*Rb<>V028*` zrs9jr!+zjAN}B|{h%D-3z~kqZxYPp8Gyj-zAjd3xJ6r{G5QdlcQ8H^fAeHkT`wric%?(#%~wRoH&tv4TV=Oe-1 zw>RqA(|$kmU1t;iCGr!ivUel7PXzBZQu-3OZYwkt{%x}05rn@btF$yjoJgBzMJOaJT5gT1xq#0;B_P7QC zfo}8iEnLtZ(iQC16C-J%O6z_&O&mcIwjJqOYIS-|(y8>;J0=vvoRY_VDmGrp|7e?t z0r?w@`AC8|#BYTA+oa-q^M=`Ix|x$|k0hgT@=pN5J$4P63rXo+{=QPo;9pv2lWD}S zzt)w7nc!-b$6$0Tz05Z(O;w})L(wsbnMu8EQDe6)c|lsqOkp9=I(Az|Pb=OGz7L!l z_G9t-aZrwqGIvkLFbcl;bV{0v?{(FtO4yuSM_p=eT-yl`tCY2oECw~Y{j6S=81^Sb z2_)}60O7W^OiwqIZ##ufgduQ7w#Dc)Ht=Ez+nK=rb-x|^SU&1Ie2ltS(WMj~$%}MZ zCgpc~@{g_~9#+39?5ei4Fe$Hnm{Pt*cK{BAsP+_G+8=1t0OvF#%kc&36Re2m`tzP` z@heXDQK=2srrEYNY-J85V*B9@@eHlL_M2(xH6LI6n9XrR1H>v<_CT9y?wfDIN;ARH z1|h4#-ENqX*(QTHvmL2Wk0wjZi>#7f8i#UT98zRFpeVALS3mkAgtChtEFv2hLzn~u zGc-CUyw99E-C11LEi0-Z4tG4|gqwb-KMKg|ovnjx^IW2oz~`DqQ^hF{tf4 zaaWdmE(cF`MjrD}sT$k7%Jnk|os`ciF?BCQa}O6vNa6-n;v%s?I~%HNm*gJO&VUb`P@Cxm4c`VR5w&`LQAtyku8r zAG}845i>+>+`_RMYgRcW@~C`8C+;K-7pN(hnr{MZC}?+%d?*oD_Q5RIr>T(u8fj8` z*(!wXW-F4R>At~|6{n+=r`~8R9GAR{ha!sv^DD}H^>{|)$V87%Ny2rh*!Im9-&J~{ za+*FP!d63LLl5W@W#Ba(oegg3Ts-1jZtsRWQ;SJ=Wk}=WtTz6;lg~?_Qit5#AjxU< z;!a-P#$XvHxoCE9Z@)#XgYhu_(A5Qg^*}nNfL~jd1srWRRG(A8+z3QW%rDuc@O1i& zO5LbC0&2Q+baN=YcF9U?&mY4jHlQbnVK^r+2GG*((_4hR1Bg2g>!6bGIZ*dMLF!KU z>`#f2WqN);nj6KC0OP|GN3s{hE^NC?=^>TZggRNyYcQsNV_`(Xz}Cw9O4wHmWmF!4 zw4fsoFUTsT572ILUtLH~=De3qdC!)S>&I&9#t(L@Pdk%#lZBfk8STom%vfV$Aq~>% z{rI|<-fG_w47^(&$0s7bdhDreN;x<>jzbT`V{09kH)E!OX%FLWHa(f8y6-FnFEYgo zQ2HJu?bQ}%s93IYWDPTAvWPzm^0&OAyQj$!@L-F@w*5}ZBJZm;sJfo6)74EesN0hYZ~MD9zU))c!{PIX!>#IjeugmRxx`b1ecwYQljY zKdQH4LpGTsAqrfiZpN*=QI2UPoT;V+v)r*HYjm>kXCeM`-@BjMdk z*WQfqC(m^1>?luTa6qMZ5ZX{-zDF=%9=so6Mlggyr-ilGQ&0|t=0K1!SE5BAfZZO= z^u#IRP;x>+lXr9-zdFMI$SuR!_|4*>lR18Iyv4g2h*XP(@_PIXB0;jXajCdEeT1~> zqi@2 z3wqYeWLBAm3yeYCAmFil>+0_p(tUps;6alQ3q|aYVe=cWHXZ&61j+G8He>cy{QFr> zu#l9ANCf^>31or(o-8P<#3X==h$@h03Sk@vODcZ_1;9|$Na)*9ysinCskbZc+tKYh z?K^mw`QTvPBxx=lO@-+*EV>i7uTo%(-&r}pe&UGi)oBEHDY8reoiVnA^|eRi|$xRRMV2u-p;dWXUUAjLcql95zoD8P3{ZVbSvOk$4Yr0vGh* zllu$H85+TE*qv{g(224bFPc?>T3GW)gW91~#g5vR%~Zu8@```8*yzFjz6J<^T(acA z+Jo84YK-xZUgS5*33y7205m-@A^w{}_`Pra_Mu+`b+gE$lKKlMtwW2$&CZTJdtv4N zdkNqIfZ~jRLed}i{9%nLfz6MH=wH76Uo1U<74Sq$0}81AfX2a5hU5Sz#+CnHwcvlM zWdS$=a*(15|Ddt|zQ$z(Hb3!pR{N`3(}2=F`&C52OU3_t5%T~oSbR9B`m0(+FOu}v z=%W9~eE)+py8bz|V9+jeJ=D#$X)~8dpJ& zc~vLIlk?*6B+&(2Js#36S1I<25JQ5?hO_u9I5WWwakQr0s5JjkeSvV)I9#0q0?OKd zIOTsD8zBjpkLj?&_K(^Jppp}_wTke_8NLJn|GCg#@I=FDwbeguN-hOF9`2gehxrGI zK}7)=ufM3e_0NK&l7PoiPdUTs{s7A5Wgtv`{z%^J4SjJJ0HDtbphY#CtQP@@3y5b~ z(9Y$VRE@~LZh*EkL#4Db;8moyeTZTOgh;jIej>6>8Ohr1K7d zjCFx1Lc1Mem<-wSlvH`kxJgYhTp$XnL?-l!PEfdr8*i9}Geo?+>^SDi{N#>)PF}xem zY1jv}_zFV60h1)hB#w0`p-P=#sVueO#&7Xz?eBPr10JsT0o~~-0IpM)ytYdLpm9R) zi|(~F{GMIzq4y6viK}kw1jGWPcg98liV6%mXQYc{QK)x`0v_d4fHawZ^$Q?&SE&$A zq*dM;!6&zDeEYjmhH9cG;B?M?fj2;+HS@w~Oj>mT3voh^9YAl1C0(1WIt1u!J%1dQ z&RAl?yy8+Uo1(J%Y)+6rY&p|k*0LJbLN3{b4kG0ai`n3Kis|yVP!2muem>MadF4^vZ5hFxA zyCnLlPoYS04FcI6Filx8ptJw_e2cdZ1_*AG&&z225sP)9*EeTMKa^s;fT&U2&SggN zPrw{1mL@*5ya8f2uUQ50+N>kwdK&2ZZ*OCHd#+}VDyq-6t_cJv*8seBAeQL-T?R0c z0J^3Vm?BXzWtPAK(lN>ZXpIjpIr-+*;={%-H$d?J} zq9()R9EBveXk=3rV1qT=V3Ro`lt(h{Ab5j-vjwjjFSLoAzX<{}!4o(YQY04IZ~H-{ zQnPWX*^Y?4dipm600Bp)8NKIb1?-XjOB9*bfdp8CeYHk|{J);A)tMRPj2ca<-J)^X zB)az60m0PbCEG92s6I#-Bi-=n2nx>m{KHq9(T6V(TZjyu!tqH!$fW7FryJIWlx2j6Y4*lMg>7j3W8#!g``fP-M5MabH} z214MPlz!R{*-Z8%?;oA%3rD&s9+9@}YOKa{ir33b{L7}fT@qNL9f2Kqx^-yAQ* z&R-JT1Etq&GyxnNfCW2fz&8}XpTFdsW3kpLI~1&WVsL;RrAR1~QT~k|0xEfNcRJ z*sZmVl1T3}Fzvf$VhY;XXH+VI%>oR3!awq=7E_FxL`nBn`x;v0p~dpD&ohzl`>c{G z!@3owA0YdMF?gY^68M!3X?|t9t<_#VEM7|`u#ch00fzx-4DDp)2sybF(mp!)W zo`4ub6NBow%uPpzN>aNj3>=NtkVJAcX3KEZxYEgQLCcO0ABsiN>isY4#F3;fJR(M> zP9lDrVYuD_YP$eTvAf2_x5F!MxXOop?pF6v>&>^}hfxKx9W`Ksllw9`Y7MM>MJ`)i zj>WhDbf00jro`BUnM01w6&Ue5^ppId^QA~Nz>%Y#%w1rGB4zU7@VY$#sJWR47eB67 zNWB3Zdu5{U4%*#w(tkbLLAd5OUP z@DepaytHVhYzeL-*YgIJ|NNmr`o_!&0(?by0C!ny*BAIr+&qJmd)#Wf>=TKR*qi$($Wz(`l>sh{Xe{aM!!xd zE4BJu`^t$BEz@`NZrTSC`XdJ*!-XgD`P8S%AH(+~r$j4^Tb3rmWbp)ui)-0y_4r-n z`jx+ZL`EREM`wz=;8-=yvcj<*LrQZY^QbJ-NnS&Px>QsbqrzBqwXe!g(%Qkh^{ddJvxZ6(+yc%1|kexX6`P@Ta`&iIbbV^F+ zpNvaGJ(uk;fYYhf%>d{joun9x(_4{W@wDRbhnYR*(ZIL0c&S^P!OA)@VS-Fjb+mtM zfICGX@DKGeGwmk(_U(KqRF~Ws3}b+wQ3}7T6|LAqwBDL30*=oa<;dpfeP$QS+@0Bw zy)I^z+-T~7r%|Sn{A9Y7bkKNAwD>?)KMWTW?N3&o-Zj!-u~8YH-@07`l~-&~*DOS$%`)-bc4J*gpd zdjdc~$npC4T~f`GWrfFEI3t7_+!@UopYR#o*6#vIzIG5dO91Pb+w;)mEFDIt*-&#- z_tdX7pe2X2>A#+vpO$tN0Rov^%v?zI8-H}yjf_MCWZzmmr!L!lwoH$8O)MO);lj{8 zQmEzWatW{5L10`TI5(lN64~RvnwC^f;8|dIFzuqyFSwF_bu%Q%)n;3Y*;;p=k)62> zey#}sdlNAWCCF-&N~3Q`%V)~QdiJ8WP03fbNi#eR4JOjjF7P4prK9Dk_onoRhWLpX z$Jz4RP5}3_tL=4y1NcjrWh{o!8jg}$LdaS-#X4`YUyG>7VLhnew(l6;rGV)(h^%dV_GJ)LvWX;J~Lb5Szc+E7?wsvY+A!9{D% zwyZ&p5r3>P83!A#`2MWw###SLCONem0^M|9uPQLrx)QL$FX05?u2&sX@x8+%VTrsG zsbJo#xkZAxUA9sXQbJ)(Qixt__bhh3@Em^qlL3Yu0@`q&G*mk$y{+K3zDK(zqr%=l z<3xlj=_>h|tiWx#cV%MRY7x#TRO@5z#i@qX%BOqMNC0~X6{9jsHkjb-0kCaJCQyS3 z^QWJs*m{3yqIuv$1t_d0=}i3EU;Qi6=`Ab8bT|xVTZ^RIJhP-WTP>syvfM!64)DM{ zC*Ak7TZ%ml_-*48e0SP*7@uz(!iPio$kX|6>-Wbo-r>t9$eXCj8OF%w)$SE1E?Khg zp@l2GeJ@((W)y2c#QA*G@SpDjK>Zp|>%3tZc+C6N_&&CP*MZr=Qipnuffd*AQb@7rE_rWk2+(w&!}YQ?9+$HrHL zV74PfjWPM;xOUNJ)U>1TwNVF_LN7XFwVrF9W$2|p3D`>tVQ5WMU-{n~wvg`$4X!F@MPVGzO z98OgbV|Z7Xxvt4o=I67oSumS05HnDL8R>K+;Z>K}kY+u_y)4r8wwc2}ix1uC0rhorHQx+d%9tM&3_@4Azq-AS~H-phtK zNz0Ev0kv#J&{*HbqCvZwHx1A5she2i6$A_&>mhL&q9Hzh>u9WL?{C8h>sH*=NfaHe z?X7fQMxZC6o%eEbkkDA#meU|<+~u-fKG}66guZ7slyM{vPUzcJ1XOki1iHBCJx{O9 zG#rPMD#o@h57KztY;VQ5gzavPr_Ps3_^dmCSpU5*1ULd~1(}_7x20`~`54Gm$gm65 zdzjH^nQ7nM?$OpkvWP1V-M#GFxNG3&JyEkB4_8*qQV*V<*cv(^RM<8o$rLK0i8eAG ziqcsu;g2>YOk+qj>f=#?5+0(&)ldl)Vl&n$OvC!pTwP#Iz@d^t;6AoelN4cF9+cEjZd0Fc0>4n9w}MStnOHccRNha=CP=H+&G zoT5{FpVcOF3{o`JV!K*Zye2fLJH0N^@2e9?EpwGw*xL_k{dG_nn{WcevX>89W+R!@ zYZH$~ktAZ^dblfcVFo3DlDt#PqXUs`=q>YRMQ_G%esI%sQpsk$LA9Onf<>cJb&LGQ zjfq|14)pg<1wjmhObbM=zHc}Vio3K@EK2?IfDj{0K%}Crk<5@pqwdX?^XF?PF#qV0hWO&^BDPmUl;#R|dJNR=Riuk=Zi%x{L< zB`%?w`fLe%X$L*Ya(~H)I$M{Al?N8pIR5L9{dSnWy!TwdxBoyP8N96|sQ{)R{LV!%O9)>^%Y)N&_v zh%%gpOM*qmuv1aS*O94-j&RrqlEUw|e~{+>*LL{HY_N}11bH5xtIv^SeO3^4pZfLB zFmJAc*{z~5@z6)^QC))2nakwhi>F)Ui5$>bofQ;=*pbkbWF|j{Ok`WxgO$h4GY0nS%L0@3@lV~*Dj+aw9YL=ENT>1 zE%e~~F_%4}H|$G>Qi{OWO=wHm`4{6Ii&!xroTaQiGGEjrxQ~8t1mWfc1)s5z^PZ&F zEq|e3+dc(Ir)!%HB#NH>=m1Muqj|IQ-H9SGo;!`{LU$8O=}+z{8(AbN`JS#0Atp6)M zerg#A?&JoD#Bh2Lswj=^PgkS##SF+r>GTL+d7^0Wl%GrvF$yvCV#^pHCzV*T$)iggT994|eXvaip}O>Op- zf;s2?Si&xFzQNmcTc+W&=#%Ia7XXdm46{Rz6p1Cq<7DiP6gZk#m@a8q3-!|<%Qbbo zKqXH-ZVk+8d#DtaA-S+TTIQ$~S*COD&JhH5lYwWa$ue}(<*XTUVc}Q5Up71Rj4*rc z=j_ecP8}2qKxF=9X#@jR>MPS}tu^HXl_$H(%QAMZCkOz>(B45g6kIxNaW}eGn64{1 z{ZY24p;_wxYVSRuqFTN^VQEppL>2`J5=8_=lH^til9PaB5tSr4v}9UDP?CsA59ZLA-nKmb^;c4gZS=bOCODGy=zYdv5Gz!z2TBZk09|v-~N8AN6z|(F=1O{HnZ}HYkUT5 zu5yn}ZTBO>HNGvod`JxkmMc5zV-hII>z2PC!sJgt9)0X|LP=T4qIwb4*5~n2&f_OH zgYDAT@Uk)_%uj_rmW_#I3YS%L6(e}qR~5|oygm=@NG-%WB`DAHJk66WeBm?vl)Uc8 z$DQYGs|c!JVzgHuiv1jCJwt7@iQhtcgZZnCF-&@PMNxlf9DxqlaG&2b&yV`AB0P)rE#WPM8-16?ya(|z3Kvsb+L{a&i!SKIR2K7*G zXUmO=@bByZsWiwVj1)@1{!Xnh#ej8>Z3HtNp2$5+1Kqm}k_)eANC}z#$|JmW1v!`y zjPyI@zjHV?AYa6qCdc&lpCpDad>JIH6y51${*NtmJY8x7)**-FIAP8N3O7h^o~KVEvziUlOyX$#cXTrP`` z%i|7b#>&V%EqkAKe|-CTn0e7V@wP`-40yKY{3R={kiDu^RTMitHIQv2_C`J2iTTFk zs;c4-bP0>iG!YW@X~hi0jL6JNpPC2S#gcpr`L~NteB%XdBFm_}uSgTx6>s);o2!F! zHk`*EtN-1vO4$++I!q_%o9N&Bd;%1(6oMqy-(8*H1tdxDIbLP1_RofAy&~Fqx;bgU zXWC`Iw$^2q_e(aDJmt={59PC3JU$+L&o(t<1hmq6r_EElEyvZbUGS#j{JrJO5Vy2) zx@;r}aEcUmf9kcmaF9EK`Ll>FE-UNM$oN> z(Mr7Eu&Fa{H9m?uYAh_W%2vXOT;laFbKba5)Fmbo7CG~qnMSQ2^RRt;7!p3SvD-z@ z`0Tc`rB9Z&8?N!i^Y%6eC@%ZHX)DZd;&ZMf^$Ywm8glebP+t_(B!{%X_h?){SLv(sCF@c% zf;15#seL(vt?F&-lf6vW8W ztcyQcV{XSvcc%taf+BIt;+az8QzaGS-IzV)Ve1apLpHFoU(>r`xoW^<);)}&;Q+mm z$~GeUcLoP)KOCWE3H`{n5>bB0wsD5|8BGF+45kt!z9vRpTRh zt$rC))GoNz-4#Grj?cA-KFD(~5zKm;h054geSZgkpYg_gx;P=fX`g>tQb#Ra>&oXM z%V$4ySNiE_mGj)a1hb44!bPUavbPQJuSJyS6Ox&Fv_GLL+xz)8713{nr8L0Uj#uXy z*s%qV)^HUj1VLJoJQ%3l>0zZPkx+?qRD8}0<9X)?Z1cFp_lsBfBSn|j?V|Y6)>GPY zcIMmfO*_Nl$Qc&tkjZKnnu@4+&2|T=x#;p&J{Y*K+e|j)85$cV1_NCMb@B{)n-@_+uR^qPt%? zK16a`Y;#cR?et}Fbtb8dZk|yR$Br>R@G$0Z|Ap~ey)sy%>c$3k0Y_re_r2PGs@<|+ zyP{j))e+@8?SQ&19kaXO8?h>o{G7VM!*s0-@E7t!Q)K(Tbj&5|a@rz&C--!r9CJHC z$6fsWzSK8g<7*D24_$}`jSw>CE zshlR+41#A};PBXaf)56c>*%?!xPgHP7M!Rd%GazT-!ZBXm(Yo0P)6ZvnDs}jFXZhs z3+THekN@nYdD*S)rVgnri2UBrQ!@pua!e2nR8d`}OZ7$69PK?%Hl49yFEt-z2KcGK zfvJGIu=G;ReD@f2Q{p!(_HS5T%TOXrZ*^pb=DB|hnaV#mV77JPmXuNye~F>jcI!p) zr3LqvlDbTU^Pa4C&^bN>{C9A8DXMd$;)7`6JCo`6Y+$=rztVPC9SajB>OoT{XAs<% zd)P4N{PrZ#)X8(yPGm&RxpZNo8zSWgzuiPe@R>*33~92%W9-`1Irlu>-7qdj)_W(<70ngqL4zcQJVg;DGu!OXFKyOO z>e|MoZQM<0`nbd+lA#?xDi5;s~ z?yu!m`H0XJ$u!6}b)W75@ZZSt8Zm1>d6nyUqB>pK#zZgc<(n>Djh)@vRylm#mzK~V zx4=yvhBcqZHYC%KM5cxzUGTI$ro3D9S-c6Ux3FBt5pA}cdT61XneLRyM+_%dZ!Y>C zJM*bTH)h<{LKEpDeZCVE9GTfGFEf_z4M((d?(e4zuS_wyGDiq4&B_U-q1l`x6Qo=&)htU+ zaVQUa;`_V$o7tB;d$ER4(R@-TRsArNr3)OHijVc2cdw7-RNoLD!%4xxJA(G&Rqz_y>K_e+K7ZZ+xa<%>1XM29Ezu9+>TS zF48~#1>`|>Auk>Mt4RN~EAp2ufXq{%00?2-Skq|5We5bDR_^%t9zcZ!ceB z(W5Yu_?sy@sFyXAi6>qhr`)?{w8ZR*#Ua9 zUyX^EBEs#aU_7=`)%&|HNCMG3vpE`B^Pt7*|A>Os(TWVW^+x!TR6a15CEIXx#MCsmIYp z>9X)=V7MFx9kZT9@;l)=^9<(1u@h&~O#2P1Ir}a3f7|b@r_HyX15dea5lQ35Nb$1C{UV*|xeYAnp&0N}>746q;Hd;DO(kWv3#`hC zmL+5@h0h0zoq%E;!jN>@fsz`t^R$a&MjL2OK4%;4!fz%po~O;zZ5A7*?THJ6Mzx@w z<~NKT*L~jee4tkky&({QQ`2uPcVD78Q_`xFJC^iG?cnFXwxC)FWT$A|p0TPfA5U1@qQndW-F za9$Xd3Tl%L(Mx)3s_t$>u7%l0@S?BDT9F7r$Du_-gAlv%z?89%2E7swZD5sUR8m#H zQl^_0{p?VozuAW=6zsXK^V7cf$x7&*i-H?Mk8rxn>_G(A6(9C>uqlb!a@n}h*pH1M z?2qaT-$L?!7tsj78K>}4@sd%EW#Qfj9sBOjr$6Ie^9m}DBXQ^=9K~(hwWsD`;>-~k zJ-wsaLMEtM&3f;L^>9UsVhOh7r&X?g!5lL}PY1Lu=5OurRu??Ia;@pyr0&wlXiM(1q6(T8(Vke9O6%}YZG2UN0qdz=F0qvd z4P6!s&o>KziA)&KeECOQ;CBeIUjk!kygC|Z?hf~!=^{@XSW*V+B${sn@|2rP20 z@4GU}qytpbAs+!fpyx&uyQJGc(}w33Gt!Ach-@ggUuaN|k!F*i=YbX%$8&OZrr$B) zJhoLK()0-<&vkV2a~N0Yc8#%t+vp9~#d_s^n|haX2~xt>p;E-(=!M7)K>&g%HU3m3 zh)K)Rvva~}eUk~2KA0&+7#`u-J*=L+32D%lQO97v)oeqjduO0q^yOKbLXRsLVno~|=r6#itxB8Tl8_m($$I}bXkfv9p683%FF17` zwwgRF9#yV@yjkl>8T-GD$0HER(7V3;?eE&r*q6Y}a1J+zVE?u#)eOMzeK4l}FK#m- zMDL0yR@C1?k2lcyo6Nct{l6bX3VM7Oz7$r9gDS87`~eFEwU@z~C4WDN88WAYh-eAm zI{Yq?1(Bu;;N`0s(w+Z)knBfrJL$a0>pu=OG>!6gnEDw?PJdOzpGLm_mM{aCx&5-R zHtCVvfh98q!kemO-qRsp$zP0)d0})l$F4n)`+Xe4X}O1WzOTJgZ%9}~3G)&LYs*N| z$KcnTr;BJAVGKnVkDd%(X3qkurJC$5U*U+4=dmLKfWN$i^pXhA_|dC^za%{eLdxz= zK=6l9&<$|tVkM8-`G2TFjC{kY%%*de(Qy?~S$b(_(jeLUI;L>^o4L8y!)q6-@82CN zO7{t%;Su6PEd$_FSksrh^yaAZ^gSv-RHXA4j3`mra-(0e3_{SP#m=hmI-fqsr$)ic zeCFMs4DlqJI`;0w+NBo@Lf4h+@uTT>`(nE>yCabcX5HF_4JV_Tg{FE~8hFzl)0-86 zmEM8cM4(6j;Er$lWl?geFFo_yhN>Y6fn@BY{;6EkydMH`JDpvN`qw~v3G3+-VWLa3 zPW1|J9N-13*YB45?T0)P+-@3t&@ zR=mZp6@vXra9;@7Jd>X1Slf9X<6(8&rYk{n!|jJh+2qN>kFuw}90VVN0LP4EYBn7* zTqj8nA}@N72^3bnqS~9=5O-#zfGU2Gam1}MG?G8lWDnWR_R0;9eb|rA6di%E#$N}^ z$s~OXmeU+lMy-;lkFIIpZ2#ia?tE_Ar!E;REU>(Vu{2#R-F@`Q2aC3Bt&)@H+L={8 z5%_{u<2~keYZvtRwXbU&^cz=(TBF=mE?bM6(VccEdX~{0ykk|=5C|ax0=c2C6fS$M zE_6nN{wbOqi#$DD03+yOVlX<7_3FJK(VV628Ar4&4~FBwM!<)G=$@}~ZJ6)RH7kP2 zOg8TskMt~6ie28WmiF5ZpyRQ`7!jSn|C02zr%(9@y`nrbbf7P2*(`rq+_Hz2&UI-@ zdY9T`kYb0{@e}Lk7vNDi(oM@MrBgA_w`~vIdc?XX*8?NwH8B6;1jlrer zP8@-ooO$bP$aTZTQorbB z5Vd*qXf;lk5X6?|x-p1;2k;l;G70-d?7(M2=9IFt>1$yo`9C!>JcH?kT}`g@?QgE6 z7tuMNl@#sHQ}PMzt5$uw5=gBaSR-})#u?>%1!4^#R1Q1)+E*PMn$|gMH{}f)8Z6gk zDVk(IGb;5c+kX6jQ}SvM!E=J^R4S}7Xp`_KlmSVGG0yW>ufiBIs))q*e%$j(`!x>a($x zO5!??Y!HbS{QVMU_JV6Upixl*1T16qg)L%?sCX?Olo@($txN>$&k-q3(`P77u5snB z{eo|W-owAZYTZZTB4E;y@9y;qY}G>9xt-NYNk%iTW=dVV{t-e4W*CJ601_u{EL6Xn z1e$@vm~eMU6I8A3aFs>z4qM4-+a)Jx=$BL?S!puSmgr}Z#<2{b?~Dk3DQ==c#zQH? z7t9yhNhC5oHs?J1m1G?GCH4>0g1-I$N4I-m*~w}ZK5|<~?Y&wHvuc0S#EJaD7EJJ% zz7^Nh!lHOlz-T*rdySg2><)E6(%Dbse+u*KAqAq0=}0vQ@<6`377WhEDCgcA=UW?+Z-Krmb+nmpufpy}b~AnF*Fupo#TdS0 zzW#W_RD$B2<7(kM91TTDK>hXGO0qY166_PM{Z5VqUpTY1oF@g^uZV8B^~OsFjh$ip z7hmr(68v2B_1aq=lrm(E_3Z_B$ttu(Z_rbfSUx1b^>D2AJe~HC zvzKN!FqceSA*7!k*r*+P`q$=72vuFKk2PSZXz&yM^~(+LtB~>>R(BbZUa76F?3+fo9y{s)W(8aHpbULsyUn|@rmC=q-F8XB_EoO z?^AK3CaVYG_z~#jithfmk~@UDHq9~%#r!6nf#wVO4f-kRXw$J4@zRxPKf02g8Yfn| zL90cdChhF+%A97o1Yd|Zt+A89ed=>=Z$5K@b9b@5W!`xarHOG~=uD(bGf3B$)s{Q} z4Yvug1K1ZMoTy)q1a&ec`lGMqqI_*FS3&UH*)o@9Tx7rP6nm~g=eLid<`bMcX*^Hb zFku7M0)-A{B|nYqEwjTyrP9Nsu2f2it=q+K2u@e!iC=6>+~_Q_@8^7EXK!gQ;d!E2 zKdr^KAUCDFlF?`S(Uo+LSAuUfFOc|3JZ!5X9JBhNYnOP}POC@}9V+}?@u6vfv~*lb z;}gY)dGo@p6;nJ<-#ZC+iU6~#3guSFon>x*pafBy^w-p@AZUD!cJYh(rT%l`xG^v7 z95Y1kkCcOeS-JF(pVcd}c87*0*QEoPj09l-EBoQ_lp0yq@;hzOr!L!QE;Mk7u54PB z?e1ofHE%AaYFd>}!}2e?KS~#J7+=zwVw9NMAh{@DII>75B_Aoa7eyrQKH>dSK+oN9 zWjx|fx6#u|VCTWaQX~&X-Wn;5KPz&vESwIPUT|4oIt5OYl!qfge7$F)fBvcv&>~{6 zyRXg*Pp{ftx9wbK=X<7%34@10I!2isG#9#h+EM))Ozv6F6ThxG4H&)gcCPNVTC0pN4b@=t2P$ddB^vyL255l zqNEJ)=gE)>AHMbGI;4PZ-upaO`d@k9KmZ?E0mV@-Hh+`E-;XJ;T`1!-pYV4OnEg@N z>{8gHkT>WaRzftgyeDAb7z{aMn*{!s*J>vXJcx666M+1^4wE*6OS< zIafh@0`d$&1Vpwz&UVx#KBG9en6( z(E^>YJBr$G^Ll%dN?w{yl;bRkg?3K%#XU%IAQ?zoFH|yAaDn5UHkDlAT^g)ODBmB| z#2Bp6`rHzDuMSi-J3m^&u?otIacNiSji{+tI8iJIz}&gxG$dyD!}3q6yIF z3b^wsz;%`56_9)?aT^BVGdjHHZ?uGJSimVqe*l5*|0@JCv#-)m@}12*oAcw!6W+Ka zRP4u_Qwu&j12z<&n%!r35(I$gqa>2Ft&6nu}! z{@>rEhc;L+<&O=%YK2FT6jt&!;FSPNxS?oh7Z+7YHC2qW>Kn;Y-5HMGme&T9C1`=V z0Mvh37huMs$N&lm65xk|;%iWe(TG+QD_6mnF5k-(CnSb&gg`msdU#rJ<^wN*t5`C%d~2arlC^`KgokfuQd&@LFdth=wArY^@$*y!g&fjEaPoKuAYlwT z$NQ=DEmP>7hX4|^bzI=kT;#h*NA#%22I~hN0rFzCJL0(g+Pmo@Wtt%Km~2gMRtmN_ z4%(vohV2)Z#rfgTzcu1$jZ*$*VPq~y*`r-%AJktyNN(YS zt%)jE488}9bV&gJXYV~amxmSkWx8s{@-gcTz6%APcx{Rg^H!hO({%E!U(axgX>G2|GgQaG3FMSWw zW5Esl7F{{DmVD+NLrkXn)eOpjABaElAm#8%jX>LpK#SFV7F)Zt{vy&kuUW$0&@{;E=>`9 z(jI(F?P&TXAa?MZqHMPxFkewP)PYR?;j=e^liG!yee0l!zbqFEd@7Bfv%Og$p^t`8 zj;?vhu10d3WUK@MuVv3rD&C%J#e6nym_vtm#%`2PNVHEHL=5R+-XMHbxY(#3Pw?f6 z^K1>L{IfH(ii2>XO>7(hBPni7p?1KSmKeCXM;@HLvn2 zml)1U{<+|TAIKk+=Uu#zUyv*1c{q^1^%@>`q56*rX87+bS2H{>>J3fwQ5~$89_|TG z!LPz<)JKHFe}Tl>hS!2{p^mri9dK=q|p$+fA`M?Fp0;bkAD1YKj>jxC5r=UL1Mg*n+#+2cWKejruMwg zEEJ4R6xcmoSS>*Nq~$mY7(2f?q<$TsW3%jF;W+8@l{HPSkBenirTgEi{U|*i?I@*Q zm6C256kF{!v)5L;!RojvdoLiGwI^H zJ!COk9Z9*lL>`7YYqMI{@FSVw^zuj|oF-f2_I*kE1y~}j$7(#$V1GK!(_?!AG0_<9 zu;oOiu~m!k;LJ*Ar43)5Ga`CTdpT#+xHe|lt8$~4S*W>Tqnsm)Qk?Xj?oeEohmbWK zSCz;IZBhgYaeO3-=R^6g6YJt*=8+be(=O0lw)Zz=2A!^zQL!{OS1^|jl&u0Tw_31jS{*-xI!3m5R% ze{Y=Vs0fZ~rl$=)_aZ4PnJqRnKOZ0-P69;Yo+{j5QMFwcl8(3E%ZE?V5bvo)uH*;| z7LR>1XZhq&k5CjIjmv7e=XfOxhzxz2rxmw9`-CebPjylc(GFevgM_0#XK8t-f@^;y zmwyA-u?$0&_LOIRph<~~;X$AUti#sXsi7HldOVmn1#osVQggV$4CG1=M=TJ)4IEvG z8O(gHlf?+bw`B!4&uf2u2!4NmZ>U>fH|>XP0%!@$W84(MyR|f9S?G|Zz`y;ozBbIa zT!WC9KpiNQ9G+L7LM_{nqgo#vg`hqY0NosAgzB+;BUoVauQ6_lWU zDBo4h1zE0-7owCy7Czl9sNbS2-WbfDvT8BNco=19&Jr2z6L6WLb0W0uv6m{fen+>h z%0_6PH)`?D>$)OpbVeBxf11=c&W?kOOK=AWUzo`9ew?HnZ4j+4X}md6b2ML;{x8CXZM1l|HsXFW8()kB)GqC*?;MrKXie zccP}O#LdEwQZcarj(zn0 zMD*2Q4-u9EJAdB06aeDxKW*I%sf1eo^?+l}EDa$lRC@4RYg z-WhZ3C%5x)l=OsKp4feDXsMHE4PNFwe5mhrp8ue`&ZR>9)Lco}7MZnxkcX%mH<&<{ zVKGnBVCECQTNtmsTaGy5sGxzBuVIXo&*WQ<+ZW&Jwg3kyPGDuAc*Be|Lqv}Yt!wRQA2JGi?lNY2A zQh4YDTbbbUrKLrbi;3_3M7XSO4wbfyRQrKXF;;LO2KN??l%6NnE;ouudBUB6q$L_# zFhkqn%Y4H=;3sNpZ+kuU2^M=xsK=v)OY)SG!j|0k-IL3>qCaMr;1 z0k6txD7#)_%6+lTWDwugqqB#`t%7l@O{oXt90E9g66(1m@ zS;%PWlB2bORG&P+Lh?Nvve2{+grBTuRnCBB5@nS^1^W~I`_>D`S)-l%tgH(5e7$Nd zOQ%>I0uZEts@g0EVj-`;b0?XF)%5EzBQ@gIaq_nuC3rn2>xz@~>!K23N~U>@l9Bn!qoiWc zAc@u#(v;*$VPaS~z7S!G%!F*o+jr$}e|o8E*WU#HN>VC!iJ!5vy#U?;+Au%?P9O!p z&Ca19Op_7+M;`*A$nn}9pHNw5s`fFaOiey|Jv=?d?tU|w)A`OJ9sWF;3S!N^&zjeE z*r_)*Qx1V-^@4RDBWW-~Kxi&kE9j4&h&=|gIML(90Z|M75f-J!;-mH8erHB-{EJit zASh#Z%F$TeY?0u(O|*g|j6 znqNML!0dFqFS~5lb0uYMs(tgi6f}qsk6VaP((ZsR*I>rcc7vY#dx5t$J@0E|5r8KY z0zN8gU%CuE>?NahE!~34a!bo)r#j@`&f1y8YoWyS%G$tL$xh9)0Yk>*hZ^l#4BT?W zq=9TSPit`E!A?*R$DiqY&<|8JJQIJucyhlsW`E_g(FXO+=gOImPX)kj0LV0(Pj)DP z*Ha~w$cxjQ>zv6?r_uhV*MA)kb_Klg6@;_}^d^hQa=i@&=hQ^50|A*HyTAkq%PklJgBu!SjTy{y3p&?r zz}0+f63`+x2aGXyYf^;Nh{EaJ)lVyNm-g4Nj1iw|Aq~yMKGaYivc^jd(Sqr^b-Rbb z`$hk+0Mx=sffi)_Y;qp_a43)O0QWGP=g<8E7OnuW@PEySLpY)KuW*8mUMEfxmNt0d zXPSLq#ICGT!!CYV*-fk4Q73@mUwNh(ee4*Ox{So_`@rAp?CmIWjrPRN?l1$@BlhqEYr5r{!l!WId%Y3=L z)qsoLN`;4c)I-(%s41dOV2|9d%F>n*LQCp=TNB~}&M7VKHv2D51fjQz@b-2jUBX3; z=W~nfB|??bgesm*#3z z=adz#g(tVD|S#Q3IFJrVx+K5X_+K&Rf2d zO2u=5SMSC+k8E1AYh1#EC6Fu*V%xF{Uk9)(?|)|L5_k_E1)Nbc7eTERVzpm_QlYSF ziQs$MlG%=lzC=Srb!%vP#Ur!n17l>$yccSa8eMD{!E*=~6Pzq~b2S*2_v0TJZAwA0 z3()aUE!w4YbSNEkRoPrsQ`*`N#_@XWzgYJQ%AfzL)Bd2J3hWV91MHEdlgJhkQerNA_Lr#b?1SV+@!WI+|K&`5-Z?Ulh#w2XwGyX7Jj8A4hHf{_Hlwjm zSKH`x4tX(g1=ERL^hzY5d{7;GkS+xrDt=f&){ zzkmNG<1E+|4SIU1^kX%ZkZO3FQ!bn}DZyO4|J1nq4#1)md$*pY@)e0;AsaAtX7fwn zKd|RnR%e}pz~JJ9@cj06nVUyYRILP#WWCvCAxP}I(Q1-KUXOi9;aBTJC^J-o;)v&H zt$JPTkp!pBea8wQj0XB!y#Q_ytEulY>Jh~^{1tQUEpxuYszz8 zGo#L(XzgokT9EU%9K6wV>CLt_=X7aXF+Y_KJBgFkw8>e_RYC=*FOvPf z1t@bNfc^emPXqbgdfUOz#nrEcn!rOss&v|Vlh0snfi;qs2Qz(3eiK6iA16Hx4b80rj#2vtn`m>ulXC8RX1$MS$5CvGK64Sa03 z1l^_MzOk8}&SzhDb8-_3YEGTWr;aKupnxKavHD1$Th<#!1D%(~H8KktDSU#|V_l`jG83d8b@;Ye#09oNTxKBI4aXsXnSIsJ zVr9NFBpxG$__`r7R#$m`0Oya}iekyr*cmVZlQ0>XRr(z;qH56%D*daF8%w<|@D}1U zZFXy1We60E_-xRd{^4_cwPVjZS-(}uxO%q|xrdINl_Nl40PD@bZBIiqdwgY4YBJlT z7kPXeq37z?B(^??!3_#pICr0-0d|Vs4In5tJ9fy@!}VZv{U%=6VwoV3uC>|Q$4^O# zmo9HpA9*#ds?ysN8V9-6c3N2|#r>qVDq^H6)&?IN zQW$0eE6A2FkGYuZp9?2rUFP3~IrYd!15Kvokivo)i@7LOElu}psRXwmhj3$5-qv~ra! zPu~p^LuJ;cN%n@j0VQVyC{o*`CIqMlLX(XhlR=v`G5~X~psi>%zXd>u&hYeeHR+FT zDijcfUFjDisjb|n#x>)VN_S>3`q8{wnItpk#TUNEoTa+uc^(Y934#MWR!G6mCscG~^D9j2a9)nS^{7q85>z zVAkkD>?{w{RFJmQ%VWq75JSs3DI7Qt-1yDg<0pIqa9zT76Y`r&L?48&%QmXK8r+d9 z!tD!$;Wk|`rCY!JYUev`R%k%8Nz9k==!M&msDl)dWmIbTflNljYv+qxaXL7msSh*f z2aH?B1B(~mq{P>j;Zm*Eu}eP}(y`}PWol_?Xg+yVt2c;Zynt0OiSiV1$Zqaxgsdj9 z($FYf|E=xBNbU_$5vNEF=`;b?xjsQP+;$s^byvs!a*0x5qzV*1gR|og1Xy~16Fhjf z6Vm$(CWhaw=AD4Nfx}*vXiZ=wzfnzHlh_4OTG7becTa&9D}MvxG!*^MUqOoX4yiAx z-TmfPzLRA;^Tp{gw$lP-86dewx=^F-7k2>#o!k11IC&zD!=H7+q ze)r*!Cmu>F+GEM8h~~z8f~Nl7`lgYh$a)=Cp$!$oBo)*h%~7K`_u?2*7eoXLOhOo- zH9ib2O-bjE^(Z?<>tx>(aW}nWGfa3c!|hyxUEeCu*6F3%bW(r__teTGh1xoD55gt9 z%o$538RM1&_BN{JZm!M>+QmF~{-vSgTAb{3js;M`6)3OLg9HKn+u-v2%pECqrQ*G2 z<{1O)YIP7+S^zO5PNyn7@WyQ2fz1bDQ%&qP(h#) zZ&+pY{G0PN^PTR^l3{zbEgnlp)<_a1-RQ>f!<&uJaC-9tU z+#YB$EDuueT}cAT43H!7gQx3TG(E-^KU0eW(F2fM^ckSZIf63y*MQI7ck#*5h5yha z4wUJ8jsFps_`jx1$u(k2N@Qm3&>s`T&MsBv~nRB%=n)pAa(WsJUvX#q%oWT zvPW-;plQ|X9F;fF*8)Qw7@25)H0gKX#(W=3mHXpA{tY`1o(tCiw`eSP z9Tid3WqMwW8l7N^g7zO8LeCDAJ-gG)QIu}%0f)=6;K`DYUv(c9euXCMQ!aA zN%yZ!ahmY`5u$t$h4$%>w7zBzlC_B!L1-QlJ_bU2dwX(I;eVIUx(SQ zBc>*(BE84-vyhsmVW9qqkMYHT6RZb;lHZK{2uN{e{n?ZxnLr zk6h5h;r@>2{@e?agBFpSg&2;QoOMPya|J7{=Yl_s?APxu6^* zvhmm->BoaT$~OOJF8DJb$o%1W|17r3r2C(_kUc;XJiB@PpT+v3(*EckFf0>jpPt7j z|5+?}@#Fu@^`9^PhtmJg?s)45g4+qIGLLCP{$Y|jZ+I3WzQ4cUE>r2B!ZYFr46UF> zVk|$?3o5vI_C@6%Mv`T3oke5YDtd4IVLo^`V7;3N%;1qMQjrtKQ@`Z#|HHkpjevK< zi>wxora5=o7j>@U(owOX@>7t1YeeTE{s;&1B$)K%@Y&zxR-g(X3)CV^%=Zf)J=P0M z>Sg`v2ztNQg8OsYZOk2+#@O3n(%ZKxk3{AdFnnPJJL^ZLu{GB7BAWgaRAq1|#!r9( zXc%yFIWmnV%D|Z=ZoYA}uA4xEJee;oIWmpBSLvk$2#i78@^F@a8##8xL(>Y>xfkcp R9s~c$NGeLC-Ff)p{{zlJ;^+VX literal 0 HcmV?d00001 diff --git a/format/diagrams/layout-list-of-struct.png b/format/diagrams/layout-list-of-struct.png new file mode 100644 index 0000000000000000000000000000000000000000..00d6c6fa441769a3c86044a52186d71c0bc23d54 GIT binary patch literal 60600 zcmeFZWn9%;)HR9-N+{ALozfsF-60*)-Q5k+-QArcAtfP5cZUL-kVa7&M4G$yIeH%T z`Q7{Je!BPd3){W_v0|<{#~gF46|NvB@em0g2?hq{p_HVk5)2I71Pshw5(HTAjg1Kc zCHQm4Nl8K&rfQ5}8~g{+UQ)ve1_m4afK7QvN{QkCT=3CCMcr9lR)*Wi&W6#z*v`;| z(cQ)#Tnz)m>&^{6+L$;Skht4e+d6T(^O5~?1vmH%{hEo4>N!< z*csUvnaTK(NJvO{9gR)7l|;pV-VXl9M`rHqY|qWan^6c3&Cg!J1 zPoFY?D;S(SY@H3<8El=%|9!}>=ZKm(897?mJ6qVM++0s>woTM`FZ#6-~GIvmkHYO z-v;8}QU2#!FwFc&yiEU189!3>`qL#C7(o~*Q6UxgJ3HAfc?2IJx66Wg>JA8UB!V?7 zdvK}PGzyCr$e48K^>CVb6EI?$5!iMPj}#V{bOP_W{a7yLZU51ln7;JhYj*EbOYZgh z)<}9`Zu-)QTh3?R(&a}2R|f(<6eM>LB?V#8+zd|=hx&v>fTs{BlS4V!e2Q>Qr{|_6zg?g6XHvYBz&8I8}$m_oc z<%HtShVDzx!%&|OZ_MspUwkhwedj{?moenoAbz01z2r1DbU`T~HL6DY+a}C0{X=kZ zbh;t0h7EXrjrT8eV#dD%|Jj_2_kY=dI2y2lh@#zJ7yf!{hdHpM!qtQSWe50Tzz&`y zG?4#2>4o4)F(KV||9889Efh-5$Nb&xC~?8gXZR@p+lUpxtE0&*{!b_VnUw#p&qSVY zA%1Q4De1v#C~*w?jQV=4fY0TFvQqJ*?neSq`0SP&1MzQKkA^4`Wu8vUu#R%AN3F;3 z*vd!u`~1E}CKTa3{ezD)?bBKP|_@K(grZIh&6^O%-J&L7FiVxj(e*-hV$^k6pHH*=>NBK5nbS(Q%%f)$`Tyfbnj5(|&W&eznO28vU&Bx?kYt za3%1OYSo&?uQlWuA;?s1uj|l@B}LSF(1q-KwN6}`?K+>c_Oao*?R4Hsn!X0EVU(>) zp8uSpc$kvNeZ_y6sXeRiv27EnVTE>$ZmREQvYKuEqsZQY|EfuU#TyThMAPDrlhDV z+nx_5jHW3L#?YFPnZkB`&h^!8HYVn|oRlE$?fIPT_9;ZBav1qyBT@c{-5U(8u=gV1UWBHAv}9ckHuVpD#Ghu! zYj}#+=`pM0cKubY9qBFeviGq~O)s|7*fVOuPu|C)pusr{+co_}Mnk*}W4y;B_X(X9 zVx|-&2!nC4_(~JapLoST zhjL;s0OLF;$Zs-5%Qk$0IviYZ*!S3O2Rdh7XM%YPpE zl6pk?ntL?s7?;=XTbwr$T)|HshTg*TcFEbYP7Nqx+uvM$7U(C#FWt)Vvi&4I#(%j( zu+|<|i3NW}_cLw@K0-IQ>I^69$U)53Fikf(?8Twpla528ULY!(Y0$w@qiEPw$Yf)t zCz?EZI!v#eAGyI8d217Fn3LcKmmvhxD_J)%3~ae{?gsRcARV%q&AIH&YWH|h!jiFe zfUH?_J(}|q@y$W#fU$+=UPb7ykbFYo-=Nz)7?72!>xsw>m$I@ijLgupMu2^;pX_;i z{UcK7xjQ_O&r7*jFn%Pl-+bepH496jwx=p6071*nZfljRG_Mq!Ir*Xc*4*&Z^1Nf!Z zt4V!|=W|A}o6H0bQdtgtc%<(U5FWh8-|8(~ZT_4itz1E3Qvcuvpd8aH=Q}JJn|Seu zYM&~C)%k(rykA~j?Zd2`l8m0}9*m_bNxuKI^ORb##4&FeVw(9-EP^4)aM|aQp(H_B zo{wFI5iM#144))Lb6L}!*UcVb&Rtz~{&(-Ae<<2e@MtaJ(aozCWl-~rP&ai~HN9Be z;EJ0bdrnu`Q}8zTYJrLfJS>;i;}@LY|Hj>s0+ z`p2&k?knmR*Px=Q&V94zEN*esTh%U=<=aq8VxrQq_mXFu6b0_|%muih1)IjhT`C_f zG)Z^$-g@;C{fbG}CshgKt`vkMq*a$O73`5$@tARVC(rWDDXOWz+ z9<;v5@!A{Y$dp|tk{ZO{5_^zi9y;-c{;eVTgmqnf4u}=*0V6RK#SVyl_KFq4D3_G> z?8<{fh^H!H;uEKm{X@ZZW@=hCZxh}p;*m;$=@KS7a3x=GE2Y?UNy4Cbke1QV&Qr8U7p*WuDWXI{kwy+m z-pvQmGQpYLgf5HJG#;%!OrRWg?lvB%7?vMR{E<&a$Gw@`;X8)!L!w zlE%m8Oj;pJTSAr_dO6-*eLUlaNVuRzJFw_19F1iQDdXX%HsLpA;TF;l@Z(eJ2#eV^ z3LR#QN=HtVVsqalrOQ%e54K6rw13A{o~8Pkvb^*Yyo%NSEIL2ib^gg~|Iogh0W2>K=x>K#fkp$&N6_v6gn4|pvMVg$Pk zb}3*VhO0Y^uJ{m1S?p#xjnUBHz6#zdf)Qq1@M>o3!djO~qeG1HcP1U99p8JUU4)PQ zjA*NxoY21G&LnJ$TTl(cFp+!*FY*z$fL!#mt+Y+g&O`=F(m}&s@{&?iX*S79&n;jo z*O8yaa%dEvNBNedn|4`-;$a5|zQr)YKMKFvnPX-!?4#hIPk)8ESQbh-7`*dDEfXm?EiAa?fo33{kqucSuJ>^Z(Y2&*u z9&9D$0w+~-M>t$M@DXXBN436M|NMBRq`z%+-3E*~YUV~bbq=mUw9{;Xaq^z&Q+HAJ z{qR>|C7bX_-BcLIF0bGlv?!=|I!1F88xKs)JOS3w0XWV(7F)}j#qbr`?U`o?)01|A z9iE7xp_Kg@d5H$HWg&Y5A=&gRU4x5t*cy~{^*2pv1#;6?c`6mXh>MJ2C*7%$UL~+` zj7(4Jy3Mt159SJ{BV%{Qp0viT-A~t?Q<0vLy}^%Sz%mz!vatR&$=Skra-H~TU#*Lb zU98h4x%g`!u&?3=pdh5~yeRQuME{TK>ciJWX3Rw1@u(j$C8a~$O9t_<-LK5x ziq5)s1|Rj@f50kc%S%wzviD9!s7Ak$K4)~3q}*OSnlVMk6|w_^D=@P(5yxA_GN!%c zj?nGFo~m~qC-LKhPOT$d_q$DEZn8C%?h(gPJ)3vp>BQSqmnjSfs@H`%jLAtlJV*TH5Avq)oX+|DQC5! z!>whVLVWM{1sm6gGj{81wiy|+u8v@WOr{_{!*}v!+`vrMoUQg>$Xr}Z=iVgkzH-c% zwc%3eDQ^Dyy1X`6D~e^|t;N3}SkR6d^|o>5S>)}#uN9xC+md- zNwqzd!S0ip+|3NiO)F2#KB6?M{F;Eoz80Q06)PiUqqwvz)FSGzL0uMC0DZLICCN-G z8cFgaZRt>kVbDpCqD_}2DTF0#0#Vqx$Kvunn^rmOTC3Q5DPdd{Zl5ntt*|Z)5+~ZHix8f*~_358~>J{A_OtENM+1YDNSm{yd=68qG z6g$tKB<#*BQ+>_QAcr636Dj?yAN8?JH5Jj@!GqIZ@m-M>3ll7=nEP{_1)3j8i<_e& z%IpYZP4=iA-n#&aUAwb%0!)a&|58vb`Ga3gBa2*LB~;22(tB{ z?j-nG;u@7E?w+4YC*_08 ztomh1`CmlmuPZ^4YOxhWt3mwi48igN7aqs^RUJ=u_z-MrO3BsY+hI`@BJ2yj(vF2TBdFHU0xV3B+ zwv!=Z5gMF#mAXx+4${~jY?-%EU7+Z@b(!HGCj;cM#tzmiy&Gi`?j^x>IW$l7bwjxbk-W*M&(_4+x z@m(u$d-ekQe+((|&wO{*xG{Cl0Z5 z&iBW^Dr~)Q)j5v(+OtsELN;2g1>k0+8fls5CY>bEzj{0x*cX+*P+G89d%3x1#qv^jr(4u147$D>1QazH(O2hnalo zt}8Jl{$WA|-U#+vMHqLcqY|vtAQyx)mO2NRz$jVy6J-?R2JO`EGOf zf(DP(9W?`ra9*Xv^x9Qp%)Ya}QeiPUF~OxF6WZL~u0e5RZv^t7_etg*?&ftubpC>L zY9l$a$^)h|Dffg`WJ(6+=Lcuj4ERb!7HN2~hHYvRE2pB^%8yiaTt4QDgrdXWhYY&i zp|zAVIS;{r8usM<=J*v1pA_5^=9D9NBx?ga!{K??Ih-9 zi!NB)%8lLpoy!lc4Z3jZUX!J~t|fMye@K_SD!%5HOzi83i)F{zH2s`=yp)y_gD;?F zLlU_Df2hk71K8asTxeb&F-n;)nAO@g9v@n#JHjJG4rMxuA_RwX2eRMuWnd$65H-uR z;ea(w!ps7c)u89M%Fy{9&pum%9T=(a%u|L_QY?~ncb}9f$30fX!x}Hiq6pk(GvT07 zTxg~J*2l)f8;8d+(}l~MIfS}fcqCf=LY1*&(};%9#GGO6ZYte45(|-1O@RW8m`auU z;FTk%{9sqJ%%^-~r-Y9zpJa{YKW53rF3oiuGWQ6vJdP>eGPQa&FzO3UMxr|iB=pU8 zvR1#uWT(H>K!_XWUY!dmbzEm3e}(5h{P1lYx(_Pvd$e2H9eH3PQYLZ z3U>_*&lxBnrKSC$ejJtI-?*q?b+etxWw@{%kFEKAvMHh&RAVtG5rTRHd& z>aV_JyF6!2x~%|?d(|`IdILCin*sAfDUtpu%uYvlU8dKg#@TeP{j?PmaszT>=#;@Y z2lOh^L=C;P%>-n+L6hi%qwDT_*4#ajTvL^3S?nWps#PU-%rIbDZ*MLKI|b6u#mPo^ z=Gtm0>nuKyOlq7_+%D`utTRfwII$|lU`lt&3^SNH;=|+CmFX2kuO@-ga z-0ROLh0rq3)RrjI_{XDKG3Od|hZK(>A?_^zA{En28vCr~=3z|)B7(=bDaDCWzTsW@ znlrpq$-NEjOz(U|lb&K@4~|mF4WuOA3J>wo4lT!^?^n)A*45fnJdg08)Amn|Xl8~wz?W}}vMcKY<8W8Oc#TML>w zQzu>vR|momW}*zb?&t9DHUqTsW$ASsT+516Xf-N|1Z-%-S>o8XAo1i9Ap#67$j)?J z&*2(`NUPU9EIBe)@aQzfGg2NKH+YLoEoILF8yrF^Ptd8WxlgC_4C~OVP^O1KoJ7a> z`a7jn4LPxa=^4rU=hy?L;uHpKmh-7hUL2WnUiB1O&rhvww&PlO$5JYMfPdXOa-h-n zKI|RrIU9W1x(MN-a5ke*vfoPRjYt;>dC)1b0w8VwNK66$e(N#w)mkF~4*~WtCv)g` zm!=X{zXTif1u9+(BemEBljo+cZsM}NnS=(}0)&u~p<)l^Zo0d=1P2#>xd^uRZ1q2p zta#49nx)))XY1~BWVLeNe?X9OLEcM&0nk&5ez`GcN=B2tzHlFPY7I9>+Jb71WcYrv zW9H$EOKI-tvp{brU zM+e`4`35nA$Y68k#(pE$xEEt|GA)z%*ilVn5LfkToJeMm%($icd(;bal@zdZrsZ+s ziK08QbA}|N*$hNvZiyE7i6n2BBTNPi_zX1BG8XiywB-R?v8CP1xX(Cbi`m2ysOw`GK|h?v=U%44$zndcberhs{i{n-1!uLWKWHfKfI zUy{vI3$t}7@GenZGb0zpjZ@a+J=s!NwVH zkcJW{5j+rRI5b(+iY;SBDjLvrpTaIz87spx+BE$&A775)))0%vEE;C{vw9vxX&CgE zn;IGr79c&EPTE&OS7G$ylCs)RcMQ>@+Vofyx(qh6*0c;RhWF(#%-y($hy2kHse*Vo zyK;-!rcR2y{d(7q<-*|7kfOpn)B9*pd033_MdnJ+x2N->J2gQG!3Jq45$hGEx>EM{ zM0iatP~*dOdoe|_27zq~Ty(9Zd?T~Dj9Gv3Xo$C;)k<7=XSrTa(sqAZ8yV7FL5gOC z&@FgQZq=L-#ISjoeO7NFB6q4YFhiMr`bv=f&|KO|g`)jv6eGHblPWE*Y#>1_i^0DY z6j2UgMzJ2;u2!rRk)<4v1l}{PcbCO*M?4JA=gCvV(K2z4zX7vFt01F~XCz+ZnL-rb z#Jh(PLYpiTfT8uIDl=Ii&A8A|JvrPnd0yx=6cU|^JVN-f|2Z#5zQ4c)HGC;0zx`|B z0zB3N$lh9|kp0Z*w4#Rn5jV_owDE(Dt<|s9T)fg1!raY^pH*H4(>}z>!eV?9^5ev5 z=tynt+mUcI@ryS4IqT_iJcObl*e{%6vt<@B&Tyxz= zf+MP&&r2syR0-~7v~Gx)*#uzJ#8>D};*7=hI;gU~q2SWE$xtzLxUzOvP#MLWlIP zs@MGV?h=^SIYqr#-n0cg_sry%U1fOR&8T_Jl`liN}2SOqh+Y9V%k zu{rCfn7Z%6r|E?aY~Qdv6y&J#%!BS{1NuK@Q)x##)`fAqAW8CNjiMRvXM?#_2i{G! zTtS@!+z7F-m1xKkeKpQQB3@j1O;PFj=S3H$!^u6Up{Ihk%>8T5N|m;wktNlWyRFA# zX>`iS+EiEmXvJxkiNjXh$c#C%cCClv98;1Z6I{cv(jy`6rc{LzuV0Y*3wM{|_;zVq zDZ&fzVZ+qTr+hv$th4?SI?VQ%A;`-qST~~7{9!l!wdrR`wp37=!K0&#S?i_UP=I-g z`yk!gmXAo#Pvfb!!sz;ju#s{XQjblaE58^l<$cJKOdE~%1gMD7x#>+3{&;-MuxI+A z2=8M^W6N>nacSWkp+JZ_f&*jkZ5Ai_AeA1y#xv^vcE{DeoJwyYk47eu8aaHxB0DSez4ypQLT7GxO(U-0&= zvRS3&7U(py9PqkS{N*Q-3^NVU`0PmSs?4g)3#Os>y%*Cw2Kqwd)RR8$Os7@mNUtk| z9&LzFP+N;?6W+T!f`E`kx7Z4h%-t4wDuy@XQZ9pVdGOKs4%c#recmfB%qc> zt86U0Bj@nQl)FNHOAXdcbL+*`TD`USXrI=xyhOa<^(_2~_jbdgqx520at_H&wt`tiiwx757Dx0Hk1}#!rG_>hS5ibFWoBD zPc=L?l)d}4{P;CuNHWC)GD7t>-D+)(PH2k|6K;*jfA!5+bOrkvFN1J9Q4FrY8Y+ec z*F|8_wbEx_ck@EO{`lze9aXL+Sx_3W+JZVh9n!;Yb2~k?w=46XF6fHM|K>LHU>P3~ zl|IP+t@V264KbTg2dSf_b$Ryd?;kw?zuQmdZ;D2uMCXHTaX-V<#i-?1^+__14i;&w z3cNj)k+23Vwa?bkimC$@ZP#F=o zxS`D5M*F{h{FCaDCO&m8~faxe{NPD{Cq*)OEAz2pEbD>ZqX5tKf@x^Gk;W3mJzOH_EXf!_te{?6A{*dxNR&JSjG-Wye?WUHj$tkBv*N3!n-wXP*1)bCwf8xe(wt zsM3uSv@4z6LAf#8=VJR2|Je#uO#*~64DU~-=zwshc9@aJabuwU_Ie+4!7!2ml*IQy zjrIF>Y3^2>>G>tJLJ!KflLJ1NyAl;kl4iG4M|fsgoy(v=?6%V3{~!cc08|T28y;*+ z5P59u)Q!iC0G&sA`W8st0D7bbl@T7fQNV{$BB%Yi_D-l^1wQ-;FPoVa&wc$K=vTUf zp1keV0C#&}$g(LOZl!^WG?C|*eC=~Ysx-fZ1=|(|G!3+fUQMW`2Axff4&@Qoil3o} zBUIC{1eg?uuWOM^B?mwppju0j9<&fW1GGFuCh98T&O5Vj^lFgSuYkGM1)k5g-y}ys zl?@POtSReG^tYdUZ?BPhc`5%`MQoXZyy>%$NbE6+$c)Uh>wRC}lSzr``fJSxjoRMbPGmUU`lVr03{3%92vKy0TBxj zAz(XJdmCyJcEwl;hqm0@P*PLuQM_JpMU>nh#H{$8tca{y}V2q4B7% z(H&}|G*Ehj7M;HV_7h)S&h@$Y=&HLuu~MvP0PTkTl*zp!aemnEzzdA!a07W&E&fqKB;!zyn1>RYV(Ms=bFICSejlD$D^P3Inv4OP z{o(-d#n{%Y=_UU|N|8L%XeJvDtuXxX1zW1v;kQyZfX0y^w6+DP=UUHe`+iO}E zP719rFJnOjoNR-M~(J($XAKh+Y@#z5IDA)!kRf@F`KkS>N>NNDAOB-9+v>x`+ zRo0_FV*+KbP-34nIh&@{kSJ9W!s&%$U9Lk2pyuIh83d$Iw# zBnbis(O~SqtgHSJa5K|ydc)Aj50Hy9U;bFaVKuF32jxHbW0Lc;3hZ$r#{tq_8P|v5 zZ-{1GPd?t>tj zi6SsVIW-Dmwq7l(3dC`kp`T_Xc@DA8#GcG#$&`b73YS;WqzJJpacpe|-vZ7Is0n{0xG3{cj^Vr%?*G6OHuCW7BY|@+n1H9aBuw6+;vR4R)it!@gi_Gv2nHOi3%2 z6nWGQE9*2YS9Ur>G;2D5t|^qSE-lwFhWqO)uo`nWD}vt3rvED7{u^=p8z?~>yeNiZ z+JaadUHAOqF$CdEhD4or>7z{PrsRnO#LW7FgRsUaC)*5aT&dvc8(%p#;zmv9cus)) zO(dmGw|;{o|Kaw4_#aTiKLo0qhBGP_55vMNMI;PTwkj^TJr;mU)HY$U#;9gh*B&J= z9wfelW)Z>I^S2+yeNI{7sMmXJvyk;Z+yuG>{T>y+umQkPqUXSI-n5lr^(ACU7F0SK zw!U+MMFo|t)yp5`ROs0@^e^I1?vxS9W3YvpWbG^|8PttZqz%)2CCUxH`=_oK{Z8OL zO3cycBwU5S%lo|3Cbk~)7zVP2wu{%=>5Ru{S&Ca=)|oWYKE|qPJBSS`@5T+H(BZmk z_ziIlM^j_>)b1mfj5(T&r2Y|Q=b?~k8z;bq7ERJw+2F2}D*>s(E@&+_rPbqU5TV(( z(u|aTfDl2IWK|PiH^xmTMopWudzcj#FgyUMtl!X|gJ+}9Iv(XD4w*=LlpXWwVE>}sy1 z#?(DQ7_D118;mdxnYac`7{dBu%g;K0?JEGfZOQ&I^Z_1vuPabSVZzY7FxHa8?(ADN z2qjE8lKnkC3&Pq+3$iNGF+N<7&H(`xU>cTi5SMI#nl#7v<{a^2C?t`7zZKXeUe5<* z`E0lsqek2Z9e07xFdRvn)%Ds1UhpDP@5*qHGS#>x2`T{XLS}-7C6-2Aw7Cmff)%3ty=9VY@T zuq;A}X&!_5Z}AGNY5_Db#6VPSZX=zFh3@P2eIBPV9mJ2 zxG;&(Z4u3jgk#*=XVhYC`V){yb%#9EG6=>ZOQKs)GyEzlwe}+O^n4_I0Tg-m-p=Yq zK|OML^S28+!|(4wsA6nd2>|23w}4-07)&jktq}!+9z^+$;I=a364u&K!&2KkN$sj`nVI1x4y}0_ye`<$K&-?lpf-dqmW_*-)DWl3*yF zq|Y5_Zew*aBl}}e*fIn~;7U~s#r0BhtR8t$`IJgN=g^>e76HU~c^uE)&Rdoz)9e3@ z&hLS2kptEA5hptcXy`#HLW=-r1a$*Vs9oJ%GVHy3kF5n!c=D=wq)(H6DklD0vZEw< z(l)HGPSrEzc@N?8%^|f3Xfpig8=@JMw!7cnw9B;?fLKPGGJmuB$d~I&UZAISm@EAs zjXt>t`_Q56h5c!W^inoU48b!;;F=7~;teTEp=!KMEUfwLQ~E!ByE7RGd(^t$X+K{9 zu%_z@F=1-IIs&LvDrsKB_lmxz2Yru5iBd4~_lzU28w4btt9$krzq>}@)-uHMNCT$x zght=-HvIV>$^VCYK*~g4`~S-IUf&+t4>L4g?evcW)63pcc*>g1Km~wagn1C3N`x8R zhI3;flI+{_vhTBc#G;Xm@60n}r&s^z+e5??yKa_eMV1KOAb|mi4?U&8*-D^F8`y-T zCF6Z{*guI7z6L>hw#n_e;(wp9{mPNn7C;XmB9jQl_8g`;eF{Gb&6Tw7?Qs2`PEPQH zq;A)V;i5~D$orrZs(?-(0Z9C%i2Z4B`5M9Race>ZT+0X7mg*Lnx(#UOE(&arsOa)z zg);pyQ3p3BD&xp8eug3KZsppUpUU7JJis8Qu)2*HKaHsh11GRN1LIFaF8U5DNF+g zz7K+UJ9wlW5XmptwP(NL1$ynzK%89%Jg@B2K?=wz@S>^(y$Q$P^lm+8G|WbGB2J+C zP@bYj0KC)9a+=li6rcTAaw8_JK{*DXEZg?oHny(&su;G0@hJN&0Z2`qba;V$7;5q9 zRsd}Qo&Gt{s)7YuL!iX+JO|KC!2RJ^;6wBSA4(Qk1eh0uQ+b}PhJtJ70Zv~B$T}A! zLgvQ6NoLzLKnl3efsmiC#y;2g#zP$ToesDiID?qF=60Yu9|f4&7lZ)jC~(nx-`ks> zGYx%z_$D}LPKIl=vHaQ_G#6;MH=)|ubpRFm!2SeYkXo`#gS^f)MZ)PRJ6F*A26l*rQ@Q_!z?WMeFJt1K&h3>(8j zIvFV3F-AGSv}4_mlJo)kXeSq^_sc*L+#3;J_fb_#Y>7=46aShF+3AH~Xnl!&1hx22gPOs`>+@>#x11MZdne<2AI;(8Q~HnNT-ZkQfQ}5(VwVQka{O@+zaBYA>0BkAp z%dpm-`S30eASh=sY6HfUw!*PsT?YixRj-=9?$qaey}ba*Ajt1Y3aURCSZ7(3q|dEv z0Qu%Mz|72=jP(=3)~>=n&5QtIWK};j955-t6oCslB?9QMY6eWYFR+2z4QhdWp%oB| zdD31$XBXt_&Y^^^evi%iL0AKk8QzxI=WG?EdQn2jkKLeX9-7Z-+jo9g0n>U2071ah zjVqw0FJ^y`{6q$h`89NBg8<0n5mH)!4U1ugG=eS}&289&h7=)hNmJc>@p=M&J|_#j zfW<}e>77@ATzR^20p9KgOd7LPqqZRZzX1a5%@=yuXLsh8y8S*bX})aoepO+=ezmmk z^P?9>uV}E=fXeYnn1Q0q`QXWHAhw)%@XP%$@)*fNGXFR&i{fO%4z9NawNsF+VAjoC z=P4=08-Qb;r^ic97r|83S=*)~t;|OBL`PipzDi@qT`P!s9aRUyxAa@dJUQ~)G^3b7 zji)SgePw_JWPD+u65l+={cXY;{&KY{(G><_i(g~*kP$Q-ju9+_Jjw+6a;%^b%5@ksRM!e8D z0V54l?qpg0mJ(wGwV5}D3t0dl)xrP1_5hX)_X}XnH@7d?5Ednah?*xNC6siozp82m zTUERPZ$_KayZ-|uq)}JUtv|i6>>F3m4I4J>Y7Pf0o$-OgRxHfuiot6;z&vtk+kSlV zk`zrDHk6V}hlr|YA}$5p#9q})eh)xq6ntZfoY>bfOs2%&OCqJ1;L>byn*jKS3xGFA z+7t%fZsq^M^{wEKl`P(!yC*GPB9xow(FV3LRoGh&A-sb;>2w*gpr+4KC)aR}h{tIB zc#)cuqKzDJ&KWY2FuXR0_xPSvlZiZO0vOs2Yq{s^-#xn!sZnh?H$?;T-ii{D$%f5? z%`xd=og39>tYdPw>(pwLglpWxGm!xQQ9p}hXi$gSmNokeYKwuyhb-~lhO?Dh^t}@` zHw=MG$T)3%rv@B#ODxIMy@&WJ`Nr&}2_1AA?$F#koj7GRp>#m>VJ$K;N@Ea?VJ;Sb zMGgZt)$@rv_Xz}zNoESspM*WS$bBUMHb(VC&{c)iqZFFgkRiE3_Z-Abh~jpkavKlC z8A>_*9vVsfMFmsmg9_rq9cQ}4PbViD=DyVmd#T^7?EvaUAz%HQ{3^Ee&T?_8ZibDt zDp0xufC_af#+-^DsL-sWE!Ux;H(Ld4jww?=4v#3|WeN2G7sx-`OafjP`Oq;Eqca`* zZMX~PDQ5992p%ET?|Kt{N#B09XGK+sXDHBccPzkK!5zzsr%GZ#SYUY!q_EkC?6jYZdkAn6pLO4w}D zdV>(w6Cfizj}G^wMZhe$xkx!tmIIGY?L6*yi0(}_&l9hAD;d&Bn$9~KX4%Ql6FQDl zUOumhOfdbv1-xY>x7cD8V8oi{O{FtZ{zIlvuZ4bxAq!C~mU8NbfS#WISr`D%CB{B? zSs}3X+E!_t03Y~_)!z1FD$6KVsee13PJ>Jo7tPCrK-f4MBn{3t0f(9FNXF9rfEBgi z6zn)cRHhEsz9Z9;&oM|HgznQ!(N%MrD9=BLb*k_?ct=M7Mwio$Q34ojTsYl~-br3S zkXZZ}Sc8afR#(IqM^S3=01|A@J3+a}oIxY6^rQxW}wvrG=_M-9vg8N^UuJcYJLEyQxZT zRKi#u!vhVu26_;@;dj$2q)>H!azj+421f;L^DNhS3lo^!oyt zzPbY-t8T7?Qwlhg6mfcMXmU^xM#uw2oLmbZ{LxiYABu(TaC@KGi&8BF8LD2d6Wced zA3Lzs5O}cIuE#d@iaX@*pfw_37I4^}0Q`gtUV;Pxn&b{10mjnG_1OUU6G^CAfEyp` zT(O&w2LIbC<&N_c3iP}wFp}Zfh!dI7^|G0<*Zsau%|pcz+B1OH;$VFIIMWp7_N-rE z&brQc<0A;0hagH+mGlEBmaEVEif1NUT=`XJHJz&Rdo2?%Qq;~=+~t*?WMNY>{)^{eGPV0C>OND?^l)YHh&N zgDw2bETIT-k!5x2eSfa4r@+!}9KN%Z)im}4Vj@R8(Acmg2GD^&=B?gPidqiYDFv7d zrd={$m~irU3QIB*J*b>}P)^Vdw!Aum@Ci+y*tq&|N8h(lWtbNL40^8&Pkfl9n@b}? z>IQHaA{uhN4Cb^>0hiR}u_Z?JRLP82`%_JX8Kn zw9S|b-v5Sgp?rf&iFVI3p&#{vhwMH5=9LmzSdGXs|d#TSrd?q~2A{qFm~ z;}}Q*SBPlJ6aR;YqL}aiDO0xFXm-ti1N0Bk=AQx>zy`Rw>_3?!^e?<%jg7Tzjj8|H zAVF#HUabI#+;8R#{pd^tT*GG_C$m43Zpl1K@Lq;aYwF(@FnBX}EJ)3ctD0H=^rRFM z=-|C6`VAO=bt_u}41>q)BHrJFhX~#)&Abr)SGUHd922y&9?B(ArtSV& z^NJJv47U8ceqwj)`>UP*XZZgy;r}!I|Ev#S{U>%(Owh&nL0CVUeFhgSc?R~%|2aV- zLq;Evx9xyv%vcN`{ox6F4N|F4PpZi`dIGO!A(DJ|tE=EHnElUuzON#3v;VSkPxd=6 z1mCoI>!7reYRV&ReT02U8z_!Cfg*T{%${>!*26sW=K$%d)?A3ybTSr>x`Njq7n?17y943I_a zuh|#=_9&6S6J$EpI6;akO+Nr6o6a7&~EC8qjAHMgW0KYvs z1|GCj>B1TD8TSRY`(cKPc_gX%4IU&5oJmy$G}Ml3NIO+`+%fOk)%agV`%ZYklbnCCkDw_@16?P6c`^x-=o4^U!6GOTL22$`#+Hf~-8IC&=J%gr z&5MOrTW*BOKAIkgLQp*;fDQ~z9U(dZaq_Q5LAisdaRHD%L5d*0 zxn@#&`W(z2VE)V{gQb9YY_rFT^1pDXBG0RO=Z4$aK2Epx(hu0rx z*FXcW8r!dTjiDI zo^67is5RinH`QH8y$M*3>io}npfp(^sW)1Ps^*;Qg3W!9X4THTP}9LW2ggPj48qcj zn5bkG7!MkxR+#+7M+3|6-n(;(|7^E@TFFT?AbLq*K*)oPE7xnUd4QBcUat8)(9oMp z2N}Kw1N27VWI_ZmF?--GZot%4Etp<{0}5sUQwC;+WdUXF7^pQ2&!2eJ7hNUk-^ZCU z0#pG=xl^Ckg{x0V$t8{C{8bQ^%!37OfahA`Vb|~Q0~_~)xuNwCz_Bg0``!XhS{x(! z2WDqZIGW>!SBjX>(H97ZxliQa)>2qzv^rEF?FYLp(8BGIUb2lfHKo5ujjFab!# zy3zv-4M24ZVY&rdeeP3wWOwtepJf?@;7w%F2xNvy#dlkdTetnqvnPXfPj|fC2NChT zQt&sgA80Kr$v{dpW&aMyNv{u7p8McHg=J7?PN)V1`Ub%JfWvRh1ume6_%(rj*tImY zETxdCWnBaRB4Ownl+*#e-`))}goXv%;posaaX=6;r0*;x@Fb&9=eQg8u=YZki6o0Z z>JmU@l?wL&{{v;g7B&}HP_}L7rWZgzan}_vmPH(c>45rD=fLy#mg8E5tBA)6Ir`Cl z%V%8_lyVI=NvD6HdSQ;iz-v8MZTKCO%xIq?Dr)gQex^bU%u}+^VmO&a?=M&r4`9JS zQSmc%_U%W+^B^b2<}p=EfTqIUUAIulN-r$UXW%o!PXOJP{i3q01A2ggb@Dbddee=# zVH6AP5~2eK))@O*K9FO`c4B8R73GR z1t#HbpxSU?4u}q4Nz!B*|$>1mpyrd`Jkfz{><8)p31J}4ZO)s7~AH} zlvgf*ws+HLQo~%ipMnQ1O1(F%M^XBBG$v5sw7VgRYO5d#3X^Eew0##o@isXWee&h= zl>F?0H;V(-b)zZeH;K7d6ppk4LVXZN#b8Gp&p9mwRIGG9cfa0 zcFoRUL52_AkNfvCU0OE{din!BkfI4;m%%BdhaPdihBh;gJoGZfENQk z<3MxAX2C=PG(w-Nd;H(>gX^=+66`oL<-_@=e?ewm8Hf_Hy$fRN^h;6b$e@%5)g3gS zh+@d9R0&3~Tmm1RN+)sy_|_~Q9*A*bH1M^rMt5eio1YGPYknwiP9QoZg9V296;uX? z{QzsMW$h3|Vv3_dMoP`}O_wdT1SR=lggG%if@6Rf=i4`Jvs|A_cVI?Z=XT5XN5}Af zm1Gni0uKVVfWDwg=eKvg9@e{2a^f6J5`f*o3?2yM2t?2 zaNTlar9Ooxd03uO8k@Ba2Th-k=v3w&?=XFxzUMrs_!=Ip8ZSgrP1WXGp4s48@_;i* zenv~Qi7rStjx&H+T<05!#6VmHDt$h^SZee!A(dR9K5cVE6XL5`Be-z|h*(i!)X_CN z-%G8*WOpcu3Qk)+$rCD`Q;BC*Rv?iBzjfOQciTh6g<}T z+@=(71N7>VTYfP1+RR#XA7!fc5n#3Dfn=#yN(a=FTfdX!vPZvSGVI8cV**4N7_C4Y zJk}9rkwurq?(&LSq+VN4*x`t!L|{crn`RK@XHmO3jP+e9ea=jz8-se0?8fMg49+zP zgjRpQUjp}$*+@;`cswmUG)|@d4q7@(>@JTpgxdnASY&AKF!ylAaMF($ww-U0?>%|V z^Lg3_U?V8NQ1J#8{If{p4iJv;b_u{iNjfr5%mNKk); z0J!8L$Z5kyTT|9rHzyzLZgnr3-2Q{_V2~(tv5Fpi2c(xdfr!6*V4gX2Jb^85!HX1&%R;s(5zt>@9i~a- zsdVR^gA9IE3h$yYhfD^cg<(GhLiW2EKwXHyi))pM4wJ8(47=%#w;X81pANrvFLU%0^wwP9T9cE883{KG|pSnA=No38W&tU zx~#`lSCJA<2E`|GU_Lt@VLjhUSrkCb4wP^nyyt+pC8p#`LGvDHFCA*0&(NBN+;#n6 zQt4hJIGB|npqeD={(rIemSI_DZM?UX($Xc;H{IRc-Q5iW(j_h3b<-u?NQjgI(%ncZ zpp+sif(QccbvZN7%shMV5BtM=yvKXIdp>Z?Tyt}C)mrOZ=lT2p&lF(3;Y7EJ+`=IqQ)J>8aYTNH+=lXP+BWbpp42+iHC);wYc6*RR`q&I_JIkzoK?T~l6 zc-uve8&|MnHpR6koX$An8SU)xmD~Yo%ZySJzimY}B~JMZPz148KN8H?*QDN*2bnif z^>Gq*L%$bt94a{;@rK*>t*iih{sP=N=aOoDkNY35+!kNDF8VFZ)x~y0P z1^!lpu^QzFz_(-`fOye zI@!-L(T~{L4ecm5>%x#3J|mFQxzca|_|~lUiyQmq7^+rcZVft`{;R6qszfhiBo&QkG`FXm4^AHD0^a(WeqWGsA@K&m?me~}(UJ>}nW9hk{1m;BvcKVn6 zRoe|w!aI=l2#(}&fpG~dd()~jQ!-=Y>(whohJMmzEnbEegnT-;W&*{T5fJjIlO$-$ zc6eAzCB^ud8_HG%iMZ2GwrhNOKSX50I(nW-pV2f3S#*Qa)X&0QHE zIr>Ews%TOTux@R-nN%Rk=kDHIjabi5MXt6yl&3!|9pT78sKDVfQ? zKG_vw`=TcFy!)*koPwD6e2+w8@8Ad&l^-;WN&rC3^L3%cJHeqLWxcK|X)CmS2O*S} z9h5t_e|a|9Yq$PR-t%TjUIp>BtC>6yK*Z`CN1q&o781dC-_>cti|ZWPQFv;Qy$U@GKFzastB~&&41%^=yWyg$PrT!t}(^@+^Iu8 zOeXh%0%SxES*`hwhwEYdg=)(z+u!(C@4dj%86pU+K3lds%nYgsp2og9B#wOYRg@6@+bSXG83BZP_?Wb@sfs_ zZ~B)ITIRSdeiB|mY-vY530mtX1>q=-cML3#+7V}5niUsNRBSU$W#JwV%Vs@c+eGYi zJx4Z&_Q$L-hxfwO1Pj$7T7S4jaR2yt+pA|>hj4bN%kj-&`J@Ytn>tZ;(L*SMUO`%8gbzhYWGYl4L1gQd4bSd{N1NKAI6N}b$0QJB+(8Gi(&G_yH41OGgdi5DAp&?anWt*nfm}5?|YS^`+CL%F-+X~8hIn$Z9u`U zIQ7l#*afeEq%vNkZFS!U8S;gZO_mxtVQQZ=$|@aFZEbw_zKYY>Nz1}n$VCJ~yT@x% z6+>$MIkQ-Znq5h5<0iJ)k+BPU1SO5R0dRXYST5o!7K^p;2VAESi zcd}M+D>d#D#*0F>g88v*3KQ>f)@(;$t8p-_nHI2;m9W`d8TKi_6aA@jztU3}&2{-} zS4OT=!|4zCfvS~5AUiGw=G3eBmMeI+TL(LXS`jkC@%ZYiv^~VA0{VCp;Ryj2w-C%) zCZ2pMH%(3@nNKGe@CNm~=XA(E35^I7&$~Nx`3#1BJ3O4>$G&a6TyKu@r3b4;{g%tV z%F&8&p`^^q_eQZ!)OJcTl*TN9gCs=_a>8iSZS*`QR6m0f)0iLCPVi1QGrh4_cc)iM zZJm?5GK@B^Xcd?m?uoB_%`2*Fbl^&JT&ao&#E~^qM|n#JxVxYV@jY%)$wo1P=K-&Z zgAPri&kr62Gi2@hMYfklih8Ik_Y4gsa=9m9WH3P|i2zNTg@Gx)?&=oP&t~ZAwxS7a zHdcsT7-xFio*M1f-Ud|4trzeGYC(^FHUn0o&=&p<({LB@iIlIt$#s2i(<53hIS63D zOVXl>Z;V9u>w&tf{ZBrK8ZV(n-r29CR$EnE$8#LjyJ?adNI|BXMMM6=j47Z}G}?>; z6L`&gp2|p0Bly)x^U zs}|k3Js3-)jZMkRpS;lF+)BYm5OOvkse?r+F8V!4`hnDEuw4(f(+8GR>uCq+8Xlau zT-7;m?XYi~>o5{w%RqBhjY+V*iV}CBK81PhD88L5;YAT>DV6s zXZAu81FA{fT4kr3B|_#TiDtrE%9cLKBqvSze57jViT}#*hd~QbF5&bJO)}RVri7Mp zeaPWUi>xEmfrmB;mCl(FgE&BVWF-BXf6UdYHt9Nr>dFYtWG87%Id#K0-F=>J-CiW6ta(LmXg*b0F@{w{h$UVNrfT5cS# zi5*se zp1?Clm=hPPBR7}{GQAz;TjiYMeW4c<7XIcVqoffTLrzEL6E#Qt469rO3%D&&K|8T~ z-@V6(_FTyIYiwG5y5W_1$qxbRd^#Pg za|VI}|`Hpd;{VL*b}9=PF{=H8?9gocBD zevoA`UI$k_*Zbg#ed4uF7?wYvPDdM}6?*Buv05)G~#$U4Z^@ zW~Dq+W)J&$x?Z3t{@KvUreXhtwv06F$vDj{iwGKi?Q@kGy<-8F&K^=BkoF94=#N>} z^y^6XWoo9v4@Tf0Vw~pck0A5R+^uYuS++?_IG|6x$EC*d_%ZO|}JQZ~{ajwahkQfr?3#qy?Fs|*!B$#Bmx{`*{v3$OH1@k$&)v+VmN=S#WF79AFraHDQP%f4F? zE(58r(UCK)=RBi`=Idfo3t=iiy~#oKG}XYZ)^;;9mz`XGGu^rMZv`qfNN2DxT_jE}WPnm4mbt4+hma$g6r9 z6kMy9(n=qdEXP0IE>v;UdV?KV=Gm;uCj|c-O5o3|srF!qJ9Zo7qKUSX=J&9v5f=b2 z8-b|cIdUvD#QYbvW2^+KtIBKNVc|v8eewz0bK?s^LZAD2Tj&HIc5L6K*w6`?LOW8| z%6^t-*y#CbOHoO7YBaI0wiD5xFuI4X8WTZXNCd!0AZfi9uO~Q|HS|4^z%t(`NPj^; zwFbRejMf@UUy7U?d`;YYG$pWFPXE&WQ5YvU|-=lx%|%Kr#1@e%wtx3qD;f4v5l$+ z25r0oXz6FOGIw|Nt?sFsj(gpoy8%i>*qeXgPzdkmc@6s_J8~kr!w%na&dUOXg3CDEe#Z zTb$JEg$UqoAMS}vc6J?q=YjD+x_)DmOUOz8fHT1bQ^W+g?^$f5hwmwJH$RILz6%hR zJF}(5w`JH(lN{apLZ#-3m=^Yt^`2aLSK8=BkTv-P;lW9&WXU0JL4`N+7uj%2R=Hf; zU44(T>9z0MvUMZR2J*9EnGhH9bc*Os?&dOz!5eD?&sY0d%MsMW=469s1V8VY6yc^!Ijx3W(fLBdi>-X1oAmQ1fK?$$*ySQE6-yLC>@M9?yfv;e31WwEz_kWo%uXvMd@t9U6vh&$ZII zM8}roV~`vmpo-0+NVrWM4}E_Gc!|c7&~uV%$S84<=hjE% zlDqhhqvse23Zi=WiEY)p%JVAp3ciW4`!|8e3}%+niJb($Xxg4$+4@<4M3rDIfwK=N z)HHe$b;_r9;1*X3-u6tjjSV*?#D0KZz>fE^3?Br#WZ zk4NCkFcQ{8ltFg4M9E02Qop&fSu(`0>q~hrk1DEAdEYaC)Xu*t6QB-%sT6jr=m`-g z1HFpPZjCl~Jv&JdF!IhDkE#=z2$U-&+KouIt{-@v#zwkIhvTI-FUOSTntQFagF`&X z?2rb+YfDA{V(Zr3Ly2{i0n1JLQi0spo`eqsZQtPsdE<`p7JXuQO<-P#JB#|zaKp~a z!7GM0$|$iSZuKnElzwNHO4%sZFmYJ671)g9sMj8~r(-^uV?EA}%%CRnP|P;9P!p5= z`J~Omm+hD}9?Ko1gk>ceXx3v)M{D}tlG1$K-DO5o#PYAIBY_XgW_aTDj(E|hkEm#t+wf~ z+Wgss0ce^}-}N>I%H!W<@!{8=xmC{+F0NXV65Q**%-}*#WDvpya*0J7&e{b|0jCb3 z)9EHmG}RC>)yTJPs1kj7wCwr?B(hy)K*)k{Acf z{2cez+WKvzzV`|Db5I*$CdneH2$ntymHejhHdVVfRm53vYoQr)!A5Mi`D#;_UNgL{ zkB{*^qkd?^92}rxWq36HfsFm zHDay&Co2EC%{xc$5f>_@43dSX?)+V@>wl}ST&2LeK$P7t1-?rN(O=?(8b3wndlyxB zo$K@Vx08G;^f9uc)^r`3z~X%lSq(N-JG65$!B5&S@Eb1r4IlX69auO7*Fm*p5R82% zq^oAdU^ViM6)nSPzg5F#s>u^=;FUk9L4(QBZSm9tH3>&Ct{~5(we*iN0Q75KfC=L7 zmBiLSEK~`*ru#8?rDfN9G4=5B76#4Ba#y*y4KI)yC%LY;YpP%MXW*AADTH_%i&Pn|Q&b7l=a;4az`7o< zPeCKHfRmBMJ@}dJm{z;9(c#Yu)D-RPm$Whb1`cDG7ugVho#*Wjf41^>CGKj5o3D&YP;YAU4ti|_zf3;mzjWl&*q z4Gtjx|HLnpgnqu?_4^WL6Hx1$hsi*EeISSATOADV1G1EmB|U?)rBxx0-!O!uz@JFW z@Z!Ib7=&Jn;KQBar6Uy{PxcqtVZR=nzj?4f3IoC*a36~w%=M&!+3>-(%NvvyJLszd z5E}w!AC89Jy?Bx>)ZWBx*_E~#7`;hSAou5^I>U9xatOYv|F;es{H7h)FHVGUnf!6H z|9WA4KneGY1*KZullIX#=fvo|3CDffjNSzHh0aU*PDt)MJ#skuDK+1dpZ8|Eu z!E|fTzsh~&`?tKQmI2(0A~*~-`ve(k;9x&JjRgd_my<0_(WwIDD=iGmYsowsc03k9Ljv$v`{Mx^8p|Lz(5l5xY`GR0PO-27>mHeiiZF$ zs09TqppgwdtFype@&$$f(|mC3CLoQ2Gf)8Bic{)}U(s2I#G{$DocQ04Ns)RY<6;4^;{^yArL^rmSm!&%iCCmR{ zvr;89P){+CI)osT!;@ebCJ$!@NpI#)Zul%Y3d6wApEtYeLUvpHM@svxg5 zr>cqpW_-X#U~5G&s){~=N=st~^|41ifJ_D+43!&9 z8r#>LLA(HI8gP)vfD9=I!2%RhK{Jsqpd%ZK$^%3acjd%U4KPs1_yB2Iw7ltt$sW{G z%S2mB4V44NMA4V4Ut75fK+N3w`l`uu#GzW?FDfLF$pwPg)`6CYmy>@Ik|u)-5CMY@ zxQ#rzIT;WD70|&x@tYs-IGe4X+xlrTbMHB(>Hq{vG+GTC0POOSyzQpk&4q73Jk~=i z0`Q)(RgaPTWj&#Pu?Ao!&XW$i(*!Dsm4U{9+oRJ=mPG2csOB?uDC-9h%y*!Z)F`x$ zwDn5EiMIhQUS#fyWva_rgRrcWCN13ag1i({h!0@>@6~~QS5DBH?3BF*^~G#4S0Z$> zkDLLJK}%^C2!=k-nm0OE&!V=99HC1(Q!(;QS6V2jFWP-m%D}GzbX_5Yu{`( zbKM90y{m^L8m*7wzJICO@OK`pT(&t#s?C8e+S&XEa$ffwNRf)Nk$ zt=J{_uCd|P`s|;YBHk=8RW*ycsv){6hY`ZLTFR1znGc+4i?nX+ejob? zRDJi=;i9%lj_>hHeZkZ_JOIfxlwi%-+_%Jmajw!h@1Kr(zZZ9R5STa%YTwJ0T*n2C z0f=AW6MnRg>aS>hqRx_ebVBgN8gk09eGkydn<`3)!80T4I9O_)Y;qyclD4vOie?$5 zKhF3B?6dR4^i*3f+j5L1sg-f0vqvRW3nZn2Qd^LftT7?oeXU3YCLv4S-*_tBtR$kb z;62oZO1jxfWEw-3#~4+so@J^gI+Z88%D%ZXd%1T-zw`m=Km65jOD@A)pkc8RDve7% z)NI1}Jp|qG{tTG+>D0)Kd)ijx7KLoyA=?4l((J&&SIh3`0E^QksMB}_`T2oHK`Rz* z(infGouCD)Is-s}!U)D@h%dtWzzj@Qb_LcSX3TRid8M>sdGOkO!_&8K@y#TBa{ z&01jVEn8E8fw+y=>G0#rHqZ)*#EkYH_jJ#>ez_ukdAEvAU^We6f+DWF zR99HM?jco4SQvp*cWJ~tE)JoUl+@vTIYu?XdGN;PpCWtqm#TZ}kB%w*c*ej#47uGG zdu?d=^c!GOXQE@@+~h-NED1yW{eumi#JwH$;28ZhtPjZBNGj^z=<~a(xjrzF5E#xk z^F6t1G}mL{8#5fvxDRO6^-C!)p-DiUnrPfuiFw6C+LFnt zSKcMM72I?j@6Pw;e`+D15NJw5KS~%8@)wC{MSm#6)=GtYoz)E2q8 zCF&)#@^`++}7Cv2b{KF0X$3KEyt`61t7Fe)bqz9I+|cBU$9tOg#F18v&7f?--p!A`FOiRF_el(e zFpzGvcmhnbq~Za)SZ-<3CQupm=pt|Vlgl=yumo#b#+s4}r@+YpiPQrgC8^wu3@c-(an8MmJ!x~^)hke>So76J z)2Mwu!SiUPnXTw9JVq-4$7U}@DDC-M|E^@vPy}^sUrxhHZ zz--<|1PPvNw!jWeVFp+i>3-`3F^nAcjsD-aGp0{7VQQ?@vt&qh&BJ=HMv@Yh;OF(u zqwz2?2+rklFe-p>?ih(HD>5Gb3M23w8Pp3DM=ybGNCmvP&tJm3r`+Z}PzR@N%IYiu z4e>6BnPwU%*?HI)&I+75sQ9_caqx3B@$V5ftPVC$c1U1;$!JI|mt+6ti%Jov+JeWgZLMDOBjdEwLceI^#%enRm;Ki>Kt*vqIr%J$V#$XCLdWK1T0>De9UH`h}k+E%`HzsdWVpBbTQb zTvq>_Oa6K*0u{j+i($`_)BEel?+*kjfy3F2S$6+*`2X*5us#LKvOn%kJ%-Xd4D#k{!kHUPqF885Ht+ z$~xdfuyFjoM%dajfP6F{lsItruR-|d(jc?u3cOzff>+?~HDGK2v;{-$rV}^{n#@d*|2+l+4J>s(6K$rmlt}uaL$;BSD%0)} zXo1<)8kwd1pR={JMXui_Cv`{x6Tv?&7_$JJf1jDBhT(rc30*D;DY)7+kME5C_$XsZ zUmv(**c-6(NWy=<$l+#kj+6d@N-u}8-YigXIgCax}B15X( z0-_MG?FJFP#JhK(xhOv{rKmV-?(-13e8}+Y>L>X8%7}P7pb(%<0!U$~u-hqzf<4hD z*scMuK*6YDph91<0~?h6L5b5Rcj22u@V~Eb2t(ER69f99F~xC17~(EU{B?+0fgl(6 z)nb*n-i?0-y=&_ff7%|HsTG}K3DH)pgAYGL&I5oSR*N$tkq8N*gSM3mm!E?_KWYY` z@s*u%u9@KFK^ZMVYpw0;)*uk^tAn&+MNARMwS<5@>^|PNub>=w(RW8&Gp~p*vM>d7 zhC5`Bs*j^3N{8qE@$#f1cYXVneaxyS?ezpFT)MG^ub}_nA1GfUTV;5eIPC1iNmJ0W1O5XN%g0 z?vN`g2sD2D=)W7{4t7=!_?-rdWEuGCrpa98N)}SMu+f!mFjdtQ zcp29f?*MOgTH@E07N_|dt|6o_3I)k8;F`*|qZF~d;tf1So=OVUjTb(j6{(k(fc-5QVYSr%Wn}D}JAe^mzg(i>_k0p!Z8r(4l*fkFT zXzNy#O%?d)Cs^vAz5+~jZW6yy=bs*3tQ%kllJ;`t?P3}9lEP+ELQcyfW0}@$r)Fny z<{q}aswnyfcIC-3y&#Fld)qz$qJrh~>#1pA@=3P6pvR$2|J#3@4k7233rkZj%d8oq zcc3XqTQDv_j|ytnVIo-C4Oaa;zBWNvUxM18Ct%l!0TsFrZPD`u*)({jcv*Xa^rx*s zjU~Z2hto=0HU)~$U>sB5A6Wb}#^f0MA=wOf&!ftLCm??n#4aj@cG=H3cZzl=Ro>0EG)^gDcXz=EuG2_zqZ-8mHYa#kx z@#f4VBb3D24}2HhV)=6hS_b&k$UkO;F4==ppd!6+%saLg$BZu@6dZwl%t#?Wm4n$K@1foeVC^}JNRybD zD{9(83VoIXKvf!ahECzp@u=W!Gyn@ype}d|6*vlp^h*K|B2@QepLo$C?1N;Wj69e4 zC&)HyoVm{STg4hr=@j_x>~L2!!B&$5vk;IA^(56cfG(glojSo17%!lu3si) zx>nRgs$#(e)aQ2uNZ?Yk28w?SGSu$R-Z9BFgjb8GM;ahO2EagQuzF`-?}!O<(?V9^ zy4H9fCkp8-m%|Yr@j||_VpGJt?rTAShmY+WSJkd z30r+>m-lm_dGXxsk8r#JRnSrM_rACZeQ;JvbZ7Q5GrDE2$QSs{jrqV;Kd;pcGTw8o zWY6TIwiYHz$NdRcDc&NbP>O$v9ou)?9`$9?vO?}CAtlk*A2eJS86 z#vhWAxTeeO<7ITVI|BBGG2g7cVIj02%e;TS>(uVA-A0n)m_f`wSvZhDGVI)#%kY^a zS#T#d=lq$yHU8=ITNaI=s#($S>us)AV(M^Z)gPmu*DhZr$f2!w9 z!Q~4~4M9u+<>iPOTFw`ppem`IcO;EY_u($QwflVe1ibZf3D)D2gZzmFgNRsED*5c^ z%SzA6+PEmVADWLi!gOO|Fm5+z$q5d` z5MJHw!qjpj8<&4%EPspHwO!e(Ur?Y+d`~@{S@t$%AP^?n1nY9{&D%;$gGl9T+Dr4! z{ut1?1&a)p)eb9*`wSE^OB43lye^U?<;_Nt=@_ zRL}28EJG4=aaH@0@(VYa~g!nb5P zRm#FCKrv$p$X)G#6seT@fivTTBMVrbwF${8UqbzMl=_bEfpiPFqJaZedE6-T7%MfPuoo^&Gv5mE0F+}uPlOjlRu~~g<6mI(HXw8tBTLBJmapb!HR;}5eAaCKd`nA?>62l+nWTg4YzOB^@M~m zt8i8xJH{C=kX07?<~%RVOBQPn?7*jZZTW#_x&j`Vc{*sLNEhM@#%r56n`XC4&^@;J zh4CAcE%LH6Pty*&ETd>YWbe$1-eYK254>>p)N{u_JjND|g+sjCV#vbO+3A9YLeck% zQ^A}K^B$>;pV->2fc*inh zh+b>taE1$zwVm%#Y|ZwVT9&Xzdn&!77&+Gc4A$ID){#@89t11inw7ye;SFQElvfkU z2fE~X2fPXzj}R2$p7{GiKhr1y(?g!z6{xJT9s0esSSth*_uY#bpeIncsMA=MhTXFu z8jdaXFJLk&>HPKD&?c$HK`R)BMFc7hWWxR#_s%gbNb~N^4i-MwT^dQSuQgx0|0N$S zVZO}6~S@0m8OJEF~-yXn!YkUWR^ zIs4lMm%X0X?;fHE+>+S1ySM{}Ru>Rj;D=_kj+sK+whwq)zxMP)c1RlDZ@olk)Kn_j zeB0C%w9DdwN`cM-4=;&|wOg$I^`d}Ac(*eqM%4ESpUZ~=r5AR#(#RyMCzlf0Y#OCd zLu?zC-T?Ra%{hmR{v~nY35H`fc5WQCjF6wd{J5G(OW$LTg}XTO@A1B+DYTWPhzk$# zJ8=}f0s+dy527l5Otxy|)v5KyGm(s9;{zSaY4>P12J=AXs+so}FG!q@$*Q#xs=C?A zUFM-F%5*!{j(65Oy*)zIXxq4TI15$JTz2F58z$Mm1d&q{2`Kpn!|lj~<=YCx+*rNN zNKXa3AFiI`rsd%9=5u$esy%YrYmSb2q|NVPzW7Z6TjFvq#oy+YLH#LC&&Cru?{+$; zTuleCh;eL2@mT;rxz0Q$J28wPsqthSc{kb5$fc{QIaZ9xKOxgS4yRQW4He`!`nKoR zoli`nPMs)deG|`74})^&>9YeA#O2>2D-W5Z)1%{X@g#Kxq-nMHOUQ(B`J3?Xd&^2+ z9$g-tfMgeirfe18!_79@Yb~gtOIJYTe>gCamN^-fMi5)anXXX3Z@6hx8>sA-cZq`D zIsWT@Lo>Ws8huU~VZ$I?oamtFGqIX>?s$4mY86(*K)n!OR%C2XoWuCY(IxZ5VhE=t ztB;_q3C!q_zIH;8$mhPO^$qa4WL8Tx8+gog_ds{ytD*deob!hd99I0rm)YI>fz;%D zT9ee`F>?SW=wai*Y@q? zWXv9K%P%}H$-Mja>N2OZIc&P}BENbJWaMOH+oj;y1beAnJU05aoJ20@NbUyqaXTu5 zp9br4M`O-xhb(@!jCk6apzM^ko#Ck2zA3D5-jUs?h?fNTNmy(A(|m%1r;^kbC)mXU zRQF##Y9#i_(IQzJv~RjqC}bnnbhqWBk@{DXv?cqvT9DZCTT#?r{TM&Hf-vLM+^X!x zBT!E)*NklFaaDS(GLmEBl&y?m8nt*U{QJzRfeLx%fxx(AS+sWvfqP3_cz;VXqk__7 zIs9)9ch?RYUcF@6O00ckJq;PEp?kTv2>H2priSrE#vK;{�r$)PNu(MM*QeHCao_ z?vgb9$|#bMTx9QLMHZuK?~g5UZ+@gv4~$j}Jd1`#12g|r+cLZ zZJv{jK$cbD0vj%4FfT*X&b&9(;MJ`_DaSp-kFf#r9f(P{b30E@s}rl%nLoW~?bh5H zjX%z_8+c7Xrsa^cnYL?ZT_814On z7NObOC?k`bD4m+)zNZx%T6JFX#%Su3UhC{~WxAQvxlYWB9BI3qFKy4_OkgMl<*0aCysd#5=vk`~^p=nbEeO*8KiJ6Up?$)jbLKBb<_O@awNe9k;<4_1lJ z0vo@}8@U>6&;=w@2us!mRyJ1-Zvo%$j=jM|#GncjkGrv{!jg zIVy%&B@TmS(FM!O9v+(4peHA1J3YXEu8b~NQ)6HKBvWq+#oRqovaq*ir77286658wvvYNaLg{!Y4G%225VX$vo5eZ{o-z*&o5CN?LNE^cP0 zA*7d>xj8=Gf_88)q_Jv-%3mG9vMGBYohoy_206Pe_twl&F_j%Bf7Mb*p)l*Ral_iy z$F!dq)vlrE%#OY^kM51<(q+6m@O=-D05rk+RyyLe_%Vwp`Wv)=jD5CM8j(PJNA$W? zRTd?hczy9%oNosTUm^fostYo zTEtwPcl4Eh&MymeiB$3}5m#^8t$g_4)MB!bpKvsuX4-h#P>u;1M;5vml{B!@F38sy z27olvQ&l8;`Kx&N)m{TSH6sT6C7j95SW(U03+YQG@}i9wWjb5gyAzSKWOcOP+4XTy zQO&aRsDLue8HkBTWKYPjKE?KHuMc^1I~il_%(tYp>t@^CVNw#QjLJjvH&CXMWg~sJ z`ei_e_5S?)aVH8zHZZBNTvL@I$DSSwlCu@(6u~5R9 zVxryPD|Jr}(#&$6GgZ1N-}$tM^3s&@VpM?{&bdjkahnM%jo}fOmmd38BG>aDFRO(7 z7uexMJJ%>ChvN6yd{lJid4=^o6-eE)E}_cezpq!j`ITv~wyIX=d;wp)28Wx?@YMW>!of(^asvKA5K( zb}B79TF#q*GuyL#S{ZD>GlE4b3!R@{UOo)-&eL_{NoO+)&9%vhAIX`IZVE*%Xsc_& z>#tO`rry*#v>%0yhSUcr*x`5AgK`18WuMTNKv-2%7-BFib!}nUOP`Kje6k|(#;g;* zngYk09ZdFxd*j2nyclMmB~IH6`X_p{fVN)TrK+o88C6#Hi&Sg2Pl<+=t!}CMrYIW^ zODR3Bw8zlwDzHCMT;iy3GR%0jbxGMjv8OXLzBgSLxREo>POtpLoJCnWQL6^amf3mL ziEu%mNV{wU^{h1?-NNlyKh9dW>l&TIR)O2);I<;nHaYj;LJ8)8T!2YDNj^kZ-7Cfx zw{U-`y1j-;R#@5XNiXfMjj>Bme>Wxo+fm%vO)!MR-Ed2;KLDpjDW~e^1t2$9VLdnO z14288k0`{ZUp_qy6*;$6vKUbo8NO6Tt5%6~^a2@K>UDxX-I=6@HgVV%-Sm?{U%*X6 zKuNA})Y+K?3;yh|R_$}Rnu*TB;7XlZZ@wVM+PFTGY9tO_Z0*|Odg=nho9Y%DvHWx^ zIJf3Q`FvN;vO=#iy}VU!-r%}arHXZq{Ls(r5}@>~Dk*I+)>4()ji60-eCLfI=Q`&@ z=MH=l119rTPAx1O?kZb8;*PS$-kZSmPXItT45{wSMgs+j@I1ubU)4(+Q^>zlRh#sE zTg6&*w}Y?!cPVeD#8p-;GD0j+Q~}DTkJf{@w-JN^GN)WNr7L0(C>XdacSP{`8TT4P zAsVTIz_X`po3Q8hz!D%%Sw-4y4{^*@jnYYue1aC=9RvkV9};x;+HP$@HXh>h;aYX(xdUti>ZR>xOhawazKc5i0h&2mdw6cnv^|LOL`B$3R@*lUL9$%-1+iWR<3 zU{Y)1tliM!RVWUU*)%<*AG1Gb0doxwsMWkd0~anSfyz1Vs0x$5zCF;aFHw)`2Sz(QRV3N3!-bPpVk*TL!)-lk!CKO}3D{%gj0+jUB{f8Nb_X zeYku?Mw0>1!*D*;KE;X2qsq2|%Bk3Qm1kINZT@w<-mG}|uej*Ft|e4e77nQ6)r#%& z#_T4Dub_-5%)l9x1*Tjp<^|1V+&?)|N%A1bUsG>Q!s>3)lyu4vb19y{wNHlL`&uI( z-84EkdS8O{T=*Wy4dr`zK_$;7!qW=@9bq}QW z@aog`|4C{Lnm1e}-#K=)HfAcOpH47SZbGHksu^96bosOccnvbIvDw=#-gM=_BC8NN z6U#(+T%Xz27$Rp!PM`*%_M?g>H5Z})>-aJHtn!ho@!JFGgy+g-G4J&E<>ap=;O2(B zfg6M2!1>K*qP(n=b+pYb~^^#4D|~YBEnFB_sRixcQf2zL2X9 zNPI~1R2Wl#Bl)X&V>wV<-ks1OeBYk=y1KrME>cNVlI-JHZt0$987ZfUlTMzA#kEad z@#ByxsN4qVtTCqSW9^v$ITPz~oM>IWIsDI{=u$Stqf?N0*%~tXTL#FqT!=G$TDV)_J1NFVXR3C9}r!xs> zRu)DL3q4v6euc|Tr~ZPjPnO#7I7k{`6)h~*Os-C7K0ghzH>S;#)-2B{i#<0wvcpJsS;S*NyLcO$OgZu|1LG<4OS zVuIByqQ^M}mC+SOe8ldMJwd08JE1a9(|(V3pVIkj0@xn+{=1cC8SaeoL{mF+ydv8-cC5C+jD+XT-D-E(H+RU)$;sMutL3J1m)DTX z=ZD@6wr;Nv+F`53lJ}`$$rF*JQm>J-cc`e5zy7DK#dT;nW+(adB8Ojsgb3r3TcR8}+@5KG z8Dr7SgNk`%MWD7-+xn!+W%c$8C!I13wl~a46phd4KLfj~jcSn2^=@C$Wp;bEq%Ru? z@^V4G!EU)1TGm)`byxSRLq2*Qk`~41r#=$Nm%wTb*zNb1R`ts?`wUITnmU%T9qsm| z%xF8k%|tvk>Am>`B(^|0@WSWpg}Eumj(OgdoY`B*i}|CPvW>Cuos&LevL9R>7)T$* zL5c{oR7GZEnL}&}m(qU=Vc5eiJ!fN)qrt&E6#J}z7{UCsw&E+@Ky#|lBV4oDvrJ_+ zk~-aWdlf|SsT?#kN9jq)!J6l`{A$+|i5r$gj}D zY3;NA-Q!xF-+M~mCBA2z=$O!^Qk+#mqD`mo<5K)Fp@Atb(mLk{G#R({bUlMlCO|fQ zyTHr`ov6|Pxq56LHi4(YiYxQ;RJobWK@4&8;+KkFRxO3b!Xk$z;G~pXueU0H6LgN* zx85D6^3k4a4y-oXc+^7QBX)3*(c06_I9+gInR_r2J+hrLAxGp1qTOpI7=7u()W3DbiTth0tM<_-n! z$WP$wNroq*C4N{*6MG8P9{VgYIQ=mPmbjDdd<&)O50j}sL`nyzAl|_n?2GQL3zfbP zx0OcB%s1S9RZDg3I!eSvf;9l25hw?WMZc>SAfI@t1^|co_u+}|UVtOuI+Ug?um$Hq zX_|K29DK`I6``a8_}YWi=| z0vE)QO))4?@TnYqsB{Tt$t0+)h2_+9gNJoeX*J}L;}m|yz-f4};V zI`jW4H-v;QV&vqH2lImYPc-V(`H;C`A5rnS#Ur&wbiStLqBUNc7M1a5UTk7M3r8ddK=X5Kedl@_xe3V_u0t z4B$5Pkd9nT=ijX9r3H9Pe>F8b{C*Ao1C@Bog2S{&6uSSqSqGuu zX5I7-7d`pson@TG1&5QLKKj2W{l{(j-`#0T-CxC}zi)J6N#E@3EGos63=|steyC<_ zmRJr378Dp&VTfXMoPnnY6wD2O@WKbw;aoPurW01Hn^Z$?b zzA`MUuIrXmQjii*q`OloRuDuffipMi zqdb4lb*}Hc=e+*#(gl0p_g;IhwdNXYj4}CcABoirDPs6U<&27Z|7v!GM@Oj7N2YcX z+l%mt*aDvFN1O9FfnoH%jpXky2QN|?i`-c#aPGgI{kvcOV>-L{I2u?V_5j1L3=9DU z-MG$8-sHcrb{(i_LS+_j7|*8hf3CP}5%_DS&B(=+{pPmFSq(`mD^hc8f0o-HaS2~m zqHae{@j(xmGz(Sy7+=l+v33(Y&k+U`D6VceX>$7&x#qm~5>3$mJ_4wgOSLND{5*?7+)c|8!?01laes+oef3ZWjYu$^b_pku!!$siH17BlG_m zLcQ-H#9#aFalA%zBuxT76Cy(^KikmtXPwy5tGf^cbbcTE-}gSeT8bI7XzEi|O{-1@ zJsL16B@x#~P=aj&VQ(CEmp>L&{e6JcgA`Fs&@&Ia;=K<9y1fs_B~52#p_Wg-BKxyC zIi1qF&;N^@6fRXa{Pzu}$JpYgn zv|Jx~lE0M7zddQvumQV=*QTtLP>V+BV1_wnwmu1RR5gM9PM9tf9hHi0Vp*@Ala z9VAXYi#HdfsXG`;*4OHfQ;UHRn64f^uL@|Wx(c6|{+Y7^53mseG@m^u!KRK>;5r(n zU#-Mim}=B}c;|8%@@<>7lBfqSW$L~^SiQ-tdw*@FwH1*siU|24b-`-I*AFDtnuwzj z!i(Aap7X0+@5-&t`y-{WFdk+_YRvLIrdkD|T@rgc6!p9KD(P~v1VFOGcy+1oXMeGv zqxvmRVV$QaKbKmwp){b=untg_h==5KD1HBqA}-Jud(L*wuAKT1qg|PD$O|ONV<7aO zV0U!)^~X9)bU*Fe&|c1yGDB0^zo0w^E#=?!XiE>x}cfs z11VVthmZJs{B-q%gg-ZV1M%5cK=xips5@XhL%jR$JIUyhr0a zHtDj?X*~@TQZ=27x>enBuIP84pf+FE*)y zhkS&TeVM0fALDZ;G$oIgPjqw zzc{9uQBgY-WH(f0W7MYY(VJnq@e(bFPU@{J_q&I4O284T&oa@L-{)>jOl=>=V-7XyD$p#?8MGtX zU2tQkF+>FTZ3&ZQZN5-~R!&WYO+lOzKz%dWq`g@iAj1 zt`43{NI`24gq@wU5gmBnwyASLabto3{3ZWpv46wGTfVM+-~6kA?wW*XUnF4)i3Ydy=>R+-ZTSVVQ%SFr%aYwzk zASbyV9jPUHykOsa04(A(Se5sd?WzsL*R(X1wNq%`Hrr;EOEX&9 z@eQXzX$_^s1~;a`n|@c<34+Fu+QY#qSBqZ^L!HMf9^5+py0Laz8dj_iRNjK-eJQ&j zE`IsX3gDozES^z^g53ow1nnLn_fM8-@7UU9{fIwX*v$fptd-pVB&N?TnKYKyI!uV- zS%RIG;ysyftU=R=iUK*KaIhL2Lk`VVvcTvZ>$zvYZMhrwaBqFQZESB`z590Cs|$E4 zhCIy9KAKM+cfWyBE z6Pv=@4p45VdpytAv|U*MQt6%ihe)wx3e`}3y`Di*|0XzOOK8|EvNM+{)ecK73aSa2Zzx+8Xo`_4l@hvNT(X|c_wixC|+62 zoLw_NuY?1b%y&n$vVJLyfyXG>uSk}paU1noKI`l`>U{3ku=bW>>oN5(jE2Wg?oMpZ zrCnD^xUjpwS^MzvqTHW_Dg{yi!7!x^DHCi$nRs#PoEpmvDk1d)h1C91JJC`;{EM9& z53(_7%9XJDb=Y_)vJ)O6@H24ndaH#b5{?y)SVCEcJ*atb@@h-O>9>wXhR-j#|H)WX zAY)mn=1Ncu^uSh1;+cXRsAsl^ga(ZC_CRFhIeZP$1S@8a_RkGh5{n*H*g5%5Q4i#S zwzbPj1#=E?!Ap5(mXjG7$@+_!ZGg`XvZwU-_aZkAy0STxiLXN4n%5w`NwJcx8V5SB zxz8KLFe5nV{xz*cfk>;DV`w)?G%`=v29dt zAuR2)x5)bwfQRq932RG4?+Uoo<=M2#4#;d?{!TZM*%A|Yi+m^a*mK}w;e377`%dtp z2~ltNI;0=wf#iZqJYnkjr>x}PBShoS=;EwNa2ZS54Yow)g~p*$wY_;z;G2t5^pAnj z2vQvBTa(kDjdmLju-J7PtuS}41lXz9z$CQ-%V_w$M%5LEFj|WE+@2F@8}1A!^mm}o z#n)-=A3%cO>RrAcjYe$4w>AfhgH%ycchrYuTFX;Vwq{90U($O04J?hVe6)-Md1I2> zcWC`fZ#44_xB@3TVq=M8JkF3G=)JuBIJBb6kpD@Y&|p^QqW&XVL1!_=pGPrvQ}xck zObU(Bc~OxfzO)lMvVhlKmf*g`L&5a?3ZLsIWDc6=>SoNOpG6Eh@UGN3c0agx z|T_4q<9+ zs7^sr44B{24Ei+~=%iZGt#*15Mc3)X(y)9rAf_p(B;fQTl{)TrBYL>RoOiMID+K@~ zvANaCmX{ssW`bCI$0nER%a+EUb0NI5?<@GG?Ykx*@%297cORh;@*D!_kD|(i-p~Mn zdIIZ7Y*(&*Gd1|OqJPRY{@|@lu_rA@1(G;8qOBdT?9-^w)p1}DEN%T~fZ&ikC9nxh zvPLlI>T4SdeQX;EDCify=>&2X|BPXA2GiKlvz2;e?jJC4>{@u7?b8n-eYl{^q%Jq^ z`)SX8tV4+Fygw~x;P1dcgbR8TVRZX?>x;AgGvJY}M8;QWeo%WX;pk>{;WCWXn(CtT zxT81mhUOj<^vk~yh@V4iBOA8fCdN9de`Jx!?G~z`dHzWt{Duz@ zcFJ* z+UbB>n0yoiMZPG(|KJzRWPrFXAtmRx?h3{xt}OQ=Q#uUzKqo?Sai5K?wVA3%3<;pB zR>TZpQclsTEZ-M|{TG81Z|n|%jUwhTMVcBDboY+*qMv=GjJ3rvia>B`ni)}gZzy|G zdf@b|5pAD&bPD&w)M4UPsaKcy+~}qYM+({j=~IH-&7;Ac@gLu*|98Co-s(E=%RuC- z`=AN5lfd2lX9(_q{QMw2#U>{Jts4t;av})h^^-SJ!WYnq?vMSeyVa2IB88wq%)9Yx zlMQ*wXGn@BR+)|Fejvz(-^}Sr(?+=YHlnalx+_#f*H#P5ERrTpKxPxX19b`x5C{wW z7ySXxC~z$U2x=CJtRYD<@DVoOK*2^t_mK$tbFSLgAP(>528VJ`&JM$VrMyRUSxcl! zG4l&mkS^kw#4Z5jG6wln z=;`Zsc0rx~z5ANqbtp{UUK|;D%GhfPA&Z#cIDom$*awX!1N`E`bPkj#Fs#V@YCn+W2Vm_mA6_-$pZ#~0UNH8n(9!B{mr7upc@k%T}t~S zYky3_7m3H+VDPy-mIm^0(ex+TvrrWHEaGR3qBY$+x!>vI)dL1>}3ZW*nod#o%-a zL2=4mE{)tsbob-N-Sc4UMgJV!Z?a!q>CO*9!LL35s4+S;B*%>(@o%!Y1MpPiF4Thu zKfF4*TN;l$;`(6sfxr~&(asxS@0jt>W^AYkG1x(>oeadrTKTwJZz@)KI-3DA#}BaG zl;xKr_2xHhe=cZ>u=-zCzI){>cuB1jnpobS%a(eNk%#-3n{EX3qUAtjyDs)ggNj6T z+wr^ufz6rENAj-aI#(qLDBM>|(FH<17kIo2W!U}cYN%eV3-V);C`|8eR;<%DW+>@{l z0)lup#a3fP5I{!f4Jh?i)i{xP*Z{B=gq7E)1s``E;2LV!)Bh5-86!ULa*;8 z_;CaAZI?&+06P=vz8(OegAF}e%arakg5>jdB&ghTp)x%#*;WW26)tm|UH_R4#KnQQ zBa+Att0XU{L?Vh{G-<#$`XOSLh*#P8@HRL#*a4wdq1DkR@ca~*Ggx^ZTTccvON8RI zj{foAA;7sCZ1B>)@4qCGNAJ9*;D5paCRm*12hpiM%NH-)x%3x4!^C}wuf+V zEbuWq9hU~>R*Wf+4HWGm-cQfQbC;`C!PzSW+`K&RsM%qU}*?5b%0Lx89~$6StE zd|tnEt7c#<4|&Zd2}?mlx%8EXiBPXMYI>Q$b{MfO-51QAB9t>1Z-(|mb z@8#R#H1{SH%jPfS__ML2vg^(!zO7!5XDgoi@VzWIG-aqYkqL7VE;u(+?RDQ$j8uon z?WAEsx`Wo8n1GC15)mf5qi^Ys`+b#>_eoE+zz?#NrAZ|EqQS(+z&Az9SNw*PkooB~ zQGsq6dQ@Hwuh1_zLMBNMA_uvaP^bdDxwrc~NRd|NA~vm?6Gj9PJ-&{jW*$0lH(_kJGb>S0!gIo(3;jG<}*)*K}3d!-@#dk5O@(Bee~ zgQKYSs>;V)FQ+hRm}%&(e0~$^`F`+5*@5uz`xI3a6gE8tnG0HHA5~#I_KD>?=`7ox zXS5ApZ7Y*{FSN#T!}&cd6oqj~%s1?NG;e8fQe17Dy8dwYjG$`;MtDtQkl#%c@`2l^ zuHK_e4Gdx#Lg6$Cch6sJ&6Be-S^AEMHOcf;#f>{SIn{D0Bi>mo43^dOqXn{?okcH; z3P4kg;O6dJUrK{p5M%>*Eq6}4dw=+zO{L_bRN^rVw;-raVo)tEFDr7on?HSS(GXU zvFmE`iUk$6^6dh6(Fz4qSq-tp!5T2~bLqcLuxY9>$v;W;AmTYChI`*cyxexuOP(r0%(S>z z@<^P#PdW*jbqZ+q{R>q|%2L77@l>gKiJc{?op>n?I`&ijlw`KL?TUW37n65xu+<uLmr1H+c!Kik>%ckbQ1B!L62RQ27>8Wa=A3BkQ&40kHa%Aju?i+*Weq~PGOwJkJd+#Gg;y$dGy1rG-F4jzU$omdYrzex zR_dg455V`$HsT_!4d^|J=vWla<%>AO4;?|8>T@E|#|%L{D>k|!iJCL|9GojcTdo?cig@}7P>k^gA3j;WvUrbt8eq#nkS2K z)l}stPp^mVoBr@mKK57)42$h@JA@b0*ETTk(*LD(ij;DvfdMsC< zqo}E)FKuyYeQGMHf#=F`Rg{l)nMyuT;4x9RP0Y{g?elAH!JQ>;6(*Pq_@v4-;onJg6SkLtDt~hX?d=X zG|E#{EC@Dk@mUJ^atg#$RQWMf;y!L3iKU$ax#9H=kh~fjd`?i7e#k4EA2#9EbP4=) zl}0RY7iG6p(ph8`J3{nAy3 zW}q3TF_v~Bi%;W>?pb~4-hsR#wL@W6)jm@0qe7hmzG{&^?1EfTnH}GIq1@HSpgUd} zyC${)eo9-gY2|Z#|JX&nqq-k$NOQyNUh6;{0YQ;KZ072MoWy!`>G=eeZ!HJQ!L6XA zD?PvnW$-STsSZVd7e7V zb(_^_vV(F$ecdzt%oe+mO@N~L4sXBg>rXD*kwI%qrKyy!Ro}g=IL&vXw`^?8sP!PX zI?c*tf2&t8k$Qr$Z!S$Ow&0Uurc`#&uUu0cK;_#;xGl-ya_q9JAF|I!Rs#sH`thtZ z99j~ozSFuW0sm*ncuBf9;wAi{J#~>T&pHbyRF1_@*GE)oWc_~Hop3Lud5XA$q_ET5 zlPQT!V$DPw!#5XH^(%OU(Z-DOwPByKmG8Qu)ewg-ym(f}id#rqQ3{u^CEm)^W^UwO z#<|(X%M`{wv)!z@05h& zFX)0k57h$ZoZ3nmudS5Of1kT2wM%2~n>kKMAGjtseZU$xcVsJ(8d zBbcmy3x-5Bj-O)AW_ej#>0mPnSfqz7F}f_H;PhfdQw%MOx!99<^W~75bNP;pW;vuN zVw1mMl+B+iXKs4T%Z!fMhY`Si@`;fxg`p4^-NxgkAlulMsPM$bsDd->qXX;dw_gg4 zNb4HhU%UG*g>9oFM+sk|v*m`Br=1Vu(3QTv`-(kS!gdy_jb403gXu!nartgdZuVVU zM0RQsn8&KKhs*l!Ka*%K9Ey$T8eVwN@Qh7x$(s1u7?v=|mE}sb;9zhQ=p(Cn1J)%Z zvX^pL-MN%uJhn~h^P?s1LW>sXsMaSKh|-zPS#7h(FV~g!_+EEqaPSqDY_M*>*Ng;!iYxiB33=r|`gOwN%a%5rt<+wPH?8 zt`XsRp_+F|8>0}bCTQ?DUlhQbsgJIhQzXt3tbQ74l|pHq*&P=kp)$gX(eHEWn0~-g zJ&?I@PzzR*PThGX%eL1$%Ab)Z#;wRvc_rrbw*keTR#(IzC3NlY?wTJn_1mQ zF_)}7yXxbxn(FEJ1Oo95HA*7>#E{6ryrzzmSi&NNfdaLQsdDeyDVp5`Kj`AHVpjm1 zz+$=LNu;9sF;JF|a%GMvH&Wm%ld7MqMOZclmA*Xb+wcjke3bkVe!-8d-H6zykI6AN zf7^j~R191j^SDY5?%=d*#gyV|O`qh`-cjJDBp>DM7uZb6fG8t<(RZ2;M69;cB{O@h z_*vWSCvGdJLul>t;`^Jo7%jbCyb&Xw@YAl?2)yP-+V3@>TZQM;)up;6u^Pq+G`gpR zJrbaBe6}Fm;bZ(IH<`h;-Rq+LLWGx-r{wo+DL1c~!B^)B&OQCO<^pQUz_3?&N)Z3@*g!bt?E>G@xT(Pl8 zA717-hbc4Svl!l0Zv5i8liZmVG7gPj7;Xy?h7xkQcU^8(#AOJ*THnTU zl54$#6<;zT#_AN4Qj%0`-zUlhirev(FKqP{-?WevN0UCn#U`{+(qapBJx|obdlid) zq|m`I?8;7qqz>I<>Knn))tD?)da35U{d|-@&bnEfZf>%nTpa>?c@oYl511PSM5<@< z4s!de!bJ&mO-muecN+_FYEyh$5?004?bE<;uFsC2l$EUii#4kiS0MGz(;7K&b9_)u zX7@=#)-6vW4?Vkx&z#|_c9UHf_>F7#I29m3KU5zsd7 z^Na8BNdGh-c30aYza~sYU!C#NadlCvKHf;RKyJ*;Fro{6MRD_QLq)vl8--bTh1EJ@ z3ux_c<42zR&V{9_}~QC8@vCVubXM10KpA5lSP>cJK_Q67T4?xxE5DOHk(+9PaRXH(bgLN zWKG7WRsPjW?o6Ab2b|{k&$64?sfR~}I7jycS1Uc5W=TDL5=~nJ2%TOVuTJxC=avxx z(IUUu=vy5}s<%h7!YNzF9>U&W-mdI=RS)(~0a-?`!og|r#8-22x0c8`Qc_;ZEA2zp z*p^&-tmz#?hJoNt0jv*66tZVq7e7-_Njzu8Mj?-2b^j2qEHJ`Im;IdGg_7QXJ$C%I zRY#+<)$PIKbFL-x7Nm6M7rA;7{pmd&*|cfqMA~}h2SfRZD$Cy9^?ZRW%L5&yxYcNR z%pW+au5CVXn3hmr)`rfTMRubnHTtzvxvGA|?RoH%h+Wg}_xhAy$8`wgyFEZTAbz<{ zJU^g#V(;ZOCHZduFBO{iMj8n-FER;bKT@=PopAZetVQ@ERf#h6aLfGl>rlX|xRn0Q zu*>VM$ZADCp=Q&_!9*R^xnU37ven&7n3REJ+h4;~ZW}*fj%mp?vQD;o#~3mH43T)r>Y^R?=;fENjZ3^3Vqm2O^4!ZO1XNj+sohf@7;)IfDX-Fn zXFX!2w&P4LoSzy(c_ylC&;Gf`5S{LvVuN}=fne9*?6ALNly^!~eEP!?i|n&Ocvs?# z9wun=o60gPg7=Yv%siTkV+U)lQnPj7@+He2;gXS}g|Sp+P2UO}ccC{Yja7@3Z;^Y% zt+Syzn{hxa2>WTzOF^|{#u@4=iA zBCDA`{+I2M^sGEIliX+7lgkn+HFtd0YC`7xv6H5x^stZr#C#ZVy;CTZof7y2q&Iy`)VziT!0@;{Eh1Kb04ziQbX%b1PzV~AgR4DstWjsd{ zlSxp~roQwTt-K?H#toHml)-9*%u_jyI8dd!id+HrbFN~W$}cs>O0I1J6f=NA9Eb*&x(m%6H<7ya}rA7d1-arwWc1#-HU83(fUcU2cwUndX)~{n8F_@Z2?~Rw?l^ zL|qh%HQ6xoEVoi&ovKhhs3oMo8iioDm$>N%rW0oz@JmOo^WEO$BT45~*N9w5KcA>& zI!eQ%fJeohsY$mSDLjpp!=?cKMy3py)wFUN?=2A8kS{l=s^+pJgeIVXc?C$5|ErM$%4K@sZLFH>Gz7`Azk@5*tGRZSC~mMQutfoTbmF^fKuWub^n z-T0W55R@-SArLdaICnxRh@~=+*B^zSg71{c_2gH&t(Z6z6h$*O>JF&n#`jR|@>^{1 zioqzh+(+}PIl*|%m@T$HxJC1QVl&wWd-4Y&D_76n)71hwC)2g)Bx=NmmZdjJiujzR z^meLVpgHD`+EP7LzK$on;&iYUlziSmc zYL?^oH}MJ>DJ+VRsIe$~{g>>CoC!4S|H(B1{w4{j#lac0dd%EWnJAR{s0+AGeuw%9 zjSLm@E{IAw6#H-ft2qSGiO+lq`bo$6eS0Z&z)UVFq(mNVuH%8-sM(qScJ@ECuK%w) zo2MWzybG)tcJz=cEW}Jo3p2zUg9aymF~JPm>wOPrPhj<#9MZuCJ1YLAm9CuMC1qT+ zuvaa1a!`j0M^ebsZn zsI5R2qD}Nl;TPrdm_F=0f^L2j_dG)H%ZPwjc5D*7dZg7f3b+!-x0i1GvRwZ;wPZ-0 zhUWL7SBHO(gu^#oM*}3x6YE2pW51r~C(AQX2_gtviNw50*7A(E8 z&@42a^5RG?hjZ~sqpbmnN%ONb;Xe{4K_{D^nh-bKUta{W)g^{4!v$$2z)0f_;VtAS~Pzxj8Mo3;b{&k`mG&1;yUq1T6r65%ioGN-1@PU ze&EF#lVRtC7w(4`+wbQGw5<>Vq8O*L_DGmyNa;iLlydw;GXE^lwus$?_Jz%(5l_Mb z{xbRre-~>x4CZ2FjIk~3{}tItj9bAt?JRwzaU_b>GPL0nbuh;fTbDl{sFqO#F|SXi z`!6nS85<0;f{fjxBN8VLs&a62PAeX{g;bype4-(#oBzL>OX1P2^soF8uO1G|-U9f< z!dokU1RsaZYC7=f9QXbb6dis|1Eda!cJJPidE|)e(~)v^ z!W?J*s|h7gUCQ}B#|doY=f8i^l$}QS<*gOsznTK-SZIXh$nN~S(XU^8N=le!cwzJC zwIqaT^!KEHed5nwe20V8Qw$}F55L5(Yk8n8x|#O(S37KVz{Hr>>GeHwEjhHcO-KE@ z>gRVXN4s^MeL>>Lwc*fa=_3EnGa&ykM6ItNb$fW^T3OV3ky2rXKMxy#5?CDYVGO5I z^~kmVk=<8Qjy}dLMML(ggnj>1_@DP2g|&5?-%K#BjvnpF+vRAHki)kfNI>>#bvMXl z-VS5mGjMs&@7lzd)t8C0QQiJ^Mn_(P*Od>^!(VadXC|Aaj{IN?BIPsckr%;$b_@Ba zBb|EuyRUY$^85A e|M$A?2dV*+bTK1SWh4sxQMjZcQ*zPt*8c)ByOAgW literal 0 HcmV?d00001 diff --git a/format/diagrams/layout-list.png b/format/diagrams/layout-list.png new file mode 100644 index 0000000000000000000000000000000000000000..167b10b11e37e761de81de8fa9fc8c5c9a30e4f8 GIT binary patch literal 15906 zcmd_R1yEhVwk?VU34~z5-7UC#aCZp7U4mO+w>s>(9Yo)SKVfq{7@Co8EA0|UDceqxc~ zz?Fsf9(v$ESXXr!ahUQk;w|t6#Yy&!D+~-S4fF?APM!J?H27tsspF=jq$ptKXwPC| z?r3Vk;%V;$TEoByc?y7!_7-j?8!jz9K1i)wLWmZb^$0lyJ!jw8ns^k)m zE*9k6EL<#Xlp;^b$;pLW%q<1fC8ht_9efj}w03iI5@2QZ@bF;q;AC-hv0`QC=jUf- z<6!0BcnMm(bcHy$nRvc*aHaakApbFrq=l=Qi;a_;jiUoObX*ftM|U@2N=oR1{`Kb{ z^K`SZ{9jLUaQ)}9zyevJcUak3*jWEHHs~q@y(%E#Xz%1=;pz&;7vUCqZ24cV{jcZz zqrIxNqnjhx1s5ALIR`fj7tqh-n?9Kkq-v9a9Kidnj zLKps@HsT+<{CE{?v&d5+)_=V+k*Afb9DiV71lQ#xUuk;6?qwtS;7C36Q3R_deiuK` zDnOP!$ZOO2{G*xt$15pJ*eAX()EFq$@2JShpVDhJxiPTPB!8z!R>c;R6-PmUyCoww z=x-jj;BEZV{Kt2{n{H%G`Nr4Zb$o1WWWUhw_R^WrPc{Gn{2fv37$1qTnv;9{mV=R( zTtgNQgAyOGg`zs7bRbSB21hLhm+W#5i$xBL#U>0Z-?(KU3;)ZV{ar18*y; z4d?~~dd(t1?{=Ib(61GFP&@Xde_(sJa)1<9uYbnn8GaI$GeIu0qPE3ct2pE(*cux{(sP9^(vpAZcT3V z;u+01xii`6HNUg3Fl-%4VR~I@+mLX4H#5ZV{>9%@bprfL?Dl?2o);PaCTIoI8nSPFf z!l2*aG+CtdoS1JZ%TObFs%yH!NYwf}`8Nvx``gaWP8juKbumZyr+j z>gsB<$3YI$>+*7g<{V*fTpPdZtzs-1`GAqjqt!KLN}r4UKUHS^b=DJbYocdAHI_u} ze}C(bBNHq(k5)p-<2^KEhr@Vvx(*iEeljMA{zt&~`gAz;rE47?M*^)9VFDjHxflW* zJiJ=C#EMcT&oZ;}pJtB=tx|3DwTG|#~RsH$M2hLBz1D`yRO!U#IsHoPG(dJUc80T~yOuBog+sp?O zFY6@qDEh&wj4F*^UGu`goOV~)yJ+?iwqt=(NhOLCFLlM|?lvwO!K8NqK{C*V3eacFplF1Qp-yX{roS2xP zQ_j-)W<8NeqwqHw*^D}>N4&w~V8M5Po?qWH2eM8exx(qVAjIFwhTW}`C7uB`iA3<7 z-9S7AJ`NQkFRqJenTb+J5M2hWlwp-=??;I+CtH`xLlY;@ADfybPR{CY@zl|+VQ$va zj40W$l3(i8xvUK}kOi`!!gYr})2KGqXC)4!-TPgOy71}>f7)s!9TTEg?@(4?-Z2IY z9{02@p2BB*IE|Gx?)SUhnfJ0jmSb5Z=_J_L@Xj)S)7j0EWf8Eju*liN9lz(jCuS|U zm18JTueG1U#eOpMX5cBMGm4riM#lhVLUq=l)!N~>=)-NncQ7;i_|?^w6|hOco&$wG znCc^6_3HOWGkH^(<7KIpvju9ibofg-mJ+%`_C9b}bd^PXel5$N5f(v(8L>r6EIu1Y z%pb^g(00Ru&aPGX?PO9#R7!DwzR`ln8F$*UXElg{mWCa^j^=P&IC~TK)|LRSp|)S_ z1y-JpX1;VxajqX>nC;Y8HA?n1uhqV1BOGR*^dlr;Rsv0E;l4WzH2KH}vu_A`?CW3o zk!dXRghZ4~KwTJdo9(>8~m`b5}27LZHgVSa#J+K)2i0kL3!EZuH~y z8*wz+6$>KFAXK{bQA|~q=g22!dcb22H<7|Gw?Iw=(!=z;d-Pj8)gthif>ll4B=^pm z_C`{`aAWSc$m3#me2Y$3Q(LLp|^r3HC)3?>aXm|bes3AZVi+g6(&D~-lE`~xw+2*L8-7k&@-zM3&A>mwhe z3L88UZ2Rrw@hhp+kp=c0CE=VuQzJt3QJm;$4U;_);;XGEJ|SlbLDXZdW^xd66djJu z$lzwM;mlZ#+P7~VM=>H5p`Fy}b)d82fwk~|1F z>isk|vpQT7$|DPmSG{@9S129!Qamv+Y;9L6x*5*GbTE5G22ikRjh|tA#k5~8{f61= zUMZkQ(m1DFswUv0p{1NxD7dDD@X)L$JLyqTzfefl>F80W(1gv`cgWW?YCc5&v0A;s zd=aQMmc{=IZy;HlMvS5Uii(czXvP+X}0ww--+vyL5nT#??FCo>F@p0h$ zQ(28-_kw2jV&#dr-v+}9)*0suu)hv~L70GWz#ieT=yw>gPsatQ_=1=bXt@g4=(H?8 z@`S;Dt_4!Vk?=|K?b8lHN@3cbBd36(7DYQIxGbjngl2Z?ekw+`nO8%YL8GpS0Ucc= z=jmn>FXk}RNJ6u8s&LLsbqV5VDToOPx971WP4TE!deId&5OmF)P-Yul>1a1=E!~vk zAWlLsA(!vrk%k2Jvp5LYBa?|%^US}+zwrqMI6#KjM3U2TG$Z7R%I@z}vb|1EVt2FX z=;CKF);+oI#xFG$e;eDYmYjy*8H(XIFICkWMBuzv%a^7I;3udO>|OpkC1H2jjf{5# zA>^@3cj4~e$6>dpIZvXG=?r;0l*}+>att>U5KK}yaC5bhYsSqD31LekK~^1lxnuFQ zeR)~VgcO&pYD?9Q*XN-10(8hv{{t*%`&a2I25MsZ*t;1IWG(jvu1|lJI08!Sg{D_4 zVG+A^geGF$#m&QV7(>L$*sh``c)3Uf4JC!XwSf?n-q1nuu5Fm9P+nRI_N$!V-PtrH z>NcqbYbRscm^-|E|=Vb*rO;%P#vb89%{_58!nX;}vImJFqn z#|)wNGB8Fwe$3Khkd`M=A7)g;C8{jL$6}87xPEpT?#gI7{H-!X`dIP0g7h3ca*rk6 zkH7Ue9K&2H)`LJ>6Y+$uHwv%ci=1>=^IXO(a+YHH?_iSiBzar#j9kPA=PG-6Cck6& zc*-<3?94@yr=6U+pMoFm4*b!Josb#Q>I21wvM0%rp5t<_K*(n3(Sj++{BNtm z``1)|Ql1BxG0K%_+Tr&~N%e$VNmfy5$OXXIY5tCOZqo6DQR#eoqw}{1Z7vBW_xxMl zVhGwYH^Wd45?{9WTQtX6iGO0>SAfeB>FJ=+Snnj8)Ft7gSCxbm z_r$m!4S%IK^0<&UB!siM1jUK-Q^}t+$yN5c5C__X&Bfx6e;TlE1z8GZJUe7;Ih`b6 zFjZ`NBQ`v)$kkT{wX$Djro2q~!eOm~u-uo8E%LjDBdiz2X?Ai2y$Ax}@;h?iFM`r+ zr~q{D1t9$8zHx>O0MSTLbIwaUQkHCiRa>aGe??`31)4|(289;!p5TfB7{PzFH_f({B|&?975 z2A^H8W6Efum_GuhP(QGf97+n(lYqJZ|K<^w3MRRzeSLkQAtBiU?u@jwv?@75SI27r zA?v3yztQK>)G$q>VM|2t`2f@QB$LP9#o7cQD%ZaQ%NT@rrNx8wn!AGn8R-+le zI%#UG#@{(q6!nbF3KLc9;S-M6AwkR={JQ z8Qt+uQ?1*U`Wg;Ab^H(#tMzob0e0e_Pf}6TNR1g-J8$jQhEgVSug`XhZC^5HA&4yj zLL=V7$7tn`f#Bmjd(!b99`F}wmAjQFRya7gc_T+|ns-Jxgf;i6e+a5e)>9W~(hE#KmDSh$5XDY^PcH*T6mouhARDzV32tIYPv0xmfU)kLHF$ z#bdzo(Z#A)nv(fDv9m+!~F(fi#Q z>6POTPc)`K&~7ppe4n+qnJgkSvr9oz5fYNW!Rli`K}dw9cN`m0^Ok9uP$MQUPm)0!p?`5RyX zFCklnX~Zj9=a2-29lW8BEWBa5A1cRD)2~^)O>e`Mse-|=|8tqhs#mxB!72yS*K45# z5-BtH&E_PxU9>myg#?_x|3f>~{dN1B8cWL7q8`eq8LdRXBGELf)b0R>+FSt?ssqDi zFp)-{QxXQAz|KsDC~qC#Ma1(EMnV&9fzxs%d|;|fk7IQf-4md z5)u+Ar@HW)Qw{C`Twisn;M_kFs3KShg$x%+y#J)vetSF&SZqJqFBvjlge^(rB|M05g}zrC3{pNN>ubRNh=1yvE|So54t!QHYER#$te@M<{l3zg>+JC02g3gpqHc6Be^@aj@vY zwsCi{*k6K1z+vp&;38k@pAt=)=5pq1`0dpsp1e(6Mu0Z$?n&U6SoQmpanbyP75Mk@ z2Kab*uS?WvdL#;#c$^tM_UHYJ+%M?B2_BSXt~Tb1`IKF&;*gGoOXbV8wLY3DPYs8P z%Z#x%*XG+y%-SWrZ;ZJPgN!KSq99o>O~q+7)?zV?yOzvBoy?khd_s#~GME|S)F@6t zqmZg8FUuUDi;z^&b}cU#gjyG#T#9Yg>#o~I%-G0`p?WQZJg_e0bIIno&|-Z2+qPox zC?W*=q#|))RlU4+-1b_WVN_X;oE^G^w?C35Rs=spw;TCg(YTmo1-|-}MFNjKQcjLY z)x7BcV35_di9bXsl3pc6y&m!qd%5Eu4V#ppWB#fT*%&`IDlA-Xmz9*S3x|Kq6dm0@ zKIv5en@J3Qbp4v=YInLjpf5j+#pfH}%_F|2vm8v(uoz0lyb)x#`XtE+dua`5vBWcArQsKXsJze{EnL6d>{dD=`r4QmX&06RLe>hEYr*|+ZqH%jYh>|GKSvpOE z5T7|I#D09bF3=-vGn&DjVh$*`6n)5*H0F!4ZjG?b}~ap}_b2&B&FhXRLFn zei2q58ay>tnCUy&2pJg8k5W;%y*+_C%gHJP*nRebMWL!`KRnS2vr9v(SX&MQC4^}8 z4)sxR8A(>0`5fl?*z-SS!9hlXI}G&?dv7aTtQ) z>~*Ps83>X2j1e$|TFfN%hBjcBqVLaAqtBJghy}y&3-MXtFl~ga4*GOb?%g%SkCkcu zvDykbd6pE#nT@jy<4&wdC&WG~mV+;B&4CE;h-gkCr7koY=9;+;{tx%^4;`=3+xk=C zA=Lq^*Ms}BD5-zXASp)lg`?Z>Q1^>k=@m0nRY%pfq2rU(ekiZw z=D;N$F(C-^%5Jeum`B-|6nohDGD;-^iv~y_3~?zv@x(Fj=LHW!an{AO&K%0<$ux-5 zcRCbt(5tXeU-f&@c6IFdNfeWg0v`7>o)7b7TkuB`8j>S^j?BNoukMa|3YR>1wh6p7B~I*EeY%3`btma zGR=(l&l)L7bxS0_plq@;;>Ad!RyC?R1aai?*lD8MV|B%-@t>y;#b9<*hNX>+0(QyUs`t=?S;(lra(Iwz!O@BvKf@VU_0jsRTGof=UpK2OPS%UVC~r5C=I`-?Y&8 z-$CQhMOYJuvr+BI8`m>PYXPx-&(IveK0v?S4nV1Y8Q%F9(LsttnE(>L^+j= zRfSaLb9KzuW5%lEqPCKR7Hi2DJo3EbS^p(ahg=fy@Z$$yJHBS zvFjWyjDVAwmhN1WE>*p^9A44fgpBSdakI5POAwJ+ph<0>h6&^IUw6kFNNF8iA*id% z$YF?^+eQ{8ZyMiA>{-i!8C5PVe@`^OfP}f|`+X$-(IEOWCF15w_IO6=_zwMAytvQa z(UZ_3vz>7AClft~2sL4RJrWV7)H`SsHSC+6GfT|G`N`2cqNM))v>rV_KdVVZ$cWkH zOJ=f{Yvw59pyZWZ(uQXfx!?x(BA`b0?m&)Oq7z9`1JD_?NLS|Q-LQFgXM6(nXShNY zp1k`gMyZLrQewUWOOJNcQ-&7VyDz}(1DoJO;r1DQeABEIw#I%%PEC;`TsQ%J@Tq|8N^Y^txP%po`q---&&SIRh!zCr`{}>@fQ@Hvg3{zvU8EkJA~& z&8CRq940#N)G01pVcD)~t2!xyd@?7*5!&OzB5G&~z$Ouu)o{#_qqB6pgTR0R4 zeXJU_B!6u0i1-yu#nIsx%jj6}v=(Tw4iRZcc{_f9)(@;OuMLz|kJBBX($tSC>0o0g z=jSHpJ7J$)(TxYt8@y_V$-xiY%tKnrks?ste4Ra<$S(N4*qd!4nw`6Dav5E7^qAz* zSTtH!K0o3@C!Te4kztq21fbJAi4=tJfy>}Pmjn8qL>VB#A7&ts7 zQWhOJz6<0GuOp%dnWbUyloflI#KYTXF#ie}*|B@uprd#K!)e%+EB>7Yqup}E;kL=Q z9&+dj>ps7J*NMLFT{XUQ_N|y?UYf`Nt78tf$}rf#LMNo#sGw6&uBVyJpva0EB|STU z^%SC9ncj+brGHb^h%Z`5oe^^7*22(Q^t00C4U{-1Adti;gon6=uU?0Xv|ty@Gifsus8cA42F0#Zv= z>o-FbE7ue!#)FqOuFq5ZKFNwP!LGJ5?4IcQ7I3tt_)A$^UAm{>M5}1GN`@b$R7+zy z$t?{{sHPp|#EfV)Ra>E&+i<|zDpwBRawtMoi&T(}tZ#|30m4JT8CZSUftfClGmIh7 z)i@IlZjpvz&vg$2geR^YBjjRKi<$NPX%=xF8$m4YBJ|F2Knlt3%b56l$$KP5Pb3F zXLyU}{SAZB%LfYwyVIBn4S!iKB9-`nXUMmgd%VNi58Cir=8))3u^0uXOWeF0K0-?h zSVPGYAUZl+rs^F*LP4o_Sd5L0&@#n8WC00e+xyp4tjm(irZhcUfDZayjHcDJ-|5cB z8@6I?2@sjdb$miVcf3=ovvZtT4&7@WVn_2%gSkj)UiatT=1>kcACG8{KWih&9LW%1 zp_VC^b4|5Gdm7vB!OKJY+g-WuQEXTS*{Je_#psyMh^3J?bq+d{B{r>&7pUN%jUPB(W+$V&L%|o2w1ag>E+Db&rJjJuC}Vzy5mShe8_P${^t5}24Sz2 zm$0XTyuAVfAEOG6Pp!bX|HTQRVp^`yJL|!gH_QETnqPHohvqG2QqrwSbI4)6F+8&( z+u8Q@a@If3XN$@U8CdppE$;XC%NPpeRv2UY`>(i!MQ!X#EPbC?W#cj0*FR^U?SEOw zL$j9`;I{Sj#oZ3deSrq&6O@^Nj*J*RV~Yq=DLdm*e#L_1%x)Ds8Whijjls-QI>9eSl?NRzn1dch`!7Tq#UH#W|#o&sd>T2QKj5YLy#-r#!j2i8V z9Yc@iT-s*Y>V{|$dfHs5Vk~l}BY0%i1m0UzJwApQmPxdso%6QInVfryI_Landc>fd z7#^Io$X-blEHjrDW9-l@>c3flVZ3HBj^l+Vx&^ru+uTZ$$**|8OlEM+@p-~H{h26> z!S!jB#kX*ed2>2w_AIj= z_?*8uUXksT$>ISc`?Kmj!~$NoF|a}71}(`v zW&RPLq*$b+vvrB8R_dP}ocpcb@iuz>^<{O{ET2M}5eJ^YW9DF?(e>|A8;D7-`V$-$_xJKe2|ZV@x&LFNn|;-ppC#_D zNW%i(>_XGK`;`THPQH7>5vMep8CEMc#@tZFuQ<$;2(fVau2}DKiF$AMjP&jR4fcSA zZag6%1UXB<8+{6gNq?ZTsO##4Vtt#;e?Fp_D4jj1`wc0f9GHdHH%@c3uj(}m!Ed5m zHb&#_!smF(wNhdtWIM1B9R4&t6Rcd?GFB231Ifl)%oj*pp=9bq+~obTEpAux0~mlJ zgxV>}!_m^IHW>TqI4!Fe1d*hD%Guo-N@B7yHkJ<45aLP)Mg#A3oBC_^aCJcSljfUe zv+2nQipqJO30;-OdqpqN-!bd8qO-5bi|%2JORR#}{);0nuALV$LYGe84}Z7Dq@|G= zvnAVci_qk5okXjY0b+Ie>}6cWfuN`7@d&@Cp_^)Z&N}bJDK4L~--iYA=LAkWmb*8; z0&7D_qHOe=CgNScm=a{V&ikYn2eP>5J~O!fuv*$H2otNp=fv>*qLa{8;gQ4VtDSZa zzcck&x7t&UIe34jmG}Ps&coZoKQ!Nk%!zWbbLBUR>R#HHRMoMotE;uK*LiZKvG#^L z1o${#d*}o^JSwS)NI1RX1ByHueKICWI5s2e)}%Vxc>a_9F_}1!ks|5UibcCqmTGS+(>9Lc{J> z&hvk0^e74x^Pn!Pd851L(euCo2ZzlUp+pUhd>0_P?q3};KYF+T;NcLI zO|hXKt^-;j<}D(m0EJ?-ASuQBK50f0>O-7B<<|bc^B8Tv?9_wR4ZRhgMVTpwkDl47 z2|U%~JCd3CF4-N`|A?fq0V#{o4gcW=4P8pvv6GSlSP;sIpZzMF_s3&2bh$a;BZWkR zh)XIl82IQA^}wVTD961X=V=DM{C9W0w(Ncom%UhBeNKif?0al|dL|mkVKJ6eA+|>7 zeueGKP%yL$mgfaP5S{$l_VO`;^h?*OogC&ppARzb*LiiX^CF@tuHO4*GY)Z~VSiK(P4T0#(0=(ui8=+WmF_&%${Qtwx`U@m zva+8wTHmS~=VevNhq&+wL1@Ke&1aYp8VDvx&J!($?B~he<9Mv@FFAkQ!j>^BzM9pb zvt%*MWijyb;CC0~HR(b~TXk*}rxvTmq>>&;KuZ#<2GH3X9a4WVJC~U^Xnn2Zk~ zg!5QgiD!<PouALRiF4w6teH-~z^&=dM&@CUfszjU(F>;G)v1L_P$XHJE;UfF$ z!J_|?Q)*b7SO5UDt^bDS9@rGfysrY)bttH)3e`Cv2}AcG6h&5y1{UOq#zWC}uh+dN zH^z+WA+|A$FHY8n+z^)*sKm=V#cPeeT#uW^=H?ph z`#j0`wmq|6oAdrvXZeYBgwi>c_>2Xp%Tv3^)H%%i{(;lV<*)3PtD-I%hmvkpaKJ>9 zjqX<~iC?0W!P-z_)M;gQ9!wG))yzZ~av(KmDz%|V@S4qNfjViazlA^dq3sJ(XGxl%UX$+|R(t)) z%+V4ZeT+0w3t}qEpQ#g|o`X*i*P}5c9H{0>!I0ph*@tw&jr|}9UwG??rE;&5lRQlG z`MJi>uNrx2c@}S4_>7AhHB6iziY)r{ufL#t^3rb19Ml9`JMDY!&KXmr$TJzp#(-;eTt*Z{_i1PTPS&Tkg0g z*^!OPhsHdL{!J1~0F&AN3rtpNYxvcmU0P}b41uH=7V2@ZQggB%+G zG4tGqxmGHAwdQR*t|jVPWm^7SU!@|lErcFDydsE#gq|rG?f_7^iFmBY&Md=!-1=eV zB+`rj2(@vc4EY!kINXPcJ1ahs?9d4=y@b<0|0#A41c$CT5*D>Dc4gPk zxXjJeu~cP@=j!Z18t`@6)OSt$ri+K`IRZRZ8Sbs0t=mzfmaLu}%?-v|k;i97C9{=l z<*i6)_RpAHrNDF|06y{T0fWBbP!zM8l7#Q^~o%y9o;=Ji08ruG@bm z==N9A^WYVkVQge%tNTv7S>FM_lT$~fN==x#Ac?cXce>nC)<_0-fK^_jzS{}qri&$I z)+xAjW#v}O4BL_wqk0d-EX5q5M)?#*ke7!RrfO;?_raaYw^{U$g`?t4v)(!z=aM=* zdw92ByxQP-$-{hIr5OP-w%Luo8?lBjYV_Sf+}`FZl;<{Ev8MI~WZTa#xCobUk|whQ zq`wfK5hjBQLSX5v`VHTAuvvK>|B&8V9(;YnVfB3QX};6ta)2Llu=#Lc`*glpqCEqr zS1^p0+>5@wOuugBlX=KV`~0`sFKCLAa5?0Mn-lq9= zf|wX^#1D3!s82^W)oyfjNehn5H#6{gz^3OMWP9a2p1xfbOB8ZNi&sxfBHhjn5-xw^bewmvP+Cm}W$pPSWam2{I!#cm)rX~#p za;RYQk&_uT5T7oTXwdE2uzyjWGJ^s}59JII6C7n;$7veZDZBk2N5m+biUF=G0`NS( zMxL?`X!E{h(NJ;z}-P;gu6)O&b)2XbwIS(o@bYnEf{;3~i|PC1CN{UkWx`{ zaTbLFIqW#bXM^TCtEZ+6=Tsds!Mhu-xBm z%u=wy0A%$JpyIv%H_dgfKwkSmsZDp_!qe)mM+jNUeONa;v{J%*6`$6yIF zRBmxV_zD!=OgtM_6zVt_BbQ7e2a)vQ3_v6hMrpy6;sgA^iNCJ07<_b47{EKc%6sz+ z>Pt3d0@TaN^im&XPejbTWFmx&U*P(`-(|_sn-JP3paexStDVnru%h__)GT@LPwmZ7 z%vR&DsyE0Kzk4X@U!*!i>37lG$96ELlI`e6`KsG)bNetl-zv&%AolxH2&!RhQxoK7 zEY=rMF_Uw=U(X?F{ixG7lP|+EF#r{An{0lu%F1zfRPW4IYdl^=_-PX+Hs^Bl8;$l{ zqIjOYE#Y<~gaUW9>-Kz?l=5igHszf~O`&&^3>WzFxRP*1RC$I&*-G793XNW@{eN2P zO(x#X0$1tR+LoTw0m(k2W7ZWBy-{c2y{%bM4cWTgPRTu!_tq)lgcYIuI}*%JZHxz}ONM#9D(&W`XqOOp zRfX^+1(OW(hrFVO*XaFsO#$pA7!;idn9z+&E1D=~GgLYq=2dw@abQeFz?g2Xb5fxV zLV+=HlowEdgL@7wXRbJ~;RlD)0$PFlzqxtGw9V}F#kVYk=m%Kxwt4+x!+y0!W#c!G z1QlN<)QF?24sx=DVDMKqCnL#9UVDR_>4Z_~wH;d*b z5557BuR|SJU#bgT3^v%1%W-SQUTt^7l% z+((w|?-|P$Xxl4|Ig;*%QWa+BINqf==d?sg5uSlao9);S^%m30tLeOeu2zCx`Rq=#=%}`=IQm(@dp*qYwX^uFZy*u!Lb`Xg4Q828<{}#di@2U5H zxuY)XcRQo!USkN}$pKbJl8ALcg6O1T=fvPDamo%tYVV`c?+o5hUzMd2)Ia87gH8NT z?_J%2*mg=1sznF@|M)-A1#F-j+_0%!2LeLR&bMegdZXe(Z2Q21^ z0TcZEGBvHSI%6dw?J-`N0ZC1z|?%z-Kt{`9+H?5om0u9N@afa{$AY zi=c^#&+G2xrnsB=r}^(lf&l;-9H{`q_8nW=QyVM8ouo16o_Dtpu~)6zwVmlHG1Rxk z*-!w>TLXCG<$)<7G#%y(j>OJQIxg5kogA1n;eT9j!f3$uxX5141^!&O{B$z+8Oi2G zzw<|yEG=C5YsOdX1Ss(xq6jPd3p+opI;wx|>@T)I_&pf@Jo5kU$pHK{mZ!D?ToH{E ztH^z=-@)!oO}%Mjb^AzqB8@BUWQJ*WsB2)aKwAL~wK2^5%bSblQ?ds4jhvyi+z$_b zdY1CY;TbJn$N_39Bqs?i3q8bpfAgHQ<$fw}bk0M&Ovj}2RPLwq;uZmM%1tcm3>Ma* zJDh|}^wVd8nxDn)uO?{`Ek~AiMr$>0_!#Sx>TQ@%3Zlj8W9{KNErmRmS6JK272n8> z?3sYYKZmz)rZzT#vye|szZ%^HnI1F(_U&ozotJhcTZ;L~nPV1qY$%>dVt2Z~fS8NZ zzBA=5f#uz2|O-lVf#H4@=lmg}KYe1-{KkFY!D=KF0-vMzN z(ynyw@apXdWHu9~j1Kk&D}b`wlghv7Ltws!JPH{}-IO`fUIJ literal 0 HcmV?d00001 diff --git a/format/diagrams/layout-primitive-array.png b/format/diagrams/layout-primitive-array.png new file mode 100644 index 0000000000000000000000000000000000000000..bd212f081151234c01f5814be4a4e4e1e4841835 GIT binary patch literal 10907 zcmeHtWmJ^yzb*_zBOpk3Bi#%=gp{;|bk|EaNDU<*l1d6PD2f4!fRfVE2+|1BEdqjc zog0|fF!V&z--xuIo4PI$A2%@Tl?7(9o`_sVeEAp+PFZ=MEeQ z_&0-Pkskbq?xUxofcEYy?J8*Cda4@xprMhoUHnB?(_`HN9fqCt@A}@=)R46GaO1PK z^RTh!3v}}Yz0uI510}&nH+x@e#y~e$cOS_>8Rp9#lHl{jV}54F%PzhyGR${1br=;r zyzLo9`9$~xm}T)885yO$?HnZal$24ggO&`lldrF*BtL&ZKmcEWFrSCFBfp@8gap5U z5WkQRFX+MR6Xfn|9mwnM!*Vsq-{UCR``CIrd-^(ixHDdiYi;A<=PSd^d@<2Kf3DW) z>+JCFOzu9YZGjE)Uwp$a$S1)6&)DEq>5He5iXLvB-u6B|V0>9o>C2w~<=MaUT=my+ z^6>Qlr{L{utLE-&?+sq|wZ2%LtPtw$|96l7K9;t(vpv}L<=cX&xBv4Qs=qY<#m4{R zAg)gN@+mlGSv+a}f38dxuXJAM2O1iurJ9nwejxg9>?o|M+&?>3lF>VGG^JmLSLBQ<|wX0U71&086=lKBHCA+ZtlkM)~joSAi{HMqJ`F3e)U73PLsGDPli0Ttgh3xcxux;0p zJN@HEt@8svKV-YJr>ne2sC%XPk8fWY)pr%pC3I5qnPDj{PwjeH`BzC7l|POQSN0b% zjZfaGrum=4|F?&cS+dXZ^SgH-grdu5PF=$eBBrMUa#$5D&pF^`7d^X+}2qh>U& zrA`se$GGZUARUEqcJUTsu2XkeZ*a|!%m zQ$fq059Sig26)1!gEv285J?M&cpuEhmM=3G4mMPT%*Ie`@6RX8ZH$+5L^XwbK2H== z?@kFjS}o$QKbT{j1n*Q2YDA=E^4D!OgDce-(YwuSP=-*r@NwYAB{(N0<u zp{@J-S%S8*U{QV}#>-!&Hq9Ke?dPJYNv6ZkWu_O`k5$MTACzu2w10gD|ntMhwhDY;DsNpa!lf<*rEmg z`Z)Sbnmw{o{?!=aBAO=KbI<6EQB#l4{p=>U*At-Sfd?;|w^xVuZ9a`VO!!Z@{r-%A z|0MuBJDBelw|RG!H;{DDvwevv>xL&D^;8qEA9Vg>)>+uwr$S9TYt};Jcq=Tn#CeOJ#3ktCC{7L1A2ogx zzJr=~)Se7fQ9tHM4u_`{Mcp(mt0V5l_J@Sf{|@P?;LxuRt)gF~!pA)!+2F zG!!_*jcx0kKE$nLJZamt5C5Y;a;EPt2Sro}XVeqxQs(IH8D7c(-m}p8L~(`eG>0hi zaue4E%_!#&A44J}mUBs^p_0@?GEoiuC8 zHv_4Om*X`;Rl#r=C&U;5UWX_&~Vp6$4lX@Ejne@Hj9g`daPND_OKze%(D>dImTOk?IG?z*sR5& zEDftKpQsZq)%<*ipMjBNuZ|&^&Qq-OPoe~S^_2X!QM>u|VC*l|en_*BU0augGL%B0 zgS>U=ld#>J$8~nSHp_Vd>S>nC1^1k65=Ke-> zSTRsZoseD{CTZ06AX=u;nZ=9*`)1dF-gGD-rQpQA1WW2l2P=#>-byL=9%4t1=EzHh zG4Q%Q4h|(pga7n`t_t0kc=ar7_u5}#3Zfomo$PN*K2e*}N_fQEUWd`(3S(iDFHE#Q z)fSW{$;ukP-WVXEd`9^~(6as(t_!E&_Pp4H>mapCx8JCF<`x$1Ae@sGBEfKPr6BZp zFVCa?aO&RAlG@GsC#J_BudQUxGd-jJHYdP?!go$=5lM!Gakt~ft$vnNI&|hbKe2Q` zYP5_EjIZRUR_>-rWVNK=ylRZ^oXz~y__`P15Dy)~W{6Ucy^XTVY+4I`WJ-ACs0e-9+#4%uv`z;3MubBKD>JMa!%D$w#L=FqO}B%S`68vrW7D;c!9>bJPYqv? zkK7=9T?4X+uon{*89gqi!Q_|6ceISt>s^(uCQBh%NwYlAE_camlm6-`!_Yp~lbviz z5Z1r(+w)1_)(5NW=2T~A`wc`?LeW_9F^GZwjSMLyvvpE`_FWseF4_(Gmr}?(wU_WY z0yhJmmmWp!)!vtn@xUq9;Ng-2CQtQ38c0k$b|AWy8_9_w`R- zTSuC9%ns8xf{wM(_m-UyJX9xAo`^Pm!kU21uMSZ9wO6{&$H?U?n*N}dbJ&2}S7?6s zC9NG^6r+g(Q|uy;#OmV4J0WL>%j0QqRkQ)&uP_!V!m)-%!CMCkNmRn@fhM#;3#sk2 zK0VQU9iJ^xu`>vA?Jce61OPcl05^?Y%L=Grr6rv@b0Sz={TuXUq2Y>@c@xuf%s#{s z)5w+A61J&?wDWDvQ!4JGYa!m(R`Oh;!xEi>zv3A#7~K&R!d{J%L&YU_ZU4_SLOR=7dy$m~$6f zMz38Y$FUE#RGp6bq2%p%^eBCH;`tj+ghUjW^G%fp%v6%)&JI98(*EKGl3?mB@|%MS zj-AYg&^WxXNPEYK)yv>`BOA`biT!2axRq4<Y!fO5HbFRacG`9B4}kz6e-W)D zo8WZMpsf9WK@w+D#@3#EwBleyI`rex>kTp;b>wVssY4Rs$om;l0PEVE9Q=k=dVKgo zevx$l1MMa^{5Q0#M=RmYT8ou(vZ@_E7g#95T?l}@`*)_(%!ZQ}M(-^SA;W3XM=Q^L zqo*5%`=o`JbA8^GHsf)>tC~}$aeFNHZWWxpGyupe0fU<2=58N{+Y)KSUB?9YUQ=TE z0OTusFcVeiYzHir`ML+G(x8^quRGc@eiLq$iS#l@05zGqe)!^Kx6>@FBC-8*$v<(=E;oAyZ~$wBKTUg@%MKW)Co=zw+3N z+nGX(rafH@g~s*jD_>0_^x6kG&wYCUsg`6wKa(l>=JDqBO9=KL!P$vR$wx!_Uf?zn zH7R>x18N*GHUX1fDqP8{^GQ;CH>}DYHaiX2g>msetUgql#0vU$i$btL3t-qo^ymGa zZMHAnU+x3)2D21Y^&hv-)DfRzFTZ&z6yx{?K`0v6-1LO$QMcS_QLeYT1EtwBAAygK zc7HLU>}H>CU>_2tuhf0?wPA;b`6 zkyv}({0(N_oouc-u#>G8MY8*&McNDmSY}^=Y5F#uGPml;lhwe96wBJ_gIoF3+0j1k ze3il;{%iF`X8rDn=}x6_*|RHF#mN`({U(pKGQrDrnf=Kfkn|ElucshhbHitE%XH0i z_|frVKYXycM5-+igWY<(*by&)d3=(-|IQOF|Y7trq653)JRNHc5i=!?ybNZ#!1)m zUMI0GH8Y=H(MOtMRMk!rucbXCkaG-YTe1CwZ3TcIGV@N?`#P1B_?4uSaLXq8R=i|m zjDV1I@F~QG8@@Tn^e{B+k0oOG%Hijl@$tg7M9sF%=@&88wmyf=L}0&+Y7$0!&&_TW z4ag;ABu1q2hqE;02JpL~T8OaX$=!B`t^D&7j$yK_@?`1lj(+n$&NFEg_&aF0pYCh` zA9|!Aq$bJAOoK&)wa%cMbe*{{;Jb4X60enK2$P;FbBrvKM!rX#bS#>a>M_9yZz&bv zRwB;mS1hGp4R^>)0&%-D#;0x}aza0QwTMr{G1sdNU{LO9Q5OakIzRoC^Ld8z4hpAO zJK*f?zWJ}KT1jq0aUIU<2HEHnnDz&?KSs1 zfXJuC74OV*?)Wr_eS+^9E9+Pt!idHB6BjW>b{>nWry)+Y&*@&betaemL+QuU2yA+I z82}xzz0`uE(U*AD*B>HkMEJde02uH>Yv!Twc6P5wSmaO zw%9bMCxp)=EH1@i?Az#<`Z@JvYp;W%4a2J=g(;A3PA-C!SdJs`dL#GYyy3)EhHXR4o5Z%~GV7wnSJ|y+o zI3r{jU5Yd_-Ix5r%z@x^?eCh#K3KGIM$RulO=g)&l(|iawUolvX>; zHzhC19X)w#A$5q3P1~r9Wup-h$D0&>`qQ#c@+VLI>bTHRzar6>6!mO$D~`rTsY2_w zTJ0lZ9$IC}(o=-PS{NgMiO6-;n+O*%g$NydIIm&#j;N1$eaS z53HS%4;CpD%4)v(JVV~NOxL3@&`oJP1vKUvT2nLCdIVYs0!_A__W;&=>ksChb(wUN zJEm{a1y}bo4rL^S?qNj-(s(Yh=0G3d$Wt1-tKQDZxX?%=wA7JzGoq@Bl47>f;FkAR zU!-kw;o*MhH*?9lMjyN`=sjL84rB&VqvH8RuO8QCp4U`LQ7W`F6VzT5l)?@gkE2_G zDA8+(8EJ?~3W?W1*I<_yh9?C(z%nFx_=q{U?u^~O#APcFp$}fr65kSpx9a5;sg)wD zq~!8Vd2ze9cs^&kCf*?#o75(+KVrilx=kt0(&{Wlhr5(78zIx^`(VQ)_wiTx2$rB^+fJl8`TYgmLpUICB~l0Snk_ahyF^&4h3bGvugcxXj@7 zvlA$rQrxkHsvVvk-*d7;jNDzD);W&+5FufL`lZSV_u8u6)X!K2aM&+MW()Bl><)f{ z!j`hq1ZRT;gnBtYJgmx?oaNaymO*6Zl}*Is+_){lC`1xV6G4<9zTp?rw6oat8ePL` zy{z`p7WaKZx~?fXEX7d_Y(k!fAA$t@8!lG0jG7;~ z^2NUy{Zv-4vNdU;Z=}Ktygp}$t5}Ck$#tr%8H0GJr15Ow(}X+CoQS6oM4rsc9;{MGN*U14>;Z*1?`CeU%< zOarD|)fF3l@Yg=fkE@PEOQQWXR_>^+m8{YJkOh%~nH{r}fnN4kX80-LeJajlxYFFqaOqrCFs(jF_G z!q3l6J!)28j%4;_u5W{i!_o_NRvmr_kmp|;jqY94UXa!}`mkyj38yUaf&%x#=FuAK zyHEo)>e&c&me3C`m=YhubKZ4vWXNLQ)I*8I=(rj~)#*_eM6e)++{F1sdPv&e9-^NB zU$!s|1O%)-iB1XtcO(4RUw>Ek`m1@dW#z7{83BEx1?+WA)$MPctQwfN@5b`-3d(y7 z0EgG{D-`3r@{W~B=wEXwFzt<@6>#cDW;^8Ev^)UGs2xPbHyM;ve&oXY?eb@nTTv0dsC^)!f;3fap~gZVP# z-&%RQC4JwU@aQX~$A5I!I);q%;a&<}QO;jn(JO4P1vItA#_evb3>Il~5v&5v5&|G| zFOk)KDu+P8mxLE}S;#E{XG5%6TamE!(iWk`&#u=Z8p}o5IHH9HRP)P33R6i`2kU7J zigZc4o1P``byXy~S@mQI0u5;$RN1aSH9I*5_!T+`VUPGiZpzc4dlSs~2o#TUyI&VQ z%~`<=iF|%d8jz%YPr8P3$li2a9zc01mTlc;1+}Xjt6VF4rt~N7*COB_|1YQY zZJQ(vF*VXK;aaYl#rVE%yDQ`eU;%hcXHXzGn0G9;2@s`N*ryxpfx4y|dp2vyE!I{{ z3NzwPmSpL)uD9vR6yq-MKRdaBN~^M0Hqc~jx_OAVc9W^laQ8072D=}iTF3MO>?20% zTIR_u^D!VHsUhv^PWL8s-4#$`)lHlsb%`04U{L9UG~EU!2k3ASPgeZ<+6_ro7^M(Z zEYFfTyZ*_+9sq$Ao*LNNkwB8j&_3iWHjEUG^a@3a(qd8u)sKUcO%YIGIxg-#MQXM1 z(S7K5q8!baUGW@G1NS`|ZWiSVJzBg99;(P0=wB~@f(Kv{fj9#NW;5HPr3>XC$76$W zcJra|*TAnH4NkY?j46KsUkR55g1N5VZY$u+ zn#lEv3@UL%Ws#9shWy5KM`(;)uvvybCqT2S)yQ76s0UTR0{LQ!UONB7amV}^A=lE@ z2MFxB6eh*H?pER~T5%_(Y&^hS!40kfp+15Ax(3}aT~P@tdrC&a*S_fNii{k15{2x? zwi(~SLC1)rO<7IajRb)+Yg|F|L|TP-6*g9Y|4DXVg4)*UwYF#)~b@Ye+tJ4-T1^AA_p0 zVz}a}4v5BVnGwNe5O=gBVnn7lKy{v$VH(E72dW|KeyySf;pb=X>yQ58G@i9&bTEl~ zM^we;|8s=*(UWSsPDb@o(ixTbme!KWsRs$nHINa)@9jgNS}0&frJP7D+Ranmy3LMq zr#uAwjr$NYT_mklEK%Zxe8L&c->Q}3-@ey?<^oF3Y%uE_vH>2wXb71JVdP&5cVN&~ z3lN$b`e7T7bjRwUMXC&eGZBJmBp;rBAN>k7<(7`a-P;ou=S#d1_aA{qF}8K9BGQyu zE2L_>fS5R&QsRz-gY$Wk)CSHX&~l(XSWJ0miMP*Qm)c3X@+ie}W=r|6kJaUlsmpf+ zgv*Mnj%5Z;d&tqUrk1`=vr=hQ>3W7@O0p}{+-)OMg(thVMfpT(|Mkw(v*(I1?z@!qqm2-KMh1< z+Pgj#!}Ls7ot3Z0Ct9A1qnBKMPn^U{qN61?_@%}@H33^b5FSG_PVPuJLueBz1({hE z((IFd1We~awJ=4SsF|>*9$M=%;xpA~wJE3F zT>aZIJ)r|J52#;A^m0y$$)2#rDd%oFt@IZZ5Y$I#RQ41Xvq=BxS8RzT$_FxY#{*rQ z?3oL-vtOa}2cA4;0iGaMDQO+huNXl1HV^m_Q_i#_f=&k8MobPZBStt|Ho63`DyZ6e zAW*+PnSC?DXx1zi2L849C;hFDDg*G(KY*`^3A!%sym9<{o}wX5WT-!B|KG0?jY z4}{v|Y3LXas4`UkYYc+A-)#~HE=u-}RI#xt0tnd0q90yUdfB9eL=p?cHrHav4{B%c z)7-CNOUIC=5jbdXc7|NT@PUYhM^h~#uq0R>usk2wcFw87^ip(KIj!6IdK-UX@S-4e zFZrdZBBPb7W+!gJLy>_Zo@>s9(1D6gv4I0<4buI_ zUyqdx!+;d}+q9Jkv3XFH)5s`@ODu*M%pIrCUL)>J%tzd&JakT^gKC^)->j=Pkxyb_ zQc;-MfN!z+#%wnc2?Uo$2^Pajk%&P>f?SaaSK=9)X~s=C{6*2(bf8ZlzV}&wy+nj>KJecohCqJ&*LZZ zQ5(3ZlHi>oXrH3m&o2;~h}SWysB$A0P=qFilonLmR|N2vBzE_36r+eF75ECi?}v)l zl_~~A*1CDz1sM%)fU8fbuGKAXkbghXKm&m`K~`yf`C~D5CrB<_fA-Wx;uSG+7=;28 zmEqm;&t-HsRZujB2~2{AZh~St{Z3$w)W72I$)P6218daB>0d)Nl>r6ochVj@PQ8j3 z`l?`y3J60K%MqalqbUbJn@UD)Q2~%7GDt-Ys`>5WcWvKEcf(Nelk9>d{g=fpBELO- zwa(n@!CNJ!VBkq{&mP(n&V5F{4e-JKFr((yg~_I};_ zoH5RS-+$+I42LkC&+NG8bzS!ztEsMtiAIVB0|SGptR$xm0|Rdc0|R%A0tcihif{Y| z{(<$>R+NFM8X?~YenE9pGW3LjA!2y`3#+V6e*$Efw%0ZAGEh?$v2=CjG`DiKu;%o0 zb^~(5z=-*Y06#igdzn-FIXk&{iuj4s{*yxl`1v`Ri8t!^OkH z0p#HD^mp+x_v3K!r2AJP|5J{fwWpo@E#qDHvrrX?`IJ0kErP>;_&Q=XoCc=jV#=dO!NL+cM?* z4<8=a@s18*kV*tzKb(bh5!^ww{{})?~0#4L>!m+ zp8|lgxZz+4yK&w8X#Uf@f-pO(f13HPBv9yEv<(!Y*8fV_(w%cDTZ+2r={$N+f`f!TKM`{c?d8Rk@|1LYQwx1XlyGCGWZWM};;a z!vc#H>wy7WG?kw7)|nrIF6U1|j=s$@=J{GxaD^^wnTy57iH|jI1EDcVeeKg1rqCA z_Z|}8CHEWA?Q|V1%G@!@bH(P1Ni6Rb`Hy`uk6gxhb<_)|ZnbeScIPIlEwkxlw9T+Q= z^^Z`vSq;8BQfB&)KIp&u;i@p2GVW1iK?w%2hN68`;*K5d;r@J7pUvXOpBeOVmzJXj z2Y1r*T`jzC6Q+-6nF-btX`f&FZ@=5lFy?GfpXsPOY=3%irTx}+y%AqmJMH$n{pqpt z9f|Yr$=|{GA&>qaguj0O^@X24H{jRjQIUg+IKx8X)P^1u!2r}3QqX#aKM}t!7hS2` z9}Y~PUVNeD7ol{nxj!2;1*(WeX$5LgfEXg9SL41wY`d{bo+77pRKZ9yWwDGX&&#!1HF(r#xP|0h%R|-)~(8x z&%fVq3+R{6USYI47)l^|++>;7{73cdYBhxFDEareX0bexq1E;~{ouozUYFKWg(%PK zlZVHv_K>S(UtX?N-j^-~v_nhaw~C#>l;6%6yH+wqpUCsw?iF)2%vsb8@t{*ZPz7C# z^Nxyz^Mp!7Xt(W`NcQTg=xfdRQxI}|9&g)&9SoL$6JO%jwYQwJf;s`+WgxUk4x{9ns@Ww z*g%`MQdRsm`175^gfAAIW$>LEmwiHj0VRoDGLNS4i~rW5C9oQJ-hRK$mEI}V1FYn^ znGLm)%diq zu}5ZImMxff*ri3Sw3D!V5A)BqZYGLK+@ExL&%VmUZmE*BEZcd!`#q`$ z>z(n{I4~;^70m?2dV`CslFel&Y!=vT34$wrsmhG~Xvx#jewn<{c*qTwd|nsM4f8hN zruE1iD{6|Eux!e+tqScy<6^tHo�X797#HVGfZ!sz-Zfh5U6d)dv!`Wk70jriy+N zWSTnMZ+;TRHNRS!U*cIN%=(QJtQ94q#$$2sA$+?f5StL@jh5!R^q8h5V0dvz(yJ!8 z+&$~zM{r`+yWp`NNjx%>$W;TZDuILI7IcIKKB4(PEIR_z(S;6ps{?*N5(=#3)2HwH#Rw(6)@|k5?pu1cl@N9NT z%TNXTHgP%E_A?e*@`lYF7mcGA+cTa+f4AMah%<;lO|U=B^xWvZI+f`BZ2t^dk|b80XOb%<|L)qVJti)y-<|SwMeJZDz^e_G3W7O1xa`*(-=6O0FT4)S9=?(XWXNL|$PixKLv`8%8 zb-m?@jS|{S5Pa=&79e%wCC(!|%dXO`gtho0!`t^&UKC~O5pjh3DO|a#{c{EuDHTJT(wqG-Rq(<`{vOn&xsy=;wRebB3R&|w6O1S>{ z?r!y|4HU`}L-nwyDE|~Eu{V%)bC&7bUH-Lg0EdDq160P|yp_r>b-!}erVoC}Bu)}w zTS?DwHh!J6dSB9Cw7HmpYFANzgZ8kF9b!;_K9T`v8^uvm*Lxj3e+1kmRBtR+);b>J zwod7|KU$24i~Y335xzgqf0Co=0PTP9T`@;oC-)rcAP+i)zUOF6Kb{#e=v{hWBMm_! zuaKvfP48^Vprxg>gn(VzA1)0{+`5BcE$Y{#^ar68KK(Oq348taMA{#YT*2z23`2(U zCl+(Qt3kz|O$gvy7{Wa4586Na*209<%HUya%8Xo$#ATZ;kkeMfTSIWD#DOay_yPlq zptr!TQ`o5J{?Jx(P(D6$h|or`(u!>sG^MAAV7!P_v;K%5=~{_t!Pj6Z8*Lybj+KEQ zxr|>!BEMM1q-?y&H*@+WmO!%XSEKZyYHzu7xZhbnQ*=)J4-PiHqd&8zw!-o;X?KIh z5!=~>I*6m85+NRj3@+hw3y=5*ub-SJvLD0EPH`jX286X0FF$1;3A68j?2Tx-d3yCw zr3d>NlU!cE=Rb?pc+6O&-dQnqk1wZ@jhad>1?lcC`Xunrz{rFBPt04u7tMVi_oU7T zKS39S%t_7ho(`O&R2_*87&Ss(W#As1ja#Jwo|P>_rWMf0lk)exw9~3T1#xf zi9F#Qg~X5<{OF`Is{1wMmfvlnM<7Z>4Y5-ME_o!%xP zOyb1Q#3o+S=*uq080|T(C~VN7^|&TUP=&z4hoG`ZaTldQU#A(#@5QhaN{+a6x`82) z%PlcocBT+vO~r&pj>@4SZzzVpOR+4k5|s>zs9cL#UD)kQ#GQF1JRi7qBwLe=EoiL) zMCl0h6*DQ!BEmrelc^%ZyzUWRo9Q4a_W?w@r!aqd9{=*#i9!*xMpFVX)w*CP>2^=i z%`#v4e5S#J?+jHJr;7<}iS7%uSs$@c$w36i{LF}(FE|ub(}m3@!G5W*MwcYUV>Rd; zrTtFLc#)#*yAollWmpszaVp;>4dw28FwO{3!Y`6Ivt6wFr_09nNR>R6c{@YF@0R34 zbUas50`)=126i9lV&H#C9yU<0z8*DA509IcLtFx`?2td@rr_+pbXtv@WU@Fo9U8(k z-HR=Bq2sK+^B8d_Hid&~TT6N^_bfHIEt5ygz1x=rB0ihIj_u z6m%;@r2|8vsMA#_Ex^n|q;DR#c=j7{yF*tPKkQ5Ly$FoNv_k;TzBe~Ylh7GOu?WPk zk%uv++N#}1O`6_#{ZJ@7bCc;C0A6&E;0HFvB@0I1%fZ4us)N4H-33?U6@sDr9S<`U z>X~#`0Vl~Z9$9iPQ}zxSy9C_s%_K<)C*f`^xu@%1Ng0oe@nr5duVsA(^FC!5{d(_M z)sL!kRJ$HA&Equjw3t%ob8l2cjr4{hkr0~2#YP|ASg1V5%5{+MIs275YBNYCk1vpu ziW9t#V*6_B95e#KwaDBqq)Hc7z_GM*8#Cqo_*gIxxc3p*UvH7thBvN@dh1man-Mx%p4`yEJf<{_|r|`I7UhZdW;Bb3dIY zLLX&iy$OH9qJOK!bAF?`zNG&0!!ZrW7w@EAvCt*Q5|)DG%&N&Mm?|-lTXJhFYv^IwIgv3}i90-KDhlgErkQk~_k`1e|CR0KB4l2pz;UW< zB>fQ^oY$Kpsy|(Aid2ZfuoG6BdFm`afh$*%)b*pIFHj)5LKF*z`y0iBWhQ3-4JV6( z$`E#xW(lcfXI$b^B@{+zRe8P*q|FbEtPTD=1x|JpO0BshB~Nda@byXN&2XFtl7_^p zf6a8v*jYg+Ym9DT&EsS>CT)3&K?)9EYz!{44ybjgR`vz|S~pgBz(AH_LKX#<@78nHnV4w2H)qjy&6zroQD%?5u0U0fI>jeo?7-1$du)g` z>x&dbvLgsqa8BIEk(qgCG_?ww(c4m*YiS&V(29YH|L`;tT zE@_MaG>lRsZ+R&Q%qOA!-v69%8h$a!MN#MQ zV5ivcIDD?=L>r*ml$sOVDgs{*37UYyMiAv?>*FSXothWKOrrWyahNC5+%j`0Gcv;b zGQaT9%qu1}nuzcmVrmaEwsPG870M+h{>i7s_u=0#PWxe>rDrxHk+p302hWx*PWi-~ zKGvtVLxvUu4+N9F#VPN^ z6#PalxD@Rk94stRJ>mEOm;*5-r zRFXBlN#*uH^@)UqoT$_CiX3y1L25U6s!eq%R6HHnT*jJc&^al zkt;1N5NAy{yXmLNZ!*GOkE6AaYpU)rU#q7h_eoJxt75ER&5n zy$pgSYN)$YRvOfzg{sc&3u18c;Jx($?a;-IU+i?SLWRqkpDDwr4Du~)mTF<>jxCtZ zx|01D*Z0t@@oI@<(-|p74&^Yld!I$;-d-9GJRXeV*zvcOMlAo>93LK!v~lWuD7Dbp z2MEHAtnKyLEj~&>&}(B=V=@*Ez?m7}$~sDKf_pv;G9yAT(-) z$X-nI{jRu?SEVGBBYMyj(HUYs4Rp78=}%kyA%Y$pJS45ii?E>(HKf0Dw%(Dj&)0I! zyF8ka-Jj}OvANA^e5sPG zI=y%9*=G34SVQV#+U!+dKNxqRI@*d(jQOy{5RfBxJ%bMp9^D1G>$cynozIf!M%}BL z4RAcaN`HXzd?gW$%HnNgLyrS%q6=aW9kZ);Ttl+MVi*)gL^eJLG5Z@nt(7pz>s(bJ zk<8u}z@1IYSQW`vGJIqRu6G$hBt-vZo?pgii)r}2!nIywm|_~^&}cEncwlu8Z$uxMucx<;Z8V?Rzn6UYA|G#=kz&)Zm^#mF z$f#N=u@ISkx2Rvj(YwwEcb**ScFghmfzX%O977`aa}HrK#qsrF0+MdC_E_HhFsq-9 z>HB88VUsgf^H6kO@o4vmpoVaVK0jEFmvp;P_%nTVj|Bp(G47H>4kbusJRLaE$cm1{ zR#T==IJ^=S!q#O*3!n%e&`kWxcwy?NCFh=Q{3tt&PEQ}_Z?5l~e1ux&t=Q8aE{ao$ z#oY_2dkob%v%}4P@{qv_=Ex&B^u-0(6Ib z4c>=RxklwX7=dJ7^kpHgw@j+s8x~ez()Jw>$HT(epLg-DW~yIg_CZU#-FCg!Vq)wo z&GXt>ZkFNN^(hs$=*xC~Tn-=){`v!U1@`o;7;XY;T8BC1Z4CZ9OJ z?-f;>k+6jcX{hB^Nrr^wvRviI)ZZ+AYnviSrqbGC9oj_SE z_zYK(95-xk@GUrYZjLlz3?@hFV#&KaB33A} zcG??@-j=Aay{hR5`<`}Sm}|y*1O|?!dgIBg&&TuAa($9Sqb8mV58oyH6sdF5IoXFY z2MeGGh9kSLAsC05*jbL|n=lSLrGM~vvS0tlGA@?O67q1?eX_p4ui7p9F|BXY3xqoL7c_+64~*%eX~shs&{1)~3djaHZsi6zCTa+qvE14AV}`|-0X7AfngC^;z+E1 z*o6h$2ieJ9u#rIP9nrOCpZq`Cz-0!94?33P|vqNb8%SJdFmugRY z$fC!2dIX9(EVk0TPhLa5(lT<5b!|wG&fBSd&&Dj2ug%uncIzjW_d&LgIUVeAT1oh% zWMCBy6^!GT{SruK1|wB!T~3OTaMKx!#+HLW-bkVm#=0yK98ZT72X|)~X-_f3a5B|5 zpg?gm&>h_X+LsY*OZE_3VMX^TgfYiP>BISiGL7*nE+8-HC1yu}CGD^t%E~ zB{6ip#m+CAI^xx;NfhG$&SS)&srv%Xlk>Q@noK}#^rP*4H>SH#c)WDLYi`KxdAB7z ziwlF#g*7>8m#NDSZqbfm`QLG$5ELErCB=@5wziV$PZ=`V1Wg!MLX5m~yqcbVW?|Hjl1hADe&&gHWy-OdAP@#k>X>UFdrg9th zo<}e-2@=40(j&Zip(v{4A{fOjgvn6F#X_Xpj-7fQh?OC>RH*vOOPF`}#V<^+Yn6`>kvYJoy{DGSZM{07 zqO4=s`DZ2c6*1?&bF)}Yecy<4p}R+>0-3)y*g^XUpVtQZr93(0dN7_F>Dw4CvPZ4; zORr*O+d=1-PpI#`*uhukZ!3M-`NRS&R0-rrE&3TUhZf3CUyY#XsN#{&e@@71eNU%9 z?4)N$FP!#vNm*yUZ@7u<*zb_%Rpo2>CkI7+ris(J3CL()9B7{q1Zyifw4pCo#&~Wx zef$+EPg0mh$CxFCot@^WzSlbvWflu@#i{PU zvp@;zeYUo(-5sQ_SFq-3JzTsx4)qy?t={~04Y^x1xlehm_a-DV;!vV%g+fav&rFP0 zi)b^r3&A?xZpV3$N#_`qWMEsCk35fr_%pX=DhGk@ebbi^1n^-*q(jU^_F(0u^|P6> z=(dM~Ip)U6k7x0=Z%pM7!4SF5FVL8HvZtlI3yn7QgyCT_|B#6={9=)s zH{#fKB%%OIZc#u)4;(BmLAYv4LsJ(Y7_Gn96_LZ76O!g?yu-2i9iv11+{%)YojQbqqoESTjHDhwo|( z0@)m0PD3H5(xarLPH%Q|T-$O7=ByqoUnGtUtVy2d1zv*r$Yyb-(j>9r7#+ZmYo9w9 z)3@ib5KKpw{ny_d;q5V3UY6Y#GQEO42flK&BGm zldZcSqo_+8IV$I3<;Nd3FN#ohv&ml6)#Ii3s+{s->aU$uLyCQ|~q0zI91BJ?xY_J=y0K z3M&=Z0#c4kRUM<0U?w+!aIYh!K%*0H`n;-a`1AtdQ1V^b9#iNeLqXVFyR+-?n<$tJ ze^7XBTO4UkT*jwbaUFiP30Szxd@QL36x8&@w*oPr9C;8kr1D_ioJ9te&e!I@o1Nc2 zx+#0DQXuv7QAesM8iD@gX`xB0!Ha=96nIIyYBjK!ufDo8-&c*zm{0+#{4w=@Mjoz` zb`8a%T=3m*#7%Lzmsid`+_^Zt*P<-ys|7QB%A^5+x3NVgS z0{C2*1LHGm!@!z6$PT`gaI1-STU zaR|VEGJf!y_}qTSQQJq0&eeJ%Rp+2{53rsaQ~+!Z*qc8NYX=|icO?O-3&&`KnpvUr z2;rw!7O@Ph0NtqoH~Y$$LU0FQ=!HgvHVL*v0M!X)$i4eLQ-0uCwryTBh{ip?G%p~o zkxa$~kN^Nm1QmJvAQipt38uWDN@|I_!#aHvcc_x_EdZBexdZx|@@M(V$J1EuE)3ph zrua`cpSmj!AeG_uS@GWm6cHgnH%XJju7`L>oxbmvJOMI@c=8iTfWwG?e04hSP+mDA za=;$9^DKaQ(}~LDB@ekpcV?RdWH+9`1$GfgJskZGI5zK)-WK52+(gJ)rvlgf5>6H|Y=hF?j&`KlprDP^r&z z2hfV-TP6y}O<-Mc7!FD2pW*E}B7y-aOx^x5lP}JOe4s6GY0a6X$*KhNHz-QiD$b2M{>O(Fg%l4~wqt4^5kii&BiSOu~TPC(I`%cWJG)jxhq*6@{DtaN^@RAQ#d@pKd<{1a$4r9e^$=09XPXZ;bWQ@-=<}DvEbw|D!-5 zQhVw_5xZLQgz%k|S--cD+H_FWaRJl=(xSP(EB>RxyX3l6+}GWEh^$y>L6)@N#0cNB zwVz~5?*w`*0iEb)jiG=MowrI`21G%IBaT(gj9qbtRLk;kfGTUP?e5t4LSxt#5Wpyk z%5iN8ux~E8i;HCwjViPNZ-ctb@r0rwuGc%{kgO%INi|^&t~XI{(JbiHgaLrt70`F_ zYX|>+vs#Yppvi?h=iIn_{aB(x?%s_H|Am+_C`s%LMR^C%xB%p>#9eF8LpasFnN;&P zex7h~)dE|lh1YM6zBOdbsqRp~3@Y6x)5zA!gfYMQ+@b^neXy0hBMFI@R=yFt7?TaT z?xMo62Oz>BZy) zIjjO{*(4k~`%lo~Th`ugub3p%Ks`}%O-73NAq)auwO*=_d()>|eY6h%1S_g{A%)IL zJQq4s)q098wDuO6(vLdmri&sW_s7GBQReF|Y{{Wg7RQn5yiQ83BkqThd2>{q?5m#cA$ntf$p{qAKa>rnddy$j{b zFA=OUKNf0d%JC7ZKchjbxziMnfW2oL^0}ia=c|uT&u7V(zLvF9xUgBiIX3U@zq2VP zObntlD0l}@@1>QPR63YWbZOB+eS73&Y29#*Z0K4uE z8GWC#%~XaTfab#dA}R5Cb!mj$;%7RY>sxYxe0D8Z@z32dW~pHoVL5>VtYTN`B!LIm zCOr%5KW{PD4GHr9RP0~{sLE?jr>>3e5F8G7GX_2Zd(yrnvhA+2X(J90R{1694%0*C zYg3T9R922-fA5Wh48Fu7dERViT81g~(xm}}K}Z}+iJ9?0wb5R*)rzS(E? z^FK};H-PHIbCWx{D`tc~O|>(QLF9gb?duS|Spw|~ublfG!FFk*U!5yhWhVTr?__WU zy}7XzKrS_s0M#e_W^zGB0$BA~+-JD}CW)=w7tZMvsUu40xLnnG_CkcxGUGxpfWjgI zldl$gJz3dubhik`OSoaXGap$g0oemwDDfck7BM#H&ax|VY)I%==+A~S8j0^(;-~nr z=YV_+xyBG`QmW*R+6hRul<;2XK1(Ioj*l~jlka7v)3g30x$q6*g!T{YWqVg2AtHeZ z^?zegh@>O7&-bx3Y5Dtqj(!o@9+)yq2O5x$8PzlqKb6e|R6i|6pXj>nsLv`gUUs3% zMz#K!0S=gg0$weH%5;KzGWhr{h||kOHlca~QKZ>2vn7>w1vaa+Re-_gZh8h`d$LZ%9IvpYQ@R{QbZDNsE^C=d@n`j4RK zVYTe3=%s%9Eur)@kGXMq{^LbyII=!PMMs-(W<$-cCnvr@`GzdT&3s=J;&t(S>a%yFongS~Nr7(K2JwGg+T-^0aD{#4EFLnS!Oj$N z3EPdHRFjsB9fvXi&WE}W+X$Gaq0zRe^19)nS!{R%D2R0(YMbMx5E`B5VSOqG7*TLB z7OF3Es+E_<{RQ{iBE#PHe7%974e?0bE85J~1dtHqD-mNX=xU&(bJ?xaNy5C{h141l z>HKB zZeZL3bu|Q+y>z2HxGCR-`fQ-l(uK@(6g$?x(UZR4D^A+jjKAf%ByYh_Tzxq9C3Gc< zA{!sIP|c$kGZU7Kr38t`YrN~p8x@A#TOsIh<1fi*F1Y1-iF084vL}l*W_2YBo1xU@95d>-4vC`1XlaJ-M<{wa^VCb2OhY?eRndEF z0P&iq+)Rd~!eQ`=Z0^h#ki`OeNX79lvPaj{v7-62nglWAh-J12xWf|ZeF7d(i7q?* zH-q3D(-b#A`({_rDd$|F4vc}vS>ZQ8uIe6uBdAZ%vKNE4L1gTNTVm(0UR!?5_O(^U zszQ3>bs=-0lgs#VR5E|sTlfj~)fZ|k z6#aeLa_Zo_0vBy*jM0F{GsJ$29|R&D%f71#B&OG=AJOq@8`eUC>F#Z}rPX(?BL_~9 zlEi}NuxvK#O{A@QI8ge#@sXtFYu$KlDH_by_;c6%oYfadGjsch$BNaWZqF=0h2XZ9 zpYM5dP)lUowNPvl=L^4BoWq!_-x3OR@LHGQ9LKwoW7j6DH!KlU=Up)AY5+)AzAEHU+rGbk8+*5xEipqJ4NJ!9Ntq`t=VG^Q-T zXjLgnV+haQ1acCMwikS{%HA^nWRkJp(~v14@! zw-4D_20_1O5$bSv;3u?#78WHS%F$>uV}0QAUc9$SS}OzVdy=Vqxz&ySJqwXERGh-f z&6jvPSsxj&GIDv%he2dSW{@*VVDs$}frvR<4r}==zCB-J1rM6ithK-cpqFOO;m5G5V(?yp6Xr|XW(-{k89A4@FM zF14iNx(;8zb<9V>*oqjc$&RNppN<=puq}t~GlV#CZWh=Z_!-pr%(+;cP(sr7qE$GX zt-4Uzc2?8!3^1Ytto18>3QE`yt;tlJbOFmkB(C_=$jxzY*bJZpj;d*F>JesBHFUvG zyuxfbFvlHCSBW4$=(G`Rs(CSl1RfBY*42RuZF>~VCtdhuao14(f>Ohox;NcWmU%8t z#Yu^r2UZ!>=RE*^A2rsM5EfKR7-?y%O_-?eB7cv;fxU6aTr=Gi0`@#i%0AykYZT*(LFV?Ce%V~=iP>OY%UTB)dwfHa0Ussj`;q&`DoNJ z;FU!TH^i^6g98?qtD(kHs2R_#T8{{l5goaW+$Mk;g+Bns3d&=%x-n#_8k=|U5ZVL? znQN8N9Ib_{QY+?6(kibvH@pG?lDK-*)91J;CUTkTM$c;4)TCb}_b_-IO3(18aC9l) zpi%%|FXpPKi~;6p7ujGZFeJMzH38ePP9e8<+|iCG*~?2V}KD-g)cb)<$K zBYCa+faBXN=i=Nu06!+%QSdt(A2de;^ZAJNK-he$p9T1tCLYR=ud;BjXn_61b694& z`@#2hUm|i)u4|>iPrzLUTGe_#_qyF8f+#?py8{8jnnL5+5%3pqQV1(wxFDo7scpzCHThoN$>H?mKU9NSX4yDV?I3?MsOl)C!Ku^vC-Sr#`NcRfbb39hS$D_;s`C7N;s$gOI|%Z+?)MR znMhymt>N#pBq?GgAWZO*?l#qzB{S|dl5pLDtRmM276+nwV0{c|+BS>y7eLG`Ok67H zk_PLXzP$dJ2LJkwC+}mJL}q;d%B=f=9@3c>f+sEr8>@1NF{XkwzA*O*~r!Z`K&Vr-m#=*d4@=IEY6%vj=ukkD4LON0bxV zD+qUJyaHgFbLY=bckVszN7uvAJZBf}6Kz4tb#IaSYg#&yVqlr_6r1sd$=<+#MTvUV z=1Eg=>9C~Y@!8_pYf-k5z;nz^1}qhSijyU;JFtb8k}9qJmo2t~0FS4dPVgM#BLKW` zm0(JIKsHuOO}WKZN}>_el*8AUa-8EykLgiG@97FFL@qNK2dug6$vK8{!}&!X0XEOo zT(mg@J?|1y<~hH&LaW{Ya>UUx3;4FdUG{4X}imcC+jl)CAs@g?(qzlKb^JP!{%-tbr4>6F|o!R{`iR_ zvGc`AZ8RKxaNN$dYXk%Ij)*r_fW`|6BYvOznWmg-Z0aTH!n(pLMs40o5YP9_uHWK8 z%U!rJ=lJ{~y9}`GDkzh~K$OSpu36s`BwDrnS|IiM&uV5cDwbYOBA@a(u1Dk!y=!b( zOtQPCfU)#aYt%eZ7Jg$`h&*D-R;y?oK3Fbs$BTw0#1Ug_>(?Afbprpe)F03b^C6Z- zYa^3j1fu#~2g`CyCZ`IAd0lqc9$JnD3M`4LT3n-JN@fG3J5M|$dw)URvjji)lz|>H zR}Y>4+EvCsY%w2xOX7NnCjyzsI1g;L)h?Yb zO!DS6TAWxVl5XzlJeJPct_(kgWzp?h=Ku$F@EgmS&(u8Mab}}!e($dB1Wt=2q1x1? z+Eo!x?35rbPR#`_?5HP}cwLrmFJ&G=xIYP=!*nh%mN#*Oy3$EZyR1aD5mSUJLafHX zSG98L^oloC*Bq4Y-snyQ1;M>if3lL#fhXI~*eZ8suotD5UE5~5gCSi>Y)sUqY9#;DTLoQvmxQ zvS6_yG-b_X{8O9SFMXWP-qCj&iJnJ;4rsKmVk%1#B_FSg_>}}?^S%cnS>N~WE|nHSZC=MGg2q3V)DeAvHU}IQ-EPUHk<^wxuw+`F}^}DR{NN7cm8N*5auo| zGAe`4AxF2P+vC6>kXvmO<$cSm6<;IGeZBdTn3#S#yA^(%S~kRT(Ln*9WaW6MZGNwI z9gLSpzm#N`4H4e%iFMR%*KkR@xWoQJnX#k_@PJV(qvu~nj8l7{Q@a2Lx!V!>nU<+G z{lE{=haF9H_KaeBeKW>ki5Te{L~3BcvBf(6ywE$5xLOPE@9 zsLyK9)ydEo@rmq$tskt~%`W(N8t)&*aJ%^yKfTm$9#$LFJqooU&=W~xeczAGhrFpz zPdyiB@cp2FjU298*ly{lx#i3t;Q6~GaXq*XOl!dT#RDLy=bLbg7G!3O+I-7w3S`OH z@iKE=Ve|m%BBW;x|H5&L6am2)UcE7_sofx1WuJZeFaskahe5MyvHLO2xdp3bJ{|XQ z8V#yYWR(2)L#p7VY>M>|lR9|Tr7weP?b)BPE7abA!hGC7!OMPteQs(2(eW0r2N^3= zcbPv-bsuQ>1NQ4&$oCcbmOmMX0Z>91xFGScm|!B`#;QWPh9!rrci@U$DD?*p&w1t+ zg>fXzn73xg5#{hHfF~kr14B{4u;6MguZM_ckvt?_Di-MAsFHErGVb?38BV@GhT19N zf_}s`Q%W7h)B%1LykUI{>D9}!qH%kpVu(3LExeLY*zMe%TEOY9gwc3)K&J@k*<|C7 zskl|8*}(yGD|52z)EEj3Yq^hSJa3A zm&H;a#`D)#IWz*;^o;*o(EpCnzt+|@S&qT~!~%GW@V{&7|Jrs5*TIn?35fa$_)33m zJsZgToJg1=C5dkk^k@1j04@hJnk_SojUpgnQc1MtAGj_XtC)=yNLLPzEdc=&IPp#p z$K|hyI6{x{+z@J7?%&j1q?@uzN51T|3CP39DI}iAh%`Uv=SN0WD)@*{FU@Y>#nSNi zAh06_q8#|7@SFZUKQy(0gMh(V2%7OXpu!O?@NnR)-C_HCqP%zkJa9i^md^b>KMGy} zjM1X}Od!Mm#88k@23U`b`gH>1zcCi-!Y3p`JW z4bR{Gt)Br902oX}={Wsu?Y=EtF zk`s759zT+=L$yg|uCx0x5P9GiLeyQF z)B#Gw@rqTj{wwck07^G)^^C$G3G11&+TlpyX6pjne~He-{!HoAJz};P)2;*Y3wR%| zH;GBAeq1gjNj~~@)ka1<=ahYqG5I@s+hN-&{s5Sn38t~a(=0B4a;*Cu444#Zr=LD0 zA<%~Es0&*I!16ObB>{Nm{{qYfK7fOb9s}YD#18_!fOqf4G5A>1_L8Q3rAKI_!psJ;cm%-?TK1zh1OOqX=gO;5PBLDymMpFfYY6F0{2Vp3Pto@IIO`@f=R33Bf0kz6Z>9N;y*3)>Fv8M9xn0*R>VD_OhUZmFmtF3l4JNHbb z*pTf4PAAWeMkjOt`4pIP{EWZ8m&sRVqw>q~R-pJ_V1h`_}0ip+8vI89r zaqs)nUVuk`V5SAQ#2eapb{QFV7Ec@7W>qC`U|>!&nS`SBRNtxd^wUqO@(!n}aNYp? z9aZ@^lR&OIge-rI%(`I#V?d3Z=RITC#gNicL~RqPJT^RZ2DlZ6nz3(xW?liJXn^hw zOSQpadHYwg#`CXPVNt!_N_n8s8d0cl0dT;K=FKbLR<{) z+LHNUMRiDOY+g6En`N1Znx0|gz~cUrJ~>#j{vDe@ZXU399Z~DGq$`BxqDTy(Xa!=0 zGLZ$VkC%<+8vg|J02)(n|AP~WOx8;D5KcDdjP=5GOXco_y3s2C#(K#U3#_x77% z*D;pxT>zi!OQtcxiY7m)HShT~u1GZuy{c8Bt|xfpk4msiUhu(PQwlP;#aF-q7jF@P z?C&2TnwP>2at6`S99t`CTsYT0+Q{VQxSDXMEwjyj7J>*e~twav-fA}*C zC?1x9bN|Es=5suhvdZ}gKtSOG%91X9mOI1~*+ZmvFlKA}y^-NS@Cb@E{B%Vi0=iym z3VxKg2LL37*ofiIm}Lfc97CFNC&uhkIX53CWuXZ%flT^S88@E~acIE-GJ^y(+64oI z-aXQ~A0_}J&<+u;fohUFS)~53!g2rLJ=3upLzg%Q;E$VY76R(rFX`>#IU4$Ik*{oW zoO@(u0l`Y*{&Le5&+Y1S(2ico#}uJwH3F8<8jSAutc-6prTM`knR8ec3t8VbEVMYd z^dSe?A4LJt0t=W;yNnbS@CBYP@`-r@*6QFoQ4KW(A!;{`?al$4!cSf#sG!%ktXoC6 ztclths%W0gF7gN9)Kt?*EEMrTc(H)(ylo;H@PFES>!>WZXm3++cuG!n7FosX6amKiz{6pb}qrFOEgo)uGE+XwTenmFRRcxH(hx6`jIx|lbHp|8OF6XeV1>QdQ#S8VyF$YX-967Zp4jv~CfS@#gmSpuvM!33 zf$7iccFcaEwZUkDj_tD zDZa42``3Xh_nT`ZF>zw|1Maa3ajMl(7=*|Ttma13d122oMZA|ZDJl1re33-lX}s?G z6!{DOzK;pXwR6;?S5qrOsU-$igl&;Ek}po^}a8F3GXTpS`^qXS<{rWFxO0 zm8Oaf@%Gw$BMk#8gr&7VcHr6syFzCqcB;2fFn-lk`%~7RAD?(gVv%18`6)K(c#+Kb z>vK?J{fxCOzQ?ZzcH-1*dtY`adW|EcE?g3daWsii4c*|5m|+mVIj)mh>-IRu)|W!& z;>x74Ip2yw*Km?R-xD0kdS4*qIMqfE7Rgk?sfuh|NyJi9)%kPjR`!OY;%msChH=@6 zcWlO?Pf#6Rp*v3!QRRlV6j}-1T{Oq!v2r>4E|RqA%uT|~V3x7-K|?QKl{uKrw2-9q zsN>wOW@MJelOd4ZQZSUb<36*VXEbnKm~#heo8X+vx4f8-S7+&_#- z4^dL|+Whb&?{z7ROt%d8?ROQvvI1M_z(r`UUcH#?=a_NbwC@c~1yn6-k5?kWp=1Or zb`VmAsS-q1GW@K)zX*fkTUN~irXqhr|pg!-NR(*|Cm5@g#9W zd|6=>!c&p8)Xh9@jYf~f8yoSpGPng@O%~ekQ-nS}kJ#T7`AY;($E1VuG9d5vCZ!eC z&f@E0s@c#BA|$r7lvq7E;VV4lVJ|*uyhw;7-@OpWM8f`-P-}LOJm2a9jyj_$ir(z@ z<3P{rcZ+2)&lQF=F5)=^!V$@ODmR6j_ks<%mZ!7#S^u|QLwroL02VXuU5?+?943f% zet*NOnkss*?%tFclwgrMdCT>E|G|k;p#xRRxN0}<<*~LRPxT!C@9pBUV?763Pzeiv zldATrqEWS^->Dxxr1wFTrec2IC3UY}jnTJ3^hZgpMeVyUu%yh?i{*b|Yx0Zb-$7k`-aY-cJt;^RlbA8I6bBARhbKR?D3OOSaBXl^e@&>{Vz zkI3o4_a~w{AvP~0M?4*#O0BQP$)h$!iZXGt)jGlEb45I}chDsalS0A3?h@0w%oA>A z-#q+N8|pRcp~N?2zXygJHS}`!;mA?RUYEQsCK(GydWYGQj{+DtEolmled6OA;bYTe z1T88_-SG!(YfAmmrwMqL4Fck*uko=yi%Co*@s>i1Wngzn=JpQU6!)gSU}qz+>9C_A z@7D7=p#WgjS3?7B@*S>Ugs` z#=o%vdZ|FD3l%vWuqU7?XmmxRf|=BqxP;f%p{F#g{ege_su`zEdYYkDUF2l{#OHaX zsQo7gjkE?Rl2tC0>@VqEWy)c*(>v(oA2cYkmp!w#*5qXEW!K3aqc6`&KDDS#?dMRd zTeX(Yn-g+3l~QiSXLfkmhsocl>q1Gmj#~t{QajzdY=$KLTJ#`0&zW#8o84L#Woq#Y zJ#>J0dvRb|b0u(}xSoIGMYIdbz&abGch8 zU#3ep^4Nd>9G$1imzq`UHdSNkW_)IRB_fWxt|F*=k{iXI4)RRt{4DB~lLV3{-iWf7 z_}Zvqj!!4i)@G!e70`|-z5L@0D>R{{MV}Vb%IA?vQ;z;AQ z=i7W`AT$80Qn3o`!*}ovaPF0bLs`62r<5-y%m)-<2p@Ph85ulY&R2^Pb}b5(0tjeVDWi zjKT!tSO?cKjnV|c0aAD^j!5mZ+9%=hu1pi$=V7QqpX6E(Lfy`v0Njd&nO(Pd?GSk!9*&6R*-*ghTOiW-QKKN1EU@ z#qz#z4V5VvzH%l)?R2rzqs_hGIGT4%Bdms86#BbeDdg*9#fiHpOH?-L^8a{n-IP#bk zQ9vakY)V4@-Qxn23eQPA;yN7jTY1RwUg}$N%_@P>$s$$%K0TauaS9)P;l(&d)GX>F zp=Awj$5nB>m*8DH6H!U~IMIZ2LP^arOjHu1FUEaUI0FB9QdkC#WO$C(C(it1VrLa8 zpUJP#rhfd+DZ=nchWWXfzg?jOhWWoXRLOW}PI{32n8mq|Xesc<`2G&|?eoBSFHPx* zA2$s>;*PNlbx%EhaYlQD&TdO;(m3uTKNQkjNBP8O9P>(CAcQ#IaVA9f0ow&tfK#&jfUk{az5UbBMcGrL)c#is7wa`#Z*M_3)Nl`vPRWpv;yJ`M8e(=fjR}M za|#Z?Gqo*~X~5VbB)jZ2>dbm~6jBTp$P7p>kRUf0hI|i%nF7wj1ph@vJ@yQ;#>Wd7VHb!a+64kL{R>L$q*Z3rXdG zBp={@hZvJ^ZBZjHx|wBx&D>7E1}>tx*8tmh8StY@1xm>tLaGGD2rf^J2hbW{1=z3f zhw&4G`PW)_t-B7C9es=Poy@Di_8^5d$##J4u802u3H>t|sdgwqD+g==QsI?$P-5sd zjs*-H2(KjS?QY*b83;{k#8-R#RJ4Yqj~`D3f*Yae=qPAOW@vWVnjuB05b|lRzXiEA z{`E2$2bNXft!K4+RJk|U=SmKvPt*E7!lcxlE>?^{*bkKrA_fOX{;n;hheMts#n&N= z-&dVIUL&FhSg<}>dIlV*0qR6;FBGG8RYO^=eDxfP^QLS&QbMl47dsxZ6g~kzTmZL-t&msFmfTlrEQ}@v>Bm)>g;mWt{ zTR=;T1fQq(|F{Y2^!Da?!UvXqSF(jMq**`dRcS~La+dKTWVC4DYs2`v-Ea~Tpmc=V zB3vqV?e1>@i-K2<`0@kf$xAa(RP|;S(cuwrmW_rUv(6PGs=H0mIOoMC4MD0o15B_a zK=OSLv0Afa;_GX7E0huu%Kc1r-eju60h}kKE-Vh_CQbXA|hTLsqPI9b5tSSVppB4AiHVu^Zo5dsO#kR!fa>Ypnd7WD%w(^h~s zjquF+NRby_wo$9Y4TlZs=9W!Z18f9QM9R?wwe&N(5F%Pa2RgRUIe(zu=@`Z}cySe` z7R2Ja5Fi0|oR#2pq$=h%Qv0a9u4$0FEK%IdBf8r)a2HDH(nvJNQQfjBijnDi-ZxSB z&r{$3tn$tA`jBPz`-yR%?|-Vxc}{X_W9$K?OS7;cYJ;)(w5?2~uk>@lK=r#KBccX% z>-4_U?_k3!t0E&*x@4MgD}KEktR}cd1;Ga&&j&H$1PB*EHWXLPZG4hd2|y$JO7Y(N z3YAy^YIdKzLC@Tb!zTxX=@QUKq-A#*9k@Qi+Dl{ZnW_b@`80E*&QgNN=$N(i-6_4C zUrdndDbRAw@EbcMwV5#RUJw30bjeHVDVYJ1FC^(u(?FlpK{mDds#qy=z?GA|gk>(7lofY&4OdiedoOL6zW zT`1Zc=~QRd{|*73tI}a$kAi_}4TQK#iXD=AQ2psXe261qReu9vp_tCf!-QDO9p(hDq%-?xXD|xEiBx#dJfX*AX&~ZF( zdRVb^wsbdEOLCt1Ts)%%K0hNe#y%01TQ-k&A+_L)@3hT?OUX@&Sumr`lkwF#zv~+z z?n?14T5>Vgh|A{a=p>=6?nc9|wskB?rFsu>FGxi`3=|QRSUW^lYv_7?V9gQ#P;0rxbWadYE<*`KpwbqWL)68fOB%6la|7 zzhoWrm{)I>)|?bU=dI@=OflrFV4gmsxvlWMDD1K-tF8;!1dfTo8ZMj{k@wi8229>R zyJ4rJKl-BSdYuE*T3uWFh5^fqj+vlzZ(e z7`N@TH}>PG5B46t^}DCVr>)@-8M=8RS2E1uK<&MeL6<(M(;Y`E&!^-F5j6RX`v!7S zik-Tdt?(I6rt_8BNC2mJ8ac+0g6A0tsoju0mV6Qm8%Yc;4Vh{jV9)Qp@Vz0`L_8yt zCx;d5RtZC{`EYWD5YD6miDMbtDpwa20-={)2vpP&~VjMj^cQjqhtD z@ZrDMFvE0z8Nyg}cWV8K&P6jt}8xGqn?zy_clX(^VdKaviu^`5nuP z>6ur6Zl_QJAx~AobPh-9(bJY;mgens2n-PiEf;v}kd0Vy3k)g2v^`??ZX`^YXar_H z3p2)x9X>0P(k~Z$ZAp5!##X&J1Y7cHK@j@d{D{q90;7Xpo7x@wEYm2x9-w+x?k!#J zjs-Mv9oamx-(H8kgCg=eHsM{=|L>pvs_mGIMX6SuYEQ7!{Xow;;|hEcI4md4Za9K(%p}?joOZ5 z->4GqOlV}LIQ!QTIwAodYreH@`Oml1ivjD^EL_Ozw|Rd<{{O3ok@x;` z3%J66>_2fs(={=we|_k&Zs_8Fiwr>pZuTKk#tef4NX#+;~%VSGRPrQ}wNv);VjBkF1+) zVFt$CgF93k<-byD-#Wv;5@*<3}#OzG&h;{WkjS#_BaDToWN>R%~FA>!BK3}*ek6+=@6rc))cyz=jq zlfuNvD9#A~b@XH5`B{~XWJbr!$tN6;83(6%)baBwogtQ2XQ?M53Ae+jP?OdBgD{^ zS581`FkG9^yYfuP5Jkc%5S9$3@LFHH2P}5ZL$Fi|gdn4kgFVFCJj_}4ZF|1NI0)kM z5Xh`YKsiuA6+m^%dO@=LC*2temM~vfm#!Rde3yg3QK*I3|2?e*Z(sOI9sM%|P~Aq? zfg@4dI_omsU?cS%Tm>&cov_L0q(TJDbr!%KK|BO1nFOhjZ||bVYn4&uKoyuwEe8N6 zKE!?B3-*aNDT0NdQhk2PX;MriOvMn17;iZw&kbn%Ui>E{TD;>vQVzQDJxD>U7G*Uz z38-}pmoE4KxU9?U5yZ)eh9UQ#by3k^xUtG43k32*2;n>QrW|!b?6&dOXQgbwB|K;< zv4q_mM79AHlm$22Kj?S7x)c%w0rF0`MgWP#1BQMNS1^2rgZwQ|rcUFTcgjJaNP-a5 zGJwQ_(wITkFf}CQS+nl-$Hx)QaJ(RlWhY} zb%+%>3F4&#VYI}GXe=Vbsv(43#Pr0L4dfMU3Gbjkd|6XFLU-?K5bLGOL8*Nte@~lz z_~UN&-Vo>!0{Xq-u*Q4&Q=7uZ3&*DrhWk~1F)_w_T=H{7w2vHVa58dF{HQ{210dUW z*<)M<%K8wlzJUBNQ}Ks)e<##iFAYd<=VWCLouj{I*sfw=sF+~qlaU8aapeneA~^CV zoM~|QKLg(I9&*nJ54AsOgR4C_aW4fMGd8NmNGcbH_|_ zimgBIabp~IVVXKAR4Wf-95pw0qt&L0{d^)kz=cmgTUJq#&auFAy6-(~qeN%H^Sx`U zNK^}oQBE_CjCiq7)03=HBfhaqCEjlTB)1qO@ZT&3hrHH|?k6q#R_Etn5O()1L=)~< zlfCtbV$VpWyBklz0Ges*o;%b~T5l9&c7f&R1#Ht$sJv}WMi&cWcBHaVE!3GGi=AgZ@;`P8V z{p(N=V&RDz6-bdbp-TT5Iia0R3606jf}a?h|NEKhry3s3;b z-H8M@JiED~Kf8t|4m5d&%z^;yPhFKm@!SmgL>Xbb2$9O)W0e}p5<}jY@zxt2ZY^P1 z^=L2?ic?p8*=t0dWbNVHfjmilp>e2gfE4?nKn*jY^=|nu=eh_#D+Y{O2Q!Mi%>>wd zlao57ioE)ddSVLOTwHyrDk$%XYqgNS=AFW#bsbwK8Gd*L{XQ#TJ=jb;hDsDldl<%- zfrHEl!l(kYX$yA!WQER|lXJq$)W``uQ8xSM+ zRQRJ097Wf|i-9mv0ery>_#!M7tSlW@=^}H3l(=xC`ggccq>^)vE!%=bjYujcIjk?v z6M-0-nxP`H%XwG$BfF1&{XmpswtNI2)@$e-0d?is>9YQRN0d8ePgj$N0M9T&f$lCP z;hQ+9XgA1%@X`tm{xruK|q(DVlfkP?|gdN?VOF)@swJ_(g*;I}P(Tr01VC&q23 zpEwR%1IX`$8LG0G25<^8zsl=##`m1W4)Ze_)JCx;*;C}ce_^Z9mb^K(Ox>jY2o5j$ z?6M{9d*FV6)=msWQvRdezlN6CYPxH5`dr-}JT@J!;UT=W#CKQy3Y%-Jy2FcGXgWn+ z`_5#k@&S|5LJ(Le2cDiHM-xsf*fhwUFe0-y;ujFDV>lX##7>pEf#=jAQ|P8U zqt39`NWCHW=zu#7P8A!(+ERTxx@3XZO+15}?>(90u4Y;Q@Yn2GlQACKI>GxFS$RmW z&v$Q|_ou|h*gT0U;Y|{oGU2t#wTDr+MkI@x7bhS#4)1Z*z5r362Ck=-D)bBCKA@T=A%j7oH+ z0hRnNMs*w#*gzC@Ru7l|LR!sp@O&TbJ3W7)(*v+duZLdE{eAvN1Ul6XkeM8JvLzcj z8Sp)?^Z5Ck^kI+6)7!Jr1kp`j^%HBfsNfy7qA?&q_@3zz@t+u;J%hUGD@S~PVh-HE)kQ3sW_rH@_q5M$^ zG@F2XANhd{%pYnNfgAo1?CyKGp$V}*+eZ*+1+ohgT>p&61ziZc0uU*IqbwFQP2FB4 z*Q~@@YzkGs1hzO(%?^(EBs=s=7`Xp*hGF`3pQ?ER7D4CBz>o5%zb%JABbXl^gGvbz z2-I+{+ofyogTYM%9sy39P2m2Ggmm1VJFwOidLKoKKN&N2zjI4eSuQL7yX4nb0MjFW z_9+vd6$OnpA_&KWtKI-D3&G82&zusa%$RL1tLAH>Oi^{Vqc6tz!n{bY1NK+Kq(C5^ z7*mYjYb0y{P^4*F`<~MQhgOgAru$2zYFEDdeJ-v3GOC0r_76GKFEIt>@z0U&pX^anKvRhK*@dP)oAuxSFpOQ+~6r7x19@Cv&N)@8{$_y{d| zMCXN~8K=Z9#+QW*q7kb>ATk8Uk{C-2b^~B@1f-njKV>6`e|9lN*+ljWoLk7fYv4dE zLri3vPm1p%R2l-7PfIW_7M;e})9aaX83^sW?7H{|@cLCx$K~x|5w)2iv5ZHnDKCU1 zkWnb}jUhZmwgG%QL>wgDGoH3j8n&t2v%nF(02anSYMC+25WDxy=_>^l2(mG}eKF|( zyEiIc+w?s0yDeQ1mgYjk69PO9z&KQ`#RBZ}Z~>V^u06$}9z$*dzuMS2a2kI*#5@9P zP=xFkfcoX2?#?2hk~wm$!{Zch>PGcD~@RoP8%* z3aG`oV=1ee^W-x!z0;jRrMia4xZ^ZjlqWyr0MjVA z^3nb|sr7E2?lVa|(;6@Q_x8s4lN>=X3KwTrL3{S-VM-|?(4|53j$LoJPwdCifVj1e zPz4eI4+|rv0t#6<91wqt*&ql^FUH>u$wS*dd3uL;kowaMd2O-{+IP#oIX~yx>en%r-UM;I4ac_8pW%LApRSP9DDtWo`E66M@ zPAlN&*+fzQncjql@5)Gng()?#44!s9`xDUSQimH7RWrTl=7rro^g@4- zD=d|?=a(}qJm$CXMJjA12kfV2o&jm#?$#EIf)h>fYa!m5yHvT#yV4#fquHN(_%` zuw&vysU_==8{lE#nmh{EcP2D4tg59+ZfT7x$5szBARs%-eU`I$ftXXbvba`nUmL&w zVWg;QQyf|)>r1NhGr`Y~JBPcxKM%i>5AOEcoa^}7aZ6@NHg`$K%Ia;q=LZU2lb)l8 z{7jdX%`xg2?~7dX{6(H0T0EI7h=E<_V^h{;vBAm5muV-&#-?gzpB8vuDprNW40HtZlfBHg1)8+_TT6$f$%t(u-ZN{GjgnM*Mqd;Ro{8n33AYuk;ySDOmm_mHhFzAR`yr5c{XeZLycH^Zm9{j|&K0KZuWCl@>+5a@T;0M#weG4Oj;ZM&SU^dJYrd9qldB_h_ z{8c1$o8`}-P=whHiRb@r@jiBetiRVl8rvm{!_NR#TqmoH+}+(PbPB7Cmy$y)ms2E( zbqzz&dgHN7yzpX-T|JO5k0@_5vcRifT3NYJBrCf*h-&e;n3 z`e~x&!VeFKXT>P1?v*|4Iv}nRuL#@jz4{HD!?OU?)P~@3SKi#e=I{T#)$Mgc8jY9> z89DhIpjN^x&zX>fL=yXEEV!&CYyHF~m9_U-$vV--G?!L!3F(O%Mu&@uiHQ^4A>_C* z$Oi{PJLYZfXFV1aUe#HaAA${|m-Hw;oTrjm5JVGgJ%HKyd;#MGk%EFm7+7MBBDsu8 zh`tU8UGAHxvwTgZSf-Y^Jd)4xoN;jZj&?p|rgVQ47k;mL0>cR1{&aoxd}Ro!;!o^1 zxl$_uMwhI9Vi(>jza9|tZ0qq&yjS4j!a=ZDn*rNb1D9_+5O`J4duk%h*Z0}ilaAe(ixj z;qQW&I@lH>(G1WS9BjRNf>bwcy!=dGUq5^b=MvS0$am(xe)W7G(w;q24c=Ru^l}V& z5_S6H$B$5H&7jkFQDbQr`7d=5G%VGtEK!bb8XXx||hI+nc8 zvP^qaAm!}UFnkX4ALkT9$}os>gQQpZuFw}n`yFd;+lB}AZ@Z-i_)7?o@nf1 zs3Xx{p$s!DVR_v3ndWCyzPrmK%8A{k%?^G1&0dvx8Rys<97giQ-Das~Y+PKb=6V^x z9f8Afy2aVRd1XkTJ7efRbU40Bo13Nh&WVdelLeHeO0a&z@z&?N=cmWv@|Xp&D#KLs z;XZ!Wp;AuS1|;;96pa zHn-Q}v-@|GL-kP>iiR%I*jTVt+7m+OB*k4Os*HFHjl5q$i3HLrf~9fXE+7YY#!Bee zl4+56xPvbz+s?;4`NMaFusYJ_(|GgZiBw->?h)&PNz*IrU?hQIT-ZS^uSMBBtZ>{E z=DGCH(#1l@da*xMEfZ#G@cm-gBoYB19zM^N{YB4R?UZpb8Y2;wRL!`Zsq&j9`=Oa; z$J%nrwtP`%eb{-$ot$4HPOL;lJI-7mGQzqJCynkCQNP?=TVyOb8Sl1km*gcRB#aiy zoI0y1XZ^A}B&1`UH(+UF8A3kjS~Vt=8Q1_9A?-&IT|i0>SETZaExyYM{&`@EJM$$O zX?n$sqFzYN@$D4?x|$9x{j8n`8NG}R-YYu(4|NG#VVZNEj67I&Bz$7%K#AhU^N2TH znqd-$SMA}}ZFbNuTmAC!>h@MR?IMGI?4RyPrvF8;Ft@dJh}^T? zw<_*FN9B^UTu1BP&V(!w5~uOfn|ktx4=#Jx!i8pJ_+cf=QMX3{=46@}(EoV!N58q+ShaJal$iib>Bj}sT7{y#_K?$9( zbcmA#6CL&$y~x$JzV7M9`Z2O@9npF=znw>T)u0b zxjmUYR$YoexJ{tx3!+!;5z28to&{%>S!30D0 z0dXJubG=y{*hHONEY>i|zmMh~Q&>rkLC*Al4(>4>SO))1{({l8Gv)zV10XSh#F)lu zww?}~?AwDIfXBgBNlWOelH=mybUQ5sRs>xADnh-V zSRVC$8Rg~ZlWL$`JgLp8*u&gIu7ZdC4?aN<={%-7(a&Hk!QkVOLy+PBAy|#oymbbk z+Gm_YV4L1HJ3a^R1rZp1P8@yC0;|%p!^_^&a~o{^%a<=jL_}o#HVX<1HylY<`sA*@ z?RxxJ$@vKKAU@7CS9sHp3!}Kf@>dDs^@fHAQPkNjU(Ev2XMze={$`M3Vu`H4F^L zCJbvwvnHm{G1sw6nw$-TNemAUSB3%~Ttu1W?r95a5Ok$CyV7dNSrmWVPj0c$zfMN?B#LvN(7Z&~Z< zrOw%B-EM$hl6^u|v%_cKi!4;daxE`s=&+kW+hTQpNv*dJ#Hv2iv@i2IYbat4>Nl;wCD<;+(|?;XI+h^G{g zgJTjj4$K!&{dpP7R>6$Lg(oWNeGSxlzy$PsMR{d6!x<}fWh#Squt{$JAvu_Kn(YKP4$yok0akts}5z1d#J$UGoqF3hlrkpcjSR{0iJtKbNG-#kOkV~o!rcld4lg9DEII=Rzy!#mI_5l=gKdF_BQ$|(aNh;##r4cm#e zhtDCAeTb$iP-!GkWwxnaBb|L`rKUUu9KNlhP*yhByjc>C_8a-n<$j#s6dp3mVL6wS z5xuzRTVY}G3Ln>rZfqWM0v~nGz_|><806@Hmx92*%g=G}0^PI;jVcp((HaV8e9{93`(lEjD2P0e`Ieo&nWq*OEgejWy&p&7 zSkcrlUv@^&o4Q>+Ys}88PN;vMn@fF%y^~1SG}L^U`#C`)6S#rnIQ;?lL>xzJ^va^> z#Qk>`(>X4v`N`g6AxbeKd@|tz3%Bv2Trkxg8u+3GIMs0%blbQ3lIR0|gE+gJ@x@DI z%%P#1sGY}6_=;~!GmpZP*m!uV-Cnmgt;KaUFlRR{{s@4VqH0$7wj#5Xs#~@+F)@+T zxuAfy0E^-iglkJ!e0Ba|Bg1eiZ0wTIt~6$bAua)dZ2SEzUQ-embgkLFG_>CSpr4P@ zWQnW;Mu#8W@^n*^@MQskv}irs>;nYMg(Srg!3p#2R;Y!T=$TLv`F6?-hi8ya?GIk; z7%o{K3**(JNNEdxhC0(`wSCPqAwalBSE)&%*1)r_l0T;s14}kgfbo(+Mk3&y{&QsA z0<~YX-dc)yk@=t5hIy!NiN3Xu4SpiW4%2e6n|7}75~x^GEio}M6?t@HLiu4c-5)0N zM2LZToSf%bWE2#euiKu7Ge4&v{3@GJkv^cuD=F@d$l7;)+G|BYalGxm~#oFPj`dl!@Y z_t~OxgjUlS`Ec$I<}1wh@YMq{BIqn5QK150W;hJMh*Pu^H1j@|@OfGwhM~^QApHi>87jQ#=!o@>kEa_Mid@?&pVg_Y(`N_rt3EVg zxe%^t_4fX?E0=?IB$e+;JeG`j=OGVMd|%|p+7W4SaOCE2!R~`9f%@Rc(dYVIXDhE( z21hoIg#G5qrWd6&WiZ@@u1MIU6Oh+Jm=Z}I;7w`-YL97?S!9;}oBRdq70SkJl(^jB zW;wACMJHNPVm-~UgC3Z&8O8r$%KinC|GP|?8D2(qHd4(7PVgHLX}j<9v2@P_`wRw} zpGVis?7r91;QROQ3uXN7R8%-RIQVYPNy*6kXgZ~j0j3FCKp~v!?YZU$h%_F894v}{ z8?#2uV>px;=u%=ypeZWB2ouoX*k7L+d}ahz@4_uPD6-`B@{Tr7_AV_n5u*t_5jTH1 z&+|f=uWK7k>F=H}+c#l;05%T6>4{>jWVGB$1v__~Ql$v>PUbQ!3iE3fKc zGuF{R7{@|kf71i5=B)6iQ-o(S@`uh)<39oP(}02pi8MU{Lcyroc;M(TJ}z#NznS$W z*r*g`=<8E?Y#W>z;4HI&z3jqI;qMHATm?i9B6iCd062j|0yTqg%)&K6E%cR+T zA7?PYBXmIDwno-+R^&w2 z@=#5CN=5q~XvHxVMmDzckfUEDu9b^dfr9MQy`DM~%LOH^FCf86K&giAfzON1g5Yij z&W-ON!XKPp+o23tdT^+wa1MMxddH0)Igpyfo?HozE27Go&)&fUZk}f=J^tt>%{*{& zSpt{%19o?JmF-hXlFAR8P!Q@>1!$7QAkQymU|@g=I@7CJ@)P!Ani_OVQg0w zP%TH`H5oZ_grk4cFmL0?d7=mzz=ZC>Ej!g}i8Sx3ng}t_vV4qpV=a1-yPNtFX*uE- zU`WmvN>9FI@pbe_`%XRc4yhmi2(GMnsQ(SX`#`2D&L;@Pu`2pF-dLS9hPN=6BAuQd zNat8uZsN`QfyD8Fb_+ZEk{pH2$;$ZriI~fT`omkSSIenvU|oLNXqyLU->Rf{gPClg zYe~2Se+eSwL2%0QT~6B@SkhMXA3e9ZX5!lz8=3?stn}!tpU}`b1K;&&CLW7J;ASVU z;rAv&r6@f0DX;bE#=#M)uM`k6mZBJ}8zwpW3`?qiBglCL&^Z2VUTTI`6F;SozY!ZJ z@y&ma+xh5&uiZzMn{1Yl0k$TM#``V=s0Iec*FMd=MVcvBv--kOpvB@pSFvY^AGtBR8^s@f0Kpg zqKyizqsFhQ?0OOiMDoB$rT^&9G|!UABV>@{Wst40=)a2oLcV}oD&tMLJ0IiazW^92 zy?80#3^?*cwi~8I+aLu?eLOd3(UHz6kA+XQ1-C*@=<}tGY{@T(Tvj3cmQ{JN{3)Z@1G z&9{=z*e*V57gJ}bCfgH{lqlNtKr9uNMcn~hHi zvHvvZ`T8c4>_ZN6chJ)cPEIcwg*^=VapRhk97DF9E7SpPB9-33&aS^yeB*Unax3N%Sw1WY=31;zAN*Az$iCe zbQgRBjOC8S&khpSdB6DlbllPu$aL$3wun3UMy$$00)E` z<_G~_efqO8%xK2qBxqXNn^=!^N2fH9i^y-_yLtWJGojlTDeT`lQepljv`E+FeAH zWB~vvNMh`M4+DBZLzCBN$9dJ{f`~w6xjpO)yerxGPEOqDhZ589LG_l&L`bz?oWcIb0P{zxsh}-nO z&)xNr!NIiWo2suw?>)iokH$&(+5PRQz+(PT)BBu2-wO{P^IpLj2s$b)W!_D%KuRIn z8*MZsHj-jYVqTA*DWs0u)xNK*nBSITdP;~a^Lq<=Pg{XPr2?HT}LOwI2x#4Ot%DGj zDuxmnk(I#uwIM*jyUk=*3B_p!XED7#6}Pk)T4|(Weg|fi@~Q`AavKsgZmcFY@4E*4 zFf6&DA%!Lr6&;-39jZIBwfEr5-F=Cam==601IL7x=?IE>eWU61rLf(IjBR^-n&vv03F?0$4 z2>{c}UT>(ZRNoZV(w{C-KIQBK!BEI}@JOijrh>($TaJ5?=#3QK#1XCD{~{EuyUC?x z)eiio**zDx`G-FN;nvcP{QCK0cV{PwPRa{OorM(k$?A*sT@pL<+bUH405iS$?);(_Q|}cF|s2KtS@&l#pr0c2zLTj}}NZJHKTA{wM_LJU;9L zp<+)&t+9~z8}UAr0mLhc{TuNDKS9;qbNGGt)`M@iw#Qb0^ioSc69h`0q4-5$8)LeI z%oKXAVZ62V4Z_7nmVAuTzi`$|0l;?nnO65V(q#pdxe*7``rjZ2JDFNUEssXZ>$b(q zjThxZa6%?zfmhs}>|FVZSrE-@dGF(SuHevy=vltG1#DhCD%360qSS{5&m z^enx-{4AN_>UgOt>YyACT|fb_I6q z8?ZD4c!%<0n)XMKdzp|FQb`Ymw*@JgEtCRQ7Tv)?N4o98FMfRPMuO&bKH3%7lH5H? zGf6v9u5(%Cy1@L~>&+Dhoxdgb;8J(6qMxdXlU5}0zUn;$A9}kRvKwu+@io)=SnTlE z2Ht#bg6jxjds6oE*k&6|H%C>k&(zV^TEA+?#luT%%RywVa*(_s-sgHTQann<;1}5f zMYH@$p&mi7RuMS(#*w4_ssapl`6C-rUp#{u$4_lU&e2OJF=i5`Dw2aW-lx z2zzzrmC~a#NFcW0B3+7i2bjQddIh(6Y!CPA*Q%B>Gk{31>gqZKv9NiQokpcitIMP& ziONjC>R5^HHO@BGo_3QfS_ZJ_^bs2H-=ID=2jj)l5cS{^8z$VutaS_C;FFM$aKgkz zg0)+)I^P9NBazWcwv@s`K8!**uR+a6NR!*u*$UujhRsjD9*PzhU*km7Gmn8jiu!6J z0L!^JEA;zeXqWnx;Gqs6va=dS-_MbSBk;;4rN7TQG2%2D%C8MuFr2{9m3JVkN!Puw8Uj!9ZYv>;@e9 zq2b|{Ue=)U^#~gUA*?Eko=jwt+?lP1FT{fy_`%Sgg zQlF+sOpIzG+Xu%@Q%eix0K_n_sxJzdMC)bGP}xib$@qsNbeR&zN^~nZOr4LT@Wc^^ z!vD<<>dx!_#J-$=^0&~Xalkvn7v^Z{-^lK7RN!F@kyF!4X2-gx7C5c{#D4$({uk;~ z{I&v^u7MeEv{-C;eQ`Y8r`hg$$qYA{H%A7+Za+~3?+H7LBl4{Lk+ zgE|j3nsg!L(j!AoWyYA_YN-_4xZfa!j#RMD`tITL`U7HVaVb$)OY=A$n5nyepa_Tq zNMM<1NlyQRwUEyl6&LCL{?@J%iryw65I@E|AX$?-_4hY^gu#i7IQh1y4`UNQ{-0m5TCmRg+v7INn0J zAkeG7h4g1k0f-YhDp4oh$S{7Bh6V8SV;h7gk*^y7Ol8MJLV5r8`yr`%Rb{0D3i*{W z6$SC87nZI^hd|0-g*{Dr(4bvgo4_%Uny9a`lv<{Nu!|jDSfNK5rDJjH2T(gZJ3GL9j04Ml?bDP|EzPU>BOR615*#~D?CX~p>p~Q} zosgiiZjQ#?Z)yPAl8Ddc~$ zTJ^mBi=L_LBQ*PFm`S05p~iQDENt2!s|4L^5kYKXX5E5;TnN5lILkSm}NWB53 zixgkS4|&L0BoHk=_00WkSI2{ylC#_6xG3yZUWM?=-F0HXTHj1Ue+egHe` zTaZJ5yl8yrfrJN0bCd=_m^sY@YbxNZGjd$~-X>!pJ=FWbWp@JUG;}x=LPM-E`!`G_D*sx(|)c zUSK-;rwQ~rIE*c?(x;9a-9YB*mJ-%Ky?_T#gRkfGZ;0x5JF{>MyU5?f|FgS^8ZN@& zVRgam*wt&HRG$h)|Net91H@l=-V+PUsY?J}`tx%k7C6i&(Lj_N;&BNUfi=dY(1(R7 zM~qmBtpQ!?V=-;00xQT?5>#TPkKZ5})=QA+4F0i>B?fG)5DDAb@k`6WN%&>lPQ>uf zxG*ACT0NsO;`hM9k8kSG;dALDx_@N7Vd~H-i(V Date: Tue, 16 Feb 2016 17:56:05 -0800 Subject: [PATCH 0005/1644] ARROW-4: This provides an partial C++11 implementation of the Apache Arrow data structures along with a cmake-based build system. The codebase generally follows Google C++ style guide, but more cleaning to be more conforming is needed. It uses googletest for unit testing. Feature-wise, this patch includes: * A small logical data type object model * Immutable array accessor containers for fixed-width primitive and list types * A String array container implemented as a List * Builder classes for the primitive arrays and list types * A simple memory management model using immutable and immutable buffers and C++ RAII idioms * Modest unit test coverage for the above features. --- cpp/.gitignore | 21 + cpp/CMakeLists.txt | 483 ++ cpp/LICENSE.txt | 202 + cpp/README.md | 48 + cpp/build-support/asan_symbolize.py | 360 ++ cpp/build-support/bootstrap_toolchain.py | 114 + cpp/build-support/cpplint.py | 6323 +++++++++++++++++++++ cpp/build-support/run-test.sh | 195 + cpp/build-support/stacktrace_addr2line.pl | 92 + cpp/cmake_modules/CompilerInfo.cmake | 46 + cpp/cmake_modules/FindGPerf.cmake | 69 + cpp/cmake_modules/FindGTest.cmake | 91 + cpp/cmake_modules/FindParquet.cmake | 80 + cpp/cmake_modules/san-config.cmake | 92 + cpp/setup_build_env.sh | 12 + cpp/src/arrow/CMakeLists.txt | 33 + cpp/src/arrow/api.h | 21 + cpp/src/arrow/array-test.cc | 92 + cpp/src/arrow/array.cc | 44 + cpp/src/arrow/array.h | 79 + cpp/src/arrow/builder.cc | 63 + cpp/src/arrow/builder.h | 101 + cpp/src/arrow/field-test.cc | 38 + cpp/src/arrow/field.h | 48 + cpp/src/arrow/parquet/CMakeLists.txt | 35 + cpp/src/arrow/test-util.h | 97 + cpp/src/arrow/type.cc | 22 + cpp/src/arrow/type.h | 180 + cpp/src/arrow/types/CMakeLists.txt | 63 + cpp/src/arrow/types/binary.h | 33 + cpp/src/arrow/types/boolean.h | 35 + cpp/src/arrow/types/collection.h | 45 + cpp/src/arrow/types/construct.cc | 88 + cpp/src/arrow/types/construct.h | 32 + cpp/src/arrow/types/datetime.h | 79 + cpp/src/arrow/types/decimal.h | 32 + cpp/src/arrow/types/floating.cc | 22 + cpp/src/arrow/types/floating.h | 43 + cpp/src/arrow/types/integer.cc | 22 + cpp/src/arrow/types/integer.h | 88 + cpp/src/arrow/types/json.cc | 42 + cpp/src/arrow/types/json.h | 38 + cpp/src/arrow/types/list-test.cc | 166 + cpp/src/arrow/types/list.cc | 31 + cpp/src/arrow/types/list.h | 206 + cpp/src/arrow/types/null.h | 34 + cpp/src/arrow/types/primitive-test.cc | 345 ++ cpp/src/arrow/types/primitive.cc | 50 + cpp/src/arrow/types/primitive.h | 240 + cpp/src/arrow/types/string-test.cc | 242 + cpp/src/arrow/types/string.cc | 40 + cpp/src/arrow/types/string.h | 181 + cpp/src/arrow/types/struct-test.cc | 61 + cpp/src/arrow/types/struct.cc | 38 + cpp/src/arrow/types/struct.h | 51 + cpp/src/arrow/types/test-common.h | 50 + cpp/src/arrow/types/union.cc | 49 + cpp/src/arrow/types/union.h | 86 + cpp/src/arrow/util/CMakeLists.txt | 81 + cpp/src/arrow/util/bit-util-test.cc | 44 + cpp/src/arrow/util/bit-util.cc | 46 + cpp/src/arrow/util/bit-util.h | 68 + cpp/src/arrow/util/buffer-test.cc | 58 + cpp/src/arrow/util/buffer.cc | 53 + cpp/src/arrow/util/buffer.h | 133 + cpp/src/arrow/util/macros.h | 26 + cpp/src/arrow/util/random.h | 128 + cpp/src/arrow/util/status.cc | 38 + cpp/src/arrow/util/status.h | 152 + cpp/src/arrow/util/test_main.cc | 26 + cpp/thirdparty/build_thirdparty.sh | 62 + cpp/thirdparty/download_thirdparty.sh | 20 + cpp/thirdparty/versions.sh | 3 + 73 files changed, 12551 insertions(+) create mode 100644 cpp/.gitignore create mode 100644 cpp/CMakeLists.txt create mode 100644 cpp/LICENSE.txt create mode 100644 cpp/README.md create mode 100755 cpp/build-support/asan_symbolize.py create mode 100755 cpp/build-support/bootstrap_toolchain.py create mode 100755 cpp/build-support/cpplint.py create mode 100755 cpp/build-support/run-test.sh create mode 100755 cpp/build-support/stacktrace_addr2line.pl create mode 100644 cpp/cmake_modules/CompilerInfo.cmake create mode 100644 cpp/cmake_modules/FindGPerf.cmake create mode 100644 cpp/cmake_modules/FindGTest.cmake create mode 100644 cpp/cmake_modules/FindParquet.cmake create mode 100644 cpp/cmake_modules/san-config.cmake create mode 100755 cpp/setup_build_env.sh create mode 100644 cpp/src/arrow/CMakeLists.txt create mode 100644 cpp/src/arrow/api.h create mode 100644 cpp/src/arrow/array-test.cc create mode 100644 cpp/src/arrow/array.cc create mode 100644 cpp/src/arrow/array.h create mode 100644 cpp/src/arrow/builder.cc create mode 100644 cpp/src/arrow/builder.h create mode 100644 cpp/src/arrow/field-test.cc create mode 100644 cpp/src/arrow/field.h create mode 100644 cpp/src/arrow/parquet/CMakeLists.txt create mode 100644 cpp/src/arrow/test-util.h create mode 100644 cpp/src/arrow/type.cc create mode 100644 cpp/src/arrow/type.h create mode 100644 cpp/src/arrow/types/CMakeLists.txt create mode 100644 cpp/src/arrow/types/binary.h create mode 100644 cpp/src/arrow/types/boolean.h create mode 100644 cpp/src/arrow/types/collection.h create mode 100644 cpp/src/arrow/types/construct.cc create mode 100644 cpp/src/arrow/types/construct.h create mode 100644 cpp/src/arrow/types/datetime.h create mode 100644 cpp/src/arrow/types/decimal.h create mode 100644 cpp/src/arrow/types/floating.cc create mode 100644 cpp/src/arrow/types/floating.h create mode 100644 cpp/src/arrow/types/integer.cc create mode 100644 cpp/src/arrow/types/integer.h create mode 100644 cpp/src/arrow/types/json.cc create mode 100644 cpp/src/arrow/types/json.h create mode 100644 cpp/src/arrow/types/list-test.cc create mode 100644 cpp/src/arrow/types/list.cc create mode 100644 cpp/src/arrow/types/list.h create mode 100644 cpp/src/arrow/types/null.h create mode 100644 cpp/src/arrow/types/primitive-test.cc create mode 100644 cpp/src/arrow/types/primitive.cc create mode 100644 cpp/src/arrow/types/primitive.h create mode 100644 cpp/src/arrow/types/string-test.cc create mode 100644 cpp/src/arrow/types/string.cc create mode 100644 cpp/src/arrow/types/string.h create mode 100644 cpp/src/arrow/types/struct-test.cc create mode 100644 cpp/src/arrow/types/struct.cc create mode 100644 cpp/src/arrow/types/struct.h create mode 100644 cpp/src/arrow/types/test-common.h create mode 100644 cpp/src/arrow/types/union.cc create mode 100644 cpp/src/arrow/types/union.h create mode 100644 cpp/src/arrow/util/CMakeLists.txt create mode 100644 cpp/src/arrow/util/bit-util-test.cc create mode 100644 cpp/src/arrow/util/bit-util.cc create mode 100644 cpp/src/arrow/util/bit-util.h create mode 100644 cpp/src/arrow/util/buffer-test.cc create mode 100644 cpp/src/arrow/util/buffer.cc create mode 100644 cpp/src/arrow/util/buffer.h create mode 100644 cpp/src/arrow/util/macros.h create mode 100644 cpp/src/arrow/util/random.h create mode 100644 cpp/src/arrow/util/status.cc create mode 100644 cpp/src/arrow/util/status.h create mode 100644 cpp/src/arrow/util/test_main.cc create mode 100755 cpp/thirdparty/build_thirdparty.sh create mode 100755 cpp/thirdparty/download_thirdparty.sh create mode 100755 cpp/thirdparty/versions.sh diff --git a/cpp/.gitignore b/cpp/.gitignore new file mode 100644 index 0000000000000..ab30247d49378 --- /dev/null +++ b/cpp/.gitignore @@ -0,0 +1,21 @@ +thirdparty/ +CMakeFiles/ +CMakeCache.txt +CTestTestfile.cmake +Makefile +cmake_install.cmake +build/ +Testing/ + +######################################### +# Editor temporary/working/backup files # +.#* +*\#*\# +[#]*# +*~ +*$ +*.bak +*flymake* +*.kdev4 +*.log +*.swp diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt new file mode 100644 index 0000000000000..90e55dfddbf30 --- /dev/null +++ b/cpp/CMakeLists.txt @@ -0,0 +1,483 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +cmake_minimum_required(VERSION 2.7) +project(arrow) + +set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") + +include(CMakeParseArguments) + +set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") +set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") + +# Allow "make install" to not depend on all targets. +# +# Must be declared in the top-level CMakeLists.txt. +set(CMAKE_SKIP_INSTALL_ALL_DEPENDENCY true) + +# Generate a Clang compile_commands.json "compilation database" file for use +# with various development tools, such as Vim's YouCompleteMe plugin. +# See http://clang.llvm.org/docs/JSONCompilationDatabase.html +if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") + set(CMAKE_EXPORT_COMPILE_COMMANDS 1) +endif() + +# Enable using a custom GCC toolchain to build Arrow +if (NOT "$ENV{ARROW_GCC_ROOT}" STREQUAL "") + set(GCC_ROOT $ENV{ARROW_GCC_ROOT}) + set(CMAKE_C_COMPILER ${GCC_ROOT}/bin/gcc) + set(CMAKE_CXX_COMPILER ${GCC_ROOT}/bin/g++) +endif() + +# ---------------------------------------------------------------------- +# cmake options + +# Top level cmake dir +if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") + option(ARROW_WITH_PARQUET + "Build the Parquet adapter and link to libparquet" + OFF) + + option(ARROW_BUILD_TESTS + "Build the Arrow googletest unit tests" + ON) +endif() + +if(NOT ARROW_BUILD_TESTS) + set(NO_TESTS 1) +endif() + + +############################################################ +# Compiler flags +############################################################ + +# compiler flags that are common across debug/release builds +# - msse4.2: Enable sse4.2 compiler intrinsics. +# - Wall: Enable all warnings. +# - Wno-sign-compare: suppress warnings for comparison between signed and unsigned +# integers +# -Wno-deprecated: some of the gutil code includes old things like ext/hash_set, ignore that +# - pthread: enable multithreaded malloc +# - -D__STDC_FORMAT_MACROS: for PRI* print format macros +# -fno-strict-aliasing +# Assume programs do not follow strict aliasing rules. +# GCC cannot always verify whether strict aliasing rules are indeed followed due to +# fundamental limitations in escape analysis, which can result in subtle bad code generation. +# This has a small perf hit but worth it to avoid hard to debug crashes. +set(CXX_COMMON_FLAGS "-std=c++11 -fno-strict-aliasing -msse3 -Wall -Wno-deprecated -pthread -D__STDC_FORMAT_MACROS") + +# compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') +# For all builds: +# For CMAKE_BUILD_TYPE=Debug +# -ggdb: Enable gdb debugging +# For CMAKE_BUILD_TYPE=FastDebug +# Same as DEBUG, except with some optimizations on. +# For CMAKE_BUILD_TYPE=Release +# -O3: Enable all compiler optimizations +# -g: Enable symbols for profiler tools (TODO: remove for shipping) +set(CXX_FLAGS_DEBUG "-ggdb") +set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") +set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") + +set(CXX_FLAGS_PROFILE_GEN "${CXX_FLAGS_RELEASE} -fprofile-generate") +set(CXX_FLAGS_PROFILE_BUILD "${CXX_FLAGS_RELEASE} -fprofile-use") + +# if no build build type is specified, default to debug builds +if (NOT CMAKE_BUILD_TYPE) + set(CMAKE_BUILD_TYPE Debug) +endif(NOT CMAKE_BUILD_TYPE) + +string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) + + +# Set compile flags based on the build type. +message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") +if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_DEBUG}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_FASTDEBUG}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_RELEASE}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_GEN") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_PROFILE_GEN}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_BUILD") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_PROFILE_BUILD}) +else() + message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") +endif () + +# Add common flags +set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") + +# Required to avoid static linking errors with dependencies +add_definitions(-fPIC) + +# Determine compiler version +include(CompilerInfo) + +if ("${COMPILER_FAMILY}" STREQUAL "clang") + # Clang helpfully provides a few extensions from C++11 such as the 'override' + # keyword on methods. This doesn't change behavior, and we selectively enable + # it in src/gutil/port.h only on clang. So, we can safely use it, and don't want + # to trigger warnings when we do so. + # set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-c++11-extensions") + + # Using Clang with ccache causes a bunch of spurious warnings that are + # purportedly fixed in the next version of ccache. See the following for details: + # + # http://petereisentraut.blogspot.com/2011/05/ccache-and-clang.html + # http://petereisentraut.blogspot.com/2011/09/ccache-and-clang-part-2.html + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") + + # Only hardcode -fcolor-diagnostics if stderr is opened on a terminal. Otherwise + # the color codes show up as noisy artifacts. + # + # This test is imperfect because 'cmake' and 'make' can be run independently + # (with different terminal options), and we're testing during the former. + execute_process(COMMAND test -t 2 RESULT_VARIABLE ARROW_IS_TTY) + if ((${ARROW_IS_TTY} EQUAL 0) AND (NOT ("$ENV{TERM}" STREQUAL "dumb"))) + message("Running in a controlling terminal") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fcolor-diagnostics") + else() + message("Running without a controlling terminal or in a dumb terminal") + endif() + + # Use libstdc++ and not libc++. The latter lacks support for tr1 in OSX + # and since 10.9 is now the default. + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libstdc++") +endif() + +# Sanity check linking option. +if (NOT ARROW_LINK) + set(ARROW_LINK "d") +elseif(NOT ("auto" MATCHES "^${ARROW_LINK}" OR + "dynamic" MATCHES "^${ARROW_LINK}" OR + "static" MATCHES "^${ARROW_LINK}")) + message(FATAL_ERROR "Unknown value for ARROW_LINK, must be auto|dynamic|static") +else() + # Remove all but the first letter. + string(SUBSTRING "${ARROW_LINK}" 0 1 ARROW_LINK) +endif() + +# ASAN / TSAN / UBSAN +include(san-config) + +# For any C code, use the same flags. +set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}") + +# Code coverage +if ("${ARROW_GENERATE_COVERAGE}") + if("${CMAKE_CXX_COMPILER}" MATCHES ".*clang.*") + # There appears to be some bugs in clang 3.3 which cause code coverage + # to have link errors, not locating the llvm_gcda_* symbols. + # This should be fixed in llvm 3.4 with http://llvm.org/viewvc/llvm-project?view=revision&revision=184666 + message(SEND_ERROR "Cannot currently generate coverage with clang") + endif() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --coverage -DCOVERAGE_BUILD") + + # For coverage to work properly, we need to use static linkage. Otherwise, + # __gcov_flush() doesn't properly flush coverage from every module. + # See http://stackoverflow.com/questions/28164543/using-gcov-flush-within-a-library-doesnt-force-the-other-modules-to-yield-gc + if("${ARROW_LINK}" STREQUAL "a") + message("Using static linking for coverage build") + set(ARROW_LINK "s") + elseif("${ARROW_LINK}" STREQUAL "d") + message(SEND_ERROR "Cannot use coverage with dynamic linking") + endif() +endif() + +# If we still don't know what kind of linking to perform, choose based on +# build type (developers like fast builds). +if ("${ARROW_LINK}" STREQUAL "a") + if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG" OR + "${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") + message("Using dynamic linking for ${CMAKE_BUILD_TYPE} builds") + set(ARROW_LINK "d") + else() + message("Using static linking for ${CMAKE_BUILD_TYPE} builds") + set(ARROW_LINK "s") + endif() +endif() + +# Are we using the gold linker? It doesn't work with dynamic linking as +# weak symbols aren't properly overridden, causing tcmalloc to be omitted. +# Let's flag this as an error in RELEASE builds (we shouldn't release a +# product like this). +# +# See https://sourceware.org/bugzilla/show_bug.cgi?id=16979 for details. +# +# The gold linker is only for ELF binaries, which OSX doesn't use. We can +# just skip. +if (NOT APPLE) + execute_process(COMMAND ${CMAKE_CXX_COMPILER} -Wl,--version OUTPUT_VARIABLE LINKER_OUTPUT) +endif () +if (LINKER_OUTPUT MATCHES "gold") + if ("${ARROW_LINK}" STREQUAL "d" AND + "${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") + message(SEND_ERROR "Cannot use gold with dynamic linking in a RELEASE build " + "as it would cause tcmalloc symbols to get dropped") + else() + message("Using gold linker") + endif() + set(ARROW_USING_GOLD 1) +else() + message("Using ld linker") +endif() + +# Having set ARROW_LINK due to build type and/or sanitizer, it's now safe to +# act on its value. +if ("${ARROW_LINK}" STREQUAL "d") + set(BUILD_SHARED_LIBS ON) + + # Position independent code is only necessary when producing shared objects. + add_definitions(-fPIC) +endif() + +# set compile output directory +string (TOLOWER ${CMAKE_BUILD_TYPE} BUILD_SUBDIR_NAME) + +# If build in-source, create the latest symlink. If build out-of-source, which is +# preferred, simply output the binaries in the build folder +if (${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_CURRENT_BINARY_DIR}) + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/build/${BUILD_SUBDIR_NAME}/") + # Link build/latest to the current build directory, to avoid developers + # accidentally running the latest debug build when in fact they're building + # release builds. + FILE(MAKE_DIRECTORY ${BUILD_OUTPUT_ROOT_DIRECTORY}) + if (NOT APPLE) + set(MORE_ARGS "-T") + endif() +EXECUTE_PROCESS(COMMAND ln ${MORE_ARGS} -sf ${BUILD_OUTPUT_ROOT_DIRECTORY} + ${CMAKE_CURRENT_BINARY_DIR}/build/latest) +else() + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") +endif() + +# where to put generated archives (.a files) +set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") +set(ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") + +# where to put generated libraries (.so files) +set(CMAKE_LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") +set(LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") + +# where to put generated binaries +set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") +include_directories(src) + +############################################################ +# Visibility +############################################################ +# For generate_export_header() and add_compiler_export_flags(). +include(GenerateExportHeader) + +############################################################ +# Testing +############################################################ + +# Add a new test case, with or without an executable that should be built. +# +# REL_TEST_NAME is the name of the test. It may be a single component +# (e.g. monotime-test) or contain additional components (e.g. +# net/net_util-test). Either way, the last component must be a globally +# unique name. +# +# Arguments after the test name will be passed to set_tests_properties(). +function(ADD_ARROW_TEST REL_TEST_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) + + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}.cc) + # This test has a corresponding .cc file, set it up as an executable. + set(TEST_PATH "${EXECUTABLE_OUTPUT_PATH}/${TEST_NAME}") + add_executable(${TEST_NAME} "${REL_TEST_NAME}.cc") + target_link_libraries(${TEST_NAME} ${ARROW_TEST_LINK_LIBS}) + else() + # No executable, just invoke the test (probably a script) directly. + set(TEST_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}) + endif() + + add_test(${TEST_NAME} + ${BUILD_SUPPORT_DIR}/run-test.sh ${TEST_PATH}) + if(ARGN) + set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) + endif() +endfunction() + +# A wrapper for add_dependencies() that is compatible with NO_TESTS. +function(ADD_ARROW_TEST_DEPENDENCIES REL_TEST_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) + + add_dependencies(${TEST_NAME} ${ARGN}) +endfunction() + +enable_testing() + +############################################################ +# Dependencies +############################################################ +function(ADD_THIRDPARTY_LIB LIB_NAME) + set(options) + set(one_value_args SHARED_LIB STATIC_LIB) + set(multi_value_args DEPS) + cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) + if(ARG_UNPARSED_ARGUMENTS) + message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") + endif() + + if(("${ARROW_LINK}" STREQUAL "s" AND ARG_STATIC_LIB) OR (NOT ARG_SHARED_LIB)) + if(NOT ARG_STATIC_LIB) + message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") + endif() + add_library(${LIB_NAME} STATIC IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + else() + add_library(${LIB_NAME} SHARED IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + endif() + + if(ARG_DEPS) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") + endif() +endfunction() + +## GTest +if ("$ENV{GTEST_HOME}" STREQUAL "") + set(GTest_HOME ${THIRDPARTY_DIR}/googletest-release-1.7.0) +endif() +find_package(GTest REQUIRED) +include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) +ADD_THIRDPARTY_LIB(gtest + STATIC_LIB ${GTEST_STATIC_LIB}) + +## Google PerfTools +## +## Disabled with TSAN/ASAN as well as with gold+dynamic linking (see comment +## near definition of ARROW_USING_GOLD). +# find_package(GPerf REQUIRED) +# if (NOT "${ARROW_USE_ASAN}" AND +# NOT "${ARROW_USE_TSAN}" AND +# NOT ("${ARROW_USING_GOLD}" AND "${ARROW_LINK}" STREQUAL "d")) +# ADD_THIRDPARTY_LIB(tcmalloc +# STATIC_LIB "${TCMALLOC_STATIC_LIB}" +# SHARED_LIB "${TCMALLOC_SHARED_LIB}") +# ADD_THIRDPARTY_LIB(profiler +# STATIC_LIB "${PROFILER_STATIC_LIB}" +# SHARED_LIB "${PROFILER_SHARED_LIB}") +# list(APPEND ARROW_BASE_LIBS tcmalloc profiler) +# add_definitions("-DTCMALLOC_ENABLED") +# set(ARROW_TCMALLOC_AVAILABLE 1) +# endif() + +############################################################ +# Linker setup +############################################################ +set(ARROW_MIN_TEST_LIBS arrow arrow_test_main arrow_test_util ${ARROW_BASE_LIBS}) +set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) + +############################################################ +# "make ctags" target +############################################################ +if (UNIX) + add_custom_target(ctags ctags -R --languages=c++,c) +endif (UNIX) + +############################################################ +# "make etags" target +############################################################ +if (UNIX) + add_custom_target(tags etags --members --declarations + `find ${CMAKE_CURRENT_SOURCE_DIR}/src + -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or -name \\*.h -or -name \\*.c -or + -name \\*.f`) + add_custom_target(etags DEPENDS tags) +endif (UNIX) + +############################################################ +# "make cscope" target +############################################################ +if (UNIX) + add_custom_target(cscope find ${CMAKE_CURRENT_SOURCE_DIR} + ( -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or + -name \\*.h -or -name \\*.c -or -name \\*.f ) + -exec echo \"{}\" \; > cscope.files && cscope -q -b VERBATIM) +endif (UNIX) + +############################################################ +# "make lint" target +############################################################ +if (UNIX) + # Full lint + add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py + --verbose=2 + --linelength=90 + --filter=-whitespace/comments,-readability/todo,-build/header_guard + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h`) +endif (UNIX) + +#---------------------------------------------------------------------- +# Parquet adapter + +if(ARROW_WITH_PARQUET) + find_package(Parquet REQUIRED) + include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(parquet + STATIC_LIB ${PARQUET_STATIC_LIB} + SHARED_LIB ${PARQUET_SHARED_LIB}) + + add_subdirectory(src/arrow/parquet) + list(APPEND LINK_LIBS arrow_parquet parquet) +endif() + +############################################################ +# Subdirectories +############################################################ + +add_subdirectory(src/arrow) +add_subdirectory(src/arrow/util) +add_subdirectory(src/arrow/types) + +set(LINK_LIBS + arrow_util + arrow_types) + +set(ARROW_SRCS + src/arrow/array.cc + src/arrow/builder.cc + src/arrow/type.cc +) + +add_library(arrow SHARED + ${ARROW_SRCS} +) +target_link_libraries(arrow ${LINK_LIBS}) +set_target_properties(arrow PROPERTIES LINKER_LANGUAGE CXX) + +install(TARGETS arrow + LIBRARY DESTINATION lib) diff --git a/cpp/LICENSE.txt b/cpp/LICENSE.txt new file mode 100644 index 0000000000000..d645695673349 --- /dev/null +++ b/cpp/LICENSE.txt @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/cpp/README.md b/cpp/README.md new file mode 100644 index 0000000000000..378dc4e28de76 --- /dev/null +++ b/cpp/README.md @@ -0,0 +1,48 @@ +# Arrow C++ + +## Setup Build Environment + +Arrow uses CMake as a build configuration system. Currently, it supports in-source and +out-of-source builds with the latter one being preferred. + +Arrow requires a C++11-enabled compiler. On Linux, gcc 4.8 and higher should be +sufficient. + +To build the thirdparty build dependencies, run: + +``` +./thirdparty/download_thirdparty.sh +./thirdparty/build_thirdparty.sh +``` + +You can also run from the root of the C++ tree + +``` +source setup_build_env.sh +``` + +Arrow is configured to use the `thirdparty` directory by default for its build +dependencies. To set up a custom toolchain see below. + +Simple debug build: + + mkdir debug + cd debug + cmake .. + make + ctest + +Simple release build: + + mkdir release + cd release + cmake .. -DCMAKE_BUILD_TYPE=Release + make + ctest + +### Third-party environment variables + +To set up your own specific build toolchain, here are the relevant environment +variables + +* Googletest: `GTEST_HOME` (only required to build the unit tests) diff --git a/cpp/build-support/asan_symbolize.py b/cpp/build-support/asan_symbolize.py new file mode 100755 index 0000000000000..839a1984bd349 --- /dev/null +++ b/cpp/build-support/asan_symbolize.py @@ -0,0 +1,360 @@ +#!/usr/bin/env python +#===- lib/asan/scripts/asan_symbolize.py -----------------------------------===# +# +# The LLVM Compiler Infrastructure +# +# This file is distributed under the University of Illinois Open Source +# License. See LICENSE.TXT for details. +# +#===------------------------------------------------------------------------===# +import bisect +import os +import re +import subprocess +import sys + +llvm_symbolizer = None +symbolizers = {} +filetypes = {} +vmaddrs = {} +DEBUG = False + + +# FIXME: merge the code that calls fix_filename(). +def fix_filename(file_name): + for path_to_cut in sys.argv[1:]: + file_name = re.sub('.*' + path_to_cut, '', file_name) + file_name = re.sub('.*asan_[a-z_]*.cc:[0-9]*', '_asan_rtl_', file_name) + file_name = re.sub('.*crtstuff.c:0', '???:0', file_name) + return file_name + + +class Symbolizer(object): + def __init__(self): + pass + + def symbolize(self, addr, binary, offset): + """Symbolize the given address (pair of binary and offset). + + Overriden in subclasses. + Args: + addr: virtual address of an instruction. + binary: path to executable/shared object containing this instruction. + offset: instruction offset in the @binary. + Returns: + list of strings (one string for each inlined frame) describing + the code locations for this instruction (that is, function name, file + name, line and column numbers). + """ + return None + + +class LLVMSymbolizer(Symbolizer): + def __init__(self, symbolizer_path): + super(LLVMSymbolizer, self).__init__() + self.symbolizer_path = symbolizer_path + self.pipe = self.open_llvm_symbolizer() + + def open_llvm_symbolizer(self): + if not os.path.exists(self.symbolizer_path): + return None + cmd = [self.symbolizer_path, + '--use-symbol-table=true', + '--demangle=false', + '--functions=true', + '--inlining=true'] + if DEBUG: + print ' '.join(cmd) + return subprocess.Popen(cmd, stdin=subprocess.PIPE, + stdout=subprocess.PIPE) + + def symbolize(self, addr, binary, offset): + """Overrides Symbolizer.symbolize.""" + if not self.pipe: + return None + result = [] + try: + symbolizer_input = '%s %s' % (binary, offset) + if DEBUG: + print symbolizer_input + print >> self.pipe.stdin, symbolizer_input + while True: + function_name = self.pipe.stdout.readline().rstrip() + if not function_name: + break + file_name = self.pipe.stdout.readline().rstrip() + file_name = fix_filename(file_name) + if (not function_name.startswith('??') and + not file_name.startswith('??')): + # Append only valid frames. + result.append('%s in %s %s' % (addr, function_name, + file_name)) + except Exception: + result = [] + if not result: + result = None + return result + + +def LLVMSymbolizerFactory(system): + symbolizer_path = os.getenv('LLVM_SYMBOLIZER_PATH') + if not symbolizer_path: + # Assume llvm-symbolizer is in PATH. + symbolizer_path = 'llvm-symbolizer' + return LLVMSymbolizer(symbolizer_path) + + +class Addr2LineSymbolizer(Symbolizer): + def __init__(self, binary): + super(Addr2LineSymbolizer, self).__init__() + self.binary = binary + self.pipe = self.open_addr2line() + + def open_addr2line(self): + cmd = ['addr2line', '-f', '-e', self.binary] + if DEBUG: + print ' '.join(cmd) + return subprocess.Popen(cmd, + stdin=subprocess.PIPE, stdout=subprocess.PIPE) + + def symbolize(self, addr, binary, offset): + """Overrides Symbolizer.symbolize.""" + if self.binary != binary: + return None + try: + print >> self.pipe.stdin, offset + function_name = self.pipe.stdout.readline().rstrip() + file_name = self.pipe.stdout.readline().rstrip() + except Exception: + function_name = '' + file_name = '' + file_name = fix_filename(file_name) + return ['%s in %s %s' % (addr, function_name, file_name)] + + +class DarwinSymbolizer(Symbolizer): + def __init__(self, addr, binary): + super(DarwinSymbolizer, self).__init__() + self.binary = binary + # Guess which arch we're running. 10 = len('0x') + 8 hex digits. + if len(addr) > 10: + self.arch = 'x86_64' + else: + self.arch = 'i386' + self.vmaddr = None + self.pipe = None + + def write_addr_to_pipe(self, offset): + print >> self.pipe.stdin, '0x%x' % int(offset, 16) + + def open_atos(self): + if DEBUG: + print 'atos -o %s -arch %s' % (self.binary, self.arch) + cmdline = ['atos', '-o', self.binary, '-arch', self.arch] + self.pipe = subprocess.Popen(cmdline, + stdin=subprocess.PIPE, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE) + + def symbolize(self, addr, binary, offset): + """Overrides Symbolizer.symbolize.""" + if self.binary != binary: + return None + self.open_atos() + self.write_addr_to_pipe(offset) + self.pipe.stdin.close() + atos_line = self.pipe.stdout.readline().rstrip() + # A well-formed atos response looks like this: + # foo(type1, type2) (in object.name) (filename.cc:80) + match = re.match('^(.*) \(in (.*)\) \((.*:\d*)\)$', atos_line) + if DEBUG: + print 'atos_line: ', atos_line + if match: + function_name = match.group(1) + function_name = re.sub('\(.*?\)', '', function_name) + file_name = fix_filename(match.group(3)) + return ['%s in %s %s' % (addr, function_name, file_name)] + else: + return ['%s in %s' % (addr, atos_line)] + + +# Chain several symbolizers so that if one symbolizer fails, we fall back +# to the next symbolizer in chain. +class ChainSymbolizer(Symbolizer): + def __init__(self, symbolizer_list): + super(ChainSymbolizer, self).__init__() + self.symbolizer_list = symbolizer_list + + def symbolize(self, addr, binary, offset): + """Overrides Symbolizer.symbolize.""" + for symbolizer in self.symbolizer_list: + if symbolizer: + result = symbolizer.symbolize(addr, binary, offset) + if result: + return result + return None + + def append_symbolizer(self, symbolizer): + self.symbolizer_list.append(symbolizer) + + +def BreakpadSymbolizerFactory(binary): + suffix = os.getenv('BREAKPAD_SUFFIX') + if suffix: + filename = binary + suffix + if os.access(filename, os.F_OK): + return BreakpadSymbolizer(filename) + return None + + +def SystemSymbolizerFactory(system, addr, binary): + if system == 'Darwin': + return DarwinSymbolizer(addr, binary) + elif system == 'Linux': + return Addr2LineSymbolizer(binary) + + +class BreakpadSymbolizer(Symbolizer): + def __init__(self, filename): + super(BreakpadSymbolizer, self).__init__() + self.filename = filename + lines = file(filename).readlines() + self.files = [] + self.symbols = {} + self.address_list = [] + self.addresses = {} + # MODULE mac x86_64 A7001116478B33F18FF9BEDE9F615F190 t + fragments = lines[0].rstrip().split() + self.arch = fragments[2] + self.debug_id = fragments[3] + self.binary = ' '.join(fragments[4:]) + self.parse_lines(lines[1:]) + + def parse_lines(self, lines): + cur_function_addr = '' + for line in lines: + fragments = line.split() + if fragments[0] == 'FILE': + assert int(fragments[1]) == len(self.files) + self.files.append(' '.join(fragments[2:])) + elif fragments[0] == 'PUBLIC': + self.symbols[int(fragments[1], 16)] = ' '.join(fragments[3:]) + elif fragments[0] in ['CFI', 'STACK']: + pass + elif fragments[0] == 'FUNC': + cur_function_addr = int(fragments[1], 16) + if not cur_function_addr in self.symbols.keys(): + self.symbols[cur_function_addr] = ' '.join(fragments[4:]) + else: + # Line starting with an address. + addr = int(fragments[0], 16) + self.address_list.append(addr) + # Tuple of symbol address, size, line, file number. + self.addresses[addr] = (cur_function_addr, + int(fragments[1], 16), + int(fragments[2]), + int(fragments[3])) + self.address_list.sort() + + def get_sym_file_line(self, addr): + key = None + if addr in self.addresses.keys(): + key = addr + else: + index = bisect.bisect_left(self.address_list, addr) + if index == 0: + return None + else: + key = self.address_list[index - 1] + sym_id, size, line_no, file_no = self.addresses[key] + symbol = self.symbols[sym_id] + filename = self.files[file_no] + if addr < key + size: + return symbol, filename, line_no + else: + return None + + def symbolize(self, addr, binary, offset): + if self.binary != binary: + return None + res = self.get_sym_file_line(int(offset, 16)) + if res: + function_name, file_name, line_no = res + result = ['%s in %s %s:%d' % ( + addr, function_name, file_name, line_no)] + print result + return result + else: + return None + + +class SymbolizationLoop(object): + def __init__(self, binary_name_filter=None): + # Used by clients who may want to supply a different binary name. + # E.g. in Chrome several binaries may share a single .dSYM. + self.binary_name_filter = binary_name_filter + self.system = os.uname()[0] + if self.system in ['Linux', 'Darwin']: + self.llvm_symbolizer = LLVMSymbolizerFactory(self.system) + else: + raise Exception('Unknown system') + + def symbolize_address(self, addr, binary, offset): + # Use the chain of symbolizers: + # Breakpad symbolizer -> LLVM symbolizer -> addr2line/atos + # (fall back to next symbolizer if the previous one fails). + if not binary in symbolizers: + symbolizers[binary] = ChainSymbolizer( + [BreakpadSymbolizerFactory(binary), self.llvm_symbolizer]) + result = symbolizers[binary].symbolize(addr, binary, offset) + if result is None: + # Initialize system symbolizer only if other symbolizers failed. + symbolizers[binary].append_symbolizer( + SystemSymbolizerFactory(self.system, addr, binary)) + result = symbolizers[binary].symbolize(addr, binary, offset) + # The system symbolizer must produce some result. + assert result + return result + + def print_symbolized_lines(self, symbolized_lines): + if not symbolized_lines: + print self.current_line + else: + for symbolized_frame in symbolized_lines: + print ' #' + str(self.frame_no) + ' ' + symbolized_frame.rstrip() + self.frame_no += 1 + + def process_stdin(self): + self.frame_no = 0 + sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) + + while True: + line = sys.stdin.readline() + if not line: break + self.current_line = line.rstrip() + #0 0x7f6e35cf2e45 (/blah/foo.so+0x11fe45) + stack_trace_line_format = ( + '^( *#([0-9]+) *)(0x[0-9a-f]+) *\((.*)\+(0x[0-9a-f]+)\)') + match = re.match(stack_trace_line_format, line) + if not match: + print self.current_line + continue + if DEBUG: + print line + _, frameno_str, addr, binary, offset = match.groups() + if frameno_str == '0': + # Assume that frame #0 is the first frame of new stack trace. + self.frame_no = 0 + original_binary = binary + if self.binary_name_filter: + binary = self.binary_name_filter(binary) + symbolized_line = self.symbolize_address(addr, binary, offset) + if not symbolized_line: + if original_binary != binary: + symbolized_line = self.symbolize_address(addr, binary, offset) + self.print_symbolized_lines(symbolized_line) + + +if __name__ == '__main__': + loop = SymbolizationLoop() + loop.process_stdin() diff --git a/cpp/build-support/bootstrap_toolchain.py b/cpp/build-support/bootstrap_toolchain.py new file mode 100755 index 0000000000000..128be78bbacc9 --- /dev/null +++ b/cpp/build-support/bootstrap_toolchain.py @@ -0,0 +1,114 @@ +#!/usr/bin/env python +# Copyright (c) 2015, Cloudera, inc. +# Confidential Cloudera Information: Covered by NDA. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Bootstrapping the native toolchain with prebuilt binaries +# +# The purpose of this script is to download prebuilt artifacts of the native toolchain to +# satisfy the third-party dependencies. The script checks for the presence of +# NATIVE_TOOLCHAIN. NATIVE_TOOLCHAIN indicates the location where the prebuilt artifacts +# should be extracted to. +# +# The script is called as follows without any additional parameters: +# +# python bootstrap_toolchain.py +import sh +import os +import sys +import re + +HOST = "https://native-toolchain.s3.amazonaws.com/build" + +OS_MAPPING = { + "centos6" : "ec2-package-centos-6", + "centos5" : "ec2-package-centos-5", + "centos7" : "ec2-package-centos-7", + "debian6" : "ec2-package-debian-6", + "debian7" : "ec2-package-debian-7", + "suselinux11": "ec2-package-sles-11", + "ubuntu12.04" : "ec2-package-ubuntu-12-04", + "ubuntu14.04" : "ec2-package-ubuntu-14-04" +} + +def get_release_label(): + """Gets the right package label from the OS version""" + release = "".join(map(lambda x: x.lower(), sh.lsb_release("-irs").split())) + for k, v in OS_MAPPING.iteritems(): + if re.search(k, release): + return v + + print("Pre-built toolchain archives not available for your platform.") + print("Clone and build native toolchain from source using this repository:") + print(" https://github.com/cloudera/native-toolchain") + raise Exception("Could not find package label for OS version: {0}.".format(release)) + +def download_package(destination, product, version, compiler): + label = get_release_label() + file_name = "{0}-{1}-{2}-{3}.tar.gz".format(product, version, compiler, label) + url_path="/{0}/{1}-{2}/{0}-{1}-{2}-{3}.tar.gz".format(product, version, compiler, label) + download_path = HOST + url_path + + print "URL {0}".format(download_path) + print "Downloading {0} to {1}".format(file_name, destination) + # --no-clobber avoids downloading the file if a file with the name already exists + sh.wget(download_path, directory_prefix=destination, no_clobber=True) + print "Extracting {0}".format(file_name) + sh.tar(z=True, x=True, f=os.path.join(destination, file_name), directory=destination) + sh.rm(os.path.join(destination, file_name)) + + +def bootstrap(packages): + """Validates the presence of $NATIVE_TOOLCHAIN in the environment. By checking + $NATIVE_TOOLCHAIN is present, we assume that {LIB}_VERSION will be present as well. Will + create the directory specified by $NATIVE_TOOLCHAIN if it does not yet exist. Each of + the packages specified in `packages` is downloaded and extracted into $NATIVE_TOOLCHAIN. + """ + # Create the destination directory if necessary + destination = os.getenv("NATIVE_TOOLCHAIN") + if not destination: + print("Build environment not set up correctly, make sure " + "$NATIVE_TOOLCHAIN is present.") + sys.exit(1) + + if not os.path.exists(destination): + os.makedirs(destination) + + # Detect the compiler + if "SYSTEM_GCC" in os.environ: + compiler = "gcc-system" + else: + compiler = "gcc-{0}".format(os.environ["GCC_VERSION"]) + + for p in packages: + pkg_name, pkg_version = unpack_name_and_version(p) + download_package(destination, pkg_name, pkg_version, compiler) + +def unpack_name_and_version(package): + """A package definition is either a string where the version is fetched from the + environment or a tuple where the package name and the package version are fully + specified. + """ + if isinstance(package, basestring): + env_var = "{0}_VERSION".format(package).replace("-", "_").upper() + try: + return package, os.environ[env_var] + except KeyError: + raise Exception("Could not find version for {0} in environment var {1}".format( + package, env_var)) + return package[0], package[1] + +if __name__ == "__main__": + packages = [("gcc","4.9.2"), ("gflags", "2.0"), ("glog", "0.3.3-p1"), + ("gperftools", "2.3"), ("libunwind", "1.1"), ("googletest", "20151222")] + bootstrap(packages) diff --git a/cpp/build-support/cpplint.py b/cpp/build-support/cpplint.py new file mode 100755 index 0000000000000..ccc25d4c56b1a --- /dev/null +++ b/cpp/build-support/cpplint.py @@ -0,0 +1,6323 @@ +#!/usr/bin/env python +# +# Copyright (c) 2009 Google Inc. All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# * Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following disclaimer +# in the documentation and/or other materials provided with the +# distribution. +# * Neither the name of Google Inc. nor the names of its +# contributors may be used to endorse or promote products derived from +# this software without specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +# OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +"""Does google-lint on c++ files. + +The goal of this script is to identify places in the code that *may* +be in non-compliance with google style. It does not attempt to fix +up these problems -- the point is to educate. It does also not +attempt to find all problems, or to ensure that everything it does +find is legitimately a problem. + +In particular, we can get very confused by /* and // inside strings! +We do a small hack, which is to ignore //'s with "'s after them on the +same line, but it is far from perfect (in either direction). +""" + +import codecs +import copy +import getopt +import math # for log +import os +import re +import sre_compile +import string +import sys +import unicodedata + + +_USAGE = """ +Syntax: cpplint.py [--verbose=#] [--output=vs7] [--filter=-x,+y,...] + [--counting=total|toplevel|detailed] [--root=subdir] + [--linelength=digits] + [file] ... + + The style guidelines this tries to follow are those in + http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml + + Every problem is given a confidence score from 1-5, with 5 meaning we are + certain of the problem, and 1 meaning it could be a legitimate construct. + This will miss some errors, and is not a substitute for a code review. + + To suppress false-positive errors of a certain category, add a + 'NOLINT(category)' comment to the line. NOLINT or NOLINT(*) + suppresses errors of all categories on that line. + + The files passed in will be linted; at least one file must be provided. + Default linted extensions are .cc, .cpp, .cu, .cuh and .h. Change the + extensions with the --extensions flag. + + Flags: + + output=vs7 + By default, the output is formatted to ease emacs parsing. Visual Studio + compatible output (vs7) may also be used. Other formats are unsupported. + + verbose=# + Specify a number 0-5 to restrict errors to certain verbosity levels. + + filter=-x,+y,... + Specify a comma-separated list of category-filters to apply: only + error messages whose category names pass the filters will be printed. + (Category names are printed with the message and look like + "[whitespace/indent]".) Filters are evaluated left to right. + "-FOO" and "FOO" means "do not print categories that start with FOO". + "+FOO" means "do print categories that start with FOO". + + Examples: --filter=-whitespace,+whitespace/braces + --filter=whitespace,runtime/printf,+runtime/printf_format + --filter=-,+build/include_what_you_use + + To see a list of all the categories used in cpplint, pass no arg: + --filter= + + counting=total|toplevel|detailed + The total number of errors found is always printed. If + 'toplevel' is provided, then the count of errors in each of + the top-level categories like 'build' and 'whitespace' will + also be printed. If 'detailed' is provided, then a count + is provided for each category like 'build/class'. + + root=subdir + The root directory used for deriving header guard CPP variable. + By default, the header guard CPP variable is calculated as the relative + path to the directory that contains .git, .hg, or .svn. When this flag + is specified, the relative path is calculated from the specified + directory. If the specified directory does not exist, this flag is + ignored. + + Examples: + Assuming that src/.git exists, the header guard CPP variables for + src/chrome/browser/ui/browser.h are: + + No flag => CHROME_BROWSER_UI_BROWSER_H_ + --root=chrome => BROWSER_UI_BROWSER_H_ + --root=chrome/browser => UI_BROWSER_H_ + + linelength=digits + This is the allowed line length for the project. The default value is + 80 characters. + + Examples: + --linelength=120 + + extensions=extension,extension,... + The allowed file extensions that cpplint will check + + Examples: + --extensions=hpp,cpp + + cpplint.py supports per-directory configurations specified in CPPLINT.cfg + files. CPPLINT.cfg file can contain a number of key=value pairs. + Currently the following options are supported: + + set noparent + filter=+filter1,-filter2,... + exclude_files=regex + linelength=80 + + "set noparent" option prevents cpplint from traversing directory tree + upwards looking for more .cfg files in parent directories. This option + is usually placed in the top-level project directory. + + The "filter" option is similar in function to --filter flag. It specifies + message filters in addition to the |_DEFAULT_FILTERS| and those specified + through --filter command-line flag. + + "exclude_files" allows to specify a regular expression to be matched against + a file name. If the expression matches, the file is skipped and not run + through liner. + + "linelength" allows to specify the allowed line length for the project. + + CPPLINT.cfg has an effect on files in the same directory and all + sub-directories, unless overridden by a nested configuration file. + + Example file: + filter=-build/include_order,+build/include_alpha + exclude_files=.*\.cc + + The above example disables build/include_order warning and enables + build/include_alpha as well as excludes all .cc from being + processed by linter, in the current directory (where the .cfg + file is located) and all sub-directories. +""" + +# We categorize each error message we print. Here are the categories. +# We want an explicit list so we can list them all in cpplint --filter=. +# If you add a new error message with a new category, add it to the list +# here! cpplint_unittest.py should tell you if you forget to do this. +_ERROR_CATEGORIES = [ + 'build/class', + 'build/c++11', + 'build/deprecated', + 'build/endif_comment', + 'build/explicit_make_pair', + 'build/forward_decl', + 'build/header_guard', + 'build/include', + 'build/include_alpha', + 'build/include_order', + 'build/include_what_you_use', + 'build/namespaces', + 'build/printf_format', + 'build/storage_class', + 'legal/copyright', + 'readability/alt_tokens', + 'readability/braces', + 'readability/casting', + 'readability/check', + 'readability/constructors', + 'readability/fn_size', + 'readability/function', + 'readability/inheritance', + 'readability/multiline_comment', + 'readability/multiline_string', + 'readability/namespace', + 'readability/nolint', + 'readability/nul', + 'readability/strings', + 'readability/todo', + 'readability/utf8', + 'runtime/arrays', + 'runtime/casting', + 'runtime/explicit', + 'runtime/int', + 'runtime/init', + 'runtime/invalid_increment', + 'runtime/member_string_references', + 'runtime/memset', + 'runtime/indentation_namespace', + 'runtime/operator', + 'runtime/printf', + 'runtime/printf_format', + 'runtime/references', + 'runtime/string', + 'runtime/threadsafe_fn', + 'runtime/vlog', + 'whitespace/blank_line', + 'whitespace/braces', + 'whitespace/comma', + 'whitespace/comments', + 'whitespace/empty_conditional_body', + 'whitespace/empty_loop_body', + 'whitespace/end_of_line', + 'whitespace/ending_newline', + 'whitespace/forcolon', + 'whitespace/indent', + 'whitespace/line_length', + 'whitespace/newline', + 'whitespace/operators', + 'whitespace/parens', + 'whitespace/semicolon', + 'whitespace/tab', + 'whitespace/todo', + ] + +# These error categories are no longer enforced by cpplint, but for backwards- +# compatibility they may still appear in NOLINT comments. +_LEGACY_ERROR_CATEGORIES = [ + 'readability/streams', + ] + +# The default state of the category filter. This is overridden by the --filter= +# flag. By default all errors are on, so only add here categories that should be +# off by default (i.e., categories that must be enabled by the --filter= flags). +# All entries here should start with a '-' or '+', as in the --filter= flag. +_DEFAULT_FILTERS = ['-build/include_alpha'] + +# We used to check for high-bit characters, but after much discussion we +# decided those were OK, as long as they were in UTF-8 and didn't represent +# hard-coded international strings, which belong in a separate i18n file. + +# C++ headers +_CPP_HEADERS = frozenset([ + # Legacy + 'algobase.h', + 'algo.h', + 'alloc.h', + 'builtinbuf.h', + 'bvector.h', + 'complex.h', + 'defalloc.h', + 'deque.h', + 'editbuf.h', + 'fstream.h', + 'function.h', + 'hash_map', + 'hash_map.h', + 'hash_set', + 'hash_set.h', + 'hashtable.h', + 'heap.h', + 'indstream.h', + 'iomanip.h', + 'iostream.h', + 'istream.h', + 'iterator.h', + 'list.h', + 'map.h', + 'multimap.h', + 'multiset.h', + 'ostream.h', + 'pair.h', + 'parsestream.h', + 'pfstream.h', + 'procbuf.h', + 'pthread_alloc', + 'pthread_alloc.h', + 'rope', + 'rope.h', + 'ropeimpl.h', + 'set.h', + 'slist', + 'slist.h', + 'stack.h', + 'stdiostream.h', + 'stl_alloc.h', + 'stl_relops.h', + 'streambuf.h', + 'stream.h', + 'strfile.h', + 'strstream.h', + 'tempbuf.h', + 'tree.h', + 'type_traits.h', + 'vector.h', + # 17.6.1.2 C++ library headers + 'algorithm', + 'array', + 'atomic', + 'bitset', + 'chrono', + 'codecvt', + 'complex', + 'condition_variable', + 'deque', + 'exception', + 'forward_list', + 'fstream', + 'functional', + 'future', + 'initializer_list', + 'iomanip', + 'ios', + 'iosfwd', + 'iostream', + 'istream', + 'iterator', + 'limits', + 'list', + 'locale', + 'map', + 'memory', + 'mutex', + 'new', + 'numeric', + 'ostream', + 'queue', + 'random', + 'ratio', + 'regex', + 'set', + 'sstream', + 'stack', + 'stdexcept', + 'streambuf', + 'string', + 'strstream', + 'system_error', + 'thread', + 'tuple', + 'typeindex', + 'typeinfo', + 'type_traits', + 'unordered_map', + 'unordered_set', + 'utility', + 'valarray', + 'vector', + # 17.6.1.2 C++ headers for C library facilities + 'cassert', + 'ccomplex', + 'cctype', + 'cerrno', + 'cfenv', + 'cfloat', + 'cinttypes', + 'ciso646', + 'climits', + 'clocale', + 'cmath', + 'csetjmp', + 'csignal', + 'cstdalign', + 'cstdarg', + 'cstdbool', + 'cstddef', + 'cstdint', + 'cstdio', + 'cstdlib', + 'cstring', + 'ctgmath', + 'ctime', + 'cuchar', + 'cwchar', + 'cwctype', + ]) + + +# These headers are excluded from [build/include] and [build/include_order] +# checks: +# - Anything not following google file name conventions (containing an +# uppercase character, such as Python.h or nsStringAPI.h, for example). +# - Lua headers. +_THIRD_PARTY_HEADERS_PATTERN = re.compile( + r'^(?:[^/]*[A-Z][^/]*\.h|lua\.h|lauxlib\.h|lualib\.h)$') + + +# Assertion macros. These are defined in base/logging.h and +# testing/base/gunit.h. Note that the _M versions need to come first +# for substring matching to work. +_CHECK_MACROS = [ + 'DCHECK', 'CHECK', + 'EXPECT_TRUE_M', 'EXPECT_TRUE', + 'ASSERT_TRUE_M', 'ASSERT_TRUE', + 'EXPECT_FALSE_M', 'EXPECT_FALSE', + 'ASSERT_FALSE_M', 'ASSERT_FALSE', + ] + +# Replacement macros for CHECK/DCHECK/EXPECT_TRUE/EXPECT_FALSE +_CHECK_REPLACEMENT = dict([(m, {}) for m in _CHECK_MACROS]) + +for op, replacement in [('==', 'EQ'), ('!=', 'NE'), + ('>=', 'GE'), ('>', 'GT'), + ('<=', 'LE'), ('<', 'LT')]: + _CHECK_REPLACEMENT['DCHECK'][op] = 'DCHECK_%s' % replacement + _CHECK_REPLACEMENT['CHECK'][op] = 'CHECK_%s' % replacement + _CHECK_REPLACEMENT['EXPECT_TRUE'][op] = 'EXPECT_%s' % replacement + _CHECK_REPLACEMENT['ASSERT_TRUE'][op] = 'ASSERT_%s' % replacement + _CHECK_REPLACEMENT['EXPECT_TRUE_M'][op] = 'EXPECT_%s_M' % replacement + _CHECK_REPLACEMENT['ASSERT_TRUE_M'][op] = 'ASSERT_%s_M' % replacement + +for op, inv_replacement in [('==', 'NE'), ('!=', 'EQ'), + ('>=', 'LT'), ('>', 'LE'), + ('<=', 'GT'), ('<', 'GE')]: + _CHECK_REPLACEMENT['EXPECT_FALSE'][op] = 'EXPECT_%s' % inv_replacement + _CHECK_REPLACEMENT['ASSERT_FALSE'][op] = 'ASSERT_%s' % inv_replacement + _CHECK_REPLACEMENT['EXPECT_FALSE_M'][op] = 'EXPECT_%s_M' % inv_replacement + _CHECK_REPLACEMENT['ASSERT_FALSE_M'][op] = 'ASSERT_%s_M' % inv_replacement + +# Alternative tokens and their replacements. For full list, see section 2.5 +# Alternative tokens [lex.digraph] in the C++ standard. +# +# Digraphs (such as '%:') are not included here since it's a mess to +# match those on a word boundary. +_ALT_TOKEN_REPLACEMENT = { + 'and': '&&', + 'bitor': '|', + 'or': '||', + 'xor': '^', + 'compl': '~', + 'bitand': '&', + 'and_eq': '&=', + 'or_eq': '|=', + 'xor_eq': '^=', + 'not': '!', + 'not_eq': '!=' + } + +# Compile regular expression that matches all the above keywords. The "[ =()]" +# bit is meant to avoid matching these keywords outside of boolean expressions. +# +# False positives include C-style multi-line comments and multi-line strings +# but those have always been troublesome for cpplint. +_ALT_TOKEN_REPLACEMENT_PATTERN = re.compile( + r'[ =()](' + ('|'.join(_ALT_TOKEN_REPLACEMENT.keys())) + r')(?=[ (]|$)') + + +# These constants define types of headers for use with +# _IncludeState.CheckNextIncludeOrder(). +_C_SYS_HEADER = 1 +_CPP_SYS_HEADER = 2 +_LIKELY_MY_HEADER = 3 +_POSSIBLE_MY_HEADER = 4 +_OTHER_HEADER = 5 + +# These constants define the current inline assembly state +_NO_ASM = 0 # Outside of inline assembly block +_INSIDE_ASM = 1 # Inside inline assembly block +_END_ASM = 2 # Last line of inline assembly block +_BLOCK_ASM = 3 # The whole block is an inline assembly block + +# Match start of assembly blocks +_MATCH_ASM = re.compile(r'^\s*(?:asm|_asm|__asm|__asm__)' + r'(?:\s+(volatile|__volatile__))?' + r'\s*[{(]') + + +_regexp_compile_cache = {} + +# {str, set(int)}: a map from error categories to sets of linenumbers +# on which those errors are expected and should be suppressed. +_error_suppressions = {} + +# The root directory used for deriving header guard CPP variable. +# This is set by --root flag. +_root = None + +# The allowed line length of files. +# This is set by --linelength flag. +_line_length = 80 + +# The allowed extensions for file names +# This is set by --extensions flag. +_valid_extensions = set(['cc', 'h', 'cpp', 'cu', 'cuh']) + +def ParseNolintSuppressions(filename, raw_line, linenum, error): + """Updates the global list of error-suppressions. + + Parses any NOLINT comments on the current line, updating the global + error_suppressions store. Reports an error if the NOLINT comment + was malformed. + + Args: + filename: str, the name of the input file. + raw_line: str, the line of input text, with comments. + linenum: int, the number of the current line. + error: function, an error handler. + """ + matched = Search(r'\bNOLINT(NEXTLINE)?\b(\([^)]+\))?', raw_line) + if matched: + if matched.group(1): + suppressed_line = linenum + 1 + else: + suppressed_line = linenum + category = matched.group(2) + if category in (None, '(*)'): # => "suppress all" + _error_suppressions.setdefault(None, set()).add(suppressed_line) + else: + if category.startswith('(') and category.endswith(')'): + category = category[1:-1] + if category in _ERROR_CATEGORIES: + _error_suppressions.setdefault(category, set()).add(suppressed_line) + elif category not in _LEGACY_ERROR_CATEGORIES: + error(filename, linenum, 'readability/nolint', 5, + 'Unknown NOLINT error category: %s' % category) + + +def ResetNolintSuppressions(): + """Resets the set of NOLINT suppressions to empty.""" + _error_suppressions.clear() + + +def IsErrorSuppressedByNolint(category, linenum): + """Returns true if the specified error category is suppressed on this line. + + Consults the global error_suppressions map populated by + ParseNolintSuppressions/ResetNolintSuppressions. + + Args: + category: str, the category of the error. + linenum: int, the current line number. + Returns: + bool, True iff the error should be suppressed due to a NOLINT comment. + """ + return (linenum in _error_suppressions.get(category, set()) or + linenum in _error_suppressions.get(None, set())) + + +def Match(pattern, s): + """Matches the string with the pattern, caching the compiled regexp.""" + # The regexp compilation caching is inlined in both Match and Search for + # performance reasons; factoring it out into a separate function turns out + # to be noticeably expensive. + if pattern not in _regexp_compile_cache: + _regexp_compile_cache[pattern] = sre_compile.compile(pattern) + return _regexp_compile_cache[pattern].match(s) + + +def ReplaceAll(pattern, rep, s): + """Replaces instances of pattern in a string with a replacement. + + The compiled regex is kept in a cache shared by Match and Search. + + Args: + pattern: regex pattern + rep: replacement text + s: search string + + Returns: + string with replacements made (or original string if no replacements) + """ + if pattern not in _regexp_compile_cache: + _regexp_compile_cache[pattern] = sre_compile.compile(pattern) + return _regexp_compile_cache[pattern].sub(rep, s) + + +def Search(pattern, s): + """Searches the string for the pattern, caching the compiled regexp.""" + if pattern not in _regexp_compile_cache: + _regexp_compile_cache[pattern] = sre_compile.compile(pattern) + return _regexp_compile_cache[pattern].search(s) + + +class _IncludeState(object): + """Tracks line numbers for includes, and the order in which includes appear. + + include_list contains list of lists of (header, line number) pairs. + It's a lists of lists rather than just one flat list to make it + easier to update across preprocessor boundaries. + + Call CheckNextIncludeOrder() once for each header in the file, passing + in the type constants defined above. Calls in an illegal order will + raise an _IncludeError with an appropriate error message. + + """ + # self._section will move monotonically through this set. If it ever + # needs to move backwards, CheckNextIncludeOrder will raise an error. + _INITIAL_SECTION = 0 + _MY_H_SECTION = 1 + _C_SECTION = 2 + _CPP_SECTION = 3 + _OTHER_H_SECTION = 4 + + _TYPE_NAMES = { + _C_SYS_HEADER: 'C system header', + _CPP_SYS_HEADER: 'C++ system header', + _LIKELY_MY_HEADER: 'header this file implements', + _POSSIBLE_MY_HEADER: 'header this file may implement', + _OTHER_HEADER: 'other header', + } + _SECTION_NAMES = { + _INITIAL_SECTION: "... nothing. (This can't be an error.)", + _MY_H_SECTION: 'a header this file implements', + _C_SECTION: 'C system header', + _CPP_SECTION: 'C++ system header', + _OTHER_H_SECTION: 'other header', + } + + def __init__(self): + self.include_list = [[]] + self.ResetSection('') + + def FindHeader(self, header): + """Check if a header has already been included. + + Args: + header: header to check. + Returns: + Line number of previous occurrence, or -1 if the header has not + been seen before. + """ + for section_list in self.include_list: + for f in section_list: + if f[0] == header: + return f[1] + return -1 + + def ResetSection(self, directive): + """Reset section checking for preprocessor directive. + + Args: + directive: preprocessor directive (e.g. "if", "else"). + """ + # The name of the current section. + self._section = self._INITIAL_SECTION + # The path of last found header. + self._last_header = '' + + # Update list of includes. Note that we never pop from the + # include list. + if directive in ('if', 'ifdef', 'ifndef'): + self.include_list.append([]) + elif directive in ('else', 'elif'): + self.include_list[-1] = [] + + def SetLastHeader(self, header_path): + self._last_header = header_path + + def CanonicalizeAlphabeticalOrder(self, header_path): + """Returns a path canonicalized for alphabetical comparison. + + - replaces "-" with "_" so they both cmp the same. + - removes '-inl' since we don't require them to be after the main header. + - lowercase everything, just in case. + + Args: + header_path: Path to be canonicalized. + + Returns: + Canonicalized path. + """ + return header_path.replace('-inl.h', '.h').replace('-', '_').lower() + + def IsInAlphabeticalOrder(self, clean_lines, linenum, header_path): + """Check if a header is in alphabetical order with the previous header. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + header_path: Canonicalized header to be checked. + + Returns: + Returns true if the header is in alphabetical order. + """ + # If previous section is different from current section, _last_header will + # be reset to empty string, so it's always less than current header. + # + # If previous line was a blank line, assume that the headers are + # intentionally sorted the way they are. + if (self._last_header > header_path and + Match(r'^\s*#\s*include\b', clean_lines.elided[linenum - 1])): + return False + return True + + def CheckNextIncludeOrder(self, header_type): + """Returns a non-empty error message if the next header is out of order. + + This function also updates the internal state to be ready to check + the next include. + + Args: + header_type: One of the _XXX_HEADER constants defined above. + + Returns: + The empty string if the header is in the right order, or an + error message describing what's wrong. + + """ + error_message = ('Found %s after %s' % + (self._TYPE_NAMES[header_type], + self._SECTION_NAMES[self._section])) + + last_section = self._section + + if header_type == _C_SYS_HEADER: + if self._section <= self._C_SECTION: + self._section = self._C_SECTION + else: + self._last_header = '' + return error_message + elif header_type == _CPP_SYS_HEADER: + if self._section <= self._CPP_SECTION: + self._section = self._CPP_SECTION + else: + self._last_header = '' + return error_message + elif header_type == _LIKELY_MY_HEADER: + if self._section <= self._MY_H_SECTION: + self._section = self._MY_H_SECTION + else: + self._section = self._OTHER_H_SECTION + elif header_type == _POSSIBLE_MY_HEADER: + if self._section <= self._MY_H_SECTION: + self._section = self._MY_H_SECTION + else: + # This will always be the fallback because we're not sure + # enough that the header is associated with this file. + self._section = self._OTHER_H_SECTION + else: + assert header_type == _OTHER_HEADER + self._section = self._OTHER_H_SECTION + + if last_section != self._section: + self._last_header = '' + + return '' + + +class _CppLintState(object): + """Maintains module-wide state..""" + + def __init__(self): + self.verbose_level = 1 # global setting. + self.error_count = 0 # global count of reported errors + # filters to apply when emitting error messages + self.filters = _DEFAULT_FILTERS[:] + # backup of filter list. Used to restore the state after each file. + self._filters_backup = self.filters[:] + self.counting = 'total' # In what way are we counting errors? + self.errors_by_category = {} # string to int dict storing error counts + + # output format: + # "emacs" - format that emacs can parse (default) + # "vs7" - format that Microsoft Visual Studio 7 can parse + self.output_format = 'emacs' + + def SetOutputFormat(self, output_format): + """Sets the output format for errors.""" + self.output_format = output_format + + def SetVerboseLevel(self, level): + """Sets the module's verbosity, and returns the previous setting.""" + last_verbose_level = self.verbose_level + self.verbose_level = level + return last_verbose_level + + def SetCountingStyle(self, counting_style): + """Sets the module's counting options.""" + self.counting = counting_style + + def SetFilters(self, filters): + """Sets the error-message filters. + + These filters are applied when deciding whether to emit a given + error message. + + Args: + filters: A string of comma-separated filters (eg "+whitespace/indent"). + Each filter should start with + or -; else we die. + + Raises: + ValueError: The comma-separated filters did not all start with '+' or '-'. + E.g. "-,+whitespace,-whitespace/indent,whitespace/badfilter" + """ + # Default filters always have less priority than the flag ones. + self.filters = _DEFAULT_FILTERS[:] + self.AddFilters(filters) + + def AddFilters(self, filters): + """ Adds more filters to the existing list of error-message filters. """ + for filt in filters.split(','): + clean_filt = filt.strip() + if clean_filt: + self.filters.append(clean_filt) + for filt in self.filters: + if not (filt.startswith('+') or filt.startswith('-')): + raise ValueError('Every filter in --filters must start with + or -' + ' (%s does not)' % filt) + + def BackupFilters(self): + """ Saves the current filter list to backup storage.""" + self._filters_backup = self.filters[:] + + def RestoreFilters(self): + """ Restores filters previously backed up.""" + self.filters = self._filters_backup[:] + + def ResetErrorCounts(self): + """Sets the module's error statistic back to zero.""" + self.error_count = 0 + self.errors_by_category = {} + + def IncrementErrorCount(self, category): + """Bumps the module's error statistic.""" + self.error_count += 1 + if self.counting in ('toplevel', 'detailed'): + if self.counting != 'detailed': + category = category.split('/')[0] + if category not in self.errors_by_category: + self.errors_by_category[category] = 0 + self.errors_by_category[category] += 1 + + def PrintErrorCounts(self): + """Print a summary of errors by category, and the total.""" + for category, count in self.errors_by_category.iteritems(): + sys.stderr.write('Category \'%s\' errors found: %d\n' % + (category, count)) + sys.stderr.write('Total errors found: %d\n' % self.error_count) + +_cpplint_state = _CppLintState() + + +def _OutputFormat(): + """Gets the module's output format.""" + return _cpplint_state.output_format + + +def _SetOutputFormat(output_format): + """Sets the module's output format.""" + _cpplint_state.SetOutputFormat(output_format) + + +def _VerboseLevel(): + """Returns the module's verbosity setting.""" + return _cpplint_state.verbose_level + + +def _SetVerboseLevel(level): + """Sets the module's verbosity, and returns the previous setting.""" + return _cpplint_state.SetVerboseLevel(level) + + +def _SetCountingStyle(level): + """Sets the module's counting options.""" + _cpplint_state.SetCountingStyle(level) + + +def _Filters(): + """Returns the module's list of output filters, as a list.""" + return _cpplint_state.filters + + +def _SetFilters(filters): + """Sets the module's error-message filters. + + These filters are applied when deciding whether to emit a given + error message. + + Args: + filters: A string of comma-separated filters (eg "whitespace/indent"). + Each filter should start with + or -; else we die. + """ + _cpplint_state.SetFilters(filters) + +def _AddFilters(filters): + """Adds more filter overrides. + + Unlike _SetFilters, this function does not reset the current list of filters + available. + + Args: + filters: A string of comma-separated filters (eg "whitespace/indent"). + Each filter should start with + or -; else we die. + """ + _cpplint_state.AddFilters(filters) + +def _BackupFilters(): + """ Saves the current filter list to backup storage.""" + _cpplint_state.BackupFilters() + +def _RestoreFilters(): + """ Restores filters previously backed up.""" + _cpplint_state.RestoreFilters() + +class _FunctionState(object): + """Tracks current function name and the number of lines in its body.""" + + _NORMAL_TRIGGER = 250 # for --v=0, 500 for --v=1, etc. + _TEST_TRIGGER = 400 # about 50% more than _NORMAL_TRIGGER. + + def __init__(self): + self.in_a_function = False + self.lines_in_function = 0 + self.current_function = '' + + def Begin(self, function_name): + """Start analyzing function body. + + Args: + function_name: The name of the function being tracked. + """ + self.in_a_function = True + self.lines_in_function = 0 + self.current_function = function_name + + def Count(self): + """Count line in current function body.""" + if self.in_a_function: + self.lines_in_function += 1 + + def Check(self, error, filename, linenum): + """Report if too many lines in function body. + + Args: + error: The function to call with any errors found. + filename: The name of the current file. + linenum: The number of the line to check. + """ + if Match(r'T(EST|est)', self.current_function): + base_trigger = self._TEST_TRIGGER + else: + base_trigger = self._NORMAL_TRIGGER + trigger = base_trigger * 2**_VerboseLevel() + + if self.lines_in_function > trigger: + error_level = int(math.log(self.lines_in_function / base_trigger, 2)) + # 50 => 0, 100 => 1, 200 => 2, 400 => 3, 800 => 4, 1600 => 5, ... + if error_level > 5: + error_level = 5 + error(filename, linenum, 'readability/fn_size', error_level, + 'Small and focused functions are preferred:' + ' %s has %d non-comment lines' + ' (error triggered by exceeding %d lines).' % ( + self.current_function, self.lines_in_function, trigger)) + + def End(self): + """Stop analyzing function body.""" + self.in_a_function = False + + +class _IncludeError(Exception): + """Indicates a problem with the include order in a file.""" + pass + + +class FileInfo(object): + """Provides utility functions for filenames. + + FileInfo provides easy access to the components of a file's path + relative to the project root. + """ + + def __init__(self, filename): + self._filename = filename + + def FullName(self): + """Make Windows paths like Unix.""" + return os.path.abspath(self._filename).replace('\\', '/') + + def RepositoryName(self): + """FullName after removing the local path to the repository. + + If we have a real absolute path name here we can try to do something smart: + detecting the root of the checkout and truncating /path/to/checkout from + the name so that we get header guards that don't include things like + "C:\Documents and Settings\..." or "/home/username/..." in them and thus + people on different computers who have checked the source out to different + locations won't see bogus errors. + """ + fullname = self.FullName() + + if os.path.exists(fullname): + project_dir = os.path.dirname(fullname) + + if os.path.exists(os.path.join(project_dir, ".svn")): + # If there's a .svn file in the current directory, we recursively look + # up the directory tree for the top of the SVN checkout + root_dir = project_dir + one_up_dir = os.path.dirname(root_dir) + while os.path.exists(os.path.join(one_up_dir, ".svn")): + root_dir = os.path.dirname(root_dir) + one_up_dir = os.path.dirname(one_up_dir) + + prefix = os.path.commonprefix([root_dir, project_dir]) + return fullname[len(prefix) + 1:] + + # Not SVN <= 1.6? Try to find a git, hg, or svn top level directory by + # searching up from the current path. + root_dir = os.path.dirname(fullname) + while (root_dir != os.path.dirname(root_dir) and + not os.path.exists(os.path.join(root_dir, ".git")) and + not os.path.exists(os.path.join(root_dir, ".hg")) and + not os.path.exists(os.path.join(root_dir, ".svn"))): + root_dir = os.path.dirname(root_dir) + + if (os.path.exists(os.path.join(root_dir, ".git")) or + os.path.exists(os.path.join(root_dir, ".hg")) or + os.path.exists(os.path.join(root_dir, ".svn"))): + prefix = os.path.commonprefix([root_dir, project_dir]) + return fullname[len(prefix) + 1:] + + # Don't know what to do; header guard warnings may be wrong... + return fullname + + def Split(self): + """Splits the file into the directory, basename, and extension. + + For 'chrome/browser/browser.cc', Split() would + return ('chrome/browser', 'browser', '.cc') + + Returns: + A tuple of (directory, basename, extension). + """ + + googlename = self.RepositoryName() + project, rest = os.path.split(googlename) + return (project,) + os.path.splitext(rest) + + def BaseName(self): + """File base name - text after the final slash, before the final period.""" + return self.Split()[1] + + def Extension(self): + """File extension - text following the final period.""" + return self.Split()[2] + + def NoExtension(self): + """File has no source file extension.""" + return '/'.join(self.Split()[0:2]) + + def IsSource(self): + """File has a source file extension.""" + return self.Extension()[1:] in ('c', 'cc', 'cpp', 'cxx') + + +def _ShouldPrintError(category, confidence, linenum): + """If confidence >= verbose, category passes filter and is not suppressed.""" + + # There are three ways we might decide not to print an error message: + # a "NOLINT(category)" comment appears in the source, + # the verbosity level isn't high enough, or the filters filter it out. + if IsErrorSuppressedByNolint(category, linenum): + return False + + if confidence < _cpplint_state.verbose_level: + return False + + is_filtered = False + for one_filter in _Filters(): + if one_filter.startswith('-'): + if category.startswith(one_filter[1:]): + is_filtered = True + elif one_filter.startswith('+'): + if category.startswith(one_filter[1:]): + is_filtered = False + else: + assert False # should have been checked for in SetFilter. + if is_filtered: + return False + + return True + + +def Error(filename, linenum, category, confidence, message): + """Logs the fact we've found a lint error. + + We log where the error was found, and also our confidence in the error, + that is, how certain we are this is a legitimate style regression, and + not a misidentification or a use that's sometimes justified. + + False positives can be suppressed by the use of + "cpplint(category)" comments on the offending line. These are + parsed into _error_suppressions. + + Args: + filename: The name of the file containing the error. + linenum: The number of the line containing the error. + category: A string used to describe the "category" this bug + falls under: "whitespace", say, or "runtime". Categories + may have a hierarchy separated by slashes: "whitespace/indent". + confidence: A number from 1-5 representing a confidence score for + the error, with 5 meaning that we are certain of the problem, + and 1 meaning that it could be a legitimate construct. + message: The error message. + """ + if _ShouldPrintError(category, confidence, linenum): + _cpplint_state.IncrementErrorCount(category) + if _cpplint_state.output_format == 'vs7': + sys.stderr.write('%s(%s): %s [%s] [%d]\n' % ( + filename, linenum, message, category, confidence)) + elif _cpplint_state.output_format == 'eclipse': + sys.stderr.write('%s:%s: warning: %s [%s] [%d]\n' % ( + filename, linenum, message, category, confidence)) + else: + sys.stderr.write('%s:%s: %s [%s] [%d]\n' % ( + filename, linenum, message, category, confidence)) + + +# Matches standard C++ escape sequences per 2.13.2.3 of the C++ standard. +_RE_PATTERN_CLEANSE_LINE_ESCAPES = re.compile( + r'\\([abfnrtv?"\\\']|\d+|x[0-9a-fA-F]+)') +# Match a single C style comment on the same line. +_RE_PATTERN_C_COMMENTS = r'/\*(?:[^*]|\*(?!/))*\*/' +# Matches multi-line C style comments. +# This RE is a little bit more complicated than one might expect, because we +# have to take care of space removals tools so we can handle comments inside +# statements better. +# The current rule is: We only clear spaces from both sides when we're at the +# end of the line. Otherwise, we try to remove spaces from the right side, +# if this doesn't work we try on left side but only if there's a non-character +# on the right. +_RE_PATTERN_CLEANSE_LINE_C_COMMENTS = re.compile( + r'(\s*' + _RE_PATTERN_C_COMMENTS + r'\s*$|' + + _RE_PATTERN_C_COMMENTS + r'\s+|' + + r'\s+' + _RE_PATTERN_C_COMMENTS + r'(?=\W)|' + + _RE_PATTERN_C_COMMENTS + r')') + + +def IsCppString(line): + """Does line terminate so, that the next symbol is in string constant. + + This function does not consider single-line nor multi-line comments. + + Args: + line: is a partial line of code starting from the 0..n. + + Returns: + True, if next character appended to 'line' is inside a + string constant. + """ + + line = line.replace(r'\\', 'XX') # after this, \\" does not match to \" + return ((line.count('"') - line.count(r'\"') - line.count("'\"'")) & 1) == 1 + + +def CleanseRawStrings(raw_lines): + """Removes C++11 raw strings from lines. + + Before: + static const char kData[] = R"( + multi-line string + )"; + + After: + static const char kData[] = "" + (replaced by blank line) + ""; + + Args: + raw_lines: list of raw lines. + + Returns: + list of lines with C++11 raw strings replaced by empty strings. + """ + + delimiter = None + lines_without_raw_strings = [] + for line in raw_lines: + if delimiter: + # Inside a raw string, look for the end + end = line.find(delimiter) + if end >= 0: + # Found the end of the string, match leading space for this + # line and resume copying the original lines, and also insert + # a "" on the last line. + leading_space = Match(r'^(\s*)\S', line) + line = leading_space.group(1) + '""' + line[end + len(delimiter):] + delimiter = None + else: + # Haven't found the end yet, append a blank line. + line = '""' + + # Look for beginning of a raw string, and replace them with + # empty strings. This is done in a loop to handle multiple raw + # strings on the same line. + while delimiter is None: + # Look for beginning of a raw string. + # See 2.14.15 [lex.string] for syntax. + matched = Match(r'^(.*)\b(?:R|u8R|uR|UR|LR)"([^\s\\()]*)\((.*)$', line) + if matched: + delimiter = ')' + matched.group(2) + '"' + + end = matched.group(3).find(delimiter) + if end >= 0: + # Raw string ended on same line + line = (matched.group(1) + '""' + + matched.group(3)[end + len(delimiter):]) + delimiter = None + else: + # Start of a multi-line raw string + line = matched.group(1) + '""' + else: + break + + lines_without_raw_strings.append(line) + + # TODO(unknown): if delimiter is not None here, we might want to + # emit a warning for unterminated string. + return lines_without_raw_strings + + +def FindNextMultiLineCommentStart(lines, lineix): + """Find the beginning marker for a multiline comment.""" + while lineix < len(lines): + if lines[lineix].strip().startswith('/*'): + # Only return this marker if the comment goes beyond this line + if lines[lineix].strip().find('*/', 2) < 0: + return lineix + lineix += 1 + return len(lines) + + +def FindNextMultiLineCommentEnd(lines, lineix): + """We are inside a comment, find the end marker.""" + while lineix < len(lines): + if lines[lineix].strip().endswith('*/'): + return lineix + lineix += 1 + return len(lines) + + +def RemoveMultiLineCommentsFromRange(lines, begin, end): + """Clears a range of lines for multi-line comments.""" + # Having // dummy comments makes the lines non-empty, so we will not get + # unnecessary blank line warnings later in the code. + for i in range(begin, end): + lines[i] = '/**/' + + +def RemoveMultiLineComments(filename, lines, error): + """Removes multiline (c-style) comments from lines.""" + lineix = 0 + while lineix < len(lines): + lineix_begin = FindNextMultiLineCommentStart(lines, lineix) + if lineix_begin >= len(lines): + return + lineix_end = FindNextMultiLineCommentEnd(lines, lineix_begin) + if lineix_end >= len(lines): + error(filename, lineix_begin + 1, 'readability/multiline_comment', 5, + 'Could not find end of multi-line comment') + return + RemoveMultiLineCommentsFromRange(lines, lineix_begin, lineix_end + 1) + lineix = lineix_end + 1 + + +def CleanseComments(line): + """Removes //-comments and single-line C-style /* */ comments. + + Args: + line: A line of C++ source. + + Returns: + The line with single-line comments removed. + """ + commentpos = line.find('//') + if commentpos != -1 and not IsCppString(line[:commentpos]): + line = line[:commentpos].rstrip() + # get rid of /* ... */ + return _RE_PATTERN_CLEANSE_LINE_C_COMMENTS.sub('', line) + + +class CleansedLines(object): + """Holds 4 copies of all lines with different preprocessing applied to them. + + 1) elided member contains lines without strings and comments. + 2) lines member contains lines without comments. + 3) raw_lines member contains all the lines without processing. + 4) lines_without_raw_strings member is same as raw_lines, but with C++11 raw + strings removed. + All these members are of , and of the same length. + """ + + def __init__(self, lines): + self.elided = [] + self.lines = [] + self.raw_lines = lines + self.num_lines = len(lines) + self.lines_without_raw_strings = CleanseRawStrings(lines) + for linenum in range(len(self.lines_without_raw_strings)): + self.lines.append(CleanseComments( + self.lines_without_raw_strings[linenum])) + elided = self._CollapseStrings(self.lines_without_raw_strings[linenum]) + self.elided.append(CleanseComments(elided)) + + def NumLines(self): + """Returns the number of lines represented.""" + return self.num_lines + + @staticmethod + def _CollapseStrings(elided): + """Collapses strings and chars on a line to simple "" or '' blocks. + + We nix strings first so we're not fooled by text like '"http://"' + + Args: + elided: The line being processed. + + Returns: + The line with collapsed strings. + """ + if _RE_PATTERN_INCLUDE.match(elided): + return elided + + # Remove escaped characters first to make quote/single quote collapsing + # basic. Things that look like escaped characters shouldn't occur + # outside of strings and chars. + elided = _RE_PATTERN_CLEANSE_LINE_ESCAPES.sub('', elided) + + # Replace quoted strings and digit separators. Both single quotes + # and double quotes are processed in the same loop, otherwise + # nested quotes wouldn't work. + collapsed = '' + while True: + # Find the first quote character + match = Match(r'^([^\'"]*)([\'"])(.*)$', elided) + if not match: + collapsed += elided + break + head, quote, tail = match.groups() + + if quote == '"': + # Collapse double quoted strings + second_quote = tail.find('"') + if second_quote >= 0: + collapsed += head + '""' + elided = tail[second_quote + 1:] + else: + # Unmatched double quote, don't bother processing the rest + # of the line since this is probably a multiline string. + collapsed += elided + break + else: + # Found single quote, check nearby text to eliminate digit separators. + # + # There is no special handling for floating point here, because + # the integer/fractional/exponent parts would all be parsed + # correctly as long as there are digits on both sides of the + # separator. So we are fine as long as we don't see something + # like "0.'3" (gcc 4.9.0 will not allow this literal). + if Search(r'\b(?:0[bBxX]?|[1-9])[0-9a-fA-F]*$', head): + match_literal = Match(r'^((?:\'?[0-9a-zA-Z_])*)(.*)$', "'" + tail) + collapsed += head + match_literal.group(1).replace("'", '') + elided = match_literal.group(2) + else: + second_quote = tail.find('\'') + if second_quote >= 0: + collapsed += head + "''" + elided = tail[second_quote + 1:] + else: + # Unmatched single quote + collapsed += elided + break + + return collapsed + + +def FindEndOfExpressionInLine(line, startpos, stack): + """Find the position just after the end of current parenthesized expression. + + Args: + line: a CleansedLines line. + startpos: start searching at this position. + stack: nesting stack at startpos. + + Returns: + On finding matching end: (index just after matching end, None) + On finding an unclosed expression: (-1, None) + Otherwise: (-1, new stack at end of this line) + """ + for i in xrange(startpos, len(line)): + char = line[i] + if char in '([{': + # Found start of parenthesized expression, push to expression stack + stack.append(char) + elif char == '<': + # Found potential start of template argument list + if i > 0 and line[i - 1] == '<': + # Left shift operator + if stack and stack[-1] == '<': + stack.pop() + if not stack: + return (-1, None) + elif i > 0 and Search(r'\boperator\s*$', line[0:i]): + # operator<, don't add to stack + continue + else: + # Tentative start of template argument list + stack.append('<') + elif char in ')]}': + # Found end of parenthesized expression. + # + # If we are currently expecting a matching '>', the pending '<' + # must have been an operator. Remove them from expression stack. + while stack and stack[-1] == '<': + stack.pop() + if not stack: + return (-1, None) + if ((stack[-1] == '(' and char == ')') or + (stack[-1] == '[' and char == ']') or + (stack[-1] == '{' and char == '}')): + stack.pop() + if not stack: + return (i + 1, None) + else: + # Mismatched parentheses + return (-1, None) + elif char == '>': + # Found potential end of template argument list. + + # Ignore "->" and operator functions + if (i > 0 and + (line[i - 1] == '-' or Search(r'\boperator\s*$', line[0:i - 1]))): + continue + + # Pop the stack if there is a matching '<'. Otherwise, ignore + # this '>' since it must be an operator. + if stack: + if stack[-1] == '<': + stack.pop() + if not stack: + return (i + 1, None) + elif char == ';': + # Found something that look like end of statements. If we are currently + # expecting a '>', the matching '<' must have been an operator, since + # template argument list should not contain statements. + while stack and stack[-1] == '<': + stack.pop() + if not stack: + return (-1, None) + + # Did not find end of expression or unbalanced parentheses on this line + return (-1, stack) + + +def CloseExpression(clean_lines, linenum, pos): + """If input points to ( or { or [ or <, finds the position that closes it. + + If lines[linenum][pos] points to a '(' or '{' or '[' or '<', finds the + linenum/pos that correspond to the closing of the expression. + + TODO(unknown): cpplint spends a fair bit of time matching parentheses. + Ideally we would want to index all opening and closing parentheses once + and have CloseExpression be just a simple lookup, but due to preprocessor + tricks, this is not so easy. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + pos: A position on the line. + + Returns: + A tuple (line, linenum, pos) pointer *past* the closing brace, or + (line, len(lines), -1) if we never find a close. Note we ignore + strings and comments when matching; and the line we return is the + 'cleansed' line at linenum. + """ + + line = clean_lines.elided[linenum] + if (line[pos] not in '({[<') or Match(r'<[<=]', line[pos:]): + return (line, clean_lines.NumLines(), -1) + + # Check first line + (end_pos, stack) = FindEndOfExpressionInLine(line, pos, []) + if end_pos > -1: + return (line, linenum, end_pos) + + # Continue scanning forward + while stack and linenum < clean_lines.NumLines() - 1: + linenum += 1 + line = clean_lines.elided[linenum] + (end_pos, stack) = FindEndOfExpressionInLine(line, 0, stack) + if end_pos > -1: + return (line, linenum, end_pos) + + # Did not find end of expression before end of file, give up + return (line, clean_lines.NumLines(), -1) + + +def FindStartOfExpressionInLine(line, endpos, stack): + """Find position at the matching start of current expression. + + This is almost the reverse of FindEndOfExpressionInLine, but note + that the input position and returned position differs by 1. + + Args: + line: a CleansedLines line. + endpos: start searching at this position. + stack: nesting stack at endpos. + + Returns: + On finding matching start: (index at matching start, None) + On finding an unclosed expression: (-1, None) + Otherwise: (-1, new stack at beginning of this line) + """ + i = endpos + while i >= 0: + char = line[i] + if char in ')]}': + # Found end of expression, push to expression stack + stack.append(char) + elif char == '>': + # Found potential end of template argument list. + # + # Ignore it if it's a "->" or ">=" or "operator>" + if (i > 0 and + (line[i - 1] == '-' or + Match(r'\s>=\s', line[i - 1:]) or + Search(r'\boperator\s*$', line[0:i]))): + i -= 1 + else: + stack.append('>') + elif char == '<': + # Found potential start of template argument list + if i > 0 and line[i - 1] == '<': + # Left shift operator + i -= 1 + else: + # If there is a matching '>', we can pop the expression stack. + # Otherwise, ignore this '<' since it must be an operator. + if stack and stack[-1] == '>': + stack.pop() + if not stack: + return (i, None) + elif char in '([{': + # Found start of expression. + # + # If there are any unmatched '>' on the stack, they must be + # operators. Remove those. + while stack and stack[-1] == '>': + stack.pop() + if not stack: + return (-1, None) + if ((char == '(' and stack[-1] == ')') or + (char == '[' and stack[-1] == ']') or + (char == '{' and stack[-1] == '}')): + stack.pop() + if not stack: + return (i, None) + else: + # Mismatched parentheses + return (-1, None) + elif char == ';': + # Found something that look like end of statements. If we are currently + # expecting a '<', the matching '>' must have been an operator, since + # template argument list should not contain statements. + while stack and stack[-1] == '>': + stack.pop() + if not stack: + return (-1, None) + + i -= 1 + + return (-1, stack) + + +def ReverseCloseExpression(clean_lines, linenum, pos): + """If input points to ) or } or ] or >, finds the position that opens it. + + If lines[linenum][pos] points to a ')' or '}' or ']' or '>', finds the + linenum/pos that correspond to the opening of the expression. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + pos: A position on the line. + + Returns: + A tuple (line, linenum, pos) pointer *at* the opening brace, or + (line, 0, -1) if we never find the matching opening brace. Note + we ignore strings and comments when matching; and the line we + return is the 'cleansed' line at linenum. + """ + line = clean_lines.elided[linenum] + if line[pos] not in ')}]>': + return (line, 0, -1) + + # Check last line + (start_pos, stack) = FindStartOfExpressionInLine(line, pos, []) + if start_pos > -1: + return (line, linenum, start_pos) + + # Continue scanning backward + while stack and linenum > 0: + linenum -= 1 + line = clean_lines.elided[linenum] + (start_pos, stack) = FindStartOfExpressionInLine(line, len(line) - 1, stack) + if start_pos > -1: + return (line, linenum, start_pos) + + # Did not find start of expression before beginning of file, give up + return (line, 0, -1) + + +def CheckForCopyright(filename, lines, error): + """Logs an error if no Copyright message appears at the top of the file.""" + + # We'll say it should occur by line 10. Don't forget there's a + # dummy line at the front. + for line in xrange(1, min(len(lines), 11)): + if re.search(r'Copyright', lines[line], re.I): break + else: # means no copyright line was found + error(filename, 0, 'legal/copyright', 5, + 'No copyright message found. ' + 'You should have a line: "Copyright [year] "') + + +def GetIndentLevel(line): + """Return the number of leading spaces in line. + + Args: + line: A string to check. + + Returns: + An integer count of leading spaces, possibly zero. + """ + indent = Match(r'^( *)\S', line) + if indent: + return len(indent.group(1)) + else: + return 0 + + +def GetHeaderGuardCPPVariable(filename): + """Returns the CPP variable that should be used as a header guard. + + Args: + filename: The name of a C++ header file. + + Returns: + The CPP variable that should be used as a header guard in the + named file. + + """ + + # Restores original filename in case that cpplint is invoked from Emacs's + # flymake. + filename = re.sub(r'_flymake\.h$', '.h', filename) + filename = re.sub(r'/\.flymake/([^/]*)$', r'/\1', filename) + # Replace 'c++' with 'cpp'. + filename = filename.replace('C++', 'cpp').replace('c++', 'cpp') + + fileinfo = FileInfo(filename) + file_path_from_root = fileinfo.RepositoryName() + if _root: + file_path_from_root = re.sub('^' + _root + os.sep, '', file_path_from_root) + return re.sub(r'[^a-zA-Z0-9]', '_', file_path_from_root).upper() + '_' + + +def CheckForHeaderGuard(filename, clean_lines, error): + """Checks that the file contains a header guard. + + Logs an error if no #ifndef header guard is present. For other + headers, checks that the full pathname is used. + + Args: + filename: The name of the C++ header file. + clean_lines: A CleansedLines instance containing the file. + error: The function to call with any errors found. + """ + + # Don't check for header guards if there are error suppression + # comments somewhere in this file. + # + # Because this is silencing a warning for a nonexistent line, we + # only support the very specific NOLINT(build/header_guard) syntax, + # and not the general NOLINT or NOLINT(*) syntax. + raw_lines = clean_lines.lines_without_raw_strings + for i in raw_lines: + if Search(r'//\s*NOLINT\(build/header_guard\)', i): + return + + cppvar = GetHeaderGuardCPPVariable(filename) + + ifndef = '' + ifndef_linenum = 0 + define = '' + endif = '' + endif_linenum = 0 + for linenum, line in enumerate(raw_lines): + linesplit = line.split() + if len(linesplit) >= 2: + # find the first occurrence of #ifndef and #define, save arg + if not ifndef and linesplit[0] == '#ifndef': + # set ifndef to the header guard presented on the #ifndef line. + ifndef = linesplit[1] + ifndef_linenum = linenum + if not define and linesplit[0] == '#define': + define = linesplit[1] + # find the last occurrence of #endif, save entire line + if line.startswith('#endif'): + endif = line + endif_linenum = linenum + + if not ifndef or not define or ifndef != define: + error(filename, 0, 'build/header_guard', 5, + 'No #ifndef header guard found, suggested CPP variable is: %s' % + cppvar) + return + + # The guard should be PATH_FILE_H_, but we also allow PATH_FILE_H__ + # for backward compatibility. + if ifndef != cppvar: + error_level = 0 + if ifndef != cppvar + '_': + error_level = 5 + + ParseNolintSuppressions(filename, raw_lines[ifndef_linenum], ifndef_linenum, + error) + error(filename, ifndef_linenum, 'build/header_guard', error_level, + '#ifndef header guard has wrong style, please use: %s' % cppvar) + + # Check for "//" comments on endif line. + ParseNolintSuppressions(filename, raw_lines[endif_linenum], endif_linenum, + error) + match = Match(r'#endif\s*//\s*' + cppvar + r'(_)?\b', endif) + if match: + if match.group(1) == '_': + # Issue low severity warning for deprecated double trailing underscore + error(filename, endif_linenum, 'build/header_guard', 0, + '#endif line should be "#endif // %s"' % cppvar) + return + + # Didn't find the corresponding "//" comment. If this file does not + # contain any "//" comments at all, it could be that the compiler + # only wants "/**/" comments, look for those instead. + no_single_line_comments = True + for i in xrange(1, len(raw_lines) - 1): + line = raw_lines[i] + if Match(r'^(?:(?:\'(?:\.|[^\'])*\')|(?:"(?:\.|[^"])*")|[^\'"])*//', line): + no_single_line_comments = False + break + + if no_single_line_comments: + match = Match(r'#endif\s*/\*\s*' + cppvar + r'(_)?\s*\*/', endif) + if match: + if match.group(1) == '_': + # Low severity warning for double trailing underscore + error(filename, endif_linenum, 'build/header_guard', 0, + '#endif line should be "#endif /* %s */"' % cppvar) + return + + # Didn't find anything + error(filename, endif_linenum, 'build/header_guard', 5, + '#endif line should be "#endif // %s"' % cppvar) + + +def CheckHeaderFileIncluded(filename, include_state, error): + """Logs an error if a .cc file does not include its header.""" + + # Do not check test files + if filename.endswith('_test.cc') or filename.endswith('_unittest.cc'): + return + + fileinfo = FileInfo(filename) + headerfile = filename[0:len(filename) - 2] + 'h' + if not os.path.exists(headerfile): + return + headername = FileInfo(headerfile).RepositoryName() + first_include = 0 + for section_list in include_state.include_list: + for f in section_list: + if headername in f[0] or f[0] in headername: + return + if not first_include: + first_include = f[1] + + error(filename, first_include, 'build/include', 5, + '%s should include its header file %s' % (fileinfo.RepositoryName(), + headername)) + + +def CheckForBadCharacters(filename, lines, error): + """Logs an error for each line containing bad characters. + + Two kinds of bad characters: + + 1. Unicode replacement characters: These indicate that either the file + contained invalid UTF-8 (likely) or Unicode replacement characters (which + it shouldn't). Note that it's possible for this to throw off line + numbering if the invalid UTF-8 occurred adjacent to a newline. + + 2. NUL bytes. These are problematic for some tools. + + Args: + filename: The name of the current file. + lines: An array of strings, each representing a line of the file. + error: The function to call with any errors found. + """ + for linenum, line in enumerate(lines): + if u'\ufffd' in line: + error(filename, linenum, 'readability/utf8', 5, + 'Line contains invalid UTF-8 (or Unicode replacement character).') + if '\0' in line: + error(filename, linenum, 'readability/nul', 5, 'Line contains NUL byte.') + + +def CheckForNewlineAtEOF(filename, lines, error): + """Logs an error if there is no newline char at the end of the file. + + Args: + filename: The name of the current file. + lines: An array of strings, each representing a line of the file. + error: The function to call with any errors found. + """ + + # The array lines() was created by adding two newlines to the + # original file (go figure), then splitting on \n. + # To verify that the file ends in \n, we just have to make sure the + # last-but-two element of lines() exists and is empty. + if len(lines) < 3 or lines[-2]: + error(filename, len(lines) - 2, 'whitespace/ending_newline', 5, + 'Could not find a newline character at the end of the file.') + + +def CheckForMultilineCommentsAndStrings(filename, clean_lines, linenum, error): + """Logs an error if we see /* ... */ or "..." that extend past one line. + + /* ... */ comments are legit inside macros, for one line. + Otherwise, we prefer // comments, so it's ok to warn about the + other. Likewise, it's ok for strings to extend across multiple + lines, as long as a line continuation character (backslash) + terminates each line. Although not currently prohibited by the C++ + style guide, it's ugly and unnecessary. We don't do well with either + in this lint program, so we warn about both. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Remove all \\ (escaped backslashes) from the line. They are OK, and the + # second (escaped) slash may trigger later \" detection erroneously. + line = line.replace('\\\\', '') + + if line.count('/*') > line.count('*/'): + error(filename, linenum, 'readability/multiline_comment', 5, + 'Complex multi-line /*...*/-style comment found. ' + 'Lint may give bogus warnings. ' + 'Consider replacing these with //-style comments, ' + 'with #if 0...#endif, ' + 'or with more clearly structured multi-line comments.') + + if (line.count('"') - line.count('\\"')) % 2: + error(filename, linenum, 'readability/multiline_string', 5, + 'Multi-line string ("...") found. This lint script doesn\'t ' + 'do well with such strings, and may give bogus warnings. ' + 'Use C++11 raw strings or concatenation instead.') + + +# (non-threadsafe name, thread-safe alternative, validation pattern) +# +# The validation pattern is used to eliminate false positives such as: +# _rand(); // false positive due to substring match. +# ->rand(); // some member function rand(). +# ACMRandom rand(seed); // some variable named rand. +# ISAACRandom rand(); // another variable named rand. +# +# Basically we require the return value of these functions to be used +# in some expression context on the same line by matching on some +# operator before the function name. This eliminates constructors and +# member function calls. +_UNSAFE_FUNC_PREFIX = r'(?:[-+*/=%^&|(<]\s*|>\s+)' +_THREADING_LIST = ( + ('asctime(', 'asctime_r(', _UNSAFE_FUNC_PREFIX + r'asctime\([^)]+\)'), + ('ctime(', 'ctime_r(', _UNSAFE_FUNC_PREFIX + r'ctime\([^)]+\)'), + ('getgrgid(', 'getgrgid_r(', _UNSAFE_FUNC_PREFIX + r'getgrgid\([^)]+\)'), + ('getgrnam(', 'getgrnam_r(', _UNSAFE_FUNC_PREFIX + r'getgrnam\([^)]+\)'), + ('getlogin(', 'getlogin_r(', _UNSAFE_FUNC_PREFIX + r'getlogin\(\)'), + ('getpwnam(', 'getpwnam_r(', _UNSAFE_FUNC_PREFIX + r'getpwnam\([^)]+\)'), + ('getpwuid(', 'getpwuid_r(', _UNSAFE_FUNC_PREFIX + r'getpwuid\([^)]+\)'), + ('gmtime(', 'gmtime_r(', _UNSAFE_FUNC_PREFIX + r'gmtime\([^)]+\)'), + ('localtime(', 'localtime_r(', _UNSAFE_FUNC_PREFIX + r'localtime\([^)]+\)'), + ('rand(', 'rand_r(', _UNSAFE_FUNC_PREFIX + r'rand\(\)'), + ('strtok(', 'strtok_r(', + _UNSAFE_FUNC_PREFIX + r'strtok\([^)]+\)'), + ('ttyname(', 'ttyname_r(', _UNSAFE_FUNC_PREFIX + r'ttyname\([^)]+\)'), + ) + + +def CheckPosixThreading(filename, clean_lines, linenum, error): + """Checks for calls to thread-unsafe functions. + + Much code has been originally written without consideration of + multi-threading. Also, engineers are relying on their old experience; + they have learned posix before threading extensions were added. These + tests guide the engineers to use thread-safe functions (when using + posix directly). + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + for single_thread_func, multithread_safe_func, pattern in _THREADING_LIST: + # Additional pattern matching check to confirm that this is the + # function we are looking for + if Search(pattern, line): + error(filename, linenum, 'runtime/threadsafe_fn', 2, + 'Consider using ' + multithread_safe_func + + '...) instead of ' + single_thread_func + + '...) for improved thread safety.') + + +def CheckVlogArguments(filename, clean_lines, linenum, error): + """Checks that VLOG() is only used for defining a logging level. + + For example, VLOG(2) is correct. VLOG(INFO), VLOG(WARNING), VLOG(ERROR), and + VLOG(FATAL) are not. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + if Search(r'\bVLOG\((INFO|ERROR|WARNING|DFATAL|FATAL)\)', line): + error(filename, linenum, 'runtime/vlog', 5, + 'VLOG() should be used with numeric verbosity level. ' + 'Use LOG() if you want symbolic severity levels.') + +# Matches invalid increment: *count++, which moves pointer instead of +# incrementing a value. +_RE_PATTERN_INVALID_INCREMENT = re.compile( + r'^\s*\*\w+(\+\+|--);') + + +def CheckInvalidIncrement(filename, clean_lines, linenum, error): + """Checks for invalid increment *count++. + + For example following function: + void increment_counter(int* count) { + *count++; + } + is invalid, because it effectively does count++, moving pointer, and should + be replaced with ++*count, (*count)++ or *count += 1. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + if _RE_PATTERN_INVALID_INCREMENT.match(line): + error(filename, linenum, 'runtime/invalid_increment', 5, + 'Changing pointer instead of value (or unused value of operator*).') + + +def IsMacroDefinition(clean_lines, linenum): + if Search(r'^#define', clean_lines[linenum]): + return True + + if linenum > 0 and Search(r'\\$', clean_lines[linenum - 1]): + return True + + return False + + +def IsForwardClassDeclaration(clean_lines, linenum): + return Match(r'^\s*(\btemplate\b)*.*class\s+\w+;\s*$', clean_lines[linenum]) + + +class _BlockInfo(object): + """Stores information about a generic block of code.""" + + def __init__(self, seen_open_brace): + self.seen_open_brace = seen_open_brace + self.open_parentheses = 0 + self.inline_asm = _NO_ASM + self.check_namespace_indentation = False + + def CheckBegin(self, filename, clean_lines, linenum, error): + """Run checks that applies to text up to the opening brace. + + This is mostly for checking the text after the class identifier + and the "{", usually where the base class is specified. For other + blocks, there isn't much to check, so we always pass. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + pass + + def CheckEnd(self, filename, clean_lines, linenum, error): + """Run checks that applies to text after the closing brace. + + This is mostly used for checking end of namespace comments. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + pass + + def IsBlockInfo(self): + """Returns true if this block is a _BlockInfo. + + This is convenient for verifying that an object is an instance of + a _BlockInfo, but not an instance of any of the derived classes. + + Returns: + True for this class, False for derived classes. + """ + return self.__class__ == _BlockInfo + + +class _ExternCInfo(_BlockInfo): + """Stores information about an 'extern "C"' block.""" + + def __init__(self): + _BlockInfo.__init__(self, True) + + +class _ClassInfo(_BlockInfo): + """Stores information about a class.""" + + def __init__(self, name, class_or_struct, clean_lines, linenum): + _BlockInfo.__init__(self, False) + self.name = name + self.starting_linenum = linenum + self.is_derived = False + self.check_namespace_indentation = True + if class_or_struct == 'struct': + self.access = 'public' + self.is_struct = True + else: + self.access = 'private' + self.is_struct = False + + # Remember initial indentation level for this class. Using raw_lines here + # instead of elided to account for leading comments. + self.class_indent = GetIndentLevel(clean_lines.raw_lines[linenum]) + + # Try to find the end of the class. This will be confused by things like: + # class A { + # } *x = { ... + # + # But it's still good enough for CheckSectionSpacing. + self.last_line = 0 + depth = 0 + for i in range(linenum, clean_lines.NumLines()): + line = clean_lines.elided[i] + depth += line.count('{') - line.count('}') + if not depth: + self.last_line = i + break + + def CheckBegin(self, filename, clean_lines, linenum, error): + # Look for a bare ':' + if Search('(^|[^:]):($|[^:])', clean_lines.elided[linenum]): + self.is_derived = True + + def CheckEnd(self, filename, clean_lines, linenum, error): + # If there is a DISALLOW macro, it should appear near the end of + # the class. + seen_last_thing_in_class = False + for i in xrange(linenum - 1, self.starting_linenum, -1): + match = Search( + r'\b(DISALLOW_COPY_AND_ASSIGN|DISALLOW_IMPLICIT_CONSTRUCTORS)\(' + + self.name + r'\)', + clean_lines.elided[i]) + if match: + if seen_last_thing_in_class: + error(filename, i, 'readability/constructors', 3, + match.group(1) + ' should be the last thing in the class') + break + + if not Match(r'^\s*$', clean_lines.elided[i]): + seen_last_thing_in_class = True + + # Check that closing brace is aligned with beginning of the class. + # Only do this if the closing brace is indented by only whitespaces. + # This means we will not check single-line class definitions. + indent = Match(r'^( *)\}', clean_lines.elided[linenum]) + if indent and len(indent.group(1)) != self.class_indent: + if self.is_struct: + parent = 'struct ' + self.name + else: + parent = 'class ' + self.name + error(filename, linenum, 'whitespace/indent', 3, + 'Closing brace should be aligned with beginning of %s' % parent) + + +class _NamespaceInfo(_BlockInfo): + """Stores information about a namespace.""" + + def __init__(self, name, linenum): + _BlockInfo.__init__(self, False) + self.name = name or '' + self.starting_linenum = linenum + self.check_namespace_indentation = True + + def CheckEnd(self, filename, clean_lines, linenum, error): + """Check end of namespace comments.""" + line = clean_lines.raw_lines[linenum] + + # Check how many lines is enclosed in this namespace. Don't issue + # warning for missing namespace comments if there aren't enough + # lines. However, do apply checks if there is already an end of + # namespace comment and it's incorrect. + # + # TODO(unknown): We always want to check end of namespace comments + # if a namespace is large, but sometimes we also want to apply the + # check if a short namespace contained nontrivial things (something + # other than forward declarations). There is currently no logic on + # deciding what these nontrivial things are, so this check is + # triggered by namespace size only, which works most of the time. + if (linenum - self.starting_linenum < 10 + and not Match(r'};*\s*(//|/\*).*\bnamespace\b', line)): + return + + # Look for matching comment at end of namespace. + # + # Note that we accept C style "/* */" comments for terminating + # namespaces, so that code that terminate namespaces inside + # preprocessor macros can be cpplint clean. + # + # We also accept stuff like "// end of namespace ." with the + # period at the end. + # + # Besides these, we don't accept anything else, otherwise we might + # get false negatives when existing comment is a substring of the + # expected namespace. + if self.name: + # Named namespace + if not Match((r'};*\s*(//|/\*).*\bnamespace\s+' + re.escape(self.name) + + r'[\*/\.\\\s]*$'), + line): + error(filename, linenum, 'readability/namespace', 5, + 'Namespace should be terminated with "// namespace %s"' % + self.name) + else: + # Anonymous namespace + if not Match(r'};*\s*(//|/\*).*\bnamespace[\*/\.\\\s]*$', line): + # If "// namespace anonymous" or "// anonymous namespace (more text)", + # mention "// anonymous namespace" as an acceptable form + if Match(r'}.*\b(namespace anonymous|anonymous namespace)\b', line): + error(filename, linenum, 'readability/namespace', 5, + 'Anonymous namespace should be terminated with "// namespace"' + ' or "// anonymous namespace"') + else: + error(filename, linenum, 'readability/namespace', 5, + 'Anonymous namespace should be terminated with "// namespace"') + + +class _PreprocessorInfo(object): + """Stores checkpoints of nesting stacks when #if/#else is seen.""" + + def __init__(self, stack_before_if): + # The entire nesting stack before #if + self.stack_before_if = stack_before_if + + # The entire nesting stack up to #else + self.stack_before_else = [] + + # Whether we have already seen #else or #elif + self.seen_else = False + + +class NestingState(object): + """Holds states related to parsing braces.""" + + def __init__(self): + # Stack for tracking all braces. An object is pushed whenever we + # see a "{", and popped when we see a "}". Only 3 types of + # objects are possible: + # - _ClassInfo: a class or struct. + # - _NamespaceInfo: a namespace. + # - _BlockInfo: some other type of block. + self.stack = [] + + # Top of the previous stack before each Update(). + # + # Because the nesting_stack is updated at the end of each line, we + # had to do some convoluted checks to find out what is the current + # scope at the beginning of the line. This check is simplified by + # saving the previous top of nesting stack. + # + # We could save the full stack, but we only need the top. Copying + # the full nesting stack would slow down cpplint by ~10%. + self.previous_stack_top = [] + + # Stack of _PreprocessorInfo objects. + self.pp_stack = [] + + def SeenOpenBrace(self): + """Check if we have seen the opening brace for the innermost block. + + Returns: + True if we have seen the opening brace, False if the innermost + block is still expecting an opening brace. + """ + return (not self.stack) or self.stack[-1].seen_open_brace + + def InNamespaceBody(self): + """Check if we are currently one level inside a namespace body. + + Returns: + True if top of the stack is a namespace block, False otherwise. + """ + return self.stack and isinstance(self.stack[-1], _NamespaceInfo) + + def InExternC(self): + """Check if we are currently one level inside an 'extern "C"' block. + + Returns: + True if top of the stack is an extern block, False otherwise. + """ + return self.stack and isinstance(self.stack[-1], _ExternCInfo) + + def InClassDeclaration(self): + """Check if we are currently one level inside a class or struct declaration. + + Returns: + True if top of the stack is a class/struct, False otherwise. + """ + return self.stack and isinstance(self.stack[-1], _ClassInfo) + + def InAsmBlock(self): + """Check if we are currently one level inside an inline ASM block. + + Returns: + True if the top of the stack is a block containing inline ASM. + """ + return self.stack and self.stack[-1].inline_asm != _NO_ASM + + def InTemplateArgumentList(self, clean_lines, linenum, pos): + """Check if current position is inside template argument list. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + pos: position just after the suspected template argument. + Returns: + True if (linenum, pos) is inside template arguments. + """ + while linenum < clean_lines.NumLines(): + # Find the earliest character that might indicate a template argument + line = clean_lines.elided[linenum] + match = Match(r'^[^{};=\[\]\.<>]*(.)', line[pos:]) + if not match: + linenum += 1 + pos = 0 + continue + token = match.group(1) + pos += len(match.group(0)) + + # These things do not look like template argument list: + # class Suspect { + # class Suspect x; } + if token in ('{', '}', ';'): return False + + # These things look like template argument list: + # template + # template + # template + # template + if token in ('>', '=', '[', ']', '.'): return True + + # Check if token is an unmatched '<'. + # If not, move on to the next character. + if token != '<': + pos += 1 + if pos >= len(line): + linenum += 1 + pos = 0 + continue + + # We can't be sure if we just find a single '<', and need to + # find the matching '>'. + (_, end_line, end_pos) = CloseExpression(clean_lines, linenum, pos - 1) + if end_pos < 0: + # Not sure if template argument list or syntax error in file + return False + linenum = end_line + pos = end_pos + return False + + def UpdatePreprocessor(self, line): + """Update preprocessor stack. + + We need to handle preprocessors due to classes like this: + #ifdef SWIG + struct ResultDetailsPageElementExtensionPoint { + #else + struct ResultDetailsPageElementExtensionPoint : public Extension { + #endif + + We make the following assumptions (good enough for most files): + - Preprocessor condition evaluates to true from #if up to first + #else/#elif/#endif. + + - Preprocessor condition evaluates to false from #else/#elif up + to #endif. We still perform lint checks on these lines, but + these do not affect nesting stack. + + Args: + line: current line to check. + """ + if Match(r'^\s*#\s*(if|ifdef|ifndef)\b', line): + # Beginning of #if block, save the nesting stack here. The saved + # stack will allow us to restore the parsing state in the #else case. + self.pp_stack.append(_PreprocessorInfo(copy.deepcopy(self.stack))) + elif Match(r'^\s*#\s*(else|elif)\b', line): + # Beginning of #else block + if self.pp_stack: + if not self.pp_stack[-1].seen_else: + # This is the first #else or #elif block. Remember the + # whole nesting stack up to this point. This is what we + # keep after the #endif. + self.pp_stack[-1].seen_else = True + self.pp_stack[-1].stack_before_else = copy.deepcopy(self.stack) + + # Restore the stack to how it was before the #if + self.stack = copy.deepcopy(self.pp_stack[-1].stack_before_if) + else: + # TODO(unknown): unexpected #else, issue warning? + pass + elif Match(r'^\s*#\s*endif\b', line): + # End of #if or #else blocks. + if self.pp_stack: + # If we saw an #else, we will need to restore the nesting + # stack to its former state before the #else, otherwise we + # will just continue from where we left off. + if self.pp_stack[-1].seen_else: + # Here we can just use a shallow copy since we are the last + # reference to it. + self.stack = self.pp_stack[-1].stack_before_else + # Drop the corresponding #if + self.pp_stack.pop() + else: + # TODO(unknown): unexpected #endif, issue warning? + pass + + # TODO(unknown): Update() is too long, but we will refactor later. + def Update(self, filename, clean_lines, linenum, error): + """Update nesting state with current line. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Remember top of the previous nesting stack. + # + # The stack is always pushed/popped and not modified in place, so + # we can just do a shallow copy instead of copy.deepcopy. Using + # deepcopy would slow down cpplint by ~28%. + if self.stack: + self.previous_stack_top = self.stack[-1] + else: + self.previous_stack_top = None + + # Update pp_stack + self.UpdatePreprocessor(line) + + # Count parentheses. This is to avoid adding struct arguments to + # the nesting stack. + if self.stack: + inner_block = self.stack[-1] + depth_change = line.count('(') - line.count(')') + inner_block.open_parentheses += depth_change + + # Also check if we are starting or ending an inline assembly block. + if inner_block.inline_asm in (_NO_ASM, _END_ASM): + if (depth_change != 0 and + inner_block.open_parentheses == 1 and + _MATCH_ASM.match(line)): + # Enter assembly block + inner_block.inline_asm = _INSIDE_ASM + else: + # Not entering assembly block. If previous line was _END_ASM, + # we will now shift to _NO_ASM state. + inner_block.inline_asm = _NO_ASM + elif (inner_block.inline_asm == _INSIDE_ASM and + inner_block.open_parentheses == 0): + # Exit assembly block + inner_block.inline_asm = _END_ASM + + # Consume namespace declaration at the beginning of the line. Do + # this in a loop so that we catch same line declarations like this: + # namespace proto2 { namespace bridge { class MessageSet; } } + while True: + # Match start of namespace. The "\b\s*" below catches namespace + # declarations even if it weren't followed by a whitespace, this + # is so that we don't confuse our namespace checker. The + # missing spaces will be flagged by CheckSpacing. + namespace_decl_match = Match(r'^\s*namespace\b\s*([:\w]+)?(.*)$', line) + if not namespace_decl_match: + break + + new_namespace = _NamespaceInfo(namespace_decl_match.group(1), linenum) + self.stack.append(new_namespace) + + line = namespace_decl_match.group(2) + if line.find('{') != -1: + new_namespace.seen_open_brace = True + line = line[line.find('{') + 1:] + + # Look for a class declaration in whatever is left of the line + # after parsing namespaces. The regexp accounts for decorated classes + # such as in: + # class LOCKABLE API Object { + # }; + class_decl_match = Match( + r'^(\s*(?:template\s*<[\w\s<>,:]*>\s*)?' + r'(class|struct)\s+(?:[A-Z_]+\s+)*(\w+(?:::\w+)*))' + r'(.*)$', line) + if (class_decl_match and + (not self.stack or self.stack[-1].open_parentheses == 0)): + # We do not want to accept classes that are actually template arguments: + # template , + # template class Ignore3> + # void Function() {}; + # + # To avoid template argument cases, we scan forward and look for + # an unmatched '>'. If we see one, assume we are inside a + # template argument list. + end_declaration = len(class_decl_match.group(1)) + if not self.InTemplateArgumentList(clean_lines, linenum, end_declaration): + self.stack.append(_ClassInfo( + class_decl_match.group(3), class_decl_match.group(2), + clean_lines, linenum)) + line = class_decl_match.group(4) + + # If we have not yet seen the opening brace for the innermost block, + # run checks here. + if not self.SeenOpenBrace(): + self.stack[-1].CheckBegin(filename, clean_lines, linenum, error) + + # Update access control if we are inside a class/struct + if self.stack and isinstance(self.stack[-1], _ClassInfo): + classinfo = self.stack[-1] + access_match = Match( + r'^(.*)\b(public|private|protected|signals)(\s+(?:slots\s*)?)?' + r':(?:[^:]|$)', + line) + if access_match: + classinfo.access = access_match.group(2) + + # Check that access keywords are indented +1 space. Skip this + # check if the keywords are not preceded by whitespaces. + indent = access_match.group(1) + if (len(indent) != classinfo.class_indent + 1 and + Match(r'^\s*$', indent)): + if classinfo.is_struct: + parent = 'struct ' + classinfo.name + else: + parent = 'class ' + classinfo.name + slots = '' + if access_match.group(3): + slots = access_match.group(3) + error(filename, linenum, 'whitespace/indent', 3, + '%s%s: should be indented +1 space inside %s' % ( + access_match.group(2), slots, parent)) + + # Consume braces or semicolons from what's left of the line + while True: + # Match first brace, semicolon, or closed parenthesis. + matched = Match(r'^[^{;)}]*([{;)}])(.*)$', line) + if not matched: + break + + token = matched.group(1) + if token == '{': + # If namespace or class hasn't seen a opening brace yet, mark + # namespace/class head as complete. Push a new block onto the + # stack otherwise. + if not self.SeenOpenBrace(): + self.stack[-1].seen_open_brace = True + elif Match(r'^extern\s*"[^"]*"\s*\{', line): + self.stack.append(_ExternCInfo()) + else: + self.stack.append(_BlockInfo(True)) + if _MATCH_ASM.match(line): + self.stack[-1].inline_asm = _BLOCK_ASM + + elif token == ';' or token == ')': + # If we haven't seen an opening brace yet, but we already saw + # a semicolon, this is probably a forward declaration. Pop + # the stack for these. + # + # Similarly, if we haven't seen an opening brace yet, but we + # already saw a closing parenthesis, then these are probably + # function arguments with extra "class" or "struct" keywords. + # Also pop these stack for these. + if not self.SeenOpenBrace(): + self.stack.pop() + else: # token == '}' + # Perform end of block checks and pop the stack. + if self.stack: + self.stack[-1].CheckEnd(filename, clean_lines, linenum, error) + self.stack.pop() + line = matched.group(2) + + def InnermostClass(self): + """Get class info on the top of the stack. + + Returns: + A _ClassInfo object if we are inside a class, or None otherwise. + """ + for i in range(len(self.stack), 0, -1): + classinfo = self.stack[i - 1] + if isinstance(classinfo, _ClassInfo): + return classinfo + return None + + def CheckCompletedBlocks(self, filename, error): + """Checks that all classes and namespaces have been completely parsed. + + Call this when all lines in a file have been processed. + Args: + filename: The name of the current file. + error: The function to call with any errors found. + """ + # Note: This test can result in false positives if #ifdef constructs + # get in the way of brace matching. See the testBuildClass test in + # cpplint_unittest.py for an example of this. + for obj in self.stack: + if isinstance(obj, _ClassInfo): + error(filename, obj.starting_linenum, 'build/class', 5, + 'Failed to find complete declaration of class %s' % + obj.name) + elif isinstance(obj, _NamespaceInfo): + error(filename, obj.starting_linenum, 'build/namespaces', 5, + 'Failed to find complete declaration of namespace %s' % + obj.name) + + +def CheckForNonStandardConstructs(filename, clean_lines, linenum, + nesting_state, error): + r"""Logs an error if we see certain non-ANSI constructs ignored by gcc-2. + + Complain about several constructs which gcc-2 accepts, but which are + not standard C++. Warning about these in lint is one way to ease the + transition to new compilers. + - put storage class first (e.g. "static const" instead of "const static"). + - "%lld" instead of %qd" in printf-type functions. + - "%1$d" is non-standard in printf-type functions. + - "\%" is an undefined character escape sequence. + - text after #endif is not allowed. + - invalid inner-style forward declaration. + - >? and ?= and )\?=?\s*(\w+|[+-]?\d+)(\.\d*)?', + line): + error(filename, linenum, 'build/deprecated', 3, + '>? and ))?' + # r'\s*const\s*' + type_name + '\s*&\s*\w+\s*;' + error(filename, linenum, 'runtime/member_string_references', 2, + 'const string& members are dangerous. It is much better to use ' + 'alternatives, such as pointers or simple constants.') + + # Everything else in this function operates on class declarations. + # Return early if the top of the nesting stack is not a class, or if + # the class head is not completed yet. + classinfo = nesting_state.InnermostClass() + if not classinfo or not classinfo.seen_open_brace: + return + + # The class may have been declared with namespace or classname qualifiers. + # The constructor and destructor will not have those qualifiers. + base_classname = classinfo.name.split('::')[-1] + + # Look for single-argument constructors that aren't marked explicit. + # Technically a valid construct, but against style. Also look for + # non-single-argument constructors which are also technically valid, but + # strongly suggest something is wrong. + explicit_constructor_match = Match( + r'\s+(?:inline\s+)?(explicit\s+)?(?:inline\s+)?%s\s*' + r'\(((?:[^()]|\([^()]*\))*)\)' + % re.escape(base_classname), + line) + + if explicit_constructor_match: + is_marked_explicit = explicit_constructor_match.group(1) + + if not explicit_constructor_match.group(2): + constructor_args = [] + else: + constructor_args = explicit_constructor_match.group(2).split(',') + + # collapse arguments so that commas in template parameter lists and function + # argument parameter lists don't split arguments in two + i = 0 + while i < len(constructor_args): + constructor_arg = constructor_args[i] + while (constructor_arg.count('<') > constructor_arg.count('>') or + constructor_arg.count('(') > constructor_arg.count(')')): + constructor_arg += ',' + constructor_args[i + 1] + del constructor_args[i + 1] + constructor_args[i] = constructor_arg + i += 1 + + defaulted_args = [arg for arg in constructor_args if '=' in arg] + noarg_constructor = (not constructor_args or # empty arg list + # 'void' arg specifier + (len(constructor_args) == 1 and + constructor_args[0].strip() == 'void')) + onearg_constructor = ((len(constructor_args) == 1 and # exactly one arg + not noarg_constructor) or + # all but at most one arg defaulted + (len(constructor_args) >= 1 and + not noarg_constructor and + len(defaulted_args) >= len(constructor_args) - 1)) + initializer_list_constructor = bool( + onearg_constructor and + Search(r'\bstd\s*::\s*initializer_list\b', constructor_args[0])) + copy_constructor = bool( + onearg_constructor and + Match(r'(const\s+)?%s(\s*<[^>]*>)?(\s+const)?\s*(?:<\w+>\s*)?&' + % re.escape(base_classname), constructor_args[0].strip())) + + if (not is_marked_explicit and + onearg_constructor and + not initializer_list_constructor and + not copy_constructor): + if defaulted_args: + error(filename, linenum, 'runtime/explicit', 5, + 'Constructors callable with one argument ' + 'should be marked explicit.') + else: + error(filename, linenum, 'runtime/explicit', 5, + 'Single-parameter constructors should be marked explicit.') + elif is_marked_explicit and not onearg_constructor: + if noarg_constructor: + error(filename, linenum, 'runtime/explicit', 5, + 'Zero-parameter constructors should not be marked explicit.') + else: + error(filename, linenum, 'runtime/explicit', 0, + 'Constructors that require multiple arguments ' + 'should not be marked explicit.') + + +def CheckSpacingForFunctionCall(filename, clean_lines, linenum, error): + """Checks for the correctness of various spacing around function calls. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Since function calls often occur inside if/for/while/switch + # expressions - which have their own, more liberal conventions - we + # first see if we should be looking inside such an expression for a + # function call, to which we can apply more strict standards. + fncall = line # if there's no control flow construct, look at whole line + for pattern in (r'\bif\s*\((.*)\)\s*{', + r'\bfor\s*\((.*)\)\s*{', + r'\bwhile\s*\((.*)\)\s*[{;]', + r'\bswitch\s*\((.*)\)\s*{'): + match = Search(pattern, line) + if match: + fncall = match.group(1) # look inside the parens for function calls + break + + # Except in if/for/while/switch, there should never be space + # immediately inside parens (eg "f( 3, 4 )"). We make an exception + # for nested parens ( (a+b) + c ). Likewise, there should never be + # a space before a ( when it's a function argument. I assume it's a + # function argument when the char before the whitespace is legal in + # a function name (alnum + _) and we're not starting a macro. Also ignore + # pointers and references to arrays and functions coz they're too tricky: + # we use a very simple way to recognize these: + # " (something)(maybe-something)" or + # " (something)(maybe-something," or + # " (something)[something]" + # Note that we assume the contents of [] to be short enough that + # they'll never need to wrap. + if ( # Ignore control structures. + not Search(r'\b(if|for|while|switch|return|new|delete|catch|sizeof)\b', + fncall) and + # Ignore pointers/references to functions. + not Search(r' \([^)]+\)\([^)]*(\)|,$)', fncall) and + # Ignore pointers/references to arrays. + not Search(r' \([^)]+\)\[[^\]]+\]', fncall)): + if Search(r'\w\s*\(\s(?!\s*\\$)', fncall): # a ( used for a fn call + error(filename, linenum, 'whitespace/parens', 4, + 'Extra space after ( in function call') + elif Search(r'\(\s+(?!(\s*\\)|\()', fncall): + error(filename, linenum, 'whitespace/parens', 2, + 'Extra space after (') + if (Search(r'\w\s+\(', fncall) and + not Search(r'#\s*define|typedef|using\s+\w+\s*=', fncall) and + not Search(r'\w\s+\((\w+::)*\*\w+\)\(', fncall) and + not Search(r'\bcase\s+\(', fncall)): + # TODO(unknown): Space after an operator function seem to be a common + # error, silence those for now by restricting them to highest verbosity. + if Search(r'\boperator_*\b', line): + error(filename, linenum, 'whitespace/parens', 0, + 'Extra space before ( in function call') + else: + error(filename, linenum, 'whitespace/parens', 4, + 'Extra space before ( in function call') + # If the ) is followed only by a newline or a { + newline, assume it's + # part of a control statement (if/while/etc), and don't complain + if Search(r'[^)]\s+\)\s*[^{\s]', fncall): + # If the closing parenthesis is preceded by only whitespaces, + # try to give a more descriptive error message. + if Search(r'^\s+\)', fncall): + error(filename, linenum, 'whitespace/parens', 2, + 'Closing ) should be moved to the previous line') + else: + error(filename, linenum, 'whitespace/parens', 2, + 'Extra space before )') + + +def IsBlankLine(line): + """Returns true if the given line is blank. + + We consider a line to be blank if the line is empty or consists of + only white spaces. + + Args: + line: A line of a string. + + Returns: + True, if the given line is blank. + """ + return not line or line.isspace() + + +def CheckForNamespaceIndentation(filename, nesting_state, clean_lines, line, + error): + is_namespace_indent_item = ( + len(nesting_state.stack) > 1 and + nesting_state.stack[-1].check_namespace_indentation and + isinstance(nesting_state.previous_stack_top, _NamespaceInfo) and + nesting_state.previous_stack_top == nesting_state.stack[-2]) + + if ShouldCheckNamespaceIndentation(nesting_state, is_namespace_indent_item, + clean_lines.elided, line): + CheckItemIndentationInNamespace(filename, clean_lines.elided, + line, error) + + +def CheckForFunctionLengths(filename, clean_lines, linenum, + function_state, error): + """Reports for long function bodies. + + For an overview why this is done, see: + http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Write_Short_Functions + + Uses a simplistic algorithm assuming other style guidelines + (especially spacing) are followed. + Only checks unindented functions, so class members are unchecked. + Trivial bodies are unchecked, so constructors with huge initializer lists + may be missed. + Blank/comment lines are not counted so as to avoid encouraging the removal + of vertical space and comments just to get through a lint check. + NOLINT *on the last line of a function* disables this check. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + function_state: Current function name and lines in body so far. + error: The function to call with any errors found. + """ + lines = clean_lines.lines + line = lines[linenum] + joined_line = '' + + starting_func = False + regexp = r'(\w(\w|::|\*|\&|\s)*)\(' # decls * & space::name( ... + match_result = Match(regexp, line) + if match_result: + # If the name is all caps and underscores, figure it's a macro and + # ignore it, unless it's TEST or TEST_F. + function_name = match_result.group(1).split()[-1] + if function_name == 'TEST' or function_name == 'TEST_F' or ( + not Match(r'[A-Z_]+$', function_name)): + starting_func = True + + if starting_func: + body_found = False + for start_linenum in xrange(linenum, clean_lines.NumLines()): + start_line = lines[start_linenum] + joined_line += ' ' + start_line.lstrip() + if Search(r'(;|})', start_line): # Declarations and trivial functions + body_found = True + break # ... ignore + elif Search(r'{', start_line): + body_found = True + function = Search(r'((\w|:)*)\(', line).group(1) + if Match(r'TEST', function): # Handle TEST... macros + parameter_regexp = Search(r'(\(.*\))', joined_line) + if parameter_regexp: # Ignore bad syntax + function += parameter_regexp.group(1) + else: + function += '()' + function_state.Begin(function) + break + if not body_found: + # No body for the function (or evidence of a non-function) was found. + error(filename, linenum, 'readability/fn_size', 5, + 'Lint failed to find start of function body.') + elif Match(r'^\}\s*$', line): # function end + function_state.Check(error, filename, linenum) + function_state.End() + elif not Match(r'^\s*$', line): + function_state.Count() # Count non-blank/non-comment lines. + + +_RE_PATTERN_TODO = re.compile(r'^//(\s*)TODO(\(.+?\))?:?(\s|$)?') + + +def CheckComment(line, filename, linenum, next_line_start, error): + """Checks for common mistakes in comments. + + Args: + line: The line in question. + filename: The name of the current file. + linenum: The number of the line to check. + next_line_start: The first non-whitespace column of the next line. + error: The function to call with any errors found. + """ + commentpos = line.find('//') + if commentpos != -1: + # Check if the // may be in quotes. If so, ignore it + # Comparisons made explicit for clarity -- pylint: disable=g-explicit-bool-comparison + if (line.count('"', 0, commentpos) - + line.count('\\"', 0, commentpos)) % 2 == 0: # not in quotes + # Allow one space for new scopes, two spaces otherwise: + if (not (Match(r'^.*{ *//', line) and next_line_start == commentpos) and + ((commentpos >= 1 and + line[commentpos-1] not in string.whitespace) or + (commentpos >= 2 and + line[commentpos-2] not in string.whitespace))): + error(filename, linenum, 'whitespace/comments', 2, + 'At least two spaces is best between code and comments') + + # Checks for common mistakes in TODO comments. + comment = line[commentpos:] + match = _RE_PATTERN_TODO.match(comment) + if match: + # One whitespace is correct; zero whitespace is handled elsewhere. + leading_whitespace = match.group(1) + if len(leading_whitespace) > 1: + error(filename, linenum, 'whitespace/todo', 2, + 'Too many spaces before TODO') + + username = match.group(2) + if not username: + error(filename, linenum, 'readability/todo', 2, + 'Missing username in TODO; it should look like ' + '"// TODO(my_username): Stuff."') + + middle_whitespace = match.group(3) + # Comparisons made explicit for correctness -- pylint: disable=g-explicit-bool-comparison + if middle_whitespace != ' ' and middle_whitespace != '': + error(filename, linenum, 'whitespace/todo', 2, + 'TODO(my_username) should be followed by a space') + + # If the comment contains an alphanumeric character, there + # should be a space somewhere between it and the // unless + # it's a /// or //! Doxygen comment. + if (Match(r'//[^ ]*\w', comment) and + not Match(r'(///|//\!)(\s+|$)', comment)): + error(filename, linenum, 'whitespace/comments', 4, + 'Should have a space between // and comment') + + +def CheckAccess(filename, clean_lines, linenum, nesting_state, error): + """Checks for improper use of DISALLOW* macros. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] # get rid of comments and strings + + matched = Match((r'\s*(DISALLOW_COPY_AND_ASSIGN|' + r'DISALLOW_IMPLICIT_CONSTRUCTORS)'), line) + if not matched: + return + if nesting_state.stack and isinstance(nesting_state.stack[-1], _ClassInfo): + if nesting_state.stack[-1].access != 'private': + error(filename, linenum, 'readability/constructors', 3, + '%s must be in the private: section' % matched.group(1)) + + else: + # Found DISALLOW* macro outside a class declaration, or perhaps it + # was used inside a function when it should have been part of the + # class declaration. We could issue a warning here, but it + # probably resulted in a compiler error already. + pass + + +def CheckSpacing(filename, clean_lines, linenum, nesting_state, error): + """Checks for the correctness of various spacing issues in the code. + + Things we check for: spaces around operators, spaces after + if/for/while/switch, no spaces around parens in function calls, two + spaces between code and comment, don't start a block with a blank + line, don't end a function with a blank line, don't add a blank line + after public/protected/private, don't have too many blank lines in a row. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + + # Don't use "elided" lines here, otherwise we can't check commented lines. + # Don't want to use "raw" either, because we don't want to check inside C++11 + # raw strings, + raw = clean_lines.lines_without_raw_strings + line = raw[linenum] + + # Before nixing comments, check if the line is blank for no good + # reason. This includes the first line after a block is opened, and + # blank lines at the end of a function (ie, right before a line like '}' + # + # Skip all the blank line checks if we are immediately inside a + # namespace body. In other words, don't issue blank line warnings + # for this block: + # namespace { + # + # } + # + # A warning about missing end of namespace comments will be issued instead. + # + # Also skip blank line checks for 'extern "C"' blocks, which are formatted + # like namespaces. + if (IsBlankLine(line) and + not nesting_state.InNamespaceBody() and + not nesting_state.InExternC()): + elided = clean_lines.elided + prev_line = elided[linenum - 1] + prevbrace = prev_line.rfind('{') + # TODO(unknown): Don't complain if line before blank line, and line after, + # both start with alnums and are indented the same amount. + # This ignores whitespace at the start of a namespace block + # because those are not usually indented. + if prevbrace != -1 and prev_line[prevbrace:].find('}') == -1: + # OK, we have a blank line at the start of a code block. Before we + # complain, we check if it is an exception to the rule: The previous + # non-empty line has the parameters of a function header that are indented + # 4 spaces (because they did not fit in a 80 column line when placed on + # the same line as the function name). We also check for the case where + # the previous line is indented 6 spaces, which may happen when the + # initializers of a constructor do not fit into a 80 column line. + exception = False + if Match(r' {6}\w', prev_line): # Initializer list? + # We are looking for the opening column of initializer list, which + # should be indented 4 spaces to cause 6 space indentation afterwards. + search_position = linenum-2 + while (search_position >= 0 + and Match(r' {6}\w', elided[search_position])): + search_position -= 1 + exception = (search_position >= 0 + and elided[search_position][:5] == ' :') + else: + # Search for the function arguments or an initializer list. We use a + # simple heuristic here: If the line is indented 4 spaces; and we have a + # closing paren, without the opening paren, followed by an opening brace + # or colon (for initializer lists) we assume that it is the last line of + # a function header. If we have a colon indented 4 spaces, it is an + # initializer list. + exception = (Match(r' {4}\w[^\(]*\)\s*(const\s*)?(\{\s*$|:)', + prev_line) + or Match(r' {4}:', prev_line)) + + if not exception: + error(filename, linenum, 'whitespace/blank_line', 2, + 'Redundant blank line at the start of a code block ' + 'should be deleted.') + # Ignore blank lines at the end of a block in a long if-else + # chain, like this: + # if (condition1) { + # // Something followed by a blank line + # + # } else if (condition2) { + # // Something else + # } + if linenum + 1 < clean_lines.NumLines(): + next_line = raw[linenum + 1] + if (next_line + and Match(r'\s*}', next_line) + and next_line.find('} else ') == -1): + error(filename, linenum, 'whitespace/blank_line', 3, + 'Redundant blank line at the end of a code block ' + 'should be deleted.') + + matched = Match(r'\s*(public|protected|private):', prev_line) + if matched: + error(filename, linenum, 'whitespace/blank_line', 3, + 'Do not leave a blank line after "%s:"' % matched.group(1)) + + # Next, check comments + next_line_start = 0 + if linenum + 1 < clean_lines.NumLines(): + next_line = raw[linenum + 1] + next_line_start = len(next_line) - len(next_line.lstrip()) + CheckComment(line, filename, linenum, next_line_start, error) + + # get rid of comments and strings + line = clean_lines.elided[linenum] + + # You shouldn't have spaces before your brackets, except maybe after + # 'delete []' or 'return []() {};' + if Search(r'\w\s+\[', line) and not Search(r'(?:delete|return)\s+\[', line): + error(filename, linenum, 'whitespace/braces', 5, + 'Extra space before [') + + # In range-based for, we wanted spaces before and after the colon, but + # not around "::" tokens that might appear. + if (Search(r'for *\(.*[^:]:[^: ]', line) or + Search(r'for *\(.*[^: ]:[^:]', line)): + error(filename, linenum, 'whitespace/forcolon', 2, + 'Missing space around colon in range-based for loop') + + +def CheckOperatorSpacing(filename, clean_lines, linenum, error): + """Checks for horizontal spacing around operators. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Don't try to do spacing checks for operator methods. Do this by + # replacing the troublesome characters with something else, + # preserving column position for all other characters. + # + # The replacement is done repeatedly to avoid false positives from + # operators that call operators. + while True: + match = Match(r'^(.*\boperator\b)(\S+)(\s*\(.*)$', line) + if match: + line = match.group(1) + ('_' * len(match.group(2))) + match.group(3) + else: + break + + # We allow no-spaces around = within an if: "if ( (a=Foo()) == 0 )". + # Otherwise not. Note we only check for non-spaces on *both* sides; + # sometimes people put non-spaces on one side when aligning ='s among + # many lines (not that this is behavior that I approve of...) + if ((Search(r'[\w.]=', line) or + Search(r'=[\w.]', line)) + and not Search(r'\b(if|while|for) ', line) + # Operators taken from [lex.operators] in C++11 standard. + and not Search(r'(>=|<=|==|!=|&=|\^=|\|=|\+=|\*=|\/=|\%=)', line) + and not Search(r'operator=', line)): + error(filename, linenum, 'whitespace/operators', 4, + 'Missing spaces around =') + + # It's ok not to have spaces around binary operators like + - * /, but if + # there's too little whitespace, we get concerned. It's hard to tell, + # though, so we punt on this one for now. TODO. + + # You should always have whitespace around binary operators. + # + # Check <= and >= first to avoid false positives with < and >, then + # check non-include lines for spacing around < and >. + # + # If the operator is followed by a comma, assume it's be used in a + # macro context and don't do any checks. This avoids false + # positives. + # + # Note that && is not included here. Those are checked separately + # in CheckRValueReference + match = Search(r'[^<>=!\s](==|!=|<=|>=|\|\|)[^<>=!\s,;\)]', line) + if match: + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around %s' % match.group(1)) + elif not Match(r'#.*include', line): + # Look for < that is not surrounded by spaces. This is only + # triggered if both sides are missing spaces, even though + # technically should should flag if at least one side is missing a + # space. This is done to avoid some false positives with shifts. + match = Match(r'^(.*[^\s<])<[^\s=<,]', line) + if match: + (_, _, end_pos) = CloseExpression( + clean_lines, linenum, len(match.group(1))) + if end_pos <= -1: + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around <') + + # Look for > that is not surrounded by spaces. Similar to the + # above, we only trigger if both sides are missing spaces to avoid + # false positives with shifts. + match = Match(r'^(.*[^-\s>])>[^\s=>,]', line) + if match: + (_, _, start_pos) = ReverseCloseExpression( + clean_lines, linenum, len(match.group(1))) + if start_pos <= -1: + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around >') + + # We allow no-spaces around << when used like this: 10<<20, but + # not otherwise (particularly, not when used as streams) + # + # We also allow operators following an opening parenthesis, since + # those tend to be macros that deal with operators. + match = Search(r'(operator|[^\s(<])(?:L|UL|ULL|l|ul|ull)?<<([^\s,=<])', line) + if (match and not (match.group(1).isdigit() and match.group(2).isdigit()) and + not (match.group(1) == 'operator' and match.group(2) == ';')): + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around <<') + + # We allow no-spaces around >> for almost anything. This is because + # C++11 allows ">>" to close nested templates, which accounts for + # most cases when ">>" is not followed by a space. + # + # We still warn on ">>" followed by alpha character, because that is + # likely due to ">>" being used for right shifts, e.g.: + # value >> alpha + # + # When ">>" is used to close templates, the alphanumeric letter that + # follows would be part of an identifier, and there should still be + # a space separating the template type and the identifier. + # type> alpha + match = Search(r'>>[a-zA-Z_]', line) + if match: + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around >>') + + # There shouldn't be space around unary operators + match = Search(r'(!\s|~\s|[\s]--[\s;]|[\s]\+\+[\s;])', line) + if match: + error(filename, linenum, 'whitespace/operators', 4, + 'Extra space for operator %s' % match.group(1)) + + +def CheckParenthesisSpacing(filename, clean_lines, linenum, error): + """Checks for horizontal spacing around parentheses. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # No spaces after an if, while, switch, or for + match = Search(r' (if\(|for\(|while\(|switch\()', line) + if match: + error(filename, linenum, 'whitespace/parens', 5, + 'Missing space before ( in %s' % match.group(1)) + + # For if/for/while/switch, the left and right parens should be + # consistent about how many spaces are inside the parens, and + # there should either be zero or one spaces inside the parens. + # We don't want: "if ( foo)" or "if ( foo )". + # Exception: "for ( ; foo; bar)" and "for (foo; bar; )" are allowed. + match = Search(r'\b(if|for|while|switch)\s*' + r'\(([ ]*)(.).*[^ ]+([ ]*)\)\s*{\s*$', + line) + if match: + if len(match.group(2)) != len(match.group(4)): + if not (match.group(3) == ';' and + len(match.group(2)) == 1 + len(match.group(4)) or + not match.group(2) and Search(r'\bfor\s*\(.*; \)', line)): + error(filename, linenum, 'whitespace/parens', 5, + 'Mismatching spaces inside () in %s' % match.group(1)) + if len(match.group(2)) not in [0, 1]: + error(filename, linenum, 'whitespace/parens', 5, + 'Should have zero or one spaces inside ( and ) in %s' % + match.group(1)) + + +def CheckCommaSpacing(filename, clean_lines, linenum, error): + """Checks for horizontal spacing near commas and semicolons. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + raw = clean_lines.lines_without_raw_strings + line = clean_lines.elided[linenum] + + # You should always have a space after a comma (either as fn arg or operator) + # + # This does not apply when the non-space character following the + # comma is another comma, since the only time when that happens is + # for empty macro arguments. + # + # We run this check in two passes: first pass on elided lines to + # verify that lines contain missing whitespaces, second pass on raw + # lines to confirm that those missing whitespaces are not due to + # elided comments. + if (Search(r',[^,\s]', ReplaceAll(r'\boperator\s*,\s*\(', 'F(', line)) and + Search(r',[^,\s]', raw[linenum])): + error(filename, linenum, 'whitespace/comma', 3, + 'Missing space after ,') + + # You should always have a space after a semicolon + # except for few corner cases + # TODO(unknown): clarify if 'if (1) { return 1;}' is requires one more + # space after ; + if Search(r';[^\s};\\)/]', line): + error(filename, linenum, 'whitespace/semicolon', 3, + 'Missing space after ;') + + +def CheckBracesSpacing(filename, clean_lines, linenum, error): + """Checks for horizontal spacing near commas. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Except after an opening paren, or after another opening brace (in case of + # an initializer list, for instance), you should have spaces before your + # braces. And since you should never have braces at the beginning of a line, + # this is an easy test. + match = Match(r'^(.*[^ ({>]){', line) + if match: + # Try a bit harder to check for brace initialization. This + # happens in one of the following forms: + # Constructor() : initializer_list_{} { ... } + # Constructor{}.MemberFunction() + # Type variable{}; + # FunctionCall(type{}, ...); + # LastArgument(..., type{}); + # LOG(INFO) << type{} << " ..."; + # map_of_type[{...}] = ...; + # ternary = expr ? new type{} : nullptr; + # OuterTemplate{}> + # + # We check for the character following the closing brace, and + # silence the warning if it's one of those listed above, i.e. + # "{.;,)<>]:". + # + # To account for nested initializer list, we allow any number of + # closing braces up to "{;,)<". We can't simply silence the + # warning on first sight of closing brace, because that would + # cause false negatives for things that are not initializer lists. + # Silence this: But not this: + # Outer{ if (...) { + # Inner{...} if (...){ // Missing space before { + # }; } + # + # There is a false negative with this approach if people inserted + # spurious semicolons, e.g. "if (cond){};", but we will catch the + # spurious semicolon with a separate check. + (endline, endlinenum, endpos) = CloseExpression( + clean_lines, linenum, len(match.group(1))) + trailing_text = '' + if endpos > -1: + trailing_text = endline[endpos:] + for offset in xrange(endlinenum + 1, + min(endlinenum + 3, clean_lines.NumLines() - 1)): + trailing_text += clean_lines.elided[offset] + if not Match(r'^[\s}]*[{.;,)<>\]:]', trailing_text): + error(filename, linenum, 'whitespace/braces', 5, + 'Missing space before {') + + # Make sure '} else {' has spaces. + if Search(r'}else', line): + error(filename, linenum, 'whitespace/braces', 5, + 'Missing space before else') + + # You shouldn't have a space before a semicolon at the end of the line. + # There's a special case for "for" since the style guide allows space before + # the semicolon there. + if Search(r':\s*;\s*$', line): + error(filename, linenum, 'whitespace/semicolon', 5, + 'Semicolon defining empty statement. Use {} instead.') + elif Search(r'^\s*;\s*$', line): + error(filename, linenum, 'whitespace/semicolon', 5, + 'Line contains only semicolon. If this should be an empty statement, ' + 'use {} instead.') + elif (Search(r'\s+;\s*$', line) and + not Search(r'\bfor\b', line)): + error(filename, linenum, 'whitespace/semicolon', 5, + 'Extra space before last semicolon. If this should be an empty ' + 'statement, use {} instead.') + + +def IsDecltype(clean_lines, linenum, column): + """Check if the token ending on (linenum, column) is decltype(). + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: the number of the line to check. + column: end column of the token to check. + Returns: + True if this token is decltype() expression, False otherwise. + """ + (text, _, start_col) = ReverseCloseExpression(clean_lines, linenum, column) + if start_col < 0: + return False + if Search(r'\bdecltype\s*$', text[0:start_col]): + return True + return False + + +def IsTemplateParameterList(clean_lines, linenum, column): + """Check if the token ending on (linenum, column) is the end of template<>. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: the number of the line to check. + column: end column of the token to check. + Returns: + True if this token is end of a template parameter list, False otherwise. + """ + (_, startline, startpos) = ReverseCloseExpression( + clean_lines, linenum, column) + if (startpos > -1 and + Search(r'\btemplate\s*$', clean_lines.elided[startline][0:startpos])): + return True + return False + + +def IsRValueType(typenames, clean_lines, nesting_state, linenum, column): + """Check if the token ending on (linenum, column) is a type. + + Assumes that text to the right of the column is "&&" or a function + name. + + Args: + typenames: set of type names from template-argument-list. + clean_lines: A CleansedLines instance containing the file. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + linenum: the number of the line to check. + column: end column of the token to check. + Returns: + True if this token is a type, False if we are not sure. + """ + prefix = clean_lines.elided[linenum][0:column] + + # Get one word to the left. If we failed to do so, this is most + # likely not a type, since it's unlikely that the type name and "&&" + # would be split across multiple lines. + match = Match(r'^(.*)(\b\w+|[>*)&])\s*$', prefix) + if not match: + return False + + # Check text following the token. If it's "&&>" or "&&," or "&&...", it's + # most likely a rvalue reference used inside a template. + suffix = clean_lines.elided[linenum][column:] + if Match(r'&&\s*(?:[>,]|\.\.\.)', suffix): + return True + + # Check for known types and end of templates: + # int&& variable + # vector&& variable + # + # Because this function is called recursively, we also need to + # recognize pointer and reference types: + # int* Function() + # int& Function() + if (match.group(2) in typenames or + match.group(2) in ['char', 'char16_t', 'char32_t', 'wchar_t', 'bool', + 'short', 'int', 'long', 'signed', 'unsigned', + 'float', 'double', 'void', 'auto', '>', '*', '&']): + return True + + # If we see a close parenthesis, look for decltype on the other side. + # decltype would unambiguously identify a type, anything else is + # probably a parenthesized expression and not a type. + if match.group(2) == ')': + return IsDecltype( + clean_lines, linenum, len(match.group(1)) + len(match.group(2)) - 1) + + # Check for casts and cv-qualifiers. + # match.group(1) remainder + # -------------- --------- + # const_cast< type&& + # const type&& + # type const&& + if Search(r'\b(?:const_cast\s*<|static_cast\s*<|dynamic_cast\s*<|' + r'reinterpret_cast\s*<|\w+\s)\s*$', + match.group(1)): + return True + + # Look for a preceding symbol that might help differentiate the context. + # These are the cases that would be ambiguous: + # match.group(1) remainder + # -------------- --------- + # Call ( expression && + # Declaration ( type&& + # sizeof ( type&& + # if ( expression && + # while ( expression && + # for ( type&& + # for( ; expression && + # statement ; type&& + # block { type&& + # constructor { expression && + start = linenum + line = match.group(1) + match_symbol = None + while start >= 0: + # We want to skip over identifiers and commas to get to a symbol. + # Commas are skipped so that we can find the opening parenthesis + # for function parameter lists. + match_symbol = Match(r'^(.*)([^\w\s,])[\w\s,]*$', line) + if match_symbol: + break + start -= 1 + line = clean_lines.elided[start] + + if not match_symbol: + # Probably the first statement in the file is an rvalue reference + return True + + if match_symbol.group(2) == '}': + # Found closing brace, probably an indicate of this: + # block{} type&& + return True + + if match_symbol.group(2) == ';': + # Found semicolon, probably one of these: + # for(; expression && + # statement; type&& + + # Look for the previous 'for(' in the previous lines. + before_text = match_symbol.group(1) + for i in xrange(start - 1, max(start - 6, 0), -1): + before_text = clean_lines.elided[i] + before_text + if Search(r'for\s*\([^{};]*$', before_text): + # This is the condition inside a for-loop + return False + + # Did not find a for-init-statement before this semicolon, so this + # is probably a new statement and not a condition. + return True + + if match_symbol.group(2) == '{': + # Found opening brace, probably one of these: + # block{ type&& = ... ; } + # constructor{ expression && expression } + + # Look for a closing brace or a semicolon. If we see a semicolon + # first, this is probably a rvalue reference. + line = clean_lines.elided[start][0:len(match_symbol.group(1)) + 1] + end = start + depth = 1 + while True: + for ch in line: + if ch == ';': + return True + elif ch == '{': + depth += 1 + elif ch == '}': + depth -= 1 + if depth == 0: + return False + end += 1 + if end >= clean_lines.NumLines(): + break + line = clean_lines.elided[end] + # Incomplete program? + return False + + if match_symbol.group(2) == '(': + # Opening parenthesis. Need to check what's to the left of the + # parenthesis. Look back one extra line for additional context. + before_text = match_symbol.group(1) + if linenum > 1: + before_text = clean_lines.elided[linenum - 1] + before_text + before_text = match_symbol.group(1) + + # Patterns that are likely to be types: + # [](type&& + # for (type&& + # sizeof(type&& + # operator=(type&& + # + if Search(r'(?:\]|\bfor|\bsizeof|\boperator\s*\S+\s*)\s*$', before_text): + return True + + # Patterns that are likely to be expressions: + # if (expression && + # while (expression && + # : initializer(expression && + # , initializer(expression && + # ( FunctionCall(expression && + # + FunctionCall(expression && + # + (expression && + # + # The last '+' represents operators such as '+' and '-'. + if Search(r'(?:\bif|\bwhile|[-+=%^(]*>)?\s*$', + match_symbol.group(1)) + if match_func: + # Check for constructors, which don't have return types. + if Search(r'\b(?:explicit|inline)$', match_func.group(1)): + return True + implicit_constructor = Match(r'\s*(\w+)\((?:const\s+)?(\w+)', prefix) + if (implicit_constructor and + implicit_constructor.group(1) == implicit_constructor.group(2)): + return True + return IsRValueType(typenames, clean_lines, nesting_state, linenum, + len(match_func.group(1))) + + # Nothing before the function name. If this is inside a block scope, + # this is probably a function call. + return not (nesting_state.previous_stack_top and + nesting_state.previous_stack_top.IsBlockInfo()) + + if match_symbol.group(2) == '>': + # Possibly a closing bracket, check that what's on the other side + # looks like the start of a template. + return IsTemplateParameterList( + clean_lines, start, len(match_symbol.group(1))) + + # Some other symbol, usually something like "a=b&&c". This is most + # likely not a type. + return False + + +def IsDeletedOrDefault(clean_lines, linenum): + """Check if current constructor or operator is deleted or default. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + Returns: + True if this is a deleted or default constructor. + """ + open_paren = clean_lines.elided[linenum].find('(') + if open_paren < 0: + return False + (close_line, _, close_paren) = CloseExpression( + clean_lines, linenum, open_paren) + if close_paren < 0: + return False + return Match(r'\s*=\s*(?:delete|default)\b', close_line[close_paren:]) + + +def IsRValueAllowed(clean_lines, linenum, typenames): + """Check if RValue reference is allowed on a particular line. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + typenames: set of type names from template-argument-list. + Returns: + True if line is within the region where RValue references are allowed. + """ + # Allow region marked by PUSH/POP macros + for i in xrange(linenum, 0, -1): + line = clean_lines.elided[i] + if Match(r'GOOGLE_ALLOW_RVALUE_REFERENCES_(?:PUSH|POP)', line): + if not line.endswith('PUSH'): + return False + for j in xrange(linenum, clean_lines.NumLines(), 1): + line = clean_lines.elided[j] + if Match(r'GOOGLE_ALLOW_RVALUE_REFERENCES_(?:PUSH|POP)', line): + return line.endswith('POP') + + # Allow operator= + line = clean_lines.elided[linenum] + if Search(r'\boperator\s*=\s*\(', line): + return IsDeletedOrDefault(clean_lines, linenum) + + # Allow constructors + match = Match(r'\s*(?:[\w<>]+::)*([\w<>]+)\s*::\s*([\w<>]+)\s*\(', line) + if match and match.group(1) == match.group(2): + return IsDeletedOrDefault(clean_lines, linenum) + if Search(r'\b(?:explicit|inline)\s+[\w<>]+\s*\(', line): + return IsDeletedOrDefault(clean_lines, linenum) + + if Match(r'\s*[\w<>]+\s*\(', line): + previous_line = 'ReturnType' + if linenum > 0: + previous_line = clean_lines.elided[linenum - 1] + if Match(r'^\s*$', previous_line) or Search(r'[{}:;]\s*$', previous_line): + return IsDeletedOrDefault(clean_lines, linenum) + + # Reject types not mentioned in template-argument-list + while line: + match = Match(r'^.*?(\w+)\s*&&(.*)$', line) + if not match: + break + if match.group(1) not in typenames: + return False + line = match.group(2) + + # All RValue types that were in template-argument-list should have + # been removed by now. Those were allowed, assuming that they will + # be forwarded. + # + # If there are no remaining RValue types left (i.e. types that were + # not found in template-argument-list), flag those as not allowed. + return line.find('&&') < 0 + + +def GetTemplateArgs(clean_lines, linenum): + """Find list of template arguments associated with this function declaration. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: Line number containing the start of the function declaration, + usually one line after the end of the template-argument-list. + Returns: + Set of type names, or empty set if this does not appear to have + any template parameters. + """ + # Find start of function + func_line = linenum + while func_line > 0: + line = clean_lines.elided[func_line] + if Match(r'^\s*$', line): + return set() + if line.find('(') >= 0: + break + func_line -= 1 + if func_line == 0: + return set() + + # Collapse template-argument-list into a single string + argument_list = '' + match = Match(r'^(\s*template\s*)<', clean_lines.elided[func_line]) + if match: + # template-argument-list on the same line as function name + start_col = len(match.group(1)) + _, end_line, end_col = CloseExpression(clean_lines, func_line, start_col) + if end_col > -1 and end_line == func_line: + start_col += 1 # Skip the opening bracket + argument_list = clean_lines.elided[func_line][start_col:end_col] + + elif func_line > 1: + # template-argument-list one line before function name + match = Match(r'^(.*)>\s*$', clean_lines.elided[func_line - 1]) + if match: + end_col = len(match.group(1)) + _, start_line, start_col = ReverseCloseExpression( + clean_lines, func_line - 1, end_col) + if start_col > -1: + start_col += 1 # Skip the opening bracket + while start_line < func_line - 1: + argument_list += clean_lines.elided[start_line][start_col:] + start_col = 0 + start_line += 1 + argument_list += clean_lines.elided[func_line - 1][start_col:end_col] + + if not argument_list: + return set() + + # Extract type names + typenames = set() + while True: + match = Match(r'^[,\s]*(?:typename|class)(?:\.\.\.)?\s+(\w+)(.*)$', + argument_list) + if not match: + break + typenames.add(match.group(1)) + argument_list = match.group(2) + return typenames + + +def CheckRValueReference(filename, clean_lines, linenum, nesting_state, error): + """Check for rvalue references. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + # Find lines missing spaces around &&. + # TODO(unknown): currently we don't check for rvalue references + # with spaces surrounding the && to avoid false positives with + # boolean expressions. + line = clean_lines.elided[linenum] + match = Match(r'^(.*\S)&&', line) + if not match: + match = Match(r'(.*)&&\S', line) + if (not match) or '(&&)' in line or Search(r'\boperator\s*$', match.group(1)): + return + + # Either poorly formed && or an rvalue reference, check the context + # to get a more accurate error message. Mostly we want to determine + # if what's to the left of "&&" is a type or not. + typenames = GetTemplateArgs(clean_lines, linenum) + and_pos = len(match.group(1)) + if IsRValueType(typenames, clean_lines, nesting_state, linenum, and_pos): + if not IsRValueAllowed(clean_lines, linenum, typenames): + error(filename, linenum, 'build/c++11', 3, + 'RValue references are an unapproved C++ feature.') + else: + error(filename, linenum, 'whitespace/operators', 3, + 'Missing spaces around &&') + + +def CheckSectionSpacing(filename, clean_lines, class_info, linenum, error): + """Checks for additional blank line issues related to sections. + + Currently the only thing checked here is blank line before protected/private. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + class_info: A _ClassInfo objects. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + # Skip checks if the class is small, where small means 25 lines or less. + # 25 lines seems like a good cutoff since that's the usual height of + # terminals, and any class that can't fit in one screen can't really + # be considered "small". + # + # Also skip checks if we are on the first line. This accounts for + # classes that look like + # class Foo { public: ... }; + # + # If we didn't find the end of the class, last_line would be zero, + # and the check will be skipped by the first condition. + if (class_info.last_line - class_info.starting_linenum <= 24 or + linenum <= class_info.starting_linenum): + return + + matched = Match(r'\s*(public|protected|private):', clean_lines.lines[linenum]) + if matched: + # Issue warning if the line before public/protected/private was + # not a blank line, but don't do this if the previous line contains + # "class" or "struct". This can happen two ways: + # - We are at the beginning of the class. + # - We are forward-declaring an inner class that is semantically + # private, but needed to be public for implementation reasons. + # Also ignores cases where the previous line ends with a backslash as can be + # common when defining classes in C macros. + prev_line = clean_lines.lines[linenum - 1] + if (not IsBlankLine(prev_line) and + not Search(r'\b(class|struct)\b', prev_line) and + not Search(r'\\$', prev_line)): + # Try a bit harder to find the beginning of the class. This is to + # account for multi-line base-specifier lists, e.g.: + # class Derived + # : public Base { + end_class_head = class_info.starting_linenum + for i in range(class_info.starting_linenum, linenum): + if Search(r'\{\s*$', clean_lines.lines[i]): + end_class_head = i + break + if end_class_head < linenum - 1: + error(filename, linenum, 'whitespace/blank_line', 3, + '"%s:" should be preceded by a blank line' % matched.group(1)) + + +def GetPreviousNonBlankLine(clean_lines, linenum): + """Return the most recent non-blank line and its line number. + + Args: + clean_lines: A CleansedLines instance containing the file contents. + linenum: The number of the line to check. + + Returns: + A tuple with two elements. The first element is the contents of the last + non-blank line before the current line, or the empty string if this is the + first non-blank line. The second is the line number of that line, or -1 + if this is the first non-blank line. + """ + + prevlinenum = linenum - 1 + while prevlinenum >= 0: + prevline = clean_lines.elided[prevlinenum] + if not IsBlankLine(prevline): # if not a blank line... + return (prevline, prevlinenum) + prevlinenum -= 1 + return ('', -1) + + +def CheckBraces(filename, clean_lines, linenum, error): + """Looks for misplaced braces (e.g. at the end of line). + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + + line = clean_lines.elided[linenum] # get rid of comments and strings + + if Match(r'\s*{\s*$', line): + # We allow an open brace to start a line in the case where someone is using + # braces in a block to explicitly create a new scope, which is commonly used + # to control the lifetime of stack-allocated variables. Braces are also + # used for brace initializers inside function calls. We don't detect this + # perfectly: we just don't complain if the last non-whitespace character on + # the previous non-blank line is ',', ';', ':', '(', '{', or '}', or if the + # previous line starts a preprocessor block. + prevline = GetPreviousNonBlankLine(clean_lines, linenum)[0] + if (not Search(r'[,;:}{(]\s*$', prevline) and + not Match(r'\s*#', prevline)): + error(filename, linenum, 'whitespace/braces', 4, + '{ should almost always be at the end of the previous line') + + # An else clause should be on the same line as the preceding closing brace. + if Match(r'\s*else\b\s*(?:if\b|\{|$)', line): + prevline = GetPreviousNonBlankLine(clean_lines, linenum)[0] + if Match(r'\s*}\s*$', prevline): + error(filename, linenum, 'whitespace/newline', 4, + 'An else should appear on the same line as the preceding }') + + # If braces come on one side of an else, they should be on both. + # However, we have to worry about "else if" that spans multiple lines! + if Search(r'else if\s*\(', line): # could be multi-line if + brace_on_left = bool(Search(r'}\s*else if\s*\(', line)) + # find the ( after the if + pos = line.find('else if') + pos = line.find('(', pos) + if pos > 0: + (endline, _, endpos) = CloseExpression(clean_lines, linenum, pos) + brace_on_right = endline[endpos:].find('{') != -1 + if brace_on_left != brace_on_right: # must be brace after if + error(filename, linenum, 'readability/braces', 5, + 'If an else has a brace on one side, it should have it on both') + elif Search(r'}\s*else[^{]*$', line) or Match(r'[^}]*else\s*{', line): + error(filename, linenum, 'readability/braces', 5, + 'If an else has a brace on one side, it should have it on both') + + # Likewise, an else should never have the else clause on the same line + if Search(r'\belse [^\s{]', line) and not Search(r'\belse if\b', line): + error(filename, linenum, 'whitespace/newline', 4, + 'Else clause should never be on same line as else (use 2 lines)') + + # In the same way, a do/while should never be on one line + if Match(r'\s*do [^\s{]', line): + error(filename, linenum, 'whitespace/newline', 4, + 'do/while clauses should not be on a single line') + + # Check single-line if/else bodies. The style guide says 'curly braces are not + # required for single-line statements'. We additionally allow multi-line, + # single statements, but we reject anything with more than one semicolon in + # it. This means that the first semicolon after the if should be at the end of + # its line, and the line after that should have an indent level equal to or + # lower than the if. We also check for ambiguous if/else nesting without + # braces. + if_else_match = Search(r'\b(if\s*\(|else\b)', line) + if if_else_match and not Match(r'\s*#', line): + if_indent = GetIndentLevel(line) + endline, endlinenum, endpos = line, linenum, if_else_match.end() + if_match = Search(r'\bif\s*\(', line) + if if_match: + # This could be a multiline if condition, so find the end first. + pos = if_match.end() - 1 + (endline, endlinenum, endpos) = CloseExpression(clean_lines, linenum, pos) + # Check for an opening brace, either directly after the if or on the next + # line. If found, this isn't a single-statement conditional. + if (not Match(r'\s*{', endline[endpos:]) + and not (Match(r'\s*$', endline[endpos:]) + and endlinenum < (len(clean_lines.elided) - 1) + and Match(r'\s*{', clean_lines.elided[endlinenum + 1]))): + while (endlinenum < len(clean_lines.elided) + and ';' not in clean_lines.elided[endlinenum][endpos:]): + endlinenum += 1 + endpos = 0 + if endlinenum < len(clean_lines.elided): + endline = clean_lines.elided[endlinenum] + # We allow a mix of whitespace and closing braces (e.g. for one-liner + # methods) and a single \ after the semicolon (for macros) + endpos = endline.find(';') + if not Match(r';[\s}]*(\\?)$', endline[endpos:]): + # Semicolon isn't the last character, there's something trailing. + # Output a warning if the semicolon is not contained inside + # a lambda expression. + if not Match(r'^[^{};]*\[[^\[\]]*\][^{}]*\{[^{}]*\}\s*\)*[;,]\s*$', + endline): + error(filename, linenum, 'readability/braces', 4, + 'If/else bodies with multiple statements require braces') + elif endlinenum < len(clean_lines.elided) - 1: + # Make sure the next line is dedented + next_line = clean_lines.elided[endlinenum + 1] + next_indent = GetIndentLevel(next_line) + # With ambiguous nested if statements, this will error out on the + # if that *doesn't* match the else, regardless of whether it's the + # inner one or outer one. + if (if_match and Match(r'\s*else\b', next_line) + and next_indent != if_indent): + error(filename, linenum, 'readability/braces', 4, + 'Else clause should be indented at the same level as if. ' + 'Ambiguous nested if/else chains require braces.') + elif next_indent > if_indent: + error(filename, linenum, 'readability/braces', 4, + 'If/else bodies with multiple statements require braces') + + +def CheckTrailingSemicolon(filename, clean_lines, linenum, error): + """Looks for redundant trailing semicolon. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + + line = clean_lines.elided[linenum] + + # Block bodies should not be followed by a semicolon. Due to C++11 + # brace initialization, there are more places where semicolons are + # required than not, so we use a whitelist approach to check these + # rather than a blacklist. These are the places where "};" should + # be replaced by just "}": + # 1. Some flavor of block following closing parenthesis: + # for (;;) {}; + # while (...) {}; + # switch (...) {}; + # Function(...) {}; + # if (...) {}; + # if (...) else if (...) {}; + # + # 2. else block: + # if (...) else {}; + # + # 3. const member function: + # Function(...) const {}; + # + # 4. Block following some statement: + # x = 42; + # {}; + # + # 5. Block at the beginning of a function: + # Function(...) { + # {}; + # } + # + # Note that naively checking for the preceding "{" will also match + # braces inside multi-dimensional arrays, but this is fine since + # that expression will not contain semicolons. + # + # 6. Block following another block: + # while (true) {} + # {}; + # + # 7. End of namespaces: + # namespace {}; + # + # These semicolons seems far more common than other kinds of + # redundant semicolons, possibly due to people converting classes + # to namespaces. For now we do not warn for this case. + # + # Try matching case 1 first. + match = Match(r'^(.*\)\s*)\{', line) + if match: + # Matched closing parenthesis (case 1). Check the token before the + # matching opening parenthesis, and don't warn if it looks like a + # macro. This avoids these false positives: + # - macro that defines a base class + # - multi-line macro that defines a base class + # - macro that defines the whole class-head + # + # But we still issue warnings for macros that we know are safe to + # warn, specifically: + # - TEST, TEST_F, TEST_P, MATCHER, MATCHER_P + # - TYPED_TEST + # - INTERFACE_DEF + # - EXCLUSIVE_LOCKS_REQUIRED, SHARED_LOCKS_REQUIRED, LOCKS_EXCLUDED: + # + # We implement a whitelist of safe macros instead of a blacklist of + # unsafe macros, even though the latter appears less frequently in + # google code and would have been easier to implement. This is because + # the downside for getting the whitelist wrong means some extra + # semicolons, while the downside for getting the blacklist wrong + # would result in compile errors. + # + # In addition to macros, we also don't want to warn on + # - Compound literals + # - Lambdas + # - alignas specifier with anonymous structs: + closing_brace_pos = match.group(1).rfind(')') + opening_parenthesis = ReverseCloseExpression( + clean_lines, linenum, closing_brace_pos) + if opening_parenthesis[2] > -1: + line_prefix = opening_parenthesis[0][0:opening_parenthesis[2]] + macro = Search(r'\b([A-Z_]+)\s*$', line_prefix) + func = Match(r'^(.*\])\s*$', line_prefix) + if ((macro and + macro.group(1) not in ( + 'TEST', 'TEST_F', 'MATCHER', 'MATCHER_P', 'TYPED_TEST', + 'EXCLUSIVE_LOCKS_REQUIRED', 'SHARED_LOCKS_REQUIRED', + 'LOCKS_EXCLUDED', 'INTERFACE_DEF')) or + (func and not Search(r'\boperator\s*\[\s*\]', func.group(1))) or + Search(r'\b(?:struct|union)\s+alignas\s*$', line_prefix) or + Search(r'\s+=\s*$', line_prefix)): + match = None + if (match and + opening_parenthesis[1] > 1 and + Search(r'\]\s*$', clean_lines.elided[opening_parenthesis[1] - 1])): + # Multi-line lambda-expression + match = None + + else: + # Try matching cases 2-3. + match = Match(r'^(.*(?:else|\)\s*const)\s*)\{', line) + if not match: + # Try matching cases 4-6. These are always matched on separate lines. + # + # Note that we can't simply concatenate the previous line to the + # current line and do a single match, otherwise we may output + # duplicate warnings for the blank line case: + # if (cond) { + # // blank line + # } + prevline = GetPreviousNonBlankLine(clean_lines, linenum)[0] + if prevline and Search(r'[;{}]\s*$', prevline): + match = Match(r'^(\s*)\{', line) + + # Check matching closing brace + if match: + (endline, endlinenum, endpos) = CloseExpression( + clean_lines, linenum, len(match.group(1))) + if endpos > -1 and Match(r'^\s*;', endline[endpos:]): + # Current {} pair is eligible for semicolon check, and we have found + # the redundant semicolon, output warning here. + # + # Note: because we are scanning forward for opening braces, and + # outputting warnings for the matching closing brace, if there are + # nested blocks with trailing semicolons, we will get the error + # messages in reversed order. + error(filename, endlinenum, 'readability/braces', 4, + "You don't need a ; after a }") + + +def CheckEmptyBlockBody(filename, clean_lines, linenum, error): + """Look for empty loop/conditional body with only a single semicolon. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + + # Search for loop keywords at the beginning of the line. Because only + # whitespaces are allowed before the keywords, this will also ignore most + # do-while-loops, since those lines should start with closing brace. + # + # We also check "if" blocks here, since an empty conditional block + # is likely an error. + line = clean_lines.elided[linenum] + matched = Match(r'\s*(for|while|if)\s*\(', line) + if matched: + # Find the end of the conditional expression + (end_line, end_linenum, end_pos) = CloseExpression( + clean_lines, linenum, line.find('(')) + + # Output warning if what follows the condition expression is a semicolon. + # No warning for all other cases, including whitespace or newline, since we + # have a separate check for semicolons preceded by whitespace. + if end_pos >= 0 and Match(r';', end_line[end_pos:]): + if matched.group(1) == 'if': + error(filename, end_linenum, 'whitespace/empty_conditional_body', 5, + 'Empty conditional bodies should use {}') + else: + error(filename, end_linenum, 'whitespace/empty_loop_body', 5, + 'Empty loop bodies should use {} or continue') + + +def FindCheckMacro(line): + """Find a replaceable CHECK-like macro. + + Args: + line: line to search on. + Returns: + (macro name, start position), or (None, -1) if no replaceable + macro is found. + """ + for macro in _CHECK_MACROS: + i = line.find(macro) + if i >= 0: + # Find opening parenthesis. Do a regular expression match here + # to make sure that we are matching the expected CHECK macro, as + # opposed to some other macro that happens to contain the CHECK + # substring. + matched = Match(r'^(.*\b' + macro + r'\s*)\(', line) + if not matched: + continue + return (macro, len(matched.group(1))) + return (None, -1) + + +def CheckCheck(filename, clean_lines, linenum, error): + """Checks the use of CHECK and EXPECT macros. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + + # Decide the set of replacement macros that should be suggested + lines = clean_lines.elided + (check_macro, start_pos) = FindCheckMacro(lines[linenum]) + if not check_macro: + return + + # Find end of the boolean expression by matching parentheses + (last_line, end_line, end_pos) = CloseExpression( + clean_lines, linenum, start_pos) + if end_pos < 0: + return + + # If the check macro is followed by something other than a + # semicolon, assume users will log their own custom error messages + # and don't suggest any replacements. + if not Match(r'\s*;', last_line[end_pos:]): + return + + if linenum == end_line: + expression = lines[linenum][start_pos + 1:end_pos - 1] + else: + expression = lines[linenum][start_pos + 1:] + for i in xrange(linenum + 1, end_line): + expression += lines[i] + expression += last_line[0:end_pos - 1] + + # Parse expression so that we can take parentheses into account. + # This avoids false positives for inputs like "CHECK((a < 4) == b)", + # which is not replaceable by CHECK_LE. + lhs = '' + rhs = '' + operator = None + while expression: + matched = Match(r'^\s*(<<|<<=|>>|>>=|->\*|->|&&|\|\||' + r'==|!=|>=|>|<=|<|\()(.*)$', expression) + if matched: + token = matched.group(1) + if token == '(': + # Parenthesized operand + expression = matched.group(2) + (end, _) = FindEndOfExpressionInLine(expression, 0, ['(']) + if end < 0: + return # Unmatched parenthesis + lhs += '(' + expression[0:end] + expression = expression[end:] + elif token in ('&&', '||'): + # Logical and/or operators. This means the expression + # contains more than one term, for example: + # CHECK(42 < a && a < b); + # + # These are not replaceable with CHECK_LE, so bail out early. + return + elif token in ('<<', '<<=', '>>', '>>=', '->*', '->'): + # Non-relational operator + lhs += token + expression = matched.group(2) + else: + # Relational operator + operator = token + rhs = matched.group(2) + break + else: + # Unparenthesized operand. Instead of appending to lhs one character + # at a time, we do another regular expression match to consume several + # characters at once if possible. Trivial benchmark shows that this + # is more efficient when the operands are longer than a single + # character, which is generally the case. + matched = Match(r'^([^-=!<>()&|]+)(.*)$', expression) + if not matched: + matched = Match(r'^(\s*\S)(.*)$', expression) + if not matched: + break + lhs += matched.group(1) + expression = matched.group(2) + + # Only apply checks if we got all parts of the boolean expression + if not (lhs and operator and rhs): + return + + # Check that rhs do not contain logical operators. We already know + # that lhs is fine since the loop above parses out && and ||. + if rhs.find('&&') > -1 or rhs.find('||') > -1: + return + + # At least one of the operands must be a constant literal. This is + # to avoid suggesting replacements for unprintable things like + # CHECK(variable != iterator) + # + # The following pattern matches decimal, hex integers, strings, and + # characters (in that order). + lhs = lhs.strip() + rhs = rhs.strip() + match_constant = r'^([-+]?(\d+|0[xX][0-9a-fA-F]+)[lLuU]{0,3}|".*"|\'.*\')$' + if Match(match_constant, lhs) or Match(match_constant, rhs): + # Note: since we know both lhs and rhs, we can provide a more + # descriptive error message like: + # Consider using CHECK_EQ(x, 42) instead of CHECK(x == 42) + # Instead of: + # Consider using CHECK_EQ instead of CHECK(a == b) + # + # We are still keeping the less descriptive message because if lhs + # or rhs gets long, the error message might become unreadable. + error(filename, linenum, 'readability/check', 2, + 'Consider using %s instead of %s(a %s b)' % ( + _CHECK_REPLACEMENT[check_macro][operator], + check_macro, operator)) + + +def CheckAltTokens(filename, clean_lines, linenum, error): + """Check alternative keywords being used in boolean expressions. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Avoid preprocessor lines + if Match(r'^\s*#', line): + return + + # Last ditch effort to avoid multi-line comments. This will not help + # if the comment started before the current line or ended after the + # current line, but it catches most of the false positives. At least, + # it provides a way to workaround this warning for people who use + # multi-line comments in preprocessor macros. + # + # TODO(unknown): remove this once cpplint has better support for + # multi-line comments. + if line.find('/*') >= 0 or line.find('*/') >= 0: + return + + for match in _ALT_TOKEN_REPLACEMENT_PATTERN.finditer(line): + error(filename, linenum, 'readability/alt_tokens', 2, + 'Use operator %s instead of %s' % ( + _ALT_TOKEN_REPLACEMENT[match.group(1)], match.group(1))) + + +def GetLineWidth(line): + """Determines the width of the line in column positions. + + Args: + line: A string, which may be a Unicode string. + + Returns: + The width of the line in column positions, accounting for Unicode + combining characters and wide characters. + """ + if isinstance(line, unicode): + width = 0 + for uc in unicodedata.normalize('NFC', line): + if unicodedata.east_asian_width(uc) in ('W', 'F'): + width += 2 + elif not unicodedata.combining(uc): + width += 1 + return width + else: + return len(line) + + +def CheckStyle(filename, clean_lines, linenum, file_extension, nesting_state, + error): + """Checks rules from the 'C++ style rules' section of cppguide.html. + + Most of these rules are hard to test (naming, comment style), but we + do what we can. In particular we check for 2-space indents, line lengths, + tab usage, spaces inside code, etc. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + file_extension: The extension (without the dot) of the filename. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + + # Don't use "elided" lines here, otherwise we can't check commented lines. + # Don't want to use "raw" either, because we don't want to check inside C++11 + # raw strings, + raw_lines = clean_lines.lines_without_raw_strings + line = raw_lines[linenum] + + if line.find('\t') != -1: + error(filename, linenum, 'whitespace/tab', 1, + 'Tab found; better to use spaces') + + # One or three blank spaces at the beginning of the line is weird; it's + # hard to reconcile that with 2-space indents. + # NOTE: here are the conditions rob pike used for his tests. Mine aren't + # as sophisticated, but it may be worth becoming so: RLENGTH==initial_spaces + # if(RLENGTH > 20) complain = 0; + # if(match($0, " +(error|private|public|protected):")) complain = 0; + # if(match(prev, "&& *$")) complain = 0; + # if(match(prev, "\\|\\| *$")) complain = 0; + # if(match(prev, "[\",=><] *$")) complain = 0; + # if(match($0, " <<")) complain = 0; + # if(match(prev, " +for \\(")) complain = 0; + # if(prevodd && match(prevprev, " +for \\(")) complain = 0; + scope_or_label_pattern = r'\s*\w+\s*:\s*\\?$' + classinfo = nesting_state.InnermostClass() + initial_spaces = 0 + cleansed_line = clean_lines.elided[linenum] + while initial_spaces < len(line) and line[initial_spaces] == ' ': + initial_spaces += 1 + if line and line[-1].isspace(): + error(filename, linenum, 'whitespace/end_of_line', 4, + 'Line ends in whitespace. Consider deleting these extra spaces.') + # There are certain situations we allow one space, notably for + # section labels, and also lines containing multi-line raw strings. + elif ((initial_spaces == 1 or initial_spaces == 3) and + not Match(scope_or_label_pattern, cleansed_line) and + not (clean_lines.raw_lines[linenum] != line and + Match(r'^\s*""', line))): + error(filename, linenum, 'whitespace/indent', 3, + 'Weird number of spaces at line-start. ' + 'Are you using a 2-space indent?') + + # Check if the line is a header guard. + is_header_guard = False + if file_extension == 'h': + cppvar = GetHeaderGuardCPPVariable(filename) + if (line.startswith('#ifndef %s' % cppvar) or + line.startswith('#define %s' % cppvar) or + line.startswith('#endif // %s' % cppvar)): + is_header_guard = True + # #include lines and header guards can be long, since there's no clean way to + # split them. + # + # URLs can be long too. It's possible to split these, but it makes them + # harder to cut&paste. + # + # The "$Id:...$" comment may also get very long without it being the + # developers fault. + if (not line.startswith('#include') and not is_header_guard and + not Match(r'^\s*//.*http(s?)://\S*$', line) and + not Match(r'^// \$Id:.*#[0-9]+ \$$', line)): + line_width = GetLineWidth(line) + extended_length = int((_line_length * 1.25)) + if line_width > extended_length: + error(filename, linenum, 'whitespace/line_length', 4, + 'Lines should very rarely be longer than %i characters' % + extended_length) + elif line_width > _line_length: + error(filename, linenum, 'whitespace/line_length', 2, + 'Lines should be <= %i characters long' % _line_length) + + if (cleansed_line.count(';') > 1 and + # for loops are allowed two ;'s (and may run over two lines). + cleansed_line.find('for') == -1 and + (GetPreviousNonBlankLine(clean_lines, linenum)[0].find('for') == -1 or + GetPreviousNonBlankLine(clean_lines, linenum)[0].find(';') != -1) and + # It's ok to have many commands in a switch case that fits in 1 line + not ((cleansed_line.find('case ') != -1 or + cleansed_line.find('default:') != -1) and + cleansed_line.find('break;') != -1)): + error(filename, linenum, 'whitespace/newline', 0, + 'More than one command on the same line') + + # Some more style checks + CheckBraces(filename, clean_lines, linenum, error) + CheckTrailingSemicolon(filename, clean_lines, linenum, error) + CheckEmptyBlockBody(filename, clean_lines, linenum, error) + CheckAccess(filename, clean_lines, linenum, nesting_state, error) + CheckSpacing(filename, clean_lines, linenum, nesting_state, error) + CheckOperatorSpacing(filename, clean_lines, linenum, error) + CheckParenthesisSpacing(filename, clean_lines, linenum, error) + CheckCommaSpacing(filename, clean_lines, linenum, error) + CheckBracesSpacing(filename, clean_lines, linenum, error) + CheckSpacingForFunctionCall(filename, clean_lines, linenum, error) + CheckRValueReference(filename, clean_lines, linenum, nesting_state, error) + CheckCheck(filename, clean_lines, linenum, error) + CheckAltTokens(filename, clean_lines, linenum, error) + classinfo = nesting_state.InnermostClass() + if classinfo: + CheckSectionSpacing(filename, clean_lines, classinfo, linenum, error) + + +_RE_PATTERN_INCLUDE = re.compile(r'^\s*#\s*include\s*([<"])([^>"]*)[>"].*$') +# Matches the first component of a filename delimited by -s and _s. That is: +# _RE_FIRST_COMPONENT.match('foo').group(0) == 'foo' +# _RE_FIRST_COMPONENT.match('foo.cc').group(0) == 'foo' +# _RE_FIRST_COMPONENT.match('foo-bar_baz.cc').group(0) == 'foo' +# _RE_FIRST_COMPONENT.match('foo_bar-baz.cc').group(0) == 'foo' +_RE_FIRST_COMPONENT = re.compile(r'^[^-_.]+') + + +def _DropCommonSuffixes(filename): + """Drops common suffixes like _test.cc or -inl.h from filename. + + For example: + >>> _DropCommonSuffixes('foo/foo-inl.h') + 'foo/foo' + >>> _DropCommonSuffixes('foo/bar/foo.cc') + 'foo/bar/foo' + >>> _DropCommonSuffixes('foo/foo_internal.h') + 'foo/foo' + >>> _DropCommonSuffixes('foo/foo_unusualinternal.h') + 'foo/foo_unusualinternal' + + Args: + filename: The input filename. + + Returns: + The filename with the common suffix removed. + """ + for suffix in ('test.cc', 'regtest.cc', 'unittest.cc', + 'inl.h', 'impl.h', 'internal.h'): + if (filename.endswith(suffix) and len(filename) > len(suffix) and + filename[-len(suffix) - 1] in ('-', '_')): + return filename[:-len(suffix) - 1] + return os.path.splitext(filename)[0] + + +def _IsTestFilename(filename): + """Determines if the given filename has a suffix that identifies it as a test. + + Args: + filename: The input filename. + + Returns: + True if 'filename' looks like a test, False otherwise. + """ + if (filename.endswith('_test.cc') or + filename.endswith('_unittest.cc') or + filename.endswith('_regtest.cc')): + return True + else: + return False + + +def _ClassifyInclude(fileinfo, include, is_system): + """Figures out what kind of header 'include' is. + + Args: + fileinfo: The current file cpplint is running over. A FileInfo instance. + include: The path to a #included file. + is_system: True if the #include used <> rather than "". + + Returns: + One of the _XXX_HEADER constants. + + For example: + >>> _ClassifyInclude(FileInfo('foo/foo.cc'), 'stdio.h', True) + _C_SYS_HEADER + >>> _ClassifyInclude(FileInfo('foo/foo.cc'), 'string', True) + _CPP_SYS_HEADER + >>> _ClassifyInclude(FileInfo('foo/foo.cc'), 'foo/foo.h', False) + _LIKELY_MY_HEADER + >>> _ClassifyInclude(FileInfo('foo/foo_unknown_extension.cc'), + ... 'bar/foo_other_ext.h', False) + _POSSIBLE_MY_HEADER + >>> _ClassifyInclude(FileInfo('foo/foo.cc'), 'foo/bar.h', False) + _OTHER_HEADER + """ + # This is a list of all standard c++ header files, except + # those already checked for above. + is_cpp_h = include in _CPP_HEADERS + + if is_system: + if is_cpp_h: + return _CPP_SYS_HEADER + else: + return _C_SYS_HEADER + + # If the target file and the include we're checking share a + # basename when we drop common extensions, and the include + # lives in . , then it's likely to be owned by the target file. + target_dir, target_base = ( + os.path.split(_DropCommonSuffixes(fileinfo.RepositoryName()))) + include_dir, include_base = os.path.split(_DropCommonSuffixes(include)) + if target_base == include_base and ( + include_dir == target_dir or + include_dir == os.path.normpath(target_dir + '/../public')): + return _LIKELY_MY_HEADER + + # If the target and include share some initial basename + # component, it's possible the target is implementing the + # include, so it's allowed to be first, but we'll never + # complain if it's not there. + target_first_component = _RE_FIRST_COMPONENT.match(target_base) + include_first_component = _RE_FIRST_COMPONENT.match(include_base) + if (target_first_component and include_first_component and + target_first_component.group(0) == + include_first_component.group(0)): + return _POSSIBLE_MY_HEADER + + return _OTHER_HEADER + + + +def CheckIncludeLine(filename, clean_lines, linenum, include_state, error): + """Check rules that are applicable to #include lines. + + Strings on #include lines are NOT removed from elided line, to make + certain tasks easier. However, to prevent false positives, checks + applicable to #include lines in CheckLanguage must be put here. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + include_state: An _IncludeState instance in which the headers are inserted. + error: The function to call with any errors found. + """ + fileinfo = FileInfo(filename) + line = clean_lines.lines[linenum] + + # "include" should use the new style "foo/bar.h" instead of just "bar.h" + # Only do this check if the included header follows google naming + # conventions. If not, assume that it's a 3rd party API that + # requires special include conventions. + # + # We also make an exception for Lua headers, which follow google + # naming convention but not the include convention. + match = Match(r'#include\s*"([^/]+\.h)"', line) + if match and not _THIRD_PARTY_HEADERS_PATTERN.match(match.group(1)): + error(filename, linenum, 'build/include', 4, + 'Include the directory when naming .h files') + + # we shouldn't include a file more than once. actually, there are a + # handful of instances where doing so is okay, but in general it's + # not. + match = _RE_PATTERN_INCLUDE.search(line) + if match: + include = match.group(2) + is_system = (match.group(1) == '<') + duplicate_line = include_state.FindHeader(include) + if duplicate_line >= 0: + error(filename, linenum, 'build/include', 4, + '"%s" already included at %s:%s' % + (include, filename, duplicate_line)) + elif (include.endswith('.cc') and + os.path.dirname(fileinfo.RepositoryName()) != os.path.dirname(include)): + error(filename, linenum, 'build/include', 4, + 'Do not include .cc files from other packages') + elif not _THIRD_PARTY_HEADERS_PATTERN.match(include): + include_state.include_list[-1].append((include, linenum)) + + # We want to ensure that headers appear in the right order: + # 1) for foo.cc, foo.h (preferred location) + # 2) c system files + # 3) cpp system files + # 4) for foo.cc, foo.h (deprecated location) + # 5) other google headers + # + # We classify each include statement as one of those 5 types + # using a number of techniques. The include_state object keeps + # track of the highest type seen, and complains if we see a + # lower type after that. + error_message = include_state.CheckNextIncludeOrder( + _ClassifyInclude(fileinfo, include, is_system)) + if error_message: + error(filename, linenum, 'build/include_order', 4, + '%s. Should be: %s.h, c system, c++ system, other.' % + (error_message, fileinfo.BaseName())) + canonical_include = include_state.CanonicalizeAlphabeticalOrder(include) + if not include_state.IsInAlphabeticalOrder( + clean_lines, linenum, canonical_include): + error(filename, linenum, 'build/include_alpha', 4, + 'Include "%s" not in alphabetical order' % include) + include_state.SetLastHeader(canonical_include) + + + +def _GetTextInside(text, start_pattern): + r"""Retrieves all the text between matching open and close parentheses. + + Given a string of lines and a regular expression string, retrieve all the text + following the expression and between opening punctuation symbols like + (, [, or {, and the matching close-punctuation symbol. This properly nested + occurrences of the punctuations, so for the text like + printf(a(), b(c())); + a call to _GetTextInside(text, r'printf\(') will return 'a(), b(c())'. + start_pattern must match string having an open punctuation symbol at the end. + + Args: + text: The lines to extract text. Its comments and strings must be elided. + It can be single line and can span multiple lines. + start_pattern: The regexp string indicating where to start extracting + the text. + Returns: + The extracted text. + None if either the opening string or ending punctuation could not be found. + """ + # TODO(unknown): Audit cpplint.py to see what places could be profitably + # rewritten to use _GetTextInside (and use inferior regexp matching today). + + # Give opening punctuations to get the matching close-punctuations. + matching_punctuation = {'(': ')', '{': '}', '[': ']'} + closing_punctuation = set(matching_punctuation.itervalues()) + + # Find the position to start extracting text. + match = re.search(start_pattern, text, re.M) + if not match: # start_pattern not found in text. + return None + start_position = match.end(0) + + assert start_position > 0, ( + 'start_pattern must ends with an opening punctuation.') + assert text[start_position - 1] in matching_punctuation, ( + 'start_pattern must ends with an opening punctuation.') + # Stack of closing punctuations we expect to have in text after position. + punctuation_stack = [matching_punctuation[text[start_position - 1]]] + position = start_position + while punctuation_stack and position < len(text): + if text[position] == punctuation_stack[-1]: + punctuation_stack.pop() + elif text[position] in closing_punctuation: + # A closing punctuation without matching opening punctuations. + return None + elif text[position] in matching_punctuation: + punctuation_stack.append(matching_punctuation[text[position]]) + position += 1 + if punctuation_stack: + # Opening punctuations left without matching close-punctuations. + return None + # punctuations match. + return text[start_position:position - 1] + + +# Patterns for matching call-by-reference parameters. +# +# Supports nested templates up to 2 levels deep using this messy pattern: +# < (?: < (?: < [^<>]* +# > +# | [^<>] )* +# > +# | [^<>] )* +# > +_RE_PATTERN_IDENT = r'[_a-zA-Z]\w*' # =~ [[:alpha:]][[:alnum:]]* +_RE_PATTERN_TYPE = ( + r'(?:const\s+)?(?:typename\s+|class\s+|struct\s+|union\s+|enum\s+)?' + r'(?:\w|' + r'\s*<(?:<(?:<[^<>]*>|[^<>])*>|[^<>])*>|' + r'::)+') +# A call-by-reference parameter ends with '& identifier'. +_RE_PATTERN_REF_PARAM = re.compile( + r'(' + _RE_PATTERN_TYPE + r'(?:\s*(?:\bconst\b|[*]))*\s*' + r'&\s*' + _RE_PATTERN_IDENT + r')\s*(?:=[^,()]+)?[,)]') +# A call-by-const-reference parameter either ends with 'const& identifier' +# or looks like 'const type& identifier' when 'type' is atomic. +_RE_PATTERN_CONST_REF_PARAM = ( + r'(?:.*\s*\bconst\s*&\s*' + _RE_PATTERN_IDENT + + r'|const\s+' + _RE_PATTERN_TYPE + r'\s*&\s*' + _RE_PATTERN_IDENT + r')') + + +def CheckLanguage(filename, clean_lines, linenum, file_extension, + include_state, nesting_state, error): + """Checks rules from the 'C++ language rules' section of cppguide.html. + + Some of these rules are hard to test (function overloading, using + uint32 inappropriately), but we do the best we can. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + file_extension: The extension (without the dot) of the filename. + include_state: An _IncludeState instance in which the headers are inserted. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + # If the line is empty or consists of entirely a comment, no need to + # check it. + line = clean_lines.elided[linenum] + if not line: + return + + match = _RE_PATTERN_INCLUDE.search(line) + if match: + CheckIncludeLine(filename, clean_lines, linenum, include_state, error) + return + + # Reset include state across preprocessor directives. This is meant + # to silence warnings for conditional includes. + match = Match(r'^\s*#\s*(if|ifdef|ifndef|elif|else|endif)\b', line) + if match: + include_state.ResetSection(match.group(1)) + + # Make Windows paths like Unix. + fullname = os.path.abspath(filename).replace('\\', '/') + + # Perform other checks now that we are sure that this is not an include line + CheckCasts(filename, clean_lines, linenum, error) + CheckGlobalStatic(filename, clean_lines, linenum, error) + CheckPrintf(filename, clean_lines, linenum, error) + + if file_extension == 'h': + # TODO(unknown): check that 1-arg constructors are explicit. + # How to tell it's a constructor? + # (handled in CheckForNonStandardConstructs for now) + # TODO(unknown): check that classes declare or disable copy/assign + # (level 1 error) + pass + + # Check if people are using the verboten C basic types. The only exception + # we regularly allow is "unsigned short port" for port. + if Search(r'\bshort port\b', line): + if not Search(r'\bunsigned short port\b', line): + error(filename, linenum, 'runtime/int', 4, + 'Use "unsigned short" for ports, not "short"') + else: + match = Search(r'\b(short|long(?! +double)|long long)\b', line) + if match: + error(filename, linenum, 'runtime/int', 4, + 'Use int16/int64/etc, rather than the C type %s' % match.group(1)) + + # Check if some verboten operator overloading is going on + # TODO(unknown): catch out-of-line unary operator&: + # class X {}; + # int operator&(const X& x) { return 42; } // unary operator& + # The trick is it's hard to tell apart from binary operator&: + # class Y { int operator&(const Y& x) { return 23; } }; // binary operator& + if Search(r'\boperator\s*&\s*\(\s*\)', line): + error(filename, linenum, 'runtime/operator', 4, + 'Unary operator& is dangerous. Do not use it.') + + # Check for suspicious usage of "if" like + # } if (a == b) { + if Search(r'\}\s*if\s*\(', line): + error(filename, linenum, 'readability/braces', 4, + 'Did you mean "else if"? If not, start a new line for "if".') + + # Check for potential format string bugs like printf(foo). + # We constrain the pattern not to pick things like DocidForPrintf(foo). + # Not perfect but it can catch printf(foo.c_str()) and printf(foo->c_str()) + # TODO(unknown): Catch the following case. Need to change the calling + # convention of the whole function to process multiple line to handle it. + # printf( + # boy_this_is_a_really_long_variable_that_cannot_fit_on_the_prev_line); + printf_args = _GetTextInside(line, r'(?i)\b(string)?printf\s*\(') + if printf_args: + match = Match(r'([\w.\->()]+)$', printf_args) + if match and match.group(1) != '__VA_ARGS__': + function_name = re.search(r'\b((?:string)?printf)\s*\(', + line, re.I).group(1) + error(filename, linenum, 'runtime/printf', 4, + 'Potential format string bug. Do %s("%%s", %s) instead.' + % (function_name, match.group(1))) + + # Check for potential memset bugs like memset(buf, sizeof(buf), 0). + match = Search(r'memset\s*\(([^,]*),\s*([^,]*),\s*0\s*\)', line) + if match and not Match(r"^''|-?[0-9]+|0x[0-9A-Fa-f]$", match.group(2)): + error(filename, linenum, 'runtime/memset', 4, + 'Did you mean "memset(%s, 0, %s)"?' + % (match.group(1), match.group(2))) + + if Search(r'\busing namespace\b', line): + error(filename, linenum, 'build/namespaces', 5, + 'Do not use namespace using-directives. ' + 'Use using-declarations instead.') + + # Detect variable-length arrays. + match = Match(r'\s*(.+::)?(\w+) [a-z]\w*\[(.+)];', line) + if (match and match.group(2) != 'return' and match.group(2) != 'delete' and + match.group(3).find(']') == -1): + # Split the size using space and arithmetic operators as delimiters. + # If any of the resulting tokens are not compile time constants then + # report the error. + tokens = re.split(r'\s|\+|\-|\*|\/|<<|>>]', match.group(3)) + is_const = True + skip_next = False + for tok in tokens: + if skip_next: + skip_next = False + continue + + if Search(r'sizeof\(.+\)', tok): continue + if Search(r'arraysize\(\w+\)', tok): continue + + tok = tok.lstrip('(') + tok = tok.rstrip(')') + if not tok: continue + if Match(r'\d+', tok): continue + if Match(r'0[xX][0-9a-fA-F]+', tok): continue + if Match(r'k[A-Z0-9]\w*', tok): continue + if Match(r'(.+::)?k[A-Z0-9]\w*', tok): continue + if Match(r'(.+::)?[A-Z][A-Z0-9_]*', tok): continue + # A catch all for tricky sizeof cases, including 'sizeof expression', + # 'sizeof(*type)', 'sizeof(const type)', 'sizeof(struct StructName)' + # requires skipping the next token because we split on ' ' and '*'. + if tok.startswith('sizeof'): + skip_next = True + continue + is_const = False + break + if not is_const: + error(filename, linenum, 'runtime/arrays', 1, + 'Do not use variable-length arrays. Use an appropriately named ' + "('k' followed by CamelCase) compile-time constant for the size.") + + # Check for use of unnamed namespaces in header files. Registration + # macros are typically OK, so we allow use of "namespace {" on lines + # that end with backslashes. + if (file_extension == 'h' + and Search(r'\bnamespace\s*{', line) + and line[-1] != '\\'): + error(filename, linenum, 'build/namespaces', 4, + 'Do not use unnamed namespaces in header files. See ' + 'http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Namespaces' + ' for more information.') + + +def CheckGlobalStatic(filename, clean_lines, linenum, error): + """Check for unsafe global or static objects. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Match two lines at a time to support multiline declarations + if linenum + 1 < clean_lines.NumLines() and not Search(r'[;({]', line): + line += clean_lines.elided[linenum + 1].strip() + + # Check for people declaring static/global STL strings at the top level. + # This is dangerous because the C++ language does not guarantee that + # globals with constructors are initialized before the first access. + match = Match( + r'((?:|static +)(?:|const +))string +([a-zA-Z0-9_:]+)\b(.*)', + line) + + # Remove false positives: + # - String pointers (as opposed to values). + # string *pointer + # const string *pointer + # string const *pointer + # string *const pointer + # + # - Functions and template specializations. + # string Function(... + # string Class::Method(... + # + # - Operators. These are matched separately because operator names + # cross non-word boundaries, and trying to match both operators + # and functions at the same time would decrease accuracy of + # matching identifiers. + # string Class::operator*() + if (match and + not Search(r'\bstring\b(\s+const)?\s*\*\s*(const\s+)?\w', line) and + not Search(r'\boperator\W', line) and + not Match(r'\s*(<.*>)?(::[a-zA-Z0-9_]+)*\s*\(([^"]|$)', match.group(3))): + error(filename, linenum, 'runtime/string', 4, + 'For a static/global string constant, use a C style string instead: ' + '"%schar %s[]".' % + (match.group(1), match.group(2))) + + if Search(r'\b([A-Za-z0-9_]*_)\(\1\)', line): + error(filename, linenum, 'runtime/init', 4, + 'You seem to be initializing a member variable with itself.') + + +def CheckPrintf(filename, clean_lines, linenum, error): + """Check for printf related issues. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # When snprintf is used, the second argument shouldn't be a literal. + match = Search(r'snprintf\s*\(([^,]*),\s*([0-9]*)\s*,', line) + if match and match.group(2) != '0': + # If 2nd arg is zero, snprintf is used to calculate size. + error(filename, linenum, 'runtime/printf', 3, + 'If you can, use sizeof(%s) instead of %s as the 2nd arg ' + 'to snprintf.' % (match.group(1), match.group(2))) + + # Check if some verboten C functions are being used. + if Search(r'\bsprintf\s*\(', line): + error(filename, linenum, 'runtime/printf', 5, + 'Never use sprintf. Use snprintf instead.') + match = Search(r'\b(strcpy|strcat)\s*\(', line) + if match: + error(filename, linenum, 'runtime/printf', 4, + 'Almost always, snprintf is better than %s' % match.group(1)) + + +def IsDerivedFunction(clean_lines, linenum): + """Check if current line contains an inherited function. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + Returns: + True if current line contains a function with "override" + virt-specifier. + """ + # Scan back a few lines for start of current function + for i in xrange(linenum, max(-1, linenum - 10), -1): + match = Match(r'^([^()]*\w+)\(', clean_lines.elided[i]) + if match: + # Look for "override" after the matching closing parenthesis + line, _, closing_paren = CloseExpression( + clean_lines, i, len(match.group(1))) + return (closing_paren >= 0 and + Search(r'\boverride\b', line[closing_paren:])) + return False + + +def IsOutOfLineMethodDefinition(clean_lines, linenum): + """Check if current line contains an out-of-line method definition. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + Returns: + True if current line contains an out-of-line method definition. + """ + # Scan back a few lines for start of current function + for i in xrange(linenum, max(-1, linenum - 10), -1): + if Match(r'^([^()]*\w+)\(', clean_lines.elided[i]): + return Match(r'^[^()]*\w+::\w+\(', clean_lines.elided[i]) is not None + return False + + +def IsInitializerList(clean_lines, linenum): + """Check if current line is inside constructor initializer list. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + Returns: + True if current line appears to be inside constructor initializer + list, False otherwise. + """ + for i in xrange(linenum, 1, -1): + line = clean_lines.elided[i] + if i == linenum: + remove_function_body = Match(r'^(.*)\{\s*$', line) + if remove_function_body: + line = remove_function_body.group(1) + + if Search(r'\s:\s*\w+[({]', line): + # A lone colon tend to indicate the start of a constructor + # initializer list. It could also be a ternary operator, which + # also tend to appear in constructor initializer lists as + # opposed to parameter lists. + return True + if Search(r'\}\s*,\s*$', line): + # A closing brace followed by a comma is probably the end of a + # brace-initialized member in constructor initializer list. + return True + if Search(r'[{};]\s*$', line): + # Found one of the following: + # - A closing brace or semicolon, probably the end of the previous + # function. + # - An opening brace, probably the start of current class or namespace. + # + # Current line is probably not inside an initializer list since + # we saw one of those things without seeing the starting colon. + return False + + # Got to the beginning of the file without seeing the start of + # constructor initializer list. + return False + + +def CheckForNonConstReference(filename, clean_lines, linenum, + nesting_state, error): + """Check for non-const references. + + Separate from CheckLanguage since it scans backwards from current + line, instead of scanning forward. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: The function to call with any errors found. + """ + # Do nothing if there is no '&' on current line. + line = clean_lines.elided[linenum] + if '&' not in line: + return + + # If a function is inherited, current function doesn't have much of + # a choice, so any non-const references should not be blamed on + # derived function. + if IsDerivedFunction(clean_lines, linenum): + return + + # Don't warn on out-of-line method definitions, as we would warn on the + # in-line declaration, if it isn't marked with 'override'. + if IsOutOfLineMethodDefinition(clean_lines, linenum): + return + + # Long type names may be broken across multiple lines, usually in one + # of these forms: + # LongType + # ::LongTypeContinued &identifier + # LongType:: + # LongTypeContinued &identifier + # LongType< + # ...>::LongTypeContinued &identifier + # + # If we detected a type split across two lines, join the previous + # line to current line so that we can match const references + # accordingly. + # + # Note that this only scans back one line, since scanning back + # arbitrary number of lines would be expensive. If you have a type + # that spans more than 2 lines, please use a typedef. + if linenum > 1: + previous = None + if Match(r'\s*::(?:[\w<>]|::)+\s*&\s*\S', line): + # previous_line\n + ::current_line + previous = Search(r'\b((?:const\s*)?(?:[\w<>]|::)+[\w<>])\s*$', + clean_lines.elided[linenum - 1]) + elif Match(r'\s*[a-zA-Z_]([\w<>]|::)+\s*&\s*\S', line): + # previous_line::\n + current_line + previous = Search(r'\b((?:const\s*)?(?:[\w<>]|::)+::)\s*$', + clean_lines.elided[linenum - 1]) + if previous: + line = previous.group(1) + line.lstrip() + else: + # Check for templated parameter that is split across multiple lines + endpos = line.rfind('>') + if endpos > -1: + (_, startline, startpos) = ReverseCloseExpression( + clean_lines, linenum, endpos) + if startpos > -1 and startline < linenum: + # Found the matching < on an earlier line, collect all + # pieces up to current line. + line = '' + for i in xrange(startline, linenum + 1): + line += clean_lines.elided[i].strip() + + # Check for non-const references in function parameters. A single '&' may + # found in the following places: + # inside expression: binary & for bitwise AND + # inside expression: unary & for taking the address of something + # inside declarators: reference parameter + # We will exclude the first two cases by checking that we are not inside a + # function body, including one that was just introduced by a trailing '{'. + # TODO(unknown): Doesn't account for 'catch(Exception& e)' [rare]. + if (nesting_state.previous_stack_top and + not (isinstance(nesting_state.previous_stack_top, _ClassInfo) or + isinstance(nesting_state.previous_stack_top, _NamespaceInfo))): + # Not at toplevel, not within a class, and not within a namespace + return + + # Avoid initializer lists. We only need to scan back from the + # current line for something that starts with ':'. + # + # We don't need to check the current line, since the '&' would + # appear inside the second set of parentheses on the current line as + # opposed to the first set. + if linenum > 0: + for i in xrange(linenum - 1, max(0, linenum - 10), -1): + previous_line = clean_lines.elided[i] + if not Search(r'[),]\s*$', previous_line): + break + if Match(r'^\s*:\s+\S', previous_line): + return + + # Avoid preprocessors + if Search(r'\\\s*$', line): + return + + # Avoid constructor initializer lists + if IsInitializerList(clean_lines, linenum): + return + + # We allow non-const references in a few standard places, like functions + # called "swap()" or iostream operators like "<<" or ">>". Do not check + # those function parameters. + # + # We also accept & in static_assert, which looks like a function but + # it's actually a declaration expression. + whitelisted_functions = (r'(?:[sS]wap(?:<\w:+>)?|' + r'operator\s*[<>][<>]|' + r'static_assert|COMPILE_ASSERT' + r')\s*\(') + if Search(whitelisted_functions, line): + return + elif not Search(r'\S+\([^)]*$', line): + # Don't see a whitelisted function on this line. Actually we + # didn't see any function name on this line, so this is likely a + # multi-line parameter list. Try a bit harder to catch this case. + for i in xrange(2): + if (linenum > i and + Search(whitelisted_functions, clean_lines.elided[linenum - i - 1])): + return + + decls = ReplaceAll(r'{[^}]*}', ' ', line) # exclude function body + for parameter in re.findall(_RE_PATTERN_REF_PARAM, decls): + if not Match(_RE_PATTERN_CONST_REF_PARAM, parameter): + error(filename, linenum, 'runtime/references', 2, + 'Is this a non-const reference? ' + 'If so, make const or use a pointer: ' + + ReplaceAll(' *<', '<', parameter)) + + +def CheckCasts(filename, clean_lines, linenum, error): + """Various cast related checks. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Check to see if they're using an conversion function cast. + # I just try to capture the most common basic types, though there are more. + # Parameterless conversion functions, such as bool(), are allowed as they are + # probably a member operator declaration or default constructor. + match = Search( + r'(\bnew\s+|\S<\s*(?:const\s+)?)?\b' + r'(int|float|double|bool|char|int32|uint32|int64|uint64)' + r'(\([^)].*)', line) + expecting_function = ExpectingFunctionArgs(clean_lines, linenum) + if match and not expecting_function: + matched_type = match.group(2) + + # matched_new_or_template is used to silence two false positives: + # - New operators + # - Template arguments with function types + # + # For template arguments, we match on types immediately following + # an opening bracket without any spaces. This is a fast way to + # silence the common case where the function type is the first + # template argument. False negative with less-than comparison is + # avoided because those operators are usually followed by a space. + # + # function // bracket + no space = false positive + # value < double(42) // bracket + space = true positive + matched_new_or_template = match.group(1) + + # Avoid arrays by looking for brackets that come after the closing + # parenthesis. + if Match(r'\([^()]+\)\s*\[', match.group(3)): + return + + # Other things to ignore: + # - Function pointers + # - Casts to pointer types + # - Placement new + # - Alias declarations + matched_funcptr = match.group(3) + if (matched_new_or_template is None and + not (matched_funcptr and + (Match(r'\((?:[^() ]+::\s*\*\s*)?[^() ]+\)\s*\(', + matched_funcptr) or + matched_funcptr.startswith('(*)'))) and + not Match(r'\s*using\s+\S+\s*=\s*' + matched_type, line) and + not Search(r'new\(\S+\)\s*' + matched_type, line)): + error(filename, linenum, 'readability/casting', 4, + 'Using deprecated casting style. ' + 'Use static_cast<%s>(...) instead' % + matched_type) + + if not expecting_function: + CheckCStyleCast(filename, clean_lines, linenum, 'static_cast', + r'\((int|float|double|bool|char|u?int(16|32|64))\)', error) + + # This doesn't catch all cases. Consider (const char * const)"hello". + # + # (char *) "foo" should always be a const_cast (reinterpret_cast won't + # compile). + if CheckCStyleCast(filename, clean_lines, linenum, 'const_cast', + r'\((char\s?\*+\s?)\)\s*"', error): + pass + else: + # Check pointer casts for other than string constants + CheckCStyleCast(filename, clean_lines, linenum, 'reinterpret_cast', + r'\((\w+\s?\*+\s?)\)', error) + + # In addition, we look for people taking the address of a cast. This + # is dangerous -- casts can assign to temporaries, so the pointer doesn't + # point where you think. + # + # Some non-identifier character is required before the '&' for the + # expression to be recognized as a cast. These are casts: + # expression = &static_cast(temporary()); + # function(&(int*)(temporary())); + # + # This is not a cast: + # reference_type&(int* function_param); + match = Search( + r'(?:[^\w]&\(([^)*][^)]*)\)[\w(])|' + r'(?:[^\w]&(static|dynamic|down|reinterpret)_cast\b)', line) + if match: + # Try a better error message when the & is bound to something + # dereferenced by the casted pointer, as opposed to the casted + # pointer itself. + parenthesis_error = False + match = Match(r'^(.*&(?:static|dynamic|down|reinterpret)_cast\b)<', line) + if match: + _, y1, x1 = CloseExpression(clean_lines, linenum, len(match.group(1))) + if x1 >= 0 and clean_lines.elided[y1][x1] == '(': + _, y2, x2 = CloseExpression(clean_lines, y1, x1) + if x2 >= 0: + extended_line = clean_lines.elided[y2][x2:] + if y2 < clean_lines.NumLines() - 1: + extended_line += clean_lines.elided[y2 + 1] + if Match(r'\s*(?:->|\[)', extended_line): + parenthesis_error = True + + if parenthesis_error: + error(filename, linenum, 'readability/casting', 4, + ('Are you taking an address of something dereferenced ' + 'from a cast? Wrapping the dereferenced expression in ' + 'parentheses will make the binding more obvious')) + else: + error(filename, linenum, 'runtime/casting', 4, + ('Are you taking an address of a cast? ' + 'This is dangerous: could be a temp var. ' + 'Take the address before doing the cast, rather than after')) + + +def CheckCStyleCast(filename, clean_lines, linenum, cast_type, pattern, error): + """Checks for a C-style cast by looking for the pattern. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + cast_type: The string for the C++ cast to recommend. This is either + reinterpret_cast, static_cast, or const_cast, depending. + pattern: The regular expression used to find C-style casts. + error: The function to call with any errors found. + + Returns: + True if an error was emitted. + False otherwise. + """ + line = clean_lines.elided[linenum] + match = Search(pattern, line) + if not match: + return False + + # Exclude lines with keywords that tend to look like casts + context = line[0:match.start(1) - 1] + if Match(r'.*\b(?:sizeof|alignof|alignas|[_A-Z][_A-Z0-9]*)\s*$', context): + return False + + # Try expanding current context to see if we one level of + # parentheses inside a macro. + if linenum > 0: + for i in xrange(linenum - 1, max(0, linenum - 5), -1): + context = clean_lines.elided[i] + context + if Match(r'.*\b[_A-Z][_A-Z0-9]*\s*\((?:\([^()]*\)|[^()])*$', context): + return False + + # operator++(int) and operator--(int) + if context.endswith(' operator++') or context.endswith(' operator--'): + return False + + # A single unnamed argument for a function tends to look like old + # style cast. If we see those, don't issue warnings for deprecated + # casts, instead issue warnings for unnamed arguments where + # appropriate. + # + # These are things that we want warnings for, since the style guide + # explicitly require all parameters to be named: + # Function(int); + # Function(int) { + # ConstMember(int) const; + # ConstMember(int) const { + # ExceptionMember(int) throw (...); + # ExceptionMember(int) throw (...) { + # PureVirtual(int) = 0; + # [](int) -> bool { + # + # These are functions of some sort, where the compiler would be fine + # if they had named parameters, but people often omit those + # identifiers to reduce clutter: + # (FunctionPointer)(int); + # (FunctionPointer)(int) = value; + # Function((function_pointer_arg)(int)) + # Function((function_pointer_arg)(int), int param) + # ; + # <(FunctionPointerTemplateArgument)(int)>; + remainder = line[match.end(0):] + if Match(r'^\s*(?:;|const\b|throw\b|final\b|override\b|[=>{),]|->)', + remainder): + # Looks like an unnamed parameter. + + # Don't warn on any kind of template arguments. + if Match(r'^\s*>', remainder): + return False + + # Don't warn on assignments to function pointers, but keep warnings for + # unnamed parameters to pure virtual functions. Note that this pattern + # will also pass on assignments of "0" to function pointers, but the + # preferred values for those would be "nullptr" or "NULL". + matched_zero = Match(r'^\s=\s*(\S+)\s*;', remainder) + if matched_zero and matched_zero.group(1) != '0': + return False + + # Don't warn on function pointer declarations. For this we need + # to check what came before the "(type)" string. + if Match(r'.*\)\s*$', line[0:match.start(0)]): + return False + + # Don't warn if the parameter is named with block comments, e.g.: + # Function(int /*unused_param*/); + raw_line = clean_lines.raw_lines[linenum] + if '/*' in raw_line: + return False + + # Passed all filters, issue warning here. + error(filename, linenum, 'readability/function', 3, + 'All parameters should be named in a function') + return True + + # At this point, all that should be left is actual casts. + error(filename, linenum, 'readability/casting', 4, + 'Using C-style cast. Use %s<%s>(...) instead' % + (cast_type, match.group(1))) + + return True + + +def ExpectingFunctionArgs(clean_lines, linenum): + """Checks whether where function type arguments are expected. + + Args: + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + + Returns: + True if the line at 'linenum' is inside something that expects arguments + of function types. + """ + line = clean_lines.elided[linenum] + return (Match(r'^\s*MOCK_(CONST_)?METHOD\d+(_T)?\(', line) or + (linenum >= 2 and + (Match(r'^\s*MOCK_(?:CONST_)?METHOD\d+(?:_T)?\((?:\S+,)?\s*$', + clean_lines.elided[linenum - 1]) or + Match(r'^\s*MOCK_(?:CONST_)?METHOD\d+(?:_T)?\(\s*$', + clean_lines.elided[linenum - 2]) or + Search(r'\bstd::m?function\s*\<\s*$', + clean_lines.elided[linenum - 1])))) + + +_HEADERS_CONTAINING_TEMPLATES = ( + ('', ('deque',)), + ('', ('unary_function', 'binary_function', + 'plus', 'minus', 'multiplies', 'divides', 'modulus', + 'negate', + 'equal_to', 'not_equal_to', 'greater', 'less', + 'greater_equal', 'less_equal', + 'logical_and', 'logical_or', 'logical_not', + 'unary_negate', 'not1', 'binary_negate', 'not2', + 'bind1st', 'bind2nd', + 'pointer_to_unary_function', + 'pointer_to_binary_function', + 'ptr_fun', + 'mem_fun_t', 'mem_fun', 'mem_fun1_t', 'mem_fun1_ref_t', + 'mem_fun_ref_t', + 'const_mem_fun_t', 'const_mem_fun1_t', + 'const_mem_fun_ref_t', 'const_mem_fun1_ref_t', + 'mem_fun_ref', + )), + ('', ('numeric_limits',)), + ('', ('list',)), + ('', ('map', 'multimap',)), + ('', ('allocator',)), + ('', ('queue', 'priority_queue',)), + ('', ('set', 'multiset',)), + ('', ('stack',)), + ('', ('char_traits', 'basic_string',)), + ('', ('tuple',)), + ('', ('pair',)), + ('', ('vector',)), + + # gcc extensions. + # Note: std::hash is their hash, ::hash is our hash + ('', ('hash_map', 'hash_multimap',)), + ('', ('hash_set', 'hash_multiset',)), + ('', ('slist',)), + ) + +_RE_PATTERN_STRING = re.compile(r'\bstring\b') + +_re_pattern_algorithm_header = [] +for _template in ('copy', 'max', 'min', 'min_element', 'sort', 'swap', + 'transform'): + # Match max(..., ...), max(..., ...), but not foo->max, foo.max or + # type::max(). + _re_pattern_algorithm_header.append( + (re.compile(r'[^>.]\b' + _template + r'(<.*?>)?\([^\)]'), + _template, + '')) + +_re_pattern_templates = [] +for _header, _templates in _HEADERS_CONTAINING_TEMPLATES: + for _template in _templates: + _re_pattern_templates.append( + (re.compile(r'(\<|\b)' + _template + r'\s*\<'), + _template + '<>', + _header)) + + +def FilesBelongToSameModule(filename_cc, filename_h): + """Check if these two filenames belong to the same module. + + The concept of a 'module' here is a as follows: + foo.h, foo-inl.h, foo.cc, foo_test.cc and foo_unittest.cc belong to the + same 'module' if they are in the same directory. + some/path/public/xyzzy and some/path/internal/xyzzy are also considered + to belong to the same module here. + + If the filename_cc contains a longer path than the filename_h, for example, + '/absolute/path/to/base/sysinfo.cc', and this file would include + 'base/sysinfo.h', this function also produces the prefix needed to open the + header. This is used by the caller of this function to more robustly open the + header file. We don't have access to the real include paths in this context, + so we need this guesswork here. + + Known bugs: tools/base/bar.cc and base/bar.h belong to the same module + according to this implementation. Because of this, this function gives + some false positives. This should be sufficiently rare in practice. + + Args: + filename_cc: is the path for the .cc file + filename_h: is the path for the header path + + Returns: + Tuple with a bool and a string: + bool: True if filename_cc and filename_h belong to the same module. + string: the additional prefix needed to open the header file. + """ + + if not filename_cc.endswith('.cc'): + return (False, '') + filename_cc = filename_cc[:-len('.cc')] + if filename_cc.endswith('_unittest'): + filename_cc = filename_cc[:-len('_unittest')] + elif filename_cc.endswith('_test'): + filename_cc = filename_cc[:-len('_test')] + filename_cc = filename_cc.replace('/public/', '/') + filename_cc = filename_cc.replace('/internal/', '/') + + if not filename_h.endswith('.h'): + return (False, '') + filename_h = filename_h[:-len('.h')] + if filename_h.endswith('-inl'): + filename_h = filename_h[:-len('-inl')] + filename_h = filename_h.replace('/public/', '/') + filename_h = filename_h.replace('/internal/', '/') + + files_belong_to_same_module = filename_cc.endswith(filename_h) + common_path = '' + if files_belong_to_same_module: + common_path = filename_cc[:-len(filename_h)] + return files_belong_to_same_module, common_path + + +def UpdateIncludeState(filename, include_dict, io=codecs): + """Fill up the include_dict with new includes found from the file. + + Args: + filename: the name of the header to read. + include_dict: a dictionary in which the headers are inserted. + io: The io factory to use to read the file. Provided for testability. + + Returns: + True if a header was successfully added. False otherwise. + """ + headerfile = None + try: + headerfile = io.open(filename, 'r', 'utf8', 'replace') + except IOError: + return False + linenum = 0 + for line in headerfile: + linenum += 1 + clean_line = CleanseComments(line) + match = _RE_PATTERN_INCLUDE.search(clean_line) + if match: + include = match.group(2) + include_dict.setdefault(include, linenum) + return True + + +def CheckForIncludeWhatYouUse(filename, clean_lines, include_state, error, + io=codecs): + """Reports for missing stl includes. + + This function will output warnings to make sure you are including the headers + necessary for the stl containers and functions that you use. We only give one + reason to include a header. For example, if you use both equal_to<> and + less<> in a .h file, only one (the latter in the file) of these will be + reported as a reason to include the . + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + include_state: An _IncludeState instance. + error: The function to call with any errors found. + io: The IO factory to use to read the header file. Provided for unittest + injection. + """ + required = {} # A map of header name to linenumber and the template entity. + # Example of required: { '': (1219, 'less<>') } + + for linenum in xrange(clean_lines.NumLines()): + line = clean_lines.elided[linenum] + if not line or line[0] == '#': + continue + + # String is special -- it is a non-templatized type in STL. + matched = _RE_PATTERN_STRING.search(line) + if matched: + # Don't warn about strings in non-STL namespaces: + # (We check only the first match per line; good enough.) + prefix = line[:matched.start()] + if prefix.endswith('std::') or not prefix.endswith('::'): + required[''] = (linenum, 'string') + + for pattern, template, header in _re_pattern_algorithm_header: + if pattern.search(line): + required[header] = (linenum, template) + + # The following function is just a speed up, no semantics are changed. + if not '<' in line: # Reduces the cpu time usage by skipping lines. + continue + + for pattern, template, header in _re_pattern_templates: + if pattern.search(line): + required[header] = (linenum, template) + + # The policy is that if you #include something in foo.h you don't need to + # include it again in foo.cc. Here, we will look at possible includes. + # Let's flatten the include_state include_list and copy it into a dictionary. + include_dict = dict([item for sublist in include_state.include_list + for item in sublist]) + + # Did we find the header for this file (if any) and successfully load it? + header_found = False + + # Use the absolute path so that matching works properly. + abs_filename = FileInfo(filename).FullName() + + # For Emacs's flymake. + # If cpplint is invoked from Emacs's flymake, a temporary file is generated + # by flymake and that file name might end with '_flymake.cc'. In that case, + # restore original file name here so that the corresponding header file can be + # found. + # e.g. If the file name is 'foo_flymake.cc', we should search for 'foo.h' + # instead of 'foo_flymake.h' + abs_filename = re.sub(r'_flymake\.cc$', '.cc', abs_filename) + + # include_dict is modified during iteration, so we iterate over a copy of + # the keys. + header_keys = include_dict.keys() + for header in header_keys: + (same_module, common_path) = FilesBelongToSameModule(abs_filename, header) + fullpath = common_path + header + if same_module and UpdateIncludeState(fullpath, include_dict, io): + header_found = True + + # If we can't find the header file for a .cc, assume it's because we don't + # know where to look. In that case we'll give up as we're not sure they + # didn't include it in the .h file. + # TODO(unknown): Do a better job of finding .h files so we are confident that + # not having the .h file means there isn't one. + if filename.endswith('.cc') and not header_found: + return + + # All the lines have been processed, report the errors found. + for required_header_unstripped in required: + template = required[required_header_unstripped][1] + if required_header_unstripped.strip('<>"') not in include_dict: + error(filename, required[required_header_unstripped][0], + 'build/include_what_you_use', 4, + 'Add #include ' + required_header_unstripped + ' for ' + template) + + +_RE_PATTERN_EXPLICIT_MAKEPAIR = re.compile(r'\bmake_pair\s*<') + + +def CheckMakePairUsesDeduction(filename, clean_lines, linenum, error): + """Check that make_pair's template arguments are deduced. + + G++ 4.6 in C++11 mode fails badly if make_pair's template arguments are + specified explicitly, and such use isn't intended in any case. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + match = _RE_PATTERN_EXPLICIT_MAKEPAIR.search(line) + if match: + error(filename, linenum, 'build/explicit_make_pair', + 4, # 4 = high confidence + 'For C++11-compatibility, omit template arguments from make_pair' + ' OR use pair directly OR if appropriate, construct a pair directly') + + +def CheckDefaultLambdaCaptures(filename, clean_lines, linenum, error): + """Check that default lambda captures are not used. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # A lambda introducer specifies a default capture if it starts with "[=" + # or if it starts with "[&" _not_ followed by an identifier. + match = Match(r'^(.*)\[\s*(?:=|&[^\w])', line) + if match: + # Found a potential error, check what comes after the lambda-introducer. + # If it's not open parenthesis (for lambda-declarator) or open brace + # (for compound-statement), it's not a lambda. + line, _, pos = CloseExpression(clean_lines, linenum, len(match.group(1))) + if pos >= 0 and Match(r'^\s*[{(]', line[pos:]): + error(filename, linenum, 'build/c++11', + 4, # 4 = high confidence + 'Default lambda captures are an unapproved C++ feature.') + + +def CheckRedundantVirtual(filename, clean_lines, linenum, error): + """Check if line contains a redundant "virtual" function-specifier. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + # Look for "virtual" on current line. + line = clean_lines.elided[linenum] + virtual = Match(r'^(.*)(\bvirtual\b)(.*)$', line) + if not virtual: return + + # Ignore "virtual" keywords that are near access-specifiers. These + # are only used in class base-specifier and do not apply to member + # functions. + if (Search(r'\b(public|protected|private)\s+$', virtual.group(1)) or + Match(r'^\s+(public|protected|private)\b', virtual.group(3))): + return + + # Ignore the "virtual" keyword from virtual base classes. Usually + # there is a column on the same line in these cases (virtual base + # classes are rare in google3 because multiple inheritance is rare). + if Match(r'^.*[^:]:[^:].*$', line): return + + # Look for the next opening parenthesis. This is the start of the + # parameter list (possibly on the next line shortly after virtual). + # TODO(unknown): doesn't work if there are virtual functions with + # decltype() or other things that use parentheses, but csearch suggests + # that this is rare. + end_col = -1 + end_line = -1 + start_col = len(virtual.group(2)) + for start_line in xrange(linenum, min(linenum + 3, clean_lines.NumLines())): + line = clean_lines.elided[start_line][start_col:] + parameter_list = Match(r'^([^(]*)\(', line) + if parameter_list: + # Match parentheses to find the end of the parameter list + (_, end_line, end_col) = CloseExpression( + clean_lines, start_line, start_col + len(parameter_list.group(1))) + break + start_col = 0 + + if end_col < 0: + return # Couldn't find end of parameter list, give up + + # Look for "override" or "final" after the parameter list + # (possibly on the next few lines). + for i in xrange(end_line, min(end_line + 3, clean_lines.NumLines())): + line = clean_lines.elided[i][end_col:] + match = Search(r'\b(override|final)\b', line) + if match: + error(filename, linenum, 'readability/inheritance', 4, + ('"virtual" is redundant since function is ' + 'already declared as "%s"' % match.group(1))) + + # Set end_col to check whole lines after we are done with the + # first line. + end_col = 0 + if Search(r'[^\w]\s*$', line): + break + + +def CheckRedundantOverrideOrFinal(filename, clean_lines, linenum, error): + """Check if line contains a redundant "override" or "final" virt-specifier. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + # Look for closing parenthesis nearby. We need one to confirm where + # the declarator ends and where the virt-specifier starts to avoid + # false positives. + line = clean_lines.elided[linenum] + declarator_end = line.rfind(')') + if declarator_end >= 0: + fragment = line[declarator_end:] + else: + if linenum > 1 and clean_lines.elided[linenum - 1].rfind(')') >= 0: + fragment = line + else: + return + + # Check that at most one of "override" or "final" is present, not both + if Search(r'\boverride\b', fragment) and Search(r'\bfinal\b', fragment): + error(filename, linenum, 'readability/inheritance', 4, + ('"override" is redundant since function is ' + 'already declared as "final"')) + + + + +# Returns true if we are at a new block, and it is directly +# inside of a namespace. +def IsBlockInNameSpace(nesting_state, is_forward_declaration): + """Checks that the new block is directly in a namespace. + + Args: + nesting_state: The _NestingState object that contains info about our state. + is_forward_declaration: If the class is a forward declared class. + Returns: + Whether or not the new block is directly in a namespace. + """ + if is_forward_declaration: + if len(nesting_state.stack) >= 1 and ( + isinstance(nesting_state.stack[-1], _NamespaceInfo)): + return True + else: + return False + + return (len(nesting_state.stack) > 1 and + nesting_state.stack[-1].check_namespace_indentation and + isinstance(nesting_state.stack[-2], _NamespaceInfo)) + + +def ShouldCheckNamespaceIndentation(nesting_state, is_namespace_indent_item, + raw_lines_no_comments, linenum): + """This method determines if we should apply our namespace indentation check. + + Args: + nesting_state: The current nesting state. + is_namespace_indent_item: If we just put a new class on the stack, True. + If the top of the stack is not a class, or we did not recently + add the class, False. + raw_lines_no_comments: The lines without the comments. + linenum: The current line number we are processing. + + Returns: + True if we should apply our namespace indentation check. Currently, it + only works for classes and namespaces inside of a namespace. + """ + + is_forward_declaration = IsForwardClassDeclaration(raw_lines_no_comments, + linenum) + + if not (is_namespace_indent_item or is_forward_declaration): + return False + + # If we are in a macro, we do not want to check the namespace indentation. + if IsMacroDefinition(raw_lines_no_comments, linenum): + return False + + return IsBlockInNameSpace(nesting_state, is_forward_declaration) + + +# Call this method if the line is directly inside of a namespace. +# If the line above is blank (excluding comments) or the start of +# an inner namespace, it cannot be indented. +def CheckItemIndentationInNamespace(filename, raw_lines_no_comments, linenum, + error): + line = raw_lines_no_comments[linenum] + if Match(r'^\s+', line): + error(filename, linenum, 'runtime/indentation_namespace', 4, + 'Do not indent within a namespace') + + +def ProcessLine(filename, file_extension, clean_lines, line, + include_state, function_state, nesting_state, error, + extra_check_functions=[]): + """Processes a single line in the file. + + Args: + filename: Filename of the file that is being processed. + file_extension: The extension (dot not included) of the file. + clean_lines: An array of strings, each representing a line of the file, + with comments stripped. + line: Number of line being processed. + include_state: An _IncludeState instance in which the headers are inserted. + function_state: A _FunctionState instance which counts function lines, etc. + nesting_state: A NestingState instance which maintains information about + the current stack of nested blocks being parsed. + error: A callable to which errors are reported, which takes 4 arguments: + filename, line number, error level, and message + extra_check_functions: An array of additional check functions that will be + run on each source line. Each function takes 4 + arguments: filename, clean_lines, line, error + """ + raw_lines = clean_lines.raw_lines + ParseNolintSuppressions(filename, raw_lines[line], line, error) + nesting_state.Update(filename, clean_lines, line, error) + CheckForNamespaceIndentation(filename, nesting_state, clean_lines, line, + error) + if nesting_state.InAsmBlock(): return + CheckForFunctionLengths(filename, clean_lines, line, function_state, error) + CheckForMultilineCommentsAndStrings(filename, clean_lines, line, error) + CheckStyle(filename, clean_lines, line, file_extension, nesting_state, error) + CheckLanguage(filename, clean_lines, line, file_extension, include_state, + nesting_state, error) + CheckForNonConstReference(filename, clean_lines, line, nesting_state, error) + CheckForNonStandardConstructs(filename, clean_lines, line, + nesting_state, error) + CheckVlogArguments(filename, clean_lines, line, error) + CheckPosixThreading(filename, clean_lines, line, error) + CheckInvalidIncrement(filename, clean_lines, line, error) + CheckMakePairUsesDeduction(filename, clean_lines, line, error) + CheckDefaultLambdaCaptures(filename, clean_lines, line, error) + CheckRedundantVirtual(filename, clean_lines, line, error) + CheckRedundantOverrideOrFinal(filename, clean_lines, line, error) + for check_fn in extra_check_functions: + check_fn(filename, clean_lines, line, error) + +def FlagCxx11Features(filename, clean_lines, linenum, error): + """Flag those c++11 features that we only allow in certain places. + + Args: + filename: The name of the current file. + clean_lines: A CleansedLines instance containing the file. + linenum: The number of the line to check. + error: The function to call with any errors found. + """ + line = clean_lines.elided[linenum] + + # Flag unapproved C++11 headers. + include = Match(r'\s*#\s*include\s+[<"]([^<"]+)[">]', line) + if include and include.group(1) in ('cfenv', + 'condition_variable', + 'fenv.h', + 'future', + 'mutex', + 'thread', + 'chrono', + 'ratio', + 'regex', + 'system_error', + ): + error(filename, linenum, 'build/c++11', 5, + ('<%s> is an unapproved C++11 header.') % include.group(1)) + + # The only place where we need to worry about C++11 keywords and library + # features in preprocessor directives is in macro definitions. + if Match(r'\s*#', line) and not Match(r'\s*#\s*define\b', line): return + + # These are classes and free functions. The classes are always + # mentioned as std::*, but we only catch the free functions if + # they're not found by ADL. They're alphabetical by header. + for top_name in ( + # type_traits + 'alignment_of', + 'aligned_union', + ): + if Search(r'\bstd::%s\b' % top_name, line): + error(filename, linenum, 'build/c++11', 5, + ('std::%s is an unapproved C++11 class or function. Send c-style ' + 'an example of where it would make your code more readable, and ' + 'they may let you use it.') % top_name) + + +def ProcessFileData(filename, file_extension, lines, error, + extra_check_functions=[]): + """Performs lint checks and reports any errors to the given error function. + + Args: + filename: Filename of the file that is being processed. + file_extension: The extension (dot not included) of the file. + lines: An array of strings, each representing a line of the file, with the + last element being empty if the file is terminated with a newline. + error: A callable to which errors are reported, which takes 4 arguments: + filename, line number, error level, and message + extra_check_functions: An array of additional check functions that will be + run on each source line. Each function takes 4 + arguments: filename, clean_lines, line, error + """ + lines = (['// marker so line numbers and indices both start at 1'] + lines + + ['// marker so line numbers end in a known way']) + + include_state = _IncludeState() + function_state = _FunctionState() + nesting_state = NestingState() + + ResetNolintSuppressions() + + CheckForCopyright(filename, lines, error) + + RemoveMultiLineComments(filename, lines, error) + clean_lines = CleansedLines(lines) + + if file_extension == 'h': + CheckForHeaderGuard(filename, clean_lines, error) + + for line in xrange(clean_lines.NumLines()): + ProcessLine(filename, file_extension, clean_lines, line, + include_state, function_state, nesting_state, error, + extra_check_functions) + FlagCxx11Features(filename, clean_lines, line, error) + nesting_state.CheckCompletedBlocks(filename, error) + + CheckForIncludeWhatYouUse(filename, clean_lines, include_state, error) + + # Check that the .cc file has included its header if it exists. + if file_extension == 'cc': + CheckHeaderFileIncluded(filename, include_state, error) + + # We check here rather than inside ProcessLine so that we see raw + # lines rather than "cleaned" lines. + CheckForBadCharacters(filename, lines, error) + + CheckForNewlineAtEOF(filename, lines, error) + +def ProcessConfigOverrides(filename): + """ Loads the configuration files and processes the config overrides. + + Args: + filename: The name of the file being processed by the linter. + + Returns: + False if the current |filename| should not be processed further. + """ + + abs_filename = os.path.abspath(filename) + cfg_filters = [] + keep_looking = True + while keep_looking: + abs_path, base_name = os.path.split(abs_filename) + if not base_name: + break # Reached the root directory. + + cfg_file = os.path.join(abs_path, "CPPLINT.cfg") + abs_filename = abs_path + if not os.path.isfile(cfg_file): + continue + + try: + with open(cfg_file) as file_handle: + for line in file_handle: + line, _, _ = line.partition('#') # Remove comments. + if not line.strip(): + continue + + name, _, val = line.partition('=') + name = name.strip() + val = val.strip() + if name == 'set noparent': + keep_looking = False + elif name == 'filter': + cfg_filters.append(val) + elif name == 'exclude_files': + # When matching exclude_files pattern, use the base_name of + # the current file name or the directory name we are processing. + # For example, if we are checking for lint errors in /foo/bar/baz.cc + # and we found the .cfg file at /foo/CPPLINT.cfg, then the config + # file's "exclude_files" filter is meant to be checked against "bar" + # and not "baz" nor "bar/baz.cc". + if base_name: + pattern = re.compile(val) + if pattern.match(base_name): + sys.stderr.write('Ignoring "%s": file excluded by "%s". ' + 'File path component "%s" matches ' + 'pattern "%s"\n' % + (filename, cfg_file, base_name, val)) + return False + elif name == 'linelength': + global _line_length + try: + _line_length = int(val) + except ValueError: + sys.stderr.write('Line length must be numeric.') + else: + sys.stderr.write( + 'Invalid configuration option (%s) in file %s\n' % + (name, cfg_file)) + + except IOError: + sys.stderr.write( + "Skipping config file '%s': Can't open for reading\n" % cfg_file) + keep_looking = False + + # Apply all the accumulated filters in reverse order (top-level directory + # config options having the least priority). + for filter in reversed(cfg_filters): + _AddFilters(filter) + + return True + + +def ProcessFile(filename, vlevel, extra_check_functions=[]): + """Does google-lint on a single file. + + Args: + filename: The name of the file to parse. + + vlevel: The level of errors to report. Every error of confidence + >= verbose_level will be reported. 0 is a good default. + + extra_check_functions: An array of additional check functions that will be + run on each source line. Each function takes 4 + arguments: filename, clean_lines, line, error + """ + + _SetVerboseLevel(vlevel) + _BackupFilters() + + if not ProcessConfigOverrides(filename): + _RestoreFilters() + return + + lf_lines = [] + crlf_lines = [] + try: + # Support the UNIX convention of using "-" for stdin. Note that + # we are not opening the file with universal newline support + # (which codecs doesn't support anyway), so the resulting lines do + # contain trailing '\r' characters if we are reading a file that + # has CRLF endings. + # If after the split a trailing '\r' is present, it is removed + # below. + if filename == '-': + lines = codecs.StreamReaderWriter(sys.stdin, + codecs.getreader('utf8'), + codecs.getwriter('utf8'), + 'replace').read().split('\n') + else: + lines = codecs.open(filename, 'r', 'utf8', 'replace').read().split('\n') + + # Remove trailing '\r'. + # The -1 accounts for the extra trailing blank line we get from split() + for linenum in range(len(lines) - 1): + if lines[linenum].endswith('\r'): + lines[linenum] = lines[linenum].rstrip('\r') + crlf_lines.append(linenum + 1) + else: + lf_lines.append(linenum + 1) + + except IOError: + sys.stderr.write( + "Skipping input '%s': Can't open for reading\n" % filename) + _RestoreFilters() + return + + # Note, if no dot is found, this will give the entire filename as the ext. + file_extension = filename[filename.rfind('.') + 1:] + + # When reading from stdin, the extension is unknown, so no cpplint tests + # should rely on the extension. + if filename != '-' and file_extension not in _valid_extensions: + sys.stderr.write('Ignoring %s; not a valid file name ' + '(%s)\n' % (filename, ', '.join(_valid_extensions))) + else: + ProcessFileData(filename, file_extension, lines, Error, + extra_check_functions) + + # If end-of-line sequences are a mix of LF and CR-LF, issue + # warnings on the lines with CR. + # + # Don't issue any warnings if all lines are uniformly LF or CR-LF, + # since critique can handle these just fine, and the style guide + # doesn't dictate a particular end of line sequence. + # + # We can't depend on os.linesep to determine what the desired + # end-of-line sequence should be, since that will return the + # server-side end-of-line sequence. + if lf_lines and crlf_lines: + # Warn on every line with CR. An alternative approach might be to + # check whether the file is mostly CRLF or just LF, and warn on the + # minority, we bias toward LF here since most tools prefer LF. + for linenum in crlf_lines: + Error(filename, linenum, 'whitespace/newline', 1, + 'Unexpected \\r (^M) found; better to use only \\n') + + sys.stderr.write('Done processing %s\n' % filename) + _RestoreFilters() + + +def PrintUsage(message): + """Prints a brief usage string and exits, optionally with an error message. + + Args: + message: The optional error message. + """ + sys.stderr.write(_USAGE) + if message: + sys.exit('\nFATAL ERROR: ' + message) + else: + sys.exit(1) + + +def PrintCategories(): + """Prints a list of all the error-categories used by error messages. + + These are the categories used to filter messages via --filter. + """ + sys.stderr.write(''.join(' %s\n' % cat for cat in _ERROR_CATEGORIES)) + sys.exit(0) + + +def ParseArguments(args): + """Parses the command line arguments. + + This may set the output format and verbosity level as side-effects. + + Args: + args: The command line arguments: + + Returns: + The list of filenames to lint. + """ + try: + (opts, filenames) = getopt.getopt(args, '', ['help', 'output=', 'verbose=', + 'counting=', + 'filter=', + 'root=', + 'linelength=', + 'extensions=']) + except getopt.GetoptError: + PrintUsage('Invalid arguments.') + + verbosity = _VerboseLevel() + output_format = _OutputFormat() + filters = '' + counting_style = '' + + for (opt, val) in opts: + if opt == '--help': + PrintUsage(None) + elif opt == '--output': + if val not in ('emacs', 'vs7', 'eclipse'): + PrintUsage('The only allowed output formats are emacs, vs7 and eclipse.') + output_format = val + elif opt == '--verbose': + verbosity = int(val) + elif opt == '--filter': + filters = val + if not filters: + PrintCategories() + elif opt == '--counting': + if val not in ('total', 'toplevel', 'detailed'): + PrintUsage('Valid counting options are total, toplevel, and detailed') + counting_style = val + elif opt == '--root': + global _root + _root = val + elif opt == '--linelength': + global _line_length + try: + _line_length = int(val) + except ValueError: + PrintUsage('Line length must be digits.') + elif opt == '--extensions': + global _valid_extensions + try: + _valid_extensions = set(val.split(',')) + except ValueError: + PrintUsage('Extensions must be comma seperated list.') + + if not filenames: + PrintUsage('No files were specified.') + + _SetOutputFormat(output_format) + _SetVerboseLevel(verbosity) + _SetFilters(filters) + _SetCountingStyle(counting_style) + + return filenames + + +def main(): + filenames = ParseArguments(sys.argv[1:]) + + # Change stderr to write with replacement characters so we don't die + # if we try to print something containing non-ASCII characters. + sys.stderr = codecs.StreamReaderWriter(sys.stderr, + codecs.getreader('utf8'), + codecs.getwriter('utf8'), + 'replace') + + _cpplint_state.ResetErrorCounts() + for filename in filenames: + ProcessFile(filename, _cpplint_state.verbose_level) + _cpplint_state.PrintErrorCounts() + + sys.exit(_cpplint_state.error_count > 0) + + +if __name__ == '__main__': + main() diff --git a/cpp/build-support/run-test.sh b/cpp/build-support/run-test.sh new file mode 100755 index 0000000000000..b2039134d558d --- /dev/null +++ b/cpp/build-support/run-test.sh @@ -0,0 +1,195 @@ +#!/bin/bash +# Copyright 2014 Cloudera, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Script which wraps running a test and redirects its output to a +# test log directory. +# +# If KUDU_COMPRESS_TEST_OUTPUT is non-empty, then the logs will be +# gzip-compressed while they are written. +# +# If KUDU_FLAKY_TEST_ATTEMPTS is non-zero, and the test being run matches +# one of the lines in the file KUDU_FLAKY_TEST_LIST, then the test will +# be retried on failure up to the specified number of times. This can be +# used in the gerrit workflow to prevent annoying false -1s caused by +# tests that are known to be flaky in master. +# +# If KUDU_REPORT_TEST_RESULTS is non-zero, then tests are reported to the +# central test server. + +ROOT=$(cd $(dirname $BASH_SOURCE)/..; pwd) + +TEST_LOGDIR=$ROOT/build/test-logs +mkdir -p $TEST_LOGDIR + +TEST_DEBUGDIR=$ROOT/build/test-debug +mkdir -p $TEST_DEBUGDIR + +TEST_DIRNAME=$(cd $(dirname $1); pwd) +TEST_FILENAME=$(basename $1) +shift +TEST_EXECUTABLE="$TEST_DIRNAME/$TEST_FILENAME" +TEST_NAME=$(echo $TEST_FILENAME | perl -pe 's/\..+?$//') # Remove path and extension (if any). + +# We run each test in its own subdir to avoid core file related races. +TEST_WORKDIR=$ROOT/build/test-work/$TEST_NAME +mkdir -p $TEST_WORKDIR +pushd $TEST_WORKDIR >/dev/null || exit 1 +rm -f * + +set -o pipefail + +LOGFILE=$TEST_LOGDIR/$TEST_NAME.txt +XMLFILE=$TEST_LOGDIR/$TEST_NAME.xml + +TEST_EXECUTION_ATTEMPTS=1 + +# Remove both the uncompressed output, so the developer doesn't accidentally get confused +# and read output from a prior test run. +rm -f $LOGFILE $LOGFILE.gz + +pipe_cmd=cat + +# Configure TSAN (ignored if this isn't a TSAN build). +# +# Deadlock detection (new in clang 3.5) is disabled because: +# 1. The clang 3.5 deadlock detector crashes in some unit tests. It +# needs compiler-rt commits c4c3dfd, 9a8efe3, and possibly others. +# 2. Many unit tests report lock-order-inversion warnings; they should be +# fixed before reenabling the detector. +TSAN_OPTIONS="$TSAN_OPTIONS detect_deadlocks=0" +TSAN_OPTIONS="$TSAN_OPTIONS suppressions=$ROOT/build-support/tsan-suppressions.txt" +TSAN_OPTIONS="$TSAN_OPTIONS history_size=7" +export TSAN_OPTIONS + +# Enable leak detection even under LLVM 3.4, where it was disabled by default. +# This flag only takes effect when running an ASAN build. +ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" +export ASAN_OPTIONS + +# Set up suppressions for LeakSanitizer +LSAN_OPTIONS="$LSAN_OPTIONS suppressions=$ROOT/build-support/lsan-suppressions.txt" +export LSAN_OPTIONS + +# Suppressions require symbolization. We'll default to using the symbolizer in +# thirdparty. +if [ -z "$ASAN_SYMBOLIZER_PATH" ]; then + export ASAN_SYMBOLIZER_PATH=$(find $NATIVE_TOOLCHAIN/llvm-3.7.0/bin -name llvm-symbolizer) +fi + +# Allow for collecting core dumps. +ARROW_TEST_ULIMIT_CORE=${ARROW_TEST_ULIMIT_CORE:-0} +ulimit -c $ARROW_TEST_ULIMIT_CORE + +# Run the actual test. +for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do + if [ $ATTEMPT_NUMBER -lt $TEST_EXECUTION_ATTEMPTS ]; then + # If the test fails, the test output may or may not be left behind, + # depending on whether the test cleaned up or exited immediately. Either + # way we need to clean it up. We do this by comparing the data directory + # contents before and after the test runs, and deleting anything new. + # + # The comm program requires that its two inputs be sorted. + TEST_TMPDIR_BEFORE=$(find $TEST_TMPDIR -maxdepth 1 -type d | sort) + fi + + # gtest won't overwrite old junit test files, resulting in a build failure + # even when retries are successful. + rm -f $XMLFILE + + echo "Running $TEST_NAME, redirecting output into $LOGFILE" \ + "(attempt ${ATTEMPT_NUMBER}/$TEST_EXECUTION_ATTEMPTS)" + $TEST_EXECUTABLE "$@" 2>&1 \ + | $ROOT/build-support/asan_symbolize.py \ + | c++filt \ + | $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE \ + | $pipe_cmd > $LOGFILE + STATUS=$? + + # TSAN doesn't always exit with a non-zero exit code due to a bug: + # mutex errors don't get reported through the normal error reporting infrastructure. + # So we make sure to detect this and exit 1. + # + # Additionally, certain types of failures won't show up in the standard JUnit + # XML output from gtest. We assume that gtest knows better than us and our + # regexes in most cases, but for certain errors we delete the resulting xml + # file and let our own post-processing step regenerate it. + export GREP=$(which egrep) + if zgrep --silent "ThreadSanitizer|Leak check.*detected leaks" $LOGFILE ; then + echo ThreadSanitizer or leak check failures in $LOGFILE + STATUS=1 + rm -f $XMLFILE + fi + + if [ $ATTEMPT_NUMBER -lt $TEST_EXECUTION_ATTEMPTS ]; then + # Now delete any new test output. + TEST_TMPDIR_AFTER=$(find $TEST_TMPDIR -maxdepth 1 -type d | sort) + DIFF=$(comm -13 <(echo "$TEST_TMPDIR_BEFORE") \ + <(echo "$TEST_TMPDIR_AFTER")) + for DIR in $DIFF; do + # Multiple tests may be running concurrently. To avoid deleting the + # wrong directories, constrain to only directories beginning with the + # test name. + # + # This may delete old test directories belonging to this test, but + # that's not typically a concern when rerunning flaky tests. + if [[ $DIR =~ ^$TEST_TMPDIR/$TEST_NAME ]]; then + echo Deleting leftover flaky test directory "$DIR" + rm -Rf "$DIR" + fi + done + fi + + if [ "$STATUS" -eq "0" ]; then + break + elif [ "$ATTEMPT_NUMBER" -lt "$TEST_EXECUTION_ATTEMPTS" ]; then + echo Test failed attempt number $ATTEMPT_NUMBER + echo Will retry... + fi +done + +# If we have a LeakSanitizer report, and XML reporting is configured, add a new test +# case result to the XML file for the leak report. Otherwise Jenkins won't show +# us which tests had LSAN errors. +if zgrep --silent "ERROR: LeakSanitizer: detected memory leaks" $LOGFILE ; then + echo Test had memory leaks. Editing XML + perl -p -i -e ' + if (m##) { + print "\n"; + print " \n"; + print " See txt log file for details\n"; + print " \n"; + print "\n"; + }' $XMLFILE +fi + +# Capture and compress core file and binary. +COREFILES=$(ls | grep ^core) +if [ -n "$COREFILES" ]; then + echo Found core dump. Saving executable and core files. + gzip < $TEST_EXECUTABLE > "$TEST_DEBUGDIR/$TEST_NAME.gz" || exit $? + for COREFILE in $COREFILES; do + gzip < $COREFILE > "$TEST_DEBUGDIR/$TEST_NAME.$COREFILE.gz" || exit $? + done + # Pull in any .so files as well. + for LIB in $(ldd $TEST_EXECUTABLE | grep $ROOT | awk '{print $3}'); do + LIB_NAME=$(basename $LIB) + gzip < $LIB > "$TEST_DEBUGDIR/$LIB_NAME.gz" || exit $? + done +fi + +popd +rm -Rf $TEST_WORKDIR + +exit $STATUS diff --git a/cpp/build-support/stacktrace_addr2line.pl b/cpp/build-support/stacktrace_addr2line.pl new file mode 100755 index 0000000000000..7664bab5af65b --- /dev/null +++ b/cpp/build-support/stacktrace_addr2line.pl @@ -0,0 +1,92 @@ +#!/usr/bin/perl +# Copyright 2014 Cloudera, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +####################################################################### +# This script will convert a stack trace with addresses: +# @ 0x5fb015 kudu::master::Master::Init() +# @ 0x5c2d38 kudu::master::MiniMaster::StartOnPorts() +# @ 0x5c31fa kudu::master::MiniMaster::Start() +# @ 0x58270a kudu::MiniCluster::Start() +# @ 0x57dc71 kudu::CreateTableStressTest::SetUp() +# To one with line numbers: +# @ 0x5fb015 kudu::master::Master::Init() at /home/mpercy/src/kudu/src/master/master.cc:54 +# @ 0x5c2d38 kudu::master::MiniMaster::StartOnPorts() at /home/mpercy/src/kudu/src/master/mini_master.cc:52 +# @ 0x5c31fa kudu::master::MiniMaster::Start() at /home/mpercy/src/kudu/src/master/mini_master.cc:33 +# @ 0x58270a kudu::MiniCluster::Start() at /home/mpercy/src/kudu/src/integration-tests/mini_cluster.cc:48 +# @ 0x57dc71 kudu::CreateTableStressTest::SetUp() at /home/mpercy/src/kudu/src/integration-tests/create-table-stress-test.cc:61 +# +# If the script detects that the output is not symbolized, it will also attempt +# to determine the function names, i.e. it will convert: +# @ 0x5fb015 +# @ 0x5c2d38 +# @ 0x5c31fa +# To: +# @ 0x5fb015 kudu::master::Master::Init() at /home/mpercy/src/kudu/src/master/master.cc:54 +# @ 0x5c2d38 kudu::master::MiniMaster::StartOnPorts() at /home/mpercy/src/kudu/src/master/mini_master.cc:52 +# @ 0x5c31fa kudu::master::MiniMaster::Start() at /home/mpercy/src/kudu/src/master/mini_master.cc:33 +####################################################################### +use strict; +use warnings; + +if (!@ARGV) { + die < is magical in Perl. +while (defined(my $input = )) { + if ($input =~ /^\s+\@\s+(0x[[:xdigit:]]{6,})(?:\s+(\S+))?/) { + my $addr = $1; + my $lookup_func_name = (!defined $2); + if (!exists($addr2line_map{$addr})) { + $addr2line_map{$addr} = `addr2line -ifC -e $binary $addr`; + } + chomp $input; + $input .= parse_addr2line_output($addr2line_map{$addr}, $lookup_func_name) . "\n"; + } + print $input; +} + +exit 0; diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake new file mode 100644 index 0000000000000..07860682f9b1b --- /dev/null +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -0,0 +1,46 @@ +# Copyright 2013 Cloudera, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Sets COMPILER_FAMILY to 'clang' or 'gcc' +# Sets COMPILER_VERSION to the version +execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v + ERROR_VARIABLE COMPILER_VERSION_FULL) +message(INFO " ${COMPILER_VERSION_FULL}") + +# clang on Linux and Mac OS X before 10.9 +if("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") + set(COMPILER_FAMILY "clang") + string(REGEX REPLACE ".*clang version ([0-9]+\\.[0-9]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") +# clang on Mac OS X 10.9 and later +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") + set(COMPILER_FAMILY "clang") + string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") + +# clang on Mac OS X, XCode 7. No version replacement is done +# because Apple no longer advertises the upstream LLVM version. +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-700\\..*") + set(COMPILER_FAMILY "clang") + +# gcc +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*gcc version.*") + set(COMPILER_FAMILY "gcc") + string(REGEX REPLACE ".*gcc version ([0-9\\.]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") +else() + message(FATAL_ERROR "Unknown compiler. Version info:\n${COMPILER_VERSION_FULL}") +endif() +message("Selected compiler ${COMPILER_FAMILY} ${COMPILER_VERSION}") + diff --git a/cpp/cmake_modules/FindGPerf.cmake b/cpp/cmake_modules/FindGPerf.cmake new file mode 100644 index 0000000000000..e8310799c3671 --- /dev/null +++ b/cpp/cmake_modules/FindGPerf.cmake @@ -0,0 +1,69 @@ +# -*- cmake -*- + +# - Find Google perftools +# Find the Google perftools includes and libraries +# This module defines +# GOOGLE_PERFTOOLS_INCLUDE_DIR, where to find heap-profiler.h, etc. +# GOOGLE_PERFTOOLS_FOUND, If false, do not try to use Google perftools. +# also defined for general use are +# TCMALLOC_LIBS, where to find the tcmalloc libraries. +# TCMALLOC_STATIC_LIB, path to libtcmalloc.a. +# TCMALLOC_SHARED_LIB, path to libtcmalloc's shared library +# PROFILER_LIBS, where to find the profiler libraries. +# PROFILER_STATIC_LIB, path to libprofiler.a. +# PROFILER_SHARED_LIB, path to libprofiler's shared library + +FIND_PATH(GOOGLE_PERFTOOLS_INCLUDE_DIR google/heap-profiler.h + $ENV{NATIVE_TOOLCHAIN}/gperftools-$ENV{GPERFTOOLS_VERSION}/include + NO_DEFAULT_PATH +) + +SET(GPERF_LIB_SEARCH $ENV{NATIVE_TOOLCHAIN}/gperftools-$ENV{GPERFTOOLS_VERSION}/lib) + +FIND_LIBRARY(TCMALLOC_LIB_PATH + NAMES libtcmalloc.a + PATHS ${GPERF_LIB_SEARCH} + NO_DEFAULT_PATH +) + +IF (TCMALLOC_LIB_PATH AND GOOGLE_PERFTOOLS_INCLUDE_DIR) + SET(TCMALLOC_LIBS ${GPERF_LIB_SEARCH}) + SET(TCMALLOC_LIB_NAME libtcmalloc) + SET(TCMALLOC_STATIC_LIB ${GPERF_LIB_SEARCH}/${TCMALLOC_LIB_NAME}.a) + SET(TCMALLOC_SHARED_LIB ${TCMALLOC_LIBS}/${TCMALLOC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + SET(GOOGLE_PERFTOOLS_FOUND "YES") +ELSE (TCMALLOC_LIB_PATH AND GOOGLE_PERFTOOLS_INCLUDE_DIR) + SET(GOOGLE_PERFTOOLS_FOUND "NO") +ENDIF (TCMALLOC_LIB_PATH AND GOOGLE_PERFTOOLS_INCLUDE_DIR) + +FIND_LIBRARY(PROFILER_LIB_PATH + NAMES libprofiler.a + PATHS ${GPERF_LIB_SEARCH} +) + +IF (PROFILER_LIB_PATH AND GOOGLE_PERFTOOLS_INCLUDE_DIR) + SET(PROFILER_LIBS ${GPERF_LIB_SEARCH}) + SET(PROFILER_LIB_NAME libprofiler) + SET(PROFILER_STATIC_LIB ${GPERF_LIB_SEARCH}/${PROFILER_LIB_NAME}.a) + SET(PROFILER_SHARED_LIB ${PROFILER_LIBS}/${PROFILER_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +ENDIF (PROFILER_LIB_PATH AND GOOGLE_PERFTOOLS_INCLUDE_DIR) + +IF (GOOGLE_PERFTOOLS_FOUND) + IF (NOT GPerf_FIND_QUIETLY) + MESSAGE(STATUS "Found the Google perftools library: ${TCMALLOC_LIBS}") + ENDIF (NOT GPerf_FIND_QUIETLY) +ELSE (GOOGLE_PERFTOOLS_FOUND) + IF (GPerf_FIND_REQUIRED) + MESSAGE(FATAL_ERROR "Could not find the Google perftools library") + ENDIF (GPerf_FIND_REQUIRED) +ENDIF (GOOGLE_PERFTOOLS_FOUND) + +MARK_AS_ADVANCED( + TCMALLOC_LIBS + TCMALLOC_STATIC_LIB + TCMALLOC_SHARED_LIB + PROFILER_LIBS + PROFILER_STATIC_LIB + PROFILER_SHARED_LIB + GOOGLE_PERFTOOLS_INCLUDE_DIR +) diff --git a/cpp/cmake_modules/FindGTest.cmake b/cpp/cmake_modules/FindGTest.cmake new file mode 100644 index 0000000000000..e47faf0dd89d2 --- /dev/null +++ b/cpp/cmake_modules/FindGTest.cmake @@ -0,0 +1,91 @@ +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Tries to find GTest headers and libraries. +# +# Usage of this module as follows: +# +# find_package(GTest) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# GTest_HOME - When set, this path is inspected instead of standard library +# locations as the root of the GTest installation. +# The environment variable GTEST_HOME overrides this veriable. +# +# This module defines +# GTEST_INCLUDE_DIR, directory containing headers +# GTEST_LIBS, directory containing gtest libraries +# GTEST_STATIC_LIB, path to libgtest.a +# GTEST_SHARED_LIB, path to libgtest's shared library +# GTEST_FOUND, whether gtest has been found + +if( NOT "$ENV{GTEST_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{GTEST_HOME}" _native_path ) + list( APPEND _gtest_roots ${_native_path} ) +elseif ( GTest_HOME ) + list( APPEND _gtest_roots ${GTest_HOME} ) +endif() + +# Try the parameterized roots, if they exist +if ( _gtest_roots ) + find_path( GTEST_INCLUDE_DIR NAMES gtest/gtest.h + PATHS ${_gtest_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( GTEST_LIBRARIES NAMES gtest + PATHS ${_gtest_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else () + find_path( GTEST_INCLUDE_DIR NAMES gtest/gtest.h ) + find_library( GTEST_LIBRARIES NAMES gtest ) +endif () + + +if (GTEST_INCLUDE_DIR AND GTEST_LIBRARIES) + set(GTEST_FOUND TRUE) + get_filename_component( GTEST_LIBS ${GTEST_LIBRARIES} DIRECTORY ) + set(GTEST_LIB_NAME libgtest) + set(GTEST_STATIC_LIB ${GTEST_LIBS}/${GTEST_LIB_NAME}.a) + set(GTEST_SHARED_LIB ${GTEST_LIBS}/${GTEST_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +else () + set(GTEST_FOUND FALSE) +endif () + +if (GTEST_FOUND) + if (NOT GTest_FIND_QUIETLY) + message(STATUS "Found the GTest library: ${GTEST_LIBRARIES}") + endif () +else () + if (NOT GTest_FIND_QUIETLY) + set(GTEST_ERR_MSG "Could not find the GTest library. Looked in ") + if ( _gtest_roots ) + set(GTEST_ERR_MSG "${GTEST_ERR_MSG} in ${_gtest_roots}.") + else () + set(GTEST_ERR_MSG "${GTEST_ERR_MSG} system search paths.") + endif () + if (GTest_FIND_REQUIRED) + message(FATAL_ERROR "${GTEST_ERR_MSG}") + else (GTest_FIND_REQUIRED) + message(STATUS "${GTEST_ERR_MSG}") + endif (GTest_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + GTEST_INCLUDE_DIR + GTEST_LIBS + GTEST_LIBRARIES + GTEST_STATIC_LIB + GTEST_SHARED_LIB +) diff --git a/cpp/cmake_modules/FindParquet.cmake b/cpp/cmake_modules/FindParquet.cmake new file mode 100644 index 0000000000000..76c2d1dbee941 --- /dev/null +++ b/cpp/cmake_modules/FindParquet.cmake @@ -0,0 +1,80 @@ +# Copyright 2012 Cloudera Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# - Find PARQUET (parquet/parquet.h, libparquet.a, libparquet.so) +# This module defines +# PARQUET_INCLUDE_DIR, directory containing headers +# PARQUET_LIBS, directory containing parquet libraries +# PARQUET_STATIC_LIB, path to libparquet.a +# PARQUET_SHARED_LIB, path to libparquet's shared library +# PARQUET_FOUND, whether parquet has been found + +if( NOT "$ENV{PARQUET_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{PARQUET_HOME}" _native_path ) + list( APPEND _parquet_roots ${_native_path} ) +elseif ( Parquet_HOME ) + list( APPEND _parquet_roots ${Parquet_HOME} ) +endif() + +# Try the parameterized roots, if they exist +if ( _parquet_roots ) + find_path( PARQUET_INCLUDE_DIR NAMES parquet/parquet.h + PATHS ${_parquet_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( PARQUET_LIBRARIES NAMES parquet + PATHS ${_parquet_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else () + find_path( PARQUET_INCLUDE_DIR NAMES parquet/parquet.h ) + find_library( PARQUET_LIBRARIES NAMES parquet ) +endif () + + +if (PARQUET_INCLUDE_DIR AND PARQUET_LIBRARIES) + set(PARQUET_FOUND TRUE) + get_filename_component( PARQUET_LIBS ${PARQUET_LIBRARIES} DIRECTORY ) + set(PARQUET_LIB_NAME libparquet) + set(PARQUET_STATIC_LIB ${PARQUET_LIBS}/${PARQUET_LIB_NAME}.a) + set(PARQUET_SHARED_LIB ${PARQUET_LIBS}/${PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +else () + set(PARQUET_FOUND FALSE) +endif () + +if (PARQUET_FOUND) + if (NOT Parquet_FIND_QUIETLY) + message(STATUS "Found the Parquet library: ${PARQUET_LIBRARIES}") + endif () +else () + if (NOT Parquet_FIND_QUIETLY) + set(PARQUET_ERR_MSG "Could not find the Parquet library. Looked in ") + if ( _parquet_roots ) + set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} in ${_parquet_roots}.") + else () + set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} system search paths.") + endif () + if (Parquet_FIND_REQUIRED) + message(FATAL_ERROR "${PARQUET_ERR_MSG}") + else (Parquet_FIND_REQUIRED) + message(STATUS "${PARQUET_ERR_MSG}") + endif (Parquet_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + PARQUET_INCLUDE_DIR + PARQUET_LIBS + PARQUET_LIBRARIES + PARQUET_STATIC_LIB + PARQUET_SHARED_LIB +) diff --git a/cpp/cmake_modules/san-config.cmake b/cpp/cmake_modules/san-config.cmake new file mode 100644 index 0000000000000..b847c96657a4a --- /dev/null +++ b/cpp/cmake_modules/san-config.cmake @@ -0,0 +1,92 @@ +# Clang does not support using ASAN and TSAN simultaneously. +if ("${ARROW_USE_ASAN}" AND "${ARROW_USE_TSAN}") + message(SEND_ERROR "Can only enable one of ASAN or TSAN at a time") +endif() + +# Flag to enable clang address sanitizer +# This will only build if clang or a recent enough gcc is the chosen compiler +if (${ARROW_USE_ASAN}) + if(NOT (("${COMPILER_FAMILY}" STREQUAL "clang") OR + ("${COMPILER_FAMILY}" STREQUAL "gcc" AND "${COMPILER_VERSION}" VERSION_GREATER "4.8"))) + message(SEND_ERROR "Cannot use ASAN without clang or gcc >= 4.8") + endif() + + # If UBSAN is also enabled, and we're on clang < 3.5, ensure static linking is + # enabled. Otherwise, we run into https://llvm.org/bugs/show_bug.cgi?id=18211 + if("${ARROW_USE_UBSAN}" AND + "${COMPILER_FAMILY}" STREQUAL "clang" AND + "${COMPILER_VERSION}" VERSION_LESS "3.5") + if("${ARROW_LINK}" STREQUAL "a") + message("Using static linking for ASAN+UBSAN build") + set(ARROW_LINK "s") + elseif("${ARROW_LINK}" STREQUAL "d") + message(SEND_ERROR "Cannot use dynamic linking when ASAN and UBSAN are both enabled") + endif() + endif() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=address -DADDRESS_SANITIZER") +endif() + + +# Flag to enable clang undefined behavior sanitizer +# We explicitly don't enable all of the sanitizer flags: +# - disable 'vptr' because it currently crashes somewhere in boost::intrusive::list code +# - disable 'alignment' because unaligned access is really OK on Nehalem and we do it +# all over the place. +if (${ARROW_USE_UBSAN}) + if(NOT (("${COMPILER_FAMILY}" STREQUAL "clang") OR + ("${COMPILER_FAMILY}" STREQUAL "gcc" AND "${COMPILER_VERSION}" VERSION_GREATER "4.9"))) + message(SEND_ERROR "Cannot use UBSAN without clang or gcc >= 4.9") + endif() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=undefined -fno-sanitize=alignment,vptr -fno-sanitize-recover") +endif () + +# Flag to enable thread sanitizer (clang or gcc 4.8) +if (${ARROW_USE_TSAN}) + if(NOT (("${COMPILER_FAMILY}" STREQUAL "clang") OR + ("${COMPILER_FAMILY}" STREQUAL "gcc" AND "${COMPILER_VERSION}" VERSION_GREATER "4.8"))) + message(SEND_ERROR "Cannot use TSAN without clang or gcc >= 4.8") + endif() + + add_definitions("-fsanitize=thread") + + # Enables dynamic_annotations.h to actually generate code + add_definitions("-DDYNAMIC_ANNOTATIONS_ENABLED") + + # changes atomicops to use the tsan implementations + add_definitions("-DTHREAD_SANITIZER") + + # Disables using the precompiled template specializations for std::string, shared_ptr, etc + # so that the annotations in the header actually take effect. + add_definitions("-D_GLIBCXX_EXTERN_TEMPLATE=0") + + # Some of the above also need to be passed to the linker. + set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} -pie -fsanitize=thread") + + # Strictly speaking, TSAN doesn't require dynamic linking. But it does + # require all code to be position independent, and the easiest way to + # guarantee that is via dynamic linking (not all 3rd party archives are + # compiled with -fPIC e.g. boost). + if("${ARROW_LINK}" STREQUAL "a") + message("Using dynamic linking for TSAN") + set(ARROW_LINK "d") + elseif("${ARROW_LINK}" STREQUAL "s") + message(SEND_ERROR "Cannot use TSAN with static linking") + endif() +endif() + + +if ("${ARROW_USE_UBSAN}" OR "${ARROW_USE_ASAN}" OR "${ARROW_USE_TSAN}") + # GCC 4.8 and 4.9 (latest as of this writing) don't allow you to specify a + # sanitizer blacklist. + if("${COMPILER_FAMILY}" STREQUAL "clang") + # Require clang 3.4 or newer; clang 3.3 has issues with TSAN and pthread + # symbol interception. + if("${COMPILER_VERSION}" VERSION_LESS "3.4") + message(SEND_ERROR "Must use clang 3.4 or newer to run a sanitizer build." + " Try using clang from $NATIVE_TOOLCHAIN/") + endif() + add_definitions("-fsanitize-blacklist=${BUILD_SUPPORT_DIR}/sanitize-blacklist.txt") + else() + message(WARNING "GCC does not support specifying a sanitizer blacklist. Known sanitizer check failures will not be suppressed.") + endif() +endif() diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh new file mode 100755 index 0000000000000..457b9717ebe81 --- /dev/null +++ b/cpp/setup_build_env.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +set -e + +SOURCE_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) + +./thirdparty/download_thirdparty.sh +./thirdparty/build_thirdparty.sh + +export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR + +echo "Build env initialized" diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt new file mode 100644 index 0000000000000..eeea2dbc517b4 --- /dev/null +++ b/cpp/src/arrow/CMakeLists.txt @@ -0,0 +1,33 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Headers: top level +install(FILES + api.h + array.h + builder.h + type.h + DESTINATION include/arrow) + +####################################### +# Unit tests +####################################### + +set(ARROW_TEST_LINK_LIBS arrow_test_util ${ARROW_MIN_TEST_LIBS}) + +ADD_ARROW_TEST(array-test) +ADD_ARROW_TEST(field-test) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h new file mode 100644 index 0000000000000..899e8aae19c0e --- /dev/null +++ b/cpp/src/arrow/api.h @@ -0,0 +1,21 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_API_H +#define ARROW_API_H + +#endif // ARROW_API_H diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc new file mode 100644 index 0000000000000..5ecf91624fe73 --- /dev/null +++ b/cpp/src/arrow/array-test.cc @@ -0,0 +1,92 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" +#include "arrow/util/buffer.h" + +using std::string; +using std::vector; + +namespace arrow { + +static TypePtr int32 = TypePtr(new Int32Type()); +static TypePtr int32_nn = TypePtr(new Int32Type(false)); + + +class TestArray : public ::testing::Test { + public: + void SetUp() { + auto data = std::make_shared(); + auto nulls = std::make_shared(); + + ASSERT_OK(data->Resize(400)); + ASSERT_OK(nulls->Resize(128)); + + arr_.reset(new Int32Array(100, data, nulls)); + } + + protected: + std::unique_ptr arr_; +}; + + +TEST_F(TestArray, TestNullable) { + std::shared_ptr tmp = arr_->data(); + std::unique_ptr arr_nn(new Int32Array(100, tmp)); + + ASSERT_TRUE(arr_->nullable()); + ASSERT_FALSE(arr_nn->nullable()); +} + + +TEST_F(TestArray, TestLength) { + ASSERT_EQ(arr_->length(), 100); +} + +TEST_F(TestArray, TestIsNull) { + vector nulls = {1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 0, 1}; + + std::shared_ptr null_buf = bytes_to_null_buffer(nulls.data(), nulls.size()); + std::unique_ptr arr; + arr.reset(new Array(int32, nulls.size(), null_buf)); + + ASSERT_EQ(null_buf->size(), 5); + for (size_t i = 0; i < nulls.size(); ++i) { + ASSERT_EQ(static_cast(nulls[i]), arr->IsNull(i)); + } +} + + +TEST_F(TestArray, TestCopy) { +} + +} // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc new file mode 100644 index 0000000000000..1726a2f27d82d --- /dev/null +++ b/cpp/src/arrow/array.cc @@ -0,0 +1,44 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/array.h" + +#include "arrow/util/buffer.h" + +namespace arrow { + +// ---------------------------------------------------------------------- +// Base array class + +Array::Array(const TypePtr& type, int64_t length, + const std::shared_ptr& nulls) { + Init(type, length, nulls); +} + +void Array::Init(const TypePtr& type, int64_t length, + const std::shared_ptr& nulls) { + type_ = type; + length_ = length; + nulls_ = nulls; + + nullable_ = type->nullable; + if (nulls_) { + null_bits_ = nulls_->data(); + } +} + +} // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h new file mode 100644 index 0000000000000..c95450d12a419 --- /dev/null +++ b/cpp/src/arrow/array.h @@ -0,0 +1,79 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_ARRAY_H +#define ARROW_ARRAY_H + +#include +#include +#include + +#include "arrow/type.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/macros.h" + +namespace arrow { + +class Buffer; + +// Immutable data array with some logical type and some length. Any memory is +// owned by the respective Buffer instance (or its parents). May or may not be +// nullable. +// +// The base class only has a null array (if the data type is nullable) +// +// Any buffers used to initialize the array have their references "stolen". If +// you wish to use the buffer beyond the lifetime of the array, you need to +// explicitly increment its reference count +class Array { + public: + Array() : length_(0), nulls_(nullptr), null_bits_(nullptr) {} + Array(const TypePtr& type, int64_t length, + const std::shared_ptr& nulls = nullptr); + + virtual ~Array() {} + + void Init(const TypePtr& type, int64_t length, const std::shared_ptr& nulls); + + // Determine if a slot if null. For inner loops. Does *not* boundscheck + bool IsNull(int64_t i) const { + return nullable_ && util::get_bit(null_bits_, i); + } + + int64_t length() const { return length_;} + bool nullable() const { return nullable_;} + const TypePtr& type() const { return type_;} + TypeEnum type_enum() const { return type_->type;} + + protected: + TypePtr type_; + bool nullable_; + int64_t length_; + + std::shared_ptr nulls_; + const uint8_t* null_bits_; + + private: + DISALLOW_COPY_AND_ASSIGN(Array); +}; + + +typedef std::shared_ptr ArrayPtr; + +} // namespace arrow + +#endif diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc new file mode 100644 index 0000000000000..1fd7471928367 --- /dev/null +++ b/cpp/src/arrow/builder.cc @@ -0,0 +1,63 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/builder.h" + +#include + +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +Status ArrayBuilder::Init(int64_t capacity) { + capacity_ = capacity; + + if (nullable_) { + int64_t to_alloc = util::ceil_byte(capacity) / 8; + nulls_ = std::make_shared(); + RETURN_NOT_OK(nulls_->Resize(to_alloc)); + null_bits_ = nulls_->mutable_data(); + memset(null_bits_, 0, to_alloc); + } + return Status::OK(); +} + +Status ArrayBuilder::Resize(int64_t new_bits) { + if (nullable_) { + int64_t new_bytes = util::ceil_byte(new_bits) / 8; + int64_t old_bytes = nulls_->size(); + RETURN_NOT_OK(nulls_->Resize(new_bytes)); + null_bits_ = nulls_->mutable_data(); + if (old_bytes < new_bytes) { + memset(null_bits_ + old_bytes, 0, new_bytes - old_bytes); + } + } + return Status::OK(); +} + +Status ArrayBuilder::Advance(int64_t elements) { + if (nullable_ && length_ + elements > capacity_) { + return Status::Invalid("Builder must be expanded"); + } + length_ += elements; + return Status::OK(); +} + + +} // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h new file mode 100644 index 0000000000000..b43668af77cbd --- /dev/null +++ b/cpp/src/arrow/builder.h @@ -0,0 +1,101 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_BUILDER_H +#define ARROW_BUILDER_H + +#include +#include +#include + +#include "arrow/type.h" +#include "arrow/util/buffer.h" +#include "arrow/util/macros.h" +#include "arrow/util/status.h" + +namespace arrow { + +class Array; + +static constexpr int64_t MIN_BUILDER_CAPACITY = 1 << 8; + +// Base class for all data array builders +class ArrayBuilder { + public: + explicit ArrayBuilder(const TypePtr& type) + : type_(type), + nullable_(type_->nullable), + nulls_(nullptr), null_bits_(nullptr), + length_(0), + capacity_(0) {} + + virtual ~ArrayBuilder() {} + + // For nested types. Since the objects are owned by this class instance, we + // skip shared pointers and just return a raw pointer + ArrayBuilder* child(int i) { + return children_[i].get(); + } + + int num_children() const { + return children_.size(); + } + + int64_t length() const { return length_;} + int64_t capacity() const { return capacity_;} + bool nullable() const { return nullable_;} + + // Allocates requires memory at this level, but children need to be + // initialized independently + Status Init(int64_t capacity); + + // Resizes the nulls array (if nullable) + Status Resize(int64_t new_bits); + + // For cases where raw data was memcpy'd into the internal buffers, allows us + // to advance the length of the builder. It is your responsibility to use + // this function responsibly. + Status Advance(int64_t elements); + + const std::shared_ptr& nulls() const { return nulls_;} + + // Creates new array object to hold the contents of the builder and transfers + // ownership of the data + virtual Status ToArray(Array** out) = 0; + + protected: + TypePtr type_; + bool nullable_; + + // If the type is not nullable, then null_ is nullptr after initialization + std::shared_ptr nulls_; + uint8_t* null_bits_; + + // Array length, so far. Also, the index of the next element to be added + int64_t length_; + int64_t capacity_; + + // Child value array builders. These are owned by this class + std::vector > children_; + + private: + DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); +}; + +} // namespace arrow + +#endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/field-test.cc b/cpp/src/arrow/field-test.cc new file mode 100644 index 0000000000000..2bb8bad4054c3 --- /dev/null +++ b/cpp/src/arrow/field-test.cc @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" + +using std::string; + +namespace arrow { + +TEST(TestField, Basics) { + TypePtr ftype = TypePtr(new Int32Type()); + Field f0("f0", ftype); + + ASSERT_EQ(f0.name, "f0"); + ASSERT_EQ(f0.type->ToString(), ftype->ToString()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/field.h b/cpp/src/arrow/field.h new file mode 100644 index 0000000000000..664cae61a777a --- /dev/null +++ b/cpp/src/arrow/field.h @@ -0,0 +1,48 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_FIELD_H +#define ARROW_FIELD_H + +#include + +#include "arrow/type.h" + +namespace arrow { + +// A field is a piece of metadata that includes (for now) a name and a data +// type + +struct Field { + // Field name + std::string name; + + // The field's data type + TypePtr type; + + Field(const std::string& name, const TypePtr& type) : + name(name), type(type) {} + + bool Equals(const Field& other) const { + return (this == &other) || (this->name == other.name && + this->type->Equals(other.type.get())); + } +}; + +} // namespace arrow + +#endif // ARROW_FIELD_H diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt new file mode 100644 index 0000000000000..7b449affab025 --- /dev/null +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -0,0 +1,35 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# ---------------------------------------------------------------------- +# arrow_parquet : Arrow <-> Parquet adapter + +set(PARQUET_SRCS +) + +set(PARQUET_LIBS +) + +add_library(arrow_parquet STATIC + ${PARQUET_SRCS} +) +target_link_libraries(arrow_parquet ${PARQUET_LIBS}) +SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) + +# Headers: top level +install(FILES + DESTINATION include/arrow/parquet) diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h new file mode 100644 index 0000000000000..2233a4f832a8c --- /dev/null +++ b/cpp/src/arrow/test-util.h @@ -0,0 +1,97 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TEST_UTIL_H_ +#define ARROW_TEST_UTIL_H_ + +#include +#include +#include +#include + +#include "arrow/util/bit-util.h" +#include "arrow/util/random.h" +#include "arrow/util/status.h" + +#define ASSERT_RAISES(ENUM, expr) \ + do { \ + Status s = (expr); \ + ASSERT_TRUE(s.Is##ENUM()); \ + } while (0) + + +#define ASSERT_OK(expr) \ + do { \ + Status s = (expr); \ + ASSERT_TRUE(s.ok()); \ + } while (0) + + +#define EXPECT_OK(expr) \ + do { \ + Status s = (expr); \ + EXPECT_TRUE(s.ok()); \ + } while (0) + + +namespace arrow { + +template +void randint(int64_t N, T lower, T upper, std::vector* out) { + Random rng(random_seed()); + uint64_t draw; + uint64_t span = upper - lower; + T val; + for (int64_t i = 0; i < N; ++i) { + draw = rng.Uniform64(span); + val = lower + static_cast(draw); + out->push_back(val); + } +} + + +template +std::shared_ptr to_buffer(const std::vector& values) { + return std::make_shared(reinterpret_cast(values.data()), + values.size() * sizeof(T)); +} + +void random_nulls(int64_t n, double pct_null, std::vector* nulls) { + Random rng(random_seed()); + for (int i = 0; i < n; ++i) { + nulls->push_back(static_cast(rng.NextDoubleFraction() > pct_null)); + } +} + +void random_nulls(int64_t n, double pct_null, std::vector* nulls) { + Random rng(random_seed()); + for (int i = 0; i < n; ++i) { + nulls->push_back(rng.NextDoubleFraction() > pct_null); + } +} + +std::shared_ptr bytes_to_null_buffer(uint8_t* bytes, int length) { + std::shared_ptr out; + + // TODO(wesm): error checking + util::bytes_to_bits(bytes, length, &out); + return out; +} + +} // namespace arrow + +#endif // ARROW_TEST_UTIL_H_ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc new file mode 100644 index 0000000000000..492eee52b04b1 --- /dev/null +++ b/cpp/src/arrow/type.cc @@ -0,0 +1,22 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/type.h" + +namespace arrow { + +} // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h new file mode 100644 index 0000000000000..220f99f4e885a --- /dev/null +++ b/cpp/src/arrow/type.h @@ -0,0 +1,180 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPE_H +#define ARROW_TYPE_H + +#include +#include + +namespace arrow { + +// Physical data type that describes the memory layout of values. See details +// for each type +enum class LayoutEnum: char { + // A physical type consisting of some non-negative number of bytes + BYTE = 0, + + // A physical type consisting of some non-negative number of bits + BIT = 1, + + // A parametric variable-length value type. Full specification requires a + // child logical type + LIST = 2, + + // A collection of multiple equal-length child arrays. Parametric type taking + // 1 or more child logical types + STRUCT = 3, + + // An array with heterogeneous value types. Parametric types taking 1 or more + // child logical types + DENSE_UNION = 4, + SPARSE_UNION = 5 +}; + + +struct LayoutType { + LayoutEnum type; + explicit LayoutType(LayoutEnum type) : type(type) {} +}; + + +// Data types in this library are all *logical*. They can be expressed as +// either a primitive physical type (bytes or bits of some fixed size), a +// nested type consisting of other data types, or another data type (e.g. a +// timestamp encoded as an int64) +// +// Any data type can be nullable + +enum class TypeEnum: char { + // A degerate NULL type represented as 0 bytes/bits + NA = 0, + + // Little-endian integer types + UINT8 = 1, + INT8 = 2, + UINT16 = 3, + INT16 = 4, + UINT32 = 5, + INT32 = 6, + UINT64 = 7, + INT64 = 8, + + // A boolean value represented as 1 byte + BOOL = 9, + + // A boolean value represented as 1 bit + BIT = 10, + + // 4-byte floating point value + FLOAT = 11, + + // 8-byte floating point value + DOUBLE = 12, + + // CHAR(N): fixed-length UTF8 string with length N + CHAR = 13, + + // UTF8 variable-length string as List + STRING = 14, + + // VARCHAR(N): Null-terminated string type embedded in a CHAR(N + 1) + VARCHAR = 15, + + // Variable-length bytes (no guarantee of UTF8-ness) + BINARY = 16, + + // By default, int32 days since the UNIX epoch + DATE = 17, + + // Exact timestamp encoded with int64 since UNIX epoch + // Default unit millisecond + TIMESTAMP = 18, + + // Timestamp as double seconds since the UNIX epoch + TIMESTAMP_DOUBLE = 19, + + // Exact time encoded with int64, default unit millisecond + TIME = 20, + + // Precision- and scale-based decimal type. Storage type depends on the + // parameters. + DECIMAL = 21, + + // Decimal value encoded as a text string + DECIMAL_TEXT = 22, + + // A list of some logical data type + LIST = 30, + + // Struct of logical types + STRUCT = 31, + + // Unions of logical types + DENSE_UNION = 32, + SPARSE_UNION = 33, + + // Union + JSON_SCALAR = 50, + + // User-defined type + USER = 60 +}; + + +struct DataType { + TypeEnum type; + bool nullable; + + explicit DataType(TypeEnum type, bool nullable = true) + : type(type), nullable(nullable) {} + + virtual bool Equals(const DataType* other) { + return (this == other) || (this->type == other->type && + this->nullable == other->nullable); + } + + virtual std::string ToString() const = 0; +}; + + +typedef std::shared_ptr LayoutPtr; +typedef std::shared_ptr TypePtr; + + +struct BytesType : public LayoutType { + int size; + + explicit BytesType(int size) + : LayoutType(LayoutEnum::BYTE), + size(size) {} + + BytesType(const BytesType& other) + : BytesType(other.size) {} +}; + +struct ListLayoutType : public LayoutType { + LayoutPtr value_type; + + explicit ListLayoutType(const LayoutPtr& value_type) + : LayoutType(LayoutEnum::BYTE), + value_type(value_type) {} +}; + +} // namespace arrow + +#endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt new file mode 100644 index 0000000000000..e090aead1f8b9 --- /dev/null +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -0,0 +1,63 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# arrow_types +####################################### + +set(TYPES_SRCS + construct.cc + floating.cc + integer.cc + json.cc + list.cc + primitive.cc + string.cc + struct.cc + union.cc +) + +set(TYPES_LIBS +) + +add_library(arrow_types STATIC + ${TYPES_SRCS} +) +target_link_libraries(arrow_types ${TYPES_LIBS}) +SET_TARGET_PROPERTIES(arrow_types PROPERTIES LINKER_LANGUAGE CXX) + +# Headers: top level +install(FILES + boolean.h + collection.h + datetime.h + decimal.h + floating.h + integer.h + json.h + list.h + primitive.h + string.h + struct.h + union.h + DESTINATION include/arrow/types) + + +ADD_ARROW_TEST(list-test) +ADD_ARROW_TEST(primitive-test) +ADD_ARROW_TEST(string-test) +ADD_ARROW_TEST(struct-test) diff --git a/cpp/src/arrow/types/binary.h b/cpp/src/arrow/types/binary.h new file mode 100644 index 0000000000000..a9f20046b582b --- /dev/null +++ b/cpp/src/arrow/types/binary.h @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_BINARY_H +#define ARROW_TYPES_BINARY_H + +#include +#include + +#include "arrow/type.h" + +namespace arrow { + +struct StringType : public DataType { +}; + +} // namespace arrow + +#endif // ARROW_TYPES_BINARY_H diff --git a/cpp/src/arrow/types/boolean.h b/cpp/src/arrow/types/boolean.h new file mode 100644 index 0000000000000..31388c8152d52 --- /dev/null +++ b/cpp/src/arrow/types/boolean.h @@ -0,0 +1,35 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_BOOLEAN_H +#define ARROW_TYPES_BOOLEAN_H + +#include "arrow/types/primitive.h" + +namespace arrow { + +struct BooleanType : public PrimitiveType { + PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); +}; + +typedef PrimitiveArrayImpl BooleanArray; + +// typedef PrimitiveBuilder BooleanBuilder; + +} // namespace arrow + +#endif // ARROW_TYPES_BOOLEAN_H diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h new file mode 100644 index 0000000000000..59ba61419417a --- /dev/null +++ b/cpp/src/arrow/types/collection.h @@ -0,0 +1,45 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_COLLECTION_H +#define ARROW_TYPES_COLLECTION_H + +#include +#include + +#include "arrow/type.h" + +namespace arrow { + +template +struct CollectionType : public DataType { + std::vector child_types_; + + explicit CollectionType(bool nullable = true) : DataType(T, nullable) {} + + const TypePtr& child(int i) const { + return child_types_[i]; + } + + int num_children() const { + return child_types_.size(); + } +}; + +} // namespace arrow + +#endif // ARROW_TYPES_COLLECTION_H diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc new file mode 100644 index 0000000000000..5176cafd3ba1c --- /dev/null +++ b/cpp/src/arrow/types/construct.cc @@ -0,0 +1,88 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/construct.h" + +#include + +#include "arrow/types/floating.h" +#include "arrow/types/integer.h" +#include "arrow/types/list.h" +#include "arrow/types/string.h" +#include "arrow/util/status.h" + +namespace arrow { + +class ArrayBuilder; + +// Initially looked at doing this with vtables, but shared pointers makes it +// difficult + +#define BUILDER_CASE(ENUM, BuilderType) \ + case TypeEnum::ENUM: \ + *out = static_cast(new BuilderType(type)); \ + return Status::OK(); + +Status make_builder(const TypePtr& type, ArrayBuilder** out) { + switch (type->type) { + BUILDER_CASE(UINT8, UInt8Builder); + BUILDER_CASE(INT8, Int8Builder); + BUILDER_CASE(UINT16, UInt16Builder); + BUILDER_CASE(INT16, Int16Builder); + BUILDER_CASE(UINT32, UInt32Builder); + BUILDER_CASE(INT32, Int32Builder); + BUILDER_CASE(UINT64, UInt64Builder); + BUILDER_CASE(INT64, Int64Builder); + + // BUILDER_CASE(BOOL, BooleanBuilder); + + BUILDER_CASE(FLOAT, FloatBuilder); + BUILDER_CASE(DOUBLE, DoubleBuilder); + + BUILDER_CASE(STRING, StringBuilder); + + case TypeEnum::LIST: + { + ListType* list_type = static_cast(type.get()); + ArrayBuilder* value_builder; + RETURN_NOT_OK(make_builder(list_type->value_type, &value_builder)); + + // The ListBuilder takes ownership of the value_builder + ListBuilder* builder = new ListBuilder(type, value_builder); + *out = static_cast(builder); + return Status::OK(); + } + // BUILDER_CASE(CHAR, CharBuilder); + + // BUILDER_CASE(VARCHAR, VarcharBuilder); + // BUILDER_CASE(BINARY, BinaryBuilder); + + // BUILDER_CASE(DATE, DateBuilder); + // BUILDER_CASE(TIMESTAMP, TimestampBuilder); + // BUILDER_CASE(TIME, TimeBuilder); + + // BUILDER_CASE(LIST, ListBuilder); + // BUILDER_CASE(STRUCT, StructBuilder); + // BUILDER_CASE(DENSE_UNION, DenseUnionBuilder); + // BUILDER_CASE(SPARSE_UNION, SparseUnionBuilder); + + default: + return Status::NotImplemented(type->ToString()); + } +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h new file mode 100644 index 0000000000000..c0bfedd27d6ad --- /dev/null +++ b/cpp/src/arrow/types/construct.h @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_CONSTRUCT_H +#define ARROW_TYPES_CONSTRUCT_H + +#include "arrow/type.h" + +namespace arrow { + +class ArrayBuilder; +class Status; + +Status make_builder(const TypePtr& type, ArrayBuilder** out); + +} // namespace arrow + +#endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h new file mode 100644 index 0000000000000..b4d62523c413a --- /dev/null +++ b/cpp/src/arrow/types/datetime.h @@ -0,0 +1,79 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_DATETIME_H +#define ARROW_TYPES_DATETIME_H + +#include "arrow/type.h" + +namespace arrow { + +struct DateType : public DataType { + enum class Unit: char { + DAY = 0, + MONTH = 1, + YEAR = 2 + }; + + Unit unit; + + explicit DateType(Unit unit = Unit::DAY, bool nullable = true) + : DataType(TypeEnum::DATE, nullable), + unit(unit) {} + + DateType(const DateType& other) + : DateType(other.unit, other.nullable) {} + + static char const *name() { + return "date"; + } + + // virtual std::string ToString() { + // return name(); + // } +}; + + +struct TimestampType : public DataType { + enum class Unit: char { + SECOND = 0, + MILLI = 1, + MICRO = 2, + NANO = 3 + }; + + Unit unit; + + explicit TimestampType(Unit unit = Unit::MILLI, bool nullable = true) + : DataType(TypeEnum::TIMESTAMP, nullable), + unit(unit) {} + + TimestampType(const TimestampType& other) + : TimestampType(other.unit, other.nullable) {} + + static char const *name() { + return "timestamp"; + } + + // virtual std::string ToString() { + // return name(); + // } +}; + +} // namespace arrow + +#endif // ARROW_TYPES_DATETIME_H diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h new file mode 100644 index 0000000000000..464c3ff8da92b --- /dev/null +++ b/cpp/src/arrow/types/decimal.h @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_DECIMAL_H +#define ARROW_TYPES_DECIMAL_H + +#include "arrow/type.h" + +namespace arrow { + +struct DecimalType : public DataType { + int precision; + int scale; +}; + +} // namespace arrow + +#endif // ARROW_TYPES_DECIMAL_H diff --git a/cpp/src/arrow/types/floating.cc b/cpp/src/arrow/types/floating.cc new file mode 100644 index 0000000000000..bde28266e638c --- /dev/null +++ b/cpp/src/arrow/types/floating.cc @@ -0,0 +1,22 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/floating.h" + +namespace arrow { + +} // namespace arrow diff --git a/cpp/src/arrow/types/floating.h b/cpp/src/arrow/types/floating.h new file mode 100644 index 0000000000000..7551ce665a27b --- /dev/null +++ b/cpp/src/arrow/types/floating.h @@ -0,0 +1,43 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_FLOATING_H +#define ARROW_TYPES_FLOATING_H + +#include + +#include "arrow/types/primitive.h" + +namespace arrow { + +struct FloatType : public PrimitiveType { + PRIMITIVE_DECL(FloatType, float, FLOAT, 4, "float"); +}; + +struct DoubleType : public PrimitiveType { + PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); +}; + +typedef PrimitiveArrayImpl FloatArray; +typedef PrimitiveArrayImpl DoubleArray; + +typedef PrimitiveBuilder FloatBuilder; +typedef PrimitiveBuilder DoubleBuilder; + +} // namespace arrow + +#endif // ARROW_TYPES_FLOATING_H diff --git a/cpp/src/arrow/types/integer.cc b/cpp/src/arrow/types/integer.cc new file mode 100644 index 0000000000000..4696536616971 --- /dev/null +++ b/cpp/src/arrow/types/integer.cc @@ -0,0 +1,22 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/integer.h" + +namespace arrow { + +} // namespace arrow diff --git a/cpp/src/arrow/types/integer.h b/cpp/src/arrow/types/integer.h new file mode 100644 index 0000000000000..7e5eab55be0a9 --- /dev/null +++ b/cpp/src/arrow/types/integer.h @@ -0,0 +1,88 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_INTEGER_H +#define ARROW_TYPES_INTEGER_H + +#include +#include + +#include "arrow/types/primitive.h" + +namespace arrow { + +struct UInt8Type : public PrimitiveType { + PRIMITIVE_DECL(UInt8Type, uint8_t, UINT8, 1, "uint8"); +}; + +struct Int8Type : public PrimitiveType { + PRIMITIVE_DECL(Int8Type, int8_t, INT8, 1, "int8"); +}; + +struct UInt16Type : public PrimitiveType { + PRIMITIVE_DECL(UInt16Type, uint16_t, UINT16, 2, "uint16"); +}; + +struct Int16Type : public PrimitiveType { + PRIMITIVE_DECL(Int16Type, int16_t, INT16, 2, "int16"); +}; + +struct UInt32Type : public PrimitiveType { + PRIMITIVE_DECL(UInt32Type, uint32_t, UINT32, 4, "uint32"); +}; + +struct Int32Type : public PrimitiveType { + PRIMITIVE_DECL(Int32Type, int32_t, INT32, 4, "int32"); +}; + +struct UInt64Type : public PrimitiveType { + PRIMITIVE_DECL(UInt64Type, uint64_t, UINT64, 8, "uint64"); +}; + +struct Int64Type : public PrimitiveType { + PRIMITIVE_DECL(Int64Type, int64_t, INT64, 8, "int64"); +}; + +// Array containers + +typedef PrimitiveArrayImpl UInt8Array; +typedef PrimitiveArrayImpl Int8Array; + +typedef PrimitiveArrayImpl UInt16Array; +typedef PrimitiveArrayImpl Int16Array; + +typedef PrimitiveArrayImpl UInt32Array; +typedef PrimitiveArrayImpl Int32Array; + +typedef PrimitiveArrayImpl UInt64Array; +typedef PrimitiveArrayImpl Int64Array; + +// Builders + +typedef PrimitiveBuilder UInt8Builder; +typedef PrimitiveBuilder UInt16Builder; +typedef PrimitiveBuilder UInt32Builder; +typedef PrimitiveBuilder UInt64Builder; + +typedef PrimitiveBuilder Int8Builder; +typedef PrimitiveBuilder Int16Builder; +typedef PrimitiveBuilder Int32Builder; +typedef PrimitiveBuilder Int64Builder; + +} // namespace arrow + +#endif // ARROW_TYPES_INTEGER_H diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc new file mode 100644 index 0000000000000..b29b95715fef6 --- /dev/null +++ b/cpp/src/arrow/types/json.cc @@ -0,0 +1,42 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/json.h" + +#include + +#include "arrow/types/boolean.h" +#include "arrow/types/integer.h" +#include "arrow/types/floating.h" +#include "arrow/types/null.h" +#include "arrow/types/string.h" +#include "arrow/types/union.h" + +namespace arrow { + +static const TypePtr Null(new NullType()); +static const TypePtr Int32(new Int32Type()); +static const TypePtr String(new StringType()); +static const TypePtr Double(new DoubleType()); +static const TypePtr Bool(new BooleanType()); + +static const std::vector json_types = {Null, Int32, String, + Double, Bool}; +TypePtr JSONScalar::dense_type = TypePtr(new DenseUnionType(json_types)); +TypePtr JSONScalar::sparse_type = TypePtr(new SparseUnionType(json_types)); + +} // namespace arrow diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h new file mode 100644 index 0000000000000..91fd132408fe6 --- /dev/null +++ b/cpp/src/arrow/types/json.h @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_JSON_H +#define ARROW_TYPES_JSON_H + +#include "arrow/type.h" + +namespace arrow { + +struct JSONScalar : public DataType { + bool dense; + + static TypePtr dense_type; + static TypePtr sparse_type; + + explicit JSONScalar(bool dense = true, bool nullable = true) + : DataType(TypeEnum::JSON_SCALAR, nullable), + dense(dense) {} +}; + +} // namespace arrow + +#endif // ARROW_TYPES_JSON_H diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc new file mode 100644 index 0000000000000..47673ff898bbd --- /dev/null +++ b/cpp/src/arrow/types/list-test.cc @@ -0,0 +1,166 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/construct.h" +#include "arrow/types/integer.h" +#include "arrow/types/list.h" +#include "arrow/types/string.h" +#include "arrow/types/test-common.h" +#include "arrow/util/status.h" + +using std::string; +using std::unique_ptr; +using std::vector; + +namespace arrow { + +class ArrayBuilder; + +TEST(TypesTest, TestListType) { + std::shared_ptr vt = std::make_shared(); + + ListType list_type(vt); + ListType list_type_nn(vt, false); + + ASSERT_EQ(list_type.type, TypeEnum::LIST); + ASSERT_TRUE(list_type.nullable); + ASSERT_FALSE(list_type_nn.nullable); + + ASSERT_EQ(list_type.name(), string("list")); + ASSERT_EQ(list_type.ToString(), string("list")); + + ASSERT_EQ(list_type.value_type->type, vt->type); + ASSERT_EQ(list_type.value_type->type, vt->type); + + std::shared_ptr st = std::make_shared(); + std::shared_ptr lt = std::make_shared(st); + ASSERT_EQ(lt->ToString(), string("list")); + + ListType lt2(lt); + ASSERT_EQ(lt2.ToString(), string("list>")); +} + +// ---------------------------------------------------------------------- +// List tests + +class TestListBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + + value_type_ = TypePtr(new Int32Type()); + type_ = TypePtr(new ListType(value_type_)); + + ArrayBuilder* tmp; + ASSERT_OK(make_builder(type_, &tmp)); + builder_.reset(static_cast(tmp)); + } + + void Done() { + Array* out; + ASSERT_OK(builder_->ToArray(&out)); + result_.reset(static_cast(out)); + } + + protected: + TypePtr value_type_; + TypePtr type_; + + unique_ptr builder_; + unique_ptr result_; +}; + + +TEST_F(TestListBuilder, TestResize) { +} + +TEST_F(TestListBuilder, TestAppendNull) { + ASSERT_OK(builder_->AppendNull()); + ASSERT_OK(builder_->AppendNull()); + + Done(); + + ASSERT_TRUE(result_->IsNull(0)); + ASSERT_TRUE(result_->IsNull(1)); + + ASSERT_EQ(0, result_->offsets()[0]); + ASSERT_EQ(0, result_->offset(1)); + ASSERT_EQ(0, result_->offset(2)); + + Int32Array* values = static_cast(result_->values().get()); + ASSERT_EQ(0, values->length()); +} + +TEST_F(TestListBuilder, TestBasics) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_null = {0, 1, 0}; + + Int32Builder* vb = static_cast(builder_->value_builder()); + + int pos = 0; + for (size_t i = 0; i < lengths.size(); ++i) { + ASSERT_OK(builder_->Append(is_null[i] > 0)); + for (int j = 0; j < lengths[i]; ++j) { + ASSERT_OK(vb->Append(values[pos++])); + } + } + + Done(); + + ASSERT_TRUE(result_->nullable()); + ASSERT_TRUE(result_->values()->nullable()); + + ASSERT_EQ(3, result_->length()); + vector ex_offsets = {0, 3, 3, 7}; + for (size_t i = 0; i < ex_offsets.size(); ++i) { + ASSERT_EQ(ex_offsets[i], result_->offset(i)); + } + + for (int i = 0; i < result_->length(); ++i) { + ASSERT_EQ(static_cast(is_null[i]), result_->IsNull(i)); + } + + ASSERT_EQ(7, result_->values()->length()); + Int32Array* varr = static_cast(result_->values().get()); + + for (size_t i = 0; i < values.size(); ++i) { + ASSERT_EQ(values[i], varr->Value(i)); + } +} + +TEST_F(TestListBuilder, TestBasicsNonNullable) { +} + + +TEST_F(TestListBuilder, TestZeroLength) { + // All buffers are null + Done(); +} + + +} // namespace arrow diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc new file mode 100644 index 0000000000000..f0ff5bf928a1a --- /dev/null +++ b/cpp/src/arrow/types/list.cc @@ -0,0 +1,31 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/list.h" + +#include +#include + +namespace arrow { + +std::string ListType::ToString() const { + std::stringstream s; + s << "list<" << value_type->ToString() << ">"; + return s.str(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h new file mode 100644 index 0000000000000..0f1116257c507 --- /dev/null +++ b/cpp/src/arrow/types/list.h @@ -0,0 +1,206 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_LIST_H +#define ARROW_TYPES_LIST_H + +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +struct ListType : public DataType { + // List can contain any other logical value type + TypePtr value_type; + + explicit ListType(const TypePtr& value_type, bool nullable = true) + : DataType(TypeEnum::LIST, nullable), + value_type(value_type) {} + + static char const *name() { + return "list"; + } + + virtual std::string ToString() const; +}; + + +class ListArray : public Array { + public: + ListArray() : Array(), offset_buf_(nullptr), offsets_(nullptr) {} + + ListArray(const TypePtr& type, int64_t length, std::shared_ptr offsets, + const ArrayPtr& values, std::shared_ptr nulls = nullptr) { + Init(type, length, offsets, values, nulls); + } + + virtual ~ListArray() {} + + void Init(const TypePtr& type, int64_t length, std::shared_ptr offsets, + const ArrayPtr& values, std::shared_ptr nulls = nullptr) { + offset_buf_ = offsets; + offsets_ = offsets == nullptr? nullptr : + reinterpret_cast(offset_buf_->data()); + + values_ = values; + Array::Init(type, length, nulls); + } + + // Return a shared pointer in case the requestor desires to share ownership + // with this array. + const ArrayPtr& values() const {return values_;} + + const int32_t* offsets() const { return offsets_;} + + int32_t offset(int i) const { return offsets_[i];} + + // Neither of these functions will perform boundschecking + int32_t value_offset(int i) { return offsets_[i];} + int32_t value_length(int i) { return offsets_[i + 1] - offsets_[i];} + + protected: + std::shared_ptr offset_buf_; + const int32_t* offsets_; + ArrayPtr values_; +}; + +// ---------------------------------------------------------------------- +// Array builder + + +// Builder class for variable-length list array value types +// +// To use this class, you must append values to the child array builder and use +// the Append function to delimit each distinct list value (once the values +// have been appended to the child array) +class ListBuilder : public Int32Builder { + public: + ListBuilder(const TypePtr& type, ArrayBuilder* value_builder) + : Int32Builder(type) { + value_builder_.reset(value_builder); + } + + Status Init(int64_t elements) { + // One more than requested. + // + // XXX: This is slightly imprecise, because we might trigger null mask + // resizes that are unnecessary when creating arrays with power-of-two size + return Int32Builder::Init(elements + 1); + } + + Status Resize(int64_t capacity) { + // Need space for the end offset + RETURN_NOT_OK(Int32Builder::Resize(capacity + 1)); + + // Slight hack, as the "real" capacity is one less + --capacity_; + return Status::OK(); + } + + // Vector append + // + // If passed, null_bytes is of equal length to values, and any nonzero byte + // will be considered as a null for that slot + Status Append(T* values, int64_t length, uint8_t* null_bytes = nullptr) { + if (length_ + length > capacity_) { + int64_t new_capacity = util::next_power2(length_ + length); + RETURN_NOT_OK(Resize(new_capacity)); + } + memcpy(raw_buffer() + length_, values, length * elsize_); + + if (nullable_ && null_bytes != nullptr) { + // If null_bytes is all not null, then none of the values are null + for (int i = 0; i < length; ++i) { + util::set_bit(null_bits_, length_ + i, static_cast(null_bytes[i])); + } + } + + length_ += length; + return Status::OK(); + } + + // Initialize an array type instance with the results of this builder + // Transfers ownership of all buffers + template + Status Transfer(Container* out) { + Array* child_values; + RETURN_NOT_OK(value_builder_->ToArray(&child_values)); + + // Add final offset if the length is non-zero + if (length_) { + raw_buffer()[length_] = child_values->length(); + } + + out->Init(type_, length_, values_, ArrayPtr(child_values), nulls_); + values_ = nulls_ = nullptr; + capacity_ = length_ = 0; + return Status::OK(); + } + + virtual Status ToArray(Array** out) { + ListArray* result = new ListArray(); + RETURN_NOT_OK(Transfer(result)); + *out = static_cast(result); + return Status::OK(); + } + + // Start a new variable-length list slot + // + // This function should be called before beginning to append elements to the + // value builder + Status Append(bool is_null = false) { + if (length_ == capacity_) { + // If the capacity was not already a multiple of 2, do so here + RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); + } + if (nullable_) { + util::set_bit(null_bits_, length_, is_null); + } + + raw_buffer()[length_++] = value_builder_->length(); + return Status::OK(); + } + + // Status Append(int32_t* offsets, int length, uint8_t* null_bytes) { + // return Int32Builder::Append(offsets, length, null_bytes); + // } + + Status AppendNull() { + return Append(true); + } + + ArrayBuilder* value_builder() const { return value_builder_.get();} + + protected: + std::unique_ptr value_builder_; +}; + + +} // namespace arrow + +#endif // ARROW_TYPES_LIST_H diff --git a/cpp/src/arrow/types/null.h b/cpp/src/arrow/types/null.h new file mode 100644 index 0000000000000..c67f752d40989 --- /dev/null +++ b/cpp/src/arrow/types/null.h @@ -0,0 +1,34 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_NULL_H +#define ARROW_TYPES_NULL_H + +#include +#include + +#include "arrow/type.h" + +namespace arrow { + +struct NullType : public PrimitiveType { + PRIMITIVE_DECL(NullType, void, NA, 0, "null"); +}; + +} // namespace arrow + +#endif // ARROW_TYPES_NULL_H diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc new file mode 100644 index 0000000000000..12968608094d7 --- /dev/null +++ b/cpp/src/arrow/types/primitive-test.cc @@ -0,0 +1,345 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/boolean.h" +#include "arrow/types/construct.h" +#include "arrow/types/floating.h" +#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" +#include "arrow/types/test-common.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +using std::string; +using std::unique_ptr; +using std::vector; + +namespace arrow { + +TEST(TypesTest, TestBytesType) { + BytesType t1(3); + + ASSERT_EQ(t1.type, LayoutEnum::BYTE); + ASSERT_EQ(t1.size, 3); +} + + +#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ + TEST(TypesTest, TestPrimitive_##ENUM) { \ + KLASS tp; \ + KLASS tp_nn(false); \ + \ + ASSERT_EQ(tp.type, TypeEnum::ENUM); \ + ASSERT_EQ(tp.name(), string(NAME)); \ + ASSERT_TRUE(tp.nullable); \ + ASSERT_FALSE(tp_nn.nullable); \ + \ + KLASS tp_copy = tp_nn; \ + ASSERT_FALSE(tp_copy.nullable); \ + } + +PRIMITIVE_TEST(Int8Type, INT8, "int8"); +PRIMITIVE_TEST(Int16Type, INT16, "int16"); +PRIMITIVE_TEST(Int32Type, INT32, "int32"); +PRIMITIVE_TEST(Int64Type, INT64, "int64"); +PRIMITIVE_TEST(UInt8Type, UINT8, "uint8"); +PRIMITIVE_TEST(UInt16Type, UINT16, "uint16"); +PRIMITIVE_TEST(UInt32Type, UINT32, "uint32"); +PRIMITIVE_TEST(UInt64Type, UINT64, "uint64"); + +PRIMITIVE_TEST(FloatType, FLOAT, "float"); +PRIMITIVE_TEST(DoubleType, DOUBLE, "double"); + +PRIMITIVE_TEST(BooleanType, BOOL, "bool"); + +// ---------------------------------------------------------------------- +// Primitive type tests + +TEST_F(TestBuilder, TestResize) { + builder_->Init(10); + ASSERT_EQ(2, builder_->nulls()->size()); + + builder_->Resize(30); + ASSERT_EQ(4, builder_->nulls()->size()); +} + +template +class TestPrimitiveBuilder : public TestBuilder { + public: + typedef typename Attrs::ArrayType ArrayType; + typedef typename Attrs::BuilderType BuilderType; + typedef typename Attrs::T T; + + void SetUp() { + TestBuilder::SetUp(); + + type_ = Attrs::type(); + type_nn_ = Attrs::type(false); + + ArrayBuilder* tmp; + ASSERT_OK(make_builder(type_, &tmp)); + builder_.reset(static_cast(tmp)); + + ASSERT_OK(make_builder(type_nn_, &tmp)); + builder_nn_.reset(static_cast(tmp)); + } + + void RandomData(int64_t N, double pct_null = 0.1) { + Attrs::draw(N, &draws_); + random_nulls(N, pct_null, &nulls_); + } + + void CheckNullable() { + ArrayType result; + ArrayType expected; + int64_t size = builder_->length(); + + auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), + size * sizeof(T)); + + auto ex_nulls = bytes_to_null_buffer(nulls_.data(), size); + + expected.Init(size, ex_data, ex_nulls); + ASSERT_OK(builder_->Transfer(&result)); + + // Builder is now reset + ASSERT_EQ(0, builder_->length()); + ASSERT_EQ(0, builder_->capacity()); + ASSERT_EQ(nullptr, builder_->buffer()); + + ASSERT_TRUE(result.Equals(expected)); + } + + void CheckNonNullable() { + ArrayType result; + ArrayType expected; + int64_t size = builder_nn_->length(); + + auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), + size * sizeof(T)); + + expected.Init(size, ex_data); + ASSERT_OK(builder_nn_->Transfer(&result)); + + // Builder is now reset + ASSERT_EQ(0, builder_nn_->length()); + ASSERT_EQ(0, builder_nn_->capacity()); + ASSERT_EQ(nullptr, builder_nn_->buffer()); + + ASSERT_TRUE(result.Equals(expected)); + } + + protected: + TypePtr type_; + TypePtr type_nn_; + unique_ptr builder_; + unique_ptr builder_nn_; + + vector draws_; + vector nulls_; +}; + +#define PTYPE_DECL(CapType, c_type) \ + typedef CapType##Array ArrayType; \ + typedef CapType##Builder BuilderType; \ + typedef CapType##Type Type; \ + typedef c_type T; \ + \ + static TypePtr type(bool nullable = true) { \ + return TypePtr(new Type(nullable)); \ + } + +#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int64_t N, vector* draws) { \ + randint(N, LOWER, UPPER, draws); \ + } \ + } + +PINT_DECL(UInt8, uint8_t, 0, UINT8_MAX); +PINT_DECL(UInt16, uint16_t, 0, UINT16_MAX); +PINT_DECL(UInt32, uint32_t, 0, UINT32_MAX); +PINT_DECL(UInt64, uint64_t, 0, UINT64_MAX); + +PINT_DECL(Int8, int8_t, INT8_MIN, INT8_MAX); +PINT_DECL(Int16, int16_t, INT16_MIN, INT16_MAX); +PINT_DECL(Int32, int32_t, INT32_MIN, INT32_MAX); +PINT_DECL(Int64, int64_t, INT64_MIN, INT64_MAX); + +typedef ::testing::Types Primitives; + +TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); + +#define DECL_T() \ + typedef typename TestFixture::T T; + +#define DECL_ARRAYTYPE() \ + typedef typename TestFixture::ArrayType ArrayType; + + +TYPED_TEST(TestPrimitiveBuilder, TestInit) { + DECL_T(); + + int64_t n = 1000; + ASSERT_OK(this->builder_->Init(n)); + ASSERT_EQ(n, this->builder_->capacity()); + ASSERT_EQ(n * sizeof(T), this->builder_->buffer()->size()); + + // unsure if this should go in all builder classes + ASSERT_EQ(0, this->builder_->num_children()); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { + int size = 10000; + for (int i = 0; i < size; ++i) { + ASSERT_OK(this->builder_->AppendNull()); + } + + Array* result; + ASSERT_OK(this->builder_->ToArray(&result)); + unique_ptr holder(result); + + for (int i = 0; i < size; ++i) { + ASSERT_TRUE(result->IsNull(i)); + } +} + + +TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { + DECL_T(); + + int size = 10000; + + vector& draws = this->draws_; + vector& nulls = this->nulls_; + + this->RandomData(size); + + int i; + // Append the first 1000 + for (i = 0; i < 1000; ++i) { + ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + ASSERT_OK(this->builder_nn_->Append(draws[i])); + } + + ASSERT_EQ(1000, this->builder_->length()); + ASSERT_EQ(1024, this->builder_->capacity()); + + ASSERT_EQ(1000, this->builder_nn_->length()); + ASSERT_EQ(1024, this->builder_nn_->capacity()); + + // Append the next 9000 + for (i = 1000; i < size; ++i) { + ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + ASSERT_OK(this->builder_nn_->Append(draws[i])); + } + + ASSERT_EQ(size, this->builder_->length()); + ASSERT_EQ(util::next_power2(size), this->builder_->capacity()); + + ASSERT_EQ(size, this->builder_nn_->length()); + ASSERT_EQ(util::next_power2(size), this->builder_nn_->capacity()); + + this->CheckNullable(); + this->CheckNonNullable(); +} + + +TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { + DECL_T(); + + int size = 10000; + this->RandomData(size); + + vector& draws = this->draws_; + vector& nulls = this->nulls_; + + // first slug + int K = 1000; + + ASSERT_OK(this->builder_->Append(draws.data(), K, nulls.data())); + ASSERT_OK(this->builder_nn_->Append(draws.data(), K)); + + ASSERT_EQ(1000, this->builder_->length()); + ASSERT_EQ(1024, this->builder_->capacity()); + + ASSERT_EQ(1000, this->builder_nn_->length()); + ASSERT_EQ(1024, this->builder_nn_->capacity()); + + // Append the next 9000 + ASSERT_OK(this->builder_->Append(draws.data() + K, size - K, nulls.data() + K)); + ASSERT_OK(this->builder_nn_->Append(draws.data() + K, size - K)); + + ASSERT_EQ(size, this->builder_->length()); + ASSERT_EQ(util::next_power2(size), this->builder_->capacity()); + + this->CheckNullable(); + this->CheckNonNullable(); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { + int n = 1000; + ASSERT_OK(this->builder_->Init(n)); + + ASSERT_OK(this->builder_->Advance(100)); + ASSERT_EQ(100, this->builder_->length()); + + ASSERT_OK(this->builder_->Advance(900)); + ASSERT_RAISES(Invalid, this->builder_->Advance(1)); +} + +TYPED_TEST(TestPrimitiveBuilder, TestResize) { + DECL_T(); + + int cap = MIN_BUILDER_CAPACITY * 2; + + ASSERT_OK(this->builder_->Resize(cap)); + ASSERT_EQ(cap, this->builder_->capacity()); + + ASSERT_EQ(cap * sizeof(T), this->builder_->buffer()->size()); + ASSERT_EQ(util::ceil_byte(cap) / 8, this->builder_->nulls()->size()); +} + +TYPED_TEST(TestPrimitiveBuilder, TestReserve) { + int n = 100; + ASSERT_OK(this->builder_->Reserve(n)); + ASSERT_EQ(0, this->builder_->length()); + ASSERT_EQ(MIN_BUILDER_CAPACITY, this->builder_->capacity()); + + ASSERT_OK(this->builder_->Advance(100)); + ASSERT_OK(this->builder_->Reserve(MIN_BUILDER_CAPACITY)); + + ASSERT_EQ(util::next_power2(MIN_BUILDER_CAPACITY + 100), + this->builder_->capacity()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc new file mode 100644 index 0000000000000..2612e8ca7fd4a --- /dev/null +++ b/cpp/src/arrow/types/primitive.cc @@ -0,0 +1,50 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/primitive.h" + +#include + +#include "arrow/util/buffer.h" + +namespace arrow { + +// ---------------------------------------------------------------------- +// Primitive array base + +void PrimitiveArray::Init(const TypePtr& type, int64_t length, + const std::shared_ptr& data, + const std::shared_ptr& nulls) { + Array::Init(type, length, nulls); + data_ = data; + raw_data_ = data == nullptr? nullptr : data_->data(); +} + +bool PrimitiveArray::Equals(const PrimitiveArray& other) const { + if (this == &other) return true; + if (type_->nullable != other.type_->nullable) return false; + + bool equal_data = data_->Equals(*other.data_, length_); + if (type_->nullable) { + return equal_data && + nulls_->Equals(*other.nulls_, util::ceil_byte(length_) / 8); + } else { + return equal_data; + } +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h new file mode 100644 index 0000000000000..a41911224e05e --- /dev/null +++ b/cpp/src/arrow/types/primitive.h @@ -0,0 +1,240 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_PRIMITIVE_H +#define ARROW_TYPES_PRIMITIVE_H + +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/type.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +template +struct PrimitiveType : public DataType { + explicit PrimitiveType(bool nullable = true) + : DataType(Derived::type_enum, nullable) {} + + virtual std::string ToString() const { + return std::string(static_cast(this)->name()); + } +}; + +#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ + typedef C_TYPE c_type; \ + static constexpr TypeEnum type_enum = TypeEnum::ENUM; \ + static constexpr int size = SIZE; \ + \ + explicit TYPENAME(bool nullable = true) \ + : PrimitiveType(nullable) {} \ + \ + static const char* name() { \ + return NAME; \ + } + + +// Base class for fixed-size logical types +class PrimitiveArray : public Array { + public: + PrimitiveArray() : Array(), data_(nullptr), raw_data_(nullptr) {} + + virtual ~PrimitiveArray() {} + + void Init(const TypePtr& type, int64_t length, const std::shared_ptr& data, + const std::shared_ptr& nulls = nullptr); + + const std::shared_ptr& data() const { return data_;} + + bool Equals(const PrimitiveArray& other) const; + + protected: + std::shared_ptr data_; + const uint8_t* raw_data_; +}; + + +template +class PrimitiveArrayImpl : public PrimitiveArray { + public: + typedef typename TypeClass::c_type T; + + PrimitiveArrayImpl() : PrimitiveArray() {} + + PrimitiveArrayImpl(int64_t length, const std::shared_ptr& data, + const std::shared_ptr& nulls = nullptr) { + Init(length, data, nulls); + } + + void Init(int64_t length, const std::shared_ptr& data, + const std::shared_ptr& nulls = nullptr) { + TypePtr type(new TypeClass(nulls != nullptr)); + PrimitiveArray::Init(type, length, data, nulls); + } + + bool Equals(const PrimitiveArrayImpl& other) const { + return PrimitiveArray::Equals(*static_cast(&other)); + } + + const T* raw_data() const { return reinterpret_cast(raw_data_);} + + T Value(int64_t i) const { + return raw_data()[i]; + } + + TypeClass* exact_type() const { + return static_cast(type_); + } +}; + + +template +class PrimitiveBuilder : public ArrayBuilder { + public: + typedef typename Type::c_type T; + + explicit PrimitiveBuilder(const TypePtr& type) + : ArrayBuilder(type), values_(nullptr) { + elsize_ = sizeof(T); + } + + virtual ~PrimitiveBuilder() {} + + Status Resize(int64_t capacity) { + // XXX: Set floor size for now + if (capacity < MIN_BUILDER_CAPACITY) { + capacity = MIN_BUILDER_CAPACITY; + } + + if (capacity_ == 0) { + RETURN_NOT_OK(Init(capacity)); + } else { + RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); + RETURN_NOT_OK(values_->Resize(capacity * elsize_)); + capacity_ = capacity; + } + return Status::OK(); + } + + Status Init(int64_t capacity) { + RETURN_NOT_OK(ArrayBuilder::Init(capacity)); + + values_ = std::make_shared(); + return values_->Resize(capacity * elsize_); + } + + Status Reserve(int64_t elements) { + if (length_ + elements > capacity_) { + int64_t new_capacity = util::next_power2(length_ + elements); + return Resize(new_capacity); + } + return Status::OK(); + } + + Status Advance(int64_t elements) { + return ArrayBuilder::Advance(elements); + } + + // Scalar append + Status Append(T val, bool is_null = false) { + if (length_ == capacity_) { + // If the capacity was not already a multiple of 2, do so here + RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); + } + if (nullable_) { + util::set_bit(null_bits_, length_, is_null); + } + raw_buffer()[length_++] = val; + return Status::OK(); + } + + // Vector append + // + // If passed, null_bytes is of equal length to values, and any nonzero byte + // will be considered as a null for that slot + Status Append(const T* values, int64_t length, uint8_t* null_bytes = nullptr) { + if (length_ + length > capacity_) { + int64_t new_capacity = util::next_power2(length_ + length); + RETURN_NOT_OK(Resize(new_capacity)); + } + memcpy(raw_buffer() + length_, values, length * elsize_); + + if (nullable_ && null_bytes != nullptr) { + // If null_bytes is all not null, then none of the values are null + for (int64_t i = 0; i < length; ++i) { + util::set_bit(null_bits_, length_ + i, static_cast(null_bytes[i])); + } + } + + length_ += length; + return Status::OK(); + } + + Status AppendNull() { + if (!nullable_) { + return Status::Invalid("not nullable"); + } + if (length_ == capacity_) { + // If the capacity was not already a multiple of 2, do so here + RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); + } + util::set_bit(null_bits_, length_++, true); + return Status::OK(); + } + + // Initialize an array type instance with the results of this builder + // Transfers ownership of all buffers + Status Transfer(PrimitiveArray* out) { + out->Init(type_, length_, values_, nulls_); + values_ = nulls_ = nullptr; + capacity_ = length_ = 0; + return Status::OK(); + } + + Status Transfer(ArrayType* out) { + return Transfer(static_cast(out)); + } + + virtual Status ToArray(Array** out) { + ArrayType* result = new ArrayType(); + RETURN_NOT_OK(Transfer(result)); + *out = static_cast(result); + return Status::OK(); + } + + T* raw_buffer() { + return reinterpret_cast(values_->mutable_data()); + } + + std::shared_ptr buffer() const { + return values_; + } + + protected: + std::shared_ptr values_; + int64_t elsize_; +}; + +} // namespace arrow + +#endif // ARROW_TYPES_PRIMITIVE_H diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc new file mode 100644 index 0000000000000..6dba3fdcbb6aa --- /dev/null +++ b/cpp/src/arrow/types/string-test.cc @@ -0,0 +1,242 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/construct.h" +#include "arrow/types/integer.h" +#include "arrow/types/string.h" +#include "arrow/types/test-common.h" +#include "arrow/util/status.h" + +using std::string; +using std::unique_ptr; +using std::vector; + +namespace arrow { + + +TEST(TypesTest, TestCharType) { + CharType t1(5); + + ASSERT_EQ(t1.type, TypeEnum::CHAR); + ASSERT_TRUE(t1.nullable); + ASSERT_EQ(t1.size, 5); + + ASSERT_EQ(t1.ToString(), string("char(5)")); + + // Test copy constructor + CharType t2 = t1; + ASSERT_EQ(t2.type, TypeEnum::CHAR); + ASSERT_TRUE(t2.nullable); + ASSERT_EQ(t2.size, 5); +} + + +TEST(TypesTest, TestVarcharType) { + VarcharType t1(5); + + ASSERT_EQ(t1.type, TypeEnum::VARCHAR); + ASSERT_TRUE(t1.nullable); + ASSERT_EQ(t1.size, 5); + ASSERT_EQ(t1.physical_type.size, 6); + + ASSERT_EQ(t1.ToString(), string("varchar(5)")); + + // Test copy constructor + VarcharType t2 = t1; + ASSERT_EQ(t2.type, TypeEnum::VARCHAR); + ASSERT_TRUE(t2.nullable); + ASSERT_EQ(t2.size, 5); + ASSERT_EQ(t2.physical_type.size, 6); +} + +TEST(TypesTest, TestStringType) { + StringType str; + StringType str_nn(false); + + ASSERT_EQ(str.type, TypeEnum::STRING); + ASSERT_EQ(str.name(), string("string")); + ASSERT_TRUE(str.nullable); + ASSERT_FALSE(str_nn.nullable); +} + +// ---------------------------------------------------------------------- +// String container + +class TestStringContainer : public ::testing::Test { + public: + void SetUp() { + chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; + offsets_ = {0, 1, 1, 1, 3, 6}; + nulls_ = {0, 0, 1, 0, 0}; + expected_ = {"a", "", "", "bb", "ccc"}; + + MakeArray(); + } + + void MakeArray() { + length_ = offsets_.size() - 1; + int64_t nchars = chars_.size(); + + value_buf_ = to_buffer(chars_); + values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); + + offsets_buf_ = to_buffer(offsets_); + + nulls_buf_ = bytes_to_null_buffer(nulls_.data(), nulls_.size()); + strings_.Init(length_, offsets_buf_, values_, nulls_buf_); + } + + protected: + vector offsets_; + vector chars_; + vector nulls_; + + vector expected_; + + std::shared_ptr value_buf_; + std::shared_ptr offsets_buf_; + std::shared_ptr nulls_buf_; + + int64_t length_; + + ArrayPtr values_; + StringArray strings_; +}; + + +TEST_F(TestStringContainer, TestArrayBasics) { + ASSERT_EQ(length_, strings_.length()); + ASSERT_TRUE(strings_.nullable()); +} + +TEST_F(TestStringContainer, TestType) { + TypePtr type = strings_.type(); + + ASSERT_EQ(TypeEnum::STRING, type->type); + ASSERT_EQ(TypeEnum::STRING, strings_.type_enum()); +} + + +TEST_F(TestStringContainer, TestListFunctions) { + int pos = 0; + for (size_t i = 0; i < expected_.size(); ++i) { + ASSERT_EQ(pos, strings_.value_offset(i)); + ASSERT_EQ(expected_[i].size(), strings_.value_length(i)); + pos += expected_[i].size(); + } +} + + +TEST_F(TestStringContainer, TestDestructor) { + auto arr = std::make_shared(length_, offsets_buf_, values_, nulls_buf_); +} + +TEST_F(TestStringContainer, TestGetString) { + for (size_t i = 0; i < expected_.size(); ++i) { + if (nulls_[i]) { + ASSERT_TRUE(strings_.IsNull(i)); + } else { + ASSERT_EQ(expected_[i], strings_.GetString(i)); + } + } +} + +// ---------------------------------------------------------------------- +// String builder tests + +class TestStringBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + type_ = TypePtr(new StringType()); + + ArrayBuilder* tmp; + ASSERT_OK(make_builder(type_, &tmp)); + builder_.reset(static_cast(tmp)); + } + + void Done() { + Array* out; + ASSERT_OK(builder_->ToArray(&out)); + result_.reset(static_cast(out)); + } + + protected: + TypePtr type_; + + unique_ptr builder_; + unique_ptr result_; +}; + +TEST_F(TestStringBuilder, TestAttrs) { + ASSERT_FALSE(builder_->value_builder()->nullable()); +} + +TEST_F(TestStringBuilder, TestScalarAppend) { + vector strings = {"a", "bb", "", "", "ccc"}; + vector is_null = {0, 0, 0, 1, 0}; + + int N = strings.size(); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder_->AppendNull(); + } else { + builder_->Append(strings[i]); + } + } + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps * 6, result_->values()->length()); + + int64_t length; + int64_t pos = 0; + for (int i = 0; i < N * reps; ++i) { + if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); + } else { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->offset(i)); + ASSERT_EQ(strings[i % N].size(), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; + } + } +} + +TEST_F(TestStringBuilder, TestZeroLength) { + // All buffers are null + Done(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc new file mode 100644 index 0000000000000..f3dfbdc50f7a4 --- /dev/null +++ b/cpp/src/arrow/types/string.cc @@ -0,0 +1,40 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/string.h" + +#include +#include + +namespace arrow { + +std::string CharType::ToString() const { + std::stringstream s; + s << "char(" << size << ")"; + return s.str(); +} + + +std::string VarcharType::ToString() const { + std::stringstream s; + s << "varchar(" << size << ")"; + return s.str(); +} + +TypePtr StringBuilder::value_type_ = TypePtr(new UInt8Type(false)); + +} // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h new file mode 100644 index 0000000000000..30d6e247db1ad --- /dev/null +++ b/cpp/src/arrow/types/string.h @@ -0,0 +1,181 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_STRING_H +#define ARROW_TYPES_STRING_H + +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" +#include "arrow/types/list.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +class ArrayBuilder; + +struct CharType : public DataType { + int size; + + BytesType physical_type; + + explicit CharType(int size, bool nullable = true) + : DataType(TypeEnum::CHAR, nullable), + size(size), + physical_type(BytesType(size)) {} + + CharType(const CharType& other) + : CharType(other.size, other.nullable) {} + + virtual std::string ToString() const; +}; + + +// Variable-length, null-terminated strings, up to a certain length +struct VarcharType : public DataType { + int size; + + BytesType physical_type; + + explicit VarcharType(int size, bool nullable = true) + : DataType(TypeEnum::VARCHAR, nullable), + size(size), + physical_type(BytesType(size + 1)) {} + VarcharType(const VarcharType& other) + : VarcharType(other.size, other.nullable) {} + + virtual std::string ToString() const; +}; + +static const LayoutPtr byte1(new BytesType(1)); +static const LayoutPtr physical_string = LayoutPtr(new ListLayoutType(byte1)); + +// String is a logical type consisting of a physical list of 1-byte values +struct StringType : public DataType { + explicit StringType(bool nullable = true) + : DataType(TypeEnum::STRING, nullable) {} + + StringType(const StringType& other) + : StringType(other.nullable) {} + + const LayoutPtr& physical_type() { + return physical_string; + } + + static char const *name() { + return "string"; + } + + virtual std::string ToString() const { + return name(); + } +}; + + +// TODO: add a BinaryArray layer in between +class StringArray : public ListArray { + public: + StringArray() : ListArray(), bytes_(nullptr), raw_bytes_(nullptr) {} + + StringArray(int64_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, + const std::shared_ptr& nulls = nullptr) { + Init(length, offsets, values, nulls); + } + + void Init(const TypePtr& type, int64_t length, + const std::shared_ptr& offsets, + const ArrayPtr& values, + const std::shared_ptr& nulls = nullptr) { + ListArray::Init(type, length, offsets, values, nulls); + + // TODO: type validation for values array + + // For convenience + bytes_ = static_cast(values.get()); + raw_bytes_ = bytes_->raw_data(); + } + + void Init(int64_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, + const std::shared_ptr& nulls = nullptr) { + TypePtr type(new StringType(nulls != nullptr)); + Init(type, length, offsets, values, nulls); + } + + // Compute the pointer t + const uint8_t* GetValue(int64_t i, int64_t* out_length) const { + int32_t pos = offsets_[i]; + *out_length = offsets_[i + 1] - pos; + return raw_bytes_ + pos; + } + + // Construct a std::string + std::string GetString(int64_t i) const { + int64_t nchars; + const uint8_t* str = GetValue(i, &nchars); + return std::string(reinterpret_cast(str), nchars); + } + + private: + UInt8Array* bytes_; + const uint8_t* raw_bytes_; +}; + +// Array builder + + + +class StringBuilder : public ListBuilder { + public: + explicit StringBuilder(const TypePtr& type) : + ListBuilder(type, static_cast(new UInt8Builder(value_type_))) { + byte_builder_ = static_cast(value_builder_.get()); + } + + Status Append(const std::string& value) { + RETURN_NOT_OK(ListBuilder::Append()); + return byte_builder_->Append(reinterpret_cast(value.c_str()), + value.size()); + } + + Status Append(const uint8_t* value, int64_t length); + Status Append(const std::vector& values, + uint8_t* null_bytes); + + virtual Status ToArray(Array** out) { + StringArray* result = new StringArray(); + RETURN_NOT_OK(ListBuilder::Transfer(result)); + *out = static_cast(result); + return Status::OK(); + } + + protected: + UInt8Builder* byte_builder_; + + static TypePtr value_type_; +}; + +} // namespace arrow + +#endif // ARROW_TYPES_STRING_H diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc new file mode 100644 index 0000000000000..644b5457d5851 --- /dev/null +++ b/cpp/src/arrow/types/struct-test.cc @@ -0,0 +1,61 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include +#include + +#include "arrow/field.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" + +using std::string; +using std::vector; + +namespace arrow { + +TEST(TestStructType, Basics) { + TypePtr f0_type = TypePtr(new Int32Type()); + Field f0("f0", f0_type); + + TypePtr f1_type = TypePtr(new StringType()); + Field f1("f1", f1_type); + + TypePtr f2_type = TypePtr(new UInt8Type()); + Field f2("f2", f2_type); + + vector fields = {f0, f1, f2}; + + StructType struct_type(fields, true); + StructType struct_type_nn(fields, false); + + ASSERT_TRUE(struct_type.nullable); + ASSERT_FALSE(struct_type_nn.nullable); + + ASSERT_TRUE(struct_type.field(0).Equals(f0)); + ASSERT_TRUE(struct_type.field(1).Equals(f1)); + ASSERT_TRUE(struct_type.field(2).Equals(f2)); + + ASSERT_EQ(struct_type.ToString(), "struct"); + + // TODO: out of bounds for field(...) +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc new file mode 100644 index 0000000000000..b7be5d8245f1d --- /dev/null +++ b/cpp/src/arrow/types/struct.cc @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/struct.h" + +#include +#include +#include + +namespace arrow { + +std::string StructType::ToString() const { + std::stringstream s; + s << "struct<"; + for (size_t i = 0; i < fields_.size(); ++i) { + if (i > 0) s << ", "; + const Field& field = fields_[i]; + s << field.name << ": " << field.type->ToString(); + } + s << ">"; + return s.str(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h new file mode 100644 index 0000000000000..7d8885b830dba --- /dev/null +++ b/cpp/src/arrow/types/struct.h @@ -0,0 +1,51 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_STRUCT_H +#define ARROW_TYPES_STRUCT_H + +#include +#include + +#include "arrow/field.h" +#include "arrow/type.h" + +namespace arrow { + +struct StructType : public DataType { + std::vector fields_; + + StructType(const std::vector& fields, + bool nullable = true) + : DataType(TypeEnum::STRUCT, nullable) { + fields_ = fields; + } + + const Field& field(int i) const { + return fields_[i]; + } + + int num_children() const { + return fields_.size(); + } + + virtual std::string ToString() const; +}; + +} // namespace arrow + +#endif // ARROW_TYPES_STRUCT_H diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h new file mode 100644 index 0000000000000..267e48a7f25c9 --- /dev/null +++ b/cpp/src/arrow/types/test-common.h @@ -0,0 +1,50 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_TEST_COMMON_H +#define ARROW_TYPES_TEST_COMMON_H + +#include +#include +#include +#include + +#include "arrow/test-util.h" +#include "arrow/type.h" + +using std::unique_ptr; + +namespace arrow { + +class TestBuilder : public ::testing::Test { + public: + void SetUp() { + type_ = TypePtr(new UInt8Type()); + type_nn_ = TypePtr(new UInt8Type(false)); + builder_.reset(new UInt8Builder(type_)); + builder_nn_.reset(new UInt8Builder(type_nn_)); + } + protected: + TypePtr type_; + TypePtr type_nn_; + unique_ptr builder_; + unique_ptr builder_nn_; +}; + +} // namespace arrow + +#endif // ARROW_TYPES_TEST_COMMON_H diff --git a/cpp/src/arrow/types/union.cc b/cpp/src/arrow/types/union.cc new file mode 100644 index 0000000000000..54f41a7eef6be --- /dev/null +++ b/cpp/src/arrow/types/union.cc @@ -0,0 +1,49 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/union.h" + +#include +#include +#include + +#include "arrow/type.h" + +namespace arrow { + +static inline std::string format_union(const std::vector& child_types) { + std::stringstream s; + s << "union<"; + for (size_t i = 0; i < child_types.size(); ++i) { + if (i) s << ", "; + s << child_types[i]->ToString(); + } + s << ">"; + return s.str(); +} + +std::string DenseUnionType::ToString() const { + return format_union(child_types_); +} + + +std::string SparseUnionType::ToString() const { + return format_union(child_types_); +} + + +} // namespace arrow diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h new file mode 100644 index 0000000000000..7b66c3b88bf3c --- /dev/null +++ b/cpp/src/arrow/types/union.h @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPES_UNION_H +#define ARROW_TYPES_UNION_H + +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/type.h" +#include "arrow/types/collection.h" + +namespace arrow { + +class Buffer; + +struct DenseUnionType : public CollectionType { + typedef CollectionType Base; + + DenseUnionType(const std::vector& child_types, + bool nullable = true) + : Base(nullable) { + child_types_ = child_types; + } + + virtual std::string ToString() const; +}; + + +struct SparseUnionType : public CollectionType { + typedef CollectionType Base; + + SparseUnionType(const std::vector& child_types, + bool nullable = true) + : Base(nullable) { + child_types_ = child_types; + } + + virtual std::string ToString() const; +}; + + +class UnionArray : public Array { + public: + UnionArray() : Array() {} + + protected: + // The data are types encoded as int16 + Buffer* types_; + std::vector > children_; +}; + + +class DenseUnionArray : public UnionArray { + public: + DenseUnionArray() : UnionArray() {} + + protected: + Buffer* offset_buf_; +}; + + +class SparseUnionArray : public UnionArray { + public: + SparseUnionArray() : UnionArray() {} +}; + +} // namespace arrow + +#endif // ARROW_TYPES_UNION_H diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt new file mode 100644 index 0000000000000..88e3f7a656d90 --- /dev/null +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -0,0 +1,81 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# arrow_util +####################################### + +set(UTIL_SRCS + bit-util.cc + buffer.cc + status.cc +) + +set(UTIL_LIBS + rt) + +add_library(arrow_util STATIC + ${UTIL_SRCS} +) +target_link_libraries(arrow_util ${UTIL_LIBS}) +SET_TARGET_PROPERTIES(arrow_util PROPERTIES LINKER_LANGUAGE CXX) + +# Headers: top level +install(FILES + bit-util.h + buffer.h + macros.h + status.h + DESTINATION include/arrow/util) + +####################################### +# arrow_test_util +####################################### + +add_library(arrow_test_util) +target_link_libraries(arrow_test_util + arrow_util) + +SET_TARGET_PROPERTIES(arrow_test_util PROPERTIES LINKER_LANGUAGE CXX) + +####################################### +# arrow_test_main +####################################### + +add_library(arrow_test_main + test_main.cc) + +if (APPLE) + target_link_libraries(arrow_test_main + gtest + arrow_util + arrow_test_util + dl) + set_target_properties(arrow_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") +else() + target_link_libraries(arrow_test_main + gtest + arrow_util + arrow_test_util + pthread + dl + ) +endif() + +ADD_ARROW_TEST(bit-util-test) +ADD_ARROW_TEST(buffer-test) diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc new file mode 100644 index 0000000000000..7506ca5b5531c --- /dev/null +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -0,0 +1,44 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include "arrow/util/bit-util.h" + +namespace arrow { + +TEST(UtilTests, TestNextPower2) { + using util::next_power2; + + ASSERT_EQ(8, next_power2(6)); + ASSERT_EQ(8, next_power2(8)); + + ASSERT_EQ(1, next_power2(1)); + ASSERT_EQ(256, next_power2(131)); + + ASSERT_EQ(1024, next_power2(1000)); + + ASSERT_EQ(4096, next_power2(4000)); + + ASSERT_EQ(65536, next_power2(64000)); + + ASSERT_EQ(1LL << 32, next_power2((1LL << 32) - 1)); + ASSERT_EQ(1LL << 31, next_power2((1LL << 31) - 1)); + ASSERT_EQ(1LL << 62, next_power2((1LL << 62) - 1)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc new file mode 100644 index 0000000000000..d2ddd6584a88c --- /dev/null +++ b/cpp/src/arrow/util/bit-util.cc @@ -0,0 +1,46 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +void util::bytes_to_bits(uint8_t* bytes, int length, uint8_t* bits) { + for (int i = 0; i < length; ++i) { + set_bit(bits, i, static_cast(bytes[i])); + } +} + +Status util::bytes_to_bits(uint8_t* bytes, int length, + std::shared_ptr* out) { + int bit_length = ceil_byte(length) / 8; + + auto buffer = std::make_shared(); + RETURN_NOT_OK(buffer->Resize(bit_length)); + memset(buffer->mutable_data(), 0, bit_length); + bytes_to_bits(bytes, length, buffer->mutable_data()); + + *out = buffer; + + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h new file mode 100644 index 0000000000000..61dffa30423b1 --- /dev/null +++ b/cpp/src/arrow/util/bit-util.h @@ -0,0 +1,68 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_BIT_UTIL_H +#define ARROW_UTIL_BIT_UTIL_H + +#include +#include +#include + +#include "arrow/util/buffer.h" + +namespace arrow { + +class Status; + +namespace util { + +static inline int64_t ceil_byte(int64_t size) { + return (size + 7) & ~7; +} + +static inline int64_t ceil_2bytes(int64_t size) { + return (size + 15) & ~15; +} + +static inline bool get_bit(const uint8_t* bits, int i) { + return bits[i / 8] & (1 << (i % 8)); +} + +static inline void set_bit(uint8_t* bits, int i, bool is_set) { + bits[i / 8] |= (1 << (i % 8)) * is_set; +} + +static inline int64_t next_power2(int64_t n) { + n--; + n |= n >> 1; + n |= n >> 2; + n |= n >> 4; + n |= n >> 8; + n |= n >> 16; + n |= n >> 32; + n++; + return n; +} + +void bytes_to_bits(uint8_t* bytes, int length, uint8_t* bits); +Status bytes_to_bits(uint8_t*, int, std::shared_ptr*); + +} // namespace util + +} // namespace arrow + +#endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc new file mode 100644 index 0000000000000..edfd08e850bd8 --- /dev/null +++ b/cpp/src/arrow/util/buffer-test.cc @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "arrow/test-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +using std::string; + +namespace arrow { + +class TestBuffer : public ::testing::Test { +}; + +TEST_F(TestBuffer, Resize) { + OwnedMutableBuffer buf; + + ASSERT_EQ(0, buf.size()); + ASSERT_OK(buf.Resize(100)); + ASSERT_EQ(100, buf.size()); + ASSERT_OK(buf.Resize(200)); + ASSERT_EQ(200, buf.size()); + + // Make it smaller, too + ASSERT_OK(buf.Resize(50)); + ASSERT_EQ(50, buf.size()); +} + +TEST_F(TestBuffer, ResizeOOM) { + // realloc fails, even though there may be no explicit limit + OwnedMutableBuffer buf; + ASSERT_OK(buf.Resize(100)); + int64_t to_alloc = std::numeric_limits::max(); + ASSERT_RAISES(OutOfMemory, buf.Resize(to_alloc)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc new file mode 100644 index 0000000000000..2fb34d59e0b78 --- /dev/null +++ b/cpp/src/arrow/util/buffer.cc @@ -0,0 +1,53 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/buffer.h" + +#include + +#include "arrow/util/status.h" + +namespace arrow { + +Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, + int64_t size) { + data_ = parent->data() + offset; + size_ = size; + parent_ = parent; +} + +std::shared_ptr MutableBuffer::GetImmutableView() { + return std::make_shared(this->get_shared_ptr(), 0, size()); +} + +OwnedMutableBuffer::OwnedMutableBuffer() : + MutableBuffer(nullptr, 0) {} + +Status OwnedMutableBuffer::Resize(int64_t new_size) { + size_ = new_size; + try { + buffer_owner_.resize(new_size); + } catch (const std::bad_alloc& e) { + return Status::OutOfMemory("resize failed"); + } + data_ = buffer_owner_.data(); + mutable_data_ = buffer_owner_.data(); + + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h new file mode 100644 index 0000000000000..3e4183936b33d --- /dev/null +++ b/cpp/src/arrow/util/buffer.h @@ -0,0 +1,133 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_BUFFER_H +#define ARROW_UTIL_BUFFER_H + +#include +#include +#include +#include +#include + +#include "arrow/util/macros.h" + +namespace arrow { + +class Status; + +// ---------------------------------------------------------------------- +// Buffer classes + +// Immutable API for a chunk of bytes which may or may not be owned by the +// class instance +class Buffer : public std::enable_shared_from_this { + public: + Buffer(const uint8_t* data, int64_t size) : + data_(data), + size_(size) {} + + // An offset into data that is owned by another buffer, but we want to be + // able to retain a valid pointer to it even after other shared_ptr's to the + // parent buffer have been destroyed + Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); + + std::shared_ptr get_shared_ptr() { + return shared_from_this(); + } + + // Return true if both buffers are the same size and contain the same bytes + // up to the number of compared bytes + bool Equals(const Buffer& other, int64_t nbytes) const { + return this == &other || + (size_ >= nbytes && other.size_ >= nbytes && + !memcmp(data_, other.data_, nbytes)); + } + + bool Equals(const Buffer& other) const { + return this == &other || + (size_ == other.size_ && !memcmp(data_, other.data_, size_)); + } + + const uint8_t* data() const { + return data_; + } + + int64_t size() const { + return size_; + } + + // Returns true if this Buffer is referencing memory (possibly) owned by some + // other buffer + bool is_shared() const { + return static_cast(parent_); + } + + const std::shared_ptr parent() const { + return parent_; + } + + protected: + const uint8_t* data_; + int64_t size_; + + // nullptr by default, but may be set + std::shared_ptr parent_; + + private: + DISALLOW_COPY_AND_ASSIGN(Buffer); +}; + +// A Buffer whose contents can be mutated. May or may not own its data. +class MutableBuffer : public Buffer { + public: + MutableBuffer(uint8_t* data, int64_t size) : + Buffer(data, size) { + mutable_data_ = data; + } + + uint8_t* mutable_data() { + return mutable_data_; + } + + // Get a read-only view of this buffer + std::shared_ptr GetImmutableView(); + + protected: + MutableBuffer() : + Buffer(nullptr, 0), + mutable_data_(nullptr) {} + + uint8_t* mutable_data_; +}; + +// A MutableBuffer whose memory is owned by the class instance. For example, +// for reading data out of files that you want to deallocate when this class is +// garbage-collected +class OwnedMutableBuffer : public MutableBuffer { + public: + OwnedMutableBuffer(); + Status Resize(int64_t new_size); + + private: + // TODO: aligned allocations + std::vector buffer_owner_; +}; + +} // namespace arrow + +#endif // ARROW_UTIL_BUFFER_H diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h new file mode 100644 index 0000000000000..069e627c90eaa --- /dev/null +++ b/cpp/src/arrow/util/macros.h @@ -0,0 +1,26 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_MACROS_H +#define ARROW_UTIL_MACROS_H + +// From Google gutil +#define DISALLOW_COPY_AND_ASSIGN(TypeName) \ + TypeName(const TypeName&) = delete; \ + void operator=(const TypeName&) = delete + +#endif // ARROW_UTIL_MACROS_H diff --git a/cpp/src/arrow/util/random.h b/cpp/src/arrow/util/random.h new file mode 100644 index 0000000000000..64c197ef080fd --- /dev/null +++ b/cpp/src/arrow/util/random.h @@ -0,0 +1,128 @@ +// Copyright (c) 2011 The LevelDB Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. See the AUTHORS file for names of contributors. + +// Moved from Kudu http://github.com/cloudera/kudu + +#ifndef ARROW_UTIL_RANDOM_H_ +#define ARROW_UTIL_RANDOM_H_ + +#include + +#include + +namespace arrow { + +namespace random_internal { + +static const uint32_t M = 2147483647L; // 2^31-1 +const double kTwoPi = 6.283185307179586476925286; + +} // namespace random_internal + +// A very simple random number generator. Not especially good at +// generating truly random bits, but good enough for our needs in this +// package. This implementation is not thread-safe. +class Random { + public: + explicit Random(uint32_t s) : seed_(s & 0x7fffffffu) { + // Avoid bad seeds. + if (seed_ == 0 || seed_ == random_internal::M) { + seed_ = 1; + } + } + + // Next pseudo-random 32-bit unsigned integer. + // FIXME: This currently only generates 31 bits of randomness. + // The MSB will always be zero. + uint32_t Next() { + static const uint64_t A = 16807; // bits 14, 8, 7, 5, 2, 1, 0 + // We are computing + // seed_ = (seed_ * A) % M, where M = 2^31-1 + // + // seed_ must not be zero or M, or else all subsequent computed values + // will be zero or M respectively. For all other values, seed_ will end + // up cycling through every number in [1,M-1] + uint64_t product = seed_ * A; + + // Compute (product % M) using the fact that ((x << 31) % M) == x. + seed_ = static_cast((product >> 31) + (product & random_internal::M)); + // The first reduction may overflow by 1 bit, so we may need to + // repeat. mod == M is not possible; using > allows the faster + // sign-bit-based test. + if (seed_ > random_internal::M) { + seed_ -= random_internal::M; + } + return seed_; + } + + // Alias for consistency with Next64 + uint32_t Next32() { return Next(); } + + // Next pseudo-random 64-bit unsigned integer. + // FIXME: This currently only generates 62 bits of randomness due to Next() + // only giving 31 bits of randomness. The 2 most significant bits will always + // be zero. + uint64_t Next64() { + uint64_t large = Next(); + // Only shift by 31 bits so we end up with zeros in MSB and not scattered + // throughout the 64-bit word. This is due to the weakness in Next() noted + // above. + large <<= 31; + large |= Next(); + return large; + } + + // Returns a uniformly distributed value in the range [0..n-1] + // REQUIRES: n > 0 + uint32_t Uniform(uint32_t n) { return Next() % n; } + + // Alias for consistency with Uniform64 + uint32_t Uniform32(uint32_t n) { return Uniform(n); } + + // Returns a uniformly distributed 64-bit value in the range [0..n-1] + // REQUIRES: n > 0 + uint64_t Uniform64(uint64_t n) { return Next64() % n; } + + // Randomly returns true ~"1/n" of the time, and false otherwise. + // REQUIRES: n > 0 + bool OneIn(int n) { return (Next() % n) == 0; } + + // Skewed: pick "base" uniformly from range [0,max_log] and then + // return "base" random bits. The effect is to pick a number in the + // range [0,2^max_log-1] with exponential bias towards smaller numbers. + uint32_t Skewed(int max_log) { + return Uniform(1 << Uniform(max_log + 1)); + } + + // Creates a normal distribution variable using the + // Box-Muller transform. See: + // http://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform + // Adapted from WebRTC source code at: + // webrtc/trunk/modules/video_coding/main/test/test_util.cc + double Normal(double mean, double std_dev) { + double uniform1 = (Next() + 1.0) / (random_internal::M + 1.0); + double uniform2 = (Next() + 1.0) / (random_internal::M + 1.0); + return (mean + std_dev * sqrt(-2 * ::log(uniform1)) * + cos(random_internal::kTwoPi * uniform2)); + } + + // Return a random number between 0.0 and 1.0 inclusive. + double NextDoubleFraction() { + return Next() / static_cast(random_internal::M + 1.0); + } + + private: + uint32_t seed_; +}; + + +uint32_t random_seed() { + // TODO: use system time to get a reasonably random seed + return 0; +} + + +} // namespace arrow + +#endif // ARROW_UTIL_RANDOM_H_ diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc new file mode 100644 index 0000000000000..c64b8a3d5f80a --- /dev/null +++ b/cpp/src/arrow/util/status.cc @@ -0,0 +1,38 @@ +// Copyright (c) 2011 The LevelDB Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. See the AUTHORS file for names of contributors. +// +// A Status encapsulates the result of an operation. It may indicate success, +// or it may indicate an error with an associated error message. +// +// Multiple threads can invoke const methods on a Status without +// external synchronization, but if any of the threads may call a +// non-const method, all threads accessing the same Status must use +// external synchronization. + +#include "arrow/util/status.h" + +#include + +namespace arrow { + +Status::Status(StatusCode code, const std::string& msg, int16_t posix_code) { + assert(code != StatusCode::OK); + const uint32_t size = msg.size(); + char* result = new char[size + 7]; + memcpy(result, &size, sizeof(size)); + result[4] = static_cast(code); + memcpy(result + 5, &posix_code, sizeof(posix_code)); + memcpy(result + 7, msg.c_str(), msg.size()); + state_ = result; +} + +const char* Status::CopyState(const char* state) { + uint32_t size; + memcpy(&size, state, sizeof(size)); + char* result = new char[size + 7]; + memcpy(result, state, size + 7); + return result; +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h new file mode 100644 index 0000000000000..47fda40db2596 --- /dev/null +++ b/cpp/src/arrow/util/status.h @@ -0,0 +1,152 @@ +// Copyright (c) 2011 The LevelDB Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. See the AUTHORS file for names of contributors. +// +// A Status encapsulates the result of an operation. It may indicate success, +// or it may indicate an error with an associated error message. +// +// Multiple threads can invoke const methods on a Status without +// external synchronization, but if any of the threads may call a +// non-const method, all threads accessing the same Status must use +// external synchronization. + +// Adapted from Kudu github.com/cloudera/kudu + +#ifndef ARROW_STATUS_H_ +#define ARROW_STATUS_H_ + +#include +#include +#include + +// Return the given status if it is not OK. +#define ARROW_RETURN_NOT_OK(s) do { \ + ::arrow::Status _s = (s); \ + if (!_s.ok()) return _s; \ + } while (0); + +// Return the given status if it is not OK, but first clone it and +// prepend the given message. +#define ARROW_RETURN_NOT_OK_PREPEND(s, msg) do { \ + ::arrow::Status _s = (s); \ + if (::gutil::PREDICT_FALSE(!_s.ok())) return _s.CloneAndPrepend(msg); \ + } while (0); + +// Return 'to_return' if 'to_call' returns a bad status. +// The substitution for 'to_return' may reference the variable +// 's' for the bad status. +#define ARROW_RETURN_NOT_OK_RET(to_call, to_return) do { \ + ::arrow::Status s = (to_call); \ + if (::gutil::PREDICT_FALSE(!s.ok())) return (to_return); \ + } while (0); + +// If 'to_call' returns a bad status, CHECK immediately with a logged message +// of 'msg' followed by the status. +#define ARROW_CHECK_OK_PREPEND(to_call, msg) do { \ +::arrow::Status _s = (to_call); \ +ARROW_CHECK(_s.ok()) << (msg) << ": " << _s.ToString(); \ +} while (0); + +// If the status is bad, CHECK immediately, appending the status to the +// logged message. +#define ARROW_CHECK_OK(s) ARROW_CHECK_OK_PREPEND(s, "Bad status") + +namespace arrow { + +#define RETURN_NOT_OK(s) do { \ + Status _s = (s); \ + if (!_s.ok()) return _s; \ + } while (0); + +enum class StatusCode: char { + OK = 0, + OutOfMemory = 1, + KeyError = 2, + Invalid = 3, + + NotImplemented = 10, +}; + +class Status { + public: + // Create a success status. + Status() : state_(NULL) { } + ~Status() { delete[] state_; } + + // Copy the specified status. + Status(const Status& s); + void operator=(const Status& s); + + // Return a success status. + static Status OK() { return Status(); } + + // Return error status of an appropriate type. + static Status OutOfMemory(const std::string& msg, int16_t posix_code = -1) { + return Status(StatusCode::OutOfMemory, msg, posix_code); + } + + static Status KeyError(const std::string& msg) { + return Status(StatusCode::KeyError, msg, -1); + } + + static Status NotImplemented(const std::string& msg) { + return Status(StatusCode::NotImplemented, msg, -1); + } + + static Status Invalid(const std::string& msg) { + return Status(StatusCode::Invalid, msg, -1); + } + + // Returns true iff the status indicates success. + bool ok() const { return (state_ == NULL); } + + bool IsOutOfMemory() const { return code() == StatusCode::OutOfMemory; } + bool IsKeyError() const { return code() == StatusCode::KeyError; } + bool IsInvalid() const { return code() == StatusCode::Invalid; } + + // Return a string representation of this status suitable for printing. + // Returns the string "OK" for success. + std::string ToString() const; + + // Return a string representation of the status code, without the message + // text or posix code information. + std::string CodeAsString() const; + + // Get the POSIX code associated with this Status, or -1 if there is none. + int16_t posix_code() const; + + private: + // OK status has a NULL state_. Otherwise, state_ is a new[] array + // of the following form: + // state_[0..3] == length of message + // state_[4] == code + // state_[5..6] == posix_code + // state_[7..] == message + const char* state_; + + StatusCode code() const { + return ((state_ == NULL) ? + StatusCode::OK : static_cast(state_[4])); + } + + Status(StatusCode code, const std::string& msg, int16_t posix_code); + static const char* CopyState(const char* s); +}; + +inline Status::Status(const Status& s) { + state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); +} + +inline void Status::operator=(const Status& s) { + // The following condition catches both aliasing (when this == &s), + // and the common case where both s and *this are ok. + if (state_ != s.state_) { + delete[] state_; + state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); + } +} + +} // namespace arrow + + +#endif // ARROW_STATUS_H_ diff --git a/cpp/src/arrow/util/test_main.cc b/cpp/src/arrow/util/test_main.cc new file mode 100644 index 0000000000000..00139f36742ed --- /dev/null +++ b/cpp/src/arrow/util/test_main.cc @@ -0,0 +1,26 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +int main(int argc, char **argv) { + ::testing::InitGoogleTest(&argc, argv); + + int ret = RUN_ALL_TESTS(); + + return ret; +} diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh new file mode 100755 index 0000000000000..46794def400eb --- /dev/null +++ b/cpp/thirdparty/build_thirdparty.sh @@ -0,0 +1,62 @@ +#!/bin/bash + +set -x +set -e +TP_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) + +source $TP_DIR/versions.sh +PREFIX=$TP_DIR/installed + +################################################################################ + +if [ "$#" = "0" ]; then + F_ALL=1 +else + # Allow passing specific libs to build on the command line + for arg in "$*"; do + case $arg in + "gtest") F_GTEST=1 ;; + *) echo "Unknown module: $arg"; exit 1 ;; + esac + done +fi + +################################################################################ + +# Determine how many parallel jobs to use for make based on the number of cores +if [[ "$OSTYPE" =~ ^linux ]]; then + PARALLEL=$(grep -c processor /proc/cpuinfo) +elif [[ "$OSTYPE" == "darwin"* ]]; then + PARALLEL=$(sysctl -n hw.ncpu) +else + echo Unsupported platform $OSTYPE + exit 1 +fi + +mkdir -p "$PREFIX/include" +mkdir -p "$PREFIX/lib" + +# On some systems, autotools installs libraries to lib64 rather than lib. Fix +# this by setting up lib64 as a symlink to lib. We have to do this step first +# to handle cases where one third-party library depends on another. +ln -sf lib "$PREFIX/lib64" + +# use the compiled tools +export PATH=$PREFIX/bin:$PATH + + +# build googletest +if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then + cd $TP_DIR/$GTEST_BASEDIR + + if [[ "$OSTYPE" == "darwin"* ]]; then + CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="-std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" + else + CXXFLAGS=-fPIC cmake . + fi + + make VERBOSE=1 +fi + +echo "---------------------" +echo "Thirdparty dependencies built and installed into $PREFIX successfully" diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh new file mode 100755 index 0000000000000..8ffb22a93f7e2 --- /dev/null +++ b/cpp/thirdparty/download_thirdparty.sh @@ -0,0 +1,20 @@ +#!/bin/bash + +set -x +set -e + +TP_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) + +source $TP_DIR/versions.sh + +download_extract_and_cleanup() { + filename=$TP_DIR/$(basename "$1") + curl -#LC - "$1" -o $filename + tar xzf $filename -C $TP_DIR + rm $filename +} + +if [ ! -d ${GTEST_BASEDIR} ]; then + echo "Fetching gtest" + download_extract_and_cleanup $GTEST_URL +fi diff --git a/cpp/thirdparty/versions.sh b/cpp/thirdparty/versions.sh new file mode 100755 index 0000000000000..12ad56ef00103 --- /dev/null +++ b/cpp/thirdparty/versions.sh @@ -0,0 +1,3 @@ +GTEST_VERSION=1.7.0 +GTEST_URL="https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" +GTEST_BASEDIR=googletest-release-$GTEST_VERSION From 7e76e3aee92122f39702241db2d0eaea86fd3e8c Mon Sep 17 00:00:00 2001 From: proflin Date: Fri, 19 Feb 2016 23:07:17 +0800 Subject: [PATCH 0006/1644] ARROW-5: Update drill-fmpp-maven-plugin to 1.5.0 This closes #1. --- java/pom.xml | 2 -- java/vector/pom.xml | 2 +- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/java/pom.xml b/java/pom.xml index 8a3b192e13e40..4ee4ff4f7604e 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -36,8 +36,6 @@ 2 2.7.1 2.7.1 - 0.9.15 - 2.3.21 diff --git a/java/vector/pom.xml b/java/vector/pom.xml index e693344221b9a..1fef81b7eba2a 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -106,7 +106,7 @@ org.apache.drill.tools drill-fmpp-maven-plugin - 1.4.0 + 1.5.0 generate-fmpp From e9cc8ce390a1ab28bf71ce6eeb66c915140e2cb9 Mon Sep 17 00:00:00 2001 From: Jacques Nadeau Date: Fri, 19 Feb 2016 18:42:35 -0800 Subject: [PATCH 0007/1644] ARROW-5: Correct Apache Maven repo for maven plugin use --- java/vector/pom.xml | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 1fef81b7eba2a..df5389261ba57 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -60,7 +60,7 @@ apache apache - https://repo.maven.apache.org/ + https://repo.maven.apache.org/maven2/ true From e6905effbb9383afd2423a4f86cf9a33ca680b9d Mon Sep 17 00:00:00 2001 From: proflin Date: Sat, 20 Feb 2016 15:50:45 +0800 Subject: [PATCH 0008/1644] ARROW-9: Replace straggler references to Drill - Renaming drill to arrow for TestBaseAllocator - Fix ArrowBuffer as ArrowBuf - Replace Drill with Arrow for ValueHolder This closes #2. --- .../main/java/io/netty/buffer/ArrowBuf.java | 36 +-- .../io/netty/buffer/ExpandableByteBuf.java | 2 +- .../netty/buffer/PooledByteBufAllocatorL.java | 6 +- .../buffer/UnsafeDirectLittleEndian.java | 4 +- .../arrow/memory/AllocationManager.java | 34 +-- ...ocator.java => ArrowByteBufAllocator.java} | 10 +- .../apache/arrow/memory/BaseAllocator.java | 22 +- .../apache/arrow/memory/BufferAllocator.java | 4 +- .../apache/arrow/memory/BufferManager.java | 2 +- .../org/apache/arrow/memory/package-info.java | 2 +- .../arrow/memory/TestBaseAllocator.java | 232 +++++++++--------- .../main/codegen/templates/ListWriters.java | 2 +- .../complex/AbstractContainerVector.java | 2 +- .../vector/complex/AbstractMapVector.java | 2 +- .../arrow/vector/holders/ValueHolder.java | 4 +- .../vector/util/ByteFunctionHelpers.java | 16 +- .../arrow/vector/util/DecimalUtility.java | 16 +- 17 files changed, 198 insertions(+), 198 deletions(-) rename java/memory/src/main/java/org/apache/arrow/memory/{DrillByteBufAllocator.java => ArrowByteBufAllocator.java} (92%) diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index f033ba6538e83..bbec26aa85c74 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -56,7 +56,7 @@ public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { private final boolean isEmpty; private volatile int length; private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? - new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, "DrillBuf[%d]", id) : null; + new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, "ArrowBuf[%d]", id) : null; public ArrowBuf( final AtomicInteger refCnt, @@ -155,18 +155,18 @@ private void ensure(int width) { } /** - * Create a new DrillBuf that is associated with an alternative allocator for the purposes of memory ownership and - * accounting. This has no impact on the reference counting for the current DrillBuf except in the situation where the + * Create a new ArrowBuf that is associated with an alternative allocator for the purposes of memory ownership and + * accounting. This has no impact on the reference counting for the current ArrowBuf except in the situation where the * passed in Allocator is the same as the current buffer. * - * This operation has no impact on the reference count of this DrillBuf. The newly created DrillBuf with either have a + * This operation has no impact on the reference count of this ArrowBuf. The newly created ArrowBuf with either have a * reference count of 1 (in the case that this is the first time this memory is being associated with the new * allocator) or the current value of the reference count + 1 for the other AllocationManager/BufferLedger combination * in the case that the provided allocator already had an association to this underlying memory. * * @param target * The target allocator to create an association with. - * @return A new DrillBuf which shares the same underlying memory as this DrillBuf. + * @return A new ArrowBuf which shares the same underlying memory as this ArrowBuf. */ public ArrowBuf retain(BufferAllocator target) { @@ -178,17 +178,17 @@ public ArrowBuf retain(BufferAllocator target) { historicalLog.recordEvent("retain(%s)", target.getName()); } final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); - return otherLedger.newDrillBuf(offset, length, null); + return otherLedger.newArrowBuf(offset, length, null); } /** - * Transfer the memory accounting ownership of this DrillBuf to another allocator. This will generate a new DrillBuf - * that carries an association with the underlying memory of this DrillBuf. If this DrillBuf is connected to the + * Transfer the memory accounting ownership of this ArrowBuf to another allocator. This will generate a new ArrowBuf + * that carries an association with the underlying memory of this ArrowBuf. If this ArrowBuf is connected to the * owning BufferLedger of this memory, that memory ownership/accounting will be transferred to the taret allocator. If - * this DrillBuf does not currently own the memory underlying it (and is only associated with it), this does not - * transfer any ownership to the newly created DrillBuf. + * this ArrowBuf does not currently own the memory underlying it (and is only associated with it), this does not + * transfer any ownership to the newly created ArrowBuf. * - * This operation has no impact on the reference count of this DrillBuf. The newly created DrillBuf with either have a + * This operation has no impact on the reference count of this ArrowBuf. The newly created ArrowBuf with either have a * reference count of 1 (in the case that this is the first time this memory is being associated with the new * allocator) or the current value of the reference count for the other AllocationManager/BufferLedger combination in * the case that the provided allocator already had an association to this underlying memory. @@ -203,7 +203,7 @@ public ArrowBuf retain(BufferAllocator target) { * @param target * The allocator to transfer ownership to. * @return A new transfer result with the impact of the transfer (whether it was overlimit) as well as the newly - * created DrillBuf. + * created ArrowBuf. */ public TransferResult transferOwnership(BufferAllocator target) { @@ -212,7 +212,7 @@ public TransferResult transferOwnership(BufferAllocator target) { } final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); - final ArrowBuf newBuf = otherLedger.newDrillBuf(offset, length, null); + final ArrowBuf newBuf = otherLedger.newArrowBuf(offset, length, null); final boolean allocationFit = this.ledger.transferBalance(otherLedger); return new TransferResult(allocationFit, newBuf); } @@ -267,7 +267,7 @@ public boolean release(int decrement) { if (refCnt < 0) { throw new IllegalStateException( - String.format("DrillBuf[%d] refCnt has gone negative. Buffer Info: %s", id, toVerboseString())); + String.format("ArrowBuf[%d] refCnt has gone negative. Buffer Info: %s", id, toVerboseString())); } return refCnt == 0; @@ -370,7 +370,7 @@ public ArrowBuf slice(int index, int length) { * Re the behavior of reference counting, see http://netty.io/wiki/reference-counted-objects.html#wiki-h3-5, which * explains that derived buffers share their reference count with their parent */ - final ArrowBuf newBuf = ledger.newDrillBuf(offset + index, length); + final ArrowBuf newBuf = ledger.newArrowBuf(offset + index, length); newBuf.writerIndex(length); return newBuf; } @@ -437,7 +437,7 @@ public long memoryAddress() { @Override public String toString() { - return String.format("DrillBuf[%d], udle: [%d %d..%d]", id, udle.id, offset, offset + capacity()); + return String.format("ArrowBuf[%d], udle: [%d %d..%d]", id, udle.id, offset, offset + capacity()); } @Override @@ -782,7 +782,7 @@ public void close() { } /** - * Returns the possible memory consumed by this DrillBuf in the worse case scenario. (not shared, connected to larger + * Returns the possible memory consumed by this ArrowBuf in the worse case scenario. (not shared, connected to larger * underlying buffer of allocated memory) * * @return Size in bytes. @@ -833,7 +833,7 @@ public String toHexString(final int start, final int length) { } /** - * Get the integer id assigned to this DrillBuf for debugging purposes. + * Get the integer id assigned to this ArrowBuf for debugging purposes. * * @return integer id */ diff --git a/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java index 59886474923f3..7fb884daa3952 100644 --- a/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java @@ -20,7 +20,7 @@ import org.apache.arrow.memory.BufferAllocator; /** - * Allows us to decorate DrillBuf to make it expandable so that we can use them in the context of the Netty framework + * Allows us to decorate ArrowBuf to make it expandable so that we can use them in the context of the Netty framework * (thus supporting RPC level memory accounting). */ public class ExpandableByteBuf extends MutableWrappedByteBuf { diff --git a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java index 1610028df9de3..0b6e3f7f8392d 100644 --- a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java +++ b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java @@ -32,7 +32,7 @@ import com.codahale.metrics.MetricRegistry; /** - * The base allocator that we use for all of Drill's memory management. Returns UnsafeDirectLittleEndian buffers. + * The base allocator that we use for all of Arrow's memory management. Returns UnsafeDirectLittleEndian buffers. */ public class PooledByteBufAllocatorL { private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("drill.allocator"); @@ -184,7 +184,7 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa private UnsupportedOperationException fail() { return new UnsupportedOperationException( - "Drill requries that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); + "Arrow requries that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); } public UnsafeDirectLittleEndian directBuffer(int initialCapacity, int maxCapacity) { @@ -197,7 +197,7 @@ public UnsafeDirectLittleEndian directBuffer(int initialCapacity, int maxCapacit @Override public ByteBuf heapBuffer(int initialCapacity, int maxCapacity) { - throw new UnsupportedOperationException("Drill doesn't support using heap buffers."); + throw new UnsupportedOperationException("Arrow doesn't support using heap buffers."); } diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java index 6495d5d371e76..a94c6d1988399 100644 --- a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -24,7 +24,7 @@ import java.util.concurrent.atomic.AtomicLong; /** - * The underlying class we use for little-endian access to memory. Is used underneath DrillBufs to abstract away the + * The underlying class we use for little-endian access to memory. Is used underneath ArrowBufs to abstract away the * Netty classes and underlying Netty memory management. */ public final class UnsafeDirectLittleEndian extends WrappedByteBuf { @@ -55,7 +55,7 @@ public final class UnsafeDirectLittleEndian extends WrappedByteBuf { private UnsafeDirectLittleEndian(AbstractByteBuf buf, boolean fake, AtomicLong bufferCount, AtomicLong bufferSize) { super(buf); if (!NATIVE_ORDER || buf.order() != ByteOrder.BIG_ENDIAN) { - throw new IllegalStateException("Drill only runs on LittleEndian systems."); + throw new IllegalStateException("Arrow only runs on LittleEndian systems."); } this.bufferCount = bufferCount; diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java index 0db61443266c6..37d1d34a62005 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -41,7 +41,7 @@ * This class is also responsible for managing when memory is allocated and returned to the Netty-based * PooledByteBufAllocatorL. * - * The only reason that this isn't package private is we're forced to put DrillBuf in Netty's package which need access + * The only reason that this isn't package private is we're forced to put ArrowBuf in Netty's package which need access * to these objects or methods. * * Threading: AllocationManager manages thread-safety internally. Operations within the context of a single BufferLedger @@ -185,8 +185,8 @@ public void release() { /** * The reference manager that binds an allocator manager to a particular BaseAllocator. Also responsible for creating - * a set of DrillBufs that share a common fate and set of reference counts. - * As with AllocationManager, the only reason this is public is due to DrillBuf being in io.netty.buffer package. + * a set of ArrowBufs that share a common fate and set of reference counts. + * As with AllocationManager, the only reason this is public is due to ArrowBuf being in io.netty.buffer package. */ public class BufferLedger { @@ -322,7 +322,7 @@ public int decrement(int decrement) { /** * Returns the ledger associated with a particular BufferAllocator. If the BufferAllocator doesn't currently have a * ledger associated with this AllocationManager, a new one is created. This is placed on BufferLedger rather than - * AllocationManager directly because DrillBufs don't have access to AllocationManager and they are the ones + * AllocationManager directly because ArrowBufs don't have access to AllocationManager and they are the ones * responsible for exposing the ability to associate multiple allocators with a particular piece of underlying * memory. Note that this will increment the reference count of this ledger by one to ensure the ledger isn't * destroyed before use. @@ -335,32 +335,32 @@ public BufferLedger getLedgerForAllocator(BufferAllocator allocator) { } /** - * Create a new DrillBuf associated with this AllocationManager and memory. Does not impact reference count. + * Create a new ArrowBuf associated with this AllocationManager and memory. Does not impact reference count. * Typically used for slicing. * @param offset - * The offset in bytes to start this new DrillBuf. + * The offset in bytes to start this new ArrowBuf. * @param length - * The length in bytes that this DrillBuf will provide access to. - * @return A new DrillBuf that shares references with all DrillBufs associated with this BufferLedger + * The length in bytes that this ArrowBuf will provide access to. + * @return A new ArrowBuf that shares references with all ArrowBufs associated with this BufferLedger */ - public ArrowBuf newDrillBuf(int offset, int length) { + public ArrowBuf newArrowBuf(int offset, int length) { allocator.assertOpen(); - return newDrillBuf(offset, length, null); + return newArrowBuf(offset, length, null); } /** - * Create a new DrillBuf associated with this AllocationManager and memory. + * Create a new ArrowBuf associated with this AllocationManager and memory. * @param offset - * The offset in bytes to start this new DrillBuf. + * The offset in bytes to start this new ArrowBuf. * @param length - * The length in bytes that this DrillBuf will provide access to. + * The length in bytes that this ArrowBuf will provide access to. * @param manager - * An optional BufferManager argument that can be used to manage expansion of this DrillBuf + * An optional BufferManager argument that can be used to manage expansion of this ArrowBuf * @param retain * Whether or not the newly created buffer should get an additional reference count added to it. - * @return A new DrillBuf that shares references with all DrillBufs associated with this BufferLedger + * @return A new ArrowBuf that shares references with all ArrowBufs associated with this BufferLedger */ - public ArrowBuf newDrillBuf(int offset, int length, BufferManager manager) { + public ArrowBuf newArrowBuf(int offset, int length, BufferManager manager) { allocator.assertOpen(); final ArrowBuf buf = new ArrowBuf( @@ -375,7 +375,7 @@ public ArrowBuf newDrillBuf(int offset, int length, BufferManager manager) { if (BaseAllocator.DEBUG) { historicalLog.recordEvent( - "DrillBuf(BufferLedger, BufferAllocator[%s], UnsafeDirectLittleEndian[identityHashCode == " + "ArrowBuf(BufferLedger, BufferAllocator[%s], UnsafeDirectLittleEndian[identityHashCode == " + "%d](%s)) => ledger hc == %d", allocator.name, System.identityHashCode(buf), buf.toString(), System.identityHashCode(this)); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java similarity index 92% rename from java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java rename to java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java index 23d644841e13f..f3f72fa57c33a 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/DrillByteBufAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java @@ -23,19 +23,19 @@ import io.netty.buffer.ExpandableByteBuf; /** - * An implementation of ByteBufAllocator that wraps a Drill BufferAllocator. This allows the RPC layer to be accounted - * and managed using Drill's BufferAllocator infrastructure. The only thin different from a typical BufferAllocator is + * An implementation of ByteBufAllocator that wraps a Arrow BufferAllocator. This allows the RPC layer to be accounted + * and managed using Arrow's BufferAllocator infrastructure. The only thin different from a typical BufferAllocator is * the signature and the fact that this Allocator returns ExpandableByteBufs which enable otherwise non-expandable - * DrillBufs to be expandable. + * ArrowBufs to be expandable. */ -public class DrillByteBufAllocator implements ByteBufAllocator { +public class ArrowByteBufAllocator implements ByteBufAllocator { private static final int DEFAULT_BUFFER_SIZE = 4096; private static final int DEFAULT_MAX_COMPOSITE_COMPONENTS = 16; private final BufferAllocator allocator; - public DrillByteBufAllocator(BufferAllocator allocator) { + public ArrowByteBufAllocator(BufferAllocator allocator) { this.allocator = allocator; } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index 72f77ab0c7bc2..90257bb9ffbf7 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -82,7 +82,7 @@ protected BaseAllocator( this.parentAllocator = parentAllocator; this.name = name; - this.thisAsByteBufAllocator = new DrillByteBufAllocator(this); + this.thisAsByteBufAllocator = new ArrowByteBufAllocator(this); if (DEBUG) { childAllocators = new IdentityHashMap<>(); @@ -236,7 +236,7 @@ private ArrowBuf bufferWithoutReservation(final int size, BufferManager bufferMa final AllocationManager manager = new AllocationManager(this, size); final BufferLedger ledger = manager.associate(this); // +1 ref cnt (required) - final ArrowBuf buffer = ledger.newDrillBuf(0, size, bufferManager); + final ArrowBuf buffer = ledger.newArrowBuf(0, size, bufferManager); // make sure that our allocation is equal to what we expected. Preconditions.checkArgument(buffer.capacity() == size, @@ -314,9 +314,9 @@ public ArrowBuf allocateBuffer() { Preconditions.checkState(!closed, "Attempt to allocate after closed"); Preconditions.checkState(!used, "Attempt to allocate more than once"); - final ArrowBuf drillBuf = allocate(nBytes); + final ArrowBuf arrowBuf = allocate(nBytes); used = true; - return drillBuf; + return arrowBuf; } public int getSize() { @@ -397,13 +397,13 @@ private ArrowBuf allocate(int nBytes) { * as well, so we need to return the same number back to avoid double-counting them. */ try { - final ArrowBuf drillBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); + final ArrowBuf arrowBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); if (DEBUG) { - historicalLog.recordEvent("allocate() => %s", String.format("DrillBuf[%d]", drillBuf.getId())); + historicalLog.recordEvent("allocate() => %s", String.format("ArrowBuf[%d]", arrowBuf.getId())); } success = true; - return drillBuf; + return arrowBuf; } finally { if (!success) { releaseBytes(nBytes); @@ -565,7 +565,7 @@ void verifyAllocator() { * Verifies the accounting state of the allocator. Only works for DEBUG. * *

- * This overload is used for recursive calls, allowing for checking that DrillBufs are unique across all allocators + * This overload is used for recursive calls, allowing for checking that ArrowBufs are unique across all allocators * that are checked. *

* @@ -594,7 +594,7 @@ private void verifyAllocator(final IdentityHashMap T typeify(ValueVector v, Class clazz) { if (clazz.isAssignableFrom(v.getClass())) { return (T) v; } - throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Drill doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); + throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Arrow doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); } MajorType getLastPathType() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index d4189b2314a6a..de6ae829b476d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -137,7 +137,7 @@ public T addOrGet(String name, MajorType type, Class } return vector; } - final String message = "Drill does not support schema change yet. Existing[%s] and desired[%s] vector types mismatch"; + final String message = "Arrow does not support schema change yet. Existing[%s] and desired[%s] vector types mismatch"; throw new IllegalStateException(String.format(message, existing.getClass().getSimpleName(), clazz.getSimpleName())); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java index 88cbcd4a8c308..16777c806ec2d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/ValueHolder.java @@ -18,10 +18,10 @@ package org.apache.arrow.vector.holders; /** - * Wrapper object for an individual value in Drill. + * Wrapper object for an individual value in Arrow. * * ValueHolders are designed to be mutable wrapper objects for defining clean - * APIs that access data in Drill. For performance, object creation is avoided + * APIs that access data in Arrow. For performance, object creation is avoided * at all costs throughout execution. For this reason, ValueHolders are * disallowed from implementing any methods, this allows for them to be * replaced by their java primitive inner members during optimization of diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java index 2bdfd70b22956..b6dd13a06a82d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java @@ -29,12 +29,12 @@ public class ByteFunctionHelpers { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ByteFunctionHelpers.class); /** - * Helper function to check for equality of bytes in two DrillBuffers + * Helper function to check for equality of bytes in two ArrowBufs * - * @param left Left DrillBuf for comparison + * @param left Left ArrowBuf for comparison * @param lStart start offset in the buffer * @param lEnd end offset in the buffer - * @param right Right DrillBuf for comparison + * @param right Right ArrowBuf for comparison * @param rStart start offset in the buffer * @param rEnd end offset in the buffer * @return 1 if left input is greater, -1 if left input is smaller, 0 otherwise @@ -81,14 +81,14 @@ private static final int memEqual(final long laddr, int lStart, int lEnd, final } /** - * Helper function to compare a set of bytes in two DrillBuffers. + * Helper function to compare a set of bytes in two ArrowBufs. * * Function will check data before completing in the case that * - * @param left Left DrillBuf to compare + * @param left Left ArrowBuf to compare * @param lStart start offset in the buffer * @param lEnd end offset in the buffer - * @param right Right DrillBuf to compare + * @param right Right ArrowBuf to compare * @param rStart start offset in the buffer * @param rEnd end offset in the buffer * @return 1 if left input is greater, -1 if left input is smaller, 0 otherwise @@ -138,9 +138,9 @@ private static final int memcmp(final long laddr, int lStart, int lEnd, final lo } /** - * Helper function to compare a set of bytes in DrillBuf to a ByteArray. + * Helper function to compare a set of bytes in ArrowBuf to a ByteArray. * - * @param left Left DrillBuf for comparison purposes + * @param left Left ArrowBuf for comparison purposes * @param lStart start offset in the buffer * @param lEnd end offset in the buffer * @param right second input to be compared diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java index 576a5b6351ad1..a3763cd34f1a1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -145,16 +145,16 @@ public static StringBuilder toStringWithZeroes(long number, int desiredLength) { public static BigDecimal getBigDecimalFromIntermediate(ByteBuf data, int startIndex, int nDecimalDigits, int scale) { // In the intermediate representation we don't pad the scale with zeroes, so set truncate = false - return getBigDecimalFromDrillBuf(data, startIndex, nDecimalDigits, scale, false); + return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, false); } public static BigDecimal getBigDecimalFromSparse(ArrowBuf data, int startIndex, int nDecimalDigits, int scale) { // In the sparse representation we pad the scale with zeroes for ease of arithmetic, need to truncate - return getBigDecimalFromDrillBuf(data, startIndex, nDecimalDigits, scale, true); + return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, true); } - public static BigDecimal getBigDecimalFromDrillBuf(ArrowBuf bytebuf, int start, int length, int scale) { + public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int start, int length, int scale) { byte[] value = new byte[length]; bytebuf.getBytes(start, value, 0, length); BigInteger unscaledValue = new BigInteger(value); @@ -168,17 +168,17 @@ public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int sta return new BigDecimal(unscaledValue, scale); } - /* Create a BigDecimal object using the data in the DrillBuf. + /* Create a BigDecimal object using the data in the ArrowBuf. * This function assumes that data is provided in a non-dense format * It works on both sparse and intermediate representations. */ - public static BigDecimal getBigDecimalFromDrillBuf(ByteBuf data, int startIndex, int nDecimalDigits, int scale, + public static BigDecimal getBigDecimalFromArrowBuf(ByteBuf data, int startIndex, int nDecimalDigits, int scale, boolean truncateScale) { // For sparse decimal type we have padded zeroes at the end, strip them while converting to BigDecimal. int actualDigits; - // Initialize the BigDecimal, first digit in the DrillBuf has the sign so mask it out + // Initialize the BigDecimal, first digit in the ArrowBuf has the sign so mask it out BigInteger decimalDigits = BigInteger.valueOf((data.getInt(startIndex)) & 0x7FFFFFFF); BigInteger base = BigInteger.valueOf(DIGITS_BASE); @@ -208,7 +208,7 @@ public static BigDecimal getBigDecimalFromDrillBuf(ByteBuf data, int startIndex, /* This function returns a BigDecimal object from the dense decimal representation. * First step is to convert the dense representation into an intermediate representation - * and then invoke getBigDecimalFromDrillBuf() to get the BigDecimal object + * and then invoke getBigDecimalFromArrowBuf() to get the BigDecimal object */ public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, int nDecimalDigits, int scale, int maxPrecision, int width) { @@ -340,7 +340,7 @@ public static void getSparseFromBigDecimal(BigDecimal input, ByteBuf data, int s destIndex = nDecimalDigits - 1; while (scale > 0) { - // Get next set of MAX_DIGITS (9) store it in the DrillBuf + // Get next set of MAX_DIGITS (9) store it in the ArrowBuf fractionalPart = fractionalPart.movePointLeft(MAX_DIGITS); BigDecimal temp = fractionalPart.remainder(BigDecimal.ONE); From a3856222d78d58b51088769178715dcb1e5a8d2c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 1 Mar 2016 14:48:27 -0800 Subject: [PATCH 0009/1644] ARROW-8: Add .travis.yml and test script for Arrow C++. OS X build fixes --- .travis.yml | 27 ++++++++++++++++++++++ README.md | 11 +++++++++ ci/travis_script_cpp.sh | 35 +++++++++++++++++++++++++++++ cpp/CMakeLists.txt | 37 ++++++++++++++++--------------- cpp/setup_build_env.sh | 3 +-- cpp/src/arrow/util/CMakeLists.txt | 2 +- 6 files changed, 94 insertions(+), 21 deletions(-) create mode 100644 .travis.yml create mode 100755 ci/travis_script_cpp.sh diff --git a/.travis.yml b/.travis.yml new file mode 100644 index 0000000000000..cb2d5cb1bad19 --- /dev/null +++ b/.travis.yml @@ -0,0 +1,27 @@ +sudo: required +dist: trusty +addons: + apt: + sources: + - ubuntu-toolchain-r-test + - kalakris-cmake + packages: + - gcc-4.9 # Needed for C++11 + - g++-4.9 # Needed for C++11 + - gcov + - cmake + - valgrind + +matrix: + include: + - compiler: gcc + language: cpp + os: linux + script: + - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh + - compiler: clang + language: cpp + os: osx + addons: + script: + - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh diff --git a/README.md b/README.md index 4423a91351381..d948a996bc075 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,16 @@ ## Apache Arrow + + + + + +
Build Status + + travis build status + +
+ #### Powering Columnar In-Memory Analytics Arrow is a set of technologies that enable big-data systems to process and move data fast. diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh new file mode 100755 index 0000000000000..28f16cc021fe3 --- /dev/null +++ b/ci/travis_script_cpp.sh @@ -0,0 +1,35 @@ +#!/usr/bin/env bash + +set -e + +mkdir $TRAVIS_BUILD_DIR/cpp-build +pushd $TRAVIS_BUILD_DIR/cpp-build + +CPP_DIR=$TRAVIS_BUILD_DIR/cpp + +# Build an isolated thirdparty +cp -r $CPP_DIR/thirdparty . +cp $CPP_DIR/setup_build_env.sh . + +if [ $TRAVIS_OS_NAME == "linux" ]; then + # Use a C++11 compiler on Linux + export CC="gcc-4.9" + export CXX="g++-4.9" +fi + +source setup_build_env.sh + +echo $GTEST_HOME + +cmake -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +make lint +make -j4 + +if [ $TRAVIS_OS_NAME == "linux" ]; then + valgrind --tool=memcheck --leak-check=yes --error-exitcode=1 ctest +else + ctest +fi + +popd +rm -rf cpp-build diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 90e55dfddbf30..5ddd9dae3fe82 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -44,6 +44,11 @@ if (NOT "$ENV{ARROW_GCC_ROOT}" STREQUAL "") set(CMAKE_CXX_COMPILER ${GCC_ROOT}/bin/g++) endif() +if(APPLE) + # In newer versions of CMake, this is the default setting + set(CMAKE_MACOSX_RPATH 1) +endif() + # ---------------------------------------------------------------------- # cmake options @@ -68,19 +73,15 @@ endif() ############################################################ # compiler flags that are common across debug/release builds -# - msse4.2: Enable sse4.2 compiler intrinsics. # - Wall: Enable all warnings. -# - Wno-sign-compare: suppress warnings for comparison between signed and unsigned -# integers -# -Wno-deprecated: some of the gutil code includes old things like ext/hash_set, ignore that -# - pthread: enable multithreaded malloc -# - -D__STDC_FORMAT_MACROS: for PRI* print format macros -# -fno-strict-aliasing -# Assume programs do not follow strict aliasing rules. -# GCC cannot always verify whether strict aliasing rules are indeed followed due to -# fundamental limitations in escape analysis, which can result in subtle bad code generation. -# This has a small perf hit but worth it to avoid hard to debug crashes. -set(CXX_COMMON_FLAGS "-std=c++11 -fno-strict-aliasing -msse3 -Wall -Wno-deprecated -pthread -D__STDC_FORMAT_MACROS") +set(CXX_COMMON_FLAGS "-std=c++11 -msse3 -Wall") + +if (APPLE) + # Depending on the default OSX_DEPLOYMENT_TARGET (< 10.9), libstdc++ may be + # the default standard library which does not support C++11. libc++ is the + # default from 10.9 onward. + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -stdlib=libc++") +endif() # compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') # For all builds: @@ -157,10 +158,6 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") else() message("Running without a controlling terminal or in a dumb terminal") endif() - - # Use libstdc++ and not libc++. The latter lacks support for tr1 in OSX - # and since 10.9 is now the default. - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -stdlib=libstdc++") endif() # Sanity check linking option. @@ -473,11 +470,15 @@ set(ARROW_SRCS src/arrow/type.cc ) -add_library(arrow SHARED +set(LIBARROW_LINKAGE "SHARED") + +add_library(arrow + ${LIBARROW_LINKAGE} ${ARROW_SRCS} ) target_link_libraries(arrow ${LINK_LIBS}) set_target_properties(arrow PROPERTIES LINKER_LANGUAGE CXX) install(TARGETS arrow - LIBRARY DESTINATION lib) + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index 457b9717ebe81..e9901bdbecd42 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -1,11 +1,10 @@ #!/bin/bash -set -e - SOURCE_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) ./thirdparty/download_thirdparty.sh ./thirdparty/build_thirdparty.sh +source thirdparty/versions.sh export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 88e3f7a656d90..ff8db6a04106d 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -26,7 +26,7 @@ set(UTIL_SRCS ) set(UTIL_LIBS - rt) +) add_library(arrow_util STATIC ${UTIL_SRCS} From 8f2ca246b34daa49eed2a1eb2a747cab93bb2dbd Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 13:49:20 -0800 Subject: [PATCH 0010/1644] ARROW-13: Add PR merge tool from parquet-mr, suitably modified Author: Wes McKinney Closes #7 from wesm/ARROW-13 and squashes the following commits: 7a58712 [Wes McKinney] Add PR merge tool from parquet-mr, suitably modified --- dev/README.md | 94 +++++++++++ dev/merge_arrow_pr.py | 362 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 456 insertions(+) create mode 100644 dev/README.md create mode 100755 dev/merge_arrow_pr.py diff --git a/dev/README.md b/dev/README.md new file mode 100644 index 0000000000000..e986abef1913f --- /dev/null +++ b/dev/README.md @@ -0,0 +1,94 @@ + + +# Arrow Developer Scripts + +This directory contains scripts useful to developers when packaging, +testing, or committing to Arrow. + +Merging a pull request requires being a committer on the project. + +* How to merge a Pull request: +have an apache and apache-github remote setup +``` +git remote add apache-github https://github.com/apache/arrow.git +git remote add apache https://git-wip-us.apache.org/repos/asf/arrow.git +``` +run the following command +``` +dev/merge_arrow_pr.py +``` + +Note: +* The directory name of your Arrow git clone must be called arrow +* Without jira-python installed you'll have to close the JIRA manually + +example output: +``` +Which pull request would you like to merge? (e.g. 34): +``` +Type the pull request number (from https://github.com/apache/arrow/pulls) and hit enter. +``` +=== Pull Request #X === +title Blah Blah Blah +source repo/branch +target master +url https://api.github.com/repos/apache/arrow/pulls/X + +Proceed with merging pull request #3? (y/n): +``` +If this looks good, type y and hit enter. +``` +From git-wip-us.apache.org:/repos/asf/arrow.git + * [new branch] master -> PR_TOOL_MERGE_PR_3_MASTER +Switched to branch 'PR_TOOL_MERGE_PR_3_MASTER' + +Merge complete (local ref PR_TOOL_MERGE_PR_3_MASTER). Push to apache? (y/n): +``` +A local branch with the merge has been created. +type y and hit enter to push it to apache master +``` +Counting objects: 67, done. +Delta compression using up to 4 threads. +Compressing objects: 100% (26/26), done. +Writing objects: 100% (36/36), 5.32 KiB, done. +Total 36 (delta 17), reused 0 (delta 0) +To git-wip-us.apache.org:/repos/arrow-mr.git + b767ac4..485658a PR_TOOL_MERGE_PR_X_MASTER -> master +Restoring head pointer to b767ac4e +Note: checking out 'b767ac4e'. + +You are in 'detached HEAD' state. You can look around, make experimental +changes and commit them, and you can discard any commits you make in this +state without impacting any branches by performing another checkout. + +If you want to create a new branch to retain commits you create, you may +do so (now or later) by using -b with the checkout command again. Example: + + git checkout -b new_branch_name + +HEAD is now at b767ac4... Update README.md +Deleting local branch PR_TOOL_MERGE_PR_X +Deleting local branch PR_TOOL_MERGE_PR_X_MASTER +Pull request #X merged! +Merge hash: 485658a5 + +Would you like to pick 485658a5 into another branch? (y/n): +``` +For now just say n as we have 1 branch diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py new file mode 100755 index 0000000000000..ef47dec88c124 --- /dev/null +++ b/dev/merge_arrow_pr.py @@ -0,0 +1,362 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Utility for creating well-formed pull request merges and pushing them to Apache. +# usage: ./apache-pr-merge.py (see config env vars below) +# +# This utility assumes you already have a local Arrow git clone and that you +# have added remotes corresponding to both (i) the Github Apache Arrow mirror +# and (ii) the apache git repo. + +import json +import os +import re +import subprocess +import sys +import tempfile +import urllib2 +import getpass + +try: + import jira.client + JIRA_IMPORTED = True +except ImportError: + JIRA_IMPORTED = False + +# Location of your Arrow git clone +ARROW_HOME = os.path.abspath(__file__).rsplit("/", 2)[0] +PROJECT_NAME = ARROW_HOME.rsplit("/", 1)[1] +print "ARROW_HOME = " + ARROW_HOME +print "PROJECT_NAME = " + PROJECT_NAME + +# Remote name which points to the Gihub site +PR_REMOTE_NAME = os.environ.get("PR_REMOTE_NAME", "apache-github") +# Remote name which points to Apache git +PUSH_REMOTE_NAME = os.environ.get("PUSH_REMOTE_NAME", "apache") +# ASF JIRA username +JIRA_USERNAME = os.environ.get("JIRA_USERNAME") +# ASF JIRA password +JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD") + +GITHUB_BASE = "https://github.com/apache/" + PROJECT_NAME + "/pull" +GITHUB_API_BASE = "https://api.github.com/repos/apache/" + PROJECT_NAME +JIRA_BASE = "https://issues.apache.org/jira/browse" +JIRA_API_BASE = "https://issues.apache.org/jira" +# Prefix added to temporary branches +BRANCH_PREFIX = "PR_TOOL" + +os.chdir(ARROW_HOME) + + +def get_json(url): + try: + return json.load(urllib2.urlopen(url)) + except urllib2.HTTPError as e: + print "Unable to fetch URL, exiting: %s" % url + sys.exit(-1) + + +def fail(msg): + print msg + clean_up() + sys.exit(-1) + + +def run_cmd(cmd): + try: + if isinstance(cmd, list): + return subprocess.check_output(cmd) + else: + return subprocess.check_output(cmd.split(" ")) + except subprocess.CalledProcessError as e: + # this avoids hiding the stdout / stderr of failed processes + print 'Command failed: %s' % cmd + print 'With output:' + print '--------------' + print e.output + print '--------------' + raise e + +def continue_maybe(prompt): + result = raw_input("\n%s (y/n): " % prompt) + if result.lower() != "y": + fail("Okay, exiting") + + +original_head = run_cmd("git rev-parse HEAD")[:8] + + +def clean_up(): + print "Restoring head pointer to %s" % original_head + run_cmd("git checkout %s" % original_head) + + branches = run_cmd("git branch").replace(" ", "").split("\n") + + for branch in filter(lambda x: x.startswith(BRANCH_PREFIX), branches): + print "Deleting local branch %s" % branch + run_cmd("git branch -D %s" % branch) + + +# merge the requested PR and return the merge hash +def merge_pr(pr_num, target_ref): + pr_branch_name = "%s_MERGE_PR_%s" % (BRANCH_PREFIX, pr_num) + target_branch_name = "%s_MERGE_PR_%s_%s" % (BRANCH_PREFIX, pr_num, target_ref.upper()) + run_cmd("git fetch %s pull/%s/head:%s" % (PR_REMOTE_NAME, pr_num, pr_branch_name)) + run_cmd("git fetch %s %s:%s" % (PUSH_REMOTE_NAME, target_ref, target_branch_name)) + run_cmd("git checkout %s" % target_branch_name) + + had_conflicts = False + try: + run_cmd(['git', 'merge', pr_branch_name, '--squash']) + except Exception as e: + msg = "Error merging: %s\nWould you like to manually fix-up this merge?" % e + continue_maybe(msg) + msg = "Okay, please fix any conflicts and 'git add' conflicting files... Finished?" + continue_maybe(msg) + had_conflicts = True + + commit_authors = run_cmd(['git', 'log', 'HEAD..%s' % pr_branch_name, + '--pretty=format:%an <%ae>']).split("\n") + distinct_authors = sorted(set(commit_authors), + key=lambda x: commit_authors.count(x), reverse=True) + primary_author = distinct_authors[0] + commits = run_cmd(['git', 'log', 'HEAD..%s' % pr_branch_name, + '--pretty=format:%h [%an] %s']).split("\n\n") + + merge_message_flags = [] + + merge_message_flags += ["-m", title] + if body != None: + merge_message_flags += ["-m", body] + + authors = "\n".join(["Author: %s" % a for a in distinct_authors]) + + merge_message_flags += ["-m", authors] + + if had_conflicts: + committer_name = run_cmd("git config --get user.name").strip() + committer_email = run_cmd("git config --get user.email").strip() + message = "This patch had conflicts when merged, resolved by\nCommitter: %s <%s>" % ( + committer_name, committer_email) + merge_message_flags += ["-m", message] + + # The string "Closes #%s" string is required for GitHub to correctly close the PR + merge_message_flags += [ + "-m", + "Closes #%s from %s and squashes the following commits:" % (pr_num, pr_repo_desc)] + for c in commits: + merge_message_flags += ["-m", c] + + run_cmd(['git', 'commit', '--author="%s"' % primary_author] + merge_message_flags) + + continue_maybe("Merge complete (local ref %s). Push to %s?" % ( + target_branch_name, PUSH_REMOTE_NAME)) + + try: + run_cmd('git push %s %s:%s' % (PUSH_REMOTE_NAME, target_branch_name, target_ref)) + except Exception as e: + clean_up() + fail("Exception while pushing: %s" % e) + + merge_hash = run_cmd("git rev-parse %s" % target_branch_name)[:8] + clean_up() + print("Pull request #%s merged!" % pr_num) + print("Merge hash: %s" % merge_hash) + return merge_hash + + +def cherry_pick(pr_num, merge_hash, default_branch): + pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) + if pick_ref == "": + pick_ref = default_branch + + pick_branch_name = "%s_PICK_PR_%s_%s" % (BRANCH_PREFIX, pr_num, pick_ref.upper()) + + run_cmd("git fetch %s %s:%s" % (PUSH_REMOTE_NAME, pick_ref, pick_branch_name)) + run_cmd("git checkout %s" % pick_branch_name) + run_cmd("git cherry-pick -sx %s" % merge_hash) + + continue_maybe("Pick complete (local ref %s). Push to %s?" % ( + pick_branch_name, PUSH_REMOTE_NAME)) + + try: + run_cmd('git push %s %s:%s' % (PUSH_REMOTE_NAME, pick_branch_name, pick_ref)) + except Exception as e: + clean_up() + fail("Exception while pushing: %s" % e) + + pick_hash = run_cmd("git rev-parse %s" % pick_branch_name)[:8] + clean_up() + + print("Pull request #%s picked into %s!" % (pr_num, pick_ref)) + print("Pick hash: %s" % pick_hash) + return pick_ref + + +def fix_version_from_branch(branch, versions): + # Note: Assumes this is a sorted (newest->oldest) list of un-released versions + if branch == "master": + return versions[0] + else: + branch_ver = branch.replace("branch-", "") + return filter(lambda x: x.name.startswith(branch_ver), versions)[-1] + +def exctract_jira_id(title): + m = re.search(r'^(ARROW-[0-9]+)\b.*$', title) + if m and m.groups > 0: + return m.group(1) + else: + fail("PR title should be prefixed by a jira id \"ARROW-XXX: ...\", found: \"%s\"" % title) + +def check_jira(title): + jira_id = exctract_jira_id(title) + asf_jira = jira.client.JIRA({'server': JIRA_API_BASE}, + basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) + try: + issue = asf_jira.issue(jira_id) + except Exception as e: + fail("ASF JIRA could not find %s\n%s" % (jira_id, e)) + +def resolve_jira(title, merge_branches, comment): + asf_jira = jira.client.JIRA({'server': JIRA_API_BASE}, + basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) + + default_jira_id = exctract_jira_id(title) + + jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) + if jira_id == "": + jira_id = default_jira_id + + try: + issue = asf_jira.issue(jira_id) + except Exception as e: + fail("ASF JIRA could not find %s\n%s" % (jira_id, e)) + + cur_status = issue.fields.status.name + cur_summary = issue.fields.summary + cur_assignee = issue.fields.assignee + if cur_assignee is None: + cur_assignee = "NOT ASSIGNED!!!" + else: + cur_assignee = cur_assignee.displayName + + if cur_status == "Resolved" or cur_status == "Closed": + fail("JIRA issue %s already has status '%s'" % (jira_id, cur_status)) + print ("=== JIRA %s ===" % jira_id) + print ("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%s\n" % ( + cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) + + versions = asf_jira.project_versions("ARROW") + versions = sorted(versions, key=lambda x: x.name, reverse=True) + versions = filter(lambda x: x.raw['released'] is False, versions) + + default_fix_versions = map(lambda x: fix_version_from_branch(x, versions).name, merge_branches) + for v in default_fix_versions: + # Handles the case where we have forked a release branch but not yet made the release. + # In this case, if the PR is committed to the master branch and the release branch, we + # only consider the release branch to be the fix version. E.g. it is not valid to have + # both 1.1.0 and 1.0.0 as fix versions. + (major, minor, patch) = v.split(".") + if patch == "0": + previous = "%s.%s.%s" % (major, int(minor) - 1, 0) + if previous in default_fix_versions: + default_fix_versions = filter(lambda x: x != v, default_fix_versions) + default_fix_versions = ",".join(default_fix_versions) + + fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) + if fix_versions == "": + fix_versions = default_fix_versions + fix_versions = fix_versions.replace(" ", "").split(",") + + def get_version_json(version_str): + return filter(lambda v: v.name == version_str, versions)[0].raw + + jira_fix_versions = map(lambda v: get_version_json(v), fix_versions) + + resolve = filter(lambda a: a['name'] == "Resolve Issue", asf_jira.transitions(jira_id))[0] + asf_jira.transition_issue( + jira_id, resolve["id"], fixVersions=jira_fix_versions, comment=comment) + + print "Succesfully resolved %s with fixVersions=%s!" % (jira_id, fix_versions) + + +if not JIRA_USERNAME: + JIRA_USERNAME = raw_input("Env JIRA_USERNAME not set, please enter your JIRA username:") + +if not JIRA_PASSWORD: + JIRA_PASSWORD = getpass.getpass("Env JIRA_PASSWORD not set, please enter your JIRA password:") + +branches = get_json("%s/branches" % GITHUB_API_BASE) +branch_names = filter(lambda x: x.startswith("branch-"), [x['name'] for x in branches]) +# Assumes branch names can be sorted lexicographically +# Julien: I commented this out as we don't have any "branch-*" branch yet +#latest_branch = sorted(branch_names, reverse=True)[0] + +pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ") +pr = get_json("%s/pulls/%s" % (GITHUB_API_BASE, pr_num)) + +url = pr["url"] +title = pr["title"] +check_jira(title) +body = pr["body"] +target_ref = pr["base"]["ref"] +user_login = pr["user"]["login"] +base_ref = pr["head"]["ref"] +pr_repo_desc = "%s/%s" % (user_login, base_ref) + +if pr["merged"] is True: + print "Pull request %s has already been merged, assuming you want to backport" % pr_num + merge_commit_desc = run_cmd([ + 'git', 'log', '--merges', '--first-parent', + '--grep=pull request #%s' % pr_num, '--oneline']).split("\n")[0] + if merge_commit_desc == "": + fail("Couldn't find any merge commit for #%s, you may need to update HEAD." % pr_num) + + merge_hash = merge_commit_desc[:7] + message = merge_commit_desc[8:] + + print "Found: %s" % message + maybe_cherry_pick(pr_num, merge_hash, latest_branch) + sys.exit(0) + +if not bool(pr["mergeable"]): + msg = "Pull request %s is not mergeable in its current form.\n" % pr_num + \ + "Continue? (experts only!)" + continue_maybe(msg) + +print ("\n=== Pull Request #%s ===" % pr_num) +print ("title\t%s\nsource\t%s\ntarget\t%s\nurl\t%s" % ( + title, pr_repo_desc, target_ref, url)) +continue_maybe("Proceed with merging pull request #%s?" % pr_num) + +merged_refs = [target_ref] + +merge_hash = merge_pr(pr_num, target_ref) + +pick_prompt = "Would you like to pick %s into another branch?" % merge_hash +while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": + merged_refs = merged_refs + [cherry_pick(pr_num, merge_hash, latest_branch)] + +if JIRA_IMPORTED: + continue_maybe("Would you like to update the associated JIRA?") + jira_comment = "Issue resolved by pull request %s\n[%s/%s]" % (pr_num, GITHUB_BASE, pr_num) + resolve_jira(title, merged_refs, jira_comment) +else: + print "Could not find jira-python library. Run 'sudo pip install jira-python' to install." + print "Exiting without trying to close the associated JIRA." From 1000d110cdc8a699cfb9caaee7772a0a5161538c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 14:00:12 -0800 Subject: [PATCH 0011/1644] ARROW-36: Remove fixVersions from JIRA resolve code path This one is tricky to test; sorry I missed this on the first go (the JIRA transition code executes after ARROW-13 was merged). Author: Wes McKinney Closes #11 from wesm/ARROW-36 and squashes the following commits: 432c17c [Wes McKinney] Remove fixVersions from JIRA resolve code path --- dev/merge_arrow_pr.py | 37 +++++-------------------------------- 1 file changed, 5 insertions(+), 32 deletions(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index ef47dec88c124..fe0bcd13dd8f1 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -262,38 +262,11 @@ def resolve_jira(title, merge_branches, comment): print ("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%s\n" % ( cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) - versions = asf_jira.project_versions("ARROW") - versions = sorted(versions, key=lambda x: x.name, reverse=True) - versions = filter(lambda x: x.raw['released'] is False, versions) - - default_fix_versions = map(lambda x: fix_version_from_branch(x, versions).name, merge_branches) - for v in default_fix_versions: - # Handles the case where we have forked a release branch but not yet made the release. - # In this case, if the PR is committed to the master branch and the release branch, we - # only consider the release branch to be the fix version. E.g. it is not valid to have - # both 1.1.0 and 1.0.0 as fix versions. - (major, minor, patch) = v.split(".") - if patch == "0": - previous = "%s.%s.%s" % (major, int(minor) - 1, 0) - if previous in default_fix_versions: - default_fix_versions = filter(lambda x: x != v, default_fix_versions) - default_fix_versions = ",".join(default_fix_versions) - - fix_versions = raw_input("Enter comma-separated fix version(s) [%s]: " % default_fix_versions) - if fix_versions == "": - fix_versions = default_fix_versions - fix_versions = fix_versions.replace(" ", "").split(",") - - def get_version_json(version_str): - return filter(lambda v: v.name == version_str, versions)[0].raw - - jira_fix_versions = map(lambda v: get_version_json(v), fix_versions) - - resolve = filter(lambda a: a['name'] == "Resolve Issue", asf_jira.transitions(jira_id))[0] - asf_jira.transition_issue( - jira_id, resolve["id"], fixVersions=jira_fix_versions, comment=comment) - - print "Succesfully resolved %s with fixVersions=%s!" % (jira_id, fix_versions) + resolve = filter(lambda a: a['name'] == "Resolve Issue", + asf_jira.transitions(jira_id))[0] + asf_jira.transition_issue(jira_id, resolve["id"], comment=comment) + + print "Succesfully resolved %s!" % (jira_id) if not JIRA_USERNAME: From e418020852ad4fa148b07f21f5b4d47230fe4c5b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 14:02:53 -0800 Subject: [PATCH 0012/1644] ARROW-19: Add an externalized MemoryPool interface for use in builder classes Memory management will be an ongoing concern, but this is a stride in the right direction. Applications requiring custom memory management will be able to implement a subclass of MemoryPool; we can evolve its API as user needs evolve. Author: Wes McKinney Closes #8 from wesm/ARROW-19 and squashes the following commits: 08d3895 [Wes McKinney] Some include cleanup e319a36 [Wes McKinney] cpplint fixes abca6eb [Wes McKinney] Add a MemoryPool abstract interface, change builder instances to request memory from pool via Buffer subclass --- cpp/CMakeLists.txt | 2 +- cpp/src/arrow/array-test.cc | 10 +++- cpp/src/arrow/array.h | 1 - cpp/src/arrow/builder.cc | 2 +- cpp/src/arrow/builder.h | 22 +++++--- cpp/src/arrow/types/construct.cc | 13 +++-- cpp/src/arrow/types/construct.h | 4 +- cpp/src/arrow/types/list-test.cc | 2 +- cpp/src/arrow/types/list.h | 7 ++- cpp/src/arrow/types/primitive-test.cc | 5 +- cpp/src/arrow/types/primitive.h | 12 ++-- cpp/src/arrow/types/string-test.cc | 29 +++++----- cpp/src/arrow/types/string.h | 9 ++- cpp/src/arrow/types/struct.cc | 1 + cpp/src/arrow/types/test-common.h | 8 ++- cpp/src/arrow/types/union.cc | 1 + cpp/src/arrow/util/CMakeLists.txt | 3 + cpp/src/arrow/util/bit-util.cc | 2 +- cpp/src/arrow/util/bit-util.h | 3 +- cpp/src/arrow/util/buffer-test.cc | 6 +- cpp/src/arrow/util/buffer.cc | 36 ++++++++---- cpp/src/arrow/util/buffer.h | 36 ++++++++---- cpp/src/arrow/util/memory-pool-test.cc | 47 ++++++++++++++++ cpp/src/arrow/util/memory-pool.cc | 78 ++++++++++++++++++++++++++ cpp/src/arrow/util/memory-pool.h | 41 ++++++++++++++ 25 files changed, 301 insertions(+), 79 deletions(-) create mode 100644 cpp/src/arrow/util/memory-pool-test.cc create mode 100644 cpp/src/arrow/util/memory-pool.cc create mode 100644 cpp/src/arrow/util/memory-pool.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5ddd9dae3fe82..d2c840abfe823 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -434,7 +434,7 @@ if (UNIX) add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py --verbose=2 --linelength=90 - --filter=-whitespace/comments,-readability/todo,-build/header_guard + --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h`) endif (UNIX) diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 5ecf91624fe73..16afb9bef348c 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -18,6 +18,7 @@ #include #include +#include #include #include #include @@ -28,6 +29,8 @@ #include "arrow/types/integer.h" #include "arrow/types/primitive.h" #include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" using std::string; using std::vector; @@ -41,8 +44,10 @@ static TypePtr int32_nn = TypePtr(new Int32Type(false)); class TestArray : public ::testing::Test { public: void SetUp() { - auto data = std::make_shared(); - auto nulls = std::make_shared(); + pool_ = GetDefaultMemoryPool(); + + auto data = std::make_shared(pool_); + auto nulls = std::make_shared(pool_); ASSERT_OK(data->Resize(400)); ASSERT_OK(nulls->Resize(128)); @@ -51,6 +56,7 @@ class TestArray : public ::testing::Test { } protected: + MemoryPool* pool_; std::unique_ptr arr_; }; diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c95450d12a419..0eaa28d528e37 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -19,7 +19,6 @@ #define ARROW_ARRAY_H #include -#include #include #include "arrow/type.h" diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 1fd7471928367..cb85067315099 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -30,7 +30,7 @@ Status ArrayBuilder::Init(int64_t capacity) { if (nullable_) { int64_t to_alloc = util::ceil_byte(capacity) / 8; - nulls_ = std::make_shared(); + nulls_ = std::make_shared(pool_); RETURN_NOT_OK(nulls_->Resize(to_alloc)); null_bits_ = nulls_->mutable_data(); memset(null_bits_, 0, to_alloc); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index b43668af77cbd..456bb04ae090a 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -23,25 +23,27 @@ #include #include "arrow/type.h" -#include "arrow/util/buffer.h" #include "arrow/util/macros.h" #include "arrow/util/status.h" namespace arrow { class Array; +class MemoryPool; +class PoolBuffer; static constexpr int64_t MIN_BUILDER_CAPACITY = 1 << 8; // Base class for all data array builders class ArrayBuilder { public: - explicit ArrayBuilder(const TypePtr& type) - : type_(type), - nullable_(type_->nullable), - nulls_(nullptr), null_bits_(nullptr), - length_(0), - capacity_(0) {} + explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) : + pool_(pool), + type_(type), + nullable_(type_->nullable), + nulls_(nullptr), null_bits_(nullptr), + length_(0), + capacity_(0) {} virtual ~ArrayBuilder() {} @@ -71,18 +73,20 @@ class ArrayBuilder { // this function responsibly. Status Advance(int64_t elements); - const std::shared_ptr& nulls() const { return nulls_;} + const std::shared_ptr& nulls() const { return nulls_;} // Creates new array object to hold the contents of the builder and transfers // ownership of the data virtual Status ToArray(Array** out) = 0; protected: + MemoryPool* pool_; + TypePtr type_; bool nullable_; // If the type is not nullable, then null_ is nullptr after initialization - std::shared_ptr nulls_; + std::shared_ptr nulls_; uint8_t* null_bits_; // Array length, so far. Also, the index of the next element to be added diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 5176cafd3ba1c..e1bb990063c1b 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -32,12 +32,13 @@ class ArrayBuilder; // Initially looked at doing this with vtables, but shared pointers makes it // difficult -#define BUILDER_CASE(ENUM, BuilderType) \ - case TypeEnum::ENUM: \ - *out = static_cast(new BuilderType(type)); \ +#define BUILDER_CASE(ENUM, BuilderType) \ + case TypeEnum::ENUM: \ + *out = static_cast(new BuilderType(pool, type)); \ return Status::OK(); -Status make_builder(const TypePtr& type, ArrayBuilder** out) { +Status make_builder(MemoryPool* pool, const TypePtr& type, + ArrayBuilder** out) { switch (type->type) { BUILDER_CASE(UINT8, UInt8Builder); BUILDER_CASE(INT8, Int8Builder); @@ -59,10 +60,10 @@ Status make_builder(const TypePtr& type, ArrayBuilder** out) { { ListType* list_type = static_cast(type.get()); ArrayBuilder* value_builder; - RETURN_NOT_OK(make_builder(list_type->value_type, &value_builder)); + RETURN_NOT_OK(make_builder(pool, list_type->value_type, &value_builder)); // The ListBuilder takes ownership of the value_builder - ListBuilder* builder = new ListBuilder(type, value_builder); + ListBuilder* builder = new ListBuilder(pool, type, value_builder); *out = static_cast(builder); return Status::OK(); } diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index c0bfedd27d6ad..b5ba436f787d9 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -23,9 +23,11 @@ namespace arrow { class ArrayBuilder; +class MemoryPool; class Status; -Status make_builder(const TypePtr& type, ArrayBuilder** out); +Status make_builder(MemoryPool* pool, const TypePtr& type, + ArrayBuilder** out); } // namespace arrow diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 47673ff898bbd..abfc8a31b0daa 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -76,7 +76,7 @@ class TestListBuilder : public TestBuilder { type_ = TypePtr(new ListType(value_type_)); ArrayBuilder* tmp; - ASSERT_OK(make_builder(type_, &tmp)); + ASSERT_OK(make_builder(pool_, type_, &tmp)); builder_.reset(static_cast(tmp)); } diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 0f1116257c507..4ca0f13d53c6f 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -34,6 +34,8 @@ namespace arrow { +class MemoryPool; + struct ListType : public DataType { // List can contain any other logical value type TypePtr value_type; @@ -100,8 +102,9 @@ class ListArray : public Array { // have been appended to the child array) class ListBuilder : public Int32Builder { public: - ListBuilder(const TypePtr& type, ArrayBuilder* value_builder) - : Int32Builder(type) { + ListBuilder(MemoryPool* pool, const TypePtr& type, + ArrayBuilder* value_builder) + : Int32Builder(pool, type) { value_builder_.reset(value_builder); } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 12968608094d7..3484294a39f9a 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -18,7 +18,6 @@ #include #include -#include #include #include #include @@ -104,10 +103,10 @@ class TestPrimitiveBuilder : public TestBuilder { type_nn_ = Attrs::type(false); ArrayBuilder* tmp; - ASSERT_OK(make_builder(type_, &tmp)); + ASSERT_OK(make_builder(pool_, type_, &tmp)); builder_.reset(static_cast(tmp)); - ASSERT_OK(make_builder(type_nn_, &tmp)); + ASSERT_OK(make_builder(pool_, type_nn_, &tmp)); builder_nn_.reset(static_cast(tmp)); } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index a41911224e05e..c5ae0f78a991b 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -20,6 +20,7 @@ #include #include +#include #include #include "arrow/array.h" @@ -31,6 +32,8 @@ namespace arrow { +class MemoryPool; + template struct PrimitiveType : public DataType { explicit PrimitiveType(bool nullable = true) @@ -113,8 +116,9 @@ class PrimitiveBuilder : public ArrayBuilder { public: typedef typename Type::c_type T; - explicit PrimitiveBuilder(const TypePtr& type) - : ArrayBuilder(type), values_(nullptr) { + explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) : + ArrayBuilder(pool, type), + values_(nullptr) { elsize_ = sizeof(T); } @@ -139,7 +143,7 @@ class PrimitiveBuilder : public ArrayBuilder { Status Init(int64_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); - values_ = std::make_shared(); + values_ = std::make_shared(pool_); return values_->Resize(capacity * elsize_); } @@ -231,7 +235,7 @@ class PrimitiveBuilder : public ArrayBuilder { } protected: - std::shared_ptr values_; + std::shared_ptr values_; int64_t elsize_; }; diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 6dba3fdcbb6aa..a2d87ead59c59 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -31,12 +31,9 @@ #include "arrow/types/test-common.h" #include "arrow/util/status.h" -using std::string; -using std::unique_ptr; -using std::vector; - namespace arrow { +class Buffer; TEST(TypesTest, TestCharType) { CharType t1(5); @@ -45,7 +42,7 @@ TEST(TypesTest, TestCharType) { ASSERT_TRUE(t1.nullable); ASSERT_EQ(t1.size, 5); - ASSERT_EQ(t1.ToString(), string("char(5)")); + ASSERT_EQ(t1.ToString(), std::string("char(5)")); // Test copy constructor CharType t2 = t1; @@ -63,7 +60,7 @@ TEST(TypesTest, TestVarcharType) { ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.physical_type.size, 6); - ASSERT_EQ(t1.ToString(), string("varchar(5)")); + ASSERT_EQ(t1.ToString(), std::string("varchar(5)")); // Test copy constructor VarcharType t2 = t1; @@ -78,7 +75,7 @@ TEST(TypesTest, TestStringType) { StringType str_nn(false); ASSERT_EQ(str.type, TypeEnum::STRING); - ASSERT_EQ(str.name(), string("string")); + ASSERT_EQ(str.name(), std::string("string")); ASSERT_TRUE(str.nullable); ASSERT_FALSE(str_nn.nullable); } @@ -111,11 +108,11 @@ class TestStringContainer : public ::testing::Test { } protected: - vector offsets_; - vector chars_; - vector nulls_; + std::vector offsets_; + std::vector chars_; + std::vector nulls_; - vector expected_; + std::vector expected_; std::shared_ptr value_buf_; std::shared_ptr offsets_buf_; @@ -175,7 +172,7 @@ class TestStringBuilder : public TestBuilder { type_ = TypePtr(new StringType()); ArrayBuilder* tmp; - ASSERT_OK(make_builder(type_, &tmp)); + ASSERT_OK(make_builder(pool_, type_, &tmp)); builder_.reset(static_cast(tmp)); } @@ -188,8 +185,8 @@ class TestStringBuilder : public TestBuilder { protected: TypePtr type_; - unique_ptr builder_; - unique_ptr result_; + std::unique_ptr builder_; + std::unique_ptr result_; }; TEST_F(TestStringBuilder, TestAttrs) { @@ -197,8 +194,8 @@ TEST_F(TestStringBuilder, TestAttrs) { } TEST_F(TestStringBuilder, TestScalarAppend) { - vector strings = {"a", "bb", "", "", "ccc"}; - vector is_null = {0, 0, 0, 1, 0}; + std::vector strings = {"a", "bb", "", "", "ccc"}; + std::vector is_null = {0, 0, 0, 1, 0}; int N = strings.size(); int reps = 1000; diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 30d6e247db1ad..d0690d9a7d2a4 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -27,12 +27,13 @@ #include "arrow/type.h" #include "arrow/types/integer.h" #include "arrow/types/list.h" -#include "arrow/util/buffer.h" #include "arrow/util/status.h" namespace arrow { class ArrayBuilder; +class Buffer; +class MemoryPool; struct CharType : public DataType { int size; @@ -148,8 +149,9 @@ class StringArray : public ListArray { class StringBuilder : public ListBuilder { public: - explicit StringBuilder(const TypePtr& type) : - ListBuilder(type, static_cast(new UInt8Builder(value_type_))) { + explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : + ListBuilder(pool, type, + static_cast(new UInt8Builder(pool, value_type_))) { byte_builder_ = static_cast(value_builder_.get()); } @@ -171,6 +173,7 @@ class StringBuilder : public ListBuilder { } protected: + std::shared_ptr list_builder_; UInt8Builder* byte_builder_; static TypePtr value_type_; diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index b7be5d8245f1d..a245656b516cc 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -17,6 +17,7 @@ #include "arrow/types/struct.h" +#include #include #include #include diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h index 267e48a7f25c9..3ecb0dec7c04a 100644 --- a/cpp/src/arrow/types/test-common.h +++ b/cpp/src/arrow/types/test-common.h @@ -25,6 +25,7 @@ #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/util/memory-pool.h" using std::unique_ptr; @@ -33,12 +34,15 @@ namespace arrow { class TestBuilder : public ::testing::Test { public: void SetUp() { + pool_ = GetDefaultMemoryPool(); type_ = TypePtr(new UInt8Type()); type_nn_ = TypePtr(new UInt8Type(false)); - builder_.reset(new UInt8Builder(type_)); - builder_nn_.reset(new UInt8Builder(type_nn_)); + builder_.reset(new UInt8Builder(pool_, type_)); + builder_nn_.reset(new UInt8Builder(pool_, type_nn_)); } protected: + MemoryPool* pool_; + TypePtr type_; TypePtr type_nn_; unique_ptr builder_; diff --git a/cpp/src/arrow/types/union.cc b/cpp/src/arrow/types/union.cc index 54f41a7eef6be..db3f81795eae2 100644 --- a/cpp/src/arrow/types/union.cc +++ b/cpp/src/arrow/types/union.cc @@ -17,6 +17,7 @@ #include "arrow/types/union.h" +#include #include #include #include diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index ff8db6a04106d..c53f307c9f59a 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -22,6 +22,7 @@ set(UTIL_SRCS bit-util.cc buffer.cc + memory-pool.cc status.cc ) @@ -39,6 +40,7 @@ install(FILES bit-util.h buffer.h macros.h + memory-pool.h status.h DESTINATION include/arrow/util) @@ -79,3 +81,4 @@ endif() ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(buffer-test) +ADD_ARROW_TEST(memory-pool-test) diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index d2ddd6584a88c..dbac0a42527be 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -33,7 +33,7 @@ Status util::bytes_to_bits(uint8_t* bytes, int length, std::shared_ptr* out) { int bit_length = ceil_byte(length) / 8; - auto buffer = std::make_shared(); + auto buffer = std::make_shared(); RETURN_NOT_OK(buffer->Resize(bit_length)); memset(buffer->mutable_data(), 0, bit_length); bytes_to_bits(bytes, length, buffer->mutable_data()); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 61dffa30423b1..9ae6127c5ea9c 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -22,10 +22,9 @@ #include #include -#include "arrow/util/buffer.h" - namespace arrow { +class Buffer; class Status; namespace util { diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc index edfd08e850bd8..9f1fd91432b4d 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/util/buffer-test.cc @@ -16,10 +16,8 @@ // under the License. #include -#include #include #include -#include #include #include "arrow/test-util.h" @@ -34,7 +32,7 @@ class TestBuffer : public ::testing::Test { }; TEST_F(TestBuffer, Resize) { - OwnedMutableBuffer buf; + PoolBuffer buf; ASSERT_EQ(0, buf.size()); ASSERT_OK(buf.Resize(100)); @@ -49,7 +47,7 @@ TEST_F(TestBuffer, Resize) { TEST_F(TestBuffer, ResizeOOM) { // realloc fails, even though there may be no explicit limit - OwnedMutableBuffer buf; + PoolBuffer buf; ASSERT_OK(buf.Resize(100)); int64_t to_alloc = std::numeric_limits::max(); ASSERT_RAISES(OutOfMemory, buf.Resize(to_alloc)); diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 2fb34d59e0b78..3f3807d4e2094 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -19,6 +19,7 @@ #include +#include "arrow/util/memory-pool.h" #include "arrow/util/status.h" namespace arrow { @@ -34,19 +35,34 @@ std::shared_ptr MutableBuffer::GetImmutableView() { return std::make_shared(this->get_shared_ptr(), 0, size()); } -OwnedMutableBuffer::OwnedMutableBuffer() : - MutableBuffer(nullptr, 0) {} +PoolBuffer::PoolBuffer(MemoryPool* pool) : + ResizableBuffer(nullptr, 0) { + if (pool == nullptr) { + pool = GetDefaultMemoryPool(); + } + pool_ = pool; +} -Status OwnedMutableBuffer::Resize(int64_t new_size) { - size_ = new_size; - try { - buffer_owner_.resize(new_size); - } catch (const std::bad_alloc& e) { - return Status::OutOfMemory("resize failed"); +Status PoolBuffer::Reserve(int64_t new_capacity) { + if (!mutable_data_ || new_capacity > capacity_) { + uint8_t* new_data; + if (mutable_data_) { + RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); + memcpy(new_data, mutable_data_, size_); + pool_->Free(mutable_data_, capacity_); + } else { + RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); + } + mutable_data_ = new_data; + data_ = mutable_data_; + capacity_ = new_capacity; } - data_ = buffer_owner_.data(); - mutable_data_ = buffer_owner_.data(); + return Status::OK(); +} +Status PoolBuffer::Resize(int64_t new_size) { + RETURN_NOT_OK(Reserve(new_size)); + size_ = new_size; return Status::OK(); } diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 3e4183936b33d..8704723eb0a89 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -19,15 +19,14 @@ #define ARROW_UTIL_BUFFER_H #include -#include #include #include -#include #include "arrow/util/macros.h" namespace arrow { +class MemoryPool; class Status; // ---------------------------------------------------------------------- @@ -115,17 +114,34 @@ class MutableBuffer : public Buffer { uint8_t* mutable_data_; }; -// A MutableBuffer whose memory is owned by the class instance. For example, -// for reading data out of files that you want to deallocate when this class is -// garbage-collected -class OwnedMutableBuffer : public MutableBuffer { +class ResizableBuffer : public MutableBuffer { public: - OwnedMutableBuffer(); - Status Resize(int64_t new_size); + // Change buffer reported size to indicated size, allocating memory if + // necessary + virtual Status Resize(int64_t new_size) = 0; + + // Ensure that buffer has enough memory allocated to fit the indicated + // capacity. Does not change buffer's reported size + virtual Status Reserve(int64_t new_capacity) = 0; + + protected: + ResizableBuffer(uint8_t* data, int64_t size) : + MutableBuffer(data, size), + capacity_(size) {} + + int64_t capacity_; +}; + +// A Buffer whose lifetime is tied to a particular MemoryPool +class PoolBuffer : public ResizableBuffer { + public: + explicit PoolBuffer(MemoryPool* pool = nullptr); + + virtual Status Resize(int64_t new_size); + virtual Status Reserve(int64_t new_capacity); private: - // TODO: aligned allocations - std::vector buffer_owner_; + MemoryPool* pool_; }; } // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc new file mode 100644 index 0000000000000..954b5f951b558 --- /dev/null +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -0,0 +1,47 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "arrow/test-util.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { + +TEST(DefaultMemoryPool, MemoryTracking) { + MemoryPool* pool = GetDefaultMemoryPool(); + + uint8_t* data; + ASSERT_OK(pool->Allocate(100, &data)); + ASSERT_EQ(100, pool->bytes_allocated()); + + pool->Free(data, 100); + ASSERT_EQ(0, pool->bytes_allocated()); +} + +TEST(DefaultMemoryPool, OOM) { + MemoryPool* pool = GetDefaultMemoryPool(); + + uint8_t* data; + int64_t to_alloc = std::numeric_limits::max(); + ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc new file mode 100644 index 0000000000000..5820346e5a739 --- /dev/null +++ b/cpp/src/arrow/util/memory-pool.cc @@ -0,0 +1,78 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/memory-pool.h" + +#include +#include +#include + +#include "arrow/util/status.h" + +namespace arrow { + +MemoryPool::~MemoryPool() {} + +class InternalMemoryPool : public MemoryPool { + public: + InternalMemoryPool() : bytes_allocated_(0) {} + virtual ~InternalMemoryPool(); + + Status Allocate(int64_t size, uint8_t** out) override; + + void Free(uint8_t* buffer, int64_t size) override; + + int64_t bytes_allocated() const override; + + private: + mutable std::mutex pool_lock_; + int64_t bytes_allocated_; +}; + +Status InternalMemoryPool::Allocate(int64_t size, uint8_t** out) { + std::lock_guard guard(pool_lock_); + *out = static_cast(std::malloc(size)); + if (*out == nullptr) { + std::stringstream ss; + ss << "malloc of size " << size << " failed"; + return Status::OutOfMemory(ss.str()); + } + + bytes_allocated_ += size; + + return Status::OK(); +} + +int64_t InternalMemoryPool::bytes_allocated() const { + std::lock_guard guard(pool_lock_); + return bytes_allocated_; +} + +void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { + std::lock_guard guard(pool_lock_); + std::free(buffer); + bytes_allocated_ -= size; +} + +InternalMemoryPool::~InternalMemoryPool() {} + +MemoryPool* GetDefaultMemoryPool() { + static InternalMemoryPool default_memory_pool; + return &default_memory_pool; +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.h b/cpp/src/arrow/util/memory-pool.h new file mode 100644 index 0000000000000..a7cb10dae1703 --- /dev/null +++ b/cpp/src/arrow/util/memory-pool.h @@ -0,0 +1,41 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_MEMORY_POOL_H +#define ARROW_UTIL_MEMORY_POOL_H + +#include + +namespace arrow { + +class Status; + +class MemoryPool { + public: + virtual ~MemoryPool(); + + virtual Status Allocate(int64_t size, uint8_t** out) = 0; + virtual void Free(uint8_t* buffer, int64_t size) = 0; + + virtual int64_t bytes_allocated() const = 0; +}; + +MemoryPool* GetDefaultMemoryPool(); + +} // namespace arrow + +#endif // ARROW_UTIL_MEMORY_POOL_H From b88b69e204b59fa8f19cd20dcb6c091fe9bde3a9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 14:56:31 -0800 Subject: [PATCH 0013/1644] ARROW-20: Add null_count_ member to array containers, remove nullable_ member Based off of ARROW-19. After some contemplation / discussion, I believe it would be better to track nullability at the schema metadata level (if at all!) rather than making it a property of the data structures. This allows the data containers to be "plain ol' data" and thus both nullable data with `null_count == 0` and non-nullable data (implicitly `null_count == 0`) can be treated as semantically equivalent in algorithms code. If it is deemed useful we can validate (cheaply) that physical data meets the metadata requirements (e.g. non-nullable type metadata cannot be associated with data containers having nulls). Author: Wes McKinney Closes #9 from wesm/ARROW-20 and squashes the following commits: 98be016 [Wes McKinney] ARROW-20: Add null_count_ member to Array containers, remove nullable member --- cpp/CMakeLists.txt | 2 +- cpp/src/arrow/array-test.cc | 57 ++++++++-------- cpp/src/arrow/array.cc | 11 ++-- cpp/src/arrow/array.h | 37 +++++++---- cpp/src/arrow/builder.cc | 35 +++++----- cpp/src/arrow/builder.h | 29 ++++---- cpp/src/arrow/test-util.h | 10 +++ cpp/src/arrow/type.h | 12 ++-- cpp/src/arrow/types/collection.h | 2 +- cpp/src/arrow/types/datetime.h | 12 ++-- cpp/src/arrow/types/json.h | 4 +- cpp/src/arrow/types/list-test.cc | 12 +--- cpp/src/arrow/types/list.h | 46 ++++++------- cpp/src/arrow/types/primitive-test.cc | 34 +++++----- cpp/src/arrow/types/primitive.cc | 11 ++-- cpp/src/arrow/types/primitive.h | 95 +++++++++++++++------------ cpp/src/arrow/types/string-test.cc | 31 ++++----- cpp/src/arrow/types/string.cc | 2 +- cpp/src/arrow/types/string.h | 43 ++++++------ cpp/src/arrow/types/struct-test.cc | 6 +- cpp/src/arrow/types/struct.h | 5 +- cpp/src/arrow/types/test-common.h | 4 +- cpp/src/arrow/types/union.h | 10 ++- cpp/src/arrow/util/bit-util.cc | 4 +- cpp/src/arrow/util/bit-util.h | 4 +- 25 files changed, 265 insertions(+), 253 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index d2c840abfe823..f0eb73dc41371 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -92,7 +92,7 @@ endif() # For CMAKE_BUILD_TYPE=Release # -O3: Enable all compiler optimizations # -g: Enable symbols for profiler tools (TODO: remove for shipping) -set(CXX_FLAGS_DEBUG "-ggdb") +set(CXX_FLAGS_DEBUG "-ggdb -O0") set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 16afb9bef348c..df827aaa113aa 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -20,7 +20,6 @@ #include #include #include -#include #include #include "arrow/array.h" @@ -32,60 +31,60 @@ #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" -using std::string; -using std::vector; - namespace arrow { static TypePtr int32 = TypePtr(new Int32Type()); -static TypePtr int32_nn = TypePtr(new Int32Type(false)); - class TestArray : public ::testing::Test { public: void SetUp() { pool_ = GetDefaultMemoryPool(); - - auto data = std::make_shared(pool_); - auto nulls = std::make_shared(pool_); - - ASSERT_OK(data->Resize(400)); - ASSERT_OK(nulls->Resize(128)); - - arr_.reset(new Int32Array(100, data, nulls)); } protected: MemoryPool* pool_; - std::unique_ptr arr_; }; -TEST_F(TestArray, TestNullable) { - std::shared_ptr tmp = arr_->data(); - std::unique_ptr arr_nn(new Int32Array(100, tmp)); +TEST_F(TestArray, TestNullCount) { + auto data = std::make_shared(pool_); + auto nulls = std::make_shared(pool_); - ASSERT_TRUE(arr_->nullable()); - ASSERT_FALSE(arr_nn->nullable()); + std::unique_ptr arr(new Int32Array(100, data, 10, nulls)); + ASSERT_EQ(10, arr->null_count()); + + std::unique_ptr arr_no_nulls(new Int32Array(100, data)); + ASSERT_EQ(0, arr_no_nulls->null_count()); } TEST_F(TestArray, TestLength) { - ASSERT_EQ(arr_->length(), 100); + auto data = std::make_shared(pool_); + std::unique_ptr arr(new Int32Array(100, data)); + ASSERT_EQ(arr->length(), 100); } TEST_F(TestArray, TestIsNull) { - vector nulls = {1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 0, 1}; + std::vector nulls = {1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 0, 1}; + int32_t null_count = 0; + for (uint8_t x : nulls) { + if (x > 0) ++null_count; + } - std::shared_ptr null_buf = bytes_to_null_buffer(nulls.data(), nulls.size()); + std::shared_ptr null_buf = bytes_to_null_buffer(nulls.data(), + nulls.size()); std::unique_ptr arr; - arr.reset(new Array(int32, nulls.size(), null_buf)); + arr.reset(new Array(int32, nulls.size(), null_count, null_buf)); + + ASSERT_EQ(null_count, arr->null_count()); + ASSERT_EQ(5, null_buf->size()); + + ASSERT_TRUE(arr->nulls()->Equals(*null_buf.get())); - ASSERT_EQ(null_buf->size(), 5); for (size_t i = 0; i < nulls.size(); ++i) { ASSERT_EQ(static_cast(nulls[i]), arr->IsNull(i)); } diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 1726a2f27d82d..ee4ef66d11e26 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -17,6 +17,8 @@ #include "arrow/array.h" +#include + #include "arrow/util/buffer.h" namespace arrow { @@ -24,18 +26,17 @@ namespace arrow { // ---------------------------------------------------------------------- // Base array class -Array::Array(const TypePtr& type, int64_t length, +Array::Array(const TypePtr& type, int32_t length, int32_t null_count, const std::shared_ptr& nulls) { - Init(type, length, nulls); + Init(type, length, null_count, nulls); } -void Array::Init(const TypePtr& type, int64_t length, +void Array::Init(const TypePtr& type, int32_t length, int32_t null_count, const std::shared_ptr& nulls) { type_ = type; length_ = length; + null_count_ = null_count; nulls_ = nulls; - - nullable_ = type->nullable; if (nulls_) { null_bits_ = nulls_->data(); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 0eaa28d528e37..3d748c1bad6f8 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -30,38 +30,49 @@ namespace arrow { class Buffer; // Immutable data array with some logical type and some length. Any memory is -// owned by the respective Buffer instance (or its parents). May or may not be -// nullable. +// owned by the respective Buffer instance (or its parents). // -// The base class only has a null array (if the data type is nullable) +// The base class is only required to have a nulls buffer if the null count is +// greater than 0 // // Any buffers used to initialize the array have their references "stolen". If // you wish to use the buffer beyond the lifetime of the array, you need to // explicitly increment its reference count class Array { public: - Array() : length_(0), nulls_(nullptr), null_bits_(nullptr) {} - Array(const TypePtr& type, int64_t length, + Array() : + null_count_(0), + length_(0), + nulls_(nullptr), + null_bits_(nullptr) {} + + Array(const TypePtr& type, int32_t length, int32_t null_count = 0, const std::shared_ptr& nulls = nullptr); virtual ~Array() {} - void Init(const TypePtr& type, int64_t length, const std::shared_ptr& nulls); + void Init(const TypePtr& type, int32_t length, int32_t null_count, + const std::shared_ptr& nulls); - // Determine if a slot if null. For inner loops. Does *not* boundscheck - bool IsNull(int64_t i) const { - return nullable_ && util::get_bit(null_bits_, i); + // Determine if a slot is null. For inner loops. Does *not* boundscheck + bool IsNull(int i) const { + return null_count_ > 0 && util::get_bit(null_bits_, i); } - int64_t length() const { return length_;} - bool nullable() const { return nullable_;} + int32_t length() const { return length_;} + int32_t null_count() const { return null_count_;} + const TypePtr& type() const { return type_;} TypeEnum type_enum() const { return type_->type;} + const std::shared_ptr& nulls() const { + return nulls_; + } + protected: TypePtr type_; - bool nullable_; - int64_t length_; + int32_t null_count_; + int32_t length_; std::shared_ptr nulls_; const uint8_t* null_bits_; diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index cb85067315099..ba70add155186 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -25,34 +25,29 @@ namespace arrow { -Status ArrayBuilder::Init(int64_t capacity) { +Status ArrayBuilder::Init(int32_t capacity) { capacity_ = capacity; - - if (nullable_) { - int64_t to_alloc = util::ceil_byte(capacity) / 8; - nulls_ = std::make_shared(pool_); - RETURN_NOT_OK(nulls_->Resize(to_alloc)); - null_bits_ = nulls_->mutable_data(); - memset(null_bits_, 0, to_alloc); - } + int32_t to_alloc = util::ceil_byte(capacity) / 8; + nulls_ = std::make_shared(pool_); + RETURN_NOT_OK(nulls_->Resize(to_alloc)); + null_bits_ = nulls_->mutable_data(); + memset(null_bits_, 0, to_alloc); return Status::OK(); } -Status ArrayBuilder::Resize(int64_t new_bits) { - if (nullable_) { - int64_t new_bytes = util::ceil_byte(new_bits) / 8; - int64_t old_bytes = nulls_->size(); - RETURN_NOT_OK(nulls_->Resize(new_bytes)); - null_bits_ = nulls_->mutable_data(); - if (old_bytes < new_bytes) { - memset(null_bits_ + old_bytes, 0, new_bytes - old_bytes); - } +Status ArrayBuilder::Resize(int32_t new_bits) { + int32_t new_bytes = util::ceil_byte(new_bits) / 8; + int32_t old_bytes = nulls_->size(); + RETURN_NOT_OK(nulls_->Resize(new_bytes)); + null_bits_ = nulls_->mutable_data(); + if (old_bytes < new_bytes) { + memset(null_bits_ + old_bytes, 0, new_bytes - old_bytes); } return Status::OK(); } -Status ArrayBuilder::Advance(int64_t elements) { - if (nullable_ && length_ + elements > capacity_) { +Status ArrayBuilder::Advance(int32_t elements) { + if (length_ + elements > capacity_) { return Status::Invalid("Builder must be expanded"); } length_ += elements; diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 456bb04ae090a..491b9133d2cca 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -32,7 +32,7 @@ class Array; class MemoryPool; class PoolBuffer; -static constexpr int64_t MIN_BUILDER_CAPACITY = 1 << 8; +static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 8; // Base class for all data array builders class ArrayBuilder { @@ -40,8 +40,9 @@ class ArrayBuilder { explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) : pool_(pool), type_(type), - nullable_(type_->nullable), - nulls_(nullptr), null_bits_(nullptr), + nulls_(nullptr), + null_count_(0), + null_bits_(nullptr), length_(0), capacity_(0) {} @@ -57,21 +58,21 @@ class ArrayBuilder { return children_.size(); } - int64_t length() const { return length_;} - int64_t capacity() const { return capacity_;} - bool nullable() const { return nullable_;} + int32_t length() const { return length_;} + int32_t null_count() const { return null_count_;} + int32_t capacity() const { return capacity_;} // Allocates requires memory at this level, but children need to be // initialized independently - Status Init(int64_t capacity); + Status Init(int32_t capacity); - // Resizes the nulls array (if nullable) - Status Resize(int64_t new_bits); + // Resizes the nulls array + Status Resize(int32_t new_bits); // For cases where raw data was memcpy'd into the internal buffers, allows us // to advance the length of the builder. It is your responsibility to use // this function responsibly. - Status Advance(int64_t elements); + Status Advance(int32_t elements); const std::shared_ptr& nulls() const { return nulls_;} @@ -83,15 +84,15 @@ class ArrayBuilder { MemoryPool* pool_; TypePtr type_; - bool nullable_; - // If the type is not nullable, then null_ is nullptr after initialization + // When nulls are first appended to the builder, the null bitmap is allocated std::shared_ptr nulls_; + int32_t null_count_; uint8_t* null_bits_; // Array length, so far. Also, the index of the next element to be added - int64_t length_; - int64_t capacity_; + int32_t length_; + int32_t capacity_; // Child value array builders. These are owned by this class std::vector > children_; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 2233a4f832a8c..0898c8e3e3aa3 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -84,6 +84,16 @@ void random_nulls(int64_t n, double pct_null, std::vector* nulls) { } } +static inline int null_count(const std::vector& nulls) { + int result = 0; + for (size_t i = 0; i < nulls.size(); ++i) { + if (nulls[i] > 0) { + ++result; + } + } + return result; +} + std::shared_ptr bytes_to_null_buffer(uint8_t* bytes, int length) { std::shared_ptr out; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 220f99f4e885a..12f19604c688d 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -57,11 +57,9 @@ struct LayoutType { // either a primitive physical type (bytes or bits of some fixed size), a // nested type consisting of other data types, or another data type (e.g. a // timestamp encoded as an int64) -// -// Any data type can be nullable enum class TypeEnum: char { - // A degerate NULL type represented as 0 bytes/bits + // A degenerate NULL type represented as 0 bytes/bits NA = 0, // Little-endian integer types @@ -138,14 +136,12 @@ enum class TypeEnum: char { struct DataType { TypeEnum type; - bool nullable; - explicit DataType(TypeEnum type, bool nullable = true) - : type(type), nullable(nullable) {} + explicit DataType(TypeEnum type) + : type(type) {} virtual bool Equals(const DataType* other) { - return (this == other) || (this->type == other->type && - this->nullable == other->nullable); + return this == other || this->type == other->type; } virtual std::string ToString() const = 0; diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h index 59ba61419417a..094b63f28988a 100644 --- a/cpp/src/arrow/types/collection.h +++ b/cpp/src/arrow/types/collection.h @@ -29,7 +29,7 @@ template struct CollectionType : public DataType { std::vector child_types_; - explicit CollectionType(bool nullable = true) : DataType(T, nullable) {} + CollectionType() : DataType(T) {} const TypePtr& child(int i) const { return child_types_[i]; diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index b4d62523c413a..d90883cb01871 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -31,12 +31,12 @@ struct DateType : public DataType { Unit unit; - explicit DateType(Unit unit = Unit::DAY, bool nullable = true) - : DataType(TypeEnum::DATE, nullable), + explicit DateType(Unit unit = Unit::DAY) + : DataType(TypeEnum::DATE), unit(unit) {} DateType(const DateType& other) - : DateType(other.unit, other.nullable) {} + : DateType(other.unit) {} static char const *name() { return "date"; @@ -58,12 +58,12 @@ struct TimestampType : public DataType { Unit unit; - explicit TimestampType(Unit unit = Unit::MILLI, bool nullable = true) - : DataType(TypeEnum::TIMESTAMP, nullable), + explicit TimestampType(Unit unit = Unit::MILLI) + : DataType(TypeEnum::TIMESTAMP), unit(unit) {} TimestampType(const TimestampType& other) - : TimestampType(other.unit, other.nullable) {} + : TimestampType(other.unit) {} static char const *name() { return "timestamp"; diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h index 91fd132408fe6..6c2b097a737c7 100644 --- a/cpp/src/arrow/types/json.h +++ b/cpp/src/arrow/types/json.h @@ -28,8 +28,8 @@ struct JSONScalar : public DataType { static TypePtr dense_type; static TypePtr sparse_type; - explicit JSONScalar(bool dense = true, bool nullable = true) - : DataType(TypeEnum::JSON_SCALAR, nullable), + explicit JSONScalar(bool dense = true) + : DataType(TypeEnum::JSON_SCALAR), dense(dense) {} }; diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index abfc8a31b0daa..1d9ddbe607a41 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -44,11 +44,7 @@ TEST(TypesTest, TestListType) { std::shared_ptr vt = std::make_shared(); ListType list_type(vt); - ListType list_type_nn(vt, false); - ASSERT_EQ(list_type.type, TypeEnum::LIST); - ASSERT_TRUE(list_type.nullable); - ASSERT_FALSE(list_type_nn.nullable); ASSERT_EQ(list_type.name(), string("list")); ASSERT_EQ(list_type.ToString(), string("list")); @@ -132,8 +128,8 @@ TEST_F(TestListBuilder, TestBasics) { Done(); - ASSERT_TRUE(result_->nullable()); - ASSERT_TRUE(result_->values()->nullable()); + ASSERT_EQ(1, result_->null_count()); + ASSERT_EQ(0, result_->values()->null_count()); ASSERT_EQ(3, result_->length()); vector ex_offsets = {0, 3, 3, 7}; @@ -153,10 +149,6 @@ TEST_F(TestListBuilder, TestBasics) { } } -TEST_F(TestListBuilder, TestBasicsNonNullable) { -} - - TEST_F(TestListBuilder, TestZeroLength) { // All buffers are null Done(); diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 4ca0f13d53c6f..4190b53df01cd 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -40,8 +40,8 @@ struct ListType : public DataType { // List can contain any other logical value type TypePtr value_type; - explicit ListType(const TypePtr& value_type, bool nullable = true) - : DataType(TypeEnum::LIST, nullable), + explicit ListType(const TypePtr& value_type) + : DataType(TypeEnum::LIST), value_type(value_type) {} static char const *name() { @@ -56,21 +56,25 @@ class ListArray : public Array { public: ListArray() : Array(), offset_buf_(nullptr), offsets_(nullptr) {} - ListArray(const TypePtr& type, int64_t length, std::shared_ptr offsets, - const ArrayPtr& values, std::shared_ptr nulls = nullptr) { - Init(type, length, offsets, values, nulls); + ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, + const ArrayPtr& values, + int32_t null_count = 0, + std::shared_ptr nulls = nullptr) { + Init(type, length, offsets, values, null_count, nulls); } virtual ~ListArray() {} - void Init(const TypePtr& type, int64_t length, std::shared_ptr offsets, - const ArrayPtr& values, std::shared_ptr nulls = nullptr) { + void Init(const TypePtr& type, int32_t length, std::shared_ptr offsets, + const ArrayPtr& values, + int32_t null_count = 0, + std::shared_ptr nulls = nullptr) { offset_buf_ = offsets; offsets_ = offsets == nullptr? nullptr : reinterpret_cast(offset_buf_->data()); values_ = values; - Array::Init(type, length, nulls); + Array::Init(type, length, null_count, nulls); } // Return a shared pointer in case the requestor desires to share ownership @@ -108,7 +112,7 @@ class ListBuilder : public Int32Builder { value_builder_.reset(value_builder); } - Status Init(int64_t elements) { + Status Init(int32_t elements) { // One more than requested. // // XXX: This is slightly imprecise, because we might trigger null mask @@ -116,7 +120,7 @@ class ListBuilder : public Int32Builder { return Int32Builder::Init(elements + 1); } - Status Resize(int64_t capacity) { + Status Resize(int32_t capacity) { // Need space for the end offset RETURN_NOT_OK(Int32Builder::Resize(capacity + 1)); @@ -129,18 +133,15 @@ class ListBuilder : public Int32Builder { // // If passed, null_bytes is of equal length to values, and any nonzero byte // will be considered as a null for that slot - Status Append(T* values, int64_t length, uint8_t* null_bytes = nullptr) { + Status Append(T* values, int32_t length, uint8_t* null_bytes = nullptr) { if (length_ + length > capacity_) { - int64_t new_capacity = util::next_power2(length_ + length); + int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } memcpy(raw_buffer() + length_, values, length * elsize_); - if (nullable_ && null_bytes != nullptr) { - // If null_bytes is all not null, then none of the values are null - for (int i = 0; i < length; ++i) { - util::set_bit(null_bits_, length_ + i, static_cast(null_bytes[i])); - } + if (null_bytes != nullptr) { + AppendNulls(null_bytes, length); } length_ += length; @@ -159,9 +160,10 @@ class ListBuilder : public Int32Builder { raw_buffer()[length_] = child_values->length(); } - out->Init(type_, length_, values_, ArrayPtr(child_values), nulls_); + out->Init(type_, length_, values_, ArrayPtr(child_values), + null_count_, nulls_); values_ = nulls_ = nullptr; - capacity_ = length_ = 0; + capacity_ = length_ = null_count_ = 0; return Status::OK(); } @@ -181,10 +183,10 @@ class ListBuilder : public Int32Builder { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } - if (nullable_) { - util::set_bit(null_bits_, length_, is_null); + if (is_null) { + ++null_count_; + util::set_bit(null_bits_, length_); } - raw_buffer()[length_++] = value_builder_->length(); return Status::OK(); } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 3484294a39f9a..93634432d5ccb 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -53,15 +53,12 @@ TEST(TypesTest, TestBytesType) { #define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ TEST(TypesTest, TestPrimitive_##ENUM) { \ KLASS tp; \ - KLASS tp_nn(false); \ \ ASSERT_EQ(tp.type, TypeEnum::ENUM); \ ASSERT_EQ(tp.name(), string(NAME)); \ - ASSERT_TRUE(tp.nullable); \ - ASSERT_FALSE(tp_nn.nullable); \ \ - KLASS tp_copy = tp_nn; \ - ASSERT_FALSE(tp_copy.nullable); \ + KLASS tp_copy = tp; \ + ASSERT_EQ(tp_copy.type, TypeEnum::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); @@ -100,17 +97,16 @@ class TestPrimitiveBuilder : public TestBuilder { TestBuilder::SetUp(); type_ = Attrs::type(); - type_nn_ = Attrs::type(false); ArrayBuilder* tmp; ASSERT_OK(make_builder(pool_, type_, &tmp)); builder_.reset(static_cast(tmp)); - ASSERT_OK(make_builder(pool_, type_nn_, &tmp)); + ASSERT_OK(make_builder(pool_, type_, &tmp)); builder_nn_.reset(static_cast(tmp)); } - void RandomData(int64_t N, double pct_null = 0.1) { + void RandomData(int N, double pct_null = 0.1) { Attrs::draw(N, &draws_); random_nulls(N, pct_null, &nulls_); } @@ -118,28 +114,33 @@ class TestPrimitiveBuilder : public TestBuilder { void CheckNullable() { ArrayType result; ArrayType expected; - int64_t size = builder_->length(); + int size = builder_->length(); - auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), + auto ex_data = std::make_shared( + reinterpret_cast(draws_.data()), size * sizeof(T)); auto ex_nulls = bytes_to_null_buffer(nulls_.data(), size); - expected.Init(size, ex_data, ex_nulls); + int32_t ex_null_count = null_count(nulls_); + + expected.Init(size, ex_data, ex_null_count, ex_nulls); ASSERT_OK(builder_->Transfer(&result)); // Builder is now reset ASSERT_EQ(0, builder_->length()); ASSERT_EQ(0, builder_->capacity()); + ASSERT_EQ(0, builder_->null_count()); ASSERT_EQ(nullptr, builder_->buffer()); ASSERT_TRUE(result.Equals(expected)); + ASSERT_EQ(ex_null_count, result.null_count()); } void CheckNonNullable() { ArrayType result; ArrayType expected; - int64_t size = builder_nn_->length(); + int size = builder_nn_->length(); auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), size * sizeof(T)); @@ -153,6 +154,7 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(nullptr, builder_nn_->buffer()); ASSERT_TRUE(result.Equals(expected)); + ASSERT_EQ(0, result.null_count()); } protected: @@ -171,14 +173,14 @@ class TestPrimitiveBuilder : public TestBuilder { typedef CapType##Type Type; \ typedef c_type T; \ \ - static TypePtr type(bool nullable = true) { \ - return TypePtr(new Type(nullable)); \ + static TypePtr type() { \ + return TypePtr(new Type()); \ } #define PINT_DECL(CapType, c_type, LOWER, UPPER) \ struct P##CapType { \ PTYPE_DECL(CapType, c_type); \ - static void draw(int64_t N, vector* draws) { \ + static void draw(int N, vector* draws) { \ randint(N, LOWER, UPPER, draws); \ } \ } @@ -208,7 +210,7 @@ TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); TYPED_TEST(TestPrimitiveBuilder, TestInit) { DECL_T(); - int64_t n = 1000; + int n = 1000; ASSERT_OK(this->builder_->Init(n)); ASSERT_EQ(n, this->builder_->capacity()); ASSERT_EQ(n * sizeof(T), this->builder_->buffer()->size()); diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 2612e8ca7fd4a..c86260b0fc641 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -26,20 +26,23 @@ namespace arrow { // ---------------------------------------------------------------------- // Primitive array base -void PrimitiveArray::Init(const TypePtr& type, int64_t length, +void PrimitiveArray::Init(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& nulls) { - Array::Init(type, length, nulls); + Array::Init(type, length, null_count, nulls); data_ = data; raw_data_ = data == nullptr? nullptr : data_->data(); } bool PrimitiveArray::Equals(const PrimitiveArray& other) const { if (this == &other) return true; - if (type_->nullable != other.type_->nullable) return false; + if (null_count_ != other.null_count_) { + return false; + } bool equal_data = data_->Equals(*other.data_, length_); - if (type_->nullable) { + if (null_count_ > 0) { return equal_data && nulls_->Equals(*other.nulls_, util::ceil_byte(length_) / 8); } else { diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index c5ae0f78a991b..aa8f351202a20 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -36,24 +36,24 @@ class MemoryPool; template struct PrimitiveType : public DataType { - explicit PrimitiveType(bool nullable = true) - : DataType(Derived::type_enum, nullable) {} + PrimitiveType() + : DataType(Derived::type_enum) {} virtual std::string ToString() const { return std::string(static_cast(this)->name()); } }; -#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ - typedef C_TYPE c_type; \ - static constexpr TypeEnum type_enum = TypeEnum::ENUM; \ - static constexpr int size = SIZE; \ - \ - explicit TYPENAME(bool nullable = true) \ - : PrimitiveType(nullable) {} \ - \ - static const char* name() { \ - return NAME; \ +#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ + typedef C_TYPE c_type; \ + static constexpr TypeEnum type_enum = TypeEnum::ENUM; \ + static constexpr int size = SIZE; \ + \ + TYPENAME() \ + : PrimitiveType() {} \ + \ + static const char* name() { \ + return NAME; \ } @@ -64,7 +64,9 @@ class PrimitiveArray : public Array { virtual ~PrimitiveArray() {} - void Init(const TypePtr& type, int64_t length, const std::shared_ptr& data, + void Init(const TypePtr& type, int32_t length, + const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr); const std::shared_ptr& data() const { return data_;} @@ -84,15 +86,17 @@ class PrimitiveArrayImpl : public PrimitiveArray { PrimitiveArrayImpl() : PrimitiveArray() {} - PrimitiveArrayImpl(int64_t length, const std::shared_ptr& data, + PrimitiveArrayImpl(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { - Init(length, data, nulls); + Init(length, data, null_count, nulls); } - void Init(int64_t length, const std::shared_ptr& data, + void Init(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { - TypePtr type(new TypeClass(nulls != nullptr)); - PrimitiveArray::Init(type, length, data, nulls); + TypePtr type(new TypeClass()); + PrimitiveArray::Init(type, length, data, null_count, nulls); } bool Equals(const PrimitiveArrayImpl& other) const { @@ -101,7 +105,7 @@ class PrimitiveArrayImpl : public PrimitiveArray { const T* raw_data() const { return reinterpret_cast(raw_data_);} - T Value(int64_t i) const { + T Value(int i) const { return raw_data()[i]; } @@ -124,7 +128,7 @@ class PrimitiveBuilder : public ArrayBuilder { virtual ~PrimitiveBuilder() {} - Status Resize(int64_t capacity) { + Status Resize(int32_t capacity) { // XXX: Set floor size for now if (capacity < MIN_BUILDER_CAPACITY) { capacity = MIN_BUILDER_CAPACITY; @@ -135,27 +139,26 @@ class PrimitiveBuilder : public ArrayBuilder { } else { RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); RETURN_NOT_OK(values_->Resize(capacity * elsize_)); - capacity_ = capacity; } + capacity_ = capacity; return Status::OK(); } - Status Init(int64_t capacity) { + Status Init(int32_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); - values_ = std::make_shared(pool_); return values_->Resize(capacity * elsize_); } - Status Reserve(int64_t elements) { + Status Reserve(int32_t elements) { if (length_ + elements > capacity_) { - int64_t new_capacity = util::next_power2(length_ + elements); + int32_t new_capacity = util::next_power2(length_ + elements); return Resize(new_capacity); } return Status::OK(); } - Status Advance(int64_t elements) { + Status Advance(int32_t elements) { return ArrayBuilder::Advance(elements); } @@ -165,8 +168,9 @@ class PrimitiveBuilder : public ArrayBuilder { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } - if (nullable_) { - util::set_bit(null_bits_, length_, is_null); + if (is_null) { + ++null_count_; + util::set_bit(null_bits_, length_); } raw_buffer()[length_++] = val; return Status::OK(); @@ -176,42 +180,49 @@ class PrimitiveBuilder : public ArrayBuilder { // // If passed, null_bytes is of equal length to values, and any nonzero byte // will be considered as a null for that slot - Status Append(const T* values, int64_t length, uint8_t* null_bytes = nullptr) { + Status Append(const T* values, int32_t length, + const uint8_t* null_bytes = nullptr) { if (length_ + length > capacity_) { - int64_t new_capacity = util::next_power2(length_ + length); + int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } memcpy(raw_buffer() + length_, values, length * elsize_); - if (nullable_ && null_bytes != nullptr) { - // If null_bytes is all not null, then none of the values are null - for (int64_t i = 0; i < length; ++i) { - util::set_bit(null_bits_, length_ + i, static_cast(null_bytes[i])); - } + if (null_bytes != nullptr) { + AppendNulls(null_bytes, length); } length_ += length; return Status::OK(); } - Status AppendNull() { - if (!nullable_) { - return Status::Invalid("not nullable"); + // Write nulls as uint8_t* into pre-allocated memory + void AppendNulls(const uint8_t* null_bytes, int32_t length) { + // If null_bytes is all not null, then none of the values are null + for (int i = 0; i < length; ++i) { + if (static_cast(null_bytes[i])) { + ++null_count_; + util::set_bit(null_bits_, length_ + i); + } } + } + + Status AppendNull() { if (length_ == capacity_) { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } - util::set_bit(null_bits_, length_++, true); + ++null_count_; + util::set_bit(null_bits_, length_++); return Status::OK(); } // Initialize an array type instance with the results of this builder // Transfers ownership of all buffers Status Transfer(PrimitiveArray* out) { - out->Init(type_, length_, values_, nulls_); + out->Init(type_, length_, values_, null_count_, nulls_); values_ = nulls_ = nullptr; - capacity_ = length_ = 0; + capacity_ = length_ = null_count_ = 0; return Status::OK(); } @@ -236,7 +247,7 @@ class PrimitiveBuilder : public ArrayBuilder { protected: std::shared_ptr values_; - int64_t elsize_; + int elsize_; }; } // namespace arrow diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index a2d87ead59c59..e1dcebe97f013 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -39,7 +39,6 @@ TEST(TypesTest, TestCharType) { CharType t1(5); ASSERT_EQ(t1.type, TypeEnum::CHAR); - ASSERT_TRUE(t1.nullable); ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.ToString(), std::string("char(5)")); @@ -47,7 +46,6 @@ TEST(TypesTest, TestCharType) { // Test copy constructor CharType t2 = t1; ASSERT_EQ(t2.type, TypeEnum::CHAR); - ASSERT_TRUE(t2.nullable); ASSERT_EQ(t2.size, 5); } @@ -56,7 +54,6 @@ TEST(TypesTest, TestVarcharType) { VarcharType t1(5); ASSERT_EQ(t1.type, TypeEnum::VARCHAR); - ASSERT_TRUE(t1.nullable); ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.physical_type.size, 6); @@ -65,19 +62,14 @@ TEST(TypesTest, TestVarcharType) { // Test copy constructor VarcharType t2 = t1; ASSERT_EQ(t2.type, TypeEnum::VARCHAR); - ASSERT_TRUE(t2.nullable); ASSERT_EQ(t2.size, 5); ASSERT_EQ(t2.physical_type.size, 6); } TEST(TypesTest, TestStringType) { StringType str; - StringType str_nn(false); - ASSERT_EQ(str.type, TypeEnum::STRING); ASSERT_EQ(str.name(), std::string("string")); - ASSERT_TRUE(str.nullable); - ASSERT_FALSE(str_nn.nullable); } // ---------------------------------------------------------------------- @@ -96,7 +88,7 @@ class TestStringContainer : public ::testing::Test { void MakeArray() { length_ = offsets_.size() - 1; - int64_t nchars = chars_.size(); + int nchars = chars_.size(); value_buf_ = to_buffer(chars_); values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); @@ -104,7 +96,9 @@ class TestStringContainer : public ::testing::Test { offsets_buf_ = to_buffer(offsets_); nulls_buf_ = bytes_to_null_buffer(nulls_.data(), nulls_.size()); - strings_.Init(length_, offsets_buf_, values_, nulls_buf_); + null_count_ = null_count(nulls_); + + strings_.Init(length_, offsets_buf_, values_, null_count_, nulls_buf_); } protected: @@ -118,7 +112,8 @@ class TestStringContainer : public ::testing::Test { std::shared_ptr offsets_buf_; std::shared_ptr nulls_buf_; - int64_t length_; + int null_count_; + int length_; ArrayPtr values_; StringArray strings_; @@ -127,7 +122,7 @@ class TestStringContainer : public ::testing::Test { TEST_F(TestStringContainer, TestArrayBasics) { ASSERT_EQ(length_, strings_.length()); - ASSERT_TRUE(strings_.nullable()); + ASSERT_EQ(1, strings_.null_count()); } TEST_F(TestStringContainer, TestType) { @@ -149,7 +144,8 @@ TEST_F(TestStringContainer, TestListFunctions) { TEST_F(TestStringContainer, TestDestructor) { - auto arr = std::make_shared(length_, offsets_buf_, values_, nulls_buf_); + auto arr = std::make_shared(length_, offsets_buf_, values_, + null_count_, nulls_buf_); } TEST_F(TestStringContainer, TestGetString) { @@ -189,10 +185,6 @@ class TestStringBuilder : public TestBuilder { std::unique_ptr result_; }; -TEST_F(TestStringBuilder, TestAttrs) { - ASSERT_FALSE(builder_->value_builder()->nullable()); -} - TEST_F(TestStringBuilder, TestScalarAppend) { std::vector strings = {"a", "bb", "", "", "ccc"}; std::vector is_null = {0, 0, 0, 1, 0}; @@ -212,10 +204,11 @@ TEST_F(TestStringBuilder, TestScalarAppend) { Done(); ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps * null_count(is_null), result_->null_count()); ASSERT_EQ(reps * 6, result_->values()->length()); - int64_t length; - int64_t pos = 0; + int32_t length; + int32_t pos = 0; for (int i = 0; i < N * reps; ++i) { if (is_null[i % N]) { ASSERT_TRUE(result_->IsNull(i)); diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index f3dfbdc50f7a4..dea42e102b0d0 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -35,6 +35,6 @@ std::string VarcharType::ToString() const { return s.str(); } -TypePtr StringBuilder::value_type_ = TypePtr(new UInt8Type(false)); +TypePtr StringBuilder::value_type_ = TypePtr(new UInt8Type()); } // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index d0690d9a7d2a4..084562530a8fc 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -40,13 +40,13 @@ struct CharType : public DataType { BytesType physical_type; - explicit CharType(int size, bool nullable = true) - : DataType(TypeEnum::CHAR, nullable), + explicit CharType(int size) + : DataType(TypeEnum::CHAR), size(size), physical_type(BytesType(size)) {} CharType(const CharType& other) - : CharType(other.size, other.nullable) {} + : CharType(other.size) {} virtual std::string ToString() const; }; @@ -58,12 +58,12 @@ struct VarcharType : public DataType { BytesType physical_type; - explicit VarcharType(int size, bool nullable = true) - : DataType(TypeEnum::VARCHAR, nullable), + explicit VarcharType(int size) + : DataType(TypeEnum::VARCHAR), size(size), physical_type(BytesType(size + 1)) {} VarcharType(const VarcharType& other) - : VarcharType(other.size, other.nullable) {} + : VarcharType(other.size) {} virtual std::string ToString() const; }; @@ -73,11 +73,11 @@ static const LayoutPtr physical_string = LayoutPtr(new ListLayoutType(byte1)); // String is a logical type consisting of a physical list of 1-byte values struct StringType : public DataType { - explicit StringType(bool nullable = true) - : DataType(TypeEnum::STRING, nullable) {} + StringType() + : DataType(TypeEnum::STRING) {} StringType(const StringType& other) - : StringType(other.nullable) {} + : StringType() {} const LayoutPtr& physical_type() { return physical_string; @@ -98,17 +98,19 @@ class StringArray : public ListArray { public: StringArray() : ListArray(), bytes_(nullptr), raw_bytes_(nullptr) {} - StringArray(int64_t length, const std::shared_ptr& offsets, + StringArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { - Init(length, offsets, values, nulls); + Init(length, offsets, values, null_count, nulls); } - void Init(const TypePtr& type, int64_t length, + void Init(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { - ListArray::Init(type, length, offsets, values, nulls); + ListArray::Init(type, length, offsets, values, null_count, nulls); // TODO: type validation for values array @@ -117,23 +119,24 @@ class StringArray : public ListArray { raw_bytes_ = bytes_->raw_data(); } - void Init(int64_t length, const std::shared_ptr& offsets, + void Init(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, + int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { - TypePtr type(new StringType(nulls != nullptr)); - Init(type, length, offsets, values, nulls); + TypePtr type(new StringType()); + Init(type, length, offsets, values, null_count, nulls); } // Compute the pointer t - const uint8_t* GetValue(int64_t i, int64_t* out_length) const { + const uint8_t* GetValue(int i, int32_t* out_length) const { int32_t pos = offsets_[i]; *out_length = offsets_[i + 1] - pos; return raw_bytes_ + pos; } // Construct a std::string - std::string GetString(int64_t i) const { - int64_t nchars; + std::string GetString(int i) const { + int32_t nchars; const uint8_t* str = GetValue(i, &nchars); return std::string(reinterpret_cast(str), nchars); } @@ -161,7 +164,7 @@ class StringBuilder : public ListBuilder { value.size()); } - Status Append(const uint8_t* value, int64_t length); + Status Append(const uint8_t* value, int32_t length); Status Append(const std::vector& values, uint8_t* null_bytes); diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index 644b5457d5851..1a9fc6be4a5ce 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -43,11 +43,7 @@ TEST(TestStructType, Basics) { vector fields = {f0, f1, f2}; - StructType struct_type(fields, true); - StructType struct_type_nn(fields, false); - - ASSERT_TRUE(struct_type.nullable); - ASSERT_FALSE(struct_type_nn.nullable); + StructType struct_type(fields); ASSERT_TRUE(struct_type.field(0).Equals(f0)); ASSERT_TRUE(struct_type.field(1).Equals(f1)); diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 7d8885b830dba..afba19a7e4699 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -29,9 +29,8 @@ namespace arrow { struct StructType : public DataType { std::vector fields_; - StructType(const std::vector& fields, - bool nullable = true) - : DataType(TypeEnum::STRUCT, nullable) { + explicit StructType(const std::vector& fields) + : DataType(TypeEnum::STRUCT) { fields_ = fields; } diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h index 3ecb0dec7c04a..1744efce7d631 100644 --- a/cpp/src/arrow/types/test-common.h +++ b/cpp/src/arrow/types/test-common.h @@ -36,15 +36,13 @@ class TestBuilder : public ::testing::Test { void SetUp() { pool_ = GetDefaultMemoryPool(); type_ = TypePtr(new UInt8Type()); - type_nn_ = TypePtr(new UInt8Type(false)); builder_.reset(new UInt8Builder(pool_, type_)); - builder_nn_.reset(new UInt8Builder(pool_, type_nn_)); + builder_nn_.reset(new UInt8Builder(pool_, type_)); } protected: MemoryPool* pool_; TypePtr type_; - TypePtr type_nn_; unique_ptr builder_; unique_ptr builder_nn_; }; diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h index 7b66c3b88bf3c..62a3d1c10355d 100644 --- a/cpp/src/arrow/types/union.h +++ b/cpp/src/arrow/types/union.h @@ -33,9 +33,8 @@ class Buffer; struct DenseUnionType : public CollectionType { typedef CollectionType Base; - DenseUnionType(const std::vector& child_types, - bool nullable = true) - : Base(nullable) { + explicit DenseUnionType(const std::vector& child_types) : + Base() { child_types_ = child_types; } @@ -46,9 +45,8 @@ struct DenseUnionType : public CollectionType { struct SparseUnionType : public CollectionType { typedef CollectionType Base; - SparseUnionType(const std::vector& child_types, - bool nullable = true) - : Base(nullable) { + explicit SparseUnionType(const std::vector& child_types) : + Base() { child_types_ = child_types; } diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index dbac0a42527be..292cb33887ffa 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -25,7 +25,9 @@ namespace arrow { void util::bytes_to_bits(uint8_t* bytes, int length, uint8_t* bits) { for (int i = 0; i < length; ++i) { - set_bit(bits, i, static_cast(bytes[i])); + if (static_cast(bytes[i])) { + set_bit(bits, i); + } } } diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 9ae6127c5ea9c..841f617a3139c 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -41,8 +41,8 @@ static inline bool get_bit(const uint8_t* bits, int i) { return bits[i / 8] & (1 << (i % 8)); } -static inline void set_bit(uint8_t* bits, int i, bool is_set) { - bits[i / 8] |= (1 << (i % 8)) * is_set; +static inline void set_bit(uint8_t* bits, int i) { + bits[i / 8] |= 1 << (i % 8); } static inline int64_t next_power2(int64_t n) { From 89c6afd2026cab21fbe2b3c81f14335dffde6d08 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 15:35:54 -0800 Subject: [PATCH 0014/1644] ARROW-21: Implement a simple in-memory Schema data structure I also have restored the `nullable` bit to the type metadata only (for the moment mainly to facilitate schema testing / round-trips to Parquet and other media with required/optional distinction) and done some miscellaneous refactoring (`TypeEnum` is renamed to `LogicalType`). Author: Wes McKinney Closes #10 from wesm/ARROW-21 and squashes the following commits: c770f7d [Wes McKinney] Add simple in-memory Schema data structure. Restore nullable bit to type metadata only. Add "?" to nullable type formatting. --- cpp/CMakeLists.txt | 2 + cpp/src/arrow/CMakeLists.txt | 2 +- cpp/src/arrow/array.h | 4 +- cpp/src/arrow/{field-test.cc => field.cc} | 19 +-- cpp/src/arrow/field.h | 17 +- cpp/src/arrow/schema-test.cc | 110 ++++++++++++ cpp/src/arrow/schema.cc | 58 +++++++ cpp/src/arrow/schema.h | 56 +++++++ cpp/src/arrow/type.h | 193 +++++++++++++++------- cpp/src/arrow/types/binary.h | 3 - cpp/src/arrow/types/boolean.h | 4 - cpp/src/arrow/types/collection.h | 2 +- cpp/src/arrow/types/construct.cc | 4 +- cpp/src/arrow/types/datetime.h | 8 +- cpp/src/arrow/types/floating.h | 9 +- cpp/src/arrow/types/integer.h | 33 +--- cpp/src/arrow/types/json.h | 4 +- cpp/src/arrow/types/list-test.cc | 10 +- cpp/src/arrow/types/list.cc | 3 + cpp/src/arrow/types/list.h | 5 +- cpp/src/arrow/types/primitive-test.cc | 4 +- cpp/src/arrow/types/primitive.h | 22 --- cpp/src/arrow/types/string-test.cc | 14 +- cpp/src/arrow/types/string.h | 24 +-- cpp/src/arrow/types/struct-test.cc | 2 +- cpp/src/arrow/types/struct.cc | 1 + cpp/src/arrow/types/struct.h | 4 +- cpp/src/arrow/types/union.h | 8 +- 28 files changed, 434 insertions(+), 191 deletions(-) rename cpp/src/arrow/{field-test.cc => field.cc} (74%) create mode 100644 cpp/src/arrow/schema-test.cc create mode 100644 cpp/src/arrow/schema.cc create mode 100644 cpp/src/arrow/schema.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f0eb73dc41371..5e4c204581369 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -467,6 +467,8 @@ set(LINK_LIBS set(ARROW_SRCS src/arrow/array.cc src/arrow/builder.cc + src/arrow/field.cc + src/arrow/schema.cc src/arrow/type.cc ) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index eeea2dbc517b4..04f8dd1f908cb 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -30,4 +30,4 @@ install(FILES set(ARROW_TEST_LINK_LIBS arrow_test_util ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) -ADD_ARROW_TEST(field-test) +ADD_ARROW_TEST(schema-test) diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 3d748c1bad6f8..0632146637e59 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -62,8 +62,8 @@ class Array { int32_t length() const { return length_;} int32_t null_count() const { return null_count_;} - const TypePtr& type() const { return type_;} - TypeEnum type_enum() const { return type_->type;} + const std::shared_ptr& type() const { return type_;} + LogicalType::type logical_type() const { return type_->type;} const std::shared_ptr& nulls() const { return nulls_; diff --git a/cpp/src/arrow/field-test.cc b/cpp/src/arrow/field.cc similarity index 74% rename from cpp/src/arrow/field-test.cc rename to cpp/src/arrow/field.cc index 2bb8bad4054c3..4568d905c2991 100644 --- a/cpp/src/arrow/field-test.cc +++ b/cpp/src/arrow/field.cc @@ -15,24 +15,17 @@ // specific language governing permissions and limitations // under the License. -#include -#include -#include - #include "arrow/field.h" -#include "arrow/type.h" -#include "arrow/types/integer.h" -using std::string; +#include +#include namespace arrow { -TEST(TestField, Basics) { - TypePtr ftype = TypePtr(new Int32Type()); - Field f0("f0", ftype); - - ASSERT_EQ(f0.name, "f0"); - ASSERT_EQ(f0.type->ToString(), ftype->ToString()); +std::string Field::ToString() const { + std::stringstream ss; + ss << this->name << " " << this->type->ToString(); + return ss.str(); } } // namespace arrow diff --git a/cpp/src/arrow/field.h b/cpp/src/arrow/field.h index 664cae61a777a..89a450c66f256 100644 --- a/cpp/src/arrow/field.h +++ b/cpp/src/arrow/field.h @@ -35,12 +35,27 @@ struct Field { TypePtr type; Field(const std::string& name, const TypePtr& type) : - name(name), type(type) {} + name(name), + type(type) {} + + bool operator==(const Field& other) const { + return this->Equals(other); + } + + bool operator!=(const Field& other) const { + return !this->Equals(other); + } bool Equals(const Field& other) const { return (this == &other) || (this->name == other.name && this->type->Equals(other.type.get())); } + + bool nullable() const { + return this->type->nullable; + } + + std::string ToString() const; }; } // namespace arrow diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/schema-test.cc new file mode 100644 index 0000000000000..3debb9cec3c00 --- /dev/null +++ b/cpp/src/arrow/schema-test.cc @@ -0,0 +1,110 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/schema.h" +#include "arrow/type.h" +#include "arrow/types/string.h" + +using std::shared_ptr; +using std::vector; + +namespace arrow { + +TEST(TestField, Basics) { + shared_ptr ftype = std::make_shared(); + shared_ptr ftype_nn = std::make_shared(false); + Field f0("f0", ftype); + Field f0_nn("f0", ftype_nn); + + ASSERT_EQ(f0.name, "f0"); + ASSERT_EQ(f0.type->ToString(), ftype->ToString()); + + ASSERT_TRUE(f0.nullable()); + ASSERT_FALSE(f0_nn.nullable()); +} + +TEST(TestField, Equals) { + shared_ptr ftype = std::make_shared(); + shared_ptr ftype_nn = std::make_shared(false); + + Field f0("f0", ftype); + Field f0_nn("f0", ftype_nn); + Field f0_other("f0", ftype); + + ASSERT_EQ(f0, f0_other); + ASSERT_NE(f0, f0_nn); +} + +class TestSchema : public ::testing::Test { + public: + void SetUp() {} +}; + +TEST_F(TestSchema, Basics) { + auto f0 = std::make_shared("f0", std::make_shared()); + + auto f1 = std::make_shared("f1", std::make_shared(false)); + auto f1_optional = std::make_shared("f1", std::make_shared()); + + auto f2 = std::make_shared("f2", std::make_shared()); + + vector > fields = {f0, f1, f2}; + auto schema = std::make_shared(fields); + + ASSERT_EQ(3, schema->num_fields()); + ASSERT_EQ(f0, schema->field(0)); + ASSERT_EQ(f1, schema->field(1)); + ASSERT_EQ(f2, schema->field(2)); + + auto schema2 = std::make_shared(fields); + + vector > fields3 = {f0, f1_optional, f2}; + auto schema3 = std::make_shared(fields3); + ASSERT_TRUE(schema->Equals(schema2)); + ASSERT_FALSE(schema->Equals(schema3)); + + ASSERT_TRUE(schema->Equals(*schema2.get())); + ASSERT_FALSE(schema->Equals(*schema3.get())); +} + +TEST_F(TestSchema, ToString) { + auto f0 = std::make_shared("f0", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared(false)); + auto f2 = std::make_shared("f2", std::make_shared()); + auto f3 = std::make_shared("f3", + std::make_shared(std::make_shared())); + + vector > fields = {f0, f1, f2, f3}; + auto schema = std::make_shared(fields); + + std::string result = schema->ToString(); + std::string expected = R"(f0 ?int32 +f1 uint8 +f2 ?string +f3 ?list +)"; + + ASSERT_EQ(expected, result); +} + +} // namespace arrow diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc new file mode 100644 index 0000000000000..a735fd3d23075 --- /dev/null +++ b/cpp/src/arrow/schema.cc @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/schema.h" + +#include +#include +#include +#include + +#include "arrow/field.h" + +namespace arrow { + +Schema::Schema(const std::vector >& fields) : + fields_(fields) {} + +bool Schema::Equals(const Schema& other) const { + if (this == &other) return true; + if (num_fields() != other.num_fields()) { + return false; + } + for (int i = 0; i < num_fields(); ++i) { + if (!field(i)->Equals(*other.field(i).get())) { + return false; + } + } + return true; +} + +bool Schema::Equals(const std::shared_ptr& other) const { + return Equals(*other.get()); +} + +std::string Schema::ToString() const { + std::stringstream buffer; + + for (auto field : fields_) { + buffer << field->ToString() << std::endl; + } + return buffer.str(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h new file mode 100644 index 0000000000000..d04e3f628c1e3 --- /dev/null +++ b/cpp/src/arrow/schema.h @@ -0,0 +1,56 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_SCHEMA_H +#define ARROW_SCHEMA_H + +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/type.h" + +namespace arrow { + +class Schema { + public: + explicit Schema(const std::vector >& fields); + + // Returns true if all of the schema fields are equal + bool Equals(const Schema& other) const; + bool Equals(const std::shared_ptr& other) const; + + // Return the ith schema element. Does not boundscheck + const std::shared_ptr& field(int i) const { + return fields_[i]; + } + + // Render a string representation of the schema suitable for debugging + std::string ToString() const; + + int num_fields() const { + return fields_.size(); + } + + private: + std::vector > fields_; +}; + +} // namespace arrow + +#endif // ARROW_FIELD_H diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 12f19604c688d..04cdb52b535db 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -52,96 +52,98 @@ struct LayoutType { explicit LayoutType(LayoutEnum type) : type(type) {} }; - // Data types in this library are all *logical*. They can be expressed as // either a primitive physical type (bytes or bits of some fixed size), a // nested type consisting of other data types, or another data type (e.g. a // timestamp encoded as an int64) +struct LogicalType { + enum type { + // A degenerate NULL type represented as 0 bytes/bits + NA = 0, -enum class TypeEnum: char { - // A degenerate NULL type represented as 0 bytes/bits - NA = 0, - - // Little-endian integer types - UINT8 = 1, - INT8 = 2, - UINT16 = 3, - INT16 = 4, - UINT32 = 5, - INT32 = 6, - UINT64 = 7, - INT64 = 8, + // Little-endian integer types + UINT8 = 1, + INT8 = 2, + UINT16 = 3, + INT16 = 4, + UINT32 = 5, + INT32 = 6, + UINT64 = 7, + INT64 = 8, - // A boolean value represented as 1 byte - BOOL = 9, + // A boolean value represented as 1 byte + BOOL = 9, - // A boolean value represented as 1 bit - BIT = 10, + // A boolean value represented as 1 bit + BIT = 10, - // 4-byte floating point value - FLOAT = 11, + // 4-byte floating point value + FLOAT = 11, - // 8-byte floating point value - DOUBLE = 12, + // 8-byte floating point value + DOUBLE = 12, - // CHAR(N): fixed-length UTF8 string with length N - CHAR = 13, + // CHAR(N): fixed-length UTF8 string with length N + CHAR = 13, - // UTF8 variable-length string as List - STRING = 14, + // UTF8 variable-length string as List + STRING = 14, - // VARCHAR(N): Null-terminated string type embedded in a CHAR(N + 1) - VARCHAR = 15, + // VARCHAR(N): Null-terminated string type embedded in a CHAR(N + 1) + VARCHAR = 15, - // Variable-length bytes (no guarantee of UTF8-ness) - BINARY = 16, + // Variable-length bytes (no guarantee of UTF8-ness) + BINARY = 16, - // By default, int32 days since the UNIX epoch - DATE = 17, + // By default, int32 days since the UNIX epoch + DATE = 17, - // Exact timestamp encoded with int64 since UNIX epoch - // Default unit millisecond - TIMESTAMP = 18, + // Exact timestamp encoded with int64 since UNIX epoch + // Default unit millisecond + TIMESTAMP = 18, - // Timestamp as double seconds since the UNIX epoch - TIMESTAMP_DOUBLE = 19, + // Timestamp as double seconds since the UNIX epoch + TIMESTAMP_DOUBLE = 19, - // Exact time encoded with int64, default unit millisecond - TIME = 20, + // Exact time encoded with int64, default unit millisecond + TIME = 20, - // Precision- and scale-based decimal type. Storage type depends on the - // parameters. - DECIMAL = 21, + // Precision- and scale-based decimal type. Storage type depends on the + // parameters. + DECIMAL = 21, - // Decimal value encoded as a text string - DECIMAL_TEXT = 22, + // Decimal value encoded as a text string + DECIMAL_TEXT = 22, - // A list of some logical data type - LIST = 30, + // A list of some logical data type + LIST = 30, - // Struct of logical types - STRUCT = 31, + // Struct of logical types + STRUCT = 31, - // Unions of logical types - DENSE_UNION = 32, - SPARSE_UNION = 33, + // Unions of logical types + DENSE_UNION = 32, + SPARSE_UNION = 33, - // Union - JSON_SCALAR = 50, + // Union + JSON_SCALAR = 50, - // User-defined type - USER = 60 + // User-defined type + USER = 60 + }; }; - struct DataType { - TypeEnum type; + LogicalType::type type; + bool nullable; - explicit DataType(TypeEnum type) - : type(type) {} + explicit DataType(LogicalType::type type, bool nullable = true) : + type(type), + nullable(nullable) {} virtual bool Equals(const DataType* other) { - return this == other || this->type == other->type; + return this == other || (this->type == other->type && + this->nullable == other->nullable); } virtual std::string ToString() const = 0; @@ -171,6 +173,77 @@ struct ListLayoutType : public LayoutType { value_type(value_type) {} }; +template +struct PrimitiveType : public DataType { + explicit PrimitiveType(bool nullable = true) + : DataType(Derived::type_enum, nullable) {} + + virtual std::string ToString() const { + std::string result; + if (nullable) { + result.append("?"); + } + result.append(static_cast(this)->name()); + return result; + } +}; + +#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ + typedef C_TYPE c_type; \ + static constexpr LogicalType::type type_enum = LogicalType::ENUM; \ + static constexpr int size = SIZE; \ + \ + explicit TYPENAME(bool nullable = true) \ + : PrimitiveType(nullable) {} \ + \ + static const char* name() { \ + return NAME; \ + } + +struct BooleanType : public PrimitiveType { + PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); +}; + +struct UInt8Type : public PrimitiveType { + PRIMITIVE_DECL(UInt8Type, uint8_t, UINT8, 1, "uint8"); +}; + +struct Int8Type : public PrimitiveType { + PRIMITIVE_DECL(Int8Type, int8_t, INT8, 1, "int8"); +}; + +struct UInt16Type : public PrimitiveType { + PRIMITIVE_DECL(UInt16Type, uint16_t, UINT16, 2, "uint16"); +}; + +struct Int16Type : public PrimitiveType { + PRIMITIVE_DECL(Int16Type, int16_t, INT16, 2, "int16"); +}; + +struct UInt32Type : public PrimitiveType { + PRIMITIVE_DECL(UInt32Type, uint32_t, UINT32, 4, "uint32"); +}; + +struct Int32Type : public PrimitiveType { + PRIMITIVE_DECL(Int32Type, int32_t, INT32, 4, "int32"); +}; + +struct UInt64Type : public PrimitiveType { + PRIMITIVE_DECL(UInt64Type, uint64_t, UINT64, 8, "uint64"); +}; + +struct Int64Type : public PrimitiveType { + PRIMITIVE_DECL(Int64Type, int64_t, INT64, 8, "int64"); +}; + +struct FloatType : public PrimitiveType { + PRIMITIVE_DECL(FloatType, float, FLOAT, 4, "float"); +}; + +struct DoubleType : public PrimitiveType { + PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); +}; + } // namespace arrow #endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/types/binary.h b/cpp/src/arrow/types/binary.h index a9f20046b582b..1fd675e5fdebf 100644 --- a/cpp/src/arrow/types/binary.h +++ b/cpp/src/arrow/types/binary.h @@ -25,9 +25,6 @@ namespace arrow { -struct StringType : public DataType { -}; - } // namespace arrow #endif // ARROW_TYPES_BINARY_H diff --git a/cpp/src/arrow/types/boolean.h b/cpp/src/arrow/types/boolean.h index 31388c8152d52..8fc9cfd19c0d4 100644 --- a/cpp/src/arrow/types/boolean.h +++ b/cpp/src/arrow/types/boolean.h @@ -22,10 +22,6 @@ namespace arrow { -struct BooleanType : public PrimitiveType { - PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); -}; - typedef PrimitiveArrayImpl BooleanArray; // typedef PrimitiveBuilder BooleanBuilder; diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h index 094b63f28988a..42a9c926bb134 100644 --- a/cpp/src/arrow/types/collection.h +++ b/cpp/src/arrow/types/collection.h @@ -25,7 +25,7 @@ namespace arrow { -template +template struct CollectionType : public DataType { std::vector child_types_; diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index e1bb990063c1b..05d6b270fc3fd 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -33,7 +33,7 @@ class ArrayBuilder; // difficult #define BUILDER_CASE(ENUM, BuilderType) \ - case TypeEnum::ENUM: \ + case LogicalType::ENUM: \ *out = static_cast(new BuilderType(pool, type)); \ return Status::OK(); @@ -56,7 +56,7 @@ Status make_builder(MemoryPool* pool, const TypePtr& type, BUILDER_CASE(STRING, StringBuilder); - case TypeEnum::LIST: + case LogicalType::LIST: { ListType* list_type = static_cast(type.get()); ArrayBuilder* value_builder; diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index d90883cb01871..765fc29dd57ae 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -31,8 +31,8 @@ struct DateType : public DataType { Unit unit; - explicit DateType(Unit unit = Unit::DAY) - : DataType(TypeEnum::DATE), + explicit DateType(Unit unit = Unit::DAY, bool nullable = true) + : DataType(LogicalType::DATE, nullable), unit(unit) {} DateType(const DateType& other) @@ -58,8 +58,8 @@ struct TimestampType : public DataType { Unit unit; - explicit TimestampType(Unit unit = Unit::MILLI) - : DataType(TypeEnum::TIMESTAMP), + explicit TimestampType(Unit unit = Unit::MILLI, bool nullable = true) + : DataType(LogicalType::TIMESTAMP, nullable), unit(unit) {} TimestampType(const TimestampType& other) diff --git a/cpp/src/arrow/types/floating.h b/cpp/src/arrow/types/floating.h index 7551ce665a27b..e7522781d33e3 100644 --- a/cpp/src/arrow/types/floating.h +++ b/cpp/src/arrow/types/floating.h @@ -21,17 +21,10 @@ #include #include "arrow/types/primitive.h" +#include "arrow/type.h" namespace arrow { -struct FloatType : public PrimitiveType { - PRIMITIVE_DECL(FloatType, float, FLOAT, 4, "float"); -}; - -struct DoubleType : public PrimitiveType { - PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); -}; - typedef PrimitiveArrayImpl FloatArray; typedef PrimitiveArrayImpl DoubleArray; diff --git a/cpp/src/arrow/types/integer.h b/cpp/src/arrow/types/integer.h index 7e5eab55be0a9..568419124941f 100644 --- a/cpp/src/arrow/types/integer.h +++ b/cpp/src/arrow/types/integer.h @@ -22,41 +22,10 @@ #include #include "arrow/types/primitive.h" +#include "arrow/type.h" namespace arrow { -struct UInt8Type : public PrimitiveType { - PRIMITIVE_DECL(UInt8Type, uint8_t, UINT8, 1, "uint8"); -}; - -struct Int8Type : public PrimitiveType { - PRIMITIVE_DECL(Int8Type, int8_t, INT8, 1, "int8"); -}; - -struct UInt16Type : public PrimitiveType { - PRIMITIVE_DECL(UInt16Type, uint16_t, UINT16, 2, "uint16"); -}; - -struct Int16Type : public PrimitiveType { - PRIMITIVE_DECL(Int16Type, int16_t, INT16, 2, "int16"); -}; - -struct UInt32Type : public PrimitiveType { - PRIMITIVE_DECL(UInt32Type, uint32_t, UINT32, 4, "uint32"); -}; - -struct Int32Type : public PrimitiveType { - PRIMITIVE_DECL(Int32Type, int32_t, INT32, 4, "int32"); -}; - -struct UInt64Type : public PrimitiveType { - PRIMITIVE_DECL(UInt64Type, uint64_t, UINT64, 8, "uint64"); -}; - -struct Int64Type : public PrimitiveType { - PRIMITIVE_DECL(Int64Type, int64_t, INT64, 8, "int64"); -}; - // Array containers typedef PrimitiveArrayImpl UInt8Array; diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h index 6c2b097a737c7..b67fb3807aded 100644 --- a/cpp/src/arrow/types/json.h +++ b/cpp/src/arrow/types/json.h @@ -28,8 +28,8 @@ struct JSONScalar : public DataType { static TypePtr dense_type; static TypePtr sparse_type; - explicit JSONScalar(bool dense = true) - : DataType(TypeEnum::JSON_SCALAR), + explicit JSONScalar(bool dense = true, bool nullable = true) + : DataType(LogicalType::JSON_SCALAR, nullable), dense(dense) {} }; diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 1d9ddbe607a41..b4bbd2841a89d 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -44,19 +44,19 @@ TEST(TypesTest, TestListType) { std::shared_ptr vt = std::make_shared(); ListType list_type(vt); - ASSERT_EQ(list_type.type, TypeEnum::LIST); + ASSERT_EQ(list_type.type, LogicalType::LIST); ASSERT_EQ(list_type.name(), string("list")); - ASSERT_EQ(list_type.ToString(), string("list")); + ASSERT_EQ(list_type.ToString(), string("?list")); ASSERT_EQ(list_type.value_type->type, vt->type); ASSERT_EQ(list_type.value_type->type, vt->type); - std::shared_ptr st = std::make_shared(); - std::shared_ptr lt = std::make_shared(st); + std::shared_ptr st = std::make_shared(false); + std::shared_ptr lt = std::make_shared(st, false); ASSERT_EQ(lt->ToString(), string("list")); - ListType lt2(lt); + ListType lt2(lt, false); ASSERT_EQ(lt2.ToString(), string("list>")); } diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index f0ff5bf928a1a..577d71d0b2892 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -24,6 +24,9 @@ namespace arrow { std::string ListType::ToString() const { std::stringstream s; + if (this->nullable) { + s << "?"; + } s << "list<" << value_type->ToString() << ">"; return s.str(); } diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 4190b53df01cd..1fc83536db8c6 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -40,8 +40,8 @@ struct ListType : public DataType { // List can contain any other logical value type TypePtr value_type; - explicit ListType(const TypePtr& value_type) - : DataType(TypeEnum::LIST), + explicit ListType(const TypePtr& value_type, bool nullable = true) + : DataType(LogicalType::LIST, nullable), value_type(value_type) {} static char const *name() { @@ -51,7 +51,6 @@ struct ListType : public DataType { virtual std::string ToString() const; }; - class ListArray : public Array { public: ListArray() : Array(), offset_buf_(nullptr), offsets_(nullptr) {} diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 93634432d5ccb..02eaaa7542bf0 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -54,11 +54,11 @@ TEST(TypesTest, TestBytesType) { TEST(TypesTest, TestPrimitive_##ENUM) { \ KLASS tp; \ \ - ASSERT_EQ(tp.type, TypeEnum::ENUM); \ + ASSERT_EQ(tp.type, LogicalType::ENUM); \ ASSERT_EQ(tp.name(), string(NAME)); \ \ KLASS tp_copy = tp; \ - ASSERT_EQ(tp_copy.type, TypeEnum::ENUM); \ + ASSERT_EQ(tp_copy.type, LogicalType::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index aa8f351202a20..49040fb66268f 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -34,28 +34,6 @@ namespace arrow { class MemoryPool; -template -struct PrimitiveType : public DataType { - PrimitiveType() - : DataType(Derived::type_enum) {} - - virtual std::string ToString() const { - return std::string(static_cast(this)->name()); - } -}; - -#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ - typedef C_TYPE c_type; \ - static constexpr TypeEnum type_enum = TypeEnum::ENUM; \ - static constexpr int size = SIZE; \ - \ - TYPENAME() \ - : PrimitiveType() {} \ - \ - static const char* name() { \ - return NAME; \ - } - // Base class for fixed-size logical types class PrimitiveArray : public Array { diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index e1dcebe97f013..9af667295026b 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -38,14 +38,14 @@ class Buffer; TEST(TypesTest, TestCharType) { CharType t1(5); - ASSERT_EQ(t1.type, TypeEnum::CHAR); + ASSERT_EQ(t1.type, LogicalType::CHAR); ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.ToString(), std::string("char(5)")); // Test copy constructor CharType t2 = t1; - ASSERT_EQ(t2.type, TypeEnum::CHAR); + ASSERT_EQ(t2.type, LogicalType::CHAR); ASSERT_EQ(t2.size, 5); } @@ -53,7 +53,7 @@ TEST(TypesTest, TestCharType) { TEST(TypesTest, TestVarcharType) { VarcharType t1(5); - ASSERT_EQ(t1.type, TypeEnum::VARCHAR); + ASSERT_EQ(t1.type, LogicalType::VARCHAR); ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.physical_type.size, 6); @@ -61,14 +61,14 @@ TEST(TypesTest, TestVarcharType) { // Test copy constructor VarcharType t2 = t1; - ASSERT_EQ(t2.type, TypeEnum::VARCHAR); + ASSERT_EQ(t2.type, LogicalType::VARCHAR); ASSERT_EQ(t2.size, 5); ASSERT_EQ(t2.physical_type.size, 6); } TEST(TypesTest, TestStringType) { StringType str; - ASSERT_EQ(str.type, TypeEnum::STRING); + ASSERT_EQ(str.type, LogicalType::STRING); ASSERT_EQ(str.name(), std::string("string")); } @@ -128,8 +128,8 @@ TEST_F(TestStringContainer, TestArrayBasics) { TEST_F(TestStringContainer, TestType) { TypePtr type = strings_.type(); - ASSERT_EQ(TypeEnum::STRING, type->type); - ASSERT_EQ(TypeEnum::STRING, strings_.type_enum()); + ASSERT_EQ(LogicalType::STRING, type->type); + ASSERT_EQ(LogicalType::STRING, strings_.logical_type()); } diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 084562530a8fc..5795cfed577c5 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -40,8 +40,8 @@ struct CharType : public DataType { BytesType physical_type; - explicit CharType(int size) - : DataType(TypeEnum::CHAR), + explicit CharType(int size, bool nullable = true) + : DataType(LogicalType::CHAR, nullable), size(size), physical_type(BytesType(size)) {} @@ -58,8 +58,8 @@ struct VarcharType : public DataType { BytesType physical_type; - explicit VarcharType(int size) - : DataType(TypeEnum::VARCHAR), + explicit VarcharType(int size, bool nullable = true) + : DataType(LogicalType::VARCHAR, nullable), size(size), physical_type(BytesType(size + 1)) {} VarcharType(const VarcharType& other) @@ -73,26 +73,26 @@ static const LayoutPtr physical_string = LayoutPtr(new ListLayoutType(byte1)); // String is a logical type consisting of a physical list of 1-byte values struct StringType : public DataType { - StringType() - : DataType(TypeEnum::STRING) {} + explicit StringType(bool nullable = true) + : DataType(LogicalType::STRING, nullable) {} StringType(const StringType& other) : StringType() {} - const LayoutPtr& physical_type() { - return physical_string; - } - static char const *name() { return "string"; } virtual std::string ToString() const { - return name(); + std::string result; + if (nullable) { + result.append("?"); + } + result.append(name()); + return result; } }; - // TODO: add a BinaryArray layer in between class StringArray : public ListArray { public: diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index 1a9fc6be4a5ce..df6157104795e 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -49,7 +49,7 @@ TEST(TestStructType, Basics) { ASSERT_TRUE(struct_type.field(1).Equals(f1)); ASSERT_TRUE(struct_type.field(2).Equals(f2)); - ASSERT_EQ(struct_type.ToString(), "struct"); + ASSERT_EQ(struct_type.ToString(), "?struct"); // TODO: out of bounds for field(...) } diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index a245656b516cc..6b233bc372af1 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -26,6 +26,7 @@ namespace arrow { std::string StructType::ToString() const { std::stringstream s; + if (nullable) s << "?"; s << "struct<"; for (size_t i = 0; i < fields_.size(); ++i) { if (i > 0) s << ", "; diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index afba19a7e4699..e575c31287cb2 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -29,8 +29,8 @@ namespace arrow { struct StructType : public DataType { std::vector fields_; - explicit StructType(const std::vector& fields) - : DataType(TypeEnum::STRUCT) { + explicit StructType(const std::vector& fields, bool nullable = true) + : DataType(LogicalType::STRUCT, nullable) { fields_ = fields; } diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h index 62a3d1c10355d..9aff780c6a392 100644 --- a/cpp/src/arrow/types/union.h +++ b/cpp/src/arrow/types/union.h @@ -30,8 +30,8 @@ namespace arrow { class Buffer; -struct DenseUnionType : public CollectionType { - typedef CollectionType Base; +struct DenseUnionType : public CollectionType { + typedef CollectionType Base; explicit DenseUnionType(const std::vector& child_types) : Base() { @@ -42,8 +42,8 @@ struct DenseUnionType : public CollectionType { }; -struct SparseUnionType : public CollectionType { - typedef CollectionType Base; +struct SparseUnionType : public CollectionType { + typedef CollectionType Base; explicit SparseUnionType(const std::vector& child_types) : Base() { From 307977e39eddf62f832a5f1a452963751c6b36a0 Mon Sep 17 00:00:00 2001 From: proflin Date: Thu, 3 Mar 2016 16:14:47 -0800 Subject: [PATCH 0015/1644] ARROW-15: Fix a naming typo for memory.AllocationManager.AllocationOutcome Rename FORCED_SUCESS to FORCED_SUC**_C_**ESS in memory.AllocationManager.AllocationOutcome. Author: proflin Closes #4 from proflin/ARROW-15--Fix-a-naming-typo-for-memory.AllocationManager.AllocationOutcome and squashes the following commits: 0e276fa [proflin] ARROW-15: Fix a naming typo for memory.AllocationManager.AllocationOutcome --- .../src/main/java/org/apache/arrow/memory/Accountant.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java index dc75e5d7231a8..37c598ad89ece 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java @@ -247,7 +247,7 @@ public static enum AllocationOutcome { /** * Allocation succeeded but only because the allocator was forced to move beyond a limit. */ - FORCED_SUCESS(true), + FORCED_SUCCESS(true), /** * Allocation failed because the local allocator's limits were exceeded. From 0c95d3cc6d954128bf400598878ad9c4228ccbce Mon Sep 17 00:00:00 2001 From: proflin Date: Thu, 3 Mar 2016 16:16:28 -0800 Subject: [PATCH 0016/1644] ARROW-10: Fix mismatch of javadoc names and method parameters Author: proflin Author: Liwei Lin Closes #3 from proflin/ARROW-10--Fix-mismatch-of-javadoc-names-and-method-parameters and squashes the following commits: 99366ab [Liwei Lin] ARROW-10: Fix mismatch of javadoc names and method parameters 9186cb3 [proflin] ARROW-10: Fix mismatch of javadoc names and method parameters 2b1313e [proflin] Fix mismatch of javadoc names and method parameters --- .../main/java/org/apache/arrow/memory/AllocationManager.java | 5 ++--- .../org/apache/arrow/memory/AllocatorClosedException.java | 5 +++-- .../src/main/java/org/apache/arrow/memory/BufferManager.java | 1 + .../main/java/org/apache/arrow/memory/ChildAllocator.java | 5 +---- .../java/org/apache/arrow/memory/util/HistoricalLog.java | 2 +- .../main/java/org/apache/arrow/vector/AllocationHelper.java | 2 +- .../org/apache/arrow/vector/complex/ContainerVectorLike.java | 2 +- 7 files changed, 10 insertions(+), 12 deletions(-) diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java index 37d1d34a62005..43ee9c108d902 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -328,7 +328,8 @@ public int decrement(int decrement) { * destroyed before use. * * @param allocator - * @return + * A BufferAllocator. + * @return The ledger associated with the BufferAllocator. */ public BufferLedger getLedgerForAllocator(BufferAllocator allocator) { return associate((BaseAllocator) allocator); @@ -356,8 +357,6 @@ public ArrowBuf newArrowBuf(int offset, int length) { * The length in bytes that this ArrowBuf will provide access to. * @param manager * An optional BufferManager argument that can be used to manage expansion of this ArrowBuf - * @param retain - * Whether or not the newly created buffer should get an additional reference count added to it. * @return A new ArrowBuf that shares references with all ArrowBufs associated with this BufferLedger */ public ArrowBuf newArrowBuf(int offset, int length, BufferManager manager) { diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java index 566457981c7ed..3274642dedd59 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java @@ -20,11 +20,12 @@ /** * Exception thrown when a closed BufferAllocator is used. Note * this is an unchecked exception. - * - * @param message string associated with the cause */ @SuppressWarnings("serial") public class AllocatorClosedException extends RuntimeException { + /** + * @param message string associated with the cause + */ public AllocatorClosedException(String message) { super(message); } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java index d6470fa51e7a2..8969434791012 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java @@ -43,6 +43,7 @@ public interface BufferManager extends AutoCloseable { * @param newSize * Size of new replacement buffer. * @return + * A new version of the buffer. */ public ArrowBuf replace(ArrowBuf old, int newSize); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java index 6f120e5328bd4..11c9063fc9c69 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java @@ -31,15 +31,12 @@ class ChildAllocator extends BaseAllocator { * Constructor. * * @param parentAllocator parent allocator -- the one creating this child - * @param allocatorOwner a handle to the object making the request - * @param allocationPolicy the allocation policy to use; the policy for all - * allocators must match for each invocation of a drillbit + * @param name the name of this child allocator * @param initReservation initial amount of space to reserve (obtained from the parent) * @param maxAllocation maximum amount of space that can be obtained from this allocator; * note this includes direct allocations (via {@see BufferAllocator#buffer(int, int)} * et al) and requests from descendant allocators. Depending on the allocation policy in * force, even less memory may be available - * @param flags one or more of BaseAllocator.F_* flags */ ChildAllocator( BaseAllocator parentAllocator, diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java index 38cb779343ab6..c9b5c5385c596 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java @@ -126,7 +126,7 @@ public void buildHistory(final StringBuilder sb, boolean includeStackTrace) { /** * * @param sb - * @param indexLevel + * @param indent * @param includeStackTrace */ public synchronized void buildHistory(final StringBuilder sb, int indent, boolean includeStackTrace) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java b/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java index 54c3cd7331e0f..15c3a0227c656 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/AllocationHelper.java @@ -49,7 +49,7 @@ public static void allocate(ValueVector v, int valueCount, int bytesPerValue, in * Allocates the exact amount if v is fixed width, otherwise falls back to dynamic allocation * @param v value vector we are trying to allocate * @param valueCount size we are trying to allocate - * @throws org.apache.drill.exec.memory.OutOfMemoryException if it can't allocate the memory + * @throws org.apache.arrow.memory.OutOfMemoryException if it can't allocate the memory */ public static void allocateNew(ValueVector v, int valueCount) { if (v instanceof FixedWidthVector) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java index e50b0d0d0a5ea..655b55a6aa2c6 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java @@ -31,7 +31,7 @@ public interface ContainerVectorLike { * * @param descriptor vector descriptor * @return result of operation wrapping vector corresponding to the given descriptor and whether it's newly created - * @throws org.apache.drill.common.exceptions.DrillRuntimeException + * @throws org.apache.arrow.vector.util.SchemaChangeRuntimeException * if schema change is not permissible between the given and existing data vector types. */ AddOrGetResult addOrGetVector(VectorDescriptor descriptor); From 3b777c7f43d75444f040351b8ae4b735250f2efc Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Mar 2016 16:18:28 -0800 Subject: [PATCH 0017/1644] ARROW-26: Add instructions for enabling Arrow C++ Parquet adapter build This patch documents the workflow for building the optional Arrow-Parquet C++ integration. I originally thought about adding an option to build it in Arrow's thirdparty, but it immediately results in a dependency-hell situation (Parquet requires Thrift, Boost, snappy, lz4, zlib) Author: Wes McKinney Closes #12 from wesm/ARROW-26 and squashes the following commits: b28fd75 [Wes McKinney] Add instructions for enabling Arrow C++ Parquet adapter build --- cpp/CMakeLists.txt | 4 ++-- cpp/doc/Parquet.md | 24 ++++++++++++++++++++++++ 2 files changed, 26 insertions(+), 2 deletions(-) create mode 100644 cpp/doc/Parquet.md diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5e4c204581369..f425c5f310673 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -54,7 +54,7 @@ endif() # Top level cmake dir if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") - option(ARROW_WITH_PARQUET + option(ARROW_PARQUET "Build the Parquet adapter and link to libparquet" OFF) @@ -441,7 +441,7 @@ endif (UNIX) #---------------------------------------------------------------------- # Parquet adapter -if(ARROW_WITH_PARQUET) +if(ARROW_PARQUET) find_package(Parquet REQUIRED) include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(parquet diff --git a/cpp/doc/Parquet.md b/cpp/doc/Parquet.md new file mode 100644 index 0000000000000..370ac833388fc --- /dev/null +++ b/cpp/doc/Parquet.md @@ -0,0 +1,24 @@ +## Building Arrow-Parquet integration + +To build the Arrow C++'s Parquet adapter library, you must first build [parquet-cpp][1]: + +```bash +# Set this to your preferred install location +export PARQUET_HOME=$HOME/local + +git clone https://github.com/apache/parquet-cpp.git +cd parquet-cpp +source setup_build_env.sh +cmake -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME +make -j4 +make install +``` + +Make sure that `$PARQUET_HOME` is set to the installation location. Now, build +Arrow with the Parquet adapter enabled: + +```bash +cmake -DARROW_PARQUET=ON +``` + +[1]: https://github.com/apache/parquet-cpp \ No newline at end of file From 9c2b95446abe1ec4dd5c25215c9595a3d7b49f2b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 4 Mar 2016 15:02:10 -0800 Subject: [PATCH 0018/1644] ARROW-23: Add a logical Column data structure I also added global const instances of common primitive types Author: Wes McKinney Closes #15 from wesm/ARROW-23 and squashes the following commits: 1835d33 [Wes McKinney] Don't use auto 988135c [Wes McKinney] Add Column chunk type validation function 8a2e40e [Wes McKinney] Remove unneeded operator()/shared_from_this experiment de9ec70 [Wes McKinney] Aggregate null counts too 7049314 [Wes McKinney] cpplint a565d26 [Wes McKinney] Add ChunkedArray / Column ctors, test passes 0648ed2 [Wes McKinney] Prototyping --- cpp/CMakeLists.txt | 2 + cpp/src/arrow/array.h | 1 - cpp/src/arrow/schema-test.cc | 7 +- cpp/src/arrow/table/CMakeLists.txt | 39 +++++++++++ cpp/src/arrow/table/column-test.cc | 93 ++++++++++++++++++++++++++ cpp/src/arrow/table/column.cc | 62 +++++++++++++++++ cpp/src/arrow/table/column.h | 103 +++++++++++++++++++++++++++++ cpp/src/arrow/type.cc | 12 ++++ cpp/src/arrow/type.h | 17 +++++ cpp/src/arrow/types/list.h | 2 +- cpp/src/arrow/types/primitive.h | 20 +++--- cpp/src/arrow/util/bit-util.h | 4 ++ 12 files changed, 347 insertions(+), 15 deletions(-) create mode 100644 cpp/src/arrow/table/CMakeLists.txt create mode 100644 cpp/src/arrow/table/column-test.cc create mode 100644 cpp/src/arrow/table/column.cc create mode 100644 cpp/src/arrow/table/column.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f425c5f310673..15afb1acf67cf 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -458,10 +458,12 @@ endif() add_subdirectory(src/arrow) add_subdirectory(src/arrow/util) +add_subdirectory(src/arrow/table) add_subdirectory(src/arrow/types) set(LINK_LIBS arrow_util + arrow_table arrow_types) set(ARROW_SRCS diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 0632146637e59..85e853e2ae5e2 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -81,7 +81,6 @@ class Array { DISALLOW_COPY_AND_ASSIGN(Array); }; - typedef std::shared_ptr ArrayPtr; } // namespace arrow diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/schema-test.cc index 3debb9cec3c00..7c190d068c2a6 100644 --- a/cpp/src/arrow/schema-test.cc +++ b/cpp/src/arrow/schema-test.cc @@ -31,7 +31,7 @@ using std::vector; namespace arrow { TEST(TestField, Basics) { - shared_ptr ftype = std::make_shared(); + shared_ptr ftype = INT32; shared_ptr ftype_nn = std::make_shared(false); Field f0("f0", ftype); Field f0_nn("f0", ftype_nn); @@ -44,7 +44,7 @@ TEST(TestField, Basics) { } TEST(TestField, Equals) { - shared_ptr ftype = std::make_shared(); + shared_ptr ftype = INT32; shared_ptr ftype_nn = std::make_shared(false); Field f0("f0", ftype); @@ -61,8 +61,7 @@ class TestSchema : public ::testing::Test { }; TEST_F(TestSchema, Basics) { - auto f0 = std::make_shared("f0", std::make_shared()); - + auto f0 = std::make_shared("f0", INT32); auto f1 = std::make_shared("f1", std::make_shared(false)); auto f1_optional = std::make_shared("f1", std::make_shared()); diff --git a/cpp/src/arrow/table/CMakeLists.txt b/cpp/src/arrow/table/CMakeLists.txt new file mode 100644 index 0000000000000..a401622d2e0d7 --- /dev/null +++ b/cpp/src/arrow/table/CMakeLists.txt @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# arrow_table +####################################### + +set(TABLE_SRCS + column.cc +) + +set(TABLE_LIBS +) + +add_library(arrow_table STATIC + ${TABLE_SRCS} +) +target_link_libraries(arrow_table ${TABLE_LIBS}) +SET_TARGET_PROPERTIES(arrow_table PROPERTIES LINKER_LANGUAGE CXX) + +# Headers: top level +install(FILES + DESTINATION include/arrow/table) + +ADD_ARROW_TEST(column-test) diff --git a/cpp/src/arrow/table/column-test.cc b/cpp/src/arrow/table/column-test.cc new file mode 100644 index 0000000000000..15f554f46325d --- /dev/null +++ b/cpp/src/arrow/table/column-test.cc @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/schema.h" +#include "arrow/table/column.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +using std::shared_ptr; +using std::vector; + +namespace arrow { + +class TestColumn : public ::testing::Test { + public: + void SetUp() { + pool_ = GetDefaultMemoryPool(); + } + + template + std::shared_ptr MakeArray(int32_t length, int32_t null_count = 0) { + auto data = std::make_shared(pool_); + auto nulls = std::make_shared(pool_); + data->Resize(length * sizeof(typename ArrayType::value_type)); + nulls->Resize(util::bytes_for_bits(length)); + return std::make_shared(length, data, 10, nulls); + } + + protected: + MemoryPool* pool_; + + std::shared_ptr data_; + std::unique_ptr column_; +}; + +TEST_F(TestColumn, BasicAPI) { + ArrayVector arrays; + arrays.push_back(MakeArray(100)); + arrays.push_back(MakeArray(100, 10)); + arrays.push_back(MakeArray(100, 20)); + + auto field = std::make_shared("c0", INT32); + column_.reset(new Column(field, arrays)); + + ASSERT_EQ("c0", column_->name()); + ASSERT_TRUE(column_->type()->Equals(INT32)); + ASSERT_EQ(300, column_->length()); + ASSERT_EQ(30, column_->null_count()); + ASSERT_EQ(3, column_->data()->num_chunks()); +} + +TEST_F(TestColumn, ChunksInhomogeneous) { + ArrayVector arrays; + arrays.push_back(MakeArray(100)); + arrays.push_back(MakeArray(100, 10)); + + auto field = std::make_shared("c0", INT32); + column_.reset(new Column(field, arrays)); + + ASSERT_OK(column_->ValidateData()); + + arrays.push_back(MakeArray(100, 10)); + column_.reset(new Column(field, arrays)); + ASSERT_RAISES(Invalid, column_->ValidateData()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/table/column.cc b/cpp/src/arrow/table/column.cc new file mode 100644 index 0000000000000..82750cf4d4306 --- /dev/null +++ b/cpp/src/arrow/table/column.cc @@ -0,0 +1,62 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/table/column.h" + +#include +#include + +#include "arrow/field.h" +#include "arrow/util/status.h" + +namespace arrow { + +ChunkedArray::ChunkedArray(const ArrayVector& chunks) : + chunks_(chunks) { + length_ = 0; + for (const std::shared_ptr& chunk : chunks) { + length_ += chunk->length(); + null_count_ += chunk->null_count(); + } +} + +Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) : + field_(field) { + data_ = std::make_shared(chunks); +} + +Column::Column(const std::shared_ptr& field, + const std::shared_ptr& data) : + field_(field), + data_(data) {} + +Status Column::ValidateData() { + for (int i = 0; i < data_->num_chunks(); ++i) { + const std::shared_ptr& type = data_->chunk(i)->type(); + if (!this->type()->Equals(type)) { + std::stringstream ss; + ss << "In chunk " << i << " expected type " + << this->type()->ToString() + << " but saw " + << type->ToString(); + return Status::Invalid(ss.str()); + } + } + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/table/column.h b/cpp/src/arrow/table/column.h new file mode 100644 index 0000000000000..9e9064e86545d --- /dev/null +++ b/cpp/src/arrow/table/column.h @@ -0,0 +1,103 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TABLE_COLUMN_H +#define ARROW_TABLE_COLUMN_H + +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/field.h" + +namespace arrow { + +typedef std::vector > ArrayVector; + +// A data structure managing a list of primitive Arrow arrays logically as one +// large array +class ChunkedArray { + public: + explicit ChunkedArray(const ArrayVector& chunks); + + // @returns: the total length of the chunked array; computed on construction + int64_t length() const { + return length_; + } + + int64_t null_count() const { + return null_count_; + } + + int num_chunks() const { + return chunks_.size(); + } + + const std::shared_ptr& chunk(int i) const { + return chunks_[i]; + } + + protected: + ArrayVector chunks_; + int64_t length_; + int64_t null_count_; +}; + +// An immutable column data structure consisting of a field (type metadata) and +// a logical chunked data array (which can be validated as all being the same +// type). +class Column { + public: + Column(const std::shared_ptr& field, const ArrayVector& chunks); + Column(const std::shared_ptr& field, + const std::shared_ptr& data); + + int64_t length() const { + return data_->length(); + } + + int64_t null_count() const { + return data_->null_count(); + } + + // @returns: the column's name in the passed metadata + const std::string& name() const { + return field_->name; + } + + // @returns: the column's type according to the metadata + const std::shared_ptr& type() const { + return field_->type; + } + + // @returns: the column's data as a chunked logical array + const std::shared_ptr& data() const { + return data_; + } + // Verify that the column's array data is consistent with the passed field's + // metadata + Status ValidateData(); + + protected: + std::shared_ptr field_; + std::shared_ptr data_; +}; + +} // namespace arrow + +#endif // ARROW_TABLE_COLUMN_H diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 492eee52b04b1..ff145e2c1e3b4 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -19,4 +19,16 @@ namespace arrow { +const std::shared_ptr BOOL = std::make_shared(); +const std::shared_ptr UINT8 = std::make_shared(); +const std::shared_ptr UINT16 = std::make_shared(); +const std::shared_ptr UINT32 = std::make_shared(); +const std::shared_ptr UINT64 = std::make_shared(); +const std::shared_ptr INT8 = std::make_shared(); +const std::shared_ptr INT16 = std::make_shared(); +const std::shared_ptr INT32 = std::make_shared(); +const std::shared_ptr INT64 = std::make_shared(); +const std::shared_ptr FLOAT = std::make_shared(); +const std::shared_ptr DOUBLE = std::make_shared(); + } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 04cdb52b535db..4193a0e8bc851 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -142,10 +142,15 @@ struct DataType { nullable(nullable) {} virtual bool Equals(const DataType* other) { + // Call with a pointer so more friendly to subclasses return this == other || (this->type == other->type && this->nullable == other->nullable); } + bool Equals(const std::shared_ptr& other) { + return Equals(other.get()); + } + virtual std::string ToString() const = 0; }; @@ -244,6 +249,18 @@ struct DoubleType : public PrimitiveType { PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); }; +extern const std::shared_ptr BOOL; +extern const std::shared_ptr UINT8; +extern const std::shared_ptr UINT16; +extern const std::shared_ptr UINT32; +extern const std::shared_ptr UINT64; +extern const std::shared_ptr INT8; +extern const std::shared_ptr INT16; +extern const std::shared_ptr INT32; +extern const std::shared_ptr INT64; +extern const std::shared_ptr FLOAT; +extern const std::shared_ptr DOUBLE; + } // namespace arrow #endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 1fc83536db8c6..f39fe5c4d811b 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -132,7 +132,7 @@ class ListBuilder : public Int32Builder { // // If passed, null_bytes is of equal length to values, and any nonzero byte // will be considered as a null for that slot - Status Append(T* values, int32_t length, uint8_t* null_bytes = nullptr) { + Status Append(value_type* values, int32_t length, uint8_t* null_bytes = nullptr) { if (length_ + length > capacity_) { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 49040fb66268f..09d43e7ec8b80 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -60,7 +60,7 @@ class PrimitiveArray : public Array { template class PrimitiveArrayImpl : public PrimitiveArray { public: - typedef typename TypeClass::c_type T; + typedef typename TypeClass::c_type value_type; PrimitiveArrayImpl() : PrimitiveArray() {} @@ -81,9 +81,11 @@ class PrimitiveArrayImpl : public PrimitiveArray { return PrimitiveArray::Equals(*static_cast(&other)); } - const T* raw_data() const { return reinterpret_cast(raw_data_);} + const value_type* raw_data() const { + return reinterpret_cast(raw_data_); + } - T Value(int i) const { + value_type Value(int i) const { return raw_data()[i]; } @@ -96,12 +98,12 @@ class PrimitiveArrayImpl : public PrimitiveArray { template class PrimitiveBuilder : public ArrayBuilder { public: - typedef typename Type::c_type T; + typedef typename Type::c_type value_type; explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) : ArrayBuilder(pool, type), values_(nullptr) { - elsize_ = sizeof(T); + elsize_ = sizeof(value_type); } virtual ~PrimitiveBuilder() {} @@ -141,7 +143,7 @@ class PrimitiveBuilder : public ArrayBuilder { } // Scalar append - Status Append(T val, bool is_null = false) { + Status Append(value_type val, bool is_null = false) { if (length_ == capacity_) { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); @@ -158,7 +160,7 @@ class PrimitiveBuilder : public ArrayBuilder { // // If passed, null_bytes is of equal length to values, and any nonzero byte // will be considered as a null for that slot - Status Append(const T* values, int32_t length, + Status Append(const value_type* values, int32_t length, const uint8_t* null_bytes = nullptr) { if (length_ + length > capacity_) { int32_t new_capacity = util::next_power2(length_ + length); @@ -215,8 +217,8 @@ class PrimitiveBuilder : public ArrayBuilder { return Status::OK(); } - T* raw_buffer() { - return reinterpret_cast(values_->mutable_data()); + value_type* raw_buffer() { + return reinterpret_cast(values_->mutable_data()); } std::shared_ptr buffer() const { diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 841f617a3139c..5e7197f901222 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -33,6 +33,10 @@ static inline int64_t ceil_byte(int64_t size) { return (size + 7) & ~7; } +static inline int64_t bytes_for_bits(int64_t size) { + return ceil_byte(size) / 8; +} + static inline int64_t ceil_2bytes(int64_t size) { return (size + 15) & ~15; } From 612fbc74ece160a52edbd260de8391aa07ad00ca Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 4 Mar 2016 17:59:58 -0800 Subject: [PATCH 0019/1644] ARROW-24: C++: Implement a logical Table container type A table enables us to interpret a collection of Arrow arrays as a logical table or "data frame"-like structure. Each column may consist of one or more "primitive" Arrow memory containers. Note that this currently has the limitation that the table column names must be strings. At least, this is consistent with most storage media and up-stack table implementations (e.g. R's data.frame). Currently this is somewhat limited in the arrangement of data (a vector of chunked columns -- the columns may contain only one data chunk) -- since a Table might be assembled from a vector of row batches (coming across the wire), "pivoting" the row batches might have performance implications that we can examine further on down the road. Author: Wes McKinney Closes #16 from wesm/ARROW-24 and squashes the following commits: b701c76 [Wes McKinney] Test case for wrong number of columns passed 5faa5ac [Wes McKinney] cpplint 9a651cb [Wes McKinney] Basic table prototype. Move Schema code under arrow/table --- cpp/CMakeLists.txt | 1 - cpp/src/arrow/CMakeLists.txt | 1 - cpp/src/arrow/table/CMakeLists.txt | 4 + cpp/src/arrow/table/column-test.cc | 37 ++----- cpp/src/arrow/table/column.cc | 6 ++ cpp/src/arrow/table/column.h | 2 + cpp/src/arrow/{ => table}/schema-test.cc | 2 +- cpp/src/arrow/{ => table}/schema.cc | 2 +- cpp/src/arrow/{ => table}/schema.h | 0 cpp/src/arrow/table/table-test.cc | 125 +++++++++++++++++++++++ cpp/src/arrow/table/table.cc | 73 +++++++++++++ cpp/src/arrow/table/table.h | 82 +++++++++++++++ cpp/src/arrow/table/test-common.h | 55 ++++++++++ 13 files changed, 358 insertions(+), 32 deletions(-) rename cpp/src/arrow/{ => table}/schema-test.cc (99%) rename cpp/src/arrow/{ => table}/schema.cc (98%) rename cpp/src/arrow/{ => table}/schema.h (100%) create mode 100644 cpp/src/arrow/table/table-test.cc create mode 100644 cpp/src/arrow/table/table.cc create mode 100644 cpp/src/arrow/table/table.h create mode 100644 cpp/src/arrow/table/test-common.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 15afb1acf67cf..8042661533e1d 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -470,7 +470,6 @@ set(ARROW_SRCS src/arrow/array.cc src/arrow/builder.cc src/arrow/field.cc - src/arrow/schema.cc src/arrow/type.cc ) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 04f8dd1f908cb..77326ce38d754 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -30,4 +30,3 @@ install(FILES set(ARROW_TEST_LINK_LIBS arrow_test_util ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) -ADD_ARROW_TEST(schema-test) diff --git a/cpp/src/arrow/table/CMakeLists.txt b/cpp/src/arrow/table/CMakeLists.txt index a401622d2e0d7..b51258ffd8b0d 100644 --- a/cpp/src/arrow/table/CMakeLists.txt +++ b/cpp/src/arrow/table/CMakeLists.txt @@ -21,6 +21,8 @@ set(TABLE_SRCS column.cc + schema.cc + table.cc ) set(TABLE_LIBS @@ -37,3 +39,5 @@ install(FILES DESTINATION include/arrow/table) ADD_ARROW_TEST(column-test) +ADD_ARROW_TEST(schema-test) +ADD_ARROW_TEST(table-test) diff --git a/cpp/src/arrow/table/column-test.cc b/cpp/src/arrow/table/column-test.cc index 15f554f46325d..4959b82c6e2ae 100644 --- a/cpp/src/arrow/table/column-test.cc +++ b/cpp/src/arrow/table/column-test.cc @@ -22,48 +22,29 @@ #include #include "arrow/field.h" -#include "arrow/schema.h" #include "arrow/table/column.h" +#include "arrow/table/schema.h" +#include "arrow/table/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/types/integer.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" using std::shared_ptr; using std::vector; namespace arrow { -class TestColumn : public ::testing::Test { - public: - void SetUp() { - pool_ = GetDefaultMemoryPool(); - } - - template - std::shared_ptr MakeArray(int32_t length, int32_t null_count = 0) { - auto data = std::make_shared(pool_); - auto nulls = std::make_shared(pool_); - data->Resize(length * sizeof(typename ArrayType::value_type)); - nulls->Resize(util::bytes_for_bits(length)); - return std::make_shared(length, data, 10, nulls); - } - +class TestColumn : public TestBase { protected: - MemoryPool* pool_; - std::shared_ptr data_; std::unique_ptr column_; }; TEST_F(TestColumn, BasicAPI) { ArrayVector arrays; - arrays.push_back(MakeArray(100)); - arrays.push_back(MakeArray(100, 10)); - arrays.push_back(MakeArray(100, 20)); + arrays.push_back(MakePrimitive(100)); + arrays.push_back(MakePrimitive(100, 10)); + arrays.push_back(MakePrimitive(100, 20)); auto field = std::make_shared("c0", INT32); column_.reset(new Column(field, arrays)); @@ -77,15 +58,15 @@ TEST_F(TestColumn, BasicAPI) { TEST_F(TestColumn, ChunksInhomogeneous) { ArrayVector arrays; - arrays.push_back(MakeArray(100)); - arrays.push_back(MakeArray(100, 10)); + arrays.push_back(MakePrimitive(100)); + arrays.push_back(MakePrimitive(100, 10)); auto field = std::make_shared("c0", INT32); column_.reset(new Column(field, arrays)); ASSERT_OK(column_->ValidateData()); - arrays.push_back(MakeArray(100, 10)); + arrays.push_back(MakePrimitive(100, 10)); column_.reset(new Column(field, arrays)); ASSERT_RAISES(Invalid, column_->ValidateData()); } diff --git a/cpp/src/arrow/table/column.cc b/cpp/src/arrow/table/column.cc index 82750cf4d4306..d68b491fb99da 100644 --- a/cpp/src/arrow/table/column.cc +++ b/cpp/src/arrow/table/column.cc @@ -39,6 +39,12 @@ Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) : data_ = std::make_shared(chunks); } +Column::Column(const std::shared_ptr& field, + const std::shared_ptr& data) : + field_(field) { + data_ = std::make_shared(ArrayVector({data})); +} + Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) : field_(field), diff --git a/cpp/src/arrow/table/column.h b/cpp/src/arrow/table/column.h index 9e9064e86545d..64423bf956147 100644 --- a/cpp/src/arrow/table/column.h +++ b/cpp/src/arrow/table/column.h @@ -67,6 +67,8 @@ class Column { Column(const std::shared_ptr& field, const std::shared_ptr& data); + Column(const std::shared_ptr& field, const std::shared_ptr& data); + int64_t length() const { return data_->length(); } diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/table/schema-test.cc similarity index 99% rename from cpp/src/arrow/schema-test.cc rename to cpp/src/arrow/table/schema-test.cc index 7c190d068c2a6..0cf1b3c5f9a8e 100644 --- a/cpp/src/arrow/schema-test.cc +++ b/cpp/src/arrow/table/schema-test.cc @@ -21,7 +21,7 @@ #include #include "arrow/field.h" -#include "arrow/schema.h" +#include "arrow/table/schema.h" #include "arrow/type.h" #include "arrow/types/string.h" diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/table/schema.cc similarity index 98% rename from cpp/src/arrow/schema.cc rename to cpp/src/arrow/table/schema.cc index a735fd3d23075..fb3b4d6f29268 100644 --- a/cpp/src/arrow/schema.cc +++ b/cpp/src/arrow/table/schema.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/schema.h" +#include "arrow/table/schema.h" #include #include diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/table/schema.h similarity index 100% rename from cpp/src/arrow/schema.h rename to cpp/src/arrow/table/schema.h diff --git a/cpp/src/arrow/table/table-test.cc b/cpp/src/arrow/table/table-test.cc new file mode 100644 index 0000000000000..dd4f74cd16f89 --- /dev/null +++ b/cpp/src/arrow/table/table-test.cc @@ -0,0 +1,125 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/table/column.h" +#include "arrow/table/schema.h" +#include "arrow/table/table.h" +#include "arrow/table/test-common.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/types/integer.h" + +using std::shared_ptr; +using std::vector; + +namespace arrow { + +class TestTable : public TestBase { + public: + void MakeExample1(int length) { + auto f0 = std::make_shared("f0", INT32); + auto f1 = std::make_shared("f1", UINT8); + auto f2 = std::make_shared("f2", INT16); + + vector > fields = {f0, f1, f2}; + schema_ = std::make_shared(fields); + + columns_ = { + std::make_shared(schema_->field(0), MakePrimitive(length)), + std::make_shared(schema_->field(1), MakePrimitive(length)), + std::make_shared(schema_->field(2), MakePrimitive(length)) + }; + } + + protected: + std::unique_ptr table_; + shared_ptr schema_; + vector > columns_; +}; + +TEST_F(TestTable, EmptySchema) { + auto empty_schema = shared_ptr(new Schema({})); + table_.reset(new Table("data", empty_schema, columns_)); + ASSERT_OK(table_->ValidateColumns()); + ASSERT_EQ(0, table_->num_rows()); + ASSERT_EQ(0, table_->num_columns()); +} + +TEST_F(TestTable, Ctors) { + int length = 100; + MakeExample1(length); + + std::string name = "data"; + + table_.reset(new Table(name, schema_, columns_)); + ASSERT_OK(table_->ValidateColumns()); + ASSERT_EQ(name, table_->name()); + ASSERT_EQ(length, table_->num_rows()); + ASSERT_EQ(3, table_->num_columns()); + + table_.reset(new Table(name, schema_, columns_, length)); + ASSERT_OK(table_->ValidateColumns()); + ASSERT_EQ(name, table_->name()); + ASSERT_EQ(length, table_->num_rows()); +} + +TEST_F(TestTable, Metadata) { + int length = 100; + MakeExample1(length); + + std::string name = "data"; + table_.reset(new Table(name, schema_, columns_)); + + ASSERT_TRUE(table_->schema()->Equals(schema_)); + + auto col = table_->column(0); + ASSERT_EQ(schema_->field(0)->name, col->name()); + ASSERT_EQ(schema_->field(0)->type, col->type()); +} + +TEST_F(TestTable, InvalidColumns) { + // Check that columns are all the same length + int length = 100; + MakeExample1(length); + + table_.reset(new Table("data", schema_, columns_, length - 1)); + ASSERT_RAISES(Invalid, table_->ValidateColumns()); + + columns_.clear(); + + // Wrong number of columns + table_.reset(new Table("data", schema_, columns_, length)); + ASSERT_RAISES(Invalid, table_->ValidateColumns()); + + columns_ = { + std::make_shared(schema_->field(0), MakePrimitive(length)), + std::make_shared(schema_->field(1), MakePrimitive(length)), + std::make_shared(schema_->field(2), MakePrimitive(length - 1)) + }; + + table_.reset(new Table("data", schema_, columns_, length)); + ASSERT_RAISES(Invalid, table_->ValidateColumns()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/table/table.cc b/cpp/src/arrow/table/table.cc new file mode 100644 index 0000000000000..4cefc924ed38f --- /dev/null +++ b/cpp/src/arrow/table/table.cc @@ -0,0 +1,73 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/table/table.h" + +#include +#include + +#include "arrow/field.h" +#include "arrow/table/column.h" +#include "arrow/table/schema.h" +#include "arrow/util/status.h" + +namespace arrow { + +Table::Table(const std::string& name, const std::shared_ptr& schema, + const std::vector >& columns) : + name_(name), + schema_(schema), + columns_(columns) { + if (columns.size() == 0) { + num_rows_ = 0; + } else { + num_rows_ = columns[0]->length(); + } +} + +Table::Table(const std::string& name, const std::shared_ptr& schema, + const std::vector >& columns, int64_t num_rows) : + name_(name), + schema_(schema), + columns_(columns), + num_rows_(num_rows) {} + +Status Table::ValidateColumns() const { + if (num_columns() != schema_->num_fields()) { + return Status::Invalid("Number of columns did not match schema"); + } + + if (columns_.size() == 0) { + return Status::OK(); + } + + // Make sure columns are all the same length + for (size_t i = 0; i < columns_.size(); ++i) { + const Column* col = columns_[i].get(); + if (col->length() != num_rows_) { + std::stringstream ss; + ss << "Column " << i << " expected length " + << num_rows_ + << " but got length " + << col->length(); + return Status::Invalid(ss.str()); + } + } + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/table/table.h b/cpp/src/arrow/table/table.h new file mode 100644 index 0000000000000..b0129387b710c --- /dev/null +++ b/cpp/src/arrow/table/table.h @@ -0,0 +1,82 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TABLE_TABLE_H +#define ARROW_TABLE_TABLE_H + +#include +#include +#include + +namespace arrow { + +class Column; +class Schema; +class Status; + +// Immutable container of fixed-length columns conforming to a particular schema +class Table { + public: + // If columns is zero-length, the table's number of rows is zero + Table(const std::string& name, const std::shared_ptr& schema, + const std::vector >& columns); + + Table(const std::string& name, const std::shared_ptr& schema, + const std::vector >& columns, int64_t num_rows); + + // @returns: the table's name, if any (may be length 0) + const std::string& name() const { + return name_; + } + + // @returns: the table's schema + const std::shared_ptr& schema() const { + return schema_; + } + + // Note: Does not boundscheck + // @returns: the i-th column + const std::shared_ptr& column(int i) const { + return columns_[i]; + } + + // @returns: the number of columns in the table + int num_columns() const { + return columns_.size(); + } + + // @returns: the number of rows (the corresponding length of each column) + int64_t num_rows() const { + return num_rows_; + } + + // After construction, perform any checks to validate the input arguments + Status ValidateColumns() const; + + private: + // The table's name, optional + std::string name_; + + std::shared_ptr schema_; + std::vector > columns_; + + int64_t num_rows_; +}; + +} // namespace arrow + +#endif // ARROW_TABLE_TABLE_H diff --git a/cpp/src/arrow/table/test-common.h b/cpp/src/arrow/table/test-common.h new file mode 100644 index 0000000000000..efe2f228cd0a3 --- /dev/null +++ b/cpp/src/arrow/table/test-common.h @@ -0,0 +1,55 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +#include "arrow/field.h" +#include "arrow/table/column.h" +#include "arrow/table/schema.h" +#include "arrow/table/table.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" + +namespace arrow { + +class TestBase : public ::testing::Test { + public: + void SetUp() { + pool_ = GetDefaultMemoryPool(); + } + + template + std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + auto data = std::make_shared(pool_); + auto nulls = std::make_shared(pool_); + EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); + EXPECT_OK(nulls->Resize(util::bytes_for_bits(length))); + return std::make_shared(length, data, 10, nulls); + } + + protected: + MemoryPool* pool_; +}; + +} // namespace arrow From 572cdf22e3595035966a05a5ec2398f9d29df669 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 7 Mar 2016 14:42:32 -0800 Subject: [PATCH 0020/1644] ARROW-7: Add barebones Python library build toolchain This patch provides no actual functionality; it only builds an empty Cython extension that links to libarrow.so. I will hook this into Travis CI at some later time. I have adapted a limited amount of BSD (2- or 3-clause) or Apache 2.0 3rd-party code (particularly the cmake/Cython integration) to bootstrap this Python package / build setup in accordance with http://www.apache.org/legal/resolved.html. I have noted the relevant copyright holders and licenses in `python/LICENSE.txt`. In particular, I expect to continue to refactor and reuse occasional utility code from pandas (https://github.com/pydata/pandas) as practical. Since a significant amount of "glue code" will need to be written to marshal between Arrow data and pure Python / NumPy / pandas objects, to get started I've adopted the approach used by libdynd/dynd-python -- a C++ "glue library" that is then called from Cython to provide a Python user interface. This will allow us to build shims as necessary to abstract away complications that leak through (for example: enabling C++ code with no knowledge of Python to invoke Python functions). Let's see how this goes: there are other options, like Boost::Python, but Cython + shim code is a more lightweight and flexible solution for the moment. Author: Wes McKinney Closes #17 from wesm/ARROW-7 and squashes the following commits: be059a2 [Wes McKinney] Nest arrow::py namespace 3ad3143 [Wes McKinney] Add preliminary Python development toolchain --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/table/CMakeLists.txt | 3 + python/.gitignore | 37 ++ python/CMakeLists.txt | 464 +++++++++++++++++++ python/LICENSE.txt | 88 ++++ python/README.md | 14 + python/arrow/__init__.py | 0 python/arrow/compat.py | 86 ++++ python/arrow/config.pyx | 8 + python/arrow/includes/__init__.pxd | 0 python/arrow/includes/arrow.pxd | 23 + python/arrow/includes/common.pxd | 34 ++ python/arrow/includes/parquet.pxd | 51 ++ python/arrow/includes/pyarrow.pxd | 23 + python/arrow/parquet.pyx | 23 + python/arrow/tests/__init__.py | 0 python/cmake_modules/CompilerInfo.cmake | 48 ++ python/cmake_modules/FindArrow.cmake | 77 +++ python/cmake_modules/FindCython.cmake | 30 ++ python/cmake_modules/FindNumPy.cmake | 100 ++++ python/cmake_modules/FindPythonLibsNew.cmake | 236 ++++++++++ python/cmake_modules/UseCython.cmake | 164 +++++++ python/setup.py | 244 ++++++++++ python/src/pyarrow/CMakeLists.txt | 20 + python/src/pyarrow/api.h | 21 + python/src/pyarrow/init.cc | 29 ++ python/src/pyarrow/init.h | 31 ++ python/src/pyarrow/util/CMakeLists.txt | 53 +++ python/src/pyarrow/util/test_main.cc | 26 ++ 29 files changed, 1934 insertions(+) create mode 100644 python/.gitignore create mode 100644 python/CMakeLists.txt create mode 100644 python/LICENSE.txt create mode 100644 python/README.md create mode 100644 python/arrow/__init__.py create mode 100644 python/arrow/compat.py create mode 100644 python/arrow/config.pyx create mode 100644 python/arrow/includes/__init__.pxd create mode 100644 python/arrow/includes/arrow.pxd create mode 100644 python/arrow/includes/common.pxd create mode 100644 python/arrow/includes/parquet.pxd create mode 100644 python/arrow/includes/pyarrow.pxd create mode 100644 python/arrow/parquet.pyx create mode 100644 python/arrow/tests/__init__.py create mode 100644 python/cmake_modules/CompilerInfo.cmake create mode 100644 python/cmake_modules/FindArrow.cmake create mode 100644 python/cmake_modules/FindCython.cmake create mode 100644 python/cmake_modules/FindNumPy.cmake create mode 100644 python/cmake_modules/FindPythonLibsNew.cmake create mode 100644 python/cmake_modules/UseCython.cmake create mode 100644 python/setup.py create mode 100644 python/src/pyarrow/CMakeLists.txt create mode 100644 python/src/pyarrow/api.h create mode 100644 python/src/pyarrow/init.cc create mode 100644 python/src/pyarrow/init.h create mode 100644 python/src/pyarrow/util/CMakeLists.txt create mode 100644 python/src/pyarrow/util/test_main.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 77326ce38d754..102a8a1853f3e 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -20,6 +20,7 @@ install(FILES api.h array.h builder.h + field.h type.h DESTINATION include/arrow) diff --git a/cpp/src/arrow/table/CMakeLists.txt b/cpp/src/arrow/table/CMakeLists.txt index b51258ffd8b0d..68bf3148a9889 100644 --- a/cpp/src/arrow/table/CMakeLists.txt +++ b/cpp/src/arrow/table/CMakeLists.txt @@ -36,6 +36,9 @@ SET_TARGET_PROPERTIES(arrow_table PROPERTIES LINKER_LANGUAGE CXX) # Headers: top level install(FILES + column.h + schema.h + table.h DESTINATION include/arrow/table) ADD_ARROW_TEST(column-test) diff --git a/python/.gitignore b/python/.gitignore new file mode 100644 index 0000000000000..80103a1a52942 --- /dev/null +++ b/python/.gitignore @@ -0,0 +1,37 @@ +thirdparty/ +CMakeFiles/ +CMakeCache.txt +CTestTestfile.cmake +Makefile +cmake_install.cmake +build/ +Testing/ + +# Python stuff + +# Editor temporary/working/backup files +*flymake* + +# Compiled source +*.a +*.dll +*.o +*.py[ocd] +*.so +.build_cache_dir +MANIFEST + +# Generated sources +*.c +*.cpp +# Python files + +# setup.py working directory +build +# setup.py dist directory +dist +# Egg metadata +*.egg-info +# coverage +.coverage +coverage.xml diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt new file mode 100644 index 0000000000000..df55bfac9eb4a --- /dev/null +++ b/python/CMakeLists.txt @@ -0,0 +1,464 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +# Includes code assembled from BSD/MIT/Apache-licensed code from some 3rd-party +# projects, including Kudu, Impala, and libdynd. See python/LICENSE.txt + +cmake_minimum_required(VERSION 2.7) +project(pyarrow) + +set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") + +# Use common cmake modules from Arrow C++ if available +set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/../cpp/cmake_modules") + +include(CMakeParseArguments) + +set(BUILD_SUPPORT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../cpp/build-support) + +# Allow "make install" to not depend on all targets. +# +# Must be declared in the top-level CMakeLists.txt. +set(CMAKE_SKIP_INSTALL_ALL_DEPENDENCY true) + +set(CMAKE_MACOSX_RPATH 1) +set(CMAKE_OSX_DEPLOYMENT_TARGET 10.9) + +# Generate a Clang compile_commands.json "compilation database" file for use +# with various development tools, such as Vim's YouCompleteMe plugin. +# See http://clang.llvm.org/docs/JSONCompilationDatabase.html +if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") + set(CMAKE_EXPORT_COMPILE_COMMANDS 1) +endif() + +############################################################ +# Compiler flags +############################################################ + +# compiler flags that are common across debug/release builds +set(CXX_COMMON_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall") + +# compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') +# For all builds: +# For CMAKE_BUILD_TYPE=Debug +# -ggdb: Enable gdb debugging +# For CMAKE_BUILD_TYPE=FastDebug +# Same as DEBUG, except with some optimizations on. +# For CMAKE_BUILD_TYPE=Release +# -O3: Enable all compiler optimizations +# -g: Enable symbols for profiler tools (TODO: remove for shipping) +# -DNDEBUG: Turn off dchecks/asserts/debug only code. +set(CXX_FLAGS_DEBUG "-ggdb -O0") +set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") +set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") + +# if no build build type is specified, default to debug builds +if (NOT CMAKE_BUILD_TYPE) + set(CMAKE_BUILD_TYPE Debug) +endif(NOT CMAKE_BUILD_TYPE) + +string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) + +# Set compile flags based on the build type. +message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") +if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_DEBUG}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_FASTDEBUG}) +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") + set(CMAKE_CXX_FLAGS ${CXX_FLAGS_RELEASE}) +else() + message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") +endif () + +# Add common flags +set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") + +# Determine compiler version +include(CompilerInfo) + +if ("${COMPILER_FAMILY}" STREQUAL "clang") + # Using Clang with ccache causes a bunch of spurious warnings that are + # purportedly fixed in the next version of ccache. See the following for details: + # + # http://petereisentraut.blogspot.com/2011/05/ccache-and-clang.html + # http://petereisentraut.blogspot.com/2011/09/ccache-and-clang-part-2.html + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") +endif() + +set(PYARROW_LINK "a") + +# For any C code, use the same flags. +set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}") + +# Code coverage +if ("${PYARROW_GENERATE_COVERAGE}") + if("${CMAKE_CXX_COMPILER}" MATCHES ".*clang.*") + # There appears to be some bugs in clang 3.3 which cause code coverage + # to have link errors, not locating the llvm_gcda_* symbols. + # This should be fixed in llvm 3.4 with http://llvm.org/viewvc/llvm-project?view=revision&revision=184666 + message(SEND_ERROR "Cannot currently generate coverage with clang") + endif() + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} --coverage -DCOVERAGE_BUILD") + + # For coverage to work properly, we need to use static linkage. Otherwise, + # __gcov_flush() doesn't properly flush coverage from every module. + # See http://stackoverflow.com/questions/28164543/using-gcov-flush-within-a-library-doesnt-force-the-other-modules-to-yield-gc + if("${PYARROW_LINK}" STREQUAL "a") + message("Using static linking for coverage build") + set(PYARROW_LINK "s") + elseif("${PYARROW_LINK}" STREQUAL "d") + message(SEND_ERROR "Cannot use coverage with static linking") + endif() +endif() + +# If we still don't know what kind of linking to perform, choose based on +# build type (developers like fast builds). +if ("${PYARROW_LINK}" STREQUAL "a") + if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG" OR + "${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") + message("Using dynamic linking for ${CMAKE_BUILD_TYPE} builds") + set(PYARROW_LINK "d") + else() + message("Using static linking for ${CMAKE_BUILD_TYPE} builds") + set(PYARROW_LINK "s") + endif() +endif() + +# Are we using the gold linker? It doesn't work with dynamic linking as +# weak symbols aren't properly overridden, causing tcmalloc to be omitted. +# Let's flag this as an error in RELEASE builds (we shouldn't release a +# product like this). +# +# See https://sourceware.org/bugzilla/show_bug.cgi?id=16979 for details. +# +# The gold linker is only for ELF binaries, which OSX doesn't use. We can +# just skip. +if (NOT APPLE) + execute_process(COMMAND ${CMAKE_CXX_COMPILER} -Wl,--version OUTPUT_VARIABLE LINKER_OUTPUT) +endif () +if (LINKER_OUTPUT MATCHES "gold") + if ("${PYARROW_LINK}" STREQUAL "d" AND + "${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") + message(SEND_ERROR "Cannot use gold with dynamic linking in a RELEASE build " + "as it would cause tcmalloc symbols to get dropped") + else() + message("Using gold linker") + endif() + set(PYARROW_USING_GOLD 1) +else() + message("Using ld linker") +endif() + +# Having set PYARROW_LINK due to build type and/or sanitizer, it's now safe to +# act on its value. +if ("${PYARROW_LINK}" STREQUAL "d") + set(BUILD_SHARED_LIBS ON) + + # Position independent code is only necessary when producing shared objects. + add_definitions(-fPIC) +endif() + +# set compile output directory +string (TOLOWER ${CMAKE_BUILD_TYPE} BUILD_SUBDIR_NAME) + +# If build in-source, create the latest symlink. If build out-of-source, which is +# preferred, simply output the binaries in the build folder +if (${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_CURRENT_BINARY_DIR}) + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/build/${BUILD_SUBDIR_NAME}/") + # Link build/latest to the current build directory, to avoid developers + # accidentally running the latest debug build when in fact they're building + # release builds. + FILE(MAKE_DIRECTORY ${BUILD_OUTPUT_ROOT_DIRECTORY}) + if (NOT APPLE) + set(MORE_ARGS "-T") + endif() +EXECUTE_PROCESS(COMMAND ln ${MORE_ARGS} -sf ${BUILD_OUTPUT_ROOT_DIRECTORY} + ${CMAKE_CURRENT_BINARY_DIR}/build/latest) +else() + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}") + # set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") +endif() + +# where to put generated archives (.a files) +set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") +set(ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") + +# where to put generated libraries (.so files) +set(CMAKE_LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") +set(LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") + +# where to put generated binaries +set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") + +## Python and libraries +find_package(PythonLibsNew REQUIRED) +include(UseCython) + +include_directories(SYSTEM + src) + +############################################################ +# Testing +############################################################ + +# Add a new test case, with or without an executable that should be built. +# +# REL_TEST_NAME is the name of the test. It may be a single component +# (e.g. monotime-test) or contain additional components (e.g. +# net/net_util-test). Either way, the last component must be a globally +# unique name. +# +# Arguments after the test name will be passed to set_tests_properties(). +function(ADD_PYARROW_TEST REL_TEST_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) + + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}.cc) + # This test has a corresponding .cc file, set it up as an executable. + set(TEST_PATH "${EXECUTABLE_OUTPUT_PATH}/${TEST_NAME}") + add_executable(${TEST_NAME} "${REL_TEST_NAME}.cc") + target_link_libraries(${TEST_NAME} ${PYARROW_TEST_LINK_LIBS}) + else() + # No executable, just invoke the test (probably a script) directly. + set(TEST_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}) + endif() + + add_test(${TEST_NAME} + ${BUILD_SUPPORT_DIR}/run-test.sh ${TEST_PATH}) + if(ARGN) + set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) + endif() +endfunction() + +# A wrapper for add_dependencies() that is compatible with NO_TESTS. +function(ADD_PYARROW_TEST_DEPENDENCIES REL_TEST_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) + + add_dependencies(${TEST_NAME} ${ARGN}) +endfunction() + +enable_testing() + +############################################################ +# Dependencies +############################################################ +function(ADD_THIRDPARTY_LIB LIB_NAME) + set(options) + set(one_value_args SHARED_LIB STATIC_LIB) + set(multi_value_args DEPS) + cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) + if(ARG_UNPARSED_ARGUMENTS) + message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") + endif() + + if(("${PYARROW_LINK}" STREQUAL "s" AND ARG_STATIC_LIB) OR (NOT ARG_SHARED_LIB)) + if(NOT ARG_STATIC_LIB) + message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") + endif() + add_library(${LIB_NAME} STATIC IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + else() + add_library(${LIB_NAME} SHARED IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + endif() + + if(ARG_DEPS) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") + endif() + + # Set up an "exported variant" for this thirdparty library (see "Visibility" + # above). It's the same as the real target, just with an "_exported" suffix. + # We prefer the static archive if it exists (as it's akin to an "internal" + # library), but we'll settle for the shared object if we must. + # + # A shared object exported variant will force any "leaf" library that + # transitively depends on it to also depend on it at runtime; this is + # desirable for some libraries (e.g. cyrus_sasl). + set(LIB_NAME_EXPORTED ${LIB_NAME}_exported) + if(ARG_STATIC_LIB) + add_library(${LIB_NAME_EXPORTED} STATIC IMPORTED) + set_target_properties(${LIB_NAME_EXPORTED} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + else() + add_library(${LIB_NAME_EXPORTED} SHARED IMPORTED) + set_target_properties(${LIB_NAME_EXPORTED} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + endif() + if(ARG_DEPS) + set_target_properties(${LIB_NAME_EXPORTED} + PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") + endif() +endfunction() + +## GMock +find_package(GTest REQUIRED) +include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) +ADD_THIRDPARTY_LIB(gtest + STATIC_LIB ${GTEST_STATIC_LIB}) + +## Arrow +find_package(Arrow REQUIRED) +include_directories(SYSTEM ${ARROW_INCLUDE_DIR}) +ADD_THIRDPARTY_LIB(arrow + SHARED_LIB ${ARROW_SHARED_LIB}) + +############################################################ +# Linker setup +############################################################ + +set(PYARROW_MIN_TEST_LIBS + pyarrow_test_main + pyarrow) + +set(PYARROW_MIN_TEST_LIBS + pyarrow_test_main + pyarrow + ${PYARROW_BASE_LIBS}) + +set(PYARROW_TEST_LINK_LIBS ${PYARROW_MIN_TEST_LIBS}) + +############################################################ +# "make ctags" target +############################################################ +if (UNIX) + add_custom_target(ctags ctags -R --languages=c++,c --exclude=thirdparty/installed) +endif (UNIX) + +############################################################ +# "make etags" target +############################################################ +if (UNIX) + add_custom_target(tags etags --members --declarations + `find ${CMAKE_CURRENT_SOURCE_DIR}/src + -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or -name \\*.h -or -name \\*.c -or + -name \\*.f`) + add_custom_target(etags DEPENDS tags) +endif (UNIX) + +############################################################ +# "make cscope" target +############################################################ +if (UNIX) + add_custom_target(cscope find ${CMAKE_CURRENT_SOURCE_DIR} + ( -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or + -name \\*.h -or -name \\*.c -or -name \\*.f ) + -exec echo \"{}\" \; > cscope.files && cscope -q -b VERBATIM) +endif (UNIX) + +############################################################ +# "make lint" target +############################################################ +if (UNIX) + # Full lint + add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py + --verbose=2 + --filter=-whitespace/comments,-readability/todo,-build/header_guard + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h`) +endif (UNIX) + +############################################################ +# Subdirectories +############################################################ + +add_subdirectory(src/pyarrow) +add_subdirectory(src/pyarrow/util) + +set(PYARROW_SRCS + src/pyarrow/init.cc +) + +set(LINK_LIBS + pyarrow_util + arrow +) + +add_library(pyarrow SHARED + ${PYARROW_SRCS}) +target_link_libraries(pyarrow ${LINK_LIBS}) +set_target_properties(pyarrow PROPERTIES LINKER_LANGUAGE CXX) + +if(APPLE) + set_target_properties(pyarrow PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") +endif() + +############################################################ +# Setup and build Cython modules +############################################################ + +foreach(pyx_api_file + arrow/config.pyx + arrow/parquet.pyx) + set_source_files_properties(${pyx_api_file} PROPERTIES CYTHON_API 1) +endforeach(pyx_api_file) + +set(USE_RELATIVE_RPATH ON) +set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) + +set(CYTHON_EXTENSIONS + config + parquet +) + +foreach(module ${CYTHON_EXTENSIONS}) + string(REPLACE "." ";" directories ${module}) + list(GET directories -1 module_name) + list(REMOVE_AT directories -1) + + string(REPLACE "." "/" module_root "${module}") + set(module_SRC arrow/${module_root}.pyx) + set_source_files_properties(${module_SRC} PROPERTIES CYTHON_IS_CXX 1) + + cython_add_module(${module_name} + ${module_name}_pyx + ${module_name}_output + ${module_SRC}) + + if (directories) + string(REPLACE ";" "/" module_output_directory ${directories}) + set_target_properties(${module_name} PROPERTIES + LIBRARY_OUTPUT_DIRECTORY ${module_output_directory}) + endif() + + if(APPLE) + set(module_install_rpath "@loader_path") + else() + set(module_install_rpath "$ORIGIN") + endif() + list(LENGTH directories i) + while(${i} GREATER 0) + set(module_install_rpath "${module_install_rpath}/..") + math(EXPR i "${i} - 1" ) + endwhile(${i} GREATER 0) + + # for inplace development for now + set(module_install_rpath "${CMAKE_SOURCE_DIR}/arrow/") + + set_target_properties(${module_name} PROPERTIES + INSTALL_RPATH ${module_install_rpath}) + target_link_libraries(${module_name} pyarrow) +endforeach(module) diff --git a/python/LICENSE.txt b/python/LICENSE.txt new file mode 100644 index 0000000000000..078e144ded1c1 --- /dev/null +++ b/python/LICENSE.txt @@ -0,0 +1,88 @@ +## 3rd-party licenses for code that has been adapted for the Arrow Python + library + +------------------------------------------------------------------------------- +Some code from pandas has been adapted for this codebase. pandas is available +under the 3-clause BSD license, which follows: + +pandas license +============== + +Copyright (c) 2011-2012, Lambda Foundry, Inc. and PyData Development Team +All rights reserved. + +Copyright (c) 2008-2011 AQR Capital Management, LLC +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + + * Neither the name of the copyright holder nor the names of any + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +------------------------------------------------------------------------------- + +Some bits from DyND, in particular aspects of the build system, have been +adapted from libdynd and dynd-python under the terms of the BSD 2-clause +license + +The BSD 2-Clause License + + Copyright (C) 2011-12, Dynamic NDArray Developers + All rights reserved. + + Redistribution and use in source and binary forms, with or without + modification, are permitted provided that the following conditions are + met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + * Redistributions in binary form must reproduce the above + copyright notice, this list of conditions and the following + disclaimer in the documentation and/or other materials provided + with the distribution. + + THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +Dynamic NDArray Developers list: + + * Mark Wiebe + * Continuum Analytics + +------------------------------------------------------------------------------- + +Some source code from Ibis (https://github.com/cloudera/ibis) has been adapted +for Arrow. Ibis is released under the Apache License, Version 2.0. diff --git a/python/README.md b/python/README.md new file mode 100644 index 0000000000000..c79fa9786f476 --- /dev/null +++ b/python/README.md @@ -0,0 +1,14 @@ +## Python library for Apache Arrow + +This library provides a Pythonic API wrapper for the reference Arrow C++ +implementation, along with tools for interoperability with pandas, NumPy, and +other traditional Python scientific computing packages. + +#### Development details + +This project is layered in two pieces: + +* pyarrow, a C++ library for easier interoperability between Arrow C++, NumPy, + and pandas +* Cython extensions and pure Python code under arrow/ which expose Arrow C++ + and pyarrow to pure Python users \ No newline at end of file diff --git a/python/arrow/__init__.py b/python/arrow/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/python/arrow/compat.py b/python/arrow/compat.py new file mode 100644 index 0000000000000..2ac41ac8abf89 --- /dev/null +++ b/python/arrow/compat.py @@ -0,0 +1,86 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# flake8: noqa + +import itertools + +import numpy as np + +import sys +import six +from six import BytesIO, StringIO, string_types as py_string + + +PY26 = sys.version_info[:2] == (2, 6) +PY2 = sys.version_info[0] == 2 + + +if PY26: + import unittest2 as unittest +else: + import unittest + + +if PY2: + import cPickle + + try: + from cdecimal import Decimal + except ImportError: + from decimal import Decimal + + unicode_type = unicode + lzip = zip + zip = itertools.izip + + def dict_values(x): + return x.values() + + range = xrange + long = long + + def tobytes(o): + if isinstance(o, unicode): + return o.encode('utf8') + else: + return o + + def frombytes(o): + return o +else: + unicode_type = str + def lzip(*x): + return list(zip(*x)) + long = int + zip = zip + def dict_values(x): + return list(x.values()) + from decimal import Decimal + range = range + + def tobytes(o): + if isinstance(o, str): + return o.encode('utf8') + else: + return o + + def frombytes(o): + return o.decode('utf8') + + +integer_types = six.integer_types + (np.integer,) diff --git a/python/arrow/config.pyx b/python/arrow/config.pyx new file mode 100644 index 0000000000000..8f10beb3a2e72 --- /dev/null +++ b/python/arrow/config.pyx @@ -0,0 +1,8 @@ +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +cdef extern from 'pyarrow/init.h' namespace 'arrow::py': + void pyarrow_init() + +pyarrow_init() diff --git a/python/arrow/includes/__init__.pxd b/python/arrow/includes/__init__.pxd new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/python/arrow/includes/arrow.pxd b/python/arrow/includes/arrow.pxd new file mode 100644 index 0000000000000..3635ceb868596 --- /dev/null +++ b/python/arrow/includes/arrow.pxd @@ -0,0 +1,23 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from arrow.includes.common cimport * + +cdef extern from "arrow/api.h" namespace "arrow" nogil: + pass diff --git a/python/arrow/includes/common.pxd b/python/arrow/includes/common.pxd new file mode 100644 index 0000000000000..f2fc826625e45 --- /dev/null +++ b/python/arrow/includes/common.pxd @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from libc.stdint cimport * +from libcpp cimport bool as c_bool +from libcpp.string cimport string +from libcpp.vector cimport vector + +# This must be included for cerr and other things to work +cdef extern from "": + pass + +cdef extern from "" namespace "std" nogil: + + cdef cppclass shared_ptr[T]: + T* get() + void reset() + void reset(T* p) diff --git a/python/arrow/includes/parquet.pxd b/python/arrow/includes/parquet.pxd new file mode 100644 index 0000000000000..62342f3066969 --- /dev/null +++ b/python/arrow/includes/parquet.pxd @@ -0,0 +1,51 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from arrow.includes.common cimport * + +cdef extern from "parquet/api/reader.h" namespace "parquet_cpp" nogil: + cdef cppclass ColumnReader: + pass + + cdef cppclass BoolReader(ColumnReader): + pass + + cdef cppclass Int32Reader(ColumnReader): + pass + + cdef cppclass Int64Reader(ColumnReader): + pass + + cdef cppclass Int96Reader(ColumnReader): + pass + + cdef cppclass FloatReader(ColumnReader): + pass + + cdef cppclass DoubleReader(ColumnReader): + pass + + cdef cppclass ByteArrayReader(ColumnReader): + pass + + cdef cppclass RowGroupReader: + pass + + cdef cppclass ParquetFileReader: + pass diff --git a/python/arrow/includes/pyarrow.pxd b/python/arrow/includes/pyarrow.pxd new file mode 100644 index 0000000000000..dcef663f3894d --- /dev/null +++ b/python/arrow/includes/pyarrow.pxd @@ -0,0 +1,23 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from arrow.includes.common cimport * + +cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: + pass diff --git a/python/arrow/parquet.pyx b/python/arrow/parquet.pyx new file mode 100644 index 0000000000000..23c3838bcad1f --- /dev/null +++ b/python/arrow/parquet.pyx @@ -0,0 +1,23 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from arrow.compat import frombytes, tobytes +from arrow.includes.parquet cimport * diff --git a/python/arrow/tests/__init__.py b/python/arrow/tests/__init__.py new file mode 100644 index 0000000000000..e69de29bb2d1d diff --git a/python/cmake_modules/CompilerInfo.cmake b/python/cmake_modules/CompilerInfo.cmake new file mode 100644 index 0000000000000..e66bc2693eead --- /dev/null +++ b/python/cmake_modules/CompilerInfo.cmake @@ -0,0 +1,48 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +# Sets COMPILER_FAMILY to 'clang' or 'gcc' +# Sets COMPILER_VERSION to the version +execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v + ERROR_VARIABLE COMPILER_VERSION_FULL) +message(INFO " ${COMPILER_VERSION_FULL}") + +# clang on Linux and Mac OS X before 10.9 +if("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") + set(COMPILER_FAMILY "clang") + string(REGEX REPLACE ".*clang version ([0-9]+\\.[0-9]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") +# clang on Mac OS X 10.9 and later +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") + set(COMPILER_FAMILY "clang") + string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") + +# clang on Mac OS X, XCode 7. No version replacement is done +# because Apple no longer advertises the upstream LLVM version. +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-700\\..*") + set(COMPILER_FAMILY "clang") + +# gcc +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*gcc version.*") + set(COMPILER_FAMILY "gcc") + string(REGEX REPLACE ".*gcc version ([0-9\\.]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") +else() + message(FATAL_ERROR "Unknown compiler. Version info:\n${COMPILER_VERSION_FULL}") +endif() +message("Selected compiler ${COMPILER_FAMILY} ${COMPILER_VERSION}") diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake new file mode 100644 index 0000000000000..3d9983849ebb2 --- /dev/null +++ b/python/cmake_modules/FindArrow.cmake @@ -0,0 +1,77 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# - Find ARROW (arrow/api.h, libarrow.a, libarrow.so) +# This module defines +# ARROW_INCLUDE_DIR, directory containing headers +# ARROW_LIBS, directory containing arrow libraries +# ARROW_STATIC_LIB, path to libarrow.a +# ARROW_SHARED_LIB, path to libarrow's shared library +# ARROW_FOUND, whether arrow has been found + +set(ARROW_SEARCH_HEADER_PATHS + $ENV{ARROW_HOME}/include +) + +set(ARROW_SEARCH_LIB_PATH + $ENV{ARROW_HOME}/lib +) + +find_path(ARROW_INCLUDE_DIR arrow/array.h PATHS + ${ARROW_SEARCH_HEADER_PATHS} + # make sure we don't accidentally pick up a different version + NO_DEFAULT_PATH +) + +find_library(ARROW_LIB_PATH NAMES arrow + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + +if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) + set(ARROW_FOUND TRUE) + set(ARROW_LIB_NAME libarrow) + set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) + set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) + set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +else () + set(ARROW_FOUND FALSE) +endif () + +if (ARROW_FOUND) + if (NOT Arrow_FIND_QUIETLY) + message(STATUS "Found the Arrow library: ${ARROW_LIB_PATH}") + endif () +else () + if (NOT Arrow_FIND_QUIETLY) + set(ARROW_ERR_MSG "Could not find the Arrow library. Looked for headers") + set(ARROW_ERR_MSG "${ARROW_ERR_MSG} in ${ARROW_SEARCH_HEADER_PATHS}, and for libs") + set(ARROW_ERR_MSG "${ARROW_ERR_MSG} in ${ARROW_SEARCH_LIB_PATH}") + if (Arrow_FIND_REQUIRED) + message(FATAL_ERROR "${ARROW_ERR_MSG}") + else (Arrow_FIND_REQUIRED) + message(STATUS "${ARROW_ERR_MSG}") + endif (Arrow_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + ARROW_INCLUDE_DIR + ARROW_LIBS + ARROW_STATIC_LIB + ARROW_SHARED_LIB +) diff --git a/python/cmake_modules/FindCython.cmake b/python/cmake_modules/FindCython.cmake new file mode 100644 index 0000000000000..9df3b5d59d274 --- /dev/null +++ b/python/cmake_modules/FindCython.cmake @@ -0,0 +1,30 @@ +# Find the Cython compiler. +# +# This code sets the following variables: +# +# CYTHON_EXECUTABLE +# +# See also UseCython.cmake + +#============================================================================= +# Copyright 2011 Kitware, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +#============================================================================= + +find_program( CYTHON_EXECUTABLE NAMES cython cython.bat ) + +include( FindPackageHandleStandardArgs ) +FIND_PACKAGE_HANDLE_STANDARD_ARGS( Cython REQUIRED_VARS CYTHON_EXECUTABLE ) + +mark_as_advanced( CYTHON_EXECUTABLE ) diff --git a/python/cmake_modules/FindNumPy.cmake b/python/cmake_modules/FindNumPy.cmake new file mode 100644 index 0000000000000..58bb531f5324a --- /dev/null +++ b/python/cmake_modules/FindNumPy.cmake @@ -0,0 +1,100 @@ +# - Find the NumPy libraries +# This module finds if NumPy is installed, and sets the following variables +# indicating where it is. +# +# TODO: Update to provide the libraries and paths for linking npymath lib. +# +# NUMPY_FOUND - was NumPy found +# NUMPY_VERSION - the version of NumPy found as a string +# NUMPY_VERSION_MAJOR - the major version number of NumPy +# NUMPY_VERSION_MINOR - the minor version number of NumPy +# NUMPY_VERSION_PATCH - the patch version number of NumPy +# NUMPY_VERSION_DECIMAL - e.g. version 1.6.1 is 10601 +# NUMPY_INCLUDE_DIRS - path to the NumPy include files + +#============================================================================ +# Copyright 2012 Continuum Analytics, Inc. +# +# MIT License +# +# Permission is hereby granted, free of charge, to any person obtaining +# a copy of this software and associated documentation files +# (the "Software"), to deal in the Software without restriction, including +# without limitation the rights to use, copy, modify, merge, publish, +# distribute, sublicense, and/or sell copies of the Software, and to permit +# persons to whom the Software is furnished to do so, subject to +# the following conditions: +# +# The above copyright notice and this permission notice shall be included +# in all copies or substantial portions of the Software. +# +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS +# OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL +# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR +# OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, +# ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR +# OTHER DEALINGS IN THE SOFTWARE. +# +#============================================================================ + +# Finding NumPy involves calling the Python interpreter +if(NumPy_FIND_REQUIRED) + find_package(PythonInterp REQUIRED) +else() + find_package(PythonInterp) +endif() + +if(NOT PYTHONINTERP_FOUND) + set(NUMPY_FOUND FALSE) + return() +endif() + +execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" + "import numpy as n; print(n.__version__); print(n.get_include());" + RESULT_VARIABLE _NUMPY_SEARCH_SUCCESS + OUTPUT_VARIABLE _NUMPY_VALUES_OUTPUT + ERROR_VARIABLE _NUMPY_ERROR_VALUE + OUTPUT_STRIP_TRAILING_WHITESPACE) + +if(NOT _NUMPY_SEARCH_SUCCESS MATCHES 0) + if(NumPy_FIND_REQUIRED) + message(FATAL_ERROR + "NumPy import failure:\n${_NUMPY_ERROR_VALUE}") + endif() + set(NUMPY_FOUND FALSE) + return() +endif() + +# Convert the process output into a list +string(REGEX REPLACE ";" "\\\\;" _NUMPY_VALUES ${_NUMPY_VALUES_OUTPUT}) +string(REGEX REPLACE "\n" ";" _NUMPY_VALUES ${_NUMPY_VALUES}) +list(GET _NUMPY_VALUES 0 NUMPY_VERSION) +list(GET _NUMPY_VALUES 1 NUMPY_INCLUDE_DIRS) + +string(REGEX MATCH "^[0-9]+\\.[0-9]+\\.[0-9]+" _VER_CHECK "${NUMPY_VERSION}") +if("${_VER_CHECK}" STREQUAL "") + # The output from Python was unexpected. Raise an error always + # here, because we found NumPy, but it appears to be corrupted somehow. + message(FATAL_ERROR + "Requested version and include path from NumPy, got instead:\n${_NUMPY_VALUES_OUTPUT}\n") + return() +endif() + +# Make sure all directory separators are '/' +string(REGEX REPLACE "\\\\" "/" NUMPY_INCLUDE_DIRS ${NUMPY_INCLUDE_DIRS}) + +# Get the major and minor version numbers +string(REGEX REPLACE "\\." ";" _NUMPY_VERSION_LIST ${NUMPY_VERSION}) +list(GET _NUMPY_VERSION_LIST 0 NUMPY_VERSION_MAJOR) +list(GET _NUMPY_VERSION_LIST 1 NUMPY_VERSION_MINOR) +list(GET _NUMPY_VERSION_LIST 2 NUMPY_VERSION_PATCH) +string(REGEX MATCH "[0-9]*" NUMPY_VERSION_PATCH ${NUMPY_VERSION_PATCH}) +math(EXPR NUMPY_VERSION_DECIMAL + "(${NUMPY_VERSION_MAJOR} * 10000) + (${NUMPY_VERSION_MINOR} * 100) + ${NUMPY_VERSION_PATCH}") + +find_package_message(NUMPY + "Found NumPy: version \"${NUMPY_VERSION}\" ${NUMPY_INCLUDE_DIRS}" + "${NUMPY_INCLUDE_DIRS}${NUMPY_VERSION}") + +set(NUMPY_FOUND TRUE) diff --git a/python/cmake_modules/FindPythonLibsNew.cmake b/python/cmake_modules/FindPythonLibsNew.cmake new file mode 100644 index 0000000000000..c70e6bc26a719 --- /dev/null +++ b/python/cmake_modules/FindPythonLibsNew.cmake @@ -0,0 +1,236 @@ +# - Find python libraries +# This module finds the libraries corresponding to the Python interpeter +# FindPythonInterp provides. +# This code sets the following variables: +# +# PYTHONLIBS_FOUND - have the Python libs been found +# PYTHON_PREFIX - path to the Python installation +# PYTHON_LIBRARIES - path to the python library +# PYTHON_INCLUDE_DIRS - path to where Python.h is found +# PYTHON_SITE_PACKAGES - path to installation site-packages +# PYTHON_IS_DEBUG - whether the Python interpreter is a debug build +# +# PYTHON_INCLUDE_PATH - path to where Python.h is found (deprecated) +# +# A function PYTHON_ADD_MODULE( src1 src2 ... srcN) is defined +# to build modules for python. +# +# Thanks to talljimbo for the patch adding the 'LDVERSION' config +# variable usage. + +#============================================================================= +# Copyright 2001-2009 Kitware, Inc. +# Copyright 2012-2014 Continuum Analytics, Inc. +# +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions +# are met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the names of Kitware, Inc., the Insight Software Consortium, +# nor the names of their contributors may be used to endorse or promote +# products derived from this software without specific prior written +# permission. +# +# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +# # A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. +#============================================================================= +# (To distribute this file outside of CMake, substitute the full +# License text for the above reference.) + +# Use the Python interpreter to find the libs. +if(PythonLibsNew_FIND_REQUIRED) + find_package(PythonInterp REQUIRED) +else() + find_package(PythonInterp) +endif() + +if(NOT PYTHONINTERP_FOUND) + set(PYTHONLIBS_FOUND FALSE) + return() +endif() + +# According to http://stackoverflow.com/questions/646518/python-how-to-detect-debug-interpreter +# testing whether sys has the gettotalrefcount function is a reliable, +# cross-platform way to detect a CPython debug interpreter. +# +# The library suffix is from the config var LDVERSION sometimes, otherwise +# VERSION. VERSION will typically be like "2.7" on unix, and "27" on windows. +# +# The config var LIBPL is for Linux, and helps on Debian Jessie where the +# addition of multi-arch support shuffled things around. +execute_process(COMMAND "${PYTHON_EXECUTABLE}" "-c" + "from distutils import sysconfig as s;import sys;import struct; +print('.'.join(str(v) for v in sys.version_info)); +print(sys.prefix); +print(s.get_python_inc(plat_specific=True)); +print(s.get_python_lib(plat_specific=True)); +print(s.get_config_var('SO')); +print(hasattr(sys, 'gettotalrefcount')+0); +print(struct.calcsize('@P')); +print(s.get_config_var('LDVERSION') or s.get_config_var('VERSION')); +print(s.get_config_var('LIBPL')); +" + RESULT_VARIABLE _PYTHON_SUCCESS + OUTPUT_VARIABLE _PYTHON_VALUES + ERROR_VARIABLE _PYTHON_ERROR_VALUE + OUTPUT_STRIP_TRAILING_WHITESPACE) + +if(NOT _PYTHON_SUCCESS MATCHES 0) + if(PythonLibsNew_FIND_REQUIRED) + message(FATAL_ERROR + "Python config failure:\n${_PYTHON_ERROR_VALUE}") + endif() + set(PYTHONLIBS_FOUND FALSE) + return() +endif() + +# Convert the process output into a list +string(REGEX REPLACE ";" "\\\\;" _PYTHON_VALUES ${_PYTHON_VALUES}) +string(REGEX REPLACE "\n" ";" _PYTHON_VALUES ${_PYTHON_VALUES}) +list(GET _PYTHON_VALUES 0 _PYTHON_VERSION_LIST) +list(GET _PYTHON_VALUES 1 PYTHON_PREFIX) +list(GET _PYTHON_VALUES 2 PYTHON_INCLUDE_DIR) +list(GET _PYTHON_VALUES 3 PYTHON_SITE_PACKAGES) +list(GET _PYTHON_VALUES 4 PYTHON_MODULE_EXTENSION) +list(GET _PYTHON_VALUES 5 PYTHON_IS_DEBUG) +list(GET _PYTHON_VALUES 6 PYTHON_SIZEOF_VOID_P) +list(GET _PYTHON_VALUES 7 PYTHON_LIBRARY_SUFFIX) +list(GET _PYTHON_VALUES 8 PYTHON_LIBRARY_PATH) + +# Make sure the Python has the same pointer-size as the chosen compiler +# Skip the check on OS X, it doesn't consistently have CMAKE_SIZEOF_VOID_P defined +if((NOT APPLE) AND (NOT "${PYTHON_SIZEOF_VOID_P}" STREQUAL "${CMAKE_SIZEOF_VOID_P}")) + if(PythonLibsNew_FIND_REQUIRED) + math(EXPR _PYTHON_BITS "${PYTHON_SIZEOF_VOID_P} * 8") + math(EXPR _CMAKE_BITS "${CMAKE_SIZEOF_VOID_P} * 8") + message(FATAL_ERROR + "Python config failure: Python is ${_PYTHON_BITS}-bit, " + "chosen compiler is ${_CMAKE_BITS}-bit") + endif() + set(PYTHONLIBS_FOUND FALSE) + return() +endif() + +# The built-in FindPython didn't always give the version numbers +string(REGEX REPLACE "\\." ";" _PYTHON_VERSION_LIST ${_PYTHON_VERSION_LIST}) +list(GET _PYTHON_VERSION_LIST 0 PYTHON_VERSION_MAJOR) +list(GET _PYTHON_VERSION_LIST 1 PYTHON_VERSION_MINOR) +list(GET _PYTHON_VERSION_LIST 2 PYTHON_VERSION_PATCH) + +# Make sure all directory separators are '/' +string(REGEX REPLACE "\\\\" "/" PYTHON_PREFIX ${PYTHON_PREFIX}) +string(REGEX REPLACE "\\\\" "/" PYTHON_INCLUDE_DIR ${PYTHON_INCLUDE_DIR}) +string(REGEX REPLACE "\\\\" "/" PYTHON_SITE_PACKAGES ${PYTHON_SITE_PACKAGES}) + +if(CMAKE_HOST_WIN32) + if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC") + set(PYTHON_LIBRARY + "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + else() + set(PYTHON_LIBRARY "${PYTHON_PREFIX}/libs/libpython${PYTHON_LIBRARY_SUFFIX}.a") + endif() +elseif(APPLE) + # Seems to require "-undefined dynamic_lookup" instead of linking + # against the .dylib, otherwise it crashes. This flag is added + # below + set(PYTHON_LIBRARY "") + #set(PYTHON_LIBRARY + # "${PYTHON_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") +else() + if(${PYTHON_SIZEOF_VOID_P} MATCHES 8) + set(_PYTHON_LIBS_SEARCH "${PYTHON_PREFIX}/lib64" "${PYTHON_PREFIX}/lib" "${PYTHON_LIBRARY_PATH}") + else() + set(_PYTHON_LIBS_SEARCH "${PYTHON_PREFIX}/lib" "${PYTHON_LIBRARY_PATH}") + endif() + message(STATUS "Searching for Python libs in ${_PYTHON_LIBS_SEARCH}") + # Probably this needs to be more involved. It would be nice if the config + # information the python interpreter itself gave us were more complete. + find_library(PYTHON_LIBRARY + NAMES "python${PYTHON_LIBRARY_SUFFIX}" + PATHS ${_PYTHON_LIBS_SEARCH} + NO_DEFAULT_PATH) + message(STATUS "Found Python lib ${PYTHON_LIBRARY}") +endif() + +# For backward compatibility, set PYTHON_INCLUDE_PATH, but make it internal. +SET(PYTHON_INCLUDE_PATH "${PYTHON_INCLUDE_DIR}" CACHE INTERNAL + "Path to where Python.h is found (deprecated)") + +MARK_AS_ADVANCED( + PYTHON_LIBRARY + PYTHON_INCLUDE_DIR +) + +# We use PYTHON_INCLUDE_DIR, PYTHON_LIBRARY and PYTHON_DEBUG_LIBRARY for the +# cache entries because they are meant to specify the location of a single +# library. We now set the variables listed by the documentation for this +# module. +SET(PYTHON_INCLUDE_DIRS "${PYTHON_INCLUDE_DIR}") +SET(PYTHON_LIBRARIES "${PYTHON_LIBRARY}") +SET(PYTHON_DEBUG_LIBRARIES "${PYTHON_DEBUG_LIBRARY}") + + +# Don't know how to get to this directory, just doing something simple :P +#INCLUDE(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake) +#FIND_PACKAGE_HANDLE_STANDARD_ARGS(PythonLibs DEFAULT_MSG PYTHON_LIBRARIES PYTHON_INCLUDE_DIRS) +find_package_message(PYTHON + "Found PythonLibs: ${PYTHON_LIBRARY}" + "${PYTHON_EXECUTABLE}${PYTHON_VERSION}") + + +# PYTHON_ADD_MODULE( src1 src2 ... srcN) is used to build modules for python. +FUNCTION(PYTHON_ADD_MODULE _NAME ) + GET_PROPERTY(_TARGET_SUPPORTS_SHARED_LIBS + GLOBAL PROPERTY TARGET_SUPPORTS_SHARED_LIBS) + OPTION(PYTHON_ENABLE_MODULE_${_NAME} "Add module ${_NAME}" TRUE) + OPTION(PYTHON_MODULE_${_NAME}_BUILD_SHARED + "Add module ${_NAME} shared" ${_TARGET_SUPPORTS_SHARED_LIBS}) + + # Mark these options as advanced + MARK_AS_ADVANCED(PYTHON_ENABLE_MODULE_${_NAME} + PYTHON_MODULE_${_NAME}_BUILD_SHARED) + + IF(PYTHON_ENABLE_MODULE_${_NAME}) + IF(PYTHON_MODULE_${_NAME}_BUILD_SHARED) + SET(PY_MODULE_TYPE MODULE) + ELSE(PYTHON_MODULE_${_NAME}_BUILD_SHARED) + SET(PY_MODULE_TYPE STATIC) + SET_PROPERTY(GLOBAL APPEND PROPERTY PY_STATIC_MODULES_LIST ${_NAME}) + ENDIF(PYTHON_MODULE_${_NAME}_BUILD_SHARED) + + SET_PROPERTY(GLOBAL APPEND PROPERTY PY_MODULES_LIST ${_NAME}) + ADD_LIBRARY(${_NAME} ${PY_MODULE_TYPE} ${ARGN}) + IF(APPLE) + # On OS X, linking against the Python libraries causes + # segfaults, so do this dynamic lookup instead. + SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS + "-undefined dynamic_lookup") + ELSE() + TARGET_LINK_LIBRARIES(${_NAME} ${PYTHON_LIBRARIES}) + ENDIF() + IF(PYTHON_MODULE_${_NAME}_BUILD_SHARED) + SET_TARGET_PROPERTIES(${_NAME} PROPERTIES PREFIX "${PYTHON_MODULE_PREFIX}") + SET_TARGET_PROPERTIES(${_NAME} PROPERTIES SUFFIX "${PYTHON_MODULE_EXTENSION}") + ELSE() + ENDIF() + + ENDIF(PYTHON_ENABLE_MODULE_${_NAME}) +ENDFUNCTION(PYTHON_ADD_MODULE) \ No newline at end of file diff --git a/python/cmake_modules/UseCython.cmake b/python/cmake_modules/UseCython.cmake new file mode 100644 index 0000000000000..e7034db52f335 --- /dev/null +++ b/python/cmake_modules/UseCython.cmake @@ -0,0 +1,164 @@ +# Define a function to create Cython modules. +# +# For more information on the Cython project, see http://cython.org/. +# "Cython is a language that makes writing C extensions for the Python language +# as easy as Python itself." +# +# This file defines a CMake function to build a Cython Python module. +# To use it, first include this file. +# +# include( UseCython ) +# +# Then call cython_add_module to create a module. +# +# cython_add_module( ... ) +# +# Where is the desired name of the target for the resulting Python module, +# is the desired name of the target that runs the Cython compiler +# to generate the needed C or C++ files, is a variable to hold the +# files generated by Cython, and ... are source files +# to be compiled into the module, e.g. *.pyx, *.c, *.cxx, etc. +# only one .pyx file may be present for each target +# (this is an inherent limitation of Cython). +# +# The sample paths set with the CMake include_directories() command will be used +# for include directories to search for *.pxd when running the Cython complire. +# +# Cache variables that effect the behavior include: +# +# CYTHON_ANNOTATE +# CYTHON_NO_DOCSTRINGS +# CYTHON_FLAGS +# +# Source file properties that effect the build process are +# +# CYTHON_IS_CXX +# CYTHON_IS_PUBLIC +# CYTHON_IS_API +# +# If this is set of a *.pyx file with CMake set_source_files_properties() +# command, the file will be compiled as a C++ file. +# +# See also FindCython.cmake + +#============================================================================= +# Copyright 2011 Kitware, Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +#============================================================================= + +# Configuration options. +set( CYTHON_ANNOTATE OFF + CACHE BOOL "Create an annotated .html file when compiling *.pyx." ) +set( CYTHON_NO_DOCSTRINGS OFF + CACHE BOOL "Strip docstrings from the compiled module." ) +set( CYTHON_FLAGS "" CACHE STRING + "Extra flags to the cython compiler." ) +mark_as_advanced( CYTHON_ANNOTATE CYTHON_NO_DOCSTRINGS CYTHON_FLAGS ) + +find_package( Cython REQUIRED ) +find_package( PythonLibsNew REQUIRED ) + +set( CYTHON_CXX_EXTENSION "cxx" ) +set( CYTHON_C_EXTENSION "c" ) + +# Create a *.c or *.cxx file from a *.pyx file. +# Input the generated file basename. The generate files will put into the variable +# placed in the "generated_files" argument. Finally all the *.py and *.pyx files. +function( compile_pyx _name pyx_target_name generated_files pyx_file) + # Default to assuming all files are C. + set( cxx_arg "" ) + set( extension ${CYTHON_C_EXTENSION} ) + set( pyx_lang "C" ) + set( comment "Compiling Cython C source for ${_name}..." ) + + get_filename_component( pyx_file_basename "${pyx_file}" NAME_WE ) + + # Determine if it is a C or C++ file. + get_source_file_property( property_is_cxx ${pyx_file} CYTHON_IS_CXX ) + if( ${property_is_cxx} ) + set( cxx_arg "--cplus" ) + set( extension ${CYTHON_CXX_EXTENSION} ) + set( pyx_lang "CXX" ) + set( comment "Compiling Cython CXX source for ${_name}..." ) + endif() + get_source_file_property( pyx_location ${pyx_file} LOCATION ) + + # Set additional flags. + if( CYTHON_ANNOTATE ) + set( annotate_arg "--annotate" ) + endif() + + if( CYTHON_NO_DOCSTRINGS ) + set( no_docstrings_arg "--no-docstrings" ) + endif() + + if(NOT WIN32) + if( "${CMAKE_BUILD_TYPE}" STREQUAL "Debug" OR + "${CMAKE_BUILD_TYPE}" STREQUAL "RelWithDebInfo" ) + set( cython_debug_arg "--gdb" ) + endif() + endif() + + # Determining generated file names. + get_source_file_property( property_is_public ${pyx_file} CYTHON_PUBLIC ) + get_source_file_property( property_is_api ${pyx_file} CYTHON_API ) + if( ${property_is_api} ) + set( _generated_files "${_name}.${extension}" "${_name}.h" "${name}_api.h") + elseif( ${property_is_public} ) + set( _generated_files "${_name}.${extension}" "${_name}.h") + else() + set( _generated_files "${_name}.${extension}") + endif() + set_source_files_properties( ${_generated_files} PROPERTIES GENERATED TRUE ) + set( ${generated_files} ${_generated_files} PARENT_SCOPE ) + + # Add the command to run the compiler. + add_custom_target(${pyx_target_name} + COMMAND ${CYTHON_EXECUTABLE} ${cxx_arg} ${include_directory_arg} + ${annotate_arg} ${no_docstrings_arg} ${cython_debug_arg} ${CYTHON_FLAGS} + --output-file "${_name}.${extension}" ${pyx_location} + DEPENDS ${pyx_location} + # do not specify byproducts for now since they don't work with the older + # version of cmake available in the apt repositories. + #BYPRODUCTS ${_generated_files} + COMMENT ${comment} + ) + + # Remove their visibility to the user. + set( corresponding_pxd_file "" CACHE INTERNAL "" ) + set( header_location "" CACHE INTERNAL "" ) + set( pxd_location "" CACHE INTERNAL "" ) +endfunction() + +# cython_add_module( src1 src2 ... srcN ) +# Build the Cython Python module. +function( cython_add_module _name pyx_target_name generated_files) + set( pyx_module_source "" ) + set( other_module_sources "" ) + foreach( _file ${ARGN} ) + if( ${_file} MATCHES ".*\\.py[x]?$" ) + list( APPEND pyx_module_source ${_file} ) + else() + list( APPEND other_module_sources ${_file} ) + endif() + endforeach() + compile_pyx( ${_name} ${pyx_target_name} _generated_files ${pyx_module_source} ) + set( ${generated_files} ${_generated_files} PARENT_SCOPE ) + include_directories( ${PYTHON_INCLUDE_DIRS} ) + python_add_module( ${_name} ${_generated_files} ${other_module_sources} ) + add_dependencies( ${_name} ${pyx_target_name}) + target_link_libraries( ${_name} ${PYTHON_LIBRARIES} ) +endfunction() + +include( CMakeParseArguments ) diff --git a/python/setup.py b/python/setup.py new file mode 100644 index 0000000000000..f6b0a4bee8316 --- /dev/null +++ b/python/setup.py @@ -0,0 +1,244 @@ +#!/usr/bin/env python + +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import glob +import os.path as osp +import re +import shutil +from Cython.Distutils import build_ext as _build_ext +import Cython + +import sys + +import pkg_resources +from setuptools import setup + +import os + +from os.path import join as pjoin + +from distutils.command.clean import clean as _clean +from distutils import sysconfig + +# Check if we're running 64-bit Python +is_64_bit = sys.maxsize > 2**32 + +# Check if this is a debug build of Python. +if hasattr(sys, 'gettotalrefcount'): + build_type = 'Debug' +else: + build_type = 'Release' + +if Cython.__version__ < '0.19.1': + raise Exception('Please upgrade to Cython 0.19.1 or newer') + +MAJOR = 0 +MINOR = 1 +MICRO = 0 +VERSION = '%d.%d.%d' % (MAJOR, MINOR, MICRO) + + +class clean(_clean): + + def run(self): + _clean.run(self) + for x in []: + try: + os.remove(x) + except OSError: + pass + + +class build_ext(_build_ext): + + def build_extensions(self): + numpy_incl = pkg_resources.resource_filename('numpy', 'core/include') + + for ext in self.extensions: + if (hasattr(ext, 'include_dirs') and + numpy_incl not in ext.include_dirs): + ext.include_dirs.append(numpy_incl) + _build_ext.build_extensions(self) + + def run(self): + self._run_cmake() + _build_ext.run(self) + + # adapted from cmake_build_ext in dynd-python + # github.com/libdynd/dynd-python + + description = "Build the C-extensions for arrow" + user_options = ([('extra-cmake-args=', None, + 'extra arguments for CMake')] + + _build_ext.user_options) + + def initialize_options(self): + _build_ext.initialize_options(self) + self.extra_cmake_args = '' + + def _run_cmake(self): + # The directory containing this setup.py + source = osp.dirname(osp.abspath(__file__)) + + # The staging directory for the module being built + build_temp = pjoin(os.getcwd(), self.build_temp) + + # Change to the build directory + saved_cwd = os.getcwd() + if not os.path.isdir(self.build_temp): + self.mkpath(self.build_temp) + os.chdir(self.build_temp) + + # Detect if we built elsewhere + if os.path.isfile('CMakeCache.txt'): + cachefile = open('CMakeCache.txt', 'r') + cachedir = re.search('CMAKE_CACHEFILE_DIR:INTERNAL=(.*)', + cachefile.read()).group(1) + cachefile.close() + if (cachedir != build_temp): + return + + pyexe_option = '-DPYTHON_EXECUTABLE=%s' % sys.executable + static_lib_option = '' + build_tests_option = '' + + if sys.platform != 'win32': + cmake_command = ['cmake', self.extra_cmake_args, pyexe_option, + build_tests_option, + static_lib_option, source] + + self.spawn(cmake_command) + self.spawn(['make']) + else: + import shlex + cmake_generator = 'Visual Studio 14 2015' + if is_64_bit: + cmake_generator += ' Win64' + # Generate the build files + extra_cmake_args = shlex.split(self.extra_cmake_args) + cmake_command = (['cmake'] + extra_cmake_args + + [source, pyexe_option, + static_lib_option, + build_tests_option, + '-G', cmake_generator]) + if "-G" in self.extra_cmake_args: + cmake_command = cmake_command[:-2] + + self.spawn(cmake_command) + # Do the build + self.spawn(['cmake', '--build', '.', '--config', build_type]) + + if self.inplace: + # a bit hacky + build_lib = saved_cwd + else: + build_lib = pjoin(os.getcwd(), self.build_lib) + + # Move the built libpyarrow library to the place expected by the Python + # build + if sys.platform != 'win32': + name, = glob.glob('libpyarrow.*') + try: + os.makedirs(pjoin(build_lib, 'arrow')) + except OSError: + pass + shutil.move(name, pjoin(build_lib, 'arrow', name)) + else: + shutil.move(pjoin(build_type, 'pyarrow.dll'), + pjoin(build_lib, 'arrow', 'pyarrow.dll')) + + # Move the built C-extension to the place expected by the Python build + self._found_names = [] + for name in self.get_cmake_cython_names(): + built_path = self.get_ext_built(name) + if not os.path.exists(built_path): + print(built_path) + raise RuntimeError('libpyarrow C-extension failed to build:', + os.path.abspath(built_path)) + + ext_path = pjoin(build_lib, self._get_cmake_ext_path(name)) + if os.path.exists(ext_path): + os.remove(ext_path) + self.mkpath(os.path.dirname(ext_path)) + print('Moving built libpyarrow C-extension', built_path, + 'to build path', ext_path) + shutil.move(self.get_ext_built(name), ext_path) + self._found_names.append(name) + + os.chdir(saved_cwd) + + def _get_inplace_dir(self): + pass + + def _get_cmake_ext_path(self, name): + # Get the package directory from build_py + build_py = self.get_finalized_command('build_py') + package_dir = build_py.get_package_dir('arrow') + # This is the name of the arrow C-extension + suffix = sysconfig.get_config_var('EXT_SUFFIX') + if suffix is None: + suffix = sysconfig.get_config_var('SO') + filename = name + suffix + return pjoin(package_dir, filename) + + def get_ext_built(self, name): + if sys.platform == 'win32': + head, tail = os.path.split(name) + suffix = sysconfig.get_config_var('SO') + return pjoin(head, build_type, tail + suffix) + else: + suffix = sysconfig.get_config_var('SO') + return name + suffix + + def get_cmake_cython_names(self): + return ['config', 'parquet'] + + def get_names(self): + return self._found_names + + def get_outputs(self): + # Just the C extensions + cmake_exts = [self._get_cmake_ext_path(name) + for name in self.get_names()] + regular_exts = _build_ext.get_outputs(self) + return regular_exts + cmake_exts + + +extensions = [] + +DESC = """\ +Python library for Apache Arrow""" + +setup( + name="arrow", + packages=['arrow', 'arrow.tests'], + version=VERSION, + package_data={'arrow': ['*.pxd', '*.pyx']}, + ext_modules=extensions, + cmdclass={ + 'clean': clean, + 'build_ext': build_ext + }, + install_requires=['cython >= 0.21'], + description=DESC, + license='Apache License, Version 2.0', + maintainer="Apache Arrow Developers", + maintainer_email="dev@arrow.apache.org", + test_suite="arrow.tests" +) diff --git a/python/src/pyarrow/CMakeLists.txt b/python/src/pyarrow/CMakeLists.txt new file mode 100644 index 0000000000000..e20c3238b5f78 --- /dev/null +++ b/python/src/pyarrow/CMakeLists.txt @@ -0,0 +1,20 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# Unit tests +####################################### diff --git a/python/src/pyarrow/api.h b/python/src/pyarrow/api.h new file mode 100644 index 0000000000000..c2285de77bf10 --- /dev/null +++ b/python/src/pyarrow/api.h @@ -0,0 +1,21 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_API_H +#define PYARROW_API_H + +#endif // PYARROW_API_H diff --git a/python/src/pyarrow/init.cc b/python/src/pyarrow/init.cc new file mode 100644 index 0000000000000..c36f413725532 --- /dev/null +++ b/python/src/pyarrow/init.cc @@ -0,0 +1,29 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "pyarrow/init.h" + +namespace arrow { + +namespace py { + +void pyarrow_init() { +} + +} // namespace py + +} // namespace arrow diff --git a/python/src/pyarrow/init.h b/python/src/pyarrow/init.h new file mode 100644 index 0000000000000..1fc9f10102696 --- /dev/null +++ b/python/src/pyarrow/init.h @@ -0,0 +1,31 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_INIT_H +#define PYARROW_INIT_H + +namespace arrow { + +namespace py { + +void pyarrow_init(); + +} // namespace py + +} // namespace arrow + +#endif // PYARROW_INIT_H diff --git a/python/src/pyarrow/util/CMakeLists.txt b/python/src/pyarrow/util/CMakeLists.txt new file mode 100644 index 0000000000000..60dc80eb38cb6 --- /dev/null +++ b/python/src/pyarrow/util/CMakeLists.txt @@ -0,0 +1,53 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# pyarrow_util +####################################### + +set(UTIL_SRCS +) + +set(UTIL_LIBS +) + +add_library(pyarrow_util STATIC + ${UTIL_SRCS} +) +target_link_libraries(pyarrow_util ${UTIL_LIBS}) +SET_TARGET_PROPERTIES(pyarrow_util PROPERTIES LINKER_LANGUAGE CXX) + +####################################### +# pyarrow_test_main +####################################### + +add_library(pyarrow_test_main + test_main.cc) + +if (APPLE) + target_link_libraries(pyarrow_test_main + gmock + dl) + set_target_properties(pyarrow_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") +else() + target_link_libraries(pyarrow_test_main + gtest + pthread + dl + ) +endif() diff --git a/python/src/pyarrow/util/test_main.cc b/python/src/pyarrow/util/test_main.cc new file mode 100644 index 0000000000000..00139f36742ed --- /dev/null +++ b/python/src/pyarrow/util/test_main.cc @@ -0,0 +1,26 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +int main(int argc, char **argv) { + ::testing::InitGoogleTest(&argc, argv); + + int ret = RUN_ALL_TESTS(); + + return ret; +} From 8caa287263425c5b6c64c0e25fb8aa945e2f78d4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 7 Mar 2016 14:47:36 -0800 Subject: [PATCH 0021/1644] ARROW-35: Add a short call-to-action in the top level README.md Author: Wes McKinney Closes #13 from wesm/ARROW-35 and squashes the following commits: e10bfc3 [Wes McKinney] Add a proper mailto link c4428fe [Wes McKinney] Add a short 'how to get involved' blurb in top-level README --- README.md | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/README.md b/README.md index d948a996bc075..84bae78cc7fbe 100644 --- a/README.md +++ b/README.md @@ -22,3 +22,24 @@ Initial implementations include: - [Arrow Structures and APIs in Java](https://github.com/apache/arrow/tree/master/java) Arrow is an [Apache Software Foundation](www.apache.org) project. More info can be found at [arrow.apache.org](http://arrow.apache.org). + +#### Getting involved + +Right now the primary audience for Apache Arrow are the designers and +developers of data systems; most people will use Apache Arrow indirectly +through systems that use it for internal data handling and interoperating with +other Arrow-enabled systems. + +Even if you do not plan to contribute to Apache Arrow itself or Arrow +integrations in other projects, we'd be happy to have you involved: + +- Join the mailing list: send an email to + [dev-subscribe@arrow.apache.org][1]. Share your ideas and use cases for the + project. +- [Follow our activity on JIRA][3] +- [Learn the format][2] +- Contribute code to one of the reference implementations + +[1]: mailto:dev-subscribe@arrow.apache.org +[2]: https://github.com/apache/arrow/tree/master/format +[3]: https://issues.apache.org/jira/browse/ARROW \ No newline at end of file From 571343bbe36f99a11ed82e475b976bbe79dfb755 Mon Sep 17 00:00:00 2001 From: hyukjinkwon Date: Mon, 7 Mar 2016 14:49:27 -0800 Subject: [PATCH 0022/1644] ARROW-9: Rename some unchanged "Drill" to "Arrow" (follow-up) https://issues.apache.org/jira/browse/ARROW-9 There is a unchanged one from "Drill" to "Arrow" at `ValueVector` and minor typos are fixed. Author: hyukjinkwon Author: Hyukjin Kwon Closes #18 from HyukjinKwon/ARROW-9 and squashes the following commits: 54a5d9f [Hyukjin Kwon] Update typo 628f35d [hyukjinkwon] Replace straggler references to Drill (follow-up) --- .../main/java/org/apache/arrow/vector/ValueVector.java | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index c05f0e7c50fd2..a170c59abd7cc 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -63,7 +63,7 @@ public interface ValueVector extends Closeable, Iterable { /** * Allocates new buffers. ValueVector implements logic to determine how much to allocate. - * @return Returns true if allocation was succesful. + * @return Returns true if allocation was successful. */ boolean allocateNewSafe(); @@ -71,7 +71,7 @@ public interface ValueVector extends Closeable, Iterable { /** * Set the initial record capacity - * @param numRecords + * @param numRecords the initial record capacity. */ void setInitialCapacity(int numRecords); @@ -87,7 +87,7 @@ public interface ValueVector extends Closeable, Iterable { void close(); /** - * Release the underlying DrillBuf and reset the ValueVector to empty. + * Release the underlying ArrowBuf and reset the ValueVector to empty. */ void clear(); @@ -198,7 +198,7 @@ interface Accessor { } /** - * An abstractiong that is used to write into this vector instance. + * An abstraction that is used to write into this vector instance. */ interface Mutator { /** From 9afb667783b8cedbe6e9d6ee5eb02d35cf1d0f79 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 7 Mar 2016 15:02:56 -0800 Subject: [PATCH 0023/1644] ARROW-31: Python: prototype user object model, add PyList conversion path with type inference Depends on ARROW-7. Pretty mundane stuff but got to start somewhere. I'm going to do a little more in this patch (handle normal lists of strings and lists of other supported Python types) before merging. Author: Wes McKinney Closes #19 from wesm/ARROW-31 and squashes the following commits: 2345541 [Wes McKinney] Test basic conversion of nested lists 1d4618b [Wes McKinney] Prototype string and double converters b02b296 [Wes McKinney] Type inference for lists and lists-of-lists 8c3891c [Wes McKinney] Smoke test that array garbage collection deallocates memory c28bf09 [Wes McKinney] Build array successfully, without validating contents 731544a [Wes McKinney] Move PrimitiveType::ToString template back to type.h b5b5b82 [Wes McKinney] Failing test stubs, raise on null array edb451c [Wes McKinney] Add a few data type smoke tests 47fd78e [Wes McKinney] Add unit test stub 07c1379 [Wes McKinney] Move some bits from arrow/type.h to type.cc 3a774fb [Wes McKinney] Add Status::ToString impls. Unit test stub 4e206fc [Wes McKinney] Add pandas converter placeholder 102ed36 [Wes McKinney] Cython array box scaffold builds 94f122f [Wes McKinney] Basic object model for sequence->arrow conversions bdb02e7 [Wes McKinney] Use shared_ptr with dynamic make_builder too d5655ba [Wes McKinney] Clean up array builder API to return shared_ptr 4132bda [Wes McKinney] Essential scaffolding -- error handling, memory pools, etc. -- to work toward converting Python lists to Arrow arrays 55e69a2 [Wes McKinney] Typed array stubs ac8c796 [Wes McKinney] Cache primitive data type instances 8f7edaf [Wes McKinney] Consolidate Field and data type subclasses. Add more Python stubs ea2f3ec [Wes McKinney] Bootstrap end-to-end exposure in Python, wrap DataType and Field types --- cpp/CMakeLists.txt | 83 ++-- cpp/src/arrow/CMakeLists.txt | 1 - cpp/src/arrow/api.h | 21 + cpp/src/arrow/builder.h | 10 +- cpp/src/arrow/field.h | 63 --- cpp/src/arrow/table/CMakeLists.txt | 15 - cpp/src/arrow/table/column-test.cc | 1 - cpp/src/arrow/table/column.cc | 2 +- cpp/src/arrow/table/column.h | 2 +- cpp/src/arrow/table/schema-test.cc | 9 +- cpp/src/arrow/table/schema.cc | 2 +- cpp/src/arrow/table/schema.h | 1 - cpp/src/arrow/table/table-test.cc | 1 - cpp/src/arrow/table/table.cc | 2 +- cpp/src/arrow/table/test-common.h | 1 - cpp/src/arrow/type.cc | 49 +++ cpp/src/arrow/type.h | 143 ++++-- cpp/src/arrow/types/CMakeLists.txt | 22 +- cpp/src/arrow/types/boolean.h | 3 +- cpp/src/arrow/types/construct.cc | 21 +- cpp/src/arrow/types/construct.h | 6 +- cpp/src/arrow/types/json.cc | 5 +- cpp/src/arrow/types/list-test.cc | 24 +- cpp/src/arrow/types/list.cc | 12 - cpp/src/arrow/types/list.h | 51 +-- cpp/src/arrow/types/primitive-test.cc | 64 ++- cpp/src/arrow/types/primitive.h | 22 +- cpp/src/arrow/types/string-test.cc | 11 +- cpp/src/arrow/types/string.h | 41 +- cpp/src/arrow/types/struct-test.cc | 19 +- cpp/src/arrow/types/struct.cc | 18 - cpp/src/arrow/types/struct.h | 21 +- cpp/src/arrow/util/CMakeLists.txt | 20 +- cpp/src/arrow/util/buffer.cc | 8 + cpp/src/arrow/util/buffer.h | 2 + cpp/src/arrow/util/status.cc | 40 ++ python/CMakeLists.txt | 21 +- python/arrow/__init__.py | 34 ++ python/arrow/array.pxd | 85 ++++ python/arrow/array.pyx | 179 ++++++++ python/arrow/config.pyx | 2 +- python/arrow/error.pxd | 20 + python/arrow/error.pyx | 30 ++ python/arrow/includes/arrow.pxd | 75 +++- python/arrow/includes/common.pxd | 4 +- python/arrow/includes/pyarrow.pxd | 24 +- python/arrow/scalar.pxd | 47 ++ python/arrow/scalar.pyx | 28 ++ python/arrow/schema.pxd | 39 ++ python/arrow/schema.pyx | 150 +++++++ python/arrow/tests/test_array.py | 26 ++ python/arrow/tests/test_convert_builtin.py | 85 ++++ python/arrow/tests/test_schema.py | 51 +++ python/setup.py | 7 +- python/src/pyarrow/adapters/builtin.cc | 415 ++++++++++++++++++ python/src/pyarrow/adapters/builtin.h | 40 ++ .../src/pyarrow/adapters/pandas.h | 17 +- python/src/pyarrow/api.h | 7 + python/src/pyarrow/common.cc | 71 +++ python/src/pyarrow/common.h | 95 ++++ python/src/pyarrow/helpers.cc | 57 +++ .../null.h => python/src/pyarrow/helpers.h | 22 +- python/src/pyarrow/init.cc | 8 +- python/src/pyarrow/init.h | 8 +- python/src/pyarrow/status.cc | 92 ++++ python/src/pyarrow/status.h | 144 ++++++ 66 files changed, 2246 insertions(+), 453 deletions(-) delete mode 100644 cpp/src/arrow/field.h create mode 100644 python/arrow/array.pxd create mode 100644 python/arrow/array.pyx create mode 100644 python/arrow/error.pxd create mode 100644 python/arrow/error.pyx create mode 100644 python/arrow/scalar.pxd create mode 100644 python/arrow/scalar.pyx create mode 100644 python/arrow/schema.pxd create mode 100644 python/arrow/schema.pyx create mode 100644 python/arrow/tests/test_array.py create mode 100644 python/arrow/tests/test_convert_builtin.py create mode 100644 python/arrow/tests/test_schema.py create mode 100644 python/src/pyarrow/adapters/builtin.cc create mode 100644 python/src/pyarrow/adapters/builtin.h rename cpp/src/arrow/field.cc => python/src/pyarrow/adapters/pandas.h (76%) create mode 100644 python/src/pyarrow/common.cc create mode 100644 python/src/pyarrow/common.h create mode 100644 python/src/pyarrow/helpers.cc rename cpp/src/arrow/types/null.h => python/src/pyarrow/helpers.h (72%) create mode 100644 python/src/pyarrow/status.cc create mode 100644 python/src/pyarrow/status.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 8042661533e1d..e8cb88c0b4d9b 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -37,18 +37,17 @@ if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") set(CMAKE_EXPORT_COMPILE_COMMANDS 1) endif() -# Enable using a custom GCC toolchain to build Arrow -if (NOT "$ENV{ARROW_GCC_ROOT}" STREQUAL "") - set(GCC_ROOT $ENV{ARROW_GCC_ROOT}) - set(CMAKE_C_COMPILER ${GCC_ROOT}/bin/gcc) - set(CMAKE_CXX_COMPILER ${GCC_ROOT}/bin/g++) -endif() - if(APPLE) # In newer versions of CMake, this is the default setting set(CMAKE_MACOSX_RPATH 1) endif() +find_program(CCACHE_FOUND ccache) +if(CCACHE_FOUND) + set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) + set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache) +endif(CCACHE_FOUND) + # ---------------------------------------------------------------------- # cmake options @@ -126,38 +125,16 @@ endif () # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") -# Required to avoid static linking errors with dependencies -add_definitions(-fPIC) - # Determine compiler version include(CompilerInfo) if ("${COMPILER_FAMILY}" STREQUAL "clang") - # Clang helpfully provides a few extensions from C++11 such as the 'override' - # keyword on methods. This doesn't change behavior, and we selectively enable - # it in src/gutil/port.h only on clang. So, we can safely use it, and don't want - # to trigger warnings when we do so. - # set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-c++11-extensions") - # Using Clang with ccache causes a bunch of spurious warnings that are # purportedly fixed in the next version of ccache. See the following for details: # # http://petereisentraut.blogspot.com/2011/05/ccache-and-clang.html # http://petereisentraut.blogspot.com/2011/09/ccache-and-clang-part-2.html set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") - - # Only hardcode -fcolor-diagnostics if stderr is opened on a terminal. Otherwise - # the color codes show up as noisy artifacts. - # - # This test is imperfect because 'cmake' and 'make' can be run independently - # (with different terminal options), and we're testing during the former. - execute_process(COMMAND test -t 2 RESULT_VARIABLE ARROW_IS_TTY) - if ((${ARROW_IS_TTY} EQUAL 0) AND (NOT ("$ENV{TERM}" STREQUAL "dumb"))) - message("Running in a controlling terminal") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fcolor-diagnostics") - else() - message("Running without a controlling terminal or in a dumb terminal") - endif() endif() # Sanity check linking option. @@ -278,12 +255,6 @@ set(LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") include_directories(src) -############################################################ -# Visibility -############################################################ -# For generate_export_header() and add_compiler_export_flags(). -include(GenerateExportHeader) - ############################################################ # Testing ############################################################ @@ -456,21 +427,32 @@ endif() # Subdirectories ############################################################ -add_subdirectory(src/arrow) -add_subdirectory(src/arrow/util) -add_subdirectory(src/arrow/table) -add_subdirectory(src/arrow/types) - -set(LINK_LIBS - arrow_util - arrow_table - arrow_types) +set(LIBARROW_LINK_LIBS +) set(ARROW_SRCS src/arrow/array.cc src/arrow/builder.cc - src/arrow/field.cc src/arrow/type.cc + + src/arrow/table/column.cc + src/arrow/table/schema.cc + src/arrow/table/table.cc + + src/arrow/types/construct.cc + src/arrow/types/floating.cc + src/arrow/types/integer.cc + src/arrow/types/json.cc + src/arrow/types/list.cc + src/arrow/types/primitive.cc + src/arrow/types/string.cc + src/arrow/types/struct.cc + src/arrow/types/union.cc + + src/arrow/util/bit-util.cc + src/arrow/util/buffer.cc + src/arrow/util/memory-pool.cc + src/arrow/util/status.cc ) set(LIBARROW_LINKAGE "SHARED") @@ -479,8 +461,15 @@ add_library(arrow ${LIBARROW_LINKAGE} ${ARROW_SRCS} ) -target_link_libraries(arrow ${LINK_LIBS}) -set_target_properties(arrow PROPERTIES LINKER_LANGUAGE CXX) +set_target_properties(arrow + PROPERTIES + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") +target_link_libraries(arrow ${LIBARROW_LINK_LIBS}) + +add_subdirectory(src/arrow) +add_subdirectory(src/arrow/util) +add_subdirectory(src/arrow/table) +add_subdirectory(src/arrow/types) install(TARGETS arrow LIBRARY DESTINATION lib diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 102a8a1853f3e..77326ce38d754 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -20,7 +20,6 @@ install(FILES api.h array.h builder.h - field.h type.h DESTINATION include/arrow) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 899e8aae19c0e..c73d4b386cf54 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -15,7 +15,28 @@ // specific language governing permissions and limitations // under the License. +// Coarse public API while the library is in development + #ifndef ARROW_API_H #define ARROW_API_H +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/type.h" + +#include "arrow/table/column.h" +#include "arrow/table/schema.h" +#include "arrow/table/table.h" + +#include "arrow/types/boolean.h" +#include "arrow/types/construct.h" +#include "arrow/types/floating.h" +#include "arrow/types/integer.h" +#include "arrow/types/list.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" + +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + #endif // ARROW_API_H diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 491b9133d2cca..8cc689c3e81ee 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -32,7 +32,7 @@ class Array; class MemoryPool; class PoolBuffer; -static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 8; +static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5; // Base class for all data array builders class ArrayBuilder { @@ -78,12 +78,16 @@ class ArrayBuilder { // Creates new array object to hold the contents of the builder and transfers // ownership of the data - virtual Status ToArray(Array** out) = 0; + virtual std::shared_ptr Finish() = 0; + + const std::shared_ptr& type() const { + return type_; + } protected: MemoryPool* pool_; - TypePtr type_; + std::shared_ptr type_; // When nulls are first appended to the builder, the null bitmap is allocated std::shared_ptr nulls_; diff --git a/cpp/src/arrow/field.h b/cpp/src/arrow/field.h deleted file mode 100644 index 89a450c66f256..0000000000000 --- a/cpp/src/arrow/field.h +++ /dev/null @@ -1,63 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_FIELD_H -#define ARROW_FIELD_H - -#include - -#include "arrow/type.h" - -namespace arrow { - -// A field is a piece of metadata that includes (for now) a name and a data -// type - -struct Field { - // Field name - std::string name; - - // The field's data type - TypePtr type; - - Field(const std::string& name, const TypePtr& type) : - name(name), - type(type) {} - - bool operator==(const Field& other) const { - return this->Equals(other); - } - - bool operator!=(const Field& other) const { - return !this->Equals(other); - } - - bool Equals(const Field& other) const { - return (this == &other) || (this->name == other.name && - this->type->Equals(other.type.get())); - } - - bool nullable() const { - return this->type->nullable; - } - - std::string ToString() const; -}; - -} // namespace arrow - -#endif // ARROW_FIELD_H diff --git a/cpp/src/arrow/table/CMakeLists.txt b/cpp/src/arrow/table/CMakeLists.txt index 68bf3148a9889..26d843d853bfb 100644 --- a/cpp/src/arrow/table/CMakeLists.txt +++ b/cpp/src/arrow/table/CMakeLists.txt @@ -19,21 +19,6 @@ # arrow_table ####################################### -set(TABLE_SRCS - column.cc - schema.cc - table.cc -) - -set(TABLE_LIBS -) - -add_library(arrow_table STATIC - ${TABLE_SRCS} -) -target_link_libraries(arrow_table ${TABLE_LIBS}) -SET_TARGET_PROPERTIES(arrow_table PROPERTIES LINKER_LANGUAGE CXX) - # Headers: top level install(FILES column.h diff --git a/cpp/src/arrow/table/column-test.cc b/cpp/src/arrow/table/column-test.cc index 4959b82c6e2ae..bf95932916cf4 100644 --- a/cpp/src/arrow/table/column-test.cc +++ b/cpp/src/arrow/table/column-test.cc @@ -21,7 +21,6 @@ #include #include -#include "arrow/field.h" #include "arrow/table/column.h" #include "arrow/table/schema.h" #include "arrow/table/test-common.h" diff --git a/cpp/src/arrow/table/column.cc b/cpp/src/arrow/table/column.cc index d68b491fb99da..573e650875944 100644 --- a/cpp/src/arrow/table/column.cc +++ b/cpp/src/arrow/table/column.cc @@ -20,7 +20,7 @@ #include #include -#include "arrow/field.h" +#include "arrow/type.h" #include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/table/column.h b/cpp/src/arrow/table/column.h index 64423bf956147..dfc7516e26aac 100644 --- a/cpp/src/arrow/table/column.h +++ b/cpp/src/arrow/table/column.h @@ -23,7 +23,7 @@ #include #include "arrow/array.h" -#include "arrow/field.h" +#include "arrow/type.h" namespace arrow { diff --git a/cpp/src/arrow/table/schema-test.cc b/cpp/src/arrow/table/schema-test.cc index 0cf1b3c5f9a8e..d6725cc08c0c8 100644 --- a/cpp/src/arrow/table/schema-test.cc +++ b/cpp/src/arrow/table/schema-test.cc @@ -20,7 +20,6 @@ #include #include -#include "arrow/field.h" #include "arrow/table/schema.h" #include "arrow/type.h" #include "arrow/types/string.h" @@ -97,10 +96,10 @@ TEST_F(TestSchema, ToString) { auto schema = std::make_shared(fields); std::string result = schema->ToString(); - std::string expected = R"(f0 ?int32 -f1 uint8 -f2 ?string -f3 ?list + std::string expected = R"(f0 int32 +f1 uint8 not null +f2 string +f3 list )"; ASSERT_EQ(expected, result); diff --git a/cpp/src/arrow/table/schema.cc b/cpp/src/arrow/table/schema.cc index fb3b4d6f29268..d49d0a713e7f4 100644 --- a/cpp/src/arrow/table/schema.cc +++ b/cpp/src/arrow/table/schema.cc @@ -22,7 +22,7 @@ #include #include -#include "arrow/field.h" +#include "arrow/type.h" namespace arrow { diff --git a/cpp/src/arrow/table/schema.h b/cpp/src/arrow/table/schema.h index d04e3f628c1e3..103f01b26e3ca 100644 --- a/cpp/src/arrow/table/schema.h +++ b/cpp/src/arrow/table/schema.h @@ -22,7 +22,6 @@ #include #include -#include "arrow/field.h" #include "arrow/type.h" namespace arrow { diff --git a/cpp/src/arrow/table/table-test.cc b/cpp/src/arrow/table/table-test.cc index dd4f74cd16f89..c4fdb062db83a 100644 --- a/cpp/src/arrow/table/table-test.cc +++ b/cpp/src/arrow/table/table-test.cc @@ -21,7 +21,6 @@ #include #include -#include "arrow/field.h" #include "arrow/table/column.h" #include "arrow/table/schema.h" #include "arrow/table/table.h" diff --git a/cpp/src/arrow/table/table.cc b/cpp/src/arrow/table/table.cc index 4cefc924ed38f..0c788b8fe3ff3 100644 --- a/cpp/src/arrow/table/table.cc +++ b/cpp/src/arrow/table/table.cc @@ -20,9 +20,9 @@ #include #include -#include "arrow/field.h" #include "arrow/table/column.h" #include "arrow/table/schema.h" +#include "arrow/type.h" #include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/table/test-common.h b/cpp/src/arrow/table/test-common.h index efe2f228cd0a3..50a5f6a2f5018 100644 --- a/cpp/src/arrow/table/test-common.h +++ b/cpp/src/arrow/table/test-common.h @@ -21,7 +21,6 @@ #include #include -#include "arrow/field.h" #include "arrow/table/column.h" #include "arrow/table/schema.h" #include "arrow/table/table.h" diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index ff145e2c1e3b4..265770822ce90 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -17,8 +17,56 @@ #include "arrow/type.h" +#include +#include + namespace arrow { +std::string Field::ToString() const { + std::stringstream ss; + ss << this->name << " " << this->type->ToString(); + return ss.str(); +} + +DataType::~DataType() {} + +StringType::StringType(bool nullable) + : DataType(LogicalType::STRING, nullable) {} + +StringType::StringType(const StringType& other) + : StringType(other.nullable) {} + +std::string StringType::ToString() const { + std::string result(name()); + if (!nullable) { + result.append(" not null"); + } + return result; +} + +std::string ListType::ToString() const { + std::stringstream s; + s << "list<" << value_type->ToString() << ">"; + if (!this->nullable) { + s << " not null"; + } + return s.str(); +} + +std::string StructType::ToString() const { + std::stringstream s; + s << "struct<"; + for (size_t i = 0; i < fields_.size(); ++i) { + if (i > 0) s << ", "; + const std::shared_ptr& field = fields_[i]; + s << field->name << ": " << field->type->ToString(); + } + s << ">"; + if (!nullable) s << " not null"; + return s.str(); +} + +const std::shared_ptr NA = std::make_shared(); const std::shared_ptr BOOL = std::make_shared(); const std::shared_ptr UINT8 = std::make_shared(); const std::shared_ptr UINT16 = std::make_shared(); @@ -30,5 +78,6 @@ const std::shared_ptr INT32 = std::make_shared(); const std::shared_ptr INT64 = std::make_shared(); const std::shared_ptr FLOAT = std::make_shared(); const std::shared_ptr DOUBLE = std::make_shared(); +const std::shared_ptr STRING = std::make_shared(); } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 4193a0e8bc851..e78e49491193e 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -20,6 +20,7 @@ #include #include +#include namespace arrow { @@ -71,49 +72,46 @@ struct LogicalType { UINT64 = 7, INT64 = 8, - // A boolean value represented as 1 byte - BOOL = 9, - // A boolean value represented as 1 bit - BIT = 10, + BOOL = 9, // 4-byte floating point value - FLOAT = 11, + FLOAT = 10, // 8-byte floating point value - DOUBLE = 12, + DOUBLE = 11, // CHAR(N): fixed-length UTF8 string with length N - CHAR = 13, + CHAR = 12, // UTF8 variable-length string as List - STRING = 14, + STRING = 13, // VARCHAR(N): Null-terminated string type embedded in a CHAR(N + 1) - VARCHAR = 15, + VARCHAR = 14, // Variable-length bytes (no guarantee of UTF8-ness) - BINARY = 16, + BINARY = 15, // By default, int32 days since the UNIX epoch - DATE = 17, + DATE = 16, // Exact timestamp encoded with int64 since UNIX epoch // Default unit millisecond - TIMESTAMP = 18, + TIMESTAMP = 17, // Timestamp as double seconds since the UNIX epoch - TIMESTAMP_DOUBLE = 19, + TIMESTAMP_DOUBLE = 18, // Exact time encoded with int64, default unit millisecond - TIME = 20, + TIME = 19, // Precision- and scale-based decimal type. Storage type depends on the // parameters. - DECIMAL = 21, + DECIMAL = 20, // Decimal value encoded as a text string - DECIMAL_TEXT = 22, + DECIMAL_TEXT = 21, // A list of some logical data type LIST = 30, @@ -141,7 +139,9 @@ struct DataType { type(type), nullable(nullable) {} - virtual bool Equals(const DataType* other) { + virtual ~DataType(); + + bool Equals(const DataType* other) { // Call with a pointer so more friendly to subclasses return this == other || (this->type == other->type && this->nullable == other->nullable); @@ -154,10 +154,45 @@ struct DataType { virtual std::string ToString() const = 0; }; - typedef std::shared_ptr LayoutPtr; typedef std::shared_ptr TypePtr; +// A field is a piece of metadata that includes (for now) a name and a data +// type +struct Field { + // Field name + std::string name; + + // The field's data type + TypePtr type; + + Field(const std::string& name, const TypePtr& type) : + name(name), + type(type) {} + + bool operator==(const Field& other) const { + return this->Equals(other); + } + + bool operator!=(const Field& other) const { + return !this->Equals(other); + } + + bool Equals(const Field& other) const { + return (this == &other) || (this->name == other.name && + this->type->Equals(other.type.get())); + } + + bool Equals(const std::shared_ptr& other) const { + return Equals(*other.get()); + } + + bool nullable() const { + return this->type->nullable; + } + + std::string ToString() const; +}; struct BytesType : public LayoutType { int size; @@ -183,16 +218,18 @@ struct PrimitiveType : public DataType { explicit PrimitiveType(bool nullable = true) : DataType(Derived::type_enum, nullable) {} - virtual std::string ToString() const { - std::string result; - if (nullable) { - result.append("?"); - } - result.append(static_cast(this)->name()); - return result; - } + std::string ToString() const override; }; +template +inline std::string PrimitiveType::ToString() const { + std::string result(static_cast(this)->name()); + if (!nullable) { + result.append(" not null"); + } + return result; +} + #define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ typedef C_TYPE c_type; \ static constexpr LogicalType::type type_enum = LogicalType::ENUM; \ @@ -205,6 +242,10 @@ struct PrimitiveType : public DataType { return NAME; \ } +struct NullType : public PrimitiveType { + PRIMITIVE_DECL(NullType, void, NA, 0, "null"); +}; + struct BooleanType : public PrimitiveType { PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); }; @@ -249,6 +290,55 @@ struct DoubleType : public PrimitiveType { PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); }; +struct ListType : public DataType { + // List can contain any other logical value type + TypePtr value_type; + + explicit ListType(const TypePtr& value_type, bool nullable = true) + : DataType(LogicalType::LIST, nullable), + value_type(value_type) {} + + static char const *name() { + return "list"; + } + + std::string ToString() const override; +}; + +// String is a logical type consisting of a physical list of 1-byte values +struct StringType : public DataType { + explicit StringType(bool nullable = true); + + StringType(const StringType& other); + + static char const *name() { + return "string"; + } + + std::string ToString() const override; +}; + +struct StructType : public DataType { + std::vector > fields_; + + explicit StructType(const std::vector >& fields, + bool nullable = true) + : DataType(LogicalType::STRUCT, nullable) { + fields_ = fields; + } + + const std::shared_ptr& field(int i) const { + return fields_[i]; + } + + int num_children() const { + return fields_.size(); + } + + std::string ToString() const override; +}; + +extern const std::shared_ptr NA; extern const std::shared_ptr BOOL; extern const std::shared_ptr UINT8; extern const std::shared_ptr UINT16; @@ -260,6 +350,7 @@ extern const std::shared_ptr INT32; extern const std::shared_ptr INT64; extern const std::shared_ptr FLOAT; extern const std::shared_ptr DOUBLE; +extern const std::shared_ptr STRING; } // namespace arrow diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index e090aead1f8b9..57cabdefd2525 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -19,31 +19,11 @@ # arrow_types ####################################### -set(TYPES_SRCS - construct.cc - floating.cc - integer.cc - json.cc - list.cc - primitive.cc - string.cc - struct.cc - union.cc -) - -set(TYPES_LIBS -) - -add_library(arrow_types STATIC - ${TYPES_SRCS} -) -target_link_libraries(arrow_types ${TYPES_LIBS}) -SET_TARGET_PROPERTIES(arrow_types PROPERTIES LINKER_LANGUAGE CXX) - # Headers: top level install(FILES boolean.h collection.h + construct.h datetime.h decimal.h floating.h diff --git a/cpp/src/arrow/types/boolean.h b/cpp/src/arrow/types/boolean.h index 8fc9cfd19c0d4..a5023d7b368d2 100644 --- a/cpp/src/arrow/types/boolean.h +++ b/cpp/src/arrow/types/boolean.h @@ -24,7 +24,8 @@ namespace arrow { typedef PrimitiveArrayImpl BooleanArray; -// typedef PrimitiveBuilder BooleanBuilder; +class BooleanBuilder : public ArrayBuilder { +}; } // namespace arrow diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 05d6b270fc3fd..43f01a3051385 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -32,13 +32,13 @@ class ArrayBuilder; // Initially looked at doing this with vtables, but shared pointers makes it // difficult -#define BUILDER_CASE(ENUM, BuilderType) \ - case LogicalType::ENUM: \ - *out = static_cast(new BuilderType(pool, type)); \ +#define BUILDER_CASE(ENUM, BuilderType) \ + case LogicalType::ENUM: \ + out->reset(new BuilderType(pool, type)); \ return Status::OK(); -Status make_builder(MemoryPool* pool, const TypePtr& type, - ArrayBuilder** out) { +Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, + std::shared_ptr* out) { switch (type->type) { BUILDER_CASE(UINT8, UInt8Builder); BUILDER_CASE(INT8, Int8Builder); @@ -58,13 +58,12 @@ Status make_builder(MemoryPool* pool, const TypePtr& type, case LogicalType::LIST: { - ListType* list_type = static_cast(type.get()); - ArrayBuilder* value_builder; - RETURN_NOT_OK(make_builder(pool, list_type->value_type, &value_builder)); + std::shared_ptr value_builder; - // The ListBuilder takes ownership of the value_builder - ListBuilder* builder = new ListBuilder(pool, type, value_builder); - *out = static_cast(builder); + const std::shared_ptr& value_type = static_cast( + type.get())->value_type; + RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); + out->reset(new ListBuilder(pool, type, value_builder)); return Status::OK(); } // BUILDER_CASE(CHAR, CharBuilder); diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index b5ba436f787d9..59ebe1acddc98 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -18,6 +18,8 @@ #ifndef ARROW_TYPES_CONSTRUCT_H #define ARROW_TYPES_CONSTRUCT_H +#include + #include "arrow/type.h" namespace arrow { @@ -26,8 +28,8 @@ class ArrayBuilder; class MemoryPool; class Status; -Status make_builder(MemoryPool* pool, const TypePtr& type, - ArrayBuilder** out); +Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, + std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc index b29b95715fef6..168e370d51a14 100644 --- a/cpp/src/arrow/types/json.cc +++ b/cpp/src/arrow/types/json.cc @@ -19,10 +19,7 @@ #include -#include "arrow/types/boolean.h" -#include "arrow/types/integer.h" -#include "arrow/types/floating.h" -#include "arrow/types/null.h" +#include "arrow/type.h" #include "arrow/types/string.h" #include "arrow/types/union.h" diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index b4bbd2841a89d..02991de2648e7 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -32,6 +32,7 @@ #include "arrow/types/test-common.h" #include "arrow/util/status.h" +using std::shared_ptr; using std::string; using std::unique_ptr; using std::vector; @@ -47,17 +48,18 @@ TEST(TypesTest, TestListType) { ASSERT_EQ(list_type.type, LogicalType::LIST); ASSERT_EQ(list_type.name(), string("list")); - ASSERT_EQ(list_type.ToString(), string("?list")); + ASSERT_EQ(list_type.ToString(), string("list")); ASSERT_EQ(list_type.value_type->type, vt->type); ASSERT_EQ(list_type.value_type->type, vt->type); std::shared_ptr st = std::make_shared(false); std::shared_ptr lt = std::make_shared(st, false); - ASSERT_EQ(lt->ToString(), string("list")); + ASSERT_EQ(lt->ToString(), string("list not null")); ListType lt2(lt, false); - ASSERT_EQ(lt2.ToString(), string("list>")); + ASSERT_EQ(lt2.ToString(), + string("list not null> not null")); } // ---------------------------------------------------------------------- @@ -71,23 +73,21 @@ class TestListBuilder : public TestBuilder { value_type_ = TypePtr(new Int32Type()); type_ = TypePtr(new ListType(value_type_)); - ArrayBuilder* tmp; - ASSERT_OK(make_builder(pool_, type_, &tmp)); - builder_.reset(static_cast(tmp)); + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_ = std::dynamic_pointer_cast(tmp); } void Done() { - Array* out; - ASSERT_OK(builder_->ToArray(&out)); - result_.reset(static_cast(out)); + result_ = std::dynamic_pointer_cast(builder_->Finish()); } protected: TypePtr value_type_; TypePtr type_; - unique_ptr builder_; - unique_ptr result_; + shared_ptr builder_; + shared_ptr result_; }; @@ -116,7 +116,7 @@ TEST_F(TestListBuilder, TestBasics) { vector lengths = {3, 0, 4}; vector is_null = {0, 1, 0}; - Int32Builder* vb = static_cast(builder_->value_builder()); + Int32Builder* vb = static_cast(builder_->value_builder().get()); int pos = 0; for (size_t i = 0; i < lengths.size(); ++i) { diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 577d71d0b2892..69a79a77fabe0 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -17,18 +17,6 @@ #include "arrow/types/list.h" -#include -#include - namespace arrow { -std::string ListType::ToString() const { - std::stringstream s; - if (this->nullable) { - s << "?"; - } - s << "list<" << value_type->ToString() << ">"; - return s.str(); -} - } // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index f39fe5c4d811b..f40a8245362b1 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -36,21 +36,6 @@ namespace arrow { class MemoryPool; -struct ListType : public DataType { - // List can contain any other logical value type - TypePtr value_type; - - explicit ListType(const TypePtr& value_type, bool nullable = true) - : DataType(LogicalType::LIST, nullable), - value_type(value_type) {} - - static char const *name() { - return "list"; - } - - virtual std::string ToString() const; -}; - class ListArray : public Array { public: ListArray() : Array(), offset_buf_(nullptr), offsets_(nullptr) {} @@ -106,10 +91,9 @@ class ListArray : public Array { class ListBuilder : public Int32Builder { public: ListBuilder(MemoryPool* pool, const TypePtr& type, - ArrayBuilder* value_builder) - : Int32Builder(pool, type) { - value_builder_.reset(value_builder); - } + std::shared_ptr value_builder) + : Int32Builder(pool, type), + value_builder_(value_builder) {} Status Init(int32_t elements) { // One more than requested. @@ -147,30 +131,27 @@ class ListBuilder : public Int32Builder { return Status::OK(); } - // Initialize an array type instance with the results of this builder - // Transfers ownership of all buffers template - Status Transfer(Container* out) { - Array* child_values; - RETURN_NOT_OK(value_builder_->ToArray(&child_values)); + std::shared_ptr Transfer() { + auto result = std::make_shared(); + + std::shared_ptr items = value_builder_->Finish(); // Add final offset if the length is non-zero if (length_) { - raw_buffer()[length_] = child_values->length(); + raw_buffer()[length_] = items->length(); } - out->Init(type_, length_, values_, ArrayPtr(child_values), + result->Init(type_, length_, values_, items, null_count_, nulls_); values_ = nulls_ = nullptr; capacity_ = length_ = null_count_ = 0; - return Status::OK(); + + return result; } - virtual Status ToArray(Array** out) { - ListArray* result = new ListArray(); - RETURN_NOT_OK(Transfer(result)); - *out = static_cast(result); - return Status::OK(); + std::shared_ptr Finish() override { + return Transfer(); } // Start a new variable-length list slot @@ -198,10 +179,12 @@ class ListBuilder : public Int32Builder { return Append(true); } - ArrayBuilder* value_builder() const { return value_builder_.get();} + const std::shared_ptr& value_builder() const { + return value_builder_; + } protected: - std::unique_ptr value_builder_; + std::shared_ptr value_builder_; }; diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 02eaaa7542bf0..f35a258e2cb57 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -37,6 +37,7 @@ #include "arrow/util/status.h" using std::string; +using std::shared_ptr; using std::unique_ptr; using std::vector; @@ -98,12 +99,12 @@ class TestPrimitiveBuilder : public TestBuilder { type_ = Attrs::type(); - ArrayBuilder* tmp; - ASSERT_OK(make_builder(pool_, type_, &tmp)); - builder_.reset(static_cast(tmp)); + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_ = std::dynamic_pointer_cast(tmp); - ASSERT_OK(make_builder(pool_, type_, &tmp)); - builder_nn_.reset(static_cast(tmp)); + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_nn_ = std::dynamic_pointer_cast(tmp); } void RandomData(int N, double pct_null = 0.1) { @@ -112,7 +113,6 @@ class TestPrimitiveBuilder : public TestBuilder { } void CheckNullable() { - ArrayType result; ArrayType expected; int size = builder_->length(); @@ -125,7 +125,9 @@ class TestPrimitiveBuilder : public TestBuilder { int32_t ex_null_count = null_count(nulls_); expected.Init(size, ex_data, ex_null_count, ex_nulls); - ASSERT_OK(builder_->Transfer(&result)); + + std::shared_ptr result = std::dynamic_pointer_cast( + builder_->Finish()); // Builder is now reset ASSERT_EQ(0, builder_->length()); @@ -133,12 +135,11 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(0, builder_->null_count()); ASSERT_EQ(nullptr, builder_->buffer()); - ASSERT_TRUE(result.Equals(expected)); - ASSERT_EQ(ex_null_count, result.null_count()); + ASSERT_TRUE(result->Equals(expected)); + ASSERT_EQ(ex_null_count, result->null_count()); } void CheckNonNullable() { - ArrayType result; ArrayType expected; int size = builder_nn_->length(); @@ -146,22 +147,24 @@ class TestPrimitiveBuilder : public TestBuilder { size * sizeof(T)); expected.Init(size, ex_data); - ASSERT_OK(builder_nn_->Transfer(&result)); + + std::shared_ptr result = std::dynamic_pointer_cast( + builder_nn_->Finish()); // Builder is now reset ASSERT_EQ(0, builder_nn_->length()); ASSERT_EQ(0, builder_nn_->capacity()); ASSERT_EQ(nullptr, builder_nn_->buffer()); - ASSERT_TRUE(result.Equals(expected)); - ASSERT_EQ(0, result.null_count()); + ASSERT_TRUE(result->Equals(expected)); + ASSERT_EQ(0, result->null_count()); } protected: TypePtr type_; TypePtr type_nn_; - unique_ptr builder_; - unique_ptr builder_nn_; + shared_ptr builder_; + shared_ptr builder_nn_; vector draws_; vector nulls_; @@ -225,15 +228,36 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { ASSERT_OK(this->builder_->AppendNull()); } - Array* result; - ASSERT_OK(this->builder_->ToArray(&result)); - unique_ptr holder(result); + auto result = this->builder_->Finish(); for (int i = 0; i < size; ++i) { ASSERT_TRUE(result->IsNull(i)); } } +TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { + DECL_T(); + + int size = 10000; + + vector& draws = this->draws_; + vector& nulls = this->nulls_; + + int64_t memory_before = this->pool_->bytes_allocated(); + + this->RandomData(size); + + int i; + for (i = 0; i < size; ++i) { + ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + } + + do { + std::shared_ptr result = this->builder_->Finish(); + } while (false); + + ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); +} TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { DECL_T(); @@ -331,11 +355,11 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { } TYPED_TEST(TestPrimitiveBuilder, TestReserve) { - int n = 100; - ASSERT_OK(this->builder_->Reserve(n)); + ASSERT_OK(this->builder_->Reserve(10)); ASSERT_EQ(0, this->builder_->length()); ASSERT_EQ(MIN_BUILDER_CAPACITY, this->builder_->capacity()); + ASSERT_OK(this->builder_->Reserve(90)); ASSERT_OK(this->builder_->Advance(100)); ASSERT_OK(this->builder_->Reserve(MIN_BUILDER_CAPACITY)); diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 09d43e7ec8b80..1073bb6e1c340 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -64,6 +64,8 @@ class PrimitiveArrayImpl : public PrimitiveArray { PrimitiveArrayImpl() : PrimitiveArray() {} + virtual ~PrimitiveArrayImpl() {} + PrimitiveArrayImpl(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& nulls = nullptr) { @@ -197,24 +199,12 @@ class PrimitiveBuilder : public ArrayBuilder { return Status::OK(); } - // Initialize an array type instance with the results of this builder - // Transfers ownership of all buffers - Status Transfer(PrimitiveArray* out) { - out->Init(type_, length_, values_, null_count_, nulls_); + std::shared_ptr Finish() override { + std::shared_ptr result = std::make_shared(); + result->PrimitiveArray::Init(type_, length_, values_, null_count_, nulls_); values_ = nulls_ = nullptr; capacity_ = length_ = null_count_ = 0; - return Status::OK(); - } - - Status Transfer(ArrayType* out) { - return Transfer(static_cast(out)); - } - - virtual Status ToArray(Array** out) { - ArrayType* result = new ArrayType(); - RETURN_NOT_OK(Transfer(result)); - *out = static_cast(result); - return Status::OK(); + return result; } value_type* raw_buffer() { diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 9af667295026b..8e82fd95dd808 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -166,23 +166,18 @@ class TestStringBuilder : public TestBuilder { void SetUp() { TestBuilder::SetUp(); type_ = TypePtr(new StringType()); - - ArrayBuilder* tmp; - ASSERT_OK(make_builder(pool_, type_, &tmp)); - builder_.reset(static_cast(tmp)); + builder_.reset(new StringBuilder(pool_, type_)); } void Done() { - Array* out; - ASSERT_OK(builder_->ToArray(&out)); - result_.reset(static_cast(out)); + result_ = std::dynamic_pointer_cast(builder_->Finish()); } protected: TypePtr type_; std::unique_ptr builder_; - std::unique_ptr result_; + std::shared_ptr result_; }; TEST_F(TestStringBuilder, TestScalarAppend) { diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 5795cfed577c5..8ccc0a9698a54 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -71,28 +71,6 @@ struct VarcharType : public DataType { static const LayoutPtr byte1(new BytesType(1)); static const LayoutPtr physical_string = LayoutPtr(new ListLayoutType(byte1)); -// String is a logical type consisting of a physical list of 1-byte values -struct StringType : public DataType { - explicit StringType(bool nullable = true) - : DataType(LogicalType::STRING, nullable) {} - - StringType(const StringType& other) - : StringType() {} - - static char const *name() { - return "string"; - } - - virtual std::string ToString() const { - std::string result; - if (nullable) { - result.append("?"); - } - result.append(name()); - return result; - } -}; - // TODO: add a BinaryArray layer in between class StringArray : public ListArray { public: @@ -153,26 +131,23 @@ class StringArray : public ListArray { class StringBuilder : public ListBuilder { public: explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : - ListBuilder(pool, type, - static_cast(new UInt8Builder(pool, value_type_))) { + ListBuilder(pool, type, std::make_shared(pool, value_type_)) { byte_builder_ = static_cast(value_builder_.get()); } Status Append(const std::string& value) { - RETURN_NOT_OK(ListBuilder::Append()); - return byte_builder_->Append(reinterpret_cast(value.c_str()), - value.size()); + return Append(value.c_str(), value.size()); } - Status Append(const uint8_t* value, int32_t length); + Status Append(const char* value, int32_t length) { + RETURN_NOT_OK(ListBuilder::Append()); + return byte_builder_->Append(reinterpret_cast(value), length); + } Status Append(const std::vector& values, uint8_t* null_bytes); - virtual Status ToArray(Array** out) { - StringArray* result = new StringArray(); - RETURN_NOT_OK(ListBuilder::Transfer(result)); - *out = static_cast(result); - return Status::OK(); + std::shared_ptr Finish() override { + return ListBuilder::Transfer(); } protected: diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index df6157104795e..9a4777e8b983d 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -17,15 +17,16 @@ #include +#include #include #include -#include "arrow/field.h" #include "arrow/type.h" #include "arrow/types/integer.h" #include "arrow/types/string.h" #include "arrow/types/struct.h" +using std::shared_ptr; using std::string; using std::vector; @@ -33,23 +34,23 @@ namespace arrow { TEST(TestStructType, Basics) { TypePtr f0_type = TypePtr(new Int32Type()); - Field f0("f0", f0_type); + auto f0 = std::make_shared("f0", f0_type); TypePtr f1_type = TypePtr(new StringType()); - Field f1("f1", f1_type); + auto f1 = std::make_shared("f1", f1_type); TypePtr f2_type = TypePtr(new UInt8Type()); - Field f2("f2", f2_type); + auto f2 = std::make_shared("f2", f2_type); - vector fields = {f0, f1, f2}; + vector > fields = {f0, f1, f2}; StructType struct_type(fields); - ASSERT_TRUE(struct_type.field(0).Equals(f0)); - ASSERT_TRUE(struct_type.field(1).Equals(f1)); - ASSERT_TRUE(struct_type.field(2).Equals(f2)); + ASSERT_TRUE(struct_type.field(0)->Equals(f0)); + ASSERT_TRUE(struct_type.field(1)->Equals(f1)); + ASSERT_TRUE(struct_type.field(2)->Equals(f2)); - ASSERT_EQ(struct_type.ToString(), "?struct"); + ASSERT_EQ(struct_type.ToString(), "struct"); // TODO: out of bounds for field(...) } diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index 6b233bc372af1..02af600b017d0 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -17,24 +17,6 @@ #include "arrow/types/struct.h" -#include -#include -#include -#include - namespace arrow { -std::string StructType::ToString() const { - std::stringstream s; - if (nullable) s << "?"; - s << "struct<"; - for (size_t i = 0; i < fields_.size(); ++i) { - if (i > 0) s << ", "; - const Field& field = fields_[i]; - s << field.name << ": " << field.type->ToString(); - } - s << ">"; - return s.str(); -} - } // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index e575c31287cb2..5842534d35be1 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -18,33 +18,14 @@ #ifndef ARROW_TYPES_STRUCT_H #define ARROW_TYPES_STRUCT_H +#include #include #include -#include "arrow/field.h" #include "arrow/type.h" namespace arrow { -struct StructType : public DataType { - std::vector fields_; - - explicit StructType(const std::vector& fields, bool nullable = true) - : DataType(LogicalType::STRUCT, nullable) { - fields_ = fields; - } - - const Field& field(int i) const { - return fields_[i]; - } - - int num_children() const { - return fields_.size(); - } - - virtual std::string ToString() const; -}; - } // namespace arrow #endif // ARROW_TYPES_STRUCT_H diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index c53f307c9f59a..4272ce4285482 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -19,22 +19,6 @@ # arrow_util ####################################### -set(UTIL_SRCS - bit-util.cc - buffer.cc - memory-pool.cc - status.cc -) - -set(UTIL_LIBS -) - -add_library(arrow_util STATIC - ${UTIL_SRCS} -) -target_link_libraries(arrow_util ${UTIL_LIBS}) -SET_TARGET_PROPERTIES(arrow_util PROPERTIES LINKER_LANGUAGE CXX) - # Headers: top level install(FILES bit-util.h @@ -50,7 +34,7 @@ install(FILES add_library(arrow_test_util) target_link_libraries(arrow_test_util - arrow_util) +) SET_TARGET_PROPERTIES(arrow_test_util PROPERTIES LINKER_LANGUAGE CXX) @@ -64,7 +48,6 @@ add_library(arrow_test_main if (APPLE) target_link_libraries(arrow_test_main gtest - arrow_util arrow_test_util dl) set_target_properties(arrow_test_main @@ -72,7 +55,6 @@ if (APPLE) else() target_link_libraries(arrow_test_main gtest - arrow_util arrow_test_util pthread dl diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 3f3807d4e2094..50f4716769d70 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -31,6 +31,8 @@ Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, parent_ = parent; } +Buffer::~Buffer() {} + std::shared_ptr MutableBuffer::GetImmutableView() { return std::make_shared(this->get_shared_ptr(), 0, size()); } @@ -43,6 +45,12 @@ PoolBuffer::PoolBuffer(MemoryPool* pool) : pool_ = pool; } +PoolBuffer::~PoolBuffer() { + if (mutable_data_ != nullptr) { + pool_->Free(mutable_data_, capacity_); + } +} + Status PoolBuffer::Reserve(int64_t new_capacity) { if (!mutable_data_ || new_capacity > capacity_) { uint8_t* new_data; diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 8704723eb0a89..0c3e210abd910 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -39,6 +39,7 @@ class Buffer : public std::enable_shared_from_this { Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size) {} + virtual ~Buffer(); // An offset into data that is owned by another buffer, but we want to be // able to retain a valid pointer to it even after other shared_ptr's to the @@ -136,6 +137,7 @@ class ResizableBuffer : public MutableBuffer { class PoolBuffer : public ResizableBuffer { public: explicit PoolBuffer(MemoryPool* pool = nullptr); + virtual ~PoolBuffer(); virtual Status Resize(int64_t new_size); virtual Status Reserve(int64_t new_capacity); diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc index c64b8a3d5f80a..c6e113ebea590 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/util/status.cc @@ -35,4 +35,44 @@ const char* Status::CopyState(const char* state) { return result; } +std::string Status::CodeAsString() const { + if (state_ == NULL) { + return "OK"; + } + + const char* type; + switch (code()) { + case StatusCode::OK: + type = "OK"; + break; + case StatusCode::OutOfMemory: + type = "Out of memory"; + break; + case StatusCode::KeyError: + type = "Key error"; + break; + case StatusCode::Invalid: + type = "Invalid"; + break; + case StatusCode::NotImplemented: + type = "NotImplemented"; + break; + } + return std::string(type); +} + +std::string Status::ToString() const { + std::string result(CodeAsString()); + if (state_ == NULL) { + return result; + } + + result.append(": "); + + uint32_t length; + memcpy(&length, state_, sizeof(length)); + result.append(reinterpret_cast(state_ + 7), length); + return result; +} + } // namespace arrow diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index df55bfac9eb4a..8fdd829010eef 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -45,6 +45,12 @@ if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") set(CMAKE_EXPORT_COMPILE_COMMANDS 1) endif() +find_program(CCACHE_FOUND ccache) +if(CCACHE_FOUND) + set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) + set_property(GLOBAL PROPERTY RULE_LAUNCH_LINK ccache) +endif(CCACHE_FOUND) + ############################################################ # Compiler flags ############################################################ @@ -389,7 +395,12 @@ add_subdirectory(src/pyarrow) add_subdirectory(src/pyarrow/util) set(PYARROW_SRCS + src/pyarrow/common.cc + src/pyarrow/helpers.cc src/pyarrow/init.cc + src/pyarrow/status.cc + + src/pyarrow/adapters/builtin.cc ) set(LINK_LIBS @@ -410,18 +421,16 @@ endif() # Setup and build Cython modules ############################################################ -foreach(pyx_api_file - arrow/config.pyx - arrow/parquet.pyx) - set_source_files_properties(${pyx_api_file} PROPERTIES CYTHON_API 1) -endforeach(pyx_api_file) - set(USE_RELATIVE_RPATH ON) set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) set(CYTHON_EXTENSIONS + array config + error parquet + scalar + schema ) foreach(module ${CYTHON_EXTENSIONS}) diff --git a/python/arrow/__init__.py b/python/arrow/__init__.py index e69de29bb2d1d..3c049b85e8c93 100644 --- a/python/arrow/__init__.py +++ b/python/arrow/__init__.py @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# flake8: noqa + +from arrow.array import (Array, from_pylist, total_allocated_bytes, + BooleanArray, NumericArray, + Int8Array, UInt8Array, + ListArray, StringArray) + +from arrow.error import ArrowException + +from arrow.scalar import ArrayValue, NA, Scalar + +from arrow.schema import (null, bool_, + int8, int16, int32, int64, + uint8, uint16, uint32, uint64, + float_, double, string, + list_, struct, field, + DataType, Field, Schema) diff --git a/python/arrow/array.pxd b/python/arrow/array.pxd new file mode 100644 index 0000000000000..e32d27769b5e1 --- /dev/null +++ b/python/arrow/array.pxd @@ -0,0 +1,85 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.includes.common cimport shared_ptr +from arrow.includes.arrow cimport CArray, LogicalType + +from arrow.scalar import NA + +from arrow.schema cimport DataType + +cdef extern from "Python.h": + int PySlice_Check(object) + +cdef class Array: + cdef: + shared_ptr[CArray] sp_array + CArray* ap + + cdef readonly: + DataType type + + cdef init(self, const shared_ptr[CArray]& sp_array) + cdef _getitem(self, int i) + + +cdef class BooleanArray(Array): + pass + + +cdef class NumericArray(Array): + pass + + +cdef class Int8Array(NumericArray): + pass + + +cdef class UInt8Array(NumericArray): + pass + + +cdef class Int16Array(NumericArray): + pass + + +cdef class UInt16Array(NumericArray): + pass + + +cdef class Int32Array(NumericArray): + pass + + +cdef class UInt32Array(NumericArray): + pass + + +cdef class Int64Array(NumericArray): + pass + + +cdef class UInt64Array(NumericArray): + pass + + +cdef class ListArray(Array): + pass + + +cdef class StringArray(Array): + pass diff --git a/python/arrow/array.pyx b/python/arrow/array.pyx new file mode 100644 index 0000000000000..3a3210d6cc100 --- /dev/null +++ b/python/arrow/array.pyx @@ -0,0 +1,179 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from arrow.includes.arrow cimport * +cimport arrow.includes.pyarrow as pyarrow + +from arrow.compat import frombytes, tobytes +from arrow.error cimport check_status + +from arrow.scalar import NA + +def total_allocated_bytes(): + cdef MemoryPool* pool = pyarrow.GetMemoryPool() + return pool.bytes_allocated() + + +cdef class Array: + + cdef init(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + self.ap = sp_array.get() + self.type = DataType() + self.type.init(self.sp_array.get().type()) + + property null_count: + + def __get__(self): + return self.sp_array.get().null_count() + + def __len__(self): + return self.sp_array.get().length() + + def isnull(self): + raise NotImplemented + + def __getitem__(self, key): + cdef: + Py_ssize_t n = len(self) + + if PySlice_Check(key): + start = key.start or 0 + while start < 0: + start += n + + stop = key.stop if key.stop is not None else n + while stop < 0: + stop += n + + step = key.step or 1 + if step != 1: + raise NotImplementedError + else: + return self.slice(start, stop) + + while key < 0: + key += len(self) + + if self.ap.IsNull(key): + return NA + else: + return self._getitem(key) + + cdef _getitem(self, int i): + raise NotImplementedError + + def slice(self, start, end): + pass + + +cdef class NullArray(Array): + pass + + +cdef class BooleanArray(Array): + pass + + +cdef class NumericArray(Array): + pass + + +cdef class Int8Array(NumericArray): + pass + + +cdef class UInt8Array(NumericArray): + pass + + +cdef class Int16Array(NumericArray): + pass + + +cdef class UInt16Array(NumericArray): + pass + + +cdef class Int32Array(NumericArray): + pass + + +cdef class UInt32Array(NumericArray): + pass + + +cdef class Int64Array(NumericArray): + pass + + +cdef class UInt64Array(NumericArray): + pass + + +cdef class FloatArray(NumericArray): + pass + + +cdef class DoubleArray(NumericArray): + pass + + +cdef class ListArray(Array): + pass + + +cdef class StringArray(Array): + pass + + +cdef dict _array_classes = { + LogicalType_NA: NullArray, + LogicalType_BOOL: BooleanArray, + LogicalType_INT64: Int64Array, + LogicalType_DOUBLE: DoubleArray, + LogicalType_LIST: ListArray, + LogicalType_STRING: StringArray, +} + +cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): + if sp_array.get() == NULL: + raise ValueError('Array was NULL') + + cdef CDataType* data_type = sp_array.get().type().get() + + if data_type == NULL: + raise ValueError('Array data type was NULL') + + cdef Array arr = _array_classes[data_type.type]() + arr.init(sp_array) + return arr + + +def from_pylist(object list_obj, type=None): + """ + Convert Python list to Arrow array + """ + cdef: + shared_ptr[CArray] sp_array + + check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) + return box_arrow_array(sp_array) diff --git a/python/arrow/config.pyx b/python/arrow/config.pyx index 8f10beb3a2e72..521bc066cd4a5 100644 --- a/python/arrow/config.pyx +++ b/python/arrow/config.pyx @@ -2,7 +2,7 @@ # distutils: language = c++ # cython: embedsignature = True -cdef extern from 'pyarrow/init.h' namespace 'arrow::py': +cdef extern from 'pyarrow/init.h' namespace 'pyarrow': void pyarrow_init() pyarrow_init() diff --git a/python/arrow/error.pxd b/python/arrow/error.pxd new file mode 100644 index 0000000000000..c18cb3efffca6 --- /dev/null +++ b/python/arrow/error.pxd @@ -0,0 +1,20 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.includes.pyarrow cimport * + +cdef check_status(const Status& status) diff --git a/python/arrow/error.pyx b/python/arrow/error.pyx new file mode 100644 index 0000000000000..f1d516358819d --- /dev/null +++ b/python/arrow/error.pyx @@ -0,0 +1,30 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.includes.common cimport c_string + +from arrow.compat import frombytes + +class ArrowException(Exception): + pass + +cdef check_status(const Status& status): + if status.ok(): + return + + cdef c_string c_message = status.ToString() + raise ArrowException(frombytes(c_message)) diff --git a/python/arrow/includes/arrow.pxd b/python/arrow/includes/arrow.pxd index 3635ceb868596..fde5de910915a 100644 --- a/python/arrow/includes/arrow.pxd +++ b/python/arrow/includes/arrow.pxd @@ -20,4 +20,77 @@ from arrow.includes.common cimport * cdef extern from "arrow/api.h" namespace "arrow" nogil: - pass + + enum LogicalType" arrow::LogicalType::type": + LogicalType_NA" arrow::LogicalType::NA" + + LogicalType_BOOL" arrow::LogicalType::BOOL" + + LogicalType_UINT8" arrow::LogicalType::UINT8" + LogicalType_INT8" arrow::LogicalType::INT8" + LogicalType_UINT16" arrow::LogicalType::UINT16" + LogicalType_INT16" arrow::LogicalType::INT16" + LogicalType_UINT32" arrow::LogicalType::UINT32" + LogicalType_INT32" arrow::LogicalType::INT32" + LogicalType_UINT64" arrow::LogicalType::UINT64" + LogicalType_INT64" arrow::LogicalType::INT64" + + LogicalType_FLOAT" arrow::LogicalType::FLOAT" + LogicalType_DOUBLE" arrow::LogicalType::DOUBLE" + + LogicalType_STRING" arrow::LogicalType::STRING" + + LogicalType_LIST" arrow::LogicalType::LIST" + LogicalType_STRUCT" arrow::LogicalType::STRUCT" + + cdef cppclass CDataType" arrow::DataType": + LogicalType type + c_bool nullable + + c_bool Equals(const CDataType* other) + + c_string ToString() + + cdef cppclass MemoryPool" arrow::MemoryPool": + int64_t bytes_allocated() + + cdef cppclass CListType" arrow::ListType"(CDataType): + CListType(const shared_ptr[CDataType]& value_type, + c_bool nullable) + + cdef cppclass CStringType" arrow::StringType"(CDataType): + pass + + cdef cppclass CField" arrow::Field": + c_string name + shared_ptr[CDataType] type + + CField(const c_string& name, const shared_ptr[CDataType]& type) + + cdef cppclass CStructType" arrow::StructType"(CDataType): + CStructType(const vector[shared_ptr[CField]]& fields, + c_bool nullable) + + cdef cppclass CSchema" arrow::Schema": + CSchema(const shared_ptr[CField]& fields) + + cdef cppclass CArray" arrow::Array": + const shared_ptr[CDataType]& type() + + int32_t length() + int32_t null_count() + LogicalType logical_type() + + c_bool IsNull(int i) + + cdef cppclass CUInt8Array" arrow::UInt8Array"(CArray): + pass + + cdef cppclass CInt8Array" arrow::Int8Array"(CArray): + pass + + cdef cppclass CListArray" arrow::ListArray"(CArray): + pass + + cdef cppclass CStringArray" arrow::StringArray"(CListArray): + pass diff --git a/python/arrow/includes/common.pxd b/python/arrow/includes/common.pxd index f2fc826625e45..839427a699002 100644 --- a/python/arrow/includes/common.pxd +++ b/python/arrow/includes/common.pxd @@ -19,7 +19,7 @@ from libc.stdint cimport * from libcpp cimport bool as c_bool -from libcpp.string cimport string +from libcpp.string cimport string as c_string from libcpp.vector cimport vector # This must be included for cerr and other things to work @@ -29,6 +29,8 @@ cdef extern from "": cdef extern from "" namespace "std" nogil: cdef cppclass shared_ptr[T]: + shared_ptr() + shared_ptr(T*) T* get() void reset() void reset(T* p) diff --git a/python/arrow/includes/pyarrow.pxd b/python/arrow/includes/pyarrow.pxd index dcef663f3894d..3eed5b8542493 100644 --- a/python/arrow/includes/pyarrow.pxd +++ b/python/arrow/includes/pyarrow.pxd @@ -18,6 +18,28 @@ # distutils: language = c++ from arrow.includes.common cimport * +from arrow.includes.arrow cimport (CArray, CDataType, LogicalType, + MemoryPool) cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: - pass + # We can later add more of the common status factory methods as needed + cdef Status Status_OK "Status::OK"() + + cdef cppclass Status: + Status() + + c_string ToString() + + c_bool ok() + c_bool IsOutOfMemory() + c_bool IsKeyError() + c_bool IsTypeError() + c_bool IsIOError() + c_bool IsValueError() + c_bool IsNotImplemented() + c_bool IsArrowError() + + shared_ptr[CDataType] GetPrimitiveType(LogicalType type, c_bool nullable) + Status ConvertPySequence(object obj, shared_ptr[CArray]* out) + + MemoryPool* GetMemoryPool() diff --git a/python/arrow/scalar.pxd b/python/arrow/scalar.pxd new file mode 100644 index 0000000000000..e193c09cd69a2 --- /dev/null +++ b/python/arrow/scalar.pxd @@ -0,0 +1,47 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.includes.common cimport * +from arrow.includes.arrow cimport CArray, CListArray + +from arrow.schema cimport DataType + +cdef class Scalar: + cdef readonly: + DataType type + + +cdef class NAType(Scalar): + pass + + +cdef class ArrayValue(Scalar): + cdef: + shared_ptr[CArray] array + int index + + +cdef class Int8Value(ArrayValue): + pass + + +cdef class ListValue(ArrayValue): + pass + + +cdef class StringValue(ArrayValue): + pass diff --git a/python/arrow/scalar.pyx b/python/arrow/scalar.pyx new file mode 100644 index 0000000000000..78dadecf9b422 --- /dev/null +++ b/python/arrow/scalar.pyx @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import arrow.schema as schema + +cdef class NAType(Scalar): + + def __cinit__(self): + self.type = schema.null() + + def __repr__(self): + return 'NA' + +NA = NAType() diff --git a/python/arrow/schema.pxd b/python/arrow/schema.pxd new file mode 100644 index 0000000000000..487c246f44abf --- /dev/null +++ b/python/arrow/schema.pxd @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.includes.common cimport shared_ptr +from arrow.includes.arrow cimport CDataType, CField, CSchema + +cdef class DataType: + cdef: + shared_ptr[CDataType] sp_type + CDataType* type + + cdef init(self, const shared_ptr[CDataType]& type) + +cdef class Field: + cdef: + shared_ptr[CField] sp_field + CField* field + + cdef readonly: + DataType type + +cdef class Schema: + cdef: + shared_ptr[CSchema] sp_schema + CSchema* schema diff --git a/python/arrow/schema.pyx b/python/arrow/schema.pyx new file mode 100644 index 0000000000000..63cd6e888abd0 --- /dev/null +++ b/python/arrow/schema.pyx @@ -0,0 +1,150 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +######################################## +# Data types, fields, schemas, and so forth + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from arrow.compat import frombytes, tobytes +from arrow.includes.arrow cimport * +cimport arrow.includes.pyarrow as pyarrow + +cimport cpython + +cdef class DataType: + + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CDataType]& type): + self.sp_type = type + self.type = type.get() + + def __str__(self): + return frombytes(self.type.ToString()) + + def __repr__(self): + return 'DataType({0})'.format(str(self)) + + def __richcmp__(DataType self, DataType other, int op): + if op == cpython.Py_EQ: + return self.type.Equals(other.type) + elif op == cpython.Py_NE: + return not self.type.Equals(other.type) + else: + raise TypeError('Invalid comparison') + + +cdef class Field: + + def __cinit__(self, object name, DataType type): + self.type = type + self.sp_field.reset(new CField(tobytes(name), type.sp_type)) + self.field = self.sp_field.get() + + def __repr__(self): + return 'Field({0!r}, type={1})'.format(self.name, str(self.type)) + + property name: + + def __get__(self): + return frombytes(self.field.name) + +cdef dict _type_cache = {} + +cdef DataType primitive_type(LogicalType type, bint nullable=True): + if (type, nullable) in _type_cache: + return _type_cache[type, nullable] + + cdef DataType out = DataType() + out.init(pyarrow.GetPrimitiveType(type, nullable)) + + _type_cache[type, nullable] = out + return out + +#------------------------------------------------------------ +# Type factory functions + +def field(name, type): + return Field(name, type) + +def null(): + return primitive_type(LogicalType_NA) + +def bool_(c_bool nullable=True): + return primitive_type(LogicalType_BOOL, nullable) + +def uint8(c_bool nullable=True): + return primitive_type(LogicalType_UINT8, nullable) + +def int8(c_bool nullable=True): + return primitive_type(LogicalType_INT8, nullable) + +def uint16(c_bool nullable=True): + return primitive_type(LogicalType_UINT16, nullable) + +def int16(c_bool nullable=True): + return primitive_type(LogicalType_INT16, nullable) + +def uint32(c_bool nullable=True): + return primitive_type(LogicalType_UINT32, nullable) + +def int32(c_bool nullable=True): + return primitive_type(LogicalType_INT32, nullable) + +def uint64(c_bool nullable=True): + return primitive_type(LogicalType_UINT64, nullable) + +def int64(c_bool nullable=True): + return primitive_type(LogicalType_INT64, nullable) + +def float_(c_bool nullable=True): + return primitive_type(LogicalType_FLOAT, nullable) + +def double(c_bool nullable=True): + return primitive_type(LogicalType_DOUBLE, nullable) + +def string(c_bool nullable=True): + """ + UTF8 string + """ + return primitive_type(LogicalType_STRING, nullable) + +def list_(DataType value_type, c_bool nullable=True): + cdef DataType out = DataType() + out.init(shared_ptr[CDataType]( + new CListType(value_type.sp_type, nullable))) + return out + +def struct(fields, c_bool nullable=True): + """ + + """ + cdef: + DataType out = DataType() + Field field + vector[shared_ptr[CField]] c_fields + + for field in fields: + c_fields.push_back(field.sp_field) + + out.init(shared_ptr[CDataType]( + new CStructType(c_fields, nullable))) + return out diff --git a/python/arrow/tests/test_array.py b/python/arrow/tests/test_array.py new file mode 100644 index 0000000000000..8eaa53352061b --- /dev/null +++ b/python/arrow/tests/test_array.py @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.compat import unittest +import arrow + + +class TestArrayAPI(unittest.TestCase): + + def test_getitem_NA(self): + arr = arrow.from_pylist([1, None, 2]) + assert arr[1] is arrow.NA diff --git a/python/arrow/tests/test_convert_builtin.py b/python/arrow/tests/test_convert_builtin.py new file mode 100644 index 0000000000000..57e6ab9f0e7b5 --- /dev/null +++ b/python/arrow/tests/test_convert_builtin.py @@ -0,0 +1,85 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.compat import unittest +import arrow + + +class TestConvertList(unittest.TestCase): + + def test_boolean(self): + pass + + def test_empty_list(self): + arr = arrow.from_pylist([]) + assert len(arr) == 0 + assert arr.null_count == 0 + assert arr.type == arrow.null() + + def test_all_none(self): + arr = arrow.from_pylist([None, None]) + assert len(arr) == 2 + assert arr.null_count == 2 + assert arr.type == arrow.null() + + def test_integer(self): + arr = arrow.from_pylist([1, None, 3, None]) + assert len(arr) == 4 + assert arr.null_count == 2 + assert arr.type == arrow.int64() + + def test_garbage_collection(self): + import gc + bytes_before = arrow.total_allocated_bytes() + arrow.from_pylist([1, None, 3, None]) + gc.collect() + assert arrow.total_allocated_bytes() == bytes_before + + def test_double(self): + data = [1.5, 1, None, 2.5, None, None] + arr = arrow.from_pylist(data) + assert len(arr) == 6 + assert arr.null_count == 3 + assert arr.type == arrow.double() + + def test_string(self): + data = ['foo', b'bar', None, 'arrow'] + arr = arrow.from_pylist(data) + assert len(arr) == 4 + assert arr.null_count == 1 + assert arr.type == arrow.string() + + def test_mixed_nesting_levels(self): + arrow.from_pylist([1, 2, None]) + arrow.from_pylist([[1], [2], None]) + arrow.from_pylist([[1], [2], [None]]) + + with self.assertRaises(arrow.ArrowException): + arrow.from_pylist([1, 2, [1]]) + + with self.assertRaises(arrow.ArrowException): + arrow.from_pylist([1, 2, []]) + + with self.assertRaises(arrow.ArrowException): + arrow.from_pylist([[1], [2], [None, [1]]]) + + def test_list_of_int(self): + data = [[1, 2, 3], [], None, [1, 2]] + arr = arrow.from_pylist(data) + assert len(arr) == 4 + assert arr.null_count == 1 + assert arr.type == arrow.list_(arrow.int64()) diff --git a/python/arrow/tests/test_schema.py b/python/arrow/tests/test_schema.py new file mode 100644 index 0000000000000..a89edd74a0adf --- /dev/null +++ b/python/arrow/tests/test_schema.py @@ -0,0 +1,51 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.compat import unittest +import arrow + + +class TestTypes(unittest.TestCase): + + def test_integers(self): + dtypes = ['int8', 'int16', 'int32', 'int64', + 'uint8', 'uint16', 'uint32', 'uint64'] + + for name in dtypes: + factory = getattr(arrow, name) + t = factory() + t_required = factory(False) + + assert str(t) == name + assert str(t_required) == '{0} not null'.format(name) + + def test_list(self): + value_type = arrow.int32() + list_type = arrow.list_(value_type) + assert str(list_type) == 'list' + + def test_string(self): + t = arrow.string() + assert str(t) == 'string' + + def test_field(self): + t = arrow.string() + f = arrow.field('foo', t) + + assert f.name == 'foo' + assert f.type is t + assert repr(f) == "Field('foo', type=string)" diff --git a/python/setup.py b/python/setup.py index f6b0a4bee8316..9a0de071a9c40 100644 --- a/python/setup.py +++ b/python/setup.py @@ -124,7 +124,10 @@ def _run_cmake(self): static_lib_option, source] self.spawn(cmake_command) - self.spawn(['make']) + args = ['make'] + if 'PYARROW_PARALLEL' in os.environ: + args.append('-j{0}'.format(os.environ['PYARROW_PARALLEL'])) + self.spawn(args) else: import shlex cmake_generator = 'Visual Studio 14 2015' @@ -207,7 +210,7 @@ def get_ext_built(self, name): return name + suffix def get_cmake_cython_names(self): - return ['config', 'parquet'] + return ['array', 'config', 'error', 'parquet', 'scalar', 'schema'] def get_names(self): return self._found_names diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc new file mode 100644 index 0000000000000..ae84fa12b0de6 --- /dev/null +++ b/python/src/pyarrow/adapters/builtin.cc @@ -0,0 +1,415 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include + +#include "pyarrow/adapters/builtin.h" + +#include + +#include "pyarrow/status.h" + +using arrow::ArrayBuilder; +using arrow::DataType; +using arrow::LogicalType; + +namespace pyarrow { + +static inline bool IsPyInteger(PyObject* obj) { +#if PYARROW_IS_PY2 + return PyLong_Check(obj) || PyInt_Check(obj); +#else + return PyLong_Check(obj); +#endif +} + +static inline bool IsPyBaseString(PyObject* obj) { +#if PYARROW_IS_PY2 + return PyString_Check(obj) || PyUnicode_Check(obj); +#else + return PyUnicode_Check(obj); +#endif +} + +class ScalarVisitor { + public: + ScalarVisitor() : + total_count_(0), + none_count_(0), + bool_count_(0), + int_count_(0), + float_count_(0), + string_count_(0) {} + + void Visit(PyObject* obj) { + ++total_count_; + if (obj == Py_None) { + ++none_count_; + } else if (PyFloat_Check(obj)) { + ++float_count_; + } else if (IsPyInteger(obj)) { + ++int_count_; + } else if (IsPyBaseString(obj)) { + ++string_count_; + } else { + // TODO(wesm): accumulate error information somewhere + } + } + + std::shared_ptr GetType() { + // TODO(wesm): handling mixed-type cases + if (float_count_) { + return arrow::DOUBLE; + } else if (int_count_) { + // TODO(wesm): tighter type later + return arrow::INT64; + } else if (bool_count_) { + return arrow::BOOL; + } else if (string_count_) { + return arrow::STRING; + } else { + return arrow::NA; + } + } + + int64_t total_count() const { + return total_count_; + } + + private: + int64_t total_count_; + int64_t none_count_; + int64_t bool_count_; + int64_t int_count_; + int64_t float_count_; + int64_t string_count_; + + // Place to accumulate errors + // std::vector errors_; +}; + +static constexpr int MAX_NESTING_LEVELS = 32; + +class SeqVisitor { + public: + SeqVisitor() : + max_nesting_level_(0) { + memset(nesting_histogram_, 0, MAX_NESTING_LEVELS * sizeof(int)); + } + + Status Visit(PyObject* obj, int level=0) { + Py_ssize_t size = PySequence_Size(obj); + + if (level > max_nesting_level_) { + max_nesting_level_ = level; + } + + for (int64_t i = 0; i < size; ++i) { + // TODO(wesm): Error checking? + // TODO(wesm): Specialize for PyList_GET_ITEM? + OwnedRef item_ref(PySequence_GetItem(obj, i)); + PyObject* item = item_ref.obj(); + + if (PyList_Check(item)) { + PY_RETURN_NOT_OK(Visit(item, level + 1)); + } else if (PyDict_Check(item)) { + return Status::NotImplemented("No type inference for dicts"); + } else { + // We permit nulls at any level of nesting + if (item == Py_None) { + // TODO + } else { + ++nesting_histogram_[level]; + scalars_.Visit(item); + } + } + } + return Status::OK(); + } + + std::shared_ptr GetType() { + if (scalars_.total_count() == 0) { + if (max_nesting_level_ == 0) { + return arrow::NA; + } else { + return nullptr; + } + } else { + std::shared_ptr result = scalars_.GetType(); + for (int i = 0; i < max_nesting_level_; ++i) { + result = std::make_shared(result); + } + return result; + } + } + + Status Validate() const { + if (scalars_.total_count() > 0) { + if (num_nesting_levels() > 1) { + return Status::ValueError("Mixed nesting levels not supported"); + } else if (max_observed_level() < max_nesting_level_) { + return Status::ValueError("Mixed nesting levels not supported"); + } + } + return Status::OK(); + } + + int max_observed_level() const { + int result = 0; + for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { + if (nesting_histogram_[i] > 0) { + result = i; + } + } + return result; + } + + int num_nesting_levels() const { + int result = 0; + for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { + if (nesting_histogram_[i] > 0) { + ++result; + } + } + return result; + } + + private: + ScalarVisitor scalars_; + + // Track observed + int max_nesting_level_; + int nesting_histogram_[MAX_NESTING_LEVELS]; +}; + +// Non-exhaustive type inference +static Status InferArrowType(PyObject* obj, int64_t* size, + std::shared_ptr* out_type) { + *size = PySequence_Size(obj); + if (PyErr_Occurred()) { + // Not a sequence + PyErr_Clear(); + return Status::TypeError("Object is not a sequence"); + } + + // For 0-length sequences, refuse to guess + if (*size == 0) { + *out_type = arrow::NA; + } + + SeqVisitor seq_visitor; + PY_RETURN_NOT_OK(seq_visitor.Visit(obj)); + PY_RETURN_NOT_OK(seq_visitor.Validate()); + + *out_type = seq_visitor.GetType(); + return Status::OK(); +} + +// Marshal Python sequence (list, tuple, etc.) to Arrow array +class SeqConverter { + public: + virtual Status Init(const std::shared_ptr& builder) { + builder_ = builder; + return Status::OK(); + } + + virtual Status AppendData(PyObject* seq) = 0; + + protected: + std::shared_ptr builder_; +}; + +template +class TypedConverter : public SeqConverter { + public: + Status Init(const std::shared_ptr& builder) override { + builder_ = builder; + typed_builder_ = static_cast(builder.get()); + return Status::OK(); + } + + protected: + BuilderType* typed_builder_; +}; + +class BoolConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + return Status::OK(); + } +}; + +class Int64Converter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + int64_t val; + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + } else { + val = PyLong_AsLongLong(item.obj()); + RETURN_IF_PYERROR(); + RETURN_ARROW_NOT_OK(typed_builder_->Append(val)); + } + } + return Status::OK(); + } +}; + +class DoubleConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + int64_t val; + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + } else { + val = PyFloat_AsDouble(item.obj()); + RETURN_IF_PYERROR(); + RETURN_ARROW_NOT_OK(typed_builder_->Append(val)); + } + } + return Status::OK(); + } +}; + +class StringConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + PyObject* item; + PyObject* bytes_obj; + OwnedRef tmp; + const char* bytes; + int32_t length; + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + item = PySequence_GetItem(seq, i); + OwnedRef holder(item); + + if (item == Py_None) { + RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + continue; + } else if (PyUnicode_Check(item)) { + tmp.reset(PyUnicode_AsUTF8String(item)); + RETURN_IF_PYERROR(); + bytes_obj = tmp.obj(); + } else if (PyBytes_Check(item)) { + bytes_obj = item; + } else { + return Status::TypeError("Non-string value encountered"); + } + // No error checking + length = PyBytes_GET_SIZE(bytes_obj); + bytes = PyBytes_AS_STRING(bytes_obj); + RETURN_ARROW_NOT_OK(typed_builder_->Append(bytes, length)); + } + return Status::OK(); + } +}; + +class ListConverter : public TypedConverter { + public: + Status Init(const std::shared_ptr& builder) override; + + Status AppendData(PyObject* seq) override { + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + } else { + typed_builder_->Append(); + PY_RETURN_NOT_OK(value_converter_->AppendData(item.obj())); + } + } + return Status::OK(); + } + protected: + std::shared_ptr value_converter_; +}; + +// Dynamic constructor for sequence converters +std::shared_ptr GetConverter(const std::shared_ptr& type) { + switch (type->type) { + case LogicalType::BOOL: + return std::make_shared(); + case LogicalType::INT64: + return std::make_shared(); + case LogicalType::DOUBLE: + return std::make_shared(); + case LogicalType::STRING: + return std::make_shared(); + case LogicalType::LIST: + return std::make_shared(); + case LogicalType::STRUCT: + default: + return nullptr; + break; + } +} + +Status ListConverter::Init(const std::shared_ptr& builder) { + builder_ = builder; + typed_builder_ = static_cast(builder.get()); + + value_converter_ = GetConverter(static_cast( + builder->type().get())->value_type); + if (value_converter_ == nullptr) { + return Status::NotImplemented("value type not implemented"); + } + + value_converter_->Init(typed_builder_->value_builder()); + return Status::OK(); +} + +Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { + std::shared_ptr type; + int64_t size; + PY_RETURN_NOT_OK(InferArrowType(obj, &size, &type)); + + // Handle NA / NullType case + if (type->type == LogicalType::NA) { + out->reset(new arrow::Array(type, size, size)); + return Status::OK(); + } + + std::shared_ptr converter = GetConverter(type); + if (converter == nullptr) { + std::stringstream ss; + ss << "No type converter implemented for " + << type->ToString(); + return Status::NotImplemented(ss.str()); + } + + // Give the sequence converter an array builder + std::shared_ptr builder; + RETURN_ARROW_NOT_OK(arrow::MakeBuilder(GetMemoryPool(), type, &builder)); + converter->Init(builder); + + PY_RETURN_NOT_OK(converter->AppendData(obj)); + + *out = builder->Finish(); + + return Status::OK(); +} + +} // namespace pyarrow diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h new file mode 100644 index 0000000000000..24886f4970d50 --- /dev/null +++ b/python/src/pyarrow/adapters/builtin.h @@ -0,0 +1,40 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for converting between CPython built-in data structures and Arrow +// data structures + +#ifndef PYARROW_ADAPTERS_BUILTIN_H +#define PYARROW_ADAPTERS_BUILTIN_H + +#include + +#include + +#include "pyarrow/common.h" + +namespace arrow { class Array; } + +namespace pyarrow { + +class Status; + +Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); + +} // namespace pyarrow + +#endif // PYARROW_ADAPTERS_BUILTIN_H diff --git a/cpp/src/arrow/field.cc b/python/src/pyarrow/adapters/pandas.h similarity index 76% rename from cpp/src/arrow/field.cc rename to python/src/pyarrow/adapters/pandas.h index 4568d905c2991..a4f4163808711 100644 --- a/cpp/src/arrow/field.cc +++ b/python/src/pyarrow/adapters/pandas.h @@ -15,17 +15,14 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/field.h" +// Functions for converting between pandas's NumPy-based data representation +// and Arrow data structures -#include -#include +#ifndef PYARROW_ADAPTERS_PANDAS_H +#define PYARROW_ADAPTERS_PANDAS_H -namespace arrow { +namespace pyarrow { -std::string Field::ToString() const { - std::stringstream ss; - ss << this->name << " " << this->type->ToString(); - return ss.str(); -} +} // namespace pyarrow -} // namespace arrow +#endif // PYARROW_ADAPTERS_PANDAS_H diff --git a/python/src/pyarrow/api.h b/python/src/pyarrow/api.h index c2285de77bf10..72be6afe02c76 100644 --- a/python/src/pyarrow/api.h +++ b/python/src/pyarrow/api.h @@ -18,4 +18,11 @@ #ifndef PYARROW_API_H #define PYARROW_API_H +#include "pyarrow/status.h" + +#include "pyarrow/helpers.h" + +#include "pyarrow/adapters/builtin.h" +#include "pyarrow/adapters/pandas.h" + #endif // PYARROW_API_H diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc new file mode 100644 index 0000000000000..a2748f99b6733 --- /dev/null +++ b/python/src/pyarrow/common.cc @@ -0,0 +1,71 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "pyarrow/common.h" + +#include +#include +#include + +#include +#include + +#include "pyarrow/status.h" + +namespace pyarrow { + +class PyArrowMemoryPool : public arrow::MemoryPool { + public: + PyArrowMemoryPool() : bytes_allocated_(0) {} + virtual ~PyArrowMemoryPool() {} + + arrow::Status Allocate(int64_t size, uint8_t** out) override { + std::lock_guard guard(pool_lock_); + *out = static_cast(std::malloc(size)); + if (*out == nullptr) { + std::stringstream ss; + ss << "malloc of size " << size << " failed"; + return arrow::Status::OutOfMemory(ss.str()); + } + + bytes_allocated_ += size; + + return arrow::Status::OK(); + } + + int64_t bytes_allocated() const override { + std::lock_guard guard(pool_lock_); + return bytes_allocated_; + } + + void Free(uint8_t* buffer, int64_t size) override { + std::lock_guard guard(pool_lock_); + std::free(buffer); + bytes_allocated_ -= size; + } + + private: + mutable std::mutex pool_lock_; + int64_t bytes_allocated_; +}; + +arrow::MemoryPool* GetMemoryPool() { + static PyArrowMemoryPool memory_pool; + return &memory_pool; +} + +} // namespace pyarrow diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h new file mode 100644 index 0000000000000..a43e4d28c899a --- /dev/null +++ b/python/src/pyarrow/common.h @@ -0,0 +1,95 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_COMMON_H +#define PYARROW_COMMON_H + +#include + +namespace arrow { class MemoryPool; } + +namespace pyarrow { + +#define PYARROW_IS_PY2 PY_MAJOR_VERSION < 2 + +#define RETURN_ARROW_NOT_OK(s) do { \ + arrow::Status _s = (s); \ + if (!_s.ok()) { \ + return Status::ArrowError(s.ToString()); \ + } \ + } while (0); + +class OwnedRef { + public: + OwnedRef() : obj_(nullptr) {} + + OwnedRef(PyObject* obj) : + obj_(obj) {} + + ~OwnedRef() { + Py_XDECREF(obj_); + } + + void reset(PyObject* obj) { + if (obj_ != nullptr) { + Py_XDECREF(obj_); + } + obj_ = obj; + } + + PyObject* obj() const{ + return obj_; + } + + private: + PyObject* obj_; +}; + +struct PyObjectStringify { + OwnedRef tmp_obj; + const char* bytes; + + PyObjectStringify(PyObject* obj) { + PyObject* bytes_obj; + if (PyUnicode_Check(obj)) { + bytes_obj = PyUnicode_AsUTF8String(obj); + tmp_obj.reset(bytes_obj); + } else { + bytes_obj = obj; + } + bytes = PyBytes_AsString(bytes_obj); + } +}; + +// TODO(wesm): We can just let errors pass through. To be explored later +#define RETURN_IF_PYERROR() \ + if (PyErr_Occurred()) { \ + PyObject *exc_type, *exc_value, *traceback; \ + PyErr_Fetch(&exc_type, &exc_value, &traceback); \ + PyObjectStringify stringified(exc_value); \ + std::string message(stringified.bytes); \ + Py_DECREF(exc_type); \ + Py_DECREF(exc_value); \ + Py_DECREF(traceback); \ + return Status::UnknownError(message); \ + } + +arrow::MemoryPool* GetMemoryPool(); + +} // namespace pyarrow + +#endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc new file mode 100644 index 0000000000000..d0969dacc21e0 --- /dev/null +++ b/python/src/pyarrow/helpers.cc @@ -0,0 +1,57 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "pyarrow/helpers.h" + +#include + +using namespace arrow; + +namespace pyarrow { + +#define GET_PRIMITIVE_TYPE(NAME, Type) \ + case LogicalType::NAME: \ + if (nullable) { \ + return NAME; \ + } else { \ + return std::make_shared(nullable); \ + } \ + break; + +std::shared_ptr GetPrimitiveType(LogicalType::type type, + bool nullable) { + switch (type) { + case LogicalType::NA: + return NA; + GET_PRIMITIVE_TYPE(UINT8, UInt8Type); + GET_PRIMITIVE_TYPE(INT8, Int8Type); + GET_PRIMITIVE_TYPE(UINT16, UInt16Type); + GET_PRIMITIVE_TYPE(INT16, Int16Type); + GET_PRIMITIVE_TYPE(UINT32, UInt32Type); + GET_PRIMITIVE_TYPE(INT32, Int32Type); + GET_PRIMITIVE_TYPE(UINT64, UInt64Type); + GET_PRIMITIVE_TYPE(INT64, Int64Type); + GET_PRIMITIVE_TYPE(BOOL, BooleanType); + GET_PRIMITIVE_TYPE(FLOAT, FloatType); + GET_PRIMITIVE_TYPE(DOUBLE, DoubleType); + GET_PRIMITIVE_TYPE(STRING, StringType); + default: + return nullptr; + } +} + +} // namespace pyarrow diff --git a/cpp/src/arrow/types/null.h b/python/src/pyarrow/helpers.h similarity index 72% rename from cpp/src/arrow/types/null.h rename to python/src/pyarrow/helpers.h index c67f752d40989..1a24f056febe6 100644 --- a/cpp/src/arrow/types/null.h +++ b/python/src/pyarrow/helpers.h @@ -15,20 +15,20 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_TYPES_NULL_H -#define ARROW_TYPES_NULL_H +#ifndef PYARROW_HELPERS_H +#define PYARROW_HELPERS_H -#include -#include +#include +#include -#include "arrow/type.h" +namespace pyarrow { -namespace arrow { +using arrow::DataType; +using arrow::LogicalType; -struct NullType : public PrimitiveType { - PRIMITIVE_DECL(NullType, void, NA, 0, "null"); -}; +std::shared_ptr GetPrimitiveType(LogicalType::type type, + bool nullable); -} // namespace arrow +} // namespace pyarrow -#endif // ARROW_TYPES_NULL_H +#endif // PYARROW_HELPERS_H diff --git a/python/src/pyarrow/init.cc b/python/src/pyarrow/init.cc index c36f413725532..acd851e168743 100644 --- a/python/src/pyarrow/init.cc +++ b/python/src/pyarrow/init.cc @@ -17,13 +17,9 @@ #include "pyarrow/init.h" -namespace arrow { - -namespace py { +namespace pyarrow { void pyarrow_init() { } -} // namespace py - -} // namespace arrow +} // namespace pyarrow diff --git a/python/src/pyarrow/init.h b/python/src/pyarrow/init.h index 1fc9f10102696..71e67a20c1ca5 100644 --- a/python/src/pyarrow/init.h +++ b/python/src/pyarrow/init.h @@ -18,14 +18,10 @@ #ifndef PYARROW_INIT_H #define PYARROW_INIT_H -namespace arrow { - -namespace py { +namespace pyarrow { void pyarrow_init(); -} // namespace py - -} // namespace arrow +} // namespace pyarrow #endif // PYARROW_INIT_H diff --git a/python/src/pyarrow/status.cc b/python/src/pyarrow/status.cc new file mode 100644 index 0000000000000..1cd54f6a78560 --- /dev/null +++ b/python/src/pyarrow/status.cc @@ -0,0 +1,92 @@ +// Copyright (c) 2011 The LevelDB Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. See the AUTHORS file for names of contributors. +// +// A Status encapsulates the result of an operation. It may indicate success, +// or it may indicate an error with an associated error message. +// +// Multiple threads can invoke const methods on a Status without +// external synchronization, but if any of the threads may call a +// non-const method, all threads accessing the same Status must use +// external synchronization. + +#include "pyarrow/status.h" + +#include +#include +#include + +namespace pyarrow { + +Status::Status(StatusCode code, const std::string& msg, int16_t posix_code) { + assert(code != StatusCode::OK); + const uint32_t size = msg.size(); + char* result = new char[size + 7]; + memcpy(result, &size, sizeof(size)); + result[4] = static_cast(code); + memcpy(result + 5, &posix_code, sizeof(posix_code)); + memcpy(result + 7, msg.c_str(), msg.size()); + state_ = result; +} + +const char* Status::CopyState(const char* state) { + uint32_t size; + memcpy(&size, state, sizeof(size)); + char* result = new char[size + 7]; + memcpy(result, state, size + 7); + return result; +} + +std::string Status::CodeAsString() const { + if (state_ == NULL) { + return "OK"; + } + + const char* type; + switch (code()) { + case StatusCode::OK: + type = "OK"; + break; + case StatusCode::OutOfMemory: + type = "Out of memory"; + break; + case StatusCode::KeyError: + type = "Key error"; + break; + case StatusCode::TypeError: + type = "Value error"; + break; + case StatusCode::ValueError: + type = "Value error"; + break; + case StatusCode::IOError: + type = "IO error"; + break; + case StatusCode::NotImplemented: + type = "Not implemented"; + break; + case StatusCode::ArrowError: + type = "Arrow C++ error"; + break; + case StatusCode::UnknownError: + type = "Unknown error"; + break; + } + return std::string(type); +} + +std::string Status::ToString() const { + std::string result(CodeAsString()); + if (state_ == NULL) { + return result; + } + + result.append(": "); + + uint32_t length; + memcpy(&length, state_, sizeof(length)); + result.append(reinterpret_cast(state_ + 7), length); + return result; +} + +} // namespace pyarrow diff --git a/python/src/pyarrow/status.h b/python/src/pyarrow/status.h new file mode 100644 index 0000000000000..cb8c8add210e4 --- /dev/null +++ b/python/src/pyarrow/status.h @@ -0,0 +1,144 @@ +// Copyright (c) 2011 The LevelDB Authors. All rights reserved. +// Use of this source code is governed by a BSD-style license that can be +// found in the LICENSE file. See the AUTHORS file for names of contributors. +// +// A Status encapsulates the result of an operation. It may indicate success, +// or it may indicate an error with an associated error message. +// +// Multiple threads can invoke const methods on a Status without +// external synchronization, but if any of the threads may call a +// non-const method, all threads accessing the same Status must use +// external synchronization. + +#ifndef PYARROW_STATUS_H_ +#define PYARROW_STATUS_H_ + +#include +#include +#include + +namespace pyarrow { + +#define PY_RETURN_NOT_OK(s) do { \ + Status _s = (s); \ + if (!_s.ok()) return _s; \ + } while (0); + +enum class StatusCode: char { + OK = 0, + OutOfMemory = 1, + KeyError = 2, + TypeError = 3, + ValueError = 4, + IOError = 5, + NotImplemented = 6, + + ArrowError = 7, + + UnknownError = 10 +}; + +class Status { + public: + // Create a success status. + Status() : state_(NULL) { } + ~Status() { delete[] state_; } + + // Copy the specified status. + Status(const Status& s); + void operator=(const Status& s); + + // Return a success status. + static Status OK() { return Status(); } + + // Return error status of an appropriate type. + static Status OutOfMemory(const std::string& msg, int16_t posix_code = -1) { + return Status(StatusCode::OutOfMemory, msg, posix_code); + } + + static Status KeyError(const std::string& msg) { + return Status(StatusCode::KeyError, msg, -1); + } + + static Status TypeError(const std::string& msg) { + return Status(StatusCode::TypeError, msg, -1); + } + + static Status IOError(const std::string& msg) { + return Status(StatusCode::IOError, msg, -1); + } + + static Status ValueError(const std::string& msg) { + return Status(StatusCode::ValueError, msg, -1); + } + + static Status NotImplemented(const std::string& msg) { + return Status(StatusCode::NotImplemented, msg, -1); + } + + static Status UnknownError(const std::string& msg) { + return Status(StatusCode::UnknownError, msg, -1); + } + + static Status ArrowError(const std::string& msg) { + return Status(StatusCode::ArrowError, msg, -1); + } + + // Returns true iff the status indicates success. + bool ok() const { return (state_ == NULL); } + + bool IsOutOfMemory() const { return code() == StatusCode::OutOfMemory; } + bool IsKeyError() const { return code() == StatusCode::KeyError; } + bool IsIOError() const { return code() == StatusCode::IOError; } + bool IsTypeError() const { return code() == StatusCode::TypeError; } + bool IsValueError() const { return code() == StatusCode::ValueError; } + + bool IsUnknownError() const { return code() == StatusCode::UnknownError; } + + bool IsArrowError() const { return code() == StatusCode::ArrowError; } + + // Return a string representation of this status suitable for printing. + // Returns the string "OK" for success. + std::string ToString() const; + + // Return a string representation of the status code, without the message + // text or posix code information. + std::string CodeAsString() const; + + // Get the POSIX code associated with this Status, or -1 if there is none. + int16_t posix_code() const; + + private: + // OK status has a NULL state_. Otherwise, state_ is a new[] array + // of the following form: + // state_[0..3] == length of message + // state_[4] == code + // state_[5..6] == posix_code + // state_[7..] == message + const char* state_; + + StatusCode code() const { + return ((state_ == NULL) ? + StatusCode::OK : static_cast(state_[4])); + } + + Status(StatusCode code, const std::string& msg, int16_t posix_code); + static const char* CopyState(const char* s); +}; + +inline Status::Status(const Status& s) { + state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); +} + +inline void Status::operator=(const Status& s) { + // The following condition catches both aliasing (when this == &s), + // and the common case where both s and *this are ok. + if (state_ != s.state_) { + delete[] state_; + state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); + } +} + +} // namespace pyarrow + +#endif // PYARROW_STATUS_H_ From ae95dbd189477442d39e55fb0a1aede206906cd9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 7 Mar 2016 22:39:07 -0800 Subject: [PATCH 0024/1644] ARROW-44: Python: prototype object model for array slot values ("scalars") Non-exhaustive, but this will facilitate inspecting Arrow data while the library is in development. ```python In [2]: arr = arrow.from_pylist([['foo', None], None, [], ['qux']]) In [3]: arr Out[3]: In [4]: arr[0] Out[4]: ['foo', None] In [5]: type(arr[0]) Out[5]: arrow.scalar.ListValue In [6]: arr[0][0] Out[6]: 'foo' In [7]: arr[0][1] Out[7]: NA In [8]: arr[1] Out[8]: NA In [9]: arr[2] Out[9]: [] In [10]: len(arr[2]) Out[10]: 0 In [11]: arr.type Out[11]: DataType(list) ``` Author: Wes McKinney Closes #20 from wesm/ARROW-44 and squashes the following commits: df06ba1 [Wes McKinney] Add tests for scalars proxying implemented Python list type conversions, fix associated bugs 20fbdc1 [Wes McKinney] Draft scalar box types, no tests yet --- cpp/src/arrow/types/list.h | 6 +- python/arrow/__init__.py | 6 +- python/arrow/array.pxd | 1 - python/arrow/array.pyx | 17 ++- python/arrow/compat.py | 6 + python/arrow/includes/arrow.pxd | 36 +++++- python/arrow/scalar.pxd | 25 +++- python/arrow/scalar.pyx | 165 +++++++++++++++++++++++++ python/arrow/schema.pxd | 2 + python/arrow/schema.pyx | 14 +++ python/arrow/tests/test_scalars.py | 82 ++++++++++++ python/src/pyarrow/adapters/builtin.cc | 2 +- 12 files changed, 342 insertions(+), 20 deletions(-) create mode 100644 python/arrow/tests/test_scalars.py diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index f40a8245362b1..210c76a046c21 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -63,7 +63,11 @@ class ListArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. - const ArrayPtr& values() const {return values_;} + const std::shared_ptr& values() const {return values_;} + + const std::shared_ptr& value_type() const { + return values_->type(); + } const int32_t* offsets() const { return offsets_;} diff --git a/python/arrow/__init__.py b/python/arrow/__init__.py index 3c049b85e8c93..3507ea0235afe 100644 --- a/python/arrow/__init__.py +++ b/python/arrow/__init__.py @@ -24,7 +24,11 @@ from arrow.error import ArrowException -from arrow.scalar import ArrayValue, NA, Scalar +from arrow.scalar import (ArrayValue, Scalar, NA, NAType, + BooleanValue, + Int8Value, Int16Value, Int32Value, Int64Value, + UInt8Value, UInt16Value, UInt32Value, UInt64Value, + FloatValue, DoubleValue, ListValue, StringValue) from arrow.schema import (null, bool_, int8, int16, int32, int64, diff --git a/python/arrow/array.pxd b/python/arrow/array.pxd index e32d27769b5e1..04dd8d182bcf6 100644 --- a/python/arrow/array.pxd +++ b/python/arrow/array.pxd @@ -34,7 +34,6 @@ cdef class Array: DataType type cdef init(self, const shared_ptr[CArray]& sp_array) - cdef _getitem(self, int i) cdef class BooleanArray(Array): diff --git a/python/arrow/array.pyx b/python/arrow/array.pyx index 3a3210d6cc100..8ebd01d1dbe73 100644 --- a/python/arrow/array.pyx +++ b/python/arrow/array.pyx @@ -25,6 +25,7 @@ cimport arrow.includes.pyarrow as pyarrow from arrow.compat import frombytes, tobytes from arrow.error cimport check_status +cimport arrow.scalar as scalar from arrow.scalar import NA def total_allocated_bytes(): @@ -73,13 +74,7 @@ cdef class Array: while key < 0: key += len(self) - if self.ap.IsNull(key): - return NA - else: - return self._getitem(key) - - cdef _getitem(self, int i): - raise NotImplementedError + return scalar.box_arrow_scalar(self.type, self.sp_array, key) def slice(self, start, end): pass @@ -168,12 +163,16 @@ cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): return arr -def from_pylist(object list_obj, type=None): +def from_pylist(object list_obj, DataType type=None): """ Convert Python list to Arrow array """ cdef: shared_ptr[CArray] sp_array - check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) + if type is None: + check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) + else: + raise NotImplementedError + return box_arrow_array(sp_array) diff --git a/python/arrow/compat.py b/python/arrow/compat.py index 2ac41ac8abf89..08f0f23796797 100644 --- a/python/arrow/compat.py +++ b/python/arrow/compat.py @@ -54,6 +54,9 @@ def dict_values(x): range = xrange long = long + def u(s): + return unicode(s, "unicode_escape") + def tobytes(o): if isinstance(o, unicode): return o.encode('utf8') @@ -73,6 +76,9 @@ def dict_values(x): from decimal import Decimal range = range + def u(s): + return s + def tobytes(o): if isinstance(o, str): return o.encode('utf8') diff --git a/python/arrow/includes/arrow.pxd b/python/arrow/includes/arrow.pxd index fde5de910915a..0cc44c06cb607 100644 --- a/python/arrow/includes/arrow.pxd +++ b/python/arrow/includes/arrow.pxd @@ -84,13 +84,41 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsNull(int i) cdef cppclass CUInt8Array" arrow::UInt8Array"(CArray): - pass + uint8_t Value(int i) cdef cppclass CInt8Array" arrow::Int8Array"(CArray): - pass + int8_t Value(int i) + + cdef cppclass CUInt16Array" arrow::UInt16Array"(CArray): + uint16_t Value(int i) + + cdef cppclass CInt16Array" arrow::Int16Array"(CArray): + int16_t Value(int i) + + cdef cppclass CUInt32Array" arrow::UInt32Array"(CArray): + uint32_t Value(int i) + + cdef cppclass CInt32Array" arrow::Int32Array"(CArray): + int32_t Value(int i) + + cdef cppclass CUInt64Array" arrow::UInt64Array"(CArray): + uint64_t Value(int i) + + cdef cppclass CInt64Array" arrow::Int64Array"(CArray): + int64_t Value(int i) + + cdef cppclass CFloatArray" arrow::FloatArray"(CArray): + float Value(int i) + + cdef cppclass CDoubleArray" arrow::DoubleArray"(CArray): + double Value(int i) cdef cppclass CListArray" arrow::ListArray"(CArray): - pass + const int32_t* offsets() + int32_t offset(int i) + int32_t value_length(int i) + const shared_ptr[CArray]& values() + const shared_ptr[CDataType]& value_type() cdef cppclass CStringArray" arrow::StringArray"(CListArray): - pass + c_string GetString(int i) diff --git a/python/arrow/scalar.pxd b/python/arrow/scalar.pxd index e193c09cd69a2..15cdc956a2593 100644 --- a/python/arrow/scalar.pxd +++ b/python/arrow/scalar.pxd @@ -16,7 +16,7 @@ # under the License. from arrow.includes.common cimport * -from arrow.includes.arrow cimport CArray, CListArray +from arrow.includes.arrow cimport * from arrow.schema cimport DataType @@ -31,17 +31,36 @@ cdef class NAType(Scalar): cdef class ArrayValue(Scalar): cdef: - shared_ptr[CArray] array + shared_ptr[CArray] sp_array int index + cdef void init(self, DataType type, + const shared_ptr[CArray]& sp_array, int index) + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array) + cdef class Int8Value(ArrayValue): pass -cdef class ListValue(ArrayValue): +cdef class Int64Value(ArrayValue): pass +cdef class ListValue(ArrayValue): + cdef readonly: + DataType value_type + + cdef: + CListArray* ap + + cdef _getitem(self, int i) + + cdef class StringValue(ArrayValue): pass + +cdef object box_arrow_scalar(DataType type, + const shared_ptr[CArray]& sp_array, + int index) diff --git a/python/arrow/scalar.pyx b/python/arrow/scalar.pyx index 78dadecf9b422..951ede2877690 100644 --- a/python/arrow/scalar.pyx +++ b/python/arrow/scalar.pyx @@ -15,14 +15,179 @@ # specific language governing permissions and limitations # under the License. +from arrow.schema cimport DataType, box_data_type + +from arrow.compat import frombytes import arrow.schema as schema +NA = None + cdef class NAType(Scalar): def __cinit__(self): + global NA + if NA is not None: + raise Exception('Cannot create multiple NAType instances') + self.type = schema.null() def __repr__(self): return 'NA' + def as_py(self): + return None + NA = NAType() + +cdef class ArrayValue(Scalar): + + cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, + int index): + self.type = type + self.index = index + self._set_array(sp_array) + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + + def __repr__(self): + if hasattr(self, 'as_py'): + return repr(self.as_py()) + else: + return Scalar.__repr__(self) + + +cdef class BooleanValue(ArrayValue): + pass + + +cdef class Int8Value(ArrayValue): + + def as_py(self): + cdef CInt8Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt8Value(ArrayValue): + + def as_py(self): + cdef CUInt8Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int16Value(ArrayValue): + + def as_py(self): + cdef CInt16Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt16Value(ArrayValue): + + def as_py(self): + cdef CUInt16Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int32Value(ArrayValue): + + def as_py(self): + cdef CInt32Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt32Value(ArrayValue): + + def as_py(self): + cdef CUInt32Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int64Value(ArrayValue): + + def as_py(self): + cdef CInt64Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt64Value(ArrayValue): + + def as_py(self): + cdef CUInt64Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class FloatValue(ArrayValue): + + def as_py(self): + cdef CFloatArray* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class DoubleValue(ArrayValue): + + def as_py(self): + cdef CDoubleArray* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class StringValue(ArrayValue): + + def as_py(self): + cdef CStringArray* ap = self.sp_array.get() + return frombytes(ap.GetString(self.index)) + + +cdef class ListValue(ArrayValue): + + def __len__(self): + return self.ap.value_length(self.index) + + def __getitem__(self, i): + return self._getitem(i) + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + self.ap = sp_array.get() + self.value_type = box_data_type(self.ap.value_type()) + + cdef _getitem(self, int i): + cdef int j = self.ap.offset(self.index) + i + return box_arrow_scalar(self.value_type, self.ap.values(), j) + + def as_py(self): + cdef: + int j + list result = [] + + for j in range(len(self)): + result.append(self._getitem(j).as_py()) + + return result + + +cdef dict _scalar_classes = { + LogicalType_UINT8: Int8Value, + LogicalType_UINT16: Int16Value, + LogicalType_UINT32: Int32Value, + LogicalType_UINT64: Int64Value, + LogicalType_INT8: Int8Value, + LogicalType_INT16: Int16Value, + LogicalType_INT32: Int32Value, + LogicalType_INT64: Int64Value, + LogicalType_FLOAT: FloatValue, + LogicalType_DOUBLE: DoubleValue, + LogicalType_LIST: ListValue, + LogicalType_STRING: StringValue +} + +cdef object box_arrow_scalar(DataType type, + const shared_ptr[CArray]& sp_array, + int index): + cdef ArrayValue val + if sp_array.get().IsNull(index): + return NA + else: + val = _scalar_classes[type.type.type]() + val.init(type, sp_array, index) + return val diff --git a/python/arrow/schema.pxd b/python/arrow/schema.pxd index 487c246f44abf..8cc244aaba341 100644 --- a/python/arrow/schema.pxd +++ b/python/arrow/schema.pxd @@ -37,3 +37,5 @@ cdef class Schema: cdef: shared_ptr[CSchema] sp_schema CSchema* schema + +cdef DataType box_data_type(const shared_ptr[CDataType]& type) diff --git a/python/arrow/schema.pyx b/python/arrow/schema.pyx index 63cd6e888abd0..3001531eb607d 100644 --- a/python/arrow/schema.pyx +++ b/python/arrow/schema.pyx @@ -85,6 +85,14 @@ cdef DataType primitive_type(LogicalType type, bint nullable=True): def field(name, type): return Field(name, type) +cdef set PRIMITIVE_TYPES = set([ + LogicalType_NA, LogicalType_BOOL, + LogicalType_UINT8, LogicalType_INT8, + LogicalType_UINT16, LogicalType_INT16, + LogicalType_UINT32, LogicalType_INT32, + LogicalType_UINT64, LogicalType_INT64, + LogicalType_FLOAT, LogicalType_DOUBLE]) + def null(): return primitive_type(LogicalType_NA) @@ -148,3 +156,9 @@ def struct(fields, c_bool nullable=True): out.init(shared_ptr[CDataType]( new CStructType(c_fields, nullable))) return out + + +cdef DataType box_data_type(const shared_ptr[CDataType]& type): + cdef DataType out = DataType() + out.init(type) + return out diff --git a/python/arrow/tests/test_scalars.py b/python/arrow/tests/test_scalars.py new file mode 100644 index 0000000000000..951380bd981e4 --- /dev/null +++ b/python/arrow/tests/test_scalars.py @@ -0,0 +1,82 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from arrow.compat import unittest, u +import arrow + + +class TestScalars(unittest.TestCase): + + def test_null_singleton(self): + with self.assertRaises(Exception): + arrow.NAType() + + def test_bool(self): + pass + + def test_int64(self): + arr = arrow.from_pylist([1, 2, None]) + + v = arr[0] + assert isinstance(v, arrow.Int64Value) + assert repr(v) == "1" + assert v.as_py() == 1 + + assert arr[2] is arrow.NA + + def test_double(self): + arr = arrow.from_pylist([1.5, None, 3]) + + v = arr[0] + assert isinstance(v, arrow.DoubleValue) + assert repr(v) == "1.5" + assert v.as_py() == 1.5 + + assert arr[1] is arrow.NA + + v = arr[2] + assert v.as_py() == 3.0 + + def test_string(self): + arr = arrow.from_pylist(['foo', None, u('bar')]) + + v = arr[0] + assert isinstance(v, arrow.StringValue) + assert repr(v) == "'foo'" + assert v.as_py() == 'foo' + + assert arr[1] is arrow.NA + + v = arr[2].as_py() + assert v == 'bar' + assert isinstance(v, str) + + def test_list(self): + arr = arrow.from_pylist([['foo', None], None, ['bar'], []]) + + v = arr[0] + assert len(v) == 2 + assert isinstance(v, arrow.ListValue) + assert repr(v) == "['foo', None]" + assert v.as_py() == ['foo', None] + assert v[0].as_py() == 'foo' + assert v[1] is arrow.NA + + assert arr[1] is arrow.NA + + v = arr[3] + assert len(v) == 0 diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index ae84fa12b0de6..60d6248842ec9 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -276,7 +276,7 @@ class Int64Converter : public TypedConverter { class DoubleConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { - int64_t val; + double val; Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); From 45cd9fd8ddc75f5c8a558024c705ab8d37bbc5b5 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 8 Mar 2016 12:48:42 -0800 Subject: [PATCH 0025/1644] ARROW-43: Python: format array values to in __repr__ for interactive computing Author: Wes McKinney Closes #21 from wesm/ARROW-43 and squashes the following commits: dee6ba2 [Wes McKinney] Basic array formatter, not tweaking too much for now --- python/arrow/array.pxd | 1 + python/arrow/array.pyx | 16 +++++++++++++- python/arrow/scalar.pxd | 2 +- python/arrow/scalar.pyx | 11 +++++++--- python/arrow/tests/test_array.py | 37 ++++++++++++++++++++++++++++++++ 5 files changed, 62 insertions(+), 5 deletions(-) diff --git a/python/arrow/array.pxd b/python/arrow/array.pxd index 04dd8d182bcf6..482f8f796dd26 100644 --- a/python/arrow/array.pxd +++ b/python/arrow/array.pxd @@ -34,6 +34,7 @@ cdef class Array: DataType type cdef init(self, const shared_ptr[CArray]& sp_array) + cdef getitem(self, int i) cdef class BooleanArray(Array): diff --git a/python/arrow/array.pyx b/python/arrow/array.pyx index 8ebd01d1dbe73..b367e3b84a8b3 100644 --- a/python/arrow/array.pyx +++ b/python/arrow/array.pyx @@ -46,6 +46,17 @@ cdef class Array: def __get__(self): return self.sp_array.get().null_count() + def __iter__(self): + for i in range(len(self)): + yield self.getitem(i) + raise StopIteration + + def __repr__(self): + from arrow.formatting import array_format + type_format = object.__repr__(self) + values = array_format(self, window=10) + return '{0}\n{1}'.format(type_format, values) + def __len__(self): return self.sp_array.get().length() @@ -74,7 +85,10 @@ cdef class Array: while key < 0: key += len(self) - return scalar.box_arrow_scalar(self.type, self.sp_array, key) + return self.getitem(key) + + cdef getitem(self, int i): + return scalar.box_arrow_scalar(self.type, self.sp_array, i) def slice(self, start, end): pass diff --git a/python/arrow/scalar.pxd b/python/arrow/scalar.pxd index 15cdc956a2593..4e0a3647155a6 100644 --- a/python/arrow/scalar.pxd +++ b/python/arrow/scalar.pxd @@ -55,7 +55,7 @@ cdef class ListValue(ArrayValue): cdef: CListArray* ap - cdef _getitem(self, int i) + cdef getitem(self, int i) cdef class StringValue(ArrayValue): diff --git a/python/arrow/scalar.pyx b/python/arrow/scalar.pyx index 951ede2877690..72a280e334f4e 100644 --- a/python/arrow/scalar.pyx +++ b/python/arrow/scalar.pyx @@ -144,14 +144,19 @@ cdef class ListValue(ArrayValue): return self.ap.value_length(self.index) def __getitem__(self, i): - return self._getitem(i) + return self.getitem(i) + + def __iter__(self): + for i in range(len(self)): + yield self.getitem(i) + raise StopIteration cdef void _set_array(self, const shared_ptr[CArray]& sp_array): self.sp_array = sp_array self.ap = sp_array.get() self.value_type = box_data_type(self.ap.value_type()) - cdef _getitem(self, int i): + cdef getitem(self, int i): cdef int j = self.ap.offset(self.index) + i return box_arrow_scalar(self.value_type, self.ap.values(), j) @@ -161,7 +166,7 @@ cdef class ListValue(ArrayValue): list result = [] for j in range(len(self)): - result.append(self._getitem(j).as_py()) + result.append(self.getitem(j).as_py()) return result diff --git a/python/arrow/tests/test_array.py b/python/arrow/tests/test_array.py index 8eaa53352061b..ebd872c744e44 100644 --- a/python/arrow/tests/test_array.py +++ b/python/arrow/tests/test_array.py @@ -17,6 +17,7 @@ from arrow.compat import unittest import arrow +import arrow.formatting as fmt class TestArrayAPI(unittest.TestCase): @@ -24,3 +25,39 @@ class TestArrayAPI(unittest.TestCase): def test_getitem_NA(self): arr = arrow.from_pylist([1, None, 2]) assert arr[1] is arrow.NA + + def test_list_format(self): + arr = arrow.from_pylist([[1], None, [2, 3]]) + result = fmt.array_format(arr) + expected = """\ +[ + [1], + NA, + [2, + 3] +]""" + assert result == expected + + def test_string_format(self): + arr = arrow.from_pylist(['foo', None, 'bar']) + result = fmt.array_format(arr) + expected = """\ +[ + 'foo', + NA, + 'bar' +]""" + assert result == expected + + def test_long_array_format(self): + arr = arrow.from_pylist(range(100)) + result = fmt.array_format(arr, window=2) + expected = """\ +[ + 0, + 1, + ... + 98, + 99 +]""" + assert result == expected From 1650026285bea52288c7f24720c3caf7cd3ce2a8 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Mon, 29 Feb 2016 19:32:12 -0800 Subject: [PATCH 0026/1644] ARROW-17: set some vector fields to package level access for Drill compatibility --- .../codegen/templates/BasicTypeHelper.java | 1 + .../templates/NullableValueVectors.java | 6 ++- .../templates/RepeatedValueVectors.java | 2 +- .../main/codegen/templates/UnionVector.java | 4 +- .../templates/VariableLengthVectors.java | 2 +- .../org/apache/arrow/vector/BitVector.java | 4 +- .../arrow/vector/complex/ListVector.java | 4 +- .../arrow/vector/complex/MapVector.java | 2 +- .../vector/complex/RepeatedListVector.java | 3 +- .../vector/complex/RepeatedMapVector.java | 2 +- .../org/apache/arrow/vector/types/Types.java | 54 +++++++++++++++---- 11 files changed, 60 insertions(+), 24 deletions(-) diff --git a/java/vector/src/main/codegen/templates/BasicTypeHelper.java b/java/vector/src/main/codegen/templates/BasicTypeHelper.java index bb6446e8d6b19..0bae715e35283 100644 --- a/java/vector/src/main/codegen/templates/BasicTypeHelper.java +++ b/java/vector/src/main/codegen/templates/BasicTypeHelper.java @@ -231,6 +231,7 @@ public static ValueVector getNewVector(MaterializedField field, BufferAllocator return getNewVector(field, allocator, null); } public static ValueVector getNewVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ + field = field.clone(); MajorType type = field.getType(); switch (type.getMinorType()) { diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 6893a25efbe18..b0029f7ad4c37 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -45,8 +45,10 @@ public final class ${className} extends BaseDataValueVector implements <#if type private final FieldReader reader = new Nullable${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); private final MaterializedField bitsField = MaterializedField.create("$bits$", new MajorType(MinorType.UINT1, DataMode.REQUIRED)); - private final UInt1Vector bits = new UInt1Vector(bitsField, allocator); - private final ${valuesName} values = new ${minor.class}Vector(field, allocator); + private final MaterializedField valuesField = MaterializedField.create("$values$", new MajorType(field.getType().getMinorType(), DataMode.REQUIRED, field.getPrecision(), field.getScale())); + + final UInt1Vector bits = new UInt1Vector(bitsField, allocator); + final ${valuesName} values = new ${minor.class}Vector(valuesField, allocator); private final Mutator mutator = new Mutator(); private final Accessor accessor = new Accessor(); diff --git a/java/vector/src/main/codegen/templates/RepeatedValueVectors.java b/java/vector/src/main/codegen/templates/RepeatedValueVectors.java index 5ac80f57737ff..ceae53bbf58cf 100644 --- a/java/vector/src/main/codegen/templates/RepeatedValueVectors.java +++ b/java/vector/src/main/codegen/templates/RepeatedValueVectors.java @@ -42,7 +42,7 @@ public final class Repeated${minor.class}Vector extends BaseRepeatedValueVector //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Repeated${minor.class}Vector.class); // we maintain local reference to concrete vector type for performance reasons. - private ${minor.class}Vector values; + ${minor.class}Vector values; private final FieldReader reader = new Repeated${minor.class}ReaderImpl(Repeated${minor.class}Vector.this); private final Mutator mutator = new Mutator(); private final Accessor accessor = new Accessor(); diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index ba94ac22a05f6..6042a5bf68352 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -51,9 +51,9 @@ public class UnionVector implements ValueVector { private BufferAllocator allocator; private Accessor accessor = new Accessor(); private Mutator mutator = new Mutator(); - private int valueCount; + int valueCount; - private MapVector internalMap; + MapVector internalMap; private UInt1Vector typeVector; private MapVector mapVector; diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java index 13d53b8e846ab..84fb3eb55674f 100644 --- a/java/vector/src/main/codegen/templates/VariableLengthVectors.java +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -57,7 +57,7 @@ public final class ${minor.class}Vector extends BaseDataValueVector implements V public final static String OFFSETS_VECTOR_NAME = "$offsets$"; private final MaterializedField offsetsField = MaterializedField.create(OFFSETS_VECTOR_NAME, new MajorType(MinorType.UINT4, DataMode.REQUIRED)); - private final UInt${type.width}Vector offsetVector = new UInt${type.width}Vector(offsetsField, allocator); + final UInt${type.width}Vector offsetVector = new UInt${type.width}Vector(offsetsField, allocator); private final FieldReader reader = new ${minor.class}ReaderImpl(${minor.class}Vector.this); private final Accessor accessor; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 952e9028e0668..c5bcb2decc43b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -41,7 +41,7 @@ public final class BitVector extends BaseDataValueVector implements FixedWidthVe private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); - private int valueCount; + int valueCount; private int allocationSizeInBytes = INITIAL_VALUE_ALLOCATION; private int allocationMonitor = 0; @@ -64,7 +64,7 @@ public int getBufferSizeFor(final int valueCount) { return getSizeFromCount(valueCount); } - private int getSizeFromCount(int valueCount) { + int getSizeFromCount(int valueCount) { return (int) Math.ceil(valueCount / 8.0); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 8387c9e5ba667..13610c4f03f61 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -47,8 +47,8 @@ public class ListVector extends BaseRepeatedValueVector { - private UInt4Vector offsets; - private final UInt1Vector bits; + UInt4Vector offsets; + final UInt1Vector bits; private Mutator mutator = new Mutator(); private Accessor accessor = new Accessor(); private UnionListWriter writer; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index 1bbce73d6ff82..cc0953a1af8ba 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -52,7 +52,7 @@ public class MapVector extends AbstractMapVector { private final SingleMapReaderImpl reader = new SingleMapReaderImpl(MapVector.this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); - private int valueCount; + int valueCount; public MapVector(String path, BufferAllocator allocator, CallBack callBack){ this(MaterializedField.create(path, TYPE), allocator, callBack); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java index 778fe81b5da6a..f337f9c4a60e0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java @@ -49,7 +49,7 @@ public class RepeatedListVector extends AbstractContainerVector public final static MajorType TYPE = new MajorType(MinorType.LIST, DataMode.REPEATED); private final RepeatedListReaderImpl reader = new RepeatedListReaderImpl(null, this); - private final DelegateRepeatedVector delegate; + final DelegateRepeatedVector delegate; protected static class DelegateRepeatedVector extends BaseRepeatedValueVector { @@ -313,7 +313,6 @@ public AddOrGetResult addOrGetVector(VectorDescriptor if (result.isCreated() && callBack != null) { callBack.doWork(); } - this.field = delegate.getField(); return result; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java index e7eacd3c67c40..686414e71cadf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java @@ -53,7 +53,7 @@ public class RepeatedMapVector extends AbstractMapVector public final static MajorType TYPE = new MajorType(MinorType.MAP, DataMode.REPEATED); - private final UInt4Vector offsets; // offsets to start of each record (considering record indices are 0-indexed) + final UInt4Vector offsets; // offsets to start of each record (considering record indices are 0-indexed) private final RepeatedMapReaderImpl reader = new RepeatedMapReaderImpl(RepeatedMapVector.this); private final RepeatedMapAccessor accessor = new RepeatedMapAccessor(); private final Mutator mutator = new Mutator(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index cef892ce88030..88999cb8f5ab8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -20,6 +20,7 @@ import java.util.ArrayList; import java.util.List; import java.util.Map; +import java.util.Objects; public class Types { public enum MinorType { @@ -73,26 +74,35 @@ public enum DataMode { public static class MajorType { private MinorType minorType; private DataMode mode; - private Integer precision; - private Integer scale; - private Integer timezone; + private int precision; + private int scale; + private int timezone; + private int width; private List subTypes; public MajorType(MinorType minorType, DataMode mode) { - this(minorType, mode, null, null, null, null); + this(minorType, mode, 0, 0, 0, 0, null); } - public MajorType(MinorType minorType, DataMode mode, Integer precision, Integer scale) { - this(minorType, mode, precision, scale, null, null); + public MajorType(MinorType minorType, DataMode mode, int precision, int scale) { + this(minorType, mode, precision, scale, 0, 0, null); } - public MajorType(MinorType minorType, DataMode mode, Integer precision, Integer scale, Integer timezone, List subTypes) { + public MajorType(MinorType minorType, DataMode mode, int precision, int scale, int timezone, List subTypes) { + this(minorType, mode, precision, scale, timezone, 0, subTypes); + } + + public MajorType(MinorType minorType, DataMode mode, int precision, int scale, int timezone, int width, List subTypes) { this.minorType = minorType; this.mode = mode; this.precision = precision; this.scale = scale; this.timezone = timezone; + this.width = width; this.subTypes = subTypes; + if (subTypes == null) { + this.subTypes = new ArrayList<>(); + } } public MinorType getMinorType() { @@ -103,21 +113,45 @@ public DataMode getMode() { return mode; } - public Integer getPrecision() { + public int getPrecision() { return precision; } - public Integer getScale() { + public int getScale() { return scale; } - public Integer getTimezone() { + public int getTimezone() { return timezone; } public List getSubTypes() { return subTypes; } + + public int getWidth() { + return width; + } + + + @Override + public boolean equals(Object other) { + if (other == null) { + return false; + } + if (!(other instanceof MajorType)) { + return false; + } + MajorType that = (MajorType) other; + return this.minorType == that.minorType && + this.mode == that.mode && + this.precision == that.precision && + this.scale == that.scale && + this.timezone == that.timezone && + this.width == that.width && + Objects.equals(this.subTypes, that.subTypes); + } + } public static MajorType required(MinorType minorType) { From 243ed4e91d5ed922b205f7ac5fa8f9f821a07fbb Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Mon, 29 Feb 2016 19:33:44 -0800 Subject: [PATCH 0027/1644] ARROW-18: Fix decimal precision and scale in MapWriters --- java/vector/src/main/codegen/templates/MapWriters.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 7001367bb3774..42f39820393e5 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -206,7 +206,7 @@ public void end() { } public ${minor.class}Writer ${lowerName}(String name, int scale, int precision) { - final MajorType ${upperName}_TYPE = new MajorType(MinorType.${upperName}, DataMode.OPTIONAL, scale, precision, null, null); + final MajorType ${upperName}_TYPE = new MajorType(MinorType.${upperName}, DataMode.OPTIONAL, precision, scale, 0, null); <#else> private static final MajorType ${upperName}_TYPE = Types.optional(MinorType.${upperName}); @Override From 31def7d81a094dd051d2f4bbead78edaae25755a Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Tue, 8 Mar 2016 14:11:29 -0800 Subject: [PATCH 0028/1644] ARROW-51: Add simple ValueVector tests --- .../apache/arrow/vector/TestValueVector.java | 521 ++++++++++++++++++ 1 file changed, 521 insertions(+) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java new file mode 100644 index 0000000000000..4488d750284c7 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -0,0 +1,521 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.nio.charset.Charset; + +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.RepeatedListVector; +import org.apache.arrow.vector.complex.RepeatedMapVector; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.holders.BitHolder; +import org.apache.arrow.vector.holders.IntHolder; +import org.apache.arrow.vector.holders.NullableFloat4Holder; +import org.apache.arrow.vector.holders.NullableUInt4Holder; +import org.apache.arrow.vector.holders.NullableVar16CharHolder; +import org.apache.arrow.vector.holders.NullableVarCharHolder; +import org.apache.arrow.vector.holders.RepeatedFloat4Holder; +import org.apache.arrow.vector.holders.RepeatedIntHolder; +import org.apache.arrow.vector.holders.RepeatedVarBinaryHolder; +import org.apache.arrow.vector.holders.UInt4Holder; +import org.apache.arrow.vector.holders.VarCharHolder; +import org.apache.arrow.memory.BufferAllocator; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + + +public class TestValueVector { + //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(TestValueVector.class); + + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Long.MAX_VALUE); + } + + private final static Charset utf8Charset = Charset.forName("UTF-8"); + private final static byte[] STR1 = new String("AAAAA1").getBytes(utf8Charset); + private final static byte[] STR2 = new String("BBBBBBBBB2").getBytes(utf8Charset); + private final static byte[] STR3 = new String("CCCC3").getBytes(utf8Charset); + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test(expected = OversizedAllocationException.class) + public void testFixedVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final UInt4Vector vector = new UInt4Vector(field, allocator); + // edge case 1: buffer size = max value capacity + final int expectedValueCapacity = BaseValueVector.MAX_ALLOCATION_SIZE / 4; + try { + vector.allocateNew(expectedValueCapacity); + assertEquals(expectedValueCapacity, vector.getValueCapacity()); + vector.reAlloc(); + assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); + } finally { + vector.close(); + } + + // common case: value count < max value capacity + try { + vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 8); + vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION + vector.reAlloc(); // this should throw an IOOB + } finally { + vector.close(); + } + } + + @Test(expected = OversizedAllocationException.class) + public void testBitVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final BitVector vector = new BitVector(field, allocator); + // edge case 1: buffer size ~ max value capacity + final int expectedValueCapacity = 1 << 29; + try { + vector.allocateNew(expectedValueCapacity); + assertEquals(expectedValueCapacity, vector.getValueCapacity()); + vector.reAlloc(); + assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); + } finally { + vector.close(); + } + + // common: value count < MAX_VALUE_ALLOCATION + try { + vector.allocateNew(expectedValueCapacity); + for (int i=0; i<3;i++) { + vector.reAlloc(); // expand buffer size + } + assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); + vector.reAlloc(); // buffer size ~ max allocation + assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); + vector.reAlloc(); // overflow + } finally { + vector.close(); + } + } + + + @Test(expected = OversizedAllocationException.class) + public void testVariableVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final VarCharVector vector = new VarCharVector(field, allocator); + // edge case 1: value count = MAX_VALUE_ALLOCATION + final int expectedAllocationInBytes = BaseValueVector.MAX_ALLOCATION_SIZE; + final int expectedOffsetSize = 10; + try { + vector.allocateNew(expectedAllocationInBytes, 10); + assertTrue(expectedOffsetSize <= vector.getValueCapacity()); + assertTrue(expectedAllocationInBytes <= vector.getBuffer().capacity()); + vector.reAlloc(); + assertTrue(expectedOffsetSize * 2 <= vector.getValueCapacity()); + assertTrue(expectedAllocationInBytes * 2 <= vector.getBuffer().capacity()); + } finally { + vector.close(); + } + + // common: value count < MAX_VALUE_ALLOCATION + try { + vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 2, 0); + vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION + vector.reAlloc(); // this tests if it overflows + } finally { + vector.close(); + } + } + + @Test + public void testFixedType() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + + // Create a new value vector for 1024 integers. + try (final UInt4Vector vector = new UInt4Vector(field, allocator)) { + final UInt4Vector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + // Put and set a few values + m.setSafe(0, 100); + m.setSafe(1, 101); + m.setSafe(100, 102); + m.setSafe(1022, 103); + m.setSafe(1023, 104); + + final UInt4Vector.Accessor accessor = vector.getAccessor(); + assertEquals(100, accessor.get(0)); + assertEquals(101, accessor.get(1)); + assertEquals(102, accessor.get(100)); + assertEquals(103, accessor.get(1022)); + assertEquals(104, accessor.get(1023)); + } + } + + @Test + public void testNullableVarLen2() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVarCharHolder.TYPE); + + // Create a new value vector for 1024 integers. + try (final NullableVarCharVector vector = new NullableVarCharVector(field, allocator)) { + final NullableVarCharVector.Mutator m = vector.getMutator(); + vector.allocateNew(1024 * 10, 1024); + + m.set(0, STR1); + m.set(1, STR2); + m.set(2, STR3); + + // Check the sample strings. + final NullableVarCharVector.Accessor accessor = vector.getAccessor(); + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(1)); + assertArrayEquals(STR3, accessor.get(2)); + + // Ensure null value throws. + boolean b = false; + try { + vector.getAccessor().get(3); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + } + + @Test + public void testRepeatedIntVector() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedIntHolder.TYPE); + + // Create a new value vector. + try (final RepeatedIntVector vector1 = new RepeatedIntVector(field, allocator)) { + + // Populate the vector. + final int[] values = {2, 3, 5, 7, 11, 13, 17, 19, 23, 27}; // some tricksy primes + final int nRecords = 7; + final int nElements = values.length; + vector1.allocateNew(nRecords, nRecords * nElements); + final RepeatedIntVector.Mutator mutator = vector1.getMutator(); + for (int recordIndex = 0; recordIndex < nRecords; ++recordIndex) { + mutator.startNewValue(recordIndex); + for (int elementIndex = 0; elementIndex < nElements; ++elementIndex) { + mutator.add(recordIndex, recordIndex * values[elementIndex]); + } + } + mutator.setValueCount(nRecords); + + // Verify the contents. + final RepeatedIntVector.Accessor accessor1 = vector1.getAccessor(); + assertEquals(nRecords, accessor1.getValueCount()); + for (int recordIndex = 0; recordIndex < nRecords; ++recordIndex) { + for (int elementIndex = 0; elementIndex < nElements; ++elementIndex) { + final int value = accessor1.get(recordIndex, elementIndex); + assertEquals(recordIndex * values[elementIndex], value); + } + } + } + } + + @Test + public void testNullableFixedType() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableUInt4Holder.TYPE); + + // Create a new value vector for 1024 integers. + try (final NullableUInt4Vector vector = new NullableUInt4Vector(field, allocator)) { + final NullableUInt4Vector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + // Put and set a few values + m.set(0, 100); + m.set(1, 101); + m.set(100, 102); + m.set(1022, 103); + m.set(1023, 104); + + final NullableUInt4Vector.Accessor accessor = vector.getAccessor(); + assertEquals(100, accessor.get(0)); + assertEquals(101, accessor.get(1)); + assertEquals(102, accessor.get(100)); + assertEquals(103, accessor.get(1022)); + assertEquals(104, accessor.get(1023)); + + // Ensure null values throw + { + boolean b = false; + try { + accessor.get(3); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + + vector.allocateNew(2048); + { + boolean b = false; + try { + accessor.get(0); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + + m.set(0, 100); + m.set(1, 101); + m.set(100, 102); + m.set(1022, 103); + m.set(1023, 104); + assertEquals(100, accessor.get(0)); + assertEquals(101, accessor.get(1)); + assertEquals(102, accessor.get(100)); + assertEquals(103, accessor.get(1022)); + assertEquals(104, accessor.get(1023)); + + // Ensure null values throw. + { + boolean b = false; + try { + vector.getAccessor().get(3); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + } + } + + @Test + public void testNullableFloat() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableFloat4Holder.TYPE); + + // Create a new value vector for 1024 integers + try (final NullableFloat4Vector vector = (NullableFloat4Vector) BasicTypeHelper.getNewVector(field, allocator)) { + final NullableFloat4Vector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + // Put and set a few values. + m.set(0, 100.1f); + m.set(1, 101.2f); + m.set(100, 102.3f); + m.set(1022, 103.4f); + m.set(1023, 104.5f); + + final NullableFloat4Vector.Accessor accessor = vector.getAccessor(); + assertEquals(100.1f, accessor.get(0), 0); + assertEquals(101.2f, accessor.get(1), 0); + assertEquals(102.3f, accessor.get(100), 0); + assertEquals(103.4f, accessor.get(1022), 0); + assertEquals(104.5f, accessor.get(1023), 0); + + // Ensure null values throw. + { + boolean b = false; + try { + vector.getAccessor().get(3); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + + vector.allocateNew(2048); + { + boolean b = false; + try { + accessor.get(0); + } catch (IllegalStateException e) { + b = true; + } finally { + assertTrue(b); + } + } + } + } + + @Test + public void testBitVector() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, BitHolder.TYPE); + + // Create a new value vector for 1024 integers + try (final BitVector vector = new BitVector(field, allocator)) { + final BitVector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + // Put and set a few values + m.set(0, 1); + m.set(1, 0); + m.set(100, 0); + m.set(1022, 1); + + final BitVector.Accessor accessor = vector.getAccessor(); + assertEquals(1, accessor.get(0)); + assertEquals(0, accessor.get(1)); + assertEquals(0, accessor.get(100)); + assertEquals(1, accessor.get(1022)); + + // test setting the same value twice + m.set(0, 1); + m.set(0, 1); + m.set(1, 0); + m.set(1, 0); + assertEquals(1, accessor.get(0)); + assertEquals(0, accessor.get(1)); + + // test toggling the values + m.set(0, 0); + m.set(1, 1); + assertEquals(0, accessor.get(0)); + assertEquals(1, accessor.get(1)); + + // Ensure unallocated space returns 0 + assertEquals(0, accessor.get(3)); + } + } + + @Test + public void testReAllocNullableFixedWidthVector() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableFloat4Holder.TYPE); + + // Create a new value vector for 1024 integers + try (final NullableFloat4Vector vector = (NullableFloat4Vector) BasicTypeHelper.getNewVector(field, allocator)) { + final NullableFloat4Vector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + assertEquals(1024, vector.getValueCapacity()); + + // Put values in indexes that fall within the initial allocation + m.setSafe(0, 100.1f); + m.setSafe(100, 102.3f); + m.setSafe(1023, 104.5f); + + // Now try to put values in space that falls beyond the initial allocation + m.setSafe(2000, 105.5f); + + // Check valueCapacity is more than initial allocation + assertEquals(1024 * 2, vector.getValueCapacity()); + + final NullableFloat4Vector.Accessor accessor = vector.getAccessor(); + assertEquals(100.1f, accessor.get(0), 0); + assertEquals(102.3f, accessor.get(100), 0); + assertEquals(104.5f, accessor.get(1023), 0); + assertEquals(105.5f, accessor.get(2000), 0); + + // Set the valueCount to be more than valueCapacity of current allocation. This is possible for NullableValueVectors + // as we don't call setSafe for null values, but we do call setValueCount when all values are inserted into the + // vector + m.setValueCount(vector.getValueCapacity() + 200); + } + } + + @Test + public void testReAllocNullableVariableWidthVector() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVarCharHolder.TYPE); + + // Create a new value vector for 1024 integers + try (final NullableVarCharVector vector = (NullableVarCharVector) BasicTypeHelper.getNewVector(field, allocator)) { + final NullableVarCharVector.Mutator m = vector.getMutator(); + vector.allocateNew(); + + int initialCapacity = vector.getValueCapacity(); + + // Put values in indexes that fall within the initial allocation + m.setSafe(0, STR1, 0, STR1.length); + m.setSafe(initialCapacity - 1, STR2, 0, STR2.length); + + // Now try to put values in space that falls beyond the initial allocation + m.setSafe(initialCapacity + 200, STR3, 0, STR3.length); + + // Check valueCapacity is more than initial allocation + assertEquals((initialCapacity + 1) * 2 - 1, vector.getValueCapacity()); + + final NullableVarCharVector.Accessor accessor = vector.getAccessor(); + assertArrayEquals(STR1, accessor.get(0)); + assertArrayEquals(STR2, accessor.get(initialCapacity - 1)); + assertArrayEquals(STR3, accessor.get(initialCapacity + 200)); + + // Set the valueCount to be more than valueCapacity of current allocation. This is possible for NullableValueVectors + // as we don't call setSafe for null values, but we do call setValueCount when the current batch is processed. + m.setValueCount(vector.getValueCapacity() + 200); + } + } + + @Test + public void testVVInitialCapacity() throws Exception { + final MaterializedField[] fields = new MaterializedField[9]; + final ValueVector[] valueVectors = new ValueVector[9]; + + fields[0] = MaterializedField.create(EMPTY_SCHEMA_PATH, BitHolder.TYPE); + fields[1] = MaterializedField.create(EMPTY_SCHEMA_PATH, IntHolder.TYPE); + fields[2] = MaterializedField.create(EMPTY_SCHEMA_PATH, VarCharHolder.TYPE); + fields[3] = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVar16CharHolder.TYPE); + fields[4] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedFloat4Holder.TYPE); + fields[5] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedVarBinaryHolder.TYPE); + + fields[6] = MaterializedField.create(EMPTY_SCHEMA_PATH, MapVector.TYPE); + fields[6].addChild(fields[0] /*bit*/); + fields[6].addChild(fields[2] /*varchar*/); + + fields[7] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedMapVector.TYPE); + fields[7].addChild(fields[1] /*int*/); + fields[7].addChild(fields[3] /*optional var16char*/); + + fields[8] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedListVector.TYPE); + fields[8].addChild(fields[1] /*int*/); + + final int initialCapacity = 1024; + + try { + for (int i = 0; i < valueVectors.length; i++) { + valueVectors[i] = BasicTypeHelper.getNewVector(fields[i], allocator); + valueVectors[i].setInitialCapacity(initialCapacity); + valueVectors[i].allocateNew(); + } + + for (int i = 0; i < valueVectors.length; i++) { + final ValueVector vv = valueVectors[i]; + final int vvCapacity = vv.getValueCapacity(); + + // this can't be equality because Nullables will be allocated using power of two sized buffers (thus need 1025 + // spots in one vector > power of two is 2048, available capacity will be 2048 => 2047) + assertTrue(String.format("Incorrect value capacity for %s [%d]", vv.getField(), vvCapacity), + initialCapacity <= vvCapacity); + } + } finally { + for (ValueVector v : valueVectors) { + v.close(); + } + } + } + +} From e822ea758dc18ade9d3386acfd1d38e7b05ba3dd Mon Sep 17 00:00:00 2001 From: Minji Kim Date: Mon, 7 Mar 2016 15:23:33 -0800 Subject: [PATCH 0029/1644] ARROW-46: ListVector should initialize bits in allocateNew --- .../arrow/vector/complex/ListVector.java | 1 + .../apache/arrow/vector/TestValueVector.java | 20 +++++++++++++++++++ 2 files changed, 21 insertions(+) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 13610c4f03f61..3e60c76802380 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -72,6 +72,7 @@ public UnionListWriter getWriter() { @Override public void allocateNew() throws OutOfMemoryException { super.allocateNewSafe(); + bits.allocateNewSafe(); } public void transferTo(ListVector target) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 4488d750284c7..ac3eebe98eab7 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -24,10 +24,13 @@ import java.nio.charset.Charset; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.RepeatedListVector; import org.apache.arrow.vector.complex.RepeatedMapVector; import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.BasicTypeHelper; import org.apache.arrow.vector.util.OversizedAllocationException; import org.apache.arrow.vector.holders.BitHolder; @@ -518,4 +521,21 @@ public void testVVInitialCapacity() throws Exception { } } + @Test + public void testListVectorShouldNotThrowOversizedAllocationException() throws Exception { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, + Types.optional(MinorType.LIST)); + ListVector vector = new ListVector(field, allocator, null); + ListVector vectorFrom = new ListVector(field, allocator, null); + vectorFrom.allocateNew(); + + for (int i = 0; i < 10000; i++) { + vector.allocateNew(); + vector.copyFromSafe(0, 0, vectorFrom); + vector.clear(); + } + + vectorFrom.clear(); + vector.clear(); + } } From 83675273bd2057552ae64b7d8632a54093a02ed9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 8 Mar 2016 20:28:58 -0800 Subject: [PATCH 0030/1644] ARROW-42: Add Python tests to Travis CI build Author: Wes McKinney Closes #22 from wesm/ARROW-42 and squashes the following commits: 3b056a1 [Wes McKinney] Modularize Travis CI build and add Python build script. Remove parquet.pyx from Cython build for now, suppress -Wunused-variable in Cython compilation. Add missing formatting.py file --- .travis.yml | 23 ++++++++++ ci/travis_before_script_cpp.sh | 26 ++++++++++++ ci/travis_script_cpp.sh | 22 +--------- ci/travis_script_python.sh | 59 ++++++++++++++++++++++++++ cpp/src/arrow/table/column-test.cc | 2 + cpp/src/arrow/table/schema-test.cc | 2 + cpp/src/arrow/table/table-test.cc | 4 ++ cpp/src/arrow/type.cc | 14 ------ cpp/src/arrow/type.h | 14 ------ python/CMakeLists.txt | 2 - python/arrow/formatting.py | 56 ++++++++++++++++++++++++ python/cmake_modules/UseCython.cmake | 5 +++ python/requirements.txt | 4 ++ python/setup.py | 2 +- python/src/pyarrow/adapters/builtin.cc | 20 ++++++--- python/src/pyarrow/adapters/builtin.h | 2 + python/src/pyarrow/helpers.cc | 14 ++++++ python/src/pyarrow/helpers.h | 14 ++++++ python/src/pyarrow/util/CMakeLists.txt | 18 +------- 19 files changed, 228 insertions(+), 75 deletions(-) create mode 100755 ci/travis_before_script_cpp.sh create mode 100755 ci/travis_script_python.sh create mode 100644 python/arrow/formatting.py create mode 100644 python/requirements.txt diff --git a/.travis.yml b/.travis.yml index cb2d5cb1bad19..9e858d7d98e48 100644 --- a/.travis.yml +++ b/.travis.yml @@ -8,7 +8,9 @@ addons: packages: - gcc-4.9 # Needed for C++11 - g++-4.9 # Needed for C++11 + - gdb - gcov + - ccache - cmake - valgrind @@ -17,11 +19,32 @@ matrix: - compiler: gcc language: cpp os: linux + before_script: + - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh script: + - export CC="gcc-4.9" + - export CXX="g++-4.9" - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh + - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - compiler: clang language: cpp os: osx addons: + before_script: + - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh + - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh + +before_install: +- ulimit -c unlimited -S +- export CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build +- export ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install +- export LD_LIBRARY_PATH=$ARROW_CPP_INSTALL/lib:$LD_LIBRARY_PATH + +after_script: +- rm -rf $CPP_BUILD_DIR + +after_failure: +- COREFILE=$(find . -maxdepth 2 -name "core*" | head -n 1) +- if [[ -f "$COREFILE" ]]; then gdb -c "$COREFILE" example -ex "thread apply all bt" -ex "set pagination 0" -batch; fi diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh new file mode 100755 index 0000000000000..4d5bef8bbdf70 --- /dev/null +++ b/ci/travis_before_script_cpp.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash + +set -e + +: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} + +mkdir $CPP_BUILD_DIR +pushd $CPP_BUILD_DIR + +CPP_DIR=$TRAVIS_BUILD_DIR/cpp + +# Build an isolated thirdparty +cp -r $CPP_DIR/thirdparty . +cp $CPP_DIR/setup_build_env.sh . + +source setup_build_env.sh + +echo $GTEST_HOME + +: ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} + +cmake -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +make -j4 +make install + +popd diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index 28f16cc021fe3..3e843dd759ea1 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -2,28 +2,11 @@ set -e -mkdir $TRAVIS_BUILD_DIR/cpp-build -pushd $TRAVIS_BUILD_DIR/cpp-build +: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} -CPP_DIR=$TRAVIS_BUILD_DIR/cpp +pushd $CPP_BUILD_DIR -# Build an isolated thirdparty -cp -r $CPP_DIR/thirdparty . -cp $CPP_DIR/setup_build_env.sh . - -if [ $TRAVIS_OS_NAME == "linux" ]; then - # Use a C++11 compiler on Linux - export CC="gcc-4.9" - export CXX="g++-4.9" -fi - -source setup_build_env.sh - -echo $GTEST_HOME - -cmake -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR make lint -make -j4 if [ $TRAVIS_OS_NAME == "linux" ]; then valgrind --tool=memcheck --leak-check=yes --error-exitcode=1 ctest @@ -32,4 +15,3 @@ else fi popd -rm -rf cpp-build diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh new file mode 100755 index 0000000000000..9b0bd4f54cbc9 --- /dev/null +++ b/ci/travis_script_python.sh @@ -0,0 +1,59 @@ +#!/usr/bin/env bash + +set -e + +PYTHON_DIR=$TRAVIS_BUILD_DIR/python + +# Share environment with C++ +pushd $CPP_BUILD_DIR +source setup_build_env.sh +popd + +pushd $PYTHON_DIR + +# Bootstrap a Conda Python environment + +if [ $TRAVIS_OS_NAME == "linux" ]; then + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" +else + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" +fi + +curl $MINICONDA_URL > miniconda.sh +MINICONDA=$TRAVIS_BUILD_DIR/miniconda +bash miniconda.sh -b -p $MINICONDA +export PATH="$MINICONDA/bin:$PATH" +conda update -y -q conda +conda info -a + +PYTHON_VERSION=3.5 +CONDA_ENV_NAME=pyarrow-test + +conda create -y -q -n $CONDA_ENV_NAME python=$PYTHON_VERSION +source activate $CONDA_ENV_NAME + +python --version +which python + +# faster builds, please +conda install -y nomkl + +# Expensive dependencies install from Continuum package repo +conda install -y pip numpy pandas cython + +# Other stuff pip install +pip install -r requirements.txt + +export ARROW_HOME=$ARROW_CPP_INSTALL + +python setup.py build_ext --inplace + +py.test -vv -r sxX arrow + +# if [ $TRAVIS_OS_NAME == "linux" ]; then +# valgrind --tool=memcheck py.test -vv -r sxX arrow +# else +# py.test -vv -r sxX arrow +# fi + +popd diff --git a/cpp/src/arrow/table/column-test.cc b/cpp/src/arrow/table/column-test.cc index bf95932916cf4..3b102e48c87cf 100644 --- a/cpp/src/arrow/table/column-test.cc +++ b/cpp/src/arrow/table/column-test.cc @@ -33,6 +33,8 @@ using std::vector; namespace arrow { +const auto INT32 = std::make_shared(); + class TestColumn : public TestBase { protected: std::shared_ptr data_; diff --git a/cpp/src/arrow/table/schema-test.cc b/cpp/src/arrow/table/schema-test.cc index d6725cc08c0c8..9dfade2695311 100644 --- a/cpp/src/arrow/table/schema-test.cc +++ b/cpp/src/arrow/table/schema-test.cc @@ -29,6 +29,8 @@ using std::vector; namespace arrow { +const auto INT32 = std::make_shared(); + TEST(TestField, Basics) { shared_ptr ftype = INT32; shared_ptr ftype_nn = std::make_shared(false); diff --git a/cpp/src/arrow/table/table-test.cc b/cpp/src/arrow/table/table-test.cc index c4fdb062db83a..8b354e8503c71 100644 --- a/cpp/src/arrow/table/table-test.cc +++ b/cpp/src/arrow/table/table-test.cc @@ -34,6 +34,10 @@ using std::vector; namespace arrow { +const auto INT16 = std::make_shared(); +const auto UINT8 = std::make_shared(); +const auto INT32 = std::make_shared(); + class TestTable : public TestBase { public: void MakeExample1(int length) { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 265770822ce90..0a2e817ad30c6 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -66,18 +66,4 @@ std::string StructType::ToString() const { return s.str(); } -const std::shared_ptr NA = std::make_shared(); -const std::shared_ptr BOOL = std::make_shared(); -const std::shared_ptr UINT8 = std::make_shared(); -const std::shared_ptr UINT16 = std::make_shared(); -const std::shared_ptr UINT32 = std::make_shared(); -const std::shared_ptr UINT64 = std::make_shared(); -const std::shared_ptr INT8 = std::make_shared(); -const std::shared_ptr INT16 = std::make_shared(); -const std::shared_ptr INT32 = std::make_shared(); -const std::shared_ptr INT64 = std::make_shared(); -const std::shared_ptr FLOAT = std::make_shared(); -const std::shared_ptr DOUBLE = std::make_shared(); -const std::shared_ptr STRING = std::make_shared(); - } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index e78e49491193e..00b01ea86e8a5 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -338,20 +338,6 @@ struct StructType : public DataType { std::string ToString() const override; }; -extern const std::shared_ptr NA; -extern const std::shared_ptr BOOL; -extern const std::shared_ptr UINT8; -extern const std::shared_ptr UINT16; -extern const std::shared_ptr UINT32; -extern const std::shared_ptr UINT64; -extern const std::shared_ptr INT8; -extern const std::shared_ptr INT16; -extern const std::shared_ptr INT32; -extern const std::shared_ptr INT64; -extern const std::shared_ptr FLOAT; -extern const std::shared_ptr DOUBLE; -extern const std::shared_ptr STRING; - } // namespace arrow #endif // ARROW_TYPE_H diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 8fdd829010eef..8f5c27b0f76d7 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -404,7 +404,6 @@ set(PYARROW_SRCS ) set(LINK_LIBS - pyarrow_util arrow ) @@ -428,7 +427,6 @@ set(CYTHON_EXTENSIONS array config error - parquet scalar schema ) diff --git a/python/arrow/formatting.py b/python/arrow/formatting.py new file mode 100644 index 0000000000000..a42d4e4bb5713 --- /dev/null +++ b/python/arrow/formatting.py @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Pretty-printing and other formatting utilities for Arrow data structures + +import arrow.scalar as scalar + + +def array_format(arr, window=None): + values = [] + + if window is None or window * 2 >= len(arr): + for x in arr: + values.append(value_format(x, 0)) + contents = _indent(',\n'.join(values), 2) + else: + for i in range(window): + values.append(value_format(arr[i], 0) + ',') + values.append('...') + for i in range(len(arr) - window, len(arr)): + formatted = value_format(arr[i], 0) + if i < len(arr) - 1: + formatted += ',' + values.append(formatted) + contents = _indent('\n'.join(values), 2) + + return '[\n{0}\n]'.format(contents) + + +def value_format(x, indent_level=0): + if isinstance(x, scalar.ListValue): + contents = ',\n'.join(value_format(item) for item in x) + return '[{0}]'.format(_indent(contents, 1).strip()) + else: + return repr(x) + + +def _indent(text, spaces): + if spaces == 0: + return text + block = ' ' * spaces + return '\n'.join(block + x for x in text.split('\n')) diff --git a/python/cmake_modules/UseCython.cmake b/python/cmake_modules/UseCython.cmake index e7034db52f335..3b1c201edff5f 100644 --- a/python/cmake_modules/UseCython.cmake +++ b/python/cmake_modules/UseCython.cmake @@ -121,6 +121,11 @@ function( compile_pyx _name pyx_target_name generated_files pyx_file) set( _generated_files "${_name}.${extension}") endif() set_source_files_properties( ${_generated_files} PROPERTIES GENERATED TRUE ) + + # Cython creates a lot of compiler warning detritus on clang + set_source_files_properties(${_generated_files} PROPERTIES + COMPILE_FLAGS -Wno-unused-function) + set( ${generated_files} ${_generated_files} PARENT_SCOPE ) # Add the command to run the compiler. diff --git a/python/requirements.txt b/python/requirements.txt new file mode 100644 index 0000000000000..a82cb20aab86e --- /dev/null +++ b/python/requirements.txt @@ -0,0 +1,4 @@ +pytest +numpy>=1.7.0 +pandas>=0.12.0 +six diff --git a/python/setup.py b/python/setup.py index 9a0de071a9c40..eb3ff2a1547d6 100644 --- a/python/setup.py +++ b/python/setup.py @@ -210,7 +210,7 @@ def get_ext_built(self, name): return name + suffix def get_cmake_cython_names(self): - return ['array', 'config', 'error', 'parquet', 'scalar', 'schema'] + return ['array', 'config', 'error', 'scalar', 'schema'] def get_names(self): return self._found_names diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 60d6248842ec9..bb7905236c59c 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -22,6 +22,7 @@ #include +#include "pyarrow/helpers.h" #include "pyarrow/status.h" using arrow::ArrayBuilder; @@ -74,16 +75,16 @@ class ScalarVisitor { std::shared_ptr GetType() { // TODO(wesm): handling mixed-type cases if (float_count_) { - return arrow::DOUBLE; + return DOUBLE; } else if (int_count_) { // TODO(wesm): tighter type later - return arrow::INT64; + return INT64; } else if (bool_count_) { - return arrow::BOOL; + return BOOL; } else if (string_count_) { - return arrow::STRING; + return STRING; } else { - return arrow::NA; + return NA; } } @@ -145,7 +146,7 @@ class SeqVisitor { std::shared_ptr GetType() { if (scalars_.total_count() == 0) { if (max_nesting_level_ == 0) { - return arrow::NA; + return NA; } else { return nullptr; } @@ -209,7 +210,7 @@ static Status InferArrowType(PyObject* obj, int64_t* size, // For 0-length sequences, refuse to guess if (*size == 0) { - *out_type = arrow::NA; + *out_type = NA; } SeqVisitor seq_visitor; @@ -217,6 +218,11 @@ static Status InferArrowType(PyObject* obj, int64_t* size, PY_RETURN_NOT_OK(seq_visitor.Validate()); *out_type = seq_visitor.GetType(); + + if (*out_type == nullptr) { + return Status::TypeError("Unable to determine data type"); + } + return Status::OK(); } diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 24886f4970d50..88869c2048003 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -25,6 +25,8 @@ #include +#include + #include "pyarrow/common.h" namespace arrow { class Array; } diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index d0969dacc21e0..0921fc4994599 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -23,6 +23,20 @@ using namespace arrow; namespace pyarrow { +const std::shared_ptr NA = std::make_shared(); +const std::shared_ptr BOOL = std::make_shared(); +const std::shared_ptr UINT8 = std::make_shared(); +const std::shared_ptr UINT16 = std::make_shared(); +const std::shared_ptr UINT32 = std::make_shared(); +const std::shared_ptr UINT64 = std::make_shared(); +const std::shared_ptr INT8 = std::make_shared(); +const std::shared_ptr INT16 = std::make_shared(); +const std::shared_ptr INT32 = std::make_shared(); +const std::shared_ptr INT64 = std::make_shared(); +const std::shared_ptr FLOAT = std::make_shared(); +const std::shared_ptr DOUBLE = std::make_shared(); +const std::shared_ptr STRING = std::make_shared(); + #define GET_PRIMITIVE_TYPE(NAME, Type) \ case LogicalType::NAME: \ if (nullable) { \ diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index 1a24f056febe6..e41568d5881d4 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -26,6 +26,20 @@ namespace pyarrow { using arrow::DataType; using arrow::LogicalType; +extern const std::shared_ptr NA; +extern const std::shared_ptr BOOL; +extern const std::shared_ptr UINT8; +extern const std::shared_ptr UINT16; +extern const std::shared_ptr UINT32; +extern const std::shared_ptr UINT64; +extern const std::shared_ptr INT8; +extern const std::shared_ptr INT16; +extern const std::shared_ptr INT32; +extern const std::shared_ptr INT64; +extern const std::shared_ptr FLOAT; +extern const std::shared_ptr DOUBLE; +extern const std::shared_ptr STRING; + std::shared_ptr GetPrimitiveType(LogicalType::type type, bool nullable); diff --git a/python/src/pyarrow/util/CMakeLists.txt b/python/src/pyarrow/util/CMakeLists.txt index 60dc80eb38cb6..3fd8bac31506d 100644 --- a/python/src/pyarrow/util/CMakeLists.txt +++ b/python/src/pyarrow/util/CMakeLists.txt @@ -15,22 +15,6 @@ # specific language governing permissions and limitations # under the License. -####################################### -# pyarrow_util -####################################### - -set(UTIL_SRCS -) - -set(UTIL_LIBS -) - -add_library(pyarrow_util STATIC - ${UTIL_SRCS} -) -target_link_libraries(pyarrow_util ${UTIL_LIBS}) -SET_TARGET_PROPERTIES(pyarrow_util PROPERTIES LINKER_LANGUAGE CXX) - ####################################### # pyarrow_test_main ####################################### @@ -40,7 +24,7 @@ add_library(pyarrow_test_main if (APPLE) target_link_libraries(pyarrow_test_main - gmock + gtest dl) set_target_properties(pyarrow_test_main PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") From 6fdcd4943ff9a8cc66afbee380217cec40c0cda0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 9 Mar 2016 15:45:05 -0800 Subject: [PATCH 0031/1644] ARROW-54: [Python] Rename package to "pyarrow" Also fixed rpath issues (at great cost) per ARROW-53 Author: Wes McKinney Closes #23 from wesm/ARROW-54 and squashes the following commits: b8ce0e8 [Wes McKinney] Update installation instructions cae9b39 [Wes McKinney] Fix rpath issues per ARROW-53 7554539 [Wes McKinney] Twiddle rpath stuff, remove empty arrow_test_util module 8cca41a [Wes McKinney] Fix Travis CI script for renamed package 1d37c93 [Wes McKinney] Opt in to building unit tests 60088d0 [Wes McKinney] Rename package to pyarrow e3d0caf [Wes McKinney] Note on other Python interpreters 80d3bac [Wes McKinney] Start installation document --- .travis.yml | 4 +- ci/travis_script_python.sh | 2 +- cpp/CMakeLists.txt | 29 ++++--- cpp/src/arrow/CMakeLists.txt | 2 +- cpp/src/arrow/util/CMakeLists.txt | 44 ++++------ python/CMakeLists.txt | 31 ++++--- python/arrow/__init__.py | 38 -------- python/doc/INSTALL.md | 87 +++++++++++++++++++ python/pyarrow/__init__.py | 38 ++++++++ python/{arrow => pyarrow}/array.pxd | 8 +- python/{arrow => pyarrow}/array.pyx | 14 +-- python/{arrow => pyarrow}/compat.py | 0 python/{arrow => pyarrow}/config.pyx | 0 python/{arrow => pyarrow}/error.pxd | 2 +- python/{arrow => pyarrow}/error.pyx | 5 +- python/{arrow => pyarrow}/formatting.py | 2 +- .../{arrow => pyarrow}/includes/__init__.pxd | 0 python/{arrow => pyarrow}/includes/common.pxd | 0 .../includes/libarrow.pxd} | 2 +- .../{arrow => pyarrow}/includes/parquet.pxd | 2 +- .../{arrow => pyarrow}/includes/pyarrow.pxd | 6 +- python/{arrow => pyarrow}/parquet.pyx | 4 +- python/{arrow => pyarrow}/scalar.pxd | 6 +- python/{arrow => pyarrow}/scalar.pyx | 6 +- python/{arrow => pyarrow}/schema.pxd | 4 +- python/{arrow => pyarrow}/schema.pyx | 6 +- python/{arrow => pyarrow}/tests/__init__.py | 0 python/{arrow => pyarrow}/tests/test_array.py | 16 ++-- .../tests/test_convert_builtin.py | 52 +++++------ .../{arrow => pyarrow}/tests/test_scalars.py | 4 +- .../{arrow => pyarrow}/tests/test_schema.py | 4 +- python/requirements.txt | 1 - python/setup.py | 52 ++++++----- python/src/pyarrow/util/CMakeLists.txt | 30 ++++--- 34 files changed, 300 insertions(+), 201 deletions(-) delete mode 100644 python/arrow/__init__.py create mode 100644 python/doc/INSTALL.md create mode 100644 python/pyarrow/__init__.py rename python/{arrow => pyarrow}/array.pxd (90%) rename python/{arrow => pyarrow}/array.pyx (93%) rename python/{arrow => pyarrow}/compat.py (100%) rename python/{arrow => pyarrow}/config.pyx (100%) rename python/{arrow => pyarrow}/error.pxd (95%) rename python/{arrow => pyarrow}/error.pyx (92%) rename python/{arrow => pyarrow}/formatting.py (98%) rename python/{arrow => pyarrow}/includes/__init__.pxd (100%) rename python/{arrow => pyarrow}/includes/common.pxd (100%) rename python/{arrow/includes/arrow.pxd => pyarrow/includes/libarrow.pxd} (99%) rename python/{arrow => pyarrow}/includes/parquet.pxd (97%) rename python/{arrow => pyarrow}/includes/pyarrow.pxd (90%) rename python/{arrow => pyarrow}/parquet.pyx (91%) rename python/{arrow => pyarrow}/scalar.pxd (93%) rename python/{arrow => pyarrow}/scalar.pyx (97%) rename python/{arrow => pyarrow}/schema.pxd (91%) rename python/{arrow => pyarrow}/schema.pyx (97%) rename python/{arrow => pyarrow}/tests/__init__.py (100%) rename python/{arrow => pyarrow}/tests/test_array.py (80%) rename python/{arrow => pyarrow}/tests/test_convert_builtin.py (58%) rename python/{arrow => pyarrow}/tests/test_scalars.py (97%) rename python/{arrow => pyarrow}/tests/test_schema.py (96%) diff --git a/.travis.yml b/.travis.yml index 9e858d7d98e48..49a956ead3dca 100644 --- a/.travis.yml +++ b/.travis.yml @@ -27,7 +27,8 @@ matrix: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - compiler: clang - language: cpp + language: objective-c + osx_image: xcode6.4 os: osx addons: before_script: @@ -40,7 +41,6 @@ before_install: - ulimit -c unlimited -S - export CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build - export ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install -- export LD_LIBRARY_PATH=$ARROW_CPP_INSTALL/lib:$LD_LIBRARY_PATH after_script: - rm -rf $CPP_BUILD_DIR diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 9b0bd4f54cbc9..14d66b44ff8ee 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -48,7 +48,7 @@ export ARROW_HOME=$ARROW_CPP_INSTALL python setup.py build_ext --inplace -py.test -vv -r sxX arrow +py.test -vv -r sxX pyarrow # if [ $TRAVIS_OS_NAME == "linux" ]; then # valgrind --tool=memcheck py.test -vv -r sxX arrow diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index e8cb88c0b4d9b..f5f6038031127 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -37,11 +37,6 @@ if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") set(CMAKE_EXPORT_COMPILE_COMMANDS 1) endif() -if(APPLE) - # In newer versions of CMake, this is the default setting - set(CMAKE_MACOSX_RPATH 1) -endif() - find_program(CCACHE_FOUND ccache) if(CCACHE_FOUND) set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) @@ -339,10 +334,13 @@ endfunction() if ("$ENV{GTEST_HOME}" STREQUAL "") set(GTest_HOME ${THIRDPARTY_DIR}/googletest-release-1.7.0) endif() -find_package(GTest REQUIRED) -include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) -ADD_THIRDPARTY_LIB(gtest - STATIC_LIB ${GTEST_STATIC_LIB}) + +if(ARROW_BUILD_TESTS) + find_package(GTest REQUIRED) + include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(gtest + STATIC_LIB ${GTEST_STATIC_LIB}) +endif() ## Google PerfTools ## @@ -366,7 +364,7 @@ ADD_THIRDPARTY_LIB(gtest ############################################################ # Linker setup ############################################################ -set(ARROW_MIN_TEST_LIBS arrow arrow_test_main arrow_test_util ${ARROW_BASE_LIBS}) +set(ARROW_MIN_TEST_LIBS arrow arrow_test_main ${ARROW_BASE_LIBS}) set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) ############################################################ @@ -461,9 +459,18 @@ add_library(arrow ${LIBARROW_LINKAGE} ${ARROW_SRCS} ) + +if (APPLE) + set_target_properties(arrow + PROPERTIES + BUILD_WITH_INSTALL_RPATH ON + INSTALL_NAME_DIR "@rpath") +endif() + set_target_properties(arrow PROPERTIES - LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" +) target_link_libraries(arrow ${LIBARROW_LINK_LIBS}) add_subdirectory(src/arrow) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 77326ce38d754..73e6a9b22c94a 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -27,6 +27,6 @@ install(FILES # Unit tests ####################################### -set(ARROW_TEST_LINK_LIBS arrow_test_util ${ARROW_MIN_TEST_LIBS}) +set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 4272ce4285482..d8e2f98f2c85e 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -28,37 +28,27 @@ install(FILES status.h DESTINATION include/arrow/util) -####################################### -# arrow_test_util -####################################### - -add_library(arrow_test_util) -target_link_libraries(arrow_test_util -) - -SET_TARGET_PROPERTIES(arrow_test_util PROPERTIES LINKER_LANGUAGE CXX) - ####################################### # arrow_test_main ####################################### -add_library(arrow_test_main - test_main.cc) - -if (APPLE) - target_link_libraries(arrow_test_main - gtest - arrow_test_util - dl) - set_target_properties(arrow_test_main - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") -else() - target_link_libraries(arrow_test_main - gtest - arrow_test_util - pthread - dl - ) +if (ARROW_BUILD_TESTS) + add_library(arrow_test_main + test_main.cc) + + if (APPLE) + target_link_libraries(arrow_test_main + gtest + dl) + set_target_properties(arrow_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + else() + target_link_libraries(arrow_test_main + gtest + pthread + dl + ) + endif() endif() ADD_ARROW_TEST(bit-util-test) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 8f5c27b0f76d7..0ecafc7202e89 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -45,6 +45,13 @@ if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") set(CMAKE_EXPORT_COMPILE_COMMANDS 1) endif() +# Top level cmake dir +if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") + option(PYARROW_BUILD_TESTS + "Build the PyArrow C++ googletest unit tests" + OFF) +endif() + find_program(CCACHE_FOUND ccache) if(CCACHE_FOUND) set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) @@ -322,10 +329,12 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) endfunction() ## GMock -find_package(GTest REQUIRED) -include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) -ADD_THIRDPARTY_LIB(gtest - STATIC_LIB ${GTEST_STATIC_LIB}) +if (PYARROW_BUILD_TESTS) + find_package(GTest REQUIRED) + include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(gtest + STATIC_LIB ${GTEST_STATIC_LIB}) +endif() ## Arrow find_package(Arrow REQUIRED) @@ -391,6 +400,10 @@ endif (UNIX) # Subdirectories ############################################################ +if (UNIX) + set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) +endif() + add_subdirectory(src/pyarrow) add_subdirectory(src/pyarrow/util) @@ -407,10 +420,11 @@ set(LINK_LIBS arrow ) +SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) + add_library(pyarrow SHARED ${PYARROW_SRCS}) target_link_libraries(pyarrow ${LINK_LIBS}) -set_target_properties(pyarrow PROPERTIES LINKER_LANGUAGE CXX) if(APPLE) set_target_properties(pyarrow PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") @@ -420,9 +434,6 @@ endif() # Setup and build Cython modules ############################################################ -set(USE_RELATIVE_RPATH ON) -set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) - set(CYTHON_EXTENSIONS array config @@ -437,7 +448,7 @@ foreach(module ${CYTHON_EXTENSIONS}) list(REMOVE_AT directories -1) string(REPLACE "." "/" module_root "${module}") - set(module_SRC arrow/${module_root}.pyx) + set(module_SRC pyarrow/${module_root}.pyx) set_source_files_properties(${module_SRC} PROPERTIES CYTHON_IS_CXX 1) cython_add_module(${module_name} @@ -463,7 +474,7 @@ foreach(module ${CYTHON_EXTENSIONS}) endwhile(${i} GREATER 0) # for inplace development for now - set(module_install_rpath "${CMAKE_SOURCE_DIR}/arrow/") + #set(module_install_rpath "${CMAKE_SOURCE_DIR}/pyarrow/") set_target_properties(${module_name} PROPERTIES INSTALL_RPATH ${module_install_rpath}) diff --git a/python/arrow/__init__.py b/python/arrow/__init__.py deleted file mode 100644 index 3507ea0235afe..0000000000000 --- a/python/arrow/__init__.py +++ /dev/null @@ -1,38 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# flake8: noqa - -from arrow.array import (Array, from_pylist, total_allocated_bytes, - BooleanArray, NumericArray, - Int8Array, UInt8Array, - ListArray, StringArray) - -from arrow.error import ArrowException - -from arrow.scalar import (ArrayValue, Scalar, NA, NAType, - BooleanValue, - Int8Value, Int16Value, Int32Value, Int64Value, - UInt8Value, UInt16Value, UInt32Value, UInt64Value, - FloatValue, DoubleValue, ListValue, StringValue) - -from arrow.schema import (null, bool_, - int8, int16, int32, int64, - uint8, uint16, uint32, uint64, - float_, double, string, - list_, struct, field, - DataType, Field, Schema) diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md new file mode 100644 index 0000000000000..d30a03046eda7 --- /dev/null +++ b/python/doc/INSTALL.md @@ -0,0 +1,87 @@ +## Building pyarrow (Apache Arrow Python library) + +First, clone the master git repository: + +```bash +git clone https://github.com/apache/arrow.git arrow +``` + +#### System requirements + +Building pyarrow requires: + +* A C++11 compiler + + * Linux: gcc >= 4.8 or clang >= 3.5 + * OS X: XCode 6.4 or higher preferred + +* [cmake][1] + +#### Python requirements + +You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and +are not being targeted. + +> This library targets CPython only due to an emphasis on interoperability with +> pandas and NumPy, which are only available for CPython. + +The build requires NumPy, Cython, and a few other Python dependencies: + +```bash +pip install cython +cd arrow/python +pip install -r requirements.txt +``` + +#### Installing Arrow C++ library + +First, you should choose an installation location for Arrow C++. In the future +using the default system install location will work, but for now we are being +explicit: + +```bash +export ARROW_HOME=$HOME/local +``` + +Now, we build Arrow: + +```bash +cd arrow/cpp + +mkdir dev-build +cd dev-build + +cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. + +make + +# Use sudo here if $ARROW_HOME requires it +make install +``` + +#### Install `pyarrow` + +```bash +cd arrow/python + +python setup.py install +``` + +> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are +> unable to import pyarrow, upgrading XCode may be the solution. + + +```python +In [1]: import pyarrow + +In [2]: pyarrow.from_pylist([1,2,3]) +Out[2]: + +[ + 1, + 2, + 3 +] +``` + +[1]: https://cmake.org/ \ No newline at end of file diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py new file mode 100644 index 0000000000000..8d93a156bcc3d --- /dev/null +++ b/python/pyarrow/__init__.py @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# flake8: noqa + +from pyarrow.array import (Array, from_pylist, total_allocated_bytes, + BooleanArray, NumericArray, + Int8Array, UInt8Array, + ListArray, StringArray) + +from pyarrow.error import ArrowException + +from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, + BooleanValue, + Int8Value, Int16Value, Int32Value, Int64Value, + UInt8Value, UInt16Value, UInt32Value, UInt64Value, + FloatValue, DoubleValue, ListValue, StringValue) + +from pyarrow.schema import (null, bool_, + int8, int16, int32, int64, + uint8, uint16, uint32, uint64, + float_, double, string, + list_, struct, field, + DataType, Field, Schema) diff --git a/python/arrow/array.pxd b/python/pyarrow/array.pxd similarity index 90% rename from python/arrow/array.pxd rename to python/pyarrow/array.pxd index 482f8f796dd26..d0d3486c032fe 100644 --- a/python/arrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -15,12 +15,12 @@ # specific language governing permissions and limitations # under the License. -from arrow.includes.common cimport shared_ptr -from arrow.includes.arrow cimport CArray, LogicalType +from pyarrow.includes.common cimport shared_ptr +from pyarrow.includes.libarrow cimport CArray, LogicalType -from arrow.scalar import NA +from pyarrow.scalar import NA -from arrow.schema cimport DataType +from pyarrow.schema cimport DataType cdef extern from "Python.h": int PySlice_Check(object) diff --git a/python/arrow/array.pyx b/python/pyarrow/array.pyx similarity index 93% rename from python/arrow/array.pyx rename to python/pyarrow/array.pyx index b367e3b84a8b3..bceb333c94ea5 100644 --- a/python/arrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -19,14 +19,14 @@ # distutils: language = c++ # cython: embedsignature = True -from arrow.includes.arrow cimport * -cimport arrow.includes.pyarrow as pyarrow +from pyarrow.includes.libarrow cimport * +cimport pyarrow.includes.pyarrow as pyarrow -from arrow.compat import frombytes, tobytes -from arrow.error cimport check_status +from pyarrow.compat import frombytes, tobytes +from pyarrow.error cimport check_status -cimport arrow.scalar as scalar -from arrow.scalar import NA +cimport pyarrow.scalar as scalar +from pyarrow.scalar import NA def total_allocated_bytes(): cdef MemoryPool* pool = pyarrow.GetMemoryPool() @@ -52,7 +52,7 @@ cdef class Array: raise StopIteration def __repr__(self): - from arrow.formatting import array_format + from pyarrow.formatting import array_format type_format = object.__repr__(self) values = array_format(self, window=10) return '{0}\n{1}'.format(type_format, values) diff --git a/python/arrow/compat.py b/python/pyarrow/compat.py similarity index 100% rename from python/arrow/compat.py rename to python/pyarrow/compat.py diff --git a/python/arrow/config.pyx b/python/pyarrow/config.pyx similarity index 100% rename from python/arrow/config.pyx rename to python/pyarrow/config.pyx diff --git a/python/arrow/error.pxd b/python/pyarrow/error.pxd similarity index 95% rename from python/arrow/error.pxd rename to python/pyarrow/error.pxd index c18cb3efffca6..d226abeda04e0 100644 --- a/python/arrow/error.pxd +++ b/python/pyarrow/error.pxd @@ -15,6 +15,6 @@ # specific language governing permissions and limitations # under the License. -from arrow.includes.pyarrow cimport * +from pyarrow.includes.pyarrow cimport * cdef check_status(const Status& status) diff --git a/python/arrow/error.pyx b/python/pyarrow/error.pyx similarity index 92% rename from python/arrow/error.pyx rename to python/pyarrow/error.pyx index f1d516358819d..3f8d7dd646091 100644 --- a/python/arrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -15,9 +15,8 @@ # specific language governing permissions and limitations # under the License. -from arrow.includes.common cimport c_string - -from arrow.compat import frombytes +from pyarrow.includes.common cimport c_string +from pyarrow.compat import frombytes class ArrowException(Exception): pass diff --git a/python/arrow/formatting.py b/python/pyarrow/formatting.py similarity index 98% rename from python/arrow/formatting.py rename to python/pyarrow/formatting.py index a42d4e4bb5713..5fe0611f8450b 100644 --- a/python/arrow/formatting.py +++ b/python/pyarrow/formatting.py @@ -17,7 +17,7 @@ # Pretty-printing and other formatting utilities for Arrow data structures -import arrow.scalar as scalar +import pyarrow.scalar as scalar def array_format(arr, window=None): diff --git a/python/arrow/includes/__init__.pxd b/python/pyarrow/includes/__init__.pxd similarity index 100% rename from python/arrow/includes/__init__.pxd rename to python/pyarrow/includes/__init__.pxd diff --git a/python/arrow/includes/common.pxd b/python/pyarrow/includes/common.pxd similarity index 100% rename from python/arrow/includes/common.pxd rename to python/pyarrow/includes/common.pxd diff --git a/python/arrow/includes/arrow.pxd b/python/pyarrow/includes/libarrow.pxd similarity index 99% rename from python/arrow/includes/arrow.pxd rename to python/pyarrow/includes/libarrow.pxd index 0cc44c06cb607..baba112833e0d 100644 --- a/python/arrow/includes/arrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -17,7 +17,7 @@ # distutils: language = c++ -from arrow.includes.common cimport * +from pyarrow.includes.common cimport * cdef extern from "arrow/api.h" namespace "arrow" nogil: diff --git a/python/arrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd similarity index 97% rename from python/arrow/includes/parquet.pxd rename to python/pyarrow/includes/parquet.pxd index 62342f3066969..99a2d423d9cba 100644 --- a/python/arrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -17,7 +17,7 @@ # distutils: language = c++ -from arrow.includes.common cimport * +from pyarrow.includes.common cimport * cdef extern from "parquet/api/reader.h" namespace "parquet_cpp" nogil: cdef cppclass ColumnReader: diff --git a/python/arrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd similarity index 90% rename from python/arrow/includes/pyarrow.pxd rename to python/pyarrow/includes/pyarrow.pxd index 3eed5b8542493..9a0c004b7684a 100644 --- a/python/arrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -17,9 +17,9 @@ # distutils: language = c++ -from arrow.includes.common cimport * -from arrow.includes.arrow cimport (CArray, CDataType, LogicalType, - MemoryPool) +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport (CArray, CDataType, LogicalType, + MemoryPool) cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: # We can later add more of the common status factory methods as needed diff --git a/python/arrow/parquet.pyx b/python/pyarrow/parquet.pyx similarity index 91% rename from python/arrow/parquet.pyx rename to python/pyarrow/parquet.pyx index 23c3838bcad1f..622e7d0772456 100644 --- a/python/arrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -19,5 +19,5 @@ # distutils: language = c++ # cython: embedsignature = True -from arrow.compat import frombytes, tobytes -from arrow.includes.parquet cimport * +from pyarrow.compat import frombytes, tobytes +from pyarrow.includes.parquet cimport * diff --git a/python/arrow/scalar.pxd b/python/pyarrow/scalar.pxd similarity index 93% rename from python/arrow/scalar.pxd rename to python/pyarrow/scalar.pxd index 4e0a3647155a6..b06845718649b 100644 --- a/python/arrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -15,10 +15,10 @@ # specific language governing permissions and limitations # under the License. -from arrow.includes.common cimport * -from arrow.includes.arrow cimport * +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport * -from arrow.schema cimport DataType +from pyarrow.schema cimport DataType cdef class Scalar: cdef readonly: diff --git a/python/arrow/scalar.pyx b/python/pyarrow/scalar.pyx similarity index 97% rename from python/arrow/scalar.pyx rename to python/pyarrow/scalar.pyx index 72a280e334f4e..261a38967c495 100644 --- a/python/arrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -15,10 +15,10 @@ # specific language governing permissions and limitations # under the License. -from arrow.schema cimport DataType, box_data_type +from pyarrow.schema cimport DataType, box_data_type -from arrow.compat import frombytes -import arrow.schema as schema +from pyarrow.compat import frombytes +import pyarrow.schema as schema NA = None diff --git a/python/arrow/schema.pxd b/python/pyarrow/schema.pxd similarity index 91% rename from python/arrow/schema.pxd rename to python/pyarrow/schema.pxd index 8cc244aaba341..07b9bd04da20e 100644 --- a/python/arrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -15,8 +15,8 @@ # specific language governing permissions and limitations # under the License. -from arrow.includes.common cimport shared_ptr -from arrow.includes.arrow cimport CDataType, CField, CSchema +from pyarrow.includes.common cimport shared_ptr +from pyarrow.includes.libarrow cimport CDataType, CField, CSchema cdef class DataType: cdef: diff --git a/python/arrow/schema.pyx b/python/pyarrow/schema.pyx similarity index 97% rename from python/arrow/schema.pyx rename to python/pyarrow/schema.pyx index 3001531eb607d..ea878720d5bb8 100644 --- a/python/arrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -22,9 +22,9 @@ # distutils: language = c++ # cython: embedsignature = True -from arrow.compat import frombytes, tobytes -from arrow.includes.arrow cimport * -cimport arrow.includes.pyarrow as pyarrow +from pyarrow.compat import frombytes, tobytes +from pyarrow.includes.libarrow cimport * +cimport pyarrow.includes.pyarrow as pyarrow cimport cpython diff --git a/python/arrow/tests/__init__.py b/python/pyarrow/tests/__init__.py similarity index 100% rename from python/arrow/tests/__init__.py rename to python/pyarrow/tests/__init__.py diff --git a/python/arrow/tests/test_array.py b/python/pyarrow/tests/test_array.py similarity index 80% rename from python/arrow/tests/test_array.py rename to python/pyarrow/tests/test_array.py index ebd872c744e44..034c1576551d3 100644 --- a/python/arrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -15,19 +15,19 @@ # specific language governing permissions and limitations # under the License. -from arrow.compat import unittest -import arrow -import arrow.formatting as fmt +from pyarrow.compat import unittest +import pyarrow +import pyarrow.formatting as fmt class TestArrayAPI(unittest.TestCase): def test_getitem_NA(self): - arr = arrow.from_pylist([1, None, 2]) - assert arr[1] is arrow.NA + arr = pyarrow.from_pylist([1, None, 2]) + assert arr[1] is pyarrow.NA def test_list_format(self): - arr = arrow.from_pylist([[1], None, [2, 3]]) + arr = pyarrow.from_pylist([[1], None, [2, 3]]) result = fmt.array_format(arr) expected = """\ [ @@ -39,7 +39,7 @@ def test_list_format(self): assert result == expected def test_string_format(self): - arr = arrow.from_pylist(['foo', None, 'bar']) + arr = pyarrow.from_pylist(['foo', None, 'bar']) result = fmt.array_format(arr) expected = """\ [ @@ -50,7 +50,7 @@ def test_string_format(self): assert result == expected def test_long_array_format(self): - arr = arrow.from_pylist(range(100)) + arr = pyarrow.from_pylist(range(100)) result = fmt.array_format(arr, window=2) expected = """\ [ diff --git a/python/arrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py similarity index 58% rename from python/arrow/tests/test_convert_builtin.py rename to python/pyarrow/tests/test_convert_builtin.py index 57e6ab9f0e7b5..25f696912105d 100644 --- a/python/arrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -15,8 +15,8 @@ # specific language governing permissions and limitations # under the License. -from arrow.compat import unittest -import arrow +from pyarrow.compat import unittest +import pyarrow class TestConvertList(unittest.TestCase): @@ -25,61 +25,61 @@ def test_boolean(self): pass def test_empty_list(self): - arr = arrow.from_pylist([]) + arr = pyarrow.from_pylist([]) assert len(arr) == 0 assert arr.null_count == 0 - assert arr.type == arrow.null() + assert arr.type == pyarrow.null() def test_all_none(self): - arr = arrow.from_pylist([None, None]) + arr = pyarrow.from_pylist([None, None]) assert len(arr) == 2 assert arr.null_count == 2 - assert arr.type == arrow.null() + assert arr.type == pyarrow.null() def test_integer(self): - arr = arrow.from_pylist([1, None, 3, None]) + arr = pyarrow.from_pylist([1, None, 3, None]) assert len(arr) == 4 assert arr.null_count == 2 - assert arr.type == arrow.int64() + assert arr.type == pyarrow.int64() def test_garbage_collection(self): import gc - bytes_before = arrow.total_allocated_bytes() - arrow.from_pylist([1, None, 3, None]) + bytes_before = pyarrow.total_allocated_bytes() + pyarrow.from_pylist([1, None, 3, None]) gc.collect() - assert arrow.total_allocated_bytes() == bytes_before + assert pyarrow.total_allocated_bytes() == bytes_before def test_double(self): data = [1.5, 1, None, 2.5, None, None] - arr = arrow.from_pylist(data) + arr = pyarrow.from_pylist(data) assert len(arr) == 6 assert arr.null_count == 3 - assert arr.type == arrow.double() + assert arr.type == pyarrow.double() def test_string(self): data = ['foo', b'bar', None, 'arrow'] - arr = arrow.from_pylist(data) + arr = pyarrow.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 - assert arr.type == arrow.string() + assert arr.type == pyarrow.string() def test_mixed_nesting_levels(self): - arrow.from_pylist([1, 2, None]) - arrow.from_pylist([[1], [2], None]) - arrow.from_pylist([[1], [2], [None]]) + pyarrow.from_pylist([1, 2, None]) + pyarrow.from_pylist([[1], [2], None]) + pyarrow.from_pylist([[1], [2], [None]]) - with self.assertRaises(arrow.ArrowException): - arrow.from_pylist([1, 2, [1]]) + with self.assertRaises(pyarrow.ArrowException): + pyarrow.from_pylist([1, 2, [1]]) - with self.assertRaises(arrow.ArrowException): - arrow.from_pylist([1, 2, []]) + with self.assertRaises(pyarrow.ArrowException): + pyarrow.from_pylist([1, 2, []]) - with self.assertRaises(arrow.ArrowException): - arrow.from_pylist([[1], [2], [None, [1]]]) + with self.assertRaises(pyarrow.ArrowException): + pyarrow.from_pylist([[1], [2], [None, [1]]]) def test_list_of_int(self): data = [[1, 2, 3], [], None, [1, 2]] - arr = arrow.from_pylist(data) + arr = pyarrow.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 - assert arr.type == arrow.list_(arrow.int64()) + assert arr.type == pyarrow.list_(pyarrow.int64()) diff --git a/python/arrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py similarity index 97% rename from python/arrow/tests/test_scalars.py rename to python/pyarrow/tests/test_scalars.py index 951380bd981e4..021737db6726e 100644 --- a/python/arrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -15,8 +15,8 @@ # specific language governing permissions and limitations # under the License. -from arrow.compat import unittest, u -import arrow +from pyarrow.compat import unittest, u +import pyarrow as arrow class TestScalars(unittest.TestCase): diff --git a/python/arrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py similarity index 96% rename from python/arrow/tests/test_schema.py rename to python/pyarrow/tests/test_schema.py index a89edd74a0adf..0235526198f35 100644 --- a/python/arrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -15,8 +15,8 @@ # specific language governing permissions and limitations # under the License. -from arrow.compat import unittest -import arrow +from pyarrow.compat import unittest +import pyarrow as arrow class TestTypes(unittest.TestCase): diff --git a/python/requirements.txt b/python/requirements.txt index a82cb20aab86e..f42c90c5c9b3f 100644 --- a/python/requirements.txt +++ b/python/requirements.txt @@ -1,4 +1,3 @@ pytest numpy>=1.7.0 -pandas>=0.12.0 six diff --git a/python/setup.py b/python/setup.py index eb3ff2a1547d6..5cc871aba9f81 100644 --- a/python/setup.py +++ b/python/setup.py @@ -27,7 +27,7 @@ import sys import pkg_resources -from setuptools import setup +from setuptools import setup, Extension import os @@ -40,10 +40,12 @@ is_64_bit = sys.maxsize > 2**32 # Check if this is a debug build of Python. -if hasattr(sys, 'gettotalrefcount'): - build_type = 'Debug' -else: - build_type = 'Release' +# if hasattr(sys, 'gettotalrefcount'): +# build_type = 'Debug' +# else: +# build_type = 'Release' + +build_type = 'Debug' if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') @@ -51,7 +53,7 @@ MAJOR = 0 MINOR = 1 MICRO = 0 -VERSION = '%d.%d.%d' % (MAJOR, MINOR, MICRO) +VERSION = '%d.%d.%ddev' % (MAJOR, MINOR, MICRO) class clean(_clean): @@ -70,6 +72,9 @@ class build_ext(_build_ext): def build_extensions(self): numpy_incl = pkg_resources.resource_filename('numpy', 'core/include') + self.extensions = [ext for ext in self.extensions + if ext.name != '__dummy__'] + for ext in self.extensions: if (hasattr(ext, 'include_dirs') and numpy_incl not in ext.include_dirs): @@ -98,6 +103,7 @@ def _run_cmake(self): # The staging directory for the module being built build_temp = pjoin(os.getcwd(), self.build_temp) + build_lib = os.path.join(os.getcwd(), self.build_lib) # Change to the build directory saved_cwd = os.getcwd() @@ -124,7 +130,7 @@ def _run_cmake(self): static_lib_option, source] self.spawn(cmake_command) - args = ['make'] + args = ['make', 'VERBOSE=1'] if 'PYARROW_PARALLEL' in os.environ: args.append('-j{0}'.format(os.environ['PYARROW_PARALLEL'])) self.spawn(args) @@ -150,21 +156,19 @@ def _run_cmake(self): if self.inplace: # a bit hacky build_lib = saved_cwd - else: - build_lib = pjoin(os.getcwd(), self.build_lib) # Move the built libpyarrow library to the place expected by the Python # build if sys.platform != 'win32': name, = glob.glob('libpyarrow.*') try: - os.makedirs(pjoin(build_lib, 'arrow')) + os.makedirs(pjoin(build_lib, 'pyarrow')) except OSError: pass - shutil.move(name, pjoin(build_lib, 'arrow', name)) + shutil.move(name, pjoin(build_lib, 'pyarrow', name)) else: shutil.move(pjoin(build_type, 'pyarrow.dll'), - pjoin(build_lib, 'arrow', 'pyarrow.dll')) + pjoin(build_lib, 'pyarrow', 'pyarrow.dll')) # Move the built C-extension to the place expected by the Python build self._found_names = [] @@ -192,7 +196,7 @@ def _get_inplace_dir(self): def _get_cmake_ext_path(self, name): # Get the package directory from build_py build_py = self.get_finalized_command('build_py') - package_dir = build_py.get_package_dir('arrow') + package_dir = build_py.get_package_dir('pyarrow') # This is the name of the arrow C-extension suffix = sysconfig.get_config_var('EXT_SUFFIX') if suffix is None: @@ -217,23 +221,23 @@ def get_names(self): def get_outputs(self): # Just the C extensions - cmake_exts = [self._get_cmake_ext_path(name) - for name in self.get_names()] - regular_exts = _build_ext.get_outputs(self) - return regular_exts + cmake_exts + # regular_exts = _build_ext.get_outputs(self) + return [self._get_cmake_ext_path(name) + for name in self.get_names()] -extensions = [] - DESC = """\ Python library for Apache Arrow""" setup( - name="arrow", - packages=['arrow', 'arrow.tests'], + name="pyarrow", + packages=['pyarrow', 'pyarrow.tests'], version=VERSION, - package_data={'arrow': ['*.pxd', '*.pyx']}, - ext_modules=extensions, + zip_safe=False, + package_data={'pyarrow': ['*.pxd', '*.pyx']}, + # Dummy extension to trigger build_ext + ext_modules=[Extension('__dummy__', sources=[])], + cmdclass={ 'clean': clean, 'build_ext': build_ext @@ -243,5 +247,5 @@ def get_outputs(self): license='Apache License, Version 2.0', maintainer="Apache Arrow Developers", maintainer_email="dev@arrow.apache.org", - test_suite="arrow.tests" + test_suite="pyarrow.tests" ) diff --git a/python/src/pyarrow/util/CMakeLists.txt b/python/src/pyarrow/util/CMakeLists.txt index 3fd8bac31506d..4afb4d0f912b1 100644 --- a/python/src/pyarrow/util/CMakeLists.txt +++ b/python/src/pyarrow/util/CMakeLists.txt @@ -19,19 +19,21 @@ # pyarrow_test_main ####################################### -add_library(pyarrow_test_main - test_main.cc) +if (PYARROW_BUILD_TESTS) + add_library(pyarrow_test_main + test_main.cc) -if (APPLE) - target_link_libraries(pyarrow_test_main - gtest - dl) - set_target_properties(pyarrow_test_main - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") -else() - target_link_libraries(pyarrow_test_main - gtest - pthread - dl - ) + if (APPLE) + target_link_libraries(pyarrow_test_main + gtest + dl) + set_target_properties(pyarrow_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + else() + target_link_libraries(pyarrow_test_main + gtest + pthread + dl + ) + endif() endif() From 883c62bddc534df2c0a4ee1e8bef38772aa4a7cd Mon Sep 17 00:00:00 2001 From: Dan Robinson Date: Wed, 16 Mar 2016 15:11:56 -0700 Subject: [PATCH 0032/1644] ARROW-55: [Python] Fix unit tests in 2.7 Fixing the #define check for Python 2 makes all unit tests pass in Python 2.7. Author: Dan Robinson Closes #25 from danrobinson/ARROW-55 and squashes the following commits: dda4396 [Dan Robinson] ARROW-55: Add Python 2.7 tests to travis-ci b00524b [Dan Robinson] ARROW-55: [Python] Fix unit tests in 2.7 --- ci/travis_script_python.sh | 35 ++++++++++++++++++++--------------- python/src/pyarrow/common.h | 2 +- 2 files changed, 21 insertions(+), 16 deletions(-) diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 14d66b44ff8ee..af6b0085724fc 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -26,29 +26,34 @@ export PATH="$MINICONDA/bin:$PATH" conda update -y -q conda conda info -a -PYTHON_VERSION=3.5 -CONDA_ENV_NAME=pyarrow-test +python_version_tests() { + PYTHON_VERSION=$1 + CONDA_ENV_NAME="pyarrow-test-${PYTHON_VERSION}" + conda create -y -q -n $CONDA_ENV_NAME python=$PYTHON_VERSION + source activate $CONDA_ENV_NAME -conda create -y -q -n $CONDA_ENV_NAME python=$PYTHON_VERSION -source activate $CONDA_ENV_NAME + python --version + which python -python --version -which python + # faster builds, please + conda install -y nomkl -# faster builds, please -conda install -y nomkl + # Expensive dependencies install from Continuum package repo + conda install -y pip numpy pandas cython -# Expensive dependencies install from Continuum package repo -conda install -y pip numpy pandas cython + # Other stuff pip install + pip install -r requirements.txt -# Other stuff pip install -pip install -r requirements.txt + export ARROW_HOME=$ARROW_CPP_INSTALL -export ARROW_HOME=$ARROW_CPP_INSTALL + python setup.py build_ext --inplace -python setup.py build_ext --inplace + py.test -vv -r sxX pyarrow +} -py.test -vv -r sxX pyarrow +# run tests for python 2.7 and 3.5 +python_version_tests 2.7 +python_version_tests 3.5 # if [ $TRAVIS_OS_NAME == "linux" ]; then # valgrind --tool=memcheck py.test -vv -r sxX arrow diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index a43e4d28c899a..db6361384c10d 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -24,7 +24,7 @@ namespace arrow { class MemoryPool; } namespace pyarrow { -#define PYARROW_IS_PY2 PY_MAJOR_VERSION < 2 +#define PYARROW_IS_PY2 PY_MAJOR_VERSION <= 2 #define RETURN_ARROW_NOT_OK(s) do { \ arrow::Status _s = (s); \ From 5881aacefc577ef8a2c39dc40d8f9cd978d50a88 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 16 Mar 2016 15:13:55 -0700 Subject: [PATCH 0033/1644] ARROW-64: Add zsh support to C++ build scripts All scripts that have to be sourced during development currently only support bash. This patch adds zsh support. Author: Uwe L. Korn Closes #24 from xhochy/zsh-support and squashes the following commits: d3590aa [Uwe L. Korn] ARROW-64: Add zsh support to C++ build scripts --- cpp/setup_build_env.sh | 2 +- cpp/thirdparty/build_thirdparty.sh | 2 +- cpp/thirdparty/download_thirdparty.sh | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index e9901bdbecd42..26a727c87e526 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -1,6 +1,6 @@ #!/bin/bash -SOURCE_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) +SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) ./thirdparty/download_thirdparty.sh ./thirdparty/build_thirdparty.sh diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index 46794def400eb..8de56a6d08678 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -2,7 +2,7 @@ set -x set -e -TP_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) +TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) source $TP_DIR/versions.sh PREFIX=$TP_DIR/installed diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh index 8ffb22a93f7e2..0c801179e8d30 100755 --- a/cpp/thirdparty/download_thirdparty.sh +++ b/cpp/thirdparty/download_thirdparty.sh @@ -3,7 +3,7 @@ set -x set -e -TP_DIR=$(cd "$(dirname "$BASH_SOURCE")"; pwd) +TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) source $TP_DIR/versions.sh From c99661069c2f1dbd29c3a86e1e0bd5fa3c6c809f Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Thu, 17 Mar 2016 15:05:24 -0700 Subject: [PATCH 0034/1644] ARROW-68: Better error handling for not fully setup systems Author: Micah Kornfield Closes #27 from emkornfield/emk_add_nice_errors_PR and squashes the following commits: c0b9d78 [Micah Kornfield] ARROW-68: Better error handling for systems missing prerequistites --- cpp/setup_build_env.sh | 4 ++-- cpp/thirdparty/build_thirdparty.sh | 9 ++++++--- cpp/thirdparty/download_thirdparty.sh | 1 + 3 files changed, 9 insertions(+), 5 deletions(-) diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index 26a727c87e526..1a33fe386f103 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -2,8 +2,8 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) -./thirdparty/download_thirdparty.sh -./thirdparty/build_thirdparty.sh +./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } +./thirdparty/build_thirdparty.sh || { echo "build_thirdparty.sh failed" ; return; } source thirdparty/versions.sh export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index 8de56a6d08678..beb248803594c 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -44,18 +44,21 @@ ln -sf lib "$PREFIX/lib64" # use the compiled tools export PATH=$PREFIX/bin:$PATH +type cmake >/dev/null 2>&1 || { echo >&2 "cmake not installed. Aborting."; exit 1; } +type make >/dev/null 2>&1 || { echo >&2 "make not installed. Aborting."; exit 1; } # build googletest +GOOGLETEST_ERROR="failed for googletest!" if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then cd $TP_DIR/$GTEST_BASEDIR if [[ "$OSTYPE" == "darwin"* ]]; then - CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="-std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" + CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="-std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" || { echo "cmake $GOOGLETEST_ERROR" ; exit 1; } else - CXXFLAGS=-fPIC cmake . + CXXFLAGS=-fPIC cmake . || { echo "cmake $GOOGLETEST_ERROR"; exit 1; } fi - make VERBOSE=1 + make VERBOSE=1 || { echo "Make $GOOGLETEST_ERROR" ; exit 1; } fi echo "---------------------" diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh index 0c801179e8d30..c18dd4d8e80ab 100755 --- a/cpp/thirdparty/download_thirdparty.sh +++ b/cpp/thirdparty/download_thirdparty.sh @@ -8,6 +8,7 @@ TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) source $TP_DIR/versions.sh download_extract_and_cleanup() { + type curl >/dev/null 2>&1 || { echo >&2 "curl not installed. Aborting."; exit 1; } filename=$TP_DIR/$(basename "$1") curl -#LC - "$1" -o $filename tar xzf $filename -C $TP_DIR From 3a99f39d64d4e0d6556582c0560140c7b06ee21d Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 21 Mar 2016 16:31:21 -0700 Subject: [PATCH 0035/1644] ARROW-73: Support older CMake versions Author: Uwe L. Korn Closes #31 from xhochy/arrow-73 and squashes the following commits: c92ce5c [Uwe L. Korn] ARROW-73: Support older CMake versions --- cpp/cmake_modules/FindGTest.cmake | 2 +- cpp/cmake_modules/FindParquet.cmake | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/cmake_modules/FindGTest.cmake b/cpp/cmake_modules/FindGTest.cmake index e47faf0dd89d2..3c5d2b67e4494 100644 --- a/cpp/cmake_modules/FindGTest.cmake +++ b/cpp/cmake_modules/FindGTest.cmake @@ -54,7 +54,7 @@ endif () if (GTEST_INCLUDE_DIR AND GTEST_LIBRARIES) set(GTEST_FOUND TRUE) - get_filename_component( GTEST_LIBS ${GTEST_LIBRARIES} DIRECTORY ) + get_filename_component( GTEST_LIBS ${GTEST_LIBRARIES} PATH ) set(GTEST_LIB_NAME libgtest) set(GTEST_STATIC_LIB ${GTEST_LIBS}/${GTEST_LIB_NAME}.a) set(GTEST_SHARED_LIB ${GTEST_LIBS}/${GTEST_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) diff --git a/cpp/cmake_modules/FindParquet.cmake b/cpp/cmake_modules/FindParquet.cmake index 76c2d1dbee941..d16e6c98f8d1c 100644 --- a/cpp/cmake_modules/FindParquet.cmake +++ b/cpp/cmake_modules/FindParquet.cmake @@ -43,7 +43,7 @@ endif () if (PARQUET_INCLUDE_DIR AND PARQUET_LIBRARIES) set(PARQUET_FOUND TRUE) - get_filename_component( PARQUET_LIBS ${PARQUET_LIBRARIES} DIRECTORY ) + get_filename_component( PARQUET_LIBS ${PARQUET_LIBRARIES} PATH ) set(PARQUET_LIB_NAME libparquet) set(PARQUET_STATIC_LIB ${PARQUET_LIBS}/${PARQUET_LIB_NAME}.a) set(PARQUET_SHARED_LIB ${PARQUET_LIBS}/${PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) From 016b92bccf60de480da07acbabe876fb695c45e5 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 21 Mar 2016 16:34:07 -0700 Subject: [PATCH 0036/1644] ARROW-72: Search for alternative parquet-cpp header Author: Uwe L. Korn Closes #30 from xhochy/arrow-72 and squashes the following commits: 5b6b328 [Uwe L. Korn] ARROW-72: Search for alternative parquet-cpp header --- cpp/cmake_modules/FindParquet.cmake | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/cmake_modules/FindParquet.cmake b/cpp/cmake_modules/FindParquet.cmake index d16e6c98f8d1c..e3350d6e13da6 100644 --- a/cpp/cmake_modules/FindParquet.cmake +++ b/cpp/cmake_modules/FindParquet.cmake @@ -29,14 +29,14 @@ endif() # Try the parameterized roots, if they exist if ( _parquet_roots ) - find_path( PARQUET_INCLUDE_DIR NAMES parquet/parquet.h + find_path( PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h PATHS ${_parquet_roots} NO_DEFAULT_PATH PATH_SUFFIXES "include" ) find_library( PARQUET_LIBRARIES NAMES parquet PATHS ${_parquet_roots} NO_DEFAULT_PATH PATH_SUFFIXES "lib" ) else () - find_path( PARQUET_INCLUDE_DIR NAMES parquet/parquet.h ) + find_path( PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h ) find_library( PARQUET_LIBRARIES NAMES parquet ) endif () From 4ec034bbe18bd961a4bac64f2e25dba0472c28c9 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Tue, 22 Mar 2016 08:51:23 -0700 Subject: [PATCH 0037/1644] ARROW-28: Adding google's benchmark library to the toolchain This isn't yet complete, but before I go further I think its worth asking some questions on peoples' preferences: 1. It seems that the build third-party script is setting up an install directory that it is not making use of. Do we want to keep this functionality and start adding new libraries to be placed there? The gtest component of the tool-chain assumes it is in its own location, and this how I patterned google benchmark integration. 2. Do we want to couple unit test builds with benchmark builds? I am currently aiming for having them decoupled and having benchmarks off by default. 3. I am not familiar with the Darwin/mac build environment and it is not clear if the CXX flags are required universally. (I need to fix it anyways to move -DGTEST_USE_OWN_TR1_TUPLE=1 back to be gtest only). Travis-ci might provide the answer. 4. Any other basic features in the benchmark toolchain people would like to see as part of this PR? Wes mentioned starting to create benchmarking tools lib, but I think that likely belongs in a separate PR. Author: Micah Kornfield Closes #29 from emkornfield/emk_add_benchmark and squashes the following commits: dbd4e71 [Micah Kornfield] only run unittests is travis ab21150 [Micah Kornfield] Enable benchmarks in cpp toolchain 40847ee [Micah Kornfield] WIP-Adding google's benchmark library to the toolchain --- ci/travis_before_script_cpp.sh | 2 +- ci/travis_script_cpp.sh | 4 +- cpp/CMakeLists.txt | 88 ++++++++++++- cpp/README.md | 23 +++- cpp/build-support/run-test.sh | 160 ++++++++++++++---------- cpp/cmake_modules/FindGBenchmark.cmake | 88 +++++++++++++ cpp/setup_build_env.sh | 1 + cpp/src/arrow/table/CMakeLists.txt | 2 + cpp/src/arrow/table/column-benchmark.cc | 55 ++++++++ cpp/src/arrow/util/CMakeLists.txt | 14 +++ cpp/src/arrow/util/benchmark_main.cc | 24 ++++ cpp/thirdparty/build_thirdparty.sh | 20 ++- cpp/thirdparty/download_thirdparty.sh | 6 + cpp/thirdparty/versions.sh | 4 + 14 files changed, 415 insertions(+), 76 deletions(-) create mode 100644 cpp/cmake_modules/FindGBenchmark.cmake create mode 100644 cpp/src/arrow/table/column-benchmark.cc create mode 100644 cpp/src/arrow/util/benchmark_main.cc diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 4d5bef8bbdf70..49dcc395fbc83 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -19,7 +19,7 @@ echo $GTEST_HOME : ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} -cmake -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +cmake -DARROW_BUILD_BENCHMARKS=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR make -j4 make install diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index 3e843dd759ea1..d96b98f8d37f5 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -9,9 +9,9 @@ pushd $CPP_BUILD_DIR make lint if [ $TRAVIS_OS_NAME == "linux" ]; then - valgrind --tool=memcheck --leak-check=yes --error-exitcode=1 ctest + valgrind --tool=memcheck --leak-check=yes --error-exitcode=1 ctest -L unittest else - ctest + ctest -L unittest fi popd diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f5f6038031127..268c1d11e1e8e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -55,12 +55,21 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_BUILD_TESTS "Build the Arrow googletest unit tests" ON) + + option(ARROW_BUILD_BENCHMARKS + "Build the Arrow micro benchmarks" + OFF) + endif() if(NOT ARROW_BUILD_TESTS) set(NO_TESTS 1) endif() +if(NOT ARROW_BUILD_BENCHMARKS) + set(NO_BENCHMARKS 1) +endif() + ############################################################ # Compiler flags @@ -251,9 +260,63 @@ set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") include_directories(src) ############################################################ -# Testing +# Benchmarking ############################################################ +# Add a new micro benchmark, with or without an executable that should be built. +# If benchmarks are enabled then they will be run along side unit tests with ctest. +# 'make runbenchmark' and 'make unittest' to build/run only benchmark or unittests, +# respectively. +# +# REL_BENCHMARK_NAME is the name of the benchmark app. It may be a single component +# (e.g. monotime-benchmark) or contain additional components (e.g. +# net/net_util-benchmark). Either way, the last component must be a globally +# unique name. + +# The benchmark will registered as unit test with ctest with a label +# of 'benchmark'. +# +# Arguments after the test name will be passed to set_tests_properties(). +function(ADD_ARROW_BENCHMARK REL_BENCHMARK_NAME) + if(NO_BENCHMARKS) + return() + endif() + get_filename_component(BENCHMARK_NAME ${REL_BENCHMARK_NAME} NAME_WE) + + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${REL_BENCHMARK_NAME}.cc) + # This benchmark has a corresponding .cc file, set it up as an executable. + set(BENCHMARK_PATH "${EXECUTABLE_OUTPUT_PATH}/${BENCHMARK_NAME}") + add_executable(${BENCHMARK_NAME} "${REL_BENCHMARK_NAME}.cc") + target_link_libraries(${BENCHMARK_NAME} ${ARROW_BENCHMARK_LINK_LIBS}) + add_dependencies(runbenchmark ${BENCHMARK_NAME}) + set(NO_COLOR "--color_print=false") + else() + # No executable, just invoke the benchmark (probably a script) directly. + set(BENCHMARK_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_BENCHMARK_NAME}) + set(NO_COLOR "") + endif() + + add_test(${BENCHMARK_NAME} + ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} benchmark ${BENCHMARK_PATH} ${NO_COLOR}) + set_tests_properties(${BENCHMARK_NAME} PROPERTIES LABELS "benchmark") + if(ARGN) + set_tests_properties(${BENCHMARK_NAME} PROPERTIES ${ARGN}) + endif() +endfunction() + +# A wrapper for add_dependencies() that is compatible with NO_BENCHMARKS. +function(ADD_ARROW_BENCHMARK_DEPENDENCIES REL_BENCHMARK_NAME) + if(NO_BENCHMARKS) + return() + endif() + get_filename_component(BENCMARK_NAME ${REL_BENCHMARK_NAME} NAME_WE) + add_dependencies(${BENCHMARK_NAME} ${ARGN}) +endfunction() + + +############################################################ +# Testing +############################################################ # Add a new test case, with or without an executable that should be built. # # REL_TEST_NAME is the name of the test. It may be a single component @@ -261,6 +324,9 @@ include_directories(src) # net/net_util-test). Either way, the last component must be a globally # unique name. # +# The unit test is added with a label of "unittest" to support filtering with +# ctest. +# # Arguments after the test name will be passed to set_tests_properties(). function(ADD_ARROW_TEST REL_TEST_NAME) if(NO_TESTS) @@ -273,13 +339,15 @@ function(ADD_ARROW_TEST REL_TEST_NAME) set(TEST_PATH "${EXECUTABLE_OUTPUT_PATH}/${TEST_NAME}") add_executable(${TEST_NAME} "${REL_TEST_NAME}.cc") target_link_libraries(${TEST_NAME} ${ARROW_TEST_LINK_LIBS}) + add_dependencies(unittest ${TEST_NAME}) else() # No executable, just invoke the test (probably a script) directly. set(TEST_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}) endif() add_test(${TEST_NAME} - ${BUILD_SUPPORT_DIR}/run-test.sh ${TEST_PATH}) + ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) + set_tests_properties(${TEST_NAME} PROPERTIES LABELS "unittest") if(ARGN) set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) endif() @@ -335,13 +403,28 @@ if ("$ENV{GTEST_HOME}" STREQUAL "") set(GTest_HOME ${THIRDPARTY_DIR}/googletest-release-1.7.0) endif() +## Google Benchmark +if ("$ENV{GBENCHMARK_HOME}" STREQUAL "") + set(GBENCHMARK_HOME ${THIRDPARTY_DIR}/installed) +endif() + + if(ARROW_BUILD_TESTS) + add_custom_target(unittest ctest -L unittest) find_package(GTest REQUIRED) include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(gtest STATIC_LIB ${GTEST_STATIC_LIB}) endif() +if(ARROW_BUILD_BENCHMARKS) + add_custom_target(runbenchmark ctest -L benchmark) + find_package(GBenchmark REQUIRED) + include_directories(SYSTEM ${GBENCHMARK_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(benchmark + STATIC_LIB ${GBENCHMARK_STATIC_LIB}) +endif() + ## Google PerfTools ## ## Disabled with TSAN/ASAN as well as with gold+dynamic linking (see comment @@ -366,6 +449,7 @@ endif() ############################################################ set(ARROW_MIN_TEST_LIBS arrow arrow_test_main ${ARROW_BASE_LIBS}) set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) +set(ARROW_BENCHMARK_LINK_LIBS arrow arrow_benchmark_main ${ARROW_BASE_LIBS}) ############################################################ # "make ctags" target diff --git a/cpp/README.md b/cpp/README.md index 378dc4e28de76..542cce43a1391 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -29,16 +29,29 @@ Simple debug build: mkdir debug cd debug cmake .. - make - ctest + make unittest Simple release build: mkdir release cd release cmake .. -DCMAKE_BUILD_TYPE=Release - make - ctest + make unittest + +Detailed unit test logs will be placed in the build directory under `build/test-logs`. + +### Building/Running benchmarks + +Follow the directions for simple build except run cmake +with the `--ARROW_BUILD_BENCHMARKS` parameter set correctly: + + cmake -DARROW_BUILD_BENCHMARKS=ON .. + +and instead of make unittest run either `make; ctest` to run both unit tests +and benchmarks or `make runbenchmark` to run only the benchmark tests. + +Benchmark logs will be placed in the build directory under `build/benchmark-logs`. + ### Third-party environment variables @@ -46,3 +59,5 @@ To set up your own specific build toolchain, here are the relevant environment variables * Googletest: `GTEST_HOME` (only required to build the unit tests) +* Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) + diff --git a/cpp/build-support/run-test.sh b/cpp/build-support/run-test.sh index b2039134d558d..0e628e26ecd52 100755 --- a/cpp/build-support/run-test.sh +++ b/cpp/build-support/run-test.sh @@ -16,24 +16,23 @@ # Script which wraps running a test and redirects its output to a # test log directory. # -# If KUDU_COMPRESS_TEST_OUTPUT is non-empty, then the logs will be -# gzip-compressed while they are written. +# Arguments: +# $1 - Base path for logs/artifacts. +# $2 - type of test (e.g. test or benchmark) +# $3 - path to executable +# $ARGN - arguments for executable # -# If KUDU_FLAKY_TEST_ATTEMPTS is non-zero, and the test being run matches -# one of the lines in the file KUDU_FLAKY_TEST_LIST, then the test will -# be retried on failure up to the specified number of times. This can be -# used in the gerrit workflow to prevent annoying false -1s caused by -# tests that are known to be flaky in master. -# -# If KUDU_REPORT_TEST_RESULTS is non-zero, then tests are reported to the -# central test server. +OUTPUT_ROOT=$1 +shift ROOT=$(cd $(dirname $BASH_SOURCE)/..; pwd) -TEST_LOGDIR=$ROOT/build/test-logs +TEST_LOGDIR=$OUTPUT_ROOT/build/$1-logs mkdir -p $TEST_LOGDIR -TEST_DEBUGDIR=$ROOT/build/test-debug +RUN_TYPE=$1 +shift +TEST_DEBUGDIR=$OUTPUT_ROOT/build/$RUN_TYPE-debug mkdir -p $TEST_DEBUGDIR TEST_DIRNAME=$(cd $(dirname $1); pwd) @@ -43,7 +42,7 @@ TEST_EXECUTABLE="$TEST_DIRNAME/$TEST_FILENAME" TEST_NAME=$(echo $TEST_FILENAME | perl -pe 's/\..+?$//') # Remove path and extension (if any). # We run each test in its own subdir to avoid core file related races. -TEST_WORKDIR=$ROOT/build/test-work/$TEST_NAME +TEST_WORKDIR=$OUTPUT_ROOT/build/test-work/$TEST_NAME mkdir -p $TEST_WORKDIR pushd $TEST_WORKDIR >/dev/null || exit 1 rm -f * @@ -61,55 +60,49 @@ rm -f $LOGFILE $LOGFILE.gz pipe_cmd=cat -# Configure TSAN (ignored if this isn't a TSAN build). -# -# Deadlock detection (new in clang 3.5) is disabled because: -# 1. The clang 3.5 deadlock detector crashes in some unit tests. It -# needs compiler-rt commits c4c3dfd, 9a8efe3, and possibly others. -# 2. Many unit tests report lock-order-inversion warnings; they should be -# fixed before reenabling the detector. -TSAN_OPTIONS="$TSAN_OPTIONS detect_deadlocks=0" -TSAN_OPTIONS="$TSAN_OPTIONS suppressions=$ROOT/build-support/tsan-suppressions.txt" -TSAN_OPTIONS="$TSAN_OPTIONS history_size=7" -export TSAN_OPTIONS - -# Enable leak detection even under LLVM 3.4, where it was disabled by default. -# This flag only takes effect when running an ASAN build. -ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" -export ASAN_OPTIONS - -# Set up suppressions for LeakSanitizer -LSAN_OPTIONS="$LSAN_OPTIONS suppressions=$ROOT/build-support/lsan-suppressions.txt" -export LSAN_OPTIONS - -# Suppressions require symbolization. We'll default to using the symbolizer in -# thirdparty. -if [ -z "$ASAN_SYMBOLIZER_PATH" ]; then - export ASAN_SYMBOLIZER_PATH=$(find $NATIVE_TOOLCHAIN/llvm-3.7.0/bin -name llvm-symbolizer) -fi - # Allow for collecting core dumps. ARROW_TEST_ULIMIT_CORE=${ARROW_TEST_ULIMIT_CORE:-0} ulimit -c $ARROW_TEST_ULIMIT_CORE -# Run the actual test. -for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do - if [ $ATTEMPT_NUMBER -lt $TEST_EXECUTION_ATTEMPTS ]; then - # If the test fails, the test output may or may not be left behind, - # depending on whether the test cleaned up or exited immediately. Either - # way we need to clean it up. We do this by comparing the data directory - # contents before and after the test runs, and deleting anything new. - # - # The comm program requires that its two inputs be sorted. - TEST_TMPDIR_BEFORE=$(find $TEST_TMPDIR -maxdepth 1 -type d | sort) + +function setup_sanitizers() { + # Sets environment variables for different sanitizers (it configures how) the run_tests. Function works. + + # Configure TSAN (ignored if this isn't a TSAN build). + # + # Deadlock detection (new in clang 3.5) is disabled because: + # 1. The clang 3.5 deadlock detector crashes in some unit tests. It + # needs compiler-rt commits c4c3dfd, 9a8efe3, and possibly others. + # 2. Many unit tests report lock-order-inversion warnings; they should be + # fixed before reenabling the detector. + TSAN_OPTIONS="$TSAN_OPTIONS detect_deadlocks=0" + TSAN_OPTIONS="$TSAN_OPTIONS suppressions=$ROOT/build-support/tsan-suppressions.txt" + TSAN_OPTIONS="$TSAN_OPTIONS history_size=7" + export TSAN_OPTIONS + + # Enable leak detection even under LLVM 3.4, where it was disabled by default. + # This flag only takes effect when running an ASAN build. + ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" + export ASAN_OPTIONS + + # Set up suppressions for LeakSanitizer + LSAN_OPTIONS="$LSAN_OPTIONS suppressions=$ROOT/build-support/lsan-suppressions.txt" + export LSAN_OPTIONS + + # Suppressions require symbolization. We'll default to using the symbolizer in + # thirdparty. + if [ -z "$ASAN_SYMBOLIZER_PATH" ]; then + export ASAN_SYMBOLIZER_PATH=$(find $NATIVE_TOOLCHAIN/llvm-3.7.0/bin -name llvm-symbolizer) fi +} + +function run_test() { + # Run gtest style tests with sanitizers if they are setup appropriately. # gtest won't overwrite old junit test files, resulting in a build failure # even when retries are successful. rm -f $XMLFILE - echo "Running $TEST_NAME, redirecting output into $LOGFILE" \ - "(attempt ${ATTEMPT_NUMBER}/$TEST_EXECUTION_ATTEMPTS)" $TEST_EXECUTABLE "$@" 2>&1 \ | $ROOT/build-support/asan_symbolize.py \ | c++filt \ @@ -131,6 +124,46 @@ for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do STATUS=1 rm -f $XMLFILE fi +} + +function post_process_tests() { + # If we have a LeakSanitizer report, and XML reporting is configured, add a new test + # case result to the XML file for the leak report. Otherwise Jenkins won't show + # us which tests had LSAN errors. + if zgrep --silent "ERROR: LeakSanitizer: detected memory leaks" $LOGFILE ; then + echo Test had memory leaks. Editing XML + perl -p -i -e ' + if (m##) { + print "\n"; + print " \n"; + print " See txt log file for details\n"; + print " \n"; + print "\n"; + }' $XMLFILE + fi +} + +function run_other() { + # Generic run function for test like executables that aren't actually gtest + $TEST_EXECUTABLE "$@" 2>&1 | $pipe_cmd > $LOGFILE + STATUS=$? +} + +if [ $RUN_TYPE = "test" ]; then + setup_sanitizers +fi + +# Run the actual test. +for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do + if [ $ATTEMPT_NUMBER -lt $TEST_EXECUTION_ATTEMPTS ]; then + # If the test fails, the test output may or may not be left behind, + # depending on whether the test cleaned up or exited immediately. Either + # way we need to clean it up. We do this by comparing the data directory + # contents before and after the test runs, and deleting anything new. + # + # The comm program requires that its two inputs be sorted. + TEST_TMPDIR_BEFORE=$(find $TEST_TMPDIR -maxdepth 1 -type d | sort) + fi if [ $ATTEMPT_NUMBER -lt $TEST_EXECUTION_ATTEMPTS ]; then # Now delete any new test output. @@ -150,7 +183,13 @@ for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do fi done fi - + echo "Running $TEST_NAME, redirecting output into $LOGFILE" \ + "(attempt ${ATTEMPT_NUMBER}/$TEST_EXECUTION_ATTEMPTS)" + if [ $RUN_TYPE = "test" ]; then + run_test $* + else + run_other $* + fi if [ "$STATUS" -eq "0" ]; then break elif [ "$ATTEMPT_NUMBER" -lt "$TEST_EXECUTION_ATTEMPTS" ]; then @@ -159,19 +198,8 @@ for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do fi done -# If we have a LeakSanitizer report, and XML reporting is configured, add a new test -# case result to the XML file for the leak report. Otherwise Jenkins won't show -# us which tests had LSAN errors. -if zgrep --silent "ERROR: LeakSanitizer: detected memory leaks" $LOGFILE ; then - echo Test had memory leaks. Editing XML - perl -p -i -e ' - if (m##) { - print "\n"; - print " \n"; - print " See txt log file for details\n"; - print " \n"; - print "\n"; - }' $XMLFILE +if [ $RUN_TYPE = "test" ]; then + post_process_tests fi # Capture and compress core file and binary. diff --git a/cpp/cmake_modules/FindGBenchmark.cmake b/cpp/cmake_modules/FindGBenchmark.cmake new file mode 100644 index 0000000000000..3e46a60f5e68a --- /dev/null +++ b/cpp/cmake_modules/FindGBenchmark.cmake @@ -0,0 +1,88 @@ +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Tries to find Google benchmark headers and libraries. +# +# Usage of this module as follows: +# +# find_package(GBenchark) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# GBenchmark_HOME - When set, this path is inspected instead of standard library +# locations as the root of the benchark installation. +# The environment variable GBENCHMARK_HOME overrides this veriable. +# +# This module defines +# GBENCHMARK_INCLUDE_DIR, directory containing benchmark header directory +# GBENCHMARK_LIBS, directory containing benchmark libraries +# GBENCHMARK_STATIC_LIB, path to libbenchmark.a +# GBENCHMARK_FOUND, whether gbenchmark has been found + +if( NOT "$ENV{GBENCHMARK_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{GBENCHMARK_HOME}" _native_path ) + list( APPEND _gbenchmark_roots ${_native_path} ) +elseif ( GBenchmark_HOME ) + list( APPEND _gbenchmark_roots ${GBenchmark_HOME} ) +endif() + +# Try the parameterized roots, if they exist +if ( _gbenchmark_roots ) + find_path( GBENCHMARK_INCLUDE_DIR NAMES benchmark/benchmark.h + PATHS ${_gbenchmark_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( GBENCHMARK_LIBRARIES NAMES benchmark + PATHS ${_gbenchmark_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else () + find_path( GBENCHMARK_INCLUDE_DIR NAMES benchmark/benchmark.hh ) + find_library( GBENCHMARK_LIBRARIES NAMES benchmark ) +endif () + + +if (GBENCHMARK_INCLUDE_DIR AND GBENCHMARK_LIBRARIES) + set(GBENCHMARK_FOUND TRUE) + get_filename_component( GBENCHMARK_LIBS ${GBENCHMARK_LIBRARIES} PATH ) + set(GBENCHMARK_LIB_NAME libbenchmark) + set(GBENCHMARK_STATIC_LIB ${GBENCHMARK_LIBS}/${GBENCHMARK_LIB_NAME}.a) +else () + set(GBENCHMARK_FOUND FALSE) +endif () + +if (GBENCHMARK_FOUND) + if (NOT GBenchmark_FIND_QUIETLY) + message(STATUS "Found the GBenchmark library: ${GBENCHMARK_LIBRARIES}") + endif () +else () + if (NOT GBenchmark_FIND_QUIETLY) + set(GBENCHMARK_ERR_MSG "Could not find the GBenchmark library. Looked in ") + if ( _gbenchmark_roots ) + set(GBENCHMARK_ERR_MSG "${GBENCHMARK_ERR_MSG} in ${_gbenchmark_roots}.") + else () + set(GBENCHMARK_ERR_MSG "${GBENCHMARK_ERR_MSG} system search paths.") + endif () + if (GBenchmark_FIND_REQUIRED) + message(FATAL_ERROR "${GBENCHMARK_ERR_MSG}") + else (GBenchmark_FIND_REQUIRED) + message(STATUS "${GBENCHMARK_ERR_MSG}") + endif (GBenchmark_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + GBENCHMARK_INCLUDE_DIR + GBENCHMARK_LIBS + GBENCHMARK_LIBRARIES + GBENCHMARK_STATIC_LIB +) diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index 1a33fe386f103..04688e7d59400 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -7,5 +7,6 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) source thirdparty/versions.sh export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR +export GBENCHMARK_HOME=$SOURCE_DIR/thirdparty/installed echo "Build env initialized" diff --git a/cpp/src/arrow/table/CMakeLists.txt b/cpp/src/arrow/table/CMakeLists.txt index 26d843d853bfb..d9f00e74a37db 100644 --- a/cpp/src/arrow/table/CMakeLists.txt +++ b/cpp/src/arrow/table/CMakeLists.txt @@ -29,3 +29,5 @@ install(FILES ADD_ARROW_TEST(column-test) ADD_ARROW_TEST(schema-test) ADD_ARROW_TEST(table-test) + +ADD_ARROW_BENCHMARK(column-benchmark) diff --git a/cpp/src/arrow/table/column-benchmark.cc b/cpp/src/arrow/table/column-benchmark.cc new file mode 100644 index 0000000000000..c01146d7b096f --- /dev/null +++ b/cpp/src/arrow/table/column-benchmark.cc @@ -0,0 +1,55 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + +#include "benchmark/benchmark.h" + +#include "arrow/test-util.h" +#include "arrow/table/test-common.h" +#include "arrow/types/integer.h" +#include "arrow/util/memory-pool.h" + +namespace arrow { +namespace { + template + std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + auto pool = GetDefaultMemoryPool(); + auto data = std::make_shared(pool); + auto nulls = std::make_shared(pool); + data->Resize(length * sizeof(typename ArrayType::value_type)); + nulls->Resize(util::bytes_for_bits(length)); + return std::make_shared(length, data, 10, nulls); + } +} // anonymous namespace + + +static void BM_BuildInt32ColumnByChunk(benchmark::State& state) { //NOLINT non-const reference + ArrayVector arrays; + for (int chunk_n = 0; chunk_n < state.range_x(); ++chunk_n) { + arrays.push_back(MakePrimitive(100, 10)); + } + const auto INT32 = std::make_shared(); + const auto field = std::make_shared("c0", INT32); + std::unique_ptr column; + while (state.KeepRunning()) { + column.reset(new Column(field, arrays)); + } +} + +BENCHMARK(BM_BuildInt32ColumnByChunk)->Range(5, 50000); + +} // namespace arrow diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index d8e2f98f2c85e..fed05e3690c74 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -51,6 +51,20 @@ if (ARROW_BUILD_TESTS) endif() endif() +if (ARROW_BUILD_BENCHMARKS) + add_library(arrow_benchmark_main benchmark_main.cc) + if (APPLE) + target_link_libraries(arrow_benchmark_main + benchmark + ) + else() + target_link_libraries(arrow_benchmark_main + benchmark + pthread + ) + endif() +endif() + ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(memory-pool-test) diff --git a/cpp/src/arrow/util/benchmark_main.cc b/cpp/src/arrow/util/benchmark_main.cc new file mode 100644 index 0000000000000..c9739af03fb53 --- /dev/null +++ b/cpp/src/arrow/util/benchmark_main.cc @@ -0,0 +1,24 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +int main(int argc, char** argv) { + benchmark::Initialize(&argc, argv); + benchmark::RunSpecifiedBenchmarks(); + return 0; +} diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index beb248803594c..294737cc50522 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -16,6 +16,7 @@ else for arg in "$*"; do case $arg in "gtest") F_GTEST=1 ;; + "gbenchmark") F_GBENCHMARK=1 ;; *) echo "Unknown module: $arg"; exit 1 ;; esac done @@ -47,13 +48,15 @@ export PATH=$PREFIX/bin:$PATH type cmake >/dev/null 2>&1 || { echo >&2 "cmake not installed. Aborting."; exit 1; } type make >/dev/null 2>&1 || { echo >&2 "make not installed. Aborting."; exit 1; } +STANDARD_DARWIN_FLAGS="-std=c++11 -stdlib=libc++" + # build googletest GOOGLETEST_ERROR="failed for googletest!" if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then cd $TP_DIR/$GTEST_BASEDIR if [[ "$OSTYPE" == "darwin"* ]]; then - CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="-std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" || { echo "cmake $GOOGLETEST_ERROR" ; exit 1; } + CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="$STANDARD_DARWIN_FLAGS -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" || { echo "cmake $GOOGLETEST_ERROR" ; exit 1; } else CXXFLAGS=-fPIC cmake . || { echo "cmake $GOOGLETEST_ERROR"; exit 1; } fi @@ -61,5 +64,20 @@ if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then make VERBOSE=1 || { echo "Make $GOOGLETEST_ERROR" ; exit 1; } fi +# build google benchmark +GBENCHMARK_ERROR="failed for google benchmark" +if [ -n "$F_ALL" -o -n "$F_GBENCHMARK" ]; then + cd $TP_DIR/$GBENCHMARK_BASEDIR + + CMAKE_CXX_FLAGS="--std=c++11" + if [[ "$OSTYPE" == "darwin"* ]]; then + CMAKE_CXX_FLAGS=$STANDARD_DARWIN_FLAGS + fi + cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$PREFIX -DCMAKE_CXX_FLAGS="-fPIC $CMAKE_CXX_FLAGS" . || { echo "cmake $GBENCHMARK_ERROR" ; exit 1; } + + make VERBOSE=1 install || { echo "make $GBENCHMARK_ERROR" ; exit 1; } +fi + + echo "---------------------" echo "Thirdparty dependencies built and installed into $PREFIX successfully" diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh index c18dd4d8e80ab..d22c559b3e3ba 100755 --- a/cpp/thirdparty/download_thirdparty.sh +++ b/cpp/thirdparty/download_thirdparty.sh @@ -19,3 +19,9 @@ if [ ! -d ${GTEST_BASEDIR} ]; then echo "Fetching gtest" download_extract_and_cleanup $GTEST_URL fi + +echo ${GBENCHMARK_BASEDIR} +if [ ! -d ${GBENCHMARK_BASEDIR} ]; then + echo "Fetching google benchmark" + download_extract_and_cleanup $GBENCHMARK_URL +fi diff --git a/cpp/thirdparty/versions.sh b/cpp/thirdparty/versions.sh index 12ad56ef00103..9cfc7cd94b58c 100755 --- a/cpp/thirdparty/versions.sh +++ b/cpp/thirdparty/versions.sh @@ -1,3 +1,7 @@ GTEST_VERSION=1.7.0 GTEST_URL="https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" GTEST_BASEDIR=googletest-release-$GTEST_VERSION + +GBENCHMARK_VERSION=1.0.0 +GBENCHMARK_URL="https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" +GBENCHMARK_BASEDIR=benchmark-$GBENCHMARK_VERSION From 093f9bd8c30b1b77b3e6e7a4123cab9a6dd9daa1 Mon Sep 17 00:00:00 2001 From: Dan Robinson Date: Tue, 22 Mar 2016 14:15:38 -0700 Subject: [PATCH 0038/1644] ARROW-75: Fix handling of empty strings Fixes [ARROW-75](https://issues.apache.org/jira/browse/ARROW-75) (and changes Python tests to verify that behavior). Author: Dan Robinson Closes #32 from danrobinson/ARROW-75 and squashes the following commits: cb8e527 [Dan Robinson] ARROW-75: remove whitespace 9604a21 [Dan Robinson] ARROW-75: Changed tests 722df19 [Dan Robinson] ARROW-75: Fixed braces 1ef3b75 [Dan Robinson] ARROW-75: Fix handling of empty strings --- cpp/src/arrow/types/primitive.h | 4 +++- cpp/src/arrow/types/string-test.cc | 2 +- python/pyarrow/tests/test_array.py | 6 +++--- 3 files changed, 7 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 1073bb6e1c340..22ab59c309a1d 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -168,7 +168,9 @@ class PrimitiveBuilder : public ArrayBuilder { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } - memcpy(raw_buffer() + length_, values, length * elsize_); + if (length > 0) { + memcpy(raw_buffer() + length_, values, length * elsize_); + } if (null_bytes != nullptr) { AppendNulls(null_bytes, length); diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 8e82fd95dd808..6381093dcbb45 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -181,7 +181,7 @@ class TestStringBuilder : public TestBuilder { }; TEST_F(TestStringBuilder, TestScalarAppend) { - std::vector strings = {"a", "bb", "", "", "ccc"}; + std::vector strings = {"", "bb", "a", "", "ccc"}; std::vector is_null = {0, 0, 0, 1, 0}; int N = strings.size(); diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 034c1576551d3..36aaaa4f93d5d 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -39,13 +39,13 @@ def test_list_format(self): assert result == expected def test_string_format(self): - arr = pyarrow.from_pylist(['foo', None, 'bar']) + arr = pyarrow.from_pylist(['', None, 'foo']) result = fmt.array_format(arr) expected = """\ [ - 'foo', + '', NA, - 'bar' + 'foo' ]""" assert result == expected From 65db0da80b6a1fb6887b7ac1df24e2423d41dfb9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 22 Mar 2016 18:45:13 -0700 Subject: [PATCH 0039/1644] ARROW-67: C++ metadata flatbuffer serialization and data movement to memory maps Several things here: * Add Google flatbuffers dependency * Flatbuffers IDL draft in collaboration with @jacques-n and @stevenmphillips * Add Schema wrapper in Cython * arrow::Schema conversion to/from flatbuffer representation * Remove unneeded physical layout types from type.h * Refactor ListType to be a nested type with a single child * Implement shared memory round-trip for numeric row batches * mmap-based shared memory interface and MemorySource abstract API Quite a bit of judicious code cleaning and consolidation as part of this. For example, List types are now internally equivalent to a nested type with 1 named child field (versus a struct, which can have any number of child fields). Associated JIRAs: ARROW-48, ARROW-57, ARROW-58 Author: Wes McKinney Closes #28 from wesm/cpp-ipc-draft and squashes the following commits: 0cef7ea [Wes McKinney] Add NullArray type now that Array is virtual, fix pyarrow build 5e841f7 [Wes McKinney] Create explicit PrimitiveArray subclasses to avoid unwanted template instantiation 6fa6319 [Wes McKinney] ARROW-28: Draft C++ shared memory IPC workflow and related refactoring / scaffolding / cleaning. --- ci/travis_before_script_cpp.sh | 9 +- ci/travis_script_cpp.sh | 6 +- cpp/CMakeLists.txt | 96 ++++-- cpp/cmake_modules/FindFlatbuffers.cmake | 95 ++++++ cpp/setup_build_env.sh | 5 +- cpp/src/arrow/CMakeLists.txt | 8 + cpp/src/arrow/api.h | 11 +- cpp/src/arrow/array-test.cc | 14 +- cpp/src/arrow/array.cc | 26 +- cpp/src/arrow/array.h | 27 +- cpp/src/arrow/builder.h | 2 +- cpp/src/arrow/{table => }/column-benchmark.cc | 5 +- cpp/src/arrow/{table => }/column-test.cc | 10 +- cpp/src/arrow/{table => }/column.cc | 4 +- cpp/src/arrow/{table => }/column.h | 13 +- cpp/src/arrow/ipc/.gitignore | 1 + cpp/src/arrow/ipc/CMakeLists.txt | 51 +++ cpp/src/arrow/ipc/adapter.cc | 305 +++++++++++++++++ cpp/src/arrow/ipc/adapter.h | 86 +++++ cpp/src/arrow/ipc/ipc-adapter-test.cc | 112 +++++++ cpp/src/arrow/ipc/ipc-memory-test.cc | 82 +++++ cpp/src/arrow/ipc/ipc-metadata-test.cc | 99 ++++++ cpp/src/arrow/ipc/memory.cc | 162 +++++++++ cpp/src/arrow/ipc/memory.h | 131 ++++++++ cpp/src/arrow/ipc/metadata-internal.cc | 317 ++++++++++++++++++ cpp/src/arrow/ipc/metadata-internal.h | 69 ++++ cpp/src/arrow/ipc/metadata.cc | 238 +++++++++++++ cpp/src/arrow/ipc/metadata.h | 146 ++++++++ .../{types/floating.h => ipc/test-common.h} | 43 ++- cpp/src/arrow/{table => }/schema-test.cc | 48 ++- cpp/src/arrow/{table => }/schema.cc | 11 +- cpp/src/arrow/{table => }/schema.h | 8 +- cpp/src/arrow/{table => }/table-test.cc | 18 +- cpp/src/arrow/{table => }/table.cc | 35 +- cpp/src/arrow/{table => }/table.h | 58 +++- cpp/src/arrow/table/test-common.h | 54 --- cpp/src/arrow/test-util.h | 68 +++- cpp/src/arrow/type.cc | 24 +- cpp/src/arrow/type.h | 177 ++++------ cpp/src/arrow/types/CMakeLists.txt | 2 - cpp/src/arrow/types/boolean.h | 2 +- cpp/src/arrow/types/collection.h | 2 +- cpp/src/arrow/types/construct.cc | 53 +-- cpp/src/arrow/types/construct.h | 11 +- cpp/src/arrow/types/datetime.h | 16 +- cpp/src/arrow/types/floating.cc | 22 -- cpp/src/arrow/types/integer.cc | 22 -- cpp/src/arrow/types/integer.h | 57 ---- cpp/src/arrow/types/json.cc | 1 - cpp/src/arrow/types/json.h | 4 +- cpp/src/arrow/types/list-test.cc | 28 +- cpp/src/arrow/types/list.cc | 29 ++ cpp/src/arrow/types/list.h | 28 +- cpp/src/arrow/types/primitive-test.cc | 41 +-- cpp/src/arrow/types/primitive.cc | 16 +- cpp/src/arrow/types/primitive.h | 102 +++--- cpp/src/arrow/types/string-test.cc | 54 ++- cpp/src/arrow/types/string.h | 55 +-- cpp/src/arrow/types/struct-test.cc | 15 +- cpp/src/arrow/types/test-common.h | 5 +- cpp/src/arrow/types/union.h | 18 +- cpp/src/arrow/util/bit-util-test.cc | 4 +- cpp/src/arrow/util/bit-util.h | 1 - cpp/src/arrow/util/buffer-test.cc | 3 +- cpp/src/arrow/util/buffer.cc | 2 +- cpp/src/arrow/util/memory-pool-test.cc | 7 +- cpp/src/arrow/util/memory-pool.cc | 6 +- cpp/src/arrow/util/memory-pool.h | 2 +- cpp/src/arrow/util/status.cc | 3 + cpp/src/arrow/util/status.h | 6 + cpp/src/arrow/util/test_main.cc | 2 +- cpp/thirdparty/build_thirdparty.sh | 9 + cpp/thirdparty/download_thirdparty.sh | 5 + cpp/thirdparty/versions.sh | 4 + format/Message.fbs | 183 ++++++++++ python/pyarrow/__init__.py | 4 +- python/pyarrow/array.pxd | 2 +- python/pyarrow/array.pyx | 47 ++- python/pyarrow/includes/libarrow.pxd | 107 ++++-- python/pyarrow/includes/pyarrow.pxd | 5 +- python/pyarrow/scalar.pyx | 24 +- python/pyarrow/schema.pxd | 6 +- python/pyarrow/schema.pyx | 155 ++++++--- python/pyarrow/tests/test_schema.py | 28 +- .../pyarrow/tests/test_table.py | 39 ++- python/src/pyarrow/adapters/builtin.cc | 20 +- python/src/pyarrow/helpers.cc | 15 +- python/src/pyarrow/helpers.h | 5 +- 88 files changed, 3113 insertions(+), 838 deletions(-) create mode 100644 cpp/cmake_modules/FindFlatbuffers.cmake rename cpp/src/arrow/{table => }/column-benchmark.cc (94%) rename cpp/src/arrow/{table => }/column-test.cc (93%) rename cpp/src/arrow/{table => }/column.cc (96%) rename cpp/src/arrow/{table => }/column.h (93%) create mode 100644 cpp/src/arrow/ipc/.gitignore create mode 100644 cpp/src/arrow/ipc/CMakeLists.txt create mode 100644 cpp/src/arrow/ipc/adapter.cc create mode 100644 cpp/src/arrow/ipc/adapter.h create mode 100644 cpp/src/arrow/ipc/ipc-adapter-test.cc create mode 100644 cpp/src/arrow/ipc/ipc-memory-test.cc create mode 100644 cpp/src/arrow/ipc/ipc-metadata-test.cc create mode 100644 cpp/src/arrow/ipc/memory.cc create mode 100644 cpp/src/arrow/ipc/memory.h create mode 100644 cpp/src/arrow/ipc/metadata-internal.cc create mode 100644 cpp/src/arrow/ipc/metadata-internal.h create mode 100644 cpp/src/arrow/ipc/metadata.cc create mode 100644 cpp/src/arrow/ipc/metadata.h rename cpp/src/arrow/{types/floating.h => ipc/test-common.h} (59%) rename cpp/src/arrow/{table => }/schema-test.cc (72%) rename cpp/src/arrow/{table => }/schema.cc (88%) rename cpp/src/arrow/{table => }/schema.h (91%) rename cpp/src/arrow/{table => }/table-test.cc (92%) rename cpp/src/arrow/{table => }/table.cc (69%) rename cpp/src/arrow/{table => }/table.h (55%) delete mode 100644 cpp/src/arrow/table/test-common.h delete mode 100644 cpp/src/arrow/types/floating.cc delete mode 100644 cpp/src/arrow/types/integer.cc delete mode 100644 cpp/src/arrow/types/integer.h create mode 100644 format/Message.fbs rename cpp/src/arrow/table/CMakeLists.txt => python/pyarrow/tests/test_table.py (58%) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 49dcc395fbc83..193c76feba1d7 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -19,7 +19,14 @@ echo $GTEST_HOME : ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} -cmake -DARROW_BUILD_BENCHMARKS=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +CMAKE_COMMON_FLAGS="-DARROW_BUILD_BENCHMARKS=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" + +if [ $TRAVIS_OS_NAME == "linux" ]; then + cmake -DARROW_TEST_MEMCHECK=on $CMAKE_COMMON_FLAGS -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +else + cmake $CMAKE_COMMON_FLAGS -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR +fi + make -j4 make install diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index d96b98f8d37f5..997bdf35e83d2 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -8,10 +8,6 @@ pushd $CPP_BUILD_DIR make lint -if [ $TRAVIS_OS_NAME == "linux" ]; then - valgrind --tool=memcheck --leak-check=yes --error-exitcode=1 ctest -L unittest -else - ctest -L unittest -fi +ctest -L unittest popd diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 268c1d11e1e8e..6d701079b482c 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -51,7 +51,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_PARQUET "Build the Parquet adapter and link to libparquet" OFF) - + option(ARROW_TEST_MEMCHECK + "Run the test suite using valgrind --tool=memcheck" + OFF) option(ARROW_BUILD_TESTS "Build the Arrow googletest unit tests" ON) @@ -60,6 +62,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow micro benchmarks" OFF) + option(ARROW_IPC + "Build the Arrow IPC extensions" + ON) + endif() if(NOT ARROW_BUILD_TESTS) @@ -260,17 +266,17 @@ set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") include_directories(src) ############################################################ -# Benchmarking +# Benchmarking ############################################################ # Add a new micro benchmark, with or without an executable that should be built. # If benchmarks are enabled then they will be run along side unit tests with ctest. -# 'make runbenchmark' and 'make unittest' to build/run only benchmark or unittests, +# 'make runbenchmark' and 'make unittest' to build/run only benchmark or unittests, # respectively. # # REL_BENCHMARK_NAME is the name of the benchmark app. It may be a single component # (e.g. monotime-benchmark) or contain additional components (e.g. # net/net_util-benchmark). Either way, the last component must be a globally -# unique name. +# unique name. # The benchmark will registered as unit test with ctest with a label # of 'benchmark'. @@ -281,7 +287,7 @@ function(ADD_ARROW_BENCHMARK REL_BENCHMARK_NAME) return() endif() get_filename_component(BENCHMARK_NAME ${REL_BENCHMARK_NAME} NAME_WE) - + if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${REL_BENCHMARK_NAME}.cc) # This benchmark has a corresponding .cc file, set it up as an executable. set(BENCHMARK_PATH "${EXECUTABLE_OUTPUT_PATH}/${BENCHMARK_NAME}") @@ -294,7 +300,7 @@ function(ADD_ARROW_BENCHMARK REL_BENCHMARK_NAME) set(BENCHMARK_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_BENCHMARK_NAME}) set(NO_COLOR "") endif() - + add_test(${BENCHMARK_NAME} ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} benchmark ${BENCHMARK_PATH} ${NO_COLOR}) set_tests_properties(${BENCHMARK_NAME} PROPERTIES LABELS "benchmark") @@ -345,9 +351,18 @@ function(ADD_ARROW_TEST REL_TEST_NAME) set(TEST_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}) endif() - add_test(${TEST_NAME} - ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) + if (ARROW_TEST_MEMCHECK) + SET_PROPERTY(TARGET ${TEST_NAME} + APPEND_STRING PROPERTY + COMPILE_FLAGS " -DARROW_VALGRIND") + add_test(${TEST_NAME} + valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ${TEST_PATH}) + else() + add_test(${TEST_NAME} + ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) + endif() set_tests_properties(${TEST_NAME} PROPERTIES LABELS "unittest") + if(ARGN) set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) endif() @@ -403,7 +418,7 @@ if ("$ENV{GTEST_HOME}" STREQUAL "") set(GTest_HOME ${THIRDPARTY_DIR}/googletest-release-1.7.0) endif() -## Google Benchmark +## Google Benchmark if ("$ENV{GBENCHMARK_HOME}" STREQUAL "") set(GBENCHMARK_HOME ${THIRDPARTY_DIR}/installed) endif() @@ -487,24 +502,10 @@ if (UNIX) add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py --verbose=2 --linelength=90 - --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11 - `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h`) + --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) endif (UNIX) -#---------------------------------------------------------------------- -# Parquet adapter - -if(ARROW_PARQUET) - find_package(Parquet REQUIRED) - include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) - ADD_THIRDPARTY_LIB(parquet - STATIC_LIB ${PARQUET_STATIC_LIB} - SHARED_LIB ${PARQUET_SHARED_LIB}) - - add_subdirectory(src/arrow/parquet) - list(APPEND LINK_LIBS arrow_parquet parquet) -endif() - ############################################################ # Subdirectories ############################################################ @@ -515,15 +516,18 @@ set(LIBARROW_LINK_LIBS set(ARROW_SRCS src/arrow/array.cc src/arrow/builder.cc + src/arrow/column.cc + src/arrow/schema.cc + src/arrow/table.cc src/arrow/type.cc - src/arrow/table/column.cc - src/arrow/table/schema.cc - src/arrow/table/table.cc + # IPC / Shared memory library; to be turned into an optional component + src/arrow/ipc/adapter.cc + src/arrow/ipc/memory.cc + src/arrow/ipc/metadata.cc + src/arrow/ipc/metadata-internal.cc src/arrow/types/construct.cc - src/arrow/types/floating.cc - src/arrow/types/integer.cc src/arrow/types/json.cc src/arrow/types/list.cc src/arrow/types/primitive.cc @@ -559,9 +563,39 @@ target_link_libraries(arrow ${LIBARROW_LINK_LIBS}) add_subdirectory(src/arrow) add_subdirectory(src/arrow/util) -add_subdirectory(src/arrow/table) add_subdirectory(src/arrow/types) install(TARGETS arrow LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) + +#---------------------------------------------------------------------- +# Parquet adapter library + +if(ARROW_PARQUET) + find_package(Parquet REQUIRED) + include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(parquet + STATIC_LIB ${PARQUET_STATIC_LIB} + SHARED_LIB ${PARQUET_SHARED_LIB}) + + add_subdirectory(src/arrow/parquet) + list(APPEND LINK_LIBS arrow_parquet parquet) +endif() + +#---------------------------------------------------------------------- +# IPC library + +## Flatbuffers +if(ARROW_IPC) + find_package(Flatbuffers REQUIRED) + message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") + message(STATUS "Flatbuffers static library: ${FLATBUFFERS_STATIC_LIB}") + message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") + include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) + add_library(flatbuffers STATIC IMPORTED) + set_target_properties(flatbuffers PROPERTIES + IMPORTED_LOCATION ${FLATBUFFERS_STATIC_LIB}) + + add_subdirectory(src/arrow/ipc) +endif() diff --git a/cpp/cmake_modules/FindFlatbuffers.cmake b/cpp/cmake_modules/FindFlatbuffers.cmake new file mode 100644 index 0000000000000..ee472d1c8995f --- /dev/null +++ b/cpp/cmake_modules/FindFlatbuffers.cmake @@ -0,0 +1,95 @@ +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Tries to find Flatbuffers headers and libraries. +# +# Usage of this module as follows: +# +# find_package(Flatbuffers) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# Flatbuffers_HOME - +# When set, this path is inspected instead of standard library locations as +# the root of the Flatbuffers installation. The environment variable +# FLATBUFFERS_HOME overrides this veriable. +# +# This module defines +# FLATBUFFERS_INCLUDE_DIR, directory containing headers +# FLATBUFFERS_LIBS, directory containing flatbuffers libraries +# FLATBUFFERS_STATIC_LIB, path to libflatbuffers.a +# FLATBUFFERS_FOUND, whether flatbuffers has been found + +if( NOT "$ENV{FLATBUFFERS_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{FLATBUFFERS_HOME}" _native_path ) + list( APPEND _flatbuffers_roots ${_native_path} ) +elseif ( Flatbuffers_HOME ) + list( APPEND _flatbuffers_roots ${Flatbuffers_HOME} ) +endif() + +# Try the parameterized roots, if they exist +if ( _flatbuffers_roots ) + find_path( FLATBUFFERS_INCLUDE_DIR NAMES flatbuffers/flatbuffers.h + PATHS ${_flatbuffers_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( FLATBUFFERS_LIBRARIES NAMES flatbuffers + PATHS ${_flatbuffers_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else () + find_path( FLATBUFFERS_INCLUDE_DIR NAMES flatbuffers/flatbuffers.h ) + find_library( FLATBUFFERS_LIBRARIES NAMES flatbuffers ) +endif () + +find_program(FLATBUFFERS_COMPILER flatc + $ENV{FLATBUFFERS_HOME}/bin + /usr/local/bin + /usr/bin + NO_DEFAULT_PATH +) + +if (FLATBUFFERS_INCLUDE_DIR AND FLATBUFFERS_LIBRARIES) + set(FLATBUFFERS_FOUND TRUE) + get_filename_component( FLATBUFFERS_LIBS ${FLATBUFFERS_LIBRARIES} PATH ) + set(FLATBUFFERS_LIB_NAME libflatbuffers) + set(FLATBUFFERS_STATIC_LIB ${FLATBUFFERS_LIBS}/${FLATBUFFERS_LIB_NAME}.a) +else () + set(FLATBUFFERS_FOUND FALSE) +endif () + +if (FLATBUFFERS_FOUND) + if (NOT Flatbuffers_FIND_QUIETLY) + message(STATUS "Found the Flatbuffers library: ${FLATBUFFERS_LIBRARIES}") + endif () +else () + if (NOT Flatbuffers_FIND_QUIETLY) + set(FLATBUFFERS_ERR_MSG "Could not find the Flatbuffers library. Looked in ") + if ( _flatbuffers_roots ) + set(FLATBUFFERS_ERR_MSG "${FLATBUFFERS_ERR_MSG} in ${_flatbuffers_roots}.") + else () + set(FLATBUFFERS_ERR_MSG "${FLATBUFFERS_ERR_MSG} system search paths.") + endif () + if (Flatbuffers_FIND_REQUIRED) + message(FATAL_ERROR "${FLATBUFFERS_ERR_MSG}") + else (Flatbuffers_FIND_REQUIRED) + message(STATUS "${FLATBUFFERS_ERR_MSG}") + endif (Flatbuffers_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + FLATBUFFERS_INCLUDE_DIR + FLATBUFFERS_LIBS + FLATBUFFERS_STATIC_LIB + FLATBUFFERS_COMPILER +) diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index 04688e7d59400..6520dbd43f705 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -2,11 +2,12 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) -./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } -./thirdparty/build_thirdparty.sh || { echo "build_thirdparty.sh failed" ; return; } +./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } +./thirdparty/build_thirdparty.sh || { echo "build_thirdparty.sh failed" ; return; } source thirdparty/versions.sh export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR export GBENCHMARK_HOME=$SOURCE_DIR/thirdparty/installed +export FLATBUFFERS_HOME=$SOURCE_DIR/thirdparty/installed echo "Build env initialized" diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 73e6a9b22c94a..2d42edcfbd499 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -19,7 +19,10 @@ install(FILES api.h array.h + column.h builder.h + schema.h + table.h type.h DESTINATION include/arrow) @@ -30,3 +33,8 @@ install(FILES set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) +ADD_ARROW_TEST(column-test) +ADD_ARROW_TEST(schema-test) +ADD_ARROW_TEST(table-test) + +ADD_ARROW_BENCHMARK(column-benchmark) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index c73d4b386cf54..7be7f88c22eb6 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -22,20 +22,19 @@ #include "arrow/array.h" #include "arrow/builder.h" +#include "arrow/column.h" +#include "arrow/schema.h" +#include "arrow/table.h" #include "arrow/type.h" -#include "arrow/table/column.h" -#include "arrow/table/schema.h" -#include "arrow/table/table.h" - #include "arrow/types/boolean.h" #include "arrow/types/construct.h" -#include "arrow/types/floating.h" -#include "arrow/types/integer.h" #include "arrow/types/list.h" +#include "arrow/types/primitive.h" #include "arrow/types/string.h" #include "arrow/types/struct.h" +#include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index df827aaa113aa..eded5941e892e 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -15,30 +15,26 @@ // specific language governing permissions and limitations // under the License. -#include - #include #include #include #include +#include "gtest/gtest.h" + #include "arrow/array.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/integer.h" #include "arrow/types/primitive.h" #include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { -static TypePtr int32 = TypePtr(new Int32Type()); - class TestArray : public ::testing::Test { public: void SetUp() { - pool_ = GetDefaultMemoryPool(); + pool_ = default_memory_pool(); } protected: @@ -75,10 +71,10 @@ TEST_F(TestArray, TestIsNull) { if (x > 0) ++null_count; } - std::shared_ptr null_buf = bytes_to_null_buffer(nulls.data(), + std::shared_ptr null_buf = test::bytes_to_null_buffer(nulls.data(), nulls.size()); std::unique_ptr arr; - arr.reset(new Array(int32, nulls.size(), null_count, null_buf)); + arr.reset(new Int32Array(nulls.size(), nullptr, null_count, null_buf)); ASSERT_EQ(null_count, arr->null_count()); ASSERT_EQ(5, null_buf->size()); diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index ee4ef66d11e26..5a5bc1069db13 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -28,11 +28,6 @@ namespace arrow { Array::Array(const TypePtr& type, int32_t length, int32_t null_count, const std::shared_ptr& nulls) { - Init(type, length, null_count, nulls); -} - -void Array::Init(const TypePtr& type, int32_t length, int32_t null_count, - const std::shared_ptr& nulls) { type_ = type; length_ = length; null_count_ = null_count; @@ -42,4 +37,25 @@ void Array::Init(const TypePtr& type, int32_t length, int32_t null_count, } } +bool Array::EqualsExact(const Array& other) const { + if (this == &other) return true; + if (length_ != other.length_ || null_count_ != other.null_count_ || + type_enum() != other.type_enum()) { + return false; + } + if (null_count_ > 0) { + return nulls_->Equals(*other.nulls_, util::bytes_for_bits(length_)); + } else { + return true; + } +} + +bool NullArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) return true; + if (Type::NA != arr->type_enum()) { + return false; + } + return arr->length() == length_; +} + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 85e853e2ae5e2..65fc0aaf583e9 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -40,20 +40,11 @@ class Buffer; // explicitly increment its reference count class Array { public: - Array() : - null_count_(0), - length_(0), - nulls_(nullptr), - null_bits_(nullptr) {} - Array(const TypePtr& type, int32_t length, int32_t null_count = 0, const std::shared_ptr& nulls = nullptr); virtual ~Array() {} - void Init(const TypePtr& type, int32_t length, int32_t null_count, - const std::shared_ptr& nulls); - // Determine if a slot is null. For inner loops. Does *not* boundscheck bool IsNull(int i) const { return null_count_ > 0 && util::get_bit(null_bits_, i); @@ -63,12 +54,15 @@ class Array { int32_t null_count() const { return null_count_;} const std::shared_ptr& type() const { return type_;} - LogicalType::type logical_type() const { return type_->type;} + Type::type type_enum() const { return type_->type;} const std::shared_ptr& nulls() const { return nulls_; } + bool EqualsExact(const Array& arr) const; + virtual bool Equals(const std::shared_ptr& arr) const = 0; + protected: TypePtr type_; int32_t null_count_; @@ -78,9 +72,22 @@ class Array { const uint8_t* null_bits_; private: + Array() {} DISALLOW_COPY_AND_ASSIGN(Array); }; +// Degenerate null type Array +class NullArray : public Array { + public: + NullArray(const std::shared_ptr& type, int32_t length) : + Array(type, length, length, nullptr) {} + + explicit NullArray(int32_t length) : + NullArray(std::make_shared(), length) {} + + bool Equals(const std::shared_ptr& arr) const override; +}; + typedef std::shared_ptr ArrayPtr; } // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 8cc689c3e81ee..d5d1fdf95af17 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -99,7 +99,7 @@ class ArrayBuilder { int32_t capacity_; // Child value array builders. These are owned by this class - std::vector > children_; + std::vector> children_; private: DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); diff --git a/cpp/src/arrow/table/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc similarity index 94% rename from cpp/src/arrow/table/column-benchmark.cc rename to cpp/src/arrow/column-benchmark.cc index c01146d7b096f..69ee52c3e09ea 100644 --- a/cpp/src/arrow/table/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -19,15 +19,14 @@ #include "benchmark/benchmark.h" #include "arrow/test-util.h" -#include "arrow/table/test-common.h" -#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" #include "arrow/util/memory-pool.h" namespace arrow { namespace { template std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { - auto pool = GetDefaultMemoryPool(); + auto pool = default_memory_pool(); auto data = std::make_shared(pool); auto nulls = std::make_shared(pool); data->Resize(length * sizeof(typename ArrayType::value_type)); diff --git a/cpp/src/arrow/table/column-test.cc b/cpp/src/arrow/column-test.cc similarity index 93% rename from cpp/src/arrow/table/column-test.cc rename to cpp/src/arrow/column-test.cc index 3b102e48c87cf..0630785630e81 100644 --- a/cpp/src/arrow/table/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -15,18 +15,18 @@ // specific language governing permissions and limitations // under the License. -#include #include #include #include #include -#include "arrow/table/column.h" -#include "arrow/table/schema.h" -#include "arrow/table/test-common.h" +#include "gtest/gtest.h" + +#include "arrow/column.h" +#include "arrow/schema.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" using std::shared_ptr; using std::vector; diff --git a/cpp/src/arrow/table/column.cc b/cpp/src/arrow/column.cc similarity index 96% rename from cpp/src/arrow/table/column.cc rename to cpp/src/arrow/column.cc index 573e650875944..46acf8df2ff57 100644 --- a/cpp/src/arrow/table/column.cc +++ b/cpp/src/arrow/column.cc @@ -15,11 +15,12 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/table/column.h" +#include "arrow/column.h" #include #include +#include "arrow/array.h" #include "arrow/type.h" #include "arrow/util/status.h" @@ -28,6 +29,7 @@ namespace arrow { ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { length_ = 0; + null_count_ = 0; for (const std::shared_ptr& chunk : chunks) { length_ += chunk->length(); null_count_ += chunk->null_count(); diff --git a/cpp/src/arrow/table/column.h b/cpp/src/arrow/column.h similarity index 93% rename from cpp/src/arrow/table/column.h rename to cpp/src/arrow/column.h index dfc7516e26aac..1ad97b20863c8 100644 --- a/cpp/src/arrow/table/column.h +++ b/cpp/src/arrow/column.h @@ -15,19 +15,22 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_TABLE_COLUMN_H -#define ARROW_TABLE_COLUMN_H +#ifndef ARROW_COLUMN_H +#define ARROW_COLUMN_H +#include #include #include #include -#include "arrow/array.h" #include "arrow/type.h" namespace arrow { -typedef std::vector > ArrayVector; +class Array; +class Status; + +typedef std::vector> ArrayVector; // A data structure managing a list of primitive Arrow arrays logically as one // large array @@ -102,4 +105,4 @@ class Column { } // namespace arrow -#endif // ARROW_TABLE_COLUMN_H +#endif // ARROW_COLUMN_H diff --git a/cpp/src/arrow/ipc/.gitignore b/cpp/src/arrow/ipc/.gitignore new file mode 100644 index 0000000000000..8150d7efe33c4 --- /dev/null +++ b/cpp/src/arrow/ipc/.gitignore @@ -0,0 +1 @@ +*_generated.h \ No newline at end of file diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt new file mode 100644 index 0000000000000..383684f42f952 --- /dev/null +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -0,0 +1,51 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +####################################### +# arrow_ipc +####################################### + +# Headers: top level +install(FILES + adapter.h + metadata.h + memory.h + DESTINATION include/arrow/ipc) + +ADD_ARROW_TEST(ipc-adapter-test) +ADD_ARROW_TEST(ipc-memory-test) +ADD_ARROW_TEST(ipc-metadata-test) + +# make clean will delete the generated file +set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) + +set(OUTPUT_DIR ${CMAKE_SOURCE_DIR}/src/arrow/ipc) +set(FBS_OUTPUT_FILES "${OUTPUT_DIR}/Message_generated.h") + +set(FBS_SRC ${CMAKE_SOURCE_DIR}/../format/Message.fbs) +get_filename_component(ABS_FBS_SRC ${FBS_SRC} ABSOLUTE) + +add_custom_command( + OUTPUT ${FBS_OUTPUT_FILES} + COMMAND ${FLATBUFFERS_COMPILER} -c -o ${OUTPUT_DIR} ${ABS_FBS_SRC} + DEPENDS ${ABS_FBS_SRC} + COMMENT "Running flatc compiler on ${FBS_SRC}" + VERBATIM +) + +add_custom_target(metadata_fbs DEPENDS ${FBS_OUTPUT_FILES}) +add_dependencies(arrow metadata_fbs) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc new file mode 100644 index 0000000000000..7cdb965f5f45c --- /dev/null +++ b/cpp/src/arrow/ipc/adapter.cc @@ -0,0 +1,305 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/adapter.h" + +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/ipc/memory.h" +#include "arrow/ipc/Message_generated.h" +#include "arrow/ipc/metadata.h" +#include "arrow/ipc/metadata-internal.h" +#include "arrow/schema.h" +#include "arrow/table.h" +#include "arrow/type.h" +#include "arrow/types/construct.h" +#include "arrow/types/primitive.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +namespace flatbuf = apache::arrow::flatbuf; + +namespace ipc { + +static bool IsPrimitive(const DataType* type) { + switch (type->type) { + // NA is null type or "no type", considered primitive for now + case Type::NA: + case Type::BOOL: + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::UINT64: + case Type::INT64: + case Type::FLOAT: + case Type::DOUBLE: + return true; + default: + return false; + } +} + +// ---------------------------------------------------------------------- +// Row batch write path + +Status VisitArray(const Array* arr, std::vector* field_nodes, + std::vector>* buffers) { + if (IsPrimitive(arr->type().get())) { + const PrimitiveArray* prim_arr = static_cast(arr); + + field_nodes->push_back( + flatbuf::FieldNode(prim_arr->length(), prim_arr->null_count())); + + if (prim_arr->null_count() > 0) { + buffers->push_back(prim_arr->nulls()); + } else { + // Push a dummy zero-length buffer, not to be copied + buffers->push_back(std::make_shared(nullptr, 0)); + } + buffers->push_back(prim_arr->data()); + } else if (arr->type_enum() == Type::LIST) { + // TODO(wesm) + return Status::NotImplemented("List type"); + } else if (arr->type_enum() == Type::STRUCT) { + // TODO(wesm) + return Status::NotImplemented("Struct type"); + } + + return Status::OK(); +} + +class RowBatchWriter { + public: + explicit RowBatchWriter(const RowBatch* batch) : + batch_(batch) {} + + Status AssemblePayload() { + // Perform depth-first traversal of the row-batch + for (int i = 0; i < batch_->num_columns(); ++i) { + const Array* arr = batch_->column(i).get(); + RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_)); + } + return Status::OK(); + } + + Status Write(MemorySource* dst, int64_t position, int64_t* data_header_offset) { + // Write out all the buffers contiguously and compute the total size of the + // memory payload + int64_t offset = 0; + for (size_t i = 0; i < buffers_.size(); ++i) { + const Buffer* buffer = buffers_[i].get(); + int64_t size = buffer->size(); + + // TODO(wesm): We currently have no notion of shared memory page id's, + // but we've included it in the metadata IDL for when we have it in the + // future. Use page=0 for now + // + // Note that page ids are a bespoke notion for Arrow and not a feature we + // are using from any OS-level shared memory. The thought is that systems + // may (in the future) associate integer page id's with physical memory + // pages (according to whatever is the desired shared memory mechanism) + buffer_meta_.push_back(flatbuf::Buffer(0, position + offset, size)); + + if (size > 0) { + RETURN_NOT_OK(dst->Write(position + offset, buffer->data(), size)); + offset += size; + } + } + + // Now that we have computed the locations of all of the buffers in shared + // memory, the data header can be converted to a flatbuffer and written out + // + // Note: The memory written here is prefixed by the size of the flatbuffer + // itself as an int32_t. On reading from a MemorySource, you will have to + // determine the data header size then request a buffer such that you can + // construct the flatbuffer data accessor object (see arrow::ipc::Message) + std::shared_ptr data_header; + RETURN_NOT_OK(WriteDataHeader(batch_->num_rows(), offset, + field_nodes_, buffer_meta_, &data_header)); + + // Write the data header at the end + RETURN_NOT_OK(dst->Write(position + offset, data_header->data(), + data_header->size())); + + *data_header_offset = position + offset; + return Status::OK(); + } + + // This must be called after invoking AssemblePayload + int64_t DataHeaderSize() { + // TODO(wesm): In case it is needed, compute the upper bound for the size + // of the buffer containing the flatbuffer data header. + return 0; + } + + // Total footprint of buffers. This must be called after invoking + // AssemblePayload + int64_t TotalBytes() { + int64_t total = 0; + for (const std::shared_ptr& buffer : buffers_) { + total += buffer->size(); + } + return total; + } + + private: + const RowBatch* batch_; + + std::vector field_nodes_; + std::vector buffer_meta_; + std::vector> buffers_; +}; + +Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, + int64_t* header_offset) { + RowBatchWriter serializer(batch); + RETURN_NOT_OK(serializer.AssemblePayload()); + return serializer.Write(dst, position, header_offset); +} +// ---------------------------------------------------------------------- +// Row batch read path + +static constexpr int64_t INIT_METADATA_SIZE = 4096; + +class RowBatchReader::Impl { + public: + Impl(MemorySource* source, const std::shared_ptr& metadata) : + source_(source), + metadata_(metadata) { + num_buffers_ = metadata->num_buffers(); + num_flattened_fields_ = metadata->num_fields(); + } + + Status AssembleBatch(const std::shared_ptr& schema, + std::shared_ptr* out) { + std::vector> arrays(schema->num_fields()); + + // The field_index and buffer_index are incremented in NextArray based on + // how much of the batch is "consumed" (through nested data reconstruction, + // for example) + field_index_ = 0; + buffer_index_ = 0; + for (int i = 0; i < schema->num_fields(); ++i) { + const Field* field = schema->field(i).get(); + RETURN_NOT_OK(NextArray(field, &arrays[i])); + } + + *out = std::make_shared(schema, metadata_->length(), + arrays); + return Status::OK(); + } + + private: + // Traverse the flattened record batch metadata and reassemble the + // corresponding array containers + Status NextArray(const Field* field, std::shared_ptr* out) { + const std::shared_ptr& type = field->type; + + // pop off a field + if (field_index_ >= num_flattened_fields_) { + return Status::Invalid("Ran out of field metadata, likely malformed"); + } + + // This only contains the length and null count, which we need to figure + // out what to do with the buffers. For example, if null_count == 0, then + // we can skip that buffer without reading from shared memory + FieldMetadata field_meta = metadata_->field(field_index_++); + + if (IsPrimitive(type.get())) { + std::shared_ptr nulls; + std::shared_ptr data; + if (field_meta.null_count == 0) { + nulls = nullptr; + ++buffer_index_; + } else { + RETURN_NOT_OK(GetBuffer(buffer_index_++, &nulls)); + } + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(buffer_index_++, &data)); + } else { + data.reset(new Buffer(nullptr, 0)); + } + return MakePrimitiveArray(type, field_meta.length, data, + field_meta.null_count, nulls, out); + } else { + return Status::NotImplemented("Non-primitive types not complete yet"); + } + } + + Status GetBuffer(int buffer_index, std::shared_ptr* out) { + BufferMetadata metadata = metadata_->buffer(buffer_index); + return source_->ReadAt(metadata.offset, metadata.length, out); + } + + MemorySource* source_; + std::shared_ptr metadata_; + + int field_index_; + int buffer_index_; + int num_buffers_; + int num_flattened_fields_; +}; + +Status RowBatchReader::Open(MemorySource* source, int64_t position, + std::shared_ptr* out) { + std::shared_ptr metadata; + RETURN_NOT_OK(source->ReadAt(position, INIT_METADATA_SIZE, &metadata)); + + int32_t metadata_size = *reinterpret_cast(metadata->data()); + + // We may not need to call source->ReadAt again + if (metadata_size > static_cast(INIT_METADATA_SIZE - sizeof(int32_t))) { + // We don't have enough data, read the indicated metadata size. + RETURN_NOT_OK(source->ReadAt(position + sizeof(int32_t), + metadata_size, &metadata)); + } + + // TODO(wesm): buffer slicing here would be better in case ReadAt returns + // allocated memory + + std::shared_ptr message; + RETURN_NOT_OK(Message::Open(metadata, &message)); + + if (message->type() != Message::RECORD_BATCH) { + return Status::Invalid("Metadata message is not a record batch"); + } + + std::shared_ptr batch_meta = message->GetRecordBatch(); + + std::shared_ptr result(new RowBatchReader()); + result->impl_.reset(new Impl(source, batch_meta)); + *out = result; + + return Status::OK(); +} + +Status RowBatchReader::GetRowBatch(const std::shared_ptr& schema, + std::shared_ptr* out) { + return impl_->AssembleBatch(schema, out); +} + + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h new file mode 100644 index 0000000000000..26dea6d04b889 --- /dev/null +++ b/cpp/src/arrow/ipc/adapter.h @@ -0,0 +1,86 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Public API for writing and accessing (with zero copy, if possible) Arrow +// data in shared memory + +#ifndef ARROW_IPC_ADAPTER_H +#define ARROW_IPC_ADAPTER_H + +#include +#include + +namespace arrow { + +class Array; +class RowBatch; +class Schema; +class Status; + +namespace ipc { + +class MemorySource; +class RecordBatchMessage; + +// ---------------------------------------------------------------------- +// Write path + +// Write the RowBatch (collection of equal-length Arrow arrays) to the memory +// source at the indicated position +// +// First, each of the memory buffers are written out end-to-end in starting at +// the indicated position. +// +// Then, this function writes the batch metadata as a flatbuffer (see +// format/Message.fbs -- the RecordBatch message type) like so: +// +// +// +// Finally, the memory offset to the start of the metadata / data header is +// returned in an out-variable +Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, + int64_t* header_offset); + +// int64_t GetRowBatchMetadata(const RowBatch* batch); + +// Compute the precise number of bytes needed in a contiguous memory segment to +// write the row batch. This involves generating the complete serialized +// Flatbuffers metadata. +int64_t GetRowBatchSize(const RowBatch* batch); + +// ---------------------------------------------------------------------- +// "Read" path; does not copy data if the MemorySource does not + +class RowBatchReader { + public: + static Status Open(MemorySource* source, int64_t position, + std::shared_ptr* out); + + // Reassemble the row batch. A Schema is required to be able to construct the + // right array containers + Status GetRowBatch(const std::shared_ptr& schema, + std::shared_ptr* out); + + private: + class Impl; + std::unique_ptr impl_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc new file mode 100644 index 0000000000000..d75998f0a5dd2 --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -0,0 +1,112 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/memory.h" +#include "arrow/ipc/test-common.h" + +#include "arrow/test-util.h" +#include "arrow/types/primitive.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +class TestWriteRowBatch : public ::testing::Test, public MemoryMapFixture { + public: + void SetUp() { + pool_ = default_memory_pool(); + } + void TearDown() { + MemoryMapFixture::TearDown(); + } + + void InitMemoryMap(int64_t size) { + std::string path = "test-write-row-batch"; + MemoryMapFixture::CreateFile(path, size); + ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &mmap_)); + } + + protected: + MemoryPool* pool_; + std::shared_ptr mmap_; +}; + +const auto INT32 = std::make_shared(); + +TEST_F(TestWriteRowBatch, IntegerRoundTrip) { + const int length = 1000; + + // Make the schema + auto f0 = std::make_shared("f0", INT32); + auto f1 = std::make_shared("f1", INT32); + std::shared_ptr schema(new Schema({f0, f1})); + + // Example data + + auto data = std::make_shared(pool_); + ASSERT_OK(data->Resize(length * sizeof(int32_t))); + test::rand_uniform_int(length, 0, 0, std::numeric_limits::max(), + reinterpret_cast(data->mutable_data())); + + auto nulls = std::make_shared(pool_); + int null_bytes = util::bytes_for_bits(length); + ASSERT_OK(nulls->Resize(null_bytes)); + test::random_bytes(null_bytes, 0, nulls->mutable_data()); + + auto a0 = std::make_shared(length, data); + auto a1 = std::make_shared(length, data, + test::bitmap_popcount(nulls->data(), length), nulls); + + RowBatch batch(schema, length, {a0, a1}); + + // TODO(wesm): computing memory requirements for a row batch + // 64k is plenty of space + InitMemoryMap(1 << 16); + + int64_t header_location; + ASSERT_OK(WriteRowBatch(mmap_.get(), &batch, 0, &header_location)); + + std::shared_ptr result; + ASSERT_OK(RowBatchReader::Open(mmap_.get(), header_location, &result)); + + std::shared_ptr batch_result; + ASSERT_OK(result->GetRowBatch(schema, &batch_result)); + EXPECT_EQ(batch.num_rows(), batch_result->num_rows()); + + for (int i = 0; i < batch.num_columns(); ++i) { + EXPECT_TRUE(batch.column(i)->Equals(batch_result->column(i))) + << i << batch.column_name(i); + } +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-memory-test.cc b/cpp/src/arrow/ipc/ipc-memory-test.cc new file mode 100644 index 0000000000000..332ad2a2b809b --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-memory-test.cc @@ -0,0 +1,82 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/ipc/memory.h" +#include "arrow/ipc/test-common.h" +#include "arrow/test-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +class TestMemoryMappedSource : public ::testing::Test, public MemoryMapFixture { + public: + void TearDown() { + MemoryMapFixture::TearDown(); + } +}; + +TEST_F(TestMemoryMappedSource, InvalidUsages) { +} + +TEST_F(TestMemoryMappedSource, WriteRead) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + const int reps = 5; + + std::string path = "ipc-write-read-test"; + CreateFile(path, reps * buffer_size); + + std::shared_ptr result; + ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &result)); + + int64_t position = 0; + + std::shared_ptr out_buffer; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(result->Write(position, buffer.data(), buffer_size)); + ASSERT_OK(result->ReadAt(position, buffer_size, &out_buffer)); + + ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); + + position += buffer_size; + } +} + +TEST_F(TestMemoryMappedSource, InvalidFile) { + std::string non_existent_path = "invalid-file-name-asfd"; + + std::shared_ptr result; + ASSERT_RAISES(IOError, MemoryMappedSource::Open(non_existent_path, + MemorySource::READ_ONLY, &result)); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc new file mode 100644 index 0000000000000..ceabec0fa7c29 --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/ipc/metadata.h" +#include "arrow/schema.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/util/status.h" + +namespace arrow { + +class Buffer; + +static inline void assert_schema_equal(const Schema* lhs, const Schema* rhs) { + if (!lhs->Equals(*rhs)) { + std::stringstream ss; + ss << "left schema: " << lhs->ToString() << std::endl + << "right schema: " << rhs->ToString() << std::endl; + FAIL() << ss.str(); + } +} + +class TestSchemaMessage : public ::testing::Test { + public: + void SetUp() {} + + void CheckRoundtrip(const Schema* schema) { + std::shared_ptr buffer; + ASSERT_OK(ipc::WriteSchema(schema, &buffer)); + + std::shared_ptr message; + ASSERT_OK(ipc::Message::Open(buffer, &message)); + + ASSERT_EQ(ipc::Message::SCHEMA, message->type()); + + std::shared_ptr schema_msg = message->GetSchema(); + ASSERT_EQ(schema->num_fields(), schema_msg->num_fields()); + + std::shared_ptr schema2; + ASSERT_OK(schema_msg->GetSchema(&schema2)); + + assert_schema_equal(schema, schema2.get()); + } +}; + +const std::shared_ptr INT32 = std::make_shared(); + +TEST_F(TestSchemaMessage, PrimitiveFields) { + auto f0 = std::make_shared("f0", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared()); + auto f2 = std::make_shared("f2", std::make_shared()); + auto f3 = std::make_shared("f3", std::make_shared()); + auto f4 = std::make_shared("f4", std::make_shared()); + auto f5 = std::make_shared("f5", std::make_shared()); + auto f6 = std::make_shared("f6", std::make_shared()); + auto f7 = std::make_shared("f7", std::make_shared()); + auto f8 = std::make_shared("f8", std::make_shared()); + auto f9 = std::make_shared("f9", std::make_shared()); + auto f10 = std::make_shared("f10", std::make_shared()); + + Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); + CheckRoundtrip(&schema); +} + +TEST_F(TestSchemaMessage, NestedFields) { + auto type = std::make_shared(std::make_shared()); + auto f0 = std::make_shared("f0", type); + + std::shared_ptr type2(new StructType({ + std::make_shared("k1", INT32), + std::make_shared("k2", INT32), + std::make_shared("k3", INT32)})); + auto f1 = std::make_shared("f1", type2); + + Schema schema({f0, f1}); + CheckRoundtrip(&schema); +} + +} // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc new file mode 100644 index 0000000000000..e630ccd109b77 --- /dev/null +++ b/cpp/src/arrow/ipc/memory.cc @@ -0,0 +1,162 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/memory.h" + +#include // For memory-mapping +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +MemorySource::MemorySource(AccessMode access_mode) : + access_mode_(access_mode) {} + +MemorySource::~MemorySource() {} + +// Implement MemoryMappedSource + +class MemoryMappedSource::Impl { + public: + Impl() : + file_(nullptr), + is_open_(false), + data_(nullptr) {} + + ~Impl() { + if (is_open_) { + munmap(data_, size_); + fclose(file_); + } + } + + Status Open(const std::string& path, MemorySource::AccessMode mode) { + if (is_open_) { + return Status::IOError("A file is already open"); + } + + path_ = path; + + if (mode == MemorySource::READ_WRITE) { + file_ = fopen(path.c_str(), "r+b"); + } else { + file_ = fopen(path.c_str(), "rb"); + } + if (file_ == nullptr) { + std::stringstream ss; + ss << "Unable to open file, errno: " << errno; + return Status::IOError(ss.str()); + } + + fseek(file_, 0L, SEEK_END); + if (ferror(file_)) { + return Status::IOError("Unable to seek to end of file"); + } + size_ = ftell(file_); + + fseek(file_, 0L, SEEK_SET); + is_open_ = true; + + // TODO(wesm): Add read-only version of this + data_ = reinterpret_cast(mmap(nullptr, size_, + PROT_READ | PROT_WRITE, + MAP_SHARED, fileno(file_), 0)); + if (data_ == nullptr) { + std::stringstream ss; + ss << "Memory mapping file failed, errno: " << errno; + return Status::IOError(ss.str()); + } + + return Status::OK(); + } + + int64_t size() const { + return size_; + } + + uint8_t* data() { + return data_; + } + + private: + std::string path_; + FILE* file_; + int64_t size_; + bool is_open_; + + // The memory map + uint8_t* data_; +}; + +MemoryMappedSource::MemoryMappedSource(AccessMode access_mode) : + MemorySource(access_mode) {} + +Status MemoryMappedSource::Open(const std::string& path, AccessMode access_mode, + std::shared_ptr* out) { + std::shared_ptr result(new MemoryMappedSource(access_mode)); + + result->impl_.reset(new Impl()); + RETURN_NOT_OK(result->impl_->Open(path, access_mode)); + + *out = result; + return Status::OK(); +} + +int64_t MemoryMappedSource::Size() const { + return impl_->size(); +} + +Status MemoryMappedSource::Close() { + // munmap handled in ::Impl dtor + return Status::OK(); +} + +Status MemoryMappedSource::ReadAt(int64_t position, int64_t nbytes, + std::shared_ptr* out) { + if (position < 0 || position >= impl_->size()) { + return Status::Invalid("position is out of bounds"); + } + + nbytes = std::min(nbytes, impl_->size() - position); + *out = std::make_shared(impl_->data() + position, nbytes); + return Status::OK(); +} + +Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, + int64_t nbytes) { + if (position < 0 || position >= impl_->size()) { + return Status::Invalid("position is out of bounds"); + } + + // TODO(wesm): verify we are not writing past the end of the buffer + uint8_t* dst = impl_->data() + position; + memcpy(dst, data, nbytes); + + return Status::OK(); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.h b/cpp/src/arrow/ipc/memory.h new file mode 100644 index 0000000000000..0b4d8347c342f --- /dev/null +++ b/cpp/src/arrow/ipc/memory.h @@ -0,0 +1,131 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Public API for different interprocess memory sharing mechanisms + +#ifndef ARROW_IPC_MEMORY_H +#define ARROW_IPC_MEMORY_H + +#include +#include +#include + +#include "arrow/util/macros.h" + +namespace arrow { + +class Buffer; +class MutableBuffer; +class Status; + +namespace ipc { + +// Abstract output stream +class OutputStream { + public: + virtual ~OutputStream() {} + // Close the output stream + virtual Status Close() = 0; + + // The current position in the output stream + virtual int64_t Tell() const = 0; + + // Write bytes to the stream + virtual Status Write(const uint8_t* data, int64_t length) = 0; +}; + +// An output stream that writes to a MutableBuffer, such as one obtained from a +// memory map +class BufferOutputStream : public OutputStream { + public: + explicit BufferOutputStream(const std::shared_ptr& buffer): + buffer_(buffer) {} + + // Implement the OutputStream interface + Status Close() override; + int64_t Tell() const override; + Status Write(const uint8_t* data, int64_t length) override; + + // Returns the number of bytes remaining in the buffer + int64_t bytes_remaining() const; + + private: + std::shared_ptr buffer_; + int64_t capacity_; + int64_t position_; +}; + +class MemorySource { + public: + // Indicates the access permissions of the memory source + enum AccessMode { + READ_ONLY, + READ_WRITE + }; + + virtual ~MemorySource(); + + // Retrieve a buffer of memory from the source of the indicates size and at + // the indicated location + // @returns: arrow::Status indicating success / failure. The buffer is set + // into the *out argument + virtual Status ReadAt(int64_t position, int64_t nbytes, + std::shared_ptr* out) = 0; + + virtual Status Close() = 0; + + virtual Status Write(int64_t position, const uint8_t* data, int64_t nbytes) = 0; + + // @return: the size in bytes of the memory source + virtual int64_t Size() const = 0; + + protected: + explicit MemorySource(AccessMode access_mode = AccessMode::READ_WRITE); + + AccessMode access_mode_; + + private: + DISALLOW_COPY_AND_ASSIGN(MemorySource); +}; + +// A memory source that uses memory-mapped files for memory interactions +class MemoryMappedSource : public MemorySource { + public: + static Status Open(const std::string& path, AccessMode access_mode, + std::shared_ptr* out); + + Status Close() override; + + Status ReadAt(int64_t position, int64_t nbytes, + std::shared_ptr* out) override; + + Status Write(int64_t position, const uint8_t* data, int64_t nbytes) override; + + // @return: the size in bytes of the memory source + int64_t Size() const override; + + private: + explicit MemoryMappedSource(AccessMode access_mode); + // Hide the internal details of this class for now + class Impl; + std::unique_ptr impl_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc new file mode 100644 index 0000000000000..14b186906c3a0 --- /dev/null +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -0,0 +1,317 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/metadata-internal.h" + +#include +#include +#include +#include +#include +#include + +#include "arrow/ipc/Message_generated.h" +#include "arrow/schema.h" +#include "arrow/type.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +typedef flatbuffers::FlatBufferBuilder FBB; +typedef flatbuffers::Offset FieldOffset; +typedef flatbuffers::Offset Offset; + +namespace arrow { + +namespace flatbuf = apache::arrow::flatbuf; + +namespace ipc { + +const std::shared_ptr BOOL = std::make_shared(); +const std::shared_ptr INT8 = std::make_shared(); +const std::shared_ptr INT16 = std::make_shared(); +const std::shared_ptr INT32 = std::make_shared(); +const std::shared_ptr INT64 = std::make_shared(); +const std::shared_ptr UINT8 = std::make_shared(); +const std::shared_ptr UINT16 = std::make_shared(); +const std::shared_ptr UINT32 = std::make_shared(); +const std::shared_ptr UINT64 = std::make_shared(); +const std::shared_ptr FLOAT = std::make_shared(); +const std::shared_ptr DOUBLE = std::make_shared(); + +static Status IntFromFlatbuffer(const flatbuf::Int* int_data, + std::shared_ptr* out) { + if (int_data->bitWidth() % 8 != 0) { + return Status::NotImplemented("Integers not in cstdint are not implemented"); + } else if (int_data->bitWidth() > 64) { + return Status::NotImplemented("Integers with more than 64 bits not implemented"); + } + + switch (int_data->bitWidth()) { + case 8: + *out = int_data->is_signed() ? INT8 : UINT8; + break; + case 16: + *out = int_data->is_signed() ? INT16 : UINT16; + break; + case 32: + *out = int_data->is_signed() ? INT32 : UINT32; + break; + case 64: + *out = int_data->is_signed() ? INT64 : UINT64; + break; + default: + *out = nullptr; + break; + } + return Status::OK(); +} + +static Status FloatFromFlatuffer(const flatbuf::FloatingPoint* float_data, + std::shared_ptr* out) { + if (float_data->precision() == flatbuf::Precision_SINGLE) { + *out = FLOAT; + } else { + *out = DOUBLE; + } + return Status::OK(); +} + +static Status TypeFromFlatbuffer(flatbuf::Type type, + const void* type_data, const std::vector>& children, + std::shared_ptr* out) { + switch (type) { + case flatbuf::Type_NONE: + return Status::Invalid("Type metadata cannot be none"); + case flatbuf::Type_Int: + return IntFromFlatbuffer(static_cast(type_data), out); + case flatbuf::Type_Bit: + return Status::NotImplemented("Type is not implemented"); + case flatbuf::Type_FloatingPoint: + return FloatFromFlatuffer(static_cast(type_data), + out); + case flatbuf::Type_Binary: + case flatbuf::Type_Utf8: + return Status::NotImplemented("Type is not implemented"); + case flatbuf::Type_Bool: + *out = BOOL; + return Status::OK(); + case flatbuf::Type_Decimal: + case flatbuf::Type_Timestamp: + case flatbuf::Type_List: + if (children.size() != 1) { + return Status::Invalid("List must have exactly 1 child field"); + } + *out = std::make_shared(children[0]); + return Status::OK(); + case flatbuf::Type_Tuple: + *out = std::make_shared(children); + return Status::OK(); + case flatbuf::Type_Union: + return Status::NotImplemented("Type is not implemented"); + default: + return Status::Invalid("Unrecognized type"); + } +} + +// Forward declaration +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + FieldOffset* offset); + +static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, + bool is_signed) { + return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); +} + +static Offset FloatToFlatbuffer(FBB& fbb, + flatbuf::Precision precision) { + return flatbuf::CreateFloatingPoint(fbb, precision).Union(); +} + +static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, Offset* offset) { + FieldOffset field; + RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(0), &field)); + out_children->push_back(field); + *offset = flatbuf::CreateList(fbb).Union(); + return Status::OK(); +} + +static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, Offset* offset) { + FieldOffset field; + for (int i = 0; i < type->num_children(); ++i) { + RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), &field)); + out_children->push_back(field); + } + *offset = flatbuf::CreateTuple(fbb).Union(); + return Status::OK(); +} + +#define INT_TO_FB_CASE(BIT_WIDTH, IS_SIGNED) \ + *out_type = flatbuf::Type_Int; \ + *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ + break; + + +static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* children, + flatbuf::Type* out_type, Offset* offset) { + switch (type->type) { + case Type::BOOL: + *out_type = flatbuf::Type_Bool; + *offset = flatbuf::CreateBool(fbb).Union(); + break; + case Type::UINT8: + INT_TO_FB_CASE(8, false); + case Type::INT8: + INT_TO_FB_CASE(8, true); + case Type::UINT16: + INT_TO_FB_CASE(16, false); + case Type::INT16: + INT_TO_FB_CASE(16, true); + case Type::UINT32: + INT_TO_FB_CASE(32, false); + case Type::INT32: + INT_TO_FB_CASE(32, true); + case Type::UINT64: + INT_TO_FB_CASE(64, false); + case Type::INT64: + INT_TO_FB_CASE(64, true); + case Type::FLOAT: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_SINGLE); + break; + case Type::DOUBLE: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); + break; + case Type::LIST: + *out_type = flatbuf::Type_List; + return ListToFlatbuffer(fbb, type, children, offset); + case Type::STRUCT: + *out_type = flatbuf::Type_Tuple; + return StructToFlatbuffer(fbb, type, children, offset); + default: + std::stringstream ss; + ss << "Unable to convert type: " << type->ToString() + << std::endl; + return Status::NotImplemented(ss.str()); + } + return Status::OK(); +} + +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + FieldOffset* offset) { + auto fb_name = fbb.CreateString(field->name); + + flatbuf::Type type_enum; + Offset type_data; + std::vector children; + + RETURN_NOT_OK(TypeToFlatbuffer(fbb, field->type, &children, &type_enum, &type_data)); + auto fb_children = fbb.CreateVector(children); + + *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, + type_data, fb_children); + + return Status::OK(); +} + +Status FieldFromFlatbuffer(const flatbuf::Field* field, + std::shared_ptr* out) { + std::shared_ptr type; + + auto children = field->children(); + std::vector> child_fields(children->size()); + for (size_t i = 0; i < children->size(); ++i) { + RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), &child_fields[i])); + } + + RETURN_NOT_OK(TypeFromFlatbuffer(field->type_type(), + field->type(), child_fields, &type)); + + *out = std::make_shared(field->name()->str(), type); + return Status::OK(); +} + +// Implement MessageBuilder + +Status MessageBuilder::SetSchema(const Schema* schema) { + header_type_ = flatbuf::MessageHeader_Schema; + + std::vector field_offsets; + for (int i = 0; i < schema->num_fields(); ++i) { + const std::shared_ptr& field = schema->field(i); + FieldOffset offset; + RETURN_NOT_OK(FieldToFlatbuffer(fbb_, field, &offset)); + field_offsets.push_back(offset); + } + + header_ = flatbuf::CreateSchema(fbb_, fbb_.CreateVector(field_offsets)).Union(); + body_length_ = 0; + return Status::OK(); +} + +Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers) { + header_type_ = flatbuf::MessageHeader_RecordBatch; + header_ = flatbuf::CreateRecordBatch(fbb_, length, + fbb_.CreateVectorOfStructs(nodes), + fbb_.CreateVectorOfStructs(buffers)).Union(); + body_length_ = body_length; + + return Status::OK(); +} + + +Status WriteDataHeader(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out) { + MessageBuilder message; + RETURN_NOT_OK(message.SetRecordBatch(length, body_length, nodes, buffers)); + RETURN_NOT_OK(message.Finish()); + return message.GetBuffer(out); +} + +Status MessageBuilder::Finish() { + auto message = flatbuf::CreateMessage(fbb_, header_type_, header_, + body_length_); + fbb_.Finish(message); + return Status::OK(); +} + +Status MessageBuilder::GetBuffer(std::shared_ptr* out) { + // The message buffer is prefixed by the size of the complete flatbuffer as + // int32_t + // + int32_t size = fbb_.GetSize(); + + auto result = std::make_shared(); + RETURN_NOT_OK(result->Resize(size + sizeof(int32_t))); + + uint8_t* dst = result->mutable_data(); + memcpy(dst, reinterpret_cast(&size), sizeof(int32_t)); + memcpy(dst + sizeof(int32_t), fbb_.GetBufferPointer(), size); + + *out = result; + return Status::OK(); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h new file mode 100644 index 0000000000000..f7365d2a49f95 --- /dev/null +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -0,0 +1,69 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IPC_METADATA_INTERNAL_H +#define ARROW_IPC_METADATA_INTERNAL_H + +#include +#include +#include +#include + +#include "arrow/ipc/Message_generated.h" + +namespace arrow { + +namespace flatbuf = apache::arrow::flatbuf; + +class Buffer; +struct Field; +class Schema; +class Status; + +namespace ipc { + +Status FieldFromFlatbuffer(const flatbuf::Field* field, + std::shared_ptr* out); + +class MessageBuilder { + public: + Status SetSchema(const Schema* schema); + + Status SetRecordBatch(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers); + + Status Finish(); + + Status GetBuffer(std::shared_ptr* out); + + private: + flatbuf::MessageHeader header_type_; + flatbuffers::Offset header_; + int64_t body_length_; + flatbuffers::FlatBufferBuilder fbb_; +}; + +Status WriteDataHeader(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers, + std::shared_ptr* out); + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_METADATA_INTERNAL_H diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc new file mode 100644 index 0000000000000..642f21a41e640 --- /dev/null +++ b/cpp/src/arrow/ipc/metadata.cc @@ -0,0 +1,238 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/metadata.h" + +#include +#include +#include +#include + +// Generated C++ flatbuffer IDL +#include "arrow/ipc/Message_generated.h" +#include "arrow/ipc/metadata-internal.h" + +#include "arrow/schema.h" +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { + +namespace flatbuf = apache::arrow::flatbuf; + +namespace ipc { + +Status WriteSchema(const Schema* schema, std::shared_ptr* out) { + MessageBuilder message; + RETURN_NOT_OK(message.SetSchema(schema)); + RETURN_NOT_OK(message.Finish()); + return message.GetBuffer(out); +} + +//---------------------------------------------------------------------- +// Message reader + +class Message::Impl { + public: + explicit Impl(const std::shared_ptr& buffer, + const flatbuf::Message* message) : + buffer_(buffer), + message_(message) {} + + Message::Type type() const { + switch (message_->header_type()) { + case flatbuf::MessageHeader_Schema: + return Message::SCHEMA; + case flatbuf::MessageHeader_DictionaryBatch: + return Message::DICTIONARY_BATCH; + case flatbuf::MessageHeader_RecordBatch: + return Message::RECORD_BATCH; + default: + return Message::NONE; + } + } + + const void* header() const { + return message_->header(); + } + + int64_t body_length() const { + return message_->bodyLength(); + } + + private: + // Owns the memory this message accesses + std::shared_ptr buffer_; + + const flatbuf::Message* message_; +}; + +class SchemaMessage::Impl { + public: + explicit Impl(const void* schema) : + schema_(static_cast(schema)) {} + + const flatbuf::Field* field(int i) const { + return schema_->fields()->Get(i); + } + + int num_fields() const { + return schema_->fields()->size(); + } + + private: + const flatbuf::Schema* schema_; +}; + +Message::Message() {} + +Status Message::Open(const std::shared_ptr& buffer, + std::shared_ptr* out) { + std::shared_ptr result(new Message()); + + // The buffer is prefixed by its size as int32_t + const uint8_t* fb_head = buffer->data() + sizeof(int32_t); + const flatbuf::Message* message = flatbuf::GetMessage(fb_head); + + // TODO(wesm): verify message + result->impl_.reset(new Impl(buffer, message)); + *out = result; + + return Status::OK(); +} + +Message::Type Message::type() const { + return impl_->type(); +} + +int64_t Message::body_length() const { + return impl_->body_length(); +} + +std::shared_ptr Message::get_shared_ptr() { + return this->shared_from_this(); +} + +std::shared_ptr Message::GetSchema() { + return std::make_shared(this->shared_from_this(), + impl_->header()); +} + +SchemaMessage::SchemaMessage(const std::shared_ptr& message, + const void* schema) { + message_ = message; + impl_.reset(new Impl(schema)); +} + +int SchemaMessage::num_fields() const { + return impl_->num_fields(); +} + +Status SchemaMessage::GetField(int i, std::shared_ptr* out) const { + const flatbuf::Field* field = impl_->field(i); + return FieldFromFlatbuffer(field, out); +} + +Status SchemaMessage::GetSchema(std::shared_ptr* out) const { + std::vector> fields(num_fields()); + for (int i = 0; i < this->num_fields(); ++i) { + RETURN_NOT_OK(GetField(i, &fields[i])); + } + *out = std::make_shared(fields); + return Status::OK(); +} + +class RecordBatchMessage::Impl { + public: + explicit Impl(const void* batch) : + batch_(static_cast(batch)) { + nodes_ = batch_->nodes(); + buffers_ = batch_->buffers(); + } + + const flatbuf::FieldNode* field(int i) const { + return nodes_->Get(i); + } + + const flatbuf::Buffer* buffer(int i) const { + return buffers_->Get(i); + } + + int32_t length() const { + return batch_->length(); + } + + int num_buffers() const { + return batch_->buffers()->size(); + } + + int num_fields() const { + return batch_->nodes()->size(); + } + + private: + const flatbuf::RecordBatch* batch_; + const flatbuffers::Vector* nodes_; + const flatbuffers::Vector* buffers_; +}; + +std::shared_ptr Message::GetRecordBatch() { + return std::make_shared(this->shared_from_this(), + impl_->header()); +} + +RecordBatchMessage::RecordBatchMessage(const std::shared_ptr& message, + const void* batch) { + message_ = message; + impl_.reset(new Impl(batch)); +} + +// TODO(wesm): Copying the flatbuffer data isn't great, but this will do for +// now +FieldMetadata RecordBatchMessage::field(int i) const { + const flatbuf::FieldNode* node = impl_->field(i); + + FieldMetadata result; + result.length = node->length(); + result.null_count = node->null_count(); + return result; +} + +BufferMetadata RecordBatchMessage::buffer(int i) const { + const flatbuf::Buffer* buffer = impl_->buffer(i); + + BufferMetadata result; + result.page = buffer->page(); + result.offset = buffer->offset(); + result.length = buffer->length(); + return result; +} + +int32_t RecordBatchMessage::length() const { + return impl_->length(); +} + +int RecordBatchMessage::num_buffers() const { + return impl_->num_buffers(); +} + +int RecordBatchMessage::num_fields() const { + return impl_->num_fields(); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h new file mode 100644 index 0000000000000..c7288529b9fbd --- /dev/null +++ b/cpp/src/arrow/ipc/metadata.h @@ -0,0 +1,146 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// C++ object model and user API for interprocess schema messaging + +#ifndef ARROW_IPC_METADATA_H +#define ARROW_IPC_METADATA_H + +#include +#include + +namespace arrow { + +class Buffer; +struct Field; +class Schema; +class Status; + +namespace ipc { + +//---------------------------------------------------------------------- +// Message read/write APIs + +// Serialize arrow::Schema as a Flatbuffer +Status WriteSchema(const Schema* schema, std::shared_ptr* out); + +//---------------------------------------------------------------------- + +// Read interface classes. We do not fully deserialize the flatbuffers so that +// individual fields metadata can be retrieved from very large schema without +// + +class Message; + +// Container for serialized Schema metadata contained in an IPC message +class SchemaMessage { + public: + // Accepts an opaque flatbuffer pointer + SchemaMessage(const std::shared_ptr& message, const void* schema); + + int num_fields() const; + + // Construct an arrow::Field for the i-th value in the metadata + Status GetField(int i, std::shared_ptr* out) const; + + // Construct a complete Schema from the message. May be expensive for very + // large schemas if you are only interested in a few fields + Status GetSchema(std::shared_ptr* out) const; + + private: + // Parent, owns the flatbuffer data + std::shared_ptr message_; + + class Impl; + std::unique_ptr impl_; +}; + +// Field metadata +struct FieldMetadata { + int32_t length; + int32_t null_count; +}; + +struct BufferMetadata { + int32_t page; + int64_t offset; + int64_t length; +}; + +// Container for serialized record batch metadata contained in an IPC message +class RecordBatchMessage { + public: + // Accepts an opaque flatbuffer pointer + RecordBatchMessage(const std::shared_ptr& message, + const void* batch_meta); + + FieldMetadata field(int i) const; + BufferMetadata buffer(int i) const; + + int32_t length() const; + int num_buffers() const; + int num_fields() const; + + private: + // Parent, owns the flatbuffer data + std::shared_ptr message_; + + class Impl; + std::unique_ptr impl_; +}; + +class DictionaryBatchMessage { + public: + int64_t id() const; + std::unique_ptr data() const; +}; + +class Message : public std::enable_shared_from_this { + public: + enum Type { + NONE, + SCHEMA, + DICTIONARY_BATCH, + RECORD_BATCH + }; + + static Status Open(const std::shared_ptr& buffer, + std::shared_ptr* out); + + std::shared_ptr get_shared_ptr(); + + int64_t body_length() const; + + Type type() const; + + // These methods only to be invoked if you have checked the message type + std::shared_ptr GetSchema(); + std::shared_ptr GetRecordBatch(); + std::shared_ptr GetDictionaryBatch(); + + private: + Message(); + + // Hide serialization details from user API + class Impl; + std::unique_ptr impl_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_METADATA_H diff --git a/cpp/src/arrow/types/floating.h b/cpp/src/arrow/ipc/test-common.h similarity index 59% rename from cpp/src/arrow/types/floating.h rename to cpp/src/arrow/ipc/test-common.h index e7522781d33e3..0fccce941071b 100644 --- a/cpp/src/arrow/types/floating.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -15,22 +15,39 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_TYPES_FLOATING_H -#define ARROW_TYPES_FLOATING_H +#ifndef ARROW_IPC_TEST_COMMON_H +#define ARROW_IPC_TEST_COMMON_H +#include +#include #include - -#include "arrow/types/primitive.h" -#include "arrow/type.h" +#include namespace arrow { - -typedef PrimitiveArrayImpl FloatArray; -typedef PrimitiveArrayImpl DoubleArray; - -typedef PrimitiveBuilder FloatBuilder; -typedef PrimitiveBuilder DoubleBuilder; - +namespace ipc { + +class MemoryMapFixture { + public: + void TearDown() { + for (auto path : tmp_files_) { + std::remove(path.c_str()); + } + } + + void CreateFile(const std::string path, int64_t size) { + FILE* file = fopen(path.c_str(), "w"); + if (file != nullptr) { + tmp_files_.push_back(path); + } + ftruncate(fileno(file), size); + fclose(file); + } + + private: + std::vector tmp_files_; +}; + +} // namespace ipc } // namespace arrow -#endif // ARROW_TYPES_FLOATING_H +#endif // ARROW_IPC_TEST_COMMON_H diff --git a/cpp/src/arrow/table/schema-test.cc b/cpp/src/arrow/schema-test.cc similarity index 72% rename from cpp/src/arrow/table/schema-test.cc rename to cpp/src/arrow/schema-test.cc index 9dfade2695311..a1de1dc5ac8a4 100644 --- a/cpp/src/arrow/table/schema-test.cc +++ b/cpp/src/arrow/schema-test.cc @@ -15,14 +15,14 @@ // specific language governing permissions and limitations // under the License. -#include #include #include #include -#include "arrow/table/schema.h" +#include "gtest/gtest.h" + +#include "arrow/schema.h" #include "arrow/type.h" -#include "arrow/types/string.h" using std::shared_ptr; using std::vector; @@ -32,25 +32,20 @@ namespace arrow { const auto INT32 = std::make_shared(); TEST(TestField, Basics) { - shared_ptr ftype = INT32; - shared_ptr ftype_nn = std::make_shared(false); - Field f0("f0", ftype); - Field f0_nn("f0", ftype_nn); + Field f0("f0", INT32); + Field f0_nn("f0", INT32, false); ASSERT_EQ(f0.name, "f0"); - ASSERT_EQ(f0.type->ToString(), ftype->ToString()); + ASSERT_EQ(f0.type->ToString(), INT32->ToString()); - ASSERT_TRUE(f0.nullable()); - ASSERT_FALSE(f0_nn.nullable()); + ASSERT_TRUE(f0.nullable); + ASSERT_FALSE(f0_nn.nullable); } TEST(TestField, Equals) { - shared_ptr ftype = INT32; - shared_ptr ftype_nn = std::make_shared(false); - - Field f0("f0", ftype); - Field f0_nn("f0", ftype_nn); - Field f0_other("f0", ftype); + Field f0("f0", INT32); + Field f0_nn("f0", INT32, false); + Field f0_other("f0", INT32); ASSERT_EQ(f0, f0_other); ASSERT_NE(f0, f0_nn); @@ -63,12 +58,12 @@ class TestSchema : public ::testing::Test { TEST_F(TestSchema, Basics) { auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", std::make_shared(false)); + auto f1 = std::make_shared("f1", std::make_shared(), false); auto f1_optional = std::make_shared("f1", std::make_shared()); auto f2 = std::make_shared("f2", std::make_shared()); - vector > fields = {f0, f1, f2}; + vector> fields = {f0, f1, f2}; auto schema = std::make_shared(fields); ASSERT_EQ(3, schema->num_fields()); @@ -78,7 +73,7 @@ TEST_F(TestSchema, Basics) { auto schema2 = std::make_shared(fields); - vector > fields3 = {f0, f1_optional, f2}; + vector> fields3 = {f0, f1_optional, f2}; auto schema3 = std::make_shared(fields3); ASSERT_TRUE(schema->Equals(schema2)); ASSERT_FALSE(schema->Equals(schema3)); @@ -88,21 +83,20 @@ TEST_F(TestSchema, Basics) { } TEST_F(TestSchema, ToString) { - auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared(false)); + auto f0 = std::make_shared("f0", INT32); + auto f1 = std::make_shared("f1", std::make_shared(), false); auto f2 = std::make_shared("f2", std::make_shared()); auto f3 = std::make_shared("f3", std::make_shared(std::make_shared())); - vector > fields = {f0, f1, f2, f3}; + vector> fields = {f0, f1, f2, f3}; auto schema = std::make_shared(fields); std::string result = schema->ToString(); - std::string expected = R"(f0 int32 -f1 uint8 not null -f2 string -f3 list -)"; + std::string expected = R"(f0: int32 +f1: uint8 not null +f2: string +f3: list)"; ASSERT_EQ(expected, result); } diff --git a/cpp/src/arrow/table/schema.cc b/cpp/src/arrow/schema.cc similarity index 88% rename from cpp/src/arrow/table/schema.cc rename to cpp/src/arrow/schema.cc index d49d0a713e7f4..18aad0e806ff2 100644 --- a/cpp/src/arrow/table/schema.cc +++ b/cpp/src/arrow/schema.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/table/schema.h" +#include "arrow/schema.h" #include #include @@ -26,7 +26,7 @@ namespace arrow { -Schema::Schema(const std::vector >& fields) : +Schema::Schema(const std::vector>& fields) : fields_(fields) {} bool Schema::Equals(const Schema& other) const { @@ -49,8 +49,13 @@ bool Schema::Equals(const std::shared_ptr& other) const { std::string Schema::ToString() const { std::stringstream buffer; + int i = 0; for (auto field : fields_) { - buffer << field->ToString() << std::endl; + if (i > 0) { + buffer << std::endl; + } + buffer << field->ToString(); + ++i; } return buffer.str(); } diff --git a/cpp/src/arrow/table/schema.h b/cpp/src/arrow/schema.h similarity index 91% rename from cpp/src/arrow/table/schema.h rename to cpp/src/arrow/schema.h index 103f01b26e3ca..52f3c1ceae46d 100644 --- a/cpp/src/arrow/table/schema.h +++ b/cpp/src/arrow/schema.h @@ -22,13 +22,13 @@ #include #include -#include "arrow/type.h" - namespace arrow { +struct Field; + class Schema { public: - explicit Schema(const std::vector >& fields); + explicit Schema(const std::vector>& fields); // Returns true if all of the schema fields are equal bool Equals(const Schema& other) const; @@ -47,7 +47,7 @@ class Schema { } private: - std::vector > fields_; + std::vector> fields_; }; } // namespace arrow diff --git a/cpp/src/arrow/table/table-test.cc b/cpp/src/arrow/table-test.cc similarity index 92% rename from cpp/src/arrow/table/table-test.cc rename to cpp/src/arrow/table-test.cc index 8b354e8503c71..4c7b8f80486de 100644 --- a/cpp/src/arrow/table/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -15,19 +15,19 @@ // specific language governing permissions and limitations // under the License. -#include -#include #include #include #include -#include "arrow/table/column.h" -#include "arrow/table/schema.h" -#include "arrow/table/table.h" -#include "arrow/table/test-common.h" +#include "gtest/gtest.h" + +#include "arrow/column.h" +#include "arrow/schema.h" +#include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" +#include "arrow/util/status.h" using std::shared_ptr; using std::vector; @@ -45,7 +45,7 @@ class TestTable : public TestBase { auto f1 = std::make_shared("f1", UINT8); auto f2 = std::make_shared("f2", INT16); - vector > fields = {f0, f1, f2}; + vector> fields = {f0, f1, f2}; schema_ = std::make_shared(fields); columns_ = { @@ -58,7 +58,7 @@ class TestTable : public TestBase { protected: std::unique_ptr
table_; shared_ptr schema_; - vector > columns_; + vector> columns_; }; TEST_F(TestTable, EmptySchema) { diff --git a/cpp/src/arrow/table/table.cc b/cpp/src/arrow/table.cc similarity index 69% rename from cpp/src/arrow/table/table.cc rename to cpp/src/arrow/table.cc index 0c788b8fe3ff3..e405c1d508c22 100644 --- a/cpp/src/arrow/table/table.cc +++ b/cpp/src/arrow/table.cc @@ -15,20 +15,30 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/table/table.h" +#include "arrow/table.h" +#include #include #include -#include "arrow/table/column.h" -#include "arrow/table/schema.h" -#include "arrow/type.h" +#include "arrow/column.h" +#include "arrow/schema.h" #include "arrow/util/status.h" namespace arrow { +RowBatch::RowBatch(const std::shared_ptr& schema, int num_rows, + const std::vector>& columns) : + schema_(schema), + num_rows_(num_rows), + columns_(columns) {} + +const std::string& RowBatch::column_name(int i) const { + return schema_->field(i)->name; +} + Table::Table(const std::string& name, const std::shared_ptr& schema, - const std::vector >& columns) : + const std::vector>& columns) : name_(name), schema_(schema), columns_(columns) { @@ -40,7 +50,7 @@ Table::Table(const std::string& name, const std::shared_ptr& schema, } Table::Table(const std::string& name, const std::shared_ptr& schema, - const std::vector >& columns, int64_t num_rows) : + const std::vector>& columns, int64_t num_rows) : name_(name), schema_(schema), columns_(columns), @@ -51,16 +61,19 @@ Status Table::ValidateColumns() const { return Status::Invalid("Number of columns did not match schema"); } - if (columns_.size() == 0) { - return Status::OK(); - } - // Make sure columns are all the same length for (size_t i = 0; i < columns_.size(); ++i) { const Column* col = columns_[i].get(); + if (col == nullptr) { + std::stringstream ss; + ss << "Column " << i << " named " << col->name() + << " was null"; + return Status::Invalid(ss.str()); + } if (col->length() != num_rows_) { std::stringstream ss; - ss << "Column " << i << " expected length " + ss << "Column " << i << " named " << col->name() + << " expected length " << num_rows_ << " but got length " << col->length(); diff --git a/cpp/src/arrow/table/table.h b/cpp/src/arrow/table.h similarity index 55% rename from cpp/src/arrow/table/table.h rename to cpp/src/arrow/table.h index b0129387b710c..e2f73a2eeddcb 100644 --- a/cpp/src/arrow/table/table.h +++ b/cpp/src/arrow/table.h @@ -15,28 +15,74 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_TABLE_TABLE_H -#define ARROW_TABLE_TABLE_H +#ifndef ARROW_TABLE_H +#define ARROW_TABLE_H +#include #include #include #include namespace arrow { +class Array; class Column; class Schema; class Status; +// A row batch is a simpler and more rigid table data structure intended for +// use primarily in shared memory IPC. It contains a schema (metadata) and a +// corresponding vector of equal-length Arrow arrays +class RowBatch { + public: + // num_rows is a parameter to allow for row batches of a particular size not + // having any materialized columns. Each array should have the same length as + // num_rows + RowBatch(const std::shared_ptr& schema, int num_rows, + const std::vector>& columns); + + // @returns: the table's schema + const std::shared_ptr& schema() const { + return schema_; + } + + // @returns: the i-th column + // Note: Does not boundscheck + const std::shared_ptr& column(int i) const { + return columns_[i]; + } + + const std::string& column_name(int i) const; + + // @returns: the number of columns in the table + int num_columns() const { + return columns_.size(); + } + + // @returns: the number of rows (the corresponding length of each column) + int64_t num_rows() const { + return num_rows_; + } + + private: + std::shared_ptr schema_; + int num_rows_; + std::vector> columns_; +}; + // Immutable container of fixed-length columns conforming to a particular schema class Table { public: // If columns is zero-length, the table's number of rows is zero Table(const std::string& name, const std::shared_ptr& schema, - const std::vector >& columns); + const std::vector>& columns); + // num_rows is a parameter to allow for tables of a particular size not + // having any materialized columns. Each column should therefore have the + // same length as num_rows -- you can validate this using + // Table::ValidateColumns Table(const std::string& name, const std::shared_ptr& schema, - const std::vector >& columns, int64_t num_rows); + const std::vector>& columns, int64_t num_rows); // @returns: the table's name, if any (may be length 0) const std::string& name() const { @@ -72,11 +118,11 @@ class Table { std::string name_; std::shared_ptr schema_; - std::vector > columns_; + std::vector> columns_; int64_t num_rows_; }; } // namespace arrow -#endif // ARROW_TABLE_TABLE_H +#endif // ARROW_TABLE_H diff --git a/cpp/src/arrow/table/test-common.h b/cpp/src/arrow/table/test-common.h deleted file mode 100644 index 50a5f6a2f5018..0000000000000 --- a/cpp/src/arrow/table/test-common.h +++ /dev/null @@ -1,54 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include - -#include "arrow/table/column.h" -#include "arrow/table/schema.h" -#include "arrow/table/table.h" -#include "arrow/test-util.h" -#include "arrow/type.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" - -namespace arrow { - -class TestBase : public ::testing::Test { - public: - void SetUp() { - pool_ = GetDefaultMemoryPool(); - } - - template - std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { - auto data = std::make_shared(pool_); - auto nulls = std::make_shared(pool_); - EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); - EXPECT_OK(nulls->Resize(util::bytes_for_bits(length))); - return std::make_shared(length, data, 10, nulls); - } - - protected: - MemoryPool* pool_; -}; - -} // namespace arrow diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 0898c8e3e3aa3..a9fb2a7644ab3 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -18,26 +18,39 @@ #ifndef ARROW_TEST_UTIL_H_ #define ARROW_TEST_UTIL_H_ -#include +#include #include +#include #include #include +#include "gtest/gtest.h" + +#include "arrow/type.h" +#include "arrow/column.h" +#include "arrow/schema.h" +#include "arrow/table.h" #include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" #include "arrow/util/random.h" #include "arrow/util/status.h" #define ASSERT_RAISES(ENUM, expr) \ do { \ Status s = (expr); \ - ASSERT_TRUE(s.Is##ENUM()); \ + if (!s.Is##ENUM()) { \ + FAIL() << s.ToString(); \ + } \ } while (0) #define ASSERT_OK(expr) \ do { \ Status s = (expr); \ - ASSERT_TRUE(s.ok()); \ + if (!s.ok()) { \ + FAIL() << s.ToString(); \ + } \ } while (0) @@ -50,6 +63,27 @@ namespace arrow { +class TestBase : public ::testing::Test { + public: + void SetUp() { + pool_ = default_memory_pool(); + } + + template + std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + auto data = std::make_shared(pool_); + auto nulls = std::make_shared(pool_); + EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); + EXPECT_OK(nulls->Resize(util::bytes_for_bits(length))); + return std::make_shared(length, data, 10, nulls); + } + + protected: + MemoryPool* pool_; +}; + +namespace test { + template void randint(int64_t N, T lower, T upper, std::vector* out) { Random rng(random_seed()); @@ -84,6 +118,33 @@ void random_nulls(int64_t n, double pct_null, std::vector* nulls) { } } +static inline void random_bytes(int n, uint32_t seed, uint8_t* out) { + std::mt19937 gen(seed); + std::uniform_int_distribution d(0, 255); + + for (int i = 0; i < n; ++i) { + out[i] = d(gen) & 0xFF; + } +} + +template +void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { + std::mt19937 gen(seed); + std::uniform_int_distribution d(min_value, max_value); + for (int i = 0; i < n; ++i) { + out[i] = d(gen); + } +} + +static inline int bitmap_popcount(const uint8_t* data, int length) { + int count = 0; + for (int i = 0; i < length; ++i) { + // TODO: accelerate this + if (util::get_bit(data, i)) ++count; + } + return count; +} + static inline int null_count(const std::vector& nulls) { int result = 0; for (size_t i = 0; i < nulls.size(); ++i) { @@ -102,6 +163,7 @@ std::shared_ptr bytes_to_null_buffer(uint8_t* bytes, int length) { return out; } +} // namespace test } // namespace arrow #endif // ARROW_TEST_UTIL_H_ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 0a2e817ad30c6..f7f835e96a729 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -24,45 +24,37 @@ namespace arrow { std::string Field::ToString() const { std::stringstream ss; - ss << this->name << " " << this->type->ToString(); + ss << this->name << ": " << this->type->ToString(); + if (!this->nullable) { + ss << " not null"; + } return ss.str(); } DataType::~DataType() {} -StringType::StringType(bool nullable) - : DataType(LogicalType::STRING, nullable) {} - -StringType::StringType(const StringType& other) - : StringType(other.nullable) {} +StringType::StringType() : DataType(Type::STRING) {} std::string StringType::ToString() const { std::string result(name()); - if (!nullable) { - result.append(" not null"); - } return result; } std::string ListType::ToString() const { std::stringstream s; - s << "list<" << value_type->ToString() << ">"; - if (!this->nullable) { - s << " not null"; - } + s << "list<" << value_field()->ToString() << ">"; return s.str(); } std::string StructType::ToString() const { std::stringstream s; s << "struct<"; - for (size_t i = 0; i < fields_.size(); ++i) { + for (int i = 0; i < this->num_children(); ++i) { if (i > 0) s << ", "; - const std::shared_ptr& field = fields_[i]; + const std::shared_ptr& field = this->child(i); s << field->name << ": " << field->type->ToString(); } s << ">"; - if (!nullable) s << " not null"; return s.str(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 00b01ea86e8a5..5984b6718ddbe 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -18,62 +18,34 @@ #ifndef ARROW_TYPE_H #define ARROW_TYPE_H +#include #include #include #include namespace arrow { -// Physical data type that describes the memory layout of values. See details -// for each type -enum class LayoutEnum: char { - // A physical type consisting of some non-negative number of bytes - BYTE = 0, - - // A physical type consisting of some non-negative number of bits - BIT = 1, - - // A parametric variable-length value type. Full specification requires a - // child logical type - LIST = 2, - - // A collection of multiple equal-length child arrays. Parametric type taking - // 1 or more child logical types - STRUCT = 3, - - // An array with heterogeneous value types. Parametric types taking 1 or more - // child logical types - DENSE_UNION = 4, - SPARSE_UNION = 5 -}; - - -struct LayoutType { - LayoutEnum type; - explicit LayoutType(LayoutEnum type) : type(type) {} -}; - // Data types in this library are all *logical*. They can be expressed as // either a primitive physical type (bytes or bits of some fixed size), a // nested type consisting of other data types, or another data type (e.g. a // timestamp encoded as an int64) -struct LogicalType { +struct Type { enum type { // A degenerate NULL type represented as 0 bytes/bits NA = 0, - // Little-endian integer types - UINT8 = 1, - INT8 = 2, - UINT16 = 3, - INT16 = 4, - UINT32 = 5, - INT32 = 6, - UINT64 = 7, - INT64 = 8, - // A boolean value represented as 1 bit - BOOL = 9, + BOOL = 1, + + // Little-endian integer types + UINT8 = 2, + INT8 = 3, + UINT16 = 4, + INT16 = 5, + UINT32 = 6, + INT32 = 7, + UINT64 = 8, + INT64 = 9, // 4-byte floating point value FLOAT = 10, @@ -131,30 +103,38 @@ struct LogicalType { }; }; +struct Field; + struct DataType { - LogicalType::type type; - bool nullable; + Type::type type; - explicit DataType(LogicalType::type type, bool nullable = true) : - type(type), - nullable(nullable) {} + std::vector> children_; + + explicit DataType(Type::type type) : + type(type) {} virtual ~DataType(); bool Equals(const DataType* other) { // Call with a pointer so more friendly to subclasses - return this == other || (this->type == other->type && - this->nullable == other->nullable); + return this == other || (this->type == other->type); } bool Equals(const std::shared_ptr& other) { return Equals(other.get()); } + const std::shared_ptr& child(int i) const { + return children_[i]; + } + + int num_children() const { + return children_.size(); + } + virtual std::string ToString() const = 0; }; -typedef std::shared_ptr LayoutPtr; typedef std::shared_ptr TypePtr; // A field is a piece of metadata that includes (for now) a name and a data @@ -166,9 +146,13 @@ struct Field { // The field's data type TypePtr type; - Field(const std::string& name, const TypePtr& type) : + // Fields can be nullable + bool nullable; + + Field(const std::string& name, const TypePtr& type, bool nullable = true) : name(name), - type(type) {} + type(type), + nullable(nullable) {} bool operator==(const Field& other) const { return this->Equals(other); @@ -180,6 +164,7 @@ struct Field { bool Equals(const Field& other) const { return (this == &other) || (this->name == other.name && + this->nullable == other.nullable && this->type->Equals(other.type.get())); } @@ -187,36 +172,12 @@ struct Field { return Equals(*other.get()); } - bool nullable() const { - return this->type->nullable; - } - std::string ToString() const; }; -struct BytesType : public LayoutType { - int size; - - explicit BytesType(int size) - : LayoutType(LayoutEnum::BYTE), - size(size) {} - - BytesType(const BytesType& other) - : BytesType(other.size) {} -}; - -struct ListLayoutType : public LayoutType { - LayoutPtr value_type; - - explicit ListLayoutType(const LayoutPtr& value_type) - : LayoutType(LayoutEnum::BYTE), - value_type(value_type) {} -}; - template struct PrimitiveType : public DataType { - explicit PrimitiveType(bool nullable = true) - : DataType(Derived::type_enum, nullable) {} + PrimitiveType() : DataType(Derived::type_enum) {} std::string ToString() const override; }; @@ -224,22 +185,19 @@ struct PrimitiveType : public DataType { template inline std::string PrimitiveType::ToString() const { std::string result(static_cast(this)->name()); - if (!nullable) { - result.append(" not null"); - } return result; } -#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ - typedef C_TYPE c_type; \ - static constexpr LogicalType::type type_enum = LogicalType::ENUM; \ - static constexpr int size = SIZE; \ - \ - explicit TYPENAME(bool nullable = true) \ - : PrimitiveType(nullable) {} \ - \ - static const char* name() { \ - return NAME; \ +#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ + typedef C_TYPE c_type; \ + static constexpr Type::type type_enum = Type::ENUM; \ + static constexpr int size = SIZE; \ + \ + TYPENAME() \ + : PrimitiveType() {} \ + \ + static const char* name() { \ + return NAME; \ } struct NullType : public PrimitiveType { @@ -292,11 +250,23 @@ struct DoubleType : public PrimitiveType { struct ListType : public DataType { // List can contain any other logical value type - TypePtr value_type; + explicit ListType(const std::shared_ptr& value_type) + : DataType(Type::LIST) { + children_ = {std::make_shared("item", value_type)}; + } + + explicit ListType(const std::shared_ptr& value_field) + : DataType(Type::LIST) { + children_ = {value_field}; + } - explicit ListType(const TypePtr& value_type, bool nullable = true) - : DataType(LogicalType::LIST, nullable), - value_type(value_type) {} + const std::shared_ptr& value_field() const { + return children_[0]; + } + + const std::shared_ptr& value_type() const { + return children_[0]->type; + } static char const *name() { return "list"; @@ -307,9 +277,7 @@ struct ListType : public DataType { // String is a logical type consisting of a physical list of 1-byte values struct StringType : public DataType { - explicit StringType(bool nullable = true); - - StringType(const StringType& other); + StringType(); static char const *name() { return "string"; @@ -319,20 +287,9 @@ struct StringType : public DataType { }; struct StructType : public DataType { - std::vector > fields_; - - explicit StructType(const std::vector >& fields, - bool nullable = true) - : DataType(LogicalType::STRUCT, nullable) { - fields_ = fields; - } - - const std::shared_ptr& field(int i) const { - return fields_[i]; - } - - int num_children() const { - return fields_.size(); + explicit StructType(const std::vector>& fields) + : DataType(Type::STRUCT) { + children_ = fields; } std::string ToString() const override; diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index 57cabdefd2525..595b3be6e1661 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -26,8 +26,6 @@ install(FILES construct.h datetime.h decimal.h - floating.h - integer.h json.h list.h primitive.h diff --git a/cpp/src/arrow/types/boolean.h b/cpp/src/arrow/types/boolean.h index a5023d7b368d2..1cb91f9ba4966 100644 --- a/cpp/src/arrow/types/boolean.h +++ b/cpp/src/arrow/types/boolean.h @@ -22,7 +22,7 @@ namespace arrow { -typedef PrimitiveArrayImpl BooleanArray; +// typedef PrimitiveArrayImpl BooleanArray; class BooleanBuilder : public ArrayBuilder { }; diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h index 42a9c926bb134..46d84f1f183c8 100644 --- a/cpp/src/arrow/types/collection.h +++ b/cpp/src/arrow/types/collection.h @@ -25,7 +25,7 @@ namespace arrow { -template +template struct CollectionType : public DataType { std::vector child_types_; diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 43f01a3051385..290decd81ff42 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -19,24 +19,26 @@ #include -#include "arrow/types/floating.h" -#include "arrow/types/integer.h" +#include "arrow/type.h" +#include "arrow/types/primitive.h" #include "arrow/types/list.h" #include "arrow/types/string.h" +#include "arrow/util/buffer.h" #include "arrow/util/status.h" namespace arrow { class ArrayBuilder; -// Initially looked at doing this with vtables, but shared pointers makes it -// difficult - #define BUILDER_CASE(ENUM, BuilderType) \ - case LogicalType::ENUM: \ + case Type::ENUM: \ out->reset(new BuilderType(pool, type)); \ return Status::OK(); +// Initially looked at doing this with vtables, but shared pointers makes it +// difficult +// +// TODO(wesm): come up with a less monolithic strategy Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out) { switch (type->type) { @@ -56,30 +58,41 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(STRING, StringBuilder); - case LogicalType::LIST: + case Type::LIST: { std::shared_ptr value_builder; const std::shared_ptr& value_type = static_cast( - type.get())->value_type; + type.get())->value_type(); RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); out->reset(new ListBuilder(pool, type, value_builder)); return Status::OK(); } - // BUILDER_CASE(CHAR, CharBuilder); - - // BUILDER_CASE(VARCHAR, VarcharBuilder); - // BUILDER_CASE(BINARY, BinaryBuilder); - - // BUILDER_CASE(DATE, DateBuilder); - // BUILDER_CASE(TIMESTAMP, TimestampBuilder); - // BUILDER_CASE(TIME, TimeBuilder); + default: + return Status::NotImplemented(type->ToString()); + } +} - // BUILDER_CASE(LIST, ListBuilder); - // BUILDER_CASE(STRUCT, StructBuilder); - // BUILDER_CASE(DENSE_UNION, DenseUnionBuilder); - // BUILDER_CASE(SPARSE_UNION, SparseUnionBuilder); +#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ + case Type::ENUM: \ + out->reset(new ArrayType(type, length, data, null_count, nulls)); \ + return Status::OK(); +Status MakePrimitiveArray(const std::shared_ptr& type, + int32_t length, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& nulls, + std::shared_ptr* out) { + switch (type->type) { + MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT8, Int8Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT16, UInt16Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT16, Int16Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT32, UInt32Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); + MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); default: return Status::NotImplemented(type->ToString()); } diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index 59ebe1acddc98..089c484c58bee 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -18,19 +18,26 @@ #ifndef ARROW_TYPES_CONSTRUCT_H #define ARROW_TYPES_CONSTRUCT_H +#include #include -#include "arrow/type.h" - namespace arrow { +class Array; class ArrayBuilder; +class Buffer; +struct DataType; class MemoryPool; class Status; Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out); +Status MakePrimitiveArray(const std::shared_ptr& type, + int32_t length, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& nulls, + std::shared_ptr* out); + } // namespace arrow #endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index 765fc29dd57ae..e57b66ab46adb 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -31,8 +31,8 @@ struct DateType : public DataType { Unit unit; - explicit DateType(Unit unit = Unit::DAY, bool nullable = true) - : DataType(LogicalType::DATE, nullable), + explicit DateType(Unit unit = Unit::DAY) + : DataType(Type::DATE), unit(unit) {} DateType(const DateType& other) @@ -41,10 +41,6 @@ struct DateType : public DataType { static char const *name() { return "date"; } - - // virtual std::string ToString() { - // return name(); - // } }; @@ -58,8 +54,8 @@ struct TimestampType : public DataType { Unit unit; - explicit TimestampType(Unit unit = Unit::MILLI, bool nullable = true) - : DataType(LogicalType::TIMESTAMP, nullable), + explicit TimestampType(Unit unit = Unit::MILLI) + : DataType(Type::TIMESTAMP), unit(unit) {} TimestampType(const TimestampType& other) @@ -68,10 +64,6 @@ struct TimestampType : public DataType { static char const *name() { return "timestamp"; } - - // virtual std::string ToString() { - // return name(); - // } }; } // namespace arrow diff --git a/cpp/src/arrow/types/floating.cc b/cpp/src/arrow/types/floating.cc deleted file mode 100644 index bde28266e638c..0000000000000 --- a/cpp/src/arrow/types/floating.cc +++ /dev/null @@ -1,22 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/floating.h" - -namespace arrow { - -} // namespace arrow diff --git a/cpp/src/arrow/types/integer.cc b/cpp/src/arrow/types/integer.cc deleted file mode 100644 index 4696536616971..0000000000000 --- a/cpp/src/arrow/types/integer.cc +++ /dev/null @@ -1,22 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/integer.h" - -namespace arrow { - -} // namespace arrow diff --git a/cpp/src/arrow/types/integer.h b/cpp/src/arrow/types/integer.h deleted file mode 100644 index 568419124941f..0000000000000 --- a/cpp/src/arrow/types/integer.h +++ /dev/null @@ -1,57 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_INTEGER_H -#define ARROW_TYPES_INTEGER_H - -#include -#include - -#include "arrow/types/primitive.h" -#include "arrow/type.h" - -namespace arrow { - -// Array containers - -typedef PrimitiveArrayImpl UInt8Array; -typedef PrimitiveArrayImpl Int8Array; - -typedef PrimitiveArrayImpl UInt16Array; -typedef PrimitiveArrayImpl Int16Array; - -typedef PrimitiveArrayImpl UInt32Array; -typedef PrimitiveArrayImpl Int32Array; - -typedef PrimitiveArrayImpl UInt64Array; -typedef PrimitiveArrayImpl Int64Array; - -// Builders - -typedef PrimitiveBuilder UInt8Builder; -typedef PrimitiveBuilder UInt16Builder; -typedef PrimitiveBuilder UInt32Builder; -typedef PrimitiveBuilder UInt64Builder; - -typedef PrimitiveBuilder Int8Builder; -typedef PrimitiveBuilder Int16Builder; -typedef PrimitiveBuilder Int32Builder; -typedef PrimitiveBuilder Int64Builder; - -} // namespace arrow - -#endif // ARROW_TYPES_INTEGER_H diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc index 168e370d51a14..fb731edd6073f 100644 --- a/cpp/src/arrow/types/json.cc +++ b/cpp/src/arrow/types/json.cc @@ -20,7 +20,6 @@ #include #include "arrow/type.h" -#include "arrow/types/string.h" #include "arrow/types/union.h" namespace arrow { diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h index b67fb3807aded..9c850afac0af4 100644 --- a/cpp/src/arrow/types/json.h +++ b/cpp/src/arrow/types/json.h @@ -28,8 +28,8 @@ struct JSONScalar : public DataType { static TypePtr dense_type; static TypePtr sparse_type; - explicit JSONScalar(bool dense = true, bool nullable = true) - : DataType(LogicalType::JSON_SCALAR, nullable), + explicit JSONScalar(bool dense = true) + : DataType(Type::JSON_SCALAR), dense(dense) {} }; diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 02991de2648e7..eb55ca868eeee 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -15,20 +15,21 @@ // specific language governing permissions and limitations // under the License. -#include #include #include #include #include #include +#include "gtest/gtest.h" + #include "arrow/array.h" +#include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/types/construct.h" -#include "arrow/types/integer.h" #include "arrow/types/list.h" -#include "arrow/types/string.h" +#include "arrow/types/primitive.h" #include "arrow/types/test-common.h" #include "arrow/util/status.h" @@ -39,27 +40,24 @@ using std::vector; namespace arrow { -class ArrayBuilder; - TEST(TypesTest, TestListType) { std::shared_ptr vt = std::make_shared(); ListType list_type(vt); - ASSERT_EQ(list_type.type, LogicalType::LIST); + ASSERT_EQ(list_type.type, Type::LIST); ASSERT_EQ(list_type.name(), string("list")); - ASSERT_EQ(list_type.ToString(), string("list")); + ASSERT_EQ(list_type.ToString(), string("list")); - ASSERT_EQ(list_type.value_type->type, vt->type); - ASSERT_EQ(list_type.value_type->type, vt->type); + ASSERT_EQ(list_type.value_type()->type, vt->type); + ASSERT_EQ(list_type.value_type()->type, vt->type); - std::shared_ptr st = std::make_shared(false); - std::shared_ptr lt = std::make_shared(st, false); - ASSERT_EQ(lt->ToString(), string("list not null")); + std::shared_ptr st = std::make_shared(); + std::shared_ptr lt = std::make_shared(st); + ASSERT_EQ(lt->ToString(), string("list")); - ListType lt2(lt, false); - ASSERT_EQ(lt2.ToString(), - string("list not null> not null")); + ListType lt2(lt); + ASSERT_EQ(lt2.ToString(), string("list>")); } // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 69a79a77fabe0..670ee4da11675 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -19,4 +19,33 @@ namespace arrow { +bool ListArray::EqualsExact(const ListArray& other) const { + if (this == &other) return true; + if (null_count_ != other.null_count_) { + return false; + } + + bool equal_offsets = offset_buf_->Equals(*other.offset_buf_, + length_ + 1); + bool equal_nulls = true; + if (null_count_ > 0) { + equal_nulls = nulls_->Equals(*other.nulls_, + util::bytes_for_bits(length_)); + } + + if (!(equal_offsets && equal_nulls)) { + return false; + } + + return values()->Equals(other.values()); +} + +bool ListArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) return true; + if (this->type_enum() != arr->type_enum()) { + return false; + } + return EqualsExact(*static_cast(arr.get())); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 210c76a046c21..141f762458b3b 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -21,12 +21,10 @@ #include #include #include -#include #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/type.h" -#include "arrow/types/integer.h" #include "arrow/types/primitive.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" @@ -38,29 +36,19 @@ class MemoryPool; class ListArray : public Array { public: - ListArray() : Array(), offset_buf_(nullptr), offsets_(nullptr) {} - ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, const ArrayPtr& values, int32_t null_count = 0, - std::shared_ptr nulls = nullptr) { - Init(type, length, offsets, values, null_count, nulls); - } - - virtual ~ListArray() {} - - void Init(const TypePtr& type, int32_t length, std::shared_ptr offsets, - const ArrayPtr& values, - int32_t null_count = 0, - std::shared_ptr nulls = nullptr) { + std::shared_ptr nulls = nullptr) : + Array(type, length, null_count, nulls) { offset_buf_ = offsets; offsets_ = offsets == nullptr? nullptr : reinterpret_cast(offset_buf_->data()); - values_ = values; - Array::Init(type, length, null_count, nulls); } + virtual ~ListArray() {} + // Return a shared pointer in case the requestor desires to share ownership // with this array. const std::shared_ptr& values() const {return values_;} @@ -77,6 +65,9 @@ class ListArray : public Array { int32_t value_offset(int i) { return offsets_[i];} int32_t value_length(int i) { return offsets_[i + 1] - offsets_[i];} + bool EqualsExact(const ListArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + protected: std::shared_ptr offset_buf_; const int32_t* offsets_; @@ -137,8 +128,6 @@ class ListBuilder : public Int32Builder { template std::shared_ptr Transfer() { - auto result = std::make_shared(); - std::shared_ptr items = value_builder_->Finish(); // Add final offset if the length is non-zero @@ -146,8 +135,9 @@ class ListBuilder : public Int32Builder { raw_buffer()[length_] = items->length(); } - result->Init(type_, length_, values_, items, + auto result = std::make_shared(type_, length_, values_, items, null_count_, nulls_); + values_ = nulls_ = nullptr; capacity_ = length_ = null_count_ = 0; diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index f35a258e2cb57..7eae8cda8c488 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -15,21 +15,17 @@ // specific language governing permissions and limitations // under the License. -#include - #include #include #include #include -#include "arrow/array.h" +#include "gtest/gtest.h" + #include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/boolean.h" #include "arrow/types/construct.h" -#include "arrow/types/floating.h" -#include "arrow/types/integer.h" #include "arrow/types/primitive.h" #include "arrow/types/test-common.h" #include "arrow/util/bit-util.h" @@ -43,23 +39,17 @@ using std::vector; namespace arrow { -TEST(TypesTest, TestBytesType) { - BytesType t1(3); - - ASSERT_EQ(t1.type, LayoutEnum::BYTE); - ASSERT_EQ(t1.size, 3); -} - +class Array; #define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ TEST(TypesTest, TestPrimitive_##ENUM) { \ KLASS tp; \ \ - ASSERT_EQ(tp.type, LogicalType::ENUM); \ + ASSERT_EQ(tp.type, Type::ENUM); \ ASSERT_EQ(tp.name(), string(NAME)); \ \ KLASS tp_copy = tp; \ - ASSERT_EQ(tp_copy.type, LogicalType::ENUM); \ + ASSERT_EQ(tp_copy.type, Type::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); @@ -109,22 +99,20 @@ class TestPrimitiveBuilder : public TestBuilder { void RandomData(int N, double pct_null = 0.1) { Attrs::draw(N, &draws_); - random_nulls(N, pct_null, &nulls_); + test::random_nulls(N, pct_null, &nulls_); } void CheckNullable() { - ArrayType expected; int size = builder_->length(); auto ex_data = std::make_shared( reinterpret_cast(draws_.data()), size * sizeof(T)); - auto ex_nulls = bytes_to_null_buffer(nulls_.data(), size); - - int32_t ex_null_count = null_count(nulls_); + auto ex_nulls = test::bytes_to_null_buffer(nulls_.data(), size); + int32_t ex_null_count = test::null_count(nulls_); - expected.Init(size, ex_data, ex_null_count, ex_nulls); + auto expected = std::make_shared(size, ex_data, ex_null_count, ex_nulls); std::shared_ptr result = std::dynamic_pointer_cast( builder_->Finish()); @@ -135,18 +123,17 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(0, builder_->null_count()); ASSERT_EQ(nullptr, builder_->buffer()); - ASSERT_TRUE(result->Equals(expected)); + ASSERT_TRUE(result->EqualsExact(*expected.get())); ASSERT_EQ(ex_null_count, result->null_count()); } void CheckNonNullable() { - ArrayType expected; int size = builder_nn_->length(); auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), size * sizeof(T)); - expected.Init(size, ex_data); + auto expected = std::make_shared(size, ex_data); std::shared_ptr result = std::dynamic_pointer_cast( builder_nn_->Finish()); @@ -156,7 +143,7 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(0, builder_nn_->capacity()); ASSERT_EQ(nullptr, builder_nn_->buffer()); - ASSERT_TRUE(result->Equals(expected)); + ASSERT_TRUE(result->EqualsExact(*expected.get())); ASSERT_EQ(0, result->null_count()); } @@ -183,8 +170,8 @@ class TestPrimitiveBuilder : public TestBuilder { #define PINT_DECL(CapType, c_type, LOWER, UPPER) \ struct P##CapType { \ PTYPE_DECL(CapType, c_type); \ - static void draw(int N, vector* draws) { \ - randint(N, LOWER, UPPER, draws); \ + static void draw(int N, vector* draws) { \ + test::randint(N, LOWER, UPPER, draws); \ } \ } diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index c86260b0fc641..32b8bfa7f1bd4 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -26,16 +26,16 @@ namespace arrow { // ---------------------------------------------------------------------- // Primitive array base -void PrimitiveArray::Init(const TypePtr& type, int32_t length, +PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& nulls) { - Array::Init(type, length, null_count, nulls); + const std::shared_ptr& nulls) : + Array(type, length, null_count, nulls) { data_ = data; raw_data_ = data == nullptr? nullptr : data_->data(); } -bool PrimitiveArray::Equals(const PrimitiveArray& other) const { +bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { if (this == &other) return true; if (null_count_ != other.null_count_) { return false; @@ -50,4 +50,12 @@ bool PrimitiveArray::Equals(const PrimitiveArray& other) const { } } +bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) return true; + if (this->type_enum() != arr->type_enum()) { + return false; + } + return EqualsExact(*static_cast(arr.get())); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 22ab59c309a1d..e01027cf55c39 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -21,7 +21,6 @@ #include #include #include -#include #include "arrow/array.h" #include "arrow/builder.h" @@ -38,64 +37,57 @@ class MemoryPool; // Base class for fixed-size logical types class PrimitiveArray : public Array { public: - PrimitiveArray() : Array(), data_(nullptr), raw_data_(nullptr) {} - - virtual ~PrimitiveArray() {} - - void Init(const TypePtr& type, int32_t length, + PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& nulls = nullptr); + virtual ~PrimitiveArray() {} const std::shared_ptr& data() const { return data_;} - bool Equals(const PrimitiveArray& other) const; + bool EqualsExact(const PrimitiveArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; protected: std::shared_ptr data_; const uint8_t* raw_data_; }; - -template -class PrimitiveArrayImpl : public PrimitiveArray { - public: - typedef typename TypeClass::c_type value_type; - - PrimitiveArrayImpl() : PrimitiveArray() {} - - virtual ~PrimitiveArrayImpl() {} - - PrimitiveArrayImpl(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) { - Init(length, data, null_count, nulls); - } - - void Init(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) { - TypePtr type(new TypeClass()); - PrimitiveArray::Init(type, length, data, null_count, nulls); - } - - bool Equals(const PrimitiveArrayImpl& other) const { - return PrimitiveArray::Equals(*static_cast(&other)); - } - - const value_type* raw_data() const { - return reinterpret_cast(raw_data_); - } - - value_type Value(int i) const { - return raw_data()[i]; - } - - TypeClass* exact_type() const { - return static_cast(type_); - } +#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ +class NAME : public PrimitiveArray { \ + public: \ + using value_type = T; \ + using PrimitiveArray::PrimitiveArray; \ + NAME(int32_t length, const std::shared_ptr& data, \ + int32_t null_count = 0, \ + const std::shared_ptr& nulls = nullptr) : \ + PrimitiveArray(std::make_shared(), length, data, \ + null_count, nulls) {} \ + \ + bool EqualsExact(const NAME& other) const { \ + return PrimitiveArray::EqualsExact( \ + *static_cast(&other)); \ + } \ + \ + const T* raw_data() const { \ + return reinterpret_cast(raw_data_); \ + } \ + \ + T Value(int i) const { \ + return raw_data()[i]; \ + } \ }; +NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type, uint8_t); +NUMERIC_ARRAY_DECL(Int8Array, Int8Type, int8_t); +NUMERIC_ARRAY_DECL(UInt16Array, UInt16Type, uint16_t); +NUMERIC_ARRAY_DECL(Int16Array, Int16Type, int16_t); +NUMERIC_ARRAY_DECL(UInt32Array, UInt32Type, uint32_t); +NUMERIC_ARRAY_DECL(Int32Array, Int32Type, int32_t); +NUMERIC_ARRAY_DECL(UInt64Array, UInt64Type, uint64_t); +NUMERIC_ARRAY_DECL(Int64Array, Int64Type, int64_t); +NUMERIC_ARRAY_DECL(FloatArray, FloatType, float); +NUMERIC_ARRAY_DECL(DoubleArray, DoubleType, double); template class PrimitiveBuilder : public ArrayBuilder { @@ -202,8 +194,9 @@ class PrimitiveBuilder : public ArrayBuilder { } std::shared_ptr Finish() override { - std::shared_ptr result = std::make_shared(); - result->PrimitiveArray::Init(type_, length_, values_, null_count_, nulls_); + std::shared_ptr result = std::make_shared( + type_, length_, values_, null_count_, nulls_); + values_ = nulls_ = nullptr; capacity_ = length_ = null_count_ = 0; return result; @@ -222,6 +215,21 @@ class PrimitiveBuilder : public ArrayBuilder { int elsize_; }; +// Builders + +typedef PrimitiveBuilder UInt8Builder; +typedef PrimitiveBuilder UInt16Builder; +typedef PrimitiveBuilder UInt32Builder; +typedef PrimitiveBuilder UInt64Builder; + +typedef PrimitiveBuilder Int8Builder; +typedef PrimitiveBuilder Int16Builder; +typedef PrimitiveBuilder Int32Builder; +typedef PrimitiveBuilder Int64Builder; + +typedef PrimitiveBuilder FloatBuilder; +typedef PrimitiveBuilder DoubleBuilder; + } // namespace arrow #endif // ARROW_TYPES_PRIMITIVE_H diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 6381093dcbb45..7dc3d682cdc15 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -15,21 +15,20 @@ // specific language governing permissions and limitations // under the License. -#include #include +#include #include #include #include +#include "gtest/gtest.h" + #include "arrow/array.h" -#include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/construct.h" -#include "arrow/types/integer.h" +#include "arrow/types/primitive.h" #include "arrow/types/string.h" #include "arrow/types/test-common.h" -#include "arrow/util/status.h" namespace arrow { @@ -38,14 +37,14 @@ class Buffer; TEST(TypesTest, TestCharType) { CharType t1(5); - ASSERT_EQ(t1.type, LogicalType::CHAR); + ASSERT_EQ(t1.type, Type::CHAR); ASSERT_EQ(t1.size, 5); ASSERT_EQ(t1.ToString(), std::string("char(5)")); // Test copy constructor CharType t2 = t1; - ASSERT_EQ(t2.type, LogicalType::CHAR); + ASSERT_EQ(t2.type, Type::CHAR); ASSERT_EQ(t2.size, 5); } @@ -53,22 +52,20 @@ TEST(TypesTest, TestCharType) { TEST(TypesTest, TestVarcharType) { VarcharType t1(5); - ASSERT_EQ(t1.type, LogicalType::VARCHAR); + ASSERT_EQ(t1.type, Type::VARCHAR); ASSERT_EQ(t1.size, 5); - ASSERT_EQ(t1.physical_type.size, 6); ASSERT_EQ(t1.ToString(), std::string("varchar(5)")); // Test copy constructor VarcharType t2 = t1; - ASSERT_EQ(t2.type, LogicalType::VARCHAR); + ASSERT_EQ(t2.type, Type::VARCHAR); ASSERT_EQ(t2.size, 5); - ASSERT_EQ(t2.physical_type.size, 6); } TEST(TypesTest, TestStringType) { StringType str; - ASSERT_EQ(str.type, LogicalType::STRING); + ASSERT_EQ(str.type, Type::STRING); ASSERT_EQ(str.name(), std::string("string")); } @@ -90,15 +87,16 @@ class TestStringContainer : public ::testing::Test { length_ = offsets_.size() - 1; int nchars = chars_.size(); - value_buf_ = to_buffer(chars_); + value_buf_ = test::to_buffer(chars_); values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); - offsets_buf_ = to_buffer(offsets_); + offsets_buf_ = test::to_buffer(offsets_); - nulls_buf_ = bytes_to_null_buffer(nulls_.data(), nulls_.size()); - null_count_ = null_count(nulls_); + nulls_buf_ = test::bytes_to_null_buffer(nulls_.data(), nulls_.size()); + null_count_ = test::null_count(nulls_); - strings_.Init(length_, offsets_buf_, values_, null_count_, nulls_buf_); + strings_ = std::make_shared(length_, offsets_buf_, values_, + null_count_, nulls_buf_); } protected: @@ -116,28 +114,28 @@ class TestStringContainer : public ::testing::Test { int length_; ArrayPtr values_; - StringArray strings_; + std::shared_ptr strings_; }; TEST_F(TestStringContainer, TestArrayBasics) { - ASSERT_EQ(length_, strings_.length()); - ASSERT_EQ(1, strings_.null_count()); + ASSERT_EQ(length_, strings_->length()); + ASSERT_EQ(1, strings_->null_count()); } TEST_F(TestStringContainer, TestType) { - TypePtr type = strings_.type(); + TypePtr type = strings_->type(); - ASSERT_EQ(LogicalType::STRING, type->type); - ASSERT_EQ(LogicalType::STRING, strings_.logical_type()); + ASSERT_EQ(Type::STRING, type->type); + ASSERT_EQ(Type::STRING, strings_->type_enum()); } TEST_F(TestStringContainer, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { - ASSERT_EQ(pos, strings_.value_offset(i)); - ASSERT_EQ(expected_[i].size(), strings_.value_length(i)); + ASSERT_EQ(pos, strings_->value_offset(i)); + ASSERT_EQ(expected_[i].size(), strings_->value_length(i)); pos += expected_[i].size(); } } @@ -151,9 +149,9 @@ TEST_F(TestStringContainer, TestDestructor) { TEST_F(TestStringContainer, TestGetString) { for (size_t i = 0; i < expected_.size(); ++i) { if (nulls_[i]) { - ASSERT_TRUE(strings_.IsNull(i)); + ASSERT_TRUE(strings_->IsNull(i)); } else { - ASSERT_EQ(expected_[i], strings_.GetString(i)); + ASSERT_EQ(expected_[i], strings_->GetString(i)); } } } @@ -199,7 +197,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { Done(); ASSERT_EQ(reps * N, result_->length()); - ASSERT_EQ(reps * null_count(is_null), result_->null_count()); + ASSERT_EQ(reps * test::null_count(is_null), result_->null_count()); ASSERT_EQ(reps * 6, result_->values()->length()); int32_t length; diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 8ccc0a9698a54..2b3fba5ce0932 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -25,25 +25,21 @@ #include "arrow/array.h" #include "arrow/type.h" -#include "arrow/types/integer.h" #include "arrow/types/list.h" +#include "arrow/types/primitive.h" #include "arrow/util/status.h" namespace arrow { -class ArrayBuilder; class Buffer; class MemoryPool; struct CharType : public DataType { int size; - BytesType physical_type; - - explicit CharType(int size, bool nullable = true) - : DataType(LogicalType::CHAR, nullable), - size(size), - physical_type(BytesType(size)) {} + explicit CharType(int size) + : DataType(Type::CHAR), + size(size) {} CharType(const CharType& other) : CharType(other.size) {} @@ -56,54 +52,36 @@ struct CharType : public DataType { struct VarcharType : public DataType { int size; - BytesType physical_type; - - explicit VarcharType(int size, bool nullable = true) - : DataType(LogicalType::VARCHAR, nullable), - size(size), - physical_type(BytesType(size + 1)) {} + explicit VarcharType(int size) + : DataType(Type::VARCHAR), + size(size) {} VarcharType(const VarcharType& other) : VarcharType(other.size) {} virtual std::string ToString() const; }; -static const LayoutPtr byte1(new BytesType(1)); -static const LayoutPtr physical_string = LayoutPtr(new ListLayoutType(byte1)); - // TODO: add a BinaryArray layer in between class StringArray : public ListArray { public: - StringArray() : ListArray(), bytes_(nullptr), raw_bytes_(nullptr) {} - - StringArray(int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, - int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) { - Init(length, offsets, values, null_count, nulls); - } - - void Init(const TypePtr& type, int32_t length, + StringArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) { - ListArray::Init(type, length, offsets, values, null_count, nulls); - - // TODO: type validation for values array - + const std::shared_ptr& nulls = nullptr) : + ListArray(type, length, offsets, values, null_count, nulls) { // For convenience bytes_ = static_cast(values.get()); raw_bytes_ = bytes_->raw_data(); } - void Init(int32_t length, const std::shared_ptr& offsets, + StringArray(int32_t length, + const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) { - TypePtr type(new StringType()); - Init(type, length, offsets, values, null_count, nulls); - } + const std::shared_ptr& nulls = nullptr) : + StringArray(std::make_shared(), length, offsets, values, + null_count, nulls) {} // Compute the pointer t const uint8_t* GetValue(int i, int32_t* out_length) const { @@ -125,9 +103,6 @@ class StringArray : public ListArray { }; // Array builder - - - class StringBuilder : public ListBuilder { public: explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index 9a4777e8b983d..d94396f42c52a 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -15,16 +15,13 @@ // specific language governing permissions and limitations // under the License. -#include - #include #include #include +#include "gtest/gtest.h" + #include "arrow/type.h" -#include "arrow/types/integer.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" using std::shared_ptr; using std::string; @@ -42,13 +39,13 @@ TEST(TestStructType, Basics) { TypePtr f2_type = TypePtr(new UInt8Type()); auto f2 = std::make_shared("f2", f2_type); - vector > fields = {f0, f1, f2}; + vector> fields = {f0, f1, f2}; StructType struct_type(fields); - ASSERT_TRUE(struct_type.field(0)->Equals(f0)); - ASSERT_TRUE(struct_type.field(1)->Equals(f1)); - ASSERT_TRUE(struct_type.field(2)->Equals(f2)); + ASSERT_TRUE(struct_type.child(0)->Equals(f0)); + ASSERT_TRUE(struct_type.child(1)->Equals(f1)); + ASSERT_TRUE(struct_type.child(2)->Equals(f2)); ASSERT_EQ(struct_type.ToString(), "struct"); diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h index 1744efce7d631..227aca632ef3c 100644 --- a/cpp/src/arrow/types/test-common.h +++ b/cpp/src/arrow/types/test-common.h @@ -18,11 +18,12 @@ #ifndef ARROW_TYPES_TEST_COMMON_H #define ARROW_TYPES_TEST_COMMON_H -#include #include #include #include +#include "gtest/gtest.h" + #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/util/memory-pool.h" @@ -34,7 +35,7 @@ namespace arrow { class TestBuilder : public ::testing::Test { public: void SetUp() { - pool_ = GetDefaultMemoryPool(); + pool_ = default_memory_pool(); type_ = TypePtr(new UInt8Type()); builder_.reset(new UInt8Builder(pool_, type_)); builder_nn_.reset(new UInt8Builder(pool_, type_)); diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h index 9aff780c6a392..29cda90b972dd 100644 --- a/cpp/src/arrow/types/union.h +++ b/cpp/src/arrow/types/union.h @@ -30,8 +30,8 @@ namespace arrow { class Buffer; -struct DenseUnionType : public CollectionType { - typedef CollectionType Base; +struct DenseUnionType : public CollectionType { + typedef CollectionType Base; explicit DenseUnionType(const std::vector& child_types) : Base() { @@ -42,8 +42,8 @@ struct DenseUnionType : public CollectionType { }; -struct SparseUnionType : public CollectionType { - typedef CollectionType Base; +struct SparseUnionType : public CollectionType { + typedef CollectionType Base; explicit SparseUnionType(const std::vector& child_types) : Base() { @@ -55,28 +55,20 @@ struct SparseUnionType : public CollectionType { class UnionArray : public Array { - public: - UnionArray() : Array() {} - protected: // The data are types encoded as int16 Buffer* types_; - std::vector > children_; + std::vector> children_; }; class DenseUnionArray : public UnionArray { - public: - DenseUnionArray() : UnionArray() {} - protected: Buffer* offset_buf_; }; class SparseUnionArray : public UnionArray { - public: - SparseUnionArray() : UnionArray() {} }; } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index 7506ca5b5531c..220bff084fd6e 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -15,10 +15,10 @@ // specific language governing permissions and limitations // under the License. -#include - #include "arrow/util/bit-util.h" +#include "gtest/gtest.h" + namespace arrow { TEST(UtilTests, TestNextPower2) { diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 5e7197f901222..1d2d1d5f9d7e4 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -19,7 +19,6 @@ #define ARROW_UTIL_BIT_UTIL_H #include -#include #include namespace arrow { diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc index 9f1fd91432b4d..1d58226d84a46 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/util/buffer-test.cc @@ -15,11 +15,12 @@ // specific language governing permissions and limitations // under the License. -#include #include #include #include +#include "gtest/gtest.h" + #include "arrow/test-util.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 50f4716769d70..04cdcd75cd41a 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -40,7 +40,7 @@ std::shared_ptr MutableBuffer::GetImmutableView() { PoolBuffer::PoolBuffer(MemoryPool* pool) : ResizableBuffer(nullptr, 0) { if (pool == nullptr) { - pool = GetDefaultMemoryPool(); + pool = default_memory_pool(); } pool_ = pool; } diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index 954b5f951b558..6ef07a07ada3f 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -15,10 +15,11 @@ // specific language governing permissions and limitations // under the License. -#include #include #include +#include "gtest/gtest.h" + #include "arrow/test-util.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" @@ -26,7 +27,7 @@ namespace arrow { TEST(DefaultMemoryPool, MemoryTracking) { - MemoryPool* pool = GetDefaultMemoryPool(); + MemoryPool* pool = default_memory_pool(); uint8_t* data; ASSERT_OK(pool->Allocate(100, &data)); @@ -37,7 +38,7 @@ TEST(DefaultMemoryPool, MemoryTracking) { } TEST(DefaultMemoryPool, OOM) { - MemoryPool* pool = GetDefaultMemoryPool(); + MemoryPool* pool = default_memory_pool(); uint8_t* data; int64_t to_alloc = std::numeric_limits::max(); diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index 5820346e5a739..0b885e9376a62 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -70,9 +70,9 @@ void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { InternalMemoryPool::~InternalMemoryPool() {} -MemoryPool* GetDefaultMemoryPool() { - static InternalMemoryPool default_memory_pool; - return &default_memory_pool; +MemoryPool* default_memory_pool() { + static InternalMemoryPool default_memory_pool_; + return &default_memory_pool_; } } // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.h b/cpp/src/arrow/util/memory-pool.h index a7cb10dae1703..0d2478686f5a4 100644 --- a/cpp/src/arrow/util/memory-pool.h +++ b/cpp/src/arrow/util/memory-pool.h @@ -34,7 +34,7 @@ class MemoryPool { virtual int64_t bytes_allocated() const = 0; }; -MemoryPool* GetDefaultMemoryPool(); +MemoryPool* default_memory_pool(); } // namespace arrow diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc index c6e113ebea590..43cb87e1a8c56 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/util/status.cc @@ -54,6 +54,9 @@ std::string Status::CodeAsString() const { case StatusCode::Invalid: type = "Invalid"; break; + case StatusCode::IOError: + type = "IOError"; + break; case StatusCode::NotImplemented: type = "NotImplemented"; break; diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index 47fda40db2596..b5931232dbdcb 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -63,6 +63,7 @@ enum class StatusCode: char { OutOfMemory = 1, KeyError = 2, Invalid = 3, + IOError = 4, NotImplemented = 10, }; @@ -97,12 +98,17 @@ class Status { return Status(StatusCode::Invalid, msg, -1); } + static Status IOError(const std::string& msg) { + return Status(StatusCode::IOError, msg, -1); + } + // Returns true iff the status indicates success. bool ok() const { return (state_ == NULL); } bool IsOutOfMemory() const { return code() == StatusCode::OutOfMemory; } bool IsKeyError() const { return code() == StatusCode::KeyError; } bool IsInvalid() const { return code() == StatusCode::Invalid; } + bool IsIOError() const { return code() == StatusCode::IOError; } // Return a string representation of this status suitable for printing. // Returns the string "OK" for success. diff --git a/cpp/src/arrow/util/test_main.cc b/cpp/src/arrow/util/test_main.cc index 00139f36742ed..adc8466fb0be9 100644 --- a/cpp/src/arrow/util/test_main.cc +++ b/cpp/src/arrow/util/test_main.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include +#include "gtest/gtest.h" int main(int argc, char **argv) { ::testing::InitGoogleTest(&argc, argv); diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index 294737cc50522..3d5f532b16309 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -17,6 +17,7 @@ else case $arg in "gtest") F_GTEST=1 ;; "gbenchmark") F_GBENCHMARK=1 ;; + "flatbuffers") F_FLATBUFFERS=1 ;; *) echo "Unknown module: $arg"; exit 1 ;; esac done @@ -78,6 +79,14 @@ if [ -n "$F_ALL" -o -n "$F_GBENCHMARK" ]; then make VERBOSE=1 install || { echo "make $GBENCHMARK_ERROR" ; exit 1; } fi +FLATBUFFERS_ERROR="failed for flatbuffers" +if [ -n "$F_ALL" -o -n "$F_FLATBUFFERS" ]; then + cd $TP_DIR/$FLATBUFFERS_BASEDIR + + CXXFLAGS=-fPIC cmake -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX -DFLATBUFFERS_BUILD_TESTS=OFF . || { echo "cmake $FLATBUFFERS_ERROR" ; exit 1; } + make -j$PARALLEL + make install +fi echo "---------------------" echo "Thirdparty dependencies built and installed into $PREFIX successfully" diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh index d22c559b3e3ba..d299afc15222b 100755 --- a/cpp/thirdparty/download_thirdparty.sh +++ b/cpp/thirdparty/download_thirdparty.sh @@ -25,3 +25,8 @@ if [ ! -d ${GBENCHMARK_BASEDIR} ]; then echo "Fetching google benchmark" download_extract_and_cleanup $GBENCHMARK_URL fi + +if [ ! -d ${FLATBUFFERS_BASEDIR} ]; then + echo "Fetching flatbuffers" + download_extract_and_cleanup $FLATBUFFERS_URL +fi diff --git a/cpp/thirdparty/versions.sh b/cpp/thirdparty/versions.sh index 9cfc7cd94b58c..cb455b4eadd3b 100755 --- a/cpp/thirdparty/versions.sh +++ b/cpp/thirdparty/versions.sh @@ -5,3 +5,7 @@ GTEST_BASEDIR=googletest-release-$GTEST_VERSION GBENCHMARK_VERSION=1.0.0 GBENCHMARK_URL="https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" GBENCHMARK_BASEDIR=benchmark-$GBENCHMARK_VERSION + +FLATBUFFERS_VERSION=1.3.0 +FLATBUFFERS_URL="https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" +FLATBUFFERS_BASEDIR=flatbuffers-$FLATBUFFERS_VERSION diff --git a/format/Message.fbs b/format/Message.fbs new file mode 100644 index 0000000000000..3ffd20332087a --- /dev/null +++ b/format/Message.fbs @@ -0,0 +1,183 @@ +namespace apache.arrow.flatbuf; + +/// ---------------------------------------------------------------------- +/// Logical types and their metadata (if any) +/// +/// These are stored in the flatbuffer in the Type union below + +/// A Tuple in the flatbuffer metadata is the same as an Arrow Struct +/// (according to the physical memory layout). We used Tuple here as Struct is +/// a reserved word in Flatbuffers +table Tuple { +} + +table List { +} + +enum UnionMode:int { Sparse, Dense } + +table Union { + mode: UnionMode; +} + +table Bit { +} + +table Int { + bitWidth: int; // 1 to 64 + is_signed: bool; +} + +enum Precision:int {SINGLE, DOUBLE} + +table FloatingPoint { + precision: Precision; +} + +table Utf8 { +} + +table Binary { +} + +table Bool { +} + +table Decimal { + precision: int; + scale: int; +} + +table Timestamp { + timezone: string; +} + +table JSONScalar { + dense:bool=true; +} + +/// ---------------------------------------------------------------------- +/// Top-level Type value, enabling extensible type-specific metadata. We can +/// add new logical types to Type without breaking backwards compatibility + +union Type { + Int, + Bit, + FloatingPoint, + Binary, + Utf8, + Bool, + Decimal, + Timestamp, + List, + Tuple, + Union, + JSONScalar +} + +/// ---------------------------------------------------------------------- +/// A field represents a named column in a record / row batch or child of a +/// nested type. +/// +/// - children is only for nested Arrow arrays +/// - For primitive types, children will have length 0 +/// - nullable should default to true in general + +table Field { + // Name is not required, in i.e. a List + name: string; + nullable: bool; + type: Type; + children: [Field]; +} + +/// ---------------------------------------------------------------------- +/// A Schema describes the columns in a row batch + +table Schema { + fields: [Field]; +} + +/// ---------------------------------------------------------------------- +/// Data structures for describing a table row batch (a collection of +/// equal-length Arrow arrays) + +/// A Buffer represents a single contiguous memory segment +struct Buffer { + /// The shared memory page id where this buffer is located. Currently this is + /// not used + page: int; + + /// The relative offset into the shared memory page where the bytes for this + /// buffer starts + offset: long; + + /// The absolute length (in bytes) of the memory buffer. The memory is found + /// from offset (inclusive) to offset + length (non-inclusive). + length: long; +} + +/// Metadata about a field at some level of a nested type tree (but not +/// its children). +/// +/// For example, a List with values [[1, 2, 3], null, [4], [5, 6], null] +/// would have {length: 5, null_count: 2} for its List node, and {length: 6, +/// null_count: 0} for its Int16 node, as separate FieldNode structs +struct FieldNode { + /// The number of value slots in the Arrow array at this level of a nested + /// tree + length: int; + + /// The number of observed nulls. Fields with null_count == 0 may choose not + /// to write their physical null bitmap out as a materialized buffer, instead + /// setting the length of the null buffer to 0. + null_count: int; +} + +/// A data header describing the shared memory layout of a "record" or "row" +/// batch. Some systems call this a "row batch" internally and others a "record +/// batch". +table RecordBatch { + /// number of records / rows. The arrays in the batch should all have this + /// length + length: int; + + /// Nodes correspond to the pre-ordered flattened logical schema + nodes: [FieldNode]; + + /// Buffers correspond to the pre-ordered flattened buffer tree + /// + /// The number of buffers appended to this list depends on the schema. For + /// example, most primitive arrays will have 2 buffers, 1 for the null bitmap + /// and 1 for the values. For struct arrays, there will only be a single + /// buffer for the null bitmap + buffers: [Buffer]; +} + +/// ---------------------------------------------------------------------- +/// For sending dictionary encoding information. Any Field can be +/// dictionary-encoded, but in this case none of its children may be +/// dictionary-encoded. +/// +/// TODO(wesm): To be documented in more detail + +table DictionaryBatch { + id: long; + data: RecordBatch; +} + +/// ---------------------------------------------------------------------- +/// The root Message type + +/// This union enables us to easily send different message types without +/// redundant storage, and in the future we can easily add new message types. +union MessageHeader { + Schema, DictionaryBatch, RecordBatch +} + +table Message { + header: MessageHeader; + bodyLength: long; +} + +root_type Message; diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 8d93a156bcc3d..9a080709bebda 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -35,4 +35,6 @@ uint8, uint16, uint32, uint64, float_, double, string, list_, struct, field, - DataType, Field, Schema) + DataType, Field, Schema, schema) + +from pyarrow.array import RowBatch diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index d0d3486c032fe..de3c77419623f 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -16,7 +16,7 @@ # under the License. from pyarrow.includes.common cimport shared_ptr -from pyarrow.includes.libarrow cimport CArray, LogicalType +from pyarrow.includes.libarrow cimport CArray from pyarrow.scalar import NA diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index bceb333c94ea5..c5d40ddd7a481 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -28,6 +28,9 @@ from pyarrow.error cimport check_status cimport pyarrow.scalar as scalar from pyarrow.scalar import NA +from pyarrow.schema cimport Schema +import pyarrow.schema as schema + def total_allocated_bytes(): cdef MemoryPool* pool = pyarrow.GetMemoryPool() return pool.bytes_allocated() @@ -155,12 +158,12 @@ cdef class StringArray(Array): cdef dict _array_classes = { - LogicalType_NA: NullArray, - LogicalType_BOOL: BooleanArray, - LogicalType_INT64: Int64Array, - LogicalType_DOUBLE: DoubleArray, - LogicalType_LIST: ListArray, - LogicalType_STRING: StringArray, + Type_NA: NullArray, + Type_BOOL: BooleanArray, + Type_INT64: Int64Array, + Type_DOUBLE: DoubleArray, + Type_LIST: ListArray, + Type_STRING: StringArray, } cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): @@ -190,3 +193,35 @@ def from_pylist(object list_obj, DataType type=None): raise NotImplementedError return box_arrow_array(sp_array) + +#---------------------------------------------------------------------- +# Table-like data structures + +cdef class RowBatch: + """ + + """ + cdef readonly: + Schema schema + int num_rows + list arrays + + def __cinit__(self, Schema schema, int num_rows, list arrays): + self.schema = schema + self.num_rows = num_rows + self.arrays = arrays + + if len(self.schema) != len(arrays): + raise ValueError('Mismatch number of data arrays and ' + 'schema fields') + + def __len__(self): + return self.num_rows + + property num_columns: + + def __get__(self): + return len(self.arrays) + + def __getitem__(self, i): + return self.arrays[i] diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index baba112833e0d..e6afcbd79b69f 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -21,31 +21,30 @@ from pyarrow.includes.common cimport * cdef extern from "arrow/api.h" namespace "arrow" nogil: - enum LogicalType" arrow::LogicalType::type": - LogicalType_NA" arrow::LogicalType::NA" + enum Type" arrow::Type::type": + Type_NA" arrow::Type::NA" - LogicalType_BOOL" arrow::LogicalType::BOOL" + Type_BOOL" arrow::Type::BOOL" - LogicalType_UINT8" arrow::LogicalType::UINT8" - LogicalType_INT8" arrow::LogicalType::INT8" - LogicalType_UINT16" arrow::LogicalType::UINT16" - LogicalType_INT16" arrow::LogicalType::INT16" - LogicalType_UINT32" arrow::LogicalType::UINT32" - LogicalType_INT32" arrow::LogicalType::INT32" - LogicalType_UINT64" arrow::LogicalType::UINT64" - LogicalType_INT64" arrow::LogicalType::INT64" + Type_UINT8" arrow::Type::UINT8" + Type_INT8" arrow::Type::INT8" + Type_UINT16" arrow::Type::UINT16" + Type_INT16" arrow::Type::INT16" + Type_UINT32" arrow::Type::UINT32" + Type_INT32" arrow::Type::INT32" + Type_UINT64" arrow::Type::UINT64" + Type_INT64" arrow::Type::INT64" - LogicalType_FLOAT" arrow::LogicalType::FLOAT" - LogicalType_DOUBLE" arrow::LogicalType::DOUBLE" + Type_FLOAT" arrow::Type::FLOAT" + Type_DOUBLE" arrow::Type::DOUBLE" - LogicalType_STRING" arrow::LogicalType::STRING" + Type_STRING" arrow::Type::STRING" - LogicalType_LIST" arrow::LogicalType::LIST" - LogicalType_STRUCT" arrow::LogicalType::STRUCT" + Type_LIST" arrow::Type::LIST" + Type_STRUCT" arrow::Type::STRUCT" cdef cppclass CDataType" arrow::DataType": - LogicalType type - c_bool nullable + Type type c_bool Equals(const CDataType* other) @@ -55,8 +54,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int64_t bytes_allocated() cdef cppclass CListType" arrow::ListType"(CDataType): - CListType(const shared_ptr[CDataType]& value_type, - c_bool nullable) + CListType(const shared_ptr[CDataType]& value_type) cdef cppclass CStringType" arrow::StringType"(CDataType): pass @@ -65,21 +63,26 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_string name shared_ptr[CDataType] type - CField(const c_string& name, const shared_ptr[CDataType]& type) + c_bool nullable + + CField(const c_string& name, const shared_ptr[CDataType]& type, + c_bool nullable) cdef cppclass CStructType" arrow::StructType"(CDataType): - CStructType(const vector[shared_ptr[CField]]& fields, - c_bool nullable) + CStructType(const vector[shared_ptr[CField]]& fields) cdef cppclass CSchema" arrow::Schema": - CSchema(const shared_ptr[CField]& fields) + CSchema(const vector[shared_ptr[CField]]& fields) + const shared_ptr[CField]& field(int i) + int num_fields() + c_string ToString() cdef cppclass CArray" arrow::Array": const shared_ptr[CDataType]& type() int32_t length() int32_t null_count() - LogicalType logical_type() + Type type_enum() c_bool IsNull(int i) @@ -122,3 +125,57 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringArray" arrow::StringArray"(CListArray): c_string GetString(int i) + + +cdef extern from "arrow/api.h" namespace "arrow" nogil: + # We can later add more of the common status factory methods as needed + cdef CStatus CStatus_OK "Status::OK"() + + cdef cppclass CStatus "arrow::Status": + CStatus() + + c_string ToString() + + c_bool ok() + c_bool IsOutOfMemory() + c_bool IsKeyError() + c_bool IsNotImplemented() + c_bool IsInvalid() + + cdef cppclass Buffer: + uint8_t* data() + int64_t size() + + +cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: + cdef cppclass SchemaMessage: + int num_fields() + CStatus GetField(int i, shared_ptr[CField]* out) + CStatus GetSchema(shared_ptr[CSchema]* out) + + cdef cppclass FieldMetadata: + pass + + cdef cppclass BufferMetadata: + pass + + cdef cppclass RecordBatchMessage: + pass + + cdef cppclass DictionaryBatchMessage: + pass + + enum MessageType" arrow::ipc::Message::Type": + MessageType_SCHEMA" arrow::ipc::Message::SCHEMA" + MessageType_RECORD_BATCH" arrow::ipc::Message::RECORD_BATCH" + MessageType_DICTIONARY_BATCH" arrow::ipc::Message::DICTIONARY_BATCH" + + cdef cppclass Message: + CStatus Open(const shared_ptr[Buffer]& buf, + shared_ptr[Message]* out) + int64_t body_length() + MessageType type() + + shared_ptr[SchemaMessage] GetSchema() + shared_ptr[RecordBatchMessage] GetRecordBatch() + shared_ptr[DictionaryBatchMessage] GetDictionaryBatch() diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 9a0c004b7684a..eedfc85446810 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,8 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CDataType, LogicalType, - MemoryPool) +from pyarrow.includes.libarrow cimport CArray, CDataType, Type, MemoryPool cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: # We can later add more of the common status factory methods as needed @@ -39,7 +38,7 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: c_bool IsNotImplemented() c_bool IsArrowError() - shared_ptr[CDataType] GetPrimitiveType(LogicalType type, c_bool nullable) + shared_ptr[CDataType] GetPrimitiveType(Type type) Status ConvertPySequence(object obj, shared_ptr[CArray]* out) MemoryPool* GetMemoryPool() diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 261a38967c495..04f013d6ca706 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -172,18 +172,18 @@ cdef class ListValue(ArrayValue): cdef dict _scalar_classes = { - LogicalType_UINT8: Int8Value, - LogicalType_UINT16: Int16Value, - LogicalType_UINT32: Int32Value, - LogicalType_UINT64: Int64Value, - LogicalType_INT8: Int8Value, - LogicalType_INT16: Int16Value, - LogicalType_INT32: Int32Value, - LogicalType_INT64: Int64Value, - LogicalType_FLOAT: FloatValue, - LogicalType_DOUBLE: DoubleValue, - LogicalType_LIST: ListValue, - LogicalType_STRING: StringValue + Type_UINT8: Int8Value, + Type_UINT16: Int16Value, + Type_UINT32: Int32Value, + Type_UINT64: Int64Value, + Type_INT8: Int8Value, + Type_INT16: Int16Value, + Type_INT32: Int32Value, + Type_INT64: Int64Value, + Type_FLOAT: FloatValue, + Type_DOUBLE: DoubleValue, + Type_LIST: ListValue, + Type_STRING: StringValue } cdef object box_arrow_scalar(DataType type, diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index 07b9bd04da20e..61458b765c742 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.includes.common cimport shared_ptr +from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport CDataType, CField, CSchema cdef class DataType: @@ -33,9 +33,13 @@ cdef class Field: cdef readonly: DataType type + cdef init(self, const shared_ptr[CField]& field) + cdef class Schema: cdef: shared_ptr[CSchema] sp_schema CSchema* schema + cdef init(self, const vector[shared_ptr[CField]]& fields) + cdef DataType box_data_type(const shared_ptr[CDataType]& type) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index ea878720d5bb8..b3bf02aad76bb 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -54,94 +54,153 @@ cdef class DataType: cdef class Field: - def __cinit__(self, object name, DataType type): - self.type = type - self.sp_field.reset(new CField(tobytes(name), type.sp_type)) - self.field = self.sp_field.get() + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CField]& field): + self.sp_field = field + self.field = field.get() + + @classmethod + def from_py(cls, object name, DataType type, bint nullable=True): + cdef Field result = Field() + result.type = type + result.sp_field.reset(new CField(tobytes(name), type.sp_type, + nullable)) + result.field = result.sp_field.get() + + return result def __repr__(self): return 'Field({0!r}, type={1})'.format(self.name, str(self.type)) + property nullable: + + def __get__(self): + return self.field.nullable + property name: def __get__(self): return frombytes(self.field.name) +cdef class Schema: + + def __cinit__(self): + pass + + def __len__(self): + return self.schema.num_fields() + + def __getitem__(self, i): + if i < 0 or i >= len(self): + raise IndexError("{0} is out of bounds".format(i)) + + cdef Field result = Field() + result.init(self.schema.field(i)) + result.type = box_data_type(result.field.type) + + return result + + cdef init(self, const vector[shared_ptr[CField]]& fields): + self.schema = new CSchema(fields) + self.sp_schema.reset(self.schema) + + @classmethod + def from_fields(cls, fields): + cdef: + Schema result + Field field + vector[shared_ptr[CField]] c_fields + + c_fields.resize(len(fields)) + + for i in range(len(fields)): + field = fields[i] + c_fields[i] = field.sp_field + + result = Schema() + result.init(c_fields) + + return result + + def __repr__(self): + return frombytes(self.schema.ToString()) + cdef dict _type_cache = {} -cdef DataType primitive_type(LogicalType type, bint nullable=True): - if (type, nullable) in _type_cache: - return _type_cache[type, nullable] +cdef DataType primitive_type(Type type): + if type in _type_cache: + return _type_cache[type] cdef DataType out = DataType() - out.init(pyarrow.GetPrimitiveType(type, nullable)) + out.init(pyarrow.GetPrimitiveType(type)) - _type_cache[type, nullable] = out + _type_cache[type] = out return out #------------------------------------------------------------ # Type factory functions -def field(name, type): - return Field(name, type) +def field(name, type, bint nullable=True): + return Field.from_py(name, type, nullable) cdef set PRIMITIVE_TYPES = set([ - LogicalType_NA, LogicalType_BOOL, - LogicalType_UINT8, LogicalType_INT8, - LogicalType_UINT16, LogicalType_INT16, - LogicalType_UINT32, LogicalType_INT32, - LogicalType_UINT64, LogicalType_INT64, - LogicalType_FLOAT, LogicalType_DOUBLE]) + Type_NA, Type_BOOL, + Type_UINT8, Type_INT8, + Type_UINT16, Type_INT16, + Type_UINT32, Type_INT32, + Type_UINT64, Type_INT64, + Type_FLOAT, Type_DOUBLE]) def null(): - return primitive_type(LogicalType_NA) + return primitive_type(Type_NA) -def bool_(c_bool nullable=True): - return primitive_type(LogicalType_BOOL, nullable) +def bool_(): + return primitive_type(Type_BOOL) -def uint8(c_bool nullable=True): - return primitive_type(LogicalType_UINT8, nullable) +def uint8(): + return primitive_type(Type_UINT8) -def int8(c_bool nullable=True): - return primitive_type(LogicalType_INT8, nullable) +def int8(): + return primitive_type(Type_INT8) -def uint16(c_bool nullable=True): - return primitive_type(LogicalType_UINT16, nullable) +def uint16(): + return primitive_type(Type_UINT16) -def int16(c_bool nullable=True): - return primitive_type(LogicalType_INT16, nullable) +def int16(): + return primitive_type(Type_INT16) -def uint32(c_bool nullable=True): - return primitive_type(LogicalType_UINT32, nullable) +def uint32(): + return primitive_type(Type_UINT32) -def int32(c_bool nullable=True): - return primitive_type(LogicalType_INT32, nullable) +def int32(): + return primitive_type(Type_INT32) -def uint64(c_bool nullable=True): - return primitive_type(LogicalType_UINT64, nullable) +def uint64(): + return primitive_type(Type_UINT64) -def int64(c_bool nullable=True): - return primitive_type(LogicalType_INT64, nullable) +def int64(): + return primitive_type(Type_INT64) -def float_(c_bool nullable=True): - return primitive_type(LogicalType_FLOAT, nullable) +def float_(): + return primitive_type(Type_FLOAT) -def double(c_bool nullable=True): - return primitive_type(LogicalType_DOUBLE, nullable) +def double(): + return primitive_type(Type_DOUBLE) -def string(c_bool nullable=True): +def string(): """ UTF8 string """ - return primitive_type(LogicalType_STRING, nullable) + return primitive_type(Type_STRING) -def list_(DataType value_type, c_bool nullable=True): +def list_(DataType value_type): cdef DataType out = DataType() - out.init(shared_ptr[CDataType]( - new CListType(value_type.sp_type, nullable))) + out.init(shared_ptr[CDataType](new CListType(value_type.sp_type))) return out -def struct(fields, c_bool nullable=True): +def struct(fields): """ """ @@ -154,9 +213,11 @@ def struct(fields, c_bool nullable=True): c_fields.push_back(field.sp_field) out.init(shared_ptr[CDataType]( - new CStructType(c_fields, nullable))) + new CStructType(c_fields))) return out +def schema(fields): + return Schema.from_fields(fields) cdef DataType box_data_type(const shared_ptr[CDataType]& type): cdef DataType out = DataType() diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 0235526198f35..2894ea8f84451 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -18,6 +18,8 @@ from pyarrow.compat import unittest import pyarrow as arrow +A = arrow + class TestTypes(unittest.TestCase): @@ -28,15 +30,12 @@ def test_integers(self): for name in dtypes: factory = getattr(arrow, name) t = factory() - t_required = factory(False) - assert str(t) == name - assert str(t_required) == '{0} not null'.format(name) def test_list(self): value_type = arrow.int32() list_type = arrow.list_(value_type) - assert str(list_type) == 'list' + assert str(list_type) == 'list' def test_string(self): t = arrow.string() @@ -47,5 +46,26 @@ def test_field(self): f = arrow.field('foo', t) assert f.name == 'foo' + assert f.nullable assert f.type is t assert repr(f) == "Field('foo', type=string)" + + f = arrow.field('foo', t, False) + assert not f.nullable + + def test_schema(self): + fields = [ + A.field('foo', A.int32()), + A.field('bar', A.string()), + A.field('baz', A.list_(A.int8())) + ] + sch = A.schema(fields) + + assert len(sch) == 3 + assert sch[0].name == 'foo' + assert sch[0].type == fields[0].type + + assert repr(sch) == """\ +foo: int32 +bar: string +baz: list""" diff --git a/cpp/src/arrow/table/CMakeLists.txt b/python/pyarrow/tests/test_table.py similarity index 58% rename from cpp/src/arrow/table/CMakeLists.txt rename to python/pyarrow/tests/test_table.py index d9f00e74a37db..2e24445bd0c22 100644 --- a/cpp/src/arrow/table/CMakeLists.txt +++ b/python/pyarrow/tests/test_table.py @@ -15,19 +15,26 @@ # specific language governing permissions and limitations # under the License. -####################################### -# arrow_table -####################################### - -# Headers: top level -install(FILES - column.h - schema.h - table.h - DESTINATION include/arrow/table) - -ADD_ARROW_TEST(column-test) -ADD_ARROW_TEST(schema-test) -ADD_ARROW_TEST(table-test) - -ADD_ARROW_BENCHMARK(column-benchmark) +from pyarrow.compat import unittest +import pyarrow as arrow + +A = arrow + + +class TestRowBatch(unittest.TestCase): + + def test_basics(self): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] + num_rows = 5 + + descr = A.schema([A.field('c0', data[0].type), + A.field('c1', data[1].type)]) + + batch = A.RowBatch(descr, num_rows, data) + + assert len(batch) == num_rows + assert batch.num_rows == num_rows + assert batch.num_columns == len(data) diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index bb7905236c59c..acb13acecaf33 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -27,7 +27,7 @@ using arrow::ArrayBuilder; using arrow::DataType; -using arrow::LogicalType; +using arrow::Type; namespace pyarrow { @@ -356,17 +356,17 @@ class ListConverter : public TypedConverter { // Dynamic constructor for sequence converters std::shared_ptr GetConverter(const std::shared_ptr& type) { switch (type->type) { - case LogicalType::BOOL: + case Type::BOOL: return std::make_shared(); - case LogicalType::INT64: + case Type::INT64: return std::make_shared(); - case LogicalType::DOUBLE: + case Type::DOUBLE: return std::make_shared(); - case LogicalType::STRING: + case Type::STRING: return std::make_shared(); - case LogicalType::LIST: + case Type::LIST: return std::make_shared(); - case LogicalType::STRUCT: + case Type::STRUCT: default: return nullptr; break; @@ -378,7 +378,7 @@ Status ListConverter::Init(const std::shared_ptr& builder) { typed_builder_ = static_cast(builder.get()); value_converter_ = GetConverter(static_cast( - builder->type().get())->value_type); + builder->type().get())->value_type()); if (value_converter_ == nullptr) { return Status::NotImplemented("value type not implemented"); } @@ -393,8 +393,8 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { PY_RETURN_NOT_OK(InferArrowType(obj, &size, &type)); // Handle NA / NullType case - if (type->type == LogicalType::NA) { - out->reset(new arrow::Array(type, size, size)); + if (type->type == Type::NA) { + out->reset(new arrow::NullArray(type, size)); return Status::OK(); } diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index 0921fc4994599..08003aabf9f22 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -37,19 +37,14 @@ const std::shared_ptr FLOAT = std::make_shared(); const std::shared_ptr DOUBLE = std::make_shared(); const std::shared_ptr STRING = std::make_shared(); -#define GET_PRIMITIVE_TYPE(NAME, Type) \ - case LogicalType::NAME: \ - if (nullable) { \ - return NAME; \ - } else { \ - return std::make_shared(nullable); \ - } \ +#define GET_PRIMITIVE_TYPE(NAME, Class) \ + case Type::NAME: \ + return NAME; \ break; -std::shared_ptr GetPrimitiveType(LogicalType::type type, - bool nullable) { +std::shared_ptr GetPrimitiveType(Type::type type) { switch (type) { - case LogicalType::NA: + case Type::NA: return NA; GET_PRIMITIVE_TYPE(UINT8, UInt8Type); GET_PRIMITIVE_TYPE(INT8, Int8Type); diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index e41568d5881d4..ec42bb31d3b9b 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -24,7 +24,7 @@ namespace pyarrow { using arrow::DataType; -using arrow::LogicalType; +using arrow::Type; extern const std::shared_ptr NA; extern const std::shared_ptr BOOL; @@ -40,8 +40,7 @@ extern const std::shared_ptr FLOAT; extern const std::shared_ptr DOUBLE; extern const std::shared_ptr STRING; -std::shared_ptr GetPrimitiveType(LogicalType::type type, - bool nullable); +std::shared_ptr GetPrimitiveType(Type::type type); } // namespace pyarrow From a4002c6e217bf1e74895dc11ab76f0c8befbfe2a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 23 Mar 2016 10:59:31 -0700 Subject: [PATCH 0040/1644] ARROW-70: Add adapt 'lite' DCHECK macros from Kudu as also used in Parquet Also added a null pointer DCHECK to show that it works. cc @emkornfield Author: Wes McKinney Closes #33 from wesm/ARROW-70 and squashes the following commits: 258d77b [Wes McKinney] Add adapt 'lite' DCHECK macros from Kudu as also used in Parquet --- cpp/src/arrow/ipc/adapter.cc | 2 + cpp/src/arrow/util/logging.h | 109 +++++++++++++++++++++++++++++++++++ 2 files changed, 111 insertions(+) create mode 100644 cpp/src/arrow/util/logging.h diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 7cdb965f5f45c..8a7d818ceeedd 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -32,6 +32,7 @@ #include "arrow/types/construct.h" #include "arrow/types/primitive.h" #include "arrow/util/buffer.h" +#include "arrow/util/logging.h" #include "arrow/util/status.h" namespace arrow { @@ -41,6 +42,7 @@ namespace flatbuf = apache::arrow::flatbuf; namespace ipc { static bool IsPrimitive(const DataType* type) { + DCHECK(type != nullptr); switch (type->type) { // NA is null type or "no type", considered primitive for now case Type::NA: diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h new file mode 100644 index 0000000000000..3ce4ccc1e9c26 --- /dev/null +++ b/cpp/src/arrow/util/logging.h @@ -0,0 +1,109 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_LOGGING_H +#define ARROW_UTIL_LOGGING_H + +#include + +namespace arrow { + +// Stubbed versions of macros defined in glog/logging.h, intended for +// environments where glog headers aren't available. +// +// Add more as needed. + +// Log levels. LOG ignores them, so their values are abitrary. + +#define ARROW_INFO 0 +#define ARROW_WARNING 1 +#define ARROW_ERROR 2 +#define ARROW_FATAL 3 + +#define ARROW_LOG_INTERNAL(level) arrow::internal::CerrLog(level) +#define ARROW_LOG(level) ARROW_LOG_INTERNAL(ARROW_##level) + +#define ARROW_CHECK(condition) \ + (condition) ? 0 : ARROW_LOG(FATAL) << "Check failed: " #condition " " + +#ifdef NDEBUG +#define ARROW_DFATAL ARROW_WARNING + +#define DCHECK(condition) while (false) arrow::internal::NullLog() +#define DCHECK_EQ(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK_NE(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK_LE(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK_LT(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK_GE(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK_GT(val1, val2) while (false) arrow::internal::NullLog() + +#else +#define ARROW_DFATAL ARROW_FATAL + +#define DCHECK(condition) ARROW_CHECK(condition) +#define DCHECK_EQ(val1, val2) ARROW_CHECK((val1) == (val2)) +#define DCHECK_NE(val1, val2) ARROW_CHECK((val1) != (val2)) +#define DCHECK_LE(val1, val2) ARROW_CHECK((val1) <= (val2)) +#define DCHECK_LT(val1, val2) ARROW_CHECK((val1) < (val2)) +#define DCHECK_GE(val1, val2) ARROW_CHECK((val1) >= (val2)) +#define DCHECK_GT(val1, val2) ARROW_CHECK((val1) > (val2)) + +#endif // NDEBUG + +namespace internal { + +class NullLog { + public: + template + NullLog& operator<<(const T& t) { + return *this; + } +}; + +class CerrLog { + public: + CerrLog(int severity) // NOLINT(runtime/explicit) + : severity_(severity), + has_logged_(false) { + } + + ~CerrLog() { + if (has_logged_) { + std::cerr << std::endl; + } + if (severity_ == ARROW_FATAL) { + exit(1); + } + } + + template + CerrLog& operator<<(const T& t) { + has_logged_ = true; + std::cerr << t; + return *this; + } + + private: + const int severity_; + bool has_logged_; +}; + +} // namespace internal + +} // namespace arrow + +#endif // ARROW_UTIL_LOGGING_H From fbbee3d2db5beb1ae6c623fc6392095cffdf74fe Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 24 Mar 2016 09:31:56 -0700 Subject: [PATCH 0041/1644] ARROW-77: [C++] Conform bitmap interpretation to ARROW-62; 1 for nulls, 0 for non-nulls Author: Wes McKinney Closes #35 from wesm/ARROW-77 and squashes the following commits: 848d1fe [Wes McKinney] Clean up variable names to indicate valid_bytes vs null_bytes and change nulls to null_bitmap to be more clear 8960c7d [Wes McKinney] Flip bit interpretation so that 1 is null and 0 is not-null. Do not compare null slots in primitive arrays --- cpp/src/arrow/array-test.cc | 30 ++++----- cpp/src/arrow/array.cc | 10 +-- cpp/src/arrow/array.h | 16 ++--- cpp/src/arrow/builder.cc | 16 ++--- cpp/src/arrow/builder.h | 14 ++-- cpp/src/arrow/column-benchmark.cc | 6 +- cpp/src/arrow/ipc/adapter.cc | 10 +-- cpp/src/arrow/ipc/ipc-adapter-test.cc | 8 +-- cpp/src/arrow/test-util.h | 25 ++++--- cpp/src/arrow/types/construct.cc | 4 +- cpp/src/arrow/types/construct.h | 2 +- cpp/src/arrow/types/list.cc | 6 +- cpp/src/arrow/types/list.h | 23 +++---- cpp/src/arrow/types/primitive-test.cc | 55 ++++++++++------ cpp/src/arrow/types/primitive.cc | 29 +++++++-- cpp/src/arrow/types/primitive.h | 93 +++++++++++++++------------ cpp/src/arrow/types/string-test.cc | 18 +++--- cpp/src/arrow/types/string.h | 8 +-- cpp/src/arrow/util/bit-util.h | 8 ++- 19 files changed, 213 insertions(+), 168 deletions(-) diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index eded5941e892e..7c6eaf55c0d0f 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -44,9 +44,9 @@ class TestArray : public ::testing::Test { TEST_F(TestArray, TestNullCount) { auto data = std::make_shared(pool_); - auto nulls = std::make_shared(pool_); + auto null_bitmap = std::make_shared(pool_); - std::unique_ptr arr(new Int32Array(100, data, 10, nulls)); + std::unique_ptr arr(new Int32Array(100, data, 10, null_bitmap)); ASSERT_EQ(10, arr->null_count()); std::unique_ptr arr_no_nulls(new Int32Array(100, data)); @@ -61,28 +61,28 @@ TEST_F(TestArray, TestLength) { } TEST_F(TestArray, TestIsNull) { - std::vector nulls = {1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 0, 1}; + std::vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 0, 1}; int32_t null_count = 0; - for (uint8_t x : nulls) { - if (x > 0) ++null_count; + for (uint8_t x : null_bitmap) { + if (x == 0) ++null_count; } - std::shared_ptr null_buf = test::bytes_to_null_buffer(nulls.data(), - nulls.size()); + std::shared_ptr null_buf = test::bytes_to_null_buffer(null_bitmap.data(), + null_bitmap.size()); std::unique_ptr arr; - arr.reset(new Int32Array(nulls.size(), nullptr, null_count, null_buf)); + arr.reset(new Int32Array(null_bitmap.size(), nullptr, null_count, null_buf)); ASSERT_EQ(null_count, arr->null_count()); ASSERT_EQ(5, null_buf->size()); - ASSERT_TRUE(arr->nulls()->Equals(*null_buf.get())); + ASSERT_TRUE(arr->null_bitmap()->Equals(*null_buf.get())); - for (size_t i = 0; i < nulls.size(); ++i) { - ASSERT_EQ(static_cast(nulls[i]), arr->IsNull(i)); + for (size_t i = 0; i < null_bitmap.size(); ++i) { + EXPECT_EQ(static_cast(null_bitmap[i]), !arr->IsNull(i)) << i; } } diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 5a5bc1069db13..3736732740b5b 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -27,13 +27,13 @@ namespace arrow { // Base array class Array::Array(const TypePtr& type, int32_t length, int32_t null_count, - const std::shared_ptr& nulls) { + const std::shared_ptr& null_bitmap) { type_ = type; length_ = length; null_count_ = null_count; - nulls_ = nulls; - if (nulls_) { - null_bits_ = nulls_->data(); + null_bitmap_ = null_bitmap; + if (null_bitmap_) { + null_bitmap_data_ = null_bitmap_->data(); } } @@ -44,7 +44,7 @@ bool Array::EqualsExact(const Array& other) const { return false; } if (null_count_ > 0) { - return nulls_->Equals(*other.nulls_, util::bytes_for_bits(length_)); + return null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); } else { return true; } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 65fc0aaf583e9..133adf32cbd50 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -32,8 +32,8 @@ class Buffer; // Immutable data array with some logical type and some length. Any memory is // owned by the respective Buffer instance (or its parents). // -// The base class is only required to have a nulls buffer if the null count is -// greater than 0 +// The base class is only required to have a null bitmap buffer if the null +// count is greater than 0 // // Any buffers used to initialize the array have their references "stolen". If // you wish to use the buffer beyond the lifetime of the array, you need to @@ -41,13 +41,13 @@ class Buffer; class Array { public: Array(const TypePtr& type, int32_t length, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr); + const std::shared_ptr& null_bitmap = nullptr); virtual ~Array() {} // Determine if a slot is null. For inner loops. Does *not* boundscheck bool IsNull(int i) const { - return null_count_ > 0 && util::get_bit(null_bits_, i); + return null_count_ > 0 && util::bit_not_set(null_bitmap_data_, i); } int32_t length() const { return length_;} @@ -56,8 +56,8 @@ class Array { const std::shared_ptr& type() const { return type_;} Type::type type_enum() const { return type_->type;} - const std::shared_ptr& nulls() const { - return nulls_; + const std::shared_ptr& null_bitmap() const { + return null_bitmap_; } bool EqualsExact(const Array& arr) const; @@ -68,8 +68,8 @@ class Array { int32_t null_count_; int32_t length_; - std::shared_ptr nulls_; - const uint8_t* null_bits_; + std::shared_ptr null_bitmap_; + const uint8_t* null_bitmap_data_; private: Array() {} diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index ba70add155186..6a62dc3b0e08f 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -28,20 +28,20 @@ namespace arrow { Status ArrayBuilder::Init(int32_t capacity) { capacity_ = capacity; int32_t to_alloc = util::ceil_byte(capacity) / 8; - nulls_ = std::make_shared(pool_); - RETURN_NOT_OK(nulls_->Resize(to_alloc)); - null_bits_ = nulls_->mutable_data(); - memset(null_bits_, 0, to_alloc); + null_bitmap_ = std::make_shared(pool_); + RETURN_NOT_OK(null_bitmap_->Resize(to_alloc)); + null_bitmap_data_ = null_bitmap_->mutable_data(); + memset(null_bitmap_data_, 0, to_alloc); return Status::OK(); } Status ArrayBuilder::Resize(int32_t new_bits) { int32_t new_bytes = util::ceil_byte(new_bits) / 8; - int32_t old_bytes = nulls_->size(); - RETURN_NOT_OK(nulls_->Resize(new_bytes)); - null_bits_ = nulls_->mutable_data(); + int32_t old_bytes = null_bitmap_->size(); + RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); + null_bitmap_data_ = null_bitmap_->mutable_data(); if (old_bytes < new_bytes) { - memset(null_bits_ + old_bytes, 0, new_bytes - old_bytes); + memset(null_bitmap_data_ + old_bytes, 0, new_bytes - old_bytes); } return Status::OK(); } diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index d5d1fdf95af17..308e54c80d794 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -40,9 +40,9 @@ class ArrayBuilder { explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) : pool_(pool), type_(type), - nulls_(nullptr), + null_bitmap_(nullptr), null_count_(0), - null_bits_(nullptr), + null_bitmap_data_(nullptr), length_(0), capacity_(0) {} @@ -66,7 +66,7 @@ class ArrayBuilder { // initialized independently Status Init(int32_t capacity); - // Resizes the nulls array + // Resizes the null_bitmap array Status Resize(int32_t new_bits); // For cases where raw data was memcpy'd into the internal buffers, allows us @@ -74,7 +74,7 @@ class ArrayBuilder { // this function responsibly. Status Advance(int32_t elements); - const std::shared_ptr& nulls() const { return nulls_;} + const std::shared_ptr& null_bitmap() const { return null_bitmap_;} // Creates new array object to hold the contents of the builder and transfers // ownership of the data @@ -89,10 +89,10 @@ class ArrayBuilder { std::shared_ptr type_; - // When nulls are first appended to the builder, the null bitmap is allocated - std::shared_ptr nulls_; + // When null_bitmap are first appended to the builder, the null bitmap is allocated + std::shared_ptr null_bitmap_; int32_t null_count_; - uint8_t* null_bits_; + uint8_t* null_bitmap_data_; // Array length, so far. Also, the index of the next element to be added int32_t length_; diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index 69ee52c3e09ea..335d581782ac0 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -28,10 +28,10 @@ namespace { std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { auto pool = default_memory_pool(); auto data = std::make_shared(pool); - auto nulls = std::make_shared(pool); + auto null_bitmap = std::make_shared(pool); data->Resize(length * sizeof(typename ArrayType::value_type)); - nulls->Resize(util::bytes_for_bits(length)); - return std::make_shared(length, data, 10, nulls); + null_bitmap->Resize(util::bytes_for_bits(length)); + return std::make_shared(length, data, 10, null_bitmap); } } // anonymous namespace diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 8a7d818ceeedd..c79e8469530f7 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -75,7 +75,7 @@ Status VisitArray(const Array* arr, std::vector* field_nodes flatbuf::FieldNode(prim_arr->length(), prim_arr->null_count())); if (prim_arr->null_count() > 0) { - buffers->push_back(prim_arr->nulls()); + buffers->push_back(prim_arr->null_bitmap()); } else { // Push a dummy zero-length buffer, not to be copied buffers->push_back(std::make_shared(nullptr, 0)); @@ -230,13 +230,13 @@ class RowBatchReader::Impl { FieldMetadata field_meta = metadata_->field(field_index_++); if (IsPrimitive(type.get())) { - std::shared_ptr nulls; + std::shared_ptr null_bitmap; std::shared_ptr data; if (field_meta.null_count == 0) { - nulls = nullptr; + null_bitmap = nullptr; ++buffer_index_; } else { - RETURN_NOT_OK(GetBuffer(buffer_index_++, &nulls)); + RETURN_NOT_OK(GetBuffer(buffer_index_++, &null_bitmap)); } if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(buffer_index_++, &data)); @@ -244,7 +244,7 @@ class RowBatchReader::Impl { data.reset(new Buffer(nullptr, 0)); } return MakePrimitiveArray(type, field_meta.length, data, - field_meta.null_count, nulls, out); + field_meta.null_count, null_bitmap, out); } else { return Status::NotImplemented("Non-primitive types not complete yet"); } diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index d75998f0a5dd2..79b4d710d282f 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -77,14 +77,14 @@ TEST_F(TestWriteRowBatch, IntegerRoundTrip) { test::rand_uniform_int(length, 0, 0, std::numeric_limits::max(), reinterpret_cast(data->mutable_data())); - auto nulls = std::make_shared(pool_); + auto null_bitmap = std::make_shared(pool_); int null_bytes = util::bytes_for_bits(length); - ASSERT_OK(nulls->Resize(null_bytes)); - test::random_bytes(null_bytes, 0, nulls->mutable_data()); + ASSERT_OK(null_bitmap->Resize(null_bytes)); + test::random_bytes(null_bytes, 0, null_bitmap->mutable_data()); auto a0 = std::make_shared(length, data); auto a1 = std::make_shared(length, data, - test::bitmap_popcount(nulls->data(), length), nulls); + test::bitmap_popcount(null_bitmap->data(), length), null_bitmap); RowBatch batch(schema, length, {a0, a1}); diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index a9fb2a7644ab3..ea3ce5f7f53ff 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -72,10 +72,10 @@ class TestBase : public ::testing::Test { template std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { auto data = std::make_shared(pool_); - auto nulls = std::make_shared(pool_); + auto null_bitmap = std::make_shared(pool_); EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); - EXPECT_OK(nulls->Resize(util::bytes_for_bits(length))); - return std::make_shared(length, data, 10, nulls); + EXPECT_OK(null_bitmap->Resize(util::bytes_for_bits(length))); + return std::make_shared(length, data, 10, null_bitmap); } protected: @@ -104,17 +104,22 @@ std::shared_ptr to_buffer(const std::vector& values) { values.size() * sizeof(T)); } -void random_nulls(int64_t n, double pct_null, std::vector* nulls) { +void random_null_bitmap(int64_t n, double pct_null, std::vector* null_bitmap) { Random rng(random_seed()); for (int i = 0; i < n; ++i) { - nulls->push_back(static_cast(rng.NextDoubleFraction() > pct_null)); + if (rng.NextDoubleFraction() > pct_null) { + null_bitmap->push_back(1); + } else { + // null + null_bitmap->push_back(0); + } } } -void random_nulls(int64_t n, double pct_null, std::vector* nulls) { +void random_null_bitmap(int64_t n, double pct_null, std::vector* null_bitmap) { Random rng(random_seed()); for (int i = 0; i < n; ++i) { - nulls->push_back(rng.NextDoubleFraction() > pct_null); + null_bitmap->push_back(rng.NextDoubleFraction() > pct_null); } } @@ -145,10 +150,10 @@ static inline int bitmap_popcount(const uint8_t* data, int length) { return count; } -static inline int null_count(const std::vector& nulls) { +static inline int null_count(const std::vector& valid_bytes) { int result = 0; - for (size_t i = 0; i < nulls.size(); ++i) { - if (nulls[i] > 0) { + for (size_t i = 0; i < valid_bytes.size(); ++i) { + if (valid_bytes[i] == 0) { ++result; } } diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 290decd81ff42..df2317c340b2d 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -75,12 +75,12 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, #define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ case Type::ENUM: \ - out->reset(new ArrayType(type, length, data, null_count, nulls)); \ + out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ return Status::OK(); Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& nulls, + int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index 089c484c58bee..228faeccc4e4d 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -35,7 +35,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& nulls, + int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 670ee4da11675..d64c06d90c174 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -27,13 +27,13 @@ bool ListArray::EqualsExact(const ListArray& other) const { bool equal_offsets = offset_buf_->Equals(*other.offset_buf_, length_ + 1); - bool equal_nulls = true; + bool equal_null_bitmap = true; if (null_count_ > 0) { - equal_nulls = nulls_->Equals(*other.nulls_, + equal_null_bitmap = null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); } - if (!(equal_offsets && equal_nulls)) { + if (!(equal_offsets && equal_null_bitmap)) { return false; } diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 141f762458b3b..72e20e943c347 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -39,8 +39,8 @@ class ListArray : public Array { ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, const ArrayPtr& values, int32_t null_count = 0, - std::shared_ptr nulls = nullptr) : - Array(type, length, null_count, nulls) { + std::shared_ptr null_bitmap = nullptr) : + Array(type, length, null_count, null_bitmap) { offset_buf_ = offsets; offsets_ = offsets == nullptr? nullptr : reinterpret_cast(offset_buf_->data()); @@ -109,17 +109,17 @@ class ListBuilder : public Int32Builder { // Vector append // - // If passed, null_bytes is of equal length to values, and any nonzero byte + // If passed, valid_bytes is of equal length to values, and any zero byte // will be considered as a null for that slot - Status Append(value_type* values, int32_t length, uint8_t* null_bytes = nullptr) { + Status Append(value_type* values, int32_t length, uint8_t* valid_bytes = nullptr) { if (length_ + length > capacity_) { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } memcpy(raw_buffer() + length_, values, length * elsize_); - if (null_bytes != nullptr) { - AppendNulls(null_bytes, length); + if (valid_bytes != nullptr) { + AppendNulls(valid_bytes, length); } length_ += length; @@ -136,9 +136,9 @@ class ListBuilder : public Int32Builder { } auto result = std::make_shared(type_, length_, values_, items, - null_count_, nulls_); + null_count_, null_bitmap_); - values_ = nulls_ = nullptr; + values_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; return result; @@ -159,16 +159,13 @@ class ListBuilder : public Int32Builder { } if (is_null) { ++null_count_; - util::set_bit(null_bits_, length_); + } else { + util::set_bit(null_bitmap_data_, length_); } raw_buffer()[length_++] = value_builder_->length(); return Status::OK(); } - // Status Append(int32_t* offsets, int length, uint8_t* null_bytes) { - // return Int32Builder::Append(offsets, length, null_bytes); - // } - Status AppendNull() { return Append(true); } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 7eae8cda8c488..10ba113c5916c 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -71,10 +71,10 @@ PRIMITIVE_TEST(BooleanType, BOOL, "bool"); TEST_F(TestBuilder, TestResize) { builder_->Init(10); - ASSERT_EQ(2, builder_->nulls()->size()); + ASSERT_EQ(2, builder_->null_bitmap()->size()); builder_->Resize(30); - ASSERT_EQ(4, builder_->nulls()->size()); + ASSERT_EQ(4, builder_->null_bitmap()->size()); } template @@ -99,7 +99,7 @@ class TestPrimitiveBuilder : public TestBuilder { void RandomData(int N, double pct_null = 0.1) { Attrs::draw(N, &draws_); - test::random_nulls(N, pct_null, &nulls_); + test::random_null_bitmap(N, pct_null, &valid_bytes_); } void CheckNullable() { @@ -109,10 +109,11 @@ class TestPrimitiveBuilder : public TestBuilder { reinterpret_cast(draws_.data()), size * sizeof(T)); - auto ex_nulls = test::bytes_to_null_buffer(nulls_.data(), size); - int32_t ex_null_count = test::null_count(nulls_); + auto ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_.data(), size); + int32_t ex_null_count = test::null_count(valid_bytes_); - auto expected = std::make_shared(size, ex_data, ex_null_count, ex_nulls); + auto expected = std::make_shared(size, ex_data, ex_null_count, + ex_null_bitmap); std::shared_ptr result = std::dynamic_pointer_cast( builder_->Finish()); @@ -123,8 +124,8 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(0, builder_->null_count()); ASSERT_EQ(nullptr, builder_->buffer()); - ASSERT_TRUE(result->EqualsExact(*expected.get())); ASSERT_EQ(ex_null_count, result->null_count()); + ASSERT_TRUE(result->EqualsExact(*expected.get())); } void CheckNonNullable() { @@ -154,7 +155,7 @@ class TestPrimitiveBuilder : public TestBuilder { shared_ptr builder_nn_; vector draws_; - vector nulls_; + vector valid_bytes_; }; #define PTYPE_DECL(CapType, c_type) \ @@ -210,7 +211,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestInit) { } TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { - int size = 10000; + int size = 1000; for (int i = 0; i < size; ++i) { ASSERT_OK(this->builder_->AppendNull()); } @@ -218,17 +219,17 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { auto result = this->builder_->Finish(); for (int i = 0; i < size; ++i) { - ASSERT_TRUE(result->IsNull(i)); + ASSERT_TRUE(result->IsNull(i)) << i; } } TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { DECL_T(); - int size = 10000; + int size = 1000; vector& draws = this->draws_; - vector& nulls = this->nulls_; + vector& valid_bytes = this->valid_bytes_; int64_t memory_before = this->pool_->bytes_allocated(); @@ -236,7 +237,11 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { int i; for (i = 0; i < size; ++i) { - ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + if (valid_bytes[i] > 0) { + ASSERT_OK(this->builder_->Append(draws[i])); + } else { + ASSERT_OK(this->builder_->AppendNull()); + } } do { @@ -249,17 +254,21 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { DECL_T(); - int size = 10000; + const int size = 10000; vector& draws = this->draws_; - vector& nulls = this->nulls_; + vector& valid_bytes = this->valid_bytes_; this->RandomData(size); int i; // Append the first 1000 for (i = 0; i < 1000; ++i) { - ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + if (valid_bytes[i] > 0) { + ASSERT_OK(this->builder_->Append(draws[i])); + } else { + ASSERT_OK(this->builder_->AppendNull()); + } ASSERT_OK(this->builder_nn_->Append(draws[i])); } @@ -271,7 +280,11 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { // Append the next 9000 for (i = 1000; i < size; ++i) { - ASSERT_OK(this->builder_->Append(draws[i], nulls[i] > 0)); + if (valid_bytes[i] > 0) { + ASSERT_OK(this->builder_->Append(draws[i])); + } else { + ASSERT_OK(this->builder_->AppendNull()); + } ASSERT_OK(this->builder_nn_->Append(draws[i])); } @@ -293,12 +306,12 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { this->RandomData(size); vector& draws = this->draws_; - vector& nulls = this->nulls_; + vector& valid_bytes = this->valid_bytes_; // first slug int K = 1000; - ASSERT_OK(this->builder_->Append(draws.data(), K, nulls.data())); + ASSERT_OK(this->builder_->Append(draws.data(), K, valid_bytes.data())); ASSERT_OK(this->builder_nn_->Append(draws.data(), K)); ASSERT_EQ(1000, this->builder_->length()); @@ -308,7 +321,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { ASSERT_EQ(1024, this->builder_nn_->capacity()); // Append the next 9000 - ASSERT_OK(this->builder_->Append(draws.data() + K, size - K, nulls.data() + K)); + ASSERT_OK(this->builder_->Append(draws.data() + K, size - K, valid_bytes.data() + K)); ASSERT_OK(this->builder_nn_->Append(draws.data() + K, size - K)); ASSERT_EQ(size, this->builder_->length()); @@ -338,7 +351,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { ASSERT_EQ(cap, this->builder_->capacity()); ASSERT_EQ(cap * sizeof(T), this->builder_->buffer()->size()); - ASSERT_EQ(util::ceil_byte(cap) / 8, this->builder_->nulls()->size()); + ASSERT_EQ(util::ceil_byte(cap) / 8, this->builder_->null_bitmap()->size()); } TYPED_TEST(TestPrimitiveBuilder, TestReserve) { diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 32b8bfa7f1bd4..ecd5d68ff45a8 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -26,13 +26,14 @@ namespace arrow { // ---------------------------------------------------------------------- // Primitive array base -PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, +PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, int value_size, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& nulls) : - Array(type, length, null_count, nulls) { + const std::shared_ptr& null_bitmap) : + Array(type, length, null_count, null_bitmap) { data_ = data; raw_data_ = data == nullptr? nullptr : data_->data(); + value_size_ = value_size; } bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { @@ -41,12 +42,26 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { return false; } - bool equal_data = data_->Equals(*other.data_, length_); if (null_count_ > 0) { - return equal_data && - nulls_->Equals(*other.nulls_, util::ceil_byte(length_) / 8); + bool equal_bitmap = null_bitmap_->Equals(*other.null_bitmap_, + util::ceil_byte(length_) / 8); + if (!equal_bitmap) { + return false; + } + + const uint8_t* this_data = raw_data_; + const uint8_t* other_data = other.raw_data_; + + for (int i = 0; i < length_; ++i) { + if (!IsNull(i) && memcmp(this_data, other_data, value_size_)) { + return false; + } + this_data += value_size_; + other_data += value_size_; + } + return true; } else { - return equal_data; + return data_->Equals(*other.data_, length_); } } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index e01027cf55c39..4eaff433229e0 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -37,10 +37,10 @@ class MemoryPool; // Base class for fixed-size logical types class PrimitiveArray : public Array { public: - PrimitiveArray(const TypePtr& type, int32_t length, + PrimitiveArray(const TypePtr& type, int32_t length, int value_size, const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr); + const std::shared_ptr& null_bitmap = nullptr); virtual ~PrimitiveArray() {} const std::shared_ptr& data() const { return data_;} @@ -51,31 +51,38 @@ class PrimitiveArray : public Array { protected: std::shared_ptr data_; const uint8_t* raw_data_; + int value_size_; }; -#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ -class NAME : public PrimitiveArray { \ - public: \ - using value_type = T; \ - using PrimitiveArray::PrimitiveArray; \ - NAME(int32_t length, const std::shared_ptr& data, \ - int32_t null_count = 0, \ - const std::shared_ptr& nulls = nullptr) : \ - PrimitiveArray(std::make_shared(), length, data, \ - null_count, nulls) {} \ - \ - bool EqualsExact(const NAME& other) const { \ - return PrimitiveArray::EqualsExact( \ - *static_cast(&other)); \ - } \ - \ - const T* raw_data() const { \ - return reinterpret_cast(raw_data_); \ - } \ - \ - T Value(int i) const { \ - return raw_data()[i]; \ - } \ +#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ +class NAME : public PrimitiveArray { \ + public: \ + using value_type = T; \ + NAME(const TypePtr& type, int32_t length, \ + const std::shared_ptr& data, \ + int32_t null_count = 0, \ + const std::shared_ptr& null_bitmap = nullptr) : \ + PrimitiveArray(std::make_shared(), length, \ + sizeof(T), data, null_count, null_bitmap) {} \ + \ + NAME(int32_t length, const std::shared_ptr& data, \ + int32_t null_count = 0, \ + const std::shared_ptr& null_bitmap = nullptr) : \ + PrimitiveArray(std::make_shared(), length, \ + sizeof(T), data, null_count, null_bitmap) {} \ + \ + bool EqualsExact(const NAME& other) const { \ + return PrimitiveArray::EqualsExact( \ + *static_cast(&other)); \ + } \ + \ + const T* raw_data() const { \ + return reinterpret_cast(raw_data_); \ + } \ + \ + T Value(int i) const { \ + return raw_data()[i]; \ + } \ }; NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type, uint8_t); @@ -137,25 +144,22 @@ class PrimitiveBuilder : public ArrayBuilder { } // Scalar append - Status Append(value_type val, bool is_null = false) { + Status Append(value_type val) { if (length_ == capacity_) { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } - if (is_null) { - ++null_count_; - util::set_bit(null_bits_, length_); - } + util::set_bit(null_bitmap_data_, length_); raw_buffer()[length_++] = val; return Status::OK(); } // Vector append // - // If passed, null_bytes is of equal length to values, and any nonzero byte + // If passed, valid_bytes is of equal length to values, and any zero byte // will be considered as a null for that slot Status Append(const value_type* values, int32_t length, - const uint8_t* null_bytes = nullptr) { + const uint8_t* valid_bytes = nullptr) { if (length_ + length > capacity_) { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); @@ -164,21 +168,26 @@ class PrimitiveBuilder : public ArrayBuilder { memcpy(raw_buffer() + length_, values, length * elsize_); } - if (null_bytes != nullptr) { - AppendNulls(null_bytes, length); + if (valid_bytes != nullptr) { + AppendNulls(valid_bytes, length); + } else { + for (int i = 0; i < length; ++i) { + util::set_bit(null_bitmap_data_, length_ + i); + } } length_ += length; return Status::OK(); } - // Write nulls as uint8_t* into pre-allocated memory - void AppendNulls(const uint8_t* null_bytes, int32_t length) { - // If null_bytes is all not null, then none of the values are null + // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + void AppendNulls(const uint8_t* valid_bytes, int32_t length) { + // If valid_bytes is all not null, then none of the values are null for (int i = 0; i < length; ++i) { - if (static_cast(null_bytes[i])) { + if (valid_bytes[i] == 0) { ++null_count_; - util::set_bit(null_bits_, length_ + i); + } else { + util::set_bit(null_bitmap_data_, length_ + i); } } } @@ -189,15 +198,15 @@ class PrimitiveBuilder : public ArrayBuilder { RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } ++null_count_; - util::set_bit(null_bits_, length_++); + ++length_; return Status::OK(); } std::shared_ptr Finish() override { std::shared_ptr result = std::make_shared( - type_, length_, values_, null_count_, nulls_); + type_, length_, values_, null_count_, null_bitmap_); - values_ = nulls_ = nullptr; + values_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; return result; } diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 7dc3d682cdc15..b329b4f0ef7e1 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -77,7 +77,7 @@ class TestStringContainer : public ::testing::Test { void SetUp() { chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; offsets_ = {0, 1, 1, 1, 3, 6}; - nulls_ = {0, 0, 1, 0, 0}; + valid_bytes_ = {1, 1, 0, 1, 1}; expected_ = {"a", "", "", "bb", "ccc"}; MakeArray(); @@ -92,23 +92,23 @@ class TestStringContainer : public ::testing::Test { offsets_buf_ = test::to_buffer(offsets_); - nulls_buf_ = test::bytes_to_null_buffer(nulls_.data(), nulls_.size()); - null_count_ = test::null_count(nulls_); + null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_.data(), valid_bytes_.size()); + null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared(length_, offsets_buf_, values_, - null_count_, nulls_buf_); + null_count_, null_bitmap_); } protected: std::vector offsets_; std::vector chars_; - std::vector nulls_; + std::vector valid_bytes_; std::vector expected_; std::shared_ptr value_buf_; std::shared_ptr offsets_buf_; - std::shared_ptr nulls_buf_; + std::shared_ptr null_bitmap_; int null_count_; int length_; @@ -143,12 +143,12 @@ TEST_F(TestStringContainer, TestListFunctions) { TEST_F(TestStringContainer, TestDestructor) { auto arr = std::make_shared(length_, offsets_buf_, values_, - null_count_, nulls_buf_); + null_count_, null_bitmap_); } TEST_F(TestStringContainer, TestGetString) { for (size_t i = 0; i < expected_.size(); ++i) { - if (nulls_[i]) { + if (valid_bytes_[i] == 0) { ASSERT_TRUE(strings_->IsNull(i)); } else { ASSERT_EQ(expected_[i], strings_->GetString(i)); @@ -197,7 +197,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { Done(); ASSERT_EQ(reps * N, result_->length()); - ASSERT_EQ(reps * test::null_count(is_null), result_->null_count()); + ASSERT_EQ(reps, result_->null_count()); ASSERT_EQ(reps * 6, result_->values()->length()); int32_t length; diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 2b3fba5ce0932..fda722ba6def2 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -68,8 +68,8 @@ class StringArray : public ListArray { const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) : - ListArray(type, length, offsets, values, null_count, nulls) { + const std::shared_ptr& null_bitmap = nullptr) : + ListArray(type, length, offsets, values, null_count, null_bitmap) { // For convenience bytes_ = static_cast(values.get()); raw_bytes_ = bytes_->raw_data(); @@ -79,9 +79,9 @@ class StringArray : public ListArray { const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& nulls = nullptr) : + const std::shared_ptr& null_bitmap = nullptr) : StringArray(std::make_shared(), length, offsets, values, - null_count, nulls) {} + null_count, null_bitmap) {} // Compute the pointer t const uint8_t* GetValue(int i, int32_t* out_length) const { diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 1d2d1d5f9d7e4..08222d5089474 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -40,8 +40,14 @@ static inline int64_t ceil_2bytes(int64_t size) { return (size + 15) & ~15; } +static constexpr uint8_t BITMASK[] = {1, 2, 4, 8, 16, 32, 64, 128}; + static inline bool get_bit(const uint8_t* bits, int i) { - return bits[i / 8] & (1 << (i % 8)); + return bits[i / 8] & BITMASK[i % 8]; +} + +static inline bool bit_not_set(const uint8_t* bits, int i) { + return (bits[i / 8] & BITMASK[i % 8]) == 0; } static inline void set_bit(uint8_t* bits, int i) { From c06b7654bccfe8c461869a6e5922668896c27c45 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 24 Mar 2016 19:19:22 -0700 Subject: [PATCH 0042/1644] ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction As the initial scribe for the Arrow format, I made a mistake in what the null bits mean (1 for not-null, 0 for null). I also addressed ARROW-56 (bit-numbering) here. Database systems are split on this subject. PostgreSQL for example does it this way: http://www.postgresql.org/docs/9.5/static/storage-page-layout.html > In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null. Since the Drill implementation predates the Arrow project, I think it's safe to go with this. This patch also includes ARROW-76 which adds a "null count" to the memory layout indicating the actual number of nulls in an array. This also strikes the "non-nullable" distinction from the memory layout as there is no semantic difference between arrays with null count 0 and a non-nullable array. Instead, users may choose to set `nullable=false` in the schema metadata and verify that Arrow memory conforms to the schema. Author: Wes McKinney Closes #34 from wesm/ARROW-62 and squashes the following commits: 8c92926 [Wes McKinney] Add to README about what the format documents are 1f6fe03 [Wes McKinney] Account for null count and non-nullable removal from ARROW-76 648fd47 [Wes McKinney] Indicate that bitmaps should be a multiple of 8 bytes 4333d82 [Wes McKinney] Use 'null bitmap' similar to PostgreSQL documentation dac77d4 [Wes McKinney] Revise format document language re: null bitmaps per feedback f7a3898 [Wes McKinney] Revise format to indicate LSB bit numbering and 0/1 null/not-null distinction --- format/Layout.md | 77 +++++++++++++++------- format/Message.fbs | 10 +-- format/README.md | 17 +++++ format/diagrams/layout-list-of-struct.png | Bin 60600 -> 54122 bytes 4 files changed, 74 insertions(+), 30 deletions(-) diff --git a/format/Layout.md b/format/Layout.md index c393163bf894b..2d46ece606ea7 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -42,7 +42,7 @@ Base requirements * Capable of representing fully-materialized and decoded / decompressed Parquet data * All leaf nodes (primitive value arrays) use contiguous memory regions -* Each relative type can be nullable or non-nullable +* Any relative type can be have null slots * Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built. @@ -56,7 +56,7 @@ Base requirements * To describe relative types (physical value types and a preliminary set of nested types) sufficient for an unambiguous implementation * Memory layout and random access patterns for each relative type -* Null representation for nullable types +* Null value representation ## Non-goals (for this document @@ -79,28 +79,55 @@ Base requirements Any array has a known and fixed length, stored as a 32-bit signed integer, so a maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons: -* Enhance compatibility with Java and client languages which may have varying quality of support for unsigned integers. +* Enhance compatibility with Java and client languages which may have varying + quality of support for unsigned integers. * To encourage developers to compose smaller arrays (each of which contains contiguous memory in its leaf nodes) to create larger array structures possibly exceeding 2^31 - 1 elements, as opposed to allocating very large contiguous memory blocks. -## Nullable and non-nullable arrays +## Null count -Any relative type can be nullable or non-nullable. +The number of null value slots is a property of the physical array and +considered part of the data structure. The null count is stored as a 32-bit +signed integer, as it may be as large as the array length. -Nullable arrays have a contiguous memory buffer, known as the null bitmask, -whose length is large enough to have 1 bit for each array slot. Whether any -array slot is null is encoded in the respective bits of this bitmask, i.e.: +## Null bitmaps + +Any relative type can have null value slots, whether primitive or nested type. + +An array with nulls must have a contiguous memory buffer, known as the null (or +validity) bitmap, whose length is a multiple of 8 bytes (to avoid +word-alignment concerns) and large enough to have at least 1 bit for each array +slot. + +Whether any array slot is valid (non-null) is encoded in the respective bits of +this bitmap. A 1 (set bit) for index `j` indicates that the value is not null, +while a 0 (bit not set) indicates that it is null. Bitmaps are to be +initialized to be all unset at allocation time. ``` -is_null[j] -> bitmask[j / 8] & (1 << (j % 8)) +is_valid[j] -> bitmap[j / 8] & (1 << (j % 8)) ``` -Physically, non-nullable (NN) arrays do not have a null bitmask. +We use [least-significant bit (LSB) numbering][1] (also known as +bit-endianness). This means that within a group of 8 bits, we read +right-to-left: -For nested types, if the top-level nested type is nullable, it has its own -bitmask regardless of whether the child types are nullable. +``` +values = [0, 1, null, 2, null, 3] + +bitmap +j mod 8 7 6 5 4 3 2 1 0 + 0 0 1 0 1 0 1 1 +``` + +Arrays having a 0 null count may choose to not allocate the null +bitmap. Implementations may choose to always allocate one anyway as a matter of +convenience, but this should be noted when memory is being shared. + +Nested type arrays have their own null bitmap and null count regardless of +the null count and null bits of their child arrays. ## Primitive value arrays @@ -112,9 +139,8 @@ Internally, the array contains a contiguous memory buffer whose total size is equal to the slot width multiplied by the array length. For bit-packed types, the size is rounded up to the nearest byte. -The associated null bitmask (for nullable types) is contiguously allocated (as -described above) but does not need to be adjacent in memory to the values -buffer. +The associated null bitmap is contiguously allocated (as described above) but +does not need to be adjacent in memory to the values buffer. (diagram not to scale) @@ -180,22 +206,22 @@ For example, the struct (field names shown here as strings for illustration purposes) ``` -Struct [nullable] < - name: String (= List) [nullable], - age: Int32 [not-nullable] +Struct < + name: String (= List), + age: Int32 > ``` -has two child arrays, one List array (layout as above) and one -non-nullable 4-byte physical value array having Int32 (not-null) logical -type. Here is a diagram showing the full physical layout of this struct: +has two child arrays, one List array (layout as above) and one 4-byte +physical value array having Int32 logical type. Here is a diagram showing the +full physical layout of this struct: While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via -the bitmask. Whether each of the child field arrays can have null values -depends on whether or not the respective relative type is nullable. +the null bitmap. Any of the child field arrays can have null values according +to their respective independent null bitmaps. ## Dense union type @@ -233,8 +259,7 @@ Here is a diagram of an example dense union: A sparse union has the same structure as a dense union, with the omission of the offsets array. In this case, the child arrays are each equal in length to -the length of the union. This is analogous to a large struct in which all -fields are nullable. +the length of the union. While a sparse union may use significantly more space compared with a dense union, it has some advantages that may be desirable in certain use cases: @@ -251,3 +276,5 @@ the correct value. ## References Drill docs https://drill.apache.org/docs/value-vectors/ + +[1]: https://en.wikipedia.org/wiki/Bit_numbering \ No newline at end of file diff --git a/format/Message.fbs b/format/Message.fbs index 3ffd20332087a..fc849eedf791a 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -129,8 +129,8 @@ struct FieldNode { length: int; /// The number of observed nulls. Fields with null_count == 0 may choose not - /// to write their physical null bitmap out as a materialized buffer, instead - /// setting the length of the null buffer to 0. + /// to write their physical validity bitmap out as a materialized buffer, + /// instead setting the length of the bitmap buffer to 0. null_count: int; } @@ -148,9 +148,9 @@ table RecordBatch { /// Buffers correspond to the pre-ordered flattened buffer tree /// /// The number of buffers appended to this list depends on the schema. For - /// example, most primitive arrays will have 2 buffers, 1 for the null bitmap - /// and 1 for the values. For struct arrays, there will only be a single - /// buffer for the null bitmap + /// example, most primitive arrays will have 2 buffers, 1 for the validity + /// bitmap and 1 for the values. For struct arrays, there will only be a + /// single buffer for the validity (nulls) bitmap buffers: [Buffer]; } diff --git a/format/README.md b/format/README.md index 1120e6282a50a..c84e00772c3d6 100644 --- a/format/README.md +++ b/format/README.md @@ -3,3 +3,20 @@ > **Work-in-progress specification documents**. These are discussion documents > created by the Arrow developers during late 2015 and in no way represents a > finalized specification. + +Currently, the Arrow specification consists of these pieces: + +- Physical memory layout specification (see Layout.md) +- Metadata serialized representation (see Message.fbs) + +The metadata currently uses Google's [flatbuffers library][1] for serializing a +couple related pieces of information: + +- Schemas for tables or record (row) batches. This contains the logical types, + field names, and other metadata. Schemas do not contain any information about + actual data. +- *Data headers* for record (row) batches. These must correspond to a known + schema, and enable a system to send and receive Arrow row batches in a form + that can be precisely disassembled or reconstructed. + +[1]: http://github.com/google/flatbuffers \ No newline at end of file diff --git a/format/diagrams/layout-list-of-struct.png b/format/diagrams/layout-list-of-struct.png index 00d6c6fa441769a3c86044a52186d71c0bc23d54..fb6f2a27e07a766729d12ea33454db011ce6ae00 100644 GIT binary patch literal 54122 zcmeGEWmJ{h_XZ3Df(j_zCDPs9N_RKX-Q9wKba#g|(kZ2Mce4p;Bo$B)Bz)FBN6-Iw zp7DHrzq~)kaJaqqU2Cqn)|~UYuDQdN6eJ%b5g@_9z&w_g5>tVJfuDqdfjf8z2afm{ z35kM#U|m!spTks*6YYclAUaBEyTHKUP(uI0N~=(ufD`7d)U;f+fmff z%E`pR#6l*7L`q63;B0Enry?ft>vHfvK{88MS4Tc(W)BY!CJ%Nd2WJarR$g9SW)?PP zHa13Z2BV9Yy{nNYqrD6H-#7VvA2Bl*6K5+&S1Sj5Qs{k+j2+xu1R{*SZ06zu?k~hC@Yk7t9Q(iL{C&QX zrGu*ju!6IdiL|||nKQW9)d>1_LTtaT{{KGXzxPsfwlV{w{_ASiUswP8*st>in4tsz zmm&VP@?S@RnT3!9nE#nFA*7luwpADy5g2K)XKJ3XN7-(#h}7rbt%*2#cxW>;^LGW+ zH}gl1Z04*Y;28KiAs=m$7&>`)JSJFq@l|l->f^P>TQl$0atrU`+&dHJ!}+{pWNHOC(qX z3)PyX(%(1ueJ>iG_YeMi`#f5BtStF1>w@1W{eE`d`ZtpQdxyWj4Jbxaa^ylNkCyr4 z1;LkrGd}-41fTI-qzi&o!A1Aq0fCqP@1g&Hcxhhv5Y7vH7DEx2ezWrbntL)EKF%k4 zzt8Ap>Du6uD1Hcfnr<`=6Emwd((W;Tj@q`2=;@^G{Fua1Ek3rlm}x zSS1c0m9URlowyY_UzTaH>b6ic z{~Vc7Z4j^ZeZ0$rAUCPNN$0&)p|mQZ&N)p3V$8Ld-&flz4E#QRY59}`?rgSf_ioku zu<~kL-)kpX`1YIXIqb>thvTMY@HS~kar)wBXzc7h5e4zhf{3(qlVqp1Tk4L(LP*%Zabj# zfk2o$ZrG@V7W7=O*&0!p_mG*}I~PudsB=f%xaZVViCrh30Smlf5=MFOn4n=sf&Y|p zEc=B?G-nLo0S$)>)0*h-Al8~Hf{K=9k^7P>N&?U5^_t)HF&LcS&zrBPz{a18<^$h9 zdHHSX4Oi{%n?|ZE$8PYD_8uo=oXZh|yXUMVtUl4}z9(G|z(u!8$Y5^1?`Ap(E!)<^ zZU>%bG9T=AQUn!eYI=zStogSW7kV!>7C&=l^2caeD!*zQxpblz`B;O!i?M z_l2vn>E=Ke>VnUg_XG5(3NL()8ykj}r8zBdGOYd{r^q`@OSe>=90B{59UK4nGY`F=^R(D|&Y35C|km-esYOnB;FJ_lYOpY))9=T!RgVYqJ` zMR6(H<0J3vuD;K4@Zk2&0(q>!*+%FSy9*AdeYhnQ3XgBTz5IFm9imfA`72x5(}$~M;)z$3dWdIN+lSNqGQo2z zn)`F}R?J^h7K%!tAz$In$kWb5h!L9uW)OY#4a?8F?;WWZkyHb&Ez9<8P7&O=6}t3{ zCJH4FZCiFy9uPTp=JN~4WgC@qL-oe;V{c^jr0 zHtl$cs>E=#%|#wfWCr`my1X((oI_!Zb%3ZnHOnkTo2S?RsV+UzrRJ|3D!`s%ayk%G z@RcAYANx4Z7Fi_2^OaOmY=x#(sYG+D9@hmvIU;ja3xGu&BLrD2K2T zd2PoRBKmqS8ik6R?;CvFk)(+M9_$c^NJYb_lo2#dBH(#a5GxdsiO+nrPvEmE@O88& zp(LC-he!ABY=n6LHJV6o>QZ(w_Wk#d{76LJdv7Y7hH(_Eh>{ig_9;+$&{KO6aP`op zBVTn+mYe7c5a>G(QF8PlsYr~hJGAao^(*3{6=>(64fdL08gn$LOy)JZl+D^ z6rTFR9l~*jf@$lIVfOYxYnIn;x^m&-bJ~j$<_<$T-cMOB-P;&lQXl<)-ijq;MiAzu zWQIynGYDk5FIgd4tGJg6=+6){D{1AY(2IyOunW+$zqEHI8nm*w+DWyTTB0e4NSV!nu5I`=CE01xX&P9G+5aD*fBO##JbYPT#fuA z$J?ZJAM`a8GCC)5EmS#2g3sG(q^YGlCFE~mK7DTcLNx|wtB<`wI$X|yL)MMtL;She zRdb&{>#(6L^d{o0uMpY(i&H=TC$QtUfywyt{=5E%)m{gB1eFnlXt&v$97T>Pv7#XA zW>?dY?kJA$S`^cn1A5|qrevW3MXFIv;w;5a(|^U@24dI-dGg%#IVoiI1B@S(OP>Nb zBg9`5vi&@b^-t1}bJF%==|<;Xp(;Sg$oCCJpy?sM1i?{fNNrG*9W}4aMz1CRrP%Y~ zQK}DM>f^L5OKkm;Icn{mGfIv?IxK549X{~tz8e-|m7262vi$+p|2S8s{1%Dy5Z|u& zqa4f7$!dm1WAaJc`uH4|abB;$(HM#%C&Ye7Wu0K63)yw+;)FTOqcXJUyyN(oIb@58 zs`j&CGaqX9=LxgUUtbHXA4@dt*m`frOBQ#5pB^uu@kNf0A4l5bwHNsQqF4{-;krB8Rdx!JJo52Er@mrhlL><~ee zya^TWF!M*69BqFkr5Ywy;>TCoP0;sBw9m6TNvpOG{Y%MZ;|B9HCud^L{S&YYM#9!S zVp|r6>t23mQ`JJnTUR`lFbKDQ5_m=xcr51`Ej0p(Q6V^+;yJ8UHTNzXn9u}vOBgp@ zD@a;9$5UOP`c*T%eq#3vDnxeD@N$>jA)BiiJY@+b-d_(wjlo%bPaY+%b>RuO zs~xv+5Gc{AbpjY}Nj^hXw-7XL8{x#y2za+{7CEqct={ z71?+*af;g(10h!E>{PWzLt;hiLeUMyEXCM+Mr0C>XYY;*{F{iUdr{utts#AWV#7# zu4~=1(@PQ@F8#T!NBu#}82Kr1RzqkL9K1bnu_FQ=5(_i|dj!nQN;#LWjEV=NLwgz* z^fT#baE?-nEL?a8>9UjyB8yUV==fp!W8JtpC}M4**fqIHx*he$b~J^z9Skf4^ z2ig;Y6YryYdLo(=ZcpLM>U`@{+Hx+00|AhuWYTkSrLJF&!yP_%o54v*_38l8K5dWY zj_gf1PoL~-Q|fB{>;W^nv(XIt`HfZaD?Wy-6|NzSu$pZv{&QqXhGQ6gf zYWOuDC75A9bqTZj{O`GOoza@2nLR+O{_4fmcb3TLI80k2R+5HWcU723kq4t_)h_nM zHlWa1TQs?R#Hd2K&oB*C>0 zc6Dep9H%`kJ|$yk@iyX%Qtp~QlzpYd6a91T=bZR?5k80kF#J>yOx0S)79JWIt^i5yH+oz1`4mj3?Siy!dt(}9%r7F>D~KPOl->rO zaXr50ypnsO-&v642{0>R>a?n!LqUDo~0Qt)s? z*>paGd=oSiyK!d7h72+v`^0vJQfT!Y-QD%29gRzkp0pn=s_@x)54j)HV!Dl0#0Mtl zyqR%f=|4~_?}7i&+@9G<^J2a+4Hq_yOrS`1x$Ilg%Ep+es@ucyW=FKn!)8~O?HW=X zR0$yv!2)EqMpIyExt}&{edv}Z+{chrsqa&1rRhp>j9Me)9ux!%ivYEbJjgS$Rj1+|AMX{Xfm$A}6Jw+C+ir_;F#^S1PZM$M6q}UB<E@ycsDbvVRk2T_FfruO_Ch_6tAqbijp4T}VM#=N? zyr7EaBS|LGc;dxJs9RZhFeWw2UCq&5C%VbdPc%!#hGD~4zq=8efF~>(nNo?J&R|{X z+%!!aMIu(YEeb1F%_Cc%`Y1lSxp&R?>tkkS63_L3hq!8G!Fqn7+=^Ac$)fJJ;gKIV zR?0aN_=6!hMj}x~N`nj%;)#Y!D(75-mKIIPZCGlsbyx%#SF-nFL`@Z2tOiQRN zs)`+jSW!=QU^^_0NI|(}x+ak0nXul=6R~AiuRc^+PxI=OI#l`8J&3=u;;Zjl6|N~pypfn*xnDIqGu327 zAZD#5zY|gcs>{yE2lR=QGPPGVrR}tmua(kD*|khg7M^dROS&nGF`L+RsSvtt@*J|`m6nVozE0$2XYz3U^Qcg@3K}dCbagrnu8!$3)h0)#L=PHB zmZ@8m2PQ<-tz^c8o6Aw)k(1JBh)G!+gpbbo>ly$fe92h`&=>LfEqGc@EX)mX;I``& zc$zlKK6(Q8yZ=!}wDL;{xlZ|)7~=q@4k<<9TUU+*8g&Ar!|kt(Va#v}^~n;!nNGHf z$y8}FIlgB@4B{vYFc&f#*!M@!rXk~0-c0@DgX#s6&}7MZAre=M9&c1$oY7+&QW}-Yue4fn3D#&>(J5@aj93@rGR!gK7m~9O zFmeIKeA7XVJ(kF%M1E_vEuZeHOJlzRKZC|V`*gxcUGx?S{*}>Nd8dRlp{9&Fb?b@V zdGy1>#eurk^+Ne7w}%R9`1)Rr&4L3`bgKfyF~Iv=y4hZQqnq^Ph^^GaH)5os(zrT{ zxu+;QI|%!L2nKCs-04bzu)&>Y+!*=fF8J8xYLdZkU(A^-gk{Q}WeQW9Lrnaf z{X~PndR3U1V-dn1@s4ZpcJc*$QxV^N6&!^Uw=c;Oi+oo+_Xhlg8Ti(V^dSM1sLPkn z6wI1)0g@$x;uRgZPiS!J`nEniC#JUkYh(Mbgs1JHns3+2i+f1O*zfBVtPX~A)iqym z>M7Rd0^o-;I0`41!<9iEogwtoxk_Vtg-g+DEfReLW?f>vnDhpM0PGR&s~B@H6Pyl;rQFI02{-!n*+I0~aY9?}S?P2tRz zL5zB>6?8@9zGy^SFB5p}o+!`N1xY~gtMC`CFot{nu^u1sy;_;pIxUXy7RFF^mQ5lv z$D63;B!-DytZ==6x=^=?Qhd*mn40+9n7YRr!=~zVUr;&+%i(9KmM`iwG5gh1_Rq54 zs+G^3CC)sk4Wx@^WW*n=N;C9dP|CcyG`XtKwYxxxN}fz-{+si#VC=s>wi4l+7EURvf^Pi}isLcXk4#NlA=Y1SEePyfjYX4hEf<55C*J;g}x7b<0k z-Cq+8?g>8>yHB_5w-jUz5UMD+8`s`cf=H%2^3uD`23r{{!g>+e>jf|dH3=^RraUhv6ZV(GoPWNN`-)ub@QQIpO@wVPZFLC`Ua#U*7 zhkj>RbZlbtc;X{K?S&rkJS?0xR(JjgK*}k}C)v_QH1w-333!zZ@=>&@pB0tXnrLim z(H`$!c#fjf^?L+Z}r8qI)19)cMSm zU>Ax_ZUjHCQq76&v5t5yDsC~ML157&Cxbb$n-_!EvcxQ4GNVGzz<61YeoetWs8x!iYIv*l*_JWXDpQ`| zY9{yY^%UwdF1gXp@@q_mG##cn<#R-ZcBTUPUJ_XvZM=~EJlCwoQF_F2!>|E{$7F0% zmDFjlU_Q zW@1f|zz57+F1x{@56F<#!Wk0mQYsQVtLTVr%?_{p_;3C2m>k&Jbz$|CaUKveT#~$& zzzyLn>@2lj3=-4H&M2H}D>&w-LaeS2I1Y@jxk#SZV#LZ@uE*X6=u*{;D)oTclxw`<5`)`?$Rks_M^d~My2*5 zu20iW#YDE5`>W5aY-k@y8!R1l2HXR#>}WCV9LkMqX8BliTJM}^JA;g%M<)|AOUtP~ zK}a*3+E!n*u&hX|$f+1E8DlYZvEBlJ_>nP-+?2_3f$ZENb)iQYnAoxi)K^xl?vPf~ zd~?-~c&Hs){<7mYl>tRbx2_?&F_8i034o1_XKWKy9fVer@{}7W2_V0%;8|aQd9ZXu zWHRSbv`m|s`!aHtwEqZ3PWErrz*3yZ4eIiGudd}|>d)B%)$H`6;f1kny#TOR~O<;uJxM_K&ggcuOYwiW)c#4Wd5T@;Fi0K|>bPx|VZK z7+7ES(z_1u-&OC&U70&bIv zTPuNz$noHE`I5KP5#3tvaVLK336z3Gp1|8A&k*NCl+4{H+^%)&;h7I!$9ud1n%= zV|_O)?)hQrN9kB>re9=`#gwUBe>#6Vqb!qmdI;zE!By9`7QueB==hxi%GUP9Iz6i3 zXRpt3D>kVTAxk{{65j8s{sER~#13hBBHJH_EyWIn;jj`Ki_z8X+uMQRBqNR66W3+S z=B*+c+76_Rx@Xs}*|u#t-u=VlV@L?+pXX^^#JLtLe~@dk>b}J?!_@-&Ej`#wC~;PNOa1ELi3X`{rYOj%p>{5e*86 zT~%*C@>Sb+TfM{BQ-QNxk~THxa*|*E>1y@eo$4P+$6cz(+E4jQhiv$VIM17PnobRW zqGph;qw%^m&B@GpAGYg+*P6tO)IUTk!xoNgxF=uVTKnSf*bd6=NpzqPl6=j_r`U^Z z%qT%iwxZ{EwWDr)>oKc^oc*QptM$`f6md9tauLJh;zqlk-|sDzhY8d|*{{@_Xr7Tm zh7;aDhM)!Ss6gJ!4X1peP0!9=n3iRJ3FvWF{>|?fNxXat|2z`P6Xun{N2x5UwR183 zGmgJ7lsric78$$Pc3TvkKQ&GwnP}jPOg+mI+W%Zpj!47PmeG7G{io{45)&Aft7TF_ z^*=Y##8h&0=XAPG{6|&w*O(@$z3c|rZ3=gL2%;aT2( z&M*JRz(n9)$OEm@T<`IJL|R|~N{&rK!R{IVnDFmMv}OR!O?mio?th-k^)zqYHWB%l z<2SMTix6_<1T&_I`A+yh&$WYUtf+98tbQ-;-+Zh!FhmuVz5jonyYw17SDvY1Ch0$h zm;;7*A-{6>AAMK@V!#JGwh5-p-PXgZ>rwWJD4-WW6~1vybz7Q^85jJ%QV_!z!32o# zcA!OQD0?||scG=i?fa^S<`=WhSFq$UJjUm2Imv%Ktu-6G#JcRsyspFhvtfFuuAu9f ztEFWth99Mfdx#?O2=K-&RYE^+T+GFG0ri~9KNr>0_PNz2nsdJNhkWc@4qcO=?CB7=)TYC|0&y}3W!Vw{C?hCZpEs~bI!@o zS2h5s3rZHYI1vHi1eTWNn_wVEA{?X?PP<}{S=0eOf?Z$aOt(e3L_7gg^ z>s7j50(vON|L#gr;I#L_6JM59*BQpzUJMX~x>aNuIYFfZq!-b~fc?87av6037q@?# zmL-~I@CQ`(EtKvh-~~dK@J7P@BLkfe>w(CZ*Faw($A8@LC2j2Pr3oVHeA-Sn^A5H%l?P+2IPntYMJDwS)15>d%Fb0IWTfzD!P2- zfXr~sa_Xmj&r6kN7)YJ#%cki!$~sWE&9!VRDyH$;*s7AYI1LUlqZ~UF493yQ%UMD2 zeAef43$+4JW2j?6r8FZc6p4;)Q%g|x$xKTv-1TUtM9~Vn{|XcVW)Hps3XC4x2hfmp zC2tEMft1!SFCZ3wv-tl~IazWAc(OUv;4)@O<8X4>_v5hoCs15K5f^#x zRjdcIA71|aLIzbPS#|(nOq{y}ffWUTp65m|Yn(AbUqTC*6gxnCK$xgKwxB zC-FIgs+hpCdilBK)1DI;YwW{TrlX3?O6-46$>d{r0;~$wO%VCz=SFmRpi&W-U^*aE zS(x_qzdq(Rode>EVeO-=%f@0=qYWaX*FTx+R)NcFucRr|vhD^VGvPhXXH*pmhssvY z&`T8s&($tw!48l#7pJN3Wp=?ZoEQ<-YHMDY7@P*iUG|0R?8y{_@$HN_&^N-aT>#K!vn3hx)!6+&OgcOs#wQ5FJ9hhAAL}P#gzq3cRmDp(>$m4eVC9 zV%Ul6TN$zov95#2V^^S|T6XMttUZDhz+wx6XDa_uO%Et=UlbuI{uz`iMZp&R&4$(U z#PH_%{_d-LD4kK-OKbLR$01&EzP=i=6C{3@8%60y1Q<57P*IobMTSjP7ki4RD?vb@ zik}Qr$W}BJJ`F*^1S`&7p;sdLYB!S^1JV^2BoeI{$|woDNw{GyM_{oIutuA`$v3Hb z&O-)2e`rS4dZHAJ7A%4yzyK^3g5`#^KR2~#u&K$1J~_nLHwVs3yNv(VcE2cf7i(AM z=rx3mC%(>}JrmRFQBEn_H2i|Jg_C^^s#-ljG|1lk@}BpXZ1m} zx^>=E#fC@ey_95p%J#f8DE*ftnA)j0S_`c}$ffQuT>nQn5}`-C?C@l-WXaIyz2EdB z%xi-{hpU&~LOTWD>>&cy{5Vfkm{9!ZlP_Sy!Z#bv;N};YGrG5f;(8P8YD)g+%RdFS3pT8n ze^?v1$?+!0(YjLOoBy4~+p^-f`gy~U?IoF;H+6)5D&K+YS5H4b%>B{x38V0^J~DQZrc0h7 zICpFnc)vC(@xAU_+op7Q7fSKf<9bJ!KVOi{a5C?8zBVGf1;;JO%hV5o#(2Nn-F^l3 z(x96$5(;D=*7I0(GefTbaSHYW%@A|XS**6-|j2A_B2u2zU2j4 zP`y-KJ;yFpML}_@<#aV9lgx6J3pD>##-_zURiMp#x}3WmRm8pOib?2~QYmHraMhvX zwn-&me3!jI_~#c8X+M?~d4oNtU=fHO)8I&pB5X&7$QF47&IZX zn|zL32_kzWV-wzT5y$}qHT9hO0H*%B1w?lTM!P2l8WFYKn6;7=T1TNg&2`siRzn0X zCr`|A6KUGJ?i0>pJhB2RIFsu!z}sDeIs3d&y#~!HHY4bvJ02L^1+m9U*X)OzGg*Qa zAh%28sB~h%`7^_94F)+5eGN35NaF^Y$d@UkWC&N=-r`FelEV^7EjQyN`G+8GuyWn_ z{_q!u;6U|+fJ767x8PYTt`8GTVhM%Dd($ewy+l78pLH(5fQIT3ZN!(~z;b$Wa!svC zta}RLZ|2CiKO7pvQXZ~#)_uyY%Qg&a1;|nnBNC4jqEi~1;FQK0_Htt4FW8l)^F z067L5y6tOcvuhc-Y#Qly>lvY3UZ=J%Kwk<*A7uwtVs+bLKfUq!`>*5tAOM_co2OGQ z6BHN+Q$Nqvj$-Y{gQ)%^M*J(Z2c%F=mbgkAmGl$Xpb**?stR5Nb~lF@zv>MO?Vus3 z?iu4f%wbyyrs3x~EG~NEf4#hhB6Mj;U|Oy4+H#fA<5?ew_i)59K8%}~O5{=NBk@dm z_PGCE370?Nk*2(n#y`3UqpyKf&%jK`^F(85*cs$iz99D2`jUO}o*2yz|6|Wp;s|&S zU!qZv&n)}BF;ozIWtQKn$+O1{I`Ay0Gka$_t9>#1fd;A;3QiZ z;K5=Yn}H!@-9>u4UpTx!A2fP^U%3+0-0Kv=C{pA=6;oBQ50U&_0E7*E;0U;KN5CtxPa$+*4SBs8!8^=QKBJ_ zAVA9u(Jtgjm;qv5*(6OhON*uYe`L#l8EG;Gyz;TMqz|B3Gjtg^Mm zkK6ja{K(c4sc0C+|0`-KNCUX2LH+ZUuyg+xF!VcVw#0#2?O9NUo0six5$CUAie!S? zuIcLW`nP}QIG_&Vhy=c`ZEE+w^Z#cbs2}~$AO7m?`rmr_Rr>wUAO7bL|1K8(zx`ov zdH1Ww4-88)xblj+VGyWwy6q5{acS+ol70189JSApW zgUAKhs^f!}kose$7R*1xGQ}h(h_NqvAo{<<4F5?5Y(Q?^7E=HI zca!HYw9t|tpoP2oXCUL*Tz2TNvwza`Ux=0xT1Po+>-(PeHO}kU09_X`D3ALAj0E7P z6np#k=kvlhcP&SCBLw^BKrJ8Re*ZDSY?*si^JYY;Cy;)pR$g`^Gec1ikh*-K%CvO` z4GJ$nSq73kdq{%$?=k-dFyme==evfL&LE&!MKZP0G>xf!J*@7a8e0SSLa^9{8l9>S zT%uJ2G^bh6k4!M|yE0RYIyF7?1DP1Kmlz2;SoQl09ao+yG0ZWJ899K&jDlld z0dac-kVt4-5>WXrtM}AC`Dc2egUCw-brO#M(|Bu+Ve@8a6#@Eztdb{8$zWaYZq2#Y8 z6aiqvgrM(jx_ahzjo%pn7-&9p^y~N@R=xv(nc5N_=&1YL@zamPq9GLZ7~r`47mFs< zAfMj`4JSgtZ&~SDYd==}nKm|{!v@{wJT(4}W#Lr=5J^k`U{=(Rx6WzkG6J}*h_ZaU#y82G;Z70Y!0iX}`2Ng*8>;u&5{l$_^ z2j~xkxTEC`Jn6vaFr)-Dc?V`@tH=!iJiDS{;i;YE}7Iui;e-3fx z0j*#KysCa&=!s;Zb2$*-ncwsCYyH0knX|^Hneckao4Yq~QaFc2+9? z?W3|JaUqBPfdmI<>2y`BAFi)>G!~T7$lEv z2S#++dir4s)N0SEeBNy<6!m~D1NT&}7FFny)GXs)@GJD=K6 zsK(Zzm1vo~GiYt$OOcl#%>~vXwlPh9lLd-MBXY`yBY;V!&QhscVlw)I?Ic;_C+Jrs zPef}hvLFvV08LJORvdOkY`g!dCTJ-0))~yEi+Hs-_felL0nF`PgNZxk6|{v=Xv5#W zb-#ETj`kD47o)y`2E{=0np70a^G5!sg|;pcHwuT9YY2BglUO?22CZ)hGbv zLxKw^cGFB^z^LBY)&d^5oCC!!p?E_}w}~v>si~EwQf=WTh4SR@GT z62%2b4iZ+K2fv6lO9Z&8jid4M695Da6f6U*M0#)OKs0<$$ZTX%VO!I|6u=(i>QBs) zl*vkdV47xCVyV+C6o(21!zjl?f~X}%7{4ua+_^>27VpspeW`QW)DyX3w9PD&gi|VU zE$RjMg&*Jrol@Qj9610j80xSE?InG6JI1}3Y#$qZf%_Y!=g^&8<%3H4{Kn9*ul|WV zz$}tOtpA~5ctMOJienP=1z>A|Fo{LIVuRU(XN8cuE7Ri3+BpPKh6iI@7?JEl;g+Hh zRoZBL996u~GDkkMd`V26>!W&n;Sif|_cX5n4o}^z77OHi_x%B`FyN6$Iw2N7)sEw{ z?7lb!wjT8H6CkwH{DPK6tk6=1unqY?#?q1wL=4qa_emaE?t)(5uVe2<*+$fu1wjj; z8eN?6s2?D{l857d0x)1_T!Ezvz95>5z?!|y&7w+*=zGpQNJU7 zDHPbjkr|^)lzYO{M&dHc>gJIsV3o+ju&h?`bu>Ez;D(D2zk#MYv4EuHB(Q_%sl73+ zstOfMg*l@oB7e7NjBLB`^KwAgbKh}**rSHhb|PDiU1EIx8AqhLYNPgpkND;$;pQ4auC(DndscfvRUb|)bET+ zUjl>=Zi7j64ecEe2fWXQs2$pln|4xkt6Vv3`vA*ilc30YGb4j6HvZ)2r6dLX6a$nS zs-r)JV(FnI-lYvcz|QUm@cTr?(KKir!Hu{8JG@coM5*}D^ug{oF*p#suK}_=G{_Ax z#k?CA{&};a(>~Pgc%$Mb?R^S!>)tY$%Ftr3yQWnP;fitOTTOVC_13iV) zjZBgN_r=YJB_1K(LzJm;6G$o{(2&9AcOTlY2?62QaZX*cvM!m)4H3-^bZugiM|kLZ zo^9zq)7xaKF`rol-I+ZwGS%c2;~)omK;rpIH(q@c8Zt<5KyD^Psp?yIAkboA*`k{J zR5ZkJBG)uQ+PM)hl3soREjyQJq5H`s!T89)^xc9`^kwH50Lf<@Teh`H<{9vWpLF}%AHqpuf3VS;aRY2oFt*0v-9_C4^LYTLova<~O3v!-ymI7TOP=C%1rTVO{ra6m6G z#xcdQPJl>~3WH;5V~Q*J(;LQ|X-_QBBZuL2DHFRtV=BfAMMiiWyvk;5W4Z zn*WpdcZ_h+9$p4}^qRbExLLBP7T-eO2$A!(8FByqN%$x4TSI4mfCAPBxGZu~AlM7; z7L`*QH;WL4cav^w^C`N0pud7Z zK13ClZN)@<#;|M!k;~8l>nlk3VnHUBjR)BnS&#gH&!I*MApN}k;y)1c`o$?oeY-!z<5-Y@j()3*ApZ0)L(N1r>-(M7zDLb<=P7%h9IRGcN zT6*@e&{w~jw37wfIhN7|^Si%y^FTM|6%DNY7BD7XiiipM<`uu0hO{XsMG8I2_az(V zN0}=ZtSlew+cUy(E;TkCMyWq;onT+y?sfRXXqBZx>CK2DPeaS7!Hq+?${PP^;Tj4Y_X8gV#1w-n`41!eb1+o=X>`*3i&WyE&BDcm$-ai`o9Im6N^{@2b6+hte=|SD+qf*_6 zSk9o8uFs%t97#dn_G^1G3IArR`OmX6{%a*IX~#3Q*DuQw5W*j{eX9W?L{h1gQQMCA z-8X2cRhz63nBp}?67|jP=H*2^`OW$kU&*uKEwm)3vIg(lCN%e{kP1Gj z2BoMG!*leF{%{0Nl6fy9^z5_a&UBV#p{_*m3l|$aBR(mvNW>WFu1RdFWLFiRL8=8L z>YISN44-$T>9pX~#pqZAzx&Xq-79*5wJ&4vNw0S0DFG<~XlJIcFPT){nCj%us}QG8 zHiA(TTf^YyVM|=6$aV0*$aliI1Hsi7bY|91hyCizRKPo}>v)=m1{!)#OU1Duvp&gU z0fFq6?RcCEV#dutv3Q4!h}4^+c}=$I+`evcQwmwVpirPru-i3umHV7^=ZcM|yk z8oQ=+x7$H)RVXvBT{O_w_nx;umY0t|hjU(w{0u7ZSaJKbz@y+)tCEb;p`_J)rL7Q9 z;n@|?)iGV3K3nk@nbn68mRoJS+|Z~&t~98w0cufgdcij-Ze{YTUz>yaXthZ;4~*@* z-?3a5N?pkWvS%0vpncXqRU(W&DGVkK3YMm&kSkk{GOrCVJ*QKA&k-E|^zpP9E+);k zY!0D|ryJRGt*UfE?8B$~@$XO}H&WDH(y+c)TCM4fi8-;nBX&LAeMj2U(mTeU6WN-!|f+?_*VR+OEqUZP^4cJeI)O+hmM=SRdUFqm~T&&rL4dsxHq%*a%LrPA++}?P2xmz<*?$PI~ zY~AE0ij>K5DT9_atUj|osWdy`X`s`+bAEx;1qhqcaBdt z%$|T2Eel?`wTHId`?eTaMPI0Bf;#M32FeB0VP8~4?q(~1p770GQzxt8qZ4Fo$`(fk zF7x-IuEP^|DEvoCtsLZ>HqSx&B2o*#IZ(35Wmh@(_adlt(`Hp#&kIAvLuDH3RUSa| zO4)ptCBkeQJY9D1x{H(!ahG&{|BW)g`a*Z`U#2Bi*TD*)Wwo*gs2c$-na`35j%0sV-*4w_rXi_htSS-_m@USaSzxl=0* zrV1dlG8Qt}?1|cl$Kv(WHv-Q~pEor}HqDJ?RWVa^OY7(5>RTQ~9H&5Z=4uUlGNNST zdQl8X(=6NPe2l7&SjE*1NgI99LcA;Js*9c$7u08H1dN$D0qs z5}${faz7)9S#foE^TAUyH77^(Z7R;+*QARZv$$gYIPQdfsjJ~Qe;1s;?Y?Z6I!|Gu z)K|dYn0~wfZFmlN{;+pK^Qk;UY6Jre`sYI4h9vmGkq2;|*ddveirb5U37Iw-K2lgH z=vkK>#W%KOeFj73#T*wL`Dukc)9@W7i;YTAcwv{MTrAzGIFUpY@Q?jKk&_mpL{+Ax z#ei?O^o%2$E>p^9Q3Ib)<#9Q!)?ytTzj|FlvRzG=oLUzYxsw@a$RxYJwE+c?rrD@A z<365Yk=!63_B2013G+e4pDqa{!e{#W)?ofCWDW@fj_#oY*6}u`etS?lIGA zwj85}ieYT_jA;~Ugc_9EzA%@6==yPd{n4ntnCt5|$(gXEuLotGbLjT{ZVpF=;+ZgY zFQoD)9~!beQl)sy9K21-&qlHaNv}DgdR==`ZOh13p4BH#WP+lCW0qVg1hB6H4#%J4 z<`M+xQnN}pOlPh!v6R<#2@~t0j>c`}4I4&PWfY8_m*d&PB+fWp($}6RDcL?vj^tKi z=r^_YM(^bUF*DhnC)go=s~pskLW(gc~M&wXV^EL`HgL3+$f5Sd5kC?r_# ze_Gs4@(J4i`o=~e05QM6Ekq|Ipj4twC86yhGK;dWl4YY>paWka$;F@ zWW@piCNe{y4nBny0^ozE{=gu9U}B%uk^V^85f?RtHK~}~6#1$bee8HYfpHYVQ=vRy ze&|joe?3SsOZ7BK2>d?H;wD+tL@8Wal3iu1yBEMGe_`& z(J;zKaD9G-BAvI%9z&q2}YnEY`kY%y$ zPhJtHI26JVo)G!poHlg6Y9WL7+H4xA8}@m8KZ@RsvP*j(d2Yjy#q}P=H-TUi(QFhO z?Oz(+$wvS$Fjs1nM{FR1z0s_#VqwT}dVorzXXVRtjZPC>4^6JUNqHf~XIB2*Ns01m ze37poHimmV87sE9wOy^V5~KAFr(xS;O`@I{dpZgluL+*FBvqtY9n?o@R&-LH7&u&VO&XF!LdUC6^vYMws*R7FH=9?i*> z)5qzh%qd^`k@QmtFz67~aRT%q(rHTXiM^|wJB&tl#s&crGQp25)DY6zliVO0j*z~G zTth426oZH}8&|-<9gUFR0@p}{!7DJk3uHF~-cz0qDzf6kYCoQF=fvetP$O%wz{tH_1f^5 zBJ`9`(A>JnnqB+T|HIx}MpfB$QJ{b{DAFz69nuKW-QC?F0@96y(%m5`QWAnR(w&Nw z64D~l(%k*%8~ToMf87}Oj`96r;PJ4}d3LP5V$Qh+-JkMMU~5sX&s-fNxDuu??ADzZ6yX^|NdlOnC{TiXvkP z2EjKqSZQw}6BRg79@d-o+atiqRogVmp>2H-?;ow90q}GDyE^@4yqfh+g9`X}3Y>J` z=w4kbf74KQzS%nnjuz#}rEF2{=Ob{{cC{~JaW89F`kPUe2;q$2r>$tl#o&?O5$QYa z>9h~fW@?b>GLAFgFEBbj*zd-E><&^0Z2D0fUuw?cpFQE0z%hE02FrnHMkwyARSKBi z4#>$qK((#LfZc=V&Q!h@TviH;il-H_?eHkTNubn1`SwirPxRmJz=mZoj?jho#FZn{ zU_N##Q0OMP?iF2dyqrcgRvn$YvEm*sycm{oS>e1FKjx)YU2!oS3Vz-i!Va;MgS z;*ihnxLq(Bb08!($7J~UMDSS*a&?Fv6Dj(41F^^hcXEQ~p&K_zf+>a_F_ zIeaH&=_p!M=N2d=Oaw0Xx>T2Bc4vfN);;I{l?x84qy|WLxw#XXpk0a%smSGt!6{#w z@9(&4sQEy8+Wp>wy8kDX9VHMhX6RJ9d|Q%{l^E%d>Uh-%-AOIfFVSI@_~+XLZh$%! zk9PSWMLl!9bs^0B@Jek)eXxRN#5Y1r``4%|ap?zNm8g|vyvyGszX<6xC+yNj&Sbo< zn?Tl-t3rN)A-Kbiy2nF-rJD9q2-6;4UK=$Nr{r`fSQa(T1Rc|ZF5Av77)EvCfnaub z{)mCcOf?l<^6G2692|3232oXVCcTVI_yKPa#X6V0zD>)z))-?GP?W4i2`W#E)Hcqn zDHvxU-7fsdPG)bAO1nWleCt~Ja)Z%q{gX{WbUwaPa_fz&W=*(_EzW&_&6gcE1Hybu zXu;bLmI7~0Hmx~${gpe1X6ewFDwn{GibV$?OI}Lkz5K2%q%%d7QYYtUd6wRfa;PB_-25^DXAjwazBgX|9Rlwoc!t z4A8dN%lcGa)7Nd2g-ILk^}?{fD=JR7U2EL#k4z0GXIc;$L_^+!_wEa?W@)@NBN+;F zI5&aMunBJ8Dq6Jzp-^k0InU_o+rk=g^otb;ls<4R8*G_Xs+E>EpupcKu`oRA$G3lc z8@U^4#V#L9wG&>p6OQmvz~mJME8n;Yh#wJZ@HE5U*T+zjrIz7o-QMZoRIA_vQ!^YctjnS}DyiDJ;^fWNx)^vFMBN+9Eka5RW2hcvzwEOlD>12^x=R>HZ}H1;c8gY`h69>oN{H0X?0fvo z>?EETI0b~ePm2)xWrGvfs$^-3^%&kw<0D@fpcx3(6Ub&T;wqsW5y)|Bm+hEpITvqh zV)e_+F);C=G(^5HSHmw+ao-xgQy`Ty%g7KN*lm|hxnC2*n@#x+IT4}8O@{eNFYkT* zwma_ze3>NgNg+~Qvo)?Jyube{ZTA(L68Z~`Le!WQNYj$W3sa)RlTcT28dYiqL^4CziYVZEWpcf&C! zZHsXj=%)=zvrk}^d*(2-6N?r8mVBBo(&5P@ym;pYP9}@;%`7Vc;#VHb)`6CkGMzpO z;@qWzNSWe&2jj-#z3r$#DbKDI)=a@n!9s>%GiJr%O1f1}%y2~2L9y;h8wGAb=OxiB zjaf$;La_qf1vV5qapsjc*2cCB?}!s|+b=tMO||qq9YCY+mSS4|POpsJs;>LYQLrMx z>@zcI(e)&*GNWo&r`PWgmqxUZI5&|SHTsm&X5D4DjN2qUUxioPA9bk)*SevTgS$Hxcv8@Hnbh0Rr zX&7oV7W#_KgB{7#P5|n197 zwA7Uu+s2!%P|G5n(OUeNvXD$}M*R6l;qa&DOHU1G+B9Zo)1jhFo-!IKD@=NTap!5PxzzB=AX8NoZef}@lVmXiZdR++xst@s z`SIq4%~kV0S}S|J7iwT>XlGyO$N@&4M}AF(Y?ru8HRI3_U9cH5E1E#dQEa-*+~hd_ zFMO2cApj&}rM`OziiP}eo+1Yn*s!zF~wsDaoO$PdCR(u}<7 znOsX{;nWL?IvID5>doq@jq#>lKguB>hTre(z5smk>tS7rNUgu?masz7_0}Gd!vrv2sT@N zRmvX_cTFl_3nBFb!{4|Dm1ltuLm1;ZjNv{h=f#E2eG-XolG+1T0U)g3nDC-$=z_`{ zVAyXpW!N^#0XhdUtS!p;U>Lnbwx#~Wwz!2gk=dkL*lBCgMTADxKP2o!g-~TWEKy)!j(Cz0V0-gT=H{?fSl z=fng}Q-EpwY7n^|T^LtZ9$M>#-=6}3JXcZNMne24ZiH9Ejh1KH7z4K?+&^;dbT2Xg zr8RC76==^#{KeYwKa-C2QdIPvE^O57e+uE$;{>3vXMS1E62wCV2@(5(q^l*J;hM(m z!S6_dBX1B?7zI4b&VxrK-vg-GN|1pBw6$IEoV*}-0)>m}(g{dhcjW#@g%XH`<~I13 zvK(|G0QQdEMkR9*M#&hC6x$Pg2)?dGJS3j`9P4Bq)KFrM0Fw^F#-<2=c-(wD)$?+{ zuL=M_GGrb(rZ%>#a}u*^N?d4!f8!T0Bc>>T2 zM2Z9Q&oh9j)*Yvl)%X`2r4(ia`V*3{DmD}nZ-8iihy_VbgJMIf+%>45gPLI%O{&2G zC<(RRTwgq$v_uEfwFGKgT%hz>Z1y`?CPN}mnSg%VNtl`VHjkRecdaDR@3bnXKfVHV z(E>=-K{XyVCz)oz8*}c#)mA8}|Lpx0kO_E6r_)XWWAGWHgy{+jq^~T~Hu=1N6ch)1 z0D564yYnHwxK=E@ld;4P**;(DQjFB9%gBK#^1R;pG2Xpj5&1XP56}mg4c(TRNA?TedLZ-za zT~NEzD#X^B^_^VpE(P}GE5I5J=d^+vX6Fn14;dqm+y#24c8v1u(TS~X(7?C* zPCK65p4ggc!8BGGSLQID*Rg4ga^wOBWjP{w(y4oULHI-Z#VCWF61!|2z&Ku}Tj03nYKz~|LQo9OnO67qv1W-agp@WDO<@&*ZK0tCjn zW%QOL$NzlG($B5H;^a}C)F@!1rs%UhhlhA~{GnIa4f75xzXOhXumA>YXlj~zJRe?1 zm5@ph(DAy`(up?#qmr`=oh$AcNzWONN4OBOErWv<4{0+O1wtU}nxdn?vMM|<4I(m60BuAzMJk_ZXCV+s& z^)@T;+L>P~u(lDf7<(=w%vtN&Vm)u%O`zJfOf^gaa*}cP^W&8x=4euM;4tPPrMcv8 zqd}O{C@@hBQ@`8)Wq5QV4L&vEB+(&E$i#GE<9;TDYnByLG^e9t`k*%B ztm%VHYltBUpbyHZ#6wm&5?@XIXelCII~oBrOldq&Jm;U>*^vF1{6LkqbF#Qe6U%cM zYdIxQd|6C^+i+zi2iz$@s*9+t89+7kci8m)WI}R7kiZa_3f7=|q9)*0i&gc%KHJa? zyO_1k?U@xLQJiggnw(xvYAH!C7yuA>tw3@0uBzQuu;^QfU4D?Yy#ZAq*NDqV{|zAM zzFayOO*PShV6qBg>#UER0aqtt4?Io4U0qE6gQ}tOb0mH0Ids^OsSvAmu3Fe z5@VbkT=;)~9e{0$QPuj9tZ1G1$nlFS)#n|!_5d5y@6*77r@WLz7P$jq0zvKVRxRWD z_-4do2yYYZkzsENM73v5e_)*LFn9xS$Ln0w`5-en8fvzEAJ?_t?=sLxf%T1Kd>tuD z9c$1t+6?-rl)I9dKA^fOyLiO$?L34tMZTDKTZ!mff z4@(*uQ}j2m1QhiwxvK&Cr539qfVGDHf?JAJKMcswx|?m_VtttZw{-w82~@yyjvlWC z!83Xg03>Bs11MXA#2$m-Ca6Q718GyA7=gHUIp|tiPANQo|{^cAl6TdzX+f= zp$Tv2K7%5KW!Y<7I-sVzM5$jd%IXSk4+|KEddJ-z;5y*u)b)R#R8)5f{rzcQW&+>V z$qig)%9MelB;YI(a@gZPdb=F9FkUSSVTE_h#I7mANjQN*{5q=JcAQF`>bNtd*N(b%`S59T)0k;?|r!=7@PB37lIjmuAfwiW!lrjTlY08H>0N94O zrt~;sjf)bIqN3})u3j(6zn&kcB+>KzG?SaK@Woj!a?=W&DkrPd3SU9#&Kb-Jwjxs3 zI-puV?}^Q;$+E;Jslw*FL~%u!+wW*Muv;lDidoH z0si@w|GqGd$rzBU3tVSLvj&N-CP3C#8=ldy0!G{u9IW_qP&-x{jDp|y*YO<1kj zn0!#RKL8W80#4kBmDQBP(6el=P_z{siqM&UczFi`z1mPJ#RW3BON-kHo~t#F=$g23 zdlOSwM8CwrTg|(@;tuK-25zO?VP=y0&WG{`G8eB#OHyh%%R>_^fiLR*^9uSl+~~7m z)UP?+Iopaq@tJ1)3H<4R32_2M;Rnvl&>TuNJdkC;*y`b-D6*?<+pqRQ>6Jo{?SF58 z<_2hlKI~EDbA{*>p@?@#B2X>a5&^f5N*}Gq2e?10?`aWqoCZJV>=cS=6k{i;pF!^ud!0KEW*jvvXBI6url}^?HpKk;pbz}BkLo;qq z#e{3mPS#8#7~cXgl2Bs;aDr#LX$eOwHT8P?2jr9-N&p^w8J%l*d`T-E6!sAib%twa zJPA7h7F<24vXV1&QeCk%qj7hK38^qfoNyX;gb|!Reay|l#esUbd9_5ApXJN#3;_ZZ zbm_b}wlilI?{Po)rUSb^M8QM0*KcWD^tvW^I?AGNJ2oO+h@GT<>ha?3i{O>3z$IVCcPNV zU&)kTaVhxf7?k4;NYNi&K#=^+EqKZ~{g1<)ZA2Rj`h+}6y?B2?5br~`2^D=L#CErd z=m>CB?^v&NwTIE!%LdXzl2@g6gvF)1g$JkP~7l?#7-V;SH7G7r*lhdTA(sR z>*-2yDXvTM8)w>_-<{*1ACxBpo|n>gF6Gb1g+-wTuRrbl?Q;D^S;UbA5gxVTe8R7P z{+$sZ4+7JZu9Ef_PVC>8DS>f_EVR}C&*V_Yfmy>w#s2ThD%4=o;<+7vMYMmTF~O+N zWQv(W7=S-M_&G?gGba=~{;3H7CYdsCNr7JBpYQQk@Ff}#jJD)t9m~Jw>(~FtfY(t% z|0F4ciKkBhmR2UC{@>KYUtbAmn{d5g|2Gr!*UOSvoB9^!Ki{4gjRMFlj3G!EPUbL5 zhCjb}h9qFJzyhzuyjHKl5QRJOSHM2$4|>JC{A@xMjo#6Owb=uI>3>`vN&K_%X(LF4WxzvC86X-te+vNSsso8F zP&$%;b=MqNPizAWls*6!ovft$TbL%T3rY)sw=lCP->Er_=KV}xp(dkrq$8uw^Vht~ z!0R2}q)wA)23zIdh5H#M2TinNZC=5Hz^EPpPX~s09g>g%0V25m*myU!*o$^pTJsTj>DPj5mles&rmEo26{ zir9)74GS(qMIhS>pdaIEa*52yK#)jmt_J}DM@1AC`VkZckIqLNd|w}dR@DS#W2vU| zEskQpq4PakKpvQB0@?tDli0FSbzUIl^bI)kvngGG6J`TkH#!a+oo%_inzY-G*)ji=`$v2<9g&LNE>+$i%G^N=2@DNeCLaqM zSOR7!opn^9$n;fDr^WNJMC^C(kInR6*7cA4zUrHMQ>-IBuXs8IKfpX?o%W!i(Qw+x z#}P4NvcC}kQCE4+RDZ4%^dk=i_SPt^#)x{*0B)YPBMz%ofg%D8dVvok89)qwtaAbG zsKCaH-1qD0JRF)kz_C8h{_aaxQR=tsvv^^&+6cT~9IMftw@??F%*ys_?_xMJQ^4wo z4O4_$0c&d$AP=&v;a3v(NVFRYeAr&iNM48iDpGWCEFbNv&= zk`M?a)EJ+9NApw zWx!$PVz;-^G*(46a0*%3>`jg53>Oc_G1PaUUKCF=qE0N<3p;%cO8;Gmr(hwlU#*vR ztfI1eFpVS+!*m0|S9R)j4?oa&%xj1y)JkW);NI~|hpnv_=TOBKe6q;*y5ZwYOG>g{ zh3lgdk&)u6$Lap<@%kd+ryr{Pai67grXr3(y|@acGzj#U>zaTmWJc-;`5I-$OQ%g%f$+Qdw+}c;`wyLF6eKBb=SQ8Ah~L_C=ZF$ zO+`+m3k^~Pw*3^{)V_qkmUre6#w4ilfna|~#ndYC`LhfFT$QN2H05*{Xfsgt{rGZ6 z`oRHaWnIeqCjI+_notM4th3=wg$WkLhGTH~d>HdTsVxy&V`wRTt55Wi2`s-lqe%yU z5#6X6lZs&=d7?)?``h@y!`AP9Y=k|Rbs+sP2T=7HBdJzn5j>C(9c?|$(y&v7#~I)+ zj#MtXO&8kkRnhG_{@%8N!{QPii865JhHU8la zEu(U|_oTQ02Ml5ph(|Cg+PYoQ$>zV07F^_crMx;U@8*4Pr7%&6G6Blk%?lyn`prR^ zi1N*MpVb}evDv4WKyIQWhJ#~1FWC}Dpj9cY#c`aJsqn{SMc_IsDm{BtU?Sq}f{iv* zg)&hE(r%VJt;$>U#8bCDkEbk|hi5$JaP-U~y zUNK%B1A%pV`FdP(exIjArjYaUCw`#_)5`T6uY=1KKtX%_j};i{Uz1c9B(Z}e)Q^?nFf0HDv+9728YhA0)qngNt%m1jSG zla*$3AwxfrXnK<;WxiXQNVc+R`_+PXQ;}qvoKW0+(gV)TB%`(@A+^cC0p>TCK;Z|0 zLD8K80qZJ!aQ0J){aO9IQ-L&x<$4d#a8@@QDY;-XtFm@c2EF5O`0x}S+yO*DhA!l3wXv?=@Po^H1OXrAxX;+brDnKY@(U zOeA{($`e5$pZ5^w(9@5SbHd$pg+XlrGT8z( z#W6L*^)4R`J(9=wwueCOHWhvcYBdQu{=FFjTm;WI^5;tR)jRFFK?VT_JG7~|{V44GcMlE01vetj^U5D!ctCDpR>A%)nP}FARygz%@SP)d(&^u>pDMi0K;rM!|p$ z!A03589}OJwQnh?hNZoJjvtqV?hkQh&6J2(>p@p54$^#!pyr#N zNc=;Ikn7I|w@5O=(G-$;G+1pvf0@Z%zXE=E*qW!`9c9rpMMj=|JyO<+&Vs$yREW6J z*aDv>ri@@F3k_Zv36Y(3(>-{%Y*SWJhjWc)ddAEmZXqBHCBWrx#od37Nt`hBHXU9i8WS$Z3WkrYm8Qq3loK83W`3GDe>+upOvx< z)xFXe2QF87a_s{jp?=wDgyBmD2x23iF=hXQ#6r{C?b^*x;ON-b&&X~2R{@WhsnP8$ z$BJvC@m@RATbR>j?>t-&@CW`=nV-AAuR3t1LMNaN{&ZewSee8;jbuR^7NH4+`|4^A z6u`dRUBGc?y7dqZb4X48YO#wUUwzl-?SMJHcUSSd^g4YhT8Y4Tme?EDSr@wVC~G0A z`U5uJm<7B;0aPa&vlU7$7yb14NZLUFk;+YXnTla{{xb9x3E9@=*;zl!93cPYnq}7E zdn^N%8@XdncJuoh+B+KZT3yy+N%z)Fjf(oC37yPFfNwLG9*|-6tk)u+_8gU#w}#Sh z48HJ%F}LDc)^yz?_aH$%)eAGiMbmYN!{FLkscF+$zmN#Je1`} zD?GZZe=00_Xr*&Ypx(D9MPlC7jWTMAN> zhD`2f^X`~^5{eJ0+HSS`xSA%xJ;syr>h_PhpI)?GvRb@loR?C2 zSu#r$epXs^w+pAk4>!#;XyD9$f+ks87Yo~^;h}Td{gR%9yYY1UEe9OXR7_*gMit!- zcaTqb;*{SQWj-lo5P6JoyKj8ztl$sMb^DPHtuwIa0TQU&&2rj@;&FJ z83Vk{88KD$((!8csr?WNSohgRoT$I12Xa5{-4DNz zVij&Mh%!pKQ;c+q6==EpHCPfM)*|IU40BgK$0~_We=08V$@T{%RnY9~pKr~lta{d=?f&e`b{q{l=0i7?_-y>7`KljIXw}69YS^ z-uI5%pO`Jqga!zjzYb!>jR7CA&Zuu&l&m(&JdT*jML5)lbNxiC9<(6DRmEprBIxfB zN1_$^QuYqUp-rx|&6jX$mO3gHEvH+^jb2D@_;`eW4c}@^p4Vk5@9|sEk1>o5GZj15 zijr(P5x4ytTI|}O6v9d&w|4|;y0TR%qei!(9K0R>5MmZhr<0n=Fx(s8fj)z99#Vgo zQGCZVRf6+wt1*}meskeijGC{yO8eA@j#6mEuo$h}w&ZK+Vj?J>1eodASrVJccODkp z_86;9-ik06Zlwe6EVJuD`@86g~9Gp?0JOA@8y~9kYKz z;r*-x_5_aN$|p`X12WTlRdpkd6)IIY&iM~Dvd~_8jf#0pTM(B}ktXeKI#BKkuO+Fg z(RXU))E5pEnU7Y~$u}U3ifspq-fZj?(A<$bBDTy^7Z*@L<(uv9Hks8)DkvPte|A=j zM0%uVx`^W}H6_de!3ktT`1xkD`raDIcMx_;cz~q2eLbtiJ!3qqajkS7%$Yt`o5EyU zieh70*bu5oIlLyrCI>Tae~W%@mr@%&{+hqF09@g}@_%s~S$lj--*#h_ID2G#4HLQB zi|}0#qCmGrid`OK+S?LQ7VFu-+LBrn9B=|zj1Q`l+x(DnvJP=u1yIB`4YPzXI5Hw# zCWknhbJlR2Ymy0K<2^%YWmKKq8b-g}GshVanrCqa*Zgg60gLrXu6)j2C;4+9YmXNl z@wi?0(Yii@o>5k5pW@fNZz*w-)EUhNa^X8lsoG}6b|eTS@FnNYP}z+{mD*bBb;P~! zpe9dY<(6eIryx8Sr@)_12rS;}k48^FXL6T)C2^oYm^~C1%b-DfM_HfAfRSNob@PbN za?@(C!Z&E0FSzA)C;*LM|A( zO$b>AR{Lck*BhbHmn|J!oI;@BIHhVp5G~ptyRLIZX2~i4A>d4LHul%_kl9Vqv)>ES zYg_%o;LQ{(K(P^NLM|sZ6DNpq)Ev3|b~r-(Y5J#(+kDSbYckCmt>LVuC}P&=QeAgO zXWHpM08~Q`XGw@nxdl%~9VZ^a2tj5V0c}I+p858HG`6BC+}zh`9rNh^#LnIf4Z$sGEwVo+4$3P*dH;zAv3@Tdw5uL?0Pxk}Flsl&Cj# z(mW;iNQe?4i(|aOwdy^5U&m10o6e#HI3>MT>2~y%A6$cwNGQfP+PQar4A-(hG=$v% zy>@ldM3`)wV)fXYD2ikaQw76MrL0|86O#Dwzzq3*q`c`g{k$_|9?_qu+SVv^HB%>= zl~EkaX>*;Z`|6>+>A6dg-spi^eB=!yl&u&J6ZPJg-6(U4uufAC*Ue7*_6##2YKyVl z#g2esVQ{b$9{p&g#2G*McE*wbFJ{x-RrZz>uG};9)&TXAhiX#uq=yBgV`Cn>Y5L-C zaT8}bYc=dphKpXbrfyZ4-N7B~s%ve8zwcjez$TR?)$C+#`tl=4ILVeH;NLb7;#=Cy zwN!I?+vycHQwKQuQd-^5X=-~C5XBM2Yxah%A8@r|Fq|mfU`6dGKCr@9OWoR$OZoCf zig|n{hez|R#$?;karAPGt>PJ6MYDpwVGeEcec{IbN|H5~hDS7(o*+aYCfCQ?WEvW- zQ)Sfh-;y^rOTRNbe2U9-mRiF2G>_TsLaR8*mg32~5yg&7tnrwUCo=WJF+=UPZf$8PPL?CkQFQr%|Y?h zk|UR3B0QYk>oR1%H95K4h~Sip!M6lMBxwt$$n_N~O9Dx3POm_gr$Kv{>W~7{uezpX z)z)63q)}n}N~gl8^l50QDHAN+)xGpPsndFPinS~ZwwKg+EDJOh7Ap37tOh0zw%K>m zHJ5j9Ap|jm2eBs4*ynA=8+txcZ)sv6bxXiMbrC+Pj{T<9>E`83Sja7rv(uAV<^pRy z+}QI&xHW|L+oU5!j`8c{<*5nv1}+-I0&^PeBv|W`SbLD6QRxV*jq6DUX_F2O)bTi6 zKjX$^l(Lz4e(kBp@^}OTc>3xkvz^@x^m1Ehi63fe^jZZL*AiJ266oF+7Eh4N z!qJW{(`DAUrUw?fTDmJ{bn~@t(l@`^aA^pebrRHit3g@zbev_Q)M)VrA?}sz7ap!z z&ZAThhsiyHs=DQ`WMIm+AYwV;NV7fE`>NJuMZ_Bp#_oD(q?^9x7cfS!FtPZVb;ES*PJ zi)x5X;(%TnQz#1itT2&nIueayMh1)MhKqNiw4MK!LRSV3;*S6{qcmj^rm_!ZLQV>A zBJqpHD{*NkdN-$$IC0a%vVx7&k7Ukv*q~?AN@O4EJb!YVrt<46S54Lfpe*{!1h!RUPd}qq6FRpx6tu20|^%CFDQXUnX0v;v5 zlqZ^}@VuLtM)0JTfiNX0$erulA(8kh3(0YEIOWAbkr*v_Du?T-W?zs*-iN~qlf%=Z z@PIAkHoy>iyyVy=RJtr@VpQa5(G_WzmdH=@wXp;D2}t23Ishb={73PzRt3l|9pTVl z7frkJV*kvkDrCSU21v!H9UU`#3P=;8ZOWO<_XNPi_iAY8AA6#x6yN}?j_!$6X6cj4 zL+7Z$$QcG&d&_!dL~PDjr)|^uMtP?ZfZsi}siR(9HzaSY*rIBUF!$nywGQ*IpMbgiE#khY=q;k>>v9Pz^of_Ut^p;=8DvWhok0MBLO6TC_c9cUizL@)H zW3aXX@sgR43<=iN2#<6MOKyt{E^M^>dN7g{RT_HkbPjB%=r;+eZR{l_ln^r?))RCp zwLRdlWHoeb?y0ox+Y$4As_G&+L3;H9ZJX&6Yl1@nR6B`l(+bRI{^PJ&m1q3w!8vF^d5Oi= zCctOR=d*g+DL{2bWT7ZFmFD^OX23(J7Mvf?dTJDQ8)bJ7zT?~F?w6z!wrpdfMO z{BhG|Z;L1_(YiI2Y2`9|fFJ?{NOT9%=K!+NRo|97rtfZ+paAl?VxVpSGN0?fpvb}G zdmB+J37du_UyjDCG+3Yd%#3Eu#baP_6@dlH$n;?V&z}fV(s%{ZU*(Ms5&*2EsKuFK z3F6m3+aJXKQTt>qSv$re>OTh%qMO`kHj=UcNqQnWZFWVraQOgxKa z%?j{)%usD3``tGH^cziTanquHhbc`0QKh<#x-N-%7bbe@^dI1E_poML!zbEktlEmz zKiLO*R8bxS<485Gw1p6C4?gL(KH{J|lZ75&754QVfC5#@=eYR#$pbH-%e5EpIrCRWgSc81(5lFb;Gca&Wtqbkm1N(~}V-lnHz`^}zqA^JDzVTk#x3POCT`6w&P{V$gU>FwpXrEC$+G$bF+~~Homt~Ya zDCm>Eof1FRG96KQ7l+}qYWE>?Z-)EO!trgC?&C$3_2k;@VKvRl7(qPL_bw@9T4_&i znCuxOZo;{W;3}~8F76wa_(W|)J~J|VjrUm0npwL)ErZ1CHsUwBwS8*-BL&fJT)6BT zto4a+Tq{j8a(xp9r8VLok+Y!bmAOT`ul(FWF!G^-l6Kc<`74AhfKPVYy_*#}I($*8 zjmk%Oosr*dunR|E?zM@b#)X~d5sJma^Rb(V>wI;^Nx4G1GbBerV=wUGH{TRV#; z69N-?_1%?7vPa>tyAF`azvk*ICE`wn86WuMT~5*zj`?d&O_&G8?7BT4CH)vnbr<;t zlf))ZT*0ao@m26MALuc^sS!v0EF39RcYuN|{*|Pyk^)Qg+A)P?w^d7raB(rS@}eu# z7u@tW^t{<9(N91{O|X>%QU}-DP%}uvQ>$-z$+dp);v@YhxvBGOeB^Ls`L?r>M+Z_V z4kfx7mO01e){n`QrwqLJ#cyWa<-Pg=lnY3-lvTBPyRCgoame$2Cn-@4u+LKl>71+F z`;}j(ZcbPkG^)l}AV<&M^4fwYgV24%N0A3xI7Hag_6;WyA90Mv9k+Tsg(#f$u>F z@~P@$M;ZoC4d!Prx6($aFxIhGo=DFg|Fp-#2y5e8^HNVhnUfSPy%v80L0sAPeQ-B}jwO{(M+DOM%&PztH^tbuwwzKpI5b%)S zAJ7=b6_91%?D^g<@ByXczHXjj@6M#mE{prB7bogfqk*pZuS`sfB?D7@@70upP?zJ) zSRKy1%Lx^GdVn!)t;UsSMsWZLGinU&E|7SOpu(uAI^1xx`1J6CIGVM3AAxjRtzOk* z9q(aI*w3Znb+XUXZ0ngz=b^0o1n_QRrsoTw3jp=9O0?xLt}+?@wS|C|76iS|?!vj@ zJ$rvCQg)n?fjyoPJh_}g_DbH^&FHcVF8c$L-<6epUHMBm)F?%h(rVJ#ThN7yTBjd4 zV4hZ)^3ike8iP8`IpRK-C?^;pNS(`E(+u8VJ4`W(F1CXt?mnoruZGeOk7oF#4%BIFy>P+MtAYqdpwoh- zwlN>ahYi1@(0~=}Tu;zy6@pO;Rn58#u{6d>86(C*`R^kf=ajh|30f5s2e>!G=sfk4 z4c|L*s!p<9?aRS`M~Rtm-n4;pE%o@poLdxFhD5|STM5;b|JFhw&k-3^2&5yvz4*J7 z{;QtOgF@h3zvZB>4OQ3wDcqVo1+zfaeWLQm5Tco&_&)DeUs3jV=^L7AdUi2TRm3#& zq5oruJVh{sqrm*M=0BrC;}g6Hgxx@{7EbzOh;dpl1W&f(*H_ShKSzarM1knU_36w# zxj%-mNd#rp%+dNMn*V-b@S7Bf-1x;DFpB;dqD~gL7mC*NO+TaGe}<;QVP7Jls1H~B zV~9CwlaATqT1P{WPyY8KgD{~TdUsy@=MMd62uCdTC5Nfm`v3Q)IjBg=llj;4u~)!F z;<4zpc<$Gwe0CX&>$tqU1PF5;aoie4XtATvhFNrLEYcW9M@FhI;?mL8VSZ?ugdtq| zD}Efv{&tA_CbN>u%+^}7qd7)bxOTcMv*VY6c*5o8cON>+ELk7+KFNbwuqo)s<30ay zt!a zFP}hftStE@>7Nhx0GyVKr=013ZAXRZ4nX>WFu2l>F$6aAMgtFl0`FTOgeChLK$qMR zvw#PzJm&}6j*o6K;xhwHk^a zn;g$Rx}-QRJIq^}`~7G4f)gHtJlFN-r#WJS-8)a%(DvJL1N$ZdoN#IQZ>GN+1JR&* zl+$~o_1TZVLLd_j{D^^Dh@Tc)Ys~D>tKvuA{k713I-0`zV(I|%2XtdZ z^HM#K^u!yj&Jx7<(G&ZF1-^TWu69Z#~q$WAj&-1dg@i6s^ z?lmxR-YQc}<5<8=W8t6BQvbQ$L8aJxtd`qKx}k7Ck4HKAch)~vAP}w{vf-!K%7bO2PO$}h09#N_PXAF(I z>`Uc85i9>(<sN$ZvN2f2ZS*E9ZZw z69HX=(65XJ=EVzw5!nALPiKlZ!6Xur7tst5w#>ZSnvF$KVr_5fkv2modU zzb3Q3Tok$fBGPu4{oSHV-uQeHG}e4WE_tgA`Jf%-di}rqEH*EP5hEXe1o#Nh;Km2A zSk4Z?_edzXTIX$V1WFe-YsS}W#y7{tWIx{a34b9Rn*kkh=`Ii2KuaG$*p7u-U;`PB z&!B0hoZUx9Y@e$gYZ0G8^_&Ccn ziX?LSZmUfEkcM_pV5=~3^07bY(#Q%jAWlxK6$}O@_$!k@FWU#XSIc;*xJRpe%Z`#um^BT15=AyM(^F>I=B+!_GaG z@Qs@VydItzlMs-wfI16or+;`=R_P6REh1la$MT<4TCbGKI0L1V*5%ue@?JvJlIvx% zYtvKCRBS^4HfQsh9lGba1be6My_!75K)VPags4#ud`W0%S!0*f*&B5e_8`Eq=-Sz{4mhy} z%@^p=7W#~VZb69UMDTlvOo%?`dt7F$ctKFD^~NL*LF0*5d`{idADE61NoQl_a@3=( ziz)#i(i3YtsIcz1k3U56`_a5Oaxa;kFJG}q7`t<#`UJ>%6jO0?+yHWy*x=&Tu1D58 zz+wL}+uGZvP_(Hacw(jl^Xqtofk>O=^Ee4k8YV1al;lk-e%oFsR6uZ))`+6W=2B}t zfAtW}+M*uhc&1a!_>b@Ah5Ue6B3Kg>Nv6F}WAdH7m(s%9ppU|X)j07}85KE=bWW&$ z?_Q-xYh>)x0&reZ2{s6|@7qKDs{tDzGHTT#T{U~Qb#m9-U?&5N!-BSYvyC~ z>uU*sZ7!&|Z2)KvXSoaAAJYrG7{=3J93MtWQePOl&sqZB%=1jcjF1)9_UbOB$GsC{ z7SEa3H6`iH3R|A0PF&_NU4Ud4j(ts+U{%vLBO*slj!Gg!uuJne)j7~GiAWx%OPUwx zjfEIxKr#Ca(hdPCKm_I-LUq4P`x|6FTNRe5GX?*>5VAd;E70(UlgsXPbojm-jCq3S zdC2}gIJcjGP{85E(bXA~@tK@L)MLH}$gM$glxAN;j zKUyiIp=%7&!BV`4saDVdtPOBN3MmZ?Ar>K8x$@z(6_d=Ns(~tmou(83D983e6RzSX zUhmM1_JS~c12e5ywcWjLE*6y?q#jAl$pZs6#>tx0TLh6X;)yHNx|@bQpymY}Z)14q zX)!I&RRqf0R%zItRVEZdwCzPeQK=C{i51v>In!`xJonCwhYnP^vGyfLiWHxi$% z4dnE;@@36GCgvPb$6S95xcZ2W@7sx;6E}Hyxe1yWeW40LG^e?WrXqsK__tVSMpSEET%~uq+E#-|w-7u%OAm9b_z% zaKbe0<$Mwq3UM&-Ty7rxXdH{akHg$xYe3fMy+xqVZ~}~d!SS8k-9v%1RXP#=afAW! z4+%qpqlTR%N3W=^7H_T=1HKS1np7J4tFpUipdCY)dE=A=*=JLjhChQhf_4}Wy{=0M zgs=hAlanmx7mf5Z=$bjk&@nhlK(O1aXbATtt7olZBj-FI0s-bI!j^K&gXO&CUA zRfD@XC=n(>yC6|~+=#FOf1(&(Ol(0-E+xlQiysy}%Bqx8!+c%forWYR;=@eHrzPHwTCl@c2ORPS zBo3v-!f(SDed4@)U-CebWTp`xlbPk|r;}}ioA_8?YEDxE3-_Hi;opRU#?Lg^IKyirBs1Ncl}u^)D;RCw2)Z?1hh~bYtS*Re)jl_ z9&PdS6rCI@m9Qh5kPW^M&o!0>2#aR0*H9f6XZqIdLy=nmt&DXcw8ZfHc|9b{^sk>Y50D^JHkK6V0@9Gm9-%pE`F! zgRgw*r7$iAhbnsO3q;f9q9QtL%r2T)P{7`q$P9#+P<(tb%FXq(w48#{l5$aJY0*jK z5;21GOHp>+@Sah(j6x3bUOvvCXc?QtW$)&2IA3Bi_a&C1WrLco9;a=lUpLh~*nZIL zSb~c)luqSgtZDRcUWjJp*YES6PwIh&S*R~x4}2N-Q%=U@R!`Ipw`L2O4NW4?TBKJk z3R4`yf~QOoEQx~$>~{_~NrtG{NBK92S4T*c%OKA8 z@hRedb3~=Muk6-!M{=%DOXi&o2y6(vV-$44+tP|*!xp{nivZSavV14jcC%KqU<&hU z|8zOs+x;qUXylEvOTifSrp^0&)OzPT-gb7%A~oOfZ!2~yk;Gj1XC$aeA_wK8yVWZ& zO^Mkp7xFUjh87m4N5&+zgi&zHvSe|4)+DQJ1MCL%Yl_Y&g-LxfWUX43bcZ6+2X|V5 z4+YTsiai<}N4djAvD^~1F^`bITA7}YcjrF|`i~o=I&N3>>E$sU+Fd!w} zCDIJtBGLi_LkZF#Eg)Uef(n8X%1}ciB?c%U4RRO@&D!@``@ZK|_l|2{zyE)dlaH}tuMt|6Y|DZx0?ZtS6GgdEfxBw553~`!#x!`j zlROH=c`@|ih0nb2v4>D1hYmBOUnp^2&M@Sax~mSUKYVSCACgsCMbES|;dq-1@Xm#^ zlI{X#E90%P(}uZ+M(U8O5*!2x;uw`oE3}WyL(7Vt2R`AmBv)dLl@q>z8a}W4hq@NW zSDxpZN;s9O_3WA8VsxBpQkIt#4I0ETIH9vrpGS%ZlEDKlCX0oTR zB#QBJuCS+G_Zk1 zkQ5DwWU8*?@49PQ12r@49xvOfL3(-BK8x$O`qhHr0KX)N6*3@u)Zp;3ej>_Vz0JI2=I4W`J}CvP;?1?W(ueMw0TlGvgd{PB~cAIJ>e*Ixn~; zierv)dK=$qu+eYn=N+@Njr;Wq)Z};?W*L4+k$wvxL%(dJ+B5vVHNGaZj|;jlG*e2x z;*GDOP*p#7o%Eb8%*we^DfGcRL7ORBXI&7$G~BD-c=;gxVB6dMs?Tm9HnHlfAv~2>PmDfY`QRH! zf+X;il2S0&Nnbn%fjwC!ZvHwHfbO=u#x%*v%vdOCpzfO%uH;SBrOs(8F;s z%y+q2US(+)){qlsf5s9hp(_PioFA>oY45cTw~1a(bqP78yRYf|m%jR^)JCYZf>2Gq^Wj&rJPkS@?kmM-R08av?Eev#Sjxw37Axv}_`sJb*&xCD`55Aw4QqM^t0 zR+V^z5~tp-eE>?{KGjBUv2zAJd_GOO$*rfV@8IGsP9GTh6=i7gGg`VFVBSUliPrwg zczacS6j5;5!UaCbOkXyagn3Gw`EZxvCcp!thO63dySaEnkxO=|sEyQniWh=cg7kB^ zPVa2JRq*BEDhQEY-#&`Zt5ibU-H%YDzFi!}&&V-(j?*X>zU;43og&iJfA(_ihg$hs zpXM*mJO{{cerBClJYY>?y=ozAV7#1fRe-IQh!R1m-dL56M{UzYwNCMKLlDktUVE?Mo5mnsINf?Bj3SLU3)?2 zz^hG5=3Hlj0>HI}CSPn{_B^uUl@~^=^^8-4JxEzXFeR?K8|n#XwmGne3a=mbQN1FY z4?}C%xu#AgtlFbP{May}m+k?TQ$gMC{w(cjm%T|yO>0%d7Onl^eFj!7$Xx1*npW>+ z#hNwJ7sXPEti9Sf)#`p3FnByW(J5PpjE9yj-OpVUwz@PFDmFUU)&|!@vY#=%3Vry< zMc6Lf!0+ob74}ihKF|DFy#9VdvIhqn$AdDm^@K=TT%svURCWvun?vV!D44(V(U?^x zsNBM6!P)JlDF;PQJzuc3^ti4-wYW7@7oIL9U$C$ze{RE~9*vXI9%o;e7(DBJ(DqlT z|G3UAYu+`B^_zWcl0qg8ANE6f+RSzr)5ukQ4`$7%!upA%TzWo9lNiZ8A~`9vAvr|LErPIv7~yWP`l^}f9g6RuA#eiHklS;OglM~8C(P@cLFbqlea_)NS{jX_rUZ{j3z zy-7M=O?^6TSS`=F%){w1R7bc-F5%62<5`eqpOql^fsxGb$tfi3lrtZ+n$E61?9_u# zx!mR_m1SfzuAtAX=guQMwcR8BLgMrIy*!!4HdRKpb_%*;DVNCmNvBaD`^OnjA(nO7 zh$F~VSS~3t6BHLrl)2QHJXm5k+|Xhqh8?^4GxU&W_rwTA6ezvKg$!+N;I}N`VpJh^ z2rJ2M zcG?s=Ud-yv&rhtnFxZs!9$or1c%6`GM=l@dt#DUs5{PC@)4^GcVvORxXuAi`3En^_RymlV?T5(4;03Es{L9`u6muXJhllT|&QH=Cz<7?mR z(+=qn?WG&bof(wZ8q046L~jY-12&_UiVt-KbXr&snOa{k(ib+u%@c1){hX5 zX*6g;-yM3`l`Ymp;k!y=fvj+;=e8M1O)rY>&7rv0pM{`NqsM7hlMl$h>Sk-HwsLWZ zaDXS910`~Y|MEHGosz|KBzo=@cRp3KY}3Awl3JlHMzAW7%r~`K36vM2U<4BbU5vQA zVQZ@ldeKvdL!Hcy^{ir-<7JxFn2hWri9B7T)~oHG>TS9PRSdYr5Ad@*u_*D7-fWML zKtd0L7qq@yzp-~6=rnDrT(-+-oGuo_{Dt3`Ds1&CMlWucu}))0dvC%Y#-yVv89o@$-Y%bgJy2eF z6?{_>L<>Ti(%q+20$)z$=~n2VaqwF(FBNvrTwoi=;849U@Q5^<5EE*rsSOo80sNa^ z^52=Dch+%q=I+NwtL4Z`+~vy{z3a!^`$mM&Ji4L7hVtp9{$XA1CzRR8W~!fuxb#CV zW^~=h%z+aQo~v@YB?oRkgjyHef&{@LA<`0<0;`-D+^*hLcrk8&**Zb2Xeyw@R(iq- z{8A=Tvl0lQ{j?|R!Z9DYd>(lz2e4&GsapFV5!nlKhj-o8AlO-)-6KHT?Q5c0iwkwr zuFnowo!DoH21MRTY{C~3QR)lo!n2v{)+OfAjY+Vbl-(+;>&JH?BVV&el|_#`OK{u7 z8Sc_TT$;cdn-{c^q%Eu*v2=d=UYjN7RQSp|qrPO$C^wy|aCf|?vTYTLX$35+$88Mj z>SNu;)*oB5`8bOV$=~%``CM}@iC6E%YPDUo`Yf3g1|L$vG$_1$&OltgJG!yif<5f$ z+lcoE|CxK*WEyR)pL?dCJlJYq{)UAZi@zOAv3iFxfyy@Tx*?ni#?#4Q#mdi zdM?7iuoaA|^E$oD_Xn2vOR`jay^j?oHH(vIgfTH1UBm0V7o2TeYcvhl#s110g$sPh zqj#7PYtWy4Jb%E?T6oZ{8>N#YA7#5Q#(HBezM(2aVwWpK5X*%F(gusjW${Ka_oVAY z#>y~?8k#2WM1S1jiSBpU>9CvO`xw-cJI(b#yOVWLNw3|v;j1?f)UJj3^85tfXPfTz zDPaGjt<_rTY{9_BrK6I`J(>pr{~VXVTY$U!!m<#490}Bli~w8ph9sgy<~ej+_AbPp}-HB%pRnjbB>vQDM0`DyPQcRt z2q09v;rnL`Ln}y1&#QOam%0W2h*c++5L7?9Bc>oUK@muHW`2bD#vrwaQ{0;Ih-7~} zOMQ`T*h`IhZpM0-3FswX#~@8rsaWZ{FL8&(*0rj}xk1%u@AUWImb`|cy4XcSM4UOn zFw3GWPF6^X2ejeR72N#}5bIPu#czH~!go_vubj1Z@{JvfUWL1^o}u8dWJ*scOuf)-BXXddcdX*uLLN$5y&_KakHYV)i%5llN$~Xmb6Z?wDGNj*Rq;Qu9HZ@9{4sO-}6jE2@g(&^{wbDl;RZ$qQsK+E~`ly1G~wjL&Px68SX{DA!q!Rvd(uJvUJ(BhWGTl z^If*?et#la?RzzMy1=wT`+$eQ@gkB2PmM9TqeWf?+Dy~OqlJI5=F+DtoJ$L_q}zGz z3rhTY8PU$``M9_D2wo&D^WTzIp^6kGh#PhDq2Wp5pyF)9x@@zp?Tp*^NenHh_xD9) zB%JSqyjPxMx3+Ajxw4~SBYbV{gXI%J&o_=Zyc{wjH8ZA*29j!3b-;2WFS~WzApJ`UAixhbc~&!2K`gN>`>2goN^zaRFu$?cq7kJF3s(;go5U4!%MsF=J0S6 z*-;N&@z2D4vfONs(4Uoq7p))PTnqTr<%)TdP-0MJ>!agiyWQl}%4Lr60 zn^kzNeuMu9_~LctlRCi~fNbXFp)Q{k2%h_gkcS9dS-#ge#ABVy&^yH=6!$C&9c#!V zT%~hH%4nCxiB23uJVb+~%P|1_j4RgWOZ6`0Ace4a`u#ink}QtCoGN*6SKzZ0PkH?w z+u^F)?Lrgx58fth>5dX;;q5_aaHa%Q(w6k|;0$j=ilbZsRippM%^dB?ZBAu>1f5Q= z1sBgT*xWAkdE(i*QDL4k*}tt{MD&qz*38H4#^L6UC*$MvbtV&db@J4&Q9TeJOOcI7 zfmHBQtN5aAeTDf1p6|K6?H9LW((hAOVHy8|6vmSMZb!|HZX5rJqM=|m>*t6wJ%?2l zUXdC&-|(}dzUqGOAs&T4Haezv=&pO>^_KFIxKSoY!x2q*1By7OL;yJW8rjgFl+bu| zBVM(G1IMtF1IKv!OWTABCNBd+80;jw+KVMG6E%(c#A+*avK%g`Dz<2*GAYUcAA}^< zx|)RT4)(1i2e-T3Y%U@~dP^h4(2TF)0}<+Z7r##y4CRepy*PMO1=1PZhQb4R?3q|A z%cWosMiOewq{bt->XqT-8$|2r>`{C%JMFZk138J=uqmr<1b1}tK4OAM}2Itcm{{Lhiivb zS@pOQ=MBueRrJmh(Cfbh{%LZwnb3av2j{kHoht3TJf(FGe2Dy|(Uc-3)}r$Hj{;UW z%8|0piu7FkAvq1o<#N_q?f!!YYqzexly{gtGM4ZzAE%D-+*^Cw%ft^-AW*__% zgVrhZry}aMRwcS?WC>!7`(4vngI4WAU#`lU%ia({N!Vn<&4^@&C_I{*wydy=6`k5fFsy1H^@kkKno{Ao6p&1z);+3 z$~j>w_fb1!31+=XJc4|>&_OCGCe{*5$I&U$aX!U(LmCH%k{GV0Y=q-uCT1hjG{aHT zok$#K5|P=}Z}L{-fpdfqism*)o?2$1;F`Bd{w1}>Bcrhw(^hsy0zyqw58t%5ZQj3^ z+p5D2TYYlntYyV~pv>l;i1n<*v>jr80A|Sxas#TY$TGP@kqy4vxk;o`K&t?f+ipnz zrSnUbsxqT1%$*@)L)r_MPz?Tz-)vk0_;Sf|%_Iw~Vh)(kzaN7)kQoLSyIJYmNaF`K zs$8s%EROP(c2yG!EVn@{(F}F>BbuIeB@+iK>#oLDj$scr7qGDEfhETMO9bBOJc_ut zwls7ifivY_t*%dX)D~k6GamTt1lYsoo9q3Pt)z(+@{rV|>O;s7c`EyE$zRe-jkd|= z=<9Ed6mvoZ^aW<1WABk-2U+fkoKePYnp_B&*d5h#_yw{?%e&=t7Kk9T$ zOPC3uCFp6$7!6rj@+sgw0_HV6RZl_}fj6M`wD8AmF)+%Y~b_T}3^?s~I^?m7bZ&Y!)RyvN2pB$$I57AC(7gosID;56iE-q(NW0)@reNK$g-%`hz4>4EF~25M&`BHU<1}pL`_IMb?`u(D$VLg4{jNQtVK4@( zovHHP4~yDgAU4Q=%bCe{?+fxni5czUgfw4`)%UDg+|kY(^q56 zw?CSR!GuC$_AjH}ljw^+12A*Pd=a-D}_~| z`!Cc#L-jup>gU7H8Ee7&At@*-BQv@|+4qATJ$Z2>)$!eC-EO}{&_22d6pGEMp?zvU zE-=yo*~8=bFKuFNhZ4aZ`16buQC&uklZLnpZ+$PME=+D&)wJmolxxGSmar=`FP5a69@dk MFY2n5so327KS``}od5s; literal 60600 zcmeFZWn9%;)HR9-N+{ALozfsF-60*)-Q5k+-QArcAtfP5cZUL-kVa7&M4G$yIeH%T z`Q7{Je!BPd3){W_v0|<{#~gF46|NvB@em0g2?hq{p_HVk5)2I71Pshw5(HTAjg1Kc zCHQm4Nl8K&rfQ5}8~g{+UQ)ve1_m4afK7QvN{QkCT=3CCMcr9lR)*Wi&W6#z*v`;| z(cQ)#Tnz)m>&^{6+L$;Skht4e+d6T(^O5~?1vmH%{hEo4>N!< z*csUvnaTK(NJvO{9gR)7l|;pV-VXl9M`rHqY|qWan^6c3&Cg!J1 zPoFY?D;S(SY@H3<8El=%|9!}>=ZKm(897?mJ6qVM++0s>woTM`FZ#6-~GIvmkHYO z-v;8}QU2#!FwFc&yiEU189!3>`qL#C7(o~*Q6UxgJ3HAfc?2IJx66Wg>JA8UB!V?7 zdvK}PGzyCr$e48K^>CVb6EI?$5!iMPj}#V{bOP_W{a7yLZU51ln7;JhYj*EbOYZgh z)<}9`Zu-)QTh3?R(&a}2R|f(<6eM>LB?V#8+zd|=hx&v>fTs{BlS4V!e2Q>Qr{|_6zg?g6XHvYBz&8I8}$m_oc z<%HtShVDzx!%&|OZ_MspUwkhwedj{?moenoAbz01z2r1DbU`T~HL6DY+a}C0{X=kZ zbh;t0h7EXrjrT8eV#dD%|Jj_2_kY=dI2y2lh@#zJ7yf!{hdHpM!qtQSWe50Tzz&`y zG?4#2>4o4)F(KV||9889Efh-5$Nb&xC~?8gXZR@p+lUpxtE0&*{!b_VnUw#p&qSVY zA%1Q4De1v#C~*w?jQV=4fY0TFvQqJ*?neSq`0SP&1MzQKkA^4`Wu8vUu#R%AN3F;3 z*vd!u`~1E}CKTa3{ezD)?bBKP|_@K(grZIh&6^O%-J&L7FiVxj(e*-hV$^k6pHH*=>NBK5nbS(Q%%f)$`Tyfbnj5(|&W&eznO28vU&Bx?kYt za3%1OYSo&?uQlWuA;?s1uj|l@B}LSF(1q-KwN6}`?K+>c_Oao*?R4Hsn!X0EVU(>) zp8uSpc$kvNeZ_y6sXeRiv27EnVTE>$ZmREQvYKuEqsZQY|EfuU#TyThMAPDrlhDV z+nx_5jHW3L#?YFPnZkB`&h^!8HYVn|oRlE$?fIPT_9;ZBav1qyBT@c{-5U(8u=gV1UWBHAv}9ckHuVpD#Ghu! zYj}#+=`pM0cKubY9qBFeviGq~O)s|7*fVOuPu|C)pusr{+co_}Mnk*}W4y;B_X(X9 zVx|-&2!nC4_(~JapLoST zhjL;s0OLF;$Zs-5%Qk$0IviYZ*!S3O2Rdh7XM%YPpE zl6pk?ntL?s7?;=XTbwr$T)|HshTg*TcFEbYP7Nqx+uvM$7U(C#FWt)Vvi&4I#(%j( zu+|<|i3NW}_cLw@K0-IQ>I^69$U)53Fikf(?8Twpla528ULY!(Y0$w@qiEPw$Yf)t zCz?EZI!v#eAGyI8d217Fn3LcKmmvhxD_J)%3~ae{?gsRcARV%q&AIH&YWH|h!jiFe zfUH?_J(}|q@y$W#fU$+=UPb7ykbFYo-=Nz)7?72!>xsw>m$I@ijLgupMu2^;pX_;i z{UcK7xjQ_O&r7*jFn%Pl-+bepH496jwx=p6071*nZfljRG_Mq!Ir*Xc*4*&Z^1Nf!Z zt4V!|=W|A}o6H0bQdtgtc%<(U5FWh8-|8(~ZT_4itz1E3Qvcuvpd8aH=Q}JJn|Seu zYM&~C)%k(rykA~j?Zd2`l8m0}9*m_bNxuKI^ORb##4&FeVw(9-EP^4)aM|aQp(H_B zo{wFI5iM#144))Lb6L}!*UcVb&Rtz~{&(-Ae<<2e@MtaJ(aozCWl-~rP&ai~HN9Be z;EJ0bdrnu`Q}8zTYJrLfJS>;i;}@LY|Hj>s0+ z`p2&k?knmR*Px=Q&V94zEN*esTh%U=<=aq8VxrQq_mXFu6b0_|%muih1)IjhT`C_f zG)Z^$-g@;C{fbG}CshgKt`vkMq*a$O73`5$@tARVC(rWDDXOWz+ z9<;v5@!A{Y$dp|tk{ZO{5_^zi9y;-c{;eVTgmqnf4u}=*0V6RK#SVyl_KFq4D3_G> z?8<{fh^H!H;uEKm{X@ZZW@=hCZxh}p;*m;$=@KS7a3x=GE2Y?UNy4Cbke1QV&Qr8U7p*WuDWXI{kwy+m z-pvQmGQpYLgf5HJG#;%!OrRWg?lvB%7?vMR{E<&a$Gw@`;X8)!L!w zlE%m8Oj;pJTSAr_dO6-*eLUlaNVuRzJFw_19F1iQDdXX%HsLpA;TF;l@Z(eJ2#eV^ z3LR#QN=HtVVsqalrOQ%e54K6rw13A{o~8Pkvb^*Yyo%NSEIL2ib^gg~|Iogh0W2>K=x>K#fkp$&N6_v6gn4|pvMVg$Pk zb}3*VhO0Y^uJ{m1S?p#xjnUBHz6#zdf)Qq1@M>o3!djO~qeG1HcP1U99p8JUU4)PQ zjA*NxoY21G&LnJ$TTl(cFp+!*FY*z$fL!#mt+Y+g&O`=F(m}&s@{&?iX*S79&n;jo z*O8yaa%dEvNBNedn|4`-;$a5|zQr)YKMKFvnPX-!?4#hIPk)8ESQbh-7`*dDEfXm?EiAa?fo33{kqucSuJ>^Z(Y2&*u z9&9D$0w+~-M>t$M@DXXBN436M|NMBRq`z%+-3E*~YUV~bbq=mUw9{;Xaq^z&Q+HAJ z{qR>|C7bX_-BcLIF0bGlv?!=|I!1F88xKs)JOS3w0XWV(7F)}j#qbr`?U`o?)01|A z9iE7xp_Kg@d5H$HWg&Y5A=&gRU4x5t*cy~{^*2pv1#;6?c`6mXh>MJ2C*7%$UL~+` zj7(4Jy3Mt159SJ{BV%{Qp0viT-A~t?Q<0vLy}^%Sz%mz!vatR&$=Skra-H~TU#*Lb zU98h4x%g`!u&?3=pdh5~yeRQuME{TK>ciJWX3Rw1@u(j$C8a~$O9t_<-LK5x ziq5)s1|Rj@f50kc%S%wzviD9!s7Ak$K4)~3q}*OSnlVMk6|w_^D=@P(5yxA_GN!%c zj?nGFo~m~qC-LKhPOT$d_q$DEZn8C%?h(gPJ)3vp>BQSqmnjSfs@H`%jLAtlJV*TH5Avq)oX+|DQC5! z!>whVLVWM{1sm6gGj{81wiy|+u8v@WOr{_{!*}v!+`vrMoUQg>$Xr}Z=iVgkzH-c% zwc%3eDQ^Dyy1X`6D~e^|t;N3}SkR6d^|o>5S>)}#uN9xC+md- zNwqzd!S0ip+|3NiO)F2#KB6?M{F;Eoz80Q06)PiUqqwvz)FSGzL0uMC0DZLICCN-G z8cFgaZRt>kVbDpCqD_}2DTF0#0#Vqx$Kvunn^rmOTC3Q5DPdd{Zl5ntt*|Z)5+~ZHix8f*~_358~>J{A_OtENM+1YDNSm{yd=68qG z6g$tKB<#*BQ+>_QAcr636Dj?yAN8?JH5Jj@!GqIZ@m-M>3ll7=nEP{_1)3j8i<_e& z%IpYZP4=iA-n#&aUAwb%0!)a&|58vb`Ga3gBa2*LB~;22(tB{ z?j-nG;u@7E?w+4YC*_08 ztomh1`CmlmuPZ^4YOxhWt3mwi48igN7aqs^RUJ=u_z-MrO3BsY+hI`@BJ2yj(vF2TBdFHU0xV3B+ zwv!=Z5gMF#mAXx+4${~jY?-%EU7+Z@b(!HGCj;cM#tzmiy&Gi`?j^x>IW$l7bwjxbk-W*M&(_4+x z@m(u$d-ekQe+((|&wO{*xG{Cl0Z5 z&iBW^Dr~)Q)j5v(+OtsELN;2g1>k0+8fls5CY>bEzj{0x*cX+*P+G89d%3x1#qv^jr(4u147$D>1QazH(O2hnalo zt}8Jl{$WA|-U#+vMHqLcqY|vtAQyx)mO2NRz$jVy6J-?R2JO`EGOf zf(DP(9W?`ra9*Xv^x9Qp%)Ya}QeiPUF~OxF6WZL~u0e5RZv^t7_etg*?&ftubpC>L zY9l$a$^)h|Dffg`WJ(6+=Lcuj4ERb!7HN2~hHYvRE2pB^%8yiaTt4QDgrdXWhYY&i zp|zAVIS;{r8usM<=J*v1pA_5^=9D9NBx?ga!{K??Ih-9 zi!NB)%8lLpoy!lc4Z3jZUX!J~t|fMye@K_SD!%5HOzi83i)F{zH2s`=yp)y_gD;?F zLlU_Df2hk71K8asTxeb&F-n;)nAO@g9v@n#JHjJG4rMxuA_RwX2eRMuWnd$65H-uR z;ea(w!ps7c)u89M%Fy{9&pum%9T=(a%u|L_QY?~ncb}9f$30fX!x}Hiq6pk(GvT07 zTxg~J*2l)f8;8d+(}l~MIfS}fcqCf=LY1*&(};%9#GGO6ZYte45(|-1O@RW8m`auU z;FTk%{9sqJ%%^-~r-Y9zpJa{YKW53rF3oiuGWQ6vJdP>eGPQa&FzO3UMxr|iB=pU8 zvR1#uWT(H>K!_XWUY!dmbzEm3e}(5h{P1lYx(_Pvd$e2H9eH3PQYLZ z3U>_*&lxBnrKSC$ejJtI-?*q?b+etxWw@{%kFEKAvMHh&RAVtG5rTRHd& z>aV_JyF6!2x~%|?d(|`IdILCin*sAfDUtpu%uYvlU8dKg#@TeP{j?PmaszT>=#;@Y z2lOh^L=C;P%>-n+L6hi%qwDT_*4#ajTvL^3S?nWps#PU-%rIbDZ*MLKI|b6u#mPo^ z=Gtm0>nuKyOlq7_+%D`utTRfwII$|lU`lt&3^SNH;=|+CmFX2kuO@-ga z-0ROLh0rq3)RrjI_{XDKG3Od|hZK(>A?_^zA{En28vCr~=3z|)B7(=bDaDCWzTsW@ znlrpq$-NEjOz(U|lb&K@4~|mF4WuOA3J>wo4lT!^?^n)A*45fnJdg08)Amn|Xl8~wz?W}}vMcKY<8W8Oc#TML>w zQzu>vR|momW}*zb?&t9DHUqTsW$ASsT+516Xf-N|1Z-%-S>o8XAo1i9Ap#67$j)?J z&*2(`NUPU9EIBe)@aQzfGg2NKH+YLoEoILF8yrF^Ptd8WxlgC_4C~OVP^O1KoJ7a> z`a7jn4LPxa=^4rU=hy?L;uHpKmh-7hUL2WnUiB1O&rhvww&PlO$5JYMfPdXOa-h-n zKI|RrIU9W1x(MN-a5ke*vfoPRjYt;>dC)1b0w8VwNK66$e(N#w)mkF~4*~WtCv)g` zm!=X{zXTif1u9+(BemEBljo+cZsM}NnS=(}0)&u~p<)l^Zo0d=1P2#>xd^uRZ1q2p zta#49nx)))XY1~BWVLeNe?X9OLEcM&0nk&5ez`GcN=B2tzHlFPY7I9>+Jb71WcYrv zW9H$EOKI-tvp{brU zM+e`4`35nA$Y68k#(pE$xEEt|GA)z%*ilVn5LfkToJeMm%($icd(;bal@zdZrsZ+s ziK08QbA}|N*$hNvZiyE7i6n2BBTNPi_zX1BG8XiywB-R?v8CP1xX(Cbi`m2ysOw`GK|h?v=U%44$zndcberhs{i{n-1!uLWKWHfKfI zUy{vI3$t}7@GenZGb0zpjZ@a+J=s!NwVH zkcJW{5j+rRI5b(+iY;SBDjLvrpTaIz87spx+BE$&A775)))0%vEE;C{vw9vxX&CgE zn;IGr79c&EPTE&OS7G$ylCs)RcMQ>@+Vofyx(qh6*0c;RhWF(#%-y($hy2kHse*Vo zyK;-!rcR2y{d(7q<-*|7kfOpn)B9*pd033_MdnJ+x2N->J2gQG!3Jq45$hGEx>EM{ zM0iatP~*dOdoe|_27zq~Ty(9Zd?T~Dj9Gv3Xo$C;)k<7=XSrTa(sqAZ8yV7FL5gOC z&@FgQZq=L-#ISjoeO7NFB6q4YFhiMr`bv=f&|KO|g`)jv6eGHblPWE*Y#>1_i^0DY z6j2UgMzJ2;u2!rRk)<4v1l}{PcbCO*M?4JA=gCvV(K2z4zX7vFt01F~XCz+ZnL-rb z#Jh(PLYpiTfT8uIDl=Ii&A8A|JvrPnd0yx=6cU|^JVN-f|2Z#5zQ4c)HGC;0zx`|B z0zB3N$lh9|kp0Z*w4#Rn5jV_owDE(Dt<|s9T)fg1!raY^pH*H4(>}z>!eV?9^5ev5 z=tynt+mUcI@ryS4IqT_iJcObl*e{%6vt<@B&Tyxz= zf+MP&&r2syR0-~7v~Gx)*#uzJ#8>D};*7=hI;gU~q2SWE$xtzLxUzOvP#MLWlIP zs@MGV?h=^SIYqr#-n0cg_sry%U1fOR&8T_Jl`liN}2SOqh+Y9V%k zu{rCfn7Z%6r|E?aY~Qdv6y&J#%!BS{1NuK@Q)x##)`fAqAW8CNjiMRvXM?#_2i{G! zTtS@!+z7F-m1xKkeKpQQB3@j1O;PFj=S3H$!^u6Up{Ihk%>8T5N|m;wktNlWyRFA# zX>`iS+EiEmXvJxkiNjXh$c#C%cCClv98;1Z6I{cv(jy`6rc{LzuV0Y*3wM{|_;zVq zDZ&fzVZ+qTr+hv$th4?SI?VQ%A;`-qST~~7{9!l!wdrR`wp37=!K0&#S?i_UP=I-g z`yk!gmXAo#Pvfb!!sz;ju#s{XQjblaE58^l<$cJKOdE~%1gMD7x#>+3{&;-MuxI+A z2=8M^W6N>nacSWkp+JZ_f&*jkZ5Ai_AeA1y#xv^vcE{DeoJwyYk47eu8aaHxB0DSez4ypQLT7GxO(U-0&= zvRS3&7U(py9PqkS{N*Q-3^NVU`0PmSs?4g)3#Os>y%*Cw2Kqwd)RR8$Os7@mNUtk| z9&LzFP+N;?6W+T!f`E`kx7Z4h%-t4wDuy@XQZ9pVdGOKs4%c#recmfB%qc> zt86U0Bj@nQl)FNHOAXdcbL+*`TD`USXrI=xyhOa<^(_2~_jbdgqx520at_H&wt`tiiwx757Dx0Hk1}#!rG_>hS5ibFWoBD zPc=L?l)d}4{P;CuNHWC)GD7t>-D+)(PH2k|6K;*jfA!5+bOrkvFN1J9Q4FrY8Y+ec z*F|8_wbEx_ck@EO{`lze9aXL+Sx_3W+JZVh9n!;Yb2~k?w=46XF6fHM|K>LHU>P3~ zl|IP+t@V264KbTg2dSf_b$Ryd?;kw?zuQmdZ;D2uMCXHTaX-V<#i-?1^+__14i;&w z3cNj)k+23Vwa?bkimC$@ZP#F=o zxS`D5M*F{h{FCaDCO&m8~faxe{NPD{Cq*)OEAz2pEbD>ZqX5tKf@x^Gk;W3mJzOH_EXf!_te{?6A{*dxNR&JSjG-Wye?WUHj$tkBv*N3!n-wXP*1)bCwf8xe(wt zsM3uSv@4z6LAf#8=VJR2|Je#uO#*~64DU~-=zwshc9@aJabuwU_Ie+4!7!2ml*IQy zjrIF>Y3^2>>G>tJLJ!KflLJ1NyAl;kl4iG4M|fsgoy(v=?6%V3{~!cc08|T28y;*+ z5P59u)Q!iC0G&sA`W8st0D7bbl@T7fQNV{$BB%Yi_D-l^1wQ-;FPoVa&wc$K=vTUf zp1keV0C#&}$g(LOZl!^WG?C|*eC=~Ysx-fZ1=|(|G!3+fUQMW`2Axff4&@Qoil3o} zBUIC{1eg?uuWOM^B?mwppju0j9<&fW1GGFuCh98T&O5Vj^lFgSuYkGM1)k5g-y}ys zl?@POtSReG^tYdUZ?BPhc`5%`MQoXZyy>%$NbE6+$c)Uh>wRC}lSzr``fJSxjoRMbPGmUU`lVr03{3%92vKy0TBxj zAz(XJdmCyJcEwl;hqm0@P*PLuQM_JpMU>nh#H{$8tca{y}V2q4B7% z(H&}|G*Ehj7M;HV_7h)S&h@$Y=&HLuu~MvP0PTkTl*zp!aemnEzzdA!a07W&E&fqKB;!zyn1>RYV(Ms=bFICSejlD$D^P3Inv4OP z{o(-d#n{%Y=_UU|N|8L%XeJvDtuXxX1zW1v;kQyZfX0y^w6+DP=UUHe`+iO}E zP719rFJnOjoNR-M~(J($XAKh+Y@#z5IDA)!kRf@F`KkS>N>NNDAOB-9+v>x`+ zRo0_FV*+KbP-34nIh&@{kSJ9W!s&%$U9Lk2pyuIh83d$Iw# zBnbis(O~SqtgHSJa5K|ydc)Aj50Hy9U;bFaVKuF32jxHbW0Lc;3hZ$r#{tq_8P|v5 zZ-{1GPd?t>tj zi6SsVIW-Dmwq7l(3dC`kp`T_Xc@DA8#GcG#$&`b73YS;WqzJJpacpe|-vZ7Is0n{0xG3{cj^Vr%?*G6OHuCW7BY|@+n1H9aBuw6+;vR4R)it!@gi_Gv2nHOi3%2 z6nWGQE9*2YS9Ur>G;2D5t|^qSE-lwFhWqO)uo`nWD}vt3rvED7{u^=p8z?~>yeNiZ z+JaadUHAOqF$CdEhD4or>7z{PrsRnO#LW7FgRsUaC)*5aT&dvc8(%p#;zmv9cus)) zO(dmGw|;{o|Kaw4_#aTiKLo0qhBGP_55vMNMI;PTwkj^TJr;mU)HY$U#;9gh*B&J= z9wfelW)Z>I^S2+yeNI{7sMmXJvyk;Z+yuG>{T>y+umQkPqUXSI-n5lr^(ACU7F0SK zw!U+MMFo|t)yp5`ROs0@^e^I1?vxS9W3YvpWbG^|8PttZqz%)2CCUxH`=_oK{Z8OL zO3cycBwU5S%lo|3Cbk~)7zVP2wu{%=>5Ru{S&Ca=)|oWYKE|qPJBSS`@5T+H(BZmk z_ziIlM^j_>)b1mfj5(T&r2Y|Q=b?~k8z;bq7ERJw+2F2}D*>s(E@&+_rPbqU5TV(( z(u|aTfDl2IWK|PiH^xmTMopWudzcj#FgyUMtl!X|gJ+}9Iv(XD4w*=LlpXWwVE>}sy1 z#?(DQ7_D118;mdxnYac`7{dBu%g;K0?JEGfZOQ&I^Z_1vuPabSVZzY7FxHa8?(ADN z2qjE8lKnkC3&Pq+3$iNGF+N<7&H(`xU>cTi5SMI#nl#7v<{a^2C?t`7zZKXeUe5<* z`E0lsqek2Z9e07xFdRvn)%Ds1UhpDP@5*qHGS#>x2`T{XLS}-7C6-2Aw7Cmff)%3ty=9VY@T zuq;A}X&!_5Z}AGNY5_Db#6VPSZX=zFh3@P2eIBPV9mJ2 zxG;&(Z4u3jgk#*=XVhYC`V){yb%#9EG6=>ZOQKs)GyEzlwe}+O^n4_I0Tg-m-p=Yq zK|OML^S28+!|(4wsA6nd2>|23w}4-07)&jktq}!+9z^+$;I=a364u&K!&2KkN$sj`nVI1x4y}0_ye`<$K&-?lpf-dqmW_*-)DWl3*yF zq|Y5_Zew*aBl}}e*fIn~;7U~s#r0BhtR8t$`IJgN=g^>e76HU~c^uE)&Rdoz)9e3@ z&hLS2kptEA5hptcXy`#HLW=-r1a$*Vs9oJ%GVHy3kF5n!c=D=wq)(H6DklD0vZEw< z(l)HGPSrEzc@N?8%^|f3Xfpig8=@JMw!7cnw9B;?fLKPGGJmuB$d~I&UZAISm@EAs zjXt>t`_Q56h5c!W^inoU48b!;;F=7~;teTEp=!KMEUfwLQ~E!ByE7RGd(^t$X+K{9 zu%_z@F=1-IIs&LvDrsKB_lmxz2Yru5iBd4~_lzU28w4btt9$krzq>}@)-uHMNCT$x zght=-HvIV>$^VCYK*~g4`~S-IUf&+t4>L4g?evcW)63pcc*>g1Km~wagn1C3N`x8R zhI3;flI+{_vhTBc#G;Xm@60n}r&s^z+e5??yKa_eMV1KOAb|mi4?U&8*-D^F8`y-T zCF6Z{*guI7z6L>hw#n_e;(wp9{mPNn7C;XmB9jQl_8g`;eF{Gb&6Tw7?Qs2`PEPQH zq;A)V;i5~D$orrZs(?-(0Z9C%i2Z4B`5M9Race>ZT+0X7mg*Lnx(#UOE(&arsOa)z zg);pyQ3p3BD&xp8eug3KZsppUpUU7JJis8Qu)2*HKaHsh11GRN1LIFaF8U5DNF+g zz7K+UJ9wlW5XmptwP(NL1$ynzK%89%Jg@B2K?=wz@S>^(y$Q$P^lm+8G|WbGB2J+C zP@bYj0KC)9a+=li6rcTAaw8_JK{*DXEZg?oHny(&su;G0@hJN&0Z2`qba;V$7;5q9 zRsd}Qo&Gt{s)7YuL!iX+JO|KC!2RJ^;6wBSA4(Qk1eh0uQ+b}PhJtJ70Zv~B$T}A! zLgvQ6NoLzLKnl3efsmiC#y;2g#zP$ToesDiID?qF=60Yu9|f4&7lZ)jC~(nx-`ks> zGYx%z_$D}LPKIl=vHaQ_G#6;MH=)|ubpRFm!2SeYkXo`#gS^f)MZ)PRJ6F*A26l*rQ@Q_!z?WMeFJt1K&h3>(8j zIvFV3F-AGSv}4_mlJo)kXeSq^_sc*L+#3;J_fb_#Y>7=46aShF+3AH~Xnl!&1hx22gPOs`>+@>#x11MZdne<2AI;(8Q~HnNT-ZkQfQ}5(VwVQka{O@+zaBYA>0BkAp z%dpm-`S30eASh=sY6HfUw!*PsT?YixRj-=9?$qaey}ba*Ajt1Y3aURCSZ7(3q|dEv z0Qu%Mz|72=jP(=3)~>=n&5QtIWK};j955-t6oCslB?9QMY6eWYFR+2z4QhdWp%oB| zdD31$XBXt_&Y^^^evi%iL0AKk8QzxI=WG?EdQn2jkKLeX9-7Z-+jo9g0n>U2071ah zjVqw0FJ^y`{6q$h`89NBg8<0n5mH)!4U1ugG=eS}&289&h7=)hNmJc>@p=M&J|_#j zfW<}e>77@ATzR^20p9KgOd7LPqqZRZzX1a5%@=yuXLsh8y8S*bX})aoepO+=ezmmk z^P?9>uV}E=fXeYnn1Q0q`QXWHAhw)%@XP%$@)*fNGXFR&i{fO%4z9NawNsF+VAjoC z=P4=08-Qb;r^ic97r|83S=*)~t;|OBL`PipzDi@qT`P!s9aRUyxAa@dJUQ~)G^3b7 zji)SgePw_JWPD+u65l+={cXY;{&KY{(G><_i(g~*kP$Q-ju9+_Jjw+6a;%^b%5@ksRM!e8D z0V54l?qpg0mJ(wGwV5}D3t0dl)xrP1_5hX)_X}XnH@7d?5Ednah?*xNC6siozp82m zTUERPZ$_KayZ-|uq)}JUtv|i6>>F3m4I4J>Y7Pf0o$-OgRxHfuiot6;z&vtk+kSlV zk`zrDHk6V}hlr|YA}$5p#9q})eh)xq6ntZfoY>bfOs2%&OCqJ1;L>byn*jKS3xGFA z+7t%fZsq^M^{wEKl`P(!yC*GPB9xow(FV3LRoGh&A-sb;>2w*gpr+4KC)aR}h{tIB zc#)cuqKzDJ&KWY2FuXR0_xPSvlZiZO0vOs2Yq{s^-#xn!sZnh?H$?;T-ii{D$%f5? z%`xd=og39>tYdPw>(pwLglpWxGm!xQQ9p}hXi$gSmNokeYKwuyhb-~lhO?Dh^t}@` zHw=MG$T)3%rv@B#ODxIMy@&WJ`Nr&}2_1AA?$F#koj7GRp>#m>VJ$K;N@Ea?VJ;Sb zMGgZt)$@rv_Xz}zNoESspM*WS$bBUMHb(VC&{c)iqZFFgkRiE3_Z-Abh~jpkavKlC z8A>_*9vVsfMFmsmg9_rq9cQ}4PbViD=DyVmd#T^7?EvaUAz%HQ{3^Ee&T?_8ZibDt zDp0xufC_af#+-^DsL-sWE!Ux;H(Ld4jww?=4v#3|WeN2G7sx-`OafjP`Oq;Eqca`* zZMX~PDQ5992p%ET?|Kt{N#B09XGK+sXDHBccPzkK!5zzsr%GZ#SYUY!q_EkC?6jYZdkAn6pLO4w}D zdV>(w6Cfizj}G^wMZhe$xkx!tmIIGY?L6*yi0(}_&l9hAD;d&Bn$9~KX4%Ql6FQDl zUOumhOfdbv1-xY>x7cD8V8oi{O{FtZ{zIlvuZ4bxAq!C~mU8NbfS#WISr`D%CB{B? zSs}3X+E!_t03Y~_)!z1FD$6KVsee13PJ>Jo7tPCrK-f4MBn{3t0f(9FNXF9rfEBgi z6zn)cRHhEsz9Z9;&oM|HgznQ!(N%MrD9=BLb*k_?ct=M7Mwio$Q34ojTsYl~-br3S zkXZZ}Sc8afR#(IqM^S3=01|A@J3+a}oIxY6^rQxW}wvrG=_M-9vg8N^UuJcYJLEyQxZT zRKi#u!vhVu26_;@;dj$2q)>H!azj+421f;L^DNhS3lo^!oyt zzPbY-t8T7?Qwlhg6mfcMXmU^xM#uw2oLmbZ{LxiYABu(TaC@KGi&8BF8LD2d6Wced zA3Lzs5O}cIuE#d@iaX@*pfw_37I4^}0Q`gtUV;Pxn&b{10mjnG_1OUU6G^CAfEyp` zT(O&w2LIbC<&N_c3iP}wFp}Zfh!dI7^|G0<*Zsau%|pcz+B1OH;$VFIIMWp7_N-rE z&brQc<0A;0hagH+mGlEBmaEVEif1NUT=`XJHJz&Rdo2?%Qq;~=+~t*?WMNY>{)^{eGPV0C>OND?^l)YHh&N zgDw2bETIT-k!5x2eSfa4r@+!}9KN%Z)im}4Vj@R8(Acmg2GD^&=B?gPidqiYDFv7d zrd={$m~irU3QIB*J*b>}P)^Vdw!Aum@Ci+y*tq&|N8h(lWtbNL40^8&Pkfl9n@b}? z>IQHaA{uhN4Cb^>0hiR}u_Z?JRLP82`%_JX8Kn zw9S|b-v5Sgp?rf&iFVI3p&#{vhwMH5=9LmzSdGXs|d#TSrd?q~2A{qFm~ z;}}Q*SBPlJ6aR;YqL}aiDO0xFXm-ti1N0Bk=AQx>zy`Rw>_3?!^e?<%jg7Tzjj8|H zAVF#HUabI#+;8R#{pd^tT*GG_C$m43Zpl1K@Lq;aYwF(@FnBX}EJ)3ctD0H=^rRFM z=-|C6`VAO=bt_u}41>q)BHrJFhX~#)&Abr)SGUHd922y&9?B(ArtSV& z^NJJv47U8ceqwj)`>UP*XZZgy;r}!I|Ev#S{U>%(Owh&nL0CVUeFhgSc?R~%|2aV- zLq;Evx9xyv%vcN`{ox6F4N|F4PpZi`dIGO!A(DJ|tE=EHnElUuzON#3v;VSkPxd=6 z1mCoI>!7reYRV&ReT02U8z_!Cfg*T{%${>!*26sW=K$%d)?A3ybTSr>x`Njq7n?17y943I_a zuh|#=_9&6S6J$EpI6;akO+Nr6o6a7&~EC8qjAHMgW0KYvs z1|GCj>B1TD8TSRY`(cKPc_gX%4IU&5oJmy$G}Ml3NIO+`+%fOk)%agV`%ZYklbnCCkDw_@16?P6c`^x-=o4^U!6GOTL22$`#+Hf~-8IC&=J%gr z&5MOrTW*BOKAIkgLQp*;fDQ~z9U(dZaq_Q5LAisdaRHD%L5d*0 zxn@#&`W(z2VE)V{gQb9YY_rFT^1pDXBG0RO=Z4$aK2Epx(hu0rx z*FXcW8r!dTjiDI zo^67is5RinH`QH8y$M*3>io}npfp(^sW)1Ps^*;Qg3W!9X4THTP}9LW2ggPj48qcj zn5bkG7!MkxR+#+7M+3|6-n(;(|7^E@TFFT?AbLq*K*)oPE7xnUd4QBcUat8)(9oMp z2N}Kw1N27VWI_ZmF?--GZot%4Etp<{0}5sUQwC;+WdUXF7^pQ2&!2eJ7hNUk-^ZCU z0#pG=xl^Ckg{x0V$t8{C{8bQ^%!37OfahA`Vb|~Q0~_~)xuNwCz_Bg0``!XhS{x(! z2WDqZIGW>!SBjX>(H97ZxliQa)>2qzv^rEF?FYLp(8BGIUb2lfHKo5ujjFab!# zy3zv-4M24ZVY&rdeeP3wWOwtepJf?@;7w%F2xNvy#dlkdTetnqvnPXfPj|fC2NChT zQt&sgA80Kr$v{dpW&aMyNv{u7p8McHg=J7?PN)V1`Ub%JfWvRh1ume6_%(rj*tImY zETxdCWnBaRB4Ownl+*#e-`))}goXv%;posaaX=6;r0*;x@Fb&9=eQg8u=YZki6o0Z z>JmU@l?wL&{{v;g7B&}HP_}L7rWZgzan}_vmPH(c>45rD=fLy#mg8E5tBA)6Ir`Cl z%V%8_lyVI=NvD6HdSQ;iz-v8MZTKCO%xIq?Dr)gQex^bU%u}+^VmO&a?=M&r4`9JS zQSmc%_U%W+^B^b2<}p=EfTqIUUAIulN-r$UXW%o!PXOJP{i3q01A2ggb@Dbddee=# zVH6AP5~2eK))@O*K9FO`c4B8R73GR z1t#HbpxSU?4u}q4Nz!B*|$>1mpyrd`Jkfz{><8)p31J}4ZO)s7~AH} zlvgf*ws+HLQo~%ipMnQ1O1(F%M^XBBG$v5sw7VgRYO5d#3X^Eew0##o@isXWee&h= zl>F?0H;V(-b)zZeH;K7d6ppk4LVXZN#b8Gp&p9mwRIGG9cfa0 zcFoRUL52_AkNfvCU0OE{din!BkfI4;m%%BdhaPdihBh;gJoGZfENQk z<3MxAX2C=PG(w-Nd;H(>gX^=+66`oL<-_@=e?ewm8Hf_Hy$fRN^h;6b$e@%5)g3gS zh+@d9R0&3~Tmm1RN+)sy_|_~Q9*A*bH1M^rMt5eio1YGPYknwiP9QoZg9V296;uX? z{QzsMW$h3|Vv3_dMoP`}O_wdT1SR=lggG%if@6Rf=i4`Jvs|A_cVI?Z=XT5XN5}Af zm1Gni0uKVVfWDwg=eKvg9@e{2a^f6J5`f*o3?2yM2t?2 zaNTlar9Ooxd03uO8k@Ba2Th-k=v3w&?=XFxzUMrs_!=Ip8ZSgrP1WXGp4s48@_;i* zenv~Qi7rStjx&H+T<05!#6VmHDt$h^SZee!A(dR9K5cVE6XL5`Be-z|h*(i!)X_CN z-%G8*WOpcu3Qk)+$rCD`Q;BC*Rv?iBzjfOQciTh6g<}T z+@=(71N7>VTYfP1+RR#XA7!fc5n#3Dfn=#yN(a=FTfdX!vPZvSGVI8cV**4N7_C4Y zJk}9rkwurq?(&LSq+VN4*x`t!L|{crn`RK@XHmO3jP+e9ea=jz8-se0?8fMg49+zP zgjRpQUjp}$*+@;`cswmUG)|@d4q7@(>@JTpgxdnASY&AKF!ylAaMF($ww-U0?>%|V z^Lg3_U?V8NQ1J#8{If{p4iJv;b_u{iNjfr5%mNKk); z0J!8L$Z5kyTT|9rHzyzLZgnr3-2Q{_V2~(tv5Fpi2c(xdfr!6*V4gX2Jb^85!HX1&%R;s(5zt>@9i~a- zsdVR^gA9IE3h$yYhfD^cg<(GhLiW2EKwXHyi))pM4wJ8(47=%#w;X81pANrvFLU%0^wwP9T9cE883{KG|pSnA=No38W&tU zx~#`lSCJA<2E`|GU_Lt@VLjhUSrkCb4wP^nyyt+pC8p#`LGvDHFCA*0&(NBN+;#n6 zQt4hJIGB|npqeD={(rIemSI_DZM?UX($Xc;H{IRc-Q5iW(j_h3b<-u?NQjgI(%ncZ zpp+sif(QccbvZN7%shMV5BtM=yvKXIdp>Z?Tyt}C)mrOZ=lT2p&lF(3;Y7EJ+`=IqQ)J>8aYTNH+=lXP+BWbpp42+iHC);wYc6*RR`q&I_JIkzoK?T~l6 zc-uve8&|MnHpR6koX$An8SU)xmD~Yo%ZySJzimY}B~JMZPz148KN8H?*QDN*2bnif z^>Gq*L%$bt94a{;@rK*>t*iih{sP=N=aOoDkNY35+!kNDF8VFZ)x~y0P z1^!lpu^QzFz_(-`fOye zI@!-L(T~{L4ecm5>%x#3J|mFQxzca|_|~lUiyQmq7^+rcZVft`{;R6qszfhiBo&QkG`FXm4^AHD0^a(WeqWGsA@K&m?me~}(UJ>}nW9hk{1m;BvcKVn6 zRoe|w!aI=l2#(}&fpG~dd()~jQ!-=Y>(whohJMmzEnbEegnT-;W&*{T5fJjIlO$-$ zc6eAzCB^ud8_HG%iMZ2GwrhNOKSX50I(nW-pV2f3S#*Qa)X&0QHE zIr>Ews%TOTux@R-nN%Rk=kDHIjabi5MXt6yl&3!|9pT78sKDVfQ? zKG_vw`=TcFy!)*koPwD6e2+w8@8Ad&l^-;WN&rC3^L3%cJHeqLWxcK|X)CmS2O*S} z9h5t_e|a|9Yq$PR-t%TjUIp>BtC>6yK*Z`CN1q&o781dC-_>cti|ZWPQFv;Qy$U@GKFzastB~&&41%^=yWyg$PrT!t}(^@+^Iu8 zOeXh%0%SxES*`hwhwEYdg=)(z+u!(C@4dj%86pU+K3lds%nYgsp2og9B#wOYRg@6@+bSXG83BZP_?Wb@sfs_ zZ~B)ITIRSdeiB|mY-vY530mtX1>q=-cML3#+7V}5niUsNRBSU$W#JwV%Vs@c+eGYi zJx4Z&_Q$L-hxfwO1Pj$7T7S4jaR2yt+pA|>hj4bN%kj-&`J@Ytn>tZ;(L*SMUO`%8gbzhYWGYl4L1gQd4bSd{N1NKAI6N}b$0QJB+(8Gi(&G_yH41OGgdi5DAp&?anWt*nfm}5?|YS^`+CL%F-+X~8hIn$Z9u`U zIQ7l#*afeEq%vNkZFS!U8S;gZO_mxtVQQZ=$|@aFZEbw_zKYY>Nz1}n$VCJ~yT@x% z6+>$MIkQ-Znq5h5<0iJ)k+BPU1SO5R0dRXYST5o!7K^p;2VAESi zcd}M+D>d#D#*0F>g88v*3KQ>f)@(;$t8p-_nHI2;m9W`d8TKi_6aA@jztU3}&2{-} zS4OT=!|4zCfvS~5AUiGw=G3eBmMeI+TL(LXS`jkC@%ZYiv^~VA0{VCp;Ryj2w-C%) zCZ2pMH%(3@nNKGe@CNm~=XA(E35^I7&$~Nx`3#1BJ3O4>$G&a6TyKu@r3b4;{g%tV z%F&8&p`^^q_eQZ!)OJcTl*TN9gCs=_a>8iSZS*`QR6m0f)0iLCPVi1QGrh4_cc)iM zZJm?5GK@B^Xcd?m?uoB_%`2*Fbl^&JT&ao&#E~^qM|n#JxVxYV@jY%)$wo1P=K-&Z zgAPri&kr62Gi2@hMYfklih8Ik_Y4gsa=9m9WH3P|i2zNTg@Gx)?&=oP&t~ZAwxS7a zHdcsT7-xFio*M1f-Ud|4trzeGYC(^FHUn0o&=&p<({LB@iIlIt$#s2i(<53hIS63D zOVXl>Z;V9u>w&tf{ZBrK8ZV(n-r29CR$EnE$8#LjyJ?adNI|BXMMM6=j47Z}G}?>; z6L`&gp2|p0Bly)x^U zs}|k3Js3-)jZMkRpS;lF+)BYm5OOvkse?r+F8V!4`hnDEuw4(f(+8GR>uCq+8Xlau zT-7;m?XYi~>o5{w%RqBhjY+V*iV}CBK81PhD88L5;YAT>DV6s zXZAu81FA{fT4kr3B|_#TiDtrE%9cLKBqvSze57jViT}#*hd~QbF5&bJO)}RVri7Mp zeaPWUi>xEmfrmB;mCl(FgE&BVWF-BXf6UdYHt9Nr>dFYtWG87%Id#K0-F=>J-CiW6ta(LmXg*b0F@{w{h$UVNrfT5cS# zi5*se zp1?Clm=hPPBR7}{GQAz;TjiYMeW4c<7XIcVqoffTLrzEL6E#Qt469rO3%D&&K|8T~ z-@V6(_FTyIYiwG5y5W_1$qxbRd^#Pg za|VI}|`Hpd;{VL*b}9=PF{=H8?9gocBD zevoA`UI$k_*Zbg#ed4uF7?wYvPDdM}6?*Buv05)G~#$U4Z^@ zW~Dq+W)J&$x?Z3t{@KvUreXhtwv06F$vDj{iwGKi?Q@kGy<-8F&K^=BkoF94=#N>} z^y^6XWoo9v4@Tf0Vw~pck0A5R+^uYuS++?_IG|6x$EC*d_%ZO|}JQZ~{ajwahkQfr?3#qy?Fs|*!B$#Bmx{`*{v3$OH1@k$&)v+VmN=S#WF79AFraHDQP%f4F? zE(58r(UCK)=RBi`=Idfo3t=iiy~#oKG}XYZ)^;;9mz`XGGu^rMZv`qfNN2DxT_jE}WPnm4mbt4+hma$g6r9 z6kMy9(n=qdEXP0IE>v;UdV?KV=Gm;uCj|c-O5o3|srF!qJ9Zo7qKUSX=J&9v5f=b2 z8-b|cIdUvD#QYbvW2^+KtIBKNVc|v8eewz0bK?s^LZAD2Tj&HIc5L6K*w6`?LOW8| z%6^t-*y#CbOHoO7YBaI0wiD5xFuI4X8WTZXNCd!0AZfi9uO~Q|HS|4^z%t(`NPj^; zwFbRejMf@UUy7U?d`;YYG$pWFPXE&WQ5YvU|-=lx%|%Kr#1@e%wtx3qD;f4v5l$+ z25r0oXz6FOGIw|Nt?sFsj(gpoy8%i>*qeXgPzdkmc@6s_J8~kr!w%na&dUOXg3CDEe#Z zTb$JEg$UqoAMS}vc6J?q=YjD+x_)DmOUOz8fHT1bQ^W+g?^$f5hwmwJH$RILz6%hR zJF}(5w`JH(lN{apLZ#-3m=^Yt^`2aLSK8=BkTv-P;lW9&WXU0JL4`N+7uj%2R=Hf; zU44(T>9z0MvUMZR2J*9EnGhH9bc*Os?&dOz!5eD?&sY0d%MsMW=469s1V8VY6yc^!Ijx3W(fLBdi>-X1oAmQ1fK?$$*ySQE6-yLC>@M9?yfv;e31WwEz_kWo%uXvMd@t9U6vh&$ZII zM8}roV~`vmpo-0+NVrWM4}E_Gc!|c7&~uV%$S84<=hjE% zlDqhhqvse23Zi=WiEY)p%JVAp3ciW4`!|8e3}%+niJb($Xxg4$+4@<4M3rDIfwK=N z)HHe$b;_r9;1*X3-u6tjjSV*?#D0KZz>fE^3?Br#WZ zk4NCkFcQ{8ltFg4M9E02Qop&fSu(`0>q~hrk1DEAdEYaC)Xu*t6QB-%sT6jr=m`-g z1HFpPZjCl~Jv&JdF!IhDkE#=z2$U-&+KouIt{-@v#zwkIhvTI-FUOSTntQFagF`&X z?2rb+YfDA{V(Zr3Ly2{i0n1JLQi0spo`eqsZQtPsdE<`p7JXuQO<-P#JB#|zaKp~a z!7GM0$|$iSZuKnElzwNHO4%sZFmYJ671)g9sMj8~r(-^uV?EA}%%CRnP|P;9P!p5= z`J~Omm+hD}9?Ko1gk>ceXx3v)M{D}tlG1$K-DO5o#PYAIBY_XgW_aTDj(E|hkEm#t+wf~ z+Wgss0ce^}-}N>I%H!W<@!{8=xmC{+F0NXV65Q**%-}*#WDvpya*0J7&e{b|0jCb3 z)9EHmG}RC>)yTJPs1kj7wCwr?B(hy)K*)k{Acf z{2cez+WKvzzV`|Db5I*$CdneH2$ntymHejhHdVVfRm53vYoQr)!A5Mi`D#;_UNgL{ zkB{*^qkd?^92}rxWq36HfsFm zHDay&Co2EC%{xc$5f>_@43dSX?)+V@>wl}ST&2LeK$P7t1-?rN(O=?(8b3wndlyxB zo$K@Vx08G;^f9uc)^r`3z~X%lSq(N-JG65$!B5&S@Eb1r4IlX69auO7*Fm*p5R82% zq^oAdU^ViM6)nSPzg5F#s>u^=;FUk9L4(QBZSm9tH3>&Ct{~5(we*iN0Q75KfC=L7 zmBiLSEK~`*ru#8?rDfN9G4=5B76#4Ba#y*y4KI)yC%LY;YpP%MXW*AADTH_%i&Pn|Q&b7l=a;4az`7o< zPeCKHfRmBMJ@}dJm{z;9(c#Yu)D-RPm$Whb1`cDG7ugVho#*Wjf41^>CGKj5o3D&YP;YAU4ti|_zf3;mzjWl&*q z4Gtjx|HLnpgnqu?_4^WL6Hx1$hsi*EeISSATOADV1G1EmB|U?)rBxx0-!O!uz@JFW z@Z!Ib7=&Jn;KQBar6Uy{PxcqtVZR=nzj?4f3IoC*a36~w%=M&!+3>-(%NvvyJLszd z5E}w!AC89Jy?Bx>)ZWBx*_E~#7`;hSAou5^I>U9xatOYv|F;es{H7h)FHVGUnf!6H z|9WA4KneGY1*KZullIX#=fvo|3CDffjNSzHh0aU*PDt)MJ#skuDK+1dpZ8|Eu z!E|fTzsh~&`?tKQmI2(0A~*~-`ve(k;9x&JjRgd_my<0_(WwIDD=iGmYsowsc03k9Ljv$v`{Mx^8p|Lz(5l5xY`GR0PO-27>mHeiiZF$ zs09TqppgwdtFype@&$$f(|mC3CLoQ2Gf)8Bic{)}U(s2I#G{$DocQ04Ns)RY<6;4^;{^yArL^rmSm!&%iCCmR{ zvr;89P){+CI)osT!;@ebCJ$!@NpI#)Zul%Y3d6wApEtYeLUvpHM@svxg5 zr>cqpW_-X#U~5G&s){~=N=st~^|41ifJ_D+43!&9 z8r#>LLA(HI8gP)vfD9=I!2%RhK{Jsqpd%ZK$^%3acjd%U4KPs1_yB2Iw7ltt$sW{G z%S2mB4V44NMA4V4Ut75fK+N3w`l`uu#GzW?FDfLF$pwPg)`6CYmy>@Ik|u)-5CMY@ zxQ#rzIT;WD70|&x@tYs-IGe4X+xlrTbMHB(>Hq{vG+GTC0POOSyzQpk&4q73Jk~=i z0`Q)(RgaPTWj&#Pu?Ao!&XW$i(*!Dsm4U{9+oRJ=mPG2csOB?uDC-9h%y*!Z)F`x$ zwDn5EiMIhQUS#fyWva_rgRrcWCN13ag1i({h!0@>@6~~QS5DBH?3BF*^~G#4S0Z$> zkDLLJK}%^C2!=k-nm0OE&!V=99HC1(Q!(;QS6V2jFWP-m%D}GzbX_5Yu{`( zbKM90y{m^L8m*7wzJICO@OK`pT(&t#s?C8e+S&XEa$ffwNRf)Nk$ zt=J{_uCd|P`s|;YBHk=8RW*ycsv){6hY`ZLTFR1znGc+4i?nX+ejob? zRDJi=;i9%lj_>hHeZkZ_JOIfxlwi%-+_%Jmajw!h@1Kr(zZZ9R5STa%YTwJ0T*n2C z0f=AW6MnRg>aS>hqRx_ebVBgN8gk09eGkydn<`3)!80T4I9O_)Y;qyclD4vOie?$5 zKhF3B?6dR4^i*3f+j5L1sg-f0vqvRW3nZn2Qd^LftT7?oeXU3YCLv4S-*_tBtR$kb z;62oZO1jxfWEw-3#~4+so@J^gI+Z88%D%ZXd%1T-zw`m=Km65jOD@A)pkc8RDve7% z)NI1}Jp|qG{tTG+>D0)Kd)ijx7KLoyA=?4l((J&&SIh3`0E^QksMB}_`T2oHK`Rz* z(infGouCD)Is-s}!U)D@h%dtWzzj@Qb_LcSX3TRid8M>sdGOkO!_&8K@y#TBa{ z&01jVEn8E8fw+y=>G0#rHqZ)*#EkYH_jJ#>ez_ukdAEvAU^We6f+DWF zR99HM?jco4SQvp*cWJ~tE)JoUl+@vTIYu?XdGN;PpCWtqm#TZ}kB%w*c*ej#47uGG zdu?d=^c!GOXQE@@+~h-NED1yW{eumi#JwH$;28ZhtPjZBNGj^z=<~a(xjrzF5E#xk z^F6t1G}mL{8#5fvxDRO6^-C!)p-DiUnrPfuiFw6C+LFnt zSKcMM72I?j@6Pw;e`+D15NJw5KS~%8@)wC{MSm#6)=GtYoz)E2q8 zCF&)#@^`++}7Cv2b{KF0X$3KEyt`61t7Fe)bqz9I+|cBU$9tOg#F18v&7f?--p!A`FOiRF_el(e zFpzGvcmhnbq~Za)SZ-<3CQupm=pt|Vlgl=yumo#b#+s4}r@+YpiPQrgC8^wu3@c-(an8MmJ!x~^)hke>So76J z)2Mwu!SiUPnXTw9JVq-4$7U}@DDC-M|E^@vPy}^sUrxhHZ zz--<|1PPvNw!jWeVFp+i>3-`3F^nAcjsD-aGp0{7VQQ?@vt&qh&BJ=HMv@Yh;OF(u zqwz2?2+rklFe-p>?ih(HD>5Gb3M23w8Pp3DM=ybGNCmvP&tJm3r`+Z}PzR@N%IYiu z4e>6BnPwU%*?HI)&I+75sQ9_caqx3B@$V5ftPVC$c1U1;$!JI|mt+6ti%Jov+JeWgZLMDOBjdEwLceI^#%enRm;Ki>Kt*vqIr%J$V#$XCLdWK1T0>De9UH`h}k+E%`HzsdWVpBbTQb zTvq>_Oa6K*0u{j+i($`_)BEel?+*kjfy3F2S$6+*`2X*5us#LKvOn%kJ%-Xd4D#k{!kHUPqF885Ht+ z$~xdfuyFjoM%dajfP6F{lsItruR-|d(jc?u3cOzff>+?~HDGK2v;{-$rV}^{n#@d*|2+l+4J>s(6K$rmlt}uaL$;BSD%0)} zXo1<)8kwd1pR={JMXui_Cv`{x6Tv?&7_$JJf1jDBhT(rc30*D;DY)7+kME5C_$XsZ zUmv(**c-6(NWy=<$l+#kj+6d@N-u}8-YigXIgCax}B15X( z0-_MG?FJFP#JhK(xhOv{rKmV-?(-13e8}+Y>L>X8%7}P7pb(%<0!U$~u-hqzf<4hD z*scMuK*6YDph91<0~?h6L5b5Rcj22u@V~Eb2t(ER69f99F~xC17~(EU{B?+0fgl(6 z)nb*n-i?0-y=&_ff7%|HsTG}K3DH)pgAYGL&I5oSR*N$tkq8N*gSM3mm!E?_KWYY` z@s*u%u9@KFK^ZMVYpw0;)*uk^tAn&+MNARMwS<5@>^|PNub>=w(RW8&Gp~p*vM>d7 zhC5`Bs*j^3N{8qE@$#f1cYXVneaxyS?ezpFT)MG^ub}_nA1GfUTV;5eIPC1iNmJ0W1O5XN%g0 z?vN`g2sD2D=)W7{4t7=!_?-rdWEuGCrpa98N)}SMu+f!mFjdtQ zcp29f?*MOgTH@E07N_|dt|6o_3I)k8;F`*|qZF~d;tf1So=OVUjTb(j6{(k(fc-5QVYSr%Wn}D}JAe^mzg(i>_k0p!Z8r(4l*fkFT zXzNy#O%?d)Cs^vAz5+~jZW6yy=bs*3tQ%kllJ;`t?P3}9lEP+ELQcyfW0}@$r)Fny z<{q}aswnyfcIC-3y&#Fld)qz$qJrh~>#1pA@=3P6pvR$2|J#3@4k7233rkZj%d8oq zcc3XqTQDv_j|ytnVIo-C4Oaa;zBWNvUxM18Ct%l!0TsFrZPD`u*)({jcv*Xa^rx*s zjU~Z2hto=0HU)~$U>sB5A6Wb}#^f0MA=wOf&!ftLCm??n#4aj@cG=H3cZzl=Ro>0EG)^gDcXz=EuG2_zqZ-8mHYa#kx z@#f4VBb3D24}2HhV)=6hS_b&k$UkO;F4==ppd!6+%saLg$BZu@6dZwl%t#?Wm4n$K@1foeVC^}JNRybD zD{9(83VoIXKvf!ahECzp@u=W!Gyn@ype}d|6*vlp^h*K|B2@QepLo$C?1N;Wj69e4 zC&)HyoVm{STg4hr=@j_x>~L2!!B&$5vk;IA^(56cfG(glojSo17%!lu3si) zx>nRgs$#(e)aQ2uNZ?Yk28w?SGSu$R-Z9BFgjb8GM;ahO2EagQuzF`-?}!O<(?V9^ zy4H9fCkp8-m%|Yr@j||_VpGJt?rTAShmY+WSJkd z30r+>m-lm_dGXxsk8r#JRnSrM_rACZeQ;JvbZ7Q5GrDE2$QSs{jrqV;Kd;pcGTw8o zWY6TIwiYHz$NdRcDc&NbP>O$v9ou)?9`$9?vO?}CAtlk*A2eJS86 z#vhWAxTeeO<7ITVI|BBGG2g7cVIj02%e;TS>(uVA-A0n)m_f`wSvZhDGVI)#%kY^a zS#T#d=lq$yHU8=ITNaI=s#($S>us)AV(M^Z)gPmu*DhZr$f2!w9 z!Q~4~4M9u+<>iPOTFw`ppem`IcO;EY_u($QwflVe1ibZf3D)D2gZzmFgNRsED*5c^ z%SzA6+PEmVADWLi!gOO|Fm5+z$q5d` z5MJHw!qjpj8<&4%EPspHwO!e(Ur?Y+d`~@{S@t$%AP^?n1nY9{&D%;$gGl9T+Dr4! z{ut1?1&a)p)eb9*`wSE^OB43lye^U?<;_Nt=@_ zRL}28EJG4=aaH@0@(VYa~g!nb5P zRm#FCKrv$p$X)G#6seT@fivTTBMVrbwF${8UqbzMl=_bEfpiPFqJaZedE6-T7%MfPuoo^&Gv5mE0F+}uPlOjlRu~~g<6mI(HXw8tBTLBJmapb!HR;}5eAaCKd`nA?>62l+nWTg4YzOB^@M~m zt8i8xJH{C=kX07?<~%RVOBQPn?7*jZZTW#_x&j`Vc{*sLNEhM@#%r56n`XC4&^@;J zh4CAcE%LH6Pty*&ETd>YWbe$1-eYK254>>p)N{u_JjND|g+sjCV#vbO+3A9YLeck% zQ^A}K^B$>;pV->2fc*inh zh+b>taE1$zwVm%#Y|ZwVT9&Xzdn&!77&+Gc4A$ID){#@89t11inw7ye;SFQElvfkU z2fE~X2fPXzj}R2$p7{GiKhr1y(?g!z6{xJT9s0esSSth*_uY#bpeIncsMA=MhTXFu z8jdaXFJLk&>HPKD&?c$HK`R)BMFc7hWWxR#_s%gbNb~N^4i-MwT^dQSuQgx0|0N$S zVZO}6~S@0m8OJEF~-yXn!YkUWR^ zIs4lMm%X0X?;fHE+>+S1ySM{}Ru>Rj;D=_kj+sK+whwq)zxMP)c1RlDZ@olk)Kn_j zeB0C%w9DdwN`cM-4=;&|wOg$I^`d}Ac(*eqM%4ESpUZ~=r5AR#(#RyMCzlf0Y#OCd zLu?zC-T?Ra%{hmR{v~nY35H`fc5WQCjF6wd{J5G(OW$LTg}XTO@A1B+DYTWPhzk$# zJ8=}f0s+dy527l5Otxy|)v5KyGm(s9;{zSaY4>P12J=AXs+so}FG!q@$*Q#xs=C?A zUFM-F%5*!{j(65Oy*)zIXxq4TI15$JTz2F58z$Mm1d&q{2`Kpn!|lj~<=YCx+*rNN zNKXa3AFiI`rsd%9=5u$esy%YrYmSb2q|NVPzW7Z6TjFvq#oy+YLH#LC&&Cru?{+$; zTuleCh;eL2@mT;rxz0Q$J28wPsqthSc{kb5$fc{QIaZ9xKOxgS4yRQW4He`!`nKoR zoli`nPMs)deG|`74})^&>9YeA#O2>2D-W5Z)1%{X@g#Kxq-nMHOUQ(B`J3?Xd&^2+ z9$g-tfMgeirfe18!_79@Yb~gtOIJYTe>gCamN^-fMi5)anXXX3Z@6hx8>sA-cZq`D zIsWT@Lo>Ws8huU~VZ$I?oamtFGqIX>?s$4mY86(*K)n!OR%C2XoWuCY(IxZ5VhE=t ztB;_q3C!q_zIH;8$mhPO^$qa4WL8Tx8+gog_ds{ytD*deob!hd99I0rm)YI>fz;%D zT9ee`F>?SW=wai*Y@q? zWXv9K%P%}H$-Mja>N2OZIc&P}BENbJWaMOH+oj;y1beAnJU05aoJ20@NbUyqaXTu5 zp9br4M`O-xhb(@!jCk6apzM^ko#Ck2zA3D5-jUs?h?fNTNmy(A(|m%1r;^kbC)mXU zRQF##Y9#i_(IQzJv~RjqC}bnnbhqWBk@{DXv?cqvT9DZCTT#?r{TM&Hf-vLM+^X!x zBT!E)*NklFaaDS(GLmEBl&y?m8nt*U{QJzRfeLx%fxx(AS+sWvfqP3_cz;VXqk__7 zIs9)9ch?RYUcF@6O00ckJq;PEp?kTv2>H2priSrE#vK;{�r$)PNu(MM*QeHCao_ z?vgb9$|#bMTx9QLMHZuK?~g5UZ+@gv4~$j}Jd1`#12g|r+cLZ zZJv{jK$cbD0vj%4FfT*X&b&9(;MJ`_DaSp-kFf#r9f(P{b30E@s}rl%nLoW~?bh5H zjX%z_8+c7Xrsa^cnYL?ZT_814On z7NObOC?k`bD4m+)zNZx%T6JFX#%Su3UhC{~WxAQvxlYWB9BI3qFKy4_OkgMl<*0aCysd#5=vk`~^p=nbEeO*8KiJ6Up?$)jbLKBb<_O@awNe9k;<4_1lJ z0vo@}8@U>6&;=w@2us!mRyJ1-Zvo%$j=jM|#GncjkGrv{!jg zIVy%&B@TmS(FM!O9v+(4peHA1J3YXEu8b~NQ)6HKBvWq+#oRqovaq*ir77286658wvvYNaLg{!Y4G%225VX$vo5eZ{o-z*&o5CN?LNE^cP0 zA*7d>xj8=Gf_88)q_Jv-%3mG9vMGBYohoy_206Pe_twl&F_j%Bf7Mb*p)l*Ral_iy z$F!dq)vlrE%#OY^kM51<(q+6m@O=-D05rk+RyyLe_%Vwp`Wv)=jD5CM8j(PJNA$W? zRTd?hczy9%oNosTUm^fostYo zTEtwPcl4Eh&MymeiB$3}5m#^8t$g_4)MB!bpKvsuX4-h#P>u;1M;5vml{B!@F38sy z27olvQ&l8;`Kx&N)m{TSH6sT6C7j95SW(U03+YQG@}i9wWjb5gyAzSKWOcOP+4XTy zQO&aRsDLue8HkBTWKYPjKE?KHuMc^1I~il_%(tYp>t@^CVNw#QjLJjvH&CXMWg~sJ z`ei_e_5S?)aVH8zHZZBNTvL@I$DSSwlCu@(6u~5R9 zVxryPD|Jr}(#&$6GgZ1N-}$tM^3s&@VpM?{&bdjkahnM%jo}fOmmd38BG>aDFRO(7 z7uexMJJ%>ChvN6yd{lJid4=^o6-eE)E}_cezpq!j`ITv~wyIX=d;wp)28Wx?@YMW>!of(^asvKA5K( zb}B79TF#q*GuyL#S{ZD>GlE4b3!R@{UOo)-&eL_{NoO+)&9%vhAIX`IZVE*%Xsc_& z>#tO`rry*#v>%0yhSUcr*x`5AgK`18WuMTNKv-2%7-BFib!}nUOP`Kje6k|(#;g;* zngYk09ZdFxd*j2nyclMmB~IH6`X_p{fVN)TrK+o88C6#Hi&Sg2Pl<+=t!}CMrYIW^ zODR3Bw8zlwDzHCMT;iy3GR%0jbxGMjv8OXLzBgSLxREo>POtpLoJCnWQL6^amf3mL ziEu%mNV{wU^{h1?-NNlyKh9dW>l&TIR)O2);I<;nHaYj;LJ8)8T!2YDNj^kZ-7Cfx zw{U-`y1j-;R#@5XNiXfMjj>Bme>Wxo+fm%vO)!MR-Ed2;KLDpjDW~e^1t2$9VLdnO z14288k0`{ZUp_qy6*;$6vKUbo8NO6Tt5%6~^a2@K>UDxX-I=6@HgVV%-Sm?{U%*X6 zKuNA})Y+K?3;yh|R_$}Rnu*TB;7XlZZ@wVM+PFTGY9tO_Z0*|Odg=nho9Y%DvHWx^ zIJf3Q`FvN;vO=#iy}VU!-r%}arHXZq{Ls(r5}@>~Dk*I+)>4()ji60-eCLfI=Q`&@ z=MH=l119rTPAx1O?kZb8;*PS$-kZSmPXItT45{wSMgs+j@I1ubU)4(+Q^>zlRh#sE zTg6&*w}Y?!cPVeD#8p-;GD0j+Q~}DTkJf{@w-JN^GN)WNr7L0(C>XdacSP{`8TT4P zAsVTIz_X`po3Q8hz!D%%Sw-4y4{^*@jnYYue1aC=9RvkV9};x;+HP$@HXh>h;aYX(xdUti>ZR>xOhawazKc5i0h&2mdw6cnv^|LOL`B$3R@*lUL9$%-1+iWR<3 zU{Y)1tliM!RVWUU*)%<*AG1Gb0doxwsMWkd0~anSfyz1Vs0x$5zCF;aFHw)`2Sz(QRV3N3!-bPpVk*TL!)-lk!CKO}3D{%gj0+jUB{f8Nb_X zeYku?Mw0>1!*D*;KE;X2qsq2|%Bk3Qm1kINZT@w<-mG}|uej*Ft|e4e77nQ6)r#%& z#_T4Dub_-5%)l9x1*Tjp<^|1V+&?)|N%A1bUsG>Q!s>3)lyu4vb19y{wNHlL`&uI( z-84EkdS8O{T=*Wy4dr`zK_$;7!qW=@9bq}QW z@aog`|4C{Lnm1e}-#K=)HfAcOpH47SZbGHksu^96bosOccnvbIvDw=#-gM=_BC8NN z6U#(+T%Xz27$Rp!PM`*%_M?g>H5Z})>-aJHtn!ho@!JFGgy+g-G4J&E<>ap=;O2(B zfg6M2!1>K*qP(n=b+pYb~^^#4D|~YBEnFB_sRixcQf2zL2X9 zNPI~1R2Wl#Bl)X&V>wV<-ks1OeBYk=y1KrME>cNVlI-JHZt0$987ZfUlTMzA#kEad z@#ByxsN4qVtTCqSW9^v$ITPz~oM>IWIsDI{=u$Stqf?N0*%~tXTL#FqT!=G$TDV)_J1NFVXR3C9}r!xs> zRu)DL3q4v6euc|Tr~ZPjPnO#7I7k{`6)h~*Os-C7K0ghzH>S;#)-2B{i#<0wvcpJsS;S*NyLcO$OgZu|1LG<4OS zVuIByqQ^M}mC+SOe8ldMJwd08JE1a9(|(V3pVIkj0@xn+{=1cC8SaeoL{mF+ydv8-cC5C+jD+XT-D-E(H+RU)$;sMutL3J1m)DTX z=ZD@6wr;Nv+F`53lJ}`$$rF*JQm>J-cc`e5zy7DK#dT;nW+(adB8Ojsgb3r3TcR8}+@5KG z8Dr7SgNk`%MWD7-+xn!+W%c$8C!I13wl~a46phd4KLfj~jcSn2^=@C$Wp;bEq%Ru? z@^V4G!EU)1TGm)`byxSRLq2*Qk`~41r#=$Nm%wTb*zNb1R`ts?`wUITnmU%T9qsm| z%xF8k%|tvk>Am>`B(^|0@WSWpg}Eumj(OgdoY`B*i}|CPvW>Cuos&LevL9R>7)T$* zL5c{oR7GZEnL}&}m(qU=Vc5eiJ!fN)qrt&E6#J}z7{UCsw&E+@Ky#|lBV4oDvrJ_+ zk~-aWdlf|SsT?#kN9jq)!J6l`{A$+|i5r$gj}D zY3;NA-Q!xF-+M~mCBA2z=$O!^Qk+#mqD`mo<5K)Fp@Atb(mLk{G#R({bUlMlCO|fQ zyTHr`ov6|Pxq56LHi4(YiYxQ;RJobWK@4&8;+KkFRxO3b!Xk$z;G~pXueU0H6LgN* zx85D6^3k4a4y-oXc+^7QBX)3*(c06_I9+gInR_r2J+hrLAxGp1qTOpI7=7u()W3DbiTth0tM<_-n! z$WP$wNroq*C4N{*6MG8P9{VgYIQ=mPmbjDdd<&)O50j}sL`nyzAl|_n?2GQL3zfbP zx0OcB%s1S9RZDg3I!eSvf;9l25hw?WMZc>SAfI@t1^|co_u+}|UVtOuI+Ug?um$Hq zX_|K29DK`I6``a8_}YWi=| z0vE)QO))4?@TnYqsB{Tt$t0+)h2_+9gNJoeX*J}L;}m|yz-f4};V zI`jW4H-v;QV&vqH2lImYPc-V(`H;C`A5rnS#Ur&wbiStLqBUNc7M1a5UTk7M3r8ddK=X5Kedl@_xe3V_u0t z4B$5Pkd9nT=ijX9r3H9Pe>F8b{C*Ao1C@Bog2S{&6uSSqSqGuu zX5I7-7d`pson@TG1&5QLKKj2W{l{(j-`#0T-CxC}zi)J6N#E@3EGos63=|steyC<_ zmRJr378Dp&VTfXMoPnnY6wD2O@WKbw;aoPurW01Hn^Z$?b zzA`MUuIrXmQjii*q`OloRuDuffipMi zqdb4lb*}Hc=e+*#(gl0p_g;IhwdNXYj4}CcABoirDPs6U<&27Z|7v!GM@Oj7N2YcX z+l%mt*aDvFN1O9FfnoH%jpXky2QN|?i`-c#aPGgI{kvcOV>-L{I2u?V_5j1L3=9DU z-MG$8-sHcrb{(i_LS+_j7|*8hf3CP}5%_DS&B(=+{pPmFSq(`mD^hc8f0o-HaS2~m zqHae{@j(xmGz(Sy7+=l+v33(Y&k+U`D6VceX>$7&x#qm~5>3$mJ_4wgOSLND{5*?7+)c|8!?01laes+oef3ZWjYu$^b_pku!!$siH17BlG_m zLcQ-H#9#aFalA%zBuxT76Cy(^KikmtXPwy5tGf^cbbcTE-}gSeT8bI7XzEi|O{-1@ zJsL16B@x#~P=aj&VQ(CEmp>L&{e6JcgA`Fs&@&Ia;=K<9y1fs_B~52#p_Wg-BKxyC zIi1qF&;N^@6fRXa{Pzu}$JpYgn zv|Jx~lE0M7zddQvumQV=*QTtLP>V+BV1_wnwmu1RR5gM9PM9tf9hHi0Vp*@Ala z9VAXYi#HdfsXG`;*4OHfQ;UHRn64f^uL@|Wx(c6|{+Y7^53mseG@m^u!KRK>;5r(n zU#-Mim}=B}c;|8%@@<>7lBfqSW$L~^SiQ-tdw*@FwH1*siU|24b-`-I*AFDtnuwzj z!i(Aap7X0+@5-&t`y-{WFdk+_YRvLIrdkD|T@rgc6!p9KD(P~v1VFOGcy+1oXMeGv zqxvmRVV$QaKbKmwp){b=untg_h==5KD1HBqA}-Jud(L*wuAKT1qg|PD$O|ONV<7aO zV0U!)^~X9)bU*Fe&|c1yGDB0^zo0w^E#=?!XiE>x}cfs z11VVthmZJs{B-q%gg-ZV1M%5cK=xips5@XhL%jR$JIUyhr0a zHtDj?X*~@TQZ=27x>enBuIP84pf+FE*)y zhkS&TeVM0fALDZ;G$oIgPjqw zzc{9uQBgY-WH(f0W7MYY(VJnq@e(bFPU@{J_q&I4O284T&oa@L-{)>jOl=>=V-7XyD$p#?8MGtX zU2tQkF+>FTZ3&ZQZN5-~R!&WYO+lOzKz%dWq`g@iAj1 zt`43{NI`24gq@wU5gmBnwyASLabto3{3ZWpv46wGTfVM+-~6kA?wW*XUnF4)i3Ydy=>R+-ZTSVVQ%SFr%aYwzk zASbyV9jPUHykOsa04(A(Se5sd?WzsL*R(X1wNq%`Hrr;EOEX&9 z@eQXzX$_^s1~;a`n|@c<34+Fu+QY#qSBqZ^L!HMf9^5+py0Laz8dj_iRNjK-eJQ&j zE`IsX3gDozES^z^g53ow1nnLn_fM8-@7UU9{fIwX*v$fptd-pVB&N?TnKYKyI!uV- zS%RIG;ysyftU=R=iUK*KaIhL2Lk`VVvcTvZ>$zvYZMhrwaBqFQZESB`z590Cs|$E4 zhCIy9KAKM+cfWyBE z6Pv=@4p45VdpytAv|U*MQt6%ihe)wx3e`}3y`Di*|0XzOOK8|EvNM+{)ecK73aSa2Zzx+8Xo`_4l@hvNT(X|c_wixC|+62 zoLw_NuY?1b%y&n$vVJLyfyXG>uSk}paU1noKI`l`>U{3ku=bW>>oN5(jE2Wg?oMpZ zrCnD^xUjpwS^MzvqTHW_Dg{yi!7!x^DHCi$nRs#PoEpmvDk1d)h1C91JJC`;{EM9& z53(_7%9XJDb=Y_)vJ)O6@H24ndaH#b5{?y)SVCEcJ*atb@@h-O>9>wXhR-j#|H)WX zAY)mn=1Ncu^uSh1;+cXRsAsl^ga(ZC_CRFhIeZP$1S@8a_RkGh5{n*H*g5%5Q4i#S zwzbPj1#=E?!Ap5(mXjG7$@+_!ZGg`XvZwU-_aZkAy0STxiLXN4n%5w`NwJcx8V5SB zxz8KLFe5nV{xz*cfk>;DV`w)?G%`=v29dt zAuR2)x5)bwfQRq932RG4?+Uoo<=M2#4#;d?{!TZM*%A|Yi+m^a*mK}w;e377`%dtp z2~ltNI;0=wf#iZqJYnkjr>x}PBShoS=;EwNa2ZS54Yow)g~p*$wY_;z;G2t5^pAnj z2vQvBTa(kDjdmLju-J7PtuS}41lXz9z$CQ-%V_w$M%5LEFj|WE+@2F@8}1A!^mm}o z#n)-=A3%cO>RrAcjYe$4w>AfhgH%ycchrYuTFX;Vwq{90U($O04J?hVe6)-Md1I2> zcWC`fZ#44_xB@3TVq=M8JkF3G=)JuBIJBb6kpD@Y&|p^QqW&XVL1!_=pGPrvQ}xck zObU(Bc~OxfzO)lMvVhlKmf*g`L&5a?3ZLsIWDc6=>SoNOpG6Eh@UGN3c0agx z|T_4q<9+ zs7^sr44B{24Ei+~=%iZGt#*15Mc3)X(y)9rAf_p(B;fQTl{)TrBYL>RoOiMID+K@~ zvANaCmX{ssW`bCI$0nER%a+EUb0NI5?<@GG?Ykx*@%297cORh;@*D!_kD|(i-p~Mn zdIIZ7Y*(&*Gd1|OqJPRY{@|@lu_rA@1(G;8qOBdT?9-^w)p1}DEN%T~fZ&ikC9nxh zvPLlI>T4SdeQX;EDCify=>&2X|BPXA2GiKlvz2;e?jJC4>{@u7?b8n-eYl{^q%Jq^ z`)SX8tV4+Fygw~x;P1dcgbR8TVRZX?>x;AgGvJY}M8;QWeo%WX;pk>{;WCWXn(CtT zxT81mhUOj<^vk~yh@V4iBOA8fCdN9de`Jx!?G~z`dHzWt{Duz@ zcFJ* z+UbB>n0yoiMZPG(|KJzRWPrFXAtmRx?h3{xt}OQ=Q#uUzKqo?Sai5K?wVA3%3<;pB zR>TZpQclsTEZ-M|{TG81Z|n|%jUwhTMVcBDboY+*qMv=GjJ3rvia>B`ni)}gZzy|G zdf@b|5pAD&bPD&w)M4UPsaKcy+~}qYM+({j=~IH-&7;Ac@gLu*|98Co-s(E=%RuC- z`=AN5lfd2lX9(_q{QMw2#U>{Jts4t;av})h^^-SJ!WYnq?vMSeyVa2IB88wq%)9Yx zlMQ*wXGn@BR+)|Fejvz(-^}Sr(?+=YHlnalx+_#f*H#P5ERrTpKxPxX19b`x5C{wW z7ySXxC~z$U2x=CJtRYD<@DVoOK*2^t_mK$tbFSLgAP(>528VJ`&JM$VrMyRUSxcl! zG4l&mkS^kw#4Z5jG6wln z=;`Zsc0rx~z5ANqbtp{UUK|;D%GhfPA&Z#cIDom$*awX!1N`E`bPkj#Fs#V@YCn+W2Vm_mA6_-$pZ#~0UNH8n(9!B{mr7upc@k%T}t~S zYky3_7m3H+VDPy-mIm^0(ex+TvrrWHEaGR3qBY$+x!>vI)dL1>}3ZW*nod#o%-a zL2=4mE{)tsbob-N-Sc4UMgJV!Z?a!q>CO*9!LL35s4+S;B*%>(@o%!Y1MpPiF4Thu zKfF4*TN;l$;`(6sfxr~&(asxS@0jt>W^AYkG1x(>oeadrTKTwJZz@)KI-3DA#}BaG zl;xKr_2xHhe=cZ>u=-zCzI){>cuB1jnpobS%a(eNk%#-3n{EX3qUAtjyDs)ggNj6T z+wr^ufz6rENAj-aI#(qLDBM>|(FH<17kIo2W!U}cYN%eV3-V);C`|8eR;<%DW+>@{l z0)lup#a3fP5I{!f4Jh?i)i{xP*Z{B=gq7E)1s``E;2LV!)Bh5-86!ULa*;8 z_;CaAZI?&+06P=vz8(OegAF}e%arakg5>jdB&ghTp)x%#*;WW26)tm|UH_R4#KnQQ zBa+Att0XU{L?Vh{G-<#$`XOSLh*#P8@HRL#*a4wdq1DkR@ca~*Ggx^ZTTccvON8RI zj{foAA;7sCZ1B>)@4qCGNAJ9*;D5paCRm*12hpiM%NH-)x%3x4!^C}wuf+V zEbuWq9hU~>R*Wf+4HWGm-cQfQbC;`C!PzSW+`K&RsM%qU}*?5b%0Lx89~$6StE zd|tnEt7c#<4|&Zd2}?mlx%8EXiBPXMYI>Q$b{MfO-51QAB9t>1Z-(|mb z@8#R#H1{SH%jPfS__ML2vg^(!zO7!5XDgoi@VzWIG-aqYkqL7VE;u(+?RDQ$j8uon z?WAEsx`Wo8n1GC15)mf5qi^Ys`+b#>_eoE+zz?#NrAZ|EqQS(+z&Az9SNw*PkooB~ zQGsq6dQ@Hwuh1_zLMBNMA_uvaP^bdDxwrc~NRd|NA~vm?6Gj9PJ-&{jW*$0lH(_kJGb>S0!gIo(3;jG<}*)*K}3d!-@#dk5O@(Bee~ zgQKYSs>;V)FQ+hRm}%&(e0~$^`F`+5*@5uz`xI3a6gE8tnG0HHA5~#I_KD>?=`7ox zXS5ApZ7Y*{FSN#T!}&cd6oqj~%s1?NG;e8fQe17Dy8dwYjG$`;MtDtQkl#%c@`2l^ zuHK_e4Gdx#Lg6$Cch6sJ&6Be-S^AEMHOcf;#f>{SIn{D0Bi>mo43^dOqXn{?okcH; z3P4kg;O6dJUrK{p5M%>*Eq6}4dw=+zO{L_bRN^rVw;-raVo)tEFDr7on?HSS(GXU zvFmE`iUk$6^6dh6(Fz4qSq-tp!5T2~bLqcLuxY9>$v;W;AmTYChI`*cyxexuOP(r0%(S>z z@<^P#PdW*jbqZ+q{R>q|%2L77@l>gKiJc{?op>n?I`&ijlw`KL?TUW37n65xu+<uLmr1H+c!Kik>%ckbQ1B!L62RQ27>8Wa=A3BkQ&40kHa%Aju?i+*Weq~PGOwJkJd+#Gg;y$dGy1rG-F4jzU$omdYrzex zR_dg455V`$HsT_!4d^|J=vWla<%>AO4;?|8>T@E|#|%L{D>k|!iJCL|9GojcTdo?cig@}7P>k^gA3j;WvUrbt8eq#nkS2K z)l}stPp^mVoBr@mKK57)42$h@JA@b0*ETTk(*LD(ij;DvfdMsC< zqo}E)FKuyYeQGMHf#=F`Rg{l)nMyuT;4x9RP0Y{g?elAH!JQ>;6(*Pq_@v4-;onJg6SkLtDt~hX?d=X zG|E#{EC@Dk@mUJ^atg#$RQWMf;y!L3iKU$ax#9H=kh~fjd`?i7e#k4EA2#9EbP4=) zl}0RY7iG6p(ph8`J3{nAy3 zW}q3TF_v~Bi%;W>?pb~4-hsR#wL@W6)jm@0qe7hmzG{&^?1EfTnH}GIq1@HSpgUd} zyC${)eo9-gY2|Z#|JX&nqq-k$NOQyNUh6;{0YQ;KZ072MoWy!`>G=eeZ!HJQ!L6XA zD?PvnW$-STsSZVd7e7V zb(_^_vV(F$ecdzt%oe+mO@N~L4sXBg>rXD*kwI%qrKyy!Ro}g=IL&vXw`^?8sP!PX zI?c*tf2&t8k$Qr$Z!S$Ow&0Uurc`#&uUu0cK;_#;xGl-ya_q9JAF|I!Rs#sH`thtZ z99j~ozSFuW0sm*ncuBf9;wAi{J#~>T&pHbyRF1_@*GE)oWc_~Hop3Lud5XA$q_ET5 zlPQT!V$DPw!#5XH^(%OU(Z-DOwPByKmG8Qu)ewg-ym(f}id#rqQ3{u^CEm)^W^UwO z#<|(X%M`{wv)!z@05h& zFX)0k57h$ZoZ3nmudS5Of1kT2wM%2~n>kKMAGjtseZU$xcVsJ(8d zBbcmy3x-5Bj-O)AW_ej#>0mPnSfqz7F}f_H;PhfdQw%MOx!99<^W~75bNP;pW;vuN zVw1mMl+B+iXKs4T%Z!fMhY`Si@`;fxg`p4^-NxgkAlulMsPM$bsDd->qXX;dw_gg4 zNb4HhU%UG*g>9oFM+sk|v*m`Br=1Vu(3QTv`-(kS!gdy_jb403gXu!nartgdZuVVU zM0RQsn8&KKhs*l!Ka*%K9Ey$T8eVwN@Qh7x$(s1u7?v=|mE}sb;9zhQ=p(Cn1J)%Z zvX^pL-MN%uJhn~h^P?s1LW>sXsMaSKh|-zPS#7h(FV~g!_+EEqaPSqDY_M*>*Ng;!iYxiB33=r|`gOwN%a%5rt<+wPH?8 zt`XsRp_+F|8>0}bCTQ?DUlhQbsgJIhQzXt3tbQ74l|pHq*&P=kp)$gX(eHEWn0~-g zJ&?I@PzzR*PThGX%eL1$%Ab)Z#;wRvc_rrbw*keTR#(IzC3NlY?wTJn_1mQ zF_)}7yXxbxn(FEJ1Oo95HA*7>#E{6ryrzzmSi&NNfdaLQsdDeyDVp5`Kj`AHVpjm1 zz+$=LNu;9sF;JF|a%GMvH&Wm%ld7MqMOZclmA*Xb+wcjke3bkVe!-8d-H6zykI6AN zf7^j~R191j^SDY5?%=d*#gyV|O`qh`-cjJDBp>DM7uZb6fG8t<(RZ2;M69;cB{O@h z_*vWSCvGdJLul>t;`^Jo7%jbCyb&Xw@YAl?2)yP-+V3@>TZQM;)up;6u^Pq+G`gpR zJrbaBe6}Fm;bZ(IH<`h;-Rq+LLWGx-r{wo+DL1c~!B^)B&OQCO<^pQUz_3?&N)Z3@*g!bt?E>G@xT(Pl8 zA717-hbc4Svl!l0Zv5i8liZmVG7gPj7;Xy?h7xkQcU^8(#AOJ*THnTU zl54$#6<;zT#_AN4Qj%0`-zUlhirev(FKqP{-?WevN0UCn#U`{+(qapBJx|obdlid) zq|m`I?8;7qqz>I<>Knn))tD?)da35U{d|-@&bnEfZf>%nTpa>?c@oYl511PSM5<@< z4s!de!bJ&mO-muecN+_FYEyh$5?004?bE<;uFsC2l$EUii#4kiS0MGz(;7K&b9_)u zX7@=#)-6vW4?Vkx&z#|_c9UHf_>F7#I29m3KU5zsd7 z^Na8BNdGh-c30aYza~sYU!C#NadlCvKHf;RKyJ*;Fro{6MRD_QLq)vl8--bTh1EJ@ z3ux_c<42zR&V{9_}~QC8@vCVubXM10KpA5lSP>cJK_Q67T4?xxE5DOHk(+9PaRXH(bgLN zWKG7WRsPjW?o6Ab2b|{k&$64?sfR~}I7jycS1Uc5W=TDL5=~nJ2%TOVuTJxC=avxx z(IUUu=vy5}s<%h7!YNzF9>U&W-mdI=RS)(~0a-?`!og|r#8-22x0c8`Qc_;ZEA2zp z*p^&-tmz#?hJoNt0jv*66tZVq7e7-_Njzu8Mj?-2b^j2qEHJ`Im;IdGg_7QXJ$C%I zRY#+<)$PIKbFL-x7Nm6M7rA;7{pmd&*|cfqMA~}h2SfRZD$Cy9^?ZRW%L5&yxYcNR z%pW+au5CVXn3hmr)`rfTMRubnHTtzvxvGA|?RoH%h+Wg}_xhAy$8`wgyFEZTAbz<{ zJU^g#V(;ZOCHZduFBO{iMj8n-FER;bKT@=PopAZetVQ@ERf#h6aLfGl>rlX|xRn0Q zu*>VM$ZADCp=Q&_!9*R^xnU37ven&7n3REJ+h4;~ZW}*fj%mp?vQD;o#~3mH43T)r>Y^R?=;fENjZ3^3Vqm2O^4!ZO1XNj+sohf@7;)IfDX-Fn zXFX!2w&P4LoSzy(c_ylC&;Gf`5S{LvVuN}=fne9*?6ALNly^!~eEP!?i|n&Ocvs?# z9wun=o60gPg7=Yv%siTkV+U)lQnPj7@+He2;gXS}g|Sp+P2UO}ccC{Yja7@3Z;^Y% zt+Syzn{hxa2>WTzOF^|{#u@4=iA zBCDA`{+I2M^sGEIliX+7lgkn+HFtd0YC`7xv6H5x^stZr#C#ZVy;CTZof7y2q&Iy`)VziT!0@;{Eh1Kb04ziQbX%b1PzV~AgR4DstWjsd{ zlSxp~roQwTt-K?H#toHml)-9*%u_jyI8dd!id+HrbFN~W$}cs>O0I1J6f=NA9Eb*&x(m%6H<7ya}rA7d1-arwWc1#-HU83(fUcU2cwUndX)~{n8F_@Z2?~Rw?l^ zL|qh%HQ6xoEVoi&ovKhhs3oMo8iioDm$>N%rW0oz@JmOo^WEO$BT45~*N9w5KcA>& zI!eQ%fJeohsY$mSDLjpp!=?cKMy3py)wFUN?=2A8kS{l=s^+pJgeIVXc?C$5|ErM$%4K@sZLFH>Gz7`Azk@5*tGRZSC~mMQutfoTbmF^fKuWub^n z-T0W55R@-SArLdaICnxRh@~=+*B^zSg71{c_2gH&t(Z6z6h$*O>JF&n#`jR|@>^{1 zioqzh+(+}PIl*|%m@T$HxJC1QVl&wWd-4Y&D_76n)71hwC)2g)Bx=NmmZdjJiujzR z^meLVpgHD`+EP7LzK$on;&iYUlziSmc zYL?^oH}MJ>DJ+VRsIe$~{g>>CoC!4S|H(B1{w4{j#lac0dd%EWnJAR{s0+AGeuw%9 zjSLm@E{IAw6#H-ft2qSGiO+lq`bo$6eS0Z&z)UVFq(mNVuH%8-sM(qScJ@ECuK%w) zo2MWzybG)tcJz=cEW}Jo3p2zUg9aymF~JPm>wOPrPhj<#9MZuCJ1YLAm9CuMC1qT+ zuvaa1a!`j0M^ebsZn zsI5R2qD}Nl;TPrdm_F=0f^L2j_dG)H%ZPwjc5D*7dZg7f3b+!-x0i1GvRwZ;wPZ-0 zhUWL7SBHO(gu^#oM*}3x6YE2pW51r~C(AQX2_gtviNw50*7A(E8 z&@42a^5RG?hjZ~sqpbmnN%ONb;Xe{4K_{D^nh-bKUta{W)g^{4!v$$2z)0f_;VtAS~Pzxj8Mo3;b{&k`mG&1;yUq1T6r65%ioGN-1@PU ze&EF#lVRtC7w(4`+wbQGw5<>Vq8O*L_DGmyNa;iLlydw;GXE^lwus$?_Jz%(5l_Mb z{xbRre-~>x4CZ2FjIk~3{}tItj9bAt?JRwzaU_b>GPL0nbuh;fTbDl{sFqO#F|SXi z`!6nS85<0;f{fjxBN8VLs&a62PAeX{g;bype4-(#oBzL>OX1P2^soF8uO1G|-U9f< z!dokU1RsaZYC7=f9QXbb6dis|1Eda!cJJPidE|)e(~)v^ z!W?J*s|h7gUCQ}B#|doY=f8i^l$}QS<*gOsznTK-SZIXh$nN~S(XU^8N=le!cwzJC zwIqaT^!KEHed5nwe20V8Qw$}F55L5(Yk8n8x|#O(S37KVz{Hr>>GeHwEjhHcO-KE@ z>gRVXN4s^MeL>>Lwc*fa=_3EnGa&ykM6ItNb$fW^T3OV3ky2rXKMxy#5?CDYVGO5I z^~kmVk=<8Qjy}dLMML(ggnj>1_@DP2g|&5?-%K#BjvnpF+vRAHki)kfNI>>#bvMXl z-VS5mGjMs&@7lzd)t8C0QQiJ^Mn_(P*Od>^!(VadXC|Aaj{IN?BIPsckr%;$b_@Ba zBb|EuyRUY$^85A e|M$A?2dV*+bTK1SWh4sxQMjZcQ*zPt*8c)ByOAgW From 0a8979d3ab80374eb4f84d08e060bcec70174c73 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 25 Mar 2016 16:52:11 -0700 Subject: [PATCH 0043/1644] ARROW-37: [C++ / Python] Implement BooleanArray and BooleanBuilder. Handle Python built-in bool This patch implements `arrow::Type::BOOL` in bit-packed form. ARROW-59 (Support for boolean in Python lists) is also addressed here. Author: Wes McKinney Closes #40 from wesm/ARROW-37 and squashes the following commits: f6b60f1 [Wes McKinney] Clean up template instantiations to fix clang 2d3b855 [Wes McKinney] Use gcc-4.9 to build thirdparty, too 74f704b [Wes McKinney] Try to fix clang issues, make some PrimitiveBuilder methods protected dc8d0a4 [Wes McKinney] Fix up pyarrow per C++ API changes. Add boolean builtin conversion / API support 2299a6c [Wes McKinney] Install logging.h 2dae07e [Wes McKinney] cpplint 0f55344 [Wes McKinney] Initialize memory to 0 after PoolBuffer::Resize to avoid boolean bit setting issues 83527a9 [Wes McKinney] Draft BooleanBuilder and arrange to share common code between normal numeric builders and boolean builder --- .travis.yml | 4 +- cpp/src/arrow/api.h | 1 - cpp/src/arrow/array-test.cc | 5 +- cpp/src/arrow/builder.cc | 7 + cpp/src/arrow/builder.h | 2 + cpp/src/arrow/test-util.h | 31 +- cpp/src/arrow/type.h | 15 +- cpp/src/arrow/types/CMakeLists.txt | 1 - cpp/src/arrow/types/boolean.h | 32 -- cpp/src/arrow/types/construct.cc | 3 +- cpp/src/arrow/types/list-test.cc | 5 +- cpp/src/arrow/types/list.h | 11 +- cpp/src/arrow/types/primitive-test.cc | 214 ++++++++----- cpp/src/arrow/types/primitive.cc | 179 ++++++++++- cpp/src/arrow/types/primitive.h | 302 ++++++++++++------- cpp/src/arrow/types/string-test.cc | 2 +- cpp/src/arrow/util/CMakeLists.txt | 3 +- cpp/src/arrow/util/bit-util.cc | 14 +- cpp/src/arrow/util/bit-util.h | 13 +- python/pyarrow/includes/libarrow.pxd | 3 + python/pyarrow/scalar.pyx | 6 +- python/pyarrow/tests/test_convert_builtin.py | 5 +- python/pyarrow/tests/test_scalars.py | 39 ++- python/src/pyarrow/adapters/builtin.cc | 26 +- 24 files changed, 643 insertions(+), 280 deletions(-) delete mode 100644 cpp/src/arrow/types/boolean.h diff --git a/.travis.yml b/.travis.yml index 49a956ead3dca..d89a200b892e6 100644 --- a/.travis.yml +++ b/.travis.yml @@ -20,10 +20,10 @@ matrix: language: cpp os: linux before_script: - - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh - script: - export CC="gcc-4.9" - export CXX="g++-4.9" + - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh + script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - compiler: clang diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 7be7f88c22eb6..2ae80f642f29d 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -27,7 +27,6 @@ #include "arrow/table.h" #include "arrow/type.h" -#include "arrow/types/boolean.h" #include "arrow/types/construct.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 7c6eaf55c0d0f..121b802d994fa 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -71,8 +71,7 @@ TEST_F(TestArray, TestIsNull) { if (x == 0) ++null_count; } - std::shared_ptr null_buf = test::bytes_to_null_buffer(null_bitmap.data(), - null_bitmap.size()); + std::shared_ptr null_buf = test::bytes_to_null_buffer(null_bitmap); std::unique_ptr arr; arr.reset(new Int32Array(null_bitmap.size(), nullptr, null_count, null_buf)); @@ -82,7 +81,7 @@ TEST_F(TestArray, TestIsNull) { ASSERT_TRUE(arr->null_bitmap()->Equals(*null_buf.get())); for (size_t i = 0; i < null_bitmap.size(); ++i) { - EXPECT_EQ(static_cast(null_bitmap[i]), !arr->IsNull(i)) << i; + EXPECT_EQ(null_bitmap[i], !arr->IsNull(i)) << i; } } diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 6a62dc3b0e08f..4061f35fd5e53 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -54,5 +54,12 @@ Status ArrayBuilder::Advance(int32_t elements) { return Status::OK(); } +Status ArrayBuilder::Reserve(int32_t elements) { + if (length_ + elements > capacity_) { + int32_t new_capacity = util::next_power2(length_ + elements); + return Resize(new_capacity); + } + return Status::OK(); +} } // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 308e54c80d794..d1a49dce79961 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -69,6 +69,8 @@ class ArrayBuilder { // Resizes the null_bitmap array Status Resize(int32_t new_bits); + Status Reserve(int32_t extra_bits); + // For cases where raw data was memcpy'd into the internal buffers, allows us // to advance the length of the builder. It is your responsibility to use // this function responsibly. diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index ea3ce5f7f53ff..b2bce269992d0 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -98,28 +98,27 @@ void randint(int64_t N, T lower, T upper, std::vector* out) { } +template +void random_real(int n, uint32_t seed, T min_value, T max_value, + std::vector* out) { + std::mt19937 gen(seed); + std::uniform_real_distribution d(min_value, max_value); + for (int i = 0; i < n; ++i) { + out->push_back(d(gen)); + } +} + + template std::shared_ptr to_buffer(const std::vector& values) { return std::make_shared(reinterpret_cast(values.data()), values.size() * sizeof(T)); } -void random_null_bitmap(int64_t n, double pct_null, std::vector* null_bitmap) { - Random rng(random_seed()); - for (int i = 0; i < n; ++i) { - if (rng.NextDoubleFraction() > pct_null) { - null_bitmap->push_back(1); - } else { - // null - null_bitmap->push_back(0); - } - } -} - -void random_null_bitmap(int64_t n, double pct_null, std::vector* null_bitmap) { +void random_null_bitmap(int64_t n, double pct_null, uint8_t* null_bitmap) { Random rng(random_seed()); for (int i = 0; i < n; ++i) { - null_bitmap->push_back(rng.NextDoubleFraction() > pct_null); + null_bitmap[i] = rng.NextDoubleFraction() > pct_null; } } @@ -160,11 +159,11 @@ static inline int null_count(const std::vector& valid_bytes) { return result; } -std::shared_ptr bytes_to_null_buffer(uint8_t* bytes, int length) { +std::shared_ptr bytes_to_null_buffer(const std::vector& bytes) { std::shared_ptr out; // TODO(wesm): error checking - util::bytes_to_bits(bytes, length, &out); + util::bytes_to_bits(bytes, &out); return out; } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 5984b6718ddbe..86e47791b7cea 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -132,6 +132,10 @@ struct DataType { return children_.size(); } + virtual int value_size() const { + return -1; + } + virtual std::string ToString() const = 0; }; @@ -191,11 +195,14 @@ inline std::string PrimitiveType::ToString() const { #define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ typedef C_TYPE c_type; \ static constexpr Type::type type_enum = Type::ENUM; \ - static constexpr int size = SIZE; \ \ TYPENAME() \ : PrimitiveType() {} \ \ + virtual int value_size() const { \ + return SIZE; \ + } \ + \ static const char* name() { \ return NAME; \ } @@ -295,6 +302,12 @@ struct StructType : public DataType { std::string ToString() const override; }; +// These will be defined elsewhere +template +struct type_traits { +}; + + } // namespace arrow #endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index 595b3be6e1661..f3e41289bfe8d 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -21,7 +21,6 @@ # Headers: top level install(FILES - boolean.h collection.h construct.h datetime.h diff --git a/cpp/src/arrow/types/boolean.h b/cpp/src/arrow/types/boolean.h deleted file mode 100644 index 1cb91f9ba4966..0000000000000 --- a/cpp/src/arrow/types/boolean.h +++ /dev/null @@ -1,32 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_BOOLEAN_H -#define ARROW_TYPES_BOOLEAN_H - -#include "arrow/types/primitive.h" - -namespace arrow { - -// typedef PrimitiveArrayImpl BooleanArray; - -class BooleanBuilder : public ArrayBuilder { -}; - -} // namespace arrow - -#endif // ARROW_TYPES_BOOLEAN_H diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index df2317c340b2d..34647a5005b90 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -51,7 +51,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(UINT64, UInt64Builder); BUILDER_CASE(INT64, Int64Builder); - // BUILDER_CASE(BOOL, BooleanBuilder); + BUILDER_CASE(BOOL, BooleanBuilder); BUILDER_CASE(FLOAT, FloatBuilder); BUILDER_CASE(DOUBLE, DoubleBuilder); @@ -83,6 +83,7 @@ Status MakePrimitiveArray(const std::shared_ptr& type, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out) { switch (type->type) { + MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); MAKE_PRIMITIVE_ARRAY_CASE(INT8, Int8Array); MAKE_PRIMITIVE_ARRAY_CASE(UINT16, UInt16Array); diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index eb55ca868eeee..4eb560ea52256 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -116,11 +116,14 @@ TEST_F(TestListBuilder, TestBasics) { Int32Builder* vb = static_cast(builder_->value_builder().get()); + EXPECT_OK(builder_->Reserve(lengths.size())); + EXPECT_OK(vb->Reserve(values.size())); + int pos = 0; for (size_t i = 0; i < lengths.size(); ++i) { ASSERT_OK(builder_->Append(is_null[i] > 0)); for (int j = 0; j < lengths[i]; ++j) { - ASSERT_OK(vb->Append(values[pos++])); + vb->Append(values[pos++]); } } diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 72e20e943c347..8073b5121764d 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -116,7 +116,8 @@ class ListBuilder : public Int32Builder { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } - memcpy(raw_buffer() + length_, values, length * elsize_); + memcpy(raw_data_ + length_, values, + type_traits::bytes_required(length)); if (valid_bytes != nullptr) { AppendNulls(valid_bytes, length); @@ -132,13 +133,13 @@ class ListBuilder : public Int32Builder { // Add final offset if the length is non-zero if (length_) { - raw_buffer()[length_] = items->length(); + raw_data_[length_] = items->length(); } - auto result = std::make_shared(type_, length_, values_, items, + auto result = std::make_shared(type_, length_, data_, items, null_count_, null_bitmap_); - values_ = null_bitmap_ = nullptr; + data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; return result; @@ -162,7 +163,7 @@ class ListBuilder : public Int32Builder { } else { util::set_bit(null_bitmap_data_, length_); } - raw_buffer()[length_++] = value_builder_->length(); + raw_data_[length_++] = value_builder_->length(); return Status::OK(); } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 10ba113c5916c..761845d93812a 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -69,11 +69,11 @@ PRIMITIVE_TEST(BooleanType, BOOL, "bool"); // ---------------------------------------------------------------------- // Primitive type tests -TEST_F(TestBuilder, TestResize) { +TEST_F(TestBuilder, TestReserve) { builder_->Init(10); ASSERT_EQ(2, builder_->null_bitmap()->size()); - builder_->Resize(30); + builder_->Reserve(30); ASSERT_EQ(4, builder_->null_bitmap()->size()); } @@ -83,8 +83,9 @@ class TestPrimitiveBuilder : public TestBuilder { typedef typename Attrs::ArrayType ArrayType; typedef typename Attrs::BuilderType BuilderType; typedef typename Attrs::T T; + typedef typename Attrs::Type Type; - void SetUp() { + virtual void SetUp() { TestBuilder::SetUp(); type_ = Attrs::type(); @@ -99,58 +100,44 @@ class TestPrimitiveBuilder : public TestBuilder { void RandomData(int N, double pct_null = 0.1) { Attrs::draw(N, &draws_); - test::random_null_bitmap(N, pct_null, &valid_bytes_); + + valid_bytes_.resize(N); + test::random_null_bitmap(N, pct_null, valid_bytes_.data()); } - void CheckNullable() { - int size = builder_->length(); + void Check(const std::shared_ptr& builder, bool nullable) { + int size = builder->length(); - auto ex_data = std::make_shared( - reinterpret_cast(draws_.data()), + auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), size * sizeof(T)); - auto ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_.data(), size); - int32_t ex_null_count = test::null_count(valid_bytes_); + std::shared_ptr ex_null_bitmap; + int32_t ex_null_count = 0; + + if (nullable) { + ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); + ex_null_count = test::null_count(valid_bytes_); + } else { + ex_null_bitmap = nullptr; + } auto expected = std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); - std::shared_ptr result = std::dynamic_pointer_cast( - builder_->Finish()); + builder->Finish()); // Builder is now reset - ASSERT_EQ(0, builder_->length()); - ASSERT_EQ(0, builder_->capacity()); - ASSERT_EQ(0, builder_->null_count()); - ASSERT_EQ(nullptr, builder_->buffer()); + ASSERT_EQ(0, builder->length()); + ASSERT_EQ(0, builder->capacity()); + ASSERT_EQ(0, builder->null_count()); + ASSERT_EQ(nullptr, builder->data()); ASSERT_EQ(ex_null_count, result->null_count()); ASSERT_TRUE(result->EqualsExact(*expected.get())); } - void CheckNonNullable() { - int size = builder_nn_->length(); - - auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), - size * sizeof(T)); - - auto expected = std::make_shared(size, ex_data); - - std::shared_ptr result = std::dynamic_pointer_cast( - builder_nn_->Finish()); - - // Builder is now reset - ASSERT_EQ(0, builder_nn_->length()); - ASSERT_EQ(0, builder_nn_->capacity()); - ASSERT_EQ(nullptr, builder_nn_->buffer()); - - ASSERT_TRUE(result->EqualsExact(*expected.get())); - ASSERT_EQ(0, result->null_count()); - } - protected: - TypePtr type_; - TypePtr type_nn_; + std::shared_ptr type_; shared_ptr builder_; shared_ptr builder_nn_; @@ -158,14 +145,14 @@ class TestPrimitiveBuilder : public TestBuilder { vector valid_bytes_; }; -#define PTYPE_DECL(CapType, c_type) \ - typedef CapType##Array ArrayType; \ - typedef CapType##Builder BuilderType; \ - typedef CapType##Type Type; \ - typedef c_type T; \ - \ - static TypePtr type() { \ - return TypePtr(new Type()); \ +#define PTYPE_DECL(CapType, c_type) \ + typedef CapType##Array ArrayType; \ + typedef CapType##Builder BuilderType; \ + typedef CapType##Type Type; \ + typedef c_type T; \ + \ + static std::shared_ptr type() { \ + return std::shared_ptr(new Type()); \ } #define PINT_DECL(CapType, c_type, LOWER, UPPER) \ @@ -176,6 +163,14 @@ class TestPrimitiveBuilder : public TestBuilder { } \ } +#define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int N, vector* draws) { \ + test::random_real(N, 0, LOWER, UPPER, draws); \ + } \ + } + PINT_DECL(UInt8, uint8_t, 0, UINT8_MAX); PINT_DECL(UInt16, uint16_t, 0, UINT16_MAX); PINT_DECL(UInt32, uint32_t, 0, UINT32_MAX); @@ -186,25 +181,89 @@ PINT_DECL(Int16, int16_t, INT16_MIN, INT16_MAX); PINT_DECL(Int32, int32_t, INT32_MIN, INT32_MAX); PINT_DECL(Int64, int64_t, INT64_MIN, INT64_MAX); -typedef ::testing::Types Primitives; +PFLOAT_DECL(Float, float, -1000, 1000); +PFLOAT_DECL(Double, double, -1000, 1000); + +struct PBoolean { + PTYPE_DECL(Boolean, uint8_t); +}; + +template <> +void TestPrimitiveBuilder::RandomData(int N, double pct_null) { + draws_.resize(N); + valid_bytes_.resize(N); + + test::random_null_bitmap(N, 0.5, draws_.data()); + test::random_null_bitmap(N, pct_null, valid_bytes_.data()); +} + +template <> +void TestPrimitiveBuilder::Check( + const std::shared_ptr& builder, bool nullable) { + int size = builder->length(); + + auto ex_data = test::bytes_to_null_buffer(draws_); + + std::shared_ptr ex_null_bitmap; + int32_t ex_null_count = 0; + + if (nullable) { + ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); + ex_null_count = test::null_count(valid_bytes_); + } else { + ex_null_bitmap = nullptr; + } + + auto expected = std::make_shared(size, ex_data, ex_null_count, + ex_null_bitmap); + std::shared_ptr result = std::dynamic_pointer_cast( + builder->Finish()); + + // Builder is now reset + ASSERT_EQ(0, builder->length()); + ASSERT_EQ(0, builder->capacity()); + ASSERT_EQ(0, builder->null_count()); + ASSERT_EQ(nullptr, builder->data()); + + ASSERT_EQ(ex_null_count, result->null_count()); + + ASSERT_EQ(expected->length(), result->length()); + + for (int i = 0; i < result->length(); ++i) { + if (nullable) { + ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; + } + bool actual = util::get_bit(result->raw_data(), i); + ASSERT_EQ(static_cast(draws_[i]), actual) << i; + } + ASSERT_TRUE(result->EqualsExact(*expected.get())); +} + +typedef ::testing::Types Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); #define DECL_T() \ typedef typename TestFixture::T T; +#define DECL_TYPE() \ + typedef typename TestFixture::Type Type; + #define DECL_ARRAYTYPE() \ typedef typename TestFixture::ArrayType ArrayType; TYPED_TEST(TestPrimitiveBuilder, TestInit) { - DECL_T(); + DECL_TYPE(); int n = 1000; - ASSERT_OK(this->builder_->Init(n)); - ASSERT_EQ(n, this->builder_->capacity()); - ASSERT_EQ(n * sizeof(T), this->builder_->buffer()->size()); + ASSERT_OK(this->builder_->Reserve(n)); + ASSERT_EQ(util::next_power2(n), this->builder_->capacity()); + ASSERT_EQ(util::next_power2(type_traits::bytes_required(n)), + this->builder_->data()->size()); // unsure if this should go in all builder classes ASSERT_EQ(0, this->builder_->num_children()); @@ -235,12 +294,14 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { this->RandomData(size); + this->builder_->Reserve(size); + int i; for (i = 0; i < size; ++i) { if (valid_bytes[i] > 0) { - ASSERT_OK(this->builder_->Append(draws[i])); + this->builder_->Append(draws[i]); } else { - ASSERT_OK(this->builder_->AppendNull()); + this->builder_->AppendNull(); } } @@ -261,31 +322,41 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { this->RandomData(size); + this->builder_->Reserve(1000); + this->builder_nn_->Reserve(1000); + int i; + int null_count = 0; // Append the first 1000 for (i = 0; i < 1000; ++i) { if (valid_bytes[i] > 0) { - ASSERT_OK(this->builder_->Append(draws[i])); + this->builder_->Append(draws[i]); } else { - ASSERT_OK(this->builder_->AppendNull()); + this->builder_->AppendNull(); + ++null_count; } - ASSERT_OK(this->builder_nn_->Append(draws[i])); + this->builder_nn_->Append(draws[i]); } + ASSERT_EQ(null_count, this->builder_->null_count()); + ASSERT_EQ(1000, this->builder_->length()); ASSERT_EQ(1024, this->builder_->capacity()); ASSERT_EQ(1000, this->builder_nn_->length()); ASSERT_EQ(1024, this->builder_nn_->capacity()); + this->builder_->Reserve(size - 1000); + this->builder_nn_->Reserve(size - 1000); + // Append the next 9000 for (i = 1000; i < size; ++i) { if (valid_bytes[i] > 0) { - ASSERT_OK(this->builder_->Append(draws[i])); + this->builder_->Append(draws[i]); } else { - ASSERT_OK(this->builder_->AppendNull()); + this->builder_->AppendNull(); } - ASSERT_OK(this->builder_nn_->Append(draws[i])); + this->builder_nn_->Append(draws[i]); } ASSERT_EQ(size, this->builder_->length()); @@ -294,8 +365,8 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { ASSERT_EQ(size, this->builder_nn_->length()); ASSERT_EQ(util::next_power2(size), this->builder_nn_->capacity()); - this->CheckNullable(); - this->CheckNonNullable(); + this->Check(this->builder_, true); + this->Check(this->builder_nn_, false); } @@ -327,31 +398,34 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { ASSERT_EQ(size, this->builder_->length()); ASSERT_EQ(util::next_power2(size), this->builder_->capacity()); - this->CheckNullable(); - this->CheckNonNullable(); + this->Check(this->builder_, true); + this->Check(this->builder_nn_, false); } TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { int n = 1000; - ASSERT_OK(this->builder_->Init(n)); + ASSERT_OK(this->builder_->Reserve(n)); ASSERT_OK(this->builder_->Advance(100)); ASSERT_EQ(100, this->builder_->length()); ASSERT_OK(this->builder_->Advance(900)); - ASSERT_RAISES(Invalid, this->builder_->Advance(1)); + + int too_many = this->builder_->capacity() - 1000 + 1; + ASSERT_RAISES(Invalid, this->builder_->Advance(too_many)); } TYPED_TEST(TestPrimitiveBuilder, TestResize) { - DECL_T(); + DECL_TYPE(); int cap = MIN_BUILDER_CAPACITY * 2; - ASSERT_OK(this->builder_->Resize(cap)); + ASSERT_OK(this->builder_->Reserve(cap)); ASSERT_EQ(cap, this->builder_->capacity()); - ASSERT_EQ(cap * sizeof(T), this->builder_->buffer()->size()); - ASSERT_EQ(util::ceil_byte(cap) / 8, this->builder_->null_bitmap()->size()); + ASSERT_EQ(type_traits::bytes_required(cap), this->builder_->data()->size()); + ASSERT_EQ(util::bytes_for_bits(cap), + this->builder_->null_bitmap()->size()); } TYPED_TEST(TestPrimitiveBuilder, TestReserve) { diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index ecd5d68ff45a8..c54d0757c4789 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -20,20 +20,20 @@ #include #include "arrow/util/buffer.h" +#include "arrow/util/logging.h" namespace arrow { // ---------------------------------------------------------------------- // Primitive array base -PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, int value_size, +PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : Array(type, length, null_count, null_bitmap) { data_ = data; raw_data_ = data == nullptr? nullptr : data_->data(); - value_size_ = value_size; } bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { @@ -52,12 +52,15 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; + int value_size = type_->value_size(); + DCHECK_GT(value_size, 0); + for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && memcmp(this_data, other_data, value_size_)) { + if (!IsNull(i) && memcmp(this_data, other_data, value_size)) { return false; } - this_data += value_size_; - other_data += value_size_; + this_data += value_size; + other_data += value_size; } return true; } else { @@ -73,4 +76,170 @@ bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { return EqualsExact(*static_cast(arr.get())); } +template +Status PrimitiveBuilder::Init(int32_t capacity) { + RETURN_NOT_OK(ArrayBuilder::Init(capacity)); + data_ = std::make_shared(pool_); + + int64_t nbytes = type_traits::bytes_required(capacity); + RETURN_NOT_OK(data_->Resize(nbytes)); + memset(data_->mutable_data(), 0, nbytes); + + raw_data_ = reinterpret_cast(data_->mutable_data()); + return Status::OK(); +} + +template +Status PrimitiveBuilder::Resize(int32_t capacity) { + // XXX: Set floor size for now + if (capacity < MIN_BUILDER_CAPACITY) { + capacity = MIN_BUILDER_CAPACITY; + } + + if (capacity_ == 0) { + RETURN_NOT_OK(Init(capacity)); + } else { + RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); + + int64_t old_bytes = data_->size(); + int64_t new_bytes = type_traits::bytes_required(capacity); + RETURN_NOT_OK(data_->Resize(new_bytes)); + raw_data_ = reinterpret_cast(data_->mutable_data()); + + memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + } + capacity_ = capacity; + return Status::OK(); +} + +template +Status PrimitiveBuilder::Reserve(int32_t elements) { + if (length_ + elements > capacity_) { + int32_t new_capacity = util::next_power2(length_ + elements); + return Resize(new_capacity); + } + return Status::OK(); +} + +template +Status PrimitiveBuilder::Append(const value_type* values, int32_t length, + const uint8_t* valid_bytes) { + RETURN_NOT_OK(PrimitiveBuilder::Reserve(length)); + + if (length > 0) { + memcpy(raw_data_ + length_, values, type_traits::bytes_required(length)); + } + + if (valid_bytes != nullptr) { + PrimitiveBuilder::AppendNulls(valid_bytes, length); + } else { + for (int i = 0; i < length; ++i) { + util::set_bit(null_bitmap_data_, length_ + i); + } + } + + length_ += length; + return Status::OK(); +} + +template +void PrimitiveBuilder::AppendNulls(const uint8_t* valid_bytes, int32_t length) { + // If valid_bytes is all not null, then none of the values are null + for (int i = 0; i < length; ++i) { + if (valid_bytes[i] == 0) { + ++null_count_; + } else { + util::set_bit(null_bitmap_data_, length_ + i); + } + } +} + +template +std::shared_ptr PrimitiveBuilder::Finish() { + std::shared_ptr result = std::make_shared< + typename type_traits::ArrayType>( + type_, length_, data_, null_count_, null_bitmap_); + + data_ = null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + return result; +} + +template <> +Status PrimitiveBuilder::Append(const uint8_t* values, int32_t length, + const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + + for (int i = 0; i < length; ++i) { + if (values[i] > 0) { + util::set_bit(raw_data_, length_ + i); + } else { + util::clear_bit(raw_data_, length_ + i); + } + } + + if (valid_bytes != nullptr) { + PrimitiveBuilder::AppendNulls(valid_bytes, length); + } else { + for (int i = 0; i < length; ++i) { + util::set_bit(null_bitmap_data_, length_ + i); + } + } + length_ += length; + return Status::OK(); +} + +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; + +BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, + int32_t null_count, + const std::shared_ptr& null_bitmap) : + PrimitiveArray(std::make_shared(), length, + data, null_count, null_bitmap) {} + +bool BooleanArray::EqualsExact(const BooleanArray& other) const { + if (this == &other) return true; + if (null_count_ != other.null_count_) { + return false; + } + + if (null_count_ > 0) { + bool equal_bitmap = null_bitmap_->Equals(*other.null_bitmap_, + util::bytes_for_bits(length_)); + if (!equal_bitmap) { + return false; + } + + const uint8_t* this_data = raw_data_; + const uint8_t* other_data = other.raw_data_; + + for (int i = 0; i < length_; ++i) { + if (!IsNull(i) && util::get_bit(this_data, i) != util::get_bit(other_data, i)) { + return false; + } + } + return true; + } else { + return data_->Equals(*other.data_, util::bytes_for_bits(length_)); + } +} + +bool BooleanArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) return true; + if (Type::BOOL != arr->type_enum()) { + return false; + } + return EqualsExact(*static_cast(arr.get())); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 4eaff433229e0..ec6fee35513ce 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -21,6 +21,7 @@ #include #include #include +#include #include "arrow/array.h" #include "arrow/builder.h" @@ -37,7 +38,7 @@ class MemoryPool; // Base class for fixed-size logical types class PrimitiveArray : public Array { public: - PrimitiveArray(const TypePtr& type, int32_t length, int value_size, + PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); @@ -51,25 +52,19 @@ class PrimitiveArray : public Array { protected: std::shared_ptr data_; const uint8_t* raw_data_; - int value_size_; }; #define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ class NAME : public PrimitiveArray { \ public: \ using value_type = T; \ - NAME(const TypePtr& type, int32_t length, \ - const std::shared_ptr& data, \ - int32_t null_count = 0, \ - const std::shared_ptr& null_bitmap = nullptr) : \ - PrimitiveArray(std::make_shared(), length, \ - sizeof(T), data, null_count, null_bitmap) {} \ + using PrimitiveArray::PrimitiveArray; \ \ NAME(int32_t length, const std::shared_ptr& data, \ int32_t null_count = 0, \ const std::shared_ptr& null_bitmap = nullptr) : \ PrimitiveArray(std::make_shared(), length, \ - sizeof(T), data, null_count, null_bitmap) {} \ + data, null_count, null_bitmap) {} \ \ bool EqualsExact(const NAME& other) const { \ return PrimitiveArray::EqualsExact( \ @@ -96,148 +91,241 @@ NUMERIC_ARRAY_DECL(Int64Array, Int64Type, int64_t); NUMERIC_ARRAY_DECL(FloatArray, FloatType, float); NUMERIC_ARRAY_DECL(DoubleArray, DoubleType, double); -template +template class PrimitiveBuilder : public ArrayBuilder { public: typedef typename Type::c_type value_type; explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) : ArrayBuilder(pool, type), - values_(nullptr) { - elsize_ = sizeof(value_type); - } + data_(nullptr) {} virtual ~PrimitiveBuilder() {} - Status Resize(int32_t capacity) { - // XXX: Set floor size for now - if (capacity < MIN_BUILDER_CAPACITY) { - capacity = MIN_BUILDER_CAPACITY; - } - - if (capacity_ == 0) { - RETURN_NOT_OK(Init(capacity)); - } else { - RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); - RETURN_NOT_OK(values_->Resize(capacity * elsize_)); - } - capacity_ = capacity; - return Status::OK(); - } - - Status Init(int32_t capacity) { - RETURN_NOT_OK(ArrayBuilder::Init(capacity)); - values_ = std::make_shared(pool_); - return values_->Resize(capacity * elsize_); - } - - Status Reserve(int32_t elements) { - if (length_ + elements > capacity_) { - int32_t new_capacity = util::next_power2(length_ + elements); - return Resize(new_capacity); - } - return Status::OK(); - } + using ArrayBuilder::Advance; - Status Advance(int32_t elements) { - return ArrayBuilder::Advance(elements); - } + // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + void AppendNulls(const uint8_t* valid_bytes, int32_t length); - // Scalar append - Status Append(value_type val) { + Status AppendNull() { if (length_ == capacity_) { // If the capacity was not already a multiple of 2, do so here RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); } - util::set_bit(null_bitmap_data_, length_); - raw_buffer()[length_++] = val; + ++null_count_; + ++length_; return Status::OK(); } + std::shared_ptr data() const { + return data_; + } + // Vector append // // If passed, valid_bytes is of equal length to values, and any zero byte // will be considered as a null for that slot Status Append(const value_type* values, int32_t length, - const uint8_t* valid_bytes = nullptr) { - if (length_ + length > capacity_) { - int32_t new_capacity = util::next_power2(length_ + length); - RETURN_NOT_OK(Resize(new_capacity)); - } - if (length > 0) { - memcpy(raw_buffer() + length_, values, length * elsize_); - } + const uint8_t* valid_bytes = nullptr); - if (valid_bytes != nullptr) { - AppendNulls(valid_bytes, length); - } else { - for (int i = 0; i < length; ++i) { - util::set_bit(null_bitmap_data_, length_ + i); - } - } + // Ensure that builder can accommodate an additional number of + // elements. Resizes if the current capacity is not sufficient + Status Reserve(int32_t elements); - length_ += length; - return Status::OK(); + std::shared_ptr Finish() override; + + protected: + std::shared_ptr data_; + value_type* raw_data_; + + Status Init(int32_t capacity); + + // Increase the capacity of the builder to accommodate at least the indicated + // number of elements + Status Resize(int32_t capacity); +}; + +template +class NumericBuilder : public PrimitiveBuilder { + public: + using typename PrimitiveBuilder::value_type; + using PrimitiveBuilder::PrimitiveBuilder; + + using PrimitiveBuilder::Append; + + // Scalar append. Does not capacity-check; make sure to call Reserve beforehand + void Append(value_type val) { + util::set_bit(null_bitmap_data_, length_); + raw_data_[length_++] = val; } - // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - void AppendNulls(const uint8_t* valid_bytes, int32_t length) { - // If valid_bytes is all not null, then none of the values are null - for (int i = 0; i < length; ++i) { - if (valid_bytes[i] == 0) { - ++null_count_; - } else { - util::set_bit(null_bitmap_data_, length_ + i); - } - } + protected: + using PrimitiveBuilder::length_; + using PrimitiveBuilder::null_bitmap_data_; + using PrimitiveBuilder::raw_data_; + + using PrimitiveBuilder::Init; + using PrimitiveBuilder::Resize; +}; + +template <> +struct type_traits { + typedef UInt8Array ArrayType; + + static inline int bytes_required(int elements) { + return elements; } +}; - Status AppendNull() { - if (length_ == capacity_) { - // If the capacity was not already a multiple of 2, do so here - RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); - } - ++null_count_; - ++length_; - return Status::OK(); +template <> +struct type_traits { + typedef Int8Array ArrayType; + + static inline int bytes_required(int elements) { + return elements; } +}; - std::shared_ptr Finish() override { - std::shared_ptr result = std::make_shared( - type_, length_, values_, null_count_, null_bitmap_); +template <> +struct type_traits { + typedef UInt16Array ArrayType; - values_ = null_bitmap_ = nullptr; - capacity_ = length_ = null_count_ = 0; - return result; + static inline int bytes_required(int elements) { + return elements * sizeof(uint16_t); } +}; + +template <> +struct type_traits { + typedef Int16Array ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(int16_t); + } +}; - value_type* raw_buffer() { - return reinterpret_cast(values_->mutable_data()); +template <> +struct type_traits { + typedef UInt32Array ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(uint32_t); } +}; - std::shared_ptr buffer() const { - return values_; +template <> +struct type_traits { + typedef Int32Array ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(int32_t); } +}; - protected: - std::shared_ptr values_; - int elsize_; +template <> +struct type_traits { + typedef UInt64Array ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(uint64_t); + } +}; + +template <> +struct type_traits { + typedef Int64Array ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(int64_t); + } +}; +template <> +struct type_traits { + typedef FloatArray ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(float); + } +}; + +template <> +struct type_traits { + typedef DoubleArray ArrayType; + + static inline int bytes_required(int elements) { + return elements * sizeof(double); + } }; // Builders -typedef PrimitiveBuilder UInt8Builder; -typedef PrimitiveBuilder UInt16Builder; -typedef PrimitiveBuilder UInt32Builder; -typedef PrimitiveBuilder UInt64Builder; +typedef NumericBuilder UInt8Builder; +typedef NumericBuilder UInt16Builder; +typedef NumericBuilder UInt32Builder; +typedef NumericBuilder UInt64Builder; + +typedef NumericBuilder Int8Builder; +typedef NumericBuilder Int16Builder; +typedef NumericBuilder Int32Builder; +typedef NumericBuilder Int64Builder; + +typedef NumericBuilder FloatBuilder; +typedef NumericBuilder DoubleBuilder; + -typedef PrimitiveBuilder Int8Builder; -typedef PrimitiveBuilder Int16Builder; -typedef PrimitiveBuilder Int32Builder; -typedef PrimitiveBuilder Int64Builder; +class BooleanArray : public PrimitiveArray { + public: + using PrimitiveArray::PrimitiveArray; + + BooleanArray(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + + bool EqualsExact(const BooleanArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + + const uint8_t* raw_data() const { + return reinterpret_cast(raw_data_); + } + + bool Value(int i) const { + return util::get_bit(raw_data(), i); + } +}; + +template <> +struct type_traits { + typedef BooleanArray ArrayType; + + static inline int bytes_required(int elements) { + return util::bytes_for_bits(elements); + } +}; + +class BooleanBuilder : public PrimitiveBuilder { + public: + explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) : + PrimitiveBuilder(pool, type) {} + + virtual ~BooleanBuilder() {} + + using PrimitiveBuilder::Append; + + // Scalar append + void Append(bool val) { + util::set_bit(null_bitmap_data_, length_); + if (val) { + util::set_bit(raw_data_, length_); + } else { + util::clear_bit(raw_data_, length_); + } + ++length_; + } -typedef PrimitiveBuilder FloatBuilder; -typedef PrimitiveBuilder DoubleBuilder; + void Append(uint8_t val) { + Append(static_cast(val)); + } +}; } // namespace arrow diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index b329b4f0ef7e1..d3a4cc37f9c4c 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -92,7 +92,7 @@ class TestStringContainer : public ::testing::Test { offsets_buf_ = test::to_buffer(offsets_); - null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_.data(), valid_bytes_.size()); + null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared(length_, offsets_buf_, values_, diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index fed05e3690c74..d2a4b091fada5 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -23,6 +23,7 @@ install(FILES bit-util.h buffer.h + logging.h macros.h memory-pool.h status.h @@ -59,7 +60,7 @@ if (ARROW_BUILD_BENCHMARKS) ) else() target_link_libraries(arrow_benchmark_main - benchmark + benchmark pthread ) endif() diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 292cb33887ffa..6c6d5330eab0d 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -16,6 +16,7 @@ // under the License. #include +#include #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" @@ -23,25 +24,24 @@ namespace arrow { -void util::bytes_to_bits(uint8_t* bytes, int length, uint8_t* bits) { - for (int i = 0; i < length; ++i) { - if (static_cast(bytes[i])) { +void util::bytes_to_bits(const std::vector& bytes, uint8_t* bits) { + for (size_t i = 0; i < bytes.size(); ++i) { + if (bytes[i] > 0) { set_bit(bits, i); } } } -Status util::bytes_to_bits(uint8_t* bytes, int length, +Status util::bytes_to_bits(const std::vector& bytes, std::shared_ptr* out) { - int bit_length = ceil_byte(length) / 8; + int bit_length = util::bytes_for_bits(bytes.size()); auto buffer = std::make_shared(); RETURN_NOT_OK(buffer->Resize(bit_length)); memset(buffer->mutable_data(), 0, bit_length); - bytes_to_bits(bytes, length, buffer->mutable_data()); + bytes_to_bits(bytes, buffer->mutable_data()); *out = buffer; - return Status::OK(); } diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 08222d5089474..8d6287130dd2b 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -20,6 +20,7 @@ #include #include +#include namespace arrow { @@ -43,15 +44,19 @@ static inline int64_t ceil_2bytes(int64_t size) { static constexpr uint8_t BITMASK[] = {1, 2, 4, 8, 16, 32, 64, 128}; static inline bool get_bit(const uint8_t* bits, int i) { - return bits[i / 8] & BITMASK[i % 8]; + return static_cast(bits[i / 8] & BITMASK[i % 8]); } static inline bool bit_not_set(const uint8_t* bits, int i) { return (bits[i / 8] & BITMASK[i % 8]) == 0; } +static inline void clear_bit(uint8_t* bits, int i) { + bits[i / 8] &= ~BITMASK[i % 8]; +} + static inline void set_bit(uint8_t* bits, int i) { - bits[i / 8] |= 1 << (i % 8); + bits[i / 8] |= BITMASK[i % 8]; } static inline int64_t next_power2(int64_t n) { @@ -66,8 +71,8 @@ static inline int64_t next_power2(int64_t n) { return n; } -void bytes_to_bits(uint8_t* bytes, int length, uint8_t* bits); -Status bytes_to_bits(uint8_t*, int, std::shared_ptr*); +void bytes_to_bits(const std::vector& bytes, uint8_t* bits); +Status bytes_to_bits(const std::vector&, std::shared_ptr*); } // namespace util diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index e6afcbd79b69f..943a08f84a055 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -86,6 +86,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsNull(int i) + cdef cppclass CBooleanArray" arrow::BooleanArray"(CArray): + c_bool Value(int i) + cdef cppclass CUInt8Array" arrow::UInt8Array"(CArray): uint8_t Value(int i) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 04f013d6ca706..0d391e5f26b3e 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -58,7 +58,10 @@ cdef class ArrayValue(Scalar): cdef class BooleanValue(ArrayValue): - pass + + def as_py(self): + cdef CBooleanArray* ap = self.sp_array.get() + return ap.Value(self.index) cdef class Int8Value(ArrayValue): @@ -172,6 +175,7 @@ cdef class ListValue(ArrayValue): cdef dict _scalar_classes = { + Type_BOOL: BooleanValue, Type_UINT8: Int8Value, Type_UINT16: Int16Value, Type_UINT32: Int32Value, diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 25f696912105d..2beb6b39d73ed 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -22,7 +22,10 @@ class TestConvertList(unittest.TestCase): def test_boolean(self): - pass + arr = pyarrow.from_pylist([True, None, False, None]) + assert len(arr) == 4 + assert arr.null_count == 2 + assert arr.type == pyarrow.bool_() def test_empty_list(self): arr = pyarrow.from_pylist([]) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 021737db6726e..4fb850a4d47bf 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -16,67 +16,74 @@ # under the License. from pyarrow.compat import unittest, u -import pyarrow as arrow +import pyarrow as A class TestScalars(unittest.TestCase): def test_null_singleton(self): with self.assertRaises(Exception): - arrow.NAType() + A.NAType() def test_bool(self): - pass + arr = A.from_pylist([True, None, False, None]) + + v = arr[0] + assert isinstance(v, A.BooleanValue) + assert repr(v) == "True" + assert v.as_py() == True + + assert arr[1] is A.NA def test_int64(self): - arr = arrow.from_pylist([1, 2, None]) + arr = A.from_pylist([1, 2, None]) v = arr[0] - assert isinstance(v, arrow.Int64Value) + assert isinstance(v, A.Int64Value) assert repr(v) == "1" assert v.as_py() == 1 - assert arr[2] is arrow.NA + assert arr[2] is A.NA def test_double(self): - arr = arrow.from_pylist([1.5, None, 3]) + arr = A.from_pylist([1.5, None, 3]) v = arr[0] - assert isinstance(v, arrow.DoubleValue) + assert isinstance(v, A.DoubleValue) assert repr(v) == "1.5" assert v.as_py() == 1.5 - assert arr[1] is arrow.NA + assert arr[1] is A.NA v = arr[2] assert v.as_py() == 3.0 def test_string(self): - arr = arrow.from_pylist(['foo', None, u('bar')]) + arr = A.from_pylist(['foo', None, u('bar')]) v = arr[0] - assert isinstance(v, arrow.StringValue) + assert isinstance(v, A.StringValue) assert repr(v) == "'foo'" assert v.as_py() == 'foo' - assert arr[1] is arrow.NA + assert arr[1] is A.NA v = arr[2].as_py() assert v == 'bar' assert isinstance(v, str) def test_list(self): - arr = arrow.from_pylist([['foo', None], None, ['bar'], []]) + arr = A.from_pylist([['foo', None], None, ['bar'], []]) v = arr[0] assert len(v) == 2 - assert isinstance(v, arrow.ListValue) + assert isinstance(v, A.ListValue) assert repr(v) == "['foo', None]" assert v.as_py() == ['foo', None] assert v[0].as_py() == 'foo' - assert v[1] is arrow.NA + assert v[1] is A.NA - assert arr[1] is arrow.NA + assert arr[1] is A.NA v = arr[3] assert len(v) == 0 diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index acb13acecaf33..78ef1b31f34f1 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -61,6 +61,8 @@ class ScalarVisitor { ++total_count_; if (obj == Py_None) { ++none_count_; + } else if (PyBool_Check(obj)) { + ++bool_count_; } else if (PyFloat_Check(obj)) { ++float_count_; } else if (IsPyInteger(obj)) { @@ -256,6 +258,20 @@ class TypedConverter : public SeqConverter { class BoolConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { + Py_ssize_t size = PySequence_Size(seq); + RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + typed_builder_->AppendNull(); + } else { + if (item.obj() == Py_True) { + typed_builder_->Append(true); + } else { + typed_builder_->Append(false); + } + } + } return Status::OK(); } }; @@ -265,14 +281,15 @@ class Int64Converter : public TypedConverter { Status AppendData(PyObject* seq) override { int64_t val; Py_ssize_t size = PySequence_Size(seq); + RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { - RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + typed_builder_->AppendNull(); } else { val = PyLong_AsLongLong(item.obj()); RETURN_IF_PYERROR(); - RETURN_ARROW_NOT_OK(typed_builder_->Append(val)); + typed_builder_->Append(val); } } return Status::OK(); @@ -284,14 +301,15 @@ class DoubleConverter : public TypedConverter { Status AppendData(PyObject* seq) override { double val; Py_ssize_t size = PySequence_Size(seq); + RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { - RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + typed_builder_->AppendNull(); } else { val = PyFloat_AsDouble(item.obj()); RETURN_IF_PYERROR(); - RETURN_ARROW_NOT_OK(typed_builder_->Append(val)); + typed_builder_->Append(val); } } return Status::OK(); From d3cb6b47fde2935522b73c7150d83e364f4e19c9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 26 Mar 2016 17:07:40 -0700 Subject: [PATCH 0044/1644] ARROW-22: [C++] Convert flat Parquet schemas to Arrow schemas I'm going to limit the amount of nested data (especially repeated fields) cases in this patch as I haven't yet thought through the nested data reassembly from repetition / definition levels. Since the effective Arrow schemas may "collapse" multiple levels of nesting (for example: 3-level array encoding -- see https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema/types.h), we'll need to track the logical correspondence between repetition and definition levels so that the right null bits can be set easily during reassembly. Closes #37. Closes #38. Closes #39 Author: Wes McKinney Author: Uwe L. Korn Closes #41 from wesm/ARROW-22 and squashes the following commits: f388210 [Wes McKinney] Correct typo in Layout.md (thanks @takahirox) e5c429a [Wes McKinney] Test for some unsupported Parquet schema types, add unannotated FIXED_LEN_BYTE_ARRAY to List 54daa9b [Wes McKinney] Refactor tests to invoke FromParquetSchema 74d6bae [Wes McKinney] Convert BYTE_ARRAY to StringType or List depending on the logical type b7b9ca9 [Uwe L. Korn] Add basic conversion for primitive types 0e2a7f1 [Uwe L. Korn] Add macro for adding dependencies to tests 0dd1109 [Uwe L. Korn] ARROW-78: Add constructor for DecimalType --- cpp/CMakeLists.txt | 11 ++ cpp/src/arrow/parquet/CMakeLists.txt | 8 +- cpp/src/arrow/parquet/parquet-schema-test.cc | 147 +++++++++++++++ cpp/src/arrow/parquet/schema.cc | 178 +++++++++++++++++++ cpp/src/arrow/parquet/schema.h | 44 +++++ cpp/src/arrow/types/decimal.cc | 32 ++++ cpp/src/arrow/types/decimal.h | 11 ++ cpp/src/arrow/util/status.h | 1 + format/Layout.md | 2 +- 9 files changed, 432 insertions(+), 2 deletions(-) create mode 100644 cpp/src/arrow/parquet/parquet-schema-test.cc create mode 100644 cpp/src/arrow/parquet/schema.cc create mode 100644 cpp/src/arrow/parquet/schema.h create mode 100644 cpp/src/arrow/types/decimal.cc diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 6d701079b482c..6ed2768d13918 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -378,6 +378,16 @@ function(ADD_ARROW_TEST_DEPENDENCIES REL_TEST_NAME) add_dependencies(${TEST_NAME} ${ARGN}) endfunction() +# A wrapper for target_link_libraries() that is compatible with NO_TESTS. +function(ARROW_TEST_LINK_LIBRARIES REL_TEST_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) + + target_link_libraries(${TEST_NAME} ${ARGN}) +endfunction() + enable_testing() ############################################################ @@ -528,6 +538,7 @@ set(ARROW_SRCS src/arrow/ipc/metadata-internal.cc src/arrow/types/construct.cc + src/arrow/types/decimal.cc src/arrow/types/json.cc src/arrow/types/list.cc src/arrow/types/primitive.cc diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index 7b449affab025..0d5cf263ec3e2 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -19,17 +19,23 @@ # arrow_parquet : Arrow <-> Parquet adapter set(PARQUET_SRCS + schema.cc ) set(PARQUET_LIBS + arrow + ${PARQUET_SHARED_LIB} ) -add_library(arrow_parquet STATIC +add_library(arrow_parquet SHARED ${PARQUET_SRCS} ) target_link_libraries(arrow_parquet ${PARQUET_LIBS}) SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) +ADD_ARROW_TEST(parquet-schema-test) +ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) + # Headers: top level install(FILES DESTINATION include/arrow/parquet) diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc new file mode 100644 index 0000000000000..9c3093d9ff7c9 --- /dev/null +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -0,0 +1,147 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/util/status.h" + +#include "arrow/parquet/schema.h" + +namespace arrow { + +namespace parquet { + +using parquet_cpp::Repetition; +using parquet_cpp::schema::NodePtr; +using parquet_cpp::schema::GroupNode; +using parquet_cpp::schema::PrimitiveNode; + +const auto BOOL = std::make_shared(); +const auto UINT8 = std::make_shared(); +const auto INT32 = std::make_shared(); +const auto INT64 = std::make_shared(); +const auto FLOAT = std::make_shared(); +const auto DOUBLE = std::make_shared(); +const auto UTF8 = std::make_shared(); +const auto BINARY = std::make_shared( + std::make_shared("", UINT8)); + +class TestConvertParquetSchema : public ::testing::Test { + public: + virtual void SetUp() {} + + void CheckFlatSchema(const std::shared_ptr& expected_schema) { + ASSERT_EQ(expected_schema->num_fields(), result_schema_->num_fields()); + for (int i = 0; i < expected_schema->num_fields(); ++i) { + auto lhs = result_schema_->field(i); + auto rhs = expected_schema->field(i); + EXPECT_TRUE(lhs->Equals(rhs)) + << i << " " << lhs->ToString() << " != " << rhs->ToString(); + } + } + + Status ConvertSchema(const std::vector& nodes) { + NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, nodes); + descr_.Init(schema); + return FromParquetSchema(&descr_, &result_schema_); + } + + protected: + parquet_cpp::SchemaDescriptor descr_; + std::shared_ptr result_schema_; +}; + +TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { + std::vector parquet_fields; + std::vector> arrow_fields; + + parquet_fields.push_back( + PrimitiveNode::Make("boolean", Repetition::REQUIRED, parquet_cpp::Type::BOOLEAN)); + arrow_fields.push_back(std::make_shared("boolean", BOOL, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("int32", Repetition::REQUIRED, parquet_cpp::Type::INT32)); + arrow_fields.push_back(std::make_shared("int32", INT32, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("int64", Repetition::REQUIRED, parquet_cpp::Type::INT64)); + arrow_fields.push_back(std::make_shared("int64", INT64, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("float", Repetition::OPTIONAL, parquet_cpp::Type::FLOAT)); + arrow_fields.push_back(std::make_shared("float", FLOAT)); + + parquet_fields.push_back( + PrimitiveNode::Make("double", Repetition::OPTIONAL, parquet_cpp::Type::DOUBLE)); + arrow_fields.push_back(std::make_shared("double", DOUBLE)); + + parquet_fields.push_back( + PrimitiveNode::Make("binary", Repetition::OPTIONAL, + parquet_cpp::Type::BYTE_ARRAY)); + arrow_fields.push_back(std::make_shared("binary", BINARY)); + + parquet_fields.push_back( + PrimitiveNode::Make("string", Repetition::OPTIONAL, + parquet_cpp::Type::BYTE_ARRAY, + parquet_cpp::LogicalType::UTF8)); + arrow_fields.push_back(std::make_shared("string", UTF8)); + + parquet_fields.push_back( + PrimitiveNode::Make("flba-binary", Repetition::OPTIONAL, + parquet_cpp::Type::FIXED_LEN_BYTE_ARRAY, + parquet_cpp::LogicalType::NONE, 12)); + arrow_fields.push_back(std::make_shared("flba-binary", BINARY)); + + auto arrow_schema = std::make_shared(arrow_fields); + ASSERT_OK(ConvertSchema(parquet_fields)); + + CheckFlatSchema(arrow_schema); +} + +TEST_F(TestConvertParquetSchema, UnsupportedThings) { + std::vector unsupported_nodes; + + unsupported_nodes.push_back( + PrimitiveNode::Make("int96", Repetition::REQUIRED, parquet_cpp::Type::INT96)); + + unsupported_nodes.push_back( + GroupNode::Make("repeated-group", Repetition::REPEATED, {})); + + unsupported_nodes.push_back( + PrimitiveNode::Make("int32", Repetition::OPTIONAL, + parquet_cpp::Type::INT32, parquet_cpp::LogicalType::DATE)); + + unsupported_nodes.push_back( + PrimitiveNode::Make("int64", Repetition::OPTIONAL, + parquet_cpp::Type::INT64, parquet_cpp::LogicalType::TIMESTAMP_MILLIS)); + + for (const NodePtr& node : unsupported_nodes) { + ASSERT_RAISES(NotImplemented, ConvertSchema({node})); + } +} + +TEST(TestNodeConversion, DateAndTime) { +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc new file mode 100644 index 0000000000000..6b1de572617b8 --- /dev/null +++ b/cpp/src/arrow/parquet/schema.cc @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/parquet/schema.h" + +#include + +#include "parquet/api/schema.h" + +#include "arrow/util/status.h" +#include "arrow/types/decimal.h" + +using parquet_cpp::schema::Node; +using parquet_cpp::schema::NodePtr; +using parquet_cpp::schema::GroupNode; +using parquet_cpp::schema::PrimitiveNode; + +using parquet_cpp::LogicalType; + +namespace arrow { + +namespace parquet { + +const auto BOOL = std::make_shared(); +const auto UINT8 = std::make_shared(); +const auto INT32 = std::make_shared(); +const auto INT64 = std::make_shared(); +const auto FLOAT = std::make_shared(); +const auto DOUBLE = std::make_shared(); +const auto UTF8 = std::make_shared(); +const auto BINARY = std::make_shared( + std::make_shared("", UINT8)); + +TypePtr MakeDecimalType(const PrimitiveNode* node) { + int precision = node->decimal_metadata().precision; + int scale = node->decimal_metadata().scale; + return std::make_shared(precision, scale); +} + +static Status FromByteArray(const PrimitiveNode* node, TypePtr* out) { + switch (node->logical_type()) { + case LogicalType::UTF8: + *out = UTF8; + break; + default: + // BINARY + *out = BINARY; + break; + } + return Status::OK(); +} + +static Status FromFLBA(const PrimitiveNode* node, TypePtr* out) { + switch (node->logical_type()) { + case LogicalType::NONE: + *out = BINARY; + break; + case LogicalType::DECIMAL: + *out = MakeDecimalType(node); + break; + default: + return Status::NotImplemented("unhandled type"); + break; + } + + return Status::OK(); +} + +static Status FromInt32(const PrimitiveNode* node, TypePtr* out) { + switch (node->logical_type()) { + case LogicalType::NONE: + *out = INT32; + break; + default: + return Status::NotImplemented("Unhandled logical type for int32"); + break; + } + return Status::OK(); +} + +static Status FromInt64(const PrimitiveNode* node, TypePtr* out) { + switch (node->logical_type()) { + case LogicalType::NONE: + *out = INT64; + break; + default: + return Status::NotImplemented("Unhandled logical type for int64"); + break; + } + return Status::OK(); +} + +// TODO: Logical Type Handling +Status NodeToField(const NodePtr& node, std::shared_ptr* out) { + std::shared_ptr type; + + if (node->is_repeated()) { + return Status::NotImplemented("No support yet for repeated node types"); + } + + if (node->is_group()) { + const GroupNode* group = static_cast(node.get()); + std::vector> fields(group->field_count()); + for (int i = 0; i < group->field_count(); i++) { + RETURN_NOT_OK(NodeToField(group->field(i), &fields[i])); + } + type = std::make_shared(fields); + } else { + // Primitive (leaf) node + const PrimitiveNode* primitive = static_cast(node.get()); + + switch (primitive->physical_type()) { + case parquet_cpp::Type::BOOLEAN: + type = BOOL; + break; + case parquet_cpp::Type::INT32: + RETURN_NOT_OK(FromInt32(primitive, &type)); + break; + case parquet_cpp::Type::INT64: + RETURN_NOT_OK(FromInt64(primitive, &type)); + break; + case parquet_cpp::Type::INT96: + // TODO: Do we have that type in Arrow? + // type = TypePtr(new Int96Type()); + return Status::NotImplemented("int96"); + case parquet_cpp::Type::FLOAT: + type = FLOAT; + break; + case parquet_cpp::Type::DOUBLE: + type = DOUBLE; + break; + case parquet_cpp::Type::BYTE_ARRAY: + // TODO: Do we have that type in Arrow? + RETURN_NOT_OK(FromByteArray(primitive, &type)); + break; + case parquet_cpp::Type::FIXED_LEN_BYTE_ARRAY: + RETURN_NOT_OK(FromFLBA(primitive, &type)); + break; + } + } + + *out = std::make_shared(node->name(), type, !node->is_required()); + return Status::OK(); +} + +Status FromParquetSchema(const parquet_cpp::SchemaDescriptor* parquet_schema, + std::shared_ptr* out) { + // TODO(wesm): Consider adding an arrow::Schema name attribute, which comes + // from the root Parquet node + const GroupNode* schema_node = static_cast( + parquet_schema->schema().get()); + + std::vector> fields(schema_node->field_count()); + for (int i = 0; i < schema_node->field_count(); i++) { + RETURN_NOT_OK(NodeToField(schema_node->field(i), &fields[i])); + } + + *out = std::make_shared(fields); + return Status::OK(); +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h new file mode 100644 index 0000000000000..61de193a33877 --- /dev/null +++ b/cpp/src/arrow/parquet/schema.h @@ -0,0 +1,44 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PARQUET_SCHEMA_H +#define ARROW_PARQUET_SCHEMA_H + +#include + +#include "parquet/api/schema.h" + +#include "arrow/schema.h" +#include "arrow/type.h" + +namespace arrow { + +class Status; + +namespace parquet { + +Status NodeToField(const parquet_cpp::schema::NodePtr& node, + std::shared_ptr* out); + +Status FromParquetSchema(const parquet_cpp::SchemaDescriptor* parquet_schema, + std::shared_ptr* out); + +} // namespace parquet + +} // namespace arrow + +#endif diff --git a/cpp/src/arrow/types/decimal.cc b/cpp/src/arrow/types/decimal.cc new file mode 100644 index 0000000000000..f120c1a9dfde6 --- /dev/null +++ b/cpp/src/arrow/types/decimal.cc @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/types/decimal.h" + +#include +#include + +namespace arrow { + +std::string DecimalType::ToString() const { + std::stringstream s; + s << "decimal(" << precision << ", " << scale << ")"; + return s.str(); +} + +} // namespace arrow + diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h index 464c3ff8da92b..26243b42b0e7d 100644 --- a/cpp/src/arrow/types/decimal.h +++ b/cpp/src/arrow/types/decimal.h @@ -18,13 +18,24 @@ #ifndef ARROW_TYPES_DECIMAL_H #define ARROW_TYPES_DECIMAL_H +#include + #include "arrow/type.h" namespace arrow { struct DecimalType : public DataType { + explicit DecimalType(int precision_, int scale_) + : DataType(Type::DECIMAL), precision(precision_), + scale(scale_) { } int precision; int scale; + + static char const *name() { + return "decimal"; + } + + std::string ToString() const override; }; } // namespace arrow diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index b5931232dbdcb..4e273edcb8f1f 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -109,6 +109,7 @@ class Status { bool IsKeyError() const { return code() == StatusCode::KeyError; } bool IsInvalid() const { return code() == StatusCode::Invalid; } bool IsIOError() const { return code() == StatusCode::IOError; } + bool IsNotImplemented() const { return code() == StatusCode::NotImplemented; } // Return a string representation of this status suitable for printing. // Returns the string "OK" for success. diff --git a/format/Layout.md b/format/Layout.md index 2d46ece606ea7..1b532c6b3817c 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -58,7 +58,7 @@ Base requirements * Memory layout and random access patterns for each relative type * Null value representation -## Non-goals (for this document +## Non-goals (for this document) * To enumerate or specify logical types that can be implemented as primitive (fixed-width) value types. For example: signed and unsigned integers, From d6d53b25ef4e8cd7d8c34df56661817366906bbf Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 27 Mar 2016 12:28:18 -0700 Subject: [PATCH 0045/1644] ARROW-63: [C++] Enable ctest to work on systems with Python 3 as the default Python Author: Wes McKinney Closes #42 from wesm/ARROW-63 and squashes the following commits: 9840308 [Wes McKinney] Make asan_symbolize.py work on both Python 2.7 and 3.x --- cpp/build-support/asan_symbolize.py | 36 ++++++++++++++++++----------- 1 file changed, 22 insertions(+), 14 deletions(-) diff --git a/cpp/build-support/asan_symbolize.py b/cpp/build-support/asan_symbolize.py index 839a1984bd349..1108044d7d648 100755 --- a/cpp/build-support/asan_symbolize.py +++ b/cpp/build-support/asan_symbolize.py @@ -64,7 +64,7 @@ def open_llvm_symbolizer(self): '--functions=true', '--inlining=true'] if DEBUG: - print ' '.join(cmd) + print(' '.join(cmd)) return subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE) @@ -76,8 +76,9 @@ def symbolize(self, addr, binary, offset): try: symbolizer_input = '%s %s' % (binary, offset) if DEBUG: - print symbolizer_input - print >> self.pipe.stdin, symbolizer_input + print(symbolizer_input) + self.pipe.stdin.write(symbolizer_input) + self.pipe.stdin.write('\n') while True: function_name = self.pipe.stdout.readline().rstrip() if not function_name: @@ -113,7 +114,7 @@ def __init__(self, binary): def open_addr2line(self): cmd = ['addr2line', '-f', '-e', self.binary] if DEBUG: - print ' '.join(cmd) + print(' '.join(cmd)) return subprocess.Popen(cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE) @@ -122,7 +123,8 @@ def symbolize(self, addr, binary, offset): if self.binary != binary: return None try: - print >> self.pipe.stdin, offset + self.pipe.stdin.write(offset) + self.pipe.stdin.write('\n') function_name = self.pipe.stdout.readline().rstrip() file_name = self.pipe.stdout.readline().rstrip() except Exception: @@ -145,11 +147,12 @@ def __init__(self, addr, binary): self.pipe = None def write_addr_to_pipe(self, offset): - print >> self.pipe.stdin, '0x%x' % int(offset, 16) + self.pipe.stdin.write('0x%x' % int(offset, 16)) + self.pipe.stdin.write('\n') def open_atos(self): if DEBUG: - print 'atos -o %s -arch %s' % (self.binary, self.arch) + print('atos -o %s -arch %s' % (self.binary, self.arch)) cmdline = ['atos', '-o', self.binary, '-arch', self.arch] self.pipe = subprocess.Popen(cmdline, stdin=subprocess.PIPE, @@ -168,7 +171,7 @@ def symbolize(self, addr, binary, offset): # foo(type1, type2) (in object.name) (filename.cc:80) match = re.match('^(.*) \(in (.*)\) \((.*:\d*)\)$', atos_line) if DEBUG: - print 'atos_line: ', atos_line + print('atos_line: {0}'.format(atos_line)) if match: function_name = match.group(1) function_name = re.sub('\(.*?\)', '', function_name) @@ -282,7 +285,7 @@ def symbolize(self, addr, binary, offset): function_name, file_name, line_no = res result = ['%s in %s %s:%d' % ( addr, function_name, file_name, line_no)] - print result + print(result) return result else: return None @@ -318,15 +321,20 @@ def symbolize_address(self, addr, binary, offset): def print_symbolized_lines(self, symbolized_lines): if not symbolized_lines: - print self.current_line + print(self.current_line) else: for symbolized_frame in symbolized_lines: - print ' #' + str(self.frame_no) + ' ' + symbolized_frame.rstrip() + print(' #' + str(self.frame_no) + ' ' + symbolized_frame.rstrip()) self.frame_no += 1 def process_stdin(self): self.frame_no = 0 - sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) + + if sys.version_info[0] == 2: + sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0) + else: + # Unbuffered output is not supported in Python 3 + sys.stdout = os.fdopen(sys.stdout.fileno(), 'w') while True: line = sys.stdin.readline() @@ -337,10 +345,10 @@ def process_stdin(self): '^( *#([0-9]+) *)(0x[0-9a-f]+) *\((.*)\+(0x[0-9a-f]+)\)') match = re.match(stack_trace_line_format, line) if not match: - print self.current_line + print(self.current_line) continue if DEBUG: - print line + print(line) _, frameno_str, addr, binary, offset = match.groups() if frameno_str == '0': # Assume that frame #0 is the first frame of new stack trace. From 017187749f3916e589015a4db2409258a0b3c03c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 27 Mar 2016 12:30:58 -0700 Subject: [PATCH 0046/1644] ARROW-65: Be less restrictive on PYTHON_LIBRARY search paths Current CMake FindPythonLibs also uses this option instead of NO_DEFAULT_PATH. Author: Uwe L. Korn Closes #43 from xhochy/arrow-65 and squashes the following commits: 10eb9e0 [Uwe L. Korn] ARROW-65: Be less restrictive on PYTHON_LIBRARY search paths --- python/cmake_modules/FindPythonLibsNew.cmake | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/cmake_modules/FindPythonLibsNew.cmake b/python/cmake_modules/FindPythonLibsNew.cmake index c70e6bc26a719..0f2295aa43bc1 100644 --- a/python/cmake_modules/FindPythonLibsNew.cmake +++ b/python/cmake_modules/FindPythonLibsNew.cmake @@ -166,7 +166,7 @@ else() find_library(PYTHON_LIBRARY NAMES "python${PYTHON_LIBRARY_SUFFIX}" PATHS ${_PYTHON_LIBS_SEARCH} - NO_DEFAULT_PATH) + NO_SYSTEM_ENVIRONMENT_PATH) message(STATUS "Found Python lib ${PYTHON_LIBRARY}") endif() From 1fd0668a1330e72b1b137d90d00906bc188243e0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 28 Mar 2016 09:36:20 -0700 Subject: [PATCH 0047/1644] ARROW-30: [Python] Routines for converting between arrow::Array/Table and pandas.DataFrame There is a lot to do here for maximum compatibility, but this gets things started. Author: Wes McKinney Closes #46 from wesm/ARROW-30 and squashes the following commits: 0a9e747 [Wes McKinney] Invoke py.test with python -m pytest 4c9f766 [Wes McKinney] More scaffolding. Table wrapper. Initial unit tests passing 8475a0e [Wes McKinney] More pandas conversion scaffolding, enable libpyarrow to use the NumPy C API globally d1f05c5 [Wes McKinney] cpplint f0cc451 [Wes McKinney] Give libpyarrow a reference to numpy.nan 5e09bfe [Wes McKinney] Compiling, but untested draft of pandas <-> arrow converters --- ci/travis_script_python.sh | 8 +- cpp/README.md | 6 +- cpp/src/arrow/array.h | 13 +- cpp/src/arrow/types/string.cc | 10 + cpp/src/arrow/types/string.h | 4 +- cpp/src/arrow/util/buffer.h | 42 ++ python/CMakeLists.txt | 6 +- python/pyarrow/__init__.py | 8 +- python/pyarrow/array.pyx | 135 ++++ python/pyarrow/config.pyx | 13 +- python/pyarrow/includes/common.pxd | 6 + python/pyarrow/includes/libarrow.pxd | 52 +- python/pyarrow/includes/pyarrow.pxd | 9 +- python/pyarrow/tests/test_convert_pandas.py | 172 +++++ python/src/pyarrow/adapters/pandas.cc | 714 ++++++++++++++++++ python/src/pyarrow/adapters/pandas.h | 21 + python/src/pyarrow/common.h | 23 +- python/src/pyarrow/{init.cc => config.cc} | 11 +- python/src/pyarrow/config.h | 39 + .../src/pyarrow/{init.h => do_import_numpy.h} | 12 +- python/src/pyarrow/numpy_interop.h | 58 ++ 21 files changed, 1313 insertions(+), 49 deletions(-) create mode 100644 python/pyarrow/tests/test_convert_pandas.py create mode 100644 python/src/pyarrow/adapters/pandas.cc rename python/src/pyarrow/{init.cc => config.cc} (84%) create mode 100644 python/src/pyarrow/config.h rename python/src/pyarrow/{init.h => do_import_numpy.h} (83%) create mode 100644 python/src/pyarrow/numpy_interop.h diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index af6b0085724fc..d45b895d8cf38 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -48,17 +48,11 @@ python_version_tests() { python setup.py build_ext --inplace - py.test -vv -r sxX pyarrow + python -m pytest -vv -r sxX pyarrow } # run tests for python 2.7 and 3.5 python_version_tests 2.7 python_version_tests 3.5 -# if [ $TRAVIS_OS_NAME == "linux" ]; then -# valgrind --tool=memcheck py.test -vv -r sxX arrow -# else -# py.test -vv -r sxX arrow -# fi - popd diff --git a/cpp/README.md b/cpp/README.md index 542cce43a1391..9026cf963f8ee 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -42,12 +42,12 @@ Detailed unit test logs will be placed in the build directory under `build/test- ### Building/Running benchmarks -Follow the directions for simple build except run cmake +Follow the directions for simple build except run cmake with the `--ARROW_BUILD_BENCHMARKS` parameter set correctly: cmake -DARROW_BUILD_BENCHMARKS=ON .. -and instead of make unittest run either `make; ctest` to run both unit tests +and instead of make unittest run either `make; ctest` to run both unit tests and benchmarks or `make runbenchmark` to run only the benchmark tests. Benchmark logs will be placed in the build directory under `build/benchmark-logs`. @@ -60,4 +60,4 @@ variables * Googletest: `GTEST_HOME` (only required to build the unit tests) * Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) - +* Flatbuffers: `FLATBUFFERS_HOME` (only required for the IPC extensions) diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 133adf32cbd50..097634d74f890 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -34,13 +34,10 @@ class Buffer; // // The base class is only required to have a null bitmap buffer if the null // count is greater than 0 -// -// Any buffers used to initialize the array have their references "stolen". If -// you wish to use the buffer beyond the lifetime of the array, you need to -// explicitly increment its reference count class Array { public: - Array(const TypePtr& type, int32_t length, int32_t null_count = 0, + Array(const std::shared_ptr& type, int32_t length, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); virtual ~Array() {} @@ -60,11 +57,15 @@ class Array { return null_bitmap_; } + const uint8_t* null_bitmap_data() const { + return null_bitmap_data_; + } + bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; protected: - TypePtr type_; + std::shared_ptr type_; int32_t null_count_; int32_t length_; diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index dea42e102b0d0..80b075cdfbb23 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -20,8 +20,18 @@ #include #include +#include "arrow/type.h" + namespace arrow { +const std::shared_ptr STRING(new StringType()); + +StringArray::StringArray(int32_t length, + const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& null_bitmap) : + StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} + std::string CharType::ToString() const { std::stringstream s; s << "char(" << size << ")"; diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index fda722ba6def2..84cd0326ec850 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -79,9 +79,7 @@ class StringArray : public ListArray { const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) : - StringArray(std::make_shared(), length, offsets, values, - null_count, null_bitmap) {} + const std::shared_ptr& null_bitmap = nullptr); // Compute the pointer t const uint8_t* GetValue(int i, int32_t* out_length) const { diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 0c3e210abd910..c15f9b630cd97 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -18,11 +18,13 @@ #ifndef ARROW_UTIL_BUFFER_H #define ARROW_UTIL_BUFFER_H +#include #include #include #include #include "arrow/util/macros.h" +#include "arrow/util/status.h" namespace arrow { @@ -146,6 +148,46 @@ class PoolBuffer : public ResizableBuffer { MemoryPool* pool_; }; +static constexpr int64_t MIN_BUFFER_CAPACITY = 1024; + +class BufferBuilder { + public: + explicit BufferBuilder(MemoryPool* pool) : + pool_(pool), + capacity_(0), + size_(0) {} + + Status Append(const uint8_t* data, int length) { + if (capacity_ < length + size_) { + if (capacity_ == 0) { + buffer_ = std::make_shared(pool_); + } + capacity_ = std::max(MIN_BUFFER_CAPACITY, capacity_); + while (capacity_ < length + size_) { + capacity_ *= 2; + } + RETURN_NOT_OK(buffer_->Resize(capacity_)); + data_ = buffer_->mutable_data(); + } + memcpy(data_ + size_, data, length); + size_ += length; + return Status::OK(); + } + + std::shared_ptr Finish() { + auto result = buffer_; + buffer_ = nullptr; + return result; + } + + private: + std::shared_ptr buffer_; + MemoryPool* pool_; + uint8_t* data_; + int64_t capacity_; + int64_t size_; +}; + } // namespace arrow #endif // ARROW_UTIL_BUFFER_H diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 0ecafc7202e89..ebe825f65c4da 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -220,9 +220,12 @@ set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") ## Python and libraries find_package(PythonLibsNew REQUIRED) +find_package(NumPy REQUIRED) include(UseCython) include_directories(SYSTEM + ${NUMPY_INCLUDE_DIRS} + ${PYTHON_INCLUDE_DIRS} src) ############################################################ @@ -409,11 +412,12 @@ add_subdirectory(src/pyarrow/util) set(PYARROW_SRCS src/pyarrow/common.cc + src/pyarrow/config.cc src/pyarrow/helpers.cc - src/pyarrow/init.cc src/pyarrow/status.cc src/pyarrow/adapters/builtin.cc + src/pyarrow/adapters/pandas.cc ) set(LINK_LIBS diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 9a080709bebda..c343f5ba5f129 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -17,7 +17,11 @@ # flake8: noqa -from pyarrow.array import (Array, from_pylist, total_allocated_bytes, +import pyarrow.config + +from pyarrow.array import (Array, + from_pandas_series, from_pylist, + total_allocated_bytes, BooleanArray, NumericArray, Int8Array, UInt8Array, ListArray, StringArray) @@ -37,4 +41,4 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.array import RowBatch +from pyarrow.array import RowBatch, Table, from_pandas_dataframe diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index c5d40ddd7a481..88770cdaa966e 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -22,6 +22,8 @@ from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow +import pyarrow.config + from pyarrow.compat import frombytes, tobytes from pyarrow.error cimport check_status @@ -44,6 +46,10 @@ cdef class Array: self.type = DataType() self.type.init(self.sp_array.get().type()) + @staticmethod + def from_pandas(obj, mask=None): + return from_pandas_series(obj, mask) + property null_count: def __get__(self): @@ -160,7 +166,15 @@ cdef class StringArray(Array): cdef dict _array_classes = { Type_NA: NullArray, Type_BOOL: BooleanArray, + Type_UINT8: UInt8Array, + Type_UINT16: UInt16Array, + Type_UINT32: UInt32Array, + Type_UINT64: UInt64Array, + Type_INT8: Int8Array, + Type_INT16: Int16Array, + Type_INT32: Int32Array, Type_INT64: Int64Array, + Type_FLOAT: FloatArray, Type_DOUBLE: DoubleArray, Type_LIST: ListArray, Type_STRING: StringArray, @@ -194,6 +208,49 @@ def from_pylist(object list_obj, DataType type=None): return box_arrow_array(sp_array) + +def from_pandas_series(object series, object mask=None): + cdef: + shared_ptr[CArray] out + + series_values = series_as_ndarray(series) + + if mask is None: + check_status(pyarrow.PandasToArrow(pyarrow.GetMemoryPool(), + series_values, &out)) + else: + mask = series_as_ndarray(mask) + check_status(pyarrow.PandasMaskedToArrow( + pyarrow.GetMemoryPool(), series_values, mask, &out)) + + return box_arrow_array(out) + + +def from_pandas_dataframe(object df, name=None): + cdef: + list names = [] + list arrays = [] + + for name in df.columns: + col = df[name] + arr = from_pandas_series(col) + + names.append(name) + arrays.append(arr) + + return Table.from_arrays(names, arrays, name=name) + + +cdef object series_as_ndarray(object obj): + import pandas as pd + + if isinstance(obj, pd.Series): + result = obj.values + else: + result = obj + + return result + #---------------------------------------------------------------------- # Table-like data structures @@ -225,3 +282,81 @@ cdef class RowBatch: def __getitem__(self, i): return self.arrays[i] + + +cdef class Table: + ''' + Do not call this class's constructor directly. + ''' + cdef: + shared_ptr[CTable] sp_table + CTable* table + + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CTable]& table): + self.sp_table = table + self.table = table.get() + + @staticmethod + def from_pandas(df, name=None): + pass + + @staticmethod + def from_arrays(names, arrays, name=None): + cdef: + Array arr + Table result + c_string c_name + vector[shared_ptr[CField]] fields + vector[shared_ptr[CColumn]] columns + shared_ptr[CSchema] schema + shared_ptr[CTable] table + + cdef int K = len(arrays) + + fields.resize(K) + columns.resize(K) + for i in range(K): + arr = arrays[i] + c_name = tobytes(names[i]) + + fields[i].reset(new CField(c_name, arr.type.sp_type, True)) + columns[i].reset(new CColumn(fields[i], arr.sp_array)) + + if name is None: + c_name = '' + else: + c_name = tobytes(name) + + schema.reset(new CSchema(fields)) + table.reset(new CTable(c_name, schema, columns)) + + result = Table() + result.init(table) + + return result + + def to_pandas(self): + """ + Convert the arrow::Table to a pandas DataFrame + """ + cdef: + PyObject* arr + shared_ptr[CColumn] col + + import pandas as pd + + names = [] + data = [] + for i in range(self.table.num_columns()): + col = self.table.column(i) + check_status(pyarrow.ArrowToPandas(col, &arr)) + names.append(frombytes(col.get().name())) + data.append( arr) + + # One ref count too many + Py_XDECREF(arr) + + return pd.DataFrame(dict(zip(names, data)), columns=names) diff --git a/python/pyarrow/config.pyx b/python/pyarrow/config.pyx index 521bc066cd4a5..1047a472fe338 100644 --- a/python/pyarrow/config.pyx +++ b/python/pyarrow/config.pyx @@ -2,7 +2,18 @@ # distutils: language = c++ # cython: embedsignature = True -cdef extern from 'pyarrow/init.h' namespace 'pyarrow': +cdef extern from 'pyarrow/do_import_numpy.h': + pass + +cdef extern from 'pyarrow/numpy_interop.h' namespace 'pyarrow': + int import_numpy() + +cdef extern from 'pyarrow/config.h' namespace 'pyarrow': void pyarrow_init() + void pyarrow_set_numpy_nan(object o) +import_numpy() pyarrow_init() + +import numpy as np +pyarrow_set_numpy_nan(np.nan) diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 839427a699002..e86d5d77e8b10 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -22,10 +22,16 @@ from libcpp cimport bool as c_bool from libcpp.string cimport string as c_string from libcpp.vector cimport vector +from cpython cimport PyObject +cimport cpython + # This must be included for cerr and other things to work cdef extern from "": pass +cdef extern from "": + void Py_XDECREF(PyObject* o) + cdef extern from "" namespace "std" nogil: cdef cppclass shared_ptr[T]: diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 943a08f84a055..42f1f25073d1b 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -19,6 +19,25 @@ from pyarrow.includes.common cimport * +cdef extern from "arrow/api.h" namespace "arrow" nogil: + # We can later add more of the common status factory methods as needed + cdef CStatus CStatus_OK "Status::OK"() + + cdef cppclass CStatus "arrow::Status": + CStatus() + + c_string ToString() + + c_bool ok() + c_bool IsOutOfMemory() + c_bool IsKeyError() + c_bool IsNotImplemented() + c_bool IsInvalid() + + cdef cppclass Buffer: + uint8_t* data() + int64_t size() + cdef extern from "arrow/api.h" namespace "arrow" nogil: enum Type" arrow::Type::type": @@ -129,25 +148,30 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringArray" arrow::StringArray"(CListArray): c_string GetString(int i) + cdef cppclass CChunkedArray" arrow::ChunkedArray": + pass -cdef extern from "arrow/api.h" namespace "arrow" nogil: - # We can later add more of the common status factory methods as needed - cdef CStatus CStatus_OK "Status::OK"() + cdef cppclass CColumn" arrow::Column": + CColumn(const shared_ptr[CField]& field, + const shared_ptr[CArray]& data) - cdef cppclass CStatus "arrow::Status": - CStatus() + int64_t length() + int64_t null_count() + const c_string& name() + const shared_ptr[CDataType]& type() + const shared_ptr[CChunkedArray]& data() - c_string ToString() + cdef cppclass CTable" arrow::Table": + CTable(const c_string& name, const shared_ptr[CSchema]& schema, + const vector[shared_ptr[CColumn]]& columns) - c_bool ok() - c_bool IsOutOfMemory() - c_bool IsKeyError() - c_bool IsNotImplemented() - c_bool IsInvalid() + int num_columns() + int num_rows() - cdef cppclass Buffer: - uint8_t* data() - int64_t size() + const c_string& name() + + const shared_ptr[CSchema]& schema() + const shared_ptr[CColumn]& column(int i) cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index eedfc85446810..1066b8034be70 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,7 +18,8 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport CArray, CDataType, Type, MemoryPool +from pyarrow.includes.libarrow cimport (CArray, CColumn, CDataType, + Type, MemoryPool) cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: # We can later add more of the common status factory methods as needed @@ -41,4 +42,10 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) Status ConvertPySequence(object obj, shared_ptr[CArray]* out) + Status PandasToArrow(MemoryPool* pool, object ao, shared_ptr[CArray]* out) + Status PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, + shared_ptr[CArray]* out) + + Status ArrowToPandas(const shared_ptr[CColumn]& arr, PyObject** out) + MemoryPool* GetMemoryPool() diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py new file mode 100644 index 0000000000000..6dc9c689e249b --- /dev/null +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -0,0 +1,172 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import unittest + +import numpy as np + +import pandas as pd +import pandas.util.testing as tm + +import pyarrow as A + + +class TestPandasConversion(unittest.TestCase): + + def setUp(self): + pass + + def tearDown(self): + pass + + def _check_pandas_roundtrip(self, df, expected=None): + table = A.from_pandas_dataframe(df) + result = table.to_pandas() + if expected is None: + expected = df + tm.assert_frame_equal(result, expected) + + def test_float_no_nulls(self): + data = {} + numpy_dtypes = ['f4', 'f8'] + num_values = 100 + + for dtype in numpy_dtypes: + values = np.random.randn(num_values) + data[dtype] = values.astype(dtype) + + df = pd.DataFrame(data) + self._check_pandas_roundtrip(df) + + def test_float_nulls(self): + num_values = 100 + + null_mask = np.random.randint(0, 10, size=num_values) < 3 + dtypes = ['f4', 'f8'] + expected_cols = [] + + arrays = [] + for name in dtypes: + values = np.random.randn(num_values).astype(name) + + arr = A.from_pandas_series(values, null_mask) + arrays.append(arr) + + values[null_mask] = np.nan + + expected_cols.append(values) + + ex_frame = pd.DataFrame(dict(zip(dtypes, expected_cols)), + columns=dtypes) + + table = A.Table.from_arrays(dtypes, arrays) + result = table.to_pandas() + tm.assert_frame_equal(result, ex_frame) + + def test_integer_no_nulls(self): + data = {} + + numpy_dtypes = ['i1', 'i2', 'i4', 'i8', 'u1', 'u2', 'u4', 'u8'] + num_values = 100 + + for dtype in numpy_dtypes: + info = np.iinfo(dtype) + values = np.random.randint(info.min, + min(info.max, np.iinfo('i8').max), + size=num_values) + data[dtype] = values.astype(dtype) + + df = pd.DataFrame(data) + self._check_pandas_roundtrip(df) + + def test_integer_with_nulls(self): + # pandas requires upcast to float dtype + + int_dtypes = ['i1', 'i2', 'i4', 'i8', 'u1', 'u2', 'u4', 'u8'] + num_values = 100 + + null_mask = np.random.randint(0, 10, size=num_values) < 3 + + expected_cols = [] + arrays = [] + for name in int_dtypes: + values = np.random.randint(0, 100, size=num_values) + + arr = A.from_pandas_series(values, null_mask) + arrays.append(arr) + + expected = values.astype('f8') + expected[null_mask] = np.nan + + expected_cols.append(expected) + + ex_frame = pd.DataFrame(dict(zip(int_dtypes, expected_cols)), + columns=int_dtypes) + + table = A.Table.from_arrays(int_dtypes, arrays) + result = table.to_pandas() + + tm.assert_frame_equal(result, ex_frame) + + def test_boolean_no_nulls(self): + num_values = 100 + + np.random.seed(0) + + df = pd.DataFrame({'bools': np.random.randn(num_values) > 0}) + self._check_pandas_roundtrip(df) + + def test_boolean_nulls(self): + # pandas requires upcast to object dtype + num_values = 100 + np.random.seed(0) + + mask = np.random.randint(0, 10, size=num_values) < 3 + values = np.random.randint(0, 10, size=num_values) < 5 + + arr = A.from_pandas_series(values, mask) + + expected = values.astype(object) + expected[mask] = None + + ex_frame = pd.DataFrame({'bools': expected}) + + table = A.Table.from_arrays(['bools'], [arr]) + result = table.to_pandas() + + tm.assert_frame_equal(result, ex_frame) + + def test_boolean_object_nulls(self): + arr = np.array([False, None, True] * 100, dtype=object) + df = pd.DataFrame({'bools': arr}) + self._check_pandas_roundtrip(df) + + def test_strings(self): + repeats = 1000 + values = [b'foo', None, u'bar', 'qux', np.nan] + df = pd.DataFrame({'strings': values * repeats}) + + values = ['foo', None, u'bar', 'qux', None] + expected = pd.DataFrame({'strings': values * repeats}) + self._check_pandas_roundtrip(df, expected) + + # def test_category(self): + # repeats = 1000 + # values = [b'foo', None, u'bar', 'qux', np.nan] + # df = pd.DataFrame({'strings': values * repeats}) + # df['strings'] = df['strings'].astype('category') + # self._check_pandas_roundtrip(df) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc new file mode 100644 index 0000000000000..22f1d7575f8c5 --- /dev/null +++ b/python/src/pyarrow/adapters/pandas.cc @@ -0,0 +1,714 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for pandas conversion via NumPy + +#include + +#include "pyarrow/numpy_interop.h" + +#include +#include +#include +#include +#include + +#include "arrow/api.h" +#include "arrow/util/bit-util.h" + +#include "pyarrow/common.h" +#include "pyarrow/config.h" +#include "pyarrow/status.h" + +namespace pyarrow { + +using arrow::Array; +using arrow::Column; +namespace util = arrow::util; + +// ---------------------------------------------------------------------- +// Serialization + +template +struct npy_traits { +}; + +template <> +struct npy_traits { + typedef uint8_t value_type; + using ArrayType = arrow::BooleanArray; + + static constexpr bool supports_nulls = false; + static inline bool isnull(uint8_t v) { + return false; + } +}; + +#define NPY_INT_DECL(TYPE, CapType, T) \ + template <> \ + struct npy_traits { \ + typedef T value_type; \ + using ArrayType = arrow::CapType##Array; \ + \ + static constexpr bool supports_nulls = false; \ + static inline bool isnull(T v) { \ + return false; \ + } \ + }; + +NPY_INT_DECL(INT8, Int8, int8_t); +NPY_INT_DECL(INT16, Int16, int16_t); +NPY_INT_DECL(INT32, Int32, int32_t); +NPY_INT_DECL(INT64, Int64, int64_t); +NPY_INT_DECL(UINT8, UInt8, uint8_t); +NPY_INT_DECL(UINT16, UInt16, uint16_t); +NPY_INT_DECL(UINT32, UInt32, uint32_t); +NPY_INT_DECL(UINT64, UInt64, uint64_t); + +template <> +struct npy_traits { + typedef float value_type; + using ArrayType = arrow::FloatArray; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(float v) { + return v != v; + } +}; + +template <> +struct npy_traits { + typedef double value_type; + using ArrayType = arrow::DoubleArray; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(double v) { + return v != v; + } +}; + +template <> +struct npy_traits { + typedef PyObject* value_type; + static constexpr bool supports_nulls = true; +}; + +template +class ArrowSerializer { + public: + ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) : + pool_(pool), + arr_(arr), + mask_(mask) { + length_ = PyArray_SIZE(arr_); + } + + Status Convert(std::shared_ptr* out); + + int stride() const { + return PyArray_STRIDES(arr_)[0]; + } + + Status InitNullBitmap() { + int null_bytes = util::bytes_for_bits(length_); + + null_bitmap_ = std::make_shared(pool_); + RETURN_ARROW_NOT_OK(null_bitmap_->Resize(null_bytes)); + + null_bitmap_data_ = null_bitmap_->mutable_data(); + memset(null_bitmap_data_, 0, null_bytes); + + return Status::OK(); + } + + bool is_strided() const { + npy_intp* astrides = PyArray_STRIDES(arr_); + return astrides[0] != PyArray_DESCR(arr_)->elsize; + } + + private: + Status ConvertData(); + + Status ConvertObjectStrings(std::shared_ptr* out) { + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + + auto offsets_buffer = std::make_shared(pool_); + RETURN_ARROW_NOT_OK(offsets_buffer->Resize(sizeof(int32_t) * (length_ + 1))); + int32_t* offsets = reinterpret_cast(offsets_buffer->mutable_data()); + + arrow::BufferBuilder data_builder(pool_); + arrow::Status s; + PyObject* obj; + int length; + int offset = 0; + int64_t null_count = 0; + for (int64_t i = 0; i < length_; ++i) { + obj = objects[i]; + if (PyUnicode_Check(obj)) { + obj = PyUnicode_AsUTF8String(obj); + if (obj == NULL) { + PyErr_Clear(); + return Status::TypeError("failed converting unicode to UTF8"); + } + length = PyBytes_GET_SIZE(obj); + s = data_builder.Append( + reinterpret_cast(PyBytes_AS_STRING(obj)), length); + Py_DECREF(obj); + if (!s.ok()) { + return Status::ArrowError(s.ToString()); + } + util::set_bit(null_bitmap_data_, i); + } else if (PyBytes_Check(obj)) { + length = PyBytes_GET_SIZE(obj); + RETURN_ARROW_NOT_OK(data_builder.Append( + reinterpret_cast(PyBytes_AS_STRING(obj)), length)); + util::set_bit(null_bitmap_data_, i); + } else { + // NULL + // No change to offset + length = 0; + ++null_count; + } + offsets[i] = offset; + offset += length; + } + // End offset + offsets[length_] = offset; + + std::shared_ptr data_buffer = data_builder.Finish(); + + auto values = std::make_shared(data_buffer->size(), + data_buffer); + *out = std::shared_ptr( + new arrow::StringArray(length_, offsets_buffer, values, null_count, + null_bitmap_)); + + return Status::OK(); + } + + Status ConvertBooleans(std::shared_ptr* out) { + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + + int nbytes = util::bytes_for_bits(length_); + auto data = std::make_shared(pool_); + RETURN_ARROW_NOT_OK(data->Resize(nbytes)); + uint8_t* bitmap = data->mutable_data(); + memset(bitmap, 0, nbytes); + + int64_t null_count = 0; + for (int64_t i = 0; i < length_; ++i) { + if (objects[i] == Py_True) { + util::set_bit(bitmap, i); + util::set_bit(null_bitmap_data_, i); + } else if (objects[i] != Py_False) { + ++null_count; + } else { + util::set_bit(null_bitmap_data_, i); + } + } + + *out = std::make_shared(length_, data, null_count, + null_bitmap_); + + return Status::OK(); + } + + arrow::MemoryPool* pool_; + + PyArrayObject* arr_; + PyArrayObject* mask_; + + int64_t length_; + + std::shared_ptr data_; + std::shared_ptr null_bitmap_; + uint8_t* null_bitmap_data_; +}; + +// Returns null count +static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { + int64_t null_count = 0; + const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); + // TODO(wesm): strided null mask + for (int i = 0; i < length; ++i) { + if (mask_values[i]) { + ++null_count; + } else { + util::set_bit(bitmap, i); + } + } + return null_count; +} + +template +static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) { + typedef npy_traits traits; + typedef typename traits::value_type T; + + int64_t null_count = 0; + const T* values = reinterpret_cast(data); + + // TODO(wesm): striding + for (int i = 0; i < length; ++i) { + if (traits::isnull(values[i])) { + ++null_count; + } else { + util::set_bit(bitmap, i); + } + } + + return null_count; +} + +template +inline Status ArrowSerializer::Convert(std::shared_ptr* out) { + typedef npy_traits traits; + + if (mask_ != nullptr || traits::supports_nulls) { + RETURN_NOT_OK(InitNullBitmap()); + } + + int64_t null_count = 0; + if (mask_ != nullptr) { + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); + } else if (traits::supports_nulls) { + null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); + } + + RETURN_NOT_OK(ConvertData()); + *out = std::make_shared(length_, data_, null_count, + null_bitmap_); + + return Status::OK(); +} + +static inline bool PyObject_is_null(const PyObject* obj) { + return obj == Py_None || obj == numpy_nan; +} + +static inline bool PyObject_is_string(const PyObject* obj) { +#if PY_MAJOR_VERSION >= 3 + return PyUnicode_Check(obj) || PyBytes_Check(obj); +#else + return PyString_Check(obj) || PyUnicode_Check(obj); +#endif +} + +static inline bool PyObject_is_bool(const PyObject* obj) { +#if PY_MAJOR_VERSION >= 3 + return PyString_Check(obj) || PyBytes_Check(obj); +#else + return PyString_Check(obj) || PyUnicode_Check(obj); +#endif +} + +template <> +inline Status ArrowSerializer::Convert(std::shared_ptr* out) { + // Python object arrays are annoying, since we could have one of: + // + // * Strings + // * Booleans with nulls + // * Mixed type (not supported at the moment by arrow format) + // + // Additionally, nulls may be encoded either as np.nan or None. So we have to + // do some type inference and conversion + + RETURN_NOT_OK(InitNullBitmap()); + + // TODO: mask not supported here + const PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + continue; + } else if (PyObject_is_string(objects[i])) { + return ConvertObjectStrings(out); + } else if (PyBool_Check(objects[i])) { + return ConvertBooleans(out); + } else { + return Status::TypeError("unhandled python type"); + } + } + + return Status::TypeError("Unable to infer type of object array, were all null"); +} + +template +inline Status ArrowSerializer::ConvertData() { + // TODO(wesm): strided arrays + if (is_strided()) { + return Status::ValueError("no support for strided data yet"); + } + + data_ = std::make_shared(arr_); + return Status::OK(); +} + +template <> +inline Status ArrowSerializer::ConvertData() { + if (is_strided()) { + return Status::ValueError("no support for strided data yet"); + } + + int nbytes = util::bytes_for_bits(length_); + auto buffer = std::make_shared(pool_); + RETURN_ARROW_NOT_OK(buffer->Resize(nbytes)); + + const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); + + uint8_t* bitmap = buffer->mutable_data(); + + memset(bitmap, 0, nbytes); + for (int i = 0; i < length_; ++i) { + if (values[i] > 0) { + util::set_bit(bitmap, i); + } + } + + data_ = buffer; + + return Status::OK(); +} + +template <> +inline Status ArrowSerializer::ConvertData() { + return Status::TypeError("NYI"); +} + + +#define TO_ARROW_CASE(TYPE) \ + case NPY_##TYPE: \ + { \ + ArrowSerializer converter(pool, arr, mask); \ + RETURN_NOT_OK(converter.Convert(out)); \ + } \ + break; + +Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, + std::shared_ptr* out) { + PyArrayObject* arr = reinterpret_cast(ao); + PyArrayObject* mask = nullptr; + + if (mo != nullptr) { + mask = reinterpret_cast(mo); + } + + if (PyArray_NDIM(arr) != 1) { + return Status::ValueError("only handle 1-dimensional arrays"); + } + + switch(PyArray_DESCR(arr)->type_num) { + TO_ARROW_CASE(BOOL); + TO_ARROW_CASE(INT8); + TO_ARROW_CASE(INT16); + TO_ARROW_CASE(INT32); + TO_ARROW_CASE(INT64); + TO_ARROW_CASE(UINT8); + TO_ARROW_CASE(UINT16); + TO_ARROW_CASE(UINT32); + TO_ARROW_CASE(UINT64); + TO_ARROW_CASE(FLOAT32); + TO_ARROW_CASE(FLOAT64); + TO_ARROW_CASE(OBJECT); + default: + std::stringstream ss; + ss << "unsupported type " << PyArray_DESCR(arr)->type_num + << std::endl; + return Status::NotImplemented(ss.str()); + } + return Status::OK(); +} + +Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, + std::shared_ptr* out) { + return PandasMaskedToArrow(pool, ao, nullptr, out); +} + +// ---------------------------------------------------------------------- +// Deserialization + +template +struct arrow_traits { +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_BOOL; + static constexpr bool supports_nulls = false; + static constexpr bool is_boolean = true; + static constexpr bool is_integer = false; + static constexpr bool is_floating = false; +}; + +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_integer = true; \ + static constexpr bool is_floating = false; \ + typedef typename npy_traits::value_type T; \ + }; + +INT_DECL(INT8); +INT_DECL(INT16); +INT_DECL(INT32); +INT_DECL(INT64); +INT_DECL(UINT8); +INT_DECL(UINT16); +INT_DECL(UINT32); +INT_DECL(UINT64); + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT32; + static constexpr bool supports_nulls = true; + static constexpr float na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_integer = false; + static constexpr bool is_floating = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT64; + static constexpr bool supports_nulls = true; + static constexpr double na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_integer = false; + static constexpr bool is_floating = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_integer = false; + static constexpr bool is_floating = false; +}; + + +static inline PyObject* make_pystring(const uint8_t* data, int32_t length) { +#if PY_MAJOR_VERSION >= 3 + return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); +#else + return PyString_FromStringAndSize(reinterpret_cast(data), length); +#endif +} + +template +class ArrowDeserializer { + public: + ArrowDeserializer(const std::shared_ptr& col) : + col_(col) {} + + Status Convert(PyObject** out) { + const std::shared_ptr data = col_->data(); + if (data->num_chunks() > 1) { + return Status::NotImplemented("Chunked column conversion NYI"); + } + + auto chunk = data->chunk(0); + + RETURN_NOT_OK(ConvertValues(chunk)); + *out = reinterpret_cast(out_); + return Status::OK(); + } + + Status AllocateOutput(int type) { + npy_intp dims[1] = {col_->length()}; + out_ = reinterpret_cast(PyArray_SimpleNew(1, dims, type)); + + if (out_ == NULL) { + // Error occurred, trust that SimpleNew set the error state + return Status::OK(); + } + + return Status::OK(); + } + + template + inline typename std::enable_if< + arrow_traits::is_floating, Status>::type + ConvertValues(const std::shared_ptr& arr) { + typedef typename arrow_traits::T T; + + arrow::PrimitiveArray* prim_arr = static_cast( + arr.get()); + + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + + if (arr->null_count() > 0) { + T* out_values = reinterpret_cast(PyArray_DATA(out_)); + const T* in_values = reinterpret_cast(prim_arr->data()->data()); + for (int64_t i = 0; i < arr->length(); ++i) { + out_values[i] = arr->IsNull(i) ? NAN : in_values[i]; + } + } else { + memcpy(PyArray_DATA(out_), prim_arr->data()->data(), + arr->length() * arr->type()->value_size()); + } + + return Status::OK(); + } + + // Integer specialization + template + inline typename std::enable_if< + arrow_traits::is_integer, Status>::type + ConvertValues(const std::shared_ptr& arr) { + typedef typename arrow_traits::T T; + + arrow::PrimitiveArray* prim_arr = static_cast( + arr.get()); + + const T* in_values = reinterpret_cast(prim_arr->data()->data()); + + if (arr->null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); + + // Upcast to double, set NaN as appropriate + double* out_values = reinterpret_cast(PyArray_DATA(out_)); + for (int i = 0; i < arr->length(); ++i) { + out_values[i] = prim_arr->IsNull(i) ? NAN : in_values[i]; + } + } else { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + + memcpy(PyArray_DATA(out_), in_values, + arr->length() * arr->type()->value_size()); + } + + return Status::OK(); + } + + // Boolean specialization + template + inline typename std::enable_if< + arrow_traits::is_boolean, Status>::type + ConvertValues(const std::shared_ptr& arr) { + arrow::BooleanArray* bool_arr = static_cast(arr.get()); + + if (arr->null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + + PyObject** out_values = reinterpret_cast(PyArray_DATA(out_)); + for (int64_t i = 0; i < arr->length(); ++i) { + if (bool_arr->IsNull(i)) { + Py_INCREF(Py_None); + out_values[i] = Py_None; + } else if (bool_arr->Value(i)) { + // True + Py_INCREF(Py_True); + out_values[i] = Py_True; + } else { + // False + Py_INCREF(Py_False); + out_values[i] = Py_False; + } + } + } else { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + + uint8_t* out_values = reinterpret_cast(PyArray_DATA(out_)); + for (int64_t i = 0; i < arr->length(); ++i) { + out_values[i] = static_cast(bool_arr->Value(i)); + } + } + + return Status::OK(); + } + + // UTF8 + template + inline typename std::enable_if< + T2 == arrow::Type::STRING, Status>::type + ConvertValues(const std::shared_ptr& arr) { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + + PyObject** out_values = reinterpret_cast(PyArray_DATA(out_)); + + arrow::StringArray* string_arr = static_cast(arr.get()); + + const uint8_t* data; + int32_t length; + if (arr->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + if (string_arr->IsNull(i)) { + Py_INCREF(Py_None); + out_values[i] = Py_None; + } else { + data = string_arr->GetValue(i, &length); + + out_values[i] = make_pystring(data, length); + if (out_values[i] == nullptr) { + return Status::OK(); + } + } + } + } else { + for (int64_t i = 0; i < arr->length(); ++i) { + data = string_arr->GetValue(i, &length); + out_values[i] = make_pystring(data, length); + if (out_values[i] == nullptr) { + return Status::OK(); + } + } + } + return Status::OK(); + } + private: + std::shared_ptr col_; + PyArrayObject* out_; +}; + +#define FROM_ARROW_CASE(TYPE) \ + case arrow::Type::TYPE: \ + { \ + ArrowDeserializer converter(col); \ + return converter.Convert(out); \ + } \ + break; + +Status ArrowToPandas(const std::shared_ptr& col, PyObject** out) { + switch(col->type()->type) { + FROM_ARROW_CASE(BOOL); + FROM_ARROW_CASE(INT8); + FROM_ARROW_CASE(INT16); + FROM_ARROW_CASE(INT32); + FROM_ARROW_CASE(INT64); + FROM_ARROW_CASE(UINT8); + FROM_ARROW_CASE(UINT16); + FROM_ARROW_CASE(UINT32); + FROM_ARROW_CASE(UINT64); + FROM_ARROW_CASE(FLOAT); + FROM_ARROW_CASE(DOUBLE); + FROM_ARROW_CASE(STRING); + default: + return Status::NotImplemented("Arrow type reading not implemented"); + } + return Status::OK(); +} + +} // namespace pyarrow diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index a4f4163808711..58eb3ca61cdf4 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -21,8 +21,29 @@ #ifndef PYARROW_ADAPTERS_PANDAS_H #define PYARROW_ADAPTERS_PANDAS_H +#include + +#include + +namespace arrow { + +class Array; +class Column; + +} // namespace arrow + namespace pyarrow { +class Status; + +Status ArrowToPandas(const std::shared_ptr& col, PyObject** out); + +Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, + std::shared_ptr* out); + +Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, + std::shared_ptr* out); + } // namespace pyarrow #endif // PYARROW_ADAPTERS_PANDAS_H diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index db6361384c10d..cc9ad9ec5bbea 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -18,7 +18,9 @@ #ifndef PYARROW_COMMON_H #define PYARROW_COMMON_H -#include +#include "pyarrow/config.h" + +#include "arrow/util/buffer.h" namespace arrow { class MemoryPool; } @@ -90,6 +92,25 @@ struct PyObjectStringify { arrow::MemoryPool* GetMemoryPool(); +class NumPyBuffer : public arrow::Buffer { + public: + NumPyBuffer(PyArrayObject* arr) : + Buffer(nullptr, 0) { + arr_ = arr; + Py_INCREF(arr); + + data_ = reinterpret_cast(PyArray_DATA(arr_)); + size_ = PyArray_SIZE(arr_); + } + + virtual ~NumPyBuffer() { + Py_XDECREF(arr_); + } + + private: + PyArrayObject* arr_; +}; + } // namespace pyarrow #endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/init.cc b/python/src/pyarrow/config.cc similarity index 84% rename from python/src/pyarrow/init.cc rename to python/src/pyarrow/config.cc index acd851e168743..730d2db99a530 100644 --- a/python/src/pyarrow/init.cc +++ b/python/src/pyarrow/config.cc @@ -15,11 +15,20 @@ // specific language governing permissions and limitations // under the License. -#include "pyarrow/init.h" +#include + +#include "pyarrow/config.h" namespace pyarrow { void pyarrow_init() { } +PyObject* numpy_nan = nullptr; + +void pyarrow_set_numpy_nan(PyObject* obj) { + Py_INCREF(obj); + numpy_nan = obj; +} + } // namespace pyarrow diff --git a/python/src/pyarrow/config.h b/python/src/pyarrow/config.h new file mode 100644 index 0000000000000..48ae715d842b1 --- /dev/null +++ b/python/src/pyarrow/config.h @@ -0,0 +1,39 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_CONFIG_H +#define PYARROW_CONFIG_H + +#include + +#include "pyarrow/numpy_interop.h" + +#if PY_MAJOR_VERSION >= 3 + #define PyString_Check PyUnicode_Check +#endif + +namespace pyarrow { + +extern PyObject* numpy_nan; + +void pyarrow_init(); + +void pyarrow_set_numpy_nan(PyObject* obj); + +} // namespace pyarrow + +#endif // PYARROW_CONFIG_H diff --git a/python/src/pyarrow/init.h b/python/src/pyarrow/do_import_numpy.h similarity index 83% rename from python/src/pyarrow/init.h rename to python/src/pyarrow/do_import_numpy.h index 71e67a20c1ca5..bb4a382959102 100644 --- a/python/src/pyarrow/init.h +++ b/python/src/pyarrow/do_import_numpy.h @@ -15,13 +15,7 @@ // specific language governing permissions and limitations // under the License. -#ifndef PYARROW_INIT_H -#define PYARROW_INIT_H +// Trick borrowed from dynd-python for initializing the NumPy array API -namespace pyarrow { - -void pyarrow_init(); - -} // namespace pyarrow - -#endif // PYARROW_INIT_H +// Trigger the array import (inversion of NO_IMPORT_ARRAY) +#define NUMPY_IMPORT_ARRAY diff --git a/python/src/pyarrow/numpy_interop.h b/python/src/pyarrow/numpy_interop.h new file mode 100644 index 0000000000000..882d287c7c559 --- /dev/null +++ b/python/src/pyarrow/numpy_interop.h @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_NUMPY_INTEROP_H +#define PYARROW_NUMPY_INTEROP_H + +#include + +#include + +// Don't use the deprecated Numpy functions +#ifdef NPY_1_7_API_VERSION +#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION +#else +#define NPY_ARRAY_NOTSWAPPED NPY_NOTSWAPPED +#define NPY_ARRAY_ALIGNED NPY_ALIGNED +#define NPY_ARRAY_WRITEABLE NPY_WRITEABLE +#define NPY_ARRAY_UPDATEIFCOPY NPY_UPDATEIFCOPY +#endif + +// This is required to be able to access the NumPy C API properly in C++ files +// other than this main one +#define PY_ARRAY_UNIQUE_SYMBOL pyarrow_ARRAY_API +#ifndef NUMPY_IMPORT_ARRAY +#define NO_IMPORT_ARRAY +#endif + +#include +#include + +namespace pyarrow { + +inline int import_numpy() { +#ifdef NUMPY_IMPORT_ARRAY + import_array1(-1); + import_umath1(-1); +#endif + + return 0; +} + +} // namespace pyarrow + +#endif // PYARROW_NUMPY_INTEROP_H From ecadd0bcb9f022a5067826ed564f513ffd0c578e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 28 Mar 2016 09:38:13 -0700 Subject: [PATCH 0048/1644] ARROW-80: Handle len call for pre-init arrays Author: Uwe L. Korn Closes #45 from xhochy/arrow-80 and squashes the following commits: d9a1160 [Uwe L. Korn] Add unit test for repr on pre-init Array 6208d7d [Uwe L. Korn] ARROW-80: Handle len call for pre-init arrays --- python/pyarrow/array.pyx | 5 ++++- python/pyarrow/tests/test_array.py | 4 ++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 88770cdaa966e..155c965f3e8aa 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -67,7 +67,10 @@ cdef class Array: return '{0}\n{1}'.format(type_format, values) def __len__(self): - return self.sp_array.get().length() + if self.sp_array.get(): + return self.sp_array.get().length() + else: + return 0 def isnull(self): raise NotImplemented diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 36aaaa4f93d5d..d608f8167df65 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -22,6 +22,10 @@ class TestArrayAPI(unittest.TestCase): + def test_repr_on_pre_init_array(self): + arr = pyarrow.array.Array() + assert len(repr(arr)) > 0 + def test_getitem_NA(self): arr = pyarrow.from_pylist([1, None, 2]) assert arr[1] is pyarrow.NA From 80ec2c17fccac484993868f951d95362cb75cea9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 28 Mar 2016 09:39:55 -0700 Subject: [PATCH 0049/1644] ARROW-79: [Python] Add benchmarks Run them using `asv run --python=same` or `asv dev`. Author: Uwe L. Korn Closes #44 from xhochy/arrow-79 and squashes the following commits: d3c6401 [Uwe L. Korn] Move benchmarks to toplevel folder 2737f18 [Uwe L. Korn] ARROW-79: [Python] Add benchmarks --- python/.gitignore | 3 ++ python/asv.conf.json | 73 +++++++++++++++++++++++++++++++++++ python/benchmarks/__init__.py | 17 ++++++++ python/benchmarks/array.py | 38 ++++++++++++++++++ python/doc/Benchmarks.md | 11 ++++++ 5 files changed, 142 insertions(+) create mode 100644 python/asv.conf.json create mode 100644 python/benchmarks/__init__.py create mode 100644 python/benchmarks/array.py create mode 100644 python/doc/Benchmarks.md diff --git a/python/.gitignore b/python/.gitignore index 80103a1a52942..3cb591ea766d5 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -35,3 +35,6 @@ dist # coverage .coverage coverage.xml + +# benchmark working dir +.asv diff --git a/python/asv.conf.json b/python/asv.conf.json new file mode 100644 index 0000000000000..96beba64c2e6e --- /dev/null +++ b/python/asv.conf.json @@ -0,0 +1,73 @@ +{ + // The version of the config file format. Do not change, unless + // you know what you are doing. + "version": 1, + + // The name of the project being benchmarked + "project": "pyarrow", + + // The project's homepage + "project_url": "https://arrow.apache.org/", + + // The URL or local path of the source code repository for the + // project being benchmarked + "repo": "https://github.com/apache/arrow/", + + // List of branches to benchmark. If not provided, defaults to "master" + // (for git) or "tip" (for mercurial). + // "branches": ["master"], // for git + // "branches": ["tip"], // for mercurial + + // The DVCS being used. If not set, it will be automatically + // determined from "repo" by looking at the protocol in the URL + // (if remote), or by looking for special directories, such as + // ".git" (if local). + "dvcs": "git", + + // The tool to use to create environments. May be "conda", + // "virtualenv" or other value depending on the plugins in use. + // If missing or the empty string, the tool will be automatically + // determined by looking for tools on the PATH environment + // variable. + "environment_type": "virtualenv", + + // the base URL to show a commit for the project. + "show_commit_url": "https://github.com/apache/arrow/commit/", + + // The Pythons you'd like to test against. If not provided, defaults + // to the current version of Python used to run `asv`. + // "pythons": ["2.7", "3.3"], + + // The matrix of dependencies to test. Each key is the name of a + // package (in PyPI) and the values are version numbers. An empty + // list indicates to just test against the default (latest) + // version. + // "matrix": { + // "numpy": ["1.6", "1.7"] + // }, + + // The directory (relative to the current directory) that benchmarks are + // stored in. If not provided, defaults to "benchmarks" + "benchmark_dir": "benchmarks", + + // The directory (relative to the current directory) to cache the Python + // environments in. If not provided, defaults to "env" + "env_dir": ".asv/env", + + + // The directory (relative to the current directory) that raw benchmark + // results are stored in. If not provided, defaults to "results". + "results_dir": ".asv/results", + + // The directory (relative to the current directory) that the html tree + // should be written to. If not provided, defaults to "html". + "html_dir": "build/benchmarks/html", + + // The number of characters to retain in the commit hashes. + // "hash_length": 8, + + // `asv` will cache wheels of the recent builds in each + // environment, making them faster to install next time. This is + // number of builds to keep, per environment. + // "wheel_cache_size": 0 +} diff --git a/python/benchmarks/__init__.py b/python/benchmarks/__init__.py new file mode 100644 index 0000000000000..245692337bc3f --- /dev/null +++ b/python/benchmarks/__init__.py @@ -0,0 +1,17 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + diff --git a/python/benchmarks/array.py b/python/benchmarks/array.py new file mode 100644 index 0000000000000..6ab73d18d1f87 --- /dev/null +++ b/python/benchmarks/array.py @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import pyarrow + +class Conversions(object): + params = (1, 10 ** 5, 10 ** 6, 10 ** 7) + + def time_from_pylist(self, n): + pyarrow.from_pylist(list(range(n))) + + def peakmem_from_pylist(self, n): + pyarrow.from_pylist(list(range(n))) + +class ScalarAccess(object): + params = (1, 10 ** 5, 10 ** 6, 10 ** 7) + + def setUp(self, n): + self._array = pyarrow.from_pylist(list(range(n))) + + def time_as_py(self, n): + for i in range(n): + self._array[i].as_py() + diff --git a/python/doc/Benchmarks.md b/python/doc/Benchmarks.md new file mode 100644 index 0000000000000..8edfb6209e4af --- /dev/null +++ b/python/doc/Benchmarks.md @@ -0,0 +1,11 @@ +## Benchmark Requirements + +The benchmarks are run using [asv][1] which is also their only requirement. + +## Running the benchmarks + +To run the benchmarks, call `asv run --python=same`. You cannot use the +plain `asv run` command at the moment as asv cannot handle python packages +in subdirectories of a repository. + +[1]: https://asv.readthedocs.org/ From df7726d44ab59828aacc20a1786287ba7ade2562 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 28 Mar 2016 10:39:25 -0700 Subject: [PATCH 0050/1644] ARROW-88: [C++] Refactor usages of parquet_cpp namespace I also removed an unneeded `Py_XDECREF` from ARROW-30; didn't want to create a separate patch for that. Author: Wes McKinney Closes #49 from wesm/ARROW-88 and squashes the following commits: c4d81dc [Wes McKinney] Refactor usages of parquet_cpp namespace --- cpp/src/arrow/parquet/parquet-schema-test.cc | 40 ++++++++++---------- cpp/src/arrow/parquet/schema.cc | 29 +++++++------- cpp/src/arrow/parquet/schema.h | 4 +- python/pyarrow/array.pyx | 3 -- python/pyarrow/includes/parquet.pxd | 2 +- 5 files changed, 39 insertions(+), 39 deletions(-) diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index 9c3093d9ff7c9..02a8caf03c9bd 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -26,15 +26,17 @@ #include "arrow/parquet/schema.h" +using ParquetType = parquet::Type; +using parquet::LogicalType; +using parquet::Repetition; +using parquet::schema::NodePtr; +using parquet::schema::GroupNode; +using parquet::schema::PrimitiveNode; + namespace arrow { namespace parquet { -using parquet_cpp::Repetition; -using parquet_cpp::schema::NodePtr; -using parquet_cpp::schema::GroupNode; -using parquet_cpp::schema::PrimitiveNode; - const auto BOOL = std::make_shared(); const auto UINT8 = std::make_shared(); const auto INT32 = std::make_shared(); @@ -66,7 +68,7 @@ class TestConvertParquetSchema : public ::testing::Test { } protected: - parquet_cpp::SchemaDescriptor descr_; + ::parquet::SchemaDescriptor descr_; std::shared_ptr result_schema_; }; @@ -75,40 +77,40 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { std::vector> arrow_fields; parquet_fields.push_back( - PrimitiveNode::Make("boolean", Repetition::REQUIRED, parquet_cpp::Type::BOOLEAN)); + PrimitiveNode::Make("boolean", Repetition::REQUIRED, ParquetType::BOOLEAN)); arrow_fields.push_back(std::make_shared("boolean", BOOL, false)); parquet_fields.push_back( - PrimitiveNode::Make("int32", Repetition::REQUIRED, parquet_cpp::Type::INT32)); + PrimitiveNode::Make("int32", Repetition::REQUIRED, ParquetType::INT32)); arrow_fields.push_back(std::make_shared("int32", INT32, false)); parquet_fields.push_back( - PrimitiveNode::Make("int64", Repetition::REQUIRED, parquet_cpp::Type::INT64)); + PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); arrow_fields.push_back(std::make_shared("int64", INT64, false)); parquet_fields.push_back( - PrimitiveNode::Make("float", Repetition::OPTIONAL, parquet_cpp::Type::FLOAT)); + PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); arrow_fields.push_back(std::make_shared("float", FLOAT)); parquet_fields.push_back( - PrimitiveNode::Make("double", Repetition::OPTIONAL, parquet_cpp::Type::DOUBLE)); + PrimitiveNode::Make("double", Repetition::OPTIONAL, ParquetType::DOUBLE)); arrow_fields.push_back(std::make_shared("double", DOUBLE)); parquet_fields.push_back( PrimitiveNode::Make("binary", Repetition::OPTIONAL, - parquet_cpp::Type::BYTE_ARRAY)); + ParquetType::BYTE_ARRAY)); arrow_fields.push_back(std::make_shared("binary", BINARY)); parquet_fields.push_back( PrimitiveNode::Make("string", Repetition::OPTIONAL, - parquet_cpp::Type::BYTE_ARRAY, - parquet_cpp::LogicalType::UTF8)); + ParquetType::BYTE_ARRAY, + LogicalType::UTF8)); arrow_fields.push_back(std::make_shared("string", UTF8)); parquet_fields.push_back( PrimitiveNode::Make("flba-binary", Repetition::OPTIONAL, - parquet_cpp::Type::FIXED_LEN_BYTE_ARRAY, - parquet_cpp::LogicalType::NONE, 12)); + ParquetType::FIXED_LEN_BYTE_ARRAY, + LogicalType::NONE, 12)); arrow_fields.push_back(std::make_shared("flba-binary", BINARY)); auto arrow_schema = std::make_shared(arrow_fields); @@ -121,18 +123,18 @@ TEST_F(TestConvertParquetSchema, UnsupportedThings) { std::vector unsupported_nodes; unsupported_nodes.push_back( - PrimitiveNode::Make("int96", Repetition::REQUIRED, parquet_cpp::Type::INT96)); + PrimitiveNode::Make("int96", Repetition::REQUIRED, ParquetType::INT96)); unsupported_nodes.push_back( GroupNode::Make("repeated-group", Repetition::REPEATED, {})); unsupported_nodes.push_back( PrimitiveNode::Make("int32", Repetition::OPTIONAL, - parquet_cpp::Type::INT32, parquet_cpp::LogicalType::DATE)); + ParquetType::INT32, LogicalType::DATE)); unsupported_nodes.push_back( PrimitiveNode::Make("int64", Repetition::OPTIONAL, - parquet_cpp::Type::INT64, parquet_cpp::LogicalType::TIMESTAMP_MILLIS)); + ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); for (const NodePtr& node : unsupported_nodes) { ASSERT_RAISES(NotImplemented, ConvertSchema({node})); diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index 6b1de572617b8..d8eb2addb0ada 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -24,12 +24,13 @@ #include "arrow/util/status.h" #include "arrow/types/decimal.h" -using parquet_cpp::schema::Node; -using parquet_cpp::schema::NodePtr; -using parquet_cpp::schema::GroupNode; -using parquet_cpp::schema::PrimitiveNode; +using parquet::schema::Node; +using parquet::schema::NodePtr; +using parquet::schema::GroupNode; +using parquet::schema::PrimitiveNode; -using parquet_cpp::LogicalType; +using ParquetType = parquet::Type; +using parquet::LogicalType; namespace arrow { @@ -124,30 +125,30 @@ Status NodeToField(const NodePtr& node, std::shared_ptr* out) { const PrimitiveNode* primitive = static_cast(node.get()); switch (primitive->physical_type()) { - case parquet_cpp::Type::BOOLEAN: + case ParquetType::BOOLEAN: type = BOOL; break; - case parquet_cpp::Type::INT32: + case ParquetType::INT32: RETURN_NOT_OK(FromInt32(primitive, &type)); break; - case parquet_cpp::Type::INT64: + case ParquetType::INT64: RETURN_NOT_OK(FromInt64(primitive, &type)); break; - case parquet_cpp::Type::INT96: + case ParquetType::INT96: // TODO: Do we have that type in Arrow? // type = TypePtr(new Int96Type()); return Status::NotImplemented("int96"); - case parquet_cpp::Type::FLOAT: + case ParquetType::FLOAT: type = FLOAT; break; - case parquet_cpp::Type::DOUBLE: + case ParquetType::DOUBLE: type = DOUBLE; break; - case parquet_cpp::Type::BYTE_ARRAY: + case ParquetType::BYTE_ARRAY: // TODO: Do we have that type in Arrow? RETURN_NOT_OK(FromByteArray(primitive, &type)); break; - case parquet_cpp::Type::FIXED_LEN_BYTE_ARRAY: + case ParquetType::FIXED_LEN_BYTE_ARRAY: RETURN_NOT_OK(FromFLBA(primitive, &type)); break; } @@ -157,7 +158,7 @@ Status NodeToField(const NodePtr& node, std::shared_ptr* out) { return Status::OK(); } -Status FromParquetSchema(const parquet_cpp::SchemaDescriptor* parquet_schema, +Status FromParquetSchema(const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out) { // TODO(wesm): Consider adding an arrow::Schema name attribute, which comes // from the root Parquet node diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index 61de193a33877..a8408970ede48 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -31,10 +31,10 @@ class Status; namespace parquet { -Status NodeToField(const parquet_cpp::schema::NodePtr& node, +Status NodeToField(const ::parquet::schema::NodePtr& node, std::shared_ptr* out); -Status FromParquetSchema(const parquet_cpp::SchemaDescriptor* parquet_schema, +Status FromParquetSchema(const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out); } // namespace parquet diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 155c965f3e8aa..255efc268fe29 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -359,7 +359,4 @@ cdef class Table: names.append(frombytes(col.get().name())) data.append( arr) - # One ref count too many - Py_XDECREF(arr) - return pd.DataFrame(dict(zip(names, data)), columns=names) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index 99a2d423d9cba..ffdc5d487068d 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * -cdef extern from "parquet/api/reader.h" namespace "parquet_cpp" nogil: +cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass ColumnReader: pass From 38897ee29f85765f7646e90237fa85f98ccb55f5 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 28 Mar 2016 10:42:14 -0700 Subject: [PATCH 0051/1644] ARROW-83: [C++] Add basic test infrastructure for DecimalType Author: Uwe L. Korn Closes #47 from xhochy/arrow-83 and squashes the following commits: 6eabd7a [Uwe L. Korn] Remove unused forward decl e1854e9 [Uwe L. Korn] ARROW-83: [C++] Add basic test infrastructure for DecimalType --- cpp/src/arrow/types/CMakeLists.txt | 1 + cpp/src/arrow/types/decimal-test.cc | 40 +++++++++++++++++++++++++++++ 2 files changed, 41 insertions(+) create mode 100644 cpp/src/arrow/types/decimal-test.cc diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index f3e41289bfe8d..72a8e77664610 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -34,6 +34,7 @@ install(FILES DESTINATION include/arrow/types) +ADD_ARROW_TEST(decimal-test) ADD_ARROW_TEST(list-test) ADD_ARROW_TEST(primitive-test) ADD_ARROW_TEST(string-test) diff --git a/cpp/src/arrow/types/decimal-test.cc b/cpp/src/arrow/types/decimal-test.cc new file mode 100644 index 0000000000000..89896c8b425d0 --- /dev/null +++ b/cpp/src/arrow/types/decimal-test.cc @@ -0,0 +1,40 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/types/decimal.h" + +namespace arrow { + +TEST(TypesTest, TestDecimalType) { + DecimalType t1(8, 4); + + ASSERT_EQ(t1.type, Type::DECIMAL); + ASSERT_EQ(t1.precision, 8); + ASSERT_EQ(t1.scale, 4); + + ASSERT_EQ(t1.ToString(), std::string("decimal(8, 4)")); + + // Test copy constructor + DecimalType t2 = t1; + ASSERT_EQ(t2.type, Type::DECIMAL); + ASSERT_EQ(t2.precision, 8); + ASSERT_EQ(t2.scale, 4); +} + +} // namespace arrow From 2d8627cd81f83783b0ceb01d137a46b581ecba26 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 28 Mar 2016 10:49:08 -0700 Subject: [PATCH 0052/1644] ARROW-87: [C++] Add all four possible ways to encode Decimals in Parquet to schema conversion See also: https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal Author: Uwe L. Korn Closes #48 from xhochy/arrow-87 and squashes the following commits: 05ca3be [Uwe L. Korn] Use parquet:: namespace instead of parquet_cpp 6bafc5f [Uwe L. Korn] ARROW-87: [C++] Add all four possible ways to encode Decimals in Parquet to schema conversion --- cpp/src/arrow/parquet/parquet-schema-test.cc | 36 ++++++++++++++++++++ cpp/src/arrow/parquet/schema.cc | 9 +++++ 2 files changed, 45 insertions(+) diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index 02a8caf03c9bd..a289ddbfde6eb 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -22,6 +22,7 @@ #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/types/decimal.h" #include "arrow/util/status.h" #include "arrow/parquet/schema.h" @@ -46,6 +47,7 @@ const auto DOUBLE = std::make_shared(); const auto UTF8 = std::make_shared(); const auto BINARY = std::make_shared( std::make_shared("", UINT8)); +const auto DECIMAL_8_4 = std::make_shared(8, 4); class TestConvertParquetSchema : public ::testing::Test { public: @@ -119,6 +121,40 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { CheckFlatSchema(arrow_schema); } +TEST_F(TestConvertParquetSchema, ParquetFlatDecimals) { + std::vector parquet_fields; + std::vector> arrow_fields; + + parquet_fields.push_back( + PrimitiveNode::Make("flba-decimal", Repetition::OPTIONAL, + ParquetType::FIXED_LEN_BYTE_ARRAY, + LogicalType::DECIMAL, 4, 8, 4)); + arrow_fields.push_back(std::make_shared("flba-decimal", DECIMAL_8_4)); + + parquet_fields.push_back( + PrimitiveNode::Make("binary-decimal", Repetition::OPTIONAL, + ParquetType::BYTE_ARRAY, + LogicalType::DECIMAL, -1, 8, 4)); + arrow_fields.push_back(std::make_shared("binary-decimal", DECIMAL_8_4)); + + parquet_fields.push_back( + PrimitiveNode::Make("int32-decimal", Repetition::OPTIONAL, + ParquetType::INT32, + LogicalType::DECIMAL, -1, 8, 4)); + arrow_fields.push_back(std::make_shared("int32-decimal", DECIMAL_8_4)); + + parquet_fields.push_back( + PrimitiveNode::Make("int64-decimal", Repetition::OPTIONAL, + ParquetType::INT64, + LogicalType::DECIMAL, -1, 8, 4)); + arrow_fields.push_back(std::make_shared("int64-decimal", DECIMAL_8_4)); + + auto arrow_schema = std::make_shared(arrow_fields); + ASSERT_OK(ConvertSchema(parquet_fields)); + + CheckFlatSchema(arrow_schema); +} + TEST_F(TestConvertParquetSchema, UnsupportedThings) { std::vector unsupported_nodes; diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index d8eb2addb0ada..14f4f5be53ce9 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -57,6 +57,9 @@ static Status FromByteArray(const PrimitiveNode* node, TypePtr* out) { case LogicalType::UTF8: *out = UTF8; break; + case LogicalType::DECIMAL: + *out = MakeDecimalType(node); + break; default: // BINARY *out = BINARY; @@ -86,6 +89,9 @@ static Status FromInt32(const PrimitiveNode* node, TypePtr* out) { case LogicalType::NONE: *out = INT32; break; + case LogicalType::DECIMAL: + *out = MakeDecimalType(node); + break; default: return Status::NotImplemented("Unhandled logical type for int32"); break; @@ -98,6 +104,9 @@ static Status FromInt64(const PrimitiveNode* node, TypePtr* out) { case LogicalType::NONE: *out = INT64; break; + case LogicalType::DECIMAL: + *out = MakeDecimalType(node); + break; default: return Status::NotImplemented("Unhandled logical type for int64"); break; From 5a68f8d737aa94ff3d09dae4e5b29883e798e9c4 Mon Sep 17 00:00:00 2001 From: Dan Robinson Date: Thu, 31 Mar 2016 10:02:54 -0700 Subject: [PATCH 0053/1644] ARROW-93: Fix builds when using XCode 7.3 Author: Dan Robinson Closes #54 from danrobinson/ARROW-93 and squashes the following commits: ddff5b0 [Dan Robinson] ARROW-93: Fix builds when using XCode 7.3 --- cpp/cmake_modules/CompilerInfo.cmake | 2 +- python/cmake_modules/CompilerInfo.cmake | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 07860682f9b1b..e1c821cca5d45 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -31,7 +31,7 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") # clang on Mac OS X, XCode 7. No version replacement is done # because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-700\\..*") +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-70[0-9]\\..*") set(COMPILER_FAMILY "clang") # gcc diff --git a/python/cmake_modules/CompilerInfo.cmake b/python/cmake_modules/CompilerInfo.cmake index e66bc2693eead..55f989a1a6c9d 100644 --- a/python/cmake_modules/CompilerInfo.cmake +++ b/python/cmake_modules/CompilerInfo.cmake @@ -34,7 +34,7 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") # clang on Mac OS X, XCode 7. No version replacement is done # because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-700\\..*") +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-70[0-9]\\..*") set(COMPILER_FAMILY "clang") # gcc From b3ebce1b3471abbdc4516ff86014aa26bcc99a24 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 31 Mar 2016 17:27:56 -0700 Subject: [PATCH 0054/1644] ARROW-89: [Python] Add benchmarks for Arrow<->Pandas conversion Author: Uwe L. Korn Closes #51 from xhochy/arrow-89 and squashes the following commits: bd6a7cb [Uwe L. Korn] Split benchmarks and add one for a float64 column with NaNs 8f74528 [Uwe L. Korn] ARROW-89: [Python] Add benchmarks for Arrow<->Pandas conversion --- python/benchmarks/array.py | 55 ++++++++++++++++++++++++++++++++++---- 1 file changed, 50 insertions(+), 5 deletions(-) diff --git a/python/benchmarks/array.py b/python/benchmarks/array.py index 6ab73d18d1f87..4268f0073f292 100644 --- a/python/benchmarks/array.py +++ b/python/benchmarks/array.py @@ -15,22 +15,67 @@ # specific language governing permissions and limitations # under the License. -import pyarrow +import numpy as np +import pandas as pd +import pyarrow as A -class Conversions(object): + +class PyListConversions(object): + param_names = ('size',) params = (1, 10 ** 5, 10 ** 6, 10 ** 7) + def setup(self, n): + self.data = list(range(n)) + def time_from_pylist(self, n): - pyarrow.from_pylist(list(range(n))) + A.from_pylist(self.data) def peakmem_from_pylist(self, n): - pyarrow.from_pylist(list(range(n))) + A.from_pylist(self.data) + + +class PandasConversionsBase(object): + def setup(self, n, dtype): + if dtype == 'float64_nans': + arr = np.arange(n).astype('float64') + arr[arr % 10 == 0] = np.nan + else: + arr = np.arange(n).astype(dtype) + self.data = pd.DataFrame({'column': arr}) + + +class PandasConversionsToArrow(PandasConversionsBase): + param_names = ('size', 'dtype') + params = ((1, 10 ** 5, 10 ** 6, 10 ** 7), ('int64', 'float64', 'float64_nans', 'str')) + + def time_from_series(self, n, dtype): + A.from_pandas_dataframe(self.data) + + def peakmem_from_series(self, n, dtype): + A.from_pandas_dataframe(self.data) + + +class PandasConversionsFromArrow(PandasConversionsBase): + param_names = ('size', 'dtype') + params = ((1, 10 ** 5, 10 ** 6, 10 ** 7), ('int64', 'float64', 'float64_nans', 'str')) + + def setup(self, n, dtype): + super(PandasConversionsFromArrow, self).setup(n, dtype) + self.arrow_data = A.from_pandas_dataframe(self.data) + + def time_to_series(self, n, dtype): + self.arrow_data.to_pandas() + + def peakmem_to_series(self, n, dtype): + self.arrow_data.to_pandas() + class ScalarAccess(object): + param_names = ('size',) params = (1, 10 ** 5, 10 ** 6, 10 ** 7) def setUp(self, n): - self._array = pyarrow.from_pylist(list(range(n))) + self._array = A.from_pylist(list(range(n))) def time_as_py(self, n): for i in range(n): From 6d31d5928f4ec5ced14a105b5b05d46a7dab5264 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 31 Mar 2016 17:47:42 -0700 Subject: [PATCH 0055/1644] ARROW-49: [Python] Add Column and Table wrapper interface After https://github.com/apache/arrow/pull/52 is merged, I'd like to split Column and Table into separate .pyx files, array.pyx seems a bit overcrowded. Author: Uwe L. Korn Closes #53 from xhochy/arrow-49 and squashes the following commits: b01b201 [Uwe L. Korn] Use correct number of chunks e422faf [Uwe L. Korn] Incoportate PR feedback, Add ChunkedArray interface e8f84a9 [Uwe L. Korn] ARROW-49: [Python] Add Column and Table wrapper interface --- python/CMakeLists.txt | 1 + python/pyarrow/__init__.py | 4 +- python/pyarrow/array.pxd | 2 + python/pyarrow/array.pyx | 75 +------- python/pyarrow/includes/libarrow.pxd | 5 +- python/pyarrow/schema.pxd | 2 + python/pyarrow/schema.pyx | 9 + python/pyarrow/table.pxd | 46 +++++ python/pyarrow/table.pyx | 264 +++++++++++++++++++++++++++ python/pyarrow/tests/test_column.py | 49 +++++ python/pyarrow/tests/test_table.py | 39 ++++ python/setup.py | 2 +- 12 files changed, 422 insertions(+), 76 deletions(-) create mode 100644 python/pyarrow/table.pxd create mode 100644 python/pyarrow/table.pyx create mode 100644 python/pyarrow/tests/test_column.py diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index ebe825f65c4da..2173232d4eff5 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -444,6 +444,7 @@ set(CYTHON_EXTENSIONS error scalar schema + table ) foreach(module ${CYTHON_EXTENSIONS}) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index c343f5ba5f129..40a09c2feaef0 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -41,4 +41,6 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.array import RowBatch, Table, from_pandas_dataframe +from pyarrow.array import RowBatch, from_pandas_dataframe + +from pyarrow.table import Column, Table diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index de3c77419623f..8cd15cd450219 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -36,6 +36,8 @@ cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array) cdef getitem(self, int i) +cdef object box_arrow_array(const shared_ptr[CArray]& sp_array) + cdef class BooleanArray(Array): pass diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 255efc268fe29..456bf6d1da848 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -33,6 +33,8 @@ from pyarrow.scalar import NA from pyarrow.schema cimport Schema import pyarrow.schema as schema +from pyarrow.table cimport Table + def total_allocated_bytes(): cdef MemoryPool* pool = pyarrow.GetMemoryPool() return pool.bytes_allocated() @@ -287,76 +289,3 @@ cdef class RowBatch: return self.arrays[i] -cdef class Table: - ''' - Do not call this class's constructor directly. - ''' - cdef: - shared_ptr[CTable] sp_table - CTable* table - - def __cinit__(self): - pass - - cdef init(self, const shared_ptr[CTable]& table): - self.sp_table = table - self.table = table.get() - - @staticmethod - def from_pandas(df, name=None): - pass - - @staticmethod - def from_arrays(names, arrays, name=None): - cdef: - Array arr - Table result - c_string c_name - vector[shared_ptr[CField]] fields - vector[shared_ptr[CColumn]] columns - shared_ptr[CSchema] schema - shared_ptr[CTable] table - - cdef int K = len(arrays) - - fields.resize(K) - columns.resize(K) - for i in range(K): - arr = arrays[i] - c_name = tobytes(names[i]) - - fields[i].reset(new CField(c_name, arr.type.sp_type, True)) - columns[i].reset(new CColumn(fields[i], arr.sp_array)) - - if name is None: - c_name = '' - else: - c_name = tobytes(name) - - schema.reset(new CSchema(fields)) - table.reset(new CTable(c_name, schema, columns)) - - result = Table() - result.init(table) - - return result - - def to_pandas(self): - """ - Convert the arrow::Table to a pandas DataFrame - """ - cdef: - PyObject* arr - shared_ptr[CColumn] col - - import pandas as pd - - names = [] - data = [] - for i in range(self.table.num_columns()): - col = self.table.column(i) - check_status(pyarrow.ArrowToPandas(col, &arr)) - names.append(frombytes(col.get().name())) - data.append( arr) - - return pd.DataFrame(dict(zip(names, data)), columns=names) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 42f1f25073d1b..b2ef45a347bc0 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -149,7 +149,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_string GetString(int i) cdef cppclass CChunkedArray" arrow::ChunkedArray": - pass + int64_t length() + int64_t null_count() + int num_chunks() + const shared_ptr[CArray]& chunk(int i) cdef cppclass CColumn" arrow::Column": CColumn(const shared_ptr[CField]& field, diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index 61458b765c742..f2cb776eb2e9f 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -41,5 +41,7 @@ cdef class Schema: CSchema* schema cdef init(self, const vector[shared_ptr[CField]]& fields) + cdef init_schema(self, const shared_ptr[CSchema]& schema) cdef DataType box_data_type(const shared_ptr[CDataType]& type) +cdef Schema box_schema(const shared_ptr[CSchema]& schema) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index b3bf02aad76bb..22ddf0cf17e41 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -106,6 +106,10 @@ cdef class Schema: self.schema = new CSchema(fields) self.sp_schema.reset(self.schema) + cdef init_schema(self, const shared_ptr[CSchema]& schema): + self.schema = schema.get() + self.sp_schema = schema + @classmethod def from_fields(cls, fields): cdef: @@ -223,3 +227,8 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): cdef DataType out = DataType() out.init(type) return out + +cdef Schema box_schema(const shared_ptr[CSchema]& type): + cdef Schema out = Schema() + out.init_schema(type) + return out diff --git a/python/pyarrow/table.pxd b/python/pyarrow/table.pxd new file mode 100644 index 0000000000000..0a5c122c95cff --- /dev/null +++ b/python/pyarrow/table.pxd @@ -0,0 +1,46 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from pyarrow.includes.common cimport shared_ptr +from pyarrow.includes.libarrow cimport CChunkedArray, CColumn, CTable + + +cdef class ChunkedArray: + cdef: + shared_ptr[CChunkedArray] sp_chunked_array + CChunkedArray* chunked_array + + cdef init(self, const shared_ptr[CChunkedArray]& chunked_array) + cdef _check_nullptr(self) + + +cdef class Column: + cdef: + shared_ptr[CColumn] sp_column + CColumn* column + + cdef init(self, const shared_ptr[CColumn]& column) + cdef _check_nullptr(self) + + +cdef class Table: + cdef: + shared_ptr[CTable] sp_table + CTable* table + + cdef init(self, const shared_ptr[CTable]& table) + cdef _check_nullptr(self) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx new file mode 100644 index 0000000000000..4c4816f0c7e69 --- /dev/null +++ b/python/pyarrow/table.pyx @@ -0,0 +1,264 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from pyarrow.includes.libarrow cimport * +cimport pyarrow.includes.pyarrow as pyarrow + +import pyarrow.config + +from pyarrow.array cimport Array, box_arrow_array +from pyarrow.compat import frombytes, tobytes +from pyarrow.error cimport check_status +from pyarrow.schema cimport box_data_type, box_schema + +cdef class ChunkedArray: + ''' + Do not call this class's constructor directly. + ''' + + def __cinit__(self): + self.chunked_array = NULL + + cdef init(self, const shared_ptr[CChunkedArray]& chunked_array): + self.sp_chunked_array = chunked_array + self.chunked_array = chunked_array.get() + + cdef _check_nullptr(self): + if self.chunked_array == NULL: + raise ReferenceError("ChunkedArray object references a NULL pointer." + "Not initialized.") + + def length(self): + self._check_nullptr() + return self.chunked_array.length() + + def __len__(self): + return self.length() + + property null_count: + + def __get__(self): + self._check_nullptr() + return self.chunked_array.null_count() + + property num_chunks: + + def __get__(self): + self._check_nullptr() + return self.chunked_array.num_chunks() + + def chunk(self, i): + self._check_nullptr() + return box_arrow_array(self.chunked_array.chunk(i)) + + + def iterchunks(self): + for i in range(self.num_chunks): + yield self.chunk(i) + + +cdef class Column: + ''' + Do not call this class's constructor directly. + ''' + + def __cinit__(self): + self.column = NULL + + cdef init(self, const shared_ptr[CColumn]& column): + self.sp_column = column + self.column = column.get() + + def to_pandas(self): + """ + Convert the arrow::Column to a pandas Series + """ + cdef: + PyObject* arr + + import pandas as pd + + check_status(pyarrow.ArrowToPandas(self.sp_column, &arr)) + return pd.Series(arr, name=self.name) + + cdef _check_nullptr(self): + if self.column == NULL: + raise ReferenceError("Column object references a NULL pointer." + "Not initialized.") + + def __len__(self): + self._check_nullptr() + return self.column.length() + + def length(self): + self._check_nullptr() + return self.column.length() + + property shape: + + def __get__(self): + self._check_nullptr() + return (self.length(),) + + property null_count: + + def __get__(self): + self._check_nullptr() + return self.column.null_count() + + property name: + + def __get__(self): + return frombytes(self.column.name()) + + property type: + + def __get__(self): + return box_data_type(self.column.type()) + + property data: + + def __get__(self): + cdef ChunkedArray chunked_array = ChunkedArray() + chunked_array.init(self.column.data()) + return chunked_array + + +cdef class Table: + ''' + Do not call this class's constructor directly. + ''' + + def __cinit__(self): + self.table = NULL + + cdef init(self, const shared_ptr[CTable]& table): + self.sp_table = table + self.table = table.get() + + cdef _check_nullptr(self): + if self.table == NULL: + raise ReferenceError("Table object references a NULL pointer." + "Not initialized.") + + @staticmethod + def from_pandas(df, name=None): + pass + + @staticmethod + def from_arrays(names, arrays, name=None): + cdef: + Array arr + Table result + c_string c_name + vector[shared_ptr[CField]] fields + vector[shared_ptr[CColumn]] columns + shared_ptr[CSchema] schema + shared_ptr[CTable] table + + cdef int K = len(arrays) + + fields.resize(K) + columns.resize(K) + for i in range(K): + arr = arrays[i] + c_name = tobytes(names[i]) + + fields[i].reset(new CField(c_name, arr.type.sp_type, True)) + columns[i].reset(new CColumn(fields[i], arr.sp_array)) + + if name is None: + c_name = '' + else: + c_name = tobytes(name) + + schema.reset(new CSchema(fields)) + table.reset(new CTable(c_name, schema, columns)) + + result = Table() + result.init(table) + + return result + + def to_pandas(self): + """ + Convert the arrow::Table to a pandas DataFrame + """ + cdef: + PyObject* arr + shared_ptr[CColumn] col + + import pandas as pd + + names = [] + data = [] + for i in range(self.table.num_columns()): + col = self.table.column(i) + check_status(pyarrow.ArrowToPandas(col, &arr)) + names.append(frombytes(col.get().name())) + data.append( arr) + + return pd.DataFrame(dict(zip(names, data)), columns=names) + + property name: + + def __get__(self): + self._check_nullptr() + return frombytes(self.table.name()) + + property schema: + + def __get__(self): + raise box_schema(self.table.schema()) + + def column(self, index): + self._check_nullptr() + cdef Column column = Column() + column.init(self.table.column(index)) + return column + + def __getitem__(self, i): + return self.column(i) + + def itercolumns(self): + for i in range(self.num_columns): + yield self.column(i) + + property num_columns: + + def __get__(self): + self._check_nullptr() + return self.table.num_columns() + + property num_rows: + + def __get__(self): + self._check_nullptr() + return self.table.num_rows() + + def __len__(self): + return self.num_rows + + property shape: + + def __get__(self): + return (self.num_rows, self.num_columns) + diff --git a/python/pyarrow/tests/test_column.py b/python/pyarrow/tests/test_column.py new file mode 100644 index 0000000000000..b62f58236e073 --- /dev/null +++ b/python/pyarrow/tests/test_column.py @@ -0,0 +1,49 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from pyarrow.compat import unittest +import pyarrow as arrow + +A = arrow + +import pandas as pd + + +class TestColumn(unittest.TestCase): + + def test_basics(self): + data = [ + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a'), data, 'table_name') + column = table.column(0) + assert column.name == 'a' + assert column.length() == 5 + assert len(column) == 5 + assert column.shape == (5,) + + def test_pandas(self): + data = [ + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a'), data, 'table_name') + column = table.column(0) + series = column.to_pandas() + assert series.name == 'a' + assert series.shape == (5,) + assert series.iloc[0] == -10 + diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 2e24445bd0c22..83fcbb8faff5d 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -20,6 +20,8 @@ A = arrow +import pandas as pd + class TestRowBatch(unittest.TestCase): @@ -38,3 +40,40 @@ def test_basics(self): assert len(batch) == num_rows assert batch.num_rows == num_rows assert batch.num_columns == len(data) + + +class TestTable(unittest.TestCase): + + def test_basics(self): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + assert table.name == 'table_name' + assert len(table) == 5 + assert table.num_rows == 5 + assert table.num_columns == 2 + assert table.shape == (5, 2) + + for col in table.itercolumns(): + for chunk in col.data.iterchunks(): + assert chunk is not None + + def test_pandas(self): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + + # TODO: Use this part once from_pandas is implemented + # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} + # df = pd.DataFrame(data) + # A.Table.from_pandas(df) + + df = table.to_pandas() + assert set(df.columns) == set(('a', 'b')) + assert df.shape == (5, 2) + assert df.ix[0, 'b'] == -10 + diff --git a/python/setup.py b/python/setup.py index 5cc871aba9f81..ebd80de46b4da 100644 --- a/python/setup.py +++ b/python/setup.py @@ -214,7 +214,7 @@ def get_ext_built(self, name): return name + suffix def get_cmake_cython_names(self): - return ['array', 'config', 'error', 'scalar', 'schema'] + return ['array', 'config', 'error', 'scalar', 'schema', 'table'] def get_names(self): return self._found_names From 79fddd1138ff69953e943f5980533dc01eabbb97 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 31 Mar 2016 17:48:38 -0700 Subject: [PATCH 0056/1644] ARROW-90: [C++] Check for SIMD instruction set support This also adds an option to disable the usage of a specific instruction set, e.g. you compile on a machine that supports SSE3 but you want to use the binary also on machines without SSE3. (Distribution packagers will love that option!) Author: Uwe L. Korn Closes #50 from xhochy/arrow-90 and squashes the following commits: 6fd80d3 [Uwe L. Korn] ARROW-90: Check for SIMD instruction set support --- cpp/CMakeLists.txt | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 6ed2768d13918..26d12d2424796 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -66,6 +66,14 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow IPC extensions" ON) + option(ARROW_SSE3 + "Build Arrow with SSE3" + ON) + + option(ARROW_ALTIVEC + "Build Arrow with Altivec" + ON) + endif() if(NOT ARROW_BUILD_TESTS) @@ -81,9 +89,25 @@ endif() # Compiler flags ############################################################ +# Check if the target architecture and compiler supports some special +# instruction sets that would boost performance. +include(CheckCXXCompilerFlag) +# x86/amd64 compiler flags +CHECK_CXX_COMPILER_FLAG("-msse3" CXX_SUPPORTS_SSE3) +# power compiler flags +CHECK_CXX_COMPILER_FLAG("-maltivec" CXX_SUPPORTS_ALTIVEC) + # compiler flags that are common across debug/release builds # - Wall: Enable all warnings. -set(CXX_COMMON_FLAGS "-std=c++11 -msse3 -Wall") +set(CXX_COMMON_FLAGS "-std=c++11 -Wall") + +# Only enable additional instruction sets if they are supported +if (CXX_SUPPORTS_SSE3 AND ARROW_SSE3) + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -msse3") +endif() +if (CXX_SUPPORTS_ALTIVEC AND ARROW_ALTIVEC) + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -maltivec") +endif() if (APPLE) # Depending on the default OSX_DEPLOYMENT_TARGET (< 10.9), libstdc++ may be From 5d129991b3369b0e45cb79d1efe6ba2fd8dd21d0 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Fri, 1 Apr 2016 21:40:20 -0700 Subject: [PATCH 0057/1644] ARROW-71: [C++] Add clang-tidy and clang-format to the the tool chain. I changed the ubuntu flavor for building to precise because https://github.com/travis-ci/apt-source-whitelist/issues/199 is currently blocking using trusty. I also expect there might be a couple of iterations on settings for clang-format and clang-tidy (or if we even want them as standard parts of the toolchain). @wesm I noticed the lint target explicitly turns off some checks, I don't know if these were copy and pasted or you really don't like them. If the latter I can do a first pass of turning the same ones off for clang-tidy. In terms of reviewing: It is likely useful, to look at the PR commit by commit, since the last two commits are 99% driven by the first commit. The main chunk of code that wasn't machine fixed is FatalLog in logging. The good news is clang-tidy caught one potential corner case segfault when a column happened to be null :) Author: Micah Kornfield Closes #55 from emkornfield/emk_add_clang_tidy_PR and squashes the following commits: 2fafb10 [Micah Kornfield] adjust line length from 88 to 90, turn on bin packing of parameters. increase penality for before first call parameter 169352f [Micah Kornfield] add llvm tool chain as travis source e7723d1 [Micah Kornfield] upgrade to precise to verify if build works. address self comments d3f76d8 [Micah Kornfield] clang format change 9c556ef [Micah Kornfield] cleanup from clang-tidy 26945e9 [Micah Kornfield] add more failure checks for build_thirdparty 4dd0b81 [Micah Kornfield] Add clang-format and clang-tidy targets to toolchain --- .travis.yml | 5 +- ci/travis_script_cpp.sh | 4 + cpp/CMakeLists.txt | 39 +++++- cpp/README.md | 16 +++ cpp/build-support/run-clang-format.sh | 42 ++++++ cpp/build-support/run-clang-tidy.sh | 40 ++++++ cpp/cmake_modules/FindClangTools.cmake | 60 +++++++++ cpp/src/.clang-format | 65 ++++++++++ cpp/src/.clang-tidy | 14 ++ cpp/src/arrow/api.h | 2 +- cpp/src/arrow/array-test.cc | 16 +-- cpp/src/arrow/array.cc | 17 +-- cpp/src/arrow/array.h | 28 ++-- cpp/src/arrow/builder.cc | 2 +- cpp/src/arrow/builder.h | 40 +++--- cpp/src/arrow/column-benchmark.cc | 23 ++-- cpp/src/arrow/column-test.cc | 2 +- cpp/src/arrow/column.cc | 27 ++-- cpp/src/arrow/column.h | 41 ++---- cpp/src/arrow/ipc/adapter.cc | 50 ++++---- cpp/src/arrow/ipc/adapter.h | 18 +-- cpp/src/arrow/ipc/ipc-adapter-test.cc | 20 ++- cpp/src/arrow/ipc/ipc-memory-test.cc | 15 +-- cpp/src/arrow/ipc/ipc-metadata-test.cc | 8 +- cpp/src/arrow/ipc/memory.cc | 46 +++---- cpp/src/arrow/ipc/memory.h | 22 ++-- cpp/src/arrow/ipc/metadata-internal.cc | 70 +++++----- cpp/src/arrow/ipc/metadata-internal.h | 12 +- cpp/src/arrow/ipc/metadata.cc | 72 ++++------- cpp/src/arrow/ipc/metadata.h | 20 +-- cpp/src/arrow/ipc/test-common.h | 10 +- cpp/src/arrow/parquet/parquet-schema-test.cc | 63 ++++----- cpp/src/arrow/parquet/schema.cc | 15 +-- cpp/src/arrow/parquet/schema.h | 11 +- cpp/src/arrow/schema-test.cc | 6 +- cpp/src/arrow/schema.cc | 20 +-- cpp/src/arrow/schema.h | 10 +- cpp/src/arrow/table-test.cc | 16 +-- cpp/src/arrow/table.cc | 31 ++--- cpp/src/arrow/table.h | 38 ++---- cpp/src/arrow/test-util.h | 58 ++++----- cpp/src/arrow/type.cc | 8 +- cpp/src/arrow/type.h | 94 +++++--------- cpp/src/arrow/types/binary.h | 6 +- cpp/src/arrow/types/collection.h | 12 +- cpp/src/arrow/types/construct.cc | 42 +++--- cpp/src/arrow/types/construct.h | 11 +- cpp/src/arrow/types/datetime.h | 39 ++---- cpp/src/arrow/types/decimal-test.cc | 2 +- cpp/src/arrow/types/decimal.cc | 3 +- cpp/src/arrow/types/decimal.h | 11 +- cpp/src/arrow/types/json.cc | 5 +- cpp/src/arrow/types/json.h | 8 +- cpp/src/arrow/types/list-test.cc | 11 +- cpp/src/arrow/types/list.cc | 25 ++-- cpp/src/arrow/types/list.h | 65 ++++------ cpp/src/arrow/types/primitive-test.cc | 107 +++++++--------- cpp/src/arrow/types/primitive.cc | 75 ++++------- cpp/src/arrow/types/primitive.h | 128 +++++++------------ cpp/src/arrow/types/string-test.cc | 20 +-- cpp/src/arrow/types/string.cc | 10 +- cpp/src/arrow/types/string.h | 48 +++---- cpp/src/arrow/types/struct-test.cc | 4 +- cpp/src/arrow/types/struct.cc | 4 +- cpp/src/arrow/types/struct.h | 6 +- cpp/src/arrow/types/test-common.h | 9 +- cpp/src/arrow/types/union.cc | 6 +- cpp/src/arrow/types/union.h | 17 +-- cpp/src/arrow/util/bit-util-test.cc | 2 +- cpp/src/arrow/util/bit-util.cc | 10 +- cpp/src/arrow/util/bit-util.h | 6 +- cpp/src/arrow/util/buffer-test.cc | 5 +- cpp/src/arrow/util/buffer.cc | 16 +-- cpp/src/arrow/util/buffer.h | 61 +++------ cpp/src/arrow/util/logging.h | 78 +++++++---- cpp/src/arrow/util/macros.h | 6 +- cpp/src/arrow/util/memory-pool-test.cc | 2 +- cpp/src/arrow/util/memory-pool.cc | 2 +- cpp/src/arrow/util/memory-pool.h | 4 +- cpp/src/arrow/util/random.h | 27 ++-- cpp/src/arrow/util/status.cc | 10 +- cpp/src/arrow/util/status.h | 45 ++++--- cpp/src/arrow/util/test_main.cc | 2 +- cpp/thirdparty/build_thirdparty.sh | 4 +- 84 files changed, 1015 insertions(+), 1155 deletions(-) create mode 100755 cpp/build-support/run-clang-format.sh create mode 100755 cpp/build-support/run-clang-tidy.sh create mode 100644 cpp/cmake_modules/FindClangTools.cmake create mode 100644 cpp/src/.clang-format create mode 100644 cpp/src/.clang-tidy diff --git a/.travis.yml b/.travis.yml index d89a200b892e6..a0138a79598a1 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,11 +1,14 @@ sudo: required -dist: trusty +dist: precise addons: apt: sources: - ubuntu-toolchain-r-test - kalakris-cmake + - llvm-toolchain-precise-3.7 packages: + - clang-format-3.7 + - clang-tidy-3.7 - gcc-4.9 # Needed for C++11 - g++-4.9 # Needed for C++11 - gdb diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index 997bdf35e83d2..c9b3b5f1442a1 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -7,6 +7,10 @@ set -e pushd $CPP_BUILD_DIR make lint +if [ $TRAVIS_OS_NAME == "linux" ]; then + make check-format + make check-clang-tidy +fi ctest -L unittest diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 26d12d2424796..f803c0fb3e428 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -30,10 +30,11 @@ set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") # Must be declared in the top-level CMakeLists.txt. set(CMAKE_SKIP_INSTALL_ALL_DEPENDENCY true) -# Generate a Clang compile_commands.json "compilation database" file for use -# with various development tools, such as Vim's YouCompleteMe plugin. -# See http://clang.llvm.org/docs/JSONCompilationDatabase.html -if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1") +find_package(ClangTools) +if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND) + # Generate a Clang compile_commands.json "compilation database" file for use + # with various development tools, such as Vim's YouCompleteMe plugin. + # See http://clang.llvm.org/docs/JSONCompilationDatabase.html set(CMAKE_EXPORT_COMPILE_COMMANDS 1) endif() @@ -540,6 +541,36 @@ if (UNIX) `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) endif (UNIX) + +############################################################ +# "make format" and "make check-format" targets +############################################################ +if (${CLANG_FORMAT_FOUND}) + # runs clang format and updates files in place. + add_custom_target(format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 1 + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) + + # runs clang format and exits with a non-zero exit code if any files need to be reformatted + add_custom_target(check-format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 0 + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) +endif() + + +############################################################ +# "make clang-tidy" and "make check-clang-tidy" targets +############################################################ +if (${CLANG_TIDY_FOUND}) + # runs clang-tidy and attempts to fix any warning automatically + add_custom_target(clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json 1 + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`) + # runs clang-tidy and exits with a non-zero exit code if any errors are found. + add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json + 0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`) + +endif() + + + ############################################################ # Subdirectories ############################################################ diff --git a/cpp/README.md b/cpp/README.md index 9026cf963f8ee..3f5da21b7d417 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -61,3 +61,19 @@ variables * Googletest: `GTEST_HOME` (only required to build the unit tests) * Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) * Flatbuffers: `FLATBUFFERS_HOME` (only required for the IPC extensions) + +## Continuous Integration + +Pull requests are run through travis-ci for continuous integration. You can avoid +build failures by running the following checks before submitting your pull request: + + make unittest + make lint + # The next two commands may change your code. It is recommended you commit + # before running them. + make clang-tidy # requires clang-tidy is installed + make format # requires clang-format is installed + +Note that the clang-tidy target may take a while to run. You might consider +running clang-tidy separately on the files you have added/changed before +invoking the make target to reduce iteration time. diff --git a/cpp/build-support/run-clang-format.sh b/cpp/build-support/run-clang-format.sh new file mode 100755 index 0000000000000..ba525dfc33c69 --- /dev/null +++ b/cpp/build-support/run-clang-format.sh @@ -0,0 +1,42 @@ +#!/bin/bash +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Runs clang format in the given directory +# Arguments: +# $1 - Path to the source tree +# $2 - Path to the clang format binary +# $3 - Apply fixes (will raise an error if false and not there where changes) +# $ARGN - Files to run clang format on +# +SOURCE_DIR=$1 +shift +CLANG_FORMAT=$1 +shift +APPLY_FIXES=$1 +shift + +# clang format will only find its configuration if we are in +# the source tree or in a path relative to the source tree +pushd $SOURCE_DIR +if [ "$APPLY_FIXES" == "1" ]; then + $CLANG_FORMAT -i $@ +else + + NUM_CORRECTIONS=`$CLANG_FORMAT -output-replacements-xml $@ | grep offset | wc -l` + if [ "$NUM_CORRECTIONS" -gt "0" ]; then + echo "clang-format suggested changes, please run 'make format'!!!!" + exit 1 + fi +fi +popd diff --git a/cpp/build-support/run-clang-tidy.sh b/cpp/build-support/run-clang-tidy.sh new file mode 100755 index 0000000000000..4ba8ab8cd766d --- /dev/null +++ b/cpp/build-support/run-clang-tidy.sh @@ -0,0 +1,40 @@ +#!/bin/bash +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# +# Runs clang format in the given directory +# Arguments: +# $1 - Path to the clang tidy binary +# $2 - Path to the compile_commands.json to use +# $3 - Apply fixes (will raise an error if false and not there where changes) +# $ARGN - Files to run clang-tidy on +# +CLANG_TIDY=$1 +shift +COMPILE_COMMANDS=$1 +shift +APPLY_FIXES=$1 +shift + +# clang format will only find its configuration if we are in +# the source tree or in a path relative to the source tree +if [ "$APPLY_FIXES" == "1" ]; then + $CLANG_TIDY -p $COMPILE_COMMANDS -fix $@ +else + NUM_CORRECTIONS=`$CLANG_TIDY -p $COMPILE_COMMANDS $@ 2>&1 | grep -v Skipping | grep "warnings* generated" | wc -l` + if [ "$NUM_CORRECTIONS" -gt "0" ]; then + echo "clang-tidy had suggested fixes. Please fix these!!!" + exit 1 + fi +fi diff --git a/cpp/cmake_modules/FindClangTools.cmake b/cpp/cmake_modules/FindClangTools.cmake new file mode 100644 index 0000000000000..c07c7d244493e --- /dev/null +++ b/cpp/cmake_modules/FindClangTools.cmake @@ -0,0 +1,60 @@ +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Tries to find the clang-tidy and clang-format modules +# +# Usage of this module as follows: +# +# find_package(ClangTools) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# ClangToolsBin_HOME - +# When set, this path is inspected instead of standard library binary locations +# to find clang-tidy and clang-format +# +# This module defines +# CLANG_TIDY_BIN, The path to the clang tidy binary +# CLANG_TIDY_FOUND, Whether clang tidy was found +# CLANG_FORMAT_BIN, The path to the clang format binary +# CLANG_TIDY_FOUND, Whether clang format was found + +find_program(CLANG_TIDY_BIN + NAMES clang-tidy-3.8 clang-tidy-3.7 clang-tidy-3.6 clang-tidy + PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin + NO_DEFAULT_PATH +) + +if ( "${CLANG_TIDY_BIN}" STREQUAL "CLANG_TIDY_BIN-NOTFOUND" ) + set(CLANG_TIDY_FOUND 0) + message("clang-tidy not found") +else() + set(CLANG_TIDY_FOUND 1) + message("clang-tidy found at ${CLANG_TIDY_BIN}") +endif() + +find_program(CLANG_FORMAT_BIN + NAMES clang-format-3.8 clang-format-3.7 clang-format-3.6 clang-format + PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin + NO_DEFAULT_PATH +) + +if ( "${CLANG_FORMAT_BIN}" STREQUAL "CLANG_FORMAT_BIN-NOTFOUND" ) + set(CLANG_FORMAT_FOUND 0) + message("clang-format not found") +else() + set(CLANG_FORMAT_FOUND 1) + message("clang-format found at ${CLANG_FORMAT_BIN}") +endif() + diff --git a/cpp/src/.clang-format b/cpp/src/.clang-format new file mode 100644 index 0000000000000..7d5b3cf30ef51 --- /dev/null +++ b/cpp/src/.clang-format @@ -0,0 +1,65 @@ +--- +Language: Cpp +# BasedOnStyle: Google +AccessModifierOffset: -1 +AlignAfterOpenBracket: false +AlignConsecutiveAssignments: false +AlignEscapedNewlinesLeft: true +AlignOperands: true +AlignTrailingComments: true +AllowAllParametersOfDeclarationOnNextLine: true +AllowShortBlocksOnASingleLine: true +AllowShortCaseLabelsOnASingleLine: false +AllowShortFunctionsOnASingleLine: Inline +AllowShortIfStatementsOnASingleLine: true +AllowShortLoopsOnASingleLine: false +AlwaysBreakAfterDefinitionReturnType: None +AlwaysBreakBeforeMultilineStrings: true +AlwaysBreakTemplateDeclarations: true +BinPackArguments: true +BinPackParameters: true +BreakBeforeBinaryOperators: None +BreakBeforeBraces: Attach +BreakBeforeTernaryOperators: true +BreakConstructorInitializersBeforeComma: false +ColumnLimit: 90 +CommentPragmas: '^ IWYU pragma:' +ConstructorInitializerAllOnOneLineOrOnePerLine: true +ConstructorInitializerIndentWidth: 4 +ContinuationIndentWidth: 4 +Cpp11BracedListStyle: true +DerivePointerAlignment: false +DisableFormat: false +ExperimentalAutoDetectBinPacking: false +ForEachMacros: [ foreach, Q_FOREACH, BOOST_FOREACH ] +IndentCaseLabels: true +IndentWidth: 2 +IndentWrappedFunctionNames: false +KeepEmptyLinesAtTheStartOfBlocks: false +MacroBlockBegin: '' +MacroBlockEnd: '' +MaxEmptyLinesToKeep: 1 +NamespaceIndentation: None +ObjCBlockIndentWidth: 2 +ObjCSpaceAfterProperty: false +ObjCSpaceBeforeProtocolList: false +PenaltyBreakBeforeFirstCallParameter: 1000 +PenaltyBreakComment: 300 +PenaltyBreakFirstLessLess: 120 +PenaltyBreakString: 1000 +PenaltyExcessCharacter: 1000000 +PenaltyReturnTypeOnItsOwnLine: 200 +PointerAlignment: Left +SpaceAfterCStyleCast: false +SpaceBeforeAssignmentOperators: true +SpaceBeforeParens: ControlStatements +SpaceInEmptyParentheses: false +SpacesBeforeTrailingComments: 2 +SpacesInAngles: false +SpacesInContainerLiterals: true +SpacesInCStyleCastParentheses: false +SpacesInParentheses: false +SpacesInSquareBrackets: false +Standard: Cpp11 +TabWidth: 8 +UseTab: Never diff --git a/cpp/src/.clang-tidy b/cpp/src/.clang-tidy new file mode 100644 index 0000000000000..deaa9bdf97fa1 --- /dev/null +++ b/cpp/src/.clang-tidy @@ -0,0 +1,14 @@ +--- +Checks: 'clang-diagnostic-*,clang-analyzer-*,-clang-analyzer-alpha*,google-.*,modernize-.*,readablity-.*' +HeaderFilterRegex: 'arrow/.*' +AnalyzeTemporaryDtors: true +CheckOptions: + - key: google-readability-braces-around-statements.ShortStatementLines + value: '1' + - key: google-readability-function-size.StatementThreshold + value: '800' + - key: google-readability-namespace-comments.ShortNamespaceLines + value: '10' + - key: google-readability-namespace-comments.SpacesBeforeComments + value: '2' + diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 2ae80f642f29d..2d317b49cb7b6 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -37,4 +37,4 @@ #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" -#endif // ARROW_API_H +#endif // ARROW_API_H diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 121b802d994fa..b4c727997ee7e 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -33,15 +33,12 @@ namespace arrow { class TestArray : public ::testing::Test { public: - void SetUp() { - pool_ = default_memory_pool(); - } + void SetUp() { pool_ = default_memory_pool(); } protected: MemoryPool* pool_; }; - TEST_F(TestArray, TestNullCount) { auto data = std::make_shared(pool_); auto null_bitmap = std::make_shared(pool_); @@ -53,7 +50,6 @@ TEST_F(TestArray, TestNullCount) { ASSERT_EQ(0, arr_no_nulls->null_count()); } - TEST_F(TestArray, TestLength) { auto data = std::make_shared(pool_); std::unique_ptr arr(new Int32Array(100, data)); @@ -61,14 +57,16 @@ TEST_F(TestArray, TestLength) { } TEST_F(TestArray, TestIsNull) { + // clang-format off std::vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1}; + // clang-format on int32_t null_count = 0; for (uint8_t x : null_bitmap) { - if (x == 0) ++null_count; + if (x == 0) { ++null_count; } } std::shared_ptr null_buf = test::bytes_to_null_buffer(null_bitmap); @@ -85,8 +83,6 @@ TEST_F(TestArray, TestIsNull) { } } +TEST_F(TestArray, TestCopy) {} -TEST_F(TestArray, TestCopy) { -} - -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 3736732740b5b..a1536861a20be 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -32,30 +32,25 @@ Array::Array(const TypePtr& type, int32_t length, int32_t null_count, length_ = length; null_count_ = null_count; null_bitmap_ = null_bitmap; - if (null_bitmap_) { - null_bitmap_data_ = null_bitmap_->data(); - } + if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } bool Array::EqualsExact(const Array& other) const { - if (this == &other) return true; + if (this == &other) { return true; } if (length_ != other.length_ || null_count_ != other.null_count_ || type_enum() != other.type_enum()) { return false; } if (null_count_ > 0) { return null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); - } else { - return true; } + return true; } bool NullArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) return true; - if (Type::NA != arr->type_enum()) { - return false; - } + if (this == arr.get()) { return true; } + if (Type::NA != arr->type_enum()) { return false; } return arr->length() == length_; } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 097634d74f890..c6735f87d8f42 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -36,8 +36,7 @@ class Buffer; // count is greater than 0 class Array { public: - Array(const std::shared_ptr& type, int32_t length, - int32_t null_count = 0, + Array(const std::shared_ptr& type, int32_t length, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); virtual ~Array() {} @@ -47,19 +46,15 @@ class Array { return null_count_ > 0 && util::bit_not_set(null_bitmap_data_, i); } - int32_t length() const { return length_;} - int32_t null_count() const { return null_count_;} + int32_t length() const { return length_; } + int32_t null_count() const { return null_count_; } - const std::shared_ptr& type() const { return type_;} - Type::type type_enum() const { return type_->type;} + const std::shared_ptr& type() const { return type_; } + Type::type type_enum() const { return type_->type; } - const std::shared_ptr& null_bitmap() const { - return null_bitmap_; - } + const std::shared_ptr& null_bitmap() const { return null_bitmap_; } - const uint8_t* null_bitmap_data() const { - return null_bitmap_data_; - } + const uint8_t* null_bitmap_data() const { return null_bitmap_data_; } bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; @@ -80,17 +75,16 @@ class Array { // Degenerate null type Array class NullArray : public Array { public: - NullArray(const std::shared_ptr& type, int32_t length) : - Array(type, length, length, nullptr) {} + NullArray(const std::shared_ptr& type, int32_t length) + : Array(type, length, length, nullptr) {} - explicit NullArray(int32_t length) : - NullArray(std::make_shared(), length) {} + explicit NullArray(int32_t length) : NullArray(std::make_shared(), length) {} bool Equals(const std::shared_ptr& arr) const override; }; typedef std::shared_ptr ArrayPtr; -} // namespace arrow +} // namespace arrow #endif diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 4061f35fd5e53..1447078f76028 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -62,4 +62,4 @@ Status ArrayBuilder::Reserve(int32_t elements) { return Status::OK(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index d1a49dce79961..21a6341ef5086 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -37,30 +37,26 @@ static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5; // Base class for all data array builders class ArrayBuilder { public: - explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) : - pool_(pool), - type_(type), - null_bitmap_(nullptr), - null_count_(0), - null_bitmap_data_(nullptr), - length_(0), - capacity_(0) {} + explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) + : pool_(pool), + type_(type), + null_bitmap_(nullptr), + null_count_(0), + null_bitmap_data_(nullptr), + length_(0), + capacity_(0) {} virtual ~ArrayBuilder() {} // For nested types. Since the objects are owned by this class instance, we // skip shared pointers and just return a raw pointer - ArrayBuilder* child(int i) { - return children_[i].get(); - } + ArrayBuilder* child(int i) { return children_[i].get(); } - int num_children() const { - return children_.size(); - } + int num_children() const { return children_.size(); } - int32_t length() const { return length_;} - int32_t null_count() const { return null_count_;} - int32_t capacity() const { return capacity_;} + int32_t length() const { return length_; } + int32_t null_count() const { return null_count_; } + int32_t capacity() const { return capacity_; } // Allocates requires memory at this level, but children need to be // initialized independently @@ -76,15 +72,13 @@ class ArrayBuilder { // this function responsibly. Status Advance(int32_t elements); - const std::shared_ptr& null_bitmap() const { return null_bitmap_;} + const std::shared_ptr& null_bitmap() const { return null_bitmap_; } // Creates new array object to hold the contents of the builder and transfers // ownership of the data virtual std::shared_ptr Finish() = 0; - const std::shared_ptr& type() const { - return type_; - } + const std::shared_ptr& type() const { return type_; } protected: MemoryPool* pool_; @@ -107,6 +101,6 @@ class ArrayBuilder { DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_BUILDER_H_ +#endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index 335d581782ac0..edea0948860de 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -15,7 +15,6 @@ // specific language governing permissions and limitations // under the License. - #include "benchmark/benchmark.h" #include "arrow/test-util.h" @@ -24,19 +23,19 @@ namespace arrow { namespace { - template - std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { - auto pool = default_memory_pool(); - auto data = std::make_shared(pool); - auto null_bitmap = std::make_shared(pool); - data->Resize(length * sizeof(typename ArrayType::value_type)); - null_bitmap->Resize(util::bytes_for_bits(length)); - return std::make_shared(length, data, 10, null_bitmap); - } +template +std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + auto pool = default_memory_pool(); + auto data = std::make_shared(pool); + auto null_bitmap = std::make_shared(pool); + data->Resize(length * sizeof(typename ArrayType::value_type)); + null_bitmap->Resize(util::bytes_for_bits(length)); + return std::make_shared(length, data, 10, null_bitmap); +} } // anonymous namespace - -static void BM_BuildInt32ColumnByChunk(benchmark::State& state) { //NOLINT non-const reference +static void BM_BuildInt32ColumnByChunk( + benchmark::State& state) { // NOLINT non-const reference ArrayVector arrays; for (int chunk_n = 0; chunk_n < state.range_x(); ++chunk_n) { arrays.push_back(MakePrimitive(100, 10)); diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 0630785630e81..1edf313d49bf6 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -72,4 +72,4 @@ TEST_F(TestColumn, ChunksInhomogeneous) { ASSERT_RAISES(Invalid, column_->ValidateData()); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 46acf8df2ff57..52e4c58e1dc3d 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -26,8 +26,7 @@ namespace arrow { -ChunkedArray::ChunkedArray(const ArrayVector& chunks) : - chunks_(chunks) { +ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { length_ = 0; null_count_ = 0; for (const std::shared_ptr& chunk : chunks) { @@ -36,35 +35,31 @@ ChunkedArray::ChunkedArray(const ArrayVector& chunks) : } } -Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) : - field_(field) { +Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) + : field_(field) { data_ = std::make_shared(chunks); } -Column::Column(const std::shared_ptr& field, - const std::shared_ptr& data) : - field_(field) { +Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) + : field_(field) { data_ = std::make_shared(ArrayVector({data})); } -Column::Column(const std::shared_ptr& field, - const std::shared_ptr& data) : - field_(field), - data_(data) {} +Column::Column( + const std::shared_ptr& field, const std::shared_ptr& data) + : field_(field), data_(data) {} Status Column::ValidateData() { for (int i = 0; i < data_->num_chunks(); ++i) { const std::shared_ptr& type = data_->chunk(i)->type(); if (!this->type()->Equals(type)) { std::stringstream ss; - ss << "In chunk " << i << " expected type " - << this->type()->ToString() - << " but saw " - << type->ToString(); + ss << "In chunk " << i << " expected type " << this->type()->ToString() + << " but saw " << type->ToString(); return Status::Invalid(ss.str()); } } return Status::OK(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index 1ad97b20863c8..22becc3454780 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -39,21 +39,13 @@ class ChunkedArray { explicit ChunkedArray(const ArrayVector& chunks); // @returns: the total length of the chunked array; computed on construction - int64_t length() const { - return length_; - } + int64_t length() const { return length_; } - int64_t null_count() const { - return null_count_; - } + int64_t null_count() const { return null_count_; } - int num_chunks() const { - return chunks_.size(); - } + int num_chunks() const { return chunks_.size(); } - const std::shared_ptr& chunk(int i) const { - return chunks_[i]; - } + const std::shared_ptr& chunk(int i) const { return chunks_[i]; } protected: ArrayVector chunks_; @@ -67,33 +59,22 @@ class ChunkedArray { class Column { public: Column(const std::shared_ptr& field, const ArrayVector& chunks); - Column(const std::shared_ptr& field, - const std::shared_ptr& data); + Column(const std::shared_ptr& field, const std::shared_ptr& data); Column(const std::shared_ptr& field, const std::shared_ptr& data); - int64_t length() const { - return data_->length(); - } + int64_t length() const { return data_->length(); } - int64_t null_count() const { - return data_->null_count(); - } + int64_t null_count() const { return data_->null_count(); } // @returns: the column's name in the passed metadata - const std::string& name() const { - return field_->name; - } + const std::string& name() const { return field_->name; } // @returns: the column's type according to the metadata - const std::shared_ptr& type() const { - return field_->type; - } + const std::shared_ptr& type() const { return field_->type; } // @returns: the column's data as a chunked logical array - const std::shared_ptr& data() const { - return data_; - } + const std::shared_ptr& data() const { return data_; } // Verify that the column's array data is consistent with the passed field's // metadata Status ValidateData(); @@ -103,6 +84,6 @@ class Column { std::shared_ptr data_; }; -} // namespace arrow +} // namespace arrow #endif // ARROW_COLUMN_H diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index c79e8469530f7..2f72c3aa8467a 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -94,8 +94,7 @@ Status VisitArray(const Array* arr, std::vector* field_nodes class RowBatchWriter { public: - explicit RowBatchWriter(const RowBatch* batch) : - batch_(batch) {} + explicit RowBatchWriter(const RowBatch* batch) : batch_(batch) {} Status AssemblePayload() { // Perform depth-first traversal of the row-batch @@ -138,12 +137,12 @@ class RowBatchWriter { // determine the data header size then request a buffer such that you can // construct the flatbuffer data accessor object (see arrow::ipc::Message) std::shared_ptr data_header; - RETURN_NOT_OK(WriteDataHeader(batch_->num_rows(), offset, - field_nodes_, buffer_meta_, &data_header)); + RETURN_NOT_OK(WriteDataHeader( + batch_->num_rows(), offset, field_nodes_, buffer_meta_, &data_header)); // Write the data header at the end - RETURN_NOT_OK(dst->Write(position + offset, data_header->data(), - data_header->size())); + RETURN_NOT_OK( + dst->Write(position + offset, data_header->data(), data_header->size())); *data_header_offset = position + offset; return Status::OK(); @@ -174,8 +173,8 @@ class RowBatchWriter { std::vector> buffers_; }; -Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, - int64_t* header_offset) { +Status WriteRowBatch( + MemorySource* dst, const RowBatch* batch, int64_t position, int64_t* header_offset) { RowBatchWriter serializer(batch); RETURN_NOT_OK(serializer.AssemblePayload()); return serializer.Write(dst, position, header_offset); @@ -187,15 +186,14 @@ static constexpr int64_t INIT_METADATA_SIZE = 4096; class RowBatchReader::Impl { public: - Impl(MemorySource* source, const std::shared_ptr& metadata) : - source_(source), - metadata_(metadata) { + Impl(MemorySource* source, const std::shared_ptr& metadata) + : source_(source), metadata_(metadata) { num_buffers_ = metadata->num_buffers(); num_flattened_fields_ = metadata->num_fields(); } - Status AssembleBatch(const std::shared_ptr& schema, - std::shared_ptr* out) { + Status AssembleBatch( + const std::shared_ptr& schema, std::shared_ptr* out) { std::vector> arrays(schema->num_fields()); // The field_index and buffer_index are incremented in NextArray based on @@ -208,8 +206,7 @@ class RowBatchReader::Impl { RETURN_NOT_OK(NextArray(field, &arrays[i])); } - *out = std::make_shared(schema, metadata_->length(), - arrays); + *out = std::make_shared(schema, metadata_->length(), arrays); return Status::OK(); } @@ -243,11 +240,10 @@ class RowBatchReader::Impl { } else { data.reset(new Buffer(nullptr, 0)); } - return MakePrimitiveArray(type, field_meta.length, data, - field_meta.null_count, null_bitmap, out); - } else { - return Status::NotImplemented("Non-primitive types not complete yet"); + return MakePrimitiveArray( + type, field_meta.length, data, field_meta.null_count, null_bitmap, out); } + return Status::NotImplemented("Non-primitive types not complete yet"); } Status GetBuffer(int buffer_index, std::shared_ptr* out) { @@ -264,8 +260,8 @@ class RowBatchReader::Impl { int num_flattened_fields_; }; -Status RowBatchReader::Open(MemorySource* source, int64_t position, - std::shared_ptr* out) { +Status RowBatchReader::Open( + MemorySource* source, int64_t position, std::shared_ptr* out) { std::shared_ptr metadata; RETURN_NOT_OK(source->ReadAt(position, INIT_METADATA_SIZE, &metadata)); @@ -274,8 +270,7 @@ Status RowBatchReader::Open(MemorySource* source, int64_t position, // We may not need to call source->ReadAt again if (metadata_size > static_cast(INIT_METADATA_SIZE - sizeof(int32_t))) { // We don't have enough data, read the indicated metadata size. - RETURN_NOT_OK(source->ReadAt(position + sizeof(int32_t), - metadata_size, &metadata)); + RETURN_NOT_OK(source->ReadAt(position + sizeof(int32_t), metadata_size, &metadata)); } // TODO(wesm): buffer slicing here would be better in case ReadAt returns @@ -297,11 +292,10 @@ Status RowBatchReader::Open(MemorySource* source, int64_t position, return Status::OK(); } -Status RowBatchReader::GetRowBatch(const std::shared_ptr& schema, - std::shared_ptr* out) { +Status RowBatchReader::GetRowBatch( + const std::shared_ptr& schema, std::shared_ptr* out) { return impl_->AssembleBatch(schema, out); } - -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 26dea6d04b889..d453fa05f4982 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -52,8 +52,8 @@ class RecordBatchMessage; // // Finally, the memory offset to the start of the metadata / data header is // returned in an out-variable -Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, - int64_t* header_offset); +Status WriteRowBatch( + MemorySource* dst, const RowBatch* batch, int64_t position, int64_t* header_offset); // int64_t GetRowBatchMetadata(const RowBatch* batch); @@ -67,20 +67,20 @@ int64_t GetRowBatchSize(const RowBatch* batch); class RowBatchReader { public: - static Status Open(MemorySource* source, int64_t position, - std::shared_ptr* out); + static Status Open( + MemorySource* source, int64_t position, std::shared_ptr* out); // Reassemble the row batch. A Schema is required to be able to construct the // right array containers - Status GetRowBatch(const std::shared_ptr& schema, - std::shared_ptr* out); + Status GetRowBatch( + const std::shared_ptr& schema, std::shared_ptr* out); private: class Impl; std::unique_ptr impl_; }; -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow -#endif // ARROW_IPC_MEMORY_H +#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 79b4d710d282f..fbdda77e4919c 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -42,12 +42,8 @@ namespace ipc { class TestWriteRowBatch : public ::testing::Test, public MemoryMapFixture { public: - void SetUp() { - pool_ = default_memory_pool(); - } - void TearDown() { - MemoryMapFixture::TearDown(); - } + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { MemoryMapFixture::TearDown(); } void InitMemoryMap(int64_t size) { std::string path = "test-write-row-batch"; @@ -83,8 +79,8 @@ TEST_F(TestWriteRowBatch, IntegerRoundTrip) { test::random_bytes(null_bytes, 0, null_bitmap->mutable_data()); auto a0 = std::make_shared(length, data); - auto a1 = std::make_shared(length, data, - test::bitmap_popcount(null_bitmap->data(), length), null_bitmap); + auto a1 = std::make_shared( + length, data, test::bitmap_popcount(null_bitmap->data(), length), null_bitmap); RowBatch batch(schema, length, {a0, a1}); @@ -103,10 +99,10 @@ TEST_F(TestWriteRowBatch, IntegerRoundTrip) { EXPECT_EQ(batch.num_rows(), batch_result->num_rows()); for (int i = 0; i < batch.num_columns(); ++i) { - EXPECT_TRUE(batch.column(i)->Equals(batch_result->column(i))) - << i << batch.column_name(i); + EXPECT_TRUE(batch.column(i)->Equals(batch_result->column(i))) << i + << batch.column_name(i); } } -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-memory-test.cc b/cpp/src/arrow/ipc/ipc-memory-test.cc index 332ad2a2b809b..1933921222595 100644 --- a/cpp/src/arrow/ipc/ipc-memory-test.cc +++ b/cpp/src/arrow/ipc/ipc-memory-test.cc @@ -35,13 +35,10 @@ namespace ipc { class TestMemoryMappedSource : public ::testing::Test, public MemoryMapFixture { public: - void TearDown() { - MemoryMapFixture::TearDown(); - } + void TearDown() { MemoryMapFixture::TearDown(); } }; -TEST_F(TestMemoryMappedSource, InvalidUsages) { -} +TEST_F(TestMemoryMappedSource, InvalidUsages) {} TEST_F(TestMemoryMappedSource, WriteRead) { const int64_t buffer_size = 1024; @@ -74,9 +71,9 @@ TEST_F(TestMemoryMappedSource, InvalidFile) { std::string non_existent_path = "invalid-file-name-asfd"; std::shared_ptr result; - ASSERT_RAISES(IOError, MemoryMappedSource::Open(non_existent_path, - MemorySource::READ_ONLY, &result)); + ASSERT_RAISES(IOError, + MemoryMappedSource::Open(non_existent_path, MemorySource::READ_ONLY, &result)); } -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index ceabec0fa7c29..51d79cfb4c4bb 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -86,14 +86,12 @@ TEST_F(TestSchemaMessage, NestedFields) { auto type = std::make_shared(std::make_shared()); auto f0 = std::make_shared("f0", type); - std::shared_ptr type2(new StructType({ - std::make_shared("k1", INT32), - std::make_shared("k2", INT32), - std::make_shared("k3", INT32)})); + std::shared_ptr type2(new StructType({std::make_shared("k1", INT32), + std::make_shared("k2", INT32), std::make_shared("k3", INT32)})); auto f1 = std::make_shared("f1", type2); Schema schema({f0, f1}); CheckRoundtrip(&schema); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc index e630ccd109b77..2b077e9792925 100644 --- a/cpp/src/arrow/ipc/memory.cc +++ b/cpp/src/arrow/ipc/memory.cc @@ -17,7 +17,7 @@ #include "arrow/ipc/memory.h" -#include // For memory-mapping +#include // For memory-mapping #include #include #include @@ -32,8 +32,7 @@ namespace arrow { namespace ipc { -MemorySource::MemorySource(AccessMode access_mode) : - access_mode_(access_mode) {} +MemorySource::MemorySource(AccessMode access_mode) : access_mode_(access_mode) {} MemorySource::~MemorySource() {} @@ -41,10 +40,7 @@ MemorySource::~MemorySource() {} class MemoryMappedSource::Impl { public: - Impl() : - file_(nullptr), - is_open_(false), - data_(nullptr) {} + Impl() : file_(nullptr), is_open_(false), data_(nullptr) {} ~Impl() { if (is_open_) { @@ -54,9 +50,7 @@ class MemoryMappedSource::Impl { } Status Open(const std::string& path, MemorySource::AccessMode mode) { - if (is_open_) { - return Status::IOError("A file is already open"); - } + if (is_open_) { return Status::IOError("A file is already open"); } path_ = path; @@ -72,18 +66,15 @@ class MemoryMappedSource::Impl { } fseek(file_, 0L, SEEK_END); - if (ferror(file_)) { - return Status::IOError("Unable to seek to end of file"); - } + if (ferror(file_)) { return Status::IOError("Unable to seek to end of file"); } size_ = ftell(file_); fseek(file_, 0L, SEEK_SET); is_open_ = true; // TODO(wesm): Add read-only version of this - data_ = reinterpret_cast(mmap(nullptr, size_, - PROT_READ | PROT_WRITE, - MAP_SHARED, fileno(file_), 0)); + data_ = reinterpret_cast( + mmap(nullptr, size_, PROT_READ | PROT_WRITE, MAP_SHARED, fileno(file_), 0)); if (data_ == nullptr) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; @@ -93,13 +84,9 @@ class MemoryMappedSource::Impl { return Status::OK(); } - int64_t size() const { - return size_; - } + int64_t size() const { return size_; } - uint8_t* data() { - return data_; - } + uint8_t* data() { return data_; } private: std::string path_; @@ -111,8 +98,8 @@ class MemoryMappedSource::Impl { uint8_t* data_; }; -MemoryMappedSource::MemoryMappedSource(AccessMode access_mode) : - MemorySource(access_mode) {} +MemoryMappedSource::MemoryMappedSource(AccessMode access_mode) + : MemorySource(access_mode) {} Status MemoryMappedSource::Open(const std::string& path, AccessMode access_mode, std::shared_ptr* out) { @@ -134,8 +121,8 @@ Status MemoryMappedSource::Close() { return Status::OK(); } -Status MemoryMappedSource::ReadAt(int64_t position, int64_t nbytes, - std::shared_ptr* out) { +Status MemoryMappedSource::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { if (position < 0 || position >= impl_->size()) { return Status::Invalid("position is out of bounds"); } @@ -145,8 +132,7 @@ Status MemoryMappedSource::ReadAt(int64_t position, int64_t nbytes, return Status::OK(); } -Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, - int64_t nbytes) { +Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { if (position < 0 || position >= impl_->size()) { return Status::Invalid("position is out of bounds"); } @@ -158,5 +144,5 @@ Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, return Status::OK(); } -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.h b/cpp/src/arrow/ipc/memory.h index 0b4d8347c342f..e529603dc6e2a 100644 --- a/cpp/src/arrow/ipc/memory.h +++ b/cpp/src/arrow/ipc/memory.h @@ -52,8 +52,8 @@ class OutputStream { // memory map class BufferOutputStream : public OutputStream { public: - explicit BufferOutputStream(const std::shared_ptr& buffer): - buffer_(buffer) {} + explicit BufferOutputStream(const std::shared_ptr& buffer) + : buffer_(buffer) {} // Implement the OutputStream interface Status Close() override; @@ -72,10 +72,7 @@ class BufferOutputStream : public OutputStream { class MemorySource { public: // Indicates the access permissions of the memory source - enum AccessMode { - READ_ONLY, - READ_WRITE - }; + enum AccessMode { READ_ONLY, READ_WRITE }; virtual ~MemorySource(); @@ -83,8 +80,8 @@ class MemorySource { // the indicated location // @returns: arrow::Status indicating success / failure. The buffer is set // into the *out argument - virtual Status ReadAt(int64_t position, int64_t nbytes, - std::shared_ptr* out) = 0; + virtual Status ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) = 0; virtual Status Close() = 0; @@ -110,8 +107,7 @@ class MemoryMappedSource : public MemorySource { Status Close() override; - Status ReadAt(int64_t position, int64_t nbytes, - std::shared_ptr* out) override; + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; Status Write(int64_t position, const uint8_t* data, int64_t nbytes) override; @@ -125,7 +121,7 @@ class MemoryMappedSource : public MemorySource { std::unique_ptr impl_; }; -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow -#endif // ARROW_IPC_MEMORY_H +#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 14b186906c3a0..ad5951d17e2c0 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -52,11 +52,12 @@ const std::shared_ptr UINT64 = std::make_shared(); const std::shared_ptr FLOAT = std::make_shared(); const std::shared_ptr DOUBLE = std::make_shared(); -static Status IntFromFlatbuffer(const flatbuf::Int* int_data, - std::shared_ptr* out) { +static Status IntFromFlatbuffer( + const flatbuf::Int* int_data, std::shared_ptr* out) { if (int_data->bitWidth() % 8 != 0) { return Status::NotImplemented("Integers not in cstdint are not implemented"); - } else if (int_data->bitWidth() > 64) { + } + if (int_data->bitWidth() > 64) { return Status::NotImplemented("Integers with more than 64 bits not implemented"); } @@ -80,8 +81,8 @@ static Status IntFromFlatbuffer(const flatbuf::Int* int_data, return Status::OK(); } -static Status FloatFromFlatuffer(const flatbuf::FloatingPoint* float_data, - std::shared_ptr* out) { +static Status FloatFromFlatuffer( + const flatbuf::FloatingPoint* float_data, std::shared_ptr* out) { if (float_data->precision() == flatbuf::Precision_SINGLE) { *out = FLOAT; } else { @@ -90,9 +91,8 @@ static Status FloatFromFlatuffer(const flatbuf::FloatingPoint* float_data, return Status::OK(); } -static Status TypeFromFlatbuffer(flatbuf::Type type, - const void* type_data, const std::vector>& children, - std::shared_ptr* out) { +static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, + const std::vector>& children, std::shared_ptr* out) { switch (type) { case flatbuf::Type_NONE: return Status::Invalid("Type metadata cannot be none"); @@ -101,8 +101,8 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, case flatbuf::Type_Bit: return Status::NotImplemented("Type is not implemented"); case flatbuf::Type_FloatingPoint: - return FloatFromFlatuffer(static_cast(type_data), - out); + return FloatFromFlatuffer( + static_cast(type_data), out); case flatbuf::Type_Binary: case flatbuf::Type_Utf8: return Status::NotImplemented("Type is not implemented"); @@ -128,16 +128,14 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, } // Forward declaration -static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - FieldOffset* offset); +static Status FieldToFlatbuffer( + FBB& fbb, const std::shared_ptr& field, FieldOffset* offset); -static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, - bool is_signed) { +static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, bool is_signed) { return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); } -static Offset FloatToFlatbuffer(FBB& fbb, - flatbuf::Precision precision) { +static Offset FloatToFlatbuffer(FBB& fbb, flatbuf::Precision precision) { return flatbuf::CreateFloatingPoint(fbb, precision).Union(); } @@ -166,10 +164,8 @@ static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ break; - static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* children, - flatbuf::Type* out_type, Offset* offset) { + std::vector* children, flatbuf::Type* out_type, Offset* offset) { switch (type->type) { case Type::BOOL: *out_type = flatbuf::Type_Bool; @@ -206,16 +202,16 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_Tuple; return StructToFlatbuffer(fbb, type, children, offset); default: + *out_type = flatbuf::Type_NONE; // Make clang-tidy happy std::stringstream ss; - ss << "Unable to convert type: " << type->ToString() - << std::endl; + ss << "Unable to convert type: " << type->ToString() << std::endl; return Status::NotImplemented(ss.str()); } return Status::OK(); } -static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - FieldOffset* offset) { +static Status FieldToFlatbuffer( + FBB& fbb, const std::shared_ptr& field, FieldOffset* offset) { auto fb_name = fbb.CreateString(field->name); flatbuf::Type type_enum; @@ -225,14 +221,13 @@ static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, RETURN_NOT_OK(TypeToFlatbuffer(fbb, field->type, &children, &type_enum, &type_data)); auto fb_children = fbb.CreateVector(children); - *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, - type_data, fb_children); + *offset = flatbuf::CreateField( + fbb, fb_name, field->nullable, type_enum, type_data, fb_children); return Status::OK(); } -Status FieldFromFlatbuffer(const flatbuf::Field* field, - std::shared_ptr* out) { +Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out) { std::shared_ptr type; auto children = field->children(); @@ -241,8 +236,8 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), &child_fields[i])); } - RETURN_NOT_OK(TypeFromFlatbuffer(field->type_type(), - field->type(), child_fields, &type)); + RETURN_NOT_OK( + TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); *out = std::make_shared(field->name()->str(), type); return Status::OK(); @@ -270,19 +265,17 @@ Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers) { header_type_ = flatbuf::MessageHeader_RecordBatch; - header_ = flatbuf::CreateRecordBatch(fbb_, length, - fbb_.CreateVectorOfStructs(nodes), - fbb_.CreateVectorOfStructs(buffers)).Union(); + header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), + fbb_.CreateVectorOfStructs(buffers)) + .Union(); body_length_ = body_length; return Status::OK(); } - Status WriteDataHeader(int32_t length, int64_t body_length, const std::vector& nodes, - const std::vector& buffers, - std::shared_ptr* out) { + const std::vector& buffers, std::shared_ptr* out) { MessageBuilder message; RETURN_NOT_OK(message.SetRecordBatch(length, body_length, nodes, buffers)); RETURN_NOT_OK(message.Finish()); @@ -290,8 +283,7 @@ Status WriteDataHeader(int32_t length, int64_t body_length, } Status MessageBuilder::Finish() { - auto message = flatbuf::CreateMessage(fbb_, header_type_, header_, - body_length_); + auto message = flatbuf::CreateMessage(fbb_, header_type_, header_, body_length_); fbb_.Finish(message); return Status::OK(); } @@ -313,5 +305,5 @@ Status MessageBuilder::GetBuffer(std::shared_ptr* out) { return Status::OK(); } -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index f7365d2a49f95..779c5a30a044a 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -36,8 +36,7 @@ class Status; namespace ipc { -Status FieldFromFlatbuffer(const flatbuf::Field* field, - std::shared_ptr* out); +Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); class MessageBuilder { public: @@ -60,10 +59,9 @@ class MessageBuilder { Status WriteDataHeader(int32_t length, int64_t body_length, const std::vector& nodes, - const std::vector& buffers, - std::shared_ptr* out); + const std::vector& buffers, std::shared_ptr* out); -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow -#endif // ARROW_IPC_METADATA_INTERNAL_H +#endif // ARROW_IPC_METADATA_INTERNAL_H diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 642f21a41e640..bcf104f0b8ba6 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -48,10 +48,8 @@ Status WriteSchema(const Schema* schema, std::shared_ptr* out) { class Message::Impl { public: - explicit Impl(const std::shared_ptr& buffer, - const flatbuf::Message* message) : - buffer_(buffer), - message_(message) {} + explicit Impl(const std::shared_ptr& buffer, const flatbuf::Message* message) + : buffer_(buffer), message_(message) {} Message::Type type() const { switch (message_->header_type()) { @@ -66,13 +64,9 @@ class Message::Impl { } } - const void* header() const { - return message_->header(); - } + const void* header() const { return message_->header(); } - int64_t body_length() const { - return message_->bodyLength(); - } + int64_t body_length() const { return message_->bodyLength(); } private: // Owns the memory this message accesses @@ -83,16 +77,12 @@ class Message::Impl { class SchemaMessage::Impl { public: - explicit Impl(const void* schema) : - schema_(static_cast(schema)) {} + explicit Impl(const void* schema) + : schema_(static_cast(schema)) {} - const flatbuf::Field* field(int i) const { - return schema_->fields()->Get(i); - } + const flatbuf::Field* field(int i) const { return schema_->fields()->Get(i); } - int num_fields() const { - return schema_->fields()->size(); - } + int num_fields() const { return schema_->fields()->size(); } private: const flatbuf::Schema* schema_; @@ -100,8 +90,8 @@ class SchemaMessage::Impl { Message::Message() {} -Status Message::Open(const std::shared_ptr& buffer, - std::shared_ptr* out) { +Status Message::Open( + const std::shared_ptr& buffer, std::shared_ptr* out) { std::shared_ptr result(new Message()); // The buffer is prefixed by its size as int32_t @@ -128,12 +118,11 @@ std::shared_ptr Message::get_shared_ptr() { } std::shared_ptr Message::GetSchema() { - return std::make_shared(this->shared_from_this(), - impl_->header()); + return std::make_shared(this->shared_from_this(), impl_->header()); } -SchemaMessage::SchemaMessage(const std::shared_ptr& message, - const void* schema) { +SchemaMessage::SchemaMessage( + const std::shared_ptr& message, const void* schema) { message_ = message; impl_.reset(new Impl(schema)); } @@ -158,31 +147,21 @@ Status SchemaMessage::GetSchema(std::shared_ptr* out) const { class RecordBatchMessage::Impl { public: - explicit Impl(const void* batch) : - batch_(static_cast(batch)) { + explicit Impl(const void* batch) + : batch_(static_cast(batch)) { nodes_ = batch_->nodes(); buffers_ = batch_->buffers(); } - const flatbuf::FieldNode* field(int i) const { - return nodes_->Get(i); - } + const flatbuf::FieldNode* field(int i) const { return nodes_->Get(i); } - const flatbuf::Buffer* buffer(int i) const { - return buffers_->Get(i); - } + const flatbuf::Buffer* buffer(int i) const { return buffers_->Get(i); } - int32_t length() const { - return batch_->length(); - } + int32_t length() const { return batch_->length(); } - int num_buffers() const { - return batch_->buffers()->size(); - } + int num_buffers() const { return batch_->buffers()->size(); } - int num_fields() const { - return batch_->nodes()->size(); - } + int num_fields() const { return batch_->nodes()->size(); } private: const flatbuf::RecordBatch* batch_; @@ -191,12 +170,11 @@ class RecordBatchMessage::Impl { }; std::shared_ptr Message::GetRecordBatch() { - return std::make_shared(this->shared_from_this(), - impl_->header()); + return std::make_shared(this->shared_from_this(), impl_->header()); } -RecordBatchMessage::RecordBatchMessage(const std::shared_ptr& message, - const void* batch) { +RecordBatchMessage::RecordBatchMessage( + const std::shared_ptr& message, const void* batch) { message_ = message; impl_.reset(new Impl(batch)); } @@ -234,5 +212,5 @@ int RecordBatchMessage::num_fields() const { return impl_->num_fields(); } -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index c7288529b9fbd..838a4a676ea35 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -85,8 +85,7 @@ struct BufferMetadata { class RecordBatchMessage { public: // Accepts an opaque flatbuffer pointer - RecordBatchMessage(const std::shared_ptr& message, - const void* batch_meta); + RecordBatchMessage(const std::shared_ptr& message, const void* batch_meta); FieldMetadata field(int i) const; BufferMetadata buffer(int i) const; @@ -111,15 +110,10 @@ class DictionaryBatchMessage { class Message : public std::enable_shared_from_this { public: - enum Type { - NONE, - SCHEMA, - DICTIONARY_BATCH, - RECORD_BATCH - }; + enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; - static Status Open(const std::shared_ptr& buffer, - std::shared_ptr* out); + static Status Open( + const std::shared_ptr& buffer, std::shared_ptr* out); std::shared_ptr get_shared_ptr(); @@ -140,7 +134,7 @@ class Message : public std::enable_shared_from_this { std::unique_ptr impl_; }; -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow -#endif // ARROW_IPC_METADATA_H +#endif // ARROW_IPC_METADATA_H diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 0fccce941071b..65c837dc8b141 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -36,9 +36,7 @@ class MemoryMapFixture { void CreateFile(const std::string path, int64_t size) { FILE* file = fopen(path.c_str(), "w"); - if (file != nullptr) { - tmp_files_.push_back(path); - } + if (file != nullptr) { tmp_files_.push_back(path); } ftruncate(fileno(file), size); fclose(file); } @@ -47,7 +45,7 @@ class MemoryMapFixture { std::vector tmp_files_; }; -} // namespace ipc -} // namespace arrow +} // namespace ipc +} // namespace arrow -#endif // ARROW_IPC_TEST_COMMON_H +#endif // ARROW_IPC_TEST_COMMON_H diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index a289ddbfde6eb..e2280f41189ef 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -45,8 +45,7 @@ const auto INT64 = std::make_shared(); const auto FLOAT = std::make_shared(); const auto DOUBLE = std::make_shared(); const auto UTF8 = std::make_shared(); -const auto BINARY = std::make_shared( - std::make_shared("", UINT8)); +const auto BINARY = std::make_shared(std::make_shared("", UINT8)); const auto DECIMAL_8_4 = std::make_shared(8, 4); class TestConvertParquetSchema : public ::testing::Test { @@ -58,8 +57,8 @@ class TestConvertParquetSchema : public ::testing::Test { for (int i = 0; i < expected_schema->num_fields(); ++i) { auto lhs = result_schema_->field(i); auto rhs = expected_schema->field(i); - EXPECT_TRUE(lhs->Equals(rhs)) - << i << " " << lhs->ToString() << " != " << rhs->ToString(); + EXPECT_TRUE(lhs->Equals(rhs)) << i << " " << lhs->ToString() + << " != " << rhs->ToString(); } } @@ -99,20 +98,15 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { arrow_fields.push_back(std::make_shared("double", DOUBLE)); parquet_fields.push_back( - PrimitiveNode::Make("binary", Repetition::OPTIONAL, - ParquetType::BYTE_ARRAY)); + PrimitiveNode::Make("binary", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY)); arrow_fields.push_back(std::make_shared("binary", BINARY)); - parquet_fields.push_back( - PrimitiveNode::Make("string", Repetition::OPTIONAL, - ParquetType::BYTE_ARRAY, - LogicalType::UTF8)); + parquet_fields.push_back(PrimitiveNode::Make( + "string", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY, LogicalType::UTF8)); arrow_fields.push_back(std::make_shared("string", UTF8)); - parquet_fields.push_back( - PrimitiveNode::Make("flba-binary", Repetition::OPTIONAL, - ParquetType::FIXED_LEN_BYTE_ARRAY, - LogicalType::NONE, 12)); + parquet_fields.push_back(PrimitiveNode::Make("flba-binary", Repetition::OPTIONAL, + ParquetType::FIXED_LEN_BYTE_ARRAY, LogicalType::NONE, 12)); arrow_fields.push_back(std::make_shared("flba-binary", BINARY)); auto arrow_schema = std::make_shared(arrow_fields); @@ -125,28 +119,20 @@ TEST_F(TestConvertParquetSchema, ParquetFlatDecimals) { std::vector parquet_fields; std::vector> arrow_fields; - parquet_fields.push_back( - PrimitiveNode::Make("flba-decimal", Repetition::OPTIONAL, - ParquetType::FIXED_LEN_BYTE_ARRAY, - LogicalType::DECIMAL, 4, 8, 4)); + parquet_fields.push_back(PrimitiveNode::Make("flba-decimal", Repetition::OPTIONAL, + ParquetType::FIXED_LEN_BYTE_ARRAY, LogicalType::DECIMAL, 4, 8, 4)); arrow_fields.push_back(std::make_shared("flba-decimal", DECIMAL_8_4)); - parquet_fields.push_back( - PrimitiveNode::Make("binary-decimal", Repetition::OPTIONAL, - ParquetType::BYTE_ARRAY, - LogicalType::DECIMAL, -1, 8, 4)); + parquet_fields.push_back(PrimitiveNode::Make("binary-decimal", Repetition::OPTIONAL, + ParquetType::BYTE_ARRAY, LogicalType::DECIMAL, -1, 8, 4)); arrow_fields.push_back(std::make_shared("binary-decimal", DECIMAL_8_4)); - parquet_fields.push_back( - PrimitiveNode::Make("int32-decimal", Repetition::OPTIONAL, - ParquetType::INT32, - LogicalType::DECIMAL, -1, 8, 4)); + parquet_fields.push_back(PrimitiveNode::Make("int32-decimal", Repetition::OPTIONAL, + ParquetType::INT32, LogicalType::DECIMAL, -1, 8, 4)); arrow_fields.push_back(std::make_shared("int32-decimal", DECIMAL_8_4)); - parquet_fields.push_back( - PrimitiveNode::Make("int64-decimal", Repetition::OPTIONAL, - ParquetType::INT64, - LogicalType::DECIMAL, -1, 8, 4)); + parquet_fields.push_back(PrimitiveNode::Make("int64-decimal", Repetition::OPTIONAL, + ParquetType::INT64, LogicalType::DECIMAL, -1, 8, 4)); arrow_fields.push_back(std::make_shared("int64-decimal", DECIMAL_8_4)); auto arrow_schema = std::make_shared(arrow_fields); @@ -164,22 +150,19 @@ TEST_F(TestConvertParquetSchema, UnsupportedThings) { unsupported_nodes.push_back( GroupNode::Make("repeated-group", Repetition::REPEATED, {})); - unsupported_nodes.push_back( - PrimitiveNode::Make("int32", Repetition::OPTIONAL, - ParquetType::INT32, LogicalType::DATE)); + unsupported_nodes.push_back(PrimitiveNode::Make( + "int32", Repetition::OPTIONAL, ParquetType::INT32, LogicalType::DATE)); - unsupported_nodes.push_back( - PrimitiveNode::Make("int64", Repetition::OPTIONAL, - ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); + unsupported_nodes.push_back(PrimitiveNode::Make( + "int64", Repetition::OPTIONAL, ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); for (const NodePtr& node : unsupported_nodes) { ASSERT_RAISES(NotImplemented, ConvertSchema({node})); } } -TEST(TestNodeConversion, DateAndTime) { -} +TEST(TestNodeConversion, DateAndTime) {} -} // namespace parquet +} // namespace parquet -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index 14f4f5be53ce9..066388b4d0e23 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -43,8 +43,7 @@ const auto INT64 = std::make_shared(); const auto FLOAT = std::make_shared(); const auto DOUBLE = std::make_shared(); const auto UTF8 = std::make_shared(); -const auto BINARY = std::make_shared( - std::make_shared("", UINT8)); +const auto BINARY = std::make_shared(std::make_shared("", UINT8)); TypePtr MakeDecimalType(const PrimitiveNode* node) { int precision = node->decimal_metadata().precision; @@ -167,12 +166,12 @@ Status NodeToField(const NodePtr& node, std::shared_ptr* out) { return Status::OK(); } -Status FromParquetSchema(const ::parquet::SchemaDescriptor* parquet_schema, - std::shared_ptr* out) { +Status FromParquetSchema( + const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out) { // TODO(wesm): Consider adding an arrow::Schema name attribute, which comes // from the root Parquet node - const GroupNode* schema_node = static_cast( - parquet_schema->schema().get()); + const GroupNode* schema_node = + static_cast(parquet_schema->schema().get()); std::vector> fields(schema_node->field_count()); for (int i = 0; i < schema_node->field_count(); i++) { @@ -183,6 +182,6 @@ Status FromParquetSchema(const ::parquet::SchemaDescriptor* parquet_schema, return Status::OK(); } -} // namespace parquet +} // namespace parquet -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index a8408970ede48..a44a9a4b6a892 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -31,14 +31,13 @@ class Status; namespace parquet { -Status NodeToField(const ::parquet::schema::NodePtr& node, - std::shared_ptr* out); +Status NodeToField(const ::parquet::schema::NodePtr& node, std::shared_ptr* out); -Status FromParquetSchema(const ::parquet::SchemaDescriptor* parquet_schema, - std::shared_ptr* out); +Status FromParquetSchema( + const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out); -} // namespace parquet +} // namespace parquet -} // namespace arrow +} // namespace arrow #endif diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/schema-test.cc index a1de1dc5ac8a4..8cc80be120a44 100644 --- a/cpp/src/arrow/schema-test.cc +++ b/cpp/src/arrow/schema-test.cc @@ -86,8 +86,8 @@ TEST_F(TestSchema, ToString) { auto f0 = std::make_shared("f0", INT32); auto f1 = std::make_shared("f1", std::make_shared(), false); auto f2 = std::make_shared("f2", std::make_shared()); - auto f3 = std::make_shared("f3", - std::make_shared(std::make_shared())); + auto f3 = std::make_shared( + "f3", std::make_shared(std::make_shared())); vector> fields = {f0, f1, f2, f3}; auto schema = std::make_shared(fields); @@ -101,4 +101,4 @@ f3: list)"; ASSERT_EQ(expected, result); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc index 18aad0e806ff2..a38acaa94ba56 100644 --- a/cpp/src/arrow/schema.cc +++ b/cpp/src/arrow/schema.cc @@ -26,18 +26,14 @@ namespace arrow { -Schema::Schema(const std::vector>& fields) : - fields_(fields) {} +Schema::Schema(const std::vector>& fields) : fields_(fields) {} bool Schema::Equals(const Schema& other) const { - if (this == &other) return true; - if (num_fields() != other.num_fields()) { - return false; - } + if (this == &other) { return true; } + + if (num_fields() != other.num_fields()) { return false; } for (int i = 0; i < num_fields(); ++i) { - if (!field(i)->Equals(*other.field(i).get())) { - return false; - } + if (!field(i)->Equals(*other.field(i).get())) { return false; } } return true; } @@ -51,13 +47,11 @@ std::string Schema::ToString() const { int i = 0; for (auto field : fields_) { - if (i > 0) { - buffer << std::endl; - } + if (i > 0) { buffer << std::endl; } buffer << field->ToString(); ++i; } return buffer.str(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h index 52f3c1ceae46d..a8b0d8444ac92 100644 --- a/cpp/src/arrow/schema.h +++ b/cpp/src/arrow/schema.h @@ -35,21 +35,17 @@ class Schema { bool Equals(const std::shared_ptr& other) const; // Return the ith schema element. Does not boundscheck - const std::shared_ptr& field(int i) const { - return fields_[i]; - } + const std::shared_ptr& field(int i) const { return fields_[i]; } // Render a string representation of the schema suitable for debugging std::string ToString() const; - int num_fields() const { - return fields_.size(); - } + int num_fields() const { return fields_.size(); } private: std::vector> fields_; }; -} // namespace arrow +} // namespace arrow #endif // ARROW_FIELD_H diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 4c7b8f80486de..385e7d831500a 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -49,10 +49,9 @@ class TestTable : public TestBase { schema_ = std::make_shared(fields); columns_ = { - std::make_shared(schema_->field(0), MakePrimitive(length)), - std::make_shared(schema_->field(1), MakePrimitive(length)), - std::make_shared(schema_->field(2), MakePrimitive(length)) - }; + std::make_shared(schema_->field(0), MakePrimitive(length)), + std::make_shared(schema_->field(1), MakePrimitive(length)), + std::make_shared(schema_->field(2), MakePrimitive(length))}; } protected: @@ -116,13 +115,12 @@ TEST_F(TestTable, InvalidColumns) { ASSERT_RAISES(Invalid, table_->ValidateColumns()); columns_ = { - std::make_shared(schema_->field(0), MakePrimitive(length)), - std::make_shared(schema_->field(1), MakePrimitive(length)), - std::make_shared(schema_->field(2), MakePrimitive(length - 1)) - }; + std::make_shared(schema_->field(0), MakePrimitive(length)), + std::make_shared(schema_->field(1), MakePrimitive(length)), + std::make_shared(schema_->field(2), MakePrimitive(length - 1))}; table_.reset(new Table("data", schema_, columns_, length)); ASSERT_RAISES(Invalid, table_->ValidateColumns()); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index e405c1d508c22..d9573eae74ddd 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -28,20 +28,16 @@ namespace arrow { RowBatch::RowBatch(const std::shared_ptr& schema, int num_rows, - const std::vector>& columns) : - schema_(schema), - num_rows_(num_rows), - columns_(columns) {} + const std::vector>& columns) + : schema_(schema), num_rows_(num_rows), columns_(columns) {} const std::string& RowBatch::column_name(int i) const { return schema_->field(i)->name; } Table::Table(const std::string& name, const std::shared_ptr& schema, - const std::vector>& columns) : - name_(name), - schema_(schema), - columns_(columns) { + const std::vector>& columns) + : name_(name), schema_(schema), columns_(columns) { if (columns.size() == 0) { num_rows_ = 0; } else { @@ -50,11 +46,8 @@ Table::Table(const std::string& name, const std::shared_ptr& schema, } Table::Table(const std::string& name, const std::shared_ptr& schema, - const std::vector>& columns, int64_t num_rows) : - name_(name), - schema_(schema), - columns_(columns), - num_rows_(num_rows) {} + const std::vector>& columns, int64_t num_rows) + : name_(name), schema_(schema), columns_(columns), num_rows_(num_rows) {} Status Table::ValidateColumns() const { if (num_columns() != schema_->num_fields()) { @@ -66,21 +59,17 @@ Status Table::ValidateColumns() const { const Column* col = columns_[i].get(); if (col == nullptr) { std::stringstream ss; - ss << "Column " << i << " named " << col->name() - << " was null"; + ss << "Column " << i << " was null"; return Status::Invalid(ss.str()); } if (col->length() != num_rows_) { std::stringstream ss; - ss << "Column " << i << " named " << col->name() - << " expected length " - << num_rows_ - << " but got length " - << col->length(); + ss << "Column " << i << " named " << col->name() << " expected length " << num_rows_ + << " but got length " << col->length(); return Status::Invalid(ss.str()); } } return Status::OK(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index e2f73a2eeddcb..756b2a19593f4 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -42,27 +42,19 @@ class RowBatch { const std::vector>& columns); // @returns: the table's schema - const std::shared_ptr& schema() const { - return schema_; - } + const std::shared_ptr& schema() const { return schema_; } // @returns: the i-th column // Note: Does not boundscheck - const std::shared_ptr& column(int i) const { - return columns_[i]; - } + const std::shared_ptr& column(int i) const { return columns_[i]; } const std::string& column_name(int i) const; // @returns: the number of columns in the table - int num_columns() const { - return columns_.size(); - } + int num_columns() const { return columns_.size(); } // @returns: the number of rows (the corresponding length of each column) - int64_t num_rows() const { - return num_rows_; - } + int64_t num_rows() const { return num_rows_; } private: std::shared_ptr schema_; @@ -85,30 +77,20 @@ class Table { const std::vector>& columns, int64_t num_rows); // @returns: the table's name, if any (may be length 0) - const std::string& name() const { - return name_; - } + const std::string& name() const { return name_; } // @returns: the table's schema - const std::shared_ptr& schema() const { - return schema_; - } + const std::shared_ptr& schema() const { return schema_; } // Note: Does not boundscheck // @returns: the i-th column - const std::shared_ptr& column(int i) const { - return columns_[i]; - } + const std::shared_ptr& column(int i) const { return columns_[i]; } // @returns: the number of columns in the table - int num_columns() const { - return columns_.size(); - } + int num_columns() const { return columns_.size(); } // @returns: the number of rows (the corresponding length of each column) - int64_t num_rows() const { - return num_rows_; - } + int64_t num_rows() const { return num_rows_; } // After construction, perform any checks to validate the input arguments Status ValidateColumns() const; @@ -123,6 +105,6 @@ class Table { int64_t num_rows_; }; -} // namespace arrow +} // namespace arrow #endif // ARROW_TABLE_H diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index b2bce269992d0..538d9b233d990 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -36,38 +36,29 @@ #include "arrow/util/random.h" #include "arrow/util/status.h" -#define ASSERT_RAISES(ENUM, expr) \ - do { \ - Status s = (expr); \ - if (!s.Is##ENUM()) { \ - FAIL() << s.ToString(); \ - } \ +#define ASSERT_RAISES(ENUM, expr) \ + do { \ + Status s = (expr); \ + if (!s.Is##ENUM()) { FAIL() << s.ToString(); } \ } while (0) - -#define ASSERT_OK(expr) \ - do { \ - Status s = (expr); \ - if (!s.ok()) { \ - FAIL() << s.ToString(); \ - } \ +#define ASSERT_OK(expr) \ + do { \ + Status s = (expr); \ + if (!s.ok()) { FAIL() << s.ToString(); } \ } while (0) - -#define EXPECT_OK(expr) \ - do { \ - Status s = (expr); \ - EXPECT_TRUE(s.ok()); \ +#define EXPECT_OK(expr) \ + do { \ + Status s = (expr); \ + EXPECT_TRUE(s.ok()); \ } while (0) - namespace arrow { class TestBase : public ::testing::Test { public: - void SetUp() { - pool_ = default_memory_pool(); - } + void SetUp() { pool_ = default_memory_pool(); } template std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { @@ -97,10 +88,8 @@ void randint(int64_t N, T lower, T upper, std::vector* out) { } } - template -void random_real(int n, uint32_t seed, T min_value, T max_value, - std::vector* out) { +void random_real(int n, uint32_t seed, T min_value, T max_value, std::vector* out) { std::mt19937 gen(seed); std::uniform_real_distribution d(min_value, max_value); for (int i = 0; i < n; ++i) { @@ -108,11 +97,10 @@ void random_real(int n, uint32_t seed, T min_value, T max_value, } } - template std::shared_ptr to_buffer(const std::vector& values) { - return std::make_shared(reinterpret_cast(values.data()), - values.size() * sizeof(T)); + return std::make_shared( + reinterpret_cast(values.data()), values.size() * sizeof(T)); } void random_null_bitmap(int64_t n, double pct_null, uint8_t* null_bitmap) { @@ -143,8 +131,8 @@ void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { static inline int bitmap_popcount(const uint8_t* data, int length) { int count = 0; for (int i = 0; i < length; ++i) { - // TODO: accelerate this - if (util::get_bit(data, i)) ++count; + // TODO(wesm): accelerate this + if (util::get_bit(data, i)) { ++count; } } return count; } @@ -152,9 +140,7 @@ static inline int bitmap_popcount(const uint8_t* data, int length) { static inline int null_count(const std::vector& valid_bytes) { int result = 0; for (size_t i = 0; i < valid_bytes.size(); ++i) { - if (valid_bytes[i] == 0) { - ++result; - } + if (valid_bytes[i] == 0) { ++result; } } return result; } @@ -167,7 +153,7 @@ std::shared_ptr bytes_to_null_buffer(const std::vector& bytes) return out; } -} // namespace test -} // namespace arrow +} // namespace test +} // namespace arrow -#endif // ARROW_TEST_UTIL_H_ +#endif // ARROW_TEST_UTIL_H_ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index f7f835e96a729..4e686d9cf4a6f 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -25,9 +25,7 @@ namespace arrow { std::string Field::ToString() const { std::stringstream ss; ss << this->name << ": " << this->type->ToString(); - if (!this->nullable) { - ss << " not null"; - } + if (!this->nullable) { ss << " not null"; } return ss.str(); } @@ -50,7 +48,7 @@ std::string StructType::ToString() const { std::stringstream s; s << "struct<"; for (int i = 0; i < this->num_children(); ++i) { - if (i > 0) s << ", "; + if (i > 0) { s << ", "; } const std::shared_ptr& field = this->child(i); s << field->name << ": " << field->type->ToString(); } @@ -58,4 +56,4 @@ std::string StructType::ToString() const { return s.str(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 86e47791b7cea..051ab46b199f9 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -110,8 +110,7 @@ struct DataType { std::vector> children_; - explicit DataType(Type::type type) : - type(type) {} + explicit DataType(Type::type type) : type(type) {} virtual ~DataType(); @@ -120,21 +119,13 @@ struct DataType { return this == other || (this->type == other->type); } - bool Equals(const std::shared_ptr& other) { - return Equals(other.get()); - } + bool Equals(const std::shared_ptr& other) { return Equals(other.get()); } - const std::shared_ptr& child(int i) const { - return children_[i]; - } + const std::shared_ptr& child(int i) const { return children_[i]; } - int num_children() const { - return children_.size(); - } + int num_children() const { return children_.size(); } - virtual int value_size() const { - return -1; - } + virtual int value_size() const { return -1; } virtual std::string ToString() const = 0; }; @@ -153,28 +144,20 @@ struct Field { // Fields can be nullable bool nullable; - Field(const std::string& name, const TypePtr& type, bool nullable = true) : - name(name), - type(type), - nullable(nullable) {} + Field(const std::string& name, const TypePtr& type, bool nullable = true) + : name(name), type(type), nullable(nullable) {} - bool operator==(const Field& other) const { - return this->Equals(other); - } + bool operator==(const Field& other) const { return this->Equals(other); } - bool operator!=(const Field& other) const { - return !this->Equals(other); - } + bool operator!=(const Field& other) const { return !this->Equals(other); } bool Equals(const Field& other) const { - return (this == &other) || (this->name == other.name && - this->nullable == other.nullable && - this->type->Equals(other.type.get())); + return (this == &other) || + (this->name == other.name && this->nullable == other.nullable && + this->type->Equals(other.type.get())); } - bool Equals(const std::shared_ptr& other) const { - return Equals(*other.get()); - } + bool Equals(const std::shared_ptr& other) const { return Equals(*other.get()); } std::string ToString() const; }; @@ -192,20 +175,15 @@ inline std::string PrimitiveType::ToString() const { return result; } -#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ - typedef C_TYPE c_type; \ - static constexpr Type::type type_enum = Type::ENUM; \ - \ - TYPENAME() \ - : PrimitiveType() {} \ - \ - virtual int value_size() const { \ - return SIZE; \ - } \ - \ - static const char* name() { \ - return NAME; \ - } +#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ + typedef C_TYPE c_type; \ + static constexpr Type::type type_enum = Type::ENUM; \ + \ + TYPENAME() : PrimitiveType() {} \ + \ + virtual int value_size() const { return SIZE; } \ + \ + static const char* name() { return NAME; } struct NullType : public PrimitiveType { PRIMITIVE_DECL(NullType, void, NA, 0, "null"); @@ -257,27 +235,19 @@ struct DoubleType : public PrimitiveType { struct ListType : public DataType { // List can contain any other logical value type - explicit ListType(const std::shared_ptr& value_type) - : DataType(Type::LIST) { + explicit ListType(const std::shared_ptr& value_type) : DataType(Type::LIST) { children_ = {std::make_shared("item", value_type)}; } - explicit ListType(const std::shared_ptr& value_field) - : DataType(Type::LIST) { + explicit ListType(const std::shared_ptr& value_field) : DataType(Type::LIST) { children_ = {value_field}; } - const std::shared_ptr& value_field() const { - return children_[0]; - } + const std::shared_ptr& value_field() const { return children_[0]; } - const std::shared_ptr& value_type() const { - return children_[0]->type; - } + const std::shared_ptr& value_type() const { return children_[0]->type; } - static char const *name() { - return "list"; - } + static char const* name() { return "list"; } std::string ToString() const override; }; @@ -286,9 +256,7 @@ struct ListType : public DataType { struct StringType : public DataType { StringType(); - static char const *name() { - return "string"; - } + static char const* name() { return "string"; } std::string ToString() const override; }; @@ -304,10 +272,8 @@ struct StructType : public DataType { // These will be defined elsewhere template -struct type_traits { -}; - +struct type_traits {}; -} // namespace arrow +} // namespace arrow #endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/types/binary.h b/cpp/src/arrow/types/binary.h index 1fd675e5fdebf..201fbb6e79536 100644 --- a/cpp/src/arrow/types/binary.h +++ b/cpp/src/arrow/types/binary.h @@ -23,8 +23,6 @@ #include "arrow/type.h" -namespace arrow { +namespace arrow {} // namespace arrow -} // namespace arrow - -#endif // ARROW_TYPES_BINARY_H +#endif // ARROW_TYPES_BINARY_H diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h index 46d84f1f183c8..1712030203fa2 100644 --- a/cpp/src/arrow/types/collection.h +++ b/cpp/src/arrow/types/collection.h @@ -31,15 +31,11 @@ struct CollectionType : public DataType { CollectionType() : DataType(T) {} - const TypePtr& child(int i) const { - return child_types_[i]; - } + const TypePtr& child(int i) const { return child_types_[i]; } - int num_children() const { - return child_types_.size(); - } + int num_children() const { return child_types_.size(); } }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_COLLECTION_H +#endif // ARROW_TYPES_COLLECTION_H diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 34647a5005b90..0a30929b97c51 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -30,10 +30,10 @@ namespace arrow { class ArrayBuilder; -#define BUILDER_CASE(ENUM, BuilderType) \ - case Type::ENUM: \ - out->reset(new BuilderType(pool, type)); \ - return Status::OK(); +#define BUILDER_CASE(ENUM, BuilderType) \ + case Type::ENUM: \ + out->reset(new BuilderType(pool, type)); \ + return Status::OK(); // Initially looked at doing this with vtables, but shared pointers makes it // difficult @@ -58,30 +58,28 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(STRING, StringBuilder); - case Type::LIST: - { - std::shared_ptr value_builder; + case Type::LIST: { + std::shared_ptr value_builder; - const std::shared_ptr& value_type = static_cast( - type.get())->value_type(); - RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); - out->reset(new ListBuilder(pool, type, value_builder)); - return Status::OK(); - } + const std::shared_ptr& value_type = + static_cast(type.get())->value_type(); + RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); + out->reset(new ListBuilder(pool, type, value_builder)); + return Status::OK(); + } default: return Status::NotImplemented(type->ToString()); } } -#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ - case Type::ENUM: \ - out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ - return Status::OK(); +#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ + case Type::ENUM: \ + out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ + return Status::OK(); -Status MakePrimitiveArray(const std::shared_ptr& type, - int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap, - std::shared_ptr* out) { +Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap, std::shared_ptr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); @@ -99,4 +97,4 @@ Status MakePrimitiveArray(const std::shared_ptr& type, } } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index 228faeccc4e4d..27fb7bd2149cf 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -33,11 +33,10 @@ class Status; Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out); -Status MakePrimitiveArray(const std::shared_ptr& type, - int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap, - std::shared_ptr* out); +Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap, std::shared_ptr* out); -} // namespace arrow +} // namespace arrow -#endif // ARROW_BUILDER_H_ +#endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index e57b66ab46adb..b782455546c33 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -23,49 +23,30 @@ namespace arrow { struct DateType : public DataType { - enum class Unit: char { - DAY = 0, - MONTH = 1, - YEAR = 2 - }; + enum class Unit : char { DAY = 0, MONTH = 1, YEAR = 2 }; Unit unit; - explicit DateType(Unit unit = Unit::DAY) - : DataType(Type::DATE), - unit(unit) {} + explicit DateType(Unit unit = Unit::DAY) : DataType(Type::DATE), unit(unit) {} - DateType(const DateType& other) - : DateType(other.unit) {} + DateType(const DateType& other) : DateType(other.unit) {} - static char const *name() { - return "date"; - } + static char const* name() { return "date"; } }; - struct TimestampType : public DataType { - enum class Unit: char { - SECOND = 0, - MILLI = 1, - MICRO = 2, - NANO = 3 - }; + enum class Unit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; Unit unit; explicit TimestampType(Unit unit = Unit::MILLI) - : DataType(Type::TIMESTAMP), - unit(unit) {} + : DataType(Type::TIMESTAMP), unit(unit) {} - TimestampType(const TimestampType& other) - : TimestampType(other.unit) {} + TimestampType(const TimestampType& other) : TimestampType(other.unit) {} - static char const *name() { - return "timestamp"; - } + static char const* name() { return "timestamp"; } }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_DATETIME_H +#endif // ARROW_TYPES_DATETIME_H diff --git a/cpp/src/arrow/types/decimal-test.cc b/cpp/src/arrow/types/decimal-test.cc index 89896c8b425d0..7296ff8176113 100644 --- a/cpp/src/arrow/types/decimal-test.cc +++ b/cpp/src/arrow/types/decimal-test.cc @@ -37,4 +37,4 @@ TEST(TypesTest, TestDecimalType) { ASSERT_EQ(t2.scale, 4); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/decimal.cc b/cpp/src/arrow/types/decimal.cc index f120c1a9dfde6..1d9a5e50e460b 100644 --- a/cpp/src/arrow/types/decimal.cc +++ b/cpp/src/arrow/types/decimal.cc @@ -28,5 +28,4 @@ std::string DecimalType::ToString() const { return s.str(); } -} // namespace arrow - +} // namespace arrow diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h index 26243b42b0e7d..1be489d4f51b6 100644 --- a/cpp/src/arrow/types/decimal.h +++ b/cpp/src/arrow/types/decimal.h @@ -26,18 +26,15 @@ namespace arrow { struct DecimalType : public DataType { explicit DecimalType(int precision_, int scale_) - : DataType(Type::DECIMAL), precision(precision_), - scale(scale_) { } + : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} int precision; int scale; - static char const *name() { - return "decimal"; - } + static char const* name() { return "decimal"; } std::string ToString() const override; }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_DECIMAL_H +#endif // ARROW_TYPES_DECIMAL_H diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc index fb731edd6073f..a4e0d085620a0 100644 --- a/cpp/src/arrow/types/json.cc +++ b/cpp/src/arrow/types/json.cc @@ -30,9 +30,8 @@ static const TypePtr String(new StringType()); static const TypePtr Double(new DoubleType()); static const TypePtr Bool(new BooleanType()); -static const std::vector json_types = {Null, Int32, String, - Double, Bool}; +static const std::vector json_types = {Null, Int32, String, Double, Bool}; TypePtr JSONScalar::dense_type = TypePtr(new DenseUnionType(json_types)); TypePtr JSONScalar::sparse_type = TypePtr(new SparseUnionType(json_types)); -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h index 9c850afac0af4..9de961f79a60a 100644 --- a/cpp/src/arrow/types/json.h +++ b/cpp/src/arrow/types/json.h @@ -28,11 +28,9 @@ struct JSONScalar : public DataType { static TypePtr dense_type; static TypePtr sparse_type; - explicit JSONScalar(bool dense = true) - : DataType(Type::JSON_SCALAR), - dense(dense) {} + explicit JSONScalar(bool dense = true) : DataType(Type::JSON_SCALAR), dense(dense) {} }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_JSON_H +#endif // ARROW_TYPES_JSON_H diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 4eb560ea52256..aa34f23cc0230 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -76,9 +76,7 @@ class TestListBuilder : public TestBuilder { builder_ = std::dynamic_pointer_cast(tmp); } - void Done() { - result_ = std::dynamic_pointer_cast(builder_->Finish()); - } + void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } protected: TypePtr value_type_; @@ -88,9 +86,7 @@ class TestListBuilder : public TestBuilder { shared_ptr result_; }; - -TEST_F(TestListBuilder, TestResize) { -} +TEST_F(TestListBuilder, TestResize) {} TEST_F(TestListBuilder, TestAppendNull) { ASSERT_OK(builder_->AppendNull()); @@ -155,5 +151,4 @@ TEST_F(TestListBuilder, TestZeroLength) { Done(); } - -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index d64c06d90c174..23f12ddc4ecd7 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -20,32 +20,25 @@ namespace arrow { bool ListArray::EqualsExact(const ListArray& other) const { - if (this == &other) return true; - if (null_count_ != other.null_count_) { - return false; - } + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } - bool equal_offsets = offset_buf_->Equals(*other.offset_buf_, - length_ + 1); + bool equal_offsets = offset_buf_->Equals(*other.offset_buf_, length_ + 1); bool equal_null_bitmap = true; if (null_count_ > 0) { - equal_null_bitmap = null_bitmap_->Equals(*other.null_bitmap_, - util::bytes_for_bits(length_)); + equal_null_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); } - if (!(equal_offsets && equal_null_bitmap)) { - return false; - } + if (!(equal_offsets && equal_null_bitmap)) { return false; } return values()->Equals(other.values()); } bool ListArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) return true; - if (this->type_enum() != arr->type_enum()) { - return false; - } + if (this == arr.get()) { return true; } + if (this->type_enum() != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 8073b5121764d..6b815460ecb1e 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -37,13 +37,12 @@ class MemoryPool; class ListArray : public Array { public: ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, - const ArrayPtr& values, - int32_t null_count = 0, - std::shared_ptr null_bitmap = nullptr) : - Array(type, length, null_count, null_bitmap) { + const ArrayPtr& values, int32_t null_count = 0, + std::shared_ptr null_bitmap = nullptr) + : Array(type, length, null_count, null_bitmap) { offset_buf_ = offsets; - offsets_ = offsets == nullptr? nullptr : - reinterpret_cast(offset_buf_->data()); + offsets_ = offsets == nullptr ? nullptr + : reinterpret_cast(offset_buf_->data()); values_ = values; } @@ -51,19 +50,17 @@ class ListArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. - const std::shared_ptr& values() const {return values_;} + const std::shared_ptr& values() const { return values_; } - const std::shared_ptr& value_type() const { - return values_->type(); - } + const std::shared_ptr& value_type() const { return values_->type(); } - const int32_t* offsets() const { return offsets_;} + const int32_t* offsets() const { return offsets_; } - int32_t offset(int i) const { return offsets_[i];} + int32_t offset(int i) const { return offsets_[i]; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) { return offsets_[i];} - int32_t value_length(int i) { return offsets_[i + 1] - offsets_[i];} + int32_t value_offset(int i) { return offsets_[i]; } + int32_t value_length(int i) { return offsets_[i + 1] - offsets_[i]; } bool EqualsExact(const ListArray& other) const; bool Equals(const std::shared_ptr& arr) const override; @@ -77,7 +74,6 @@ class ListArray : public Array { // ---------------------------------------------------------------------- // Array builder - // Builder class for variable-length list array value types // // To use this class, you must append values to the child array builder and use @@ -85,10 +81,9 @@ class ListArray : public Array { // have been appended to the child array) class ListBuilder : public Int32Builder { public: - ListBuilder(MemoryPool* pool, const TypePtr& type, - std::shared_ptr value_builder) - : Int32Builder(pool, type), - value_builder_(value_builder) {} + ListBuilder( + MemoryPool* pool, const TypePtr& type, std::shared_ptr value_builder) + : Int32Builder(pool, type), value_builder_(value_builder) {} Status Init(int32_t elements) { // One more than requested. @@ -116,12 +111,9 @@ class ListBuilder : public Int32Builder { int32_t new_capacity = util::next_power2(length_ + length); RETURN_NOT_OK(Resize(new_capacity)); } - memcpy(raw_data_ + length_, values, - type_traits::bytes_required(length)); + memcpy(raw_data_ + length_, values, type_traits::bytes_required(length)); - if (valid_bytes != nullptr) { - AppendNulls(valid_bytes, length); - } + if (valid_bytes != nullptr) { AppendNulls(valid_bytes, length); } length_ += length; return Status::OK(); @@ -132,12 +124,10 @@ class ListBuilder : public Int32Builder { std::shared_ptr items = value_builder_->Finish(); // Add final offset if the length is non-zero - if (length_) { - raw_data_[length_] = items->length(); - } + if (length_) { raw_data_[length_] = items->length(); } - auto result = std::make_shared(type_, length_, data_, items, - null_count_, null_bitmap_); + auto result = std::make_shared( + type_, length_, data_, items, null_count_, null_bitmap_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -145,9 +135,7 @@ class ListBuilder : public Int32Builder { return result; } - std::shared_ptr Finish() override { - return Transfer(); - } + std::shared_ptr Finish() override { return Transfer(); } // Start a new variable-length list slot // @@ -167,19 +155,14 @@ class ListBuilder : public Int32Builder { return Status::OK(); } - Status AppendNull() { - return Append(true); - } + Status AppendNull() { return Append(true); } - const std::shared_ptr& value_builder() const { - return value_builder_; - } + const std::shared_ptr& value_builder() const { return value_builder_; } protected: std::shared_ptr value_builder_; }; +} // namespace arrow -} // namespace arrow - -#endif // ARROW_TYPES_LIST_H +#endif // ARROW_TYPES_LIST_H diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 761845d93812a..6bd9e73eb46ac 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -41,15 +41,15 @@ namespace arrow { class Array; -#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ - TEST(TypesTest, TestPrimitive_##ENUM) { \ - KLASS tp; \ - \ - ASSERT_EQ(tp.type, Type::ENUM); \ - ASSERT_EQ(tp.name(), string(NAME)); \ - \ - KLASS tp_copy = tp; \ - ASSERT_EQ(tp_copy.type, Type::ENUM); \ +#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ + TEST(TypesTest, TestPrimitive_##ENUM) { \ + KLASS tp; \ + \ + ASSERT_EQ(tp.type, Type::ENUM); \ + ASSERT_EQ(tp.name(), string(NAME)); \ + \ + KLASS tp_copy = tp; \ + ASSERT_EQ(tp_copy.type, Type::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); @@ -108,8 +108,8 @@ class TestPrimitiveBuilder : public TestBuilder { void Check(const std::shared_ptr& builder, bool nullable) { int size = builder->length(); - auto ex_data = std::make_shared(reinterpret_cast(draws_.data()), - size * sizeof(T)); + auto ex_data = std::make_shared( + reinterpret_cast(draws_.data()), size * sizeof(T)); std::shared_ptr ex_null_bitmap; int32_t ex_null_count = 0; @@ -121,10 +121,10 @@ class TestPrimitiveBuilder : public TestBuilder { ex_null_bitmap = nullptr; } - auto expected = std::make_shared(size, ex_data, ex_null_count, - ex_null_bitmap); - std::shared_ptr result = std::dynamic_pointer_cast( - builder->Finish()); + auto expected = + std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); + std::shared_ptr result = + std::dynamic_pointer_cast(builder->Finish()); // Builder is now reset ASSERT_EQ(0, builder->length()); @@ -145,30 +145,30 @@ class TestPrimitiveBuilder : public TestBuilder { vector valid_bytes_; }; -#define PTYPE_DECL(CapType, c_type) \ - typedef CapType##Array ArrayType; \ - typedef CapType##Builder BuilderType; \ - typedef CapType##Type Type; \ - typedef c_type T; \ - \ - static std::shared_ptr type() { \ - return std::shared_ptr(new Type()); \ +#define PTYPE_DECL(CapType, c_type) \ + typedef CapType##Array ArrayType; \ + typedef CapType##Builder BuilderType; \ + typedef CapType##Type Type; \ + typedef c_type T; \ + \ + static std::shared_ptr type() { \ + return std::shared_ptr(new Type()); \ } -#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ - struct P##CapType { \ - PTYPE_DECL(CapType, c_type); \ - static void draw(int N, vector* draws) { \ - test::randint(N, LOWER, UPPER, draws); \ - } \ +#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int N, vector* draws) { \ + test::randint(N, LOWER, UPPER, draws); \ + } \ } -#define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ - struct P##CapType { \ - PTYPE_DECL(CapType, c_type); \ - static void draw(int N, vector* draws) { \ - test::random_real(N, 0, LOWER, UPPER, draws); \ - } \ +#define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int N, vector* draws) { \ + test::random_real(N, 0, LOWER, UPPER, draws); \ + } \ } PINT_DECL(UInt8, uint8_t, 0, UINT8_MAX); @@ -214,10 +214,10 @@ void TestPrimitiveBuilder::Check( ex_null_bitmap = nullptr; } - auto expected = std::make_shared(size, ex_data, ex_null_count, - ex_null_bitmap); - std::shared_ptr result = std::dynamic_pointer_cast( - builder->Finish()); + auto expected = + std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); + std::shared_ptr result = + std::dynamic_pointer_cast(builder->Finish()); // Builder is now reset ASSERT_EQ(0, builder->length()); @@ -230,31 +230,23 @@ void TestPrimitiveBuilder::Check( ASSERT_EQ(expected->length(), result->length()); for (int i = 0; i < result->length(); ++i) { - if (nullable) { - ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; - } + if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } bool actual = util::get_bit(result->raw_data(), i); ASSERT_EQ(static_cast(draws_[i]), actual) << i; } ASSERT_TRUE(result->EqualsExact(*expected.get())); } -typedef ::testing::Types Primitives; +typedef ::testing::Types Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); -#define DECL_T() \ - typedef typename TestFixture::T T; +#define DECL_T() typedef typename TestFixture::T T; -#define DECL_TYPE() \ - typedef typename TestFixture::Type Type; - -#define DECL_ARRAYTYPE() \ - typedef typename TestFixture::ArrayType ArrayType; +#define DECL_TYPE() typedef typename TestFixture::Type Type; +#define DECL_ARRAYTYPE() typedef typename TestFixture::ArrayType ArrayType; TYPED_TEST(TestPrimitiveBuilder, TestInit) { DECL_TYPE(); @@ -369,7 +361,6 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { this->Check(this->builder_nn_, false); } - TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { DECL_T(); @@ -424,8 +415,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { ASSERT_EQ(cap, this->builder_->capacity()); ASSERT_EQ(type_traits::bytes_required(cap), this->builder_->data()->size()); - ASSERT_EQ(util::bytes_for_bits(cap), - this->builder_->null_bitmap()->size()); + ASSERT_EQ(util::bytes_for_bits(cap), this->builder_->null_bitmap()->size()); } TYPED_TEST(TestPrimitiveBuilder, TestReserve) { @@ -437,8 +427,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestReserve) { ASSERT_OK(this->builder_->Advance(100)); ASSERT_OK(this->builder_->Reserve(MIN_BUILDER_CAPACITY)); - ASSERT_EQ(util::next_power2(MIN_BUILDER_CAPACITY + 100), - this->builder_->capacity()); + ASSERT_EQ(util::next_power2(MIN_BUILDER_CAPACITY + 100), this->builder_->capacity()); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index c54d0757c4789..9549c47b41157 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -28,26 +28,21 @@ namespace arrow { // Primitive array base PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, - const std::shared_ptr& data, - int32_t null_count, - const std::shared_ptr& null_bitmap) : - Array(type, length, null_count, null_bitmap) { + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : Array(type, length, null_count, null_bitmap) { data_ = data; - raw_data_ = data == nullptr? nullptr : data_->data(); + raw_data_ = data == nullptr ? nullptr : data_->data(); } bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { - if (this == &other) return true; - if (null_count_ != other.null_count_) { - return false; - } + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } if (null_count_ > 0) { - bool equal_bitmap = null_bitmap_->Equals(*other.null_bitmap_, - util::ceil_byte(length_) / 8); - if (!equal_bitmap) { - return false; - } + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, util::ceil_byte(length_) / 8); + if (!equal_bitmap) { return false; } const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; @@ -56,9 +51,7 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { DCHECK_GT(value_size, 0); for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && memcmp(this_data, other_data, value_size)) { - return false; - } + if (!IsNull(i) && memcmp(this_data, other_data, value_size)) { return false; } this_data += value_size; other_data += value_size; } @@ -69,10 +62,8 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { } bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) return true; - if (this->type_enum() != arr->type_enum()) { - return false; - } + if (this == arr.get()) { return true; } + if (this->type_enum() != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } @@ -92,9 +83,7 @@ Status PrimitiveBuilder::Init(int32_t capacity) { template Status PrimitiveBuilder::Resize(int32_t capacity) { // XXX: Set floor size for now - if (capacity < MIN_BUILDER_CAPACITY) { - capacity = MIN_BUILDER_CAPACITY; - } + if (capacity < MIN_BUILDER_CAPACITY) { capacity = MIN_BUILDER_CAPACITY; } if (capacity_ == 0) { RETURN_NOT_OK(Init(capacity)); @@ -122,8 +111,8 @@ Status PrimitiveBuilder::Reserve(int32_t elements) { } template -Status PrimitiveBuilder::Append(const value_type* values, int32_t length, - const uint8_t* valid_bytes) { +Status PrimitiveBuilder::Append( + const value_type* values, int32_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(PrimitiveBuilder::Reserve(length)); if (length > 0) { @@ -156,9 +145,8 @@ void PrimitiveBuilder::AppendNulls(const uint8_t* valid_bytes, int32_t length template std::shared_ptr PrimitiveBuilder::Finish() { - std::shared_ptr result = std::make_shared< - typename type_traits::ArrayType>( - type_, length_, data_, null_count_, null_bitmap_); + std::shared_ptr result = std::make_shared::ArrayType>( + type_, length_, data_, null_count_, null_bitmap_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -166,8 +154,8 @@ std::shared_ptr PrimitiveBuilder::Finish() { } template <> -Status PrimitiveBuilder::Append(const uint8_t* values, int32_t length, - const uint8_t* valid_bytes) { +Status PrimitiveBuilder::Append( + const uint8_t* values, int32_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); for (int i = 0; i < length; ++i) { @@ -202,23 +190,18 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count, - const std::shared_ptr& null_bitmap) : - PrimitiveArray(std::make_shared(), length, - data, null_count, null_bitmap) {} + int32_t null_count, const std::shared_ptr& null_bitmap) + : PrimitiveArray( + std::make_shared(), length, data, null_count, null_bitmap) {} bool BooleanArray::EqualsExact(const BooleanArray& other) const { if (this == &other) return true; - if (null_count_ != other.null_count_) { - return false; - } + if (null_count_ != other.null_count_) { return false; } if (null_count_ > 0) { - bool equal_bitmap = null_bitmap_->Equals(*other.null_bitmap_, - util::bytes_for_bits(length_)); - if (!equal_bitmap) { - return false; - } + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); + if (!equal_bitmap) { return false; } const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; @@ -236,10 +219,8 @@ bool BooleanArray::EqualsExact(const BooleanArray& other) const { bool BooleanArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) return true; - if (Type::BOOL != arr->type_enum()) { - return false; - } + if (Type::BOOL != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index ec6fee35513ce..fcd3db4e96e53 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -34,17 +34,14 @@ namespace arrow { class MemoryPool; - // Base class for fixed-size logical types class PrimitiveArray : public Array { public: - PrimitiveArray(const TypePtr& type, int32_t length, - const std::shared_ptr& data, - int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); virtual ~PrimitiveArray() {} - const std::shared_ptr& data() const { return data_;} + const std::shared_ptr& data() const { return data_; } bool EqualsExact(const PrimitiveArray& other) const; bool Equals(const std::shared_ptr& arr) const override; @@ -54,31 +51,25 @@ class PrimitiveArray : public Array { const uint8_t* raw_data_; }; -#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ -class NAME : public PrimitiveArray { \ - public: \ - using value_type = T; \ - using PrimitiveArray::PrimitiveArray; \ - \ - NAME(int32_t length, const std::shared_ptr& data, \ - int32_t null_count = 0, \ - const std::shared_ptr& null_bitmap = nullptr) : \ - PrimitiveArray(std::make_shared(), length, \ - data, null_count, null_bitmap) {} \ - \ - bool EqualsExact(const NAME& other) const { \ - return PrimitiveArray::EqualsExact( \ - *static_cast(&other)); \ - } \ - \ - const T* raw_data() const { \ - return reinterpret_cast(raw_data_); \ - } \ - \ - T Value(int i) const { \ - return raw_data()[i]; \ - } \ -}; +#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ + class NAME : public PrimitiveArray { \ + public: \ + using value_type = T; \ + using PrimitiveArray::PrimitiveArray; \ + \ + NAME(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, \ + const std::shared_ptr& null_bitmap = nullptr) \ + : PrimitiveArray( \ + std::make_shared(), length, data, null_count, null_bitmap) {} \ + \ + bool EqualsExact(const NAME& other) const { \ + return PrimitiveArray::EqualsExact(*static_cast(&other)); \ + } \ + \ + const T* raw_data() const { return reinterpret_cast(raw_data_); } \ + \ + T Value(int i) const { return raw_data()[i]; } \ + }; NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type, uint8_t); NUMERIC_ARRAY_DECL(Int8Array, Int8Type, int8_t); @@ -96,9 +87,8 @@ class PrimitiveBuilder : public ArrayBuilder { public: typedef typename Type::c_type value_type; - explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) : - ArrayBuilder(pool, type), - data_(nullptr) {} + explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) + : ArrayBuilder(pool, type), data_(nullptr) {} virtual ~PrimitiveBuilder() {} @@ -117,16 +107,14 @@ class PrimitiveBuilder : public ArrayBuilder { return Status::OK(); } - std::shared_ptr data() const { - return data_; - } + std::shared_ptr data() const { return data_; } // Vector append // // If passed, valid_bytes is of equal length to values, and any zero byte // will be considered as a null for that slot - Status Append(const value_type* values, int32_t length, - const uint8_t* valid_bytes = nullptr); + Status Append( + const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); // Ensure that builder can accommodate an additional number of // elements. Resizes if the current capacity is not sufficient @@ -172,89 +160,69 @@ template <> struct type_traits { typedef UInt8Array ArrayType; - static inline int bytes_required(int elements) { - return elements; - } + static inline int bytes_required(int elements) { return elements; } }; template <> struct type_traits { typedef Int8Array ArrayType; - static inline int bytes_required(int elements) { - return elements; - } + static inline int bytes_required(int elements) { return elements; } }; template <> struct type_traits { typedef UInt16Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(uint16_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } }; template <> struct type_traits { typedef Int16Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(int16_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } }; template <> struct type_traits { typedef UInt32Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(uint32_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } }; template <> struct type_traits { typedef Int32Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(int32_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } }; template <> struct type_traits { typedef UInt64Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(uint64_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } }; template <> struct type_traits { typedef Int64Array ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(int64_t); - } + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } }; template <> struct type_traits { typedef FloatArray ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(float); - } + static inline int bytes_required(int elements) { return elements * sizeof(float); } }; template <> struct type_traits { typedef DoubleArray ArrayType; - static inline int bytes_required(int elements) { - return elements * sizeof(double); - } + static inline int bytes_required(int elements) { return elements * sizeof(double); } }; // Builders @@ -272,25 +240,19 @@ typedef NumericBuilder Int64Builder; typedef NumericBuilder FloatBuilder; typedef NumericBuilder DoubleBuilder; - class BooleanArray : public PrimitiveArray { public: using PrimitiveArray::PrimitiveArray; BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); bool EqualsExact(const BooleanArray& other) const; bool Equals(const std::shared_ptr& arr) const override; - const uint8_t* raw_data() const { - return reinterpret_cast(raw_data_); - } + const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } - bool Value(int i) const { - return util::get_bit(raw_data(), i); - } + bool Value(int i) const { return util::get_bit(raw_data(), i); } }; template <> @@ -304,8 +266,8 @@ struct type_traits { class BooleanBuilder : public PrimitiveBuilder { public: - explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) : - PrimitiveBuilder(pool, type) {} + explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) + : PrimitiveBuilder(pool, type) {} virtual ~BooleanBuilder() {} @@ -322,11 +284,9 @@ class BooleanBuilder : public PrimitiveBuilder { ++length_; } - void Append(uint8_t val) { - Append(static_cast(val)); - } + void Append(uint8_t val) { Append(static_cast(val)); } }; -} // namespace arrow +} // namespace arrow #endif // ARROW_TYPES_PRIMITIVE_H diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index d3a4cc37f9c4c..ee4307c4d168a 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -48,7 +48,6 @@ TEST(TypesTest, TestCharType) { ASSERT_EQ(t2.size, 5); } - TEST(TypesTest, TestVarcharType) { VarcharType t1(5); @@ -72,7 +71,7 @@ TEST(TypesTest, TestStringType) { // ---------------------------------------------------------------------- // String container -class TestStringContainer : public ::testing::Test { +class TestStringContainer : public ::testing::Test { public: void SetUp() { chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; @@ -95,8 +94,8 @@ class TestStringContainer : public ::testing::Test { null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); - strings_ = std::make_shared(length_, offsets_buf_, values_, - null_count_, null_bitmap_); + strings_ = std::make_shared( + length_, offsets_buf_, values_, null_count_, null_bitmap_); } protected: @@ -117,7 +116,6 @@ class TestStringContainer : public ::testing::Test { std::shared_ptr strings_; }; - TEST_F(TestStringContainer, TestArrayBasics) { ASSERT_EQ(length_, strings_->length()); ASSERT_EQ(1, strings_->null_count()); @@ -130,7 +128,6 @@ TEST_F(TestStringContainer, TestType) { ASSERT_EQ(Type::STRING, strings_->type_enum()); } - TEST_F(TestStringContainer, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { @@ -140,10 +137,9 @@ TEST_F(TestStringContainer, TestListFunctions) { } } - TEST_F(TestStringContainer, TestDestructor) { - auto arr = std::make_shared(length_, offsets_buf_, values_, - null_count_, null_bitmap_); + auto arr = std::make_shared( + length_, offsets_buf_, values_, null_count_, null_bitmap_); } TEST_F(TestStringContainer, TestGetString) { @@ -167,9 +163,7 @@ class TestStringBuilder : public TestBuilder { builder_.reset(new StringBuilder(pool_, type_)); } - void Done() { - result_ = std::dynamic_pointer_cast(builder_->Finish()); - } + void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } protected: TypePtr type_; @@ -222,4 +216,4 @@ TEST_F(TestStringBuilder, TestZeroLength) { Done(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index 80b075cdfbb23..29d97d039477c 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -26,11 +26,10 @@ namespace arrow { const std::shared_ptr STRING(new StringType()); -StringArray::StringArray(int32_t length, - const std::shared_ptr& offsets, +StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, - const std::shared_ptr& null_bitmap) : - StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} + const std::shared_ptr& null_bitmap) + : StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} std::string CharType::ToString() const { std::stringstream s; @@ -38,7 +37,6 @@ std::string CharType::ToString() const { return s.str(); } - std::string VarcharType::ToString() const { std::stringstream s; s << "varchar(" << size << ")"; @@ -47,4 +45,4 @@ std::string VarcharType::ToString() const { TypePtr StringBuilder::value_type_ = TypePtr(new UInt8Type()); -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index 84cd0326ec850..c5cbe1058c7cf 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -37,48 +37,37 @@ class MemoryPool; struct CharType : public DataType { int size; - explicit CharType(int size) - : DataType(Type::CHAR), - size(size) {} + explicit CharType(int size) : DataType(Type::CHAR), size(size) {} - CharType(const CharType& other) - : CharType(other.size) {} + CharType(const CharType& other) : CharType(other.size) {} virtual std::string ToString() const; }; - // Variable-length, null-terminated strings, up to a certain length struct VarcharType : public DataType { int size; - explicit VarcharType(int size) - : DataType(Type::VARCHAR), - size(size) {} - VarcharType(const VarcharType& other) - : VarcharType(other.size) {} + explicit VarcharType(int size) : DataType(Type::VARCHAR), size(size) {} + VarcharType(const VarcharType& other) : VarcharType(other.size) {} virtual std::string ToString() const; }; -// TODO: add a BinaryArray layer in between +// TODO(wesm): add a BinaryArray layer in between class StringArray : public ListArray { public: - StringArray(const TypePtr& type, int32_t length, - const std::shared_ptr& offsets, - const ArrayPtr& values, - int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) : - ListArray(type, length, offsets, values, null_count, null_bitmap) { + StringArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr) + : ListArray(type, length, offsets, values, null_count, null_bitmap) { // For convenience bytes_ = static_cast(values.get()); raw_bytes_ = bytes_->raw_data(); } - StringArray(int32_t length, - const std::shared_ptr& offsets, - const ArrayPtr& values, - int32_t null_count = 0, + StringArray(int32_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); // Compute the pointer t @@ -103,21 +92,18 @@ class StringArray : public ListArray { // Array builder class StringBuilder : public ListBuilder { public: - explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : - ListBuilder(pool, type, std::make_shared(pool, value_type_)) { + explicit StringBuilder(MemoryPool* pool, const TypePtr& type) + : ListBuilder(pool, type, std::make_shared(pool, value_type_)) { byte_builder_ = static_cast(value_builder_.get()); } - Status Append(const std::string& value) { - return Append(value.c_str(), value.size()); - } + Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } Status Append(const char* value, int32_t length) { RETURN_NOT_OK(ListBuilder::Append()); return byte_builder_->Append(reinterpret_cast(value), length); } - Status Append(const std::vector& values, - uint8_t* null_bytes); + Status Append(const std::vector& values, uint8_t* null_bytes); std::shared_ptr Finish() override { return ListBuilder::Transfer(); @@ -130,6 +116,6 @@ class StringBuilder : public ListBuilder { static TypePtr value_type_; }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_STRING_H +#endif // ARROW_TYPES_STRING_H diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index d94396f42c52a..79d560e19bcc0 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -49,7 +49,7 @@ TEST(TestStructType, Basics) { ASSERT_EQ(struct_type.ToString(), "struct"); - // TODO: out of bounds for field(...) + // TODO(wesm): out of bounds for field(...) } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index 02af600b017d0..04a277a86fa58 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -17,6 +17,4 @@ #include "arrow/types/struct.h" -namespace arrow { - -} // namespace arrow +namespace arrow {} // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 5842534d35be1..17e32993bf975 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -24,8 +24,6 @@ #include "arrow/type.h" -namespace arrow { +namespace arrow {} // namespace arrow -} // namespace arrow - -#endif // ARROW_TYPES_STRUCT_H +#endif // ARROW_TYPES_STRUCT_H diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h index 227aca632ef3c..1957636b141fd 100644 --- a/cpp/src/arrow/types/test-common.h +++ b/cpp/src/arrow/types/test-common.h @@ -28,10 +28,10 @@ #include "arrow/type.h" #include "arrow/util/memory-pool.h" -using std::unique_ptr; - namespace arrow { +using std::unique_ptr; + class TestBuilder : public ::testing::Test { public: void SetUp() { @@ -40,6 +40,7 @@ class TestBuilder : public ::testing::Test { builder_.reset(new UInt8Builder(pool_, type_)); builder_nn_.reset(new UInt8Builder(pool_, type_)); } + protected: MemoryPool* pool_; @@ -48,6 +49,6 @@ class TestBuilder : public ::testing::Test { unique_ptr builder_nn_; }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_TEST_COMMON_H +#endif // ARROW_TYPES_TEST_COMMON_H diff --git a/cpp/src/arrow/types/union.cc b/cpp/src/arrow/types/union.cc index db3f81795eae2..c891b4a5357ef 100644 --- a/cpp/src/arrow/types/union.cc +++ b/cpp/src/arrow/types/union.cc @@ -30,7 +30,7 @@ static inline std::string format_union(const std::vector& child_types) std::stringstream s; s << "union<"; for (size_t i = 0; i < child_types.size(); ++i) { - if (i) s << ", "; + if (i) { s << ", "; } s << child_types[i]->ToString(); } s << ">"; @@ -41,10 +41,8 @@ std::string DenseUnionType::ToString() const { return format_union(child_types_); } - std::string SparseUnionType::ToString() const { return format_union(child_types_); } - -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h index 29cda90b972dd..d2ee9bde04d0d 100644 --- a/cpp/src/arrow/types/union.h +++ b/cpp/src/arrow/types/union.h @@ -33,27 +33,23 @@ class Buffer; struct DenseUnionType : public CollectionType { typedef CollectionType Base; - explicit DenseUnionType(const std::vector& child_types) : - Base() { + explicit DenseUnionType(const std::vector& child_types) : Base() { child_types_ = child_types; } virtual std::string ToString() const; }; - struct SparseUnionType : public CollectionType { typedef CollectionType Base; - explicit SparseUnionType(const std::vector& child_types) : - Base() { + explicit SparseUnionType(const std::vector& child_types) : Base() { child_types_ = child_types; } virtual std::string ToString() const; }; - class UnionArray : public Array { protected: // The data are types encoded as int16 @@ -61,16 +57,13 @@ class UnionArray : public Array { std::vector> children_; }; - class DenseUnionArray : public UnionArray { protected: Buffer* offset_buf_; }; +class SparseUnionArray : public UnionArray {}; -class SparseUnionArray : public UnionArray { -}; - -} // namespace arrow +} // namespace arrow -#endif // ARROW_TYPES_UNION_H +#endif // ARROW_TYPES_UNION_H diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index 220bff084fd6e..26554d2c9069c 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -41,4 +41,4 @@ TEST(UtilTests, TestNextPower2) { ASSERT_EQ(1LL << 62, next_power2((1LL << 62) - 1)); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 6c6d5330eab0d..475576e87cadd 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -26,14 +26,12 @@ namespace arrow { void util::bytes_to_bits(const std::vector& bytes, uint8_t* bits) { for (size_t i = 0; i < bytes.size(); ++i) { - if (bytes[i] > 0) { - set_bit(bits, i); - } + if (bytes[i] > 0) { set_bit(bits, i); } } } -Status util::bytes_to_bits(const std::vector& bytes, - std::shared_ptr* out) { +Status util::bytes_to_bits( + const std::vector& bytes, std::shared_ptr* out) { int bit_length = util::bytes_for_bits(bytes.size()); auto buffer = std::make_shared(); @@ -45,4 +43,4 @@ Status util::bytes_to_bits(const std::vector& bytes, return Status::OK(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 8d6287130dd2b..1f0f08c4d88ef 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -74,8 +74,8 @@ static inline int64_t next_power2(int64_t n) { void bytes_to_bits(const std::vector& bytes, uint8_t* bits); Status bytes_to_bits(const std::vector&, std::shared_ptr*); -} // namespace util +} // namespace util -} // namespace arrow +} // namespace arrow -#endif // ARROW_UTIL_BIT_UTIL_H +#endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc index 1d58226d84a46..dad0f7461d914 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/util/buffer-test.cc @@ -29,8 +29,7 @@ using std::string; namespace arrow { -class TestBuffer : public ::testing::Test { -}; +class TestBuffer : public ::testing::Test {}; TEST_F(TestBuffer, Resize) { PoolBuffer buf; @@ -54,4 +53,4 @@ TEST_F(TestBuffer, ResizeOOM) { ASSERT_RAISES(OutOfMemory, buf.Resize(to_alloc)); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 04cdcd75cd41a..bc9c22c10de44 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -24,8 +24,7 @@ namespace arrow { -Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, - int64_t size) { +Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) { data_ = parent->data() + offset; size_ = size; parent_ = parent; @@ -37,18 +36,13 @@ std::shared_ptr MutableBuffer::GetImmutableView() { return std::make_shared(this->get_shared_ptr(), 0, size()); } -PoolBuffer::PoolBuffer(MemoryPool* pool) : - ResizableBuffer(nullptr, 0) { - if (pool == nullptr) { - pool = default_memory_pool(); - } +PoolBuffer::PoolBuffer(MemoryPool* pool) : ResizableBuffer(nullptr, 0) { + if (pool == nullptr) { pool = default_memory_pool(); } pool_ = pool; } PoolBuffer::~PoolBuffer() { - if (mutable_data_ != nullptr) { - pool_->Free(mutable_data_, capacity_); - } + if (mutable_data_ != nullptr) { pool_->Free(mutable_data_, capacity_); } } Status PoolBuffer::Reserve(int64_t new_capacity) { @@ -74,4 +68,4 @@ Status PoolBuffer::Resize(int64_t new_size) { return Status::OK(); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index c15f9b630cd97..94e53b61f2e83 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -38,9 +38,7 @@ class Status; // class instance class Buffer : public std::enable_shared_from_this { public: - Buffer(const uint8_t* data, int64_t size) : - data_(data), - size_(size) {} + Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size) {} virtual ~Buffer(); // An offset into data that is owned by another buffer, but we want to be @@ -48,40 +46,28 @@ class Buffer : public std::enable_shared_from_this { // parent buffer have been destroyed Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); - std::shared_ptr get_shared_ptr() { - return shared_from_this(); - } + std::shared_ptr get_shared_ptr() { return shared_from_this(); } // Return true if both buffers are the same size and contain the same bytes // up to the number of compared bytes bool Equals(const Buffer& other, int64_t nbytes) const { - return this == &other || - (size_ >= nbytes && other.size_ >= nbytes && - !memcmp(data_, other.data_, nbytes)); + return this == &other || (size_ >= nbytes && other.size_ >= nbytes && + !memcmp(data_, other.data_, nbytes)); } bool Equals(const Buffer& other) const { - return this == &other || - (size_ == other.size_ && !memcmp(data_, other.data_, size_)); + return this == &other || (size_ == other.size_ && !memcmp(data_, other.data_, size_)); } - const uint8_t* data() const { - return data_; - } + const uint8_t* data() const { return data_; } - int64_t size() const { - return size_; - } + int64_t size() const { return size_; } // Returns true if this Buffer is referencing memory (possibly) owned by some // other buffer - bool is_shared() const { - return static_cast(parent_); - } + bool is_shared() const { return static_cast(parent_); } - const std::shared_ptr parent() const { - return parent_; - } + const std::shared_ptr parent() const { return parent_; } protected: const uint8_t* data_; @@ -97,22 +83,17 @@ class Buffer : public std::enable_shared_from_this { // A Buffer whose contents can be mutated. May or may not own its data. class MutableBuffer : public Buffer { public: - MutableBuffer(uint8_t* data, int64_t size) : - Buffer(data, size) { + MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { mutable_data_ = data; } - uint8_t* mutable_data() { - return mutable_data_; - } + uint8_t* mutable_data() { return mutable_data_; } // Get a read-only view of this buffer std::shared_ptr GetImmutableView(); protected: - MutableBuffer() : - Buffer(nullptr, 0), - mutable_data_(nullptr) {} + MutableBuffer() : Buffer(nullptr, 0), mutable_data_(nullptr) {} uint8_t* mutable_data_; }; @@ -128,9 +109,8 @@ class ResizableBuffer : public MutableBuffer { virtual Status Reserve(int64_t new_capacity) = 0; protected: - ResizableBuffer(uint8_t* data, int64_t size) : - MutableBuffer(data, size), - capacity_(size) {} + ResizableBuffer(uint8_t* data, int64_t size) + : MutableBuffer(data, size), capacity_(size) {} int64_t capacity_; }; @@ -152,16 +132,11 @@ static constexpr int64_t MIN_BUFFER_CAPACITY = 1024; class BufferBuilder { public: - explicit BufferBuilder(MemoryPool* pool) : - pool_(pool), - capacity_(0), - size_(0) {} + explicit BufferBuilder(MemoryPool* pool) : pool_(pool), capacity_(0), size_(0) {} Status Append(const uint8_t* data, int length) { if (capacity_ < length + size_) { - if (capacity_ == 0) { - buffer_ = std::make_shared(pool_); - } + if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } capacity_ = std::max(MIN_BUFFER_CAPACITY, capacity_); while (capacity_ < length + size_) { capacity_ *= 2; @@ -188,6 +163,6 @@ class BufferBuilder { int64_t size_; }; -} // namespace arrow +} // namespace arrow -#endif // ARROW_UTIL_BUFFER_H +#endif // ARROW_UTIL_BUFFER_H diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index 3ce4ccc1e9c26..527ce423e7751 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -19,6 +19,7 @@ #define ARROW_UTIL_LOGGING_H #include +#include namespace arrow { @@ -37,19 +38,34 @@ namespace arrow { #define ARROW_LOG_INTERNAL(level) arrow::internal::CerrLog(level) #define ARROW_LOG(level) ARROW_LOG_INTERNAL(ARROW_##level) -#define ARROW_CHECK(condition) \ - (condition) ? 0 : ARROW_LOG(FATAL) << "Check failed: " #condition " " +#define ARROW_CHECK(condition) \ + (condition) ? 0 : ::arrow::internal::FatalLog(ARROW_FATAL) \ + << __FILE__ << __LINE__ << "Check failed: " #condition " " #ifdef NDEBUG #define ARROW_DFATAL ARROW_WARNING -#define DCHECK(condition) while (false) arrow::internal::NullLog() -#define DCHECK_EQ(val1, val2) while (false) arrow::internal::NullLog() -#define DCHECK_NE(val1, val2) while (false) arrow::internal::NullLog() -#define DCHECK_LE(val1, val2) while (false) arrow::internal::NullLog() -#define DCHECK_LT(val1, val2) while (false) arrow::internal::NullLog() -#define DCHECK_GE(val1, val2) while (false) arrow::internal::NullLog() -#define DCHECK_GT(val1, val2) while (false) arrow::internal::NullLog() +#define DCHECK(condition) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_EQ(val1, val2) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_NE(val1, val2) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_LE(val1, val2) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_LT(val1, val2) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_GE(val1, val2) \ + while (false) \ + arrow::internal::NullLog() +#define DCHECK_GT(val1, val2) \ + while (false) \ + arrow::internal::NullLog() #else #define ARROW_DFATAL ARROW_FATAL @@ -62,13 +78,13 @@ namespace arrow { #define DCHECK_GE(val1, val2) ARROW_CHECK((val1) >= (val2)) #define DCHECK_GT(val1, val2) ARROW_CHECK((val1) > (val2)) -#endif // NDEBUG +#endif // NDEBUG namespace internal { class NullLog { public: - template + template NullLog& operator<<(const T& t) { return *this; } @@ -76,34 +92,42 @@ class NullLog { class CerrLog { public: - CerrLog(int severity) // NOLINT(runtime/explicit) - : severity_(severity), - has_logged_(false) { - } + CerrLog(int severity) // NOLINT(runtime/explicit) + : severity_(severity), + has_logged_(false) {} - ~CerrLog() { - if (has_logged_) { - std::cerr << std::endl; - } - if (severity_ == ARROW_FATAL) { - exit(1); - } + virtual ~CerrLog() { + if (has_logged_) { std::cerr << std::endl; } + if (severity_ == ARROW_FATAL) { std::exit(1); } } - template + template CerrLog& operator<<(const T& t) { has_logged_ = true; std::cerr << t; return *this; } - private: + protected: const int severity_; bool has_logged_; }; -} // namespace internal +// Clang-tidy isn't smart enough to determine that DCHECK using CerrLog doesn't +// return so we create a new class to give it a hint. +class FatalLog : public CerrLog { + public: + FatalLog(int /* severity */) // NOLINT + : CerrLog(ARROW_FATAL) {} + + [[noreturn]] ~FatalLog() { + if (has_logged_) { std::cerr << std::endl; } + std::exit(1); + } +}; + +} // namespace internal -} // namespace arrow +} // namespace arrow -#endif // ARROW_UTIL_LOGGING_H +#endif // ARROW_UTIL_LOGGING_H diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index 069e627c90eaa..51e605ee50ac4 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -19,8 +19,8 @@ #define ARROW_UTIL_MACROS_H // From Google gutil -#define DISALLOW_COPY_AND_ASSIGN(TypeName) \ - TypeName(const TypeName&) = delete; \ +#define DISALLOW_COPY_AND_ASSIGN(TypeName) \ + TypeName(const TypeName&) = delete; \ void operator=(const TypeName&) = delete -#endif // ARROW_UTIL_MACROS_H +#endif // ARROW_UTIL_MACROS_H diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index 6ef07a07ada3f..e4600a9bd9b27 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -45,4 +45,4 @@ TEST(DefaultMemoryPool, OOM) { ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index 0b885e9376a62..fb417e74daf53 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -75,4 +75,4 @@ MemoryPool* default_memory_pool() { return &default_memory_pool_; } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.h b/cpp/src/arrow/util/memory-pool.h index 0d2478686f5a4..824c7248e2e86 100644 --- a/cpp/src/arrow/util/memory-pool.h +++ b/cpp/src/arrow/util/memory-pool.h @@ -36,6 +36,6 @@ class MemoryPool { MemoryPool* default_memory_pool(); -} // namespace arrow +} // namespace arrow -#endif // ARROW_UTIL_MEMORY_POOL_H +#endif // ARROW_UTIL_MEMORY_POOL_H diff --git a/cpp/src/arrow/util/random.h b/cpp/src/arrow/util/random.h index 64c197ef080fd..31f2b0680fe0a 100644 --- a/cpp/src/arrow/util/random.h +++ b/cpp/src/arrow/util/random.h @@ -15,10 +15,10 @@ namespace arrow { namespace random_internal { -static const uint32_t M = 2147483647L; // 2^31-1 +static const uint32_t M = 2147483647L; // 2^31-1 const double kTwoPi = 6.283185307179586476925286; -} // namespace random_internal +} // namespace random_internal // A very simple random number generator. Not especially good at // generating truly random bits, but good enough for our needs in this @@ -27,9 +27,7 @@ class Random { public: explicit Random(uint32_t s) : seed_(s & 0x7fffffffu) { // Avoid bad seeds. - if (seed_ == 0 || seed_ == random_internal::M) { - seed_ = 1; - } + if (seed_ == 0 || seed_ == random_internal::M) { seed_ = 1; } } // Next pseudo-random 32-bit unsigned integer. @@ -50,9 +48,7 @@ class Random { // The first reduction may overflow by 1 bit, so we may need to // repeat. mod == M is not possible; using > allows the faster // sign-bit-based test. - if (seed_ > random_internal::M) { - seed_ -= random_internal::M; - } + if (seed_ > random_internal::M) { seed_ -= random_internal::M; } return seed_; } @@ -91,9 +87,7 @@ class Random { // Skewed: pick "base" uniformly from range [0,max_log] and then // return "base" random bits. The effect is to pick a number in the // range [0,2^max_log-1] with exponential bias towards smaller numbers. - uint32_t Skewed(int max_log) { - return Uniform(1 << Uniform(max_log + 1)); - } + uint32_t Skewed(int max_log) { return Uniform(1 << Uniform(max_log + 1)); } // Creates a normal distribution variable using the // Box-Muller transform. See: @@ -103,8 +97,9 @@ class Random { double Normal(double mean, double std_dev) { double uniform1 = (Next() + 1.0) / (random_internal::M + 1.0); double uniform2 = (Next() + 1.0) / (random_internal::M + 1.0); - return (mean + std_dev * sqrt(-2 * ::log(uniform1)) * - cos(random_internal::kTwoPi * uniform2)); + return ( + mean + + std_dev * sqrt(-2 * ::log(uniform1)) * cos(random_internal::kTwoPi * uniform2)); } // Return a random number between 0.0 and 1.0 inclusive. @@ -116,13 +111,11 @@ class Random { uint32_t seed_; }; - uint32_t random_seed() { - // TODO: use system time to get a reasonably random seed + // TODO(wesm): use system time to get a reasonably random seed return 0; } - -} // namespace arrow +} // namespace arrow #endif // ARROW_UTIL_RANDOM_H_ diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc index 43cb87e1a8c56..d194ed5572f52 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/util/status.cc @@ -36,9 +36,7 @@ const char* Status::CopyState(const char* state) { } std::string Status::CodeAsString() const { - if (state_ == NULL) { - return "OK"; - } + if (state_ == NULL) { return "OK"; } const char* type; switch (code()) { @@ -66,9 +64,7 @@ std::string Status::CodeAsString() const { std::string Status::ToString() const { std::string result(CodeAsString()); - if (state_ == NULL) { - return result; - } + if (state_ == NULL) { return result; } result.append(": "); @@ -78,4 +74,4 @@ std::string Status::ToString() const { return result; } -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index 4e273edcb8f1f..6ddc177a9a50d 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -20,32 +20,36 @@ #include // Return the given status if it is not OK. -#define ARROW_RETURN_NOT_OK(s) do { \ - ::arrow::Status _s = (s); \ - if (!_s.ok()) return _s; \ +#define ARROW_RETURN_NOT_OK(s) \ + do { \ + ::arrow::Status _s = (s); \ + if (!_s.ok()) { return _s; } \ } while (0); // Return the given status if it is not OK, but first clone it and // prepend the given message. -#define ARROW_RETURN_NOT_OK_PREPEND(s, msg) do { \ - ::arrow::Status _s = (s); \ +#define ARROW_RETURN_NOT_OK_PREPEND(s, msg) \ + do { \ + ::arrow::Status _s = (s); \ if (::gutil::PREDICT_FALSE(!_s.ok())) return _s.CloneAndPrepend(msg); \ } while (0); // Return 'to_return' if 'to_call' returns a bad status. // The substitution for 'to_return' may reference the variable // 's' for the bad status. -#define ARROW_RETURN_NOT_OK_RET(to_call, to_return) do { \ - ::arrow::Status s = (to_call); \ - if (::gutil::PREDICT_FALSE(!s.ok())) return (to_return); \ +#define ARROW_RETURN_NOT_OK_RET(to_call, to_return) \ + do { \ + ::arrow::Status s = (to_call); \ + if (::gutil::PREDICT_FALSE(!s.ok())) return (to_return); \ } while (0); // If 'to_call' returns a bad status, CHECK immediately with a logged message // of 'msg' followed by the status. -#define ARROW_CHECK_OK_PREPEND(to_call, msg) do { \ -::arrow::Status _s = (to_call); \ -ARROW_CHECK(_s.ok()) << (msg) << ": " << _s.ToString(); \ -} while (0); +#define ARROW_CHECK_OK_PREPEND(to_call, msg) \ + do { \ + ::arrow::Status _s = (to_call); \ + ARROW_CHECK(_s.ok()) << (msg) << ": " << _s.ToString(); \ + } while (0); // If the status is bad, CHECK immediately, appending the status to the // logged message. @@ -53,12 +57,13 @@ ARROW_CHECK(_s.ok()) << (msg) << ": " << _s.ToString(); \ namespace arrow { -#define RETURN_NOT_OK(s) do { \ - Status _s = (s); \ - if (!_s.ok()) return _s; \ +#define RETURN_NOT_OK(s) \ + do { \ + Status _s = (s); \ + if (!_s.ok()) { return _s; } \ } while (0); -enum class StatusCode: char { +enum class StatusCode : char { OK = 0, OutOfMemory = 1, KeyError = 2, @@ -71,7 +76,7 @@ enum class StatusCode: char { class Status { public: // Create a success status. - Status() : state_(NULL) { } + Status() : state_(NULL) {} ~Status() { delete[] state_; } // Copy the specified status. @@ -132,8 +137,7 @@ class Status { const char* state_; StatusCode code() const { - return ((state_ == NULL) ? - StatusCode::OK : static_cast(state_[4])); + return ((state_ == NULL) ? StatusCode::OK : static_cast(state_[4])); } Status(StatusCode code, const std::string& msg, int16_t posix_code); @@ -155,5 +159,4 @@ inline void Status::operator=(const Status& s) { } // namespace arrow - -#endif // ARROW_STATUS_H_ +#endif // ARROW_STATUS_H_ diff --git a/cpp/src/arrow/util/test_main.cc b/cpp/src/arrow/util/test_main.cc index adc8466fb0be9..f928047023966 100644 --- a/cpp/src/arrow/util/test_main.cc +++ b/cpp/src/arrow/util/test_main.cc @@ -17,7 +17,7 @@ #include "gtest/gtest.h" -int main(int argc, char **argv) { +int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); int ret = RUN_ALL_TESTS(); diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index 3d5f532b16309..f1738ff748299 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -84,8 +84,8 @@ if [ -n "$F_ALL" -o -n "$F_FLATBUFFERS" ]; then cd $TP_DIR/$FLATBUFFERS_BASEDIR CXXFLAGS=-fPIC cmake -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX -DFLATBUFFERS_BUILD_TESTS=OFF . || { echo "cmake $FLATBUFFERS_ERROR" ; exit 1; } - make -j$PARALLEL - make install + make VERBOSE=1 -j$PARALLEL || { echo "make $FLATBUFFERS_ERROR" ; exit 1; } + make install || { echo "install $FLATBUFFERS_ERROR" ; exit 1; } fi echo "---------------------" From 9d88a50c55d18860c5543dfa6ddc8f4f162dd5e5 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 3 Apr 2016 13:10:17 -0700 Subject: [PATCH 0058/1644] ARROW-86: [Python] Implement zero-copy Arrow-to-Pandas conversion We can create zero-copy NumPy arrays for floats and ints if we have no nulls. Each numpy-arrow-view has a reference to the underlying column to ensure that the Arrow structure lives at least as long as the newly created NumPy array. Author: Uwe L. Korn Closes #52 from xhochy/arrow-86 and squashes the following commits: ee29e90 [Uwe L. Korn] Remove duplicate ref counting 2cb4c7d [Uwe L. Korn] Release instead of reset reference 9d35528 [Uwe L. Korn] Handle reference counting with OwnedRef 327b368 [Uwe L. Korn] ARROW-86: [Python] Implement zero-copy Arrow-to-Pandas conversion --- python/pyarrow/array.pyx | 1 - python/pyarrow/includes/pyarrow.pxd | 2 +- python/pyarrow/table.pyx | 6 ++- python/src/pyarrow/adapters/pandas.cc | 67 ++++++++++++++++++++------- python/src/pyarrow/adapters/pandas.h | 3 +- python/src/pyarrow/common.h | 4 ++ 6 files changed, 60 insertions(+), 23 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 456bf6d1da848..a80b3ce83980e 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -288,4 +288,3 @@ cdef class RowBatch: def __getitem__(self, i): return self.arrays[i] - diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 1066b8034be70..92c814706fdd6 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -46,6 +46,6 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: Status PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, shared_ptr[CArray]* out) - Status ArrowToPandas(const shared_ptr[CColumn]& arr, PyObject** out) + Status ArrowToPandas(const shared_ptr[CColumn]& arr, object py_ref, PyObject** out) MemoryPool* GetMemoryPool() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 4c4816f0c7e69..f02d36f520be6 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -96,7 +96,7 @@ cdef class Column: import pandas as pd - check_status(pyarrow.ArrowToPandas(self.sp_column, &arr)) + check_status(pyarrow.ArrowToPandas(self.sp_column, self, &arr)) return pd.Series(arr, name=self.name) cdef _check_nullptr(self): @@ -205,6 +205,7 @@ cdef class Table: cdef: PyObject* arr shared_ptr[CColumn] col + Column column import pandas as pd @@ -212,7 +213,8 @@ cdef class Table: data = [] for i in range(self.table.num_columns()): col = self.table.column(i) - check_status(pyarrow.ArrowToPandas(col, &arr)) + column = self.column(i) + check_status(pyarrow.ArrowToPandas(col, column, &arr)) names.append(frombytes(col.get().name())) data.append( arr) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 22f1d7575f8c5..b39fde92034aa 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -520,8 +520,8 @@ static inline PyObject* make_pystring(const uint8_t* data, int32_t length) { template class ArrowDeserializer { public: - ArrowDeserializer(const std::shared_ptr& col) : - col_(col) {} + ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) : + col_(col), py_ref_(py_ref) {} Status Convert(PyObject** out) { const std::shared_ptr data = col_->data(); @@ -548,6 +548,33 @@ class ArrowDeserializer { return Status::OK(); } + Status OutputFromData(int type, void* data) { + // Zero-Copy. We can pass the data pointer directly to NumPy. + Py_INCREF(py_ref_); + OwnedRef py_ref(py_ref); + npy_intp dims[1] = {col_->length()}; + out_ = reinterpret_cast(PyArray_SimpleNewFromData(1, dims, + type, data)); + + if (out_ == NULL) { + // Error occurred, trust that SimpleNew set the error state + return Status::OK(); + } + + if (PyArray_SetBaseObject(out_, py_ref_) == -1) { + // Error occurred, trust that SetBaseObject set the error state + return Status::OK(); + } else { + // PyArray_SetBaseObject steals our reference to py_ref_ + py_ref.release(); + } + + // Arrow data is immutable. + PyArray_CLEARFLAGS(out_, NPY_ARRAY_WRITEABLE); + + return Status::OK(); + } + template inline typename std::enable_if< arrow_traits::is_floating, Status>::type @@ -556,18 +583,20 @@ class ArrowDeserializer { arrow::PrimitiveArray* prim_arr = static_cast( arr.get()); - - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + const T* in_values = reinterpret_cast(prim_arr->data()->data()); if (arr->null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + T* out_values = reinterpret_cast(PyArray_DATA(out_)); - const T* in_values = reinterpret_cast(prim_arr->data()->data()); for (int64_t i = 0; i < arr->length(); ++i) { out_values[i] = arr->IsNull(i) ? NAN : in_values[i]; } } else { - memcpy(PyArray_DATA(out_), prim_arr->data()->data(), - arr->length() * arr->type()->value_size()); + // Zero-Copy. We can pass the data pointer directly to NumPy. + void* data = const_cast(in_values); + int type = arrow_traits::npy_type; + RETURN_NOT_OK(OutputFromData(type, data)); } return Status::OK(); @@ -594,10 +623,10 @@ class ArrowDeserializer { out_values[i] = prim_arr->IsNull(i) ? NAN : in_values[i]; } } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - - memcpy(PyArray_DATA(out_), in_values, - arr->length() * arr->type()->value_size()); + // Zero-Copy. We can pass the data pointer directly to NumPy. + void* data = const_cast(in_values); + int type = arrow_traits::npy_type; + RETURN_NOT_OK(OutputFromData(type, data)); } return Status::OK(); @@ -680,18 +709,20 @@ class ArrowDeserializer { } private: std::shared_ptr col_; + PyObject* py_ref_; PyArrayObject* out_; }; -#define FROM_ARROW_CASE(TYPE) \ - case arrow::Type::TYPE: \ - { \ - ArrowDeserializer converter(col); \ - return converter.Convert(out); \ - } \ +#define FROM_ARROW_CASE(TYPE) \ + case arrow::Type::TYPE: \ + { \ + ArrowDeserializer converter(col, py_ref); \ + return converter.Convert(out); \ + } \ break; -Status ArrowToPandas(const std::shared_ptr& col, PyObject** out) { +Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, + PyObject** out) { switch(col->type()->type) { FROM_ARROW_CASE(BOOL); FROM_ARROW_CASE(INT8); diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index 58eb3ca61cdf4..17922349de6c1 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -36,7 +36,8 @@ namespace pyarrow { class Status; -Status ArrowToPandas(const std::shared_ptr& col, PyObject** out); +Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, + PyObject** out); Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, std::shared_ptr* out); diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index cc9ad9ec5bbea..0211e8948f2f7 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -53,6 +53,10 @@ class OwnedRef { obj_ = obj; } + void release() { + obj_ = nullptr; + } + PyObject* obj() const{ return obj_; } From 7b2153b0430b825730a6e986993bb290ef29d22a Mon Sep 17 00:00:00 2001 From: Kai Zheng Date: Mon, 11 Apr 2016 10:35:50 +0200 Subject: [PATCH 0059/1644] =?UTF-8?q?ARROW-85:=20memcmp=20can=20be=20avoid?= =?UTF-8?q?ed=20in=20Equal=20when=20comparing=20with=20the=20same=20?= =?UTF-8?q?=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Avoid memcmp when possible, if the two underlying buffers are the same. Author: Kai Zheng Closes #57 from drankye/upstream and squashes the following commits: 2a70944 [Kai Zheng] Free test buffers afterwards 6a8bef5 [Kai Zheng] Fixed some comments b83f989 [Kai Zheng] ARROW-85. Corrected another format issue by clang-format 0ddcd01 [Kai Zheng] ARROW-85. Fixed another format issue 1b48663 [Kai Zheng] ARROW-85. Fixed checking styles 9f239a3 [Kai Zheng] ARROW-85. Added tests for Buffer and the new behavior 4d04c27 [Kai Zheng] ARROW-85 memcmp can be avoided in Equal when comparing with the same Buffer --- cpp/src/arrow/util/buffer-test.cc | 43 +++++++++++++++++++++++++++++++ cpp/src/arrow/util/buffer.h | 9 ++++--- 2 files changed, 49 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc index dad0f7461d914..cc4ec98e4fb29 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/util/buffer-test.cc @@ -53,4 +53,47 @@ TEST_F(TestBuffer, ResizeOOM) { ASSERT_RAISES(OutOfMemory, buf.Resize(to_alloc)); } +TEST_F(TestBuffer, EqualsWithSameContent) { + MemoryPool* pool = default_memory_pool(); + const int32_t bufferSize = 128 * 1024; + uint8_t* rawBuffer1; + ASSERT_OK(pool->Allocate(bufferSize, &rawBuffer1)); + memset(rawBuffer1, 12, bufferSize); + uint8_t* rawBuffer2; + ASSERT_OK(pool->Allocate(bufferSize, &rawBuffer2)); + memset(rawBuffer2, 12, bufferSize); + uint8_t* rawBuffer3; + ASSERT_OK(pool->Allocate(bufferSize, &rawBuffer3)); + memset(rawBuffer3, 3, bufferSize); + + Buffer buffer1(rawBuffer1, bufferSize); + Buffer buffer2(rawBuffer2, bufferSize); + Buffer buffer3(rawBuffer3, bufferSize); + ASSERT_TRUE(buffer1.Equals(buffer2)); + ASSERT_FALSE(buffer1.Equals(buffer3)); + + pool->Free(rawBuffer1, bufferSize); + pool->Free(rawBuffer2, bufferSize); + pool->Free(rawBuffer3, bufferSize); +} + +TEST_F(TestBuffer, EqualsWithSameBuffer) { + MemoryPool* pool = default_memory_pool(); + const int32_t bufferSize = 128 * 1024; + uint8_t* rawBuffer; + ASSERT_OK(pool->Allocate(bufferSize, &rawBuffer)); + memset(rawBuffer, 111, bufferSize); + + Buffer buffer1(rawBuffer, bufferSize); + Buffer buffer2(rawBuffer, bufferSize); + ASSERT_TRUE(buffer1.Equals(buffer2)); + + const int64_t nbytes = bufferSize / 2; + Buffer buffer3(rawBuffer, nbytes); + ASSERT_TRUE(buffer1.Equals(buffer3, nbytes)); + ASSERT_FALSE(buffer1.Equals(buffer3, nbytes + 1)); + + pool->Free(rawBuffer, bufferSize); +} + } // namespace arrow diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 94e53b61f2e83..56532be8070ae 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -51,12 +51,15 @@ class Buffer : public std::enable_shared_from_this { // Return true if both buffers are the same size and contain the same bytes // up to the number of compared bytes bool Equals(const Buffer& other, int64_t nbytes) const { - return this == &other || (size_ >= nbytes && other.size_ >= nbytes && - !memcmp(data_, other.data_, nbytes)); + return this == &other || + (size_ >= nbytes && other.size_ >= nbytes && + (data_ == other.data_ || !memcmp(data_, other.data_, nbytes))); } bool Equals(const Buffer& other) const { - return this == &other || (size_ == other.size_ && !memcmp(data_, other.data_, size_)); + return this == &other || + (size_ == other.size_ && + (data_ == other.data_ || !memcmp(data_, other.data_, size_))); } const uint8_t* data() const { return data_; } From 37f72716822f5b7ec3055b2dd0fabbc992e46c08 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Thu, 14 Apr 2016 19:24:19 +0200 Subject: [PATCH 0060/1644] ARROW-94: [Format] Expand list example to clarify null vs empty list WIP to make sure what I've done so far looks good. Per discussion on the JIRA item I started conversion of examples images to "text diagrams", but I wanted to get feedback to see if this is actually desirable (and if the way I'm approaching it is desirable). The remaining diagrams are for unions which I can convert if the existing changes look OK to others (although I think the Union diagrams are are pretty reasonable/compact). This change also includes some other minor cleanup, as well as including a statement about endianness per the discussion on the mailing list. Rendered markdown can be seen at: https://github.com/emkornfield/arrow/blob/emk_doc_fixes_PR3/format/Layout.md Author: Micah Kornfield Closes #58 from emkornfield/emk_doc_fixes_PR3 and squashes the following commits: 00b99ef [Micah Kornfield] remove png diagrams that are no longer used cab6f87 [Micah Kornfield] a few more consistency fixes 5550a78 [Micah Kornfield] fix some off by one bugs 69e1a78 [Micah Kornfield] fix some alignment, and one last offset array to buffer conversion b7aa7ea [Micah Kornfield] change list offset array to offset buffer 7dda5d5 [Micah Kornfield] clarify requirements of child types, finish replacing diagrams, fix some typos 0f23052 [Micah Kornfield] replace diagrams with physical layouts, clarify memory requirements for struct 590e4a7 [Micah Kornfield] cleanup magic quotes and clarify/fix some minor points --- format/Layout.md | 343 +++++++++++++++++++-- format/diagrams/layout-dense-union.png | Bin 47999 -> 0 bytes format/diagrams/layout-list-of-list.png | Bin 40105 -> 0 bytes format/diagrams/layout-list-of-struct.png | Bin 54122 -> 0 bytes format/diagrams/layout-list.png | Bin 15906 -> 0 bytes format/diagrams/layout-primitive-array.png | Bin 10907 -> 0 bytes format/diagrams/layout-sparse-union.png | Bin 43020 -> 0 bytes 7 files changed, 311 insertions(+), 32 deletions(-) delete mode 100644 format/diagrams/layout-dense-union.png delete mode 100644 format/diagrams/layout-list-of-list.png delete mode 100644 format/diagrams/layout-list-of-struct.png delete mode 100644 format/diagrams/layout-list.png delete mode 100644 format/diagrams/layout-primitive-array.png delete mode 100644 format/diagrams/layout-sparse-union.png diff --git a/format/Layout.md b/format/Layout.md index 1b532c6b3817c..92553d944c2d1 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -9,7 +9,7 @@ concepts, here is a small glossary to help disambiguate. * Slot or array slot: a single logical value in an array of some particular data type * Contiguous memory region: a sequential virtual address space with a given length. Any byte can be reached via a single pointer offset less than the - region’s length. + region's length. * Primitive type: a data type that occupies a fixed-size memory slot specified in bit width or byte width * Nested or parametric type: a data type whose full structure depends on one or @@ -42,7 +42,7 @@ Base requirements * Capable of representing fully-materialized and decoded / decompressed Parquet data * All leaf nodes (primitive value arrays) use contiguous memory regions -* Any relative type can be have null slots +* Any relative type can have null slots * Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to be built. @@ -69,11 +69,15 @@ Base requirements * To define a selection or masking vector construct * Implementation-specific details * Details of a user or developer C/C++/Java API. -* Any “table” structure composed of named arrays each having their own type or +* Any "table" structure composed of named arrays each having their own type or any other structure that composes arrays. * Any memory management or reference counting subsystem * To enumerate or specify types of encodings or compression support +## Byte Order (Endianness) + +The Arrow format is little endian. + ## Array lengths Any array has a known and fixed length, stored as a 32-bit signed integer, so a @@ -142,9 +146,59 @@ the size is rounded up to the nearest byte. The associated null bitmap is contiguously allocated (as described above) but does not need to be adjacent in memory to the values buffer. -(diagram not to scale) - +### Example Layout: Int32 Array +For example a primitive array of int32s: + +[1, 2, null, 4, 8] + +Would look like: + +``` +* Length: 5, Null count: 1 +* Null bitmap buffer: + + |Byte 0 (validity bitmap) | Bytes 1-7 | + |-------------------------|-----------------------| + |00011011 | 0 (padding) | + +* Value Buffer: + + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | + |------------|-------------|-------------|-------------|-------------| + | 1 | 2 | unspecified | 4 | 8 | +``` + +### Example Layout: Non-null int32 Array + +[1, 2, 3, 4, 8] has two possible layouts: + +``` +* Length: 5, Null count: 0 +* Null bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-----------------------| + | 00011111 | 0 (padding) | + +* Value Buffer: + + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | + |------------|-------------|-------------|-------------|-------------| + | 1 | 2 | 3 | 4 | 8 | +``` + +or with the bitmap elided: + +``` +* Length 5, Null count: 0 +* Null bitmap buffer: Not required +* Value Buffer: + + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | + |------------|-------------|-------------|-------------|-------------| + | 1 | 2 | 3 | 4 | 8 | +``` ## List type @@ -158,7 +212,7 @@ A list type is specified like `List`, where `T` is any relative type A list-array is represented by the combination of the following: * A values array, a child array of type T. T may also be a nested type. -* An offsets array containing 32-bit signed integers with length equal to the +* An offsets buffer containing 32-bit signed integers with length equal to the length of the top-level array plus one. Note that this limits the size of the values array to 2^31 -1. @@ -175,20 +229,76 @@ slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length) The first value in the offsets array is 0, and the last element is the length of the values array. -Let’s consider an example, the type `List`, where Char is a 1-byte +### Example Layout: `List` Array +Let's consider an example, the type `List`, where Char is a 1-byte logical type. -For an array of length 3 with respective values: +For an array of length 4 with respective values: -[[‘j’, ‘o’, ‘e’], null, [‘m’, ‘a’, ‘r’, ‘k’]] +[['j', 'o', 'e'], null, ['m', 'a', 'r', 'k'], []] -We have the following offsets and values arrays +will have the following representation: - +``` +* Length: 4, Null count: 1 +* Null bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-----------------------| + | 00001101 | 0 (padding) | + +* Offsets buffer (int32) + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | + |------------|-------------|-------------|-------------|-------------| + | 0 | 3 | 3 | 7 | 7 | + +* Values array (char array): + * Length: 7, Null count: 0 + * Null bitmap buffer: Not required + + | Bytes 0-7 | + |------------| + | joemark | +``` + +### Example Layout: `List>` +[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]] + +will be be represented as follows: + +``` +* Length 3 +* Nulls count: 0 +* Null bitmap buffer: Not required +* Offsets buffer (int32) + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | + |------------|------------|------------|-------------| + | 0 | 2 | 6 | 7 | -Let’s consider an array of a nested type, `List>` +* Values array (`List`) + * Length: 6, Null count: 1 + * Null bitmap buffer: - + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-------------| + | 00110111 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-28 | + |----------------------| + | 0, 2, 4, 7, 7, 8, 10 | + + * Values array (bytes): + * Length: 10, Null count: 0 + * Null bitmap buffer: Not required + + | Bytes 0-9 | + |-------------------------------| + | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | +``` ## Struct type @@ -198,7 +308,8 @@ types (which can all be distinct), called its fields. Typically the fields have names, but the names and their types are part of the type metadata, not the physical memory layout. -A struct does not have any additional allocated physical storage. +A struct array does not have any additional allocated physical storage for its values. +A struct array must still have an allocated null bitmap, if it has one or more null values. Physically, a struct type has one child array for each field. @@ -213,15 +324,67 @@ Struct < ``` has two child arrays, one List array (layout as above) and one 4-byte -physical value array having Int32 logical type. Here is a diagram showing the -full physical layout of this struct: +primitive value array having Int32 logical type. + +### Example Layout: `Struct, Int32>`: +The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}] would be: + +``` +* Length: 4, Null count: 1 +* Null bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-------------| + | 00001011 | 0 (padding) | + +* Children arrays: + * field-0 array (`List`): + * Length: 4, Null count: 1 + * Null bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-----------------------| + | 00011101 | 0 (padding) | - + * Offsets buffer: + + | Bytes 0-19 | + |----------------| + | 0, 3, 3, 6, 10 | + + * Values array: + * Length: 10, Null count: 0 + * Null bitmap buffer: Not required + + * Value buffer: + + | Bytes 0-9 | + |----------------| + | joebobmark | + + * field-1 array (int32 array): + * Length: 4, Null count: 0 + * Null bitmap buffer: Not required + * Value Buffer: + + | Bytes 0-15 | + |----------------| + | 1, 2, 3, 4 | + +``` While a struct does not have physical storage for each of its semantic slots (i.e. each scalar C-like struct), an entire struct slot can be set to null via the null bitmap. Any of the child field arrays can have null values according to their respective independent null bitmaps. +This implies that for a particular struct slot the null bitmap for the struct +array might indicate a null slot when one or more of its child arrays has a +non-null value in their corresponding slot. When reading the struct array the +parent null bitmap is authoritative. +This is illustrated in the example above, the child arrays have valid entries +for the null struct but are 'hidden' from the consumer by the parent array's +null bitmap. However, when treated independently corresponding +values of the children array will be non-null. ## Dense union type @@ -237,23 +400,64 @@ cases. This first, the dense union, represents a mixed-type array with 6 bytes of overhead for each value. Its physical layout is as follows: * One child array for each relative type -* Types array: An array of unsigned integers, enumerated from 0 corresponding +* Types buffer: A buffer of unsigned integers, enumerated from 0 corresponding to each type, with the smallest byte width capable of representing the number of types in the union. -* Offsets array: An array of signed int32 values indicating the relative offset +* Offsets buffer: A buffer of signed int32 values indicating the relative offset into the respective child array for the type in a given slot. The respective offsets for each child value array must be in order / increasing. -Alternate proposal (TBD): the types and offset values may be packed into an -int48 with 2 bytes for the type and 4 bytes for the offset. - Critically, the dense union allows for minimal overhead in the ubiquitous -union-of-structs with non-overlapping-fields use case (Union) +union-of-structs with non-overlapping-fields use case (`Union`) -Here is a diagram of an example dense union: +### Example Layout: Dense union + +An example layout for logical union of: +`Union` having the values: +[{f=1.2}, null, {f=3.4}, {i=5}] + +``` +* Length: 4, Null count: 1 +* Null bitmap buffer: + |Byte 0 (validity bitmap) | Bytes 1-7 | + |-------------------------|-----------------------| + |00001101 | 0 (padding) | - +* Types buffer: + + |Byte 0-1 | Byte 2-3 | Byte 4-5 | Byte 6-7 | + |---------|-------------|----------|----------| + | 0 | unspecified | 0 | 1 | + +* Offset buffer: + + |Byte 0-3 | Byte 4-7 | Byte 8-11 | Byte 12-15 | + |---------|-------------|-----------|------------| + | 0 | unspecified | 1 | 0 | + +* Children arrays: + * Field-0 array (f: float): + * Length: 2, nulls: 0 + * Null bitmap buffer: Not required + + * Value Buffer: + + | Bytes 0-7 | + |-----------| + | 1.2, 3.4 | + + + * Field-1 array (f: float): + * Length: 1, nulls: 0 + * Null bitmap buffer: Not required + + * Value Buffer: + + | Bytes 0-3 | + |-----------| + | 5 | +``` ## Sparse union type @@ -264,17 +468,92 @@ the length of the union. While a sparse union may use significantly more space compared with a dense union, it has some advantages that may be desirable in certain use cases: - +* A sparse union is more amenable to vectorized expression evaluation in some use cases. +* Equal-length arrays can be interpreted as a union by only defining the types array. -More amenable to vectorized expression evaluation in some use cases. -Equal-length arrays can be interpreted as a union by only defining the types array +### Example layout: `SparseUnion>` + +For the union array: + +[{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, 'mark'] + +will have the following layout: +``` +* Length: 6, Null count: 0 +* Null bitmap buffer: Not required + +* Types buffer: + + | Bytes 0-1 | Bytes 2-3 | Bytes 4-5 | Bytes 6-7 | Bytes 8-9 | Bytes 10-11 | + |------------|-------------|-------------|-------------|-------------|--------------| + | 0 | 1 | 2 | 1 | 0 | 2 | + +* Children arrays: + + * u0 (Int32): + * Length: 6, Null count: 4 + * Null bitmap buffer: + + |Byte 0 (validity bitmap) | Bytes 1-7 | + |-------------------------|-----------------------| + |00010001 | 0 (padding) | + + * Value buffer: + + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | + |------------|-------------|-------------|-------------|-------------|--------------| + | 1 | unspecified | unspecified | unspecified | 4 | unspecified | + + * u1 (float): + * Length: 6, Null count: 4 + * Null bitmap buffer: + + |Byte 0 (validity bitmap) | Bytes 1-7 | + |-------------------------|-----------------------| + |00001010 | 0 (padding) | + + * Value buffer: + + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | + |-------------|-------------|-------------|-------------|-------------|--------------| + | unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | + + * u2 (`List`) + * Length: 6, Null count: 4 + * Null bitmap buffer: + + | Byte 0 (validity bitmap) | Bytes 1-7 | + |--------------------------|-----------------------| + | 00100100 | 0 (padding) | + + * Offsets buffer (int32) + + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | + |------------|-------------|-------------|-------------|-------------|-------------|-------------| + | 0 | 0 | 0 | 3 | 3 | 3 | 7 | + + * Values array (char array): + * Length: 7, Null count: 0 + * Null bitmap buffer: Not required + + | Bytes 0-7 | + |------------| + | joemark | +``` Note that nested types in a sparse union must be internally consistent -(e.g. see the List in the diagram), i.e. random access at any index j yields -the correct value. +(e.g. see the List in the diagram), i.e. random access at any index j +on any child array will not cause an error. +In other words, the array for the nested type must be valid if it is +reinterpreted as a non-nested array. + +Similar to structs, a particular child array may have a non-null slot +even if the null bitmap of the parent union array indicates the slot is +null. Additionally, a child array may have a non-null slot even if +the the types array indicates that a slot contains a different type at the index. ## References Drill docs https://drill.apache.org/docs/value-vectors/ -[1]: https://en.wikipedia.org/wiki/Bit_numbering \ No newline at end of file +[1]: https://en.wikipedia.org/wiki/Bit_numbering diff --git a/format/diagrams/layout-dense-union.png b/format/diagrams/layout-dense-union.png deleted file mode 100644 index 5f1f3811bf0056defe19abf494afcaba1bedbb77..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 47999 zcmeFZbx>W~*DVMH2n2TxF2OB8a0%|gHMj+L2n2V6yF-8wT!RG-8k``(-Q6Wvuan&S z``&x?y1J^XySo4QZk^nO&E9M8z1CcFjycAhCrn8}66G1;Gbku16lp0jWhf|^Z78TG zIS4S|$g&u=Blrv2Nm)`9s(hGu2mF9&FQw%K1%*ou`2#JjOmz%Sn6y;Ybk>xY<2AOk zVKy|eGcsj%x3LFjLqYMo^MZfcm^vGhx!YLVI`O&-Q2aTA7yKJ?n1zDu&neDU0u-9^ zN@TC?98JkMnO`!qQV2dHBO~K?G%@2<7L)kra`2M?g@v=TJueH3o0}W68wazUqd5y3 z4-XFuD?1B2I}c z$No8=p9S*b|HDN5o#j7A!88j#<7fHLk_kSmTw`B=f)a+3786l*hu+WhSjHKDd@Nx6 zfr%L=6;bv{oa$DcJ}z4Eh1HK%FO#L3{JvIdMCmEUTsRDI@$|lOWvV#zT&cF+Uwc

$Tz|>{&V05 zHCYFBqf=qlzjyub$N0ZF8#toNooGV*XM-F^O?%}+57&93yg3r$!mwgFFA-GfA`LCD zDPn@#^?mL+ZRcMNRLrZ5`YrWYt&;H=rF{knb zVp2*?5Cc1$e`n%AVSn)Tm-pEa_u1x<)o6}L#_xhw$c7O9j?mXbZHV>xkS19;l8`fQ z&f`KCGQ5$fFky0vTwoLVzZPEk4p^u9kQC3~`~N*hR-7#r!G-cO?As3)lRM*ua>eO2JtuXO@`S1i^$omew=Rw4JwbifkukR+G z^@Q{crg2(zg<_0d&}Qp`kqjKV|GXDFLg_>{LISVflF#G9Jz?ARYJ_U1z-lNZ z<6OF3>(1#i=P*SG3XufW_x{T{yQ6nC?b=ploQB_FHM(H%^u4NB()B1_Upszo^15CL zDN!rIEj78A)M(qU9(cGrQ`GMekHFR1l!zpZ4ZKQYsA#U1!FlYJ?u~jbeVRCQce>!V zbTCnbWn=nQj0a*dWTG(I^v_!Huqk0yV9_m(^+WbW*}d15yvbgpzpb+-I67r%yZfnw z90slEb~xhro+BUe4C!dG<<_D1dA;+Fa`i#OTGT?53*TO8jZ&==laVC&C6Ss&M07BPnca$DL^A6C;ARi~@4;D$;)}Ll_3* znMkn=?5Z>wX5_#@{nFbSGkNcDBAePVuz)7dqAOZ|7Xu@_yxpIvF;80|b)7a$V#10< zYQi1($Q)8xKO2TctIgf#D~oRzr?IvwhNa#h_d_h?_F~U`QoUFeG24BydZ+WVX>J0J zzFo^jOzKa(H^mSyh8^1e4z5^ZGF;dTL5!jy$7tniOE<2L0psfPSy`4AlM1Ki!}n_+ zRtDklY8k&;{i@>y=5MJlpzD1(V{TPi1y3;An)Bd5kg;4n%)6Ho3vO| z?ZPax1M^no2#?pCO+7i$8?UWQH#18bO~w1)7h2*JB-H%{$!`eRD?ae($Le^{4NZe?!fTFHr)!8Qh8;cx-X_@=Z*+|#G|?`Zad+5nPxmW3Q3IZ$xkm6YyL6&(*V?Bo{uaX>W@uOqXF%7o ze_)91j#Y)YoKFbXZp5mpLKEc(T^Uf{LaSm>4ybS&GS*Tguo0LN=}*_9h3fJT>gJf$ zvTW)XaC2e5KI_$DdD}cqt5{habYEI#bMr2z1Q%;oM=YrXtZw60R-RckdG+1aQQPih z$?fh~0c3epv7s$>>BF06BcDUNuIBHIi#&avK%Y2{C2eH)rWiT3*Eov9?NE21?P5x& zkUsRTYUgvh_AYI<*Xn?Th!f9R#MaLynf6~g*-3EHK{eH5)yEb2Nn{d`O{q!-85+Gc zjTsaAs_wJEICUV`egb+2PSWUIP^d{wbbK}|gY-CTQO2wk{7YubYSXxv_RH-m3OwDh zGiK>FJyFkf3KcTt`st(p7*eY5%Rf_1?+f+KZrZ{ag>Q#y8ZCKU#{+SbQ5X3=HW!i3 zy9uHr*9Hu)JTI;H{n@_jRhvlHsnNCl88Z#US5YkW z`zo^KDL(Iqn=RGnj?%+#WQsH3B742NcipTcIyPBt*w~Q18Ei?>q%fV~>&`_I^TcsH zlk>V+2_a?eE7PfG&N?Jc8njYT_+t{pz$Ai!+I>y$$NMY98kZN570LYH4zTpmbD100F>#}98&=@? zMCe9-tNjTgj+6k5aNE)+vceG0)SUDXja!;v2jaU9EqiqL_%0orY}g?&40vdV=^ zF3ftr%RHj5#GI?az35KKpbl$4@eO#bk0b(VYDCzgYrWAzzqj)+bW69mm%M6jT5ASG z@Ak0tU#a=3RAA_vTq`5Y7OMbXiv<%~7G?f3E z9ZyO3M{@%hlt+ETSdN;{P;K>q?}rnfZ*N#ogtO63{DZclfk6tAH8Ere_g=@@)o=q5K zO;S`;ehkX@&!4Qd9Pf!F;(J3glJ<8E;-tVDW17o5ifUiSyv| zzd3`fqd<=KZRfw2P+`U_$8~9`WJ6*Z#2d*0C`&gU_N(%r17u+UX;3*#O_)P09^wa8 zfSvX`uX0-db3lp$>;mZw&I_%7?%JOKw%Y#(kKsozy~Ql-u79&TCUGlB-re(C$P++WU7N_<>i z0&a8{pot<7C^LCopyt5jgUf6(U8ba$ekahWK%z{r+zCJ_6#yAVK0a817#@Q9{AaQn zh=sy0VbymPfx#xh!lMwX0`RhY)T~?}gQw>_2I4wK$XlVSd8ZDN;GXl{NrqEY%E9}~ ztEF%p2KJ(6o4LB&4>f&5U?&_ecjR6$zHM$WAIZ|O`Kp$-`t7Kd&q&w8K|I_SK}-tZ zxR+=Y)M3Iw7!6m8o^PNu&n)1sA;B_I-}B(>)sj#4(jyA@e0#}x>+NAl~pMi*2yl#hgyCoIn)kOUOsK%fH5vQ0vJ=uZ&7eCm1C=qGCG6bgFUvi z3-UCj&>kGrh!YnuA+&=J0wzvDgPwaSp%wceqn>rFvUETCZzoDb}(u_Z&vmH40 zd?a9g3pSviC^|?!5sHC7bfopItOoPdpH3cBL4iw2;&m*7N5_E|Lh&fP(Vtir_inP- zRA&y%s+ssSU!slH1|Bps4;7Rkx90T9nI!jDCSs{gM0 ztx*^6hu_Y>?YDD7(k2loVP;2jUjwgBDKPa;u-e^)q z3@)#)c1waK)%_sJjp6hYEfp$)MAmv&-vaX&ku0s(-@VJf9n__Bcfu1{8=R1OU*~dX z9S7mPwUCDFR=8`tHDrCRZzgRYOTw-3?}E)D_$Re&r+UQKf0{)*ZB`e-4S->owLxET+bv~{r zq=i=5bnFWE!S|p~;JE$C$=Y|>1?8Epy`jzT;PF0{giASdFw@!kf$fU5gEA8oRR)5d zg;pgY`5>+-7j%_k@0?*R;o-=oXnc6>3PlU_yW=Dj6qUy@Nm2omK6GUVi7mY_q@kmT zuGh==S4)cP5=1r=AXw`cGegXi7I^f&*kVzPzD(bEQFkBDG7&9_m70eUp~o(ngnG&& z7i&7fn{_O0M{%z*NHl*T^up*QwSnhkQYl0BZY0b5_fDT?^L6Z?(x*yviFL38BSRAH z+U_swKvsLRnNGQn9N}OQAl$D_@|*sp&gRlaP5>$)t3f-ij&-buj&X04z{P~hkCBgL zCnjoKXX*(c4!6q+`$r@Pf2UCDfX%@6Cwgh8;qEa9|B0cz0cOOKYK5PvWm&mIy-dfh z1uSoCDZ$DYraJ!t;9exqJT}&XNu5ZETn=Wx9s#5Zz>U4-XV$15J-3C8@2{FTw-~^7 zs9$hpUWA)(N{k?9^}0UJ>C|86cc!h~8%h=Yhzdd0#%;Vahpo4^A6f&txZM57F>Ie4Pb1IX0S8Ut^aD8;ed~VP-TUrDQUfP=kgWUX z#^-d~)9-EsUeI#Mr~q*eI)?SeR_*9{c{tO7<_+m^e9uN*PWU&e)Sq-2aar%}jYLSux6gJAnbs+a^L)Yo^MH)Rrz z)9B@I;J&ehy*m6Szpl)WHc~LtX+d+j*Tmt>n%A6$>DoKk7_xRX5fP>6zu%$8Ew;wO zrD55bq}A-%2kH3dr%ji$RuxUV^cuMkhCrMQUsUiP$ymf3nXYScDevO&%jJ_vV&9t4ggG@(0IzPK0NX4G5b-5Q7`vZd%% z?vnf*jyD_0x9r=>0C0?gg~Lv(+{N=_M6k!qZQ;`jZohs~1b5|-&^s{yd#~cCEc64n z>yF8DyW;OW=`dwdy1X>WvHh7K*T{;jAgbZyei)g3u6~r?3G_y_C232ir z5JH$2QQFmULdbQXT##ha4DL-a*A(y^58s^??R8F84NMmafZv#j`GeEt z0EQdhR#k%iiDB>Qk&#-)&0TKCQ`$?k_FdrpxxFFGFR(F^bZmDb$?y6ZDm3M;4i_@t z^(Ge*Gjq?|zb5+ZX6z*R#O*Va!iCXS@()cm(;&)L#LJvn%9;WP@-I%OhAAOimQ(h! zm3nbGZ5&JFuG4iqwF92gNBbEGH=&@;f80=FPLqtF8XRuhKr zupy7rS5fWAXf$3?(9-4Oa3=!I{U<#lwCZe8n z@#wt2Rl19vUh#R^OBYQ1eZD|8`Gw@Qo1QhY1kp-a68Ri;FR$aiudGoUo6jM8TsuH9 z1H6QVRF1_MQERqs_pZ*Wj6WS<#;#A+X2RhFre;f-ntL2fxQi12xMnoVpnc{zVN{vG9gOAiFt_}8 z?*phWPnEo9PtopAxOP$U434>Wt>Zn0Jl9%Q02WkvLfjnN%E}&owGfTHEsz}=5Tomz z&u!;}uA367UZ$P8#3TEibm1cbhb{)5WId3 zr+^$SygHP@z@?lw>~0>=utdSye@$VuJ5`26KzuxR1I0gK4VL|Ix{A_PWB$doclXb@ z8dEbI#5ZQEU17x9VkBbE)a{j2I4ys6;e>IuD^O=|a4NXx*@%7(6Xrqx(sk5K+*8CB z4$?^jwjk4p?g+d*{AC2AiYb#sz16L-gaC5d67meuk7WLFzR7HbnB~=s1fzSEyM9P~ zxRQii)+y4l>zf@?a2zJYy}8cN*m2o&PV3Jnz6T*`%^7uv5Zv5sW&dbASn}z6x!6r< z7i-bE{Y8^i^gLf%{a%wPyv}=)28J4vtN&1}f`0)+fzAR$+&^8RFZ4T5=^G$odP|&k zBN4i?e3QbBgv57qK%}$8sFeS;Uz#+btc_JNh8+g{KZTz<2ZTd_g1+oD_K^3LSA?Me zfa(cwSUkci7?4D*FAa)D5})WfNP-f&TpR)0Qe7?z2U0$D9eFP1CZs>fA3p4HTnHbj z_V?TqwXA$%9Np zQ!$#C_FqPMnVO6Uj|BySupt8j(pet7V`-seCf46v6&m9hWS*-9x&P`Pp~9sg^9;K< z|KbFBlRuxgAk%DqiTzKm3Su26c%iB+5*LMkM*S}&^6w>8G=P_w8qXUB`vHUk@Nq{3 z9smMX$Y7{t4KDX*knot1QSBny9{o?kF~Fd-&>q8)Tg_iID2~x!07tD;ZG!klKJDd< zMPB4Xta>h91TIq=kJIP9?;x#tTc-UL5dFzM+;i47^lC=umJ@lB(IA6ba+!o<2F$Eg zMaz^PJ$M{+Cm6Q(rCk{bO|&pLq6Rmo8^rcbAa?5P0Z;pGuFfXtoT_%L9&8iw0vD<7Ayr!$I!9mE{>_dk-S5Gf6<)QxbPywS1sD#B-*}b1$3&A6PW4V0U(vy%*pX zG$8)x)qKFhtO`l)8qQ$2pN~l@@@ypNd*1}bR)B==m#)(g2(7OGxMgNgE5=qN22*mW zz<02IFbBN*0R2zEVTTd(IGO|@F)y~#L6E;L@DDN&{S2%SU{h$mzO-As1JpGL=@Daq zeW=yYbqI2~z4)cCX!JMITbGp8iK<&omCj&D5_iG^jdr^ppY6ttfayV_;mF@E|p`w^u zVbnW1siB`adcny}3~>RXh!#aJup=nM$xdM3dYy>JYg<#{-{E>lS-dX?FpyBsJky1B z9I-Cy2KBO;=BqVOzMejJMvt2ZPx-b)or&EC&&pIW?)m7TjKsDN)oZ>GMw=ytKY){67mUBfhsqhJ@s3<*&-T?> z0sQxuCnvq6SffC>7N&z}$uc!XRiW)d2nO-D-9Vzczxb81jXMjCTy}Tc=JQ30{sUi)u=!UneRw-)j84nUqm&6zN3i5i*q zc6&Z)SUniceZr<$nVHYn!XFWGwlpgZpl{!Q zR>fS5!hgmZK|oN&Sj#ei1jOSh@hYI%QQ-{Fr5cSfjN^YYz9D06jnXB4@d zt)AOZQ;_TgJu^GLeG#+JW|M-?nsNXfkB{$oEaPG#ikZqsFw2-Awn(p*7L8EEK*1oe z?VrpTMKEnMWUT^YVEbiR2&d(9f1zi>x~#0{4HBif8ONpx0}&(LVg{l#`B?C#VGzFC zLFF?lZ2*+*XlWi}a^U4PNS2TySql09&yt$?EZ{wmu@*+z`Jc5v$6h|olhgw0SSd;J zZt(^DAkY=cOS#nyf$Lg_P_FZhDkH}7IpE5vHb754s&cVFZ#;5+(>tOcb_(>+F>9W&wekr{JtL{ae_6Nga1E|+J3On!ALnh z7Qx)K4fZ)oPx9MAfN-sq8sZ)Doc#Mdk6J=%Uf~^~R8pOTL{IDEz(@ho+I@HW5|OL4 z6jNQ{Y>GYphhGMs{7=89Bq`!9 zJf<7Z+W<#tZ@?~i@mizQv6ypQR0`+SMXxCmT;(+npetIaV>Jemx68jBI;P%)yvv4NReg%9jso_^yVDRg%PG9g;~j@+^Y*hE|{aqF)K6Ag<4eWMy$;oI)nye>Khf6mtQr9CimLoaK|R&m=YTa^9~+ za)SZ^ukp`7ykC{Dzpj-T>`bIXPrzb8{DLNK>%c?Rxl zxWbvbz)`a~;)xP2n$sDeN!u{yqAr3l%KucgMYv@uHZYZcK zB}TTr*y5>KzP*Mn7DEJ5gt=n15@lr&Lb~d1t;d*hw zae#?qANKSa180KiBgYhfBoVjkC%ItG^&}Lv@c`g{LB6Bn)fYqFOY9_Pd+{X!Oh6rE zL&(bAK%n&UBZ4kiN}yj;`dX)L-a&Uj7@#DO9DLd+ngx-(Kgc&#_E)){j*sP}h<209p6k+wD-XBG4V|z3Kp`^`&XIxl zND%*n%aD5!fd7X>lBtu_ z67m0xOAOrD4~&fP|6~N%*&ty?wo*(kCej< z;&C@evfC`iJ^?t&4Juk3)_NWvuG)ZP@mz%Hjq1sadG?sVyHBz}JM1g#2N=b1b1*dl zustpZR5|mzVLnBcxp$XbD!VvtCpxAJlq~dOIM8WpUHW~VfK~DhFSnj?wXZo zGI@TwG2nB1YX{N+Eu2BxGXPJ0+M%E3d+fP0xT#|3HlI8L%PX}Ox5@U23gZht3`AA~ zp3c$}2u9W*V*x0*641zxAbkLooCl~y3f+>uUyoNT*Qi(&yxmt})B9LEiNZC#3~c*o zyl@0ir;tD?sqN-&O1ysuWSf{^Q7Zu1&_=K=_ZuA6c!8oDDp3X`65dE4I1H;5tBUAz z8uuVSMAmYu(ZCcyC^Nc|@}4AJ7?7D7X+}`R_XC_=L3QsunDL{QtGCv=+w*`FQZ>|L6z=4FH}OKa3`d`OK>16r z+&jyOFBrU8kjjguuOQ=UmC^%NABZ{Xx~;?*j_nst6(jSmO7NiNY7w`c3U^Xg+*Q@a zTzDBwOm!mJzB@qR-8mC+BG3Cl4u32J9^s)^lpI3}7TO%G3Z!c`fS3&1cA9ZLjTU-n z03@Z&_5S(dyOOeg+#q;nm^T=s%5PMu)S?mY0@?vmegF^v!{w(mZIHJKi&izP&Qo>+ z_c=gZ`Cuz}vKe(D$o5WC+w@GbBD8EI>5TSQiU-~I>~P@mxZGWzOz*)Gflz1#vR?m_ zT(&;@``fE6eI*vnN~lEmIAWrYI$z%b(ZTbp?E-JDDjQ{$?afYMwgpDOJOCdrvOmv2 zb<1-)Z;wU`{npGTvhh6G zRP_$rB?qIB(y+aBZdR~U<#+>V)^J#f<$65au^2To#w^oB3B^a_7SuyFUnG3>O07NXX*r; zx#Z8Mq+Y&H`2mE}cbQZxC&Sf9(P@;0gYe+Bs@QkZASE)AfubG)snu|+#qZvS6_DZO z@4Q%=xh%8uAo*}U_A)ghyGy}f@!KIwBm(9vcP(CX;*7R9{cX4+-$5-1c>c)(t+Uet zB-*fVM($_cT2!xSc8kA%VkS$ZtynDO%E`e8(qvA{_Uf5Sm+uWJvTF8;Cm@%i`N>p` z#N#k3i50vDK#|b-=<5?xJnc^EWL0H_?2a3GcZ5JQ|g;l|!gc-wnQEQU_TRev!QLNlSq_=(7R{+L6vr>avU>dS9SK0;G z-HY-^vYm>@QA`RPPB?qhf?PO3720p)fsfIxU`zLF;3NZ5f!ZjNWihvp%2#fgA#3wz zEn}a6I!B%4wN=%f!j9>)k1rFiIc9 zZCpFWGDI6h`f9BEOhCqyQ1E%IVY%K76kCwb0d^h2aqBqb8!}9nXt4bH;%0RPv}cH= ze4_$OgAV&}rR(I@2#G>$RK8^&JqkKY{jMYl+nT047{zT{E~Wsqs5T31`?}ho#em8^ z)mKVB$Qbo%HPj$4#h2J^-l5g~EXOW3lv5q?6#6OEjicjE;*Eb1GdvR#g+C&-7)AB6 zFq-uDb)0j88s6Jmp*!=9HHz;}%w$PF>m~5cTQ6b+;fe(Jyg{LrBX89ODHu&F znX0rGVs}B`9U`b+^$9K(fqf?H8?E8xHwupnKP_j{x)V%cfy(Yxf#-@GI~8@NH6BXV z0Fezu5?|++-7uWZROje+G{dV{Wjy{~Jb7D|B8~ab#Qsg95kTZ`*Qbte6DLPfWf!TE zevHPVzos_|7k^x)vn0=^CAHLM@TP z8T%4K8RgJJaA0-T%Z?Zgf%rOS5G-e;WLEhTFAZzlY_;nB$!DPA#%toBJN1!=yyC;R z3r6mI5tE_Zi?b4nXdQTSR$QS=58Z{A;I*IsdfLb%0zZu9t>80BQ4kBM@YTW)umhna zg~5K=P>rlX$xi}A33h%eta>)%i7=&BkgZZ85cFyo zVI5Cyv}T#m-MNiEZ% z@(EBZq@KzXhin~C|9(`cpcuV#PE{JC$LY$^mxn!jW?qSds4%sN@2HvMySl0x0m$N& zgX7D|@dAV=i`B{qxGrGl)>03NU0|t+j7Gm!>IW|rHykdHk(P?EML{vpTpJokyrO_h zA~3sw7T}60WDol)j{EMIAhdz^19VqTETv>M`3x+E*OqZi#3TlTI-p={SZc0*-V}CV z%!AaaMH=qex3XaXyY(!vq+DLLRpl?Y;AvCMNqsSzjzD2a|2gE78OEqms9^fJx=j8? zyd5+UM4*K-i<1$;*s?^=ecLO;>#eZ7{C+&zt3`h^MzO58prf?%61GyeM~`TGPyjIv ze~M1@j88|KzdOhbA@V$og?K&y4l_aoP{+~*oxH$99obSJDy21}%(DHz*Z~B}ugxW3m7{c}w^#Lf8b>e&5gqZiy5GYWySu zOYQFt83mh<9-v%Emyd%*-ibP?%gnMc&r#6zgR%2B4|&pMS$Q7% z*=BJgpEX>0|CvwW4*s&oF;BfV9iuQ0cXV}jelh6P$-#s&2Ee0yr$`Y{avk;+OOk7# zQ$&R%2vG9kVg{42^HnX@rN{K^GiCjl@EZ^$JE}c8I=#d31%=mn8$FG-Q|cqf_52o7 zR89OSP6_f}m<|{An#t=Vlmh|SWv?*WGy~Z+<((4N&4Ys zniE%-AeiyZI?}Rsz^)o@-On}!GmOB55WKCVGfuo^VO`!BNxXn}7iqe}Eukf52O=e& zU2tFLbapWBa(%FrqvU$NCE*#;)>yW6Gt!V*w@zTLp@V45D@zETUM=g|!ww?>qNWNeBPc=1tG^%OH6#4?u@% z!W&(O^11g*W=ed;NBuS!sx^;ho!#>)7t*xfS1b!!CEH&dYcq#(<+td!)Kv31HwWQyer& zA%oSZt!kQ@Y`r%V2I0(_E8WbAWQk_YKx+KOmh1xp`s?I?XuUlg&sZ9T45^Q2U-qos z^sJ>-Om)Uw~}uJuGR{qQJp;&DPq>4ULP>&tBjhfojnCr(3kA5S3} zzJhN7>7pY3+v5*y0PyqVEeu-q@zAn}XA~NG`w^+gp*xn-*1?^ejC?|#B4VsQwv|v? zL%SU-HE<=^*^Sze?w&uNzF3g@|D_H7(i*}ifNP)=bovUE z+`pVf3?;br|Bq;n|5qo5k;@NgH4x#g+g9!1FW-Xg3+4z$SS^u_lGS0g3owthwhO-# zHQt74{hE0GXCnRf5dcFImF!3xD-5bib}m5`DJt(aSs06ShRENlVo0GqY7j)g{h2Qf z9P$^HLR3oU)d<{Z8P`@$R^SZcXp}5L(Id5hFp2`~gODE})gj$lAkDuAObp{YpuD#2 z7MCb>hQ|IOkA#)MO{M$gin7Phg}pqM{hp%hgIZzPk~gceFHO`UASYo5foK0@piap~ z>3hv?F-q2z4C*%fn`lm`%~bLaAw-t&EAV71y4X6Xut`ih!Ph?v6-mQbHdq%3^T01H zA+?%>XJ6i|D=dR20dh*`bd|9ljyWhF&U9PcVaEAuNTF!Z2lT-DzR-5lR2p@MlfMpr zX)`+^%><@{2q2pLB%_hQS%^LReMD6K@v4wLTzCil8qh+ZOt}Fg? zAP5(5NJWZ4%vJ_spvoC)k#wBUVz{-8K$gqmv`X5D>x)ZEQU)i8kwH2YyKH49a$zup zO$hkiS^2`uMzU<=rA+A1->Lm+f(!!MM}fVynv*WV9nh_4szJGJ?{BJDYG$WA??01} z1Y2A3Le`)^OBiM-ojaS$Ce1~cLKcQrNe`mk`#oaB^lbPrV$-X^0P)?#B9sxRIlSvJ zuVq|P2$LZ*od79S*kB#l5R?#00^FM2)}Q$oeJotax)iy+gQMdbz zvp8k`*@iC=!QPcIBIHE)_W*?U{O>dVdp`c|E?^&OGDg@QD64>Seg4v7l)eN~E?+-w z6rpCO63k9}40s#BqFRGCt^J<`vUM*&dnebsi7#NYzgzhp1n9l#yQ8*EVA_BMnJa#l z-~*~T6oDEK7{FREQnveQ7~DJR(|1U)xP%kLvvWYUXL|WIR+@^{Y!Kf&1(Za{gB}3? z`$w>SfN*WR3Y3dRkZr?qbn$Q2j0mnBNYFyY1DZ(CC6}$8G$M1!P?mrvXqwEe*@)By z+^-(k;n{$t`OHnH<_^jXih-zFtlFJS;s*$&<9Vl%dqBJmgSz@s*kr95v*&<_2J)kN z4B`ozz_}7@tEiCz$S^wso{qnKR7hpt0Q^RIjIGI{g-C}t@Qr}rdv<`V7zXIk5E?p6 z7zXvyWx6IH)ByJI0}$oInXm-OL1kU|vM5>wXki=mKCOYM!=RtS`DeS|lhF5}vVmRo0tl2(D_6`6wHk&`XDSeY?fL@(=;|sE&g(s-5 zmH}N|PvJEQ>7j(=zM%7my&@lh2ao0L9pLZ21^>P}^4TpazQxANY zcOI7qbRQc^8KCw6RqTnOJ(!D_KjQOvy9AIBptvrEU#)!vR4;dkjL}<@c3)_kucM0I z%gr^Qlc~rc8xZiF9Q62hY#`gZ2TYxE0WQT`|AEIZS&sd50M;!&mdi2(_UX2Hf9Rjg!1o4s%p}GkzGQdb5HMgT@s|H$hRsn>>lv!N?-mmC-J(m+lu77V3A$ z4~Nj0I6gEFn0_zrdWh`0)a~hDrscV2Oh5o!oZ$h^Q>oM47#kJS9njIB%(>sdYu*S# zptlZq#e*%v<4O27jK(Do0iYvPrPN>2T(+FC60Tp>@R!h&OK|92O2^O z=$3$@^Rh^Wo*D)ei3lz+%wvK^Eo1ey?#HJ&tR=se=6kC)H3V=m_5zTk_>H&pj@sXwE-|qtGrr!i* z?ror*ar-A|ztk>PM#?j${tY-dhkiX0BVe|@gjS^(C5;HK68^F=(x2IyHGqJUY-zi} zUf(Hh)G?|&wpP%v8I1Z|CJ$6^(P@v*a3wHS z&O|Xc-axu}2sq5T+8!?>K4%x)--k`v1F^-6-Owr;2lkTrBw0Hpp}`_|&4;ukgd!%q z9W-V)S&UIA-)MS(O67$#hvK`%>mTv7a~nY0-HMF1Nx5fNXb(T*5O6Rs=3B#X{h0U_ zR6F=R1D()sz*a}FifX1er;ASEbUU2y+ly8V+71IfM8@9~SMMS&G2{PMzO9S;a(XiK zNFb6xXY#oXH{)`=TVn`$K9QPSJe^@Jkn?1F1gJ%NRRYrE?=d?*6DS~3LCUlkF{`pe z;W%9jQpq(6WD-|-US8z^rKLuz=L0B{S(_P=dcToh`|9ZUG)h%g9Aw{*0Z?dZAgYGE zXsz^E)tm*kFR~7*hY$^{qu<3IIDz!ufqKT>pB9sO#z>{*y$jwEjBFBPBVgbgy5yzP z1UlXE45VRdUJZutKpo0m&O0gYYZ_nYgA%x*`E3LPJ&~=M{LM%?oVqKk7hhDxL(Q=$ zJ+B=o*g-?c=c1h@NYRteKoZk)?Z!B!cd*Xpcur8AjjIg)!3U)IL<`Ux68x~X>> z4dyvV1Ov2Lw=$2 zx^_F#&5(*@p#4)o#5w#l$q#?4ZQHcVgSi^$-gF*RLov3uTwE`1lUY*acza|APV<`j zsMKyUhMCQGs&*f-|I#g5DtIH#(h?>rtc>cySPSIm&>`SzbqFQJRP>}TXw!Wf32 zGNb!n)+{qU03TZ5&>14U#1NZ_i;D zl>?oax669(mf7f%bsfXF1qD|!I%d%aYeLhpotgSTL08LHjo>IHdl7W!Q|7oAbMlRo zJ5PuNDt#p~r4^MCnZb-5SJeY0jGZ{)DL;LrE9h2#N&QyC+H!M&p#jntcX-n6J~r7n z7uF73Dd}%x5rKgDWye#h73hdI>qB8Y0&{0IK*P03l$&`M z0M_*zXGk749%(j|LdOoIFml+K_d|uFciSg+Hki7nR*A1^IdIwK@1w2WiLaX7AYfwvav4}i8LDRAvcXKxLi}w~5u({H%IM=H+O$p0! zIOP`XlGF_ig(R5VJ!fOi%x&7BEd2SP*nYHr&INjU|6|YnnhzuMhcce9tPv6K@H@lw z)gi3LW&h22-Y}R{I5Rt)KrPSeviHp-MqidrFSv%NX`kS~&nG5gK`Zr<>si9nMDk!T zq?;w^G?G}8uV!7U?g8BnfI?emU)LR-j7bm%-sd?kQ*hKyd#{Lnb1BTU94IDq>QKdt zPh4KLrWw3xDKSa(Gr5<8G<&}8ZRMdwV(+;|VlK%*7+8kI8-M0R$!21wMTK9OVFuTq zE_S`07tJNu$1TOvs#=3b^;+pMZSHJF!-pyvQ^uG7yrJHnZf}i`zkaE&`y7u&8@uKj ziVV1s*X`voUo{^ncK8}9ztk7cYlK#8f`UnHIZIFVMJSSQy4l9d&8=?t%9mT1UKpZT-C(u`=J3}!VJjY@t z2A&FEth@^KjC*6;$eIQBPrOziytC}h z>8yeN9E;6HYa$N5rDze?6&70Dky&$=e_?wcD^*s3P^PV*4=0SO9hVh1YgF36&{utb zGu>`8GvuTnlep~lsd8iq>?+%U)2#fG{XF5PtB&qwV|G)X z8K-rIxEoW4tdZ2KLIfV_1~Fs;X}CS{`9kH2vt29)BJql-!p7SboN@<2uQEq#_KZz3 zHK)ZTpJM(3>}dID74{hk-_bEcqQKm2CT}NFM+KK*xYv6uKcw80KDpsZV@UK%5X~P~ zPVs|e1ayalVQ01jz>Dy*$3sX>? zxN%Cw;M>(GBBQP<%}&Si*Xzrx(;fiQ%1GZArpTjqCq9{ zS2d-fI`oaZ(<_$7-w!Z}K8#0?c%kKwsK^wxuVC0wp4!GCN5!gQ>F%edICU5d6%ei| zq(n^B=+ZQ}ORp^lmc3@=&C95ylX%hFgT$b~Usn~lzVJ~#Q~D?$nLzt=bixv zg-enA?Kf3qPJh{-%$$wT=0@8e>>G)qM??jA;`o|YuXO!o&j&LLkkW*?+zez0J`k~$ zgT{ZYN`!+=(Eo@_cvLrNutVIWfUR9DM=_R{gWgjQyS91sv1S!K?acB!-UQr( z!k23IC3&L33@2m;vFo@-5b0;+A|hEs^ptRh{p=lfxFRs(6S5H>#+$8#MHt;HR)~UI zP;JdoJes|y{lzf}I|VoZi@(R8&Ncr+-D52lt1NwkWj=E2cYF@$S{4`5BIYwntd?3A zjQ(UtGeEdN;2pHMXF_W=$~Uzi@L}DpPhr!5pN)TKFfKUz1xdE(S%Oo3%@5A_P$g_; z#G>w-pF@oX7FY#%T0kWwO1O&eM2Syk@YEV2>lDXjZhs0-(-bOWJHk;NQQruxPzn8$ zw9J`#w(<*$7LPcq<<(3|{cF(i&}r=}CD>!HysNwr84|^G4)3`hD;bPNdU(pP3`jhO z7oQNxa6thkmO369S-mZ$_ATf*K6dRUJ;Z*^A%@@B0Z(*{qY%WO`C2^u%d8_Ve#b8; zA%i-F{C5cd2YYWFRn;2Bivohu(uhH~bV*2ugrIavmw-r@AR!@*G)Si)-5^rZ9nv6O zBB-Q*z?=K%Iq$vu?;GQed*8TYoH5Q2*lVw~_S)Y!=WqT(BPG+a;Ks(rSdX!vB5z9f z%8P?dBWH(3q-%U=+Yc5U^Pep5j+u2M7T9Pc-H4wS-(TQ7P;pF)t2@4?q~N7)lpTn~ z>A3!oG~Ft5% z&NXJ-nh#M=xiwTxevEYe1J~uJ+x_%Rn=K5n*tU1=uaPJF|6XFz(TY#I@9tp97urtJ zN@iN#jCAgu-2aAFVC9x4GwR*dlg&eGxvLvVoZ(%ShQf(tHy;h^{?vGV-=gwwnWeS>(j8Y%+Zbjez;foJJdvO67?RzF~=KEMj;$;4-0$S z{X2K;4*jp!(7y(R4reCwEUT@{>2;DJXs%Bn&#TIRvsZh^(b- zGZCeuGd5vw4%x)pFZ=O+R9MsF&~=6CB-yf04X ztWvr^ZO6nZkW)7r*q8Y5Ebc0E**U#9eF2U-C1tu37i-uV2PgZF))eNeyZf$uGV&N~ z5V7v}%-9$DnX2_>_WBjp+A;2>nLu`%+vP%NPl5OKbg}{8Ly$)!q$CERXb8eytO%Y& zns^4Id=w$v{-ZT07!}cSei^DBn#TRq{%BBuLp11*UVgpioqp76qPS}!?g7%8wcv(d zlZV3xQ`g{Qe(Q;{Ik^}0>UF&;(|*p^Y<_;UpLv>J&QK2~4vb8`a@c2HDg0k*pD*wf z3$+G(y8UBg^`bG~R@+{vJo;3Ch>$_B!gKNppswbU(Oz^z4`} z&IwV|=sPjkz6D6F1q%?dQMlU~C9O^jOfv!+ ze$p%{b+a9^Fra&**MLhpV9*Ft3pCPQ?qdrDj@@X6JC8o9f5A&(*$KfjJnmu+T}^MD zd{?^h2A$1mw8!rYGQs+x&j;4~-nyRe7A5S{w)6v-lTzygiB}hRD4Lo$Nv_%0s>bY` z#cBMKIj;A#GoLgRFI314?X5SomE4avR=_2>Zo7+4dGLZ;ihPiSrB~VDdU0(d_b}0@ zutmx&7(wxTboL8%E%Sf196_kpC{P}~jF+E&a66OZV76nakg=@fO479ouV$SAXpNWJ3G+S5-(H)J_>Q0$Nf&;AiKpE8H%Q%4b_HIDQ%&z{O5e z`*@f3mcH;8zmxmv9}ib4*qS=mE*oE+z2sPP39f%vd|_8=x_gG+ryAB_rxYU2@4(5f zR>QQH<9IFK%n$zq$;6~GR%JlkwXE6gbh+vK`(-bOrbik#%9xhaREo6xR3y?A9}8gG zk=ULbW?es)dzz#c@yfgCXF*T`PoT~3Oq(AUZK~=jzJ{vKOnYu=)}Ec250=)S&QvkQ zY!aHORKupOMr?j#$PX(t(EM>5%@mojg;pc_H07uBKq3Ar_IFo8S0ZuA`p9XHXqS#( zAAGvk@TfH2{Fm7tckt!;D|EkHN_n#;yr>86`ewgkOdYf<(m9M!uCqNf`qA>1r8f7Q z&AlPr?qg{O(bVRuxPzM&<8@L(`XY@FjT!yI6{>_I-JhY{`h>B}x8nR}s_o3A zElM~qYUb>iM^t2_1h^m}ngds?-5msl~dg6&iwu zj&6e_*YV)Tnei*^bHi<^j$;{Nt!LH$yt~?>aHuNTul%n_)~N;2n8Xy1##K3Ati^|n z(nNS>PnH>d>bDnqKcm%8}5IiYMsd}Q{eV$opb5^vzUzZ;N5i0X6d_M zS(B_5N9&3L=N20~bKg9nu8jxU!#VGb<)skJYfev-HB>ZEQTgubNq67(IIyfK9mBhp zzd~4#&RQ6(P9QrxBy~K*Y*9H(#*8?U-i&%|$ivo^A^sI?9=55~nVJ^AT1`G2bep-n zo5I}J+?GDp0-@V}TX*`10cu*fEiV%CZq}hDsF@P6|2yi$6GE9$C;Gso9E2Dw{qvs+ zh?}BA!j}J=yBZ@*Ex+YS-2vYEA8=k)8h*me6#_E7zq2reGiLc)0%zI32a%#AhMyF4 zGwT2MgMWusRDWNC=c@)|tQ_iWxBq@H1#eC8e@^EA+?l@*<9~zB|DQiI8S;gw9id`_ zs57pwwWEuSO!=tbr%Pl@)SEiI=kKuMpN#d~En_YU=+fC<4Ho zHn+@E{lkCC41f4WN};PFtd$TmVwk4K!!WvSPkraV7vUpByi^a>>Hb|p9&5qi7PC*Y z8VDzUUIg#tr{K?_9{1mIjSv(I7K8DBeR8Sss9<^cpL8-A^8U^R%=oY45lW>!X!l?K zI9>qAbsorrP;%LW5LD0WkRNjn)CPl~5{4Bu3{Ywg9o&=8M<9}W2xelZ2U`ez0{Hw? z!JM!fL=-TuJNc|#SMk&I#^o7@cBNk3*#q7LEMJgv_`buKiewnlc@Rnd1T_1L-|b++ zA|2gRxhLzfxr1PYnL6X%<{!S5?#Lfajtic50LWeE`ZiU*f-rY-TqrW66+SqkO2dZP z2EBFsNA8j`^i>(d>ba0L4J#O9HlLr@0^?xL>T!V-UiW333yhN*CZ7S7;w^AB%8VpQ zE5TevDTSYtg6QOUc=i-T#O~04>7;Vv>j7)O6k&IVuM2!+7)SyUT354Bt%o|eN7aSN zJ@BD@|MySr0~N=bAg5S;nDNf94|K{V*FSKYbgs^s5QmBS-Zyy4QujlrTHfR6+X)aF zW}^@TC6KP@{6p#rA-`nhTr=QByf?VERo)yF zIkPb^Y4j<#FIQ^{^eRX)SfBochD8EOc3!9Jyn$mzTB-JUdU#tk)8@BMUvUm29Bw}$ zthZR66xBAKc)}~<`6T(qYRB_YqvQym7eJeMi$YiR*AH^L+xvZbS4*)NjDn|A_ zRZL2zyYV2$#jz2!xq#DGVkQ)Kg5i>^#rF?5iRBQYc^HY_5uS2g@PY}-n4u;NM}2Pc zTCwO@;O6v+mL|2yf1S z7dD`K38qjZIM7-2js=$uSLcLP~WVA1C&+EXN&P8q=6Vy+s!A3VsmOxw`A7RvX zdj=FEFNT)Ohj0cmx_>PMC7aFXnki7``C*M-)url;qD2X9CMvYd{7?KSsNBR!q5ycyDrNxzKtv;j8#d#O6`Tr z?MF4*t_tR1KiLyHJ8oYctetZRRi1%ydBELRvESb%2OKurnJR9h6Y1Gus4E%f;bA5u z2$s4iG=PNE=KD5SJ~)9;HJ{7u7jPBu#Tos`bG^=vTCLtF*;)EhN)RO9I?_%|`yI;O zWC=pHndFS15;}!=kodH==kbD64uo7ApPVj!pCkqt_i`AwzlJpq=hfB5;9tofU`7|^ z%;Eq8a+!{{S?OGF~tg$-dCdNxJ%Yy~t z{t#t*G7L}N3UZp8A6_HIMb5#i;stub$I^9>;!Ng{XxyD*|0<%7Amqd1?Ahjz!EyzG z&%ajv;s@^1AHBM%x67<)EPwh8c3Rw6Vl64dH&h-%X`Oa@try+*Et*JH5=jqinid%( z(?H_pV<45>ZA^@LX)wlL zf$ni`ujuKk(Oh5>v?Gig^HUr`R6E^56Pd8(=0CU|ck2EEc-wHL2PV9qxWf0J2I_Hv z-8j(DWA1~-U>5ip;qR7uD>TV155~TS*Zi7`S`{h7(S?> zd7=_{P*}x#?Z4}bru}hUJQI&*wO>G!HSN~7otINaLFE_A%c^KUBz(8%u8oyFe8u5A z63o5t_ZUp^wuWVF!UBc6!Htscub6G%4LM$hTIaiMmsKV$f1cM6@R8&@{LGl8CRAAZ$ef zZ`pGAMjD__Y|r&j#{`Nk@4^F~Jgue@jISj;xU!TT6sNOIWi7ZV;Q7&PKceH8#xp&> zZ<=W^ipsaP|9N|6+0&oWqRx%Qe9FQ);l@#e+w+z?$ty)(tZ~8*@;I|3_$WS}etrK% z1=B*d1;evDwjc!Yv8pR|>=(48PJeKHoOA?5LXS;h<^K+5RIx0l+ERuRy`Dz2ZS zev?^FVDcH{^j)~5$PIJotK465=7N&fJk7Pu2<0hv{jjt+E!V_^q%h+R?WnuhsHjg~ zHv;(peNl>SJXz_KNj%hqjBmlyF<~Ew85JC}iJDu}{kp^BK4H$El^>MjU{)FszIr^e zVi+jOCnpTTwf996>gs986TCKy1&t#nq6gq1A9bzvCCEAFY2EgC3j)-sjw;3rR2kr- z3?h(ihs6aqRY*Es#P|M?U~eJ6hy~6j7&liiyLH`h)e`@v5Oswy^tK^q;_%h?4z_C? znK*f1KXcF4N;#$OKEGG_FE9}?b(Dm=7*1&OcEV~=50xl zuHLPO`El_}+8INuKu-4~j;UT~`iqoICk(A{IZqDoX0cRnl*I-fq)2Wj5%uH;Wgd2g zPzi^fcqIpo%|DGXNi>>*_Q zs`5tY7Nc$wQfo}DuzRI*;kt0^6-?ax|-@Rk$B*A@`07VO}oF1LdR3v4&ETkE%(WqL> zE^nVvRi<^SMhIw-lv5(}Bui0GeCpLud*#j&7@%+`Ee=zu|KIc4BJ7qIjj=E{#k~?T zAYY@t6u8hM+}K$}e!I$lBIacQI%ivm39WB7sR0f>`5WJBg1uS>Rw4<5SJ*{%_X-#; z%y6EIoFnYwVFG4$hXC3|UDN(yD_DfXHD?Ps}s>p5z@h=2n zq^|Wp=U5{GC?cdmtqQ0}GoK>D2lVd}9?Rl{SAB+sZOHw+>y79emTl7)qnH*H##BGk zFMoG^ZI|@^K`W{zbR@NS)&zu~rWJx>%DWPW}SOdK|Eyy?wk>uDqZ)r8Ul|Nfv z6G}7*Ym~#hhLMrFY{}qB_`xKY<+=q3vv5j@XN<`G%{f^Js3P*^)jc&(jbFgpU77sA zjExtifHqm8-DGANTi#B-FDami%)X)%w#hI#6DUBnTII&rf{@*7*;b<2)XnqlmtGMV zp}f%YoL{~k*cJ6wwB|C0Iik3!FECv%cBe>t*qx6$b7d)o<|rY02~hIG3Bx1KcL4}H ziX!>)XwCqfAQtLne%2A#mNr?{qm8mjdxr6DyUVN)jV6B0c)!Tn7)%6l8L$`s-H$k7gQg)st z(}N7DCF{Dh)@yrx62B8|f-Ljp=_Ef{|1IofeV1cONB#n8EoPpppNJk>O=0>Ezxv!^ z0E(x%rrlv`U{>c4U6XtX{U4v;)jA;9IoPRKUW|tFbQ3#Uo3*J5_xr1y-ZxH6dqiJg zoQOfUhL23)i%F>GdEUJm{b?!cdG`qnR{NeUqqA=(qG|ar|pu(fB%)JV5&o=8IOKB=f^kQK+?@=fFee zIih3vK$}qaU$hN%6t{1d$KXI?LJ_9D|1d(VZP@B37eg-$Y@!?0k8fXR%f}f=(9R z88Zf+5I)8M6FuKVOg4cqDd}ZNvC^vF8xq(Z9Ns7=lX%doUvd@(w&TlomSnXXsR#Xc z`IzF+`%%3s{3j#;fol5tAxP}vW%2I0dHQUQ=r^0tW3hetDHfEuqIjeFF~DCz;O<&~ ztk?&a9j<{wPdLw?^uy3ZxVb+5cvlEvO7;H^6B_+%kMx(|EQH{@+xGGr{tK$I0VL9s zD)cAt1rV^_2*6${-a6p?`xr(eygD5=lo0l4|D0h3g5>`#Pw&dV&hY=)6#c!d41`L9 z=>yZG&)*;XKU|vr=bnZd(+7O(mH-CC4FcIDNz7zLm%sxMOs@H2-5qkeKzv#DzRVvr_>*?A# zC_FUyn6URL3as#a(GePCi9`rf`3oXulmVzk6klrff1}k=aAVPp3Yq@{p4$+@P_3rI zC;z?(5@LY=<6ltt_aITp{P$2%BP2`z zlTJo0xU6JZjTc%hpe6(Eq6v#jk#ctDy7wh8e;NU=+7c=Q142Zw1K5~neFNHfG36zA zXOzQ^xz)#9E%8qp3nI4Yn)f9lJP9nGX2ACW2v$SmnvtO`#pE(COaLSk0Q-31l`w8x zP-9wJM-=iMaxh01J^*v`mb>?vyb);A_C&E{kOvTh^C_ekn!N5!8NM2>a9r<~GtelQ zCg2#Hj^A6O5cY5aY|T%#1~y*mo?l!RFAB&z{WD#cf4wVnC*!xvxF@n`BsF!a~Nu|?Xl_~fFWxV-g33aqZQ=70D(MuzXU*& z7eGS#?rU*s#JG0^KaXrpmM#Edz_J0xEf(P6@AhNypQNB^<4y|4e$I&@fYyjA=OwDl zuJqKKlVD9eaeeYW2WHO*J+b;Bc*al;Xd$4rmM3o<3BOXIGRJ4S?rMD$21dD!%CCEY zH%SkNg$zKk+d%+E0dI%O4gqBROm21G>Wvx~i`n#C!YO#v#HqJZTf{Fe`?zjT4CKMG zJ>>F+M2tNkJTw=`Pam5;_p=uV_V1rL6b2#n#4G_ZzXas+d5MP$061$uI|2-k^pN9H z`{>*b7V$TDvVn^ZbM{@+gwLkRjI?oHfKR6ep0J_@T4=hqUnu7Ts^%X0w;a}5D_O=E zN~kluZ&j<e0=!CeYWk5CPI>0+gO9bsqOU z)3bAgw>3%kkA)O0bFJnIX(b;fM8Lp1gpm;-W{R&4DePL1ztNlYJcq{^AOYQ;EA!ubiI7Vi+B1T8$x>!e9WR= z(@9CX-}XU==UJJO5&DSRvo71^$R|uvV5akM5W^p$F*z{s8WP_ zLb9hyJj5q5^u>HuFLU1{$yyQQF_h}Mhuf32v|OYHcFNaP2SWdu-({eqhS1hwlX0y= zfQBTBon#vm*6!jDT9+O$2!S=i#@$)235g!fQ3~8!7mZx|@se95BCmzyBd7n1-{%{lp1buMc5k=C ze=mo!Kh}U?r9@f5>5?ZOU`4o@8RAM@JZ4d|?g4CS82oR6=Ea06TAejk4$Qmqh&ppv zzGOd}0fbPi?KtFd>(XPATe#1w*}`HH0H5o7LeQNL7OTd@d{l94t)l8(dIS$u)|m7f zJ+%013jYaA(U8kn^y>rrCu0>X_9sNwK`v6XO(=8LjpMdUanE*69Z&;6L_Xw%V9tSm z7=5)NAL64tP1&g+bp^z#phz4sb%@c?T5*FlavmSm(i&Ivw;g7MzZ0xa2nQVHaJKX> zGUGwr+pWzJ<4K*R%?b`lG~Z>CF`s0Y+-mJ->oLP!XVMF^{_IRM6%W6AGf)~G$326540eS6vHXV4u^elA_Je6yGQ6a} zMbq7i8JcPjr4ACWx0yn#l=aV~_+*+?@Zxw^$$t4%&YUFC=K5!#k8Z6kel1Yx5Y-kq zo){2k z72jl$D6|BT)N?+8IMs3N3_S+lhwLF%;6OL1D7W=tLh_1}Ye~UarNvNATq>I9O5Qnr zv0iUme-^9Rijn$zwSwp^9qX!`kk;u{H6?ZKTWd=BN;iT(^1n^#4R)^3Tz2~E|7I$o zvZo+LD~JFUPpn14_gk91cDZo?nd7K6g}Exot(K(Lt1t*5b#5PrbE?n6@(muj40j5JaK!I9Y;OSix{2_bZ}N&+SBIGxSNT(w2uBnq z^(qm1wj}Ad%p3~6sgd%@(rT%S2t)ZFMubK`)(>8T&nQ2NcB5LA#SpoL03mDIj#JXF zwrQ)~O}QdG(;ldW@^2;bu1Li=q|~SP^6{WO2=*h#+Q9A&mZ#w@L6j>@u0Ey^i|@;=OXTYdLJ$ zDHaGaBNKT7;A-oLyO+sG1F}x;(o=3ED;s+z&t&9GfF z$pjp7+;`My+N6-Lb?JO+Ph=8DLPnLRiJEWkNQ$w}3!}y9p3W(vvoDCw`GBhTE>=R- z38EbkHcCk)aUa?{Gpg77{S_5C(}Y}YzvYcKyQEPh*?%=TW#*h24Io(}PlF(H#rk*Dad4+2L;C0d8&x<(mM{$AzRX=+4~dlUJ)vpiK{>jWhmod; z_zvIn{YKA*aKm(0`e<9LtRMZ0^a-hPJd1YLud05z~St1<0@ z3GKm!f&%A_{+P?ECOr@l5HkEI%qw}lMHb#LKQn#zC}6!Hu~tXo9vdguntgzNR`|~3 z;3V__+BG&w6WC7V3Z~1SpFGj|tabNrt9*7zM#+EabW8{ZeYMnQ(B1U#fyG&B(Annm zO16s4&!KxagwtgWGVZt*52Q0=TMP!Vb>$Ss6zYX`C$SwwWvDr)>@{B;mgZILv5rr^ z3OcZnem1W5I2CP#Fmpgr=Cst$^~gcrlB<;ac5%#EzUaA}iRKRv+!~pCc8g;BL_H(f z4}?CnH$7HNRpVf_q^{hx8ReyMF0Qsa2xAaV6?g{CVeY z8nb;5|BRZE8~Z&f_vZ{RSn)Dlb+I`Q*u=2VO(~=16{OcS-|lve%v5-=RXKi@j-(i+ zKUaqggY40p!l_-Kv5%WC_t-D9w%I*yXLCQ&_-V=fPD{~H&?Lu^L7q~j17WiG)ijJg zT(CYO#(7KW=@X?PsL_|*kRN+9ipFZ6SFL-HjisC1nDt(*VVZxp zQ8_u!X>u2#iUgxSoX1lHxDi* zR-cK#teWe5=x^IWux*lPxm<%PS!?|Hff$M}J=^PzV5H`Vj}(@O8?*Qk?MyW2OP4Ix z8$$GD73I~(=r@pIyMQnFHzG>=^8GqIVmNoi1<>MC+ROP#;Q#&S6qDGp@rFGW^Pj)Q zlcg40jz$fj{`=1|RiE~b!4(O$KQE??lJaS%!@!jO`%lAsq~Orhu7G#{ycM1pWiV2( z0s&FT-+#6!A!`g)?*qM%xpKz5s!Qwuo$|*en$1hkuFJ?bmm6=8EcEs^4X^Qi&dSPa zInAO$#p^B^kXqJOpLnst|8bi$eSV{&iHH zzs#eqzo!PTZtp0*aR^Hd;?F60kO>loH@E)1EpwPrb#=tqB7bj#KfTy8!A;>ie?DU$ zA6;YuQpBzP^H#pYsIqzz+4&}a?~2`BvE^lru+zV9g_jIhZF%4n_V+|>(PZ`JLX;`~ zegYDNl=c|mX%TR4VbVaz~|L3Cr z&nKOU@xROCEtWZZ=N$zG!pqkltfRp-w;xO)0Ls>Zeg@(k^$y1#f$W-^6V|^d0c! z^K0FCJ+E3vw!ncvTXzJS)aO%WRO)KYr{ksckJ3E*Ibp9AbJvGkzB;q?rPAUN#EpPI zuG!TIXcMs5>JVupu;iZtflW;dz$&mRP6g+JdOn9vRY7DgP0-5sFSfqAPJZ9=h|-=Z z^>%dy^$+lu5H!dIU+ov*#<$fBs8H434Z6F)%Zl4NoP^CBY+))1ki%sZeH5gXg^cMs zW^11t#8hwY4W|e=DhVH|RYHn(wT^T)5|Y4OIVo{9Aw*_Zz2gSRNsr+o3?Qx-_96wV zu$_8(4oFf38=TEY0ax>`w=8`Ma(w0=#^8qnW*l`8Ok4txB8*%+0!y*aV7okryoZ7| z0C;p6(BsD1H5Q#;VsZ%F%a1sNs0;?Bly_9bL`3%id{mmqB1knuU}^`#GNsNa<`5ks z!0T{THM#{nPfPcAiDz@t0EPi*qPc-rQGQwkcrGn&8v)s5t^x|hE zxPjdBf>KXdSA5tu2Nqar=Q?@Jqap8$%DlD035eaa8<8gMP`?BK5K=6JRDVGg!(@B+ zxufnP7-k4A_GG#HGuz12k!hBU>pAO7c_C#Flei#f_MWcdU}_=#+6hfd zZq}T@gy7C)Li6vAunllo4ZD?S&Id5z1MiOyPk?MuQg8-0Eckur>}?W$NV){;`v!!M z1!R2Ua}{+7gUMezDV&FMk; zAHQvI{l5F#Bn`ksur*TC0!ss6%-pkmkQ#=tc~i4A=VR=12ODQd%pI(=YOL0#Tp0?y zn)@QH`h^kIW$~#*WQZACQHqT0kCG-s9SO!GQGuf}7kQq&2U7E-G*Vv*~vAmEtuNGH<@<2x~Av zUyrc^hf(@sF)EZ?q94aGWV2gf3t%adtJ6zYlQMb2(sG-0Ka!gC!dj`;k{0l+$Xkyt zFU}!4tGJW}?E3w=D_aXFG?Qv!6ZFABc_e%ot3-U4V4mg#j6-Je2so!GTii`-@LBTYOidR)Gli{3lPSF6O_#%&&!;=>b^R;W8x9aOtL=Dw*jc>H|J>Pl!{ZN#m6yU_}4`3z#NTQ&-a)gm zYnd;?4t0gE9U7p9+2gnS`bRgF5r}P zagOq%_vK4fMOFQ}E?T89xF2{WKI>uWV4UE=Q`B|?Mv@d4KZ5L5GF`EtFkrdQNEWE} z!R1+H{)Fs!C}_&X-&O!|>m_pnXQ z=^&W`Q32^8KWUt(Uvl1_Uz%-2KgLFbjm263x9MLh6W#!8f9ha~9UjrOwQMoQ`ruguddpME5r#}>) z`10H8>vUy)T5g<)2zHkN)L!1gUY3fHx4DZM_ho11U|~E=v8ZI?s|QX-IFpkK;XJxm zCz(Y%U)Jf&d~Ex(km62jAvDBM1#V~ck|9T?KV)r07%V>BOo-lZsZMD%qHMylE9iDr zDaEm3D}LJ6Umk9)q`K?u9hJsIq*+ocuLnfWrP#@0T~`3m2X*(2d!p%J)gRl=ju4bz zuvxCzn{eO)VW=F*E(X`^Z3i1-JVq<-e6#by5=3?4cK(5y0Hx+l|MtT zkWYDmgqpZQfK>5;?ckB-HanQV$n7hGl@c$^M(7ojEh~0_$pfK)xitrig%_}$=Lnq4IDwZwg?&4I1Gy=QwKDE!;B`5k$Td@@U~umGWxKpX(HFTTNcBCZ z*c?yDb(e=db_K$5`CLq4#sj7tsx}y<__8!`PoFB)zC&a}Z-3BE*01VG`jXIeQ`b>X z%=^wsxuTdtF|EjwHicAi`1>O15q*G3*Bws~-acrhTkO3GDUXUvu00jcM}&OSA!vXE z8Fe4sZV@9sH67YdC)lySB!*VPFwyOGsH1)|RJNPa9j#dP7-G=qA+r&8Tx6H6Rg?4O zaI=-a?_}ZqR3%?nhcn2F}b`ac{T>Yg(OpPzqP6wn6{B%t46k&yn=NiD!Cws%h0^6O7S^EwJ1 zdw*P#OzMC7=6@_Q@Hv9tr?3VdU4>y2K70)XXNd*#R0|0?jXAF z3e5pcRT4D%s5VMiS$jjXq0Iwld(&kno4<%}J(y;ftKFgeuF)TTs>fWhDu#FUZ?_i9 zBDSnjfzSJI!%F~bmjW)T(Lb#=piDh*Gw5z9U-{b#vOx1){=ewU8N*u;b>z;hW^(f9 z9<2jVLDCCrtm7%&En>4Z(LH`Xw91B3cgCui@Ry5MpPj34bVL;U0 zfl4+0;y^W1A^=!t2nyXDK_^Dlz|uTqFU)v$gO|W=18TW7WSe{iam^bb-(bG`{(C#H zoo=Sf6NZ)v`W{bDIT_DO)e2Nj1#Jf0ePQ4W@t887zJaUj;B~&ztNdWM1hib6xjIhU z&p=st0S0BsJWmNU2XzY9DW@Kdelz6`b}9tP@<#h32q^XyC@o1F`-mya}Xi;2@eOXfpQY=@(m* z*`v_V(7eyL4S>U*S5s8!|W zNmLh9vY4_q&C)-MD$teF&Qkt`7%S+1)G0?KVvoE|WPC+1Ogjgasb-AM7w=22iujDr zS9Ugl565FM!gEV=Ok(AfjfLg&wH6{U4|@-r+*=&vRQn#8%Bm#B7P17XnC{rk53%cP zjC0`3D@AXKf+Ljyvu&6@gzU&0V2Z*_e2PH-fsP??;nzc(aI2KcX(QIad}`7ZE7f_d znC&$B6WxjwbtD}jG${9;r`;T>7C6^DCd47XKUL)pQ|Y8W7xE8=$2su%Bd%bxzke-E zfd^O1M;YNl2ObtKHbEdma)9T=M8WOOQ{;JX>*eS0SrqzOj)9H?bU{BrwZDCJOKqHS z=%HBUYo~?y%Fc ztb<$fW6s`ZvkP25>FQpIKSD>noAKT^?-R70d0*>6E@2+S98;I0(p#s;&=6K;E@w?% z#Wn;?-=V6`QEjsK@#YT*5=I%db-HNTAT2A$)>W^}(3dcmzG<_tX8j&Eg@_*3{WA18 znon4$4l;{i0?1=Ba+sm^su1%0rgM5`3cfy0D{t|>ILMxlBl_C*v57HRE{N=2cSJ7W zf=qO=;p6=uhu6b?!Z>9T;xM5N)V_S_L>m0^(U#wPTfgzID4@y;3LhaPL_v4M8%Cq7 ztZiz@vFe;YfQL7~P*MQp4ZvH$b5v)|K5@y31@0hxf(Nq8w(+No1#N?*N>49D(x_$K zqDKh6pyL$Kb@xR=5unVlCNw8{s#G8dqb>msyFedU_|-(Oz)lINGq2OsonY>;rKj+X zgNNy%s(x$}SODrTRK>O+yqg1-&BrQST2pWK4*?gstxV}d8hBpIQ* z<#(y}5G)+3^1>YA1(Wm(^)_1@86;mubpA(5;|VcpKN06#V5JiA@4l)RG*f49QWH`39hlM?Bd0p@jub!nz^9%#n1>iG{<%|&0L|h~U!oQD)HnDP$6Xm^nOL6V(1wQQ zT|`VBz8GbaPo=4BGTNIzqp%7-MoG$Olob2mRU_45Xgx=3-vL4`Wbumdfw5B6=3b5C zSGsasI=CKb--%+~w`HWPIlwuKZQYoye+N1IefhjpN_h^HWICDWkYiAUK4_1ZcNna|Djb z^IQ(AvX$cn@!VG|DBYnnAsD~zLw;^Wk~+ra5LFC8A(_H(vZlQ<=2tcJaun&C!aK_iH zvUuy!grqwBYRi$fqKrGRouE>)!b*isA#y$gd)5T7t`SMCQ%05pR|kBOU=Mu6&X5C*aSc4%YZ_H{UGx);oHzIa%JFafcuQd)RZf<03Ne9RJjXgT=c0rTZC8)wd{mjINiG zXEYDnbV8`ly3D-&a}5VIh}nqE6yja}g`=S$gZ1kv-ro|Xf)2JbR_&QT#U2teqT2Ih z$QA#KM|a1=_q4|JDe2!oN0gF3Q)93GJsPDhEVk3pxljK7`41@AI$Tw5O8o`WZJtFtlal7o5P1kJ$hHZidh)d$Ue9xlaNrmW{z#c_Nf1mhSSPj2flvFls{37v6pWRA zhb>9`kk1ODK*`&3f4CuLx9cxOtbeT6dsNUIHGHH9WGg$&JV-vqY+oOX7wmECM*FH~zixeNA6 zR$#WOC-9tx0a0<`)5u`o5j+4~w7m5$=gvrW7vc+=<}@l-zSby1c@6c~MBJ5YKP@B^ z@4N58Ms!k1{be<`h49h+%>HpxRxpWp2K6U|KZp4@rEc$l6<=ZkM?LrE^71l3Z$Ltv zHD!fw>3taL#ug9}5g{CGL3ptY8qhLez|4_x(1Lw$N@+7&qevR;U?q!9PlLR6 z7s^ns4_}~sJ>7hyPxj@L|OQ3-ETVtH@`!4{cXrXkP zHDu*aEe7S^9W!I7I46L~&eJ9RJmF6<7DRCu-;_3^k_EB-x(M%e(}14t92CX;1Dj6C zgN@|CRb(Dniue$+sefFmGM@3ATws5K1=RC2v-7u7E?`qM=(0Ysgs6LS8VqskJD1UKp>l*r)H zBLY^NaSD}}1ykOzfVR$=&mJP)ptwu(Bw$J(vKm!yn=bY|n1W6t{ z!k#Ff{{bA63r}RoZ&@7kz{2v>j=B;EWRjOqfE9kofGp#F?jbb-!lYpg9xRkoHd~!% zi0W;S*%}eMRA>6V``3jZEL@{_H$B1Go`<&-D7(qicoN4>Pm@}pczY;?bBpgG zu?{>q{q*Ye_HwzVjwfrP=Q$ydNs%`OQ{NXi915$O$5s%qV0uv1^*`QS35SPXfVhA>=+T}fa!_)@a? znjYI1il|Yi_uJwHr=|ZCcxPcM^6(ov9=h4iQ-Wn_B^E$XFWt zArB`UfTGs~a6^Nvsd0Ss%UgHNMotBF^frL8ry*f-VwZJ;JXnD!az@l-B9*2=HpiP5 zm}V505*J(13#CrT-AJ1+W7G%+^FVknqJF2BhBiYth$mfc+f3*|Gw9ZbM9zvFCAIxK zjosg=@Fpzrq?*AVovgiyTI77`6fa$gU46qh`o{S&Py!ul_S)EycbWGTxKdoIWOrd_ zbz)eB7L!Z~^}s6Kr&3TM+4($GQOWGAQ(TG-)#;`o$g5%QAUsg7MUvs@tJP3A`0OoY ziObqdEPtB^VY0BZ`181l&8%qEL9AwgI(|oEXQqN2Vk@Q0)_puxs-cV5AANIF9+@@8 zYI(G1M-3S&IBa@##a7FBzvHyRqr+fxtP?}6iE|*M z8c~OwU)&+ht_*fpw&%k4dXK6$8CcbM39i`3V02jmy;S8PFUgb7cJVvGo5Mwzx>$tU zqm2?)AU;cONhzLNNs?h&QKV%f8w+U6p0J+Oxuesz1T;J>0iuT5hb`LO4}dv_Iq(m2 zEM|`p)3RX`f*&4g6kfARHjB;X$D1{t$aN=-uD=n>e1D@vIR7P-$QG0>hXL5bZ40oR zvVs6Kd3pFd&3qgwX%vYQCmjHl9J}{P zAJ6zyY!d7q=C(E84uSlgk72g;NVE_kLO}e=1O?U$Zkkh|W~$%}Ogk+HGeFT?UyYI+ zqMIvE>?J*dNsZvF6K_BodtSS=?tpx6L7}p~2l58~_EssffNrX_OzwBTWICAa$I6w$+LTd=7oEO_N$$@wW%kgs3aerA)M@&Csg9y z!pdSluDbBD2STI---l*6QMouL@#%+#qn-&l#PX_}bZw(m6dHHfS4QJpD?h|PmAiQf zy}hyR1UHk?s30AQM{rqtXSTj?PyPBzt2cf(lP~z)C;zm}O|PR{=lNg;;EMTsRo48g z=@hT(A-JPo>~KR>Kz=TRw=@^bFa)df_F&hnD{efb=CGOKh{2D5q2<@l1@?+tTAPCy z+*jm|9q)kT9zMX z1+V##B4Z`aB72IWCXuy`5N>x^YDL1A-{(}8e&h>IkV8`}vQbpE6^J{tQXj6<<+$###Q?R<2i zq8@N&&BB0u0IrLi#ZQjoa_l?(uOtjK?IhHSw%L$Vc+b=<3}2vg3lSVe_QC&Sj?LKFh)>#N zY5oAUlz9H|UB>G@fM))mpERLI)&2Tk_18qFqnP1t_#D^A_C`_8)(er+!}WwK%2$>d zyR1-8Zk|w9dwmP#W)m@Puq}M|uKQ5L_xmsmTU*X@J{kO&UX=WYI7N(T-;^4BB@lgq zZxLqV^WS`z1T~^<{kT!*D;4nnRj+9{N)yK@`rk$Zl?s|&e=e`TaIgde{@dT1{q2PxKeGuTy3~ll<2-tqWfOC z2H&?MMD>5J{_e5SaE3UPD31W+H0g5>7yFEcd5{@-dgC=>pz%6OOkCWz?dX?+8$?H3 zo-bS$yY$Ah&H~^JEpRau96;y1x(dVq3YeF4@;6B(+JB#(9l5ZdrZ75+Pzq+{?#h*- zW=7ohcuZ(IKAW-vuBZdW6hiegl=cyIKX&W2;X^U=FqkIEcMG*r$ImjQXTXuqprZ!3fSS1p#YgP zy9JBAF~cZP4@~;7GvmlC_&7P|yI6$A8rTiz}-Cta6cl-cXA4nJV9?6mx zJj;Ds4>x5Tcps^|h1!*QT_Yz1nO!%E=2Jg}6bbX72ErFKhbU3VwWNaN85=XRym|r5 z_5hmRRTisF1^rC*SpFl|nUN(9vYd)|M6wc?qd4*XZ2L2ZeTsv)7ZPFy2a5Md_3V7z zGeuJnRxAMY7L{I{e({ERy=IA?!0X3*rIFzIvso45K#KFE+aleort03anH(Dxg)L9$ ze{`fCj72)Jdt6MTavBpPP{wW{rXoKyVYsLGO2^C=AXD&8iLs>**;xj_?lw?pn$KsO zJ4p6sKKPX29RfOS$_k>167!qTDGk6b8wmw+hD!XWpzGU#r9ctV!U#Vme@dm~(^GFx z*@yrzs3{FE?5ybho7T1q{d;tHq=ALhKz|B`Ec1Y|Lo$mq>Q%x5JgCXF6|I!J1&evX zikvSjm#!fa#z|>J@~iy(a8%|5tn0 z{torF#$%6AXlBSglm^*um0Y_>N|GpLwo<(-SxALLq8WGW6ON$sT4y%0qqZ!$vQVi((YGR}!n7CL7S3eiEP5Oox@p8rv=n)3 z!>n^2YS)AQa&C2dbOr#xrqZg?E)P?0pOQsCz#1gj91&s#{5(M1wL;5)tRb%8VX7UW zv(U$7jfw%~746><#J4!?=~=W%Ipfg8gZ#|xr}(k4yEV}PLm^Ejv$jt&W^kIhvv|+i zL8lE%sO+!0rL2mrtZUJL?$Ye)rbqN0k<7RgCit2|BO0}UsY-M0Ui-CtgC*nN_ zx~|M-&PcaLYs}dYxW^^&{q84I0~OELfME*AiIoD50m{3sy)^ zq#YB#^Ck5AydI2y;I{Q0S0DjAhO@OM>4kN)Z}j992}$LgLpre(Az`@f@!my?bSWo4hyPCHdV*Dule2sVfC!uN;um|J1i5EZv;5X+*yxS|}|;pO!pf z)l`eAy5ano^e-Q*TgUQO8~@^2eeNKyLZROGf-a=cE<_9r8=ba4_5)LvzHnlP}edpdwGU zY{>KFjU)A-CiK&7@s95%{P4k>)!(alQyrYyl!iPJKR1c3DgxJ+|> zAAx>47JKjf!H#A8o|J=*V!m9vpWl?&X1E5ckV=7Q{0Kh%uKKE=srg72PsZB!knPf> zQF0#~m5caE%)~IilpiM8rvJ;&SpSYzQ%X;Eh4#UUW4E z{SA>Fg8h|Ny;!czG1~iib6HNnwKs_)l0H1YgYP6RJUXAN<5GU`Y4uUKgRWi{5KMQF zKH3kN(FQTrgY+1)Q1WlBTr7Tyoi%HgO4#Ic6#1C10sn1}u*tI}r5NZhco3bROL8_U zyht`%&aIx&scP*^lonB&j3a>gHhAhPhIL)rmS>mVjVhlSS5(8a-5*Z3nK8s*iqA;Z zBh4wmE(?AI#Ff4Bw%xm*L*usZzV#X9)a?=UL=2>gQ13kAsnt|P40Uh9wb@t~E!6`C z*Qs&gu<7Af4((<#p3q_>FP>-4d)`07FP5cve()?fU_zY2nC9bmp9if4@%nZ|sPJ)e zOcvq+&$mz(bZm8ZoQZ}MvWjOD69k3L=+hc=$y?9}w{!!y%gb-fmB7MWV&N`F%o#*9 z>rY5}!IRG~gaQ5WqE`|rpSkBC)peIvHmk{>KhNJ9cU=nNbGQ*sCIU5ol9a6S*D#GaF##Z>yh zY&wnn$a%*XJisjKn5&(P;#*=qEOX0~>qL}8e7z1PsCYD-D#h!!SdJ_TX8u9r5w|qM_A>g0WeLv5}NY7mSS_#-{3w z=p(_{C}M03P5k&yOQ;HCGis({B^VoTy#B9kJj%dW)sR8uNV#9u*Jp_`RIbr<1gW4pH|_zgc}F? z(+)QPyHi;e76Ml4*8{qC*C1MIoMbHXb!wymyBZ2pFIJDUT_`2JnskMe8UqYn zwn*-qu7dY{6(4a$**8do=XRw}bihX!d${iLH>ppkUtqk^cEGEqq{Dj~ni zL`rpK26Hg^ikNy(#+H*06j(9_GQ6ET$!Qliw>9EO&7j>xHHH1Onk+uCWWeZ29E(7V;nEgT_+(Z7&Kob$#eT9^;xT< zRE`eQ5**|XJKsD|s1%EAp%cgf<{V;x(yLcv8yE$rGV!b0kmQMz9C4>Oa!YV}J*^Kn zxCHnFZB$*4iZFY=2Qcz2f|(KBDr+N-tRC18TR#DFxsXCmOS zJRWxrR>URv_mD+X{YDAOup%#sgS66K}fSg2OZzsWdk$ z67~;;9gjtyT}Q6Camy@_VV)TX2)^S?-@5%zj8>Gbw_l`a9`t*6ODHhK`z(EbZLj7N zg-$Fvty@ybBCjGp=Nc1ziMTFOt@9ogh;7S1Zq)5dHW6Dk_5N?}d(>O>9F;5m6N=MH zrdhFP9D_s@uKqOei*0A38c*y=(R6@0;O@#H@MU^B(=Yc|K+xm+=u-53Ra}g0i`)u% zx3SROl_wULQxDZ>!fkz{d4`bQW|iEBk&u{+rRCVkK~$Bv)*OSDm{~;4|FLj`F?eTe zqr?rcMs;70%DjI1mA<=D>QZ$=1LMARB8OZ*eX}{4vK-ay}%KR%F-5c8ijKK(JjuXdTB`R)31;nkt=ZOk4`OlOuTiAUqEzaCz)Z{w~mt>^|q#Ziuo-N6ulO zinC`-O_9F~tXD@UH1iZu=uC{?E*73==e;N|{w|=FfPh<3S>PMPcVaMu=FPdp-ytJ- zTy5QF7Eudu*uPR?azW7)!<_YLUVS_`W}I<63spTe8)e;*F`US0O5+4Ai{7Gh+EpyneC@olVg6YUJEfpJ?jkxkl|ANA^vn?;hlhixPl>KzTp zQqUl~dV>NO_d;Q=xAAJ2Qp8DN-csrlI|uUFF7VAU#2sS|_uBYNikgEZl=S$MAf34g zU$=a$cqxBDQXq<@=NN?wk|W42Qip?qHVd$hT?>ln(Fve`PV^^M9wDdh)~!ZevZ!=c4Kv__0~L$*Rb5SLD9{2x%!u diff --git a/format/diagrams/layout-list-of-list.png b/format/diagrams/layout-list-of-list.png deleted file mode 100644 index 5bc00784641ab3cb3c3275ce1e52e7ccd3b2ec37..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 40105 zcmd?RWmH_-7Bz?l4Z+=m26rd8LvVL@ch}(V?hqij1`F;Iyl{7yAOVs-<@%`clw;fBvbbvxVvZO0sqO zw^={~8D2hNV4`Pa_;238qr5M7xrOa)>>W*os5MgMjPaGj3S=#q&{Cw(x!RvChW3_F$?Qtb?A$4&v+-!}->u!u? za!i#@yTu;PM??VX&#MPnoms}P7y}#y`HxFoM8L9PAOJSZ|9@971u+{pE3?U;>5#$8 zrf?xii6Q=XA!qo{9()^068x)}A~7q(KQjvxFo#eq6n?Ap>aTL+a4VVq%#0W!A5J0~ zDOv{S&pQ9E+#fXqGXDQW<#M`u%yZ9H+d`|Axc+KyUK*6VVWCAgLd6hKF|>+`ib5St z_5XSXXg_+0fCyWQDXow=Iw>ZOa@b!VLXjXxlBjo13Chz%|MdZ_n1JPtA*VCRAN@xR zAqWeYTu9R5uPW#>0sV#5n~(5Uf58d~SUMw1Nxl6mgGiJ(+zxCcoxeLA8OR_=UJ>`N z4Ai55(b8{!AoSO02}F*=B?wNK{wsqnL#QG#hL6Hn{~K$-P(cm`G9XS$68kFy%(ujB zo=o_ye~r}uVju&U{36!BGN{1@#&G23w~)Uq14RN1c3ARals_f~azH;|8C}Fs`oAng z3=3o+NJahsbJJSBk}P7~mD0bz*nI!tjVvY(L8vu}AU80_oanpZ6U5A%$s&5wA# zSHqN-P4UxPHhb5T@4p^bwJDwyNhV&E-%Mt5*bV>kF@!0A5dFP`I#M7ISD&K1Ub2x>=yoMQp%dEn(P4nsqO*K9W0$<$# zu2q)rk3*DEN$cAMU9a(hwaNFt@U@5D|9YBnHm~!(nG)}2QO#uj(B)*c7fR%74)KP7 z8$G#Mb2Er?NQor~Ax=w*q9Uy|RmXKP9F1~MHPA8q_mW7KE^x@@f{4x7Wx@D{@8+Aw zes-#H=Wx>LOUZnmtlT$V=N$`Qoc6IpywASu+oLaf9#y^>2gpZ53a8bKfcsdHIl;+qO8y3W)^R*rChTvcb<|%2 z1yYs3Jf%P!rIYehR^k1KRxCW3!nHpW5&FvzUbje0Ee#8ES${=c` z0vM+?2EBo^NC5qm`MskdCxL}pi#({aDM-O4H)}Q9 z&bFL)J|uWezzgEdW4R-09=)Ql$Jna?yCrE4bR^TcaA1HCcs zZ~dyK?GVCo`B|fWg-$Qe<#>KA!GXshh~MJdZ9BPo+NpTTGccxQJMb7k65T5YP^YN4 zcqg%77%GaEEofZ@zz`m6$9WD@+tDLP;lZiLif!7*)7Px~K006}$3G;9R<)hA_N}oJiHCnLC}a2p*McdF z<#-_M%nIT<;&7%k^CF+ncu%ggU|N@1=O!MG%{(2>u6~@I;=LAuw`j30|79NgEX zN8^SB;L~JS#HmBoYHBEiMQCpD&s%}ab)ulD$n@vdj0FVP8s4}n!LI|$>95H9J3swa z+A}KF6lm=t=o-a^O087X3C?Xqx6xViBJvz+8&bU7FBf7yZ+JcO4NS?^zO0>X+V_!0V>zU#p z+QzvBiLiUxygpGpihhwi*ocwYS?TvoZ_LuV0N-jF{TujF-CV*4U~$zbc3bl(P4>83b<>i${rZ#=nn1#K`3lck z0K=m4e66nu`Ze1Go+PkB60yT|&>)tGp~V_~T<|}QQS;gefYIjS^h)!{X@tg-1k>I_ znE2wmc3-n8|D(9s-mP7Rf-NJ!%w~}DXgrM%AGw7FYP6V(zP<*M2J0MQXJsr1i91Zy zSfMa+MnF%+2{-GPVa%y z6Xxp7R3P!3Fk<1nyO@?{Fi0jth=`|KXg&F~r_TDN&u*Huc7&zTo=B-kj{A@d%LqKv zsKxOLt0vW1U-|Z7mj4!Hq#SYKUi@e|s3@JN)nLB4)#iK{NIJ1Q%JB7s^zr5}DEp1q z@-SuY2`QX@s3~{=7@cVG?V6MXsA-H8vFX}7tj`$LE)MXHa;p)ep3U7wFXLtPSu`fD zzSJx$4gbKH=$prS@QR~;KawZvIPm^h$GiAqqXv{x(YT5kz(!YQ={e2<(G2P zO8$G9jms7o{xpSc&A2^j`E9%!Ot_9F>r2F!GHfO+JMeZBMFMq)h0QY} zzo$*qOr#OxVIH@-Hq)Qtp<;ZFhlyxi^D*_vSKqu#a4eq7ol-EHoS6z^`te=AZGE)* zdC7&9<}?7sp7`9-xd&U#5XjH>hSHV^84i=QGL$<2jVf!sOGQCNEgF(Eqb`(pd!R?;E+lafoipq*fO_B!k}Y>M-dcAr1rz!9g;Cz?)TTOG+}YB zzW7Le^tA??@({=9=Xze(EYyokd()JZ{%f(B;9_ncVGzz}I}7hvB1aphQMDaEZQaE- zCeQ-TO(c&vsXH7~32nG(s}o#@FwKWI%+`f(G)wxdS46CHC@8Wf)~S^S4RH0%9a1p} zqh`_*sC9pbal||U2?~Ydg^MlcinTzn2uKh-7-EW zcYU@jHZ&E*O>#_~ma#%a9lGbY*6pJVjy*MW)<{6>Ldb#?0O43ny!-leh2S*{yB*W> zsxE6c_*HD|=~Q2K5E`Tt{kxQ?VY?D7dB>kdth_+O$i;~oyom~4*K6QRvdWzqO*-7> z7}moKQY5_6@}FF%^%7hC2V=QFrFzYVn^^AQT;m$PTyYmV-mG=<0VaH1F8ixjw)A7MjQ4h z*VN~wf`|nF!_ZR#)~`s`Rjz^JpUOfWhV4J5T0t!Lz_QpGuo#wnXq5U$Q+WaIt;30m zZIThx#q~4f`doc*!@LJ@BRL`%PnVm`@|gGgG>1{*&dS0SO_SQ;>?mT(8z=S%tBN1o zfgtZ35x`7Dbw{Bd;Gu|0C!P5JXq{O=Eb35TGsD-zIfdy6(~!0AzGVwShwTY%qUt9-)&Nkdm5we5H3yM zg%4?hPD!g`df>Q^u5KTmy%Of98$dOb`Fi zJWo>Gb2uX#ceCK=G2$XYx37sRT@Ca7%4B;F+8&wpZOVn`TkR$FgO`xUr!CRv@SjbQ z6j-w?K27aSb+@hGvq?SYO-)N!r<(?juWf|6m965#dAD7vr|dd^kqOy$v02_rdy~e& z2;+aj{8i>KsIT5AD%T^pwi(j1zUC1*Rosx&@@S12@}mtN(UB%{t~F8{+b{SM0sK!w z80tTieCAnHq7xB$y43q`I}HY(%xY{3NsUL;g)XBKvt5_q0xo?ey}Bf*HW0C6b_Tj1aH1uaB{2lh{}adwp>?rh>>7O+rOsb!=LFAG2wiV#2p=a4C$r{l zhQhZr4fI%lPGPSWE%*?(rW=z+j?*l!b()HSNAgt{Gic&y^(gh0+~Dcs;v#I5*Wowe zH{-erac!pK#p;OC>c_)9rp!yWc0ES(n{plBlT8!)zGDj8ZJy|}T*=@aV6}x&*754Ku-&%n1i@c5Xbvf~#i+3`Pfhoi7>85`uLBqk1 zGn16hjPmg`?`-^csS|^37A?GZZRkZKfN)#@2aVB%+z?WF<2%h+GX9C#R2nM{QIaG*tGfNQO?()=c@7@lFGjbeCS!oSCKcg?o~aS1cSf^UQ_~ zt*sNGFVY%E+Wp8_E^hQpY-)prOf<#o0p&3-?bg>x zpnNX2{?tee4=r*^gmU&!KNeYtq%xEa&+k{oZ(I&Tv=%X~eQI{;ur61)&Gh34i8OM1dOOS%Tc|%B z#l?as^K1KLR0%njY+Tu|7ql!wV76D=^kW8Ny0smf@^9*R+Ojp>kuU1&9Kv*u=~!m- zOdQPzwILQe42QKc7-a0OphLi{r%ZW5&#QiqyF6w%P^@ELQ>s6%4a3bIhO z_lLY^i5-<_F$X~-*HS+HtO4fe0GOZ@7fA}XrR+GM2c)%xR(rmUcB|;VL7P=Ya+Oe_ z@95wA5M6?q44d)&j~9 zN|8!qQZ~j#qHp1|a0h0(<|K%@#*GibwD`}EU4yw2Sb7bDYlad{R01f1ed4ATaCU5O z6j314nq$0bk&)QU7t==NtWFL(g{F0SO5!ecPOrj^}W;+cH+D8|KstL$g{DE}bL^p5>YlXC(w^q_T~im@DBx0gjG z?*y)|pj#@M{u^gE2C5H=5trX$0+BYyyEt#r@)Ec6HFphpE}k{!4wwG2DUBF(+JyV^ zrY+b5!KxN|wICpekt6|HOJ|F$S%o9MFb|7ZzFUFN0Z1r4$-{5+`B`)QIK#mKiISo8 z->Xuec0~l6aN&u!#=6wZ_8S5f{#fviUD|#1 z1=mzo`4M}_we=I+qjLjIluaHU>wBlR)Wi^QBzZ7gBm^W(F99*J1%ONM*NU^RdppIn zY(=q8Xm69_4lh2X4zjDMv1&bY79(}wP-Lm2?sTjGAv_O^%OafK97gpcDe`-8zM)GN zmGBoxTOJdtp}&T-haGb;K+sB%nt zZEVqj^740e;f63oLE>z(TgHZ`zjhKSTg({*z(`kE@^;$Oa0jM-_l>+Acp}&hd^4A#9u^5J1}&- zEU0LIhfzSP-&ep3f%E@rrn;{PVG~iq1j6%{rg2Uc4t=(S_DNZ zzXY{6nd=8YlX^fLlt>|a)l>vvasWJ0!$A>xvGTz(fKc5(Ui@9D?^;dnF@o#}`X*afVmtDQcN z$v^MD9F(*4-0CuRpc(_<5jT(l481}j zL)%H6vcn|rwai2wCqPBHt-35s6^d{F4B)x|n3Mxh?Y4qALCEdKc%WpPTD2~1%iZ~U zu6u;f&xie~1qbxAl`q#of3~eSj(W@h!S->|?O|!I;{bH14sHDyv-%)P_)&Xjf08TE z1#y=j@oYDs5bfYF`M1K}`T|stZ9MUJ+lw^c=N}|*%6kD2p%xfAj`)2Jg9ObJ+{ec6 z47U2hSIX6@*Ak=$7|lA*T93gDUD5;jfxQEP6C8Q($8q&OYP1ai!RN5sWJEex&|U!a zSmr3tVI>@q@BJous`hl1P8+|^560Nm>j|Fl2LOpz)%7spC8uxM2yA%uU8CM~6)3K# z@0E75txg)F{(&cAZzt}x+wg_H#Qn(Ey$maEm!qA?8{N)_z05bf7hp7WoD2r)HnX2l z)MNzcPS^Ye*ydhU{M77Z|5(FIZ(3axn`OXH`ojA?sL%%V-XN~I-W%? z?oXNPop3FD@yQ^Vh4JF68Q8t?s|Bw8EXd|@j`b(NCs_o5TyHq%-=Ja$aSh|rph%SJ zM@PzR0R_)5FW@?`xjhPnvXIMswEOzZjxnF2Wjm~+T&t$%Oi{*A_8w?E5Px2hhV7Y*&}V~L+nZGcct4gc33ag6(@E>@NeXM zvTzKi8y6gFomO#pXr8y7^XKkpUx@&`3OkwDPnzm><`)b-j|>*%&j6<|Dl?=$#DIb# zMJptdB)E7jn|IMMhC|>QLFY=XIe1Q9j%}ESUcx>hgSwvCtTzb%xG!i_$CNK%s9{67HoFvVj<)PuDS z9%#cVF~Pe5w9bnjgfAhQT_4}CS8k-LuTu3TnbJm5oq?um+fj!@PlKwcth=ZLZkbpJ zWgi)~)_*a5Xuau%{EesY`^ukB@SH#C^?m&%yoM2$^()a9ZtMv_rq#=8@tsl=g;(fi zjt<1xj~oz0(ePNL!ArVOwrOB0GgxZE2N52&W8aGrl*XQQKHAcxLrQ87b3qA}O|kB; z{^GxZCcO-hDDmV9jDz|Vb>OF7w>S~Tltzx8!uDxl7an27qyelrKS&W};tehC{h6ah z7Y<>Gr5OxiIfql+(*T~Gh;ndutRDc?-%#W{mry}QV3aC9NcLJHe!1VMOZah6Y^o3i zOBU__dZS2E1@5X(NY~Pt95MNmO_a%}5Y(-17@{IY=dCiOQcbb>oA7RG(nfrM)ELdU z=Q(fRM$Nm**wMw9nKh3z`|b_E{q%gASeKvQwx#_8lA~Ei(7%?lv2<}=a9*U+TvhtO zXA)9)(}oK-mc`}t{C!~`^N3&S>#F5cK?wQ871n0Q?hf*re$Z>G_m$sQwUz)zSDbm4 z(~b;)kbOaY*Hw&Zs^Yx2qED>t8W%IfzWo?#eQjO+nU&eMtqz9GX+9d8?yxO=29Wu) z=60KB?&-9a4X;VoWyZ85EIEnoQ&FH-S2Wc@o7p_{Yk05-yvla15rGmh!A(hhbqS*q zVEfSkrI)z20juxYF^rlpKJat*^rGT$k-(|#BRB1xl2J{QP%Z+Fu@FTpT|}J zRj$%O|MoqqUqS(lYWT?zFjwjs-}dE?Ab#_bMwxw=jbeRFE4yS)_YL`Y*GVFpA=7Th zLx&@BrmtI6MD){m8*J8}+pis@;C@6%cb#{XEm-&9u~{blM#J&QL82ew^{)ri(rDeS zicI^Y#OM09@$}t1A_;A3-|T6TrW-wBU7r8Sv}w|?RCoyM{?PRV50c&=F{?7v!iY|b zbPGWHz&jN}=l>WD4brjzSGF`|4jy0aCp~2l4(ad=_E{$Cu!z7jZ6O6qaX)DrYXf!Q zrt;Pl$Ktrq?$xZOq>II^77O~nHL&f#DYCe?i;F7z>}G0(W5WbPkU==9j6TV)%STIo zd0^9eSYm5IrR3TFJF=n$>@gUq5Yi{whC;SiyjW7L1?MiiF)&ii!?Ss@XtJ6ftHwx40v@w1VBMhGOqsCm>%_RxxLk7L zIRk@eWHE4<*ynCrWSUQ6>M?Fy`e9sg2)kVz#3CMT-vtROr--t9g{jeoAwuZpi(d`b za5F(W{?zxQ|8XH|Jr6Co^3o?l`$2Fka5sNQ&7^%MNwXre^%2^Dvr~#{jwV; zj5PeEL3slmL(Q&4%A{jBooh|5fzg=O=XUMnTYfJfdx-@hvb39Alz*rcW%}*vGXRRB z%PVCICS=A~q$JZDxp|LOn0lIGrz-)>LS@7`uZXUi(9v#5%53A%kXqd+B6cF4?D52l zbz;k^FBgW;tcmQVF6G?rI=3zw-YtVXb@nZZk2s*Rb<>V028*` zrs9jr!+zjAN}B|{h%D-3z~kqZxYPp8Gyj-zAjd3xJ6r{G5QdlcQ8H^fAeHkT`wric%?(#%~wRoH&tv4TV=Oe-1 zw>RqA(|$kmU1t;iCGr!ivUel7PXzBZQu-3OZYwkt{%x}05rn@btF$yjoJgBzMJOaJT5gT1xq#0;B_P7QC zfo}8iEnLtZ(iQC16C-J%O6z_&O&mcIwjJqOYIS-|(y8>;J0=vvoRY_VDmGrp|7e?t z0r?w@`AC8|#BYTA+oa-q^M=`Ix|x$|k0hgT@=pN5J$4P63rXo+{=QPo;9pv2lWD}S zzt)w7nc!-b$6$0Tz05Z(O;w})L(wsbnMu8EQDe6)c|lsqOkp9=I(Az|Pb=OGz7L!l z_G9t-aZrwqGIvkLFbcl;bV{0v?{(FtO4yuSM_p=eT-yl`tCY2oECw~Y{j6S=81^Sb z2_)}60O7W^OiwqIZ##ufgduQ7w#Dc)Ht=Ez+nK=rb-x|^SU&1Ie2ltS(WMj~$%}MZ zCgpc~@{g_~9#+39?5ei4Fe$Hnm{Pt*cK{BAsP+_G+8=1t0OvF#%kc&36Re2m`tzP` z@heXDQK=2srrEYNY-J85V*B9@@eHlL_M2(xH6LI6n9XrR1H>v<_CT9y?wfDIN;ARH z1|h4#-ENqX*(QTHvmL2Wk0wjZi>#7f8i#UT98zRFpeVALS3mkAgtChtEFv2hLzn~u zGc-CUyw99E-C11LEi0-Z4tG4|gqwb-KMKg|ovnjx^IW2oz~`DqQ^hF{tf4 zaaWdmE(cF`MjrD}sT$k7%Jnk|os`ciF?BCQa}O6vNa6-n;v%s?I~%HNm*gJO&VUb`P@Cxm4c`VR5w&`LQAtyku8r zAG}845i>+>+`_RMYgRcW@~C`8C+;K-7pN(hnr{MZC}?+%d?*oD_Q5RIr>T(u8fj8` z*(!wXW-F4R>At~|6{n+=r`~8R9GAR{ha!sv^DD}H^>{|)$V87%Ny2rh*!Im9-&J~{ za+*FP!d63LLl5W@W#Ba(oegg3Ts-1jZtsRWQ;SJ=Wk}=WtTz6;lg~?_Qit5#AjxU< z;!a-P#$XvHxoCE9Z@)#XgYhu_(A5Qg^*}nNfL~jd1srWRRG(A8+z3QW%rDuc@O1i& zO5LbC0&2Q+baN=YcF9U?&mY4jHlQbnVK^r+2GG*((_4hR1Bg2g>!6bGIZ*dMLF!KU z>`#f2WqN);nj6KC0OP|GN3s{hE^NC?=^>TZggRNyYcQsNV_`(Xz}Cw9O4wHmWmF!4 zw4fsoFUTsT572ILUtLH~=De3qdC!)S>&I&9#t(L@Pdk%#lZBfk8STom%vfV$Aq~>% z{rI|<-fG_w47^(&$0s7bdhDreN;x<>jzbT`V{09kH)E!OX%FLWHa(f8y6-FnFEYgo zQ2HJu?bQ}%s93IYWDPTAvWPzm^0&OAyQj$!@L-F@w*5}ZBJZm;sJfo6)74EesN0hYZ~MD9zU))c!{PIX!>#IjeugmRxx`b1ecwYQljY zKdQH4LpGTsAqrfiZpN*=QI2UPoT;V+v)r*HYjm>kXCeM`-@BjMdk z*WQfqC(m^1>?luTa6qMZ5ZX{-zDF=%9=so6Mlggyr-ilGQ&0|t=0K1!SE5BAfZZO= z^u#IRP;x>+lXr9-zdFMI$SuR!_|4*>lR18Iyv4g2h*XP(@_PIXB0;jXajCdEeT1~> zqi@2 z3wqYeWLBAm3yeYCAmFil>+0_p(tUps;6alQ3q|aYVe=cWHXZ&61j+G8He>cy{QFr> zu#l9ANCf^>31or(o-8P<#3X==h$@h03Sk@vODcZ_1;9|$Na)*9ysinCskbZc+tKYh z?K^mw`QTvPBxx=lO@-+*EV>i7uTo%(-&r}pe&UGi)oBEHDY8reoiVnA^|eRi|$xRRMV2u-p;dWXUUAjLcql95zoD8P3{ZVbSvOk$4Yr0vGh* zllu$H85+TE*qv{g(224bFPc?>T3GW)gW91~#g5vR%~Zu8@```8*yzFjz6J<^T(acA z+Jo84YK-xZUgS5*33y7205m-@A^w{}_`Pra_Mu+`b+gE$lKKlMtwW2$&CZTJdtv4N zdkNqIfZ~jRLed}i{9%nLfz6MH=wH76Uo1U<74Sq$0}81AfX2a5hU5Sz#+CnHwcvlM zWdS$=a*(15|Ddt|zQ$z(Hb3!pR{N`3(}2=F`&C52OU3_t5%T~oSbR9B`m0(+FOu}v z=%W9~eE)+py8bz|V9+jeJ=D#$X)~8dpJ& zc~vLIlk?*6B+&(2Js#36S1I<25JQ5?hO_u9I5WWwakQr0s5JjkeSvV)I9#0q0?OKd zIOTsD8zBjpkLj?&_K(^Jppp}_wTke_8NLJn|GCg#@I=FDwbeguN-hOF9`2gehxrGI zK}7)=ufM3e_0NK&l7PoiPdUTs{s7A5Wgtv`{z%^J4SjJJ0HDtbphY#CtQP@@3y5b~ z(9Y$VRE@~LZh*EkL#4Db;8moyeTZTOgh;jIej>6>8Ohr1K7d zjCFx1Lc1Mem<-wSlvH`kxJgYhTp$XnL?-l!PEfdr8*i9}Geo?+>^SDi{N#>)PF}xem zY1jv}_zFV60h1)hB#w0`p-P=#sVueO#&7Xz?eBPr10JsT0o~~-0IpM)ytYdLpm9R) zi|(~F{GMIzq4y6viK}kw1jGWPcg98liV6%mXQYc{QK)x`0v_d4fHawZ^$Q?&SE&$A zq*dM;!6&zDeEYjmhH9cG;B?M?fj2;+HS@w~Oj>mT3voh^9YAl1C0(1WIt1u!J%1dQ z&RAl?yy8+Uo1(J%Y)+6rY&p|k*0LJbLN3{b4kG0ai`n3Kis|yVP!2muem>MadF4^vZ5hFxA zyCnLlPoYS04FcI6Filx8ptJw_e2cdZ1_*AG&&z225sP)9*EeTMKa^s;fT&U2&SggN zPrw{1mL@*5ya8f2uUQ50+N>kwdK&2ZZ*OCHd#+}VDyq-6t_cJv*8seBAeQL-T?R0c z0J^3Vm?BXzWtPAK(lN>ZXpIjpIr-+*;={%-H$d?J} zq9()R9EBveXk=3rV1qT=V3Ro`lt(h{Ab5j-vjwjjFSLoAzX<{}!4o(YQY04IZ~H-{ zQnPWX*^Y?4dipm600Bp)8NKIb1?-XjOB9*bfdp8CeYHk|{J);A)tMRPj2ca<-J)^X zB)az60m0PbCEG92s6I#-Bi-=n2nx>m{KHq9(T6V(TZjyu!tqH!$fW7FryJIWlx2j6Y4*lMg>7j3W8#!g``fP-M5MabH} z214MPlz!R{*-Z8%?;oA%3rD&s9+9@}YOKa{ir33b{L7}fT@qNL9f2Kqx^-yAQ* z&R-JT1Etq&GyxnNfCW2fz&8}XpTFdsW3kpLI~1&WVsL;RrAR1~QT~k|0xEfNcRJ z*sZmVl1T3}Fzvf$VhY;XXH+VI%>oR3!awq=7E_FxL`nBn`x;v0p~dpD&ohzl`>c{G z!@3owA0YdMF?gY^68M!3X?|t9t<_#VEM7|`u#ch00fzx-4DDp)2sybF(mp!)W zo`4ub6NBow%uPpzN>aNj3>=NtkVJAcX3KEZxYEgQLCcO0ABsiN>isY4#F3;fJR(M> zP9lDrVYuD_YP$eTvAf2_x5F!MxXOop?pF6v>&>^}hfxKx9W`Ksllw9`Y7MM>MJ`)i zj>WhDbf00jro`BUnM01w6&Ue5^ppId^QA~Nz>%Y#%w1rGB4zU7@VY$#sJWR47eB67 zNWB3Zdu5{U4%*#w(tkbLLAd5OUP z@DepaytHVhYzeL-*YgIJ|NNmr`o_!&0(?by0C!ny*BAIr+&qJmd)#Wf>=TKR*qi$($Wz(`l>sh{Xe{aM!!xd zE4BJu`^t$BEz@`NZrTSC`XdJ*!-XgD`P8S%AH(+~r$j4^Tb3rmWbp)ui)-0y_4r-n z`jx+ZL`EREM`wz=;8-=yvcj<*LrQZY^QbJ-NnS&Px>QsbqrzBqwXe!g(%Qkh^{ddJvxZ6(+yc%1|kexX6`P@Ta`&iIbbV^F+ zpNvaGJ(uk;fYYhf%>d{joun9x(_4{W@wDRbhnYR*(ZIL0c&S^P!OA)@VS-Fjb+mtM zfICGX@DKGeGwmk(_U(KqRF~Ws3}b+wQ3}7T6|LAqwBDL30*=oa<;dpfeP$QS+@0Bw zy)I^z+-T~7r%|Sn{A9Y7bkKNAwD>?)KMWTW?N3&o-Zj!-u~8YH-@07`l~-&~*DOS$%`)-bc4J*gpd zdjdc~$npC4T~f`GWrfFEI3t7_+!@UopYR#o*6#vIzIG5dO91Pb+w;)mEFDIt*-&#- z_tdX7pe2X2>A#+vpO$tN0Rov^%v?zI8-H}yjf_MCWZzmmr!L!lwoH$8O)MO);lj{8 zQmEzWatW{5L10`TI5(lN64~RvnwC^f;8|dIFzuqyFSwF_bu%Q%)n;3Y*;;p=k)62> zey#}sdlNAWCCF-&N~3Q`%V)~QdiJ8WP03fbNi#eR4JOjjF7P4prK9Dk_onoRhWLpX z$Jz4RP5}3_tL=4y1NcjrWh{o!8jg}$LdaS-#X4`YUyG>7VLhnew(l6;rGV)(h^%dV_GJ)LvWX;J~Lb5Szc+E7?wsvY+A!9{D% zwyZ&p5r3>P83!A#`2MWw###SLCONem0^M|9uPQLrx)QL$FX05?u2&sX@x8+%VTrsG zsbJo#xkZAxUA9sXQbJ)(Qixt__bhh3@Em^qlL3Yu0@`q&G*mk$y{+K3zDK(zqr%=l z<3xlj=_>h|tiWx#cV%MRY7x#TRO@5z#i@qX%BOqMNC0~X6{9jsHkjb-0kCaJCQyS3 z^QWJs*m{3yqIuv$1t_d0=}i3EU;Qi6=`Ab8bT|xVTZ^RIJhP-WTP>syvfM!64)DM{ zC*Ak7TZ%ml_-*48e0SP*7@uz(!iPio$kX|6>-Wbo-r>t9$eXCj8OF%w)$SE1E?Khg zp@l2GeJ@((W)y2c#QA*G@SpDjK>Zp|>%3tZc+C6N_&&CP*MZr=Qipnuffd*AQb@7rE_rWk2+(w&!}YQ?9+$HrHL zV74PfjWPM;xOUNJ)U>1TwNVF_LN7XFwVrF9W$2|p3D`>tVQ5WMU-{n~wvg`$4X!F@MPVGzO z98OgbV|Z7Xxvt4o=I67oSumS05HnDL8R>K+;Z>K}kY+u_y)4r8wwc2}ix1uC0rhorHQx+d%9tM&3_@4Azq-AS~H-phtK zNz0Ev0kv#J&{*HbqCvZwHx1A5she2i6$A_&>mhL&q9Hzh>u9WL?{C8h>sH*=NfaHe z?X7fQMxZC6o%eEbkkDA#meU|<+~u-fKG}66guZ7slyM{vPUzcJ1XOki1iHBCJx{O9 zG#rPMD#o@h57KztY;VQ5gzavPr_Ps3_^dmCSpU5*1ULd~1(}_7x20`~`54Gm$gm65 zdzjH^nQ7nM?$OpkvWP1V-M#GFxNG3&JyEkB4_8*qQV*V<*cv(^RM<8o$rLK0i8eAG ziqcsu;g2>YOk+qj>f=#?5+0(&)ldl)Vl&n$OvC!pTwP#Iz@d^t;6AoelN4cF9+cEjZd0Fc0>4n9w}MStnOHccRNha=CP=H+&G zoT5{FpVcOF3{o`JV!K*Zye2fLJH0N^@2e9?EpwGw*xL_k{dG_nn{WcevX>89W+R!@ zYZH$~ktAZ^dblfcVFo3DlDt#PqXUs`=q>YRMQ_G%esI%sQpsk$LA9Onf<>cJb&LGQ zjfq|14)pg<1wjmhObbM=zHc}Vio3K@EK2?IfDj{0K%}Crk<5@pqwdX?^XF?PF#qV0hWO&^BDPmUl;#R|dJNR=Riuk=Zi%x{L< zB`%?w`fLe%X$L*Ya(~H)I$M{Al?N8pIR5L9{dSnWy!TwdxBoyP8N96|sQ{)R{LV!%O9)>^%Y)N&_v zh%%gpOM*qmuv1aS*O94-j&RrqlEUw|e~{+>*LL{HY_N}11bH5xtIv^SeO3^4pZfLB zFmJAc*{z~5@z6)^QC))2nakwhi>F)Ui5$>bofQ;=*pbkbWF|j{Ok`WxgO$h4GY0nS%L0@3@lV~*Dj+aw9YL=ENT>1 zE%e~~F_%4}H|$G>Qi{OWO=wHm`4{6Ii&!xroTaQiGGEjrxQ~8t1mWfc1)s5z^PZ&F zEq|e3+dc(Ir)!%HB#NH>=m1Muqj|IQ-H9SGo;!`{LU$8O=}+z{8(AbN`JS#0Atp6)M zerg#A?&JoD#Bh2Lswj=^PgkS##SF+r>GTL+d7^0Wl%GrvF$yvCV#^pHCzV*T$)iggT994|eXvaip}O>Op- zf;s2?Si&xFzQNmcTc+W&=#%Ia7XXdm46{Rz6p1Cq<7DiP6gZk#m@a8q3-!|<%Qbbo zKqXH-ZVk+8d#DtaA-S+TTIQ$~S*COD&JhH5lYwWa$ue}(<*XTUVc}Q5Up71Rj4*rc z=j_ecP8}2qKxF=9X#@jR>MPS}tu^HXl_$H(%QAMZCkOz>(B45g6kIxNaW}eGn64{1 z{ZY24p;_wxYVSRuqFTN^VQEppL>2`J5=8_=lH^til9PaB5tSr4v}9UDP?CsA59ZLA-nKmb^;c4gZS=bOCODGy=zYdv5Gz!z2TBZk09|v-~N8AN6z|(F=1O{HnZ}HYkUT5 zu5yn}ZTBO>HNGvod`JxkmMc5zV-hII>z2PC!sJgt9)0X|LP=T4qIwb4*5~n2&f_OH zgYDAT@Uk)_%uj_rmW_#I3YS%L6(e}qR~5|oygm=@NG-%WB`DAHJk66WeBm?vl)Uc8 z$DQYGs|c!JVzgHuiv1jCJwt7@iQhtcgZZnCF-&@PMNxlf9DxqlaG&2b&yV`AB0P)rE#WPM8-16?ya(|z3Kvsb+L{a&i!SKIR2K7*G zXUmO=@bByZsWiwVj1)@1{!Xnh#ej8>Z3HtNp2$5+1Kqm}k_)eANC}z#$|JmW1v!`y zjPyI@zjHV?AYa6qCdc&lpCpDad>JIH6y51${*NtmJY8x7)**-FIAP8N3O7h^o~KVEvziUlOyX$#cXTrP`` z%i|7b#>&V%EqkAKe|-CTn0e7V@wP`-40yKY{3R={kiDu^RTMitHIQv2_C`J2iTTFk zs;c4-bP0>iG!YW@X~hi0jL6JNpPC2S#gcpr`L~NteB%XdBFm_}uSgTx6>s);o2!F! zHk`*EtN-1vO4$++I!q_%o9N&Bd;%1(6oMqy-(8*H1tdxDIbLP1_RofAy&~Fqx;bgU zXWC`Iw$^2q_e(aDJmt={59PC3JU$+L&o(t<1hmq6r_EElEyvZbUGS#j{JrJO5Vy2) zx@;r}aEcUmf9kcmaF9EK`Ll>FE-UNM$oN> z(Mr7Eu&Fa{H9m?uYAh_W%2vXOT;laFbKba5)Fmbo7CG~qnMSQ2^RRt;7!p3SvD-z@ z`0Tc`rB9Z&8?N!i^Y%6eC@%ZHX)DZd;&ZMf^$Ywm8glebP+t_(B!{%X_h?){SLv(sCF@c% zf;15#seL(vt?F&-lf6vW8W ztcyQcV{XSvcc%taf+BIt;+az8QzaGS-IzV)Ve1apLpHFoU(>r`xoW^<);)}&;Q+mm z$~GeUcLoP)KOCWE3H`{n5>bB0wsD5|8BGF+45kt!z9vRpTRh zt$rC))GoNz-4#Grj?cA-KFD(~5zKm;h054geSZgkpYg_gx;P=fX`g>tQb#Ra>&oXM z%V$4ySNiE_mGj)a1hb44!bPUavbPQJuSJyS6Ox&Fv_GLL+xz)8713{nr8L0Uj#uXy z*s%qV)^HUj1VLJoJQ%3l>0zZPkx+?qRD8}0<9X)?Z1cFp_lsBfBSn|j?V|Y6)>GPY zcIMmfO*_Nl$Qc&tkjZKnnu@4+&2|T=x#;p&J{Y*K+e|j)85$cV1_NCMb@B{)n-@_+uR^qPt%? zK16a`Y;#cR?et}Fbtb8dZk|yR$Br>R@G$0Z|Ap~ey)sy%>c$3k0Y_re_r2PGs@<|+ zyP{j))e+@8?SQ&19kaXO8?h>o{G7VM!*s0-@E7t!Q)K(Tbj&5|a@rz&C--!r9CJHC z$6fsWzSK8g<7*D24_$}`jSw>CE zshlR+41#A};PBXaf)56c>*%?!xPgHP7M!Rd%GazT-!ZBXm(Yo0P)6ZvnDs}jFXZhs z3+THekN@nYdD*S)rVgnri2UBrQ!@pua!e2nR8d`}OZ7$69PK?%Hl49yFEt-z2KcGK zfvJGIu=G;ReD@f2Q{p!(_HS5T%TOXrZ*^pb=DB|hnaV#mV77JPmXuNye~F>jcI!p) zr3LqvlDbTU^Pa4C&^bN>{C9A8DXMd$;)7`6JCo`6Y+$=rztVPC9SajB>OoT{XAs<% zd)P4N{PrZ#)X8(yPGm&RxpZNo8zSWgzuiPe@R>*33~92%W9-`1Irlu>-7qdj)_W(<70ngqL4zcQJVg;DGu!OXFKyOO z>e|MoZQM<0`nbd+lA#?xDi5;s~ z?yu!m`H0XJ$u!6}b)W75@ZZSt8Zm1>d6nyUqB>pK#zZgc<(n>Djh)@vRylm#mzK~V zx4=yvhBcqZHYC%KM5cxzUGTI$ro3D9S-c6Ux3FBt5pA}cdT61XneLRyM+_%dZ!Y>C zJM*bTH)h<{LKEpDeZCVE9GTfGFEf_z4M((d?(e4zuS_wyGDiq4&B_U-q1l`x6Qo=&)htU+ zaVQUa;`_V$o7tB;d$ER4(R@-TRsArNr3)OHijVc2cdw7-RNoLD!%4xxJA(G&Rqz_y>K_e+K7ZZ+xa<%>1XM29Ezu9+>TS zF48~#1>`|>Auk>Mt4RN~EAp2ufXq{%00?2-Skq|5We5bDR_^%t9zcZ!ceB z(W5Yu_?sy@sFyXAi6>qhr`)?{w8ZR*#Ua9 zUyX^EBEs#aU_7=`)%&|HNCMG3vpE`B^Pt7*|A>Os(TWVW^+x!TR6a15CEIXx#MCsmIYp z>9X)=V7MFx9kZT9@;l)=^9<(1u@h&~O#2P1Ir}a3f7|b@r_HyX15dea5lQ35Nb$1C{UV*|xeYAnp&0N}>746q;Hd;DO(kWv3#`hC zmL+5@h0h0zoq%E;!jN>@fsz`t^R$a&MjL2OK4%;4!fz%po~O;zZ5A7*?THJ6Mzx@w z<~NKT*L~jee4tkky&({QQ`2uPcVD78Q_`xFJC^iG?cnFXwxC)FWT$A|p0TPfA5U1@qQndW-F za9$Xd3Tl%L(Mx)3s_t$>u7%l0@S?BDT9F7r$Du_-gAlv%z?89%2E7swZD5sUR8m#H zQl^_0{p?VozuAW=6zsXK^V7cf$x7&*i-H?Mk8rxn>_G(A6(9C>uqlb!a@n}h*pH1M z?2qaT-$L?!7tsj78K>}4@sd%EW#Qfj9sBOjr$6Ie^9m}DBXQ^=9K~(hwWsD`;>-~k zJ-wsaLMEtM&3f;L^>9UsVhOh7r&X?g!5lL}PY1Lu=5OurRu??Ia;@pyr0&wlXiM(1q6(T8(Vke9O6%}YZG2UN0qdz=F0qvd z4P6!s&o>KziA)&KeECOQ;CBeIUjk!kygC|Z?hf~!=^{@XSW*V+B${sn@|2rP20 z@4GU}qytpbAs+!fpyx&uyQJGc(}w33Gt!Ach-@ggUuaN|k!F*i=YbX%$8&OZrr$B) zJhoLK()0-<&vkV2a~N0Yc8#%t+vp9~#d_s^n|haX2~xt>p;E-(=!M7)K>&g%HU3m3 zh)K)Rvva~}eUk~2KA0&+7#`u-J*=L+32D%lQO97v)oeqjduO0q^yOKbLXRsLVno~|=r6#itxB8Tl8_m($$I}bXkfv9p683%FF17` zwwgRF9#yV@yjkl>8T-GD$0HER(7V3;?eE&r*q6Y}a1J+zVE?u#)eOMzeK4l}FK#m- zMDL0yR@C1?k2lcyo6Nct{l6bX3VM7Oz7$r9gDS87`~eFEwU@z~C4WDN88WAYh-eAm zI{Yq?1(Bu;;N`0s(w+Z)knBfrJL$a0>pu=OG>!6gnEDw?PJdOzpGLm_mM{aCx&5-R zHtCVvfh98q!kemO-qRsp$zP0)d0})l$F4n)`+Xe4X}O1WzOTJgZ%9}~3G)&LYs*N| z$KcnTr;BJAVGKnVkDd%(X3qkurJC$5U*U+4=dmLKfWN$i^pXhA_|dC^za%{eLdxz= zK=6l9&<$|tVkM8-`G2TFjC{kY%%*de(Qy?~S$b(_(jeLUI;L>^o4L8y!)q6-@82CN zO7{t%;Su6PEd$_FSksrh^yaAZ^gSv-RHXA4j3`mra-(0e3_{SP#m=hmI-fqsr$)ic zeCFMs4DlqJI`;0w+NBo@Lf4h+@uTT>`(nE>yCabcX5HF_4JV_Tg{FE~8hFzl)0-86 zmEM8cM4(6j;Er$lWl?geFFo_yhN>Y6fn@BY{;6EkydMH`JDpvN`qw~v3G3+-VWLa3 zPW1|J9N-13*YB45?T0)P+-@3t&@ zR=mZp6@vXra9;@7Jd>X1Slf9X<6(8&rYk{n!|jJh+2qN>kFuw}90VVN0LP4EYBn7* zTqj8nA}@N72^3bnqS~9=5O-#zfGU2Gam1}MG?G8lWDnWR_R0;9eb|rA6di%E#$N}^ z$s~OXmeU+lMy-;lkFIIpZ2#ia?tE_Ar!E;REU>(Vu{2#R-F@`Q2aC3Bt&)@H+L={8 z5%_{u<2~keYZvtRwXbU&^cz=(TBF=mE?bM6(VccEdX~{0ykk|=5C|ax0=c2C6fS$M zE_6nN{wbOqi#$DD03+yOVlX<7_3FJK(VV628Ar4&4~FBwM!<)G=$@}~ZJ6)RH7kP2 zOg8TskMt~6ie28WmiF5ZpyRQ`7!jSn|C02zr%(9@y`nrbbf7P2*(`rq+_Hz2&UI-@ zdY9T`kYb0{@e}Lk7vNDi(oM@MrBgA_w`~vIdc?XX*8?NwH8B6;1jlrer zP8@-ooO$bP$aTZTQorbB z5Vd*qXf;lk5X6?|x-p1;2k;l;G70-d?7(M2=9IFt>1$yo`9C!>JcH?kT}`g@?QgE6 z7tuMNl@#sHQ}PMzt5$uw5=gBaSR-})#u?>%1!4^#R1Q1)+E*PMn$|gMH{}f)8Z6gk zDVk(IGb;5c+kX6jQ}SvM!E=J^R4S}7Xp`_KlmSVGG0yW>ufiBIs))q*e%$j(`!x>a($x zO5!??Y!HbS{QVMU_JV6Upixl*1T16qg)L%?sCX?Olo@($txN>$&k-q3(`P77u5snB z{eo|W-owAZYTZZTB4E;y@9y;qY}G>9xt-NYNk%iTW=dVV{t-e4W*CJ601_u{EL6Xn z1e$@vm~eMU6I8A3aFs>z4qM4-+a)Jx=$BL?S!puSmgr}Z#<2{b?~Dk3DQ==c#zQH? z7t9yhNhC5oHs?J1m1G?GCH4>0g1-I$N4I-m*~w}ZK5|<~?Y&wHvuc0S#EJaD7EJJ% zz7^Nh!lHOlz-T*rdySg2><)E6(%Dbse+u*KAqAq0=}0vQ@<6`377WhEDCgcA=UW?+Z-Krmb+nmpufpy}b~AnF*Fupo#TdS0 zzW#W_RD$B2<7(kM91TTDK>hXGO0qY166_PM{Z5VqUpTY1oF@g^uZV8B^~OsFjh$ip z7hmr(68v2B_1aq=lrm(E_3Z_B$ttu(Z_rbfSUx1b^>D2AJe~HC zvzKN!FqceSA*7!k*r*+P`q$=72vuFKk2PSZXz&yM^~(+LtB~>>R(BbZUa76F?3+fo9y{s)W(8aHpbULsyUn|@rmC=q-F8XB_EoO z?^AK3CaVYG_z~#jithfmk~@UDHq9~%#r!6nf#wVO4f-kRXw$J4@zRxPKf02g8Yfn| zL90cdChhF+%A97o1Yd|Zt+A89ed=>=Z$5K@b9b@5W!`xarHOG~=uD(bGf3B$)s{Q} z4Yvug1K1ZMoTy)q1a&ec`lGMqqI_*FS3&UH*)o@9Tx7rP6nm~g=eLid<`bMcX*^Hb zFku7M0)-A{B|nYqEwjTyrP9Nsu2f2it=q+K2u@e!iC=6>+~_Q_@8^7EXK!gQ;d!E2 zKdr^KAUCDFlF?`S(Uo+LSAuUfFOc|3JZ!5X9JBhNYnOP}POC@}9V+}?@u6vfv~*lb z;}gY)dGo@p6;nJ<-#ZC+iU6~#3guSFon>x*pafBy^w-p@AZUD!cJYh(rT%l`xG^v7 z95Y1kkCcOeS-JF(pVcd}c87*0*QEoPj09l-EBoQ_lp0yq@;hzOr!L!QE;Mk7u54PB z?e1ofHE%AaYFd>}!}2e?KS~#J7+=zwVw9NMAh{@DII>75B_Aoa7eyrQKH>dSK+oN9 zWjx|fx6#u|VCTWaQX~&X-Wn;5KPz&vESwIPUT|4oIt5OYl!qfge7$F)fBvcv&>~{6 zyRXg*Pp{ftx9wbK=X<7%34@10I!2isG#9#h+EM))Ozv6F6ThxG4H&)gcCPNVTC0pN4b@=t2P$ddB^vyL255l zqNEJ)=gE)>AHMbGI;4PZ-upaO`d@k9KmZ?E0mV@-Hh+`E-;XJ;T`1!-pYV4OnEg@N z>{8gHkT>WaRzftgyeDAb7z{aMn*{!s*J>vXJcx666M+1^4wE*6OS< zIafh@0`d$&1Vpwz&UVx#KBG9en6( z(E^>YJBr$G^Ll%dN?w{yl;bRkg?3K%#XU%IAQ?zoFH|yAaDn5UHkDlAT^g)ODBmB| z#2Bp6`rHzDuMSi-J3m^&u?otIacNiSji{+tI8iJIz}&gxG$dyD!}3q6yIF z3b^wsz;%`56_9)?aT^BVGdjHHZ?uGJSimVqe*l5*|0@JCv#-)m@}12*oAcw!6W+Ka zRP4u_Qwu&j12z<&n%!r35(I$gqa>2Ft&6nu}! z{@>rEhc;L+<&O=%YK2FT6jt&!;FSPNxS?oh7Z+7YHC2qW>Kn;Y-5HMGme&T9C1`=V z0Mvh37huMs$N&lm65xk|;%iWe(TG+QD_6mnF5k-(CnSb&gg`msdU#rJ<^wN*t5`C%d~2arlC^`KgokfuQd&@LFdth=wArY^@$*y!g&fjEaPoKuAYlwT z$NQ=DEmP>7hX4|^bzI=kT;#h*NA#%22I~hN0rFzCJL0(g+Pmo@Wtt%Km~2gMRtmN_ z4%(vohV2)Z#rfgTzcu1$jZ*$*VPq~y*`r-%AJktyNN(YS zt%)jE488}9bV&gJXYV~amxmSkWx8s{@-gcTz6%APcx{Rg^H!hO({%E!U(axgX>G2|GgQaG3FMSWw zW5Esl7F{{DmVD+NLrkXn)eOpjABaElAm#8%jX>LpK#SFV7F)Zt{vy&kuUW$0&@{;E=>`9 z(jI(F?P&TXAa?MZqHMPxFkewP)PYR?;j=e^liG!yee0l!zbqFEd@7Bfv%Og$p^t`8 zj;?vhu10d3WUK@MuVv3rD&C%J#e6nym_vtm#%`2PNVHEHL=5R+-XMHbxY(#3Pw?f6 z^K1>L{IfH(ii2>XO>7(hBPni7p?1KSmKeCXM;@HLvn2 zml)1U{<+|TAIKk+=Uu#zUyv*1c{q^1^%@>`q56*rX87+bS2H{>>J3fwQ5~$89_|TG z!LPz<)JKHFe}Tl>hS!2{p^mri9dK=q|p$+fA`M?Fp0;bkAD1YKj>jxC5r=UL1Mg*n+#+2cWKejruMwg zEEJ4R6xcmoSS>*Nq~$mY7(2f?q<$TsW3%jF;W+8@l{HPSkBenirTgEi{U|*i?I@*Q zm6C256kF{!v)5L;!RojvdoLiGwI^H zJ!COk9Z9*lL>`7YYqMI{@FSVw^zuj|oF-f2_I*kE1y~}j$7(#$V1GK!(_?!AG0_<9 zu;oOiu~m!k;LJ*Ar43)5Ga`CTdpT#+xHe|lt8$~4S*W>Tqnsm)Qk?Xj?oeEohmbWK zSCz;IZBhgYaeO3-=R^6g6YJt*=8+be(=O0lw)Zz=2A!^zQL!{OS1^|jl&u0Tw_31jS{*-xI!3m5R% ze{Y=Vs0fZ~rl$=)_aZ4PnJqRnKOZ0-P69;Yo+{j5QMFwcl8(3E%ZE?V5bvo)uH*;| z7LR>1XZhq&k5CjIjmv7e=XfOxhzxz2rxmw9`-CebPjylc(GFevgM_0#XK8t-f@^;y zmwyA-u?$0&_LOIRph<~~;X$AUti#sXsi7HldOVmn1#osVQggV$4CG1=M=TJ)4IEvG z8O(gHlf?+bw`B!4&uf2u2!4NmZ>U>fH|>XP0%!@$W84(MyR|f9S?G|Zz`y;ozBbIa zT!WC9KpiNQ9G+L7LM_{nqgo#vg`hqY0NosAgzB+;BUoVauQ6_lWU zDBo4h1zE0-7owCy7Czl9sNbS2-WbfDvT8BNco=19&Jr2z6L6WLb0W0uv6m{fen+>h z%0_6PH)`?D>$)OpbVeBxf11=c&W?kOOK=AWUzo`9ew?HnZ4j+4X}md6b2ML;{x8CXZM1l|HsXFW8()kB)GqC*?;MrKXie zccP}O#LdEwQZcarj(zn0 zMD*2Q4-u9EJAdB06aeDxKW*I%sf1eo^?+l}EDa$lRC@4RYg z-WhZ3C%5x)l=OsKp4feDXsMHE4PNFwe5mhrp8ue`&ZR>9)Lco}7MZnxkcX%mH<&<{ zVKGnBVCECQTNtmsTaGy5sGxzBuVIXo&*WQ<+ZW&Jwg3kyPGDuAc*Be|Lqv}Yt!wRQA2JGi?lNY2A zQh4YDTbbbUrKLrbi;3_3M7XSO4wbfyRQrKXF;;LO2KN??l%6NnE;ouudBUB6q$L_# zFhkqn%Y4H=;3sNpZ+kuU2^M=xsK=v)OY)SG!j|0k-IL3>qCaMr;1 z0k6txD7#)_%6+lTWDwugqqB#`t%7l@O{oXt90E9g66(1m@ zS;%PWlB2bORG&P+Lh?Nvve2{+grBTuRnCBB5@nS^1^W~I`_>D`S)-l%tgH(5e7$Nd zOQ%>I0uZEts@g0EVj-`;b0?XF)%5EzBQ@gIaq_nuC3rn2>xz@~>!K23N~U>@l9Bn!qoiWc zAc@u#(v;*$VPaS~z7S!G%!F*o+jr$}e|o8E*WU#HN>VC!iJ!5vy#U?;+Au%?P9O!p z&Ca19Op_7+M;`*A$nn}9pHNw5s`fFaOiey|Jv=?d?tU|w)A`OJ9sWF;3S!N^&zjeE z*r_)*Qx1V-^@4RDBWW-~Kxi&kE9j4&h&=|gIML(90Z|M75f-J!;-mH8erHB-{EJit zASh#Z%F$TeY?0u(O|*g|j6 znqNML!0dFqFS~5lb0uYMs(tgi6f}qsk6VaP((ZsR*I>rcc7vY#dx5t$J@0E|5r8KY z0zN8gU%CuE>?NahE!~34a!bo)r#j@`&f1y8YoWyS%G$tL$xh9)0Yk>*hZ^l#4BT?W zq=9TSPit`E!A?*R$DiqY&<|8JJQIJucyhlsW`E_g(FXO+=gOImPX)kj0LV0(Pj)DP z*Ha~w$cxjQ>zv6?r_uhV*MA)kb_Klg6@;_}^d^hQa=i@&=hQ^50|A*HyTAkq%PklJgBu!SjTy{y3p&?r zz}0+f63`+x2aGXyYf^;Nh{EaJ)lVyNm-g4Nj1iw|Aq~yMKGaYivc^jd(Sqr^b-Rbb z`$hk+0Mx=sffi)_Y;qp_a43)O0QWGP=g<8E7OnuW@PEySLpY)KuW*8mUMEfxmNt0d zXPSLq#ICGT!!CYV*-fk4Q73@mUwNh(ee4*Ox{So_`@rAp?CmIWjrPRN?l1$@BlhqEYr5r{!l!WId%Y3=L z)qsoLN`;4c)I-(%s41dOV2|9d%F>n*LQCp=TNB~}&M7VKHv2D51fjQz@b-2jUBX3; z=W~nfB|??bgesm*#3z z=adz#g(tVD|S#Q3IFJrVx+K5X_+K&Rf2d zO2u=5SMSC+k8E1AYh1#EC6Fu*V%xF{Uk9)(?|)|L5_k_E1)Nbc7eTERVzpm_QlYSF ziQs$MlG%=lzC=Srb!%vP#Ur!n17l>$yccSa8eMD{!E*=~6Pzq~b2S*2_v0TJZAwA0 z3()aUE!w4YbSNEkRoPrsQ`*`N#_@XWzgYJQ%AfzL)Bd2J3hWV91MHEdlgJhkQerNA_Lr#b?1SV+@!WI+|K&`5-Z?Ulh#w2XwGyX7Jj8A4hHf{_Hlwjm zSKH`x4tX(g1=ERL^hzY5d{7;GkS+xrDt=f&){ zzkmNG<1E+|4SIU1^kX%ZkZO3FQ!bn}DZyO4|J1nq4#1)md$*pY@)e0;AsaAtX7fwn zKd|RnR%e}pz~JJ9@cj06nVUyYRILP#WWCvCAxP}I(Q1-KUXOi9;aBTJC^J-o;)v&H zt$JPTkp!pBea8wQj0XB!y#Q_ytEulY>Jh~^{1tQUEpxuYszz8 zGo#L(XzgokT9EU%9K6wV>CLt_=X7aXF+Y_KJBgFkw8>e_RYC=*FOvPf z1t@bNfc^emPXqbgdfUOz#nrEcn!rOss&v|Vlh0snfi;qs2Qz(3eiK6iA16Hx4b80rj#2vtn`m>ulXC8RX1$MS$5CvGK64Sa03 z1l^_MzOk8}&SzhDb8-_3YEGTWr;aKupnxKavHD1$Th<#!1D%(~H8KktDSU#|V_l`jG83d8b@;Ye#09oNTxKBI4aXsXnSIsJ zVr9NFBpxG$__`r7R#$m`0Oya}iekyr*cmVZlQ0>XRr(z;qH56%D*daF8%w<|@D}1U zZFXy1We60E_-xRd{^4_cwPVjZS-(}uxO%q|xrdINl_Nl40PD@bZBIiqdwgY4YBJlT z7kPXeq37z?B(^??!3_#pICr0-0d|Vs4In5tJ9fy@!}VZv{U%=6VwoV3uC>|Q$4^O# zmo9HpA9*#ds?ysN8V9-6c3N2|#r>qVDq^H6)&?IN zQW$0eE6A2FkGYuZp9?2rUFP3~IrYd!15Kvokivo)i@7LOElu}psRXwmhj3$5-qv~ra! zPu~p^LuJ;cN%n@j0VQVyC{o*`CIqMlLX(XhlR=v`G5~X~psi>%zXd>u&hYeeHR+FT zDijcfUFjDisjb|n#x>)VN_S>3`q8{wnItpk#TUNEoTa+uc^(Y934#MWR!G6mCscG~^D9j2a9)nS^{7q85>z zVAkkD>?{w{RFJmQ%VWq75JSs3DI7Qt-1yDg<0pIqa9zT76Y`r&L?48&%QmXK8r+d9 z!tD!$;Wk|`rCY!JYUev`R%k%8Nz9k==!M&msDl)dWmIbTflNljYv+qxaXL7msSh*f z2aH?B1B(~mq{P>j;Zm*Eu}eP}(y`}PWol_?Xg+yVt2c;Zynt0OiSiV1$Zqaxgsdj9 z($FYf|E=xBNbU_$5vNEF=`;b?xjsQP+;$s^byvs!a*0x5qzV*1gR|og1Xy~16Fhjf z6Vm$(CWhaw=AD4Nfx}*vXiZ=wzfnzHlh_4OTG7becTa&9D}MvxG!*^MUqOoX4yiAx z-TmfPzLRA;^Tp{gw$lP-86dewx=^F-7k2>#o!k11IC&zD!=H7+q ze)r*!Cmu>F+GEM8h~~z8f~Nl7`lgYh$a)=Cp$!$oBo)*h%~7K`_u?2*7eoXLOhOo- zH9ib2O-bjE^(Z?<>tx>(aW}nWGfa3c!|hyxUEeCu*6F3%bW(r__teTGh1xoD55gt9 z%o$538RM1&_BN{JZm!M>+QmF~{-vSgTAb{3js;M`6)3OLg9HKn+u-v2%pECqrQ*G2 z<{1O)YIP7+S^zO5PNyn7@WyQ2fz1bDQ%&qP(h#) zZ&+pY{G0PN^PTR^l3{zbEgnlp)<_a1-RQ>f!<&uJaC-9tU z+#YB$EDuueT}cAT43H!7gQx3TG(E-^KU0eW(F2fM^ckSZIf63y*MQI7ck#*5h5yha z4wUJ8jsFps_`jx1$u(k2N@Qm3&>s`T&MsBv~nRB%=n)pAa(WsJUvX#q%oWT zvPW-;plQ|X9F;fF*8)Qw7@25)H0gKX#(W=3mHXpA{tY`1o(tCiw`eSP z9Tid3WqMwW8l7N^g7zO8LeCDAJ-gG)QIu}%0f)=6;K`DYUv(c9euXCMQ!aA zN%yZ!ahmY`5u$t$h4$%>w7zBzlC_B!L1-QlJ_bU2dwX(I;eVIUx(SQ zBc>*(BE84-vyhsmVW9qqkMYHT6RZb;lHZK{2uN{e{n?ZxnLr zk6h5h;r@>2{@e?agBFpSg&2;QoOMPya|J7{=Yl_s?APxu6^* zvhmm->BoaT$~OOJF8DJb$o%1W|17r3r2C(_kUc;XJiB@PpT+v3(*EckFf0>jpPt7j z|5+?}@#Fu@^`9^PhtmJg?s)45g4+qIGLLCP{$Y|jZ+I3WzQ4cUE>r2B!ZYFr46UF> zVk|$?3o5vI_C@6%Mv`T3oke5YDtd4IVLo^`V7;3N%;1qMQjrtKQ@`Z#|HHkpjevK< zi>wxora5=o7j>@U(owOX@>7t1YeeTE{s;&1B$)K%@Y&zxR-g(X3)CV^%=Zf)J=P0M z>Sg`v2ztNQg8OsYZOk2+#@O3n(%ZKxk3{AdFnnPJJL^ZLu{GB7BAWgaRAq1|#!r9( zXc%yFIWmnV%D|Z=ZoYA}uA4xEJee;oIWmpBSLvk$2#i78@^F@a8##8xL(>Y>xfkcp R9s~c$NGeLC-Ff)p{{zlJ;^+VX diff --git a/format/diagrams/layout-list-of-struct.png b/format/diagrams/layout-list-of-struct.png deleted file mode 100644 index fb6f2a27e07a766729d12ea33454db011ce6ae00..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 54122 zcmeGEWmJ{h_XZ3Df(j_zCDPs9N_RKX-Q9wKba#g|(kZ2Mce4p;Bo$B)Bz)FBN6-Iw zp7DHrzq~)kaJaqqU2Cqn)|~UYuDQdN6eJ%b5g@_9z&w_g5>tVJfuDqdfjf8z2afm{ z35kM#U|m!spTks*6YYclAUaBEyTHKUP(uI0N~=(ufD`7d)U;f+fmff z%E`pR#6l*7L`q63;B0Enry?ft>vHfvK{88MS4Tc(W)BY!CJ%Nd2WJarR$g9SW)?PP zHa13Z2BV9Yy{nNYqrD6H-#7VvA2Bl*6K5+&S1Sj5Qs{k+j2+xu1R{*SZ06zu?k~hC@Yk7t9Q(iL{C&QX zrGu*ju!6IdiL|||nKQW9)d>1_LTtaT{{KGXzxPsfwlV{w{_ASiUswP8*st>in4tsz zmm&VP@?S@RnT3!9nE#nFA*7luwpADy5g2K)XKJ3XN7-(#h}7rbt%*2#cxW>;^LGW+ zH}gl1Z04*Y;28KiAs=m$7&>`)JSJFq@l|l->f^P>TQl$0atrU`+&dHJ!}+{pWNHOC(qX z3)PyX(%(1ueJ>iG_YeMi`#f5BtStF1>w@1W{eE`d`ZtpQdxyWj4Jbxaa^ylNkCyr4 z1;LkrGd}-41fTI-qzi&o!A1Aq0fCqP@1g&Hcxhhv5Y7vH7DEx2ezWrbntL)EKF%k4 zzt8Ap>Du6uD1Hcfnr<`=6Emwd((W;Tj@q`2=;@^G{Fua1Ek3rlm}x zSS1c0m9URlowyY_UzTaH>b6ic z{~Vc7Z4j^ZeZ0$rAUCPNN$0&)p|mQZ&N)p3V$8Ld-&flz4E#QRY59}`?rgSf_ioku zu<~kL-)kpX`1YIXIqb>thvTMY@HS~kar)wBXzc7h5e4zhf{3(qlVqp1Tk4L(LP*%Zabj# zfk2o$ZrG@V7W7=O*&0!p_mG*}I~PudsB=f%xaZVViCrh30Smlf5=MFOn4n=sf&Y|p zEc=B?G-nLo0S$)>)0*h-Al8~Hf{K=9k^7P>N&?U5^_t)HF&LcS&zrBPz{a18<^$h9 zdHHSX4Oi{%n?|ZE$8PYD_8uo=oXZh|yXUMVtUl4}z9(G|z(u!8$Y5^1?`Ap(E!)<^ zZU>%bG9T=AQUn!eYI=zStogSW7kV!>7C&=l^2caeD!*zQxpblz`B;O!i?M z_l2vn>E=Ke>VnUg_XG5(3NL()8ykj}r8zBdGOYd{r^q`@OSe>=90B{59UK4nGY`F=^R(D|&Y35C|km-esYOnB;FJ_lYOpY))9=T!RgVYqJ` zMR6(H<0J3vuD;K4@Zk2&0(q>!*+%FSy9*AdeYhnQ3XgBTz5IFm9imfA`72x5(}$~M;)z$3dWdIN+lSNqGQo2z zn)`F}R?J^h7K%!tAz$In$kWb5h!L9uW)OY#4a?8F?;WWZkyHb&Ez9<8P7&O=6}t3{ zCJH4FZCiFy9uPTp=JN~4WgC@qL-oe;V{c^jr0 zHtl$cs>E=#%|#wfWCr`my1X((oI_!Zb%3ZnHOnkTo2S?RsV+UzrRJ|3D!`s%ayk%G z@RcAYANx4Z7Fi_2^OaOmY=x#(sYG+D9@hmvIU;ja3xGu&BLrD2K2T zd2PoRBKmqS8ik6R?;CvFk)(+M9_$c^NJYb_lo2#dBH(#a5GxdsiO+nrPvEmE@O88& zp(LC-he!ABY=n6LHJV6o>QZ(w_Wk#d{76LJdv7Y7hH(_Eh>{ig_9;+$&{KO6aP`op zBVTn+mYe7c5a>G(QF8PlsYr~hJGAao^(*3{6=>(64fdL08gn$LOy)JZl+D^ z6rTFR9l~*jf@$lIVfOYxYnIn;x^m&-bJ~j$<_<$T-cMOB-P;&lQXl<)-ijq;MiAzu zWQIynGYDk5FIgd4tGJg6=+6){D{1AY(2IyOunW+$zqEHI8nm*w+DWyTTB0e4NSV!nu5I`=CE01xX&P9G+5aD*fBO##JbYPT#fuA z$J?ZJAM`a8GCC)5EmS#2g3sG(q^YGlCFE~mK7DTcLNx|wtB<`wI$X|yL)MMtL;She zRdb&{>#(6L^d{o0uMpY(i&H=TC$QtUfywyt{=5E%)m{gB1eFnlXt&v$97T>Pv7#XA zW>?dY?kJA$S`^cn1A5|qrevW3MXFIv;w;5a(|^U@24dI-dGg%#IVoiI1B@S(OP>Nb zBg9`5vi&@b^-t1}bJF%==|<;Xp(;Sg$oCCJpy?sM1i?{fNNrG*9W}4aMz1CRrP%Y~ zQK}DM>f^L5OKkm;Icn{mGfIv?IxK549X{~tz8e-|m7262vi$+p|2S8s{1%Dy5Z|u& zqa4f7$!dm1WAaJc`uH4|abB;$(HM#%C&Ye7Wu0K63)yw+;)FTOqcXJUyyN(oIb@58 zs`j&CGaqX9=LxgUUtbHXA4@dt*m`frOBQ#5pB^uu@kNf0A4l5bwHNsQqF4{-;krB8Rdx!JJo52Er@mrhlL><~ee zya^TWF!M*69BqFkr5Ywy;>TCoP0;sBw9m6TNvpOG{Y%MZ;|B9HCud^L{S&YYM#9!S zVp|r6>t23mQ`JJnTUR`lFbKDQ5_m=xcr51`Ej0p(Q6V^+;yJ8UHTNzXn9u}vOBgp@ zD@a;9$5UOP`c*T%eq#3vDnxeD@N$>jA)BiiJY@+b-d_(wjlo%bPaY+%b>RuO zs~xv+5Gc{AbpjY}Nj^hXw-7XL8{x#y2za+{7CEqct={ z71?+*af;g(10h!E>{PWzLt;hiLeUMyEXCM+Mr0C>XYY;*{F{iUdr{utts#AWV#7# zu4~=1(@PQ@F8#T!NBu#}82Kr1RzqkL9K1bnu_FQ=5(_i|dj!nQN;#LWjEV=NLwgz* z^fT#baE?-nEL?a8>9UjyB8yUV==fp!W8JtpC}M4**fqIHx*he$b~J^z9Skf4^ z2ig;Y6YryYdLo(=ZcpLM>U`@{+Hx+00|AhuWYTkSrLJF&!yP_%o54v*_38l8K5dWY zj_gf1PoL~-Q|fB{>;W^nv(XIt`HfZaD?Wy-6|NzSu$pZv{&QqXhGQ6gf zYWOuDC75A9bqTZj{O`GOoza@2nLR+O{_4fmcb3TLI80k2R+5HWcU723kq4t_)h_nM zHlWa1TQs?R#Hd2K&oB*C>0 zc6Dep9H%`kJ|$yk@iyX%Qtp~QlzpYd6a91T=bZR?5k80kF#J>yOx0S)79JWIt^i5yH+oz1`4mj3?Siy!dt(}9%r7F>D~KPOl->rO zaXr50ypnsO-&v642{0>R>a?n!LqUDo~0Qt)s? z*>paGd=oSiyK!d7h72+v`^0vJQfT!Y-QD%29gRzkp0pn=s_@x)54j)HV!Dl0#0Mtl zyqR%f=|4~_?}7i&+@9G<^J2a+4Hq_yOrS`1x$Ilg%Ep+es@ucyW=FKn!)8~O?HW=X zR0$yv!2)EqMpIyExt}&{edv}Z+{chrsqa&1rRhp>j9Me)9ux!%ivYEbJjgS$Rj1+|AMX{Xfm$A}6Jw+C+ir_;F#^S1PZM$M6q}UB<E@ycsDbvVRk2T_FfruO_Ch_6tAqbijp4T}VM#=N? zyr7EaBS|LGc;dxJs9RZhFeWw2UCq&5C%VbdPc%!#hGD~4zq=8efF~>(nNo?J&R|{X z+%!!aMIu(YEeb1F%_Cc%`Y1lSxp&R?>tkkS63_L3hq!8G!Fqn7+=^Ac$)fJJ;gKIV zR?0aN_=6!hMj}x~N`nj%;)#Y!D(75-mKIIPZCGlsbyx%#SF-nFL`@Z2tOiQRN zs)`+jSW!=QU^^_0NI|(}x+ak0nXul=6R~AiuRc^+PxI=OI#l`8J&3=u;;Zjl6|N~pypfn*xnDIqGu327 zAZD#5zY|gcs>{yE2lR=QGPPGVrR}tmua(kD*|khg7M^dROS&nGF`L+RsSvtt@*J|`m6nVozE0$2XYz3U^Qcg@3K}dCbagrnu8!$3)h0)#L=PHB zmZ@8m2PQ<-tz^c8o6Aw)k(1JBh)G!+gpbbo>ly$fe92h`&=>LfEqGc@EX)mX;I``& zc$zlKK6(Q8yZ=!}wDL;{xlZ|)7~=q@4k<<9TUU+*8g&Ar!|kt(Va#v}^~n;!nNGHf z$y8}FIlgB@4B{vYFc&f#*!M@!rXk~0-c0@DgX#s6&}7MZAre=M9&c1$oY7+&QW}-Yue4fn3D#&>(J5@aj93@rGR!gK7m~9O zFmeIKeA7XVJ(kF%M1E_vEuZeHOJlzRKZC|V`*gxcUGx?S{*}>Nd8dRlp{9&Fb?b@V zdGy1>#eurk^+Ne7w}%R9`1)Rr&4L3`bgKfyF~Iv=y4hZQqnq^Ph^^GaH)5os(zrT{ zxu+;QI|%!L2nKCs-04bzu)&>Y+!*=fF8J8xYLdZkU(A^-gk{Q}WeQW9Lrnaf z{X~PndR3U1V-dn1@s4ZpcJc*$QxV^N6&!^Uw=c;Oi+oo+_Xhlg8Ti(V^dSM1sLPkn z6wI1)0g@$x;uRgZPiS!J`nEniC#JUkYh(Mbgs1JHns3+2i+f1O*zfBVtPX~A)iqym z>M7Rd0^o-;I0`41!<9iEogwtoxk_Vtg-g+DEfReLW?f>vnDhpM0PGR&s~B@H6Pyl;rQFI02{-!n*+I0~aY9?}S?P2tRz zL5zB>6?8@9zGy^SFB5p}o+!`N1xY~gtMC`CFot{nu^u1sy;_;pIxUXy7RFF^mQ5lv z$D63;B!-DytZ==6x=^=?Qhd*mn40+9n7YRr!=~zVUr;&+%i(9KmM`iwG5gh1_Rq54 zs+G^3CC)sk4Wx@^WW*n=N;C9dP|CcyG`XtKwYxxxN}fz-{+si#VC=s>wi4l+7EURvf^Pi}isLcXk4#NlA=Y1SEePyfjYX4hEf<55C*J;g}x7b<0k z-Cq+8?g>8>yHB_5w-jUz5UMD+8`s`cf=H%2^3uD`23r{{!g>+e>jf|dH3=^RraUhv6ZV(GoPWNN`-)ub@QQIpO@wVPZFLC`Ua#U*7 zhkj>RbZlbtc;X{K?S&rkJS?0xR(JjgK*}k}C)v_QH1w-333!zZ@=>&@pB0tXnrLim z(H`$!c#fjf^?L+Z}r8qI)19)cMSm zU>Ax_ZUjHCQq76&v5t5yDsC~ML157&Cxbb$n-_!EvcxQ4GNVGzz<61YeoetWs8x!iYIv*l*_JWXDpQ`| zY9{yY^%UwdF1gXp@@q_mG##cn<#R-ZcBTUPUJ_XvZM=~EJlCwoQF_F2!>|E{$7F0% zmDFjlU_Q zW@1f|zz57+F1x{@56F<#!Wk0mQYsQVtLTVr%?_{p_;3C2m>k&Jbz$|CaUKveT#~$& zzzyLn>@2lj3=-4H&M2H}D>&w-LaeS2I1Y@jxk#SZV#LZ@uE*X6=u*{;D)oTclxw`<5`)`?$Rks_M^d~My2*5 zu20iW#YDE5`>W5aY-k@y8!R1l2HXR#>}WCV9LkMqX8BliTJM}^JA;g%M<)|AOUtP~ zK}a*3+E!n*u&hX|$f+1E8DlYZvEBlJ_>nP-+?2_3f$ZENb)iQYnAoxi)K^xl?vPf~ zd~?-~c&Hs){<7mYl>tRbx2_?&F_8i034o1_XKWKy9fVer@{}7W2_V0%;8|aQd9ZXu zWHRSbv`m|s`!aHtwEqZ3PWErrz*3yZ4eIiGudd}|>d)B%)$H`6;f1kny#TOR~O<;uJxM_K&ggcuOYwiW)c#4Wd5T@;Fi0K|>bPx|VZK z7+7ES(z_1u-&OC&U70&bIv zTPuNz$noHE`I5KP5#3tvaVLK336z3Gp1|8A&k*NCl+4{H+^%)&;h7I!$9ud1n%= zV|_O)?)hQrN9kB>re9=`#gwUBe>#6Vqb!qmdI;zE!By9`7QueB==hxi%GUP9Iz6i3 zXRpt3D>kVTAxk{{65j8s{sER~#13hBBHJH_EyWIn;jj`Ki_z8X+uMQRBqNR66W3+S z=B*+c+76_Rx@Xs}*|u#t-u=VlV@L?+pXX^^#JLtLe~@dk>b}J?!_@-&Ej`#wC~;PNOa1ELi3X`{rYOj%p>{5e*86 zT~%*C@>Sb+TfM{BQ-QNxk~THxa*|*E>1y@eo$4P+$6cz(+E4jQhiv$VIM17PnobRW zqGph;qw%^m&B@GpAGYg+*P6tO)IUTk!xoNgxF=uVTKnSf*bd6=NpzqPl6=j_r`U^Z z%qT%iwxZ{EwWDr)>oKc^oc*QptM$`f6md9tauLJh;zqlk-|sDzhY8d|*{{@_Xr7Tm zh7;aDhM)!Ss6gJ!4X1peP0!9=n3iRJ3FvWF{>|?fNxXat|2z`P6Xun{N2x5UwR183 zGmgJ7lsric78$$Pc3TvkKQ&GwnP}jPOg+mI+W%Zpj!47PmeG7G{io{45)&Aft7TF_ z^*=Y##8h&0=XAPG{6|&w*O(@$z3c|rZ3=gL2%;aT2( z&M*JRz(n9)$OEm@T<`IJL|R|~N{&rK!R{IVnDFmMv}OR!O?mio?th-k^)zqYHWB%l z<2SMTix6_<1T&_I`A+yh&$WYUtf+98tbQ-;-+Zh!FhmuVz5jonyYw17SDvY1Ch0$h zm;;7*A-{6>AAMK@V!#JGwh5-p-PXgZ>rwWJD4-WW6~1vybz7Q^85jJ%QV_!z!32o# zcA!OQD0?||scG=i?fa^S<`=WhSFq$UJjUm2Imv%Ktu-6G#JcRsyspFhvtfFuuAu9f ztEFWth99Mfdx#?O2=K-&RYE^+T+GFG0ri~9KNr>0_PNz2nsdJNhkWc@4qcO=?CB7=)TYC|0&y}3W!Vw{C?hCZpEs~bI!@o zS2h5s3rZHYI1vHi1eTWNn_wVEA{?X?PP<}{S=0eOf?Z$aOt(e3L_7gg^ z>s7j50(vON|L#gr;I#L_6JM59*BQpzUJMX~x>aNuIYFfZq!-b~fc?87av6037q@?# zmL-~I@CQ`(EtKvh-~~dK@J7P@BLkfe>w(CZ*Faw($A8@LC2j2Pr3oVHeA-Sn^A5H%l?P+2IPntYMJDwS)15>d%Fb0IWTfzD!P2- zfXr~sa_Xmj&r6kN7)YJ#%cki!$~sWE&9!VRDyH$;*s7AYI1LUlqZ~UF493yQ%UMD2 zeAef43$+4JW2j?6r8FZc6p4;)Q%g|x$xKTv-1TUtM9~Vn{|XcVW)Hps3XC4x2hfmp zC2tEMft1!SFCZ3wv-tl~IazWAc(OUv;4)@O<8X4>_v5hoCs15K5f^#x zRjdcIA71|aLIzbPS#|(nOq{y}ffWUTp65m|Yn(AbUqTC*6gxnCK$xgKwxB zC-FIgs+hpCdilBK)1DI;YwW{TrlX3?O6-46$>d{r0;~$wO%VCz=SFmRpi&W-U^*aE zS(x_qzdq(Rode>EVeO-=%f@0=qYWaX*FTx+R)NcFucRr|vhD^VGvPhXXH*pmhssvY z&`T8s&($tw!48l#7pJN3Wp=?ZoEQ<-YHMDY7@P*iUG|0R?8y{_@$HN_&^N-aT>#K!vn3hx)!6+&OgcOs#wQ5FJ9hhAAL}P#gzq3cRmDp(>$m4eVC9 zV%Ul6TN$zov95#2V^^S|T6XMttUZDhz+wx6XDa_uO%Et=UlbuI{uz`iMZp&R&4$(U z#PH_%{_d-LD4kK-OKbLR$01&EzP=i=6C{3@8%60y1Q<57P*IobMTSjP7ki4RD?vb@ zik}Qr$W}BJJ`F*^1S`&7p;sdLYB!S^1JV^2BoeI{$|woDNw{GyM_{oIutuA`$v3Hb z&O-)2e`rS4dZHAJ7A%4yzyK^3g5`#^KR2~#u&K$1J~_nLHwVs3yNv(VcE2cf7i(AM z=rx3mC%(>}JrmRFQBEn_H2i|Jg_C^^s#-ljG|1lk@}BpXZ1m} zx^>=E#fC@ey_95p%J#f8DE*ftnA)j0S_`c}$ffQuT>nQn5}`-C?C@l-WXaIyz2EdB z%xi-{hpU&~LOTWD>>&cy{5Vfkm{9!ZlP_Sy!Z#bv;N};YGrG5f;(8P8YD)g+%RdFS3pT8n ze^?v1$?+!0(YjLOoBy4~+p^-f`gy~U?IoF;H+6)5D&K+YS5H4b%>B{x38V0^J~DQZrc0h7 zICpFnc)vC(@xAU_+op7Q7fSKf<9bJ!KVOi{a5C?8zBVGf1;;JO%hV5o#(2Nn-F^l3 z(x96$5(;D=*7I0(GefTbaSHYW%@A|XS**6-|j2A_B2u2zU2j4 zP`y-KJ;yFpML}_@<#aV9lgx6J3pD>##-_zURiMp#x}3WmRm8pOib?2~QYmHraMhvX zwn-&me3!jI_~#c8X+M?~d4oNtU=fHO)8I&pB5X&7$QF47&IZX zn|zL32_kzWV-wzT5y$}qHT9hO0H*%B1w?lTM!P2l8WFYKn6;7=T1TNg&2`siRzn0X zCr`|A6KUGJ?i0>pJhB2RIFsu!z}sDeIs3d&y#~!HHY4bvJ02L^1+m9U*X)OzGg*Qa zAh%28sB~h%`7^_94F)+5eGN35NaF^Y$d@UkWC&N=-r`FelEV^7EjQyN`G+8GuyWn_ z{_q!u;6U|+fJ767x8PYTt`8GTVhM%Dd($ewy+l78pLH(5fQIT3ZN!(~z;b$Wa!svC zta}RLZ|2CiKO7pvQXZ~#)_uyY%Qg&a1;|nnBNC4jqEi~1;FQK0_Htt4FW8l)^F z067L5y6tOcvuhc-Y#Qly>lvY3UZ=J%Kwk<*A7uwtVs+bLKfUq!`>*5tAOM_co2OGQ z6BHN+Q$Nqvj$-Y{gQ)%^M*J(Z2c%F=mbgkAmGl$Xpb**?stR5Nb~lF@zv>MO?Vus3 z?iu4f%wbyyrs3x~EG~NEf4#hhB6Mj;U|Oy4+H#fA<5?ew_i)59K8%}~O5{=NBk@dm z_PGCE370?Nk*2(n#y`3UqpyKf&%jK`^F(85*cs$iz99D2`jUO}o*2yz|6|Wp;s|&S zU!qZv&n)}BF;ozIWtQKn$+O1{I`Ay0Gka$_t9>#1fd;A;3QiZ z;K5=Yn}H!@-9>u4UpTx!A2fP^U%3+0-0Kv=C{pA=6;oBQ50U&_0E7*E;0U;KN5CtxPa$+*4SBs8!8^=QKBJ_ zAVA9u(Jtgjm;qv5*(6OhON*uYe`L#l8EG;Gyz;TMqz|B3Gjtg^Mm zkK6ja{K(c4sc0C+|0`-KNCUX2LH+ZUuyg+xF!VcVw#0#2?O9NUo0six5$CUAie!S? zuIcLW`nP}QIG_&Vhy=c`ZEE+w^Z#cbs2}~$AO7m?`rmr_Rr>wUAO7bL|1K8(zx`ov zdH1Ww4-88)xblj+VGyWwy6q5{acS+ol70189JSApW zgUAKhs^f!}kose$7R*1xGQ}h(h_NqvAo{<<4F5?5Y(Q?^7E=HI zca!HYw9t|tpoP2oXCUL*Tz2TNvwza`Ux=0xT1Po+>-(PeHO}kU09_X`D3ALAj0E7P z6np#k=kvlhcP&SCBLw^BKrJ8Re*ZDSY?*si^JYY;Cy;)pR$g`^Gec1ikh*-K%CvO` z4GJ$nSq73kdq{%$?=k-dFyme==evfL&LE&!MKZP0G>xf!J*@7a8e0SSLa^9{8l9>S zT%uJ2G^bh6k4!M|yE0RYIyF7?1DP1Kmlz2;SoQl09ao+yG0ZWJ899K&jDlld z0dac-kVt4-5>WXrtM}AC`Dc2egUCw-brO#M(|Bu+Ve@8a6#@Eztdb{8$zWaYZq2#Y8 z6aiqvgrM(jx_ahzjo%pn7-&9p^y~N@R=xv(nc5N_=&1YL@zamPq9GLZ7~r`47mFs< zAfMj`4JSgtZ&~SDYd==}nKm|{!v@{wJT(4}W#Lr=5J^k`U{=(Rx6WzkG6J}*h_ZaU#y82G;Z70Y!0iX}`2Ng*8>;u&5{l$_^ z2j~xkxTEC`Jn6vaFr)-Dc?V`@tH=!iJiDS{;i;YE}7Iui;e-3fx z0j*#KysCa&=!s;Zb2$*-ncwsCYyH0knX|^Hneckao4Yq~QaFc2+9? z?W3|JaUqBPfdmI<>2y`BAFi)>G!~T7$lEv z2S#++dir4s)N0SEeBNy<6!m~D1NT&}7FFny)GXs)@GJD=K6 zsK(Zzm1vo~GiYt$OOcl#%>~vXwlPh9lLd-MBXY`yBY;V!&QhscVlw)I?Ic;_C+Jrs zPef}hvLFvV08LJORvdOkY`g!dCTJ-0))~yEi+Hs-_felL0nF`PgNZxk6|{v=Xv5#W zb-#ETj`kD47o)y`2E{=0np70a^G5!sg|;pcHwuT9YY2BglUO?22CZ)hGbv zLxKw^cGFB^z^LBY)&d^5oCC!!p?E_}w}~v>si~EwQf=WTh4SR@GT z62%2b4iZ+K2fv6lO9Z&8jid4M695Da6f6U*M0#)OKs0<$$ZTX%VO!I|6u=(i>QBs) zl*vkdV47xCVyV+C6o(21!zjl?f~X}%7{4ua+_^>27VpspeW`QW)DyX3w9PD&gi|VU zE$RjMg&*Jrol@Qj9610j80xSE?InG6JI1}3Y#$qZf%_Y!=g^&8<%3H4{Kn9*ul|WV zz$}tOtpA~5ctMOJienP=1z>A|Fo{LIVuRU(XN8cuE7Ri3+BpPKh6iI@7?JEl;g+Hh zRoZBL996u~GDkkMd`V26>!W&n;Sif|_cX5n4o}^z77OHi_x%B`FyN6$Iw2N7)sEw{ z?7lb!wjT8H6CkwH{DPK6tk6=1unqY?#?q1wL=4qa_emaE?t)(5uVe2<*+$fu1wjj; z8eN?6s2?D{l857d0x)1_T!Ezvz95>5z?!|y&7w+*=zGpQNJU7 zDHPbjkr|^)lzYO{M&dHc>gJIsV3o+ju&h?`bu>Ez;D(D2zk#MYv4EuHB(Q_%sl73+ zstOfMg*l@oB7e7NjBLB`^KwAgbKh}**rSHhb|PDiU1EIx8AqhLYNPgpkND;$;pQ4auC(DndscfvRUb|)bET+ zUjl>=Zi7j64ecEe2fWXQs2$pln|4xkt6Vv3`vA*ilc30YGb4j6HvZ)2r6dLX6a$nS zs-r)JV(FnI-lYvcz|QUm@cTr?(KKir!Hu{8JG@coM5*}D^ug{oF*p#suK}_=G{_Ax z#k?CA{&};a(>~Pgc%$Mb?R^S!>)tY$%Ftr3yQWnP;fitOTTOVC_13iV) zjZBgN_r=YJB_1K(LzJm;6G$o{(2&9AcOTlY2?62QaZX*cvM!m)4H3-^bZugiM|kLZ zo^9zq)7xaKF`rol-I+ZwGS%c2;~)omK;rpIH(q@c8Zt<5KyD^Psp?yIAkboA*`k{J zR5ZkJBG)uQ+PM)hl3soREjyQJq5H`s!T89)^xc9`^kwH50Lf<@Teh`H<{9vWpLF}%AHqpuf3VS;aRY2oFt*0v-9_C4^LYTLova<~O3v!-ymI7TOP=C%1rTVO{ra6m6G z#xcdQPJl>~3WH;5V~Q*J(;LQ|X-_QBBZuL2DHFRtV=BfAMMiiWyvk;5W4Z zn*WpdcZ_h+9$p4}^qRbExLLBP7T-eO2$A!(8FByqN%$x4TSI4mfCAPBxGZu~AlM7; z7L`*QH;WL4cav^w^C`N0pud7Z zK13ClZN)@<#;|M!k;~8l>nlk3VnHUBjR)BnS&#gH&!I*MApN}k;y)1c`o$?oeY-!z<5-Y@j()3*ApZ0)L(N1r>-(M7zDLb<=P7%h9IRGcN zT6*@e&{w~jw37wfIhN7|^Si%y^FTM|6%DNY7BD7XiiipM<`uu0hO{XsMG8I2_az(V zN0}=ZtSlew+cUy(E;TkCMyWq;onT+y?sfRXXqBZx>CK2DPeaS7!Hq+?${PP^;Tj4Y_X8gV#1w-n`41!eb1+o=X>`*3i&WyE&BDcm$-ai`o9Im6N^{@2b6+hte=|SD+qf*_6 zSk9o8uFs%t97#dn_G^1G3IArR`OmX6{%a*IX~#3Q*DuQw5W*j{eX9W?L{h1gQQMCA z-8X2cRhz63nBp}?67|jP=H*2^`OW$kU&*uKEwm)3vIg(lCN%e{kP1Gj z2BoMG!*leF{%{0Nl6fy9^z5_a&UBV#p{_*m3l|$aBR(mvNW>WFu1RdFWLFiRL8=8L z>YISN44-$T>9pX~#pqZAzx&Xq-79*5wJ&4vNw0S0DFG<~XlJIcFPT){nCj%us}QG8 zHiA(TTf^YyVM|=6$aV0*$aliI1Hsi7bY|91hyCizRKPo}>v)=m1{!)#OU1Duvp&gU z0fFq6?RcCEV#dutv3Q4!h}4^+c}=$I+`evcQwmwVpirPru-i3umHV7^=ZcM|yk z8oQ=+x7$H)RVXvBT{O_w_nx;umY0t|hjU(w{0u7ZSaJKbz@y+)tCEb;p`_J)rL7Q9 z;n@|?)iGV3K3nk@nbn68mRoJS+|Z~&t~98w0cufgdcij-Ze{YTUz>yaXthZ;4~*@* z-?3a5N?pkWvS%0vpncXqRU(W&DGVkK3YMm&kSkk{GOrCVJ*QKA&k-E|^zpP9E+);k zY!0D|ryJRGt*UfE?8B$~@$XO}H&WDH(y+c)TCM4fi8-;nBX&LAeMj2U(mTeU6WN-!|f+?_*VR+OEqUZP^4cJeI)O+hmM=SRdUFqm~T&&rL4dsxHq%*a%LrPA++}?P2xmz<*?$PI~ zY~AE0ij>K5DT9_atUj|osWdy`X`s`+bAEx;1qhqcaBdt z%$|T2Eel?`wTHId`?eTaMPI0Bf;#M32FeB0VP8~4?q(~1p770GQzxt8qZ4Fo$`(fk zF7x-IuEP^|DEvoCtsLZ>HqSx&B2o*#IZ(35Wmh@(_adlt(`Hp#&kIAvLuDH3RUSa| zO4)ptCBkeQJY9D1x{H(!ahG&{|BW)g`a*Z`U#2Bi*TD*)Wwo*gs2c$-na`35j%0sV-*4w_rXi_htSS-_m@USaSzxl=0* zrV1dlG8Qt}?1|cl$Kv(WHv-Q~pEor}HqDJ?RWVa^OY7(5>RTQ~9H&5Z=4uUlGNNST zdQl8X(=6NPe2l7&SjE*1NgI99LcA;Js*9c$7u08H1dN$D0qs z5}${faz7)9S#foE^TAUyH77^(Z7R;+*QARZv$$gYIPQdfsjJ~Qe;1s;?Y?Z6I!|Gu z)K|dYn0~wfZFmlN{;+pK^Qk;UY6Jre`sYI4h9vmGkq2;|*ddveirb5U37Iw-K2lgH z=vkK>#W%KOeFj73#T*wL`Dukc)9@W7i;YTAcwv{MTrAzGIFUpY@Q?jKk&_mpL{+Ax z#ei?O^o%2$E>p^9Q3Ib)<#9Q!)?ytTzj|FlvRzG=oLUzYxsw@a$RxYJwE+c?rrD@A z<365Yk=!63_B2013G+e4pDqa{!e{#W)?ofCWDW@fj_#oY*6}u`etS?lIGA zwj85}ieYT_jA;~Ugc_9EzA%@6==yPd{n4ntnCt5|$(gXEuLotGbLjT{ZVpF=;+ZgY zFQoD)9~!beQl)sy9K21-&qlHaNv}DgdR==`ZOh13p4BH#WP+lCW0qVg1hB6H4#%J4 z<`M+xQnN}pOlPh!v6R<#2@~t0j>c`}4I4&PWfY8_m*d&PB+fWp($}6RDcL?vj^tKi z=r^_YM(^bUF*DhnC)go=s~pskLW(gc~M&wXV^EL`HgL3+$f5Sd5kC?r_# ze_Gs4@(J4i`o=~e05QM6Ekq|Ipj4twC86yhGK;dWl4YY>paWka$;F@ zWW@piCNe{y4nBny0^ozE{=gu9U}B%uk^V^85f?RtHK~}~6#1$bee8HYfpHYVQ=vRy ze&|joe?3SsOZ7BK2>d?H;wD+tL@8Wal3iu1yBEMGe_`& z(J;zKaD9G-BAvI%9z&q2}YnEY`kY%y$ zPhJtHI26JVo)G!poHlg6Y9WL7+H4xA8}@m8KZ@RsvP*j(d2Yjy#q}P=H-TUi(QFhO z?Oz(+$wvS$Fjs1nM{FR1z0s_#VqwT}dVorzXXVRtjZPC>4^6JUNqHf~XIB2*Ns01m ze37poHimmV87sE9wOy^V5~KAFr(xS;O`@I{dpZgluL+*FBvqtY9n?o@R&-LH7&u&VO&XF!LdUC6^vYMws*R7FH=9?i*> z)5qzh%qd^`k@QmtFz67~aRT%q(rHTXiM^|wJB&tl#s&crGQp25)DY6zliVO0j*z~G zTth426oZH}8&|-<9gUFR0@p}{!7DJk3uHF~-cz0qDzf6kYCoQF=fvetP$O%wz{tH_1f^5 zBJ`9`(A>JnnqB+T|HIx}MpfB$QJ{b{DAFz69nuKW-QC?F0@96y(%m5`QWAnR(w&Nw z64D~l(%k*%8~ToMf87}Oj`96r;PJ4}d3LP5V$Qh+-JkMMU~5sX&s-fNxDuu??ADzZ6yX^|NdlOnC{TiXvkP z2EjKqSZQw}6BRg79@d-o+atiqRogVmp>2H-?;ow90q}GDyE^@4yqfh+g9`X}3Y>J` z=w4kbf74KQzS%nnjuz#}rEF2{=Ob{{cC{~JaW89F`kPUe2;q$2r>$tl#o&?O5$QYa z>9h~fW@?b>GLAFgFEBbj*zd-E><&^0Z2D0fUuw?cpFQE0z%hE02FrnHMkwyARSKBi z4#>$qK((#LfZc=V&Q!h@TviH;il-H_?eHkTNubn1`SwirPxRmJz=mZoj?jho#FZn{ zU_N##Q0OMP?iF2dyqrcgRvn$YvEm*sycm{oS>e1FKjx)YU2!oS3Vz-i!Va;MgS z;*ihnxLq(Bb08!($7J~UMDSS*a&?Fv6Dj(41F^^hcXEQ~p&K_zf+>a_F_ zIeaH&=_p!M=N2d=Oaw0Xx>T2Bc4vfN);;I{l?x84qy|WLxw#XXpk0a%smSGt!6{#w z@9(&4sQEy8+Wp>wy8kDX9VHMhX6RJ9d|Q%{l^E%d>Uh-%-AOIfFVSI@_~+XLZh$%! zk9PSWMLl!9bs^0B@Jek)eXxRN#5Y1r``4%|ap?zNm8g|vyvyGszX<6xC+yNj&Sbo< zn?Tl-t3rN)A-Kbiy2nF-rJD9q2-6;4UK=$Nr{r`fSQa(T1Rc|ZF5Av77)EvCfnaub z{)mCcOf?l<^6G2692|3232oXVCcTVI_yKPa#X6V0zD>)z))-?GP?W4i2`W#E)Hcqn zDHvxU-7fsdPG)bAO1nWleCt~Ja)Z%q{gX{WbUwaPa_fz&W=*(_EzW&_&6gcE1Hybu zXu;bLmI7~0Hmx~${gpe1X6ewFDwn{GibV$?OI}Lkz5K2%q%%d7QYYtUd6wRfa;PB_-25^DXAjwazBgX|9Rlwoc!t z4A8dN%lcGa)7Nd2g-ILk^}?{fD=JR7U2EL#k4z0GXIc;$L_^+!_wEa?W@)@NBN+;F zI5&aMunBJ8Dq6Jzp-^k0InU_o+rk=g^otb;ls<4R8*G_Xs+E>EpupcKu`oRA$G3lc z8@U^4#V#L9wG&>p6OQmvz~mJME8n;Yh#wJZ@HE5U*T+zjrIz7o-QMZoRIA_vQ!^YctjnS}DyiDJ;^fWNx)^vFMBN+9Eka5RW2hcvzwEOlD>12^x=R>HZ}H1;c8gY`h69>oN{H0X?0fvo z>?EETI0b~ePm2)xWrGvfs$^-3^%&kw<0D@fpcx3(6Ub&T;wqsW5y)|Bm+hEpITvqh zV)e_+F);C=G(^5HSHmw+ao-xgQy`Ty%g7KN*lm|hxnC2*n@#x+IT4}8O@{eNFYkT* zwma_ze3>NgNg+~Qvo)?Jyube{ZTA(L68Z~`Le!WQNYj$W3sa)RlTcT28dYiqL^4CziYVZEWpcf&C! zZHsXj=%)=zvrk}^d*(2-6N?r8mVBBo(&5P@ym;pYP9}@;%`7Vc;#VHb)`6CkGMzpO z;@qWzNSWe&2jj-#z3r$#DbKDI)=a@n!9s>%GiJr%O1f1}%y2~2L9y;h8wGAb=OxiB zjaf$;La_qf1vV5qapsjc*2cCB?}!s|+b=tMO||qq9YCY+mSS4|POpsJs;>LYQLrMx z>@zcI(e)&*GNWo&r`PWgmqxUZI5&|SHTsm&X5D4DjN2qUUxioPA9bk)*SevTgS$Hxcv8@Hnbh0Rr zX&7oV7W#_KgB{7#P5|n197 zwA7Uu+s2!%P|G5n(OUeNvXD$}M*R6l;qa&DOHU1G+B9Zo)1jhFo-!IKD@=NTap!5PxzzB=AX8NoZef}@lVmXiZdR++xst@s z`SIq4%~kV0S}S|J7iwT>XlGyO$N@&4M}AF(Y?ru8HRI3_U9cH5E1E#dQEa-*+~hd_ zFMO2cApj&}rM`OziiP}eo+1Yn*s!zF~wsDaoO$PdCR(u}<7 znOsX{;nWL?IvID5>doq@jq#>lKguB>hTre(z5smk>tS7rNUgu?masz7_0}Gd!vrv2sT@N zRmvX_cTFl_3nBFb!{4|Dm1ltuLm1;ZjNv{h=f#E2eG-XolG+1T0U)g3nDC-$=z_`{ zVAyXpW!N^#0XhdUtS!p;U>Lnbwx#~Wwz!2gk=dkL*lBCgMTADxKP2o!g-~TWEKy)!j(Cz0V0-gT=H{?fSl z=fng}Q-EpwY7n^|T^LtZ9$M>#-=6}3JXcZNMne24ZiH9Ejh1KH7z4K?+&^;dbT2Xg zr8RC76==^#{KeYwKa-C2QdIPvE^O57e+uE$;{>3vXMS1E62wCV2@(5(q^l*J;hM(m z!S6_dBX1B?7zI4b&VxrK-vg-GN|1pBw6$IEoV*}-0)>m}(g{dhcjW#@g%XH`<~I13 zvK(|G0QQdEMkR9*M#&hC6x$Pg2)?dGJS3j`9P4Bq)KFrM0Fw^F#-<2=c-(wD)$?+{ zuL=M_GGrb(rZ%>#a}u*^N?d4!f8!T0Bc>>T2 zM2Z9Q&oh9j)*Yvl)%X`2r4(ia`V*3{DmD}nZ-8iihy_VbgJMIf+%>45gPLI%O{&2G zC<(RRTwgq$v_uEfwFGKgT%hz>Z1y`?CPN}mnSg%VNtl`VHjkRecdaDR@3bnXKfVHV z(E>=-K{XyVCz)oz8*}c#)mA8}|Lpx0kO_E6r_)XWWAGWHgy{+jq^~T~Hu=1N6ch)1 z0D564yYnHwxK=E@ld;4P**;(DQjFB9%gBK#^1R;pG2Xpj5&1XP56}mg4c(TRNA?TedLZ-za zT~NEzD#X^B^_^VpE(P}GE5I5J=d^+vX6Fn14;dqm+y#24c8v1u(TS~X(7?C* zPCK65p4ggc!8BGGSLQID*Rg4ga^wOBWjP{w(y4oULHI-Z#VCWF61!|2z&Ku}Tj03nYKz~|LQo9OnO67qv1W-agp@WDO<@&*ZK0tCjn zW%QOL$NzlG($B5H;^a}C)F@!1rs%UhhlhA~{GnIa4f75xzXOhXumA>YXlj~zJRe?1 zm5@ph(DAy`(up?#qmr`=oh$AcNzWONN4OBOErWv<4{0+O1wtU}nxdn?vMM|<4I(m60BuAzMJk_ZXCV+s& z^)@T;+L>P~u(lDf7<(=w%vtN&Vm)u%O`zJfOf^gaa*}cP^W&8x=4euM;4tPPrMcv8 zqd}O{C@@hBQ@`8)Wq5QV4L&vEB+(&E$i#GE<9;TDYnByLG^e9t`k*%B ztm%VHYltBUpbyHZ#6wm&5?@XIXelCII~oBrOldq&Jm;U>*^vF1{6LkqbF#Qe6U%cM zYdIxQd|6C^+i+zi2iz$@s*9+t89+7kci8m)WI}R7kiZa_3f7=|q9)*0i&gc%KHJa? zyO_1k?U@xLQJiggnw(xvYAH!C7yuA>tw3@0uBzQuu;^QfU4D?Yy#ZAq*NDqV{|zAM zzFayOO*PShV6qBg>#UER0aqtt4?Io4U0qE6gQ}tOb0mH0Ids^OsSvAmu3Fe z5@VbkT=;)~9e{0$QPuj9tZ1G1$nlFS)#n|!_5d5y@6*77r@WLz7P$jq0zvKVRxRWD z_-4do2yYYZkzsENM73v5e_)*LFn9xS$Ln0w`5-en8fvzEAJ?_t?=sLxf%T1Kd>tuD z9c$1t+6?-rl)I9dKA^fOyLiO$?L34tMZTDKTZ!mff z4@(*uQ}j2m1QhiwxvK&Cr539qfVGDHf?JAJKMcswx|?m_VtttZw{-w82~@yyjvlWC z!83Xg03>Bs11MXA#2$m-Ca6Q718GyA7=gHUIp|tiPANQo|{^cAl6TdzX+f= zp$Tv2K7%5KW!Y<7I-sVzM5$jd%IXSk4+|KEddJ-z;5y*u)b)R#R8)5f{rzcQW&+>V z$qig)%9MelB;YI(a@gZPdb=F9FkUSSVTE_h#I7mANjQN*{5q=JcAQF`>bNtd*N(b%`S59T)0k;?|r!=7@PB37lIjmuAfwiW!lrjTlY08H>0N94O zrt~;sjf)bIqN3})u3j(6zn&kcB+>KzG?SaK@Woj!a?=W&DkrPd3SU9#&Kb-Jwjxs3 zI-puV?}^Q;$+E;Jslw*FL~%u!+wW*Muv;lDidoH z0si@w|GqGd$rzBU3tVSLvj&N-CP3C#8=ldy0!G{u9IW_qP&-x{jDp|y*YO<1kj zn0!#RKL8W80#4kBmDQBP(6el=P_z{siqM&UczFi`z1mPJ#RW3BON-kHo~t#F=$g23 zdlOSwM8CwrTg|(@;tuK-25zO?VP=y0&WG{`G8eB#OHyh%%R>_^fiLR*^9uSl+~~7m z)UP?+Iopaq@tJ1)3H<4R32_2M;Rnvl&>TuNJdkC;*y`b-D6*?<+pqRQ>6Jo{?SF58 z<_2hlKI~EDbA{*>p@?@#B2X>a5&^f5N*}Gq2e?10?`aWqoCZJV>=cS=6k{i;pF!^ud!0KEW*jvvXBI6url}^?HpKk;pbz}BkLo;qq z#e{3mPS#8#7~cXgl2Bs;aDr#LX$eOwHT8P?2jr9-N&p^w8J%l*d`T-E6!sAib%twa zJPA7h7F<24vXV1&QeCk%qj7hK38^qfoNyX;gb|!Reay|l#esUbd9_5ApXJN#3;_ZZ zbm_b}wlilI?{Po)rUSb^M8QM0*KcWD^tvW^I?AGNJ2oO+h@GT<>ha?3i{O>3z$IVCcPNV zU&)kTaVhxf7?k4;NYNi&K#=^+EqKZ~{g1<)ZA2Rj`h+}6y?B2?5br~`2^D=L#CErd z=m>CB?^v&NwTIE!%LdXzl2@g6gvF)1g$JkP~7l?#7-V;SH7G7r*lhdTA(sR z>*-2yDXvTM8)w>_-<{*1ACxBpo|n>gF6Gb1g+-wTuRrbl?Q;D^S;UbA5gxVTe8R7P z{+$sZ4+7JZu9Ef_PVC>8DS>f_EVR}C&*V_Yfmy>w#s2ThD%4=o;<+7vMYMmTF~O+N zWQv(W7=S-M_&G?gGba=~{;3H7CYdsCNr7JBpYQQk@Ff}#jJD)t9m~Jw>(~FtfY(t% z|0F4ciKkBhmR2UC{@>KYUtbAmn{d5g|2Gr!*UOSvoB9^!Ki{4gjRMFlj3G!EPUbL5 zhCjb}h9qFJzyhzuyjHKl5QRJOSHM2$4|>JC{A@xMjo#6Owb=uI>3>`vN&K_%X(LF4WxzvC86X-te+vNSsso8F zP&$%;b=MqNPizAWls*6!ovft$TbL%T3rY)sw=lCP->Er_=KV}xp(dkrq$8uw^Vht~ z!0R2}q)wA)23zIdh5H#M2TinNZC=5Hz^EPpPX~s09g>g%0V25m*myU!*o$^pTJsTj>DPj5mles&rmEo26{ zir9)74GS(qMIhS>pdaIEa*52yK#)jmt_J}DM@1AC`VkZckIqLNd|w}dR@DS#W2vU| zEskQpq4PakKpvQB0@?tDli0FSbzUIl^bI)kvngGG6J`TkH#!a+oo%_inzY-G*)ji=`$v2<9g&LNE>+$i%G^N=2@DNeCLaqM zSOR7!opn^9$n;fDr^WNJMC^C(kInR6*7cA4zUrHMQ>-IBuXs8IKfpX?o%W!i(Qw+x z#}P4NvcC}kQCE4+RDZ4%^dk=i_SPt^#)x{*0B)YPBMz%ofg%D8dVvok89)qwtaAbG zsKCaH-1qD0JRF)kz_C8h{_aaxQR=tsvv^^&+6cT~9IMftw@??F%*ys_?_xMJQ^4wo z4O4_$0c&d$AP=&v;a3v(NVFRYeAr&iNM48iDpGWCEFbNv&= zk`M?a)EJ+9NApw zWx!$PVz;-^G*(46a0*%3>`jg53>Oc_G1PaUUKCF=qE0N<3p;%cO8;Gmr(hwlU#*vR ztfI1eFpVS+!*m0|S9R)j4?oa&%xj1y)JkW);NI~|hpnv_=TOBKe6q;*y5ZwYOG>g{ zh3lgdk&)u6$Lap<@%kd+ryr{Pai67grXr3(y|@acGzj#U>zaTmWJc-;`5I-$OQ%g%f$+Qdw+}c;`wyLF6eKBb=SQ8Ah~L_C=ZF$ zO+`+m3k^~Pw*3^{)V_qkmUre6#w4ilfna|~#ndYC`LhfFT$QN2H05*{Xfsgt{rGZ6 z`oRHaWnIeqCjI+_notM4th3=wg$WkLhGTH~d>HdTsVxy&V`wRTt55Wi2`s-lqe%yU z5#6X6lZs&=d7?)?``h@y!`AP9Y=k|Rbs+sP2T=7HBdJzn5j>C(9c?|$(y&v7#~I)+ zj#MtXO&8kkRnhG_{@%8N!{QPii865JhHU8la zEu(U|_oTQ02Ml5ph(|Cg+PYoQ$>zV07F^_crMx;U@8*4Pr7%&6G6Blk%?lyn`prR^ zi1N*MpVb}evDv4WKyIQWhJ#~1FWC}Dpj9cY#c`aJsqn{SMc_IsDm{BtU?Sq}f{iv* zg)&hE(r%VJt;$>U#8bCDkEbk|hi5$JaP-U~y zUNK%B1A%pV`FdP(exIjArjYaUCw`#_)5`T6uY=1KKtX%_j};i{Uz1c9B(Z}e)Q^?nFf0HDv+9728YhA0)qngNt%m1jSG zla*$3AwxfrXnK<;WxiXQNVc+R`_+PXQ;}qvoKW0+(gV)TB%`(@A+^cC0p>TCK;Z|0 zLD8K80qZJ!aQ0J){aO9IQ-L&x<$4d#a8@@QDY;-XtFm@c2EF5O`0x}S+yO*DhA!l3wXv?=@Po^H1OXrAxX;+brDnKY@(U zOeA{($`e5$pZ5^w(9@5SbHd$pg+XlrGT8z( z#W6L*^)4R`J(9=wwueCOHWhvcYBdQu{=FFjTm;WI^5;tR)jRFFK?VT_JG7~|{V44GcMlE01vetj^U5D!ctCDpR>A%)nP}FARygz%@SP)d(&^u>pDMi0K;rM!|p$ z!A03589}OJwQnh?hNZoJjvtqV?hkQh&6J2(>p@p54$^#!pyr#N zNc=;Ikn7I|w@5O=(G-$;G+1pvf0@Z%zXE=E*qW!`9c9rpMMj=|JyO<+&Vs$yREW6J z*aDv>ri@@F3k_Zv36Y(3(>-{%Y*SWJhjWc)ddAEmZXqBHCBWrx#od37Nt`hBHXU9i8WS$Z3WkrYm8Qq3loK83W`3GDe>+upOvx< z)xFXe2QF87a_s{jp?=wDgyBmD2x23iF=hXQ#6r{C?b^*x;ON-b&&X~2R{@WhsnP8$ z$BJvC@m@RATbR>j?>t-&@CW`=nV-AAuR3t1LMNaN{&ZewSee8;jbuR^7NH4+`|4^A z6u`dRUBGc?y7dqZb4X48YO#wUUwzl-?SMJHcUSSd^g4YhT8Y4Tme?EDSr@wVC~G0A z`U5uJm<7B;0aPa&vlU7$7yb14NZLUFk;+YXnTla{{xb9x3E9@=*;zl!93cPYnq}7E zdn^N%8@XdncJuoh+B+KZT3yy+N%z)Fjf(oC37yPFfNwLG9*|-6tk)u+_8gU#w}#Sh z48HJ%F}LDc)^yz?_aH$%)eAGiMbmYN!{FLkscF+$zmN#Je1`} zD?GZZe=00_Xr*&Ypx(D9MPlC7jWTMAN> zhD`2f^X`~^5{eJ0+HSS`xSA%xJ;syr>h_PhpI)?GvRb@loR?C2 zSu#r$epXs^w+pAk4>!#;XyD9$f+ks87Yo~^;h}Td{gR%9yYY1UEe9OXR7_*gMit!- zcaTqb;*{SQWj-lo5P6JoyKj8ztl$sMb^DPHtuwIa0TQU&&2rj@;&FJ z83Vk{88KD$((!8csr?WNSohgRoT$I12Xa5{-4DNz zVij&Mh%!pKQ;c+q6==EpHCPfM)*|IU40BgK$0~_We=08V$@T{%RnY9~pKr~lta{d=?f&e`b{q{l=0i7?_-y>7`KljIXw}69YS^ z-uI5%pO`Jqga!zjzYb!>jR7CA&Zuu&l&m(&JdT*jML5)lbNxiC9<(6DRmEprBIxfB zN1_$^QuYqUp-rx|&6jX$mO3gHEvH+^jb2D@_;`eW4c}@^p4Vk5@9|sEk1>o5GZj15 zijr(P5x4ytTI|}O6v9d&w|4|;y0TR%qei!(9K0R>5MmZhr<0n=Fx(s8fj)z99#Vgo zQGCZVRf6+wt1*}meskeijGC{yO8eA@j#6mEuo$h}w&ZK+Vj?J>1eodASrVJccODkp z_86;9-ik06Zlwe6EVJuD`@86g~9Gp?0JOA@8y~9kYKz z;r*-x_5_aN$|p`X12WTlRdpkd6)IIY&iM~Dvd~_8jf#0pTM(B}ktXeKI#BKkuO+Fg z(RXU))E5pEnU7Y~$u}U3ifspq-fZj?(A<$bBDTy^7Z*@L<(uv9Hks8)DkvPte|A=j zM0%uVx`^W}H6_de!3ktT`1xkD`raDIcMx_;cz~q2eLbtiJ!3qqajkS7%$Yt`o5EyU zieh70*bu5oIlLyrCI>Tae~W%@mr@%&{+hqF09@g}@_%s~S$lj--*#h_ID2G#4HLQB zi|}0#qCmGrid`OK+S?LQ7VFu-+LBrn9B=|zj1Q`l+x(DnvJP=u1yIB`4YPzXI5Hw# zCWknhbJlR2Ymy0K<2^%YWmKKq8b-g}GshVanrCqa*Zgg60gLrXu6)j2C;4+9YmXNl z@wi?0(Yii@o>5k5pW@fNZz*w-)EUhNa^X8lsoG}6b|eTS@FnNYP}z+{mD*bBb;P~! zpe9dY<(6eIryx8Sr@)_12rS;}k48^FXL6T)C2^oYm^~C1%b-DfM_HfAfRSNob@PbN za?@(C!Z&E0FSzA)C;*LM|A( zO$b>AR{Lck*BhbHmn|J!oI;@BIHhVp5G~ptyRLIZX2~i4A>d4LHul%_kl9Vqv)>ES zYg_%o;LQ{(K(P^NLM|sZ6DNpq)Ev3|b~r-(Y5J#(+kDSbYckCmt>LVuC}P&=QeAgO zXWHpM08~Q`XGw@nxdl%~9VZ^a2tj5V0c}I+p858HG`6BC+}zh`9rNh^#LnIf4Z$sGEwVo+4$3P*dH;zAv3@Tdw5uL?0Pxk}Flsl&Cj# z(mW;iNQe?4i(|aOwdy^5U&m10o6e#HI3>MT>2~y%A6$cwNGQfP+PQar4A-(hG=$v% zy>@ldM3`)wV)fXYD2ikaQw76MrL0|86O#Dwzzq3*q`c`g{k$_|9?_qu+SVv^HB%>= zl~EkaX>*;Z`|6>+>A6dg-spi^eB=!yl&u&J6ZPJg-6(U4uufAC*Ue7*_6##2YKyVl z#g2esVQ{b$9{p&g#2G*McE*wbFJ{x-RrZz>uG};9)&TXAhiX#uq=yBgV`Cn>Y5L-C zaT8}bYc=dphKpXbrfyZ4-N7B~s%ve8zwcjez$TR?)$C+#`tl=4ILVeH;NLb7;#=Cy zwN!I?+vycHQwKQuQd-^5X=-~C5XBM2Yxah%A8@r|Fq|mfU`6dGKCr@9OWoR$OZoCf zig|n{hez|R#$?;karAPGt>PJ6MYDpwVGeEcec{IbN|H5~hDS7(o*+aYCfCQ?WEvW- zQ)Sfh-;y^rOTRNbe2U9-mRiF2G>_TsLaR8*mg32~5yg&7tnrwUCo=WJF+=UPZf$8PPL?CkQFQr%|Y?h zk|UR3B0QYk>oR1%H95K4h~Sip!M6lMBxwt$$n_N~O9Dx3POm_gr$Kv{>W~7{uezpX z)z)63q)}n}N~gl8^l50QDHAN+)xGpPsndFPinS~ZwwKg+EDJOh7Ap37tOh0zw%K>m zHJ5j9Ap|jm2eBs4*ynA=8+txcZ)sv6bxXiMbrC+Pj{T<9>E`83Sja7rv(uAV<^pRy z+}QI&xHW|L+oU5!j`8c{<*5nv1}+-I0&^PeBv|W`SbLD6QRxV*jq6DUX_F2O)bTi6 zKjX$^l(Lz4e(kBp@^}OTc>3xkvz^@x^m1Ehi63fe^jZZL*AiJ266oF+7Eh4N z!qJW{(`DAUrUw?fTDmJ{bn~@t(l@`^aA^pebrRHit3g@zbev_Q)M)VrA?}sz7ap!z z&ZAThhsiyHs=DQ`WMIm+AYwV;NV7fE`>NJuMZ_Bp#_oD(q?^9x7cfS!FtPZVb;ES*PJ zi)x5X;(%TnQz#1itT2&nIueayMh1)MhKqNiw4MK!LRSV3;*S6{qcmj^rm_!ZLQV>A zBJqpHD{*NkdN-$$IC0a%vVx7&k7Ukv*q~?AN@O4EJb!YVrt<46S54Lfpe*{!1h!RUPd}qq6FRpx6tu20|^%CFDQXUnX0v;v5 zlqZ^}@VuLtM)0JTfiNX0$erulA(8kh3(0YEIOWAbkr*v_Du?T-W?zs*-iN~qlf%=Z z@PIAkHoy>iyyVy=RJtr@VpQa5(G_WzmdH=@wXp;D2}t23Ishb={73PzRt3l|9pTVl z7frkJV*kvkDrCSU21v!H9UU`#3P=;8ZOWO<_XNPi_iAY8AA6#x6yN}?j_!$6X6cj4 zL+7Z$$QcG&d&_!dL~PDjr)|^uMtP?ZfZsi}siR(9HzaSY*rIBUF!$nywGQ*IpMbgiE#khY=q;k>>v9Pz^of_Ut^p;=8DvWhok0MBLO6TC_c9cUizL@)H zW3aXX@sgR43<=iN2#<6MOKyt{E^M^>dN7g{RT_HkbPjB%=r;+eZR{l_ln^r?))RCp zwLRdlWHoeb?y0ox+Y$4As_G&+L3;H9ZJX&6Yl1@nR6B`l(+bRI{^PJ&m1q3w!8vF^d5Oi= zCctOR=d*g+DL{2bWT7ZFmFD^OX23(J7Mvf?dTJDQ8)bJ7zT?~F?w6z!wrpdfMO z{BhG|Z;L1_(YiI2Y2`9|fFJ?{NOT9%=K!+NRo|97rtfZ+paAl?VxVpSGN0?fpvb}G zdmB+J37du_UyjDCG+3Yd%#3Eu#baP_6@dlH$n;?V&z}fV(s%{ZU*(Ms5&*2EsKuFK z3F6m3+aJXKQTt>qSv$re>OTh%qMO`kHj=UcNqQnWZFWVraQOgxKa z%?j{)%usD3``tGH^cziTanquHhbc`0QKh<#x-N-%7bbe@^dI1E_poML!zbEktlEmz zKiLO*R8bxS<485Gw1p6C4?gL(KH{J|lZ75&754QVfC5#@=eYR#$pbH-%e5EpIrCRWgSc81(5lFb;Gca&Wtqbkm1N(~}V-lnHz`^}zqA^JDzVTk#x3POCT`6w&P{V$gU>FwpXrEC$+G$bF+~~Homt~Ya zDCm>Eof1FRG96KQ7l+}qYWE>?Z-)EO!trgC?&C$3_2k;@VKvRl7(qPL_bw@9T4_&i znCuxOZo;{W;3}~8F76wa_(W|)J~J|VjrUm0npwL)ErZ1CHsUwBwS8*-BL&fJT)6BT zto4a+Tq{j8a(xp9r8VLok+Y!bmAOT`ul(FWF!G^-l6Kc<`74AhfKPVYy_*#}I($*8 zjmk%Oosr*dunR|E?zM@b#)X~d5sJma^Rb(V>wI;^Nx4G1GbBerV=wUGH{TRV#; z69N-?_1%?7vPa>tyAF`azvk*ICE`wn86WuMT~5*zj`?d&O_&G8?7BT4CH)vnbr<;t zlf))ZT*0ao@m26MALuc^sS!v0EF39RcYuN|{*|Pyk^)Qg+A)P?w^d7raB(rS@}eu# z7u@tW^t{<9(N91{O|X>%QU}-DP%}uvQ>$-z$+dp);v@YhxvBGOeB^Ls`L?r>M+Z_V z4kfx7mO01e){n`QrwqLJ#cyWa<-Pg=lnY3-lvTBPyRCgoame$2Cn-@4u+LKl>71+F z`;}j(ZcbPkG^)l}AV<&M^4fwYgV24%N0A3xI7Hag_6;WyA90Mv9k+Tsg(#f$u>F z@~P@$M;ZoC4d!Prx6($aFxIhGo=DFg|Fp-#2y5e8^HNVhnUfSPy%v80L0sAPeQ-B}jwO{(M+DOM%&PztH^tbuwwzKpI5b%)S zAJ7=b6_91%?D^g<@ByXczHXjj@6M#mE{prB7bogfqk*pZuS`sfB?D7@@70upP?zJ) zSRKy1%Lx^GdVn!)t;UsSMsWZLGinU&E|7SOpu(uAI^1xx`1J6CIGVM3AAxjRtzOk* z9q(aI*w3Znb+XUXZ0ngz=b^0o1n_QRrsoTw3jp=9O0?xLt}+?@wS|C|76iS|?!vj@ zJ$rvCQg)n?fjyoPJh_}g_DbH^&FHcVF8c$L-<6epUHMBm)F?%h(rVJ#ThN7yTBjd4 zV4hZ)^3ike8iP8`IpRK-C?^;pNS(`E(+u8VJ4`W(F1CXt?mnoruZGeOk7oF#4%BIFy>P+MtAYqdpwoh- zwlN>ahYi1@(0~=}Tu;zy6@pO;Rn58#u{6d>86(C*`R^kf=ajh|30f5s2e>!G=sfk4 z4c|L*s!p<9?aRS`M~Rtm-n4;pE%o@poLdxFhD5|STM5;b|JFhw&k-3^2&5yvz4*J7 z{;QtOgF@h3zvZB>4OQ3wDcqVo1+zfaeWLQm5Tco&_&)DeUs3jV=^L7AdUi2TRm3#& zq5oruJVh{sqrm*M=0BrC;}g6Hgxx@{7EbzOh;dpl1W&f(*H_ShKSzarM1knU_36w# zxj%-mNd#rp%+dNMn*V-b@S7Bf-1x;DFpB;dqD~gL7mC*NO+TaGe}<;QVP7Jls1H~B zV~9CwlaATqT1P{WPyY8KgD{~TdUsy@=MMd62uCdTC5Nfm`v3Q)IjBg=llj;4u~)!F z;<4zpc<$Gwe0CX&>$tqU1PF5;aoie4XtATvhFNrLEYcW9M@FhI;?mL8VSZ?ugdtq| zD}Efv{&tA_CbN>u%+^}7qd7)bxOTcMv*VY6c*5o8cON>+ELk7+KFNbwuqo)s<30ay zt!a zFP}hftStE@>7Nhx0GyVKr=013ZAXRZ4nX>WFu2l>F$6aAMgtFl0`FTOgeChLK$qMR zvw#PzJm&}6j*o6K;xhwHk^a zn;g$Rx}-QRJIq^}`~7G4f)gHtJlFN-r#WJS-8)a%(DvJL1N$ZdoN#IQZ>GN+1JR&* zl+$~o_1TZVLLd_j{D^^Dh@Tc)Ys~D>tKvuA{k713I-0`zV(I|%2XtdZ z^HM#K^u!yj&Jx7<(G&ZF1-^TWu69Z#~q$WAj&-1dg@i6s^ z?lmxR-YQc}<5<8=W8t6BQvbQ$L8aJxtd`qKx}k7Ck4HKAch)~vAP}w{vf-!K%7bO2PO$}h09#N_PXAF(I z>`Uc85i9>(<sN$ZvN2f2ZS*E9ZZw z69HX=(65XJ=EVzw5!nALPiKlZ!6Xur7tst5w#>ZSnvF$KVr_5fkv2modU zzb3Q3Tok$fBGPu4{oSHV-uQeHG}e4WE_tgA`Jf%-di}rqEH*EP5hEXe1o#Nh;Km2A zSk4Z?_edzXTIX$V1WFe-YsS}W#y7{tWIx{a34b9Rn*kkh=`Ii2KuaG$*p7u-U;`PB z&!B0hoZUx9Y@e$gYZ0G8^_&Ccn ziX?LSZmUfEkcM_pV5=~3^07bY(#Q%jAWlxK6$}O@_$!k@FWU#XSIc;*xJRpe%Z`#um^BT15=AyM(^F>I=B+!_GaG z@Qs@VydItzlMs-wfI16or+;`=R_P6REh1la$MT<4TCbGKI0L1V*5%ue@?JvJlIvx% zYtvKCRBS^4HfQsh9lGba1be6My_!75K)VPags4#ud`W0%S!0*f*&B5e_8`Eq=-Sz{4mhy} z%@^p=7W#~VZb69UMDTlvOo%?`dt7F$ctKFD^~NL*LF0*5d`{idADE61NoQl_a@3=( ziz)#i(i3YtsIcz1k3U56`_a5Oaxa;kFJG}q7`t<#`UJ>%6jO0?+yHWy*x=&Tu1D58 zz+wL}+uGZvP_(Hacw(jl^Xqtofk>O=^Ee4k8YV1al;lk-e%oFsR6uZ))`+6W=2B}t zfAtW}+M*uhc&1a!_>b@Ah5Ue6B3Kg>Nv6F}WAdH7m(s%9ppU|X)j07}85KE=bWW&$ z?_Q-xYh>)x0&reZ2{s6|@7qKDs{tDzGHTT#T{U~Qb#m9-U?&5N!-BSYvyC~ z>uU*sZ7!&|Z2)KvXSoaAAJYrG7{=3J93MtWQePOl&sqZB%=1jcjF1)9_UbOB$GsC{ z7SEa3H6`iH3R|A0PF&_NU4Ud4j(ts+U{%vLBO*slj!Gg!uuJne)j7~GiAWx%OPUwx zjfEIxKr#Ca(hdPCKm_I-LUq4P`x|6FTNRe5GX?*>5VAd;E70(UlgsXPbojm-jCq3S zdC2}gIJcjGP{85E(bXA~@tK@L)MLH}$gM$glxAN;j zKUyiIp=%7&!BV`4saDVdtPOBN3MmZ?Ar>K8x$@z(6_d=Ns(~tmou(83D983e6RzSX zUhmM1_JS~c12e5ywcWjLE*6y?q#jAl$pZs6#>tx0TLh6X;)yHNx|@bQpymY}Z)14q zX)!I&RRqf0R%zItRVEZdwCzPeQK=C{i51v>In!`xJonCwhYnP^vGyfLiWHxi$% z4dnE;@@36GCgvPb$6S95xcZ2W@7sx;6E}Hyxe1yWeW40LG^e?WrXqsK__tVSMpSEET%~uq+E#-|w-7u%OAm9b_z% zaKbe0<$Mwq3UM&-Ty7rxXdH{akHg$xYe3fMy+xqVZ~}~d!SS8k-9v%1RXP#=afAW! z4+%qpqlTR%N3W=^7H_T=1HKS1np7J4tFpUipdCY)dE=A=*=JLjhChQhf_4}Wy{=0M zgs=hAlanmx7mf5Z=$bjk&@nhlK(O1aXbATtt7olZBj-FI0s-bI!j^K&gXO&CUA zRfD@XC=n(>yC6|~+=#FOf1(&(Ol(0-E+xlQiysy}%Bqx8!+c%forWYR;=@eHrzPHwTCl@c2ORPS zBo3v-!f(SDed4@)U-CebWTp`xlbPk|r;}}ioA_8?YEDxE3-_Hi;opRU#?Lg^IKyirBs1Ncl}u^)D;RCw2)Z?1hh~bYtS*Re)jl_ z9&PdS6rCI@m9Qh5kPW^M&o!0>2#aR0*H9f6XZqIdLy=nmt&DXcw8ZfHc|9b{^sk>Y50D^JHkK6V0@9Gm9-%pE`F! zgRgw*r7$iAhbnsO3q;f9q9QtL%r2T)P{7`q$P9#+P<(tb%FXq(w48#{l5$aJY0*jK z5;21GOHp>+@Sah(j6x3bUOvvCXc?QtW$)&2IA3Bi_a&C1WrLco9;a=lUpLh~*nZIL zSb~c)luqSgtZDRcUWjJp*YES6PwIh&S*R~x4}2N-Q%=U@R!`Ipw`L2O4NW4?TBKJk z3R4`yf~QOoEQx~$>~{_~NrtG{NBK92S4T*c%OKA8 z@hRedb3~=Muk6-!M{=%DOXi&o2y6(vV-$44+tP|*!xp{nivZSavV14jcC%KqU<&hU z|8zOs+x;qUXylEvOTifSrp^0&)OzPT-gb7%A~oOfZ!2~yk;Gj1XC$aeA_wK8yVWZ& zO^Mkp7xFUjh87m4N5&+zgi&zHvSe|4)+DQJ1MCL%Yl_Y&g-LxfWUX43bcZ6+2X|V5 z4+YTsiai<}N4djAvD^~1F^`bITA7}YcjrF|`i~o=I&N3>>E$sU+Fd!w} zCDIJtBGLi_LkZF#Eg)Uef(n8X%1}ciB?c%U4RRO@&D!@``@ZK|_l|2{zyE)dlaH}tuMt|6Y|DZx0?ZtS6GgdEfxBw553~`!#x!`j zlROH=c`@|ih0nb2v4>D1hYmBOUnp^2&M@Sax~mSUKYVSCACgsCMbES|;dq-1@Xm#^ zlI{X#E90%P(}uZ+M(U8O5*!2x;uw`oE3}WyL(7Vt2R`AmBv)dLl@q>z8a}W4hq@NW zSDxpZN;s9O_3WA8VsxBpQkIt#4I0ETIH9vrpGS%ZlEDKlCX0oTR zB#QBJuCS+G_Zk1 zkQ5DwWU8*?@49PQ12r@49xvOfL3(-BK8x$O`qhHr0KX)N6*3@u)Zp;3ej>_Vz0JI2=I4W`J}CvP;?1?W(ueMw0TlGvgd{PB~cAIJ>e*Ixn~; zierv)dK=$qu+eYn=N+@Njr;Wq)Z};?W*L4+k$wvxL%(dJ+B5vVHNGaZj|;jlG*e2x z;*GDOP*p#7o%Eb8%*we^DfGcRL7ORBXI&7$G~BD-c=;gxVB6dMs?Tm9HnHlfAv~2>PmDfY`QRH! zf+X;il2S0&Nnbn%fjwC!ZvHwHfbO=u#x%*v%vdOCpzfO%uH;SBrOs(8F;s z%y+q2US(+)){qlsf5s9hp(_PioFA>oY45cTw~1a(bqP78yRYf|m%jR^)JCYZf>2Gq^Wj&rJPkS@?kmM-R08av?Eev#Sjxw37Axv}_`sJb*&xCD`55Aw4QqM^t0 zR+V^z5~tp-eE>?{KGjBUv2zAJd_GOO$*rfV@8IGsP9GTh6=i7gGg`VFVBSUliPrwg zczacS6j5;5!UaCbOkXyagn3Gw`EZxvCcp!thO63dySaEnkxO=|sEyQniWh=cg7kB^ zPVa2JRq*BEDhQEY-#&`Zt5ibU-H%YDzFi!}&&V-(j?*X>zU;43og&iJfA(_ihg$hs zpXM*mJO{{cerBClJYY>?y=ozAV7#1fRe-IQh!R1m-dL56M{UzYwNCMKLlDktUVE?Mo5mnsINf?Bj3SLU3)?2 zz^hG5=3Hlj0>HI}CSPn{_B^uUl@~^=^^8-4JxEzXFeR?K8|n#XwmGne3a=mbQN1FY z4?}C%xu#AgtlFbP{May}m+k?TQ$gMC{w(cjm%T|yO>0%d7Onl^eFj!7$Xx1*npW>+ z#hNwJ7sXPEti9Sf)#`p3FnByW(J5PpjE9yj-OpVUwz@PFDmFUU)&|!@vY#=%3Vry< zMc6Lf!0+ob74}ihKF|DFy#9VdvIhqn$AdDm^@K=TT%svURCWvun?vV!D44(V(U?^x zsNBM6!P)JlDF;PQJzuc3^ti4-wYW7@7oIL9U$C$ze{RE~9*vXI9%o;e7(DBJ(DqlT z|G3UAYu+`B^_zWcl0qg8ANE6f+RSzr)5ukQ4`$7%!upA%TzWo9lNiZ8A~`9vAvr|LErPIv7~yWP`l^}f9g6RuA#eiHklS;OglM~8C(P@cLFbqlea_)NS{jX_rUZ{j3z zy-7M=O?^6TSS`=F%){w1R7bc-F5%62<5`eqpOql^fsxGb$tfi3lrtZ+n$E61?9_u# zx!mR_m1SfzuAtAX=guQMwcR8BLgMrIy*!!4HdRKpb_%*;DVNCmNvBaD`^OnjA(nO7 zh$F~VSS~3t6BHLrl)2QHJXm5k+|Xhqh8?^4GxU&W_rwTA6ezvKg$!+N;I}N`VpJh^ z2rJ2M zcG?s=Ud-yv&rhtnFxZs!9$or1c%6`GM=l@dt#DUs5{PC@)4^GcVvORxXuAi`3En^_RymlV?T5(4;03Es{L9`u6muXJhllT|&QH=Cz<7?mR z(+=qn?WG&bof(wZ8q046L~jY-12&_UiVt-KbXr&snOa{k(ib+u%@c1){hX5 zX*6g;-yM3`l`Ymp;k!y=fvj+;=e8M1O)rY>&7rv0pM{`NqsM7hlMl$h>Sk-HwsLWZ zaDXS910`~Y|MEHGosz|KBzo=@cRp3KY}3Awl3JlHMzAW7%r~`K36vM2U<4BbU5vQA zVQZ@ldeKvdL!Hcy^{ir-<7JxFn2hWri9B7T)~oHG>TS9PRSdYr5Ad@*u_*D7-fWML zKtd0L7qq@yzp-~6=rnDrT(-+-oGuo_{Dt3`Ds1&CMlWucu}))0dvC%Y#-yVv89o@$-Y%bgJy2eF z6?{_>L<>Ti(%q+20$)z$=~n2VaqwF(FBNvrTwoi=;849U@Q5^<5EE*rsSOo80sNa^ z^52=Dch+%q=I+NwtL4Z`+~vy{z3a!^`$mM&Ji4L7hVtp9{$XA1CzRR8W~!fuxb#CV zW^~=h%z+aQo~v@YB?oRkgjyHef&{@LA<`0<0;`-D+^*hLcrk8&**Zb2Xeyw@R(iq- z{8A=Tvl0lQ{j?|R!Z9DYd>(lz2e4&GsapFV5!nlKhj-o8AlO-)-6KHT?Q5c0iwkwr zuFnowo!DoH21MRTY{C~3QR)lo!n2v{)+OfAjY+Vbl-(+;>&JH?BVV&el|_#`OK{u7 z8Sc_TT$;cdn-{c^q%Eu*v2=d=UYjN7RQSp|qrPO$C^wy|aCf|?vTYTLX$35+$88Mj z>SNu;)*oB5`8bOV$=~%``CM}@iC6E%YPDUo`Yf3g1|L$vG$_1$&OltgJG!yif<5f$ z+lcoE|CxK*WEyR)pL?dCJlJYq{)UAZi@zOAv3iFxfyy@Tx*?ni#?#4Q#mdi zdM?7iuoaA|^E$oD_Xn2vOR`jay^j?oHH(vIgfTH1UBm0V7o2TeYcvhl#s110g$sPh zqj#7PYtWy4Jb%E?T6oZ{8>N#YA7#5Q#(HBezM(2aVwWpK5X*%F(gusjW${Ka_oVAY z#>y~?8k#2WM1S1jiSBpU>9CvO`xw-cJI(b#yOVWLNw3|v;j1?f)UJj3^85tfXPfTz zDPaGjt<_rTY{9_BrK6I`J(>pr{~VXVTY$U!!m<#490}Bli~w8ph9sgy<~ej+_AbPp}-HB%pRnjbB>vQDM0`DyPQcRt z2q09v;rnL`Ln}y1&#QOam%0W2h*c++5L7?9Bc>oUK@muHW`2bD#vrwaQ{0;Ih-7~} zOMQ`T*h`IhZpM0-3FswX#~@8rsaWZ{FL8&(*0rj}xk1%u@AUWImb`|cy4XcSM4UOn zFw3GWPF6^X2ejeR72N#}5bIPu#czH~!go_vubj1Z@{JvfUWL1^o}u8dWJ*scOuf)-BXXddcdX*uLLN$5y&_KakHYV)i%5llN$~Xmb6Z?wDGNj*Rq;Qu9HZ@9{4sO-}6jE2@g(&^{wbDl;RZ$qQsK+E~`ly1G~wjL&Px68SX{DA!q!Rvd(uJvUJ(BhWGTl z^If*?et#la?RzzMy1=wT`+$eQ@gkB2PmM9TqeWf?+Dy~OqlJI5=F+DtoJ$L_q}zGz z3rhTY8PU$``M9_D2wo&D^WTzIp^6kGh#PhDq2Wp5pyF)9x@@zp?Tp*^NenHh_xD9) zB%JSqyjPxMx3+Ajxw4~SBYbV{gXI%J&o_=Zyc{wjH8ZA*29j!3b-;2WFS~WzApJ`UAixhbc~&!2K`gN>`>2goN^zaRFu$?cq7kJF3s(;go5U4!%MsF=J0S6 z*-;N&@z2D4vfONs(4Uoq7p))PTnqTr<%)TdP-0MJ>!agiyWQl}%4Lr60 zn^kzNeuMu9_~LctlRCi~fNbXFp)Q{k2%h_gkcS9dS-#ge#ABVy&^yH=6!$C&9c#!V zT%~hH%4nCxiB23uJVb+~%P|1_j4RgWOZ6`0Ace4a`u#ink}QtCoGN*6SKzZ0PkH?w z+u^F)?Lrgx58fth>5dX;;q5_aaHa%Q(w6k|;0$j=ilbZsRippM%^dB?ZBAu>1f5Q= z1sBgT*xWAkdE(i*QDL4k*}tt{MD&qz*38H4#^L6UC*$MvbtV&db@J4&Q9TeJOOcI7 zfmHBQtN5aAeTDf1p6|K6?H9LW((hAOVHy8|6vmSMZb!|HZX5rJqM=|m>*t6wJ%?2l zUXdC&-|(}dzUqGOAs&T4Haezv=&pO>^_KFIxKSoY!x2q*1By7OL;yJW8rjgFl+bu| zBVM(G1IMtF1IKv!OWTABCNBd+80;jw+KVMG6E%(c#A+*avK%g`Dz<2*GAYUcAA}^< zx|)RT4)(1i2e-T3Y%U@~dP^h4(2TF)0}<+Z7r##y4CRepy*PMO1=1PZhQb4R?3q|A z%cWosMiOewq{bt->XqT-8$|2r>`{C%JMFZk138J=uqmr<1b1}tK4OAM}2Itcm{{Lhiivb zS@pOQ=MBueRrJmh(Cfbh{%LZwnb3av2j{kHoht3TJf(FGe2Dy|(Uc-3)}r$Hj{;UW z%8|0piu7FkAvq1o<#N_q?f!!YYqzexly{gtGM4ZzAE%D-+*^Cw%ft^-AW*__% zgVrhZry}aMRwcS?WC>!7`(4vngI4WAU#`lU%ia({N!Vn<&4^@&C_I{*wydy=6`k5fFsy1H^@kkKno{Ao6p&1z);+3 z$~j>w_fb1!31+=XJc4|>&_OCGCe{*5$I&U$aX!U(LmCH%k{GV0Y=q-uCT1hjG{aHT zok$#K5|P=}Z}L{-fpdfqism*)o?2$1;F`Bd{w1}>Bcrhw(^hsy0zyqw58t%5ZQj3^ z+p5D2TYYlntYyV~pv>l;i1n<*v>jr80A|Sxas#TY$TGP@kqy4vxk;o`K&t?f+ipnz zrSnUbsxqT1%$*@)L)r_MPz?Tz-)vk0_;Sf|%_Iw~Vh)(kzaN7)kQoLSyIJYmNaF`K zs$8s%EROP(c2yG!EVn@{(F}F>BbuIeB@+iK>#oLDj$scr7qGDEfhETMO9bBOJc_ut zwls7ifivY_t*%dX)D~k6GamTt1lYsoo9q3Pt)z(+@{rV|>O;s7c`EyE$zRe-jkd|= z=<9Ed6mvoZ^aW<1WABk-2U+fkoKePYnp_B&*d5h#_yw{?%e&=t7Kk9T$ zOPC3uCFp6$7!6rj@+sgw0_HV6RZl_}fj6M`wD8AmF)+%Y~b_T}3^?s~I^?m7bZ&Y!)RyvN2pB$$I57AC(7gosID;56iE-q(NW0)@reNK$g-%`hz4>4EF~25M&`BHU<1}pL`_IMb?`u(D$VLg4{jNQtVK4@( zovHHP4~yDgAU4Q=%bCe{?+fxni5czUgfw4`)%UDg+|kY(^q56 zw?CSR!GuC$_AjH}ljw^+12A*Pd=a-D}_~| z`!Cc#L-jup>gU7H8Ee7&At@*-BQv@|+4qATJ$Z2>)$!eC-EO}{&_22d6pGEMp?zvU zE-=yo*~8=bFKuFNhZ4aZ`16buQC&uklZLnpZ+$PME=+D&)wJmolxxGSmar=`FP5a69@dk MFY2n5so327KS``}od5s; diff --git a/format/diagrams/layout-list.png b/format/diagrams/layout-list.png deleted file mode 100644 index 167b10b11e37e761de81de8fa9fc8c5c9a30e4f8..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 15906 zcmd_R1yEhVwk?VU34~z5-7UC#aCZp7U4mO+w>s>(9Yo)SKVfq{7@Co8EA0|UDceqxc~ zz?Fsf9(v$ESXXr!ahUQk;w|t6#Yy&!D+~-S4fF?APM!J?H27tsspF=jq$ptKXwPC| z?r3Vk;%V;$TEoByc?y7!_7-j?8!jz9K1i)wLWmZb^$0lyJ!jw8ns^k)m zE*9k6EL<#Xlp;^b$;pLW%q<1fC8ht_9efj}w03iI5@2QZ@bF;q;AC-hv0`QC=jUf- z<6!0BcnMm(bcHy$nRvc*aHaakApbFrq=l=Qi;a_;jiUoObX*ftM|U@2N=oR1{`Kb{ z^K`SZ{9jLUaQ)}9zyevJcUak3*jWEHHs~q@y(%E#Xz%1=;pz&;7vUCqZ24cV{jcZz zqrIxNqnjhx1s5ALIR`fj7tqh-n?9Kkq-v9a9Kidnj zLKps@HsT+<{CE{?v&d5+)_=V+k*Afb9DiV71lQ#xUuk;6?qwtS;7C36Q3R_deiuK` zDnOP!$ZOO2{G*xt$15pJ*eAX()EFq$@2JShpVDhJxiPTPB!8z!R>c;R6-PmUyCoww z=x-jj;BEZV{Kt2{n{H%G`Nr4Zb$o1WWWUhw_R^WrPc{Gn{2fv37$1qTnv;9{mV=R( zTtgNQgAyOGg`zs7bRbSB21hLhm+W#5i$xBL#U>0Z-?(KU3;)ZV{ar18*y; z4d?~~dd(t1?{=Ib(61GFP&@Xde_(sJa)1<9uYbnn8GaI$GeIu0qPE3ct2pE(*cux{(sP9^(vpAZcT3V z;u+01xii`6HNUg3Fl-%4VR~I@+mLX4H#5ZV{>9%@bprfL?Dl?2o);PaCTIoI8nSPFf z!l2*aG+CtdoS1JZ%TObFs%yH!NYwf}`8Nvx``gaWP8juKbumZyr+j z>gsB<$3YI$>+*7g<{V*fTpPdZtzs-1`GAqjqt!KLN}r4UKUHS^b=DJbYocdAHI_u} ze}C(bBNHq(k5)p-<2^KEhr@Vvx(*iEeljMA{zt&~`gAz;rE47?M*^)9VFDjHxflW* zJiJ=C#EMcT&oZ;}pJtB=tx|3DwTG|#~RsH$M2hLBz1D`yRO!U#IsHoPG(dJUc80T~yOuBog+sp?O zFY6@qDEh&wj4F*^UGu`goOV~)yJ+?iwqt=(NhOLCFLlM|?lvwO!K8NqK{C*V3eacFplF1Qp-yX{roS2xP zQ_j-)W<8NeqwqHw*^D}>N4&w~V8M5Po?qWH2eM8exx(qVAjIFwhTW}`C7uB`iA3<7 z-9S7AJ`NQkFRqJenTb+J5M2hWlwp-=??;I+CtH`xLlY;@ADfybPR{CY@zl|+VQ$va zj40W$l3(i8xvUK}kOi`!!gYr})2KGqXC)4!-TPgOy71}>f7)s!9TTEg?@(4?-Z2IY z9{02@p2BB*IE|Gx?)SUhnfJ0jmSb5Z=_J_L@Xj)S)7j0EWf8Eju*liN9lz(jCuS|U zm18JTueG1U#eOpMX5cBMGm4riM#lhVLUq=l)!N~>=)-NncQ7;i_|?^w6|hOco&$wG znCc^6_3HOWGkH^(<7KIpvju9ibofg-mJ+%`_C9b}bd^PXel5$N5f(v(8L>r6EIu1Y z%pb^g(00Ru&aPGX?PO9#R7!DwzR`ln8F$*UXElg{mWCa^j^=P&IC~TK)|LRSp|)S_ z1y-JpX1;VxajqX>nC;Y8HA?n1uhqV1BOGR*^dlr;Rsv0E;l4WzH2KH}vu_A`?CW3o zk!dXRghZ4~KwTJdo9(>8~m`b5}27LZHgVSa#J+K)2i0kL3!EZuH~y z8*wz+6$>KFAXK{bQA|~q=g22!dcb22H<7|Gw?Iw=(!=z;d-Pj8)gthif>ll4B=^pm z_C`{`aAWSc$m3#me2Y$3Q(LLp|^r3HC)3?>aXm|bes3AZVi+g6(&D~-lE`~xw+2*L8-7k&@-zM3&A>mwhe z3L88UZ2Rrw@hhp+kp=c0CE=VuQzJt3QJm;$4U;_);;XGEJ|SlbLDXZdW^xd66djJu z$lzwM;mlZ#+P7~VM=>H5p`Fy}b)d82fwk~|1F z>isk|vpQT7$|DPmSG{@9S129!Qamv+Y;9L6x*5*GbTE5G22ikRjh|tA#k5~8{f61= zUMZkQ(m1DFswUv0p{1NxD7dDD@X)L$JLyqTzfefl>F80W(1gv`cgWW?YCc5&v0A;s zd=aQMmc{=IZy;HlMvS5Uii(czXvP+X}0ww--+vyL5nT#??FCo>F@p0h$ zQ(28-_kw2jV&#dr-v+}9)*0suu)hv~L70GWz#ieT=yw>gPsatQ_=1=bXt@g4=(H?8 z@`S;Dt_4!Vk?=|K?b8lHN@3cbBd36(7DYQIxGbjngl2Z?ekw+`nO8%YL8GpS0Ucc= z=jmn>FXk}RNJ6u8s&LLsbqV5VDToOPx971WP4TE!deId&5OmF)P-Yul>1a1=E!~vk zAWlLsA(!vrk%k2Jvp5LYBa?|%^US}+zwrqMI6#KjM3U2TG$Z7R%I@z}vb|1EVt2FX z=;CKF);+oI#xFG$e;eDYmYjy*8H(XIFICkWMBuzv%a^7I;3udO>|OpkC1H2jjf{5# zA>^@3cj4~e$6>dpIZvXG=?r;0l*}+>att>U5KK}yaC5bhYsSqD31LekK~^1lxnuFQ zeR)~VgcO&pYD?9Q*XN-10(8hv{{t*%`&a2I25MsZ*t;1IWG(jvu1|lJI08!Sg{D_4 zVG+A^geGF$#m&QV7(>L$*sh``c)3Uf4JC!XwSf?n-q1nuu5Fm9P+nRI_N$!V-PtrH z>NcqbYbRscm^-|E|=Vb*rO;%P#vb89%{_58!nX;}vImJFqn z#|)wNGB8Fwe$3Khkd`M=A7)g;C8{jL$6}87xPEpT?#gI7{H-!X`dIP0g7h3ca*rk6 zkH7Ue9K&2H)`LJ>6Y+$uHwv%ci=1>=^IXO(a+YHH?_iSiBzar#j9kPA=PG-6Cck6& zc*-<3?94@yr=6U+pMoFm4*b!Josb#Q>I21wvM0%rp5t<_K*(n3(Sj++{BNtm z``1)|Ql1BxG0K%_+Tr&~N%e$VNmfy5$OXXIY5tCOZqo6DQR#eoqw}{1Z7vBW_xxMl zVhGwYH^Wd45?{9WTQtX6iGO0>SAfeB>FJ=+Snnj8)Ft7gSCxbm z_r$m!4S%IK^0<&UB!siM1jUK-Q^}t+$yN5c5C__X&Bfx6e;TlE1z8GZJUe7;Ih`b6 zFjZ`NBQ`v)$kkT{wX$Djro2q~!eOm~u-uo8E%LjDBdiz2X?Ai2y$Ax}@;h?iFM`r+ zr~q{D1t9$8zHx>O0MSTLbIwaUQkHCiRa>aGe??`31)4|(289;!p5TfB7{PzFH_f({B|&?975 z2A^H8W6Efum_GuhP(QGf97+n(lYqJZ|K<^w3MRRzeSLkQAtBiU?u@jwv?@75SI27r zA?v3yztQK>)G$q>VM|2t`2f@QB$LP9#o7cQD%ZaQ%NT@rrNx8wn!AGn8R-+le zI%#UG#@{(q6!nbF3KLc9;S-M6AwkR={JQ z8Qt+uQ?1*U`Wg;Ab^H(#tMzob0e0e_Pf}6TNR1g-J8$jQhEgVSug`XhZC^5HA&4yj zLL=V7$7tn`f#Bmjd(!b99`F}wmAjQFRya7gc_T+|ns-Jxgf;i6e+a5e)>9W~(hE#KmDSh$5XDY^PcH*T6mouhARDzV32tIYPv0xmfU)kLHF$ z#bdzo(Z#A)nv(fDv9m+!~F(fi#Q z>6POTPc)`K&~7ppe4n+qnJgkSvr9oz5fYNW!Rli`K}dw9cN`m0^Ok9uP$MQUPm)0!p?`5RyX zFCklnX~Zj9=a2-29lW8BEWBa5A1cRD)2~^)O>e`Mse-|=|8tqhs#mxB!72yS*K45# z5-BtH&E_PxU9>myg#?_x|3f>~{dN1B8cWL7q8`eq8LdRXBGELf)b0R>+FSt?ssqDi zFp)-{QxXQAz|KsDC~qC#Ma1(EMnV&9fzxs%d|;|fk7IQf-4md z5)u+Ar@HW)Qw{C`Twisn;M_kFs3KShg$x%+y#J)vetSF&SZqJqFBvjlge^(rB|M05g}zrC3{pNN>ubRNh=1yvE|So54t!QHYER#$te@M<{l3zg>+JC02g3gpqHc6Be^@aj@vY zwsCi{*k6K1z+vp&;38k@pAt=)=5pq1`0dpsp1e(6Mu0Z$?n&U6SoQmpanbyP75Mk@ z2Kab*uS?WvdL#;#c$^tM_UHYJ+%M?B2_BSXt~Tb1`IKF&;*gGoOXbV8wLY3DPYs8P z%Z#x%*XG+y%-SWrZ;ZJPgN!KSq99o>O~q+7)?zV?yOzvBoy?khd_s#~GME|S)F@6t zqmZg8FUuUDi;z^&b}cU#gjyG#T#9Yg>#o~I%-G0`p?WQZJg_e0bIIno&|-Z2+qPox zC?W*=q#|))RlU4+-1b_WVN_X;oE^G^w?C35Rs=spw;TCg(YTmo1-|-}MFNjKQcjLY z)x7BcV35_di9bXsl3pc6y&m!qd%5Eu4V#ppWB#fT*%&`IDlA-Xmz9*S3x|Kq6dm0@ zKIv5en@J3Qbp4v=YInLjpf5j+#pfH}%_F|2vm8v(uoz0lyb)x#`XtE+dua`5vBWcArQsKXsJze{EnL6d>{dD=`r4QmX&06RLe>hEYr*|+ZqH%jYh>|GKSvpOE z5T7|I#D09bF3=-vGn&DjVh$*`6n)5*H0F!4ZjG?b}~ap}_b2&B&FhXRLFn zei2q58ay>tnCUy&2pJg8k5W;%y*+_C%gHJP*nRebMWL!`KRnS2vr9v(SX&MQC4^}8 z4)sxR8A(>0`5fl?*z-SS!9hlXI}G&?dv7aTtQ) z>~*Ps83>X2j1e$|TFfN%hBjcBqVLaAqtBJghy}y&3-MXtFl~ga4*GOb?%g%SkCkcu zvDykbd6pE#nT@jy<4&wdC&WG~mV+;B&4CE;h-gkCr7koY=9;+;{tx%^4;`=3+xk=C zA=Lq^*Ms}BD5-zXASp)lg`?Z>Q1^>k=@m0nRY%pfq2rU(ekiZw z=D;N$F(C-^%5Jeum`B-|6nohDGD;-^iv~y_3~?zv@x(Fj=LHW!an{AO&K%0<$ux-5 zcRCbt(5tXeU-f&@c6IFdNfeWg0v`7>o)7b7TkuB`8j>S^j?BNoukMa|3YR>1wh6p7B~I*EeY%3`btma zGR=(l&l)L7bxS0_plq@;;>Ad!RyC?R1aai?*lD8MV|B%-@t>y;#b9<*hNX>+0(QyUs`t=?S;(lra(Iwz!O@BvKf@VU_0jsRTGof=UpK2OPS%UVC~r5C=I`-?Y&8 z-$CQhMOYJuvr+BI8`m>PYXPx-&(IveK0v?S4nV1Y8Q%F9(LsttnE(>L^+j= zRfSaLb9KzuW5%lEqPCKR7Hi2DJo3EbS^p(ahg=fy@Z$$yJHBS zvFjWyjDVAwmhN1WE>*p^9A44fgpBSdakI5POAwJ+ph<0>h6&^IUw6kFNNF8iA*id% z$YF?^+eQ{8ZyMiA>{-i!8C5PVe@`^OfP}f|`+X$-(IEOWCF15w_IO6=_zwMAytvQa z(UZ_3vz>7AClft~2sL4RJrWV7)H`SsHSC+6GfT|G`N`2cqNM))v>rV_KdVVZ$cWkH zOJ=f{Yvw59pyZWZ(uQXfx!?x(BA`b0?m&)Oq7z9`1JD_?NLS|Q-LQFgXM6(nXShNY zp1k`gMyZLrQewUWOOJNcQ-&7VyDz}(1DoJO;r1DQeABEIw#I%%PEC;`TsQ%J@Tq|8N^Y^txP%po`q---&&SIRh!zCr`{}>@fQ@Hvg3{zvU8EkJA~& z&8CRq940#N)G01pVcD)~t2!xyd@?7*5!&OzB5G&~z$Ouu)o{#_qqB6pgTR0R4 zeXJU_B!6u0i1-yu#nIsx%jj6}v=(Tw4iRZcc{_f9)(@;OuMLz|kJBBX($tSC>0o0g z=jSHpJ7J$)(TxYt8@y_V$-xiY%tKnrks?ste4Ra<$S(N4*qd!4nw`6Dav5E7^qAz* zSTtH!K0o3@C!Te4kztq21fbJAi4=tJfy>}Pmjn8qL>VB#A7&ts7 zQWhOJz6<0GuOp%dnWbUyloflI#KYTXF#ie}*|B@uprd#K!)e%+EB>7Yqup}E;kL=Q z9&+dj>ps7J*NMLFT{XUQ_N|y?UYf`Nt78tf$}rf#LMNo#sGw6&uBVyJpva0EB|STU z^%SC9ncj+brGHb^h%Z`5oe^^7*22(Q^t00C4U{-1Adti;gon6=uU?0Xv|ty@Gifsus8cA42F0#Zv= z>o-FbE7ue!#)FqOuFq5ZKFNwP!LGJ5?4IcQ7I3tt_)A$^UAm{>M5}1GN`@b$R7+zy z$t?{{sHPp|#EfV)Ra>E&+i<|zDpwBRawtMoi&T(}tZ#|30m4JT8CZSUftfClGmIh7 z)i@IlZjpvz&vg$2geR^YBjjRKi<$NPX%=xF8$m4YBJ|F2Knlt3%b56l$$KP5Pb3F zXLyU}{SAZB%LfYwyVIBn4S!iKB9-`nXUMmgd%VNi58Cir=8))3u^0uXOWeF0K0-?h zSVPGYAUZl+rs^F*LP4o_Sd5L0&@#n8WC00e+xyp4tjm(irZhcUfDZayjHcDJ-|5cB z8@6I?2@sjdb$miVcf3=ovvZtT4&7@WVn_2%gSkj)UiatT=1>kcACG8{KWih&9LW%1 zp_VC^b4|5Gdm7vB!OKJY+g-WuQEXTS*{Je_#psyMh^3J?bq+d{B{r>&7pUN%jUPB(W+$V&L%|o2w1ag>E+Db&rJjJuC}Vzy5mShe8_P${^t5}24Sz2 zm$0XTyuAVfAEOG6Pp!bX|HTQRVp^`yJL|!gH_QETnqPHohvqG2QqrwSbI4)6F+8&( z+u8Q@a@If3XN$@U8CdppE$;XC%NPpeRv2UY`>(i!MQ!X#EPbC?W#cj0*FR^U?SEOw zL$j9`;I{Sj#oZ3deSrq&6O@^Nj*J*RV~Yq=DLdm*e#L_1%x)Ds8Whijjls-QI>9eSl?NRzn1dch`!7Tq#UH#W|#o&sd>T2QKj5YLy#-r#!j2i8V z9Yc@iT-s*Y>V{|$dfHs5Vk~l}BY0%i1m0UzJwApQmPxdso%6QInVfryI_Landc>fd z7#^Io$X-blEHjrDW9-l@>c3flVZ3HBj^l+Vx&^ru+uTZ$$**|8OlEM+@p-~H{h26> z!S!jB#kX*ed2>2w_AIj= z_?*8uUXksT$>ISc`?Kmj!~$NoF|a}71}(`v zW&RPLq*$b+vvrB8R_dP}ocpcb@iuz>^<{O{ET2M}5eJ^YW9DF?(e>|A8;D7-`V$-$_xJKe2|ZV@x&LFNn|;-ppC#_D zNW%i(>_XGK`;`THPQH7>5vMep8CEMc#@tZFuQ<$;2(fVau2}DKiF$AMjP&jR4fcSA zZag6%1UXB<8+{6gNq?ZTsO##4Vtt#;e?Fp_D4jj1`wc0f9GHdHH%@c3uj(}m!Ed5m zHb&#_!smF(wNhdtWIM1B9R4&t6Rcd?GFB231Ifl)%oj*pp=9bq+~obTEpAux0~mlJ zgxV>}!_m^IHW>TqI4!Fe1d*hD%Guo-N@B7yHkJ<45aLP)Mg#A3oBC_^aCJcSljfUe zv+2nQipqJO30;-OdqpqN-!bd8qO-5bi|%2JORR#}{);0nuALV$LYGe84}Z7Dq@|G= zvnAVci_qk5okXjY0b+Ie>}6cWfuN`7@d&@Cp_^)Z&N}bJDK4L~--iYA=LAkWmb*8; z0&7D_qHOe=CgNScm=a{V&ikYn2eP>5J~O!fuv*$H2otNp=fv>*qLa{8;gQ4VtDSZa zzcck&x7t&UIe34jmG}Ps&coZoKQ!Nk%!zWbbLBUR>R#HHRMoMotE;uK*LiZKvG#^L z1o${#d*}o^JSwS)NI1RX1ByHueKICWI5s2e)}%Vxc>a_9F_}1!ks|5UibcCqmTGS+(>9Lc{J> z&hvk0^e74x^Pn!Pd851L(euCo2ZzlUp+pUhd>0_P?q3};KYF+T;NcLI zO|hXKt^-;j<}D(m0EJ?-ASuQBK50f0>O-7B<<|bc^B8Tv?9_wR4ZRhgMVTpwkDl47 z2|U%~JCd3CF4-N`|A?fq0V#{o4gcW=4P8pvv6GSlSP;sIpZzMF_s3&2bh$a;BZWkR zh)XIl82IQA^}wVTD961X=V=DM{C9W0w(Ncom%UhBeNKif?0al|dL|mkVKJ6eA+|>7 zeueGKP%yL$mgfaP5S{$l_VO`;^h?*OogC&ppARzb*LiiX^CF@tuHO4*GY)Z~VSiK(P4T0#(0=(ui8=+WmF_&%${Qtwx`U@m zva+8wTHmS~=VevNhq&+wL1@Ke&1aYp8VDvx&J!($?B~he<9Mv@FFAkQ!j>^BzM9pb zvt%*MWijyb;CC0~HR(b~TXk*}rxvTmq>>&;KuZ#<2GH3X9a4WVJC~U^Xnn2Zk~ zg!5QgiD!<PouALRiF4w6teH-~z^&=dM&@CUfszjU(F>;G)v1L_P$XHJE;UfF$ z!J_|?Q)*b7SO5UDt^bDS9@rGfysrY)bttH)3e`Cv2}AcG6h&5y1{UOq#zWC}uh+dN zH^z+WA+|A$FHY8n+z^)*sKm=V#cPeeT#uW^=H?ph z`#j0`wmq|6oAdrvXZeYBgwi>c_>2Xp%Tv3^)H%%i{(;lV<*)3PtD-I%hmvkpaKJ>9 zjqX<~iC?0W!P-z_)M;gQ9!wG))yzZ~av(KmDz%|V@S4qNfjViazlA^dq3sJ(XGxl%UX$+|R(t)) z%+V4ZeT+0w3t}qEpQ#g|o`X*i*P}5c9H{0>!I0ph*@tw&jr|}9UwG??rE;&5lRQlG z`MJi>uNrx2c@}S4_>7AhHB6iziY)r{ufL#t^3rb19Ml9`JMDY!&KXmr$TJzp#(-;eTt*Z{_i1PTPS&Tkg0g z*^!OPhsHdL{!J1~0F&AN3rtpNYxvcmU0P}b41uH=7V2@ZQggB%+G zG4tGqxmGHAwdQR*t|jVPWm^7SU!@|lErcFDydsE#gq|rG?f_7^iFmBY&Md=!-1=eV zB+`rj2(@vc4EY!kINXPcJ1ahs?9d4=y@b<0|0#A41c$CT5*D>Dc4gPk zxXjJeu~cP@=j!Z18t`@6)OSt$ri+K`IRZRZ8Sbs0t=mzfmaLu}%?-v|k;i97C9{=l z<*i6)_RpAHrNDF|06y{T0fWBbP!zM8l7#Q^~o%y9o;=Ji08ruG@bm z==N9A^WYVkVQge%tNTv7S>FM_lT$~fN==x#Ac?cXce>nC)<_0-fK^_jzS{}qri&$I z)+xAjW#v}O4BL_wqk0d-EX5q5M)?#*ke7!RrfO;?_raaYw^{U$g`?t4v)(!z=aM=* zdw92ByxQP-$-{hIr5OP-w%Luo8?lBjYV_Sf+}`FZl;<{Ev8MI~WZTa#xCobUk|whQ zq`wfK5hjBQLSX5v`VHTAuvvK>|B&8V9(;YnVfB3QX};6ta)2Llu=#Lc`*glpqCEqr zS1^p0+>5@wOuugBlX=KV`~0`sFKCLAa5?0Mn-lq9= zf|wX^#1D3!s82^W)oyfjNehn5H#6{gz^3OMWP9a2p1xfbOB8ZNi&sxfBHhjn5-xw^bewmvP+Cm}W$pPSWam2{I!#cm)rX~#p za;RYQk&_uT5T7oTXwdE2uzyjWGJ^s}59JII6C7n;$7veZDZBk2N5m+biUF=G0`NS( zMxL?`X!E{h(NJ;z}-P;gu6)O&b)2XbwIS(o@bYnEf{;3~i|PC1CN{UkWx`{ zaTbLFIqW#bXM^TCtEZ+6=Tsds!Mhu-xBm z%u=wy0A%$JpyIv%H_dgfKwkSmsZDp_!qe)mM+jNUeONa;v{J%*6`$6yIF zRBmxV_zD!=OgtM_6zVt_BbQ7e2a)vQ3_v6hMrpy6;sgA^iNCJ07<_b47{EKc%6sz+ z>Pt3d0@TaN^im&XPejbTWFmx&U*P(`-(|_sn-JP3paexStDVnru%h__)GT@LPwmZ7 z%vR&DsyE0Kzk4X@U!*!i>37lG$96ELlI`e6`KsG)bNetl-zv&%AolxH2&!RhQxoK7 zEY=rMF_Uw=U(X?F{ixG7lP|+EF#r{An{0lu%F1zfRPW4IYdl^=_-PX+Hs^Bl8;$l{ zqIjOYE#Y<~gaUW9>-Kz?l=5igHszf~O`&&^3>WzFxRP*1RC$I&*-G793XNW@{eN2P zO(x#X0$1tR+LoTw0m(k2W7ZWBy-{c2y{%bM4cWTgPRTu!_tq)lgcYIuI}*%JZHxz}ONM#9D(&W`XqOOp zRfX^+1(OW(hrFVO*XaFsO#$pA7!;idn9z+&E1D=~GgLYq=2dw@abQeFz?g2Xb5fxV zLV+=HlowEdgL@7wXRbJ~;RlD)0$PFlzqxtGw9V}F#kVYk=m%Kxwt4+x!+y0!W#c!G z1QlN<)QF?24sx=DVDMKqCnL#9UVDR_>4Z_~wH;d*b z5557BuR|SJU#bgT3^v%1%W-SQUTt^7l% z+((w|?-|P$Xxl4|Ig;*%QWa+BINqf==d?sg5uSlao9);S^%m30tLeOeu2zCx`Rq=#=%}`=IQm(@dp*qYwX^uFZy*u!Lb`Xg4Q828<{}#di@2U5H zxuY)XcRQo!USkN}$pKbJl8ALcg6O1T=fvPDamo%tYVV`c?+o5hUzMd2)Ia87gH8NT z?_J%2*mg=1sznF@|M)-A1#F-j+_0%!2LeLR&bMegdZXe(Z2Q21^ z0TcZEGBvHSI%6dw?J-`N0ZC1z|?%z-Kt{`9+H?5om0u9N@afa{$AY zi=c^#&+G2xrnsB=r}^(lf&l;-9H{`q_8nW=QyVM8ouo16o_Dtpu~)6zwVmlHG1Rxk z*-!w>TLXCG<$)<7G#%y(j>OJQIxg5kogA1n;eT9j!f3$uxX5141^!&O{B$z+8Oi2G zzw<|yEG=C5YsOdX1Ss(xq6jPd3p+opI;wx|>@T)I_&pf@Jo5kU$pHK{mZ!D?ToH{E ztH^z=-@)!oO}%Mjb^AzqB8@BUWQJ*WsB2)aKwAL~wK2^5%bSblQ?ds4jhvyi+z$_b zdY1CY;TbJn$N_39Bqs?i3q8bpfAgHQ<$fw}bk0M&Ovj}2RPLwq;uZmM%1tcm3>Ma* zJDh|}^wVd8nxDn)uO?{`Ek~AiMr$>0_!#Sx>TQ@%3Zlj8W9{KNErmRmS6JK272n8> z?3sYYKZmz)rZzT#vye|szZ%^HnI1F(_U&ozotJhcTZ;L~nPV1qY$%>dVt2Z~fS8NZ zzBA=5f#uz2|O-lVf#H4@=lmg}KYe1-{KkFY!D=KF0-vMzN z(ynyw@apXdWHu9~j1Kk&D}b`wlghv7Ltws!JPH{}-IO`fUIJ diff --git a/format/diagrams/layout-primitive-array.png b/format/diagrams/layout-primitive-array.png deleted file mode 100644 index bd212f081151234c01f5814be4a4e4e1e4841835..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 10907 zcmeHtWmJ^yzb*_zBOpk3Bi#%=gp{;|bk|EaNDU<*l1d6PD2f4!fRfVE2+|1BEdqjc zog0|fF!V&z--xuIo4PI$A2%@Tl?7(9o`_sVeEAp+PFZ=MEeQ z_&0-Pkskbq?xUxofcEYy?J8*Cda4@xprMhoUHnB?(_`HN9fqCt@A}@=)R46GaO1PK z^RTh!3v}}Yz0uI510}&nH+x@e#y~e$cOS_>8Rp9#lHl{jV}54F%PzhyGR${1br=;r zyzLo9`9$~xm}T)885yO$?HnZal$24ggO&`lldrF*BtL&ZKmcEWFrSCFBfp@8gap5U z5WkQRFX+MR6Xfn|9mwnM!*Vsq-{UCR``CIrd-^(ixHDdiYi;A<=PSd^d@<2Kf3DW) z>+JCFOzu9YZGjE)Uwp$a$S1)6&)DEq>5He5iXLvB-u6B|V0>9o>C2w~<=MaUT=my+ z^6>Qlr{L{utLE-&?+sq|wZ2%LtPtw$|96l7K9;t(vpv}L<=cX&xBv4Qs=qY<#m4{R zAg)gN@+mlGSv+a}f38dxuXJAM2O1iurJ9nwejxg9>?o|M+&?>3lF>VGG^JmLSLBQ<|wX0U71&086=lKBHCA+ZtlkM)~joSAi{HMqJ`F3e)U73PLsGDPli0Ttgh3xcxux;0p zJN@HEt@8svKV-YJr>ne2sC%XPk8fWY)pr%pC3I5qnPDj{PwjeH`BzC7l|POQSN0b% zjZfaGrum=4|F?&cS+dXZ^SgH-grdu5PF=$eBBrMUa#$5D&pF^`7d^X+}2qh>U& zrA`se$GGZUARUEqcJUTsu2XkeZ*a|!%m zQ$fq059Sig26)1!gEv285J?M&cpuEhmM=3G4mMPT%*Ie`@6RX8ZH$+5L^XwbK2H== z?@kFjS}o$QKbT{j1n*Q2YDA=E^4D!OgDce-(YwuSP=-*r@NwYAB{(N0<u zp{@J-S%S8*U{QV}#>-!&Hq9Ke?dPJYNv6ZkWu_O`k5$MTACzu2w10gD|ntMhwhDY;DsNpa!lf<*rEmg z`Z)Sbnmw{o{?!=aBAO=KbI<6EQB#l4{p=>U*At-Sfd?;|w^xVuZ9a`VO!!Z@{r-%A z|0MuBJDBelw|RG!H;{DDvwevv>xL&D^;8qEA9Vg>)>+uwr$S9TYt};Jcq=Tn#CeOJ#3ktCC{7L1A2ogx zzJr=~)Se7fQ9tHM4u_`{Mcp(mt0V5l_J@Sf{|@P?;LxuRt)gF~!pA)!+2F zG!!_*jcx0kKE$nLJZamt5C5Y;a;EPt2Sro}XVeqxQs(IH8D7c(-m}p8L~(`eG>0hi zaue4E%_!#&A44J}mUBs^p_0@?GEoiuC8 zHv_4Om*X`;Rl#r=C&U;5UWX_&~Vp6$4lX@Ejne@Hj9g`daPND_OKze%(D>dImTOk?IG?z*sR5& zEDftKpQsZq)%<*ipMjBNuZ|&^&Qq-OPoe~S^_2X!QM>u|VC*l|en_*BU0augGL%B0 zgS>U=ld#>J$8~nSHp_Vd>S>nC1^1k65=Ke-> zSTRsZoseD{CTZ06AX=u;nZ=9*`)1dF-gGD-rQpQA1WW2l2P=#>-byL=9%4t1=EzHh zG4Q%Q4h|(pga7n`t_t0kc=ar7_u5}#3Zfomo$PN*K2e*}N_fQEUWd`(3S(iDFHE#Q z)fSW{$;ukP-WVXEd`9^~(6as(t_!E&_Pp4H>mapCx8JCF<`x$1Ae@sGBEfKPr6BZp zFVCa?aO&RAlG@GsC#J_BudQUxGd-jJHYdP?!go$=5lM!Gakt~ft$vnNI&|hbKe2Q` zYP5_EjIZRUR_>-rWVNK=ylRZ^oXz~y__`P15Dy)~W{6Ucy^XTVY+4I`WJ-ACs0e-9+#4%uv`z;3MubBKD>JMa!%D$w#L=FqO}B%S`68vrW7D;c!9>bJPYqv? zkK7=9T?4X+uon{*89gqi!Q_|6ceISt>s^(uCQBh%NwYlAE_camlm6-`!_Yp~lbviz z5Z1r(+w)1_)(5NW=2T~A`wc`?LeW_9F^GZwjSMLyvvpE`_FWseF4_(Gmr}?(wU_WY z0yhJmmmWp!)!vtn@xUq9;Ng-2CQtQ38c0k$b|AWy8_9_w`R- zTSuC9%ns8xf{wM(_m-UyJX9xAo`^Pm!kU21uMSZ9wO6{&$H?U?n*N}dbJ&2}S7?6s zC9NG^6r+g(Q|uy;#OmV4J0WL>%j0QqRkQ)&uP_!V!m)-%!CMCkNmRn@fhM#;3#sk2 zK0VQU9iJ^xu`>vA?Jce61OPcl05^?Y%L=Grr6rv@b0Sz={TuXUq2Y>@c@xuf%s#{s z)5w+A61J&?wDWDvQ!4JGYa!m(R`Oh;!xEi>zv3A#7~K&R!d{J%L&YU_ZU4_SLOR=7dy$m~$6f zMz38Y$FUE#RGp6bq2%p%^eBCH;`tj+ghUjW^G%fp%v6%)&JI98(*EKGl3?mB@|%MS zj-AYg&^WxXNPEYK)yv>`BOA`biT!2axRq4<Y!fO5HbFRacG`9B4}kz6e-W)D zo8WZMpsf9WK@w+D#@3#EwBleyI`rex>kTp;b>wVssY4Rs$om;l0PEVE9Q=k=dVKgo zevx$l1MMa^{5Q0#M=RmYT8ou(vZ@_E7g#95T?l}@`*)_(%!ZQ}M(-^SA;W3XM=Q^L zqo*5%`=o`JbA8^GHsf)>tC~}$aeFNHZWWxpGyupe0fU<2=58N{+Y)KSUB?9YUQ=TE z0OTusFcVeiYzHir`ML+G(x8^quRGc@eiLq$iS#l@05zGqe)!^Kx6>@FBC-8*$v<(=E;oAyZ~$wBKTUg@%MKW)Co=zw+3N z+nGX(rafH@g~s*jD_>0_^x6kG&wYCUsg`6wKa(l>=JDqBO9=KL!P$vR$wx!_Uf?zn zH7R>x18N*GHUX1fDqP8{^GQ;CH>}DYHaiX2g>msetUgql#0vU$i$btL3t-qo^ymGa zZMHAnU+x3)2D21Y^&hv-)DfRzFTZ&z6yx{?K`0v6-1LO$QMcS_QLeYT1EtwBAAygK zc7HLU>}H>CU>_2tuhf0?wPA;b`6 zkyv}({0(N_oouc-u#>G8MY8*&McNDmSY}^=Y5F#uGPml;lhwe96wBJ_gIoF3+0j1k ze3il;{%iF`X8rDn=}x6_*|RHF#mN`({U(pKGQrDrnf=Kfkn|ElucshhbHitE%XH0i z_|frVKYXycM5-+igWY<(*by&)d3=(-|IQOF|Y7trq653)JRNHc5i=!?ybNZ#!1)m zUMI0GH8Y=H(MOtMRMk!rucbXCkaG-YTe1CwZ3TcIGV@N?`#P1B_?4uSaLXq8R=i|m zjDV1I@F~QG8@@Tn^e{B+k0oOG%Hijl@$tg7M9sF%=@&88wmyf=L}0&+Y7$0!&&_TW z4ag;ABu1q2hqE;02JpL~T8OaX$=!B`t^D&7j$yK_@?`1lj(+n$&NFEg_&aF0pYCh` zA9|!Aq$bJAOoK&)wa%cMbe*{{;Jb4X60enK2$P;FbBrvKM!rX#bS#>a>M_9yZz&bv zRwB;mS1hGp4R^>)0&%-D#;0x}aza0QwTMr{G1sdNU{LO9Q5OakIzRoC^Ld8z4hpAO zJK*f?zWJ}KT1jq0aUIU<2HEHnnDz&?KSs1 zfXJuC74OV*?)Wr_eS+^9E9+Pt!idHB6BjW>b{>nWry)+Y&*@&betaemL+QuU2yA+I z82}xzz0`uE(U*AD*B>HkMEJde02uH>Yv!Twc6P5wSmaO zw%9bMCxp)=EH1@i?Az#<`Z@JvYp;W%4a2J=g(;A3PA-C!SdJs`dL#GYyy3)EhHXR4o5Z%~GV7wnSJ|y+o zI3r{jU5Yd_-Ix5r%z@x^?eCh#K3KGIM$RulO=g)&l(|iawUolvX>; zHzhC19X)w#A$5q3P1~r9Wup-h$D0&>`qQ#c@+VLI>bTHRzar6>6!mO$D~`rTsY2_w zTJ0lZ9$IC}(o=-PS{NgMiO6-;n+O*%g$NydIIm&#j;N1$eaS z53HS%4;CpD%4)v(JVV~NOxL3@&`oJP1vKUvT2nLCdIVYs0!_A__W;&=>ksChb(wUN zJEm{a1y}bo4rL^S?qNj-(s(Yh=0G3d$Wt1-tKQDZxX?%=wA7JzGoq@Bl47>f;FkAR zU!-kw;o*MhH*?9lMjyN`=sjL84rB&VqvH8RuO8QCp4U`LQ7W`F6VzT5l)?@gkE2_G zDA8+(8EJ?~3W?W1*I<_yh9?C(z%nFx_=q{U?u^~O#APcFp$}fr65kSpx9a5;sg)wD zq~!8Vd2ze9cs^&kCf*?#o75(+KVrilx=kt0(&{Wlhr5(78zIx^`(VQ)_wiTx2$rB^+fJl8`TYgmLpUICB~l0Snk_ahyF^&4h3bGvugcxXj@7 zvlA$rQrxkHsvVvk-*d7;jNDzD);W&+5FufL`lZSV_u8u6)X!K2aM&+MW()Bl><)f{ z!j`hq1ZRT;gnBtYJgmx?oaNaymO*6Zl}*Is+_){lC`1xV6G4<9zTp?rw6oat8ePL` zy{z`p7WaKZx~?fXEX7d_Y(k!fAA$t@8!lG0jG7;~ z^2NUy{Zv-4vNdU;Z=}Ktygp}$t5}Ck$#tr%8H0GJr15Ow(}X+CoQS6oM4rsc9;{MGN*U14>;Z*1?`CeU%< zOarD|)fF3l@Yg=fkE@PEOQQWXR_>^+m8{YJkOh%~nH{r}fnN4kX80-LeJajlxYFFqaOqrCFs(jF_G z!q3l6J!)28j%4;_u5W{i!_o_NRvmr_kmp|;jqY94UXa!}`mkyj38yUaf&%x#=FuAK zyHEo)>e&c&me3C`m=YhubKZ4vWXNLQ)I*8I=(rj~)#*_eM6e)++{F1sdPv&e9-^NB zU$!s|1O%)-iB1XtcO(4RUw>Ek`m1@dW#z7{83BEx1?+WA)$MPctQwfN@5b`-3d(y7 z0EgG{D-`3r@{W~B=wEXwFzt<@6>#cDW;^8Ev^)UGs2xPbHyM;ve&oXY?eb@nTTv0dsC^)!f;3fap~gZVP# z-&%RQC4JwU@aQX~$A5I!I);q%;a&<}QO;jn(JO4P1vItA#_evb3>Il~5v&5v5&|G| zFOk)KDu+P8mxLE}S;#E{XG5%6TamE!(iWk`&#u=Z8p}o5IHH9HRP)P33R6i`2kU7J zigZc4o1P``byXy~S@mQI0u5;$RN1aSH9I*5_!T+`VUPGiZpzc4dlSs~2o#TUyI&VQ z%~`<=iF|%d8jz%YPr8P3$li2a9zc01mTlc;1+}Xjt6VF4rt~N7*COB_|1YQY zZJQ(vF*VXK;aaYl#rVE%yDQ`eU;%hcXHXzGn0G9;2@s`N*ryxpfx4y|dp2vyE!I{{ z3NzwPmSpL)uD9vR6yq-MKRdaBN~^M0Hqc~jx_OAVc9W^laQ8072D=}iTF3MO>?20% zTIR_u^D!VHsUhv^PWL8s-4#$`)lHlsb%`04U{L9UG~EU!2k3ASPgeZ<+6_ro7^M(Z zEYFfTyZ*_+9sq$Ao*LNNkwB8j&_3iWHjEUG^a@3a(qd8u)sKUcO%YIGIxg-#MQXM1 z(S7K5q8!baUGW@G1NS`|ZWiSVJzBg99;(P0=wB~@f(Kv{fj9#NW;5HPr3>XC$76$W zcJra|*TAnH4NkY?j46KsUkR55g1N5VZY$u+ zn#lEv3@UL%Ws#9shWy5KM`(;)uvvybCqT2S)yQ76s0UTR0{LQ!UONB7amV}^A=lE@ z2MFxB6eh*H?pER~T5%_(Y&^hS!40kfp+15Ax(3}aT~P@tdrC&a*S_fNii{k15{2x? zwi(~SLC1)rO<7IajRb)+Yg|F|L|TP-6*g9Y|4DXVg4)*UwYF#)~b@Ye+tJ4-T1^AA_p0 zVz}a}4v5BVnGwNe5O=gBVnn7lKy{v$VH(E72dW|KeyySf;pb=X>yQ58G@i9&bTEl~ zM^we;|8s=*(UWSsPDb@o(ixTbme!KWsRs$nHINa)@9jgNS}0&frJP7D+Ranmy3LMq zr#uAwjr$NYT_mklEK%Zxe8L&c->Q}3-@ey?<^oF3Y%uE_vH>2wXb71JVdP&5cVN&~ z3lN$b`e7T7bjRwUMXC&eGZBJmBp;rBAN>k7<(7`a-P;ou=S#d1_aA{qF}8K9BGQyu zE2L_>fS5R&QsRz-gY$Wk)CSHX&~l(XSWJ0miMP*Qm)c3X@+ie}W=r|6kJaUlsmpf+ zgv*Mnj%5Z;d&tqUrk1`=vr=hQ>3W7@O0p}{+-)OMg(thVMfpT(|Mkw(v*(I1?z@!qqm2-KMh1< z+Pgj#!}Ls7ot3Z0Ct9A1qnBKMPn^U{qN61?_@%}@H33^b5FSG_PVPuJLueBz1({hE z((IFd1We~awJ=4SsF|>*9$M=%;xpA~wJE3F zT>aZIJ)r|J52#;A^m0y$$)2#rDd%oFt@IZZ5Y$I#RQ41Xvq=BxS8RzT$_FxY#{*rQ z?3oL-vtOa}2cA4;0iGaMDQO+huNXl1HV^m_Q_i#_f=&k8MobPZBStt|Ho63`DyZ6e zAW*+PnSC?DXx1zi2L849C;hFDDg*G(KY*`^3A!%sym9<{o}wX5WT-!B|KG0?jY z4}{v|Y3LXas4`UkYYc+A-)#~HE=u-}RI#xt0tnd0q90yUdfB9eL=p?cHrHav4{B%c z)7-CNOUIC=5jbdXc7|NT@PUYhM^h~#uq0R>usk2wcFw87^ip(KIj!6IdK-UX@S-4e zFZrdZBBPb7W+!gJLy>_Zo@>s9(1D6gv4I0<4buI_ zUyqdx!+;d}+q9Jkv3XFH)5s`@ODu*M%pIrCUL)>J%tzd&JakT^gKC^)->j=Pkxyb_ zQc;-MfN!z+#%wnc2?Uo$2^Pajk%&P>f?SaaSK=9)X~s=C{6*2(bf8ZlzV}&wy+nj>KJecohCqJ&*LZZ zQ5(3ZlHi>oXrH3m&o2;~h}SWysB$A0P=qFilonLmR|N2vBzE_36r+eF75ECi?}v)l zl_~~A*1CDz1sM%)fU8fbuGKAXkbghXKm&m`K~`yf`C~D5CrB<_fA-Wx;uSG+7=;28 zmEqm;&t-HsRZujB2~2{AZh~St{Z3$w)W72I$)P6218daB>0d)Nl>r6ochVj@PQ8j3 z`l?`y3J60K%MqalqbUbJn@UD)Q2~%7GDt-Ys`>5WcWvKEcf(Nelk9>d{g=fpBELO- zwa(n@!CNJ!VBkq{&mP(n&V5F{4e-JKFr((yg~_I};_ zoH5RS-+$+I42LkC&+NG8bzS!ztEsMtiAIVB0|SGptR$xm0|Rdc0|R%A0tcihif{Y| z{(<$>R+NFM8X?~YenE9pGW3LjA!2y`3#+V6e*$Efw%0ZAGEh?$v2=CjG`DiKu;%o0 zb^~(5z=-*Y06#igdzn-FIXk&{iuj4s{*yxl`1v`Ri8t!^OkH z0p#HD^mp+x_v3K!r2AJP|5J{fwWpo@E#qDHvrrX?`IJ0kErP>;_&Q=XoCc=jV#=dO!NL+cM?* z4<8=a@s18*kV*tzKb(bh5!^ww{{})?~0#4L>!m+ zp8|lgxZz+4yK&w8X#Uf@f-pO(f13HPBv9yEv<(!Y*8fV_(w%cDTZ+2r={$N+f`f!TKM`{c?d8Rk@|1LYQwx1XlyGCGWZWM};;a z!vc#H>wy7WG?kw7)|nrIF6U1|j=s$@=J{GxaD^^wnTy57iH|jI1EDcVeeKg1rqCA z_Z|}8CHEWA?Q|V1%G@!@bH(P1Ni6Rb`Hy`uk6gxhb<_)|ZnbeScIPIlEwkxlw9T+Q= z^^Z`vSq;8BQfB&)KIp&u;i@p2GVW1iK?w%2hN68`;*K5d;r@J7pUvXOpBeOVmzJXj z2Y1r*T`jzC6Q+-6nF-btX`f&FZ@=5lFy?GfpXsPOY=3%irTx}+y%AqmJMH$n{pqpt z9f|Yr$=|{GA&>qaguj0O^@X24H{jRjQIUg+IKx8X)P^1u!2r}3QqX#aKM}t!7hS2` z9}Y~PUVNeD7ol{nxj!2;1*(WeX$5LgfEXg9SL41wY`d{bo+77pRKZ9yWwDGX&&#!1HF(r#xP|0h%R|-)~(8x z&%fVq3+R{6USYI47)l^|++>;7{73cdYBhxFDEareX0bexq1E;~{ouozUYFKWg(%PK zlZVHv_K>S(UtX?N-j^-~v_nhaw~C#>l;6%6yH+wqpUCsw?iF)2%vsb8@t{*ZPz7C# z^Nxyz^Mp!7Xt(W`NcQTg=xfdRQxI}|9&g)&9SoL$6JO%jwYQwJf;s`+WgxUk4x{9ns@Ww z*g%`MQdRsm`175^gfAAIW$>LEmwiHj0VRoDGLNS4i~rW5C9oQJ-hRK$mEI}V1FYn^ znGLm)%diq zu}5ZImMxff*ri3Sw3D!V5A)BqZYGLK+@ExL&%VmUZmE*BEZcd!`#q`$ z>z(n{I4~;^70m?2dV`CslFel&Y!=vT34$wrsmhG~Xvx#jewn<{c*qTwd|nsM4f8hN zruE1iD{6|Eux!e+tqScy<6^tHo�X797#HVGfZ!sz-Zfh5U6d)dv!`Wk70jriy+N zWSTnMZ+;TRHNRS!U*cIN%=(QJtQ94q#$$2sA$+?f5StL@jh5!R^q8h5V0dvz(yJ!8 z+&$~zM{r`+yWp`NNjx%>$W;TZDuILI7IcIKKB4(PEIR_z(S;6ps{?*N5(=#3)2HwH#Rw(6)@|k5?pu1cl@N9NT z%TNXTHgP%E_A?e*@`lYF7mcGA+cTa+f4AMah%<;lO|U=B^xWvZI+f`BZ2t^dk|b80XOb%<|L)qVJti)y-<|SwMeJZDz^e_G3W7O1xa`*(-=6O0FT4)S9=?(XWXNL|$PixKLv`8%8 zb-m?@jS|{S5Pa=&79e%wCC(!|%dXO`gtho0!`t^&UKC~O5pjh3DO|a#{c{EuDHTJT(wqG-Rq(<`{vOn&xsy=;wRebB3R&|w6O1S>{ z?r!y|4HU`}L-nwyDE|~Eu{V%)bC&7bUH-Lg0EdDq160P|yp_r>b-!}erVoC}Bu)}w zTS?DwHh!J6dSB9Cw7HmpYFANzgZ8kF9b!;_K9T`v8^uvm*Lxj3e+1kmRBtR+);b>J zwod7|KU$24i~Y335xzgqf0Co=0PTP9T`@;oC-)rcAP+i)zUOF6Kb{#e=v{hWBMm_! zuaKvfP48^Vprxg>gn(VzA1)0{+`5BcE$Y{#^ar68KK(Oq348taMA{#YT*2z23`2(U zCl+(Qt3kz|O$gvy7{Wa4586Na*209<%HUya%8Xo$#ATZ;kkeMfTSIWD#DOay_yPlq zptr!TQ`o5J{?Jx(P(D6$h|or`(u!>sG^MAAV7!P_v;K%5=~{_t!Pj6Z8*Lybj+KEQ zxr|>!BEMM1q-?y&H*@+WmO!%XSEKZyYHzu7xZhbnQ*=)J4-PiHqd&8zw!-o;X?KIh z5!=~>I*6m85+NRj3@+hw3y=5*ub-SJvLD0EPH`jX286X0FF$1;3A68j?2Tx-d3yCw zr3d>NlU!cE=Rb?pc+6O&-dQnqk1wZ@jhad>1?lcC`Xunrz{rFBPt04u7tMVi_oU7T zKS39S%t_7ho(`O&R2_*87&Ss(W#As1ja#Jwo|P>_rWMf0lk)exw9~3T1#xf zi9F#Qg~X5<{OF`Is{1wMmfvlnM<7Z>4Y5-ME_o!%xP zOyb1Q#3o+S=*uq080|T(C~VN7^|&TUP=&z4hoG`ZaTldQU#A(#@5QhaN{+a6x`82) z%PlcocBT+vO~r&pj>@4SZzzVpOR+4k5|s>zs9cL#UD)kQ#GQF1JRi7qBwLe=EoiL) zMCl0h6*DQ!BEmrelc^%ZyzUWRo9Q4a_W?w@r!aqd9{=*#i9!*xMpFVX)w*CP>2^=i z%`#v4e5S#J?+jHJr;7<}iS7%uSs$@c$w36i{LF}(FE|ub(}m3@!G5W*MwcYUV>Rd; zrTtFLc#)#*yAollWmpszaVp;>4dw28FwO{3!Y`6Ivt6wFr_09nNR>R6c{@YF@0R34 zbUas50`)=126i9lV&H#C9yU<0z8*DA509IcLtFx`?2td@rr_+pbXtv@WU@Fo9U8(k z-HR=Bq2sK+^B8d_Hid&~TT6N^_bfHIEt5ygz1x=rB0ihIj_u z6m%;@r2|8vsMA#_Ex^n|q;DR#c=j7{yF*tPKkQ5Ly$FoNv_k;TzBe~Ylh7GOu?WPk zk%uv++N#}1O`6_#{ZJ@7bCc;C0A6&E;0HFvB@0I1%fZ4us)N4H-33?U6@sDr9S<`U z>X~#`0Vl~Z9$9iPQ}zxSy9C_s%_K<)C*f`^xu@%1Ng0oe@nr5duVsA(^FC!5{d(_M z)sL!kRJ$HA&Equjw3t%ob8l2cjr4{hkr0~2#YP|ASg1V5%5{+MIs275YBNYCk1vpu ziW9t#V*6_B95e#KwaDBqq)Hc7z_GM*8#Cqo_*gIxxc3p*UvH7thBvN@dh1man-Mx%p4`yEJf<{_|r|`I7UhZdW;Bb3dIY zLLX&iy$OH9qJOK!bAF?`zNG&0!!ZrW7w@EAvCt*Q5|)DG%&N&Mm?|-lTXJhFYv^IwIgv3}i90-KDhlgErkQk~_k`1e|CR0KB4l2pz;UW< zB>fQ^oY$Kpsy|(Aid2ZfuoG6BdFm`afh$*%)b*pIFHj)5LKF*z`y0iBWhQ3-4JV6( z$`E#xW(lcfXI$b^B@{+zRe8P*q|FbEtPTD=1x|JpO0BshB~Nda@byXN&2XFtl7_^p zf6a8v*jYg+Ym9DT&EsS>CT)3&K?)9EYz!{44ybjgR`vz|S~pgBz(AH_LKX#<@78nHnV4w2H)qjy&6zroQD%?5u0U0fI>jeo?7-1$du)g` z>x&dbvLgsqa8BIEk(qgCG_?ww(c4m*YiS&V(29YH|L`;tT zE@_MaG>lRsZ+R&Q%qOA!-v69%8h$a!MN#MQ zV5ivcIDD?=L>r*ml$sOVDgs{*37UYyMiAv?>*FSXothWKOrrWyahNC5+%j`0Gcv;b zGQaT9%qu1}nuzcmVrmaEwsPG870M+h{>i7s_u=0#PWxe>rDrxHk+p302hWx*PWi-~ zKGvtVLxvUu4+N9F#VPN^ z6#PalxD@Rk94stRJ>mEOm;*5-r zRFXBlN#*uH^@)UqoT$_CiX3y1L25U6s!eq%R6HHnT*jJc&^al zkt;1N5NAy{yXmLNZ!*GOkE6AaYpU)rU#q7h_eoJxt75ER&5n zy$pgSYN)$YRvOfzg{sc&3u18c;Jx($?a;-IU+i?SLWRqkpDDwr4Du~)mTF<>jxCtZ zx|01D*Z0t@@oI@<(-|p74&^Yld!I$;-d-9GJRXeV*zvcOMlAo>93LK!v~lWuD7Dbp z2MEHAtnKyLEj~&>&}(B=V=@*Ez?m7}$~sDKf_pv;G9yAT(-) z$X-nI{jRu?SEVGBBYMyj(HUYs4Rp78=}%kyA%Y$pJS45ii?E>(HKf0Dw%(Dj&)0I! zyF8ka-Jj}OvANA^e5sPG zI=y%9*=G34SVQV#+U!+dKNxqRI@*d(jQOy{5RfBxJ%bMp9^D1G>$cynozIf!M%}BL z4RAcaN`HXzd?gW$%HnNgLyrS%q6=aW9kZ);Ttl+MVi*)gL^eJLG5Z@nt(7pz>s(bJ zk<8u}z@1IYSQW`vGJIqRu6G$hBt-vZo?pgii)r}2!nIywm|_~^&}cEncwlu8Z$uxMucx<;Z8V?Rzn6UYA|G#=kz&)Zm^#mF z$f#N=u@ISkx2Rvj(YwwEcb**ScFghmfzX%O977`aa}HrK#qsrF0+MdC_E_HhFsq-9 z>HB88VUsgf^H6kO@o4vmpoVaVK0jEFmvp;P_%nTVj|Bp(G47H>4kbusJRLaE$cm1{ zR#T==IJ^=S!q#O*3!n%e&`kWxcwy?NCFh=Q{3tt&PEQ}_Z?5l~e1ux&t=Q8aE{ao$ z#oY_2dkob%v%}4P@{qv_=Ex&B^u-0(6Ib z4c>=RxklwX7=dJ7^kpHgw@j+s8x~ez()Jw>$HT(epLg-DW~yIg_CZU#-FCg!Vq)wo z&GXt>ZkFNN^(hs$=*xC~Tn-=){`v!U1@`o;7;XY;T8BC1Z4CZ9OJ z?-f;>k+6jcX{hB^Nrr^wvRviI)ZZ+AYnviSrqbGC9oj_SE z_zYK(95-xk@GUrYZjLlz3?@hFV#&KaB33A} zcG??@-j=Aay{hR5`<`}Sm}|y*1O|?!dgIBg&&TuAa($9Sqb8mV58oyH6sdF5IoXFY z2MeGGh9kSLAsC05*jbL|n=lSLrGM~vvS0tlGA@?O67q1?eX_p4ui7p9F|BXY3xqoL7c_+64~*%eX~shs&{1)~3djaHZsi6zCTa+qvE14AV}`|-0X7AfngC^;z+E1 z*o6h$2ieJ9u#rIP9nrOCpZq`Cz-0!94?33P|vqNb8%SJdFmugRY z$fC!2dIX9(EVk0TPhLa5(lT<5b!|wG&fBSd&&Dj2ug%uncIzjW_d&LgIUVeAT1oh% zWMCBy6^!GT{SruK1|wB!T~3OTaMKx!#+HLW-bkVm#=0yK98ZT72X|)~X-_f3a5B|5 zpg?gm&>h_X+LsY*OZE_3VMX^TgfYiP>BISiGL7*nE+8-HC1yu}CGD^t%E~ zB{6ip#m+CAI^xx;NfhG$&SS)&srv%Xlk>Q@noK}#^rP*4H>SH#c)WDLYi`KxdAB7z ziwlF#g*7>8m#NDSZqbfm`QLG$5ELErCB=@5wziV$PZ=`V1Wg!MLX5m~yqcbVW?|Hjl1hADe&&gHWy-OdAP@#k>X>UFdrg9th zo<}e-2@=40(j&Zip(v{4A{fOjgvn6F#X_Xpj-7fQh?OC>RH*vOOPF`}#V<^+Yn6`>kvYJoy{DGSZM{07 zqO4=s`DZ2c6*1?&bF)}Yecy<4p}R+>0-3)y*g^XUpVtQZr93(0dN7_F>Dw4CvPZ4; zORr*O+d=1-PpI#`*uhukZ!3M-`NRS&R0-rrE&3TUhZf3CUyY#XsN#{&e@@71eNU%9 z?4)N$FP!#vNm*yUZ@7u<*zb_%Rpo2>CkI7+ris(J3CL()9B7{q1Zyifw4pCo#&~Wx zef$+EPg0mh$CxFCot@^WzSlbvWflu@#i{PU zvp@;zeYUo(-5sQ_SFq-3JzTsx4)qy?t={~04Y^x1xlehm_a-DV;!vV%g+fav&rFP0 zi)b^r3&A?xZpV3$N#_`qWMEsCk35fr_%pX=DhGk@ebbi^1n^-*q(jU^_F(0u^|P6> z=(dM~Ip)U6k7x0=Z%pM7!4SF5FVL8HvZtlI3yn7QgyCT_|B#6={9=)s zH{#fKB%%OIZc#u)4;(BmLAYv4LsJ(Y7_Gn96_LZ76O!g?yu-2i9iv11+{%)YojQbqqoESTjHDhwo|( z0@)m0PD3H5(xarLPH%Q|T-$O7=ByqoUnGtUtVy2d1zv*r$Yyb-(j>9r7#+ZmYo9w9 z)3@ib5KKpw{ny_d;q5V3UY6Y#GQEO42flK&BGm zldZcSqo_+8IV$I3<;Nd3FN#ohv&ml6)#Ii3s+{s->aU$uLyCQ|~q0zI91BJ?xY_J=y0K z3M&=Z0#c4kRUM<0U?w+!aIYh!K%*0H`n;-a`1AtdQ1V^b9#iNeLqXVFyR+-?n<$tJ ze^7XBTO4UkT*jwbaUFiP30Szxd@QL36x8&@w*oPr9C;8kr1D_ioJ9te&e!I@o1Nc2 zx+#0DQXuv7QAesM8iD@gX`xB0!Ha=96nIIyYBjK!ufDo8-&c*zm{0+#{4w=@Mjoz` zb`8a%T=3m*#7%Lzmsid`+_^Zt*P<-ys|7QB%A^5+x3NVgS z0{C2*1LHGm!@!z6$PT`gaI1-STU zaR|VEGJf!y_}qTSQQJq0&eeJ%Rp+2{53rsaQ~+!Z*qc8NYX=|icO?O-3&&`KnpvUr z2;rw!7O@Ph0NtqoH~Y$$LU0FQ=!HgvHVL*v0M!X)$i4eLQ-0uCwryTBh{ip?G%p~o zkxa$~kN^Nm1QmJvAQipt38uWDN@|I_!#aHvcc_x_EdZBexdZx|@@M(V$J1EuE)3ph zrua`cpSmj!AeG_uS@GWm6cHgnH%XJju7`L>oxbmvJOMI@c=8iTfWwG?e04hSP+mDA za=;$9^DKaQ(}~LDB@ekpcV?RdWH+9`1$GfgJskZGI5zK)-WK52+(gJ)rvlgf5>6H|Y=hF?j&`KlprDP^r&z z2hfV-TP6y}O<-Mc7!FD2pW*E}B7y-aOx^x5lP}JOe4s6GY0a6X$*KhNHz-QiD$b2M{>O(Fg%l4~wqt4^5kii&BiSOu~TPC(I`%cWJG)jxhq*6@{DtaN^@RAQ#d@pKd<{1a$4r9e^$=09XPXZ;bWQ@-=<}DvEbw|D!-5 zQhVw_5xZLQgz%k|S--cD+H_FWaRJl=(xSP(EB>RxyX3l6+}GWEh^$y>L6)@N#0cNB zwVz~5?*w`*0iEb)jiG=MowrI`21G%IBaT(gj9qbtRLk;kfGTUP?e5t4LSxt#5Wpyk z%5iN8ux~E8i;HCwjViPNZ-ctb@r0rwuGc%{kgO%INi|^&t~XI{(JbiHgaLrt70`F_ zYX|>+vs#Yppvi?h=iIn_{aB(x?%s_H|Am+_C`s%LMR^C%xB%p>#9eF8LpasFnN;&P zex7h~)dE|lh1YM6zBOdbsqRp~3@Y6x)5zA!gfYMQ+@b^neXy0hBMFI@R=yFt7?TaT z?xMo62Oz>BZy) zIjjO{*(4k~`%lo~Th`ugub3p%Ks`}%O-73NAq)auwO*=_d()>|eY6h%1S_g{A%)IL zJQq4s)q098wDuO6(vLdmri&sW_s7GBQReF|Y{{Wg7RQn5yiQ83BkqThd2>{q?5m#cA$ntf$p{qAKa>rnddy$j{b zFA=OUKNf0d%JC7ZKchjbxziMnfW2oL^0}ia=c|uT&u7V(zLvF9xUgBiIX3U@zq2VP zObntlD0l}@@1>QPR63YWbZOB+eS73&Y29#*Z0K4uE z8GWC#%~XaTfab#dA}R5Cb!mj$;%7RY>sxYxe0D8Z@z32dW~pHoVL5>VtYTN`B!LIm zCOr%5KW{PD4GHr9RP0~{sLE?jr>>3e5F8G7GX_2Zd(yrnvhA+2X(J90R{1694%0*C zYg3T9R922-fA5Wh48Fu7dERViT81g~(xm}}K}Z}+iJ9?0wb5R*)rzS(E? z^FK};H-PHIbCWx{D`tc~O|>(QLF9gb?duS|Spw|~ublfG!FFk*U!5yhWhVTr?__WU zy}7XzKrS_s0M#e_W^zGB0$BA~+-JD}CW)=w7tZMvsUu40xLnnG_CkcxGUGxpfWjgI zldl$gJz3dubhik`OSoaXGap$g0oemwDDfck7BM#H&ax|VY)I%==+A~S8j0^(;-~nr z=YV_+xyBG`QmW*R+6hRul<;2XK1(Ioj*l~jlka7v)3g30x$q6*g!T{YWqVg2AtHeZ z^?zegh@>O7&-bx3Y5Dtqj(!o@9+)yq2O5x$8PzlqKb6e|R6i|6pXj>nsLv`gUUs3% zMz#K!0S=gg0$weH%5;KzGWhr{h||kOHlca~QKZ>2vn7>w1vaa+Re-_gZh8h`d$LZ%9IvpYQ@R{QbZDNsE^C=d@n`j4RK zVYTe3=%s%9Eur)@kGXMq{^LbyII=!PMMs-(W<$-cCnvr@`GzdT&3s=J;&t(S>a%yFongS~Nr7(K2JwGg+T-^0aD{#4EFLnS!Oj$N z3EPdHRFjsB9fvXi&WE}W+X$Gaq0zRe^19)nS!{R%D2R0(YMbMx5E`B5VSOqG7*TLB z7OF3Es+E_<{RQ{iBE#PHe7%974e?0bE85J~1dtHqD-mNX=xU&(bJ?xaNy5C{h141l z>HKB zZeZL3bu|Q+y>z2HxGCR-`fQ-l(uK@(6g$?x(UZR4D^A+jjKAf%ByYh_Tzxq9C3Gc< zA{!sIP|c$kGZU7Kr38t`YrN~p8x@A#TOsIh<1fi*F1Y1-iF084vL}l*W_2YBo1xU@95d>-4vC`1XlaJ-M<{wa^VCb2OhY?eRndEF z0P&iq+)Rd~!eQ`=Z0^h#ki`OeNX79lvPaj{v7-62nglWAh-J12xWf|ZeF7d(i7q?* zH-q3D(-b#A`({_rDd$|F4vc}vS>ZQ8uIe6uBdAZ%vKNE4L1gTNTVm(0UR!?5_O(^U zszQ3>bs=-0lgs#VR5E|sTlfj~)fZ|k z6#aeLa_Zo_0vBy*jM0F{GsJ$29|R&D%f71#B&OG=AJOq@8`eUC>F#Z}rPX(?BL_~9 zlEi}NuxvK#O{A@QI8ge#@sXtFYu$KlDH_by_;c6%oYfadGjsch$BNaWZqF=0h2XZ9 zpYM5dP)lUowNPvl=L^4BoWq!_-x3OR@LHGQ9LKwoW7j6DH!KlU=Up)AY5+)AzAEHU+rGbk8+*5xEipqJ4NJ!9Ntq`t=VG^Q-T zXjLgnV+haQ1acCMwikS{%HA^nWRkJp(~v14@! zw-4D_20_1O5$bSv;3u?#78WHS%F$>uV}0QAUc9$SS}OzVdy=Vqxz&ySJqwXERGh-f z&6jvPSsxj&GIDv%he2dSW{@*VVDs$}frvR<4r}==zCB-J1rM6ithK-cpqFOO;m5G5V(?yp6Xr|XW(-{k89A4@FM zF14iNx(;8zb<9V>*oqjc$&RNppN<=puq}t~GlV#CZWh=Z_!-pr%(+;cP(sr7qE$GX zt-4Uzc2?8!3^1Ytto18>3QE`yt;tlJbOFmkB(C_=$jxzY*bJZpj;d*F>JesBHFUvG zyuxfbFvlHCSBW4$=(G`Rs(CSl1RfBY*42RuZF>~VCtdhuao14(f>Ohox;NcWmU%8t z#Yu^r2UZ!>=RE*^A2rsM5EfKR7-?y%O_-?eB7cv;fxU6aTr=Gi0`@#i%0AykYZT*(LFV?Ce%V~=iP>OY%UTB)dwfHa0Ussj`;q&`DoNJ z;FU!TH^i^6g98?qtD(kHs2R_#T8{{l5goaW+$Mk;g+Bns3d&=%x-n#_8k=|U5ZVL? znQN8N9Ib_{QY+?6(kibvH@pG?lDK-*)91J;CUTkTM$c;4)TCb}_b_-IO3(18aC9l) zpi%%|FXpPKi~;6p7ujGZFeJMzH38ePP9e8<+|iCG*~?2V}KD-g)cb)<$K zBYCa+faBXN=i=Nu06!+%QSdt(A2de;^ZAJNK-he$p9T1tCLYR=ud;BjXn_61b694& z`@#2hUm|i)u4|>iPrzLUTGe_#_qyF8f+#?py8{8jnnL5+5%3pqQV1(wxFDo7scpzCHThoN$>H?mKU9NSX4yDV?I3?MsOl)C!Ku^vC-Sr#`NcRfbb39hS$D_;s`C7N;s$gOI|%Z+?)MR znMhymt>N#pBq?GgAWZO*?l#qzB{S|dl5pLDtRmM276+nwV0{c|+BS>y7eLG`Ok67H zk_PLXzP$dJ2LJkwC+}mJL}q;d%B=f=9@3c>f+sEr8>@1NF{XkwzA*O*~r!Z`K&Vr-m#=*d4@=IEY6%vj=ukkD4LON0bxV zD+qUJyaHgFbLY=bckVszN7uvAJZBf}6Kz4tb#IaSYg#&yVqlr_6r1sd$=<+#MTvUV z=1Eg=>9C~Y@!8_pYf-k5z;nz^1}qhSijyU;JFtb8k}9qJmo2t~0FS4dPVgM#BLKW` zm0(JIKsHuOO}WKZN}>_el*8AUa-8EykLgiG@97FFL@qNK2dug6$vK8{!}&!X0XEOo zT(mg@J?|1y<~hH&LaW{Ya>UUx3;4FdUG{4X}imcC+jl)CAs@g?(qzlKb^JP!{%-tbr4>6F|o!R{`iR_ zvGc`AZ8RKxaNN$dYXk%Ij)*r_fW`|6BYvOznWmg-Z0aTH!n(pLMs40o5YP9_uHWK8 z%U!rJ=lJ{~y9}`GDkzh~K$OSpu36s`BwDrnS|IiM&uV5cDwbYOBA@a(u1Dk!y=!b( zOtQPCfU)#aYt%eZ7Jg$`h&*D-R;y?oK3Fbs$BTw0#1Ug_>(?Afbprpe)F03b^C6Z- zYa^3j1fu#~2g`CyCZ`IAd0lqc9$JnD3M`4LT3n-JN@fG3J5M|$dw)URvjji)lz|>H zR}Y>4+EvCsY%w2xOX7NnCjyzsI1g;L)h?Yb zO!DS6TAWxVl5XzlJeJPct_(kgWzp?h=Ku$F@EgmS&(u8Mab}}!e($dB1Wt=2q1x1? z+Eo!x?35rbPR#`_?5HP}cwLrmFJ&G=xIYP=!*nh%mN#*Oy3$EZyR1aD5mSUJLafHX zSG98L^oloC*Bq4Y-snyQ1;M>if3lL#fhXI~*eZ8suotD5UE5~5gCSi>Y)sUqY9#;DTLoQvmxQ zvS6_yG-b_X{8O9SFMXWP-qCj&iJnJ;4rsKmVk%1#B_FSg_>}}?^S%cnS>N~WE|nHSZC=MGg2q3V)DeAvHU}IQ-EPUHk<^wxuw+`F}^}DR{NN7cm8N*5auo| zGAe`4AxF2P+vC6>kXvmO<$cSm6<;IGeZBdTn3#S#yA^(%S~kRT(Ln*9WaW6MZGNwI z9gLSpzm#N`4H4e%iFMR%*KkR@xWoQJnX#k_@PJV(qvu~nj8l7{Q@a2Lx!V!>nU<+G z{lE{=haF9H_KaeBeKW>ki5Te{L~3BcvBf(6ywE$5xLOPE@9 zsLyK9)ydEo@rmq$tskt~%`W(N8t)&*aJ%^yKfTm$9#$LFJqooU&=W~xeczAGhrFpz zPdyiB@cp2FjU298*ly{lx#i3t;Q6~GaXq*XOl!dT#RDLy=bLbg7G!3O+I-7w3S`OH z@iKE=Ve|m%BBW;x|H5&L6am2)UcE7_sofx1WuJZeFaskahe5MyvHLO2xdp3bJ{|XQ z8V#yYWR(2)L#p7VY>M>|lR9|Tr7weP?b)BPE7abA!hGC7!OMPteQs(2(eW0r2N^3= zcbPv-bsuQ>1NQ4&$oCcbmOmMX0Z>91xFGScm|!B`#;QWPh9!rrci@U$DD?*p&w1t+ zg>fXzn73xg5#{hHfF~kr14B{4u;6MguZM_ckvt?_Di-MAsFHErGVb?38BV@GhT19N zf_}s`Q%W7h)B%1LykUI{>D9}!qH%kpVu(3LExeLY*zMe%TEOY9gwc3)K&J@k*<|C7 zskl|8*}(yGD|52z)EEj3Yq^hSJa3A zm&H;a#`D)#IWz*;^o;*o(EpCnzt+|@S&qT~!~%GW@V{&7|Jrs5*TIn?35fa$_)33m zJsZgToJg1=C5dkk^k@1j04@hJnk_SojUpgnQc1MtAGj_XtC)=yNLLPzEdc=&IPp#p z$K|hyI6{x{+z@J7?%&j1q?@uzN51T|3CP39DI}iAh%`Uv=SN0WD)@*{FU@Y>#nSNi zAh06_q8#|7@SFZUKQy(0gMh(V2%7OXpu!O?@NnR)-C_HCqP%zkJa9i^md^b>KMGy} zjM1X}Od!Mm#88k@23U`b`gH>1zcCi-!Y3p`JW z4bR{Gt)Br902oX}={Wsu?Y=EtF zk`s759zT+=L$yg|uCx0x5P9GiLeyQF z)B#Gw@rqTj{wwck07^G)^^C$G3G11&+TlpyX6pjne~He-{!HoAJz};P)2;*Y3wR%| zH;GBAeq1gjNj~~@)ka1<=ahYqG5I@s+hN-&{s5Sn38t~a(=0B4a;*Cu444#Zr=LD0 zA<%~Es0&*I!16ObB>{Nm{{qYfK7fOb9s}YD#18_!fOqf4G5A>1_L8Q3rAKI_!psJ;cm%-?TK1zh1OOqX=gO;5PBLDymMpFfYY6F0{2Vp3Pto@IIO`@f=R33Bf0kz6Z>9N;y*3)>Fv8M9xn0*R>VD_OhUZmFmtF3l4JNHbb z*pTf4PAAWeMkjOt`4pIP{EWZ8m&sRVqw>q~R-pJ_V1h`_}0ip+8vI89r zaqs)nUVuk`V5SAQ#2eapb{QFV7Ec@7W>qC`U|>!&nS`SBRNtxd^wUqO@(!n}aNYp? z9aZ@^lR&OIge-rI%(`I#V?d3Z=RITC#gNicL~RqPJT^RZ2DlZ6nz3(xW?liJXn^hw zOSQpadHYwg#`CXPVNt!_N_n8s8d0cl0dT;K=FKbLR<{) z+LHNUMRiDOY+g6En`N1Znx0|gz~cUrJ~>#j{vDe@ZXU399Z~DGq$`BxqDTy(Xa!=0 zGLZ$VkC%<+8vg|J02)(n|AP~WOx8;D5KcDdjP=5GOXco_y3s2C#(K#U3#_x77% z*D;pxT>zi!OQtcxiY7m)HShT~u1GZuy{c8Bt|xfpk4msiUhu(PQwlP;#aF-q7jF@P z?C&2TnwP>2at6`S99t`CTsYT0+Q{VQxSDXMEwjyj7J>*e~twav-fA}*C zC?1x9bN|Es=5suhvdZ}gKtSOG%91X9mOI1~*+ZmvFlKA}y^-NS@Cb@E{B%Vi0=iym z3VxKg2LL37*ofiIm}Lfc97CFNC&uhkIX53CWuXZ%flT^S88@E~acIE-GJ^y(+64oI z-aXQ~A0_}J&<+u;fohUFS)~53!g2rLJ=3upLzg%Q;E$VY76R(rFX`>#IU4$Ik*{oW zoO@(u0l`Y*{&Le5&+Y1S(2ico#}uJwH3F8<8jSAutc-6prTM`knR8ec3t8VbEVMYd z^dSe?A4LJt0t=W;yNnbS@CBYP@`-r@*6QFoQ4KW(A!;{`?al$4!cSf#sG!%ktXoC6 ztclths%W0gF7gN9)Kt?*EEMrTc(H)(ylo;H@PFES>!>WZXm3++cuG!n7FosX6amKiz{6pb}qrFOEgo)uGE+XwTenmFRRcxH(hx6`jIx|lbHp|8OF6XeV1>QdQ#S8VyF$YX-967Zp4jv~CfS@#gmSpuvM!33 zf$7iccFcaEwZUkDj_tD zDZa42``3Xh_nT`ZF>zw|1Maa3ajMl(7=*|Ttma13d122oMZA|ZDJl1re33-lX}s?G z6!{DOzK;pXwR6;?S5qrOsU-$igl&;Ek}po^}a8F3GXTpS`^qXS<{rWFxO0 zm8Oaf@%Gw$BMk#8gr&7VcHr6syFzCqcB;2fFn-lk`%~7RAD?(gVv%18`6)K(c#+Kb z>vK?J{fxCOzQ?ZzcH-1*dtY`adW|EcE?g3daWsii4c*|5m|+mVIj)mh>-IRu)|W!& z;>x74Ip2yw*Km?R-xD0kdS4*qIMqfE7Rgk?sfuh|NyJi9)%kPjR`!OY;%msChH=@6 zcWlO?Pf#6Rp*v3!QRRlV6j}-1T{Oq!v2r>4E|RqA%uT|~V3x7-K|?QKl{uKrw2-9q zsN>wOW@MJelOd4ZQZSUb<36*VXEbnKm~#heo8X+vx4f8-S7+&_#- z4^dL|+Whb&?{z7ROt%d8?ROQvvI1M_z(r`UUcH#?=a_NbwC@c~1yn6-k5?kWp=1Or zb`VmAsS-q1GW@K)zX*fkTUN~irXqhr|pg!-NR(*|Cm5@g#9W zd|6=>!c&p8)Xh9@jYf~f8yoSpGPng@O%~ekQ-nS}kJ#T7`AY;($E1VuG9d5vCZ!eC z&f@E0s@c#BA|$r7lvq7E;VV4lVJ|*uyhw;7-@OpWM8f`-P-}LOJm2a9jyj_$ir(z@ z<3P{rcZ+2)&lQF=F5)=^!V$@ODmR6j_ks<%mZ!7#S^u|QLwroL02VXuU5?+?943f% zet*NOnkss*?%tFclwgrMdCT>E|G|k;p#xRRxN0}<<*~LRPxT!C@9pBUV?763Pzeiv zldATrqEWS^->Dxxr1wFTrec2IC3UY}jnTJ3^hZgpMeVyUu%yh?i{*b|Yx0Zb-$7k`-aY-cJt;^RlbA8I6bBARhbKR?D3OOSaBXl^e@&>{Vz zkI3o4_a~w{AvP~0M?4*#O0BQP$)h$!iZXGt)jGlEb45I}chDsalS0A3?h@0w%oA>A z-#q+N8|pRcp~N?2zXygJHS}`!;mA?RUYEQsCK(GydWYGQj{+DtEolmled6OA;bYTe z1T88_-SG!(YfAmmrwMqL4Fck*uko=yi%Co*@s>i1Wngzn=JpQU6!)gSU}qz+>9C_A z@7D7=p#WgjS3?7B@*S>Ugs` z#=o%vdZ|FD3l%vWuqU7?XmmxRf|=BqxP;f%p{F#g{ege_su`zEdYYkDUF2l{#OHaX zsQo7gjkE?Rl2tC0>@VqEWy)c*(>v(oA2cYkmp!w#*5qXEW!K3aqc6`&KDDS#?dMRd zTeX(Yn-g+3l~QiSXLfkmhsocl>q1Gmj#~t{QajzdY=$KLTJ#`0&zW#8o84L#Woq#Y zJ#>J0dvRb|b0u(}xSoIGMYIdbz&abGch8 zU#3ep^4Nd>9G$1imzq`UHdSNkW_)IRB_fWxt|F*=k{iXI4)RRt{4DB~lLV3{-iWf7 z_}Zvqj!!4i)@G!e70`|-z5L@0D>R{{MV}Vb%IA?vQ;z;AQ z=i7W`AT$80Qn3o`!*}ovaPF0bLs`62r<5-y%m)-<2p@Ph85ulY&R2^Pb}b5(0tjeVDWi zjKT!tSO?cKjnV|c0aAD^j!5mZ+9%=hu1pi$=V7QqpX6E(Lfy`v0Njd&nO(Pd?GSk!9*&6R*-*ghTOiW-QKKN1EU@ z#qz#z4V5VvzH%l)?R2rzqs_hGIGT4%Bdms86#BbeDdg*9#fiHpOH?-L^8a{n-IP#bk zQ9vakY)V4@-Qxn23eQPA;yN7jTY1RwUg}$N%_@P>$s$$%K0TauaS9)P;l(&d)GX>F zp=Awj$5nB>m*8DH6H!U~IMIZ2LP^arOjHu1FUEaUI0FB9QdkC#WO$C(C(it1VrLa8 zpUJP#rhfd+DZ=nchWWXfzg?jOhWWoXRLOW}PI{32n8mq|Xesc<`2G&|?eoBSFHPx* zA2$s>;*PNlbx%EhaYlQD&TdO;(m3uTKNQkjNBP8O9P>(CAcQ#IaVA9f0ow&tfK#&jfUk{az5UbBMcGrL)c#is7wa`#Z*M_3)Nl`vPRWpv;yJ`M8e(=fjR}M za|#Z?Gqo*~X~5VbB)jZ2>dbm~6jBTp$P7p>kRUf0hI|i%nF7wj1ph@vJ@yQ;#>Wd7VHb!a+64kL{R>L$q*Z3rXdG zBp={@hZvJ^ZBZjHx|wBx&D>7E1}>tx*8tmh8StY@1xm>tLaGGD2rf^J2hbW{1=z3f zhw&4G`PW)_t-B7C9es=Poy@Di_8^5d$##J4u802u3H>t|sdgwqD+g==QsI?$P-5sd zjs*-H2(KjS?QY*b83;{k#8-R#RJ4Yqj~`D3f*Yae=qPAOW@vWVnjuB05b|lRzXiEA z{`E2$2bNXft!K4+RJk|U=SmKvPt*E7!lcxlE>?^{*bkKrA_fOX{;n;hheMts#n&N= z-&dVIUL&FhSg<}>dIlV*0qR6;FBGG8RYO^=eDxfP^QLS&QbMl47dsxZ6g~kzTmZL-t&msFmfTlrEQ}@v>Bm)>g;mWt{ zTR=;T1fQq(|F{Y2^!Da?!UvXqSF(jMq**`dRcS~La+dKTWVC4DYs2`v-Ea~Tpmc=V zB3vqV?e1>@i-K2<`0@kf$xAa(RP|;S(cuwrmW_rUv(6PGs=H0mIOoMC4MD0o15B_a zK=OSLv0Afa;_GX7E0huu%Kc1r-eju60h}kKE-Vh_CQbXA|hTLsqPI9b5tSSVppB4AiHVu^Zo5dsO#kR!fa>Ypnd7WD%w(^h~s zjquF+NRby_wo$9Y4TlZs=9W!Z18f9QM9R?wwe&N(5F%Pa2RgRUIe(zu=@`Z}cySe` z7R2Ja5Fi0|oR#2pq$=h%Qv0a9u4$0FEK%IdBf8r)a2HDH(nvJNQQfjBijnDi-ZxSB z&r{$3tn$tA`jBPz`-yR%?|-Vxc}{X_W9$K?OS7;cYJ;)(w5?2~uk>@lK=r#KBccX% z>-4_U?_k3!t0E&*x@4MgD}KEktR}cd1;Ga&&j&H$1PB*EHWXLPZG4hd2|y$JO7Y(N z3YAy^YIdKzLC@Tb!zTxX=@QUKq-A#*9k@Qi+Dl{ZnW_b@`80E*&QgNN=$N(i-6_4C zUrdndDbRAw@EbcMwV5#RUJw30bjeHVDVYJ1FC^(u(?FlpK{mDds#qy=z?GA|gk>(7lofY&4OdiedoOL6zW zT`1Zc=~QRd{|*73tI}a$kAi_}4TQK#iXD=AQ2psXe261qReu9vp_tCf!-QDO9p(hDq%-?xXD|xEiBx#dJfX*AX&~ZF( zdRVb^wsbdEOLCt1Ts)%%K0hNe#y%01TQ-k&A+_L)@3hT?OUX@&Sumr`lkwF#zv~+z z?n?14T5>Vgh|A{a=p>=6?nc9|wskB?rFsu>FGxi`3=|QRSUW^lYv_7?V9gQ#P;0rxbWadYE<*`KpwbqWL)68fOB%6la|7 zzhoWrm{)I>)|?bU=dI@=OflrFV4gmsxvlWMDD1K-tF8;!1dfTo8ZMj{k@wi8229>R zyJ4rJKl-BSdYuE*T3uWFh5^fqj+vlzZ(e z7`N@TH}>PG5B46t^}DCVr>)@-8M=8RS2E1uK<&MeL6<(M(;Y`E&!^-F5j6RX`v!7S zik-Tdt?(I6rt_8BNC2mJ8ac+0g6A0tsoju0mV6Qm8%Yc;4Vh{jV9)Qp@Vz0`L_8yt zCx;d5RtZC{`EYWD5YD6miDMbtDpwa20-={)2vpP&~VjMj^cQjqhtD z@ZrDMFvE0z8Nyg}cWV8K&P6jt}8xGqn?zy_clX(^VdKaviu^`5nuP z>6ur6Zl_QJAx~AobPh-9(bJY;mgens2n-PiEf;v}kd0Vy3k)g2v^`??ZX`^YXar_H z3p2)x9X>0P(k~Z$ZAp5!##X&J1Y7cHK@j@d{D{q90;7Xpo7x@wEYm2x9-w+x?k!#J zjs-Mv9oamx-(H8kgCg=eHsM{=|L>pvs_mGIMX6SuYEQ7!{Xow;;|hEcI4md4Za9K(%p}?joOZ5 z->4GqOlV}LIQ!QTIwAodYreH@`Oml1ivjD^EL_Ozw|Rd<{{O3ok@x;` z3%J66>_2fs(={=we|_k&Zs_8Fiwr>pZuTKk#tef4NX#+;~%VSGRPrQ}wNv);VjBkF1+) zVFt$CgF93k<-byD-#Wv;5@*<3}#OzG&h;{WkjS#_BaDToWN>R%~FA>!BK3}*ek6+=@6rc))cyz=jq zlfuNvD9#A~b@XH5`B{~XWJbr!$tN6;83(6%)baBwogtQ2XQ?M53Ae+jP?OdBgD{^ zS581`FkG9^yYfuP5Jkc%5S9$3@LFHH2P}5ZL$Fi|gdn4kgFVFCJj_}4ZF|1NI0)kM z5Xh`YKsiuA6+m^%dO@=LC*2temM~vfm#!Rde3yg3QK*I3|2?e*Z(sOI9sM%|P~Aq? zfg@4dI_omsU?cS%Tm>&cov_L0q(TJDbr!%KK|BO1nFOhjZ||bVYn4&uKoyuwEe8N6 zKE!?B3-*aNDT0NdQhk2PX;MriOvMn17;iZw&kbn%Ui>E{TD;>vQVzQDJxD>U7G*Uz z38-}pmoE4KxU9?U5yZ)eh9UQ#by3k^xUtG43k32*2;n>QrW|!b?6&dOXQgbwB|K;< zv4q_mM79AHlm$22Kj?S7x)c%w0rF0`MgWP#1BQMNS1^2rgZwQ|rcUFTcgjJaNP-a5 zGJwQ_(wITkFf}CQS+nl-$Hx)QaJ(RlWhY} zb%+%>3F4&#VYI}GXe=Vbsv(43#Pr0L4dfMU3Gbjkd|6XFLU-?K5bLGOL8*Nte@~lz z_~UN&-Vo>!0{Xq-u*Q4&Q=7uZ3&*DrhWk~1F)_w_T=H{7w2vHVa58dF{HQ{210dUW z*<)M<%K8wlzJUBNQ}Ks)e<##iFAYd<=VWCLouj{I*sfw=sF+~qlaU8aapeneA~^CV zoM~|QKLg(I9&*nJ54AsOgR4C_aW4fMGd8NmNGcbH_|_ zimgBIabp~IVVXKAR4Wf-95pw0qt&L0{d^)kz=cmgTUJq#&auFAy6-(~qeN%H^Sx`U zNK^}oQBE_CjCiq7)03=HBfhaqCEjlTB)1qO@ZT&3hrHH|?k6q#R_Etn5O()1L=)~< zlfCtbV$VpWyBklz0Ges*o;%b~T5l9&c7f&R1#Ht$sJv}WMi&cWcBHaVE!3GGi=AgZ@;`P8V z{p(N=V&RDz6-bdbp-TT5Iia0R3606jf}a?h|NEKhry3s3;b z-H8M@JiED~Kf8t|4m5d&%z^;yPhFKm@!SmgL>Xbb2$9O)W0e}p5<}jY@zxt2ZY^P1 z^=L2?ic?p8*=t0dWbNVHfjmilp>e2gfE4?nKn*jY^=|nu=eh_#D+Y{O2Q!Mi%>>wd zlao57ioE)ddSVLOTwHyrDk$%XYqgNS=AFW#bsbwK8Gd*L{XQ#TJ=jb;hDsDldl<%- zfrHEl!l(kYX$yA!WQER|lXJq$)W``uQ8xSM+ zRQRJ097Wf|i-9mv0ery>_#!M7tSlW@=^}H3l(=xC`ggccq>^)vE!%=bjYujcIjk?v z6M-0-nxP`H%XwG$BfF1&{XmpswtNI2)@$e-0d?is>9YQRN0d8ePgj$N0M9T&f$lCP z;hQ+9XgA1%@X`tm{xruK|q(DVlfkP?|gdN?VOF)@swJ_(g*;I}P(Tr01VC&q23 zpEwR%1IX`$8LG0G25<^8zsl=##`m1W4)Ze_)JCx;*;C}ce_^Z9mb^K(Ox>jY2o5j$ z?6M{9d*FV6)=msWQvRdezlN6CYPxH5`dr-}JT@J!;UT=W#CKQy3Y%-Jy2FcGXgWn+ z`_5#k@&S|5LJ(Le2cDiHM-xsf*fhwUFe0-y;ujFDV>lX##7>pEf#=jAQ|P8U zqt39`NWCHW=zu#7P8A!(+ERTxx@3XZO+15}?>(90u4Y;Q@Yn2GlQACKI>GxFS$RmW z&v$Q|_ou|h*gT0U;Y|{oGU2t#wTDr+MkI@x7bhS#4)1Z*z5r362Ck=-D)bBCKA@T=A%j7oH+ z0hRnNMs*w#*gzC@Ru7l|LR!sp@O&TbJ3W7)(*v+duZLdE{eAvN1Ul6XkeM8JvLzcj z8Sp)?^Z5Ck^kI+6)7!Jr1kp`j^%HBfsNfy7qA?&q_@3zz@t+u;J%hUGD@S~PVh-HE)kQ3sW_rH@_q5M$^ zG@F2XANhd{%pYnNfgAo1?CyKGp$V}*+eZ*+1+ohgT>p&61ziZc0uU*IqbwFQP2FB4 z*Q~@@YzkGs1hzO(%?^(EBs=s=7`Xp*hGF`3pQ?ER7D4CBz>o5%zb%JABbXl^gGvbz z2-I+{+ofyogTYM%9sy39P2m2Ggmm1VJFwOidLKoKKN&N2zjI4eSuQL7yX4nb0MjFW z_9+vd6$OnpA_&KWtKI-D3&G82&zusa%$RL1tLAH>Oi^{Vqc6tz!n{bY1NK+Kq(C5^ z7*mYjYb0y{P^4*F`<~MQhgOgAru$2zYFEDdeJ-v3GOC0r_76GKFEIt>@z0U&pX^anKvRhK*@dP)oAuxSFpOQ+~6r7x19@Cv&N)@8{$_y{d| zMCXN~8K=Z9#+QW*q7kb>ATk8Uk{C-2b^~B@1f-njKV>6`e|9lN*+ljWoLk7fYv4dE zLri3vPm1p%R2l-7PfIW_7M;e})9aaX83^sW?7H{|@cLCx$K~x|5w)2iv5ZHnDKCU1 zkWnb}jUhZmwgG%QL>wgDGoH3j8n&t2v%nF(02anSYMC+25WDxy=_>^l2(mG}eKF|( zyEiIc+w?s0yDeQ1mgYjk69PO9z&KQ`#RBZ}Z~>V^u06$}9z$*dzuMS2a2kI*#5@9P zP=xFkfcoX2?#?2hk~wm$!{Zch>PGcD~@RoP8%* z3aG`oV=1ee^W-x!z0;jRrMia4xZ^ZjlqWyr0MjVA z^3nb|sr7E2?lVa|(;6@Q_x8s4lN>=X3KwTrL3{S-VM-|?(4|53j$LoJPwdCifVj1e zPz4eI4+|rv0t#6<91wqt*&ql^FUH>u$wS*dd3uL;kowaMd2O-{+IP#oIX~yx>en%r-UM;I4ac_8pW%LApRSP9DDtWo`E66M@ zPAlN&*+fzQncjql@5)Gng()?#44!s9`xDUSQimH7RWrTl=7rro^g@4- zD=d|?=a(}qJm$CXMJjA12kfV2o&jm#?$#EIf)h>fYa!m5yHvT#yV4#fquHN(_%` zuw&vysU_==8{lE#nmh{EcP2D4tg59+ZfT7x$5szBARs%-eU`I$ftXXbvba`nUmL&w zVWg;QQyf|)>r1NhGr`Y~JBPcxKM%i>5AOEcoa^}7aZ6@NHg`$K%Ia;q=LZU2lb)l8 z{7jdX%`xg2?~7dX{6(H0T0EI7h=E<_V^h{;vBAm5muV-&#-?gzpB8vuDprNW40HtZlfBHg1)8+_TT6$f$%t(u-ZN{GjgnM*Mqd;Ro{8n33AYuk;ySDOmm_mHhFzAR`yr5c{XeZLycH^Zm9{j|&K0KZuWCl@>+5a@T;0M#weG4Oj;ZM&SU^dJYrd9qldB_h_ z{8c1$o8`}-P=whHiRb@r@jiBetiRVl8rvm{!_NR#TqmoH+}+(PbPB7Cmy$y)ms2E( zbqzz&dgHN7yzpX-T|JO5k0@_5vcRifT3NYJBrCf*h-&e;n3 z`e~x&!VeFKXT>P1?v*|4Iv}nRuL#@jz4{HD!?OU?)P~@3SKi#e=I{T#)$Mgc8jY9> z89DhIpjN^x&zX>fL=yXEEV!&CYyHF~m9_U-$vV--G?!L!3F(O%Mu&@uiHQ^4A>_C* z$Oi{PJLYZfXFV1aUe#HaAA${|m-Hw;oTrjm5JVGgJ%HKyd;#MGk%EFm7+7MBBDsu8 zh`tU8UGAHxvwTgZSf-Y^Jd)4xoN;jZj&?p|rgVQ47k;mL0>cR1{&aoxd}Ro!;!o^1 zxl$_uMwhI9Vi(>jza9|tZ0qq&yjS4j!a=ZDn*rNb1D9_+5O`J4duk%h*Z0}ilaAe(ixj z;qQW&I@lH>(G1WS9BjRNf>bwcy!=dGUq5^b=MvS0$am(xe)W7G(w;q24c=Ru^l}V& z5_S6H$B$5H&7jkFQDbQr`7d=5G%VGtEK!bb8XXx||hI+nc8 zvP^qaAm!}UFnkX4ALkT9$}os>gQQpZuFw}n`yFd;+lB}AZ@Z-i_)7?o@nf1 zs3Xx{p$s!DVR_v3ndWCyzPrmK%8A{k%?^G1&0dvx8Rys<97giQ-Das~Y+PKb=6V^x z9f8Afy2aVRd1XkTJ7efRbU40Bo13Nh&WVdelLeHeO0a&z@z&?N=cmWv@|Xp&D#KLs z;XZ!Wp;AuS1|;;96pa zHn-Q}v-@|GL-kP>iiR%I*jTVt+7m+OB*k4Os*HFHjl5q$i3HLrf~9fXE+7YY#!Bee zl4+56xPvbz+s?;4`NMaFusYJ_(|GgZiBw->?h)&PNz*IrU?hQIT-ZS^uSMBBtZ>{E z=DGCH(#1l@da*xMEfZ#G@cm-gBoYB19zM^N{YB4R?UZpb8Y2;wRL!`Zsq&j9`=Oa; z$J%nrwtP`%eb{-$ot$4HPOL;lJI-7mGQzqJCynkCQNP?=TVyOb8Sl1km*gcRB#aiy zoI0y1XZ^A}B&1`UH(+UF8A3kjS~Vt=8Q1_9A?-&IT|i0>SETZaExyYM{&`@EJM$$O zX?n$sqFzYN@$D4?x|$9x{j8n`8NG}R-YYu(4|NG#VVZNEj67I&Bz$7%K#AhU^N2TH znqd-$SMA}}ZFbNuTmAC!>h@MR?IMGI?4RyPrvF8;Ft@dJh}^T? zw<_*FN9B^UTu1BP&V(!w5~uOfn|ktx4=#Jx!i8pJ_+cf=QMX3{=46@}(EoV!N58q+ShaJal$iib>Bj}sT7{y#_K?$9( zbcmA#6CL&$y~x$JzV7M9`Z2O@9npF=znw>T)u0b zxjmUYR$YoexJ{tx3!+!;5z28to&{%>S!30D0 z0dXJubG=y{*hHONEY>i|zmMh~Q&>rkLC*Al4(>4>SO))1{({l8Gv)zV10XSh#F)lu zww?}~?AwDIfXBgBNlWOelH=mybUQ5sRs>xADnh-V zSRVC$8Rg~ZlWL$`JgLp8*u&gIu7ZdC4?aN<={%-7(a&Hk!QkVOLy+PBAy|#oymbbk z+Gm_YV4L1HJ3a^R1rZp1P8@yC0;|%p!^_^&a~o{^%a<=jL_}o#HVX<1HylY<`sA*@ z?RxxJ$@vKKAU@7CS9sHp3!}Kf@>dDs^@fHAQPkNjU(Ev2XMze={$`M3Vu`H4F^L zCJbvwvnHm{G1sw6nw$-TNemAUSB3%~Ttu1W?r95a5Ok$CyV7dNSrmWVPj0c$zfMN?B#LvN(7Z&~Z< zrOw%B-EM$hl6^u|v%_cKi!4;daxE`s=&+kW+hTQpNv*dJ#Hv2iv@i2IYbat4>Nl;wCD<;+(|?;XI+h^G{g zgJTjj4$K!&{dpP7R>6$Lg(oWNeGSxlzy$PsMR{d6!x<}fWh#Squt{$JAvu_Kn(YKP4$yok0akts}5z1d#J$UGoqF3hlrkpcjSR{0iJtKbNG-#kOkV~o!rcld4lg9DEII=Rzy!#mI_5l=gKdF_BQ$|(aNh;##r4cm#e zhtDCAeTb$iP-!GkWwxnaBb|L`rKUUu9KNlhP*yhByjc>C_8a-n<$j#s6dp3mVL6wS z5xuzRTVY}G3Ln>rZfqWM0v~nGz_|><806@Hmx92*%g=G}0^PI;jVcp((HaV8e9{93`(lEjD2P0e`Ieo&nWq*OEgejWy&p&7 zSkcrlUv@^&o4Q>+Ys}88PN;vMn@fF%y^~1SG}L^U`#C`)6S#rnIQ;?lL>xzJ^va^> z#Qk>`(>X4v`N`g6AxbeKd@|tz3%Bv2Trkxg8u+3GIMs0%blbQ3lIR0|gE+gJ@x@DI z%%P#1sGY}6_=;~!GmpZP*m!uV-Cnmgt;KaUFlRR{{s@4VqH0$7wj#5Xs#~@+F)@+T zxuAfy0E^-iglkJ!e0Ba|Bg1eiZ0wTIt~6$bAua)dZ2SEzUQ-embgkLFG_>CSpr4P@ zWQnW;Mu#8W@^n*^@MQskv}irs>;nYMg(Srg!3p#2R;Y!T=$TLv`F6?-hi8ya?GIk; z7%o{K3**(JNNEdxhC0(`wSCPqAwalBSE)&%*1)r_l0T;s14}kgfbo(+Mk3&y{&QsA z0<~YX-dc)yk@=t5hIy!NiN3Xu4SpiW4%2e6n|7}75~x^GEio}M6?t@HLiu4c-5)0N zM2LZToSf%bWE2#euiKu7Ge4&v{3@GJkv^cuD=F@d$l7;)+G|BYalGxm~#oFPj`dl!@Y z_t~OxgjUlS`Ec$I<}1wh@YMq{BIqn5QK150W;hJMh*Pu^H1j@|@OfGwhM~^QApHi>87jQ#=!o@>kEa_Mid@?&pVg_Y(`N_rt3EVg zxe%^t_4fX?E0=?IB$e+;JeG`j=OGVMd|%|p+7W4SaOCE2!R~`9f%@Rc(dYVIXDhE( z21hoIg#G5qrWd6&WiZ@@u1MIU6Oh+Jm=Z}I;7w`-YL97?S!9;}oBRdq70SkJl(^jB zW;wACMJHNPVm-~UgC3Z&8O8r$%KinC|GP|?8D2(qHd4(7PVgHLX}j<9v2@P_`wRw} zpGVis?7r91;QROQ3uXN7R8%-RIQVYPNy*6kXgZ~j0j3FCKp~v!?YZU$h%_F894v}{ z8?#2uV>px;=u%=ypeZWB2ouoX*k7L+d}ahz@4_uPD6-`B@{Tr7_AV_n5u*t_5jTH1 z&+|f=uWK7k>F=H}+c#l;05%T6>4{>jWVGB$1v__~Ql$v>PUbQ!3iE3fKc zGuF{R7{@|kf71i5=B)6iQ-o(S@`uh)<39oP(}02pi8MU{Lcyroc;M(TJ}z#NznS$W z*r*g`=<8E?Y#W>z;4HI&z3jqI;qMHATm?i9B6iCd062j|0yTqg%)&K6E%cR+T zA7?PYBXmIDwno-+R^&w2 z@=#5CN=5q~XvHxVMmDzckfUEDu9b^dfr9MQy`DM~%LOH^FCf86K&giAfzON1g5Yij z&W-ON!XKPp+o23tdT^+wa1MMxddH0)Igpyfo?HozE27Go&)&fUZk}f=J^tt>%{*{& zSpt{%19o?JmF-hXlFAR8P!Q@>1!$7QAkQymU|@g=I@7CJ@)P!Ani_OVQg0w zP%TH`H5oZ_grk4cFmL0?d7=mzz=ZC>Ej!g}i8Sx3ng}t_vV4qpV=a1-yPNtFX*uE- zU`WmvN>9FI@pbe_`%XRc4yhmi2(GMnsQ(SX`#`2D&L;@Pu`2pF-dLS9hPN=6BAuQd zNat8uZsN`QfyD8Fb_+ZEk{pH2$;$ZriI~fT`omkSSIenvU|oLNXqyLU->Rf{gPClg zYe~2Se+eSwL2%0QT~6B@SkhMXA3e9ZX5!lz8=3?stn}!tpU}`b1K;&&CLW7J;ASVU z;rAv&r6@f0DX;bE#=#M)uM`k6mZBJ}8zwpW3`?qiBglCL&^Z2VUTTI`6F;SozY!ZJ z@y&ma+xh5&uiZzMn{1Yl0k$TM#``V=s0Iec*FMd=MVcvBv--kOpvB@pSFvY^AGtBR8^s@f0Kpg zqKyizqsFhQ?0OOiMDoB$rT^&9G|!UABV>@{Wst40=)a2oLcV}oD&tMLJ0IiazW^92 zy?80#3^?*cwi~8I+aLu?eLOd3(UHz6kA+XQ1-C*@=<}tGY{@T(Tvj3cmQ{JN{3)Z@1G z&9{=z*e*V57gJ}bCfgH{lqlNtKr9uNMcn~hHi zvHvvZ`T8c4>_ZN6chJ)cPEIcwg*^=VapRhk97DF9E7SpPB9-33&aS^yeB*Unax3N%Sw1WY=31;zAN*Az$iCe zbQgRBjOC8S&khpSdB6DlbllPu$aL$3wun3UMy$$00)E` z<_G~_efqO8%xK2qBxqXNn^=!^N2fH9i^y-_yLtWJGojlTDeT`lQepljv`E+FeAH zWB~vvNMh`M4+DBZLzCBN$9dJ{f`~w6xjpO)yerxGPEOqDhZ589LG_l&L`bz?oWcIb0P{zxsh}-nO z&)xNr!NIiWo2suw?>)iokH$&(+5PRQz+(PT)BBu2-wO{P^IpLj2s$b)W!_D%KuRIn z8*MZsHj-jYVqTA*DWs0u)xNK*nBSITdP;~a^Lq<=Pg{XPr2?HT}LOwI2x#4Ot%DGj zDuxmnk(I#uwIM*jyUk=*3B_p!XED7#6}Pk)T4|(Weg|fi@~Q`AavKsgZmcFY@4E*4 zFf6&DA%!Lr6&;-39jZIBwfEr5-F=Cam==601IL7x=?IE>eWU61rLf(IjBR^-n&vv03F?0$4 z2>{c}UT>(ZRNoZV(w{C-KIQBK!BEI}@JOijrh>($TaJ5?=#3QK#1XCD{~{EuyUC?x z)eiio**zDx`G-FN;nvcP{QCK0cV{PwPRa{OorM(k$?A*sT@pL<+bUH405iS$?);(_Q|}cF|s2KtS@&l#pr0c2zLTj}}NZJHKTA{wM_LJU;9L zp<+)&t+9~z8}UAr0mLhc{TuNDKS9;qbNGGt)`M@iw#Qb0^ioSc69h`0q4-5$8)LeI z%oKXAVZ62V4Z_7nmVAuTzi`$|0l;?nnO65V(q#pdxe*7``rjZ2JDFNUEssXZ>$b(q zjThxZa6%?zfmhs}>|FVZSrE-@dGF(SuHevy=vltG1#DhCD%360qSS{5&m z^enx-{4AN_>UgOt>YyACT|fb_I6q z8?ZD4c!%<0n)XMKdzp|FQb`Ymw*@JgEtCRQ7Tv)?N4o98FMfRPMuO&bKH3%7lH5H? zGf6v9u5(%Cy1@L~>&+Dhoxdgb;8J(6qMxdXlU5}0zUn;$A9}kRvKwu+@io)=SnTlE z2Ht#bg6jxjds6oE*k&6|H%C>k&(zV^TEA+?#luT%%RywVa*(_s-sgHTQann<;1}5f zMYH@$p&mi7RuMS(#*w4_ssapl`6C-rUp#{u$4_lU&e2OJF=i5`Dw2aW-lx z2zzzrmC~a#NFcW0B3+7i2bjQddIh(6Y!CPA*Q%B>Gk{31>gqZKv9NiQokpcitIMP& ziONjC>R5^HHO@BGo_3QfS_ZJ_^bs2H-=ID=2jj)l5cS{^8z$VutaS_C;FFM$aKgkz zg0)+)I^P9NBazWcwv@s`K8!**uR+a6NR!*u*$UujhRsjD9*PzhU*km7Gmn8jiu!6J z0L!^JEA;zeXqWnx;Gqs6va=dS-_MbSBk;;4rN7TQG2%2D%C8MuFr2{9m3JVkN!Puw8Uj!9ZYv>;@e9 zq2b|{Ue=)U^#~gUA*?Eko=jwt+?lP1FT{fy_`%Sgg zQlF+sOpIzG+Xu%@Q%eix0K_n_sxJzdMC)bGP}xib$@qsNbeR&zN^~nZOr4LT@Wc^^ z!vD<<>dx!_#J-$=^0&~Xalkvn7v^Z{-^lK7RN!F@kyF!4X2-gx7C5c{#D4$({uk;~ z{I&v^u7MeEv{-C;eQ`Y8r`hg$$qYA{H%A7+Za+~3?+H7LBl4{Lk+ zgE|j3nsg!L(j!AoWyYA_YN-_4xZfa!j#RMD`tITL`U7HVaVb$)OY=A$n5nyepa_Tq zNMM<1NlyQRwUEyl6&LCL{?@J%iryw65I@E|AX$?-_4hY^gu#i7IQh1y4`UNQ{-0m5TCmRg+v7INn0J zAkeG7h4g1k0f-YhDp4oh$S{7Bh6V8SV;h7gk*^y7Ol8MJLV5r8`yr`%Rb{0D3i*{W z6$SC87nZI^hd|0-g*{Dr(4bvgo4_%Uny9a`lv<{Nu!|jDSfNK5rDJjH2T(gZJ3GL9j04Ml?bDP|EzPU>BOR615*#~D?CX~p>p~Q} zosgiiZjQ#?Z)yPAl8Ddc~$ zTJ^mBi=L_LBQ*PFm`S05p~iQDENt2!s|4L^5kYKXX5E5;TnN5lILkSm}NWB53 zixgkS4|&L0BoHk=_00WkSI2{ylC#_6xG3yZUWM?=-F0HXTHj1Ue+egHe` zTaZJ5yl8yrfrJN0bCd=_m^sY@YbxNZGjd$~-X>!pJ=FWbWp@JUG;}x=LPM-E`!`G_D*sx(|)c zUSK-;rwQ~rIE*c?(x;9a-9YB*mJ-%Ky?_T#gRkfGZ;0x5JF{>MyU5?f|FgS^8ZN@& zVRgam*wt&HRG$h)|Net91H@l=-V+PUsY?J}`tx%k7C6i&(Lj_N;&BNUfi=dY(1(R7 zM~qmBtpQ!?V=-;00xQT?5>#TPkKZ5})=QA+4F0i>B?fG)5DDAb@k`6WN%&>lPQ>uf zxG*ACT0NsO;`hM9k8kSG;dALDx_@N7Vd~H-i(V Date: Sun, 17 Apr 2016 15:25:39 +0200 Subject: [PATCH 0061/1644] ARROW-103: Add files to gitignore Patches [ARROW-103](https://issues.apache.org/jira/browse/ARROW-103), though perhaps it would make sense to leave that issue open to cover any future .gitignore-related pull requests. Author: Dan Robinson Closes #62 from danrobinson/ARROW-103 and squashes the following commits: 7c1c7d8 [Dan Robinson] ARROW-103: Added '*-build' to cpp/.gitignore 633bacf [Dan Robinson] ARROW-103: Added '.cache' to python/.gitignore 59f58ba [Dan Robinson] ARROW-103: Add '*.dylib to python/.gitignore' 52572ab [Dan Robinson] ARROW-103: Add 'dev-build/' to cpp/.gitignore --- cpp/.gitignore | 1 + python/.gitignore | 3 +++ 2 files changed, 4 insertions(+) diff --git a/cpp/.gitignore b/cpp/.gitignore index ab30247d49378..4910544ec87cd 100644 --- a/cpp/.gitignore +++ b/cpp/.gitignore @@ -5,6 +5,7 @@ CTestTestfile.cmake Makefile cmake_install.cmake build/ +*-build/ Testing/ ######################################### diff --git a/python/.gitignore b/python/.gitignore index 3cb591ea766d5..7e2e360557ad8 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -18,6 +18,7 @@ Testing/ *.o *.py[ocd] *.so +*.dylib .build_cache_dir MANIFEST @@ -35,6 +36,8 @@ dist # coverage .coverage coverage.xml +# cache +.cache # benchmark working dir .asv From 0b472d860260f7063aee742939be23b921382741 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Mon, 18 Apr 2016 19:44:29 +0200 Subject: [PATCH 0062/1644] ARROW-82: Initial IPC support for ListArray This is a work in progress because I can't get clang-tidy to shut-up on parameterized test files (see last commit which I need to revert before merge). I'd like to make sure this is a clean build and make sure people are ok with these change. This PR also has a lot of collateral damage for small/large things I cleaned up my way to make this work. I tried to split the commits up logically but if people would prefer separate pull requests I can try to do that as well. Open questions: 1. For supporting strings, binary, etc. I was thinking of changing thei4 type definitions to inherit from ListType, and to hard-code the child type. This would allow for simpler IPC code (all of the instantiation of types would happen in construct.h/.cc?) vs handling each of there types separately for IPC. 2. There are some TODOs I left sprinkled in the code and would like peoples thoughts on if they are urgent/worthwhile for following up on. Open issues: 1. Supporting the rest of the List backed logical types 2. More unit tests for added functionality. As part of this commit I also refactored the Builder interfaces a little bit for the following reasons: 1. It seems that if ArrayBuilder owns the bitmap it should be responsible for having methods to manipulate it. 2. This allows ListBuilder to use the parent class + a BufferBuilder instead of inheriting Int32Builder, which means it doesn't need to do strange length/capacity hacks. Other misc things here: 1. Native popcount in test-util.h 2. Ability to build a new list on top of an existing by incrementally add offsets/sizes 3. Added missing types primitive types in construct.h for primitive. Author: Micah Kornfield Closes #59 from emkornfield/emk_list_ipc_PR and squashes the following commits: 0c5162d [Micah Kornfield] another format fix 0af558b [Micah Kornfield] remove a now unnecessary NOLINT, but mostly to trigger another travis-ci job that failed due to apt get issue 7789205 [Micah Kornfield] make clang-format-3.7 happy 6e57728 [Micah Kornfield] make format fixes 5e15815 [Micah Kornfield] fix make lint 8982723 [Micah Kornfield] remaining style cleanup be04b3e [Micah Kornfield] add unit tests for zero length row batches and non-null batches. fix bugs 10e6651 [Micah Kornfield] add in maximum recursion depth, surfaced possible recursion issue with flatbuffers 3b219a1 [Micah Kornfield] Make append is_null parameter is_valid for api consistency 2e6c477 [Micah Kornfield] add missing RETURN_NOT_OK e71810b [Micah Kornfield] make Resize and Init virtual on builder 8ab5315 [Micah Kornfield] make clang tidy ignore a little bit less hacky 53d37bc [Micah Kornfield] filter out ipc-adapter-test from tidy 8e464b5 [Micah Kornfield] Fixes per tidy and lint aa0602c [Micah Kornfield] add potentially useful pool factories to test utils 39c57ed [Micah Kornfield] add potentially useful methods for generative arrays to ipc test-common a2e1e52 [Micah Kornfield] native popcount 61b0481 [Micah Kornfield] small fixes to naming/style for c++ and potential bugs 5f87aef [Micah Kornfield] Refactor ipc-adapter-test to make it paramaterizable. add unit test for lists. make unit test pass and and construction method for list arrays 45e41c0 [Micah Kornfield] Make BufferBuilder more useable for appending primitives 1374485 [Micah Kornfield] augment python unittest to have null element in list 20f984b [Micah Kornfield] refactor primitive builders to use parent builders bitmap 3895d34 [Micah Kornfield] Refactor list builder to use ArrayBuilders bitmap methods and a separate buffer builder 01c50be [Micah Kornfield] Add utility methods for managing null bitmap directly to ArrayBuilder cc7f851 [Micah Kornfield] add Validate method to array and implementation for ListArray --- cpp/CMakeLists.txt | 2 +- cpp/README.md | 9 +- cpp/src/.clang-tidy-ignore | 1 + cpp/src/arrow/array.cc | 5 + cpp/src/arrow/array.h | 6 +- cpp/src/arrow/builder.cc | 56 +++++++ cpp/src/arrow/builder.h | 46 ++++-- cpp/src/arrow/ipc/adapter.cc | 136 +++++++++++----- cpp/src/arrow/ipc/adapter.h | 11 +- cpp/src/arrow/ipc/ipc-adapter-test.cc | 216 +++++++++++++++++++++---- cpp/src/arrow/ipc/memory.cc | 1 + cpp/src/arrow/ipc/metadata-internal.cc | 3 +- cpp/src/arrow/ipc/metadata-internal.h | 3 +- cpp/src/arrow/ipc/metadata.cc | 3 +- cpp/src/arrow/ipc/test-common.h | 67 ++++++++ cpp/src/arrow/parquet/schema.cc | 2 +- cpp/src/arrow/schema.cc | 2 +- cpp/src/arrow/test-util.h | 49 +++++- cpp/src/arrow/type.h | 2 +- cpp/src/arrow/types/construct.cc | 43 ++++- cpp/src/arrow/types/construct.h | 9 ++ cpp/src/arrow/types/list-test.cc | 80 ++++++--- cpp/src/arrow/types/list.cc | 60 ++++++- cpp/src/arrow/types/list.h | 112 +++++++------ cpp/src/arrow/types/primitive-test.cc | 6 +- cpp/src/arrow/types/primitive.cc | 45 +----- cpp/src/arrow/types/primitive.h | 41 +++-- cpp/src/arrow/types/string.h | 5 +- cpp/src/arrow/util/buffer.h | 59 +++++-- cpp/src/arrow/util/logging.h | 2 +- cpp/src/arrow/util/memory-pool.cc | 2 +- python/pyarrow/tests/test_array.py | 5 +- 32 files changed, 839 insertions(+), 250 deletions(-) create mode 100644 cpp/src/.clang-tidy-ignore diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f803c0fb3e428..b38f91e5d687c 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -565,7 +565,7 @@ if (${CLANG_TIDY_FOUND}) `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`) # runs clang-tidy and exits with a non-zero exit code if any errors are found. add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json - 0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`) + 0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc |grep -v -F -f ${CMAKE_CURRENT_SOURCE_DIR}/src/.clang-tidy-ignore | sed -e '/_generated/g'`) endif() diff --git a/cpp/README.md b/cpp/README.md index 3f5da21b7d417..c8cd86fedc6fe 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -76,4 +76,11 @@ build failures by running the following checks before submitting your pull reque Note that the clang-tidy target may take a while to run. You might consider running clang-tidy separately on the files you have added/changed before -invoking the make target to reduce iteration time. +invoking the make target to reduce iteration time. Also, it might generate warnings +that aren't valid. To avoid these you can use add a line comment `// NOLINT`. If +NOLINT doesn't suppress the warnings, you add the file in question to +the .clang-tidy-ignore file. This will allow `make check-clang-tidy` to pass in +travis-CI (but still surface the potential warnings in `make clang-tidy`). Ideally, +both of these options would be used rarely. Current known uses-cases whent hey are required: + +* Parameterized tests in google test. diff --git a/cpp/src/.clang-tidy-ignore b/cpp/src/.clang-tidy-ignore new file mode 100644 index 0000000000000..a128c38889672 --- /dev/null +++ b/cpp/src/.clang-tidy-ignore @@ -0,0 +1 @@ +ipc-adapter-test.cc diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index a1536861a20be..c6b9b1599cdd2 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -20,6 +20,7 @@ #include #include "arrow/util/buffer.h" +#include "arrow/util/status.h" namespace arrow { @@ -47,6 +48,10 @@ bool Array::EqualsExact(const Array& other) const { return true; } +Status Array::Validate() const { + return Status::OK(); +} + bool NullArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } if (Type::NA != arr->type_enum()) { return false; } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c6735f87d8f42..f98c4c28310f8 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -28,6 +28,7 @@ namespace arrow { class Buffer; +class Status; // Immutable data array with some logical type and some length. Any memory is // owned by the respective Buffer instance (or its parents). @@ -39,7 +40,7 @@ class Array { Array(const std::shared_ptr& type, int32_t length, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - virtual ~Array() {} + virtual ~Array() = default; // Determine if a slot is null. For inner loops. Does *not* boundscheck bool IsNull(int i) const { @@ -58,6 +59,9 @@ class Array { bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; + // Determines if the array is internally consistent. Defaults to always + // returning Status::OK. This can be an expensive check. + virtual Status Validate() const; protected: std::shared_ptr type_; diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 1447078f76028..87c1219025d37 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -25,6 +25,25 @@ namespace arrow { +Status ArrayBuilder::AppendToBitmap(bool is_valid) { + if (length_ == capacity_) { + // If the capacity was not already a multiple of 2, do so here + // TODO(emkornfield) doubling isn't great default allocation practice + // see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md + // fo discussion + RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); + } + UnsafeAppendToBitmap(is_valid); + return Status::OK(); +} + +Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) { + RETURN_NOT_OK(Reserve(length)); + + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); +} + Status ArrayBuilder::Init(int32_t capacity) { capacity_ = capacity; int32_t to_alloc = util::ceil_byte(capacity) / 8; @@ -36,6 +55,7 @@ Status ArrayBuilder::Init(int32_t capacity) { } Status ArrayBuilder::Resize(int32_t new_bits) { + if (!null_bitmap_) { return Init(new_bits); } int32_t new_bytes = util::ceil_byte(new_bits) / 8; int32_t old_bytes = null_bitmap_->size(); RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); @@ -56,10 +76,46 @@ Status ArrayBuilder::Advance(int32_t elements) { Status ArrayBuilder::Reserve(int32_t elements) { if (length_ + elements > capacity_) { + // TODO(emkornfield) power of 2 growth is potentially suboptimal int32_t new_capacity = util::next_power2(length_ + elements); return Resize(new_capacity); } return Status::OK(); } +Status ArrayBuilder::SetNotNull(int32_t length) { + RETURN_NOT_OK(Reserve(length)); + UnsafeSetNotNull(length); + return Status::OK(); +} + +void ArrayBuilder::UnsafeAppendToBitmap(bool is_valid) { + if (is_valid) { + util::set_bit(null_bitmap_data_, length_); + } else { + ++null_count_; + } + ++length_; +} + +void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length) { + if (valid_bytes == nullptr) { + UnsafeSetNotNull(length); + return; + } + for (int32_t i = 0; i < length; ++i) { + // TODO(emkornfield) Optimize for large values of length? + UnsafeAppendToBitmap(valid_bytes[i] > 0); + } +} + +void ArrayBuilder::UnsafeSetNotNull(int32_t length) { + const int32_t new_length = length + length_; + // TODO(emkornfield) Optimize for large values of length? + for (int32_t i = length_; i < new_length; ++i) { + util::set_bit(null_bitmap_data_, i); + } + length_ = new_length; +} + } // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 21a6341ef5086..7d3f4398d73e3 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -34,7 +34,10 @@ class PoolBuffer; static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5; -// Base class for all data array builders +// Base class for all data array builders. +// This class provides a facilities for incrementally building the null bitmap +// (see Append methods) and as a side effect the current number of slots and +// the null count. class ArrayBuilder { public: explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) @@ -46,7 +49,7 @@ class ArrayBuilder { length_(0), capacity_(0) {} - virtual ~ArrayBuilder() {} + virtual ~ArrayBuilder() = default; // For nested types. Since the objects are owned by this class instance, we // skip shared pointers and just return a raw pointer @@ -58,14 +61,27 @@ class ArrayBuilder { int32_t null_count() const { return null_count_; } int32_t capacity() const { return capacity_; } - // Allocates requires memory at this level, but children need to be - // initialized independently - Status Init(int32_t capacity); + // Append to null bitmap + Status AppendToBitmap(bool is_valid); + // Vector append. Treat each zero byte as a null. If valid_bytes is null + // assume all of length bits are valid. + Status AppendToBitmap(const uint8_t* valid_bytes, int32_t length); + // Set the next length bits to not null (i.e. valid). + Status SetNotNull(int32_t length); - // Resizes the null_bitmap array - Status Resize(int32_t new_bits); + // Allocates initial capacity requirements for the builder. In most + // cases subclasses should override and call there parent classes + // method as well. + virtual Status Init(int32_t capacity); - Status Reserve(int32_t extra_bits); + // Resizes the null_bitmap array. In most + // cases subclasses should override and call there parent classes + // method as well. + virtual Status Resize(int32_t new_bits); + + // Ensures there is enough space for adding the number of elements by checking + // capacity and calling Resize if necessary. + Status Reserve(int32_t elements); // For cases where raw data was memcpy'd into the internal buffers, allows us // to advance the length of the builder. It is your responsibility to use @@ -75,7 +91,7 @@ class ArrayBuilder { const std::shared_ptr& null_bitmap() const { return null_bitmap_; } // Creates new array object to hold the contents of the builder and transfers - // ownership of the data + // ownership of the data. This resets all variables on the builder. virtual std::shared_ptr Finish() = 0; const std::shared_ptr& type() const { return type_; } @@ -97,6 +113,18 @@ class ArrayBuilder { // Child value array builders. These are owned by this class std::vector> children_; + // + // Unsafe operations (don't check capacity/don't resize) + // + + // Append to null bitmap. + void UnsafeAppendToBitmap(bool is_valid); + // Vector append. Treat each zero byte as a nullzero. If valid_bytes is null + // assume all of length bits are valid. + void UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length); + // Set the next length bits to not null (i.e. valid). + void UnsafeSetNotNull(int32_t length); + private: DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); }; diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 2f72c3aa8467a..bf6fa94dea7a4 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -19,17 +19,19 @@ #include #include +#include #include #include "arrow/array.h" -#include "arrow/ipc/memory.h" #include "arrow/ipc/Message_generated.h" -#include "arrow/ipc/metadata.h" +#include "arrow/ipc/memory.h" #include "arrow/ipc/metadata-internal.h" +#include "arrow/ipc/metadata.h" #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/type.h" #include "arrow/types/construct.h" +#include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" @@ -63,44 +65,70 @@ static bool IsPrimitive(const DataType* type) { } } +static bool IsListType(const DataType* type) { + DCHECK(type != nullptr); + switch (type->type) { + // TODO(emkornfield) grouping like this are used in a few places in the + // code consider using pattern like: + // http://stackoverflow.com/questions/26784685/c-macro-for-calling-function-based-on-enum-type + // + // TODO(emkornfield) Fix type systems so these are all considered lists and + // the types behave the same way? + // case Type::BINARY: + // case Type::CHAR: + case Type::LIST: + // see todo on common types + // case Type::STRING: + // case Type::VARCHAR: + return true; + default: + return false; + } +} + // ---------------------------------------------------------------------- // Row batch write path Status VisitArray(const Array* arr, std::vector* field_nodes, - std::vector>* buffers) { - if (IsPrimitive(arr->type().get())) { - const PrimitiveArray* prim_arr = static_cast(arr); - - field_nodes->push_back( - flatbuf::FieldNode(prim_arr->length(), prim_arr->null_count())); + std::vector>* buffers, int max_recursion_depth) { + if (max_recursion_depth <= 0) { return Status::Invalid("Max recursion depth reached"); } + DCHECK(arr); + DCHECK(field_nodes); + // push back all common elements + field_nodes->push_back(flatbuf::FieldNode(arr->length(), arr->null_count())); + if (arr->null_count() > 0) { + buffers->push_back(arr->null_bitmap()); + } else { + // Push a dummy zero-length buffer, not to be copied + buffers->push_back(std::make_shared(nullptr, 0)); + } - if (prim_arr->null_count() > 0) { - buffers->push_back(prim_arr->null_bitmap()); - } else { - // Push a dummy zero-length buffer, not to be copied - buffers->push_back(std::make_shared(nullptr, 0)); - } + const DataType* arr_type = arr->type().get(); + if (IsPrimitive(arr_type)) { + const auto prim_arr = static_cast(arr); buffers->push_back(prim_arr->data()); - } else if (arr->type_enum() == Type::LIST) { - // TODO(wesm) - return Status::NotImplemented("List type"); + } else if (IsListType(arr_type)) { + const auto list_arr = static_cast(arr); + buffers->push_back(list_arr->offset_buffer()); + RETURN_NOT_OK(VisitArray( + list_arr->values().get(), field_nodes, buffers, max_recursion_depth - 1)); } else if (arr->type_enum() == Type::STRUCT) { // TODO(wesm) return Status::NotImplemented("Struct type"); } - return Status::OK(); } class RowBatchWriter { public: - explicit RowBatchWriter(const RowBatch* batch) : batch_(batch) {} + RowBatchWriter(const RowBatch* batch, int max_recursion_depth) + : batch_(batch), max_recursion_depth_(max_recursion_depth) {} Status AssemblePayload() { // Perform depth-first traversal of the row-batch for (int i = 0; i < batch_->num_columns(); ++i) { const Array* arr = batch_->column(i).get(); - RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_)); + RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_, max_recursion_depth_)); } return Status::OK(); } @@ -111,8 +139,10 @@ class RowBatchWriter { int64_t offset = 0; for (size_t i = 0; i < buffers_.size(); ++i) { const Buffer* buffer = buffers_[i].get(); - int64_t size = buffer->size(); + int64_t size = 0; + // The buffer might be null if we are handling zero row lengths. + if (buffer) { size = buffer->size(); } // TODO(wesm): We currently have no notion of shared memory page id's, // but we've included it in the metadata IDL for when we have it in the // future. Use page=0 for now @@ -171,11 +201,13 @@ class RowBatchWriter { std::vector field_nodes_; std::vector buffer_meta_; std::vector> buffers_; + int max_recursion_depth_; }; -Status WriteRowBatch( - MemorySource* dst, const RowBatch* batch, int64_t position, int64_t* header_offset) { - RowBatchWriter serializer(batch); +Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, + int64_t* header_offset, int max_recursion_depth) { + DCHECK_GT(max_recursion_depth, 0); + RowBatchWriter serializer(batch, max_recursion_depth); RETURN_NOT_OK(serializer.AssemblePayload()); return serializer.Write(dst, position, header_offset); } @@ -186,8 +218,9 @@ static constexpr int64_t INIT_METADATA_SIZE = 4096; class RowBatchReader::Impl { public: - Impl(MemorySource* source, const std::shared_ptr& metadata) - : source_(source), metadata_(metadata) { + Impl(MemorySource* source, const std::shared_ptr& metadata, + int max_recursion_depth) + : source_(source), metadata_(metadata), max_recursion_depth_(max_recursion_depth) { num_buffers_ = metadata->num_buffers(); num_flattened_fields_ = metadata->num_fields(); } @@ -203,7 +236,7 @@ class RowBatchReader::Impl { buffer_index_ = 0; for (int i = 0; i < schema->num_fields(); ++i) { const Field* field = schema->field(i).get(); - RETURN_NOT_OK(NextArray(field, &arrays[i])); + RETURN_NOT_OK(NextArray(field, max_recursion_depth_, &arrays[i])); } *out = std::make_shared(schema, metadata_->length(), arrays); @@ -213,8 +246,12 @@ class RowBatchReader::Impl { private: // Traverse the flattened record batch metadata and reassemble the // corresponding array containers - Status NextArray(const Field* field, std::shared_ptr* out) { - const std::shared_ptr& type = field->type; + Status NextArray( + const Field* field, int max_recursion_depth, std::shared_ptr* out) { + const TypePtr& type = field->type; + if (max_recursion_depth <= 0) { + return Status::Invalid("Max recursion depth reached"); + } // pop off a field if (field_index_ >= num_flattened_fields_) { @@ -226,23 +263,42 @@ class RowBatchReader::Impl { // we can skip that buffer without reading from shared memory FieldMetadata field_meta = metadata_->field(field_index_++); + // extract null_bitmap which is common to all arrays + std::shared_ptr null_bitmap; + if (field_meta.null_count == 0) { + ++buffer_index_; + } else { + RETURN_NOT_OK(GetBuffer(buffer_index_++, &null_bitmap)); + } + if (IsPrimitive(type.get())) { - std::shared_ptr null_bitmap; std::shared_ptr data; - if (field_meta.null_count == 0) { - null_bitmap = nullptr; - ++buffer_index_; - } else { - RETURN_NOT_OK(GetBuffer(buffer_index_++, &null_bitmap)); - } if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(buffer_index_++, &data)); } else { + buffer_index_++; data.reset(new Buffer(nullptr, 0)); } return MakePrimitiveArray( type, field_meta.length, data, field_meta.null_count, null_bitmap, out); } + + if (IsListType(type.get())) { + std::shared_ptr offsets; + RETURN_NOT_OK(GetBuffer(buffer_index_++, &offsets)); + const int num_children = type->num_children(); + if (num_children != 1) { + std::stringstream ss; + ss << "Field: " << field->ToString() + << " has wrong number of children:" << num_children; + return Status::Invalid(ss.str()); + } + std::shared_ptr values_array; + RETURN_NOT_OK( + NextArray(type->child(0).get(), max_recursion_depth - 1, &values_array)); + return MakeListArray(type, field_meta.length, offsets, values_array, + field_meta.null_count, null_bitmap, out); + } return Status::NotImplemented("Non-primitive types not complete yet"); } @@ -256,12 +312,18 @@ class RowBatchReader::Impl { int field_index_; int buffer_index_; + int max_recursion_depth_; int num_buffers_; int num_flattened_fields_; }; Status RowBatchReader::Open( MemorySource* source, int64_t position, std::shared_ptr* out) { + return Open(source, position, kMaxIpcRecursionDepth, out); +} + +Status RowBatchReader::Open(MemorySource* source, int64_t position, + int max_recursion_depth, std::shared_ptr* out) { std::shared_ptr metadata; RETURN_NOT_OK(source->ReadAt(position, INIT_METADATA_SIZE, &metadata)); @@ -286,7 +348,7 @@ Status RowBatchReader::Open( std::shared_ptr batch_meta = message->GetRecordBatch(); std::shared_ptr result(new RowBatchReader()); - result->impl_.reset(new Impl(source, batch_meta)); + result->impl_.reset(new Impl(source, batch_meta, max_recursion_depth)); *out = result; return Status::OK(); diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index d453fa05f4982..4c9a8a9d8ee39 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -38,7 +38,9 @@ class RecordBatchMessage; // ---------------------------------------------------------------------- // Write path - +// We have trouble decoding flatbuffers if the size i > 70, so 64 is a nice round number +// TODO(emkornfield) investigate this more +constexpr int kMaxIpcRecursionDepth = 64; // Write the RowBatch (collection of equal-length Arrow arrays) to the memory // source at the indicated position // @@ -52,8 +54,8 @@ class RecordBatchMessage; // // Finally, the memory offset to the start of the metadata / data header is // returned in an out-variable -Status WriteRowBatch( - MemorySource* dst, const RowBatch* batch, int64_t position, int64_t* header_offset); +Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, + int64_t* header_offset, int max_recursion_depth = kMaxIpcRecursionDepth); // int64_t GetRowBatchMetadata(const RowBatch* batch); @@ -70,6 +72,9 @@ class RowBatchReader { static Status Open( MemorySource* source, int64_t position, std::shared_ptr* out); + static Status Open(MemorySource* source, int64_t position, int max_recursion_depth, + std::shared_ptr* out); + // Reassemble the row batch. A Schema is required to be able to construct the // right array containers Status GetRowBatch( diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index fbdda77e4919c..c243cfba820cc 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -18,9 +18,7 @@ #include #include #include -#include #include -#include #include #include @@ -31,6 +29,7 @@ #include "arrow/ipc/test-common.h" #include "arrow/test-util.h" +#include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" @@ -40,25 +39,56 @@ namespace arrow { namespace ipc { -class TestWriteRowBatch : public ::testing::Test, public MemoryMapFixture { +// TODO(emkornfield) convert to google style kInt32, etc? +const auto INT32 = std::make_shared(); +const auto LIST_INT32 = std::make_shared(INT32); +const auto LIST_LIST_INT32 = std::make_shared(LIST_INT32); + +typedef Status MakeRowBatch(std::shared_ptr* out); + +class TestWriteRowBatch : public ::testing::TestWithParam, + public MemoryMapFixture { public: void SetUp() { pool_ = default_memory_pool(); } void TearDown() { MemoryMapFixture::TearDown(); } - void InitMemoryMap(int64_t size) { + Status RoundTripHelper(const RowBatch& batch, int memory_map_size, + std::shared_ptr* batch_result) { std::string path = "test-write-row-batch"; - MemoryMapFixture::CreateFile(path, size); - ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &mmap_)); + MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + int64_t header_location; + RETURN_NOT_OK(WriteRowBatch(mmap_.get(), &batch, 0, &header_location)); + + std::shared_ptr reader; + RETURN_NOT_OK(RowBatchReader::Open(mmap_.get(), header_location, &reader)); + + RETURN_NOT_OK(reader->GetRowBatch(batch.schema(), batch_result)); + return Status::OK(); } protected: - MemoryPool* pool_; std::shared_ptr mmap_; + MemoryPool* pool_; }; -const auto INT32 = std::make_shared(); +TEST_P(TestWriteRowBatch, RoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + std::shared_ptr batch_result; + ASSERT_OK(RoundTripHelper(*batch, 1 << 16, &batch_result)); + + // do checks + ASSERT_TRUE(batch->schema()->Equals(batch_result->schema())); + ASSERT_EQ(batch->num_columns(), batch_result->num_columns()) + << batch->schema()->ToString() << " result: " << batch_result->schema()->ToString(); + EXPECT_EQ(batch->num_rows(), batch_result->num_rows()); + for (int i = 0; i < batch->num_columns(); ++i) { + EXPECT_TRUE(batch->column(i)->Equals(batch_result->column(i))) + << "Idx: " << i << " Name: " << batch->column_name(i); + } +} -TEST_F(TestWriteRowBatch, IntegerRoundTrip) { +Status MakeIntRowBatch(std::shared_ptr* out) { const int length = 1000; // Make the schema @@ -67,41 +97,159 @@ TEST_F(TestWriteRowBatch, IntegerRoundTrip) { std::shared_ptr schema(new Schema({f0, f1})); // Example data + std::shared_ptr a0, a1; + MemoryPool* pool = default_memory_pool(); + RETURN_NOT_OK(MakeRandomInt32Array(length, false, pool, &a0)); + RETURN_NOT_OK(MakeRandomInt32Array(length, true, pool, &a1)); + out->reset(new RowBatch(schema, length, {a0, a1})); + return Status::OK(); +} - auto data = std::make_shared(pool_); - ASSERT_OK(data->Resize(length * sizeof(int32_t))); - test::rand_uniform_int(length, 0, 0, std::numeric_limits::max(), - reinterpret_cast(data->mutable_data())); +Status MakeListRowBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", LIST_INT32); + auto f1 = std::make_shared("f1", LIST_LIST_INT32); + auto f2 = std::make_shared("f2", INT32); + std::shared_ptr schema(new Schema({f0, f1, f2})); - auto null_bitmap = std::make_shared(pool_); - int null_bytes = util::bytes_for_bits(length); - ASSERT_OK(null_bitmap->Resize(null_bytes)); - test::random_bytes(null_bytes, 0, null_bitmap->mutable_data()); + // Example data - auto a0 = std::make_shared(length, data); - auto a1 = std::make_shared( - length, data, test::bitmap_popcount(null_bitmap->data(), length), null_bitmap); + MemoryPool* pool = default_memory_pool(); + const int length = 200; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &leaf_values)); + RETURN_NOT_OK( + MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); + out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} - RowBatch batch(schema, length, {a0, a1}); +Status MakeZeroLengthRowBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", LIST_INT32); + auto f1 = std::make_shared("f1", LIST_LIST_INT32); + auto f2 = std::make_shared("f2", INT32); + std::shared_ptr schema(new Schema({f0, f1, f2})); - // TODO(wesm): computing memory requirements for a row batch - // 64k is plenty of space - InitMemoryMap(1 << 16); + // Example data + MemoryPool* pool = default_memory_pool(); + const int length = 200; + const bool include_nulls = true; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; + RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &leaf_values)); + RETURN_NOT_OK(MakeRandomListArray(leaf_values, 0, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, 0, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); + out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} - int64_t header_location; - ASSERT_OK(WriteRowBatch(mmap_.get(), &batch, 0, &header_location)); +Status MakeNonNullRowBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", LIST_INT32); + auto f1 = std::make_shared("f1", LIST_LIST_INT32); + auto f2 = std::make_shared("f2", INT32); + std::shared_ptr schema(new Schema({f0, f1, f2})); - std::shared_ptr result; - ASSERT_OK(RowBatchReader::Open(mmap_.get(), header_location, &result)); + // Example data + MemoryPool* pool = default_memory_pool(); + const int length = 200; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; - std::shared_ptr batch_result; - ASSERT_OK(result->GetRowBatch(schema, &batch_result)); - EXPECT_EQ(batch.num_rows(), batch_result->num_rows()); + RETURN_NOT_OK(MakeRandomInt32Array(1000, true, pool, &leaf_values)); + bool include_nulls = false; + RETURN_NOT_OK(MakeRandomListArray(leaf_values, 50, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, 50, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); + out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} - for (int i = 0; i < batch.num_columns(); ++i) { - EXPECT_TRUE(batch.column(i)->Equals(batch_result->column(i))) << i - << batch.column_name(i); +Status MakeDeeplyNestedList(std::shared_ptr* out) { + const int batch_length = 5; + TypePtr type = INT32; + + MemoryPool* pool = default_memory_pool(); + ArrayPtr array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &array)); + for (int i = 0; i < 63; ++i) { + type = std::static_pointer_cast(std::make_shared(type)); + RETURN_NOT_OK(MakeRandomListArray(array, batch_length, include_nulls, pool, &array)); + } + + auto f0 = std::make_shared("f0", type); + std::shared_ptr schema(new Schema({f0})); + std::vector arrays = {array}; + out->reset(new RowBatch(schema, batch_length, arrays)); + return Status::OK(); +} + +INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, + ::testing::Values(&MakeIntRowBatch, &MakeListRowBatch, &MakeNonNullRowBatch, + &MakeZeroLengthRowBatch, &MakeDeeplyNestedList)); + +class RecursionLimits : public ::testing::Test, public MemoryMapFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { MemoryMapFixture::TearDown(); } + + Status WriteToMmap(int recursion_level, bool override_level, + int64_t* header_out = nullptr, std::shared_ptr* schema_out = nullptr) { + const int batch_length = 5; + TypePtr type = INT32; + ArrayPtr array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); + for (int i = 0; i < recursion_level; ++i) { + type = std::static_pointer_cast(std::make_shared(type)); + RETURN_NOT_OK( + MakeRandomListArray(array, batch_length, include_nulls, pool_, &array)); + } + + auto f0 = std::make_shared("f0", type); + std::shared_ptr schema(new Schema({f0})); + if (schema_out != nullptr) { *schema_out = schema; } + std::vector arrays = {array}; + auto batch = std::make_shared(schema, batch_length, arrays); + + std::string path = "test-write-past-max-recursion"; + const int memory_map_size = 1 << 16; + MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + int64_t header_location; + int64_t* header_out_param = header_out == nullptr ? &header_location : header_out; + if (override_level) { + return WriteRowBatch( + mmap_.get(), batch.get(), 0, header_out_param, recursion_level + 1); + } else { + return WriteRowBatch(mmap_.get(), batch.get(), 0, header_out_param); + } } + + protected: + std::shared_ptr mmap_; + MemoryPool* pool_; +}; + +TEST_F(RecursionLimits, WriteLimit) { + ASSERT_RAISES(Invalid, WriteToMmap((1 << 8) + 1, false)); +} + +TEST_F(RecursionLimits, ReadLimit) { + int64_t header_location; + std::shared_ptr schema; + ASSERT_OK(WriteToMmap(64, true, &header_location, &schema)); + + std::shared_ptr reader; + ASSERT_OK(RowBatchReader::Open(mmap_.get(), header_location, &reader)); + std::shared_ptr batch_result; + ASSERT_RAISES(Invalid, reader->GetRowBatch(schema, &batch_result)); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc index 2b077e9792925..84cbc182cd26f 100644 --- a/cpp/src/arrow/ipc/memory.cc +++ b/cpp/src/arrow/ipc/memory.cc @@ -18,6 +18,7 @@ #include "arrow/ipc/memory.h" #include // For memory-mapping + #include #include #include diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index ad5951d17e2c0..1b1d50f96eaf5 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -17,13 +17,14 @@ #include "arrow/ipc/metadata-internal.h" -#include #include #include #include #include #include +#include "flatbuffers/flatbuffers.h" + #include "arrow/ipc/Message_generated.h" #include "arrow/schema.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index 779c5a30a044a..871b5bc4bf606 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -18,11 +18,12 @@ #ifndef ARROW_IPC_METADATA_INTERNAL_H #define ARROW_IPC_METADATA_INTERNAL_H -#include #include #include #include +#include "flatbuffers/flatbuffers.h" + #include "arrow/ipc/Message_generated.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index bcf104f0b8ba6..4fc8ec50eb716 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -17,11 +17,12 @@ #include "arrow/ipc/metadata.h" -#include #include #include #include +#include "flatbuffers/flatbuffers.h" + // Generated C++ flatbuffer IDL #include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata-internal.h" diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 65c837dc8b141..e7dbb84d790a1 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -18,11 +18,19 @@ #ifndef ARROW_IPC_TEST_COMMON_H #define ARROW_IPC_TEST_COMMON_H +#include #include #include #include #include +#include "arrow/array.h" +#include "arrow/test-util.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" + namespace arrow { namespace ipc { @@ -41,10 +49,69 @@ class MemoryMapFixture { fclose(file); } + Status InitMemoryMap( + int64_t size, const std::string& path, std::shared_ptr* mmap) { + CreateFile(path, size); + return MemoryMappedSource::Open(path, MemorySource::READ_WRITE, mmap); + } + private: std::vector tmp_files_; }; +Status MakeRandomInt32Array( + int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { + std::shared_ptr data; + test::MakeRandomInt32PoolBuffer(length, pool, &data); + const auto INT32 = std::make_shared(); + Int32Builder builder(pool, INT32); + if (include_nulls) { + std::shared_ptr valid_bytes; + test::MakeRandomBytePoolBuffer(length, pool, &valid_bytes); + RETURN_NOT_OK(builder.Append( + reinterpret_cast(data->data()), length, valid_bytes->data())); + *array = builder.Finish(); + return Status::OK(); + } + RETURN_NOT_OK(builder.Append(reinterpret_cast(data->data()), length)); + *array = builder.Finish(); + return Status::OK(); +} + +Status MakeRandomListArray(const std::shared_ptr& child_array, int num_lists, + bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { + // Create the null list values + std::vector valid_lists(num_lists); + const double null_percent = include_nulls ? 0.1 : 0; + test::random_null_bytes(num_lists, null_percent, valid_lists.data()); + + // Create list offsets + const int max_list_size = 10; + + std::vector list_sizes(num_lists, 0); + std::vector offsets( + num_lists + 1, 0); // +1 so we can shift for nulls. See partial sum below. + const int seed = child_array->length(); + if (num_lists > 0) { + test::rand_uniform_int(num_lists, seed, 0, max_list_size, list_sizes.data()); + // make sure sizes are consistent with null + std::transform(list_sizes.begin(), list_sizes.end(), valid_lists.begin(), + list_sizes.begin(), + [](int32_t size, int32_t valid) { return valid == 0 ? 0 : size; }); + std::partial_sum(list_sizes.begin(), list_sizes.end(), ++offsets.begin()); + + // Force invariants + const int child_length = child_array->length(); + offsets[0] = 0; + std::replace_if(offsets.begin(), offsets.end(), + [child_length](int32_t offset) { return offset > child_length; }, child_length); + } + ListBuilder builder(pool, child_array); + RETURN_NOT_OK(builder.Append(offsets.data(), num_lists, valid_lists.data())); + *array = builder.Finish(); + return (*array)->Validate(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index 066388b4d0e23..560e28374066b 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -21,8 +21,8 @@ #include "parquet/api/schema.h" -#include "arrow/util/status.h" #include "arrow/types/decimal.h" +#include "arrow/util/status.h" using parquet::schema::Node; using parquet::schema::NodePtr; diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc index a38acaa94ba56..ff3ea1990e551 100644 --- a/cpp/src/arrow/schema.cc +++ b/cpp/src/arrow/schema.cc @@ -18,8 +18,8 @@ #include "arrow/schema.h" #include -#include #include +#include #include #include "arrow/type.h" diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 538d9b233d990..2f81161d1d6d1 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -19,6 +19,7 @@ #define ARROW_TEST_UTIL_H_ #include +#include #include #include #include @@ -26,12 +27,13 @@ #include "gtest/gtest.h" -#include "arrow/type.h" #include "arrow/column.h" #include "arrow/schema.h" #include "arrow/table.h" +#include "arrow/type.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" +#include "arrow/util/logging.h" #include "arrow/util/memory-pool.h" #include "arrow/util/random.h" #include "arrow/util/status.h" @@ -103,10 +105,12 @@ std::shared_ptr to_buffer(const std::vector& values) { reinterpret_cast(values.data()), values.size() * sizeof(T)); } -void random_null_bitmap(int64_t n, double pct_null, uint8_t* null_bitmap) { +// Sets approximately pct_null of the first n bytes in null_bytes to zero +// and the rest to non-zero (true) values. +void random_null_bytes(int64_t n, double pct_null, uint8_t* null_bytes) { Random rng(random_seed()); for (int i = 0; i < n; ++i) { - null_bitmap[i] = rng.NextDoubleFraction() > pct_null; + null_bytes[i] = rng.NextDoubleFraction() > pct_null; } } @@ -121,6 +125,7 @@ static inline void random_bytes(int n, uint32_t seed, uint8_t* out) { template void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { + DCHECK(out); std::mt19937 gen(seed); std::uniform_int_distribution d(min_value, max_value); for (int i = 0; i < n; ++i) { @@ -129,11 +134,25 @@ void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { } static inline int bitmap_popcount(const uint8_t* data, int length) { + // book keeping + constexpr int pop_len = sizeof(uint64_t); + const uint64_t* i64_data = reinterpret_cast(data); + const int fast_counts = length / pop_len; + const uint64_t* end = i64_data + fast_counts; + int count = 0; - for (int i = 0; i < length; ++i) { - // TODO(wesm): accelerate this + // popcount as much as possible with the widest possible count + for (auto iter = i64_data; iter < end; ++iter) { + count += __builtin_popcountll(*iter); + } + + // Account for left over bytes (in theory we could fall back to smaller + // versions of popcount but the code complexity is likely not worth it) + const int loop_tail_index = fast_counts * pop_len; + for (int i = loop_tail_index; i < length; ++i) { if (util::get_bit(data, i)) { ++count; } } + return count; } @@ -153,6 +172,26 @@ std::shared_ptr bytes_to_null_buffer(const std::vector& bytes) return out; } +Status MakeRandomInt32PoolBuffer(int32_t length, MemoryPool* pool, + std::shared_ptr* pool_buffer, uint32_t seed = 0) { + DCHECK(pool); + auto data = std::make_shared(pool); + RETURN_NOT_OK(data->Resize(length * sizeof(int32_t))); + test::rand_uniform_int(length, seed, 0, std::numeric_limits::max(), + reinterpret_cast(data->mutable_data())); + *pool_buffer = data; + return Status::OK(); +} + +Status MakeRandomBytePoolBuffer(int32_t length, MemoryPool* pool, + std::shared_ptr* pool_buffer, uint32_t seed = 0) { + auto bytes = std::make_shared(pool); + RETURN_NOT_OK(bytes->Resize(length)); + test::random_bytes(length, seed, bytes->mutable_data()); + *pool_buffer = bytes; + return Status::OK(); +} + } // namespace test } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 051ab46b199f9..77404cd702524 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -116,7 +116,7 @@ struct DataType { bool Equals(const DataType* other) { // Call with a pointer so more friendly to subclasses - return this == other || (this->type == other->type); + return other && ((this == other) || (this->type == other->type)); } bool Equals(const std::shared_ptr& other) { return Equals(other.get()); } diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 0a30929b97c51..78036d4bf5711 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -20,8 +20,8 @@ #include #include "arrow/type.h" -#include "arrow/types/primitive.h" #include "arrow/types/list.h" +#include "arrow/types/primitive.h" #include "arrow/types/string.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" @@ -60,11 +60,10 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, case Type::LIST: { std::shared_ptr value_builder; - const std::shared_ptr& value_type = static_cast(type.get())->value_type(); RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); - out->reset(new ListBuilder(pool, type, value_builder)); + out->reset(new ListBuilder(pool, value_builder)); return Status::OK(); } default: @@ -75,11 +74,11 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, #define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ case Type::ENUM: \ out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ - return Status::OK(); + break; -Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, +Status MakePrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out) { + const std::shared_ptr& null_bitmap, ArrayPtr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); @@ -90,11 +89,43 @@ Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, Int64Array); MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); + default: + return Status::NotImplemented(type->ToString()); + } +#ifdef NDEBUG + return Status::OK(); +#else + return (*out)->Validate(); +#endif +} + +Status MakeListArray(const TypePtr& type, int32_t length, + const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& null_bitmap, ArrayPtr* out) { + switch (type->type) { + case Type::BINARY: + case Type::LIST: + out->reset(new ListArray(type, length, offsets, values, null_count, null_bitmap)); + break; + case Type::CHAR: + case Type::DECIMAL_TEXT: + case Type::STRING: + case Type::VARCHAR: + out->reset(new StringArray(type, length, offsets, values, null_count, null_bitmap)); + break; default: return Status::NotImplemented(type->ToString()); } +#ifdef NDEBUG + return Status::OK(); +#else + return (*out)->Validate(); +#endif } } // namespace arrow diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index 27fb7bd2149cf..43c0018c67e41 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -33,10 +33,19 @@ class Status; Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out); +// Create new arrays for logical types that are backed by primitive arrays. Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out); +// Create new list arrays for logical types that are backed by ListArrays (e.g. list of +// primitives and strings) +// TODO(emkornfield) split up string vs list? +Status MakeListArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& offests, const std::shared_ptr& values, + int32_t null_count, const std::shared_ptr& null_bitmap, + std::shared_ptr* out); + } // namespace arrow #endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index aa34f23cc0230..6a8ad9aa59ead 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -#include #include +#include #include #include #include @@ -94,6 +94,7 @@ TEST_F(TestListBuilder, TestAppendNull) { Done(); + ASSERT_OK(result_->Validate()); ASSERT_TRUE(result_->IsNull(0)); ASSERT_TRUE(result_->IsNull(1)); @@ -105,50 +106,93 @@ TEST_F(TestListBuilder, TestAppendNull) { ASSERT_EQ(0, values->length()); } +void ValidateBasicListArray(const ListArray* result, const vector& values, + const vector& is_valid) { + ASSERT_OK(result->Validate()); + ASSERT_EQ(1, result->null_count()); + ASSERT_EQ(0, result->values()->null_count()); + + ASSERT_EQ(3, result->length()); + vector ex_offsets = {0, 3, 3, 7}; + for (size_t i = 0; i < ex_offsets.size(); ++i) { + ASSERT_EQ(ex_offsets[i], result->offset(i)); + } + + for (int i = 0; i < result->length(); ++i) { + ASSERT_EQ(!static_cast(is_valid[i]), result->IsNull(i)); + } + + ASSERT_EQ(7, result->values()->length()); + Int32Array* varr = static_cast(result->values().get()); + + for (size_t i = 0; i < values.size(); ++i) { + ASSERT_EQ(values[i], varr->Value(i)); + } +} + TEST_F(TestListBuilder, TestBasics) { vector values = {0, 1, 2, 3, 4, 5, 6}; vector lengths = {3, 0, 4}; - vector is_null = {0, 1, 0}; + vector is_valid = {1, 0, 1}; Int32Builder* vb = static_cast(builder_->value_builder().get()); - EXPECT_OK(builder_->Reserve(lengths.size())); - EXPECT_OK(vb->Reserve(values.size())); + ASSERT_OK(builder_->Reserve(lengths.size())); + ASSERT_OK(vb->Reserve(values.size())); int pos = 0; for (size_t i = 0; i < lengths.size(); ++i) { - ASSERT_OK(builder_->Append(is_null[i] > 0)); + ASSERT_OK(builder_->Append(is_valid[i] > 0)); for (int j = 0; j < lengths[i]; ++j) { vb->Append(values[pos++]); } } Done(); + ValidateBasicListArray(result_.get(), values, is_valid); +} - ASSERT_EQ(1, result_->null_count()); - ASSERT_EQ(0, result_->values()->null_count()); +TEST_F(TestListBuilder, BulkAppend) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_valid = {1, 0, 1}; + vector offsets = {0, 3, 3}; - ASSERT_EQ(3, result_->length()); - vector ex_offsets = {0, 3, 3, 7}; - for (size_t i = 0; i < ex_offsets.size(); ++i) { - ASSERT_EQ(ex_offsets[i], result_->offset(i)); - } + Int32Builder* vb = static_cast(builder_->value_builder().get()); + ASSERT_OK(vb->Reserve(values.size())); - for (int i = 0; i < result_->length(); ++i) { - ASSERT_EQ(static_cast(is_null[i]), result_->IsNull(i)); + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + for (int32_t value : values) { + vb->Append(value); } + Done(); + ValidateBasicListArray(result_.get(), values, is_valid); +} - ASSERT_EQ(7, result_->values()->length()); - Int32Array* varr = static_cast(result_->values().get()); +TEST_F(TestListBuilder, BulkAppendInvalid) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_null = {0, 1, 0}; + vector is_valid = {1, 0, 1}; + vector offsets = {0, 2, 4}; // should be 0, 3, 3 given the is_null array - for (size_t i = 0; i < values.size(); ++i) { - ASSERT_EQ(values[i], varr->Value(i)); + Int32Builder* vb = static_cast(builder_->value_builder().get()); + ASSERT_OK(vb->Reserve(values.size())); + + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + for (int32_t value : values) { + vb->Append(value); } + + Done(); + ASSERT_RAISES(Invalid, result_->Validate()); } TEST_F(TestListBuilder, TestZeroLength) { // All buffers are null Done(); + ASSERT_OK(result_->Validate()); } } // namespace arrow diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 23f12ddc4ecd7..fc3331139c6d8 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -14,23 +14,26 @@ // KIND, either express or implied. See the License for the // specific language governing permissions and limitations // under the License. - #include "arrow/types/list.h" +#include + namespace arrow { bool ListArray::EqualsExact(const ListArray& other) const { if (this == &other) { return true; } if (null_count_ != other.null_count_) { return false; } - bool equal_offsets = offset_buf_->Equals(*other.offset_buf_, length_ + 1); + bool equal_offsets = + offset_buf_->Equals(*other.offset_buf_, (length_ + 1) * sizeof(int32_t)); + if (!equal_offsets) { return false; } bool equal_null_bitmap = true; if (null_count_ > 0) { equal_null_bitmap = null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); } - if (!(equal_offsets && equal_null_bitmap)) { return false; } + if (!equal_null_bitmap) { return false; } return values()->Equals(other.values()); } @@ -41,4 +44,55 @@ bool ListArray::Equals(const std::shared_ptr& arr) const { return EqualsExact(*static_cast(arr.get())); } +Status ListArray::Validate() const { + if (length_ < 0) { return Status::Invalid("Length was negative"); } + if (!offset_buf_) { return Status::Invalid("offset_buf_ was null"); } + if (offset_buf_->size() / sizeof(int32_t) < length_) { + std::stringstream ss; + ss << "offset buffer size (bytes): " << offset_buf_->size() + << " isn't large enough for length: " << length_; + return Status::Invalid(ss.str()); + } + const int32_t last_offset = offset(length_); + if (last_offset > 0) { + if (!values_) { + return Status::Invalid("last offset was non-zero and values was null"); + } + if (values_->length() != last_offset) { + std::stringstream ss; + ss << "Final offset invariant not equal to values length: " << last_offset + << "!=" << values_->length(); + return Status::Invalid(ss.str()); + } + + const Status child_valid = values_->Validate(); + if (!child_valid.ok()) { + std::stringstream ss; + ss << "Child array invalid: " << child_valid.ToString(); + return Status::Invalid(ss.str()); + } + } + + int32_t prev_offset = offset(0); + if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } + for (int32_t i = 1; i <= length_; ++i) { + int32_t current_offset = offset(i); + if (IsNull(i - 1) && current_offset != prev_offset) { + std::stringstream ss; + ss << "Offset invariant failure at: " << i << " inconsistent offsets for null slot" + << current_offset << "!=" << prev_offset; + return Status::Invalid(ss.str()); + } + if (current_offset < prev_offset) { + std::stringstream ss; + ss << "Offset invariant failure: " << i + << " inconsistent offset for non-null slot: " << current_offset << "<" + << prev_offset; + return Status::Invalid(ss.str()); + } + prev_offset = current_offset; + } + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 6b815460ecb1e..e2302d917b8f6 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -28,6 +28,7 @@ #include "arrow/types/primitive.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" +#include "arrow/util/logging.h" #include "arrow/util/status.h" namespace arrow { @@ -46,11 +47,16 @@ class ListArray : public Array { values_ = values; } - virtual ~ListArray() {} + Status Validate() const override; + + virtual ~ListArray() = default; // Return a shared pointer in case the requestor desires to share ownership // with this array. const std::shared_ptr& values() const { return values_; } + const std::shared_ptr offset_buffer() const { + return std::static_pointer_cast(offset_buf_); + } const std::shared_ptr& value_type() const { return values_->type(); } @@ -78,59 +84,73 @@ class ListArray : public Array { // // To use this class, you must append values to the child array builder and use // the Append function to delimit each distinct list value (once the values -// have been appended to the child array) -class ListBuilder : public Int32Builder { +// have been appended to the child array) or use the bulk API to append +// a sequence of offests and null values. +// +// A note on types. Per arrow/type.h all types in the c++ implementation are +// logical so even though this class always builds an Array of lists, this can +// represent multiple different logical types. If no logical type is provided +// at construction time, the class defaults to List where t is take from the +// value_builder/values that the object is constructed with. +class ListBuilder : public ArrayBuilder { public: + // Use this constructor to incrementally build the value array along with offsets and + // null bitmap. + ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, + const TypePtr& type = nullptr) + : ArrayBuilder( + pool, type ? type : std::static_pointer_cast( + std::make_shared(value_builder->type()))), + offset_builder_(pool), + value_builder_(value_builder) {} + + // Use this constructor to build the list with a pre-existing values array ListBuilder( - MemoryPool* pool, const TypePtr& type, std::shared_ptr value_builder) - : Int32Builder(pool, type), value_builder_(value_builder) {} - - Status Init(int32_t elements) { - // One more than requested. - // - // XXX: This is slightly imprecise, because we might trigger null mask - // resizes that are unnecessary when creating arrays with power-of-two size - return Int32Builder::Init(elements + 1); + MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr) + : ArrayBuilder(pool, type ? type : std::static_pointer_cast( + std::make_shared(values->type()))), + offset_builder_(pool), + values_(values) {} + + Status Init(int32_t elements) override { + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + // one more then requested for offsets + return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); } - Status Resize(int32_t capacity) { - // Need space for the end offset - RETURN_NOT_OK(Int32Builder::Resize(capacity + 1)); - - // Slight hack, as the "real" capacity is one less - --capacity_; - return Status::OK(); + Status Resize(int32_t capacity) override { + // one more then requested for offsets + RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); + return ArrayBuilder::Resize(capacity); } // Vector append // // If passed, valid_bytes is of equal length to values, and any zero byte // will be considered as a null for that slot - Status Append(value_type* values, int32_t length, uint8_t* valid_bytes = nullptr) { - if (length_ + length > capacity_) { - int32_t new_capacity = util::next_power2(length_ + length); - RETURN_NOT_OK(Resize(new_capacity)); - } - memcpy(raw_data_ + length_, values, type_traits::bytes_required(length)); - - if (valid_bytes != nullptr) { AppendNulls(valid_bytes, length); } - - length_ += length; + Status Append( + const int32_t* offsets, int32_t length, const uint8_t* valid_bytes = nullptr) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + offset_builder_.UnsafeAppend(offsets, length); return Status::OK(); } + // The same as Finalize but allows for overridding the c++ type template std::shared_ptr Transfer() { - std::shared_ptr items = value_builder_->Finish(); + std::shared_ptr items = values_; + if (!items) { items = value_builder_->Finish(); } - // Add final offset if the length is non-zero - if (length_) { raw_data_[length_] = items->length(); } + offset_builder_.Append(items->length()); + const auto offsets_buffer = offset_builder_.Finish(); auto result = std::make_shared( - type_, length_, data_, items, null_count_, null_bitmap_); + type_, length_, offsets_buffer, items, null_count_, null_bitmap_); - data_ = null_bitmap_ = nullptr; + // TODO(emkornfield) make a reset method capacity_ = length_ = null_count_ = 0; + null_bitmap_ = nullptr; return result; } @@ -141,26 +161,24 @@ class ListBuilder : public Int32Builder { // // This function should be called before beginning to append elements to the // value builder - Status Append(bool is_null = false) { - if (length_ == capacity_) { - // If the capacity was not already a multiple of 2, do so here - RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); - } - if (is_null) { - ++null_count_; - } else { - util::set_bit(null_bitmap_data_, length_); - } - raw_data_[length_++] = value_builder_->length(); + Status Append(bool is_valid = true) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(is_valid); + RETURN_NOT_OK(offset_builder_.Append(value_builder_->length())); return Status::OK(); } - Status AppendNull() { return Append(true); } + Status AppendNull() { return Append(false); } - const std::shared_ptr& value_builder() const { return value_builder_; } + const std::shared_ptr& value_builder() const { + DCHECK(!values_) << "Using value builder is pointless when values_ is set"; + return value_builder_; + } protected: + BufferBuilder offset_builder_; std::shared_ptr value_builder_; + std::shared_ptr values_; }; } // namespace arrow diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 6bd9e73eb46ac..2b4c0879a28f4 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -102,7 +102,7 @@ class TestPrimitiveBuilder : public TestBuilder { Attrs::draw(N, &draws_); valid_bytes_.resize(N); - test::random_null_bitmap(N, pct_null, valid_bytes_.data()); + test::random_null_bytes(N, pct_null, valid_bytes_.data()); } void Check(const std::shared_ptr& builder, bool nullable) { @@ -193,8 +193,8 @@ void TestPrimitiveBuilder::RandomData(int N, double pct_null) { draws_.resize(N); valid_bytes_.resize(N); - test::random_null_bitmap(N, 0.5, draws_.data()); - test::random_null_bitmap(N, pct_null, valid_bytes_.data()); + test::random_null_bytes(N, 0.5, draws_.data()); + test::random_null_bytes(N, pct_null, valid_bytes_.data()); } template <> diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 9549c47b41157..9102c530e25da 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -57,12 +57,14 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { } return true; } else { + if (length_ == 0 && other.length_ == 0) { return true; } return data_->Equals(*other.data_, length_); } } bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } + if (!arr) { return false; } if (this->type_enum() != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } @@ -101,48 +103,21 @@ Status PrimitiveBuilder::Resize(int32_t capacity) { return Status::OK(); } -template -Status PrimitiveBuilder::Reserve(int32_t elements) { - if (length_ + elements > capacity_) { - int32_t new_capacity = util::next_power2(length_ + elements); - return Resize(new_capacity); - } - return Status::OK(); -} - template Status PrimitiveBuilder::Append( const value_type* values, int32_t length, const uint8_t* valid_bytes) { - RETURN_NOT_OK(PrimitiveBuilder::Reserve(length)); + RETURN_NOT_OK(Reserve(length)); if (length > 0) { memcpy(raw_data_ + length_, values, type_traits::bytes_required(length)); } - if (valid_bytes != nullptr) { - PrimitiveBuilder::AppendNulls(valid_bytes, length); - } else { - for (int i = 0; i < length; ++i) { - util::set_bit(null_bitmap_data_, length_ + i); - } - } + // length_ is update by these + ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); - length_ += length; return Status::OK(); } -template -void PrimitiveBuilder::AppendNulls(const uint8_t* valid_bytes, int32_t length) { - // If valid_bytes is all not null, then none of the values are null - for (int i = 0; i < length; ++i) { - if (valid_bytes[i] == 0) { - ++null_count_; - } else { - util::set_bit(null_bitmap_data_, length_ + i); - } - } -} - template std::shared_ptr PrimitiveBuilder::Finish() { std::shared_ptr result = std::make_shared::ArrayType>( @@ -166,14 +141,8 @@ Status PrimitiveBuilder::Append( } } - if (valid_bytes != nullptr) { - PrimitiveBuilder::AppendNulls(valid_bytes, length); - } else { - for (int i = 0; i < length; ++i) { - util::set_bit(null_bitmap_data_, length_ + i); - } - } - length_ += length; + // this updates length_ + ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index fcd3db4e96e53..6f6b2fed5a320 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -95,15 +95,13 @@ class PrimitiveBuilder : public ArrayBuilder { using ArrayBuilder::Advance; // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - void AppendNulls(const uint8_t* valid_bytes, int32_t length); + void AppendNulls(const uint8_t* valid_bytes, int32_t length) { + UnsafeAppendToBitmap(valid_bytes, length); + } Status AppendNull() { - if (length_ == capacity_) { - // If the capacity was not already a multiple of 2, do so here - RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); - } - ++null_count_; - ++length_; + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(false); return Status::OK(); } @@ -116,21 +114,17 @@ class PrimitiveBuilder : public ArrayBuilder { Status Append( const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); - // Ensure that builder can accommodate an additional number of - // elements. Resizes if the current capacity is not sufficient - Status Reserve(int32_t elements); - std::shared_ptr Finish() override; - protected: - std::shared_ptr data_; - value_type* raw_data_; - - Status Init(int32_t capacity); + Status Init(int32_t capacity) override; // Increase the capacity of the builder to accommodate at least the indicated // number of elements - Status Resize(int32_t capacity); + Status Resize(int32_t capacity) override; + + protected: + std::shared_ptr data_; + value_type* raw_data_; }; template @@ -140,9 +134,17 @@ class NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::PrimitiveBuilder; using PrimitiveBuilder::Append; + using PrimitiveBuilder::Init; + using PrimitiveBuilder::Resize; - // Scalar append. Does not capacity-check; make sure to call Reserve beforehand + // Scalar append. void Append(value_type val) { + ArrayBuilder::Reserve(1); + UnsafeAppend(val); + } + + // Does not capacity-check; make sure to call Reserve beforehand + void UnsafeAppend(value_type val) { util::set_bit(null_bitmap_data_, length_); raw_data_[length_++] = val; } @@ -151,9 +153,6 @@ class NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::length_; using PrimitiveBuilder::null_bitmap_data_; using PrimitiveBuilder::raw_data_; - - using PrimitiveBuilder::Init; - using PrimitiveBuilder::Resize; }; template <> diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index c5cbe1058c7cf..d2d3c5b6b5a83 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -89,11 +89,11 @@ class StringArray : public ListArray { const uint8_t* raw_bytes_; }; -// Array builder +// String builder class StringBuilder : public ListBuilder { public: explicit StringBuilder(MemoryPool* pool, const TypePtr& type) - : ListBuilder(pool, type, std::make_shared(pool, value_type_)) { + : ListBuilder(pool, std::make_shared(pool, value_type_), type) { byte_builder_ = static_cast(value_builder_.get()); } @@ -110,7 +110,6 @@ class StringBuilder : public ListBuilder { } protected: - std::shared_ptr list_builder_; UInt8Builder* byte_builder_; static TypePtr value_type_; diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 56532be8070ae..5ef0076953cea 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -23,6 +23,7 @@ #include #include +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/status.h" @@ -137,26 +138,64 @@ class BufferBuilder { public: explicit BufferBuilder(MemoryPool* pool) : pool_(pool), capacity_(0), size_(0) {} + Status Resize(int32_t elements) { + if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } + capacity_ = elements; + RETURN_NOT_OK(buffer_->Resize(capacity_)); + data_ = buffer_->mutable_data(); + return Status::OK(); + } + Status Append(const uint8_t* data, int length) { - if (capacity_ < length + size_) { - if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } - capacity_ = std::max(MIN_BUFFER_CAPACITY, capacity_); - while (capacity_ < length + size_) { - capacity_ *= 2; - } - RETURN_NOT_OK(buffer_->Resize(capacity_)); - data_ = buffer_->mutable_data(); - } + if (capacity_ < length + size_) { RETURN_NOT_OK(Resize(length + size_)); } + UnsafeAppend(data, length); + return Status::OK(); + } + + template + Status Append(T arithmetic_value) { + static_assert(std::is_arithmetic::value, + "Convenience buffer append only supports arithmetic types"); + return Append(reinterpret_cast(&arithmetic_value), sizeof(T)); + } + + template + Status Append(const T* arithmetic_values, int num_elements) { + static_assert(std::is_arithmetic::value, + "Convenience buffer append only supports arithmetic types"); + return Append( + reinterpret_cast(arithmetic_values), num_elements * sizeof(T)); + } + + // Unsafe methods don't check existing size + void UnsafeAppend(const uint8_t* data, int length) { memcpy(data_ + size_, data, length); size_ += length; - return Status::OK(); + } + + template + void UnsafeAppend(T arithmetic_value) { + static_assert(std::is_arithmetic::value, + "Convenience buffer append only supports arithmetic types"); + UnsafeAppend(reinterpret_cast(&arithmetic_value), sizeof(T)); + } + + template + void UnsafeAppend(const T* arithmetic_values, int num_elements) { + static_assert(std::is_arithmetic::value, + "Convenience buffer append only supports arithmetic types"); + UnsafeAppend( + reinterpret_cast(arithmetic_values), num_elements * sizeof(T)); } std::shared_ptr Finish() { auto result = buffer_; buffer_ = nullptr; + capacity_ = size_ = 0; return result; } + int capacity() { return capacity_; } + int length() { return size_; } private: std::shared_ptr buffer_; diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index 527ce423e7751..fccc5e3085de5 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -18,8 +18,8 @@ #ifndef ARROW_UTIL_LOGGING_H #define ARROW_UTIL_LOGGING_H -#include #include +#include namespace arrow { diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index fb417e74daf53..961554fe06bcc 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -18,8 +18,8 @@ #include "arrow/util/memory-pool.h" #include -#include #include +#include #include "arrow/util/status.h" diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index d608f8167df65..bf5a22089cdba 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -31,14 +31,15 @@ def test_getitem_NA(self): assert arr[1] is pyarrow.NA def test_list_format(self): - arr = pyarrow.from_pylist([[1], None, [2, 3]]) + arr = pyarrow.from_pylist([[1], None, [2, 3, None]]) result = fmt.array_format(arr) expected = """\ [ [1], NA, [2, - 3] + 3, + NA] ]""" assert result == expected From a541644721ba4cb4723931b2a5eff1ac58c8aedd Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Sat, 23 Apr 2016 11:11:05 -0400 Subject: [PATCH 0063/1644] ARROW-100: [C++] Computing RowBatch size Implement RowBatchWriter::DataHeaderSize and arrow::ipc::GetRowBatchSize. To achieve this, the Flatbuffer metadata is written to a temporary buffer and its size is determined. This commit also adds MockMemorySource, a new MemorySource that tracks the amount of memory written. Author: Philipp Moritz Author: Philipp Moritz Closes #61 from pcmoritz/rowbatchsize and squashes the following commits: e95fc5c [Philipp Moritz] fix formating 253c9f0 [Philipp Moritz] rename MockMemorySource methods to reflect better what they are doing 3484458 [Philipp Moritz] add tests for more datatypes 6b798f8 [Philipp Moritz] fix maximum recursion depth 67af8e1 [Philipp Moritz] merge GetRowBatchSize 9b69f12 [Philipp Moritz] factor out GetRowBatchSize test, use MockMemorySource to implement GetRowBatchSize, unify DataHeaderSize and TotalBytes into GetTotalSize aa48cdf [Philipp Moritz] ARROW-100: [C++] Computing RowBatch size --- cpp/src/arrow/ipc/adapter.cc | 29 ++++++++++++++------------- cpp/src/arrow/ipc/adapter.h | 2 +- cpp/src/arrow/ipc/ipc-adapter-test.cc | 28 ++++++++++++++++++++++++++ cpp/src/arrow/ipc/memory.cc | 25 +++++++++++++++++++++++ cpp/src/arrow/ipc/memory.h | 22 ++++++++++++++++++++ 5 files changed, 91 insertions(+), 15 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index bf6fa94dea7a4..34700080746e7 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -179,20 +179,13 @@ class RowBatchWriter { } // This must be called after invoking AssemblePayload - int64_t DataHeaderSize() { - // TODO(wesm): In case it is needed, compute the upper bound for the size - // of the buffer containing the flatbuffer data header. - return 0; - } - - // Total footprint of buffers. This must be called after invoking - // AssemblePayload - int64_t TotalBytes() { - int64_t total = 0; - for (const std::shared_ptr& buffer : buffers_) { - total += buffer->size(); - } - return total; + Status GetTotalSize(int64_t* size) { + // emulates the behavior of Write without actually writing + int64_t data_header_offset; + MockMemorySource source(0); + RETURN_NOT_OK(Write(&source, 0, &data_header_offset)); + *size = source.GetExtentBytesWritten(); + return Status::OK(); } private: @@ -211,6 +204,14 @@ Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, RETURN_NOT_OK(serializer.AssemblePayload()); return serializer.Write(dst, position, header_offset); } + +Status GetRowBatchSize(const RowBatch* batch, int64_t* size) { + RowBatchWriter serializer(batch, kMaxIpcRecursionDepth); + RETURN_NOT_OK(serializer.AssemblePayload()); + RETURN_NOT_OK(serializer.GetTotalSize(size)); + return Status::OK(); +} + // ---------------------------------------------------------------------- // Row batch read path diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 4c9a8a9d8ee39..0d2b77f5acefe 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -62,7 +62,7 @@ Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, // Compute the precise number of bytes needed in a contiguous memory segment to // write the row batch. This involves generating the complete serialized // Flatbuffers metadata. -int64_t GetRowBatchSize(const RowBatch* batch); +Status GetRowBatchSize(const RowBatch* batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the MemorySource does not diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index c243cfba820cc..3b147343f772a 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -195,6 +195,34 @@ INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, ::testing::Values(&MakeIntRowBatch, &MakeListRowBatch, &MakeNonNullRowBatch, &MakeZeroLengthRowBatch, &MakeDeeplyNestedList)); +void TestGetRowBatchSize(std::shared_ptr batch) { + MockMemorySource mock_source(1 << 16); + int64_t mock_header_location; + int64_t size; + ASSERT_OK(WriteRowBatch(&mock_source, batch.get(), 0, &mock_header_location)); + ASSERT_OK(GetRowBatchSize(batch.get(), &size)); + ASSERT_EQ(mock_source.GetExtentBytesWritten(), size); +} + +TEST_F(TestWriteRowBatch, IntegerGetRowBatchSize) { + std::shared_ptr batch; + + ASSERT_OK(MakeIntRowBatch(&batch)); + TestGetRowBatchSize(batch); + + ASSERT_OK(MakeListRowBatch(&batch)); + TestGetRowBatchSize(batch); + + ASSERT_OK(MakeZeroLengthRowBatch(&batch)); + TestGetRowBatchSize(batch); + + ASSERT_OK(MakeNonNullRowBatch(&batch)); + TestGetRowBatchSize(batch); + + ASSERT_OK(MakeDeeplyNestedList(&batch)); + TestGetRowBatchSize(batch); +} + class RecursionLimits : public ::testing::Test, public MemoryMapFixture { public: void SetUp() { pool_ = default_memory_pool(); } diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc index 84cbc182cd26f..caff2c610b907 100644 --- a/cpp/src/arrow/ipc/memory.cc +++ b/cpp/src/arrow/ipc/memory.cc @@ -145,5 +145,30 @@ Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, int64_t return Status::OK(); } +MockMemorySource::MockMemorySource(int64_t size) + : size_(size), extent_bytes_written_(0) {} + +Status MockMemorySource::Close() { + return Status::OK(); +} + +Status MockMemorySource::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + return Status::OK(); +} + +Status MockMemorySource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { + extent_bytes_written_ = std::max(extent_bytes_written_, position + nbytes); + return Status::OK(); +} + +int64_t MockMemorySource::Size() const { + return size_; +} + +int64_t MockMemorySource::GetExtentBytesWritten() const { + return extent_bytes_written_; +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.h b/cpp/src/arrow/ipc/memory.h index e529603dc6e2a..c6fd7a718991b 100644 --- a/cpp/src/arrow/ipc/memory.h +++ b/cpp/src/arrow/ipc/memory.h @@ -121,6 +121,28 @@ class MemoryMappedSource : public MemorySource { std::unique_ptr impl_; }; +// A MemorySource that tracks the size of allocations from a memory source +class MockMemorySource : public MemorySource { + public: + explicit MockMemorySource(int64_t size); + + Status Close() override; + + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + + Status Write(int64_t position, const uint8_t* data, int64_t nbytes) override; + + int64_t Size() const override; + + // @return: the smallest number of bytes containing the modified region of the + // MockMemorySource + int64_t GetExtentBytesWritten() const; + + private: + int64_t size_; + int64_t extent_bytes_written_; +}; + } // namespace ipc } // namespace arrow From 56514d93a2d1c5ad9419c807f23127eb07d9ccfe Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Fri, 29 Apr 2016 19:31:25 -0700 Subject: [PATCH 0064/1644] ARROW-104: [FORMAT] Add alignment and padding requirements + union clarification I believe this change captures the discussion we had on the mailing list about alignment and padding for arrays. It also captures the update to UnionArrays. The rendered version should be viewable here: https://github.com/emkornfield/arrow/blob/emk_format_changes/format/Layout.md Author: Micah Kornfield Closes #67 from emkornfield/emk_format_changes and squashes the following commits: c91421e [Micah Kornfield] fixes per code review b33d4c2 [Micah Kornfield] Add alignment and padding requirements. update union types buffer to reflect using only 1 type buffer --- format/Layout.md | 165 +++++++++++++++++++++++++++-------------------- 1 file changed, 95 insertions(+), 70 deletions(-) diff --git a/format/Layout.md b/format/Layout.md index 92553d944c2d1..34eade313415a 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -10,6 +10,8 @@ concepts, here is a small glossary to help disambiguate. * Contiguous memory region: a sequential virtual address space with a given length. Any byte can be reached via a single pointer offset less than the region's length. +* Contiguous memory buffer: A contiguous memory region that stores + a multi-value component of an Array. Sometimes referred to as just "buffer". * Primitive type: a data type that occupies a fixed-size memory slot specified in bit width or byte width * Nested or parametric type: a data type whose full structure depends on one or @@ -41,7 +43,7 @@ Base requirements linearly in the nesting level * Capable of representing fully-materialized and decoded / decompressed Parquet data -* All leaf nodes (primitive value arrays) use contiguous memory regions +* All contiguous memory buffers are aligned at 64-byte boundaries and padded to a multiple of 64 bytes. * Any relative type can have null slots * Arrays are immutable once created. Implementations can provide APIs to mutate an array, but applying mutations will require a new array data structure to @@ -78,6 +80,28 @@ Base requirements The Arrow format is little endian. +## Alignment and Padding + +As noted above, all buffers are intended to be aligned in memory at 64 byte +boundaries and padded to a length that is a multiple of 64 bytes. The alignment +requirement follows best practices for optimized memory access: + +* Elements in numeric arrays will be guaranteed to be retrieved via aligned access. +* On some architectures alignment can help limit partially used cache lines. +* 64 byte alignment is recommended by the [Intel performance guide][2] for +data-structures over 64 bytes (which will be a common case for Arrow Arrays). + +Requiring padding to a multiple of 64 bytes allows for using SIMD instructions +consistently in loops without additional conditional checks. +This should allow for simpler and more efficient code. +The specific padding length was chosen because it matches the largest known +SIMD instruction registers available as of April 2016 (Intel AVX-512). +Guaranteed padding can also allow certain compilers +to generate more optimized code directly (e.g. One can safely use Intel's +`-qopt-assume-safe-padding`). + +Unless otherwise noted, padded bytes do not need to have a specific value. + ## Array lengths Any array has a known and fixed length, stored as a 32-bit signed integer, so a @@ -101,14 +125,14 @@ signed integer, as it may be as large as the array length. Any relative type can have null value slots, whether primitive or nested type. An array with nulls must have a contiguous memory buffer, known as the null (or -validity) bitmap, whose length is a multiple of 8 bytes (to avoid -word-alignment concerns) and large enough to have at least 1 bit for each array +validity) bitmap, whose length is a multiple of 64 bytes (as discussed above) +and large enough to have at least 1 bit for each array slot. Whether any array slot is valid (non-null) is encoded in the respective bits of this bitmap. A 1 (set bit) for index `j` indicates that the value is not null, while a 0 (bit not set) indicates that it is null. Bitmaps are to be -initialized to be all unset at allocation time. +initialized to be all unset at allocation time (this includes padding). ``` is_valid[j] -> bitmap[j / 8] & (1 << (j % 8)) @@ -158,15 +182,15 @@ Would look like: * Length: 5, Null count: 1 * Null bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-7 | + |Byte 0 (validity bitmap) | Bytes 1-63 | |-------------------------|-----------------------| |00011011 | 0 (padding) | * Value Buffer: - |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | - |------------|-------------|-------------|-------------|-------------| - | 1 | 2 | unspecified | 4 | 8 | + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | + |------------|-------------|-------------|-------------|-------------|-------------| + | 1 | 2 | unspecified | 4 | 8 | unspecified | ``` ### Example Layout: Non-null int32 Array @@ -177,15 +201,15 @@ Would look like: * Length: 5, Null count: 0 * Null bitmap buffer: - | Byte 0 (validity bitmap) | Bytes 1-7 | + | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00011111 | 0 (padding) | * Value Buffer: - |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | - |------------|-------------|-------------|-------------|-------------| - | 1 | 2 | 3 | 4 | 8 | + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | Bytes 20-63 | + |------------|-------------|-------------|-------------|-------------|-------------| + | 1 | 2 | 3 | 4 | 8 | unspecified | ``` or with the bitmap elided: @@ -195,9 +219,9 @@ or with the bitmap elided: * Null bitmap buffer: Not required * Value Buffer: - |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | - |------------|-------------|-------------|-------------|-------------| - | 1 | 2 | 3 | 4 | 8 | + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | Bytes 20-63 | + |------------|-------------|-------------|-------------|-------------|-------------| + | 1 | 2 | 3 | 4 | 8 | unspecified | ``` ## List type @@ -243,23 +267,23 @@ will have the following representation: * Length: 4, Null count: 1 * Null bitmap buffer: - | Byte 0 (validity bitmap) | Bytes 1-7 | + | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001101 | 0 (padding) | * Offsets buffer (int32) - | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | - |------------|-------------|-------------|-------------|-------------| - | 0 | 3 | 3 | 7 | 7 | + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-63 | + |------------|-------------|-------------|-------------|-------------|-------------| + | 0 | 3 | 3 | 7 | 7 | unspecified | * Values array (char array): * Length: 7, Null count: 0 * Null bitmap buffer: Not required - | Bytes 0-7 | - |------------| - | joemark | + | Bytes 0-7 | Bytes 8-63 | + |------------|-------------| + | joemark | unspecified | ``` ### Example Layout: `List>` @@ -273,31 +297,31 @@ will be be represented as follows: * Null bitmap buffer: Not required * Offsets buffer (int32) - | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | - |------------|------------|------------|-------------| - | 0 | 2 | 6 | 7 | + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | + |------------|------------|------------|-------------|-------------| + | 0 | 2 | 6 | 7 | unspecified | * Values array (`List`) * Length: 6, Null count: 1 * Null bitmap buffer: - | Byte 0 (validity bitmap) | Bytes 1-7 | + | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-------------| | 00110111 | 0 (padding) | * Offsets buffer (int32) - | Bytes 0-28 | - |----------------------| - | 0, 2, 4, 7, 7, 8, 10 | + | Bytes 0-28 | Bytes 29-63 | + |----------------------|-------------| + | 0, 2, 4, 7, 7, 8, 10 | unspecified | * Values array (bytes): * Length: 10, Null count: 0 * Null bitmap buffer: Not required - | Bytes 0-9 | - |-------------------------------| - | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | + | Bytes 0-9 | Bytes 10-63 | + |-------------------------------|-------------| + | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified | ``` ## Struct type @@ -333,9 +357,9 @@ The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}] would be: * Length: 4, Null count: 1 * Null bitmap buffer: - | Byte 0 (validity bitmap) | Bytes 1-7 | - |--------------------------|-------------| - | 00001011 | 0 (padding) | + | Byte 0 (validity bitmap) | Bytes 1-7 | Bytes 8-63 | + |--------------------------|-------------|-------------| + | 00001011 | 0 (padding) | unspecified | * Children arrays: * field-0 array (`List`): @@ -396,13 +420,13 @@ The union types may be named, but like structs this will be a matter of the metadata and will not affect the physical memory layout. We define two distinct union types that are optimized for different use -cases. This first, the dense union, represents a mixed-type array with 6 bytes +cases. This first, the dense union, represents a mixed-type array with 5 bytes of overhead for each value. Its physical layout is as follows: * One child array for each relative type -* Types buffer: A buffer of unsigned integers, enumerated from 0 corresponding - to each type, with the smallest byte width capable of representing the number - of types in the union. +* Types buffer: A buffer of 8-bit signed integers, enumerated from 0 corresponding + to each type. A union with more then 127 possible types can be modeled as a + union of unions. * Offsets buffer: A buffer of signed int32 values indicating the relative offset into the respective child array for the type in a given slot. The respective offsets for each child value array must be in order / increasing. @@ -420,21 +444,21 @@ An example layout for logical union of: ``` * Length: 4, Null count: 1 * Null bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-7 | + |Byte 0 (validity bitmap) | Bytes 1-63 | |-------------------------|-----------------------| |00001101 | 0 (padding) | * Types buffer: - |Byte 0-1 | Byte 2-3 | Byte 4-5 | Byte 6-7 | - |---------|-------------|----------|----------| - | 0 | unspecified | 0 | 1 | + |Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 | + |---------|-------------|----------|----------|-------------| + | 0 | unspecified | 0 | 1 | unspecified | * Offset buffer: - |Byte 0-3 | Byte 4-7 | Byte 8-11 | Byte 12-15 | - |---------|-------------|-----------|------------| - | 0 | unspecified | 1 | 0 | + |Byte 0-3 | Byte 4-7 | Byte 8-11 | Byte 12-15 | Bytes 16-63 | + |---------|-------------|-----------|------------|-------------| + | 0 | unspecified | 1 | 0 | unspecified | * Children arrays: * Field-0 array (f: float): @@ -443,9 +467,9 @@ An example layout for logical union of: * Value Buffer: - | Bytes 0-7 | - |-----------| - | 1.2, 3.4 | + | Bytes 0-7 | Bytes 8-63 | + |-----------|-------------| + | 1.2, 3.4 | unspecified | * Field-1 array (f: float): @@ -454,9 +478,9 @@ An example layout for logical union of: * Value Buffer: - | Bytes 0-3 | - |-----------| - | 5 | + | Bytes 0-3 | Bytes 4-63 | + |-----------|-------------| + | 5 | unspecified | ``` ## Sparse union type @@ -484,9 +508,9 @@ will have the following layout: * Types buffer: - | Bytes 0-1 | Bytes 2-3 | Bytes 4-5 | Bytes 6-7 | Bytes 8-9 | Bytes 10-11 | - |------------|-------------|-------------|-------------|-------------|--------------| - | 0 | 1 | 2 | 1 | 0 | 2 | + | Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Bytes 6-63 | + |------------|-------------|-------------|-------------|-------------|--------------|-----------------------| + | 0 | 1 | 2 | 1 | 0 | 2 | unspecified (padding) | * Children arrays: @@ -494,51 +518,51 @@ will have the following layout: * Length: 6, Null count: 4 * Null bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-7 | + |Byte 0 (validity bitmap) | Bytes 1-63 | |-------------------------|-----------------------| |00010001 | 0 (padding) | * Value buffer: - |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | - |------------|-------------|-------------|-------------|-------------|--------------| - | 1 | unspecified | unspecified | unspecified | 4 | unspecified | + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | + |------------|-------------|-------------|-------------|-------------|--------------|-----------------------| + | 1 | unspecified | unspecified | unspecified | 4 | unspecified | unspecified (padding) | * u1 (float): * Length: 6, Null count: 4 * Null bitmap buffer: - |Byte 0 (validity bitmap) | Bytes 1-7 | + |Byte 0 (validity bitmap) | Bytes 1-63 | |-------------------------|-----------------------| |00001010 | 0 (padding) | * Value buffer: - |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | - |-------------|-------------|-------------|-------------|-------------|--------------| - | unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | + |Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | + |-------------|-------------|-------------|-------------|-------------|--------------|-----------------------| + | unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | unspecified (padding) | * u2 (`List`) * Length: 6, Null count: 4 * Null bitmap buffer: - | Byte 0 (validity bitmap) | Bytes 1-7 | + | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00100100 | 0 (padding) | * Offsets buffer (int32) - | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | - |------------|-------------|-------------|-------------|-------------|-------------|-------------| - | 0 | 0 | 0 | 3 | 3 | 3 | 7 | + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 | + |------------|-------------|-------------|-------------|-------------|-------------|-------------|-------------| + | 0 | 0 | 0 | 3 | 3 | 3 | 7 | unspecified | * Values array (char array): * Length: 7, Null count: 0 * Null bitmap buffer: Not required - | Bytes 0-7 | - |------------| - | joemark | + | Bytes 0-7 | Bytes 8-63 | + |------------|-----------------------| + | joemark | unspecified (padding) | ``` Note that nested types in a sparse union must be internally consistent @@ -557,3 +581,4 @@ the the types array indicates that a slot contains a different type at the index Drill docs https://drill.apache.org/docs/value-vectors/ [1]: https://en.wikipedia.org/wiki/Bit_numbering +[2]: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors From 355f7c96a194c65bad523466586f51a9ae0e8627 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 1 May 2016 15:53:37 -0700 Subject: [PATCH 0065/1644] ARROW-92: Arrow to Parquet Schema conversion My current WIP state. To make the actual schema conversion complete, we probably need the physical structure too as Arrow schemas only care about logical types whereas Parquet schema is about logical and physical types. Author: Uwe L. Korn Closes #68 from xhochy/arrow-92 and squashes the following commits: e3aa261 [Uwe L. Korn] Add macro to convert ParquetException to Status 9c5b085 [Uwe L. Korn] Include string 42ed0ea [Uwe L. Korn] Add struct conversion 38e68e5 [Uwe L. Korn] make format 9a6c876 [Uwe L. Korn] Add more types 8a0293e [Uwe L. Korn] ARROW-92: Arrow to Parquet Schema conversion --- cpp/src/arrow/parquet/parquet-schema-test.cc | 75 +++++++++++ cpp/src/arrow/parquet/schema.cc | 130 +++++++++++++++++++ cpp/src/arrow/parquet/schema.h | 5 + 3 files changed, 210 insertions(+) diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index e2280f41189ef..8de739491b56f 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -161,6 +161,81 @@ TEST_F(TestConvertParquetSchema, UnsupportedThings) { } } +class TestConvertArrowSchema : public ::testing::Test { + public: + virtual void SetUp() {} + + void CheckFlatSchema(const std::vector& nodes) { + NodePtr schema_node = GroupNode::Make("schema", Repetition::REPEATED, nodes); + const GroupNode* expected_schema_node = + static_cast(schema_node.get()); + const GroupNode* result_schema_node = + static_cast(result_schema_->schema().get()); + + ASSERT_EQ(expected_schema_node->field_count(), result_schema_node->field_count()); + + for (int i = 0; i < expected_schema_node->field_count(); i++) { + auto lhs = result_schema_node->field(i); + auto rhs = expected_schema_node->field(i); + EXPECT_TRUE(lhs->Equals(rhs.get())); + } + } + + Status ConvertSchema(const std::vector>& fields) { + arrow_schema_ = std::make_shared(fields); + return ToParquetSchema(arrow_schema_.get(), &result_schema_); + } + + protected: + std::shared_ptr arrow_schema_; + std::shared_ptr<::parquet::SchemaDescriptor> result_schema_; +}; + +TEST_F(TestConvertArrowSchema, ParquetFlatPrimitives) { + std::vector parquet_fields; + std::vector> arrow_fields; + + parquet_fields.push_back( + PrimitiveNode::Make("boolean", Repetition::REQUIRED, ParquetType::BOOLEAN)); + arrow_fields.push_back(std::make_shared("boolean", BOOL, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("int32", Repetition::REQUIRED, ParquetType::INT32)); + arrow_fields.push_back(std::make_shared("int32", INT32, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); + arrow_fields.push_back(std::make_shared("int64", INT64, false)); + + parquet_fields.push_back( + PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); + arrow_fields.push_back(std::make_shared("float", FLOAT)); + + parquet_fields.push_back( + PrimitiveNode::Make("double", Repetition::OPTIONAL, ParquetType::DOUBLE)); + arrow_fields.push_back(std::make_shared("double", DOUBLE)); + + // TODO: String types need to be clarified a bit more in the Arrow spec + parquet_fields.push_back(PrimitiveNode::Make( + "string", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY, LogicalType::UTF8)); + arrow_fields.push_back(std::make_shared("string", UTF8)); + + ASSERT_OK(ConvertSchema(arrow_fields)); + + CheckFlatSchema(parquet_fields); +} + +TEST_F(TestConvertArrowSchema, ParquetFlatDecimals) { + std::vector parquet_fields; + std::vector> arrow_fields; + + // TODO: Test Decimal Arrow -> Parquet conversion + + ASSERT_OK(ConvertSchema(arrow_fields)); + + CheckFlatSchema(parquet_fields); +} + TEST(TestNodeConversion, DateAndTime) {} } // namespace parquet diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index 560e28374066b..214c764f08b6e 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -17,13 +17,18 @@ #include "arrow/parquet/schema.h" +#include #include #include "parquet/api/schema.h" +#include "parquet/exception.h" #include "arrow/types/decimal.h" +#include "arrow/types/string.h" #include "arrow/util/status.h" +using parquet::ParquetException; +using parquet::Repetition; using parquet::schema::Node; using parquet::schema::NodePtr; using parquet::schema::GroupNode; @@ -36,6 +41,11 @@ namespace arrow { namespace parquet { +#define PARQUET_CATCH_NOT_OK(s) \ + try { \ + (s); \ + } catch (const ParquetException& e) { return Status::Invalid(e.what()); } + const auto BOOL = std::make_shared(); const auto UINT8 = std::make_shared(); const auto INT32 = std::make_shared(); @@ -182,6 +192,126 @@ Status FromParquetSchema( return Status::OK(); } +Status StructToNode(const std::shared_ptr& type, const std::string& name, + bool nullable, NodePtr* out) { + Repetition::type repetition = Repetition::REQUIRED; + if (nullable) { repetition = Repetition::OPTIONAL; } + + std::vector children(type->num_children()); + for (int i = 0; i < type->num_children(); i++) { + RETURN_NOT_OK(FieldToNode(type->child(i), &children[i])); + } + + *out = GroupNode::Make(name, repetition, children); + return Status::OK(); +} + +Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { + LogicalType::type logical_type = LogicalType::NONE; + ParquetType::type type; + Repetition::type repetition = Repetition::REQUIRED; + if (field->nullable) { repetition = Repetition::OPTIONAL; } + int length = -1; + + switch (field->type->type) { + // TODO: + // case Type::NA: + // break; + case Type::BOOL: + type = ParquetType::BOOLEAN; + break; + case Type::UINT8: + type = ParquetType::INT32; + logical_type = LogicalType::UINT_8; + break; + case Type::INT8: + type = ParquetType::INT32; + logical_type = LogicalType::INT_8; + break; + case Type::UINT16: + type = ParquetType::INT32; + logical_type = LogicalType::UINT_16; + break; + case Type::INT16: + type = ParquetType::INT32; + logical_type = LogicalType::INT_16; + break; + case Type::UINT32: + type = ParquetType::INT32; + logical_type = LogicalType::UINT_32; + break; + case Type::INT32: + type = ParquetType::INT32; + break; + case Type::UINT64: + type = ParquetType::INT64; + logical_type = LogicalType::UINT_64; + break; + case Type::INT64: + type = ParquetType::INT64; + break; + case Type::FLOAT: + type = ParquetType::FLOAT; + break; + case Type::DOUBLE: + type = ParquetType::DOUBLE; + break; + case Type::CHAR: + type = ParquetType::FIXED_LEN_BYTE_ARRAY; + logical_type = LogicalType::UTF8; + length = static_cast(field->type.get())->size; + break; + case Type::STRING: + type = ParquetType::BYTE_ARRAY; + logical_type = LogicalType::UTF8; + break; + case Type::BINARY: + type = ParquetType::BYTE_ARRAY; + break; + case Type::DATE: + type = ParquetType::INT32; + logical_type = LogicalType::DATE; + break; + case Type::TIMESTAMP: + type = ParquetType::INT64; + logical_type = LogicalType::TIMESTAMP_MILLIS; + break; + case Type::TIMESTAMP_DOUBLE: + type = ParquetType::INT64; + // This is specified as seconds since the UNIX epoch + // TODO: Converted type in Parquet? + // logical_type = LogicalType::TIMESTAMP_MILLIS; + break; + case Type::TIME: + type = ParquetType::INT64; + logical_type = LogicalType::TIME_MILLIS; + break; + case Type::STRUCT: { + auto struct_type = std::static_pointer_cast(field->type); + return StructToNode(struct_type, field->name, field->nullable, out); + } break; + default: + // TODO: LIST, DENSE_UNION, SPARE_UNION, JSON_SCALAR, DECIMAL, DECIMAL_TEXT, VARCHAR + return Status::NotImplemented("unhandled type"); + } + *out = PrimitiveNode::Make(field->name, repetition, type, logical_type, length); + return Status::OK(); +} + +Status ToParquetSchema( + const Schema* arrow_schema, std::shared_ptr<::parquet::SchemaDescriptor>* out) { + std::vector nodes(arrow_schema->num_fields()); + for (int i = 0; i < arrow_schema->num_fields(); i++) { + RETURN_NOT_OK(FieldToNode(arrow_schema->field(i), &nodes[i])); + } + + NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, nodes); + *out = std::make_shared<::parquet::SchemaDescriptor>(); + PARQUET_CATCH_NOT_OK((*out)->Init(schema)); + + return Status::OK(); +} + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index a44a9a4b6a892..bfc7d21138154 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -36,6 +36,11 @@ Status NodeToField(const ::parquet::schema::NodePtr& node, std::shared_ptr* out); +Status FieldToNode(const std::shared_ptr& field, ::parquet::schema::NodePtr* out); + +Status ToParquetSchema( + const Schema* arrow_schema, std::shared_ptr<::parquet::SchemaDescriptor>* out); + } // namespace parquet } // namespace arrow From ad3d01dd5c47f6d21771a53d437772cf71bee10f Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 3 May 2016 18:23:43 -0700 Subject: [PATCH 0066/1644] ARROW-188: Add numpy as install requirement Successfully tested with NumPy 1.9 which should be a recent but still old version that we can support for now. Author: Uwe L. Korn Closes #69 from xhochy/arrow-188 and squashes the following commits: 651a9aa [Uwe L. Korn] ARROW-188: Add numpy as install requirement --- python/setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/setup.py b/python/setup.py index ebd80de46b4da..5f228ed0af245 100644 --- a/python/setup.py +++ b/python/setup.py @@ -242,7 +242,7 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, - install_requires=['cython >= 0.21'], + install_requires=['cython >= 0.21', 'numpy >= 1.9'], description=DESC, license='Apache License, Version 2.0', maintainer="Apache Arrow Developers", From 33022579e31b2448ed227ddf51160d08edd625e3 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 8 May 2016 18:03:28 -0700 Subject: [PATCH 0067/1644] ARROW-190: Python: Provide installable sdist builds Author: Uwe L. Korn Closes #71 from xhochy/arrow-190 and squashes the following commits: e28db45 [Uwe L. Korn] Add LICENSE and README to MANIFEST f9943f5 [Uwe L. Korn] ARROW-190: Python: Provide standalone installable sdist builds --- python/MANIFEST.in | 14 ++++++++++++++ 1 file changed, 14 insertions(+) create mode 100644 python/MANIFEST.in diff --git a/python/MANIFEST.in b/python/MANIFEST.in new file mode 100644 index 0000000000000..756879a0bb033 --- /dev/null +++ b/python/MANIFEST.in @@ -0,0 +1,14 @@ +include README.md +include LICENSE.txt + +global-include CMakeLists.txt +graft cmake_modules +recursive-include src/pyarrow *.cc *.h +recursive-include pyarrow *.pxd + +global-exclude *.so +global-exclude *.pyc +global-exclude *~ +global-exclude \#* +global-exclude .git* +global-exclude .DS_Store From c9ffe546b8ddb81851bcff78e4db051942dcc546 Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Sun, 8 May 2016 22:15:40 -0700 Subject: [PATCH 0068/1644] ARROW-194: C++: Allow read-only memory mapped source A simple patch to allow read-only mode. A test is also included. Author: Jihoon Son Closes #72 from jihoonson/ARROW-194 and squashes the following commits: f55dd22 [Jihoon Son] Change the type of protection flag from int8_t to int b928031 [Jihoon Son] Add missing initialization 63b99c5 [Jihoon Son] Remove unintended whitespace 22e6128 [Jihoon Son] Simplify error check 5559b8d [Jihoon Son] - Fixed a wrong protection flag in a test - Added a routine to check the protection flag before writing - Added a unit test to check the error status for protection mode - Improved failure check for mmap() d8939fa [Jihoon Son] Allow read-only memory mapped source. --- cpp/src/arrow/ipc/ipc-memory-test.cc | 54 ++++++++++++++++++++++++++-- cpp/src/arrow/ipc/memory.cc | 22 ++++++++---- 2 files changed, 66 insertions(+), 10 deletions(-) diff --git a/cpp/src/arrow/ipc/ipc-memory-test.cc b/cpp/src/arrow/ipc/ipc-memory-test.cc index 1933921222595..a2dbd35728c49 100644 --- a/cpp/src/arrow/ipc/ipc-memory-test.cc +++ b/cpp/src/arrow/ipc/ipc-memory-test.cc @@ -26,9 +26,6 @@ #include "arrow/ipc/memory.h" #include "arrow/ipc/test-common.h" -#include "arrow/test-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { @@ -67,6 +64,57 @@ TEST_F(TestMemoryMappedSource, WriteRead) { } } +TEST_F(TestMemoryMappedSource, ReadOnly) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + const int reps = 5; + + std::string path = "ipc-read-only-test"; + CreateFile(path, reps * buffer_size); + + std::shared_ptr rwmmap; + ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &rwmmap)); + + int64_t position = 0; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(rwmmap->Write(position, buffer.data(), buffer_size)); + + position += buffer_size; + } + rwmmap->Close(); + + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_ONLY, &rommap)); + + position = 0; + std::shared_ptr out_buffer; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(rommap->ReadAt(position, buffer_size, &out_buffer)); + + ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); + position += buffer_size; + } + rommap->Close(); +} + +TEST_F(TestMemoryMappedSource, InvalidMode) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + std::string path = "ipc-invalid-mode-test"; + CreateFile(path, buffer_size); + + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_ONLY, &rommap)); + + ASSERT_RAISES(IOError, rommap->Write(0, buffer.data(), buffer_size)); +} + TEST_F(TestMemoryMappedSource, InvalidFile) { std::string non_existent_path = "invalid-file-name-asfd"; diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc index caff2c610b907..a6c56d64f4aed 100644 --- a/cpp/src/arrow/ipc/memory.cc +++ b/cpp/src/arrow/ipc/memory.cc @@ -41,7 +41,7 @@ MemorySource::~MemorySource() {} class MemoryMappedSource::Impl { public: - Impl() : file_(nullptr), is_open_(false), data_(nullptr) {} + Impl() : file_(nullptr), is_open_(false), is_writable_(false), data_(nullptr) {} ~Impl() { if (is_open_) { @@ -53,10 +53,12 @@ class MemoryMappedSource::Impl { Status Open(const std::string& path, MemorySource::AccessMode mode) { if (is_open_) { return Status::IOError("A file is already open"); } - path_ = path; + int prot_flags = PROT_READ; if (mode == MemorySource::READ_WRITE) { file_ = fopen(path.c_str(), "r+b"); + prot_flags |= PROT_WRITE; + is_writable_ = true; } else { file_ = fopen(path.c_str(), "rb"); } @@ -73,14 +75,13 @@ class MemoryMappedSource::Impl { fseek(file_, 0L, SEEK_SET); is_open_ = true; - // TODO(wesm): Add read-only version of this - data_ = reinterpret_cast( - mmap(nullptr, size_, PROT_READ | PROT_WRITE, MAP_SHARED, fileno(file_), 0)); - if (data_ == nullptr) { + void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fileno(file_), 0); + if (result == MAP_FAILED) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; return Status::IOError(ss.str()); } + data_ = reinterpret_cast(result); return Status::OK(); } @@ -89,11 +90,15 @@ class MemoryMappedSource::Impl { uint8_t* data() { return data_; } + bool writable() { return is_writable_; } + + bool opened() { return is_open_; } + private: - std::string path_; FILE* file_; int64_t size_; bool is_open_; + bool is_writable_; // The memory map uint8_t* data_; @@ -134,6 +139,9 @@ Status MemoryMappedSource::ReadAt( } Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { + if (!impl_->opened() || !impl_->writable()) { + return Status::IOError("Unable to write"); + } if (position < 0 || position >= impl_->size()) { return Status::Invalid("position is out of bounds"); } From 1f04f7ff90c43efd72b57cc09ba21da1597682d6 Mon Sep 17 00:00:00 2001 From: lfzCarlosC Date: Thu, 5 May 2016 21:58:31 +0200 Subject: [PATCH 0069/1644] ARROW-193: typos "int his" fix to "in this" --- .../main/java/org/apache/arrow/vector/VariableWidthVector.java | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java index e227bb4c4176c..971a241adafc2 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java @@ -30,7 +30,7 @@ public interface VariableWidthVector extends ValueVector{ void allocateNew(int totalBytes, int valueCount); /** - * Provide the maximum amount of variable width bytes that can be stored int his vector. + * Provide the maximum amount of variable width bytes that can be stored in this vector. * @return */ int getByteCapacity(); From 4bd13b852d376065fdb16c36fa821ab0e167f0fc Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 10 May 2016 15:58:04 -0700 Subject: [PATCH 0070/1644] ARROW-91: Basic Parquet read support Depends on (mainly one line fixes): - [x] https://github.com/apache/parquet-cpp/pull/99 - [x] https://github.com/apache/parquet-cpp/pull/98 - [x] https://github.com/apache/parquet-cpp/pull/97 Author: Uwe L. Korn Author: Wes McKinney Closes #73 from xhochy/arrow-91 and squashes the following commits: 7579fed [Uwe L. Korn] Mark single argument constructor as explicit 47441a1 [Uwe L. Korn] Assert that no exception was thrown 5fa1026 [Uwe L. Korn] Incorporate review comments 8d2db22 [Uwe L. Korn] ARROW-91: Basic Parquet read support d9940d8 [Wes McKinney] Public API draft --- cpp/src/arrow/parquet/CMakeLists.txt | 4 + cpp/src/arrow/parquet/parquet-reader-test.cc | 116 +++++++++++ cpp/src/arrow/parquet/reader.cc | 194 +++++++++++++++++++ cpp/src/arrow/parquet/reader.h | 134 +++++++++++++ cpp/src/arrow/parquet/schema.cc | 8 +- cpp/src/arrow/parquet/schema.h | 2 +- cpp/src/arrow/parquet/utils.h | 38 ++++ 7 files changed, 488 insertions(+), 8 deletions(-) create mode 100644 cpp/src/arrow/parquet/parquet-reader-test.cc create mode 100644 cpp/src/arrow/parquet/reader.cc create mode 100644 cpp/src/arrow/parquet/reader.h create mode 100644 cpp/src/arrow/parquet/utils.h diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index 0d5cf263ec3e2..1ae6709652ea5 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -19,6 +19,7 @@ # arrow_parquet : Arrow <-> Parquet adapter set(PARQUET_SRCS + reader.cc schema.cc ) @@ -36,6 +37,9 @@ SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) ADD_ARROW_TEST(parquet-schema-test) ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) +ADD_ARROW_TEST(parquet-reader-test) +ARROW_TEST_LINK_LIBRARIES(parquet-reader-test arrow_parquet) + # Headers: top level install(FILES DESTINATION include/arrow/parquet) diff --git a/cpp/src/arrow/parquet/parquet-reader-test.cc b/cpp/src/arrow/parquet/parquet-reader-test.cc new file mode 100644 index 0000000000000..a7fc2a89f5f45 --- /dev/null +++ b/cpp/src/arrow/parquet/parquet-reader-test.cc @@ -0,0 +1,116 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" +#include "arrow/parquet/reader.h" +#include "arrow/types/primitive.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +#include "parquet/api/reader.h" +#include "parquet/api/writer.h" + +using ParquetBuffer = parquet::Buffer; +using parquet::BufferReader; +using parquet::InMemoryOutputStream; +using parquet::Int64Writer; +using parquet::ParquetFileReader; +using parquet::ParquetFileWriter; +using parquet::RandomAccessSource; +using parquet::Repetition; +using parquet::SchemaDescriptor; +using ParquetType = parquet::Type; +using parquet::schema::GroupNode; +using parquet::schema::NodePtr; +using parquet::schema::PrimitiveNode; + +namespace arrow { + +namespace parquet { + +class TestReadParquet : public ::testing::Test { + public: + virtual void SetUp() {} + + std::shared_ptr Int64Schema() { + auto pnode = PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64); + NodePtr node_ = + GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); + return std::static_pointer_cast(node_); + } + + std::unique_ptr Int64File( + std::vector& values, int num_chunks) { + std::shared_ptr schema = Int64Schema(); + std::shared_ptr sink(new InMemoryOutputStream()); + auto file_writer = ParquetFileWriter::Open(sink, schema); + size_t chunk_size = values.size() / num_chunks; + for (int i = 0; i < num_chunks; i++) { + auto row_group_writer = file_writer->AppendRowGroup(chunk_size); + auto column_writer = static_cast(row_group_writer->NextColumn()); + int64_t* data = values.data() + i * chunk_size; + column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); + column_writer->Close(); + row_group_writer->Close(); + } + file_writer->Close(); + + std::shared_ptr buffer = sink->GetBuffer(); + std::unique_ptr source(new BufferReader(buffer)); + return ParquetFileReader::Open(std::move(source)); + } + + private: +}; + +TEST_F(TestReadParquet, SingleColumnInt64) { + std::vector values(100, 128); + std::unique_ptr file_reader = Int64File(values, 1); + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + std::unique_ptr column_reader; + ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); + ASSERT_NE(nullptr, column_reader.get()); + std::shared_ptr out; + ASSERT_OK(column_reader->NextBatch(100, &out)); + ASSERT_NE(nullptr, out.get()); + Int64Array* out_array = static_cast(out.get()); + for (size_t i = 0; i < values.size(); i++) { + EXPECT_EQ(values[i], out_array->raw_data()[i]); + } +} + +TEST_F(TestReadParquet, SingleColumnInt64Chunked) { + std::vector values(100, 128); + std::unique_ptr file_reader = Int64File(values, 4); + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + std::unique_ptr column_reader; + ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); + ASSERT_NE(nullptr, column_reader.get()); + std::shared_ptr out; + ASSERT_OK(column_reader->NextBatch(100, &out)); + ASSERT_NE(nullptr, out.get()); + Int64Array* out_array = static_cast(out.get()); + for (size_t i = 0; i < values.size(); i++) { + EXPECT_EQ(values[i], out_array->raw_data()[i]); + } +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc new file mode 100644 index 0000000000000..481ded5789a71 --- /dev/null +++ b/cpp/src/arrow/parquet/reader.cc @@ -0,0 +1,194 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/parquet/reader.h" + +#include + +#include "arrow/parquet/schema.h" +#include "arrow/parquet/utils.h" +#include "arrow/schema.h" +#include "arrow/types/primitive.h" +#include "arrow/util/status.h" + +using parquet::ColumnReader; +using parquet::TypedColumnReader; + +namespace arrow { +namespace parquet { + +class FileReader::Impl { + public: + Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); + virtual ~Impl() {} + + Status GetFlatColumn(int i, std::unique_ptr* out); + Status ReadFlatColumn(int i, std::shared_ptr* out); + + private: + MemoryPool* pool_; + std::unique_ptr<::parquet::ParquetFileReader> reader_; +}; + +class FlatColumnReader::Impl { + public: + Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor* descr, + ::parquet::ParquetFileReader* reader, int column_index); + virtual ~Impl() {} + + Status NextBatch(int batch_size, std::shared_ptr* out); + template + Status TypedReadBatch(int batch_size, std::shared_ptr* out); + + private: + void NextRowGroup(); + + MemoryPool* pool_; + const ::parquet::ColumnDescriptor* descr_; + ::parquet::ParquetFileReader* reader_; + int column_index_; + int next_row_group_; + std::shared_ptr column_reader_; + std::shared_ptr field_; + + PoolBuffer values_buffer_; + PoolBuffer def_levels_buffer_; + PoolBuffer rep_levels_buffer_; +}; + +FileReader::Impl::Impl( + MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) + : pool_(pool), reader_(std::move(reader)) {} + +Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* out) { + std::unique_ptr impl( + new FlatColumnReader::Impl(pool_, reader_->descr()->Column(i), reader_.get(), i)); + *out = std::unique_ptr(new FlatColumnReader(std::move(impl))); + return Status::OK(); +} + +Status FileReader::Impl::ReadFlatColumn(int i, std::shared_ptr* out) { + std::unique_ptr flat_column_reader; + RETURN_NOT_OK(GetFlatColumn(i, &flat_column_reader)); + return flat_column_reader->NextBatch(reader_->num_rows(), out); +} + +FileReader::FileReader( + MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) + : impl_(new FileReader::Impl(pool, std::move(reader))) {} + +FileReader::~FileReader() {} + +Status FileReader::GetFlatColumn(int i, std::unique_ptr* out) { + return impl_->GetFlatColumn(i, out); +} + +Status FileReader::ReadFlatColumn(int i, std::shared_ptr* out) { + return impl_->ReadFlatColumn(i, out); +} + +FlatColumnReader::Impl::Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor* descr, + ::parquet::ParquetFileReader* reader, int column_index) + : pool_(pool), + descr_(descr), + reader_(reader), + column_index_(column_index), + next_row_group_(0), + values_buffer_(pool), + def_levels_buffer_(pool), + rep_levels_buffer_(pool) { + NodeToField(descr_->schema_node(), &field_); + NextRowGroup(); +} + +template +Status FlatColumnReader::Impl::TypedReadBatch( + int batch_size, std::shared_ptr* out) { + int values_to_read = batch_size; + NumericBuilder builder(pool_, field_->type); + while ((values_to_read > 0) && column_reader_) { + values_buffer_.Resize(values_to_read * sizeof(CType)); + if (descr_->max_definition_level() > 0) { + def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); + } + if (descr_->max_repetition_level() > 0) { + rep_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); + } + auto reader = dynamic_cast*>(column_reader_.get()); + int64_t values_read; + CType* values = reinterpret_cast(values_buffer_.mutable_data()); + PARQUET_CATCH_NOT_OK( + values_to_read -= reader->ReadBatch(values_to_read, + reinterpret_cast(def_levels_buffer_.mutable_data()), + reinterpret_cast(rep_levels_buffer_.mutable_data()), values, + &values_read)); + if (descr_->max_definition_level() == 0) { + RETURN_NOT_OK(builder.Append(values, values_read)); + } else { + return Status::NotImplemented("no support for definition levels yet"); + } + if (!column_reader_->HasNext()) { NextRowGroup(); } + } + *out = builder.Finish(); + return Status::OK(); +} + +#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType, CType) \ + case Type::ENUM: \ + return TypedReadBatch(batch_size, out); \ + break; + +Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* out) { + if (!column_reader_) { + // Exhausted all row groups. + *out = nullptr; + return Status::OK(); + } + + if (descr_->max_repetition_level() > 0) { + return Status::NotImplemented("no support for repetition yet"); + } + + switch (field_->type->type) { + TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type, int32_t) + TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type, int64_t) + TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType, float) + TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType, double) + default: + return Status::NotImplemented(field_->type->ToString()); + } +} + +void FlatColumnReader::Impl::NextRowGroup() { + if (next_row_group_ < reader_->num_row_groups()) { + column_reader_ = reader_->RowGroup(next_row_group_)->Column(column_index_); + next_row_group_++; + } else { + column_reader_ = nullptr; + } +} + +FlatColumnReader::FlatColumnReader(std::unique_ptr impl) : impl_(std::move(impl)) {} + +FlatColumnReader::~FlatColumnReader() {} + +Status FlatColumnReader::NextBatch(int batch_size, std::shared_ptr* out) { + return impl_->NextBatch(batch_size, out); +} + +} // namespace parquet +} // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h new file mode 100644 index 0000000000000..41ca7eb35b9f0 --- /dev/null +++ b/cpp/src/arrow/parquet/reader.h @@ -0,0 +1,134 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PARQUET_READER_H +#define ARROW_PARQUET_READER_H + +#include + +#include "parquet/api/reader.h" +#include "parquet/api/schema.h" + +namespace arrow { + +class Array; +class MemoryPool; +class RowBatch; +class Status; + +namespace parquet { + +class FlatColumnReader; + +// Arrow read adapter class for deserializing Parquet files as Arrow row +// batches. +// +// TODO(wesm): nested data does not always make sense with this user +// interface unless you are only reading a single leaf node from a branch of +// a table. For example: +// +// repeated group data { +// optional group record { +// optional int32 val1; +// optional byte_array val2; +// optional bool val3; +// } +// optional int32 val4; +// } +// +// In the Parquet file, there are 3 leaf nodes: +// +// * data.record.val1 +// * data.record.val2 +// * data.record.val3 +// * data.val4 +// +// When materializing this data in an Arrow array, we would have: +// +// data: list), +// val3: bool, +// >, +// val4: int32 +// >> +// +// However, in the Parquet format, each leaf node has its own repetition and +// definition levels describing the structure of the intermediate nodes in +// this array structure. Thus, we will need to scan the leaf data for a group +// of leaf nodes part of the same type tree to create a single result Arrow +// nested array structure. +// +// This is additionally complicated "chunky" repeated fields or very large byte +// arrays +class FileReader { + public: + FileReader(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); + + // Since the distribution of columns amongst a Parquet file's row groups may + // be uneven (the number of values in each column chunk can be different), we + // provide a column-oriented read interface. The ColumnReader hides the + // details of paging through the file's row groups and yielding + // fully-materialized arrow::Array instances + // + // Returns error status if the column of interest is not flat. + Status GetFlatColumn(int i, std::unique_ptr* out); + // Read column as a whole into an Array. + Status ReadFlatColumn(int i, std::shared_ptr* out); + + virtual ~FileReader(); + + private: + class Impl; + std::unique_ptr impl_; +}; + +// At this point, the column reader is a stream iterator. It only knows how to +// read the next batch of values for a particular column from the file until it +// runs out. +// +// We also do not expose any internal Parquet details, such as row groups. This +// might change in the future. +class FlatColumnReader { + public: + virtual ~FlatColumnReader(); + + // Scan the next array of the indicated size. The actual size of the + // returned array may be less than the passed size depending how much data is + // available in the file. + // + // When all the data in the file has been exhausted, the result is set to + // nullptr. + // + // Returns Status::OK on a successful read, including if you have exhausted + // the data available in the file. + Status NextBatch(int batch_size, std::shared_ptr* out); + + private: + class Impl; + std::unique_ptr impl_; + explicit FlatColumnReader(std::unique_ptr impl); + + friend class FileReader; +}; + +} // namespace parquet + +} // namespace arrow + +#endif // ARROW_PARQUET_READER_H diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index 214c764f08b6e..fd758940c9f3a 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -21,13 +21,12 @@ #include #include "parquet/api/schema.h" -#include "parquet/exception.h" +#include "arrow/parquet/utils.h" #include "arrow/types/decimal.h" #include "arrow/types/string.h" #include "arrow/util/status.h" -using parquet::ParquetException; using parquet::Repetition; using parquet::schema::Node; using parquet::schema::NodePtr; @@ -41,11 +40,6 @@ namespace arrow { namespace parquet { -#define PARQUET_CATCH_NOT_OK(s) \ - try { \ - (s); \ - } catch (const ParquetException& e) { return Status::Invalid(e.what()); } - const auto BOOL = std::make_shared(); const auto UINT8 = std::make_shared(); const auto INT32 = std::make_shared(); diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index bfc7d21138154..ec5f96062e89f 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -45,4 +45,4 @@ Status ToParquetSchema( } // namespace arrow -#endif +#endif // ARROW_PARQUET_SCHEMA_H diff --git a/cpp/src/arrow/parquet/utils.h b/cpp/src/arrow/parquet/utils.h new file mode 100644 index 0000000000000..b32792fdf7030 --- /dev/null +++ b/cpp/src/arrow/parquet/utils.h @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PARQUET_UTILS_H +#define ARROW_PARQUET_UTILS_H + +#include "arrow/util/status.h" + +#include "parquet/exception.h" + +namespace arrow { + +namespace parquet { + +#define PARQUET_CATCH_NOT_OK(s) \ + try { \ + (s); \ + } catch (const ::parquet::ParquetException& e) { return Status::Invalid(e.what()); } + +} // namespace parquet + +} // namespace arrow + +#endif // ARROW_PARQUET_UTILS_H From 68b80a83876b1306f80d3914eef98f51100a8009 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 14 May 2016 18:53:22 -0700 Subject: [PATCH 0071/1644] ARROW-197: Working first draft of a conda recipe for pyarrow Includes ARROW-196. I will close that PR and merge these together as I had to make some additional changes. Requires PARQUET-617. Closes #76 Author: Wes McKinney Closes #77 from wesm/ARROW-197 and squashes the following commits: 4bf3d2c [Wes McKinney] Finagle toolchain environment variables to get pyarrow conda package working c2d3684 [Wes McKinney] Add conda recipe and ensure that libarrow_parquet is installed as well --- cpp/conda.recipe/build.sh | 45 ++++++++++++++++++++++++++++ cpp/conda.recipe/meta.yaml | 32 ++++++++++++++++++++ cpp/src/arrow/parquet/CMakeLists.txt | 7 +++++ cpp/src/arrow/types/primitive.h | 1 + python/conda.recipe/build.sh | 18 +++++++++++ python/conda.recipe/meta.yaml | 41 +++++++++++++++++++++++++ 6 files changed, 144 insertions(+) create mode 100644 cpp/conda.recipe/build.sh create mode 100644 cpp/conda.recipe/meta.yaml create mode 100644 python/conda.recipe/build.sh create mode 100644 python/conda.recipe/meta.yaml diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh new file mode 100644 index 0000000000000..ac1f9c89cc9ed --- /dev/null +++ b/cpp/conda.recipe/build.sh @@ -0,0 +1,45 @@ +#!/bin/bash + +set -e +set -x + +cd $RECIPE_DIR + +# Build dependencies +export FLATBUFFERS_HOME=$PREFIX +export PARQUET_HOME=$PREFIX + +cd .. + +rm -rf conda-build +mkdir conda-build + +cp -r thirdparty conda-build/ + +cd conda-build +pwd + +# Build googletest for running unit tests +./thirdparty/download_thirdparty.sh +./thirdparty/build_thirdparty.sh gtest + +source thirdparty/versions.sh +export GTEST_HOME=`pwd`/thirdparty/$GTEST_BASEDIR + +if [ `uname` == Linux ]; then + SHARED_LINKER_FLAGS='-static-libstdc++' +elif [ `uname` == Darwin ]; then + SHARED_LINKER_FLAGS='' +fi + +cmake \ + -DCMAKE_BUILD_TYPE=debug \ + -DCMAKE_INSTALL_PREFIX=$PREFIX \ + -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ + -DARROW_IPC=on \ + -DARROW_PARQUET=on \ + .. + +make +ctest -L unittest +make install diff --git a/cpp/conda.recipe/meta.yaml b/cpp/conda.recipe/meta.yaml new file mode 100644 index 0000000000000..2e834d5cbf86c --- /dev/null +++ b/cpp/conda.recipe/meta.yaml @@ -0,0 +1,32 @@ +package: + name: arrow-cpp + version: "0.1" + +build: + number: {{environ.get('TRAVIS_BUILD_NUMBER', 0)}} # [unix] + skip: true # [win] + script_env: + - CC [linux] + - CXX [linux] + - LD_LIBRARY_PATH [linux] + +requirements: + build: + - cmake + - flatbuffers + - parquet-cpp + - thrift-cpp + + run: + - parquet-cpp + +test: + commands: + - test -f $PREFIX/lib/libarrow.so + - test -f $PREFIX/lib/libarrow_parquet.so + - test -f $PREFIX/include/arrow/api.h + +about: + home: http://github.com/apache/arrow + license: Apache 2.0 + summary: 'C++ libraries for the reference Apache Arrow implementation' diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index 1ae6709652ea5..cd6f05d6b5f8a 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -42,4 +42,11 @@ ARROW_TEST_LINK_LIBRARIES(parquet-reader-test arrow_parquet) # Headers: top level install(FILES + reader.h + schema.h + utils.h DESTINATION include/arrow/parquet) + +install(TARGETS arrow_parquet + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 6f6b2fed5a320..fc45f6c5b0568 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -136,6 +136,7 @@ class NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::Append; using PrimitiveBuilder::Init; using PrimitiveBuilder::Resize; + using PrimitiveBuilder::Reserve; // Scalar append. void Append(value_type val) { diff --git a/python/conda.recipe/build.sh b/python/conda.recipe/build.sh new file mode 100644 index 0000000000000..a9d9aedead399 --- /dev/null +++ b/python/conda.recipe/build.sh @@ -0,0 +1,18 @@ +#!/bin/bash +set -ex + +# Build dependency +export ARROW_HOME=$PREFIX + +cd $RECIPE_DIR + +echo Setting the compiler... +if [ `uname` == Linux ]; then + EXTRA_CMAKE_ARGS=-DCMAKE_SHARED_LINKER_FLAGS=-static-libstdc++ +elif [ `uname` == Darwin ]; then + EXTRA_CMAKE_ARGS= +fi + +cd .. +$PYTHON setup.py build_ext --extra-cmake-args=$EXTRA_CMAKE_ARGS || exit 1 +$PYTHON setup.py install || exit 1 diff --git a/python/conda.recipe/meta.yaml b/python/conda.recipe/meta.yaml new file mode 100644 index 0000000000000..85d24b6bc322e --- /dev/null +++ b/python/conda.recipe/meta.yaml @@ -0,0 +1,41 @@ +package: + name: pyarrow + version: "0.1" + +build: + number: {{environ.get('TRAVIS_BUILD_NUMBER', 0)}} # [unix] + rpaths: + - lib # [unix] + - lib/python{{environ.get('PY_VER')}}/site-packages/pyarrow # [unix] + script_env: + - CC [linux] + - CXX [linux] + - LD_LIBRARY_PATH [linux] + skip: true # [win] + +requirements: + build: + - cmake + - python + - setuptools + - cython + - numpy + - pandas + - arrow-cpp + - pytest + + run: + - arrow-cpp + - python + - numpy + - pandas + - six + +test: + imports: + - pyarrow + +about: + home: http://github.com/apache/arrow + license: Apache 2.0 + summary: 'Python bindings for Arrow C++ and interoperability tool for pandas and NumPy' From 6968ec01d722584e9561dc3c0438bce29c664b5a Mon Sep 17 00:00:00 2001 From: hzhang2 Date: Sat, 14 May 2016 19:07:44 -0700 Subject: [PATCH 0072/1644] ARROW-199: [C++] Refine third party dependency To generate makefile, run download_thirdparty.sh and build_thirdparty.sh is not enough source setup_build_env.sh is necessary since FLATBUFFERS_HOME must be set . Author: hzhang2 Closes #75 from zhangh43/arrow2 and squashes the following commits: ea3101b [hzhang2] remove CMAKE_SKIP_INSTALL_ALL_DEPENDENCY for target install and fix typo 8c02a38 [hzhang2] ARROW-199: [C++] Refine third party dependency b2312e0 [hzhang2] ARROW-199: [C++] Refine third party dependency fefc314 [hzhang2] FLATBUFFERS_HOME must be set before cmake --- cpp/CMakeLists.txt | 5 ----- cpp/README.md | 1 + cpp/setup_build_env.sh | 6 +----- cpp/thirdparty/set_thirdparty_env.sh | 12 ++++++++++++ 4 files changed, 14 insertions(+), 10 deletions(-) create mode 100755 cpp/thirdparty/set_thirdparty_env.sh diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index b38f91e5d687c..a3fb01076d44e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -25,11 +25,6 @@ include(CMakeParseArguments) set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") -# Allow "make install" to not depend on all targets. -# -# Must be declared in the top-level CMakeLists.txt. -set(CMAKE_SKIP_INSTALL_ALL_DEPENDENCY true) - find_package(ClangTools) if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND) # Generate a Clang compile_commands.json "compilation database" file for use diff --git a/cpp/README.md b/cpp/README.md index c8cd86fedc6fe..129c5f15b150c 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -13,6 +13,7 @@ To build the thirdparty build dependencies, run: ``` ./thirdparty/download_thirdparty.sh ./thirdparty/build_thirdparty.sh +source ./thirdparty/set_thirdparty_env.sh ``` You can also run from the root of the C++ tree diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index 6520dbd43f705..fa779fdd5c2a3 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -4,10 +4,6 @@ SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) ./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } ./thirdparty/build_thirdparty.sh || { echo "build_thirdparty.sh failed" ; return; } -source thirdparty/versions.sh - -export GTEST_HOME=$SOURCE_DIR/thirdparty/$GTEST_BASEDIR -export GBENCHMARK_HOME=$SOURCE_DIR/thirdparty/installed -export FLATBUFFERS_HOME=$SOURCE_DIR/thirdparty/installed +source ./thirdparty/set_thirdparty_env.sh || { echo "source set_thirdparty_env.sh failed" ; return; } echo "Build env initialized" diff --git a/cpp/thirdparty/set_thirdparty_env.sh b/cpp/thirdparty/set_thirdparty_env.sh new file mode 100755 index 0000000000000..7e9531cd50864 --- /dev/null +++ b/cpp/thirdparty/set_thirdparty_env.sh @@ -0,0 +1,12 @@ +#!/usr/bash + +SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) +source $SOURCE_DIR/versions.sh + +if [ -z "$THIRDPARTY_DIR" ]; then + THIRDPARTY_DIR=$SOURCE_DIR +fi + +export GTEST_HOME=$THIRDPARTY_DIR/$GTEST_BASEDIR +export GBENCHMARK_HOME=$THIRDPARTY_DIR/installed +export FLATBUFFERS_HOME=$THIRDPARTY_DIR/installed From 9c59158b4dc84e4de8e9271430befb840e523a4c Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Tue, 17 May 2016 16:46:40 -0700 Subject: [PATCH 0073/1644] ARROW-185: Make padding and alignment for all buffers be 64 bytes + some small cleanup/removal of unnecessary code. I think there is likely a good opportunity to factor this code better generally, but this seems to work for now. Author: Micah Kornfield Closes #74 from emkornfield/emk_fix_allocations_PR and squashes the following commits: e3cca14 [Micah Kornfield] fix cast style 1d006d8 [Micah Kornfield] fix warning c140e04 [Micah Kornfield] fix lint 7543267 [Micah Kornfield] cleanup 11b3fd7 [Micah Kornfield] replace cython string conversion with string builder 05653cb [Micah Kornfield] add back in memsets because they make valgrind happy 6ff3048 [Micah Kornfield] ARROW-185: Make padding and alignment for all buffers be 64 bytes --- cpp/src/arrow/builder.cc | 11 +++++-- cpp/src/arrow/ipc/adapter.cc | 20 ++++++++++++- cpp/src/arrow/ipc/ipc-adapter-test.cc | 6 ++-- cpp/src/arrow/types/list.cc | 2 +- cpp/src/arrow/types/list.h | 3 ++ cpp/src/arrow/types/primitive.cc | 7 ++--- cpp/src/arrow/util/bit-util-test.cc | 10 +++++++ cpp/src/arrow/util/bit-util.h | 4 +++ cpp/src/arrow/util/buffer.cc | 17 +++++++++++ cpp/src/arrow/util/buffer.h | 34 +++++++++++++-------- cpp/src/arrow/util/memory-pool-test.cc | 1 + cpp/src/arrow/util/memory-pool.cc | 31 ++++++++++++++----- python/src/pyarrow/adapters/pandas.cc | 41 ++++++-------------------- 13 files changed, 124 insertions(+), 63 deletions(-) diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 87c1219025d37..1fba96169228f 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -45,12 +45,14 @@ Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) } Status ArrayBuilder::Init(int32_t capacity) { - capacity_ = capacity; int32_t to_alloc = util::ceil_byte(capacity) / 8; null_bitmap_ = std::make_shared(pool_); RETURN_NOT_OK(null_bitmap_->Resize(to_alloc)); + // Buffers might allocate more then necessary to satisfy padding requirements + const int byte_capacity = null_bitmap_->capacity(); + capacity_ = capacity; null_bitmap_data_ = null_bitmap_->mutable_data(); - memset(null_bitmap_data_, 0, to_alloc); + memset(null_bitmap_data_, 0, byte_capacity); return Status::OK(); } @@ -60,8 +62,11 @@ Status ArrayBuilder::Resize(int32_t new_bits) { int32_t old_bytes = null_bitmap_->size(); RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); null_bitmap_data_ = null_bitmap_->mutable_data(); + // The buffer might be overpadded to deal with padding according to the spec + const int32_t byte_capacity = null_bitmap_->capacity(); + capacity_ = new_bits; if (old_bytes < new_bytes) { - memset(null_bitmap_data_ + old_bytes, 0, new_bytes - old_bytes); + memset(null_bitmap_data_ + old_bytes, 0, byte_capacity - old_bytes); } return Status::OK(); } diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 34700080746e7..45cc288cd6b9e 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -43,6 +43,15 @@ namespace flatbuf = apache::arrow::flatbuf; namespace ipc { +namespace { +Status CheckMultipleOf64(int64_t size) { + if (util::is_multiple_of_64(size)) { return Status::OK(); } + return Status::Invalid( + "Attempted to write a buffer that " + "wasn't a multiple of 64 bytes"); +} +} + static bool IsPrimitive(const DataType* type) { DCHECK(type != nullptr); switch (type->type) { @@ -115,6 +124,8 @@ Status VisitArray(const Array* arr, std::vector* field_nodes } else if (arr->type_enum() == Type::STRUCT) { // TODO(wesm) return Status::NotImplemented("Struct type"); + } else { + return Status::NotImplemented("Unrecognized type"); } return Status::OK(); } @@ -142,7 +153,13 @@ class RowBatchWriter { int64_t size = 0; // The buffer might be null if we are handling zero row lengths. - if (buffer) { size = buffer->size(); } + if (buffer) { + // We use capacity here, because size might not reflect the padding + // requirements of buffers but capacity always should. + size = buffer->capacity(); + // check that padding is appropriate + RETURN_NOT_OK(CheckMultipleOf64(size)); + } // TODO(wesm): We currently have no notion of shared memory page id's, // but we've included it in the metadata IDL for when we have it in the // future. Use page=0 for now @@ -305,6 +322,7 @@ class RowBatchReader::Impl { Status GetBuffer(int buffer_index, std::shared_ptr* out) { BufferMetadata metadata = metadata_->buffer(buffer_index); + RETURN_NOT_OK(CheckMultipleOf64(metadata.length)); return source_->ReadAt(metadata.offset, metadata.length, out); } diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 3b147343f772a..eb47ac6fee8a1 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -197,8 +197,8 @@ INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, void TestGetRowBatchSize(std::shared_ptr batch) { MockMemorySource mock_source(1 << 16); - int64_t mock_header_location; - int64_t size; + int64_t mock_header_location = -1; + int64_t size = -1; ASSERT_OK(WriteRowBatch(&mock_source, batch.get(), 0, &mock_header_location)); ASSERT_OK(GetRowBatchSize(batch.get(), &size)); ASSERT_EQ(mock_source.GetExtentBytesWritten(), size); @@ -270,7 +270,7 @@ TEST_F(RecursionLimits, WriteLimit) { } TEST_F(RecursionLimits, ReadLimit) { - int64_t header_location; + int64_t header_location = -1; std::shared_ptr schema; ASSERT_OK(WriteToMmap(64, true, &header_location, &schema)); diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index fc3331139c6d8..76e7fe5f4d429 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -47,7 +47,7 @@ bool ListArray::Equals(const std::shared_ptr& arr) const { Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } if (!offset_buf_) { return Status::Invalid("offset_buf_ was null"); } - if (offset_buf_->size() / sizeof(int32_t) < length_) { + if (offset_buf_->size() / static_cast(sizeof(int32_t)) < length_) { std::stringstream ss; ss << "offset buffer size (bytes): " << offset_buf_->size() << " isn't large enough for length: " << length_; diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index e2302d917b8f6..a020b8ad22668 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -20,6 +20,7 @@ #include #include +#include #include #include "arrow/array.h" @@ -113,12 +114,14 @@ class ListBuilder : public ArrayBuilder { values_(values) {} Status Init(int32_t elements) override { + DCHECK_LT(elements, std::numeric_limits::max()); RETURN_NOT_OK(ArrayBuilder::Init(elements)); // one more then requested for offsets return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); } Status Resize(int32_t capacity) override { + DCHECK_LT(capacity, std::numeric_limits::max()); // one more then requested for offsets RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); return ArrayBuilder::Resize(capacity); diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 9102c530e25da..57a3f1e4e150b 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -76,6 +76,7 @@ Status PrimitiveBuilder::Init(int32_t capacity) { int64_t nbytes = type_traits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(nbytes)); + // TODO(emkornfield) valgrind complains without this memset(data_->mutable_data(), 0, nbytes); raw_data_ = reinterpret_cast(data_->mutable_data()); @@ -91,15 +92,13 @@ Status PrimitiveBuilder::Resize(int32_t capacity) { RETURN_NOT_OK(Init(capacity)); } else { RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); - - int64_t old_bytes = data_->size(); - int64_t new_bytes = type_traits::bytes_required(capacity); + const int64_t old_bytes = data_->size(); + const int64_t new_bytes = type_traits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); } - capacity_ = capacity; return Status::OK(); } diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index 26554d2c9069c..e1d8a0808b41a 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -21,6 +21,16 @@ namespace arrow { +TEST(UtilTests, TestIsMultipleOf64) { + using util::is_multiple_of_64; + EXPECT_TRUE(is_multiple_of_64(64)); + EXPECT_TRUE(is_multiple_of_64(0)); + EXPECT_TRUE(is_multiple_of_64(128)); + EXPECT_TRUE(is_multiple_of_64(192)); + EXPECT_FALSE(is_multiple_of_64(23)); + EXPECT_FALSE(is_multiple_of_64(32)); +} + TEST(UtilTests, TestNextPower2) { using util::next_power2; diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 1f0f08c4d88ef..a6c8dd904d8e0 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -71,6 +71,10 @@ static inline int64_t next_power2(int64_t n) { return n; } +static inline bool is_multiple_of_64(int64_t n) { + return (n & 63) == 0; +} + void bytes_to_bits(const std::vector& bytes, uint8_t* bits); Status bytes_to_bits(const std::vector&, std::shared_ptr*); diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index bc9c22c10de44..703ef8384ac07 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -18,16 +18,32 @@ #include "arrow/util/buffer.h" #include +#include +#include "arrow/util/logging.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" namespace arrow { +namespace { +int64_t RoundUpToMultipleOf64(int64_t num) { + DCHECK_GE(num, 0); + constexpr int64_t round_to = 64; + constexpr int64_t force_carry_addend = round_to - 1; + constexpr int64_t truncate_bitmask = ~(round_to - 1); + constexpr int64_t max_roundable_num = std::numeric_limits::max() - round_to; + if (num <= max_roundable_num) { return (num + force_carry_addend) & truncate_bitmask; } + // handle overflow case. This should result in a malloc error upstream + return num; +} +} // namespace + Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) { data_ = parent->data() + offset; size_ = size; parent_ = parent; + capacity_ = size; } Buffer::~Buffer() {} @@ -48,6 +64,7 @@ PoolBuffer::~PoolBuffer() { Status PoolBuffer::Reserve(int64_t new_capacity) { if (!mutable_data_ || new_capacity > capacity_) { uint8_t* new_data; + new_capacity = RoundUpToMultipleOf64(new_capacity); if (mutable_data_) { RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); memcpy(new_data, mutable_data_, size_); diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 5ef0076953cea..f845d67761fe4 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -36,15 +36,23 @@ class Status; // Buffer classes // Immutable API for a chunk of bytes which may or may not be owned by the -// class instance +// class instance. Buffers have two related notions of length: size and +// capacity. Size is the number of bytes that might have valid data. +// Capacity is the number of bytes that where allocated for the buffer in +// total. +// The following invariant is always true: Size < Capacity class Buffer : public std::enable_shared_from_this { public: - Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size) {} + Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); // An offset into data that is owned by another buffer, but we want to be // able to retain a valid pointer to it even after other shared_ptr's to the // parent buffer have been destroyed + // + // This method makes no assertions about alignment or padding of the buffer but + // in general we expected buffers to be aligned and padded to 64 bytes. In the future + // we might add utility methods to help determine if a buffer satisfies this contract. Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); std::shared_ptr get_shared_ptr() { return shared_from_this(); } @@ -63,6 +71,7 @@ class Buffer : public std::enable_shared_from_this { (data_ == other.data_ || !memcmp(data_, other.data_, size_))); } + int64_t capacity() const { return capacity_; } const uint8_t* data() const { return data_; } int64_t size() const { return size_; } @@ -76,6 +85,7 @@ class Buffer : public std::enable_shared_from_this { protected: const uint8_t* data_; int64_t size_; + int64_t capacity_; // nullptr by default, but may be set std::shared_ptr parent_; @@ -105,18 +115,17 @@ class MutableBuffer : public Buffer { class ResizableBuffer : public MutableBuffer { public: // Change buffer reported size to indicated size, allocating memory if - // necessary + // necessary. This will ensure that the capacity of the buffer is a multiple + // of 64 bytes as defined in Layout.md. virtual Status Resize(int64_t new_size) = 0; // Ensure that buffer has enough memory allocated to fit the indicated - // capacity. Does not change buffer's reported size + // capacity (and meets the 64 byte padding requirement in Layout.md). + // It does not change buffer's reported size. virtual Status Reserve(int64_t new_capacity) = 0; protected: - ResizableBuffer(uint8_t* data, int64_t size) - : MutableBuffer(data, size), capacity_(size) {} - - int64_t capacity_; + ResizableBuffer(uint8_t* data, int64_t size) : MutableBuffer(data, size) {} }; // A Buffer whose lifetime is tied to a particular MemoryPool @@ -125,8 +134,8 @@ class PoolBuffer : public ResizableBuffer { explicit PoolBuffer(MemoryPool* pool = nullptr); virtual ~PoolBuffer(); - virtual Status Resize(int64_t new_size); - virtual Status Reserve(int64_t new_capacity); + Status Resize(int64_t new_size) override; + Status Reserve(int64_t new_capacity) override; private: MemoryPool* pool_; @@ -138,10 +147,11 @@ class BufferBuilder { public: explicit BufferBuilder(MemoryPool* pool) : pool_(pool), capacity_(0), size_(0) {} + // Resizes the buffer to the nearest multiple of 64 bytes per Layout.md Status Resize(int32_t elements) { if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } - capacity_ = elements; - RETURN_NOT_OK(buffer_->Resize(capacity_)); + RETURN_NOT_OK(buffer_->Resize(elements)); + capacity_ = buffer_->capacity(); data_ = buffer_->mutable_data(); return Status::OK(); } diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index e4600a9bd9b27..4ab9736c2b468 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -31,6 +31,7 @@ TEST(DefaultMemoryPool, MemoryTracking) { uint8_t* data; ASSERT_OK(pool->Allocate(100, &data)); + EXPECT_EQ(0, reinterpret_cast(data) % 64); ASSERT_EQ(100, pool->bytes_allocated()); pool->Free(data, 100); diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index 961554fe06bcc..0a58e5aa21f72 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -17,6 +17,7 @@ #include "arrow/util/memory-pool.h" +#include #include #include #include @@ -25,6 +26,28 @@ namespace arrow { +namespace { +// Allocate memory according to the alignment requirements for Arrow +// (as of May 2016 64 bytes) +Status AllocateAligned(int64_t size, uint8_t** out) { + // TODO(emkornfield) find something compatible with windows + constexpr size_t kAlignment = 64; + const int result = posix_memalign(reinterpret_cast(out), kAlignment, size); + if (result == ENOMEM) { + std::stringstream ss; + ss << "malloc of size " << size << " failed"; + return Status::OutOfMemory(ss.str()); + } + + if (result == EINVAL) { + std::stringstream ss; + ss << "invalid alignment parameter: " << kAlignment; + return Status::Invalid(ss.str()); + } + return Status::OK(); +} +} // namespace + MemoryPool::~MemoryPool() {} class InternalMemoryPool : public MemoryPool { @@ -45,13 +68,7 @@ class InternalMemoryPool : public MemoryPool { Status InternalMemoryPool::Allocate(int64_t size, uint8_t** out) { std::lock_guard guard(pool_lock_); - *out = static_cast(std::malloc(size)); - if (*out == nullptr) { - std::stringstream ss; - ss << "malloc of size " << size << " failed"; - return Status::OutOfMemory(ss.str()); - } - + RETURN_NOT_OK(AllocateAligned(size, out)); bytes_allocated_ += size; return Status::OK(); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index b39fde92034aa..5159d86865caa 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -147,17 +147,12 @@ class ArrowSerializer { Status ConvertObjectStrings(std::shared_ptr* out) { PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + arrow::TypePtr string_type(new arrow::StringType()); + arrow::StringBuilder string_builder(pool_, string_type); + RETURN_ARROW_NOT_OK(string_builder.Resize(length_)); - auto offsets_buffer = std::make_shared(pool_); - RETURN_ARROW_NOT_OK(offsets_buffer->Resize(sizeof(int32_t) * (length_ + 1))); - int32_t* offsets = reinterpret_cast(offsets_buffer->mutable_data()); - - arrow::BufferBuilder data_builder(pool_); arrow::Status s; PyObject* obj; - int length; - int offset = 0; - int64_t null_count = 0; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; if (PyUnicode_Check(obj)) { @@ -166,38 +161,20 @@ class ArrowSerializer { PyErr_Clear(); return Status::TypeError("failed converting unicode to UTF8"); } - length = PyBytes_GET_SIZE(obj); - s = data_builder.Append( - reinterpret_cast(PyBytes_AS_STRING(obj)), length); + const int32_t length = PyBytes_GET_SIZE(obj); + s = string_builder.Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); if (!s.ok()) { return Status::ArrowError(s.ToString()); } - util::set_bit(null_bitmap_data_, i); } else if (PyBytes_Check(obj)) { - length = PyBytes_GET_SIZE(obj); - RETURN_ARROW_NOT_OK(data_builder.Append( - reinterpret_cast(PyBytes_AS_STRING(obj)), length)); - util::set_bit(null_bitmap_data_, i); + const int32_t length = PyBytes_GET_SIZE(obj); + RETURN_ARROW_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); } else { - // NULL - // No change to offset - length = 0; - ++null_count; + string_builder.AppendNull(); } - offsets[i] = offset; - offset += length; } - // End offset - offsets[length_] = offset; - - std::shared_ptr data_buffer = data_builder.Finish(); - - auto values = std::make_shared(data_buffer->size(), - data_buffer); - *out = std::shared_ptr( - new arrow::StringArray(length_, offsets_buffer, values, null_count, - null_bitmap_)); + *out = std::shared_ptr(string_builder.Finish()); return Status::OK(); } From 978de1a94dd451b3142aca0eb95ce410064a2330 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 18 May 2016 10:15:14 -0700 Subject: [PATCH 0074/1644] ARROW-204: Add Travis CI builds that post conda artifacts for Linux and OS X I tested this on my fork of Arrow, but ultimately we'll see if it works when the commit hits master. I've arranged so that packaging issues won't fail the build. Author: Wes McKinney Closes #79 from wesm/ARROW-204 and squashes the following commits: afd0582 [Wes McKinney] Change encrypted token to apache/arrow, only upload on commits to master 58955e5 [Wes McKinney] Draft of automated conda builds for libarrow, pyarrow. Remove unneeded thrift-cpp build dependency --- .travis.yml | 27 ++++++++++++++++++- ci/travis_conda_build.sh | 53 ++++++++++++++++++++++++++++++++++++++ cpp/conda.recipe/build.sh | 15 ++++++++++- cpp/conda.recipe/meta.yaml | 5 ++-- 4 files changed, 95 insertions(+), 5 deletions(-) create mode 100755 ci/travis_conda_build.sh diff --git a/.travis.yml b/.travis.yml index a0138a79598a1..646f80fee7b3b 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,5 +1,5 @@ sudo: required -dist: precise +dist: precise addons: apt: sources: @@ -18,6 +18,9 @@ addons: - valgrind matrix: + fast_finish: true + allow_failures: + - env: ARROW_TEST_GROUP=packaging include: - compiler: gcc language: cpp @@ -39,6 +42,24 @@ matrix: script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh + - compiler: gcc + env: ARROW_TEST_GROUP=packaging + os: linux + before_script: + - export CC="gcc-4.9" + - export CXX="g++-4.9" + script: + - $TRAVIS_BUILD_DIR/ci/travis_conda_build.sh + - os: osx + env: ARROW_TEST_GROUP=packaging + language: objective-c + osx_image: xcode6.4 + compiler: clang + addons: + before_script: + before_install: + script: + - $TRAVIS_BUILD_DIR/ci/travis_conda_build.sh before_install: - ulimit -c unlimited -S @@ -51,3 +72,7 @@ after_script: after_failure: - COREFILE=$(find . -maxdepth 2 -name "core*" | head -n 1) - if [[ -f "$COREFILE" ]]; then gdb -c "$COREFILE" example -ex "thread apply all bt" -ex "set pagination 0" -batch; fi + +env: + global: + - secure: "GcrPtsKUCgNY7HKYjWlHQo8SiFrShDvdZSU8t1m1FJrE+UfK0Dgh9zXmAausM8GmhqSwkF0q4UbLQf2uCnSITWKeEPAL8Mo9eu4ib+ikJx/b3Sk81frgW5ADoHfW1Eyqd8xJNIMwMegJOtRLSDqiXh1CvMlKnY8PyTOGM2DgN9ona/v6p9OFH9Qs0JhBRVXAn0S4ztjumck8E56+01hqRfxbZ88pTfpKghBxYp9PJaMjtGdomjVWlqPaWaWJj+KptT8inV9NK+TVYKx0dXWD+S1Vgr1PytQnLdILOYV23gsOBYqn33ByF/yADl4m3hUjU/qeT0Fi7aWxmVpj+oTJISOSH5N8nIsuNH8mQk2ZzzXHfV7btFvP+cOPRczadoKkT6D6cHA8nQ7b0dphC6bl6SAeSfc/cbhRT+fYnIjg8jFXC8jlyWBr7LR6GXVpc0bND7i300ITo0FuRJhy2OxqPtGo3dKLE7eAcv78tuO0OYJ/kol1PEqFdFkbYbNVbg/cFpbGqiCXDsOtPDbAGBv69YnXdVowSxxs8cRGjSkDydv6ZSytb/Zd4lH/KAomcFNk8adx12O1Lk4sbmVav1cGig5P6OcQKS0jC5IiRb4THcQzVzAkXXbaafKm5sru/NoYxhzmkyhkOc11nTYHKVng+XKWzLCNn7pTTSLitp5+xa4=" diff --git a/ci/travis_conda_build.sh b/ci/travis_conda_build.sh new file mode 100755 index 0000000000000..afa531dbd6b5f --- /dev/null +++ b/ci/travis_conda_build.sh @@ -0,0 +1,53 @@ +#!/usr/bin/env bash + +set -e + +if [ $TRAVIS_OS_NAME == "linux" ]; then + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" +else + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" +fi + +wget -O miniconda.sh $MINICONDA_URL +MINICONDA=$TRAVIS_BUILD_DIR/miniconda +bash miniconda.sh -b -p $MINICONDA +export PATH="$MINICONDA/bin:$PATH" +conda update -y -q conda +conda info -a + +conda config --set show_channel_urls yes +conda config --add channels conda-forge +conda config --add channels apache + +conda install --yes conda-build jinja2 anaconda-client + +# faster builds, please +conda install -y nomkl + +# Build libarrow + +cd $TRAVIS_BUILD_DIR/cpp + +conda build conda.recipe --channel apache/channel/dev +CONDA_PACKAGE=`conda build --output conda.recipe | grep bz2` + +if [ $TRAVIS_BRANCH == "master" ] && [ $TRAVIS_PULL_REQUEST == "false" ]; then + anaconda --token $ANACONDA_TOKEN upload $CONDA_PACKAGE --user apache --channel dev; +fi + +# Build pyarrow + +cd $TRAVIS_BUILD_DIR/python + +build_for_python_version() { + PY_VERSION=$1 + conda build conda.recipe --python $PY_VERSION --channel apache/channel/dev + CONDA_PACKAGE=`conda build --python $PY_VERSION --output conda.recipe | grep bz2` + + if [ $TRAVIS_BRANCH == "master" ] && [ $TRAVIS_PULL_REQUEST == "false" ]; then + anaconda --token $ANACONDA_TOKEN upload $CONDA_PACKAGE --user apache --channel dev; + fi +} + +build_for_python_version 2.7 +build_for_python_version 3.5 diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh index ac1f9c89cc9ed..b10dd03349bd3 100644 --- a/cpp/conda.recipe/build.sh +++ b/cpp/conda.recipe/build.sh @@ -9,6 +9,19 @@ cd $RECIPE_DIR export FLATBUFFERS_HOME=$PREFIX export PARQUET_HOME=$PREFIX +if [ "$(uname)" == "Darwin" ]; then + # C++11 finagling for Mac OSX + export CC=clang + export CXX=clang++ + export MACOSX_VERSION_MIN="10.7" + CXXFLAGS="${CXXFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" + CXXFLAGS="${CXXFLAGS} -stdlib=libc++ -std=c++11" + export LDFLAGS="${LDFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" + export LDFLAGS="${LDFLAGS} -stdlib=libc++ -std=c++11" + export LINKFLAGS="${LDFLAGS}" + export MACOSX_DEPLOYMENT_TARGET=10.7 +fi + cd .. rm -rf conda-build @@ -33,7 +46,7 @@ elif [ `uname` == Darwin ]; then fi cmake \ - -DCMAKE_BUILD_TYPE=debug \ + -DCMAKE_BUILD_TYPE=release \ -DCMAKE_INSTALL_PREFIX=$PREFIX \ -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ -DARROW_IPC=on \ diff --git a/cpp/conda.recipe/meta.yaml b/cpp/conda.recipe/meta.yaml index 2e834d5cbf86c..75f3a8ba3d98f 100644 --- a/cpp/conda.recipe/meta.yaml +++ b/cpp/conda.recipe/meta.yaml @@ -15,15 +15,14 @@ requirements: - cmake - flatbuffers - parquet-cpp - - thrift-cpp run: - parquet-cpp test: commands: - - test -f $PREFIX/lib/libarrow.so - - test -f $PREFIX/lib/libarrow_parquet.so + - test -f $PREFIX/lib/libarrow.so # [linux] + - test -f $PREFIX/lib/libarrow_parquet.so # [linux] - test -f $PREFIX/include/arrow/api.h about: From e0fb3698e5602bccaee232d4c259b3df089886e6 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 18 May 2016 10:49:04 -0700 Subject: [PATCH 0075/1644] ARROW-201: [C++] Initial ParquetWriter implementation Author: Uwe L. Korn Closes #78 from xhochy/arrow-201 and squashes the following commits: 5d95099 [Uwe L. Korn] Add check for flat column 88ae3ca [Uwe L. Korn] Install arrow_parquet headers f81021b [Uwe L. Korn] Incorporate reader comments ba240e8 [Uwe L. Korn] Incorporate writer comments 2179c0e [Uwe L. Korn] Infer c-type from ArrowType efd46fb [Uwe L. Korn] Infer c-type from ArrowType 77386ea [Uwe L. Korn] Templatize test functions 1aa7698 [Uwe L. Korn] Add comment to helper function 8fdd4c8 [Uwe L. Korn] Parameterize schema creation 8e8d7d7 [Uwe L. Korn] ARROW-201: [C++] Initial ParquetWriter implementation --- cpp/src/arrow/parquet/CMakeLists.txt | 6 +- cpp/src/arrow/parquet/parquet-io-test.cc | 222 +++++++++++++++++++ cpp/src/arrow/parquet/parquet-reader-test.cc | 116 ---------- cpp/src/arrow/parquet/reader.cc | 79 ++++--- cpp/src/arrow/parquet/writer.cc | 148 +++++++++++++ cpp/src/arrow/parquet/writer.h | 59 +++++ 6 files changed, 485 insertions(+), 145 deletions(-) create mode 100644 cpp/src/arrow/parquet/parquet-io-test.cc delete mode 100644 cpp/src/arrow/parquet/parquet-reader-test.cc create mode 100644 cpp/src/arrow/parquet/writer.cc create mode 100644 cpp/src/arrow/parquet/writer.h diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index cd6f05d6b5f8a..c00cc9f0f25d0 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -21,6 +21,7 @@ set(PARQUET_SRCS reader.cc schema.cc + writer.cc ) set(PARQUET_LIBS @@ -37,14 +38,15 @@ SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) ADD_ARROW_TEST(parquet-schema-test) ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) -ADD_ARROW_TEST(parquet-reader-test) -ARROW_TEST_LINK_LIBRARIES(parquet-reader-test arrow_parquet) +ADD_ARROW_TEST(parquet-io-test) +ARROW_TEST_LINK_LIBRARIES(parquet-io-test arrow_parquet) # Headers: top level install(FILES reader.h schema.h utils.h + writer.h DESTINATION include/arrow/parquet) install(TARGETS arrow_parquet diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc new file mode 100644 index 0000000000000..845574d2c53b9 --- /dev/null +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -0,0 +1,222 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" +#include "arrow/parquet/reader.h" +#include "arrow/parquet/writer.h" +#include "arrow/types/primitive.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +#include "parquet/api/reader.h" +#include "parquet/api/writer.h" + +using ParquetBuffer = parquet::Buffer; +using parquet::BufferReader; +using parquet::InMemoryOutputStream; +using parquet::ParquetFileReader; +using parquet::ParquetFileWriter; +using parquet::RandomAccessSource; +using parquet::Repetition; +using parquet::SchemaDescriptor; +using ParquetType = parquet::Type; +using parquet::schema::GroupNode; +using parquet::schema::NodePtr; +using parquet::schema::PrimitiveNode; + +namespace arrow { + +namespace parquet { + +template +std::shared_ptr NonNullArray( + size_t size, typename ArrowType::c_type value) { + std::vector values(size, value); + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size()); + return std::static_pointer_cast(builder.Finish()); +} + +// This helper function only supports (size/2) nulls yet. +template +std::shared_ptr NullableArray( + size_t size, typename ArrowType::c_type value, size_t num_nulls) { + std::vector values(size, value); + std::vector valid_bytes(size, 1); + + for (size_t i = 0; i < num_nulls; i++) { + valid_bytes[i * 2] = 0; + } + + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size(), valid_bytes.data()); + return std::static_pointer_cast(builder.Finish()); +} + +class TestParquetIO : public ::testing::Test { + public: + virtual void SetUp() {} + + std::shared_ptr Schema( + ParquetType::type parquet_type, Repetition::type repetition) { + auto pnode = PrimitiveNode::Make("column1", repetition, parquet_type); + NodePtr node_ = + GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); + return std::static_pointer_cast(node_); + } + + std::unique_ptr MakeWriter(std::shared_ptr& schema) { + sink_ = std::make_shared(); + return ParquetFileWriter::Open(sink_, schema); + } + + std::unique_ptr ReaderFromSink() { + std::shared_ptr buffer = sink_->GetBuffer(); + std::unique_ptr source(new BufferReader(buffer)); + return ParquetFileReader::Open(std::move(source)); + } + + void ReadSingleColumnFile( + std::unique_ptr file_reader, std::shared_ptr* out) { + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + std::unique_ptr column_reader; + ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); + ASSERT_NE(nullptr, column_reader.get()); + ASSERT_OK(column_reader->NextBatch(100, out)); + ASSERT_NE(nullptr, out->get()); + } + + std::unique_ptr Int64File( + std::vector& values, int num_chunks) { + std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); + std::unique_ptr file_writer = MakeWriter(schema); + size_t chunk_size = values.size() / num_chunks; + for (int i = 0; i < num_chunks; i++) { + auto row_group_writer = file_writer->AppendRowGroup(chunk_size); + auto column_writer = + static_cast<::parquet::Int64Writer*>(row_group_writer->NextColumn()); + int64_t* data = values.data() + i * chunk_size; + column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); + column_writer->Close(); + row_group_writer->Close(); + } + file_writer->Close(); + return ReaderFromSink(); + } + + private: + std::shared_ptr sink_; +}; + +TEST_F(TestParquetIO, SingleColumnInt64Read) { + std::vector values(100, 128); + std::unique_ptr file_reader = Int64File(values, 1); + + std::shared_ptr out; + ReadSingleColumnFile(std::move(file_reader), &out); + + Int64Array* out_array = static_cast(out.get()); + for (size_t i = 0; i < values.size(); i++) { + EXPECT_EQ(values[i], out_array->raw_data()[i]); + } +} + +TEST_F(TestParquetIO, SingleColumnInt64ChunkedRead) { + std::vector values(100, 128); + std::unique_ptr file_reader = Int64File(values, 4); + + std::shared_ptr out; + ReadSingleColumnFile(std::move(file_reader), &out); + + Int64Array* out_array = static_cast(out.get()); + for (size_t i = 0; i < values.size(); i++) { + EXPECT_EQ(values[i], out_array->raw_data()[i]); + } +} + +TEST_F(TestParquetIO, SingleColumnInt64Write) { + std::shared_ptr values = NonNullArray(100, 128); + + std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); + ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); + ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); +} + +TEST_F(TestParquetIO, SingleColumnDoubleReadWrite) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(100, 128, 10); + + std::shared_ptr schema = Schema(ParquetType::DOUBLE, Repetition::OPTIONAL); + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); + ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); + ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); +} + +TEST_F(TestParquetIO, SingleColumnInt64ChunkedWrite) { + std::shared_ptr values = NonNullArray(100, 128); + std::shared_ptr values_chunk = NonNullArray(25, 128); + + std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + for (int i = 0; i < 4; i++) { + ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk->length()))); + ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk.get()))); + } + ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); +} + +TEST_F(TestParquetIO, SingleColumnDoubleChunkedWrite) { + std::shared_ptr values = NullableArray(100, 128, 10); + std::shared_ptr values_chunk_nulls = + NullableArray(25, 128, 10); + std::shared_ptr values_chunk = NullableArray(25, 128, 0); + + std::shared_ptr schema = Schema(ParquetType::DOUBLE, Repetition::OPTIONAL); + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk_nulls->length()))); + ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk_nulls.get()))); + for (int i = 0; i < 3; i++) { + ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk->length()))); + ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk.get()))); + } + ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/parquet-reader-test.cc b/cpp/src/arrow/parquet/parquet-reader-test.cc deleted file mode 100644 index a7fc2a89f5f45..0000000000000 --- a/cpp/src/arrow/parquet/parquet-reader-test.cc +++ /dev/null @@ -1,116 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "gtest/gtest.h" - -#include "arrow/test-util.h" -#include "arrow/parquet/reader.h" -#include "arrow/types/primitive.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" - -#include "parquet/api/reader.h" -#include "parquet/api/writer.h" - -using ParquetBuffer = parquet::Buffer; -using parquet::BufferReader; -using parquet::InMemoryOutputStream; -using parquet::Int64Writer; -using parquet::ParquetFileReader; -using parquet::ParquetFileWriter; -using parquet::RandomAccessSource; -using parquet::Repetition; -using parquet::SchemaDescriptor; -using ParquetType = parquet::Type; -using parquet::schema::GroupNode; -using parquet::schema::NodePtr; -using parquet::schema::PrimitiveNode; - -namespace arrow { - -namespace parquet { - -class TestReadParquet : public ::testing::Test { - public: - virtual void SetUp() {} - - std::shared_ptr Int64Schema() { - auto pnode = PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64); - NodePtr node_ = - GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); - return std::static_pointer_cast(node_); - } - - std::unique_ptr Int64File( - std::vector& values, int num_chunks) { - std::shared_ptr schema = Int64Schema(); - std::shared_ptr sink(new InMemoryOutputStream()); - auto file_writer = ParquetFileWriter::Open(sink, schema); - size_t chunk_size = values.size() / num_chunks; - for (int i = 0; i < num_chunks; i++) { - auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = static_cast(row_group_writer->NextColumn()); - int64_t* data = values.data() + i * chunk_size; - column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); - column_writer->Close(); - row_group_writer->Close(); - } - file_writer->Close(); - - std::shared_ptr buffer = sink->GetBuffer(); - std::unique_ptr source(new BufferReader(buffer)); - return ParquetFileReader::Open(std::move(source)); - } - - private: -}; - -TEST_F(TestReadParquet, SingleColumnInt64) { - std::vector values(100, 128); - std::unique_ptr file_reader = Int64File(values, 1); - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - std::unique_ptr column_reader; - ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); - ASSERT_NE(nullptr, column_reader.get()); - std::shared_ptr out; - ASSERT_OK(column_reader->NextBatch(100, &out)); - ASSERT_NE(nullptr, out.get()); - Int64Array* out_array = static_cast(out.get()); - for (size_t i = 0; i < values.size(); i++) { - EXPECT_EQ(values[i], out_array->raw_data()[i]); - } -} - -TEST_F(TestReadParquet, SingleColumnInt64Chunked) { - std::vector values(100, 128); - std::unique_ptr file_reader = Int64File(values, 4); - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - std::unique_ptr column_reader; - ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); - ASSERT_NE(nullptr, column_reader.get()); - std::shared_ptr out; - ASSERT_OK(column_reader->NextBatch(100, &out)); - ASSERT_NE(nullptr, out.get()); - Int64Array* out_array = static_cast(out.get()); - for (size_t i = 0; i < values.size(); i++) { - EXPECT_EQ(values[i], out_array->raw_data()[i]); - } -} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 481ded5789a71..346de25360649 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -26,6 +26,7 @@ #include "arrow/util/status.h" using parquet::ColumnReader; +using parquet::Repetition; using parquet::TypedColumnReader; namespace arrow { @@ -36,6 +37,7 @@ class FileReader::Impl { Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); virtual ~Impl() {} + bool CheckForFlatColumn(const ::parquet::ColumnDescriptor* descr); Status GetFlatColumn(int i, std::unique_ptr* out); Status ReadFlatColumn(int i, std::shared_ptr* out); @@ -51,7 +53,7 @@ class FlatColumnReader::Impl { virtual ~Impl() {} Status NextBatch(int batch_size, std::shared_ptr* out); - template + template Status TypedReadBatch(int batch_size, std::shared_ptr* out); private: @@ -67,14 +69,28 @@ class FlatColumnReader::Impl { PoolBuffer values_buffer_; PoolBuffer def_levels_buffer_; - PoolBuffer rep_levels_buffer_; + PoolBuffer values_builder_buffer_; + PoolBuffer valid_bytes_buffer_; }; FileReader::Impl::Impl( MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) : pool_(pool), reader_(std::move(reader)) {} +bool FileReader::Impl::CheckForFlatColumn(const ::parquet::ColumnDescriptor* descr) { + if ((descr->max_repetition_level() > 0) || (descr->max_definition_level() > 1)) { + return false; + } else if ((descr->max_definition_level() == 1) && + (descr->schema_node()->repetition() != Repetition::OPTIONAL)) { + return false; + } + return true; +} + Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* out) { + if (!CheckForFlatColumn(reader_->descr()->Column(i))) { + return Status::Invalid("The requested column is not flat"); + } std::unique_ptr impl( new FlatColumnReader::Impl(pool_, reader_->descr()->Column(i), reader_.get(), i)); *out = std::unique_ptr(new FlatColumnReader(std::move(impl))); @@ -109,37 +125,50 @@ FlatColumnReader::Impl::Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor column_index_(column_index), next_row_group_(0), values_buffer_(pool), - def_levels_buffer_(pool), - rep_levels_buffer_(pool) { + def_levels_buffer_(pool) { NodeToField(descr_->schema_node(), &field_); NextRowGroup(); } -template +template Status FlatColumnReader::Impl::TypedReadBatch( int batch_size, std::shared_ptr* out) { int values_to_read = batch_size; NumericBuilder builder(pool_, field_->type); while ((values_to_read > 0) && column_reader_) { - values_buffer_.Resize(values_to_read * sizeof(CType)); + values_buffer_.Resize(values_to_read * sizeof(typename ParquetType::c_type)); if (descr_->max_definition_level() > 0) { def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); } - if (descr_->max_repetition_level() > 0) { - rep_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); - } auto reader = dynamic_cast*>(column_reader_.get()); int64_t values_read; - CType* values = reinterpret_cast(values_buffer_.mutable_data()); - PARQUET_CATCH_NOT_OK( - values_to_read -= reader->ReadBatch(values_to_read, - reinterpret_cast(def_levels_buffer_.mutable_data()), - reinterpret_cast(rep_levels_buffer_.mutable_data()), values, - &values_read)); + int64_t levels_read; + int16_t* def_levels = reinterpret_cast(def_levels_buffer_.mutable_data()); + auto values = + reinterpret_cast(values_buffer_.mutable_data()); + PARQUET_CATCH_NOT_OK(levels_read = reader->ReadBatch( + values_to_read, def_levels, nullptr, values, &values_read)); + values_to_read -= levels_read; if (descr_->max_definition_level() == 0) { RETURN_NOT_OK(builder.Append(values, values_read)); } else { - return Status::NotImplemented("no support for definition levels yet"); + // descr_->max_definition_level() == 1 + RETURN_NOT_OK(values_builder_buffer_.Resize( + levels_read * sizeof(typename ParquetType::c_type))); + RETURN_NOT_OK(valid_bytes_buffer_.Resize(levels_read * sizeof(uint8_t))); + auto values_ptr = reinterpret_cast( + values_builder_buffer_.mutable_data()); + uint8_t* valid_bytes = valid_bytes_buffer_.mutable_data(); + int values_idx = 0; + for (int64_t i = 0; i < levels_read; i++) { + if (def_levels[i] < descr_->max_definition_level()) { + valid_bytes[i] = 0; + } else { + valid_bytes[i] = 1; + values_ptr[i] = values[values_idx++]; + } + } + builder.Append(values_ptr, levels_read, valid_bytes); } if (!column_reader_->HasNext()) { NextRowGroup(); } } @@ -147,9 +176,9 @@ Status FlatColumnReader::Impl::TypedReadBatch( return Status::OK(); } -#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType, CType) \ - case Type::ENUM: \ - return TypedReadBatch(batch_size, out); \ +#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ + case Type::ENUM: \ + return TypedReadBatch(batch_size, out); \ break; Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* out) { @@ -159,15 +188,11 @@ Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* return Status::OK(); } - if (descr_->max_repetition_level() > 0) { - return Status::NotImplemented("no support for repetition yet"); - } - switch (field_->type->type) { - TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type, int32_t) - TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type, int64_t) - TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType, float) - TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType, double) + TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) + TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) + TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) default: return Status::NotImplemented(field_->type->ToString()); } diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc new file mode 100644 index 0000000000000..3ad2c5b073501 --- /dev/null +++ b/cpp/src/arrow/parquet/writer.cc @@ -0,0 +1,148 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/parquet/writer.h" + +#include "arrow/array.h" +#include "arrow/types/primitive.h" +#include "arrow/parquet/utils.h" +#include "arrow/util/status.h" + +namespace arrow { + +namespace parquet { + +class FileWriter::Impl { + public: + Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); + + Status NewRowGroup(int64_t chunk_size); + template + Status TypedWriteBatch(::parquet::ColumnWriter* writer, const PrimitiveArray* data); + Status WriteFlatColumnChunk(const PrimitiveArray* data); + Status Close(); + + virtual ~Impl() {} + + private: + MemoryPool* pool_; + PoolBuffer data_buffer_; + PoolBuffer def_levels_buffer_; + std::unique_ptr<::parquet::ParquetFileWriter> writer_; + ::parquet::RowGroupWriter* row_group_writer_; +}; + +FileWriter::Impl::Impl( + MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer) + : pool_(pool), + data_buffer_(pool), + writer_(std::move(writer)), + row_group_writer_(nullptr) {} + +Status FileWriter::Impl::NewRowGroup(int64_t chunk_size) { + if (row_group_writer_ != nullptr) { PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); } + PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup(chunk_size)); + return Status::OK(); +} + +template +Status FileWriter::Impl::TypedWriteBatch( + ::parquet::ColumnWriter* column_writer, const PrimitiveArray* data) { + auto data_ptr = + reinterpret_cast(data->data()->data()); + auto writer = + reinterpret_cast<::parquet::TypedColumnWriter*>(column_writer); + if (writer->descr()->max_definition_level() == 0) { + // no nulls, just dump the data + PARQUET_CATCH_NOT_OK(writer->WriteBatch(data->length(), nullptr, nullptr, data_ptr)); + } else if (writer->descr()->max_definition_level() == 1) { + RETURN_NOT_OK(def_levels_buffer_.Resize(data->length() * sizeof(int16_t))); + int16_t* def_levels_ptr = + reinterpret_cast(def_levels_buffer_.mutable_data()); + if (data->null_count() == 0) { + std::fill(def_levels_ptr, def_levels_ptr + data->length(), 1); + PARQUET_CATCH_NOT_OK( + writer->WriteBatch(data->length(), def_levels_ptr, nullptr, data_ptr)); + } else { + RETURN_NOT_OK(data_buffer_.Resize( + (data->length() - data->null_count()) * sizeof(typename ParquetType::c_type))); + auto buffer_ptr = + reinterpret_cast(data_buffer_.mutable_data()); + int buffer_idx = 0; + for (size_t i = 0; i < data->length(); i++) { + if (data->IsNull(i)) { + def_levels_ptr[i] = 0; + } else { + def_levels_ptr[i] = 1; + buffer_ptr[buffer_idx++] = data_ptr[i]; + } + } + PARQUET_CATCH_NOT_OK( + writer->WriteBatch(data->length(), def_levels_ptr, nullptr, buffer_ptr)); + } + } else { + return Status::NotImplemented("no support for max definition level > 1 yet"); + } + PARQUET_CATCH_NOT_OK(writer->Close()); + return Status::OK(); +} + +Status FileWriter::Impl::Close() { + if (row_group_writer_ != nullptr) { PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); } + PARQUET_CATCH_NOT_OK(writer_->Close()); + return Status::OK(); +} + +#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ + case Type::ENUM: \ + return TypedWriteBatch(writer, data); \ + break; + +Status FileWriter::Impl::WriteFlatColumnChunk(const PrimitiveArray* data) { + ::parquet::ColumnWriter* writer; + PARQUET_CATCH_NOT_OK(writer = row_group_writer_->NextColumn()); + switch (data->type_enum()) { + TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) + TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) + TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) + default: + return Status::NotImplemented(data->type()->ToString()); + } +} + +FileWriter::FileWriter( + MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer) + : impl_(new FileWriter::Impl(pool, std::move(writer))) {} + +Status FileWriter::NewRowGroup(int64_t chunk_size) { + return impl_->NewRowGroup(chunk_size); +} + +Status FileWriter::WriteFlatColumnChunk(const PrimitiveArray* data) { + return impl_->WriteFlatColumnChunk(data); +} + +Status FileWriter::Close() { + return impl_->Close(); +} + +FileWriter::~FileWriter() {} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h new file mode 100644 index 0000000000000..38f7d0b3a89d5 --- /dev/null +++ b/cpp/src/arrow/parquet/writer.h @@ -0,0 +1,59 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PARQUET_WRITER_H +#define ARROW_PARQUET_WRITER_H + +#include + +#include "parquet/api/schema.h" +#include "parquet/api/writer.h" + +namespace arrow { + +class MemoryPool; +class PrimitiveArray; +class RowBatch; +class Status; + +namespace parquet { + +/** + * Iterative API: + * Start a new RowGroup/Chunk with NewRowGroup + * Write column-by-column the whole column chunk + */ +class FileWriter { + public: + FileWriter(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); + + Status NewRowGroup(int64_t chunk_size); + Status WriteFlatColumnChunk(const PrimitiveArray* data); + Status Close(); + + virtual ~FileWriter(); + + private: + class Impl; + std::unique_ptr impl_; +}; + +} // namespace parquet + +} // namespace arrow + +#endif // ARROW_PARQUET_WRITER_H From c0985a47665f8ce8847a6a0215e6e3c0f1db28f4 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Mon, 18 Apr 2016 11:07:22 -0700 Subject: [PATCH 0076/1644] Make BaseValueVector#MAX_ALLOCATION_SIZE configurable This closes #65 Some of the tests are based on the assumption that the JVM can allocate at least 2GB of memory, which is not a common occurence (JVM usually defaults at 512MB). Current Travis CI VM only have 3GB of memory total, which would have make challenging to run some of the tests on them Add a system property to change BaseValueVector.MAX_ALLOCATION_SIZE to allow to use a much smaller value during tests. --- .../apache/arrow/vector/BaseValueVector.java | 14 ++++---- .../apache/arrow/vector/TestValueVector.java | 36 +++++++++++++++---- 2 files changed, 38 insertions(+), 12 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java index 8bca3c005370e..932e6f13caf2b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java @@ -17,23 +17,24 @@ */ package org.apache.arrow.vector; -import io.netty.buffer.ArrowBuf; - import java.util.Iterator; -import com.google.common.base.Preconditions; -import com.google.common.collect.Iterators; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.types.MaterializedField; import org.apache.arrow.vector.util.TransferPair; import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterators; + +import io.netty.buffer.ArrowBuf; + public abstract class BaseValueVector implements ValueVector { private static final Logger logger = LoggerFactory.getLogger(BaseValueVector.class); - public static final int MAX_ALLOCATION_SIZE = Integer.MAX_VALUE; + public static final String MAX_ALLOCATION_SIZE_PROPERTY = "arrow.vector.max_allocation_bytes"; + public static final int MAX_ALLOCATION_SIZE = Integer.getInteger(MAX_ALLOCATION_SIZE_PROPERTY, Integer.MAX_VALUE); public static final int INITIAL_VALUE_ALLOCATION = 4096; protected final BufferAllocator allocator; @@ -99,6 +100,7 @@ protected BaseMutator() { } public void generateTestData(int values) {} //TODO: consider making mutator stateless(if possible) on another issue. + @Override public void reset() {} } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index ac3eebe98eab7..b5c4509c8b540 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -23,16 +23,12 @@ import java.nio.charset.Charset; +import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.RepeatedListVector; import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.BasicTypeHelper; -import org.apache.arrow.vector.util.OversizedAllocationException; import org.apache.arrow.vector.holders.BitHolder; import org.apache.arrow.vector.holders.IntHolder; import org.apache.arrow.vector.holders.NullableFloat4Holder; @@ -44,10 +40,16 @@ import org.apache.arrow.vector.holders.RepeatedVarBinaryHolder; import org.apache.arrow.vector.holders.UInt4Holder; import org.apache.arrow.vector.holders.VarCharHolder; -import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.util.OversizedAllocationException; import org.junit.After; import org.junit.Before; +import org.junit.Rule; import org.junit.Test; +import org.junit.rules.ExternalResource; public class TestValueVector { @@ -57,6 +59,28 @@ public class TestValueVector { private BufferAllocator allocator; + // Rule to adjust MAX_ALLOCATION_SIZE and restore it back after the tests + @Rule + public final ExternalResource rule = new ExternalResource() { + private final String systemValue = System.getProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY); + private final String testValue = Long.toString(32*1024*1024); + + @Override + protected void before() throws Throwable { + System.setProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY, testValue); + } + + @Override + protected void after() { + if (systemValue != null) { + System.setProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY, systemValue); + } + else { + System.clearProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY); + } + } + }; + @Before public void init() { allocator = new RootAllocator(Long.MAX_VALUE); From e316b3f765167fa1f45197061624e73332b095f4 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Fri, 15 Apr 2016 14:00:19 -0700 Subject: [PATCH 0077/1644] Fix BaseAllocator.java NPE when assertions are disabled This closes #64 When verifying memory using verifyAllocator() method, BaseAllocator throws NPE if assertions are disabled. Fixing this issue by checking first if assertion are disabled --- .../apache/arrow/memory/BaseAllocator.java | 20 +++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index 90257bb9ffbf7..f1503c902d0be 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -99,6 +99,7 @@ protected BaseAllocator( } + @Override public void assertOpen() { if (AssertionUtil.ASSERT_ENABLED) { if (isClosed) { @@ -287,6 +288,7 @@ public Reservation() { } } + @Override public boolean add(final int nBytes) { assertOpen(); @@ -308,6 +310,7 @@ public boolean add(final int nBytes) { return true; } + @Override public ArrowBuf allocateBuffer() { assertOpen(); @@ -319,14 +322,17 @@ public ArrowBuf allocateBuffer() { return arrowBuf; } + @Override public int getSize() { return nBytes; } + @Override public boolean isUsed() { return used; } + @Override public boolean isClosed() { return closed; } @@ -364,6 +370,7 @@ public void close() { closed = true; } + @Override public boolean reserve(int nBytes) { assertOpen(); @@ -509,6 +516,7 @@ public synchronized void close() { } + @Override public String toString() { final Verbosity verbosity = logger.isTraceEnabled() ? Verbosity.LOG_WITH_STACKTRACE : Verbosity.BASIC; @@ -523,6 +531,7 @@ public String toString() { * * @return A Verbose string of current allocator state. */ + @Override public String toVerboseString() { final StringBuilder sb = new StringBuilder(); print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); @@ -575,13 +584,12 @@ void verifyAllocator() { * when any problems are found */ private void verifyAllocator(final IdentityHashMap buffersSeen) { - synchronized (DEBUG_LOCK) { - - // The remaining tests can only be performed if we're in debug mode. - if (!DEBUG) { - return; - } + // The remaining tests can only be performed if we're in debug mode. + if (!DEBUG) { + return; + } + synchronized (DEBUG_LOCK) { final long allocated = getAllocatedMemory(); // verify my direct descendants From 703546787e049f1abbc96082f60fe4d08731a5ce Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Wed, 13 Apr 2016 22:36:38 -0700 Subject: [PATCH 0078/1644] Add java support to Travis CI Add java support to Travis CI using oracle JDK7 on a Linux host. --- .travis.yml | 6 +++++- ci/travis_script_java.sh | 11 +++++++++++ java/pom.xml | 2 +- 3 files changed, 17 insertions(+), 2 deletions(-) create mode 100755 ci/travis_script_java.sh diff --git a/.travis.yml b/.travis.yml index 646f80fee7b3b..7c4183700ca10 100644 --- a/.travis.yml +++ b/.travis.yml @@ -12,7 +12,6 @@ addons: - gcc-4.9 # Needed for C++11 - g++-4.9 # Needed for C++11 - gdb - - gcov - ccache - cmake - valgrind @@ -60,6 +59,11 @@ matrix: before_install: script: - $TRAVIS_BUILD_DIR/ci/travis_conda_build.sh + - language: java + os: linux + jdk: oraclejdk7 + script: + - $TRAVIS_BUILD_DIR/ci/travis_script_java.sh before_install: - ulimit -c unlimited -S diff --git a/ci/travis_script_java.sh b/ci/travis_script_java.sh new file mode 100755 index 0000000000000..2d11eaeb4c57d --- /dev/null +++ b/ci/travis_script_java.sh @@ -0,0 +1,11 @@ +#!/usr/bin/env bash + +set -e + +JAVA_DIR=${TRAVIS_BUILD_DIR}/java + +pushd $JAVA_DIR + +mvn -B test + +popd diff --git a/java/pom.xml b/java/pom.xml index 4ee4ff4f7604e..ea42894fda22e 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -297,7 +297,7 @@ maven-surefire-plugin 2.17 - -ea + true ${forkCount} true From cd1d770ede57f08b8be2f2b42f2f629eb5106098 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Mon, 23 May 2016 13:55:51 -0700 Subject: [PATCH 0079/1644] ARROW-206: Expose a C++ api to compare ranges of slots between two arrays @wesm the need for this grew out of @fengguangyuan PR to add struct type (#66) and struct builder. I considered a different APIs before settling on this: 1. Add an API that took the parent bitmask (this potentially has possibility of being the most performant, but would have a more awkward contract then provided here) 2. Add an equality comparison for a single slot (leaves the least amount of room for optimization but it would be the simplest to implement). 3. This API which potentially leaves some room for optimization but I think places the least requirements on the caller. Let me know if you would prefer a different API. WIP because I need to add more unit tests (I also need to think about if it is worth mirroring the EqualsExact in addition to the Equals method). Which I should get to by the end of the weekend. @fengguangyuan let me know if this makes sense to you as a way forward on your PR Author: Micah Kornfield Closes #80 from emkornfield/emk_add_equality and squashes the following commits: d5ae777 [Micah Kornfield] remove todo, its handled by type_traits f963639 [Micah Kornfield] add in check for null arrays f5c6bd5 [Micah Kornfield] make format/lint check dcbaad4 [Micah Kornfield] unittests passing 318855d [Micah Kornfield] working primitive tests dadb244 [Micah Kornfield] wip expose range equality to to allow for nested comparisons --- cpp/src/arrow/array-test.cc | 29 +++++++++++++++ cpp/src/arrow/array.cc | 7 ++++ cpp/src/arrow/array.h | 9 ++++- cpp/src/arrow/types/list-test.cc | 36 +++++++++++++++++++ cpp/src/arrow/types/list.cc | 26 ++++++++++++++ cpp/src/arrow/types/list.h | 3 ++ cpp/src/arrow/types/primitive-test.cc | 51 +++++++++++++++++++++++++++ cpp/src/arrow/types/primitive.cc | 17 ++++++++- cpp/src/arrow/types/primitive.h | 35 ++++++++++++++---- 9 files changed, 205 insertions(+), 8 deletions(-) diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index b4c727997ee7e..3b4736327b47c 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -56,6 +56,35 @@ TEST_F(TestArray, TestLength) { ASSERT_EQ(arr->length(), 100); } +ArrayPtr MakeArrayFromValidBytes(const std::vector& v, MemoryPool* pool) { + int32_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); + std::shared_ptr null_buf = test::bytes_to_null_buffer(v); + + BufferBuilder value_builder(pool); + for (size_t i = 0; i < v.size(); ++i) { + value_builder.Append(0); + } + + ArrayPtr arr(new Int32Array(v.size(), value_builder.Finish(), null_count, null_buf)); + return arr; +} + +TEST_F(TestArray, TestEquality) { + auto array = MakeArrayFromValidBytes({1, 0, 1, 1, 0, 1, 0, 0}, pool_); + auto equal_array = MakeArrayFromValidBytes({1, 0, 1, 1, 0, 1, 0, 0}, pool_); + auto unequal_array = MakeArrayFromValidBytes({1, 1, 1, 1, 0, 1, 0, 0}, pool_); + + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_array)); + EXPECT_FALSE(unequal_array->Equals(equal_array)); + EXPECT_TRUE(array->RangeEquals(4, 8, 4, unequal_array)); + EXPECT_FALSE(array->RangeEquals(0, 4, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(0, 8, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); +} + TEST_F(TestArray, TestIsNull) { // clang-format off std::vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index c6b9b1599cdd2..d6b081f315532 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -58,4 +58,11 @@ bool NullArray::Equals(const std::shared_ptr& arr) const { return arr->length() == length_; } +bool NullArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_index, + const std::shared_ptr& arr) const { + if (!arr) { return false; } + if (Type::NA != arr->type_enum()) { return false; } + return true; +} + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index f98c4c28310f8..76dc0f598141f 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -59,6 +59,12 @@ class Array { bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; + + // Compare if the range of slots specified are equal for the given array and + // this array. end_idx exclusive. This methods does not bounds check. + virtual bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const = 0; + // Determines if the array is internally consistent. Defaults to always // returning Status::OK. This can be an expensive check. virtual Status Validate() const; @@ -85,10 +91,11 @@ class NullArray : public Array { explicit NullArray(int32_t length) : NullArray(std::make_shared(), length) {} bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_index, + const std::shared_ptr& arr) const override; }; typedef std::shared_ptr ArrayPtr; - } // namespace arrow #endif diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 6a8ad9aa59ead..2e41b4a61caf2 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -86,6 +86,42 @@ class TestListBuilder : public TestBuilder { shared_ptr result_; }; +TEST_F(TestListBuilder, Equality) { + Int32Builder* vb = static_cast(builder_->value_builder().get()); + + ArrayPtr array, equal_array, unequal_array; + vector equal_offsets = {0, 1, 2, 5}; + vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2}; + vector unequal_offsets = {0, 1, 4}; + vector unequal_values = {1, 2, 2, 2, 3, 4, 5}; + + // setup two equal arrays + ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); + ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); + array = builder_->Finish(); + ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); + ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); + equal_array = builder_->Finish(); + // now an unequal one + ASSERT_OK(builder_->Append(unequal_offsets.data(), unequal_offsets.size())); + ASSERT_OK(vb->Append(unequal_values.data(), unequal_values.size())); + unequal_array = builder_->Finish(); + + // Test array equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_array)); + EXPECT_FALSE(unequal_array->Equals(equal_array)); + + // Test range equality + EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); + EXPECT_TRUE(array->RangeEquals(2, 3, 2, unequal_array)); + EXPECT_TRUE(array->RangeEquals(3, 4, 1, unequal_array)); +} + TEST_F(TestListBuilder, TestResize) {} TEST_F(TestListBuilder, TestAppendNull) { diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 76e7fe5f4d429..6334054caf84a 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -44,6 +44,32 @@ bool ListArray::Equals(const std::shared_ptr& arr) const { return EqualsExact(*static_cast(arr.get())); } +bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = offset(i); + const int32_t end_offset = offset(i + 1); + const int32_t other_begin_offset = other->offset(o_i); + const int32_t other_end_offset = other->offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != other_end_offset - other_begin_offset) { + return false; + } + if (!values_->RangeEquals( + begin_offset, end_offset, other_begin_offset, other->values())) { + return false; + } + } + return true; +} + Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } if (!offset_buf_) { return Status::Invalid("offset_buf_ was null"); } diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index a020b8ad22668..0a3941633eb83 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -72,6 +72,9 @@ class ListArray : public Array { bool EqualsExact(const ListArray& other) const; bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; + protected: std::shared_ptr offset_buf_; const int32_t* offsets_; diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 2b4c0879a28f4..87eb0fe3a8bf7 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -304,6 +304,57 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); } +template +Status MakeArray(const vector& valid_bytes, const vector& draws, int size, + Builder* builder, ArrayPtr* out) { + // Append the first 1000 + for (int i = 0; i < size; ++i) { + if (valid_bytes[i] > 0) { + RETURN_NOT_OK(builder->Append(draws[i])); + } else { + RETURN_NOT_OK(builder->AppendNull()); + } + } + *out = builder->Finish(); + return Status::OK(); +} + +TYPED_TEST(TestPrimitiveBuilder, Equality) { + DECL_T(); + + const int size = 1000; + this->RandomData(size); + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + ArrayPtr array, equal_array, unequal_array; + auto builder = this->builder_.get(); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &equal_array)); + + // Make the not equal array by negating the first valid element with itself. + const auto first_valid = std::find_if( + valid_bytes.begin(), valid_bytes.end(), [](uint8_t valid) { return valid > 0; }); + const int first_valid_idx = std::distance(valid_bytes.begin(), first_valid); + // This should be true with a very high probability, but might introduce flakiness + ASSERT_LT(first_valid_idx, size - 1); + draws[first_valid_idx] = ~*reinterpret_cast(&draws[first_valid_idx]); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &unequal_array)); + + // test normal equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_array)); + EXPECT_FALSE(unequal_array->Equals(equal_array)); + + // Test range equality + EXPECT_FALSE(array->RangeEquals(0, first_valid_idx + 1, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(first_valid_idx, size, first_valid_idx, unequal_array)); + EXPECT_TRUE(array->RangeEquals(0, first_valid_idx, 0, unequal_array)); + EXPECT_TRUE( + array->RangeEquals(first_valid_idx + 1, size, first_valid_idx + 1, unequal_array)); +} + TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { DECL_T(); diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 57a3f1e4e150b..8e6c0f809ca44 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -185,10 +185,25 @@ bool BooleanArray::EqualsExact(const BooleanArray& other) const { } } -bool BooleanArray::Equals(const std::shared_ptr& arr) const { +bool BooleanArray::Equals(const ArrayPtr& arr) const { if (this == arr.get()) return true; if (Type::BOOL != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } +bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, + int32_t other_start_idx, const ArrayPtr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { + return false; + } + } + return true; +} + } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index fc45f6c5b0568..9597fc8363138 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -66,6 +66,22 @@ class PrimitiveArray : public Array { return PrimitiveArray::EqualsExact(*static_cast(&other)); \ } \ \ + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, \ + const ArrayPtr& arr) const override { \ + if (this == arr.get()) { return true; } \ + if (!arr) { return false; } \ + if (this->type_enum() != arr->type_enum()) { return false; } \ + const auto other = static_cast(arr.get()); \ + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { \ + const bool is_null = IsNull(i); \ + if (is_null != arr->IsNull(o_i) || \ + (!is_null && Value(i) != other->Value(o_i))) { \ + return false; \ + } \ + } \ + return true; \ + } \ + \ const T* raw_data() const { return reinterpret_cast(raw_data_); } \ \ T Value(int i) const { return raw_data()[i]; } \ @@ -95,8 +111,10 @@ class PrimitiveBuilder : public ArrayBuilder { using ArrayBuilder::Advance; // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - void AppendNulls(const uint8_t* valid_bytes, int32_t length) { + Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); } Status AppendNull() { @@ -139,9 +157,10 @@ class NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::Reserve; // Scalar append. - void Append(value_type val) { - ArrayBuilder::Reserve(1); + Status Append(value_type val) { + RETURN_NOT_OK(ArrayBuilder::Reserve(1)); UnsafeAppend(val); + return Status::OK(); } // Does not capacity-check; make sure to call Reserve beforehand @@ -248,7 +267,9 @@ class BooleanArray : public PrimitiveArray { int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); bool EqualsExact(const BooleanArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; + bool Equals(const ArrayPtr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } @@ -274,7 +295,8 @@ class BooleanBuilder : public PrimitiveBuilder { using PrimitiveBuilder::Append; // Scalar append - void Append(bool val) { + Status Append(bool val) { + Reserve(1); util::set_bit(null_bitmap_data_, length_); if (val) { util::set_bit(raw_data_, length_); @@ -282,9 +304,10 @@ class BooleanBuilder : public PrimitiveBuilder { util::clear_bit(raw_data_, length_); } ++length_; + return Status::OK(); } - void Append(uint8_t val) { Append(static_cast(val)); } + Status Append(uint8_t val) { return Append(static_cast(val)); } }; } // namespace arrow From c8b8078810be1d703c0261859b0862d574384600 Mon Sep 17 00:00:00 2001 From: Edmon Begoli Date: Sat, 28 May 2016 19:11:47 -0400 Subject: [PATCH 0080/1644] [Doc] Update Layout.md For clarity, added references to official SIMD documentation, the description of Endiandness, Parquet. Used Markdown syntax for the exponent to document the size of the arrays. Closes PR #82. --- format/Layout.md | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/format/Layout.md b/format/Layout.md index 34eade313415a..9de0479738ac5 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -41,7 +41,7 @@ Base requirements proprietary systems that utilize the open source components. * All array slots are accessible in constant time, with complexity growing linearly in the nesting level -* Capable of representing fully-materialized and decoded / decompressed Parquet +* Capable of representing fully-materialized and decoded / decompressed [Parquet][5] data * All contiguous memory buffers are aligned at 64-byte boundaries and padded to a multiple of 64 bytes. * Any relative type can have null slots @@ -76,7 +76,7 @@ Base requirements * Any memory management or reference counting subsystem * To enumerate or specify types of encodings or compression support -## Byte Order (Endianness) +## Byte Order ([Endianness][3]) The Arrow format is little endian. @@ -91,7 +91,7 @@ requirement follows best practices for optimized memory access: * 64 byte alignment is recommended by the [Intel performance guide][2] for data-structures over 64 bytes (which will be a common case for Arrow Arrays). -Requiring padding to a multiple of 64 bytes allows for using SIMD instructions +Requiring padding to a multiple of 64 bytes allows for using [SIMD][4] instructions consistently in loops without additional conditional checks. This should allow for simpler and more efficient code. The specific padding length was chosen because it matches the largest known @@ -105,13 +105,13 @@ Unless otherwise noted, padded bytes do not need to have a specific value. ## Array lengths Any array has a known and fixed length, stored as a 32-bit signed integer, so a -maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons: +maximum of 231 - 1 elements. We choose a signed int32 for a couple reasons: * Enhance compatibility with Java and client languages which may have varying quality of support for unsigned integers. * To encourage developers to compose smaller arrays (each of which contains contiguous memory in its leaf nodes) to create larger array structures - possibly exceeding 2^31 - 1 elements, as opposed to allocating very large + possibly exceeding 231 - 1 elements, as opposed to allocating very large contiguous memory blocks. ## Null count @@ -238,7 +238,7 @@ A list-array is represented by the combination of the following: * A values array, a child array of type T. T may also be a nested type. * An offsets buffer containing 32-bit signed integers with length equal to the length of the top-level array plus one. Note that this limits the size of the - values array to 2^31 -1. + values array to 231-1. The offsets array encodes a start position in the values array, and the length of the value in each slot is computed using the first difference with the next @@ -578,7 +578,11 @@ the the types array indicates that a slot contains a different type at the index ## References -Drill docs https://drill.apache.org/docs/value-vectors/ +Apache Drill Documentation - [Value Vectors][6] [1]: https://en.wikipedia.org/wiki/Bit_numbering [2]: https://software.intel.com/en-us/articles/practical-intel-avx-optimization-on-2nd-generation-intel-core-processors +[3]: https://en.wikipedia.org/wiki/Endianness +[4]: https://software.intel.com/en-us/node/600110 +[5]: https://parquet.apache.org/documentation/latest/ +[6]: https://drill.apache.org/docs/value-vectors/ From 65740950c852b82c475ca84e970e147d25d27398 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 2 Jun 2016 18:36:43 -0700 Subject: [PATCH 0081/1644] ARROW-209: [C++] Triage builds due to unavailable LLVM apt repo For now, this unblocks builds until we can resolve the LLVM apt issue. Author: Wes McKinney Closes #84 from wesm/ARROW-209 and squashes the following commits: c6bf166 [Wes McKinney] Remove clang-* packages from apt list 30d8c5c [Wes McKinney] Temporarily disable clang-format and clang-tidy checks in Travis CI build --- .travis.yml | 3 --- ci/travis_script_cpp.sh | 12 ++++++++---- 2 files changed, 8 insertions(+), 7 deletions(-) diff --git a/.travis.yml b/.travis.yml index 7c4183700ca10..ac2b0d457cb8e 100644 --- a/.travis.yml +++ b/.travis.yml @@ -5,10 +5,7 @@ addons: sources: - ubuntu-toolchain-r-test - kalakris-cmake - - llvm-toolchain-precise-3.7 packages: - - clang-format-3.7 - - clang-tidy-3.7 - gcc-4.9 # Needed for C++11 - g++-4.9 # Needed for C++11 - gdb diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index c9b3b5f1442a1..9cf4f8e352109 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -7,10 +7,14 @@ set -e pushd $CPP_BUILD_DIR make lint -if [ $TRAVIS_OS_NAME == "linux" ]; then - make check-format - make check-clang-tidy -fi + +# ARROW-209: checks depending on the LLVM toolchain are disabled temporarily +# until we are able to install the full LLVM toolchain in Travis CI again + +# if [ $TRAVIS_OS_NAME == "linux" ]; then +# make check-format +# make check-clang-tidy +# fi ctest -L unittest From ce2fe7a782c9c1f84a6ccdc2b7b00768d535d8fc Mon Sep 17 00:00:00 2001 From: Smyatkin Maxim Date: Mon, 6 Jun 2016 23:25:31 -0700 Subject: [PATCH 0082/1644] ARROW-211: [Format] Fixed typos in layout examples Just a few typo fixes according to the ticket. Author: Smyatkin Maxim Closes #86 from Smyatkin-Maxim/ARROW-211 and squashes the following commits: 6cefba6 [Smyatkin Maxim] ARROW-211: [Format] Fixed typos in layout examples --- format/Layout.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/format/Layout.md b/format/Layout.md index 9de0479738ac5..815c47f2c934b 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -299,7 +299,7 @@ will be be represented as follows: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | |------------|------------|------------|-------------|-------------| - | 0 | 2 | 6 | 7 | unspecified | + | 0 | 2 | 5 | 6 | unspecified | * Values array (`List`) * Length: 6, Null count: 1 @@ -368,7 +368,7 @@ The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}] would be: | Byte 0 (validity bitmap) | Bytes 1-7 | |--------------------------|-----------------------| - | 00011101 | 0 (padding) | + | 00001101 | 0 (padding) | * Offsets buffer: @@ -472,7 +472,7 @@ An example layout for logical union of: | 1.2, 3.4 | unspecified | - * Field-1 array (f: float): + * Field-1 array (i: int32): * Length: 1, nulls: 0 * Null bitmap buffer: Not required @@ -499,7 +499,7 @@ union, it has some advantages that may be desirable in certain use cases: For the union array: -[{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, 'mark'] +[{u0=5}, {u1=1.2}, {u2='joe'}, {u1=3.4}, {u0=4}, {u2='mark'}] will have the following layout: ``` From 9ce13a06726874c04433100127f74e6ea4afa855 Mon Sep 17 00:00:00 2001 From: fengguangyuan Date: Mon, 6 Jun 2016 23:32:38 -0700 Subject: [PATCH 0083/1644] ARROW-60: [C++] Struct type builder API Implement the basic classes, `StructArray` and `StructBuilder,` meanwhile, add the perspective test cases for them. Other necessary methods will be added subsequently. Author: fengguangyuan Closes #66 from fengguangyuan/ARROW-60 and squashes the following commits: 190967f [fengguangyuan] ARROW-60: [C++] Struct type builder API Add field index and TODO comment. ae74c80 [fengguangyuan] ARROW-60: Struct type builder API Add RangeEquals method to implement Equals method. fa856fd [fengguangyuan] ARROW-60:[C++] Struct typebuilder API Modify Validate() refered to the specification. bfabdc1 [fengguangyuan] ARROW-60: Struct type builder API Refine the previous committed patch. Add validate methods for testing StructArray and StructBuilder. TODO, Equals methods also need to be tested, but now it's not convient to do it. 5733de7 [fengguangyuan] ARROW-60: Struct type builder API --- cpp/src/arrow/type.h | 1 + cpp/src/arrow/types/construct.cc | 15 ++ cpp/src/arrow/types/construct.h | 3 +- cpp/src/arrow/types/struct-test.cc | 332 +++++++++++++++++++++++++++++ cpp/src/arrow/types/struct.cc | 72 ++++++- cpp/src/arrow/types/struct.h | 97 ++++++++- 6 files changed, 517 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 77404cd702524..f366645cd5cf9 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -161,6 +161,7 @@ struct Field { std::string ToString() const; }; +typedef std::shared_ptr FieldPtr; template struct PrimitiveType : public DataType { diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 78036d4bf5711..bcb0ec490901f 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -23,6 +23,7 @@ #include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/types/string.h" +#include "arrow/types/struct.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" @@ -66,6 +67,20 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, out->reset(new ListBuilder(pool, value_builder)); return Status::OK(); } + + case Type::STRUCT: { + std::vector& fields = type->children_; + std::vector> values_builder; + + for (auto it : fields) { + std::shared_ptr builder; + RETURN_NOT_OK(MakeBuilder(pool, it->type, &builder)); + values_builder.push_back(builder); + } + out->reset(new StructBuilder(pool, type, values_builder)); + return Status::OK(); + } + default: return Status::NotImplemented(type->ToString()); } diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index 43c0018c67e41..d0370840ca108 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -20,13 +20,14 @@ #include #include - +#include namespace arrow { class Array; class ArrayBuilder; class Buffer; struct DataType; +struct Field; class MemoryPool; class Status; diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index 79d560e19bcc0..d2bd2971d0438 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -21,7 +21,16 @@ #include "gtest/gtest.h" +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/types/construct.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" +#include "arrow/types/struct.h" +#include "arrow/types/test-common.h" +#include "arrow/util/status.h" using std::shared_ptr; using std::string; @@ -52,4 +61,327 @@ TEST(TestStructType, Basics) { // TODO(wesm): out of bounds for field(...) } +void ValidateBasicStructArray(const StructArray* result, + const vector& struct_is_valid, const vector& list_values, + const vector& list_is_valid, const vector& list_lengths, + const vector& list_offsets, const vector& int_values) { + ASSERT_EQ(4, result->length()); + ASSERT_OK(result->Validate()); + + auto list_char_arr = static_cast(result->field(0).get()); + auto char_arr = static_cast(list_char_arr->values().get()); + auto int32_arr = static_cast(result->field(1).get()); + + ASSERT_EQ(0, result->null_count()); + ASSERT_EQ(1, list_char_arr->null_count()); + ASSERT_EQ(0, int32_arr->null_count()); + + // List + ASSERT_EQ(4, list_char_arr->length()); + ASSERT_EQ(10, list_char_arr->values()->length()); + for (size_t i = 0; i < list_offsets.size(); ++i) { + ASSERT_EQ(list_offsets[i], list_char_arr->offsets()[i]); + } + for (size_t i = 0; i < list_values.size(); ++i) { + ASSERT_EQ(list_values[i], char_arr->Value(i)); + } + + // Int32 + ASSERT_EQ(4, int32_arr->length()); + for (size_t i = 0; i < int_values.size(); ++i) { + ASSERT_EQ(int_values[i], int32_arr->Value(i)); + } +} + +// ---------------------------------------------------------------------------------- +// Struct test +class TestStructBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + + auto int32_type = TypePtr(new Int32Type()); + auto char_type = TypePtr(new Int8Type()); + auto list_type = TypePtr(new ListType(char_type)); + + std::vector types = {list_type, int32_type}; + std::vector fields; + fields.push_back(FieldPtr(new Field("list", list_type))); + fields.push_back(FieldPtr(new Field("int", int32_type))); + + type_ = TypePtr(new StructType(fields)); + value_fields_ = fields; + + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + + builder_ = std::dynamic_pointer_cast(tmp); + ASSERT_EQ(2, builder_->field_builders().size()); + } + + void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } + + protected: + std::vector value_fields_; + TypePtr type_; + + std::shared_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestStructBuilder, TestAppendNull) { + ASSERT_OK(builder_->AppendNull()); + ASSERT_OK(builder_->AppendNull()); + ASSERT_EQ(2, builder_->field_builders().size()); + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + ASSERT_OK(list_vb->AppendNull()); + ASSERT_OK(list_vb->AppendNull()); + ASSERT_EQ(2, list_vb->length()); + + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_OK(int_vb->AppendNull()); + ASSERT_OK(int_vb->AppendNull()); + ASSERT_EQ(2, int_vb->length()); + + Done(); + + ASSERT_OK(result_->Validate()); + + ASSERT_EQ(2, result_->fields().size()); + ASSERT_EQ(2, result_->length()); + ASSERT_EQ(2, result_->field(0)->length()); + ASSERT_EQ(2, result_->field(1)->length()); + ASSERT_TRUE(result_->IsNull(0)); + ASSERT_TRUE(result_->IsNull(1)); + ASSERT_TRUE(result_->field(0)->IsNull(0)); + ASSERT_TRUE(result_->field(0)->IsNull(1)); + ASSERT_TRUE(result_->field(1)->IsNull(0)); + ASSERT_TRUE(result_->field(1)->IsNull(1)); + + ASSERT_EQ(Type::LIST, result_->field(0)->type_enum()); + ASSERT_EQ(Type::INT32, result_->field(1)->type_enum()); +} + +TEST_F(TestStructBuilder, TestBasics) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6, 10}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_EQ(2, builder_->field_builders().size()); + + EXPECT_OK(builder_->Resize(list_lengths.size())); + EXPECT_OK(char_vb->Resize(list_values.size())); + EXPECT_OK(int_vb->Resize(int_values.size())); + + int pos = 0; + for (size_t i = 0; i < list_lengths.size(); ++i) { + ASSERT_OK(list_vb->Append(list_is_valid[i] > 0)); + int_vb->UnsafeAppend(int_values[i]); + for (int j = 0; j < list_lengths[i]; ++j) { + char_vb->UnsafeAppend(list_values[pos++]); + } + } + + for (size_t i = 0; i < struct_is_valid.size(); ++i) { + ASSERT_OK(builder_->Append(struct_is_valid[i] > 0)); + } + + Done(); + + ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, + list_lengths, list_offsets, int_values); +} + +TEST_F(TestStructBuilder, BulkAppend) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + Done(); + ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, + list_lengths, list_offsets, int_values); +} + +TEST_F(TestStructBuilder, BulkAppendInvalid) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 0, 1, 1}; // should be 1, 1, 1, 1 + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + + ASSERT_OK(builder_->Reserve(list_lengths.size())); + ASSERT_OK(char_vb->Reserve(list_values.size())); + ASSERT_OK(int_vb->Reserve(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + Done(); + // Even null bitmap of the parent Struct is not valid, Validate() will ignore it. + ASSERT_OK(result_->Validate()); +} + +TEST_F(TestStructBuilder, TestEquality) { + ArrayPtr array, equal_array; + ArrayPtr unequal_bitmap_array, unequal_offsets_array, unequal_values_array; + + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + vector unequal_int_values = {4, 2, 3, 1}; + vector unequal_list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'l', 'u', 'c', 'y'}; + vector unequal_list_offsets = {0, 3, 4, 6}; + vector unequal_list_is_valid = {1, 1, 1, 1}; + vector unequal_struct_is_valid = {1, 0, 0, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_OK(builder_->Reserve(list_lengths.size())); + ASSERT_OK(char_vb->Reserve(list_values.size())); + ASSERT_OK(int_vb->Reserve(int_values.size())); + + // setup two equal arrays, one of which takes an unequal bitmap + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + array = builder_->Finish(); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + equal_array = builder_->Finish(); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup an unequal one with the unequal bitmap + builder_->Append(unequal_struct_is_valid.size(), unequal_struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + unequal_bitmap_array = builder_->Finish(); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup an unequal one with unequal offsets + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(unequal_list_offsets.data(), unequal_list_offsets.size(), + unequal_list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + unequal_offsets_array = builder_->Finish(); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup anunequal one with unequal values + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : unequal_list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : unequal_int_values) { + int_vb->UnsafeAppend(value); + } + unequal_values_array = builder_->Finish(); + + // Test array equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_bitmap_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(equal_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_values_array)); + EXPECT_FALSE(unequal_values_array->Equals(unequal_bitmap_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_offsets_array)); + EXPECT_FALSE(unequal_offsets_array->Equals(unequal_bitmap_array)); + + // Test range equality + EXPECT_TRUE(array->RangeEquals(0, 4, 0, equal_array)); + EXPECT_TRUE(array->RangeEquals(3, 4, 3, unequal_bitmap_array)); + EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(0, 1, 0, unequal_values_array)); + EXPECT_TRUE(array->RangeEquals(1, 3, 1, unequal_values_array)); + EXPECT_FALSE(array->RangeEquals(3, 4, 3, unequal_values_array)); +} + +TEST_F(TestStructBuilder, TestZeroLength) { + // All buffers are null + Done(); + ASSERT_OK(result_->Validate()); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index 04a277a86fa58..e8176f08268b4 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -17,4 +17,74 @@ #include "arrow/types/struct.h" -namespace arrow {} // namespace arrow +#include + +namespace arrow { + +bool StructArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + if (null_count_ != arr->null_count()) { return false; } + return RangeEquals(0, length_, 0, arr); +} + +bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (Type::STRUCT != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + + bool equal_fields = true; + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + if (IsNull(i) != arr->IsNull(o_i)) { return false; } + if (IsNull(i)) continue; + for (size_t j = 0; j < field_arrays_.size(); ++j) { + // TODO: really we should be comparing stretches of non-null data rather + // than looking at one value at a time. + equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other->field(j)); + if (!equal_fields) { return false; } + } + } + + return true; +} + +Status StructArray::Validate() const { + if (length_ < 0) { return Status::Invalid("Length was negative"); } + + if (null_count() > length_) { + return Status::Invalid("Null count exceeds the length of this struct"); + } + + if (field_arrays_.size() > 0) { + // Validate fields + int32_t array_length = field_arrays_[0]->length(); + size_t idx = 0; + for (auto it : field_arrays_) { + if (it->length() != array_length) { + std::stringstream ss; + ss << "Length is not equal from field " << it->type()->ToString() + << " at position {" << idx << "}"; + return Status::Invalid(ss.str()); + } + + const Status child_valid = it->Validate(); + if (!child_valid.ok()) { + std::stringstream ss; + ss << "Child array invalid: " << child_valid.ToString() << " at position {" << idx + << "}"; + return Status::Invalid(ss.str()); + } + ++idx; + } + + if (array_length > 0 && array_length != length_) { + return Status::Invalid("Struct's length is not equal to its child arrays"); + } + } + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 17e32993bf975..78afd29eb8df5 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -23,7 +23,102 @@ #include #include "arrow/type.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" -namespace arrow {} // namespace arrow +namespace arrow { + +class StructArray : public Array { + public: + StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, + int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) + : Array(type, length, null_count, null_bitmap) { + type_ = type; + field_arrays_ = field_arrays; + } + + Status Validate() const override; + + virtual ~StructArray() {} + + // Return a shared pointer in case the requestor desires to share ownership + // with this array. + const std::shared_ptr& field(int32_t pos) const { + DCHECK_GT(field_arrays_.size(), 0); + return field_arrays_[pos]; + } + const std::vector& fields() const { return field_arrays_; } + + bool EqualsExact(const StructArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const override; + + protected: + // The child arrays corresponding to each field of the struct data type. + std::vector field_arrays_; +}; + +// --------------------------------------------------------------------------------- +// StructArray builder +// Append, Resize and Reserve methods are acting on StructBuilder. +// Please make sure all these methods of all child-builders' are consistently +// called to maintain data-structure consistency. +class StructBuilder : public ArrayBuilder { + public: + StructBuilder(MemoryPool* pool, const std::shared_ptr& type, + const std::vector>& field_builders) + : ArrayBuilder(pool, type) { + field_builders_ = field_builders; + } + + // Null bitmap is of equal length to every child field, and any zero byte + // will be considered as a null for that field, but users must using app- + // end methods or advance methods of the child builders' independently to + // insert data. + Status Append(int32_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); + } + + std::shared_ptr Finish() override { + std::vector fields; + for (auto it : field_builders_) { + fields.push_back(it->Finish()); + } + + auto result = + std::make_shared(type_, length_, fields, null_count_, null_bitmap_); + + null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + + return result; + } + + // Append an element to the Struct. All child-builders' Append method must + // be called independently to maintain data-structure consistency. + Status Append(bool is_valid = true) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(is_valid); + return Status::OK(); + } + + Status AppendNull() { return Append(false); } + + const std::shared_ptr field_builder(int pos) const { + DCHECK_GT(field_builders_.size(), 0); + return field_builders_[pos]; + } + const std::vector>& field_builders() const { + return field_builders_; + } + + protected: + std::vector> field_builders_; +}; + +} // namespace arrow #endif // ARROW_TYPES_STRUCT_H From bc6c4c88fb4bfd1d99e71c8043f0ba0ca5544ae2 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Wed, 8 Jun 2016 11:23:07 -0700 Subject: [PATCH 0084/1644] ARROW-200: [C++/Python] Return error status on string initialization failure Author: Micah Kornfield Closes #88 from emkornfield/emk_arrow_200 and squashes the following commits: 37e23be [Micah Kornfield] ARROW-200: Return error status on string initialization failure --- python/src/pyarrow/adapters/pandas.cc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 5159d86865caa..8dcc2b1c92e11 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -669,7 +669,7 @@ class ArrowDeserializer { out_values[i] = make_pystring(data, length); if (out_values[i] == nullptr) { - return Status::OK(); + return Status::UnknownError("String initialization failed"); } } } @@ -678,7 +678,7 @@ class ArrowDeserializer { data = string_arr->GetValue(i, &length); out_values[i] = make_pystring(data, length); if (out_values[i] == nullptr) { - return Status::OK(); + return Status::UnknownError("String initialization failed"); } } } From 8197f246de934db14b3af26a0899d95bffbdc6b2 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Wed, 8 Jun 2016 11:24:04 -0700 Subject: [PATCH 0085/1644] ARROW-212: Change contract of PrimitiveArray to reflect its abstractness Follow-up based on #80 Author: Micah Kornfield Closes #87 from emkornfield/emk_clarify_primitive and squashes the following commits: 14bd5b2 [Micah Kornfield] ARROW-212: Make the fact that PrimitiveArray is a abstract class more apparent fromt the contract --- cpp/src/arrow/types/primitive.cc | 5 +++++ cpp/src/arrow/types/primitive.h | 15 +++++++++------ 2 files changed, 14 insertions(+), 6 deletions(-) diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 8e6c0f809ca44..08fc8478e6de5 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -162,6 +162,11 @@ BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, : PrimitiveArray( std::make_shared(), length, data, null_count, null_bitmap) {} +BooleanArray::BooleanArray(const TypePtr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + bool BooleanArray::EqualsExact(const BooleanArray& other) const { if (this == &other) return true; if (null_count_ != other.null_count_) { return false; } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 9597fc8363138..f1ec417d51014 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -34,11 +34,10 @@ namespace arrow { class MemoryPool; -// Base class for fixed-size logical types +// Base class for fixed-size logical types. See MakePrimitiveArray +// (types/construct.h) for constructing a specific subclass. class PrimitiveArray : public Array { public: - PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); virtual ~PrimitiveArray() {} const std::shared_ptr& data() const { return data_; } @@ -47,6 +46,8 @@ class PrimitiveArray : public Array { bool Equals(const std::shared_ptr& arr) const override; protected: + PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); std::shared_ptr data_; const uint8_t* raw_data_; }; @@ -55,12 +56,14 @@ class PrimitiveArray : public Array { class NAME : public PrimitiveArray { \ public: \ using value_type = T; \ - using PrimitiveArray::PrimitiveArray; \ \ NAME(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, \ const std::shared_ptr& null_bitmap = nullptr) \ : PrimitiveArray( \ std::make_shared(), length, data, null_count, null_bitmap) {} \ + NAME(const TypePtr& type, int32_t length, const std::shared_ptr& data, \ + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) \ + : PrimitiveArray(type, length, data, null_count, null_bitmap) {} \ \ bool EqualsExact(const NAME& other) const { \ return PrimitiveArray::EqualsExact(*static_cast(&other)); \ @@ -261,10 +264,10 @@ typedef NumericBuilder DoubleBuilder; class BooleanArray : public PrimitiveArray { public: - using PrimitiveArray::PrimitiveArray; - BooleanArray(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + BooleanArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); bool EqualsExact(const BooleanArray& other) const; bool Equals(const ArrayPtr& arr) const override; From ec66ddd1fd4954b78967bfa1893480473e4d380c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 10 Jun 2016 15:08:23 -0700 Subject: [PATCH 0086/1644] ARROW-203: Python: Basic filename based Parquet read/write Author: Uwe L. Korn Closes #83 from xhochy/arrow-203 and squashes the following commits: 405f85d [Uwe L. Korn] Remove FindParquet duplication 38d786c [Uwe L. Korn] Make code more readable by using using ec07768 [Uwe L. Korn] Set LD_LIBRARY_PATH in python build 8d90d3f [Uwe L. Korn] Do not set LD_LIBRARY_PATH in python build 000e1e3 [Uwe L. Korn] Use unique_ptr and shared_ptr from Cython 8f6010a [Uwe L. Korn] Linter fixes 0514d01 [Uwe L. Korn] Handle exceptions on RowGroupWriter::Close better 77bd21a [Uwe L. Korn] Add pandas roundtrip to tests f583b61 [Uwe L. Korn] Fix rpath for libarrow_parquet 00c1461 [Uwe L. Korn] Also ensure correct OSX compiler flags in PyArrow 4a80116 [Uwe L. Korn] Handle Python3 strings correctly 066c08a [Uwe L. Korn] Add missing functions to smart pointers 5706db2 [Uwe L. Korn] Use length and offset instead of slicing 443de8b [Uwe L. Korn] Add miniconda to the LD_LIBRARY_PATH 2dffc14 [Uwe L. Korn] Fix min mistake, use equals instead of == 2006e70 [Uwe L. Korn] Rewrite test py.test style 9520c39 [Uwe L. Korn] Use PARQUET from miniconda path cd3b9a9 [Uwe L. Korn] Also search for Parquet in PyArrow 6a41d23 [Uwe L. Korn] Re-use conda installation from C++ 81f501e [Uwe L. Korn] No need to install conda in travis_script_python anymore b505feb [Uwe L. Korn] Install parquet-cpp via conda 5d4929a [Uwe L. Korn] Add test-util.h 9b06e41 [Uwe L. Korn] Make tests templated be6415c [Uwe L. Korn] Incorportate review comments 0fbed3f [Uwe L. Korn] Remove obsolete parquet files 081db5f [Uwe L. Korn] Limit and document chunk_size 7192cfb [Uwe L. Korn] Add const to slicing parameters 0463995 [Uwe L. Korn] ARROW-203: Python: Basic filename based Parquet read/write --- ci/travis_before_script_cpp.sh | 6 +- ci/travis_conda_build.sh | 22 +- ci/travis_install_conda.sh | 26 +++ ci/travis_script_python.sh | 21 +- cpp/src/arrow/column.h | 2 + cpp/src/arrow/parquet/CMakeLists.txt | 7 + cpp/src/arrow/parquet/parquet-io-test.cc | 256 +++++++++++++++++------ cpp/src/arrow/parquet/reader.cc | 25 +++ cpp/src/arrow/parquet/reader.h | 3 + cpp/src/arrow/parquet/test-util.h | 77 +++++++ cpp/src/arrow/parquet/utils.h | 5 + cpp/src/arrow/parquet/writer.cc | 99 +++++++-- cpp/src/arrow/parquet/writer.h | 12 +- cpp/src/arrow/util/status.h | 9 + python/CMakeLists.txt | 8 + python/cmake_modules/FindArrow.cmake | 14 +- python/conda.recipe/build.sh | 13 ++ python/pyarrow/array.pyx | 3 + python/pyarrow/error.pxd | 2 + python/pyarrow/error.pyx | 8 + python/pyarrow/includes/common.pxd | 9 +- python/pyarrow/includes/libarrow.pxd | 3 + python/pyarrow/includes/parquet.pxd | 46 ++++ python/pyarrow/parquet.pyx | 50 ++++- python/pyarrow/schema.pyx | 9 +- python/pyarrow/tests/test_parquet.py | 59 ++++++ python/setup.py | 4 +- 27 files changed, 654 insertions(+), 144 deletions(-) create mode 100644 ci/travis_install_conda.sh create mode 100644 cpp/src/arrow/parquet/test-util.h create mode 100644 python/pyarrow/tests/test_parquet.py diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 193c76feba1d7..6159f67e3613b 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -2,6 +2,10 @@ set -e +source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh +conda install -y --channel apache/channel/dev parquet-cpp +export PARQUET_HOME=$MINICONDA + : ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} mkdir $CPP_BUILD_DIR @@ -19,7 +23,7 @@ echo $GTEST_HOME : ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} -CMAKE_COMMON_FLAGS="-DARROW_BUILD_BENCHMARKS=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" +CMAKE_COMMON_FLAGS="-DARROW_BUILD_BENCHMARKS=ON -DARROW_PARQUET=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" if [ $TRAVIS_OS_NAME == "linux" ]; then cmake -DARROW_TEST_MEMCHECK=on $CMAKE_COMMON_FLAGS -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR diff --git a/ci/travis_conda_build.sh b/ci/travis_conda_build.sh index afa531dbd6b5f..c43a85170b094 100755 --- a/ci/travis_conda_build.sh +++ b/ci/travis_conda_build.sh @@ -2,27 +2,7 @@ set -e -if [ $TRAVIS_OS_NAME == "linux" ]; then - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" -else - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" -fi - -wget -O miniconda.sh $MINICONDA_URL -MINICONDA=$TRAVIS_BUILD_DIR/miniconda -bash miniconda.sh -b -p $MINICONDA -export PATH="$MINICONDA/bin:$PATH" -conda update -y -q conda -conda info -a - -conda config --set show_channel_urls yes -conda config --add channels conda-forge -conda config --add channels apache - -conda install --yes conda-build jinja2 anaconda-client - -# faster builds, please -conda install -y nomkl +source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh # Build libarrow diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh new file mode 100644 index 0000000000000..bef667dff7cc1 --- /dev/null +++ b/ci/travis_install_conda.sh @@ -0,0 +1,26 @@ +#!/usr/bin/env bash + +set -e + +if [ $TRAVIS_OS_NAME == "linux" ]; then + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" +else + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" +fi + +wget -O miniconda.sh $MINICONDA_URL +export MINICONDA=$TRAVIS_BUILD_DIR/miniconda +bash miniconda.sh -b -p $MINICONDA +export PATH="$MINICONDA/bin:$PATH" +conda update -y -q conda +conda info -a + +conda config --set show_channel_urls yes +conda config --add channels conda-forge +conda config --add channels apache + +conda install --yes conda-build jinja2 anaconda-client + +# faster builds, please +conda install -y nomkl + diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index d45b895d8cf38..6d35785356ab4 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -4,6 +4,12 @@ set -e PYTHON_DIR=$TRAVIS_BUILD_DIR/python +# Re-use conda installation from C++ +export MINICONDA=$TRAVIS_BUILD_DIR/miniconda +export PATH="$MINICONDA/bin:$PATH" +export LD_LIBRARY_PATH="$MINICONDA/lib:$LD_LIBRARY_PATH" +export PARQUET_HOME=$MINICONDA + # Share environment with C++ pushd $CPP_BUILD_DIR source setup_build_env.sh @@ -11,21 +17,6 @@ popd pushd $PYTHON_DIR -# Bootstrap a Conda Python environment - -if [ $TRAVIS_OS_NAME == "linux" ]; then - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" -else - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" -fi - -curl $MINICONDA_URL > miniconda.sh -MINICONDA=$TRAVIS_BUILD_DIR/miniconda -bash miniconda.sh -b -p $MINICONDA -export PATH="$MINICONDA/bin:$PATH" -conda update -y -q conda -conda info -a - python_version_tests() { PYTHON_VERSION=$1 CONDA_ENV_NAME="pyarrow-test-${PYTHON_VERSION}" diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index 22becc3454780..e409566e1f139 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -67,6 +67,8 @@ class Column { int64_t null_count() const { return data_->null_count(); } + const std::shared_ptr& field() const { return field_; } + // @returns: the column's name in the passed metadata const std::string& name() const { return field_->name; } diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index c00cc9f0f25d0..f00bb53c0848f 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -35,6 +35,13 @@ add_library(arrow_parquet SHARED target_link_libraries(arrow_parquet ${PARQUET_LIBS}) SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) +if (APPLE) + set_target_properties(arrow_parquet + PROPERTIES + BUILD_WITH_INSTALL_RPATH ON + INSTALL_NAME_DIR "@rpath") +endif() + ADD_ARROW_TEST(parquet-schema-test) ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index 845574d2c53b9..db779d8309cf6 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -18,6 +18,7 @@ #include "gtest/gtest.h" #include "arrow/test-util.h" +#include "arrow/parquet/test-util.h" #include "arrow/parquet/reader.h" #include "arrow/parquet/writer.h" #include "arrow/types/primitive.h" @@ -44,36 +45,45 @@ namespace arrow { namespace parquet { -template -std::shared_ptr NonNullArray( - size_t size, typename ArrowType::c_type value) { - std::vector values(size, value); - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size()); - return std::static_pointer_cast(builder.Finish()); -} +const int SMALL_SIZE = 100; +const int LARGE_SIZE = 10000; -// This helper function only supports (size/2) nulls yet. -template -std::shared_ptr NullableArray( - size_t size, typename ArrowType::c_type value, size_t num_nulls) { - std::vector values(size, value); - std::vector valid_bytes(size, 1); +template +struct test_traits {}; - for (size_t i = 0; i < num_nulls; i++) { - valid_bytes[i * 2] = 0; - } +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; +}; - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size(), valid_bytes.data()); - return std::static_pointer_cast(builder.Finish()); -} +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT64; +}; + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; +}; + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::DOUBLE; +}; + +template +using ParquetDataType = ::parquet::DataType::parquet_enum>; +template +using ParquetWriter = ::parquet::TypedColumnWriter>; + +template class TestParquetIO : public ::testing::Test { public: + typedef typename TestType::c_type T; virtual void SetUp() {} - std::shared_ptr Schema( + std::shared_ptr MakeSchema( ParquetType::type parquet_type, Repetition::type repetition) { auto pnode = PrimitiveNode::Make("column1", repetition, parquet_type); NodePtr node_ = @@ -98,20 +108,27 @@ class TestParquetIO : public ::testing::Test { std::unique_ptr column_reader; ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); ASSERT_NE(nullptr, column_reader.get()); - ASSERT_OK(column_reader->NextBatch(100, out)); + ASSERT_OK(column_reader->NextBatch(SMALL_SIZE, out)); + ASSERT_NE(nullptr, out->get()); + } + + void ReadTableFromFile( + std::unique_ptr file_reader, std::shared_ptr

* out) { + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + ASSERT_NO_THROW(ASSERT_OK(reader.ReadFlatTable(out))); ASSERT_NE(nullptr, out->get()); } - std::unique_ptr Int64File( - std::vector& values, int num_chunks) { - std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); + std::unique_ptr TestFile(std::vector& values, int num_chunks) { + std::shared_ptr schema = + MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); std::unique_ptr file_writer = MakeWriter(schema); size_t chunk_size = values.size() / num_chunks; for (int i = 0; i < num_chunks; i++) { auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = - static_cast<::parquet::Int64Writer*>(row_group_writer->NextColumn()); - int64_t* data = values.data() + i * chunk_size; + auto column_writer = static_cast*>( + row_group_writer->NextColumn()); + T* data = values.data() + i * chunk_size; column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); column_writer->Close(); row_group_writer->Close(); @@ -120,71 +137,135 @@ class TestParquetIO : public ::testing::Test { return ReaderFromSink(); } - private: std::shared_ptr sink_; }; -TEST_F(TestParquetIO, SingleColumnInt64Read) { - std::vector values(100, 128); - std::unique_ptr file_reader = Int64File(values, 1); +typedef ::testing::Types TestTypes; + +TYPED_TEST_CASE(TestParquetIO, TestTypes); + +TYPED_TEST(TestParquetIO, SingleColumnRequiredRead) { + std::vector values(SMALL_SIZE, 128); + std::unique_ptr file_reader = this->TestFile(values, 1); std::shared_ptr out; - ReadSingleColumnFile(std::move(file_reader), &out); + this->ReadSingleColumnFile(std::move(file_reader), &out); - Int64Array* out_array = static_cast(out.get()); - for (size_t i = 0; i < values.size(); i++) { - EXPECT_EQ(values[i], out_array->raw_data()[i]); - } + ExpectArray(values.data(), out.get()); } -TEST_F(TestParquetIO, SingleColumnInt64ChunkedRead) { - std::vector values(100, 128); - std::unique_ptr file_reader = Int64File(values, 4); +TYPED_TEST(TestParquetIO, SingleColumnRequiredTableRead) { + std::vector values(SMALL_SIZE, 128); + std::unique_ptr file_reader = this->TestFile(values, 1); + + std::shared_ptr
out; + this->ReadTableFromFile(std::move(file_reader), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(SMALL_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ExpectArray(values.data(), chunked_array->chunk(0).get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedRead) { + std::vector values(SMALL_SIZE, 128); + std::unique_ptr file_reader = this->TestFile(values, 4); std::shared_ptr out; - ReadSingleColumnFile(std::move(file_reader), &out); + this->ReadSingleColumnFile(std::move(file_reader), &out); - Int64Array* out_array = static_cast(out.get()); - for (size_t i = 0; i < values.size(); i++) { - EXPECT_EQ(values[i], out_array->raw_data()[i]); - } + ExpectArray(values.data(), out.get()); } -TEST_F(TestParquetIO, SingleColumnInt64Write) { - std::shared_ptr values = NonNullArray(100, 128); +TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedTableRead) { + std::vector values(SMALL_SIZE, 128); + std::unique_ptr file_reader = this->TestFile(values, 4); + + std::shared_ptr
out; + this->ReadTableFromFile(std::move(file_reader), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(SMALL_SIZE, out->num_rows()); - std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); - FileWriter writer(default_memory_pool(), MakeWriter(schema)); + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ExpectArray(values.data(), chunked_array->chunk(0).get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnRequiredWrite) { + std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); + + std::shared_ptr schema = + this->MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); ASSERT_NO_THROW(ASSERT_OK(writer.Close())); std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); + this->ReadSingleColumnFile(this->ReaderFromSink(), &out); ASSERT_TRUE(values->Equals(out)); } -TEST_F(TestParquetIO, SingleColumnDoubleReadWrite) { +TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { + std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); + std::shared_ptr
table = MakeSimpleTable(values, false); + this->sink_ = std::make_shared(); + ASSERT_NO_THROW(ASSERT_OK( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, values->length()))); + + std::shared_ptr
out; + this->ReadTableFromFile(this->ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(100, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +} + +TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) { // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(100, 128, 10); + std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); - std::shared_ptr schema = Schema(ParquetType::DOUBLE, Repetition::OPTIONAL); - FileWriter writer(default_memory_pool(), MakeWriter(schema)); + std::shared_ptr schema = + this->MakeSchema(test_traits::parquet_enum, Repetition::OPTIONAL); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); ASSERT_NO_THROW(ASSERT_OK(writer.Close())); std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); + this->ReadSingleColumnFile(this->ReaderFromSink(), &out); ASSERT_TRUE(values->Equals(out)); } -TEST_F(TestParquetIO, SingleColumnInt64ChunkedWrite) { - std::shared_ptr values = NonNullArray(100, 128); - std::shared_ptr values_chunk = NonNullArray(25, 128); +TYPED_TEST(TestParquetIO, SingleColumnTableOptionalReadWrite) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); + std::shared_ptr
table = MakeSimpleTable(values, true); + this->sink_ = std::make_shared(); + ASSERT_NO_THROW(ASSERT_OK( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, values->length()))); + + std::shared_ptr
out; + this->ReadTableFromFile(this->ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(SMALL_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +} - std::shared_ptr schema = Schema(ParquetType::INT64, Repetition::REQUIRED); - FileWriter writer(default_memory_pool(), MakeWriter(schema)); +TYPED_TEST(TestParquetIO, SingleColumnIntRequiredChunkedWrite) { + std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); + std::shared_ptr values_chunk = + NonNullArray(SMALL_SIZE / 4, 128); + + std::shared_ptr schema = + this->MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); for (int i = 0; i < 4; i++) { ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk->length()))); ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk.get()))); @@ -192,18 +273,37 @@ TEST_F(TestParquetIO, SingleColumnInt64ChunkedWrite) { ASSERT_NO_THROW(ASSERT_OK(writer.Close())); std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); + this->ReadSingleColumnFile(this->ReaderFromSink(), &out); ASSERT_TRUE(values->Equals(out)); } -TEST_F(TestParquetIO, SingleColumnDoubleChunkedWrite) { - std::shared_ptr values = NullableArray(100, 128, 10); +TYPED_TEST(TestParquetIO, SingleColumnTableRequiredChunkedWrite) { + std::shared_ptr values = NonNullArray(LARGE_SIZE, 128); + std::shared_ptr
table = MakeSimpleTable(values, false); + this->sink_ = std::make_shared(); + ASSERT_NO_THROW( + ASSERT_OK(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512))); + + std::shared_ptr
out; + this->ReadTableFromFile(this->ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(LARGE_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +} + +TYPED_TEST(TestParquetIO, SingleColumnOptionalChunkedWrite) { + std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); std::shared_ptr values_chunk_nulls = - NullableArray(25, 128, 10); - std::shared_ptr values_chunk = NullableArray(25, 128, 0); + NullableArray(SMALL_SIZE / 4, 128, 10); + std::shared_ptr values_chunk = + NullableArray(SMALL_SIZE / 4, 128, 0); - std::shared_ptr schema = Schema(ParquetType::DOUBLE, Repetition::OPTIONAL); - FileWriter writer(default_memory_pool(), MakeWriter(schema)); + std::shared_ptr schema = + this->MakeSchema(test_traits::parquet_enum, Repetition::OPTIONAL); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk_nulls->length()))); ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk_nulls.get()))); for (int i = 0; i < 3; i++) { @@ -213,10 +313,28 @@ TEST_F(TestParquetIO, SingleColumnDoubleChunkedWrite) { ASSERT_NO_THROW(ASSERT_OK(writer.Close())); std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); + this->ReadSingleColumnFile(this->ReaderFromSink(), &out); ASSERT_TRUE(values->Equals(out)); } +TYPED_TEST(TestParquetIO, SingleColumnTableOptionalChunkedWrite) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(LARGE_SIZE, 128, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + this->sink_ = std::make_shared(); + ASSERT_NO_THROW( + ASSERT_OK(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512))); + + std::shared_ptr
out; + this->ReadTableFromFile(this->ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(LARGE_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +} + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 346de25360649..3b4882d4439d5 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -18,10 +18,14 @@ #include "arrow/parquet/reader.h" #include +#include +#include +#include "arrow/column.h" #include "arrow/parquet/schema.h" #include "arrow/parquet/utils.h" #include "arrow/schema.h" +#include "arrow/table.h" #include "arrow/types/primitive.h" #include "arrow/util/status.h" @@ -40,6 +44,7 @@ class FileReader::Impl { bool CheckForFlatColumn(const ::parquet::ColumnDescriptor* descr); Status GetFlatColumn(int i, std::unique_ptr* out); Status ReadFlatColumn(int i, std::shared_ptr* out); + Status ReadFlatTable(std::shared_ptr
* out); private: MemoryPool* pool_; @@ -103,6 +108,22 @@ Status FileReader::Impl::ReadFlatColumn(int i, std::shared_ptr* out) { return flat_column_reader->NextBatch(reader_->num_rows(), out); } +Status FileReader::Impl::ReadFlatTable(std::shared_ptr
* table) { + const std::string& name = reader_->descr()->schema()->name(); + std::shared_ptr schema; + RETURN_NOT_OK(FromParquetSchema(reader_->descr(), &schema)); + + std::vector> columns(reader_->num_columns()); + for (int i = 0; i < reader_->num_columns(); i++) { + std::shared_ptr array; + RETURN_NOT_OK(ReadFlatColumn(i, &array)); + columns[i] = std::make_shared(schema->field(i), array); + } + + *table = std::make_shared
(name, schema, columns); + return Status::OK(); +} + FileReader::FileReader( MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) : impl_(new FileReader::Impl(pool, std::move(reader))) {} @@ -117,6 +138,10 @@ Status FileReader::ReadFlatColumn(int i, std::shared_ptr* out) { return impl_->ReadFlatColumn(i, out); } +Status FileReader::ReadFlatTable(std::shared_ptr
* out) { + return impl_->ReadFlatTable(out); +} + FlatColumnReader::Impl::Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor* descr, ::parquet::ParquetFileReader* reader, int column_index) : pool_(pool), diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h index 41ca7eb35b9f0..db7a15753d8e8 100644 --- a/cpp/src/arrow/parquet/reader.h +++ b/cpp/src/arrow/parquet/reader.h @@ -29,6 +29,7 @@ class Array; class MemoryPool; class RowBatch; class Status; +class Table; namespace parquet { @@ -90,6 +91,8 @@ class FileReader { Status GetFlatColumn(int i, std::unique_ptr* out); // Read column as a whole into an Array. Status ReadFlatColumn(int i, std::shared_ptr* out); + // Read a table of flat columns into a Table. + Status ReadFlatTable(std::shared_ptr
* out); virtual ~FileReader(); diff --git a/cpp/src/arrow/parquet/test-util.h b/cpp/src/arrow/parquet/test-util.h new file mode 100644 index 0000000000000..1496082d5c661 --- /dev/null +++ b/cpp/src/arrow/parquet/test-util.h @@ -0,0 +1,77 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include + +#include "arrow/types/primitive.h" + +namespace arrow { + +namespace parquet { + +template +std::shared_ptr NonNullArray( + size_t size, typename ArrowType::c_type value) { + std::vector values(size, value); + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size()); + return std::static_pointer_cast(builder.Finish()); +} + +// This helper function only supports (size/2) nulls yet. +template +std::shared_ptr NullableArray( + size_t size, typename ArrowType::c_type value, size_t num_nulls) { + std::vector values(size, value); + std::vector valid_bytes(size, 1); + + for (size_t i = 0; i < num_nulls; i++) { + valid_bytes[i * 2] = 0; + } + + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size(), valid_bytes.data()); + return std::static_pointer_cast(builder.Finish()); +} + +std::shared_ptr MakeColumn(const std::string& name, + const std::shared_ptr& array, bool nullable) { + auto field = std::make_shared(name, array->type(), nullable); + return std::make_shared(field, array); +} + +std::shared_ptr
MakeSimpleTable( + const std::shared_ptr& values, bool nullable) { + std::shared_ptr column = MakeColumn("col", values, nullable); + std::vector> columns({column}); + std::vector> fields({column->field()}); + auto schema = std::make_shared(fields); + return std::make_shared
("table", schema, columns); +} + +template +void ExpectArray(T* expected, Array* result) { + PrimitiveArray* p_array = static_cast(result); + for (size_t i = 0; i < result->length(); i++) { + EXPECT_EQ(expected[i], reinterpret_cast(p_array->data()->data())[i]); + } +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/utils.h b/cpp/src/arrow/parquet/utils.h index b32792fdf7030..409bcd9065cda 100644 --- a/cpp/src/arrow/parquet/utils.h +++ b/cpp/src/arrow/parquet/utils.h @@ -31,6 +31,11 @@ namespace parquet { (s); \ } catch (const ::parquet::ParquetException& e) { return Status::Invalid(e.what()); } +#define PARQUET_IGNORE_NOT_OK(s) \ + try { \ + (s); \ + } catch (const ::parquet::ParquetException& e) {} + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index 3ad2c5b073501..1223901d5505a 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -17,11 +17,21 @@ #include "arrow/parquet/writer.h" +#include +#include + #include "arrow/array.h" +#include "arrow/column.h" +#include "arrow/table.h" +#include "arrow/types/construct.h" #include "arrow/types/primitive.h" +#include "arrow/parquet/schema.h" #include "arrow/parquet/utils.h" #include "arrow/util/status.h" +using parquet::ParquetFileWriter; +using parquet::schema::GroupNode; + namespace arrow { namespace parquet { @@ -32,8 +42,9 @@ class FileWriter::Impl { Status NewRowGroup(int64_t chunk_size); template - Status TypedWriteBatch(::parquet::ColumnWriter* writer, const PrimitiveArray* data); - Status WriteFlatColumnChunk(const PrimitiveArray* data); + Status TypedWriteBatch(::parquet::ColumnWriter* writer, const PrimitiveArray* data, + int64_t offset, int64_t length); + Status WriteFlatColumnChunk(const PrimitiveArray* data, int64_t offset, int64_t length); Status Close(); virtual ~Impl() {} @@ -60,31 +71,31 @@ Status FileWriter::Impl::NewRowGroup(int64_t chunk_size) { } template -Status FileWriter::Impl::TypedWriteBatch( - ::parquet::ColumnWriter* column_writer, const PrimitiveArray* data) { +Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, + const PrimitiveArray* data, int64_t offset, int64_t length) { + // TODO: DCHECK((offset + length) <= data->length()); auto data_ptr = - reinterpret_cast(data->data()->data()); + reinterpret_cast(data->data()->data()) + + offset; auto writer = reinterpret_cast<::parquet::TypedColumnWriter*>(column_writer); if (writer->descr()->max_definition_level() == 0) { // no nulls, just dump the data - PARQUET_CATCH_NOT_OK(writer->WriteBatch(data->length(), nullptr, nullptr, data_ptr)); + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, data_ptr)); } else if (writer->descr()->max_definition_level() == 1) { - RETURN_NOT_OK(def_levels_buffer_.Resize(data->length() * sizeof(int16_t))); + RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); int16_t* def_levels_ptr = reinterpret_cast(def_levels_buffer_.mutable_data()); if (data->null_count() == 0) { - std::fill(def_levels_ptr, def_levels_ptr + data->length(), 1); - PARQUET_CATCH_NOT_OK( - writer->WriteBatch(data->length(), def_levels_ptr, nullptr, data_ptr)); + std::fill(def_levels_ptr, def_levels_ptr + length, 1); + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, data_ptr)); } else { - RETURN_NOT_OK(data_buffer_.Resize( - (data->length() - data->null_count()) * sizeof(typename ParquetType::c_type))); + RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(typename ParquetType::c_type))); auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); int buffer_idx = 0; - for (size_t i = 0; i < data->length(); i++) { - if (data->IsNull(i)) { + for (size_t i = 0; i < length; i++) { + if (data->IsNull(offset + i)) { def_levels_ptr[i] = 0; } else { def_levels_ptr[i] = 1; @@ -92,7 +103,7 @@ Status FileWriter::Impl::TypedWriteBatch( } } PARQUET_CATCH_NOT_OK( - writer->WriteBatch(data->length(), def_levels_ptr, nullptr, buffer_ptr)); + writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); } } else { return Status::NotImplemented("no support for max definition level > 1 yet"); @@ -107,12 +118,13 @@ Status FileWriter::Impl::Close() { return Status::OK(); } -#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ - case Type::ENUM: \ - return TypedWriteBatch(writer, data); \ +#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ + case Type::ENUM: \ + return TypedWriteBatch(writer, data, offset, length); \ break; -Status FileWriter::Impl::WriteFlatColumnChunk(const PrimitiveArray* data) { +Status FileWriter::Impl::WriteFlatColumnChunk( + const PrimitiveArray* data, int64_t offset, int64_t length) { ::parquet::ColumnWriter* writer; PARQUET_CATCH_NOT_OK(writer = row_group_writer_->NextColumn()); switch (data->type_enum()) { @@ -133,8 +145,11 @@ Status FileWriter::NewRowGroup(int64_t chunk_size) { return impl_->NewRowGroup(chunk_size); } -Status FileWriter::WriteFlatColumnChunk(const PrimitiveArray* data) { - return impl_->WriteFlatColumnChunk(data); +Status FileWriter::WriteFlatColumnChunk( + const PrimitiveArray* data, int64_t offset, int64_t length) { + int64_t real_length = length; + if (length == -1) { real_length = data->length(); } + return impl_->WriteFlatColumnChunk(data, offset, real_length); } Status FileWriter::Close() { @@ -143,6 +158,48 @@ Status FileWriter::Close() { FileWriter::~FileWriter() {} +Status WriteFlatTable(const Table* table, MemoryPool* pool, + std::shared_ptr<::parquet::OutputStream> sink, int64_t chunk_size) { + std::shared_ptr<::parquet::SchemaDescriptor> parquet_schema; + RETURN_NOT_OK(ToParquetSchema(table->schema().get(), &parquet_schema)); + auto schema_node = std::static_pointer_cast(parquet_schema->schema()); + std::unique_ptr parquet_writer = + ParquetFileWriter::Open(sink, schema_node); + FileWriter writer(pool, std::move(parquet_writer)); + + // TODO: Support writing chunked arrays. + for (int i = 0; i < table->num_columns(); i++) { + if (table->column(i)->data()->num_chunks() != 1) { + return Status::NotImplemented("No support for writing chunked arrays yet."); + } + } + + // Cast to PrimitiveArray instances as we work with them. + std::vector> arrays(table->num_columns()); + for (int i = 0; i < table->num_columns(); i++) { + // num_chunks == 1 as per above loop + std::shared_ptr array = table->column(i)->data()->chunk(0); + auto primitive_array = std::dynamic_pointer_cast(array); + if (!primitive_array) { + PARQUET_IGNORE_NOT_OK(writer.Close()); + return Status::NotImplemented("Table must consist of PrimitiveArray instances"); + } + arrays[i] = primitive_array; + } + + for (int chunk = 0; chunk * chunk_size < table->num_rows(); chunk++) { + int64_t offset = chunk * chunk_size; + int64_t size = std::min(chunk_size, table->num_rows() - offset); + RETURN_NOT_OK_ELSE(writer.NewRowGroup(size), PARQUET_IGNORE_NOT_OK(writer.Close())); + for (int i = 0; i < table->num_columns(); i++) { + RETURN_NOT_OK_ELSE(writer.WriteFlatColumnChunk(arrays[i].get(), offset, size), + PARQUET_IGNORE_NOT_OK(writer.Close())); + } + } + + return writer.Close(); +} + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index 38f7d0b3a89d5..83e799f7ed1ed 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -29,6 +29,7 @@ class MemoryPool; class PrimitiveArray; class RowBatch; class Status; +class Table; namespace parquet { @@ -42,7 +43,8 @@ class FileWriter { FileWriter(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); Status NewRowGroup(int64_t chunk_size); - Status WriteFlatColumnChunk(const PrimitiveArray* data); + Status WriteFlatColumnChunk( + const PrimitiveArray* data, int64_t offset = 0, int64_t length = -1); Status Close(); virtual ~FileWriter(); @@ -52,6 +54,14 @@ class FileWriter { std::unique_ptr impl_; }; +/** + * Write a flat Table to Parquet. + * + * The table shall only consist of nullable, non-repeated columns of primitive type. + */ +Status WriteFlatTable(const Table* table, MemoryPool* pool, + std::shared_ptr<::parquet::OutputStream> sink, int64_t chunk_size); + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index 6ddc177a9a50d..d1a742500084c 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -63,6 +63,15 @@ namespace arrow { if (!_s.ok()) { return _s; } \ } while (0); +#define RETURN_NOT_OK_ELSE(s, else_) \ + do { \ + Status _s = (s); \ + if (!_s.ok()) { \ + else_; \ + return _s; \ + } \ + } while (0); + enum class StatusCode : char { OK = 0, OutOfMemory = 1, diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 2173232d4eff5..f1becfcf44964 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -339,11 +339,17 @@ if (PYARROW_BUILD_TESTS) STATIC_LIB ${GTEST_STATIC_LIB}) endif() +## Parquet +find_package(Parquet REQUIRED) +include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) + ## Arrow find_package(Arrow REQUIRED) include_directories(SYSTEM ${ARROW_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) +ADD_THIRDPARTY_LIB(arrow_parquet + SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) ############################################################ # Linker setup @@ -422,6 +428,7 @@ set(PYARROW_SRCS set(LINK_LIBS arrow + arrow_parquet ) SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) @@ -442,6 +449,7 @@ set(CYTHON_EXTENSIONS array config error + parquet scalar schema table diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 3d9983849ebb2..f0b258ed027b0 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -42,19 +42,27 @@ find_library(ARROW_LIB_PATH NAMES arrow ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) -if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) +find_library(ARROW_PARQUET_LIB_PATH NAMES arrow_parquet + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + +if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH AND ARROW_PARQUET_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) + set(ARROW_PARQUET_LIB_NAME libarrow_parquet) set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PARQUET_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PARQUET_LIB_NAME}.a) + set(ARROW_PARQUET_SHARED_LIB ${ARROW_LIBS}/${ARROW_PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) else () set(ARROW_FOUND FALSE) endif () if (ARROW_FOUND) if (NOT Arrow_FIND_QUIETLY) - message(STATUS "Found the Arrow library: ${ARROW_LIB_PATH}") + message(STATUS "Found the Arrow library: ${ARROW_LIB_PATH}, ${ARROW_PARQUET_LIB_PATH}") endif () else () if (NOT Arrow_FIND_QUIETLY) @@ -74,4 +82,6 @@ mark_as_advanced( ARROW_LIBS ARROW_STATIC_LIB ARROW_SHARED_LIB + ARROW_PARQUET_STATIC_LIB + ARROW_PARQUET_SHARED_LIB ) diff --git a/python/conda.recipe/build.sh b/python/conda.recipe/build.sh index a9d9aedead399..a164c1af51833 100644 --- a/python/conda.recipe/build.sh +++ b/python/conda.recipe/build.sh @@ -6,6 +6,19 @@ export ARROW_HOME=$PREFIX cd $RECIPE_DIR +if [ "$(uname)" == "Darwin" ]; then + # C++11 finagling for Mac OSX + export CC=clang + export CXX=clang++ + export MACOSX_VERSION_MIN="10.7" + CXXFLAGS="${CXXFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" + CXXFLAGS="${CXXFLAGS} -stdlib=libc++ -std=c++11" + export LDFLAGS="${LDFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" + export LDFLAGS="${LDFLAGS} -stdlib=libc++ -std=c++11" + export LINKFLAGS="${LDFLAGS}" + export MACOSX_DEPLOYMENT_TARGET=10.7 +fi + echo Setting the compiler... if [ `uname` == Linux ]; then EXTRA_CMAKE_ARGS=-DCMAKE_SHARED_LINKER_FLAGS=-static-libstdc++ diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index a80b3ce83980e..619e5ef7e3943 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -68,6 +68,9 @@ cdef class Array: values = array_format(self, window=10) return '{0}\n{1}'.format(type_format, values) + def equals(Array self, Array other): + return self.ap.Equals(other.sp_array) + def __len__(self): if self.sp_array.get(): return self.sp_array.get().length() diff --git a/python/pyarrow/error.pxd b/python/pyarrow/error.pxd index d226abeda04e0..97ba0ef2e9fcb 100644 --- a/python/pyarrow/error.pxd +++ b/python/pyarrow/error.pxd @@ -15,6 +15,8 @@ # specific language governing permissions and limitations # under the License. +from pyarrow.includes.libarrow cimport CStatus from pyarrow.includes.pyarrow cimport * +cdef check_cstatus(const CStatus& status) cdef check_status(const Status& status) diff --git a/python/pyarrow/error.pyx b/python/pyarrow/error.pyx index 3f8d7dd646091..5a6a038a92e43 100644 --- a/python/pyarrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -15,12 +15,20 @@ # specific language governing permissions and limitations # under the License. +from pyarrow.includes.libarrow cimport CStatus from pyarrow.includes.common cimport c_string from pyarrow.compat import frombytes class ArrowException(Exception): pass +cdef check_cstatus(const CStatus& status): + if status.ok(): + return + + cdef c_string c_message = status.ToString() + raise ArrowException(frombytes(c_message)) + cdef check_status(const Status& status): if status.ok(): return diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index e86d5d77e8b10..1f6ecee510521 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -19,6 +19,7 @@ from libc.stdint cimport * from libcpp cimport bool as c_bool +from libcpp.memory cimport shared_ptr, unique_ptr from libcpp.string cimport string as c_string from libcpp.vector cimport vector @@ -32,11 +33,3 @@ cdef extern from "": cdef extern from "": void Py_XDECREF(PyObject* o) -cdef extern from "" namespace "std" nogil: - - cdef cppclass shared_ptr[T]: - shared_ptr() - shared_ptr(T*) - T* get() - void reset() - void reset(T* p) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index b2ef45a347bc0..90414e3d542db 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -72,6 +72,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass MemoryPool" arrow::MemoryPool": int64_t bytes_allocated() + cdef MemoryPool* default_memory_pool() + cdef cppclass CListType" arrow::ListType"(CDataType): CListType(const shared_ptr[CDataType]& value_type) @@ -103,6 +105,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int32_t null_count() Type type_enum() + c_bool Equals(const shared_ptr[CArray]& arr) c_bool IsNull(int i) cdef cppclass CBooleanArray" arrow::BooleanArray"(CArray): diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index ffdc5d487068d..0918344070eb0 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -18,6 +18,26 @@ # distutils: language = c++ from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport CSchema, CStatus, CTable, MemoryPool + + +cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: + cdef cppclass Node: + pass + + cdef cppclass GroupNode(Node): + pass + + cdef cppclass PrimitiveNode(Node): + pass + +cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: + cdef cppclass SchemaDescriptor: + shared_ptr[Node] schema() + GroupNode* group() + + cdef cppclass ColumnDescriptor: + pass cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass ColumnReader: @@ -48,4 +68,30 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: pass cdef cppclass ParquetFileReader: + # TODO: Some default arguments are missing + @staticmethod + unique_ptr[ParquetFileReader] OpenFile(const c_string& path) + +cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: + cdef cppclass OutputStream: pass + + cdef cppclass LocalFileOutputStream(OutputStream): + LocalFileOutputStream(const c_string& path) + void Close() + + +cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: + cdef cppclass FileReader: + FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) + CStatus ReadFlatTable(shared_ptr[CTable]* out); + + +cdef extern from "arrow/parquet/schema.h" namespace "arrow::parquet" nogil: + CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, shared_ptr[CSchema]* out) + CStatus ToParquetSchema(const CSchema* arrow_schema, shared_ptr[SchemaDescriptor]* out) + + +cdef extern from "arrow/parquet/writer.h" namespace "arrow::parquet" nogil: + cdef CStatus WriteFlatTable(const CTable* table, MemoryPool* pool, shared_ptr[OutputStream] sink, int64_t chunk_size) + diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 622e7d0772456..3d5355ebe433a 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -19,5 +19,53 @@ # distutils: language = c++ # cython: embedsignature = True -from pyarrow.compat import frombytes, tobytes +from pyarrow.includes.libarrow cimport * +cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.parquet cimport * + +from pyarrow.compat import tobytes +from pyarrow.error cimport check_cstatus +from pyarrow.table cimport Table + +def read_table(filename, columns=None): + """ + Read a Table from Parquet format + Returns + ------- + table: pyarrow.Table + """ + cdef unique_ptr[FileReader] reader + cdef Table table = Table() + cdef shared_ptr[CTable] ctable + + # Must be in one expression to avoid calling std::move which is not possible + # in Cython (due to missing rvalue support) + reader = unique_ptr[FileReader](new FileReader(default_memory_pool(), + ParquetFileReader.OpenFile(tobytes(filename)))) + check_cstatus(reader.get().ReadFlatTable(&ctable)) + table.init(ctable) + return table + +def write_table(table, filename, chunk_size=None): + """ + Write a Table to Parquet format + + Parameters + ---------- + table : pyarrow.Table + filename : string + chunk_size : int + The maximum number of rows in each Parquet RowGroup + """ + cdef Table table_ = table + cdef CTable* ctable_ = table_.table + cdef shared_ptr[OutputStream] sink + cdef int64_t chunk_size_ = 0 + if chunk_size is None: + chunk_size_ = min(ctable_.num_rows(), int(2**16)) + else: + chunk_size_ = chunk_size + + sink.reset(new LocalFileOutputStream(tobytes(filename))) + check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, chunk_size_)) + diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 22ddf0cf17e41..084c304aed2a2 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -201,7 +201,9 @@ def string(): def list_(DataType value_type): cdef DataType out = DataType() - out.init(shared_ptr[CDataType](new CListType(value_type.sp_type))) + cdef shared_ptr[CDataType] list_type + list_type.reset(new CListType(value_type.sp_type)) + out.init(list_type) return out def struct(fields): @@ -212,12 +214,13 @@ def struct(fields): DataType out = DataType() Field field vector[shared_ptr[CField]] c_fields + cdef shared_ptr[CDataType] struct_type for field in fields: c_fields.push_back(field.sp_field) - out.init(shared_ptr[CDataType]( - new CStructType(c_fields))) + struct_type.reset(new CStructType(c_fields)) + out.init(struct_type) return out def schema(fields): diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py new file mode 100644 index 0000000000000..d92cf4ca6563e --- /dev/null +++ b/python/pyarrow/tests/test_parquet.py @@ -0,0 +1,59 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from pyarrow.compat import unittest +import pyarrow as arrow +import pyarrow.parquet + +A = arrow + +import numpy as np +import os.path +import pandas as pd + +import pandas.util.testing as pdt + + +def test_single_pylist_column_roundtrip(tmpdir): + for dtype in [int, float]: + filename = tmpdir.join('single_{}_column.parquet'.format(dtype.__name__)) + data = [A.from_pylist(list(map(dtype, range(5))))] + table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + A.parquet.write_table(table, filename.strpath) + table_read = pyarrow.parquet.read_table(filename.strpath) + for col_written, col_read in zip(table.itercolumns(), table_read.itercolumns()): + assert col_written.name == col_read.name + assert col_read.data.num_chunks == 1 + data_written = col_written.data.chunk(0) + data_read = col_read.data.chunk(0) + assert data_written.equals(data_read) + +def test_pandas_rountrip(tmpdir): + size = 10000 + df = pd.DataFrame({ + 'int32': np.arange(size, dtype=np.int32), + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64) + }) + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = A.from_pandas_dataframe(df) + A.parquet.write_table(arrow_table, filename.strpath) + table_read = pyarrow.parquet.read_table(filename.strpath) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) + diff --git a/python/setup.py b/python/setup.py index 5f228ed0af245..7edeb9143319b 100644 --- a/python/setup.py +++ b/python/setup.py @@ -214,7 +214,7 @@ def get_ext_built(self, name): return name + suffix def get_cmake_cython_names(self): - return ['array', 'config', 'error', 'scalar', 'schema', 'table'] + return ['array', 'config', 'error', 'parquet', 'scalar', 'schema', 'table'] def get_names(self): return self._found_names @@ -242,7 +242,7 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, - install_requires=['cython >= 0.21', 'numpy >= 1.9'], + install_requires=['cython >= 0.23', 'numpy >= 1.9'], description=DESC, license='Apache License, Version 2.0', maintainer="Apache Arrow Developers", From b4e0e93d580b8e0344c0caa1cf51cbe088bd25ac Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 15 Jun 2016 13:28:10 -0700 Subject: [PATCH 0087/1644] ARROW-217: Fix Travis w.r.t conda 4.1.0 changes Travis is happy, fixes the problems we see with Travis in #85 Author: Uwe L. Korn Closes #90 from xhochy/fix-conda-show-channel-urls and squashes the following commits: 82e9840 [Uwe L. Korn] ARROW-217: Fix Travis w.r.t. conda 4.1.0 changes --- ci/travis_before_script_cpp.sh | 2 +- ci/travis_conda_build.sh | 2 +- ci/travis_install_conda.sh | 4 +++- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 6159f67e3613b..9060cc9b5ef22 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash -set -e +set -ex source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh conda install -y --channel apache/channel/dev parquet-cpp diff --git a/ci/travis_conda_build.sh b/ci/travis_conda_build.sh index c43a85170b094..a787df79a5574 100755 --- a/ci/travis_conda_build.sh +++ b/ci/travis_conda_build.sh @@ -1,6 +1,6 @@ #!/usr/bin/env bash -set -e +set -ex source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index bef667dff7cc1..be7f59a4733bd 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -15,9 +15,11 @@ export PATH="$MINICONDA/bin:$PATH" conda update -y -q conda conda info -a -conda config --set show_channel_urls yes +conda config --set show_channel_urls True +conda config --add channels https://repo.continuum.io/pkgs/free conda config --add channels conda-forge conda config --add channels apache +conda info -a conda install --yes conda-build jinja2 anaconda-client From 790d5412da67f807159f236179a8a7df37b270d2 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 16 Jun 2016 10:50:40 -0700 Subject: [PATCH 0088/1644] ARROW-218: Add optional API token authentication option to PR merge tool You can use an API token with extremely limited privileges (i.e., only access to public GitHub repos), but this helps avoid rate limiting issues on shared outbound IP addresses. Author: Wes McKinney Closes #91 from wesm/ARROW-218 and squashes the following commits: f45808c [Wes McKinney] Add optional GitHub API token to patch tool (to avoid rate limiting issues with unauthenticated requests) --- dev/merge_arrow_pr.py | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index fe0bcd13dd8f1..981779ffb4c76 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -66,7 +66,17 @@ def get_json(url): try: - return json.load(urllib2.urlopen(url)) + from urllib2 import urlopen, Request + env_var = 'ARROW_GITHUB_API_TOKEN' + + if env_var in os.environ: + token = os.environ[env_var] + request = Request(url) + request.add_header('Authorization', 'token %s' % token) + response = urlopen(request) + else: + response = urlopen(url) + return json.load(response) except urllib2.HTTPError as e: print "Unable to fetch URL, exiting: %s" % url sys.exit(-1) From 27edd25eb4f714ff1cc2770ed5a1fbc695eb8a08 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Thu, 16 Jun 2016 10:58:18 -0700 Subject: [PATCH 0089/1644] ARROW-210: Cleanup of the string related types in C++ code base One thing that is worth discussing is if char types should also be removed (if they aren't i'll add the missing unit tests). I also moved CharType to type.h which seems more consistent with existing code. I can clean it up either way in a follow-up review if we decide with want to push types into their corresponding Array headers. Author: Micah Kornfield Closes #85 from emkornfield/emk_string_types_wip and squashes the following commits: 4414816 [Micah Kornfield] remove CHAR from parquet 6f0634c [Micah Kornfield] remove char type and add dcheck 58bfcc9 [Micah Kornfield] fix style of char_type_ 1e0152d [Micah Kornfield] wip --- cpp/src/arrow/parquet/schema.cc | 5 - cpp/src/arrow/type.cc | 17 ++- cpp/src/arrow/type.h | 55 ++++++--- cpp/src/arrow/types/construct.cc | 2 - cpp/src/arrow/types/decimal.h | 1 - cpp/src/arrow/types/list.h | 8 +- cpp/src/arrow/types/string-test.cc | 188 ++++++++++++++++++++++++----- cpp/src/arrow/types/string.cc | 40 ++++-- cpp/src/arrow/types/string.h | 104 +++++++++------- cpp/src/arrow/util/macros.h | 2 +- 10 files changed, 307 insertions(+), 115 deletions(-) diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index fd758940c9f3a..c7979db349453 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -250,11 +250,6 @@ Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { case Type::DOUBLE: type = ParquetType::DOUBLE; break; - case Type::CHAR: - type = ParquetType::FIXED_LEN_BYTE_ARRAY; - logical_type = LogicalType::UTF8; - length = static_cast(field->type.get())->size; - break; case Type::STRING: type = ParquetType::BYTE_ARRAY; logical_type = LogicalType::UTF8; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 4e686d9cf4a6f..4fd50b7c19365 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -31,7 +31,18 @@ std::string Field::ToString() const { DataType::~DataType() {} -StringType::StringType() : DataType(Type::STRING) {} +bool DataType::Equals(const DataType* other) const { + bool equals = other && ((this == other) || + ((this->type == other->type) && + ((this->num_children() == other->num_children())))); + if (equals) { + for (int i = 0; i < num_children(); ++i) { + // TODO(emkornfield) limit recursion + if (!children_[i]->Equals(other->children_[i])) { return false; } + } + } + return equals; +} std::string StringType::ToString() const { std::string result(name()); @@ -44,6 +55,10 @@ std::string ListType::ToString() const { return s.str(); } +std::string BinaryType::ToString() const { + return std::string(name()); +} + std::string StructType::ToString() const { std::stringstream s; s << "struct<"; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index f366645cd5cf9..8fb41211ba945 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -23,6 +23,8 @@ #include #include +#include "arrow/util/macros.h" + namespace arrow { // Data types in this library are all *logical*. They can be expressed as @@ -53,15 +55,9 @@ struct Type { // 8-byte floating point value DOUBLE = 11, - // CHAR(N): fixed-length UTF8 string with length N - CHAR = 12, - // UTF8 variable-length string as List STRING = 13, - // VARCHAR(N): Null-terminated string type embedded in a CHAR(N + 1) - VARCHAR = 14, - // Variable-length bytes (no guarantee of UTF8-ness) BINARY = 15, @@ -114,12 +110,15 @@ struct DataType { virtual ~DataType(); - bool Equals(const DataType* other) { - // Call with a pointer so more friendly to subclasses - return other && ((this == other) || (this->type == other->type)); - } + // Return whether the types are equal + // + // Types that are logically convertable from one to another e.g. List + // and Binary are NOT equal). + virtual bool Equals(const DataType* other) const; - bool Equals(const std::shared_ptr& other) { return Equals(other.get()); } + bool Equals(const std::shared_ptr& other) const { + return Equals(other.get()); + } const std::shared_ptr& child(int i) const { return children_[i]; } @@ -236,9 +235,8 @@ struct DoubleType : public PrimitiveType { struct ListType : public DataType { // List can contain any other logical value type - explicit ListType(const std::shared_ptr& value_type) : DataType(Type::LIST) { - children_ = {std::make_shared("item", value_type)}; - } + explicit ListType(const std::shared_ptr& value_type) + : ListType(value_type, Type::LIST) {} explicit ListType(const std::shared_ptr& value_field) : DataType(Type::LIST) { children_ = {value_field}; @@ -251,15 +249,38 @@ struct ListType : public DataType { static char const* name() { return "list"; } std::string ToString() const override; + + protected: + // Constructor for classes that are implemented as List Arrays. + ListType(const std::shared_ptr& value_type, Type::type logical_type) + : DataType(logical_type) { + // TODO ARROW-187 this can technically fail, make a constructor method ? + children_ = {std::make_shared("item", value_type)}; + } }; -// String is a logical type consisting of a physical list of 1-byte values -struct StringType : public DataType { - StringType(); +// BinaryType type is reprsents lists of 1-byte values. +struct BinaryType : public ListType { + BinaryType() : BinaryType(Type::BINARY) {} + static char const* name() { return "binary"; } + std::string ToString() const override; + + protected: + // Allow subclasses to change the logical type. + explicit BinaryType(Type::type logical_type) + : ListType(std::shared_ptr(new UInt8Type()), logical_type) {} +}; + +// UTF encoded strings +struct StringType : public BinaryType { + StringType() : BinaryType(Type::STRING) {} static char const* name() { return "string"; } std::string ToString() const override; + + protected: + explicit StringType(Type::type logical_type) : BinaryType(logical_type) {} }; struct StructType : public DataType { diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index bcb0ec490901f..2d913a737486f 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -127,10 +127,8 @@ Status MakeListArray(const TypePtr& type, int32_t length, case Type::LIST: out->reset(new ListArray(type, length, offsets, values, null_count, null_bitmap)); break; - case Type::CHAR: case Type::DECIMAL_TEXT: case Type::STRING: - case Type::VARCHAR: out->reset(new StringArray(type, length, offsets, values, null_count, null_bitmap)); break; default: diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h index 1be489d4f51b6..598df3ef70d2d 100644 --- a/cpp/src/arrow/types/decimal.h +++ b/cpp/src/arrow/types/decimal.h @@ -29,7 +29,6 @@ struct DecimalType : public DataType { : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} int precision; int scale; - static char const* name() { return "decimal"; } std::string ToString() const override; diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 0a3941633eb83..2f6f85d66ca60 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -66,8 +66,8 @@ class ListArray : public Array { int32_t offset(int i) const { return offsets_[i]; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) { return offsets_[i]; } - int32_t value_length(int i) { return offsets_[i + 1] - offsets_[i]; } + int32_t value_offset(int i) const { return offsets_[i]; } + int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } bool EqualsExact(const ListArray& other) const; bool Equals(const std::shared_ptr& arr) const override; @@ -92,9 +92,9 @@ class ListArray : public Array { // a sequence of offests and null values. // // A note on types. Per arrow/type.h all types in the c++ implementation are -// logical so even though this class always builds an Array of lists, this can +// logical so even though this class always builds list array, this can // represent multiple different logical types. If no logical type is provided -// at construction time, the class defaults to List where t is take from the +// at construction time, the class defaults to List where t is taken from the // value_builder/values that the object is constructed with. class ListBuilder : public ArrayBuilder { public: diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index ee4307c4d168a..a141fc113211a 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -34,32 +34,14 @@ namespace arrow { class Buffer; -TEST(TypesTest, TestCharType) { - CharType t1(5); - - ASSERT_EQ(t1.type, Type::CHAR); - ASSERT_EQ(t1.size, 5); - - ASSERT_EQ(t1.ToString(), std::string("char(5)")); - - // Test copy constructor - CharType t2 = t1; - ASSERT_EQ(t2.type, Type::CHAR); - ASSERT_EQ(t2.size, 5); -} - -TEST(TypesTest, TestVarcharType) { - VarcharType t1(5); - - ASSERT_EQ(t1.type, Type::VARCHAR); - ASSERT_EQ(t1.size, 5); - - ASSERT_EQ(t1.ToString(), std::string("varchar(5)")); - - // Test copy constructor - VarcharType t2 = t1; - ASSERT_EQ(t2.type, Type::VARCHAR); - ASSERT_EQ(t2.size, 5); +TEST(TypesTest, BinaryType) { + BinaryType t1; + BinaryType e1; + StringType t2; + EXPECT_TRUE(t1.Equals(&e1)); + EXPECT_FALSE(t1.Equals(&t2)); + ASSERT_EQ(t1.type, Type::BINARY); + ASSERT_EQ(t1.ToString(), std::string("binary")); } TEST(TypesTest, TestStringType) { @@ -119,6 +101,7 @@ class TestStringContainer : public ::testing::Test { TEST_F(TestStringContainer, TestArrayBasics) { ASSERT_EQ(length_, strings_->length()); ASSERT_EQ(1, strings_->null_count()); + ASSERT_OK(strings_->Validate()); } TEST_F(TestStringContainer, TestType) { @@ -163,7 +146,10 @@ class TestStringBuilder : public TestBuilder { builder_.reset(new StringBuilder(pool_, type_)); } - void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } + void Done() { + result_ = std::dynamic_pointer_cast(builder_->Finish()); + result_->Validate(); + } protected: TypePtr type_; @@ -216,4 +202,152 @@ TEST_F(TestStringBuilder, TestZeroLength) { Done(); } +// Binary container type +// TODO(emkornfield) there should be some way to refactor these to avoid code duplicating +// with String +class TestBinaryContainer : public ::testing::Test { + public: + void SetUp() { + chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; + offsets_ = {0, 1, 1, 1, 3, 6}; + valid_bytes_ = {1, 1, 0, 1, 1}; + expected_ = {"a", "", "", "bb", "ccc"}; + + MakeArray(); + } + + void MakeArray() { + length_ = offsets_.size() - 1; + int nchars = chars_.size(); + + value_buf_ = test::to_buffer(chars_); + values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); + + offsets_buf_ = test::to_buffer(offsets_); + + null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); + null_count_ = test::null_count(valid_bytes_); + + strings_ = std::make_shared( + length_, offsets_buf_, values_, null_count_, null_bitmap_); + } + + protected: + std::vector offsets_; + std::vector chars_; + std::vector valid_bytes_; + + std::vector expected_; + + std::shared_ptr value_buf_; + std::shared_ptr offsets_buf_; + std::shared_ptr null_bitmap_; + + int null_count_; + int length_; + + ArrayPtr values_; + std::shared_ptr strings_; +}; + +TEST_F(TestBinaryContainer, TestArrayBasics) { + ASSERT_EQ(length_, strings_->length()); + ASSERT_EQ(1, strings_->null_count()); + ASSERT_OK(strings_->Validate()); +} + +TEST_F(TestBinaryContainer, TestType) { + TypePtr type = strings_->type(); + + ASSERT_EQ(Type::BINARY, type->type); + ASSERT_EQ(Type::BINARY, strings_->type_enum()); +} + +TEST_F(TestBinaryContainer, TestListFunctions) { + int pos = 0; + for (size_t i = 0; i < expected_.size(); ++i) { + ASSERT_EQ(pos, strings_->value_offset(i)); + ASSERT_EQ(expected_[i].size(), strings_->value_length(i)); + pos += expected_[i].size(); + } +} + +TEST_F(TestBinaryContainer, TestDestructor) { + auto arr = std::make_shared( + length_, offsets_buf_, values_, null_count_, null_bitmap_); +} + +TEST_F(TestBinaryContainer, TestGetValue) { + for (size_t i = 0; i < expected_.size(); ++i) { + if (valid_bytes_[i] == 0) { + ASSERT_TRUE(strings_->IsNull(i)); + } else { + int32_t len = -1; + const uint8_t* bytes = strings_->GetValue(i, &len); + ASSERT_EQ(0, std::memcmp(expected_[i].data(), bytes, len)); + } + } +} + +class TestBinaryBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + type_ = TypePtr(new BinaryType()); + builder_.reset(new BinaryBuilder(pool_, type_)); + } + + void Done() { + result_ = std::dynamic_pointer_cast(builder_->Finish()); + result_->Validate(); + } + + protected: + TypePtr type_; + + std::unique_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestBinaryBuilder, TestScalarAppend) { + std::vector strings = {"", "bb", "a", "", "ccc"}; + std::vector is_null = {0, 0, 0, 1, 0}; + + int N = strings.size(); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder_->AppendNull(); + } else { + builder_->Append( + reinterpret_cast(strings[i].data()), strings[i].size()); + } + } + } + Done(); + ASSERT_OK(result_->Validate()); + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->values()->length()); + + int32_t length; + for (int i = 0; i < N * reps; ++i) { + if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); + } else { + ASSERT_FALSE(result_->IsNull(i)); + const uint8_t* vals = result_->GetValue(i, &length); + ASSERT_EQ(strings[i % N].size(), length); + ASSERT_EQ(0, std::memcmp(vals, strings[i % N].data(), length)); + } + } +} + +TEST_F(TestBinaryBuilder, TestZeroLength) { + // All buffers are null + Done(); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index 29d97d039477c..da02c7d1d8a9e 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -24,25 +24,43 @@ namespace arrow { +const std::shared_ptr BINARY(new BinaryType()); const std::shared_ptr STRING(new StringType()); -StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, +BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, const std::shared_ptr& null_bitmap) - : StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} + : BinaryArray(BINARY, length, offsets, values, null_count, null_bitmap) {} + +BinaryArray::BinaryArray(const TypePtr& type, int32_t length, + const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& null_bitmap) + : ListArray(type, length, offsets, values, null_count, null_bitmap), + bytes_(std::dynamic_pointer_cast(values).get()), + raw_bytes_(bytes_->raw_data()) { + // Check in case the dynamic cast fails. + DCHECK(bytes_); +} -std::string CharType::ToString() const { - std::stringstream s; - s << "char(" << size << ")"; - return s.str(); +Status BinaryArray::Validate() const { + if (values()->null_count() > 0) { + std::stringstream ss; + ss << type()->ToString() << " can have null values in the value array"; + Status::Invalid(ss.str()); + } + return ListArray::Validate(); } -std::string VarcharType::ToString() const { - std::stringstream s; - s << "varchar(" << size << ")"; - return s.str(); +StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& null_bitmap) + : StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} + +Status StringArray::Validate() const { + // TODO(emkornfield) Validate proper UTF8 code points? + return BinaryArray::Validate(); } -TypePtr StringBuilder::value_type_ = TypePtr(new UInt8Type()); +TypePtr BinaryBuilder::value_type_ = TypePtr(new UInt8Type()); } // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index d2d3c5b6b5a83..b3c00d298b35c 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -34,87 +34,99 @@ namespace arrow { class Buffer; class MemoryPool; -struct CharType : public DataType { - int size; - - explicit CharType(int size) : DataType(Type::CHAR), size(size) {} - - CharType(const CharType& other) : CharType(other.size) {} - - virtual std::string ToString() const; -}; - -// Variable-length, null-terminated strings, up to a certain length -struct VarcharType : public DataType { - int size; - - explicit VarcharType(int size) : DataType(Type::VARCHAR), size(size) {} - VarcharType(const VarcharType& other) : VarcharType(other.size) {} - - virtual std::string ToString() const; -}; - -// TODO(wesm): add a BinaryArray layer in between -class StringArray : public ListArray { +class BinaryArray : public ListArray { public: - StringArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, + BinaryArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) - : ListArray(type, length, offsets, values, null_count, null_bitmap) { - // For convenience - bytes_ = static_cast(values.get()); - raw_bytes_ = bytes_->raw_data(); - } - - StringArray(int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& null_bitmap = nullptr); + // Constructor that allows sub-classes/builders to propagate there logical type up the + // class hierarchy. + BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - // Compute the pointer t + // Return the pointer to the given elements bytes + // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy + // pointer + offset const uint8_t* GetValue(int i, int32_t* out_length) const { - int32_t pos = offsets_[i]; + DCHECK(out_length); + const int32_t pos = offsets_[i]; *out_length = offsets_[i + 1] - pos; return raw_bytes_ + pos; } + Status Validate() const override; + + private: + UInt8Array* bytes_; + const uint8_t* raw_bytes_; +}; + +class StringArray : public BinaryArray { + public: + StringArray(int32_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + // Constructor that allows overriding the logical type, so subclasses can propagate + // there + // up the class hierarchy. + StringArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, + const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr) + : BinaryArray(type, length, offsets, values, null_count, null_bitmap) {} + // Construct a std::string + // TODO: std::bad_alloc possibility std::string GetString(int i) const { int32_t nchars; const uint8_t* str = GetValue(i, &nchars); return std::string(reinterpret_cast(str), nchars); } - private: - UInt8Array* bytes_; - const uint8_t* raw_bytes_; + Status Validate() const override; }; -// String builder -class StringBuilder : public ListBuilder { +// BinaryBuilder : public ListBuilder +class BinaryBuilder : public ListBuilder { public: - explicit StringBuilder(MemoryPool* pool, const TypePtr& type) + explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type) : ListBuilder(pool, std::make_shared(pool, value_type_), type) { byte_builder_ = static_cast(value_builder_.get()); } - Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } - - Status Append(const char* value, int32_t length) { + Status Append(const uint8_t* value, int32_t length) { RETURN_NOT_OK(ListBuilder::Append()); - return byte_builder_->Append(reinterpret_cast(value), length); + return byte_builder_->Append(value, length); } - Status Append(const std::vector& values, uint8_t* null_bytes); std::shared_ptr Finish() override { - return ListBuilder::Transfer(); + return ListBuilder::Transfer(); } protected: UInt8Builder* byte_builder_; - static TypePtr value_type_; }; +// String builder +class StringBuilder : public BinaryBuilder { + public: + explicit StringBuilder(MemoryPool* pool, const TypePtr& type) + : BinaryBuilder(pool, type) {} + + Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } + + Status Append(const char* value, int32_t length) { + return BinaryBuilder::Append(reinterpret_cast(value), length); + } + + Status Append(const std::vector& values, uint8_t* null_bytes); + + std::shared_ptr Finish() override { + return ListBuilder::Transfer(); + } +}; + } // namespace arrow #endif // ARROW_TYPES_STRING_H diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index 51e605ee50ac4..69ecda16ceba5 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -21,6 +21,6 @@ // From Google gutil #define DISALLOW_COPY_AND_ASSIGN(TypeName) \ TypeName(const TypeName&) = delete; \ - void operator=(const TypeName&) = delete + TypeName& operator=(const TypeName&) = delete #endif // ARROW_UTIL_MACROS_H From a3e3849cde60f611ea47271f510a96c2f36606a7 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 16 Jun 2016 22:07:30 -0700 Subject: [PATCH 0090/1644] ARROW-219: Preserve CMAKE_CXX_FLAGS, fix compiler warnings Some undesired compiler warnings had crept into our build; future warnings should fail the build now. Author: Wes McKinney Closes #92 from wesm/ARROW-219 and squashes the following commits: fd68a74 [Wes McKinney] Buglet 6507351 [Wes McKinney] Fix clang warning 0f9e3ca [Wes McKinney] Preserve CMAKE_CXX_FLAGS, fix compiler warnings --- cpp/CMakeLists.txt | 13 +++++++------ cpp/src/arrow/parquet/test-util.h | 2 +- cpp/src/arrow/parquet/writer.cc | 8 +++++++- cpp/src/arrow/parquet/writer.h | 2 ++ cpp/src/arrow/util/macros.h | 2 ++ 5 files changed, 19 insertions(+), 8 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index a3fb01076d44e..bdf757238cc6b 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -139,15 +139,15 @@ string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) # Set compile flags based on the build type. message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_DEBUG}) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_DEBUG}") elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_FASTDEBUG}) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_FASTDEBUG}") elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_RELEASE}) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_RELEASE}") elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_GEN") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_PROFILE_GEN}) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_GEN}") elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_BUILD") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_PROFILE_BUILD}) + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_BUILD}") else() message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") endif () @@ -165,6 +165,7 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") # http://petereisentraut.blogspot.com/2011/05/ccache-and-clang.html # http://petereisentraut.blogspot.com/2011/09/ccache-and-clang-part-2.html set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CLANG_OPTIONS}") endif() # Sanity check linking option. @@ -559,7 +560,7 @@ if (${CLANG_TIDY_FOUND}) add_custom_target(clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json 1 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc | sed -e '/_generated/g'`) # runs clang-tidy and exits with a non-zero exit code if any errors are found. - add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json + add_custom_target(check-clang-tidy ${BUILD_SUPPORT_DIR}/run-clang-tidy.sh ${CLANG_TIDY_BIN} ${CMAKE_BINARY_DIR}/compile_commands.json 0 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc |grep -v -F -f ${CMAKE_CURRENT_SOURCE_DIR}/src/.clang-tidy-ignore | sed -e '/_generated/g'`) endif() diff --git a/cpp/src/arrow/parquet/test-util.h b/cpp/src/arrow/parquet/test-util.h index 1496082d5c661..cc8723bf6ecab 100644 --- a/cpp/src/arrow/parquet/test-util.h +++ b/cpp/src/arrow/parquet/test-util.h @@ -67,7 +67,7 @@ std::shared_ptr
MakeSimpleTable( template void ExpectArray(T* expected, Array* result) { PrimitiveArray* p_array = static_cast(result); - for (size_t i = 0; i < result->length(); i++) { + for (int i = 0; i < result->length(); i++) { EXPECT_EQ(expected[i], reinterpret_cast(p_array->data()->data())[i]); } } diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index 1223901d5505a..4005e3b2b0c1b 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -50,6 +50,8 @@ class FileWriter::Impl { virtual ~Impl() {} private: + friend class FileWriter; + MemoryPool* pool_; PoolBuffer data_buffer_; PoolBuffer def_levels_buffer_; @@ -94,7 +96,7 @@ Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); int buffer_idx = 0; - for (size_t i = 0; i < length; i++) { + for (int i = 0; i < length; i++) { if (data->IsNull(offset + i)) { def_levels_ptr[i] = 0; } else { @@ -156,6 +158,10 @@ Status FileWriter::Close() { return impl_->Close(); } +MemoryPool* FileWriter::memory_pool() const { + return impl_->pool_; +} + FileWriter::~FileWriter() {} Status WriteFlatTable(const Table* table, MemoryPool* pool, diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index 83e799f7ed1ed..93693f511846b 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -49,6 +49,8 @@ class FileWriter { virtual ~FileWriter(); + MemoryPool* memory_pool() const; + private: class Impl; std::unique_ptr impl_; diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index 69ecda16ceba5..e2bb355115b42 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -19,8 +19,10 @@ #define ARROW_UTIL_MACROS_H // From Google gutil +#ifndef DISALLOW_COPY_AND_ASSIGN #define DISALLOW_COPY_AND_ASSIGN(TypeName) \ TypeName(const TypeName&) = delete; \ TypeName& operator=(const TypeName&) = delete +#endif #endif // ARROW_UTIL_MACROS_H From f7ade7bfeaa7e0d7fb3dd9d5a93e29a413cc142a Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 21 Jun 2016 15:11:26 -0700 Subject: [PATCH 0091/1644] ARROW-223: Do not link against libpython Author: Uwe L. Korn Closes #95 from xhochy/arrow-223 and squashes the following commits: 4fdf1e7 [Uwe L. Korn] ARROW-223: Do not link against libpython Change-Id: I1238a48aaf94ab175b367551f74c335c6455d78a --- python/cmake_modules/FindPythonLibsNew.cmake | 6 +++++- python/cmake_modules/UseCython.cmake | 1 - 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/python/cmake_modules/FindPythonLibsNew.cmake b/python/cmake_modules/FindPythonLibsNew.cmake index 0f2295aa43bc1..5cb65c9f1a484 100644 --- a/python/cmake_modules/FindPythonLibsNew.cmake +++ b/python/cmake_modules/FindPythonLibsNew.cmake @@ -224,7 +224,11 @@ FUNCTION(PYTHON_ADD_MODULE _NAME ) SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") ELSE() - TARGET_LINK_LIBRARIES(${_NAME} ${PYTHON_LIBRARIES}) + # In general, we should not link against libpython as we do not embed + # the Python interpreter. The python binary itself can then define where + # the symbols should loaded from. + SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS + "-Wl,-undefined,dynamic_lookup") ENDIF() IF(PYTHON_MODULE_${_NAME}_BUILD_SHARED) SET_TARGET_PROPERTIES(${_NAME} PROPERTIES PREFIX "${PYTHON_MODULE_PREFIX}") diff --git a/python/cmake_modules/UseCython.cmake b/python/cmake_modules/UseCython.cmake index 3b1c201edff5f..cee6066d31de0 100644 --- a/python/cmake_modules/UseCython.cmake +++ b/python/cmake_modules/UseCython.cmake @@ -163,7 +163,6 @@ function( cython_add_module _name pyx_target_name generated_files) include_directories( ${PYTHON_INCLUDE_DIRS} ) python_add_module( ${_name} ${_generated_files} ${other_module_sources} ) add_dependencies( ${_name} ${pyx_target_name}) - target_link_libraries( ${_name} ${PYTHON_LIBRARIES} ) endfunction() include( CMakeParseArguments ) From ef90830290491294d2fccfc5dcb16d3c0f96a70a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Jun 2016 16:41:08 -0700 Subject: [PATCH 0092/1644] ARROW-222: Prototyping an IO interface for Arrow, with initial HDFS target - Switch Travis CI back to Ubuntu trusty (old Boost in precise has issues with C++11) - Adapt SFrame libhdfs shim for arrow - Create C++ public API within arrow:io to libhdfs - Implement and test many functions in libhdfs - Start Cython wrapper interface to arrow_io. Begin Python file-like interface, unit tests - Add thirdparty hdfs.h so builds are possible without a local Hadoop distro (e.g. in Travis CI). Change-Id: I4a46e50f6c1c22787baa3749d8a542216341e630 --- .travis.yml | 5 +- NOTICE.txt | 9 + ci/travis_before_script_cpp.sh | 15 +- cpp/CMakeLists.txt | 60 +- cpp/doc/HDFS.md | 39 + cpp/src/arrow/io/CMakeLists.txt | 97 ++ cpp/src/arrow/io/hdfs-io-test.cc | 315 +++++++ cpp/src/arrow/io/hdfs.cc | 458 ++++++++++ cpp/src/arrow/io/hdfs.h | 213 +++++ cpp/src/arrow/io/interfaces.h | 71 ++ cpp/src/arrow/io/libhdfs_shim.cc | 544 ++++++++++++ cpp/src/arrow/parquet/parquet-io-test.cc | 4 +- cpp/thirdparty/hadoop/include/hdfs.h | 1024 ++++++++++++++++++++++ dev/merge_arrow_pr.py | 5 +- python/CMakeLists.txt | 6 +- python/cmake_modules/FindArrow.cmake | 17 +- python/conda.recipe/meta.yaml | 1 + python/pyarrow/error.pxd | 4 +- python/pyarrow/error.pyx | 14 +- python/pyarrow/includes/common.pxd | 18 + python/pyarrow/includes/libarrow.pxd | 19 - python/pyarrow/includes/libarrow_io.pxd | 93 ++ python/pyarrow/io.pyx | 504 +++++++++++ python/pyarrow/tests/test_array.py | 47 +- python/pyarrow/tests/test_io.py | 126 +++ python/setup.py | 9 +- 26 files changed, 3656 insertions(+), 61 deletions(-) create mode 100644 NOTICE.txt create mode 100644 cpp/doc/HDFS.md create mode 100644 cpp/src/arrow/io/CMakeLists.txt create mode 100644 cpp/src/arrow/io/hdfs-io-test.cc create mode 100644 cpp/src/arrow/io/hdfs.cc create mode 100644 cpp/src/arrow/io/hdfs.h create mode 100644 cpp/src/arrow/io/interfaces.h create mode 100644 cpp/src/arrow/io/libhdfs_shim.cc create mode 100644 cpp/thirdparty/hadoop/include/hdfs.h create mode 100644 python/pyarrow/includes/libarrow_io.pxd create mode 100644 python/pyarrow/io.pyx create mode 100644 python/pyarrow/tests/test_io.py diff --git a/.travis.yml b/.travis.yml index ac2b0d457cb8e..97229b1ceb3bc 100644 --- a/.travis.yml +++ b/.travis.yml @@ -1,5 +1,5 @@ sudo: required -dist: precise +dist: trusty addons: apt: sources: @@ -12,6 +12,9 @@ addons: - ccache - cmake - valgrind + - libboost-dev + - libboost-filesystem-dev + - libboost-system-dev matrix: fast_finish: true diff --git a/NOTICE.txt b/NOTICE.txt new file mode 100644 index 0000000000000..0310c897cd743 --- /dev/null +++ b/NOTICE.txt @@ -0,0 +1,9 @@ +Apache Arrow +Copyright 2016 The Apache Software Foundation + +This product includes software developed at +The Apache Software Foundation (http://www.apache.org/). + +This product includes software from the SFrame project (BSD, 3-clause). +* Copyright (C) 2015 Dato, Inc. +* Copyright (c) 2009 Carnegie Mellon University. diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 9060cc9b5ef22..08551f3b009a8 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -23,12 +23,21 @@ echo $GTEST_HOME : ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} -CMAKE_COMMON_FLAGS="-DARROW_BUILD_BENCHMARKS=ON -DARROW_PARQUET=ON -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" +CMAKE_COMMON_FLAGS="\ +-DARROW_BUILD_BENCHMARKS=ON \ +-DARROW_PARQUET=ON \ +-DARROW_HDFS=on \ +-DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" if [ $TRAVIS_OS_NAME == "linux" ]; then - cmake -DARROW_TEST_MEMCHECK=on $CMAKE_COMMON_FLAGS -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR + cmake -DARROW_TEST_MEMCHECK=on \ + $CMAKE_COMMON_FLAGS \ + -DCMAKE_CXX_FLAGS="-Werror" \ + $CPP_DIR else - cmake $CMAKE_COMMON_FLAGS -DCMAKE_CXX_FLAGS="-Werror" $CPP_DIR + cmake $CMAKE_COMMON_FLAGS \ + -DCMAKE_CXX_FLAGS="-Werror" \ + $CPP_DIR fi make -j4 diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index bdf757238cc6b..18b47599b93d0 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -62,6 +62,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow IPC extensions" ON) + option(ARROW_HDFS + "Build the Arrow IO extensions for the Hadoop file system" + OFF) + option(ARROW_SSE3 "Build Arrow with SSE3" ON) @@ -454,6 +458,47 @@ if ("$ENV{GBENCHMARK_HOME}" STREQUAL "") set(GBENCHMARK_HOME ${THIRDPARTY_DIR}/installed) endif() +# ---------------------------------------------------------------------- +# Add Boost dependencies (code adapted from Apache Kudu (incubating)) + +# find boost headers and libs +set(Boost_DEBUG TRUE) +set(Boost_USE_MULTITHREADED ON) +set(Boost_USE_STATIC_LIBS ON) +find_package(Boost COMPONENTS system filesystem REQUIRED) +include_directories(SYSTEM ${Boost_INCLUDE_DIRS}) +set(BOOST_STATIC_LIBS ${Boost_LIBRARIES}) +list(LENGTH BOOST_STATIC_LIBS BOOST_STATIC_LIBS_LEN) + +# Find Boost shared libraries. +set(Boost_USE_STATIC_LIBS OFF) +find_package(Boost COMPONENTS system filesystem REQUIRED) +set(BOOST_SHARED_LIBS ${Boost_LIBRARIES}) +list(LENGTH BOOST_SHARED_LIBS BOOST_SHARED_LIBS_LEN) +list(SORT BOOST_SHARED_LIBS) + +message(STATUS "Boost include dir: " ${Boost_INCLUDE_DIRS}) +message(STATUS "Boost libraries: " ${Boost_LIBRARIES}) + +math(EXPR LAST_IDX "${BOOST_STATIC_LIBS_LEN} - 1") +foreach(IDX RANGE ${LAST_IDX}) + list(GET BOOST_STATIC_LIBS ${IDX} BOOST_STATIC_LIB) + list(GET BOOST_SHARED_LIBS ${IDX} BOOST_SHARED_LIB) + + # Remove the prefix/suffix from the library name. + # + # e.g. libboost_system-mt --> boost_system + get_filename_component(LIB_NAME ${BOOST_STATIC_LIB} NAME_WE) + string(REGEX REPLACE "lib([^-]*)(-mt)?" "\\1" LIB_NAME_NO_PREFIX_SUFFIX ${LIB_NAME}) + ADD_THIRDPARTY_LIB(${LIB_NAME_NO_PREFIX_SUFFIX} + STATIC_LIB "${BOOST_STATIC_LIB}" + SHARED_LIB "${BOOST_SHARED_LIB}") + list(APPEND ARROW_BOOST_LIBS ${LIB_NAME_NO_PREFIX_SUFFIX}) +endforeach() +include_directories(SYSTEM ${Boost_INCLUDE_DIR}) + +# ---------------------------------------------------------------------- +# Enable / disable tests and benchmarks if(ARROW_BUILD_TESTS) add_custom_target(unittest ctest -L unittest) @@ -529,12 +574,24 @@ endif (UNIX) # "make lint" target ############################################################ if (UNIX) + + file(GLOB_RECURSE LINT_FILES + "${CMAKE_CURRENT_SOURCE_DIR}/src/*.h" + "${CMAKE_CURRENT_SOURCE_DIR}/src/*.cc" + ) + + FOREACH(item ${LINT_FILES}) + IF(NOT (item MATCHES "_generated.h")) + LIST(APPEND FILTERED_LINT_FILES ${item}) + ENDIF() + ENDFOREACH(item ${LINT_FILES}) + # Full lint add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py --verbose=2 --linelength=90 --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references - `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) + ${FILTERED_LINT_FILES}) endif (UNIX) @@ -624,6 +681,7 @@ set_target_properties(arrow target_link_libraries(arrow ${LIBARROW_LINK_LIBS}) add_subdirectory(src/arrow) +add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) add_subdirectory(src/arrow/types) diff --git a/cpp/doc/HDFS.md b/cpp/doc/HDFS.md new file mode 100644 index 0000000000000..e0d5dfda21d93 --- /dev/null +++ b/cpp/doc/HDFS.md @@ -0,0 +1,39 @@ +## Using Arrow's HDFS (Apache Hadoop Distributed File System) interface + +### Build requirements + +To build the integration, pass the following option to CMake + +```shell +-DARROW_HDFS=on +``` + +For convenience, we have bundled `hdfs.h` for libhdfs from Apache Hadoop in +Arrow's thirdparty. If you wish to build against the `hdfs.h` in your installed +Hadoop distribution, set the `$HADOOP_HOME` environment variable. + +### Runtime requirements + +By default, the HDFS client C++ class in `libarrow_io` uses the libhdfs JNI +interface to the Java Hadoop client. This library is loaded **at runtime** +(rather than at link / library load time, since the library may not be in your +LD_LIBRARY_PATH), and relies on some environment variables. + +* `HADOOP_HOME`: the root of your installed Hadoop distribution. Check in the + `lib/native` directory to look for `libhdfs.so` if you have any questions + about which directory you're after. +* `JAVA_HOME`: the location of your Java SDK installation +* `CLASSPATH`: must contain the Hadoop jars. You can set these using: + +```shell +export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` +``` + +#### Setting $JAVA_HOME automatically on OS X + +The installed location of Java on OS X can vary, however the following snippet +will set it automatically for you: + +```shell +export JAVA_HOME=$(/usr/libexec/java_home) +``` \ No newline at end of file diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt new file mode 100644 index 0000000000000..33b654f81903f --- /dev/null +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -0,0 +1,97 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# ---------------------------------------------------------------------- +# arrow_io : Arrow IO interfaces + +set(ARROW_IO_LINK_LIBS + arrow +) + +set(ARROW_IO_PRIVATE_LINK_LIBS + boost_system + boost_filesystem +) + +set(ARROW_IO_TEST_LINK_LIBS + arrow_io + ${ARROW_IO_PRIVATE_LINK_LIBS}) + +set(ARROW_IO_SRCS +) + +if(ARROW_HDFS) + if(NOT THIRDPARTY_DIR) + message(FATAL_ERROR "THIRDPARTY_DIR not set") + endif() + + if (DEFINED ENV{HADOOP_HOME}) + set(HADOOP_HOME $ENV{HADOOP_HOME}) + else() + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") + endif() + + set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") + if (NOT EXISTS ${HDFS_H_PATH}) + message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") + endif() + message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) + message(STATUS "Building libhdfs shim component") + + include_directories(SYSTEM "${HADOOP_HOME}/include") + + set(ARROW_HDFS_SRCS + hdfs.cc + libhdfs_shim.cc) + + set_property(SOURCE ${ARROW_HDFS_SRCS} + APPEND_STRING PROPERTY + COMPILE_FLAGS "-DHAS_HADOOP") + + set(ARROW_IO_SRCS + ${ARROW_HDFS_SRCS} + ${ARROW_IO_SRCS}) + + ADD_ARROW_TEST(hdfs-io-test) + ARROW_TEST_LINK_LIBRARIES(hdfs-io-test + ${ARROW_IO_TEST_LINK_LIBS}) +endif() + +add_library(arrow_io SHARED + ${ARROW_IO_SRCS} +) +target_link_libraries(arrow_io LINK_PUBLIC ${ARROW_IO_LINK_LIBS}) +target_link_libraries(arrow_io LINK_PRIVATE ${ARROW_IO_PRIVATE_LINK_LIBS}) + +SET_TARGET_PROPERTIES(arrow_io PROPERTIES LINKER_LANGUAGE CXX) + +if (APPLE) + set_target_properties(arrow_io + PROPERTIES + BUILD_WITH_INSTALL_RPATH ON + INSTALL_NAME_DIR "@rpath") +endif() + +# Headers: top level +install(FILES + hdfs.h + interfaces.h + DESTINATION include/arrow/io) + +install(TARGETS arrow_io + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) diff --git a/cpp/src/arrow/io/hdfs-io-test.cc b/cpp/src/arrow/io/hdfs-io-test.cc new file mode 100644 index 0000000000000..11d67aeba2026 --- /dev/null +++ b/cpp/src/arrow/io/hdfs-io-test.cc @@ -0,0 +1,315 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include // NOLINT + +#include "arrow/io/hdfs.h" +#include "arrow/test-util.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace io { + +std::vector RandomData(int64_t size) { + std::vector buffer(size); + test::random_bytes(size, 0, buffer.data()); + return buffer; +} + +class TestHdfsClient : public ::testing::Test { + public: + Status MakeScratchDir() { + if (client_->Exists(scratch_dir_)) { + RETURN_NOT_OK((client_->Delete(scratch_dir_, true))); + } + return client_->CreateDirectory(scratch_dir_); + } + + Status WriteDummyFile(const std::string& path, const uint8_t* buffer, int64_t size, + bool append = false, int buffer_size = 0, int replication = 0, + int default_block_size = 0) { + std::shared_ptr file; + RETURN_NOT_OK(client_->OpenWriteable( + path, append, buffer_size, replication, default_block_size, &file)); + + RETURN_NOT_OK(file->Write(buffer, size)); + RETURN_NOT_OK(file->Close()); + + return Status::OK(); + } + + std::string ScratchPath(const std::string& name) { + std::stringstream ss; + ss << scratch_dir_ << "/" << name; + return ss.str(); + } + + std::string HdfsAbsPath(const std::string& relpath) { + std::stringstream ss; + ss << "hdfs://" << conf_.host << ":" << conf_.port << relpath; + return ss.str(); + } + + protected: + // Set up shared state between unit tests + static void SetUpTestCase() { + if (!ConnectLibHdfs().ok()) { + std::cout << "Loading libhdfs failed, skipping tests gracefully" << std::endl; + return; + } + + loaded_libhdfs_ = true; + + const char* host = std::getenv("ARROW_HDFS_TEST_HOST"); + const char* port = std::getenv("ARROW_HDFS_TEST_PORT"); + const char* user = std::getenv("ARROW_HDFS_TEST_USER"); + + ASSERT_TRUE(user) << "Set ARROW_HDFS_TEST_USER"; + + conf_.host = host == nullptr ? "localhost" : host; + conf_.user = user; + conf_.port = port == nullptr ? 20500 : atoi(port); + + ASSERT_OK(HdfsClient::Connect(&conf_, &client_)); + } + + static void TearDownTestCase() { + if (client_) { + EXPECT_OK(client_->Delete(scratch_dir_, true)); + EXPECT_OK(client_->Disconnect()); + } + } + + static bool loaded_libhdfs_; + + // Resources shared amongst unit tests + static HdfsConnectionConfig conf_; + static std::string scratch_dir_; + static std::shared_ptr client_; +}; + +bool TestHdfsClient::loaded_libhdfs_ = false; +HdfsConnectionConfig TestHdfsClient::conf_ = HdfsConnectionConfig(); + +std::string TestHdfsClient::scratch_dir_ = + boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").native(); + +std::shared_ptr TestHdfsClient::client_ = nullptr; + +#define SKIP_IF_NO_LIBHDFS() \ + if (!loaded_libhdfs_) { \ + std::cout << "No libhdfs, skipping" << std::endl; \ + return; \ + } + +TEST_F(TestHdfsClient, ConnectsAgain) { + SKIP_IF_NO_LIBHDFS(); + + std::shared_ptr client; + ASSERT_OK(HdfsClient::Connect(&conf_, &client)); + ASSERT_OK(client->Disconnect()); +} + +TEST_F(TestHdfsClient, CreateDirectory) { + SKIP_IF_NO_LIBHDFS(); + + std::string path = ScratchPath("create-directory"); + + if (client_->Exists(path)) { ASSERT_OK(client_->Delete(path, true)); } + + ASSERT_OK(client_->CreateDirectory(path)); + ASSERT_TRUE(client_->Exists(path)); + EXPECT_OK(client_->Delete(path, true)); + ASSERT_FALSE(client_->Exists(path)); +} + +TEST_F(TestHdfsClient, GetCapacityUsed) { + SKIP_IF_NO_LIBHDFS(); + + // Who knows what is actually in your DFS cluster, but expect it to have + // positive used bytes and capacity + int64_t nbytes = 0; + ASSERT_OK(client_->GetCapacity(&nbytes)); + ASSERT_LT(0, nbytes); + + ASSERT_OK(client_->GetUsed(&nbytes)); + ASSERT_LT(0, nbytes); +} + +TEST_F(TestHdfsClient, GetPathInfo) { + SKIP_IF_NO_LIBHDFS(); + + HdfsPathInfo info; + + ASSERT_OK(MakeScratchDir()); + + // Directory info + ASSERT_OK(client_->GetPathInfo(scratch_dir_, &info)); + ASSERT_EQ(ObjectType::DIRECTORY, info.kind); + ASSERT_EQ(HdfsAbsPath(scratch_dir_), info.name); + ASSERT_EQ(conf_.user, info.owner); + + // TODO(wesm): test group, other attrs + + auto path = ScratchPath("test-file"); + + const int size = 100; + + std::vector buffer = RandomData(size); + + ASSERT_OK(WriteDummyFile(path, buffer.data(), size)); + ASSERT_OK(client_->GetPathInfo(path, &info)); + + ASSERT_EQ(ObjectType::FILE, info.kind); + ASSERT_EQ(HdfsAbsPath(path), info.name); + ASSERT_EQ(conf_.user, info.owner); + ASSERT_EQ(size, info.size); +} + +TEST_F(TestHdfsClient, AppendToFile) { + SKIP_IF_NO_LIBHDFS(); + + ASSERT_OK(MakeScratchDir()); + + auto path = ScratchPath("test-file"); + const int size = 100; + + std::vector buffer = RandomData(size); + ASSERT_OK(WriteDummyFile(path, buffer.data(), size)); + + // now append + ASSERT_OK(WriteDummyFile(path, buffer.data(), size, true)); + + HdfsPathInfo info; + ASSERT_OK(client_->GetPathInfo(path, &info)); + ASSERT_EQ(size * 2, info.size); +} + +TEST_F(TestHdfsClient, ListDirectory) { + SKIP_IF_NO_LIBHDFS(); + + const int size = 100; + std::vector data = RandomData(size); + + auto p1 = ScratchPath("test-file-1"); + auto p2 = ScratchPath("test-file-2"); + auto d1 = ScratchPath("test-dir-1"); + + ASSERT_OK(MakeScratchDir()); + ASSERT_OK(WriteDummyFile(p1, data.data(), size)); + ASSERT_OK(WriteDummyFile(p2, data.data(), size / 2)); + ASSERT_OK(client_->CreateDirectory(d1)); + + std::vector listing; + ASSERT_OK(client_->ListDirectory(scratch_dir_, &listing)); + + // Do it again, appends! + ASSERT_OK(client_->ListDirectory(scratch_dir_, &listing)); + + ASSERT_EQ(6, listing.size()); + + // Argh, well, shouldn't expect the listing to be in any particular order + for (size_t i = 0; i < listing.size(); ++i) { + const HdfsPathInfo& info = listing[i]; + if (info.name == HdfsAbsPath(p1)) { + ASSERT_EQ(ObjectType::FILE, info.kind); + ASSERT_EQ(size, info.size); + } else if (info.name == HdfsAbsPath(p2)) { + ASSERT_EQ(ObjectType::FILE, info.kind); + ASSERT_EQ(size / 2, info.size); + } else if (info.name == HdfsAbsPath(d1)) { + ASSERT_EQ(ObjectType::DIRECTORY, info.kind); + } else { + FAIL() << "Unexpected path: " << info.name; + } + } +} + +TEST_F(TestHdfsClient, ReadableMethods) { + SKIP_IF_NO_LIBHDFS(); + + ASSERT_OK(MakeScratchDir()); + + auto path = ScratchPath("test-file"); + const int size = 100; + + std::vector data = RandomData(size); + ASSERT_OK(WriteDummyFile(path, data.data(), size)); + + std::shared_ptr file; + ASSERT_OK(client_->OpenReadable(path, &file)); + + // Test GetSize -- move this into its own unit test if ever needed + int64_t file_size; + ASSERT_OK(file->GetSize(&file_size)); + ASSERT_EQ(size, file_size); + + uint8_t buffer[50]; + int32_t bytes_read = 0; + + ASSERT_OK(file->Read(50, &bytes_read, buffer)); + ASSERT_EQ(0, std::memcmp(buffer, data.data(), 50)); + ASSERT_EQ(50, bytes_read); + + ASSERT_OK(file->Read(50, &bytes_read, buffer)); + ASSERT_EQ(0, std::memcmp(buffer, data.data() + 50, 50)); + ASSERT_EQ(50, bytes_read); + + // EOF + ASSERT_OK(file->Read(1, &bytes_read, buffer)); + ASSERT_EQ(0, bytes_read); + + // ReadAt to EOF + ASSERT_OK(file->ReadAt(60, 100, &bytes_read, buffer)); + ASSERT_EQ(40, bytes_read); + ASSERT_EQ(0, std::memcmp(buffer, data.data() + 60, bytes_read)); + + // Seek, Tell + ASSERT_OK(file->Seek(60)); + + int64_t position; + ASSERT_OK(file->Tell(&position)); + ASSERT_EQ(60, position); +} + +TEST_F(TestHdfsClient, RenameFile) { + SKIP_IF_NO_LIBHDFS(); + + ASSERT_OK(MakeScratchDir()); + + auto src_path = ScratchPath("src-file"); + auto dst_path = ScratchPath("dst-file"); + const int size = 100; + + std::vector data = RandomData(size); + ASSERT_OK(WriteDummyFile(src_path, data.data(), size)); + + ASSERT_OK(client_->Rename(src_path, dst_path)); + + ASSERT_FALSE(client_->Exists(src_path)); + ASSERT_TRUE(client_->Exists(dst_path)); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc new file mode 100644 index 0000000000000..6da6ea4e71bd8 --- /dev/null +++ b/cpp/src/arrow/io/hdfs.cc @@ -0,0 +1,458 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include +#include +#include + +#include "arrow/io/hdfs.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace io { + +#define CHECK_FAILURE(RETURN_VALUE, WHAT) \ + do { \ + if (RETURN_VALUE == -1) { \ + std::stringstream ss; \ + ss << "HDFS: " << WHAT << " failed"; \ + return Status::IOError(ss.str()); \ + } \ + } while (0) + +static Status CheckReadResult(int ret) { + // Check for error on -1 (possibly errno set) + + // ret == 0 at end of file, which is OK + if (ret == -1) { + // EOF + std::stringstream ss; + ss << "HDFS read failed, errno: " << errno; + return Status::IOError(ss.str()); + } + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// File reading + +class HdfsAnyFileImpl { + public: + void set_members(const std::string& path, hdfsFS fs, hdfsFile handle) { + path_ = path; + fs_ = fs; + file_ = handle; + is_open_ = true; + } + + Status Seek(int64_t position) { + int ret = hdfsSeek(fs_, file_, position); + CHECK_FAILURE(ret, "seek"); + return Status::OK(); + } + + Status Tell(int64_t* offset) { + int64_t ret = hdfsTell(fs_, file_); + CHECK_FAILURE(ret, "tell"); + *offset = ret; + return Status::OK(); + } + + bool is_open() const { return is_open_; } + + protected: + std::string path_; + + // These are pointers in libhdfs, so OK to copy + hdfsFS fs_; + hdfsFile file_; + + bool is_open_; +}; + +// Private implementation for read-only files +class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { + public: + HdfsReadableFileImpl() {} + + Status Close() { + if (is_open_) { + int ret = hdfsCloseFile(fs_, file_); + CHECK_FAILURE(ret, "CloseFile"); + is_open_ = false; + } + return Status::OK(); + } + + Status ReadAt(int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + tSize ret = hdfsPread(fs_, file_, static_cast(position), + reinterpret_cast(buffer), nbytes); + RETURN_NOT_OK(CheckReadResult(ret)); + *bytes_read = ret; + return Status::OK(); + } + + Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer), nbytes); + RETURN_NOT_OK(CheckReadResult(ret)); + *bytes_read = ret; + return Status::OK(); + } + + Status GetSize(int64_t* size) { + hdfsFileInfo* entry = hdfsGetPathInfo(fs_, path_.c_str()); + if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } + + *size = entry->mSize; + hdfsFreeFileInfo(entry, 1); + return Status::OK(); + } +}; + +HdfsReadableFile::HdfsReadableFile() { + impl_.reset(new HdfsReadableFileImpl()); +} + +HdfsReadableFile::~HdfsReadableFile() { + impl_->Close(); +} + +Status HdfsReadableFile::Close() { + return impl_->Close(); +} + +Status HdfsReadableFile::ReadAt( + int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + return impl_->ReadAt(position, nbytes, bytes_read, buffer); +} + +Status HdfsReadableFile::Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + return impl_->Read(nbytes, bytes_read, buffer); +} + +Status HdfsReadableFile::GetSize(int64_t* size) { + return impl_->GetSize(size); +} + +Status HdfsReadableFile::Seek(int64_t position) { + return impl_->Seek(position); +} + +Status HdfsReadableFile::Tell(int64_t* position) { + return impl_->Tell(position); +} + +// ---------------------------------------------------------------------- +// File writing + +// Private implementation for writeable-only files +class HdfsWriteableFile::HdfsWriteableFileImpl : public HdfsAnyFileImpl { + public: + HdfsWriteableFileImpl() {} + + Status Close() { + if (is_open_) { + int ret = hdfsFlush(fs_, file_); + CHECK_FAILURE(ret, "Flush"); + ret = hdfsCloseFile(fs_, file_); + CHECK_FAILURE(ret, "CloseFile"); + is_open_ = false; + } + return Status::OK(); + } + + Status Write(const uint8_t* buffer, int32_t nbytes, int32_t* bytes_written) { + tSize ret = hdfsWrite(fs_, file_, reinterpret_cast(buffer), nbytes); + CHECK_FAILURE(ret, "Write"); + *bytes_written = ret; + return Status::OK(); + } +}; + +HdfsWriteableFile::HdfsWriteableFile() { + impl_.reset(new HdfsWriteableFileImpl()); +} + +HdfsWriteableFile::~HdfsWriteableFile() { + impl_->Close(); +} + +Status HdfsWriteableFile::Close() { + return impl_->Close(); +} + +Status HdfsWriteableFile::Write( + const uint8_t* buffer, int32_t nbytes, int32_t* bytes_read) { + return impl_->Write(buffer, nbytes, bytes_read); +} + +Status HdfsWriteableFile::Write(const uint8_t* buffer, int32_t nbytes) { + int32_t bytes_written_dummy = 0; + return Write(buffer, nbytes, &bytes_written_dummy); +} + +Status HdfsWriteableFile::Tell(int64_t* position) { + return impl_->Tell(position); +} + +// ---------------------------------------------------------------------- +// HDFS client + +// TODO(wesm): this could throw std::bad_alloc in the course of copying strings +// into the path info object +static void SetPathInfo(const hdfsFileInfo* input, HdfsPathInfo* out) { + out->kind = input->mKind == kObjectKindFile ? ObjectType::FILE : ObjectType::DIRECTORY; + out->name = std::string(input->mName); + out->owner = std::string(input->mOwner); + out->group = std::string(input->mGroup); + + out->last_access_time = static_cast(input->mLastAccess); + out->last_modified_time = static_cast(input->mLastMod); + out->size = static_cast(input->mSize); + + out->replication = input->mReplication; + out->block_size = input->mBlockSize; + + out->permissions = input->mPermissions; +} + +// Private implementation +class HdfsClient::HdfsClientImpl { + public: + HdfsClientImpl() {} + + Status Connect(const HdfsConnectionConfig* config) { + RETURN_NOT_OK(ConnectLibHdfs()); + + fs_ = hdfsConnectAsUser(config->host.c_str(), config->port, config->user.c_str()); + + if (fs_ == nullptr) { return Status::IOError("HDFS connection failed"); } + namenode_host_ = config->host; + port_ = config->port; + user_ = config->user; + + return Status::OK(); + } + + Status CreateDirectory(const std::string& path) { + int ret = hdfsCreateDirectory(fs_, path.c_str()); + CHECK_FAILURE(ret, "create directory"); + return Status::OK(); + } + + Status Delete(const std::string& path, bool recursive) { + int ret = hdfsDelete(fs_, path.c_str(), static_cast(recursive)); + CHECK_FAILURE(ret, "delete"); + return Status::OK(); + } + + Status Disconnect() { + int ret = hdfsDisconnect(fs_); + CHECK_FAILURE(ret, "hdfsFS::Disconnect"); + return Status::OK(); + } + + bool Exists(const std::string& path) { + // hdfsExists does not distinguish between RPC failure and the file not + // existing + int ret = hdfsExists(fs_, path.c_str()); + return ret == 0; + } + + Status GetCapacity(int64_t* nbytes) { + tOffset ret = hdfsGetCapacity(fs_); + CHECK_FAILURE(ret, "GetCapacity"); + *nbytes = ret; + return Status::OK(); + } + + Status GetUsed(int64_t* nbytes) { + tOffset ret = hdfsGetUsed(fs_); + CHECK_FAILURE(ret, "GetUsed"); + *nbytes = ret; + return Status::OK(); + } + + Status GetPathInfo(const std::string& path, HdfsPathInfo* info) { + hdfsFileInfo* entry = hdfsGetPathInfo(fs_, path.c_str()); + + if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } + + SetPathInfo(entry, info); + hdfsFreeFileInfo(entry, 1); + + return Status::OK(); + } + + Status ListDirectory(const std::string& path, std::vector* listing) { + int num_entries = 0; + hdfsFileInfo* entries = hdfsListDirectory(fs_, path.c_str(), &num_entries); + + if (entries == nullptr) { + // If the directory is empty, entries is NULL but errno is 0. Non-zero + // errno indicates error + // + // Note: errno is thread-locala + if (errno == 0) { num_entries = 0; } + { return Status::IOError("HDFS: list directory failed"); } + } + + // Allocate additional space for elements + + int vec_offset = listing->size(); + listing->resize(vec_offset + num_entries); + + for (int i = 0; i < num_entries; ++i) { + SetPathInfo(entries + i, &(*listing)[vec_offset + i]); + } + + // Free libhdfs file info + hdfsFreeFileInfo(entries, num_entries); + + return Status::OK(); + } + + Status OpenReadable(const std::string& path, std::shared_ptr* file) { + hdfsFile handle = hdfsOpenFile(fs_, path.c_str(), O_RDONLY, 0, 0, 0); + + if (handle == nullptr) { + // TODO(wesm): determine cause of failure + std::stringstream ss; + ss << "Unable to open file " << path; + return Status::IOError(ss.str()); + } + + // std::make_shared does not work with private ctors + *file = std::shared_ptr(new HdfsReadableFile()); + (*file)->impl_->set_members(path, fs_, handle); + + return Status::OK(); + } + + Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, + int16_t replication, int64_t default_block_size, + std::shared_ptr* file) { + int flags = O_WRONLY; + if (append) flags |= O_APPEND; + + hdfsFile handle = hdfsOpenFile( + fs_, path.c_str(), flags, buffer_size, replication, default_block_size); + + if (handle == nullptr) { + // TODO(wesm): determine cause of failure + std::stringstream ss; + ss << "Unable to open file " << path; + return Status::IOError(ss.str()); + } + + // std::make_shared does not work with private ctors + *file = std::shared_ptr(new HdfsWriteableFile()); + (*file)->impl_->set_members(path, fs_, handle); + + return Status::OK(); + } + + Status Rename(const std::string& src, const std::string& dst) { + int ret = hdfsRename(fs_, src.c_str(), dst.c_str()); + CHECK_FAILURE(ret, "Rename"); + return Status::OK(); + } + + private: + std::string namenode_host_; + std::string user_; + int port_; + + hdfsFS fs_; +}; + +// ---------------------------------------------------------------------- +// Public API for HDFSClient + +HdfsClient::HdfsClient() { + impl_.reset(new HdfsClientImpl()); +} + +HdfsClient::~HdfsClient() {} + +Status HdfsClient::Connect( + const HdfsConnectionConfig* config, std::shared_ptr* fs) { + // ctor is private, make_shared will not work + *fs = std::shared_ptr(new HdfsClient()); + + RETURN_NOT_OK((*fs)->impl_->Connect(config)); + return Status::OK(); +} + +Status HdfsClient::CreateDirectory(const std::string& path) { + return impl_->CreateDirectory(path); +} + +Status HdfsClient::Delete(const std::string& path, bool recursive) { + return impl_->Delete(path, recursive); +} + +Status HdfsClient::Disconnect() { + return impl_->Disconnect(); +} + +bool HdfsClient::Exists(const std::string& path) { + return impl_->Exists(path); +} + +Status HdfsClient::GetPathInfo(const std::string& path, HdfsPathInfo* info) { + return impl_->GetPathInfo(path, info); +} + +Status HdfsClient::GetCapacity(int64_t* nbytes) { + return impl_->GetCapacity(nbytes); +} + +Status HdfsClient::GetUsed(int64_t* nbytes) { + return impl_->GetUsed(nbytes); +} + +Status HdfsClient::ListDirectory( + const std::string& path, std::vector* listing) { + return impl_->ListDirectory(path, listing); +} + +Status HdfsClient::OpenReadable( + const std::string& path, std::shared_ptr* file) { + return impl_->OpenReadable(path, file); +} + +Status HdfsClient::OpenWriteable(const std::string& path, bool append, + int32_t buffer_size, int16_t replication, int64_t default_block_size, + std::shared_ptr* file) { + return impl_->OpenWriteable( + path, append, buffer_size, replication, default_block_size, file); +} + +Status HdfsClient::OpenWriteable( + const std::string& path, bool append, std::shared_ptr* file) { + return OpenWriteable(path, append, 0, 0, 0, file); +} + +Status HdfsClient::Rename(const std::string& src, const std::string& dst) { + return impl_->Rename(src, dst); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h new file mode 100644 index 0000000000000..a1972db96157a --- /dev/null +++ b/cpp/src/arrow/io/hdfs.h @@ -0,0 +1,213 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IO_HDFS +#define ARROW_IO_HDFS + +#include +#include +#include +#include + +#include "arrow/io/interfaces.h" +#include "arrow/util/macros.h" + +namespace arrow { + +class Status; + +namespace io { + +Status ConnectLibHdfs(); + +class HdfsClient; +class HdfsReadableFile; +class HdfsWriteableFile; + +struct HdfsPathInfo { + ObjectType::type kind; + + std::string name; + std::string owner; + std::string group; + + // Access times in UNIX timestamps (seconds) + int64_t size; + int64_t block_size; + + int32_t last_modified_time; + int32_t last_access_time; + + int16_t replication; + int16_t permissions; +}; + +struct HdfsConnectionConfig { + std::string host; + int port; + std::string user; + + // TODO: Kerberos, etc. +}; + +class HdfsClient : public FileSystemClient { + public: + ~HdfsClient(); + + // Connect to an HDFS cluster at indicated host, port, and as user + // + // @param host (in) + // @param port (in) + // @param user (in): user to identify as + // @param fs (out): the created client + // @returns Status + static Status Connect( + const HdfsConnectionConfig* config, std::shared_ptr* fs); + + // Create directory and all parents + // + // @param path (in): absolute HDFS path + // @returns Status + Status CreateDirectory(const std::string& path); + + // Delete file or directory + // @param path: absolute path to data + // @param recursive: if path is a directory, delete contents as well + // @returns error status on failure + Status Delete(const std::string& path, bool recursive = false); + + // Disconnect from cluster + // + // @returns Status + Status Disconnect(); + + // @param path (in): absolute HDFS path + // @returns bool, true if the path exists, false if not (or on error) + bool Exists(const std::string& path); + + // @param path (in): absolute HDFS path + // @param info (out) + // @returns Status + Status GetPathInfo(const std::string& path, HdfsPathInfo* info); + + // @param nbytes (out): total capacity of the filesystem + // @returns Status + Status GetCapacity(int64_t* nbytes); + + // @param nbytes (out): total bytes used of the filesystem + // @returns Status + Status GetUsed(int64_t* nbytes); + + Status ListDirectory(const std::string& path, std::vector* listing); + + // @param path file path to change + // @param owner pass nullptr for no change + // @param group pass nullptr for no change + Status Chown(const std::string& path, const char* owner, const char* group); + + Status Chmod(const std::string& path, int mode); + + // Move file or directory from source path to destination path within the + // current filesystem + Status Rename(const std::string& src, const std::string& dst); + + // TODO(wesm): GetWorkingDirectory, SetWorkingDirectory + + // Open an HDFS file in READ mode. Returns error + // status if the file is not found. + // + // @param path complete file path + Status OpenReadable(const std::string& path, std::shared_ptr* file); + + // FileMode::WRITE options + // @param path complete file path + // @param buffer_size, 0 for default + // @param replication, 0 for default + // @param default_block_size, 0 for default + Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, + int16_t replication, int64_t default_block_size, + std::shared_ptr* file); + + Status OpenWriteable( + const std::string& path, bool append, std::shared_ptr* file); + + private: + friend class HdfsReadableFile; + friend class HdfsWriteableFile; + + class HdfsClientImpl; + std::unique_ptr impl_; + + HdfsClient(); + DISALLOW_COPY_AND_ASSIGN(HdfsClient); +}; + +class HdfsReadableFile : public RandomAccessFile { + public: + ~HdfsReadableFile(); + + Status Close() override; + + Status GetSize(int64_t* size) override; + + Status ReadAt( + int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) override; + + Status Seek(int64_t position) override; + Status Tell(int64_t* position) override; + + // NOTE: If you wish to read a particular range of a file in a multithreaded + // context, you may prefer to use ReadAt to avoid locking issues + Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) override; + + private: + class HdfsReadableFileImpl; + std::unique_ptr impl_; + + friend class HdfsClient::HdfsClientImpl; + + HdfsReadableFile(); + DISALLOW_COPY_AND_ASSIGN(HdfsReadableFile); +}; + +class HdfsWriteableFile : public WriteableFile { + public: + ~HdfsWriteableFile(); + + Status Close() override; + + Status Write(const uint8_t* buffer, int32_t nbytes) override; + + Status Write(const uint8_t* buffer, int32_t nbytes, int32_t* bytes_written); + + Status Tell(int64_t* position) override; + + private: + class HdfsWriteableFileImpl; + std::unique_ptr impl_; + + friend class HdfsClient::HdfsClientImpl; + + HdfsWriteableFile(); + + DISALLOW_COPY_AND_ASSIGN(HdfsWriteableFile); +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_HDFS diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h new file mode 100644 index 0000000000000..4bd8a8ffc2f9d --- /dev/null +++ b/cpp/src/arrow/io/interfaces.h @@ -0,0 +1,71 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IO_INTERFACES +#define ARROW_IO_INTERFACES + +#include + +namespace arrow { + +class Status; + +namespace io { + +struct FileMode { + enum type { READ, WRITE, READWRITE }; +}; + +struct ObjectType { + enum type { FILE, DIRECTORY }; +}; + +class FileSystemClient { + public: + virtual ~FileSystemClient() {} +}; + +class FileBase { + virtual Status Close() = 0; + + virtual Status Tell(int64_t* position) = 0; +}; + +class ReadableFile : public FileBase { + public: + virtual Status ReadAt( + int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) = 0; + + virtual Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) = 0; + + virtual Status GetSize(int64_t* size) = 0; +}; + +class RandomAccessFile : public ReadableFile { + public: + virtual Status Seek(int64_t position) = 0; +}; + +class WriteableFile : public FileBase { + public: + virtual Status Write(const uint8_t* buffer, int32_t nbytes) = 0; +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_INTERFACES diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc new file mode 100644 index 0000000000000..f75266536e5b3 --- /dev/null +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -0,0 +1,544 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// This shim interface to libhdfs (for runtime shared library loading) has been +// adapted from the SFrame project, released under the ASF-compatible 3-clause +// BSD license +// +// Using this required having the $JAVA_HOME and $HADOOP_HOME environment +// variables set, so that libjvm and libhdfs can be located easily + +// Copyright (C) 2015 Dato, Inc. +// All rights reserved. +// +// This software may be modified and distributed under the terms +// of the BSD license. See the LICENSE file for details. + +#ifdef HAS_HADOOP + +#ifndef _WIN32 +#include +#else +#include +#include + +// TODO(wesm): address when/if we add windows support +// #include +#endif + +extern "C" { +#include +} + +#include +#include +#include +#include +#include +#include + +#include // NOLINT +#include // NOLINT + +#include "arrow/util/status.h" + +namespace fs = boost::filesystem; + +extern "C" { + +#ifndef _WIN32 +static void* libhdfs_handle = NULL; +static void* libjvm_handle = NULL; +#else +static HINSTANCE libhdfs_handle = NULL; +static HINSTANCE libjvm_handle = NULL; +#endif +/* + * All the shim pointers + */ + +// NOTE(wesm): cpplint does not like use of short and other imprecise C types + +static hdfsFS (*ptr_hdfsConnectAsUser)( + const char* host, tPort port, const char* user) = NULL; +static hdfsFS (*ptr_hdfsConnect)(const char* host, tPort port) = NULL; +static int (*ptr_hdfsDisconnect)(hdfsFS fs) = NULL; + +static hdfsFile (*ptr_hdfsOpenFile)(hdfsFS fs, const char* path, int flags, + int bufferSize, short replication, tSize blocksize) = NULL; // NOLINT + +static int (*ptr_hdfsCloseFile)(hdfsFS fs, hdfsFile file) = NULL; +static int (*ptr_hdfsExists)(hdfsFS fs, const char* path) = NULL; +static int (*ptr_hdfsSeek)(hdfsFS fs, hdfsFile file, tOffset desiredPos) = NULL; +static tOffset (*ptr_hdfsTell)(hdfsFS fs, hdfsFile file) = NULL; +static tSize (*ptr_hdfsRead)(hdfsFS fs, hdfsFile file, void* buffer, tSize length) = NULL; +static tSize (*ptr_hdfsPread)( + hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) = NULL; +static tSize (*ptr_hdfsWrite)( + hdfsFS fs, hdfsFile file, const void* buffer, tSize length) = NULL; +static int (*ptr_hdfsFlush)(hdfsFS fs, hdfsFile file) = NULL; +static int (*ptr_hdfsAvailable)(hdfsFS fs, hdfsFile file) = NULL; +static int (*ptr_hdfsCopy)( + hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) = NULL; +static int (*ptr_hdfsMove)( + hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) = NULL; +static int (*ptr_hdfsDelete)(hdfsFS fs, const char* path, int recursive) = NULL; +static int (*ptr_hdfsRename)(hdfsFS fs, const char* oldPath, const char* newPath) = NULL; +static char* (*ptr_hdfsGetWorkingDirectory)( + hdfsFS fs, char* buffer, size_t bufferSize) = NULL; +static int (*ptr_hdfsSetWorkingDirectory)(hdfsFS fs, const char* path) = NULL; +static int (*ptr_hdfsCreateDirectory)(hdfsFS fs, const char* path) = NULL; +static int (*ptr_hdfsSetReplication)( + hdfsFS fs, const char* path, int16_t replication) = NULL; +static hdfsFileInfo* (*ptr_hdfsListDirectory)( + hdfsFS fs, const char* path, int* numEntries) = NULL; +static hdfsFileInfo* (*ptr_hdfsGetPathInfo)(hdfsFS fs, const char* path) = NULL; +static void (*ptr_hdfsFreeFileInfo)(hdfsFileInfo* hdfsFileInfo, int numEntries) = NULL; +static char*** (*ptr_hdfsGetHosts)( + hdfsFS fs, const char* path, tOffset start, tOffset length) = NULL; +static void (*ptr_hdfsFreeHosts)(char*** blockHosts) = NULL; +static tOffset (*ptr_hdfsGetDefaultBlockSize)(hdfsFS fs) = NULL; +static tOffset (*ptr_hdfsGetCapacity)(hdfsFS fs) = NULL; +static tOffset (*ptr_hdfsGetUsed)(hdfsFS fs) = NULL; +static int (*ptr_hdfsChown)( + hdfsFS fs, const char* path, const char* owner, const char* group) = NULL; +static int (*ptr_hdfsChmod)(hdfsFS fs, const char* path, short mode) = NULL; // NOLINT +static int (*ptr_hdfsUtime)(hdfsFS fs, const char* path, tTime mtime, tTime atime) = NULL; + +// Helper functions for dlopens +static std::vector get_potential_libjvm_paths(); +static std::vector get_potential_libhdfs_paths(); +static arrow::Status try_dlopen(std::vector potential_paths, const char* name, +#ifndef _WIN32 + void*& out_handle); +#else + HINSTANCE& out_handle); +#endif + +#define GET_SYMBOL(SYMBOL_NAME) \ + if (!ptr_##SYMBOL_NAME) { \ + *reinterpret_cast(&ptr_##SYMBOL_NAME) = get_symbol("" #SYMBOL_NAME); \ + } + +static void* get_symbol(const char* symbol) { + if (libhdfs_handle == NULL) return NULL; +#ifndef _WIN32 + return dlsym(libhdfs_handle, symbol); +#else + + void* ret = reinterpret_cast(GetProcAddress(libhdfs_handle, symbol)); + if (ret == NULL) { + // logstream(LOG_INFO) << "GetProcAddress error: " + // << get_last_err_str(GetLastError()) << std::endl; + } + return ret; +#endif +} + +hdfsFS hdfsConnectAsUser(const char* host, tPort port, const char* user) { + return ptr_hdfsConnectAsUser(host, port, user); +} + +// Returns NULL on failure +hdfsFS hdfsConnect(const char* host, tPort port) { + if (ptr_hdfsConnect) { + return ptr_hdfsConnect(host, port); + } else { + // TODO: error reporting when shim setup fails + return NULL; + } +} + +int hdfsDisconnect(hdfsFS fs) { + return ptr_hdfsDisconnect(fs); +} + +hdfsFile hdfsOpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, + short replication, tSize blocksize) { // NOLINT + return ptr_hdfsOpenFile(fs, path, flags, bufferSize, replication, blocksize); +} + +int hdfsCloseFile(hdfsFS fs, hdfsFile file) { + return ptr_hdfsCloseFile(fs, file); +} + +int hdfsExists(hdfsFS fs, const char* path) { + return ptr_hdfsExists(fs, path); +} + +int hdfsSeek(hdfsFS fs, hdfsFile file, tOffset desiredPos) { + return ptr_hdfsSeek(fs, file, desiredPos); +} + +tOffset hdfsTell(hdfsFS fs, hdfsFile file) { + return ptr_hdfsTell(fs, file); +} + +tSize hdfsRead(hdfsFS fs, hdfsFile file, void* buffer, tSize length) { + return ptr_hdfsRead(fs, file, buffer, length); +} + +tSize hdfsPread(hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) { + return ptr_hdfsPread(fs, file, position, buffer, length); +} + +tSize hdfsWrite(hdfsFS fs, hdfsFile file, const void* buffer, tSize length) { + return ptr_hdfsWrite(fs, file, buffer, length); +} + +int hdfsFlush(hdfsFS fs, hdfsFile file) { + return ptr_hdfsFlush(fs, file); +} + +int hdfsAvailable(hdfsFS fs, hdfsFile file) { + GET_SYMBOL(hdfsAvailable); + if (ptr_hdfsAvailable) + return ptr_hdfsAvailable(fs, file); + else + return 0; +} + +int hdfsCopy(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { + GET_SYMBOL(hdfsCopy); + if (ptr_hdfsCopy) + return ptr_hdfsCopy(srcFS, src, dstFS, dst); + else + return 0; +} + +int hdfsMove(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { + GET_SYMBOL(hdfsMove); + if (ptr_hdfsMove) + return ptr_hdfsMove(srcFS, src, dstFS, dst); + else + return 0; +} + +int hdfsDelete(hdfsFS fs, const char* path, int recursive) { + return ptr_hdfsDelete(fs, path, recursive); +} + +int hdfsRename(hdfsFS fs, const char* oldPath, const char* newPath) { + GET_SYMBOL(hdfsRename); + if (ptr_hdfsRename) + return ptr_hdfsRename(fs, oldPath, newPath); + else + return 0; +} + +char* hdfsGetWorkingDirectory(hdfsFS fs, char* buffer, size_t bufferSize) { + GET_SYMBOL(hdfsGetWorkingDirectory); + if (ptr_hdfsGetWorkingDirectory) { + return ptr_hdfsGetWorkingDirectory(fs, buffer, bufferSize); + } else { + return NULL; + } +} + +int hdfsSetWorkingDirectory(hdfsFS fs, const char* path) { + GET_SYMBOL(hdfsSetWorkingDirectory); + if (ptr_hdfsSetWorkingDirectory) { + return ptr_hdfsSetWorkingDirectory(fs, path); + } else { + return 0; + } +} + +int hdfsCreateDirectory(hdfsFS fs, const char* path) { + return ptr_hdfsCreateDirectory(fs, path); +} + +int hdfsSetReplication(hdfsFS fs, const char* path, int16_t replication) { + GET_SYMBOL(hdfsSetReplication); + if (ptr_hdfsSetReplication) { + return ptr_hdfsSetReplication(fs, path, replication); + } else { + return 0; + } +} + +hdfsFileInfo* hdfsListDirectory(hdfsFS fs, const char* path, int* numEntries) { + return ptr_hdfsListDirectory(fs, path, numEntries); +} + +hdfsFileInfo* hdfsGetPathInfo(hdfsFS fs, const char* path) { + return ptr_hdfsGetPathInfo(fs, path); +} + +void hdfsFreeFileInfo(hdfsFileInfo* hdfsFileInfo, int numEntries) { + ptr_hdfsFreeFileInfo(hdfsFileInfo, numEntries); +} + +char*** hdfsGetHosts(hdfsFS fs, const char* path, tOffset start, tOffset length) { + GET_SYMBOL(hdfsGetHosts); + if (ptr_hdfsGetHosts) { + return ptr_hdfsGetHosts(fs, path, start, length); + } else { + return NULL; + } +} + +void hdfsFreeHosts(char*** blockHosts) { + GET_SYMBOL(hdfsFreeHosts); + if (ptr_hdfsFreeHosts) { ptr_hdfsFreeHosts(blockHosts); } +} + +tOffset hdfsGetDefaultBlockSize(hdfsFS fs) { + GET_SYMBOL(hdfsGetDefaultBlockSize); + if (ptr_hdfsGetDefaultBlockSize) { + return ptr_hdfsGetDefaultBlockSize(fs); + } else { + return 0; + } +} + +tOffset hdfsGetCapacity(hdfsFS fs) { + return ptr_hdfsGetCapacity(fs); +} + +tOffset hdfsGetUsed(hdfsFS fs) { + return ptr_hdfsGetUsed(fs); +} + +int hdfsChown(hdfsFS fs, const char* path, const char* owner, const char* group) { + GET_SYMBOL(hdfsChown); + if (ptr_hdfsChown) { + return ptr_hdfsChown(fs, path, owner, group); + } else { + return 0; + } +} + +int hdfsChmod(hdfsFS fs, const char* path, short mode) { // NOLINT + GET_SYMBOL(hdfsChmod); + if (ptr_hdfsChmod) { + return ptr_hdfsChmod(fs, path, mode); + } else { + return 0; + } +} + +int hdfsUtime(hdfsFS fs, const char* path, tTime mtime, tTime atime) { + GET_SYMBOL(hdfsUtime); + if (ptr_hdfsUtime) { + return ptr_hdfsUtime(fs, path, mtime, atime); + } else { + return 0; + } +} + +static std::vector get_potential_libhdfs_paths() { + std::vector libhdfs_potential_paths = { + // find one in the local directory + fs::path("./libhdfs.so"), fs::path("./hdfs.dll"), + // find a global libhdfs.so + fs::path("libhdfs.so"), fs::path("hdfs.dll"), + }; + + const char* hadoop_home = std::getenv("HADOOP_HOME"); + if (hadoop_home != nullptr) { + auto path = fs::path(hadoop_home) / "lib/native/libhdfs.so"; + libhdfs_potential_paths.push_back(path); + } + return libhdfs_potential_paths; +} + +static std::vector get_potential_libjvm_paths() { + std::vector libjvm_potential_paths; + + std::vector search_prefixes; + std::vector search_suffixes; + std::string file_name; + +// From heuristics +#ifdef __WIN32 + search_prefixes = {""}; + search_suffixes = {"/jre/bin/server", "/bin/server"}; + file_name = "jvm.dll"; +#elif __APPLE__ + search_prefixes = {""}; + search_suffixes = {""}; + file_name = "libjvm.dylib"; + +// SFrame uses /usr/libexec/java_home to find JAVA_HOME; for now we are +// expecting users to set an environment variable +#else + search_prefixes = { + "/usr/lib/jvm/default-java", // ubuntu / debian distros + "/usr/lib/jvm/java", // rhel6 + "/usr/lib/jvm", // centos6 + "/usr/lib64/jvm", // opensuse 13 + "/usr/local/lib/jvm/default-java", // alt ubuntu / debian distros + "/usr/local/lib/jvm/java", // alt rhel6 + "/usr/local/lib/jvm", // alt centos6 + "/usr/local/lib64/jvm", // alt opensuse 13 + "/usr/local/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros + "/usr/local/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-7-oracle", // alt ubuntu + "/usr/lib/jvm/java-8-oracle", // alt ubuntu + "/usr/lib/jvm/java-6-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-7-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-8-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-6-oracle", // alt ubuntu + "/usr/lib/jvm/default", // alt centos + "/usr/java/latest", // alt centos + }; + search_suffixes = {"/jre/lib/amd64/server"}; + file_name = "libjvm.so"; +#endif + // From direct environment variable + char* env_value = NULL; + if ((env_value = getenv("JAVA_HOME")) != NULL) { + // logstream(LOG_INFO) << "Found environment variable " << env_name << ": " << + // env_value << std::endl; + search_prefixes.insert(search_prefixes.begin(), env_value); + } + + // Generate cross product between search_prefixes, search_suffixes, and file_name + for (auto& prefix : search_prefixes) { + for (auto& suffix : search_suffixes) { + auto path = (fs::path(prefix) / fs::path(suffix) / fs::path(file_name)); + libjvm_potential_paths.push_back(path); + } + } + + return libjvm_potential_paths; +} + +#ifndef _WIN32 +static arrow::Status try_dlopen( + std::vector potential_paths, const char* name, void*& out_handle) { + std::vector error_messages; + + for (auto& i : potential_paths) { + i.make_preferred(); + // logstream(LOG_INFO) << "Trying " << i.string().c_str() << std::endl; + out_handle = dlopen(i.native().c_str(), RTLD_NOW | RTLD_LOCAL); + + if (out_handle != NULL) { + // logstream(LOG_INFO) << "Success!" << std::endl; + break; + } else { + const char* err_msg = dlerror(); + if (err_msg != NULL) { + error_messages.push_back(std::string(err_msg)); + } else { + error_messages.push_back(std::string(" returned NULL")); + } + } + } + + if (out_handle == NULL) { + std::stringstream ss; + ss << "Unable to load " << name; + return arrow::Status::IOError(ss.str()); + } + + return arrow::Status::OK(); +} + +#else +static arrow::Status try_dlopen( + std::vector potential_paths, const char* name, HINSTANCE& out_handle) { + std::vector error_messages; + + for (auto& i : potential_paths) { + i.make_preferred(); + // logstream(LOG_INFO) << "Trying " << i.string().c_str() << std::endl; + + out_handle = LoadLibrary(i.string().c_str()); + + if (out_handle != NULL) { + // logstream(LOG_INFO) << "Success!" << std::endl; + break; + } else { + // error_messages.push_back(get_last_err_str(GetLastError())); + } + } + + if (out_handle == NULL) { + std::stringstream ss; + ss << "Unable to load " << name; + return arrow::Status::IOError(ss.str()); + } + + return arrow::Status::OK(); +} +#endif // _WIN32 + +} // extern "C" + +#define GET_SYMBOL_REQUIRED(SYMBOL_NAME) \ + do { \ + if (!ptr_##SYMBOL_NAME) { \ + *reinterpret_cast(&ptr_##SYMBOL_NAME) = get_symbol("" #SYMBOL_NAME); \ + } \ + if (!ptr_##SYMBOL_NAME) \ + return Status::IOError("Getting symbol " #SYMBOL_NAME "failed"); \ + } while (0) + +namespace arrow { +namespace io { + +Status ConnectLibHdfs() { + static std::mutex lock; + std::lock_guard guard(lock); + + static bool shim_attempted = false; + if (!shim_attempted) { + shim_attempted = true; + + std::vector libjvm_potential_paths = get_potential_libjvm_paths(); + RETURN_NOT_OK(try_dlopen(libjvm_potential_paths, "libjvm", libjvm_handle)); + + std::vector libhdfs_potential_paths = get_potential_libhdfs_paths(); + RETURN_NOT_OK(try_dlopen(libhdfs_potential_paths, "libhdfs", libhdfs_handle)); + } else if (libhdfs_handle == nullptr) { + return Status::IOError("Prior attempt to load libhdfs failed"); + } + + GET_SYMBOL_REQUIRED(hdfsConnect); + GET_SYMBOL_REQUIRED(hdfsConnectAsUser); + GET_SYMBOL_REQUIRED(hdfsCreateDirectory); + GET_SYMBOL_REQUIRED(hdfsDelete); + GET_SYMBOL_REQUIRED(hdfsDisconnect); + GET_SYMBOL_REQUIRED(hdfsExists); + GET_SYMBOL_REQUIRED(hdfsFreeFileInfo); + GET_SYMBOL_REQUIRED(hdfsGetCapacity); + GET_SYMBOL_REQUIRED(hdfsGetUsed); + GET_SYMBOL_REQUIRED(hdfsGetPathInfo); + GET_SYMBOL_REQUIRED(hdfsListDirectory); + + // File methods + GET_SYMBOL_REQUIRED(hdfsCloseFile); + GET_SYMBOL_REQUIRED(hdfsFlush); + GET_SYMBOL_REQUIRED(hdfsOpenFile); + GET_SYMBOL_REQUIRED(hdfsRead); + GET_SYMBOL_REQUIRED(hdfsPread); + GET_SYMBOL_REQUIRED(hdfsSeek); + GET_SYMBOL_REQUIRED(hdfsTell); + GET_SYMBOL_REQUIRED(hdfsWrite); + + return Status::OK(); +} + +} // namespace io +} // namespace arrow + +#endif // HAS_HADOOP diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index db779d8309cf6..edcac88705668 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -126,8 +126,8 @@ class TestParquetIO : public ::testing::Test { size_t chunk_size = values.size() / num_chunks; for (int i = 0; i < num_chunks; i++) { auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = static_cast*>( - row_group_writer->NextColumn()); + auto column_writer = + static_cast*>(row_group_writer->NextColumn()); T* data = values.data() + i * chunk_size; column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); column_writer->Close(); diff --git a/cpp/thirdparty/hadoop/include/hdfs.h b/cpp/thirdparty/hadoop/include/hdfs.h new file mode 100644 index 0000000000000..a4df6ae3b2be7 --- /dev/null +++ b/cpp/thirdparty/hadoop/include/hdfs.h @@ -0,0 +1,1024 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef LIBHDFS_HDFS_H +#define LIBHDFS_HDFS_H + +#include /* for EINTERNAL, etc. */ +#include /* for O_RDONLY, O_WRONLY */ +#include /* for uint64_t, etc. */ +#include /* for time_t */ + +/* + * Support export of DLL symbols during libhdfs build, and import of DLL symbols + * during client application build. A client application may optionally define + * symbol LIBHDFS_DLL_IMPORT in its build. This is not strictly required, but + * the compiler can produce more efficient code with it. + */ +#ifdef WIN32 + #ifdef LIBHDFS_DLL_EXPORT + #define LIBHDFS_EXTERNAL __declspec(dllexport) + #elif LIBHDFS_DLL_IMPORT + #define LIBHDFS_EXTERNAL __declspec(dllimport) + #else + #define LIBHDFS_EXTERNAL + #endif +#else + #ifdef LIBHDFS_DLL_EXPORT + #define LIBHDFS_EXTERNAL __attribute__((visibility("default"))) + #elif LIBHDFS_DLL_IMPORT + #define LIBHDFS_EXTERNAL __attribute__((visibility("default"))) + #else + #define LIBHDFS_EXTERNAL + #endif +#endif + +#ifndef O_RDONLY +#define O_RDONLY 1 +#endif + +#ifndef O_WRONLY +#define O_WRONLY 2 +#endif + +#ifndef EINTERNAL +#define EINTERNAL 255 +#endif + +#define ELASTIC_BYTE_BUFFER_POOL_CLASS \ + "org/apache/hadoop/io/ElasticByteBufferPool" + +/** All APIs set errno to meaningful values */ + +#ifdef __cplusplus +extern "C" { +#endif + /** + * Some utility decls used in libhdfs. + */ + struct hdfsBuilder; + typedef int32_t tSize; /// size of data for read/write io ops + typedef time_t tTime; /// time type in seconds + typedef int64_t tOffset;/// offset within the file + typedef uint16_t tPort; /// port + typedef enum tObjectKind { + kObjectKindFile = 'F', + kObjectKindDirectory = 'D', + } tObjectKind; + struct hdfsStreamBuilder; + + + /** + * The C reflection of org.apache.org.hadoop.FileSystem . + */ + struct hdfs_internal; + typedef struct hdfs_internal* hdfsFS; + + struct hdfsFile_internal; + typedef struct hdfsFile_internal* hdfsFile; + + struct hadoopRzOptions; + + struct hadoopRzBuffer; + + /** + * Determine if a file is open for read. + * + * @param file The HDFS file + * @return 1 if the file is open for read; 0 otherwise + */ + LIBHDFS_EXTERNAL + int hdfsFileIsOpenForRead(hdfsFile file); + + /** + * Determine if a file is open for write. + * + * @param file The HDFS file + * @return 1 if the file is open for write; 0 otherwise + */ + LIBHDFS_EXTERNAL + int hdfsFileIsOpenForWrite(hdfsFile file); + + struct hdfsReadStatistics { + uint64_t totalBytesRead; + uint64_t totalLocalBytesRead; + uint64_t totalShortCircuitBytesRead; + uint64_t totalZeroCopyBytesRead; + }; + + /** + * Get read statistics about a file. This is only applicable to files + * opened for reading. + * + * @param file The HDFS file + * @param stats (out parameter) on a successful return, the read + * statistics. Unchanged otherwise. You must free the + * returned statistics with hdfsFileFreeReadStatistics. + * @return 0 if the statistics were successfully returned, + * -1 otherwise. On a failure, please check errno against + * ENOTSUP. webhdfs, LocalFilesystem, and so forth may + * not support read statistics. + */ + LIBHDFS_EXTERNAL + int hdfsFileGetReadStatistics(hdfsFile file, + struct hdfsReadStatistics **stats); + + /** + * @param stats HDFS read statistics for a file. + * + * @return the number of remote bytes read. + */ + LIBHDFS_EXTERNAL + int64_t hdfsReadStatisticsGetRemoteBytesRead( + const struct hdfsReadStatistics *stats); + + /** + * Clear the read statistics for a file. + * + * @param file The file to clear the read statistics of. + * + * @return 0 on success; the error code otherwise. + * EINVAL: the file is not open for reading. + * ENOTSUP: the file does not support clearing the read + * statistics. + * Errno will also be set to this code on failure. + */ + LIBHDFS_EXTERNAL + int hdfsFileClearReadStatistics(hdfsFile file); + + /** + * Free some HDFS read statistics. + * + * @param stats The HDFS read statistics to free. + */ + LIBHDFS_EXTERNAL + void hdfsFileFreeReadStatistics(struct hdfsReadStatistics *stats); + + /** + * hdfsConnectAsUser - Connect to a hdfs file system as a specific user + * Connect to the hdfs. + * @param nn The NameNode. See hdfsBuilderSetNameNode for details. + * @param port The port on which the server is listening. + * @param user the user name (this is hadoop domain user). Or NULL is equivelant to hhdfsConnect(host, port) + * @return Returns a handle to the filesystem or NULL on error. + * @deprecated Use hdfsBuilderConnect instead. + */ + LIBHDFS_EXTERNAL + hdfsFS hdfsConnectAsUser(const char* nn, tPort port, const char *user); + + /** + * hdfsConnect - Connect to a hdfs file system. + * Connect to the hdfs. + * @param nn The NameNode. See hdfsBuilderSetNameNode for details. + * @param port The port on which the server is listening. + * @return Returns a handle to the filesystem or NULL on error. + * @deprecated Use hdfsBuilderConnect instead. + */ + LIBHDFS_EXTERNAL + hdfsFS hdfsConnect(const char* nn, tPort port); + + /** + * hdfsConnect - Connect to an hdfs file system. + * + * Forces a new instance to be created + * + * @param nn The NameNode. See hdfsBuilderSetNameNode for details. + * @param port The port on which the server is listening. + * @param user The user name to use when connecting + * @return Returns a handle to the filesystem or NULL on error. + * @deprecated Use hdfsBuilderConnect instead. + */ + LIBHDFS_EXTERNAL + hdfsFS hdfsConnectAsUserNewInstance(const char* nn, tPort port, const char *user ); + + /** + * hdfsConnect - Connect to an hdfs file system. + * + * Forces a new instance to be created + * + * @param nn The NameNode. See hdfsBuilderSetNameNode for details. + * @param port The port on which the server is listening. + * @return Returns a handle to the filesystem or NULL on error. + * @deprecated Use hdfsBuilderConnect instead. + */ + LIBHDFS_EXTERNAL + hdfsFS hdfsConnectNewInstance(const char* nn, tPort port); + + /** + * Connect to HDFS using the parameters defined by the builder. + * + * The HDFS builder will be freed, whether or not the connection was + * successful. + * + * Every successful call to hdfsBuilderConnect should be matched with a call + * to hdfsDisconnect, when the hdfsFS is no longer needed. + * + * @param bld The HDFS builder + * @return Returns a handle to the filesystem, or NULL on error. + */ + LIBHDFS_EXTERNAL + hdfsFS hdfsBuilderConnect(struct hdfsBuilder *bld); + + /** + * Create an HDFS builder. + * + * @return The HDFS builder, or NULL on error. + */ + LIBHDFS_EXTERNAL + struct hdfsBuilder *hdfsNewBuilder(void); + + /** + * Force the builder to always create a new instance of the FileSystem, + * rather than possibly finding one in the cache. + * + * @param bld The HDFS builder + */ + LIBHDFS_EXTERNAL + void hdfsBuilderSetForceNewInstance(struct hdfsBuilder *bld); + + /** + * Set the HDFS NameNode to connect to. + * + * @param bld The HDFS builder + * @param nn The NameNode to use. + * + * If the string given is 'default', the default NameNode + * configuration will be used (from the XML configuration files) + * + * If NULL is given, a LocalFileSystem will be created. + * + * If the string starts with a protocol type such as file:// or + * hdfs://, this protocol type will be used. If not, the + * hdfs:// protocol type will be used. + * + * You may specify a NameNode port in the usual way by + * passing a string of the format hdfs://:. + * Alternately, you may set the port with + * hdfsBuilderSetNameNodePort. However, you must not pass the + * port in two different ways. + */ + LIBHDFS_EXTERNAL + void hdfsBuilderSetNameNode(struct hdfsBuilder *bld, const char *nn); + + /** + * Set the port of the HDFS NameNode to connect to. + * + * @param bld The HDFS builder + * @param port The port. + */ + LIBHDFS_EXTERNAL + void hdfsBuilderSetNameNodePort(struct hdfsBuilder *bld, tPort port); + + /** + * Set the username to use when connecting to the HDFS cluster. + * + * @param bld The HDFS builder + * @param userName The user name. The string will be shallow-copied. + */ + LIBHDFS_EXTERNAL + void hdfsBuilderSetUserName(struct hdfsBuilder *bld, const char *userName); + + /** + * Set the path to the Kerberos ticket cache to use when connecting to + * the HDFS cluster. + * + * @param bld The HDFS builder + * @param kerbTicketCachePath The Kerberos ticket cache path. The string + * will be shallow-copied. + */ + LIBHDFS_EXTERNAL + void hdfsBuilderSetKerbTicketCachePath(struct hdfsBuilder *bld, + const char *kerbTicketCachePath); + + /** + * Free an HDFS builder. + * + * It is normally not necessary to call this function since + * hdfsBuilderConnect frees the builder. + * + * @param bld The HDFS builder + */ + LIBHDFS_EXTERNAL + void hdfsFreeBuilder(struct hdfsBuilder *bld); + + /** + * Set a configuration string for an HdfsBuilder. + * + * @param key The key to set. + * @param val The value, or NULL to set no value. + * This will be shallow-copied. You are responsible for + * ensuring that it remains valid until the builder is + * freed. + * + * @return 0 on success; nonzero error code otherwise. + */ + LIBHDFS_EXTERNAL + int hdfsBuilderConfSetStr(struct hdfsBuilder *bld, const char *key, + const char *val); + + /** + * Get a configuration string. + * + * @param key The key to find + * @param val (out param) The value. This will be set to NULL if the + * key isn't found. You must free this string with + * hdfsConfStrFree. + * + * @return 0 on success; nonzero error code otherwise. + * Failure to find the key is not an error. + */ + LIBHDFS_EXTERNAL + int hdfsConfGetStr(const char *key, char **val); + + /** + * Get a configuration integer. + * + * @param key The key to find + * @param val (out param) The value. This will NOT be changed if the + * key isn't found. + * + * @return 0 on success; nonzero error code otherwise. + * Failure to find the key is not an error. + */ + LIBHDFS_EXTERNAL + int hdfsConfGetInt(const char *key, int32_t *val); + + /** + * Free a configuration string found with hdfsConfGetStr. + * + * @param val A configuration string obtained from hdfsConfGetStr + */ + LIBHDFS_EXTERNAL + void hdfsConfStrFree(char *val); + + /** + * hdfsDisconnect - Disconnect from the hdfs file system. + * Disconnect from hdfs. + * @param fs The configured filesystem handle. + * @return Returns 0 on success, -1 on error. + * Even if there is an error, the resources associated with the + * hdfsFS will be freed. + */ + LIBHDFS_EXTERNAL + int hdfsDisconnect(hdfsFS fs); + + /** + * hdfsOpenFile - Open a hdfs file in given mode. + * @deprecated Use the hdfsStreamBuilder functions instead. + * This function does not support setting block sizes bigger than 2 GB. + * + * @param fs The configured filesystem handle. + * @param path The full path to the file. + * @param flags - an | of bits/fcntl.h file flags - supported flags are O_RDONLY, O_WRONLY (meaning create or overwrite i.e., implies O_TRUNCAT), + * O_WRONLY|O_APPEND. Other flags are generally ignored other than (O_RDWR || (O_EXCL & O_CREAT)) which return NULL and set errno equal ENOTSUP. + * @param bufferSize Size of buffer for read/write - pass 0 if you want + * to use the default configured values. + * @param replication Block replication - pass 0 if you want to use + * the default configured values. + * @param blocksize Size of block - pass 0 if you want to use the + * default configured values. Note that if you want a block size bigger + * than 2 GB, you must use the hdfsStreamBuilder API rather than this + * deprecated function. + * @return Returns the handle to the open file or NULL on error. + */ + LIBHDFS_EXTERNAL + hdfsFile hdfsOpenFile(hdfsFS fs, const char* path, int flags, + int bufferSize, short replication, tSize blocksize); + + /** + * hdfsStreamBuilderAlloc - Allocate an HDFS stream builder. + * + * @param fs The configured filesystem handle. + * @param path The full path to the file. Will be deep-copied. + * @param flags The open flags, as in hdfsOpenFile. + * @return Returns the hdfsStreamBuilder, or NULL on error. + */ + LIBHDFS_EXTERNAL + struct hdfsStreamBuilder *hdfsStreamBuilderAlloc(hdfsFS fs, + const char *path, int flags); + + /** + * hdfsStreamBuilderFree - Free an HDFS file builder. + * + * It is normally not necessary to call this function since + * hdfsStreamBuilderBuild frees the builder. + * + * @param bld The hdfsStreamBuilder to free. + */ + LIBHDFS_EXTERNAL + void hdfsStreamBuilderFree(struct hdfsStreamBuilder *bld); + + /** + * hdfsStreamBuilderSetBufferSize - Set the stream buffer size. + * + * @param bld The hdfs stream builder. + * @param bufferSize The buffer size to set. + * + * @return 0 on success, or -1 on error. Errno will be set on error. + */ + LIBHDFS_EXTERNAL + int hdfsStreamBuilderSetBufferSize(struct hdfsStreamBuilder *bld, + int32_t bufferSize); + + /** + * hdfsStreamBuilderSetReplication - Set the replication for the stream. + * This is only relevant for output streams, which will create new blocks. + * + * @param bld The hdfs stream builder. + * @param replication The replication to set. + * + * @return 0 on success, or -1 on error. Errno will be set on error. + * If you call this on an input stream builder, you will get + * EINVAL, because this configuration is not relevant to input + * streams. + */ + LIBHDFS_EXTERNAL + int hdfsStreamBuilderSetReplication(struct hdfsStreamBuilder *bld, + int16_t replication); + + /** + * hdfsStreamBuilderSetDefaultBlockSize - Set the default block size for + * the stream. This is only relevant for output streams, which will create + * new blocks. + * + * @param bld The hdfs stream builder. + * @param defaultBlockSize The default block size to set. + * + * @return 0 on success, or -1 on error. Errno will be set on error. + * If you call this on an input stream builder, you will get + * EINVAL, because this configuration is not relevant to input + * streams. + */ + LIBHDFS_EXTERNAL + int hdfsStreamBuilderSetDefaultBlockSize(struct hdfsStreamBuilder *bld, + int64_t defaultBlockSize); + + /** + * hdfsStreamBuilderBuild - Build the stream by calling open or create. + * + * @param bld The hdfs stream builder. This pointer will be freed, whether + * or not the open succeeds. + * + * @return the stream pointer on success, or NULL on error. Errno will be + * set on error. + */ + LIBHDFS_EXTERNAL + hdfsFile hdfsStreamBuilderBuild(struct hdfsStreamBuilder *bld); + + /** + * hdfsTruncateFile - Truncate a hdfs file to given lenght. + * @param fs The configured filesystem handle. + * @param path The full path to the file. + * @param newlength The size the file is to be truncated to + * @return 1 if the file has been truncated to the desired newlength + * and is immediately available to be reused for write operations + * such as append. + * 0 if a background process of adjusting the length of the last + * block has been started, and clients should wait for it to + * complete before proceeding with further file updates. + * -1 on error. + */ + int hdfsTruncateFile(hdfsFS fs, const char* path, tOffset newlength); + + /** + * hdfsUnbufferFile - Reduce the buffering done on a file. + * + * @param file The file to unbuffer. + * @return 0 on success + * ENOTSUP if the file does not support unbuffering + * Errno will also be set to this value. + */ + LIBHDFS_EXTERNAL + int hdfsUnbufferFile(hdfsFile file); + + /** + * hdfsCloseFile - Close an open file. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @return Returns 0 on success, -1 on error. + * On error, errno will be set appropriately. + * If the hdfs file was valid, the memory associated with it will + * be freed at the end of this call, even if there was an I/O + * error. + */ + LIBHDFS_EXTERNAL + int hdfsCloseFile(hdfsFS fs, hdfsFile file); + + + /** + * hdfsExists - Checks if a given path exsits on the filesystem + * @param fs The configured filesystem handle. + * @param path The path to look for + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsExists(hdfsFS fs, const char *path); + + + /** + * hdfsSeek - Seek to given offset in file. + * This works only for files opened in read-only mode. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @param desiredPos Offset into the file to seek into. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsSeek(hdfsFS fs, hdfsFile file, tOffset desiredPos); + + + /** + * hdfsTell - Get the current offset in the file, in bytes. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @return Current offset, -1 on error. + */ + LIBHDFS_EXTERNAL + tOffset hdfsTell(hdfsFS fs, hdfsFile file); + + + /** + * hdfsRead - Read data from an open file. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @param buffer The buffer to copy read bytes into. + * @param length The length of the buffer. + * @return On success, a positive number indicating how many bytes + * were read. + * On end-of-file, 0. + * On error, -1. Errno will be set to the error code. + * Just like the POSIX read function, hdfsRead will return -1 + * and set errno to EINTR if data is temporarily unavailable, + * but we are not yet at the end of the file. + */ + LIBHDFS_EXTERNAL + tSize hdfsRead(hdfsFS fs, hdfsFile file, void* buffer, tSize length); + + /** + * hdfsPread - Positional read of data from an open file. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @param position Position from which to read + * @param buffer The buffer to copy read bytes into. + * @param length The length of the buffer. + * @return See hdfsRead + */ + LIBHDFS_EXTERNAL + tSize hdfsPread(hdfsFS fs, hdfsFile file, tOffset position, + void* buffer, tSize length); + + + /** + * hdfsWrite - Write data into an open file. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @param buffer The data. + * @param length The no. of bytes to write. + * @return Returns the number of bytes written, -1 on error. + */ + LIBHDFS_EXTERNAL + tSize hdfsWrite(hdfsFS fs, hdfsFile file, const void* buffer, + tSize length); + + + /** + * hdfsWrite - Flush the data. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsFlush(hdfsFS fs, hdfsFile file); + + + /** + * hdfsHFlush - Flush out the data in client's user buffer. After the + * return of this call, new readers will see the data. + * @param fs configured filesystem handle + * @param file file handle + * @return 0 on success, -1 on error and sets errno + */ + LIBHDFS_EXTERNAL + int hdfsHFlush(hdfsFS fs, hdfsFile file); + + + /** + * hdfsHSync - Similar to posix fsync, Flush out the data in client's + * user buffer. all the way to the disk device (but the disk may have + * it in its cache). + * @param fs configured filesystem handle + * @param file file handle + * @return 0 on success, -1 on error and sets errno + */ + LIBHDFS_EXTERNAL + int hdfsHSync(hdfsFS fs, hdfsFile file); + + + /** + * hdfsAvailable - Number of bytes that can be read from this + * input stream without blocking. + * @param fs The configured filesystem handle. + * @param file The file handle. + * @return Returns available bytes; -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsAvailable(hdfsFS fs, hdfsFile file); + + + /** + * hdfsCopy - Copy file from one filesystem to another. + * @param srcFS The handle to source filesystem. + * @param src The path of source file. + * @param dstFS The handle to destination filesystem. + * @param dst The path of destination file. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsCopy(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + + + /** + * hdfsMove - Move file from one filesystem to another. + * @param srcFS The handle to source filesystem. + * @param src The path of source file. + * @param dstFS The handle to destination filesystem. + * @param dst The path of destination file. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsMove(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + + + /** + * hdfsDelete - Delete file. + * @param fs The configured filesystem handle. + * @param path The path of the file. + * @param recursive if path is a directory and set to + * non-zero, the directory is deleted else throws an exception. In + * case of a file the recursive argument is irrelevant. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsDelete(hdfsFS fs, const char* path, int recursive); + + /** + * hdfsRename - Rename file. + * @param fs The configured filesystem handle. + * @param oldPath The path of the source file. + * @param newPath The path of the destination file. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsRename(hdfsFS fs, const char* oldPath, const char* newPath); + + + /** + * hdfsGetWorkingDirectory - Get the current working directory for + * the given filesystem. + * @param fs The configured filesystem handle. + * @param buffer The user-buffer to copy path of cwd into. + * @param bufferSize The length of user-buffer. + * @return Returns buffer, NULL on error. + */ + LIBHDFS_EXTERNAL + char* hdfsGetWorkingDirectory(hdfsFS fs, char *buffer, size_t bufferSize); + + + /** + * hdfsSetWorkingDirectory - Set the working directory. All relative + * paths will be resolved relative to it. + * @param fs The configured filesystem handle. + * @param path The path of the new 'cwd'. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsSetWorkingDirectory(hdfsFS fs, const char* path); + + + /** + * hdfsCreateDirectory - Make the given file and all non-existent + * parents into directories. + * @param fs The configured filesystem handle. + * @param path The path of the directory. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsCreateDirectory(hdfsFS fs, const char* path); + + + /** + * hdfsSetReplication - Set the replication of the specified + * file to the supplied value + * @param fs The configured filesystem handle. + * @param path The path of the file. + * @return Returns 0 on success, -1 on error. + */ + LIBHDFS_EXTERNAL + int hdfsSetReplication(hdfsFS fs, const char* path, int16_t replication); + + + /** + * hdfsFileInfo - Information about a file/directory. + */ + typedef struct { + tObjectKind mKind; /* file or directory */ + char *mName; /* the name of the file */ + tTime mLastMod; /* the last modification time for the file in seconds */ + tOffset mSize; /* the size of the file in bytes */ + short mReplication; /* the count of replicas */ + tOffset mBlockSize; /* the block size for the file */ + char *mOwner; /* the owner of the file */ + char *mGroup; /* the group associated with the file */ + short mPermissions; /* the permissions associated with the file */ + tTime mLastAccess; /* the last access time for the file in seconds */ + } hdfsFileInfo; + + + /** + * hdfsListDirectory - Get list of files/directories for a given + * directory-path. hdfsFreeFileInfo should be called to deallocate memory. + * @param fs The configured filesystem handle. + * @param path The path of the directory. + * @param numEntries Set to the number of files/directories in path. + * @return Returns a dynamically-allocated array of hdfsFileInfo + * objects; NULL on error or empty directory. + * errno is set to non-zero on error or zero on success. + */ + LIBHDFS_EXTERNAL + hdfsFileInfo *hdfsListDirectory(hdfsFS fs, const char* path, + int *numEntries); + + + /** + * hdfsGetPathInfo - Get information about a path as a (dynamically + * allocated) single hdfsFileInfo struct. hdfsFreeFileInfo should be + * called when the pointer is no longer needed. + * @param fs The configured filesystem handle. + * @param path The path of the file. + * @return Returns a dynamically-allocated hdfsFileInfo object; + * NULL on error. + */ + LIBHDFS_EXTERNAL + hdfsFileInfo *hdfsGetPathInfo(hdfsFS fs, const char* path); + + + /** + * hdfsFreeFileInfo - Free up the hdfsFileInfo array (including fields) + * @param hdfsFileInfo The array of dynamically-allocated hdfsFileInfo + * objects. + * @param numEntries The size of the array. + */ + LIBHDFS_EXTERNAL + void hdfsFreeFileInfo(hdfsFileInfo *hdfsFileInfo, int numEntries); + + /** + * hdfsFileIsEncrypted: determine if a file is encrypted based on its + * hdfsFileInfo. + * @return -1 if there was an error (errno will be set), 0 if the file is + * not encrypted, 1 if the file is encrypted. + */ + LIBHDFS_EXTERNAL + int hdfsFileIsEncrypted(hdfsFileInfo *hdfsFileInfo); + + + /** + * hdfsGetHosts - Get hostnames where a particular block (determined by + * pos & blocksize) of a file is stored. The last element in the array + * is NULL. Due to replication, a single block could be present on + * multiple hosts. + * @param fs The configured filesystem handle. + * @param path The path of the file. + * @param start The start of the block. + * @param length The length of the block. + * @return Returns a dynamically-allocated 2-d array of blocks-hosts; + * NULL on error. + */ + LIBHDFS_EXTERNAL + char*** hdfsGetHosts(hdfsFS fs, const char* path, + tOffset start, tOffset length); + + + /** + * hdfsFreeHosts - Free up the structure returned by hdfsGetHosts + * @param hdfsFileInfo The array of dynamically-allocated hdfsFileInfo + * objects. + * @param numEntries The size of the array. + */ + LIBHDFS_EXTERNAL + void hdfsFreeHosts(char ***blockHosts); + + + /** + * hdfsGetDefaultBlockSize - Get the default blocksize. + * + * @param fs The configured filesystem handle. + * @deprecated Use hdfsGetDefaultBlockSizeAtPath instead. + * + * @return Returns the default blocksize, or -1 on error. + */ + LIBHDFS_EXTERNAL + tOffset hdfsGetDefaultBlockSize(hdfsFS fs); + + + /** + * hdfsGetDefaultBlockSizeAtPath - Get the default blocksize at the + * filesystem indicated by a given path. + * + * @param fs The configured filesystem handle. + * @param path The given path will be used to locate the actual + * filesystem. The full path does not have to exist. + * + * @return Returns the default blocksize, or -1 on error. + */ + LIBHDFS_EXTERNAL + tOffset hdfsGetDefaultBlockSizeAtPath(hdfsFS fs, const char *path); + + + /** + * hdfsGetCapacity - Return the raw capacity of the filesystem. + * @param fs The configured filesystem handle. + * @return Returns the raw-capacity; -1 on error. + */ + LIBHDFS_EXTERNAL + tOffset hdfsGetCapacity(hdfsFS fs); + + + /** + * hdfsGetUsed - Return the total raw size of all files in the filesystem. + * @param fs The configured filesystem handle. + * @return Returns the total-size; -1 on error. + */ + LIBHDFS_EXTERNAL + tOffset hdfsGetUsed(hdfsFS fs); + + /** + * Change the user and/or group of a file or directory. + * + * @param fs The configured filesystem handle. + * @param path the path to the file or directory + * @param owner User string. Set to NULL for 'no change' + * @param group Group string. Set to NULL for 'no change' + * @return 0 on success else -1 + */ + LIBHDFS_EXTERNAL + int hdfsChown(hdfsFS fs, const char* path, const char *owner, + const char *group); + + /** + * hdfsChmod + * @param fs The configured filesystem handle. + * @param path the path to the file or directory + * @param mode the bitmask to set it to + * @return 0 on success else -1 + */ + LIBHDFS_EXTERNAL + int hdfsChmod(hdfsFS fs, const char* path, short mode); + + /** + * hdfsUtime + * @param fs The configured filesystem handle. + * @param path the path to the file or directory + * @param mtime new modification time or -1 for no change + * @param atime new access time or -1 for no change + * @return 0 on success else -1 + */ + LIBHDFS_EXTERNAL + int hdfsUtime(hdfsFS fs, const char* path, tTime mtime, tTime atime); + + /** + * Allocate a zero-copy options structure. + * + * You must free all options structures allocated with this function using + * hadoopRzOptionsFree. + * + * @return A zero-copy options structure, or NULL if one could + * not be allocated. If NULL is returned, errno will + * contain the error number. + */ + LIBHDFS_EXTERNAL + struct hadoopRzOptions *hadoopRzOptionsAlloc(void); + + /** + * Determine whether we should skip checksums in read0. + * + * @param opts The options structure. + * @param skip Nonzero to skip checksums sometimes; zero to always + * check them. + * + * @return 0 on success; -1 plus errno on failure. + */ + LIBHDFS_EXTERNAL + int hadoopRzOptionsSetSkipChecksum( + struct hadoopRzOptions *opts, int skip); + + /** + * Set the ByteBufferPool to use with read0. + * + * @param opts The options structure. + * @param className If this is NULL, we will not use any + * ByteBufferPool. If this is non-NULL, it will be + * treated as the name of the pool class to use. + * For example, you can use + * ELASTIC_BYTE_BUFFER_POOL_CLASS. + * + * @return 0 if the ByteBufferPool class was found and + * instantiated; + * -1 plus errno otherwise. + */ + LIBHDFS_EXTERNAL + int hadoopRzOptionsSetByteBufferPool( + struct hadoopRzOptions *opts, const char *className); + + /** + * Free a hadoopRzOptionsFree structure. + * + * @param opts The options structure to free. + * Any associated ByteBufferPool will also be freed. + */ + LIBHDFS_EXTERNAL + void hadoopRzOptionsFree(struct hadoopRzOptions *opts); + + /** + * Perform a byte buffer read. + * If possible, this will be a zero-copy (mmap) read. + * + * @param file The file to read from. + * @param opts An options structure created by hadoopRzOptionsAlloc. + * @param maxLength The maximum length to read. We may read fewer bytes + * than this length. + * + * @return On success, we will return a new hadoopRzBuffer. + * This buffer will continue to be valid and readable + * until it is released by readZeroBufferFree. Failure to + * release a buffer will lead to a memory leak. + * You can access the data within the hadoopRzBuffer with + * hadoopRzBufferGet. If you have reached EOF, the data + * within the hadoopRzBuffer will be NULL. You must still + * free hadoopRzBuffer instances containing NULL. + * + * On failure, we will return NULL plus an errno code. + * errno = EOPNOTSUPP indicates that we could not do a + * zero-copy read, and there was no ByteBufferPool + * supplied. + */ + LIBHDFS_EXTERNAL + struct hadoopRzBuffer* hadoopReadZero(hdfsFile file, + struct hadoopRzOptions *opts, int32_t maxLength); + + /** + * Determine the length of the buffer returned from readZero. + * + * @param buffer a buffer returned from readZero. + * @return the length of the buffer. + */ + LIBHDFS_EXTERNAL + int32_t hadoopRzBufferLength(const struct hadoopRzBuffer *buffer); + + /** + * Get a pointer to the raw buffer returned from readZero. + * + * To find out how many bytes this buffer contains, call + * hadoopRzBufferLength. + * + * @param buffer a buffer returned from readZero. + * @return a pointer to the start of the buffer. This will be + * NULL when end-of-file has been reached. + */ + LIBHDFS_EXTERNAL + const void *hadoopRzBufferGet(const struct hadoopRzBuffer *buffer); + + /** + * Release a buffer obtained through readZero. + * + * @param file The hdfs stream that created this buffer. This must be + * the same stream you called hadoopReadZero on. + * @param buffer The buffer to release. + */ + LIBHDFS_EXTERNAL + void hadoopRzBufferFree(hdfsFile file, struct hadoopRzBuffer *buffer); + +#ifdef __cplusplus +} +#endif + +#undef LIBHDFS_EXTERNAL +#endif /*LIBHDFS_HDFS_H*/ + +/** + * vim: ts=4: sw=4: et + */ diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index 981779ffb4c76..8f47f93b26dd1 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -173,7 +173,10 @@ def merge_pr(pr_num, target_ref): for c in commits: merge_message_flags += ["-m", c] - run_cmd(['git', 'commit', '--author="%s"' % primary_author] + merge_message_flags) + run_cmd(['git', 'commit', + '--no-verify', # do not run commit hooks + '--author="%s"' % primary_author] + + merge_message_flags) continue_maybe("Merge complete (local ref %s). Push to %s?" % ( target_branch_name, PUSH_REMOTE_NAME)) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index f1becfcf44964..fdbfce99656ca 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -348,8 +348,10 @@ find_package(Arrow REQUIRED) include_directories(SYSTEM ${ARROW_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) +ADD_THIRDPARTY_LIB(arrow_io + SHARED_LIB ${ARROW_IO_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_parquet - SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) + SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) ############################################################ # Linker setup @@ -428,6 +430,7 @@ set(PYARROW_SRCS set(LINK_LIBS arrow + arrow_io arrow_parquet ) @@ -449,6 +452,7 @@ set(CYTHON_EXTENSIONS array config error + io parquet scalar schema diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index f0b258ed027b0..6bd305615fce2 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -47,13 +47,24 @@ find_library(ARROW_PARQUET_LIB_PATH NAMES arrow_parquet ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) +find_library(ARROW_IO_LIB_PATH NAMES arrow_io + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH AND ARROW_PARQUET_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) + set(ARROW_IO_LIB_NAME libarrow_io) set(ARROW_PARQUET_LIB_NAME libarrow_parquet) + set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + + set(ARROW_IO_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IO_LIB_NAME}.a) + set(ARROW_IO_SHARED_LIB ${ARROW_LIBS}/${ARROW_IO_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PARQUET_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PARQUET_LIB_NAME}.a) set(ARROW_PARQUET_SHARED_LIB ${ARROW_LIBS}/${ARROW_PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) else () @@ -62,7 +73,9 @@ endif () if (ARROW_FOUND) if (NOT Arrow_FIND_QUIETLY) - message(STATUS "Found the Arrow library: ${ARROW_LIB_PATH}, ${ARROW_PARQUET_LIB_PATH}") + message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") + message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") + message(STATUS "Found the Arrow Parquet library: ${ARROW_PARQUET_LIB_PATH}") endif () else () if (NOT Arrow_FIND_QUIETLY) @@ -82,6 +95,8 @@ mark_as_advanced( ARROW_LIBS ARROW_STATIC_LIB ARROW_SHARED_LIB + ARROW_IO_STATIC_LIB + ARROW_IO_SHARED_LIB ARROW_PARQUET_STATIC_LIB ARROW_PARQUET_SHARED_LIB ) diff --git a/python/conda.recipe/meta.yaml b/python/conda.recipe/meta.yaml index 85d24b6bc322e..98ae4141e3bd7 100644 --- a/python/conda.recipe/meta.yaml +++ b/python/conda.recipe/meta.yaml @@ -26,6 +26,7 @@ requirements: run: - arrow-cpp + - parquet-cpp - python - numpy - pandas diff --git a/python/pyarrow/error.pxd b/python/pyarrow/error.pxd index 97ba0ef2e9fcb..1fb6fad396a8b 100644 --- a/python/pyarrow/error.pxd +++ b/python/pyarrow/error.pxd @@ -18,5 +18,5 @@ from pyarrow.includes.libarrow cimport CStatus from pyarrow.includes.pyarrow cimport * -cdef check_cstatus(const CStatus& status) -cdef check_status(const Status& status) +cdef int check_cstatus(const CStatus& status) nogil except -1 +cdef int check_status(const Status& status) nogil except -1 diff --git a/python/pyarrow/error.pyx b/python/pyarrow/error.pyx index 5a6a038a92e43..244019321a7fd 100644 --- a/python/pyarrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -22,16 +22,18 @@ from pyarrow.compat import frombytes class ArrowException(Exception): pass -cdef check_cstatus(const CStatus& status): +cdef int check_cstatus(const CStatus& status) nogil except -1: if status.ok(): - return + return 0 cdef c_string c_message = status.ToString() - raise ArrowException(frombytes(c_message)) + with gil: + raise ArrowException(frombytes(c_message)) -cdef check_status(const Status& status): +cdef int check_status(const Status& status) nogil except -1: if status.ok(): - return + return 0 cdef c_string c_message = status.ToString() - raise ArrowException(frombytes(c_message)) + with gil: + raise ArrowException(frombytes(c_message)) diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 1f6ecee510521..133797bc37b5c 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -33,3 +33,21 @@ cdef extern from "": cdef extern from "": void Py_XDECREF(PyObject* o) +cdef extern from "arrow/api.h" namespace "arrow" nogil: + # We can later add more of the common status factory methods as needed + cdef CStatus CStatus_OK "Status::OK"() + + cdef cppclass CStatus "arrow::Status": + CStatus() + + c_string ToString() + + c_bool ok() + c_bool IsOutOfMemory() + c_bool IsKeyError() + c_bool IsNotImplemented() + c_bool IsInvalid() + + cdef cppclass Buffer: + uint8_t* data() + int64_t size() diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 90414e3d542db..91ce069df8f42 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -19,25 +19,6 @@ from pyarrow.includes.common cimport * -cdef extern from "arrow/api.h" namespace "arrow" nogil: - # We can later add more of the common status factory methods as needed - cdef CStatus CStatus_OK "Status::OK"() - - cdef cppclass CStatus "arrow::Status": - CStatus() - - c_string ToString() - - c_bool ok() - c_bool IsOutOfMemory() - c_bool IsKeyError() - c_bool IsNotImplemented() - c_bool IsInvalid() - - cdef cppclass Buffer: - uint8_t* data() - int64_t size() - cdef extern from "arrow/api.h" namespace "arrow" nogil: enum Type" arrow::Type::type": diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd new file mode 100644 index 0000000000000..d874ba3091237 --- /dev/null +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -0,0 +1,93 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from pyarrow.includes.common cimport * + +cdef extern from "arrow/io/interfaces.h" nogil: + enum ObjectType" arrow::io::ObjectType::type": + ObjectType_FILE" arrow::io::ObjectType::FILE" + ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" + +cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: + CStatus ConnectLibHdfs() + + cdef cppclass HdfsConnectionConfig: + c_string host + int port + c_string user + + cdef cppclass HdfsPathInfo: + ObjectType kind; + c_string name + c_string owner + c_string group + int32_t last_modified_time + int32_t last_access_time + int64_t size + int16_t replication + int64_t block_size + int16_t permissions + + cdef cppclass CHdfsFile: + CStatus Close() + CStatus Seek(int64_t position) + CStatus Tell(int64_t* position) + + cdef cppclass HdfsReadableFile(CHdfsFile): + CStatus GetSize(int64_t* size) + CStatus Read(int32_t nbytes, int32_t* bytes_read, + uint8_t* buffer) + + CStatus ReadAt(int64_t position, int32_t nbytes, + int32_t* bytes_read, uint8_t* buffer) + + cdef cppclass HdfsWriteableFile(CHdfsFile): + CStatus Write(const uint8_t* buffer, int32_t nbytes) + + CStatus Write(const uint8_t* buffer, int32_t nbytes, + int32_t* bytes_written) + + cdef cppclass CHdfsClient" arrow::io::HdfsClient": + @staticmethod + CStatus Connect(const HdfsConnectionConfig* config, + shared_ptr[CHdfsClient]* client) + + CStatus CreateDirectory(const c_string& path) + + CStatus Delete(const c_string& path, c_bool recursive) + + CStatus Disconnect() + + c_bool Exists(const c_string& path) + + CStatus GetCapacity(int64_t* nbytes) + CStatus GetUsed(int64_t* nbytes) + + CStatus ListDirectory(const c_string& path, + vector[HdfsPathInfo]* listing) + + CStatus Rename(const c_string& src, const c_string& dst) + + CStatus OpenReadable(const c_string& path, + shared_ptr[HdfsReadableFile]* handle) + + CStatus OpenWriteable(const c_string& path, c_bool append, + int32_t buffer_size, int16_t replication, + int64_t default_block_size, + shared_ptr[HdfsWriteableFile]* handle) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx new file mode 100644 index 0000000000000..8b97671e45373 --- /dev/null +++ b/python/pyarrow/io.pyx @@ -0,0 +1,504 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Cython wrappers for IO interfaces defined in arrow/io + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from libc.stdlib cimport malloc, free + +from pyarrow.includes.libarrow cimport * +cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow.includes.libarrow_io cimport * + +from pyarrow.compat import frombytes, tobytes +from pyarrow.error cimport check_cstatus + +cimport cpython as cp + +import re +import sys +import threading + +_HDFS_PATH_RE = re.compile('hdfs://(.*):(\d+)(.*)') + +try: + # Python 3 + from queue import Queue, Empty as QueueEmpty, Full as QueueFull +except ImportError: + from Queue import Queue, Empty as QueueEmpty, Full as QueueFull + + +def have_libhdfs(): + try: + check_cstatus(ConnectLibHdfs()) + return True + except: + return False + + +def strip_hdfs_abspath(path): + m = _HDFS_PATH_RE.match(path) + if m: + return m.group(3) + else: + return path + + +cdef class HdfsClient: + cdef: + shared_ptr[CHdfsClient] client + + cdef readonly: + object host + int port + object user + bint is_open + + def __cinit__(self): + self.is_open = False + + def __dealloc__(self): + if self.is_open: + self.close() + + def close(self): + self._ensure_client() + with nogil: + check_cstatus(self.client.get().Disconnect()) + self.is_open = False + + cdef _ensure_client(self): + if self.client.get() == NULL: + raise IOError('HDFS client improperly initialized') + elif not self.is_open: + raise IOError('HDFS client is closed') + + @classmethod + def connect(cls, host, port, user): + """ + + Parameters + ---------- + host : + port : + user : + + Notes + ----- + The first time you call this method, it will take longer than usual due + to JNI spin-up time. + + Returns + ------- + client : HDFSClient + """ + cdef: + HdfsClient out = HdfsClient() + HdfsConnectionConfig conf + + conf.host = tobytes(host) + conf.port = port + conf.user = tobytes(user) + + with nogil: + check_cstatus( + CHdfsClient.Connect(&conf, &out.client)) + out.is_open = True + + return out + + def exists(self, path): + """ + Returns True if the path is known to the cluster, False if it does not + (or there is an RPC error) + """ + self._ensure_client() + + cdef c_string c_path = tobytes(path) + cdef c_bool result + with nogil: + result = self.client.get().Exists(c_path) + return result + + def ls(self, path, bint full_info=True): + """ + Retrieve directory contents and metadata, if requested. + + Parameters + ---------- + path : HDFS path + full_info : boolean, default True + If False, only return list of paths + + Returns + ------- + result : list of dicts (full_info=True) or strings (full_info=False) + """ + cdef: + c_string c_path = tobytes(path) + vector[HdfsPathInfo] listing + list results = [] + int i + + self._ensure_client() + + with nogil: + check_cstatus(self.client.get() + .ListDirectory(c_path, &listing)) + + cdef const HdfsPathInfo* info + for i in range(listing.size()): + info = &listing[i] + + # Try to trim off the hdfs://HOST:PORT piece + name = strip_hdfs_abspath(frombytes(info.name)) + + if full_info: + kind = ('file' if info.kind == ObjectType_FILE + else 'directory') + + results.append({ + 'kind': kind, + 'name': name, + 'owner': frombytes(info.owner), + 'group': frombytes(info.group), + 'list_modified_time': info.last_modified_time, + 'list_access_time': info.last_access_time, + 'size': info.size, + 'replication': info.replication, + 'block_size': info.block_size, + 'permissions': info.permissions + }) + else: + results.append(name) + + return results + + def mkdir(self, path): + """ + Create indicated directory and any necessary parent directories + """ + self._ensure_client() + + cdef c_string c_path = tobytes(path) + with nogil: + check_cstatus(self.client.get() + .CreateDirectory(c_path)) + + def delete(self, path, bint recursive=False): + """ + Delete the indicated file or directory + + Parameters + ---------- + path : string + recursive : boolean, default False + If True, also delete child paths for directories + """ + self._ensure_client() + + cdef c_string c_path = tobytes(path) + with nogil: + check_cstatus(self.client.get() + .Delete(c_path, recursive)) + + def open(self, path, mode='rb', buffer_size=None, replication=None, + default_block_size=None): + """ + Parameters + ---------- + mode : string, 'rb', 'wb', 'ab' + """ + self._ensure_client() + + cdef HdfsFile out = HdfsFile() + + if mode not in ('rb', 'wb', 'ab'): + raise Exception("Mode must be 'rb' (read), " + "'wb' (write, new file), or 'ab' (append)") + + cdef c_string c_path = tobytes(path) + cdef c_bool append = False + + # 0 in libhdfs means "use the default" + cdef int32_t c_buffer_size = buffer_size or 0 + cdef int16_t c_replication = replication or 0 + cdef int64_t c_default_block_size = default_block_size or 0 + + if mode in ('wb', 'ab'): + if mode == 'ab': + append = True + + with nogil: + check_cstatus( + self.client.get() + .OpenWriteable(c_path, append, c_buffer_size, + c_replication, c_default_block_size, + &out.wr_file)) + + out.is_readonly = False + else: + with nogil: + check_cstatus(self.client.get() + .OpenReadable(c_path, &out.rd_file)) + out.is_readonly = True + + if c_buffer_size == 0: + c_buffer_size = 2 ** 16 + + out.mode = mode + out.buffer_size = c_buffer_size + out.parent = self + out.is_open = True + + return out + + def upload(self, path, stream, buffer_size=2**16): + """ + Upload file-like object to HDFS path + """ + write_queue = Queue(50) + + f = self.open(path, 'wb') + + done = False + exc_info = None + def bg_write(): + try: + while not done or write_queue.qsize() > 0: + try: + buf = write_queue.get(timeout=0.01) + except QueueEmpty: + continue + + f.write(buf) + + except Exception as e: + exc_info = sys.exc_info() + + writer_thread = threading.Thread(target=bg_write) + writer_thread.start() + + try: + while True: + buf = stream.read(buffer_size) + if not buf: + break + + write_queue.put_nowait(buf) + finally: + done = True + + writer_thread.join() + if exc_info is not None: + raise exc_info[0], exc_info[1], exc_info[2] + + def download(self, path, stream, buffer_size=None): + f = self.open(path, 'rb', buffer_size=buffer_size) + f.download(stream) + + +cdef class HdfsFile: + cdef: + shared_ptr[HdfsReadableFile] rd_file + shared_ptr[HdfsWriteableFile] wr_file + bint is_readonly + bint is_open + object parent + + cdef readonly: + int32_t buffer_size + object mode + + def __cinit__(self): + self.is_open = False + + def __dealloc__(self): + if self.is_open: + self.close() + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_value, tb): + self.close() + + def close(self): + if self.is_open: + with nogil: + if self.is_readonly: + check_cstatus(self.rd_file.get().Close()) + else: + check_cstatus(self.wr_file.get().Close()) + self.is_open = False + + cdef _assert_readable(self): + if not self.is_readonly: + raise IOError("only valid on readonly files") + + cdef _assert_writeable(self): + if self.is_readonly: + raise IOError("only valid on writeonly files") + + def size(self): + cdef int64_t size + self._assert_readable() + with nogil: + check_cstatus(self.rd_file.get().GetSize(&size)) + return size + + def tell(self): + cdef int64_t position + with nogil: + if self.is_readonly: + check_cstatus(self.rd_file.get().Tell(&position)) + else: + check_cstatus(self.wr_file.get().Tell(&position)) + return position + + def seek(self, int64_t position): + self._assert_readable() + with nogil: + check_cstatus(self.rd_file.get().Seek(position)) + + def read(self, int nbytes): + """ + Read indicated number of bytes from the file, up to EOF + """ + cdef: + int32_t bytes_read = 0 + uint8_t* buf + + self._assert_readable() + + # This isn't ideal -- PyBytes_FromStringAndSize copies the data from + # the passed buffer, so it's hard for us to avoid doubling the memory + buf = malloc(nbytes) + if buf == NULL: + raise MemoryError("Failed to allocate {0} bytes".format(nbytes)) + + cdef int32_t total_bytes = 0 + + cdef int rpc_chunksize = min(self.buffer_size, nbytes) + + try: + with nogil: + while total_bytes < nbytes: + check_cstatus(self.rd_file.get() + .Read(rpc_chunksize, &bytes_read, + buf + total_bytes)) + + total_bytes += bytes_read + + # EOF + if bytes_read == 0: + break + result = cp.PyBytes_FromStringAndSize(buf, + total_bytes) + finally: + free(buf) + + return result + + def download(self, stream_or_path): + """ + Read file completely to local path (rather than reading completely into + memory). First seeks to the beginning of the file. + """ + cdef: + int32_t bytes_read = 0 + uint8_t* buf + self._assert_readable() + + write_queue = Queue(50) + + if not hasattr(stream_or_path, 'read'): + stream = open(stream_or_path, 'wb') + cleanup = lambda: stream.close() + else: + stream = stream_or_path + cleanup = lambda: None + + done = False + exc_info = None + def bg_write(): + try: + while not done or write_queue.qsize() > 0: + try: + buf = write_queue.get(timeout=0.01) + except QueueEmpty: + continue + stream.write(buf) + except Exception as e: + exc_info = sys.exc_info() + finally: + cleanup() + + self.seek(0) + + writer_thread = threading.Thread(target=bg_write) + + # This isn't ideal -- PyBytes_FromStringAndSize copies the data from + # the passed buffer, so it's hard for us to avoid doubling the memory + buf = malloc(self.buffer_size) + if buf == NULL: + raise MemoryError("Failed to allocate {0} bytes" + .format(self.buffer_size)) + + writer_thread.start() + + cdef int64_t total_bytes = 0 + + try: + while True: + with nogil: + check_cstatus(self.rd_file.get() + .Read(self.buffer_size, &bytes_read, buf)) + + total_bytes += bytes_read + + # EOF + if bytes_read == 0: + break + + pybuf = cp.PyBytes_FromStringAndSize(buf, + bytes_read) + + write_queue.put_nowait(pybuf) + finally: + free(buf) + done = True + + writer_thread.join() + if exc_info is not None: + raise exc_info[0], exc_info[1], exc_info[2] + + def write(self, data): + """ + Write bytes-like (unicode, encoded to UTF-8) to file + """ + self._assert_writeable() + + data = tobytes(data) + + cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) + cdef int32_t bufsize = len(data) + with nogil: + check_cstatus(self.wr_file.get().Write(buf, bufsize)) diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index bf5a22089cdba..86147f8df5a11 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -15,25 +15,24 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest import pyarrow import pyarrow.formatting as fmt -class TestArrayAPI(unittest.TestCase): +def test_repr_on_pre_init_array(): + arr = pyarrow.array.Array() + assert len(repr(arr)) > 0 - def test_repr_on_pre_init_array(self): - arr = pyarrow.array.Array() - assert len(repr(arr)) > 0 - def test_getitem_NA(self): - arr = pyarrow.from_pylist([1, None, 2]) - assert arr[1] is pyarrow.NA +def test_getitem_NA(): + arr = pyarrow.from_pylist([1, None, 2]) + assert arr[1] is pyarrow.NA - def test_list_format(self): - arr = pyarrow.from_pylist([[1], None, [2, 3, None]]) - result = fmt.array_format(arr) - expected = """\ + +def test_list_format(): + arr = pyarrow.from_pylist([[1], None, [2, 3, None]]) + result = fmt.array_format(arr) + expected = """\ [ [1], NA, @@ -41,23 +40,25 @@ def test_list_format(self): 3, NA] ]""" - assert result == expected + assert result == expected + - def test_string_format(self): - arr = pyarrow.from_pylist(['', None, 'foo']) - result = fmt.array_format(arr) - expected = """\ +def test_string_format(): + arr = pyarrow.from_pylist(['', None, 'foo']) + result = fmt.array_format(arr) + expected = """\ [ '', NA, 'foo' ]""" - assert result == expected + assert result == expected + - def test_long_array_format(self): - arr = pyarrow.from_pylist(range(100)) - result = fmt.array_format(arr, window=2) - expected = """\ +def test_long_array_format(): + arr = pyarrow.from_pylist(range(100)) + result = fmt.array_format(arr, window=2) + expected = """\ [ 0, 1, @@ -65,4 +66,4 @@ def test_long_array_format(self): 98, 99 ]""" - assert result == expected + assert result == expected diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py new file mode 100644 index 0000000000000..328e923b941a4 --- /dev/null +++ b/python/pyarrow/tests/test_io.py @@ -0,0 +1,126 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from io import BytesIO +from os.path import join as pjoin +import os +import random + +import pytest + +import pyarrow.io as io + +#---------------------------------------------------------------------- +# HDFS tests + + +def hdfs_test_client(): + host = os.environ.get('ARROW_HDFS_TEST_HOST', 'localhost') + user = os.environ['ARROW_HDFS_TEST_USER'] + try: + port = int(os.environ.get('ARROW_HDFS_TEST_PORT', 20500)) + except ValueError: + raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' + 'an integer') + + return io.HdfsClient.connect(host, port, user) + + +libhdfs = pytest.mark.skipif(not io.have_libhdfs(), + reason='No libhdfs available on system') + + +HDFS_TMP_PATH = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + +@pytest.fixture(scope='session') +def hdfs(request): + fixture = hdfs_test_client() + def teardown(): + fixture.delete(HDFS_TMP_PATH, recursive=True) + fixture.close() + request.addfinalizer(teardown) + return fixture + + +@libhdfs +def test_hdfs_close(): + client = hdfs_test_client() + assert client.is_open + client.close() + assert not client.is_open + + with pytest.raises(Exception): + client.ls('/') + + +@libhdfs +def test_hdfs_mkdir(hdfs): + path = pjoin(HDFS_TMP_PATH, 'test-dir/test-dir') + parent_path = pjoin(HDFS_TMP_PATH, 'test-dir') + + hdfs.mkdir(path) + assert hdfs.exists(path) + + hdfs.delete(parent_path, recursive=True) + assert not hdfs.exists(path) + + +@libhdfs +def test_hdfs_ls(hdfs): + base_path = pjoin(HDFS_TMP_PATH, 'ls-test') + hdfs.mkdir(base_path) + + dir_path = pjoin(base_path, 'a-dir') + f1_path = pjoin(base_path, 'a-file-1') + + hdfs.mkdir(dir_path) + + f = hdfs.open(f1_path, 'wb') + f.write('a' * 10) + + contents = sorted(hdfs.ls(base_path, False)) + assert contents == [dir_path, f1_path] + + +@libhdfs +def test_hdfs_download_upload(hdfs): + base_path = pjoin(HDFS_TMP_PATH, 'upload-test') + + data = b'foobarbaz' + buf = BytesIO(data) + buf.seek(0) + + hdfs.upload(base_path, buf) + + out_buf = BytesIO() + hdfs.download(base_path, out_buf) + out_buf.seek(0) + assert out_buf.getvalue() == data + + +@libhdfs +def test_hdfs_file_context_manager(hdfs): + path = pjoin(HDFS_TMP_PATH, 'ctx-manager') + + data = b'foo' + with hdfs.open(path, 'wb') as f: + f.write(data) + + with hdfs.open(path, 'rb') as f: + assert f.size() == 3 + result = f.read(10) + assert result == data diff --git a/python/setup.py b/python/setup.py index 7edeb9143319b..59410d75a61e2 100644 --- a/python/setup.py +++ b/python/setup.py @@ -214,7 +214,14 @@ def get_ext_built(self, name): return name + suffix def get_cmake_cython_names(self): - return ['array', 'config', 'error', 'parquet', 'scalar', 'schema', 'table'] + return ['array', + 'config', + 'error', + 'io', + 'parquet', + 'scalar', + 'schema', + 'table'] def get_names(self): return self._found_names From 2f52cf4eed1033d1bf1f043d9063e462e60d6605 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 12 Jun 2016 11:48:10 +0200 Subject: [PATCH 0093/1644] ARROW-215: Support other integer types and strings in Parquet I/O Change-Id: I72c6c82bc38c895a04172531bebbc78d4fb08732 --- cpp/src/arrow/parquet/parquet-io-test.cc | 461 ++++++++++++------- cpp/src/arrow/parquet/parquet-schema-test.cc | 4 +- cpp/src/arrow/parquet/reader.cc | 160 ++++++- cpp/src/arrow/parquet/schema.cc | 47 +- cpp/src/arrow/parquet/schema.h | 9 +- cpp/src/arrow/parquet/test-util.h | 136 +++++- cpp/src/arrow/parquet/writer.cc | 234 ++++++++-- cpp/src/arrow/parquet/writer.h | 9 +- cpp/src/arrow/test-util.h | 2 + cpp/src/arrow/types/primitive.cc | 5 + python/pyarrow/includes/parquet.pxd | 13 +- python/pyarrow/parquet.pyx | 22 +- python/pyarrow/tests/test_parquet.py | 43 +- 13 files changed, 901 insertions(+), 244 deletions(-) diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index edcac88705668..572cae16e58c0 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -21,7 +21,9 @@ #include "arrow/parquet/test-util.h" #include "arrow/parquet/reader.h" #include "arrow/parquet/writer.h" +#include "arrow/types/construct.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" @@ -30,12 +32,15 @@ using ParquetBuffer = parquet::Buffer; using parquet::BufferReader; +using parquet::default_writer_properties; using parquet::InMemoryOutputStream; +using parquet::LogicalType; using parquet::ParquetFileReader; using parquet::ParquetFileWriter; using parquet::RandomAccessSource; using parquet::Repetition; using parquet::SchemaDescriptor; +using parquet::ParquetVersion; using ParquetType = parquet::Type; using parquet::schema::GroupNode; using parquet::schema::NodePtr; @@ -51,26 +56,114 @@ const int LARGE_SIZE = 10000; template struct test_traits {}; +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::BOOLEAN; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static uint8_t const value; +}; + +const uint8_t test_traits::value(1); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_8; + static uint8_t const value; +}; + +const uint8_t test_traits::value(64); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::INT_8; + static int8_t const value; +}; + +const int8_t test_traits::value(-64); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_16; + static uint16_t const value; +}; + +const uint16_t test_traits::value(1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::INT_16; + static int16_t const value; +}; + +const int16_t test_traits::value(-1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_32; + static uint32_t const value; +}; + +const uint32_t test_traits::value(1024); + template <> struct test_traits { static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static int32_t const value; +}; + +const int32_t test_traits::value(-1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT64; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_64; + static uint64_t const value; }; +const uint64_t test_traits::value(1024); + template <> struct test_traits { static constexpr ParquetType::type parquet_enum = ParquetType::INT64; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static int64_t const value; }; +const int64_t test_traits::value(-1024); + template <> struct test_traits { static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static float const value; }; +const float test_traits::value(2.1f); + template <> struct test_traits { static constexpr ParquetType::type parquet_enum = ParquetType::DOUBLE; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static double const value; +}; + +const double test_traits::value(4.2); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::BYTE_ARRAY; + static constexpr LogicalType::type logical_enum = LogicalType::UTF8; + static std::string const value; }; +const std::string test_traits::value("Test"); + template using ParquetDataType = ::parquet::DataType::parquet_enum>; @@ -80,18 +173,18 @@ using ParquetWriter = ::parquet::TypedColumnWriter>; template class TestParquetIO : public ::testing::Test { public: - typedef typename TestType::c_type T; virtual void SetUp() {} - std::shared_ptr MakeSchema( - ParquetType::type parquet_type, Repetition::type repetition) { - auto pnode = PrimitiveNode::Make("column1", repetition, parquet_type); + std::shared_ptr MakeSchema(Repetition::type repetition) { + auto pnode = PrimitiveNode::Make("column1", repetition, + test_traits::parquet_enum, test_traits::logical_enum); NodePtr node_ = GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); return std::static_pointer_cast(node_); } - std::unique_ptr MakeWriter(std::shared_ptr& schema) { + std::unique_ptr MakeWriter( + const std::shared_ptr& schema) { sink_ = std::make_shared(); return ParquetFileWriter::Open(sink_, schema); } @@ -106,113 +199,74 @@ class TestParquetIO : public ::testing::Test { std::unique_ptr file_reader, std::shared_ptr* out) { arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); std::unique_ptr column_reader; - ASSERT_NO_THROW(ASSERT_OK(reader.GetFlatColumn(0, &column_reader))); + ASSERT_OK_NO_THROW(reader.GetFlatColumn(0, &column_reader)); ASSERT_NE(nullptr, column_reader.get()); + ASSERT_OK(column_reader->NextBatch(SMALL_SIZE, out)); ASSERT_NE(nullptr, out->get()); } + void ReadAndCheckSingleColumnFile(Array* values) { + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); + } + void ReadTableFromFile( std::unique_ptr file_reader, std::shared_ptr
* out) { arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - ASSERT_NO_THROW(ASSERT_OK(reader.ReadFlatTable(out))); + ASSERT_OK_NO_THROW(reader.ReadFlatTable(out)); ASSERT_NE(nullptr, out->get()); } - std::unique_ptr TestFile(std::vector& values, int num_chunks) { - std::shared_ptr schema = - MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); - std::unique_ptr file_writer = MakeWriter(schema); - size_t chunk_size = values.size() / num_chunks; - for (int i = 0; i < num_chunks; i++) { - auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = - static_cast*>(row_group_writer->NextColumn()); - T* data = values.data() + i * chunk_size; - column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); - column_writer->Close(); - row_group_writer->Close(); - } - file_writer->Close(); - return ReaderFromSink(); + void ReadAndCheckSingleColumnTable(const std::shared_ptr& values) { + std::shared_ptr
out; + ReadTableFromFile(ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(values->length(), out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); + } + + template + void WriteFlatColumn(const std::shared_ptr& schema, + const std::shared_ptr& values) { + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + ASSERT_OK_NO_THROW(writer.NewRowGroup(values->length())); + ASSERT_OK_NO_THROW(writer.WriteFlatColumnChunk(values.get())); + ASSERT_OK_NO_THROW(writer.Close()); } std::shared_ptr sink_; }; -typedef ::testing::Types TestTypes; - -TYPED_TEST_CASE(TestParquetIO, TestTypes); - -TYPED_TEST(TestParquetIO, SingleColumnRequiredRead) { - std::vector values(SMALL_SIZE, 128); - std::unique_ptr file_reader = this->TestFile(values, 1); - - std::shared_ptr out; - this->ReadSingleColumnFile(std::move(file_reader), &out); - - ExpectArray(values.data(), out.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnRequiredTableRead) { - std::vector values(SMALL_SIZE, 128); - std::unique_ptr file_reader = this->TestFile(values, 1); - - std::shared_ptr
out; - this->ReadTableFromFile(std::move(file_reader), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(SMALL_SIZE, out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ExpectArray(values.data(), chunked_array->chunk(0).get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedRead) { - std::vector values(SMALL_SIZE, 128); - std::unique_ptr file_reader = this->TestFile(values, 4); - - std::shared_ptr out; - this->ReadSingleColumnFile(std::move(file_reader), &out); +// We habe separate tests for UInt32Type as this is currently the only type +// where a roundtrip does not yield the identical Array structure. +// There we write an UInt32 Array but receive an Int64 Array as result for +// Parquet version 1.0. - ExpectArray(values.data(), out.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedTableRead) { - std::vector values(SMALL_SIZE, 128); - std::unique_ptr file_reader = this->TestFile(values, 4); - - std::shared_ptr
out; - this->ReadTableFromFile(std::move(file_reader), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(SMALL_SIZE, out->num_rows()); +typedef ::testing::Types TestTypes; - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ExpectArray(values.data(), chunked_array->chunk(0).get()); -} +TYPED_TEST_CASE(TestParquetIO, TestTypes); TYPED_TEST(TestParquetIO, SingleColumnRequiredWrite) { - std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); + auto values = NonNullArray(SMALL_SIZE); - std::shared_ptr schema = - this->MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); - ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); - ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); + this->WriteFlatColumn(schema, values); - std::shared_ptr out; - this->ReadSingleColumnFile(this->ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); + this->ReadAndCheckSingleColumnFile(values.get()); } TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { - std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); + auto values = NonNullArray(SMALL_SIZE); std::shared_ptr
table = MakeSimpleTable(values, false); this->sink_ = std::make_shared(); - ASSERT_NO_THROW(ASSERT_OK( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, values->length()))); + ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, + values->length(), default_writer_properties())); std::shared_ptr
out; this->ReadTableFromFile(this->ReaderFromSink(), &out); @@ -226,113 +280,208 @@ TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) { // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); + auto values = NullableArray(SMALL_SIZE, 10); - std::shared_ptr schema = - this->MakeSchema(test_traits::parquet_enum, Repetition::OPTIONAL); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values->length()))); - ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values.get()))); - ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); + this->WriteFlatColumn(schema, values); - std::shared_ptr out; - this->ReadSingleColumnFile(this->ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); + this->ReadAndCheckSingleColumnFile(values.get()); } TYPED_TEST(TestParquetIO, SingleColumnTableOptionalReadWrite) { // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); + std::shared_ptr values = NullableArray(SMALL_SIZE, 10); std::shared_ptr
table = MakeSimpleTable(values, true); this->sink_ = std::make_shared(); - ASSERT_NO_THROW(ASSERT_OK( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, values->length()))); - - std::shared_ptr
out; - this->ReadTableFromFile(this->ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(SMALL_SIZE, out->num_rows()); + ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, + values->length(), default_writer_properties())); - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); + this->ReadAndCheckSingleColumnTable(values); } -TYPED_TEST(TestParquetIO, SingleColumnIntRequiredChunkedWrite) { - std::shared_ptr values = NonNullArray(SMALL_SIZE, 128); - std::shared_ptr values_chunk = - NonNullArray(SMALL_SIZE / 4, 128); +TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedWrite) { + auto values = NonNullArray(SMALL_SIZE); + int64_t chunk_size = values->length() / 4; - std::shared_ptr schema = - this->MakeSchema(test_traits::parquet_enum, Repetition::REQUIRED); + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); for (int i = 0; i < 4; i++) { - ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk->length()))); - ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk.get()))); + ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); + ASSERT_OK_NO_THROW( + writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); } - ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + ASSERT_OK_NO_THROW(writer.Close()); - std::shared_ptr out; - this->ReadSingleColumnFile(this->ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); + this->ReadAndCheckSingleColumnFile(values.get()); } TYPED_TEST(TestParquetIO, SingleColumnTableRequiredChunkedWrite) { - std::shared_ptr values = NonNullArray(LARGE_SIZE, 128); + auto values = NonNullArray(LARGE_SIZE); std::shared_ptr
table = MakeSimpleTable(values, false); this->sink_ = std::make_shared(); - ASSERT_NO_THROW( - ASSERT_OK(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512))); - - std::shared_ptr
out; - this->ReadTableFromFile(this->ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(LARGE_SIZE, out->num_rows()); + ASSERT_OK_NO_THROW(WriteFlatTable( + table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); + this->ReadAndCheckSingleColumnTable(values); } TYPED_TEST(TestParquetIO, SingleColumnOptionalChunkedWrite) { - std::shared_ptr values = NullableArray(SMALL_SIZE, 128, 10); - std::shared_ptr values_chunk_nulls = - NullableArray(SMALL_SIZE / 4, 128, 10); - std::shared_ptr values_chunk = - NullableArray(SMALL_SIZE / 4, 128, 0); - - std::shared_ptr schema = - this->MakeSchema(test_traits::parquet_enum, Repetition::OPTIONAL); + int64_t chunk_size = SMALL_SIZE / 4; + auto values = NullableArray(SMALL_SIZE, 10); + + std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk_nulls->length()))); - ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk_nulls.get()))); - for (int i = 0; i < 3; i++) { - ASSERT_NO_THROW(ASSERT_OK(writer.NewRowGroup(values_chunk->length()))); - ASSERT_NO_THROW(ASSERT_OK(writer.WriteFlatColumnChunk(values_chunk.get()))); + for (int i = 0; i < 4; i++) { + ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); + ASSERT_OK_NO_THROW( + writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); } - ASSERT_NO_THROW(ASSERT_OK(writer.Close())); + ASSERT_OK_NO_THROW(writer.Close()); - std::shared_ptr out; - this->ReadSingleColumnFile(this->ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); + this->ReadAndCheckSingleColumnFile(values.get()); } TYPED_TEST(TestParquetIO, SingleColumnTableOptionalChunkedWrite) { // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(LARGE_SIZE, 128, 100); + auto values = NullableArray(LARGE_SIZE, 100); std::shared_ptr
table = MakeSimpleTable(values, true); this->sink_ = std::make_shared(); - ASSERT_NO_THROW( - ASSERT_OK(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512))); + ASSERT_OK_NO_THROW(WriteFlatTable( + table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - std::shared_ptr
out; - this->ReadTableFromFile(this->ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(LARGE_SIZE, out->num_rows()); + this->ReadAndCheckSingleColumnTable(values); +} - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +using TestUInt32ParquetIO = TestParquetIO; + +TEST_F(TestUInt32ParquetIO, Parquet_2_0_Compability) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(LARGE_SIZE, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + + // Parquet 2.0 roundtrip should yield an uint32_t column again + this->sink_ = std::make_shared(); + std::shared_ptr<::parquet::WriterProperties> properties = + ::parquet::WriterProperties::Builder() + .version(ParquetVersion::PARQUET_2_0) + ->build(); + ASSERT_OK_NO_THROW( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); + this->ReadAndCheckSingleColumnTable(values); +} + +TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compability) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(LARGE_SIZE, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + + // Parquet 1.0 returns an int64_t column as there is no way to tell a Parquet 1.0 + // reader that a column is unsigned. + this->sink_ = std::make_shared(); + std::shared_ptr<::parquet::WriterProperties> properties = + ::parquet::WriterProperties::Builder() + .version(ParquetVersion::PARQUET_1_0) + ->build(); + ASSERT_OK_NO_THROW( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); + + std::shared_ptr expected_values; + std::shared_ptr int64_data = + std::make_shared(default_memory_pool()); + { + ASSERT_OK(int64_data->Resize(sizeof(int64_t) * values->length())); + int64_t* int64_data_ptr = reinterpret_cast(int64_data->mutable_data()); + const uint32_t* uint32_data_ptr = + reinterpret_cast(values->data()->data()); + // std::copy might be faster but this is explicit on the casts) + for (int64_t i = 0; i < values->length(); i++) { + int64_data_ptr[i] = static_cast(uint32_data_ptr[i]); + } + } + ASSERT_OK(MakePrimitiveArray(std::make_shared(), values->length(), + int64_data, values->null_count(), values->null_bitmap(), &expected_values)); + this->ReadAndCheckSingleColumnTable(expected_values); +} + +template +using ParquetCDataType = typename ParquetDataType::c_type; + +template +class TestPrimitiveParquetIO : public TestParquetIO { + public: + typedef typename TestType::c_type T; + + void TestFile(std::vector& values, int num_chunks, + std::unique_ptr* file_reader) { + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); + std::unique_ptr file_writer = this->MakeWriter(schema); + size_t chunk_size = values.size() / num_chunks; + // Convert to Parquet's expected physical type + std::vector values_buffer( + sizeof(ParquetCDataType) * values.size()); + auto values_parquet = + reinterpret_cast*>(values_buffer.data()); + std::copy(values.cbegin(), values.cend(), values_parquet); + for (int i = 0; i < num_chunks; i++) { + auto row_group_writer = file_writer->AppendRowGroup(chunk_size); + auto column_writer = + static_cast*>(row_group_writer->NextColumn()); + ParquetCDataType* data = values_parquet + i * chunk_size; + column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); + column_writer->Close(); + row_group_writer->Close(); + } + file_writer->Close(); + *file_reader = this->ReaderFromSink(); + } + + void TestSingleColumnRequiredTableRead(int num_chunks) { + std::vector values(SMALL_SIZE, test_traits::value); + std::unique_ptr file_reader; + ASSERT_NO_THROW(TestFile(values, num_chunks, &file_reader)); + + std::shared_ptr
out; + this->ReadTableFromFile(std::move(file_reader), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(SMALL_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ExpectArray(values.data(), chunked_array->chunk(0).get()); + } + + void TestSingleColumnRequiredRead(int num_chunks) { + std::vector values(SMALL_SIZE, test_traits::value); + std::unique_ptr file_reader; + ASSERT_NO_THROW(TestFile(values, num_chunks, &file_reader)); + + std::shared_ptr out; + this->ReadSingleColumnFile(std::move(file_reader), &out); + + ExpectArray(values.data(), out.get()); + } +}; + +typedef ::testing::Types PrimitiveTestTypes; + +TYPED_TEST_CASE(TestPrimitiveParquetIO, PrimitiveTestTypes); + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredRead) { + this->TestSingleColumnRequiredRead(1); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredTableRead) { + this->TestSingleColumnRequiredTableRead(1); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedRead) { + this->TestSingleColumnRequiredRead(4); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedTableRead) { + this->TestSingleColumnRequiredTableRead(4); } } // namespace parquet diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index 8de739491b56f..819cdd3ec4394 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -183,7 +183,9 @@ class TestConvertArrowSchema : public ::testing::Test { Status ConvertSchema(const std::vector>& fields) { arrow_schema_ = std::make_shared(fields); - return ToParquetSchema(arrow_schema_.get(), &result_schema_); + std::shared_ptr<::parquet::WriterProperties> properties = + ::parquet::default_writer_properties(); + return ToParquetSchema(arrow_schema_.get(), *properties.get(), &result_schema_); } protected: diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 3b4882d4439d5..7b05665b230f0 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -17,6 +17,7 @@ #include "arrow/parquet/reader.h" +#include #include #include #include @@ -27,6 +28,7 @@ #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" #include "arrow/util/status.h" using parquet::ColumnReader; @@ -36,6 +38,19 @@ using parquet::TypedColumnReader; namespace arrow { namespace parquet { +template +struct ArrowTypeTraits { + typedef NumericBuilder builder_type; +}; + +template <> +struct ArrowTypeTraits { + typedef BooleanBuilder builder_type; +}; + +template +using BuilderType = typename ArrowTypeTraits::builder_type; + class FileReader::Impl { public: Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); @@ -61,9 +76,45 @@ class FlatColumnReader::Impl { template Status TypedReadBatch(int batch_size, std::shared_ptr* out); + template + Status ReadNullableFlatBatch(const int16_t* def_levels, + typename ParquetType::c_type* values, int64_t values_read, int64_t levels_read, + BuilderType* builder); + template + Status ReadNonNullableBatch(typename ParquetType::c_type* values, int64_t values_read, + BuilderType* builder); + private: void NextRowGroup(); + template + struct can_copy_ptr { + static constexpr bool value = + std::is_same::value || + (std::is_integral{} && std::is_integral{} && + (sizeof(InType) == sizeof(OutType))); + }; + + template ::value>::type* = nullptr> + Status ConvertPhysicalType( + const InType* in_ptr, int64_t length, const OutType** out_ptr) { + *out_ptr = reinterpret_cast(in_ptr); + return Status::OK(); + } + + template ::value>::type* = nullptr> + Status ConvertPhysicalType( + const InType* in_ptr, int64_t length, const OutType** out_ptr) { + RETURN_NOT_OK(values_builder_buffer_.Resize(length * sizeof(OutType))); + OutType* mutable_out_ptr = + reinterpret_cast(values_builder_buffer_.mutable_data()); + std::copy(in_ptr, in_ptr + length, mutable_out_ptr); + *out_ptr = mutable_out_ptr; + return Status::OK(); + } + MemoryPool* pool_; const ::parquet::ColumnDescriptor* descr_; ::parquet::ParquetFileReader* reader_; @@ -155,13 +206,53 @@ FlatColumnReader::Impl::Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor NextRowGroup(); } +template +Status FlatColumnReader::Impl::ReadNonNullableBatch(typename ParquetType::c_type* values, + int64_t values_read, BuilderType* builder) { + using ArrowCType = typename ArrowType::c_type; + using ParquetCType = typename ParquetType::c_type; + + DCHECK(builder); + const ArrowCType* values_ptr; + RETURN_NOT_OK( + (ConvertPhysicalType(values, values_read, &values_ptr))); + RETURN_NOT_OK(builder->Append(values_ptr, values_read)); + return Status::OK(); +} + +template +Status FlatColumnReader::Impl::ReadNullableFlatBatch(const int16_t* def_levels, + typename ParquetType::c_type* values, int64_t values_read, int64_t levels_read, + BuilderType* builder) { + using ArrowCType = typename ArrowType::c_type; + + DCHECK(builder); + RETURN_NOT_OK(values_builder_buffer_.Resize(levels_read * sizeof(ArrowCType))); + RETURN_NOT_OK(valid_bytes_buffer_.Resize(levels_read * sizeof(uint8_t))); + auto values_ptr = reinterpret_cast(values_builder_buffer_.mutable_data()); + uint8_t* valid_bytes = valid_bytes_buffer_.mutable_data(); + int values_idx = 0; + for (int64_t i = 0; i < levels_read; i++) { + if (def_levels[i] < descr_->max_definition_level()) { + valid_bytes[i] = 0; + } else { + valid_bytes[i] = 1; + values_ptr[i] = values[values_idx++]; + } + } + RETURN_NOT_OK(builder->Append(values_ptr, levels_read, valid_bytes)); + return Status::OK(); +} + template Status FlatColumnReader::Impl::TypedReadBatch( int batch_size, std::shared_ptr* out) { + using ParquetCType = typename ParquetType::c_type; + int values_to_read = batch_size; - NumericBuilder builder(pool_, field_->type); + BuilderType builder(pool_, field_->type); while ((values_to_read > 0) && column_reader_) { - values_buffer_.Resize(values_to_read * sizeof(typename ParquetType::c_type)); + values_buffer_.Resize(values_to_read * sizeof(ParquetCType)); if (descr_->max_definition_level() > 0) { def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); } @@ -169,31 +260,62 @@ Status FlatColumnReader::Impl::TypedReadBatch( int64_t values_read; int64_t levels_read; int16_t* def_levels = reinterpret_cast(def_levels_buffer_.mutable_data()); - auto values = - reinterpret_cast(values_buffer_.mutable_data()); + auto values = reinterpret_cast(values_buffer_.mutable_data()); PARQUET_CATCH_NOT_OK(levels_read = reader->ReadBatch( values_to_read, def_levels, nullptr, values, &values_read)); values_to_read -= levels_read; if (descr_->max_definition_level() == 0) { - RETURN_NOT_OK(builder.Append(values, values_read)); + RETURN_NOT_OK( + (ReadNonNullableBatch(values, values_read, &builder))); + } else { + // As per the defintion and checks for flat columns: + // descr_->max_definition_level() == 1 + RETURN_NOT_OK((ReadNullableFlatBatch( + def_levels, values, values_read, levels_read, &builder))); + } + if (!column_reader_->HasNext()) { NextRowGroup(); } + } + *out = builder.Finish(); + return Status::OK(); +} + +template <> +Status FlatColumnReader::Impl::TypedReadBatch( + int batch_size, std::shared_ptr* out) { + int values_to_read = batch_size; + StringBuilder builder(pool_, field_->type); + while ((values_to_read > 0) && column_reader_) { + values_buffer_.Resize(values_to_read * sizeof(::parquet::ByteArray)); + if (descr_->max_definition_level() > 0) { + def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); + } + auto reader = + dynamic_cast*>(column_reader_.get()); + int64_t values_read; + int64_t levels_read; + int16_t* def_levels = reinterpret_cast(def_levels_buffer_.mutable_data()); + auto values = reinterpret_cast<::parquet::ByteArray*>(values_buffer_.mutable_data()); + PARQUET_CATCH_NOT_OK(levels_read = reader->ReadBatch( + values_to_read, def_levels, nullptr, values, &values_read)); + values_to_read -= levels_read; + if (descr_->max_definition_level() == 0) { + for (int64_t i = 0; i < levels_read; i++) { + RETURN_NOT_OK( + builder.Append(reinterpret_cast(values[i].ptr), values[i].len)); + } } else { // descr_->max_definition_level() == 1 - RETURN_NOT_OK(values_builder_buffer_.Resize( - levels_read * sizeof(typename ParquetType::c_type))); - RETURN_NOT_OK(valid_bytes_buffer_.Resize(levels_read * sizeof(uint8_t))); - auto values_ptr = reinterpret_cast( - values_builder_buffer_.mutable_data()); - uint8_t* valid_bytes = valid_bytes_buffer_.mutable_data(); int values_idx = 0; for (int64_t i = 0; i < levels_read; i++) { if (def_levels[i] < descr_->max_definition_level()) { - valid_bytes[i] = 0; + RETURN_NOT_OK(builder.AppendNull()); } else { - valid_bytes[i] = 1; - values_ptr[i] = values[values_idx++]; + RETURN_NOT_OK( + builder.Append(reinterpret_cast(values[values_idx].ptr), + values[values_idx].len)); + values_idx++; } } - builder.Append(values_ptr, levels_read, valid_bytes); } if (!column_reader_->HasNext()) { NextRowGroup(); } } @@ -214,10 +336,18 @@ Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* } switch (field_->type->type) { + TYPED_BATCH_CASE(BOOL, BooleanType, ::parquet::BooleanType) + TYPED_BATCH_CASE(UINT8, UInt8Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT8, Int8Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(UINT16, UInt16Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT16, Int16Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(UINT32, UInt32Type, ::parquet::Int32Type) TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(UINT64, UInt64Type, ::parquet::Int64Type) TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) + TYPED_BATCH_CASE(STRING, StringType, ::parquet::ByteArrayType) default: return Status::NotImplemented(field_->type->ToString()); } diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index c7979db349453..a79342afe2f9d 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -42,7 +42,12 @@ namespace parquet { const auto BOOL = std::make_shared(); const auto UINT8 = std::make_shared(); +const auto INT8 = std::make_shared(); +const auto UINT16 = std::make_shared(); +const auto INT16 = std::make_shared(); +const auto UINT32 = std::make_shared(); const auto INT32 = std::make_shared(); +const auto UINT64 = std::make_shared(); const auto INT64 = std::make_shared(); const auto FLOAT = std::make_shared(); const auto DOUBLE = std::make_shared(); @@ -92,6 +97,21 @@ static Status FromInt32(const PrimitiveNode* node, TypePtr* out) { case LogicalType::NONE: *out = INT32; break; + case LogicalType::UINT_8: + *out = UINT8; + break; + case LogicalType::INT_8: + *out = INT8; + break; + case LogicalType::UINT_16: + *out = UINT16; + break; + case LogicalType::INT_16: + *out = INT16; + break; + case LogicalType::UINT_32: + *out = UINT32; + break; case LogicalType::DECIMAL: *out = MakeDecimalType(node); break; @@ -107,6 +127,9 @@ static Status FromInt64(const PrimitiveNode* node, TypePtr* out) { case LogicalType::NONE: *out = INT64; break; + case LogicalType::UINT_64: + *out = UINT64; + break; case LogicalType::DECIMAL: *out = MakeDecimalType(node); break; @@ -187,20 +210,21 @@ Status FromParquetSchema( } Status StructToNode(const std::shared_ptr& type, const std::string& name, - bool nullable, NodePtr* out) { + bool nullable, const ::parquet::WriterProperties& properties, NodePtr* out) { Repetition::type repetition = Repetition::REQUIRED; if (nullable) { repetition = Repetition::OPTIONAL; } std::vector children(type->num_children()); for (int i = 0; i < type->num_children(); i++) { - RETURN_NOT_OK(FieldToNode(type->child(i), &children[i])); + RETURN_NOT_OK(FieldToNode(type->child(i), properties, &children[i])); } *out = GroupNode::Make(name, repetition, children); return Status::OK(); } -Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { +Status FieldToNode(const std::shared_ptr& field, + const ::parquet::WriterProperties& properties, NodePtr* out) { LogicalType::type logical_type = LogicalType::NONE; ParquetType::type type; Repetition::type repetition = Repetition::REQUIRED; @@ -231,8 +255,12 @@ Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { logical_type = LogicalType::INT_16; break; case Type::UINT32: - type = ParquetType::INT32; - logical_type = LogicalType::UINT_32; + if (properties.version() == ::parquet::ParquetVersion::PARQUET_1_0) { + type = ParquetType::INT64; + } else { + type = ParquetType::INT32; + logical_type = LogicalType::UINT_32; + } break; case Type::INT32: type = ParquetType::INT32; @@ -277,7 +305,7 @@ Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { break; case Type::STRUCT: { auto struct_type = std::static_pointer_cast(field->type); - return StructToNode(struct_type, field->name, field->nullable, out); + return StructToNode(struct_type, field->name, field->nullable, properties, out); } break; default: // TODO: LIST, DENSE_UNION, SPARE_UNION, JSON_SCALAR, DECIMAL, DECIMAL_TEXT, VARCHAR @@ -287,11 +315,12 @@ Status FieldToNode(const std::shared_ptr& field, NodePtr* out) { return Status::OK(); } -Status ToParquetSchema( - const Schema* arrow_schema, std::shared_ptr<::parquet::SchemaDescriptor>* out) { +Status ToParquetSchema(const Schema* arrow_schema, + const ::parquet::WriterProperties& properties, + std::shared_ptr<::parquet::SchemaDescriptor>* out) { std::vector nodes(arrow_schema->num_fields()); for (int i = 0; i < arrow_schema->num_fields(); i++) { - RETURN_NOT_OK(FieldToNode(arrow_schema->field(i), &nodes[i])); + RETURN_NOT_OK(FieldToNode(arrow_schema->field(i), properties, &nodes[i])); } NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, nodes); diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index ec5f96062e89f..39bee059522a3 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -21,6 +21,7 @@ #include #include "parquet/api/schema.h" +#include "parquet/api/writer.h" #include "arrow/schema.h" #include "arrow/type.h" @@ -36,10 +37,12 @@ Status NodeToField(const ::parquet::schema::NodePtr& node, std::shared_ptr* out); -Status FieldToNode(const std::shared_ptr& field, ::parquet::schema::NodePtr* out); +Status FieldToNode(const std::shared_ptr& field, + const ::parquet::WriterProperties& properties, ::parquet::schema::NodePtr* out); -Status ToParquetSchema( - const Schema* arrow_schema, std::shared_ptr<::parquet::SchemaDescriptor>* out); +Status ToParquetSchema(const Schema* arrow_schema, + const ::parquet::WriterProperties& properties, + std::shared_ptr<::parquet::SchemaDescriptor>* out); } // namespace parquet diff --git a/cpp/src/arrow/parquet/test-util.h b/cpp/src/arrow/parquet/test-util.h index cc8723bf6ecab..68a7fb94c2aed 100644 --- a/cpp/src/arrow/parquet/test-util.h +++ b/cpp/src/arrow/parquet/test-util.h @@ -18,26 +18,90 @@ #include #include +#include "arrow/test-util.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" namespace arrow { namespace parquet { template -std::shared_ptr NonNullArray( - size_t size, typename ArrowType::c_type value) { - std::vector values(size, value); +using is_arrow_float = std::is_floating_point; + +template +using is_arrow_int = std::is_integral; + +template +using is_arrow_string = std::is_same; + +template +typename std::enable_if::value, + std::shared_ptr>::type +NonNullArray(size_t size) { + std::vector values; + ::arrow::test::random_real(size, 0, 0, 1, &values); NumericBuilder builder(default_memory_pool(), std::make_shared()); builder.Append(values.data(), values.size()); return std::static_pointer_cast(builder.Finish()); } -// This helper function only supports (size/2) nulls yet. +template +typename std::enable_if::value, + std::shared_ptr>::type +NonNullArray(size_t size) { + std::vector values; + ::arrow::test::randint(size, 0, 64, &values); + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size()); + return std::static_pointer_cast(builder.Finish()); +} + +template +typename std::enable_if::value, + std::shared_ptr>::type +NonNullArray(size_t size) { + StringBuilder builder(default_memory_pool(), std::make_shared()); + for (size_t i = 0; i < size; i++) { + builder.Append("test-string"); + } + return std::static_pointer_cast(builder.Finish()); +} + +template <> +std::shared_ptr NonNullArray(size_t size) { + std::vector values; + ::arrow::test::randint(size, 0, 1, &values); + BooleanBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size()); + return std::static_pointer_cast(builder.Finish()); +} + +// This helper function only supports (size/2) nulls. +template +typename std::enable_if::value, + std::shared_ptr>::type +NullableArray(size_t size, size_t num_nulls) { + std::vector values; + ::arrow::test::random_real(size, 0, 0, 1, &values); + std::vector valid_bytes(size, 1); + + for (size_t i = 0; i < num_nulls; i++) { + valid_bytes[i * 2] = 0; + } + + NumericBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size(), valid_bytes.data()); + return std::static_pointer_cast(builder.Finish()); +} + +// This helper function only supports (size/2) nulls. template -std::shared_ptr NullableArray( - size_t size, typename ArrowType::c_type value, size_t num_nulls) { - std::vector values(size, value); +typename std::enable_if::value, + std::shared_ptr>::type +NullableArray(size_t size, size_t num_nulls) { + std::vector values; + ::arrow::test::randint(size, 0, 64, &values); std::vector valid_bytes(size, 1); for (size_t i = 0; i < num_nulls; i++) { @@ -49,14 +113,49 @@ std::shared_ptr NullableArray( return std::static_pointer_cast(builder.Finish()); } -std::shared_ptr MakeColumn(const std::string& name, - const std::shared_ptr& array, bool nullable) { +// This helper function only supports (size/2) nulls yet. +template +typename std::enable_if::value, + std::shared_ptr>::type +NullableArray(size_t size, size_t num_nulls) { + std::vector valid_bytes(size, 1); + + for (size_t i = 0; i < num_nulls; i++) { + valid_bytes[i * 2] = 0; + } + + StringBuilder builder(default_memory_pool(), std::make_shared()); + for (size_t i = 0; i < size; i++) { + builder.Append("test-string"); + } + return std::static_pointer_cast(builder.Finish()); +} + +// This helper function only supports (size/2) nulls yet. +template <> +std::shared_ptr NullableArray( + size_t size, size_t num_nulls) { + std::vector values; + ::arrow::test::randint(size, 0, 1, &values); + std::vector valid_bytes(size, 1); + + for (size_t i = 0; i < num_nulls; i++) { + valid_bytes[i * 2] = 0; + } + + BooleanBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(values.data(), values.size(), valid_bytes.data()); + return std::static_pointer_cast(builder.Finish()); +} + +std::shared_ptr MakeColumn( + const std::string& name, const std::shared_ptr& array, bool nullable) { auto field = std::make_shared(name, array->type(), nullable); return std::make_shared(field, array); } std::shared_ptr
MakeSimpleTable( - const std::shared_ptr& values, bool nullable) { + const std::shared_ptr& values, bool nullable) { std::shared_ptr column = MakeColumn("col", values, nullable); std::vector> columns({column}); std::vector> fields({column->field()}); @@ -72,6 +171,23 @@ void ExpectArray(T* expected, Array* result) { } } +template +void ExpectArray(typename ArrowType::c_type* expected, Array* result) { + PrimitiveArray* p_array = static_cast(result); + for (int64_t i = 0; i < result->length(); i++) { + EXPECT_EQ(expected[i], + reinterpret_cast(p_array->data()->data())[i]); + } +} + +template <> +void ExpectArray(uint8_t* expected, Array* result) { + BooleanBuilder builder(default_memory_pool(), std::make_shared()); + builder.Append(expected, result->length()); + std::shared_ptr expected_array = builder.Finish(); + EXPECT_TRUE(result->Equals(expected_array)); +} + } // namespace parquet } // namespace arrow diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index 4005e3b2b0c1b..63449bb20b1a1 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -25,11 +25,13 @@ #include "arrow/table.h" #include "arrow/types/construct.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" #include "arrow/parquet/schema.h" #include "arrow/parquet/utils.h" #include "arrow/util/status.h" using parquet::ParquetFileWriter; +using parquet::ParquetVersion; using parquet::schema::GroupNode; namespace arrow { @@ -41,10 +43,40 @@ class FileWriter::Impl { Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); Status NewRowGroup(int64_t chunk_size); - template + template Status TypedWriteBatch(::parquet::ColumnWriter* writer, const PrimitiveArray* data, int64_t offset, int64_t length); + + // TODO(uwe): Same code as in reader.cc the only difference is the name of the temporary + // buffer + template + struct can_copy_ptr { + static constexpr bool value = + std::is_same::value || + (std::is_integral{} && std::is_integral{} && + (sizeof(InType) == sizeof(OutType))); + }; + + template ::value>::type* = nullptr> + Status ConvertPhysicalType(const InType* in_ptr, int64_t, const OutType** out_ptr) { + *out_ptr = reinterpret_cast(in_ptr); + return Status::OK(); + } + + template ::value>::type* = nullptr> + Status ConvertPhysicalType( + const InType* in_ptr, int64_t length, const OutType** out_ptr) { + RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(OutType))); + OutType* mutable_out_ptr = reinterpret_cast(data_buffer_.mutable_data()); + std::copy(in_ptr, in_ptr + length, mutable_out_ptr); + *out_ptr = mutable_out_ptr; + return Status::OK(); + } + Status WriteFlatColumnChunk(const PrimitiveArray* data, int64_t offset, int64_t length); + Status WriteFlatColumnChunk(const StringArray* data, int64_t offset, int64_t length); Status Close(); virtual ~Impl() {} @@ -53,6 +85,8 @@ class FileWriter::Impl { friend class FileWriter; MemoryPool* pool_; + // Buffer used for storing the data of an array converted to the physical type + // as expected by parquet-cpp. PoolBuffer data_buffer_; PoolBuffer def_levels_buffer_; std::unique_ptr<::parquet::ParquetFileWriter> writer_; @@ -72,36 +106,95 @@ Status FileWriter::Impl::NewRowGroup(int64_t chunk_size) { return Status::OK(); } -template +template Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, const PrimitiveArray* data, int64_t offset, int64_t length) { - // TODO: DCHECK((offset + length) <= data->length()); - auto data_ptr = - reinterpret_cast(data->data()->data()) + - offset; + using ArrowCType = typename ArrowType::c_type; + using ParquetCType = typename ParquetType::c_type; + + DCHECK((offset + length) <= data->length()); + auto data_ptr = reinterpret_cast(data->data()->data()) + offset; auto writer = reinterpret_cast<::parquet::TypedColumnWriter*>(column_writer); if (writer->descr()->max_definition_level() == 0) { // no nulls, just dump the data - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, data_ptr)); + const ParquetCType* data_writer_ptr; + RETURN_NOT_OK((ConvertPhysicalType( + data_ptr, length, &data_writer_ptr))); + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, data_writer_ptr)); } else if (writer->descr()->max_definition_level() == 1) { RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); int16_t* def_levels_ptr = reinterpret_cast(def_levels_buffer_.mutable_data()); if (data->null_count() == 0) { std::fill(def_levels_ptr, def_levels_ptr + length, 1); - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, data_ptr)); + const ParquetCType* data_writer_ptr; + RETURN_NOT_OK((ConvertPhysicalType( + data_ptr, length, &data_writer_ptr))); + PARQUET_CATCH_NOT_OK( + writer->WriteBatch(length, def_levels_ptr, nullptr, data_writer_ptr)); } else { - RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(typename ParquetType::c_type))); - auto buffer_ptr = - reinterpret_cast(data_buffer_.mutable_data()); + RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(ParquetCType))); + auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); int buffer_idx = 0; for (int i = 0; i < length; i++) { if (data->IsNull(offset + i)) { def_levels_ptr[i] = 0; } else { def_levels_ptr[i] = 1; - buffer_ptr[buffer_idx++] = data_ptr[i]; + buffer_ptr[buffer_idx++] = static_cast(data_ptr[i]); + } + } + PARQUET_CATCH_NOT_OK( + writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); + } + } else { + return Status::NotImplemented("no support for max definition level > 1 yet"); + } + PARQUET_CATCH_NOT_OK(writer->Close()); + return Status::OK(); +} + +// This specialization seems quite similar but it significantly differs in two points: +// * offset is added at the most latest time to the pointer as we have sub-byte access +// * Arrow data is stored bitwise thus we cannot use std::copy to transform from +// ArrowType::c_type to ParquetType::c_type +template <> +Status FileWriter::Impl::TypedWriteBatch<::parquet::BooleanType, BooleanType>( + ::parquet::ColumnWriter* column_writer, const PrimitiveArray* data, int64_t offset, + int64_t length) { + DCHECK((offset + length) <= data->length()); + RETURN_NOT_OK(data_buffer_.Resize(length)); + auto data_ptr = reinterpret_cast(data->data()->data()); + auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); + auto writer = reinterpret_cast<::parquet::TypedColumnWriter<::parquet::BooleanType>*>( + column_writer); + if (writer->descr()->max_definition_level() == 0) { + // no nulls, just dump the data + for (int64_t i = 0; i < length; i++) { + buffer_ptr[i] = util::get_bit(data_ptr, offset + i); + } + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, buffer_ptr)); + } else if (writer->descr()->max_definition_level() == 1) { + RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); + int16_t* def_levels_ptr = + reinterpret_cast(def_levels_buffer_.mutable_data()); + if (data->null_count() == 0) { + std::fill(def_levels_ptr, def_levels_ptr + length, 1); + for (int64_t i = 0; i < length; i++) { + buffer_ptr[i] = util::get_bit(data_ptr, offset + i); + } + // TODO(PARQUET-644): write boolean values as a packed bitmap + PARQUET_CATCH_NOT_OK( + writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); + } else { + int buffer_idx = 0; + for (int i = 0; i < length; i++) { + if (data->IsNull(offset + i)) { + def_levels_ptr[i] = 0; + } else { + def_levels_ptr[i] = 1; + buffer_ptr[buffer_idx++] = util::get_bit(data_ptr, offset + i); } } PARQUET_CATCH_NOT_OK( @@ -120,9 +213,9 @@ Status FileWriter::Impl::Close() { return Status::OK(); } -#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ - case Type::ENUM: \ - return TypedWriteBatch(writer, data, offset, length); \ +#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ + case Type::ENUM: \ + return TypedWriteBatch(writer, data, offset, length); \ break; Status FileWriter::Impl::WriteFlatColumnChunk( @@ -130,15 +223,76 @@ Status FileWriter::Impl::WriteFlatColumnChunk( ::parquet::ColumnWriter* writer; PARQUET_CATCH_NOT_OK(writer = row_group_writer_->NextColumn()); switch (data->type_enum()) { - TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) - TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) - TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) + TYPED_BATCH_CASE(BOOL, BooleanType, ::parquet::BooleanType) + TYPED_BATCH_CASE(UINT8, UInt8Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT8, Int8Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(UINT16, UInt16Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(INT16, Int16Type, ::parquet::Int32Type) + case Type::UINT32: + if (writer_->properties()->version() == ParquetVersion::PARQUET_1_0) { + // Parquet 1.0 reader cannot read the UINT_32 logical type. Thus we need + // to use the larger Int64Type to store them lossless. + return TypedWriteBatch<::parquet::Int64Type, UInt32Type>( + writer, data, offset, length); + } else { + return TypedWriteBatch<::parquet::Int32Type, UInt32Type>( + writer, data, offset, length); + } + TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) + TYPED_BATCH_CASE(UINT64, UInt64Type, ::parquet::Int64Type) + TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) + TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) + TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) default: return Status::NotImplemented(data->type()->ToString()); } } +Status FileWriter::Impl::WriteFlatColumnChunk( + const StringArray* data, int64_t offset, int64_t length) { + ::parquet::ColumnWriter* column_writer; + PARQUET_CATCH_NOT_OK(column_writer = row_group_writer_->NextColumn()); + DCHECK((offset + length) <= data->length()); + RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(::parquet::ByteArray))); + auto buffer_ptr = reinterpret_cast<::parquet::ByteArray*>(data_buffer_.mutable_data()); + auto values = std::dynamic_pointer_cast(data->values()); + auto data_ptr = reinterpret_cast(values->data()->data()); + DCHECK(values != nullptr); + auto writer = reinterpret_cast<::parquet::TypedColumnWriter<::parquet::ByteArrayType>*>( + column_writer); + if (writer->descr()->max_definition_level() > 0) { + RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); + } + int16_t* def_levels_ptr = reinterpret_cast(def_levels_buffer_.mutable_data()); + if (writer->descr()->max_definition_level() == 0 || data->null_count() == 0) { + // no nulls, just dump the data + for (int64_t i = 0; i < length; i++) { + buffer_ptr[i] = ::parquet::ByteArray( + data->value_length(i + offset), data_ptr + data->value_offset(i)); + } + if (writer->descr()->max_definition_level() > 0) { + std::fill(def_levels_ptr, def_levels_ptr + length, 1); + } + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); + } else if (writer->descr()->max_definition_level() == 1) { + int buffer_idx = 0; + for (int64_t i = 0; i < length; i++) { + if (data->IsNull(offset + i)) { + def_levels_ptr[i] = 0; + } else { + def_levels_ptr[i] = 1; + buffer_ptr[buffer_idx++] = ::parquet::ByteArray( + data->value_length(i + offset), data_ptr + data->value_offset(i + offset)); + } + } + PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); + } else { + return Status::NotImplemented("no support for max definition level > 1 yet"); + } + PARQUET_CATCH_NOT_OK(writer->Close()); + return Status::OK(); +} + FileWriter::FileWriter( MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer) : impl_(new FileWriter::Impl(pool, std::move(writer))) {} @@ -148,10 +302,20 @@ Status FileWriter::NewRowGroup(int64_t chunk_size) { } Status FileWriter::WriteFlatColumnChunk( - const PrimitiveArray* data, int64_t offset, int64_t length) { + const Array* array, int64_t offset, int64_t length) { int64_t real_length = length; - if (length == -1) { real_length = data->length(); } - return impl_->WriteFlatColumnChunk(data, offset, real_length); + if (length == -1) { real_length = array->length(); } + if (array->type_enum() == Type::STRING) { + auto string_array = dynamic_cast(array); + DCHECK(string_array); + return impl_->WriteFlatColumnChunk(string_array, offset, real_length); + } else { + auto primitive_array = dynamic_cast(array); + if (!primitive_array) { + return Status::NotImplemented("Table must consist of PrimitiveArray instances"); + } + return impl_->WriteFlatColumnChunk(primitive_array, offset, real_length); + } } Status FileWriter::Close() { @@ -165,40 +329,30 @@ MemoryPool* FileWriter::memory_pool() const { FileWriter::~FileWriter() {} Status WriteFlatTable(const Table* table, MemoryPool* pool, - std::shared_ptr<::parquet::OutputStream> sink, int64_t chunk_size) { + const std::shared_ptr<::parquet::OutputStream>& sink, int64_t chunk_size, + const std::shared_ptr<::parquet::WriterProperties>& properties) { std::shared_ptr<::parquet::SchemaDescriptor> parquet_schema; - RETURN_NOT_OK(ToParquetSchema(table->schema().get(), &parquet_schema)); + RETURN_NOT_OK( + ToParquetSchema(table->schema().get(), *properties.get(), &parquet_schema)); auto schema_node = std::static_pointer_cast(parquet_schema->schema()); std::unique_ptr parquet_writer = - ParquetFileWriter::Open(sink, schema_node); + ParquetFileWriter::Open(sink, schema_node, properties); FileWriter writer(pool, std::move(parquet_writer)); - // TODO: Support writing chunked arrays. + // TODO(ARROW-232) Support writing chunked arrays. for (int i = 0; i < table->num_columns(); i++) { if (table->column(i)->data()->num_chunks() != 1) { return Status::NotImplemented("No support for writing chunked arrays yet."); } } - // Cast to PrimitiveArray instances as we work with them. - std::vector> arrays(table->num_columns()); - for (int i = 0; i < table->num_columns(); i++) { - // num_chunks == 1 as per above loop - std::shared_ptr array = table->column(i)->data()->chunk(0); - auto primitive_array = std::dynamic_pointer_cast(array); - if (!primitive_array) { - PARQUET_IGNORE_NOT_OK(writer.Close()); - return Status::NotImplemented("Table must consist of PrimitiveArray instances"); - } - arrays[i] = primitive_array; - } - for (int chunk = 0; chunk * chunk_size < table->num_rows(); chunk++) { int64_t offset = chunk * chunk_size; int64_t size = std::min(chunk_size, table->num_rows() - offset); RETURN_NOT_OK_ELSE(writer.NewRowGroup(size), PARQUET_IGNORE_NOT_OK(writer.Close())); for (int i = 0; i < table->num_columns(); i++) { - RETURN_NOT_OK_ELSE(writer.WriteFlatColumnChunk(arrays[i].get(), offset, size), + std::shared_ptr array = table->column(i)->data()->chunk(0); + RETURN_NOT_OK_ELSE(writer.WriteFlatColumnChunk(array.get(), offset, size), PARQUET_IGNORE_NOT_OK(writer.Close())); } } diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index 93693f511846b..cfd80d80b7997 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -25,10 +25,12 @@ namespace arrow { +class Array; class MemoryPool; class PrimitiveArray; class RowBatch; class Status; +class StringArray; class Table; namespace parquet { @@ -43,8 +45,7 @@ class FileWriter { FileWriter(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); Status NewRowGroup(int64_t chunk_size); - Status WriteFlatColumnChunk( - const PrimitiveArray* data, int64_t offset = 0, int64_t length = -1); + Status WriteFlatColumnChunk(const Array* data, int64_t offset = 0, int64_t length = -1); Status Close(); virtual ~FileWriter(); @@ -62,7 +63,9 @@ class FileWriter { * The table shall only consist of nullable, non-repeated columns of primitive type. */ Status WriteFlatTable(const Table* table, MemoryPool* pool, - std::shared_ptr<::parquet::OutputStream> sink, int64_t chunk_size); + const std::shared_ptr<::parquet::OutputStream>& sink, int64_t chunk_size, + const std::shared_ptr<::parquet::WriterProperties>& properties = + ::parquet::default_writer_properties()); } // namespace parquet diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 2f81161d1d6d1..055dac7444488 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -50,6 +50,8 @@ if (!s.ok()) { FAIL() << s.ToString(); } \ } while (0) +#define ASSERT_OK_NO_THROW(expr) ASSERT_NO_THROW(ASSERT_OK(expr)) + #define EXPECT_OK(expr) \ do { \ Status s = (expr); \ diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 08fc8478e6de5..f4b47f9d2f503 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -133,6 +133,11 @@ Status PrimitiveBuilder::Append( RETURN_NOT_OK(Reserve(length)); for (int i = 0; i < length; ++i) { + // Skip reading from unitialised memory + // TODO: This actually is only to keep valgrind happy but may or may not + // have a performance impact. + if ((valid_bytes != nullptr) && !valid_bytes[i]) continue; + if (values[i] > 0) { util::set_bit(raw_data_, length_ + i); } else { diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index 0918344070eb0..a2f83ea5ea566 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -32,6 +32,10 @@ cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: pass cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: + enum ParquetVersion" parquet::ParquetVersion::type": + PARQUET_1_0" parquet::ParquetVersion::PARQUET_1_0" + PARQUET_2_0" parquet::ParquetVersion::PARQUET_2_0" + cdef cppclass SchemaDescriptor: shared_ptr[Node] schema() GroupNode* group() @@ -80,6 +84,11 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: LocalFileOutputStream(const c_string& path) void Close() + cdef cppclass WriterProperties: + cppclass Builder: + Builder* version(ParquetVersion version) + shared_ptr[WriterProperties] build() + cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: cdef cppclass FileReader: @@ -93,5 +102,7 @@ cdef extern from "arrow/parquet/schema.h" namespace "arrow::parquet" nogil: cdef extern from "arrow/parquet/writer.h" namespace "arrow::parquet" nogil: - cdef CStatus WriteFlatTable(const CTable* table, MemoryPool* pool, shared_ptr[OutputStream] sink, int64_t chunk_size) + cdef CStatus WriteFlatTable(const CTable* table, MemoryPool* pool, + const shared_ptr[OutputStream]& sink, int64_t chunk_size, + const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 3d5355ebe433a..0b2b20880332b 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -24,6 +24,7 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.parquet cimport * from pyarrow.compat import tobytes +from pyarrow.error import ArrowException from pyarrow.error cimport check_cstatus from pyarrow.table cimport Table @@ -42,11 +43,13 @@ def read_table(filename, columns=None): # in Cython (due to missing rvalue support) reader = unique_ptr[FileReader](new FileReader(default_memory_pool(), ParquetFileReader.OpenFile(tobytes(filename)))) - check_cstatus(reader.get().ReadFlatTable(&ctable)) + with nogil: + check_cstatus(reader.get().ReadFlatTable(&ctable)) + table.init(ctable) return table -def write_table(table, filename, chunk_size=None): +def write_table(table, filename, chunk_size=None, version=None): """ Write a Table to Parquet format @@ -56,16 +59,29 @@ def write_table(table, filename, chunk_size=None): filename : string chunk_size : int The maximum number of rows in each Parquet RowGroup + version : {"1.0", "2.0"}, default "1.0" + The Parquet format version, defaults to 1.0 """ cdef Table table_ = table cdef CTable* ctable_ = table_.table cdef shared_ptr[OutputStream] sink + cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 if chunk_size is None: chunk_size_ = min(ctable_.num_rows(), int(2**16)) else: chunk_size_ = chunk_size + if version is not None: + if version == "1.0": + properties_builder.version(PARQUET_1_0) + elif version == "2.0": + properties_builder.version(PARQUET_2_0) + else: + raise ArrowException("Unsupported Parquet format version") + sink.reset(new LocalFileOutputStream(tobytes(filename))) - check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, chunk_size_)) + with nogil: + check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, + chunk_size_, properties_builder.build())) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index d92cf4ca6563e..de9cfbb46e1a2 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -42,18 +42,55 @@ def test_single_pylist_column_roundtrip(tmpdir): data_read = col_read.data.chunk(0) assert data_written.equals(data_read) -def test_pandas_rountrip(tmpdir): +def test_pandas_parquet_2_0_rountrip(tmpdir): size = 10000 + np.random.seed(0) df = pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16), + 'uint32': np.arange(size, dtype=np.uint32), + 'uint64': np.arange(size, dtype=np.uint64), + 'int8': np.arange(size, dtype=np.int16), + 'int16': np.arange(size, dtype=np.int16), 'int32': np.arange(size, dtype=np.int32), 'int64': np.arange(size, dtype=np.int64), 'float32': np.arange(size, dtype=np.float32), - 'float64': np.arange(size, dtype=np.float64) + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0, + 'str': [str(x) for x in range(size)], + 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None] }) filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.from_pandas_dataframe(df) - A.parquet.write_table(arrow_table, filename.strpath) + A.parquet.write_table(arrow_table, filename.strpath, version="2.0") table_read = pyarrow.parquet.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) +def test_pandas_parquet_1_0_rountrip(tmpdir): + size = 10000 + np.random.seed(0) + df = pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16), + 'uint32': np.arange(size, dtype=np.uint32), + 'uint64': np.arange(size, dtype=np.uint64), + 'int8': np.arange(size, dtype=np.int16), + 'int16': np.arange(size, dtype=np.int16), + 'int32': np.arange(size, dtype=np.int32), + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0 + }) + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = A.from_pandas_dataframe(df) + A.parquet.write_table(arrow_table, filename.strpath, version="1.0") + table_read = pyarrow.parquet.read_table(filename.strpath) + df_read = table_read.to_pandas() + + # We pass uint32_t as int64_t if we write Parquet version 1.0 + df['uint32'] = df['uint32'].values.astype(np.int64) + + pdt.assert_frame_equal(df, df_read) + From fab4c82d2668e4f8c450053c34dd70ea99365fac Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 1 Jul 2016 14:25:46 -0700 Subject: [PATCH 0094/1644] ARROW-234: Build libhdfs IO extension in conda artifacts Author: Wes McKinney Closes #97 from wesm/ARROW-234 and squashes the following commits: 3edb8d1 [Wes McKinney] Enable ARROW_HDFS extension in conda artifact --- cpp/conda.recipe/build.sh | 1 + 1 file changed, 1 insertion(+) diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh index b10dd03349bd3..7e60ccc911faa 100644 --- a/cpp/conda.recipe/build.sh +++ b/cpp/conda.recipe/build.sh @@ -49,6 +49,7 @@ cmake \ -DCMAKE_BUILD_TYPE=release \ -DCMAKE_INSTALL_PREFIX=$PREFIX \ -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ + -DARROW_HDFS=on \ -DARROW_IPC=on \ -DARROW_PARQUET=on \ .. From 77598fa59a92c07dedf7d93307e5c72c5b2724d0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 10 Jul 2016 13:17:50 -0700 Subject: [PATCH 0095/1644] ARROW-233: Add visibility macros, add static build option This also resolves ARROW-213. Builds off work done in PARQUET-489. I inserted a hack to deal with the fast the boost libs in apt won't statically link properly. We'll deal with that some other time. Author: Wes McKinney Closes #100 from wesm/ARROW-233 and squashes the following commits: 0253827 [Wes McKinney] Remove -Wno-unused-local-typedef 69b03b0 [Wes McKinney] - Add visibility macros. Hide boost symbols in arrow_io - Hack around Travis CI inability to use its boost static libraries - Use parquet_shared name - More informative verbose test logs - Fix some gtest-1.7.0 crankiness - Fix a valgrind shared_ptr possible memory leak stemming from static variable referenced at compile-time in libarrow_parquet - Fix a bunch of compiler warnings in release builds --- ci/travis_install_conda.sh | 1 - ci/travis_script_cpp.sh | 2 +- ci/travis_script_python.sh | 6 +- cpp/CMakeLists.txt | 217 ++++++++++++----------- cpp/build-support/run-test.sh | 10 +- cpp/conda.recipe/build.sh | 13 +- cpp/src/arrow/array.h | 5 +- cpp/src/arrow/builder.h | 3 +- cpp/src/arrow/column.h | 5 +- cpp/src/arrow/io/CMakeLists.txt | 53 ++++-- cpp/src/arrow/io/hdfs-io-test.cc | 2 +- cpp/src/arrow/io/hdfs.h | 17 +- cpp/src/arrow/io/libhdfs_shim.cc | 3 +- cpp/src/arrow/io/symbols.map | 18 ++ cpp/src/arrow/ipc/CMakeLists.txt | 2 +- cpp/src/arrow/parquet/CMakeLists.txt | 4 +- cpp/src/arrow/parquet/parquet-io-test.cc | 18 +- cpp/src/arrow/parquet/reader.cc | 2 +- cpp/src/arrow/parquet/reader.h | 6 +- cpp/src/arrow/parquet/schema.h | 10 +- cpp/src/arrow/parquet/writer.cc | 4 +- cpp/src/arrow/parquet/writer.h | 6 +- cpp/src/arrow/schema.h | 4 +- cpp/src/arrow/symbols.map | 15 ++ cpp/src/arrow/table.h | 6 +- cpp/src/arrow/type.h | 39 ++-- cpp/src/arrow/types/construct.h | 11 +- cpp/src/arrow/types/decimal.h | 3 +- cpp/src/arrow/types/list.h | 7 +- cpp/src/arrow/types/primitive.h | 13 +- cpp/src/arrow/types/string-test.cc | 8 +- cpp/src/arrow/types/string.cc | 11 +- cpp/src/arrow/types/string.h | 16 +- cpp/src/arrow/types/struct-test.cc | 8 +- cpp/src/arrow/types/struct.h | 5 +- cpp/src/arrow/util/CMakeLists.txt | 1 + cpp/src/arrow/util/buffer.h | 12 +- cpp/src/arrow/util/memory-pool-test.cc | 2 +- cpp/src/arrow/util/memory-pool.h | 6 +- cpp/src/arrow/util/status.cc | 3 + cpp/src/arrow/util/status.h | 4 +- cpp/src/arrow/util/visibility.h | 32 ++++ python/conda.recipe/build.sh | 15 +- python/src/pyarrow/adapters/builtin.h | 2 + python/src/pyarrow/adapters/pandas.h | 5 + python/src/pyarrow/common.h | 6 +- python/src/pyarrow/config.h | 4 + python/src/pyarrow/helpers.h | 3 + python/src/pyarrow/status.h | 4 +- python/src/pyarrow/visibility.h | 32 ++++ 50 files changed, 439 insertions(+), 245 deletions(-) create mode 100644 cpp/src/arrow/io/symbols.map create mode 100644 cpp/src/arrow/symbols.map create mode 100644 cpp/src/arrow/util/visibility.h create mode 100644 python/src/pyarrow/visibility.h diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index be7f59a4733bd..3a8f57bf8f1bf 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -25,4 +25,3 @@ conda install --yes conda-build jinja2 anaconda-client # faster builds, please conda install -y nomkl - diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index 9cf4f8e352109..a3585507f0a6d 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -16,6 +16,6 @@ make lint # make check-clang-tidy # fi -ctest -L unittest +ctest -VV -L unittest popd diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 6d35785356ab4..4a377428ae43a 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -7,7 +7,6 @@ PYTHON_DIR=$TRAVIS_BUILD_DIR/python # Re-use conda installation from C++ export MINICONDA=$TRAVIS_BUILD_DIR/miniconda export PATH="$MINICONDA/bin:$PATH" -export LD_LIBRARY_PATH="$MINICONDA/lib:$LD_LIBRARY_PATH" export PARQUET_HOME=$MINICONDA # Share environment with C++ @@ -32,12 +31,15 @@ python_version_tests() { # Expensive dependencies install from Continuum package repo conda install -y pip numpy pandas cython + conda install -y parquet-cpp arrow-cpp -c apache/channel/dev + # Other stuff pip install pip install -r requirements.txt export ARROW_HOME=$ARROW_CPP_INSTALL - python setup.py build_ext --inplace + python setup.py build_ext \ + --inplace python -m pytest -vv -r sxX pyarrow } diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 18b47599b93d0..a39a752123155 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -44,12 +44,22 @@ endif(CCACHE_FOUND) # Top level cmake dir if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") + option(ARROW_BUILD_STATIC + "Build the libarrow static libraries" + ON) + + option(ARROW_BUILD_SHARED + "Build the libarrow shared libraries" + ON) + option(ARROW_PARQUET "Build the Parquet adapter and link to libparquet" OFF) + option(ARROW_TEST_MEMCHECK - "Run the test suite using valgrind --tool=memcheck" - OFF) + "Run the test suite using valgrind --tool=memcheck" + OFF) + option(ARROW_BUILD_TESTS "Build the Arrow googletest unit tests" ON) @@ -66,6 +76,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow IO extensions for the Hadoop file system" OFF) + option(ARROW_BOOST_USE_SHARED + "Rely on boost shared libraries where relevant" + ON) + option(ARROW_SSE3 "Build Arrow with SSE3" ON) @@ -172,18 +186,6 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CLANG_OPTIONS}") endif() -# Sanity check linking option. -if (NOT ARROW_LINK) - set(ARROW_LINK "d") -elseif(NOT ("auto" MATCHES "^${ARROW_LINK}" OR - "dynamic" MATCHES "^${ARROW_LINK}" OR - "static" MATCHES "^${ARROW_LINK}")) - message(FATAL_ERROR "Unknown value for ARROW_LINK, must be auto|dynamic|static") -else() - # Remove all but the first letter. - string(SUBSTRING "${ARROW_LINK}" 0 1 ARROW_LINK) -endif() - # ASAN / TSAN / UBSAN include(san-config) @@ -203,61 +205,11 @@ if ("${ARROW_GENERATE_COVERAGE}") # For coverage to work properly, we need to use static linkage. Otherwise, # __gcov_flush() doesn't properly flush coverage from every module. # See http://stackoverflow.com/questions/28164543/using-gcov-flush-within-a-library-doesnt-force-the-other-modules-to-yield-gc - if("${ARROW_LINK}" STREQUAL "a") - message("Using static linking for coverage build") - set(ARROW_LINK "s") - elseif("${ARROW_LINK}" STREQUAL "d") - message(SEND_ERROR "Cannot use coverage with dynamic linking") - endif() -endif() - -# If we still don't know what kind of linking to perform, choose based on -# build type (developers like fast builds). -if ("${ARROW_LINK}" STREQUAL "a") - if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG" OR - "${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") - message("Using dynamic linking for ${CMAKE_BUILD_TYPE} builds") - set(ARROW_LINK "d") - else() - message("Using static linking for ${CMAKE_BUILD_TYPE} builds") - set(ARROW_LINK "s") + if(NOT ARROW_BUILD_STATIC) + message(SEND_ERROR "Coverage requires the static lib to be built") endif() endif() -# Are we using the gold linker? It doesn't work with dynamic linking as -# weak symbols aren't properly overridden, causing tcmalloc to be omitted. -# Let's flag this as an error in RELEASE builds (we shouldn't release a -# product like this). -# -# See https://sourceware.org/bugzilla/show_bug.cgi?id=16979 for details. -# -# The gold linker is only for ELF binaries, which OSX doesn't use. We can -# just skip. -if (NOT APPLE) - execute_process(COMMAND ${CMAKE_CXX_COMPILER} -Wl,--version OUTPUT_VARIABLE LINKER_OUTPUT) -endif () -if (LINKER_OUTPUT MATCHES "gold") - if ("${ARROW_LINK}" STREQUAL "d" AND - "${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") - message(SEND_ERROR "Cannot use gold with dynamic linking in a RELEASE build " - "as it would cause tcmalloc symbols to get dropped") - else() - message("Using gold linker") - endif() - set(ARROW_USING_GOLD 1) -else() - message("Using ld linker") -endif() - -# Having set ARROW_LINK due to build type and/or sanitizer, it's now safe to -# act on its value. -if ("${ARROW_LINK}" STREQUAL "d") - set(BUILD_SHARED_LIBS ON) - - # Position independent code is only necessary when producing shared objects. - add_definitions(-fPIC) -endif() - # set compile output directory string (TOLOWER ${CMAKE_BUILD_TYPE} BUILD_SUBDIR_NAME) @@ -290,6 +242,15 @@ set(LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") set(EXECUTABLE_OUTPUT_PATH "${BUILD_OUTPUT_ROOT_DIRECTORY}") include_directories(src) +############################################################ +# Visibility +############################################################ +# For generate_export_header() and add_compiler_export_flags(). +include(GenerateExportHeader) + +# Sets -fvisibility=hidden for gcc +add_compiler_export_flags() + ############################################################ # Benchmarking ############################################################ @@ -360,7 +321,7 @@ endfunction() # # Arguments after the test name will be passed to set_tests_properties(). function(ADD_ARROW_TEST REL_TEST_NAME) - if(NO_TESTS) + if(NO_TESTS OR NOT ARROW_BUILD_STATIC) return() endif() get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) @@ -377,13 +338,13 @@ function(ADD_ARROW_TEST REL_TEST_NAME) endif() if (ARROW_TEST_MEMCHECK) - SET_PROPERTY(TARGET ${TEST_NAME} - APPEND_STRING PROPERTY - COMPILE_FLAGS " -DARROW_VALGRIND") - add_test(${TEST_NAME} - valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ${TEST_PATH}) + SET_PROPERTY(TARGET ${TEST_NAME} + APPEND_STRING PROPERTY + COMPILE_FLAGS " -DARROW_VALGRIND") + add_test(${TEST_NAME} + valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ${TEST_PATH}) else() - add_test(${TEST_NAME} + add_test(${TEST_NAME} ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) endif() set_tests_properties(${TEST_NAME} PROPERTIES LABELS "unittest") @@ -427,19 +388,34 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") endif() - if(("${ARROW_LINK}" STREQUAL "s" AND ARG_STATIC_LIB) OR (NOT ARG_SHARED_LIB)) + if(ARG_STATIC_LIB AND ARG_SHARED_LIB) if(NOT ARG_STATIC_LIB) message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") endif() + + SET(AUG_LIB_NAME "${LIB_NAME}_static") + add_library(${AUG_LIB_NAME} STATIC IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + + SET(AUG_LIB_NAME "${LIB_NAME}_shared") + add_library(${AUG_LIB_NAME} SHARED IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + elseif(ARG_STATIC_LIB) add_library(${LIB_NAME} STATIC IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") - else() + elseif(ARG_SHARED_LIB) add_library(${LIB_NAME} SHARED IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + else() + message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") endif() if(ARG_DEPS) @@ -538,9 +514,17 @@ endif() ############################################################ # Linker setup ############################################################ -set(ARROW_MIN_TEST_LIBS arrow arrow_test_main ${ARROW_BASE_LIBS}) +set(ARROW_MIN_TEST_LIBS + arrow_static + arrow_test_main + ${ARROW_BASE_LIBS}) + set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) -set(ARROW_BENCHMARK_LINK_LIBS arrow arrow_benchmark_main ${ARROW_BASE_LIBS}) + +set(ARROW_BENCHMARK_LINK_LIBS + arrow_static + arrow_benchmark_main + ${ARROW_BASE_LIBS}) ############################################################ # "make ctags" target @@ -576,14 +560,14 @@ endif (UNIX) if (UNIX) file(GLOB_RECURSE LINT_FILES - "${CMAKE_CURRENT_SOURCE_DIR}/src/*.h" - "${CMAKE_CURRENT_SOURCE_DIR}/src/*.cc" - ) + "${CMAKE_CURRENT_SOURCE_DIR}/src/*.h" + "${CMAKE_CURRENT_SOURCE_DIR}/src/*.cc" + ) FOREACH(item ${LINT_FILES}) - IF(NOT (item MATCHES "_generated.h")) + IF(NOT (item MATCHES "_generated.h")) LIST(APPEND FILTERED_LINT_FILES ${item}) - ENDIF() + ENDIF() ENDFOREACH(item ${LINT_FILES}) # Full lint @@ -628,7 +612,10 @@ endif() # Subdirectories ############################################################ -set(LIBARROW_LINK_LIBS +set(ARROW_LINK_LIBS +) + +set(ARROW_PRIVATE_LINK_LIBS ) set(ARROW_SRCS @@ -660,35 +647,67 @@ set(ARROW_SRCS src/arrow/util/status.cc ) -set(LIBARROW_LINKAGE "SHARED") - -add_library(arrow - ${LIBARROW_LINKAGE} +add_library(arrow_objlib OBJECT ${ARROW_SRCS} ) +# Necessary to make static linking into other shared libraries work properly +set_property(TARGET arrow_objlib PROPERTY POSITION_INDEPENDENT_CODE 1) + +if(NOT APPLE) + # Localize thirdparty symbols using a linker version script. This hides them + # from the client application. The OS X linker does not support the + # version-script option. + set(SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") +endif() + +if (ARROW_BUILD_SHARED) + add_library(arrow_shared SHARED $) + if(APPLE) + set_target_properties(arrow_shared PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + endif() + set_target_properties(arrow_shared + PROPERTIES + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + LINK_FLAGS "${SHARED_LINK_FLAGS}" + OUTPUT_NAME "arrow") + target_link_libraries(arrow_shared + LINK_PUBLIC ${ARROW_LINK_LIBS} + LINK_PRIVATE ${ARROW_PRIVATE_LINK_LIBS}) + + install(TARGETS arrow_shared + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) +endif() + +if (ARROW_BUILD_STATIC) + add_library(arrow_static STATIC $) + set_target_properties(arrow_static + PROPERTIES + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + OUTPUT_NAME "arrow") + + target_link_libraries(arrow_static + LINK_PUBLIC ${ARROW_LINK_LIBS} + LINK_PRIVATE ${ARROW_PRIVATE_LINK_LIBS}) + + install(TARGETS arrow_static + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) +endif() + if (APPLE) - set_target_properties(arrow + set_target_properties(arrow_shared PROPERTIES BUILD_WITH_INSTALL_RPATH ON INSTALL_NAME_DIR "@rpath") endif() -set_target_properties(arrow - PROPERTIES - LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" -) -target_link_libraries(arrow ${LIBARROW_LINK_LIBS}) - add_subdirectory(src/arrow) add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) add_subdirectory(src/arrow/types) -install(TARGETS arrow - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) - #---------------------------------------------------------------------- # Parquet adapter library @@ -715,7 +734,7 @@ if(ARROW_IPC) include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) add_library(flatbuffers STATIC IMPORTED) set_target_properties(flatbuffers PROPERTIES - IMPORTED_LOCATION ${FLATBUFFERS_STATIC_LIB}) + IMPORTED_LOCATION ${FLATBUFFERS_STATIC_LIB}) add_subdirectory(src/arrow/ipc) endif() diff --git a/cpp/build-support/run-test.sh b/cpp/build-support/run-test.sh index 0e628e26ecd52..f563da53679be 100755 --- a/cpp/build-support/run-test.sh +++ b/cpp/build-support/run-test.sh @@ -79,16 +79,16 @@ function setup_sanitizers() { TSAN_OPTIONS="$TSAN_OPTIONS suppressions=$ROOT/build-support/tsan-suppressions.txt" TSAN_OPTIONS="$TSAN_OPTIONS history_size=7" export TSAN_OPTIONS - + # Enable leak detection even under LLVM 3.4, where it was disabled by default. # This flag only takes effect when running an ASAN build. ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" export ASAN_OPTIONS - + # Set up suppressions for LeakSanitizer LSAN_OPTIONS="$LSAN_OPTIONS suppressions=$ROOT/build-support/lsan-suppressions.txt" export LSAN_OPTIONS - + # Suppressions require symbolization. We'll default to using the symbolizer in # thirdparty. if [ -z "$ASAN_SYMBOLIZER_PATH" ]; then @@ -107,7 +107,7 @@ function run_test() { | $ROOT/build-support/asan_symbolize.py \ | c++filt \ | $ROOT/build-support/stacktrace_addr2line.pl $TEST_EXECUTABLE \ - | $pipe_cmd > $LOGFILE + | $pipe_cmd 2>&1 | tee $LOGFILE STATUS=$? # TSAN doesn't always exit with a non-zero exit code due to a bug: @@ -198,7 +198,7 @@ for ATTEMPT_NUMBER in $(seq 1 $TEST_EXECUTION_ATTEMPTS) ; do fi done -if [ $RUN_TYPE = "test" ]; then +if [ $RUN_TYPE = "test" ]; then post_process_tests fi diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh index 7e60ccc911faa..2f2b748266747 100644 --- a/cpp/conda.recipe/build.sh +++ b/cpp/conda.recipe/build.sh @@ -39,16 +39,17 @@ pwd source thirdparty/versions.sh export GTEST_HOME=`pwd`/thirdparty/$GTEST_BASEDIR -if [ `uname` == Linux ]; then - SHARED_LINKER_FLAGS='-static-libstdc++' -elif [ `uname` == Darwin ]; then - SHARED_LINKER_FLAGS='' -fi +# if [ `uname` == Linux ]; then +# SHARED_LINKER_FLAGS='-static-libstdc++' +# elif [ `uname` == Darwin ]; then +# SHARED_LINKER_FLAGS='' +# fi + +# -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ cmake \ -DCMAKE_BUILD_TYPE=release \ -DCMAKE_INSTALL_PREFIX=$PREFIX \ - -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ -DARROW_HDFS=on \ -DARROW_IPC=on \ -DARROW_PARQUET=on \ diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 76dc0f598141f..c7ffb23ca18a1 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -24,6 +24,7 @@ #include "arrow/type.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -35,7 +36,7 @@ class Status; // // The base class is only required to have a null bitmap buffer if the null // count is greater than 0 -class Array { +class ARROW_EXPORT Array { public: Array(const std::shared_ptr& type, int32_t length, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); @@ -83,7 +84,7 @@ class Array { }; // Degenerate null type Array -class NullArray : public Array { +class ARROW_EXPORT NullArray : public Array { public: NullArray(const std::shared_ptr& type, int32_t length) : Array(type, length, length, nullptr) {} diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 7d3f4398d73e3..5d9fb992ff0b5 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -25,6 +25,7 @@ #include "arrow/type.h" #include "arrow/util/macros.h" #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -38,7 +39,7 @@ static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5; // This class provides a facilities for incrementally building the null bitmap // (see Append methods) and as a side effect the current number of slots and // the null count. -class ArrayBuilder { +class ARROW_EXPORT ArrayBuilder { public: explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) : pool_(pool), diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index e409566e1f139..d5168cb032ba5 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -24,6 +24,7 @@ #include #include "arrow/type.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -34,7 +35,7 @@ typedef std::vector> ArrayVector; // A data structure managing a list of primitive Arrow arrays logically as one // large array -class ChunkedArray { +class ARROW_EXPORT ChunkedArray { public: explicit ChunkedArray(const ArrayVector& chunks); @@ -56,7 +57,7 @@ class ChunkedArray { // An immutable column data structure consisting of a field (type metadata) and // a logical chunked data array (which can be validated as all being the same // type). -class Column { +class ARROW_EXPORT Column { public: Column(const std::shared_ptr& field, const ArrayVector& chunks); Column(const std::shared_ptr& field, const std::shared_ptr& data); diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 33b654f81903f..b8c0e138afb06 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -19,13 +19,18 @@ # arrow_io : Arrow IO interfaces set(ARROW_IO_LINK_LIBS - arrow + arrow_shared ) -set(ARROW_IO_PRIVATE_LINK_LIBS - boost_system - boost_filesystem -) +if (ARROW_BOOST_USE_SHARED) + set(ARROW_IO_PRIVATE_LINK_LIBS + boost_system_shared + boost_filesystem_shared) +else() + set(ARROW_IO_PRIVATE_LINK_LIBS + boost_system_static + boost_filesystem_static) +endif() set(ARROW_IO_TEST_LINK_LIBS arrow_io @@ -36,18 +41,18 @@ set(ARROW_IO_SRCS if(ARROW_HDFS) if(NOT THIRDPARTY_DIR) - message(FATAL_ERROR "THIRDPARTY_DIR not set") + message(FATAL_ERROR "THIRDPARTY_DIR not set") endif() if (DEFINED ENV{HADOOP_HOME}) - set(HADOOP_HOME $ENV{HADOOP_HOME}) + set(HADOOP_HOME $ENV{HADOOP_HOME}) else() - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") endif() set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") if (NOT EXISTS ${HDFS_H_PATH}) - message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") + message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") endif() message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) message(STATUS "Building libhdfs shim component") @@ -55,29 +60,39 @@ if(ARROW_HDFS) include_directories(SYSTEM "${HADOOP_HOME}/include") set(ARROW_HDFS_SRCS - hdfs.cc - libhdfs_shim.cc) + hdfs.cc + libhdfs_shim.cc) set_property(SOURCE ${ARROW_HDFS_SRCS} - APPEND_STRING PROPERTY - COMPILE_FLAGS "-DHAS_HADOOP") + APPEND_STRING PROPERTY + COMPILE_FLAGS "-DHAS_HADOOP") set(ARROW_IO_SRCS - ${ARROW_HDFS_SRCS} - ${ARROW_IO_SRCS}) + ${ARROW_HDFS_SRCS} + ${ARROW_IO_SRCS}) ADD_ARROW_TEST(hdfs-io-test) ARROW_TEST_LINK_LIBRARIES(hdfs-io-test - ${ARROW_IO_TEST_LINK_LIBS}) + ${ARROW_IO_TEST_LINK_LIBS}) endif() add_library(arrow_io SHARED ${ARROW_IO_SRCS} ) -target_link_libraries(arrow_io LINK_PUBLIC ${ARROW_IO_LINK_LIBS}) -target_link_libraries(arrow_io LINK_PRIVATE ${ARROW_IO_PRIVATE_LINK_LIBS}) +target_link_libraries(arrow_io + LINK_PUBLIC ${ARROW_IO_LINK_LIBS} + LINK_PRIVATE ${ARROW_IO_PRIVATE_LINK_LIBS}) + +if(NOT APPLE) + # Localize thirdparty symbols using a linker version script. This hides them + # from the client application. The OS X linker does not support the + # version-script option. + set(ARROW_IO_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") +endif() -SET_TARGET_PROPERTIES(arrow_io PROPERTIES LINKER_LANGUAGE CXX) +SET_TARGET_PROPERTIES(arrow_io PROPERTIES + LINKER_LANGUAGE CXX + LINK_FLAGS "${ARROW_IO_LINK_FLAGS}") if (APPLE) set_target_properties(arrow_io diff --git a/cpp/src/arrow/io/hdfs-io-test.cc b/cpp/src/arrow/io/hdfs-io-test.cc index 11d67aeba2026..d1bf140ae68e2 100644 --- a/cpp/src/arrow/io/hdfs-io-test.cc +++ b/cpp/src/arrow/io/hdfs-io-test.cc @@ -227,7 +227,7 @@ TEST_F(TestHdfsClient, ListDirectory) { // Do it again, appends! ASSERT_OK(client_->ListDirectory(scratch_dir_, &listing)); - ASSERT_EQ(6, listing.size()); + ASSERT_EQ(6, static_cast(listing.size())); // Argh, well, shouldn't expect the listing to be in any particular order for (size_t i = 0; i < listing.size(); ++i) { diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index a1972db96157a..532e3c536a188 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -25,6 +25,7 @@ #include "arrow/io/interfaces.h" #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -32,8 +33,6 @@ class Status; namespace io { -Status ConnectLibHdfs(); - class HdfsClient; class HdfsReadableFile; class HdfsWriteableFile; @@ -64,7 +63,7 @@ struct HdfsConnectionConfig { // TODO: Kerberos, etc. }; -class HdfsClient : public FileSystemClient { +class ARROW_EXPORT HdfsClient : public FileSystemClient { public: ~HdfsClient(); @@ -149,14 +148,14 @@ class HdfsClient : public FileSystemClient { friend class HdfsReadableFile; friend class HdfsWriteableFile; - class HdfsClientImpl; + class ARROW_NO_EXPORT HdfsClientImpl; std::unique_ptr impl_; HdfsClient(); DISALLOW_COPY_AND_ASSIGN(HdfsClient); }; -class HdfsReadableFile : public RandomAccessFile { +class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { public: ~HdfsReadableFile(); @@ -175,7 +174,7 @@ class HdfsReadableFile : public RandomAccessFile { Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) override; private: - class HdfsReadableFileImpl; + class ARROW_NO_EXPORT HdfsReadableFileImpl; std::unique_ptr impl_; friend class HdfsClient::HdfsClientImpl; @@ -184,7 +183,7 @@ class HdfsReadableFile : public RandomAccessFile { DISALLOW_COPY_AND_ASSIGN(HdfsReadableFile); }; -class HdfsWriteableFile : public WriteableFile { +class ARROW_EXPORT HdfsWriteableFile : public WriteableFile { public: ~HdfsWriteableFile(); @@ -197,7 +196,7 @@ class HdfsWriteableFile : public WriteableFile { Status Tell(int64_t* position) override; private: - class HdfsWriteableFileImpl; + class ARROW_NO_EXPORT HdfsWriteableFileImpl; std::unique_ptr impl_; friend class HdfsClient::HdfsClientImpl; @@ -207,6 +206,8 @@ class HdfsWriteableFile : public WriteableFile { DISALLOW_COPY_AND_ASSIGN(HdfsWriteableFile); }; +Status ARROW_EXPORT ConnectLibHdfs(); + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index f75266536e5b3..003570d4fdee6 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -55,6 +55,7 @@ extern "C" { #include // NOLINT #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace fs = boost::filesystem; @@ -496,7 +497,7 @@ static arrow::Status try_dlopen( namespace arrow { namespace io { -Status ConnectLibHdfs() { +Status ARROW_EXPORT ConnectLibHdfs() { static std::mutex lock; std::lock_guard guard(lock); diff --git a/cpp/src/arrow/io/symbols.map b/cpp/src/arrow/io/symbols.map new file mode 100644 index 0000000000000..b4ad98cd7f2d0 --- /dev/null +++ b/cpp/src/arrow/io/symbols.map @@ -0,0 +1,18 @@ +{ + # Symbols marked as 'local' are not exported by the DSO and thus may not + # be used by client applications. + local: + # devtoolset / static-libstdc++ symbols + __cxa_*; + + extern "C++" { + # boost + boost::*; + + # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically + # links c++11 symbols into binaries so that the result may be executed on + # a system with an older libstdc++ which doesn't include the necessary + # c++11 symbols. + std::*; + }; +}; diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 383684f42f952..82634169ed925 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -48,4 +48,4 @@ add_custom_command( ) add_custom_target(metadata_fbs DEPENDS ${FBS_OUTPUT_FILES}) -add_dependencies(arrow metadata_fbs) +add_dependencies(arrow_objlib metadata_fbs) diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index f00bb53c0848f..00f19b354e379 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -25,8 +25,8 @@ set(PARQUET_SRCS ) set(PARQUET_LIBS - arrow - ${PARQUET_SHARED_LIB} + arrow_shared + parquet_shared ) add_library(arrow_parquet SHARED diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index 572cae16e58c0..bfc27d26d63a1 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -411,7 +411,7 @@ class TestPrimitiveParquetIO : public TestParquetIO { public: typedef typename TestType::c_type T; - void TestFile(std::vector& values, int num_chunks, + void MakeTestFile(std::vector& values, int num_chunks, std::unique_ptr* file_reader) { std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); std::unique_ptr file_writer = this->MakeWriter(schema); @@ -435,10 +435,10 @@ class TestPrimitiveParquetIO : public TestParquetIO { *file_reader = this->ReaderFromSink(); } - void TestSingleColumnRequiredTableRead(int num_chunks) { + void CheckSingleColumnRequiredTableRead(int num_chunks) { std::vector values(SMALL_SIZE, test_traits::value); std::unique_ptr file_reader; - ASSERT_NO_THROW(TestFile(values, num_chunks, &file_reader)); + ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); std::shared_ptr
out; this->ReadTableFromFile(std::move(file_reader), &out); @@ -450,10 +450,10 @@ class TestPrimitiveParquetIO : public TestParquetIO { ExpectArray(values.data(), chunked_array->chunk(0).get()); } - void TestSingleColumnRequiredRead(int num_chunks) { + void CheckSingleColumnRequiredRead(int num_chunks) { std::vector values(SMALL_SIZE, test_traits::value); std::unique_ptr file_reader; - ASSERT_NO_THROW(TestFile(values, num_chunks, &file_reader)); + ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); std::shared_ptr out; this->ReadSingleColumnFile(std::move(file_reader), &out); @@ -469,19 +469,19 @@ typedef ::testing::TypesTestSingleColumnRequiredRead(1); + this->CheckSingleColumnRequiredRead(1); } TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredTableRead) { - this->TestSingleColumnRequiredTableRead(1); + this->CheckSingleColumnRequiredTableRead(1); } TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedRead) { - this->TestSingleColumnRequiredRead(4); + this->CheckSingleColumnRequiredRead(4); } TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedTableRead) { - this->TestSingleColumnRequiredTableRead(4); + this->CheckSingleColumnRequiredTableRead(4); } } // namespace parquet diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 7b05665b230f0..c7c400e957343 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -213,7 +213,7 @@ Status FlatColumnReader::Impl::ReadNonNullableBatch(typename ParquetType::c_type using ParquetCType = typename ParquetType::c_type; DCHECK(builder); - const ArrowCType* values_ptr; + const ArrowCType* values_ptr = nullptr; RETURN_NOT_OK( (ConvertPhysicalType(values, values_read, &values_ptr))); RETURN_NOT_OK(builder->Append(values_ptr, values_read)); diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h index db7a15753d8e8..2c8a9dfd025f0 100644 --- a/cpp/src/arrow/parquet/reader.h +++ b/cpp/src/arrow/parquet/reader.h @@ -23,6 +23,8 @@ #include "parquet/api/reader.h" #include "parquet/api/schema.h" +#include "arrow/util/visibility.h" + namespace arrow { class Array; @@ -77,7 +79,7 @@ class FlatColumnReader; // // This is additionally complicated "chunky" repeated fields or very large byte // arrays -class FileReader { +class ARROW_EXPORT FileReader { public: FileReader(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); @@ -107,7 +109,7 @@ class FileReader { // // We also do not expose any internal Parquet details, such as row groups. This // might change in the future. -class FlatColumnReader { +class ARROW_EXPORT FlatColumnReader { public: virtual ~FlatColumnReader(); diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h index 39bee059522a3..88b5977d223a4 100644 --- a/cpp/src/arrow/parquet/schema.h +++ b/cpp/src/arrow/parquet/schema.h @@ -25,6 +25,7 @@ #include "arrow/schema.h" #include "arrow/type.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -32,15 +33,16 @@ class Status; namespace parquet { -Status NodeToField(const ::parquet::schema::NodePtr& node, std::shared_ptr* out); +Status ARROW_EXPORT NodeToField( + const ::parquet::schema::NodePtr& node, std::shared_ptr* out); -Status FromParquetSchema( +Status ARROW_EXPORT FromParquetSchema( const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out); -Status FieldToNode(const std::shared_ptr& field, +Status ARROW_EXPORT FieldToNode(const std::shared_ptr& field, const ::parquet::WriterProperties& properties, ::parquet::schema::NodePtr* out); -Status ToParquetSchema(const Schema* arrow_schema, +Status ARROW_EXPORT ToParquetSchema(const Schema* arrow_schema, const ::parquet::WriterProperties& properties, std::shared_ptr<::parquet::SchemaDescriptor>* out); diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index 63449bb20b1a1..0139edd3bb8d9 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -118,7 +118,7 @@ Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, reinterpret_cast<::parquet::TypedColumnWriter*>(column_writer); if (writer->descr()->max_definition_level() == 0) { // no nulls, just dump the data - const ParquetCType* data_writer_ptr; + const ParquetCType* data_writer_ptr = nullptr; RETURN_NOT_OK((ConvertPhysicalType( data_ptr, length, &data_writer_ptr))); PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, data_writer_ptr)); @@ -128,7 +128,7 @@ Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, reinterpret_cast(def_levels_buffer_.mutable_data()); if (data->null_count() == 0) { std::fill(def_levels_ptr, def_levels_ptr + length, 1); - const ParquetCType* data_writer_ptr; + const ParquetCType* data_writer_ptr = nullptr; RETURN_NOT_OK((ConvertPhysicalType( data_ptr, length, &data_writer_ptr))); PARQUET_CATCH_NOT_OK( diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index cfd80d80b7997..45d0fd59868e5 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -23,6 +23,8 @@ #include "parquet/api/schema.h" #include "parquet/api/writer.h" +#include "arrow/util/visibility.h" + namespace arrow { class Array; @@ -40,7 +42,7 @@ namespace parquet { * Start a new RowGroup/Chunk with NewRowGroup * Write column-by-column the whole column chunk */ -class FileWriter { +class ARROW_EXPORT FileWriter { public: FileWriter(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); @@ -62,7 +64,7 @@ class FileWriter { * * The table shall only consist of nullable, non-repeated columns of primitive type. */ -Status WriteFlatTable(const Table* table, MemoryPool* pool, +Status ARROW_EXPORT WriteFlatTable(const Table* table, MemoryPool* pool, const std::shared_ptr<::parquet::OutputStream>& sink, int64_t chunk_size, const std::shared_ptr<::parquet::WriterProperties>& properties = ::parquet::default_writer_properties()); diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h index a8b0d8444ac92..4301968e01578 100644 --- a/cpp/src/arrow/schema.h +++ b/cpp/src/arrow/schema.h @@ -22,11 +22,13 @@ #include #include +#include "arrow/util/visibility.h" + namespace arrow { struct Field; -class Schema { +class ARROW_EXPORT Schema { public: explicit Schema(const std::vector>& fields); diff --git a/cpp/src/arrow/symbols.map b/cpp/src/arrow/symbols.map new file mode 100644 index 0000000000000..2ca8d7306105f --- /dev/null +++ b/cpp/src/arrow/symbols.map @@ -0,0 +1,15 @@ +{ + # Symbols marked as 'local' are not exported by the DSO and thus may not + # be used by client applications. + local: + # devtoolset / static-libstdc++ symbols + __cxa_*; + + extern "C++" { + # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically + # links c++11 symbols into binaries so that the result may be executed on + # a system with an older libstdc++ which doesn't include the necessary + # c++11 symbols. + std::*; + }; +}; diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 756b2a19593f4..2088fdf0b6415 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -23,6 +23,8 @@ #include #include +#include "arrow/util/visibility.h" + namespace arrow { class Array; @@ -33,7 +35,7 @@ class Status; // A row batch is a simpler and more rigid table data structure intended for // use primarily in shared memory IPC. It contains a schema (metadata) and a // corresponding vector of equal-length Arrow arrays -class RowBatch { +class ARROW_EXPORT RowBatch { public: // num_rows is a parameter to allow for row batches of a particular size not // having any materialized columns. Each array should have the same length as @@ -63,7 +65,7 @@ class RowBatch { }; // Immutable container of fixed-length columns conforming to a particular schema -class Table { +class ARROW_EXPORT Table { public: // If columns is zero-length, the table's number of rows is zero Table(const std::string& name, const std::shared_ptr& schema, diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 8fb41211ba945..4cb37fd1dead8 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -24,6 +24,7 @@ #include #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -101,7 +102,7 @@ struct Type { struct Field; -struct DataType { +struct ARROW_EXPORT DataType { Type::type type; std::vector> children_; @@ -133,7 +134,7 @@ typedef std::shared_ptr TypePtr; // A field is a piece of metadata that includes (for now) a name and a data // type -struct Field { +struct ARROW_EXPORT Field { // Field name std::string name; @@ -163,7 +164,7 @@ struct Field { typedef std::shared_ptr FieldPtr; template -struct PrimitiveType : public DataType { +struct ARROW_EXPORT PrimitiveType : public DataType { PrimitiveType() : DataType(Derived::type_enum) {} std::string ToString() const override; @@ -185,55 +186,55 @@ inline std::string PrimitiveType::ToString() const { \ static const char* name() { return NAME; } -struct NullType : public PrimitiveType { +struct ARROW_EXPORT NullType : public PrimitiveType { PRIMITIVE_DECL(NullType, void, NA, 0, "null"); }; -struct BooleanType : public PrimitiveType { +struct ARROW_EXPORT BooleanType : public PrimitiveType { PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); }; -struct UInt8Type : public PrimitiveType { +struct ARROW_EXPORT UInt8Type : public PrimitiveType { PRIMITIVE_DECL(UInt8Type, uint8_t, UINT8, 1, "uint8"); }; -struct Int8Type : public PrimitiveType { +struct ARROW_EXPORT Int8Type : public PrimitiveType { PRIMITIVE_DECL(Int8Type, int8_t, INT8, 1, "int8"); }; -struct UInt16Type : public PrimitiveType { +struct ARROW_EXPORT UInt16Type : public PrimitiveType { PRIMITIVE_DECL(UInt16Type, uint16_t, UINT16, 2, "uint16"); }; -struct Int16Type : public PrimitiveType { +struct ARROW_EXPORT Int16Type : public PrimitiveType { PRIMITIVE_DECL(Int16Type, int16_t, INT16, 2, "int16"); }; -struct UInt32Type : public PrimitiveType { +struct ARROW_EXPORT UInt32Type : public PrimitiveType { PRIMITIVE_DECL(UInt32Type, uint32_t, UINT32, 4, "uint32"); }; -struct Int32Type : public PrimitiveType { +struct ARROW_EXPORT Int32Type : public PrimitiveType { PRIMITIVE_DECL(Int32Type, int32_t, INT32, 4, "int32"); }; -struct UInt64Type : public PrimitiveType { +struct ARROW_EXPORT UInt64Type : public PrimitiveType { PRIMITIVE_DECL(UInt64Type, uint64_t, UINT64, 8, "uint64"); }; -struct Int64Type : public PrimitiveType { +struct ARROW_EXPORT Int64Type : public PrimitiveType { PRIMITIVE_DECL(Int64Type, int64_t, INT64, 8, "int64"); }; -struct FloatType : public PrimitiveType { +struct ARROW_EXPORT FloatType : public PrimitiveType { PRIMITIVE_DECL(FloatType, float, FLOAT, 4, "float"); }; -struct DoubleType : public PrimitiveType { +struct ARROW_EXPORT DoubleType : public PrimitiveType { PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); }; -struct ListType : public DataType { +struct ARROW_EXPORT ListType : public DataType { // List can contain any other logical value type explicit ListType(const std::shared_ptr& value_type) : ListType(value_type, Type::LIST) {} @@ -260,7 +261,7 @@ struct ListType : public DataType { }; // BinaryType type is reprsents lists of 1-byte values. -struct BinaryType : public ListType { +struct ARROW_EXPORT BinaryType : public ListType { BinaryType() : BinaryType(Type::BINARY) {} static char const* name() { return "binary"; } std::string ToString() const override; @@ -272,7 +273,7 @@ struct BinaryType : public ListType { }; // UTF encoded strings -struct StringType : public BinaryType { +struct ARROW_EXPORT StringType : public BinaryType { StringType() : BinaryType(Type::STRING) {} static char const* name() { return "string"; } @@ -283,7 +284,7 @@ struct StringType : public BinaryType { explicit StringType(Type::type logical_type) : BinaryType(logical_type) {} }; -struct StructType : public DataType { +struct ARROW_EXPORT StructType : public DataType { explicit StructType(const std::vector>& fields) : DataType(Type::STRUCT) { children_ = fields; diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index d0370840ca108..afdadbe079013 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -21,6 +21,9 @@ #include #include #include + +#include "arrow/util/visibility.h" + namespace arrow { class Array; @@ -31,18 +34,18 @@ struct Field; class MemoryPool; class Status; -Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, +Status ARROW_EXPORT MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out); // Create new arrays for logical types that are backed by primitive arrays. -Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, +Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, + int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out); // Create new list arrays for logical types that are backed by ListArrays (e.g. list of // primitives and strings) // TODO(emkornfield) split up string vs list? -Status MakeListArray(const std::shared_ptr& type, int32_t length, +Status ARROW_EXPORT MakeListArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& offests, const std::shared_ptr& values, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out); diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h index 598df3ef70d2d..6c497c597d987 100644 --- a/cpp/src/arrow/types/decimal.h +++ b/cpp/src/arrow/types/decimal.h @@ -21,10 +21,11 @@ #include #include "arrow/type.h" +#include "arrow/util/visibility.h" namespace arrow { -struct DecimalType : public DataType { +struct ARROW_EXPORT DecimalType : public DataType { explicit DecimalType(int precision_, int scale_) : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} int precision; diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 2f6f85d66ca60..f3894510d091a 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -31,12 +31,13 @@ #include "arrow/util/buffer.h" #include "arrow/util/logging.h" #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace arrow { class MemoryPool; -class ListArray : public Array { +class ARROW_EXPORT ListArray : public Array { public: ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, const ArrayPtr& values, int32_t null_count = 0, @@ -96,7 +97,7 @@ class ListArray : public Array { // represent multiple different logical types. If no logical type is provided // at construction time, the class defaults to List where t is taken from the // value_builder/values that the object is constructed with. -class ListBuilder : public ArrayBuilder { +class ARROW_EXPORT ListBuilder : public ArrayBuilder { public: // Use this constructor to incrementally build the value array along with offsets and // null bitmap. @@ -116,6 +117,8 @@ class ListBuilder : public ArrayBuilder { offset_builder_(pool), values_(values) {} + virtual ~ListBuilder() {} + Status Init(int32_t elements) override { DCHECK_LT(elements, std::numeric_limits::max()); RETURN_NOT_OK(ArrayBuilder::Init(elements)); diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index f1ec417d51014..18f954adc0894 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -29,6 +29,7 @@ #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -36,7 +37,7 @@ class MemoryPool; // Base class for fixed-size logical types. See MakePrimitiveArray // (types/construct.h) for constructing a specific subclass. -class PrimitiveArray : public Array { +class ARROW_EXPORT PrimitiveArray : public Array { public: virtual ~PrimitiveArray() {} @@ -53,7 +54,7 @@ class PrimitiveArray : public Array { }; #define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ - class NAME : public PrimitiveArray { \ + class ARROW_EXPORT NAME : public PrimitiveArray { \ public: \ using value_type = T; \ \ @@ -102,7 +103,7 @@ NUMERIC_ARRAY_DECL(FloatArray, FloatType, float); NUMERIC_ARRAY_DECL(DoubleArray, DoubleType, double); template -class PrimitiveBuilder : public ArrayBuilder { +class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { public: typedef typename Type::c_type value_type; @@ -149,7 +150,7 @@ class PrimitiveBuilder : public ArrayBuilder { }; template -class NumericBuilder : public PrimitiveBuilder { +class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { public: using typename PrimitiveBuilder::value_type; using PrimitiveBuilder::PrimitiveBuilder; @@ -262,7 +263,7 @@ typedef NumericBuilder Int64Builder; typedef NumericBuilder FloatBuilder; typedef NumericBuilder DoubleBuilder; -class BooleanArray : public PrimitiveArray { +class ARROW_EXPORT BooleanArray : public PrimitiveArray { public: BooleanArray(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); @@ -288,7 +289,7 @@ struct type_traits { } }; -class BooleanBuilder : public PrimitiveBuilder { +class ARROW_EXPORT BooleanBuilder : public PrimitiveBuilder { public: explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) : PrimitiveBuilder(pool, type) {} diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index a141fc113211a..6807b00e8ca99 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -115,7 +115,7 @@ TEST_F(TestStringContainer, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); - ASSERT_EQ(expected_[i].size(), strings_->value_length(i)); + ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); pos += expected_[i].size(); } } @@ -189,7 +189,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { ASSERT_FALSE(result_->IsNull(i)); result_->GetValue(i, &length); ASSERT_EQ(pos, result_->offset(i)); - ASSERT_EQ(strings[i % N].size(), length); + ASSERT_EQ(static_cast(strings[i % N].size()), length); ASSERT_EQ(strings[i % N], result_->GetString(i)); pos += length; @@ -267,7 +267,7 @@ TEST_F(TestBinaryContainer, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); - ASSERT_EQ(expected_[i].size(), strings_->value_length(i)); + ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); pos += expected_[i].size(); } } @@ -339,7 +339,7 @@ TEST_F(TestBinaryBuilder, TestScalarAppend) { } else { ASSERT_FALSE(result_->IsNull(i)); const uint8_t* vals = result_->GetValue(i, &length); - ASSERT_EQ(strings[i % N].size(), length); + ASSERT_EQ(static_cast(strings[i % N].size()), length); ASSERT_EQ(0, std::memcmp(vals, strings[i % N].data(), length)); } } diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index da02c7d1d8a9e..2f0037024c78d 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -61,6 +61,15 @@ Status StringArray::Validate() const { return BinaryArray::Validate(); } -TypePtr BinaryBuilder::value_type_ = TypePtr(new UInt8Type()); +// This used to be a static member variable of BinaryBuilder, but it can cause +// valgrind to report a (spurious?) memory leak when needed in other shared +// libraries. The problem came up while adding explicit visibility to libarrow +// and libarrow_parquet +static TypePtr kBinaryValueType = TypePtr(new UInt8Type()); + +BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) + : ListBuilder(pool, std::make_shared(pool, kBinaryValueType), type) { + byte_builder_ = static_cast(value_builder_.get()); +} } // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index b3c00d298b35c..bab0c58f617b2 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -28,13 +28,14 @@ #include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace arrow { class Buffer; class MemoryPool; -class BinaryArray : public ListArray { +class ARROW_EXPORT BinaryArray : public ListArray { public: BinaryArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, @@ -62,7 +63,7 @@ class BinaryArray : public ListArray { const uint8_t* raw_bytes_; }; -class StringArray : public BinaryArray { +class ARROW_EXPORT StringArray : public BinaryArray { public: StringArray(int32_t length, const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count = 0, @@ -87,12 +88,10 @@ class StringArray : public BinaryArray { }; // BinaryBuilder : public ListBuilder -class BinaryBuilder : public ListBuilder { +class ARROW_EXPORT BinaryBuilder : public ListBuilder { public: - explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type) - : ListBuilder(pool, std::make_shared(pool, value_type_), type) { - byte_builder_ = static_cast(value_builder_.get()); - } + explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type); + virtual ~BinaryBuilder() {} Status Append(const uint8_t* value, int32_t length) { RETURN_NOT_OK(ListBuilder::Append()); @@ -105,11 +104,10 @@ class BinaryBuilder : public ListBuilder { protected: UInt8Builder* byte_builder_; - static TypePtr value_type_; }; // String builder -class StringBuilder : public BinaryBuilder { +class ARROW_EXPORT StringBuilder : public BinaryBuilder { public: explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : BinaryBuilder(pool, type) {} diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index d2bd2971d0438..ccf5a52dc831c 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -116,7 +116,7 @@ class TestStructBuilder : public TestBuilder { ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); builder_ = std::dynamic_pointer_cast(tmp); - ASSERT_EQ(2, builder_->field_builders().size()); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); } void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } @@ -132,7 +132,7 @@ class TestStructBuilder : public TestBuilder { TEST_F(TestStructBuilder, TestAppendNull) { ASSERT_OK(builder_->AppendNull()); ASSERT_OK(builder_->AppendNull()); - ASSERT_EQ(2, builder_->field_builders().size()); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); ASSERT_OK(list_vb->AppendNull()); @@ -148,7 +148,7 @@ TEST_F(TestStructBuilder, TestAppendNull) { ASSERT_OK(result_->Validate()); - ASSERT_EQ(2, result_->fields().size()); + ASSERT_EQ(2, static_cast(result_->fields().size())); ASSERT_EQ(2, result_->length()); ASSERT_EQ(2, result_->field(0)->length()); ASSERT_EQ(2, result_->field(1)->length()); @@ -174,7 +174,7 @@ TEST_F(TestStructBuilder, TestBasics) { ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - ASSERT_EQ(2, builder_->field_builders().size()); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); EXPECT_OK(builder_->Resize(list_lengths.size())); EXPECT_OK(char_vb->Resize(list_values.size())); diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 78afd29eb8df5..63955eb31bb7d 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -25,10 +25,11 @@ #include "arrow/type.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" +#include "arrow/util/visibility.h" namespace arrow { -class StructArray : public Array { +class ARROW_EXPORT StructArray : public Array { public: StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) @@ -64,7 +65,7 @@ class StructArray : public Array { // Append, Resize and Reserve methods are acting on StructBuilder. // Please make sure all these methods of all child-builders' are consistently // called to maintain data-structure consistency. -class StructBuilder : public ArrayBuilder { +class ARROW_EXPORT StructBuilder : public ArrayBuilder { public: StructBuilder(MemoryPool* pool, const std::shared_ptr& type, const std::vector>& field_builders) diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index d2a4b091fada5..4e941fb5f5cec 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -27,6 +27,7 @@ install(FILES macros.h memory-pool.h status.h + visibility.h DESTINATION include/arrow/util) ####################################### diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index f845d67761fe4..1aeebc69b4e14 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -26,6 +26,7 @@ #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/status.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -41,7 +42,7 @@ class Status; // Capacity is the number of bytes that where allocated for the buffer in // total. // The following invariant is always true: Size < Capacity -class Buffer : public std::enable_shared_from_this { +class ARROW_EXPORT Buffer : public std::enable_shared_from_this { public: Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); @@ -95,7 +96,7 @@ class Buffer : public std::enable_shared_from_this { }; // A Buffer whose contents can be mutated. May or may not own its data. -class MutableBuffer : public Buffer { +class ARROW_EXPORT MutableBuffer : public Buffer { public: MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { mutable_data_ = data; @@ -112,7 +113,7 @@ class MutableBuffer : public Buffer { uint8_t* mutable_data_; }; -class ResizableBuffer : public MutableBuffer { +class ARROW_EXPORT ResizableBuffer : public MutableBuffer { public: // Change buffer reported size to indicated size, allocating memory if // necessary. This will ensure that the capacity of the buffer is a multiple @@ -129,7 +130,7 @@ class ResizableBuffer : public MutableBuffer { }; // A Buffer whose lifetime is tied to a particular MemoryPool -class PoolBuffer : public ResizableBuffer { +class ARROW_EXPORT PoolBuffer : public ResizableBuffer { public: explicit PoolBuffer(MemoryPool* pool = nullptr); virtual ~PoolBuffer(); @@ -145,7 +146,8 @@ static constexpr int64_t MIN_BUFFER_CAPACITY = 1024; class BufferBuilder { public: - explicit BufferBuilder(MemoryPool* pool) : pool_(pool), capacity_(0), size_(0) {} + explicit BufferBuilder(MemoryPool* pool) + : pool_(pool), data_(nullptr), capacity_(0), size_(0) {} // Resizes the buffer to the nearest multiple of 64 bytes per Layout.md Status Resize(int32_t elements) { diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index 4ab9736c2b468..8e7dfd60baa62 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -31,7 +31,7 @@ TEST(DefaultMemoryPool, MemoryTracking) { uint8_t* data; ASSERT_OK(pool->Allocate(100, &data)); - EXPECT_EQ(0, reinterpret_cast(data) % 64); + EXPECT_EQ(static_cast(0), reinterpret_cast(data) % 64); ASSERT_EQ(100, pool->bytes_allocated()); pool->Free(data, 100); diff --git a/cpp/src/arrow/util/memory-pool.h b/cpp/src/arrow/util/memory-pool.h index 824c7248e2e86..4c1d699addd50 100644 --- a/cpp/src/arrow/util/memory-pool.h +++ b/cpp/src/arrow/util/memory-pool.h @@ -20,11 +20,13 @@ #include +#include "arrow/util/visibility.h" + namespace arrow { class Status; -class MemoryPool { +class ARROW_EXPORT MemoryPool { public: virtual ~MemoryPool(); @@ -34,7 +36,7 @@ class MemoryPool { virtual int64_t bytes_allocated() const = 0; }; -MemoryPool* default_memory_pool(); +ARROW_EXPORT MemoryPool* default_memory_pool(); } // namespace arrow diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc index d194ed5572f52..8dd07d0d064e7 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/util/status.cc @@ -58,6 +58,9 @@ std::string Status::CodeAsString() const { case StatusCode::NotImplemented: type = "NotImplemented"; break; + default: + type = "Unknown"; + break; } return std::string(type); } diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index d1a742500084c..6ba2035bcd3a4 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -19,6 +19,8 @@ #include #include +#include "arrow/util/visibility.h" + // Return the given status if it is not OK. #define ARROW_RETURN_NOT_OK(s) \ do { \ @@ -82,7 +84,7 @@ enum class StatusCode : char { NotImplemented = 10, }; -class Status { +class ARROW_EXPORT Status { public: // Create a success status. Status() : state_(NULL) {} diff --git a/cpp/src/arrow/util/visibility.h b/cpp/src/arrow/util/visibility.h new file mode 100644 index 0000000000000..b197c198297c8 --- /dev/null +++ b/cpp/src/arrow/util/visibility.h @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_VISIBILITY_H +#define ARROW_UTIL_VISIBILITY_H + +#if defined(_WIN32) || defined(__CYGWIN__) +#define ARROW_EXPORT __declspec(dllexport) +#else // Not Windows +#ifndef ARROW_EXPORT +#define ARROW_EXPORT __attribute__((visibility("default"))) +#endif +#ifndef ARROW_NO_EXPORT +#define ARROW_NO_EXPORT __attribute__((visibility("hidden"))) +#endif +#endif // Non-Windows + +#endif // ARROW_UTIL_VISIBILITY_H diff --git a/python/conda.recipe/build.sh b/python/conda.recipe/build.sh index a164c1af51833..f32710073c7c2 100644 --- a/python/conda.recipe/build.sh +++ b/python/conda.recipe/build.sh @@ -19,13 +19,14 @@ if [ "$(uname)" == "Darwin" ]; then export MACOSX_DEPLOYMENT_TARGET=10.7 fi -echo Setting the compiler... -if [ `uname` == Linux ]; then - EXTRA_CMAKE_ARGS=-DCMAKE_SHARED_LINKER_FLAGS=-static-libstdc++ -elif [ `uname` == Darwin ]; then - EXTRA_CMAKE_ARGS= -fi +# echo Setting the compiler... +# if [ `uname` == Linux ]; then +# EXTRA_CMAKE_ARGS=-DCMAKE_SHARED_LINKER_FLAGS=-static-libstdc++ +# elif [ `uname` == Darwin ]; then +# EXTRA_CMAKE_ARGS= +# fi cd .. -$PYTHON setup.py build_ext --extra-cmake-args=$EXTRA_CMAKE_ARGS || exit 1 +# $PYTHON setup.py build_ext --extra-cmake-args=$EXTRA_CMAKE_ARGS || exit 1 +$PYTHON setup.py build_ext || exit 1 $PYTHON setup.py install || exit 1 diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 88869c2048003..4e997e31dd690 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -28,6 +28,7 @@ #include #include "pyarrow/common.h" +#include "pyarrow/visibility.h" namespace arrow { class Array; } @@ -35,6 +36,7 @@ namespace pyarrow { class Status; +PYARROW_EXPORT Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); } // namespace pyarrow diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index 17922349de6c1..c3377685bcce9 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -25,6 +25,8 @@ #include +#include "pyarrow/visibility.h" + namespace arrow { class Array; @@ -36,12 +38,15 @@ namespace pyarrow { class Status; +PYARROW_EXPORT Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, PyObject** out); +PYARROW_EXPORT Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, std::shared_ptr* out); +PYARROW_EXPORT Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, std::shared_ptr* out); diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 0211e8948f2f7..fb0ba3e482296 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -22,6 +22,8 @@ #include "arrow/util/buffer.h" +#include "pyarrow/visibility.h" + namespace arrow { class MemoryPool; } namespace pyarrow { @@ -94,9 +96,9 @@ struct PyObjectStringify { return Status::UnknownError(message); \ } -arrow::MemoryPool* GetMemoryPool(); +PYARROW_EXPORT arrow::MemoryPool* GetMemoryPool(); -class NumPyBuffer : public arrow::Buffer { +class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { public: NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { diff --git a/python/src/pyarrow/config.h b/python/src/pyarrow/config.h index 48ae715d842b1..82936b1a5f317 100644 --- a/python/src/pyarrow/config.h +++ b/python/src/pyarrow/config.h @@ -21,6 +21,7 @@ #include #include "pyarrow/numpy_interop.h" +#include "pyarrow/visibility.h" #if PY_MAJOR_VERSION >= 3 #define PyString_Check PyUnicode_Check @@ -28,10 +29,13 @@ namespace pyarrow { +PYARROW_EXPORT extern PyObject* numpy_nan; +PYARROW_EXPORT void pyarrow_init(); +PYARROW_EXPORT void pyarrow_set_numpy_nan(PyObject* obj); } // namespace pyarrow diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index ec42bb31d3b9b..fa9c713b0c22c 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -21,6 +21,8 @@ #include #include +#include "pyarrow/visibility.h" + namespace pyarrow { using arrow::DataType; @@ -40,6 +42,7 @@ extern const std::shared_ptr FLOAT; extern const std::shared_ptr DOUBLE; extern const std::shared_ptr STRING; +PYARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); } // namespace pyarrow diff --git a/python/src/pyarrow/status.h b/python/src/pyarrow/status.h index cb8c8add210e4..67cd66c58eeb3 100644 --- a/python/src/pyarrow/status.h +++ b/python/src/pyarrow/status.h @@ -17,6 +17,8 @@ #include #include +#include "pyarrow/visibility.h" + namespace pyarrow { #define PY_RETURN_NOT_OK(s) do { \ @@ -38,7 +40,7 @@ enum class StatusCode: char { UnknownError = 10 }; -class Status { +class PYARROW_EXPORT Status { public: // Create a success status. Status() : state_(NULL) { } diff --git a/python/src/pyarrow/visibility.h b/python/src/pyarrow/visibility.h new file mode 100644 index 0000000000000..9f0c13b4b2083 --- /dev/null +++ b/python/src/pyarrow/visibility.h @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_VISIBILITY_H +#define PYARROW_VISIBILITY_H + +#if defined(_WIN32) || defined(__CYGWIN__) +#define PYARROW_EXPORT __declspec(dllexport) +#else // Not Windows +#ifndef PYARROW_EXPORT +#define PYARROW_EXPORT __attribute__((visibility("default"))) +#endif +#ifndef PYARROW_NO_EXPORT +#define PYARROW_NO_EXPORT __attribute__((visibility("hidden"))) +#endif +#endif // Non-Windows + +#endif // PYARROW_VISIBILITY_H From ff6132f8a1c2a98cf7c94ae327342c8b38aecb18 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 11 Jul 2016 22:58:57 -0700 Subject: [PATCH 0096/1644] ARROW-237: Implement parquet-cpp's abstract IO interfaces for memory allocation and file reading Part of ARROW-227 and ARROW-236 Author: Wes McKinney Closes #101 from wesm/ARROW-237 and squashes the following commits: 00c8211 [Wes McKinney] Draft implementations of parquet-cpp allocator and read-only file interfaces --- cpp/src/arrow/io/hdfs-io-test.cc | 2 +- cpp/src/arrow/io/hdfs.cc | 16 +- cpp/src/arrow/io/hdfs.h | 8 +- cpp/src/arrow/io/interfaces.h | 14 +- cpp/src/arrow/parquet/CMakeLists.txt | 5 + cpp/src/arrow/parquet/io.cc | 94 ++++ cpp/src/arrow/parquet/io.h | 80 +++ cpp/src/arrow/parquet/parquet-io-test.cc | 511 ++++-------------- .../parquet/parquet-reader-writer-test.cc | 489 +++++++++++++++++ cpp/src/arrow/parquet/utils.h | 15 +- python/pyarrow/includes/libarrow_io.pxd | 12 +- python/pyarrow/io.pyx | 8 +- 12 files changed, 810 insertions(+), 444 deletions(-) create mode 100644 cpp/src/arrow/parquet/io.cc create mode 100644 cpp/src/arrow/parquet/io.h create mode 100644 cpp/src/arrow/parquet/parquet-reader-writer-test.cc diff --git a/cpp/src/arrow/io/hdfs-io-test.cc b/cpp/src/arrow/io/hdfs-io-test.cc index d1bf140ae68e2..e48a28142fa46 100644 --- a/cpp/src/arrow/io/hdfs-io-test.cc +++ b/cpp/src/arrow/io/hdfs-io-test.cc @@ -266,7 +266,7 @@ TEST_F(TestHdfsClient, ReadableMethods) { ASSERT_EQ(size, file_size); uint8_t buffer[50]; - int32_t bytes_read = 0; + int64_t bytes_read = 0; ASSERT_OK(file->Read(50, &bytes_read, buffer)); ASSERT_EQ(0, std::memcmp(buffer, data.data(), 50)); diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 6da6ea4e71bd8..800c3edf4f31a 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -100,7 +100,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { return Status::OK(); } - Status ReadAt(int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + Status ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { tSize ret = hdfsPread(fs_, file_, static_cast(position), reinterpret_cast(buffer), nbytes); RETURN_NOT_OK(CheckReadResult(ret)); @@ -108,7 +108,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { return Status::OK(); } - Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer), nbytes); RETURN_NOT_OK(CheckReadResult(ret)); *bytes_read = ret; @@ -138,11 +138,11 @@ Status HdfsReadableFile::Close() { } Status HdfsReadableFile::ReadAt( - int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { return impl_->ReadAt(position, nbytes, bytes_read, buffer); } -Status HdfsReadableFile::Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) { +Status HdfsReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { return impl_->Read(nbytes, bytes_read, buffer); } @@ -177,7 +177,7 @@ class HdfsWriteableFile::HdfsWriteableFileImpl : public HdfsAnyFileImpl { return Status::OK(); } - Status Write(const uint8_t* buffer, int32_t nbytes, int32_t* bytes_written) { + Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written) { tSize ret = hdfsWrite(fs_, file_, reinterpret_cast(buffer), nbytes); CHECK_FAILURE(ret, "Write"); *bytes_written = ret; @@ -198,12 +198,12 @@ Status HdfsWriteableFile::Close() { } Status HdfsWriteableFile::Write( - const uint8_t* buffer, int32_t nbytes, int32_t* bytes_read) { + const uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { return impl_->Write(buffer, nbytes, bytes_read); } -Status HdfsWriteableFile::Write(const uint8_t* buffer, int32_t nbytes) { - int32_t bytes_written_dummy = 0; +Status HdfsWriteableFile::Write(const uint8_t* buffer, int64_t nbytes) { + int64_t bytes_written_dummy = 0; return Write(buffer, nbytes, &bytes_written_dummy); } diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 532e3c536a188..b6449fcb88a75 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -164,14 +164,14 @@ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { Status GetSize(int64_t* size) override; Status ReadAt( - int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) override; + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; Status Seek(int64_t position) override; Status Tell(int64_t* position) override; // NOTE: If you wish to read a particular range of a file in a multithreaded // context, you may prefer to use ReadAt to avoid locking issues - Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) override; + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; private: class ARROW_NO_EXPORT HdfsReadableFileImpl; @@ -189,9 +189,9 @@ class ARROW_EXPORT HdfsWriteableFile : public WriteableFile { Status Close() override; - Status Write(const uint8_t* buffer, int32_t nbytes) override; + Status Write(const uint8_t* buffer, int64_t nbytes) override; - Status Write(const uint8_t* buffer, int32_t nbytes, int32_t* bytes_written); + Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written); Status Tell(int64_t* position) override; diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 4bd8a8ffc2f9d..25361d5633d12 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_IO_INTERFACES -#define ARROW_IO_INTERFACES +#ifndef ARROW_IO_INTERFACES_H +#define ARROW_IO_INTERFACES_H #include @@ -40,17 +40,17 @@ class FileSystemClient { }; class FileBase { + public: virtual Status Close() = 0; - virtual Status Tell(int64_t* position) = 0; }; class ReadableFile : public FileBase { public: virtual Status ReadAt( - int64_t position, int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) = 0; + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) = 0; - virtual Status Read(int32_t nbytes, int32_t* bytes_read, uint8_t* buffer) = 0; + virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) = 0; virtual Status GetSize(int64_t* size) = 0; }; @@ -62,10 +62,10 @@ class RandomAccessFile : public ReadableFile { class WriteableFile : public FileBase { public: - virtual Status Write(const uint8_t* buffer, int32_t nbytes) = 0; + virtual Status Write(const uint8_t* buffer, int64_t nbytes) = 0; }; } // namespace io } // namespace arrow -#endif // ARROW_IO_INTERFACES +#endif // ARROW_IO_INTERFACES_H diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index 00f19b354e379..f2a90b71a4968 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -19,6 +19,7 @@ # arrow_parquet : Arrow <-> Parquet adapter set(PARQUET_SRCS + io.cc reader.cc schema.cc writer.cc @@ -48,8 +49,12 @@ ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) ADD_ARROW_TEST(parquet-io-test) ARROW_TEST_LINK_LIBRARIES(parquet-io-test arrow_parquet) +ADD_ARROW_TEST(parquet-reader-writer-test) +ARROW_TEST_LINK_LIBRARIES(parquet-reader-writer-test arrow_parquet) + # Headers: top level install(FILES + io.h reader.h schema.h utils.h diff --git a/cpp/src/arrow/parquet/io.cc b/cpp/src/arrow/parquet/io.cc new file mode 100644 index 0000000000000..c81aa8c4da9ca --- /dev/null +++ b/cpp/src/arrow/parquet/io.cc @@ -0,0 +1,94 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/parquet/io.h" + +#include +#include + +#include "parquet/api/io.h" + +#include "arrow/parquet/utils.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +// To assist with readability +using ArrowROFile = arrow::io::RandomAccessFile; + +namespace arrow { +namespace parquet { + +// ---------------------------------------------------------------------- +// ParquetAllocator + +ParquetAllocator::ParquetAllocator() : pool_(default_memory_pool()) {} + +ParquetAllocator::ParquetAllocator(MemoryPool* pool) : pool_(pool) {} + +ParquetAllocator::~ParquetAllocator() {} + +uint8_t* ParquetAllocator::Malloc(int64_t size) { + uint8_t* result; + PARQUET_THROW_NOT_OK(pool_->Allocate(size, &result)); + return result; +} + +void ParquetAllocator::Free(uint8_t* buffer, int64_t size) { + // Does not report Status + pool_->Free(buffer, size); +} + +// ---------------------------------------------------------------------- +// ParquetReadSource + +ParquetReadSource::ParquetReadSource( + const std::shared_ptr& file, ParquetAllocator* allocator) + : file_(file), allocator_(allocator) {} + +void ParquetReadSource::Close() { + PARQUET_THROW_NOT_OK(file_->Close()); +} + +int64_t ParquetReadSource::Tell() const { + int64_t position; + PARQUET_THROW_NOT_OK(file_->Tell(&position)); + return position; +} + +void ParquetReadSource::Seek(int64_t position) { + PARQUET_THROW_NOT_OK(file_->Seek(position)); +} + +int64_t ParquetReadSource::Read(int64_t nbytes, uint8_t* out) { + int64_t bytes_read; + PARQUET_THROW_NOT_OK(file_->Read(nbytes, &bytes_read, out)); + return bytes_read; +} + +std::shared_ptr<::parquet::Buffer> ParquetReadSource::Read(int64_t nbytes) { + // TODO(wesm): This code is duplicated from parquet/util/input.cc; suggests + // that there should be more code sharing amongst file-like sources + auto result = std::make_shared<::parquet::OwnedMutableBuffer>(0, allocator_); + result->Resize(nbytes); + + int64_t bytes_read = Read(nbytes, result->mutable_data()); + if (bytes_read < nbytes) { result->Resize(bytes_read); } + return result; +} + +} // namespace parquet +} // namespace arrow diff --git a/cpp/src/arrow/parquet/io.h b/cpp/src/arrow/parquet/io.h new file mode 100644 index 0000000000000..ef8871da4df61 --- /dev/null +++ b/cpp/src/arrow/parquet/io.h @@ -0,0 +1,80 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Bridges Arrow's IO interfaces and Parquet-cpp's IO interfaces + +#ifndef ARROW_PARQUET_IO_H +#define ARROW_PARQUET_IO_H + +#include +#include + +#include "parquet/api/io.h" + +#include "arrow/io/interfaces.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class MemoryPool; + +namespace parquet { + +// An implementation of the Parquet MemoryAllocator API that plugs into an +// existing Arrow memory pool. This way we can direct all allocations to a +// single place rather than tracking allocations in different locations (for +// example: without utilizing parquet-cpp's default allocator) +class ARROW_EXPORT ParquetAllocator : public ::parquet::MemoryAllocator { + public: + // Uses the default memory pool + ParquetAllocator(); + + explicit ParquetAllocator(MemoryPool* pool); + virtual ~ParquetAllocator(); + + uint8_t* Malloc(int64_t size) override; + void Free(uint8_t* buffer, int64_t size) override; + + MemoryPool* pool() { return pool_; } + + private: + MemoryPool* pool_; +}; + +class ARROW_EXPORT ParquetReadSource : public ::parquet::RandomAccessSource { + public: + ParquetReadSource( + const std::shared_ptr& file, ParquetAllocator* allocator); + + void Close() override; + int64_t Tell() const override; + void Seek(int64_t pos) override; + int64_t Read(int64_t nbytes, uint8_t* out) override; + std::shared_ptr<::parquet::Buffer> Read(int64_t nbytes) override; + + private: + // An Arrow readable file of some kind + std::shared_ptr file_; + + // The allocator is required for creating managed buffers + ParquetAllocator* allocator_; +}; + +} // namespace parquet +} // namespace arrow + +#endif // ARROW_PARQUET_IO_H diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index bfc27d26d63a1..7e724b31e3801 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -15,475 +15,164 @@ // specific language governing permissions and limitations // under the License. +#include +#include +#include +#include + #include "gtest/gtest.h" -#include "arrow/test-util.h" -#include "arrow/parquet/test-util.h" -#include "arrow/parquet/reader.h" -#include "arrow/parquet/writer.h" -#include "arrow/types/construct.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" +#include "arrow/parquet/io.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" -#include "parquet/api/reader.h" -#include "parquet/api/writer.h" - -using ParquetBuffer = parquet::Buffer; -using parquet::BufferReader; -using parquet::default_writer_properties; -using parquet::InMemoryOutputStream; -using parquet::LogicalType; -using parquet::ParquetFileReader; -using parquet::ParquetFileWriter; -using parquet::RandomAccessSource; -using parquet::Repetition; -using parquet::SchemaDescriptor; -using parquet::ParquetVersion; -using ParquetType = parquet::Type; -using parquet::schema::GroupNode; -using parquet::schema::NodePtr; -using parquet::schema::PrimitiveNode; +#include "parquet/api/io.h" namespace arrow { - namespace parquet { -const int SMALL_SIZE = 100; -const int LARGE_SIZE = 10000; - -template -struct test_traits {}; +// Allocator tests -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::BOOLEAN; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static uint8_t const value; -}; - -const uint8_t test_traits::value(1); +TEST(TestParquetAllocator, DefaultCtor) { + ParquetAllocator allocator; -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_8; - static uint8_t const value; -}; + const int buffer_size = 10; -const uint8_t test_traits::value(64); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::INT_8; - static int8_t const value; -}; - -const int8_t test_traits::value(-64); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_16; - static uint16_t const value; -}; + uint8_t* buffer = nullptr; + ASSERT_NO_THROW(buffer = allocator.Malloc(buffer_size);); -const uint16_t test_traits::value(1024); + // valgrind will complain if we write into nullptr + memset(buffer, 0, buffer_size); -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::INT_16; - static int16_t const value; -}; - -const int16_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_32; - static uint32_t const value; -}; - -const uint32_t test_traits::value(1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static int32_t const value; -}; - -const int32_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT64; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_64; - static uint64_t const value; -}; - -const uint64_t test_traits::value(1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT64; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static int64_t const value; -}; - -const int64_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static float const value; -}; - -const float test_traits::value(2.1f); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::DOUBLE; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static double const value; -}; - -const double test_traits::value(4.2); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::BYTE_ARRAY; - static constexpr LogicalType::type logical_enum = LogicalType::UTF8; - static std::string const value; -}; - -const std::string test_traits::value("Test"); - -template -using ParquetDataType = ::parquet::DataType::parquet_enum>; - -template -using ParquetWriter = ::parquet::TypedColumnWriter>; + allocator.Free(buffer, buffer_size); +} -template -class TestParquetIO : public ::testing::Test { +// Pass through to the default memory pool +class TrackingPool : public MemoryPool { public: - virtual void SetUp() {} - - std::shared_ptr MakeSchema(Repetition::type repetition) { - auto pnode = PrimitiveNode::Make("column1", repetition, - test_traits::parquet_enum, test_traits::logical_enum); - NodePtr node_ = - GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); - return std::static_pointer_cast(node_); - } - - std::unique_ptr MakeWriter( - const std::shared_ptr& schema) { - sink_ = std::make_shared(); - return ParquetFileWriter::Open(sink_, schema); - } - - std::unique_ptr ReaderFromSink() { - std::shared_ptr buffer = sink_->GetBuffer(); - std::unique_ptr source(new BufferReader(buffer)); - return ParquetFileReader::Open(std::move(source)); - } - - void ReadSingleColumnFile( - std::unique_ptr file_reader, std::shared_ptr* out) { - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - std::unique_ptr column_reader; - ASSERT_OK_NO_THROW(reader.GetFlatColumn(0, &column_reader)); - ASSERT_NE(nullptr, column_reader.get()); - - ASSERT_OK(column_reader->NextBatch(SMALL_SIZE, out)); - ASSERT_NE(nullptr, out->get()); - } + TrackingPool() : pool_(default_memory_pool()), bytes_allocated_(0) {} - void ReadAndCheckSingleColumnFile(Array* values) { - std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); + Status Allocate(int64_t size, uint8_t** out) override { + RETURN_NOT_OK(pool_->Allocate(size, out)); + bytes_allocated_ += size; + return Status::OK(); } - void ReadTableFromFile( - std::unique_ptr file_reader, std::shared_ptr
* out) { - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - ASSERT_OK_NO_THROW(reader.ReadFlatTable(out)); - ASSERT_NE(nullptr, out->get()); + void Free(uint8_t* buffer, int64_t size) override { + pool_->Free(buffer, size); + bytes_allocated_ -= size; } - void ReadAndCheckSingleColumnTable(const std::shared_ptr& values) { - std::shared_ptr
out; - ReadTableFromFile(ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(values->length(), out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); - } + int64_t bytes_allocated() const override { return bytes_allocated_; } - template - void WriteFlatColumn(const std::shared_ptr& schema, - const std::shared_ptr& values) { - FileWriter writer(default_memory_pool(), MakeWriter(schema)); - ASSERT_OK_NO_THROW(writer.NewRowGroup(values->length())); - ASSERT_OK_NO_THROW(writer.WriteFlatColumnChunk(values.get())); - ASSERT_OK_NO_THROW(writer.Close()); - } - - std::shared_ptr sink_; + private: + MemoryPool* pool_; + int64_t bytes_allocated_; }; -// We habe separate tests for UInt32Type as this is currently the only type -// where a roundtrip does not yield the identical Array structure. -// There we write an UInt32 Array but receive an Int64 Array as result for -// Parquet version 1.0. +TEST(TestParquetAllocator, CustomPool) { + TrackingPool pool; -typedef ::testing::Types TestTypes; + ParquetAllocator allocator(&pool); -TYPED_TEST_CASE(TestParquetIO, TestTypes); + ASSERT_EQ(&pool, allocator.pool()); -TYPED_TEST(TestParquetIO, SingleColumnRequiredWrite) { - auto values = NonNullArray(SMALL_SIZE); + const int buffer_size = 10; - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - this->WriteFlatColumn(schema, values); + uint8_t* buffer = nullptr; + ASSERT_NO_THROW(buffer = allocator.Malloc(buffer_size);); - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { - auto values = NonNullArray(SMALL_SIZE); - std::shared_ptr
table = MakeSimpleTable(values, false); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, - values->length(), default_writer_properties())); - - std::shared_ptr
out; - this->ReadTableFromFile(this->ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(100, out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); -} + ASSERT_EQ(buffer_size, pool.bytes_allocated()); -TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) { - // This also tests max_definition_level = 1 - auto values = NullableArray(SMALL_SIZE, 10); + // valgrind will complain if we write into nullptr + memset(buffer, 0, buffer_size); - std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); - this->WriteFlatColumn(schema, values); + allocator.Free(buffer, buffer_size); - this->ReadAndCheckSingleColumnFile(values.get()); + ASSERT_EQ(0, pool.bytes_allocated()); } -TYPED_TEST(TestParquetIO, SingleColumnTableOptionalReadWrite) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(SMALL_SIZE, 10); - std::shared_ptr
table = MakeSimpleTable(values, true); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, - values->length(), default_writer_properties())); +// ---------------------------------------------------------------------- +// Read source tests - this->ReadAndCheckSingleColumnTable(values); -} - -TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedWrite) { - auto values = NonNullArray(SMALL_SIZE); - int64_t chunk_size = values->length() / 4; +class BufferReader : public io::RandomAccessFile { + public: + BufferReader(const uint8_t* buffer, int buffer_size) + : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - for (int i = 0; i < 4; i++) { - ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); - ASSERT_OK_NO_THROW( - writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); + Status Close() override { + // no-op + return Status::OK(); } - ASSERT_OK_NO_THROW(writer.Close()); - - this->ReadAndCheckSingleColumnFile(values.get()); -} -TYPED_TEST(TestParquetIO, SingleColumnTableRequiredChunkedWrite) { - auto values = NonNullArray(LARGE_SIZE); - std::shared_ptr
table = MakeSimpleTable(values, false); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable( - table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - - this->ReadAndCheckSingleColumnTable(values); -} - -TYPED_TEST(TestParquetIO, SingleColumnOptionalChunkedWrite) { - int64_t chunk_size = SMALL_SIZE / 4; - auto values = NullableArray(SMALL_SIZE, 10); - - std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - for (int i = 0; i < 4; i++) { - ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); - ASSERT_OK_NO_THROW( - writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); + Status Tell(int64_t* position) override { + *position = position_; + return Status::OK(); } - ASSERT_OK_NO_THROW(writer.Close()); - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableOptionalChunkedWrite) { - // This also tests max_definition_level = 1 - auto values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable( - table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - - this->ReadAndCheckSingleColumnTable(values); -} - -using TestUInt32ParquetIO = TestParquetIO; - -TEST_F(TestUInt32ParquetIO, Parquet_2_0_Compability) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - - // Parquet 2.0 roundtrip should yield an uint32_t column again - this->sink_ = std::make_shared(); - std::shared_ptr<::parquet::WriterProperties> properties = - ::parquet::WriterProperties::Builder() - .version(ParquetVersion::PARQUET_2_0) - ->build(); - ASSERT_OK_NO_THROW( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); - this->ReadAndCheckSingleColumnTable(values); -} + Status ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override { + RETURN_NOT_OK(Seek(position)); + return Read(nbytes, bytes_read, buffer); + } -TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compability) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - - // Parquet 1.0 returns an int64_t column as there is no way to tell a Parquet 1.0 - // reader that a column is unsigned. - this->sink_ = std::make_shared(); - std::shared_ptr<::parquet::WriterProperties> properties = - ::parquet::WriterProperties::Builder() - .version(ParquetVersion::PARQUET_1_0) - ->build(); - ASSERT_OK_NO_THROW( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); - - std::shared_ptr expected_values; - std::shared_ptr int64_data = - std::make_shared(default_memory_pool()); - { - ASSERT_OK(int64_data->Resize(sizeof(int64_t) * values->length())); - int64_t* int64_data_ptr = reinterpret_cast(int64_data->mutable_data()); - const uint32_t* uint32_data_ptr = - reinterpret_cast(values->data()->data()); - // std::copy might be faster but this is explicit on the casts) - for (int64_t i = 0; i < values->length(); i++) { - int64_data_ptr[i] = static_cast(uint32_data_ptr[i]); - } + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override { + memcpy(buffer, buffer_ + position_, nbytes); + *bytes_read = std::min(nbytes, buffer_size_ - position_); + position_ += *bytes_read; + return Status::OK(); } - ASSERT_OK(MakePrimitiveArray(std::make_shared(), values->length(), - int64_data, values->null_count(), values->null_bitmap(), &expected_values)); - this->ReadAndCheckSingleColumnTable(expected_values); -} -template -using ParquetCDataType = typename ParquetDataType::c_type; + Status GetSize(int64_t* size) override { + *size = buffer_size_; + return Status::OK(); + } -template -class TestPrimitiveParquetIO : public TestParquetIO { - public: - typedef typename TestType::c_type T; - - void MakeTestFile(std::vector& values, int num_chunks, - std::unique_ptr* file_reader) { - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - std::unique_ptr file_writer = this->MakeWriter(schema); - size_t chunk_size = values.size() / num_chunks; - // Convert to Parquet's expected physical type - std::vector values_buffer( - sizeof(ParquetCDataType) * values.size()); - auto values_parquet = - reinterpret_cast*>(values_buffer.data()); - std::copy(values.cbegin(), values.cend(), values_parquet); - for (int i = 0; i < num_chunks; i++) { - auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = - static_cast*>(row_group_writer->NextColumn()); - ParquetCDataType* data = values_parquet + i * chunk_size; - column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); - column_writer->Close(); - row_group_writer->Close(); + Status Seek(int64_t position) override { + if (position < 0 || position >= buffer_size_) { + return Status::IOError("position out of bounds"); } - file_writer->Close(); - *file_reader = this->ReaderFromSink(); + + position_ = position; + return Status::OK(); } - void CheckSingleColumnRequiredTableRead(int num_chunks) { - std::vector values(SMALL_SIZE, test_traits::value); - std::unique_ptr file_reader; - ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); + private: + const uint8_t* buffer_; + int buffer_size_; + int64_t position_; +}; - std::shared_ptr
out; - this->ReadTableFromFile(std::move(file_reader), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(SMALL_SIZE, out->num_rows()); +TEST(TestParquetReadSource, Basics) { + std::string data = "this is the data"; + auto data_buffer = reinterpret_cast(data.c_str()); - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ExpectArray(values.data(), chunked_array->chunk(0).get()); - } + ParquetAllocator allocator; + auto file = std::make_shared(data_buffer, data.size()); + auto source = std::make_shared(file, &allocator); - void CheckSingleColumnRequiredRead(int num_chunks) { - std::vector values(SMALL_SIZE, test_traits::value); - std::unique_ptr file_reader; - ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); + ASSERT_EQ(0, source->Tell()); + ASSERT_NO_THROW(source->Seek(5)); + ASSERT_EQ(5, source->Tell()); + ASSERT_NO_THROW(source->Seek(0)); - std::shared_ptr out; - this->ReadSingleColumnFile(std::move(file_reader), &out); - - ExpectArray(values.data(), out.get()); - } -}; + // Seek out of bounds + ASSERT_THROW(source->Seek(100), ::parquet::ParquetException); -typedef ::testing::Types PrimitiveTestTypes; + uint8_t buffer[50]; -TYPED_TEST_CASE(TestPrimitiveParquetIO, PrimitiveTestTypes); + ASSERT_NO_THROW(source->Read(4, buffer)); + ASSERT_EQ(0, std::memcmp(buffer, "this", 4)); + ASSERT_EQ(4, source->Tell()); -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredRead) { - this->CheckSingleColumnRequiredRead(1); -} + std::shared_ptr<::parquet::Buffer> pq_buffer; -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredTableRead) { - this->CheckSingleColumnRequiredTableRead(1); -} + ASSERT_NO_THROW(pq_buffer = source->Read(7)); -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedRead) { - this->CheckSingleColumnRequiredRead(4); -} + auto expected_buffer = std::make_shared<::parquet::Buffer>(data_buffer + 4, 7); -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedTableRead) { - this->CheckSingleColumnRequiredTableRead(4); + ASSERT_TRUE(expected_buffer->Equals(*pq_buffer.get())); } } // namespace parquet - } // namespace arrow diff --git a/cpp/src/arrow/parquet/parquet-reader-writer-test.cc b/cpp/src/arrow/parquet/parquet-reader-writer-test.cc new file mode 100644 index 0000000000000..bfc27d26d63a1 --- /dev/null +++ b/cpp/src/arrow/parquet/parquet-reader-writer-test.cc @@ -0,0 +1,489 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" +#include "arrow/parquet/test-util.h" +#include "arrow/parquet/reader.h" +#include "arrow/parquet/writer.h" +#include "arrow/types/construct.h" +#include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +#include "parquet/api/reader.h" +#include "parquet/api/writer.h" + +using ParquetBuffer = parquet::Buffer; +using parquet::BufferReader; +using parquet::default_writer_properties; +using parquet::InMemoryOutputStream; +using parquet::LogicalType; +using parquet::ParquetFileReader; +using parquet::ParquetFileWriter; +using parquet::RandomAccessSource; +using parquet::Repetition; +using parquet::SchemaDescriptor; +using parquet::ParquetVersion; +using ParquetType = parquet::Type; +using parquet::schema::GroupNode; +using parquet::schema::NodePtr; +using parquet::schema::PrimitiveNode; + +namespace arrow { + +namespace parquet { + +const int SMALL_SIZE = 100; +const int LARGE_SIZE = 10000; + +template +struct test_traits {}; + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::BOOLEAN; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static uint8_t const value; +}; + +const uint8_t test_traits::value(1); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_8; + static uint8_t const value; +}; + +const uint8_t test_traits::value(64); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::INT_8; + static int8_t const value; +}; + +const int8_t test_traits::value(-64); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_16; + static uint16_t const value; +}; + +const uint16_t test_traits::value(1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::INT_16; + static int16_t const value; +}; + +const int16_t test_traits::value(-1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_32; + static uint32_t const value; +}; + +const uint32_t test_traits::value(1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT32; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static int32_t const value; +}; + +const int32_t test_traits::value(-1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT64; + static constexpr LogicalType::type logical_enum = LogicalType::UINT_64; + static uint64_t const value; +}; + +const uint64_t test_traits::value(1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT64; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static int64_t const value; +}; + +const int64_t test_traits::value(-1024); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static float const value; +}; + +const float test_traits::value(2.1f); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::DOUBLE; + static constexpr LogicalType::type logical_enum = LogicalType::NONE; + static double const value; +}; + +const double test_traits::value(4.2); + +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::BYTE_ARRAY; + static constexpr LogicalType::type logical_enum = LogicalType::UTF8; + static std::string const value; +}; + +const std::string test_traits::value("Test"); + +template +using ParquetDataType = ::parquet::DataType::parquet_enum>; + +template +using ParquetWriter = ::parquet::TypedColumnWriter>; + +template +class TestParquetIO : public ::testing::Test { + public: + virtual void SetUp() {} + + std::shared_ptr MakeSchema(Repetition::type repetition) { + auto pnode = PrimitiveNode::Make("column1", repetition, + test_traits::parquet_enum, test_traits::logical_enum); + NodePtr node_ = + GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); + return std::static_pointer_cast(node_); + } + + std::unique_ptr MakeWriter( + const std::shared_ptr& schema) { + sink_ = std::make_shared(); + return ParquetFileWriter::Open(sink_, schema); + } + + std::unique_ptr ReaderFromSink() { + std::shared_ptr buffer = sink_->GetBuffer(); + std::unique_ptr source(new BufferReader(buffer)); + return ParquetFileReader::Open(std::move(source)); + } + + void ReadSingleColumnFile( + std::unique_ptr file_reader, std::shared_ptr* out) { + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + std::unique_ptr column_reader; + ASSERT_OK_NO_THROW(reader.GetFlatColumn(0, &column_reader)); + ASSERT_NE(nullptr, column_reader.get()); + + ASSERT_OK(column_reader->NextBatch(SMALL_SIZE, out)); + ASSERT_NE(nullptr, out->get()); + } + + void ReadAndCheckSingleColumnFile(Array* values) { + std::shared_ptr out; + ReadSingleColumnFile(ReaderFromSink(), &out); + ASSERT_TRUE(values->Equals(out)); + } + + void ReadTableFromFile( + std::unique_ptr file_reader, std::shared_ptr
* out) { + arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); + ASSERT_OK_NO_THROW(reader.ReadFlatTable(out)); + ASSERT_NE(nullptr, out->get()); + } + + void ReadAndCheckSingleColumnTable(const std::shared_ptr& values) { + std::shared_ptr
out; + ReadTableFromFile(ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(values->length(), out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); + } + + template + void WriteFlatColumn(const std::shared_ptr& schema, + const std::shared_ptr& values) { + FileWriter writer(default_memory_pool(), MakeWriter(schema)); + ASSERT_OK_NO_THROW(writer.NewRowGroup(values->length())); + ASSERT_OK_NO_THROW(writer.WriteFlatColumnChunk(values.get())); + ASSERT_OK_NO_THROW(writer.Close()); + } + + std::shared_ptr sink_; +}; + +// We habe separate tests for UInt32Type as this is currently the only type +// where a roundtrip does not yield the identical Array structure. +// There we write an UInt32 Array but receive an Int64 Array as result for +// Parquet version 1.0. + +typedef ::testing::Types TestTypes; + +TYPED_TEST_CASE(TestParquetIO, TestTypes); + +TYPED_TEST(TestParquetIO, SingleColumnRequiredWrite) { + auto values = NonNullArray(SMALL_SIZE); + + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); + this->WriteFlatColumn(schema, values); + + this->ReadAndCheckSingleColumnFile(values.get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { + auto values = NonNullArray(SMALL_SIZE); + std::shared_ptr
table = MakeSimpleTable(values, false); + this->sink_ = std::make_shared(); + ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, + values->length(), default_writer_properties())); + + std::shared_ptr
out; + this->ReadTableFromFile(this->ReaderFromSink(), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(100, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); +} + +TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) { + // This also tests max_definition_level = 1 + auto values = NullableArray(SMALL_SIZE, 10); + + std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); + this->WriteFlatColumn(schema, values); + + this->ReadAndCheckSingleColumnFile(values.get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnTableOptionalReadWrite) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(SMALL_SIZE, 10); + std::shared_ptr
table = MakeSimpleTable(values, true); + this->sink_ = std::make_shared(); + ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, + values->length(), default_writer_properties())); + + this->ReadAndCheckSingleColumnTable(values); +} + +TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedWrite) { + auto values = NonNullArray(SMALL_SIZE); + int64_t chunk_size = values->length() / 4; + + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); + for (int i = 0; i < 4; i++) { + ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); + ASSERT_OK_NO_THROW( + writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); + } + ASSERT_OK_NO_THROW(writer.Close()); + + this->ReadAndCheckSingleColumnFile(values.get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnTableRequiredChunkedWrite) { + auto values = NonNullArray(LARGE_SIZE); + std::shared_ptr
table = MakeSimpleTable(values, false); + this->sink_ = std::make_shared(); + ASSERT_OK_NO_THROW(WriteFlatTable( + table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); + + this->ReadAndCheckSingleColumnTable(values); +} + +TYPED_TEST(TestParquetIO, SingleColumnOptionalChunkedWrite) { + int64_t chunk_size = SMALL_SIZE / 4; + auto values = NullableArray(SMALL_SIZE, 10); + + std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); + FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); + for (int i = 0; i < 4; i++) { + ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); + ASSERT_OK_NO_THROW( + writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); + } + ASSERT_OK_NO_THROW(writer.Close()); + + this->ReadAndCheckSingleColumnFile(values.get()); +} + +TYPED_TEST(TestParquetIO, SingleColumnTableOptionalChunkedWrite) { + // This also tests max_definition_level = 1 + auto values = NullableArray(LARGE_SIZE, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + this->sink_ = std::make_shared(); + ASSERT_OK_NO_THROW(WriteFlatTable( + table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); + + this->ReadAndCheckSingleColumnTable(values); +} + +using TestUInt32ParquetIO = TestParquetIO; + +TEST_F(TestUInt32ParquetIO, Parquet_2_0_Compability) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(LARGE_SIZE, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + + // Parquet 2.0 roundtrip should yield an uint32_t column again + this->sink_ = std::make_shared(); + std::shared_ptr<::parquet::WriterProperties> properties = + ::parquet::WriterProperties::Builder() + .version(ParquetVersion::PARQUET_2_0) + ->build(); + ASSERT_OK_NO_THROW( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); + this->ReadAndCheckSingleColumnTable(values); +} + +TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compability) { + // This also tests max_definition_level = 1 + std::shared_ptr values = NullableArray(LARGE_SIZE, 100); + std::shared_ptr
table = MakeSimpleTable(values, true); + + // Parquet 1.0 returns an int64_t column as there is no way to tell a Parquet 1.0 + // reader that a column is unsigned. + this->sink_ = std::make_shared(); + std::shared_ptr<::parquet::WriterProperties> properties = + ::parquet::WriterProperties::Builder() + .version(ParquetVersion::PARQUET_1_0) + ->build(); + ASSERT_OK_NO_THROW( + WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); + + std::shared_ptr expected_values; + std::shared_ptr int64_data = + std::make_shared(default_memory_pool()); + { + ASSERT_OK(int64_data->Resize(sizeof(int64_t) * values->length())); + int64_t* int64_data_ptr = reinterpret_cast(int64_data->mutable_data()); + const uint32_t* uint32_data_ptr = + reinterpret_cast(values->data()->data()); + // std::copy might be faster but this is explicit on the casts) + for (int64_t i = 0; i < values->length(); i++) { + int64_data_ptr[i] = static_cast(uint32_data_ptr[i]); + } + } + ASSERT_OK(MakePrimitiveArray(std::make_shared(), values->length(), + int64_data, values->null_count(), values->null_bitmap(), &expected_values)); + this->ReadAndCheckSingleColumnTable(expected_values); +} + +template +using ParquetCDataType = typename ParquetDataType::c_type; + +template +class TestPrimitiveParquetIO : public TestParquetIO { + public: + typedef typename TestType::c_type T; + + void MakeTestFile(std::vector& values, int num_chunks, + std::unique_ptr* file_reader) { + std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); + std::unique_ptr file_writer = this->MakeWriter(schema); + size_t chunk_size = values.size() / num_chunks; + // Convert to Parquet's expected physical type + std::vector values_buffer( + sizeof(ParquetCDataType) * values.size()); + auto values_parquet = + reinterpret_cast*>(values_buffer.data()); + std::copy(values.cbegin(), values.cend(), values_parquet); + for (int i = 0; i < num_chunks; i++) { + auto row_group_writer = file_writer->AppendRowGroup(chunk_size); + auto column_writer = + static_cast*>(row_group_writer->NextColumn()); + ParquetCDataType* data = values_parquet + i * chunk_size; + column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); + column_writer->Close(); + row_group_writer->Close(); + } + file_writer->Close(); + *file_reader = this->ReaderFromSink(); + } + + void CheckSingleColumnRequiredTableRead(int num_chunks) { + std::vector values(SMALL_SIZE, test_traits::value); + std::unique_ptr file_reader; + ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); + + std::shared_ptr
out; + this->ReadTableFromFile(std::move(file_reader), &out); + ASSERT_EQ(1, out->num_columns()); + ASSERT_EQ(SMALL_SIZE, out->num_rows()); + + std::shared_ptr chunked_array = out->column(0)->data(); + ASSERT_EQ(1, chunked_array->num_chunks()); + ExpectArray(values.data(), chunked_array->chunk(0).get()); + } + + void CheckSingleColumnRequiredRead(int num_chunks) { + std::vector values(SMALL_SIZE, test_traits::value); + std::unique_ptr file_reader; + ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); + + std::shared_ptr out; + this->ReadSingleColumnFile(std::move(file_reader), &out); + + ExpectArray(values.data(), out.get()); + } +}; + +typedef ::testing::Types PrimitiveTestTypes; + +TYPED_TEST_CASE(TestPrimitiveParquetIO, PrimitiveTestTypes); + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredRead) { + this->CheckSingleColumnRequiredRead(1); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredTableRead) { + this->CheckSingleColumnRequiredTableRead(1); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedRead) { + this->CheckSingleColumnRequiredRead(4); +} + +TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedTableRead) { + this->CheckSingleColumnRequiredTableRead(4); +} + +} // namespace parquet + +} // namespace arrow diff --git a/cpp/src/arrow/parquet/utils.h b/cpp/src/arrow/parquet/utils.h index 409bcd9065cda..bcc46be60e6ec 100644 --- a/cpp/src/arrow/parquet/utils.h +++ b/cpp/src/arrow/parquet/utils.h @@ -18,12 +18,12 @@ #ifndef ARROW_PARQUET_UTILS_H #define ARROW_PARQUET_UTILS_H -#include "arrow/util/status.h" +#include +#include "arrow/util/status.h" #include "parquet/exception.h" namespace arrow { - namespace parquet { #define PARQUET_CATCH_NOT_OK(s) \ @@ -36,8 +36,17 @@ namespace parquet { (s); \ } catch (const ::parquet::ParquetException& e) {} -} // namespace parquet +#define PARQUET_THROW_NOT_OK(s) \ + do { \ + ::arrow::Status _s = (s); \ + if (!_s.ok()) { \ + std::stringstream ss; \ + ss << "Arrow error: " << _s.ToString(); \ + throw ::parquet::ParquetException(ss.str()); \ + } \ + } while (0); +} // namespace parquet } // namespace arrow #endif // ARROW_PARQUET_UTILS_H diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index d874ba3091237..d0fb8f9f000b9 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -51,17 +51,17 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: cdef cppclass HdfsReadableFile(CHdfsFile): CStatus GetSize(int64_t* size) - CStatus Read(int32_t nbytes, int32_t* bytes_read, + CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) - CStatus ReadAt(int64_t position, int32_t nbytes, - int32_t* bytes_read, uint8_t* buffer) + CStatus ReadAt(int64_t position, int64_t nbytes, + int64_t* bytes_read, uint8_t* buffer) cdef cppclass HdfsWriteableFile(CHdfsFile): - CStatus Write(const uint8_t* buffer, int32_t nbytes) + CStatus Write(const uint8_t* buffer, int64_t nbytes) - CStatus Write(const uint8_t* buffer, int32_t nbytes, - int32_t* bytes_written) + CStatus Write(const uint8_t* buffer, int64_t nbytes, + int64_t* bytes_written) cdef cppclass CHdfsClient" arrow::io::HdfsClient": @staticmethod diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 8b97671e45373..071eea5ba6e60 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -383,7 +383,7 @@ cdef class HdfsFile: Read indicated number of bytes from the file, up to EOF """ cdef: - int32_t bytes_read = 0 + int64_t bytes_read = 0 uint8_t* buf self._assert_readable() @@ -394,7 +394,7 @@ cdef class HdfsFile: if buf == NULL: raise MemoryError("Failed to allocate {0} bytes".format(nbytes)) - cdef int32_t total_bytes = 0 + cdef int64_t total_bytes = 0 cdef int rpc_chunksize = min(self.buffer_size, nbytes) @@ -423,7 +423,7 @@ cdef class HdfsFile: memory). First seeks to the beginning of the file. """ cdef: - int32_t bytes_read = 0 + int64_t bytes_read = 0 uint8_t* buf self._assert_readable() @@ -499,6 +499,6 @@ cdef class HdfsFile: data = tobytes(data) cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) - cdef int32_t bufsize = len(data) + cdef int64_t bufsize = len(data) with nogil: check_cstatus(self.wr_file.get().Write(buf, bufsize)) From 62390d8427445b033ba7f7cf3150184222d2c2c1 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Tue, 12 Jul 2016 17:34:36 -0700 Subject: [PATCH 0097/1644] ARROW-106: [C++] Add IPC to binary/string types Author: Micah Kornfield Closes #103 from emkornfield/emk_add_string_rpc and squashes the following commits: 9c563fe [Micah Kornfield] ARROW-106: [C++] Add IPC to binary/string types --- cpp/src/arrow/ipc/adapter.cc | 10 ++---- cpp/src/arrow/ipc/ipc-adapter-test.cc | 52 +++++++++++++++++++++++++-- cpp/src/arrow/types/construct.cc | 4 +++ 3 files changed, 57 insertions(+), 9 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 45cc288cd6b9e..bac1172700615 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -33,6 +33,7 @@ #include "arrow/types/construct.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" #include "arrow/util/status.h" @@ -81,14 +82,9 @@ static bool IsListType(const DataType* type) { // code consider using pattern like: // http://stackoverflow.com/questions/26784685/c-macro-for-calling-function-based-on-enum-type // - // TODO(emkornfield) Fix type systems so these are all considered lists and - // the types behave the same way? - // case Type::BINARY: - // case Type::CHAR: + case Type::BINARY: case Type::LIST: - // see todo on common types - // case Type::STRING: - // case Type::VARCHAR: + case Type::STRING: return true; default: return false; diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index eb47ac6fee8a1..2bfb459d6e06a 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -31,6 +31,7 @@ #include "arrow/test-util.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" @@ -105,6 +106,52 @@ Status MakeIntRowBatch(std::shared_ptr* out) { return Status::OK(); } +template +Status MakeRandomBinaryArray( + const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* array) { + const std::vector values = { + "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; + Builder builder(pool, type); + const auto values_len = values.size(); + for (int32_t i = 0; i < length; ++i) { + int values_index = i % values_len; + if (values_index == 0) { + RETURN_NOT_OK(builder.AppendNull()); + } else { + const std::string& value = values[values_index]; + RETURN_NOT_OK( + builder.Append(reinterpret_cast(value.data()), value.size())); + } + } + *array = builder.Finish(); + return Status::OK(); +} + +Status MakeStringTypesRowBatch(std::shared_ptr* out) { + const int32_t length = 500; + auto string_type = std::make_shared(); + auto binary_type = std::make_shared(); + auto f0 = std::make_shared("f0", string_type); + auto f1 = std::make_shared("f1", binary_type); + std::shared_ptr schema(new Schema({f0, f1})); + + std::shared_ptr a0, a1; + MemoryPool* pool = default_memory_pool(); + + { + auto status = + MakeRandomBinaryArray(string_type, length, pool, &a0); + RETURN_NOT_OK(status); + } + { + auto status = + MakeRandomBinaryArray(binary_type, length, pool, &a1); + RETURN_NOT_OK(status); + } + out->reset(new RowBatch(schema, length, {a0, a1})); + return Status::OK(); +} + Status MakeListRowBatch(std::shared_ptr* out) { // Make the schema auto f0 = std::make_shared("f0", LIST_INT32); @@ -191,9 +238,10 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { return Status::OK(); } -INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, +INSTANTIATE_TEST_CASE_P( + RoundTripTests, TestWriteRowBatch, ::testing::Values(&MakeIntRowBatch, &MakeListRowBatch, &MakeNonNullRowBatch, - &MakeZeroLengthRowBatch, &MakeDeeplyNestedList)); + &MakeZeroLengthRowBatch, &MakeDeeplyNestedList, &MakeStringTypesRowBatch)); void TestGetRowBatchSize(std::shared_ptr batch) { MockMemorySource mock_source(1 << 16); diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 2d913a737486f..5ae9c5ab6d4f9 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -124,9 +124,13 @@ Status MakeListArray(const TypePtr& type, int32_t length, const std::shared_ptr& null_bitmap, ArrayPtr* out) { switch (type->type) { case Type::BINARY: + out->reset(new BinaryArray(type, length, offsets, values, null_count, null_bitmap)); + break; + case Type::LIST: out->reset(new ListArray(type, length, offsets, values, null_count, null_bitmap)); break; + case Type::DECIMAL_TEXT: case Type::STRING: out->reset(new StringArray(type, length, offsets, values, null_count, null_bitmap)); From 55bfa834312685991d615301ac0b4fcc7c11640b Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Mon, 18 Jul 2016 15:07:48 -0700 Subject: [PATCH 0098/1644] =?UTF-8?q?ARROW-238:=20Change=20InternalMemoryP?= =?UTF-8?q?ool::Free()=20to=20return=20Status::Invalid=20when=20ther?= =?UTF-8?q?=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …e is insufficient memory. Author: Jihoon Son Closes #102 from jihoonson/ARROW-238 and squashes the following commits: cb9e7b1 [Jihoon Son] Disable FreeLargeMemory test for release builds f903130 [Jihoon Son] Free allocated memory after death 0077a70 [Jihoon Son] Adjust the amount of memory allocation b1af59b [Jihoon Son] Change to ASSERT_EXIT b4159f0 [Jihoon Son] Reflect comments e89a1f9 [Jihoon Son] Change python implementation as well. 7651570 [Jihoon Son] Change InternalMemoryPool::Free() to return Status::Invalid when there is insufficient memory. --- cpp/src/arrow/util/logging.h | 2 +- cpp/src/arrow/util/memory-pool-test.cc | 14 ++++++++++++++ cpp/src/arrow/util/memory-pool.cc | 2 ++ 3 files changed, 17 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index fccc5e3085de5..54f67593bec5e 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -40,7 +40,7 @@ namespace arrow { #define ARROW_CHECK(condition) \ (condition) ? 0 : ::arrow::internal::FatalLog(ARROW_FATAL) \ - << __FILE__ << __LINE__ << "Check failed: " #condition " " + << __FILE__ << __LINE__ << " Check failed: " #condition " " #ifdef NDEBUG #define ARROW_DFATAL ARROW_WARNING diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index 8e7dfd60baa62..919f3740982cf 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -46,4 +46,18 @@ TEST(DefaultMemoryPool, OOM) { ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); } +TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { + MemoryPool* pool = default_memory_pool(); + + uint8_t* data; + ASSERT_OK(pool->Allocate(100, &data)); + +#ifndef NDEBUG + EXPECT_EXIT(pool->Free(data, 120), ::testing::ExitedWithCode(1), + ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); +#endif + + pool->Free(data, 100); +} + } // namespace arrow diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index 0a58e5aa21f72..fed149bc3598c 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -23,6 +23,7 @@ #include #include "arrow/util/status.h" +#include "arrow/util/logging.h" namespace arrow { @@ -81,6 +82,7 @@ int64_t InternalMemoryPool::bytes_allocated() const { void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { std::lock_guard guard(pool_lock_); + DCHECK_GE(bytes_allocated_, size); std::free(buffer); bytes_allocated_ -= size; } From 59e5f9806515e8a5360870c93082316f74d7ec7c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 18 Jul 2016 15:37:27 -0700 Subject: [PATCH 0099/1644] ARROW-236: Bridging IO interfaces under the hood in pyarrow Author: Wes McKinney Closes #104 from wesm/ARROW-236 and squashes the following commits: 73648e0 [Wes McKinney] cpplint f2cd77f [Wes McKinney] Check in io.pxd 94bcd30 [Wes McKinney] Do not let Parquet close an Arrow file 9b9d94d [Wes McKinney] Barely working direct HDFS-Parquet reads 06ddd06 [Wes McKinney] Slight refactoring of read table to be able to also handle classes wrapping C++ file interfaces c7a913e [Wes McKinney] Provide a means to expose abstract native file handles e6724de [Wes McKinney] Implement alternate ctor to construct parquet::FileReader from an arrow::io::RandomAccessFile --- cpp/src/arrow/io/interfaces.h | 1 + cpp/src/arrow/parquet/io.cc | 19 +++++-- cpp/src/arrow/parquet/io.h | 10 +++- cpp/src/arrow/parquet/parquet-io-test.cc | 8 ++- cpp/src/arrow/parquet/reader.cc | 20 +++++++ cpp/src/arrow/parquet/reader.h | 13 ++++- cpp/src/arrow/parquet/writer.cc | 1 - cpp/src/arrow/parquet/writer.h | 2 +- python/pyarrow/includes/libarrow_io.pxd | 49 +++++++++++------ python/pyarrow/includes/parquet.pxd | 24 +++++++- python/pyarrow/io.pxd | 32 +++++++++++ python/pyarrow/io.pyx | 19 ++++++- python/pyarrow/parquet.pyx | 70 +++++++++++++++++++----- 13 files changed, 216 insertions(+), 52 deletions(-) create mode 100644 python/pyarrow/io.pxd diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 25361d5633d12..c21285253714e 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -19,6 +19,7 @@ #define ARROW_IO_INTERFACES_H #include +#include namespace arrow { diff --git a/cpp/src/arrow/parquet/io.cc b/cpp/src/arrow/parquet/io.cc index c81aa8c4da9ca..b6fdd67d15b6c 100644 --- a/cpp/src/arrow/parquet/io.cc +++ b/cpp/src/arrow/parquet/io.cc @@ -55,12 +55,23 @@ void ParquetAllocator::Free(uint8_t* buffer, int64_t size) { // ---------------------------------------------------------------------- // ParquetReadSource -ParquetReadSource::ParquetReadSource( - const std::shared_ptr& file, ParquetAllocator* allocator) - : file_(file), allocator_(allocator) {} +ParquetReadSource::ParquetReadSource(ParquetAllocator* allocator) + : file_(nullptr), allocator_(allocator) {} + +Status ParquetReadSource::Open(const std::shared_ptr& file) { + int64_t file_size; + RETURN_NOT_OK(file->GetSize(&file_size)); + + file_ = file; + size_ = file_size; + return Status::OK(); +} void ParquetReadSource::Close() { - PARQUET_THROW_NOT_OK(file_->Close()); + // TODO(wesm): Make this a no-op for now. This leaves Python wrappers for + // these classes in a borked state. Probably better to explicitly close. + + // PARQUET_THROW_NOT_OK(file_->Close()); } int64_t ParquetReadSource::Tell() const { diff --git a/cpp/src/arrow/parquet/io.h b/cpp/src/arrow/parquet/io.h index ef8871da4df61..1c59695c6c151 100644 --- a/cpp/src/arrow/parquet/io.h +++ b/cpp/src/arrow/parquet/io.h @@ -49,7 +49,9 @@ class ARROW_EXPORT ParquetAllocator : public ::parquet::MemoryAllocator { uint8_t* Malloc(int64_t size) override; void Free(uint8_t* buffer, int64_t size) override; - MemoryPool* pool() { return pool_; } + void set_pool(MemoryPool* pool) { pool_ = pool; } + + MemoryPool* pool() const { return pool_; } private: MemoryPool* pool_; @@ -57,8 +59,10 @@ class ARROW_EXPORT ParquetAllocator : public ::parquet::MemoryAllocator { class ARROW_EXPORT ParquetReadSource : public ::parquet::RandomAccessSource { public: - ParquetReadSource( - const std::shared_ptr& file, ParquetAllocator* allocator); + explicit ParquetReadSource(ParquetAllocator* allocator); + + // We need to ask for the file size on opening the file, and this can fail + Status Open(const std::shared_ptr& file); void Close() override; int64_t Tell() const override; diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index 7e724b31e3801..6615457c483f5 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -23,6 +23,7 @@ #include "gtest/gtest.h" #include "arrow/parquet/io.h" +#include "arrow/test-util.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" @@ -147,9 +148,12 @@ TEST(TestParquetReadSource, Basics) { std::string data = "this is the data"; auto data_buffer = reinterpret_cast(data.c_str()); - ParquetAllocator allocator; + ParquetAllocator allocator(default_memory_pool()); + auto file = std::make_shared(data_buffer, data.size()); - auto source = std::make_shared(file, &allocator); + auto source = std::make_shared(&allocator); + + ASSERT_OK(source->Open(file)); ASSERT_EQ(0, source->Tell()); ASSERT_NO_THROW(source->Seek(5)); diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index c7c400e957343..e92967e5363d2 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -23,6 +23,7 @@ #include #include "arrow/column.h" +#include "arrow/parquet/io.h" #include "arrow/parquet/schema.h" #include "arrow/parquet/utils.h" #include "arrow/schema.h" @@ -35,6 +36,10 @@ using parquet::ColumnReader; using parquet::Repetition; using parquet::TypedColumnReader; +// Help reduce verbosity +using ParquetRAS = parquet::RandomAccessSource; +using ParquetReader = parquet::ParquetFileReader; + namespace arrow { namespace parquet { @@ -181,6 +186,21 @@ FileReader::FileReader( FileReader::~FileReader() {} +// Static ctor +Status OpenFile(const std::shared_ptr& file, + ParquetAllocator* allocator, std::unique_ptr* reader) { + std::unique_ptr source(new ParquetReadSource(allocator)); + RETURN_NOT_OK(source->Open(file)); + + // TODO(wesm): reader properties + std::unique_ptr pq_reader; + PARQUET_CATCH_NOT_OK(pq_reader = ParquetReader::Open(std::move(source))); + + // Use the same memory pool as the ParquetAllocator + reader->reset(new FileReader(allocator->pool(), std::move(pq_reader))); + return Status::OK(); +} + Status FileReader::GetFlatColumn(int i, std::unique_ptr* out) { return impl_->GetFlatColumn(i, out); } diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h index 2c8a9dfd025f0..f1492f64521cb 100644 --- a/cpp/src/arrow/parquet/reader.h +++ b/cpp/src/arrow/parquet/reader.h @@ -23,6 +23,8 @@ #include "parquet/api/reader.h" #include "parquet/api/schema.h" +#include "arrow/io/interfaces.h" +#include "arrow/parquet/io.h" #include "arrow/util/visibility.h" namespace arrow { @@ -99,7 +101,7 @@ class ARROW_EXPORT FileReader { virtual ~FileReader(); private: - class Impl; + class ARROW_NO_EXPORT Impl; std::unique_ptr impl_; }; @@ -125,15 +127,20 @@ class ARROW_EXPORT FlatColumnReader { Status NextBatch(int batch_size, std::shared_ptr* out); private: - class Impl; + class ARROW_NO_EXPORT Impl; std::unique_ptr impl_; explicit FlatColumnReader(std::unique_ptr impl); friend class FileReader; }; -} // namespace parquet +// Helper function to create a file reader from an implementation of an Arrow +// readable file +ARROW_EXPORT +Status OpenFile(const std::shared_ptr& file, + ParquetAllocator* allocator, std::unique_ptr* reader); +} // namespace parquet } // namespace arrow #endif // ARROW_PARQUET_READER_H diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index 0139edd3bb8d9..f9514aa2ad2ff 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -35,7 +35,6 @@ using parquet::ParquetVersion; using parquet::schema::GroupNode; namespace arrow { - namespace parquet { class FileWriter::Impl { diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index 45d0fd59868e5..5aa1ba587176a 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -55,7 +55,7 @@ class ARROW_EXPORT FileWriter { MemoryPool* memory_pool() const; private: - class Impl; + class ARROW_NO_EXPORT Impl; std::unique_ptr impl_; }; diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index d0fb8f9f000b9..734ace6c923b4 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -19,11 +19,37 @@ from pyarrow.includes.common cimport * -cdef extern from "arrow/io/interfaces.h" nogil: +cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: + enum FileMode" arrow::io::FileMode::type": + FileMode_READ" arrow::io::FileMode::READ" + FileMode_WRITE" arrow::io::FileMode::WRITE" + FileMode_READWRITE" arrow::io::FileMode::READWRITE" + enum ObjectType" arrow::io::ObjectType::type": ObjectType_FILE" arrow::io::ObjectType::FILE" ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" + cdef cppclass FileBase: + CStatus Close() + CStatus Tell(int64_t* position) + + cdef cppclass ReadableFile(FileBase): + CStatus GetSize(int64_t* size) + CStatus Read(int64_t nbytes, int64_t* bytes_read, + uint8_t* buffer) + + CStatus ReadAt(int64_t position, int64_t nbytes, + int64_t* bytes_read, uint8_t* buffer) + + cdef cppclass RandomAccessFile(ReadableFile): + CStatus Seek(int64_t position) + + cdef cppclass WriteableFile(FileBase): + CStatus Write(const uint8_t* buffer, int64_t nbytes) + # CStatus Write(const uint8_t* buffer, int64_t nbytes, + # int64_t* bytes_written) + + cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus ConnectLibHdfs() @@ -44,24 +70,11 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: int64_t block_size int16_t permissions - cdef cppclass CHdfsFile: - CStatus Close() - CStatus Seek(int64_t position) - CStatus Tell(int64_t* position) - - cdef cppclass HdfsReadableFile(CHdfsFile): - CStatus GetSize(int64_t* size) - CStatus Read(int64_t nbytes, int64_t* bytes_read, - uint8_t* buffer) - - CStatus ReadAt(int64_t position, int64_t nbytes, - int64_t* bytes_read, uint8_t* buffer) - - cdef cppclass HdfsWriteableFile(CHdfsFile): - CStatus Write(const uint8_t* buffer, int64_t nbytes) + cdef cppclass HdfsReadableFile(RandomAccessFile): + pass - CStatus Write(const uint8_t* buffer, int64_t nbytes, - int64_t* bytes_written) + cdef cppclass HdfsWriteableFile(WriteableFile): + pass cdef cppclass CHdfsClient" arrow::io::HdfsClient": @staticmethod diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index a2f83ea5ea566..fe24f593e3294 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -19,6 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport CSchema, CStatus, CTable, MemoryPool +from pyarrow.includes.libarrow_io cimport RandomAccessFile cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: @@ -90,19 +91,36 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: shared_ptr[WriterProperties] build() +cdef extern from "arrow/parquet/io.h" namespace "arrow::parquet" nogil: + cdef cppclass ParquetAllocator: + ParquetAllocator() + ParquetAllocator(MemoryPool* pool) + MemoryPool* pool() + void set_pool(MemoryPool* pool) + + cdef cppclass ParquetReadSource: + ParquetReadSource(ParquetAllocator* allocator) + Open(const shared_ptr[RandomAccessFile]& file) + + cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: + CStatus OpenFile(const shared_ptr[RandomAccessFile]& file, + ParquetAllocator* allocator, + unique_ptr[FileReader]* reader) + cdef cppclass FileReader: FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) CStatus ReadFlatTable(shared_ptr[CTable]* out); cdef extern from "arrow/parquet/schema.h" namespace "arrow::parquet" nogil: - CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, shared_ptr[CSchema]* out) - CStatus ToParquetSchema(const CSchema* arrow_schema, shared_ptr[SchemaDescriptor]* out) + CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, + shared_ptr[CSchema]* out) + CStatus ToParquetSchema(const CSchema* arrow_schema, + shared_ptr[SchemaDescriptor]* out) cdef extern from "arrow/parquet/writer.h" namespace "arrow::parquet" nogil: cdef CStatus WriteFlatTable(const CTable* table, MemoryPool* pool, const shared_ptr[OutputStream]& sink, int64_t chunk_size, const shared_ptr[WriterProperties]& properties) - diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd new file mode 100644 index 0000000000000..b92af72704ae8 --- /dev/null +++ b/python/pyarrow/io.pxd @@ -0,0 +1,32 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport * +from pyarrow.includes.libarrow_io cimport RandomAccessFile, WriteableFile + + +cdef class NativeFileInterface: + + # By implementing these "virtual" functions (all functions in Cython + # extension classes are technically virtual in the C++ sense)m we can + # expose the arrow::io abstract file interfaces to other components + # throughout the suite of Arrow C++ libraries + cdef read_handle(self, shared_ptr[RandomAccessFile]* file) + cdef write_handle(self, shared_ptr[WriteableFile]* file) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 071eea5ba6e60..b8bf883562060 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -164,7 +164,7 @@ cdef class HdfsClient: .ListDirectory(c_path, &listing)) cdef const HdfsPathInfo* info - for i in range(listing.size()): + for i in range( listing.size()): info = &listing[i] # Try to trim off the hdfs://HOST:PORT piece @@ -314,8 +314,15 @@ cdef class HdfsClient: f = self.open(path, 'rb', buffer_size=buffer_size) f.download(stream) +cdef class NativeFileInterface: -cdef class HdfsFile: + cdef read_handle(self, shared_ptr[RandomAccessFile]* file): + raise NotImplementedError + + cdef write_handle(self, shared_ptr[WriteableFile]* file): + raise NotImplementedError + +cdef class HdfsFile(NativeFileInterface): cdef: shared_ptr[HdfsReadableFile] rd_file shared_ptr[HdfsWriteableFile] wr_file @@ -357,6 +364,14 @@ cdef class HdfsFile: if self.is_readonly: raise IOError("only valid on writeonly files") + cdef read_handle(self, shared_ptr[RandomAccessFile]* file): + self._assert_readable() + file[0] = self.rd_file + + cdef write_handle(self, shared_ptr[WriteableFile]* file): + self._assert_writeable() + file[0] = self.wr_file + def size(self): cdef int64_t size self._assert_readable() diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 0b2b20880332b..ebba1a17ac742 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -20,34 +20,75 @@ # cython: embedsignature = True from pyarrow.includes.libarrow cimport * -cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.parquet cimport * +from pyarrow.includes.libarrow_io cimport RandomAccessFile, WriteableFile +cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import tobytes from pyarrow.error import ArrowException from pyarrow.error cimport check_cstatus +from pyarrow.io import NativeFileInterface from pyarrow.table cimport Table -def read_table(filename, columns=None): +from pyarrow.io cimport NativeFileInterface + +import six + + +cdef class ParquetReader: + cdef: + ParquetAllocator allocator + unique_ptr[FileReader] reader + + def __cinit__(self): + self.allocator.set_pool(default_memory_pool()) + + cdef open_local_file(self, file_path): + cdef c_string path = tobytes(file_path) + + # Must be in one expression to avoid calling std::move which is not + # possible in Cython (due to missing rvalue support) + + # TODO(wesm): ParquetFileReader::OpenFIle can throw? + self.reader = unique_ptr[FileReader]( + new FileReader(default_memory_pool(), + ParquetFileReader.OpenFile(path))) + + cdef open_native_file(self, NativeFileInterface file): + cdef shared_ptr[RandomAccessFile] cpp_handle + file.read_handle(&cpp_handle) + + check_cstatus(OpenFile(cpp_handle, &self.allocator, &self.reader)) + + def read_all(self): + cdef: + Table table = Table() + shared_ptr[CTable] ctable + + with nogil: + check_cstatus(self.reader.get() + .ReadFlatTable(&ctable)) + + table.init(ctable) + return table + + +def read_table(source, columns=None): """ Read a Table from Parquet format Returns ------- table: pyarrow.Table """ - cdef unique_ptr[FileReader] reader - cdef Table table = Table() - cdef shared_ptr[CTable] ctable - - # Must be in one expression to avoid calling std::move which is not possible - # in Cython (due to missing rvalue support) - reader = unique_ptr[FileReader](new FileReader(default_memory_pool(), - ParquetFileReader.OpenFile(tobytes(filename)))) - with nogil: - check_cstatus(reader.get().ReadFlatTable(&ctable)) + cdef ParquetReader reader = ParquetReader() + + if isinstance(source, six.string_types): + reader.open_local_file(source) + elif isinstance(source, NativeFileInterface): + reader.open_native_file(source) + + return reader.read_all() - table.init(ctable) - return table def write_table(table, filename, chunk_size=None, version=None): """ @@ -84,4 +125,3 @@ def write_table(table, filename, chunk_size=None, version=None): with nogil: check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, chunk_size_, properties_builder.build())) - From a2fb756a43441a72e10ae74fa0e483e01bc5917e Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Tue, 19 Jul 2016 13:39:48 -0700 Subject: [PATCH 0100/1644] ARROW-241: Add missing implementation for splitAndTransfer in UnionVector Use simple implementation that actually just copies --- java/vector/src/main/codegen/templates/UnionVector.java | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 6042a5bf68352..482944828ade1 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -264,7 +264,11 @@ public void transfer() { @Override public void splitAndTransfer(int startIndex, int length) { - + to.allocateNew(); + for (int i = 0; i < length; i++) { + to.copyFromSafe(startIndex + i, i, org.apache.arrow.vector.complex.UnionVector.this); + } + to.getMutator().setValueCount(length); } @Override From dc79ceb05c05e626e2324863cfc3f386ecccce90 Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Mon, 1 Aug 2016 11:29:02 -0700 Subject: [PATCH 0101/1644] ARROW-244: Some global APIs of IPC module should be visible to the outside Author: Jihoon Son Closes #109 from jihoonson/ARROW-244 and squashes the following commits: 51d9a87 [Jihoon Son] Make line length shorter than 90 2da5466 [Jihoon Son] Make some APIs of IPC module visible --- cpp/src/arrow/ipc/adapter.h | 11 +++++++---- cpp/src/arrow/ipc/memory.h | 5 +++-- 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 0d2b77f5acefe..a34a5c4fcc99f 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -24,6 +24,8 @@ #include #include +#include "arrow/util/visibility.h" + namespace arrow { class Array; @@ -54,20 +56,21 @@ constexpr int kMaxIpcRecursionDepth = 64; // // Finally, the memory offset to the start of the metadata / data header is // returned in an out-variable -Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, - int64_t* header_offset, int max_recursion_depth = kMaxIpcRecursionDepth); +ARROW_EXPORT Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, + int64_t position, int64_t* header_offset, + int max_recursion_depth = kMaxIpcRecursionDepth); // int64_t GetRowBatchMetadata(const RowBatch* batch); // Compute the precise number of bytes needed in a contiguous memory segment to // write the row batch. This involves generating the complete serialized // Flatbuffers metadata. -Status GetRowBatchSize(const RowBatch* batch, int64_t* size); +ARROW_EXPORT Status GetRowBatchSize(const RowBatch* batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the MemorySource does not -class RowBatchReader { +class ARROW_EXPORT RowBatchReader { public: static Status Open( MemorySource* source, int64_t position, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/memory.h b/cpp/src/arrow/ipc/memory.h index c6fd7a718991b..377401d85c00a 100644 --- a/cpp/src/arrow/ipc/memory.h +++ b/cpp/src/arrow/ipc/memory.h @@ -25,6 +25,7 @@ #include #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { @@ -69,7 +70,7 @@ class BufferOutputStream : public OutputStream { int64_t position_; }; -class MemorySource { +class ARROW_EXPORT MemorySource { public: // Indicates the access permissions of the memory source enum AccessMode { READ_ONLY, READ_WRITE }; @@ -100,7 +101,7 @@ class MemorySource { }; // A memory source that uses memory-mapped files for memory interactions -class MemoryMappedSource : public MemorySource { +class ARROW_EXPORT MemoryMappedSource : public MemorySource { public: static Status Open(const std::string& path, AccessMode access_mode, std::shared_ptr* out); From 356d015bb7de3a12167ac8ea02dbda9bbdc8c27f Mon Sep 17 00:00:00 2001 From: MechCoder Date: Wed, 13 Jul 2016 17:24:26 -0700 Subject: [PATCH 0102/1644] ARROW-240: Provide more detailed installation instructions for pyarrow. Closes --- python/README.md | 33 +++++++++++++++++++++++++++++++-- 1 file changed, 31 insertions(+), 2 deletions(-) diff --git a/python/README.md b/python/README.md index c79fa9786f476..bafe71b05ec22 100644 --- a/python/README.md +++ b/python/README.md @@ -4,11 +4,40 @@ This library provides a Pythonic API wrapper for the reference Arrow C++ implementation, along with tools for interoperability with pandas, NumPy, and other traditional Python scientific computing packages. -#### Development details +### Development details This project is layered in two pieces: * pyarrow, a C++ library for easier interoperability between Arrow C++, NumPy, and pandas * Cython extensions and pure Python code under arrow/ which expose Arrow C++ - and pyarrow to pure Python users \ No newline at end of file + and pyarrow to pure Python users + +#### PyArrow Dependencies: +These are the various projects that PyArrow depends on. + +1. **g++ and gcc Version >= 4.8** +2. **cmake > 2.8.6** +3. **boost** +4. **Parquet-cpp** + + The preferred way to install parquet-cpp is to use conda. + You need to set the ``PARQUET_HOME`` environment variable to where parquet-cpp is installed. + ```bash + conda install -y --channel apache/channel/dev parquet-cpp + ``` +5. **Arrow-cpp and its dependencies*** + + The Arrow C++ library must be built with all options enabled and installed with ``ARROW_HOME`` environment variable set to + the installation location. Look at (https://github.com/apache/arrow/blob/master/cpp/README.md) for + instructions. Alternatively you could just install arrow-cpp + from conda. + ```bash + conda install arrow-cpp -c apache/channel/dev + ``` +6. **Python dependencies: numpy, pandas, cython, pytest** + +#### Install pyarrow + ```bash + python setup.py build_ext --inplace + ``` From 3a2dfba59a2482226cc3c49a11a779dd9ce3dfd7 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Mon, 1 Aug 2016 16:31:54 -0700 Subject: [PATCH 0103/1644] ARROW-101: Fix java compiler warnings Fixes several warnings emitted by java compiler regarding the use of raw types and unclosed resources. Author: Laurent Goujon Closes #60 from laurentgo/laurent/fix-generic-warnings and squashes the following commits: 96ccc67 [Laurent Goujon] [ARROW-101] Fix java compiler resources warnings 61bde83 [Laurent Goujon] [ARROW-101] Fix java compiler rawtypes warnings --- .../src/main/java/org/apache/arrow/vector/ZeroVector.java | 5 +++-- .../arrow/vector/complex/impl/PromotableWriter.java | 8 ++++---- .../org/apache/arrow/vector/util/JsonStringArrayList.java | 2 +- .../org/apache/arrow/vector/util/JsonStringHashMap.java | 2 +- 4 files changed, 9 insertions(+), 8 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index 78de8706fb7d4..c94e8d1db090c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -19,6 +19,7 @@ import io.netty.buffer.ArrowBuf; +import java.util.Collections; import java.util.Iterator; import org.apache.arrow.memory.BufferAllocator; @@ -109,8 +110,8 @@ public TransferPair getTransferPair(BufferAllocator allocator) { // } @Override - public Iterator iterator() { - return Iterators.emptyIterator(); + public Iterator iterator() { + return Collections.emptyIterator(); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index ea62e02360802..45509f688ba88 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -85,16 +85,16 @@ private void setWriter(ValueVector v) { state = State.SINGLE; vector = v; type = v.getField().getType().getMinorType(); - Class writerClass = BasicTypeHelper + Class writerClass = BasicTypeHelper .getWriterImpl(v.getField().getType().getMinorType(), v.getField().getDataMode()); if (writerClass.equals(SingleListWriter.class)) { writerClass = UnionListWriter.class; } - Class vectorClass = BasicTypeHelper.getValueVectorClass(v.getField().getType().getMinorType(), v.getField() + Class vectorClass = BasicTypeHelper.getValueVectorClass(v.getField().getType().getMinorType(), v.getField() .getDataMode()); try { - Constructor constructor = null; - for (Constructor c : writerClass.getConstructors()) { + Constructor constructor = null; + for (Constructor c : writerClass.getConstructors()) { if (c.getParameterTypes().length == 3) { constructor = c; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java index 7aeaa12ef9fcf..6291bfeaee666 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java @@ -42,7 +42,7 @@ public boolean equals(Object obj) { if (!(obj instanceof List)) { return false; } - List other = (List) obj; + List other = (List) obj; return this.size() == other.size() && this.containsAll(other); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java index 750dd592aa49c..e8ce5221eebd9 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringHashMap.java @@ -46,7 +46,7 @@ public boolean equals(Object obj) { if (!(obj instanceof Map)) { return false; } - Map other = (Map) obj; + Map other = (Map) obj; if (this.size() != other.size()) { return false; } From 56835c338f01aebcace01312e82431306e7fd578 Mon Sep 17 00:00:00 2001 From: adeneche Date: Mon, 1 Aug 2016 15:28:08 -0700 Subject: [PATCH 0104/1644] ARROW-246: [Java] UnionVector doesn't call allocateNew() when creating it's vectorType --- .../main/codegen/templates/UnionVector.java | 2 + .../arrow/vector/DirtyBufferAllocator.java | 120 ++++++++++++++++++ .../apache/arrow/vector/TestUnionVector.java | 88 +++++++++++++ 3 files changed, 210 insertions(+) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 482944828ade1..692436d12854d 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -73,6 +73,8 @@ public UnionVector(MaterializedField field, BufferAllocator allocator, CallBack this.allocator = allocator; this.internalMap = new MapVector("internal", allocator, callBack); this.typeVector = internalMap.addOrGet("types", new MajorType(MinorType.UINT1, DataMode.REQUIRED), UInt1Vector.class); + this.typeVector.allocateNew(); + this.typeVector.zeroVector(); this.field.addChild(internalMap.getField().clone()); this.majorType = field.getType(); this.callBack = callBack; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java b/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java new file mode 100644 index 0000000000000..cc6b9ec51d61c --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java @@ -0,0 +1,120 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.AllocationReservation; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.BufferManager; + +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ByteBufAllocator; + +/** + * Wrapper around a buffer delegate that populates any allocated buffer with a constant + * value. Useful for testing if value vectors are properly resetting their buffers. + */ +public class DirtyBufferAllocator implements BufferAllocator { + + private final BufferAllocator delegate; + private final byte fillValue; + + DirtyBufferAllocator(final BufferAllocator delegate, final byte fillValue) { + this.delegate = delegate; + this.fillValue = fillValue; + } + + @Override + public ArrowBuf buffer(int size) { + return buffer(size, null); + } + + @Override + public ArrowBuf buffer(int size, BufferManager manager) { + ArrowBuf buffer = delegate.buffer(size, manager); + // contaminate the buffer + for (int i = 0; i < buffer.capacity(); i++) { + buffer.setByte(i, fillValue); + } + + return buffer; + } + + @Override + public ByteBufAllocator getAsByteBufAllocator() { + return delegate.getAsByteBufAllocator(); + } + + @Override + public BufferAllocator newChildAllocator(String name, long initReservation, long maxAllocation) { + return delegate.newChildAllocator(name, initReservation, maxAllocation); + } + + @Override + public void close() { + delegate.close(); + } + + @Override + public long getAllocatedMemory() { + return delegate.getAllocatedMemory(); + } + + @Override + public void setLimit(long newLimit) { + delegate.setLimit(newLimit); + } + + @Override + public long getLimit() { + return delegate.getLimit(); + } + + @Override + public long getPeakMemoryAllocation() { + return delegate.getPeakMemoryAllocation(); + } + + @Override + public AllocationReservation newReservation() { + return delegate.newReservation(); + } + + @Override + public ArrowBuf getEmpty() { + return delegate.getEmpty(); + } + + @Override + public String getName() { + return delegate.getName(); + } + + @Override + public boolean isOverLimit() { + return delegate.isOverLimit(); + } + + @Override + public String toVerboseString() { + return delegate.toVerboseString(); + } + + @Override + public void assertOpen() { + delegate.assertOpen(); + }} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java new file mode 100644 index 0000000000000..8f19b3191ba15 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java @@ -0,0 +1,88 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.holders.NullableUInt4Holder; +import org.apache.arrow.vector.holders.UInt4Holder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +public class TestUnionVector { + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Long.MAX_VALUE); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testUnionVector() throws Exception { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + + final BufferAllocator alloc = new DirtyBufferAllocator(allocator, (byte) 100); + + UnionVector unionVector = new UnionVector(field, alloc, null); + + final NullableUInt4Holder uInt4Holder = new NullableUInt4Holder(); + uInt4Holder.value = 100; + uInt4Holder.isSet = 1; + + try { + // write some data + final UnionVector.Mutator mutator = unionVector.getMutator(); + mutator.setType(0, Types.MinorType.UINT4); + mutator.setSafe(0, uInt4Holder); + mutator.setType(2, Types.MinorType.UINT4); + mutator.setSafe(2, uInt4Holder); + mutator.setValueCount(4); + + // check that what we wrote is correct + final UnionVector.Accessor accessor = unionVector.getAccessor(); + assertEquals(4, accessor.getValueCount()); + + assertEquals(false, accessor.isNull(0)); + assertEquals(100, accessor.getObject(0)); + + assertEquals(true, accessor.isNull(1)); + + assertEquals(false, accessor.isNull(2)); + assertEquals(100, accessor.getObject(2)); + + assertEquals(true, accessor.isNull(3)); + + } finally { + unionVector.clear(); + } + } + +} From 5df7d4dee5fd57e91d9bb83f44f2269f61b79fb3 Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Thu, 4 Aug 2016 15:29:01 -0700 Subject: [PATCH 0105/1644] ARROW-247: Missing explicit destructor in RowBatchReader causes an incomplete type error Author: Jihoon Son Closes #111 from jihoonson/ARROW-247 and squashes the following commits: cc7281c [Jihoon Son] Make destructor virtual 795d3d3 [Jihoon Son] Merge branch 'master' of https://github.com/apache/arrow into ARROW-247 df297ef [Jihoon Son] Trigger travis 65d64c8 [Jihoon Son] Make the comment into two line 9555260 [Jihoon Son] Add a comment f671a32 [Jihoon Son] Add explicit destructor for RowBatchReader --- cpp/src/arrow/ipc/adapter.cc | 4 ++++ cpp/src/arrow/ipc/adapter.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index bac1172700615..84f7830092cf4 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -369,6 +369,10 @@ Status RowBatchReader::Open(MemorySource* source, int64_t position, return Status::OK(); } +// Here the explicit destructor is required for compilers to be aware of +// the complete information of RowBatchReader::Impl class +RowBatchReader::~RowBatchReader() {} + Status RowBatchReader::GetRowBatch( const std::shared_ptr& schema, std::shared_ptr* out) { return impl_->AssembleBatch(schema, out); diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index a34a5c4fcc99f..6231af66aa180 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -78,6 +78,8 @@ class ARROW_EXPORT RowBatchReader { static Status Open(MemorySource* source, int64_t position, int max_recursion_depth, std::shared_ptr* out); + virtual ~RowBatchReader(); + // Reassemble the row batch. A Schema is required to be able to construct the // right array containers Status GetRowBatch( From 34e7f48cb71428c4d78cf00d8fdf0045532d6607 Mon Sep 17 00:00:00 2001 From: adeneche Date: Fri, 5 Aug 2016 10:26:47 -0700 Subject: [PATCH 0106/1644] ARROW-250: Fix for ARROW-246 may cause memory leaks this closes #112 --- .../main/codegen/templates/UnionVector.java | 3 +- .../vector/complex/impl/PromotableWriter.java | 1 + .../arrow/vector/DirtyBufferAllocator.java | 120 ------------------ .../arrow/vector/DirtyRootAllocator.java | 53 ++++++++ .../apache/arrow/vector/TestUnionVector.java | 14 +- .../complex/impl/TestPromotableWriter.java | 98 ++++++++++++++ 6 files changed, 157 insertions(+), 132 deletions(-) delete mode 100644 java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/DirtyRootAllocator.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 692436d12854d..0f089b7e91537 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -73,8 +73,6 @@ public UnionVector(MaterializedField field, BufferAllocator allocator, CallBack this.allocator = allocator; this.internalMap = new MapVector("internal", allocator, callBack); this.typeVector = internalMap.addOrGet("types", new MajorType(MinorType.UINT1, DataMode.REQUIRED), UInt1Vector.class); - this.typeVector.allocateNew(); - this.typeVector.zeroVector(); this.field.addChild(internalMap.getField().clone()); this.majorType = field.getType(); this.callBack = callBack; @@ -193,6 +191,7 @@ public int getValueCapacity() { @Override public void close() { + clear(); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 45509f688ba88..462ec9dd86a9b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -155,6 +155,7 @@ private FieldWriter promoteToUnion() { tp.transfer(); if (parentContainer != null) { unionVector = parentContainer.addOrGet(name, new MajorType(MinorType.UNION, DataMode.OPTIONAL), UnionVector.class); + unionVector.allocateNew(); } else if (listVector != null) { unionVector = listVector.promoteToUnion(); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java b/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java deleted file mode 100644 index cc6b9ec51d61c..0000000000000 --- a/java/vector/src/test/java/org/apache/arrow/vector/DirtyBufferAllocator.java +++ /dev/null @@ -1,120 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector; - -import org.apache.arrow.memory.AllocationReservation; -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.BufferManager; - -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.ByteBufAllocator; - -/** - * Wrapper around a buffer delegate that populates any allocated buffer with a constant - * value. Useful for testing if value vectors are properly resetting their buffers. - */ -public class DirtyBufferAllocator implements BufferAllocator { - - private final BufferAllocator delegate; - private final byte fillValue; - - DirtyBufferAllocator(final BufferAllocator delegate, final byte fillValue) { - this.delegate = delegate; - this.fillValue = fillValue; - } - - @Override - public ArrowBuf buffer(int size) { - return buffer(size, null); - } - - @Override - public ArrowBuf buffer(int size, BufferManager manager) { - ArrowBuf buffer = delegate.buffer(size, manager); - // contaminate the buffer - for (int i = 0; i < buffer.capacity(); i++) { - buffer.setByte(i, fillValue); - } - - return buffer; - } - - @Override - public ByteBufAllocator getAsByteBufAllocator() { - return delegate.getAsByteBufAllocator(); - } - - @Override - public BufferAllocator newChildAllocator(String name, long initReservation, long maxAllocation) { - return delegate.newChildAllocator(name, initReservation, maxAllocation); - } - - @Override - public void close() { - delegate.close(); - } - - @Override - public long getAllocatedMemory() { - return delegate.getAllocatedMemory(); - } - - @Override - public void setLimit(long newLimit) { - delegate.setLimit(newLimit); - } - - @Override - public long getLimit() { - return delegate.getLimit(); - } - - @Override - public long getPeakMemoryAllocation() { - return delegate.getPeakMemoryAllocation(); - } - - @Override - public AllocationReservation newReservation() { - return delegate.newReservation(); - } - - @Override - public ArrowBuf getEmpty() { - return delegate.getEmpty(); - } - - @Override - public String getName() { - return delegate.getName(); - } - - @Override - public boolean isOverLimit() { - return delegate.isOverLimit(); - } - - @Override - public String toVerboseString() { - return delegate.toVerboseString(); - } - - @Override - public void assertOpen() { - delegate.assertOpen(); - }} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/DirtyRootAllocator.java b/java/vector/src/test/java/org/apache/arrow/vector/DirtyRootAllocator.java new file mode 100644 index 0000000000000..f775f1d2d67af --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/DirtyRootAllocator.java @@ -0,0 +1,53 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferManager; +import org.apache.arrow.memory.RootAllocator; + +import io.netty.buffer.ArrowBuf; + +/** + * Root allocator that returns buffers pre-filled with a given value.
+ * Useful for testing if value vectors are properly zeroing their buffers. + */ +public class DirtyRootAllocator extends RootAllocator { + + private final byte fillValue; + + public DirtyRootAllocator(final long limit, final byte fillValue) { + super(limit); + this.fillValue = fillValue; + } + + @Override + public ArrowBuf buffer(int size) { + return buffer(size, null); + } + + @Override + public ArrowBuf buffer(int size, BufferManager manager) { + ArrowBuf buffer = super.buffer(size, manager); + // contaminate the buffer + for (int i = 0; i < buffer.capacity(); i++) { + buffer.setByte(i, fillValue); + } + + return buffer; + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java index 8f19b3191ba15..e4d28c3f88ca6 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java @@ -20,7 +20,6 @@ import static org.junit.Assert.assertEquals; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.holders.NullableUInt4Holder; import org.apache.arrow.vector.holders.UInt4Holder; @@ -37,7 +36,7 @@ public class TestUnionVector { @Before public void init() { - allocator = new RootAllocator(Long.MAX_VALUE); + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); } @After @@ -49,15 +48,13 @@ public void terminate() throws Exception { public void testUnionVector() throws Exception { final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final BufferAllocator alloc = new DirtyBufferAllocator(allocator, (byte) 100); - - UnionVector unionVector = new UnionVector(field, alloc, null); - final NullableUInt4Holder uInt4Holder = new NullableUInt4Holder(); uInt4Holder.value = 100; uInt4Holder.isSet = 1; - try { + try (UnionVector unionVector = new UnionVector(field, allocator, null)) { + unionVector.allocateNew(); + // write some data final UnionVector.Mutator mutator = unionVector.getMutator(); mutator.setType(0, Types.MinorType.UINT4); @@ -79,9 +76,6 @@ public void testUnionVector() throws Exception { assertEquals(100, accessor.getObject(2)); assertEquals(true, accessor.isNull(3)); - - } finally { - unionVector.clear(); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java new file mode 100644 index 0000000000000..4c24444d81d18 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -0,0 +1,98 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.DirtyRootAllocator; +import org.apache.arrow.vector.complex.AbstractMapVector; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.holders.UInt4Holder; +import org.apache.arrow.vector.types.MaterializedField; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +public class TestPromotableWriter { + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testPromoteToUnion() throws Exception { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + + try (final AbstractMapVector container = new MapVector(field, allocator, null); + final MapVector v = container.addOrGet("test", MapVector.TYPE, MapVector.class); + final PromotableWriter writer = new PromotableWriter(v, container)) { + + container.allocateNew(); + + writer.start(); + + writer.setPosition(0); + writer.bit("A").writeBit(0); + + writer.setPosition(1); + writer.bit("A").writeBit(1); + + writer.setPosition(2); + writer.integer("A").writeInt(10); + + // we don't write anything in 3 + + writer.setPosition(4); + writer.integer("A").writeInt(100); + + writer.end(); + + container.getMutator().setValueCount(5); + + final UnionVector uv = v.getChild("A", UnionVector.class); + final UnionVector.Accessor accessor = uv.getAccessor(); + + assertFalse("0 shouldn't be null", accessor.isNull(0)); + assertEquals(false, accessor.getObject(0)); + + assertFalse("1 shouldn't be null", accessor.isNull(1)); + assertEquals(true, accessor.getObject(1)); + + assertFalse("2 shouldn't be null", accessor.isNull(2)); + assertEquals(10, accessor.getObject(2)); + + assertTrue("3 should be null", accessor.isNull(3)); + + assertFalse("4 shouldn't be null", accessor.isNull(4)); + assertEquals(100, accessor.getObject(4)); + } + } +} From 2742d37cc3f890ffd68ba46920240c18ae5528ae Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 12 Aug 2016 15:58:20 -0700 Subject: [PATCH 0107/1644] ARROW-254: remove Bit type as it is redundant with Boolean The only use of Bit is for the nullability (or validity) vector which is best understood as a boolean type. We should remove it as it is not used. Author: Julien Le Dem Closes #116 from julienledem/arrow_254_remove_bit_type and squashes the following commits: 1cada12 [Julien Le Dem] ARROW-254: remove Bit type --- cpp/src/arrow/ipc/metadata-internal.cc | 2 -- format/Message.fbs | 4 ---- 2 files changed, 6 deletions(-) diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 1b1d50f96eaf5..5c439120b173a 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -99,8 +99,6 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::Invalid("Type metadata cannot be none"); case flatbuf::Type_Int: return IntFromFlatbuffer(static_cast(type_data), out); - case flatbuf::Type_Bit: - return Status::NotImplemented("Type is not implemented"); case flatbuf::Type_FloatingPoint: return FloatFromFlatuffer( static_cast(type_data), out); diff --git a/format/Message.fbs b/format/Message.fbs index fc849eedf791a..e0a956c3b257a 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -20,9 +20,6 @@ table Union { mode: UnionMode; } -table Bit { -} - table Int { bitWidth: int; // 1 to 64 is_signed: bool; @@ -62,7 +59,6 @@ table JSONScalar { union Type { Int, - Bit, FloatingPoint, Binary, Utf8, From dc01f099d966b92f4de7679b4a1caf97c363e08e Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 12 Aug 2016 16:00:18 -0700 Subject: [PATCH 0108/1644] ARROW-253: restrict ints to 8, 16, 32, or 64 bits in V1 Author: Julien Le Dem Closes #115 from julienledem/arrow_253_int_8_16_32_64 and squashes the following commits: d8df119 [Julien Le Dem] ARROW-253: restrict ints to 8, 16, 32, or 64 bits in V1 --- cpp/src/arrow/ipc/metadata-internal.cc | 9 ++++----- format/Message.fbs | 2 +- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 5c439120b173a..e6b47de70ed70 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -55,12 +55,12 @@ const std::shared_ptr DOUBLE = std::make_shared(); static Status IntFromFlatbuffer( const flatbuf::Int* int_data, std::shared_ptr* out) { - if (int_data->bitWidth() % 8 != 0) { - return Status::NotImplemented("Integers not in cstdint are not implemented"); - } if (int_data->bitWidth() > 64) { return Status::NotImplemented("Integers with more than 64 bits not implemented"); } + if (int_data->bitWidth() < 8) { + return Status::NotImplemented("Integers with less than 8 bits not implemented"); + } switch (int_data->bitWidth()) { case 8: @@ -76,8 +76,7 @@ static Status IntFromFlatbuffer( *out = int_data->is_signed() ? INT64 : UINT64; break; default: - *out = nullptr; - break; + return Status::NotImplemented("Integers not in cstdint are not implemented"); } return Status::OK(); } diff --git a/format/Message.fbs b/format/Message.fbs index e0a956c3b257a..6a351b9dbf0a6 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -21,7 +21,7 @@ table Union { } table Int { - bitWidth: int; // 1 to 64 + bitWidth: int; // restricted to 8, 16, 32, and 64 in v1 is_signed: bool; } From e8724f8379324c59d285d2380005577a49290c42 Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Sat, 13 Aug 2016 13:50:02 +0900 Subject: [PATCH 0109/1644] ARROW-260: Fix flaky oversized tests - Limit max allocation bytes for a vector as 1 KB - Remove System.setProperty() in TestValueVector - Move tests which test OversizedAllocationException for ValueVector into a separate class and add a disclaimer - Add a comment for the new test This closes #118. --- java/pom.xml | 3 + ...TestOversizedAllocationForValueVector.java | 137 ++++++++++++++++++ .../apache/arrow/vector/TestValueVector.java | 131 +---------------- 3 files changed, 145 insertions(+), 126 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java diff --git a/java/pom.xml b/java/pom.xml index ea42894fda22e..71f59caf2798e 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -303,6 +303,9 @@ ${project.build.directory} + + -Darrow.vector.max_allocation_bytes=1048576 diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java new file mode 100644 index 0000000000000..4dee86c9d595a --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.holders.UInt4Holder; +import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.util.OversizedAllocationException; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +/** + * This class tests that OversizedAllocationException occurs when a large memory is allocated for a vector. + * Typically, arrow allows the allocation of the size of at most Integer.MAX_VALUE, but this might cause OOM in tests. + * Thus, the max allocation size is limited to 1 KB in this class. Please see the surefire option in pom.xml. + */ +public class TestOversizedAllocationForValueVector { + + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Long.MAX_VALUE); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test(expected = OversizedAllocationException.class) + public void testFixedVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final UInt4Vector vector = new UInt4Vector(field, allocator); + // edge case 1: buffer size = max value capacity + final int expectedValueCapacity = BaseValueVector.MAX_ALLOCATION_SIZE / 4; + try { + vector.allocateNew(expectedValueCapacity); + assertEquals(expectedValueCapacity, vector.getValueCapacity()); + vector.reAlloc(); + assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); + } finally { + vector.close(); + } + + // common case: value count < max value capacity + try { + vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 8); + vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION + vector.reAlloc(); // this should throw an IOOB + } finally { + vector.close(); + } + } + + @Test(expected = OversizedAllocationException.class) + public void testBitVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final BitVector vector = new BitVector(field, allocator); + // edge case 1: buffer size ~ max value capacity + final int expectedValueCapacity = 1 << 29; + try { + vector.allocateNew(expectedValueCapacity); + assertEquals(expectedValueCapacity, vector.getValueCapacity()); + vector.reAlloc(); + assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); + } finally { + vector.close(); + } + + // common: value count < MAX_VALUE_ALLOCATION + try { + vector.allocateNew(expectedValueCapacity); + for (int i=0; i<3;i++) { + vector.reAlloc(); // expand buffer size + } + assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); + vector.reAlloc(); // buffer size ~ max allocation + assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); + vector.reAlloc(); // overflow + } finally { + vector.close(); + } + } + + + @Test(expected = OversizedAllocationException.class) + public void testVariableVectorReallocation() { + final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); + final VarCharVector vector = new VarCharVector(field, allocator); + // edge case 1: value count = MAX_VALUE_ALLOCATION + final int expectedAllocationInBytes = BaseValueVector.MAX_ALLOCATION_SIZE; + final int expectedOffsetSize = 10; + try { + vector.allocateNew(expectedAllocationInBytes, 10); + assertTrue(expectedOffsetSize <= vector.getValueCapacity()); + assertTrue(expectedAllocationInBytes <= vector.getBuffer().capacity()); + vector.reAlloc(); + assertTrue(expectedOffsetSize * 2 <= vector.getValueCapacity()); + assertTrue(expectedAllocationInBytes * 2 <= vector.getBuffer().capacity()); + } finally { + vector.close(); + } + + // common: value count < MAX_VALUE_ALLOCATION + try { + vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 2, 0); + vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION + vector.reAlloc(); // this tests if it overflows + } finally { + vector.close(); + } + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index b5c4509c8b540..ce091ab1ed06b 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -17,29 +17,13 @@ */ package org.apache.arrow.vector; -import static org.junit.Assert.assertArrayEquals; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.nio.charset.Charset; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.RepeatedListVector; import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.holders.BitHolder; -import org.apache.arrow.vector.holders.IntHolder; -import org.apache.arrow.vector.holders.NullableFloat4Holder; -import org.apache.arrow.vector.holders.NullableUInt4Holder; -import org.apache.arrow.vector.holders.NullableVar16CharHolder; -import org.apache.arrow.vector.holders.NullableVarCharHolder; -import org.apache.arrow.vector.holders.RepeatedFloat4Holder; -import org.apache.arrow.vector.holders.RepeatedIntHolder; -import org.apache.arrow.vector.holders.RepeatedVarBinaryHolder; -import org.apache.arrow.vector.holders.UInt4Holder; -import org.apache.arrow.vector.holders.VarCharHolder; +import org.apache.arrow.vector.holders.*; import org.apache.arrow.vector.types.MaterializedField; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; @@ -47,40 +31,19 @@ import org.apache.arrow.vector.util.OversizedAllocationException; import org.junit.After; import org.junit.Before; -import org.junit.Rule; import org.junit.Test; -import org.junit.rules.ExternalResource; + +import java.nio.charset.Charset; + +import static org.junit.Assert.*; public class TestValueVector { - //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(TestValueVector.class); private final static String EMPTY_SCHEMA_PATH = ""; private BufferAllocator allocator; - // Rule to adjust MAX_ALLOCATION_SIZE and restore it back after the tests - @Rule - public final ExternalResource rule = new ExternalResource() { - private final String systemValue = System.getProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY); - private final String testValue = Long.toString(32*1024*1024); - - @Override - protected void before() throws Throwable { - System.setProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY, testValue); - } - - @Override - protected void after() { - if (systemValue != null) { - System.setProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY, systemValue); - } - else { - System.clearProperty(BaseValueVector.MAX_ALLOCATION_SIZE_PROPERTY); - } - } - }; - @Before public void init() { allocator = new RootAllocator(Long.MAX_VALUE); @@ -96,90 +59,6 @@ public void terminate() throws Exception { allocator.close(); } - @Test(expected = OversizedAllocationException.class) - public void testFixedVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final UInt4Vector vector = new UInt4Vector(field, allocator); - // edge case 1: buffer size = max value capacity - final int expectedValueCapacity = BaseValueVector.MAX_ALLOCATION_SIZE / 4; - try { - vector.allocateNew(expectedValueCapacity); - assertEquals(expectedValueCapacity, vector.getValueCapacity()); - vector.reAlloc(); - assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); - } finally { - vector.close(); - } - - // common case: value count < max value capacity - try { - vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 8); - vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION - vector.reAlloc(); // this should throw an IOOB - } finally { - vector.close(); - } - } - - @Test(expected = OversizedAllocationException.class) - public void testBitVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final BitVector vector = new BitVector(field, allocator); - // edge case 1: buffer size ~ max value capacity - final int expectedValueCapacity = 1 << 29; - try { - vector.allocateNew(expectedValueCapacity); - assertEquals(expectedValueCapacity, vector.getValueCapacity()); - vector.reAlloc(); - assertEquals(expectedValueCapacity * 2, vector.getValueCapacity()); - } finally { - vector.close(); - } - - // common: value count < MAX_VALUE_ALLOCATION - try { - vector.allocateNew(expectedValueCapacity); - for (int i=0; i<3;i++) { - vector.reAlloc(); // expand buffer size - } - assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); - vector.reAlloc(); // buffer size ~ max allocation - assertEquals(Integer.MAX_VALUE, vector.getValueCapacity()); - vector.reAlloc(); // overflow - } finally { - vector.close(); - } - } - - - @Test(expected = OversizedAllocationException.class) - public void testVariableVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final VarCharVector vector = new VarCharVector(field, allocator); - // edge case 1: value count = MAX_VALUE_ALLOCATION - final int expectedAllocationInBytes = BaseValueVector.MAX_ALLOCATION_SIZE; - final int expectedOffsetSize = 10; - try { - vector.allocateNew(expectedAllocationInBytes, 10); - assertTrue(expectedOffsetSize <= vector.getValueCapacity()); - assertTrue(expectedAllocationInBytes <= vector.getBuffer().capacity()); - vector.reAlloc(); - assertTrue(expectedOffsetSize * 2 <= vector.getValueCapacity()); - assertTrue(expectedAllocationInBytes * 2 <= vector.getBuffer().capacity()); - } finally { - vector.close(); - } - - // common: value count < MAX_VALUE_ALLOCATION - try { - vector.allocateNew(BaseValueVector.MAX_ALLOCATION_SIZE / 2, 0); - vector.reAlloc(); // value allocation reaches to MAX_VALUE_ALLOCATION - vector.reAlloc(); // this tests if it overflows - } finally { - vector.close(); - } - } - @Test public void testFixedType() { final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); From 689cd270e923d4f3f15913843c2569b36e87c4db Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 15 Aug 2016 09:25:51 -0700 Subject: [PATCH 0110/1644] ARROW-245: add endianness to RecordBatch Author: Julien Le Dem Closes #113 from julienledem/arrow_245_endianness and squashes the following commits: e4cd749 [Julien Le Dem] fix linter error c727844 [Julien Le Dem] Fix NOTICE; typo; doc wording 88aaee3 [Julien Le Dem] move endianness to Schema e5f7355 [Julien Le Dem] clarifying big endian support 36caf3c [Julien Le Dem] autodetect endianness 7477de1 [Julien Le Dem] update Layout.md endianness; add image source file eea3edd [Julien Le Dem] update cpp to use the new field 9b56874 [Julien Le Dem] ARROW-245: add endianness to RecordBatch --- NOTICE.txt | 5 +++++ cpp/src/arrow/ipc/metadata-internal.cc | 20 ++++++++++++++++++-- format/Arrow.graffle | Bin 0 -> 3646 bytes format/Arrow.png | Bin 0 -> 86598 bytes format/Layout.md | 9 ++++++++- format/Message.fbs | 11 +++++++++++ 6 files changed, 42 insertions(+), 3 deletions(-) create mode 100644 format/Arrow.graffle create mode 100644 format/Arrow.png diff --git a/NOTICE.txt b/NOTICE.txt index 0310c897cd743..a85101617cec8 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -7,3 +7,8 @@ The Apache Software Foundation (http://www.apache.org/). This product includes software from the SFrame project (BSD, 3-clause). * Copyright (C) 2015 Dato, Inc. * Copyright (c) 2009 Carnegie Mellon University. + +This product includes software from the Numpy project (BSD-new) + https://github.com/numpy/numpy/blob/e1f191c46f2eebd6cb892a4bfe14d9dd43a06c4e/numpy/core/src/multiarray/multiarraymodule.c#L2910 + * Copyright (c) 1995, 1996, 1997 Jim Hugunin, hugunin@mit.edu + * Copyright (c) 2005 Travis E. Oliphant oliphant@ee.byu.edu Brigham Young University diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index e6b47de70ed70..1d3edf0117f91 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -243,6 +243,17 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* // Implement MessageBuilder +// will return the endianness of the system we are running on +// based the NUMPY_API function. See NOTICE.txt +flatbuf::Endianness endianness() { + union { + uint32_t i; + char c[4]; + } bint = {0x01020304}; + + return bint.c[0] == 1 ? flatbuf::Endianness_Big : flatbuf::Endianness_Little; +} + Status MessageBuilder::SetSchema(const Schema* schema) { header_type_ = flatbuf::MessageHeader_Schema; @@ -254,7 +265,11 @@ Status MessageBuilder::SetSchema(const Schema* schema) { field_offsets.push_back(offset); } - header_ = flatbuf::CreateSchema(fbb_, fbb_.CreateVector(field_offsets)).Union(); + header_ = flatbuf::CreateSchema( + fbb_, + endianness(), + fbb_.CreateVector(field_offsets)) + .Union(); body_length_ = 0; return Status::OK(); } @@ -263,7 +278,8 @@ Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers) { header_type_ = flatbuf::MessageHeader_RecordBatch; - header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), + header_ = flatbuf::CreateRecordBatch(fbb_, length, + fbb_.CreateVectorOfStructs(nodes), fbb_.CreateVectorOfStructs(buffers)) .Union(); body_length_ = body_length; diff --git a/format/Arrow.graffle b/format/Arrow.graffle new file mode 100644 index 0000000000000000000000000000000000000000..453e85025d8d324310e27daa3692a98e3bceb332 GIT binary patch literal 3646 zcmV-E4#DvsiwFP!000030PS7>Q`j+NR@Ded-uKgmvFCw3BGNdmUmoyLiuq-W`UK6;X#sltJiODZ);dc)g0Crf?@qKU;`3FcV|4aU7n!kM(#E* zCdqwD9sa9Jo#@rpPy09I?$v=4(J$l--;v*LFD5?(fS zaKy9xkmNi`go6N!i=XzxDBzF$ifdS7eV=v%HXPJ0_|=;LcRC(1_Uj{&j5IXt8Z^23 z$WT=SWJLDs@`|)E+UyQZTPnEZrPnhi5U74+!>NsoePDi?JyG zBCNTAqnthnKw}qJyFgcE8}LN_^#vC=Jq|P1B&^z{7V< z2Yb1Kp+bG-S4DixCVNdia(y?*CBYmOv$C-%HkdQ)6F-`T>!kKjuKGqM912D$fcLab zPD5d$&l&Ye0Tf0L!Tjjd@38rnEF=mH`Gk01NW=+fD$fyeG8QF%i(6Sd-Ke>XU?!1}xCo*i>oOmEhd2|vMPW}7Y*C}zodG?{JNV*btNEllbl2&NR67TG)$C)!+`WUeHWJd{av zyX;KD31Ualt;H;&n}%eODzJztS!5L~VpW4hG;NVLVhaf%d87(FVn`lY1&^53;1Osp z8P({}mON4g9?>O_tb#|3ip-B#`eGhgc3<*HNqEGPJhBQN(W}BEKwsjKSO7{UsREN| zl1WymyV5-6Eu6-2!piDT8xWpS*!A>&vp!6b5}W))1L*;Qc@71n@(mXL`m7i%g(CJzs1o!O?A*{7yK zZAYAPtawt}bwTi6e2 z{Esu}LZqXX!yJaiEmgCUFX6B)zMEj{d#l5FIBI*X4rq0J%0HbR4nh(EWkSX0WJrGb z!DSuhv49Kz(h7)+biLIL2=T>5+Zz)8$)#EY9JsCMt8M`PFN7%mFf8$bkQRSfKzxX8 zk>bOSMKmoI(LpT2Q;QD+#v+EvMMSlsm)o?N-oXP~J>*2s7iIuMl zQ*pV^1Uf)lBXo2Loiag3Y}>CQa^eAzX{nO|k*2R2IhsUHsX86nCt-y9gAMp`5ik22 z^Mx-Lx5hK@;u1h1FaeaIt{PCP1XP&-)gTV-iwRb@ieA!A-;9XvihS|f6xL{6zD>7tHRUF4|t8j%Aea!S?bP#*_( zmBo%YkZNt~tg0=sQzq=Z_85+MO~i2{8<<+ijD?z94UhcQV8gaYp34-&c-?g^9g`}9 z7)sVLRd=#`yiAnzN;Tf_s|Jq@S|}BGPB+pYwQ2N`x=}Blnp6G-mEC`KqhAz5`}f*< zBpIW-JVsY73aPkP?5k?n1wpG73Tkcg*1q|Bk*y2ER9{oiBW{D@KJ(qxhOB}o3|U3M ztO2MS$gmB=)>UM2GqB=W3B%k`RPc1&*J~QhbO|GgKaQIP(q6Wn$D&sxjL~_&uCXYJ zLh#9hYkX!=d5pW+==2d+fSP6?&9FJsST z59@RuQrE%hQO^yCzqq^p$nlv6)^yXzPI?*DE~1^a9aMxY!Pa)}kdKw`q*>-oAfzWl zp*i=9)0fEGSl|t8a4XFb|2=$7Nc3|26J3Se3o(<~D4UzR4OTs%mH{v2iXh_vs^WB% zs&NihkzUhP+qNLE4ckI?yi%`OFWs|8PvzM|>Df!q{@Hu>9ZEbmlz#nF`SnP9j9(8| zHTf3b=jJTk;m6Ovy+GdEq>iud^g&MR}9dOGM^Xt<2l{B|v z=hwv%magxUO=&;E^F`KUJzq;R3Ow7)QBcs(#hFZHc7DsV@{HsggqPG1v>yYn(!aBB zF3gvzHbW)0El*u5!>MbMM^?cjYkOy3j74bbYhQR?=1)kHC=W?UZ(}tqv9>qz#pfm= zU~VEfWdTRTYDq_<8XTfM^?bds=e5n(ql%qoqya5Ek)^%ti7fF$M7zC+_#Ny-ECBf# z{HD_*ecalxQsZqPYttjI#PbN6iC-?iUWm(ZvvQrU#H)$>GB~(9#F5k6z+Nuu?kOHi zjw{pA2u*U^ z>eu0Vd{99Gxe~Y8(K~o_5SnS_hqcui^_FK94q8t*ri~q5U0a{O3*|{VrSf;*kubzv zauRnARorKROnc1;lE>wg5kVWkPMf)RXT9k@z6|Lxx+Zm~wrf#aXZA)5_w6c0y z@tI5H#>kCEheQ3bQ&JGb?{%F2s@*DthnHq=qxc)g9MfxXP6@TQRktW4Cc3DaaiES@p)?n+`O>YOvknow51MxoXeYly4fMr#E=q8?{Sw|h)nXIGiRAd`*;vu zvwO&=+Qd2xsY}8$=G>4;!a_u3q(9J*wo??&DJC8hTykccpsg6#3I6+B@JM(OB*=-w zocMnT8C0hc3isj^ zPB%9_>X7h$YJ8yNc4@Zr432BdhS7YJJMd{AM>0zKpyC~KX@_?iLhDAR zrN@~f0F7N>?Lwd+xQC2ARa1TzHJ2+Z&M1S`(e1ZGj}l)wBT9qxXA_;Xh}{2iE&`rY zG*!XH`InjgcBCfZ(^4o*PHq?Mwc(h;X^{Vu%N?(=$m`;*8Afl1mwSm-lx+OcA))! z`1@(|xOdRIcn@^g{`l5&x<~xGd+{EA>a|ZkcnAFc(XoDgjoyC#^zj@VoPIn$Xny80 zufKY?M{gFZe?~gd->DwKD2wvH8|In~R+0TQtHup+;S+)BIo`k(!S03c-7N?E6}vs~ zXxG1o6T(HC1MKj`6XBTk#miyF^q)9#oRP0-l;@=*?0>=Gl>3mR=yU%xRzvugiPLr9 zMD)vje6BdL=$5hXkO!uH@6*pik`BfeJpKES9QzRs7FGL)V)rl4-XE#Yhi&sY_GlOc zye8tBOigKyw=HfTe;r_-gYD0xiZk9-+9QruDT1tHmyUP&VAs$`vq=2ff@pfQJMqz{ z7D3Ye*^~QMLoe+oa{|11+IZRR_?1xk=SHC5$vNbj9&Voyf$bU31nIM$%&5b>BHbzK z%GLs`#noUVL)3^?Pz#?J>(BatGvx@8515Za!$Kw~ST`xa#!m&g7yXpwp&VzVp97DB QCyMI|W2akVcRaknWUDrBOf-e)sl$ zp7(pccZ~DjIpdsh#_JfuX6?1^b+0?-yyi8pxuP}H6tFSKFc1(Bu%9T(Y9Sz?SRf!E zwV*-35qDb1XYdQrQ%gY_aeja`82o|mrfBGifPhH=|Br}}nN0#tn0C-HfEuW(2wS;2 zvs+lZTH3JtI=g|h5fDUug~3N>8>j`fuk$k(Phnp%+COIqgU|589JJJbPJud!(Hf{~ zP|LV_*iiGa^Rjc&iepeyQ;T|7+X`#R%Kvpa_)Cn|9tw36=HT%0@nQGjVR!YgHgKo-|fiScv^WlxIrCUU8v#hT3EVzLB(ik;WzsC zzkl@;>R|gncXILk>#@KCa=^dg;9}?G_;=gjQc?I(VHsCvHxC<6Ptd+NpXi@6|JSkq zxzE4O*RXemx`I*gaIkvf0=4k~7eg)Jy%XpD>+1jOGybzJbq@y{@YH{<=KAaEe;)hm zd{GYgga2b7{x!;fj)Gwp#}MWC_mqia)NF7sBOpj3Jdu^s@kRW;?2|vGb6evssUe9{ zTVSdG{T-1Mik#*m-pvU#jqVZCt8%G}3H`v8meW>H3vv*ggTaV=93L z?4KlVO=U$0@(M^B?9H{OJgqzZCjUyDS+(hG>Nxg!4~Z{pZ?z`A(9M zJf#2In$I7v{Y?3#X>9Gs0ArpkEsx!ZvF}=V_s!vQtHN0hyj!mrI6XG|JJepsb2M+h zx;&%`IQ}$V=-Tm{|6|#+(xPR*{qoX6>7c8vGz*2Q_MZ#(w%v?}tw-YTO{`zak>#cc z{c#qAQef!tl9t0DDfFE7WBFxG933~O7c0S3?OxJwpM(%{>E|)?&QK-->!sWuCl&G* z44DRB9@)JXgjQcdGd|wKwhle8$Yq*P9#?UAJh6-d`^9fxQ0>u7oSk6 zYIse!40QH5yqhO{KAk?+Kn|e|)|QVW`9$pJ-nZd8TV>KM?!h>;h;(1H#@4V%ekI@x z`-~?$zia;sYuyl7pCe{^dsF2E71n~`wu0)n;KpJ1`hj6!kfpHyIjSn++APO7db07k z(P1*8@#YczC*)f6E@aji@12@%oYwW~2V7FD1+$!!zTN9ZAN%X#Xq4B4hTrvTHe|?9>nt@nk}62h)o0EE^-Z*3(i&)VpquII zDXn}`0SQfB%APH=~UoROL@@cTI4)$>8En-Qus#|Py7w`G<4G*H153o zu6VA|A|?HPI(TZ2U@7DZ*&hmzNufqs+Z`y=YyA=lyhvn{<3ky%_tfE$+9{2Er_nkN*eD`D`VHFCIrvS6hMU=dvB`gi zlIbCRUIOe0G9pU?KAG=cNm`RXCXs!Hj3D*tTH!%yUeNjp&i(4#pkEH#+GfcJh|7A~ zu_=wlJL7J!F@%tNRff%Rwzj(?nONltl0-=C6lrEbe%oglX}!iBr#dkK(p%vgTq*l+ z=&1ubw>St&bN%<78<+L!)QaRuV`^`IeT(G1)diaigGQ3icO--=@Enh?_>Fqr2t&iX zZPxLw$TRjc^q8|XU~N7JisX-kJ44RGf6(Fd1PO4#7*D#&!-?@&&CWldi zuiM8?X|r3dP%KOwNhy4m-097O=jZ7X1Zk~Be2>i;E;C4YY_SNCF{4<6Z%#gE-^oCN z_@bcin$~01Due%J9yrY7@-MetG(%2_Ru|HCgp8|}eK+EvDw4}O>|z^S5XUMlfA)lC zhJ>TOWM=IqPPq9R5xX#!Mrt!q;#)9sX$~hwUi#{}JkTvrMb}+I{{g>At)!eK1)?T? zwYih~oZZ-Q@*TDH^&&8I$KCf;6jZV7kI0<4WVi}>f!BsXQ1GdJ2c7xtct&)dh3?t~ z$YiK%oD+_fs#B>&>yP-kXNs+b5wJI{hGLbDL3f_;E1BH8mEU>tMo2e;1QBVTJBZD* z&>L6f45K2<;np`tjh6nr5e;~R^je)b3h zJZ*xEBaRJ%V0-p7f9|k<}@0?=?&y{{j6oDh-NM$T!#m zUkebB{kmnD?oIL{=da78liqsxCw5eN3ZLzVz1VZHEiXx3*UQ^SKl*sh9^Wr5jvSn^ zIz(kd4*9s|+@IQFTR>!#9C7bK%`rdJ6nEI?k8{tzgz!zn0^xS;B^VHuyI&>yTcmiZ)_(3K%A=5a)%>DQ@d97dMS@)}{Y3jzv^*Rxfe)LcLX0glSZwui zeR>eV-um%!9g~tGTqIata?X3vxp`@GfXFF));7zz4;c}Ya_}X$juCb&H{r&e-p@Sv zlyD+C9!(PFm7eor&v|@dlSvF5pW)8VZ}Qx>Rup(VHCGr)#CIp7%{mgfr&f?n&y&qa ziUvCXUa}4e6{nbngsr)-53)HFoms-`j8^Kj;OuVuCV|9+T)sqA;HBRWPX6J7NJ^y~ zG=Jw}t8gZ&{UWYm%JFU5A$oQSHf%vcZo=p&ba#u;k%Fv6rVN5Vf$}558cHejljVBI z7L#P!UMXyQm2X+LL4lqkn~G^rP&y=u!B_Et_-;&%Mwxk%GmbPBhjEcL1vGnM!d1LtB_CMkJsiI9G5tU(d#6kdd* zgBYr-Z<5kjaOTF~Yr`b3xM~JBhWU^%jk~MS;P=rBwNqG(O#VNL{0V*fpMS4K^_e|t zJHv5x2vMJcEnIu$N~6nL3#y-<=k8p&e!OBJ)k+^rTm`n0>}Sq6_Rp2zySfqKSaakZ z3o?ZpVBL9f`?VOl_rstAVNLgbV@@nlL&Zqcme1oPm%{1?VXH;M+eLuh6ef7IrdB9kc#3R48~%i(d)Z(jh6dodR`QF_~4_<4yNyK1DO; z_NzI!|Cm{G4lqXAc8m7wr-=-6sqkZwt3<4Fl2Kw8BMlW1);(#8BdcLu?|OM#S@>Dj zxL2zVmcd%sEH#ysa;_t3CPM95#@x9G#*GCf?58&Uwp>;@XCI9Po03(e`3C$GWVpdNFl)<~KS8K>}Od$+9=hMxc_u&%RN5R#RbK$A5@47qla?-<{vles8^;Dj})=Xn-AmFq=2F2q!q zP&!3)&pB%WRrHP35-YomL(QNr-!z8(tVTXt>XU|`!of{RgE;O7$5~y7>xVfr2D0XR zr!~rH>|ysvreNFXymn|G+6Ql_Q*}>nWK*WsYB+)AkOs#5QIF50F72wW+!_u&pAM<| z%f6}}w6a;HGn0k)GgqB&-E5eGMuG}sTn~vKYj(OS2|AwhNg~yb{S0|~z1-r}y8pIf z-^3;#UjqG#r7o{5;7Mj81Vg6<#uch05t`a@4)6Ofk}+lhXM)0?DB)j?2hJ`CbWdf> z5m}@`b2prtDX?3!hk~3gufEmil^OS{RKpC8XWjHQI6liFtnDkkWqwN^7@iY<(^**J_d=T)L z*{60oq1gO$)TO#A@BZ1W903UFcpU|bC3`W?4;C*{jaW=30(2DLyK{Nj5`8F>?^_swnt??)QL2Gu^fEw66U9%)Q$HtCtyJi9Y;!a&d0ZcL z-07PI60!YSi=@I4-p#j3Gid}9bH;;Sc~ zD}fi`jik457?$$SI%nD0zUeUv>4BGCQewz@>wyX?4e~@J0R7-RoXd zssP+W2QX`0uhqpfL@Zo-y9+aT>|9#*-YlgO;5C8O`bH%KUk-5wgtx7SEl#JK0|mHP zp>R**wL3=Rnc>oQB5#Si-0IV@7C}x7LgO5tuU7zRTBoRsUxgV3HjGSQvNxIUPnFAZ zp*hw>nIrB61BL)RNN5H!S_Bqpmv?hKqg$u*)#(?G(H;;yQ}DNsieEpgAD1VzElJT6 zUXP;dizDq0kS70r3ZOOVMDew98uq)9n0Fti^^EYE1visqq%BVib|BLDzkmIp)01Wj z@raOnU>0!lQh{)$A^Lj$*r(zmlY;R*^XEr(M}m?Z0?93PxR`;AWNpW-!jX+cN{H^80a@%29;Ki5HdzQ)d^VUO6s98RqE^fsvQ zm3PlFzy*qo=3{6!Y-?7xuM;93jBr;5xY{K1`jtx&RRG^5>$f z1EiDZ>Ue!STQy7=Q%5DAEZE@Xl}b{2hq~wq@*~DjTw={1usPZEsxuDst^lR}1`r03 ze{HZI55;{_f9+MJ7<*4ix`*beQd_>m*EEOIc?a-d+TTAh3a4YC|_vbmuL~XkM8@Wl4BAcQXhOMRIz`( zQlxO7jCxEwuL{`(pbdM?90DBz^V^0+m#p{66zO%lg|Tmj7!&#r1RfbXmX{RD^)Q!> zQngo#X^P}6IP7Q3EMsg>n#t5Bw;%!f2AV!PNvLyPL1>c8d}VuT*+?<}G8{n!+i0 z?Z+~mss?CT7<#m4a$Oc1*m(ACsY_a*XVIuh2CxbM8V;?$0prYMX<}HKfCyV|khWOk5hD4yv&E(VgS5-(pMQr&Mo9bSw z>I8XvaV8p&<24h_QFtLmJYFFYeWhFkE3=Z`9t*ZeM*pVuId}UWFK_-VtV~EUnc60F zSQ8UQf>;oFTv+f2x(y{Kk1sN)W@@xu%9CPPfDNx1DypU-%KxA=|61OW3poZ#j^Wg+ z93|hO&?=)zuBTZc}p~MDc}H;clF5OxC|OUm^VhD?|PLfcqY@6%#JCxnQ^N`(%d&hVOuS zCBzTo+(*1ldDYkW_uRw%rG_iwBH0w2fbOT4B!pf2`9G$|Yy1``d)(OZJKo_igmfUW z1gvi%SA9d96c8f20D{ae^XbJBi|5_wq(|mTU6dvg3F=F^wO-DUrA?kU)O5_ZeAU|$ z7YB>a>Qn=8gg9f?082C~2#Le#AA><@n*fhF!C@|Hc;MhKZ|xg?HX@R!IvEL}qvkF{ zJfM?S8kApsVM;XrVd0y+C)yKEQACRbd$k;bxPC>IuujwsY7Mr~DgroTIYfhX+WhRkUksm#W!}18-)yL)*tu{5o_Yw&~jDT#5{B%CzUq&t=2! z#`L*ni%>l@|1E?o6p9XZ^sCNQ$WqHBTIjm_g&G%NNcnTWetx~>?hklotYk%jFladE z?6W5-yCIasiWEwht1s?zzoC=DjS-uje*8!(LS3iiyL3j{B#Bf8RX%I%j=&>jrq?#i ztBfq*H*>z^G$z;-`}b=_43rT4m#cXsB{K|k>|ijtC8SbhG&J7h!XM(g_u$YY+dCQu z?L8K5(!pM7-bl9}EebN)+ioeB zBHwF#{Y&W(v!it)UK$r={A~jzrBEhjUb64%4I76z;xkDf$Sz>I$#mgHFVz!=K*fplhe2%_Gpzq&owgFcP)s{Z0tf*>XY2)>e$$J^&c34@b zGxAOkQ9^Ut-+CFql72%NBRxph=hHv}zLC9e`Non&+sPPHIvhufcRyI8;|1e1R^-c9 zbB#Z6=K%IC59ylRt8pO#6cvK2VGnk2GGozmT$Fq(Jcg*Lw!7PF3p~1Lk8~2Fxil+}ZJhe-hM@%ot1V=t9=Edchh3*ARrO2Q3{A{Kh72E*WcfqMTd z020w7xGSW9RQ6Q--Yk%le)f%VmGNzmOBG;Irx`WG+5_BeFW`)-P9y>!nwEFiEu9$v zE6`=5V)~!X#xS$P%SF!NIjSic^huEK5GT;V$S*;l^PzO>M6~LQLyxt0mnn{Le?syW zwtzUad6*ozSnr2EdGpA$v;Pe=2~V$}O6uZpc~ARwqifsAz4->$74g9&=Ntgi&y@$Y z5^MujU)A_uDqm9H^4Cs=^<8d$_Q?}QX+tA`QRE%LX*qa+v#AX7P#@`s6#y9Q&qA0U zC#`surRCTasnTq?VEtN^B;yr7wcnyjD9_vFFZT+<00g)Ky|rCcH+3hMzqiI#c@GVH z>;-}t<~sn`Xb|~Oe8>6ZMa63Gy(QsfW*nO19&hh3!4UOD z6!3e{MCmhZN9VC!SReHh8MlAYr-V$IAl@m|Y^4Fw4z&@Ya3i^Zi?LELGKeSw8op&! znL_H1R=q}qb~I9`a}g2AAjk?2y$A!bMQ<-R&atG6VJ?`8GH1RE)MMCEqJd=$o!Pi! zG^UaUG`X;Z`+iw2)~|gB!_;|RVgAMguu`l*0`lu|{Dnf28->rmAaMu;UB}4G48@Y$ z2u=s}LquaKMZ8(t=ecmVLBeqxCzrqD9DnG%gBNpSfb!M8U#6+ ze%~QaclSwhADZTgy^W;m<#8%8c7Z@o$Ei_v1p2vHM%6FL)|I6P%_5cSHk$T@36y13()gztG02zk(fhx0Pnb4KWU!@yS_2Q4`uC%zZjJgW&zv zgY)Mm2<%2=p5ayS%_cN0d07 zK_gmB5M8jQU|s@N>oa#v*)q1ZpS64<<#M%|M2l}fcl*F!P$z#4bcW2u9_P<`(WM2* zR%8hfhH0k$dmyDH)r;t|efRq)xX!gS$NTASM80l?9RJplQ-NPo54dUS<%{FYWCVf!=)Ht;j99SAQmO(^$tKHdfZ`#jz`7xN9e?Tro6c4!5sYu`Z!}|Z@yD~Mv z6#Vy!|Blt4HSyn7^8f7qu+RL4@kU=VgfOJf)0CMfutKsO?h9-d5hR2&w-zmd#Eey5Pcs92}@_p%>UF$ZWh8~Dp|)y|5c z7+d#0%VAKb(;wN!-<&n|uLhaMa6>>|oq&vl)4cP1Z!*J0QLm~3+?pwg0`VIj{f0hBx5_Jf>NlP0@mcl8Vv+dzoK(yh}$;D+j+DQzXRg@u1Twz2amtS4|LTe&@cpC@v0@AVK6_JcwW{^-7qF}#f2!wUWsmeG<8@m0 zu|S)F6QP1Dwxrwon})DWIl`x}kvK)0duSEBZ5*%qW?u9vd!3@VoxG zRXzd)a9ZTt)@Y=_bNGXtw>9m#ss>CoqvX#8y6E7s-{O>t6#e}qJt$=w=08FNqbC21 zLk7zCI?lR8<^P%Tf3>U_1F%Lhg~NsRzwh(T^Pk6yfe=)2@TIrJe>T<6 zY92@Z-&_1OW0E2NzaOYE-=E*-GqnFKk$%&B00MJ$-|uk$S&{H%s8K4p(HCzwoTU9&n7EX#&f%JC8Lq0N+8F zJPS_$+n)SuV9Q7VwjlB>V=en@ga6s(@E<4ujc90?^8Aw)`m=2<;ex%Pg7^7{JV23K!zvPOTmKq_V333 z>KBMycrS zi!O-ez=af=f-fH+0%F->$dw^z`0SgZd-GK#>oiSWrBwOma@Bt>>a9m0rPr1aS@LhU z`j>sl*8%of>Cut*e{K*a4zNVGyhK^!|Ff+Ue6Xrn{2KrG)_-pT;JW?LN$8lWH!K|f zwq1YQe5Mqb6;-Oz*8>&}{}6cO3(`YZKrEPQXqy3&J8kDE?~(txsWh~J!%j|KiGQp? z8$!xsO93Qcv1|=<3r^MTpYD;@J1;gozc>W0b;dZh4tQ}My#vsU=xj0re3!pfG2iTe znT=4S|C&Kn&(v?r#Ao@3#NCy8eQCDx6E`3$mG_)b75y=#{o)=rtp+*1b%$jO)|uel zoK<4HH;)P+E^Gl|6A1E+b@PKD;WqbStvhm*kE?@WbClzYv390ph&1NZ?VNz~0u(49 zn}Dv;ZZy|qx6-cv?&J00hSU3g+Snw9E!LhJVz#I6TK3;Q07^Rk@S)q*fpX@6iy}{* z2>h0^UvPV}!sV4nnBLln2(pL39c#RRIfU^oW{ih^n=e5Y#Rxi^^E<|KSSS&FJLim+>=Z<- zZRZopA_w)8a24L{4UpKlcpqeYj$}9~5in^)K-R3(VXGhv3;;^L%eNi_IxbtqzR9mU z06hA3d9pDoxDf~RBG0PA&s&A`X$g?Q(g?yQRLt%K>7jl(w$?zP?8MGGBNOxI9bX3m zT#x=Fn&=I;NU*3vC{RFVL(e6}3{ZX`PiF}Sb3&fmfLWLD+Shr%7QF&;YvYZsD-?C_ z?qZrX$&Y}U0S!ryLy z!hs&`VUTV572pIw2GkDZ1|6s0PWO?!iuxX}CdX?^LFYkY7bIN(&r1D^FDez^W_PBN zyrcXtac@_|8`Ugg7FHlzMiq^C3e?TzhJ)eTfEt(qnPSSO%J+jTK=-DQJOJcn!#qq+ zC2{FW6@%}tR|aUF?!$9N8cz|`0_(YVmzsG}UV}WaCdS#Mx`ZZ8Ml>L{6DF5IDvLzP zV5b?7ptZ#rybkyd>|ki$2=$h{yPVcHd+56wk{wpQ*#UHY;gw;c_k}ib^$H+e%%#~~ zq}*Gg*@rhz__$`)OeB^V2#>!J@lg|pV`kxLkZe(rOe$g{>mMeli0>ms>SPA+qPcGust}O9A zH?!m(mH>2|XNz-A)T>K;AhTigBx9Sk7D(B+czET_Eq|TwWmBZDwE20eQBUD-Y2u!J zyK#BXR)yf;Eqr%#=E>EUVl=ot=fgfQw*tnZ9qXHRvHc~#Dre>~&t5t#10=RlCXMrX zat@7u0izaPN5$NoaSVh7Aj^|2=YS?pj{NEb*o}%Vd7kV@o@@yRV!YFNpzH4L0iv|e z$QMF1iy+ya)Y#+ZC+IiC^_g?|y^XTVtltT53Jk?FDo%qS{+yJ*9wgBt&iF4FzcX|^ z+@t8^}p)!pmI0?F@}q&JFVsUQg6DsIzmL?snYO4vrZ1n$@!E z%H=gzrr-76l$jf!%B`qsdGMndF^`xv7-VBNrw}48YfA=HNG?NkyW>(P(lRSY=+kM`DX@FNt0AcUsRUk42S>}T09xup7% z2jh)y4N)m&S;)|%ned+?;o;l4p>aVPJb;Q=>t#sLY!`9!75X>gGmb3w?d>H!en_t< z)0)+&n9vHuRJ7A|<)awN>WRE>0~L#?hnH+S3A-lHbBm^)$}tr8ei(*JA{L7hgku`% z&3l->PCQ0!%~&HhxsJU8y7h=Jj2{W#T?2(e*!4{^31Y@)vw$z5J8)rb=Y|vI$I5IO z>GuyzeM69$S5)_pYLamg4hRjDuEUekxn)r{EbwO?jGOW7&G$eeijrR^yW?4{Ai2k5 zW=KmZ>{2hJ1?BA#eXllk5Q{WS^oHK;IaSx1=#>S|%RO@`x`r4u* z=;}%aVmcEl(RgFrOaWp=7E)18VXBcGpuh#Q0DA#PgVPkQW;oOSm<3Wx+_Cmlc$JE~ zNtWvAoHQ&XTj<;?XfgG+K8dOIY2M7@6Eo_^6d_65}7IC3*8c z5wca z{wb!%ErTEwP31T_asOiJXC!ubk5_JFolr1$zh!?2E`q?Fj`eCo-i_bsz?qgmE}HVy zmm)>A-Zi!zJIB}KpG6uU->6vE71UVrHTnIS%P*JiP#U9)?{5={W~l(59W-GbMH zSF&4_ew%oGn5*6D_SEMblE(@9ZKB3IZYaGJ5a*?VNI9MFJH60tSKjGx>0i6C12WII zk5e{Vaf!omWzrLFJsPP@T!0F}P{+|6O0;iCo)3-K`!_f486d>oqlc*NyMs&?G=JhaI;+IC2e7 z$cK&|6MZc-dpGpB1^XM`GL|xNFw{6%)bwtvfiWqS11+vR46`0I)4(6PskU_mNl$DelNe;< zom{7cFbK_HN48}gop1akyM7A^8~~|4UKj_p+3T@;JS^mR$n1H(UrGoYx271(GovU+ z)yfciQdIa?c6Vb*S)Sq(#GNMktHf;=tGM~e$H7!5XVHzY+8MLG7aYK%65~S4{j8zM zWRr578p?$yF+KJRCZAuQ9W$X|HVS2oWGW$;rq?*5`3drV(f9^HmRuGhFZX`YrIqcw z;C*M1HGKGH&k)v8K>OHzwe zgOq|2OC47YB6S_YY?SCoyV9ijhaow2C-Ca(ds^GW>ugM!D zO-m@SNX9ZEzYsMfX*yr(1vl(i&Wd*?305k;v`N64kBQXsaNfQs-u_xcL^A1gx%eTI zJEF%beu9fElXu%a`v}$6vKW+i3L3qrvnJa4E+G%S=R&CXa|2J1!@kL#--HVgR=O^dm(5fAEN~~9G&ch@Jap|lZtMZ0ipVTGM zunCaYs_0lT#H#GRYWOZGC4O5f1WHbUI+=i2fHz$rYeggdbcAtE#X##CjlMgA} zXx9ty(xT28w+J$uyzxtgj;^&NLz*$%ywjaBF_a{C=8QKYe&((9D16T_MK3ZT;kH(s zYDh&8QeAG6w01x%^wcERia16@?=oVyBW)7VQQjzhIp``&Ye4wJkVIa~5j`8ykS-*lare5hDSd0Du!M=C^ay}T%_%zzLFSXVm)>nl zR*9rM4AM~-hW$g(9W7#aAxaja*W^E9wDmR%0;i}DcN`b8nJE5#7l2x#junp4z`sC= zKeZJt3mZW}HQ!*TvkGsA_l~oyWQ+H~FA!pe^OEw*Lb_V9j0lpOSuTX z_6P_e%YFy%b_p#ELsxc%JJOB$G=pjC4fEB@INv`}$_rn+?ll{^FRESODjn+oEFtSg z?B^UEYsur+bcv$e*-wC)L@9OV)&(Ck-|R&V+swMfeO(eo!&dWxTs$w)q8yekWI2{D zy(L-akiH=dD!MzCe93!=p&zC%Y}P*EbztALjV~oo6)d!lM%`n5WT0W!J-e{iO_6&= z@PKjD7mL17H;j!th~~(& zs1o`Gp#&DjuxfV=r|d|rMWd~#79xDG@y+kjk3LL|rq;MjMh1pJNC-E5^_!VUUh*@i z4iNZ&lCYk_bU9H~fR)A6$8}joUCVIs)D2JfPHzyBvrEuW%DNVFY0H|UkuiKFiLfDXd)$YkM5HeQ1jB}W0Lf>I-5tNd~C@p;}#92Oy(A4>ScXJ1+#Sg z+-MVw{3ogV7i}+##tO=s;>^Az)q^Vwq-cN6=jgz{>z4+37end{*%0}7EbCv){na4D zd*h#J1VOpMe{`Jx0Gjq&mMFO~DC4=t{*E4>(-k7qbOH2xx7I+b$^WA7Kx>0=)r8-} z`#{|X$w1i^NKbgrUFu%`PsV9V9h~xe$ry<8mFULWpI^Y^eBMRxihmGBI;|l4;3&~` zXy%7W;n7b67sT372zk&uSCJf7w|&WKfP#WrwyivF8>IungM|;RfGFs`fuq^l@FL!m z#%7E#w)f-5mWs~uWbm3S(w9t|!62dU0_rHo8-QXluo9Zh zIsg}(AQ@5ZhlF_ymx7aW(+`H7WVr$kX4g5DV4g6s02LLR>>T(I>x|L?;ICmDWP~eA zY%Cg)2x7`dpga|;%;@ubjJyv()m9Do0t?1pKM%eHUW1}a?VUyc@4G-wi~aC1yowZF zh-cz?Hc8Gx23L;Xbk5SL>uXAeU=V?hu|OZNHh_%3(khVtwSwB;k2j}ZA_%Fo>Tt(H zmO;n}6yI4cEecuj*qk~5Nsh0#Ik^U@pQKDe-(6NETd z2aPN6lC^U~Noc-$o_NX&>nDJ8zYjqXTih%Numw7UY4%mavV}x9t)HU}_nrt%9NN(! zQvy~(Ek_t$1r0AA1WX?(zx#}#T~6T7dC|ecrDo3xy-I#SFZdq((kzm924QHWQKKoS z^M)60m+8QZe}T?@bKrG1#oI=p3NMh0C3aq@rKhQk2?`mz_YxH(0k!DeXNlWS zv!YuukGqBFg)vK7$*jKtQ5@10mG_*5anhz{_dCM=BCu;jEi-lBBg19)Rb+e8M-k*M ztNhPFKoA5f#c3`#U#mA^gIdYghW>!IQ^^tb0y6VFK%~XOR|Y6{{|&;~8=QxrhT#TH z3<%1l4uA~UmRaDetwj10D5q9AUjsqJ(QNe!pog9V01|EuG+6~1Vm=E)JNGU@fr1SI zn7`_A5T(eb!?K)R5*q;r1hSd6#1E>IuhwHZJ_0h#mrR{Aw9l7x`)u(Ejb~+q0s#VU zlh;m5h7mNBPKCAoeBz|>1|hN?EqE`D2KvYih*q^(T6hFOb>=#{qdL4S5U`s!Kz5HL zB1V{GzTOiJ=Rg$)#6g7clH9o-ubsrLCQ7`9?wyth)eacqEsU@_?5$C^_cqKIMD(jV z*%s;&K*x#u1Qa)53gr=GZ{^+rGSxf9SY^MZ6?+%3r}`{8;nu7jp6gBzB<}d#w0ujk7f7C#~N6Oxg%G9@W-)pT$5ucFDs6Us*_7wrz15q*!Oo$6I|_nV1$Smn&nv2x(m>85aJakh3tvES=%_)1j=xx~n!alS*~A{q!6B923$b}bXuXe)7 zBzYpRPfS$C5VBxaxclt&nsqEfe27pANP6>tfS$MzNr%N0uN~U~uE$j4UgO2LY)v!q zHZgcX-0)-cuk&TiJlRvhhAM8Q z9SjibOv{XoQRFjWpjY(4vCwSu-K)AMbXQ|0OIhR$WQ+MfQK?Ooc6RhTZ_zHlUujX> zg$R80eSvYM!-YMXV34<{dDz&k&ET@r4Uu~QPd%kheO4l!qc7WRdVeYd zy<~WsLi#;a|HY3*vsnNPL3Syo{Gj2j0?ZNe9ZS7Z+nX8d_VbHh$yYr;n^d>tT4ye4 zA*fUK@uL=*S$ zQ!>vv8TPYCXi6mN4R{Ah2n#{~t#y{mXTLeclnSOPaFFpK@71cBMe?$J^_#s%|Grai_5AzM&&eQK22S;4c-fK<4Wy@1 zmsfx0;$UuEzn-WP{}^SbI;BuT;}l?9=7XEp&HF(&8fR>n$a&ge3%IXf@OlI73G2Jm zLM>7bn=#}ns)MPOBTEL7pK<4wFO&)YXFj+gB1$3~ZCXV^fuqB){g@oSL z_QsqSlKp^^Y;Axn1g~*^P#T8meO}|Q6!^5qD;PuQ^A-Q^h}&By;I^P)K@}{uRU2cMWA>!WF!g!npW7@06k9lhofvPl?Bl zbIm3@lv)J6Y#b)keM51)9C)QZ7`RmK6~M+xK~^pWdRDh$iKGqr0N&HVJt3uIUR4+U zAnZ7Jd+4NiT&m3$aZktq8E4mA^l931;@|UiYMo_KErSCUASqdcs>&d#sg7qa5MH_8_liD>=^~VM4I+q= zdW_WU=D)xx)B)vn0(jWm*0P-d>ZIu5>k#{b%@*y%`SC$o2lVrAK)$-6WV?uG+4f~CBH%DQch{a>k$$RtaZs*^!kQ;{ct8J zUOeZM^3vQwtC|j=4V5LHn`c5Kc>J!W3)F*2<;E?LM1-yS_7VtI!XUZ~Nmtfi6uGho zxf26@`$+A^-ox;8%e-ats-c!$epw&yam%>pt`*DP0(lmHt>TCYDNCBo$Y2Rr+lJKF z%Hd3TL$gT6f#dfHXC2~fa4WTTWchcJg!Qp_nlCLvyR1?sTIhMp_l=}sN&5a*~}lJLg5Asp;z>#YUluHi@FY0xDo z_JDfjXgoLP!K5H&=DGmoANoPZQ8+bnWP=lVE2;v^vLo*(9SPiyRKrn(Syu})s+xI* z7|En(?=k;y-RL#CN!^1`q>vzqB5nBXCGxf#pXwJBM=@l zVq1RlRDA*&DaqCqOpDOQCMqwRhXOCHnZlekt*ZSrB;LZIp`oPj~2&hEXNUq|tz`4EODJ1~sBEbw8gZ45z{x z4L1qaYAMefbr|0~_{v9IfQV41f46@z@$~NaQ}GUZ#KpT^^i zP)(VHVsJZDjnJlZHIp~9=mYii`Xw9P*I`;=qW!sAy7mJs3c;l4v7X-`Xej=PNOa;$ z{b>kbX7Rz|B?czS@%nq-quE<(eqw*>Y|C1I)w6>QN^5RYN|>b89=Q2Z!G2e{QKp#~ zmWl!oAs!g;vc1vddVtZqowRZuRCxN$!3CX=X^{gj=b=hEXNbrUh$ZX1i7JsP%a!K0 zj-%J-#y?@x;Z6!CE(E||X=9>H===bv%LnXb&u5jF1qzj+_(RL$NRAN=_pstsI!NLI zdK6+ud|aE>B7j~qG{}3`c6U3=b=zI|w>_>IHG2)f>zd@XAVNeYBbiGVgk)Hg_?O#@ z7(AB&NngrvI|N&Qg~CqWq3B>Ij1xYN#&01D3?qQ7I6;i!_Mv?NtO=GNyVhz4B)9n1 zX-!8Q`9^k2W?U4XZ8eQ)JhB5pWRx*xjm2ltS)dS!*Z96sz9m{W(j?KNN;Qm{;VIT; z-8TX&Ad=Ml7H@OzSHJfrYvQ^4ZfX=i3@e^6zFLZ6AJj9x3CCpK_>CtDfD~4??1Kmk z(R}c36i~DsUmI_)pWf2`-cQ{v%okPHwd{1P&Q3iPCCj;iR2A|&Gi{iLMO6wgEq(2p zMX!F$>b`yG-TDviVTw5ap46aUC^_^ ziJv-gkSXLZcw_><&~IrfcjkP99Jha^dYfwio!2DYqS26t{#b4vX{8hD56yGn)iLdG z|D2pL(>@x`=Cm`I-s>q*h_Hx2`TZ7Dm4);=yr#C18tpru&OEzOP5Y4XCCe@UIkue= zZWh`a!2#yk)&!ChEwFnEYhPLXM2L|7gCz;HS68|>i;@Le_-WUf7n-lytv{BKd&DUx ztsiu-_uwq!f_-*6)%o|b>2kMu5VilV?Pwf6*W=Nf`+*9P@^NgyXjlpepHNzd`jsE( z5`N7Akb3sE6UAxH(<^XCIx}5P>Xty`$41U<~kd#aiF zi2JE<`l4!)q0apGsBp(z4XSLZW65t6A%(&K2Z;?ewC@)I5BPZd!ztdqLkeGdX|r(( z)UF6#AME|rXD7#?qp{DQ5m?b8P_;n1~BVYG-*j(TJo2Fx7hF^+q4@`r!u`wqV;o)Y)7OrGu`9c= zNkpYJdKZEKhW`Fpu;}V_eQ4)Vw%ZHTB@~a(e0T|daMprX%s9Tg3j(jZfr|HgWkW(j z!FC=L^YMOG_O>b_y|Y~tbTeM1MPQaff@KLe|%Cr)6PHQ(kUfC6BvmUmd! zs6%K~czi?4_=#m79iHL)OoLJ`KS(K>-`x$b3V;yKyHgL9NpREgf_kT1f>T918E2d9Kr!JspGLnlKr_N1(xp-pe z1-^XrxA5xE?P)_)B(I(4um0eh|M@W(`fZif#|?Pf!HWHx1WFo*1#S4T%Kz#AQr{xs z!bf@_N!4Hb954YNpkRu89lSU7&yV4+n&B7CJG1_evBX_CXp#cwhW`)Ea}0hl#%=Py z?=;aKh*fPGUq0Ub_mzq0Q}Bzw;Mw0F#jqTr=Qw}h|8d_S)xj_7!he#s|9ZP32q(%X z;eKHG|9&`o_=Wu=fHeQUMqq@p9oX1+pS;odZxR0d1H3f+;{Ur1KQ6}qOK-!KZTjAITJ64Ab_3pMJ>f@xFRM%-1EE3W5%@o-?9TpEsi+it;^ ziST2kpVuNh)%_V{Aidr1QhKEhIOAH`twX;4zt@(MH@=fLR0-jy5J;aZ5k{b|>2pvhV}Higna1l;b6=$d^R3{RRUFSeW9{-$|; z^)*28XNQ=M2g_7<$^$!Y)(J7T28u~Ar9UcT9lCSQIim6Q#~aH+ZyvA z=lmBg-=)L~+dol_U3>xPSH4UFXbfAy)KeZ>V1Kd%pkMgw)^8x;w}rUAl3B6V zB{qtF@#`P~zf<7Ou7k{E?CcHz8o@L4Xi)s0C_(VSAf79T|4ZY8Ywm;~ll8)z@`Q;l zw?oC=UZCnZ8>_9Ss&>g3(bFu~-5a1Cw46-bEA zCj3h~>3%<5t~?c%7D%CJzo*UXYe-JxJwgupK)sU<|ROV)N;@7n<*(&rt)_YSQ>);6wIYKeGSE>qgartUB*B@5@Wzhw&a(z}e0xV=H zvnuE-jUrYLQN$?pi6+lv3CDvWF8uyq?MT7c9$&$4EZf8lKOXzT@ZTR#hGgLE^tA|% znudLuN(Ea`w^ius*%eF?0dGFvsjmlquUjPA&~9IV4FkMhLlIv$(b z)4HeQEvI&0eBJl|vLxVVhq32IGy!|s>C$cviMeCoY1>!O0mh(9Pn z7a#xWyRqqx02gQ)tazAf`+?~S9hXy`>2kXVQ!@=C;eCbjy7~Vj_K6mhJ<=2kAH)^* zsZ2|Yr&EH{!&-{xOSv5ngv5-%gWdhiG0q3GrOU?^SbWT<5zp`?0*iYaq;Rxz3*QK zrMQ_M>H24D&bML)gjf%iVFFwu28~eS zL9=>3Xjtv^Wzb{1*8k~_{0y(`3tYWvGE~$zuYh0a0Y`uWq&73)vkN%9P zVFY##{QZ_tR6-;*umVJyc5hN$nu;-^M=BcQxP^-vooD$GU&r6~gs_My()NZV7jnj{ zbq6K$VGZicUb*w<58zMt-yLPgm7QEDlc)Nt$nhmS!)6N(xU-4mPs&*YmP#PUI zXQcb1k!IWbdzDw!rk+s)%Ig>K>P6IdH4^r@1P)jkw+Ql4f4_*hSsKOHam1P|kN?H( z%A(~fitc>{d(%K803X|{bQYoY4-1}-`1_vb+MvIJNFH`n9T`o><`I%n`wVFNApVxH zq)n5MRfgr|%j)xQ-j}GKYs}3Z+BYPseI;*vju*rIDnNyLNAjPz^W)lFq=9g$hV)vS zqgZcFZ4NY8Y`BJoftQ!@RrR}|A}^4NcvbE~y0s#HKtA{A-+3GMrWHAjZ>Gx@6tz=vPD=Enc~4j}g;QI>?5RzVa5p}0iJdhK0pTfTTBsaP
{_qU9|G**tg3Q-WTbn-oS(-;B zkOs@w>E1l`ig<)M13uc9MKi;SDM*FvX8JKjM05cz0QMXWgWvlXJ3QB*D=M{H;&tKg zHDwl$R#cpRb?2mNYN@$!y$CqqZf~#lOf6nR0@=osRu)M!Gzka-$QgA_EI6J|4fuM` zq2KH{28Axu?z2EA-gSBw8t!y*cv*IaS*f>0=6SVXE&zbxv$UshT=JVfO0fR^R?#ygm9AAx=VVj2$A*~cjsN$IrM zHTgpT`a?KV)Fv75pB2eJ5}6D3obIrt8=+%)?(4_%fS?rW;Vf<~mHoLmUT8Q`rXa(6 zR^y&w-99PU1@|fIuGmc<;oIsH!e)g6@0U=wL_l9W0*(rqs=nKH41vBw3Ui$s1Hp&Y zbYt^##YH~o1$6>1S%1sNfr`WZB7)w?iQCWYNXUidk)Vo5W1!2*!4bzP=LGn}QT8`q zW}n9xVGyS-n{mX=^k=|dqnaMb2;Y*Rx$366b!RMb5JI|-bpFxprifx`>QvQ!+LW7Q zfL0cPJ<(@DZ1oE$0})NXFr*U#W1`0xM?=7(L3Am`J#HiR2I2eR4;2!>|LyTwYhZhaKBV!G-AKj zCoqZ_?ROm!a_AiapY1j8?)T8JFPuWxE9F1bQLp+hu{=u)gcKRXr|qdaEGqMx(~8I2 zMALZH&Uh+5p`rR?ENGUiPdt7aaTk{he!O|Ew{)bL)zsw}rXO}UFmb1V-V!s36a#>Z zF`P67`;IL}j#tY4kiZJo=Wbvq2>4no%TFtRmft&lbf+ITW+kZOS!*U@lF&9OKZy|L^)3i@HTiS}s>6_U; z>F7fwaKJLWYJxcf;-HTo+^1^6AdCOp^figcPK0JKt)2jvxf28BeCHW@&W?hfjHeF} zGqn&{P>n}BQumVrL4dS{iQjYYT{gnoUuOYMx4#xWOv^`VPAqL)+%aEu|(;sUWe{YZvn^RyliF7CF@AI zVUoR}R?={TQDubUoM_RZiRy;ECuWSKAQxs4piutjVG}Zl&r)BR1r-$*fBUZXw~OWN zm7y)z_Jv#7g{*?{6QBw9Ebf)(^}Ab`A2Ji&15&VNE-S!9%n%^6}!0K z_WN6yT>#nq3E0$SrA`N5vLM3#&=s#gfe+c(l7tbkAvgp_sH8N3SZYNhIr7fx*ltCA zvt!iFV}FV+_?(0cn*C0?Aq2PB+)hm+b~D~r-22Jew<7<#1mb3rBtYhi!B^X|(Aa(G z*3p~oX7e8!KfgE#4%|~9@;N-uYcJ=H6lE`&{ePMD{ERPnY7($F5dKYhGX+X*2nrg+ z0Dt}Sfa+Z3p5z^{Vuzl*y?+4z79C5IGB4F-`uo5YFCa`PDH6&+Jqln(FI_Z0sLJS`Koida4re|H23+{?4Kk$#Mb(E2%_yjkb7}+ z17dpl^p&I*>}=oyYyH*p9NRvT@4cdBGbA4F0mp%@;=a|0=lqW-87(2}JJ|`iA`(j{ zE+3GdJhOan#}}t-VH}VtS-@*9XI9^}LaOTGrOJBCzg|qd$BYEVq|wua-PBmT+shhv z@B)Gwbt2rS?t(nzRU+?H?xp?xfW$n)?yr!gAIRhZ0IvK0k97b(3Jq+lb(nTh4ODvT zhuu8FMZqC9HHRP>MGRyAUcT(Y-XQoo0sOt^z2P~vTVO%FfITj$|28-kA9{pL{P@BjrGdy@8n8G9IF2TqT$GffKy%x%zo-XH;o$CbrE9z6 z7x8%L9hg~V3ngC1$i8wOZymPuTjQ?9MiPH~5%Ac~9IXeA4}-N?aI!njE{#hZD1W^) z(LpJvg>2CLoS&cKSKQp{3jtN9mpfk_BwSa5S~#fx7b9A24Bg@$VH3^W>P*~CY54)K>< z2S|Qzs?@_*!B18TJ_f`PgE-E4X!i3kFG!%)*ixkrA+()u%Tr>9Zn?Cvd%H2LD| zcc6vvKfV%gVP;>H1ww9l+pSk95#xPJ_QKi<5Z_c1^Pc6}3*t4<2V{#`m<>;(!3V*k zFF%Tz7T1s1l8)!^VvGB8SI8$Z5bVA{vtil#-JSR+FArz%G#g2Vb?nhkY)Y=U7DEig z$V!u&$uGowDPweKm~p6|UjodI$$1bApuKlJVKwVoxsiUNp#4o-8~Nb=EdMw=%xe+H z26hmxKeINnyGpl%LG;h%Skocmx?#f2r@bc?dQID~@~9opr7y+epC7Y-$Bz0+V{INW38c-%jv5Kt*@y=doz>Y^AcwBq{6TF)OMRLJBywWF<5Wd?y zDKIf?^7DKmRY1KW8g+P6O^-9-7F?dSJ%~y1^I|Ky1|qRdI{}HJ8(E^CcQIpJJQA_( z$8~>irk|^s+<~!Z?jcI3K$(bC^rzoj%a}&2rs|=Sm_?WyuWNh~FXZKI4;lM9S;2ibnKzjewa}Yg8RxRt&_G#vftMH$d0A93FMilxv&cWC z;z@y8`@pjs-7y4NETV(wQS#(e^>L96&ohrCSYr(n@Y_H6mHpnQzwX*R4U+BvJ-1OH zdJV^JjD+%2NNcjk}%d2~^Id2$SafK~j*aDM$a)CtM$|wyWEr=r>4RBQjP;?Z* z0%dzQi!}b7xQ<2>scF2Qz1*aHrV)B{96w=qA(>Us70py-nOnbOe)trr&hpi3z|7 zH{gGB*H~^E!?oCd^?ckRjFZqp=DSDRBnNmk&iVB&J_SI&eH1*1`@udTjBe{>`(=-b z_RH(vzk(|De86+?3@z#fl_CNDhk7;}{^Tl-D)xmhkK#$nJu}{XEz=V%em@LJ9W7 zs>jOg&!tOXp&99ox}RmNu9iq;4{cl5tW){`x=(L*W+-%|NQ6jbrqzfr8te;agCPZK`DQ%REn=d>t%ir9g8Z}rW{WASP`sS7AP0-R&K?C^eiaV8s_N+6E2cn% z)hX_~hFys#fSE+VlT}!UuAS-HQ{QA9yPkTyfUy`yTj0SD@Y@K|m5?Vih0!I0Rlru2 zuzHk52tZ%SXg6Gcsjcs;+7qy1vgWZxkAB{QMqzR-!#QuzrjCLLcX|z&^if=V*FS(u z!8b3dS#NbL9FR=K#g-V0NHU0uW0kaTS`_p!rfUQ$35C$fG$byXUR593gPI{Sn`ptX zD=1a<3wywq&=pm%o1B12jJ)kRlynvHjPF6f-pHOuW@Q<~U%dDIVLXKfT7GYec<)_b zSC+NRg{Sh+-{`X{-Vbi>-Ecy8c?_ncw}0$?Puspa zLB#J{Zcy+%%sxxs%mjn*@JFyS+c>OsegzaKK_br#2*bA&D0RN(2eI#aR$A6tI6#NE zWn=DveTQl@C1W%tXOi{$Q`gOnb@gx+`E3P|{aFKP|5S5GIdA=z!Yz8QidxMnF5aI! zKH-RWYKD2`+-m<>W$OVqCUjKV&X==Z>eCb@j?$0POs@f&)cqCsedD}X85cNk z@DcacArOg@y2K7juRTa?NxSx~Li>L5<`pP%_Yg+%Z`2@%gPtIfz;W+fme92-sQ+ET z_N2GYVv?oyp6vvSWw@j9TBp<2nNsR08Nl!qaRz5P3Qx(sb) zmzmFab|e&CRC$e0)^kvn;{!yFIA1eCclx&rg6Bb&aLvzAvmyVQJn&kfZWC2x^0>f% z0+MxtKA}Oz=h5?Sd$j=h95^MdYE#Z(}@f+(u!Ms3= znk$Oqt%q=?ei|pq^U$}z=(0||p{6RRTjamI5-;Ww{*WlYiw&j4BHKcRxpaNmM3<7W z_Z!mZ4E!`riry9|GoCp43d*=`}cejEw2aQ#rDP0Y*Rwzx6hMOqwX+m zk%a}CVy-i4F0D8W1Me-Kr|%i;982Ti1nNAqD{!(pS|0BL~Qm-_tk3F#~d6yYAf%^waI~3aa-8PyvR%f$L}$bmeY*hf)>>{K8&lz zuiP?}b)@W()W$b+u)~P4J#DMrbOomn>2>M(8?a@Fr0j`?0d~{S+Wij1DRZ6& z^_(%EKMud@j|JvdTp5GbgQ-?Dn2$W_zB{C#%!2W3_9h$08>BX`G~aeX#)bt`(H(oa zlY;&xZ3orCYkLOo?A$+U%*etTWUkA86T?&o3`GhmOrF(NpEAaEL&7=Cg{+;QZ80p` zL@Nfgb4l%#C(8qstg#Tbj3E1WLVP1fqP%!a4`Ci%on((xS^fPFr)rmRs4j{t9Qe!JNCJ@ zn!YKKZF?9dm{wtTJ6a`DaA5XlVujyD0;H*%gJ9pmuM#1CV)R=p7OSGi81rRp_4E6x z7?F7T#??xr$^#jZ3X`}hw2bWLkoG2e5%pj+z~jrxRaK^~YP>9S-8vhUn)nuyzX<1*Qd?M_%L zc46L8(-38L53exJ!Y$e(lVQk!Jx&&s38+$Jwn$*S4=EuUCf%L;nJJDip+ku}^K1<0 z>Pt>)*KS>f7MBa0hAepJ+S&mlR6@tp8KgWiBxm0e=e-69VGAp!p2tuWP|0X(zRA<>&r5?YGQ;&hyj<`m`bRDv z5JS z%Td(1$MWfP-}Y=iWn8sQA9_c7>99}2d>iEfUO#k;2AnBbPx|8Z^qfX(TR_Gpf_K%Eq>-i zP8kYR2|1Nm+pi~yTlma3@xJ>;)9}PSx!L%o;gp+?ghr^)>B_fqMHORb=og)k%<;QT z>6V^_rd$fwJ%hgYp#eFK#uYqWxpX9l*R?7%v&`~FH&h9i(0~etEKmc@kopz{FK^{| z3#HH*YJ|0p5GbDJJY@`mitQ$9RHh{fuse_@9<+%LblnrHgrSt#UxX{hb>Dorf?lSz z{?2u}A~Fc0Mm4SKL1&t$Mx+n3mVHQ`RBRyn1R24u*UVNqeemCU2A@Ll{>T_NZV4w7 z4sd|$P16Qlb8VS{)iG@hz_(w<>RJy-6`eavL1dtYdEsJctlJ{!H}w7F#8_V0m{ z;s$sQsM>pc!^yj}^3Z7MW|Ve48XY|vLSai!Vi&;^^%FJwgt2{|$E8k*q1&okk_t-%CaVTRA)lnkTh~X;KxI#*CjQnrhnl3*R z;;=6Kw>BhH(u`xjEGX_`#*wBmh>wlVfSycV^{*7@Cw2nWlnz{N3He^`!tWTQ1c2IO z3>{&%E3|E9uqrRAV(;GV2g*GMD+NYO-n`D{ z4P6XlNS-Z23m&4FqPnxy-la!0!|@XH=neEvlahy6>v!9jIAz`6A7arc>|GGyEF|d!phW^ z-ZW&0@9aNZ=Y1r9OEdL#ou(?L|6&? zh*av?`1p^hI}T2EWg4tPEp%kGOOkKr1;!XQiKdrVt08Hez(rA%_`R6QH&@Q({J!k@ z-M97StI$Dh0o2)t&<^gh%U*+qC~E=79?a*FArEMWVldmZ#v$)!`9z0PPE3fW*b4{Y zfHHhG;|%$y>dW-U8G-c@b{8W-271^fP4CFgCR?WS-=Qbn6N_iPOai>Ek5}*nP6PL9 zi)vWZ310yCg*3e#ihks-{{dA53Zz&g?&-%R>5qVJxcUTZNZN@Y#-U{c10v?gX`2EV zHQk=SuMLVjE)*&vk_m$RaSVnxDPaahqqN+s=6E8ZIQTB@=M5Nqd=g5ObpCwy^3(+k zKp;*as-9t&q>Pm8ygm9>F*HdhKvVUyq?J4x-KW^Dtf8$>aKV&HdM+GnG1O)DF#Bhp z3DAs;aX+#+Q@b{J_Vdtg^pe4dkg?TeA52y+CcElh&$s~x@i4`wHK-B;#kRVZ9+E~V zC(EUKGIJ)8ifWPc?bF9$p5YBBy~DTGssenW`;OuuMFWrHXX0CwaB(G-=k=g(OPrd6 zx~@&CLdmB0V9n~?Mn#41Xq(_P7}Wg_OhjNalU$uJ+BrimWvdGUF{?T3un1hju9mZt zuoxW9QJDCh(76FsOx)Frp-?i>a!9qFB2nem#$7z!1JK1lJ3I{X;Z<=@yRq|{GY0-)itK>uOrqKgO|yLEZMl}36gE$;zdkIK!`_?ClW|+mjjd~rG)hL_ zFaG+yg1=q|a*H2$?}`lS%#7`uibMA}*4Mx>5tm8y?d`&B1THSJnb1OWBqW8zNe$WV zHewUdr;c}W*_6K{%8s+7U>9~q@*aU(fsp{xJ6|2Bip9g*DBrHVW*x?l7Y{Oh|1`4< z%5CSR1we~g(-4ie*cs=!w37V)q7TFy$yCqSJg1D57v~4uVBR^HS&rZHDtYt~P6qO! ztHt{;8v@J>wpE8=oD4u4kEan`b+bVO_}0bqYjCS&BZIOiqV+@xV}zn$*t>6l{y6EQ zF6{NPMxsG7^&^%ITS)AnY~aKjfnf^vu}vpy$OyczlVFUsVkI|;Ydi%#6ChIRK0)ol zgwxlFd9PYWR`Ij5{y{@YdMqRR=>5Bob%!y2$HSt5h7j}jh1{uqXQkpVRoF^3Bb{yc#qGgK%rO?FRZWj zDrE-Ud`XjXZRxFBWa7~5QAScSP zXQ9HPpdIROz8dCZ-s(tW{x~yM$~%TQ)0{5njEEdMX*A^reHLO6^}>ojKK(J^KKB(~ z7F(4F(X@v?**8zGt-<{3KD@D6ux_1hd~nc~>UO>-L-4L@?gQIsa#aWmO-RWC zok@rhEvI5`8WvZ2li8af7~6IsRM7GgNf)1{pumi~c_UH!NXJ&z@Zp0;VC)%wk6I?y z9S#|UTNX;gy5LPpFpf3KQYgPjEz@b9xC+XS)D!`oz_ zDY0I(34Ob5_4e1m68@GD*vyN$#kmG_ko$zK;YzcSiU1QtfG@*SZ{%`&;RTpY#G_RADGIL+j!NSd+{gI3=#9Y^(kY zO>>x=O!axeikaLo*nkL%1liDeudv`R-E}NItfdN(?$3cngsCamrfvv(o5>`){*x94 ztlLxlo?Zb^)3(x4yYJqjmc8GIFIp;ED4rsLNhLfsK=M+C;r0b%t~X;)7VNKZ&?uX3 zY7-aRj`1@%iiAW+rDDjy9d5OQMO5z2Zg#;A19R6Cc8s%{s$a53O;BeN*O)PZ7eZur zSx?fejKn1bT$t!clQ~e%OC!{%!m|#D?;-7PVA!`Lxdv;&{2+85BF#AO+g&~WVo)$Y zzq&ZlQjUM}Y0kx!g;6<<_yU_3_t^Gs*OpWOI8aAUkg7Vw>1aIj2xLj)4eLos+gPYS zTtA@#wyx?tIpAlZGM}YSH+v+{C&9jfwd71A35i9pjEfz!}R=4 z9m~r6J$L&|^9iAmQ)tLAcg1{xuGEo`b)eOzNLtOpvHKz+?}=;I{IEw)E@~rHE=gpG z%2{p$3pXa@Nkl$V+-4YnTM(yG0?ci=a2>49yh65)Zi$~7Z&LosE)t~?(qx|GQ#yxw zQ7e2Fg=R72F=xRsc)_xL4;7p@`At$Dv0Qkw>SOy#9}dkvsZfij>5(FoQN%3L@4t3k z2&X_|5j6?|kBK^a&G)6%-o<`^+%^Zf2hok4qYw+54(}-eI^Oe{O4&w+k~6Bm^Y;p&f62Fg8L^W^To`X6g76}#@5th3LvqJVVFghT01o( zuiJ6JsUzerK#pq{$GsDhTskEG9nz=FYRyoX0t7tvtK+eDwMT2fl!i%{X7zsV^L~4N zM-UPxjpxgnsD=AJcee_cM?Pjl$qA*hwWY-3cW<+M)G31Mr!}iOR={^bV;fDnSnl?2 zi)a-rVEOVLhwOPuZK{I3)xNN_2LTiBi>wG~J_LHj=O9Z`3FiV!Vn#Z<8n+`dVi-O_ zPUr}J1F=A58jz|6SkmP3x76^*dQh&>jfh3v@n#>$p(1Nwz-fb9*VU7;3+$(;LVP>}%4nqiL+C-EuL5 zz`_G*8rD~v8`yg%Zy0Cj`1aH9@myj)XcMu?{Jp~X*V2Go1S>{lLFwlKmx4*FAd3^} zeG3;ka=eQ>EWo@VkJ?yl=()e(RjJ-$)A1$lkUz&P@7TMzCdi~}Zb({}w)Ukx65X5X z9sO9={S%u_Kr+HNl+)XU4P$&34F=CvyCS6W_dLy38_3e}CI-GSx-5BWy0jR;qq6veM?S{7`~jM|IABQ*F_WElb95eDBt^e|@U;>VqzqI|BZc zDmhcm3VRU#YWJhanT7VnSmLpVAf^0ZH*DWJ3i-nxXCm%G`}ix)3%&P3uc>XLFWQUe zmc78?`8G>^-?ks;s~$fX{rA|vUEW1$7*qBQEpX48U$ z5S|<%@coPD<2*b<=T+&bwD;$~{`IloF|<BaF}AM_F{k-jIXaIL%x%- zTV>0)Zuu28KL<@XoY1_ETOB(p$&HZu%36>s-v)UfBr-k5O2=C1t#ghb;AmxOjlYOX+hx%J8v^AKjxWROGM zPe-5TlAu5|%PpXMEgh1x`odR0InGkuY!zdP*_%6WSxE%Y9pc9BZhDN#D6Es zcr=#~xt97(_7$ZQSFV-pSD{7iCd8t}$Tv(9zW_q4%&F`A_+1Xg@;?k zH7^<>okkLe$ae)j3@MF;`MKFQHwKR9JeIIv3>nXRRKE!uyN0ca@4 z70MTXqQbw@GeQqm>QKER^z&G7b!ruE=bet!1GCk?#OegU7@mvtM9gsbzuN0Z@^2V{ z1QAN{>Klvvx4lQ3IvvX12O;?QVUx^Ve>hG*(K<&g?gm%p+2B1ik?%18b%Z@QK2Ty6 z=CBQ(I?dJK{Qr4siX)dx_TRLqaqIPKpYdy<5wFp1$@sQG+lCF&3giAoO!*aR^W|aD zHlO^`ae2Ez!wwOLJ*A&;*uUXu~!r3C||{j`hE|E~OZN z^PT|}#b$4beG44riKmWjGy==);?taSz)t&(T*5aJdmVws0F3&kuNT@`49R3!olNB|Fz8jdK5g6Ya8)S6U=9cCdikIN=~fjIjEAtUa5%p>_L%Fz{sSus@W_o!h0iws!08rQ zWS}6x5G*J$-mpz2R*rVO#vdt4ev{~e*^`4l{gYPp=l}7}7#FuhIR{(m9bDR}%&UdL zte^%Hccyr*vzNT`$9Ld5wZE^%FP)wWIji+8Xn%cPkS1xi5i3F>18@nH zu4M?FAb!xG>AK;P44}BQSuPnFpa_X|3MNq}B$Z#Ac3*9{a(YrPSE+oQF3??Ov*6*ek z0E#0CXDwv@=X&C0fwTq@=s=N~wjv91uUP2SZM%R1D7AF_z&}?|5B~@=UMBmf)xJI# zA~=u~zXLJdzS#76Knh%P{%p?Ed;jMFp=9Xz?}CE~xRasd?`2V#NJZOP&ms2h)*`t( zc3%2_T0;~hP$wYJI`loBl6OkZ@}n|%>Ynu7+qp-7WcaT6(|`6G0h0wcYb5{D3LN@T zu2)w$G?>q7Eu_4Y69mS{FEf1Di-~56}4Cfp|gPK z>|j;FQd+IrL`{RlXw$ylZw8wCM>`6;CMh~y#|7Pv3;UXv|GRsMH?X;SL@_iCHc3T- zik3|U5Oq-3WjAXU%>*E<#5bW<5B}#NbiX#d!B$BNc3Zp|7+k`pI(iL#X!bM`f6j;Z zYyER?ai|vsH?n-%5xFQ^nfVB zhMgs2!k4|nEdQ<+9O)ez4V>WqrR4;<8!OpwN@xE~Otlp(_GJ?Gj_3c$hJ=ihJT zNIwidb>-#=A|1Re!_2GS=o}=hx?P%{0B~dgK@k6p{D2cOGYxwjYei z-g{q@%L)w?*IJfIBC+Lu`iUhN^+lfk<4I_?j2pjW@UDUW~Bu3WEZ0 zCyYFTaCmw9f%t-k_MgX$o*-T~MBJO@A}!ZLATA-Y$Y$j-4pzcQD4;=NsIDR!=2k`2 zmo&e$V^_U^w>{``ZI zw6@z3nB#zTYy!9eU3-^{pVf=*4+CoqRLL;lm%krXxKcM>E&KQ+1;b&?YeBLd6l?l} zv=)4C2kdsE70_yg55Av%avPU0bsQm+WZeL_c)Klo$q@51 z`k6WceFpfmook=#BLqk8*d=FbxR|5^uRgu znH1lFzNt%A8GU5ZzRkjQj{7FEChPV(K8|KbgjfK!zAm%Aee@C;(x*zu?Lh;vBj)q8 z4--owKZ#mw_aPS4CNnaaJd2x{-?sQxw%?iAJArtK(WRqssx;8pDj6F|Fk(r(qjS>a z-Y0B>s!aJOqj~9a$Ni#0iL;nu>#|j-Lf|6v;DWS3i~NI%tYt+rViR*s1`gGEXiNm3 zOinIVSM);GeycjFq2Bn4;!W6r85KGQS|SJlkXI=TS02#~Dn3xwCV~f%bl-tjRk*}5 zZK)J>i%F1#I6#9EQG5G;v`_HcGmIlOvtp}cu&TyE7yGXxpLivWa8>;*5V}K4gl`@_ zdxE9vhOO;wG6{Hay?`?%_fxo^8bC^__!*2ks zg_q0`orWeVxCTpbY0rIPvF#h=-Vdl6aK_>J*MQj#jD#z#j74QDz(&Ad@U*J_>zUwL z38fe%uh)Zcq60MhyvRHxy&XF7YTyV_6`=g$VAYH3J~V=kE|{(>I}ZZ$0X*@HBHN+a zIuEmx1mj9Bt+1HQJz80^`d}+YboSQZycI8ZWT#srOSm1w*&N~qUxSvx>@|2vf9!}D zRI^Z7)PRC)nsFRxst}$Sc$D;vD=%(dXuSW->T_t?GQ^}!plx=tnTUR!8?wwLkf7%( z1uznvDC!%btpk{RgOT?$mQ(*6zj!PL2!FN5pG#se^x@v;TCq4>qUKC~vUlv(Q06Ue z^lE43T646toY3g6<6W8l;2TR1l=lm8$~!>Ul(^-}b=|g0*WbOU{O4&27gyE(6^3E!d(X!ZQdk;dAbg;PQ3@ZVRP3hzh*x z(@F>y*l|OnlH>dTy}2zzjv?eadk9S_x$mGUZ>8qFgg{vB#LN`cce}5o= z!9dIA9!RT!o_u0t4N?Z6rau5LXN2JaXz?oeGsI#UvVA|bY~Nqx4OG+NbMMO_zh{)u z_eGlD6%O~COfUJ)4azAMKuwwiEH6M&fi5r)5w1oQF!FtA(2llg3_4pX=-If(^KD~x zHIfESL=oRXmP0R*tLsmb%V<_^uR5zZDMB%Ib=0|yiBp!StEcl?9pNO2-qkI?JIO6y z}8 zr6g+{4k0lD1Lc|$$MqZ8XSm<%?jlqHp@|QLTBWkl_vTpt5<<=NvV++akFNt!hHZ=w zlC?sni0D5(QKGMfi24+g{E!m^prD`-Svgr;QXcK_4N^7YDb&#gvq)W9O{C`&@p^y$ zDVOw*8nsT3X%+SAsKK$|_VkIF(ik}0y_pM(+v!Pg>c!D3QnWmfE6+ZOMh^+A>figFeZ2*Gpp-$SlS(sa&C!HM9rf z0A#CqT{qnO;NLp!1T-JUGFI`!6%RQJGQ%6_IG0$J=x;z;u=3)`A#_9PTfa5R6UX^A zgH#?$+#1xY{|Z>)5R}%v0Jn68kz1XNu0gm{PUs#u}0J5(%>? zr(;>4QpL^D3QB7?VYp+0es845n+DZc3pRah+$@`pA3terR+VR`)y!X%KeR59kK3VS zjyl0lP**oDhqu~ByT9c_aKwUMtCvr`>T*-=py~!N$tQUE%-?54 zw4gr$xK*3s=Ks>qFt|(bjE8l(eUz_WV0-^a15ltEXY`v_yCh~u_sKLnWFM8>pz~xk zs@qFJojQ26l0D+%mijw|KJPq;HK(Nc_r{oitb!118S?C_o)mj=>~>^dP~}Os(W5R{ z#*&El1Ay_VB2yJHw~(pG;kLG_lEY9NyC)Y*UVV$jwrVrSM`T#)6Z?tiETL}N$>aUU zYFJveKO!pNp(hqnakH14D-G7ZV0jgvIi(#j;Ej~GOr-EX8F{HJ0Y829Yp=`E`jW=R z-3l7V52fp&!RSU#O%bF83@_9CDB~!h#W2xqL7sb)*?0|Sgx=~GpjdkFj`R#zPdv$@ z-JE&(;&+b>9y?BsjK@4Ypd(ph=Ub&X z8&n+DpQoFEbBtHq$DctbaQcxjOL-^rGmRWouZ{ZlfjdPH%u4Qr6XxICKgIUsTtcO< z)P~&3lf&z$2M#@vf3xetbrDzdzBMpkg|t~aAQ+~6+{#aA!hJPm6Q&^T2q~bjD=5l# zc8PAIU}OY3z?1D6<{z4$3BI`5pzMtLbXpl&}q}QYU}lA4kre7A2+Hjb@jSv>&jr?3#F+kWtM-p$AEHwnE4{^E$FyP zk1XfxyRfF40F$(`+F-s+lz>@iqXfOjLf)&a7U)7>k*OH07;mZ^4_s? zKa|fGy z{K>&x0>Nx4&~`d_h_yUvwRf023MS()e1dXZWraDWs&{a{igWvZIe%<=Hud=!!Wa$b zcHucX&u4473ws|jB}QqIlUuIp=*8`(hyYpg=!+-28D@Zxj@avoBx>Y^5Dd`=->9o& z$<7TtOVjc-7V6nfPr)$^zs=Q|Q2`n{s%U*S zE_p2d`*$iQfnMdqHGe|NM?Oj<;5=ITOx%>y)gA!ey z`JGF%faT+yn(^-SF@no-XtF7Gf*+{;v5Kn(Ucz*g8^%AOwgTMEvc!^~gLO@cuM}xMsXgQYuAN9h2bRNQbQ<#qwmKPdXOG!wl>KV`1KytABkcsUY z$79sT=vMCWtCA0Y0t7*$RD8ygXx|WTLh-w1&v)j|XSaVIis1izD6XTtbTm#1XYU<_ zXXeW#lCFf-$~{`I@AfsU_awAhc>=hP_dLyXcET6WHh7(VE=e}4Tg~v`h%$pUI%~dW z=7-^YmP?|sZv41U=o|Rs_J~I|3s}tMHvmJ`{?fAk-?enT2(z_{WiSZAF9^(Xu+_eE zEaW$BD_!fM_Gu-HdrJLGh?PrbH-1gmCF2(Lf z671}Q``1k0NuEKP?unuJGmk*^`niRft~_9UW4a~wOROq`Q?AH@92(St!rw*#zmd4w zq5bvI(v8bQ@8RUw__p@Q``;buA-^Pe2#1WRgLLw++tLUiqDss|vwIvC4uU#AXgXd2 zd#CTHmNI?>#?P0LYj7i-`*81`raeHrM&CJJ`5EYWYrE0uK=aw^b4nEX;3PJ5>v|2S ze4!WwmW(nS;lRf_grd4P8+wqNG7w@&xQr+mze8)8GI0Eb2)dkRX|z&MVFh%t#+y*) z*n!bYUA3PRlBeBXb00Vmpi{fw2o1wA#EwU!`qfF~T4gYZOB56_5;E2+tcN&+CNTlC zbsKcdo}2NChb+9CR9Mf;1ADhjegU>(bw=2Q&+YSg}tMEFTz^ zC(WLRn634q(y(4JTM8oEc>VG-L$YZlw@V!kZjBI1fHQS0Vc# z>V$ST{b%s;-c3qYhdn!CA4>h2@WTUW0t=r49M%5-VBBQ0RfOe%h+LI%cGvw1z@@n1 z0L+{EfMHJJukn7?tXc%S1YD(Kn8gqh>f`A}vpAj;3ia}1;A9580Q3!% zy7_PC0v*#;F0VLWEF%rOrquRaRLNu9lV)8ATyGA*pdzAS;dnhV>4`9=wWn!2J`J_d zr2wLGt$GOU@;Mmsdv?=Q^Fgyb_~C*ZJ`;r0DXBc6fyIuDpN84^dUt68 zE-YD}tiAJS{ddc+D=>}0WJb298)1?XHeYwgLfhJN0VhaQdV~@WI4s)xMbIi6c=7<{ zT@b#p&y=SB1t6X7=1xye!wjOekc1~1MmSvR&!8Z_LB_D5)eBHL?ZXXlPD(1Z2sh;K z6sTE&W>2~;XACjbe{g5m#z=%C+{9KHjV3_doF3TLpypZ^-_)B zW5Z+~ibe2UqI^zync4Km@<5DLop_4ANg;Ib0G2pIv_^I#NnP2e9DBY<+4oHu$wBcv zA?bUGK@fHB@dSuepa-tF4Q@@Z;oNc)Htm6=oLKtCJncPB4|I$>UD<<4P+niVd0E9| ziWQ#&#y$l{?adyXBqj0@Dg)Bh_a0A&Y@fiRZaq1w0bRmpWBQZ!I_6IIyDN`8L8oBv zJr|=Auhagh4Dd|*iL9)0HrL-$U=G zOw?!=)2DPEgmo9SQ_-G!3X>s~CNY9Mr?jO`rKzgmjzyiH>6+HMuAKN5wCZG}#VfBX zi~3o3dB`$ct$m};}WB4p2xmYFszVE;M;cZ|gXyMef|QKq!Zfj?(x zlX1R*Jw08O(dBy?94jgW7(a|X1#9FVSXuKG2DhIrz4rRy85<(q#$On)1=Eq0 zi$XKAml^Y@g&%}ybHsz-;#aOUwH{X?#(cY-o#0nhNX{@Tov)vdrTx}kuA5AXj#JI0 z0RJvqicenU8@jDV&9@$;&DM>jN{t&nsr%-{XJ3l;nucMNJ-(6qNvJW=CMv=YbvgpZ zS@P16LR*)r$jkdHUSzfcPVMwkA;7e%epFlJWBzjhLH~^%H?pH$37c|Xow$Xq(ZXM& zJZSML2n1*zrm}-H|zm^aBKj9lW2%qPRO^n`y6k;>~_u|a|dLs zmIOP`xM@*htT`q^xiiS9X+{Abjo9a`U^c8V==~E8%fK@#2~!C2*tfyqiU;;cX;?NM zLaMaonzscTGnB5IR7>hjRD1GQXpyYM3I_3ggDw!`O>3+$6oU1AA>af^{8JRH!OfekWFFY}72 zgG^-nl*UeyFU{TAgcSfj$lF5+K#tqvc=09^A6EySXtpa>Fubtdn5toUv^0XQn zbNV%;#`V5wXK8|%@W?(?7Ez5%QSr!E+K1%)Zp~~qCZ6lJHPvV%;%TO~1&BGSiG$L3 zqfnbF=3;`bCaI@;-_!yfp$7Nzg}@5x{qZv#^B^6xg$xg$HCs2d_nz*Vc;;Wdwqj3Z zS#dedG8)neRquo}$WV^75sKhx(_$Hrr^ZE+Q+RLU7~1a<60HO zyQBugkW8r0F6o({;C;_F1sW+_F>4CSHM23(;wtU$6Khvae7_nXzlVvOzG_kUhZi7- z3Okb}oQ(2vjUKgbFj&799zyh`4+0;E#qCy}B@$7z+K%ch$v zU3q5=--J_J1*ZR2amic?-WB`Yi2A4>t5WnT)nV}Vq+;~Lc1$Jy6ZdW5Iq=o2iD0Z+h#K9}fol*!1~X{QE=(U6}M zI+(KfFo!dXkP{DM^op?OYhDz+HB0s!CraQfYeVOAlloT3M>1p4LW9J18S+eb)Km+D z4q^JwR_Qs<)ZoS8JF#*4KY%(N)$SaFKOV$#Uxby6Q`_+D^{PwNNh%_Bc~kS6VnNp( zbq}}Qn>W5AL$L^8Dw54A4y?VpKov-4l(0FeNubs7+ z4+ZtT?tYwDcEcvG@<6ag?MXm&R@~*7EM=bSOT#OkVi?EF=QbHMi>Q}PVl_e0*Mug>|?~^BGUY!%6g|+ZL;wrUbh=YZn2uXAudaP9~jf0Mz|I0 zv;Q{+Q-nF%bG(CxOY-i?Qk&#+?()EuDZ2#xcY9zi^KoHaR$`nSEu*m5CAlkT8p^fb z_$*R7ldxA?2`dp8OjySD+;!BDdO3DhX-ZYc3!INT~MCi3x528T!Q{1Wl2FK@3h=(Akp`FyjV zGWVt^s@w1E=GCHaL0!@-OS|i>%f_%X$1*en>GnFMCiD`|H^_it6R#4NWPYRk{y<%a zw(u<%;$^Fd(v4 z+M8DOHP~A6Y8`ENMTcN#+Efu1Bm*fw{ z^=Hf%seH;Qy6+|`c5=jJ4gzC>>j5c@Yil7Rh8TrL4{ZsAwlP@SV`5b&`uCv9*d*+F z+0Fe;MrFd6rNFYH_FZ+J{Z$AI*;d{Zh~nifg-hq>1L$i4k8${G>-ZEqO0>a z9z{kp4bWKy{8|r@H4x7EVQ?Y!U48fkkV8Nx(*yNe^KBd&`yfk@2${a*(Fo+L6%|5e%I<`E}O-u`qr4GJE)OH7x6 z_O^%Q;@ZiQBCH~*VPfMk9U$RGaM5YuS3o!kV&Rlz7^$=oY7JB!-c&{ULl2{nf^9cd^K`=&m_v~!l2+|} zslh!(Oa-?kw=vcBk9_kV_E-UX|Fo%vqM5|^hM-Hd3gzH&xvijKwCerVd`8a(epXAf0LC8(XLu8QJ*A>x8XI$EP(8~dc2O*G94rtSE46;zH(pI4){l$_pBkSKoy?zcI~t!6>ldS&qi98XA;YurK2uz|pa3moO5Ebuq1hA= zwUUl>PPTpxFL|6mV_~ZGZIZ-rFgSnTTn=48;f@Wk8^{<$asv8~slGI{KZW0{ zOQW%O#G9ZJ7ePjwxg*H&WbL*;LmHcMK&d z($=jqW9kN{6*s^_OB^`8x8|S|Y@W0F01{`DPw-f8H4A;=z0%07afix|*XE^_qO(}f z`(rY5U?-C*er9NHGHCX8-GvQO;q*oP+@zWu4t9%1@_=1u~9kVu*{I2y4fwhKPdUmW5| zi`tpTxZmoDbQ9h>@7h|Uf3qGGHmBnDKWYHK0Xj=g9*Oa`X`}8;|K%V z3Af$RVg}jHq`U!94ibR~^E__E8nEukcAv41s#cIY?Tci$m{9YFKMX0Kzgt&|&)UHd z_grfRbT5PRKsYN3j&K<4e*F%Zs(@ge0V0O6>c})f8`#Cf zbmB3CcjDZ5s?8KB!o4(88`n0Z>4@yVtMJXVS{#w)y`_qzQ# zMi!D3_$G3U|2$MCaC_Po3!NSHrqXG5e4ms7Va;mPJFJERpD;almLIaP%R)2MGIMXI zv+CU}%`I%yzq|lOX;M&u6hSVnQYcMUc~5upt(zOYI@G?Mlh}vRd@{MnACXc95x`oLU z6lgwAmCRMRal`&o&P?B1oyS0aVbk|uXOIlrdg(dp_$*7&m*={p$2;y1a{c=sv#s_& zr8seR3k-yOzGM(?uFP4|^PVo93c6u=c1!L7U)YT&VZ>Mc2_iR8N>q%IbMSOaTl2K# z@F!m?In#D`H*yyQe$<%&leqpaDvxWV#GzRFu3ws+7YTcKGuXGtoZqMTOq^Su;HH`M zskwPY(}>kJWNB7x1bNV9{8Oh9IRk+(D-O?Dy96LQ0nef zl7ad+bP+L(>8He#Zeu4*&B>S*F0oOi3m9kMbI?Du$u_Q%IXGjut$kxJh)%*}pF-mS z8#1spWJhQ2<|eZDz!pDRVDPf%Y;#sh8n?(2n;%)|<<*3{27>Kq=Z9yM+OH1PPz(~j zGYFBN?u`-54V5)0k$+0!l6pMSmn=7gULhwssFOUYvhXfbD1K>=SId6-wk%4R;A*O; zy$`9kdXTKHQd>w8D!*y5O1Er~XUHwEW@sTwu)ki1$8-$Tv%&m2NzNig6btFoX`+I0 zNF&D+A_w4ak~!M7$s=2LdeYN_B2;4}v>AnRv2T93-@lPP# zry_3kCk|NtQZpr&Pma_mQX(EgdXwNEO<}a$=;0ag0p^PzBxWpQtGMO(tpAUGt4JWi zc)zEEitGykM{LhGEYWUR{RgwEb1T_swr`3)j0qWmfg|*>^P}|TA}K6(4ynyr%57_s zY+FAa+HcsA+L4-sr6%(vyS+xgK<7|xNP;WKwczG&L2WgO^`R%Ix=HjX@)|M+wi4&5 zNbf05B-fMy3{|}q%e|3pPo1`b1|Ymnl6>Vl(BpWp_5(|v(K=@d75$qBu(f@22808O zyplwc+fJdkJ`%7*`07v9;m#3MFX?=e9?I@ z>=$C13`wy7^_gM)&VsuiZE3I^sMKm{DWS2|3QvE#&YLt!il2X}_LN@8 z7H)Jn{l2#A#aB=3Wgj%`;p*hvi61w}%gr_4Cw2&}i|n#%+TfQXk`FdM-{LRaaw6o8 z8K&bGEW8q1a`Oq!U-5_0Mu|hHgT9ApF)iZ2soQ_*3>=(q`RwCAn@|UF<#WYIJHKLY z9wT6B+TYvC-A2a}Z&1a`Yt_W#e~?OFEB?fBW`_Iq*JXLub+o6jJxT2Ed{N5s9s>z{ zCgD%CqHXI%!si0qa-&w~z`PY@L(w7JV;O8uyEeU%zB4*9(hDF5x~e<2HIypNaj7>Voe=NWeG6**=fR zg;#@DTP1`CXs7gu&JZk$O?5|v2FZ|D+Zdv0RGW985We%2N>#UDY52Ytp`B}nl|<_s zYTLTK4dtj7rz1~=i}IFY`qq=Gp1CO239vZnSVfN$73yUS@xL=ug;#_2!&z|<@9@PC zJX=pntyE}EP*9Hue%>r-WoT_PUoe`GfWJ(6d(!@_Y$Zi|Ya^pTH}CEGyRp($xfJDVuy#UJ^Jb!mIwOR*mN7&N+L(RA>Alq0KM}7q?NrY6tzA>u-#ew6mNv+% zwFq~e{3DmI4*q9>Oiq3!U-GjuT|C7RCPNZQpEioL9ohhrnz1m6x()4TJUN(4REG9+ z5wB3|Qaz7TsVGpT7xqY|o%U@dB@T2(abYz(tN!+392dF@`bJ#p1zx8#%cbc zC@gH>H_MRIv#40ediewoqV6&LbOCe{T$z}l_YR^*+AcFzd9GCsdY>xlQRj9qCJ-T1 zRX0(8cr}QVkG`P%wnCs?|6ZFw2T zNOWDfo+`s}*Ik>nlZs5bcU2ti_09;SV~$E}Xhk=#9{*0FIy?U$DiYG(H1MBedf_C9FMGyfIN-piHF#rC<2Yy)}~;f^aXoPt1%@8*krgBc#lxs zXUE@5FQz&H zcz_>H*2mDs`)0nw#r=pa`L09tJ?vUdiQ(?%?5gcH??%ygrNYWC0m*B>_D>fy=eAK* z7C}+=Y;9S`@Z|n4u63#_GqaAX+~PmyB@4IGzSi_V#_b8bx?K7|YQgL!?HK}b0$G;e zMbhv!zt3D1z878chS(ji%^3IcpRit{Ri+xV4yw>}@Bt2Mt6?DHmY(v{y>EJpg#pq} z8U!pJ4+$d`!v+^Boh*(9CmA?VRVU1K-}}*ae)dgTk5damU_AC|MvZIVgx`^K<86GJ zXEO#v1Kre7x~cB?YZ**dl?w(m!|q2vtMevl3uoP+JJ9V-)h}&)EqA*6)K~lCBHZW$ z{>F1ACn5ugFzz3Ie@^~=cpw!`=(_7^#d~o|Lo|mtvBw96YIF2dtJ9u$P9?DF(seV` z)TA@Cu%M3H9l3mERsAUx;F2)Tx<$)#2aUmae*O-6^kLFsJ-5~;shNi8L%IslWXdPv zf9Prve6;q|jD)C2D#!hMt(S@hWsgz{aTcs+fk(zI0_ApQfkxCI0X?oPx2bv$hCnZz z%dWIlGP#VbL&n7!(;wqE0X;*-WnyaQ5$}YPm@kJk*K2_Fy3g zAv*P&pGL^Imazmwc-JFC*PR}E_b}Wix%Nl`cp80y)l#41WFloElNif$tCb!)gO8^d z6)9aC`*1!;$|yfvK|~}7-~zreiehsqQ~HBv<1urIT>9oB0{m$YtWYBZVVgkPWF@oMs!*y=P3j*=~Bp24qbHD8|{ z7b!AR{3!WGoOI^4j5fNCj#H|Lf}0uuyr`)uNcq!YnFf`KzA@Ja8}3v@9}G5|%*K=~ z(tI3h;4wWw!Wm+JQQ#c`3JdLO-0tP0?NpC!z<4SgfWSCvaEUfsXSqIf00kL1ppl0ItWcF_4PW!7D zu}mkpa0*GKLJq%lJaHZ)obmlUg(3g`dd{?p^Ql(vU!tHX6JBq+F z{QDmPAYV)Je{Vgqis?{jMlM$KQZbG0%%31hoFYa=)`{IKuZQWP*plURIzJebBkF`&KNrwrkSdRWZ4Ds)U&deG)meQo4-Df~ z=)K8oxzo&L+WIf=8nuoB<$$V9Lba_LoQ*)U4(iOsI$J+h1%Ut%`uBQ>l;UQltW zv;7k8!o==>1JFy7E2L(7sgH1&fRLEN2Bxr}1N)#I3ghjZnO~4+)EaRdu441>W^9Ce z#cnXL&?IO<>arWyrc#y5pqY%c#6Ex(ZjwDB?5TKt;%0IJaCer?&7;krlfdEhYr5g3~(+majZCA4os$jGX z*jKevYj`*0A|{-B3(hZy+m~5+W}<9_PW6)S`~E+ZWqxH7U>7F*=V&?XV9%|t1$VZc zv6}RAJrEmcxZ7az+*JY}I(@NS*5o@2>jNeK6+k|6nwKeJ0A-Oc2I;MrDYuEHCcqI) z89M9XF(2;k{9-#!?Q-#N6a4SD?axmp`0~u4()*pc7XL;06#)dl zk;j0!ysOs4o6u|55c{6$Wn_*jlnxF`6ewgG?C6h`7yv89{mu-wnLc@Ny)-_9P+A0T zzady4?qpX40jEG9UF$=sVjRvVtzI|InFjP>6`fDc0aCh4%XQ&&$Px1|JX2EWSBq-$ z+$#ylqO8%!{|+8Ql`(wiuoW&p*Of2xSTvDv`{vsYB7#ZN$!GFp@s+gM2{v(RNBAR} z(XQk{IzH*sIPM$WzNaq9NQL+WChjk(RWXUsSR?F7hUX$o&_uJMwgQY#Z(SIP!R)_j zA5vRy$(C?BxBW{ztGw?qV)%ER;MZ0NNAMg?u_eJG0FYmV7_yuX?}-JFd6pygc#M9t zyTG|`3u$}=K&WDCTd7)n#{HPxQ$Be&t*CQh3M&EYx9FB35%R+S=YL!t7#xjVK%d_E z&>svOm@X3U@J{mhhnlCyVa3vJV`W^s%7```?FS?TmN*0mIYTxgo^Aom)Yywx|C|#` z+QlQ83rzRBO1COO=dds7c+5rM4v0tMik?>gw|@vZuUaqwU~jb_@wUI%c=IX{*2;gdJFypbyHe@4Y%U zjNexV$@K`dC8u@?RIrXL`p@?uC!ysDPXu!<%(x%yiZxz#{C+$@{{Ns?gKa<%FVg@l zcUmffTfae%T6Wniz9rzvHIT-T#`-{N3M3;8w*WE9?--0Qb}xa#GLHz{C7EVsx#~=Z z_M#K-w>5Fu*+kH7rK-H6jui#7WllKg%n%tuh zV~89yP7L22oO)=cyv*lFSq*YM$P5yD@5JV$e~Ts|IT|jB>HQ66rl|y)mk@;&n~itD zNYO8EgY?3&R(07029-rur0LkfqebBUGVQ;PhCfFw?DvLfl>CYl4K5=O297kxQSzoX zFL{qLEBbK##BS2zdW9}fNs(cN_Cb?0Z8r@PhPRnAg6K%FOhsq`J-GN|Q%?jACl3cs zWq(D(`XZpkKDJT7Tg^Y9f+U+CJr>S1heNT(hy}@9r%$mh@4OaCsW|?0!XzFc1(@j z6DF@M8-OHg^t{3@&KYCt_J|c;OjQG}M;fDfgzTkM|6IVt|L+A5HdWxi{Dg4^g91ts zYyVoky^p_?d58`f*>FXE&4r-}=|zQA$~LDr6%$WNho@@K@U7}tQvRue-eER=x6r-c zYm%c0OSCvDtIoRpb^Kxvtryn(*UozQ;x5TfXGvFg{P`yv@vkNF$EQ-QOsD1BR|XgU89KrC4C2ydnwm(9v_GG{hy|xwcENT< zGAIT8VQ96?X%V~I-XN8nv^Uk!>7RQlMwd$-tvj*6VOQaN%(D$468_zWxK1G(p^?!% z@@oh#=exdExipCkD2YUr0MFgO6i z$?G}fE+4@cEltRlA`m7vIXsbf*>@p+1pEBleN9wzD4p^}pe+5diW1%fWipQ(uc~5s z4S162-S@th|6kGqQ*?#gzlz^T9Rglt&_y=C2Rf{frbr?i#(<^Xi0NsCj6r(RF(!5x zw7?$jj5m4e)yl1sU1oN)`gU%Jq*^&<0hA8%%n*F3) zT|5Eu(xqTw(9bdxbR&Q?@d)IfJO566O*wZqXugQ=r!+{aD)^<08@&WMl=lRX(_y0PLglsCnS%;Ty|;rIcFFT>0E>h%0>O>5-*12p$||)ff`{NwE-c8ZF5n>be2!4 z{weFZKxCMEjc`K`+c!S|Tg^sAY6^C5ATk}@NI(#>jOplT?&UJGX0XLm#ja|-0&|(DeJ&ci zMBx(ee|A2Q_kmyEF6I4c_-BhS5uioO@~tR-1(UK=<}%{tV03#LK_IG~B!3x+!!e3% zQ~gT%1sgBh#1Spnqq=1*Ep@EyG@;@rE|Od{e8oDeC4>n2&r7d~5ku{XRsC}#?P(i{ zuK`@+>N(VLX4l+8N-~1F*-Th)0Ause_2(!1^X}T-Gs1*7m_l$oOe4@dKKt1=&O3AZribVo|!~G>B)`Qbp4_UXO78F3@@s<)A`(rgwXn#A6u545Tk1`in>Z zDztqQ-1dVGbp!lA_>wUJgnW0ZKS8vDC*{8ykqrgZgiI3EhC@vC*;+rSxHD9P{dSY! zIm3uhLAIQf%E0~l2KpUMgd)vUbZer32!T5tGYX{u&M8)_A4^jD7~@X$HiU<-1cQpD zq#DA1K;7sGdy)RCMex7Qvj{rc>Nf77KfAId&mjwkfa5*DjmHy&PSBLA)Oo+GM6;xB!@NKf;5 zpGUIA8`y0VNtaAMb)?*_d-nG;>E-|1@$xTaEd|n6nf|-2`p=*2YQP(~r2e13_3xkm zzxYYkj}Huqjp(kA;CJ;$MwK=2;Z9 zLOyBE_vAq!<^pQ~$pgD;_VM2(fij{K4JAWr}~3Fo_0h_gFo6Ixkq#xL~=Z zT^M;RjFKNwuvi*qp0pMsD~azejNbWmb>j6z=m#aQ;sN!=waT3ryK|unv@T69+m*eTP{j^~`FWV{g?~Mn3ehQohh){BjXpcJo9{o43($Ew#M&It8 zElvBIx>zDm28<90V7dx?1|%2>y02oB?Di+Wz>0bSvSE)uK2r~Dn)AO_{X5vN04x{) zL?px$9XyqiPoz~z|ILn^y8{W0XCq(O{F5yTPRrVaszH*J$o>ek$ZxtQ)@D9wfn8Z)(it#qA&j0v(wVmW}wZZvoa(`kl={ubw zz=#y#WFVv&r-Qv->Ef1s@TIW?sUr}Ml4D($5hrm-Bhh?$2yr`oA_qV$FL*3*{>^@l ze@_XJOMq|=yHpp&dVdF4N&#U53dDh&Ft8qNv5jBn!uWuoRR4x<92_in>>zclsu)~f zfMoX-ln)~W&tV%A86^0aNE;u{LawX!)Y+1hZ{ICt13l~G@`kZ{9n#cNs%4*E`m`5cUi{_L ztIHQZ^~?IcHKPtE+XoECz4t4$DP~}d0E74%pkWV#iNrJk8)6INkpd2{dypW&2HwY% zyXUTqr_qt}_ymC8=h*_7Ko{!YMwA$Ex#}+O=(oK`D{iP(B^`6yMS_MQQh;1M#I7eC z_y{Zs6>lxH31!zxJ_BFNz@7JT7kIh~r@N!?zJ?U7k}W8#yM^<|kq0D02x`;3UHBb# zSlAf38WU5cuQjue__T0-etJCYZ0t&Ye5goimv_)1iGG}hQ9Ca)ChOiypnt)TV+MjRBe`wPouL2C6??Ia#StTWaQGX5hC+J3!IRcEWX}mQ_igkaKU#i_w5TQQ4JjFT z5CuU~R&*$@-}sUA#4gw*=HElB$7f*Yd&?v3NRgg4JM}#H#32v{8xL)hFt;1kV|uc; z8zvDf+cj_xVC8Ez6R}-vJe>ZCq#@Ira}g0SYvH~k-2dDBi>EW#L#Ps53>(^%!+1j%$M z#bni1W*VO9TL2xGnqw4X~nQKZ;E0f_y|3P)?5_poDQs^HM|kgl z2oDf<)pV95+B!N0Y$9EKp?h@8TOVFU$NNsv&GXT83*cVM__LA)(pGW_X)Z-9>3ila zt$Y`b9XHFmq7t&GpC)~_)8y|aP0Meu5vhCZbOL;_jq_;ma7{dAn`jdC@)m$uE^gp{ za=4oFWo#<<3wd~`wo_cYT@B7#@nvPv;kya&MHwO6BFpla01nj5#?9S^3fE8eSyy`= zkltN3Y-eW?xwl#NIhd#M$hE6IGA5d9gCj*A-5fvBl2Tf0`RoImAIMM0iBbqZDV?>_zllZB&P-5IgqbT0M25tJtx=kFul2Rge{!P;3| zM5D3TI@l(RJ2cDi`6L0UQ!TAz5_^zuC%M9q4wfpYdWG@+zy7OMc80z;R9xTQMk`4uob=6ZX+r%DkIyA-Bnj?CKo? z$-le6zE_%=Ogx$7m$7PT-Cr^%)IWLi93{doNwyqWaEEBg0 zJaLVgSYGS}pG(o&1^uyVH0t(37%M`_YiJpwvV~?q(HnF9R6y$W(=1ffO7R?pH^}#q9568nzzgfXR;J3$17?M+N^vqw@LeQ-2p~Iec;2RSys3i|eBV zH108N#YIcFb-ir)T=&G8;fi|xhi$6#jw7yFw?ZJM4T;jV_yomG<}dNEa!7q7W|;cz zweQJc!hs}UB_tDxDRfGr&Qk^8v-pM4mI&$S0QHr?fuh_ECH-dN75f% zOxdL{Zbqdvhs&Dz{!H*|qhmzodU6^zmQzYQ;*gA4Mbi_Bes)~Z)!+5M)cL`asP1#I z_aPu4B?M^>KTyObk~R3QfOLm+t#)!%IQ!zQ?we^vKiN2s5|!)CD}Ih@usg-+EYX7d zX8(mSJ!tCeRTR_~i(vFoKuC4B#~huVy7BWo=QHge5SYu?Od0TC+T4aa@)%ugqIn7! zO2iyC$|Tmtipy}HyeZyW^>^9ViY9H?;{GiA&MT;c@va?Z-7n5Vw?CX^v5T=h#wtFI zyUqCNL_A@!Qq$^BNy=-Atx;RwQx|aEdOrjDPXtL?t-E7Jr0zG{HOl?F893{ohVYnjahEO2IF?BsFy$YDI`Dz zcgoB@Jm&?q64W_o8POC(mh``=+#~3$$*i3Z4%I_{Mh(&_8u3Lk?N0Xs@O7+_C#N4U zL)dJSwH5Db5~$c_OFGYH*JysfM?5d%z@2=PL70fuA0!Q+EtALPX;K0A0 z3is6E=v?$;7ZGIr^0l1rw}{8LQ#>tMqH}k2!c^60FNRz-i1D5Jc+YlH$MiJ2XMbRU zQmd?Yk;jBN>p=AY-XVqBs-#byROV6T+upB_1m&A*Fc{HR*a zUTi;fsuVJ7r7n8Y1w>-Y?=*C? zxQsj#(Q|7P2;3L$NOwK`=VXcy`;U{^&jMeqc67SXELwQ9Mv}at&m$NL?g!mEqLCBt z*N~{JW0m1h(^j=CXLn~T;;!5empdvLEXXGQS^tda3MOybbV)$d8vWCo&UVV2d9C}J zQ`UDNQ_uYn5SI${Tc!}~-@dl;oZ3r%LTQU5%^aOidD`Xs?}c+x{!*h zRGOC`+eGGhOiR-nisAcaLzvx6DIyCg zg1IyXr{u)Mb+nJ@QetJ#`ghB*xwQT4eXFXYwT*y{<8-g zdZF4`C5x0^b!@0>d~(JrtkzeuJY_Q6%I2R5)-!u~VLOpsWgC!vr(=Pqr;I-{=4N#3 ze%z%#daagCuwN}=U9yooRfjF&cx(DGIqes=zmgINx~@0IP~@M|m0~`R`xcWWBc@u) z&3VSq(CzPa5cqp-47|HWqTHt2&|lZ^%KF6L!=MBcPs2AjTL!&cQY8BOg&rJJZ2hqX zuKT|T)V4EyB|mZYe>YwrIsggCJ9DXt!ODUovouRZCNB?#7AFNVY9!{SbHh`WzR$Rp z-*7jRy`otaLxt|DX7utzt8fg`4F2&?mr#3ttI+uPqmP|HZldrM$riJoRx~}sX zckgbU{`0DckRK9~>Z3Gh$!nQ%o`3 zvR0;zqszPMVw#PXL}K?lYoc4IRN{LHZm~*^;cZDPINf8z<06A(r!}&pn3xUZjy`87 zX61Ny5_Qk-*|*a*$Wg9|lhmI&s)Zu5G;eGuI^52e9>V{au(^_x`? zFSR9E-2Z%v`v3J)D)-Enjmlt(4`70uO!x)2yH<_Luz6*WGquTJ}|D4*y|J!emUu8x9K5XkF7AV))qT|g8=`(wOmy8fJ z_PhFmUdws|5JhJsqtrOWCOd&>G!Hc=JsC?Zx+VF7g89%TgPkljQTg2t7w_tXx zuDJUi`TEAmlHjJN`&5&mujr?EP45bR2^B#;Q3N`Jkolb%*br!u#$eI=3L`q+UjYy? zWR1-@k0k4WE!nR{jYC@?2p|9fe0#o2ufJ43YGkhq2>dGXiZKqNI+nuEJP8lAvGIjG z7gcE11V6D5>iT9{zujtU!n&J_nm} zqT05Nkj#z*dvOowt3rT01dNdfV6T>Ah?1)VVeQP^P7?Bd{8^Wp(pbeIE|$E=%kZ2c z#d|@ukL1D}sOgGt?~eNUr1lb=W6ry}QUZZ7epf-inJss58|J&J0D2&Q@tWUyzwd?5 z4^57wpP*oYVyYk941U0+!hj2!tK0tR|FH*svu+H-?5(a3GMIBEVwP?m0=j1B+r|K~ zC;6gbU%;KR+@{EFmVY3C}MgH47B^)sh1JQCpuyx=!qlb zw_r3{@a;^MyA-4jWeotAVw z7)3^@^*va2^jXS}@tYl6hVaN{Aw0Z^ips0-V3c7KE7^`9;ZXGGFRJ{Yo}GO=K|fAW z(IT%a4s>#WR(?Uw;FxNDP0HGC5ZUYV2YmVT1OA6tc|ZmO*2+69_ki*J8eAuA1M6Mo z4DyyauB|l@M9F+ml~KJ5HO0T4A0F)stE{zY6CP<%LL-27`jt5XLb)4zA^V#h0O$e< zN9l@T;avjMmpz!pf4P61Y9?28at-d}ui*7=K6DN<*H=Iq0c(5?%_sQPDnUQ??9I;= z7|i>{avabb4y4j1C@o|_LN_I+I)ZC07GQXrYr(COi0)VN!x!NnEetj?DHKMwtU+3*0em zXt|gfthvr5e^{s%q~>s@%g2rPs-YAX6zzL%h+P+81FLs72lR}JDmqAqnO~$e>fi5w zqdd{>neIgRb&?}`KjrRpg1er~ruzC^OPJj{1iS2~pSk3aa(Bfz+JI1n4T1WWvAA5* ziDz`t9lUL_pN^FYRy}Cpexm|7ZX?fspG#o*cwXEE!?uD-xECRSl$Eo1*1lN6F@aAB&X;bxb%IwxPmI?YL1LGvwXei2 zl`4{6QtIhrV>1ehPw3#vU2Zq1ZD`3p6J09f=BYGirdvmUs6 zF3?u7)$|m})R(Kfz2|eIm%wnzJ!W8i3V`XjvL&&`Ghf|PF!kR_)i#Xo>B!z;Nzrg> zk6$X_7L+0CeS7cjXNQyxOv+wg@9lPVc$V)}r|^Wx8ZJ9Fd#{hAFMkB%i*e=O1>|5I z7*qYc`te-yz5q0KEX4J9xij~d-Z3;;`qj7q+ejpJ3VF27oG{EU1B_Bf7)?{TO(Vk? zzn5{fOp>(9Yz~V=$3Mju>J{Xu#tM09|7n(xAGi|mzp*n@L5A>cjPT`O%XeA?tcau4 zEFx=(_axT|>VGoh<+$mLp@_v8ULZU9QvXC-Cp40q3c!4XvUczgA zV@3TFi@c1&cp=3mT`MmYxDbt~fi(RYD-{ZF2>T85 z5r&^kQMqR$t#^t@l_g1evh{A#J6ymo(bJ}CfO+i5%SmV6tfYK@tu^7z0r1IW;$=k& z$`McQuee`BYTNG7vM(FSs7hUMU%~;OAAWj<$tzly4US?NZ|Un?DeO_BoU_;{UzYEm zu>6vgx?=PyPMkB~I`I-@+puaCw*3I7>LQwsKU8ge*LBcYL|YzgjLa&mUIo!^o+xM>Ha6MTdb1YoVP6}-VTJgFz>_J?H*>pVko)F;m=tSNN{c@ zShT10TB&d224-Te!}WcVYb$ELjNUQ&x2Gk^DSu}@O^J@n%j24qM{B0gD_xf8%q_Rt znAfhP`d<@viNr5_Rc`sU=QC98BYpRn2V&%kyxFaEX1HTbD^MIbt7-eS)7Y0{EEtrv zHZqn9oBF9)9_rb%wgJhvS2Uvd7|BhD-u9g>w{~mB=3&Pb6mxNeRu+~K6v2*L{~&ir z8Mu2i;!D$m5 z;`l6a3H|8eY|V_3M4De+_c~NV*lyvav6&(LScoGI~{hZG?h zGDV?`4VnvytW?6P%sxXXp;8etB$^XViu%x?GHb1*l1heB2^lh-`^~u4KIb~uxvq1r z^UvOY>}zlG8{XghKEwUo&;88g<8u5|;<;PvhhehTRx-Z3nwc)isukUSg)fg^Io*18 zZ9u-CR3NiO?q}wo86(8ptdh*jbP~FfzS~N#>=<3z)~8+P5_^1ilG~XJ5VMRxbvp9T znWCjq!p>1e`gz-lMrOB>@tK4MH;-wQO zxPCtUYvRngL&dEGsgEC`Xz>i>sNXt+sd`_~InD?OD%{w@r#7@L&NV6f?Ftuq^bZY* za~qi}_WZ5y$PvaJS)vd)Lo}JOrcChkz4fl!vJ7M?C0ar2UF#cDFISeu1g0#NP3wHP zFSRn+cS}>9s^aJ3K>J0Ra^y`!^^O*t=hlHZpwlb1-HW{|b^LG1hp2y&71WQFRf+WE zSTVcy&rqt(4NdP*F=4U!M}1o!mFu20ND9`j$w-kG*y31u@%{RV=*75~O=CnXby%~4 zR9WP%7`NP7FC=Txpx3(R(R@rNR_HzZoJmue@2%sn5Rgwbo|wE)lnMIjB`h1Y)wNNo zVkrMru^v#sIhW&dCQnoF+@^-e;1w?H#Mb>}`%O^kuM-KmDUhTP-^d!HT;>`=`}t>~ zDSkf&3+zG&l~r-JciAk9JGkKq4%GJO`N4vwT ze&2TLztTfEMP-%Jc`N^ze2MWDig!d^n1lVM&crn){)qED%FvWwO_y(tn~+KO%pt74 zEdKxFO@x%wf8J`q^BV`yQNB_RVvi%4*5oH=#~XXFJN=$r2U;IMud;WFTi91c)v-upBs_;ec zIQBBRP@Hnt@QY>$Ct6LMtg-!IOfM9|8lGy{vf|&rXr6^%6jnVxtvv0O$c^Gp!8diS ze7%$WLRaKGs{b|%(jc9=8vH>-#u!w}83E;jlOIvlsTf#TJ5ny%#41;JW););r&T&Y zS<1QP#FgvhYL6+9Qj^hRVvJSWHJl$t?YTeTidm%{7r@%x8;)T7+=8mu#$Vvh?7tC} z*v+B&&?9LK=EBNwhA?cRhyBkpy}7;yE3L(szU!pl=$)Do=xhDZZB6NhvF_vs%+ zCsi#oaVby}{26D!Kk_mGas{@cE@-f-829b9ZE+qPvnG>eJ1Dw7^KjH zl&LJpnJ?v0E0IcgQ-CbwyTmg&(H}2;M+XTTSmO;OVQZ!bW6H8PZVA2#jZw&WA{bz2 zEWpsd-6!0sk_VcY1yB(q@<<%L4^YU*h%VYN%IJZxr-18!fPJwR?FvYXAbAtx$v~W5 zQDc*;2c7DjDAuol8XXd!OVE5E=@AI~-KbJ2)46ky0V7$ocyV#f7bwrC!s&v9vMfl0 zFN}z#AjZ)7Y(BYHXpdMs%;TMTU%a-*Qi+Hf1q}rS2R7W^Lt+h-MYkc#etowUX&}(z zBaO_v8g58ep{6#l1yxaalXeqhdko-~)nnNB2?tL@Fn$p5^;a%Ko~_)iRK))qUIZTyeH$?wH$(r=dKls}5c){iKZ2n6 zX}~Bq=Z;r-@`EG;k@Vw?zq;Y3%GUq^g3N<*M#g2@xM<4tOF{t;M#JTKAQ+`sh&gK|{3s02(HDGCD{M5b`J1q=R@3_{6_OaQ(6c4VG&uwN|?r#$naT zw2o+v#PPF=Rksj>Ntt+EedZ1QKAcUrFueBYr;12lP7Mj?GnDMvL@hpPRY){@;+c33 zE{$eoWNtSiE2*j%7y*$$@v1<1Uk9aF^-m@V#Z&pA@iVHZtVNP}`w=|89~m_34#uui zKhsfJpf~^=JW9_f@aga z3cH}hT*rG?Z(gCiMX)%2IO9`r$z$CG>*m6i&~`GT|FgF>zs^P<(@-#zrPA@{5(lMn zz5NdUggpL%Qf^(1vNli*^-oT12Xm3iUd|i+Zm6K<7{wfoaCJW_`i&0SeN)~@$cLWV z@139iF6`%8Y%mRKbx^O=ZEgiIZLutdQWz*@r&QLclxA=zbN@k8tdsKM{Gz%AjCe7P zta8_qU~LXH_3h`$cKrR-EH*HY#@IBNZ+nEtZKd3*@Y;Q@C zxBb~M8T~Ogs!aL1Yr;1Oa0yv{xVwHKxfp#HlNNm3H$gK!7lGntH-Ze$t1j~$JJ zoJ3aMVikP7OtygFjI2h2i-H--cL)3D1Ofse@Y6>>0}Jl zJm)`ri6Hi~u6?(}#A}S*5$D<#=-_9ARo{`iO`UeK3@O;9WznZwr<~i^|BCm+&RF?9 z$~Wuj+;V)Rr9&lN50+ugn%*T*|nMngTNq%wrW$ zQRhoNNNcYjuY!GjGJ?@r4?17E!GcM@hIjf7E)ue!S@SmZp!dkR6^J}0O5ca2e|ce5 zD|Y9)8$0Yh=5|Y?GtkKIleS zk*_qdgfw2VNoJI7At|zj9v!9Kl{tV5tuL@~7!KaPGlmmRfP z9Pf~ax})hN8b%~Me`t)dK)!Xugw5l@*Mw;>+P3XQYTNayK++itNTEGkvlG^NQay%85KE0 z!joyyFceu24p8AG9@`PT6g_e&!{uB1K~~}yPgHlJBqJ7aHVg$As+pQnxOQUQjOoGk zaSSOsu|fNRka=tU*eaiZv4^^HL0VQX*lLK$ZxLCjU!$q>O ze?Ilw3M)FFg75UnfpCz3+ne{5|G}08MjnYscC_9S$aug2UjH}R1t3i*8z=Vzop!sw z2u(9@Sy?#c2?3WI6A)7zHOAI0cn#kS9D6_V>a11-4A65HkZ^XTuR9EIR3P-^IFnRW)Xlcy2l6iG(+|Ce_m%DW@GkYxG{N zd=L~UK17a?*1B}iBHy2k)BAiLp@qwxVH}vR0FGlE;GDNZp9w8R8V5QZyNJ$@nAhED z>(D^fVQ|22Kkgz-0e>^wU~85Izn@_EcAmb5o!2n01hPqtoZYyx9{7y^`eQf z*8Zq!bMd^*gU7!_Ji&=YYq)HG7nQOv7WSPR_i;~2(AVrM{P%yD;UQvX(|&E)_-~Wg z%#U9zt|~eGA2u-!OV+jO-OkCwF>V{<7gLrV%Tf7#4~ulBx<~T4T8F1uLE0RWCzh`2 zm7HZ^PS@Hrs6I!^d&&1#cc5`QZz6}_%tC3Gvw2&7?f6wEEv_7??UYzf66z)_IGv*G z7z0*^_S`Fz9)0YOKsrQDlk#Gm(mWOMD8=Iy{vX_gUMToK%Ul|YmoNU+c2;N;cKICD z9LH42ZAl?6#dr7byfpE|G&?gkVi@S5^X(A(WU{<@^dG4U+wlFa`y)t(teci{Xv$4r z+%988tiB-~^5hb4W)Vv8!3T!0nV-R>5*frG2#7f*-CKLpXUh5{5J%+|_hR-+Bsjw3 z840rz32TqZhdqQgqkr>JbL`OT*o$OjCWuv~BOMY)D%__qMcOAUan_J*FFR^dKiSAI zy!F9>r1@cnb z!8f=VM&;cF%^m$4Hc1+vpqfqsR>&Sk6J55G+&)Tjj2XuJZ(2-7V+F1Mj2tcavx9B85gX6x(W1Tsz8ryJsdvDcKXNz zvK1W_HTlDtnA>wtN_Wo|Fib?FZpzsVtp{Tacd~hFmou)jPk4YpGKT||wiAOX3$`sI z3e>sv%%w9+Qm?JM9?JUtHiyJRBk``msWTDpfK6Zfd9QyyY{dr(#1fqyO5pF32D6?s z{Rl;_t$sPAfV(u+JHZUgi{|+`qdN|*@b#ZMi0e53E6knMM11E(U+% zA6KI%5QlRt?yux&p5gZzI^M(WJr*S?k_#2gBPmv@vmwZ1q=1F_phfWUU<{z;PWcYh zJN4JubhH8;HsZk5Ni$)5kyFU>ui4%A$L%mYG3jdrs)PO%9C8sd{15tyq6Z3O8pxCY*hKd5sA163quPr)Yl{nkFitDDb8h5hv{V zW(=-dCc}++bUT*#Tq5NF6w%Q}y%qzuNFNKpg7+jsLLufSOnKe_V%g)aRT$01W7yxc zdEcu~fTK`)|AdII9-S(w`x^@oe%0IrV&sh&WtC&JfIJG7#wQGiI!oKUrb8X3c-{*? zB#O8YR0h%y&%fISfVRbuaEf4Gj^D>@gnEp8P?6>DCj*U9Ep+5%4pp!|J%kZh2s*dT z2dASy9eRpjOQ1LE3+~^D!K<0&qZl6SqmyD(oojGuHyF3%1#*71Yo%(HwMmittW>a&IxW!%&R?m_0MVyw3)&viv_ z`nRdvKVvne+8b3 z=Xr#kU}!J%xK;0iFIlQ!z`sQ1vY#Cw&Rj7WF5pN-UvIqzn85Yf4)J!qz_f~6Rt$;w z$5F&POm)Iukyb8c+z}Y5qFAAF5N;u^XxB;YtdGVB)*Xg^azu&>F)WyC&FB2N+M}f- zN9?&Ls%-+QcOX#fFm3&Qa1#QQc+HAL7mx#e+Ethd8KckbjqBnzjdHTkZqPY3pFQ z1Ca|tB&A&lUv%f!O|wzQ>>P7ca|&!q4sDPpc5;xy-C%n>_|1;nq|pjzA~stzW{q=2 z37olrs^hWH`eEG_q;(9SuKvsAf9Z0Kbu#Jtd!CteE-qVXn@3_Q??>q(5M*QZNIu6f zEg#%y@jDf1DT*GakQ&o{1U)aC+%?$yeeJ(J!)4fim#-hTOs>uNKd80FNBKYei*$>- zZ*%*`!fVDY*2L|e8zV`1c+8Gug`tFB`(D{U4jbzp;jQKDDsU_g)j4$mg$J~z^DCOh z@@{YaEMy1}5yOsJ`c>S$WLCH;bRJjO@jE3Y<|vEwGOqSJ-FTS7*&H{WZC3dS!y`a8 z8`cr?3Q*Eto72f5r~qdn`(*78v&r|P`|#CAizu{vrZt|$)lP(bsdWo_L%>uVPYB4g zmPfu394~|!Do^|w?KT_mme(aG@j8~$omIgjr1z}|^HoWsFW@8ul%r&*tt)AMJkmc!CY4 zLI&2NCSdA^K=DNu7=7qn-SPU?SAhH_LHEZfVJ}c!2Yz)MNF<61TR|ciEZ{q(gV#Ho zVlUJCnGAu&q=zl=rWm(-3h{OA9W}4oiub4x41-3aC)EzBk%As zkXG;=U&4DLTbJ1Tkbfcl($`HsK{GmN}7}zrsrDd9I=KjWBN6y?J9Upv}I)0j7;F~*p;gPbh zlh6tEDMS_edOi*AIb_PW^)DXJr46lnihc@5#~J=oYW77+*6zbMf5Vjx3;pKijt;b@ zDONQ4$l`L}M*!-=Dqg9-Wl4r0I8xJ<^R5_$I637A8~AZq-;L|#7VoWDDpF%Y6CX6*L{sSwVd3dV|81$l z0v_>!nv7lYcyy!3%Hd~UM4GKGxn;BW!DQGGLNQ^8(jTxQND`qV413i02abTRrc`gI zcv>YTtNr#UD9>HbQg8RYH#HBzi9607oVcELi?6X>^D^z0m9|)=9R}4o@X_u-OyDDW zgrB(5iU7TypDD!?Vci>P-C{!JxbLmrzPY14L&Ev7J^ddWU)C{(#uE!JDD5086yb#Z zjXX5AJJt)zC+>{r?dGYzWYk;A7skg!)ubNGw?OgmCA?|uypx7h1X(=vgAi$^uWBn~ zNQ7S(i=wITGU+&zn^dq%D*4IuedbHUZFu90P`{CjFxFC|-OO;ZO$)yJ$?d$@_peFE z;!jZ8{*p3mcKK$^PU%3A!=>eNTLPjNSe-*hsWp=G3FAt8k|UI)>+PCAXZ?hc*7_U1 zhZszEFd5Y`Yn2M9q6$hA~IVnHn9?XBR z{^Dxo?9EwmrNBP(f=|M=jb(ekQ*lv~4F0zNn9QIyq;OASkncHemHeklI{^CeNQD3Oui!^Y+;4pyr6 znPvx5eC9T-S&Rh>H#unPbo@R`jCGSrv(uJR)$Uwx?Dujr<9mS{#onYnp@5H~zm6U* zawqKPk4a3aX%E|>p(*EuH!iHUq4Lg&y-Dzua5d>r3FOXH&|}nm&XUFz-y7NC?QC;`f&A zR(YFDDjn`~n{+yF1or5AeH=qe=M?ByjrDNiYIy_(?NCaZTA;D-Nq>R#)CZdk!iy_{^Eh^U{2yi;OqdL79E!t{RwbBsEQ#H`($74YOS|@Z% zpoVbl)Ii~1h&VrD420AyNETEBk2|5fDjVkx8{#E=Y-t8kojjC#ku6__t_PAT!uV`U zP`VD^3?io&SnZBXC7@gtKs3kmJo0BGh_Dh{U!_2$zTNXYOgAX!pLW2hVYtQ{Op-uI zeKazb9r5yaMA9aeh;+?AB~P9Jn4&2z34Ms4c*=j$oJs19mOzP%UO=(W!0)2lRZPRB z4E2q@$Vz!7bmC152Xb1Cu7}|%XNsyfVPaMrg0YqcDI6Unh_Y6m$T?SafO>}Pc6nY! zn$alC!U)#Q68Z3Hts2%^XewC3zOUJ6;mn^9==(va-3^7k-yy%_kIZ|CSvn@m<0LN{tkU_&y`i6# z>nf}b3IfNI=Cfy9B@_*$A*V7LF$qvWA2{Ada2a?~F+QTlOjOsad8x?e9;EBFbJCG$ ziP<42;B2@L3GLWMluI(NDc7T_2=y#$F3!f z?^l^0)67cwcyyYWLb)DNMQa%7q5!ANG;+uBAy7M(-f*9yxWWn(qRxYnEt{Fyx|W34 zvm-o_HD+|uN%Diht4D|o`zyW7N({e>Jt~kI`Ih-^6V38IuDhM8)*yq&$jbsrcQTpWoq=8Ymd4vNp2yDLD?EQ!4-gHI0+OJ z=cUAO&q;m!hAu6^=Q@*ntztzkmdwTWp+_l19Ac+@?h=!JbwTM>qRYgrrr#)++~{)a_FYJZ z6e(zSnf*nmI>e2Yd6<%@;fTc~8XlM^zjOw(u&Sdo<*sOMc=0M*gFn&2CMe?Q$*jLV z#uKUw~{;t3G5E$aqUt z&vXLbzBkWzs<4WB0D&g`Z=FOZ*;5%tN(2-eIuR{_H?jR$LOay@=JH7X`NICicR<L6X_`JTguJsZlEC8}0qk za&g)}dP>5q@t&cFT)9CeAGmHlOb}MTnot|}GI^Y`NB1CL`$j9*f8(UnOD>Zp?YD6& zJBcUqSde5nVd%z99zB7$x30MQ*;Clc*so}W5}x{k@KFZ!pzvO z@K~Tq2p>0@#%fgp{(qqV{g$x~e$G00_t8koN-rBV>7K82=?NKqnh)o=3CqNUv z-n!R*GZTK}dhyM8j$=d2>kZ)X)#$L7Yhd32Y5IDj723v>zV; zRLkjq5z|El@IZb_Nv!_+5Py-XtZ zSbgI&zN~a2NQUXXQKHQ5h8Zb#=bDGIk40ZuSHflG@l(}$VrQ1ABGMThe*PD6dhs_X z1aGd+Fm2~DZVFjWpbx;oU$Kn9j32k4WzDqo%wj8Vi!f4WgFqt3Foh~p-7<=SZe`s- z)4KwZyvB)Xwt$hi>&=}X_;OFR0W$;7jxvpu@G65Fe|dA~+Qy+(8b2t0g7LoK5K3$} zW_|bmc5ij6zQ#{r&XY#F1!Nefw~xA9cy4A;oJMcEU7#84tLiTcU@GEqL$@WbQ5xK1el;k zCgYscbgVYfG3Xt{|5rd4vLk%AYMwd(=h!^OXLvTVqz!moaW|$&t3mISZ%J z1H;L|>C9M1ki4c*wzlh1+L=GtoT$cHzd|IPdlYC@X#b0YUJl|WSKd^%{-Bm5H&Oju z%#9_%i`dL{dh|2YjB@UG;iAxG-0_-M%5ZUSd2ZUP83fhuWNtB(Sonv~oDs~mwxEU_SmNPY zON4e%A6x^Nfn7s=V=?UcQ#p!QmrJnsK-VYx7WE&}Z~0)+rwq(7I}|A_`|U0t!)D11 zq&USgFh)H35m&>=PidJ>!kr2#2Ah*mR0zH3_OJ5w3~H!)k0YF6&BDybLy*|2CmOnc zf(rsnhD&6AOjk3l*e(KL;$E4zV#Kk9Uf}bxv)OW{M1qrJsCj!u(a8yW5zj|*knox% z#L^52HKmRsrZjRY2?uVqzoY)>{NeKD44J5~8N9AOOlP?X1$O@CyVYHzeGuwWs1xH` znEp89EOR$aW0w;STAK;O#&xcX2c=Gej4JeKIZ(00)yjpUtV5+{O%{k zRoYi87)LyspY0Ydnfa>X55ox~P17ZeD~YP&-`-1KfziNRzP+qG0+F%KTjp53&q^h% zPQREO<~q7Cv&@7B?Z+dFG&V|W3WZy;+JZ2HewJGGGpHI?`smznlqVC%rLw~rUUCe_Aa(cTf&fObp)NZLgvhni7w&3GHDJoNa z>PY-{W$DHnP+zt9^5lEPzX>vY96D}JW=a9B~bgVf6sAw#z4hq z7#8RK{dEnU5h7b0zm}@*@y4m(%NVFgHV)xfWzuEy71+u(t_)07$Cez;=_rc zNS-Y4s4k#0NlDa`YFj10O7BIcZm-bV6$pip>#ZoJs}TM|L{6Il zD*pj=45cSJ$4K?DOp7@*hzcoy3Hf5VU^7;#UxuY64NMza`y zC+dn`%aO+;EFfJ9n=!BvxjBd10yT!oj5%hZ!>p=_TJI0dz3OpbDJ{8$f#RM~7OCUYUj&$thu&%ahnSVquJ%I0e?%XKQ=V zc*RkqJa_9&J7l0Z`FVh&zpqAs5Y&F|l%!XgG`NC%i!6k~@}qJ*%cinx>@>E@Hy{{{QB|=6FL{K%~<}P zFTy*@lPUQ{qyVZU?M=qikXw50r+Y{~wl+F;8OL&+?sCn;LIdyKZ}SOG|HxD%mzk3C zC@54&+UV_!#H@I#MLoH!?sA_H>#JKPB=^LjW*AA`h${V|-E0|M@Ajpz&K4hHbf`!IAE<=P3T<~5-mt!TVRfKbA3S_d%bAujhSAuB05}=BoC7*`lCMNzvO12)M zgLgafo;&xMwWEe`4T-4JaeukuBM$&TNptgW|EOZ$Fol(&uK>36+2ZAUz-aVRV(1UG zS#8O!YmC)C%X;l^%er$$4T1tzoMk<%S?%JCWZw<_(r*}t_h%bLiNBWCOxC(}Rv>XN zZFwZCwO{t^^kC++dQ+#oka`1%lE_rwqx1f7T((U8?~wZ(2;j?7k{0l+e=J^>RN#>C z_yvcAOo+g&M{->MdY-91hxwdfm_*bjIn&HoGv_W4plitvL1-?h&U(>;Uo?r#QlpH> z>V>pt<*tbrX${AvO>$TaQKyhCVshqPATF~{)70<7!{``($Im9N%eW(>d^Ul70wXWA ziruA5LOW@{`t#w}4VFE0Qy-PMgbk@JEbvklOky4ASiE_}pBv9=a*Q>5fW0`qc%U$5>)08Gvjp~Ncs--HL z%+fzXJH?H>M{>~y)ayiJfFRD7Og}a1;{tK(eI1n&IR$|S7h9dyQdXfy3uNAyCJvK8 zD5M&3wX@O3bAVFFwDr0(+sr{AjgOw+KC;bfCHNO>9i-Azt0Y(h`Z``6w;XZ*tL|Ld z`r-L7e=VxiZfbnGC=}XUm@f%*K1=21ZgMMQ`XaLXBj@PtK}7vT@JDQCC(yvEB+bYY z#M`I{u<-P?*G1kjZMn(I!OCY&TT)!hVr6SlPDB$n#r`FDUQ@Al5q*#FCb?go!KYqT zj>3Z3fb4$$HJJ4mW_=WKR&=PZEGhpj!7`Sd7Lpjd@6|d@L*sl%oaw z#zS*8uDwUblVl#kJZhz9!TkXsA2o+>hoeSG`A#XXgioUWKR-W`In2F?CSLL@_C_-j z2YJ6mo(atSs-{H96K-$*$X_2N=0I6hRva9M;F>zRyGWwU(Q?4RG2Vx!DTi>TFl>6v z=XsvERe~JYIDq-|-vvE!vr|Xgc^JPaKj!lJCwyW@@$r7qDkjfa4ZYX9l^>r>+J0Jc z20~?*DCSGj9&2ajtIpfkisDnE6}>EGxj@_-O61sVr2?tUS>pXOwf2Tcs)B{bb z#W!!Zj3?W6cC21ss@g)mM*l|mv2rEj)_P~DxeF+N@1!YtBBXF$QP_kv%&MYX(v|C7 zGO%&OSiehfU?I_3B3}UI`URmr2cDry?yl(gTJ?Qg%76$-i~MG?a*aT}l=Ry`1HtkI z=51WtRmo{J#yaS@zj(3(vFtvs(XlZJ&%hZIvMc}1fz>gpi5Sv-Tur@*A~-j<)@1i4 zd8y~)e<~Y$YP_3(u0?jS0&m0X)wCkQYk1jhQ~h=^MQ=}!14&M1b9U&IP@52^r?L9G_2dWffHf$E~%nLIB^3CouDdI zm$Z_L*+rvo@Q~{$f=1sm>Btx8)I=HM0E+C}zVxARD}`eR1k)Yy9*8aoDBqnm=iM-9 ze+D`0Dq5trM?K%y!lqEJ`d0uFa@`QiVWhY{;6l`SAB6GfMLC)QmLzq3rC=Tbgg!g& z?B=QbDKz=w^l79{;|7ix;0q{ej36W?odFQ4IlX`IArF-qJRFsE2r9!IhEYc_8GL}` zn(zdndt(w#sB#FDuW#|;BP`Z!eGjuS-@F{$gX*yKzqY0!XC+z*oAx5|yZ&lA#p()n zEJ|68P@|GN%mCN+{4*3nAm+UFKIm$!)F^6hM<9e$ zTI;7+43Uin2<39;2+6KIO%JAetSmtR+F~)7fU8D-1z->YM!J$(zfDS_?gRhvErpL> z3&k@NaYLUy>zUb(tq9r6y867!(_jn09ydu1f152(uKPt0jYkST&=ZFA;jsgVj%Jk&|FPem-%sdmW za%OZ>=18H|L?pjgIr^NO=7i+$RsC*&V zR%x9Tiz84;^SW%0o)180L9Po>9&-S^U7DeZcm;m}>al7VHCE?CRQ=FQaVWBg~X*)}RVw(wt=#a)L*!W1$A%Oq=BmzfR_VyCS4ypi(GC>e3s&`2+ z>yWn)G*6%V5S)%$WY5`a=@FQ>K62Z#X1yl+Gy ziDvtJPaMU%5OexOZxs#<$Eg=4KzkIpH zFYqg#S5wtiW|eimFLfWv4=F+4u^-s38Y^VN#EJHJWznaT8>xKg()0wA`XN^$AAJ~; z%A1pmspb1Eg0_qZilisA=$o!GN|}AXy@o+!_v4F^GCH$&rWv$}t3pXq@oZTeoeTB9 z=-Pi&eU6bXx?_u}mykra&t=F(sLt<%Gk^^_=*vn+wqcaGBV_kzRw)znK~Wc`L%reL zHTE`WN%-&ps?}Is^#(Bk#S~A<_g-l7&t4+Xk;O|%W!rM#JzK5xI`9V6{Fv>IK;!Nw zN}ug0>T|GI}G*1p|1 z=OA~^y=m=2-zj3FZRLHOI1@;sf3k|I{T_qFcz_p~LyN14h zBwQ4P<#Oj4+)6Z4Aw$Q{%**Dd1pR6+-^)0MSuUkrhs>lfctQ+HeM-6>8s+c7W1;TU zsx^}FmdsvVR#;~+Wz#KX;Bzb`!~cTGUM-wx7^t_aquQs{@T@&kIQxO)TYp{S{obnk z&?P8lRMCo0N5`*;KI2%+@ZyT^xtalyQ7p1l2t?jX>j7cvNjask+buS%jZd28s^HJg zYnbNd-$pZLzq;fJvv_Zo?%b3TQReIw&8EAEl9IDThZ_gGpOVMi`X9(AG~P0h)!Te$ z*`3tRu|7aLK{)Vr;B@e0I=;HnnpT*y74VZmXZzi%^uTZNOw89liH+LIrtp!TbBwA# z=-H~3Z<>kx9>WGC_O!fmkq#p~+gGe*R*Qm8s=KrT*t!Rs7g1ibsFaWN#Az2e+&Ip` zuX{9N7SZ9)J^rP(NwH#0KhN4|8UAnI7yfES zTXz*-V9UTZ8dEA|$A(};tuBs7LOL7LesYO2s%DRDi`6}NdpKM`RNR>Tpj@>`h*?V& z4C*s=`gtRYSkywhBhUcT5R?8vRWOlrES8Iq$c$ror`5r$-=W?@TUX?_a2wwilyLVQ zoFz;cr>9?WioIm090X{inVi8RmuXw}m=*;?5;sCA&*kIUE-KIej~;U$O5ye{63ApF zGL61acs(Z&ZlUlg@2He9?+^~*8SUWJmg0XQM~oVg3O&I?%s$AW%f#Bk7%L-u45+@( zPR?7g%1GOEN7_5=l^W$6qotk94s*Hnlg=~&?1tCI>JmdkuDzcEx? z;ns4rL|^|aUu(iy)AAWM9h||p2S~$9Rt!^Jhx~g`U2eO7R@-jk;#oxmjeA?8|`(|4UH^$b0M1eqMdoZY;7raid6N{IM}*qgB^ZWURpp~hb*>%Xo_ zV@`FU`NX@)fi^JedQJ z0Z<|k-pKU49o=aJqVfX8F?`iXPSHJ%R||GFF*WvY;9hmQ>sb8L64&L*9;!+SD-+C0 zRm=LzUR!C09Z~hj+8SXP4TXPKso}2XpUnGQ_!J?N3%^cv<9Cs4p`!TFif*I#06j5 zRFDVCDWo#U>w$-ogf>~y7p_<}M5O2mMhBX!5CSMA5syPFPtP6!dPo``Idva^_=9fq zzB_rQ6jmx9%{YXKnIa`^mpA+PE&TufQK7P6R{cF4?@J?5aH;h6_xB=7D)~SHQ`V!K z?-sYm;jZv^z=CnlW~(z#HQUqYK`VDlZMX!66m}u(*;(Z#SqxfqN-1i~F0Lj^BD0kW zj9V03+#X68B4EaYBps|4`g0`t(kEA7>VKsK0F3QvuT!*2j4uVH?_3)|E^B+&xmL;A z^Sn8`K0)yWKPH{v4J+i$%9i~Cu346Zuy6OPymRxq&dsJvKG1%vWgm%mAunZ@6ynP| z5WGZ+!m1*o16dOiG@PL9GU_%T?DapxP_#XhqRdjExRVta&3@FcvzK_$Z?&KXAh@X6 z6XjP93L1kFZWVN)l9C@U-RiA_($Nd|`G2CZNz2kRJ`S&n>JkfGsyCTI@047 zt02EKL*3_tQ3PY1vA!l=mtt)%6GFHawZ&+#th?|ePU_nF{Olr|4Pax@4vRNv-XH$} zWaX9U%u;JMZpRA*#8MC-KilQ@F{HO8KV%g7RKOSm z7MwIlV!Yza$Y0}~*sWYK5YDyB{hJFWfK?3iW~ohX3)Z_L@%%n}&M7cc(k_t<$?FS9 zS!)N~hRXlivrnpAJUjqUc*0k9y?cGAw(pnh9T}QRDSOT8PnXuoMLyMU;MDElUf9q*ZCTpT zauT*X;f1tuQ@3zg!-!SWp4-1xxY?xWkScV*(-$Fy{7oMo9roR2(*1NF(^D>uqz7?^ zBzhWT%U;sBH5Zvdp&hN8=NP}s_G0XOLA{|q=g~6P$X8A2aVr)seIZ&+9u!L0RyaTRp?1!S%DrKB=rs#0|G!MtE)_1#_? z85G_<^>u#<=l-3(N9r6ab+t?`F1&cCwS5?g^?-h`uOd44tv@u<1@#TE7<(s>^sBASE79v zDeDf1%aq7g+Q`)WDWd3bqrbu)EsNue&&7!@$TW}eh)7gsuP~$<9QvhWbqQ5n?eL;K zYRbl45tj|y-Y$M+a#FW>dt}n4>U%GezOfnh?5rD}^+N8qjl{A1M)^MauC1zQ?YA^0 zy4og2X`{#pSEu{izMb9fyEpKU+0um^g{^At3;(p;I`z!IPdR-)n+xmaB^QdvZ&WTU|r&PZ4n~ay)4KRI?{O6C`RvIs>mQVO&>Z;`4>0ZoY zCZAX?`~U05EAZn}b{EH=cl?+1XL!03yt&{0_2d655AA6DX}HdS!g}9=eyuU^e%aM8 zJvm&>w_ECF@3b6$?ej~TF5TYq&Y^}sb!;7{9{LdL<%LQg>JtC4Y}D1VG}-g6N`IoK zd)jJJg*eKnhx)Y?#=qIPef^7CM!hJue2(-#f4r2U5h`PNgk#DI8Ss@c-+_ax~fhKJxgo{&$t9cC_7pH^$US`R~T~@5cDQ eV3!#Wm;;tBmOrnYo5h6xST40OySl_N^nU;(ZqPgc literal 0 HcmV?d00001 diff --git a/format/Layout.md b/format/Layout.md index 815c47f2c934b..5eaefeebf210a 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -78,7 +78,14 @@ Base requirements ## Byte Order ([Endianness][3]) -The Arrow format is little endian. +The Arrow format is little endian by default. +The Schema metadata has an endianness field indicating endianness of RecordBatches. +Typically this is the endianness of the system where the RecordBatch was generated. +The main use case is exchanging RecordBatches between systems with the same Endianness. +At first we will return an error when trying to read a Schema with an endianness +that does not match the underlying system. The reference implementation is focused on +Little Endian and provides tests for it. Eventually we may provide automatic conversion +via byte swapping. ## Alignment and Padding diff --git a/format/Message.fbs b/format/Message.fbs index 6a351b9dbf0a6..3f688c156e3ea 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -87,10 +87,21 @@ table Field { children: [Field]; } +/// ---------------------------------------------------------------------- +/// Endianness of the platform that produces the RecordBatch + +enum Endianness:int { Little, Big } + /// ---------------------------------------------------------------------- /// A Schema describes the columns in a row batch table Schema { + + /// endianness of the buffer + /// it is Little Endian by default + /// if endianness doesn't match the underlying system then the vectors need to be converted + endianness: Endianness=Little; + fields: [Field]; } From 268e108c2d9101eccf2624fccf1fddf6f7f97b8b Mon Sep 17 00:00:00 2001 From: Jihoon Son Date: Mon, 15 Aug 2016 22:08:56 -0700 Subject: [PATCH 0111/1644] ARROW-251: Expose APIs for getting code and message of the status Author: Jihoon Son Closes #114 from jihoonson/ARROW-251 and squashes the following commits: d1186bf [Jihoon Son] Fix compilation failure 4275c70 [Jihoon Son] Add tests for status 1162084 [Jihoon Son] Merge branch 'master' of https://github.com/apache/arrow into ARROW-251 a76b888 [Jihoon Son] Make code() public and add message() --- cpp/src/arrow/util/CMakeLists.txt | 1 + cpp/src/arrow/util/status-test.cc | 38 +++++++++++++++++++++++++++++++ cpp/src/arrow/util/status.h | 16 +++++++++---- 3 files changed, 51 insertions(+), 4 deletions(-) create mode 100644 cpp/src/arrow/util/status-test.cc diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 4e941fb5f5cec..13c0d7514feef 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -70,3 +70,4 @@ endif() ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(memory-pool-test) +ADD_ARROW_TEST(status-test) \ No newline at end of file diff --git a/cpp/src/arrow/util/status-test.cc b/cpp/src/arrow/util/status-test.cc new file mode 100644 index 0000000000000..45e0ff361ac22 --- /dev/null +++ b/cpp/src/arrow/util/status-test.cc @@ -0,0 +1,38 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/util/status.h" +#include "arrow/test-util.h" + +namespace arrow { + +TEST(StatusTest, TestCodeAndMessage) { + Status ok = Status::OK(); + ASSERT_EQ(StatusCode::OK, ok.code()); + Status file_error = Status::IOError("file error"); + ASSERT_EQ(StatusCode::IOError, file_error.code()); + ASSERT_EQ("file error", file_error.message()); +} + +TEST(StatusTest, TestToString) { + Status file_error = Status::IOError("file error"); + ASSERT_EQ("IOError: file error", file_error.ToString()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index 6ba2035bcd3a4..d5585313c728b 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -138,6 +138,18 @@ class ARROW_EXPORT Status { // Get the POSIX code associated with this Status, or -1 if there is none. int16_t posix_code() const; + StatusCode code() const { + return ((state_ == NULL) ? StatusCode::OK : static_cast(state_[4])); + } + + std::string message() const { + uint32_t length; + memcpy(&length, state_, sizeof(length)); + std::string msg; + msg.append((state_ + 7), length); + return msg; + } + private: // OK status has a NULL state_. Otherwise, state_ is a new[] array // of the following form: @@ -147,10 +159,6 @@ class ARROW_EXPORT Status { // state_[7..] == message const char* state_; - StatusCode code() const { - return ((state_ == NULL) ? StatusCode::OK : static_cast(state_[4])); - } - Status(StatusCode code, const std::string& msg, int16_t posix_code); static const char* CopyState(const char* s); }; From 246a126b23dc20bca7b665ec76d75ca4a68cd1f1 Mon Sep 17 00:00:00 2001 From: Micah Kornfield Date: Mon, 15 Aug 2016 23:04:46 -0700 Subject: [PATCH 0112/1644] ARROW-107: [C++] Implement IPC for structs Some other changes (I tried to isolate each in there own commit): 1. Changed NumericTypes to be its own tempated type instead of separate macros (this made debugging easier) 2. Fix an existing unit test for IPC that row counts inconsistent with row batch size. 3. Some minor make-format changes. Author: Micah Kornfield Closes #117 from emkornfield/emk_struct_ipc and squashes the following commits: 777e338 [Micah Kornfield] fix formatting 9008046 [Micah Kornfield] use TypeClass::c_type e46b0d8 [Micah Kornfield] add skip for memory pool test fc63bff [Micah Kornfield] make lint and make format 9aa972b [Micah Kornfield] change macro to templates instead (makes debugging easier) 3e01e7f [Micah Kornfield] Implement struct round-trip. Fix unit test for non null to have consistent batch sizes 8eaf1e7 [Micah Kornfield] fix formatting --- cpp/src/.clang-tidy-ignore | 1 + cpp/src/arrow/ipc/adapter.cc | 24 ++++++- cpp/src/arrow/ipc/ipc-adapter-test.cc | 46 ++++++++++-- cpp/src/arrow/ipc/metadata-internal.cc | 10 +-- cpp/src/arrow/types/primitive.h | 97 +++++++++++++------------- cpp/src/arrow/util/memory-pool-test.cc | 2 +- 6 files changed, 115 insertions(+), 65 deletions(-) diff --git a/cpp/src/.clang-tidy-ignore b/cpp/src/.clang-tidy-ignore index a128c38889672..5ab4d20d61942 100644 --- a/cpp/src/.clang-tidy-ignore +++ b/cpp/src/.clang-tidy-ignore @@ -1 +1,2 @@ ipc-adapter-test.cc +memory-pool-test.cc diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 84f7830092cf4..3259980058b8e 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -34,6 +34,7 @@ #include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/types/string.h" +#include "arrow/types/struct.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" #include "arrow/util/status.h" @@ -118,8 +119,11 @@ Status VisitArray(const Array* arr, std::vector* field_nodes RETURN_NOT_OK(VisitArray( list_arr->values().get(), field_nodes, buffers, max_recursion_depth - 1)); } else if (arr->type_enum() == Type::STRUCT) { - // TODO(wesm) - return Status::NotImplemented("Struct type"); + const auto struct_arr = static_cast(arr); + for (auto& field : struct_arr->fields()) { + RETURN_NOT_OK( + VisitArray(field.get(), field_nodes, buffers, max_recursion_depth - 1)); + } } else { return Status::NotImplemented("Unrecognized type"); } @@ -313,6 +317,22 @@ class RowBatchReader::Impl { return MakeListArray(type, field_meta.length, offsets, values_array, field_meta.null_count, null_bitmap, out); } + + if (type->type == Type::STRUCT) { + const int num_children = type->num_children(); + std::vector fields; + fields.reserve(num_children); + for (int child_idx = 0; child_idx < num_children; ++child_idx) { + std::shared_ptr field_array; + RETURN_NOT_OK(NextArray( + type->child(child_idx).get(), max_recursion_depth - 1, &field_array)); + fields.push_back(field_array); + } + out->reset(new StructArray( + type, field_meta.length, fields, field_meta.null_count, null_bitmap)); + return Status::OK(); + } + return Status::NotImplemented("Non-primitive types not complete yet"); } diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 2bfb459d6e06a..6740e0fc5acc2 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -32,6 +32,7 @@ #include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/types/string.h" +#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" @@ -205,15 +206,16 @@ Status MakeNonNullRowBatch(std::shared_ptr* out) { // Example data MemoryPool* pool = default_memory_pool(); - const int length = 200; + const int length = 50; std::shared_ptr leaf_values, list_array, list_list_array, flat_array; RETURN_NOT_OK(MakeRandomInt32Array(1000, true, pool, &leaf_values)); bool include_nulls = false; - RETURN_NOT_OK(MakeRandomListArray(leaf_values, 50, include_nulls, pool, &list_array)); RETURN_NOT_OK( - MakeRandomListArray(list_array, 50, include_nulls, pool, &list_list_array)); - RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); + MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); return Status::OK(); } @@ -238,10 +240,40 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { return Status::OK(); } -INSTANTIATE_TEST_CASE_P( - RoundTripTests, TestWriteRowBatch, +Status MakeStruct(std::shared_ptr* out) { + // reuse constructed list columns + std::shared_ptr list_batch; + RETURN_NOT_OK(MakeListRowBatch(&list_batch)); + std::vector columns = { + list_batch->column(0), list_batch->column(1), list_batch->column(2)}; + auto list_schema = list_batch->schema(); + + // Define schema + std::shared_ptr type(new StructType( + {list_schema->field(0), list_schema->field(1), list_schema->field(2)})); + auto f0 = std::make_shared("non_null_struct", type); + auto f1 = std::make_shared("null_struct", type); + std::shared_ptr schema(new Schema({f0, f1})); + + // construct individual nullable/non-nullable struct arrays + ArrayPtr no_nulls(new StructArray(type, list_batch->num_rows(), columns)); + std::vector null_bytes(list_batch->num_rows(), 1); + null_bytes[0] = 0; + std::shared_ptr null_bitmask; + RETURN_NOT_OK(util::bytes_to_bits(null_bytes, &null_bitmask)); + ArrayPtr with_nulls( + new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); + + // construct batch + std::vector arrays = {no_nulls, with_nulls}; + out->reset(new RowBatch(schema, list_batch->num_rows(), arrays)); + return Status::OK(); +} + +INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, ::testing::Values(&MakeIntRowBatch, &MakeListRowBatch, &MakeNonNullRowBatch, - &MakeZeroLengthRowBatch, &MakeDeeplyNestedList, &MakeStringTypesRowBatch)); + &MakeZeroLengthRowBatch, &MakeDeeplyNestedList, + &MakeStringTypesRowBatch, &MakeStruct)); void TestGetRowBatchSize(std::shared_ptr batch) { MockMemorySource mock_source(1 << 16); diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 1d3edf0117f91..8cd416ff5853f 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -265,11 +265,8 @@ Status MessageBuilder::SetSchema(const Schema* schema) { field_offsets.push_back(offset); } - header_ = flatbuf::CreateSchema( - fbb_, - endianness(), - fbb_.CreateVector(field_offsets)) - .Union(); + header_ = + flatbuf::CreateSchema(fbb_, endianness(), fbb_.CreateVector(field_offsets)).Union(); body_length_ = 0; return Status::OK(); } @@ -278,8 +275,7 @@ Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers) { header_type_ = flatbuf::MessageHeader_RecordBatch; - header_ = flatbuf::CreateRecordBatch(fbb_, length, - fbb_.CreateVectorOfStructs(nodes), + header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), fbb_.CreateVectorOfStructs(buffers)) .Union(); body_length_ = body_length; diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 18f954adc0894..770de765f1fcc 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -53,54 +53,55 @@ class ARROW_EXPORT PrimitiveArray : public Array { const uint8_t* raw_data_; }; -#define NUMERIC_ARRAY_DECL(NAME, TypeClass, T) \ - class ARROW_EXPORT NAME : public PrimitiveArray { \ - public: \ - using value_type = T; \ - \ - NAME(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, \ - const std::shared_ptr& null_bitmap = nullptr) \ - : PrimitiveArray( \ - std::make_shared(), length, data, null_count, null_bitmap) {} \ - NAME(const TypePtr& type, int32_t length, const std::shared_ptr& data, \ - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) \ - : PrimitiveArray(type, length, data, null_count, null_bitmap) {} \ - \ - bool EqualsExact(const NAME& other) const { \ - return PrimitiveArray::EqualsExact(*static_cast(&other)); \ - } \ - \ - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, \ - const ArrayPtr& arr) const override { \ - if (this == arr.get()) { return true; } \ - if (!arr) { return false; } \ - if (this->type_enum() != arr->type_enum()) { return false; } \ - const auto other = static_cast(arr.get()); \ - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { \ - const bool is_null = IsNull(i); \ - if (is_null != arr->IsNull(o_i) || \ - (!is_null && Value(i) != other->Value(o_i))) { \ - return false; \ - } \ - } \ - return true; \ - } \ - \ - const T* raw_data() const { return reinterpret_cast(raw_data_); } \ - \ - T Value(int i) const { return raw_data()[i]; } \ - }; - -NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type, uint8_t); -NUMERIC_ARRAY_DECL(Int8Array, Int8Type, int8_t); -NUMERIC_ARRAY_DECL(UInt16Array, UInt16Type, uint16_t); -NUMERIC_ARRAY_DECL(Int16Array, Int16Type, int16_t); -NUMERIC_ARRAY_DECL(UInt32Array, UInt32Type, uint32_t); -NUMERIC_ARRAY_DECL(Int32Array, Int32Type, int32_t); -NUMERIC_ARRAY_DECL(UInt64Array, UInt64Type, uint64_t); -NUMERIC_ARRAY_DECL(Int64Array, Int64Type, int64_t); -NUMERIC_ARRAY_DECL(FloatArray, FloatType, float); -NUMERIC_ARRAY_DECL(DoubleArray, DoubleType, double); +template +class ARROW_EXPORT NumericArray : public PrimitiveArray { + public: + using value_type = typename TypeClass::c_type; + NumericArray(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) + : PrimitiveArray( + std::make_shared(), length, data, null_count, null_bitmap) {} + NumericArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) + : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + + bool EqualsExact(const NumericArray& other) const { + return PrimitiveArray::EqualsExact(*static_cast(&other)); + } + + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast*>(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { + return false; + } + } + return true; + } + const value_type* raw_data() const { + return reinterpret_cast(raw_data_); + } + + value_type Value(int i) const { return raw_data()[i]; } +}; + +#define NUMERIC_ARRAY_DECL(NAME, TypeClass) using NAME = NumericArray; + +NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type); +NUMERIC_ARRAY_DECL(Int8Array, Int8Type); +NUMERIC_ARRAY_DECL(UInt16Array, UInt16Type); +NUMERIC_ARRAY_DECL(Int16Array, Int16Type); +NUMERIC_ARRAY_DECL(UInt32Array, UInt32Type); +NUMERIC_ARRAY_DECL(Int32Array, Int32Type); +NUMERIC_ARRAY_DECL(UInt64Array, UInt64Type); +NUMERIC_ARRAY_DECL(Int64Array, Int64Type); +NUMERIC_ARRAY_DECL(FloatArray, FloatType); +NUMERIC_ARRAY_DECL(DoubleArray, DoubleType); template class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index 919f3740982cf..deb7ffd03ba75 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -54,7 +54,7 @@ TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { #ifndef NDEBUG EXPECT_EXIT(pool->Free(data, 120), ::testing::ExitedWithCode(1), - ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); + ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); #endif pool->Free(data, 100); From e7e399db5fc6913e67426514279f81766a0778d2 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Tue, 24 May 2016 13:38:09 -0700 Subject: [PATCH 0113/1644] ARROW-259: Use Flatbuffer Field type instead of MaterializedField Remove MaterializedField, MajorType, RepeatedTypes Add code to convert from FlatBuf representation to Pojo also adds tests to test the conversion --- format/Message.fbs | 22 +- header | 16 + java/format/pom.xml | 163 ++++ java/memory/pom.xml | 2 +- java/pom.xml | 3 +- java/vector/pom.xml | 7 +- java/vector/src/main/codegen/config.fmpp | 1 + .../src/main/codegen/data/ArrowTypes.tdd | 80 ++ .../main/codegen/data/ValueVectorTypes.tdd | 59 +- .../src/main/codegen/includes/vv_imports.ftl | 4 + .../templates/AbstractFieldReader.java | 8 +- .../templates/AbstractFieldWriter.java | 10 +- .../AbstractPromotableFieldWriter.java | 4 - .../src/main/codegen/templates/ArrowType.java | 129 +++ .../main/codegen/templates/BaseReader.java | 5 +- .../main/codegen/templates/BaseWriter.java | 3 +- .../codegen/templates/BasicTypeHelper.java | 539 ------------ .../main/codegen/templates/ComplexCopier.java | 18 +- .../codegen/templates/ComplexReaders.java | 72 +- .../codegen/templates/ComplexWriters.java | 30 +- .../codegen/templates/FixedValueVectors.java | 94 +- .../codegen/templates/HolderReaderImpl.java | 98 +-- .../main/codegen/templates/ListWriters.java | 234 ----- .../main/codegen/templates/MapWriters.java | 42 +- .../main/codegen/templates/NullReader.java | 23 +- .../templates/NullableValueVectors.java | 104 ++- .../templates/RepeatedValueVectors.java | 421 --------- .../codegen/templates/UnionListWriter.java | 23 +- .../main/codegen/templates/UnionReader.java | 28 +- .../main/codegen/templates/UnionVector.java | 105 ++- .../main/codegen/templates/UnionWriter.java | 16 +- .../main/codegen/templates/ValueHolders.java | 43 +- .../templates/VariableLengthVectors.java | 73 +- .../arrow/vector/BaseDataValueVector.java | 5 +- .../apache/arrow/vector/BaseValueVector.java | 31 +- .../org/apache/arrow/vector/BitVector.java | 43 +- .../org/apache/arrow/vector/ObjectVector.java | 220 ----- .../arrow/vector/ValueHolderHelper.java | 203 ----- .../org/apache/arrow/vector/ValueVector.java | 10 +- .../apache/arrow/vector/VectorDescriptor.java | 83 -- .../org/apache/arrow/vector/ZeroVector.java | 30 +- .../complex/AbstractContainerVector.java | 49 +- .../vector/complex/AbstractMapVector.java | 47 +- .../complex/BaseRepeatedValueVector.java | 63 +- .../vector/complex/ContainerVectorLike.java | 43 - .../arrow/vector/complex/ListVector.java | 89 +- .../arrow/vector/complex/MapVector.java | 97 +-- .../vector/complex/RepeatedListVector.java | 427 ---------- .../vector/complex/RepeatedMapVector.java | 584 ------------- .../vector/complex/RepeatedValueVector.java | 2 +- .../complex/impl/AbstractBaseReader.java | 19 +- .../complex/impl/AbstractBaseWriter.java | 16 +- .../complex/impl/ComplexWriterImpl.java | 22 +- .../vector/complex/impl/PromotableWriter.java | 48 +- .../complex/impl/RepeatedListReaderImpl.java | 145 ---- .../complex/impl/RepeatedMapReaderImpl.java | 192 ----- .../impl/SingleLikeRepeatedMapReaderImpl.java | 89 -- .../complex/impl/SingleListReaderImpl.java | 14 +- .../complex/impl/SingleMapReaderImpl.java | 10 +- .../vector/complex/impl/UnionListReader.java | 19 +- .../arrow/vector/holders/ObjectHolder.java | 38 - .../arrow/vector/holders/UnionHolder.java | 7 +- .../arrow/vector/types/MaterializedField.java | 217 ----- .../org/apache/arrow/vector/types/Types.java | 596 ++++++++++--- .../apache/arrow/vector/types/pojo/Field.java | 105 +++ .../arrow/vector/types/pojo/Schema.java | 74 ++ .../vector/util/ByteFunctionHelpers.java | 50 -- .../arrow/vector/util/CoreDecimalUtility.java | 91 -- .../arrow/vector/util/DecimalUtility.java | 802 +++++++++--------- .../arrow/vector/util/MapWithOrdinal.java | 12 + .../arrow/vector/TestDecimalVector.java | 63 ++ ...TestOversizedAllocationForValueVector.java | 11 +- .../apache/arrow/vector/TestUnionVector.java | 5 +- .../apache/arrow/vector/TestValueVector.java | 137 +-- .../complex/impl/TestPromotableWriter.java | 7 +- .../complex/writer/TestComplexWriter.java | 270 ++++++ .../apache/arrow/vector/pojo/TestConvert.java | 80 ++ 77 files changed, 2464 insertions(+), 5180 deletions(-) create mode 100644 header create mode 100644 java/format/pom.xml create mode 100644 java/vector/src/main/codegen/data/ArrowTypes.tdd create mode 100644 java/vector/src/main/codegen/templates/ArrowType.java delete mode 100644 java/vector/src/main/codegen/templates/BasicTypeHelper.java delete mode 100644 java/vector/src/main/codegen/templates/ListWriters.java delete mode 100644 java/vector/src/main/codegen/templates/RepeatedValueVectors.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java diff --git a/format/Message.fbs b/format/Message.fbs index 3f688c156e3ea..2928207db8cc0 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -1,10 +1,13 @@ -namespace apache.arrow.flatbuf; +namespace org.apache.arrow.flatbuf; /// ---------------------------------------------------------------------- /// Logical types and their metadata (if any) /// /// These are stored in the flatbuffer in the Type union below +table Null { +} + /// A Tuple in the flatbuffer metadata is the same as an Arrow Struct /// (according to the physical memory layout). We used Tuple here as Struct is /// a reserved word in Flatbuffers @@ -45,10 +48,22 @@ table Decimal { scale: int; } +table Date { +} + +table Time { +} + table Timestamp { timezone: string; } +table IntervalDay { +} + +table IntervalYear { +} + table JSONScalar { dense:bool=true; } @@ -58,13 +73,18 @@ table JSONScalar { /// add new logical types to Type without breaking backwards compatibility union Type { + Null, Int, FloatingPoint, Binary, Utf8, Bool, Decimal, + Date, + Time, Timestamp, + IntervalDay, + IntervalYear, List, Tuple, Union, diff --git a/header b/header new file mode 100644 index 0000000000000..70665d1a26295 --- /dev/null +++ b/header @@ -0,0 +1,16 @@ +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. + diff --git a/java/format/pom.xml b/java/format/pom.xml new file mode 100644 index 0000000000000..ea27a3072bc9e --- /dev/null +++ b/java/format/pom.xml @@ -0,0 +1,163 @@ + + + +4.0.0 + + + arrow-java-root + org.apache.arrow + 0.1-decimal + + +arrow-format +jar +Arrow Format + + + 1.2.0-3f79e055 + 3.3 + 2.10 + 1.5.0.Final + + + + + com.vlkan + flatbuffers + ${fbs.version} + + + + + + + + kr.motd.maven + os-maven-plugin + ${os-maven-plugin.version} + + + + + + org.apache.maven.plugins + maven-dependency-plugin + ${maven-dependency-plugin.version} + + + copy-flatc + generate-sources + + copy + + + + + com.vlkan + flatc-${os.detected.classifier} + ${fbs.version} + exe + true + ${project.build.directory} + + + + + + + + org.codehaus.mojo + exec-maven-plugin + 1.4.0 + + + script-chmod + + exec + + generate-sources + + chmod + + +x + ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe + + + + + + exec + + generate-sources + + ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe + + -j + -o + target/generated-sources/ + ../../format/Message.fbs + + + + + + + com.mycila + license-maven-plugin + 2.3 + +
${basedir}/../../header
+ + **/*.java + +
+ + + process-sources + + format + + + +
+ + org.codehaus.mojo + build-helper-maven-plugin + 1.9.1 + + + add-sources-as-resources + generate-sources + + add-source + + + + ${project.build.directory}/generated-sources + + + + + + + org.apache.maven.plugins + maven-checkstyle-plugin + + true + + +
+ +
+
+ diff --git a/java/memory/pom.xml b/java/memory/pom.xml index 44332f5ed14a8..12ff4c81d86c0 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -15,7 +15,7 @@ org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1-decimal arrow-memory arrow-memory diff --git a/java/pom.xml b/java/pom.xml index 71f59caf2798e..92ab109f939e1 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -21,7 +21,7 @@ org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1-decimal pom Apache Arrow Java Root POM @@ -465,6 +465,7 @@ + format memory vector diff --git a/java/vector/pom.xml b/java/vector/pom.xml index df5389261ba57..fac788cef14d9 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -15,13 +15,18 @@ org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1-decimal vector vectors + + org.apache.arrow + arrow-format + 0.1-decimal + org.apache.arrow arrow-memory diff --git a/java/vector/src/main/codegen/config.fmpp b/java/vector/src/main/codegen/config.fmpp index 663677cbb5a76..6d92ba830ee2c 100644 --- a/java/vector/src/main/codegen/config.fmpp +++ b/java/vector/src/main/codegen/config.fmpp @@ -17,6 +17,7 @@ data: { # TODO: Rename to ~valueVectorModesAndTypes for clarity. vv: tdd(../data/ValueVectorTypes.tdd), + arrowTypes: tdd(../data/ArrowTypes.tdd) } freemarkerLinks: { diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd new file mode 100644 index 0000000000000..4ab7f8562f907 --- /dev/null +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http:# www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +{ + types: [ + { + name: "Null", + fields: [] + }, + { + name: "Tuple", + fields: [] + }, + { + name: "List", + fields: [] + }, + { + name: "Union", + fields: [] + }, + { + name: "Int", + fields: [{name: "bitWidth", type: int}, {name: "isSigned", type: boolean}] + }, + { + name: "FloatingPoint", + fields: [{name: precision, type: int}] + }, + { + name: "Utf8", + fields: [] + }, + { + name: "Binary", + fields: [] + }, + { + name: "Bool", + fields: [] + }, + { + name: "Decimal", + fields: [{name: "precision", type: int}, {name: "scale", type: int}] + }, + { + name: "Date", + fields: [] + }, + { + name: "Time", + fields: [] + }, + { + name: "Timestamp", + fields: [{name: "timezone", type: "String"}] + }, + { + name: "IntervalDay", + fields: [] + }, + { + name: "IntervalYear", + fields: [] + } + ] +} diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index e747c30c5d1cb..421dd7ef92e63 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -17,8 +17,7 @@ { modes: [ {name: "Optional", prefix: "Nullable"}, - {name: "Required", prefix: ""}, - {name: "Repeated", prefix: "Repeated"} + {name: "Required", prefix: ""} ], types: [ { @@ -61,9 +60,8 @@ { class: "Int", valueHolder: "IntHolder"}, { class: "UInt4", valueHolder: "UInt4Holder" }, { class: "Float4", javaType: "float" , boxedType: "Float", fields: [{name: "value", type: "float"}]}, - { class: "Time", javaType: "int", friendlyType: "DateTime" }, { class: "IntervalYear", javaType: "int", friendlyType: "Period" } - { class: "Decimal9", maxPrecisionDigits: 9, friendlyType: "BigDecimal", fields: [{name:"value", type:"int"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] }, + { class: "Time", javaType: "int", friendlyType: "DateTime" } ] }, { @@ -78,15 +76,11 @@ { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, { class: "Date", javaType: "long", friendlyType: "DateTime" }, { class: "TimeStamp", javaType: "long", friendlyType: "DateTime" } - { class: "Decimal18", maxPrecisionDigits: 18, friendlyType: "BigDecimal", fields: [{name:"value", type:"long"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] }, - <#-- - { class: "Money", maxPrecisionDigits: 2, scale: 1, }, - --> ] }, { major: "Fixed", - width: 12, + width: 8, javaType: "ArrowBuf", boxedType: "ArrowBuf", minor: [ @@ -96,51 +90,11 @@ { major: "Fixed", width: 16, - javaType: "ArrowBuf" - boxedType: "ArrowBuf", - minor: [ - { class: "Interval", daysOffset: 4, millisecondsOffset: 8, friendlyType: "Period", fields: [ {name: "months", type: "int"}, {name: "days", type:"int"}, {name: "milliseconds", type:"int"}] } - ] - }, - { - major: "Fixed", - width: 12, - javaType: "ArrowBuf", - boxedType: "ArrowBuf", - minor: [ - <#-- - { class: "TimeTZ" }, - { class: "Interval" } - --> - { class: "Decimal28Dense", maxPrecisionDigits: 28, nDecimalDigits: 3, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } - ] - }, - { - major: "Fixed", - width: 16, - javaType: "ArrowBuf", - boxedType: "ArrowBuf", - - minor: [ - { class: "Decimal38Dense", maxPrecisionDigits: 38, nDecimalDigits: 4, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } - ] - }, - { - major: "Fixed", - width: 24, - javaType: "ArrowBuf", - boxedType: "ArrowBuf", - minor: [ - { class: "Decimal38Sparse", maxPrecisionDigits: 38, nDecimalDigits: 6, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } - ] - }, - { - major: "Fixed", - width: 20, javaType: "ArrowBuf", boxedType: "ArrowBuf", + minor: [ - { class: "Decimal28Sparse", maxPrecisionDigits: 28, nDecimalDigits: 5, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } + { class: "Decimal", maxPrecisionDigits: 38, nDecimalDigits: 4, friendlyType: "BigDecimal", fields: [{name: "start", type: "int"}, {name: "buffer", type: "ArrowBuf"}, {name: "scale", type: "int", include: false}, {name: "precision", type: "int", include: false}] } ] }, { @@ -151,8 +105,7 @@ fields: [{name: "start", type: "int"}, {name: "end", type: "int"}, {name: "buffer", type: "ArrowBuf"}], minor: [ { class: "VarBinary" , friendlyType: "byte[]" }, - { class: "VarChar" , friendlyType: "Text" }, - { class: "Var16Char" , friendlyType: "String" } + { class: "VarChar" , friendlyType: "Text" } ] }, { diff --git a/java/vector/src/main/codegen/includes/vv_imports.ftl b/java/vector/src/main/codegen/includes/vv_imports.ftl index 2d808b1b3cb3f..9b4b79bfd7b90 100644 --- a/java/vector/src/main/codegen/includes/vv_imports.ftl +++ b/java/vector/src/main/codegen/includes/vv_imports.ftl @@ -17,6 +17,8 @@ import com.google.common.collect.ObjectArrays; import com.google.common.base.Charsets; import com.google.common.collect.ObjectArrays; +import com.google.flatbuffers.FlatBufferBuilder; + import com.google.common.base.Preconditions; import io.netty.buffer.*; @@ -25,6 +27,8 @@ import org.apache.commons.lang3.ArrayUtils; import org.apache.arrow.memory.*; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.*; +import org.apache.arrow.vector.types.pojo.*; +import org.apache.arrow.vector.types.pojo.ArrowType.*; import org.apache.arrow.vector.types.*; import org.apache.arrow.vector.*; import org.apache.arrow.vector.holders.*; diff --git a/java/vector/src/main/codegen/templates/AbstractFieldReader.java b/java/vector/src/main/codegen/templates/AbstractFieldReader.java index b83dba2879190..e0d0fc9715ba2 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldReader.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldReader.java @@ -41,7 +41,13 @@ public boolean isSet() { return true; } - <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", + @Override + public Field getField() { + fail("getField"); + return null; + } + + <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", "Character", "DateTime", "Period", "Double", "Float", "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> diff --git a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java index 6ee9dad44e929..de076fc46ffb2 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldWriter.java @@ -31,10 +31,6 @@ */ @SuppressWarnings("unused") abstract class AbstractFieldWriter extends AbstractBaseWriter implements FieldWriter { - AbstractFieldWriter(FieldWriter parent) { - super(parent); - } - @Override public void start() { throw new IllegalStateException(String.format("You tried to start when you are using a ValueWriter of type %s.", this.getClass().getSimpleName())); @@ -62,9 +58,15 @@ public void write(${name}Holder holder) { fail("${name}"); } + <#if minor.class == "Decimal"> + public void writeDecimal(int start, ArrowBuf buffer) { + fail("${name}"); + } + <#else> public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { fail("${name}"); } + diff --git a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java index 549dbf107ea67..7e60320cfb8ac 100644 --- a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java @@ -37,10 +37,6 @@ */ @SuppressWarnings("unused") abstract class AbstractPromotableFieldWriter extends AbstractFieldWriter { - AbstractPromotableFieldWriter(FieldWriter parent) { - super(parent); - } - /** * Retrieve the FieldWriter, promoting if it is not a FieldWriter of the specified type * @param type diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java new file mode 100644 index 0000000000000..6dfaf216ad042 --- /dev/null +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -0,0 +1,129 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import org.apache.arrow.flatbuf.Field; +import org.apache.arrow.flatbuf.Type; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; + +import java.util.Objects; + +<@pp.dropOutputFile /> +<@pp.changeOutputFile name="/org/apache/arrow/vector/types/pojo/ArrowType.java" /> + + +<#include "/@includes/license.ftl" /> +package org.apache.arrow.vector.types.pojo; + +import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.flatbuf.Type; + +import java.util.Objects; + +public abstract class ArrowType { + + public abstract byte getTypeType(); + public abstract int getType(FlatBufferBuilder builder); + + + <#list arrowTypes.types as type> + <#assign name = type.name> + <#assign fields = type.fields> + public static class ${name} extends ArrowType { + public static final byte TYPE_TYPE = Type.${name}; + <#if type.fields?size == 0> + public static final ${name} INSTANCE = new ${name}(); + + + <#list fields as field> + <#assign fieldName = field.name> + <#assign fieldType = field.type> + ${fieldType} ${fieldName}; + + + <#if type.fields?size != 0> + public ${type.name}(<#list type.fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + <#list type.fields as field> + this.${field.name} = ${field.name}; + + } + + + @Override + public byte getTypeType() { + return TYPE_TYPE; + } + + @Override + public int getType(FlatBufferBuilder builder) { + org.apache.arrow.flatbuf.${type.name}.start${type.name}(builder); + <#list type.fields as field> + org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, <#if field.type == "String">builder.createString(${field.name})<#else>${field.name}); + + return org.apache.arrow.flatbuf.${type.name}.end${type.name}(builder); + } + + <#list fields as field> + public ${field.type} get${field.name?cap_first}() { + return ${field.name}; + } + + + @Override + public int hashCode() { + return Objects.hash(<#list type.fields as field>${field.name}<#if field_has_next>, ); + } + + @Override + public boolean equals(Object obj) { + if (!(obj instanceof ${type.name})) { + return false; + } + <#if type.fields?size == 0> + return true; + <#else> + ${type.name} that = (${type.name}) obj; + return + <#list type.fields as field>Objects.equals(this.${field.name}, that.${field.name}) <#if field_has_next>&&<#else>; + + + } + } + + + public static org.apache.arrow.vector.types.pojo.ArrowType getTypeForField(org.apache.arrow.flatbuf.Field field) { + switch(field.typeType()) { + <#list arrowTypes.types as type> + <#assign name = type.name> + <#assign nameLower = type.name?lower_case> + <#assign fields = type.fields> + case Type.${type.name}: + org.apache.arrow.flatbuf.${type.name} ${nameLower}Type = (org.apache.arrow.flatbuf.${type.name}) field.type(new org.apache.arrow.flatbuf.${type.name}()); + return new ${type.name}(<#list type.fields as field>${nameLower}Type.${field.name}()<#if field_has_next>, ); + + default: + throw new UnsupportedOperationException("Unsupported type: " + field.typeType()); + } + } + + public static Int getInt(org.apache.arrow.flatbuf.Field field) { + org.apache.arrow.flatbuf.Int intType = (org.apache.arrow.flatbuf.Int) field.type(new org.apache.arrow.flatbuf.Int()); + return new Int(intType.bitWidth(), intType.isSigned()); + } +} + + diff --git a/java/vector/src/main/codegen/templates/BaseReader.java b/java/vector/src/main/codegen/templates/BaseReader.java index 8f12b1da80424..72fea58d0bc9e 100644 --- a/java/vector/src/main/codegen/templates/BaseReader.java +++ b/java/vector/src/main/codegen/templates/BaseReader.java @@ -30,8 +30,8 @@ @SuppressWarnings("unused") public interface BaseReader extends Positionable{ - MajorType getType(); - MaterializedField getField(); + Field getField(); + MinorType getMinorType(); void reset(); void read(UnionHolder holder); void read(int index, UnionHolder holder); @@ -60,7 +60,6 @@ public interface RepeatedListReader extends ListReader{ public interface ScalarReader extends <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> ${name}Reader, - <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> Repeated${name}Reader, BaseReader {} interface ComplexReader{ diff --git a/java/vector/src/main/codegen/templates/BaseWriter.java b/java/vector/src/main/codegen/templates/BaseWriter.java index 299b2389bb35c..08bd39eae2358 100644 --- a/java/vector/src/main/codegen/templates/BaseWriter.java +++ b/java/vector/src/main/codegen/templates/BaseWriter.java @@ -31,12 +31,11 @@ */ @SuppressWarnings("unused") public interface BaseWriter extends AutoCloseable, Positionable { - FieldWriter getParent(); int getValueCapacity(); public interface MapWriter extends BaseWriter { - MaterializedField getField(); + Field getField(); /** * Whether this writer is a map writer and is empty (has no children). diff --git a/java/vector/src/main/codegen/templates/BasicTypeHelper.java b/java/vector/src/main/codegen/templates/BasicTypeHelper.java deleted file mode 100644 index 0bae715e35283..0000000000000 --- a/java/vector/src/main/codegen/templates/BasicTypeHelper.java +++ /dev/null @@ -1,539 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -<@pp.dropOutputFile /> -<@pp.changeOutputFile name="/org/apache/arrow/vector/util/BasicTypeHelper.java" /> - -<#include "/@includes/license.ftl" /> - -package org.apache.arrow.vector.util; - -<#include "/@includes/vv_imports.ftl" /> -import org.apache.arrow.vector.complex.UnionVector; -import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.util.CallBack; - -public class BasicTypeHelper { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BasicTypeHelper.class); - - private static final int WIDTH_ESTIMATE = 50; - - // Default length when casting to varchar : 65536 = 2^16 - // This only defines an absolute maximum for values, setting - // a high value like this will not inflate the size for small values - public static final int VARCHAR_DEFAULT_CAST_LEN = 65536; - - protected static String buildErrorMessage(final String operation, final MinorType type, final DataMode mode) { - return String.format("Unable to %s for minor type [%s] and mode [%s]", operation, type, mode); - } - - protected static String buildErrorMessage(final String operation, final MajorType type) { - return buildErrorMessage(operation, type.getMinorType(), type.getMode()); - } - - public static int getSize(MajorType major) { - switch (major.getMinorType()) { -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - return ${type.width}<#if minor.class?substring(0, 3) == "Var" || - minor.class?substring(0, 3) == "PRO" || - minor.class?substring(0, 3) == "MSG"> + WIDTH_ESTIMATE; - - -// case FIXEDCHAR: return major.getWidth(); -// case FIXED16CHAR: return major.getWidth(); -// case FIXEDBINARY: return major.getWidth(); - } - throw new UnsupportedOperationException(buildErrorMessage("get size", major)); - } - - public static ValueVector getNewVector(String name, BufferAllocator allocator, MajorType type, CallBack callback){ - MaterializedField field = MaterializedField.create(name, type); - return getNewVector(field, allocator, callback); - } - - - public static Class getValueVectorClass(MinorType type, DataMode mode){ - switch (type) { - case UNION: - return UnionVector.class; - case MAP: - switch (mode) { - case OPTIONAL: - case REQUIRED: - return MapVector.class; - case REPEATED: - return RepeatedMapVector.class; - } - - case LIST: - switch (mode) { - case REPEATED: - return RepeatedListVector.class; - case REQUIRED: - case OPTIONAL: - return ListVector.class; - } - -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - switch (mode) { - case REQUIRED: - return ${minor.class}Vector.class; - case OPTIONAL: - return Nullable${minor.class}Vector.class; - case REPEATED: - return Repeated${minor.class}Vector.class; - } - - - case GENERIC_OBJECT : - return ObjectVector.class ; - default: - break; - } - throw new UnsupportedOperationException(buildErrorMessage("get value vector class", type, mode)); - } - public static Class getReaderClassName( MinorType type, DataMode mode, boolean isSingularRepeated){ - switch (type) { - case MAP: - switch (mode) { - case REQUIRED: - if (!isSingularRepeated) - return SingleMapReaderImpl.class; - else - return SingleLikeRepeatedMapReaderImpl.class; - case REPEATED: - return RepeatedMapReaderImpl.class; - } - case LIST: - switch (mode) { - case REQUIRED: - return SingleListReaderImpl.class; - case REPEATED: - return RepeatedListReaderImpl.class; - } - -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - switch (mode) { - case REQUIRED: - return ${minor.class}ReaderImpl.class; - case OPTIONAL: - return Nullable${minor.class}ReaderImpl.class; - case REPEATED: - return Repeated${minor.class}ReaderImpl.class; - } - - - default: - break; - } - throw new UnsupportedOperationException(buildErrorMessage("get reader class name", type, mode)); - } - - public static Class getWriterInterface( MinorType type, DataMode mode){ - switch (type) { - case UNION: return UnionWriter.class; - case MAP: return MapWriter.class; - case LIST: return ListWriter.class; -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: return ${minor.class}Writer.class; - - - default: - break; - } - throw new UnsupportedOperationException(buildErrorMessage("get writer interface", type, mode)); - } - - public static Class getWriterImpl( MinorType type, DataMode mode){ - switch (type) { - case UNION: - return UnionWriter.class; - case MAP: - switch (mode) { - case REQUIRED: - case OPTIONAL: - return SingleMapWriter.class; - case REPEATED: - return RepeatedMapWriter.class; - } - case LIST: - switch (mode) { - case REQUIRED: - case OPTIONAL: - return UnionListWriter.class; - case REPEATED: - return RepeatedListWriter.class; - } - -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - switch (mode) { - case REQUIRED: - return ${minor.class}WriterImpl.class; - case OPTIONAL: - return Nullable${minor.class}WriterImpl.class; - case REPEATED: - return Repeated${minor.class}WriterImpl.class; - } - - - default: - break; - } - throw new UnsupportedOperationException(buildErrorMessage("get writer implementation", type, mode)); - } - - public static Class getHolderReaderImpl( MinorType type, DataMode mode){ - switch (type) { -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - switch (mode) { - case REQUIRED: - return ${minor.class}HolderReaderImpl.class; - case OPTIONAL: - return Nullable${minor.class}HolderReaderImpl.class; - case REPEATED: - return Repeated${minor.class}HolderReaderImpl.class; - } - - - default: - break; - } - throw new UnsupportedOperationException(buildErrorMessage("get holder reader implementation", type, mode)); - } - - public static ValueVector getNewVector(MaterializedField field, BufferAllocator allocator){ - return getNewVector(field, allocator, null); - } - public static ValueVector getNewVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ - field = field.clone(); - MajorType type = field.getType(); - - switch (type.getMinorType()) { - - case UNION: - return new UnionVector(field, allocator, callBack); - - case MAP: - switch (type.getMode()) { - case REQUIRED: - case OPTIONAL: - return new MapVector(field, allocator, callBack); - case REPEATED: - return new RepeatedMapVector(field, allocator, callBack); - } - case LIST: - switch (type.getMode()) { - case REPEATED: - return new RepeatedListVector(field, allocator, callBack); - case OPTIONAL: - case REQUIRED: - return new ListVector(field, allocator, callBack); - } -<#list vv. types as type> - <#list type.minor as minor> - case ${minor.class?upper_case}: - switch (type.getMode()) { - case REQUIRED: - return new ${minor.class}Vector(field, allocator); - case OPTIONAL: - return new Nullable${minor.class}Vector(field, allocator); - case REPEATED: - return new Repeated${minor.class}Vector(field, allocator); - } - - - case GENERIC_OBJECT: - return new ObjectVector(field, allocator) ; - default: - break; - } - // All ValueVector types have been handled. - throw new UnsupportedOperationException(buildErrorMessage("get new vector", type)); - } - - public static ValueHolder getValue(ValueVector vector, int index) { - MajorType type = vector.getField().getType(); - ValueHolder holder; - switch(type.getMinorType()) { -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - <#if minor.class?starts_with("Var") || minor.class == "IntervalDay" || minor.class == "Interval" || - minor.class?starts_with("Decimal28") || minor.class?starts_with("Decimal38")> - switch (type.getMode()) { - case REQUIRED: - holder = new ${minor.class}Holder(); - ((${minor.class}Vector) vector).getAccessor().get(index, (${minor.class}Holder)holder); - return holder; - case OPTIONAL: - holder = new Nullable${minor.class}Holder(); - ((Nullable${minor.class}Holder)holder).isSet = ((Nullable${minor.class}Vector) vector).getAccessor().isSet(index); - if (((Nullable${minor.class}Holder)holder).isSet == 1) { - ((Nullable${minor.class}Vector) vector).getAccessor().get(index, (Nullable${minor.class}Holder)holder); - } - return holder; - } - <#else> - switch (type.getMode()) { - case REQUIRED: - holder = new ${minor.class}Holder(); - ((${minor.class}Holder)holder).value = ((${minor.class}Vector) vector).getAccessor().get(index); - return holder; - case OPTIONAL: - holder = new Nullable${minor.class}Holder(); - ((Nullable${minor.class}Holder)holder).isSet = ((Nullable${minor.class}Vector) vector).getAccessor().isSet(index); - if (((Nullable${minor.class}Holder)holder).isSet == 1) { - ((Nullable${minor.class}Holder)holder).value = ((Nullable${minor.class}Vector) vector).getAccessor().get(index); - } - return holder; - } - - - - case GENERIC_OBJECT: - holder = new ObjectHolder(); - ((ObjectHolder)holder).obj = ((ObjectVector) vector).getAccessor().getObject(index) ; - break; - } - - throw new UnsupportedOperationException(buildErrorMessage("get value", type)); - } - - public static void setValue(ValueVector vector, int index, ValueHolder holder) { - MajorType type = vector.getField().getType(); - - switch(type.getMinorType()) { -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - switch (type.getMode()) { - case REQUIRED: - ((${minor.class}Vector) vector).getMutator().setSafe(index, (${minor.class}Holder) holder); - return; - case OPTIONAL: - if (((Nullable${minor.class}Holder) holder).isSet == 1) { - ((Nullable${minor.class}Vector) vector).getMutator().setSafe(index, (Nullable${minor.class}Holder) holder); - } - return; - } - - - case GENERIC_OBJECT: - ((ObjectVector) vector).getMutator().setSafe(index, (ObjectHolder) holder); - return; - default: - throw new UnsupportedOperationException(buildErrorMessage("set value", type)); - } - } - - public static void setValueSafe(ValueVector vector, int index, ValueHolder holder) { - MajorType type = vector.getField().getType(); - - switch(type.getMinorType()) { - <#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - switch (type.getMode()) { - case REQUIRED: - ((${minor.class}Vector) vector).getMutator().setSafe(index, (${minor.class}Holder) holder); - return; - case OPTIONAL: - if (((Nullable${minor.class}Holder) holder).isSet == 1) { - ((Nullable${minor.class}Vector) vector).getMutator().setSafe(index, (Nullable${minor.class}Holder) holder); - } else { - ((Nullable${minor.class}Vector) vector).getMutator().isSafe(index); - } - return; - } - - - case GENERIC_OBJECT: - ((ObjectVector) vector).getMutator().setSafe(index, (ObjectHolder) holder); - default: - throw new UnsupportedOperationException(buildErrorMessage("set value safe", type)); - } - } - - public static boolean compareValues(ValueVector v1, int v1index, ValueVector v2, int v2index) { - MajorType type1 = v1.getField().getType(); - MajorType type2 = v2.getField().getType(); - - if (type1.getMinorType() != type2.getMinorType()) { - return false; - } - - switch(type1.getMinorType()) { -<#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - if ( ((${minor.class}Vector) v1).getAccessor().get(v1index) == - ((${minor.class}Vector) v2).getAccessor().get(v2index) ) - return true; - break; - - - default: - break; - } - return false; - } - - /** - * Create a ValueHolder of MajorType. - * @param type - * @return - */ - public static ValueHolder createValueHolder(MajorType type) { - switch(type.getMinorType()) { - <#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - - switch (type.getMode()) { - case REQUIRED: - return new ${minor.class}Holder(); - case OPTIONAL: - return new Nullable${minor.class}Holder(); - case REPEATED: - return new Repeated${minor.class}Holder(); - } - - - case GENERIC_OBJECT: - return new ObjectHolder(); - default: - throw new UnsupportedOperationException(buildErrorMessage("create value holder", type)); - } - } - - public static boolean isNull(ValueHolder holder) { - MajorType type = getValueHolderType(holder); - - switch(type.getMinorType()) { - <#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - - switch (type.getMode()) { - case REQUIRED: - return true; - case OPTIONAL: - return ((Nullable${minor.class}Holder) holder).isSet == 0; - case REPEATED: - return true; - } - - - default: - throw new UnsupportedOperationException(buildErrorMessage("check is null", type)); - } - } - - public static ValueHolder deNullify(ValueHolder holder) { - MajorType type = getValueHolderType(holder); - - switch(type.getMinorType()) { - <#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - - switch (type.getMode()) { - case REQUIRED: - return holder; - case OPTIONAL: - if( ((Nullable${minor.class}Holder) holder).isSet == 1) { - ${minor.class}Holder newHolder = new ${minor.class}Holder(); - - <#assign fields = minor.fields!type.fields /> - <#list fields as field> - newHolder.${field.name} = ((Nullable${minor.class}Holder) holder).${field.name}; - - - return newHolder; - } else { - throw new UnsupportedOperationException("You can not convert a null value into a non-null value!"); - } - case REPEATED: - return holder; - } - - - default: - throw new UnsupportedOperationException(buildErrorMessage("deNullify", type)); - } - } - - public static ValueHolder nullify(ValueHolder holder) { - MajorType type = getValueHolderType(holder); - - switch(type.getMinorType()) { - <#list vv.types as type> - <#list type.minor as minor> - case ${minor.class?upper_case} : - switch (type.getMode()) { - case REQUIRED: - Nullable${minor.class}Holder newHolder = new Nullable${minor.class}Holder(); - newHolder.isSet = 1; - <#assign fields = minor.fields!type.fields /> - <#list fields as field> - newHolder.${field.name} = ((${minor.class}Holder) holder).${field.name}; - - return newHolder; - case OPTIONAL: - return holder; - case REPEATED: - throw new UnsupportedOperationException("You can not convert repeated type " + type + " to nullable type!"); - } - - - default: - throw new UnsupportedOperationException(buildErrorMessage("nullify", type)); - } - } - - public static MajorType getValueHolderType(ValueHolder holder) { - - if (0 == 1) { - return null; - } - <#list vv.types as type> - <#list type.minor as minor> - else if (holder instanceof ${minor.class}Holder) { - return ((${minor.class}Holder) holder).TYPE; - } else if (holder instanceof Nullable${minor.class}Holder) { - return ((Nullable${minor.class}Holder) holder).TYPE; - } - - - - throw new UnsupportedOperationException("ValueHolder is not supported for 'getValueHolderType' method."); - - } - -} diff --git a/java/vector/src/main/codegen/templates/ComplexCopier.java b/java/vector/src/main/codegen/templates/ComplexCopier.java index 3614231c8342e..a5756a47ad785 100644 --- a/java/vector/src/main/codegen/templates/ComplexCopier.java +++ b/java/vector/src/main/codegen/templates/ComplexCopier.java @@ -42,13 +42,7 @@ public static void copy(FieldReader input, FieldWriter output) { } private static void writeValue(FieldReader reader, FieldWriter writer) { - final DataMode m = reader.getType().getMode(); - final MinorType mt = reader.getType().getMinorType(); - - switch(m){ - case OPTIONAL: - case REQUIRED: - + final MinorType mt = reader.getMinorType(); switch (mt) { @@ -89,12 +83,10 @@ private static void writeValue(FieldReader reader, FieldWriter writer) { } - break; - } } private static FieldWriter getMapWriterForReader(FieldReader reader, MapWriter writer, String name) { - switch (reader.getType().getMinorType()) { + switch (reader.getMinorType()) { <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> <#assign uncappedName = name?uncap_first/> @@ -108,12 +100,12 @@ private static FieldWriter getMapWriterForReader(FieldReader reader, MapWriter w case LIST: return (FieldWriter) writer.list(name); default: - throw new UnsupportedOperationException(reader.getType().toString()); + throw new UnsupportedOperationException(reader.getMinorType().toString()); } } private static FieldWriter getListWriterForReader(FieldReader reader, ListWriter writer) { - switch (reader.getType().getMinorType()) { + switch (reader.getMinorType()) { <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> <#assign uncappedName = name?uncap_first/> @@ -127,7 +119,7 @@ private static FieldWriter getListWriterForReader(FieldReader reader, ListWriter case LIST: return (FieldWriter) writer.list(); default: - throw new UnsupportedOperationException(reader.getType().toString()); + throw new UnsupportedOperationException(reader.getMinorType().toString()); } } } diff --git a/java/vector/src/main/codegen/templates/ComplexReaders.java b/java/vector/src/main/codegen/templates/ComplexReaders.java index 34c657126015e..74a19a605e21e 100644 --- a/java/vector/src/main/codegen/templates/ComplexReaders.java +++ b/java/vector/src/main/codegen/templates/ComplexReaders.java @@ -27,10 +27,10 @@ <@pp.dropOutputFile /> <#list vv.types as type> <#list type.minor as minor> -<#list ["", "Repeated"] as mode> +<#list [""] as mode> <#assign lowerName = minor.class?uncap_first /> <#if lowerName == "int" ><#assign lowerName = "integer" /> -<#assign name = mode + minor.class?cap_first /> +<#assign name = minor.class?cap_first /> <#assign javaType = (minor.javaType!type.javaType) /> <#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> <#assign safeType=friendlyType /> @@ -38,9 +38,9 @@ <#assign hasFriendly = minor.friendlyType!"no" == "no" /> -<#list ["", "Nullable"] as nullMode> -<#if (mode == "Repeated" && nullMode == "") || mode == "" > -<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${nullMode}${name}ReaderImpl.java" /> +<#list ["Nullable"] as nullMode> +<#if mode == "" > +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${name}ReaderImpl.java" /> <#include "/@includes/license.ftl" /> package org.apache.arrow.vector.complex.impl; @@ -48,20 +48,20 @@ <#include "/@includes/vv_imports.ftl" /> @SuppressWarnings("unused") -public class ${nullMode}${name}ReaderImpl extends AbstractFieldReader { +public class ${name}ReaderImpl extends AbstractFieldReader { private final ${nullMode}${name}Vector vector; - public ${nullMode}${name}ReaderImpl(${nullMode}${name}Vector vector){ + public ${name}ReaderImpl(${nullMode}${name}Vector vector){ super(); this.vector = vector; } - public MajorType getType(){ - return vector.getField().getType(); + public MinorType getMinorType(){ + return vector.getMinorType(); } - public MaterializedField getField(){ + public Field getField(){ return vector.getField(); } @@ -73,50 +73,13 @@ public boolean isSet(){ } - - - - <#if mode == "Repeated"> - - public void copyAsValue(${minor.class?cap_first}Writer writer){ - Repeated${minor.class?cap_first}WriterImpl impl = (Repeated${minor.class?cap_first}WriterImpl) writer; - impl.vector.copyFromSafe(idx(), impl.idx(), vector); - } - - public void copyAsField(String name, MapWriter writer){ - Repeated${minor.class?cap_first}WriterImpl impl = (Repeated${minor.class?cap_first}WriterImpl) writer.list(name).${lowerName}(); - impl.vector.copyFromSafe(idx(), impl.idx(), vector); - } - - public int size(){ - return vector.getAccessor().getInnerValueCountAt(idx()); - } - - public void read(int arrayIndex, ${minor.class?cap_first}Holder h){ - vector.getAccessor().get(idx(), arrayIndex, h); - } - public void read(int arrayIndex, Nullable${minor.class?cap_first}Holder h){ - vector.getAccessor().get(idx(), arrayIndex, h); - } - - public ${friendlyType} read${safeType}(int arrayIndex){ - return vector.getAccessor().getSingleObject(idx(), arrayIndex); - } - - - public List readObject(){ - return (List) (Object) vector.getAccessor().getObject(idx()); - } - - <#else> - public void copyAsValue(${minor.class?cap_first}Writer writer){ - ${nullMode}${minor.class?cap_first}WriterImpl impl = (${nullMode}${minor.class?cap_first}WriterImpl) writer; + ${minor.class?cap_first}WriterImpl impl = (${minor.class?cap_first}WriterImpl) writer; impl.vector.copyFromSafe(idx(), impl.idx(), vector); } public void copyAsField(String name, MapWriter writer){ - ${nullMode}${minor.class?cap_first}WriterImpl impl = (${nullMode}${minor.class?cap_first}WriterImpl) writer.${lowerName}(name); + ${minor.class?cap_first}WriterImpl impl = (${minor.class?cap_first}WriterImpl) writer.${lowerName}(name); impl.vector.copyFromSafe(idx(), impl.idx(), vector); } @@ -141,9 +104,6 @@ public void copyValue(FieldWriter w){ public Object readObject(){ return vector.getAccessor().getObject(idx()); } - - - } @@ -156,18 +116,10 @@ public Object readObject(){ @SuppressWarnings("unused") public interface ${name}Reader extends BaseReader{ - <#if mode == "Repeated"> - public int size(); - public void read(int arrayIndex, ${minor.class?cap_first}Holder h); - public void read(int arrayIndex, Nullable${minor.class?cap_first}Holder h); - public Object readObject(int arrayIndex); - public ${friendlyType} read${safeType}(int arrayIndex); - <#else> public void read(${minor.class?cap_first}Holder h); public void read(Nullable${minor.class?cap_first}Holder h); public Object readObject(); public ${friendlyType} read${safeType}(); - public boolean isSet(); public void copyAsValue(${minor.class}Writer writer); public void copyAsField(String name, ${minor.class}Writer writer); diff --git a/java/vector/src/main/codegen/templates/ComplexWriters.java b/java/vector/src/main/codegen/templates/ComplexWriters.java index 8f9a6e7b97117..3457545cea5d7 100644 --- a/java/vector/src/main/codegen/templates/ComplexWriters.java +++ b/java/vector/src/main/codegen/templates/ComplexWriters.java @@ -19,8 +19,8 @@ <@pp.dropOutputFile /> <#list vv.types as type> <#list type.minor as minor> -<#list ["", "Nullable", "Repeated"] as mode> -<#assign name = mode + minor.class?cap_first /> +<#list ["Nullable"] as mode> +<#assign name = minor.class?cap_first /> <#assign eName = name /> <#assign javaType = (minor.javaType!type.javaType) /> <#assign fields = minor.fields!type.fields /> @@ -38,17 +38,16 @@ @SuppressWarnings("unused") public class ${eName}WriterImpl extends AbstractFieldWriter { - private final ${name}Vector.Mutator mutator; - final ${name}Vector vector; + private final Nullable${name}Vector.Mutator mutator; + final Nullable${name}Vector vector; - public ${eName}WriterImpl(${name}Vector vector, AbstractFieldWriter parent) { - super(parent); + public ${eName}WriterImpl(Nullable${name}Vector vector) { this.mutator = vector.getMutator(); this.vector = vector; } @Override - public MaterializedField getField() { + public Field getField() { return vector.getField(); } @@ -89,12 +88,10 @@ public void write(Nullable${minor.class?cap_first}Holder h) { vector.getMutator().setValueCount(idx()+1); } - <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { mutator.addSafe(idx(), <#list fields as field>${field.name}<#if field_has_next>, ); vector.getMutator().setValueCount(idx()+1); } - public void setPosition(int idx) { super.setPosition(idx); @@ -114,11 +111,17 @@ public void write(Nullable${minor.class}Holder h) { vector.getMutator().setValueCount(idx()+1); } - <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + <#if minor.class == "Decimal"> + public void writeDecimal(int start, ArrowBuf buffer) { + mutator.setSafe(idx(), 1, start, buffer); + vector.getMutator().setValueCount(idx()+1); + } + <#else> public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { - mutator.setSafe(idx(), <#if mode == "Nullable">1, <#list fields as field>${field.name}<#if field_has_next>, ); + mutator.setSafe(idx()<#if mode == "Nullable">, 1<#list fields as field><#if field.include!true >, ${field.name}); vector.getMutator().setValueCount(idx()+1); } + <#if mode == "Nullable"> @@ -128,7 +131,6 @@ public void writeNull() { } - } <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/writer/${eName}Writer.java" /> @@ -141,7 +143,9 @@ public void writeNull() { public interface ${eName}Writer extends BaseWriter { public void write(${minor.class}Holder h); - <#if !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> + <#if minor.class == "Decimal"> + public void writeDecimal(int start, ArrowBuf buffer); + <#else> public void write${minor.class}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ); } diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 18fcac93bb6f0..fe2b5c5b5bc92 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -43,20 +43,42 @@ public final class ${minor.class}Vector extends BaseDataValueVector implements FixedWidthVector{ private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${minor.class}Vector.class); - private final FieldReader reader = new ${minor.class}ReaderImpl(${minor.class}Vector.this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); private int allocationSizeInBytes = INITIAL_VALUE_ALLOCATION * ${type.width}; private int allocationMonitor = 0; - public ${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); + <#if minor.class == "Decimal"> + + private int precision; + private int scale; + + public ${minor.class}Vector(String name, BufferAllocator allocator, int precision, int scale) { + super(name, allocator); + this.precision = precision; + this.scale = scale; + } + <#else> + public ${minor.class}Vector(String name, BufferAllocator allocator) { + super(name, allocator); + } + + + + @Override + public MinorType getMinorType() { + return MinorType.${minor.class?upper_case}; + } + + @Override + public Field getField() { + throw new UnsupportedOperationException("internal vector"); } @Override public FieldReader getReader(){ - return reader; + throw new UnsupportedOperationException("non-nullable vectors cannot be used in readers"); } @Override @@ -162,7 +184,7 @@ public void reAlloc() { throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); } - logger.debug("Reallocating vector [{}]. # of bytes: [{}] -> [{}]", field, allocationSizeInBytes, newAllocationSize); + logger.debug("Reallocating vector [{}]. # of bytes: [{}] -> [{}]", name, allocationSizeInBytes, newAllocationSize); final ArrowBuf newBuf = allocator.buffer((int)newAllocationSize); newBuf.setBytes(0, data, 0, data.capacity()); final int halfNewCapacity = newBuf.capacity() / 2; @@ -181,30 +203,13 @@ public void zeroVector() { data.setZero(0, data.capacity()); } -// @Override -// public void load(SerializedField metadata, ArrowBuf buffer) { -// Preconditions.checkArgument(this.field.getPath().equals(metadata.getNamePart().getName()), "The field %s doesn't match the provided metadata %s.", this.field, metadata); -// final int actualLength = metadata.getBufferLength(); -// final int valueCount = metadata.getValueCount(); -// final int expectedLength = valueCount * ${type.width}; -// assert actualLength == expectedLength : String.format("Expected to load %d bytes but actually loaded %d bytes", expectedLength, actualLength); -// -// clear(); -// if (data != null) { -// data.release(1); -// } -// data = buffer.slice(0, actualLength); -// data.retain(1); -// data.writerIndex(actualLength); -// } - public TransferPair getTransferPair(BufferAllocator allocator){ - return new TransferImpl(getField(), allocator); + return new TransferImpl(name, allocator); } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator){ - return new TransferImpl(getField().withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -230,8 +235,12 @@ public void splitAndTransferTo(int startIndex, int length, ${minor.class}Vector private class TransferImpl implements TransferPair{ private ${minor.class}Vector to; - public TransferImpl(MaterializedField field, BufferAllocator allocator){ - to = new ${minor.class}Vector(field, allocator); + public TransferImpl(String name, BufferAllocator allocator){ + <#if minor.class == "Decimal"> + to = new ${minor.class}Vector(name, allocator, precision, scale); + <#else> + to = new ${minor.class}Vector(name, allocator); + } public TransferImpl(${minor.class}Vector to) { @@ -260,7 +269,7 @@ public void copyValueSafe(int fromIndex, int toIndex) { } public void copyFrom(int fromIndex, int thisIndex, ${minor.class}Vector from){ - <#if (type.width > 8)> + <#if (type.width > 8 || minor.class == "IntervalDay")> from.data.getBytes(fromIndex * ${type.width}, data, thisIndex * ${type.width}, ${type.width}); <#else> <#-- type.width <= 8 --> data.set${(minor.javaType!type.javaType)?cap_first}(thisIndex * ${type.width}, @@ -298,7 +307,7 @@ public boolean isNull(int index){ return false; } - <#if (type.width > 8)> + <#if (type.width > 8 || minor.class == "IntervalDay")> public ${minor.javaType!type.javaType} get(int index) { return data.slice(index * ${type.width}, ${type.width}); @@ -416,31 +425,30 @@ public StringBuilder getAsStringBuilder(int index) { append(millis)); } - <#elseif (minor.class == "Decimal28Sparse") || (minor.class == "Decimal38Sparse") || (minor.class == "Decimal28Dense") || (minor.class == "Decimal38Dense")> + <#elseif minor.class == "Decimal"> public void get(int index, ${minor.class}Holder holder) { holder.start = index * ${type.width}; holder.buffer = data; - holder.scale = getField().getScale(); - holder.precision = getField().getPrecision(); + holder.scale = scale; + holder.precision = precision; } public void get(int index, Nullable${minor.class}Holder holder) { holder.isSet = 1; holder.start = index * ${type.width}; holder.buffer = data; - holder.scale = getField().getScale(); - holder.precision = getField().getPrecision(); + holder.scale = scale; + holder.precision = precision; } - @Override - public ${friendlyType} getObject(int index) { - <#if (minor.class == "Decimal28Sparse") || (minor.class == "Decimal38Sparse")> - // Get the BigDecimal object - return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromSparse(data, index * ${type.width}, ${minor.nDecimalDigits}, getField().getScale()); - <#else> - return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(data, index * ${type.width}, ${minor.nDecimalDigits}, getField().getScale(), ${minor.maxPrecisionDigits}, ${type.width}); - + @Override + public ${friendlyType} getObject(int index) { + byte[] bytes = new byte[${type.width}]; + int start = ${type.width} * index; + data.getBytes(start, bytes, 0, ${type.width}); + ${friendlyType} value = new BigDecimal(new BigInteger(bytes), scale); + return value; } <#else> @@ -581,7 +589,7 @@ public final class Mutator extends BaseDataValueVector.BaseMutator { * @param index position of the bit to set * @param value value to set */ - <#if (type.width > 8)> + <#if (type.width > 8) || minor.class == "IntervalDay"> public void set(int index, <#if (type.width > 4)>${minor.javaType!type.javaType}<#else>int value) { data.setBytes(index * ${type.width}, value, 0, ${type.width}); } @@ -653,7 +661,7 @@ public void setSafe(int index, Nullable${minor.class}Holder holder){ setSafe(index, holder.days, holder.milliseconds); } - <#elseif (minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse") || (minor.class == "Decimal28Dense") || (minor.class == "Decimal38Dense")> + <#elseif minor.class == "Decimal"> public void set(int index, ${minor.class}Holder holder){ set(index, holder.start, holder.buffer); diff --git a/java/vector/src/main/codegen/templates/HolderReaderImpl.java b/java/vector/src/main/codegen/templates/HolderReaderImpl.java index 3005fca0385aa..1ed9287b00eec 100644 --- a/java/vector/src/main/codegen/templates/HolderReaderImpl.java +++ b/java/vector/src/main/codegen/templates/HolderReaderImpl.java @@ -19,9 +19,8 @@ <@pp.dropOutputFile /> <#list vv.types as type> <#list type.minor as minor> -<#list ["", "Nullable", "Repeated"] as holderMode> +<#list ["", "Nullable"] as holderMode> <#assign nullMode = holderMode /> -<#if holderMode == "Repeated"><#assign nullMode = "Nullable" /> <#assign lowerName = minor.class?uncap_first /> <#if lowerName == "int" ><#assign lowerName = "integer" /> @@ -50,42 +49,18 @@ public class ${holderMode}${name}HolderReaderImpl extends AbstractFieldReader { private ${nullMode}${name}Holder holder; -<#if holderMode == "Repeated" > - private int index = -1; - private ${holderMode}${name}Holder repeatedHolder; - - public ${holderMode}${name}HolderReaderImpl(${holderMode}${name}Holder holder) { -<#if holderMode == "Repeated" > - this.holder = new ${nullMode}${name}Holder(); - this.repeatedHolder = holder; -<#else> this.holder = holder; - } @Override public int size() { -<#if holderMode == "Repeated"> - return repeatedHolder.end - repeatedHolder.start; -<#else> throw new UnsupportedOperationException("You can't call size on a Holder value reader."); - } @Override public boolean next() { -<#if holderMode == "Repeated"> - if(index + 1 < repeatedHolder.end) { - index++; - repeatedHolder.vector.getAccessor().get(repeatedHolder.start + index, holder); - return true; - } else { - return false; - } -<#else> throw new UnsupportedOperationException("You can't call next on a single value reader."); - } @@ -95,19 +70,13 @@ public void setPosition(int index) { } @Override - public MajorType getType() { -<#if holderMode == "Repeated"> - return this.repeatedHolder.TYPE; -<#else> - return this.holder.TYPE; - + public MinorType getMinorType() { + return MinorType.${name?upper_case}; } @Override public boolean isSet() { - <#if holderMode == "Repeated"> - return this.repeatedHolder.end!=this.repeatedHolder.start; - <#elseif nullMode == "Nullable"> + <#if holderMode == "Nullable"> return this.holder.isSet == 1; <#else> return true; @@ -115,7 +84,6 @@ public boolean isSet() { } -<#if holderMode != "Repeated"> @Override public void read(${name}Holder h) { <#list fields as field> @@ -130,19 +98,7 @@ public void read(Nullable${name}Holder h) { h.isSet = isSet() ? 1 : 0; } - -<#if holderMode == "Repeated"> - @Override - public ${friendlyType} read${safeType}(int index){ - repeatedHolder.vector.getAccessor().get(repeatedHolder.start + index, holder); - ${friendlyType} value = read${safeType}(); - if (this.index > -1) { - repeatedHolder.vector.getAccessor().get(repeatedHolder.start + this.index, holder); - } - return value; - } - @Override public ${friendlyType} read${safeType}(){ @@ -176,29 +132,10 @@ public void read(Nullable${name}Holder h) { Period p = new Period(); return p.plusDays(holder.days).plusMillis(holder.milliseconds); -<#elseif minor.class == "Decimal9" || - minor.class == "Decimal18" > - BigInteger value = BigInteger.valueOf(holder.value); - return new BigDecimal(value, holder.scale); - -<#elseif minor.class == "Decimal28Dense" || - minor.class == "Decimal38Dense"> - return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(holder.buffer, - holder.start, - holder.nDecimalDigits, - holder.scale, - holder.maxPrecision, - holder.WIDTH); - -<#elseif minor.class == "Decimal28Sparse" || - minor.class == "Decimal38Sparse"> - return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromSparse(holder.buffer, - holder.start, - holder.nDecimalDigits, - holder.scale); - <#elseif minor.class == "Bit" > return new Boolean(holder.value != 0); +<#elseif minor.class == "Decimal" > + return (BigDecimal) readSingleObject(); <#else> ${friendlyType} value = new ${friendlyType}(this.holder.value); return value; @@ -208,15 +145,7 @@ public void read(Nullable${name}Holder h) { @Override public Object readObject() { -<#if holderMode == "Repeated" > - List valList = Lists.newArrayList(); - for (int i = repeatedHolder.start; i < repeatedHolder.end; i++) { - valList.add(repeatedHolder.vector.getAccessor().getObject(i)); - } - return valList; -<#else> return readSingleObject(); - } private Object readSingleObject() { @@ -239,6 +168,9 @@ private Object readSingleObject() { Text text = new Text(); text.set(value); return text; +<#elseif minor.class == "Decimal" > + return new BigDecimal(new BigInteger(value), holder.scale); + <#elseif minor.class == "Interval"> @@ -249,11 +181,6 @@ private Object readSingleObject() { Period p = new Period(); return p.plusDays(holder.days).plusMillis(holder.milliseconds); -<#elseif minor.class == "Decimal9" || - minor.class == "Decimal18" > - BigInteger value = BigInteger.valueOf(holder.value); - return new BigDecimal(value, holder.scale); - <#elseif minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense"> return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromDense(holder.buffer, @@ -272,13 +199,18 @@ private Object readSingleObject() { <#elseif minor.class == "Bit" > return new Boolean(holder.value != 0); +<#elseif minor.class == "Decimal"> + byte[] bytes = new byte[${type.width}]; + holder.buffer.getBytes(holder.start, bytes, 0, ${type.width}); + ${friendlyType} value = new BigDecimal(new BigInteger(bytes), holder.scale); + return value; <#else> ${friendlyType} value = new ${friendlyType}(this.holder.value); return value; } -<#if holderMode != "Repeated" && nullMode != "Nullable"> +<#if nullMode != "Nullable"> public void copyAsValue(${minor.class?cap_first}Writer writer){ writer.write(holder); } diff --git a/java/vector/src/main/codegen/templates/ListWriters.java b/java/vector/src/main/codegen/templates/ListWriters.java deleted file mode 100644 index 94b812b83dc96..0000000000000 --- a/java/vector/src/main/codegen/templates/ListWriters.java +++ /dev/null @@ -1,234 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -<@pp.dropOutputFile /> - -<#list ["Single", "Repeated"] as mode> -<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}ListWriter.java" /> - - -<#include "/@includes/license.ftl" /> - -package org.apache.arrow.vector.complex.impl; -<#if mode == "Single"> - <#assign containerClass = "AbstractContainerVector" /> - <#assign index = "idx()"> -<#else> - <#assign containerClass = "RepeatedListVector" /> - <#assign index = "currentChildIndex"> - - - -<#include "/@includes/vv_imports.ftl" /> - -/* - * This class is generated using FreeMarker and the ${.template_name} template. - */ -@SuppressWarnings("unused") -public class ${mode}ListWriter extends AbstractFieldWriter { - private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${mode}ListWriter.class); - - static enum Mode { INIT, IN_MAP, IN_LIST <#list vv.types as type><#list type.minor as minor>, IN_${minor.class?upper_case} } - - private final String name; - protected final ${containerClass} container; - private Mode mode = Mode.INIT; - private FieldWriter writer; - protected RepeatedValueVector innerVector; - - <#if mode == "Repeated">private int currentChildIndex = 0; - public ${mode}ListWriter(String name, ${containerClass} container, FieldWriter parent){ - super(parent); - this.name = name; - this.container = container; - } - - public ${mode}ListWriter(${containerClass} container, FieldWriter parent){ - super(parent); - this.name = null; - this.container = container; - } - - @Override - public void allocate() { - if(writer != null) { - writer.allocate(); - } - - <#if mode == "Repeated"> - container.allocateNew(); - - } - - @Override - public void clear() { - if (writer != null) { - writer.clear(); - } - } - - @Override - public void close() { - clear(); - container.close(); - if (innerVector != null) { - innerVector.close(); - } - } - - @Override - public int getValueCapacity() { - return innerVector == null ? 0 : innerVector.getValueCapacity(); - } - - public void setValueCount(int count){ - if(innerVector != null) innerVector.getMutator().setValueCount(count); - } - - @Override - public MapWriter map() { - switch(mode) { - case INIT: - int vectorCount = container.size(); - final RepeatedMapVector vector = container.addOrGet(name, RepeatedMapVector.TYPE, RepeatedMapVector.class); - innerVector = vector; - writer = new RepeatedMapWriter(vector, this); - if(vectorCount != container.size()) { - writer.allocate(); - } - writer.setPosition(${index}); - mode = Mode.IN_MAP; - return writer; - case IN_MAP: - return writer; - } - - throw new RuntimeException(getUnsupportedErrorMsg("MAP", mode.name())); - - } - - @Override - public ListWriter list() { - switch(mode) { - case INIT: - final int vectorCount = container.size(); - final RepeatedListVector vector = container.addOrGet(name, RepeatedListVector.TYPE, RepeatedListVector.class); - innerVector = vector; - writer = new RepeatedListWriter(null, vector, this); - if(vectorCount != container.size()) { - writer.allocate(); - } - writer.setPosition(${index}); - mode = Mode.IN_LIST; - return writer; - case IN_LIST: - return writer; - } - - throw new RuntimeException(getUnsupportedErrorMsg("LIST", mode.name())); - - } - - <#list vv.types as type><#list type.minor as minor> - <#assign lowerName = minor.class?uncap_first /> - <#assign upperName = minor.class?upper_case /> - <#assign capName = minor.class?cap_first /> - <#if lowerName == "int" ><#assign lowerName = "integer" /> - - private static final MajorType ${upperName}_TYPE = Types.repeated(MinorType.${upperName}); - - @Override - public ${capName}Writer ${lowerName}() { - switch(mode) { - case INIT: - final int vectorCount = container.size(); - final Repeated${capName}Vector vector = container.addOrGet(name, ${upperName}_TYPE, Repeated${capName}Vector.class); - innerVector = vector; - writer = new Repeated${capName}WriterImpl(vector, this); - if(vectorCount != container.size()) { - writer.allocate(); - } - writer.setPosition(${index}); - mode = Mode.IN_${upperName}; - return writer; - case IN_${upperName}: - return writer; - } - - throw new RuntimeException(getUnsupportedErrorMsg("${upperName}", mode.name())); - - } - - - public MaterializedField getField() { - return container.getField(); - } - - <#if mode == "Repeated"> - - public void startList() { - final RepeatedListVector list = (RepeatedListVector) container; - final RepeatedListVector.RepeatedMutator mutator = list.getMutator(); - - // make sure that the current vector can support the end position of this list. - if(container.getValueCapacity() <= idx()) { - mutator.setValueCount(idx()+1); - } - - // update the repeated vector to state that there is current+1 objects. - final RepeatedListHolder h = new RepeatedListHolder(); - list.getAccessor().get(idx(), h); - if (h.start >= h.end) { - mutator.startNewValue(idx()); - } - currentChildIndex = container.getMutator().add(idx()); - if(writer != null) { - writer.setPosition(currentChildIndex); - } - } - - public void endList() { - // noop, we initialize state at start rather than end. - } - <#else> - - public void setPosition(int index) { - super.setPosition(index); - if(writer != null) { - writer.setPosition(index); - } - } - - public void startList() { - // noop - } - - public void endList() { - // noop - } - - - private String getUnsupportedErrorMsg(String expected, String found) { - final String f = found.substring(3); - return String.format("In a list of type %s, encountered a value of type %s. "+ - "Arrow does not support lists of different types.", - f, expected - ); - } -} - diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 42f39820393e5..af2922826ec4d 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -17,7 +17,7 @@ */ <@pp.dropOutputFile /> -<#list ["Single", "Repeated"] as mode> +<#list ["Single"] as mode> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}MapWriter.java" /> <#if mode == "Single"> <#assign containerClass = "MapVector" /> @@ -51,16 +51,8 @@ public class ${mode}MapWriter extends AbstractFieldWriter { private final Map fields = Maps.newHashMap(); <#if mode == "Repeated">private int currentChildIndex = 0; - private final boolean unionEnabled; - - public ${mode}MapWriter(${containerClass} container, FieldWriter parent, boolean unionEnabled) { - super(parent); + public ${mode}MapWriter(${containerClass} container) { this.container = container; - this.unionEnabled = unionEnabled; - } - - public ${mode}MapWriter(${containerClass} container, FieldWriter parent) { - this(container, parent, false); } @Override @@ -74,7 +66,7 @@ public boolean isEmptyMap() { } @Override - public MaterializedField getField() { + public Field getField() { return container.getField(); } @@ -83,12 +75,8 @@ public MapWriter map(String name) { FieldWriter writer = fields.get(name.toLowerCase()); if(writer == null){ int vectorCount=container.size(); - MapVector vector = container.addOrGet(name, MapVector.TYPE, MapVector.class); - if(!unionEnabled){ - writer = new SingleMapWriter(vector, this); - } else { - writer = new PromotableWriter(vector, container); - } + MapVector vector = container.addOrGet(name, MinorType.MAP, MapVector.class); + writer = new PromotableWriter(vector, container); if(vectorCount != container.size()) { writer.allocate(); } @@ -125,11 +113,7 @@ public ListWriter list(String name) { FieldWriter writer = fields.get(name.toLowerCase()); int vectorCount = container.size(); if(writer == null) { - if (!unionEnabled){ - writer = new SingleListWriter(name,container,this); - } else{ - writer = new PromotableWriter(container.addOrGet(name, Types.optional(MinorType.LIST), ListVector.class), container); - } + writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class), container); if (container.size() > vectorCount) { writer.allocate(); } @@ -206,9 +190,7 @@ public void end() { } public ${minor.class}Writer ${lowerName}(String name, int scale, int precision) { - final MajorType ${upperName}_TYPE = new MajorType(MinorType.${upperName}, DataMode.OPTIONAL, precision, scale, 0, null); <#else> - private static final MajorType ${upperName}_TYPE = Types.optional(MinorType.${upperName}); @Override public ${minor.class}Writer ${lowerName}(String name) { @@ -216,15 +198,9 @@ public void end() { if(writer == null) { ValueVector vector; ValueVector currentVector = container.getChild(name); - if (unionEnabled){ - ${vectName}Vector v = container.addOrGet(name, ${upperName}_TYPE, ${vectName}Vector.class); - writer = new PromotableWriter(v, container); - vector = v; - } else { - ${vectName}Vector v = container.addOrGet(name, ${upperName}_TYPE, ${vectName}Vector.class); - writer = new ${vectName}WriterImpl(v, this); - vector = v; - } + ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class); + writer = new PromotableWriter(v, container); + vector = v; if (currentVector == null || currentVector != vector) { vector.allocateNewSafe(); } diff --git a/java/vector/src/main/codegen/templates/NullReader.java b/java/vector/src/main/codegen/templates/NullReader.java index 3ef6c7dcc49a6..ba0c088add7c9 100644 --- a/java/vector/src/main/codegen/templates/NullReader.java +++ b/java/vector/src/main/codegen/templates/NullReader.java @@ -16,6 +16,9 @@ * limitations under the License. */ +import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.Field; + <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/NullReader.java" /> @@ -31,25 +34,31 @@ public class NullReader extends AbstractBaseReader implements FieldReader{ public static final NullReader INSTANCE = new NullReader(); - public static final NullReader EMPTY_LIST_INSTANCE = new NullReader(Types.repeated(MinorType.NULL)); - public static final NullReader EMPTY_MAP_INSTANCE = new NullReader(Types.required(MinorType.MAP)); - private MajorType type; + public static final NullReader EMPTY_LIST_INSTANCE = new NullReader(MinorType.NULL); + public static final NullReader EMPTY_MAP_INSTANCE = new NullReader(MinorType.MAP); + private MinorType type; private NullReader(){ super(); - type = Types.required(MinorType.NULL); + type = MinorType.NULL; } - private NullReader(MajorType type){ + private NullReader(MinorType type){ super(); this.type = type; } @Override - public MajorType getType() { + public MinorType getMinorType() { return type; } - + + + @Override + public Field getField() { + return new Field("", true, new Null(), null); + } + public void copyAsValue(MapWriter writer) {} public void copyAsValue(ListWriter writer) {} diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index b0029f7ad4c37..df508979c48b5 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -42,19 +42,79 @@ public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector{ private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); - private final FieldReader reader = new Nullable${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); + private final FieldReader reader = new ${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); - private final MaterializedField bitsField = MaterializedField.create("$bits$", new MajorType(MinorType.UINT1, DataMode.REQUIRED)); - private final MaterializedField valuesField = MaterializedField.create("$values$", new MajorType(field.getType().getMinorType(), DataMode.REQUIRED, field.getPrecision(), field.getScale())); + private final String bitsField = "$bits$"; + private final String valuesField = "$values$"; + private final Field field; final UInt1Vector bits = new UInt1Vector(bitsField, allocator); - final ${valuesName} values = new ${minor.class}Vector(valuesField, allocator); + final ${valuesName} values; - private final Mutator mutator = new Mutator(); - private final Accessor accessor = new Accessor(); + private final Mutator mutator; + private final Accessor accessor; - public ${className}(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); + <#if minor.class == "Decimal"> + private final int precision; + private final int scale; + + public ${className}(String name, BufferAllocator allocator, int precision, int scale) { + super(name, allocator); + values = new ${minor.class}Vector(valuesField, allocator, precision, scale); + this.precision = precision; + this.scale = scale; + mutator = new Mutator(); + accessor = new Accessor(); + field = new Field(name, true, new Decimal(precision, scale), null); + } + <#else> + public ${className}(String name, BufferAllocator allocator) { + super(name, allocator); + values = new ${minor.class}Vector(valuesField, allocator); + mutator = new Mutator(); + accessor = new Accessor(); + <#if minor.class == "TinyInt" || + minor.class == "SmallInt" || + minor.class == "Int" || + minor.class == "BigInt"> + field = new Field(name, true, new Int(${type.width} * 8, true), null); + <#elseif minor.class == "UInt1" || + minor.class == "UInt2" || + minor.class == "UInt4" || + minor.class == "UInt8"> + field = new Field(name, true, new Int(${type.width} * 8, false), null); + <#elseif minor.class == "Date"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Date(), null); + <#elseif minor.class == "Time"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null); + <#elseif minor.class == "Float4"> + field = new Field(name, true, new FloatingPoint(0), null); + <#elseif minor.class == "Float8"> + field = new Field(name, true, new FloatingPoint(1), null); + <#elseif minor.class == "TimeStamp"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null); + <#elseif minor.class == "IntervalDay"> + field = new Field(name, true, new IntervalDay(), null); + <#elseif minor.class == "IntervalYear"> + field = new Field(name, true, new IntervalYear(), null); + <#elseif minor.class == "VarChar"> + field = new Field(name, true, new Utf8(), null); + <#elseif minor.class == "VarBinary"> + field = new Field(name, true, new Binary(), null); + <#elseif minor.class == "Bit"> + field = new Field(name, true, new Bool(), null); + + } + + + @Override + public Field getField() { + return field; + } + + @Override + public MinorType getMinorType() { + return MinorType.${minor.class?upper_case}; } @Override @@ -240,12 +300,13 @@ public void zeroVector() { @Override public TransferPair getTransferPair(BufferAllocator allocator){ - return new TransferImpl(getField(), allocator); + return new TransferImpl(name, allocator); + } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator){ - return new TransferImpl(getField().withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -273,8 +334,12 @@ public void splitAndTransferTo(int startIndex, int length, Nullable${minor.class private class TransferImpl implements TransferPair { Nullable${minor.class}Vector to; - public TransferImpl(MaterializedField field, BufferAllocator allocator){ - to = new Nullable${minor.class}Vector(field, allocator); + public TransferImpl(String name, BufferAllocator allocator){ + <#if minor.class == "Decimal"> + to = new Nullable${minor.class}Vector(name, allocator, precision, scale); + <#else> + to = new Nullable${minor.class}Vector(name, allocator); + } public TransferImpl(Nullable${minor.class}Vector to){ @@ -312,17 +377,6 @@ public Mutator getMutator(){ return mutator; } - public ${minor.class}Vector convertToRequiredVector(){ - ${minor.class}Vector v = new ${minor.class}Vector(getField().getOtherNullableVersion(), allocator); - if (v.data != null) { - v.data.release(1); - } - v.data = values.data; - v.data.retain(1); - clear(); - return v; - } - public void copyFrom(int fromIndex, int thisIndex, Nullable${minor.class}Vector from){ final Accessor fromAccessor = from.getAccessor(); if (!fromAccessor.isNull(fromIndex)) { @@ -389,8 +443,8 @@ public void get(int index, Nullable${minor.class}Holder holder){ holder.isSet = bAccessor.get(index); <#if minor.class.startsWith("Decimal")> - holder.scale = getField().getScale(); - holder.precision = getField().getPrecision(); + holder.scale = scale; + holder.precision = precision; } diff --git a/java/vector/src/main/codegen/templates/RepeatedValueVectors.java b/java/vector/src/main/codegen/templates/RepeatedValueVectors.java deleted file mode 100644 index ceae53bbf58cf..0000000000000 --- a/java/vector/src/main/codegen/templates/RepeatedValueVectors.java +++ /dev/null @@ -1,421 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -<@pp.dropOutputFile /> -<#list vv.types as type> -<#list type.minor as minor> -<#assign friendlyType = (minor.friendlyType!minor.boxedType!type.boxedType) /> -<#assign fields = minor.fields!type.fields /> - -<@pp.changeOutputFile name="/org/apache/arrow/vector/Repeated${minor.class}Vector.java" /> -<#include "/@includes/license.ftl" /> - -package org.apache.arrow.vector; - -<#include "/@includes/vv_imports.ftl" /> - -/** - * Repeated${minor.class} implements a vector with multple values per row (e.g. JSON array or - * repeated protobuf field). The implementation uses two additional value vectors; one to convert - * the index offset to the underlying element offset, and another to store the number of values - * in the vector. - * - * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. - */ - -public final class Repeated${minor.class}Vector extends BaseRepeatedValueVector implements Repeated<#if type.major == "VarLen">VariableWidth<#else>FixedWidthVectorLike { - //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Repeated${minor.class}Vector.class); - - // we maintain local reference to concrete vector type for performance reasons. - ${minor.class}Vector values; - private final FieldReader reader = new Repeated${minor.class}ReaderImpl(Repeated${minor.class}Vector.this); - private final Mutator mutator = new Mutator(); - private final Accessor accessor = new Accessor(); - - public Repeated${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); - addOrGetVector(VectorDescriptor.create(new MajorType(field.getType().getMinorType(), DataMode.REQUIRED))); - } - - @Override - public Mutator getMutator() { - return mutator; - } - - @Override - public Accessor getAccessor() { - return accessor; - } - - @Override - public FieldReader getReader() { - return reader; - } - - @Override - public ${minor.class}Vector getDataVector() { - return values; - } - - @Override - public TransferPair getTransferPair(BufferAllocator allocator) { - return new TransferImpl(getField(), allocator); - } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator){ - return new TransferImpl(getField().withPath(ref), allocator); - } - - @Override - public TransferPair makeTransferPair(ValueVector to) { - return new TransferImpl((Repeated${minor.class}Vector) to); - } - - @Override - public AddOrGetResult<${minor.class}Vector> addOrGetVector(VectorDescriptor descriptor) { - final AddOrGetResult<${minor.class}Vector> result = super.addOrGetVector(descriptor); - if (result.isCreated()) { - values = result.getVector(); - } - return result; - } - - public void transferTo(Repeated${minor.class}Vector target) { - target.clear(); - offsets.transferTo(target.offsets); - values.transferTo(target.values); - clear(); - } - - public void splitAndTransferTo(final int startIndex, final int groups, Repeated${minor.class}Vector to) { - final UInt4Vector.Accessor a = offsets.getAccessor(); - final UInt4Vector.Mutator m = to.offsets.getMutator(); - - final int startPos = a.get(startIndex); - final int endPos = a.get(startIndex + groups); - final int valuesToCopy = endPos - startPos; - - values.splitAndTransferTo(startPos, valuesToCopy, to.values); - to.offsets.clear(); - to.offsets.allocateNew(groups + 1); - int normalizedPos = 0; - for (int i=0; i < groups + 1;i++ ) { - normalizedPos = a.get(startIndex+i) - startPos; - m.set(i, normalizedPos); - } - m.setValueCount(groups == 0 ? 0 : groups + 1); - } - - private class TransferImpl implements TransferPair { - final Repeated${minor.class}Vector to; - - public TransferImpl(MaterializedField field, BufferAllocator allocator) { - this.to = new Repeated${minor.class}Vector(field, allocator); - } - - public TransferImpl(Repeated${minor.class}Vector to) { - this.to = to; - } - - @Override - public Repeated${minor.class}Vector getTo() { - return to; - } - - @Override - public void transfer() { - transferTo(to); - } - - @Override - public void splitAndTransfer(int startIndex, int length) { - splitAndTransferTo(startIndex, length, to); - } - - @Override - public void copyValueSafe(int fromIndex, int toIndex) { - to.copyFromSafe(fromIndex, toIndex, Repeated${minor.class}Vector.this); - } - } - - public void copyFrom(int inIndex, int outIndex, Repeated${minor.class}Vector v) { - final Accessor vAccessor = v.getAccessor(); - final int count = vAccessor.getInnerValueCountAt(inIndex); - mutator.startNewValue(outIndex); - for (int i = 0; i < count; i++) { - mutator.add(outIndex, vAccessor.get(inIndex, i)); - } - } - - public void copyFromSafe(int inIndex, int outIndex, Repeated${minor.class}Vector v) { - final Accessor vAccessor = v.getAccessor(); - final int count = vAccessor.getInnerValueCountAt(inIndex); - mutator.startNewValue(outIndex); - for (int i = 0; i < count; i++) { - mutator.addSafe(outIndex, vAccessor.get(inIndex, i)); - } - } - - public boolean allocateNewSafe() { - /* boolean to keep track if all the memory allocation were successful - * Used in the case of composite vectors when we need to allocate multiple - * buffers for multiple vectors. If one of the allocations failed we need to - * clear all the memory that we allocated - */ - boolean success = false; - try { - if(!offsets.allocateNewSafe()) return false; - if(!values.allocateNewSafe()) return false; - success = true; - } finally { - if (!success) { - clear(); - } - } - offsets.zeroVector(); - mutator.reset(); - return true; - } - - @Override - public void allocateNew() { - try { - offsets.allocateNew(); - values.allocateNew(); - } catch (OutOfMemoryException e) { - clear(); - throw e; - } - offsets.zeroVector(); - mutator.reset(); - } - - <#if type.major == "VarLen"> -// @Override -// protected SerializedField.Builder getMetadataBuilder() { -// return super.getMetadataBuilder() -// .setVarByteLength(values.getVarByteLength()); -// } - - public void allocateNew(int totalBytes, int valueCount, int innerValueCount) { - try { - offsets.allocateNew(valueCount + 1); - values.allocateNew(totalBytes, innerValueCount); - } catch (OutOfMemoryException e) { - clear(); - throw e; - } - offsets.zeroVector(); - mutator.reset(); - } - - public int getByteCapacity(){ - return values.getByteCapacity(); - } - - <#else> - - @Override - public void allocateNew(int valueCount, int innerValueCount) { - clear(); - /* boolean to keep track if all the memory allocation were successful - * Used in the case of composite vectors when we need to allocate multiple - * buffers for multiple vectors. If one of the allocations failed we need to// - * clear all the memory that we allocated - */ - boolean success = false; - try { - offsets.allocateNew(valueCount + 1); - values.allocateNew(innerValueCount); - } catch(OutOfMemoryException e){ - clear(); - throw e; - } - offsets.zeroVector(); - mutator.reset(); - } - - - - // This is declared a subclass of the accessor declared inside of FixedWidthVector, this is also used for - // variable length vectors, as they should ahve consistent interface as much as possible, if they need to diverge - // in the future, the interface shold be declared in the respective value vector superclasses for fixed and variable - // and we should refer to each in the generation template - public final class Accessor extends BaseRepeatedValueVector.BaseRepeatedAccessor { - @Override - public List<${friendlyType}> getObject(int index) { - final List<${friendlyType}> vals = new JsonStringArrayList<>(); - final UInt4Vector.Accessor offsetsAccessor = offsets.getAccessor(); - final int start = offsetsAccessor.get(index); - final int end = offsetsAccessor.get(index + 1); - final ${minor.class}Vector.Accessor valuesAccessor = values.getAccessor(); - for(int i = start; i < end; i++) { - vals.add(valuesAccessor.getObject(i)); - } - return vals; - } - - public ${friendlyType} getSingleObject(int index, int arrayIndex) { - final int start = offsets.getAccessor().get(index); - return values.getAccessor().getObject(start + arrayIndex); - } - - /** - * Get a value for the given record. Each element in the repeated field is accessed by - * the positionIndex param. - * - * @param index record containing the repeated field - * @param positionIndex position within the repeated field - * @return element at the given position in the given record - */ - public <#if type.major == "VarLen">byte[] - <#else>${minor.javaType!type.javaType} - get(int index, int positionIndex) { - return values.getAccessor().get(offsets.getAccessor().get(index) + positionIndex); - } - - public void get(int index, Repeated${minor.class}Holder holder) { - holder.start = offsets.getAccessor().get(index); - holder.end = offsets.getAccessor().get(index+1); - holder.vector = values; - } - - public void get(int index, int positionIndex, ${minor.class}Holder holder) { - final int offset = offsets.getAccessor().get(index); - assert offset >= 0; - assert positionIndex < getInnerValueCountAt(index); - values.getAccessor().get(offset + positionIndex, holder); - } - - public void get(int index, int positionIndex, Nullable${minor.class}Holder holder) { - final int offset = offsets.getAccessor().get(index); - assert offset >= 0; - if (positionIndex >= getInnerValueCountAt(index)) { - holder.isSet = 0; - return; - } - values.getAccessor().get(offset + positionIndex, holder); - } - } - - public final class Mutator extends BaseRepeatedValueVector.BaseRepeatedMutator implements RepeatedMutator { - private Mutator() {} - - /** - * Add an element to the given record index. This is similar to the set() method in other - * value vectors, except that it permits setting multiple values for a single record. - * - * @param index record of the element to add - * @param value value to add to the given row - */ - public void add(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType} value) { - int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().set(nextOffset, value); - offsets.getMutator().set(index+1, nextOffset+1); - } - - <#if type.major == "VarLen"> - public void addSafe(int index, byte[] bytes) { - addSafe(index, bytes, 0, bytes.length); - } - - public void addSafe(int index, byte[] bytes, int start, int length) { - final int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().setSafe(nextOffset, bytes, start, length); - offsets.getMutator().setSafe(index+1, nextOffset+1); - } - - <#else> - - public void addSafe(int index, ${minor.javaType!type.javaType} srcValue) { - final int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().setSafe(nextOffset, srcValue); - offsets.getMutator().setSafe(index+1, nextOffset+1); - } - - - - public void setSafe(int index, Repeated${minor.class}Holder h) { - final ${minor.class}Holder ih = new ${minor.class}Holder(); - final ${minor.class}Vector.Accessor hVectorAccessor = h.vector.getAccessor(); - mutator.startNewValue(index); - for(int i = h.start; i < h.end; i++){ - hVectorAccessor.get(i, ih); - mutator.addSafe(index, ih); - } - } - - public void addSafe(int index, ${minor.class}Holder holder) { - int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().setSafe(nextOffset, holder); - offsets.getMutator().setSafe(index+1, nextOffset+1); - } - - public void addSafe(int index, Nullable${minor.class}Holder holder) { - final int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().setSafe(nextOffset, holder); - offsets.getMutator().setSafe(index+1, nextOffset+1); - } - - <#if (fields?size > 1) && !(minor.class == "Decimal9" || minor.class == "Decimal18" || minor.class == "Decimal28Sparse" || minor.class == "Decimal38Sparse" || minor.class == "Decimal28Dense" || minor.class == "Decimal38Dense")> - public void addSafe(int arrayIndex, <#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { - int nextOffset = offsets.getAccessor().get(arrayIndex+1); - values.getMutator().setSafe(nextOffset, <#list fields as field>${field.name}<#if field_has_next>, ); - offsets.getMutator().setSafe(arrayIndex+1, nextOffset+1); - } - - - protected void add(int index, ${minor.class}Holder holder) { - int nextOffset = offsets.getAccessor().get(index+1); - values.getMutator().set(nextOffset, holder); - offsets.getMutator().set(index+1, nextOffset+1); - } - - public void add(int index, Repeated${minor.class}Holder holder) { - - ${minor.class}Vector.Accessor accessor = holder.vector.getAccessor(); - ${minor.class}Holder innerHolder = new ${minor.class}Holder(); - - for(int i = holder.start; i < holder.end; i++) { - accessor.get(i, innerHolder); - add(index, innerHolder); - } - } - - @Override - public void generateTestData(final int valCount) { - final int[] sizes = {1, 2, 0, 6}; - int size = 0; - int runningOffset = 0; - final UInt4Vector.Mutator offsetsMutator = offsets.getMutator(); - for(int i = 1; i < valCount + 1; i++, size++) { - runningOffset += sizes[size % sizes.length]; - offsetsMutator.set(i, runningOffset); - } - values.getMutator().generateTestData(valCount * 9); - setValueCount(size); - } - - @Override - public void reset() { - } - } -} - - diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java index 9a6b08fc561f9..49d57e716bc8a 100644 --- a/java/vector/src/main/codegen/templates/UnionListWriter.java +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -43,7 +43,6 @@ public class UnionListWriter extends AbstractFieldWriter { private int lastIndex = 0; public UnionListWriter(ListVector vector) { - super(null); this.vector = vector; this.writer = new PromotableWriter(vector.getDataVector(), vector); this.offsets = vector.getOffsetVector(); @@ -64,10 +63,14 @@ public void clear() { } @Override - public MaterializedField getField() { + public Field getField() { return null; } + public void setValueCount(int count) { + vector.getMutator().setValueCount(count); + } + @Override public int getValueCapacity() { return vector.getValueCapacity(); @@ -78,6 +81,12 @@ public void close() throws Exception { } + @Override + public void setPosition(int index) { + super.setPosition(index); + startList(); + } + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> <#assign uncappedName = name?uncap_first/> @@ -91,7 +100,7 @@ public void close() throws Exception { @Override public ${name}Writer <#if uncappedName == "int">integer<#else>${uncappedName}(String name) { - assert inMap; +// assert inMap; mapName = name; final int nextOffset = offsets.getAccessor().get(idx() + 1); vector.getMutator().setNotNull(idx()); @@ -146,7 +155,7 @@ public void endList() { @Override public void start() { - assert inMap; +// assert inMap; final int nextOffset = offsets.getAccessor().get(idx() + 1); vector.getMutator().setNotNull(idx()); offsets.getMutator().setSafe(idx() + 1, nextOffset); @@ -155,11 +164,11 @@ public void start() { @Override public void end() { - if (inMap) { +// if (inMap) { inMap = false; final int nextOffset = offsets.getAccessor().get(idx() + 1); offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); - } +// } } <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> @@ -170,7 +179,7 @@ public void end() { @Override public void write${name}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { - assert !inMap; +// assert !inMap; final int nextOffset = offsets.getAccessor().get(idx() + 1); vector.getMutator().setNotNull(idx()); writer.setPosition(nextOffset); diff --git a/java/vector/src/main/codegen/templates/UnionReader.java b/java/vector/src/main/codegen/templates/UnionReader.java index 44c3e55dcc6f1..7351ae3776f57 100644 --- a/java/vector/src/main/codegen/templates/UnionReader.java +++ b/java/vector/src/main/codegen/templates/UnionReader.java @@ -17,6 +17,8 @@ */ +import org.apache.arrow.vector.types.Types.MinorType; + <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionReader.java" /> @@ -37,18 +39,18 @@ public UnionReader(UnionVector data) { this.data = data; } - private static MajorType[] TYPES = new MajorType[43]; + public MinorType getMinorType() { + return TYPES[data.getTypeValue(idx())]; + } + + private static MinorType[] TYPES = new MinorType[43]; static { for (MinorType minorType : MinorType.values()) { - TYPES[minorType.ordinal()] = new MajorType(minorType, DataMode.OPTIONAL); + TYPES[minorType.ordinal()] = minorType; } } - public MajorType getType() { - return TYPES[data.getTypeValue(idx())]; - } - public boolean isSet(){ return !data.getAccessor().isNull(idx()); } @@ -69,7 +71,7 @@ private FieldReader getReaderForIndex(int index) { return reader; } switch (MinorType.values()[typeValue]) { - case LATE: + case NULL: return NullReader.INSTANCE; case MAP: return (FieldReader) getMap(); @@ -119,9 +121,9 @@ public void copyAsValue(UnionWriter writer) { writer.data.copyFrom(idx(), writer.idx(), data); } - <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", - "Character", "DateTime", "Period", "Double", "Float", - "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> + <#list ["Object", "Integer", "Long", "Boolean", + "Character", "DateTime", "Double", "Float", + "Text", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> <#if safeType=="byte[]"><#assign safeType="ByteArray" /> @@ -141,11 +143,11 @@ public void copyAsValue(UnionWriter writer) { <#if safeType=="byte[]"><#assign safeType="ByteArray" /> <#if !minor.class?starts_with("Decimal")> - private Nullable${name}ReaderImpl ${uncappedName}Reader; + private ${name}ReaderImpl ${uncappedName}Reader; - private Nullable${name}ReaderImpl get${name}() { + private ${name}ReaderImpl get${name}() { if (${uncappedName}Reader == null) { - ${uncappedName}Reader = new Nullable${name}ReaderImpl(data.get${name}Vector()); + ${uncappedName}Reader = new ${name}ReaderImpl(data.get${name}Vector()); ${uncappedName}Reader.setPosition(idx()); readers[MinorType.${name?upper_case}.ordinal()] = ${uncappedName}Reader; } diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 0f089b7e91537..e2f19f4b33ba5 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -16,6 +16,16 @@ * limitations under the License. */ +import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.flatbuf.Field; +import org.apache.arrow.flatbuf.Type; +import org.apache.arrow.flatbuf.Union; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.types.pojo.ArrowType; + +import java.util.ArrayList; +import java.util.List; + <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/UnionVector.java" /> @@ -29,7 +39,6 @@ import java.util.Iterator; import org.apache.arrow.vector.complex.impl.ComplexCopier; import org.apache.arrow.vector.util.CallBack; -import org.apache.arrow.vector.util.BasicTypeHelper; /* * This class is generated using freemarker and the ${.template_name} template. @@ -47,34 +56,30 @@ */ public class UnionVector implements ValueVector { - private MaterializedField field; + private String name; private BufferAllocator allocator; private Accessor accessor = new Accessor(); private Mutator mutator = new Mutator(); int valueCount; MapVector internalMap; - private UInt1Vector typeVector; + UInt1Vector typeVector; private MapVector mapVector; private ListVector listVector; private FieldReader reader; - private NullableBitVector bit; private int singleType = 0; private ValueVector singleVector; - private MajorType majorType; private final CallBack callBack; - public UnionVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { - this.field = field.clone(); + public UnionVector(String name, BufferAllocator allocator, CallBack callBack) { + this.name = name; this.allocator = allocator; this.internalMap = new MapVector("internal", allocator, callBack); - this.typeVector = internalMap.addOrGet("types", new MajorType(MinorType.UINT1, DataMode.REQUIRED), UInt1Vector.class); - this.field.addChild(internalMap.getField().clone()); - this.majorType = field.getType(); + this.typeVector = new UInt1Vector("types", allocator); this.callBack = callBack; } @@ -82,34 +87,20 @@ public BufferAllocator getAllocator() { return allocator; } - public List getSubTypes() { - return majorType.getSubTypes(); - } - - public void addSubType(MinorType type) { - if (majorType.getSubTypes().contains(type)) { - return; - } - List subTypes = this.majorType.getSubTypes(); - List newSubTypes = new ArrayList<>(subTypes); - newSubTypes.add(type); - majorType = new MajorType(this.majorType.getMinorType(), this.majorType.getMode(), this.majorType.getPrecision(), - this.majorType.getScale(), this.majorType.getTimezone(), newSubTypes); - field = MaterializedField.create(field.getName(), majorType); - if (callBack != null) { - callBack.doWork(); - } + @Override + public MinorType getMinorType() { + return MinorType.UNION; } - private static final MajorType MAP_TYPE = new MajorType(MinorType.MAP, DataMode.OPTIONAL); - public MapVector getMap() { if (mapVector == null) { int vectorCount = internalMap.size(); - mapVector = internalMap.addOrGet("map", MAP_TYPE, MapVector.class); - addSubType(MinorType.MAP); + mapVector = internalMap.addOrGet("map", MinorType.MAP, MapVector.class); if (internalMap.size() > vectorCount) { mapVector.allocateNew(); + if (callBack != null) { + callBack.doWork(); + } } } return mapVector; @@ -121,15 +112,16 @@ public MapVector getMap() { <#if !minor.class?starts_with("Decimal")> private Nullable${name}Vector ${uncappedName}Vector; - private static final MajorType ${name?upper_case}_TYPE = new MajorType(MinorType.${name?upper_case}, DataMode.OPTIONAL); public Nullable${name}Vector get${name}Vector() { if (${uncappedName}Vector == null) { int vectorCount = internalMap.size(); - ${uncappedName}Vector = internalMap.addOrGet("${uncappedName}", ${name?upper_case}_TYPE, Nullable${name}Vector.class); - addSubType(MinorType.${name?upper_case}); + ${uncappedName}Vector = internalMap.addOrGet("${uncappedName}", MinorType.${name?upper_case}, Nullable${name}Vector.class); if (internalMap.size() > vectorCount) { ${uncappedName}Vector.allocateNew(); + if (callBack != null) { + callBack.doWork(); + } } } return ${uncappedName}Vector; @@ -139,15 +131,15 @@ public MapVector getMap() { - private static final MajorType LIST_TYPE = new MajorType(MinorType.LIST, DataMode.OPTIONAL); - public ListVector getList() { if (listVector == null) { int vectorCount = internalMap.size(); - listVector = internalMap.addOrGet("list", LIST_TYPE, ListVector.class); - addSubType(MinorType.LIST); + listVector = internalMap.addOrGet("list", MinorType.LIST, ListVector.class); if (internalMap.size() > vectorCount) { listVector.allocateNew(); + if (callBack != null) { + callBack.doWork(); + } } } return listVector; @@ -164,6 +156,7 @@ public UInt1Vector getTypeVector() { @Override public void allocateNew() throws OutOfMemoryException { internalMap.allocateNew(); + typeVector.allocateNew(); if (typeVector != null) { typeVector.zeroVector(); } @@ -172,6 +165,7 @@ public void allocateNew() throws OutOfMemoryException { @Override public boolean allocateNewSafe() { boolean safe = internalMap.allocateNewSafe(); + safe = safe && typeVector.allocateNewSafe(); if (safe) { if (typeVector != null) { typeVector.zeroVector(); @@ -196,22 +190,27 @@ public void close() { @Override public void clear() { + typeVector.clear(); internalMap.clear(); } @Override - public MaterializedField getField() { - return field; + public Field getField() { + List childFields = new ArrayList<>(); + for (ValueVector v : internalMap.getChildren()) { + childFields.add(v.getField()); + } + return new Field(name, true, new ArrowType.Union(), childFields); } @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new TransferImpl(field, allocator); + return new TransferImpl(name, allocator); } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new TransferImpl(field.withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -219,10 +218,9 @@ public TransferPair makeTransferPair(ValueVector target) { return new TransferImpl((UnionVector) target); } - public void transferTo(UnionVector target) { + public void transferTo(org.apache.arrow.vector.complex.UnionVector target) { internalMap.makeTransferPair(target.internalMap).transfer(); target.valueCount = valueCount; - target.majorType = majorType; } public void copyFrom(int inIndex, int outIndex, UnionVector from) { @@ -236,13 +234,14 @@ public void copyFromSafe(int inIndex, int outIndex, UnionVector from) { } public ValueVector addVector(ValueVector v) { - String name = v.getField().getType().getMinorType().name().toLowerCase(); - MajorType type = v.getField().getType(); + String name = v.getMinorType().name().toLowerCase(); Preconditions.checkState(internalMap.getChild(name) == null, String.format("%s vector already exists", name)); - final ValueVector newVector = internalMap.addOrGet(name, type, (Class) BasicTypeHelper.getValueVectorClass(type.getMinorType(), type.getMode())); + final ValueVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass()); v.makeTransferPair(newVector).transfer(); internalMap.putChild(name, newVector); - addSubType(v.getField().getType().getMinorType()); + if (callBack != null) { + callBack.doWork(); + } return newVector; } @@ -250,8 +249,8 @@ private class TransferImpl implements TransferPair { UnionVector to; - public TransferImpl(MaterializedField field, BufferAllocator allocator) { - to = new UnionVector(field, allocator, null); + public TransferImpl(String name, BufferAllocator allocator) { + to = new UnionVector(name, allocator, null); } public TransferImpl(UnionVector to) { @@ -357,7 +356,7 @@ public class Accessor extends BaseValueVector.BaseAccessor { public Object getObject(int index) { int type = typeVector.getAccessor().get(index); switch (MinorType.values()[type]) { - case LATE: + case NULL: return null; <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> @@ -421,7 +420,7 @@ public void setSafe(int index, UnionHolder holder) { writer = new UnionWriter(UnionVector.this); } writer.setPosition(index); - MinorType type = reader.getType().getMinorType(); + MinorType type = reader.getMinorType(); switch (type) { <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> @@ -460,7 +459,7 @@ public void setSafe(int index, Nullable${name}Holder holder) { public void setType(int index, MinorType type) { - typeVector.getMutator().setSafe(index, type.ordinal()); + typeVector.getMutator().setSafe(index, (byte) type.ordinal()); } @Override diff --git a/java/vector/src/main/codegen/templates/UnionWriter.java b/java/vector/src/main/codegen/templates/UnionWriter.java index c9c29e0dd5f92..1137e2cb0207a 100644 --- a/java/vector/src/main/codegen/templates/UnionWriter.java +++ b/java/vector/src/main/codegen/templates/UnionWriter.java @@ -37,17 +37,7 @@ public class UnionWriter extends AbstractFieldWriter implements FieldWriter { private UnionListWriter listWriter; private List writers = Lists.newArrayList(); - public UnionWriter(BufferAllocator allocator) { - super(null); - } - public UnionWriter(UnionVector vector) { - super(null); - data = vector; - } - - public UnionWriter(UnionVector vector, FieldWriter parent) { - super(null); data = vector; } @@ -84,7 +74,7 @@ public void endList() { private MapWriter getMapWriter() { if (mapWriter == null) { - mapWriter = new SingleMapWriter(data.getMap(), null, true); + mapWriter = new SingleMapWriter(data.getMap()); mapWriter.setPosition(idx()); writers.add(mapWriter); } @@ -120,7 +110,7 @@ public ListWriter asList() { private ${name}Writer get${name}Writer() { if (${uncappedName}Writer == null) { - ${uncappedName}Writer = new Nullable${name}WriterImpl(data.get${name}Vector(), null); + ${uncappedName}Writer = new ${name}WriterImpl(data.get${name}Vector()); ${uncappedName}Writer.setPosition(idx()); writers.add(${uncappedName}Writer); } @@ -217,7 +207,7 @@ public void close() throws Exception { } @Override - public MaterializedField getField() { + public Field getField() { return data.getField(); } diff --git a/java/vector/src/main/codegen/templates/ValueHolders.java b/java/vector/src/main/codegen/templates/ValueHolders.java index 2b14194574a58..d744c523265f7 100644 --- a/java/vector/src/main/codegen/templates/ValueHolders.java +++ b/java/vector/src/main/codegen/templates/ValueHolders.java @@ -31,10 +31,6 @@ public final class ${className} implements ValueHolder{ - public static final MajorType TYPE = new MajorType(MinorType.${minor.class?upper_case}, DataMode.${mode.name?upper_case}); - - public MajorType getType() {return TYPE;} - <#if mode.name == "Repeated"> /** The first index (inclusive) into the Vector. **/ @@ -49,48 +45,13 @@ public final class ${className} implements ValueHolder{ <#else> public static final int WIDTH = ${type.width}; - <#if mode.name == "Optional">public int isSet; + <#if mode.name == "Optional">public int isSet; + <#else>public final int isSet = 1; <#assign fields = minor.fields!type.fields /> <#list fields as field> public ${field.type} ${field.name}; - <#if minor.class.startsWith("Decimal")> - public static final int maxPrecision = ${minor.maxPrecisionDigits}; - <#if minor.class.startsWith("Decimal28") || minor.class.startsWith("Decimal38")> - public static final int nDecimalDigits = ${minor.nDecimalDigits}; - - public static int getInteger(int index, int start, ArrowBuf buffer) { - int value = buffer.getInt(start + (index * 4)); - - if (index == 0) { - /* the first byte contains sign bit, return value without it */ - <#if minor.class.endsWith("Sparse")> - value = (value & 0x7FFFFFFF); - <#elseif minor.class.endsWith("Dense")> - value = (value & 0x0000007F); - - } - return value; - } - - public static void setInteger(int index, int value, int start, ArrowBuf buffer) { - buffer.setInt(start + (index * 4), value); - } - - public static void setSign(boolean sign, int start, ArrowBuf buffer) { - // Set MSB to 1 if sign is negative - if (sign == true) { - int value = getInteger(0, start, buffer); - setInteger(0, (value | 0x80000000), start, buffer); - } - } - - public static boolean getSign(int start, ArrowBuf buffer) { - return ((buffer.getInt(start) & 0x80000000) != 0); - } - - @Deprecated public int hashCode(){ throw new UnsupportedOperationException(); diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java index 84fb3eb55674f..bcd639ab8c30c 100644 --- a/java/vector/src/main/codegen/templates/VariableLengthVectors.java +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -56,9 +56,7 @@ public final class ${minor.class}Vector extends BaseDataValueVector implements V private static final int MIN_BYTE_COUNT = 4096; public final static String OFFSETS_VECTOR_NAME = "$offsets$"; - private final MaterializedField offsetsField = MaterializedField.create(OFFSETS_VECTOR_NAME, new MajorType(MinorType.UINT4, DataMode.REQUIRED)); - final UInt${type.width}Vector offsetVector = new UInt${type.width}Vector(offsetsField, allocator); - private final FieldReader reader = new ${minor.class}ReaderImpl(${minor.class}Vector.this); + final UInt${type.width}Vector offsetVector = new UInt${type.width}Vector(OFFSETS_VECTOR_NAME, allocator); private final Accessor accessor; private final Mutator mutator; @@ -68,16 +66,42 @@ public final class ${minor.class}Vector extends BaseDataValueVector implements V private int allocationSizeInBytes = INITIAL_BYTE_COUNT; private int allocationMonitor = 0; - public ${minor.class}Vector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); + <#if minor.class == "Decimal"> + + private final int precision; + private final int scale; + + public ${minor.class}Vector(String name, BufferAllocator allocator, int precision, int scale) { + super(name, allocator); + this.oAccessor = offsetVector.getAccessor(); + this.accessor = new Accessor(); + this.mutator = new Mutator(); + this.precision = precision; + this.scale = scale; + } + <#else> + + public ${minor.class}Vector(String name, BufferAllocator allocator) { + super(name, allocator); this.oAccessor = offsetVector.getAccessor(); this.accessor = new Accessor(); this.mutator = new Mutator(); } + + + @Override + public Field getField() { + throw new UnsupportedOperationException("internal vector"); + } + + @Override + public MinorType getMinorType() { + return MinorType.${minor.class?upper_case}; + } @Override public FieldReader getReader(){ - return reader; + throw new UnsupportedOperationException("internal vector"); } @Override @@ -125,27 +149,6 @@ public int getVarByteLength(){ return offsetVector.getAccessor().get(valueCount); } -// @Override -// public SerializedField getMetadata() { -// return getMetadataBuilder() // -// .addChild(offsetVector.getMetadata()) -// .setValueCount(getAccessor().getValueCount()) // -// .setBufferLength(getBufferSize()) // -// .build(); -// } -// -// @Override -// public void load(SerializedField metadata, ArrowBuf buffer) { -// the bits vector is the first child (the order in which the children are added in getMetadataBuilder is significant) -// final SerializedField offsetField = metadata.getChild(0); -// offsetVector.load(offsetField, buffer); -// -// final int capacity = buffer.capacity(); -// final int offsetsLength = offsetField.getBufferLength(); -// data = buffer.slice(offsetsLength, capacity - offsetsLength); -// data.retain(); -// } - @Override public void clear() { super.clear(); @@ -175,12 +178,12 @@ public long getOffsetAddr(){ @Override public TransferPair getTransferPair(BufferAllocator allocator){ - return new TransferImpl(getField(), allocator); + return new TransferImpl(name, allocator); } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator){ - return new TransferImpl(getField().withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -241,8 +244,12 @@ public boolean copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector f private class TransferImpl implements TransferPair{ ${minor.class}Vector to; - public TransferImpl(MaterializedField field, BufferAllocator allocator){ - to = new ${minor.class}Vector(field, allocator); + public TransferImpl(String name, BufferAllocator allocator){ + <#if minor.class == "Decimal"> + to = new ${minor.class}Vector(name, allocator, precision, scale); + <#else> + to = new ${minor.class}Vector(name, allocator); + } public TransferImpl(${minor.class}Vector to){ @@ -426,10 +433,10 @@ public void get(int index, Nullable${minor.class}Holder holder){ return text; } <#break> - <#case "Var16Char"> + <#case "Decimal"> @Override public ${friendlyType} getObject(int index) { - return new String(get(index), Charsets.UTF_16); + return new BigDecimal(new BigInteger(get(index)), scale); } <#break> <#default> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index b129ea9bcb95f..05b7cf1006723 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -20,7 +20,6 @@ import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.types.MaterializedField; public abstract class BaseDataValueVector extends BaseValueVector { @@ -29,8 +28,8 @@ public abstract class BaseDataValueVector extends BaseValueVector { protected ArrowBuf data; - public BaseDataValueVector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); + public BaseDataValueVector(String name, BufferAllocator allocator) { + super(name, allocator); data = allocator.getEmpty(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java index 932e6f13caf2b..884cdf0910b8e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java @@ -19,8 +19,8 @@ import java.util.Iterator; +import com.google.flatbuffers.FlatBufferBuilder; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.types.MaterializedField; import org.apache.arrow.vector.util.TransferPair; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -38,16 +38,16 @@ public abstract class BaseValueVector implements ValueVector { public static final int INITIAL_VALUE_ALLOCATION = 4096; protected final BufferAllocator allocator; - protected final MaterializedField field; + protected final String name; - protected BaseValueVector(MaterializedField field, BufferAllocator allocator) { - this.field = Preconditions.checkNotNull(field, "field cannot be null"); + protected BaseValueVector(String name, BufferAllocator allocator) { this.allocator = Preconditions.checkNotNull(allocator, "allocator cannot be null"); + this.name = name; } @Override public String toString() { - return super.toString() + "[field = " + field + ", ...]"; + return super.toString() + "[name = " + name + ", ...]"; } @Override @@ -60,30 +60,11 @@ public void close() { clear(); } - @Override - public MaterializedField getField() { - return field; - } - - public MaterializedField getField(String ref){ - return getField().withPath(ref); - } - @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return getTransferPair(getField().getPath(), allocator); + return getTransferPair(name, allocator); } -// public static SerializedField getMetadata(BaseValueVector vector) { -// return getMetadataBuilder(vector).build(); -// } -// -// protected static SerializedField.Builder getMetadataBuilder(BaseValueVector vector) { -// return SerializedFieldHelper.getAsBuilder(vector.getField()) -// .setValueCount(vector.getAccessor().getValueCount()) -// .setBufferLength(vector.getBufferSize()); -// } - public abstract static class BaseAccessor implements ValueVector.Accessor { protected BaseAccessor() { } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index c5bcb2decc43b..fee6e9cdef73d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -21,11 +21,11 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.vector.complex.impl.BitReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.BitHolder; import org.apache.arrow.vector.holders.NullableBitHolder; -import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.OversizedAllocationException; import org.apache.arrow.vector.util.TransferPair; @@ -37,7 +37,6 @@ public final class BitVector extends BaseDataValueVector implements FixedWidthVector { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BitVector.class); - private final FieldReader reader = new BitReaderImpl(BitVector.this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); @@ -45,13 +44,23 @@ public final class BitVector extends BaseDataValueVector implements FixedWidthVe private int allocationSizeInBytes = INITIAL_VALUE_ALLOCATION; private int allocationMonitor = 0; - public BitVector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); + public BitVector(String name, BufferAllocator allocator) { + super(name, allocator); + } + + @Override + public Field getField() { + throw new UnsupportedOperationException("internal vector"); + } + + @Override + public MinorType getMinorType() { + return MinorType.BIT; } @Override public FieldReader getReader() { - return reader; + throw new UnsupportedOperationException("internal vector"); } @Override @@ -180,20 +189,6 @@ public boolean copyFromSafe(int inIndex, int outIndex, BitVector from) { return true; } -// @Override -// public void load(SerializedField metadata, DrillBuf buffer) { -// Preconditions.checkArgument(this.field.getPath().equals(metadata.getNamePart().getName()), "The field %s doesn't match the provided metadata %s.", this.field, metadata); -// final int valueCount = metadata.getValueCount(); -// final int expectedLength = getSizeFromCount(valueCount); -// final int actualLength = metadata.getBufferLength(); -// assert expectedLength == actualLength: "expected and actual buffer sizes do not match"; -// -// clear(); -// data = buffer.slice(0, actualLength); -// data.retain(); -// this.valueCount = valueCount; -// } - @Override public Mutator getMutator() { return new Mutator(); @@ -206,12 +201,12 @@ public Accessor getAccessor() { @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new TransferImpl(getField(), allocator); + return new TransferImpl(name, allocator); } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new TransferImpl(getField().withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -270,8 +265,8 @@ public void splitAndTransferTo(int startIndex, int length, BitVector target) { private class TransferImpl implements TransferPair { BitVector to; - public TransferImpl(MaterializedField field, BufferAllocator allocator) { - this.to = new BitVector(field, allocator); + public TransferImpl(String name, BufferAllocator allocator) { + this.to = new BitVector(name, allocator); } public TransferImpl(BitVector to) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java deleted file mode 100644 index b806b180e7014..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/ObjectVector.java +++ /dev/null @@ -1,220 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector; - -import io.netty.buffer.ArrowBuf; - -import java.util.ArrayList; -import java.util.Iterator; -import java.util.List; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.holders.ObjectHolder; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.util.TransferPair; - -public class ObjectVector extends BaseValueVector { - private final Accessor accessor = new Accessor(); - private final Mutator mutator = new Mutator(); - private int maxCount = 0; - private int count = 0; - private int allocationSize = 4096; - - private List objectArrayList = new ArrayList<>(); - - public ObjectVector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); - } - - public void addNewArray() { - objectArrayList.add(new Object[allocationSize]); - maxCount += allocationSize; - } - - @Override - public FieldReader getReader() { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - public final class Mutator implements ValueVector.Mutator { - - public void set(int index, Object obj) { - int listOffset = index / allocationSize; - if (listOffset >= objectArrayList.size()) { - addNewArray(); - } - objectArrayList.get(listOffset)[index % allocationSize] = obj; - } - - public boolean setSafe(int index, long value) { - set(index, value); - return true; - } - - protected void set(int index, ObjectHolder holder) { - set(index, holder.obj); - } - - public boolean setSafe(int index, ObjectHolder holder){ - set(index, holder); - return true; - } - - @Override - public void setValueCount(int valueCount) { - count = valueCount; - } - - @Override - public void reset() { - count = 0; - maxCount = 0; - objectArrayList = new ArrayList<>(); - addNewArray(); - } - - @Override - public void generateTestData(int values) { - } - } - - @Override - public void setInitialCapacity(int numRecords) { - // NoOp - } - - @Override - public void allocateNew() throws OutOfMemoryException { - addNewArray(); - } - - public void allocateNew(int valueCount) throws OutOfMemoryException { - while (maxCount < valueCount) { - addNewArray(); - } - } - - @Override - public boolean allocateNewSafe() { - allocateNew(); - return true; - } - - @Override - public int getBufferSize() { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - @Override - public int getBufferSizeFor(final int valueCount) { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - @Override - public void close() { - clear(); - } - - @Override - public void clear() { - objectArrayList.clear(); - maxCount = 0; - count = 0; - } - - @Override - public MaterializedField getField() { - return field; - } - - @Override - public TransferPair getTransferPair(BufferAllocator allocator) { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - @Override - public TransferPair makeTransferPair(ValueVector to) { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - @Override - public int getValueCapacity() { - return maxCount; - } - - @Override - public Accessor getAccessor() { - return accessor; - } - - @Override - public ArrowBuf[] getBuffers(boolean clear) { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - -// @Override -// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { -// throw new UnsupportedOperationException("ObjectVector does not support this"); -// } -// -// @Override -// public UserBitShared.SerializedField getMetadata() { -// throw new UnsupportedOperationException("ObjectVector does not support this"); -// } - - @Override - public Mutator getMutator() { - return mutator; - } - - @Override - public Iterator iterator() { - throw new UnsupportedOperationException("ObjectVector does not support this"); - } - - public final class Accessor extends BaseAccessor { - @Override - public Object getObject(int index) { - int listOffset = index / allocationSize; - if (listOffset >= objectArrayList.size()) { - addNewArray(); - } - return objectArrayList.get(listOffset)[index % allocationSize]; - } - - @Override - public int getValueCount() { - return count; - } - - public Object get(int index) { - return getObject(index); - } - - public void get(int index, ObjectHolder holder){ - holder.obj = getObject(index); - } - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java deleted file mode 100644 index 61ce285d61b0c..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueHolderHelper.java +++ /dev/null @@ -1,203 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector; - -import io.netty.buffer.ArrowBuf; - -import java.math.BigDecimal; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.holders.BigIntHolder; -import org.apache.arrow.vector.holders.BitHolder; -import org.apache.arrow.vector.holders.DateHolder; -import org.apache.arrow.vector.holders.Decimal18Holder; -import org.apache.arrow.vector.holders.Decimal28SparseHolder; -import org.apache.arrow.vector.holders.Decimal38SparseHolder; -import org.apache.arrow.vector.holders.Decimal9Holder; -import org.apache.arrow.vector.holders.Float4Holder; -import org.apache.arrow.vector.holders.Float8Holder; -import org.apache.arrow.vector.holders.IntHolder; -import org.apache.arrow.vector.holders.IntervalDayHolder; -import org.apache.arrow.vector.holders.IntervalYearHolder; -import org.apache.arrow.vector.holders.NullableBitHolder; -import org.apache.arrow.vector.holders.TimeHolder; -import org.apache.arrow.vector.holders.TimeStampHolder; -import org.apache.arrow.vector.holders.VarCharHolder; -import org.apache.arrow.vector.util.DecimalUtility; - -import com.google.common.base.Charsets; - - -public class ValueHolderHelper { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ValueHolderHelper.class); - - public static IntHolder getIntHolder(int value) { - IntHolder holder = new IntHolder(); - holder.value = value; - - return holder; - } - - public static BigIntHolder getBigIntHolder(long value) { - BigIntHolder holder = new BigIntHolder(); - holder.value = value; - - return holder; - } - - public static Float4Holder getFloat4Holder(float value) { - Float4Holder holder = new Float4Holder(); - holder.value = value; - - return holder; - } - - public static Float8Holder getFloat8Holder(double value) { - Float8Holder holder = new Float8Holder(); - holder.value = value; - - return holder; - } - - public static DateHolder getDateHolder(long value) { - DateHolder holder = new DateHolder(); - holder.value = value; - return holder; - } - - public static TimeHolder getTimeHolder(int value) { - TimeHolder holder = new TimeHolder(); - holder.value = value; - return holder; - } - - public static TimeStampHolder getTimeStampHolder(long value) { - TimeStampHolder holder = new TimeStampHolder(); - holder.value = value; - return holder; - } - - public static BitHolder getBitHolder(int value) { - BitHolder holder = new BitHolder(); - holder.value = value; - - return holder; - } - - public static NullableBitHolder getNullableBitHolder(boolean isNull, int value) { - NullableBitHolder holder = new NullableBitHolder(); - holder.isSet = isNull? 0 : 1; - if (! isNull) { - holder.value = value; - } - - return holder; - } - - public static VarCharHolder getVarCharHolder(ArrowBuf buf, String s){ - VarCharHolder vch = new VarCharHolder(); - - byte[] b = s.getBytes(Charsets.UTF_8); - vch.start = 0; - vch.end = b.length; - vch.buffer = buf.reallocIfNeeded(b.length); - vch.buffer.setBytes(0, b); - return vch; - } - - public static VarCharHolder getVarCharHolder(BufferAllocator a, String s){ - VarCharHolder vch = new VarCharHolder(); - - byte[] b = s.getBytes(Charsets.UTF_8); - vch.start = 0; - vch.end = b.length; - vch.buffer = a.buffer(b.length); // - vch.buffer.setBytes(0, b); - return vch; - } - - - public static IntervalYearHolder getIntervalYearHolder(int intervalYear) { - IntervalYearHolder holder = new IntervalYearHolder(); - - holder.value = intervalYear; - return holder; - } - - public static IntervalDayHolder getIntervalDayHolder(int days, int millis) { - IntervalDayHolder dch = new IntervalDayHolder(); - - dch.days = days; - dch.milliseconds = millis; - return dch; - } - - public static Decimal9Holder getDecimal9Holder(int decimal, int scale, int precision) { - Decimal9Holder dch = new Decimal9Holder(); - - dch.scale = scale; - dch.precision = precision; - dch.value = decimal; - - return dch; - } - - public static Decimal18Holder getDecimal18Holder(long decimal, int scale, int precision) { - Decimal18Holder dch = new Decimal18Holder(); - - dch.scale = scale; - dch.precision = precision; - dch.value = decimal; - - return dch; - } - - public static Decimal28SparseHolder getDecimal28Holder(ArrowBuf buf, String decimal) { - - Decimal28SparseHolder dch = new Decimal28SparseHolder(); - - BigDecimal bigDecimal = new BigDecimal(decimal); - - dch.scale = bigDecimal.scale(); - dch.precision = bigDecimal.precision(); - Decimal28SparseHolder.setSign(bigDecimal.signum() == -1, dch.start, dch.buffer); - dch.start = 0; - dch.buffer = buf.reallocIfNeeded(5 * DecimalUtility.INTEGER_SIZE); - DecimalUtility - .getSparseFromBigDecimal(bigDecimal, dch.buffer, dch.start, dch.scale, dch.precision, dch.nDecimalDigits); - - return dch; - } - - public static Decimal38SparseHolder getDecimal38Holder(ArrowBuf buf, String decimal) { - - Decimal38SparseHolder dch = new Decimal38SparseHolder(); - - BigDecimal bigDecimal = new BigDecimal(decimal); - - dch.scale = bigDecimal.scale(); - dch.precision = bigDecimal.precision(); - Decimal38SparseHolder.setSign(bigDecimal.signum() == -1, dch.start, dch.buffer); - dch.start = 0; - dch.buffer = buf.reallocIfNeeded(dch.maxPrecision * DecimalUtility.INTEGER_SIZE); - DecimalUtility - .getSparseFromBigDecimal(bigDecimal, dch.buffer, dch.start, dch.scale, dch.precision, dch.nDecimalDigits); - - return dch; - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index a170c59abd7cc..35321c947db0b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -24,8 +24,9 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.TransferPair; +import org.apache.arrow.vector.types.pojo.Field; /** * An abstraction that is used to store a sequence of values in an individual column. @@ -33,8 +34,7 @@ * A {@link ValueVector value vector} stores underlying data in-memory in a columnar fashion that is compact and * efficient. The column whose data is stored, is referred by {@link #getField()}. * - * A vector when instantiated, relies on a {@link org.apache.drill.exec.record.DeadBuf dead buffer}. It is important - * that vector is allocated before attempting to read or write. + * It is important that vector is allocated before attempting to read or write. * * There are a few "rules" around vectors: * @@ -94,7 +94,9 @@ public interface ValueVector extends Closeable, Iterable { /** * Get information about how this field is materialized. */ - MaterializedField getField(); + Field getField(); + + MinorType getMinorType(); /** * Returns a {@link org.apache.arrow.vector.util.TransferPair transfer pair}, creating a new target vector of diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java deleted file mode 100644 index fdad99a333258..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorDescriptor.java +++ /dev/null @@ -1,83 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector; - -import java.util.Collection; - -import com.google.common.base.Preconditions; - -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.MajorType; - -public class VectorDescriptor { - private static final String DEFAULT_NAME = "NONE"; - - private final MaterializedField field; - - public VectorDescriptor(final MajorType type) { - this(DEFAULT_NAME, type); - } - - public VectorDescriptor(final String name, final MajorType type) { - this(MaterializedField.create(name, type)); - } - - public VectorDescriptor(final MaterializedField field) { - this.field = Preconditions.checkNotNull(field, "field cannot be null"); - } - - public MaterializedField getField() { - return field; - } - - public MajorType getType() { - return field.getType(); - } - - public String getName() { - return field.getLastName(); - } - - public Collection getChildren() { - return field.getChildren(); - } - - public boolean hasName() { - return getName() != DEFAULT_NAME; - } - - public VectorDescriptor withName(final String name) { - return new VectorDescriptor(field.withPath(name)); - } - - public VectorDescriptor withType(final MajorType type) { - return new VectorDescriptor(field.withType(type)); - } - - public static VectorDescriptor create(final String name, final MajorType type) { - return new VectorDescriptor(name, type); - } - - public static VectorDescriptor create(final MajorType type) { - return new VectorDescriptor(type); - } - - public static VectorDescriptor create(final MaterializedField field) { - return new VectorDescriptor(field); - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index c94e8d1db090c..705a24b02fe78 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -17,18 +17,20 @@ */ package org.apache.arrow.vector; +import com.google.flatbuffers.FlatBufferBuilder; import io.netty.buffer.ArrowBuf; import java.util.Collections; import java.util.Iterator; +import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.complex.impl.NullReader; import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.TransferPair; import com.google.common.collect.Iterators; @@ -36,7 +38,7 @@ public class ZeroVector implements ValueVector { public final static ZeroVector INSTANCE = new ZeroVector(); - private final MaterializedField field = MaterializedField.create("[DEFAULT]", Types.required(MinorType.LATE)); + private final String name = "[DEFAULT]"; private final TransferPair defaultPair = new TransferPair() { @Override @@ -91,24 +93,21 @@ public void close() { } public void clear() { } @Override - public MaterializedField getField() { - return field; + public Field getField() { + return new Field(name, true, new Null(), null); } + @Override + public MinorType getMinorType() { + return MinorType.NULL; + } + + @Override public TransferPair getTransferPair(BufferAllocator allocator) { return defaultPair; } -// @Override -// public UserBitShared.SerializedField getMetadata() { -// return getField() -// .getAsBuilder() -// .setBufferLength(getBufferSize()) -// .setValueCount(getAccessor().getValueCount()) -// .build(); -// } - @Override public Iterator iterator() { return Collections.emptyIterator(); @@ -176,7 +175,4 @@ public Mutator getMutator() { public FieldReader getReader() { return NullReader.INSTANCE; } - -// @Override -// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java index 9fae2382ecb24..ed7797576d679 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -21,12 +21,10 @@ import javax.annotation.Nullable; +import org.apache.arrow.flatbuf.Field; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.CallBack; @@ -43,12 +41,12 @@ public abstract class AbstractContainerVector implements ValueVector { static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractContainerVector.class); - protected MaterializedField field; + protected final String name; protected final BufferAllocator allocator; protected final CallBack callBack; - protected AbstractContainerVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { - this.field = Preconditions.checkNotNull(field); + protected AbstractContainerVector(String name, BufferAllocator allocator, CallBack callBack) { + this.name = name; this.allocator = allocator; this.callBack = callBack; } @@ -64,14 +62,6 @@ public BufferAllocator getAllocator() { return allocator; } - /** - * Returns the field definition of this instance. - */ - @Override - public MaterializedField getField() { - return field; - } - /** * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given field name if exists or null. */ @@ -79,19 +69,6 @@ public ValueVector getChild(String name) { return getChild(name, ValueVector.class); } - /** - * Returns a sequence of field names in the order that they show up in the schema. - */ - protected Collection getChildFieldNames() { - return Sets.newLinkedHashSet(Iterables.transform(field.getChildren(), new Function() { - @Nullable - @Override - public String apply(MaterializedField field) { - return Preconditions.checkNotNull(field).getLastName(); - } - })); - } - /** * Clears out all underlying child vectors. */ @@ -109,22 +86,6 @@ protected T typeify(ValueVector v, Class clazz) { throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Arrow doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); } - MajorType getLastPathType() { - if((this.getField().getType().getMinorType() == MinorType.LIST && - this.getField().getType().getMode() == DataMode.REPEATED)) { // Use Repeated scalar type instead of Required List. - VectorWithOrdinal vord = getChildVectorWithOrdinal(null); - ValueVector v = vord.vector; - if (! (v instanceof AbstractContainerVector)) { - return v.getField().getType(); - } - } else if (this.getField().getType().getMinorType() == MinorType.MAP && - this.getField().getType().getMode() == DataMode.REPEATED) { // Use Required Map - return new MajorType(MinorType.MAP, DataMode.REQUIRED); - } - - return this.getField().getType(); - } - protected boolean supportsDirectRead() { return false; } @@ -133,7 +94,7 @@ protected boolean supportsDirectRead() { public abstract int size(); // add a new vector with the input MajorType or return the existing vector if we already added one with the same type - public abstract T addOrGet(String name, MajorType type, Class clazz); + public abstract T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale); // return the child vector with the input name public abstract T getChild(String name, Class clazz); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index de6ae829b476d..5964f80079141 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -17,17 +17,17 @@ */ package org.apache.arrow.vector.complex; +import com.google.common.collect.ImmutableList; import io.netty.buffer.ArrowBuf; -import java.util.Collection; +import java.util.ArrayList; import java.util.Iterator; import java.util.List; +import org.apache.arrow.flatbuf.Field; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.MapWithOrdinal; @@ -43,17 +43,8 @@ public abstract class AbstractMapVector extends AbstractContainerVector { // Maintains a map with key as field name and value is the vector itself private final MapWithOrdinal vectors = new MapWithOrdinal<>(); - protected AbstractMapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { - super(field.clone(), allocator, callBack); - MaterializedField clonedField = field.clone(); - // create the hierarchy of the child vectors based on the materialized field - for (MaterializedField child : clonedField.getChildren()) { - if (!child.equals(BaseRepeatedValueVector.OFFSETS_FIELD)) { - final String fieldName = child.getLastName(); - final ValueVector v = BasicTypeHelper.getNewVector(child, allocator, callBack); - putVector(fieldName, v); - } - } + protected AbstractMapVector(String name, BufferAllocator allocator, CallBack callBack) { + super(name, allocator, callBack); } @Override @@ -109,8 +100,8 @@ public boolean allocateNewSafe() { * * * - * @param name name of the field - * @param type type of the field + * @param name the name of the field + * @param minorType the minorType for the vector * @param clazz class of expected vector type * @param class type of expected vector type * @throws java.lang.IllegalStateException raised if there is a hard schema change @@ -118,7 +109,7 @@ public boolean allocateNewSafe() { * @return resultant {@link org.apache.arrow.vector.ValueVector} */ @Override - public T addOrGet(String name, MajorType type, Class clazz) { + public T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale) { final ValueVector existing = getChild(name); boolean create = false; if (existing == null) { @@ -130,7 +121,7 @@ public T addOrGet(String name, MajorType type, Class create = true; } if (create) { - final T vector = (T) BasicTypeHelper.getNewVector(name, allocator, type, callBack); + final T vector = (T) minorType.getNewVector(name, allocator, callBack, precisionScale); putChild(name, vector); if (callBack!=null) { callBack.doWork(); @@ -177,7 +168,6 @@ public T getChild(String name, Class clazz) { */ protected void putChild(String name, ValueVector vector) { putVector(name, vector); - field.addChild(vector.getField()); } /** @@ -199,8 +189,21 @@ protected void putVector(String name, ValueVector vector) { /** * Returns a sequence of underlying child vectors. */ - protected Collection getChildren() { - return vectors.values(); + protected List getChildren() { + int size = vectors.size(); + List children = new ArrayList<>(); + for (int i = 0; i < size; i++) { + children.add(vectors.getByOrdinal(i)); + } + return children; + } + + protected List getChildFieldNames() { + ImmutableList.Builder builder = ImmutableList.builder(); + for (ValueVector child : getChildren()) { + builder.add(child.getField().getName()); + } + return builder.build(); } /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 6518897fb780d..42262741df93d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -22,22 +22,18 @@ import java.util.Collections; import java.util.Iterator; +import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseValueVector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; import org.apache.arrow.vector.ZeroVector; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.BasicTypeHelper; -import org.apache.arrow.vector.util.SchemaChangeRuntimeException; import com.google.common.base.Preconditions; import com.google.common.collect.ObjectArrays; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; public abstract class BaseRepeatedValueVector extends BaseValueVector implements RepeatedValueVector { @@ -45,19 +41,16 @@ public abstract class BaseRepeatedValueVector extends BaseValueVector implements public final static String OFFSETS_VECTOR_NAME = "$offsets$"; public final static String DATA_VECTOR_NAME = "$data$"; - public final static MaterializedField OFFSETS_FIELD = - MaterializedField.create(OFFSETS_VECTOR_NAME, new MajorType(MinorType.UINT4, DataMode.REQUIRED)); - protected final UInt4Vector offsets; protected ValueVector vector; - protected BaseRepeatedValueVector(MaterializedField field, BufferAllocator allocator) { - this(field, allocator, DEFAULT_DATA_VECTOR); + protected BaseRepeatedValueVector(String name, BufferAllocator allocator) { + this(name, allocator, DEFAULT_DATA_VECTOR); } - protected BaseRepeatedValueVector(MaterializedField field, BufferAllocator allocator, ValueVector vector) { - super(field, allocator); - this.offsets = new UInt4Vector(OFFSETS_FIELD, allocator); + protected BaseRepeatedValueVector(String name, BufferAllocator allocator, ValueVector vector) { + super(name, allocator); + this.offsets = new UInt4Vector(OFFSETS_VECTOR_NAME, allocator); this.vector = Preconditions.checkNotNull(vector, "data vector cannot be null"); } @@ -109,13 +102,6 @@ public int getValueCapacity() { return Math.min(vector.getValueCapacity(), offsetValueCapacity); } -// @Override -// protected UserBitShared.SerializedField.Builder getMetadataBuilder() { -// return super.getMetadataBuilder() -// .addChild(offsets.getMetadata()) -// .addChild(vector.getMetadata()); -// } - @Override public int getBufferSize() { if (getAccessor().getValueCount() == 0) { @@ -157,47 +143,24 @@ public ArrowBuf[] getBuffers(boolean clear) { return buffers; } -// @Override -// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { -// final UserBitShared.SerializedField offsetMetadata = metadata.getChild(0); -// offsets.load(offsetMetadata, buffer); -// -// final UserBitShared.SerializedField vectorMetadata = metadata.getChild(1); -// if (getDataVector() == DEFAULT_DATA_VECTOR) { -// addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); -// } -// -// final int offsetLength = offsetMetadata.getBufferLength(); -// final int vectorLength = vectorMetadata.getBufferLength(); -// vector.load(vectorMetadata, buffer.slice(offsetLength, vectorLength)); -// } - /** * Returns 1 if inner vector is explicitly set via #addOrGetVector else 0 - * - * @see {@link ContainerVectorLike#size} */ - @Override public int size() { return vector == DEFAULT_DATA_VECTOR ? 0:1; } - @Override - public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { + public AddOrGetResult addOrGetVector(MinorType minorType) { boolean created = false; - if (vector == DEFAULT_DATA_VECTOR && descriptor.getType().getMinorType() != MinorType.LATE) { - final MaterializedField field = descriptor.withName(DATA_VECTOR_NAME).getField(); - vector = BasicTypeHelper.getNewVector(field, allocator); + if (vector instanceof ZeroVector) { + vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, null); // returned vector must have the same field - assert field.equals(vector.getField()); - getField().addChild(field); created = true; } - final MajorType actual = vector.getField().getType(); - if (!actual.equals(descriptor.getType())) { + if (vector.getField().getType().getTypeType() != minorType.getType().getTypeType()) { final String msg = String.format("Inner vector type mismatch. Requested type: [%s], actual type: [%s]", - descriptor.getType(), actual); + Type.name(minorType.getType().getTypeType()), Type.name(vector.getField().getType().getTypeType())); throw new SchemaChangeRuntimeException(msg); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java deleted file mode 100644 index 655b55a6aa2c6..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ContainerVectorLike.java +++ /dev/null @@ -1,43 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.complex; - -import org.apache.arrow.vector.AddOrGetResult; -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; - -/** - * A mix-in used for introducing container vector-like behaviour. - */ -public interface ContainerVectorLike { - - /** - * Creates and adds a child vector if none with the same name exists, else returns the vector instance. - * - * @param descriptor vector descriptor - * @return result of operation wrapping vector corresponding to the given descriptor and whether it's newly created - * @throws org.apache.arrow.vector.util.SchemaChangeRuntimeException - * if schema change is not permissible between the given and existing data vector types. - */ - AddOrGetResult addOrGetVector(VectorDescriptor descriptor); - - /** - * Returns the number of child vectors in this container vector-like instance. - */ - int size(); -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 3e60c76802380..c6c6b090db6b0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -18,6 +18,8 @@ ******************************************************************************/ package org.apache.arrow.vector.complex; +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; import io.netty.buffer.ArrowBuf; import java.util.List; @@ -28,17 +30,14 @@ import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.impl.ComplexCopier; import org.apache.arrow.vector.complex.impl.UnionListReader; import org.apache.arrow.vector.complex.impl.UnionListWriter; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.FieldWriter; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringArrayList; import org.apache.arrow.vector.util.TransferPair; @@ -55,11 +54,10 @@ public class ListVector extends BaseRepeatedValueVector { private UnionListReader reader; private CallBack callBack; - public ListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { - super(field, allocator); - this.bits = new UInt1Vector(MaterializedField.create("$bits$", new MajorType(MinorType.UINT1, DataMode.REQUIRED)), allocator); + public ListVector(String name, BufferAllocator allocator, CallBack callBack) { + super(name, allocator); + this.bits = new UInt1Vector("$bits$", allocator); offsets = getOffsetVector(); - this.field.addChild(getDataVector().getField()); this.writer = new UnionListWriter(this); this.reader = new UnionListReader(this); this.callBack = callBack; @@ -75,15 +73,6 @@ public void allocateNew() throws OutOfMemoryException { bits.allocateNewSafe(); } - public void transferTo(ListVector target) { - offsets.makeTransferPair(target.offsets).transfer(); - bits.makeTransferPair(target.bits).transfer(); - if (target.getDataVector() instanceof ZeroVector) { - target.addOrGetVector(new VectorDescriptor(vector.getField().getType())); - } - getDataVector().makeTransferPair(target.getDataVector()).transfer(); - } - public void copyFromSafe(int inIndex, int outIndex, ListVector from) { copyFrom(inIndex, outIndex, from); } @@ -103,7 +92,7 @@ public ValueVector getDataVector() { @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new TransferImpl(field.withPath(ref), allocator); + return new TransferImpl(ref, allocator); } @Override @@ -114,20 +103,28 @@ public TransferPair makeTransferPair(ValueVector target) { private class TransferImpl implements TransferPair { ListVector to; + TransferPair pairs[] = new TransferPair[3]; - public TransferImpl(MaterializedField field, BufferAllocator allocator) { - to = new ListVector(field, allocator, null); - to.addOrGetVector(new VectorDescriptor(vector.getField().getType())); + public TransferImpl(String name, BufferAllocator allocator) { + this(new ListVector(name, allocator, null)); } public TransferImpl(ListVector to) { this.to = to; - to.addOrGetVector(new VectorDescriptor(vector.getField().getType())); + to.addOrGetVector(vector.getMinorType()); + pairs[0] = offsets.makeTransferPair(to.offsets); + pairs[1] = bits.makeTransferPair(to.bits); + if (to.getDataVector() instanceof ZeroVector) { + to.addOrGetVector(vector.getMinorType()); + } + pairs[2] = getDataVector().makeTransferPair(to.getDataVector()); } @Override public void transfer() { - transferTo(to); + for (TransferPair pair : pairs) { + pair.transfer(); + } } @Override @@ -190,17 +187,8 @@ public boolean allocateNewSafe() { return success; } -// @Override -// protected UserBitShared.SerializedField.Builder getMetadataBuilder() { -// return getField().getAsBuilder() -// .setValueCount(getAccessor().getValueCount()) -// .setBufferLength(getBufferSize()) -// .addChild(offsets.getMetadata()) -// .addChild(bits.getMetadata()) -// .addChild(vector.getMetadata()); -// } - public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { - AddOrGetResult result = super.addOrGetVector(descriptor); + public AddOrGetResult addOrGetVector(MinorType minorType) { + AddOrGetResult result = super.addOrGetVector(minorType); reader = new UnionListReader(this); return result; } @@ -213,6 +201,17 @@ public int getBufferSize() { return offsets.getBufferSize() + bits.getBufferSize() + vector.getBufferSize(); } + @Override + public Field getField() { + return new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.List(), + ImmutableList.of(getDataVector().getField())); + } + + @Override + public MinorType getMinorType() { + return MinorType.LIST; + } + @Override public void clear() { offsets.clear(); @@ -235,28 +234,8 @@ public ArrowBuf[] getBuffers(boolean clear) { return buffers; } -// @Override -// public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { -// final UserBitShared.SerializedField offsetMetadata = metadata.getChild(0); -// offsets.load(offsetMetadata, buffer); -// -// final int offsetLength = offsetMetadata.getBufferLength(); -// final UserBitShared.SerializedField bitMetadata = metadata.getChild(1); -// final int bitLength = bitMetadata.getBufferLength(); -// bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); -// -// final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); -// if (getDataVector() == DEFAULT_DATA_VECTOR) { -// addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); -// } -// -// final int vectorLength = vectorMetadata.getBufferLength(); -// vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); -// } - public UnionVector promoteToUnion() { - MaterializedField newField = MaterializedField.create(getField().getPath(), new MajorType(MinorType.UNION, DataMode.OPTIONAL)); - UnionVector vector = new UnionVector(newField, allocator, null); + UnionVector vector = new UnionVector(name, allocator, null); replaceDataVector(vector); reader = new UnionListReader(this); return vector; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index cc0953a1af8ba..0cb613e2f7acf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -19,8 +19,10 @@ import io.netty.buffer.ArrowBuf; +import java.util.ArrayList; import java.util.Collection; import java.util.Iterator; +import java.util.List; import java.util.Map; import javax.annotation.Nullable; @@ -28,14 +30,13 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.BaseValueVector; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.complex.RepeatedMapVector.MapSingleCopier; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringHashMap; import org.apache.arrow.vector.util.TransferPair; @@ -47,19 +48,13 @@ public class MapVector extends AbstractMapVector { //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(MapVector.class); - public final static MajorType TYPE = new MajorType(MinorType.MAP, DataMode.OPTIONAL); - private final SingleMapReaderImpl reader = new SingleMapReaderImpl(MapVector.this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); int valueCount; - public MapVector(String path, BufferAllocator allocator, CallBack callBack){ - this(MaterializedField.create(path, TYPE), allocator, callBack); - } - - public MapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ - super(field, allocator, callBack); + public MapVector(String name, BufferAllocator allocator, CallBack callBack){ + super(name, allocator, callBack); } @Override @@ -69,7 +64,6 @@ public FieldReader getReader() { } transient private MapTransferPair ephPair; - transient private MapSingleCopier ephPair2; public void copyFromSafe(int fromIndex, int thisIndex, MapVector from) { if(ephPair == null || ephPair.from != from) { @@ -78,13 +72,6 @@ public void copyFromSafe(int fromIndex, int thisIndex, MapVector from) { ephPair.copyValueSafe(fromIndex, thisIndex); } - public void copyFromSafe(int fromSubIndex, int thisIndex, RepeatedMapVector from) { - if(ephPair2 == null || ephPair2.from != from) { - ephPair2 = from.makeSingularCopier(this); - } - ephPair2.copySafe(fromSubIndex, thisIndex); - } - @Override protected boolean supportsDirectRead() { return true; @@ -139,7 +126,7 @@ public ArrowBuf[] getBuffers(boolean clear) { @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new MapTransferPair(this, getField().getPath(), allocator); + return new MapTransferPair(this, name, allocator); } @Override @@ -157,8 +144,8 @@ protected static class MapTransferPair implements TransferPair{ private final MapVector from; private final MapVector to; - public MapTransferPair(MapVector from, String path, BufferAllocator allocator) { - this(from, new MapVector(MaterializedField.create(path, TYPE), allocator, from.callBack), false); + public MapTransferPair(MapVector from, String name, BufferAllocator allocator) { + this(from, new MapVector(name, allocator, from.callBack), false); } public MapTransferPair(MapVector from, MapVector to) { @@ -170,7 +157,6 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { this.to = to; this.pairs = new TransferPair[from.size()]; this.to.ephPair = null; - this.to.ephPair2 = null; int i = 0; ValueVector vector; @@ -189,7 +175,7 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { // (This is similar to what happens in ScanBatch where the children cannot be added till they are // read). To take care of this, we ensure that the hashCode of the MaterializedField does not // include the hashCode of the children but is based only on MaterializedField$key. - final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); + final ValueVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass()); if (allocate && to.size() != preSize) { newVector.allocateNew(); } @@ -251,46 +237,6 @@ public Accessor getAccessor() { return accessor; } -// @Override -// public void load(SerializedField metadata, DrillBuf buf) { -// final List fields = metadata.getChildList(); -// valueCount = metadata.getValueCount(); -// -// int bufOffset = 0; -// for (final SerializedField child : fields) { -// final MaterializedField fieldDef = SerializedFieldHelper.create(child); -// -// ValueVector vector = getChild(fieldDef.getLastName()); -// if (vector == null) { -// if we arrive here, we didn't have a matching vector. -// vector = BasicTypeHelper.getNewVector(fieldDef, allocator); -// putChild(fieldDef.getLastName(), vector); -// } -// if (child.getValueCount() == 0) { -// vector.clear(); -// } else { -// vector.load(child, buf.slice(bufOffset, child.getBufferLength())); -// } -// bufOffset += child.getBufferLength(); -// } -// -// assert bufOffset == buf.capacity(); -// } -// -// @Override -// public SerializedField getMetadata() { -// SerializedField.Builder b = getField() // -// .getAsBuilder() // -// .setBufferLength(getBufferSize()) // -// .setValueCount(valueCount); -// -// -// for(ValueVector v : getChildren()) { -// b.addChild(v.getMetadata()); -// } -// return b.build(); -// } - @Override public Mutator getMutator() { return mutator; @@ -303,13 +249,6 @@ public Object getObject(int index) { Map vv = new JsonStringHashMap<>(); for (String child:getChildFieldNames()) { ValueVector v = getChild(child); - // TODO(DRILL-4001): Resolve this hack: - // The index/value count check in the following if statement is a hack - // to work around the current fact that RecordBatchLoader.load and - // MapVector.load leave child vectors with a length of zero (as opposed - // to matching the lengths of siblings and the parent map vector) - // because they don't remove (or set the lengths of) vectors from - // previous batches that aren't in the current batch. if (v != null && index < v.getAccessor().getValueCount()) { Object value = v.getAccessor().getObject(index); if (value != null) { @@ -360,6 +299,20 @@ public void clear() { valueCount = 0; } + @Override + public Field getField() { + List children = new ArrayList<>(); + for (ValueVector child : getChildren()) { + children.add(child.getField()); + } + return new Field(name, false, Tuple.INSTANCE, children); + } + + @Override + public MinorType getMinorType() { + return MinorType.MAP; + } + @Override public void close() { final Collection vectors = getChildren(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java deleted file mode 100644 index f337f9c4a60e0..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedListVector.java +++ /dev/null @@ -1,427 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.complex; - -import io.netty.buffer.ArrowBuf; - -import java.util.Iterator; -import java.util.List; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.vector.AddOrGetResult; -import org.apache.arrow.vector.UInt4Vector; -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; -import org.apache.arrow.vector.complex.impl.NullReader; -import org.apache.arrow.vector.complex.impl.RepeatedListReaderImpl; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.holders.ComplexHolder; -import org.apache.arrow.vector.holders.RepeatedListHolder; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.CallBack; -import org.apache.arrow.vector.util.JsonStringArrayList; -import org.apache.arrow.vector.util.TransferPair; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Lists; - -public class RepeatedListVector extends AbstractContainerVector - implements RepeatedValueVector, RepeatedFixedWidthVectorLike { - - public final static MajorType TYPE = new MajorType(MinorType.LIST, DataMode.REPEATED); - private final RepeatedListReaderImpl reader = new RepeatedListReaderImpl(null, this); - final DelegateRepeatedVector delegate; - - protected static class DelegateRepeatedVector extends BaseRepeatedValueVector { - - private final RepeatedListAccessor accessor = new RepeatedListAccessor(); - private final RepeatedListMutator mutator = new RepeatedListMutator(); - private final EmptyValuePopulator emptyPopulator; - private transient DelegateTransferPair ephPair; - - public class RepeatedListAccessor extends BaseRepeatedValueVector.BaseRepeatedAccessor { - - @Override - public Object getObject(int index) { - final List list = new JsonStringArrayList<>(); - final int start = offsets.getAccessor().get(index); - final int until = offsets.getAccessor().get(index+1); - for (int i = start; i < until; i++) { - list.add(vector.getAccessor().getObject(i)); - } - return list; - } - - public void get(int index, RepeatedListHolder holder) { - assert index <= getValueCapacity(); - holder.start = getOffsetVector().getAccessor().get(index); - holder.end = getOffsetVector().getAccessor().get(index+1); - } - - public void get(int index, ComplexHolder holder) { - final FieldReader reader = getReader(); - reader.setPosition(index); - holder.reader = reader; - } - - public void get(int index, int arrayIndex, ComplexHolder holder) { - final RepeatedListHolder listHolder = new RepeatedListHolder(); - get(index, listHolder); - int offset = listHolder.start + arrayIndex; - if (offset >= listHolder.end) { - holder.reader = NullReader.INSTANCE; - } else { - FieldReader r = getDataVector().getReader(); - r.setPosition(offset); - holder.reader = r; - } - } - } - - public class RepeatedListMutator extends BaseRepeatedValueVector.BaseRepeatedMutator { - - public int add(int index) { - final int curEnd = getOffsetVector().getAccessor().get(index+1); - getOffsetVector().getMutator().setSafe(index + 1, curEnd + 1); - return curEnd; - } - - @Override - public void startNewValue(int index) { - emptyPopulator.populate(index+1); - super.startNewValue(index); - } - - @Override - public void setValueCount(int valueCount) { - emptyPopulator.populate(valueCount); - super.setValueCount(valueCount); - } - } - - - public class DelegateTransferPair implements TransferPair { - private final DelegateRepeatedVector target; - private final TransferPair[] children; - - public DelegateTransferPair(DelegateRepeatedVector target) { - this.target = Preconditions.checkNotNull(target); - if (target.getDataVector() == DEFAULT_DATA_VECTOR) { - target.addOrGetVector(VectorDescriptor.create(getDataVector().getField())); - target.getDataVector().allocateNew(); - } - this.children = new TransferPair[] { - getOffsetVector().makeTransferPair(target.getOffsetVector()), - getDataVector().makeTransferPair(target.getDataVector()) - }; - } - - @Override - public void transfer() { - for (TransferPair child:children) { - child.transfer(); - } - } - - @Override - public ValueVector getTo() { - return target; - } - - @Override - public void splitAndTransfer(int startIndex, int length) { - target.allocateNew(); - for (int i = 0; i < length; i++) { - copyValueSafe(startIndex + i, i); - } - } - - @Override - public void copyValueSafe(int srcIndex, int destIndex) { - final RepeatedListHolder holder = new RepeatedListHolder(); - getAccessor().get(srcIndex, holder); - target.emptyPopulator.populate(destIndex+1); - final TransferPair vectorTransfer = children[1]; - int newIndex = target.getOffsetVector().getAccessor().get(destIndex); - //todo: make this a bulk copy. - for (int i = holder.start; i < holder.end; i++, newIndex++) { - vectorTransfer.copyValueSafe(i, newIndex); - } - target.getOffsetVector().getMutator().setSafe(destIndex + 1, newIndex); - } - } - - public DelegateRepeatedVector(String path, BufferAllocator allocator) { - this(MaterializedField.create(path, TYPE), allocator); - } - - public DelegateRepeatedVector(MaterializedField field, BufferAllocator allocator) { - super(field, allocator); - emptyPopulator = new EmptyValuePopulator(getOffsetVector()); - } - - @Override - public void allocateNew() throws OutOfMemoryException { - if (!allocateNewSafe()) { - throw new OutOfMemoryException(); - } - } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return makeTransferPair(new DelegateRepeatedVector(ref, allocator)); - } - - @Override - public TransferPair makeTransferPair(ValueVector target) { - return new DelegateTransferPair(DelegateRepeatedVector.class.cast(target)); - } - - @Override - public RepeatedListAccessor getAccessor() { - return accessor; - } - - @Override - public RepeatedListMutator getMutator() { - return mutator; - } - - @Override - public FieldReader getReader() { - throw new UnsupportedOperationException(); - } - - public void copyFromSafe(int fromIndex, int thisIndex, DelegateRepeatedVector from) { - if(ephPair == null || ephPair.target != from) { - ephPair = DelegateTransferPair.class.cast(from.makeTransferPair(this)); - } - ephPair.copyValueSafe(fromIndex, thisIndex); - } - - } - - protected class RepeatedListTransferPair implements TransferPair { - private final TransferPair delegate; - - public RepeatedListTransferPair(TransferPair delegate) { - this.delegate = delegate; - } - - public void transfer() { - delegate.transfer(); - } - - @Override - public void splitAndTransfer(int startIndex, int length) { - delegate.splitAndTransfer(startIndex, length); - } - - @Override - public ValueVector getTo() { - final DelegateRepeatedVector delegateVector = DelegateRepeatedVector.class.cast(delegate.getTo()); - return new RepeatedListVector(getField(), allocator, callBack, delegateVector); - } - - @Override - public void copyValueSafe(int from, int to) { - delegate.copyValueSafe(from, to); - } - } - - public RepeatedListVector(String path, BufferAllocator allocator, CallBack callBack) { - this(MaterializedField.create(path, TYPE), allocator, callBack); - } - - public RepeatedListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack) { - this(field, allocator, callBack, new DelegateRepeatedVector(field, allocator)); - } - - protected RepeatedListVector(MaterializedField field, BufferAllocator allocator, CallBack callBack, DelegateRepeatedVector delegate) { - super(field, allocator, callBack); - this.delegate = Preconditions.checkNotNull(delegate); - - final List children = Lists.newArrayList(field.getChildren()); - final int childSize = children.size(); - assert childSize < 3; - final boolean hasChild = childSize > 0; - if (hasChild) { - // the last field is data field - final MaterializedField child = children.get(childSize-1); - addOrGetVector(VectorDescriptor.create(child)); - } - } - - - @Override - public RepeatedListReaderImpl getReader() { - return reader; - } - - @Override - public DelegateRepeatedVector.RepeatedListAccessor getAccessor() { - return delegate.getAccessor(); - } - - @Override - public DelegateRepeatedVector.RepeatedListMutator getMutator() { - return delegate.getMutator(); - } - - @Override - public UInt4Vector getOffsetVector() { - return delegate.getOffsetVector(); - } - - @Override - public ValueVector getDataVector() { - return delegate.getDataVector(); - } - - @Override - public void allocateNew() throws OutOfMemoryException { - delegate.allocateNew(); - } - - @Override - public boolean allocateNewSafe() { - return delegate.allocateNewSafe(); - } - - @Override - public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { - final AddOrGetResult result = delegate.addOrGetVector(descriptor); - if (result.isCreated() && callBack != null) { - callBack.doWork(); - } - return result; - } - - @Override - public int size() { - return delegate.size(); - } - - @Override - public int getBufferSize() { - return delegate.getBufferSize(); - } - - @Override - public int getBufferSizeFor(final int valueCount) { - return delegate.getBufferSizeFor(valueCount); - } - - @Override - public void close() { - delegate.close(); - } - - @Override - public void clear() { - delegate.clear(); - } - - @Override - public TransferPair getTransferPair(BufferAllocator allocator) { - return new RepeatedListTransferPair(delegate.getTransferPair(allocator)); - } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new RepeatedListTransferPair(delegate.getTransferPair(ref, allocator)); - } - - @Override - public TransferPair makeTransferPair(ValueVector to) { - final RepeatedListVector target = RepeatedListVector.class.cast(to); - return new RepeatedListTransferPair(delegate.makeTransferPair(target.delegate)); - } - - @Override - public int getValueCapacity() { - return delegate.getValueCapacity(); - } - - @Override - public ArrowBuf[] getBuffers(boolean clear) { - return delegate.getBuffers(clear); - } - - -// @Override -// public void load(SerializedField metadata, DrillBuf buf) { -// delegate.load(metadata, buf); -// } - -// @Override -// public SerializedField getMetadata() { -// return delegate.getMetadata(); -// } - - @Override - public Iterator iterator() { - return delegate.iterator(); - } - - @Override - public void setInitialCapacity(int numRecords) { - delegate.setInitialCapacity(numRecords); - } - - /** - * @deprecated - * prefer using {@link #addOrGetVector(org.apache.arrow.vector.VectorDescriptor)} instead. - */ - @Override - public T addOrGet(String name, MajorType type, Class clazz) { - final AddOrGetResult result = addOrGetVector(VectorDescriptor.create(type)); - return result.getVector(); - } - - @Override - public T getChild(String name, Class clazz) { - if (name != null) { - return null; - } - return typeify(delegate.getDataVector(), clazz); - } - - @Override - public void allocateNew(int valueCount, int innerValueCount) { - clear(); - getOffsetVector().allocateNew(valueCount + 1); - getMutator().reset(); - } - - @Override - public VectorWithOrdinal getChildVectorWithOrdinal(String name) { - if (name != null) { - return null; - } - return new VectorWithOrdinal(delegate.getDataVector(), 0); - } - - public void copyFromSafe(int fromIndex, int thisIndex, RepeatedListVector from) { - delegate.copyFromSafe(fromIndex, thisIndex, from.delegate); - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java deleted file mode 100644 index 686414e71cadf..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedMapVector.java +++ /dev/null @@ -1,584 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.complex; - -import io.netty.buffer.ArrowBuf; - -import java.util.Iterator; -import java.util.List; -import java.util.Map; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.vector.AddOrGetResult; -import org.apache.arrow.vector.AllocationHelper; -import org.apache.arrow.vector.UInt4Vector; -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; -import org.apache.arrow.vector.complex.impl.NullReader; -import org.apache.arrow.vector.complex.impl.RepeatedMapReaderImpl; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.holders.ComplexHolder; -import org.apache.arrow.vector.holders.RepeatedMapHolder; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.CallBack; -import org.apache.arrow.vector.util.JsonStringArrayList; -import org.apache.arrow.vector.util.TransferPair; -import org.apache.commons.lang3.ArrayUtils; - -import com.google.common.base.Preconditions; -import com.google.common.collect.Maps; - -public class RepeatedMapVector extends AbstractMapVector - implements RepeatedValueVector, RepeatedFixedWidthVectorLike { - //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RepeatedMapVector.class); - - public final static MajorType TYPE = new MajorType(MinorType.MAP, DataMode.REPEATED); - - final UInt4Vector offsets; // offsets to start of each record (considering record indices are 0-indexed) - private final RepeatedMapReaderImpl reader = new RepeatedMapReaderImpl(RepeatedMapVector.this); - private final RepeatedMapAccessor accessor = new RepeatedMapAccessor(); - private final Mutator mutator = new Mutator(); - private final EmptyValuePopulator emptyPopulator; - - public RepeatedMapVector(MaterializedField field, BufferAllocator allocator, CallBack callBack){ - super(field, allocator, callBack); - this.offsets = new UInt4Vector(BaseRepeatedValueVector.OFFSETS_FIELD, allocator); - this.emptyPopulator = new EmptyValuePopulator(offsets); - } - - @Override - public UInt4Vector getOffsetVector() { - return offsets; - } - - @Override - public ValueVector getDataVector() { - throw new UnsupportedOperationException(); - } - - @Override - public AddOrGetResult addOrGetVector(VectorDescriptor descriptor) { - throw new UnsupportedOperationException(); - } - - @Override - public void setInitialCapacity(int numRecords) { - offsets.setInitialCapacity(numRecords + 1); - for(final ValueVector v : (Iterable) this) { - v.setInitialCapacity(numRecords * RepeatedValueVector.DEFAULT_REPEAT_PER_RECORD); - } - } - - @Override - public RepeatedMapReaderImpl getReader() { - return reader; - } - - @Override - public void allocateNew(int groupCount, int innerValueCount) { - clear(); - try { - offsets.allocateNew(groupCount + 1); - for (ValueVector v : getChildren()) { - AllocationHelper.allocatePrecomputedChildCount(v, groupCount, 50, innerValueCount); - } - } catch (OutOfMemoryException e){ - clear(); - throw e; - } - offsets.zeroVector(); - mutator.reset(); - } - - public Iterator fieldNameIterator() { - return getChildFieldNames().iterator(); - } - - @Override - public List getPrimitiveVectors() { - final List primitiveVectors = super.getPrimitiveVectors(); - primitiveVectors.add(offsets); - return primitiveVectors; - } - - @Override - public int getBufferSize() { - if (getAccessor().getValueCount() == 0) { - return 0; - } - long bufferSize = offsets.getBufferSize(); - for (final ValueVector v : (Iterable) this) { - bufferSize += v.getBufferSize(); - } - return (int) bufferSize; - } - - @Override - public int getBufferSizeFor(final int valueCount) { - if (valueCount == 0) { - return 0; - } - - long bufferSize = 0; - for (final ValueVector v : (Iterable) this) { - bufferSize += v.getBufferSizeFor(valueCount); - } - - return (int) bufferSize; - } - - @Override - public void close() { - offsets.close(); - super.close(); - } - - @Override - public TransferPair getTransferPair(BufferAllocator allocator) { - return new RepeatedMapTransferPair(this, getField().getPath(), allocator); - } - - @Override - public TransferPair makeTransferPair(ValueVector to) { - return new RepeatedMapTransferPair(this, (RepeatedMapVector)to); - } - - MapSingleCopier makeSingularCopier(MapVector to) { - return new MapSingleCopier(this, to); - } - - protected static class MapSingleCopier { - private final TransferPair[] pairs; - public final RepeatedMapVector from; - - public MapSingleCopier(RepeatedMapVector from, MapVector to) { - this.from = from; - this.pairs = new TransferPair[from.size()]; - - int i = 0; - ValueVector vector; - for (final String child:from.getChildFieldNames()) { - int preSize = to.size(); - vector = from.getChild(child); - if (vector == null) { - continue; - } - final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); - if (to.size() != preSize) { - newVector.allocateNew(); - } - pairs[i++] = vector.makeTransferPair(newVector); - } - } - - public void copySafe(int fromSubIndex, int toIndex) { - for (TransferPair p : pairs) { - p.copyValueSafe(fromSubIndex, toIndex); - } - } - } - - public TransferPair getTransferPairToSingleMap(String reference, BufferAllocator allocator) { - return new SingleMapTransferPair(this, reference, allocator); - } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new RepeatedMapTransferPair(this, ref, allocator); - } - - @Override - public boolean allocateNewSafe() { - /* boolean to keep track if all the memory allocation were successful - * Used in the case of composite vectors when we need to allocate multiple - * buffers for multiple vectors. If one of the allocations failed we need to - * clear all the memory that we allocated - */ - boolean success = false; - try { - if (!offsets.allocateNewSafe()) { - return false; - } - success = super.allocateNewSafe(); - } finally { - if (!success) { - clear(); - } - } - offsets.zeroVector(); - return success; - } - - protected static class SingleMapTransferPair implements TransferPair { - private final TransferPair[] pairs; - private final RepeatedMapVector from; - private final MapVector to; - private static final MajorType MAP_TYPE = new MajorType(MinorType.MAP, DataMode.REQUIRED); - - public SingleMapTransferPair(RepeatedMapVector from, String path, BufferAllocator allocator) { - this(from, new MapVector(MaterializedField.create(path, MAP_TYPE), allocator, from.callBack), false); - } - - public SingleMapTransferPair(RepeatedMapVector from, MapVector to) { - this(from, to, true); - } - - public SingleMapTransferPair(RepeatedMapVector from, MapVector to, boolean allocate) { - this.from = from; - this.to = to; - this.pairs = new TransferPair[from.size()]; - int i = 0; - ValueVector vector; - for (final String child : from.getChildFieldNames()) { - int preSize = to.size(); - vector = from.getChild(child); - if (vector == null) { - continue; - } - final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); - if (allocate && to.size() != preSize) { - newVector.allocateNew(); - } - pairs[i++] = vector.makeTransferPair(newVector); - } - } - - - @Override - public void transfer() { - for (TransferPair p : pairs) { - p.transfer(); - } - to.getMutator().setValueCount(from.getAccessor().getValueCount()); - from.clear(); - } - - @Override - public ValueVector getTo() { - return to; - } - - @Override - public void copyValueSafe(int from, int to) { - for (TransferPair p : pairs) { - p.copyValueSafe(from, to); - } - } - - @Override - public void splitAndTransfer(int startIndex, int length) { - for (TransferPair p : pairs) { - p.splitAndTransfer(startIndex, length); - } - to.getMutator().setValueCount(length); - } - } - - private static class RepeatedMapTransferPair implements TransferPair{ - - private final TransferPair[] pairs; - private final RepeatedMapVector to; - private final RepeatedMapVector from; - - public RepeatedMapTransferPair(RepeatedMapVector from, String path, BufferAllocator allocator) { - this(from, new RepeatedMapVector(MaterializedField.create(path, TYPE), allocator, from.callBack), false); - } - - public RepeatedMapTransferPair(RepeatedMapVector from, RepeatedMapVector to) { - this(from, to, true); - } - - public RepeatedMapTransferPair(RepeatedMapVector from, RepeatedMapVector to, boolean allocate) { - this.from = from; - this.to = to; - this.pairs = new TransferPair[from.size()]; - this.to.ephPair = null; - - int i = 0; - ValueVector vector; - for (final String child : from.getChildFieldNames()) { - final int preSize = to.size(); - vector = from.getChild(child); - if (vector == null) { - continue; - } - - final ValueVector newVector = to.addOrGet(child, vector.getField().getType(), vector.getClass()); - if (to.size() != preSize) { - newVector.allocateNew(); - } - - pairs[i++] = vector.makeTransferPair(newVector); - } - } - - @Override - public void transfer() { - from.offsets.transferTo(to.offsets); - for (TransferPair p : pairs) { - p.transfer(); - } - from.clear(); - } - - @Override - public ValueVector getTo() { - return to; - } - - @Override - public void copyValueSafe(int srcIndex, int destIndex) { - RepeatedMapHolder holder = new RepeatedMapHolder(); - from.getAccessor().get(srcIndex, holder); - to.emptyPopulator.populate(destIndex + 1); - int newIndex = to.offsets.getAccessor().get(destIndex); - //todo: make these bulk copies - for (int i = holder.start; i < holder.end; i++, newIndex++) { - for (TransferPair p : pairs) { - p.copyValueSafe(i, newIndex); - } - } - to.offsets.getMutator().setSafe(destIndex + 1, newIndex); - } - - @Override - public void splitAndTransfer(final int groupStart, final int groups) { - final UInt4Vector.Accessor a = from.offsets.getAccessor(); - final UInt4Vector.Mutator m = to.offsets.getMutator(); - - final int startPos = a.get(groupStart); - final int endPos = a.get(groupStart + groups); - final int valuesToCopy = endPos - startPos; - - to.offsets.clear(); - to.offsets.allocateNew(groups + 1); - - int normalizedPos; - for (int i = 0; i < groups + 1; i++) { - normalizedPos = a.get(groupStart + i) - startPos; - m.set(i, normalizedPos); - } - - m.setValueCount(groups + 1); - to.emptyPopulator.populate(groups); - - for (final TransferPair p : pairs) { - p.splitAndTransfer(startPos, valuesToCopy); - } - } - } - - - transient private RepeatedMapTransferPair ephPair; - - public void copyFromSafe(int fromIndex, int thisIndex, RepeatedMapVector from) { - if (ephPair == null || ephPair.from != from) { - ephPair = (RepeatedMapTransferPair) from.makeTransferPair(this); - } - ephPair.copyValueSafe(fromIndex, thisIndex); - } - - @Override - public int getValueCapacity() { - return Math.max(offsets.getValueCapacity() - 1, 0); - } - - @Override - public RepeatedMapAccessor getAccessor() { - return accessor; - } - - @Override - public ArrowBuf[] getBuffers(boolean clear) { - final int expectedBufferSize = getBufferSize(); - final int actualBufferSize = super.getBufferSize(); - - Preconditions.checkArgument(expectedBufferSize == actualBufferSize + offsets.getBufferSize()); - return ArrayUtils.addAll(offsets.getBuffers(clear), super.getBuffers(clear)); - } - - -// @Override -// public void load(SerializedField metadata, DrillBuf buffer) { -// final List children = metadata.getChildList(); -// -// final SerializedField offsetField = children.get(0); -// offsets.load(offsetField, buffer); -// int bufOffset = offsetField.getBufferLength(); -// -// for (int i = 1; i < children.size(); i++) { -// final SerializedField child = children.get(i); -// final MaterializedField fieldDef = SerializedFieldHelper.create(child); -// ValueVector vector = getChild(fieldDef.getLastName()); -// if (vector == null) { - // if we arrive here, we didn't have a matching vector. -// vector = BasicTypeHelper.getNewVector(fieldDef, allocator); -// putChild(fieldDef.getLastName(), vector); -// } -// final int vectorLength = child.getBufferLength(); -// vector.load(child, buffer.slice(bufOffset, vectorLength)); -// bufOffset += vectorLength; -// } -// -// assert bufOffset == buffer.capacity(); -// } -// -// -// @Override -// public SerializedField getMetadata() { -// SerializedField.Builder builder = getField() // -// .getAsBuilder() // -// .setBufferLength(getBufferSize()) // - // while we don't need to actually read this on load, we need it to make sure we don't skip deserialization of this vector -// .setValueCount(accessor.getValueCount()); -// builder.addChild(offsets.getMetadata()); -// for (final ValueVector child : getChildren()) { -// builder.addChild(child.getMetadata()); -// } -// return builder.build(); -// } - - @Override - public Mutator getMutator() { - return mutator; - } - - public class RepeatedMapAccessor implements RepeatedAccessor { - @Override - public Object getObject(int index) { - final List list = new JsonStringArrayList<>(); - final int end = offsets.getAccessor().get(index+1); - String fieldName; - for (int i = offsets.getAccessor().get(index); i < end; i++) { - final Map vv = Maps.newLinkedHashMap(); - for (final MaterializedField field : getField().getChildren()) { - if (!field.equals(BaseRepeatedValueVector.OFFSETS_FIELD)) { - fieldName = field.getLastName(); - final Object value = getChild(fieldName).getAccessor().getObject(i); - if (value != null) { - vv.put(fieldName, value); - } - } - } - list.add(vv); - } - return list; - } - - @Override - public int getValueCount() { - return Math.max(offsets.getAccessor().getValueCount() - 1, 0); - } - - @Override - public int getInnerValueCount() { - final int valueCount = getValueCount(); - if (valueCount == 0) { - return 0; - } - return offsets.getAccessor().get(valueCount); - } - - @Override - public int getInnerValueCountAt(int index) { - return offsets.getAccessor().get(index+1) - offsets.getAccessor().get(index); - } - - @Override - public boolean isEmpty(int index) { - return false; - } - - @Override - public boolean isNull(int index) { - return false; - } - - public void get(int index, RepeatedMapHolder holder) { - assert index < getValueCapacity() : - String.format("Attempted to access index %d when value capacity is %d", - index, getValueCapacity()); - final UInt4Vector.Accessor offsetsAccessor = offsets.getAccessor(); - holder.start = offsetsAccessor.get(index); - holder.end = offsetsAccessor.get(index + 1); - } - - public void get(int index, ComplexHolder holder) { - final FieldReader reader = getReader(); - reader.setPosition(index); - holder.reader = reader; - } - - public void get(int index, int arrayIndex, ComplexHolder holder) { - final RepeatedMapHolder h = new RepeatedMapHolder(); - get(index, h); - final int offset = h.start + arrayIndex; - - if (offset >= h.end) { - holder.reader = NullReader.INSTANCE; - } else { - reader.setSinglePosition(index, arrayIndex); - holder.reader = reader; - } - } - } - - public class Mutator implements RepeatedMutator { - @Override - public void startNewValue(int index) { - emptyPopulator.populate(index + 1); - offsets.getMutator().setSafe(index + 1, offsets.getAccessor().get(index)); - } - - @Override - public void setValueCount(int topLevelValueCount) { - emptyPopulator.populate(topLevelValueCount); - offsets.getMutator().setValueCount(topLevelValueCount == 0 ? 0 : topLevelValueCount + 1); - int childValueCount = offsets.getAccessor().get(topLevelValueCount); - for (final ValueVector v : getChildren()) { - v.getMutator().setValueCount(childValueCount); - } - } - - @Override - public void reset() {} - - @Override - public void generateTestData(int values) {} - - public int add(int index) { - final int prevEnd = offsets.getAccessor().get(index + 1); - offsets.getMutator().setSafe(index + 1, prevEnd + 1); - return prevEnd; - } - } - - @Override - public void clear() { - getMutator().reset(); - - offsets.clear(); - for(final ValueVector vector : getChildren()) { - vector.clear(); - } - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java index 99c0a0aeb1e2c..54db393e8310d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java @@ -28,7 +28,7 @@ * uses the offset vector to determine the sequence of cells pertaining to an individual value. * */ -public interface RepeatedValueVector extends ValueVector, ContainerVectorLike { +public interface RepeatedValueVector extends ValueVector { final static int DEFAULT_REPEAT_PER_RECORD = 5; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java index 264e241e73935..259a954233c06 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java @@ -19,20 +19,20 @@ import java.util.Iterator; +import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.flatbuf.Type; +import org.apache.arrow.flatbuf.Union; +import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.holders.UnionHolder; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; abstract class AbstractBaseReader implements FieldReader{ static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractBaseReader.class); - private static final MajorType LATE_BIND_TYPE = new MajorType(MinorType.LATE, DataMode.OPTIONAL); private int index; @@ -58,15 +58,6 @@ public Iterator iterator() { throw new IllegalStateException("The current reader doesn't support reading as a map."); } - public MajorType getType(){ - throw new IllegalStateException("The current reader doesn't support getting type information."); - } - - @Override - public MaterializedField getField() { - return MaterializedField.create("unknown", LATE_BIND_TYPE); - } - @Override public boolean next() { throw new IllegalStateException("The current reader doesn't support getting next information."); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java index 4e1e103a12e7c..e6cf098f16f59 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java @@ -23,25 +23,11 @@ abstract class AbstractBaseWriter implements FieldWriter { //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractBaseWriter.class); - final FieldWriter parent; private int index; - public AbstractBaseWriter(FieldWriter parent) { - this.parent = parent; - } - @Override public String toString() { - return super.toString() + "[index = " + index + ", parent = " + parent + "]"; - } - - @Override - public FieldWriter getParent() { - return parent; - } - - public boolean isRoot() { - return parent == null; + return super.toString() + "[index = " + index + "]"; } int idx() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index 4e2051fd4efee..4d2adfb32561e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -17,20 +17,20 @@ */ package org.apache.arrow.vector.complex.impl; +import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.StateTool; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import com.google.common.base.Preconditions; +import org.apache.arrow.vector.types.pojo.Field; public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWriter { // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ComplexWriterImpl.class); private SingleMapWriter mapRoot; - private SingleListWriter listRoot; + private UnionListWriter listRoot; private final MapVector container; Mode mode = Mode.INIT; @@ -40,7 +40,6 @@ public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWri private enum Mode { INIT, MAP, LIST }; public ComplexWriterImpl(String name, MapVector container, boolean unionEnabled){ - super(null); this.name = name; this.container = container; this.unionEnabled = unionEnabled; @@ -51,7 +50,7 @@ public ComplexWriterImpl(String name, MapVector container){ } @Override - public MaterializedField getField() { + public Field getField() { return container.getField(); } @@ -123,7 +122,7 @@ public MapWriter directMap(){ case INIT: MapVector map = (MapVector) container; - mapRoot = new SingleMapWriter(map, this, unionEnabled); + mapRoot = new SingleMapWriter(map); mapRoot.setPosition(idx()); mode = Mode.MAP; break; @@ -143,8 +142,8 @@ public MapWriter rootAsMap() { switch(mode){ case INIT: - MapVector map = container.addOrGet(name, Types.required(MinorType.MAP), MapVector.class); - mapRoot = new SingleMapWriter(map, this, unionEnabled); + MapVector map = container.addOrGet(name, MinorType.MAP, MapVector.class); + mapRoot = new SingleMapWriter(map); mapRoot.setPosition(idx()); mode = Mode.MAP; break; @@ -174,7 +173,12 @@ public ListWriter rootAsList() { switch(mode){ case INIT: - listRoot = new SingleListWriter(name, container, this); + int vectorCount = container.size(); + ListVector listVector = container.addOrGet(name, MinorType.LIST, ListVector.class); + if (container.size() > vectorCount) { + listVector.allocateNew(); + } + listRoot = new UnionListWriter(listVector); listRoot.setPosition(idx()); mode = Mode.LIST; break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 462ec9dd86a9b..586b1283fe879 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -17,20 +17,14 @@ */ package org.apache.arrow.vector.complex.impl; -import java.lang.reflect.Constructor; - import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.VectorDescriptor; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.AbstractMapVector; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.writer.FieldWriter; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.BasicTypeHelper; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.TransferPair; /** @@ -56,14 +50,12 @@ private enum State { private FieldWriter writer; public PromotableWriter(ValueVector v, AbstractMapVector parentContainer) { - super(null); this.parentContainer = parentContainer; this.listVector = null; init(v); } public PromotableWriter(ValueVector v, ListVector listVector) { - super(null); this.listVector = listVector; this.parentContainer = null; init(v); @@ -84,30 +76,8 @@ private void init(ValueVector v) { private void setWriter(ValueVector v) { state = State.SINGLE; vector = v; - type = v.getField().getType().getMinorType(); - Class writerClass = BasicTypeHelper - .getWriterImpl(v.getField().getType().getMinorType(), v.getField().getDataMode()); - if (writerClass.equals(SingleListWriter.class)) { - writerClass = UnionListWriter.class; - } - Class vectorClass = BasicTypeHelper.getValueVectorClass(v.getField().getType().getMinorType(), v.getField() - .getDataMode()); - try { - Constructor constructor = null; - for (Constructor c : writerClass.getConstructors()) { - if (c.getParameterTypes().length == 3) { - constructor = c; - } - } - if (constructor == null) { - constructor = writerClass.getConstructor(vectorClass, AbstractFieldWriter.class); - writer = (FieldWriter) constructor.newInstance(vector, null); - } else { - writer = (FieldWriter) constructor.newInstance(vector, null, true); - } - } catch (ReflectiveOperationException e) { - throw new RuntimeException(e); - } + type = v.getMinorType(); + writer = type.getNewFieldWriter(vector); } @Override @@ -129,7 +99,7 @@ protected FieldWriter getWriter(MinorType type) { if (type == null) { return null; } - ValueVector v = listVector.addOrGetVector(new VectorDescriptor(new MajorType(type, DataMode.OPTIONAL))).getVector(); + ValueVector v = listVector.addOrGetVector(type).getVector(); v.allocateNew(); setWriter(v); writer.setPosition(position); @@ -150,11 +120,11 @@ protected FieldWriter getWriter() { } private FieldWriter promoteToUnion() { - String name = vector.getField().getLastName(); - TransferPair tp = vector.getTransferPair(vector.getField().getType().getMinorType().name().toLowerCase(), vector.getAllocator()); + String name = vector.getField().getName(); + TransferPair tp = vector.getTransferPair(vector.getMinorType().name().toLowerCase(), vector.getAllocator()); tp.transfer(); if (parentContainer != null) { - unionVector = parentContainer.addOrGet(name, new MajorType(MinorType.UNION, DataMode.OPTIONAL), UnionVector.class); + unionVector = parentContainer.addOrGet(name, MinorType.UNION, UnionVector.class); unionVector.allocateNew(); } else if (listVector != null) { unionVector = listVector.promoteToUnion(); @@ -163,7 +133,7 @@ private FieldWriter promoteToUnion() { writer = new UnionWriter(unionVector); writer.setPosition(idx()); for (int i = 0; i < idx(); i++) { - unionVector.getMutator().setType(i, vector.getField().getType().getMinorType()); + unionVector.getMutator().setType(i, vector.getMinorType()); } vector = null; state = State.UNION; @@ -181,7 +151,7 @@ public void clear() { } @Override - public MaterializedField getField() { + public Field getField() { return getWriter().getField(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java deleted file mode 100644 index dd1a152e2f603..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedListReaderImpl.java +++ /dev/null @@ -1,145 +0,0 @@ -/******************************************************************************* - - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - ******************************************************************************/ -package org.apache.arrow.vector.complex.impl; - - -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.complex.RepeatedListVector; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.holders.RepeatedListHolder; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; - -public class RepeatedListReaderImpl extends AbstractFieldReader{ - private static final int NO_VALUES = Integer.MAX_VALUE - 1; - private static final MajorType TYPE = new MajorType(MinorType.LIST, DataMode.REPEATED); - private final String name; - private final RepeatedListVector container; - private FieldReader reader; - - public RepeatedListReaderImpl(String name, RepeatedListVector container) { - super(); - this.name = name; - this.container = container; - } - - @Override - public MajorType getType() { - return TYPE; - } - - @Override - public void copyAsValue(ListWriter writer) { - if (currentOffset == NO_VALUES) { - return; - } - RepeatedListWriter impl = (RepeatedListWriter) writer; - impl.container.copyFromSafe(idx(), impl.idx(), container); - } - - @Override - public void copyAsField(String name, MapWriter writer) { - if (currentOffset == NO_VALUES) { - return; - } - RepeatedListWriter impl = (RepeatedListWriter) writer.list(name); - impl.container.copyFromSafe(idx(), impl.idx(), container); - } - - private int currentOffset; - private int maxOffset; - - @Override - public void reset() { - super.reset(); - currentOffset = 0; - maxOffset = 0; - if (reader != null) { - reader.reset(); - } - reader = null; - } - - @Override - public int size() { - return maxOffset - currentOffset; - } - - @Override - public void setPosition(int index) { - if (index < 0 || index == NO_VALUES) { - currentOffset = NO_VALUES; - return; - } - - super.setPosition(index); - RepeatedListHolder h = new RepeatedListHolder(); - container.getAccessor().get(index, h); - if (h.start == h.end) { - currentOffset = NO_VALUES; - } else { - currentOffset = h.start-1; - maxOffset = h.end; - if(reader != null) { - reader.setPosition(currentOffset); - } - } - } - - @Override - public boolean next() { - if (currentOffset +1 < maxOffset) { - currentOffset++; - if (reader != null) { - reader.setPosition(currentOffset); - } - return true; - } else { - currentOffset = NO_VALUES; - return false; - } - } - - @Override - public Object readObject() { - return container.getAccessor().getObject(idx()); - } - - @Override - public FieldReader reader() { - if (reader == null) { - ValueVector child = container.getChild(name); - if (child == null) { - reader = NullReader.INSTANCE; - } else { - reader = child.getReader(); - } - reader.setPosition(currentOffset); - } - return reader; - } - - public boolean isSet() { - return true; - } - -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java deleted file mode 100644 index 09a831d8329fc..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/RepeatedMapReaderImpl.java +++ /dev/null @@ -1,192 +0,0 @@ -/******************************************************************************* - - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - ******************************************************************************/ -package org.apache.arrow.vector.complex.impl; - -import java.util.Map; - -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.holders.RepeatedMapHolder; -import org.apache.arrow.vector.types.Types.MajorType; - -import com.google.common.collect.Maps; - -@SuppressWarnings("unused") -public class RepeatedMapReaderImpl extends AbstractFieldReader{ - private static final int NO_VALUES = Integer.MAX_VALUE - 1; - - private final RepeatedMapVector vector; - private final Map fields = Maps.newHashMap(); - - public RepeatedMapReaderImpl(RepeatedMapVector vector) { - this.vector = vector; - } - - private void setChildrenPosition(int index) { - for (FieldReader r : fields.values()) { - r.setPosition(index); - } - } - - @Override - public FieldReader reader(String name) { - FieldReader reader = fields.get(name); - if (reader == null) { - ValueVector child = vector.getChild(name); - if (child == null) { - reader = NullReader.INSTANCE; - } else { - reader = child.getReader(); - } - fields.put(name, reader); - reader.setPosition(currentOffset); - } - return reader; - } - - @Override - public FieldReader reader() { - if (currentOffset == NO_VALUES) { - return NullReader.INSTANCE; - } - - setChildrenPosition(currentOffset); - return new SingleLikeRepeatedMapReaderImpl(vector, this); - } - - private int currentOffset; - private int maxOffset; - - @Override - public void reset() { - super.reset(); - currentOffset = 0; - maxOffset = 0; - for (FieldReader reader:fields.values()) { - reader.reset(); - } - fields.clear(); - } - - @Override - public int size() { - if (isNull()) { - return 0; - } - return maxOffset - (currentOffset < 0 ? 0 : currentOffset); - } - - @Override - public void setPosition(int index) { - if (index < 0 || index == NO_VALUES) { - currentOffset = NO_VALUES; - return; - } - - super.setPosition(index); - RepeatedMapHolder h = new RepeatedMapHolder(); - vector.getAccessor().get(index, h); - if (h.start == h.end) { - currentOffset = NO_VALUES; - } else { - currentOffset = h.start-1; - maxOffset = h.end; - setChildrenPosition(currentOffset); - } - } - - public void setSinglePosition(int index, int childIndex) { - super.setPosition(index); - RepeatedMapHolder h = new RepeatedMapHolder(); - vector.getAccessor().get(index, h); - if (h.start == h.end) { - currentOffset = NO_VALUES; - } else { - int singleOffset = h.start + childIndex; - assert singleOffset < h.end; - currentOffset = singleOffset; - maxOffset = singleOffset + 1; - setChildrenPosition(singleOffset); - } - } - - @Override - public boolean next() { - if (currentOffset +1 < maxOffset) { - setChildrenPosition(++currentOffset); - return true; - } else { - currentOffset = NO_VALUES; - return false; - } - } - - public boolean isNull() { - return currentOffset == NO_VALUES; - } - - @Override - public Object readObject() { - return vector.getAccessor().getObject(idx()); - } - - @Override - public MajorType getType() { - return vector.getField().getType(); - } - - @Override - public java.util.Iterator iterator() { - return vector.fieldNameIterator(); - } - - @Override - public boolean isSet() { - return true; - } - - @Override - public void copyAsValue(MapWriter writer) { - if (currentOffset == NO_VALUES) { - return; - } - RepeatedMapWriter impl = (RepeatedMapWriter) writer; - impl.container.copyFromSafe(idx(), impl.idx(), vector); - } - - public void copyAsValueSingle(MapWriter writer) { - if (currentOffset == NO_VALUES) { - return; - } - SingleMapWriter impl = (SingleMapWriter) writer; - impl.container.copyFromSafe(currentOffset, impl.idx(), vector); - } - - @Override - public void copyAsField(String name, MapWriter writer) { - if (currentOffset == NO_VALUES) { - return; - } - RepeatedMapWriter impl = (RepeatedMapWriter) writer.map(name); - impl.container.copyFromSafe(idx(), impl.idx(), vector); - } - -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java deleted file mode 100644 index 086d26e119440..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleLikeRepeatedMapReaderImpl.java +++ /dev/null @@ -1,89 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.arrow.vector.complex.impl; - -import java.util.Iterator; - -import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.types.Types.MinorType; - -public class SingleLikeRepeatedMapReaderImpl extends AbstractFieldReader{ - - private RepeatedMapReaderImpl delegate; - - public SingleLikeRepeatedMapReaderImpl(RepeatedMapVector vector, FieldReader delegate) { - this.delegate = (RepeatedMapReaderImpl) delegate; - } - - @Override - public int size() { - throw new UnsupportedOperationException("You can't call size on a single map reader."); - } - - @Override - public boolean next() { - throw new UnsupportedOperationException("You can't call next on a single map reader."); - } - - @Override - public MajorType getType() { - return Types.required(MinorType.MAP); - } - - - @Override - public void copyAsValue(MapWriter writer) { - delegate.copyAsValueSingle(writer); - } - - public void copyAsValueSingle(MapWriter writer){ - delegate.copyAsValueSingle(writer); - } - - @Override - public FieldReader reader(String name) { - return delegate.reader(name); - } - - @Override - public void setPosition(int index) { - delegate.setPosition(index); - } - - @Override - public Object readObject() { - return delegate.readObject(); - } - - @Override - public Iterator iterator() { - return delegate.iterator(); - } - - @Override - public boolean isSet() { - return ! delegate.isNull(); - } - - -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java index f16f628603d69..b8f58658eae15 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleListReaderImpl.java @@ -24,14 +24,11 @@ import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; @SuppressWarnings("unused") public class SingleListReaderImpl extends AbstractFieldReader{ - private static final MajorType TYPE = Types.optional(MinorType.LIST); private final String name; private final AbstractContainerVector container; private FieldReader reader; @@ -42,12 +39,6 @@ public SingleListReaderImpl(String name, AbstractContainerVector container) { this.container = container; } - @Override - public MajorType getType() { - return TYPE; - } - - @Override public void setPosition(int index) { super.setPosition(index); @@ -70,6 +61,11 @@ public FieldReader reader() { return reader; } + @Override + public MinorType getMinorType() { + return MinorType.LIST; + } + @Override public boolean isSet() { return false; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java index 84b99801419c4..1c43240901c4f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java @@ -27,9 +27,9 @@ import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.types.Types.MajorType; import com.google.common.collect.Maps; +import org.apache.arrow.vector.types.Types.MinorType; @SuppressWarnings("unused") public class SingleMapReaderImpl extends AbstractFieldReader{ @@ -77,13 +77,13 @@ public Object readObject() { } @Override - public boolean isSet() { - return true; + public MinorType getMinorType() { + return MinorType.MAP; } @Override - public MajorType getType(){ - return vector.getField().getType(); + public boolean isSet() { + return true; } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java index 9b54d02e571de..39cf00421154b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java @@ -25,8 +25,6 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.holders.UnionHolder; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; public class UnionListReader extends AbstractFieldReader { @@ -46,12 +44,6 @@ public boolean isSet() { return true; } - MajorType type = new MajorType(MinorType.LIST, DataMode.OPTIONAL); - - public MajorType getType() { - return type; - } - private int currentOffset; private int maxOffset; @@ -72,6 +64,11 @@ public Object readObject() { return vector.getAccessor().getObject(idx()); } + @Override + public MinorType getMinorType() { + return MinorType.LIST; + } + @Override public void read(int index, UnionHolder holder) { setPosition(idx()); @@ -82,6 +79,12 @@ public void read(int index, UnionHolder holder) { holder.isSet = data.getReader().isSet() ? 1 : 0; } + @Override + public int size() { + int size = maxOffset - currentOffset - 1; + return size < 0 ? 0 : size; + } + @Override public boolean next() { if (currentOffset + 1 < maxOffset) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java deleted file mode 100644 index 5a5fe0305d83a..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/holders/ObjectHolder.java +++ /dev/null @@ -1,38 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ - -package org.apache.arrow.vector.holders; - -import org.apache.arrow.vector.types.Types; - -/* - * Holder class for the vector ObjectVector. This holder internally stores a - * reference to an object. The ObjectVector maintains an array of these objects. - * This holder can be used only as workspace variables in aggregate functions. - * Using this holder should be avoided and we should stick to native holder types. - */ -@Deprecated -public class ObjectHolder implements ValueHolder { - public static final Types.MajorType TYPE = Types.required(Types.MinorType.GENERIC_OBJECT); - - public Types.MajorType getType() { - return TYPE; - } - - public Object obj; -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java b/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java index b868a620f985b..b1b695e58a954 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/holders/UnionHolder.java @@ -18,17 +18,14 @@ package org.apache.arrow.vector.holders; import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; import org.apache.arrow.vector.types.Types.MinorType; public class UnionHolder implements ValueHolder { - public static final MajorType TYPE = new MajorType(MinorType.UNION, DataMode.OPTIONAL); public FieldReader reader; public int isSet; - public MajorType getType() { - return reader.getType(); + public MinorType getMinorType() { + return reader.getMinorType(); } public boolean isSet() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java b/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java deleted file mode 100644 index c73098b2a85d7..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/MaterializedField.java +++ /dev/null @@ -1,217 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.types; - -import java.util.ArrayList; -import java.util.Collection; -import java.util.Iterator; -import java.util.LinkedHashSet; -import java.util.Objects; - -import org.apache.arrow.vector.types.Types.DataMode; -import org.apache.arrow.vector.types.Types.MajorType; -import org.apache.arrow.vector.util.BasicTypeHelper; - - -public class MaterializedField { - private final String name; - private final MajorType type; - // use an ordered set as existing code relies on order (e,g. parquet writer) - private final LinkedHashSet children; - - MaterializedField(String name, MajorType type, LinkedHashSet children) { - this.name = name; - this.type = type; - this.children = children; - } - - public Collection getChildren() { - return new ArrayList<>(children); - } - - public MaterializedField newWithChild(MaterializedField child) { - MaterializedField newField = clone(); - newField.addChild(child); - return newField; - } - - public void addChild(MaterializedField field){ - children.add(field); - } - - public MaterializedField clone() { - return withPathAndType(name, getType()); - } - - public MaterializedField withType(MajorType type) { - return withPathAndType(name, type); - } - - public MaterializedField withPath(String name) { - return withPathAndType(name, getType()); - } - - public MaterializedField withPathAndType(String name, final MajorType type) { - final LinkedHashSet newChildren = new LinkedHashSet<>(children.size()); - for (final MaterializedField child:children) { - newChildren.add(child.clone()); - } - return new MaterializedField(name, type, newChildren); - } - -// public String getLastName(){ -// PathSegment seg = key.path.getRootSegment(); -// while (seg.getChild() != null) { -// seg = seg.getChild(); -// } -// return seg.getNameSegment().getPath(); -// } - - - // TODO: rewrite without as direct match rather than conversion then match. -// public boolean matches(SerializedField booleanfield){ -// MaterializedField f = create(field); -// return f.equals(this); -// } - - public static MaterializedField create(String name, MajorType type){ - return new MaterializedField(name, type, new LinkedHashSet()); - } - -// public String getName(){ -// StringBuilder sb = new StringBuilder(); -// boolean first = true; -// for(NamePart np : def.getNameList()){ -// if(np.getType() == Type.ARRAY){ -// sb.append("[]"); -// }else{ -// if(first){ -// first = false; -// }else{ -// sb.append("."); -// } -// sb.append('`'); -// sb.append(np.getName()); -// sb.append('`'); -// -// } -// } -// return sb.toString(); -// } - - public String getPath() { - return getName(); - } - - public String getLastName() { - return getName(); - } - - public String getName() { - return name; - } - -// public int getWidth() { -// return type.getWidth(); -// } - - public MajorType getType() { - return type; - } - - public int getScale() { - return type.getScale(); - } - public int getPrecision() { - return type.getPrecision(); - } - public boolean isNullable() { - return type.getMode() == DataMode.OPTIONAL; - } - - public DataMode getDataMode() { - return type.getMode(); - } - - public MaterializedField getOtherNullableVersion(){ - MajorType mt = type; - DataMode newDataMode = null; - switch (mt.getMode()){ - case OPTIONAL: - newDataMode = DataMode.REQUIRED; - break; - case REQUIRED: - newDataMode = DataMode.OPTIONAL; - break; - default: - throw new UnsupportedOperationException(); - } - return new MaterializedField(name, new MajorType(mt.getMinorType(), newDataMode, mt.getPrecision(), mt.getScale(), mt.getTimezone(), mt.getSubTypes()), children); - } - - public Class getValueClass() { - return BasicTypeHelper.getValueVectorClass(getType().getMinorType(), getDataMode()); - } - - @Override - public int hashCode() { - return Objects.hash(this.name, this.type, this.children); - } - - @Override - public boolean equals(Object obj) { - if (this == obj) { - return true; - } - if (obj == null) { - return false; - } - if (getClass() != obj.getClass()) { - return false; - } - MaterializedField other = (MaterializedField) obj; - // DRILL-1872: Compute equals only on key. See also the comment - // in MapVector$MapTransferPair - - return this.name.equalsIgnoreCase(other.name) && - Objects.equals(this.type, other.type); - } - - - @Override - public String toString() { - final int maxLen = 10; - String childStr = children != null && !children.isEmpty() ? toString(children, maxLen) : ""; - return name + "(" + type.getMinorType().name() + ":" + type.getMode().name() + ")" + childStr; - } - - - private String toString(Collection collection, int maxLen) { - StringBuilder builder = new StringBuilder(); - builder.append("["); - int i = 0; - for (Iterator iterator = collection.iterator(); iterator.hasNext() && i < maxLen; i++) { - if (i > 0){ - builder.append(", "); - } - builder.append(iterator.next()); - } - builder.append("]"); - return builder.toString(); - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 88999cb8f5ab8..5ea1456a051f7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -17,150 +17,508 @@ */ package org.apache.arrow.vector.types; -import java.util.ArrayList; -import java.util.List; +import org.apache.arrow.flatbuf.Type; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.NullableBigIntVector; +import org.apache.arrow.vector.NullableBitVector; +import org.apache.arrow.vector.NullableDateVector; +import org.apache.arrow.vector.NullableDecimalVector; +import org.apache.arrow.vector.NullableFloat4Vector; +import org.apache.arrow.vector.NullableFloat8Vector; +import org.apache.arrow.vector.NullableIntVector; +import org.apache.arrow.vector.NullableIntervalDayVector; +import org.apache.arrow.vector.NullableIntervalYearVector; +import org.apache.arrow.vector.NullableSmallIntVector; +import org.apache.arrow.vector.NullableTimeStampVector; +import org.apache.arrow.vector.NullableTimeVector; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.NullableUInt1Vector; +import org.apache.arrow.vector.NullableUInt2Vector; +import org.apache.arrow.vector.NullableUInt4Vector; +import org.apache.arrow.vector.NullableUInt8Vector; +import org.apache.arrow.vector.NullableVarBinaryVector; +import org.apache.arrow.vector.NullableVarCharVector; +import org.apache.arrow.vector.SmallIntVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.complex.impl.BigIntWriterImpl; +import org.apache.arrow.vector.complex.impl.BitWriterImpl; +import org.apache.arrow.vector.complex.impl.DateWriterImpl; +import org.apache.arrow.vector.complex.impl.Float4WriterImpl; +import org.apache.arrow.vector.complex.impl.Float8WriterImpl; +import org.apache.arrow.vector.complex.impl.IntWriterImpl; +import org.apache.arrow.vector.complex.impl.IntervalDayWriterImpl; +import org.apache.arrow.vector.complex.impl.IntervalYearWriterImpl; +import org.apache.arrow.vector.complex.impl.SingleMapWriter; +import org.apache.arrow.vector.complex.impl.SmallIntWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeWriterImpl; +import org.apache.arrow.vector.complex.impl.TinyIntWriterImpl; +import org.apache.arrow.vector.complex.impl.UInt1WriterImpl; +import org.apache.arrow.vector.complex.impl.UInt2WriterImpl; +import org.apache.arrow.vector.complex.impl.UInt4WriterImpl; +import org.apache.arrow.vector.complex.impl.UInt8WriterImpl; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.complex.impl.UnionWriter; +import org.apache.arrow.vector.complex.impl.VarBinaryWriterImpl; +import org.apache.arrow.vector.complex.impl.VarCharWriterImpl; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.Binary; +import org.apache.arrow.vector.types.pojo.ArrowType.Bool; +import org.apache.arrow.vector.types.pojo.ArrowType.Date; +import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.IntervalDay; +import org.apache.arrow.vector.types.pojo.ArrowType.IntervalYear; +import org.apache.arrow.vector.types.pojo.ArrowType.List; +import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.ArrowType.Time; +import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; +import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Union; +import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.CallBack; + +import java.util.HashMap; import java.util.Map; -import java.util.Objects; public class Types { + + public static final Field NULL_FIELD = new Field("", true, Null.INSTANCE, null); + public static final Field TINYINT_FIELD = new Field("", true, new Int(8, true), null); + public static final Field SMALLINT_FIELD = new Field("", true, new Int(16, true), null); + public static final Field INT_FIELD = new Field("", true, new Int(32, true), null); + public static final Field BIGINT_FIELD = new Field("", true, new Int(64, true), null); + public static final Field UINT1_FIELD = new Field("", true, new Int(8, false), null); + public static final Field UINT2_FIELD = new Field("", true, new Int(16, false), null); + public static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); + public static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); + public static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); + public static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); + public static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); + public static final Field INTERVALDAY_FIELD = new Field("", true, IntervalDay.INSTANCE, null); + public static final Field INTERVALYEAR_FIELD = new Field("", true, IntervalYear.INSTANCE, null); + public static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(0), null); + public static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(1), null); + public static final Field LIST_FIELD = new Field("", true, List.INSTANCE, null); + public static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); + public static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); + public static final Field BIT_FIELD = new Field("", true, Bool.INSTANCE, null); + + public enum MinorType { - LATE, // late binding type - MAP, // an empty map column. Useful for conceptual setup. Children listed within here - - TINYINT, // single byte signed integer - SMALLINT, // two byte signed integer - INT, // four byte signed integer - BIGINT, // eight byte signed integer - DECIMAL9, // a decimal supporting precision between 1 and 9 - DECIMAL18, // a decimal supporting precision between 10 and 18 - DECIMAL28SPARSE, // a decimal supporting precision between 19 and 28 - DECIMAL38SPARSE, // a decimal supporting precision between 29 and 38 - MONEY, // signed decimal with two digit precision - DATE, // days since 4713bc - TIME, // time in micros before or after 2000/1/1 - TIMETZ, // time in micros before or after 2000/1/1 with timezone - TIMESTAMPTZ, // unix epoch time in millis - TIMESTAMP, // TBD - INTERVAL, // TBD - FLOAT4, // 4 byte ieee 754 - FLOAT8, // 8 byte ieee 754 - BIT, // single bit value (boolean) - FIXEDCHAR, // utf8 fixed length string, padded with spaces - FIXED16CHAR, - FIXEDBINARY, // fixed length binary, padded with 0 bytes - VARCHAR, // utf8 variable length string - VAR16CHAR, // utf16 variable length string - VARBINARY, // variable length binary - UINT1, // unsigned 1 byte integer - UINT2, // unsigned 2 byte integer - UINT4, // unsigned 4 byte integer - UINT8, // unsigned 8 byte integer - DECIMAL28DENSE, // dense decimal representation, supporting precision between 19 and 28 - DECIMAL38DENSE, // dense decimal representation, supporting precision between 28 and 38 - NULL, // a value of unknown type (e.g. a missing reference). - INTERVALYEAR, // Interval type specifying YEAR to MONTH - INTERVALDAY, // Interval type specifying DAY to SECONDS - LIST, - GENERIC_OBJECT, - UNION - } + NULL(Null.INSTANCE) { + @Override + public Field getField() { + return NULL_FIELD; + } - public enum DataMode { - REQUIRED, - OPTIONAL, - REPEATED - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return ZeroVector.INSTANCE; + } - public static class MajorType { - private MinorType minorType; - private DataMode mode; - private int precision; - private int scale; - private int timezone; - private int width; - private List subTypes; - - public MajorType(MinorType minorType, DataMode mode) { - this(minorType, mode, 0, 0, 0, 0, null); - } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return null; + } + }, + MAP(Tuple.INSTANCE) { + @Override + public Field getField() { + throw new UnsupportedOperationException("Cannot get simple field for Map type"); + } - public MajorType(MinorType minorType, DataMode mode, int precision, int scale) { - this(minorType, mode, precision, scale, 0, 0, null); - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new MapVector(name, allocator, callBack); + } - public MajorType(MinorType minorType, DataMode mode, int precision, int scale, int timezone, List subTypes) { - this(minorType, mode, precision, scale, timezone, 0, subTypes); - } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new SingleMapWriter((MapVector) vector); + } + }, // an empty map column. Useful for conceptual setup. Children listed within here - public MajorType(MinorType minorType, DataMode mode, int precision, int scale, int timezone, int width, List subTypes) { - this.minorType = minorType; - this.mode = mode; - this.precision = precision; - this.scale = scale; - this.timezone = timezone; - this.width = width; - this.subTypes = subTypes; - if (subTypes == null) { - this.subTypes = new ArrayList<>(); + TINYINT(new Int(8, true)) { + @Override + public Field getField() { + return TINYINT_FIELD; } - } - public MinorType getMinorType() { - return minorType; - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTinyIntVector(name, allocator); + } - public DataMode getMode() { - return mode; - } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TinyIntWriterImpl((NullableTinyIntVector) vector); + } + }, // single byte signed integer + SMALLINT(new Int(16, true)) { + @Override + public Field getField() { + return SMALLINT_FIELD; + } - public int getPrecision() { - return precision; - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new SmallIntVector(name, allocator); + } - public int getScale() { - return scale; - } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new SmallIntWriterImpl((NullableSmallIntVector) vector); + } + }, // two byte signed integer + INT(new Int(32, true)) { + @Override + public Field getField() { + return INT_FIELD; + } - public int getTimezone() { - return timezone; - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableIntVector(name, allocator); + } - public List getSubTypes() { - return subTypes; - } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new IntWriterImpl((NullableIntVector) vector); + } + }, // four byte signed integer + BIGINT(new Int(64, true)) { + @Override + public Field getField() { + return BIGINT_FIELD; + } - public int getWidth() { - return width; - } + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableBigIntVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new BigIntWriterImpl((NullableBigIntVector) vector); + } + }, // eight byte signed integer + DATE(Date.INSTANCE) { + @Override + public Field getField() { + return DATE_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableDateVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new DateWriterImpl((NullableDateVector) vector); + } + }, // days since 4713bc + TIME(Time.INSTANCE) { + @Override + public Field getField() { + return TIME_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTimeVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeWriterImpl((NullableTimeVector) vector); + } + }, // time in micros before or after 2000/1/1 + TIMESTAMP(new Timestamp("")) { + @Override + public Field getField() { + return TIMESTAMP_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTimeStampVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeStampWriterImpl((NullableTimeStampVector) vector); + } + }, + INTERVALDAY(IntervalDay.INSTANCE) { + @Override + public Field getField() { + return INTERVALDAY_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableIntervalDayVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new IntervalDayWriterImpl((NullableIntervalDayVector) vector); + } + }, + INTERVALYEAR(IntervalYear.INSTANCE) { + @Override + public Field getField() { + return INTERVALYEAR_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableIntervalDayVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new IntervalYearWriterImpl((NullableIntervalYearVector) vector); + } + }, + FLOAT4(new FloatingPoint(0)) { + @Override + public Field getField() { + return FLOAT4_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableFloat4Vector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new Float4WriterImpl((NullableFloat4Vector) vector); + } + }, // 4 byte ieee 754 + FLOAT8(new FloatingPoint(1)) { + @Override + public Field getField() { + return FLOAT8_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableFloat8Vector(name, allocator); + } + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new Float8WriterImpl((NullableFloat8Vector) vector); + } + }, // 8 byte ieee 754 + BIT(Bool.INSTANCE) { + @Override + public Field getField() { + return BIT_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableBitVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new BitWriterImpl((NullableBitVector) vector); + } + }, // single bit value (boolean) + VARCHAR(Utf8.INSTANCE) { + @Override + public Field getField() { + return VARCHAR_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableVarCharVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new VarCharWriterImpl((NullableVarCharVector) vector); + } + }, // utf8 variable length string + VARBINARY(Binary.INSTANCE) { + @Override + public Field getField() { + return VARBINARY_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableVarBinaryVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new VarBinaryWriterImpl((NullableVarBinaryVector) vector); + } + }, // variable length binary + DECIMAL(null) { + @Override + public ArrowType getType() { + throw new UnsupportedOperationException("Cannot get simple type for Decimal type"); + } + @Override + public Field getField() { + throw new UnsupportedOperationException("Cannot get simple field for Decimal type"); + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableDecimalVector(name, allocator, precisionScale[0], precisionScale[1]); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new VarBinaryWriterImpl((NullableVarBinaryVector) vector); + } + }, // variable length binary + UINT1(new Int(8, false)) { + @Override + public Field getField() { + return UINT1_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableUInt1Vector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UInt1WriterImpl((NullableUInt1Vector) vector); + } + }, // unsigned 1 byte integer + UINT2(new Int(16, false)) { + @Override + public Field getField() { + return UINT2_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableUInt2Vector(name, allocator); + } - @Override - public boolean equals(Object other) { - if (other == null) { - return false; + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UInt2WriterImpl((NullableUInt2Vector) vector); } - if (!(other instanceof MajorType)) { - return false; + }, // unsigned 2 byte integer + UINT4(new Int(32, false)) { + @Override + public Field getField() { + return UINT8_FIELD; } - MajorType that = (MajorType) other; - return this.minorType == that.minorType && - this.mode == that.mode && - this.precision == that.precision && - this.scale == that.scale && - this.timezone == that.timezone && - this.width == that.width && - Objects.equals(this.subTypes, that.subTypes); + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableUInt4Vector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UInt4WriterImpl((NullableUInt4Vector) vector); + } + }, // unsigned 4 byte integer + UINT8(new Int(64, false)) { + @Override + public Field getField() { + return UINT8_FIELD; + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableUInt8Vector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UInt8WriterImpl((NullableUInt8Vector) vector); + } + }, // unsigned 8 byte integer + LIST(List.INSTANCE) { + @Override + public Field getField() { + throw new UnsupportedOperationException("Cannot get simple field for List type"); + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new ListVector(name, allocator, callBack); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UnionListWriter((ListVector) vector); + } + }, + UNION(Union.INSTANCE) { + @Override + public Field getField() { + throw new UnsupportedOperationException("Cannot get simple field for Union type"); + } + + @Override + public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new UnionVector(name, allocator, callBack); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new UnionWriter((UnionVector) vector); + } + }; + + private final ArrowType type; + + MinorType(ArrowType type) { + this.type = type; } - } + public ArrowType getType() { + return type; + } + + public abstract Field getField(); - public static MajorType required(MinorType minorType) { - return new MajorType(minorType, DataMode.REQUIRED); + public abstract ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale); + + public abstract FieldWriter getNewFieldWriter(ValueVector vector); } - public static MajorType optional(MinorType minorType) { - return new MajorType(minorType, DataMode.OPTIONAL); + + private static final Map ARROW_TYPE_MINOR_TYPE_MAP; + + public static MinorType getMinorTypeForArrowType(ArrowType arrowType) { + if (arrowType.getTypeType() == Type.Decimal) { + return MinorType.DECIMAL; + } + return ARROW_TYPE_MINOR_TYPE_MAP.get(arrowType); } - public static MajorType repeated(MinorType minorType) { - return new MajorType(minorType, DataMode.REPEATED); + + static { + ARROW_TYPE_MINOR_TYPE_MAP = new HashMap<>(); + for (MinorType minorType : MinorType.values()) { + if (minorType != MinorType.DECIMAL) { + ARROW_TYPE_MINOR_TYPE_MAP.put(minorType.getType(), minorType); + } + } } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java new file mode 100644 index 0000000000000..49d0503e47036 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -0,0 +1,105 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types.pojo; + + +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; + +import java.util.List; +import java.util.Objects; + +import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; + +public class Field { + private final String name; + private final boolean nullable; + private final ArrowType type; + private final List children; + + public Field(String name, boolean nullable, ArrowType type, List children) { + this.name = name; + this.nullable = nullable; + this.type = type; + if (children == null) { + this.children = ImmutableList.of(); + } else { + this.children = children; + } + } + + public static Field convertField(org.apache.arrow.flatbuf.Field field) { + String name = field.name(); + boolean nullable = field.nullable(); + ArrowType type = getTypeForField(field); + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + for (int i = 0; i < field.childrenLength(); i++) { + childrenBuilder.add(convertField(field.children(i))); + } + List children = childrenBuilder.build(); + return new Field(name, nullable, type, children); + } + + public int getField(FlatBufferBuilder builder) { + int nameOffset = builder.createString(name); + int typeOffset = type.getType(builder); + int[] childrenData = new int[children.size()]; + for (int i = 0; i < children.size(); i++) { + childrenData[i] = children.get(i).getField(builder); + } + int childrenOffset = org.apache.arrow.flatbuf.Field.createChildrenVector(builder, childrenData); + org.apache.arrow.flatbuf.Field.startField(builder); + org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); + org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); + org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType()); + org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); + org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); + return org.apache.arrow.flatbuf.Field.endField(builder); + } + + public String getName() { + return name; + } + + public boolean isNullable() { + return nullable; + } + + public ArrowType getType() { + return type; + } + + public List getChildren() { + return children; + } + + @Override + public boolean equals(Object obj) { + if (!(obj instanceof Field)) { + return false; + } + Field that = (Field) obj; + return Objects.equals(this.name, that.name) && + Objects.equals(this.nullable, that.nullable) && + Objects.equals(this.type, that.type) && + (Objects.equals(this.children, that.children) || + (this.children == null && that.children.size() == 0) || + (this.children.size() == 0 && that.children == null)); + + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java new file mode 100644 index 0000000000000..9e2894170b24b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -0,0 +1,74 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types.pojo; + + +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; + +import java.nio.ByteBuffer; +import java.util.List; +import java.util.Objects; + +import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; +import static org.apache.arrow.vector.types.pojo.Field.convertField; + +public class Schema { + private List fields; + + public Schema(List fields) { + this.fields = ImmutableList.copyOf(fields); + } + + public int getSchema(FlatBufferBuilder builder) { + int[] fieldOffsets = new int[fields.size()]; + for (int i = 0; i < fields.size(); i++) { + fieldOffsets[i] = fields.get(i).getField(builder); + } + int fieldsOffset = org.apache.arrow.flatbuf.Schema.createFieldsVector(builder, fieldOffsets); + org.apache.arrow.flatbuf.Schema.startSchema(builder); + org.apache.arrow.flatbuf.Schema.addFields(builder, fieldsOffset); + return org.apache.arrow.flatbuf.Schema.endSchema(builder); + } + + public List getFields() { + return fields; + } + + @Override + public int hashCode() { + return Objects.hashCode(fields); + } + + @Override + public boolean equals(Object obj) { + if (!(obj instanceof Schema)) { + return false; + } + return Objects.equals(this.fields, ((Schema) obj).fields); + } + + public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + for (int i = 0; i < schema.fieldsLength(); i++) { + childrenBuilder.add(convertField(schema.fields(i))); + } + List fields = childrenBuilder.build(); + return new Schema(fields); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java index b6dd13a06a82d..68b9fb25f2112 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteFunctionHelpers.java @@ -180,54 +180,4 @@ private static final int memcmp(final long laddr, int lStart, int lEnd, final by return lLen > rLen ? 1 : -1; } - /* - * Following are helper functions to interact with sparse decimal represented in a byte array. - */ - - // Get the integer ignore the sign - public static int getInteger(byte[] b, int index) { - return getInteger(b, index, true); - } - // Get the integer, ignore the sign - public static int getInteger(byte[] b, int index, boolean ignoreSign) { - int startIndex = index * DecimalUtility.INTEGER_SIZE; - - if (index == 0 && ignoreSign == true) { - return (b[startIndex + 3] & 0xFF) | - (b[startIndex + 2] & 0xFF) << 8 | - (b[startIndex + 1] & 0xFF) << 16 | - (b[startIndex] & 0x7F) << 24; - } - - return ((b[startIndex + 3] & 0xFF) | - (b[startIndex + 2] & 0xFF) << 8 | - (b[startIndex + 1] & 0xFF) << 16 | - (b[startIndex] & 0xFF) << 24); - - } - - // Set integer in the byte array - public static void setInteger(byte[] b, int index, int value) { - int startIndex = index * DecimalUtility.INTEGER_SIZE; - b[startIndex] = (byte) ((value >> 24) & 0xFF); - b[startIndex + 1] = (byte) ((value >> 16) & 0xFF); - b[startIndex + 2] = (byte) ((value >> 8) & 0xFF); - b[startIndex + 3] = (byte) ((value) & 0xFF); - } - - // Set the sign in a sparse decimal representation - public static void setSign(byte[] b, boolean sign) { - int value = getInteger(b, 0); - if (sign == true) { - setInteger(b, 0, value | 0x80000000); - } else { - setInteger(b, 0, value & 0x7FFFFFFF); - } - } - - // Get the sign - public static boolean getSign(byte[] b) { - return ((getInteger(b, 0, false) & 0x80000000) != 0); - } - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java deleted file mode 100644 index 1eb2c13cd4c59..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/CoreDecimalUtility.java +++ /dev/null @@ -1,91 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.util; - -import java.math.BigDecimal; - -import org.apache.arrow.vector.types.Types; - -public class CoreDecimalUtility { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(CoreDecimalUtility.class); - - public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { - // Truncate or pad to set the input to the correct scale - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - - return (input.unscaledValue().longValue()); - } - - public static int getMaxPrecision(Types.MinorType decimalType) { - if (decimalType == Types.MinorType.DECIMAL9) { - return 9; - } else if (decimalType == Types.MinorType.DECIMAL18) { - return 18; - } else if (decimalType == Types.MinorType.DECIMAL28SPARSE) { - return 28; - } else if (decimalType == Types.MinorType.DECIMAL38SPARSE) { - return 38; - } - return 0; - } - - /* - * Function returns the Minor decimal type given the precision - */ - public static Types.MinorType getDecimalDataType(int precision) { - if (precision <= 9) { - return Types.MinorType.DECIMAL9; - } else if (precision <= 18) { - return Types.MinorType.DECIMAL18; - } else if (precision <= 28) { - return Types.MinorType.DECIMAL28SPARSE; - } else { - return Types.MinorType.DECIMAL38SPARSE; - } - } - - /* - * Given a precision it provides the max precision of that decimal data type; - * For eg: given the precision 12, we would use DECIMAL18 to store the data - * which has a max precision range of 18 digits - */ - public static int getPrecisionRange(int precision) { - return getMaxPrecision(getDecimalDataType(precision)); - } - public static int getDecimal9FromBigDecimal(BigDecimal input, int scale, int precision) { - // Truncate/ or pad to set the input to the correct scale - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - - return (input.unscaledValue().intValue()); - } - - /* - * Helper function to detect if the given data type is Decimal - */ - public static boolean isDecimalType(Types.MajorType type) { - return isDecimalType(type.getMinorType()); - } - - public static boolean isDecimalType(Types.MinorType minorType) { - if (minorType == Types.MinorType.DECIMAL9 || minorType == Types.MinorType.DECIMAL18 || - minorType == Types.MinorType.DECIMAL28SPARSE || minorType == Types.MinorType.DECIMAL38SPARSE) { - return true; - } - return false; - } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java index a3763cd34f1a1..4eb0d9f2216c1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -26,140 +26,139 @@ import java.nio.ByteBuffer; import java.util.Arrays; -import org.apache.arrow.vector.holders.Decimal38SparseHolder; - -public class DecimalUtility extends CoreDecimalUtility{ - - public final static int MAX_DIGITS = 9; - public final static int DIGITS_BASE = 1000000000; - public final static int DIGITS_MAX = 999999999; - public final static int INTEGER_SIZE = (Integer.SIZE/8); - - public final static String[] decimalToString = {"", - "0", - "00", - "000", - "0000", - "00000", - "000000", - "0000000", - "00000000", - "000000000"}; - - public final static long[] scale_long_constants = { - 1, - 10, - 100, - 1000, - 10000, - 100000, - 1000000, - 10000000, - 100000000, - 1000000000, - 10000000000l, - 100000000000l, - 1000000000000l, - 10000000000000l, - 100000000000000l, - 1000000000000000l, - 10000000000000000l, - 100000000000000000l, - 1000000000000000000l}; - - /* - * Simple function that returns the static precomputed - * power of ten, instead of using Math.pow - */ - public static long getPowerOfTen(int power) { - assert power >= 0 && power < scale_long_constants.length; - return scale_long_constants[(power)]; - } - - /* - * Math.pow returns a double and while multiplying with large digits - * in the decimal data type we encounter noise. So instead of multiplying - * with Math.pow we use the static constants to perform the multiplication - */ - public static long adjustScaleMultiply(long input, int factor) { - int index = Math.abs(factor); - assert index >= 0 && index < scale_long_constants.length; - if (factor >= 0) { - return input * scale_long_constants[index]; - } else { - return input / scale_long_constants[index]; - } - } - public static long adjustScaleDivide(long input, int factor) { - int index = Math.abs(factor); - assert index >= 0 && index < scale_long_constants.length; - if (factor >= 0) { - return input / scale_long_constants[index]; - } else { - return input * scale_long_constants[index]; - } +public class DecimalUtility { + + public final static int MAX_DIGITS = 9; + public final static int DIGITS_BASE = 1000000000; + public final static int DIGITS_MAX = 999999999; + public final static int INTEGER_SIZE = (Integer.SIZE/8); + + public final static String[] decimalToString = {"", + "0", + "00", + "000", + "0000", + "00000", + "000000", + "0000000", + "00000000", + "000000000"}; + + public final static long[] scale_long_constants = { + 1, + 10, + 100, + 1000, + 10000, + 100000, + 1000000, + 10000000, + 100000000, + 1000000000, + 10000000000l, + 100000000000l, + 1000000000000l, + 10000000000000l, + 100000000000000l, + 1000000000000000l, + 10000000000000000l, + 100000000000000000l, + 1000000000000000000l}; + + /* + * Simple function that returns the static precomputed + * power of ten, instead of using Math.pow + */ + public static long getPowerOfTen(int power) { + assert power >= 0 && power < scale_long_constants.length; + return scale_long_constants[(power)]; + } + + /* + * Math.pow returns a double and while multiplying with large digits + * in the decimal data type we encounter noise. So instead of multiplying + * with Math.pow we use the static constants to perform the multiplication + */ + public static long adjustScaleMultiply(long input, int factor) { + int index = Math.abs(factor); + assert index >= 0 && index < scale_long_constants.length; + if (factor >= 0) { + return input * scale_long_constants[index]; + } else { + return input / scale_long_constants[index]; } + } - /* Given the number of actual digits this function returns the - * number of indexes it will occupy in the array of integers - * which are stored in base 1 billion - */ - public static int roundUp(int ndigits) { - return (ndigits + MAX_DIGITS - 1)/MAX_DIGITS; + public static long adjustScaleDivide(long input, int factor) { + int index = Math.abs(factor); + assert index >= 0 && index < scale_long_constants.length; + if (factor >= 0) { + return input / scale_long_constants[index]; + } else { + return input * scale_long_constants[index]; } + } - /* Returns a string representation of the given integer - * If the length of the given integer is less than the - * passed length, this function will prepend zeroes to the string - */ - public static StringBuilder toStringWithZeroes(int number, int desiredLength) { - String value = ((Integer) number).toString(); - int length = value.length(); + /* Given the number of actual digits this function returns the + * number of indexes it will occupy in the array of integers + * which are stored in base 1 billion + */ + public static int roundUp(int ndigits) { + return (ndigits + MAX_DIGITS - 1)/MAX_DIGITS; + } - StringBuilder str = new StringBuilder(); - str.append(decimalToString[desiredLength - length]); - str.append(value); + /* Returns a string representation of the given integer + * If the length of the given integer is less than the + * passed length, this function will prepend zeroes to the string + */ + public static StringBuilder toStringWithZeroes(int number, int desiredLength) { + String value = ((Integer) number).toString(); + int length = value.length(); - return str; - } + StringBuilder str = new StringBuilder(); + str.append(decimalToString[desiredLength - length]); + str.append(value); - public static StringBuilder toStringWithZeroes(long number, int desiredLength) { - String value = ((Long) number).toString(); - int length = value.length(); + return str; + } - StringBuilder str = new StringBuilder(); + public static StringBuilder toStringWithZeroes(long number, int desiredLength) { + String value = ((Long) number).toString(); + int length = value.length(); - // Desired length can be > MAX_DIGITS - int zeroesLength = desiredLength - length; - while (zeroesLength > MAX_DIGITS) { - str.append(decimalToString[MAX_DIGITS]); - zeroesLength -= MAX_DIGITS; - } - str.append(decimalToString[zeroesLength]); - str.append(value); + StringBuilder str = new StringBuilder(); - return str; + // Desired length can be > MAX_DIGITS + int zeroesLength = desiredLength - length; + while (zeroesLength > MAX_DIGITS) { + str.append(decimalToString[MAX_DIGITS]); + zeroesLength -= MAX_DIGITS; } + str.append(decimalToString[zeroesLength]); + str.append(value); + + return str; + } public static BigDecimal getBigDecimalFromIntermediate(ByteBuf data, int startIndex, int nDecimalDigits, int scale) { - // In the intermediate representation we don't pad the scale with zeroes, so set truncate = false - return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, false); - } + // In the intermediate representation we don't pad the scale with zeroes, so set truncate = false + return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, false); + } - public static BigDecimal getBigDecimalFromSparse(ArrowBuf data, int startIndex, int nDecimalDigits, int scale) { + public static BigDecimal getBigDecimalFromSparse(ArrowBuf data, int startIndex, int nDecimalDigits, int scale) { - // In the sparse representation we pad the scale with zeroes for ease of arithmetic, need to truncate - return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, true); - } + // In the sparse representation we pad the scale with zeroes for ease of arithmetic, need to truncate + return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, true); + } - public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int start, int length, int scale) { - byte[] value = new byte[length]; - bytebuf.getBytes(start, value, 0, length); - BigInteger unscaledValue = new BigInteger(value); - return new BigDecimal(unscaledValue, scale); - } + public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int start, int length, int scale) { + byte[] value = new byte[length]; + bytebuf.getBytes(start, value, 0, length); + BigInteger unscaledValue = new BigInteger(value); + return new BigDecimal(unscaledValue, scale); + } public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int start, int length, int scale) { byte[] value = new byte[length]; @@ -168,115 +167,123 @@ public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int sta return new BigDecimal(unscaledValue, scale); } - /* Create a BigDecimal object using the data in the ArrowBuf. - * This function assumes that data is provided in a non-dense format - * It works on both sparse and intermediate representations. - */ + public static void writeBigDecimalToArrowBuf(ArrowBuf bytebuf, int startIndex, BigDecimal value) { + byte[] bytes = value.unscaledValue().toByteArray(); + if (bytes.length > 16) { + throw new UnsupportedOperationException("Decimal size greater than 16 bytes"); + } + bytebuf.setBytes(startIndex + 16 - bytes.length, bytes, 0, bytes.length); + } + + /* Create a BigDecimal object using the data in the ArrowBuf. + * This function assumes that data is provided in a non-dense format + * It works on both sparse and intermediate representations. + */ public static BigDecimal getBigDecimalFromArrowBuf(ByteBuf data, int startIndex, int nDecimalDigits, int scale, - boolean truncateScale) { + boolean truncateScale) { - // For sparse decimal type we have padded zeroes at the end, strip them while converting to BigDecimal. - int actualDigits; + // For sparse decimal type we have padded zeroes at the end, strip them while converting to BigDecimal. + int actualDigits; - // Initialize the BigDecimal, first digit in the ArrowBuf has the sign so mask it out - BigInteger decimalDigits = BigInteger.valueOf((data.getInt(startIndex)) & 0x7FFFFFFF); + // Initialize the BigDecimal, first digit in the ArrowBuf has the sign so mask it out + BigInteger decimalDigits = BigInteger.valueOf((data.getInt(startIndex)) & 0x7FFFFFFF); - BigInteger base = BigInteger.valueOf(DIGITS_BASE); + BigInteger base = BigInteger.valueOf(DIGITS_BASE); - for (int i = 1; i < nDecimalDigits; i++) { + for (int i = 1; i < nDecimalDigits; i++) { - BigInteger temp = BigInteger.valueOf(data.getInt(startIndex + (i * INTEGER_SIZE))); - decimalDigits = decimalDigits.multiply(base); - decimalDigits = decimalDigits.add(temp); - } + BigInteger temp = BigInteger.valueOf(data.getInt(startIndex + (i * INTEGER_SIZE))); + decimalDigits = decimalDigits.multiply(base); + decimalDigits = decimalDigits.add(temp); + } - // Truncate any additional padding we might have added - if (truncateScale == true && scale > 0 && (actualDigits = scale % MAX_DIGITS) != 0) { - BigInteger truncate = BigInteger.valueOf((int)Math.pow(10, (MAX_DIGITS - actualDigits))); - decimalDigits = decimalDigits.divide(truncate); - } + // Truncate any additional padding we might have added + if (truncateScale == true && scale > 0 && (actualDigits = scale % MAX_DIGITS) != 0) { + BigInteger truncate = BigInteger.valueOf((int)Math.pow(10, (MAX_DIGITS - actualDigits))); + decimalDigits = decimalDigits.divide(truncate); + } - // set the sign - if ((data.getInt(startIndex) & 0x80000000) != 0) { - decimalDigits = decimalDigits.negate(); - } + // set the sign + if ((data.getInt(startIndex) & 0x80000000) != 0) { + decimalDigits = decimalDigits.negate(); + } - BigDecimal decimal = new BigDecimal(decimalDigits, scale); + BigDecimal decimal = new BigDecimal(decimalDigits, scale); - return decimal; - } + return decimal; + } - /* This function returns a BigDecimal object from the dense decimal representation. - * First step is to convert the dense representation into an intermediate representation - * and then invoke getBigDecimalFromArrowBuf() to get the BigDecimal object - */ - public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, int nDecimalDigits, int scale, int maxPrecision, int width) { + /* This function returns a BigDecimal object from the dense decimal representation. + * First step is to convert the dense representation into an intermediate representation + * and then invoke getBigDecimalFromArrowBuf() to get the BigDecimal object + */ + public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, int nDecimalDigits, int scale, int maxPrecision, int width) { /* This method converts the dense representation to * an intermediate representation. The intermediate * representation has one more integer than the dense * representation. */ - byte[] intermediateBytes = new byte[((nDecimalDigits + 1) * INTEGER_SIZE)]; - - // Start storing from the least significant byte of the first integer - int intermediateIndex = 3; - - int[] mask = {0x03, 0x0F, 0x3F, 0xFF}; - int[] reverseMask = {0xFC, 0xF0, 0xC0, 0x00}; - - int maskIndex; - int shiftOrder; - byte shiftBits; - - // TODO: Some of the logic here is common with casting from Dense to Sparse types, factor out common code - if (maxPrecision == 38) { - maskIndex = 0; - shiftOrder = 6; - shiftBits = 0x00; - intermediateBytes[intermediateIndex++] = (byte) (data.getByte(startIndex) & 0x7F); - } else if (maxPrecision == 28) { - maskIndex = 1; - shiftOrder = 4; - shiftBits = (byte) ((data.getByte(startIndex) & 0x03) << shiftOrder); - intermediateBytes[intermediateIndex++] = (byte) (((data.getByte(startIndex) & 0x3C) & 0xFF) >>> 2); - } else { - throw new UnsupportedOperationException("Dense types with max precision 38 and 28 are only supported"); - } + byte[] intermediateBytes = new byte[((nDecimalDigits + 1) * INTEGER_SIZE)]; + + // Start storing from the least significant byte of the first integer + int intermediateIndex = 3; + + int[] mask = {0x03, 0x0F, 0x3F, 0xFF}; + int[] reverseMask = {0xFC, 0xF0, 0xC0, 0x00}; + + int maskIndex; + int shiftOrder; + byte shiftBits; + + // TODO: Some of the logic here is common with casting from Dense to Sparse types, factor out common code + if (maxPrecision == 38) { + maskIndex = 0; + shiftOrder = 6; + shiftBits = 0x00; + intermediateBytes[intermediateIndex++] = (byte) (data.getByte(startIndex) & 0x7F); + } else if (maxPrecision == 28) { + maskIndex = 1; + shiftOrder = 4; + shiftBits = (byte) ((data.getByte(startIndex) & 0x03) << shiftOrder); + intermediateBytes[intermediateIndex++] = (byte) (((data.getByte(startIndex) & 0x3C) & 0xFF) >>> 2); + } else { + throw new UnsupportedOperationException("Dense types with max precision 38 and 28 are only supported"); + } - int inputIndex = 1; - boolean sign = false; + int inputIndex = 1; + boolean sign = false; - if ((data.getByte(startIndex) & 0x80) != 0) { - sign = true; - } + if ((data.getByte(startIndex) & 0x80) != 0) { + sign = true; + } - while (inputIndex < width) { + while (inputIndex < width) { - intermediateBytes[intermediateIndex] = (byte) ((shiftBits) | (((data.getByte(startIndex + inputIndex) & reverseMask[maskIndex]) & 0xFF) >>> (8 - shiftOrder))); + intermediateBytes[intermediateIndex] = (byte) ((shiftBits) | (((data.getByte(startIndex + inputIndex) & reverseMask[maskIndex]) & 0xFF) >>> (8 - shiftOrder))); - shiftBits = (byte) ((data.getByte(startIndex + inputIndex) & mask[maskIndex]) << shiftOrder); + shiftBits = (byte) ((data.getByte(startIndex + inputIndex) & mask[maskIndex]) << shiftOrder); - inputIndex++; - intermediateIndex++; + inputIndex++; + intermediateIndex++; - if (((inputIndex - 1) % INTEGER_SIZE) == 0) { - shiftBits = (byte) ((shiftBits & 0xFF) >>> 2); - maskIndex++; - shiftOrder -= 2; - } + if (((inputIndex - 1) % INTEGER_SIZE) == 0) { + shiftBits = (byte) ((shiftBits & 0xFF) >>> 2); + maskIndex++; + shiftOrder -= 2; + } - } + } /* copy the last byte */ - intermediateBytes[intermediateIndex] = shiftBits; + intermediateBytes[intermediateIndex] = shiftBits; - if (sign == true) { - intermediateBytes[0] = (byte) (intermediateBytes[0] | 0x80); - } + if (sign == true) { + intermediateBytes[0] = (byte) (intermediateBytes[0] | 0x80); + } final ByteBuf intermediate = UnpooledByteBufAllocator.DEFAULT.buffer(intermediateBytes.length); try { - intermediate.setBytes(0, intermediateBytes); + intermediate.setBytes(0, intermediateBytes); BigDecimal ret = getBigDecimalFromIntermediate(intermediate, 0, nDecimalDigits + 1, scale); return ret; @@ -284,299 +291,296 @@ public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, i intermediate.release(); } - } + } - /* - * Function converts the BigDecimal and stores it in out internal sparse representation - */ - public static void getSparseFromBigDecimal(BigDecimal input, ByteBuf data, int startIndex, int scale, int precision, - int nDecimalDigits) { + public static void getSparseFromBigDecimal(BigDecimal input, ByteBuf data, int startIndex, int scale, int precision, + int nDecimalDigits) { - // Initialize the buffer - for (int i = 0; i < nDecimalDigits; i++) { - data.setInt(startIndex + (i * INTEGER_SIZE), 0); - } + // Initialize the buffer + for (int i = 0; i < nDecimalDigits; i++) { + data.setInt(startIndex + (i * INTEGER_SIZE), 0); + } - boolean sign = false; + boolean sign = false; - if (input.signum() == -1) { - // negative input - sign = true; - input = input.abs(); - } + if (input.signum() == -1) { + // negative input + sign = true; + input = input.abs(); + } - // Truncate the input as per the scale provided - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); + // Truncate the input as per the scale provided + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - // Separate out the integer part - BigDecimal integerPart = input.setScale(0, BigDecimal.ROUND_DOWN); + // Separate out the integer part + BigDecimal integerPart = input.setScale(0, BigDecimal.ROUND_DOWN); - int destIndex = nDecimalDigits - roundUp(scale) - 1; + int destIndex = nDecimalDigits - roundUp(scale) - 1; - // we use base 1 billion integer digits for out integernal representation - BigDecimal base = new BigDecimal(DIGITS_BASE); + // we use base 1 billion integer digits for out integernal representation + BigDecimal base = new BigDecimal(DIGITS_BASE); - while (integerPart.compareTo(BigDecimal.ZERO) == 1) { - // store the modulo as the integer value - data.setInt(startIndex + (destIndex * INTEGER_SIZE), (integerPart.remainder(base)).intValue()); - destIndex--; - // Divide by base 1 billion - integerPart = (integerPart.divide(base)).setScale(0, BigDecimal.ROUND_DOWN); - } + while (integerPart.compareTo(BigDecimal.ZERO) == 1) { + // store the modulo as the integer value + data.setInt(startIndex + (destIndex * INTEGER_SIZE), (integerPart.remainder(base)).intValue()); + destIndex--; + // Divide by base 1 billion + integerPart = (integerPart.divide(base)).setScale(0, BigDecimal.ROUND_DOWN); + } /* Sparse representation contains padding of additional zeroes * so each digit contains MAX_DIGITS for ease of arithmetic */ - int actualDigits; - if ((actualDigits = (scale % MAX_DIGITS)) != 0) { - // Pad additional zeroes - scale = scale + (MAX_DIGITS - actualDigits); - input = input.setScale(scale, BigDecimal.ROUND_DOWN); - } - - //separate out the fractional part - BigDecimal fractionalPart = input.remainder(BigDecimal.ONE).movePointRight(scale); + int actualDigits; + if ((actualDigits = (scale % MAX_DIGITS)) != 0) { + // Pad additional zeroes + scale = scale + (MAX_DIGITS - actualDigits); + input = input.setScale(scale, BigDecimal.ROUND_DOWN); + } - destIndex = nDecimalDigits - 1; + //separate out the fractional part + BigDecimal fractionalPart = input.remainder(BigDecimal.ONE).movePointRight(scale); - while (scale > 0) { - // Get next set of MAX_DIGITS (9) store it in the ArrowBuf - fractionalPart = fractionalPart.movePointLeft(MAX_DIGITS); - BigDecimal temp = fractionalPart.remainder(BigDecimal.ONE); + destIndex = nDecimalDigits - 1; - data.setInt(startIndex + (destIndex * INTEGER_SIZE), (temp.unscaledValue().intValue())); - destIndex--; + while (scale > 0) { + // Get next set of MAX_DIGITS (9) store it in the ArrowBuf + fractionalPart = fractionalPart.movePointLeft(MAX_DIGITS); + BigDecimal temp = fractionalPart.remainder(BigDecimal.ONE); - fractionalPart = fractionalPart.setScale(0, BigDecimal.ROUND_DOWN); - scale -= MAX_DIGITS; - } + data.setInt(startIndex + (destIndex * INTEGER_SIZE), (temp.unscaledValue().intValue())); + destIndex--; - // Set the negative sign - if (sign == true) { - data.setInt(startIndex, data.getInt(startIndex) | 0x80000000); - } + fractionalPart = fractionalPart.setScale(0, BigDecimal.ROUND_DOWN); + scale -= MAX_DIGITS; + } + // Set the negative sign + if (sign == true) { + data.setInt(startIndex, data.getInt(startIndex) | 0x80000000); } + } - public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { - // Truncate or pad to set the input to the correct scale - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - return (input.unscaledValue().longValue()); - } + public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { + // Truncate or pad to set the input to the correct scale + input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - public static BigDecimal getBigDecimalFromPrimitiveTypes(int input, int scale, int precision) { - return BigDecimal.valueOf(input, scale); - } + return (input.unscaledValue().longValue()); + } - public static BigDecimal getBigDecimalFromPrimitiveTypes(long input, int scale, int precision) { - return BigDecimal.valueOf(input, scale); - } + public static BigDecimal getBigDecimalFromPrimitiveTypes(int input, int scale, int precision) { + return BigDecimal.valueOf(input, scale); + } + + public static BigDecimal getBigDecimalFromPrimitiveTypes(long input, int scale, int precision) { + return BigDecimal.valueOf(input, scale); + } - public static int compareDenseBytes(ArrowBuf left, int leftStart, boolean leftSign, ArrowBuf right, int rightStart, boolean rightSign, int width) { + public static int compareDenseBytes(ArrowBuf left, int leftStart, boolean leftSign, ArrowBuf right, int rightStart, boolean rightSign, int width) { - int invert = 1; + int invert = 1; /* If signs are different then simply look at the * sign of the two inputs and determine which is greater */ - if (leftSign != rightSign) { + if (leftSign != rightSign) { - return((leftSign == true) ? -1 : 1); - } else if(leftSign == true) { + return((leftSign == true) ? -1 : 1); + } else if(leftSign == true) { /* Both inputs are negative, at the end we will * have to invert the comparison */ - invert = -1; - } - - int cmp = 0; - - for (int i = 0; i < width; i++) { - byte leftByte = left.getByte(leftStart + i); - byte rightByte = right.getByte(rightStart + i); - // Unsigned byte comparison - if ((leftByte & 0xFF) > (rightByte & 0xFF)) { - cmp = 1; - break; - } else if ((leftByte & 0xFF) < (rightByte & 0xFF)) { - cmp = -1; - break; - } - } - cmp *= invert; // invert the comparison if both were negative values - - return cmp; + invert = -1; } - public static int getIntegerFromSparseBuffer(ArrowBuf buffer, int start, int index) { - int value = buffer.getInt(start + (index * 4)); + int cmp = 0; - if (index == 0) { - /* the first byte contains sign bit, return value without it */ - value = (value & 0x7FFFFFFF); + for (int i = 0; i < width; i++) { + byte leftByte = left.getByte(leftStart + i); + byte rightByte = right.getByte(rightStart + i); + // Unsigned byte comparison + if ((leftByte & 0xFF) > (rightByte & 0xFF)) { + cmp = 1; + break; + } else if ((leftByte & 0xFF) < (rightByte & 0xFF)) { + cmp = -1; + break; } - return value; } + cmp *= invert; // invert the comparison if both were negative values - public static void setInteger(ArrowBuf buffer, int start, int index, int value) { - buffer.setInt(start + (index * 4), value); + return cmp; + } + + public static int getIntegerFromSparseBuffer(ArrowBuf buffer, int start, int index) { + int value = buffer.getInt(start + (index * 4)); + + if (index == 0) { + /* the first byte contains sign bit, return value without it */ + value = (value & 0x7FFFFFFF); } + return value; + } - public static int compareSparseBytes(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits, boolean absCompare) { + public static void setInteger(ArrowBuf buffer, int start, int index, int value) { + buffer.setInt(start + (index * 4), value); + } - int invert = 1; + public static int compareSparseBytes(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits, boolean absCompare) { - if (absCompare == false) { - if (leftSign != rightSign) { - return (leftSign == true) ? -1 : 1; - } + int invert = 1; - // Both values are negative invert the outcome of the comparison - if (leftSign == true) { - invert = -1; - } + if (absCompare == false) { + if (leftSign != rightSign) { + return (leftSign == true) ? -1 : 1; } - int cmp = compareSparseBytesInner(left, leftStart, leftSign, leftScale, leftPrecision, right, rightStart, rightSign, rightPrecision, rightScale, width, nDecimalDigits); - return cmp * invert; + // Both values are negative invert the outcome of the comparison + if (leftSign == true) { + invert = -1; + } } - public static int compareSparseBytesInner(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits) { + + int cmp = compareSparseBytesInner(left, leftStart, leftSign, leftScale, leftPrecision, right, rightStart, rightSign, rightPrecision, rightScale, width, nDecimalDigits); + return cmp * invert; + } + public static int compareSparseBytesInner(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits) { /* compute the number of integer digits in each decimal */ - int leftInt = leftPrecision - leftScale; - int rightInt = rightPrecision - rightScale; + int leftInt = leftPrecision - leftScale; + int rightInt = rightPrecision - rightScale; /* compute the number of indexes required for storing integer digits */ - int leftIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftInt); - int rightIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightInt); + int leftIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftInt); + int rightIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightInt); /* compute number of indexes required for storing scale */ - int leftScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftScale); - int rightScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightScale); + int leftScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftScale); + int rightScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightScale); /* compute index of the most significant integer digits */ - int leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; - int rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; + int leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; + int rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; - int leftStopIndex = nDecimalDigits - leftScaleRoundedUp; - int rightStopIndex = nDecimalDigits - rightScaleRoundedUp; + int leftStopIndex = nDecimalDigits - leftScaleRoundedUp; + int rightStopIndex = nDecimalDigits - rightScaleRoundedUp; /* Discard the zeroes in the integer part */ - while (leftIndex1 < leftStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { - break; - } + while (leftIndex1 < leftStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { + break; + } /* Digit in this location is zero, decrement the actual number * of integer digits */ - leftIntRoundedUp--; - leftIndex1++; - } + leftIntRoundedUp--; + leftIndex1++; + } /* If we reached the stop index then the number of integers is zero */ - if (leftIndex1 == leftStopIndex) { - leftIntRoundedUp = 0; - } + if (leftIndex1 == leftStopIndex) { + leftIntRoundedUp = 0; + } - while (rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { - break; - } + while (rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { + break; + } /* Digit in this location is zero, decrement the actual number * of integer digits */ - rightIntRoundedUp--; - rightIndex1++; - } + rightIntRoundedUp--; + rightIndex1++; + } - if (rightIndex1 == rightStopIndex) { - rightIntRoundedUp = 0; - } + if (rightIndex1 == rightStopIndex) { + rightIntRoundedUp = 0; + } /* We have the accurate number of non-zero integer digits, * if the number of integer digits are different then we can determine * which decimal is larger and needn't go down to comparing individual values */ - if (leftIntRoundedUp > rightIntRoundedUp) { - return 1; - } - else if (rightIntRoundedUp > leftIntRoundedUp) { - return -1; - } + if (leftIntRoundedUp > rightIntRoundedUp) { + return 1; + } + else if (rightIntRoundedUp > leftIntRoundedUp) { + return -1; + } /* The number of integer digits are the same, set the each index * to the first non-zero integer and compare each digit */ - leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; - rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; + leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; + rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; - while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { - return 1; - } - else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { - return -1; - } - - leftIndex1++; - rightIndex1++; + while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { + return 1; + } + else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { + return -1; } + leftIndex1++; + rightIndex1++; + } + /* The integer part of both the decimal's are equal, now compare * each individual fractional part. Set the index to be at the * beginning of the fractional part */ - leftIndex1 = leftStopIndex; - rightIndex1 = rightStopIndex; + leftIndex1 = leftStopIndex; + rightIndex1 = rightStopIndex; /* Stop indexes will be the end of the array */ - leftStopIndex = nDecimalDigits; - rightStopIndex = nDecimalDigits; + leftStopIndex = nDecimalDigits; + rightStopIndex = nDecimalDigits; /* compare the two fractional parts of the decimal */ - while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { - return 1; - } - else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { - return -1; - } - - leftIndex1++; - rightIndex1++; + while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { + return 1; + } + else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { + return -1; } + leftIndex1++; + rightIndex1++; + } + /* Till now the fractional part of the decimals are equal, check * if one of the decimal has fractional part that is remaining * and is non-zero */ - while (leftIndex1 < leftStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { - return 1; - } - leftIndex1++; + while (leftIndex1 < leftStopIndex) { + if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { + return 1; } + leftIndex1++; + } - while(rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { - return -1; - } - rightIndex1++; + while(rightIndex1 < rightStopIndex) { + if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { + return -1; } + rightIndex1++; + } /* Both decimal values are equal */ - return 0; - } + return 0; + } - public static BigDecimal getBigDecimalFromByteArray(byte[] bytes, int start, int length, int scale) { - byte[] value = Arrays.copyOfRange(bytes, start, start + length); - BigInteger unscaledValue = new BigInteger(value); - return new BigDecimal(unscaledValue, scale); - } + public static BigDecimal getBigDecimalFromByteArray(byte[] bytes, int start, int length, int scale) { + byte[] value = Arrays.copyOfRange(bytes, start, start + length); + BigInteger unscaledValue = new BigInteger(value); + return new BigDecimal(unscaledValue, scale); + } public static void roundDecimal(ArrowBuf result, int start, int nDecimalDigits, int desiredScale, int currentScale) { int newScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(desiredScale); @@ -704,34 +708,6 @@ public static int getFirstFractionalDigit(ArrowBuf data, int scale, int start, i int index = nDecimalDigits - roundUp(scale); return (int) (adjustScaleDivide(data.getInt(start + (index * INTEGER_SIZE)), MAX_DIGITS - 1)); } - - public static int compareSparseSamePrecScale(ArrowBuf left, int lStart, byte[] right, int length) { - // check the sign first - boolean lSign = (left.getInt(lStart) & 0x80000000) != 0; - boolean rSign = ByteFunctionHelpers.getSign(right); - int cmp = 0; - - if (lSign != rSign) { - return (lSign == false) ? 1 : -1; - } - - // invert the comparison if we are comparing negative numbers - int invert = (lSign == true) ? -1 : 1; - - // compare byte by byte - int n = 0; - int lPos = lStart; - int rPos = 0; - while (n < length/4) { - int leftInt = Decimal38SparseHolder.getInteger(n, lStart, left); - int rightInt = ByteFunctionHelpers.getInteger(right, n); - if (leftInt != rightInt) { - cmp = (leftInt - rightInt ) > 0 ? 1 : -1; - break; - } - n++; - } - return cmp * invert; - } } + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java index dea433e99e80f..d7f9d382e4865 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java @@ -18,7 +18,9 @@ package org.apache.arrow.vector.util; import java.util.AbstractMap; +import java.util.ArrayList; import java.util.Collection; +import java.util.List; import java.util.Map; import java.util.Set; @@ -241,6 +243,16 @@ public Set keySet() { return delegate.keySet(); } + public List keyList() { + int size = size(); + Set keys = keySet(); + List children = new ArrayList<>(size); + for (K key : keys) { + children.add(getOrdinal(key), key); + } + return children; + } + @Override public Set> entrySet() { return delegate.entrySet(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java new file mode 100644 index 0000000000000..7ab7db3117b81 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java @@ -0,0 +1,63 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.util.DecimalUtility; +import org.junit.Test; + +import java.math.BigDecimal; +import java.math.BigInteger; + +import static org.junit.Assert.assertEquals; + +public class TestDecimalVector { + + private static long[] intValues; + + static { + intValues = new long[30]; + for (int i = 0; i < intValues.length; i++) { + intValues[i] = 1 << i + 1; + } + } + private int scale = 3; + + @Test + public void test() { + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + NullableDecimalVector decimalVector = new NullableDecimalVector("decimal", allocator, 10, scale); + decimalVector.allocateNew(); + BigDecimal[] values = new BigDecimal[intValues.length]; + for (int i = 0; i < intValues.length; i++) { + BigDecimal decimal = new BigDecimal(BigInteger.valueOf(intValues[i]), scale); + values[i] = decimal; + decimalVector.getMutator().setIndexDefined(i); + DecimalUtility.writeBigDecimalToArrowBuf(decimalVector.getBuffer(), i * 16, decimal); + } + + decimalVector.getMutator().setValueCount(intValues.length); + + for (int i = 0; i < intValues.length; i++) { + BigDecimal value = decimalVector.getAccessor().getObject(i); + assertEquals(values[i], value); + } + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java index 4dee86c9d595a..9baebc5a2992c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestOversizedAllocationForValueVector.java @@ -20,8 +20,6 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.holders.UInt4Holder; -import org.apache.arrow.vector.types.MaterializedField; import org.apache.arrow.vector.util.OversizedAllocationException; import org.junit.After; import org.junit.Before; @@ -53,8 +51,7 @@ public void terminate() throws Exception { @Test(expected = OversizedAllocationException.class) public void testFixedVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final UInt4Vector vector = new UInt4Vector(field, allocator); + final UInt4Vector vector = new UInt4Vector(EMPTY_SCHEMA_PATH, allocator); // edge case 1: buffer size = max value capacity final int expectedValueCapacity = BaseValueVector.MAX_ALLOCATION_SIZE / 4; try { @@ -78,8 +75,7 @@ public void testFixedVectorReallocation() { @Test(expected = OversizedAllocationException.class) public void testBitVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final BitVector vector = new BitVector(field, allocator); + final BitVector vector = new BitVector(EMPTY_SCHEMA_PATH, allocator); // edge case 1: buffer size ~ max value capacity final int expectedValueCapacity = 1 << 29; try { @@ -109,8 +105,7 @@ public void testBitVectorReallocation() { @Test(expected = OversizedAllocationException.class) public void testVariableVectorReallocation() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - final VarCharVector vector = new VarCharVector(field, allocator); + final VarCharVector vector = new VarCharVector(EMPTY_SCHEMA_PATH, allocator); // edge case 1: value count = MAX_VALUE_ALLOCATION final int expectedAllocationInBytes = BaseValueVector.MAX_ALLOCATION_SIZE; final int expectedOffsetSize = 10; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java index e4d28c3f88ca6..1bb50b73a9057 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java @@ -22,8 +22,6 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.holders.NullableUInt4Holder; -import org.apache.arrow.vector.holders.UInt4Holder; -import org.apache.arrow.vector.types.MaterializedField; import org.apache.arrow.vector.types.Types; import org.junit.After; import org.junit.Before; @@ -46,13 +44,12 @@ public void terminate() throws Exception { @Test public void testUnionVector() throws Exception { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); final NullableUInt4Holder uInt4Holder = new NullableUInt4Holder(); uInt4Holder.value = 100; uInt4Holder.isSet = 1; - try (UnionVector unionVector = new UnionVector(field, allocator, null)) { + try (UnionVector unionVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { unionVector.allocateNew(); // write some data diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index ce091ab1ed06b..21cdc4f4d8d3b 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -19,15 +19,7 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.complex.ListVector; -import org.apache.arrow.vector.complex.MapVector; -import org.apache.arrow.vector.complex.RepeatedListVector; -import org.apache.arrow.vector.complex.RepeatedMapVector; -import org.apache.arrow.vector.holders.*; -import org.apache.arrow.vector.types.MaterializedField; -import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.BasicTypeHelper; import org.apache.arrow.vector.util.OversizedAllocationException; import org.junit.After; import org.junit.Before; @@ -50,9 +42,9 @@ public void init() { } private final static Charset utf8Charset = Charset.forName("UTF-8"); - private final static byte[] STR1 = new String("AAAAA1").getBytes(utf8Charset); - private final static byte[] STR2 = new String("BBBBBBBBB2").getBytes(utf8Charset); - private final static byte[] STR3 = new String("CCCC3").getBytes(utf8Charset); + private final static byte[] STR1 = "AAAAA1".getBytes(utf8Charset); + private final static byte[] STR2 = "BBBBBBBBB2".getBytes(utf8Charset); + private final static byte[] STR3 = "CCCC3".getBytes(utf8Charset); @After public void terminate() throws Exception { @@ -61,10 +53,9 @@ public void terminate() throws Exception { @Test public void testFixedType() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); // Create a new value vector for 1024 integers. - try (final UInt4Vector vector = new UInt4Vector(field, allocator)) { + try (final UInt4Vector vector = new UInt4Vector(EMPTY_SCHEMA_PATH, allocator)) { final UInt4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -86,10 +77,9 @@ public void testFixedType() { @Test public void testNullableVarLen2() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVarCharHolder.TYPE); // Create a new value vector for 1024 integers. - try (final NullableVarCharVector vector = new NullableVarCharVector(field, allocator)) { + try (final NullableVarCharVector vector = new NullableVarCharVector(EMPTY_SCHEMA_PATH, allocator)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(1024 * 10, 1024); @@ -115,45 +105,11 @@ public void testNullableVarLen2() { } } - @Test - public void testRepeatedIntVector() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedIntHolder.TYPE); - - // Create a new value vector. - try (final RepeatedIntVector vector1 = new RepeatedIntVector(field, allocator)) { - - // Populate the vector. - final int[] values = {2, 3, 5, 7, 11, 13, 17, 19, 23, 27}; // some tricksy primes - final int nRecords = 7; - final int nElements = values.length; - vector1.allocateNew(nRecords, nRecords * nElements); - final RepeatedIntVector.Mutator mutator = vector1.getMutator(); - for (int recordIndex = 0; recordIndex < nRecords; ++recordIndex) { - mutator.startNewValue(recordIndex); - for (int elementIndex = 0; elementIndex < nElements; ++elementIndex) { - mutator.add(recordIndex, recordIndex * values[elementIndex]); - } - } - mutator.setValueCount(nRecords); - - // Verify the contents. - final RepeatedIntVector.Accessor accessor1 = vector1.getAccessor(); - assertEquals(nRecords, accessor1.getValueCount()); - for (int recordIndex = 0; recordIndex < nRecords; ++recordIndex) { - for (int elementIndex = 0; elementIndex < nElements; ++elementIndex) { - final int value = accessor1.get(recordIndex, elementIndex); - assertEquals(recordIndex * values[elementIndex], value); - } - } - } - } - @Test public void testNullableFixedType() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableUInt4Holder.TYPE); // Create a new value vector for 1024 integers. - try (final NullableUInt4Vector vector = new NullableUInt4Vector(field, allocator)) { + try (final NullableUInt4Vector vector = new NullableUInt4Vector(EMPTY_SCHEMA_PATH, allocator)) { final NullableUInt4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -222,10 +178,8 @@ public void testNullableFixedType() { @Test public void testNullableFloat() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableFloat4Holder.TYPE); - // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) BasicTypeHelper.getNewVector(field, allocator)) { + try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -271,10 +225,8 @@ public void testNullableFloat() { @Test public void testBitVector() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, BitHolder.TYPE); - // Create a new value vector for 1024 integers - try (final BitVector vector = new BitVector(field, allocator)) { + try (final BitVector vector = new BitVector(EMPTY_SCHEMA_PATH, allocator)) { final BitVector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -311,10 +263,8 @@ public void testBitVector() { @Test public void testReAllocNullableFixedWidthVector() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableFloat4Holder.TYPE); - // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) BasicTypeHelper.getNewVector(field, allocator)) { + try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -346,10 +296,8 @@ public void testReAllocNullableFixedWidthVector() { @Test public void testReAllocNullableVariableWidthVector() { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVarCharHolder.TYPE); - // Create a new value vector for 1024 integers - try (final NullableVarCharVector vector = (NullableVarCharVector) BasicTypeHelper.getNewVector(field, allocator)) { + try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(); @@ -376,69 +324,4 @@ public void testReAllocNullableVariableWidthVector() { } } - @Test - public void testVVInitialCapacity() throws Exception { - final MaterializedField[] fields = new MaterializedField[9]; - final ValueVector[] valueVectors = new ValueVector[9]; - - fields[0] = MaterializedField.create(EMPTY_SCHEMA_PATH, BitHolder.TYPE); - fields[1] = MaterializedField.create(EMPTY_SCHEMA_PATH, IntHolder.TYPE); - fields[2] = MaterializedField.create(EMPTY_SCHEMA_PATH, VarCharHolder.TYPE); - fields[3] = MaterializedField.create(EMPTY_SCHEMA_PATH, NullableVar16CharHolder.TYPE); - fields[4] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedFloat4Holder.TYPE); - fields[5] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedVarBinaryHolder.TYPE); - - fields[6] = MaterializedField.create(EMPTY_SCHEMA_PATH, MapVector.TYPE); - fields[6].addChild(fields[0] /*bit*/); - fields[6].addChild(fields[2] /*varchar*/); - - fields[7] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedMapVector.TYPE); - fields[7].addChild(fields[1] /*int*/); - fields[7].addChild(fields[3] /*optional var16char*/); - - fields[8] = MaterializedField.create(EMPTY_SCHEMA_PATH, RepeatedListVector.TYPE); - fields[8].addChild(fields[1] /*int*/); - - final int initialCapacity = 1024; - - try { - for (int i = 0; i < valueVectors.length; i++) { - valueVectors[i] = BasicTypeHelper.getNewVector(fields[i], allocator); - valueVectors[i].setInitialCapacity(initialCapacity); - valueVectors[i].allocateNew(); - } - - for (int i = 0; i < valueVectors.length; i++) { - final ValueVector vv = valueVectors[i]; - final int vvCapacity = vv.getValueCapacity(); - - // this can't be equality because Nullables will be allocated using power of two sized buffers (thus need 1025 - // spots in one vector > power of two is 2048, available capacity will be 2048 => 2047) - assertTrue(String.format("Incorrect value capacity for %s [%d]", vv.getField(), vvCapacity), - initialCapacity <= vvCapacity); - } - } finally { - for (ValueVector v : valueVectors) { - v.close(); - } - } - } - - @Test - public void testListVectorShouldNotThrowOversizedAllocationException() throws Exception { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, - Types.optional(MinorType.LIST)); - ListVector vector = new ListVector(field, allocator, null); - ListVector vectorFrom = new ListVector(field, allocator, null); - vectorFrom.allocateNew(); - - for (int i = 0; i < 10000; i++) { - vector.allocateNew(); - vector.copyFromSafe(0, 0, vectorFrom); - vector.clear(); - } - - vectorFrom.clear(); - vector.clear(); - } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 4c24444d81d18..24f00f14df001 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -27,7 +27,7 @@ import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.holders.UInt4Holder; -import org.apache.arrow.vector.types.MaterializedField; +import org.apache.arrow.vector.types.Types.MinorType; import org.junit.After; import org.junit.Before; import org.junit.Test; @@ -49,10 +49,9 @@ public void terminate() throws Exception { @Test public void testPromoteToUnion() throws Exception { - final MaterializedField field = MaterializedField.create(EMPTY_SCHEMA_PATH, UInt4Holder.TYPE); - try (final AbstractMapVector container = new MapVector(field, allocator, null); - final MapVector v = container.addOrGet("test", MapVector.TYPE, MapVector.class); + try (final AbstractMapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); + final MapVector v = container.addOrGet("test", MinorType.MAP, MapVector.class); final PromotableWriter writer = new PromotableWriter(v, container)) { container.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java new file mode 100644 index 0000000000000..bc17a2b2835c2 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -0,0 +1,270 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.writer; + +import io.netty.buffer.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; +import org.apache.arrow.vector.complex.impl.UnionListReader; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.complex.impl.UnionReader; +import org.apache.arrow.vector.complex.impl.UnionWriter; +import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.Union; +import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import org.apache.arrow.vector.types.pojo.Field; +import org.junit.Assert; +import org.junit.Test; + +public class TestComplexWriter { + + static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + + private static final int COUNT = 100; + + @Test + public void simpleNestedTypes() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < COUNT; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(COUNT); + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < COUNT; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); + Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + } + + parent.close(); + } + + @Test + public void listScalarType() { + ListVector listVector = new ListVector("list", allocator, null); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + listWriter.writeInt(j); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + UnionListReader listReader = new UnionListReader(listVector); + for (int i = 0; i < COUNT; i++) { + listReader.setPosition(i); + for (int j = 0; j < i % 7; j++) { + listReader.next(); + Assert.assertEquals(j, listReader.reader().readInteger().intValue()); + } + } + } + + + @Test + public void listMapType() { + ListVector listVector = new ListVector("list", allocator, null); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + MapWriter mapWriter = listWriter.map(); + for (int i = 0; i < COUNT; i++) { + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + mapWriter.start(); + mapWriter.integer("int").writeInt(j); + mapWriter.bigInt("bigInt").writeBigInt(j); + mapWriter.end(); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + UnionListReader listReader = new UnionListReader(listVector); + for (int i = 0; i < COUNT; i++) { + listReader.setPosition(i); + for (int j = 0; j < i % 7; j++) { + listReader.next(); + Assert.assertEquals("record: " + i, j, listReader.reader().reader("int").readInteger().intValue()); + Assert.assertEquals(j, listReader.reader().reader("bigInt").readLong().longValue()); + } + } + } + + @Test + public void listListType() { + ListVector listVector = new ListVector("list", allocator, null); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + ListWriter innerListWriter = listWriter.list(); + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + innerListWriter.integer().writeInt(k); + } + innerListWriter.endList(); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + UnionListReader listReader = new UnionListReader(listVector); + for (int i = 0; i < COUNT; i++) { + listReader.setPosition(i); + for (int j = 0; j < i % 7; j++) { + listReader.next(); + FieldReader innerListReader = listReader.reader(); + for (int k = 0; k < i % 13; k++) { + innerListReader.next(); + Assert.assertEquals("record: " + i, k, innerListReader.reader().readInteger().intValue()); + } + } + } + listVector.clear(); + } + + @Test + public void unionListListType() { + ListVector listVector = new ListVector("list", allocator, null); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + ListWriter innerListWriter = listWriter.list(); + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + if (k % 2 == 0) { + innerListWriter.integer().writeInt(k); + } else { + innerListWriter.bigInt().writeBigInt(k); + } + } + innerListWriter.endList(); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + UnionListReader listReader = new UnionListReader(listVector); + for (int i = 0; i < COUNT; i++) { + listReader.setPosition(i); + for (int j = 0; j < i % 7; j++) { + listReader.next(); + FieldReader innerListReader = listReader.reader(); + for (int k = 0; k < i % 13; k++) { + innerListReader.next(); + if (k % 2 == 0) { + Assert.assertEquals("record: " + i, k, innerListReader.reader().readInteger().intValue()); + } else { + Assert.assertEquals("record: " + i, k, innerListReader.reader().readLong().longValue()); + } + } + } + } + listVector.clear(); + } + + @Test + public void simpleUnion() { + UnionVector vector = new UnionVector("union", allocator, null); + UnionWriter unionWriter = new UnionWriter(vector); + unionWriter.allocate(); + for (int i = 0; i < COUNT; i++) { + unionWriter.setPosition(i); + if (i % 2 == 0) { + unionWriter.writeInt(i); + } else { + unionWriter.writeFloat4((float) i); + } + } + vector.getMutator().setValueCount(COUNT); + UnionReader unionReader = new UnionReader(vector); + for (int i = 0; i < COUNT; i++) { + unionReader.setPosition(i); + if (i % 2 == 0) { + Assert.assertEquals(i, i, unionReader.readInteger()); + } else { + Assert.assertEquals((float) i, unionReader.readFloat(), 1e-12); + } + } + vector.close(); + } + + @Test + public void promotableWriter() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + for (int i = 0; i < 100; i++) { + BigIntWriter bigIntWriter = rootWriter.bigInt("a"); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + Field field = parent.getField().getChildren().get(0).getChildren().get(0); + Assert.assertEquals("a", field.getName()); + Assert.assertEquals(Int.TYPE_TYPE, field.getType().getTypeType()); + Int intType = (Int) field.getType(); + + Assert.assertEquals(64, intType.getBitWidth()); + Assert.assertTrue(intType.getIsSigned()); + for (int i = 100; i < 200; i++) { + VarCharWriter varCharWriter = rootWriter.varChar("a"); + varCharWriter.setPosition(i); + byte[] bytes = Integer.toString(i).getBytes(); + ArrowBuf tempBuf = allocator.buffer(bytes.length); + tempBuf.setBytes(0, bytes); + varCharWriter.writeVarChar(0, bytes.length, tempBuf); + } + field = parent.getField().getChildren().get(0).getChildren().get(0); + Assert.assertEquals("a", field.getName()); + Assert.assertEquals(Union.TYPE_TYPE, field.getType().getTypeType()); + Assert.assertEquals(Int.TYPE_TYPE, field.getChildren().get(0).getType().getTypeType()); + Assert.assertEquals(Utf8.TYPE_TYPE, field.getChildren().get(1).getType().getTypeType()); + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < 100; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("a").readLong().intValue()); + } + for (int i = 100; i < 200; i++) { + rootReader.setPosition(i); + Assert.assertEquals(Integer.toString(i), rootReader.reader("a").readText().toString()); + } + } +} \ No newline at end of file diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java new file mode 100644 index 0000000000000..06a1149c0d6c1 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -0,0 +1,80 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.pojo; + +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import java.util.List; + +import static org.junit.Assert.assertEquals; + +/** + * Test conversion between Flatbuf and Pojo field representations + */ +public class TestConvert { + + @Test + public void simple() { + Field initialField = new Field("a", true, new Int(32, true), null); + run(initialField); + } + + @Test + public void complex() { + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); + childrenBuilder.add(new Field("child2", true, new FloatingPoint(0), ImmutableList.of())); + + Field initialField = new Field("a", true, Tuple.INSTANCE, childrenBuilder.build()); + run(initialField); + } + + @Test + public void schema() { + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); + childrenBuilder.add(new Field("child2", true, new FloatingPoint(0), ImmutableList.of())); + Schema initialSchema = new Schema(childrenBuilder.build()); + run(initialSchema); + + } + + private void run(Field initialField) { + FlatBufferBuilder builder = new FlatBufferBuilder(); + builder.finish(initialField.getField(builder)); + org.apache.arrow.flatbuf.Field flatBufField = org.apache.arrow.flatbuf.Field.getRootAsField(builder.dataBuffer()); + Field finalField = Field.convertField(flatBufField); + assertEquals(initialField, finalField); + } + + private void run(Schema initialSchema) { + FlatBufferBuilder builder = new FlatBufferBuilder(); + builder.finish(initialSchema.getSchema(builder)); + org.apache.arrow.flatbuf.Schema flatBufSchema = org.apache.arrow.flatbuf.Schema.getRootAsSchema(builder.dataBuffer()); + Schema finalSchema = Schema.convertSchema(flatBufSchema); + assertEquals(initialSchema, finalSchema); + } +} From fd2e52491bc39ae5aa0ddb7dbc21109172cea1c2 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Thu, 18 Aug 2016 16:31:32 -0700 Subject: [PATCH 0114/1644] Revert version to 0.1-SNAPSHOT --- java/format/pom.xml | 2 +- java/memory/pom.xml | 2 +- java/pom.xml | 2 +- java/vector/pom.xml | 4 ++-- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index ea27a3072bc9e..cb11b5ff3c45d 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -16,7 +16,7 @@ arrow-java-root org.apache.arrow - 0.1-decimal + 0.1-SNAPSHOT arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index 12ff4c81d86c0..44332f5ed14a8 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -15,7 +15,7 @@ org.apache.arrow arrow-java-root - 0.1-decimal + 0.1-SNAPSHOT arrow-memory arrow-memory diff --git a/java/pom.xml b/java/pom.xml index 92ab109f939e1..8eb25af7545f4 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -21,7 +21,7 @@ org.apache.arrow arrow-java-root - 0.1-decimal + 0.1-SNAPSHOT pom Apache Arrow Java Root POM diff --git a/java/vector/pom.xml b/java/vector/pom.xml index fac788cef14d9..1a2921f6ea521 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -15,7 +15,7 @@ org.apache.arrow arrow-java-root - 0.1-decimal + 0.1-SNAPSHOT vector vectors @@ -25,7 +25,7 @@ org.apache.arrow arrow-format - 0.1-decimal + ${project.version} org.apache.arrow From 282fcacc86c9232c9dc1b1030e9fc9299bbc3f8d Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Fri, 19 Aug 2016 14:28:05 -0700 Subject: [PATCH 0115/1644] ARROW-265: Pad negative decimal values with1 --- .../codegen/templates/FixedValueVectors.java | 8 +- .../codegen/templates/HolderReaderImpl.java | 5 +- .../arrow/vector/util/DecimalUtility.java | 579 +----------------- .../arrow/vector/TestDecimalVector.java | 7 +- 4 files changed, 27 insertions(+), 572 deletions(-) diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index fe2b5c5b5bc92..37946f6b76ea6 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -16,6 +16,8 @@ * limitations under the License. */ +import org.apache.arrow.vector.util.DecimalUtility; + import java.lang.Override; <@pp.dropOutputFile /> @@ -444,11 +446,7 @@ public void get(int index, Nullable${minor.class}Holder holder) { @Override public ${friendlyType} getObject(int index) { - byte[] bytes = new byte[${type.width}]; - int start = ${type.width} * index; - data.getBytes(start, bytes, 0, ${type.width}); - ${friendlyType} value = new BigDecimal(new BigInteger(bytes), scale); - return value; + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromArrowBuf(data, index, scale); } <#else> diff --git a/java/vector/src/main/codegen/templates/HolderReaderImpl.java b/java/vector/src/main/codegen/templates/HolderReaderImpl.java index 1ed9287b00eec..d66577bc1e444 100644 --- a/java/vector/src/main/codegen/templates/HolderReaderImpl.java +++ b/java/vector/src/main/codegen/templates/HolderReaderImpl.java @@ -156,9 +156,11 @@ private Object readSingleObject() { <#if type.major == "VarLen"> + <#if minor.class != "Decimal"> int length = holder.end - holder.start; byte[] value = new byte [length]; holder.buffer.getBytes(holder.start, value, 0, length); + <#if minor.class == "VarBinary"> return value; @@ -169,8 +171,7 @@ private Object readSingleObject() { text.set(value); return text; <#elseif minor.class == "Decimal" > - return new BigDecimal(new BigInteger(value), holder.scale); - + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromArrowBuf(holder.buffer, holder.start, holder.scale); <#elseif minor.class == "Interval"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java index 4eb0d9f2216c1..e171e87360d86 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -66,6 +66,8 @@ public class DecimalUtility { 100000000000000000l, 1000000000000000000l}; + public static final int DECIMAL_BYTE_LENGTH = 16; + /* * Simple function that returns the static precomputed * power of ten, instead of using Math.pow @@ -100,14 +102,6 @@ public static long adjustScaleDivide(long input, int factor) { } } - /* Given the number of actual digits this function returns the - * number of indexes it will occupy in the array of integers - * which are stored in base 1 billion - */ - public static int roundUp(int ndigits) { - return (ndigits + MAX_DIGITS - 1)/MAX_DIGITS; - } - /* Returns a string representation of the given integer * If the length of the given integer is less than the * passed length, this function will prepend zeroes to the string @@ -141,572 +135,33 @@ public static StringBuilder toStringWithZeroes(long number, int desiredLength) { return str; } - public static BigDecimal getBigDecimalFromIntermediate(ByteBuf data, int startIndex, int nDecimalDigits, int scale) { - - // In the intermediate representation we don't pad the scale with zeroes, so set truncate = false - return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, false); - } - - public static BigDecimal getBigDecimalFromSparse(ArrowBuf data, int startIndex, int nDecimalDigits, int scale) { - - // In the sparse representation we pad the scale with zeroes for ease of arithmetic, need to truncate - return getBigDecimalFromArrowBuf(data, startIndex, nDecimalDigits, scale, true); - } - - public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int start, int length, int scale) { - byte[] value = new byte[length]; - bytebuf.getBytes(start, value, 0, length); + public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int index, int scale) { + byte[] value = new byte[DECIMAL_BYTE_LENGTH]; + final int startIndex = index * DECIMAL_BYTE_LENGTH; + bytebuf.getBytes(startIndex, value, 0, DECIMAL_BYTE_LENGTH); BigInteger unscaledValue = new BigInteger(value); return new BigDecimal(unscaledValue, scale); } - public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int start, int length, int scale) { - byte[] value = new byte[length]; + public static BigDecimal getBigDecimalFromByteBuffer(ByteBuffer bytebuf, int start, int scale) { + byte[] value = new byte[DECIMAL_BYTE_LENGTH]; bytebuf.get(value); BigInteger unscaledValue = new BigInteger(value); return new BigDecimal(unscaledValue, scale); } - public static void writeBigDecimalToArrowBuf(ArrowBuf bytebuf, int startIndex, BigDecimal value) { - byte[] bytes = value.unscaledValue().toByteArray(); - if (bytes.length > 16) { + public static void writeBigDecimalToArrowBuf(BigDecimal value, ArrowBuf bytebuf, int index) { + final byte[] bytes = value.unscaledValue().toByteArray(); + final int startIndex = index * DECIMAL_BYTE_LENGTH; + if (bytes.length > DECIMAL_BYTE_LENGTH) { throw new UnsupportedOperationException("Decimal size greater than 16 bytes"); } - bytebuf.setBytes(startIndex + 16 - bytes.length, bytes, 0, bytes.length); - } - - /* Create a BigDecimal object using the data in the ArrowBuf. - * This function assumes that data is provided in a non-dense format - * It works on both sparse and intermediate representations. - */ - public static BigDecimal getBigDecimalFromArrowBuf(ByteBuf data, int startIndex, int nDecimalDigits, int scale, - boolean truncateScale) { - - // For sparse decimal type we have padded zeroes at the end, strip them while converting to BigDecimal. - int actualDigits; - - // Initialize the BigDecimal, first digit in the ArrowBuf has the sign so mask it out - BigInteger decimalDigits = BigInteger.valueOf((data.getInt(startIndex)) & 0x7FFFFFFF); - - BigInteger base = BigInteger.valueOf(DIGITS_BASE); - - for (int i = 1; i < nDecimalDigits; i++) { - - BigInteger temp = BigInteger.valueOf(data.getInt(startIndex + (i * INTEGER_SIZE))); - decimalDigits = decimalDigits.multiply(base); - decimalDigits = decimalDigits.add(temp); - } - - // Truncate any additional padding we might have added - if (truncateScale == true && scale > 0 && (actualDigits = scale % MAX_DIGITS) != 0) { - BigInteger truncate = BigInteger.valueOf((int)Math.pow(10, (MAX_DIGITS - actualDigits))); - decimalDigits = decimalDigits.divide(truncate); - } - - // set the sign - if ((data.getInt(startIndex) & 0x80000000) != 0) { - decimalDigits = decimalDigits.negate(); + final int padLength = DECIMAL_BYTE_LENGTH - bytes.length; + final int padValue = value.signum() == -1 ? 0xFF : 0; + for (int i = 0; i < padLength; i++) { + bytebuf.setByte(startIndex + i, padValue); } - - BigDecimal decimal = new BigDecimal(decimalDigits, scale); - - return decimal; - } - - /* This function returns a BigDecimal object from the dense decimal representation. - * First step is to convert the dense representation into an intermediate representation - * and then invoke getBigDecimalFromArrowBuf() to get the BigDecimal object - */ - public static BigDecimal getBigDecimalFromDense(ArrowBuf data, int startIndex, int nDecimalDigits, int scale, int maxPrecision, int width) { - - /* This method converts the dense representation to - * an intermediate representation. The intermediate - * representation has one more integer than the dense - * representation. - */ - byte[] intermediateBytes = new byte[((nDecimalDigits + 1) * INTEGER_SIZE)]; - - // Start storing from the least significant byte of the first integer - int intermediateIndex = 3; - - int[] mask = {0x03, 0x0F, 0x3F, 0xFF}; - int[] reverseMask = {0xFC, 0xF0, 0xC0, 0x00}; - - int maskIndex; - int shiftOrder; - byte shiftBits; - - // TODO: Some of the logic here is common with casting from Dense to Sparse types, factor out common code - if (maxPrecision == 38) { - maskIndex = 0; - shiftOrder = 6; - shiftBits = 0x00; - intermediateBytes[intermediateIndex++] = (byte) (data.getByte(startIndex) & 0x7F); - } else if (maxPrecision == 28) { - maskIndex = 1; - shiftOrder = 4; - shiftBits = (byte) ((data.getByte(startIndex) & 0x03) << shiftOrder); - intermediateBytes[intermediateIndex++] = (byte) (((data.getByte(startIndex) & 0x3C) & 0xFF) >>> 2); - } else { - throw new UnsupportedOperationException("Dense types with max precision 38 and 28 are only supported"); - } - - int inputIndex = 1; - boolean sign = false; - - if ((data.getByte(startIndex) & 0x80) != 0) { - sign = true; - } - - while (inputIndex < width) { - - intermediateBytes[intermediateIndex] = (byte) ((shiftBits) | (((data.getByte(startIndex + inputIndex) & reverseMask[maskIndex]) & 0xFF) >>> (8 - shiftOrder))); - - shiftBits = (byte) ((data.getByte(startIndex + inputIndex) & mask[maskIndex]) << shiftOrder); - - inputIndex++; - intermediateIndex++; - - if (((inputIndex - 1) % INTEGER_SIZE) == 0) { - shiftBits = (byte) ((shiftBits & 0xFF) >>> 2); - maskIndex++; - shiftOrder -= 2; - } - - } - /* copy the last byte */ - intermediateBytes[intermediateIndex] = shiftBits; - - if (sign == true) { - intermediateBytes[0] = (byte) (intermediateBytes[0] | 0x80); - } - - final ByteBuf intermediate = UnpooledByteBufAllocator.DEFAULT.buffer(intermediateBytes.length); - try { - intermediate.setBytes(0, intermediateBytes); - - BigDecimal ret = getBigDecimalFromIntermediate(intermediate, 0, nDecimalDigits + 1, scale); - return ret; - } finally { - intermediate.release(); - } - - } - - public static void getSparseFromBigDecimal(BigDecimal input, ByteBuf data, int startIndex, int scale, int precision, - int nDecimalDigits) { - - // Initialize the buffer - for (int i = 0; i < nDecimalDigits; i++) { - data.setInt(startIndex + (i * INTEGER_SIZE), 0); - } - - boolean sign = false; - - if (input.signum() == -1) { - // negative input - sign = true; - input = input.abs(); - } - - // Truncate the input as per the scale provided - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - - // Separate out the integer part - BigDecimal integerPart = input.setScale(0, BigDecimal.ROUND_DOWN); - - int destIndex = nDecimalDigits - roundUp(scale) - 1; - - // we use base 1 billion integer digits for out integernal representation - BigDecimal base = new BigDecimal(DIGITS_BASE); - - while (integerPart.compareTo(BigDecimal.ZERO) == 1) { - // store the modulo as the integer value - data.setInt(startIndex + (destIndex * INTEGER_SIZE), (integerPart.remainder(base)).intValue()); - destIndex--; - // Divide by base 1 billion - integerPart = (integerPart.divide(base)).setScale(0, BigDecimal.ROUND_DOWN); - } - - /* Sparse representation contains padding of additional zeroes - * so each digit contains MAX_DIGITS for ease of arithmetic - */ - int actualDigits; - if ((actualDigits = (scale % MAX_DIGITS)) != 0) { - // Pad additional zeroes - scale = scale + (MAX_DIGITS - actualDigits); - input = input.setScale(scale, BigDecimal.ROUND_DOWN); - } - - //separate out the fractional part - BigDecimal fractionalPart = input.remainder(BigDecimal.ONE).movePointRight(scale); - - destIndex = nDecimalDigits - 1; - - while (scale > 0) { - // Get next set of MAX_DIGITS (9) store it in the ArrowBuf - fractionalPart = fractionalPart.movePointLeft(MAX_DIGITS); - BigDecimal temp = fractionalPart.remainder(BigDecimal.ONE); - - data.setInt(startIndex + (destIndex * INTEGER_SIZE), (temp.unscaledValue().intValue())); - destIndex--; - - fractionalPart = fractionalPart.setScale(0, BigDecimal.ROUND_DOWN); - scale -= MAX_DIGITS; - } - - // Set the negative sign - if (sign == true) { - data.setInt(startIndex, data.getInt(startIndex) | 0x80000000); - } - - } - - - public static long getDecimal18FromBigDecimal(BigDecimal input, int scale, int precision) { - // Truncate or pad to set the input to the correct scale - input = input.setScale(scale, BigDecimal.ROUND_HALF_UP); - - return (input.unscaledValue().longValue()); - } - - public static BigDecimal getBigDecimalFromPrimitiveTypes(int input, int scale, int precision) { - return BigDecimal.valueOf(input, scale); - } - - public static BigDecimal getBigDecimalFromPrimitiveTypes(long input, int scale, int precision) { - return BigDecimal.valueOf(input, scale); - } - - - public static int compareDenseBytes(ArrowBuf left, int leftStart, boolean leftSign, ArrowBuf right, int rightStart, boolean rightSign, int width) { - - int invert = 1; - - /* If signs are different then simply look at the - * sign of the two inputs and determine which is greater - */ - if (leftSign != rightSign) { - - return((leftSign == true) ? -1 : 1); - } else if(leftSign == true) { - /* Both inputs are negative, at the end we will - * have to invert the comparison - */ - invert = -1; - } - - int cmp = 0; - - for (int i = 0; i < width; i++) { - byte leftByte = left.getByte(leftStart + i); - byte rightByte = right.getByte(rightStart + i); - // Unsigned byte comparison - if ((leftByte & 0xFF) > (rightByte & 0xFF)) { - cmp = 1; - break; - } else if ((leftByte & 0xFF) < (rightByte & 0xFF)) { - cmp = -1; - break; - } - } - cmp *= invert; // invert the comparison if both were negative values - - return cmp; - } - - public static int getIntegerFromSparseBuffer(ArrowBuf buffer, int start, int index) { - int value = buffer.getInt(start + (index * 4)); - - if (index == 0) { - /* the first byte contains sign bit, return value without it */ - value = (value & 0x7FFFFFFF); - } - return value; - } - - public static void setInteger(ArrowBuf buffer, int start, int index, int value) { - buffer.setInt(start + (index * 4), value); - } - - public static int compareSparseBytes(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits, boolean absCompare) { - - int invert = 1; - - if (absCompare == false) { - if (leftSign != rightSign) { - return (leftSign == true) ? -1 : 1; - } - - // Both values are negative invert the outcome of the comparison - if (leftSign == true) { - invert = -1; - } - } - - int cmp = compareSparseBytesInner(left, leftStart, leftSign, leftScale, leftPrecision, right, rightStart, rightSign, rightPrecision, rightScale, width, nDecimalDigits); - return cmp * invert; - } - public static int compareSparseBytesInner(ArrowBuf left, int leftStart, boolean leftSign, int leftScale, int leftPrecision, ArrowBuf right, int rightStart, boolean rightSign, int rightPrecision, int rightScale, int width, int nDecimalDigits) { - /* compute the number of integer digits in each decimal */ - int leftInt = leftPrecision - leftScale; - int rightInt = rightPrecision - rightScale; - - /* compute the number of indexes required for storing integer digits */ - int leftIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftInt); - int rightIntRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightInt); - - /* compute number of indexes required for storing scale */ - int leftScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(leftScale); - int rightScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(rightScale); - - /* compute index of the most significant integer digits */ - int leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; - int rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; - - int leftStopIndex = nDecimalDigits - leftScaleRoundedUp; - int rightStopIndex = nDecimalDigits - rightScaleRoundedUp; - - /* Discard the zeroes in the integer part */ - while (leftIndex1 < leftStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { - break; - } - - /* Digit in this location is zero, decrement the actual number - * of integer digits - */ - leftIntRoundedUp--; - leftIndex1++; - } - - /* If we reached the stop index then the number of integers is zero */ - if (leftIndex1 == leftStopIndex) { - leftIntRoundedUp = 0; - } - - while (rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { - break; - } - - /* Digit in this location is zero, decrement the actual number - * of integer digits - */ - rightIntRoundedUp--; - rightIndex1++; - } - - if (rightIndex1 == rightStopIndex) { - rightIntRoundedUp = 0; - } - - /* We have the accurate number of non-zero integer digits, - * if the number of integer digits are different then we can determine - * which decimal is larger and needn't go down to comparing individual values - */ - if (leftIntRoundedUp > rightIntRoundedUp) { - return 1; - } - else if (rightIntRoundedUp > leftIntRoundedUp) { - return -1; - } - - /* The number of integer digits are the same, set the each index - * to the first non-zero integer and compare each digit - */ - leftIndex1 = nDecimalDigits - leftScaleRoundedUp - leftIntRoundedUp; - rightIndex1 = nDecimalDigits - rightScaleRoundedUp - rightIntRoundedUp; - - while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { - return 1; - } - else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { - return -1; - } - - leftIndex1++; - rightIndex1++; - } - - /* The integer part of both the decimal's are equal, now compare - * each individual fractional part. Set the index to be at the - * beginning of the fractional part - */ - leftIndex1 = leftStopIndex; - rightIndex1 = rightStopIndex; - - /* Stop indexes will be the end of the array */ - leftStopIndex = nDecimalDigits; - rightStopIndex = nDecimalDigits; - - /* compare the two fractional parts of the decimal */ - while (leftIndex1 < leftStopIndex && rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) > getIntegerFromSparseBuffer(right, rightStart, rightIndex1)) { - return 1; - } - else if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) > getIntegerFromSparseBuffer(left, leftStart, leftIndex1)) { - return -1; - } - - leftIndex1++; - rightIndex1++; - } - - /* Till now the fractional part of the decimals are equal, check - * if one of the decimal has fractional part that is remaining - * and is non-zero - */ - while (leftIndex1 < leftStopIndex) { - if (getIntegerFromSparseBuffer(left, leftStart, leftIndex1) != 0) { - return 1; - } - leftIndex1++; - } - - while(rightIndex1 < rightStopIndex) { - if (getIntegerFromSparseBuffer(right, rightStart, rightIndex1) != 0) { - return -1; - } - rightIndex1++; - } - - /* Both decimal values are equal */ - return 0; - } - - public static BigDecimal getBigDecimalFromByteArray(byte[] bytes, int start, int length, int scale) { - byte[] value = Arrays.copyOfRange(bytes, start, start + length); - BigInteger unscaledValue = new BigInteger(value); - return new BigDecimal(unscaledValue, scale); - } - - public static void roundDecimal(ArrowBuf result, int start, int nDecimalDigits, int desiredScale, int currentScale) { - int newScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(desiredScale); - int origScaleRoundedUp = org.apache.arrow.vector.util.DecimalUtility.roundUp(currentScale); - - if (desiredScale < currentScale) { - - boolean roundUp = false; - - //Extract the first digit to be truncated to check if we need to round up - int truncatedScaleIndex = desiredScale + 1; - if (truncatedScaleIndex <= currentScale) { - int extractDigitIndex = nDecimalDigits - origScaleRoundedUp -1; - extractDigitIndex += org.apache.arrow.vector.util.DecimalUtility.roundUp(truncatedScaleIndex); - int extractDigit = getIntegerFromSparseBuffer(result, start, extractDigitIndex); - int temp = org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS - (truncatedScaleIndex % org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS); - if (temp != 0) { - extractDigit = extractDigit / (int) (Math.pow(10, temp)); - } - if ((extractDigit % 10) > 4) { - roundUp = true; - } - } - - // Get the source index beyond which we will truncate - int srcIntIndex = nDecimalDigits - origScaleRoundedUp - 1; - int srcIndex = srcIntIndex + newScaleRoundedUp; - - // Truncate the remaining fractional part, move the integer part - int destIndex = nDecimalDigits - 1; - if (srcIndex != destIndex) { - while (srcIndex >= 0) { - setInteger(result, start, destIndex--, getIntegerFromSparseBuffer(result, start, srcIndex--)); - } - - // Set the remaining portion of the decimal to be zeroes - while (destIndex >= 0) { - setInteger(result, start, destIndex--, 0); - } - srcIndex = nDecimalDigits - 1; - } - - // We truncated the decimal digit. Now we need to truncate within the base 1 billion fractional digit - int truncateFactor = org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS - (desiredScale % org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS); - if (truncateFactor != org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS) { - truncateFactor = (int) Math.pow(10, truncateFactor); - int fractionalDigits = getIntegerFromSparseBuffer(result, start, nDecimalDigits - 1); - fractionalDigits /= truncateFactor; - setInteger(result, start, nDecimalDigits - 1, fractionalDigits * truncateFactor); - } - - // Finally round up the digit if needed - if (roundUp == true) { - srcIndex = nDecimalDigits - 1; - int carry; - if (truncateFactor != org.apache.arrow.vector.util.DecimalUtility.MAX_DIGITS) { - carry = truncateFactor; - } else { - carry = 1; - } - - while (srcIndex >= 0) { - int value = getIntegerFromSparseBuffer(result, start, srcIndex); - value += carry; - - if (value >= org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE) { - setInteger(result, start, srcIndex--, value % org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE); - carry = value / org.apache.arrow.vector.util.DecimalUtility.DIGITS_BASE; - } else { - setInteger(result, start, srcIndex--, value); - carry = 0; - break; - } - } - } - } else if (desiredScale > currentScale) { - // Add fractional digits to the decimal - - // Check if we need to shift the decimal digits to the left - if (newScaleRoundedUp > origScaleRoundedUp) { - int srcIndex = 0; - int destIndex = newScaleRoundedUp - origScaleRoundedUp; - - // Check while extending scale, we are not overwriting integer part - while (srcIndex < destIndex) { - if (getIntegerFromSparseBuffer(result, start, srcIndex++) != 0) { - throw new RuntimeException("Truncate resulting in loss of integer part, reduce scale specified"); - } - } - - srcIndex = 0; - while (destIndex < nDecimalDigits) { - setInteger(result, start, srcIndex++, getIntegerFromSparseBuffer(result, start, destIndex++)); - } - - // Clear the remaining part - while (srcIndex < nDecimalDigits) { - setInteger(result, start, srcIndex++, 0); - } - } - } - } - - public static int getFirstFractionalDigit(int decimal, int scale) { - if (scale == 0) { - return 0; - } - int temp = (int) adjustScaleDivide(decimal, scale - 1); - return Math.abs(temp % 10); - } - - public static int getFirstFractionalDigit(long decimal, int scale) { - if (scale == 0) { - return 0; - } - long temp = adjustScaleDivide(decimal, scale - 1); - return (int) (Math.abs(temp % 10)); - } - - public static int getFirstFractionalDigit(ArrowBuf data, int scale, int start, int nDecimalDigits) { - if (scale == 0) { - return 0; - } - - int index = nDecimalDigits - roundUp(scale); - return (int) (adjustScaleDivide(data.getInt(start + (index * INTEGER_SIZE)), MAX_DIGITS - 1)); + bytebuf.setBytes(startIndex + padLength, bytes, 0, bytes.length); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java index 7ab7db3117b81..cca35e44a215d 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java @@ -33,9 +33,10 @@ public class TestDecimalVector { private static long[] intValues; static { - intValues = new long[30]; - for (int i = 0; i < intValues.length; i++) { + intValues = new long[60]; + for (int i = 0; i < intValues.length / 2; i++) { intValues[i] = 1 << i + 1; + intValues[2 * i] = -1 * (1 << i + 1); } } private int scale = 3; @@ -50,7 +51,7 @@ public void test() { BigDecimal decimal = new BigDecimal(BigInteger.valueOf(intValues[i]), scale); values[i] = decimal; decimalVector.getMutator().setIndexDefined(i); - DecimalUtility.writeBigDecimalToArrowBuf(decimalVector.getBuffer(), i * 16, decimal); + DecimalUtility.writeBigDecimalToArrowBuf(decimal, decimalVector.getBuffer(), i); } decimalVector.getMutator().setValueCount(intValues.length); From c2eb1612df34bee7baddc8851d24826d3c33faa6 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Fri, 19 Aug 2016 17:39:36 -0700 Subject: [PATCH 0116/1644] ARROW-265: Fix few decimal bugs --- .../AbstractPromotableFieldWriter.java | 19 ++++++++++++++++--- .../codegen/templates/FixedValueVectors.java | 2 +- .../main/codegen/templates/MapWriters.java | 2 +- .../org/apache/arrow/vector/types/Types.java | 3 ++- .../arrow/vector/util/DecimalUtility.java | 3 +-- 5 files changed, 21 insertions(+), 8 deletions(-) diff --git a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java index 7e60320cfb8ac..d21dcd0f6461c 100644 --- a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java @@ -82,7 +82,18 @@ public void write(${name}Holder holder) { getWriter(MinorType.${name?upper_case}).write${minor.class}(<#list fields as field>${field.name}<#if field_has_next>, ); } + <#else> + @Override + public void write(DecimalHolder holder) { + getWriter(MinorType.DECIMAL).write(holder); + } + + public void writeDecimal(int start, ArrowBuf buffer) { + getWriter(MinorType.DECIMAL).writeDecimal(start, buffer); + } + + public void writeNull() { @@ -113,8 +124,11 @@ public ListWriter list(String name) { <#if lowerName == "int" ><#assign lowerName = "integer" /> <#assign upperName = minor.class?upper_case /> <#assign capName = minor.class?cap_first /> - <#if !minor.class?starts_with("Decimal") > - + <#if minor.class?starts_with("Decimal") > + public ${capName}Writer ${lowerName}(String name, int scale, int precision) { + return getWriter(MinorType.MAP).${lowerName}(name, scale, precision); + } + @Override public ${capName}Writer ${lowerName}(String name) { return getWriter(MinorType.MAP).${lowerName}(name); @@ -125,7 +139,6 @@ public ListWriter list(String name) { return getWriter(MinorType.LIST).${lowerName}(); } - public void copyReader(FieldReader reader) { diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 37946f6b76ea6..7958222f5c1bb 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -446,7 +446,7 @@ public void get(int index, Nullable${minor.class}Holder holder) { @Override public ${friendlyType} getObject(int index) { - return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromArrowBuf(data, index, scale); + return org.apache.arrow.vector.util.DecimalUtility.getBigDecimalFromArrowBuf(data, ${type.width} * index, scale); } <#else> diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index af2922826ec4d..8a8983a1497cc 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -198,7 +198,7 @@ public void end() { if(writer == null) { ValueVector vector; ValueVector currentVector = container.getChild(name); - ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class); + ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class<#if minor.class == "Decimal"> , new int[] {precision, scale}); writer = new PromotableWriter(v, container); vector = v; if (currentVector == null || currentVector != vector) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 5ea1456a051f7..c34882a8fb12a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -47,6 +47,7 @@ import org.apache.arrow.vector.complex.impl.BigIntWriterImpl; import org.apache.arrow.vector.complex.impl.BitWriterImpl; import org.apache.arrow.vector.complex.impl.DateWriterImpl; +import org.apache.arrow.vector.complex.impl.DecimalWriterImpl; import org.apache.arrow.vector.complex.impl.Float4WriterImpl; import org.apache.arrow.vector.complex.impl.Float8WriterImpl; import org.apache.arrow.vector.complex.impl.IntWriterImpl; @@ -386,7 +387,7 @@ public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack @Override public FieldWriter getNewFieldWriter(ValueVector vector) { - return new VarBinaryWriterImpl((NullableVarBinaryVector) vector); + return new DecimalWriterImpl((NullableDecimalVector) vector); } }, // variable length binary UINT1(new Int(8, false)) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java index e171e87360d86..4c439b2cc1066 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DecimalUtility.java @@ -135,9 +135,8 @@ public static StringBuilder toStringWithZeroes(long number, int desiredLength) { return str; } - public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int index, int scale) { + public static BigDecimal getBigDecimalFromArrowBuf(ArrowBuf bytebuf, int startIndex, int scale) { byte[] value = new byte[DECIMAL_BYTE_LENGTH]; - final int startIndex = index * DECIMAL_BYTE_LENGTH; bytebuf.getBytes(startIndex, value, 0, DECIMAL_BYTE_LENGTH); BigInteger unscaledValue = new BigInteger(value); return new BigDecimal(unscaledValue, scale); From 812201a2db1ebabd0f65ebd2774ec8f0880bb8cb Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 19 Aug 2016 18:05:16 -0700 Subject: [PATCH 0117/1644] ARROW-266: [C++] Fix broken build due to Flatbuffers namespace change Author: Wes McKinney Closes #122 from wesm/ARROW-266 and squashes the following commits: 6193323 [Wes McKinney] Fix broken build due to Flatbuffers namespace change --- cpp/src/arrow/ipc/adapter.cc | 2 +- cpp/src/arrow/ipc/metadata-internal.cc | 2 +- cpp/src/arrow/ipc/metadata-internal.h | 2 +- cpp/src/arrow/ipc/metadata.cc | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 3259980058b8e..40d372bbd3520 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -41,7 +41,7 @@ namespace arrow { -namespace flatbuf = apache::arrow::flatbuf; +namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 8cd416ff5853f..16ba20f7e90ee 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -37,7 +37,7 @@ typedef flatbuffers::Offset Offset; namespace arrow { -namespace flatbuf = apache::arrow::flatbuf; +namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index 871b5bc4bf606..5faa8c947b55d 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -28,7 +28,7 @@ namespace arrow { -namespace flatbuf = apache::arrow::flatbuf; +namespace flatbuf = org::apache::arrow::flatbuf; class Buffer; struct Field; diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 4fc8ec50eb716..e510755110e04 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -33,7 +33,7 @@ namespace arrow { -namespace flatbuf = apache::arrow::flatbuf; +namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { From 78619686f44da5a28319032551b07ddfadc26468 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Sat, 20 Aug 2016 10:39:50 -0700 Subject: [PATCH 0118/1644] ARROW-252: Add implementation guidelines to the documentation Author: Julien Le Dem Closes #120 from julienledem/arrow_252_impl_guidelines and squashes the following commits: caf6994 [Julien Le Dem] ARROW-252: review feedback 6b68ce1 [Julien Le Dem] ARROW-252: Add implementation guidelines to the documentation --- format/Guidelines.md | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) create mode 100644 format/Guidelines.md diff --git a/format/Guidelines.md b/format/Guidelines.md new file mode 100644 index 0000000000000..14f1057850439 --- /dev/null +++ b/format/Guidelines.md @@ -0,0 +1,17 @@ +# Implementation guidelines + +An execution engine (or framework, or UDF executor, or storage engine, etc) can implements only a subset of the Arrow spec and/or extend it given the following constraints: + +## Implementing a subset the spec +### If only producing (and not consuming) arrow vectors. +Any subset of the vector spec and the corresponding metadata can be implemented. + +### If consuming and producing vectors +There is a minimal subset of vectors to be supported. +Production of a subset of vectors and their corresponding metadata is always fine. +Consumption of vectors should at least convert the unsupported input vectors to the supported subset (for example Timestamp.millis to timestamp.micros or int32 to int64) + +## Extensibility +An execution engine implementor can also extend their memory representation with their own vectors internally as long as they are never exposed. Before sending data to another system expecting Arrow data these custom vectors should be converted to a type that exist in the Arrow spec. +An example of this is operating on compressed data. +These custom vectors are not exchanged externaly and there is no support for custom metadata. From 8960a2ed4c0d400be32003beb183f150e019c4ec Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Sat, 20 Aug 2016 13:02:45 -0700 Subject: [PATCH 0119/1644] ARROW-255: Finalize Dictionary representation Author: Julien Le Dem Closes #119 from julienledem/arrow_255_dictionary and squashes the following commits: 316745d [Julien Le Dem] ARROW-255: fix typo and linter errors e28a3c8 [Julien Le Dem] ARROW-255: review feedback 8c27943 [Julien Le Dem] ARROW-255: Finalize Dictionary representation --- cpp/src/arrow/ipc/metadata-internal.cc | 3 ++- cpp/src/arrow/type.h | 11 +++++--- format/Layout.md | 37 ++++++++++++++++++++++++++ format/Message.fbs | 6 ++++- 4 files changed, 52 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 16ba20f7e90ee..50db730d20832 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -220,7 +220,8 @@ static Status FieldToFlatbuffer( auto fb_children = fbb.CreateVector(children); *offset = flatbuf::CreateField( - fbb, fb_name, field->nullable, type_enum, type_data, fb_children); + fbb, fb_name, field->nullable, type_enum, type_data, field->dictionary, + fb_children); return Status::OK(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 4cb37fd1dead8..02677d5e18b90 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -144,8 +144,13 @@ struct ARROW_EXPORT Field { // Fields can be nullable bool nullable; - Field(const std::string& name, const TypePtr& type, bool nullable = true) - : name(name), type(type), nullable(nullable) {} + // optional dictionary id if the field is dictionary encoded + // 0 means it's not dictionary encoded + int64_t dictionary; + + Field(const std::string& name, const TypePtr& type, bool nullable = true, + int64_t dictionary = 0) + : name(name), type(type), nullable(nullable), dictionary(dictionary) {} bool operator==(const Field& other) const { return this->Equals(other); } @@ -154,7 +159,7 @@ struct ARROW_EXPORT Field { bool Equals(const Field& other) const { return (this == &other) || (this->name == other.name && this->nullable == other.nullable && - this->type->Equals(other.type.get())); + this->dictionary == dictionary && this->type->Equals(other.type.get())); } bool Equals(const std::shared_ptr& other) const { return Equals(*other.get()); } diff --git a/format/Layout.md b/format/Layout.md index 5eaefeebf210a..a953930e172e7 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -583,6 +583,43 @@ even if the null bitmap of the parent union array indicates the slot is null. Additionally, a child array may have a non-null slot even if the the types array indicates that a slot contains a different type at the index. +## Dictionary encoding + +When a field is dictionary encoded, the values are represented by an array of Int32 representing the index of the value in the dictionary. +The Dictionary is received as a DictionaryBacth whose id is referenced by a dictionary attribute defined in the metadata (Message.fbs) in the Field table. +The dictionary has the same layout as the type of the field would dictate. Each entry in the dictionary can be accessed by its index in the DictionaryBatch. +When a Schema references a Dictionary id, it must send a DictionaryBatch for this id before any RecordBatch. + +As an example, you could have the following data: +``` +type: List + +[ + ['a', 'b'], + ['a', 'b'], + ['a', 'b'], + ['c', 'd', 'e'], + ['c', 'd', 'e'], + ['c', 'd', 'e'], + ['c', 'd', 'e'], + ['a', 'b'] +] +``` +In dictionary-encoded form, this could appear as: +``` +data List (dictionary-encoded, dictionary id i) +indices: [0, 0, 0, 1, 1, 1, 0] + +dictionary i + +type: List + +[ + ['a', 'b'], + ['c', 'd', 'e'], +] +``` + ## References Apache Drill Documentation - [Value Vectors][6] diff --git a/format/Message.fbs b/format/Message.fbs index 2928207db8cc0..a78009b6e5f94 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -104,6 +104,10 @@ table Field { name: string; nullable: bool; type: Type; + // present only if the field is dictionary encoded + // will point to a dictionary provided by a DictionaryBatch message + dictionary: long; + // children apply only to Nested data types like Struct, List and Union children: [Field]; } @@ -185,8 +189,8 @@ table RecordBatch { /// For sending dictionary encoding information. Any Field can be /// dictionary-encoded, but in this case none of its children may be /// dictionary-encoded. +/// There is one dictionary batch per dictionary /// -/// TODO(wesm): To be documented in more detail table DictionaryBatch { id: long; From ec51d566708f5d6ea0a94a6d53152dc8cc98d6aa Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Mon, 22 Aug 2016 13:10:06 -0700 Subject: [PATCH 0120/1644] ARROW-269: Include typeVector buffers UnionVector.getBuffers() --- .../main/codegen/templates/UnionVector.java | 24 +++++++++---------- 1 file changed, 11 insertions(+), 13 deletions(-) diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index e2f19f4b33ba5..1fef490d4ec3c 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -16,7 +16,9 @@ * limitations under the License. */ +import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; +import io.netty.buffer.ArrowBuf; import org.apache.arrow.flatbuf.Field; import org.apache.arrow.flatbuf.Type; import org.apache.arrow.flatbuf.Union; @@ -35,6 +37,7 @@ package org.apache.arrow.vector.complex; <#include "/@includes/vv_imports.ftl" /> +import com.google.common.collect.ImmutableList; import java.util.ArrayList; import java.util.Iterator; import org.apache.arrow.vector.complex.impl.ComplexCopier; @@ -219,6 +222,7 @@ public TransferPair makeTransferPair(ValueVector target) { } public void transferTo(org.apache.arrow.vector.complex.UnionVector target) { + typeVector.makeTransferPair(target.typeVector).transfer(); internalMap.makeTransferPair(target.internalMap).transfer(); target.valueCount = valueCount; } @@ -307,20 +311,9 @@ public FieldWriter getWriter() { return mutator.writer; } -// @Override -// public UserBitShared.SerializedField getMetadata() { -// SerializedField.Builder b = getField() // -// .getAsBuilder() // -// .setBufferLength(getBufferSize()) // -// .setValueCount(valueCount); -// -// b.addChild(internalMap.getMetadata()); -// return b.build(); -// } - @Override public int getBufferSize() { - return internalMap.getBufferSize(); + return typeVector.getBufferSize() + internalMap.getBufferSize(); } @Override @@ -339,7 +332,11 @@ public int getBufferSizeFor(final int valueCount) { @Override public ArrowBuf[] getBuffers(boolean clear) { - return internalMap.getBuffers(clear); + ImmutableList.Builder builder = ImmutableList.builder(); + builder.add(typeVector.getBuffers(clear)); + builder.add(internalMap.getBuffers(clear)); + List list = builder.build(); + return list.toArray(new ArrowBuf[list.size()]); } @Override @@ -411,6 +408,7 @@ public class Mutator extends BaseValueVector.BaseMutator { @Override public void setValueCount(int valueCount) { UnionVector.this.valueCount = valueCount; + typeVector.getMutator().setValueCount(valueCount); internalMap.getMutator().setValueCount(valueCount); } From 803afeb502dcdd802fada2ed0d66c145546b8a78 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 26 Aug 2016 08:20:13 -0700 Subject: [PATCH 0121/1644] ARROW-264: File format This is work in progress Author: Julien Le Dem Closes #123 from julienledem/arrow_264_file_format and squashes the following commits: 252de6d [Julien Le Dem] remove outdated comment 04d797f [Julien Le Dem] maps are not nullable yet e8359b3 [Julien Le Dem] align on 8 byte boundaries; more tests 8b8b823 [Julien Le Dem] refactoring 31e95e6 [Julien Le Dem] fix list vector b824938 [Julien Le Dem] fix types; add licenses; more tests; more complex 2fd3bc1 [Julien Le Dem] cleanup 50fe680 [Julien Le Dem] nested support b0bf6bc [Julien Le Dem] cleanup 4247b1a [Julien Le Dem] fix whitespace d6a1788 [Julien Le Dem] refactoring 81863c5 [Julien Le Dem] fixed loader aa1b766 [Julien Le Dem] better test 2067e01 [Julien Le Dem] update format aacf61e [Julien Le Dem] fix pom b907aa5 [Julien Le Dem] simplify e43f26b [Julien Le Dem] add layout spec 0cc9718 [Julien Le Dem] add vector type ac6902a [Julien Le Dem] ARROW-264: File format 807db51 [Julien Le Dem] move information to schema f2f0596 [Julien Le Dem] Update FieldNode structure to be more explicit and reflect schema --- cpp/src/arrow/ipc/metadata-internal.cc | 1 + format/File.fbs | 28 ++ format/Message.fbs | 21 +- java/format/pom.xml | 1 + .../main/java/io/netty/buffer/ArrowBuf.java | 71 ++-- .../src/main/codegen/data/ArrowTypes.tdd | 4 +- .../src/main/codegen/templates/ArrowType.java | 29 +- .../templates/NullableValueVectors.java | 49 ++- .../main/codegen/templates/UnionVector.java | 40 ++- .../arrow/vector/BaseDataValueVector.java | 38 +- .../org/apache/arrow/vector/BufferBacked.java | 31 ++ .../org/apache/arrow/vector/FieldVector.java | 65 ++++ .../org/apache/arrow/vector/ValueVector.java | 6 +- .../org/apache/arrow/vector/VectorLoader.java | 99 ++++++ .../apache/arrow/vector/VectorUnloader.java | 78 +++++ .../org/apache/arrow/vector/ZeroVector.java | 39 ++- .../complex/AbstractContainerVector.java | 21 +- .../vector/complex/AbstractMapVector.java | 42 ++- .../complex/BaseRepeatedValueVector.java | 21 +- .../arrow/vector/complex/ListVector.java | 58 ++- .../arrow/vector/complex/MapVector.java | 59 +++- .../complex/impl/ComplexWriterImpl.java | 2 +- .../vector/complex/impl/PromotableWriter.java | 3 +- .../apache/arrow/vector/file/ArrowBlock.java | 82 +++++ .../apache/arrow/vector/file/ArrowFooter.java | 144 ++++++++ .../apache/arrow/vector/file/ArrowReader.java | 151 ++++++++ .../apache/arrow/vector/file/ArrowWriter.java | 179 ++++++++++ .../file/InvalidArrowFileException.java | 27 ++ .../arrow/vector/schema/ArrowBuffer.java | 81 +++++ .../arrow/vector/schema/ArrowFieldNode.java | 53 +++ .../arrow/vector/schema/ArrowRecordBatch.java | 127 +++++++ .../arrow/vector/schema/ArrowVectorType.java | 47 +++ .../arrow/vector/schema/FBSerializable.java | 24 ++ .../arrow/vector/schema/FBSerializables.java | 37 ++ .../arrow/vector/schema/TypeLayout.java | 208 +++++++++++ .../arrow/vector/schema/VectorLayout.java | 93 +++++ .../org/apache/arrow/vector/types/Types.java | 70 ++-- .../apache/arrow/vector/types/pojo/Field.java | 42 ++- .../arrow/vector/types/pojo/Schema.java | 13 +- .../arrow/vector/TestVectorUnloadLoad.java | 89 +++++ .../ByteArrayReadableSeekableByteChannel.java | 80 +++++ .../arrow/vector/file/TestArrowFile.java | 331 ++++++++++++++++++ .../arrow/vector/file/TestArrowFooter.java | 56 +++ .../vector/file/TestArrowReaderWriter.java | 106 ++++++ .../apache/arrow/vector/pojo/TestConvert.java | 38 +- 45 files changed, 2722 insertions(+), 162 deletions(-) create mode 100644 format/File.fbs create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowBlock.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/InvalidArrowFileException.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowBuffer.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializable.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializables.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 50db730d20832..c921e4d8e0114 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -219,6 +219,7 @@ static Status FieldToFlatbuffer( RETURN_NOT_OK(TypeToFlatbuffer(fbb, field->type, &children, &type_enum, &type_data)); auto fb_children = fbb.CreateVector(children); + // TODO: produce the list of VectorTypes *offset = flatbuf::CreateField( fbb, fb_name, field->nullable, type_enum, type_data, field->dictionary, fb_children); diff --git a/format/File.fbs b/format/File.fbs new file mode 100644 index 0000000000000..f7ad1e1594a91 --- /dev/null +++ b/format/File.fbs @@ -0,0 +1,28 @@ +include "Message.fbs"; + +namespace org.apache.arrow.flatbuf; + +/// ---------------------------------------------------------------------- +/// Arrow File metadata +/// + +table Footer { + + schema: org.apache.arrow.flatbuf.Schema; + + dictionaries: [ Block ]; + + recordBatches: [ Block ]; +} + +struct Block { + + offset: long; + + metaDataLength: int; + + bodyLength: long; + +} + +root_type Footer; diff --git a/format/Message.fbs b/format/Message.fbs index a78009b6e5f94..b02f3fa38694e 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -17,7 +17,7 @@ table Tuple { table List { } -enum UnionMode:int { Sparse, Dense } +enum UnionMode:short { Sparse, Dense } table Union { mode: UnionMode; @@ -28,7 +28,7 @@ table Int { is_signed: bool; } -enum Precision:int {SINGLE, DOUBLE} +enum Precision:short {SINGLE, DOUBLE} table FloatingPoint { precision: Precision; @@ -91,6 +91,17 @@ union Type { JSONScalar } +enum VectorType: short { + /// used in List type Dense Union and variable length primitive types (String, Binary) + OFFSET, + /// fixed length primitive values + VALUES, + /// Bit vector indicated if each value is null + VALIDITY, + /// Type vector used in Union type + TYPE +} + /// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. @@ -109,12 +120,16 @@ table Field { dictionary: long; // children apply only to Nested data types like Struct, List and Union children: [Field]; + /// the buffers produced for this type (as derived from the Type) + /// does not include children + /// each recordbatch will return instances of those Buffers. + buffers: [ VectorType ]; } /// ---------------------------------------------------------------------- /// Endianness of the platform that produces the RecordBatch -enum Endianness:int { Little, Big } +enum Endianness:short { Little, Big } /// ---------------------------------------------------------------------- /// A Schema describes the columns in a row batch diff --git a/java/format/pom.xml b/java/format/pom.xml index cb11b5ff3c45d..dc5897581b5b3 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -106,6 +106,7 @@ -o target/generated-sources/ ../../format/Message.fbs + ../../format/File.fbs diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index bbec26aa85c74..d10f00247e6ee 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -17,8 +17,6 @@ */ package io.netty.buffer; -import io.netty.util.internal.PlatformDependent; - import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; @@ -30,16 +28,18 @@ import java.util.concurrent.atomic.AtomicInteger; import java.util.concurrent.atomic.AtomicLong; +import org.apache.arrow.memory.AllocationManager.BufferLedger; import org.apache.arrow.memory.BaseAllocator; +import org.apache.arrow.memory.BaseAllocator.Verbosity; import org.apache.arrow.memory.BoundsChecking; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.BufferManager; -import org.apache.arrow.memory.AllocationManager.BufferLedger; -import org.apache.arrow.memory.BaseAllocator.Verbosity; import org.apache.arrow.memory.util.HistoricalLog; import com.google.common.base.Preconditions; +import io.netty.util.internal.PlatformDependent; + public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ArrowBuf.class); @@ -307,7 +307,7 @@ public ByteOrder order() { } @Override - public ByteBuf order(ByteOrder endianness) { + public ArrowBuf order(ByteOrder endianness) { return this; } @@ -344,7 +344,7 @@ public ByteBuf copy(int index, int length) { } @Override - public ByteBuf slice() { + public ArrowBuf slice() { return slice(readerIndex(), readableBytes()); } @@ -467,7 +467,7 @@ public boolean equals(Object obj) { } @Override - public ByteBuf retain(int increment) { + public ArrowBuf retain(int increment) { Preconditions.checkArgument(increment > 0, "retain(%d) argument is not positive", increment); if (isEmpty) { @@ -484,7 +484,7 @@ public ByteBuf retain(int increment) { } @Override - public ByteBuf retain() { + public ArrowBuf retain() { return retain(1); } @@ -535,49 +535,49 @@ public short getShort(int index) { } @Override - public ByteBuf setShort(int index, int value) { + public ArrowBuf setShort(int index, int value) { chk(index, 2); PlatformDependent.putShort(addr(index), (short) value); return this; } @Override - public ByteBuf setInt(int index, int value) { + public ArrowBuf setInt(int index, int value) { chk(index, 4); PlatformDependent.putInt(addr(index), value); return this; } @Override - public ByteBuf setLong(int index, long value) { + public ArrowBuf setLong(int index, long value) { chk(index, 8); PlatformDependent.putLong(addr(index), value); return this; } @Override - public ByteBuf setChar(int index, int value) { + public ArrowBuf setChar(int index, int value) { chk(index, 2); PlatformDependent.putShort(addr(index), (short) value); return this; } @Override - public ByteBuf setFloat(int index, float value) { + public ArrowBuf setFloat(int index, float value) { chk(index, 4); PlatformDependent.putInt(addr(index), Float.floatToRawIntBits(value)); return this; } @Override - public ByteBuf setDouble(int index, double value) { + public ArrowBuf setDouble(int index, double value) { chk(index, 8); PlatformDependent.putLong(addr(index), Double.doubleToRawLongBits(value)); return this; } @Override - public ByteBuf writeShort(int value) { + public ArrowBuf writeShort(int value) { ensure(2); PlatformDependent.putShort(addr(writerIndex), (short) value); writerIndex += 2; @@ -585,7 +585,7 @@ public ByteBuf writeShort(int value) { } @Override - public ByteBuf writeInt(int value) { + public ArrowBuf writeInt(int value) { ensure(4); PlatformDependent.putInt(addr(writerIndex), value); writerIndex += 4; @@ -593,7 +593,7 @@ public ByteBuf writeInt(int value) { } @Override - public ByteBuf writeLong(long value) { + public ArrowBuf writeLong(long value) { ensure(8); PlatformDependent.putLong(addr(writerIndex), value); writerIndex += 8; @@ -601,7 +601,7 @@ public ByteBuf writeLong(long value) { } @Override - public ByteBuf writeChar(int value) { + public ArrowBuf writeChar(int value) { ensure(2); PlatformDependent.putShort(addr(writerIndex), (short) value); writerIndex += 2; @@ -609,7 +609,7 @@ public ByteBuf writeChar(int value) { } @Override - public ByteBuf writeFloat(float value) { + public ArrowBuf writeFloat(float value) { ensure(4); PlatformDependent.putInt(addr(writerIndex), Float.floatToRawIntBits(value)); writerIndex += 4; @@ -617,7 +617,7 @@ public ByteBuf writeFloat(float value) { } @Override - public ByteBuf writeDouble(double value) { + public ArrowBuf writeDouble(double value) { ensure(8); PlatformDependent.putLong(addr(writerIndex), Double.doubleToRawLongBits(value)); writerIndex += 8; @@ -625,19 +625,19 @@ public ByteBuf writeDouble(double value) { } @Override - public ByteBuf getBytes(int index, byte[] dst, int dstIndex, int length) { + public ArrowBuf getBytes(int index, byte[] dst, int dstIndex, int length) { udle.getBytes(index + offset, dst, dstIndex, length); return this; } @Override - public ByteBuf getBytes(int index, ByteBuffer dst) { + public ArrowBuf getBytes(int index, ByteBuffer dst) { udle.getBytes(index + offset, dst); return this; } @Override - public ByteBuf setByte(int index, int value) { + public ArrowBuf setByte(int index, int value) { chk(index, 1); PlatformDependent.putByte(addr(index), (byte) value); return this; @@ -699,13 +699,13 @@ protected void _setLong(int index, long value) { } @Override - public ByteBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { + public ArrowBuf getBytes(int index, ByteBuf dst, int dstIndex, int length) { udle.getBytes(index + offset, dst, dstIndex, length); return this; } @Override - public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException { + public ArrowBuf getBytes(int index, OutputStream out, int length) throws IOException { udle.getBytes(index + offset, out, length); return this; } @@ -724,12 +724,12 @@ public int getBytes(int index, GatheringByteChannel out, int length) throws IOEx } @Override - public ByteBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { + public ArrowBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { udle.setBytes(index + offset, src, srcIndex, length); return this; } - public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { + public ArrowBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { if (src.isDirect()) { checkIndex(index, length); PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + srcIndex, this.memoryAddress() + index, @@ -749,13 +749,13 @@ public ByteBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { } @Override - public ByteBuf setBytes(int index, byte[] src, int srcIndex, int length) { + public ArrowBuf setBytes(int index, byte[] src, int srcIndex, int length) { udle.setBytes(index + offset, src, srcIndex, length); return this; } @Override - public ByteBuf setBytes(int index, ByteBuffer src) { + public ArrowBuf setBytes(int index, ByteBuffer src) { udle.setBytes(index + offset, src); return this; } @@ -860,4 +860,17 @@ public void print(StringBuilder sb, int indent, Verbosity verbosity) { } } + @Override + public ArrowBuf readerIndex(int readerIndex) { + super.readerIndex(readerIndex); + return this; + } + + @Override + public ArrowBuf writerIndex(int writerIndex) { + super.writerIndex(writerIndex); + return this; + } + + } diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 4ab7f8562f907..2ecad3d31400f 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -30,7 +30,7 @@ }, { name: "Union", - fields: [] + fields: [{name: "mode", type: short}] }, { name: "Int", @@ -38,7 +38,7 @@ }, { name: "FloatingPoint", - fields: [{name: precision, type: int}] + fields: [{name: precision, type: short}] }, { name: "Utf8", diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 6dfaf216ad042..29dee20040a53 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -24,9 +24,8 @@ <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/types/pojo/ArrowType.java" /> - - <#include "/@includes/license.ftl" /> + package org.apache.arrow.vector.types.pojo; import com.google.flatbuffers.FlatBufferBuilder; @@ -38,7 +37,13 @@ public abstract class ArrowType { public abstract byte getTypeType(); public abstract int getType(FlatBufferBuilder builder); + public abstract T accept(ArrowTypeVisitor visitor); + public static interface ArrowTypeVisitor { + <#list arrowTypes.types as type> + T visit(${type.name} type); + + } <#list arrowTypes.types as type> <#assign name = type.name> @@ -70,9 +75,14 @@ public byte getTypeType() { @Override public int getType(FlatBufferBuilder builder) { + <#list type.fields as field> + <#if field.type == "String"> + int ${field.name} = builder.createString(this.${field.name}); + + org.apache.arrow.flatbuf.${type.name}.start${type.name}(builder); <#list type.fields as field> - org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, <#if field.type == "String">builder.createString(${field.name})<#else>${field.name}); + org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, ${field.name}); return org.apache.arrow.flatbuf.${type.name}.end${type.name}(builder); } @@ -83,6 +93,14 @@ public int getType(FlatBufferBuilder builder) { } + public String toString() { + return "${name}{" + <#list fields as field> + + ", " + ${field.name} + + + "}"; + } + @Override public int hashCode() { return Objects.hash(<#list type.fields as field>${field.name}<#if field_has_next>, ); @@ -102,6 +120,11 @@ public boolean equals(Object obj) { } + + @Override + public T accept(ArrowTypeVisitor visitor) { + return visitor.visit(this); + } } diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index df508979c48b5..6b1aa040a5ba2 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -29,6 +29,9 @@ package org.apache.arrow.vector; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import java.util.Collections; + <#include "/@includes/vv_imports.ftl" /> /** @@ -39,7 +42,7 @@ * NB: this class is automatically generated from ${.template_name} and ValueVectorTypes.tdd using FreeMarker. */ @SuppressWarnings("unused") -public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector{ +public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector, FieldVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); private final FieldReader reader = new ${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); @@ -54,6 +57,8 @@ public final class ${className} extends BaseDataValueVector implements <#if type private final Mutator mutator; private final Accessor accessor; + private final List innerVectors; + <#if minor.class == "Decimal"> private final int precision; private final int scale; @@ -66,6 +71,10 @@ public final class ${className} extends BaseDataValueVector implements <#if type mutator = new Mutator(); accessor = new Accessor(); field = new Field(name, true, new Decimal(precision, scale), null); + innerVectors = Collections.unmodifiableList(Arrays.asList( + bits, + values + )); } <#else> public ${className}(String name, BufferAllocator allocator) { @@ -88,9 +97,9 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "Time"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null); <#elseif minor.class == "Float4"> - field = new Field(name, true, new FloatingPoint(0), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.SINGLE), null); <#elseif minor.class == "Float8"> - field = new Field(name, true, new FloatingPoint(1), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.DOUBLE), null); <#elseif minor.class == "TimeStamp"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null); <#elseif minor.class == "IntervalDay"> @@ -104,9 +113,43 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "Bit"> field = new Field(name, true, new Bool(), null); + innerVectors = Collections.unmodifiableList(Arrays.asList( + bits, + <#if type.major = "VarLen"> + values.offsetVector, + + values + )); } + @Override + public List getFieldInnerVectors() { + return innerVectors; + } + + @Override + public void initializeChildrenFromFields(List children) { + if (!children.isEmpty()) { + throw new IllegalArgumentException("primitive type vector ${className} can not have children: " + children); + } + } + + @Override + public List getChildrenFromFields() { + return Collections.emptyList(); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + org.apache.arrow.vector.BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + // TODO: do something with the sizes in fieldNode? + } + + public List getFieldBuffers() { + return org.apache.arrow.vector.BaseDataValueVector.unload(getFieldInnerVectors()); + } + @Override public Field getField() { return field; diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 1fef490d4ec3c..72125fa50fb82 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -42,6 +42,10 @@ import java.util.Iterator; import org.apache.arrow.vector.complex.impl.ComplexCopier; import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.schema.ArrowFieldNode; + +import static org.apache.arrow.flatbuf.UnionMode.Sparse; + /* * This class is generated using freemarker and the ${.template_name} template. @@ -57,7 +61,7 @@ * For performance reasons, UnionVector stores a cached reference to each subtype vector, to avoid having to do the map lookup * each time the vector is accessed. */ -public class UnionVector implements ValueVector { +public class UnionVector implements FieldVector { private String name; private BufferAllocator allocator; @@ -95,6 +99,34 @@ public MinorType getMinorType() { return MinorType.UNION; } + @Override + public void initializeChildrenFromFields(List children) { + getMap().initializeChildrenFromFields(children); + } + + @Override + public List getChildrenFromFields() { + return getMap().getChildrenFromFields(); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + // TODO + throw new UnsupportedOperationException(); + } + + @Override + public List getFieldBuffers() { + // TODO + throw new UnsupportedOperationException(); + } + + @Override + public List getFieldInnerVectors() { + // TODO + throw new UnsupportedOperationException(); + } + public MapVector getMap() { if (mapVector == null) { int vectorCount = internalMap.size(); @@ -203,7 +235,7 @@ public Field getField() { for (ValueVector v : internalMap.getChildren()) { childFields.add(v.getField()); } - return new Field(name, true, new ArrowType.Union(), childFields); + return new Field(name, true, new ArrowType.Union(Sparse), childFields); } @Override @@ -237,10 +269,10 @@ public void copyFromSafe(int inIndex, int outIndex, UnionVector from) { copyFrom(inIndex, outIndex, from); } - public ValueVector addVector(ValueVector v) { + public FieldVector addVector(FieldVector v) { String name = v.getMinorType().name().toLowerCase(); Preconditions.checkState(internalMap.getChild(name) == null, String.format("%s vector already exists", name)); - final ValueVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass()); + final FieldVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass()); v.makeTransferPair(newVector).transfer(); internalMap.putChild(name, newVector); if (callBack != null) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index 05b7cf1006723..c22258d42651b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -17,15 +17,38 @@ */ package org.apache.arrow.vector; -import io.netty.buffer.ArrowBuf; +import java.util.ArrayList; +import java.util.List; import org.apache.arrow.memory.BufferAllocator; +import io.netty.buffer.ArrowBuf; + -public abstract class BaseDataValueVector extends BaseValueVector { +public abstract class BaseDataValueVector extends BaseValueVector implements BufferBacked { protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this + public static void load(List vectors, List buffers) { + int expectedSize = vectors.size(); + if (buffers.size() != expectedSize) { + throw new IllegalArgumentException("Illegal buffer count, expected " + expectedSize + ", got: " + buffers.size()); + } + for (int i = 0; i < expectedSize; i++) { + vectors.get(i).load(buffers.get(i)); + } + } + + public static List unload(List vectors) { + List result = new ArrayList<>(vectors.size()); + for (BufferBacked vector : vectors) { + result.add(vector.unLoad()); + } + return result; + } + + // TODO: Nullable vectors extend BaseDataValueVector but do not use the data field + // We should fix the inheritance tree protected ArrowBuf data; public BaseDataValueVector(String name, BufferAllocator allocator) { @@ -82,6 +105,17 @@ public ArrowBuf getBuffer() { return data; } + @Override + public void load(ArrowBuf data) { + this.data.release(); + this.data = data.retain(allocator); + } + + @Override + public ArrowBuf unLoad() { + return this.data.readerIndex(0); + } + /** * This method has a similar effect of allocateNew() without actually clearing and reallocating * the value vector. The purpose is to move the value vector to a "mutate" state diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java b/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java new file mode 100644 index 0000000000000..d1c262d226556 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java @@ -0,0 +1,31 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import io.netty.buffer.ArrowBuf; + +/** + * Content is backed by a buffer and can be loaded/unloaded + */ +public interface BufferBacked { + + void load(ArrowBuf data); + + ArrowBuf unLoad(); + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java new file mode 100644 index 0000000000000..b28433cfd0d94 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java @@ -0,0 +1,65 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.util.List; + +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.types.pojo.Field; + +import io.netty.buffer.ArrowBuf; + +/** + * A vector corresponding to a Field in the schema + * It has inner vectors backed by buffers (validity, offsets, data, ...) + */ +public interface FieldVector extends ValueVector { + + /** + * Initializes the child vectors + * to be later loaded with loadBuffers + * @param children the schema + */ + void initializeChildrenFromFields(List children); + + /** + * the returned list is the same size as the list passed to initializeChildrenFromFields + * @return the children according to schema (empty for primitive types) + */ + List getChildrenFromFields(); + + /** + * loads data in the vectors + * (ownBuffers must be the same size as getFieldVectors()) + * @param fieldNode the fieldNode + * @param ownBuffers the buffers for this Field (own buffers only, children not included) + */ + void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers); + + /** + * (same size as getFieldVectors() since it is their content) + * @return the buffers containing the data for this vector (ready for reading) + */ + List getFieldBuffers(); + + /** + * @return the inner vectors for this field as defined by the TypeLayout + */ + List getFieldInnerVectors(); + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index 35321c947db0b..ba7790e47ef95 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -19,14 +19,14 @@ import java.io.Closeable; -import io.netty.buffer.ArrowBuf; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.TransferPair; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.TransferPair; + +import io.netty.buffer.ArrowBuf; /** * An abstraction that is used to store a sequence of values in an individual column. diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java new file mode 100644 index 0000000000000..58ac68b82825d --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -0,0 +1,99 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.ArrayList; +import java.util.Iterator; +import java.util.List; + +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.schema.VectorLayout; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.common.collect.Iterators; + +import io.netty.buffer.ArrowBuf; + +/** + * Loads buffers into vectors + */ +public class VectorLoader { + private final List fieldVectors; + private final List fields; + + /** + * will create children in root based on schema + * @param schema the expected schema + * @param root the root to add vectors to based on schema + */ + public VectorLoader(Schema schema, FieldVector root) { + super(); + this.fields = schema.getFields(); + root.initializeChildrenFromFields(fields); + this.fieldVectors = root.getChildrenFromFields(); + if (this.fieldVectors.size() != fields.size()) { + throw new IllegalArgumentException("The root vector did not create the right number of children. found " + fieldVectors.size() + " expected " + fields.size()); + } + } + + /** + * Loads the record batch in the vectors + * will not close the record batch + * @param recordBatch + */ + public void load(ArrowRecordBatch recordBatch) { + Iterator buffers = recordBatch.getBuffers().iterator(); + Iterator nodes = recordBatch.getNodes().iterator(); + for (int i = 0; i < fields.size(); ++i) { + Field field = fields.get(i); + FieldVector fieldVector = fieldVectors.get(i); + loadBuffers(fieldVector, field, buffers, nodes); + } + if (nodes.hasNext() || buffers.hasNext()) { + throw new IllegalArgumentException("not all nodes and buffers where consumed. nodes: " + Iterators.toString(nodes) + " buffers: " + Iterators.toString(buffers)); + } + } + + private void loadBuffers(FieldVector vector, Field field, Iterator buffers, Iterator nodes) { + ArrowFieldNode fieldNode = nodes.next(); + List typeLayout = field.getTypeLayout().getVectors(); + List ownBuffers = new ArrayList<>(typeLayout.size()); + for (int j = 0; j < typeLayout.size(); j++) { + ownBuffers.add(buffers.next()); + } + try { + vector.loadFieldBuffers(fieldNode, ownBuffers); + } catch (RuntimeException e) { + throw new IllegalArgumentException("Could not load buffers for field " + field); + } + List children = field.getChildren(); + if (children.size() > 0) { + List childrenFromFields = vector.getChildrenFromFields(); + checkArgument(children.size() == childrenFromFields.size(), "should have as many children as in the schema: found " + childrenFromFields.size() + " expected " + children.size()); + for (int i = 0; i < childrenFromFields.size(); i++) { + Field child = children.get(i); + FieldVector fieldVector = childrenFromFields.get(i); + loadBuffers(fieldVector, child, buffers, nodes); + } + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java new file mode 100644 index 0000000000000..e4d37bf47d114 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java @@ -0,0 +1,78 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.arrow.vector.ValueVector.Accessor; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.schema.ArrowVectorType; +import org.apache.arrow.vector.types.pojo.Schema; + +import io.netty.buffer.ArrowBuf; + +public class VectorUnloader { + + private final Schema schema; + private final int valueCount; + private final List vectors; + + public VectorUnloader(FieldVector parent) { + super(); + this.schema = new Schema(parent.getField().getChildren()); + this.valueCount = parent.getAccessor().getValueCount(); + this.vectors = parent.getChildrenFromFields(); + } + + public Schema getSchema() { + return schema; + } + + public ArrowRecordBatch getRecordBatch() { + List nodes = new ArrayList<>(); + List buffers = new ArrayList<>(); + for (FieldVector vector : vectors) { + appendNodes(vector, nodes, buffers); + } + return new ArrowRecordBatch(valueCount, nodes, buffers); + } + + private void appendNodes(FieldVector vector, List nodes, List buffers) { + Accessor accessor = vector.getAccessor(); + int nullCount = 0; + // TODO: should not have to do that + // we can do that a lot more efficiently (for example with Long.bitCount(i)) + for (int i = 0; i < accessor.getValueCount(); i++) { + if (accessor.isNull(i)) { + nullCount ++; + } + } + nodes.add(new ArrowFieldNode(accessor.getValueCount(), nullCount)); + List fieldBuffers = vector.getFieldBuffers(); + List expectedBuffers = vector.getField().getTypeLayout().getVectorTypes(); + if (fieldBuffers.size() != expectedBuffers.size()) { + throw new IllegalArgumentException("wrong number of buffers for field " + vector.getField() + ". found: " + fieldBuffers); + } + buffers.addAll(fieldBuffers); + for (FieldVector child : vector.getChildrenFromFields()) { + appendNodes(child, nodes, buffers); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index 705a24b02fe78..c2482adefecfb 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -17,25 +17,23 @@ */ package org.apache.arrow.vector; -import com.google.flatbuffers.FlatBufferBuilder; -import io.netty.buffer.ArrowBuf; - import java.util.Collections; import java.util.Iterator; +import java.util.List; -import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.complex.impl.NullReader; import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Null; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.TransferPair; -import com.google.common.collect.Iterators; +import io.netty.buffer.ArrowBuf; -public class ZeroVector implements ValueVector { +public class ZeroVector implements FieldVector { public final static ZeroVector INSTANCE = new ZeroVector(); private final String name = "[DEFAULT]"; @@ -175,4 +173,33 @@ public Mutator getMutator() { public FieldReader getReader() { return NullReader.INSTANCE; } + + @Override + public void initializeChildrenFromFields(List children) { + if (!children.isEmpty()) { + throw new IllegalArgumentException("Zero vector has no children"); + } + } + + @Override + public List getChildrenFromFields() { + return Collections.emptyList(); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + if (!ownBuffers.isEmpty()) { + throw new IllegalArgumentException("Zero vector has no buffers"); + } + } + + @Override + public List getFieldBuffers() { + return Collections.emptyList(); + } + + @Override + public List getFieldInnerVectors() { + return Collections.emptyList(); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java index ed7797576d679..2f68886a169b3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -17,22 +17,13 @@ */ package org.apache.arrow.vector.complex; -import java.util.Collection; - -import javax.annotation.Nullable; - -import org.apache.arrow.flatbuf.Field; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.CallBack; -import com.google.common.base.Function; -import com.google.common.base.Preconditions; -import com.google.common.collect.Iterables; -import com.google.common.collect.Sets; - /** * Base class for composite vectors. * @@ -65,8 +56,8 @@ public BufferAllocator getAllocator() { /** * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given field name if exists or null. */ - public ValueVector getChild(String name) { - return getChild(name, ValueVector.class); + public FieldVector getChild(String name) { + return getChild(name, FieldVector.class); } /** @@ -81,7 +72,7 @@ public void close() { protected T typeify(ValueVector v, Class clazz) { if (clazz.isAssignableFrom(v.getClass())) { - return (T) v; + return clazz.cast(v); } throw new IllegalStateException(String.format("Vector requested [%s] was different than type stored [%s]. Arrow doesn't yet support hetergenous types.", clazz.getSimpleName(), v.getClass().getSimpleName())); } @@ -94,10 +85,10 @@ protected boolean supportsDirectRead() { public abstract int size(); // add a new vector with the input MajorType or return the existing vector if we already added one with the same type - public abstract T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale); + public abstract T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale); // return the child vector with the input name - public abstract T getChild(String name, Class clazz); + public abstract T getChild(String name, Class clazz); // return the child vector's ordinal in the composite container public abstract VectorWithOrdinal getChildVectorWithOrdinal(String name); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index 5964f80079141..23b4997f4f586 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -17,23 +17,24 @@ */ package org.apache.arrow.vector.complex; -import com.google.common.collect.ImmutableList; -import io.netty.buffer.ArrowBuf; - import java.util.ArrayList; +import java.util.Collections; import java.util.Iterator; import java.util.List; -import org.apache.arrow.flatbuf.Field; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.MapWithOrdinal; import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; import com.google.common.collect.Lists; +import io.netty.buffer.ArrowBuf; + /* * Base class for MapVectors. Currently used by RepeatedMapVector and MapVector */ @@ -41,7 +42,7 @@ public abstract class AbstractMapVector extends AbstractContainerVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AbstractContainerVector.class); // Maintains a map with key as field name and value is the vector itself - private final MapWithOrdinal vectors = new MapWithOrdinal<>(); + private final MapWithOrdinal vectors = new MapWithOrdinal<>(); protected AbstractMapVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, allocator, callBack); @@ -109,19 +110,19 @@ public boolean allocateNewSafe() { * @return resultant {@link org.apache.arrow.vector.ValueVector} */ @Override - public T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale) { + public T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale) { final ValueVector existing = getChild(name); boolean create = false; if (existing == null) { create = true; } else if (clazz.isAssignableFrom(existing.getClass())) { - return (T) existing; + return clazz.cast(existing); } else if (nullFilled(existing)) { existing.clear(); create = true; } if (create) { - final T vector = (T) minorType.getNewVector(name, allocator, callBack, precisionScale); + final T vector = clazz.cast(minorType.getNewVector(name, allocator, callBack, precisionScale)); putChild(name, vector); if (callBack!=null) { callBack.doWork(); @@ -153,7 +154,7 @@ public ValueVector getChildByOrdinal(int id) { * field name if exists or null. */ @Override - public T getChild(String name, Class clazz) { + public T getChild(String name, Class clazz) { final ValueVector v = vectors.get(name.toLowerCase()); if (v == null) { return null; @@ -161,12 +162,25 @@ public T getChild(String name, Class clazz) { return typeify(v, clazz); } + protected ValueVector add(String name, MinorType minorType, int... precisionScale) { + final ValueVector existing = getChild(name); + if (existing != null) { + throw new IllegalStateException(String.format("Vector already exists: Existing[%s], Requested[%s] ", existing.getClass().getSimpleName(), minorType)); + } + FieldVector vector = minorType.getNewVector(name, allocator, callBack, precisionScale); + putChild(name, vector); + if (callBack!=null) { + callBack.doWork(); + } + return vector; + } + /** * Inserts the vector with the given name if it does not exist else replaces it with the new value. * * Note that this method does not enforce any vector type check nor throws a schema change exception. */ - protected void putChild(String name, ValueVector vector) { + protected void putChild(String name, FieldVector vector) { putVector(name, vector); } @@ -175,7 +189,7 @@ protected void putChild(String name, ValueVector vector) { * @param name field name * @param vector vector to be inserted */ - protected void putVector(String name, ValueVector vector) { + protected void putVector(String name, FieldVector vector) { final ValueVector old = vectors.put( Preconditions.checkNotNull(name, "field name cannot be null").toLowerCase(), Preconditions.checkNotNull(vector, "vector cannot be null") @@ -189,9 +203,9 @@ protected void putVector(String name, ValueVector vector) { /** * Returns a sequence of underlying child vectors. */ - protected List getChildren() { + protected List getChildren() { int size = vectors.size(); - List children = new ArrayList<>(); + List children = new ArrayList<>(); for (int i = 0; i < size; i++) { children.add(vectors.getByOrdinal(i)); } @@ -216,7 +230,7 @@ public int size() { @Override public Iterator iterator() { - return vectors.values().iterator(); + return Collections.unmodifiableCollection(vectors.values()).iterator(); } /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 42262741df93d..517d20c77a93c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -17,8 +17,6 @@ */ package org.apache.arrow.vector.complex; -import io.netty.buffer.ArrowBuf; - import java.util.Collections; import java.util.Iterator; @@ -26,29 +24,32 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; import com.google.common.base.Preconditions; import com.google.common.collect.ObjectArrays; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.SchemaChangeRuntimeException; + +import io.netty.buffer.ArrowBuf; public abstract class BaseRepeatedValueVector extends BaseValueVector implements RepeatedValueVector { - public final static ValueVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; + public final static FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; public final static String OFFSETS_VECTOR_NAME = "$offsets$"; public final static String DATA_VECTOR_NAME = "$data$"; protected final UInt4Vector offsets; - protected ValueVector vector; + protected FieldVector vector; protected BaseRepeatedValueVector(String name, BufferAllocator allocator) { this(name, allocator, DEFAULT_DATA_VECTOR); } - protected BaseRepeatedValueVector(String name, BufferAllocator allocator, ValueVector vector) { + protected BaseRepeatedValueVector(String name, BufferAllocator allocator, FieldVector vector) { super(name, allocator); this.offsets = new UInt4Vector(OFFSETS_VECTOR_NAME, allocator); this.vector = Preconditions.checkNotNull(vector, "data vector cannot be null"); @@ -83,7 +84,7 @@ public UInt4Vector getOffsetVector() { } @Override - public ValueVector getDataVector() { + public FieldVector getDataVector() { return vector; } @@ -121,7 +122,7 @@ public int getBufferSizeFor(int valueCount) { @Override public Iterator iterator() { - return Collections.singleton(getDataVector()).iterator(); + return Collections.singleton(getDataVector()).iterator(); } @Override @@ -167,7 +168,7 @@ public AddOrGetResult addOrGetVector(MinorType minorT return new AddOrGetResult<>((T)vector, created); } - protected void replaceDataVector(ValueVector v) { + protected void replaceDataVector(FieldVector v) { vector.clear(); vector = v; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index c6c6b090db6b0..2984c362514fc 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -18,15 +18,18 @@ ******************************************************************************/ package org.apache.arrow.vector.complex; -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; -import io.netty.buffer.ArrowBuf; +import static java.util.Collections.singletonList; +import java.util.Arrays; +import java.util.Collections; import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseDataValueVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; @@ -36,18 +39,24 @@ import org.apache.arrow.vector.complex.impl.UnionListWriter; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringArrayList; import org.apache.arrow.vector.util.TransferPair; +import com.google.common.collect.ImmutableList; import com.google.common.collect.ObjectArrays; -public class ListVector extends BaseRepeatedValueVector { +import io.netty.buffer.ArrowBuf; + +public class ListVector extends BaseRepeatedValueVector implements FieldVector { - UInt4Vector offsets; + final UInt4Vector offsets; final UInt1Vector bits; + private final List innerVectors; private Mutator mutator = new Mutator(); private Accessor accessor = new Accessor(); private UnionListWriter writer; @@ -57,12 +66,46 @@ public class ListVector extends BaseRepeatedValueVector { public ListVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, allocator); this.bits = new UInt1Vector("$bits$", allocator); - offsets = getOffsetVector(); + this.offsets = getOffsetVector(); + this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits, offsets)); this.writer = new UnionListWriter(this); this.reader = new UnionListReader(this); this.callBack = callBack; } + @Override + public void initializeChildrenFromFields(List children) { + if (children.size() != 1) { + throw new IllegalArgumentException("Lists have only one child. Found: " + children); + } + Field field = children.get(0); + MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); + AddOrGetResult addOrGetVector = addOrGetVector(minorType); + if (!addOrGetVector.isCreated()) { + throw new IllegalArgumentException("Child vector already existed: " + addOrGetVector.getVector()); + } + } + + @Override + public List getChildrenFromFields() { + return singletonList(getDataVector()); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + } + + @Override + public List getFieldBuffers() { + return BaseDataValueVector.unload(getFieldInnerVectors()); + } + + @Override + public List getFieldInnerVectors() { + return innerVectors; + } + public UnionListWriter getWriter() { return writer; } @@ -86,7 +129,7 @@ public void copyFrom(int inIndex, int outIndex, ListVector from) { } @Override - public ValueVector getDataVector() { + public FieldVector getDataVector() { return vector; } @@ -298,4 +341,5 @@ public void setValueCount(int valueCount) { bits.getMutator().setValueCount(valueCount); } } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index 0cb613e2f7acf..e3696588e6006 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -17,10 +17,10 @@ */ package org.apache.arrow.vector.complex; -import io.netty.buffer.ArrowBuf; - import java.util.ArrayList; +import java.util.Arrays; import java.util.Collection; +import java.util.Collections; import java.util.Iterator; import java.util.List; import java.util.Map; @@ -28,13 +28,17 @@ import javax.annotation.Nullable; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.BaseDataValueVector; import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; @@ -45,7 +49,9 @@ import com.google.common.collect.Ordering; import com.google.common.primitives.Ints; -public class MapVector extends AbstractMapVector { +import io.netty.buffer.ArrowBuf; + +public class MapVector extends AbstractMapVector implements FieldVector { //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(MapVector.class); private final SingleMapReaderImpl reader = new SingleMapReaderImpl(MapVector.this); @@ -53,6 +59,9 @@ public class MapVector extends AbstractMapVector { private final Mutator mutator = new Mutator(); int valueCount; + // TODO: validity vector + private final List innerVectors = Collections.unmodifiableList(Arrays.asList()); + public MapVector(String name, BufferAllocator allocator, CallBack callBack){ super(name, allocator, callBack); } @@ -120,7 +129,7 @@ public ArrowBuf[] getBuffers(boolean clear) { int expectedSize = getBufferSize(); int actualSize = super.getBufferSize(); - Preconditions.checkArgument(expectedSize == actualSize); + Preconditions.checkArgument(expectedSize == actualSize, expectedSize + " != " + actualSize); return super.getBuffers(clear); } @@ -159,7 +168,7 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { this.to.ephPair = null; int i = 0; - ValueVector vector; + FieldVector vector; for (String child:from.getChildFieldNames()) { int preSize = to.size(); vector = from.getChild(child); @@ -175,7 +184,7 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { // (This is similar to what happens in ScanBatch where the children cannot be added till they are // read). To take care of this, we ensure that the hashCode of the MaterializedField does not // include the hashCode of the children but is based only on MaterializedField$key. - final ValueVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass()); + final FieldVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass()); if (allocate && to.size() != preSize) { newVector.allocateNew(); } @@ -315,13 +324,45 @@ public MinorType getMinorType() { @Override public void close() { - final Collection vectors = getChildren(); - for (final ValueVector v : vectors) { + final Collection vectors = getChildren(); + for (final FieldVector v : vectors) { v.close(); } vectors.clear(); + valueCount = 0; super.close(); } + + @Override + public void initializeChildrenFromFields(List children) { + for (Field field : children) { + MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); + FieldVector vector = (FieldVector)this.add(field.getName(), minorType); + vector.initializeChildrenFromFields(field.getChildren()); + } + } + + @Override + public List getChildrenFromFields() { + return getChildren(); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + // TODO: something with fieldNode? + } + + @Override + public List getFieldBuffers() { + return BaseDataValueVector.unload(getFieldInnerVectors()); + } + + @Override + public List getFieldInnerVectors() { + return innerVectors; + } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index 4d2adfb32561e..89bfefc8f19e3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -22,9 +22,9 @@ import org.apache.arrow.vector.complex.StateTool; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; import com.google.common.base.Preconditions; -import org.apache.arrow.vector.types.pojo.Field; public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWriter { // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ComplexWriterImpl.class); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 586b1283fe879..c282688530b87 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -17,6 +17,7 @@ */ package org.apache.arrow.vector.complex.impl; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.AbstractMapVector; @@ -129,7 +130,7 @@ private FieldWriter promoteToUnion() { } else if (listVector != null) { unionVector = listVector.promoteToUnion(); } - unionVector.addVector(tp.getTo()); + unionVector.addVector((FieldVector)tp.getTo()); writer = new UnionWriter(unionVector); writer.setPosition(idx()); for (int i = 0; i < idx(); i++) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowBlock.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowBlock.java new file mode 100644 index 0000000000000..90fb02b059707 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowBlock.java @@ -0,0 +1,82 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import org.apache.arrow.flatbuf.Block; +import org.apache.arrow.vector.schema.FBSerializable; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class ArrowBlock implements FBSerializable { + + private final long offset; + private final int metadataLength; + private final long bodyLength; + + public ArrowBlock(long offset, int metadataLength, long bodyLength) { + super(); + this.offset = offset; + this.metadataLength = metadataLength; + this.bodyLength = bodyLength; + } + + public long getOffset() { + return offset; + } + + public int getMetadataLength() { + return metadataLength; + } + + public long getBodyLength() { + return bodyLength; + } + + @Override + public int writeTo(FlatBufferBuilder builder) { + return Block.createBlock(builder, offset, metadataLength, bodyLength); + } + + @Override + public int hashCode() { + final int prime = 31; + int result = 1; + result = prime * result + (int) (bodyLength ^ (bodyLength >>> 32)); + result = prime * result + metadataLength; + result = prime * result + (int) (offset ^ (offset >>> 32)); + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) + return true; + if (obj == null) + return false; + if (getClass() != obj.getClass()) + return false; + ArrowBlock other = (ArrowBlock) obj; + if (bodyLength != other.bodyLength) + return false; + if (metadataLength != other.metadataLength) + return false; + if (offset != other.offset) + return false; + return true; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java new file mode 100644 index 0000000000000..01e175b31b8db --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java @@ -0,0 +1,144 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.arrow.flatbuf.Block; +import org.apache.arrow.flatbuf.Footer; +import org.apache.arrow.vector.schema.FBSerializable; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class ArrowFooter implements FBSerializable { + + private final Schema schema; + + private final List dictionaries; + + private final List recordBatches; + + public ArrowFooter(Schema schema, List dictionaries, List recordBatches) { + super(); + this.schema = schema; + this.dictionaries = dictionaries; + this.recordBatches = recordBatches; + } + + public ArrowFooter(Footer footer) { + this( + Schema.convertSchema(footer.schema()), + dictionaries(footer), + recordBatches(footer) + ); + } + + private static List recordBatches(Footer footer) { + List recordBatches = new ArrayList<>(); + Block tempBLock = new Block(); + int recordBatchesLength = footer.recordBatchesLength(); + for (int i = 0; i < recordBatchesLength; i++) { + Block block = footer.recordBatches(tempBLock, i); + recordBatches.add(new ArrowBlock(block.offset(), block.metaDataLength(), block.bodyLength())); + } + return recordBatches; + } + + private static List dictionaries(Footer footer) { + List dictionaries = new ArrayList<>(); + Block tempBLock = new Block(); + int dictionariesLength = footer.dictionariesLength(); + for (int i = 0; i < dictionariesLength; i++) { + Block block = footer.dictionaries(tempBLock, i); + dictionaries.add(new ArrowBlock(block.offset(), block.metaDataLength(), block.bodyLength())); + } + return dictionaries; + } + + public Schema getSchema() { + return schema; + } + + public List getDictionaries() { + return dictionaries; + } + + public List getRecordBatches() { + return recordBatches; + } + + @Override + public int writeTo(FlatBufferBuilder builder) { + int schemaIndex = schema.getSchema(builder); + Footer.startDictionariesVector(builder, dictionaries.size()); + int dicsOffset = endVector(builder, dictionaries); + Footer.startRecordBatchesVector(builder, recordBatches.size()); + int rbsOffset = endVector(builder, recordBatches); + Footer.startFooter(builder); + Footer.addSchema(builder, schemaIndex); + Footer.addDictionaries(builder, dicsOffset); + Footer.addRecordBatches(builder, rbsOffset); + return Footer.endFooter(builder); + } + + private int endVector(FlatBufferBuilder builder, List blocks) { + for (ArrowBlock block : blocks) { + block.writeTo(builder); + } + return builder.endVector(); + } + + @Override + public int hashCode() { + final int prime = 31; + int result = 1; + result = prime * result + ((dictionaries == null) ? 0 : dictionaries.hashCode()); + result = prime * result + ((recordBatches == null) ? 0 : recordBatches.hashCode()); + result = prime * result + ((schema == null) ? 0 : schema.hashCode()); + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) + return true; + if (obj == null) + return false; + if (getClass() != obj.getClass()) + return false; + ArrowFooter other = (ArrowFooter) obj; + if (dictionaries == null) { + if (other.dictionaries != null) + return false; + } else if (!dictionaries.equals(other.dictionaries)) + return false; + if (recordBatches == null) { + if (other.recordBatches != null) + return false; + } else if (!recordBatches.equals(other.recordBatches)) + return false; + if (schema == null) { + if (other.schema != null) + return false; + } else if (!schema.equals(other.schema)) + return false; + return true; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java new file mode 100644 index 0000000000000..bbcd3e9f470e3 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -0,0 +1,151 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.SeekableByteChannel; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +import org.apache.arrow.flatbuf.Buffer; +import org.apache.arrow.flatbuf.FieldNode; +import org.apache.arrow.flatbuf.Footer; +import org.apache.arrow.flatbuf.RecordBatch; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.buffer.ArrowBuf; + +public class ArrowReader implements AutoCloseable { + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowReader.class); + + private static final byte[] MAGIC = "ARROW1".getBytes(); + + private final SeekableByteChannel in; + + private final BufferAllocator allocator; + + private ArrowFooter footer; + + public ArrowReader(SeekableByteChannel in, BufferAllocator allocator) { + super(); + this.in = in; + this.allocator = allocator; + } + + private int readFully(ArrowBuf buffer, int l) throws IOException { + int n = readFully(buffer.nioBuffer(buffer.writerIndex(), l)); + buffer.writerIndex(n); + if (n != l) { + throw new IllegalStateException(n + " != " + l); + } + return n; + } + + private int readFully(ByteBuffer buffer) throws IOException { + int total = 0; + int n; + do { + n = in.read(buffer); + total += n; + } while (n >= 0 && buffer.remaining() > 0); + buffer.flip(); + return total; + } + + private static int bytesToInt(byte[] bytes) { + return ((int)(bytes[3] & 255) << 24) + + ((int)(bytes[2] & 255) << 16) + + ((int)(bytes[1] & 255) << 8) + + ((int)(bytes[0] & 255) << 0); + } + + public ArrowFooter readFooter() throws IOException { + if (footer == null) { + if (in.size() <= (MAGIC.length * 2 + 4)) { + throw new InvalidArrowFileException("file too small: " + in.size()); + } + ByteBuffer buffer = ByteBuffer.allocate(4 + MAGIC.length); + long footerLengthOffset = in.size() - buffer.remaining(); + in.position(footerLengthOffset); + readFully(buffer); + byte[] array = buffer.array(); + if (!Arrays.equals(MAGIC, Arrays.copyOfRange(array, 4, array.length))) { + throw new InvalidArrowFileException("missing Magic number " + Arrays.toString(buffer.array())); + } + int footerLength = bytesToInt(array); + if (footerLength <= 0 || footerLength + MAGIC.length * 2 + 4 > in.size()) { + throw new InvalidArrowFileException("invalid footer length: " + footerLength); + } + long footerOffset = footerLengthOffset - footerLength; + LOGGER.debug(String.format("Footer starts at %d, length: %d", footerOffset, footerLength)); + ByteBuffer footerBuffer = ByteBuffer.allocate(footerLength); + in.position(footerOffset); + readFully(footerBuffer); + Footer footerFB = Footer.getRootAsFooter(footerBuffer); + this.footer = new ArrowFooter(footerFB); + } + return footer; + } + + // TODO: read dictionaries + + public ArrowRecordBatch readRecordBatch(ArrowBlock recordBatchBlock) throws IOException { + LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", recordBatchBlock.getOffset(), recordBatchBlock.getMetadataLength(), recordBatchBlock.getBodyLength())); + int l = (int)(recordBatchBlock.getMetadataLength() + recordBatchBlock.getBodyLength()); + if (l < 0) { + throw new InvalidArrowFileException("block invalid: " + recordBatchBlock); + } + final ArrowBuf buffer = allocator.buffer(l); + LOGGER.debug("allocated buffer " + buffer); + in.position(recordBatchBlock.getOffset()); + int n = readFully(buffer, l); + if (n != l) { + throw new IllegalStateException(n + " != " + l); + } + RecordBatch recordBatchFB = RecordBatch.getRootAsRecordBatch(buffer.nioBuffer().asReadOnlyBuffer()); + int nodesLength = recordBatchFB.nodesLength(); + final ArrowBuf body = buffer.slice(recordBatchBlock.getMetadataLength(), (int)recordBatchBlock.getBodyLength()); + List nodes = new ArrayList<>(); + for (int i = 0; i < nodesLength; ++i) { + FieldNode node = recordBatchFB.nodes(i); + nodes.add(new ArrowFieldNode(node.length(), node.nullCount())); + } + List buffers = new ArrayList<>(); + for (int i = 0; i < recordBatchFB.buffersLength(); ++i) { + Buffer bufferFB = recordBatchFB.buffers(i); + LOGGER.debug(String.format("Buffer in RecordBatch at %d, length: %d", bufferFB.offset(), bufferFB.length())); + ArrowBuf vectorBuffer = body.slice((int)bufferFB.offset(), (int)bufferFB.length()); + buffers.add(vectorBuffer); + } + ArrowRecordBatch arrowRecordBatch = new ArrowRecordBatch(recordBatchFB.length(), nodes, buffers); + LOGGER.debug("released buffer " + buffer); + buffer.release(); + return arrowRecordBatch; + } + + public void close() throws IOException { + in.close(); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java new file mode 100644 index 0000000000000..9881a229c23ea --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -0,0 +1,179 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.WritableByteChannel; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.vector.schema.ArrowBuffer; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.schema.FBSerializable; +import org.apache.arrow.vector.types.pojo.Schema; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; + +public class ArrowWriter implements AutoCloseable { + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class); + + private static final byte[] MAGIC = "ARROW1".getBytes(); + + private final WritableByteChannel out; + + private final Schema schema; + + private final List recordBatches = new ArrayList<>(); + + private long currentPosition = 0; + + private boolean started = false; + + public ArrowWriter(WritableByteChannel out, Schema schema) { + this.out = out; + this.schema = schema; + } + + private void start() throws IOException { + writeMagic(); + } + + private long write(byte[] buffer) throws IOException { + return write(ByteBuffer.wrap(buffer)); + } + + private long writeZeros(int zeroCount) throws IOException { + return write(new byte[zeroCount]); + } + + private long align() throws IOException { + if (currentPosition % 8 != 0) { // align on 8 byte boundaries + return writeZeros(8 - (int)(currentPosition % 8)); + } + return 0; + } + + private long write(ByteBuffer buffer) throws IOException { + long length = buffer.remaining(); + out.write(buffer); + currentPosition += length; + return length; + } + + private static byte[] intToBytes(int value) { + byte[] outBuffer = new byte[4]; + outBuffer[3] = (byte)(value >>> 24); + outBuffer[2] = (byte)(value >>> 16); + outBuffer[1] = (byte)(value >>> 8); + outBuffer[0] = (byte)(value >>> 0); + return outBuffer; + } + + private long writeIntLittleEndian(int v) throws IOException { + return write(intToBytes(v)); + } + + // TODO: write dictionaries + + public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { + checkStarted(); + align(); + // write metadata header + long offset = currentPosition; + write(recordBatch); + align(); + // write body + long bodyOffset = currentPosition; + List buffers = recordBatch.getBuffers(); + List buffersLayout = recordBatch.getBuffersLayout(); + if (buffers.size() != buffersLayout.size()) { + throw new IllegalStateException("the layout does not match: " + buffers.size() + " != " + buffersLayout.size()); + } + for (int i = 0; i < buffers.size(); i++) { + ArrowBuf buffer = buffers.get(i); + ArrowBuffer layout = buffersLayout.get(i); + long startPosition = bodyOffset + layout.getOffset(); + if (startPosition != currentPosition) { + writeZeros((int)(startPosition - currentPosition)); + } + write(buffer); + if (currentPosition != startPosition + layout.getSize()) { + throw new IllegalStateException("wrong buffer size: " + currentPosition + " != " + startPosition + layout.getSize()); + } + } + int metadataLength = (int)(bodyOffset - offset); + if (metadataLength <= 0) { + throw new InvalidArrowFileException("invalid recordBatch"); + } + long bodyLength = currentPosition - bodyOffset; + LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", offset, metadataLength, bodyLength)); + // add metadata to footer + recordBatches.add(new ArrowBlock(offset, metadataLength, bodyLength)); + } + + private void write(ArrowBuf buffer) throws IOException { + write(buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes())); + } + + private void checkStarted() throws IOException { + if (!started) { + started = true; + start(); + } + } + + public void close() throws IOException { + try { + long footerStart = currentPosition; + writeFooter(); + int footerLength = (int)(currentPosition - footerStart); + if (footerLength <= 0 ) { + throw new InvalidArrowFileException("invalid footer"); + } + writeIntLittleEndian(footerLength); + LOGGER.debug(String.format("Footer starts at %d, length: %d", footerStart, footerLength)); + writeMagic(); + } finally { + out.close(); + } + } + + private void writeMagic() throws IOException { + write(MAGIC); + LOGGER.debug(String.format("magic written, now at %d", currentPosition)); + } + + private void writeFooter() throws IOException { + // TODO: dictionaries + write(new ArrowFooter(schema, Collections.emptyList(), recordBatches)); + } + + private long write(FBSerializable writer) throws IOException { + FlatBufferBuilder builder = new FlatBufferBuilder(); + int root = writer.writeTo(builder); + builder.finish(root); + return write(builder.dataBuffer()); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/InvalidArrowFileException.java b/java/vector/src/main/java/org/apache/arrow/vector/file/InvalidArrowFileException.java new file mode 100644 index 0000000000000..3ec75dcb12a2b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/InvalidArrowFileException.java @@ -0,0 +1,27 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +public class InvalidArrowFileException extends RuntimeException { + private static final long serialVersionUID = 1L; + + public InvalidArrowFileException(String message) { + super(message); + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowBuffer.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowBuffer.java new file mode 100644 index 0000000000000..3aa3e52582b4f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowBuffer.java @@ -0,0 +1,81 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import org.apache.arrow.flatbuf.Buffer; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class ArrowBuffer implements FBSerializable { + + private int page; + private long offset; + private long size; + + public ArrowBuffer(int page, long offset, long size) { + super(); + this.page = page; + this.offset = offset; + this.size = size; + } + + public int getPage() { + return page; + } + + public long getOffset() { + return offset; + } + + public long getSize() { + return size; + } + + @Override + public int hashCode() { + final int prime = 31; + int result = 1; + result = prime * result + (int) (offset ^ (offset >>> 32)); + result = prime * result + page; + result = prime * result + (int) (size ^ (size >>> 32)); + return result; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) + return true; + if (obj == null) + return false; + if (getClass() != obj.getClass()) + return false; + ArrowBuffer other = (ArrowBuffer) obj; + if (offset != other.offset) + return false; + if (page != other.page) + return false; + if (size != other.size) + return false; + return true; + } + + @Override + public int writeTo(FlatBufferBuilder builder) { + return Buffer.createBuffer(builder, page, offset, size); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java new file mode 100644 index 0000000000000..71dd0abc6bcef --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java @@ -0,0 +1,53 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import org.apache.arrow.flatbuf.FieldNode; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class ArrowFieldNode implements FBSerializable { + + private final int length; + private final int nullCount; + + public ArrowFieldNode(int length, int nullCount) { + super(); + this.length = length; + this.nullCount = nullCount; + } + + @Override + public int writeTo(FlatBufferBuilder builder) { + return FieldNode.createFieldNode(builder, length, nullCount); + } + + public int getNullCount() { + return nullCount; + } + + public int getLength() { + return length; + } + + @Override + public String toString() { + return "ArrowFieldNode [length=" + length + ", nullCount=" + nullCount + "]"; + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java new file mode 100644 index 0000000000000..9162efd29f864 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java @@ -0,0 +1,127 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import static org.apache.arrow.vector.schema.FBSerializables.writeAllStructsToVector; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.flatbuf.RecordBatch; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; + +public class ArrowRecordBatch implements FBSerializable, AutoCloseable { + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowRecordBatch.class); + + /** number of records */ + private final int length; + + /** Nodes correspond to the pre-ordered flattened logical schema */ + private final List nodes; + + private final List buffers; + + private final List buffersLayout; + + private boolean closed = false; + + /** + * @param length how many rows in this batch + * @param nodes field level info + * @param buffers will be retained until this recordBatch is closed + */ + public ArrowRecordBatch(int length, List nodes, List buffers) { + super(); + this.length = length; + this.nodes = nodes; + this.buffers = buffers; + List arrowBuffers = new ArrayList<>(); + long offset = 0; + for (ArrowBuf arrowBuf : buffers) { + arrowBuf.retain(); + long size = arrowBuf.readableBytes(); + arrowBuffers.add(new ArrowBuffer(0, offset, size)); + LOGGER.debug(String.format("Buffer in RecordBatch at %d, length: %d", offset, size)); + offset += size; + if (offset % 8 != 0) { // align on 8 byte boundaries + offset += 8 - (offset % 8); + } + } + this.buffersLayout = Collections.unmodifiableList(arrowBuffers); + } + + public int getLength() { + return length; + } + + /** + * @return the FieldNodes corresponding to the schema + */ + public List getNodes() { + return nodes; + } + + /** + * @return the buffers containing the data + */ + public List getBuffers() { + if (closed) { + throw new IllegalStateException("already closed"); + } + return buffers; + } + + /** + * @return the serialized layout if we send the buffers on the wire + */ + public List getBuffersLayout() { + return buffersLayout; + } + + @Override + public int writeTo(FlatBufferBuilder builder) { + RecordBatch.startNodesVector(builder, nodes.size()); + int nodesOffset = writeAllStructsToVector(builder, nodes); + RecordBatch.startBuffersVector(builder, buffers.size()); + int buffersOffset = writeAllStructsToVector(builder, buffersLayout); + RecordBatch.startRecordBatch(builder); + RecordBatch.addLength(builder, length); + RecordBatch.addNodes(builder, nodesOffset); + RecordBatch.addBuffers(builder, buffersOffset); + return RecordBatch.endRecordBatch(builder); + } + + /** + * releases the buffers + */ + public void close() { + if (!closed) { + closed = true; + for (ArrowBuf arrowBuf : buffers) { + arrowBuf.release(); + } + } + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java new file mode 100644 index 0000000000000..e3d3e34e0ae24 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java @@ -0,0 +1,47 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import org.apache.arrow.flatbuf.VectorType; + +public class ArrowVectorType { + + public static final ArrowVectorType VALUES = new ArrowVectorType(VectorType.VALUES); + public static final ArrowVectorType OFFSET = new ArrowVectorType(VectorType.OFFSET); + public static final ArrowVectorType VALIDITY = new ArrowVectorType(VectorType.VALIDITY); + public static final ArrowVectorType TYPE = new ArrowVectorType(VectorType.TYPE); + + private final short type; + + public ArrowVectorType(short type) { + this.type = type; + } + + public short getType() { + return type; + } + + @Override + public String toString() { + try { + return VectorType.name(type); + } catch (ArrayIndexOutOfBoundsException e) { + return "Unlnown type " + type; + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializable.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializable.java new file mode 100644 index 0000000000000..d23ed91948e5d --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializable.java @@ -0,0 +1,24 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import com.google.flatbuffers.FlatBufferBuilder; + +public interface FBSerializable { + int writeTo(FlatBufferBuilder builder); +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializables.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializables.java new file mode 100644 index 0000000000000..31c17ad6df02b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/FBSerializables.java @@ -0,0 +1,37 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class FBSerializables { + + public static int writeAllStructsToVector(FlatBufferBuilder builder, List all) { + // struct vectors have to be created in reverse order + List reversed = new ArrayList<>(all); + Collections.reverse(reversed); + for (FBSerializable element : reversed) { + element.writeTo(builder); + } + return builder.endVector(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java new file mode 100644 index 0000000000000..1275e0eb5dc45 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -0,0 +1,208 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import static java.util.Arrays.asList; +import static org.apache.arrow.flatbuf.Precision.DOUBLE; +import static org.apache.arrow.flatbuf.Precision.SINGLE; +import static org.apache.arrow.vector.schema.VectorLayout.booleanVector; +import static org.apache.arrow.vector.schema.VectorLayout.byteVector; +import static org.apache.arrow.vector.schema.VectorLayout.dataVector; +import static org.apache.arrow.vector.schema.VectorLayout.offsetVector; +import static org.apache.arrow.vector.schema.VectorLayout.typeVector; +import static org.apache.arrow.vector.schema.VectorLayout.validityVector; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.flatbuf.UnionMode; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeVisitor; +import org.apache.arrow.vector.types.pojo.ArrowType.Binary; +import org.apache.arrow.vector.types.pojo.ArrowType.Bool; +import org.apache.arrow.vector.types.pojo.ArrowType.Date; +import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; +import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.IntervalDay; +import org.apache.arrow.vector.types.pojo.ArrowType.IntervalYear; +import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.ArrowType.Time; +import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; +import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Union; +import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; + +/** + * The layout of vectors for a given type + * It defines its own vectors followed by the vectors for the children + * if it is a nested type (Tuple, List, Union) + */ +public class TypeLayout { + + public static TypeLayout getTypeLayout(final ArrowType arrowType) { + TypeLayout layout = arrowType.accept(new ArrowTypeVisitor() { + + @Override public TypeLayout visit(Int type) { + return newFixedWidthTypeLayout(dataVector(type.getBitWidth())); + } + + @Override public TypeLayout visit(Union type) { + List vectors; + switch (type.getMode()) { + case UnionMode.Dense: + vectors = asList( + // TODO: validate this + validityVector(), + typeVector(), + offsetVector() // offset to find the vector + ); + break; + case UnionMode.Sparse: + vectors = asList( + validityVector(), + typeVector() + ); + break; + default: + throw new UnsupportedOperationException("Unsupported Union Mode: " + type.getMode()); + } + return new TypeLayout(vectors); + } + + @Override public TypeLayout visit(Tuple type) { + List vectors = asList( + // TODO: add validity vector in Map +// validityVector() + ); + return new TypeLayout(vectors); + } + + @Override public TypeLayout visit(Timestamp type) { + return newFixedWidthTypeLayout(dataVector(64)); + } + + @Override public TypeLayout visit(org.apache.arrow.vector.types.pojo.ArrowType.List type) { + List vectors = asList( + validityVector(), + offsetVector() + ); + return new TypeLayout(vectors); + } + + @Override public TypeLayout visit(FloatingPoint type) { + int bitWidth; + switch (type.getPrecision()) { + case SINGLE: + bitWidth = 32; + break; + case DOUBLE: + bitWidth = 64; + break; + default: + throw new UnsupportedOperationException("Unsupported Precision: " + type.getPrecision()); + } + return newFixedWidthTypeLayout(dataVector(bitWidth)); + } + + @Override public TypeLayout visit(Decimal type) { + // TODO: check size + return newFixedWidthTypeLayout(dataVector(64)); // actually depends on the type fields + } + + @Override public TypeLayout visit(Bool type) { + return newFixedWidthTypeLayout(booleanVector()); + } + + @Override public TypeLayout visit(Binary type) { + return newVariableWidthTypeLayout(); + } + + @Override public TypeLayout visit(Utf8 type) { + return newVariableWidthTypeLayout(); + } + + private TypeLayout newVariableWidthTypeLayout() { + return newPrimitiveTypeLayout(validityVector(), offsetVector(), byteVector()); + } + + private TypeLayout newPrimitiveTypeLayout(VectorLayout... vectors) { + return new TypeLayout(asList(vectors)); + } + + public TypeLayout newFixedWidthTypeLayout(VectorLayout dataVector) { + return newPrimitiveTypeLayout(validityVector(), dataVector); + } + + @Override + public TypeLayout visit(Null type) { + return new TypeLayout(Collections.emptyList()); + } + + @Override + public TypeLayout visit(Date type) { + return newFixedWidthTypeLayout(dataVector(64)); + } + + @Override + public TypeLayout visit(Time type) { + return newFixedWidthTypeLayout(dataVector(64)); + } + + @Override + public TypeLayout visit(IntervalDay type) { // TODO: check size + return newFixedWidthTypeLayout(dataVector(64)); + } + + @Override + public TypeLayout visit(IntervalYear type) { // TODO: check size + return newFixedWidthTypeLayout(dataVector(64)); + } + }); + return layout; + } + + private final List vectors; + + public TypeLayout(List vectors) { + super(); + this.vectors = vectors; + } + + public TypeLayout(VectorLayout... vectors) { + this(asList(vectors)); + } + + + public List getVectors() { + return vectors; + } + + public List getVectorTypes() { + List types = new ArrayList<>(vectors.size()); + for (VectorLayout vector : vectors) { + types.add(vector.getType()); + } + return types; + } + + public String toString() { + return "TypeLayout{" + vectors + "}"; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java new file mode 100644 index 0000000000000..421ebcb837677 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java @@ -0,0 +1,93 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import static org.apache.arrow.vector.schema.ArrowVectorType.OFFSET; +import static org.apache.arrow.vector.schema.ArrowVectorType.TYPE; +import static org.apache.arrow.vector.schema.ArrowVectorType.VALIDITY; +import static org.apache.arrow.vector.schema.ArrowVectorType.VALUES; + +public class VectorLayout { + + private static final VectorLayout VALIDITY_VECTOR = new VectorLayout(VALIDITY, 1); + private static final VectorLayout OFFSET_VECTOR = new VectorLayout(OFFSET, 32); + private static final VectorLayout TYPE_VECTOR = new VectorLayout(TYPE, 32); + private static final VectorLayout BOOLEAN_VECTOR = new VectorLayout(VALUES, 1); + private static final VectorLayout VALUES_64 = new VectorLayout(VALUES, 64); + private static final VectorLayout VALUES_32 = new VectorLayout(VALUES, 32); + private static final VectorLayout VALUES_16 = new VectorLayout(VALUES, 16); + private static final VectorLayout VALUES_8 = new VectorLayout(VALUES, 8); + + public static VectorLayout typeVector() { + return TYPE_VECTOR; + } + + public static VectorLayout offsetVector() { + return OFFSET_VECTOR; + } + + public static VectorLayout dataVector(int typeBitWidth) { + switch (typeBitWidth) { + case 8: + return VALUES_8; + case 16: + return VALUES_16; + case 32: + return VALUES_32; + case 64: + return VALUES_64; + default: + throw new IllegalArgumentException("only 8, 16, 32, or 64 bits supported"); + } + } + + public static VectorLayout booleanVector() { + return BOOLEAN_VECTOR; + } + + public static VectorLayout validityVector() { + return VALIDITY_VECTOR; + } + + public static VectorLayout byteVector() { + return dataVector(8); + } + + private final int typeBitWidth; + + private final ArrowVectorType type; + + private VectorLayout(ArrowVectorType type, int typeBitWidth) { + super(); + this.type = type; + this.typeBitWidth = typeBitWidth; + } + + public int getTypeBitWidth() { + return typeBitWidth; + } + + public ArrowVectorType getType() { + return type; + } + + @Override + public String toString() { + return String.format("{width=%s,type=%s}", typeBitWidth, type); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index c34882a8fb12a..4d0d9ee114ad8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -17,8 +17,14 @@ */ package org.apache.arrow.vector.types; +import java.util.HashMap; +import java.util.Map; + +import org.apache.arrow.flatbuf.Precision; import org.apache.arrow.flatbuf.Type; +import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.NullableBigIntVector; import org.apache.arrow.vector.NullableBitVector; import org.apache.arrow.vector.NullableDateVector; @@ -38,7 +44,6 @@ import org.apache.arrow.vector.NullableUInt8Vector; import org.apache.arrow.vector.NullableVarBinaryVector; import org.apache.arrow.vector.NullableVarCharVector; -import org.apache.arrow.vector.SmallIntVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.ListVector; @@ -85,9 +90,6 @@ import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; -import java.util.HashMap; -import java.util.Map; - public class Types { public static final Field NULL_FIELD = new Field("", true, Null.INSTANCE, null); @@ -104,8 +106,8 @@ public class Types { public static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); public static final Field INTERVALDAY_FIELD = new Field("", true, IntervalDay.INSTANCE, null); public static final Field INTERVALYEAR_FIELD = new Field("", true, IntervalYear.INSTANCE, null); - public static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(0), null); - public static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(1), null); + public static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); + public static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(Precision.DOUBLE), null); public static final Field LIST_FIELD = new Field("", true, List.INSTANCE, null); public static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); public static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); @@ -120,7 +122,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return ZeroVector.INSTANCE; } @@ -136,7 +138,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new MapVector(name, allocator, callBack); } @@ -153,7 +155,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableTinyIntVector(name, allocator); } @@ -169,8 +171,8 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new SmallIntVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableSmallIntVector(name, allocator); } @Override @@ -185,7 +187,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableIntVector(name, allocator); } @@ -201,7 +203,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableBigIntVector(name, allocator); } @@ -217,7 +219,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableDateVector(name, allocator); } @@ -233,7 +235,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableTimeVector(name, allocator); } @@ -249,7 +251,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableTimeStampVector(name, allocator); } @@ -265,7 +267,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableIntervalDayVector(name, allocator); } @@ -281,7 +283,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableIntervalDayVector(name, allocator); } @@ -290,14 +292,14 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new IntervalYearWriterImpl((NullableIntervalYearVector) vector); } }, - FLOAT4(new FloatingPoint(0)) { + FLOAT4(new FloatingPoint(Precision.SINGLE)) { @Override public Field getField() { return FLOAT4_FIELD; } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableFloat4Vector(name, allocator); } @@ -306,14 +308,14 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new Float4WriterImpl((NullableFloat4Vector) vector); } }, // 4 byte ieee 754 - FLOAT8(new FloatingPoint(1)) { + FLOAT8(new FloatingPoint(Precision.DOUBLE)) { @Override public Field getField() { return FLOAT8_FIELD; } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableFloat8Vector(name, allocator); } @@ -329,7 +331,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableBitVector(name, allocator); } @@ -345,7 +347,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableVarCharVector(name, allocator); } @@ -361,7 +363,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableVarBinaryVector(name, allocator); } @@ -381,7 +383,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableDecimalVector(name, allocator, precisionScale[0], precisionScale[1]); } @@ -397,7 +399,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableUInt1Vector(name, allocator); } @@ -413,7 +415,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableUInt2Vector(name, allocator); } @@ -429,7 +431,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableUInt4Vector(name, allocator); } @@ -445,7 +447,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new NullableUInt8Vector(name, allocator); } @@ -461,7 +463,7 @@ public Field getField() { } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new ListVector(name, allocator, callBack); } @@ -470,14 +472,14 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new UnionListWriter((ListVector) vector); } }, - UNION(Union.INSTANCE) { + UNION(new Union(UnionMode.Sparse)) { @Override public Field getField() { throw new UnsupportedOperationException("Cannot get simple field for Union type"); } @Override - public ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { return new UnionVector(name, allocator, callBack); } @@ -499,7 +501,7 @@ public ArrowType getType() { public abstract Field getField(); - public abstract ValueVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale); + public abstract FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale); public abstract FieldWriter getNewFieldWriter(ValueVector vector); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 49d0503e47036..36712b9bea31e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -18,19 +18,24 @@ package org.apache.arrow.vector.types.pojo; -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; +import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; +import java.util.ArrayList; import java.util.List; import java.util.Objects; -import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; +import org.apache.arrow.vector.schema.ArrowVectorType; +import org.apache.arrow.vector.schema.TypeLayout; + +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; public class Field { private final String name; private final boolean nullable; private final ArrowType type; private final List children; + private final TypeLayout typeLayout; public Field(String name, boolean nullable, ArrowType type, List children) { this.name = name; @@ -41,18 +46,32 @@ public Field(String name, boolean nullable, ArrowType type, List children } else { this.children = children; } + this.typeLayout = TypeLayout.getTypeLayout(type); } public static Field convertField(org.apache.arrow.flatbuf.Field field) { String name = field.name(); boolean nullable = field.nullable(); ArrowType type = getTypeForField(field); + List buffers = new ArrayList<>(); + for (int i = 0; i < field.buffersLength(); ++i) { + buffers.add(new ArrowVectorType(field.buffers(i))); + } ImmutableList.Builder childrenBuilder = ImmutableList.builder(); for (int i = 0; i < field.childrenLength(); i++) { childrenBuilder.add(convertField(field.children(i))); } List children = childrenBuilder.build(); - return new Field(name, nullable, type, children); + Field result = new Field(name, nullable, type, children); + TypeLayout typeLayout = result.getTypeLayout(); + if (typeLayout.getVectors().size() != field.buffersLength()) { + List types = new ArrayList<>(); + for (int i = 0; i < field.buffersLength(); i++) { + types.add(new ArrowVectorType(field.buffers(i))); + } + throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + typeLayout.getVectorTypes() + " got " + types); + } + return result; } public int getField(FlatBufferBuilder builder) { @@ -63,12 +82,18 @@ public int getField(FlatBufferBuilder builder) { childrenData[i] = children.get(i).getField(builder); } int childrenOffset = org.apache.arrow.flatbuf.Field.createChildrenVector(builder, childrenData); + short[] buffersData = new short[typeLayout.getVectors().size()]; + for (int i = 0; i < buffersData.length; i++) { + buffersData[i] = typeLayout.getVectors().get(i).getType().getType(); + } + int buffersOffset = org.apache.arrow.flatbuf.Field.createBuffersVector(builder, buffersData ); org.apache.arrow.flatbuf.Field.startField(builder); org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); + org.apache.arrow.flatbuf.Field.addBuffers(builder, buffersOffset); return org.apache.arrow.flatbuf.Field.endField(builder); } @@ -88,6 +113,10 @@ public List getChildren() { return children; } + public TypeLayout getTypeLayout() { + return typeLayout; + } + @Override public boolean equals(Object obj) { if (!(obj instanceof Field)) { @@ -102,4 +131,9 @@ public boolean equals(Object obj) { (this.children.size() == 0 && that.children == null)); } + + @Override + public String toString() { + return String.format("Field{name=%s, type=%s, children=%s, layout=%s}", name, type, children, typeLayout); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index 9e2894170b24b..231be9bd55ca7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -18,15 +18,13 @@ package org.apache.arrow.vector.types.pojo; -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; +import static org.apache.arrow.vector.types.pojo.Field.convertField; -import java.nio.ByteBuffer; import java.util.List; import java.util.Objects; -import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; -import static org.apache.arrow.vector.types.pojo.Field.convertField; +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; public class Schema { private List fields; @@ -71,4 +69,9 @@ public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { List fields = childrenBuilder.build(); return new Schema(fields); } + + @Override + public String toString() { + return "Schema" + fields; + } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java new file mode 100644 index 0000000000000..85bb2cfc99f81 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -0,0 +1,89 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.io.IOException; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; +import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.AfterClass; +import org.junit.Assert; +import org.junit.Test; + +public class TestVectorUnloadLoad { + + static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + + @Test + public void test() throws IOException { + int count = 10000; + Schema schema; + + try ( + BufferAllocator originalVectorsAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorsAllocator, null)) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(count); + + VectorUnloader vectorUnloader = new VectorUnloader((MapVector)parent.getChild("root")); + schema = vectorUnloader.getSchema(); + + try ( + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector newParent = new MapVector("parent", finalVectorsAllocator, null)) { + MapVector root = newParent.addOrGet("root", MinorType.MAP, MapVector.class); + VectorLoader vectorLoader = new VectorLoader(schema, root); + + vectorLoader.load(recordBatch); + + MapReader rootReader = new SingleMapReaderImpl(newParent).reader("root"); + for (int i = 0; i < count; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); + Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + } + } + } + } + + @AfterClass + public static void afterClass() { + allocator.close(); + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java b/java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java new file mode 100644 index 0000000000000..7c423d5881aea --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java @@ -0,0 +1,80 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.SeekableByteChannel; + +public class ByteArrayReadableSeekableByteChannel implements SeekableByteChannel { + private byte[] byteArray; + private int position = 0; + + public ByteArrayReadableSeekableByteChannel(byte[] byteArray) { + if (byteArray == null) { + throw new NullPointerException(); + } + this.byteArray = byteArray; + } + + @Override + public boolean isOpen() { + return byteArray != null; + } + + @Override + public void close() throws IOException { + byteArray = null; + } + + @Override + public int read(final ByteBuffer dst) throws IOException { + int remainingInBuf = byteArray.length - this.position; + int length = Math.min(dst.remaining(), remainingInBuf); + dst.put(this.byteArray, this.position, length); + this.position += length; + return length; + } + + @Override + public long position() throws IOException { + return this.position; + } + + @Override + public SeekableByteChannel position(final long newPosition) throws IOException { + this.position = (int)newPosition; + return this; + } + + @Override + public long size() throws IOException { + return this.byteArray.length; + } + + @Override + public int write(final ByteBuffer src) throws IOException { + throw new UnsupportedOperationException("Read only"); + } + + @Override + public SeekableByteChannel truncate(final long size) throws IOException { + throw new UnsupportedOperationException("Read only"); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java new file mode 100644 index 0000000000000..11de0a2ef00a0 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -0,0 +1,331 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; +import java.io.IOException; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.ValueVector.Accessor; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; +import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.schema.ArrowBuffer; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class TestArrowFile { + private static final int COUNT = 10; + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Integer.MAX_VALUE); + } + + @After + public void tearDown() { + allocator.close(); + } + + @Test + public void testWrite() throws IOException { + File file = new File("target/mytest_write.arrow"); + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeData(count, parent); + write((MapVector)parent.getChild("root"), file); + } + } + + @Test + public void testWriteComplex() throws IOException { + File file = new File("target/mytest_write_complex.arrow"); + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeComplexData(count, parent); + validateComplexContent(count, parent); + write((MapVector)parent.getChild("root"), file); + } + } + + private void writeComplexData(int count, MapVector parent) { + ArrowBuf varchar = allocator.buffer(3); + varchar.readerIndex(0); + varchar.setByte(0, 'a'); + varchar.setByte(1, 'b'); + varchar.setByte(2, 'c'); + varchar.writerIndex(3); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + ListWriter listWriter = rootWriter.list("list"); + MapWriter mapWriter = rootWriter.map("map"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 3; j++) { + listWriter.varChar().writeVarChar(0, 3, varchar); + } + listWriter.endList(); + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.end(); + } + writer.setValueCount(count); + varchar.release(); + } + + + private void writeData(int count, MapVector parent) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(count); + } + + @Test + public void testWriteRead() throws IOException { + File file = new File("target/mytest.arrow"); + int count = COUNT; + + // write + try ( + BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { + writeData(count, parent); + write((MapVector)parent.getChild("root"), file); + } + + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + System.out.println("reading schema: " + schema); + + // initialize vectors + + MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); + + VectorLoader vectorLoader = new VectorLoader(schema, root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + Assert.assertEquals(0, rbBlock.getOffset() % 8); + Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); + } + + validateContent(count, parent); + } + } + } + + private void validateContent(int count, MapVector parent) { + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < count; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); + Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + } + } + + @Test + public void testWriteReadComplex() throws IOException { + File file = new File("target/mytest_complex.arrow"); + int count = COUNT; + + // write + try ( + BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { + writeComplexData(count, parent); + write((MapVector)parent.getChild("root"), file); + } + + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + System.out.println("reading schema: " + schema); + + // initialize vectors + + MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); + + VectorLoader vectorLoader = new VectorLoader(schema, root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateComplexContent(count, parent); + } + } + } + + public void printVectors(List vectors) { + for (FieldVector vector : vectors) { + System.out.println(vector.getField().getName()); + Accessor accessor = vector.getAccessor(); + int valueCount = accessor.getValueCount(); + for (int i = 0; i < valueCount; i++) { + System.out.println(accessor.getObject(i)); + } + } + } + + private void validateComplexContent(int count, MapVector parent) { + printVectors(parent.getChildrenFromFields()); + + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < count; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); + Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + Assert.assertEquals(i % 3, rootReader.reader("list").size()); + Assert.assertEquals(i, rootReader.reader("map").reader("timestamp").readDateTime().getMillis() % COUNT); + } + } + + private void write(MapVector parent, File file) throws FileNotFoundException, IOException { + VectorUnloader vectorUnloader = new VectorUnloader(parent); + Schema schema = vectorUnloader.getSchema(); + System.out.println("writing schema: " + schema); + try ( + FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + ) { + arrowWriter.writeRecordBatch(recordBatch); + } + } + + @Test + public void testWriteReadMultipleRBs() throws IOException { + File file = new File("target/mytest_multiple.arrow"); + int count = COUNT; + + // write + try ( + BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null); + FileOutputStream fileOutputStream = new FileOutputStream(file);) { + writeData(count, parent); + VectorUnloader vectorUnloader = new VectorUnloader(parent.getChild("root")); + Schema schema = vectorUnloader.getSchema(); + Assert.assertEquals(2, schema.getFields().size()); + try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema);) { + try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch()) { + arrowWriter.writeRecordBatch(recordBatch); + } + parent.allocateNew(); + writeData(count, parent); + try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch()) { + arrowWriter.writeRecordBatch(recordBatch); + } + } + } + + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null); + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + System.out.println("reading schema: " + schema); + MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); + VectorLoader vectorLoader = new VectorLoader(schema, root); + List recordBatches = footer.getRecordBatches(); + Assert.assertEquals(2, recordBatches.size()); + for (ArrowBlock rbBlock : recordBatches) { + Assert.assertEquals(0, rbBlock.getOffset() % 8); + Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); + validateContent(count, parent); + } + } + } + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java new file mode 100644 index 0000000000000..707dba2af9898 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java @@ -0,0 +1,56 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; + +import java.nio.ByteBuffer; +import java.util.Collections; + +import org.apache.arrow.flatbuf.Footer; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import com.google.flatbuffers.FlatBufferBuilder; + +public class TestArrowFooter { + + @Test + public void test() { + Schema schema = new Schema(asList( + new Field("a", true, new ArrowType.Int(8, true), Collections.emptyList()) + )); + ArrowFooter footer = new ArrowFooter(schema, Collections.emptyList(), Collections.emptyList()); + ArrowFooter newFooter = roundTrip(footer); + assertEquals(footer, newFooter); + } + + + private ArrowFooter roundTrip(ArrowFooter footer) { + FlatBufferBuilder builder = new FlatBufferBuilder(); + int i = footer.writeTo(builder); + builder.finish(i); + ByteBuffer dataBuffer = builder.dataBuffer(); + ArrowFooter newFooter = new ArrowFooter(Footer.getRootAsFooter(dataBuffer)); + return newFooter; + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java new file mode 100644 index 0000000000000..f90329aca11dd --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -0,0 +1,106 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.nio.channels.Channels; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Before; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class TestArrowReaderWriter { + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Long.MAX_VALUE); + } + + ArrowBuf buf(byte[] bytes) { + ArrowBuf buffer = allocator.buffer(bytes.length); + buffer.writeBytes(bytes); + return buffer; + } + + byte[] array(ArrowBuf buf) { + byte[] bytes = new byte[buf.readableBytes()]; + buf.readBytes(bytes); + return bytes; + } + + @Test + public void test() throws IOException { + Schema schema = new Schema(asList(new Field("testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); + byte[] validity = new byte[] { (byte)255, 0}; + // second half is "undefined" + byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; + + ByteArrayOutputStream out = new ByteArrayOutputStream(); + try (ArrowWriter writer = new ArrowWriter(Channels.newChannel(out), schema)) { + ArrowBuf validityb = buf(validity); + ArrowBuf valuesb = buf(values); + writer.writeRecordBatch(new ArrowRecordBatch(16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); + } + + byte[] byteArray = out.toByteArray(); + + try (ArrowReader reader = new ArrowReader(new ByteArrayReadableSeekableByteChannel(byteArray), allocator)) { + ArrowFooter footer = reader.readFooter(); + Schema readSchema = footer.getSchema(); + assertEquals(schema, readSchema); + assertTrue(readSchema.getFields().get(0).getTypeLayout().getVectorTypes().toString(), readSchema.getFields().get(0).getTypeLayout().getVectors().size() > 0); + // TODO: dictionaries + List recordBatches = footer.getRecordBatches(); + assertEquals(1, recordBatches.size()); + ArrowRecordBatch recordBatch = reader.readRecordBatch(recordBatches.get(0)); + List nodes = recordBatch.getNodes(); + assertEquals(1, nodes.size()); + ArrowFieldNode node = nodes.get(0); + assertEquals(16, node.getLength()); + assertEquals(8, node.getNullCount()); + List buffers = recordBatch.getBuffers(); + assertEquals(2, buffers.size()); + assertArrayEquals(validity, array(buffers.get(0))); + assertArrayEquals(values, array(buffers.get(1))); + + } + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 06a1149c0d6c1..61327f1970e83 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -17,19 +17,24 @@ */ package org.apache.arrow.vector.pojo; -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; +import static org.apache.arrow.flatbuf.Precision.DOUBLE; +import static org.apache.arrow.flatbuf.Precision.SINGLE; +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.List; +import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Test; -import java.util.List; - -import static org.junit.Assert.assertEquals; +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; /** * Test conversion between Flatbuf and Pojo field representations @@ -46,7 +51,7 @@ public void simple() { public void complex() { ImmutableList.Builder childrenBuilder = ImmutableList.builder(); childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); - childrenBuilder.add(new Field("child2", true, new FloatingPoint(0), ImmutableList.of())); + childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); Field initialField = new Field("a", true, Tuple.INSTANCE, childrenBuilder.build()); run(initialField); @@ -56,10 +61,29 @@ public void complex() { public void schema() { ImmutableList.Builder childrenBuilder = ImmutableList.builder(); childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); - childrenBuilder.add(new Field("child2", true, new FloatingPoint(0), ImmutableList.of())); + childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); Schema initialSchema = new Schema(childrenBuilder.build()); run(initialSchema); + } + @Test + public void nestedSchema() { + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); + childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); + childrenBuilder.add(new Field("child3", true, new Tuple(), ImmutableList.of( + new Field("child3.1", true, Utf8.INSTANCE, null), + new Field("child3.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) + ))); + childrenBuilder.add(new Field("child4", true, new List(), ImmutableList.of( + new Field("child4.1", true, Utf8.INSTANCE, null) + ))); + childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse), ImmutableList.of( + new Field("child5.1", true, new Timestamp("UTC"), null), + new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) + ))); + Schema initialSchema = new Schema(childrenBuilder.build()); + run(initialSchema); } private void run(Field initialField) { From 907cc5a1295c4e9227ac50abf5babbe497f1edd1 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 28 Aug 2016 13:43:01 -0400 Subject: [PATCH 0122/1644] ARROW-262: Start metadata specification document The purpose of this patch is to: * Provide exposition and a place to clarify / provide examples illustrating the canonical metadata * Begin providing definitions of logical types * Where relevant, the data header metadata generated by a particular logical type (for example: strings produce one fewer buffer compared with List even though the effective memory layout is the same as a the nested type without any nulls in its child array) This is not a complete specification and will require follow-up JIRAs to address more logical types and fill other gaps. Author: Wes McKinney Closes #121 from wesm/ARROW-262 and squashes the following commits: bba5e82 [Wes McKinney] int->short 8cc52fd [Wes McKinney] Drafting Metadata specification document --- format/Message.fbs | 3 +- format/Metadata.md | 258 +++++++++++++++++++++++++++++++++++++++++++++ format/README.md | 1 + 3 files changed, 261 insertions(+), 1 deletion(-) create mode 100644 format/Metadata.md diff --git a/format/Message.fbs b/format/Message.fbs index b02f3fa38694e..71428b581031f 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -28,12 +28,13 @@ table Int { is_signed: bool; } -enum Precision:short {SINGLE, DOUBLE} +enum Precision:short {HALF, SINGLE, DOUBLE} table FloatingPoint { precision: Precision; } +/// Unicode with UTF-8 encoding table Utf8 { } diff --git a/format/Metadata.md b/format/Metadata.md new file mode 100644 index 0000000000000..e227b8d4afd84 --- /dev/null +++ b/format/Metadata.md @@ -0,0 +1,258 @@ +# Metadata: Logical types, schemas, data headers + +This is documentation for the Arrow metadata specification, which enables +systems to communicate the + +* Logical array types (which are implemented using the physical memory layouts + specified in [Layout.md][1]) + +* Schemas for table-like collections of Arrow data structures + +* "Data headers" indicating the physical locations of memory buffers sufficient + to reconstruct a Arrow data structures without copying memory. + +## Canonical implementation + +We are using [Flatbuffers][2] for low-overhead reading and writing of the Arrow +metadata. See [Message.fbs][3]. + +## Schemas + +The `Schema` type describes a table-like structure consisting of any number of +Arrow arrays, each of which can be interpreted as a column in the table. A +schema by itself does not describe the physical structure of any particular set +of data. + +A schema consists of a sequence of **fields**, which are metadata describing +the columns. The Flatbuffers IDL for a field is: + +``` +table Field { + // Name is not required, in i.e. a List + name: string; + nullable: bool; + type: Type; + children: [Field]; +} +``` + +The `type` is the logical type of the field. Nested types, such as List, +Struct, and Union, have a sequence of child fields. + +## Record data headers + +A record batch is a collection of top-level named, equal length Arrow arrays +(or vectors). If one of the arrays contains nested data, its child arrays are +not required to be the same length as the top-level arrays. + +One can be thought of as a realization of a particular schema. The metadata +describing a particular record batch is called a "data header". Here is the +Flatbuffers IDL for a record batch data header + +``` +table RecordBatch { + length: int; + nodes: [FieldNode]; + buffers: [Buffer]; +} +``` + +The `nodes` and `buffers` fields are produced by a depth-first traversal / +flattening of a schema (possibly containing nested types) for a given in-memory +data set. + +### Buffers + +A buffer is metadata describing a contiguous memory region relative to some +virtual address space. This may include: + +* Shared memory, e.g. a memory-mapped file +* An RPC message received in-memory +* Data in a file + +The key form of the Buffer type is: + +``` +struct Buffer { + offset: long; + length: long; +} +``` + +In the context of a record batch, each field has some number of buffers +associated with it, which are derived from their physical memory layout. + +Each logical type (separate from its children, if it is a nested type) has a +deterministic number of buffers associated with it. These will be specified in +the logical types section. + +### Field metadata + +The `FieldNode` values contain metadata about each level in a nested type +hierarchy. + +``` +struct FieldNode { + /// The number of value slots in the Arrow array at this level of a nested + /// tree + length: int; + + /// The number of observed nulls. + null_count: int; +} +``` + +## Flattening of nested data + +Nested types are flattened in the record batch in depth-first order. When +visiting each field in the nested type tree, the metadata is appended to the +top-level `fields` array and the buffers associated with that field (but not +its children) are appended to the `buffers` array. + +For example, let's consider the schema + +``` +col1: Struct, c: Float64> +col2: Utf8 +``` + +The flattened version of this is: + +``` +FieldNode 0: Struct name='col1' +FieldNode 1: Int32 name=a' +FieldNode 2: List name='b' +FieldNode 3: Int64 name='item' # arbitrary +FieldNode 4: Float64 name='c' +FieldNode 5: Utf8 name='col2' +``` + +For the buffers produced, we would have the following (as described in more +detail for each type below): + +``` +buffer 0: field 0 validity bitmap + +buffer 1: field 1 validity bitmap +buffer 2: field 1 values + +buffer 3: field 2 validity bitmap +buffer 4: field 2 list offsets + +buffer 5: field 3 validity bitmap +buffer 6: field 3 values + +buffer 7: field 4 validity bitmap +buffer 8: field 4 values + +buffer 9: field 5 validity bitmap +buffer 10: field 5 offsets +buffer 11: field 5 data +``` + +## Logical types + +A logical type consists of a type name and metadata along with an explicit +mapping to a physical memory representation. These may fall into some different +categories: + +* Types represented as fixed-width primitive arrays (for example: C-style + integers and floating point numbers) +* Types having equivalent memory layout to a physical nested type (e.g. strings + use the list representation, but logically are not nested types) + +### Integers + +In the first version of Arrow we provide the standard 8-bit through 64-bit size +standard C integer types, both signed and unsigned: + +* Signed types: Int8, Int16, Int32, Int64 +* Unsigned types: UInt8, UInt16, UInt32, UInt64 + +The IDL looks like: + +``` +table Int { + bitWidth: int; + is_signed: bool; +} +``` + +The integer endianness is currently set globally at the schema level. If a +schema is set to be little-endian, then all integer types occurring within must +be little-endian. Integers that are part of other data representations, such as +list offsets and union types, must have the same endianness as the entire +record batch. + +### Floating point numbers + +We provide 3 types of floating point numbers as fixed bit-width primitive array + +- Half precision, 16-bit width +- Single precision, 32-bit width +- Double precision, 64-bit width + +The IDL looks like: + +``` +enum Precision:int {HALF, SINGLE, DOUBLE} + +table FloatingPoint { + precision: Precision; +} +``` + +### Boolean + +The Boolean logical type is represented as a 1-bit wide primitive physical +type. The bits are numbered using least-significant bit (LSB) ordering. + +Like other fixed bit-width primitive types, boolean data appears as 2 buffers +in the data header (one bitmap for the validity vector and one for the values). + +### List + +The `List` logical type is the logical (and identically-named) counterpart to +the List physical type. + +In data header form, the list field node contains 2 buffers: + +* Validity bitmap +* List offsets + +The buffers associated with a list's child field are handled recursively +according to the child logical type (e.g. `List` vs. `List`). + +### Utf8 and Binary + +We specify two logical types for variable length bytes: + +* `Utf8` data is unicode values with UTF-8 encoding +* `Binary` is any other variable length bytes + +These types both have the same memory layout as the nested type `List`, +with the constraint that the inner bytes can contain no null values. From a +logical type perspective they are primitive, not nested types. + +In data header form, while `List` would appear as 2 field nodes (`List` +and `UInt8`) and 4 buffers (2 for each of the nodes, as per above), these types +have a simplified representation single field node (of `Utf8` or `Binary` +logical type, which have no children) and 3 buffers: + +* Validity bitmap +* List offsets +* Byte data + +### Decimal + +TBD + +### Timestamp + +TBD + +## Dictionary encoding + +[1]: https://github.com/apache/arrow/blob/master/format/Layout.md +[2]: http://github.com/google/flatbuffers +[3]: https://github.com/apache/arrow/blob/master/format/Message.fbs diff --git a/format/README.md b/format/README.md index c84e00772c3d6..3b0e50364d83c 100644 --- a/format/README.md +++ b/format/README.md @@ -6,6 +6,7 @@ Currently, the Arrow specification consists of these pieces: +- Metadata specification (see Metadata.md) - Physical memory layout specification (see Layout.md) - Metadata serialized representation (see Message.fbs) From e081a4c27a5a592251f9f325a05479d4120e30e6 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Sun, 28 Aug 2016 13:45:34 -0400 Subject: [PATCH 0123/1644] ARROW-271: Update Field structure to be more explicit This is a proposal. I have not updated the code depending on this yet. Author: Julien Le Dem Closes #124 from julienledem/record_batch and squashes the following commits: 8e42d74 [Julien Le Dem] ARROW-271: Update Field structure to be more explicit add bit_width to vector layout --- format/Message.fbs | 26 ++++++--- .../templates/NullableValueVectors.java | 6 ++- .../arrow/vector/schema/ArrowVectorType.java | 2 +- .../arrow/vector/schema/TypeLayout.java | 22 +++++++- .../arrow/vector/schema/VectorLayout.java | 54 +++++++++++++++---- .../apache/arrow/vector/types/pojo/Field.java | 43 ++++++++------- .../apache/arrow/vector/pojo/TestConvert.java | 2 + 7 files changed, 115 insertions(+), 40 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 71428b581031f..9c95724897757 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -92,17 +92,31 @@ union Type { JSONScalar } +/// ---------------------------------------------------------------------- +/// The possible types of a vector + enum VectorType: short { - /// used in List type Dense Union and variable length primitive types (String, Binary) + /// used in List type, Dense Union and variable length primitive types (String, Binary) OFFSET, - /// fixed length primitive values - VALUES, - /// Bit vector indicated if each value is null + /// actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector + DATA, + /// Bit vector indicating if each value is null VALIDITY, /// Type vector used in Union type TYPE } +/// ---------------------------------------------------------------------- +/// represents the physical layout of a buffer +/// buffers have fixed width slots of a given type + +table VectorLayout { + /// the width of a slot in the buffer (typically 1, 8, 16, 32 or 64) + bit_width: short; + /// the purpose of the vector + type: VectorType; +} + /// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. @@ -121,10 +135,10 @@ table Field { dictionary: long; // children apply only to Nested data types like Struct, List and Union children: [Field]; - /// the buffers produced for this type (as derived from the Type) + /// layout of buffers produced for this type (as derived from the Type) /// does not include children /// each recordbatch will return instances of those Buffers. - buffers: [ VectorType ]; + layout: [ VectorLayout ]; } /// ---------------------------------------------------------------------- diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 6b1aa040a5ba2..bb2c00121605c 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -34,6 +34,8 @@ <#include "/@includes/vv_imports.ftl" /> +import org.apache.arrow.flatbuf.Precision; + /** * Nullable${minor.class} implements a vector of values which could be null. Elements in the vector * are first checked against a fixed length vector of boolean values. Then the element is retrieved @@ -97,9 +99,9 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "Time"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null); <#elseif minor.class == "Float4"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.SINGLE), null); + field = new Field(name, true, new FloatingPoint(Precision.SINGLE), null); <#elseif minor.class == "Float8"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.flatbuf.Precision.DOUBLE), null); + field = new Field(name, true, new FloatingPoint(Precision.DOUBLE), null); <#elseif minor.class == "TimeStamp"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null); <#elseif minor.class == "IntervalDay"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java index e3d3e34e0ae24..9b7fa45bb9ae3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java @@ -21,7 +21,7 @@ public class ArrowVectorType { - public static final ArrowVectorType VALUES = new ArrowVectorType(VectorType.VALUES); + public static final ArrowVectorType DATA = new ArrowVectorType(VectorType.DATA); public static final ArrowVectorType OFFSET = new ArrowVectorType(VectorType.OFFSET); public static final ArrowVectorType VALIDITY = new ArrowVectorType(VectorType.VALIDITY); public static final ArrowVectorType TYPE = new ArrowVectorType(VectorType.TYPE); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 1275e0eb5dc45..15cd49865bdce 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -49,6 +49,8 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import com.google.common.base.Preconditions; + /** * The layout of vectors for a given type * It defines its own vectors followed by the vectors for the children @@ -182,7 +184,7 @@ public TypeLayout visit(IntervalYear type) { // TODO: check size public TypeLayout(List vectors) { super(); - this.vectors = vectors; + this.vectors = Preconditions.checkNotNull(vectors); } public TypeLayout(VectorLayout... vectors) { @@ -205,4 +207,22 @@ public List getVectorTypes() { public String toString() { return "TypeLayout{" + vectors + "}"; } + + @Override + public int hashCode() { + return vectors.hashCode(); + } + + @Override + public boolean equals(Object obj) { + if (this == obj) + return true; + if (obj == null) + return false; + if (getClass() != obj.getClass()) + return false; + TypeLayout other = (TypeLayout) obj; + return vectors.equals(other.vectors); + } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java index 421ebcb837677..532e9d2328b0f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java @@ -17,21 +17,24 @@ */ package org.apache.arrow.vector.schema; +import static org.apache.arrow.vector.schema.ArrowVectorType.DATA; import static org.apache.arrow.vector.schema.ArrowVectorType.OFFSET; import static org.apache.arrow.vector.schema.ArrowVectorType.TYPE; import static org.apache.arrow.vector.schema.ArrowVectorType.VALIDITY; -import static org.apache.arrow.vector.schema.ArrowVectorType.VALUES; -public class VectorLayout { +import com.google.common.base.Preconditions; +import com.google.flatbuffers.FlatBufferBuilder; + +public class VectorLayout implements FBSerializable { private static final VectorLayout VALIDITY_VECTOR = new VectorLayout(VALIDITY, 1); private static final VectorLayout OFFSET_VECTOR = new VectorLayout(OFFSET, 32); private static final VectorLayout TYPE_VECTOR = new VectorLayout(TYPE, 32); - private static final VectorLayout BOOLEAN_VECTOR = new VectorLayout(VALUES, 1); - private static final VectorLayout VALUES_64 = new VectorLayout(VALUES, 64); - private static final VectorLayout VALUES_32 = new VectorLayout(VALUES, 32); - private static final VectorLayout VALUES_16 = new VectorLayout(VALUES, 16); - private static final VectorLayout VALUES_8 = new VectorLayout(VALUES, 8); + private static final VectorLayout BOOLEAN_VECTOR = new VectorLayout(DATA, 1); + private static final VectorLayout VALUES_64 = new VectorLayout(DATA, 64); + private static final VectorLayout VALUES_32 = new VectorLayout(DATA, 32); + private static final VectorLayout VALUES_16 = new VectorLayout(DATA, 16); + private static final VectorLayout VALUES_8 = new VectorLayout(DATA, 8); public static VectorLayout typeVector() { return TYPE_VECTOR; @@ -68,14 +71,21 @@ public static VectorLayout byteVector() { return dataVector(8); } - private final int typeBitWidth; + private final short typeBitWidth; private final ArrowVectorType type; private VectorLayout(ArrowVectorType type, int typeBitWidth) { super(); - this.type = type; - this.typeBitWidth = typeBitWidth; + this.type = Preconditions.checkNotNull(type); + this.typeBitWidth = (short)typeBitWidth; + if (typeBitWidth <= 0) { + throw new IllegalArgumentException("bitWidth invalid: " + typeBitWidth); + } + } + + public VectorLayout(org.apache.arrow.flatbuf.VectorLayout layout) { + this(new ArrowVectorType(layout.type()), layout.bitWidth()); } public int getTypeBitWidth() { @@ -90,4 +100,28 @@ public ArrowVectorType getType() { public String toString() { return String.format("{width=%s,type=%s}", typeBitWidth, type); } + + @Override + public int hashCode() { + return 31 * (31 + type.hashCode()) + typeBitWidth; + } + + @Override + public boolean equals(Object obj) { + if (this == obj) + return true; + if (obj == null) + return false; + if (getClass() != obj.getClass()) + return false; + VectorLayout other = (VectorLayout) obj; + return type.equals(other.type) && (typeBitWidth == other.typeBitWidth); + } + + @Override + public int writeTo(FlatBufferBuilder builder) {; + return org.apache.arrow.flatbuf.VectorLayout.createVectorLayout(builder, typeBitWidth, type.getType()); + } + + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 36712b9bea31e..cfa1ed40aeb8c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -20,12 +20,11 @@ import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; -import java.util.ArrayList; import java.util.List; import java.util.Objects; -import org.apache.arrow.vector.schema.ArrowVectorType; import org.apache.arrow.vector.schema.TypeLayout; +import org.apache.arrow.vector.schema.VectorLayout; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; @@ -37,7 +36,7 @@ public class Field { private final List children; private final TypeLayout typeLayout; - public Field(String name, boolean nullable, ArrowType type, List children) { + private Field(String name, boolean nullable, ArrowType type, List children, TypeLayout typeLayout) { this.name = name; this.nullable = nullable; this.type = type; @@ -46,34 +45,37 @@ public Field(String name, boolean nullable, ArrowType type, List children } else { this.children = children; } - this.typeLayout = TypeLayout.getTypeLayout(type); + this.typeLayout = typeLayout; + } + + public Field(String name, boolean nullable, ArrowType type, List children) { + this(name, nullable, type, children, TypeLayout.getTypeLayout(type)); } public static Field convertField(org.apache.arrow.flatbuf.Field field) { String name = field.name(); boolean nullable = field.nullable(); ArrowType type = getTypeForField(field); - List buffers = new ArrayList<>(); - for (int i = 0; i < field.buffersLength(); ++i) { - buffers.add(new ArrowVectorType(field.buffers(i))); + ImmutableList.Builder layout = ImmutableList.builder(); + for (int i = 0; i < field.layoutLength(); ++i) { + layout.add(new org.apache.arrow.vector.schema.VectorLayout(field.layout(i))); } ImmutableList.Builder childrenBuilder = ImmutableList.builder(); for (int i = 0; i < field.childrenLength(); i++) { childrenBuilder.add(convertField(field.children(i))); } List children = childrenBuilder.build(); - Field result = new Field(name, nullable, type, children); - TypeLayout typeLayout = result.getTypeLayout(); - if (typeLayout.getVectors().size() != field.buffersLength()) { - List types = new ArrayList<>(); - for (int i = 0; i < field.buffersLength(); i++) { - types.add(new ArrowVectorType(field.buffers(i))); - } - throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + typeLayout.getVectorTypes() + " got " + types); - } + Field result = new Field(name, nullable, type, children, new TypeLayout(layout.build())); return result; } + public void validate() { + TypeLayout expectedLayout = TypeLayout.getTypeLayout(type); + if (!expectedLayout.equals(typeLayout)) { + throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + expectedLayout + " got " + typeLayout); + } + } + public int getField(FlatBufferBuilder builder) { int nameOffset = builder.createString(name); int typeOffset = type.getType(builder); @@ -82,18 +84,19 @@ public int getField(FlatBufferBuilder builder) { childrenData[i] = children.get(i).getField(builder); } int childrenOffset = org.apache.arrow.flatbuf.Field.createChildrenVector(builder, childrenData); - short[] buffersData = new short[typeLayout.getVectors().size()]; + int[] buffersData = new int[typeLayout.getVectors().size()]; for (int i = 0; i < buffersData.length; i++) { - buffersData[i] = typeLayout.getVectors().get(i).getType().getType(); + VectorLayout vectorLayout = typeLayout.getVectors().get(i); + buffersData[i] = vectorLayout.writeTo(builder); } - int buffersOffset = org.apache.arrow.flatbuf.Field.createBuffersVector(builder, buffersData ); + int layoutOffset = org.apache.arrow.flatbuf.Field.createLayoutVector(builder, buffersData); org.apache.arrow.flatbuf.Field.startField(builder); org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); - org.apache.arrow.flatbuf.Field.addBuffers(builder, buffersOffset); + org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); return org.apache.arrow.flatbuf.Field.endField(builder); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 61327f1970e83..e557cc84f3bae 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -22,6 +22,8 @@ import static org.junit.Assert.assertEquals; import org.apache.arrow.flatbuf.UnionMode; +import static org.junit.Assert.assertEquals; + import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.List; From 0a411fd29ed1baac6f1524be82fc15e08f2b28db Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 28 Aug 2016 15:25:35 -0400 Subject: [PATCH 0124/1644] ARROW-242: Support Timestamp Data Type For the Pandas<->Parquet bridge this is a lossy conversion but must be explicitly activated by the user. Regarding Parquet 1.0: Yes, the logical type is not supported but should be simply ignored by the reader. Implementation for INT96 timestamps is not in the scope of this PR. Author: Uwe L. Korn Closes #107 from xhochy/arrow-242 and squashes the following commits: 8db6968 [Uwe L. Korn] Add missing include 34126b1 [Uwe L. Korn] ARROW-242: Support Timestamp Data Type --- .../parquet/parquet-reader-writer-test.cc | 12 +- cpp/src/arrow/parquet/parquet-schema-test.cc | 23 +++- cpp/src/arrow/parquet/reader.cc | 1 + cpp/src/arrow/parquet/schema.cc | 13 ++- cpp/src/arrow/parquet/writer.cc | 1 + cpp/src/arrow/types/construct.cc | 3 +- cpp/src/arrow/types/datetime.h | 12 +- cpp/src/arrow/types/primitive.cc | 1 + cpp/src/arrow/types/primitive.h | 11 ++ python/pyarrow/array.pyx | 40 ++++++- python/pyarrow/includes/libarrow.pxd | 1 + python/pyarrow/tests/test_convert_pandas.py | 24 +++- python/pyarrow/tests/test_parquet.py | 4 +- python/src/pyarrow/adapters/pandas.cc | 107 ++++++++++++++++-- 14 files changed, 232 insertions(+), 21 deletions(-) diff --git a/cpp/src/arrow/parquet/parquet-reader-writer-test.cc b/cpp/src/arrow/parquet/parquet-reader-writer-test.cc index bfc27d26d63a1..d7b39dda377d3 100644 --- a/cpp/src/arrow/parquet/parquet-reader-writer-test.cc +++ b/cpp/src/arrow/parquet/parquet-reader-writer-test.cc @@ -137,6 +137,15 @@ struct test_traits { const int64_t test_traits::value(-1024); +template <> +struct test_traits { + static constexpr ParquetType::type parquet_enum = ParquetType::INT64; + static constexpr LogicalType::type logical_enum = LogicalType::TIMESTAMP_MILLIS; + static int64_t const value; +}; + +const int64_t test_traits::value(14695634030000); + template <> struct test_traits { static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; @@ -248,7 +257,8 @@ class TestParquetIO : public ::testing::Test { // Parquet version 1.0. typedef ::testing::Types TestTypes; + Int32Type, UInt64Type, Int64Type, TimestampType, FloatType, DoubleType, + StringType> TestTypes; TYPED_TEST_CASE(TestParquetIO, TestTypes); diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index 819cdd3ec4394..a2bcd3e05c307 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -22,6 +22,7 @@ #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/types/datetime.h" #include "arrow/types/decimal.h" #include "arrow/util/status.h" @@ -45,6 +46,9 @@ const auto INT64 = std::make_shared(); const auto FLOAT = std::make_shared(); const auto DOUBLE = std::make_shared(); const auto UTF8 = std::make_shared(); +const auto TIMESTAMP_MS = std::make_shared(TimestampType::Unit::MILLI); +// TODO: This requires parquet-cpp implementing the MICROS enum value +// const auto TIMESTAMP_US = std::make_shared(TimestampType::Unit::MICRO); const auto BINARY = std::make_shared(std::make_shared("", UINT8)); const auto DECIMAL_8_4 = std::make_shared(8, 4); @@ -89,6 +93,14 @@ TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); arrow_fields.push_back(std::make_shared("int64", INT64, false)); + parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, + ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); + arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_MS, false)); + + // parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, + // ParquetType::INT64, LogicalType::TIMESTAMP_MICROS)); + // arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_US, false)); + parquet_fields.push_back( PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); arrow_fields.push_back(std::make_shared("float", FLOAT)); @@ -153,9 +165,6 @@ TEST_F(TestConvertParquetSchema, UnsupportedThings) { unsupported_nodes.push_back(PrimitiveNode::Make( "int32", Repetition::OPTIONAL, ParquetType::INT32, LogicalType::DATE)); - unsupported_nodes.push_back(PrimitiveNode::Make( - "int64", Repetition::OPTIONAL, ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); - for (const NodePtr& node : unsupported_nodes) { ASSERT_RAISES(NotImplemented, ConvertSchema({node})); } @@ -209,6 +218,14 @@ TEST_F(TestConvertArrowSchema, ParquetFlatPrimitives) { PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); arrow_fields.push_back(std::make_shared("int64", INT64, false)); + parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, + ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); + arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_MS, false)); + + // parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, + // ParquetType::INT64, LogicalType::TIMESTAMP_MICROS)); + // arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_US, false)); + parquet_fields.push_back( PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); arrow_fields.push_back(std::make_shared("float", FLOAT)); diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index e92967e5363d2..9f6212570dcdc 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -368,6 +368,7 @@ Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) TYPED_BATCH_CASE(STRING, StringType, ::parquet::ByteArrayType) + TYPED_BATCH_CASE(TIMESTAMP, TimestampType, ::parquet::Int64Type) default: return Status::NotImplemented(field_->type->ToString()); } diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index a79342afe2f9d..cd91df32271c1 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -52,6 +52,7 @@ const auto INT64 = std::make_shared(); const auto FLOAT = std::make_shared(); const auto DOUBLE = std::make_shared(); const auto UTF8 = std::make_shared(); +const auto TIMESTAMP_MS = std::make_shared(TimestampType::Unit::MILLI); const auto BINARY = std::make_shared(std::make_shared("", UINT8)); TypePtr MakeDecimalType(const PrimitiveNode* node) { @@ -133,6 +134,9 @@ static Status FromInt64(const PrimitiveNode* node, TypePtr* out) { case LogicalType::DECIMAL: *out = MakeDecimalType(node); break; + case LogicalType::TIMESTAMP_MILLIS: + *out = TIMESTAMP_MS; + break; default: return Status::NotImplemented("Unhandled logical type for int64"); break; @@ -289,10 +293,15 @@ Status FieldToNode(const std::shared_ptr& field, type = ParquetType::INT32; logical_type = LogicalType::DATE; break; - case Type::TIMESTAMP: + case Type::TIMESTAMP: { + auto timestamp_type = static_cast(field->type.get()); + if (timestamp_type->unit != TimestampType::Unit::MILLI) { + return Status::NotImplemented( + "Other timestamp units than millisecond are not yet support with parquet."); + } type = ParquetType::INT64; logical_type = LogicalType::TIMESTAMP_MILLIS; - break; + } break; case Type::TIMESTAMP_DOUBLE: type = ParquetType::INT64; // This is specified as seconds since the UNIX epoch diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index f9514aa2ad2ff..ddee573fa1eb9 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -240,6 +240,7 @@ Status FileWriter::Impl::WriteFlatColumnChunk( TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) TYPED_BATCH_CASE(UINT64, UInt64Type, ::parquet::Int64Type) TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) + TYPED_BATCH_CASE(TIMESTAMP, TimestampType, ::parquet::Int64Type) TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) default: diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 5ae9c5ab6d4f9..0b71ea965516c 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -51,6 +51,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(INT32, Int32Builder); BUILDER_CASE(UINT64, UInt64Builder); BUILDER_CASE(INT64, Int64Builder); + BUILDER_CASE(TIMESTAMP, TimestampBuilder); BUILDER_CASE(BOOL, BooleanBuilder); @@ -105,7 +106,7 @@ Status MakePrimitiveArray(const TypePtr& type, int32_t length, MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index b782455546c33..241a126d1007f 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -18,6 +18,8 @@ #ifndef ARROW_TYPES_DATETIME_H #define ARROW_TYPES_DATETIME_H +#include + #include "arrow/type.h" namespace arrow { @@ -34,15 +36,23 @@ struct DateType : public DataType { static char const* name() { return "date"; } }; -struct TimestampType : public DataType { +struct ARROW_EXPORT TimestampType : public DataType { enum class Unit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; + typedef int64_t c_type; + static constexpr Type::type type_enum = Type::TIMESTAMP; + + int value_size() const override { return sizeof(int64_t); } + Unit unit; explicit TimestampType(Unit unit = Unit::MILLI) : DataType(Type::TIMESTAMP), unit(unit) {} TimestampType(const TimestampType& other) : TimestampType(other.unit) {} + virtual ~TimestampType() {} + + std::string ToString() const override { return "timestamp"; } static char const* name() { return "timestamp"; } }; diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index f4b47f9d2f503..375e94f2bc1c4 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -158,6 +158,7 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index 770de765f1fcc..c643783f681bd 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -26,6 +26,7 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/type.h" +#include "arrow/types/datetime.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" @@ -100,6 +101,7 @@ NUMERIC_ARRAY_DECL(UInt32Array, UInt32Type); NUMERIC_ARRAY_DECL(Int32Array, Int32Type); NUMERIC_ARRAY_DECL(UInt64Array, UInt64Type); NUMERIC_ARRAY_DECL(Int64Array, Int64Type); +NUMERIC_ARRAY_DECL(TimestampArray, TimestampType); NUMERIC_ARRAY_DECL(FloatArray, FloatType); NUMERIC_ARRAY_DECL(DoubleArray, DoubleType); @@ -235,7 +237,15 @@ struct type_traits { static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } }; + +template <> +struct type_traits { + typedef TimestampArray ArrayType; + + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } +}; template <> + struct type_traits { typedef FloatArray ArrayType; @@ -260,6 +270,7 @@ typedef NumericBuilder Int8Builder; typedef NumericBuilder Int16Builder; typedef NumericBuilder Int32Builder; typedef NumericBuilder Int64Builder; +typedef NumericBuilder TimestampBuilder; typedef NumericBuilder FloatBuilder; typedef NumericBuilder DoubleBuilder; diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 619e5ef7e3943..5229b429f58b4 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -19,6 +19,8 @@ # distutils: language = c++ # cython: embedsignature = True +import numpy as np + from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow @@ -186,6 +188,7 @@ cdef dict _array_classes = { Type_DOUBLE: DoubleArray, Type_LIST: ListArray, Type_STRING: StringArray, + Type_TIMESTAMP: Int64Array, } cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): @@ -217,11 +220,28 @@ def from_pylist(object list_obj, DataType type=None): return box_arrow_array(sp_array) -def from_pandas_series(object series, object mask=None): +def from_pandas_series(object series, object mask=None, timestamps_to_ms=False): + """ + Convert pandas.Series to an Arrow Array. + + Parameters + ---------- + series: pandas.Series or numpy.ndarray + + mask: pandas.Series or numpy.ndarray + array to mask null entries in the series + + timestamps_to_ms: bool + Convert datetime columns to ms resolution. This is needed for + compability with other functionality like Parquet I/O which + only supports milliseconds. + """ cdef: shared_ptr[CArray] out series_values = series_as_ndarray(series) + if series_values.dtype.type == np.datetime64 and timestamps_to_ms: + series_values = series_values.astype('datetime64[ms]') if mask is None: check_status(pyarrow.PandasToArrow(pyarrow.GetMemoryPool(), @@ -234,14 +254,28 @@ def from_pandas_series(object series, object mask=None): return box_arrow_array(out) -def from_pandas_dataframe(object df, name=None): +def from_pandas_dataframe(object df, name=None, timestamps_to_ms=False): + """ + Convert pandas.DataFrame to an Arrow Table + + Parameters + ---------- + df: pandas.DataFrame + + name: str + + timestamps_to_ms: bool + Convert datetime columns to ms resolution. This is needed for + compability with other functionality like Parquet I/O which + only supports milliseconds. + """ cdef: list names = [] list arrays = [] for name in df.columns: col = df[name] - arr = from_pandas_series(col) + arr = from_pandas_series(col, timestamps_to_ms=timestamps_to_ms) names.append(name) arrays.append(arr) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 91ce069df8f42..854d07d691dbf 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -38,6 +38,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_FLOAT" arrow::Type::FLOAT" Type_DOUBLE" arrow::Type::DOUBLE" + Type_TIMESTAMP" arrow::Type::TIMESTAMP" Type_STRING" arrow::Type::STRING" Type_LIST" arrow::Type::LIST" diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 6dc9c689e249b..55302996f4557 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -33,8 +33,9 @@ def setUp(self): def tearDown(self): pass - def _check_pandas_roundtrip(self, df, expected=None): - table = A.from_pandas_dataframe(df) + def _check_pandas_roundtrip(self, df, expected=None, + timestamps_to_ms=False): + table = A.from_pandas_dataframe(df, timestamps_to_ms=timestamps_to_ms) result = table.to_pandas() if expected is None: expected = df @@ -164,6 +165,25 @@ def test_strings(self): expected = pd.DataFrame({'strings': values * repeats}) self._check_pandas_roundtrip(df, expected) + def test_timestamps_notimezone(self): + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123', + '2006-01-13T12:34:56.432', + '2010-08-13T05:46:57.437'], + dtype='datetime64[ms]') + }) + self._check_pandas_roundtrip(df, timestamps_to_ms=True) + + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123456789', + '2006-01-13T12:34:56.432539784', + '2010-08-13T05:46:57.437699912'], + dtype='datetime64[ns]') + }) + self._check_pandas_roundtrip(df, timestamps_to_ms=False) + # def test_category(self): # repeats = 1000 # values = [b'foo', None, u'bar', 'qux', np.nan] diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index de9cfbb46e1a2..d89d947b7b6ac 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -57,11 +57,13 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): 'float32': np.arange(size, dtype=np.float32), 'float64': np.arange(size, dtype=np.float64), 'bool': np.random.randn(size) > 0, + # Pandas only support ns resolution, Arrow at the moment only ms + 'datetime': np.arange("2016-01-01T00:00:00.001", size, dtype='datetime64[ms]'), 'str': [str(x) for x in range(size)], 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None] }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.from_pandas_dataframe(df, timestamps_to_ms=True) A.parquet.write_table(arrow_table, filename.strpath, version="2.0") table_read = pyarrow.parquet.read_table(filename.strpath) df_read = table_read.to_pandas() diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 8dcc2b1c92e11..a4e7fb6f3bb70 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -38,6 +38,7 @@ namespace pyarrow { using arrow::Array; using arrow::Column; +using arrow::DataType; namespace util = arrow::util; // ---------------------------------------------------------------------- @@ -50,7 +51,7 @@ struct npy_traits { template <> struct npy_traits { typedef uint8_t value_type; - using ArrayType = arrow::BooleanArray; + using TypeClass = arrow::BooleanType; static constexpr bool supports_nulls = false; static inline bool isnull(uint8_t v) { @@ -62,7 +63,7 @@ struct npy_traits { template <> \ struct npy_traits { \ typedef T value_type; \ - using ArrayType = arrow::CapType##Array; \ + using TypeClass = arrow::CapType##Type; \ \ static constexpr bool supports_nulls = false; \ static inline bool isnull(T v) { \ @@ -82,7 +83,7 @@ NPY_INT_DECL(UINT64, UInt64, uint64_t); template <> struct npy_traits { typedef float value_type; - using ArrayType = arrow::FloatArray; + using TypeClass = arrow::FloatType; static constexpr bool supports_nulls = true; @@ -94,7 +95,7 @@ struct npy_traits { template <> struct npy_traits { typedef double value_type; - using ArrayType = arrow::DoubleArray; + using TypeClass = arrow::DoubleType; static constexpr bool supports_nulls = true; @@ -103,6 +104,22 @@ struct npy_traits { } }; +template <> +struct npy_traits { + typedef double value_type; + using TypeClass = arrow::TimestampType; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(int64_t v) { + // NaT = -2**63 + // = -0x8000000000000000 + // = -9223372036854775808; + // = std::numeric_limits::min() + return v == std::numeric_limits::min(); + } +}; + template <> struct npy_traits { typedef PyObject* value_type; @@ -206,6 +223,8 @@ class ArrowSerializer { return Status::OK(); } + Status MakeDataType(std::shared_ptr* out); + arrow::MemoryPool* pool_; PyArrayObject* arr_; @@ -253,6 +272,39 @@ static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) return null_count; } +template +inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { + out->reset(new typename npy_traits::TypeClass()); + return Status::OK(); +} + +template <> +inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { + PyArray_Descr* descr = PyArray_DESCR(arr_); + auto date_dtype = reinterpret_cast(descr->c_metadata); + arrow::TimestampType::Unit unit; + + switch (date_dtype->meta.base) { + case NPY_FR_s: + unit = arrow::TimestampType::Unit::SECOND; + break; + case NPY_FR_ms: + unit = arrow::TimestampType::Unit::MILLI; + break; + case NPY_FR_us: + unit = arrow::TimestampType::Unit::MICRO; + break; + case NPY_FR_ns: + unit = arrow::TimestampType::Unit::NANO; + break; + default: + return Status::ValueError("Unknown NumPy datetime unit"); + } + + out->reset(new arrow::TimestampType(unit)); + return Status::OK(); +} + template inline Status ArrowSerializer::Convert(std::shared_ptr* out) { typedef npy_traits traits; @@ -269,9 +321,9 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) { } RETURN_NOT_OK(ConvertData()); - *out = std::make_shared(length_, data_, null_count, - null_bitmap_); - + std::shared_ptr type; + RETURN_NOT_OK(MakeDataType(&type)); + RETURN_ARROW_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); return Status::OK(); } @@ -402,6 +454,7 @@ Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, TO_ARROW_CASE(UINT64); TO_ARROW_CASE(FLOAT32); TO_ARROW_CASE(FLOAT64); + TO_ARROW_CASE(DATETIME); TO_ARROW_CASE(OBJECT); default: std::stringstream ss; @@ -476,6 +529,17 @@ struct arrow_traits { typedef typename npy_traits::value_type T; }; +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = std::numeric_limits::min(); + static constexpr bool is_boolean = false; + static constexpr bool is_integer = true; + static constexpr bool is_floating = false; + typedef typename npy_traits::value_type T; +}; + template <> struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; @@ -494,6 +558,30 @@ static inline PyObject* make_pystring(const uint8_t* data, int32_t length) { #endif } +inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { + if (type == NPY_DATETIME) { + auto timestamp_type = static_cast(datatype); + // We only support ms resolution at the moment + PyArray_Descr* descr = PyArray_DESCR(out); + auto date_dtype = reinterpret_cast(descr->c_metadata); + + switch (timestamp_type->unit) { + case arrow::TimestampType::Unit::SECOND: + date_dtype->meta.base = NPY_FR_s; + break; + case arrow::TimestampType::Unit::MILLI: + date_dtype->meta.base = NPY_FR_ms; + break; + case arrow::TimestampType::Unit::MICRO: + date_dtype->meta.base = NPY_FR_us; + break; + case arrow::TimestampType::Unit::NANO: + date_dtype->meta.base = NPY_FR_ns; + break; + } + } +} + template class ArrowDeserializer { public: @@ -522,6 +610,8 @@ class ArrowDeserializer { return Status::OK(); } + set_numpy_metadata(type, col_->type().get(), out_); + return Status::OK(); } @@ -538,6 +628,8 @@ class ArrowDeserializer { return Status::OK(); } + set_numpy_metadata(type, col_->type().get(), out_); + if (PyArray_SetBaseObject(out_, py_ref_) == -1) { // Error occurred, trust that SetBaseObject set the error state return Status::OK(); @@ -713,6 +805,7 @@ Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, FROM_ARROW_CASE(FLOAT); FROM_ARROW_CASE(DOUBLE); FROM_ARROW_CASE(STRING); + FROM_ARROW_CASE(TIMESTAMP); default: return Status::NotImplemented("Arrow type reading not implemented"); } From e197b2d6e41d0cf6be7c097d6b09c3be29d12cc0 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 29 Aug 2016 16:08:23 -0700 Subject: [PATCH 0125/1644] ARROW-279: rename vector module to arrow-vector Author: Julien Le Dem Closes #127 from julienledem/rename_vector and squashes the following commits: cf8a2aa [Julien Le Dem] ARROW-279: rename vector module to arrow-vector --- java/memory/pom.xml | 2 +- java/vector/pom.xml | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/java/memory/pom.xml b/java/memory/pom.xml index 44332f5ed14a8..b91b5981559c3 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -18,7 +18,7 @@ 0.1-SNAPSHOT arrow-memory - arrow-memory + Arrow Memory diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 1a2921f6ea521..08f9bc8da4e2c 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -17,8 +17,8 @@ arrow-java-root 0.1-SNAPSHOT - vector - vectors + arrow-vector + Arrow Vectors From 2d8ec789365f3c0f82b1f22d76160d5af150dd31 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 6 Sep 2016 11:46:56 -0700 Subject: [PATCH 0126/1644] ARROW-274: Add NullableMapVector to support nullable maps Author: Julien Le Dem Closes #128 from julienledem/nullable_map and squashes the following commits: d98580a [Julien Le Dem] review feedback ee1dd45 [Julien Le Dem] Fix complex writers/readers 8780f48 [Julien Le Dem] ARROW-274: Add NullableMapVector to support nullable maps --- .../main/codegen/templates/MapWriters.java | 55 ++-- .../codegen/templates/UnionListWriter.java | 2 + .../main/codegen/templates/UnionVector.java | 6 +- .../main/codegen/templates/UnionWriter.java | 2 +- .../apache/arrow/vector/NullableVector.java | 2 +- .../apache/arrow/vector/VectorUnloader.java | 4 +- .../arrow/vector/complex/MapVector.java | 53 +--- .../vector/complex/NullableMapVector.java | 260 ++++++++++++++++++ .../complex/impl/AbstractBaseReader.java | 7 +- .../complex/impl/ComplexWriterImpl.java | 11 +- .../complex/impl/NullableMapReaderImpl.java | 45 +++ .../complex/impl/SingleMapReaderImpl.java | 4 +- .../arrow/vector/schema/TypeLayout.java | 3 +- .../org/apache/arrow/vector/types/Types.java | 8 +- .../arrow/vector/TestVectorUnloadLoad.java | 5 +- .../complex/impl/TestPromotableWriter.java | 4 +- .../complex/writer/TestComplexWriter.java | 33 ++- .../arrow/vector/file/TestArrowFile.java | 39 +-- .../apache/arrow/vector/pojo/TestConvert.java | 2 - 19 files changed, 408 insertions(+), 137 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 8a8983a1497cc..7f319a9ca34d8 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -17,14 +17,13 @@ */ <@pp.dropOutputFile /> -<#list ["Single"] as mode> +<#list ["Nullable", "Single"] as mode> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}MapWriter.java" /> +<#assign index = "idx()"> <#if mode == "Single"> <#assign containerClass = "MapVector" /> -<#assign index = "idx()"> <#else> -<#assign containerClass = "RepeatedMapVector" /> -<#assign index = "currentChildIndex"> +<#assign containerClass = "NullableMapVector" /> <#include "/@includes/license.ftl" /> @@ -49,9 +48,13 @@ public class ${mode}MapWriter extends AbstractFieldWriter { protected final ${containerClass} container; private final Map fields = Maps.newHashMap(); - <#if mode == "Repeated">private int currentChildIndex = 0; public ${mode}MapWriter(${containerClass} container) { + <#if mode == "Single"> + if (container instanceof NullableMapVector) { + throw new IllegalArgumentException("Invalid container: " + container); + } + this.container = container; } @@ -75,12 +78,12 @@ public MapWriter map(String name) { FieldWriter writer = fields.get(name.toLowerCase()); if(writer == null){ int vectorCount=container.size(); - MapVector vector = container.addOrGet(name, MinorType.MAP, MapVector.class); + NullableMapVector vector = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); writer = new PromotableWriter(vector, container); if(vectorCount != container.size()) { writer.allocate(); } - writer.setPosition(${index}); + writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); } return writer; @@ -117,40 +120,12 @@ public ListWriter list(String name) { if (container.size() > vectorCount) { writer.allocate(); } - writer.setPosition(${index}); + writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); } return writer; } - <#if mode == "Repeated"> - public void start() { - // update the repeated vector to state that there is current+1 objects. - final RepeatedMapHolder h = new RepeatedMapHolder(); - final RepeatedMapVector map = (RepeatedMapVector) container; - final RepeatedMapVector.Mutator mutator = map.getMutator(); - - // Make sure that the current vector can support the end position of this list. - if(container.getValueCapacity() <= idx()) { - mutator.setValueCount(idx()+1); - } - - map.getAccessor().get(idx(), h); - if (h.start >= h.end) { - container.getMutator().startNewValue(idx()); - } - currentChildIndex = container.getMutator().add(idx()); - for(final FieldWriter w : fields.values()) { - w.setPosition(currentChildIndex); - } - } - - - public void end() { - // noop - } - <#else> - public void setValueCount(int count) { container.getMutator().setValueCount(count); } @@ -165,14 +140,16 @@ public void setPosition(int index) { @Override public void start() { + <#if mode == "Single"> + <#else> + container.getMutator().setIndexDefined(idx()); + } @Override public void end() { } - - <#list vv.types as type><#list type.minor as minor> <#assign lowerName = minor.class?uncap_first /> <#if lowerName == "int" ><#assign lowerName = "integer" /> @@ -204,7 +181,7 @@ public void end() { if (currentVector == null || currentVector != vector) { vector.allocateNewSafe(); } - writer.setPosition(${index}); + writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); } return writer; diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java index 49d57e716bc8a..d502803d71616 100644 --- a/java/vector/src/main/codegen/templates/UnionListWriter.java +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -160,11 +160,13 @@ public void start() { vector.getMutator().setNotNull(idx()); offsets.getMutator().setSafe(idx() + 1, nextOffset); writer.setPosition(nextOffset); + writer.start(); } @Override public void end() { // if (inMap) { + writer.end(); inMap = false; final int nextOffset = offsets.getAccessor().get(idx() + 1); offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 72125fa50fb82..3014bbba9d52d 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -72,7 +72,7 @@ public class UnionVector implements FieldVector { MapVector internalMap; UInt1Vector typeVector; - private MapVector mapVector; + private NullableMapVector mapVector; private ListVector listVector; private FieldReader reader; @@ -127,10 +127,10 @@ public List getFieldInnerVectors() { throw new UnsupportedOperationException(); } - public MapVector getMap() { + public NullableMapVector getMap() { if (mapVector == null) { int vectorCount = internalMap.size(); - mapVector = internalMap.addOrGet("map", MinorType.MAP, MapVector.class); + mapVector = internalMap.addOrGet("map", MinorType.MAP, NullableMapVector.class); if (internalMap.size() > vectorCount) { mapVector.allocateNew(); if (callBack != null) { diff --git a/java/vector/src/main/codegen/templates/UnionWriter.java b/java/vector/src/main/codegen/templates/UnionWriter.java index 1137e2cb0207a..460ec1c0d9586 100644 --- a/java/vector/src/main/codegen/templates/UnionWriter.java +++ b/java/vector/src/main/codegen/templates/UnionWriter.java @@ -74,7 +74,7 @@ public void endList() { private MapWriter getMapWriter() { if (mapWriter == null) { - mapWriter = new SingleMapWriter(data.getMap()); + mapWriter = new NullableMapWriter(data.getMap()); mapWriter.setPosition(idx()); writers.add(mapWriter); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java index 00c33fc2d6e6c..0212b3c0d7b95 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java @@ -17,7 +17,7 @@ */ package org.apache.arrow.vector; -public interface NullableVector extends ValueVector{ +public interface NullableVector extends ValueVector { ValueVector getValuesVector(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java index e4d37bf47d114..3375a7d5c311b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java @@ -68,7 +68,9 @@ private void appendNodes(FieldVector vector, List nodes, List fieldBuffers = vector.getFieldBuffers(); List expectedBuffers = vector.getField().getTypeLayout().getVectorTypes(); if (fieldBuffers.size() != expectedBuffers.size()) { - throw new IllegalArgumentException("wrong number of buffers for field " + vector.getField() + ". found: " + fieldBuffers); + throw new IllegalArgumentException(String.format( + "wrong number of buffers for field %s in vector %s. found: %s", + vector.getField(), vector.getClass().getSimpleName(), fieldBuffers)); } buffers.addAll(fieldBuffers); for (FieldVector child : vector.getChildrenFromFields()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index e3696588e6006..1b8483a3d41be 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -18,9 +18,7 @@ package org.apache.arrow.vector.complex; import java.util.ArrayList; -import java.util.Arrays; import java.util.Collection; -import java.util.Collections; import java.util.Iterator; import java.util.List; import java.util.Map; @@ -28,15 +26,12 @@ import javax.annotation.Nullable; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.BaseDataValueVector; import org.apache.arrow.vector.BaseValueVector; -import org.apache.arrow.vector.BufferBacked; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; -import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; @@ -49,26 +44,20 @@ import com.google.common.collect.Ordering; import com.google.common.primitives.Ints; -import io.netty.buffer.ArrowBuf; - -public class MapVector extends AbstractMapVector implements FieldVector { +public class MapVector extends AbstractMapVector { //private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(MapVector.class); - private final SingleMapReaderImpl reader = new SingleMapReaderImpl(MapVector.this); + private final SingleMapReaderImpl reader = new SingleMapReaderImpl(this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); int valueCount; - // TODO: validity vector - private final List innerVectors = Collections.unmodifiableList(Arrays.asList()); - - public MapVector(String name, BufferAllocator allocator, CallBack callBack){ + public MapVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, allocator, callBack); } @Override public FieldReader getReader() { - //return new SingleMapReaderImpl(MapVector.this); return reader; } @@ -124,18 +113,9 @@ public int getBufferSizeFor(final int valueCount) { return (int) bufferSize; } - @Override - public ArrowBuf[] getBuffers(boolean clear) { - int expectedSize = getBufferSize(); - int actualSize = super.getBufferSize(); - - Preconditions.checkArgument(expectedSize == actualSize, expectedSize + " != " + actualSize); - return super.getBuffers(clear); - } - @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new MapTransferPair(this, name, allocator); + return new MapTransferPair(this, new MapVector(name, allocator, callBack), false); } @Override @@ -145,7 +125,7 @@ public TransferPair makeTransferPair(ValueVector to) { @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new MapTransferPair(this, ref, allocator); + return new MapTransferPair(this, new MapVector(ref, allocator, callBack), false); } protected static class MapTransferPair implements TransferPair{ @@ -153,10 +133,6 @@ protected static class MapTransferPair implements TransferPair{ private final MapVector from; private final MapVector to; - public MapTransferPair(MapVector from, String name, BufferAllocator allocator) { - this(from, new MapVector(name, allocator, from.callBack), false); - } - public MapTransferPair(MapVector from, MapVector to) { this(from, to, true); } @@ -335,7 +311,6 @@ public void close() { super.close(); } - @Override public void initializeChildrenFromFields(List children) { for (Field field : children) { MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); @@ -344,25 +319,9 @@ public void initializeChildrenFromFields(List children) { } } - @Override + public List getChildrenFromFields() { return getChildren(); } - @Override - public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); - // TODO: something with fieldNode? - } - - @Override - public List getFieldBuffers() { - return BaseDataValueVector.unload(getFieldInnerVectors()); - } - - @Override - public List getFieldInnerVectors() { - return innerVectors; - } - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java new file mode 100644 index 0000000000000..6b257c095d28e --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -0,0 +1,260 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex; + +import static com.google.common.base.Preconditions.checkNotNull; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.BaseDataValueVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullableVectorDefinitionSetter; +import org.apache.arrow.vector.UInt1Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.impl.NullableMapReaderImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.holders.ComplexHolder; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.TransferPair; + +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; + +public class NullableMapVector extends MapVector implements FieldVector { + + private final NullableMapReaderImpl reader = new NullableMapReaderImpl(this); + + protected final UInt1Vector bits; + + private final List innerVectors; + + private final Accessor accessor; + private final Mutator mutator; + + public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) { + super(name, checkNotNull(allocator), callBack); + this.bits = new UInt1Vector("$bits$", allocator); + this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits)); + this.accessor = new Accessor(); + this.mutator = new Mutator(); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + this.valueCount = fieldNode.getLength(); + } + + @Override + public List getFieldBuffers() { + return BaseDataValueVector.unload(getFieldInnerVectors()); + } + + @Override + public List getFieldInnerVectors() { + return innerVectors; + } + + @Override + public FieldReader getReader() { + return reader; + } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { + return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, callBack), false); + } + + @Override + public TransferPair makeTransferPair(ValueVector to) { + return new NullableMapTransferPair(this, (NullableMapVector) to, true); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new NullableMapTransferPair(this, new NullableMapVector(ref, allocator, callBack), false); + } + + protected class NullableMapTransferPair extends MapTransferPair { + + private NullableMapVector target; + + protected NullableMapTransferPair(NullableMapVector from, NullableMapVector to, boolean allocate) { + super(from, to, allocate); + this.target = to; + } + + @Override + public void transfer() { + bits.transferTo(target.bits); + super.transfer(); + } + + @Override + public void copyValueSafe(int fromIndex, int toIndex) { + target.bits.copyFromSafe(fromIndex, toIndex, bits); + super.copyValueSafe(fromIndex, toIndex); + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + bits.splitAndTransferTo(startIndex, length, target.bits); + super.splitAndTransfer(startIndex, length); + } + } + + @Override + public int getValueCapacity() { + return Math.min(bits.getValueCapacity(), super.getValueCapacity()); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + return ObjectArrays.concat(bits.getBuffers(clear), super.getBuffers(clear), ArrowBuf.class); + } + + @Override + public void close() { + bits.close(); + super.close(); + } + + @Override + public void clear() { + bits.clear(); + super.clear(); + } + + + @Override + public int getBufferSize(){ + return super.getBufferSize() + bits.getBufferSize(); + } + + @Override + public int getBufferSizeFor(final int valueCount) { + if (valueCount == 0) { + return 0; + } + return super.getBufferSizeFor(valueCount) + + bits.getBufferSizeFor(valueCount); + } + + @Override + public void setInitialCapacity(int numRecords) { + bits.setInitialCapacity(numRecords); + super.setInitialCapacity(numRecords); + } + + @Override + public boolean allocateNewSafe() { + /* Boolean to keep track if all the memory allocations were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + success = super.allocateNewSafe() && bits.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + bits.zeroVector(); + return success; + } + public final class Accessor extends MapVector.Accessor { + final UInt1Vector.Accessor bAccessor = bits.getAccessor(); + + @Override + public Object getObject(int index) { + if (isNull(index)) { + return null; + } else { + return super.getObject(index); + } + } + + @Override + public void get(int index, ComplexHolder holder) { + holder.isSet = isSet(index); + super.get(index, holder); + } + + @Override + public boolean isNull(int index) { + return isSet(index) == 0; + } + + public int isSet(int index){ + return bAccessor.get(index); + } + + } + + public final class Mutator extends MapVector.Mutator implements NullableVectorDefinitionSetter { + + private Mutator(){ + } + + @Override + public void setIndexDefined(int index){ + bits.getMutator().setSafe(index, 1); + } + + public void setNull(int index){ + bits.getMutator().setSafe(index, 0); + } + + @Override + public void setValueCount(int valueCount) { + assert valueCount >= 0; + super.setValueCount(valueCount); + bits.getMutator().setValueCount(valueCount); + } + + @Override + public void generateTestData(int valueCount){ + super.generateTestData(valueCount); + bits.getMutator().generateTestDataAlt(valueCount); + } + + @Override + public void reset(){ + bits.getMutator().setValueCount(0); + } + + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public Mutator getMutator() { + return mutator; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java index 259a954233c06..e7c3c8c7e4b42 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java @@ -19,15 +19,10 @@ import java.util.Iterator; -import com.google.flatbuffers.FlatBufferBuilder; -import org.apache.arrow.flatbuf.Type; -import org.apache.arrow.flatbuf.Union; -import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.holders.UnionHolder; -import org.apache.arrow.vector.types.pojo.Field; abstract class AbstractBaseReader implements FieldReader{ @@ -44,7 +39,7 @@ public void setPosition(int index){ this.index = index; } - int idx(){ + protected int idx(){ return index; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index 89bfefc8f19e3..761b1b43c08aa 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -19,6 +19,7 @@ import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.StateTool; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.types.Types.MinorType; @@ -29,7 +30,7 @@ public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWriter { // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ComplexWriterImpl.class); - private SingleMapWriter mapRoot; + private NullableMapWriter mapRoot; private UnionListWriter listRoot; private final MapVector container; @@ -121,8 +122,8 @@ public MapWriter directMap(){ switch(mode){ case INIT: - MapVector map = (MapVector) container; - mapRoot = new SingleMapWriter(map); + NullableMapVector map = (NullableMapVector) container; + mapRoot = new NullableMapWriter(map); mapRoot.setPosition(idx()); mode = Mode.MAP; break; @@ -142,8 +143,8 @@ public MapWriter rootAsMap() { switch(mode){ case INIT: - MapVector map = container.addOrGet(name, MinorType.MAP, MapVector.class); - mapRoot = new SingleMapWriter(map); + NullableMapVector map = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); + mapRoot = new NullableMapWriter(map); mapRoot.setPosition(idx()); mode = Mode.MAP; break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java new file mode 100644 index 0000000000000..18b35c194a184 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java @@ -0,0 +1,45 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; + +public class NullableMapReaderImpl extends SingleMapReaderImpl { + + private NullableMapVector nullableMapVector; + + public NullableMapReaderImpl(MapVector vector) { + super((NullableMapVector)vector); + this.nullableMapVector = (NullableMapVector)vector; + } + + @Override + public void copyAsValue(MapWriter writer){ + NullableMapWriter impl = (NullableMapWriter) writer; + impl.container.copyFromSafe(idx(), impl.idx(), nullableMapVector); + } + + @Override + public void copyAsField(String name, MapWriter writer){ + NullableMapWriter impl = (NullableMapWriter) writer.map(name); + impl.container.copyFromSafe(idx(), impl.idx(), nullableMapVector); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java index 1c43240901c4f..ae17b4bbb10dd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/SingleMapReaderImpl.java @@ -1,5 +1,3 @@ - - /******************************************************************************* * Licensed to the Apache Software Foundation (ASF) under one @@ -27,9 +25,9 @@ import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.Types.MinorType; import com.google.common.collect.Maps; -import org.apache.arrow.vector.types.Types.MinorType; @SuppressWarnings("unused") public class SingleMapReaderImpl extends AbstractFieldReader{ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 15cd49865bdce..9f1efd056cb08 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -90,8 +90,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { @Override public TypeLayout visit(Tuple type) { List vectors = asList( - // TODO: add validity vector in Map -// validityVector() + validityVector() ); return new TypeLayout(vectors); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 4d0d9ee114ad8..5eef8a008a923 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -47,7 +47,7 @@ import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.ListVector; -import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.impl.BigIntWriterImpl; import org.apache.arrow.vector.complex.impl.BitWriterImpl; @@ -58,7 +58,7 @@ import org.apache.arrow.vector.complex.impl.IntWriterImpl; import org.apache.arrow.vector.complex.impl.IntervalDayWriterImpl; import org.apache.arrow.vector.complex.impl.IntervalYearWriterImpl; -import org.apache.arrow.vector.complex.impl.SingleMapWriter; +import org.apache.arrow.vector.complex.impl.NullableMapWriter; import org.apache.arrow.vector.complex.impl.SmallIntWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampWriterImpl; import org.apache.arrow.vector.complex.impl.TimeWriterImpl; @@ -139,12 +139,12 @@ public Field getField() { @Override public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new MapVector(name, allocator, callBack); + return new NullableMapVector(name, allocator, callBack); } @Override public FieldWriter getNewFieldWriter(ValueVector vector) { - return new SingleMapWriter((MapVector) vector); + return new NullableMapWriter((NullableMapVector) vector); } }, // an empty map column. Useful for conceptual setup. Children listed within here diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java index 85bb2cfc99f81..7dcb8977c0d7f 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -22,6 +22,7 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; @@ -60,14 +61,14 @@ public void test() throws IOException { } writer.setValueCount(count); - VectorUnloader vectorUnloader = new VectorUnloader((MapVector)parent.getChild("root")); + VectorUnloader vectorUnloader = new VectorUnloader(parent.getChild("root")); schema = vectorUnloader.getSchema(); try ( ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); MapVector newParent = new MapVector("parent", finalVectorsAllocator, null)) { - MapVector root = newParent.addOrGet("root", MinorType.MAP, MapVector.class); + FieldVector root = newParent.addOrGet("root", MinorType.MAP, NullableMapVector.class); VectorLoader vectorLoader = new VectorLoader(schema, root); vectorLoader.load(recordBatch); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 24f00f14df001..689c96fda9202 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -25,8 +25,8 @@ import org.apache.arrow.vector.DirtyRootAllocator; import org.apache.arrow.vector.complex.AbstractMapVector; import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; -import org.apache.arrow.vector.holders.UInt4Holder; import org.apache.arrow.vector.types.Types.MinorType; import org.junit.After; import org.junit.Before; @@ -51,7 +51,7 @@ public void terminate() throws Exception { public void testPromoteToUnion() throws Exception { try (final AbstractMapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); - final MapVector v = container.addOrGet("test", MinorType.MAP, MapVector.class); + final NullableMapVector v = container.addOrGet("test", MinorType.MAP, NullableMapVector.class); final PromotableWriter writer = new PromotableWriter(v, container)) { container.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index bc17a2b2835c2..fa710dae5eee8 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -17,7 +17,6 @@ */ package org.apache.arrow.vector.complex.writer; -import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.ListVector; @@ -41,6 +40,8 @@ import org.junit.Assert; import org.junit.Test; +import io.netty.buffer.ArrowBuf; + public class TestComplexWriter { static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); @@ -71,6 +72,36 @@ public void simpleNestedTypes() { parent.close(); } + @Test + public void nullableMap() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + MapWriter mapWriter = rootWriter.map("map"); + BigIntWriter nested = mapWriter.bigInt("nested"); + for (int i = 0; i < COUNT; i++) { + if (i % 2 == 0) { + mapWriter.setPosition(i); + mapWriter.start(); + nested.writeBigInt(i); + mapWriter.end(); + } + } + writer.setValueCount(COUNT); + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < COUNT; i++) { + rootReader.setPosition(i); + if (i % 2 == 0) { + Assert.assertNotNull(rootReader.reader("map").readObject()); + Assert.assertEquals(i, rootReader.reader("map").reader("nested").readLong().longValue()); + } else { + Assert.assertNull(rootReader.reader("map").readObject()); + } + } + + parent.close(); + } + @Test public void listScalarType() { ListVector listVector = new ListVector("list", allocator, null); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 11de0a2ef00a0..ad301689cd1e2 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -31,6 +31,7 @@ import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorUnloader; import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; @@ -47,10 +48,13 @@ import org.junit.Assert; import org.junit.Before; import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; import io.netty.buffer.ArrowBuf; public class TestArrowFile { + private static final Logger LOGGER = LoggerFactory.getLogger(TestArrowFile.class); private static final int COUNT = 10; private BufferAllocator allocator; @@ -72,7 +76,7 @@ public void testWrite() throws IOException { BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", vectorAllocator, null)) { writeData(count, parent); - write((MapVector)parent.getChild("root"), file); + write(parent.getChild("root"), file); } } @@ -82,10 +86,10 @@ public void testWriteComplex() throws IOException { int count = COUNT; try ( BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null)) { + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { writeComplexData(count, parent); validateComplexContent(count, parent); - write((MapVector)parent.getChild("root"), file); + write(parent.getChild("root"), file); } } @@ -147,7 +151,7 @@ public void testWriteRead() throws IOException { BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeData(count, parent); - write((MapVector)parent.getChild("root"), file); + write(parent.getChild("root"), file); } // read @@ -160,11 +164,11 @@ public void testWriteRead() throws IOException { ) { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); - System.out.println("reading schema: " + schema); + LOGGER.debug("reading schema: " + schema); // initialize vectors - MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); + NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); VectorLoader vectorLoader = new VectorLoader(schema, root); @@ -204,7 +208,7 @@ public void testWriteReadComplex() throws IOException { BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeComplexData(count, parent); - write((MapVector)parent.getChild("root"), file); + write(parent.getChild("root"), file); } // read @@ -213,16 +217,15 @@ public void testWriteReadComplex() throws IOException { FileInputStream fileInputStream = new FileInputStream(file); ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null) ) { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); - System.out.println("reading schema: " + schema); + LOGGER.debug("reading schema: " + schema); // initialize vectors - MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); - + NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); VectorLoader vectorLoader = new VectorLoader(schema, root); List recordBatches = footer.getRecordBatches(); @@ -237,16 +240,16 @@ public void testWriteReadComplex() throws IOException { public void printVectors(List vectors) { for (FieldVector vector : vectors) { - System.out.println(vector.getField().getName()); + LOGGER.debug(vector.getField().getName()); Accessor accessor = vector.getAccessor(); int valueCount = accessor.getValueCount(); for (int i = 0; i < valueCount; i++) { - System.out.println(accessor.getObject(i)); + LOGGER.debug(String.valueOf(accessor.getObject(i))); } } } - private void validateComplexContent(int count, MapVector parent) { + private void validateComplexContent(int count, NullableMapVector parent) { printVectors(parent.getChildrenFromFields()); MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); @@ -259,10 +262,10 @@ private void validateComplexContent(int count, MapVector parent) { } } - private void write(MapVector parent, File file) throws FileNotFoundException, IOException { + private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { VectorUnloader vectorUnloader = new VectorUnloader(parent); Schema schema = vectorUnloader.getSchema(); - System.out.println("writing schema: " + schema); + LOGGER.debug("writing schema: " + schema); try ( FileOutputStream fileOutputStream = new FileOutputStream(file); ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); @@ -308,8 +311,8 @@ public void testWriteReadMultipleRBs() throws IOException { ) { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); - System.out.println("reading schema: " + schema); - MapVector root = parent.addOrGet("root", MinorType.MAP, MapVector.class); + LOGGER.debug("reading schema: " + schema); + NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); VectorLoader vectorLoader = new VectorLoader(schema, root); List recordBatches = footer.getRecordBatches(); Assert.assertEquals(2, recordBatches.size()); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index e557cc84f3bae..61327f1970e83 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -22,8 +22,6 @@ import static org.junit.Assert.assertEquals; import org.apache.arrow.flatbuf.UnionMode; -import static org.junit.Assert.assertEquals; - import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.List; From 637584becb2db88fc510824c22b87e6effb2232f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 6 Sep 2016 23:59:30 -0400 Subject: [PATCH 0127/1644] ARROW-284: Disable arrow_parquet module in Travis CI to triage builds Author: Wes McKinney Closes #132 from wesm/ARROW-284 and squashes the following commits: e3410cf [Wes McKinney] Install miniconda in $HOME to avoid long prefix issues in conda-build 2.0 9fd94f5 [Wes McKinney] Do not run death test when valgrind is enabled. Gracefully skip pyarrow.parquet when ARROW_PARQUET=off ccf56f8 [Wes McKinney] Disable arrow_parquet module in Travis CI --- ci/travis_before_script_cpp.sh | 4 +-- ci/travis_install_conda.sh | 4 ++- ci/travis_script_python.sh | 6 ++-- cpp/cmake_modules/FindParquet.cmake | 1 + cpp/src/arrow/util/memory-pool-test.cc | 6 ++++ python/CMakeLists.txt | 41 ++++++++++++++++---------- python/cmake_modules/FindArrow.cmake | 26 +++++++++------- python/pyarrow/tests/test_io.py | 1 + python/pyarrow/tests/test_parquet.py | 38 ++++++++++++++++-------- python/pyarrow/tests/test_table.py | 7 +---- python/setup.py | 27 ++++++++++------- 11 files changed, 101 insertions(+), 60 deletions(-) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 08551f3b009a8..2f02ef247af82 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -25,8 +25,8 @@ echo $GTEST_HOME CMAKE_COMMON_FLAGS="\ -DARROW_BUILD_BENCHMARKS=ON \ --DARROW_PARQUET=ON \ --DARROW_HDFS=on \ +-DARROW_PARQUET=OFF \ +-DARROW_HDFS=ON \ -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" if [ $TRAVIS_OS_NAME == "linux" ]; then diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index 3a8f57bf8f1bf..e9225259e6d58 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -9,7 +9,9 @@ else fi wget -O miniconda.sh $MINICONDA_URL -export MINICONDA=$TRAVIS_BUILD_DIR/miniconda + +export MINICONDA=$HOME/miniconda + bash miniconda.sh -b -p $MINICONDA export PATH="$MINICONDA/bin:$PATH" conda update -y -q conda diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 4a377428ae43a..61c8e444361df 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -5,7 +5,7 @@ set -e PYTHON_DIR=$TRAVIS_BUILD_DIR/python # Re-use conda installation from C++ -export MINICONDA=$TRAVIS_BUILD_DIR/miniconda +export MINICONDA=$HOME/miniconda export PATH="$MINICONDA/bin:$PATH" export PARQUET_HOME=$MINICONDA @@ -31,7 +31,9 @@ python_version_tests() { # Expensive dependencies install from Continuum package repo conda install -y pip numpy pandas cython - conda install -y parquet-cpp arrow-cpp -c apache/channel/dev + # conda install -y parquet-cpp + + conda install -y arrow-cpp -c apache/channel/dev # Other stuff pip install pip install -r requirements.txt diff --git a/cpp/cmake_modules/FindParquet.cmake b/cpp/cmake_modules/FindParquet.cmake index e3350d6e13da6..36f4828a999d4 100644 --- a/cpp/cmake_modules/FindParquet.cmake +++ b/cpp/cmake_modules/FindParquet.cmake @@ -72,6 +72,7 @@ else () endif () mark_as_advanced( + PARQUET_FOUND PARQUET_INCLUDE_DIR PARQUET_LIBS PARQUET_LIBRARIES diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index deb7ffd03ba75..e767e9555244d 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -46,6 +46,10 @@ TEST(DefaultMemoryPool, OOM) { ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); } +// Death tests and valgrind are known to not play well 100% of the time. See +// googletest documentation +#ifndef ARROW_VALGRIND + TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { MemoryPool* pool = default_memory_pool(); @@ -60,4 +64,6 @@ TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { pool->Free(data, 100); } +#endif // ARROW_VALGRIND + } // namespace arrow diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index fdbfce99656ca..522895808de5e 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -340,8 +340,10 @@ if (PYARROW_BUILD_TESTS) endif() ## Parquet -find_package(Parquet REQUIRED) -include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) +find_package(Parquet) +if(PARQUET_FOUND) + include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) +endif() ## Arrow find_package(Arrow REQUIRED) @@ -350,8 +352,6 @@ ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_io SHARED_LIB ${ARROW_IO_SHARED_LIB}) -ADD_THIRDPARTY_LIB(arrow_parquet - SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) ############################################################ # Linker setup @@ -418,6 +418,16 @@ endif() add_subdirectory(src/pyarrow) add_subdirectory(src/pyarrow/util) +set(CYTHON_EXTENSIONS + array + config + error + io + scalar + schema + table +) + set(PYARROW_SRCS src/pyarrow/common.cc src/pyarrow/config.cc @@ -431,9 +441,19 @@ set(PYARROW_SRCS set(LINK_LIBS arrow arrow_io - arrow_parquet ) +if(PARQUET_FOUND AND ARROW_PARQUET_FOUND) + ADD_THIRDPARTY_LIB(arrow_parquet + SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) + set(LINK_LIBS + ${LINK_LIBS} + arrow_parquet) + set(CYTHON_EXTENSIONS + ${CYTHON_EXTENSIONS} + parquet) +endif() + SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) add_library(pyarrow SHARED @@ -448,17 +468,6 @@ endif() # Setup and build Cython modules ############################################################ -set(CYTHON_EXTENSIONS - array - config - error - io - parquet - scalar - schema - table -) - foreach(module ${CYTHON_EXTENSIONS}) string(REPLACE "." ";" directories ${module}) list(GET directories -1 module_name) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 6bd305615fce2..5d5efc431a48f 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -52,7 +52,7 @@ find_library(ARROW_IO_LIB_PATH NAMES arrow_io ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) -if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH AND ARROW_PARQUET_LIB_PATH) +if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) @@ -64,18 +64,9 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH AND ARROW_PARQUET_LIB_PATH) set(ARROW_IO_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IO_LIB_NAME}.a) set(ARROW_IO_SHARED_LIB ${ARROW_LIBS}/${ARROW_IO_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - - set(ARROW_PARQUET_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PARQUET_LIB_NAME}.a) - set(ARROW_PARQUET_SHARED_LIB ${ARROW_LIBS}/${ARROW_PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) -else () - set(ARROW_FOUND FALSE) -endif () - -if (ARROW_FOUND) if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") - message(STATUS "Found the Arrow Parquet library: ${ARROW_PARQUET_LIB_PATH}") endif () else () if (NOT Arrow_FIND_QUIETLY) @@ -88,8 +79,23 @@ else () message(STATUS "${ARROW_ERR_MSG}") endif (Arrow_FIND_REQUIRED) endif () + set(ARROW_FOUND FALSE) endif () +if(ARROW_PARQUET_LIB_PATH) + set(ARROW_PARQUET_FOUND TRUE) + set(ARROW_PARQUET_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PARQUET_LIB_NAME}.a) + set(ARROW_PARQUET_SHARED_LIB ${ARROW_LIBS}/${ARROW_PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + if (NOT Arrow_FIND_QUIETLY) + message(STATUS "Found the Arrow Parquet library: ${ARROW_PARQUET_LIB_PATH}") + endif () +else() + if (NOT Arrow_FIND_QUIETLY) + message(STATUS "Could not find Arrow Parquet library") + endif() + set(ARROW_PARQUET_FOUND FALSE) +endif() + mark_as_advanced( ARROW_INCLUDE_DIR ARROW_LIBS diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 328e923b941a4..eb92e8ea93a1a 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -46,6 +46,7 @@ def hdfs_test_client(): HDFS_TMP_PATH = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + @pytest.fixture(scope='session') def hdfs(request): fixture = hdfs_test_client() diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index d89d947b7b6ac..8a2d8cab57267 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -15,33 +15,45 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest -import pyarrow as arrow -import pyarrow.parquet +import pytest -A = arrow +import pyarrow as A import numpy as np -import os.path import pandas as pd import pandas.util.testing as pdt +try: + import pyarrow.parquet as pq + HAVE_PARQUET = True +except ImportError: + HAVE_PARQUET = False +# XXX: Make Parquet tests opt-in rather than skip-if-not-build +parquet = pytest.mark.skipif(not HAVE_PARQUET, + reason='Parquet support not built') + + +@parquet def test_single_pylist_column_roundtrip(tmpdir): for dtype in [int, float]: - filename = tmpdir.join('single_{}_column.parquet'.format(dtype.__name__)) + filename = tmpdir.join('single_{}_column.parquet' + .format(dtype.__name__)) data = [A.from_pylist(list(map(dtype, range(5))))] table = A.Table.from_arrays(('a', 'b'), data, 'table_name') A.parquet.write_table(table, filename.strpath) - table_read = pyarrow.parquet.read_table(filename.strpath) - for col_written, col_read in zip(table.itercolumns(), table_read.itercolumns()): + table_read = pq.read_table(filename.strpath) + for col_written, col_read in zip(table.itercolumns(), + table_read.itercolumns()): assert col_written.name == col_read.name assert col_read.data.num_chunks == 1 data_written = col_written.data.chunk(0) data_read = col_read.data.chunk(0) assert data_written.equals(data_read) + +@parquet def test_pandas_parquet_2_0_rountrip(tmpdir): size = 10000 np.random.seed(0) @@ -58,17 +70,20 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): 'float64': np.arange(size, dtype=np.float64), 'bool': np.random.randn(size) > 0, # Pandas only support ns resolution, Arrow at the moment only ms - 'datetime': np.arange("2016-01-01T00:00:00.001", size, dtype='datetime64[ms]'), + 'datetime': np.arange("2016-01-01T00:00:00.001", size, + dtype='datetime64[ms]'), 'str': [str(x) for x in range(size)], 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None] }) filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.from_pandas_dataframe(df, timestamps_to_ms=True) A.parquet.write_table(arrow_table, filename.strpath, version="2.0") - table_read = pyarrow.parquet.read_table(filename.strpath) + table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) + +@parquet def test_pandas_parquet_1_0_rountrip(tmpdir): size = 10000 np.random.seed(0) @@ -88,11 +103,10 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.from_pandas_dataframe(df) A.parquet.write_table(arrow_table, filename.strpath, version="1.0") - table_read = pyarrow.parquet.read_table(filename.strpath) + table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() # We pass uint32_t as int64_t if we write Parquet version 1.0 df['uint32'] = df['uint32'].values.astype(np.int64) pdt.assert_frame_equal(df, df_read) - diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 83fcbb8faff5d..abf143199fe15 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -16,11 +16,7 @@ # under the License. from pyarrow.compat import unittest -import pyarrow as arrow - -A = arrow - -import pandas as pd +import pyarrow as A class TestRowBatch(unittest.TestCase): @@ -76,4 +72,3 @@ def test_pandas(self): assert set(df.columns) == set(('a', 'b')) assert df.shape == (5, 2) assert df.ix[0, 'b'] == -10 - diff --git a/python/setup.py b/python/setup.py index 59410d75a61e2..a5db2b025e6ef 100644 --- a/python/setup.py +++ b/python/setup.py @@ -97,6 +97,18 @@ def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = '' + CYTHON_MODULE_NAMES = [ + 'array', + 'config', + 'error', + 'io', + 'parquet', + 'scalar', + 'schema', + 'table'] + + CYTHON_ALLOWED_FAILURES = ['parquet'] + def _run_cmake(self): # The directory containing this setup.py source = osp.dirname(osp.abspath(__file__)) @@ -172,10 +184,13 @@ def _run_cmake(self): # Move the built C-extension to the place expected by the Python build self._found_names = [] - for name in self.get_cmake_cython_names(): + for name in self.CYTHON_MODULE_NAMES: built_path = self.get_ext_built(name) if not os.path.exists(built_path): print(built_path) + if name in self.CYTHON_ALLOWED_FAILURES: + print('Cython module {0} failure permitted'.format(name)) + continue raise RuntimeError('libpyarrow C-extension failed to build:', os.path.abspath(built_path)) @@ -213,16 +228,6 @@ def get_ext_built(self, name): suffix = sysconfig.get_config_var('SO') return name + suffix - def get_cmake_cython_names(self): - return ['array', - 'config', - 'error', - 'io', - 'parquet', - 'scalar', - 'schema', - 'table'] - def get_names(self): return self._found_names From 214b861ae8f40f5fba544247d40c8995b93eca83 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 7 Sep 2016 00:20:51 -0400 Subject: [PATCH 0128/1644] ARROW-283: [C++] Account for upstream changes in parquet-cpp Author: Wes McKinney Closes #131 from wesm/ARROW-283 and squashes the following commits: 52dfb28 [Wes McKinney] Update arrow_parquet for API changes in parquet-cpp --- cpp/src/arrow/parquet/reader.cc | 22 ++++++++++++++-------- 1 file changed, 14 insertions(+), 8 deletions(-) diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 9f6212570dcdc..440ec84e2c74e 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -149,11 +149,13 @@ bool FileReader::Impl::CheckForFlatColumn(const ::parquet::ColumnDescriptor* des } Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* out) { - if (!CheckForFlatColumn(reader_->descr()->Column(i))) { + const ::parquet::SchemaDescriptor* schema = reader_->metadata()->schema_descriptor(); + + if (!CheckForFlatColumn(schema->Column(i))) { return Status::Invalid("The requested column is not flat"); } std::unique_ptr impl( - new FlatColumnReader::Impl(pool_, reader_->descr()->Column(i), reader_.get(), i)); + new FlatColumnReader::Impl(pool_, schema->Column(i), reader_.get(), i)); *out = std::unique_ptr(new FlatColumnReader(std::move(impl))); return Status::OK(); } @@ -161,16 +163,20 @@ Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* Status FileReader::Impl::ReadFlatColumn(int i, std::shared_ptr* out) { std::unique_ptr flat_column_reader; RETURN_NOT_OK(GetFlatColumn(i, &flat_column_reader)); - return flat_column_reader->NextBatch(reader_->num_rows(), out); + return flat_column_reader->NextBatch(reader_->metadata()->num_rows(), out); } Status FileReader::Impl::ReadFlatTable(std::shared_ptr

* table) { - const std::string& name = reader_->descr()->schema()->name(); + auto descr = reader_->metadata()->schema_descriptor(); + + const std::string& name = descr->schema()->name(); std::shared_ptr schema; - RETURN_NOT_OK(FromParquetSchema(reader_->descr(), &schema)); + RETURN_NOT_OK(FromParquetSchema(descr, &schema)); + + int num_columns = reader_->metadata()->num_columns(); - std::vector> columns(reader_->num_columns()); - for (int i = 0; i < reader_->num_columns(); i++) { + std::vector> columns(num_columns); + for (int i = 0; i < num_columns; i++) { std::shared_ptr array; RETURN_NOT_OK(ReadFlatColumn(i, &array)); columns[i] = std::make_shared(schema->field(i), array); @@ -375,7 +381,7 @@ Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* } void FlatColumnReader::Impl::NextRowGroup() { - if (next_row_group_ < reader_->num_row_groups()) { + if (next_row_group_ < reader_->metadata()->num_row_groups()) { column_reader_ = reader_->RowGroup(next_row_group_)->Column(column_index_); next_row_group_++; } else { From 270ab4e94dba3ec45cfd2297d4f901d51d4a053b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 7 Sep 2016 14:15:16 -0700 Subject: [PATCH 0129/1644] ARROW-278: [Format] Rename Tuple to Struct_ in flatbuffers IDL "Struct" is a reserved keyword in generated bindings for C++. We had used "Tuple" to sidestep this but we discussed and decided to mangle "Struct" instead in the Flatbuffers. Author: Wes McKinney Closes #130 from wesm/ARROW-278 and squashes the following commits: 841a721 [Wes McKinney] Rename Tuple to Struct_ in flatbuffers IDL --- cpp/src/arrow/ipc/metadata-internal.cc | 6 +++--- format/Message.fbs | 10 +++++----- java/vector/src/main/codegen/data/ArrowTypes.tdd | 2 +- .../org/apache/arrow/vector/complex/MapVector.java | 4 ++-- .../org/apache/arrow/vector/schema/TypeLayout.java | 6 +++--- .../main/java/org/apache/arrow/vector/types/Types.java | 4 ++-- .../java/org/apache/arrow/vector/pojo/TestConvert.java | 6 +++--- 7 files changed, 19 insertions(+), 19 deletions(-) diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index c921e4d8e0114..1c15218c0ba12 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -115,7 +115,7 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } *out = std::make_shared(children[0]); return Status::OK(); - case flatbuf::Type_Tuple: + case flatbuf::Type_Struct_: *out = std::make_shared(children); return Status::OK(); case flatbuf::Type_Union: @@ -153,7 +153,7 @@ static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), &field)); out_children->push_back(field); } - *offset = flatbuf::CreateTuple(fbb).Union(); + *offset = flatbuf::CreateStruct_(fbb).Union(); return Status::OK(); } @@ -197,7 +197,7 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_List; return ListToFlatbuffer(fbb, type, children, offset); case Type::STRUCT: - *out_type = flatbuf::Type_Tuple; + *out_type = flatbuf::Type_Struct_; return StructToFlatbuffer(fbb, type, children, offset); default: *out_type = flatbuf::Type_NONE; // Make clang-tidy happy diff --git a/format/Message.fbs b/format/Message.fbs index 9c95724897757..78bdaeb35f5a5 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -8,10 +8,10 @@ namespace org.apache.arrow.flatbuf; table Null { } -/// A Tuple in the flatbuffer metadata is the same as an Arrow Struct -/// (according to the physical memory layout). We used Tuple here as Struct is -/// a reserved word in Flatbuffers -table Tuple { +/// A Struct_ in the flatbuffer metadata is the same as an Arrow Struct +/// (according to the physical memory layout). We used Struct_ here as +/// Struct is a reserved word in Flatbuffers +table Struct_ { } table List { @@ -87,7 +87,7 @@ union Type { IntervalDay, IntervalYear, List, - Tuple, + Struct_, Union, JSONScalar } diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 2ecad3d31400f..5cb43bed2b69a 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -21,7 +21,7 @@ fields: [] }, { - name: "Tuple", + name: "Struct_", fields: [] }, { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index 1b8483a3d41be..aaecb956434e9 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -34,7 +34,7 @@ import org.apache.arrow.vector.holders.ComplexHolder; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringHashMap; @@ -290,7 +290,7 @@ public Field getField() { for (ValueVector child : getChildren()) { children.add(child.getField()); } - return new Field(name, false, Tuple.INSTANCE, children); + return new Field(name, false, Struct_.INSTANCE, children); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 9f1efd056cb08..885ac2ac3d7f2 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -45,7 +45,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Null; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; -import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; @@ -54,7 +54,7 @@ /** * The layout of vectors for a given type * It defines its own vectors followed by the vectors for the children - * if it is a nested type (Tuple, List, Union) + * if it is a nested type (Struct_, List, Union) */ public class TypeLayout { @@ -88,7 +88,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { return new TypeLayout(vectors); } - @Override public TypeLayout visit(Tuple type) { + @Override public TypeLayout visit(Struct_ type) { List vectors = asList( validityVector() ); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 5eef8a008a923..66ef7562ceda1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -84,7 +84,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Null; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; -import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; @@ -131,7 +131,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return null; } }, - MAP(Tuple.INSTANCE) { + MAP(Struct_.INSTANCE) { @Override public Field getField() { throw new UnsupportedOperationException("Cannot get simple field for Map type"); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 61327f1970e83..448117d84dc3e 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -26,7 +26,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.List; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; -import org.apache.arrow.vector.types.pojo.ArrowType.Tuple; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; @@ -53,7 +53,7 @@ public void complex() { childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); - Field initialField = new Field("a", true, Tuple.INSTANCE, childrenBuilder.build()); + Field initialField = new Field("a", true, Struct_.INSTANCE, childrenBuilder.build()); run(initialField); } @@ -71,7 +71,7 @@ public void nestedSchema() { ImmutableList.Builder childrenBuilder = ImmutableList.builder(); childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); - childrenBuilder.add(new Field("child3", true, new Tuple(), ImmutableList.of( + childrenBuilder.add(new Field("child3", true, new Struct_(), ImmutableList.of( new Field("child3.1", true, Utf8.INSTANCE, null), new Field("child3.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); From 52089d609dff3d8d2abe99c7b94f7af9fe4735bd Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Thu, 8 Sep 2016 11:35:08 -0700 Subject: [PATCH 0130/1644] ARROW-285: Optional flatc download For platforms which don't have a flatc compiler artifact on maven central, allow to skip the download and manually provide a flatc compiler usage: mvn -Dflatc.download.skip -Dflatc.executable=/usr/local/bin/flatc Author: Laurent Goujon Closes #129 from laurentgo/laurent/optional-flatc-download and squashes the following commits: 229c6d5 [Laurent Goujon] Optional flatc download --- java/format/pom.xml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index dc5897581b5b3..4cf68bbe057e9 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -25,6 +25,8 @@ 1.2.0-3f79e055 + false + ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe 3.3 2.10 1.5.0.Final @@ -71,6 +73,7 @@ ${project.build.directory} + ${flatc.download.skip} @@ -92,6 +95,7 @@ +x ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe + ${flatc.download.skip} @@ -100,11 +104,11 @@ generate-sources - ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe + ${flatc.executable} -j -o - target/generated-sources/ + target/generated-sources/flatc ../../format/Message.fbs ../../format/File.fbs From a5f28617499a63ec44886bed35253f790e3674e1 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 8 Sep 2016 22:58:37 -0400 Subject: [PATCH 0131/1644] ARROW-286: Build thirdparty dependencies in parallel Author: Uwe L. Korn Closes #133 from xhochy/ARROW-286 and squashes the following commits: cb5a990 [Uwe L. Korn] ARROW-286: Build thirdparty dependencies in parallel --- cpp/thirdparty/build_thirdparty.sh | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index f1738ff748299..6cc776d09042a 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -62,7 +62,7 @@ if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then CXXFLAGS=-fPIC cmake . || { echo "cmake $GOOGLETEST_ERROR"; exit 1; } fi - make VERBOSE=1 || { echo "Make $GOOGLETEST_ERROR" ; exit 1; } + make -j$PARALLEL VERBOSE=1 || { echo "Make $GOOGLETEST_ERROR" ; exit 1; } fi # build google benchmark @@ -76,7 +76,7 @@ if [ -n "$F_ALL" -o -n "$F_GBENCHMARK" ]; then fi cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$PREFIX -DCMAKE_CXX_FLAGS="-fPIC $CMAKE_CXX_FLAGS" . || { echo "cmake $GBENCHMARK_ERROR" ; exit 1; } - make VERBOSE=1 install || { echo "make $GBENCHMARK_ERROR" ; exit 1; } + make -j$PARALLEL VERBOSE=1 install || { echo "make $GBENCHMARK_ERROR" ; exit 1; } fi FLATBUFFERS_ERROR="failed for flatbuffers" From 077c72bc6adf07c5311785596cb03088ae11ae5e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 9 Sep 2016 00:02:35 -0400 Subject: [PATCH 0132/1644] ARROW-256: [Format] Add a version number to the IPC/RPC metadata See "Schema evolution examples" in https://google.github.io/flatbuffers/flatbuffers_guide_writing_schema.html. In the future, if we need to add some other message types (like `RecordBatchV2`), then this should permit this without too much trouble. Author: Wes McKinney Closes #125 from wesm/ARROW-256 and squashes the following commits: 60ee5c0 [Wes McKinney] Rename current version to V1_SNAPSHOT to reflect changing nature bab2749 [Wes McKinney] Add a version number / enum to the Message and File metadata --- cpp/src/arrow/ipc/metadata-internal.cc | 3 ++- cpp/src/arrow/ipc/metadata-internal.h | 3 +++ format/File.fbs | 1 + format/Message.fbs | 5 +++++ 4 files changed, 11 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 1c15218c0ba12..8cc902c2967da 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -295,7 +295,8 @@ Status WriteDataHeader(int32_t length, int64_t body_length, } Status MessageBuilder::Finish() { - auto message = flatbuf::CreateMessage(fbb_, header_type_, header_, body_length_); + auto message = flatbuf::CreateMessage(fbb_, kMetadataVersion, + header_type_, header_, body_length_); fbb_.Finish(message); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index 5faa8c947b55d..db9a83f6a8dfb 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -37,6 +37,9 @@ class Status; namespace ipc { +static constexpr flatbuf::MetadataVersion kMetadataVersion = + flatbuf::MetadataVersion_V1_SNAPSHOT; + Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); class MessageBuilder { diff --git a/format/File.fbs b/format/File.fbs index f7ad1e1594a91..a29bbc694bc13 100644 --- a/format/File.fbs +++ b/format/File.fbs @@ -7,6 +7,7 @@ namespace org.apache.arrow.flatbuf; /// table Footer { + version: org.apache.arrow.flatbuf.MetadataVersion; schema: org.apache.arrow.flatbuf.Schema; diff --git a/format/Message.fbs b/format/Message.fbs index 78bdaeb35f5a5..657904a7032a5 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -1,5 +1,9 @@ namespace org.apache.arrow.flatbuf; +enum MetadataVersion:short { + V1_SNAPSHOT +} + /// ---------------------------------------------------------------------- /// Logical types and their metadata (if any) /// @@ -237,6 +241,7 @@ union MessageHeader { } table Message { + version: org.apache.arrow.flatbuf.MetadataVersion; header: MessageHeader; bodyLength: long; } From 6b8abb4402ff1f39fc5944a7df6e3b4755691d87 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 12 Sep 2016 23:15:10 -0400 Subject: [PATCH 0133/1644] ARROW-289: Install test-util.h Author: Uwe L. Korn Closes #135 from xhochy/arrow-289 and squashes the following commits: 5e4aadf [Uwe L. Korn] ARROW-289: Install test-util.h --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/test-util.h | 12 ++++++------ cpp/src/arrow/util/CMakeLists.txt | 1 + cpp/src/arrow/util/bit-util.h | 4 +++- 4 files changed, 11 insertions(+), 7 deletions(-) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 2d42edcfbd499..a9b2feca28cb7 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -24,6 +24,7 @@ install(FILES schema.h table.h type.h + test-util.h DESTINATION include/arrow) ####################################### diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 055dac7444488..e632ffb1d892d 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -40,22 +40,22 @@ #define ASSERT_RAISES(ENUM, expr) \ do { \ - Status s = (expr); \ + ::arrow::Status s = (expr); \ if (!s.Is##ENUM()) { FAIL() << s.ToString(); } \ } while (0) #define ASSERT_OK(expr) \ do { \ - Status s = (expr); \ + ::arrow::Status s = (expr); \ if (!s.ok()) { FAIL() << s.ToString(); } \ } while (0) #define ASSERT_OK_NO_THROW(expr) ASSERT_NO_THROW(ASSERT_OK(expr)) -#define EXPECT_OK(expr) \ - do { \ - Status s = (expr); \ - EXPECT_TRUE(s.ok()); \ +#define EXPECT_OK(expr) \ + do { \ + ::arrow::Status s = (expr); \ + EXPECT_TRUE(s.ok()); \ } while (0) namespace arrow { diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 13c0d7514feef..fd23c1aa3b8b2 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -26,6 +26,7 @@ install(FILES logging.h macros.h memory-pool.h + random.h status.h visibility.h DESTINATION include/arrow/util) diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index a6c8dd904d8e0..873a1959865f5 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -22,6 +22,8 @@ #include #include +#include "arrow/util/visibility.h" + namespace arrow { class Buffer; @@ -76,7 +78,7 @@ static inline bool is_multiple_of_64(int64_t n) { } void bytes_to_bits(const std::vector& bytes, uint8_t* bits); -Status bytes_to_bits(const std::vector&, std::shared_ptr*); +ARROW_EXPORT Status bytes_to_bits(const std::vector&, std::shared_ptr*); } // namespace util From 6f99156c3bb01329e33f74a57d9aaff1ed8304bc Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 13 Sep 2016 11:13:54 -0700 Subject: [PATCH 0134/1644] ARROW-287: Make nullable vectors use a BitVecor instead of UInt1Vector for bits Author: Julien Le Dem Closes #134 from julienledem/bits and squashes the following commits: d4e5084 [Julien Le Dem] add nullable vector test that verifies Bit based buffers 15fde9d [Julien Le Dem] ARROW-287: Make nullable vectors use a BitVecor instead of UInt1Vector for bits --- .../templates/NullableValueVectors.java | 6 +- .../org/apache/arrow/vector/BitVector.java | 16 ++++- .../arrow/vector/complex/ListVector.java | 6 +- .../vector/complex/NullableMapVector.java | 8 +-- .../apache/arrow/vector/TestValueVector.java | 67 +++++++++++++++++-- 5 files changed, 87 insertions(+), 16 deletions(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index bb2c00121605c..486cfeefc7a3b 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -53,7 +53,7 @@ public final class ${className} extends BaseDataValueVector implements <#if type private final String valuesField = "$values$"; private final Field field; - final UInt1Vector bits = new UInt1Vector(bitsField, allocator); + final BitVector bits = new BitVector(bitsField, allocator); final ${valuesName} values; private final Mutator mutator; @@ -446,7 +446,7 @@ public void copyFromSafe(int fromIndex, int thisIndex, Nullable${minor.class}Vec } public final class Accessor extends BaseDataValueVector.BaseAccessor <#if type.major = "VarLen">implements VariableWidthVector.VariableWidthAccessor { - final UInt1Vector.Accessor bAccessor = bits.getAccessor(); + final BitVector.Accessor bAccessor = bits.getAccessor(); final ${valuesName}.Accessor vAccessor = values.getAccessor(); /** @@ -545,7 +545,7 @@ public void setIndexDefined(int index){ public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType} value) { setCount++; final ${valuesName}.Mutator valuesMutator = values.getMutator(); - final UInt1Vector.Mutator bitsMutator = bits.getMutator(); + final BitVector.Mutator bitsMutator = bits.getMutator(); <#if type.major == "VarLen"> for (int i = lastSet + 1; i < index; i++) { valuesMutator.set(i, emptyByteArray); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index fee6e9cdef73d..c12db5045c2db 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -17,8 +17,6 @@ */ package org.apache.arrow.vector; -import io.netty.buffer.ArrowBuf; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.complex.reader.FieldReader; @@ -29,6 +27,8 @@ import org.apache.arrow.vector.util.OversizedAllocationException; import org.apache.arrow.vector.util.TransferPair; +import io.netty.buffer.ArrowBuf; + /** * Bit implements a vector of bit-width values. Elements in the vector are accessed by position from the logical start * of the vector. The width of each element is 1 bit. The equivalent Java primitive is an int containing the value '0' @@ -435,6 +435,18 @@ public final void generateTestData(int values) { setValueCount(values); } + public void generateTestDataAlt(int size) { + setValueCount(size); + boolean even = true; + final int valueCount = getAccessor().getValueCount(); + for(int i = 0; i < valueCount; i++, even = !even) { + if(even){ + set(i, (byte) 1); + }else{ + set(i, (byte) 0); + } + } + } } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 2984c362514fc..dd99c734f7ff8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -28,9 +28,9 @@ import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseDataValueVector; +import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BufferBacked; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; @@ -55,7 +55,7 @@ public class ListVector extends BaseRepeatedValueVector implements FieldVector { final UInt4Vector offsets; - final UInt1Vector bits; + final BitVector bits; private final List innerVectors; private Mutator mutator = new Mutator(); private Accessor accessor = new Accessor(); @@ -65,7 +65,7 @@ public class ListVector extends BaseRepeatedValueVector implements FieldVector { public ListVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, allocator); - this.bits = new UInt1Vector("$bits$", allocator); + this.bits = new BitVector("$bits$", allocator); this.offsets = getOffsetVector(); this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits, offsets)); this.writer = new UnionListWriter(this); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 6b257c095d28e..8e1bbfabdc907 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -25,10 +25,10 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.BaseDataValueVector; +import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BufferBacked; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.NullableVectorDefinitionSetter; -import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.complex.impl.NullableMapReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; @@ -45,7 +45,7 @@ public class NullableMapVector extends MapVector implements FieldVector { private final NullableMapReaderImpl reader = new NullableMapReaderImpl(this); - protected final UInt1Vector bits; + protected final BitVector bits; private final List innerVectors; @@ -54,7 +54,7 @@ public class NullableMapVector extends MapVector implements FieldVector { public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, checkNotNull(allocator), callBack); - this.bits = new UInt1Vector("$bits$", allocator); + this.bits = new BitVector("$bits$", allocator); this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits)); this.accessor = new Accessor(); this.mutator = new Mutator(); @@ -186,7 +186,7 @@ public boolean allocateNewSafe() { return success; } public final class Accessor extends MapVector.Accessor { - final UInt1Vector.Accessor bAccessor = bits.getAccessor(); + final BitVector.Accessor bAccessor = bits.getAccessor(); @Override public Object getObject(int index) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 21cdc4f4d8d3b..124452e96ee42 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -17,17 +17,23 @@ */ package org.apache.arrow.vector; +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.nio.charset.Charset; +import java.util.List; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.util.OversizedAllocationException; +import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; import org.junit.Before; import org.junit.Test; -import java.nio.charset.Charset; - -import static org.junit.Assert.*; +import io.netty.buffer.ArrowBuf; public class TestValueVector { @@ -223,6 +229,59 @@ public void testNullableFloat() { } } + @Test + public void testNullableInt() { + // Create a new value vector for 1024 integers + try (final NullableIntVector vector = (NullableIntVector) MinorType.INT.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { + final NullableIntVector.Mutator m = vector.getMutator(); + vector.allocateNew(1024); + + // Put and set a few values. + m.set(0, 1); + m.set(1, 2); + m.set(100, 3); + m.set(1022, 4); + m.set(1023, 5); + + m.setValueCount(1024); + + final NullableIntVector.Accessor accessor = vector.getAccessor(); + assertEquals(1, accessor.get(0)); + assertEquals(2, accessor.get(1)); + assertEquals(3, accessor.get(100)); + assertEquals(4, accessor.get(1022)); + assertEquals(5, accessor.get(1023)); + + // Ensure null values. + assertTrue(vector.getAccessor().isNull(3)); + + Field field = vector.getField(); + TypeLayout typeLayout = field.getTypeLayout(); + + List buffers = vector.getFieldBuffers(); + + assertEquals(2, typeLayout.getVectors().size()); + assertEquals(2, buffers.size()); + + ArrowBuf validityVectorBuf = buffers.get(0); + assertEquals(128, validityVectorBuf.readableBytes()); + assertEquals(3, validityVectorBuf.getByte(0)); // 1st and second bit defined + for (int i = 1; i < 12; i++) { + assertEquals(0, validityVectorBuf.getByte(i)); // nothing defined until 100 + } + assertEquals(16, validityVectorBuf.getByte(12)); // 100th bit is defined (12 * 8 + 4) + for (int i = 13; i < 127; i++) { + assertEquals(0, validityVectorBuf.getByte(i)); // nothing defined between 100th and 1022nd + } + assertEquals(-64, validityVectorBuf.getByte(127)); // 1022nd and 1023rd bit defined + + vector.allocateNew(2048); + // vector has been erased + assertTrue(vector.getAccessor().isNull(0)); + } + } + + @Test public void testBitVector() { // Create a new value vector for 1024 integers From 3487c2f0cdc2297a80ba3525c192745313b3da48 Mon Sep 17 00:00:00 2001 From: adeneche Date: Wed, 14 Sep 2016 14:46:27 -0700 Subject: [PATCH 0135/1644] ARROW-292: [Java] Upgrade Netty to 4.0.41 this closes #137 --- .../main/java/io/netty/buffer/ArrowBuf.java | 2 +- .../netty/buffer/PooledByteBufAllocatorL.java | 2 +- .../buffer/UnsafeDirectLittleEndian.java | 30 +++++++++++++++---- java/pom.xml | 2 +- 4 files changed, 28 insertions(+), 8 deletions(-) diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index d10f00247e6ee..b7a268a00706c 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -452,7 +452,7 @@ public String toString(int index, int length, Charset charset) { return ""; } - return ByteBufUtil.decodeString(nioBuffer(index, length), charset); + return ByteBufUtil.decodeString(this, index, length, charset); } @Override diff --git a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java index 0b6e3f7f8392d..f6feb65cccd09 100644 --- a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java +++ b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java @@ -145,7 +145,7 @@ public boolean matches(String name, Metric metric) { } private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCapacity) { - PoolThreadCache cache = threadCache.get(); + PoolThreadCache cache = threadCache(); PoolArena directArena = cache.directArena; if (directArena != null) { diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java index a94c6d1988399..dc93602100e9c 100644 --- a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -20,6 +20,9 @@ import io.netty.util.internal.PlatformDependent; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; import java.nio.ByteOrder; import java.util.concurrent.atomic.AtomicLong; @@ -93,11 +96,6 @@ public ByteBuf slice(int index, int length) { return new SlicedByteBuf(this, index, length); } - @Override - public ByteOrder order() { - return ByteOrder.LITTLE_ENDIAN; - } - @Override public ByteBuf order(ByteOrder endianness) { return this; @@ -254,6 +252,28 @@ public boolean release(int decrement) { return released; } + @Override + public int setBytes(int index, InputStream in, int length) throws IOException { + wrapped.checkIndex(index, length); + byte[] tmp = new byte[length]; + int readBytes = in.read(tmp); + if (readBytes > 0) { + PlatformDependent.copyMemory(tmp, 0, addr(index), readBytes); + } + return readBytes; + } + + @Override + public ByteBuf getBytes(int index, OutputStream out, int length) throws IOException { + wrapped.checkIndex(index, length); + if (length != 0) { + byte[] tmp = new byte[length]; + PlatformDependent.copyMemory(addr(index), tmp, 0, length); + out.write(tmp); + } + return this; + } + @Override public int hashCode() { return System.identityHashCode(this); diff --git a/java/pom.xml b/java/pom.xml index 8eb25af7545f4..a8e24ed054cd5 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -395,7 +395,7 @@ io.netty netty-handler - 4.0.27.Final + 4.0.41.Final From 17e90e1d88266ea224244647831f49d5bd1dac72 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 16 Sep 2016 16:01:17 -0700 Subject: [PATCH 0136/1644] ARROW-290: Specialize alloc() in ArrowBuf Author: Julien Le Dem Closes #136 from julienledem/alloc and squashes the following commits: a19d16f [Julien Le Dem] ARROW-290: Specialize alloc() in ArrowBuf --- .../src/main/java/io/netty/buffer/ArrowBuf.java | 9 +++++---- .../io/netty/buffer/UnsafeDirectLittleEndian.java | 2 ++ .../apache/arrow/memory/ArrowByteBufAllocator.java | 4 ++++ .../java/org/apache/arrow/memory/BaseAllocator.java | 11 +++++------ 4 files changed, 16 insertions(+), 10 deletions(-) diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index b7a268a00706c..a5989c1518def 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -29,6 +29,7 @@ import java.util.concurrent.atomic.AtomicLong; import org.apache.arrow.memory.AllocationManager.BufferLedger; +import org.apache.arrow.memory.ArrowByteBufAllocator; import org.apache.arrow.memory.BaseAllocator; import org.apache.arrow.memory.BaseAllocator.Verbosity; import org.apache.arrow.memory.BoundsChecking; @@ -52,7 +53,7 @@ public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { private final int offset; private final BufferLedger ledger; private final BufferManager bufManager; - private final ByteBufAllocator alloc; + private final ArrowByteBufAllocator alloc; private final boolean isEmpty; private volatile int length; private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? @@ -63,7 +64,7 @@ public ArrowBuf( final BufferLedger ledger, final UnsafeDirectLittleEndian byteBuf, final BufferManager manager, - final ByteBufAllocator alloc, + final ArrowByteBufAllocator alloc, final int offset, final int length, boolean isEmpty) { @@ -297,8 +298,8 @@ public synchronized ArrowBuf capacity(int newCapacity) { } @Override - public ByteBufAllocator alloc() { - return udle.alloc(); + public ArrowByteBufAllocator alloc() { + return alloc; } @Override diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java index dc93602100e9c..023a6a2892b80 100644 --- a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -26,6 +26,8 @@ import java.nio.ByteOrder; import java.util.concurrent.atomic.AtomicLong; +import io.netty.util.internal.PlatformDependent; + /** * The underlying class we use for little-endian access to memory. Is used underneath ArrowBufs to abstract away the * Netty classes and underlying Netty memory management. diff --git a/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java index f3f72fa57c33a..5dc5ac397bd93 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java @@ -39,6 +39,10 @@ public ArrowByteBufAllocator(BufferAllocator allocator) { this.allocator = allocator; } + public BufferAllocator unwrap() { + return allocator; + } + @Override public ByteBuf buffer() { return buffer(DEFAULT_BUFFER_SIZE); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index f1503c902d0be..dbb0705045c35 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -17,10 +17,6 @@ */ package org.apache.arrow.memory; -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.ByteBufAllocator; -import io.netty.buffer.UnsafeDirectLittleEndian; - import java.util.Arrays; import java.util.IdentityHashMap; import java.util.Set; @@ -33,6 +29,9 @@ import com.google.common.base.Preconditions; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.UnsafeDirectLittleEndian; + public abstract class BaseAllocator extends Accountant implements BufferAllocator { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BaseAllocator.class); @@ -47,7 +46,7 @@ public abstract class BaseAllocator extends Accountant implements BufferAllocato private final Object DEBUG_LOCK = DEBUG ? new Object() : null; private final BaseAllocator parentAllocator; - private final ByteBufAllocator thisAsByteBufAllocator; + private final ArrowByteBufAllocator thisAsByteBufAllocator; private final IdentityHashMap childAllocators; private final ArrowBuf empty; @@ -247,7 +246,7 @@ private ArrowBuf bufferWithoutReservation(final int size, BufferManager bufferMa } @Override - public ByteBufAllocator getAsByteBufAllocator() { + public ArrowByteBufAllocator getAsByteBufAllocator() { return thisAsByteBufAllocator; } From 559b865226ec0f5d78e87957c2ff0f7711bec9a8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 18 Sep 2016 16:01:58 -0400 Subject: [PATCH 0137/1644] ARROW-280: [C++] Refactor IPC / memory map IO to use common arrow_io interfaces. Create arrow_ipc leaf library Several things here * Clean up IO interface class structure to be able to indicate precise characteristics of an implementation * Make the IPC reader/writer use more generic interfaces -- writing only needs an output stream, reading only needs a random access reader. This will unblock ARROW-267 * Create a separate arrow_ipc shared library Author: Wes McKinney Closes #138 from wesm/ARROW-280 and squashes the following commits: 6a59eb6 [Wes McKinney] * Restructure IO interfaces to accommodate more configurations. * Refactor memory mapped IO interfaces to be in line with other arrow::io classes. * Split arrow_ipc into a leaf library * Refactor pyarrow and arrow_parquet to suit. Move BufferReader to arrow_io. Pyarrow parquet tests currently segfault --- cpp/CMakeLists.txt | 6 - cpp/src/arrow/io/CMakeLists.txt | 11 +- cpp/src/arrow/io/hdfs.cc | 35 ++- cpp/src/arrow/io/hdfs.h | 29 +- cpp/src/arrow/io/interfaces.h | 71 ++++- .../io/{hdfs-io-test.cc => io-hdfs-test.cc} | 2 +- .../io-memory-test.cc} | 50 ++-- cpp/src/arrow/io/libhdfs_shim.cc | 3 +- cpp/src/arrow/io/memory.cc | 262 ++++++++++++++++++ cpp/src/arrow/io/memory.h | 130 +++++++++ cpp/src/arrow/io/test-common.h | 63 +++++ cpp/src/arrow/ipc/CMakeLists.txt | 58 +++- cpp/src/arrow/ipc/adapter.cc | 61 ++-- cpp/src/arrow/ipc/adapter.h | 39 +-- cpp/src/arrow/ipc/ipc-adapter-test.cc | 33 ++- cpp/src/arrow/ipc/memory.cc | 182 ------------ cpp/src/arrow/ipc/memory.h | 150 ---------- cpp/src/arrow/ipc/metadata-internal.cc | 9 +- cpp/src/arrow/ipc/metadata-internal.h | 2 +- cpp/src/arrow/ipc/metadata.h | 11 +- cpp/src/arrow/ipc/symbols.map | 18 ++ cpp/src/arrow/ipc/test-common.h | 25 -- cpp/src/arrow/ipc/util.h | 56 ++++ cpp/src/arrow/parquet/CMakeLists.txt | 1 + cpp/src/arrow/parquet/io.cc | 4 +- cpp/src/arrow/parquet/io.h | 4 +- cpp/src/arrow/parquet/parquet-io-test.cc | 51 +--- cpp/src/arrow/parquet/parquet-schema-test.cc | 3 +- cpp/src/arrow/parquet/reader.cc | 8 +- cpp/src/arrow/parquet/reader.h | 2 +- cpp/src/arrow/parquet/schema.cc | 2 +- cpp/src/arrow/parquet/writer.cc | 2 +- cpp/src/arrow/type.h | 4 +- cpp/src/arrow/util/memory-pool-test.cc | 2 +- python/pyarrow/includes/libarrow_io.pxd | 42 ++- python/pyarrow/includes/parquet.pxd | 18 +- python/pyarrow/io.pxd | 7 +- python/pyarrow/io.pyx | 14 +- python/pyarrow/parquet.pyx | 6 +- 39 files changed, 873 insertions(+), 603 deletions(-) rename cpp/src/arrow/io/{hdfs-io-test.cc => io-hdfs-test.cc} (99%) rename cpp/src/arrow/{ipc/ipc-memory-test.cc => io/io-memory-test.cc} (66%) create mode 100644 cpp/src/arrow/io/memory.cc create mode 100644 cpp/src/arrow/io/memory.h create mode 100644 cpp/src/arrow/io/test-common.h delete mode 100644 cpp/src/arrow/ipc/memory.cc delete mode 100644 cpp/src/arrow/ipc/memory.h create mode 100644 cpp/src/arrow/ipc/symbols.map create mode 100644 cpp/src/arrow/ipc/util.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index a39a752123155..be95dabf31897 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -626,12 +626,6 @@ set(ARROW_SRCS src/arrow/table.cc src/arrow/type.cc - # IPC / Shared memory library; to be turned into an optional component - src/arrow/ipc/adapter.cc - src/arrow/ipc/memory.cc - src/arrow/ipc/metadata.cc - src/arrow/ipc/metadata-internal.cc - src/arrow/types/construct.cc src/arrow/types/decimal.cc src/arrow/types/json.cc diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index b8c0e138afb06..87e227ef80d80 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -20,6 +20,7 @@ set(ARROW_IO_LINK_LIBS arrow_shared + dl ) if (ARROW_BOOST_USE_SHARED) @@ -37,6 +38,7 @@ set(ARROW_IO_TEST_LINK_LIBS ${ARROW_IO_PRIVATE_LINK_LIBS}) set(ARROW_IO_SRCS + memory.cc ) if(ARROW_HDFS) @@ -71,8 +73,8 @@ if(ARROW_HDFS) ${ARROW_HDFS_SRCS} ${ARROW_IO_SRCS}) - ADD_ARROW_TEST(hdfs-io-test) - ARROW_TEST_LINK_LIBRARIES(hdfs-io-test + ADD_ARROW_TEST(io-hdfs-test) + ARROW_TEST_LINK_LIBRARIES(io-hdfs-test ${ARROW_IO_TEST_LINK_LIBS}) endif() @@ -101,10 +103,15 @@ if (APPLE) INSTALL_NAME_DIR "@rpath") endif() +ADD_ARROW_TEST(io-memory-test) +ARROW_TEST_LINK_LIBRARIES(io-memory-test + ${ARROW_IO_TEST_LINK_LIBS}) + # Headers: top level install(FILES hdfs.h interfaces.h + memory.h DESTINATION include/arrow/io) install(TARGETS arrow_io diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 800c3edf4f31a..a6b4b2f3846b1 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -142,6 +142,15 @@ Status HdfsReadableFile::ReadAt( return impl_->ReadAt(position, nbytes, bytes_read, buffer); } +Status HdfsReadableFile::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + return Status::NotImplemented("Not yet implemented"); +} + +bool HdfsReadableFile::supports_zero_copy() const { + return false; +} + Status HdfsReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { return impl_->Read(nbytes, bytes_read, buffer); } @@ -162,9 +171,9 @@ Status HdfsReadableFile::Tell(int64_t* position) { // File writing // Private implementation for writeable-only files -class HdfsWriteableFile::HdfsWriteableFileImpl : public HdfsAnyFileImpl { +class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { public: - HdfsWriteableFileImpl() {} + HdfsOutputStreamImpl() {} Status Close() { if (is_open_) { @@ -185,29 +194,29 @@ class HdfsWriteableFile::HdfsWriteableFileImpl : public HdfsAnyFileImpl { } }; -HdfsWriteableFile::HdfsWriteableFile() { - impl_.reset(new HdfsWriteableFileImpl()); +HdfsOutputStream::HdfsOutputStream() { + impl_.reset(new HdfsOutputStreamImpl()); } -HdfsWriteableFile::~HdfsWriteableFile() { +HdfsOutputStream::~HdfsOutputStream() { impl_->Close(); } -Status HdfsWriteableFile::Close() { +Status HdfsOutputStream::Close() { return impl_->Close(); } -Status HdfsWriteableFile::Write( +Status HdfsOutputStream::Write( const uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { return impl_->Write(buffer, nbytes, bytes_read); } -Status HdfsWriteableFile::Write(const uint8_t* buffer, int64_t nbytes) { +Status HdfsOutputStream::Write(const uint8_t* buffer, int64_t nbytes) { int64_t bytes_written_dummy = 0; return Write(buffer, nbytes, &bytes_written_dummy); } -Status HdfsWriteableFile::Tell(int64_t* position) { +Status HdfsOutputStream::Tell(int64_t* position) { return impl_->Tell(position); } @@ -347,7 +356,7 @@ class HdfsClient::HdfsClientImpl { Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, int16_t replication, int64_t default_block_size, - std::shared_ptr* file) { + std::shared_ptr* file) { int flags = O_WRONLY; if (append) flags |= O_APPEND; @@ -362,7 +371,7 @@ class HdfsClient::HdfsClientImpl { } // std::make_shared does not work with private ctors - *file = std::shared_ptr(new HdfsWriteableFile()); + *file = std::shared_ptr(new HdfsOutputStream()); (*file)->impl_->set_members(path, fs_, handle); return Status::OK(); @@ -440,13 +449,13 @@ Status HdfsClient::OpenReadable( Status HdfsClient::OpenWriteable(const std::string& path, bool append, int32_t buffer_size, int16_t replication, int64_t default_block_size, - std::shared_ptr* file) { + std::shared_ptr* file) { return impl_->OpenWriteable( path, append, buffer_size, replication, default_block_size, file); } Status HdfsClient::OpenWriteable( - const std::string& path, bool append, std::shared_ptr* file) { + const std::string& path, bool append, std::shared_ptr* file) { return OpenWriteable(path, append, 0, 0, 0, file); } diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index b6449fcb88a75..39720cc17e422 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -29,13 +29,14 @@ namespace arrow { +class Buffer; class Status; namespace io { class HdfsClient; class HdfsReadableFile; -class HdfsWriteableFile; +class HdfsOutputStream; struct HdfsPathInfo { ObjectType::type kind; @@ -139,14 +140,14 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // @param default_block_size, 0 for default Status OpenWriteable(const std::string& path, bool append, int32_t buffer_size, int16_t replication, int64_t default_block_size, - std::shared_ptr* file); + std::shared_ptr* file); Status OpenWriteable( - const std::string& path, bool append, std::shared_ptr* file); + const std::string& path, bool append, std::shared_ptr* file); private: friend class HdfsReadableFile; - friend class HdfsWriteableFile; + friend class HdfsOutputStream; class ARROW_NO_EXPORT HdfsClientImpl; std::unique_ptr impl_; @@ -155,7 +156,7 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { DISALLOW_COPY_AND_ASSIGN(HdfsClient); }; -class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { +class ARROW_EXPORT HdfsReadableFile : public ReadableFileInterface { public: ~HdfsReadableFile(); @@ -166,6 +167,10 @@ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { Status ReadAt( int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + + bool supports_zero_copy() const override; + Status Seek(int64_t position) override; Status Tell(int64_t* position) override; @@ -183,9 +188,11 @@ class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { DISALLOW_COPY_AND_ASSIGN(HdfsReadableFile); }; -class ARROW_EXPORT HdfsWriteableFile : public WriteableFile { +// Naming this file OutputStream because it does not support seeking (like the +// WriteableFile interface) +class ARROW_EXPORT HdfsOutputStream : public OutputStream { public: - ~HdfsWriteableFile(); + ~HdfsOutputStream(); Status Close() override; @@ -196,14 +203,14 @@ class ARROW_EXPORT HdfsWriteableFile : public WriteableFile { Status Tell(int64_t* position) override; private: - class ARROW_NO_EXPORT HdfsWriteableFileImpl; - std::unique_ptr impl_; + class ARROW_NO_EXPORT HdfsOutputStreamImpl; + std::unique_ptr impl_; friend class HdfsClient::HdfsClientImpl; - HdfsWriteableFile(); + HdfsOutputStream(); - DISALLOW_COPY_AND_ASSIGN(HdfsWriteableFile); + DISALLOW_COPY_AND_ASSIGN(HdfsOutputStream); }; Status ARROW_EXPORT ConnectLibHdfs(); diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index c21285253714e..fa34b43b2c920 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -21,8 +21,11 @@ #include #include +#include "arrow/util/macros.h" + namespace arrow { +class Buffer; class Status; namespace io { @@ -40,30 +43,78 @@ class FileSystemClient { virtual ~FileSystemClient() {} }; -class FileBase { +class FileInterface { public: + virtual ~FileInterface() {} virtual Status Close() = 0; virtual Status Tell(int64_t* position) = 0; + + FileMode::type mode() const { return mode_; } + + protected: + FileInterface() {} + FileMode::type mode_; + + void set_mode(FileMode::type mode) { mode_ = mode; } + + private: + DISALLOW_COPY_AND_ASSIGN(FileInterface); }; -class ReadableFile : public FileBase { +class Seekable { public: - virtual Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) = 0; + virtual Status Seek(int64_t position) = 0; +}; - virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) = 0; +class Writeable { + public: + virtual Status Write(const uint8_t* data, int64_t nbytes) = 0; +}; - virtual Status GetSize(int64_t* size) = 0; +class Readable { + public: + virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) = 0; +}; + +class OutputStream : public FileInterface, public Writeable { + protected: + OutputStream() {} }; -class RandomAccessFile : public ReadableFile { +class InputStream : public FileInterface, public Readable { + protected: + InputStream() {} +}; + +class ReadableFileInterface : public InputStream, public Seekable { public: - virtual Status Seek(int64_t position) = 0; + virtual Status ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) = 0; + + virtual Status GetSize(int64_t* size) = 0; + + // Does not copy if not necessary + virtual Status ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) = 0; + + virtual bool supports_zero_copy() const = 0; + + protected: + ReadableFileInterface() { set_mode(FileMode::READ); } }; -class WriteableFile : public FileBase { +class WriteableFileInterface : public OutputStream, public Seekable { public: - virtual Status Write(const uint8_t* buffer, int64_t nbytes) = 0; + virtual Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) = 0; + + protected: + WriteableFileInterface() { set_mode(FileMode::READ); } +}; + +class ReadWriteFileInterface : public ReadableFileInterface, + public WriteableFileInterface { + protected: + ReadWriteFileInterface() { ReadableFileInterface::set_mode(FileMode::READWRITE); } }; } // namespace io diff --git a/cpp/src/arrow/io/hdfs-io-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc similarity index 99% rename from cpp/src/arrow/io/hdfs-io-test.cc rename to cpp/src/arrow/io/io-hdfs-test.cc index e48a28142fa46..7901932dee676 100644 --- a/cpp/src/arrow/io/hdfs-io-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -49,7 +49,7 @@ class TestHdfsClient : public ::testing::Test { Status WriteDummyFile(const std::string& path, const uint8_t* buffer, int64_t size, bool append = false, int buffer_size = 0, int replication = 0, int default_block_size = 0) { - std::shared_ptr file; + std::shared_ptr file; RETURN_NOT_OK(client_->OpenWriteable( path, append, buffer_size, replication, default_block_size, &file)); diff --git a/cpp/src/arrow/ipc/ipc-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc similarity index 66% rename from cpp/src/arrow/ipc/ipc-memory-test.cc rename to cpp/src/arrow/io/io-memory-test.cc index a2dbd35728c49..6de35dab59b4f 100644 --- a/cpp/src/arrow/ipc/ipc-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -24,20 +24,20 @@ #include "gtest/gtest.h" -#include "arrow/ipc/memory.h" -#include "arrow/ipc/test-common.h" +#include "arrow/io/memory.h" +#include "arrow/io/test-common.h" namespace arrow { -namespace ipc { +namespace io { -class TestMemoryMappedSource : public ::testing::Test, public MemoryMapFixture { +class TestMemoryMappedFile : public ::testing::Test, public MemoryMapFixture { public: void TearDown() { MemoryMapFixture::TearDown(); } }; -TEST_F(TestMemoryMappedSource, InvalidUsages) {} +TEST_F(TestMemoryMappedFile, InvalidUsages) {} -TEST_F(TestMemoryMappedSource, WriteRead) { +TEST_F(TestMemoryMappedFile, WriteRead) { const int64_t buffer_size = 1024; std::vector buffer(buffer_size); @@ -48,14 +48,13 @@ TEST_F(TestMemoryMappedSource, WriteRead) { std::string path = "ipc-write-read-test"; CreateFile(path, reps * buffer_size); - std::shared_ptr result; - ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &result)); + std::shared_ptr result; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &result)); int64_t position = 0; - std::shared_ptr out_buffer; for (int i = 0; i < reps; ++i) { - ASSERT_OK(result->Write(position, buffer.data(), buffer_size)); + ASSERT_OK(result->Write(buffer.data(), buffer_size)); ASSERT_OK(result->ReadAt(position, buffer_size, &out_buffer)); ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); @@ -64,7 +63,7 @@ TEST_F(TestMemoryMappedSource, WriteRead) { } } -TEST_F(TestMemoryMappedSource, ReadOnly) { +TEST_F(TestMemoryMappedFile, ReadOnly) { const int64_t buffer_size = 1024; std::vector buffer(buffer_size); @@ -75,19 +74,18 @@ TEST_F(TestMemoryMappedSource, ReadOnly) { std::string path = "ipc-read-only-test"; CreateFile(path, reps * buffer_size); - std::shared_ptr rwmmap; - ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_WRITE, &rwmmap)); + std::shared_ptr rwmmap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &rwmmap)); int64_t position = 0; for (int i = 0; i < reps; ++i) { - ASSERT_OK(rwmmap->Write(position, buffer.data(), buffer_size)); - + ASSERT_OK(rwmmap->Write(buffer.data(), buffer_size)); position += buffer_size; } rwmmap->Close(); - std::shared_ptr rommap; - ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_ONLY, &rommap)); + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); position = 0; std::shared_ptr out_buffer; @@ -100,7 +98,7 @@ TEST_F(TestMemoryMappedSource, ReadOnly) { rommap->Close(); } -TEST_F(TestMemoryMappedSource, InvalidMode) { +TEST_F(TestMemoryMappedFile, InvalidMode) { const int64_t buffer_size = 1024; std::vector buffer(buffer_size); @@ -109,19 +107,19 @@ TEST_F(TestMemoryMappedSource, InvalidMode) { std::string path = "ipc-invalid-mode-test"; CreateFile(path, buffer_size); - std::shared_ptr rommap; - ASSERT_OK(MemoryMappedSource::Open(path, MemorySource::READ_ONLY, &rommap)); + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); - ASSERT_RAISES(IOError, rommap->Write(0, buffer.data(), buffer_size)); + ASSERT_RAISES(IOError, rommap->Write(buffer.data(), buffer_size)); } -TEST_F(TestMemoryMappedSource, InvalidFile) { +TEST_F(TestMemoryMappedFile, InvalidFile) { std::string non_existent_path = "invalid-file-name-asfd"; - std::shared_ptr result; - ASSERT_RAISES(IOError, - MemoryMappedSource::Open(non_existent_path, MemorySource::READ_ONLY, &result)); + std::shared_ptr result; + ASSERT_RAISES( + IOError, MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); } -} // namespace ipc +} // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index 003570d4fdee6..0b805abf94c1b 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -51,8 +51,7 @@ extern "C" { #include #include -#include // NOLINT -#include // NOLINT +#include // NOLINT #include "arrow/util/status.h" #include "arrow/util/visibility.h" diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc new file mode 100644 index 0000000000000..1dd6c3a02304a --- /dev/null +++ b/cpp/src/arrow/io/memory.cc @@ -0,0 +1,262 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/io/memory.h" + +#include // For memory-mapping + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/io/interfaces.h" + +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace io { + +// Implement MemoryMappedFile + +class MemoryMappedFile::MemoryMappedFileImpl { + public: + MemoryMappedFileImpl() + : file_(nullptr), is_open_(false), is_writable_(false), data_(nullptr) {} + + ~MemoryMappedFileImpl() { + if (is_open_) { + munmap(data_, size_); + fclose(file_); + } + } + + Status Open(const std::string& path, FileMode::type mode) { + if (is_open_) { return Status::IOError("A file is already open"); } + + int prot_flags = PROT_READ; + + if (mode == FileMode::READWRITE) { + file_ = fopen(path.c_str(), "r+b"); + prot_flags |= PROT_WRITE; + is_writable_ = true; + } else { + file_ = fopen(path.c_str(), "rb"); + } + if (file_ == nullptr) { + std::stringstream ss; + ss << "Unable to open file, errno: " << errno; + return Status::IOError(ss.str()); + } + + fseek(file_, 0L, SEEK_END); + if (ferror(file_)) { return Status::IOError("Unable to seek to end of file"); } + size_ = ftell(file_); + + fseek(file_, 0L, SEEK_SET); + is_open_ = true; + position_ = 0; + + void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fileno(file_), 0); + if (result == MAP_FAILED) { + std::stringstream ss; + ss << "Memory mapping file failed, errno: " << errno; + return Status::IOError(ss.str()); + } + data_ = reinterpret_cast(result); + + return Status::OK(); + } + + int64_t size() const { return size_; } + + Status Seek(int64_t position) { + if (position < 0 || position >= size_) { + return Status::Invalid("position is out of bounds"); + } + position_ = position; + return Status::OK(); + } + + int64_t position() { return position_; } + + void advance(int64_t nbytes) { position_ = std::min(size_, position_ + nbytes); } + + uint8_t* data() { return data_; } + + uint8_t* head() { return data_ + position_; } + + bool writable() { return is_writable_; } + + bool opened() { return is_open_; } + + private: + FILE* file_; + int64_t position_; + int64_t size_; + bool is_open_; + bool is_writable_; + + // The memory map + uint8_t* data_; +}; + +MemoryMappedFile::MemoryMappedFile(FileMode::type mode) { + ReadableFileInterface::set_mode(mode); +} + +Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, + std::shared_ptr* out) { + std::shared_ptr result(new MemoryMappedFile(mode)); + + result->impl_.reset(new MemoryMappedFileImpl()); + RETURN_NOT_OK(result->impl_->Open(path, mode)); + + *out = result; + return Status::OK(); +} + +Status MemoryMappedFile::GetSize(int64_t* size) { + *size = impl_->size(); + return Status::OK(); +} + +Status MemoryMappedFile::Tell(int64_t* position) { + *position = impl_->position(); + return Status::OK(); +} + +Status MemoryMappedFile::Seek(int64_t position) { + return impl_->Seek(position); +} + +Status MemoryMappedFile::Close() { + // munmap handled in pimpl dtor + return Status::OK(); +} + +Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + nbytes = std::min(nbytes, impl_->size() - impl_->position()); + std::memcpy(out, impl_->head(), nbytes); + *bytes_read = nbytes; + impl_->advance(nbytes); + return Status::OK(); +} + +Status MemoryMappedFile::ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + RETURN_NOT_OK(impl_->Seek(position)); + return Read(nbytes, bytes_read, out); +} + +Status MemoryMappedFile::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + nbytes = std::min(nbytes, impl_->size() - position); + RETURN_NOT_OK(impl_->Seek(position)); + *out = std::make_shared(impl_->head(), nbytes); + impl_->advance(nbytes); + return Status::OK(); +} + +bool MemoryMappedFile::supports_zero_copy() const { + return true; +} + +Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { + if (!impl_->opened() || !impl_->writable()) { + return Status::IOError("Unable to write"); + } + + RETURN_NOT_OK(impl_->Seek(position)); + return WriteInternal(data, nbytes); +} + +Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { + if (!impl_->opened() || !impl_->writable()) { + return Status::IOError("Unable to write"); + } + if (nbytes + impl_->position() > impl_->size()) { + return Status::Invalid("Cannot write past end of memory map"); + } + + return WriteInternal(data, nbytes); +} + +Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { + memcpy(impl_->head(), data, nbytes); + impl_->advance(nbytes); + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// In-memory buffer reader + +Status BufferReader::Close() { + // no-op + return Status::OK(); +} + +Status BufferReader::Tell(int64_t* position) { + *position = position_; + return Status::OK(); +} + +Status BufferReader::ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { + RETURN_NOT_OK(Seek(position)); + return Read(nbytes, bytes_read, buffer); +} + +Status BufferReader::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + int64_t size = std::min(nbytes, buffer_size_ - position_); + *out = std::make_shared(buffer_ + position, size); + position_ += nbytes; + return Status::OK(); +} + +bool BufferReader::supports_zero_copy() const { + return true; +} + +Status BufferReader::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { + memcpy(buffer, buffer_ + position_, nbytes); + *bytes_read = std::min(nbytes, buffer_size_ - position_); + position_ += *bytes_read; + return Status::OK(); +} + +Status BufferReader::GetSize(int64_t* size) { + *size = buffer_size_; + return Status::OK(); +} + +Status BufferReader::Seek(int64_t position) { + if (position < 0 || position >= buffer_size_) { + return Status::IOError("position out of bounds"); + } + + position_ = position; + return Status::OK(); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h new file mode 100644 index 0000000000000..6fe47c3b5157a --- /dev/null +++ b/cpp/src/arrow/io/memory.h @@ -0,0 +1,130 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Public API for different memory sharing / IO mechanisms + +#ifndef ARROW_IO_MEMORY_H +#define ARROW_IO_MEMORY_H + +#include +#include +#include + +#include "arrow/io/interfaces.h" + +#include "arrow/util/macros.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Buffer; +class MutableBuffer; +class Status; + +namespace io { + +// An output stream that writes to a MutableBuffer, such as one obtained from a +// memory map +// +// TODO(wesm): Implement this class +class ARROW_EXPORT BufferOutputStream : public OutputStream { + public: + explicit BufferOutputStream(const std::shared_ptr& buffer) + : buffer_(buffer) {} + + // Implement the OutputStream interface + Status Close() override; + Status Tell(int64_t* position) override; + Status Write(const uint8_t* data, int64_t length) override; + + // Returns the number of bytes remaining in the buffer + int64_t bytes_remaining() const; + + private: + std::shared_ptr buffer_; + int64_t capacity_; + int64_t position_; +}; + +// A memory source that uses memory-mapped files for memory interactions +class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { + public: + static Status Open(const std::string& path, FileMode::type mode, + std::shared_ptr* out); + + Status Close() override; + + Status Tell(int64_t* position) override; + + Status Seek(int64_t position) override; + + // Required by ReadableFileInterface, copies memory into out + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; + + Status ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; + + // Read into a buffer, zero copy if possible + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + + bool supports_zero_copy() const override; + + Status Write(const uint8_t* data, int64_t nbytes) override; + + Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; + + // @return: the size in bytes of the memory source + Status GetSize(int64_t* size) override; + + private: + explicit MemoryMappedFile(FileMode::type mode); + + Status WriteInternal(const uint8_t* data, int64_t nbytes); + + // Hide the internal details of this class for now + class MemoryMappedFileImpl; + std::unique_ptr impl_; +}; + +class ARROW_EXPORT BufferReader : public ReadableFileInterface { + public: + BufferReader(const uint8_t* buffer, int buffer_size) + : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} + + Status Close() override; + Status Tell(int64_t* position) override; + + Status ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status GetSize(int64_t* size) override; + Status Seek(int64_t position) override; + + bool supports_zero_copy() const override; + + private: + const uint8_t* buffer_; + int buffer_size_; + int64_t position_; +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_MEMORY_H diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h new file mode 100644 index 0000000000000..1954d479e3930 --- /dev/null +++ b/cpp/src/arrow/io/test-common.h @@ -0,0 +1,63 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IO_TEST_COMMON_H +#define ARROW_IO_TEST_COMMON_H + +#include +#include +#include +#include +#include + +#include "arrow/io/memory.h" +#include "arrow/test-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" + +namespace arrow { +namespace io { + +class MemoryMapFixture { + public: + void TearDown() { + for (auto path : tmp_files_) { + std::remove(path.c_str()); + } + } + + void CreateFile(const std::string path, int64_t size) { + FILE* file = fopen(path.c_str(), "w"); + if (file != nullptr) { tmp_files_.push_back(path); } + ftruncate(fileno(file), size); + fclose(file); + } + + Status InitMemoryMap( + int64_t size, const std::string& path, std::shared_ptr* mmap) { + CreateFile(path, size); + return MemoryMappedFile::Open(path, FileMode::READWRITE, mmap); + } + + private: + std::vector tmp_files_; +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_TEST_COMMON_H diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 82634169ed925..e5553a6358115 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -19,16 +19,50 @@ # arrow_ipc ####################################### -# Headers: top level -install(FILES - adapter.h - metadata.h - memory.h - DESTINATION include/arrow/ipc) +set(ARROW_IPC_LINK_LIBS + arrow_io + arrow_shared +) + +set(ARROW_IPC_PRIVATE_LINK_LIBS + ) + +set(ARROW_IPC_TEST_LINK_LIBS + arrow_ipc + ${ARROW_IPC_PRIVATE_LINK_LIBS}) + +set(ARROW_IPC_SRCS + adapter.cc + metadata.cc + metadata-internal.cc +) + +# TODO(wesm): SHARED and STATIC targets +add_library(arrow_ipc SHARED + ${ARROW_IPC_SRCS} +) +target_link_libraries(arrow_ipc + LINK_PUBLIC ${ARROW_IPC_LINK_LIBS} + LINK_PRIVATE ${ARROW_IPC_PRIVATE_LINK_LIBS}) + +if(NOT APPLE) + # Localize thirdparty symbols using a linker version script. This hides them + # from the client application. The OS X linker does not support the + # version-script option. + set(ARROW_IPC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") +endif() + +SET_TARGET_PROPERTIES(arrow_ipc PROPERTIES + LINKER_LANGUAGE CXX + LINK_FLAGS "${ARROW_IPC_LINK_FLAGS}") ADD_ARROW_TEST(ipc-adapter-test) -ADD_ARROW_TEST(ipc-memory-test) +ARROW_TEST_LINK_LIBRARIES(ipc-adapter-test + ${ARROW_IPC_TEST_LINK_LIBS}) + ADD_ARROW_TEST(ipc-metadata-test) +ARROW_TEST_LINK_LIBRARIES(ipc-metadata-test + ${ARROW_IPC_TEST_LINK_LIBS}) # make clean will delete the generated file set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) @@ -49,3 +83,13 @@ add_custom_command( add_custom_target(metadata_fbs DEPENDS ${FBS_OUTPUT_FILES}) add_dependencies(arrow_objlib metadata_fbs) + +# Headers: top level +install(FILES + adapter.h + metadata.h + DESTINATION include/arrow/ipc) + +install(TARGETS arrow_ipc + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 40d372bbd3520..0e101c8930395 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -24,9 +24,11 @@ #include "arrow/array.h" #include "arrow/ipc/Message_generated.h" -#include "arrow/ipc/memory.h" #include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" +#include "arrow/ipc/util.h" +#include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/type.h" @@ -144,10 +146,15 @@ class RowBatchWriter { return Status::OK(); } - Status Write(MemorySource* dst, int64_t position, int64_t* data_header_offset) { + Status Write(io::OutputStream* dst, int64_t* data_header_offset) { // Write out all the buffers contiguously and compute the total size of the // memory payload int64_t offset = 0; + + // Get the starting position + int64_t position; + RETURN_NOT_OK(dst->Tell(&position)); + for (size_t i = 0; i < buffers_.size(); ++i) { const Buffer* buffer = buffers_[i].get(); int64_t size = 0; @@ -171,7 +178,7 @@ class RowBatchWriter { buffer_meta_.push_back(flatbuf::Buffer(0, position + offset, size)); if (size > 0) { - RETURN_NOT_OK(dst->Write(position + offset, buffer->data(), size)); + RETURN_NOT_OK(dst->Write(buffer->data(), size)); offset += size; } } @@ -180,7 +187,7 @@ class RowBatchWriter { // memory, the data header can be converted to a flatbuffer and written out // // Note: The memory written here is prefixed by the size of the flatbuffer - // itself as an int32_t. On reading from a MemorySource, you will have to + // itself as an int32_t. On reading from a input, you will have to // determine the data header size then request a buffer such that you can // construct the flatbuffer data accessor object (see arrow::ipc::Message) std::shared_ptr data_header; @@ -188,8 +195,7 @@ class RowBatchWriter { batch_->num_rows(), offset, field_nodes_, buffer_meta_, &data_header)); // Write the data header at the end - RETURN_NOT_OK( - dst->Write(position + offset, data_header->data(), data_header->size())); + RETURN_NOT_OK(dst->Write(data_header->data(), data_header->size())); *data_header_offset = position + offset; return Status::OK(); @@ -199,9 +205,9 @@ class RowBatchWriter { Status GetTotalSize(int64_t* size) { // emulates the behavior of Write without actually writing int64_t data_header_offset; - MockMemorySource source(0); - RETURN_NOT_OK(Write(&source, 0, &data_header_offset)); - *size = source.GetExtentBytesWritten(); + MockOutputStream dst; + RETURN_NOT_OK(Write(&dst, &data_header_offset)); + *size = dst.GetExtentBytesWritten(); return Status::OK(); } @@ -214,12 +220,12 @@ class RowBatchWriter { int max_recursion_depth_; }; -Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, int64_t position, - int64_t* header_offset, int max_recursion_depth) { +Status WriteRowBatch(io::OutputStream* dst, const RowBatch* batch, int64_t* header_offset, + int max_recursion_depth) { DCHECK_GT(max_recursion_depth, 0); RowBatchWriter serializer(batch, max_recursion_depth); RETURN_NOT_OK(serializer.AssemblePayload()); - return serializer.Write(dst, position, header_offset); + return serializer.Write(dst, header_offset); } Status GetRowBatchSize(const RowBatch* batch, int64_t* size) { @@ -234,11 +240,11 @@ Status GetRowBatchSize(const RowBatch* batch, int64_t* size) { static constexpr int64_t INIT_METADATA_SIZE = 4096; -class RowBatchReader::Impl { +class RowBatchReader::RowBatchReaderImpl { public: - Impl(MemorySource* source, const std::shared_ptr& metadata, - int max_recursion_depth) - : source_(source), metadata_(metadata), max_recursion_depth_(max_recursion_depth) { + RowBatchReaderImpl(io::ReadableFileInterface* file, + const std::shared_ptr& metadata, int max_recursion_depth) + : file_(file), metadata_(metadata), max_recursion_depth_(max_recursion_depth) { num_buffers_ = metadata->num_buffers(); num_flattened_fields_ = metadata->num_fields(); } @@ -339,10 +345,11 @@ class RowBatchReader::Impl { Status GetBuffer(int buffer_index, std::shared_ptr* out) { BufferMetadata metadata = metadata_->buffer(buffer_index); RETURN_NOT_OK(CheckMultipleOf64(metadata.length)); - return source_->ReadAt(metadata.offset, metadata.length, out); + return file_->ReadAt(metadata.offset, metadata.length, out); } - MemorySource* source_; + private: + io::ReadableFileInterface* file_; std::shared_ptr metadata_; int field_index_; @@ -352,22 +359,22 @@ class RowBatchReader::Impl { int num_flattened_fields_; }; -Status RowBatchReader::Open( - MemorySource* source, int64_t position, std::shared_ptr* out) { - return Open(source, position, kMaxIpcRecursionDepth, out); +Status RowBatchReader::Open(io::ReadableFileInterface* file, int64_t position, + std::shared_ptr* out) { + return Open(file, position, kMaxIpcRecursionDepth, out); } -Status RowBatchReader::Open(MemorySource* source, int64_t position, +Status RowBatchReader::Open(io::ReadableFileInterface* file, int64_t position, int max_recursion_depth, std::shared_ptr* out) { std::shared_ptr metadata; - RETURN_NOT_OK(source->ReadAt(position, INIT_METADATA_SIZE, &metadata)); + RETURN_NOT_OK(file->ReadAt(position, INIT_METADATA_SIZE, &metadata)); int32_t metadata_size = *reinterpret_cast(metadata->data()); - // We may not need to call source->ReadAt again + // We may not need to call ReadAt again if (metadata_size > static_cast(INIT_METADATA_SIZE - sizeof(int32_t))) { // We don't have enough data, read the indicated metadata size. - RETURN_NOT_OK(source->ReadAt(position + sizeof(int32_t), metadata_size, &metadata)); + RETURN_NOT_OK(file->ReadAt(position + sizeof(int32_t), metadata_size, &metadata)); } // TODO(wesm): buffer slicing here would be better in case ReadAt returns @@ -383,14 +390,14 @@ Status RowBatchReader::Open(MemorySource* source, int64_t position, std::shared_ptr batch_meta = message->GetRecordBatch(); std::shared_ptr result(new RowBatchReader()); - result->impl_.reset(new Impl(source, batch_meta, max_recursion_depth)); + result->impl_.reset(new RowBatchReaderImpl(file, batch_meta, max_recursion_depth)); *out = result; return Status::OK(); } // Here the explicit destructor is required for compilers to be aware of -// the complete information of RowBatchReader::Impl class +// the complete information of RowBatchReader::RowBatchReaderImpl class RowBatchReader::~RowBatchReader() {} Status RowBatchReader::GetRowBatch( diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 6231af66aa180..215b46f8f65d4 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -33,9 +33,15 @@ class RowBatch; class Schema; class Status; +namespace io { + +class ReadableFileInterface; +class OutputStream; + +} // namespace io + namespace ipc { -class MemorySource; class RecordBatchMessage; // ---------------------------------------------------------------------- @@ -43,22 +49,21 @@ class RecordBatchMessage; // We have trouble decoding flatbuffers if the size i > 70, so 64 is a nice round number // TODO(emkornfield) investigate this more constexpr int kMaxIpcRecursionDepth = 64; -// Write the RowBatch (collection of equal-length Arrow arrays) to the memory -// source at the indicated position + +// Write the RowBatch (collection of equal-length Arrow arrays) to the output +// stream // -// First, each of the memory buffers are written out end-to-end in starting at -// the indicated position. +// First, each of the memory buffers are written out end-to-end // // Then, this function writes the batch metadata as a flatbuffer (see // format/Message.fbs -- the RecordBatch message type) like so: // // // -// Finally, the memory offset to the start of the metadata / data header is -// returned in an out-variable -ARROW_EXPORT Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, - int64_t position, int64_t* header_offset, - int max_recursion_depth = kMaxIpcRecursionDepth); +// Finally, the absolute offset (relative to the start of the output stream) to +// the start of the metadata / data header is returned in an out-variable +ARROW_EXPORT Status WriteRowBatch(io::OutputStream* dst, const RowBatch* batch, + int64_t* header_offset, int max_recursion_depth = kMaxIpcRecursionDepth); // int64_t GetRowBatchMetadata(const RowBatch* batch); @@ -68,16 +73,16 @@ ARROW_EXPORT Status WriteRowBatch(MemorySource* dst, const RowBatch* batch, ARROW_EXPORT Status GetRowBatchSize(const RowBatch* batch, int64_t* size); // ---------------------------------------------------------------------- -// "Read" path; does not copy data if the MemorySource does not +// "Read" path; does not copy data if the input supports zero copy reads class ARROW_EXPORT RowBatchReader { public: - static Status Open( - MemorySource* source, int64_t position, std::shared_ptr* out); - - static Status Open(MemorySource* source, int64_t position, int max_recursion_depth, + static Status Open(io::ReadableFileInterface* file, int64_t position, std::shared_ptr* out); + static Status Open(io::ReadableFileInterface* file, int64_t position, + int max_recursion_depth, std::shared_ptr* out); + virtual ~RowBatchReader(); // Reassemble the row batch. A Schema is required to be able to construct the @@ -86,8 +91,8 @@ class ARROW_EXPORT RowBatchReader { const std::shared_ptr& schema, std::shared_ptr* out); private: - class Impl; - std::unique_ptr impl_; + class RowBatchReaderImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 6740e0fc5acc2..ca4d0152b9060 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -24,9 +24,11 @@ #include "gtest/gtest.h" +#include "arrow/io/memory.h" +#include "arrow/io/test-common.h" #include "arrow/ipc/adapter.h" -#include "arrow/ipc/memory.h" #include "arrow/ipc/test-common.h" +#include "arrow/ipc/util.h" #include "arrow/test-util.h" #include "arrow/types/list.h" @@ -49,17 +51,18 @@ const auto LIST_LIST_INT32 = std::make_shared(LIST_INT32); typedef Status MakeRowBatch(std::shared_ptr* out); class TestWriteRowBatch : public ::testing::TestWithParam, - public MemoryMapFixture { + public io::MemoryMapFixture { public: void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { MemoryMapFixture::TearDown(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } Status RoundTripHelper(const RowBatch& batch, int memory_map_size, std::shared_ptr* batch_result) { std::string path = "test-write-row-batch"; - MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); int64_t header_location; - RETURN_NOT_OK(WriteRowBatch(mmap_.get(), &batch, 0, &header_location)); + + RETURN_NOT_OK(WriteRowBatch(mmap_.get(), &batch, &header_location)); std::shared_ptr reader; RETURN_NOT_OK(RowBatchReader::Open(mmap_.get(), header_location, &reader)); @@ -69,7 +72,7 @@ class TestWriteRowBatch : public ::testing::TestWithParam, } protected: - std::shared_ptr mmap_; + std::shared_ptr mmap_; MemoryPool* pool_; }; @@ -276,12 +279,12 @@ INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, &MakeStringTypesRowBatch, &MakeStruct)); void TestGetRowBatchSize(std::shared_ptr batch) { - MockMemorySource mock_source(1 << 16); + ipc::MockOutputStream mock; int64_t mock_header_location = -1; int64_t size = -1; - ASSERT_OK(WriteRowBatch(&mock_source, batch.get(), 0, &mock_header_location)); + ASSERT_OK(WriteRowBatch(&mock, batch.get(), &mock_header_location)); ASSERT_OK(GetRowBatchSize(batch.get(), &size)); - ASSERT_EQ(mock_source.GetExtentBytesWritten(), size); + ASSERT_EQ(mock.GetExtentBytesWritten(), size); } TEST_F(TestWriteRowBatch, IntegerGetRowBatchSize) { @@ -303,10 +306,10 @@ TEST_F(TestWriteRowBatch, IntegerGetRowBatchSize) { TestGetRowBatchSize(batch); } -class RecursionLimits : public ::testing::Test, public MemoryMapFixture { +class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { public: void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { MemoryMapFixture::TearDown(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } Status WriteToMmap(int recursion_level, bool override_level, int64_t* header_out = nullptr, std::shared_ptr* schema_out = nullptr) { @@ -329,19 +332,19 @@ class RecursionLimits : public ::testing::Test, public MemoryMapFixture { std::string path = "test-write-past-max-recursion"; const int memory_map_size = 1 << 16; - MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); int64_t header_location; int64_t* header_out_param = header_out == nullptr ? &header_location : header_out; if (override_level) { return WriteRowBatch( - mmap_.get(), batch.get(), 0, header_out_param, recursion_level + 1); + mmap_.get(), batch.get(), header_out_param, recursion_level + 1); } else { - return WriteRowBatch(mmap_.get(), batch.get(), 0, header_out_param); + return WriteRowBatch(mmap_.get(), batch.get(), header_out_param); } } protected: - std::shared_ptr mmap_; + std::shared_ptr mmap_; MemoryPool* pool_; }; diff --git a/cpp/src/arrow/ipc/memory.cc b/cpp/src/arrow/ipc/memory.cc deleted file mode 100644 index a6c56d64f4aed..0000000000000 --- a/cpp/src/arrow/ipc/memory.cc +++ /dev/null @@ -1,182 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/ipc/memory.h" - -#include // For memory-mapping - -#include -#include -#include -#include -#include -#include -#include - -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" - -namespace arrow { -namespace ipc { - -MemorySource::MemorySource(AccessMode access_mode) : access_mode_(access_mode) {} - -MemorySource::~MemorySource() {} - -// Implement MemoryMappedSource - -class MemoryMappedSource::Impl { - public: - Impl() : file_(nullptr), is_open_(false), is_writable_(false), data_(nullptr) {} - - ~Impl() { - if (is_open_) { - munmap(data_, size_); - fclose(file_); - } - } - - Status Open(const std::string& path, MemorySource::AccessMode mode) { - if (is_open_) { return Status::IOError("A file is already open"); } - - int prot_flags = PROT_READ; - - if (mode == MemorySource::READ_WRITE) { - file_ = fopen(path.c_str(), "r+b"); - prot_flags |= PROT_WRITE; - is_writable_ = true; - } else { - file_ = fopen(path.c_str(), "rb"); - } - if (file_ == nullptr) { - std::stringstream ss; - ss << "Unable to open file, errno: " << errno; - return Status::IOError(ss.str()); - } - - fseek(file_, 0L, SEEK_END); - if (ferror(file_)) { return Status::IOError("Unable to seek to end of file"); } - size_ = ftell(file_); - - fseek(file_, 0L, SEEK_SET); - is_open_ = true; - - void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fileno(file_), 0); - if (result == MAP_FAILED) { - std::stringstream ss; - ss << "Memory mapping file failed, errno: " << errno; - return Status::IOError(ss.str()); - } - data_ = reinterpret_cast(result); - - return Status::OK(); - } - - int64_t size() const { return size_; } - - uint8_t* data() { return data_; } - - bool writable() { return is_writable_; } - - bool opened() { return is_open_; } - - private: - FILE* file_; - int64_t size_; - bool is_open_; - bool is_writable_; - - // The memory map - uint8_t* data_; -}; - -MemoryMappedSource::MemoryMappedSource(AccessMode access_mode) - : MemorySource(access_mode) {} - -Status MemoryMappedSource::Open(const std::string& path, AccessMode access_mode, - std::shared_ptr* out) { - std::shared_ptr result(new MemoryMappedSource(access_mode)); - - result->impl_.reset(new Impl()); - RETURN_NOT_OK(result->impl_->Open(path, access_mode)); - - *out = result; - return Status::OK(); -} - -int64_t MemoryMappedSource::Size() const { - return impl_->size(); -} - -Status MemoryMappedSource::Close() { - // munmap handled in ::Impl dtor - return Status::OK(); -} - -Status MemoryMappedSource::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { - if (position < 0 || position >= impl_->size()) { - return Status::Invalid("position is out of bounds"); - } - - nbytes = std::min(nbytes, impl_->size() - position); - *out = std::make_shared(impl_->data() + position, nbytes); - return Status::OK(); -} - -Status MemoryMappedSource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { - if (!impl_->opened() || !impl_->writable()) { - return Status::IOError("Unable to write"); - } - if (position < 0 || position >= impl_->size()) { - return Status::Invalid("position is out of bounds"); - } - - // TODO(wesm): verify we are not writing past the end of the buffer - uint8_t* dst = impl_->data() + position; - memcpy(dst, data, nbytes); - - return Status::OK(); -} - -MockMemorySource::MockMemorySource(int64_t size) - : size_(size), extent_bytes_written_(0) {} - -Status MockMemorySource::Close() { - return Status::OK(); -} - -Status MockMemorySource::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { - return Status::OK(); -} - -Status MockMemorySource::Write(int64_t position, const uint8_t* data, int64_t nbytes) { - extent_bytes_written_ = std::max(extent_bytes_written_, position + nbytes); - return Status::OK(); -} - -int64_t MockMemorySource::Size() const { - return size_; -} - -int64_t MockMemorySource::GetExtentBytesWritten() const { - return extent_bytes_written_; -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/memory.h b/cpp/src/arrow/ipc/memory.h deleted file mode 100644 index 377401d85c00a..0000000000000 --- a/cpp/src/arrow/ipc/memory.h +++ /dev/null @@ -1,150 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Public API for different interprocess memory sharing mechanisms - -#ifndef ARROW_IPC_MEMORY_H -#define ARROW_IPC_MEMORY_H - -#include -#include -#include - -#include "arrow/util/macros.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Buffer; -class MutableBuffer; -class Status; - -namespace ipc { - -// Abstract output stream -class OutputStream { - public: - virtual ~OutputStream() {} - // Close the output stream - virtual Status Close() = 0; - - // The current position in the output stream - virtual int64_t Tell() const = 0; - - // Write bytes to the stream - virtual Status Write(const uint8_t* data, int64_t length) = 0; -}; - -// An output stream that writes to a MutableBuffer, such as one obtained from a -// memory map -class BufferOutputStream : public OutputStream { - public: - explicit BufferOutputStream(const std::shared_ptr& buffer) - : buffer_(buffer) {} - - // Implement the OutputStream interface - Status Close() override; - int64_t Tell() const override; - Status Write(const uint8_t* data, int64_t length) override; - - // Returns the number of bytes remaining in the buffer - int64_t bytes_remaining() const; - - private: - std::shared_ptr buffer_; - int64_t capacity_; - int64_t position_; -}; - -class ARROW_EXPORT MemorySource { - public: - // Indicates the access permissions of the memory source - enum AccessMode { READ_ONLY, READ_WRITE }; - - virtual ~MemorySource(); - - // Retrieve a buffer of memory from the source of the indicates size and at - // the indicated location - // @returns: arrow::Status indicating success / failure. The buffer is set - // into the *out argument - virtual Status ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) = 0; - - virtual Status Close() = 0; - - virtual Status Write(int64_t position, const uint8_t* data, int64_t nbytes) = 0; - - // @return: the size in bytes of the memory source - virtual int64_t Size() const = 0; - - protected: - explicit MemorySource(AccessMode access_mode = AccessMode::READ_WRITE); - - AccessMode access_mode_; - - private: - DISALLOW_COPY_AND_ASSIGN(MemorySource); -}; - -// A memory source that uses memory-mapped files for memory interactions -class ARROW_EXPORT MemoryMappedSource : public MemorySource { - public: - static Status Open(const std::string& path, AccessMode access_mode, - std::shared_ptr* out); - - Status Close() override; - - Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; - - Status Write(int64_t position, const uint8_t* data, int64_t nbytes) override; - - // @return: the size in bytes of the memory source - int64_t Size() const override; - - private: - explicit MemoryMappedSource(AccessMode access_mode); - // Hide the internal details of this class for now - class Impl; - std::unique_ptr impl_; -}; - -// A MemorySource that tracks the size of allocations from a memory source -class MockMemorySource : public MemorySource { - public: - explicit MockMemorySource(int64_t size); - - Status Close() override; - - Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; - - Status Write(int64_t position, const uint8_t* data, int64_t nbytes) override; - - int64_t Size() const override; - - // @return: the smallest number of bytes containing the modified region of the - // MockMemorySource - int64_t GetExtentBytesWritten() const; - - private: - int64_t size_; - int64_t extent_bytes_written_; -}; - -} // namespace ipc -} // namespace arrow - -#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 8cc902c2967da..05e9c7ad4d359 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -220,9 +220,8 @@ static Status FieldToFlatbuffer( auto fb_children = fbb.CreateVector(children); // TODO: produce the list of VectorTypes - *offset = flatbuf::CreateField( - fbb, fb_name, field->nullable, type_enum, type_data, field->dictionary, - fb_children); + *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_data, + field->dictionary, fb_children); return Status::OK(); } @@ -295,8 +294,8 @@ Status WriteDataHeader(int32_t length, int64_t body_length, } Status MessageBuilder::Finish() { - auto message = flatbuf::CreateMessage(fbb_, kMetadataVersion, - header_type_, header_, body_length_); + auto message = + flatbuf::CreateMessage(fbb_, kMetadataVersion, header_type_, header_, body_length_); fbb_.Finish(message); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index db9a83f6a8dfb..d38df840ba05e 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -38,7 +38,7 @@ class Status; namespace ipc { static constexpr flatbuf::MetadataVersion kMetadataVersion = - flatbuf::MetadataVersion_V1_SNAPSHOT; + flatbuf::MetadataVersion_V1_SNAPSHOT; Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 838a4a676ea35..d5ec53317e6f2 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -23,6 +23,8 @@ #include #include +#include "arrow/util/visibility.h" + namespace arrow { class Buffer; @@ -36,6 +38,7 @@ namespace ipc { // Message read/write APIs // Serialize arrow::Schema as a Flatbuffer +ARROW_EXPORT Status WriteSchema(const Schema* schema, std::shared_ptr* out); //---------------------------------------------------------------------- @@ -47,7 +50,7 @@ Status WriteSchema(const Schema* schema, std::shared_ptr* out); class Message; // Container for serialized Schema metadata contained in an IPC message -class SchemaMessage { +class ARROW_EXPORT SchemaMessage { public: // Accepts an opaque flatbuffer pointer SchemaMessage(const std::shared_ptr& message, const void* schema); @@ -82,7 +85,7 @@ struct BufferMetadata { }; // Container for serialized record batch metadata contained in an IPC message -class RecordBatchMessage { +class ARROW_EXPORT RecordBatchMessage { public: // Accepts an opaque flatbuffer pointer RecordBatchMessage(const std::shared_ptr& message, const void* batch_meta); @@ -102,13 +105,13 @@ class RecordBatchMessage { std::unique_ptr impl_; }; -class DictionaryBatchMessage { +class ARROW_EXPORT DictionaryBatchMessage { public: int64_t id() const; std::unique_ptr data() const; }; -class Message : public std::enable_shared_from_this { +class ARROW_EXPORT Message : public std::enable_shared_from_this { public: enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; diff --git a/cpp/src/arrow/ipc/symbols.map b/cpp/src/arrow/ipc/symbols.map new file mode 100644 index 0000000000000..b4ad98cd7f2d0 --- /dev/null +++ b/cpp/src/arrow/ipc/symbols.map @@ -0,0 +1,18 @@ +{ + # Symbols marked as 'local' are not exported by the DSO and thus may not + # be used by client applications. + local: + # devtoolset / static-libstdc++ symbols + __cxa_*; + + extern "C++" { + # boost + boost::*; + + # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically + # links c++11 symbols into binaries so that the result may be executed on + # a system with an older libstdc++ which doesn't include the necessary + # c++11 symbols. + std::*; + }; +}; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index e7dbb84d790a1..f6582fc883bdc 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -34,31 +34,6 @@ namespace arrow { namespace ipc { -class MemoryMapFixture { - public: - void TearDown() { - for (auto path : tmp_files_) { - std::remove(path.c_str()); - } - } - - void CreateFile(const std::string path, int64_t size) { - FILE* file = fopen(path.c_str(), "w"); - if (file != nullptr) { tmp_files_.push_back(path); } - ftruncate(fileno(file), size); - fclose(file); - } - - Status InitMemoryMap( - int64_t size, const std::string& path, std::shared_ptr* mmap) { - CreateFile(path, size); - return MemoryMappedSource::Open(path, MemorySource::READ_WRITE, mmap); - } - - private: - std::vector tmp_files_; -}; - Status MakeRandomInt32Array( int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { std::shared_ptr data; diff --git a/cpp/src/arrow/ipc/util.h b/cpp/src/arrow/ipc/util.h new file mode 100644 index 0000000000000..3f4001b21a91b --- /dev/null +++ b/cpp/src/arrow/ipc/util.h @@ -0,0 +1,56 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IPC_UTIL_H +#define ARROW_IPC_UTIL_H + +#include + +#include "arrow/array.h" +#include "arrow/io/interfaces.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +// A helper class to tracks the size of allocations +class MockOutputStream : public io::OutputStream { + public: + MockOutputStream() : extent_bytes_written_(0) {} + + Status Close() override { return Status::OK(); } + + Status Write(const uint8_t* data, int64_t nbytes) override { + extent_bytes_written_ += nbytes; + return Status::OK(); + } + + Status Tell(int64_t* position) override { + *position = extent_bytes_written_; + return Status::OK(); + } + + int64_t GetExtentBytesWritten() const { return extent_bytes_written_; } + + private: + int64_t extent_bytes_written_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_UTIL_H diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt index f2a90b71a4968..c400e14ea47f7 100644 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ b/cpp/src/arrow/parquet/CMakeLists.txt @@ -27,6 +27,7 @@ set(PARQUET_SRCS set(PARQUET_LIBS arrow_shared + arrow_io parquet_shared ) diff --git a/cpp/src/arrow/parquet/io.cc b/cpp/src/arrow/parquet/io.cc index b6fdd67d15b6c..a50d753f3054e 100644 --- a/cpp/src/arrow/parquet/io.cc +++ b/cpp/src/arrow/parquet/io.cc @@ -27,7 +27,7 @@ #include "arrow/util/status.h" // To assist with readability -using ArrowROFile = arrow::io::RandomAccessFile; +using ArrowROFile = arrow::io::ReadableFileInterface; namespace arrow { namespace parquet { @@ -58,7 +58,7 @@ void ParquetAllocator::Free(uint8_t* buffer, int64_t size) { ParquetReadSource::ParquetReadSource(ParquetAllocator* allocator) : file_(nullptr), allocator_(allocator) {} -Status ParquetReadSource::Open(const std::shared_ptr& file) { +Status ParquetReadSource::Open(const std::shared_ptr& file) { int64_t file_size; RETURN_NOT_OK(file->GetSize(&file_size)); diff --git a/cpp/src/arrow/parquet/io.h b/cpp/src/arrow/parquet/io.h index 1c59695c6c151..1734863acf1ea 100644 --- a/cpp/src/arrow/parquet/io.h +++ b/cpp/src/arrow/parquet/io.h @@ -62,7 +62,7 @@ class ARROW_EXPORT ParquetReadSource : public ::parquet::RandomAccessSource { explicit ParquetReadSource(ParquetAllocator* allocator); // We need to ask for the file size on opening the file, and this can fail - Status Open(const std::shared_ptr& file); + Status Open(const std::shared_ptr& file); void Close() override; int64_t Tell() const override; @@ -72,7 +72,7 @@ class ARROW_EXPORT ParquetReadSource : public ::parquet::RandomAccessSource { private: // An Arrow readable file of some kind - std::shared_ptr file_; + std::shared_ptr file_; // The allocator is required for creating managed buffers ParquetAllocator* allocator_; diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc index 6615457c483f5..208b3e867d374 100644 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ b/cpp/src/arrow/parquet/parquet-io-test.cc @@ -22,6 +22,7 @@ #include "gtest/gtest.h" +#include "arrow/io/memory.h" #include "arrow/parquet/io.h" #include "arrow/test-util.h" #include "arrow/util/memory-pool.h" @@ -96,61 +97,13 @@ TEST(TestParquetAllocator, CustomPool) { // ---------------------------------------------------------------------- // Read source tests -class BufferReader : public io::RandomAccessFile { - public: - BufferReader(const uint8_t* buffer, int buffer_size) - : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} - - Status Close() override { - // no-op - return Status::OK(); - } - - Status Tell(int64_t* position) override { - *position = position_; - return Status::OK(); - } - - Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override { - RETURN_NOT_OK(Seek(position)); - return Read(nbytes, bytes_read, buffer); - } - - Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override { - memcpy(buffer, buffer_ + position_, nbytes); - *bytes_read = std::min(nbytes, buffer_size_ - position_); - position_ += *bytes_read; - return Status::OK(); - } - - Status GetSize(int64_t* size) override { - *size = buffer_size_; - return Status::OK(); - } - - Status Seek(int64_t position) override { - if (position < 0 || position >= buffer_size_) { - return Status::IOError("position out of bounds"); - } - - position_ = position; - return Status::OK(); - } - - private: - const uint8_t* buffer_; - int buffer_size_; - int64_t position_; -}; - TEST(TestParquetReadSource, Basics) { std::string data = "this is the data"; auto data_buffer = reinterpret_cast(data.c_str()); ParquetAllocator allocator(default_memory_pool()); - auto file = std::make_shared(data_buffer, data.size()); + auto file = std::make_shared(data_buffer, data.size()); auto source = std::make_shared(&allocator); ASSERT_OK(source->Open(file)); diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc index a2bcd3e05c307..63ad8fba46517 100644 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ b/cpp/src/arrow/parquet/parquet-schema-test.cc @@ -178,8 +178,7 @@ class TestConvertArrowSchema : public ::testing::Test { NodePtr schema_node = GroupNode::Make("schema", Repetition::REPEATED, nodes); const GroupNode* expected_schema_node = static_cast(schema_node.get()); - const GroupNode* result_schema_node = - static_cast(result_schema_->schema().get()); + const GroupNode* result_schema_node = result_schema_->group_node(); ASSERT_EQ(expected_schema_node->field_count(), result_schema_node->field_count()); diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc index 440ec84e2c74e..0c2fc6e8fc718 100644 --- a/cpp/src/arrow/parquet/reader.cc +++ b/cpp/src/arrow/parquet/reader.cc @@ -149,7 +149,7 @@ bool FileReader::Impl::CheckForFlatColumn(const ::parquet::ColumnDescriptor* des } Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* out) { - const ::parquet::SchemaDescriptor* schema = reader_->metadata()->schema_descriptor(); + const ::parquet::SchemaDescriptor* schema = reader_->metadata()->schema(); if (!CheckForFlatColumn(schema->Column(i))) { return Status::Invalid("The requested column is not flat"); @@ -167,9 +167,9 @@ Status FileReader::Impl::ReadFlatColumn(int i, std::shared_ptr* out) { } Status FileReader::Impl::ReadFlatTable(std::shared_ptr
* table) { - auto descr = reader_->metadata()->schema_descriptor(); + auto descr = reader_->metadata()->schema(); - const std::string& name = descr->schema()->name(); + const std::string& name = descr->name(); std::shared_ptr schema; RETURN_NOT_OK(FromParquetSchema(descr, &schema)); @@ -193,7 +193,7 @@ FileReader::FileReader( FileReader::~FileReader() {} // Static ctor -Status OpenFile(const std::shared_ptr& file, +Status OpenFile(const std::shared_ptr& file, ParquetAllocator* allocator, std::unique_ptr* reader) { std::unique_ptr source(new ParquetReadSource(allocator)); RETURN_NOT_OK(source->Open(file)); diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h index f1492f64521cb..a9c64eca997b5 100644 --- a/cpp/src/arrow/parquet/reader.h +++ b/cpp/src/arrow/parquet/reader.h @@ -137,7 +137,7 @@ class ARROW_EXPORT FlatColumnReader { // Helper function to create a file reader from an implementation of an Arrow // readable file ARROW_EXPORT -Status OpenFile(const std::shared_ptr& file, +Status OpenFile(const std::shared_ptr& file, ParquetAllocator* allocator, std::unique_ptr* reader); } // namespace parquet diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc index cd91df32271c1..ff32e51bacd8b 100644 --- a/cpp/src/arrow/parquet/schema.cc +++ b/cpp/src/arrow/parquet/schema.cc @@ -202,7 +202,7 @@ Status FromParquetSchema( // TODO(wesm): Consider adding an arrow::Schema name attribute, which comes // from the root Parquet node const GroupNode* schema_node = - static_cast(parquet_schema->schema().get()); + static_cast(parquet_schema->group_node()); std::vector> fields(schema_node->field_count()); for (int i = 0; i < schema_node->field_count(); i++) { diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc index ddee573fa1eb9..2b47f1461c9f4 100644 --- a/cpp/src/arrow/parquet/writer.cc +++ b/cpp/src/arrow/parquet/writer.cc @@ -334,7 +334,7 @@ Status WriteFlatTable(const Table* table, MemoryPool* pool, std::shared_ptr<::parquet::SchemaDescriptor> parquet_schema; RETURN_NOT_OK( ToParquetSchema(table->schema().get(), *properties.get(), &parquet_schema)); - auto schema_node = std::static_pointer_cast(parquet_schema->schema()); + auto schema_node = std::static_pointer_cast(parquet_schema->schema_root()); std::unique_ptr parquet_writer = ParquetFileWriter::Open(sink, schema_node, properties); FileWriter writer(pool, std::move(parquet_writer)); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 02677d5e18b90..b4c3721a72895 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -149,7 +149,7 @@ struct ARROW_EXPORT Field { int64_t dictionary; Field(const std::string& name, const TypePtr& type, bool nullable = true, - int64_t dictionary = 0) + int64_t dictionary = 0) : name(name), type(type), nullable(nullable), dictionary(dictionary) {} bool operator==(const Field& other) const { return this->Equals(other); } @@ -159,7 +159,7 @@ struct ARROW_EXPORT Field { bool Equals(const Field& other) const { return (this == &other) || (this->name == other.name && this->nullable == other.nullable && - this->dictionary == dictionary && this->type->Equals(other.type.get())); + this->dictionary == dictionary && this->type->Equals(other.type.get())); } bool Equals(const std::shared_ptr& other) const { return Equals(*other.get()); } diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/util/memory-pool-test.cc index e767e9555244d..5d60376f794ff 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/util/memory-pool-test.cc @@ -64,6 +64,6 @@ TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { pool->Free(data, 100); } -#endif // ARROW_VALGRIND +#endif // ARROW_VALGRIND } // namespace arrow diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 734ace6c923b4..f338a436814de 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -29,25 +29,41 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: ObjectType_FILE" arrow::io::ObjectType::FILE" ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" - cdef cppclass FileBase: + cdef cppclass FileInterface: CStatus Close() CStatus Tell(int64_t* position) + FileMode mode() - cdef cppclass ReadableFile(FileBase): + cdef cppclass Readable: + CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) + + cdef cppclass Seekable: + CStatus Seek(int64_t position) + + cdef cppclass Writeable: + CStatus Write(const uint8_t* data, int64_t nbytes) + + cdef cppclass OutputStream(FileInterface, Writeable): + pass + + cdef cppclass InputStream(FileInterface, Readable): + pass + + cdef cppclass ReadableFileInterface(InputStream, Seekable): CStatus GetSize(int64_t* size) - CStatus Read(int64_t nbytes, int64_t* bytes_read, - uint8_t* buffer) CStatus ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) + CStatus ReadAt(int64_t position, int64_t nbytes, + int64_t* bytes_read, shared_ptr[Buffer]* out) - cdef cppclass RandomAccessFile(ReadableFile): - CStatus Seek(int64_t position) + cdef cppclass WriteableFileInterface(OutputStream, Seekable): + CStatus WriteAt(int64_t position, const uint8_t* data, + int64_t nbytes) - cdef cppclass WriteableFile(FileBase): - CStatus Write(const uint8_t* buffer, int64_t nbytes) - # CStatus Write(const uint8_t* buffer, int64_t nbytes, - # int64_t* bytes_written) + cdef cppclass ReadWriteFileInterface(ReadableFileInterface, + WriteableFileInterface): + pass cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: @@ -70,10 +86,10 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: int64_t block_size int16_t permissions - cdef cppclass HdfsReadableFile(RandomAccessFile): + cdef cppclass HdfsReadableFile(ReadableFileInterface): pass - cdef cppclass HdfsWriteableFile(WriteableFile): + cdef cppclass HdfsOutputStream(OutputStream): pass cdef cppclass CHdfsClient" arrow::io::HdfsClient": @@ -103,4 +119,4 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus OpenWriteable(const c_string& path, c_bool append, int32_t buffer_size, int16_t replication, int64_t default_block_size, - shared_ptr[HdfsWriteableFile]* handle) + shared_ptr[HdfsOutputStream]* handle) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index fe24f593e3294..f932a93149354 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport CSchema, CStatus, CTable, MemoryPool -from pyarrow.includes.libarrow_io cimport RandomAccessFile +from pyarrow.includes.libarrow_io cimport ReadableFileInterface cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: @@ -78,10 +78,10 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: unique_ptr[ParquetFileReader] OpenFile(const c_string& path) cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: - cdef cppclass OutputStream: + cdef cppclass ParquetOutputStream" parquet::OutputStream": pass - cdef cppclass LocalFileOutputStream(OutputStream): + cdef cppclass LocalFileOutputStream(ParquetOutputStream): LocalFileOutputStream(const c_string& path) void Close() @@ -100,11 +100,11 @@ cdef extern from "arrow/parquet/io.h" namespace "arrow::parquet" nogil: cdef cppclass ParquetReadSource: ParquetReadSource(ParquetAllocator* allocator) - Open(const shared_ptr[RandomAccessFile]& file) + Open(const shared_ptr[ReadableFileInterface]& file) cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: - CStatus OpenFile(const shared_ptr[RandomAccessFile]& file, + CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, ParquetAllocator* allocator, unique_ptr[FileReader]* reader) @@ -121,6 +121,8 @@ cdef extern from "arrow/parquet/schema.h" namespace "arrow::parquet" nogil: cdef extern from "arrow/parquet/writer.h" namespace "arrow::parquet" nogil: - cdef CStatus WriteFlatTable(const CTable* table, MemoryPool* pool, - const shared_ptr[OutputStream]& sink, int64_t chunk_size, - const shared_ptr[WriterProperties]& properties) + cdef CStatus WriteFlatTable( + const CTable* table, MemoryPool* pool, + const shared_ptr[ParquetOutputStream]& sink, + int64_t chunk_size, + const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index b92af72704ae8..f55fc0ab53ac1 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -19,7 +19,8 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport RandomAccessFile, WriteableFile +from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, + OutputStream) cdef class NativeFileInterface: @@ -28,5 +29,5 @@ cdef class NativeFileInterface: # extension classes are technically virtual in the C++ sense)m we can # expose the arrow::io abstract file interfaces to other components # throughout the suite of Arrow C++ libraries - cdef read_handle(self, shared_ptr[RandomAccessFile]* file) - cdef write_handle(self, shared_ptr[WriteableFile]* file) + cdef read_handle(self, shared_ptr[ReadableFileInterface]* file) + cdef write_handle(self, shared_ptr[OutputStream]* file) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index b8bf883562060..f2eee260c331b 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -316,16 +316,16 @@ cdef class HdfsClient: cdef class NativeFileInterface: - cdef read_handle(self, shared_ptr[RandomAccessFile]* file): + cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): raise NotImplementedError - cdef write_handle(self, shared_ptr[WriteableFile]* file): + cdef write_handle(self, shared_ptr[OutputStream]* file): raise NotImplementedError cdef class HdfsFile(NativeFileInterface): cdef: shared_ptr[HdfsReadableFile] rd_file - shared_ptr[HdfsWriteableFile] wr_file + shared_ptr[HdfsOutputStream] wr_file bint is_readonly bint is_open object parent @@ -364,13 +364,13 @@ cdef class HdfsFile(NativeFileInterface): if self.is_readonly: raise IOError("only valid on writeonly files") - cdef read_handle(self, shared_ptr[RandomAccessFile]* file): + cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): self._assert_readable() - file[0] = self.rd_file + file[0] = self.rd_file - cdef write_handle(self, shared_ptr[WriteableFile]* file): + cdef write_handle(self, shared_ptr[OutputStream]* file): self._assert_writeable() - file[0] = self.wr_file + file[0] = self.wr_file def size(self): cdef int64_t size diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index ebba1a17ac742..fb36b2967c096 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -21,7 +21,7 @@ from pyarrow.includes.libarrow cimport * from pyarrow.includes.parquet cimport * -from pyarrow.includes.libarrow_io cimport RandomAccessFile, WriteableFile +from pyarrow.includes.libarrow_io cimport ReadableFileInterface cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import tobytes @@ -55,7 +55,7 @@ cdef class ParquetReader: ParquetFileReader.OpenFile(path))) cdef open_native_file(self, NativeFileInterface file): - cdef shared_ptr[RandomAccessFile] cpp_handle + cdef shared_ptr[ReadableFileInterface] cpp_handle file.read_handle(&cpp_handle) check_cstatus(OpenFile(cpp_handle, &self.allocator, &self.reader)) @@ -105,7 +105,7 @@ def write_table(table, filename, chunk_size=None, version=None): """ cdef Table table_ = table cdef CTable* ctable_ = table_.table - cdef shared_ptr[OutputStream] sink + cdef shared_ptr[ParquetOutputStream] sink cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 if chunk_size is None: From 5f1556c011446a9fc524e91042c859365ed7afc1 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 19 Sep 2016 14:08:32 -0700 Subject: [PATCH 0138/1644] ARROW-297: Fix Arrow pom for release Author: Julien Le Dem Closes #140 from julienledem/fix_pom_for_release and squashes the following commits: 9618eaf [Julien Le Dem] ARROW-297: Fix Arrow pom for release --- java/format/pom.xml | 19 ++++++++++--------- java/pom.xml | 2 +- 2 files changed, 11 insertions(+), 10 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index 4cf68bbe057e9..78300047862f4 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -27,6 +27,7 @@ 1.2.0-3f79e055 false ${project.build.directory}/flatc-${os.detected.classifier}-${fbs.version}.exe + ${project.build.directory}/generated-sources/flatc 3.3 2.10 1.5.0.Final @@ -51,7 +52,7 @@ - + org.apache.maven.plugins maven-dependency-plugin ${maven-dependency-plugin.version} @@ -83,7 +84,7 @@ exec-maven-plugin 1.4.0 - + script-chmod exec @@ -98,7 +99,7 @@ ${flatc.download.skip} - + exec @@ -108,7 +109,7 @@ -j -o - target/generated-sources/flatc + ${flatc.generated.files} ../../format/Message.fbs ../../format/File.fbs @@ -116,7 +117,7 @@ - + com.mycila license-maven-plugin 2.3 @@ -135,26 +136,26 @@ - + org.codehaus.mojo build-helper-maven-plugin 1.9.1 - add-sources-as-resources + add-generated-sources-to-classpath generate-sources add-source - ${project.build.directory}/generated-sources + ${flatc.generated.files} - + org.apache.maven.plugins maven-checkstyle-plugin diff --git a/java/pom.xml b/java/pom.xml index a8e24ed054cd5..fc2c18d0e517d 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -16,7 +16,7 @@ org.apache apache - 14 + 18 org.apache.arrow From 53583281b2af3e4ecedd3b130cef588680a44c4f Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 21 Sep 2016 13:38:52 -0700 Subject: [PATCH 0139/1644] ARROW-298: create release scripts Author: Julien Le Dem Closes #141 from julienledem/release and squashes the following commits: 1a5114d [Julien Le Dem] ARROW-298: create release scripts --- dev/release/00-prepare.sh | 46 ++++++++++++++++++++++ dev/release/01-perform.sh | 27 +++++++++++++ dev/release/02-source.sh | 80 +++++++++++++++++++++++++++++++++++++++ dev/release/README | 15 ++++++++ java/README.md | 14 +++++++ 5 files changed, 182 insertions(+) create mode 100644 dev/release/00-prepare.sh create mode 100644 dev/release/01-perform.sh create mode 100644 dev/release/02-source.sh create mode 100644 dev/release/README create mode 100644 java/README.md diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh new file mode 100644 index 0000000000000..3c1fb9a093892 --- /dev/null +++ b/dev/release/00-prepare.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + +if [ -z "$1" ]; then + echo "Usage: $0 " + exit +fi + +if [ -z "$2" ]; then + echo "Usage: $0 " + exit +fi + +version=$1 + +tag=apache-arrow-${version} + +nextVersion=$2 + +cd "${SOURCE_DIR}/../../java" + +mvn release:clean +mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmodules -DdevelopmentVersion=${nextVersion}-SNAPSHOT + +cd - + +echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/dev/release/01-perform.sh b/dev/release/01-perform.sh new file mode 100644 index 0000000000000..d7140f6cba1e7 --- /dev/null +++ b/dev/release/01-perform.sh @@ -0,0 +1,27 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + +cd "${SOURCE_DIR}/../../java" + +mvn release:perform + +cd - diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh new file mode 100644 index 0000000000000..f44692d5e9d83 --- /dev/null +++ b/dev/release/02-source.sh @@ -0,0 +1,80 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +if [ -z "$1" ]; then + echo "Usage: $0 " + exit +fi + +if [ -z "$2" ]; then + echo "Usage: $0 " + exit +fi + +version=$1 +rc=$2 + +if [ -d tmp/ ]; then + echo "Cannot run: tmp/ exists" + exit +fi + +tag=apache-arrow-$version +tagrc=${tag}-rc${rc} + +echo "Preparing source for $tagrc" + +release_hash=`git rev-list $tag 2> /dev/null | head -n 1 ` + +if [ -z "$release_hash" ]; then + echo "Cannot continue: unknown git tag: $tag" + exit +fi + +echo "Using commit $release_hash" + +tarball=$tag.tar.gz + +# be conservative and use the release hash, even though git produces the same +# archive (identical hashes) using the scm tag +git archive $release_hash --prefix $tag/ -o $tarball + +# sign the archive +gpg --armor --output ${tarball}.asc --detach-sig $tarball +gpg --print-md MD5 $tarball > ${tarball}.md5 +shasum $tarball > ${tarball}.sha + +# check out the parquet RC folder +svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow tmp + +# add the release candidate for the tag +mkdir -p tmp/$tagrc +cp ${tarball}* tmp/$tagrc +svn add tmp/$tagrc +svn ci -m 'Apache Arrow $version RC${rc}' tmp/$tagrc + +# clean up +rm -rf tmp + +echo "Success! The release candidate is available here:" +echo " https://dist.apache.org/repos/dist/dev/arrow/$tagrc" +echo "" +echo "Commit SHA1: $release_hash" + diff --git a/dev/release/README b/dev/release/README new file mode 100644 index 0000000000000..4fcc5d9728c26 --- /dev/null +++ b/dev/release/README @@ -0,0 +1,15 @@ +requirements: +- being a committer to be able to push to dist and maven repository +- a gpg key to sign the artifacts + +to release, run the following (replace 0.1.0 with version to release): +# prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT +dev/release/00-prepare.sh 0.1.0 0.1.1 +# tag and push to maven repo (repo will have to be finalized separately) +dev/release/01-perform.sh +# create the source release +dev/release/02-source.sh 0.1.0 0 + +useful commands: +to set the mvn version in the poms +mvn versions:set -DnewVersion=0.1-SNAPSHOT diff --git a/java/README.md b/java/README.md new file mode 100644 index 0000000000000..5e1d30d9fd26e --- /dev/null +++ b/java/README.md @@ -0,0 +1,14 @@ +# Arrow Java + +## Setup Build Environment + +install: + - java 7 or later + - maven 3.3 or later + +## Building running tests + +``` +cd java +mvn install +``` From 430bd9576ceb14456cd6853f6d75ca19b333efc2 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 21 Sep 2016 18:14:00 -0400 Subject: [PATCH 0140/1644] ARROW-299: Use absolute namespace in macros Author: Uwe L. Korn Closes #142 from xhochy/arrow-299 and squashes the following commits: b7967fa [Uwe L. Korn] ARROW-299: Use absolute namespace in macros --- cpp/src/arrow/util/logging.h | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index 54f67593bec5e..d320d6adb7caa 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -35,7 +35,7 @@ namespace arrow { #define ARROW_ERROR 2 #define ARROW_FATAL 3 -#define ARROW_LOG_INTERNAL(level) arrow::internal::CerrLog(level) +#define ARROW_LOG_INTERNAL(level) ::arrow::internal::CerrLog(level) #define ARROW_LOG(level) ARROW_LOG_INTERNAL(ARROW_##level) #define ARROW_CHECK(condition) \ @@ -47,25 +47,25 @@ namespace arrow { #define DCHECK(condition) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_EQ(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_NE(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_LE(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_LT(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_GE(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #define DCHECK_GT(val1, val2) \ while (false) \ - arrow::internal::NullLog() + ::arrow::internal::NullLog() #else #define ARROW_DFATAL ARROW_FATAL From 7e39747eec05379710e1a42ecbaf1d9795bc3cf0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 21 Sep 2016 18:15:58 -0400 Subject: [PATCH 0141/1644] ARROW-267: [C++] Implement file format layout for IPC/RPC Standing up the PR to get some feedback. I still have to implement the read path for record batches and then add a test suite. I'd also like to add some documentation about the structure of the file format and some of the implicit assumptions (e.g. word alignment) -- I put a placeholder `IPC.md` document here for this. I also conformed the language re: record batches (had been using "row batch" in the C++ code) to make things more sane. Note we are not yet able to write OS files here, see ARROW-293. Will tackle that in a follow up PR, and then we should be in a position to integration test. Author: Wes McKinney Closes #139 from wesm/ARROW-267 and squashes the following commits: 9bdbbd4 [Wes McKinney] Get test suite passing, add missing metadata adapters for string, binary 4d3cc1d [Wes McKinney] cpplint 2ec1aad [Wes McKinney] Draft failing file roundtrip test 358309b [Wes McKinney] Move record batch test fixtures into test-common.h b88bce0 [Wes McKinney] Finish draft of FileReader::GetRecordBatch. Add body end offset to ipc adapter edf36e7 [Wes McKinney] Start drafting FileReader IPC implementation. Change record batch data header to write metadata size int32_t as suffix rather than prefix 95157f2 [Wes McKinney] Make record batch writes aligned on word boundaries 7c50251 [Wes McKinney] Make the interface for WriteRecordBatch more flexible (not require constructing a RecordBatch object) ab4056f [Wes McKinney] Drafting file reader/writer API. Implement BufferOutputStream and write file footers to an OutputStream 113ac7b [Wes McKinney] Draft file footer metadata write/read path with simple unit test --- NOTICE.txt | 6 + cpp/src/arrow/io/memory.cc | 37 ++++ cpp/src/arrow/io/memory.h | 18 +- cpp/src/arrow/ipc/CMakeLists.txt | 18 +- cpp/src/arrow/ipc/adapter.cc | 126 ++++++----- cpp/src/arrow/ipc/adapter.h | 47 ++-- cpp/src/arrow/ipc/file.cc | 210 ++++++++++++++++++ cpp/src/arrow/ipc/file.h | 146 +++++++++++++ cpp/src/arrow/ipc/ipc-adapter-test.cc | 284 +++++-------------------- cpp/src/arrow/ipc/ipc-file-test.cc | 125 +++++++++++ cpp/src/arrow/ipc/ipc-metadata-test.cc | 77 ++++++- cpp/src/arrow/ipc/metadata-internal.cc | 46 ++-- cpp/src/arrow/ipc/metadata-internal.h | 9 + cpp/src/arrow/ipc/metadata.cc | 171 ++++++++++++--- cpp/src/arrow/ipc/metadata.h | 64 +++++- cpp/src/arrow/ipc/test-common.h | 193 ++++++++++++++++- cpp/src/arrow/ipc/util.h | 8 + cpp/src/arrow/parquet/reader.h | 2 +- cpp/src/arrow/parquet/writer.h | 2 +- cpp/src/arrow/table.cc | 4 +- cpp/src/arrow/table.h | 16 +- format/IPC.md | 3 + format/README.md | 1 + 23 files changed, 1231 insertions(+), 382 deletions(-) create mode 100644 cpp/src/arrow/ipc/file.cc create mode 100644 cpp/src/arrow/ipc/file.h create mode 100644 cpp/src/arrow/ipc/ipc-file-test.cc create mode 100644 format/IPC.md diff --git a/NOTICE.txt b/NOTICE.txt index a85101617cec8..ce6e567dcb518 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -12,3 +12,9 @@ This product includes software from the Numpy project (BSD-new) https://github.com/numpy/numpy/blob/e1f191c46f2eebd6cb892a4bfe14d9dd43a06c4e/numpy/core/src/multiarray/multiarraymodule.c#L2910 * Copyright (c) 1995, 1996, 1997 Jim Hugunin, hugunin@mit.edu * Copyright (c) 2005 Travis E. Oliphant oliphant@ee.byu.edu Brigham Young University + +This product includes software from the Feather project (Apache 2.0) +https://github.com/wesm/feather + +This product includes software from the DyND project (BSD 2-clause) +https://github.com/libdynd diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 1dd6c3a02304a..c168c91c5f87c 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -206,6 +206,43 @@ Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { return Status::OK(); } +// ---------------------------------------------------------------------- +// OutputStream that writes to resizable buffer + +static constexpr int64_t kBufferMinimumSize = 256; + +BufferOutputStream::BufferOutputStream(const std::shared_ptr& buffer) + : buffer_(buffer), + capacity_(buffer->size()), + position_(0), + mutable_data_(buffer->mutable_data()) {} + +Status BufferOutputStream::Close() { + return Status::OK(); +} + +Status BufferOutputStream::Tell(int64_t* position) { + *position = position_; + return Status::OK(); +} + +Status BufferOutputStream::Write(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(Reserve(nbytes)); + std::memcpy(mutable_data_ + position_, data, nbytes); + position_ += nbytes; + return Status::OK(); +} + +Status BufferOutputStream::Reserve(int64_t nbytes) { + while (position_ + nbytes > capacity_) { + int64_t new_capacity = std::max(kBufferMinimumSize, capacity_ * 2); + RETURN_NOT_OK(buffer_->Resize(new_capacity)); + capacity_ = new_capacity; + } + mutable_data_ = buffer_->mutable_data(); + return Status::OK(); +} + // ---------------------------------------------------------------------- // In-memory buffer reader diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 6fe47c3b5157a..51601a0a62678 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -32,32 +32,30 @@ namespace arrow { class Buffer; -class MutableBuffer; +class ResizableBuffer; class Status; namespace io { // An output stream that writes to a MutableBuffer, such as one obtained from a // memory map -// -// TODO(wesm): Implement this class class ARROW_EXPORT BufferOutputStream : public OutputStream { public: - explicit BufferOutputStream(const std::shared_ptr& buffer) - : buffer_(buffer) {} + explicit BufferOutputStream(const std::shared_ptr& buffer); // Implement the OutputStream interface Status Close() override; Status Tell(int64_t* position) override; - Status Write(const uint8_t* data, int64_t length) override; - - // Returns the number of bytes remaining in the buffer - int64_t bytes_remaining() const; + Status Write(const uint8_t* data, int64_t nbytes) override; private: - std::shared_ptr buffer_; + // Ensures there is sufficient space available to write nbytes + Status Reserve(int64_t nbytes); + + std::shared_ptr buffer_; int64_t capacity_; int64_t position_; + uint8_t* mutable_data_; }; // A memory source that uses memory-mapped files for memory interactions diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index e5553a6358115..bde8c5bf73888 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -33,6 +33,7 @@ set(ARROW_IPC_TEST_LINK_LIBS set(ARROW_IPC_SRCS adapter.cc + file.cc metadata.cc metadata-internal.cc ) @@ -60,6 +61,10 @@ ADD_ARROW_TEST(ipc-adapter-test) ARROW_TEST_LINK_LIBRARIES(ipc-adapter-test ${ARROW_IPC_TEST_LINK_LIBS}) +ADD_ARROW_TEST(ipc-file-test) +ARROW_TEST_LINK_LIBRARIES(ipc-file-test + ${ARROW_IPC_TEST_LINK_LIBS}) + ADD_ARROW_TEST(ipc-metadata-test) ARROW_TEST_LINK_LIBRARIES(ipc-metadata-test ${ARROW_IPC_TEST_LINK_LIBS}) @@ -70,14 +75,20 @@ set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) set(OUTPUT_DIR ${CMAKE_SOURCE_DIR}/src/arrow/ipc) set(FBS_OUTPUT_FILES "${OUTPUT_DIR}/Message_generated.h") -set(FBS_SRC ${CMAKE_SOURCE_DIR}/../format/Message.fbs) -get_filename_component(ABS_FBS_SRC ${FBS_SRC} ABSOLUTE) +set(FBS_SRC + ${CMAKE_SOURCE_DIR}/../format/Message.fbs + ${CMAKE_SOURCE_DIR}/../format/File.fbs) + +foreach(FIL ${FBS_SRC}) + get_filename_component(ABS_FIL ${FIL} ABSOLUTE) + list(APPEND ABS_FBS_SRC ${ABS_FIL}) +endforeach() add_custom_command( OUTPUT ${FBS_OUTPUT_FILES} COMMAND ${FLATBUFFERS_COMPILER} -c -o ${OUTPUT_DIR} ${ABS_FBS_SRC} DEPENDS ${ABS_FBS_SRC} - COMMENT "Running flatc compiler on ${FBS_SRC}" + COMMENT "Running flatc compiler on ${ABS_FBS_SRC}" VERBATIM ) @@ -87,6 +98,7 @@ add_dependencies(arrow_objlib metadata_fbs) # Headers: top level install(FILES adapter.h + file.h metadata.h DESTINATION include/arrow/ipc) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 0e101c8930395..89b7fb987c63d 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -95,7 +95,7 @@ static bool IsListType(const DataType* type) { } // ---------------------------------------------------------------------- -// Row batch write path +// Record batch write path Status VisitArray(const Array* arr, std::vector* field_nodes, std::vector>* buffers, int max_recursion_depth) { @@ -132,28 +132,32 @@ Status VisitArray(const Array* arr, std::vector* field_nodes return Status::OK(); } -class RowBatchWriter { +class RecordBatchWriter { public: - RowBatchWriter(const RowBatch* batch, int max_recursion_depth) - : batch_(batch), max_recursion_depth_(max_recursion_depth) {} + RecordBatchWriter(const std::vector>& columns, int32_t num_rows, + int max_recursion_depth) + : columns_(&columns), + num_rows_(num_rows), + max_recursion_depth_(max_recursion_depth) {} Status AssemblePayload() { // Perform depth-first traversal of the row-batch - for (int i = 0; i < batch_->num_columns(); ++i) { - const Array* arr = batch_->column(i).get(); + for (size_t i = 0; i < columns_->size(); ++i) { + const Array* arr = (*columns_)[i].get(); RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_, max_recursion_depth_)); } return Status::OK(); } - Status Write(io::OutputStream* dst, int64_t* data_header_offset) { - // Write out all the buffers contiguously and compute the total size of the - // memory payload - int64_t offset = 0; - + Status Write( + io::OutputStream* dst, int64_t* body_end_offset, int64_t* header_end_offset) { // Get the starting position - int64_t position; - RETURN_NOT_OK(dst->Tell(&position)); + int64_t start_position; + RETURN_NOT_OK(dst->Tell(&start_position)); + + // Keep track of the current position so we can determine the size of the + // message body + int64_t position = start_position; for (size_t i = 0; i < buffers_.size(); ++i) { const Buffer* buffer = buffers_[i].get(); @@ -175,14 +179,16 @@ class RowBatchWriter { // are using from any OS-level shared memory. The thought is that systems // may (in the future) associate integer page id's with physical memory // pages (according to whatever is the desired shared memory mechanism) - buffer_meta_.push_back(flatbuf::Buffer(0, position + offset, size)); + buffer_meta_.push_back(flatbuf::Buffer(0, position, size)); if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); - offset += size; + position += size; } } + *body_end_offset = position; + // Now that we have computed the locations of all of the buffers in shared // memory, the data header can be converted to a flatbuffer and written out // @@ -192,27 +198,43 @@ class RowBatchWriter { // construct the flatbuffer data accessor object (see arrow::ipc::Message) std::shared_ptr data_header; RETURN_NOT_OK(WriteDataHeader( - batch_->num_rows(), offset, field_nodes_, buffer_meta_, &data_header)); + num_rows_, position - start_position, field_nodes_, buffer_meta_, &data_header)); // Write the data header at the end RETURN_NOT_OK(dst->Write(data_header->data(), data_header->size())); - *data_header_offset = position + offset; + position += data_header->size(); + *header_end_offset = position; + + return Align(dst, &position); + } + + Status Align(io::OutputStream* dst, int64_t* position) { + // Write all buffers here on word boundaries + // TODO(wesm): Is there benefit to 64-byte padding in IPC? + int64_t remainder = PaddedLength(*position) - *position; + if (remainder > 0) { + RETURN_NOT_OK(dst->Write(kPaddingBytes, remainder)); + *position += remainder; + } return Status::OK(); } // This must be called after invoking AssemblePayload Status GetTotalSize(int64_t* size) { // emulates the behavior of Write without actually writing + int64_t body_offset; int64_t data_header_offset; MockOutputStream dst; - RETURN_NOT_OK(Write(&dst, &data_header_offset)); + RETURN_NOT_OK(Write(&dst, &body_offset, &data_header_offset)); *size = dst.GetExtentBytesWritten(); return Status::OK(); } private: - const RowBatch* batch_; + // Do not copy this vector. Ownership must be retained elsewhere + const std::vector>* columns_; + int32_t num_rows_; std::vector field_nodes_; std::vector buffer_meta_; @@ -220,29 +242,29 @@ class RowBatchWriter { int max_recursion_depth_; }; -Status WriteRowBatch(io::OutputStream* dst, const RowBatch* batch, int64_t* header_offset, - int max_recursion_depth) { +Status WriteRecordBatch(const std::vector>& columns, + int32_t num_rows, io::OutputStream* dst, int64_t* body_end_offset, + int64_t* header_end_offset, int max_recursion_depth) { DCHECK_GT(max_recursion_depth, 0); - RowBatchWriter serializer(batch, max_recursion_depth); + RecordBatchWriter serializer(columns, num_rows, max_recursion_depth); RETURN_NOT_OK(serializer.AssemblePayload()); - return serializer.Write(dst, header_offset); + return serializer.Write(dst, body_end_offset, header_end_offset); } -Status GetRowBatchSize(const RowBatch* batch, int64_t* size) { - RowBatchWriter serializer(batch, kMaxIpcRecursionDepth); +Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size) { + RecordBatchWriter serializer( + batch->columns(), batch->num_rows(), kMaxIpcRecursionDepth); RETURN_NOT_OK(serializer.AssemblePayload()); RETURN_NOT_OK(serializer.GetTotalSize(size)); return Status::OK(); } // ---------------------------------------------------------------------- -// Row batch read path +// Record batch read path -static constexpr int64_t INIT_METADATA_SIZE = 4096; - -class RowBatchReader::RowBatchReaderImpl { +class RecordBatchReader::RecordBatchReaderImpl { public: - RowBatchReaderImpl(io::ReadableFileInterface* file, + RecordBatchReaderImpl(io::ReadableFileInterface* file, const std::shared_ptr& metadata, int max_recursion_depth) : file_(file), metadata_(metadata), max_recursion_depth_(max_recursion_depth) { num_buffers_ = metadata->num_buffers(); @@ -250,7 +272,7 @@ class RowBatchReader::RowBatchReaderImpl { } Status AssembleBatch( - const std::shared_ptr& schema, std::shared_ptr* out) { + const std::shared_ptr& schema, std::shared_ptr* out) { std::vector> arrays(schema->num_fields()); // The field_index and buffer_index are incremented in NextArray based on @@ -263,7 +285,7 @@ class RowBatchReader::RowBatchReaderImpl { RETURN_NOT_OK(NextArray(field, max_recursion_depth_, &arrays[i])); } - *out = std::make_shared(schema, metadata_->length(), arrays); + *out = std::make_shared(schema, metadata_->length(), arrays); return Status::OK(); } @@ -359,29 +381,31 @@ class RowBatchReader::RowBatchReaderImpl { int num_flattened_fields_; }; -Status RowBatchReader::Open(io::ReadableFileInterface* file, int64_t position, - std::shared_ptr* out) { - return Open(file, position, kMaxIpcRecursionDepth, out); +Status RecordBatchReader::Open(io::ReadableFileInterface* file, int64_t offset, + std::shared_ptr* out) { + return Open(file, offset, kMaxIpcRecursionDepth, out); } -Status RowBatchReader::Open(io::ReadableFileInterface* file, int64_t position, - int max_recursion_depth, std::shared_ptr* out) { - std::shared_ptr metadata; - RETURN_NOT_OK(file->ReadAt(position, INIT_METADATA_SIZE, &metadata)); +Status RecordBatchReader::Open(io::ReadableFileInterface* file, int64_t offset, + int max_recursion_depth, std::shared_ptr* out) { + std::shared_ptr buffer; + RETURN_NOT_OK(file->ReadAt(offset - sizeof(int32_t), sizeof(int32_t), &buffer)); - int32_t metadata_size = *reinterpret_cast(metadata->data()); + int32_t metadata_size = *reinterpret_cast(buffer->data()); - // We may not need to call ReadAt again - if (metadata_size > static_cast(INIT_METADATA_SIZE - sizeof(int32_t))) { - // We don't have enough data, read the indicated metadata size. - RETURN_NOT_OK(file->ReadAt(position + sizeof(int32_t), metadata_size, &metadata)); + if (metadata_size + static_cast(sizeof(int32_t)) > offset) { + return Status::Invalid("metadata size invalid"); } + // Read the metadata + RETURN_NOT_OK( + file->ReadAt(offset - metadata_size - sizeof(int32_t), metadata_size, &buffer)); + // TODO(wesm): buffer slicing here would be better in case ReadAt returns // allocated memory std::shared_ptr message; - RETURN_NOT_OK(Message::Open(metadata, &message)); + RETURN_NOT_OK(Message::Open(buffer, &message)); if (message->type() != Message::RECORD_BATCH) { return Status::Invalid("Metadata message is not a record batch"); @@ -389,19 +413,19 @@ Status RowBatchReader::Open(io::ReadableFileInterface* file, int64_t position, std::shared_ptr batch_meta = message->GetRecordBatch(); - std::shared_ptr result(new RowBatchReader()); - result->impl_.reset(new RowBatchReaderImpl(file, batch_meta, max_recursion_depth)); + std::shared_ptr result(new RecordBatchReader()); + result->impl_.reset(new RecordBatchReaderImpl(file, batch_meta, max_recursion_depth)); *out = result; return Status::OK(); } // Here the explicit destructor is required for compilers to be aware of -// the complete information of RowBatchReader::RowBatchReaderImpl class -RowBatchReader::~RowBatchReader() {} +// the complete information of RecordBatchReader::RecordBatchReaderImpl class +RecordBatchReader::~RecordBatchReader() {} -Status RowBatchReader::GetRowBatch( - const std::shared_ptr& schema, std::shared_ptr* out) { +Status RecordBatchReader::GetRecordBatch( + const std::shared_ptr& schema, std::shared_ptr* out) { return impl_->AssembleBatch(schema, out); } diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 215b46f8f65d4..3fde18dde835b 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -23,13 +23,14 @@ #include #include +#include #include "arrow/util/visibility.h" namespace arrow { class Array; -class RowBatch; +class RecordBatch; class Schema; class Status; @@ -50,7 +51,7 @@ class RecordBatchMessage; // TODO(emkornfield) investigate this more constexpr int kMaxIpcRecursionDepth = 64; -// Write the RowBatch (collection of equal-length Arrow arrays) to the output +// Write the RecordBatch (collection of equal-length Arrow arrays) to the output // stream // // First, each of the memory buffers are written out end-to-end @@ -60,39 +61,43 @@ constexpr int kMaxIpcRecursionDepth = 64; // // // -// Finally, the absolute offset (relative to the start of the output stream) to -// the start of the metadata / data header is returned in an out-variable -ARROW_EXPORT Status WriteRowBatch(io::OutputStream* dst, const RowBatch* batch, - int64_t* header_offset, int max_recursion_depth = kMaxIpcRecursionDepth); +// Finally, the absolute offsets (relative to the start of the output stream) +// to the end of the body and end of the metadata / data header (suffixed by +// the header size) is returned in out-variables +ARROW_EXPORT Status WriteRecordBatch(const std::vector>& columns, + int32_t num_rows, io::OutputStream* dst, int64_t* body_end_offset, + int64_t* header_end_offset, int max_recursion_depth = kMaxIpcRecursionDepth); -// int64_t GetRowBatchMetadata(const RowBatch* batch); +// int64_t GetRecordBatchMetadata(const RecordBatch* batch); // Compute the precise number of bytes needed in a contiguous memory segment to -// write the row batch. This involves generating the complete serialized +// write the record batch. This involves generating the complete serialized // Flatbuffers metadata. -ARROW_EXPORT Status GetRowBatchSize(const RowBatch* batch, int64_t* size); +ARROW_EXPORT Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the input supports zero copy reads -class ARROW_EXPORT RowBatchReader { +class ARROW_EXPORT RecordBatchReader { public: - static Status Open(io::ReadableFileInterface* file, int64_t position, - std::shared_ptr* out); + // The offset is the absolute position to the *end* of the record batch data + // header + static Status Open(io::ReadableFileInterface* file, int64_t offset, + std::shared_ptr* out); - static Status Open(io::ReadableFileInterface* file, int64_t position, - int max_recursion_depth, std::shared_ptr* out); + static Status Open(io::ReadableFileInterface* file, int64_t offset, + int max_recursion_depth, std::shared_ptr* out); - virtual ~RowBatchReader(); + virtual ~RecordBatchReader(); - // Reassemble the row batch. A Schema is required to be able to construct the - // right array containers - Status GetRowBatch( - const std::shared_ptr& schema, std::shared_ptr* out); + // Reassemble the record batch. A Schema is required to be able to construct + // the right array containers + Status GetRecordBatch( + const std::shared_ptr& schema, std::shared_ptr* out); private: - class RowBatchReaderImpl; - std::unique_ptr impl_; + class RecordBatchReaderImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc new file mode 100644 index 0000000000000..2bf10dde266bd --- /dev/null +++ b/cpp/src/arrow/ipc/file.cc @@ -0,0 +1,210 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/file.h" + +#include +#include +#include +#include + +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata.h" +#include "arrow/ipc/util.h" +#include "arrow/io/interfaces.h" +#include "arrow/util/buffer.h" +#include "arrow/util/logging.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +static constexpr const char* kArrowMagicBytes = "ARROW1"; + +// ---------------------------------------------------------------------- +// Writer implementation + +FileWriter::FileWriter(io::OutputStream* sink, const std::shared_ptr& schema) + : sink_(sink), schema_(schema), position_(-1), started_(false) {} + +Status FileWriter::UpdatePosition() { + return sink_->Tell(&position_); +} + +Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out) { + *out = std::shared_ptr(new FileWriter(sink, schema)); // ctor is private + RETURN_NOT_OK((*out)->UpdatePosition()); + return Status::OK(); +} + +Status FileWriter::Write(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(sink_->Write(data, nbytes)); + position_ += nbytes; + return Status::OK(); +} + +Status FileWriter::Align() { + int64_t remainder = PaddedLength(position_) - position_; + if (remainder > 0) { return Write(kPaddingBytes, remainder); } + return Status::OK(); +} + +Status FileWriter::WriteAligned(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(Write(data, nbytes)); + return Align(); +} + +Status FileWriter::Start() { + RETURN_NOT_OK(WriteAligned( + reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); + started_ = true; + return Status::OK(); +} + +Status FileWriter::CheckStarted() { + if (!started_) { return Start(); } + return Status::OK(); +} + +Status FileWriter::WriteRecordBatch( + const std::vector>& columns, int32_t num_rows) { + RETURN_NOT_OK(CheckStarted()); + + int64_t offset = position_; + + int64_t body_end_offset; + int64_t header_end_offset; + RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( + columns, num_rows, sink_, &body_end_offset, &header_end_offset)); + RETURN_NOT_OK(UpdatePosition()); + + DCHECK(position_ % 8 == 0) << "ipc::WriteRecordBatch did not perform aligned writes"; + + // There may be padding ever the end of the metadata, so we cannot rely on + // position_ + int32_t metadata_length = header_end_offset - body_end_offset; + int32_t body_length = body_end_offset - offset; + + // Append metadata, to be written in the footer later + record_batches_.emplace_back(offset, metadata_length, body_length); + + return Status::OK(); +} + +Status FileWriter::Close() { + // Write metadata + int64_t initial_position = position_; + RETURN_NOT_OK(WriteFileFooter(schema_.get(), dictionaries_, record_batches_, sink_)); + RETURN_NOT_OK(UpdatePosition()); + + // Write footer length + int32_t footer_length = position_ - initial_position; + + if (footer_length <= 0) { return Status::Invalid("Invalid file footer"); } + + RETURN_NOT_OK(Write(reinterpret_cast(&footer_length), sizeof(int32_t))); + + // Write magic bytes to end file + return Write( + reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes)); +} + +// ---------------------------------------------------------------------- +// Reader implementation + +FileReader::FileReader( + const std::shared_ptr& file, int64_t footer_offset) + : file_(file), footer_offset_(footer_offset) {} + +FileReader::~FileReader() {} + +Status FileReader::Open(const std::shared_ptr& file, + std::shared_ptr* reader) { + int64_t footer_offset; + RETURN_NOT_OK(file->GetSize(&footer_offset)); + return Open(file, footer_offset, reader); +} + +Status FileReader::Open(const std::shared_ptr& file, + int64_t footer_offset, std::shared_ptr* reader) { + *reader = std::shared_ptr(new FileReader(file, footer_offset)); + return (*reader)->ReadFooter(); +} + +Status FileReader::ReadFooter() { + int magic_size = static_cast(strlen(kArrowMagicBytes)); + + if (footer_offset_ <= magic_size * 2 + 4) { + std::stringstream ss; + ss << "File is too small: " << footer_offset_; + return Status::Invalid(ss.str()); + } + + std::shared_ptr buffer; + int file_end_size = magic_size + sizeof(int32_t); + RETURN_NOT_OK(file_->ReadAt(footer_offset_ - file_end_size, file_end_size, &buffer)); + + if (memcmp(buffer->data() + sizeof(int32_t), kArrowMagicBytes, magic_size)) { + return Status::Invalid("Not an Arrow file"); + } + + int32_t footer_length = *reinterpret_cast(buffer->data()); + + if (footer_length <= 0 || footer_length + magic_size * 2 + 4 > footer_offset_) { + return Status::Invalid("File is smaller than indicated metadata size"); + } + + // Now read the footer + RETURN_NOT_OK(file_->ReadAt( + footer_offset_ - footer_length - file_end_size, footer_length, &buffer)); + RETURN_NOT_OK(FileFooter::Open(buffer, &footer_)); + + // Get the schema + return footer_->GetSchema(&schema_); +} + +const std::shared_ptr& FileReader::schema() const { + return schema_; +} + +int FileReader::num_dictionaries() const { + return footer_->num_dictionaries(); +} + +int FileReader::num_record_batches() const { + return footer_->num_record_batches(); +} + +MetadataVersion::type FileReader::version() const { + return footer_->version(); +} + +Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { + DCHECK_GE(i, 0); + DCHECK_LT(i, num_record_batches()); + FileBlock block = footer_->record_batch(i); + int64_t metadata_end_offset = block.offset + block.body_length + block.metadata_length; + + std::shared_ptr reader; + RETURN_NOT_OK(RecordBatchReader::Open(file_.get(), metadata_end_offset, &reader)); + + return reader->GetRecordBatch(schema_, batch); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/file.h new file mode 100644 index 0000000000000..4b79c98281bbc --- /dev/null +++ b/cpp/src/arrow/ipc/file.h @@ -0,0 +1,146 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Implement Arrow file layout for IPC/RPC purposes and short-lived storage + +#ifndef ARROW_IPC_FILE_H +#define ARROW_IPC_FILE_H + +#include +#include +#include + +#include "arrow/ipc/metadata.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +class Buffer; +struct Field; +class RecordBatch; +class Schema; +class Status; + +namespace io { + +class OutputStream; +class ReadableFileInterface; + +} // namespace io + +namespace ipc { + +class ARROW_EXPORT FileWriter { + public: + static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out); + + // TODO(wesm): Write dictionaries + + Status WriteRecordBatch( + const std::vector>& columns, int32_t num_rows); + + Status Close(); + + private: + FileWriter(io::OutputStream* sink, const std::shared_ptr& schema); + + Status CheckStarted(); + Status Start(); + + Status UpdatePosition(); + + // Adds padding bytes if necessary to ensure all memory blocks are written on + // 8-byte boundaries. + Status Align(); + + // Write data and update position + Status Write(const uint8_t* data, int64_t nbytes); + + // Write and align + Status WriteAligned(const uint8_t* data, int64_t nbytes); + + io::OutputStream* sink_; + std::shared_ptr schema_; + int64_t position_; + bool started_; + + std::vector dictionaries_; + std::vector record_batches_; +}; + +class ARROW_EXPORT FileReader { + public: + ~FileReader(); + + // Open a file-like object that is assumed to be self-contained; i.e., the + // end of the file interface is the end of the Arrow file. Note that there + // can be any amount of data preceding the Arrow-formatted data, because we + // need only locate the end of the Arrow file stream to discover the metadata + // and then proceed to read the data into memory. + static Status Open(const std::shared_ptr& file, + std::shared_ptr* reader); + + // If the file is embedded within some larger file or memory region, you can + // pass the absolute memory offset to the end of the file (which contains the + // metadata footer). The metadata must have been written with memory offsets + // relative to the start of the containing file + // + // @param file: the data source + // @param footer_offset: the position of the end of the Arrow "file" + static Status Open(const std::shared_ptr& file, + int64_t footer_offset, std::shared_ptr* reader); + + const std::shared_ptr& schema() const; + + // Shared dictionaries for dictionary-encoding cross record batches + // TODO(wesm): Implement dictionary reading when we also have dictionary + // encoding + int num_dictionaries() const; + + int num_record_batches() const; + + MetadataVersion::type version() const; + + // Read a record batch from the file. Does not copy memory if the input + // source supports zero-copy. + // + // TODO(wesm): Make the copy/zero-copy behavior configurable (e.g. provide an + // "always copy" option) + Status GetRecordBatch(int i, std::shared_ptr* batch); + + private: + FileReader( + const std::shared_ptr& file, int64_t footer_offset); + + Status ReadFooter(); + + std::shared_ptr file_; + + // The location where the Arrow file layout ends. May be the end of the file + // or some other location if embedded in a larger file. + int64_t footer_offset_; + + std::unique_ptr footer_; + std::shared_ptr schema_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_FILE_H diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index ca4d0152b9060..f5611d4840c97 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -43,31 +43,27 @@ namespace arrow { namespace ipc { -// TODO(emkornfield) convert to google style kInt32, etc? -const auto INT32 = std::make_shared(); -const auto LIST_INT32 = std::make_shared(INT32); -const auto LIST_LIST_INT32 = std::make_shared(LIST_INT32); - -typedef Status MakeRowBatch(std::shared_ptr* out); - -class TestWriteRowBatch : public ::testing::TestWithParam, - public io::MemoryMapFixture { +class TestWriteRecordBatch : public ::testing::TestWithParam, + public io::MemoryMapFixture { public: void SetUp() { pool_ = default_memory_pool(); } void TearDown() { io::MemoryMapFixture::TearDown(); } - Status RoundTripHelper(const RowBatch& batch, int memory_map_size, - std::shared_ptr* batch_result) { + Status RoundTripHelper(const RecordBatch& batch, int memory_map_size, + std::shared_ptr* batch_result) { std::string path = "test-write-row-batch"; io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - int64_t header_location; - RETURN_NOT_OK(WriteRowBatch(mmap_.get(), &batch, &header_location)); + int64_t body_end_offset; + int64_t header_end_offset; - std::shared_ptr reader; - RETURN_NOT_OK(RowBatchReader::Open(mmap_.get(), header_location, &reader)); + RETURN_NOT_OK(WriteRecordBatch(batch.columns(), batch.num_rows(), mmap_.get(), + &body_end_offset, &header_end_offset)); - RETURN_NOT_OK(reader->GetRowBatch(batch.schema(), batch_result)); + std::shared_ptr reader; + RETURN_NOT_OK(RecordBatchReader::Open(mmap_.get(), header_end_offset, &reader)); + + RETURN_NOT_OK(reader->GetRecordBatch(batch.schema(), batch_result)); return Status::OK(); } @@ -76,10 +72,10 @@ class TestWriteRowBatch : public ::testing::TestWithParam, MemoryPool* pool_; }; -TEST_P(TestWriteRowBatch, RoundTrip) { - std::shared_ptr batch; +TEST_P(TestWriteRecordBatch, RoundTrip) { + std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - std::shared_ptr batch_result; + std::shared_ptr batch_result; ASSERT_OK(RoundTripHelper(*batch, 1 << 16, &batch_result)); // do checks @@ -93,217 +89,39 @@ TEST_P(TestWriteRowBatch, RoundTrip) { } } -Status MakeIntRowBatch(std::shared_ptr* out) { - const int length = 1000; - - // Make the schema - auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", INT32); - std::shared_ptr schema(new Schema({f0, f1})); - - // Example data - std::shared_ptr a0, a1; - MemoryPool* pool = default_memory_pool(); - RETURN_NOT_OK(MakeRandomInt32Array(length, false, pool, &a0)); - RETURN_NOT_OK(MakeRandomInt32Array(length, true, pool, &a1)); - out->reset(new RowBatch(schema, length, {a0, a1})); - return Status::OK(); -} +INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRecordBatch, + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, + &MakeStringTypesRecordBatch, &MakeStruct)); -template -Status MakeRandomBinaryArray( - const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* array) { - const std::vector values = { - "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; - Builder builder(pool, type); - const auto values_len = values.size(); - for (int32_t i = 0; i < length; ++i) { - int values_index = i % values_len; - if (values_index == 0) { - RETURN_NOT_OK(builder.AppendNull()); - } else { - const std::string& value = values[values_index]; - RETURN_NOT_OK( - builder.Append(reinterpret_cast(value.data()), value.size())); - } - } - *array = builder.Finish(); - return Status::OK(); -} - -Status MakeStringTypesRowBatch(std::shared_ptr* out) { - const int32_t length = 500; - auto string_type = std::make_shared(); - auto binary_type = std::make_shared(); - auto f0 = std::make_shared("f0", string_type); - auto f1 = std::make_shared("f1", binary_type); - std::shared_ptr schema(new Schema({f0, f1})); - - std::shared_ptr a0, a1; - MemoryPool* pool = default_memory_pool(); - - { - auto status = - MakeRandomBinaryArray(string_type, length, pool, &a0); - RETURN_NOT_OK(status); - } - { - auto status = - MakeRandomBinaryArray(binary_type, length, pool, &a1); - RETURN_NOT_OK(status); - } - out->reset(new RowBatch(schema, length, {a0, a1})); - return Status::OK(); -} - -Status MakeListRowBatch(std::shared_ptr* out) { - // Make the schema - auto f0 = std::make_shared("f0", LIST_INT32); - auto f1 = std::make_shared("f1", LIST_LIST_INT32); - auto f2 = std::make_shared("f2", INT32); - std::shared_ptr schema(new Schema({f0, f1, f2})); - - // Example data - - MemoryPool* pool = default_memory_pool(); - const int length = 200; - std::shared_ptr leaf_values, list_array, list_list_array, flat_array; - const bool include_nulls = true; - RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &leaf_values)); - RETURN_NOT_OK( - MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); - RETURN_NOT_OK( - MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); - RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); - out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); - return Status::OK(); -} - -Status MakeZeroLengthRowBatch(std::shared_ptr* out) { - // Make the schema - auto f0 = std::make_shared("f0", LIST_INT32); - auto f1 = std::make_shared("f1", LIST_LIST_INT32); - auto f2 = std::make_shared("f2", INT32); - std::shared_ptr schema(new Schema({f0, f1, f2})); - - // Example data - MemoryPool* pool = default_memory_pool(); - const int length = 200; - const bool include_nulls = true; - std::shared_ptr leaf_values, list_array, list_list_array, flat_array; - RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &leaf_values)); - RETURN_NOT_OK(MakeRandomListArray(leaf_values, 0, include_nulls, pool, &list_array)); - RETURN_NOT_OK( - MakeRandomListArray(list_array, 0, include_nulls, pool, &list_list_array)); - RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); - out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); - return Status::OK(); -} - -Status MakeNonNullRowBatch(std::shared_ptr* out) { - // Make the schema - auto f0 = std::make_shared("f0", LIST_INT32); - auto f1 = std::make_shared("f1", LIST_LIST_INT32); - auto f2 = std::make_shared("f2", INT32); - std::shared_ptr schema(new Schema({f0, f1, f2})); - - // Example data - MemoryPool* pool = default_memory_pool(); - const int length = 50; - std::shared_ptr leaf_values, list_array, list_list_array, flat_array; - - RETURN_NOT_OK(MakeRandomInt32Array(1000, true, pool, &leaf_values)); - bool include_nulls = false; - RETURN_NOT_OK( - MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); - RETURN_NOT_OK( - MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); - RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); - out->reset(new RowBatch(schema, length, {list_array, list_list_array, flat_array})); - return Status::OK(); -} - -Status MakeDeeplyNestedList(std::shared_ptr* out) { - const int batch_length = 5; - TypePtr type = INT32; - - MemoryPool* pool = default_memory_pool(); - ArrayPtr array; - const bool include_nulls = true; - RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &array)); - for (int i = 0; i < 63; ++i) { - type = std::static_pointer_cast(std::make_shared(type)); - RETURN_NOT_OK(MakeRandomListArray(array, batch_length, include_nulls, pool, &array)); - } - - auto f0 = std::make_shared("f0", type); - std::shared_ptr schema(new Schema({f0})); - std::vector arrays = {array}; - out->reset(new RowBatch(schema, batch_length, arrays)); - return Status::OK(); -} - -Status MakeStruct(std::shared_ptr* out) { - // reuse constructed list columns - std::shared_ptr list_batch; - RETURN_NOT_OK(MakeListRowBatch(&list_batch)); - std::vector columns = { - list_batch->column(0), list_batch->column(1), list_batch->column(2)}; - auto list_schema = list_batch->schema(); - - // Define schema - std::shared_ptr type(new StructType( - {list_schema->field(0), list_schema->field(1), list_schema->field(2)})); - auto f0 = std::make_shared("non_null_struct", type); - auto f1 = std::make_shared("null_struct", type); - std::shared_ptr schema(new Schema({f0, f1})); - - // construct individual nullable/non-nullable struct arrays - ArrayPtr no_nulls(new StructArray(type, list_batch->num_rows(), columns)); - std::vector null_bytes(list_batch->num_rows(), 1); - null_bytes[0] = 0; - std::shared_ptr null_bitmask; - RETURN_NOT_OK(util::bytes_to_bits(null_bytes, &null_bitmask)); - ArrayPtr with_nulls( - new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); - - // construct batch - std::vector arrays = {no_nulls, with_nulls}; - out->reset(new RowBatch(schema, list_batch->num_rows(), arrays)); - return Status::OK(); -} - -INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRowBatch, - ::testing::Values(&MakeIntRowBatch, &MakeListRowBatch, &MakeNonNullRowBatch, - &MakeZeroLengthRowBatch, &MakeDeeplyNestedList, - &MakeStringTypesRowBatch, &MakeStruct)); - -void TestGetRowBatchSize(std::shared_ptr batch) { +void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; - int64_t mock_header_location = -1; + int64_t mock_header_offset = -1; + int64_t mock_body_offset = -1; int64_t size = -1; - ASSERT_OK(WriteRowBatch(&mock, batch.get(), &mock_header_location)); - ASSERT_OK(GetRowBatchSize(batch.get(), &size)); + ASSERT_OK(WriteRecordBatch(batch->columns(), batch->num_rows(), &mock, + &mock_body_offset, &mock_header_offset)); + ASSERT_OK(GetRecordBatchSize(batch.get(), &size)); ASSERT_EQ(mock.GetExtentBytesWritten(), size); } -TEST_F(TestWriteRowBatch, IntegerGetRowBatchSize) { - std::shared_ptr batch; +TEST_F(TestWriteRecordBatch, IntegerGetRecordBatchSize) { + std::shared_ptr batch; - ASSERT_OK(MakeIntRowBatch(&batch)); - TestGetRowBatchSize(batch); + ASSERT_OK(MakeIntRecordBatch(&batch)); + TestGetRecordBatchSize(batch); - ASSERT_OK(MakeListRowBatch(&batch)); - TestGetRowBatchSize(batch); + ASSERT_OK(MakeListRecordBatch(&batch)); + TestGetRecordBatchSize(batch); - ASSERT_OK(MakeZeroLengthRowBatch(&batch)); - TestGetRowBatchSize(batch); + ASSERT_OK(MakeZeroLengthRecordBatch(&batch)); + TestGetRecordBatchSize(batch); - ASSERT_OK(MakeNonNullRowBatch(&batch)); - TestGetRowBatchSize(batch); + ASSERT_OK(MakeNonNullRecordBatch(&batch)); + TestGetRecordBatchSize(batch); ASSERT_OK(MakeDeeplyNestedList(&batch)); - TestGetRowBatchSize(batch); + TestGetRecordBatchSize(batch); } class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { @@ -314,7 +132,7 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { Status WriteToMmap(int recursion_level, bool override_level, int64_t* header_out = nullptr, std::shared_ptr* schema_out = nullptr) { const int batch_length = 5; - TypePtr type = INT32; + TypePtr type = kInt32; ArrayPtr array; const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); @@ -328,18 +146,22 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { std::shared_ptr schema(new Schema({f0})); if (schema_out != nullptr) { *schema_out = schema; } std::vector arrays = {array}; - auto batch = std::make_shared(schema, batch_length, arrays); + auto batch = std::make_shared(schema, batch_length, arrays); std::string path = "test-write-past-max-recursion"; const int memory_map_size = 1 << 16; io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - int64_t header_location; - int64_t* header_out_param = header_out == nullptr ? &header_location : header_out; + + int64_t body_offset; + int64_t header_offset; + + int64_t* header_out_param = header_out == nullptr ? &header_offset : header_out; if (override_level) { - return WriteRowBatch( - mmap_.get(), batch.get(), header_out_param, recursion_level + 1); + return WriteRecordBatch(batch->columns(), batch->num_rows(), mmap_.get(), + &body_offset, header_out_param, recursion_level + 1); } else { - return WriteRowBatch(mmap_.get(), batch.get(), header_out_param); + return WriteRecordBatch(batch->columns(), batch->num_rows(), mmap_.get(), + &body_offset, header_out_param); } } @@ -353,14 +175,14 @@ TEST_F(RecursionLimits, WriteLimit) { } TEST_F(RecursionLimits, ReadLimit) { - int64_t header_location = -1; + int64_t header_offset = -1; std::shared_ptr schema; - ASSERT_OK(WriteToMmap(64, true, &header_location, &schema)); + ASSERT_OK(WriteToMmap(64, true, &header_offset, &schema)); - std::shared_ptr reader; - ASSERT_OK(RowBatchReader::Open(mmap_.get(), header_location, &reader)); - std::shared_ptr batch_result; - ASSERT_RAISES(Invalid, reader->GetRowBatch(schema, &batch_result)); + std::shared_ptr reader; + ASSERT_OK(RecordBatchReader::Open(mmap_.get(), header_offset, &reader)); + std::shared_ptr batch_result; + ASSERT_RAISES(Invalid, reader->GetRecordBatch(schema, &batch_result)); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc new file mode 100644 index 0000000000000..cd424bf385cae --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -0,0 +1,125 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/io/memory.h" +#include "arrow/io/test-common.h" +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/file.h" +#include "arrow/ipc/test-common.h" +#include "arrow/ipc/util.h" + +#include "arrow/test-util.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +class TestFileFormat : public ::testing::TestWithParam { + public: + void SetUp() { + pool_ = default_memory_pool(); + buffer_ = std::make_shared(pool_); + sink_.reset(new io::BufferOutputStream(buffer_)); + } + void TearDown() {} + + Status RoundTripHelper( + const RecordBatch& batch, std::vector>* out_batches) { + // Write the file + RETURN_NOT_OK(FileWriter::Open(sink_.get(), batch.schema(), &file_writer_)); + int num_batches = 3; + for (int i = 0; i < num_batches; ++i) { + RETURN_NOT_OK(file_writer_->WriteRecordBatch(batch.columns(), batch.num_rows())); + } + RETURN_NOT_OK(file_writer_->Close()); + + // Current offset into stream is the end of the file + int64_t footer_offset; + RETURN_NOT_OK(sink_->Tell(&footer_offset)); + + // Open the file + auto reader = std::make_shared(buffer_->data(), buffer_->size()); + RETURN_NOT_OK(FileReader::Open(reader, footer_offset, &file_reader_)); + + EXPECT_EQ(num_batches, file_reader_->num_record_batches()); + + out_batches->resize(num_batches); + for (int i = 0; i < num_batches; ++i) { + RETURN_NOT_OK(file_reader_->GetRecordBatch(i, &(*out_batches)[i])); + } + + return Status::OK(); + } + + void CompareBatch(const RecordBatch* left, const RecordBatch* right) { + ASSERT_TRUE(left->schema()->Equals(right->schema())); + ASSERT_EQ(left->num_columns(), right->num_columns()) + << left->schema()->ToString() << " result: " << right->schema()->ToString(); + EXPECT_EQ(left->num_rows(), right->num_rows()); + for (int i = 0; i < left->num_columns(); ++i) { + EXPECT_TRUE(left->column(i)->Equals(right->column(i))) + << "Idx: " << i << " Name: " << left->column_name(i); + } + } + + protected: + MemoryPool* pool_; + + std::unique_ptr sink_; + std::shared_ptr buffer_; + + std::shared_ptr file_writer_; + std::shared_ptr file_reader_; +}; + +TEST_P(TestFileFormat, RoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + std::vector> out_batches; + + ASSERT_OK(RoundTripHelper(*batch, &out_batches)); + + // Compare batches. Same + for (size_t i = 0; i < out_batches.size(); ++i) { + CompareBatch(batch.get(), out_batches[i].get()); + } +} + +INSTANTIATE_TEST_CASE_P(RoundTripTests, TestFileFormat, + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, + &MakeStringTypesRecordBatch, &MakeStruct)); + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index 51d79cfb4c4bb..1dc3969233237 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -21,6 +21,7 @@ #include "gtest/gtest.h" +#include "arrow/io/memory.h" #include "arrow/ipc/metadata.h" #include "arrow/schema.h" #include "arrow/test-util.h" @@ -31,6 +32,8 @@ namespace arrow { class Buffer; +namespace ipc { + static inline void assert_schema_equal(const Schema* lhs, const Schema* rhs) { if (!lhs->Equals(*rhs)) { std::stringstream ss; @@ -46,14 +49,14 @@ class TestSchemaMessage : public ::testing::Test { void CheckRoundtrip(const Schema* schema) { std::shared_ptr buffer; - ASSERT_OK(ipc::WriteSchema(schema, &buffer)); + ASSERT_OK(WriteSchema(schema, &buffer)); - std::shared_ptr message; - ASSERT_OK(ipc::Message::Open(buffer, &message)); + std::shared_ptr message; + ASSERT_OK(Message::Open(buffer, &message)); - ASSERT_EQ(ipc::Message::SCHEMA, message->type()); + ASSERT_EQ(Message::SCHEMA, message->type()); - std::shared_ptr schema_msg = message->GetSchema(); + std::shared_ptr schema_msg = message->GetSchema(); ASSERT_EQ(schema->num_fields(), schema_msg->num_fields()); std::shared_ptr schema2; @@ -94,4 +97,68 @@ TEST_F(TestSchemaMessage, NestedFields) { CheckRoundtrip(&schema); } +class TestFileFooter : public ::testing::Test { + public: + void SetUp() {} + + void CheckRoundtrip(const Schema* schema, const std::vector& dictionaries, + const std::vector& record_batches) { + auto buffer = std::make_shared(); + io::BufferOutputStream stream(buffer); + + ASSERT_OK(WriteFileFooter(schema, dictionaries, record_batches, &stream)); + + std::unique_ptr footer; + ASSERT_OK(FileFooter::Open(buffer, &footer)); + + ASSERT_EQ(MetadataVersion::V1_SNAPSHOT, footer->version()); + + // Check schema + std::shared_ptr schema2; + ASSERT_OK(footer->GetSchema(&schema2)); + assert_schema_equal(schema, schema2.get()); + + // Check blocks + ASSERT_EQ(dictionaries.size(), footer->num_dictionaries()); + ASSERT_EQ(record_batches.size(), footer->num_record_batches()); + + for (int i = 0; i < footer->num_dictionaries(); ++i) { + CheckBlocks(dictionaries[i], footer->dictionary(i)); + } + + for (int i = 0; i < footer->num_record_batches(); ++i) { + CheckBlocks(record_batches[i], footer->record_batch(i)); + } + } + + void CheckBlocks(const FileBlock& left, const FileBlock& right) { + ASSERT_EQ(left.offset, right.offset); + ASSERT_EQ(left.metadata_length, right.metadata_length); + ASSERT_EQ(left.body_length, right.body_length); + } + + private: + std::shared_ptr example_schema_; +}; + +TEST_F(TestFileFooter, Basics) { + auto f0 = std::make_shared("f0", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared()); + Schema schema({f0, f1}); + + std::vector dictionaries; + dictionaries.emplace_back(8, 92, 900); + dictionaries.emplace_back(1000, 100, 1900); + dictionaries.emplace_back(3000, 100, 2900); + + std::vector record_batches; + record_batches.emplace_back(6000, 100, 900); + record_batches.emplace_back(7000, 100, 1900); + record_batches.emplace_back(9000, 100, 2900); + record_batches.emplace_back(12000, 100, 3900); + + CheckRoundtrip(&schema, dictionaries, record_batches); +} + +} // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 05e9c7ad4d359..7102012c29a84 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -31,10 +31,6 @@ #include "arrow/util/buffer.h" #include "arrow/util/status.h" -typedef flatbuffers::FlatBufferBuilder FBB; -typedef flatbuffers::Offset FieldOffset; -typedef flatbuffers::Offset Offset; - namespace arrow { namespace flatbuf = org::apache::arrow::flatbuf; @@ -52,6 +48,8 @@ const std::shared_ptr UINT32 = std::make_shared(); const std::shared_ptr UINT64 = std::make_shared(); const std::shared_ptr FLOAT = std::make_shared(); const std::shared_ptr DOUBLE = std::make_shared(); +const std::shared_ptr STRING = std::make_shared(); +const std::shared_ptr BINARY = std::make_shared(); static Status IntFromFlatbuffer( const flatbuf::Int* int_data, std::shared_ptr* out) { @@ -102,8 +100,11 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return FloatFromFlatuffer( static_cast(type_data), out); case flatbuf::Type_Binary: + *out = BINARY; + return Status::OK(); case flatbuf::Type_Utf8: - return Status::NotImplemented("Type is not implemented"); + *out = STRING; + return Status::OK(); case flatbuf::Type_Bool: *out = BOOL; return Status::OK(); @@ -193,6 +194,14 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_FloatingPoint; *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); break; + case Type::BINARY: + *out_type = flatbuf::Type_Binary; + *offset = flatbuf::CreateBinary(fbb).Union(); + break; + case Type::STRING: + *out_type = flatbuf::Type_Utf8; + *offset = flatbuf::CreateUtf8(fbb).Union(); + break; case Type::LIST: *out_type = flatbuf::Type_List; return ListToFlatbuffer(fbb, type, children, offset); @@ -255,19 +264,26 @@ flatbuf::Endianness endianness() { return bint.c[0] == 1 ? flatbuf::Endianness_Big : flatbuf::Endianness_Little; } -Status MessageBuilder::SetSchema(const Schema* schema) { - header_type_ = flatbuf::MessageHeader_Schema; - +Status SchemaToFlatbuffer( + FBB& fbb, const Schema* schema, flatbuffers::Offset* out) { std::vector field_offsets; for (int i = 0; i < schema->num_fields(); ++i) { const std::shared_ptr& field = schema->field(i); FieldOffset offset; - RETURN_NOT_OK(FieldToFlatbuffer(fbb_, field, &offset)); + RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, &offset)); field_offsets.push_back(offset); } - header_ = - flatbuf::CreateSchema(fbb_, endianness(), fbb_.CreateVector(field_offsets)).Union(); + *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets)); + return Status::OK(); +} + +Status MessageBuilder::SetSchema(const Schema* schema) { + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb_, schema, &fb_schema)); + + header_type_ = flatbuf::MessageHeader_Schema; + header_ = fb_schema.Union(); body_length_ = 0; return Status::OK(); } @@ -301,17 +317,17 @@ Status MessageBuilder::Finish() { } Status MessageBuilder::GetBuffer(std::shared_ptr* out) { - // The message buffer is prefixed by the size of the complete flatbuffer as + // The message buffer is suffixed by the size of the complete flatbuffer as // int32_t - // + // int32_t size = fbb_.GetSize(); auto result = std::make_shared(); RETURN_NOT_OK(result->Resize(size + sizeof(int32_t))); uint8_t* dst = result->mutable_data(); - memcpy(dst, reinterpret_cast(&size), sizeof(int32_t)); - memcpy(dst + sizeof(int32_t), fbb_.GetBufferPointer(), size); + memcpy(dst, fbb_.GetBufferPointer(), size); + memcpy(dst + size, reinterpret_cast(&size), sizeof(int32_t)); *out = result; return Status::OK(); diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index d38df840ba05e..c404cfde22ca3 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -24,7 +24,9 @@ #include "flatbuffers/flatbuffers.h" +#include "arrow/ipc/File_generated.h" #include "arrow/ipc/Message_generated.h" +#include "arrow/ipc/metadata.h" namespace arrow { @@ -37,11 +39,18 @@ class Status; namespace ipc { +using FBB = flatbuffers::FlatBufferBuilder; +using FieldOffset = flatbuffers::Offset; +using Offset = flatbuffers::Offset; + static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V1_SNAPSHOT; Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); +Status SchemaToFlatbuffer( + FBB& fbb, const Schema* schema, flatbuffers::Offset* out); + class MessageBuilder { public: Status SetSchema(const Schema* schema); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index e510755110e04..66df8a6711fa9 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -23,7 +23,8 @@ #include "flatbuffers/flatbuffers.h" -// Generated C++ flatbuffer IDL +#include "arrow/io/interfaces.h" +#include "arrow/ipc/File_generated.h" #include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata-internal.h" @@ -47,9 +48,10 @@ Status WriteSchema(const Schema* schema, std::shared_ptr* out) { //---------------------------------------------------------------------- // Message reader -class Message::Impl { +class Message::MessageImpl { public: - explicit Impl(const std::shared_ptr& buffer, const flatbuf::Message* message) + explicit MessageImpl( + const std::shared_ptr& buffer, const flatbuf::Message* message) : buffer_(buffer), message_(message) {} Message::Type type() const { @@ -76,31 +78,16 @@ class Message::Impl { const flatbuf::Message* message_; }; -class SchemaMessage::Impl { - public: - explicit Impl(const void* schema) - : schema_(static_cast(schema)) {} - - const flatbuf::Field* field(int i) const { return schema_->fields()->Get(i); } - - int num_fields() const { return schema_->fields()->size(); } - - private: - const flatbuf::Schema* schema_; -}; - Message::Message() {} Status Message::Open( const std::shared_ptr& buffer, std::shared_ptr* out) { std::shared_ptr result(new Message()); - // The buffer is prefixed by its size as int32_t - const uint8_t* fb_head = buffer->data() + sizeof(int32_t); - const flatbuf::Message* message = flatbuf::GetMessage(fb_head); + const flatbuf::Message* message = flatbuf::GetMessage(buffer->data()); // TODO(wesm): verify message - result->impl_.reset(new Impl(buffer, message)); + result->impl_.reset(new MessageImpl(buffer, message)); *out = result; return Status::OK(); @@ -122,10 +109,26 @@ std::shared_ptr Message::GetSchema() { return std::make_shared(this->shared_from_this(), impl_->header()); } +// ---------------------------------------------------------------------- +// SchemaMessage + +class SchemaMessage::SchemaMessageImpl { + public: + explicit SchemaMessageImpl(const void* schema) + : schema_(static_cast(schema)) {} + + const flatbuf::Field* field(int i) const { return schema_->fields()->Get(i); } + + int num_fields() const { return schema_->fields()->size(); } + + private: + const flatbuf::Schema* schema_; +}; + SchemaMessage::SchemaMessage( const std::shared_ptr& message, const void* schema) { message_ = message; - impl_.reset(new Impl(schema)); + impl_.reset(new SchemaMessageImpl(schema)); } int SchemaMessage::num_fields() const { @@ -146,9 +149,12 @@ Status SchemaMessage::GetSchema(std::shared_ptr* out) const { return Status::OK(); } -class RecordBatchMessage::Impl { +// ---------------------------------------------------------------------- +// RecordBatchMessage + +class RecordBatchMessage::RecordBatchMessageImpl { public: - explicit Impl(const void* batch) + explicit RecordBatchMessageImpl(const void* batch) : batch_(static_cast(batch)) { nodes_ = batch_->nodes(); buffers_ = batch_->buffers(); @@ -177,7 +183,7 @@ std::shared_ptr Message::GetRecordBatch() { RecordBatchMessage::RecordBatchMessage( const std::shared_ptr& message, const void* batch) { message_ = message; - impl_.reset(new Impl(batch)); + impl_.reset(new RecordBatchMessageImpl(batch)); } // TODO(wesm): Copying the flatbuffer data isn't great, but this will do for @@ -213,5 +219,122 @@ int RecordBatchMessage::num_fields() const { return impl_->num_fields(); } +// ---------------------------------------------------------------------- +// File footer + +static flatbuffers::Offset> +FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { + std::vector fb_blocks; + + for (const FileBlock& block : blocks) { + fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); + } + + return fbb.CreateVectorOfStructs(fb_blocks); +} + +Status WriteFileFooter(const Schema* schema, const std::vector& dictionaries, + const std::vector& record_batches, io::OutputStream* out) { + FBB fbb; + + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, &fb_schema)); + + auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); + auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); + + auto footer = flatbuf::CreateFooter( + fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + + fbb.Finish(footer); + + int32_t size = fbb.GetSize(); + + return out->Write(fbb.GetBufferPointer(), size); +} + +static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { + return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); +} + +class FileFooter::FileFooterImpl { + public: + FileFooterImpl(const std::shared_ptr& buffer, const flatbuf::Footer* footer) + : buffer_(buffer), footer_(footer) {} + + int num_dictionaries() const { return footer_->dictionaries()->size(); } + + int num_record_batches() const { return footer_->recordBatches()->size(); } + + MetadataVersion::type version() const { + switch (footer_->version()) { + case flatbuf::MetadataVersion_V1_SNAPSHOT: + return MetadataVersion::V1_SNAPSHOT; + // Add cases as other versions become available + default: + return MetadataVersion::V1_SNAPSHOT; + } + } + + FileBlock record_batch(int i) const { + return FileBlockFromFlatbuffer(footer_->recordBatches()->Get(i)); + } + + FileBlock dictionary(int i) const { + return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); + } + + Status GetSchema(std::shared_ptr* out) const { + auto schema_msg = std::make_shared(nullptr, footer_->schema()); + return schema_msg->GetSchema(out); + } + + private: + // Retain reference to memory + std::shared_ptr buffer_; + + const flatbuf::Footer* footer_; +}; + +FileFooter::FileFooter() {} + +FileFooter::~FileFooter() {} + +Status FileFooter::Open( + const std::shared_ptr& buffer, std::unique_ptr* out) { + const flatbuf::Footer* footer = flatbuf::GetFooter(buffer->data()); + + *out = std::unique_ptr(new FileFooter()); + + // TODO(wesm): Verify the footer + (*out)->impl_.reset(new FileFooterImpl(buffer, footer)); + + return Status::OK(); +} + +int FileFooter::num_dictionaries() const { + return impl_->num_dictionaries(); +} + +int FileFooter::num_record_batches() const { + return impl_->num_record_batches(); +} + +MetadataVersion::type FileFooter::version() const { + return impl_->version(); +} + +FileBlock FileFooter::record_batch(int i) const { + return impl_->record_batch(i); +} + +FileBlock FileFooter::dictionary(int i) const { + return impl_->dictionary(i); +} + +Status FileFooter::GetSchema(std::shared_ptr* out) const { + return impl_->GetSchema(out); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index d5ec53317e6f2..2f0e853bf97f0 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -22,6 +22,7 @@ #include #include +#include #include "arrow/util/visibility.h" @@ -32,17 +33,24 @@ struct Field; class Schema; class Status; +namespace io { + +class OutputStream; + +} // namespace io + namespace ipc { +struct MetadataVersion { + enum type { V1_SNAPSHOT }; +}; + //---------------------------------------------------------------------- -// Message read/write APIs // Serialize arrow::Schema as a Flatbuffer ARROW_EXPORT Status WriteSchema(const Schema* schema, std::shared_ptr* out); -//---------------------------------------------------------------------- - // Read interface classes. We do not fully deserialize the flatbuffers so that // individual fields metadata can be retrieved from very large schema without // @@ -68,8 +76,8 @@ class ARROW_EXPORT SchemaMessage { // Parent, owns the flatbuffer data std::shared_ptr message_; - class Impl; - std::unique_ptr impl_; + class SchemaMessageImpl; + std::unique_ptr impl_; }; // Field metadata @@ -101,8 +109,8 @@ class ARROW_EXPORT RecordBatchMessage { // Parent, owns the flatbuffer data std::shared_ptr message_; - class Impl; - std::unique_ptr impl_; + class RecordBatchMessageImpl; + std::unique_ptr impl_; }; class ARROW_EXPORT DictionaryBatchMessage { @@ -133,8 +141,46 @@ class ARROW_EXPORT Message : public std::enable_shared_from_this { Message(); // Hide serialization details from user API - class Impl; - std::unique_ptr impl_; + class MessageImpl; + std::unique_ptr impl_; +}; + +// ---------------------------------------------------------------------- +// File footer for file-like representation + +struct FileBlock { + FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) + : offset(offset), metadata_length(metadata_length), body_length(body_length) {} + + int64_t offset; + int32_t metadata_length; + int64_t body_length; +}; + +ARROW_EXPORT +Status WriteFileFooter(const Schema* schema, const std::vector& dictionaries, + const std::vector& record_batches, io::OutputStream* out); + +class ARROW_EXPORT FileFooter { + public: + ~FileFooter(); + + static Status Open( + const std::shared_ptr& buffer, std::unique_ptr* out); + + int num_dictionaries() const; + int num_record_batches() const; + MetadataVersion::type version() const; + + FileBlock record_batch(int i) const; + FileBlock dictionary(int i) const; + + Status GetSchema(std::shared_ptr* out) const; + + private: + FileFooter(); + class FileFooterImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index f6582fc883bdc..7d02bc302f40e 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -25,21 +25,28 @@ #include #include "arrow/array.h" +#include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" #include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" namespace arrow { namespace ipc { +const auto kInt32 = std::make_shared(); +const auto kListInt32 = std::make_shared(kInt32); +const auto kListListInt32 = std::make_shared(kListInt32); + Status MakeRandomInt32Array( int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { std::shared_ptr data; test::MakeRandomInt32PoolBuffer(length, pool, &data); - const auto INT32 = std::make_shared(); - Int32Builder builder(pool, INT32); + const auto kInt32 = std::make_shared(); + Int32Builder builder(pool, kInt32); if (include_nulls) { std::shared_ptr valid_bytes; test::MakeRandomBytePoolBuffer(length, pool, &valid_bytes); @@ -87,6 +94,188 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li return (*array)->Validate(); } +typedef Status MakeRecordBatch(std::shared_ptr* out); + +Status MakeIntRecordBatch(std::shared_ptr* out) { + const int length = 1000; + + // Make the schema + auto f0 = std::make_shared("f0", kInt32); + auto f1 = std::make_shared("f1", kInt32); + std::shared_ptr schema(new Schema({f0, f1})); + + // Example data + std::shared_ptr a0, a1; + MemoryPool* pool = default_memory_pool(); + RETURN_NOT_OK(MakeRandomInt32Array(length, false, pool, &a0)); + RETURN_NOT_OK(MakeRandomInt32Array(length, true, pool, &a1)); + out->reset(new RecordBatch(schema, length, {a0, a1})); + return Status::OK(); +} + +template +Status MakeRandomBinaryArray( + const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* array) { + const std::vector values = { + "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; + Builder builder(pool, type); + const auto values_len = values.size(); + for (int32_t i = 0; i < length; ++i) { + int values_index = i % values_len; + if (values_index == 0) { + RETURN_NOT_OK(builder.AppendNull()); + } else { + const std::string& value = values[values_index]; + RETURN_NOT_OK( + builder.Append(reinterpret_cast(value.data()), value.size())); + } + } + *array = builder.Finish(); + return Status::OK(); +} + +Status MakeStringTypesRecordBatch(std::shared_ptr* out) { + const int32_t length = 500; + auto string_type = std::make_shared(); + auto binary_type = std::make_shared(); + auto f0 = std::make_shared("f0", string_type); + auto f1 = std::make_shared("f1", binary_type); + std::shared_ptr schema(new Schema({f0, f1})); + + std::shared_ptr a0, a1; + MemoryPool* pool = default_memory_pool(); + + { + auto status = + MakeRandomBinaryArray(string_type, length, pool, &a0); + RETURN_NOT_OK(status); + } + { + auto status = + MakeRandomBinaryArray(binary_type, length, pool, &a1); + RETURN_NOT_OK(status); + } + out->reset(new RecordBatch(schema, length, {a0, a1})); + return Status::OK(); +} + +Status MakeListRecordBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", kListInt32); + auto f1 = std::make_shared("f1", kListListInt32); + auto f2 = std::make_shared("f2", kInt32); + std::shared_ptr schema(new Schema({f0, f1, f2})); + + // Example data + + MemoryPool* pool = default_memory_pool(); + const int length = 200; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &leaf_values)); + RETURN_NOT_OK( + MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); + out->reset(new RecordBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} + +Status MakeZeroLengthRecordBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", kListInt32); + auto f1 = std::make_shared("f1", kListListInt32); + auto f2 = std::make_shared("f2", kInt32); + std::shared_ptr schema(new Schema({f0, f1, f2})); + + // Example data + MemoryPool* pool = default_memory_pool(); + const int length = 200; + const bool include_nulls = true; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; + RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &leaf_values)); + RETURN_NOT_OK(MakeRandomListArray(leaf_values, 0, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, 0, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); + out->reset(new RecordBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} + +Status MakeNonNullRecordBatch(std::shared_ptr* out) { + // Make the schema + auto f0 = std::make_shared("f0", kListInt32); + auto f1 = std::make_shared("f1", kListListInt32); + auto f2 = std::make_shared("f2", kInt32); + std::shared_ptr schema(new Schema({f0, f1, f2})); + + // Example data + MemoryPool* pool = default_memory_pool(); + const int length = 50; + std::shared_ptr leaf_values, list_array, list_list_array, flat_array; + + RETURN_NOT_OK(MakeRandomInt32Array(1000, true, pool, &leaf_values)); + bool include_nulls = false; + RETURN_NOT_OK( + MakeRandomListArray(leaf_values, length, include_nulls, pool, &list_array)); + RETURN_NOT_OK( + MakeRandomListArray(list_array, length, include_nulls, pool, &list_list_array)); + RETURN_NOT_OK(MakeRandomInt32Array(length, include_nulls, pool, &flat_array)); + out->reset(new RecordBatch(schema, length, {list_array, list_list_array, flat_array})); + return Status::OK(); +} + +Status MakeDeeplyNestedList(std::shared_ptr* out) { + const int batch_length = 5; + TypePtr type = kInt32; + + MemoryPool* pool = default_memory_pool(); + ArrayPtr array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &array)); + for (int i = 0; i < 63; ++i) { + type = std::static_pointer_cast(std::make_shared(type)); + RETURN_NOT_OK(MakeRandomListArray(array, batch_length, include_nulls, pool, &array)); + } + + auto f0 = std::make_shared("f0", type); + std::shared_ptr schema(new Schema({f0})); + std::vector arrays = {array}; + out->reset(new RecordBatch(schema, batch_length, arrays)); + return Status::OK(); +} + +Status MakeStruct(std::shared_ptr* out) { + // reuse constructed list columns + std::shared_ptr list_batch; + RETURN_NOT_OK(MakeListRecordBatch(&list_batch)); + std::vector columns = { + list_batch->column(0), list_batch->column(1), list_batch->column(2)}; + auto list_schema = list_batch->schema(); + + // Define schema + std::shared_ptr type(new StructType( + {list_schema->field(0), list_schema->field(1), list_schema->field(2)})); + auto f0 = std::make_shared("non_null_struct", type); + auto f1 = std::make_shared("null_struct", type); + std::shared_ptr schema(new Schema({f0, f1})); + + // construct individual nullable/non-nullable struct arrays + ArrayPtr no_nulls(new StructArray(type, list_batch->num_rows(), columns)); + std::vector null_bytes(list_batch->num_rows(), 1); + null_bytes[0] = 0; + std::shared_ptr null_bitmask; + RETURN_NOT_OK(util::bytes_to_bits(null_bytes, &null_bitmask)); + ArrayPtr with_nulls( + new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); + + // construct batch + std::vector arrays = {no_nulls, with_nulls}; + out->reset(new RecordBatch(schema, list_batch->num_rows(), arrays)); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/util.h b/cpp/src/arrow/ipc/util.h index 3f4001b21a91b..94079a3827777 100644 --- a/cpp/src/arrow/ipc/util.h +++ b/cpp/src/arrow/ipc/util.h @@ -27,6 +27,14 @@ namespace arrow { namespace ipc { +// Align on 8-byte boundaries +static constexpr int kArrowAlignment = 8; +static constexpr uint8_t kPaddingBytes[kArrowAlignment] = {0}; + +static inline int64_t PaddedLength(int64_t nbytes, int64_t alignment = kArrowAlignment) { + return ((nbytes + alignment - 1) / alignment) * alignment; +} + // A helper class to tracks the size of allocations class MockOutputStream : public io::OutputStream { public: diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h index a9c64eca997b5..2689bebea30ef 100644 --- a/cpp/src/arrow/parquet/reader.h +++ b/cpp/src/arrow/parquet/reader.h @@ -31,7 +31,7 @@ namespace arrow { class Array; class MemoryPool; -class RowBatch; +class RecordBatch; class Status; class Table; diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h index 5aa1ba587176a..ecc6a9f8be3de 100644 --- a/cpp/src/arrow/parquet/writer.h +++ b/cpp/src/arrow/parquet/writer.h @@ -30,7 +30,7 @@ namespace arrow { class Array; class MemoryPool; class PrimitiveArray; -class RowBatch; +class RecordBatch; class Status; class StringArray; class Table; diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index d9573eae74ddd..3a250df81d0fb 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -27,11 +27,11 @@ namespace arrow { -RowBatch::RowBatch(const std::shared_ptr& schema, int num_rows, +RecordBatch::RecordBatch(const std::shared_ptr& schema, int num_rows, const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} -const std::string& RowBatch::column_name(int i) const { +const std::string& RecordBatch::column_name(int i) const { return schema_->field(i)->name; } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 2088fdf0b6415..36b3c8ecaf43f 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -32,15 +32,15 @@ class Column; class Schema; class Status; -// A row batch is a simpler and more rigid table data structure intended for +// A record batch is a simpler and more rigid table data structure intended for // use primarily in shared memory IPC. It contains a schema (metadata) and a -// corresponding vector of equal-length Arrow arrays -class ARROW_EXPORT RowBatch { +// corresponding sequence of equal-length Arrow arrays +class ARROW_EXPORT RecordBatch { public: - // num_rows is a parameter to allow for row batches of a particular size not + // num_rows is a parameter to allow for record batches of a particular size not // having any materialized columns. Each array should have the same length as // num_rows - RowBatch(const std::shared_ptr& schema, int num_rows, + RecordBatch(const std::shared_ptr& schema, int32_t num_rows, const std::vector>& columns); // @returns: the table's schema @@ -50,17 +50,19 @@ class ARROW_EXPORT RowBatch { // Note: Does not boundscheck const std::shared_ptr& column(int i) const { return columns_[i]; } + const std::vector>& columns() const { return columns_; } + const std::string& column_name(int i) const; // @returns: the number of columns in the table int num_columns() const { return columns_.size(); } // @returns: the number of rows (the corresponding length of each column) - int64_t num_rows() const { return num_rows_; } + int32_t num_rows() const { return num_rows_; } private: std::shared_ptr schema_; - int num_rows_; + int32_t num_rows_; std::vector> columns_; }; diff --git a/format/IPC.md b/format/IPC.md new file mode 100644 index 0000000000000..1f39e762ab70d --- /dev/null +++ b/format/IPC.md @@ -0,0 +1,3 @@ +# Interprocess messaging / communication (IPC) + +## File format diff --git a/format/README.md b/format/README.md index 3b0e50364d83c..78e15207ee95a 100644 --- a/format/README.md +++ b/format/README.md @@ -9,6 +9,7 @@ Currently, the Arrow specification consists of these pieces: - Metadata specification (see Metadata.md) - Physical memory layout specification (see Layout.md) - Metadata serialized representation (see Message.fbs) +- Mechanics of messaging between Arrow systems (IPC, RPC, etc.) (see IPC.md) The metadata currently uses Google's [flatbuffers library][1] for serializing a couple related pieces of information: From 32fd692f3aced29cc65a786d5ec63f8cd484853c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 25 Sep 2016 19:28:26 -0400 Subject: [PATCH 0142/1644] ARROW-296: [Python / C++] Remove arrow::parquet, make pyarrow link against parquet_arrow This patch depends on PARQUET-728 (to run the full test suite, including pyarrow Parquet tests) Author: Wes McKinney Closes #145 from wesm/ARROW-296 and squashes the following commits: d67b4f9 [Wes McKinney] Refactor to link against parquet_arrow, fix up cmake files --- cpp/CMakeLists.txt | 18 - cpp/cmake_modules/FindParquet.cmake | 44 +- cpp/doc/Parquet.md | 15 +- cpp/src/arrow/parquet/CMakeLists.txt | 67 --- cpp/src/arrow/parquet/io.cc | 105 ---- cpp/src/arrow/parquet/io.h | 84 --- cpp/src/arrow/parquet/parquet-io-test.cc | 135 ----- .../parquet/parquet-reader-writer-test.cc | 499 ------------------ cpp/src/arrow/parquet/parquet-schema-test.cc | 261 --------- cpp/src/arrow/parquet/reader.cc | 401 -------------- cpp/src/arrow/parquet/reader.h | 146 ----- cpp/src/arrow/parquet/schema.cc | 344 ------------ cpp/src/arrow/parquet/schema.h | 53 -- cpp/src/arrow/parquet/test-util.h | 193 ------- cpp/src/arrow/parquet/utils.h | 52 -- cpp/src/arrow/parquet/writer.cc | 365 ------------- cpp/src/arrow/parquet/writer.h | 76 --- cpp/src/arrow/types/string.cc | 2 +- python/CMakeLists.txt | 14 +- python/cmake_modules/FindArrow.cmake | 22 - python/pyarrow/includes/parquet.pxd | 10 +- 21 files changed, 55 insertions(+), 2851 deletions(-) delete mode 100644 cpp/src/arrow/parquet/CMakeLists.txt delete mode 100644 cpp/src/arrow/parquet/io.cc delete mode 100644 cpp/src/arrow/parquet/io.h delete mode 100644 cpp/src/arrow/parquet/parquet-io-test.cc delete mode 100644 cpp/src/arrow/parquet/parquet-reader-writer-test.cc delete mode 100644 cpp/src/arrow/parquet/parquet-schema-test.cc delete mode 100644 cpp/src/arrow/parquet/reader.cc delete mode 100644 cpp/src/arrow/parquet/reader.h delete mode 100644 cpp/src/arrow/parquet/schema.cc delete mode 100644 cpp/src/arrow/parquet/schema.h delete mode 100644 cpp/src/arrow/parquet/test-util.h delete mode 100644 cpp/src/arrow/parquet/utils.h delete mode 100644 cpp/src/arrow/parquet/writer.cc delete mode 100644 cpp/src/arrow/parquet/writer.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index be95dabf31897..f3f4a7dac0100 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -52,10 +52,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the libarrow shared libraries" ON) - option(ARROW_PARQUET - "Build the Parquet adapter and link to libparquet" - OFF) - option(ARROW_TEST_MEMCHECK "Run the test suite using valgrind --tool=memcheck" OFF) @@ -702,20 +698,6 @@ add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) add_subdirectory(src/arrow/types) -#---------------------------------------------------------------------- -# Parquet adapter library - -if(ARROW_PARQUET) - find_package(Parquet REQUIRED) - include_directories(SYSTEM ${PARQUET_INCLUDE_DIR}) - ADD_THIRDPARTY_LIB(parquet - STATIC_LIB ${PARQUET_STATIC_LIB} - SHARED_LIB ${PARQUET_SHARED_LIB}) - - add_subdirectory(src/arrow/parquet) - list(APPEND LINK_LIBS arrow_parquet parquet) -endif() - #---------------------------------------------------------------------- # IPC library diff --git a/cpp/cmake_modules/FindParquet.cmake b/cpp/cmake_modules/FindParquet.cmake index 36f4828a999d4..7445e0919acb6 100644 --- a/cpp/cmake_modules/FindParquet.cmake +++ b/cpp/cmake_modules/FindParquet.cmake @@ -29,15 +29,20 @@ endif() # Try the parameterized roots, if they exist if ( _parquet_roots ) - find_path( PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h - PATHS ${_parquet_roots} NO_DEFAULT_PATH - PATH_SUFFIXES "include" ) - find_library( PARQUET_LIBRARIES NAMES parquet - PATHS ${_parquet_roots} NO_DEFAULT_PATH - PATH_SUFFIXES "lib" ) + find_path( PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h + PATHS ${_parquet_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( PARQUET_LIBRARIES NAMES parquet + PATHS ${_parquet_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) + + find_library(PARQUET_ARROW_LIBRARIES NAMES parquet_arrow + PATHS ${_parquet_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib") else () - find_path( PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h ) - find_library( PARQUET_LIBRARIES NAMES parquet ) + find_path(PARQUET_INCLUDE_DIR NAMES parquet/api/reader.h ) + find_library(PARQUET_LIBRARIES NAMES parquet) + find_library(PARQUET_ARROW_LIBRARIES NAMES parquet_arrow) endif () @@ -51,6 +56,18 @@ else () set(PARQUET_FOUND FALSE) endif () +if (PARQUET_INCLUDE_DIR AND PARQUET_ARROW_LIBRARIES) + set(PARQUET_ARROW_FOUND TRUE) + get_filename_component(PARQUET_ARROW_LIBS ${PARQUET_ARROW_LIBRARIES} PATH) + set(PARQUET_ARROW_LIB_NAME libparquet_arrow) + set(PARQUET_ARROW_STATIC_LIB + ${PARQUET_ARROW_LIBS}/${PARQUET_ARROW_LIB_NAME}.a) + set(PARQUET_ARROW_SHARED_LIB + ${PARQUET_ARROW_LIBS}/${PARQUET_ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +else () + set(PARQUET_ARROW_FOUND FALSE) +endif () + if (PARQUET_FOUND) if (NOT Parquet_FIND_QUIETLY) message(STATUS "Found the Parquet library: ${PARQUET_LIBRARIES}") @@ -71,6 +88,12 @@ else () endif () endif () +if (PARQUET_ARROW_FOUND) + if (NOT Parquet_FIND_QUIETLY) + message(STATUS "Found the Parquet Arrow library: ${PARQUET_ARROW_LIBS}") + endif() +endif() + mark_as_advanced( PARQUET_FOUND PARQUET_INCLUDE_DIR @@ -78,4 +101,9 @@ mark_as_advanced( PARQUET_LIBRARIES PARQUET_STATIC_LIB PARQUET_SHARED_LIB + + PARQUET_ARROW_FOUND + PARQUET_ARROW_LIBS + PARQUET_ARROW_STATIC_LIB + PARQUET_ARROW_SHARED_LIB ) diff --git a/cpp/doc/Parquet.md b/cpp/doc/Parquet.md index 370ac833388fc..96471d94835f3 100644 --- a/cpp/doc/Parquet.md +++ b/cpp/doc/Parquet.md @@ -1,24 +1,19 @@ ## Building Arrow-Parquet integration -To build the Arrow C++'s Parquet adapter library, you must first build [parquet-cpp][1]: +To use Arrow C++ with Parquet, you must first build the Arrow C++ libraries and +install them someplace. Then, you can build [parquet-cpp][1] with the Arrow +adapter library: ```bash # Set this to your preferred install location -export PARQUET_HOME=$HOME/local +export ARROW_HOME=$HOME/local git clone https://github.com/apache/parquet-cpp.git cd parquet-cpp source setup_build_env.sh -cmake -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME +cmake -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME -DPARQUET_ARROW=on make -j4 make install ``` -Make sure that `$PARQUET_HOME` is set to the installation location. Now, build -Arrow with the Parquet adapter enabled: - -```bash -cmake -DARROW_PARQUET=ON -``` - [1]: https://github.com/apache/parquet-cpp \ No newline at end of file diff --git a/cpp/src/arrow/parquet/CMakeLists.txt b/cpp/src/arrow/parquet/CMakeLists.txt deleted file mode 100644 index c400e14ea47f7..0000000000000 --- a/cpp/src/arrow/parquet/CMakeLists.txt +++ /dev/null @@ -1,67 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# ---------------------------------------------------------------------- -# arrow_parquet : Arrow <-> Parquet adapter - -set(PARQUET_SRCS - io.cc - reader.cc - schema.cc - writer.cc -) - -set(PARQUET_LIBS - arrow_shared - arrow_io - parquet_shared -) - -add_library(arrow_parquet SHARED - ${PARQUET_SRCS} -) -target_link_libraries(arrow_parquet ${PARQUET_LIBS}) -SET_TARGET_PROPERTIES(arrow_parquet PROPERTIES LINKER_LANGUAGE CXX) - -if (APPLE) - set_target_properties(arrow_parquet - PROPERTIES - BUILD_WITH_INSTALL_RPATH ON - INSTALL_NAME_DIR "@rpath") -endif() - -ADD_ARROW_TEST(parquet-schema-test) -ARROW_TEST_LINK_LIBRARIES(parquet-schema-test arrow_parquet) - -ADD_ARROW_TEST(parquet-io-test) -ARROW_TEST_LINK_LIBRARIES(parquet-io-test arrow_parquet) - -ADD_ARROW_TEST(parquet-reader-writer-test) -ARROW_TEST_LINK_LIBRARIES(parquet-reader-writer-test arrow_parquet) - -# Headers: top level -install(FILES - io.h - reader.h - schema.h - utils.h - writer.h - DESTINATION include/arrow/parquet) - -install(TARGETS arrow_parquet - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) diff --git a/cpp/src/arrow/parquet/io.cc b/cpp/src/arrow/parquet/io.cc deleted file mode 100644 index a50d753f3054e..0000000000000 --- a/cpp/src/arrow/parquet/io.cc +++ /dev/null @@ -1,105 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/parquet/io.h" - -#include -#include - -#include "parquet/api/io.h" - -#include "arrow/parquet/utils.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" - -// To assist with readability -using ArrowROFile = arrow::io::ReadableFileInterface; - -namespace arrow { -namespace parquet { - -// ---------------------------------------------------------------------- -// ParquetAllocator - -ParquetAllocator::ParquetAllocator() : pool_(default_memory_pool()) {} - -ParquetAllocator::ParquetAllocator(MemoryPool* pool) : pool_(pool) {} - -ParquetAllocator::~ParquetAllocator() {} - -uint8_t* ParquetAllocator::Malloc(int64_t size) { - uint8_t* result; - PARQUET_THROW_NOT_OK(pool_->Allocate(size, &result)); - return result; -} - -void ParquetAllocator::Free(uint8_t* buffer, int64_t size) { - // Does not report Status - pool_->Free(buffer, size); -} - -// ---------------------------------------------------------------------- -// ParquetReadSource - -ParquetReadSource::ParquetReadSource(ParquetAllocator* allocator) - : file_(nullptr), allocator_(allocator) {} - -Status ParquetReadSource::Open(const std::shared_ptr& file) { - int64_t file_size; - RETURN_NOT_OK(file->GetSize(&file_size)); - - file_ = file; - size_ = file_size; - return Status::OK(); -} - -void ParquetReadSource::Close() { - // TODO(wesm): Make this a no-op for now. This leaves Python wrappers for - // these classes in a borked state. Probably better to explicitly close. - - // PARQUET_THROW_NOT_OK(file_->Close()); -} - -int64_t ParquetReadSource::Tell() const { - int64_t position; - PARQUET_THROW_NOT_OK(file_->Tell(&position)); - return position; -} - -void ParquetReadSource::Seek(int64_t position) { - PARQUET_THROW_NOT_OK(file_->Seek(position)); -} - -int64_t ParquetReadSource::Read(int64_t nbytes, uint8_t* out) { - int64_t bytes_read; - PARQUET_THROW_NOT_OK(file_->Read(nbytes, &bytes_read, out)); - return bytes_read; -} - -std::shared_ptr<::parquet::Buffer> ParquetReadSource::Read(int64_t nbytes) { - // TODO(wesm): This code is duplicated from parquet/util/input.cc; suggests - // that there should be more code sharing amongst file-like sources - auto result = std::make_shared<::parquet::OwnedMutableBuffer>(0, allocator_); - result->Resize(nbytes); - - int64_t bytes_read = Read(nbytes, result->mutable_data()); - if (bytes_read < nbytes) { result->Resize(bytes_read); } - return result; -} - -} // namespace parquet -} // namespace arrow diff --git a/cpp/src/arrow/parquet/io.h b/cpp/src/arrow/parquet/io.h deleted file mode 100644 index 1734863acf1ea..0000000000000 --- a/cpp/src/arrow/parquet/io.h +++ /dev/null @@ -1,84 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Bridges Arrow's IO interfaces and Parquet-cpp's IO interfaces - -#ifndef ARROW_PARQUET_IO_H -#define ARROW_PARQUET_IO_H - -#include -#include - -#include "parquet/api/io.h" - -#include "arrow/io/interfaces.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class MemoryPool; - -namespace parquet { - -// An implementation of the Parquet MemoryAllocator API that plugs into an -// existing Arrow memory pool. This way we can direct all allocations to a -// single place rather than tracking allocations in different locations (for -// example: without utilizing parquet-cpp's default allocator) -class ARROW_EXPORT ParquetAllocator : public ::parquet::MemoryAllocator { - public: - // Uses the default memory pool - ParquetAllocator(); - - explicit ParquetAllocator(MemoryPool* pool); - virtual ~ParquetAllocator(); - - uint8_t* Malloc(int64_t size) override; - void Free(uint8_t* buffer, int64_t size) override; - - void set_pool(MemoryPool* pool) { pool_ = pool; } - - MemoryPool* pool() const { return pool_; } - - private: - MemoryPool* pool_; -}; - -class ARROW_EXPORT ParquetReadSource : public ::parquet::RandomAccessSource { - public: - explicit ParquetReadSource(ParquetAllocator* allocator); - - // We need to ask for the file size on opening the file, and this can fail - Status Open(const std::shared_ptr& file); - - void Close() override; - int64_t Tell() const override; - void Seek(int64_t pos) override; - int64_t Read(int64_t nbytes, uint8_t* out) override; - std::shared_ptr<::parquet::Buffer> Read(int64_t nbytes) override; - - private: - // An Arrow readable file of some kind - std::shared_ptr file_; - - // The allocator is required for creating managed buffers - ParquetAllocator* allocator_; -}; - -} // namespace parquet -} // namespace arrow - -#endif // ARROW_PARQUET_IO_H diff --git a/cpp/src/arrow/parquet/parquet-io-test.cc b/cpp/src/arrow/parquet/parquet-io-test.cc deleted file mode 100644 index 208b3e867d374..0000000000000 --- a/cpp/src/arrow/parquet/parquet-io-test.cc +++ /dev/null @@ -1,135 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/io/memory.h" -#include "arrow/parquet/io.h" -#include "arrow/test-util.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" - -#include "parquet/api/io.h" - -namespace arrow { -namespace parquet { - -// Allocator tests - -TEST(TestParquetAllocator, DefaultCtor) { - ParquetAllocator allocator; - - const int buffer_size = 10; - - uint8_t* buffer = nullptr; - ASSERT_NO_THROW(buffer = allocator.Malloc(buffer_size);); - - // valgrind will complain if we write into nullptr - memset(buffer, 0, buffer_size); - - allocator.Free(buffer, buffer_size); -} - -// Pass through to the default memory pool -class TrackingPool : public MemoryPool { - public: - TrackingPool() : pool_(default_memory_pool()), bytes_allocated_(0) {} - - Status Allocate(int64_t size, uint8_t** out) override { - RETURN_NOT_OK(pool_->Allocate(size, out)); - bytes_allocated_ += size; - return Status::OK(); - } - - void Free(uint8_t* buffer, int64_t size) override { - pool_->Free(buffer, size); - bytes_allocated_ -= size; - } - - int64_t bytes_allocated() const override { return bytes_allocated_; } - - private: - MemoryPool* pool_; - int64_t bytes_allocated_; -}; - -TEST(TestParquetAllocator, CustomPool) { - TrackingPool pool; - - ParquetAllocator allocator(&pool); - - ASSERT_EQ(&pool, allocator.pool()); - - const int buffer_size = 10; - - uint8_t* buffer = nullptr; - ASSERT_NO_THROW(buffer = allocator.Malloc(buffer_size);); - - ASSERT_EQ(buffer_size, pool.bytes_allocated()); - - // valgrind will complain if we write into nullptr - memset(buffer, 0, buffer_size); - - allocator.Free(buffer, buffer_size); - - ASSERT_EQ(0, pool.bytes_allocated()); -} - -// ---------------------------------------------------------------------- -// Read source tests - -TEST(TestParquetReadSource, Basics) { - std::string data = "this is the data"; - auto data_buffer = reinterpret_cast(data.c_str()); - - ParquetAllocator allocator(default_memory_pool()); - - auto file = std::make_shared(data_buffer, data.size()); - auto source = std::make_shared(&allocator); - - ASSERT_OK(source->Open(file)); - - ASSERT_EQ(0, source->Tell()); - ASSERT_NO_THROW(source->Seek(5)); - ASSERT_EQ(5, source->Tell()); - ASSERT_NO_THROW(source->Seek(0)); - - // Seek out of bounds - ASSERT_THROW(source->Seek(100), ::parquet::ParquetException); - - uint8_t buffer[50]; - - ASSERT_NO_THROW(source->Read(4, buffer)); - ASSERT_EQ(0, std::memcmp(buffer, "this", 4)); - ASSERT_EQ(4, source->Tell()); - - std::shared_ptr<::parquet::Buffer> pq_buffer; - - ASSERT_NO_THROW(pq_buffer = source->Read(7)); - - auto expected_buffer = std::make_shared<::parquet::Buffer>(data_buffer + 4, 7); - - ASSERT_TRUE(expected_buffer->Equals(*pq_buffer.get())); -} - -} // namespace parquet -} // namespace arrow diff --git a/cpp/src/arrow/parquet/parquet-reader-writer-test.cc b/cpp/src/arrow/parquet/parquet-reader-writer-test.cc deleted file mode 100644 index d7b39dda377d3..0000000000000 --- a/cpp/src/arrow/parquet/parquet-reader-writer-test.cc +++ /dev/null @@ -1,499 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "gtest/gtest.h" - -#include "arrow/test-util.h" -#include "arrow/parquet/test-util.h" -#include "arrow/parquet/reader.h" -#include "arrow/parquet/writer.h" -#include "arrow/types/construct.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" - -#include "parquet/api/reader.h" -#include "parquet/api/writer.h" - -using ParquetBuffer = parquet::Buffer; -using parquet::BufferReader; -using parquet::default_writer_properties; -using parquet::InMemoryOutputStream; -using parquet::LogicalType; -using parquet::ParquetFileReader; -using parquet::ParquetFileWriter; -using parquet::RandomAccessSource; -using parquet::Repetition; -using parquet::SchemaDescriptor; -using parquet::ParquetVersion; -using ParquetType = parquet::Type; -using parquet::schema::GroupNode; -using parquet::schema::NodePtr; -using parquet::schema::PrimitiveNode; - -namespace arrow { - -namespace parquet { - -const int SMALL_SIZE = 100; -const int LARGE_SIZE = 10000; - -template -struct test_traits {}; - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::BOOLEAN; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static uint8_t const value; -}; - -const uint8_t test_traits::value(1); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_8; - static uint8_t const value; -}; - -const uint8_t test_traits::value(64); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::INT_8; - static int8_t const value; -}; - -const int8_t test_traits::value(-64); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_16; - static uint16_t const value; -}; - -const uint16_t test_traits::value(1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::INT_16; - static int16_t const value; -}; - -const int16_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_32; - static uint32_t const value; -}; - -const uint32_t test_traits::value(1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT32; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static int32_t const value; -}; - -const int32_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT64; - static constexpr LogicalType::type logical_enum = LogicalType::UINT_64; - static uint64_t const value; -}; - -const uint64_t test_traits::value(1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT64; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static int64_t const value; -}; - -const int64_t test_traits::value(-1024); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::INT64; - static constexpr LogicalType::type logical_enum = LogicalType::TIMESTAMP_MILLIS; - static int64_t const value; -}; - -const int64_t test_traits::value(14695634030000); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::FLOAT; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static float const value; -}; - -const float test_traits::value(2.1f); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::DOUBLE; - static constexpr LogicalType::type logical_enum = LogicalType::NONE; - static double const value; -}; - -const double test_traits::value(4.2); - -template <> -struct test_traits { - static constexpr ParquetType::type parquet_enum = ParquetType::BYTE_ARRAY; - static constexpr LogicalType::type logical_enum = LogicalType::UTF8; - static std::string const value; -}; - -const std::string test_traits::value("Test"); - -template -using ParquetDataType = ::parquet::DataType::parquet_enum>; - -template -using ParquetWriter = ::parquet::TypedColumnWriter>; - -template -class TestParquetIO : public ::testing::Test { - public: - virtual void SetUp() {} - - std::shared_ptr MakeSchema(Repetition::type repetition) { - auto pnode = PrimitiveNode::Make("column1", repetition, - test_traits::parquet_enum, test_traits::logical_enum); - NodePtr node_ = - GroupNode::Make("schema", Repetition::REQUIRED, std::vector({pnode})); - return std::static_pointer_cast(node_); - } - - std::unique_ptr MakeWriter( - const std::shared_ptr& schema) { - sink_ = std::make_shared(); - return ParquetFileWriter::Open(sink_, schema); - } - - std::unique_ptr ReaderFromSink() { - std::shared_ptr buffer = sink_->GetBuffer(); - std::unique_ptr source(new BufferReader(buffer)); - return ParquetFileReader::Open(std::move(source)); - } - - void ReadSingleColumnFile( - std::unique_ptr file_reader, std::shared_ptr* out) { - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - std::unique_ptr column_reader; - ASSERT_OK_NO_THROW(reader.GetFlatColumn(0, &column_reader)); - ASSERT_NE(nullptr, column_reader.get()); - - ASSERT_OK(column_reader->NextBatch(SMALL_SIZE, out)); - ASSERT_NE(nullptr, out->get()); - } - - void ReadAndCheckSingleColumnFile(Array* values) { - std::shared_ptr out; - ReadSingleColumnFile(ReaderFromSink(), &out); - ASSERT_TRUE(values->Equals(out)); - } - - void ReadTableFromFile( - std::unique_ptr file_reader, std::shared_ptr
* out) { - arrow::parquet::FileReader reader(default_memory_pool(), std::move(file_reader)); - ASSERT_OK_NO_THROW(reader.ReadFlatTable(out)); - ASSERT_NE(nullptr, out->get()); - } - - void ReadAndCheckSingleColumnTable(const std::shared_ptr& values) { - std::shared_ptr
out; - ReadTableFromFile(ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(values->length(), out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); - } - - template - void WriteFlatColumn(const std::shared_ptr& schema, - const std::shared_ptr& values) { - FileWriter writer(default_memory_pool(), MakeWriter(schema)); - ASSERT_OK_NO_THROW(writer.NewRowGroup(values->length())); - ASSERT_OK_NO_THROW(writer.WriteFlatColumnChunk(values.get())); - ASSERT_OK_NO_THROW(writer.Close()); - } - - std::shared_ptr sink_; -}; - -// We habe separate tests for UInt32Type as this is currently the only type -// where a roundtrip does not yield the identical Array structure. -// There we write an UInt32 Array but receive an Int64 Array as result for -// Parquet version 1.0. - -typedef ::testing::Types TestTypes; - -TYPED_TEST_CASE(TestParquetIO, TestTypes); - -TYPED_TEST(TestParquetIO, SingleColumnRequiredWrite) { - auto values = NonNullArray(SMALL_SIZE); - - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - this->WriteFlatColumn(schema, values); - - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableRequiredWrite) { - auto values = NonNullArray(SMALL_SIZE); - std::shared_ptr
table = MakeSimpleTable(values, false); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, - values->length(), default_writer_properties())); - - std::shared_ptr
out; - this->ReadTableFromFile(this->ReaderFromSink(), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(100, out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ASSERT_TRUE(values->Equals(chunked_array->chunk(0))); -} - -TYPED_TEST(TestParquetIO, SingleColumnOptionalReadWrite) { - // This also tests max_definition_level = 1 - auto values = NullableArray(SMALL_SIZE, 10); - - std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); - this->WriteFlatColumn(schema, values); - - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableOptionalReadWrite) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(SMALL_SIZE, 10); - std::shared_ptr
table = MakeSimpleTable(values, true); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable(table.get(), default_memory_pool(), this->sink_, - values->length(), default_writer_properties())); - - this->ReadAndCheckSingleColumnTable(values); -} - -TYPED_TEST(TestParquetIO, SingleColumnRequiredChunkedWrite) { - auto values = NonNullArray(SMALL_SIZE); - int64_t chunk_size = values->length() / 4; - - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - for (int i = 0; i < 4; i++) { - ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); - ASSERT_OK_NO_THROW( - writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); - } - ASSERT_OK_NO_THROW(writer.Close()); - - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableRequiredChunkedWrite) { - auto values = NonNullArray(LARGE_SIZE); - std::shared_ptr
table = MakeSimpleTable(values, false); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable( - table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - - this->ReadAndCheckSingleColumnTable(values); -} - -TYPED_TEST(TestParquetIO, SingleColumnOptionalChunkedWrite) { - int64_t chunk_size = SMALL_SIZE / 4; - auto values = NullableArray(SMALL_SIZE, 10); - - std::shared_ptr schema = this->MakeSchema(Repetition::OPTIONAL); - FileWriter writer(default_memory_pool(), this->MakeWriter(schema)); - for (int i = 0; i < 4; i++) { - ASSERT_OK_NO_THROW(writer.NewRowGroup(chunk_size)); - ASSERT_OK_NO_THROW( - writer.WriteFlatColumnChunk(values.get(), i * chunk_size, chunk_size)); - } - ASSERT_OK_NO_THROW(writer.Close()); - - this->ReadAndCheckSingleColumnFile(values.get()); -} - -TYPED_TEST(TestParquetIO, SingleColumnTableOptionalChunkedWrite) { - // This also tests max_definition_level = 1 - auto values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - this->sink_ = std::make_shared(); - ASSERT_OK_NO_THROW(WriteFlatTable( - table.get(), default_memory_pool(), this->sink_, 512, default_writer_properties())); - - this->ReadAndCheckSingleColumnTable(values); -} - -using TestUInt32ParquetIO = TestParquetIO; - -TEST_F(TestUInt32ParquetIO, Parquet_2_0_Compability) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - - // Parquet 2.0 roundtrip should yield an uint32_t column again - this->sink_ = std::make_shared(); - std::shared_ptr<::parquet::WriterProperties> properties = - ::parquet::WriterProperties::Builder() - .version(ParquetVersion::PARQUET_2_0) - ->build(); - ASSERT_OK_NO_THROW( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); - this->ReadAndCheckSingleColumnTable(values); -} - -TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compability) { - // This also tests max_definition_level = 1 - std::shared_ptr values = NullableArray(LARGE_SIZE, 100); - std::shared_ptr
table = MakeSimpleTable(values, true); - - // Parquet 1.0 returns an int64_t column as there is no way to tell a Parquet 1.0 - // reader that a column is unsigned. - this->sink_ = std::make_shared(); - std::shared_ptr<::parquet::WriterProperties> properties = - ::parquet::WriterProperties::Builder() - .version(ParquetVersion::PARQUET_1_0) - ->build(); - ASSERT_OK_NO_THROW( - WriteFlatTable(table.get(), default_memory_pool(), this->sink_, 512, properties)); - - std::shared_ptr expected_values; - std::shared_ptr int64_data = - std::make_shared(default_memory_pool()); - { - ASSERT_OK(int64_data->Resize(sizeof(int64_t) * values->length())); - int64_t* int64_data_ptr = reinterpret_cast(int64_data->mutable_data()); - const uint32_t* uint32_data_ptr = - reinterpret_cast(values->data()->data()); - // std::copy might be faster but this is explicit on the casts) - for (int64_t i = 0; i < values->length(); i++) { - int64_data_ptr[i] = static_cast(uint32_data_ptr[i]); - } - } - ASSERT_OK(MakePrimitiveArray(std::make_shared(), values->length(), - int64_data, values->null_count(), values->null_bitmap(), &expected_values)); - this->ReadAndCheckSingleColumnTable(expected_values); -} - -template -using ParquetCDataType = typename ParquetDataType::c_type; - -template -class TestPrimitiveParquetIO : public TestParquetIO { - public: - typedef typename TestType::c_type T; - - void MakeTestFile(std::vector& values, int num_chunks, - std::unique_ptr* file_reader) { - std::shared_ptr schema = this->MakeSchema(Repetition::REQUIRED); - std::unique_ptr file_writer = this->MakeWriter(schema); - size_t chunk_size = values.size() / num_chunks; - // Convert to Parquet's expected physical type - std::vector values_buffer( - sizeof(ParquetCDataType) * values.size()); - auto values_parquet = - reinterpret_cast*>(values_buffer.data()); - std::copy(values.cbegin(), values.cend(), values_parquet); - for (int i = 0; i < num_chunks; i++) { - auto row_group_writer = file_writer->AppendRowGroup(chunk_size); - auto column_writer = - static_cast*>(row_group_writer->NextColumn()); - ParquetCDataType* data = values_parquet + i * chunk_size; - column_writer->WriteBatch(chunk_size, nullptr, nullptr, data); - column_writer->Close(); - row_group_writer->Close(); - } - file_writer->Close(); - *file_reader = this->ReaderFromSink(); - } - - void CheckSingleColumnRequiredTableRead(int num_chunks) { - std::vector values(SMALL_SIZE, test_traits::value); - std::unique_ptr file_reader; - ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); - - std::shared_ptr
out; - this->ReadTableFromFile(std::move(file_reader), &out); - ASSERT_EQ(1, out->num_columns()); - ASSERT_EQ(SMALL_SIZE, out->num_rows()); - - std::shared_ptr chunked_array = out->column(0)->data(); - ASSERT_EQ(1, chunked_array->num_chunks()); - ExpectArray(values.data(), chunked_array->chunk(0).get()); - } - - void CheckSingleColumnRequiredRead(int num_chunks) { - std::vector values(SMALL_SIZE, test_traits::value); - std::unique_ptr file_reader; - ASSERT_NO_THROW(MakeTestFile(values, num_chunks, &file_reader)); - - std::shared_ptr out; - this->ReadSingleColumnFile(std::move(file_reader), &out); - - ExpectArray(values.data(), out.get()); - } -}; - -typedef ::testing::Types PrimitiveTestTypes; - -TYPED_TEST_CASE(TestPrimitiveParquetIO, PrimitiveTestTypes); - -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredRead) { - this->CheckSingleColumnRequiredRead(1); -} - -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredTableRead) { - this->CheckSingleColumnRequiredTableRead(1); -} - -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedRead) { - this->CheckSingleColumnRequiredRead(4); -} - -TYPED_TEST(TestPrimitiveParquetIO, SingleColumnRequiredChunkedTableRead) { - this->CheckSingleColumnRequiredTableRead(4); -} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/parquet-schema-test.cc b/cpp/src/arrow/parquet/parquet-schema-test.cc deleted file mode 100644 index 63ad8fba46517..0000000000000 --- a/cpp/src/arrow/parquet/parquet-schema-test.cc +++ /dev/null @@ -1,261 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/test-util.h" -#include "arrow/type.h" -#include "arrow/types/datetime.h" -#include "arrow/types/decimal.h" -#include "arrow/util/status.h" - -#include "arrow/parquet/schema.h" - -using ParquetType = parquet::Type; -using parquet::LogicalType; -using parquet::Repetition; -using parquet::schema::NodePtr; -using parquet::schema::GroupNode; -using parquet::schema::PrimitiveNode; - -namespace arrow { - -namespace parquet { - -const auto BOOL = std::make_shared(); -const auto UINT8 = std::make_shared(); -const auto INT32 = std::make_shared(); -const auto INT64 = std::make_shared(); -const auto FLOAT = std::make_shared(); -const auto DOUBLE = std::make_shared(); -const auto UTF8 = std::make_shared(); -const auto TIMESTAMP_MS = std::make_shared(TimestampType::Unit::MILLI); -// TODO: This requires parquet-cpp implementing the MICROS enum value -// const auto TIMESTAMP_US = std::make_shared(TimestampType::Unit::MICRO); -const auto BINARY = std::make_shared(std::make_shared("", UINT8)); -const auto DECIMAL_8_4 = std::make_shared(8, 4); - -class TestConvertParquetSchema : public ::testing::Test { - public: - virtual void SetUp() {} - - void CheckFlatSchema(const std::shared_ptr& expected_schema) { - ASSERT_EQ(expected_schema->num_fields(), result_schema_->num_fields()); - for (int i = 0; i < expected_schema->num_fields(); ++i) { - auto lhs = result_schema_->field(i); - auto rhs = expected_schema->field(i); - EXPECT_TRUE(lhs->Equals(rhs)) << i << " " << lhs->ToString() - << " != " << rhs->ToString(); - } - } - - Status ConvertSchema(const std::vector& nodes) { - NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, nodes); - descr_.Init(schema); - return FromParquetSchema(&descr_, &result_schema_); - } - - protected: - ::parquet::SchemaDescriptor descr_; - std::shared_ptr result_schema_; -}; - -TEST_F(TestConvertParquetSchema, ParquetFlatPrimitives) { - std::vector parquet_fields; - std::vector> arrow_fields; - - parquet_fields.push_back( - PrimitiveNode::Make("boolean", Repetition::REQUIRED, ParquetType::BOOLEAN)); - arrow_fields.push_back(std::make_shared("boolean", BOOL, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("int32", Repetition::REQUIRED, ParquetType::INT32)); - arrow_fields.push_back(std::make_shared("int32", INT32, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); - arrow_fields.push_back(std::make_shared("int64", INT64, false)); - - parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, - ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); - arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_MS, false)); - - // parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, - // ParquetType::INT64, LogicalType::TIMESTAMP_MICROS)); - // arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_US, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); - arrow_fields.push_back(std::make_shared("float", FLOAT)); - - parquet_fields.push_back( - PrimitiveNode::Make("double", Repetition::OPTIONAL, ParquetType::DOUBLE)); - arrow_fields.push_back(std::make_shared("double", DOUBLE)); - - parquet_fields.push_back( - PrimitiveNode::Make("binary", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY)); - arrow_fields.push_back(std::make_shared("binary", BINARY)); - - parquet_fields.push_back(PrimitiveNode::Make( - "string", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY, LogicalType::UTF8)); - arrow_fields.push_back(std::make_shared("string", UTF8)); - - parquet_fields.push_back(PrimitiveNode::Make("flba-binary", Repetition::OPTIONAL, - ParquetType::FIXED_LEN_BYTE_ARRAY, LogicalType::NONE, 12)); - arrow_fields.push_back(std::make_shared("flba-binary", BINARY)); - - auto arrow_schema = std::make_shared(arrow_fields); - ASSERT_OK(ConvertSchema(parquet_fields)); - - CheckFlatSchema(arrow_schema); -} - -TEST_F(TestConvertParquetSchema, ParquetFlatDecimals) { - std::vector parquet_fields; - std::vector> arrow_fields; - - parquet_fields.push_back(PrimitiveNode::Make("flba-decimal", Repetition::OPTIONAL, - ParquetType::FIXED_LEN_BYTE_ARRAY, LogicalType::DECIMAL, 4, 8, 4)); - arrow_fields.push_back(std::make_shared("flba-decimal", DECIMAL_8_4)); - - parquet_fields.push_back(PrimitiveNode::Make("binary-decimal", Repetition::OPTIONAL, - ParquetType::BYTE_ARRAY, LogicalType::DECIMAL, -1, 8, 4)); - arrow_fields.push_back(std::make_shared("binary-decimal", DECIMAL_8_4)); - - parquet_fields.push_back(PrimitiveNode::Make("int32-decimal", Repetition::OPTIONAL, - ParquetType::INT32, LogicalType::DECIMAL, -1, 8, 4)); - arrow_fields.push_back(std::make_shared("int32-decimal", DECIMAL_8_4)); - - parquet_fields.push_back(PrimitiveNode::Make("int64-decimal", Repetition::OPTIONAL, - ParquetType::INT64, LogicalType::DECIMAL, -1, 8, 4)); - arrow_fields.push_back(std::make_shared("int64-decimal", DECIMAL_8_4)); - - auto arrow_schema = std::make_shared(arrow_fields); - ASSERT_OK(ConvertSchema(parquet_fields)); - - CheckFlatSchema(arrow_schema); -} - -TEST_F(TestConvertParquetSchema, UnsupportedThings) { - std::vector unsupported_nodes; - - unsupported_nodes.push_back( - PrimitiveNode::Make("int96", Repetition::REQUIRED, ParquetType::INT96)); - - unsupported_nodes.push_back( - GroupNode::Make("repeated-group", Repetition::REPEATED, {})); - - unsupported_nodes.push_back(PrimitiveNode::Make( - "int32", Repetition::OPTIONAL, ParquetType::INT32, LogicalType::DATE)); - - for (const NodePtr& node : unsupported_nodes) { - ASSERT_RAISES(NotImplemented, ConvertSchema({node})); - } -} - -class TestConvertArrowSchema : public ::testing::Test { - public: - virtual void SetUp() {} - - void CheckFlatSchema(const std::vector& nodes) { - NodePtr schema_node = GroupNode::Make("schema", Repetition::REPEATED, nodes); - const GroupNode* expected_schema_node = - static_cast(schema_node.get()); - const GroupNode* result_schema_node = result_schema_->group_node(); - - ASSERT_EQ(expected_schema_node->field_count(), result_schema_node->field_count()); - - for (int i = 0; i < expected_schema_node->field_count(); i++) { - auto lhs = result_schema_node->field(i); - auto rhs = expected_schema_node->field(i); - EXPECT_TRUE(lhs->Equals(rhs.get())); - } - } - - Status ConvertSchema(const std::vector>& fields) { - arrow_schema_ = std::make_shared(fields); - std::shared_ptr<::parquet::WriterProperties> properties = - ::parquet::default_writer_properties(); - return ToParquetSchema(arrow_schema_.get(), *properties.get(), &result_schema_); - } - - protected: - std::shared_ptr arrow_schema_; - std::shared_ptr<::parquet::SchemaDescriptor> result_schema_; -}; - -TEST_F(TestConvertArrowSchema, ParquetFlatPrimitives) { - std::vector parquet_fields; - std::vector> arrow_fields; - - parquet_fields.push_back( - PrimitiveNode::Make("boolean", Repetition::REQUIRED, ParquetType::BOOLEAN)); - arrow_fields.push_back(std::make_shared("boolean", BOOL, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("int32", Repetition::REQUIRED, ParquetType::INT32)); - arrow_fields.push_back(std::make_shared("int32", INT32, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("int64", Repetition::REQUIRED, ParquetType::INT64)); - arrow_fields.push_back(std::make_shared("int64", INT64, false)); - - parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, - ParquetType::INT64, LogicalType::TIMESTAMP_MILLIS)); - arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_MS, false)); - - // parquet_fields.push_back(PrimitiveNode::Make("timestamp", Repetition::REQUIRED, - // ParquetType::INT64, LogicalType::TIMESTAMP_MICROS)); - // arrow_fields.push_back(std::make_shared("timestamp", TIMESTAMP_US, false)); - - parquet_fields.push_back( - PrimitiveNode::Make("float", Repetition::OPTIONAL, ParquetType::FLOAT)); - arrow_fields.push_back(std::make_shared("float", FLOAT)); - - parquet_fields.push_back( - PrimitiveNode::Make("double", Repetition::OPTIONAL, ParquetType::DOUBLE)); - arrow_fields.push_back(std::make_shared("double", DOUBLE)); - - // TODO: String types need to be clarified a bit more in the Arrow spec - parquet_fields.push_back(PrimitiveNode::Make( - "string", Repetition::OPTIONAL, ParquetType::BYTE_ARRAY, LogicalType::UTF8)); - arrow_fields.push_back(std::make_shared("string", UTF8)); - - ASSERT_OK(ConvertSchema(arrow_fields)); - - CheckFlatSchema(parquet_fields); -} - -TEST_F(TestConvertArrowSchema, ParquetFlatDecimals) { - std::vector parquet_fields; - std::vector> arrow_fields; - - // TODO: Test Decimal Arrow -> Parquet conversion - - ASSERT_OK(ConvertSchema(arrow_fields)); - - CheckFlatSchema(parquet_fields); -} - -TEST(TestNodeConversion, DateAndTime) {} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.cc b/cpp/src/arrow/parquet/reader.cc deleted file mode 100644 index 0c2fc6e8fc718..0000000000000 --- a/cpp/src/arrow/parquet/reader.cc +++ /dev/null @@ -1,401 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/parquet/reader.h" - -#include -#include -#include -#include - -#include "arrow/column.h" -#include "arrow/parquet/io.h" -#include "arrow/parquet/schema.h" -#include "arrow/parquet/utils.h" -#include "arrow/schema.h" -#include "arrow/table.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/util/status.h" - -using parquet::ColumnReader; -using parquet::Repetition; -using parquet::TypedColumnReader; - -// Help reduce verbosity -using ParquetRAS = parquet::RandomAccessSource; -using ParquetReader = parquet::ParquetFileReader; - -namespace arrow { -namespace parquet { - -template -struct ArrowTypeTraits { - typedef NumericBuilder builder_type; -}; - -template <> -struct ArrowTypeTraits { - typedef BooleanBuilder builder_type; -}; - -template -using BuilderType = typename ArrowTypeTraits::builder_type; - -class FileReader::Impl { - public: - Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); - virtual ~Impl() {} - - bool CheckForFlatColumn(const ::parquet::ColumnDescriptor* descr); - Status GetFlatColumn(int i, std::unique_ptr* out); - Status ReadFlatColumn(int i, std::shared_ptr* out); - Status ReadFlatTable(std::shared_ptr
* out); - - private: - MemoryPool* pool_; - std::unique_ptr<::parquet::ParquetFileReader> reader_; -}; - -class FlatColumnReader::Impl { - public: - Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor* descr, - ::parquet::ParquetFileReader* reader, int column_index); - virtual ~Impl() {} - - Status NextBatch(int batch_size, std::shared_ptr* out); - template - Status TypedReadBatch(int batch_size, std::shared_ptr* out); - - template - Status ReadNullableFlatBatch(const int16_t* def_levels, - typename ParquetType::c_type* values, int64_t values_read, int64_t levels_read, - BuilderType* builder); - template - Status ReadNonNullableBatch(typename ParquetType::c_type* values, int64_t values_read, - BuilderType* builder); - - private: - void NextRowGroup(); - - template - struct can_copy_ptr { - static constexpr bool value = - std::is_same::value || - (std::is_integral{} && std::is_integral{} && - (sizeof(InType) == sizeof(OutType))); - }; - - template ::value>::type* = nullptr> - Status ConvertPhysicalType( - const InType* in_ptr, int64_t length, const OutType** out_ptr) { - *out_ptr = reinterpret_cast(in_ptr); - return Status::OK(); - } - - template ::value>::type* = nullptr> - Status ConvertPhysicalType( - const InType* in_ptr, int64_t length, const OutType** out_ptr) { - RETURN_NOT_OK(values_builder_buffer_.Resize(length * sizeof(OutType))); - OutType* mutable_out_ptr = - reinterpret_cast(values_builder_buffer_.mutable_data()); - std::copy(in_ptr, in_ptr + length, mutable_out_ptr); - *out_ptr = mutable_out_ptr; - return Status::OK(); - } - - MemoryPool* pool_; - const ::parquet::ColumnDescriptor* descr_; - ::parquet::ParquetFileReader* reader_; - int column_index_; - int next_row_group_; - std::shared_ptr column_reader_; - std::shared_ptr field_; - - PoolBuffer values_buffer_; - PoolBuffer def_levels_buffer_; - PoolBuffer values_builder_buffer_; - PoolBuffer valid_bytes_buffer_; -}; - -FileReader::Impl::Impl( - MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) - : pool_(pool), reader_(std::move(reader)) {} - -bool FileReader::Impl::CheckForFlatColumn(const ::parquet::ColumnDescriptor* descr) { - if ((descr->max_repetition_level() > 0) || (descr->max_definition_level() > 1)) { - return false; - } else if ((descr->max_definition_level() == 1) && - (descr->schema_node()->repetition() != Repetition::OPTIONAL)) { - return false; - } - return true; -} - -Status FileReader::Impl::GetFlatColumn(int i, std::unique_ptr* out) { - const ::parquet::SchemaDescriptor* schema = reader_->metadata()->schema(); - - if (!CheckForFlatColumn(schema->Column(i))) { - return Status::Invalid("The requested column is not flat"); - } - std::unique_ptr impl( - new FlatColumnReader::Impl(pool_, schema->Column(i), reader_.get(), i)); - *out = std::unique_ptr(new FlatColumnReader(std::move(impl))); - return Status::OK(); -} - -Status FileReader::Impl::ReadFlatColumn(int i, std::shared_ptr* out) { - std::unique_ptr flat_column_reader; - RETURN_NOT_OK(GetFlatColumn(i, &flat_column_reader)); - return flat_column_reader->NextBatch(reader_->metadata()->num_rows(), out); -} - -Status FileReader::Impl::ReadFlatTable(std::shared_ptr
* table) { - auto descr = reader_->metadata()->schema(); - - const std::string& name = descr->name(); - std::shared_ptr schema; - RETURN_NOT_OK(FromParquetSchema(descr, &schema)); - - int num_columns = reader_->metadata()->num_columns(); - - std::vector> columns(num_columns); - for (int i = 0; i < num_columns; i++) { - std::shared_ptr array; - RETURN_NOT_OK(ReadFlatColumn(i, &array)); - columns[i] = std::make_shared(schema->field(i), array); - } - - *table = std::make_shared
(name, schema, columns); - return Status::OK(); -} - -FileReader::FileReader( - MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader) - : impl_(new FileReader::Impl(pool, std::move(reader))) {} - -FileReader::~FileReader() {} - -// Static ctor -Status OpenFile(const std::shared_ptr& file, - ParquetAllocator* allocator, std::unique_ptr* reader) { - std::unique_ptr source(new ParquetReadSource(allocator)); - RETURN_NOT_OK(source->Open(file)); - - // TODO(wesm): reader properties - std::unique_ptr pq_reader; - PARQUET_CATCH_NOT_OK(pq_reader = ParquetReader::Open(std::move(source))); - - // Use the same memory pool as the ParquetAllocator - reader->reset(new FileReader(allocator->pool(), std::move(pq_reader))); - return Status::OK(); -} - -Status FileReader::GetFlatColumn(int i, std::unique_ptr* out) { - return impl_->GetFlatColumn(i, out); -} - -Status FileReader::ReadFlatColumn(int i, std::shared_ptr* out) { - return impl_->ReadFlatColumn(i, out); -} - -Status FileReader::ReadFlatTable(std::shared_ptr
* out) { - return impl_->ReadFlatTable(out); -} - -FlatColumnReader::Impl::Impl(MemoryPool* pool, const ::parquet::ColumnDescriptor* descr, - ::parquet::ParquetFileReader* reader, int column_index) - : pool_(pool), - descr_(descr), - reader_(reader), - column_index_(column_index), - next_row_group_(0), - values_buffer_(pool), - def_levels_buffer_(pool) { - NodeToField(descr_->schema_node(), &field_); - NextRowGroup(); -} - -template -Status FlatColumnReader::Impl::ReadNonNullableBatch(typename ParquetType::c_type* values, - int64_t values_read, BuilderType* builder) { - using ArrowCType = typename ArrowType::c_type; - using ParquetCType = typename ParquetType::c_type; - - DCHECK(builder); - const ArrowCType* values_ptr = nullptr; - RETURN_NOT_OK( - (ConvertPhysicalType(values, values_read, &values_ptr))); - RETURN_NOT_OK(builder->Append(values_ptr, values_read)); - return Status::OK(); -} - -template -Status FlatColumnReader::Impl::ReadNullableFlatBatch(const int16_t* def_levels, - typename ParquetType::c_type* values, int64_t values_read, int64_t levels_read, - BuilderType* builder) { - using ArrowCType = typename ArrowType::c_type; - - DCHECK(builder); - RETURN_NOT_OK(values_builder_buffer_.Resize(levels_read * sizeof(ArrowCType))); - RETURN_NOT_OK(valid_bytes_buffer_.Resize(levels_read * sizeof(uint8_t))); - auto values_ptr = reinterpret_cast(values_builder_buffer_.mutable_data()); - uint8_t* valid_bytes = valid_bytes_buffer_.mutable_data(); - int values_idx = 0; - for (int64_t i = 0; i < levels_read; i++) { - if (def_levels[i] < descr_->max_definition_level()) { - valid_bytes[i] = 0; - } else { - valid_bytes[i] = 1; - values_ptr[i] = values[values_idx++]; - } - } - RETURN_NOT_OK(builder->Append(values_ptr, levels_read, valid_bytes)); - return Status::OK(); -} - -template -Status FlatColumnReader::Impl::TypedReadBatch( - int batch_size, std::shared_ptr* out) { - using ParquetCType = typename ParquetType::c_type; - - int values_to_read = batch_size; - BuilderType builder(pool_, field_->type); - while ((values_to_read > 0) && column_reader_) { - values_buffer_.Resize(values_to_read * sizeof(ParquetCType)); - if (descr_->max_definition_level() > 0) { - def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); - } - auto reader = dynamic_cast*>(column_reader_.get()); - int64_t values_read; - int64_t levels_read; - int16_t* def_levels = reinterpret_cast(def_levels_buffer_.mutable_data()); - auto values = reinterpret_cast(values_buffer_.mutable_data()); - PARQUET_CATCH_NOT_OK(levels_read = reader->ReadBatch( - values_to_read, def_levels, nullptr, values, &values_read)); - values_to_read -= levels_read; - if (descr_->max_definition_level() == 0) { - RETURN_NOT_OK( - (ReadNonNullableBatch(values, values_read, &builder))); - } else { - // As per the defintion and checks for flat columns: - // descr_->max_definition_level() == 1 - RETURN_NOT_OK((ReadNullableFlatBatch( - def_levels, values, values_read, levels_read, &builder))); - } - if (!column_reader_->HasNext()) { NextRowGroup(); } - } - *out = builder.Finish(); - return Status::OK(); -} - -template <> -Status FlatColumnReader::Impl::TypedReadBatch( - int batch_size, std::shared_ptr* out) { - int values_to_read = batch_size; - StringBuilder builder(pool_, field_->type); - while ((values_to_read > 0) && column_reader_) { - values_buffer_.Resize(values_to_read * sizeof(::parquet::ByteArray)); - if (descr_->max_definition_level() > 0) { - def_levels_buffer_.Resize(values_to_read * sizeof(int16_t)); - } - auto reader = - dynamic_cast*>(column_reader_.get()); - int64_t values_read; - int64_t levels_read; - int16_t* def_levels = reinterpret_cast(def_levels_buffer_.mutable_data()); - auto values = reinterpret_cast<::parquet::ByteArray*>(values_buffer_.mutable_data()); - PARQUET_CATCH_NOT_OK(levels_read = reader->ReadBatch( - values_to_read, def_levels, nullptr, values, &values_read)); - values_to_read -= levels_read; - if (descr_->max_definition_level() == 0) { - for (int64_t i = 0; i < levels_read; i++) { - RETURN_NOT_OK( - builder.Append(reinterpret_cast(values[i].ptr), values[i].len)); - } - } else { - // descr_->max_definition_level() == 1 - int values_idx = 0; - for (int64_t i = 0; i < levels_read; i++) { - if (def_levels[i] < descr_->max_definition_level()) { - RETURN_NOT_OK(builder.AppendNull()); - } else { - RETURN_NOT_OK( - builder.Append(reinterpret_cast(values[values_idx].ptr), - values[values_idx].len)); - values_idx++; - } - } - } - if (!column_reader_->HasNext()) { NextRowGroup(); } - } - *out = builder.Finish(); - return Status::OK(); -} - -#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ - case Type::ENUM: \ - return TypedReadBatch(batch_size, out); \ - break; - -Status FlatColumnReader::Impl::NextBatch(int batch_size, std::shared_ptr* out) { - if (!column_reader_) { - // Exhausted all row groups. - *out = nullptr; - return Status::OK(); - } - - switch (field_->type->type) { - TYPED_BATCH_CASE(BOOL, BooleanType, ::parquet::BooleanType) - TYPED_BATCH_CASE(UINT8, UInt8Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT8, Int8Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(UINT16, UInt16Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT16, Int16Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(UINT32, UInt32Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(UINT64, UInt64Type, ::parquet::Int64Type) - TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) - TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) - TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) - TYPED_BATCH_CASE(STRING, StringType, ::parquet::ByteArrayType) - TYPED_BATCH_CASE(TIMESTAMP, TimestampType, ::parquet::Int64Type) - default: - return Status::NotImplemented(field_->type->ToString()); - } -} - -void FlatColumnReader::Impl::NextRowGroup() { - if (next_row_group_ < reader_->metadata()->num_row_groups()) { - column_reader_ = reader_->RowGroup(next_row_group_)->Column(column_index_); - next_row_group_++; - } else { - column_reader_ = nullptr; - } -} - -FlatColumnReader::FlatColumnReader(std::unique_ptr impl) : impl_(std::move(impl)) {} - -FlatColumnReader::~FlatColumnReader() {} - -Status FlatColumnReader::NextBatch(int batch_size, std::shared_ptr* out) { - return impl_->NextBatch(batch_size, out); -} - -} // namespace parquet -} // namespace arrow diff --git a/cpp/src/arrow/parquet/reader.h b/cpp/src/arrow/parquet/reader.h deleted file mode 100644 index 2689bebea30ef..0000000000000 --- a/cpp/src/arrow/parquet/reader.h +++ /dev/null @@ -1,146 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_PARQUET_READER_H -#define ARROW_PARQUET_READER_H - -#include - -#include "parquet/api/reader.h" -#include "parquet/api/schema.h" - -#include "arrow/io/interfaces.h" -#include "arrow/parquet/io.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Array; -class MemoryPool; -class RecordBatch; -class Status; -class Table; - -namespace parquet { - -class FlatColumnReader; - -// Arrow read adapter class for deserializing Parquet files as Arrow row -// batches. -// -// TODO(wesm): nested data does not always make sense with this user -// interface unless you are only reading a single leaf node from a branch of -// a table. For example: -// -// repeated group data { -// optional group record { -// optional int32 val1; -// optional byte_array val2; -// optional bool val3; -// } -// optional int32 val4; -// } -// -// In the Parquet file, there are 3 leaf nodes: -// -// * data.record.val1 -// * data.record.val2 -// * data.record.val3 -// * data.val4 -// -// When materializing this data in an Arrow array, we would have: -// -// data: list), -// val3: bool, -// >, -// val4: int32 -// >> -// -// However, in the Parquet format, each leaf node has its own repetition and -// definition levels describing the structure of the intermediate nodes in -// this array structure. Thus, we will need to scan the leaf data for a group -// of leaf nodes part of the same type tree to create a single result Arrow -// nested array structure. -// -// This is additionally complicated "chunky" repeated fields or very large byte -// arrays -class ARROW_EXPORT FileReader { - public: - FileReader(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileReader> reader); - - // Since the distribution of columns amongst a Parquet file's row groups may - // be uneven (the number of values in each column chunk can be different), we - // provide a column-oriented read interface. The ColumnReader hides the - // details of paging through the file's row groups and yielding - // fully-materialized arrow::Array instances - // - // Returns error status if the column of interest is not flat. - Status GetFlatColumn(int i, std::unique_ptr* out); - // Read column as a whole into an Array. - Status ReadFlatColumn(int i, std::shared_ptr* out); - // Read a table of flat columns into a Table. - Status ReadFlatTable(std::shared_ptr
* out); - - virtual ~FileReader(); - - private: - class ARROW_NO_EXPORT Impl; - std::unique_ptr impl_; -}; - -// At this point, the column reader is a stream iterator. It only knows how to -// read the next batch of values for a particular column from the file until it -// runs out. -// -// We also do not expose any internal Parquet details, such as row groups. This -// might change in the future. -class ARROW_EXPORT FlatColumnReader { - public: - virtual ~FlatColumnReader(); - - // Scan the next array of the indicated size. The actual size of the - // returned array may be less than the passed size depending how much data is - // available in the file. - // - // When all the data in the file has been exhausted, the result is set to - // nullptr. - // - // Returns Status::OK on a successful read, including if you have exhausted - // the data available in the file. - Status NextBatch(int batch_size, std::shared_ptr* out); - - private: - class ARROW_NO_EXPORT Impl; - std::unique_ptr impl_; - explicit FlatColumnReader(std::unique_ptr impl); - - friend class FileReader; -}; - -// Helper function to create a file reader from an implementation of an Arrow -// readable file -ARROW_EXPORT -Status OpenFile(const std::shared_ptr& file, - ParquetAllocator* allocator, std::unique_ptr* reader); - -} // namespace parquet -} // namespace arrow - -#endif // ARROW_PARQUET_READER_H diff --git a/cpp/src/arrow/parquet/schema.cc b/cpp/src/arrow/parquet/schema.cc deleted file mode 100644 index ff32e51bacd8b..0000000000000 --- a/cpp/src/arrow/parquet/schema.cc +++ /dev/null @@ -1,344 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/parquet/schema.h" - -#include -#include - -#include "parquet/api/schema.h" - -#include "arrow/parquet/utils.h" -#include "arrow/types/decimal.h" -#include "arrow/types/string.h" -#include "arrow/util/status.h" - -using parquet::Repetition; -using parquet::schema::Node; -using parquet::schema::NodePtr; -using parquet::schema::GroupNode; -using parquet::schema::PrimitiveNode; - -using ParquetType = parquet::Type; -using parquet::LogicalType; - -namespace arrow { - -namespace parquet { - -const auto BOOL = std::make_shared(); -const auto UINT8 = std::make_shared(); -const auto INT8 = std::make_shared(); -const auto UINT16 = std::make_shared(); -const auto INT16 = std::make_shared(); -const auto UINT32 = std::make_shared(); -const auto INT32 = std::make_shared(); -const auto UINT64 = std::make_shared(); -const auto INT64 = std::make_shared(); -const auto FLOAT = std::make_shared(); -const auto DOUBLE = std::make_shared(); -const auto UTF8 = std::make_shared(); -const auto TIMESTAMP_MS = std::make_shared(TimestampType::Unit::MILLI); -const auto BINARY = std::make_shared(std::make_shared("", UINT8)); - -TypePtr MakeDecimalType(const PrimitiveNode* node) { - int precision = node->decimal_metadata().precision; - int scale = node->decimal_metadata().scale; - return std::make_shared(precision, scale); -} - -static Status FromByteArray(const PrimitiveNode* node, TypePtr* out) { - switch (node->logical_type()) { - case LogicalType::UTF8: - *out = UTF8; - break; - case LogicalType::DECIMAL: - *out = MakeDecimalType(node); - break; - default: - // BINARY - *out = BINARY; - break; - } - return Status::OK(); -} - -static Status FromFLBA(const PrimitiveNode* node, TypePtr* out) { - switch (node->logical_type()) { - case LogicalType::NONE: - *out = BINARY; - break; - case LogicalType::DECIMAL: - *out = MakeDecimalType(node); - break; - default: - return Status::NotImplemented("unhandled type"); - break; - } - - return Status::OK(); -} - -static Status FromInt32(const PrimitiveNode* node, TypePtr* out) { - switch (node->logical_type()) { - case LogicalType::NONE: - *out = INT32; - break; - case LogicalType::UINT_8: - *out = UINT8; - break; - case LogicalType::INT_8: - *out = INT8; - break; - case LogicalType::UINT_16: - *out = UINT16; - break; - case LogicalType::INT_16: - *out = INT16; - break; - case LogicalType::UINT_32: - *out = UINT32; - break; - case LogicalType::DECIMAL: - *out = MakeDecimalType(node); - break; - default: - return Status::NotImplemented("Unhandled logical type for int32"); - break; - } - return Status::OK(); -} - -static Status FromInt64(const PrimitiveNode* node, TypePtr* out) { - switch (node->logical_type()) { - case LogicalType::NONE: - *out = INT64; - break; - case LogicalType::UINT_64: - *out = UINT64; - break; - case LogicalType::DECIMAL: - *out = MakeDecimalType(node); - break; - case LogicalType::TIMESTAMP_MILLIS: - *out = TIMESTAMP_MS; - break; - default: - return Status::NotImplemented("Unhandled logical type for int64"); - break; - } - return Status::OK(); -} - -// TODO: Logical Type Handling -Status NodeToField(const NodePtr& node, std::shared_ptr* out) { - std::shared_ptr type; - - if (node->is_repeated()) { - return Status::NotImplemented("No support yet for repeated node types"); - } - - if (node->is_group()) { - const GroupNode* group = static_cast(node.get()); - std::vector> fields(group->field_count()); - for (int i = 0; i < group->field_count(); i++) { - RETURN_NOT_OK(NodeToField(group->field(i), &fields[i])); - } - type = std::make_shared(fields); - } else { - // Primitive (leaf) node - const PrimitiveNode* primitive = static_cast(node.get()); - - switch (primitive->physical_type()) { - case ParquetType::BOOLEAN: - type = BOOL; - break; - case ParquetType::INT32: - RETURN_NOT_OK(FromInt32(primitive, &type)); - break; - case ParquetType::INT64: - RETURN_NOT_OK(FromInt64(primitive, &type)); - break; - case ParquetType::INT96: - // TODO: Do we have that type in Arrow? - // type = TypePtr(new Int96Type()); - return Status::NotImplemented("int96"); - case ParquetType::FLOAT: - type = FLOAT; - break; - case ParquetType::DOUBLE: - type = DOUBLE; - break; - case ParquetType::BYTE_ARRAY: - // TODO: Do we have that type in Arrow? - RETURN_NOT_OK(FromByteArray(primitive, &type)); - break; - case ParquetType::FIXED_LEN_BYTE_ARRAY: - RETURN_NOT_OK(FromFLBA(primitive, &type)); - break; - } - } - - *out = std::make_shared(node->name(), type, !node->is_required()); - return Status::OK(); -} - -Status FromParquetSchema( - const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out) { - // TODO(wesm): Consider adding an arrow::Schema name attribute, which comes - // from the root Parquet node - const GroupNode* schema_node = - static_cast(parquet_schema->group_node()); - - std::vector> fields(schema_node->field_count()); - for (int i = 0; i < schema_node->field_count(); i++) { - RETURN_NOT_OK(NodeToField(schema_node->field(i), &fields[i])); - } - - *out = std::make_shared(fields); - return Status::OK(); -} - -Status StructToNode(const std::shared_ptr& type, const std::string& name, - bool nullable, const ::parquet::WriterProperties& properties, NodePtr* out) { - Repetition::type repetition = Repetition::REQUIRED; - if (nullable) { repetition = Repetition::OPTIONAL; } - - std::vector children(type->num_children()); - for (int i = 0; i < type->num_children(); i++) { - RETURN_NOT_OK(FieldToNode(type->child(i), properties, &children[i])); - } - - *out = GroupNode::Make(name, repetition, children); - return Status::OK(); -} - -Status FieldToNode(const std::shared_ptr& field, - const ::parquet::WriterProperties& properties, NodePtr* out) { - LogicalType::type logical_type = LogicalType::NONE; - ParquetType::type type; - Repetition::type repetition = Repetition::REQUIRED; - if (field->nullable) { repetition = Repetition::OPTIONAL; } - int length = -1; - - switch (field->type->type) { - // TODO: - // case Type::NA: - // break; - case Type::BOOL: - type = ParquetType::BOOLEAN; - break; - case Type::UINT8: - type = ParquetType::INT32; - logical_type = LogicalType::UINT_8; - break; - case Type::INT8: - type = ParquetType::INT32; - logical_type = LogicalType::INT_8; - break; - case Type::UINT16: - type = ParquetType::INT32; - logical_type = LogicalType::UINT_16; - break; - case Type::INT16: - type = ParquetType::INT32; - logical_type = LogicalType::INT_16; - break; - case Type::UINT32: - if (properties.version() == ::parquet::ParquetVersion::PARQUET_1_0) { - type = ParquetType::INT64; - } else { - type = ParquetType::INT32; - logical_type = LogicalType::UINT_32; - } - break; - case Type::INT32: - type = ParquetType::INT32; - break; - case Type::UINT64: - type = ParquetType::INT64; - logical_type = LogicalType::UINT_64; - break; - case Type::INT64: - type = ParquetType::INT64; - break; - case Type::FLOAT: - type = ParquetType::FLOAT; - break; - case Type::DOUBLE: - type = ParquetType::DOUBLE; - break; - case Type::STRING: - type = ParquetType::BYTE_ARRAY; - logical_type = LogicalType::UTF8; - break; - case Type::BINARY: - type = ParquetType::BYTE_ARRAY; - break; - case Type::DATE: - type = ParquetType::INT32; - logical_type = LogicalType::DATE; - break; - case Type::TIMESTAMP: { - auto timestamp_type = static_cast(field->type.get()); - if (timestamp_type->unit != TimestampType::Unit::MILLI) { - return Status::NotImplemented( - "Other timestamp units than millisecond are not yet support with parquet."); - } - type = ParquetType::INT64; - logical_type = LogicalType::TIMESTAMP_MILLIS; - } break; - case Type::TIMESTAMP_DOUBLE: - type = ParquetType::INT64; - // This is specified as seconds since the UNIX epoch - // TODO: Converted type in Parquet? - // logical_type = LogicalType::TIMESTAMP_MILLIS; - break; - case Type::TIME: - type = ParquetType::INT64; - logical_type = LogicalType::TIME_MILLIS; - break; - case Type::STRUCT: { - auto struct_type = std::static_pointer_cast(field->type); - return StructToNode(struct_type, field->name, field->nullable, properties, out); - } break; - default: - // TODO: LIST, DENSE_UNION, SPARE_UNION, JSON_SCALAR, DECIMAL, DECIMAL_TEXT, VARCHAR - return Status::NotImplemented("unhandled type"); - } - *out = PrimitiveNode::Make(field->name, repetition, type, logical_type, length); - return Status::OK(); -} - -Status ToParquetSchema(const Schema* arrow_schema, - const ::parquet::WriterProperties& properties, - std::shared_ptr<::parquet::SchemaDescriptor>* out) { - std::vector nodes(arrow_schema->num_fields()); - for (int i = 0; i < arrow_schema->num_fields(); i++) { - RETURN_NOT_OK(FieldToNode(arrow_schema->field(i), properties, &nodes[i])); - } - - NodePtr schema = GroupNode::Make("schema", Repetition::REPEATED, nodes); - *out = std::make_shared<::parquet::SchemaDescriptor>(); - PARQUET_CATCH_NOT_OK((*out)->Init(schema)); - - return Status::OK(); -} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/schema.h b/cpp/src/arrow/parquet/schema.h deleted file mode 100644 index 88b5977d223a4..0000000000000 --- a/cpp/src/arrow/parquet/schema.h +++ /dev/null @@ -1,53 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_PARQUET_SCHEMA_H -#define ARROW_PARQUET_SCHEMA_H - -#include - -#include "parquet/api/schema.h" -#include "parquet/api/writer.h" - -#include "arrow/schema.h" -#include "arrow/type.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Status; - -namespace parquet { - -Status ARROW_EXPORT NodeToField( - const ::parquet::schema::NodePtr& node, std::shared_ptr* out); - -Status ARROW_EXPORT FromParquetSchema( - const ::parquet::SchemaDescriptor* parquet_schema, std::shared_ptr* out); - -Status ARROW_EXPORT FieldToNode(const std::shared_ptr& field, - const ::parquet::WriterProperties& properties, ::parquet::schema::NodePtr* out); - -Status ARROW_EXPORT ToParquetSchema(const Schema* arrow_schema, - const ::parquet::WriterProperties& properties, - std::shared_ptr<::parquet::SchemaDescriptor>* out); - -} // namespace parquet - -} // namespace arrow - -#endif // ARROW_PARQUET_SCHEMA_H diff --git a/cpp/src/arrow/parquet/test-util.h b/cpp/src/arrow/parquet/test-util.h deleted file mode 100644 index 68a7fb94c2aed..0000000000000 --- a/cpp/src/arrow/parquet/test-util.h +++ /dev/null @@ -1,193 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include - -#include "arrow/test-util.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" - -namespace arrow { - -namespace parquet { - -template -using is_arrow_float = std::is_floating_point; - -template -using is_arrow_int = std::is_integral; - -template -using is_arrow_string = std::is_same; - -template -typename std::enable_if::value, - std::shared_ptr>::type -NonNullArray(size_t size) { - std::vector values; - ::arrow::test::random_real(size, 0, 0, 1, &values); - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size()); - return std::static_pointer_cast(builder.Finish()); -} - -template -typename std::enable_if::value, - std::shared_ptr>::type -NonNullArray(size_t size) { - std::vector values; - ::arrow::test::randint(size, 0, 64, &values); - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size()); - return std::static_pointer_cast(builder.Finish()); -} - -template -typename std::enable_if::value, - std::shared_ptr>::type -NonNullArray(size_t size) { - StringBuilder builder(default_memory_pool(), std::make_shared()); - for (size_t i = 0; i < size; i++) { - builder.Append("test-string"); - } - return std::static_pointer_cast(builder.Finish()); -} - -template <> -std::shared_ptr NonNullArray(size_t size) { - std::vector values; - ::arrow::test::randint(size, 0, 1, &values); - BooleanBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size()); - return std::static_pointer_cast(builder.Finish()); -} - -// This helper function only supports (size/2) nulls. -template -typename std::enable_if::value, - std::shared_ptr>::type -NullableArray(size_t size, size_t num_nulls) { - std::vector values; - ::arrow::test::random_real(size, 0, 0, 1, &values); - std::vector valid_bytes(size, 1); - - for (size_t i = 0; i < num_nulls; i++) { - valid_bytes[i * 2] = 0; - } - - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size(), valid_bytes.data()); - return std::static_pointer_cast(builder.Finish()); -} - -// This helper function only supports (size/2) nulls. -template -typename std::enable_if::value, - std::shared_ptr>::type -NullableArray(size_t size, size_t num_nulls) { - std::vector values; - ::arrow::test::randint(size, 0, 64, &values); - std::vector valid_bytes(size, 1); - - for (size_t i = 0; i < num_nulls; i++) { - valid_bytes[i * 2] = 0; - } - - NumericBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size(), valid_bytes.data()); - return std::static_pointer_cast(builder.Finish()); -} - -// This helper function only supports (size/2) nulls yet. -template -typename std::enable_if::value, - std::shared_ptr>::type -NullableArray(size_t size, size_t num_nulls) { - std::vector valid_bytes(size, 1); - - for (size_t i = 0; i < num_nulls; i++) { - valid_bytes[i * 2] = 0; - } - - StringBuilder builder(default_memory_pool(), std::make_shared()); - for (size_t i = 0; i < size; i++) { - builder.Append("test-string"); - } - return std::static_pointer_cast(builder.Finish()); -} - -// This helper function only supports (size/2) nulls yet. -template <> -std::shared_ptr NullableArray( - size_t size, size_t num_nulls) { - std::vector values; - ::arrow::test::randint(size, 0, 1, &values); - std::vector valid_bytes(size, 1); - - for (size_t i = 0; i < num_nulls; i++) { - valid_bytes[i * 2] = 0; - } - - BooleanBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(values.data(), values.size(), valid_bytes.data()); - return std::static_pointer_cast(builder.Finish()); -} - -std::shared_ptr MakeColumn( - const std::string& name, const std::shared_ptr& array, bool nullable) { - auto field = std::make_shared(name, array->type(), nullable); - return std::make_shared(field, array); -} - -std::shared_ptr
MakeSimpleTable( - const std::shared_ptr& values, bool nullable) { - std::shared_ptr column = MakeColumn("col", values, nullable); - std::vector> columns({column}); - std::vector> fields({column->field()}); - auto schema = std::make_shared(fields); - return std::make_shared
("table", schema, columns); -} - -template -void ExpectArray(T* expected, Array* result) { - PrimitiveArray* p_array = static_cast(result); - for (int i = 0; i < result->length(); i++) { - EXPECT_EQ(expected[i], reinterpret_cast(p_array->data()->data())[i]); - } -} - -template -void ExpectArray(typename ArrowType::c_type* expected, Array* result) { - PrimitiveArray* p_array = static_cast(result); - for (int64_t i = 0; i < result->length(); i++) { - EXPECT_EQ(expected[i], - reinterpret_cast(p_array->data()->data())[i]); - } -} - -template <> -void ExpectArray(uint8_t* expected, Array* result) { - BooleanBuilder builder(default_memory_pool(), std::make_shared()); - builder.Append(expected, result->length()); - std::shared_ptr expected_array = builder.Finish(); - EXPECT_TRUE(result->Equals(expected_array)); -} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/utils.h b/cpp/src/arrow/parquet/utils.h deleted file mode 100644 index bcc46be60e6ec..0000000000000 --- a/cpp/src/arrow/parquet/utils.h +++ /dev/null @@ -1,52 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_PARQUET_UTILS_H -#define ARROW_PARQUET_UTILS_H - -#include - -#include "arrow/util/status.h" -#include "parquet/exception.h" - -namespace arrow { -namespace parquet { - -#define PARQUET_CATCH_NOT_OK(s) \ - try { \ - (s); \ - } catch (const ::parquet::ParquetException& e) { return Status::Invalid(e.what()); } - -#define PARQUET_IGNORE_NOT_OK(s) \ - try { \ - (s); \ - } catch (const ::parquet::ParquetException& e) {} - -#define PARQUET_THROW_NOT_OK(s) \ - do { \ - ::arrow::Status _s = (s); \ - if (!_s.ok()) { \ - std::stringstream ss; \ - ss << "Arrow error: " << _s.ToString(); \ - throw ::parquet::ParquetException(ss.str()); \ - } \ - } while (0); - -} // namespace parquet -} // namespace arrow - -#endif // ARROW_PARQUET_UTILS_H diff --git a/cpp/src/arrow/parquet/writer.cc b/cpp/src/arrow/parquet/writer.cc deleted file mode 100644 index 2b47f1461c9f4..0000000000000 --- a/cpp/src/arrow/parquet/writer.cc +++ /dev/null @@ -1,365 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/parquet/writer.h" - -#include -#include - -#include "arrow/array.h" -#include "arrow/column.h" -#include "arrow/table.h" -#include "arrow/types/construct.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/parquet/schema.h" -#include "arrow/parquet/utils.h" -#include "arrow/util/status.h" - -using parquet::ParquetFileWriter; -using parquet::ParquetVersion; -using parquet::schema::GroupNode; - -namespace arrow { -namespace parquet { - -class FileWriter::Impl { - public: - Impl(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); - - Status NewRowGroup(int64_t chunk_size); - template - Status TypedWriteBatch(::parquet::ColumnWriter* writer, const PrimitiveArray* data, - int64_t offset, int64_t length); - - // TODO(uwe): Same code as in reader.cc the only difference is the name of the temporary - // buffer - template - struct can_copy_ptr { - static constexpr bool value = - std::is_same::value || - (std::is_integral{} && std::is_integral{} && - (sizeof(InType) == sizeof(OutType))); - }; - - template ::value>::type* = nullptr> - Status ConvertPhysicalType(const InType* in_ptr, int64_t, const OutType** out_ptr) { - *out_ptr = reinterpret_cast(in_ptr); - return Status::OK(); - } - - template ::value>::type* = nullptr> - Status ConvertPhysicalType( - const InType* in_ptr, int64_t length, const OutType** out_ptr) { - RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(OutType))); - OutType* mutable_out_ptr = reinterpret_cast(data_buffer_.mutable_data()); - std::copy(in_ptr, in_ptr + length, mutable_out_ptr); - *out_ptr = mutable_out_ptr; - return Status::OK(); - } - - Status WriteFlatColumnChunk(const PrimitiveArray* data, int64_t offset, int64_t length); - Status WriteFlatColumnChunk(const StringArray* data, int64_t offset, int64_t length); - Status Close(); - - virtual ~Impl() {} - - private: - friend class FileWriter; - - MemoryPool* pool_; - // Buffer used for storing the data of an array converted to the physical type - // as expected by parquet-cpp. - PoolBuffer data_buffer_; - PoolBuffer def_levels_buffer_; - std::unique_ptr<::parquet::ParquetFileWriter> writer_; - ::parquet::RowGroupWriter* row_group_writer_; -}; - -FileWriter::Impl::Impl( - MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer) - : pool_(pool), - data_buffer_(pool), - writer_(std::move(writer)), - row_group_writer_(nullptr) {} - -Status FileWriter::Impl::NewRowGroup(int64_t chunk_size) { - if (row_group_writer_ != nullptr) { PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); } - PARQUET_CATCH_NOT_OK(row_group_writer_ = writer_->AppendRowGroup(chunk_size)); - return Status::OK(); -} - -template -Status FileWriter::Impl::TypedWriteBatch(::parquet::ColumnWriter* column_writer, - const PrimitiveArray* data, int64_t offset, int64_t length) { - using ArrowCType = typename ArrowType::c_type; - using ParquetCType = typename ParquetType::c_type; - - DCHECK((offset + length) <= data->length()); - auto data_ptr = reinterpret_cast(data->data()->data()) + offset; - auto writer = - reinterpret_cast<::parquet::TypedColumnWriter*>(column_writer); - if (writer->descr()->max_definition_level() == 0) { - // no nulls, just dump the data - const ParquetCType* data_writer_ptr = nullptr; - RETURN_NOT_OK((ConvertPhysicalType( - data_ptr, length, &data_writer_ptr))); - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, data_writer_ptr)); - } else if (writer->descr()->max_definition_level() == 1) { - RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); - int16_t* def_levels_ptr = - reinterpret_cast(def_levels_buffer_.mutable_data()); - if (data->null_count() == 0) { - std::fill(def_levels_ptr, def_levels_ptr + length, 1); - const ParquetCType* data_writer_ptr = nullptr; - RETURN_NOT_OK((ConvertPhysicalType( - data_ptr, length, &data_writer_ptr))); - PARQUET_CATCH_NOT_OK( - writer->WriteBatch(length, def_levels_ptr, nullptr, data_writer_ptr)); - } else { - RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(ParquetCType))); - auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); - int buffer_idx = 0; - for (int i = 0; i < length; i++) { - if (data->IsNull(offset + i)) { - def_levels_ptr[i] = 0; - } else { - def_levels_ptr[i] = 1; - buffer_ptr[buffer_idx++] = static_cast(data_ptr[i]); - } - } - PARQUET_CATCH_NOT_OK( - writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); - } - } else { - return Status::NotImplemented("no support for max definition level > 1 yet"); - } - PARQUET_CATCH_NOT_OK(writer->Close()); - return Status::OK(); -} - -// This specialization seems quite similar but it significantly differs in two points: -// * offset is added at the most latest time to the pointer as we have sub-byte access -// * Arrow data is stored bitwise thus we cannot use std::copy to transform from -// ArrowType::c_type to ParquetType::c_type -template <> -Status FileWriter::Impl::TypedWriteBatch<::parquet::BooleanType, BooleanType>( - ::parquet::ColumnWriter* column_writer, const PrimitiveArray* data, int64_t offset, - int64_t length) { - DCHECK((offset + length) <= data->length()); - RETURN_NOT_OK(data_buffer_.Resize(length)); - auto data_ptr = reinterpret_cast(data->data()->data()); - auto buffer_ptr = reinterpret_cast(data_buffer_.mutable_data()); - auto writer = reinterpret_cast<::parquet::TypedColumnWriter<::parquet::BooleanType>*>( - column_writer); - if (writer->descr()->max_definition_level() == 0) { - // no nulls, just dump the data - for (int64_t i = 0; i < length; i++) { - buffer_ptr[i] = util::get_bit(data_ptr, offset + i); - } - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, nullptr, nullptr, buffer_ptr)); - } else if (writer->descr()->max_definition_level() == 1) { - RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); - int16_t* def_levels_ptr = - reinterpret_cast(def_levels_buffer_.mutable_data()); - if (data->null_count() == 0) { - std::fill(def_levels_ptr, def_levels_ptr + length, 1); - for (int64_t i = 0; i < length; i++) { - buffer_ptr[i] = util::get_bit(data_ptr, offset + i); - } - // TODO(PARQUET-644): write boolean values as a packed bitmap - PARQUET_CATCH_NOT_OK( - writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); - } else { - int buffer_idx = 0; - for (int i = 0; i < length; i++) { - if (data->IsNull(offset + i)) { - def_levels_ptr[i] = 0; - } else { - def_levels_ptr[i] = 1; - buffer_ptr[buffer_idx++] = util::get_bit(data_ptr, offset + i); - } - } - PARQUET_CATCH_NOT_OK( - writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); - } - } else { - return Status::NotImplemented("no support for max definition level > 1 yet"); - } - PARQUET_CATCH_NOT_OK(writer->Close()); - return Status::OK(); -} - -Status FileWriter::Impl::Close() { - if (row_group_writer_ != nullptr) { PARQUET_CATCH_NOT_OK(row_group_writer_->Close()); } - PARQUET_CATCH_NOT_OK(writer_->Close()); - return Status::OK(); -} - -#define TYPED_BATCH_CASE(ENUM, ArrowType, ParquetType) \ - case Type::ENUM: \ - return TypedWriteBatch(writer, data, offset, length); \ - break; - -Status FileWriter::Impl::WriteFlatColumnChunk( - const PrimitiveArray* data, int64_t offset, int64_t length) { - ::parquet::ColumnWriter* writer; - PARQUET_CATCH_NOT_OK(writer = row_group_writer_->NextColumn()); - switch (data->type_enum()) { - TYPED_BATCH_CASE(BOOL, BooleanType, ::parquet::BooleanType) - TYPED_BATCH_CASE(UINT8, UInt8Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT8, Int8Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(UINT16, UInt16Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(INT16, Int16Type, ::parquet::Int32Type) - case Type::UINT32: - if (writer_->properties()->version() == ParquetVersion::PARQUET_1_0) { - // Parquet 1.0 reader cannot read the UINT_32 logical type. Thus we need - // to use the larger Int64Type to store them lossless. - return TypedWriteBatch<::parquet::Int64Type, UInt32Type>( - writer, data, offset, length); - } else { - return TypedWriteBatch<::parquet::Int32Type, UInt32Type>( - writer, data, offset, length); - } - TYPED_BATCH_CASE(INT32, Int32Type, ::parquet::Int32Type) - TYPED_BATCH_CASE(UINT64, UInt64Type, ::parquet::Int64Type) - TYPED_BATCH_CASE(INT64, Int64Type, ::parquet::Int64Type) - TYPED_BATCH_CASE(TIMESTAMP, TimestampType, ::parquet::Int64Type) - TYPED_BATCH_CASE(FLOAT, FloatType, ::parquet::FloatType) - TYPED_BATCH_CASE(DOUBLE, DoubleType, ::parquet::DoubleType) - default: - return Status::NotImplemented(data->type()->ToString()); - } -} - -Status FileWriter::Impl::WriteFlatColumnChunk( - const StringArray* data, int64_t offset, int64_t length) { - ::parquet::ColumnWriter* column_writer; - PARQUET_CATCH_NOT_OK(column_writer = row_group_writer_->NextColumn()); - DCHECK((offset + length) <= data->length()); - RETURN_NOT_OK(data_buffer_.Resize(length * sizeof(::parquet::ByteArray))); - auto buffer_ptr = reinterpret_cast<::parquet::ByteArray*>(data_buffer_.mutable_data()); - auto values = std::dynamic_pointer_cast(data->values()); - auto data_ptr = reinterpret_cast(values->data()->data()); - DCHECK(values != nullptr); - auto writer = reinterpret_cast<::parquet::TypedColumnWriter<::parquet::ByteArrayType>*>( - column_writer); - if (writer->descr()->max_definition_level() > 0) { - RETURN_NOT_OK(def_levels_buffer_.Resize(length * sizeof(int16_t))); - } - int16_t* def_levels_ptr = reinterpret_cast(def_levels_buffer_.mutable_data()); - if (writer->descr()->max_definition_level() == 0 || data->null_count() == 0) { - // no nulls, just dump the data - for (int64_t i = 0; i < length; i++) { - buffer_ptr[i] = ::parquet::ByteArray( - data->value_length(i + offset), data_ptr + data->value_offset(i)); - } - if (writer->descr()->max_definition_level() > 0) { - std::fill(def_levels_ptr, def_levels_ptr + length, 1); - } - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); - } else if (writer->descr()->max_definition_level() == 1) { - int buffer_idx = 0; - for (int64_t i = 0; i < length; i++) { - if (data->IsNull(offset + i)) { - def_levels_ptr[i] = 0; - } else { - def_levels_ptr[i] = 1; - buffer_ptr[buffer_idx++] = ::parquet::ByteArray( - data->value_length(i + offset), data_ptr + data->value_offset(i + offset)); - } - } - PARQUET_CATCH_NOT_OK(writer->WriteBatch(length, def_levels_ptr, nullptr, buffer_ptr)); - } else { - return Status::NotImplemented("no support for max definition level > 1 yet"); - } - PARQUET_CATCH_NOT_OK(writer->Close()); - return Status::OK(); -} - -FileWriter::FileWriter( - MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer) - : impl_(new FileWriter::Impl(pool, std::move(writer))) {} - -Status FileWriter::NewRowGroup(int64_t chunk_size) { - return impl_->NewRowGroup(chunk_size); -} - -Status FileWriter::WriteFlatColumnChunk( - const Array* array, int64_t offset, int64_t length) { - int64_t real_length = length; - if (length == -1) { real_length = array->length(); } - if (array->type_enum() == Type::STRING) { - auto string_array = dynamic_cast(array); - DCHECK(string_array); - return impl_->WriteFlatColumnChunk(string_array, offset, real_length); - } else { - auto primitive_array = dynamic_cast(array); - if (!primitive_array) { - return Status::NotImplemented("Table must consist of PrimitiveArray instances"); - } - return impl_->WriteFlatColumnChunk(primitive_array, offset, real_length); - } -} - -Status FileWriter::Close() { - return impl_->Close(); -} - -MemoryPool* FileWriter::memory_pool() const { - return impl_->pool_; -} - -FileWriter::~FileWriter() {} - -Status WriteFlatTable(const Table* table, MemoryPool* pool, - const std::shared_ptr<::parquet::OutputStream>& sink, int64_t chunk_size, - const std::shared_ptr<::parquet::WriterProperties>& properties) { - std::shared_ptr<::parquet::SchemaDescriptor> parquet_schema; - RETURN_NOT_OK( - ToParquetSchema(table->schema().get(), *properties.get(), &parquet_schema)); - auto schema_node = std::static_pointer_cast(parquet_schema->schema_root()); - std::unique_ptr parquet_writer = - ParquetFileWriter::Open(sink, schema_node, properties); - FileWriter writer(pool, std::move(parquet_writer)); - - // TODO(ARROW-232) Support writing chunked arrays. - for (int i = 0; i < table->num_columns(); i++) { - if (table->column(i)->data()->num_chunks() != 1) { - return Status::NotImplemented("No support for writing chunked arrays yet."); - } - } - - for (int chunk = 0; chunk * chunk_size < table->num_rows(); chunk++) { - int64_t offset = chunk * chunk_size; - int64_t size = std::min(chunk_size, table->num_rows() - offset); - RETURN_NOT_OK_ELSE(writer.NewRowGroup(size), PARQUET_IGNORE_NOT_OK(writer.Close())); - for (int i = 0; i < table->num_columns(); i++) { - std::shared_ptr array = table->column(i)->data()->chunk(0); - RETURN_NOT_OK_ELSE(writer.WriteFlatColumnChunk(array.get(), offset, size), - PARQUET_IGNORE_NOT_OK(writer.Close())); - } - } - - return writer.Close(); -} - -} // namespace parquet - -} // namespace arrow diff --git a/cpp/src/arrow/parquet/writer.h b/cpp/src/arrow/parquet/writer.h deleted file mode 100644 index ecc6a9f8be3de..0000000000000 --- a/cpp/src/arrow/parquet/writer.h +++ /dev/null @@ -1,76 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_PARQUET_WRITER_H -#define ARROW_PARQUET_WRITER_H - -#include - -#include "parquet/api/schema.h" -#include "parquet/api/writer.h" - -#include "arrow/util/visibility.h" - -namespace arrow { - -class Array; -class MemoryPool; -class PrimitiveArray; -class RecordBatch; -class Status; -class StringArray; -class Table; - -namespace parquet { - -/** - * Iterative API: - * Start a new RowGroup/Chunk with NewRowGroup - * Write column-by-column the whole column chunk - */ -class ARROW_EXPORT FileWriter { - public: - FileWriter(MemoryPool* pool, std::unique_ptr<::parquet::ParquetFileWriter> writer); - - Status NewRowGroup(int64_t chunk_size); - Status WriteFlatColumnChunk(const Array* data, int64_t offset = 0, int64_t length = -1); - Status Close(); - - virtual ~FileWriter(); - - MemoryPool* memory_pool() const; - - private: - class ARROW_NO_EXPORT Impl; - std::unique_ptr impl_; -}; - -/** - * Write a flat Table to Parquet. - * - * The table shall only consist of nullable, non-repeated columns of primitive type. - */ -Status ARROW_EXPORT WriteFlatTable(const Table* table, MemoryPool* pool, - const std::shared_ptr<::parquet::OutputStream>& sink, int64_t chunk_size, - const std::shared_ptr<::parquet::WriterProperties>& properties = - ::parquet::default_writer_properties()); - -} // namespace parquet - -} // namespace arrow - -#endif // ARROW_PARQUET_WRITER_H diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index 2f0037024c78d..745ed8f7edb99 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -64,7 +64,7 @@ Status StringArray::Validate() const { // This used to be a static member variable of BinaryBuilder, but it can cause // valgrind to report a (spurious?) memory leak when needed in other shared // libraries. The problem came up while adding explicit visibility to libarrow -// and libarrow_parquet +// and libparquet_arrow static TypePtr kBinaryValueType = TypePtr(new UInt8Type()); BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 522895808de5e..6357e3c1725e3 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -1,5 +1,5 @@ # Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file +# or more cod ntributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the @@ -294,12 +294,12 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) add_library(${LIB_NAME} STATIC IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + message(STATUS "Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") else() add_library(${LIB_NAME} SHARED IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + message(STATUS "Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") endif() if(ARG_DEPS) @@ -443,12 +443,12 @@ set(LINK_LIBS arrow_io ) -if(PARQUET_FOUND AND ARROW_PARQUET_FOUND) - ADD_THIRDPARTY_LIB(arrow_parquet - SHARED_LIB ${ARROW_PARQUET_SHARED_LIB}) +if(PARQUET_FOUND AND PARQUET_ARROW_FOUND) + ADD_THIRDPARTY_LIB(parquet_arrow + SHARED_LIB ${PARQUET_ARROW_SHARED_LIB}) set(LINK_LIBS ${LINK_LIBS} - arrow_parquet) + parquet_arrow) set(CYTHON_EXTENSIONS ${CYTHON_EXTENSIONS} parquet) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 5d5efc431a48f..9919746520b4c 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -42,11 +42,6 @@ find_library(ARROW_LIB_PATH NAMES arrow ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) -find_library(ARROW_PARQUET_LIB_PATH NAMES arrow_parquet - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) - find_library(ARROW_IO_LIB_PATH NAMES arrow_io PATHS ${ARROW_SEARCH_LIB_PATH} @@ -56,7 +51,6 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) - set(ARROW_PARQUET_LIB_NAME libarrow_parquet) set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) @@ -82,20 +76,6 @@ else () set(ARROW_FOUND FALSE) endif () -if(ARROW_PARQUET_LIB_PATH) - set(ARROW_PARQUET_FOUND TRUE) - set(ARROW_PARQUET_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PARQUET_LIB_NAME}.a) - set(ARROW_PARQUET_SHARED_LIB ${ARROW_LIBS}/${ARROW_PARQUET_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - if (NOT Arrow_FIND_QUIETLY) - message(STATUS "Found the Arrow Parquet library: ${ARROW_PARQUET_LIB_PATH}") - endif () -else() - if (NOT Arrow_FIND_QUIETLY) - message(STATUS "Could not find Arrow Parquet library") - endif() - set(ARROW_PARQUET_FOUND FALSE) -endif() - mark_as_advanced( ARROW_INCLUDE_DIR ARROW_LIBS @@ -103,6 +83,4 @@ mark_as_advanced( ARROW_SHARED_LIB ARROW_IO_STATIC_LIB ARROW_IO_SHARED_LIB - ARROW_PARQUET_STATIC_LIB - ARROW_PARQUET_SHARED_LIB ) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index f932a93149354..9085b0bb29866 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -44,6 +44,7 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: cdef cppclass ColumnDescriptor: pass + cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass ColumnReader: pass @@ -77,6 +78,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: @staticmethod unique_ptr[ParquetFileReader] OpenFile(const c_string& path) + cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: cdef cppclass ParquetOutputStream" parquet::OutputStream": pass @@ -91,7 +93,7 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: shared_ptr[WriterProperties] build() -cdef extern from "arrow/parquet/io.h" namespace "arrow::parquet" nogil: +cdef extern from "parquet/arrow/io.h" namespace "parquet::arrow" nogil: cdef cppclass ParquetAllocator: ParquetAllocator() ParquetAllocator(MemoryPool* pool) @@ -103,7 +105,7 @@ cdef extern from "arrow/parquet/io.h" namespace "arrow::parquet" nogil: Open(const shared_ptr[ReadableFileInterface]& file) -cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: +cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, ParquetAllocator* allocator, unique_ptr[FileReader]* reader) @@ -113,14 +115,14 @@ cdef extern from "arrow/parquet/reader.h" namespace "arrow::parquet" nogil: CStatus ReadFlatTable(shared_ptr[CTable]* out); -cdef extern from "arrow/parquet/schema.h" namespace "arrow::parquet" nogil: +cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, shared_ptr[CSchema]* out) CStatus ToParquetSchema(const CSchema* arrow_schema, shared_ptr[SchemaDescriptor]* out) -cdef extern from "arrow/parquet/writer.h" namespace "arrow::parquet" nogil: +cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: cdef CStatus WriteFlatTable( const CTable* table, MemoryPool* pool, const shared_ptr[ParquetOutputStream]& sink, From 45d88328dd73a331b8099c07dc1332cc585ff8d2 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 27 Sep 2016 09:45:05 -0400 Subject: [PATCH 0143/1644] ARROW-293: [C++] Implement Arrow IO interfaces for operating system files I started with the code I put together previously for Feather and conformed it to the `arrow::io` API. There's a bunch of Windows compatibility stuff; I left this until we add CI for Windows and can sort this out. We should also refactor the memory mapped file interfaces to be based on this common code (see ARROW-294). Author: Wes McKinney Closes #146 from wesm/ARROW-293 and squashes the following commits: a2653b7 [Wes McKinney] cpplint d56ef06 [Wes McKinney] Test the rest of ReadableFile methods 43126ca [Wes McKinney] Drafting OS file IO implementations based on Feather implementation. Work on test suite --- cpp/CMakeLists.txt | 2 +- cpp/src/arrow/io/CMakeLists.txt | 6 + cpp/src/arrow/io/file.cc | 485 ++++++++++++++++++++++++++ cpp/src/arrow/io/file.h | 96 +++++ cpp/src/arrow/io/io-file-test.cc | 290 +++++++++++++++ cpp/src/arrow/io/libhdfs_shim.cc | 2 +- cpp/src/arrow/io/memory.h | 2 +- cpp/src/arrow/io/mman.h | 189 ++++++++++ cpp/src/arrow/ipc/adapter.cc | 4 +- cpp/src/arrow/ipc/file.cc | 2 +- cpp/src/arrow/types/primitive-test.cc | 3 +- cpp/src/arrow/util/logging.h | 6 +- cpp/src/arrow/util/memory-pool.cc | 4 +- cpp/src/arrow/util/status-test.cc | 2 +- 14 files changed, 1080 insertions(+), 13 deletions(-) create mode 100644 cpp/src/arrow/io/file.cc create mode 100644 cpp/src/arrow/io/file.h create mode 100644 cpp/src/arrow/io/io-file-test.cc create mode 100644 cpp/src/arrow/io/mman.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f3f4a7dac0100..d65c715319694 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -570,7 +570,7 @@ if (UNIX) add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py --verbose=2 --linelength=90 - --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references + --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references,-build/include_order ${FILTERED_LINT_FILES}) endif (UNIX) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 87e227ef80d80..d2e3491b75f12 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -38,6 +38,7 @@ set(ARROW_IO_TEST_LINK_LIBS ${ARROW_IO_PRIVATE_LINK_LIBS}) set(ARROW_IO_SRCS + file.cc memory.cc ) @@ -103,12 +104,17 @@ if (APPLE) INSTALL_NAME_DIR "@rpath") endif() +ADD_ARROW_TEST(io-file-test) +ARROW_TEST_LINK_LIBRARIES(io-file-test + ${ARROW_IO_TEST_LINK_LIBS}) + ADD_ARROW_TEST(io-memory-test) ARROW_TEST_LINK_LIBRARIES(io-memory-test ${ARROW_IO_TEST_LINK_LIBS}) # Headers: top level install(FILES + file.h hdfs.h interfaces.h memory.h diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc new file mode 100644 index 0000000000000..87bae7f3928ec --- /dev/null +++ b/cpp/src/arrow/io/file.cc @@ -0,0 +1,485 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Ensure 64-bit off_t for platforms where it matters +#ifdef _FILE_OFFSET_BITS +#undef _FILE_OFFSET_BITS +#endif + +#define _FILE_OFFSET_BITS 64 + +#include "arrow/io/file.h" + +#if _WIN32 || _WIN64 +#if _WIN64 +#define ENVIRONMENT64 +#else +#define ENVIRONMENT32 +#endif +#endif + +// sys/mman.h not present in Visual Studio or Cygwin +#ifdef _WIN32 +#ifndef NOMINMAX +#define NOMINMAX +#endif +#include "arrow/io/mman.h" +#undef Realloc +#undef Free +#include +#else +#include +#endif + +#include +#include +#include + +#ifndef _MSC_VER // POSIX-like platforms + +#include + +// Not available on some platforms +#ifndef errno_t +#define errno_t int +#endif + +#endif // _MSC_VER + +// defines that +#if defined(__MINGW32__) +#define ARROW_WRITE_SHMODE S_IRUSR | S_IWUSR +#elif defined(_MSC_VER) // Visual Studio + +#else // gcc / clang on POSIX platforms +#define ARROW_WRITE_SHMODE S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH +#endif + +// ---------------------------------------------------------------------- +// C++ standard library + +#include +#include +#include +#include +#include +#include + +#if defined(_MSC_VER) +#include +#include +#endif + +// ---------------------------------------------------------------------- +// file compatibility stuff + +#if defined(__MINGW32__) // MinGW +// nothing +#elif defined(_MSC_VER) // Visual Studio +#include +#else // POSIX / Linux +// nothing +#endif + +#include + +// POSIX systems do not have this +#ifndef O_BINARY +#define O_BINARY 0 +#endif + +// ---------------------------------------------------------------------- +// Other Arrow includes + +#include "arrow/io/interfaces.h" + +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace io { + +// ---------------------------------------------------------------------- +// Cross-platform file compatability layer + +static inline Status CheckOpenResult( + int ret, int errno_actual, const char* filename, size_t filename_length) { + if (ret == -1) { + // TODO: errno codes to strings + std::stringstream ss; + ss << "Failed to open file: "; +#if defined(_MSC_VER) + // using wchar_t + + // this requires c++11 + std::wstring_convert, wchar_t> converter; + std::wstring wide_string( + reinterpret_cast(filename), filename_length / sizeof(wchar_t)); + std::string byte_string = converter.to_bytes(wide_string); + ss << byte_string; +#else + ss << filename; +#endif + return Status::IOError(ss.str()); + } + return Status::OK(); +} + +#define CHECK_LSEEK(retval) \ + if ((retval) == -1) return Status::IOError("lseek failed"); + +static inline int64_t lseek64_compat(int fd, int64_t pos, int whence) { +#if defined(_MSC_VER) + return _lseeki64(fd, pos, whence); +#else + return lseek(fd, pos, whence); +#endif +} + +static inline Status FileOpenReadable(const std::string& filename, int* fd) { + int ret; + errno_t errno_actual = 0; +#if defined(_MSC_VER) + // https://msdn.microsoft.com/en-us/library/w64k0ytk.aspx + + // See GH #209. Here we are assuming that the filename has been encoded in + // utf-16le so that unicode filenames can be supported + const int nwchars = static_cast(filename.size()) / sizeof(wchar_t); + std::vector wpath(nwchars + 1); + memcpy(wpath.data(), filename.data(), filename.size()); + memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); + + errno_actual = _wsopen_s(fd, wpath.data(), _O_RDONLY | _O_BINARY, _SH_DENYNO, _S_IREAD); + ret = *fd; +#else + ret = *fd = open(filename.c_str(), O_RDONLY | O_BINARY); + errno_actual = errno; +#endif + + return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); +} + +static inline Status FileOpenWriteable(const std::string& filename, int* fd) { + int ret; + errno_t errno_actual = 0; + +#if defined(_MSC_VER) + // https://msdn.microsoft.com/en-us/library/w64k0ytk.aspx + // Same story with wchar_t as above + const int nwchars = static_cast(filename.size()) / sizeof(wchar_t); + std::vector wpath(nwchars + 1); + memcpy(wpath.data(), filename.data(), filename.size()); + memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); + + errno_actual = _wsopen_s( + fd, wpath.data(), _O_WRONLY | _O_CREAT | _O_BINARY, _SH_DENYNO, _S_IWRITE); + ret = *fd; + +#else + ret = *fd = open(filename.c_str(), O_WRONLY | O_CREAT | O_BINARY, ARROW_WRITE_SHMODE); +#endif + return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); +} + +static inline Status FileTell(int fd, int64_t* pos) { + int64_t current_pos; + +#if defined(_MSC_VER) + current_pos = _telli64(fd); + if (current_pos == -1) { return Status::IOError("_telli64 failed"); } +#else + current_pos = lseek64_compat(fd, 0, SEEK_CUR); + CHECK_LSEEK(current_pos); +#endif + + *pos = current_pos; + return Status::OK(); +} + +static inline Status FileSeek(int fd, int64_t pos) { + int64_t ret = lseek64_compat(fd, pos, SEEK_SET); + CHECK_LSEEK(ret); + return Status::OK(); +} + +static inline Status FileRead( + int fd, uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { +#if defined(_MSC_VER) + if (nbytes > INT32_MAX) { return Status::IOError("Unable to read > 2GB blocks yet"); } + *bytes_read = _read(fd, buffer, static_cast(nbytes)); +#else + *bytes_read = read(fd, buffer, nbytes); +#endif + + if (*bytes_read == -1) { + // TODO(wesm): errno to string + return Status::IOError("Error reading bytes from file"); + } + + return Status::OK(); +} + +static inline Status FileWrite(int fd, const uint8_t* buffer, int64_t nbytes) { + int ret; +#if defined(_MSC_VER) + if (nbytes > INT32_MAX) { + return Status::IOError("Unable to write > 2GB blocks to file yet"); + } + ret = _write(fd, buffer, static_cast(nbytes)); +#else + ret = write(fd, buffer, nbytes); +#endif + + if (ret == -1) { + // TODO(wesm): errno to string + return Status::IOError("Error writing bytes to file"); + } + return Status::OK(); +} + +static inline Status FileGetSize(int fd, int64_t* size) { + int64_t ret; + + // Save current position + int64_t current_position = lseek64_compat(fd, 0, SEEK_CUR); + CHECK_LSEEK(current_position); + + // move to end of the file + ret = lseek64_compat(fd, 0, SEEK_END); + CHECK_LSEEK(ret); + + // Get file length + ret = lseek64_compat(fd, 0, SEEK_CUR); + CHECK_LSEEK(ret); + + *size = ret; + + // Restore file position + ret = lseek64_compat(fd, current_position, SEEK_SET); + CHECK_LSEEK(ret); + + return Status::OK(); +} + +static inline Status FileClose(int fd) { + int ret; + +#if defined(_MSC_VER) + ret = _close(fd); +#else + ret = close(fd); +#endif + + if (ret == -1) { return Status::IOError("error closing file"); } + return Status::OK(); +} + +class OSFile { + public: + OSFile() : fd_(-1), is_open_(false), size_(-1) {} + + ~OSFile() {} + + Status OpenWritable(const std::string& path) { + RETURN_NOT_OK(FileOpenWriteable(path, &fd_)); + path_ = path; + is_open_ = true; + return Status::OK(); + } + + Status OpenReadable(const std::string& path) { + RETURN_NOT_OK(FileOpenReadable(path, &fd_)); + RETURN_NOT_OK(FileGetSize(fd_, &size_)); + + // The position should be 0 after GetSize + // RETURN_NOT_OK(Seek(0)); + + path_ = path; + is_open_ = true; + return Status::OK(); + } + + Status Close() { + if (is_open_) { + RETURN_NOT_OK(FileClose(fd_)); + is_open_ = false; + } + return Status::OK(); + } + + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + return FileRead(fd_, out, nbytes, bytes_read); + } + + Status Seek(int64_t pos) { + if (pos > size_) { pos = size_; } + return FileSeek(fd_, pos); + } + + Status Tell(int64_t* pos) const { return FileTell(fd_, pos); } + + Status Write(const uint8_t* data, int64_t length) { + if (length < 0) { return Status::IOError("Length must be non-negative"); } + return FileWrite(fd_, data, length); + } + + int fd() const { return fd_; } + + bool is_open() const { return is_open_; } + const std::string& path() const { return path_; } + + int64_t size() const { return size_; } + + private: + std::string path_; + + // File descriptor + int fd_; + + bool is_open_; + int64_t size_; +}; + +// ---------------------------------------------------------------------- +// ReadableFile implementation + +class ReadableFile::ReadableFileImpl : public OSFile { + public: + explicit ReadableFileImpl(MemoryPool* pool) : OSFile(), pool_(pool) {} + + Status Open(const std::string& path) { return OpenReadable(path); } + + Status ReadBuffer(int64_t nbytes, std::shared_ptr* out) { + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(nbytes)); + + int64_t bytes_read = 0; + RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); + + // XXX: heuristic + if (bytes_read < nbytes / 2) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + + *out = buffer; + return Status::OK(); + } + + private: + MemoryPool* pool_; +}; + +ReadableFile::ReadableFile(MemoryPool* pool) { + impl_.reset(new ReadableFileImpl(pool)); +} + +ReadableFile::~ReadableFile() { + impl_->Close(); +} + +Status ReadableFile::Open(const std::string& path, std::shared_ptr* file) { + *file = std::shared_ptr(new ReadableFile(default_memory_pool())); + return (*file)->impl_->Open(path); +} + +Status ReadableFile::Open(const std::string& path, MemoryPool* memory_pool, + std::shared_ptr* file) { + *file = std::shared_ptr(new ReadableFile(memory_pool)); + return (*file)->impl_->Open(path); +} + +Status ReadableFile::Close() { + return impl_->Close(); +} + +Status ReadableFile::Tell(int64_t* pos) { + return impl_->Tell(pos); +} + +Status ReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + return impl_->Read(nbytes, bytes_read, out); +} + +Status ReadableFile::ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + RETURN_NOT_OK(Seek(position)); + return impl_->Read(nbytes, bytes_read, out); +} + +Status ReadableFile::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + RETURN_NOT_OK(Seek(position)); + return impl_->ReadBuffer(nbytes, out); +} + +Status ReadableFile::GetSize(int64_t* size) { + *size = impl_->size(); + return Status::OK(); +} + +Status ReadableFile::Seek(int64_t pos) { + return impl_->Seek(pos); +} + +bool ReadableFile::supports_zero_copy() const { + return false; +} + +int ReadableFile::file_descriptor() const { + return impl_->fd(); +} + +// ---------------------------------------------------------------------- +// FileOutputStream + +class FileOutputStream::FileOutputStreamImpl : public OSFile { + public: + Status Open(const std::string& path) { return OpenWritable(path); } +}; + +FileOutputStream::FileOutputStream() { + impl_.reset(new FileOutputStreamImpl()); +} + +FileOutputStream::~FileOutputStream() { + impl_->Close(); +} + +Status FileOutputStream::Open( + const std::string& path, std::shared_ptr* file) { + // private ctor + *file = std::shared_ptr(new FileOutputStream()); + return (*file)->impl_->Open(path); +} + +Status FileOutputStream::Close() { + return impl_->Close(); +} + +Status FileOutputStream::Tell(int64_t* pos) { + return impl_->Tell(pos); +} + +Status FileOutputStream::Write(const uint8_t* data, int64_t length) { + return impl_->Write(data, length); +} + +int FileOutputStream::file_descriptor() const { + return impl_->fd(); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h new file mode 100644 index 0000000000000..5e714ea966790 --- /dev/null +++ b/cpp/src/arrow/io/file.h @@ -0,0 +1,96 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// IO interface implementations for OS files + +#ifndef ARROW_IO_FILE_H +#define ARROW_IO_FILE_H + +#include +#include +#include + +#include "arrow/io/interfaces.h" +#include "arrow/util/macros.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Buffer; +class MemoryPool; +class Status; + +namespace io { + +class ARROW_EXPORT FileOutputStream : public OutputStream { + public: + ~FileOutputStream(); + + static Status Open(const std::string& path, std::shared_ptr* file); + + // OutputStream interface + Status Close() override; + Status Tell(int64_t* position) override; + Status Write(const uint8_t* data, int64_t nbytes) override; + + int file_descriptor() const; + + private: + FileOutputStream(); + + class ARROW_NO_EXPORT FileOutputStreamImpl; + std::unique_ptr impl_; +}; + +// Operating system file +class ARROW_EXPORT ReadableFile : public ReadableFileInterface { + public: + ~ReadableFile(); + + // Open file, allocate memory (if needed) from default memory pool + static Status Open(const std::string& path, std::shared_ptr* file); + + // Open file with one's own memory pool for memory allocations + static Status Open(const std::string& path, MemoryPool* memory_pool, + std::shared_ptr* file); + + Status Close() override; + Status Tell(int64_t* position) override; + + Status ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status GetSize(int64_t* size) override; + Status Seek(int64_t position) override; + + bool supports_zero_copy() const override; + + int file_descriptor() const; + + private: + explicit ReadableFile(MemoryPool* pool); + + class ARROW_NO_EXPORT ReadableFileImpl; + std::unique_ptr impl_; +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_FILE_H diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc new file mode 100644 index 0000000000000..cde769ffb6155 --- /dev/null +++ b/cpp/src/arrow/io/io-file-test.cc @@ -0,0 +1,290 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/io/file.h" +#include "arrow/io/test-common.h" +#include "arrow/util/memory-pool.h" + +namespace arrow { +namespace io { + +static bool FileExists(const std::string& path) { + return std::ifstream(path.c_str()).good(); +} + +static bool FileIsClosed(int fd) { + if (-1 != fcntl(fd, F_GETFD)) { return false; } + return errno == EBADF; +} + +class FileTestFixture : public ::testing::Test { + public: + void SetUp() { + path_ = "arrow-test-io-file-output-stream.txt"; + EnsureFileDeleted(); + } + + void TearDown() { EnsureFileDeleted(); } + + void EnsureFileDeleted() { + if (FileExists(path_)) { std::remove(path_.c_str()); } + } + + protected: + std::string path_; +}; + +// ---------------------------------------------------------------------- +// File output tests + +class TestFileOutputStream : public FileTestFixture { + public: + void OpenFile() { ASSERT_OK(FileOutputStream::Open(path_, &file_)); } + + protected: + std::shared_ptr file_; +}; + +TEST_F(TestFileOutputStream, DestructorClosesFile) { + int fd; + { + std::shared_ptr file; + ASSERT_OK(FileOutputStream::Open(path_, &file)); + fd = file->file_descriptor(); + } + ASSERT_TRUE(FileIsClosed(fd)); +} + +TEST_F(TestFileOutputStream, Close) { + OpenFile(); + + const char* data = "testdata"; + ASSERT_OK(file_->Write(reinterpret_cast(data), strlen(data))); + + int fd = file_->file_descriptor(); + file_->Close(); + + ASSERT_TRUE(FileIsClosed(fd)); + + // Idempotent + file_->Close(); + + std::shared_ptr rd_file; + ASSERT_OK(ReadableFile::Open(path_, &rd_file)); + + int64_t size = 0; + ASSERT_OK(rd_file->GetSize(&size)); + ASSERT_EQ(strlen(data), size); +} + +TEST_F(TestFileOutputStream, InvalidWrites) { + OpenFile(); + + const char* data = ""; + + ASSERT_RAISES(IOError, file_->Write(reinterpret_cast(data), -1)); +} + +TEST_F(TestFileOutputStream, Tell) { + OpenFile(); + + int64_t position; + + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(0, position); + + const char* data = "testdata"; + ASSERT_OK(file_->Write(reinterpret_cast(data), 8)); + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(8, position); +} + +// ---------------------------------------------------------------------- +// File input tests + +class TestReadableFile : public FileTestFixture { + public: + void OpenFile() { ASSERT_OK(ReadableFile::Open(path_, &file_)); } + + void MakeTestFile() { + std::string data = "testdata"; + std::ofstream stream; + stream.open(path_.c_str()); + stream << data; + } + + protected: + std::shared_ptr file_; +}; + +TEST_F(TestReadableFile, DestructorClosesFile) { + MakeTestFile(); + + int fd; + { + std::shared_ptr file; + ASSERT_OK(ReadableFile::Open(path_, &file)); + fd = file->file_descriptor(); + } + ASSERT_TRUE(FileIsClosed(fd)); +} + +TEST_F(TestReadableFile, Close) { + MakeTestFile(); + OpenFile(); + + int fd = file_->file_descriptor(); + file_->Close(); + + ASSERT_TRUE(FileIsClosed(fd)); + + // Idempotent + file_->Close(); +} + +TEST_F(TestReadableFile, SeekTellSize) { + MakeTestFile(); + OpenFile(); + + int64_t position; + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(0, position); + + ASSERT_OK(file_->Seek(4)); + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(4, position); + + ASSERT_OK(file_->Seek(100)); + ASSERT_OK(file_->Tell(&position)); + + // now at EOF + ASSERT_EQ(8, position); + + int64_t size; + ASSERT_OK(file_->GetSize(&size)); + ASSERT_EQ(8, size); + + // does not support zero copy + ASSERT_FALSE(file_->supports_zero_copy()); +} + +TEST_F(TestReadableFile, Read) { + uint8_t buffer[50]; + + MakeTestFile(); + OpenFile(); + + int64_t bytes_read; + ASSERT_OK(file_->Read(4, &bytes_read, buffer)); + ASSERT_EQ(4, bytes_read); + ASSERT_EQ(0, std::memcmp(buffer, "test", 4)); + + ASSERT_OK(file_->Read(10, &bytes_read, buffer)); + ASSERT_EQ(4, bytes_read); + ASSERT_EQ(0, std::memcmp(buffer, "data", 4)); +} + +TEST_F(TestReadableFile, ReadAt) { + uint8_t buffer[50]; + const char* test_data = "testdata"; + + MakeTestFile(); + OpenFile(); + + int64_t bytes_read; + int64_t position; + + ASSERT_OK(file_->ReadAt(0, 4, &bytes_read, buffer)); + ASSERT_EQ(4, bytes_read); + ASSERT_EQ(0, std::memcmp(buffer, "test", 4)); + + // position advanced + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(4, position); + + ASSERT_OK(file_->ReadAt(4, 10, &bytes_read, buffer)); + ASSERT_EQ(4, bytes_read); + ASSERT_EQ(0, std::memcmp(buffer, "data", 4)); + + // position advanced to EOF + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(8, position); + + // Check buffer API + std::shared_ptr buffer2; + + ASSERT_OK(file_->ReadAt(0, 4, &buffer2)); + ASSERT_EQ(4, buffer2->size()); + + Buffer expected(reinterpret_cast(test_data), 4); + ASSERT_TRUE(buffer2->Equals(expected)); + + // position advanced + ASSERT_OK(file_->Tell(&position)); + ASSERT_EQ(4, position); +} + +TEST_F(TestReadableFile, NonExistentFile) { + ASSERT_RAISES(IOError, ReadableFile::Open("0xDEADBEEF.txt", &file_)); +} + +class MyMemoryPool : public MemoryPool { + public: + MyMemoryPool() : num_allocations_(0) {} + + Status Allocate(int64_t size, uint8_t** out) override { + *out = reinterpret_cast(std::malloc(size)); + ++num_allocations_; + return Status::OK(); + } + + void Free(uint8_t* buffer, int64_t size) override { std::free(buffer); } + + int64_t bytes_allocated() const override { return -1; } + + int64_t num_allocations() const { return num_allocations_; } + + private: + int64_t num_allocations_; +}; + +TEST_F(TestReadableFile, CustomMemoryPool) { + MakeTestFile(); + + MyMemoryPool pool; + ASSERT_OK(ReadableFile::Open(path_, &pool, &file_)); + + std::shared_ptr buffer; + ASSERT_OK(file_->ReadAt(0, 4, &buffer)); + ASSERT_OK(file_->ReadAt(4, 8, &buffer)); + + ASSERT_EQ(2, pool.num_allocations()); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index 0b805abf94c1b..f256c31b4f4b2 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -33,8 +33,8 @@ #ifndef _WIN32 #include #else -#include #include +#include // TODO(wesm): address when/if we add windows support // #include diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 51601a0a62678..6989d732ca752 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -94,7 +94,7 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { Status WriteInternal(const uint8_t* data, int64_t nbytes); // Hide the internal details of this class for now - class MemoryMappedFileImpl; + class ARROW_NO_EXPORT MemoryMappedFileImpl; std::unique_ptr impl_; }; diff --git a/cpp/src/arrow/io/mman.h b/cpp/src/arrow/io/mman.h new file mode 100644 index 0000000000000..00d1f93601df3 --- /dev/null +++ b/cpp/src/arrow/io/mman.h @@ -0,0 +1,189 @@ +// Copyright https://code.google.com/p/mman-win32/ +// +// Licensed under the MIT License; +// You may obtain a copy of the License at +// +// https://opensource.org/licenses/MIT + +#ifndef _MMAN_WIN32_H +#define _MMAN_WIN32_H + +// Allow use of features specific to Windows XP or later. +#ifndef _WIN32_WINNT +// Change this to the appropriate value to target other versions of Windows. +#define _WIN32_WINNT 0x0501 + +#endif + +#include +#include +#include +#include + +#define PROT_NONE 0 +#define PROT_READ 1 +#define PROT_WRITE 2 +#define PROT_EXEC 4 + +#define MAP_FILE 0 +#define MAP_SHARED 1 +#define MAP_PRIVATE 2 +#define MAP_TYPE 0xf +#define MAP_FIXED 0x10 +#define MAP_ANONYMOUS 0x20 +#define MAP_ANON MAP_ANONYMOUS + +#define MAP_FAILED ((void*)-1) + +/* Flags for msync. */ +#define MS_ASYNC 1 +#define MS_SYNC 2 +#define MS_INVALIDATE 4 + +#ifndef FILE_MAP_EXECUTE +#define FILE_MAP_EXECUTE 0x0020 +#endif + +static int __map_mman_error(const DWORD err, const int deferr) { + if (err == 0) return 0; + // TODO: implement + return err; +} + +static DWORD __map_mmap_prot_page(const int prot) { + DWORD protect = 0; + + if (prot == PROT_NONE) return protect; + + if ((prot & PROT_EXEC) != 0) { + protect = ((prot & PROT_WRITE) != 0) ? PAGE_EXECUTE_READWRITE : PAGE_EXECUTE_READ; + } else { + protect = ((prot & PROT_WRITE) != 0) ? PAGE_READWRITE : PAGE_READONLY; + } + + return protect; +} + +static DWORD __map_mmap_prot_file(const int prot) { + DWORD desiredAccess = 0; + + if (prot == PROT_NONE) return desiredAccess; + + if ((prot & PROT_READ) != 0) desiredAccess |= FILE_MAP_READ; + if ((prot & PROT_WRITE) != 0) desiredAccess |= FILE_MAP_WRITE; + if ((prot & PROT_EXEC) != 0) desiredAccess |= FILE_MAP_EXECUTE; + + return desiredAccess; +} + +void* mmap(void* addr, size_t len, int prot, int flags, int fildes, off_t off) { + HANDLE fm, h; + + void* map = MAP_FAILED; + +#ifdef _MSC_VER +#pragma warning(push) +#pragma warning(disable : 4293) +#endif + + const DWORD dwFileOffsetLow = + (sizeof(off_t) <= sizeof(DWORD)) ? (DWORD)off : (DWORD)(off & 0xFFFFFFFFL); + const DWORD dwFileOffsetHigh = + (sizeof(off_t) <= sizeof(DWORD)) ? (DWORD)0 : (DWORD)((off >> 32) & 0xFFFFFFFFL); + const DWORD protect = __map_mmap_prot_page(prot); + const DWORD desiredAccess = __map_mmap_prot_file(prot); + + const off_t maxSize = off + (off_t)len; + + const DWORD dwMaxSizeLow = + (sizeof(off_t) <= sizeof(DWORD)) ? (DWORD)maxSize : (DWORD)(maxSize & 0xFFFFFFFFL); + const DWORD dwMaxSizeHigh = (sizeof(off_t) <= sizeof(DWORD)) + ? (DWORD)0 + : (DWORD)((maxSize >> 32) & 0xFFFFFFFFL); + +#ifdef _MSC_VER +#pragma warning(pop) +#endif + + errno = 0; + + if (len == 0 + /* Unsupported flag combinations */ + || (flags & MAP_FIXED) != 0 + /* Usupported protection combinations */ + || prot == PROT_EXEC) { + errno = EINVAL; + return MAP_FAILED; + } + + h = ((flags & MAP_ANONYMOUS) == 0) ? (HANDLE)_get_osfhandle(fildes) + : INVALID_HANDLE_VALUE; + + if ((flags & MAP_ANONYMOUS) == 0 && h == INVALID_HANDLE_VALUE) { + errno = EBADF; + return MAP_FAILED; + } + + fm = CreateFileMapping(h, NULL, protect, dwMaxSizeHigh, dwMaxSizeLow, NULL); + + if (fm == NULL) { + errno = __map_mman_error(GetLastError(), EPERM); + return MAP_FAILED; + } + + map = MapViewOfFile(fm, desiredAccess, dwFileOffsetHigh, dwFileOffsetLow, len); + + CloseHandle(fm); + + if (map == NULL) { + errno = __map_mman_error(GetLastError(), EPERM); + return MAP_FAILED; + } + + return map; +} + +int munmap(void* addr, size_t len) { + if (UnmapViewOfFile(addr)) return 0; + + errno = __map_mman_error(GetLastError(), EPERM); + + return -1; +} + +int mprotect(void* addr, size_t len, int prot) { + DWORD newProtect = __map_mmap_prot_page(prot); + DWORD oldProtect = 0; + + if (VirtualProtect(addr, len, newProtect, &oldProtect)) return 0; + + errno = __map_mman_error(GetLastError(), EPERM); + + return -1; +} + +int msync(void* addr, size_t len, int flags) { + if (FlushViewOfFile(addr, len)) return 0; + + errno = __map_mman_error(GetLastError(), EPERM); + + return -1; +} + +int mlock(const void* addr, size_t len) { + if (VirtualLock((LPVOID)addr, len)) return 0; + + errno = __map_mman_error(GetLastError(), EPERM); + + return -1; +} + +int munlock(const void* addr, size_t len) { + if (VirtualUnlock((LPVOID)addr, len)) return 0; + + errno = __map_mman_error(GetLastError(), EPERM); + + return -1; +} + +#endif diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 89b7fb987c63d..99974a4a4c7b7 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -23,12 +23,12 @@ #include #include "arrow/array.h" +#include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" #include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" -#include "arrow/io/interfaces.h" -#include "arrow/io/memory.h" #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index 2bf10dde266bd..c68244d50258c 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -22,10 +22,10 @@ #include #include +#include "arrow/io/interfaces.h" #include "arrow/ipc/adapter.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" -#include "arrow/io/interfaces.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" #include "arrow/util/status.h" diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 87eb0fe3a8bf7..ffebb9269bdc3 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -238,7 +238,8 @@ void TestPrimitiveBuilder::Check( } typedef ::testing::Types Primitives; + PInt32, PInt64, PFloat, PDouble> + Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index d320d6adb7caa..b22f07dd6345f 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -117,10 +117,10 @@ class CerrLog { // return so we create a new class to give it a hint. class FatalLog : public CerrLog { public: - FatalLog(int /* severity */) // NOLINT - : CerrLog(ARROW_FATAL) {} + explicit FatalLog(int /* severity */) // NOLINT + : CerrLog(ARROW_FATAL){} // NOLINT - [[noreturn]] ~FatalLog() { + [[noreturn]] ~FatalLog() { if (has_logged_) { std::cerr << std::endl; } std::exit(1); } diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index fed149bc3598c..9f83afe4cb20f 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -17,13 +17,13 @@ #include "arrow/util/memory-pool.h" -#include #include #include #include +#include -#include "arrow/util/status.h" #include "arrow/util/logging.h" +#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/status-test.cc b/cpp/src/arrow/util/status-test.cc index 45e0ff361ac22..e0ff20fea1233 100644 --- a/cpp/src/arrow/util/status-test.cc +++ b/cpp/src/arrow/util/status-test.cc @@ -17,8 +17,8 @@ #include "gtest/gtest.h" -#include "arrow/util/status.h" #include "arrow/test-util.h" +#include "arrow/util/status.h" namespace arrow { From 03134b11ffd4f63bda2f3cb448713600df6d8fdb Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 27 Sep 2016 09:45:32 -0700 Subject: [PATCH 0144/1644] ARROW-270: Define more generic Interval logical type Author: Julien Le Dem Closes #144 from julienledem/interval and squashes the following commits: eb76fed [Julien Le Dem] ARROW-270: Define more generic Interval logical type --- format/Message.fbs | 10 ++++----- .../src/main/codegen/data/ArrowTypes.tdd | 8 ++----- .../templates/NullableValueVectors.java | 4 ++-- .../arrow/vector/schema/TypeLayout.java | 21 +++++++++++-------- .../org/apache/arrow/vector/types/Types.java | 14 ++++++------- 5 files changed, 27 insertions(+), 30 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 657904a7032a5..07da862c32d5d 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -63,10 +63,9 @@ table Timestamp { timezone: string; } -table IntervalDay { -} - -table IntervalYear { +enum IntervalUnit: short { YEAR_MONTH, DAY_TIME} +table Interval { + unit: IntervalUnit; } table JSONScalar { @@ -88,8 +87,7 @@ union Type { Date, Time, Timestamp, - IntervalDay, - IntervalYear, + Interval, List, Struct_, Union, diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 5cb43bed2b69a..9f81f0e3800ed 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -69,12 +69,8 @@ fields: [{name: "timezone", type: "String"}] }, { - name: "IntervalDay", - fields: [] - }, - { - name: "IntervalYear", - fields: [] + name: "Interval", + fields: [{name: "unit", type: short}] } ] } diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 486cfeefc7a3b..8f325afad3920 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -105,9 +105,9 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "TimeStamp"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null); <#elseif minor.class == "IntervalDay"> - field = new Field(name, true, new IntervalDay(), null); + field = new Field(name, true, new Interval(org.apache.arrow.flatbuf.IntervalUnit.DAY_TIME), null); <#elseif minor.class == "IntervalYear"> - field = new Field(name, true, new IntervalYear(), null); + field = new Field(name, true, new Interval(org.apache.arrow.flatbuf.IntervalUnit.YEAR_MONTH), null); <#elseif minor.class == "VarChar"> field = new Field(name, true, new Utf8(), null); <#elseif minor.class == "VarBinary"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 885ac2ac3d7f2..072385a215582 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -31,6 +31,7 @@ import java.util.Collections; import java.util.List; +import org.apache.arrow.flatbuf.IntervalUnit; import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeVisitor; @@ -40,12 +41,11 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; -import org.apache.arrow.vector.types.pojo.ArrowType.IntervalDay; -import org.apache.arrow.vector.types.pojo.ArrowType.IntervalYear; +import org.apache.arrow.vector.types.pojo.ArrowType.Interval; import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; @@ -167,14 +167,17 @@ public TypeLayout visit(Time type) { } @Override - public TypeLayout visit(IntervalDay type) { // TODO: check size - return newFixedWidthTypeLayout(dataVector(64)); + public TypeLayout visit(Interval type) { // TODO: check size + switch (type.getUnit()) { + case IntervalUnit.DAY_TIME: + return newFixedWidthTypeLayout(dataVector(64)); + case IntervalUnit.YEAR_MONTH: + return newFixedWidthTypeLayout(dataVector(64)); + default: + throw new UnsupportedOperationException("Unknown unit " + type.getUnit()); + } } - @Override - public TypeLayout visit(IntervalYear type) { // TODO: check size - return newFixedWidthTypeLayout(dataVector(64)); - } }); return layout; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 66ef7562ceda1..181d835368265 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -20,6 +20,7 @@ import java.util.HashMap; import java.util.Map; +import org.apache.arrow.flatbuf.IntervalUnit; import org.apache.arrow.flatbuf.Precision; import org.apache.arrow.flatbuf.Type; import org.apache.arrow.flatbuf.UnionMode; @@ -78,13 +79,12 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Date; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; -import org.apache.arrow.vector.types.pojo.ArrowType.IntervalDay; -import org.apache.arrow.vector.types.pojo.ArrowType.IntervalYear; +import org.apache.arrow.vector.types.pojo.ArrowType.Interval; import org.apache.arrow.vector.types.pojo.ArrowType.List; import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; @@ -104,8 +104,8 @@ public class Types { public static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); public static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); public static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); - public static final Field INTERVALDAY_FIELD = new Field("", true, IntervalDay.INSTANCE, null); - public static final Field INTERVALYEAR_FIELD = new Field("", true, IntervalYear.INSTANCE, null); + public static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); + public static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); public static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); public static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(Precision.DOUBLE), null); public static final Field LIST_FIELD = new Field("", true, List.INSTANCE, null); @@ -260,7 +260,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new TimeStampWriterImpl((NullableTimeStampVector) vector); } }, - INTERVALDAY(IntervalDay.INSTANCE) { + INTERVALDAY(new Interval(IntervalUnit.DAY_TIME)) { @Override public Field getField() { return INTERVALDAY_FIELD; @@ -276,7 +276,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new IntervalDayWriterImpl((NullableIntervalDayVector) vector); } }, - INTERVALYEAR(IntervalYear.INSTANCE) { + INTERVALYEAR(new Interval(IntervalUnit.YEAR_MONTH)) { @Override public Field getField() { return INTERVALYEAR_FIELD; From bae33d622421e6377ab3e9c81dd054c796ab48a3 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 27 Sep 2016 10:39:09 -0700 Subject: [PATCH 0145/1644] ARROW-304: NullableMapReaderImpl.isSet() always returns true Author: Julien Le Dem Closes #147 from julienledem/isSet and squashes the following commits: c06e048 [Julien Le Dem] review feedback 5a33785 [Julien Le Dem] review feedback af5d613 [Julien Le Dem] ARROW-304: NullableMapReaderImpl.isSet() always returns true --- .../complex/impl/NullableMapReaderImpl.java | 5 ++ .../vector/complex/impl/UnionListReader.java | 2 +- .../complex/writer/TestComplexWriter.java | 57 ++++++++++++++++--- 3 files changed, 55 insertions(+), 9 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java index 18b35c194a184..7c389e61ae202 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapReaderImpl.java @@ -42,4 +42,9 @@ public void copyAsField(String name, MapWriter writer){ NullableMapWriter impl = (NullableMapWriter) writer.map(name); impl.container.copyFromSafe(idx(), impl.idx(), nullableMapVector); } + + @Override + public boolean isSet(){ + return !nullableMapVector.getAccessor().isNull(idx()); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java index 39cf00421154b..6c7c230226ea3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionListReader.java @@ -41,7 +41,7 @@ public UnionListReader(ListVector vector) { @Override public boolean isSet() { - return true; + return !vector.getAccessor().isNull(idx()); } private int currentOffset; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index fa710dae5eee8..c1da104da5780 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -17,6 +17,14 @@ */ package org.apache.arrow.vector.complex.writer; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; + +import java.util.List; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.ListVector; @@ -77,28 +85,33 @@ public void nullableMap() { MapVector parent = new MapVector("parent", allocator, null); ComplexWriter writer = new ComplexWriterImpl("root", parent); MapWriter rootWriter = writer.rootAsMap(); - MapWriter mapWriter = rootWriter.map("map"); - BigIntWriter nested = mapWriter.bigInt("nested"); for (int i = 0; i < COUNT; i++) { + rootWriter.setPosition(i); + rootWriter.start(); if (i % 2 == 0) { + MapWriter mapWriter = rootWriter.map("map"); mapWriter.setPosition(i); mapWriter.start(); - nested.writeBigInt(i); + mapWriter.bigInt("nested").writeBigInt(i); mapWriter.end(); } + rootWriter.end(); } writer.setValueCount(COUNT); MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); for (int i = 0; i < COUNT; i++) { rootReader.setPosition(i); + assertTrue("index is set: " + i, rootReader.isSet()); + FieldReader map = rootReader.reader("map"); if (i % 2 == 0) { - Assert.assertNotNull(rootReader.reader("map").readObject()); - Assert.assertEquals(i, rootReader.reader("map").reader("nested").readLong().longValue()); + assertTrue("index is set: " + i, map.isSet()); + assertNotNull("index is set: " + i, map.readObject()); + assertEquals(i, map.reader("nested").readLong().longValue()); } else { - Assert.assertNull(rootReader.reader("map").readObject()); + assertFalse("index is not set: " + i, map.isSet()); + assertNull("index is not set: " + i, map.readObject()); } } - parent.close(); } @@ -121,11 +134,39 @@ public void listScalarType() { listReader.setPosition(i); for (int j = 0; j < i % 7; j++) { listReader.next(); - Assert.assertEquals(j, listReader.reader().readInteger().intValue()); + assertEquals(j, listReader.reader().readInteger().intValue()); } } } + @Test + public void listScalarTypeNullable() { + ListVector listVector = new ListVector("list", allocator, null); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + if (i % 2 == 0) { + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + listWriter.writeInt(j); + } + listWriter.endList(); + } + } + listWriter.setValueCount(COUNT); + UnionListReader listReader = new UnionListReader(listVector); + for (int i = 0; i < COUNT; i++) { + listReader.setPosition(i); + if (i % 2 == 0) { + assertTrue("index is set: " + i, listReader.isSet()); + assertEquals("correct length at: " + i, i % 7, ((List)listReader.readObject()).size()); + } else { + assertFalse("index is not set: " + i, listReader.isSet()); + assertNull("index is not set: " + i, listReader.readObject()); + } + } + } @Test public void listMapType() { From 768c7d0be7dde9942235b5312c1c46ab035af86b Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 27 Sep 2016 11:54:35 -0700 Subject: [PATCH 0146/1644] ARROW-257: Add a typeids Vector to Union type Author: Julien Le Dem Closes #143 from julienledem/union and squashes the following commits: cd1b711 [Julien Le Dem] ARROW-257: Add a typeids Vector to Union type --- format/Message.fbs | 5 +++ .../src/main/codegen/data/ArrowTypes.tdd | 2 +- .../src/main/codegen/templates/ArrowType.java | 38 +++++++++++++++---- .../main/codegen/templates/UnionVector.java | 7 +++- .../org/apache/arrow/vector/types/Types.java | 2 +- .../apache/arrow/vector/pojo/TestConvert.java | 5 ++- 6 files changed, 45 insertions(+), 14 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 07da862c32d5d..288f5a1b6b2d0 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -23,8 +23,13 @@ table List { enum UnionMode:short { Sparse, Dense } +/// A union is a complex type with children in Field +/// By default ids in the type vector refer to the offsets in the children +/// optionally typeIds provides an indirection between the child offset and the type id +/// for each child typeIds[offset] is the id used in the type vector table Union { mode: UnionMode; + typeIds: [ int ]; // optional, describes typeid of each child. } table Int { diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 9f81f0e3800ed..9624fecf6aad1 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -30,7 +30,7 @@ }, { name: "Union", - fields: [{name: "mode", type: short}] + fields: [{name: "mode", type: short}, {name: "typeIds", type: "int[]"}] }, { name: "Int", diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 29dee20040a53..30f2c68efe0b3 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -33,12 +33,23 @@ import java.util.Objects; +/** + * Arrow types + **/ public abstract class ArrowType { public abstract byte getTypeType(); public abstract int getType(FlatBufferBuilder builder); public abstract T accept(ArrowTypeVisitor visitor); + /** + * to visit the ArrowTypes + * + * type.accept(new ArrowTypeVisitor() { + * ... + * }); + * + */ public static interface ArrowTypeVisitor { <#list arrowTypes.types as type> T visit(${type.name} type); @@ -55,9 +66,7 @@ public static class ${name} extends ArrowType { <#list fields as field> - <#assign fieldName = field.name> - <#assign fieldType = field.type> - ${fieldType} ${fieldName}; + ${field.type} ${field.name}; <#if type.fields?size != 0> @@ -79,6 +88,9 @@ public int getType(FlatBufferBuilder builder) { <#if field.type == "String"> int ${field.name} = builder.createString(this.${field.name}); + <#if field.type == "int[]"> + int ${field.name} = org.apache.arrow.flatbuf.${type.name}.create${field.name?cap_first}Vector(builder, this.${field.name}); + org.apache.arrow.flatbuf.${type.name}.start${type.name}(builder); <#list type.fields as field> @@ -96,7 +108,7 @@ public int getType(FlatBufferBuilder builder) { public String toString() { return "${name}{" <#list fields as field> - + ", " + ${field.name} + + <#if field.type == "int[]">java.util.Arrays.toString(${field.name})<#else>${field.name}<#if field_has_next> + ", " + "}"; } @@ -115,8 +127,7 @@ public boolean equals(Object obj) { return true; <#else> ${type.name} that = (${type.name}) obj; - return - <#list type.fields as field>Objects.equals(this.${field.name}, that.${field.name}) <#if field_has_next>&&<#else>; + return <#list type.fields as field>Objects.deepEquals(this.${field.name}, that.${field.name}) <#if field_has_next>&&<#else>; } @@ -134,9 +145,20 @@ public static org.apache.arrow.vector.types.pojo.ArrowType getTypeForField(org.a <#assign name = type.name> <#assign nameLower = type.name?lower_case> <#assign fields = type.fields> - case Type.${type.name}: + case Type.${type.name}: { org.apache.arrow.flatbuf.${type.name} ${nameLower}Type = (org.apache.arrow.flatbuf.${type.name}) field.type(new org.apache.arrow.flatbuf.${type.name}()); - return new ${type.name}(<#list type.fields as field>${nameLower}Type.${field.name}()<#if field_has_next>, ); + <#list type.fields as field> + <#if field.type == "int[]"> + ${field.type} ${field.name} = new int[${nameLower}Type.${field.name}Length()]; + for (int i = 0; i< ${field.name}.length; ++i) { + ${field.name}[i] = ${nameLower}Type.${field.name}(i); + } + <#else> + ${field.type} ${field.name} = ${nameLower}Type.${field.name}(); + + + return new ${type.name}(<#list type.fields as field>${field.name}<#if field_has_next>, ); + } default: throw new UnsupportedOperationException("Unsupported type: " + field.typeType()); diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 3014bbba9d52d..b14314d2b0dbb 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -232,10 +232,13 @@ public void clear() { @Override public Field getField() { List childFields = new ArrayList<>(); - for (ValueVector v : internalMap.getChildren()) { + List children = internalMap.getChildren(); + int[] typeIds = new int[children.size()]; + for (ValueVector v : children) { + typeIds[childFields.size()] = v.getMinorType().ordinal(); childFields.add(v.getField()); } - return new Field(name, true, new ArrowType.Union(Sparse), childFields); + return new Field(name, true, new ArrowType.Union(Sparse, typeIds), childFields); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 181d835368265..6e63ae232781a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -472,7 +472,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new UnionListWriter((ListVector) vector); } }, - UNION(new Union(UnionMode.Sparse)) { + UNION(new Union(UnionMode.Sparse, null)) { @Override public Field getField() { throw new UnsupportedOperationException("Cannot get simple field for Union type"); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 448117d84dc3e..ed740cd0f1b78 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -22,11 +22,12 @@ import static org.junit.Assert.assertEquals; import org.apache.arrow.flatbuf.UnionMode; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.List; -import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; +import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; @@ -78,7 +79,7 @@ public void nestedSchema() { childrenBuilder.add(new Field("child4", true, new List(), ImmutableList.of( new Field("child4.1", true, Utf8.INSTANCE, null) ))); - childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse), ImmutableList.of( + childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMP.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( new Field("child5.1", true, new Timestamp("UTC"), null), new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); From bd195e304d82dcf6e2cea266b4d0871bd2b88564 Mon Sep 17 00:00:00 2001 From: adeneche Date: Wed, 28 Sep 2016 07:26:05 -0700 Subject: [PATCH 0147/1644] ARROW-308: UnionListWriter.setPosition() should not call startList() --- .../codegen/templates/UnionListWriter.java | 1 - .../complex/writer/TestComplexWriter.java | 32 ++++++++++++++++++- 2 files changed, 31 insertions(+), 2 deletions(-) diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java index d502803d71616..04531a72128a0 100644 --- a/java/vector/src/main/codegen/templates/UnionListWriter.java +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -84,7 +84,6 @@ public void close() throws Exception { @Override public void setPosition(int index) { super.setPosition(index); - startList(); } <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index c1da104da5780..398aea915b343 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -52,7 +52,7 @@ public class TestComplexWriter { - static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + private static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); private static final int COUNT = 100; @@ -115,6 +115,36 @@ public void nullableMap() { parent.close(); } + @Test + public void listOfLists() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + + rootWriter.start(); + rootWriter.bigInt("int").writeBigInt(0); + rootWriter.list("list").startList(); + rootWriter.list("list").bigInt().writeBigInt(0); + rootWriter.list("list").endList(); + rootWriter.end(); + + rootWriter.setPosition(1); + rootWriter.start(); + rootWriter.bigInt("int").writeBigInt(1); + rootWriter.end(); + + writer.setValueCount(2); + + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + + rootReader.setPosition(0); + assertTrue("row 0 list is not set", rootReader.reader("list").isSet()); + assertEquals(Long.valueOf(0), rootReader.reader("list").reader().readLong()); + + rootReader.setPosition(1); + assertFalse("row 1 list is set", rootReader.reader("list").isSet()); + } + @Test public void listScalarType() { ListVector listVector = new ListVector("list", allocator, null); From bf30235fa3672936013db82ed9dd8949433d802e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 28 Sep 2016 21:44:37 -0400 Subject: [PATCH 0148/1644] ARROW-306: Add option to pass cmake arguments via environment variable Author: Uwe L. Korn Closes #149 from xhochy/arrow-306 and squashes the following commits: 11a3e66 [Uwe L. Korn] ARROW-306: Add option to pass cmake arguments via environment variable --- python/setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/setup.py b/python/setup.py index a5db2b025e6ef..d1be122888e7b 100644 --- a/python/setup.py +++ b/python/setup.py @@ -95,7 +95,7 @@ def run(self): def initialize_options(self): _build_ext.initialize_options(self) - self.extra_cmake_args = '' + self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') CYTHON_MODULE_NAMES = [ 'array', From 30f60832a5f4bd3063699061796d2107fb7a9738 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 28 Sep 2016 21:45:46 -0400 Subject: [PATCH 0149/1644] ARROW-305: Add compression and use_dictionary options to Parquet Author: Uwe L. Korn Closes #148 from xhochy/arrow-305 and squashes the following commits: 93d653b [Uwe L. Korn] ARROW-305: Add compression and use_dictionary options to Parquet interface --- python/pyarrow/includes/parquet.pxd | 12 +++++++ python/pyarrow/parquet.pyx | 49 +++++++++++++++++++++++++++- python/pyarrow/tests/test_parquet.py | 40 +++++++++++++++++++++++ 3 files changed, 100 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index 9085b0bb29866..754eeccecc8e9 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -37,6 +37,13 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: PARQUET_1_0" parquet::ParquetVersion::PARQUET_1_0" PARQUET_2_0" parquet::ParquetVersion::PARQUET_2_0" + enum Compression" parquet::Compression::type": + UNCOMPRESSED" parquet::Compression::UNCOMPRESSED" + SNAPPY" parquet::Compression::SNAPPY" + GZIP" parquet::Compression::GZIP" + LZO" parquet::Compression::LZO" + BROTLI" parquet::Compression::BROTLI" + cdef cppclass SchemaDescriptor: shared_ptr[Node] schema() GroupNode* group() @@ -90,6 +97,11 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: cdef cppclass WriterProperties: cppclass Builder: Builder* version(ParquetVersion version) + Builder* compression(Compression codec) + Builder* compression(const c_string& path, Compression codec) + Builder* disable_dictionary() + Builder* enable_dictionary() + Builder* enable_dictionary(const c_string& path) shared_ptr[WriterProperties] build() diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index fb36b2967c096..099e148abc16f 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -90,7 +90,8 @@ def read_table(source, columns=None): return reader.read_all() -def write_table(table, filename, chunk_size=None, version=None): +def write_table(table, filename, chunk_size=None, version=None, + use_dictionary=True, compression=None): """ Write a Table to Parquet format @@ -102,6 +103,11 @@ def write_table(table, filename, chunk_size=None, version=None): The maximum number of rows in each Parquet RowGroup version : {"1.0", "2.0"}, default "1.0" The Parquet format version, defaults to 1.0 + use_dictionary : bool or list + Specify if we should use dictionary encoding in general or only for + some columns. + compression : str or dict + Specify the compression codec, either on a general basis or per-column. """ cdef Table table_ = table cdef CTable* ctable_ = table_.table @@ -121,6 +127,47 @@ def write_table(table, filename, chunk_size=None, version=None): else: raise ArrowException("Unsupported Parquet format version") + if isinstance(use_dictionary, bool): + if use_dictionary: + properties_builder.enable_dictionary() + else: + properties_builder.disable_dictionary() + else: + # Deactivate dictionary encoding by default + properties_builder.disable_dictionary() + for column in use_dictionary: + properties_builder.enable_dictionary(column) + + if isinstance(compression, basestring): + if compression == "NONE": + properties_builder.compression(UNCOMPRESSED) + elif compression == "SNAPPY": + properties_builder.compression(SNAPPY) + elif compression == "GZIP": + properties_builder.compression(GZIP) + elif compression == "LZO": + properties_builder.compression(LZO) + elif compression == "BROTLI": + properties_builder.compression(BROTLI) + else: + raise ArrowException("Unsupport compression codec") + elif compression is not None: + # Deactivate dictionary encoding by default + properties_builder.disable_dictionary() + for column, codec in compression.iteritems(): + if codec == "NONE": + properties_builder.compression(column, UNCOMPRESSED) + elif codec == "SNAPPY": + properties_builder.compression(column, SNAPPY) + elif codec == "GZIP": + properties_builder.compression(column, GZIP) + elif codec == "LZO": + properties_builder.compression(column, LZO) + elif codec == "BROTLI": + properties_builder.compression(column, BROTLI) + else: + raise ArrowException("Unsupport compression codec") + sink.reset(new LocalFileOutputStream(tobytes(filename))) with nogil: check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 8a2d8cab57267..0f9f2e40813ce 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -110,3 +110,43 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): df['uint32'] = df['uint32'].values.astype(np.int64) pdt.assert_frame_equal(df, df_read) + +@parquet +def test_pandas_parquet_configuration_options(tmpdir): + size = 10000 + np.random.seed(0) + df = pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16), + 'uint32': np.arange(size, dtype=np.uint32), + 'uint64': np.arange(size, dtype=np.uint64), + 'int8': np.arange(size, dtype=np.int16), + 'int16': np.arange(size, dtype=np.int16), + 'int32': np.arange(size, dtype=np.int32), + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0 + }) + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = A.from_pandas_dataframe(df) + + for use_dictionary in [True, False]: + A.parquet.write_table( + arrow_table, + filename.strpath, + version="2.0", + use_dictionary=use_dictionary) + table_read = pq.read_table(filename.strpath) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) + + for compression in ['NONE', 'SNAPPY', 'GZIP']: + A.parquet.write_table( + arrow_table, + filename.strpath, + version="2.0", + compression=compression) + table_read = pq.read_table(filename.strpath) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) From 391ab64d05fc9c5ea89fcc9a9938604954047ada Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 30 Sep 2016 08:53:52 -0700 Subject: [PATCH 0150/1644] ARROW-309: Types.getMinorTypeForArrowType() does not work for Union type Author: Julien Le Dem Closes #151 from julienledem/fix_union and squashes the following commits: 01bea42 [Julien Le Dem] fix union --- .../org/apache/arrow/vector/types/Types.java | 145 +++++++++++++----- 1 file changed, 107 insertions(+), 38 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 6e63ae232781a..2ff93d4b98d11 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -17,12 +17,8 @@ */ package org.apache.arrow.vector.types; -import java.util.HashMap; -import java.util.Map; - import org.apache.arrow.flatbuf.IntervalUnit; import org.apache.arrow.flatbuf.Precision; -import org.apache.arrow.flatbuf.Type; import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; @@ -74,9 +70,11 @@ import org.apache.arrow.vector.complex.impl.VarCharWriterImpl; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeVisitor; import org.apache.arrow.vector.types.pojo.ArrowType.Binary; import org.apache.arrow.vector.types.pojo.ArrowType.Bool; import org.apache.arrow.vector.types.pojo.ArrowType.Date; +import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Interval; @@ -92,26 +90,25 @@ public class Types { - public static final Field NULL_FIELD = new Field("", true, Null.INSTANCE, null); - public static final Field TINYINT_FIELD = new Field("", true, new Int(8, true), null); - public static final Field SMALLINT_FIELD = new Field("", true, new Int(16, true), null); - public static final Field INT_FIELD = new Field("", true, new Int(32, true), null); - public static final Field BIGINT_FIELD = new Field("", true, new Int(64, true), null); - public static final Field UINT1_FIELD = new Field("", true, new Int(8, false), null); - public static final Field UINT2_FIELD = new Field("", true, new Int(16, false), null); - public static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); - public static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); - public static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); - public static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); - public static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); - public static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); - public static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); - public static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); - public static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(Precision.DOUBLE), null); - public static final Field LIST_FIELD = new Field("", true, List.INSTANCE, null); - public static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); - public static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); - public static final Field BIT_FIELD = new Field("", true, Bool.INSTANCE, null); + private static final Field NULL_FIELD = new Field("", true, Null.INSTANCE, null); + private static final Field TINYINT_FIELD = new Field("", true, new Int(8, true), null); + private static final Field SMALLINT_FIELD = new Field("", true, new Int(16, true), null); + private static final Field INT_FIELD = new Field("", true, new Int(32, true), null); + private static final Field BIGINT_FIELD = new Field("", true, new Int(64, true), null); + private static final Field UINT1_FIELD = new Field("", true, new Int(8, false), null); + private static final Field UINT2_FIELD = new Field("", true, new Int(16, false), null); + private static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); + private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); + private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); + private static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); + private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); + private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); + private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); + private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); + private static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(Precision.DOUBLE), null); + private static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); + private static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); + private static final Field BIT_FIELD = new Field("", true, Bool.INSTANCE, null); public enum MinorType { @@ -427,7 +424,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { UINT4(new Int(32, false)) { @Override public Field getField() { - return UINT8_FIELD; + return UINT4_FIELD; } @Override @@ -506,22 +503,94 @@ public ArrowType getType() { public abstract FieldWriter getNewFieldWriter(ValueVector vector); } - private static final Map ARROW_TYPE_MINOR_TYPE_MAP; - public static MinorType getMinorTypeForArrowType(ArrowType arrowType) { - if (arrowType.getTypeType() == Type.Decimal) { - return MinorType.DECIMAL; - } - return ARROW_TYPE_MINOR_TYPE_MAP.get(arrowType); - } + return arrowType.accept(new ArrowTypeVisitor() { + @Override public MinorType visit(Null type) { + return MinorType.NULL; + } - static { - ARROW_TYPE_MINOR_TYPE_MAP = new HashMap<>(); - for (MinorType minorType : MinorType.values()) { - if (minorType != MinorType.DECIMAL) { - ARROW_TYPE_MINOR_TYPE_MAP.put(minorType.getType(), minorType); + @Override public MinorType visit(Struct_ type) { + return MinorType.MAP; } - } + + @Override public MinorType visit(List type) { + return MinorType.LIST; + } + + @Override public MinorType visit(Union type) { + return MinorType.UNION; + } + + @Override + public MinorType visit(Int type) { + switch (type.getBitWidth()) { + case 8: + return type.getIsSigned() ? MinorType.TINYINT : MinorType.UINT1; + case 16: + return type.getIsSigned() ? MinorType.SMALLINT : MinorType.UINT2; + case 32: + return type.getIsSigned() ? MinorType.INT : MinorType.UINT4; + case 64: + return type.getIsSigned() ? MinorType.BIGINT : MinorType.UINT8; + default: + throw new IllegalArgumentException("only 8, 16, 32, 64 supported: " + type); + } + } + + @Override + public MinorType visit(FloatingPoint type) { + switch (type.getPrecision()) { + case Precision.HALF: + throw new UnsupportedOperationException("NYI: " + type); + case Precision.SINGLE: + return MinorType.FLOAT4; + case Precision.DOUBLE: + return MinorType.FLOAT8; + default: + throw new IllegalArgumentException("unknown precision: " + type); + } + } + + @Override public MinorType visit(Utf8 type) { + return MinorType.VARCHAR; + } + + @Override public MinorType visit(Binary type) { + return MinorType.VARBINARY; + } + + @Override public MinorType visit(Bool type) { + return MinorType.BIT; + } + + @Override public MinorType visit(Decimal type) { + return MinorType.DECIMAL; + } + + @Override public MinorType visit(Date type) { + return MinorType.DATE; + } + + @Override public MinorType visit(Time type) { + return MinorType.TIME; + } + + @Override public MinorType visit(Timestamp type) { + return MinorType.TIMESTAMP; + } + + @Override + public MinorType visit(Interval type) { + switch (type.getUnit()) { + case IntervalUnit.DAY_TIME: + return MinorType.INTERVALDAY; + case IntervalUnit.YEAR_MONTH: + return MinorType.INTERVALYEAR; + default: + throw new IllegalArgumentException("unknown unit: " + type); + } + } + }); } } From c7b0480f5c8dadb78b9586dc4e40f3964929d8ef Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 3 Oct 2016 14:54:15 -0700 Subject: [PATCH 0151/1644] ARROW-314: JSONScalar is unnecessary and unused Author: Julien Le Dem Closes #153 from julienledem/jsonscalar and squashes the following commits: 905027c [Julien Le Dem] ARROW-314: JSONScalar is unnecessary and unused --- format/Message.fbs | 7 +------ 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 288f5a1b6b2d0..e1758bf3638e8 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -73,10 +73,6 @@ table Interval { unit: IntervalUnit; } -table JSONScalar { - dense:bool=true; -} - /// ---------------------------------------------------------------------- /// Top-level Type value, enabling extensible type-specific metadata. We can /// add new logical types to Type without breaking backwards compatibility @@ -95,8 +91,7 @@ union Type { Interval, List, Struct_, - Union, - JSONScalar + Union } /// ---------------------------------------------------------------------- From c3930a062b2d71e3d277d4db1785e24e9183276f Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 3 Oct 2016 15:17:32 -0700 Subject: [PATCH 0152/1644] ARROW-301: Add user field metadata to IPC schemas Author: Julien Le Dem Closes #154 from julienledem/custom and squashes the following commits: 47a02b7 [Julien Le Dem] ARROW-301: Add user field metadata to IPC schemas --- format/Message.fbs | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/format/Message.fbs b/format/Message.fbs index e1758bf3638e8..3d877a2f234af 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -119,6 +119,16 @@ table VectorLayout { type: VectorType; } + +/// ---------------------------------------------------------------------- +/// user defined key value pairs to add custom metadata to arrow +/// key namespacing is the responsibility of the user + +table KeyValue { + key: string; + value: [ubyte]; +} + /// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. @@ -141,6 +151,8 @@ table Field { /// does not include children /// each recordbatch will return instances of those Buffers. layout: [ VectorLayout ]; + // User-defined metadata + custom_metadata: [ KeyValue ]; } /// ---------------------------------------------------------------------- @@ -159,6 +171,8 @@ table Schema { endianness: Endianness=Little; fields: [Field]; + // User-defined metadata + custom_metadata: [ KeyValue ]; } /// ---------------------------------------------------------------------- From c7e6a0716308766766aaaf4faa2effc5445640c6 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 3 Oct 2016 23:14:41 -0400 Subject: [PATCH 0153/1644] ARROW-302: [C++/Python] Implement C++ IO interfaces for interacting with Python file and bytes objects This will enable code (such as arrow IPC or Parquet) that only knows about Arrow's IO subsystem to interact with Python objects in various ways. In other words, when we have in C++: ``` std::shared_ptr handle = ...; handle->Read(nbytes, &out); ``` then the C++ file handle could be invoking the `read` method of a Python object. Same goes for `arrow::io::OutputStream` and `write` methods. There's data copying in some places overhead because of the rigid memory ownership semantics of the `PyBytes` type, but this can't be avoided here. Another nice thing is that if we have some data in a Python bytes object that we want to expose to some other C++ component, we can wrap it in the `PyBytesReader` which provides zero-copy read access to the underlying data. Author: Wes McKinney Closes #152 from wesm/ARROW-302 and squashes the following commits: 2de9f97 [Wes McKinney] Fix compiler warning / bug from OS X 316b845 [Wes McKinney] Code review comments e791893 [Wes McKinney] Python 2.7 fix 0fc4cf1 [Wes McKinney] cpplint e9b8c60 [Wes McKinney] Test the size() method and fix bug with missing whence 6481e91 [Wes McKinney] Add a zero-copy reader for PyBytes 7e357eb [Wes McKinney] Get basic Python file read/write working d470133 [Wes McKinney] Share default implementations of ReadAt, add Buffer-based Read API 737a8db [Wes McKinney] Refactoring, more code sharing with native file interfaces 8be433f [Wes McKinney] Draft PyReadableFile implementation, not yet tested 20a3f28 [Wes McKinney] Draft API for Arrow IO wrappers for Python files --- cpp/CMakeLists.txt | 2 + cpp/src/arrow/io/CMakeLists.txt | 1 + cpp/src/arrow/io/file.cc | 10 +- cpp/src/arrow/io/file.h | 6 +- cpp/src/arrow/io/hdfs.cc | 46 ++++- cpp/src/arrow/io/hdfs.h | 13 +- cpp/src/arrow/io/interfaces.cc | 48 ++++++ cpp/src/arrow/io/interfaces.h | 26 +-- cpp/src/arrow/io/memory.cc | 40 ++--- cpp/src/arrow/io/memory.h | 21 ++- python/CMakeLists.txt | 1 + python/pyarrow/__init__.py | 5 +- python/pyarrow/array.pyx | 31 ---- python/pyarrow/error.pxd | 4 +- python/pyarrow/error.pyx | 2 +- python/pyarrow/includes/libarrow_io.pxd | 29 ++++ python/pyarrow/includes/pyarrow.pxd | 34 +++- python/pyarrow/io.pxd | 13 +- python/pyarrow/io.pyx | 136 ++++++++++----- python/pyarrow/parquet.pyx | 8 +- python/pyarrow/table.pyx | 37 +++- python/pyarrow/tests/test_hdfs.py | 128 ++++++++++++++ python/pyarrow/tests/test_io.py | 121 ++++++------- python/src/pyarrow/adapters/pandas.cc | 2 +- python/src/pyarrow/common.cc | 15 ++ python/src/pyarrow/common.h | 30 +++- python/src/pyarrow/io.cc | 215 ++++++++++++++++++++++++ python/src/pyarrow/io.h | 97 +++++++++++ 28 files changed, 878 insertions(+), 243 deletions(-) create mode 100644 cpp/src/arrow/io/interfaces.cc create mode 100644 python/pyarrow/tests/test_hdfs.py create mode 100644 python/src/pyarrow/io.cc create mode 100644 python/src/pyarrow/io.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index d65c715319694..f70c8ab4bccef 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -166,6 +166,8 @@ else() message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") endif () +message(STATUS "Build Type: ${CMAKE_BUILD_TYPE}") + # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index d2e3491b75f12..47bb089386371 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -39,6 +39,7 @@ set(ARROW_IO_TEST_LINK_LIBS set(ARROW_IO_SRCS file.cc + interfaces.cc memory.cc ) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 87bae7f3928ec..93f0ad91ee86c 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -413,15 +413,7 @@ Status ReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { return impl_->Read(nbytes, bytes_read, out); } -Status ReadableFile::ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - RETURN_NOT_OK(Seek(position)); - return impl_->Read(nbytes, bytes_read, out); -} - -Status ReadableFile::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { - RETURN_NOT_OK(Seek(position)); +Status ReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { return impl_->ReadBuffer(nbytes, out); } diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index 5e714ea966790..10fe16e511210 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -71,11 +71,9 @@ class ARROW_EXPORT ReadableFile : public ReadableFileInterface { Status Close() override; Status Tell(int64_t* position) override; - Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; - Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; - Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + Status Read(int64_t nbytes, std::shared_ptr* out) override; + Status GetSize(int64_t* size) override; Status Seek(int64_t position) override; diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index a6b4b2f3846b1..b74f84604f18c 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -22,6 +22,8 @@ #include #include "arrow/io/hdfs.h" +#include "arrow/util/buffer.h" +#include "arrow/util/memory-pool.h" #include "arrow/util/status.h" namespace arrow { @@ -89,7 +91,7 @@ class HdfsAnyFileImpl { // Private implementation for read-only files class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { public: - HdfsReadableFileImpl() {} + explicit HdfsReadableFileImpl(MemoryPool* pool) : pool_(pool) {} Status Close() { if (is_open_) { @@ -108,6 +110,19 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { return Status::OK(); } + Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) { + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(nbytes)); + + int64_t bytes_read = 0; + RETURN_NOT_OK(ReadAt(position, nbytes, &bytes_read, buffer->mutable_data())); + + if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + + *out = buffer; + return Status::OK(); + } + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer), nbytes); RETURN_NOT_OK(CheckReadResult(ret)); @@ -115,6 +130,19 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { return Status::OK(); } + Status Read(int64_t nbytes, std::shared_ptr* out) { + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(nbytes)); + + int64_t bytes_read = 0; + RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); + + if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } + + *out = buffer; + return Status::OK(); + } + Status GetSize(int64_t* size) { hdfsFileInfo* entry = hdfsGetPathInfo(fs_, path_.c_str()); if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } @@ -123,10 +151,16 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { hdfsFreeFileInfo(entry, 1); return Status::OK(); } + + void set_memory_pool(MemoryPool* pool) { pool_ = pool; } + + private: + MemoryPool* pool_; }; -HdfsReadableFile::HdfsReadableFile() { - impl_.reset(new HdfsReadableFileImpl()); +HdfsReadableFile::HdfsReadableFile(MemoryPool* pool) { + if (pool == nullptr) { pool = default_memory_pool(); } + impl_.reset(new HdfsReadableFileImpl(pool)); } HdfsReadableFile::~HdfsReadableFile() { @@ -144,7 +178,7 @@ Status HdfsReadableFile::ReadAt( Status HdfsReadableFile::ReadAt( int64_t position, int64_t nbytes, std::shared_ptr* out) { - return Status::NotImplemented("Not yet implemented"); + return impl_->ReadAt(position, nbytes, out); } bool HdfsReadableFile::supports_zero_copy() const { @@ -155,6 +189,10 @@ Status HdfsReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buff return impl_->Read(nbytes, bytes_read, buffer); } +Status HdfsReadableFile::Read(int64_t nbytes, std::shared_ptr* buffer) { + return impl_->Read(nbytes, buffer); +} + Status HdfsReadableFile::GetSize(int64_t* size) { return impl_->GetSize(size); } diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 39720cc17e422..4a4e3ec5f5134 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -164,6 +164,12 @@ class ARROW_EXPORT HdfsReadableFile : public ReadableFileInterface { Status GetSize(int64_t* size) override; + // NOTE: If you wish to read a particular range of a file in a multithreaded + // context, you may prefer to use ReadAt to avoid locking issues + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + + Status Read(int64_t nbytes, std::shared_ptr* out) override; + Status ReadAt( int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; @@ -174,17 +180,16 @@ class ARROW_EXPORT HdfsReadableFile : public ReadableFileInterface { Status Seek(int64_t position) override; Status Tell(int64_t* position) override; - // NOTE: If you wish to read a particular range of a file in a multithreaded - // context, you may prefer to use ReadAt to avoid locking issues - Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + void set_memory_pool(MemoryPool* pool); private: + explicit HdfsReadableFile(MemoryPool* pool = nullptr); + class ARROW_NO_EXPORT HdfsReadableFileImpl; std::unique_ptr impl_; friend class HdfsClient::HdfsClientImpl; - HdfsReadableFile(); DISALLOW_COPY_AND_ASSIGN(HdfsReadableFile); }; diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc new file mode 100644 index 0000000000000..44986cee1afc9 --- /dev/null +++ b/cpp/src/arrow/io/interfaces.cc @@ -0,0 +1,48 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/io/interfaces.h" + +#include +#include + +#include "arrow/util/buffer.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace io { + +FileInterface::~FileInterface() {} + +ReadableFileInterface::ReadableFileInterface() { + set_mode(FileMode::READ); +} + +Status ReadableFileInterface::ReadAt( + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + RETURN_NOT_OK(Seek(position)); + return Read(nbytes, bytes_read, out); +} + +Status ReadableFileInterface::ReadAt( + int64_t position, int64_t nbytes, std::shared_ptr* out) { + RETURN_NOT_OK(Seek(position)); + return Read(nbytes, out); +} + +} // namespace io +} // namespace arrow diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index fa34b43b2c920..db0c059c6e286 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -22,10 +22,12 @@ #include #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { class Buffer; +class MemoryPool; class Status; namespace io { @@ -43,9 +45,9 @@ class FileSystemClient { virtual ~FileSystemClient() {} }; -class FileInterface { +class ARROW_EXPORT FileInterface { public: - virtual ~FileInterface() {} + virtual ~FileInterface() = 0; virtual Status Close() = 0; virtual Status Tell(int64_t* position) = 0; @@ -54,7 +56,6 @@ class FileInterface { protected: FileInterface() {} FileMode::type mode_; - void set_mode(FileMode::type mode) { mode_ = mode; } private: @@ -74,6 +75,9 @@ class Writeable { class Readable { public: virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) = 0; + + // Does not copy if not necessary + virtual Status Read(int64_t nbytes, std::shared_ptr* out) = 0; }; class OutputStream : public FileInterface, public Writeable { @@ -86,21 +90,21 @@ class InputStream : public FileInterface, public Readable { InputStream() {} }; -class ReadableFileInterface : public InputStream, public Seekable { +class ARROW_EXPORT ReadableFileInterface : public InputStream, public Seekable { public: - virtual Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) = 0; - virtual Status GetSize(int64_t* size) = 0; - // Does not copy if not necessary + virtual bool supports_zero_copy() const = 0; + + // Read at position, provide default implementations using Read(...), but can + // be overridden virtual Status ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) = 0; + int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out); - virtual bool supports_zero_copy() const = 0; + virtual Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out); protected: - ReadableFileInterface() { set_mode(FileMode::READ); } + ReadableFileInterface(); }; class WriteableFileInterface : public OutputStream, public Seekable { diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index c168c91c5f87c..7d6e02e25b43c 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -123,6 +123,8 @@ MemoryMappedFile::MemoryMappedFile(FileMode::type mode) { ReadableFileInterface::set_mode(mode); } +MemoryMappedFile::~MemoryMappedFile() {} + Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, std::shared_ptr* out) { std::shared_ptr result(new MemoryMappedFile(mode)); @@ -161,16 +163,8 @@ Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) return Status::OK(); } -Status MemoryMappedFile::ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - RETURN_NOT_OK(impl_->Seek(position)); - return Read(nbytes, bytes_read, out); -} - -Status MemoryMappedFile::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { - nbytes = std::min(nbytes, impl_->size() - position); - RETURN_NOT_OK(impl_->Seek(position)); +Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { + nbytes = std::min(nbytes, impl_->size() - impl_->position()); *out = std::make_shared(impl_->head(), nbytes); impl_->advance(nbytes); return Status::OK(); @@ -246,6 +240,11 @@ Status BufferOutputStream::Reserve(int64_t nbytes) { // ---------------------------------------------------------------------- // In-memory buffer reader +BufferReader::BufferReader(const uint8_t* buffer, int buffer_size) + : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} + +BufferReader::~BufferReader() {} + Status BufferReader::Close() { // no-op return Status::OK(); @@ -256,20 +255,6 @@ Status BufferReader::Tell(int64_t* position) { return Status::OK(); } -Status BufferReader::ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { - RETURN_NOT_OK(Seek(position)); - return Read(nbytes, bytes_read, buffer); -} - -Status BufferReader::ReadAt( - int64_t position, int64_t nbytes, std::shared_ptr* out) { - int64_t size = std::min(nbytes, buffer_size_ - position_); - *out = std::make_shared(buffer_ + position, size); - position_ += nbytes; - return Status::OK(); -} - bool BufferReader::supports_zero_copy() const { return true; } @@ -281,6 +266,13 @@ Status BufferReader::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) return Status::OK(); } +Status BufferReader::Read(int64_t nbytes, std::shared_ptr* out) { + int64_t size = std::min(nbytes, buffer_size_ - position_); + *out = std::make_shared(buffer_ + position_, size); + position_ += nbytes; + return Status::OK(); +} + Status BufferReader::GetSize(int64_t* size) { *size = buffer_size_; return Status::OK(); diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 6989d732ca752..df2fe8d6efbfc 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -61,6 +61,8 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { // A memory source that uses memory-mapped files for memory interactions class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { public: + ~MemoryMappedFile(); + static Status Open(const std::string& path, FileMode::type mode, std::shared_ptr* out); @@ -73,11 +75,8 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { // Required by ReadableFileInterface, copies memory into out Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; - Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; - - // Read into a buffer, zero copy if possible - Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; + // Zero copy read + Status Read(int64_t nbytes, std::shared_ptr* out) override; bool supports_zero_copy() const override; @@ -100,17 +99,17 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { class ARROW_EXPORT BufferReader : public ReadableFileInterface { public: - BufferReader(const uint8_t* buffer, int buffer_size) - : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} + BufferReader(const uint8_t* buffer, int buffer_size); + ~BufferReader(); Status Close() override; Status Tell(int64_t* position) override; - Status ReadAt( - int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; - Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) override; - Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; + + // Zero copy read + Status Read(int64_t nbytes, std::shared_ptr* out) override; + Status GetSize(int64_t* size) override; Status Seek(int64_t position) override; diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 6357e3c1725e3..77a771ab21c06 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -432,6 +432,7 @@ set(PYARROW_SRCS src/pyarrow/common.cc src/pyarrow/config.cc src/pyarrow/helpers.cc + src/pyarrow/io.cc src/pyarrow/status.cc src/pyarrow/adapters/builtin.cc diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 40a09c2feaef0..7561f6d46df21 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -41,6 +41,5 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.array import RowBatch, from_pandas_dataframe - -from pyarrow.table import Column, Table +from pyarrow.array import RowBatch +from pyarrow.table import Column, Table, from_pandas_dataframe diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 5229b429f58b4..cdbe73ad21f7d 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -35,7 +35,6 @@ from pyarrow.scalar import NA from pyarrow.schema cimport Schema import pyarrow.schema as schema -from pyarrow.table cimport Table def total_allocated_bytes(): cdef MemoryPool* pool = pyarrow.GetMemoryPool() @@ -254,35 +253,6 @@ def from_pandas_series(object series, object mask=None, timestamps_to_ms=False): return box_arrow_array(out) -def from_pandas_dataframe(object df, name=None, timestamps_to_ms=False): - """ - Convert pandas.DataFrame to an Arrow Table - - Parameters - ---------- - df: pandas.DataFrame - - name: str - - timestamps_to_ms: bool - Convert datetime columns to ms resolution. This is needed for - compability with other functionality like Parquet I/O which - only supports milliseconds. - """ - cdef: - list names = [] - list arrays = [] - - for name in df.columns: - col = df[name] - arr = from_pandas_series(col, timestamps_to_ms=timestamps_to_ms) - - names.append(name) - arrays.append(arr) - - return Table.from_arrays(names, arrays, name=name) - - cdef object series_as_ndarray(object obj): import pandas as pd @@ -324,4 +294,3 @@ cdef class RowBatch: def __getitem__(self, i): return self.arrays[i] - diff --git a/python/pyarrow/error.pxd b/python/pyarrow/error.pxd index 1fb6fad396a8b..891d1ac1c7ea0 100644 --- a/python/pyarrow/error.pxd +++ b/python/pyarrow/error.pxd @@ -16,7 +16,7 @@ # under the License. from pyarrow.includes.libarrow cimport CStatus -from pyarrow.includes.pyarrow cimport * +from pyarrow.includes.pyarrow cimport PyStatus cdef int check_cstatus(const CStatus& status) nogil except -1 -cdef int check_status(const Status& status) nogil except -1 +cdef int check_status(const PyStatus& status) nogil except -1 diff --git a/python/pyarrow/error.pyx b/python/pyarrow/error.pyx index 244019321a7fd..a2c53fed8c6a0 100644 --- a/python/pyarrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -30,7 +30,7 @@ cdef int check_cstatus(const CStatus& status) nogil except -1: with gil: raise ArrowException(frombytes(c_message)) -cdef int check_status(const Status& status) nogil except -1: +cdef int check_status(const PyStatus& status) nogil except -1: if status.ok(): return 0 diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index f338a436814de..56d8d4cf61494 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -18,6 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport MemoryPool cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: enum FileMode" arrow::io::FileMode::type": @@ -35,6 +36,7 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: FileMode mode() cdef cppclass Readable: + CStatus ReadB" Read"(int64_t nbytes, shared_ptr[Buffer]* out) CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) cdef cppclass Seekable: @@ -66,6 +68,24 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: pass +cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: + cdef cppclass FileOutputStream(OutputStream): + @staticmethod + CStatus Open(const c_string& path, shared_ptr[FileOutputStream]* file) + + int file_descriptor() + + cdef cppclass ReadableFile(ReadableFileInterface): + @staticmethod + CStatus Open(const c_string& path, shared_ptr[ReadableFile]* file) + + @staticmethod + CStatus Open(const c_string& path, MemoryPool* memory_pool, + shared_ptr[ReadableFile]* file) + + int file_descriptor() + + cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus ConnectLibHdfs() @@ -120,3 +140,12 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: int32_t buffer_size, int16_t replication, int64_t default_block_size, shared_ptr[HdfsOutputStream]* handle) + + +cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: + cdef cppclass BufferReader(ReadableFileInterface): + BufferReader(const uint8_t* data, int64_t nbytes) + + cdef cppclass BufferOutputStream(OutputStream): + # TODO(wesm) + pass diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 92c814706fdd6..4c971665ff6aa 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,15 +18,18 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CColumn, CDataType, +from pyarrow.includes.libarrow cimport (CArray, CColumn, CDataType, CStatus, Type, MemoryPool) +cimport pyarrow.includes.libarrow_io as arrow_io + + cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: # We can later add more of the common status factory methods as needed - cdef Status Status_OK "Status::OK"() + cdef PyStatus PyStatus_OK "Status::OK"() - cdef cppclass Status: - Status() + cdef cppclass PyStatus "pyarrow::Status": + PyStatus() c_string ToString() @@ -40,12 +43,25 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: c_bool IsArrowError() shared_ptr[CDataType] GetPrimitiveType(Type type) - Status ConvertPySequence(object obj, shared_ptr[CArray]* out) + PyStatus ConvertPySequence(object obj, shared_ptr[CArray]* out) - Status PandasToArrow(MemoryPool* pool, object ao, shared_ptr[CArray]* out) - Status PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, - shared_ptr[CArray]* out) + PyStatus PandasToArrow(MemoryPool* pool, object ao, + shared_ptr[CArray]* out) + PyStatus PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, + shared_ptr[CArray]* out) - Status ArrowToPandas(const shared_ptr[CColumn]& arr, object py_ref, PyObject** out) + PyStatus ArrowToPandas(const shared_ptr[CColumn]& arr, object py_ref, + PyObject** out) MemoryPool* GetMemoryPool() + + +cdef extern from "pyarrow/io.h" namespace "pyarrow" nogil: + cdef cppclass PyReadableFile(arrow_io.ReadableFileInterface): + PyReadableFile(object fo) + + cdef cppclass PyOutputStream(arrow_io.OutputStream): + PyOutputStream(object fo) + + cdef cppclass PyBytesReader(arrow_io.BufferReader): + PyBytesReader(object fo) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index f55fc0ab53ac1..1dbb3fd76bbfd 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -23,11 +23,16 @@ from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, OutputStream) -cdef class NativeFileInterface: +cdef class NativeFile: + cdef: + shared_ptr[ReadableFileInterface] rd_file + shared_ptr[OutputStream] wr_file + bint is_readonly + bint is_open # By implementing these "virtual" functions (all functions in Cython - # extension classes are technically virtual in the C++ sense)m we can - # expose the arrow::io abstract file interfaces to other components - # throughout the suite of Arrow C++ libraries + # extension classes are technically virtual in the C++ sense) we can expose + # the arrow::io abstract file interfaces to other components throughout the + # suite of Arrow C++ libraries cdef read_handle(self, shared_ptr[ReadableFileInterface]* file) cdef write_handle(self, shared_ptr[OutputStream]* file) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index f2eee260c331b..e6e2b625e87ca 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -242,6 +242,9 @@ cdef class HdfsClient: cdef int16_t c_replication = replication or 0 cdef int64_t c_default_block_size = default_block_size or 0 + cdef shared_ptr[HdfsOutputStream] wr_handle + cdef shared_ptr[HdfsReadableFile] rd_handle + if mode in ('wb', 'ab'): if mode == 'ab': append = True @@ -251,13 +254,17 @@ cdef class HdfsClient: self.client.get() .OpenWriteable(c_path, append, c_buffer_size, c_replication, c_default_block_size, - &out.wr_file)) + &wr_handle)) + + out.wr_file = wr_handle out.is_readonly = False else: with nogil: check_cstatus(self.client.get() - .OpenReadable(c_path, &out.rd_file)) + .OpenReadable(c_path, &rd_handle)) + + out.rd_file = rd_handle out.is_readonly = True if c_buffer_size == 0: @@ -314,25 +321,8 @@ cdef class HdfsClient: f = self.open(path, 'rb', buffer_size=buffer_size) f.download(stream) -cdef class NativeFileInterface: - - cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): - raise NotImplementedError - - cdef write_handle(self, shared_ptr[OutputStream]* file): - raise NotImplementedError - -cdef class HdfsFile(NativeFileInterface): - cdef: - shared_ptr[HdfsReadableFile] rd_file - shared_ptr[HdfsOutputStream] wr_file - bint is_readonly - bint is_open - object parent - cdef readonly: - int32_t buffer_size - object mode +cdef class NativeFile: def __cinit__(self): self.is_open = False @@ -356,14 +346,6 @@ cdef class HdfsFile(NativeFileInterface): check_cstatus(self.wr_file.get().Close()) self.is_open = False - cdef _assert_readable(self): - if not self.is_readonly: - raise IOError("only valid on readonly files") - - cdef _assert_writeable(self): - if self.is_readonly: - raise IOError("only valid on writeonly files") - cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): self._assert_readable() file[0] = self.rd_file @@ -372,6 +354,14 @@ cdef class HdfsFile(NativeFileInterface): self._assert_writeable() file[0] = self.wr_file + def _assert_readable(self): + if not self.is_readonly: + raise IOError("only valid on readonly files") + + def _assert_writeable(self): + if self.is_readonly: + raise IOError("only valid on writeonly files") + def size(self): cdef int64_t size self._assert_readable() @@ -393,6 +383,83 @@ cdef class HdfsFile(NativeFileInterface): with nogil: check_cstatus(self.rd_file.get().Seek(position)) + def write(self, data): + """ + Write bytes-like (unicode, encoded to UTF-8) to file + """ + self._assert_writeable() + + data = tobytes(data) + + cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) + cdef int64_t bufsize = len(data) + with nogil: + check_cstatus(self.wr_file.get().Write(buf, bufsize)) + + def read(self, int nbytes): + cdef: + int64_t bytes_read = 0 + uint8_t* buf + shared_ptr[Buffer] out + + self._assert_readable() + + with nogil: + check_cstatus(self.rd_file.get() + .ReadB(nbytes, &out)) + + result = cp.PyBytes_FromStringAndSize( + out.get().data(), out.get().size()) + + return result + + +# ---------------------------------------------------------------------- +# Python file-like objects + +cdef class PythonFileInterface(NativeFile): + cdef: + object handle + + def __cinit__(self, handle, mode='w'): + self.handle = handle + + if mode.startswith('w'): + self.wr_file.reset(new pyarrow.PyOutputStream(handle)) + self.is_readonly = 0 + elif mode.startswith('r'): + self.rd_file.reset(new pyarrow.PyReadableFile(handle)) + self.is_readonly = 1 + else: + raise ValueError('Invalid file mode: {0}'.format(mode)) + + self.is_open = True + + +cdef class BytesReader(NativeFile): + cdef: + object obj + + def __cinit__(self, obj): + if not isinstance(obj, bytes): + raise ValueError('Must pass bytes object') + + self.obj = obj + self.is_readonly = 1 + self.is_open = True + + self.rd_file.reset(new pyarrow.PyBytesReader(obj)) + +# ---------------------------------------------------------------------- +# Specialization for HDFS + + +cdef class HdfsFile(NativeFile): + cdef readonly: + int32_t buffer_size + object mode + object parent + def read(self, int nbytes): """ Read indicated number of bytes from the file, up to EOF @@ -504,16 +571,3 @@ cdef class HdfsFile(NativeFileInterface): writer_thread.join() if exc_info is not None: raise exc_info[0], exc_info[1], exc_info[2] - - def write(self, data): - """ - Write bytes-like (unicode, encoded to UTF-8) to file - """ - self._assert_writeable() - - data = tobytes(data) - - cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) - cdef int64_t bufsize = len(data) - with nogil: - check_cstatus(self.wr_file.get().Write(buf, bufsize)) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 099e148abc16f..ca0176a7c0403 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -27,10 +27,10 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import tobytes from pyarrow.error import ArrowException from pyarrow.error cimport check_cstatus -from pyarrow.io import NativeFileInterface +from pyarrow.io import NativeFile from pyarrow.table cimport Table -from pyarrow.io cimport NativeFileInterface +from pyarrow.io cimport NativeFile import six @@ -54,7 +54,7 @@ cdef class ParquetReader: new FileReader(default_memory_pool(), ParquetFileReader.OpenFile(path))) - cdef open_native_file(self, NativeFileInterface file): + cdef open_native_file(self, NativeFile file): cdef shared_ptr[ReadableFileInterface] cpp_handle file.read_handle(&cpp_handle) @@ -84,7 +84,7 @@ def read_table(source, columns=None): if isinstance(source, six.string_types): reader.open_local_file(source) - elif isinstance(source, NativeFileInterface): + elif isinstance(source, NativeFile): reader.open_native_file(source) return reader.read_all() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index f02d36f520be6..ade82aa676164 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -25,10 +25,12 @@ cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config from pyarrow.array cimport Array, box_arrow_array -from pyarrow.compat import frombytes, tobytes from pyarrow.error cimport check_status from pyarrow.schema cimport box_data_type, box_schema +from pyarrow.compat import frombytes, tobytes + + cdef class ChunkedArray: ''' Do not call this class's constructor directly. @@ -161,7 +163,7 @@ cdef class Table: @staticmethod def from_pandas(df, name=None): - pass + return from_pandas_dataframe(df, name=name) @staticmethod def from_arrays(names, arrays, name=None): @@ -264,3 +266,34 @@ cdef class Table: def __get__(self): return (self.num_rows, self.num_columns) + + +def from_pandas_dataframe(object df, name=None, timestamps_to_ms=False): + """ + Convert pandas.DataFrame to an Arrow Table + + Parameters + ---------- + df: pandas.DataFrame + + name: str + + timestamps_to_ms: bool + Convert datetime columns to ms resolution. This is needed for + compability with other functionality like Parquet I/O which + only supports milliseconds. + """ + from pyarrow.array import from_pandas_series + + cdef: + list names = [] + list arrays = [] + + for name in df.columns: + col = df[name] + arr = from_pandas_series(col, timestamps_to_ms=timestamps_to_ms) + + names.append(name) + arrays.append(arr) + + return Table.from_arrays(names, arrays, name=name) diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py new file mode 100644 index 0000000000000..ed8d41994cdd0 --- /dev/null +++ b/python/pyarrow/tests/test_hdfs.py @@ -0,0 +1,128 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from io import BytesIO +from os.path import join as pjoin +import os +import random + +import pytest + +import pyarrow.io as io + +# ---------------------------------------------------------------------- +# HDFS tests + + +def hdfs_test_client(): + host = os.environ.get('ARROW_HDFS_TEST_HOST', 'localhost') + user = os.environ['ARROW_HDFS_TEST_USER'] + try: + port = int(os.environ.get('ARROW_HDFS_TEST_PORT', 20500)) + except ValueError: + raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' + 'an integer') + + return io.HdfsClient.connect(host, port, user) + + +libhdfs = pytest.mark.skipif(not io.have_libhdfs(), + reason='No libhdfs available on system') + + +HDFS_TMP_PATH = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + + +@pytest.fixture(scope='session') +def hdfs(request): + fixture = hdfs_test_client() + + def teardown(): + fixture.delete(HDFS_TMP_PATH, recursive=True) + fixture.close() + request.addfinalizer(teardown) + return fixture + + +@libhdfs +def test_hdfs_close(): + client = hdfs_test_client() + assert client.is_open + client.close() + assert not client.is_open + + with pytest.raises(Exception): + client.ls('/') + + +@libhdfs +def test_hdfs_mkdir(hdfs): + path = pjoin(HDFS_TMP_PATH, 'test-dir/test-dir') + parent_path = pjoin(HDFS_TMP_PATH, 'test-dir') + + hdfs.mkdir(path) + assert hdfs.exists(path) + + hdfs.delete(parent_path, recursive=True) + assert not hdfs.exists(path) + + +@libhdfs +def test_hdfs_ls(hdfs): + base_path = pjoin(HDFS_TMP_PATH, 'ls-test') + hdfs.mkdir(base_path) + + dir_path = pjoin(base_path, 'a-dir') + f1_path = pjoin(base_path, 'a-file-1') + + hdfs.mkdir(dir_path) + + f = hdfs.open(f1_path, 'wb') + f.write('a' * 10) + + contents = sorted(hdfs.ls(base_path, False)) + assert contents == [dir_path, f1_path] + + +@libhdfs +def test_hdfs_download_upload(hdfs): + base_path = pjoin(HDFS_TMP_PATH, 'upload-test') + + data = b'foobarbaz' + buf = BytesIO(data) + buf.seek(0) + + hdfs.upload(base_path, buf) + + out_buf = BytesIO() + hdfs.download(base_path, out_buf) + out_buf.seek(0) + assert out_buf.getvalue() == data + + +@libhdfs +def test_hdfs_file_context_manager(hdfs): + path = pjoin(HDFS_TMP_PATH, 'ctx-manager') + + data = b'foo' + with hdfs.open(path, 'wb') as f: + f.write(data) + + with hdfs.open(path, 'rb') as f: + assert f.size() == 3 + result = f.read(10) + assert result == data diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index eb92e8ea93a1a..9a41ebe3e8c74 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -16,112 +16,85 @@ # under the License. from io import BytesIO -from os.path import join as pjoin -import os -import random - import pytest +from pyarrow.compat import u import pyarrow.io as io -#---------------------------------------------------------------------- -# HDFS tests +# ---------------------------------------------------------------------- +# Python file-like objects -def hdfs_test_client(): - host = os.environ.get('ARROW_HDFS_TEST_HOST', 'localhost') - user = os.environ['ARROW_HDFS_TEST_USER'] - try: - port = int(os.environ.get('ARROW_HDFS_TEST_PORT', 20500)) - except ValueError: - raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' - 'an integer') +def test_python_file_write(): + buf = BytesIO() - return io.HdfsClient.connect(host, port, user) + f = io.PythonFileInterface(buf) + assert f.tell() == 0 -libhdfs = pytest.mark.skipif(not io.have_libhdfs(), - reason='No libhdfs available on system') + s1 = b'enga\xc3\xb1ado' + s2 = b'foobar' + f.write(s1.decode('utf8')) + assert f.tell() == len(s1) -HDFS_TMP_PATH = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + f.write(s2) + expected = s1 + s2 -@pytest.fixture(scope='session') -def hdfs(request): - fixture = hdfs_test_client() - def teardown(): - fixture.delete(HDFS_TMP_PATH, recursive=True) - fixture.close() - request.addfinalizer(teardown) - return fixture + result = buf.getvalue() + assert result == expected + f.close() -@libhdfs -def test_hdfs_close(): - client = hdfs_test_client() - assert client.is_open - client.close() - assert not client.is_open - with pytest.raises(Exception): - client.ls('/') +def test_python_file_read(): + data = b'some sample data' + buf = BytesIO(data) + f = io.PythonFileInterface(buf, mode='r') -@libhdfs -def test_hdfs_mkdir(hdfs): - path = pjoin(HDFS_TMP_PATH, 'test-dir/test-dir') - parent_path = pjoin(HDFS_TMP_PATH, 'test-dir') + assert f.size() == len(data) - hdfs.mkdir(path) - assert hdfs.exists(path) + assert f.tell() == 0 - hdfs.delete(parent_path, recursive=True) - assert not hdfs.exists(path) + assert f.read(4) == b'some' + assert f.tell() == 4 + f.seek(0) + assert f.tell() == 0 -@libhdfs -def test_hdfs_ls(hdfs): - base_path = pjoin(HDFS_TMP_PATH, 'ls-test') - hdfs.mkdir(base_path) + f.seek(5) + assert f.tell() == 5 - dir_path = pjoin(base_path, 'a-dir') - f1_path = pjoin(base_path, 'a-file-1') + assert f.read(50) == b'sample data' - hdfs.mkdir(dir_path) + f.close() - f = hdfs.open(f1_path, 'wb') - f.write('a' * 10) - contents = sorted(hdfs.ls(base_path, False)) - assert contents == [dir_path, f1_path] +def test_bytes_reader(): + # Like a BytesIO, but zero-copy underneath for C++ consumers + data = b'some sample data' + f = io.BytesReader(data) + assert f.tell() == 0 -@libhdfs -def test_hdfs_download_upload(hdfs): - base_path = pjoin(HDFS_TMP_PATH, 'upload-test') + assert f.size() == len(data) - data = b'foobarbaz' - buf = BytesIO(data) - buf.seek(0) + assert f.read(4) == b'some' + assert f.tell() == 4 - hdfs.upload(base_path, buf) + f.seek(0) + assert f.tell() == 0 - out_buf = BytesIO() - hdfs.download(base_path, out_buf) - out_buf.seek(0) - assert out_buf.getvalue() == data + f.seek(5) + assert f.tell() == 5 + assert f.read(50) == b'sample data' -@libhdfs -def test_hdfs_file_context_manager(hdfs): - path = pjoin(HDFS_TMP_PATH, 'ctx-manager') + f.close() - data = b'foo' - with hdfs.open(path, 'wb') as f: - f.write(data) - with hdfs.open(path, 'rb') as f: - assert f.size() == 3 - result = f.read(10) - assert result == data +def test_bytes_reader_non_bytes(): + with pytest.raises(ValueError): + io.BytesReader(u('some sample data')) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index a4e7fb6f3bb70..d224074d652cb 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -618,7 +618,7 @@ class ArrowDeserializer { Status OutputFromData(int type, void* data) { // Zero-Copy. We can pass the data pointer directly to NumPy. Py_INCREF(py_ref_); - OwnedRef py_ref(py_ref); + OwnedRef py_ref(py_ref_); npy_intp dims[1] = {col_->length()}; out_ = reinterpret_cast(PyArray_SimpleNewFromData(1, dims, type, data)); diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index a2748f99b6733..82b14fdf40173 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -68,4 +68,19 @@ arrow::MemoryPool* GetMemoryPool() { return &memory_pool; } +// ---------------------------------------------------------------------- +// PyBytesBuffer + +PyBytesBuffer::PyBytesBuffer(PyObject* obj) + : Buffer(reinterpret_cast(PyBytes_AS_STRING(obj)), + PyBytes_GET_SIZE(obj)), + obj_(obj) { + Py_INCREF(obj_); +} + +PyBytesBuffer::~PyBytesBuffer() { + PyGILGuard lock; + Py_DECREF(obj_); +} + } // namespace pyarrow diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index fb0ba3e482296..bc599f84fab50 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -19,9 +19,8 @@ #define PYARROW_COMMON_H #include "pyarrow/config.h" - #include "arrow/util/buffer.h" - +#include "arrow/util/macros.h" #include "pyarrow/visibility.h" namespace arrow { class MemoryPool; } @@ -83,6 +82,20 @@ struct PyObjectStringify { } }; +class PyGILGuard { + public: + PyGILGuard() { + state_ = PyGILState_Ensure(); + } + + ~PyGILGuard() { + PyGILState_Release(state_); + } + private: + PyGILState_STATE state_; + DISALLOW_COPY_AND_ASSIGN(PyGILGuard); +}; + // TODO(wesm): We can just let errors pass through. To be explored later #define RETURN_IF_PYERROR() \ if (PyErr_Occurred()) { \ @@ -100,8 +113,8 @@ PYARROW_EXPORT arrow::MemoryPool* GetMemoryPool(); class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { public: - NumPyBuffer(PyArrayObject* arr) : - Buffer(nullptr, 0) { + NumPyBuffer(PyArrayObject* arr) + : Buffer(nullptr, 0) { arr_ = arr; Py_INCREF(arr); @@ -117,6 +130,15 @@ class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { PyArrayObject* arr_; }; +class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { + public: + PyBytesBuffer(PyObject* obj); + ~PyBytesBuffer(); + + private: + PyObject* obj_; +}; + } // namespace pyarrow #endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc new file mode 100644 index 0000000000000..35054e9025ad4 --- /dev/null +++ b/python/src/pyarrow/io.cc @@ -0,0 +1,215 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "pyarrow/io.h" + +#include +#include + +#include +#include +#include + +#include "pyarrow/common.h" +#include "pyarrow/status.h" + +namespace pyarrow { + +// ---------------------------------------------------------------------- +// Python file + +PythonFile::PythonFile(PyObject* file) + : file_(file) { + Py_INCREF(file_); +} + +PythonFile::~PythonFile() { + Py_DECREF(file_); +} + +static arrow::Status CheckPyError() { + if (PyErr_Occurred()) { + PyObject *exc_type, *exc_value, *traceback; + PyErr_Fetch(&exc_type, &exc_value, &traceback); + PyObjectStringify stringified(exc_value); + std::string message(stringified.bytes); + Py_DECREF(exc_type); + Py_DECREF(exc_value); + Py_DECREF(traceback); + PyErr_Clear(); + return arrow::Status::IOError(message); + } + return arrow::Status::OK(); +} + +arrow::Status PythonFile::Close() { + // whence: 0 for relative to start of file, 2 for end of file + PyObject* result = PyObject_CallMethod(file_, "close", "()"); + Py_XDECREF(result); + ARROW_RETURN_NOT_OK(CheckPyError()); + return arrow::Status::OK(); +} + +arrow::Status PythonFile::Seek(int64_t position, int whence) { + // whence: 0 for relative to start of file, 2 for end of file + PyObject* result = PyObject_CallMethod(file_, "seek", "(ii)", position, whence); + Py_XDECREF(result); + ARROW_RETURN_NOT_OK(CheckPyError()); + return arrow::Status::OK(); +} + +arrow::Status PythonFile::Read(int64_t nbytes, PyObject** out) { + PyObject* result = PyObject_CallMethod(file_, "read", "(i)", nbytes); + ARROW_RETURN_NOT_OK(CheckPyError()); + *out = result; + return arrow::Status::OK(); +} + +arrow::Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { + PyObject* py_data = PyBytes_FromStringAndSize( + reinterpret_cast(data), nbytes); + ARROW_RETURN_NOT_OK(CheckPyError()); + + PyObject* result = PyObject_CallMethod(file_, "write", "(O)", py_data); + Py_DECREF(py_data); + Py_XDECREF(result); + ARROW_RETURN_NOT_OK(CheckPyError()); + return arrow::Status::OK(); +} + +arrow::Status PythonFile::Tell(int64_t* position) { + PyObject* result = PyObject_CallMethod(file_, "tell", "()"); + ARROW_RETURN_NOT_OK(CheckPyError()); + + *position = PyLong_AsLongLong(result); + Py_DECREF(result); + + // PyLong_AsLongLong can raise OverflowError + ARROW_RETURN_NOT_OK(CheckPyError()); + + return arrow::Status::OK(); +} + +// ---------------------------------------------------------------------- +// Seekable input stream + +PyReadableFile::PyReadableFile(PyObject* file) { + file_.reset(new PythonFile(file)); +} + +PyReadableFile::~PyReadableFile() {} + +arrow::Status PyReadableFile::Close() { + PyGILGuard lock; + return file_->Close(); +} + +arrow::Status PyReadableFile::Seek(int64_t position) { + PyGILGuard lock; + return file_->Seek(position, 0); +} + +arrow::Status PyReadableFile::Tell(int64_t* position) { + PyGILGuard lock; + return file_->Tell(position); +} + +arrow::Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + PyGILGuard lock; + PyObject* bytes_obj; + ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); + + *bytes_read = PyBytes_GET_SIZE(bytes_obj); + std::memcpy(out, PyBytes_AS_STRING(bytes_obj), *bytes_read); + Py_DECREF(bytes_obj); + + return arrow::Status::OK(); +} + +arrow::Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { + PyGILGuard lock; + + PyObject* bytes_obj; + ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); + + *out = std::make_shared(bytes_obj); + Py_DECREF(bytes_obj); + + return arrow::Status::OK(); +} + +arrow::Status PyReadableFile::GetSize(int64_t* size) { + PyGILGuard lock; + + int64_t current_position;; + ARROW_RETURN_NOT_OK(file_->Tell(¤t_position)); + + ARROW_RETURN_NOT_OK(file_->Seek(0, 2)); + + int64_t file_size; + ARROW_RETURN_NOT_OK(file_->Tell(&file_size)); + + // Restore previous file position + ARROW_RETURN_NOT_OK(file_->Seek(current_position, 0)); + + *size = file_size; + return arrow::Status::OK(); +} + +bool PyReadableFile::supports_zero_copy() const { + return false; +} + +// ---------------------------------------------------------------------- +// Output stream + +PyOutputStream::PyOutputStream(PyObject* file) { + file_.reset(new PythonFile(file)); +} + +PyOutputStream::~PyOutputStream() {} + +arrow::Status PyOutputStream::Close() { + PyGILGuard lock; + return file_->Close(); +} + +arrow::Status PyOutputStream::Tell(int64_t* position) { + PyGILGuard lock; + return file_->Tell(position); +} + +arrow::Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { + PyGILGuard lock; + return file_->Write(data, nbytes); +} + +// ---------------------------------------------------------------------- +// A readable file that is backed by a PyBytes + +PyBytesReader::PyBytesReader(PyObject* obj) + : arrow::io::BufferReader(reinterpret_cast(PyBytes_AS_STRING(obj)), + PyBytes_GET_SIZE(obj)), + obj_(obj) { + Py_INCREF(obj_); +} + +PyBytesReader::~PyBytesReader() { + Py_DECREF(obj_); +} + +} // namespace pyarrow diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h new file mode 100644 index 0000000000000..e14aa8cfb27e3 --- /dev/null +++ b/python/src/pyarrow/io.h @@ -0,0 +1,97 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_IO_H +#define PYARROW_IO_H + +#include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" + +#include "pyarrow/config.h" +#include "pyarrow/visibility.h" + +namespace arrow { class MemoryPool; } + +namespace pyarrow { + +// A common interface to a Python file-like object. Must acquire GIL before +// calling any methods +class PythonFile { + public: + PythonFile(PyObject* file); + ~PythonFile(); + + arrow::Status Close(); + arrow::Status Seek(int64_t position, int whence); + arrow::Status Read(int64_t nbytes, PyObject** out); + arrow::Status Tell(int64_t* position); + arrow::Status Write(const uint8_t* data, int64_t nbytes); + + private: + PyObject* file_; +}; + +class PYARROW_EXPORT PyReadableFile : public arrow::io::ReadableFileInterface { + public: + explicit PyReadableFile(PyObject* file); + virtual ~PyReadableFile(); + + arrow::Status Close() override; + + arrow::Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; + arrow::Status Read(int64_t nbytes, std::shared_ptr* out) override; + + arrow::Status GetSize(int64_t* size) override; + + arrow::Status Seek(int64_t position) override; + + arrow::Status Tell(int64_t* position) override; + + bool supports_zero_copy() const override; + + private: + std::unique_ptr file_; +}; + +class PYARROW_EXPORT PyOutputStream : public arrow::io::OutputStream { + public: + explicit PyOutputStream(PyObject* file); + virtual ~PyOutputStream(); + + arrow::Status Close() override; + arrow::Status Tell(int64_t* position) override; + arrow::Status Write(const uint8_t* data, int64_t nbytes) override; + + private: + std::unique_ptr file_; +}; + +// A zero-copy reader backed by a PyBytes object +class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { + public: + explicit PyBytesReader(PyObject* obj); + virtual ~PyBytesReader(); + + private: + PyObject* obj_; +}; + +// TODO(wesm): seekable output files + +} // namespace pyarrow + +#endif // PYARROW_IO_H From c3cfa3d3b3ce017776508f42fe9410bfb99cd94f Mon Sep 17 00:00:00 2001 From: "Christopher C. Aycock" Date: Tue, 4 Oct 2016 14:07:59 -0700 Subject: [PATCH 0154/1644] ARROW-313: Build on any version of XCode Author: Christopher C. Aycock Closes #155 from chrisaycock/ARROW-313 and squashes the following commits: e47cc01 [Christopher C. Aycock] ARROW-313: Build on any version of XCode --- cpp/cmake_modules/CompilerInfo.cmake | 4 ++-- python/cmake_modules/CompilerInfo.cmake | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index e1c821cca5d45..02f6fd46997a3 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -29,9 +29,9 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" COMPILER_VERSION "${COMPILER_VERSION_FULL}") -# clang on Mac OS X, XCode 7. No version replacement is done +# clang on Mac OS X, XCode 7+. No version replacement is done # because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-70[0-9]\\..*") +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-.*") set(COMPILER_FAMILY "clang") # gcc diff --git a/python/cmake_modules/CompilerInfo.cmake b/python/cmake_modules/CompilerInfo.cmake index 55f989a1a6c9d..8e85bdea96ea5 100644 --- a/python/cmake_modules/CompilerInfo.cmake +++ b/python/cmake_modules/CompilerInfo.cmake @@ -32,9 +32,9 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" COMPILER_VERSION "${COMPILER_VERSION_FULL}") -# clang on Mac OS X, XCode 7. No version replacement is done +# clang on Mac OS X, XCode 7+. No version replacement is done # because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-70[0-9]\\..*") +elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-.*") set(COMPILER_FAMILY "clang") # gcc From 7fb4d24a35269db99fa112c0512d4a32c372dd74 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 4 Oct 2016 15:11:56 -0700 Subject: [PATCH 0155/1644] ARROW-315: finalize timestamp Author: Julien Le Dem Closes #156 from julienledem/timestamp and squashes the following commits: 0ee017f [Julien Le Dem] review feedback 86cae98 [Julien Le Dem] ARROW-315: finalize timestamp --- format/Message.fbs | 5 +- .../src/main/codegen/data/ArrowTypes.tdd | 2 +- .../templates/NullableValueVectors.java | 2 +- .../org/apache/arrow/vector/types/Types.java | 46 +++++++++++-------- .../apache/arrow/vector/pojo/TestConvert.java | 3 +- 5 files changed, 34 insertions(+), 24 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 3d877a2f234af..d8fa65006c24d 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -64,8 +64,11 @@ table Date { table Time { } +enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } + +/// time from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. table Timestamp { - timezone: string; + unit: TimeUnit; } enum IntervalUnit: short { YEAR_MONTH, DAY_TIME} diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 9624fecf6aad1..11ac99af42414 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -66,7 +66,7 @@ }, { name: "Timestamp", - fields: [{name: "timezone", type: "String"}] + fields: [{name: "unit", type: short}] }, { name: "Interval", diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 8f325afad3920..bafa31760205a 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -103,7 +103,7 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "Float8"> field = new Field(name, true, new FloatingPoint(Precision.DOUBLE), null); <#elseif minor.class == "TimeStamp"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(""), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND), null); <#elseif minor.class == "IntervalDay"> field = new Field(name, true, new Interval(org.apache.arrow.flatbuf.IntervalUnit.DAY_TIME), null); <#elseif minor.class == "IntervalYear"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 2ff93d4b98d11..d9593673156bf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -19,6 +19,7 @@ import org.apache.arrow.flatbuf.IntervalUnit; import org.apache.arrow.flatbuf.Precision; +import org.apache.arrow.flatbuf.TimeUnit; import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; @@ -101,7 +102,7 @@ public class Types { private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); private static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); - private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(""), null); + private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND), null); private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); @@ -143,8 +144,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new NullableMapWriter((NullableMapVector) vector); } - }, // an empty map column. Useful for conceptual setup. Children listed within here - + }, TINYINT(new Int(8, true)) { @Override public Field getField() { @@ -160,7 +160,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new TinyIntWriterImpl((NullableTinyIntVector) vector); } - }, // single byte signed integer + }, SMALLINT(new Int(16, true)) { @Override public Field getField() { @@ -176,7 +176,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new SmallIntWriterImpl((NullableSmallIntVector) vector); } - }, // two byte signed integer + }, INT(new Int(32, true)) { @Override public Field getField() { @@ -192,7 +192,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new IntWriterImpl((NullableIntVector) vector); } - }, // four byte signed integer + }, BIGINT(new Int(64, true)) { @Override public Field getField() { @@ -208,7 +208,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new BigIntWriterImpl((NullableBigIntVector) vector); } - }, // eight byte signed integer + }, DATE(Date.INSTANCE) { @Override public Field getField() { @@ -224,7 +224,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new DateWriterImpl((NullableDateVector) vector); } - }, // days since 4713bc + }, TIME(Time.INSTANCE) { @Override public Field getField() { @@ -240,8 +240,9 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new TimeWriterImpl((NullableTimeVector) vector); } - }, // time in micros before or after 2000/1/1 - TIMESTAMP(new Timestamp("")) { + }, + // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. + TIMESTAMP(new Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND)) { @Override public Field getField() { return TIMESTAMP_FIELD; @@ -289,6 +290,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new IntervalYearWriterImpl((NullableIntervalYearVector) vector); } }, + // 4 byte ieee 754 FLOAT4(new FloatingPoint(Precision.SINGLE)) { @Override public Field getField() { @@ -304,7 +306,8 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new Float4WriterImpl((NullableFloat4Vector) vector); } - }, // 4 byte ieee 754 + }, + // 8 byte ieee 754 FLOAT8(new FloatingPoint(Precision.DOUBLE)) { @Override public Field getField() { @@ -320,7 +323,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new Float8WriterImpl((NullableFloat8Vector) vector); } - }, // 8 byte ieee 754 + }, BIT(Bool.INSTANCE) { @Override public Field getField() { @@ -336,7 +339,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new BitWriterImpl((NullableBitVector) vector); } - }, // single bit value (boolean) + }, VARCHAR(Utf8.INSTANCE) { @Override public Field getField() { @@ -352,7 +355,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new VarCharWriterImpl((NullableVarCharVector) vector); } - }, // utf8 variable length string + }, VARBINARY(Binary.INSTANCE) { @Override public Field getField() { @@ -368,7 +371,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new VarBinaryWriterImpl((NullableVarBinaryVector) vector); } - }, // variable length binary + }, DECIMAL(null) { @Override public ArrowType getType() { @@ -388,7 +391,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new DecimalWriterImpl((NullableDecimalVector) vector); } - }, // variable length binary + }, UINT1(new Int(8, false)) { @Override public Field getField() { @@ -404,7 +407,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new UInt1WriterImpl((NullableUInt1Vector) vector); } - }, // unsigned 1 byte integer + }, UINT2(new Int(16, false)) { @Override public Field getField() { @@ -420,7 +423,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new UInt2WriterImpl((NullableUInt2Vector) vector); } - }, // unsigned 2 byte integer + }, UINT4(new Int(32, false)) { @Override public Field getField() { @@ -436,7 +439,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new UInt4WriterImpl((NullableUInt4Vector) vector); } - }, // unsigned 4 byte integer + }, UINT8(new Int(64, false)) { @Override public Field getField() { @@ -452,7 +455,7 @@ public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack public FieldWriter getNewFieldWriter(ValueVector vector) { return new UInt8WriterImpl((NullableUInt8Vector) vector); } - }, // unsigned 8 byte integer + }, LIST(List.INSTANCE) { @Override public Field getField() { @@ -576,6 +579,9 @@ public MinorType visit(FloatingPoint type) { } @Override public MinorType visit(Timestamp type) { + if (type.getUnit() != TimeUnit.MILLISECOND) { + throw new UnsupportedOperationException("Only milliseconds supported: " + type); + } return MinorType.TIMESTAMP; } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index ed740cd0f1b78..3da8db298b4a3 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -21,6 +21,7 @@ import static org.apache.arrow.flatbuf.Precision.SINGLE; import static org.junit.Assert.assertEquals; +import org.apache.arrow.flatbuf.TimeUnit; import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; @@ -80,7 +81,7 @@ public void nestedSchema() { new Field("child4.1", true, Utf8.INSTANCE, null) ))); childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMP.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( - new Field("child5.1", true, new Timestamp("UTC"), null), + new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND), null), new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); Schema initialSchema = new Schema(childrenBuilder.build()); From dd1b95b90e73c3b0b69bfd6284e329eea41f689d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 5 Oct 2016 16:15:16 -0700 Subject: [PATCH 0156/1644] ARROW-318: Revise python/README.md given recent changes in codebase Also removes a redundant LICENSE.txt from cpp/ Author: Wes McKinney Closes #157 from wesm/ARROW-318 and squashes the following commits: 9e802f2 [Wes McKinney] Update python/README.md. Remove redundant LICENSE.txt from cpp/ --- cpp/LICENSE.txt | 202 ----------------------------------------------- python/README.md | 36 ++++----- 2 files changed, 14 insertions(+), 224 deletions(-) delete mode 100644 cpp/LICENSE.txt diff --git a/cpp/LICENSE.txt b/cpp/LICENSE.txt deleted file mode 100644 index d645695673349..0000000000000 --- a/cpp/LICENSE.txt +++ /dev/null @@ -1,202 +0,0 @@ - - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. diff --git a/python/README.md b/python/README.md index bafe71b05ec22..3235d18377d56 100644 --- a/python/README.md +++ b/python/README.md @@ -19,25 +19,17 @@ These are the various projects that PyArrow depends on. 1. **g++ and gcc Version >= 4.8** 2. **cmake > 2.8.6** 3. **boost** -4. **Parquet-cpp** - - The preferred way to install parquet-cpp is to use conda. - You need to set the ``PARQUET_HOME`` environment variable to where parquet-cpp is installed. - ```bash - conda install -y --channel apache/channel/dev parquet-cpp - ``` -5. **Arrow-cpp and its dependencies*** - - The Arrow C++ library must be built with all options enabled and installed with ``ARROW_HOME`` environment variable set to - the installation location. Look at (https://github.com/apache/arrow/blob/master/cpp/README.md) for - instructions. Alternatively you could just install arrow-cpp - from conda. - ```bash - conda install arrow-cpp -c apache/channel/dev - ``` -6. **Python dependencies: numpy, pandas, cython, pytest** - -#### Install pyarrow - ```bash - python setup.py build_ext --inplace - ``` +4. **Arrow-cpp and its dependencies*** + +The Arrow C++ library must be built with all options enabled and installed with +``ARROW_HOME`` environment variable set to the installation location. Look at +(https://github.com/apache/arrow/blob/master/cpp/README.md) for instructions. + +5. **Python dependencies: numpy, pandas, cython, pytest** + +#### Build pyarrow and run the unit tests + +```bash +python setup.py build_ext --inplace +py.test pyarrow +``` From 04cf8746f3588d7bfadcc0b9c8dbe71707bdd196 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 5 Oct 2016 16:20:31 -0700 Subject: [PATCH 0157/1644] ARROW-321: fix arrow licenses Author: Julien Le Dem Closes #159 from julienledem/fix_licenses and squashes the following commits: 0c97810 [Julien Le Dem] fix NOTICE 1489289 [Julien Le Dem] more licenses 0eb2aeb [Julien Le Dem] more licenses 9ac1159 [Julien Le Dem] more licenses eafa0e1 [Julien Le Dem] more licenses 30b0fa1 [Julien Le Dem] more licenses bcfc75f [Julien Le Dem] add ci 51db31b [Julien Le Dem] ARROW-321: fix arrow licenses --- NOTICE.txt | 20 +++++++++++ README.md | 16 ++++++++- ci/travis_before_script_cpp.sh | 13 ++++++++ ci/travis_conda_build.sh | 12 +++++++ ci/travis_install_conda.sh | 12 +++++++ ci/travis_script_cpp.sh | 12 +++++++ ci/travis_script_java.sh | 12 +++++++ ci/travis_script_python.sh | 12 +++++++ cpp/README.md | 14 ++++++++ cpp/cmake_modules/FindGPerf.cmake | 12 +++++++ cpp/cmake_modules/san-config.cmake | 12 +++++++ cpp/conda.recipe/build.sh | 12 +++++++ cpp/conda.recipe/meta.yaml | 12 +++++++ cpp/doc/HDFS.md | 16 ++++++++- cpp/doc/Parquet.md | 16 ++++++++- cpp/setup_build_env.sh | 12 +++++++ cpp/src/arrow/io/symbols.map | 12 +++++++ cpp/src/arrow/ipc/symbols.map | 12 +++++++ cpp/src/arrow/symbols.map | 12 +++++++ cpp/thirdparty/build_thirdparty.sh | 12 +++++++ cpp/thirdparty/download_thirdparty.sh | 12 +++++++ cpp/thirdparty/set_thirdparty_env.sh | 12 +++++++ cpp/thirdparty/versions.sh | 12 +++++++ dev/release/02-source.sh | 33 ++++++++++++++++++- dev/release/README | 12 ++++++- format/File.fbs | 17 ++++++++++ format/Guidelines.md | 13 ++++++++ format/IPC.md | 14 ++++++++ format/Layout.md | 14 ++++++++ format/Message.fbs | 17 ++++++++++ format/Metadata.md | 14 ++++++++ format/README.md | 16 ++++++++- java/README.md | 14 ++++++++ java/vector/src/main/codegen/config.fmpp | 22 +++++-------- .../src/main/codegen/data/ArrowTypes.tdd | 22 +++++-------- .../main/codegen/data/ValueVectorTypes.tdd | 22 +++++-------- python/README.md | 14 ++++++++ python/asv.conf.json | 12 +++++++ python/conda.recipe/build.sh | 13 ++++++++ python/conda.recipe/meta.yaml | 12 +++++++ python/doc/Benchmarks.md | 13 ++++++++ python/doc/INSTALL.md | 16 ++++++++- python/pyarrow/config.pyx | 12 +++++++ 43 files changed, 575 insertions(+), 46 deletions(-) diff --git a/NOTICE.txt b/NOTICE.txt index ce6e567dcb518..679bb59e6a97d 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -18,3 +18,23 @@ https://github.com/wesm/feather This product includes software from the DyND project (BSD 2-clause) https://github.com/libdynd + +This product includes software from the LLVM project + * distributed under the University of Illinois Open Source + +This product includes software from the google-lint project + * Copyright (c) 2009 Google Inc. All rights reserved. + +This product includes software from the mman-win32 project + * Copyright https://code.google.com/p/mman-win32/ + * Licensed under the MIT License; + +This product includes software from the LevelDB project + * Copyright (c) 2011 The LevelDB Authors. All rights reserved. + * Use of this source code is governed by a BSD-style license that can be + * Moved from Kudu http://github.com/cloudera/kudu + +This product includes software from the CMake project + * Copyright 2001-2009 Kitware, Inc. + * Copyright 2012-2014 Continuum Analytics, Inc. + * All rights reserved. diff --git a/README.md b/README.md index 84bae78cc7fbe..89114ee39b4a0 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,17 @@ + + ## Apache Arrow
@@ -42,4 +56,4 @@ integrations in other projects, we'd be happy to have you involved: [1]: mailto:dev-subscribe@arrow.apache.org [2]: https://github.com/apache/arrow/tree/master/format -[3]: https://issues.apache.org/jira/browse/ARROW \ No newline at end of file +[3]: https://issues.apache.org/jira/browse/ARROW diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 2f02ef247af82..acd820bbed2d4 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -1,5 +1,18 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + + set -ex source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh diff --git a/ci/travis_conda_build.sh b/ci/travis_conda_build.sh index a787df79a5574..17a33ae9717bc 100755 --- a/ci/travis_conda_build.sh +++ b/ci/travis_conda_build.sh @@ -1,5 +1,17 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -ex source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index e9225259e6d58..ffa017cbaf5dd 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -1,5 +1,17 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -e if [ $TRAVIS_OS_NAME == "linux" ]; then diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index a3585507f0a6d..c3bd3b5f207a8 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -1,5 +1,17 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -e : ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} diff --git a/ci/travis_script_java.sh b/ci/travis_script_java.sh index 2d11eaeb4c57d..4679f9c6daf87 100755 --- a/ci/travis_script_java.sh +++ b/ci/travis_script_java.sh @@ -1,5 +1,17 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -e JAVA_DIR=${TRAVIS_BUILD_DIR}/java diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 61c8e444361df..a75ff0778bc82 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -1,5 +1,17 @@ #!/usr/bin/env bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -e PYTHON_DIR=$TRAVIS_BUILD_DIR/python diff --git a/cpp/README.md b/cpp/README.md index 129c5f15b150c..a1c3ef28447f5 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -1,3 +1,17 @@ + + # Arrow C++ ## Setup Build Environment diff --git a/cpp/cmake_modules/FindGPerf.cmake b/cpp/cmake_modules/FindGPerf.cmake index e8310799c3671..e90d4d0039590 100644 --- a/cpp/cmake_modules/FindGPerf.cmake +++ b/cpp/cmake_modules/FindGPerf.cmake @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + # -*- cmake -*- # - Find Google perftools diff --git a/cpp/cmake_modules/san-config.cmake b/cpp/cmake_modules/san-config.cmake index b847c96657a4a..fe52fef12ea5d 100644 --- a/cpp/cmake_modules/san-config.cmake +++ b/cpp/cmake_modules/san-config.cmake @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + # Clang does not support using ASAN and TSAN simultaneously. if ("${ARROW_USE_ASAN}" AND "${ARROW_USE_TSAN}") message(SEND_ERROR "Can only enable one of ASAN or TSAN at a time") diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh index 2f2b748266747..6d7454e927265 100644 --- a/cpp/conda.recipe/build.sh +++ b/cpp/conda.recipe/build.sh @@ -1,5 +1,17 @@ #!/bin/bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -e set -x diff --git a/cpp/conda.recipe/meta.yaml b/cpp/conda.recipe/meta.yaml index 75f3a8ba3d98f..31f150c1f0b00 100644 --- a/cpp/conda.recipe/meta.yaml +++ b/cpp/conda.recipe/meta.yaml @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + package: name: arrow-cpp version: "0.1" diff --git a/cpp/doc/HDFS.md b/cpp/doc/HDFS.md index e0d5dfda21d93..83311db2d2dc2 100644 --- a/cpp/doc/HDFS.md +++ b/cpp/doc/HDFS.md @@ -1,3 +1,17 @@ + + ## Using Arrow's HDFS (Apache Hadoop Distributed File System) interface ### Build requirements @@ -36,4 +50,4 @@ will set it automatically for you: ```shell export JAVA_HOME=$(/usr/libexec/java_home) -``` \ No newline at end of file +``` diff --git a/cpp/doc/Parquet.md b/cpp/doc/Parquet.md index 96471d94835f3..34b83e78d0a5c 100644 --- a/cpp/doc/Parquet.md +++ b/cpp/doc/Parquet.md @@ -1,3 +1,17 @@ + + ## Building Arrow-Parquet integration To use Arrow C++ with Parquet, you must first build the Arrow C++ libraries and @@ -16,4 +30,4 @@ make -j4 make install ``` -[1]: https://github.com/apache/parquet-cpp \ No newline at end of file +[1]: https://github.com/apache/parquet-cpp diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh index fa779fdd5c2a3..546216753b382 100755 --- a/cpp/setup_build_env.sh +++ b/cpp/setup_build_env.sh @@ -1,5 +1,17 @@ #!/bin/bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) ./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } diff --git a/cpp/src/arrow/io/symbols.map b/cpp/src/arrow/io/symbols.map index b4ad98cd7f2d0..1e87caef9c8c1 100644 --- a/cpp/src/arrow/io/symbols.map +++ b/cpp/src/arrow/io/symbols.map @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + { # Symbols marked as 'local' are not exported by the DSO and thus may not # be used by client applications. diff --git a/cpp/src/arrow/ipc/symbols.map b/cpp/src/arrow/ipc/symbols.map index b4ad98cd7f2d0..1e87caef9c8c1 100644 --- a/cpp/src/arrow/ipc/symbols.map +++ b/cpp/src/arrow/ipc/symbols.map @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + { # Symbols marked as 'local' are not exported by the DSO and thus may not # be used by client applications. diff --git a/cpp/src/arrow/symbols.map b/cpp/src/arrow/symbols.map index 2ca8d7306105f..cc8c9ba3c94bf 100644 --- a/cpp/src/arrow/symbols.map +++ b/cpp/src/arrow/symbols.map @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + { # Symbols marked as 'local' are not exported by the DSO and thus may not # be used by client applications. diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh index 6cc776d09042a..5011e29c01a71 100755 --- a/cpp/thirdparty/build_thirdparty.sh +++ b/cpp/thirdparty/build_thirdparty.sh @@ -1,5 +1,17 @@ #!/bin/bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -x set -e TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh index d299afc15222b..b50e7bc06a14c 100755 --- a/cpp/thirdparty/download_thirdparty.sh +++ b/cpp/thirdparty/download_thirdparty.sh @@ -1,5 +1,17 @@ #!/bin/bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -x set -e diff --git a/cpp/thirdparty/set_thirdparty_env.sh b/cpp/thirdparty/set_thirdparty_env.sh index 7e9531cd50864..135972ee9bdce 100755 --- a/cpp/thirdparty/set_thirdparty_env.sh +++ b/cpp/thirdparty/set_thirdparty_env.sh @@ -1,5 +1,17 @@ #!/usr/bash +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) source $SOURCE_DIR/versions.sh diff --git a/cpp/thirdparty/versions.sh b/cpp/thirdparty/versions.sh index cb455b4eadd3b..a7b21e19fccd6 100755 --- a/cpp/thirdparty/versions.sh +++ b/cpp/thirdparty/versions.sh @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + GTEST_VERSION=1.7.0 GTEST_URL="https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" GTEST_BASEDIR=googletest-release-$GTEST_VERSION diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh index f44692d5e9d83..1bbe2e92753ce 100644 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -56,12 +56,43 @@ tarball=$tag.tar.gz # archive (identical hashes) using the scm tag git archive $release_hash --prefix $tag/ -o $tarball +# download apache rat +curl -s https://repo1.maven.org/maven2/org/apache/rat/apache-rat/0.12/apache-rat-0.12.jar > apache-rat-0.12.jar + +RAT="java -jar apache-rat-0.12.jar -d " + +# generate the rat report +$RAT $tarball \ + -e ".*" \ + -e mman.h \ + -e "*_generated.h" \ + -e random.h \ + -e status.cc \ + -e status.h \ + -e asan_symbolize.py \ + -e cpplint.py \ + -e FindPythonLibsNew.cmake \ + -e pax_global_header \ + -e MANIFEST.in \ + -e __init__.pxd \ + -e __init__.py \ + -e requirements.txt \ + > rat.txt +UNAPPROVED=`cat rat.txt | grep "Unknown Licenses" | head -n 1 | cut -d " " -f 1` + +if [ "0" -eq "${UNAPPROVED}" ]; then + echo "No unnaproved licenses" +else + echo "${UNAPPROVED} unapproved licences. Check rat report: rat.txt" + exit +fi + # sign the archive gpg --armor --output ${tarball}.asc --detach-sig $tarball gpg --print-md MD5 $tarball > ${tarball}.md5 shasum $tarball > ${tarball}.sha -# check out the parquet RC folder +# check out the arrow RC folder svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow tmp # add the release candidate for the tag diff --git a/dev/release/README b/dev/release/README index 4fcc5d9728c26..07402030bf699 100644 --- a/dev/release/README +++ b/dev/release/README @@ -3,6 +3,9 @@ requirements: - a gpg key to sign the artifacts to release, run the following (replace 0.1.0 with version to release): + +#create a release branch +git co -b release-0_1_0 # prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT dev/release/00-prepare.sh 0.1.0 0.1.1 # tag and push to maven repo (repo will have to be finalized separately) @@ -11,5 +14,12 @@ dev/release/01-perform.sh dev/release/02-source.sh 0.1.0 0 useful commands: -to set the mvn version in the poms +- to set the mvn version in the poms mvn versions:set -DnewVersion=0.1-SNAPSHOT +- reset your workspace +git reset --hard +- setup gpg agent +eval $(gpg-agent --daemon --allow-preset-passphrase) +gpg --use-agent -s LICENSE.txt +- delete tag localy +git tag -d apache-arrow-0.1.0 diff --git a/format/File.fbs b/format/File.fbs index a29bbc694bc13..f28dc204d58d9 100644 --- a/format/File.fbs +++ b/format/File.fbs @@ -1,3 +1,20 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + include "Message.fbs"; namespace org.apache.arrow.flatbuf; diff --git a/format/Guidelines.md b/format/Guidelines.md index 14f1057850439..c75da9f98bebe 100644 --- a/format/Guidelines.md +++ b/format/Guidelines.md @@ -1,3 +1,16 @@ + # Implementation guidelines An execution engine (or framework, or UDF executor, or storage engine, etc) can implements only a subset of the Arrow spec and/or extend it given the following constraints: diff --git a/format/IPC.md b/format/IPC.md index 1f39e762ab70d..3f78126ef55d3 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -1,3 +1,17 @@ + + # Interprocess messaging / communication (IPC) ## File format diff --git a/format/Layout.md b/format/Layout.md index a953930e172e7..251af9dd8a128 100644 --- a/format/Layout.md +++ b/format/Layout.md @@ -1,3 +1,17 @@ + + # Arrow: Physical memory layout ## Definitions / Terminology diff --git a/format/Message.fbs b/format/Message.fbs index d8fa65006c24d..2ec9fd1817bd5 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -1,3 +1,20 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + namespace org.apache.arrow.flatbuf; enum MetadataVersion:short { diff --git a/format/Metadata.md b/format/Metadata.md index e227b8d4afd84..fa5f623ac9797 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -1,3 +1,17 @@ + + # Metadata: Logical types, schemas, data headers This is documentation for the Arrow metadata specification, which enables diff --git a/format/README.md b/format/README.md index 78e15207ee95a..048badb12214b 100644 --- a/format/README.md +++ b/format/README.md @@ -1,3 +1,17 @@ + + ## Arrow specification documents > **Work-in-progress specification documents**. These are discussion documents @@ -21,4 +35,4 @@ couple related pieces of information: schema, and enable a system to send and receive Arrow row batches in a form that can be precisely disassembled or reconstructed. -[1]: http://github.com/google/flatbuffers \ No newline at end of file +[1]: http://github.com/google/flatbuffers diff --git a/java/README.md b/java/README.md index 5e1d30d9fd26e..a57e35afbbd20 100644 --- a/java/README.md +++ b/java/README.md @@ -1,3 +1,17 @@ + + # Arrow Java ## Setup Build Environment diff --git a/java/vector/src/main/codegen/config.fmpp b/java/vector/src/main/codegen/config.fmpp index 6d92ba830ee2c..92881dc914f2a 100644 --- a/java/vector/src/main/codegen/config.fmpp +++ b/java/vector/src/main/codegen/config.fmpp @@ -1,18 +1,14 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at # -# http:# www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. data: { # TODO: Rename to ~valueVectorModesAndTypes for clarity. diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 11ac99af42414..c0b942bc3595d 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -1,18 +1,14 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at # -# http:# www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. { types: [ diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index 421dd7ef92e63..f7790bb3d6ddf 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -1,18 +1,14 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at # -# http:# www.apache.org/licenses/LICENSE-2.0 +# http://www.apache.org/licenses/LICENSE-2.0 # -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. { modes: [ diff --git a/python/README.md b/python/README.md index 3235d18377d56..6febcbcbcbfe7 100644 --- a/python/README.md +++ b/python/README.md @@ -1,3 +1,17 @@ + + ## Python library for Apache Arrow This library provides a Pythonic API wrapper for the reference Arrow C++ diff --git a/python/asv.conf.json b/python/asv.conf.json index 96beba64c2e6e..0c059fd79c1f2 100644 --- a/python/asv.conf.json +++ b/python/asv.conf.json @@ -1,3 +1,15 @@ +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. See accompanying LICENSE file. + { // The version of the config file format. Do not change, unless // you know what you are doing. diff --git a/python/conda.recipe/build.sh b/python/conda.recipe/build.sh index f32710073c7c2..fafe71e7adb75 100644 --- a/python/conda.recipe/build.sh +++ b/python/conda.recipe/build.sh @@ -1,4 +1,17 @@ #!/bin/bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + set -ex # Build dependency diff --git a/python/conda.recipe/meta.yaml b/python/conda.recipe/meta.yaml index 98ae4141e3bd7..b37dfde0a0d6f 100644 --- a/python/conda.recipe/meta.yaml +++ b/python/conda.recipe/meta.yaml @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + package: name: pyarrow version: "0.1" diff --git a/python/doc/Benchmarks.md b/python/doc/Benchmarks.md index 8edfb6209e4af..1c36801858278 100644 --- a/python/doc/Benchmarks.md +++ b/python/doc/Benchmarks.md @@ -1,3 +1,16 @@ + ## Benchmark Requirements The benchmarks are run using [asv][1] which is also their only requirement. diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md index d30a03046eda7..81eed565d9123 100644 --- a/python/doc/INSTALL.md +++ b/python/doc/INSTALL.md @@ -1,3 +1,17 @@ + + ## Building pyarrow (Apache Arrow Python library) First, clone the master git repository: @@ -84,4 +98,4 @@ Out[2]: ] ``` -[1]: https://cmake.org/ \ No newline at end of file +[1]: https://cmake.org/ diff --git a/python/pyarrow/config.pyx b/python/pyarrow/config.pyx index 1047a472fe338..778c15a5e655b 100644 --- a/python/pyarrow/config.pyx +++ b/python/pyarrow/config.pyx @@ -1,3 +1,15 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + # cython: profile=False # distutils: language = c++ # cython: embedsignature = True From f1a4bd176bc2139ba785522200d7630408328911 Mon Sep 17 00:00:00 2001 From: adeneche Date: Wed, 5 Oct 2016 20:19:42 -0700 Subject: [PATCH 0158/1644] =?UTF-8?q?ARROW-320:=20ComplexCopier.copy(Field?= =?UTF-8?q?Reader,=20FieldWriter)=20should=20not=20st=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …art a list if reader is not set Author: adeneche Closes #160 from adeneche/ARROW-320 and squashes the following commits: 5c6ebc5 [adeneche] ARROW-320: ComplexCopier.copy(FieldReader, FieldWriter) should not start a list if reader is not set --- .../main/codegen/templates/ComplexCopier.java | 14 +-- .../apache/arrow/vector/TestListVector.java | 87 +++++++++++++++++++ 2 files changed, 95 insertions(+), 6 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java diff --git a/java/vector/src/main/codegen/templates/ComplexCopier.java b/java/vector/src/main/codegen/templates/ComplexCopier.java index a5756a47ad785..0dffe5e30bea0 100644 --- a/java/vector/src/main/codegen/templates/ComplexCopier.java +++ b/java/vector/src/main/codegen/templates/ComplexCopier.java @@ -47,23 +47,25 @@ private static void writeValue(FieldReader reader, FieldWriter writer) { switch (mt) { case LIST: - writer.startList(); - while (reader.next()) { - writeValue(reader.reader(), getListWriterForReader(reader.reader(), writer)); + if (reader.isSet()) { + writer.startList(); + while (reader.next()) { + writeValue(reader.reader(), getListWriterForReader(reader.reader(), writer)); + } + writer.endList(); } - writer.endList(); break; case MAP: - writer.start(); if (reader.isSet()) { + writer.start(); for(String name : reader){ FieldReader childReader = reader.reader(name); if(childReader.isSet()){ writeValue(childReader, getMapWriterForReader(childReader, writer, name)); } } + writer.end(); } - writer.end(); break; <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java new file mode 100644 index 0000000000000..bb7103365557f --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java @@ -0,0 +1,87 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.impl.ComplexCopier; +import org.apache.arrow.vector.complex.impl.UnionListReader; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.pojo.Field; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +public class TestListVector { + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testCopyFrom() throws Exception { + try (ListVector inVector = new ListVector("input", allocator, null); + ListVector outVector = new ListVector("output", allocator, null)) { + UnionListWriter writer = inVector.getWriter(); + writer.allocate(); + + // populate input vector with the following records + // [1, 2, 3] + // null + // [] + writer.setPosition(0); // optional + writer.startList(); + writer.bigInt().writeBigInt(1); + writer.bigInt().writeBigInt(2); + writer.bigInt().writeBigInt(3); + writer.endList(); + + writer.setPosition(2); + writer.startList(); + writer.endList(); + + writer.setValueCount(3); + + // copy values from input to output + outVector.allocateNew(); + for (int i = 0; i < 3; i++) { + outVector.copyFrom(i, i, inVector); + } + outVector.getMutator().setValueCount(3); + + // assert the output vector is correct + FieldReader reader = outVector.getReader(); + Assert.assertTrue("shouldn't be null", reader.isSet()); + reader.setPosition(1); + Assert.assertFalse("should be null", reader.isSet()); + reader.setPosition(2); + Assert.assertTrue("shouldn't be null", reader.isSet()); + } + } +} From 3f85cee51e45165c4be8d251849d2b3765b9b4dd Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 7 Oct 2016 12:12:58 -0700 Subject: [PATCH 0159/1644] ARROW-324: Update arrow metadata diagram Author: Julien Le Dem Closes #161 from julienledem/diagram and squashes the following commits: f018cf5 [Julien Le Dem] ARROW-324: Update arrow metadata diagram --- format/Arrow.graffle | Bin 3646 -> 4142 bytes format/Arrow.png | Bin 86598 -> 112671 bytes format/Metadata.md | 10 ++++++++++ 3 files changed, 10 insertions(+) diff --git a/format/Arrow.graffle b/format/Arrow.graffle index 453e85025d8d324310e27daa3692a98e3bceb332..f4eead9223160f6e16f97bb8e9d6ed6a05e290e6 100644 GIT binary patch literal 4142 zcmV+}5Yg`+iwFP!000030PS6EcbhmC{yg~=Y`^U8nbeVl0Fmi*j}xbpbkZafr|q`m zIXyDSHd_oX;HH_j|9vH}4aPWjQsX#SZcqHel@Q`S&wY`Qp8e-b-&H=dAas21*C((B zo+!+-eB1H5zdm_=_2Mac@}K8B&;HRkKfHQ(aik1fCybPf*9WI3hsu+udwcc4z-4=T zjjM)oaeC6cQg|JEdq-zalqbC?8tm`wU0+|1+>#~nf`2VB$APa zx>!UcOXYBste=JjQcWsE7KB9=WjM6pXu|NQocLvdv(|8dNZwpv#pAyn*(a`Dkky) zqwaH+t6%(KEXuzKeQwZbra=fm{V5<%fu=xxUnTnhD?bDNnrWS=#1Hc2Uk+V|dCDnM z8myl$({RFpOK+MMb!Sct6~|M4Rm5Aq*bB#HS2u&1B9!AIFB|J(!?%Wg=0$nHPI?c` zbl*saL%}Hp0H0RM9Fj6n(jcg5CdLN8p#UQU@vfq(Y7JvkBL;znfeFzC*a+lDz%64?tsbfV176Tinw%=ywZ2r_kyb*j|*S=QLN!vt;wDe1=n#ys(%?G71#=TfZ!4eVC*MbiQQ3n7Z%bRymm!tomc;!PM;@n*&% z7{?+Sh(&m5@utIAL^rsIsMiTR8Vv*Lwmw8Z4#4`@Se)`@7F zTfLDv6@SJ*EoJ%+{XP%LzbFxRu)_Tk&EI@{V&Pk3119JMj~t!4DbXp>xgvC41Ww-( znXdBjxn=IDdLQ>xCk5C(52Bvm<+^`!%9I|b_=@CX-?o4hApm4ofdmu~ zVXlIFbe(kAQ<&|rrQK)^bN)3Es%i5)XWiuflL^VK$7ek63VGH)=2Qzj<&qima;h8P zhwH`){qWe%n;_0mqc}d(R%hpn@oL=N7MG)WGHK}g zX6QN5zVhymMqKj95GimR5QS^vl2NARiKKXVb$K4) zL{!sE{vz1G5J6!60I#TWbY>e^Lp3F9l)xI;+>W{7x?VHY1rCO$$@NmUf=Y~)u3Km( znj_Xl(~&SpB~K5>a=ny_PY-|Xar)bc`Qd%vX8THWAo})OgL&aYK3o>JNG!2-Vu$+* zhrpkIJ#=dx{29`OepEQ2pVFUB!MdcMYST}XyLVQUkbuOOIeGePP59CzzN(EcvFU|G zSGod%rmb(WK`hr7*@E>&!ie?<8?@SB^XTyWtnsb&PojN0tDl`G-`BR_-~s>|YP!Hr z`O&Lff~4w8kTA>XQ+FdQ+#ukM5t~xQ@nUNy6nkOq#~(KWd2o%3|6nAMS8e1S_`WM0 z%L%%4vc^R_uw1mG>RH*SLnOLVbZN#K(FG*BDsHZI^9IZbVZ-jTmWy(m*w?NoAu^j__B1 zN0g1gVB3k_I(F2H(<92i$S%}(RkW+)f46p=@I&ZyJ!bDKZ5~)M>d75aHfZLas;_Am z-pbE?D|;8-7mn*PyW!Bj@7a%MM_x`SHKG#(A=M-R0n_AaKulo@b>orm|EU%YvI}~- zbC2i76UQ*jPhlNV2dH5kVU8%>goX;RX{s46NeW;35fz1V6DsF$=#1WMG!CT?@~}Qg z>Va52$F&3I$<0bvX2rpMtoYhiBS3_3{x)Y;O)Px^{vF*7XqYuK^CF`OSaZaes=Y$a z$}P^s15VgKgc{d!v~=yyq+%&+IeBqbY!t9oNgb4^^Lh#4#mv*aBWE(rW}8 zI)Mg(I`?-4+90d}#5o#6gt>CWX*En9dn|!g6-z)FOUPK_5ylc7hq-nrBZ*Z-5(wWX zl7MT<2dWy}Rk*%O7yJGx!AIyXBUP=bnHN_Qt_ktLn2M+LN7uPCjr6jeX*ngN>O
OT_Dnx!~H*YAv3Cdp&?6sRjt!akA+21gb_@{BISL2CZ7(H@79hocg z(-FC=pN^plF(lm3a+MWh57JHF07g|Y$ucT@+>Dxc$_2MbmwaW({d;)ih}_pBCs;3V zu`@YxsB1N?+Fp2dM&*&*=xjj`LGwQFs{FXa=m%*);ud1B@YcCxyVn|m#DLPtWP)M5gO*%4)r`1hWpB?6GbljcHne3JI!xkV*<%|?pyhg z+_I@}Rm?31~qR${#oD@gRT&a^&n`qit~S$!rFh4MVplO0CYQ z86>ORr9n8n8eX5j3r*?oR>|Le#p65LWyi5`SnHg$4SAje8RnW1L?nVLh@c0+PTRL{ zPMOyg%bX+;;{#&}@|#=32H??+ZepSm?vtnl)-NYl zd+fBLuZmW*`#p218HQ%4rfEuAk+f1it$4o8^cyF1+9?X> zoFE@>ESS(zw8E+hRy|CJO3YRSLGGFEDp3|oh zn^B-#j!qo4oWSy5IbOpF)5{#i>m-fWQ`sitprcbc>v6^!VwrdqiKj27P8gTJLj$90f8Z=7KMgkC#EcbxP+{L@_f*o zH4pqK^82r7z*~^&LMC@Js;?tUg^yFQ;^~C$e}* zXcwO7pdS5gAHQ_py@i1N_rQIBa_raf9|wP&G>>}q-uY{w!S>siuGKx{-|h3)@O`g+ z{Kl>GHXU?P(OKlRBwLbGB3Wk*N4B|-uzS6iGHMe1f$$m{{{CggQ?tM zLG`nPcasMPA&eLFHdJI95SsDEx8`)VuHd9XEiTz-Zf014u0DqQaw&YEGH=Y4KPaR6 zX;zJI#JR@=rssGAmyFuyo_n(x?3ey^-F3R&5>5ygZ4Ri#3r~b&)))KUjOjmd~$kh zaOE6wO_#QhnZS0=GeKJ3lNlMzOV*vDu5^pJ_?c=op6X>n9`x&~SQ`j+NR@Ded-uKgmvFCw3BGNdmUmoyLiuq-W`UK6;X#sltJiODZ);dc)g0Crf?@qKU;`3FcV|4aU7n!kM(#E* zCdqwD9sa9Jo#@rpPy09I?$v=4(J$l--;v*LFD5?(fS zaKy9xkmNi`go6N!i=XzxDBzF$ifdS7eV=v%HXPJ0_|=;LcRC(1_Uj{&j5IXt8Z^23 z$WT=SWJLDs@`|)E+UyQZTPnEZrPnhi5U74+!>NsoePDi?JyG zBCNTAqnthnKw}qJyFgcE8}LN_^#vC=Jq|P1B&^z{7V< z2Yb1Kp+bG-S4DixCVNdia(y?*CBYmOv$C-%HkdQ)6F-`T>!kKjuKGqM912D$fcLab zPD5d$&l&Ye0Tf0L!Tjjd@38rnEF=mH`Gk01NW=+fD$fyeG8QF%i(6Sd-Ke>XU?!1}xCo*i>oOmEhd2|vMPW}7Y*C}zodG?{JNV*btNEllbl2&NR67TG)$C)!+`WUeHWJd{av zyX;KD31Ualt;H;&n}%eODzJztS!5L~VpW4hG;NVLVhaf%d87(FVn`lY1&^53;1Osp z8P({}mON4g9?>O_tb#|3ip-B#`eGhgc3<*HNqEGPJhBQN(W}BEKwsjKSO7{UsREN| zl1WymyV5-6Eu6-2!piDT8xWpS*!A>&vp!6b5}W))1L*;Qc@71n@(mXL`m7i%g(CJzs1o!O?A*{7yK zZAYAPtawt}bwTi6e2 z{Esu}LZqXX!yJaiEmgCUFX6B)zMEj{d#l5FIBI*X4rq0J%0HbR4nh(EWkSX0WJrGb z!DSuhv49Kz(h7)+biLIL2=T>5+Zz)8$)#EY9JsCMt8M`PFN7%mFf8$bkQRSfKzxX8 zk>bOSMKmoI(LpT2Q;QD+#v+EvMMSlsm)o?N-oXP~J>*2s7iIuMl zQ*pV^1Uf)lBXo2Loiag3Y}>CQa^eAzX{nO|k*2R2IhsUHsX86nCt-y9gAMp`5ik22 z^Mx-Lx5hK@;u1h1FaeaIt{PCP1XP&-)gTV-iwRb@ieA!A-;9XvihS|f6xL{6zD>7tHRUF4|t8j%Aea!S?bP#*_( zmBo%YkZNt~tg0=sQzq=Z_85+MO~i2{8<<+ijD?z94UhcQV8gaYp34-&c-?g^9g`}9 z7)sVLRd=#`yiAnzN;Tf_s|Jq@S|}BGPB+pYwQ2N`x=}Blnp6G-mEC`KqhAz5`}f*< zBpIW-JVsY73aPkP?5k?n1wpG73Tkcg*1q|Bk*y2ER9{oiBW{D@KJ(qxhOB}o3|U3M ztO2MS$gmB=)>UM2GqB=W3B%k`RPc1&*J~QhbO|GgKaQIP(q6Wn$D&sxjL~_&uCXYJ zLh#9hYkX!=d5pW+==2d+fSP6?&9FJsST z59@RuQrE%hQO^yCzqq^p$nlv6)^yXzPI?*DE~1^a9aMxY!Pa)}kdKw`q*>-oAfzWl zp*i=9)0fEGSl|t8a4XFb|2=$7Nc3|26J3Se3o(<~D4UzR4OTs%mH{v2iXh_vs^WB% zs&NihkzUhP+qNLE4ckI?yi%`OFWs|8PvzM|>Df!q{@Hu>9ZEbmlz#nF`SnP9j9(8| zHTf3b=jJTk;m6Ovy+GdEq>iud^g&MR}9dOGM^Xt<2l{B|v z=hwv%magxUO=&;E^F`KUJzq;R3Ow7)QBcs(#hFZHc7DsV@{HsggqPG1v>yYn(!aBB zF3gvzHbW)0El*u5!>MbMM^?cjYkOy3j74bbYhQR?=1)kHC=W?UZ(}tqv9>qz#pfm= zU~VEfWdTRTYDq_<8XTfM^?bds=e5n(ql%qoqya5Ek)^%ti7fF$M7zC+_#Ny-ECBf# z{HD_*ecalxQsZqPYttjI#PbN6iC-?iUWm(ZvvQrU#H)$>GB~(9#F5k6z+Nuu?kOHi zjw{pA2u*U^ z>eu0Vd{99Gxe~Y8(K~o_5SnS_hqcui^_FK94q8t*ri~q5U0a{O3*|{VrSf;*kubzv zauRnARorKROnc1;lE>wg5kVWkPMf)RXT9k@z6|Lxx+Zm~wrf#aXZA)5_w6c0y z@tI5H#>kCEheQ3bQ&JGb?{%F2s@*DthnHq=qxc)g9MfxXP6@TQRktW4Cc3DaaiES@p)?n+`O>YOvknow51MxoXeYly4fMr#E=q8?{Sw|h)nXIGiRAd`*;vu zvwO&=+Qd2xsY}8$=G>4;!a_u3q(9J*wo??&DJC8hTykccpsg6#3I6+B@JM(OB*=-w zocMnT8C0hc3isj^ zPB%9_>X7h$YJ8yNc4@Zr432BdhS7YJJMd{AM>0zKpyC~KX@_?iLhDAR zrN@~f0F7N>?Lwd+xQC2ARa1TzHJ2+Z&M1S`(e1ZGj}l)wBT9qxXA_;Xh}{2iE&`rY zG*!XH`InjgcBCfZ(^4o*PHq?Mwc(h;X^{Vu%N?(=$m`;*8Afl1mwSm-lx+OcA))! z`1@(|xOdRIcn@^g{`l5&x<~xGd+{EA>a|ZkcnAFc(XoDgjoyC#^zj@VoPIn$Xny80 zufKY?M{gFZe?~gd->DwKD2wvH8|In~R+0TQtHup+;S+)BIo`k(!S03c-7N?E6}vs~ zXxG1o6T(HC1MKj`6XBTk#miyF^q)9#oRP0-l;@=*?0>=Gl>3mR=yU%xRzvugiPLr9 zMD)vje6BdL=$5hXkO!uH@6*pik`BfeJpKES9QzRs7FGL)V)rl4-XE#Yhi&sY_GlOc zye8tBOigKyw=HfTe;r_-gYD0xiZk9-+9QruDT1tHmyUP&VAs$`vq=2ff@pfQJMqz{ z7D3Ye*^~QMLoe+oa{|11+IZRR_?1xk=SHC5$vNbj9&Voyf$bU31nIM$%&5b>BHbzK z%GLs`#noUVL)3^?Pz#?J>(BatGvx@8515Za!$Kw~ST`xa#!m&g7yXpwp&VzVp97DB Q;3aE66NOuk$(jYA@B7&4bNH@};f&z*(g3=w| zGrGI)?z><7`+e8LG&GE3 z7zBKC^urGm{14szx|$;TbU#lp_yfm9-Pj!s?FtF%Uv#wWTuQLR`+EjP9!5Ial2*rv|Kqm64f3P5@C)(@@c(sgaHuruTS+D72QF?l?(X3HvZB(z_WZ|p z|M{KY`|H~~dpLtpaJy%v;pAcC1`hVHKwX`z&>u(tU+?kX=hAb#X9Mo~*U^H19R2t2 z{@7odA9dq@8;IYd{OemV%(A%B{C_<%S={Ocp;)4fc8%8}C^)8#`NMG{q5%3@y0T-SNZeS`|~kh=uG5-=VC3&7nlo zVcn&;-G{-77`qSIqPxRVKVN1ESaeuluprQJ)a4<>Nl56QAMJf04`}j`Xn+3w_f1R% zacwLA-2K<)5H`pFEP&vDhx}_W|2Ja)n=t=aUjJJ!EQtTVgo{yKJmMsIZk(47K6`b9 z<5X&QeP)H@yu3GIdv>(FeLV`wGRlK~I%iw6I&BF#Xy#R2i3j^XCel~cx5Ibvmb2{`djJ|DTM#nH!@@giCi; zxBL%hy_u6;7kY*yzgM&ehv8q1zK}%Sk_dH(MQ7ysd!5%3ryF&t@4wuyYPK8|`x20f zS8n26G0dDOO?>6Jo5CzS_Kx4$ld-z}gwM{ckF7tyasF~rqOI_2?G4DwQz=j(aw4D( zF>SNnU+5l6S@$LVY&qpKUH_flU@WKo+Y7yflmhZRzms7|BV$2afm8N)ta^bs^yKg- z14-lM#cA%~bJHerIon;Exv4$mu9e@y5~ZMwbV^`5rL=q^wQ z!>WAs!9;jSgd|ukk$%?Z8$ZF2I~nub3nefH@?$vRFN9)mT-js~K3<}D?af>NHS~s# z==&M>ks;ZLs+JQjQlTeTbk^7#0;A}>GOZFM(v2)*o>#RrdjFi;OB0**xS_t=NhrtS z(Q>?e+dKr*?_lG2?P)$alx6AQmZ`7J=Lc^dyAMk|&2`fp|Ki+w9(1@9#&CFBGAtLKBE&G#-p8KYBJjSu~5dZ90=i0EGecuUOz4npo<>{A8w$(ZRpWQ>^ zZaEKL8{0fGbV=Qyaek{X3XbcvY4X}s&u?dP3IY#-U13D#aEA1TcDqdyPuIp=z^;y> zJ6TwP$>p^rld90A40$S8+OZCmqz8w@PI2$T&rkQO=+^tRGvCP^OcirF*7mo6yFR-i zS>|$jwmZ@~>-Bj%!o`?>0^=JI4a?H zJasPpX3E6kNjeBN1jkLgFM)?LmsJcxX5Y4&z#JKOzWf?Ss5r2wtCFd zQYdvu_H@Hzp$}58JfjU4vhnlhJ7$8Si=1EH^`2E3&9cjjxzyyY6gl6USTOP!Q^^_U zT1eD>mziU%0gAgaQ^-N&`_EU?D{jIj+sl{+U`B0)lB+PZ%Q)fTw&wRYZ`TY)F!#Z# z?L|BBgas%Aclz05b~~={3k)-REcPX1U=wr+9xtZWX{y5C8Ef%3?J@iHEh2Ko3VlJ5boNZl6rCOFhS}!l|cx`?n6C=hZ2BRcr9QQ4SK`1oMGTg)A zgx|AbZRfTRPFj?F)|WK#XDusv{(ad_TVTK7@0D*{CSnBNb+;2zDfnTNvGr7BDXpC! zezAL%dy(YpuStd=I>xm)hP>shS`I? zkNrbt^W_B0#5!4JbulasskKP(@vb+WpY@FCFoSW1u+(k}D6aKiK|hT#o}_}0a9DwO zto>K;6fE}0N$s~4XyL~K_NQPm8O>R2uaBuQbTb5AgF-VaRUbam>hX1?l@~!4ga~qX z+DnH?+$UB}L?gq%N}!YUXv;UFBo_{{F^u&H-BHxnR+&wMMH5n9d=o#IvZ`u6_*Cl8 zmc#Re^0AaRf$JGJh))HiQ2nr{gaN0UZ^n|SrM!5I9UIq5E!L#0Gv*Y=7=pm}gP#Q* zs^|Q^3*J)O8i;Q%Q>eRZuPH}WdZN7?W-Cj_BlX(UZ;h3>B!j??e3V-j{L)cDk4mc_ zDgbzXFfJ@Rdm=`F*)~){hhGIx3p%~jIU^z&>-=8TO7CSF{vwATc35QNGG!4(6moKX z8iPO_!Qv05_54?1L5E2WTG*8&o>qWR_AVJAX80_0SNPw$)bCaO3c6|eNTJXjVKFrv z1Ag?LmWmaDJ54(S+=ALjCoZLtk~p*>L(P$|nS!ph)cJu2R2UQPVCAOW&{Ys=OVppV zKRwudM*t~3jFU1Bu%e}kOtEJb(6xEarHKyW_#TO#G*RRyM{ZdE#bN8^`Nio^*Ir(% z4exk~Qc~ugF-FwrDMR6TLPMou#ZQ^8l;rovjZ1OxU~aUW{_u#Q&^{X*qIc~g3Ov~Z z%i4WE)jF0BBiKqVCg+7vA56S2efR!$%qhZU_sScnSTHKwPvsY*6Qydl(s8zJ_cEX+ zG8P7Q&3CqTbD>Bbd`Bi&gL0ujhQcmH6g`Mht62rpD`>Sa%IWR4s%q@z zQLmi7h)r^AafcCW=hgQZgwjZw2Bj*~<|*rkc#s$jH|Rl?kg{z>oF=4S$jO!}HkE=( zzxF2uLy?Vj#&b%+Yywo}@dN?4hFZKM^V^?o^z01M9OP)u?>a%d@{{t+>J{xZPGZBC zZ&sHw%U0G=j*SM|GB3pP@P+YIH~B*KJcQ>C)&l03+3#GOmM@=5T`h;Y zqGU8HvzgZO1DfD4DXjUgj*yGm*)&n6*L^!yB@b=T57^A@UE)PQm@kin=PwCL$MlbG zt>g#y(4#AGeE9MDv(v^We6}Xpiz6%YgpR>{68P26j>3^S6?a^39aYv1ONzsdsyWmK z3tm%)UXR_rbUw%j^D-obNZvoc@~C3GPSm^;Uks1@)Y6HQj#SuAQfaWFpe7=eEc?=e zoBZ731z(!3(3AW{1lmp@NROxc-+NUByxh8nASK#rIvlrY6upO9c>QcWx&$6jYUR93 z4)Pbzu?16EfgSQZp7Lh&T0vrD4t326lMfR`LLoXq1fGhbW#}EsLU#don&P$J*aDYW zKd~TmYsduhUL7rGd^JMoYbT{BHkB=D@;=ke!BI?ev9%Akorh|Sj9AU5v7b-D2q^t_ zl=jWDY|?RrAQfZ+em%nnhPUoU#G*^SaFnrm!__bJ(grWK1#O@a22oF!dBdz4hlrM` zIa5g{V}L>*^0$a*wEKt?1aWr5h63i3FMx_QK&szBr=|`kiQG_V!=;F<@1(z_I99?c@`Q4J?$fyz-TgOP>31MD%=*h> ztn|>fk_^f5pV!b{&se=od}mGvXAr@x*Ip~o$Az6fjUG&txiKGy@nhh9MCoBFrMrbU zC3f%VEkc^Bjt_ZXBa8=GlMr%o9pN|bjnSFQ-4g#qerZJa8mSXEwa1>#1zVy1n)=;~ z?O?v*&W@$K!q2(p94_H&@8X)wJ1AJDDpan9lu|0p-@#D4+|9pqnf%eUs^iZpUKi=k zuuZr^U0{!FVv>tBNDde>K#kZdB73C}MX^mY5PWQ)ls=+g{k!n;1JUhitJ@we_TXTa zW!841b>6bL<$%|{DC&o;ES~^hf`kjT*$nC|)}M>cgt9?C-=LcO@p3}_Uw7Ntj|<8#$NbEA!7}o8{_4w?V z0DUbCeNe^a+*Qt%r)OX&8WoX@gmS3^^np2_XObJ;2D<;L_QUf4D6L^yqNmm3BFTHoO-J6_T5V!aUO}5L ze?kL_E`~d~!6SDCD0Zg&w)=lGTp;EUgH*&~zMA$Q4)Hgr!VwVwzkok;bhFtS`ss~x z@R$n4E7@ZdXnL=v8>Oj%yK692s~e-=g+ec`FFe(^Kq@VE1c2d&!kRl*JE&0&MtN1! z|8e30RWO{>`Z>u^hdgSbGU})TsW4HhLH8|0zIk;L8qp;hv zZbX!FL1#p2v^ArOSB5~@4=R{h_WS^1cY0x4?L>N+Ei;#POcc5T$yx|2@bdi7|3||L zJLx23K@!@0dxrX)f=Wf&=*m4|qK>m#q zXV3HbrosN$1DvGl^(P16Vetj(`xQjqaQOW3O8(Cel{BAOCB1)~fpY#wO)vc_f&TI% z0IT*^$IegoJGTM8j4A^|Llx;a&|V3To|TpF8zRy`LWUYR?kXU&_hKYzI6 zt}{^Gfy zZIvBAaAh~#1l9jBV122e;F4CL(t$(}i39wNO-LUw)Agk%yM|j0AnRMCD_rHCI2Cx>5V zm<35(p6zZc%L|ydq3zgECw@iV^Nk{)a~TBW0ps__UZ3vmzrXcV;2}?KA38Rz04OWZ zh#u%K@wXf?b@lI4q~gyE$IrBksnReuO!@z8R)$E!5D2W=WknA{Y>$sV_a^RhmLiH4;0eUOB$Pce zFMn1l2{2Z}S+CrM;Lo?a7Xa7@+JoEsG20}T0WdX*20nwVG?Y=uT!M~)6dzz_wGZoA z22lk7nINI9r$40ljC<@jWsle|-LwQAQ_6UQx_SS@?ZkIriuuNJminwb_EsF_eKVuT z_p|wsK7-U5B=KVaew5%v&`@28r8JiqO-Ly6YYRxJIFg4m?wzdD*OCEHkJ3`cV+cC3 zs0fONU(mbtG4?6Dph`u1v5mSBS;4&&ExvZxH-O&4M?eJ~LyEvE^a^>R#HT@WPo<4a zk;qyL2tk26dho*CXWW`*j=9xwWD zzPjZyBG8F<+qu~@vu0tUH8}W(WdfaI&iN@ooM1|cIij2E$ev~9rmBn@%L}55Ubrw{ zh(`~8iK6$5PGMGnDU{J8a$rAfFNqnCZYb`YoP4ip$hFPB2T$%=7xQP9qUOLMj-rfx zTL3O#;|5H&57Tlqmiq-|teABlJegTmfiD5O zn8DJioW*AH+go_HbSbP|6-wF)zsq1y{{FhWQB4s0OmId&vyMumUxlz>)LRS7X?^Ed zjOsq19I(lZ5Z)QgG3o^@^LeUAQ4|`QOGI--9}!U4o5vm-^3<5nEa-SCRl$gX1@`4Qe>eRP;Ev@ zGH>d!G40f*icx{pwTULE7#X%P3XxYut}>oY*`o@yZ&^gbrmr7-PXER#yu=Z17JM$M z%25>dE%`1n<}%HgC5*$(hV+MKI-I(J={rbte_aeZ4yh(c9;oye@)j0HDfkVAjZV>e zcPw7yXI)xE(e*rL&vk)FT@i0c!gLvg);`Wd5F9f{+i$FY@%sEgLO+SjlZVtTT|r3r zH4Pq}WSTIhL?U7>v9Y~4_!6JyD&7+eU3~LSIU$(W#coC_L;xCH!RYH8c?t+MDiQn9 zYie{VRGC3P0okF^j4{Xo3cecgbsxR+UNW z+8-ri%ux*C5N&4a8B7L*q53<^)P9t8S!XhQ;2$b3igbtL;L0W7 zSf`9+h31wVWtdF_FK^~JHQpPKA{Fo60*Ppa7F`#6(Oy0d=2SOybqZ5k^_`a$Rq_HE zpd&?Udm=i@-dB|Eg zpcp0ZhAT*-I1`(%2I_9}xeI<4(r~ z4KvEOqaNAu_tNKM@RBSjN5&y;g@Xl*L(A7T`B(=|nm8@hU4QWDzHzgjDtD1TAnR$< zqNxe@gJ=j46KN5ok5 z^i|)bBJ~XN3gXy6xg~>sNIe-ya=TrVxNSYC4gwia)(8I$&BU;S!QdiWR)?}&7@)$i zQ!wm6Q^U+xc1h#7qa+jYm>U; z&{83_Lv-*_y72x-vqY!h6Jv}g+}d6PFjDb7z(G4(lVIPb7jS4}i@3s9kC3trJOc2Q zAv64IG_Irkq*{D=IXOokPvdG4?Mh7#_5Erw_VoA#m8rr`tHe(Pr%rJ72t)$E%?(0w z$*Iq^fU%JS^x$^-d}ow%{d^>fvW8Su$8%uK*MmzSYTm|muI*=Y4a2}v9+5qD+}WIK zJ_UtgPF$4~F;X~Xz$NAW-Xw~T`Hkdo7AKSW?!92+3-M#p$J$grxFWzfAFa&BYa<^e zimuHoB)XKVazK~H*c)1FZr2Vtp}g^M!=zmox_oQ!1XC7F!n#ZuF zx3R&F!W8CQXt zeEpS8Wd1<-G*p8-Hy$=G)LWA^9fTM#O!Pm$4sWzv|EBaCx`o8=M&%*?eJyohQr zvc(}C>mK!Dx$PkyNUhM!tzHp{Sqx6JNEH{vGv9@HF-16g%(!+g?*OYN_Mkb@-3Y`a zeCjNe6#0%J_^c-I^dR>dap=**Rg32R51k<=Xe7iXgo5g7d=QSw8Bnd=e(;KHoN7@D z#s5b29W*yYuYD1M$X?@JSiTGJ<2g|o1Gm}U#xKvoBsRJTOo%J=yw5Jrmo3BhVO*an z8_w2}t|^w7U2s7Z@eJTBdqS?GNnwgNxT9i&sDlB_vKK#vdYE;-4Du3V%i3$;+f{$l z(45x3E8^3n7*zA*9{5+M%rP7t+>SwQkmT5#KxY-Y=(#thW|fhqZzAmLqERn%rV^0! z*3n3;NHFirdVh@bYg8s$pGNR+r2&+7qz^&KYnd!Uhnik0%t%Q|J3}E(6%tvN0qc7J zB`jKN9w141bFg$dJ`Z3Oh35xf9D^A4?W{e$W?ejF?{p|U1Tna+yP!Sl6Kynq~ zh%bHgHOvT0;W}~}@~v7XPX?CBsNO^906Bt&5bQv@aQk7DvWr3=``P+4@qa9{P;3y* z_$i2sD)c8urH;=fD5~%m(R5j*4m~ifB!K z00JCSm<6=sxbQZ7@yBI%0kEdpVaODtx_F>{Iu)c>sfSY)$HfF&>(8s)3b3wT5r2w& z*_n>MaYbPF=lkarw}}D6T?vo^XE~Y6bL<(eK=Hd?cenDpbs|Tvf2jxqiFh-Ljs9`icfhJYVF6mS?C19;zsSpf3fYO!_R4bs|!nlK#IYbO;wS-~oiS0c~PWvh}>S$e8xm z5-Ph6_B6QHX~gr_iW;ScSoOpyh`0dbL_D1ox%d5)9n)W1?PgYC; z`<0?@i+x{$P9LDSnVE*&(LyDfvYnfTzm~6wro2pOddjZspNFvMcSG|)q5k4-)Acr| z*+#(Oyp7dz_?;pOVqqZG74=ke{52v498i?+01!jF+4o;CiY8obi`Tn${VLI~6P7tc zeu$R+Jb2~x=K&&2A(-Sw$h%k|%h{Bj0XQ!4-rz>y(R@^VUx(zcKqSwIiWQ3Q>i;}O zy?`8sG8O@D@UdkqXHPuM#}{{{;wXwVfzLDCgrmvk7d;!INo;?se{kq;yl4Rwsw=ZU ziH?Dlv=vRneAC#5jJEx%0#2=v#_#Ln217P!=x##w*Zn84AQeN_F*Fd3zoItcG4kq_ ziDOExf5kz2QV?>fpFF92`mbmV4iXCWkd6GK$oX5~rH%kwRPWvU&iSwNsk0ZvAs#0d z{&m-XhR_gIlw0ST$tS;mpKlyw81)savVUDd!8P!x?9)B6sSgwiL6pz#!$qNB&Ya1g*;?Nf^5D};D!HJ5?uQoW)JgJ9SQT}KAo%PsB9WlH zkS&Nq*|rL@aRi2{g{zjaY8?S-iOdkYm2^SO!3hw18fjI`pJ)3|5#kUD=4kdkt}FkZ z^WZ2OC~NoqZ~vZD*{Dfsn>`Tv?@6kOBBnI9v+MsoOKnjBC9AZy;NK@qK@}(0-?H)j zYo(M?feGwT`i$rQWZIX2*zv#H`aNX--+o&1IXEP2liC)%WzY-Vf(M^p%eDopeKQ<_ z+fS4Z9c?e%FRdpxfV@|0)05wSrSTx6|T^T@}yxHM04Rdym?GIk0h7hO(%tTz2kJ@B3ry~%DH2hQsr#(gu0^{>k*T4 z%fIL3I z>PVy?6n|p!=1X0iGawS20+tp?AuT|Hjt1HXywJD?gSjBpLy}+&NhH|>j;S=q&3f>A; z*^}EFRk9KdvjS4w+W3=qC>c$UuUIXy)B$)s^Q^5~6iP-8?NL7xa-F0vXyfy%`)?+% zYuyE2SD@~3c>dsA19i*hTei*5Z`gHJ`NJpyD zc7Yn?)~l~n<5kQ7r_u~0w^4-oUXL?MNdhiDv1#j6J-PFjHS$KcQDeQt+7^8Vl%};W zt(ODbRWktQCP@&ZJeA(109jfa6G(5h0nzm8aN};PYYHcYU`HBBu)0zUFv$9TG=N+QAqb^swT;k5yKK?Afc#&2=I+7hBgI;50*-N(o$OiyZsrkXS(9^ z;ee;}PKnr?&1GR?hJxn|EYA9!P?QHG@~Oes+IM{Ze&3wCqKWzfQN9sSzD&Aw^E85q z5W>ag0*D#`LzMQ2lH;gsGc9K7U88{91yHbVvhnsQ(4GAE*A4BrZwFWvQY|qAgDUm% zXBUH|eJS}&A6ZK1*$shGs6&E$z((-px%b>QaO-7p-3i=(|Ndq&N}M}P_TJrm^Ci!@ z<(Oj*SX8b(hI&DW1vGuo`Kme|m7r`VTZb|fp} zz5iujBgYxFsC{St>itb6iDO`Jdx{IBb`sovaRR9g0m**us<%Ru&!_2+zy-4n10`HB z7qEteVsutL@hG*1-hV^kV|>ab&_+qzaPIwRKAf8gg>;~tldaif{wwRA*Z3Crf=b>3 zJzS8;1ZZ~+%2$a}p5R)CkCa_24%It*NZ$^Y1v#G`@Z5gwdo%_pB2OUUk*q{em?B<+VY_xwQSFvtv2aXZwDwG$(u2eOD(N3t?k7TKTzP$mb@-@tT=gQ=&_oP22pCD65B7A@(FmX9jq7=&q}ko>W_JN zQRdUP+X~%LW583(9255ynMfiV&`g;U3Hb^$7CADj77<5(&L47|>dUHUbtFgBXU#OoQK7FzIFYvbyljS5TU|tJSLiv}_MOv}n6&tK*7NDQB~{ zX4Sjb(MiLA5~HBaac+?ySe9U%2Keov7E3FSE=nD)fBM+Vr1bllk&cRa2w@7*-pAIcjmiswq3*RG7)DuHX(~35Hh9RLVTood~V5f zN9m|<^UOzd722VmXsy9@pvaN5<7A1qRQ~)0lT_>;kdh-(P)61nt$=Vyp||D52$TLn zlSY85H%VPKkICC*zB8+KN{$}y)XY6B#*Cda*4`a!Fb$xKdFWm5Oqk4j74Fw4lRSbc z@+bP_kUbqu0)SK?FMu)gbFt_fpmV!6?2T)YP^)GI1PO3=&xH#(D2HpqGw?oUu3LRV zYRTRT@UKtK3Qi*upthx@Vwr%}Ty#oOe(6Av%8P~25Q&5fGS%Yog&k#ya?D&H*+Ab$ zfhcs~lDK=!H!`l^_8TYC(i=~84cH~#>#c>v^m8$I3yu=PQ^bngRN#+~-+T$u;%LOW zBjj1=zAW_QPV%&Hru=xdq;(8Pl-_N9-Uax(T1letdJ<;ip zhP7KW=Rbl=L?|T@CB2ha$qi$|@cO8jpdHlF44%)2rNQC1JB<3nsyK-aqU@BPUQk$MfeUA^SGt*WT6r^A0c3w`C{56Zrrha6y{L+c5Iq| z8FXimW8HmAI&%uwMlxz40OPMHL zx#&ivmEk>E*lXj?c5Ev5=zgzhpi<)S>RZoSq*0N#z-%I7GkGq)5|lr!ps?_gI@4EV z-s@vdj1z)P-cia{*cyaiz2gMrr>;dM8oo}m7MBe*7_NVSRf+=s_sHv=A7*2<0B{G- zXeour5o9q^ThNwJVA}sINRv=|kd#Ux)A5e?tS36@0YPZwCD0ke^fZ6cecKgl`WVlA zA+U`gNX;!MkvUv~Xt8NoQeU9ku^bupS6t*bJEYR<#Mp<|STivW*7Hdlqz zQs~4olay(pKl>Opl8%wC~|9!g7uC_KelafyZ zip{`sVTs6SHEPFMcVL7B6X~jX&m54K%(^lwO(EsYl(a=ClEl7%=wmTO!7?g{`HTEs z4?}BaCBqS(H)3v_&^TxlbQ^DvR}b38QH0V~>=Xu#yp*#hWfJ@BzcTUNt&od_*8 zn3O8>AXSz-zaqm@0TI5F@kCv6%SEFqfq?qzT-J*%*M9b57b{uXBlQSp?5287GNI%x z!C2NY#*kW~uQfoPyf&=H!RuoY;8kKy5#a&j|k z&LZoIj6nf{#xHxLZl!oLLNSSyTzXpv4;B!cT_k$S&q5{)N53i8k5-w<^@MXQR3pOg z@Xw`e!a|YankA1T(qj6iagFQwWvVgR%HLf%$hRa9cYYA5@T16ry>+J9t zCPj_k_p*_53$KhCxk>2xG1jGrKgLW6~D8&82#=NOpR zht^5VZnUJ8z36tR)P2m11rB%*MNq>n5cGjTRbp=|{ZwXn>!-;J@g#VaS0QHUTZ zFBf-Jdnts=?3hU6*3=%xjInpN7Z6aP8(V!3rw!W=Ur3O00>z9xRy!9#tRmH$a4pSrpCetn!KAloaE83p+Q2bims zF|t*Gl`O^$X7geg+X^_`^r-TBnyMS(DQViT%Bi*il~v}sHsfRI%UX(r%(5r0)sSUh zPP;rYY8^oR$%~Sh4{rhaJ3Jwo`L2=G%2&$94{<^(6BuGrCw8A98kqj*;=nuyny=u* zgMF;N#`ab3EX4M92jP4Ti^Ss!)=ApDYs)+|$~Z^R&By?uSi^TwN<#dz#AU>g!2=0W zWJma`^DUO$A5n#4AeF_^%Rc98i8)4;Or4S*luV9hiS}$ z(vef%XG}a^RN1^TT0^!2qD^V;WeslE?x)ggBw}IK`jEp7{kr+rC@P}Z$CETV{htaX z*$U_t=)_(aJ`aWfo6{0WHy3$6lgGS1291oLhq&H%V3#hs2w-ivLu&kf21W#B2 z*14nvtSm9MdV#y0&*ao6u4OEphJ@#wUs)5g4Uas3j^{4%5xI%g-k50OK$Z3u5R489 zXs1D`HmZ7g5?)5hS|?8r+bT4QRv&6^EhNe%$h!B$XU0AGZhJ6HdjaGzMB1zN#c!rU zK^_#JzSjz<*KX)9ycA|>$jzfk?P!l;hNTgO3Veqtw14ySQ?+Oyd-z@Zn&1BVqhZFZ zVkaB}nvktB_Z=9fm9;0hd*-?Y9c#Y$_juNJ>NrG1JDnzsoh6{%iOiVBwfZt~v&{to ztCD{qCRx7M8=ERFiISP-{y-Fz??t-d! zp6>vA9^@)c`eHT7B2>YKf$gEpBPv+kh5P8cq{gTC)zUAA>><>wZ8W!wS@|t%1l@=1 zMK>1l?owX2t|0E=!NQ}2d-pS~+fGm`$tZ@2=g!}Yrs~30hGpPsON2^sWNs%b#p7#0 z6~=5W#Ao681%9J!#>c+b_{7sCe2Y*U-nm_1pzW|qW;fGU6!=n2 zo>1>;EVX+d>#t&>7*mM0=B0ROrRYGL2-}Grjk6M@7A&O0pI_vF8>`L}*M>d_@QT1|S zc#cS_Q1qSToPD*o@s;7`F5E#O+40l%d9uwY596CvsKYAjzONLmdT2Mva!eJg#hjRp|hKJ7COJYMw-d;?vS+nE>_Ctt6v3i`Z z*(S%Oo;z7q;of*waZ&(Jjx%=k_LM~<>%ATc5%1>0u&(1ELzhapo-8^$PO`PeD`Vwe z<_>95u5>j2ur5)EaAaz8!F#&+TBzNmb(Xyx_x7mjvkp~GyW{}00S&3Fj04zj3HUeX z5)z3(2`nz_yx7yVG01o<{A+gHzd3kcxPBsnP+j%eO$aW<^H(-JA4_nK_H?t81zD0b z+cG%F5<@Ui8Zo8unsfCv&5`6P-Ht^w1CTgx!-W6eqyhl;9Z+;rwn=BL0d@!gr2e2%E9U8p#J6G~RXa3PYry5} zev)CTc`OI)%NtUFxtIe?4*L-ywMg`CkX7!YO(9e5UG2gsQ2n9hRq;0r5xyIjfB2E3`F=55GK*i%_E0D8BD z^wea#7gpPWt@0EYEQf`4iZwvLmh{Jhv%#SnTbTK#FTlU^NuRShUFR@ofdiZTwqp&v z`Bs(b;Aj4Z#S~lqUEQ5vBIkxB@=@1wKt#0y^pW{)*`1OUKiw?ISGMu;VYwD6Lv6!Y z{9k{2;-CpWE5UuhN*M;NpTOu`B>HOTDTShBHSmPX0uS=tMvr?srgPjO<0v*w@^$H* zT`0L(z%~GR&zGN%-NDrKBe-fpi6UM4UIRP=oTfv721o#{Nf;=5sC)Mqa5*Q*6T#A6 z+Hor$Ybyfz&l_@}xIfnse_ig7T0~PzOc&)Y90r|3o&X5WS*PmIgMv_h8_*-$pxxJ! z(;qMoIj%iaAC0WyOT5mv<~zk^Tsi@h^h_y=06ZE~>wVxata`gMfHI12DJ)e0Pz1d8 zLp*hZp=g=UoVVCT!`piS4*~o=aSk#QUs3y|C=6yLO7HZkP~0Vb^%8`9(PhP2V5IsC z*t>)a$h%YZS4!6#&QMPIUkylb-jP-Sa*qL>AZ|NI`GXFmz}BF1mo|1^jeX(k{=gfH zYO$WS&B6CXf*$J!pmm>$D7(2UmvIo)-mKWJWn=a79bZ!%c?i4PA@HFj&wv)*bIS5x zoz@Ca#SwC+{`WN`n%+9jyYFwAJz>`+oCcg8ssXtjlVS!K_gjE}1|Gciwf#Qui~y!( zyxITI*zWNw;05)t5qt&^JN-0AqXD29gSOhS4*|e!2|WGf3XvzmS)@)c!5i#rp02_6 zE2g;P(vHH1D=xtA7^MiT9|!rDXY5tJJ*#|A2xFS{UnH|PcyE4svXj!;i-A7r+4+Km*zz%w_Swx^ph;2V9#rN2)L1HI6BO1CwM?@aQS%ZwI+Q zX!4p`ggr<}mN_!Q56-gCPf)7~CDu?S?5ujvI}Vm|H+zV`TFL#3jZpvp~-Z%26YAhBDY2>!-zu zDDp4>7M>TM$w?|0X{L^&?az>42)k`a%G0A3O#a$=~T1bWy0rQ7Slt<*GH# zz*pPHgiY&$#sCRTlHEl=0ZZp7ZR=5?oSnZrDf&}uyCjr9CM3;5W1x-a;}iy2A83ik zjZiavKwBsPsM2KscL(q12{vl2DOvKT-Cu0zBG#(rQ#pYNg%*#-(wIS(5N(G@Ou&^}6Q9VG zVBH}Ac4hm$oqCNBD!1rPGu7~Syl?K*TSw%^IN@2&a`p_8g;QFBM^AwHRdxB8XPDWf zXJ{~Y+g?5X{0uzK15kKod>b)OixJDdGM~Qq#oQ7(e{Wp4ekIF}{^~5ISSVpR$pQ<% zFiby_5+A*HF;y?20G%Oyvc8MRO?hT`fJSZ|);4tYGtZo7A7hlj@Nx=-xo{O!GI*BK zI@T<(v8OF`2CrMV?KfHU9U#;E*EXiFOL}Q`+nP5HffqyAqzOzfSU8fg^0Ra^4g=mp zK8n1t0-+`gAc+z@pFR|xM-h|njhA_aIq(yD;xTT&=mSi#}k{m zW7#%DvNouityvM8I$#MTJG@YiPI^wr$YK2U4zu0~!+OM`;PMk@ijWO##4YXfhg$Of zP-*{*_3$=e8AU3}I^)-wL>m?q zNP&iYKy5~8Y3~(l#zqEzetmhPQ-Kf8Tm?E^)RPx%>J$=X4k$x$aNGF17WsM}gD@|* z`{2WS6w&5neIG-!l2MT__&|0LFxY}k;}RUgZhL*{W5UoW+>lpWaoX4FGc6*iwv=y# zHDh=KJ1?rG(@Zv9a2e32lt%1_bbA}q!f{WnXn&7zl=><_o!gP)%z4i z=ICjyTgqlUGtqe_E~B&Cvp8%641TMOX-QUQYP6EOdvSQc@CS|zG#cMneLh;FRu`lH zDc}TNv_$Rux}s%{HN*`iKkjT7sl$#Mt zw$6S>1)lUyZqD>N_QQJ{vnpM~{8`uPBwwN$tzH_dTeFM5+dD2^%nMbq?%lQd{tolKq3>g@iq193`2NDp=z0yNU9_{7inkreU zt`kDZB1b+QKdk1k5<({t^<)Bbzu0rD88<-Ge*w9a%5|r;q>Q7_hBmv+K82J+y1@mG z_6e7RAHgI4Z2sFr{^6zk;L^&NeoJb(F7p0I!Nh}hpdHGdw%|ra8d124Gz-6A1Y+Qx z4kNJk5+|Ykx~KWxkYtj0@ z%zMiz0t>n1)DlyVsI;Xck$?K|-c=#U>(mH__0o#|s0QykGPRd+F&92wXs($_b%_XO zp%1uN_|#pdCxF1q4`D7;uw!I4xyabyF<+1|t8dkm#6EfMxTYo@l6?EBPC@y=B=r@{ z+~F;dYnOL`vUalvI*H3&(WZf2(-s$)OaV;}@cXu9J)v;D4tXR(9*Pkc)a(B&$m2W8 zF*s@o@%7r^T#F(V5Tjh!x@TUDK5M?$?=|{#TS3F?es1M1D4%S5Cqb{HJ?kI0MBMa7 zi~wY>_=ykuW;X_Reu8e&%r90fy(QVLv|>n61z8r(N4@6gMd7B+euY;@a3Cc~g*V%p zjkwJlPGi{fWDwj78`J`B;+hY^w z3#_Ovis+szc$fsPZ61&&-s*BVv$ERYcCs3MyU|A=)5T(2fzx-VIY)#3Cw%__1APYkPoV^;^u zE|4$p(qop#O?4*k1-=AbB+;qB)wEK%amO+ka4Q-bh_D8^K@=1obsHjDzPR#L{kis ztzTo}{h=b>{Z#rEljmCgWi^ZOd0Gv2c8(FdjhS`i+LOi2_Q!e*DS@~46g>Kjb-_Cn zP%;g@jHh}*IqYnx-o)_HV28SS3iQa~>iu&}Q6ILg2$R!GP&QTbr3$Ip^rDyVM-z7- z6YrLSHisM1aEp{D53WK~y+@UY<`xF*(LAyCc01A0x99F$2r5$1Rxj6`Bb518Z5%Nx$4Xdd`&}-|c7>I6`-PW7wu^^lD$i z+}8?{r1XT?h*saMwXeQWdJ{QLZeoDy)N{CTJmZu*K-{+y!xfB^OoUjm>r0j~P$$Ql zmidl55B*jWT-mgHFKxp@D< z$_AcPvyCDkp|zsvR2l{CING&V`78_~Pj^V+NufVj zN-ODx;!>xJAJ_}pIXiNIcuZa1P+0*jk3j2bp8C=rlCZj8GZeDGE;!)>>T3h7V;)pT zTu29wc{h@G2*xH55jH*z;#s1l`@ykY1p5Y^4Qrb$siTGqgecu_AbY!Oa7r1BmW&r$QE0C>$+V;brGN=gRckgT0G~^1@$eM9@*w&ET{D;`=qqSt z(d$tzA5z#(Dt$Fh>XG-zZ6leT13*4s!v_uO1#GY~qyg)Y2Vj+JkI8SyN2EtR#su#r zp$(&deOfz~k?38>Iq4LTKlkZ@l$I$gwETdzW2zlg10#6upeb+oCiZh;KXT?Ad7mPC zskk+lPl8hDW3Ybr5G$6*I^qE6Rf%G-nTMYn=G0~YX_;HBFGSaHVacX)!PAjC>epEv% z9dP6#ve8FystFKk`*k;gkgg!51dG^}`-mt>(7-S~uQ*sD6AC2pD5Y!x3(8@7KOofu z^TOTXYnFsoROO?!Jn#g>Px*bUM^V6w$;e}5_t{>?0FhkrI z$G~-!jE2pfKrI@pr7cjcPJI4-B;*0+a7u|O-ZI-g%;t-7=XeymFv*ANPQgu_?+2wh z7D)%yMR((%K=WXhnilUX12{Yhllzi(JUk9f?DUFw7XRWYz!h%v@O&Bq#$i4jws0c% zuOp>6H`k`8DQ2bxvye9a>C@Pp+C9mGFwG-;TJa=itbsd}b>L+{kPxi#jaU7w`f&7? zDxlnnF?zEBcX`#wn&-OC51xKIhAI+G(gX%+MK z*x9U@@yJ> zckZ`qos4k{h+-~Yfo=;l!bUfDB<4c<12}`I-Y>Ec;Ni2#NO`Qn`PJem+j}FJL#8y_ zLfQjbtn?!nK<-fD&g|5tb5{cliyIzjuQYkoCLNB4|mr~=W>M4 z;Z#E0uN3`PU5?%x=l;5=oGRB{>7bBSB$9>};Y%wnaTC1V-hfZx4U2*-KaQ^yRYt{x zD>EQfQg6iht9JUE((T(JZyiFRqN|@}-d^G7$&e-L23xSYS7iL+81A0*9Ci25eb4O- zi||iI@2FdH^zD6q>w(G>&!}wF{EnG4SZOUA4Ekq{gIR|+}W}&CelRt4b z5tu6%JS~pR7EBk0kD5XXKYz%Mc!sL{u9L|Dz|3o~qu~y}^m*1!0;*%D>%eSU9E#E& z$bsDZ#w^Z#pPt%8KBmUkVhV5^Zw@_NRhYu1v@NyEd*u!z$e4p(np;_9ik04`@7Ot& z(C?tZ;(eLswz_(X?0910y(=@R6E_j~frnaLRaD-+%;*uiJ~`!f%}fh=@P?;4&6&}l zmYLo*f17{{uLMm!cHLi_#?3>h^#;`P6jUugx_{Boa_@b4{127dl6_M44b(*o9@4CF zX5SWxA?As#K0OV%$S|&Lq3rw`Awh9Tx#xju+7LT6_n;cld>c4C9WR8oQaqe8=Dp5I zD)7X&fm!9}i)RDmfnPU1h53~5K)_lbZ8B&IT(sw|EtSs!DMBp7Tx()ac975f`sfCD zBrCf+3&u0x6oiZg0E<0Y0&PN{1fi>snFQ0P;*+lF%Mxwvm34j3M(&9~dX+*%oF^g7x7`7#xcF)y;&t}Twf z0s2u8GclVo@+u};T{x+0JpZvTSxTcRg7LU+!%xsPIM%&TEW{PD+tuijlj5UT#;uD!2(|1CpWq8MbO$bX#?Arq`EC=^_E-k*pii z?8gzT#cytDWE*aK@=)ieHTmP-gF>r^Yj!KpBrMFY?0K?O5^Th-FP?a$rys=6Fx7i#2Zb zPKkZy)3$T}hCzuK*1)rW!TpI|KttMyP!BYE>)|4{GR&6DTL`7nH}8W(9=?@wSW!=J z25;AWh;TgV(^+nsQhHr!@p$8AY137D|e3=rcV`9(d8H7nSKJ z(cK?mGaF=m3Sz8H`Zc!Mp~t!FxawL-hQr63gzWFXGr2l)kx|+6$LVm%kBO4hJDQnd zC2Q$M3RCv|`dk~3{h`P4jQCfm59HSxQ<~;)SYhSo{g?{|f*Q1|Y2&4%Ap@*08)V9p z7SD2;JNf9#h}ckCx1E;#35juGe`t+x_oFIOka&rU&99MQ*|+x4M@{~H`}VJ+DHCOX{vTQHaDB+J(Fi_2-iynQo`AQT^FL482Om(%mjliw$G1B$bIA5vQ13I*~R~o zey_^=mo5IN7!@cuaN5FIiQ~)M`ZyM$A?D=5zn6HT`G!M>4n&LH_3ADK}l#lB~3fx9YQi2p`*K` zpTLx)T|G^AV(oy^3Cly>{B>2kM1|Bo>$T zO@$v}V>@El@Uz|&{1?NkLL;b+WL-;z4L`MGSBml1O3b}6SlMw@N2egju_ROEuRL3n zcE2X5Z2naazq4MGP$WT`MxEdG$GBUN+f!1;6S#v`DuS|Q^3mozG z*$4KmxVL9CE_rFhJd-3chU(r$L!?NG*nSXP&> zz*O7!lh=C%1@2e=@0HFNLTw?99`R z+;S03Q^g%k*71jRk}96_Hl0s2sW(X6k<>au@E({$R|wyF6TMSY7RL;llf7|vMo(BM zuZT9bQwk+rwjpbb+x&^eHd?0I2js5bM2z2m7}zDe2wym1Up-ToFezG{{d5YA_Hrlt z0!=;tsbAlohXqmD--TL6&KT8_$7rIA<~f>pSz8hJDvhxq&X;-L%e{u2bv@>Wk`+Sb zttkaKI=_{ML#gum`lA3JRk;#>n zyDK>Eio$$1mTxyyD3ad8xi&2Aeu=8d$3U|BZ5Eo7ry-;Nlkj%-Sf&YXj5bkh{8~n) zc7h|WQtXq!%>Jf+r-ERTb371=(;YXm51f1XC~eIoDuYg(U4*8EZx~e88n!KL%zCp_ z8_!Z*i35hHz;RkJ#YDmC(}@&Q9;^V-?1B981|UQ=CB>y=-un1Y%NnvmL}vEt>;{9? zi^PxgOVd7L9l>Gy((`t)b$-w=rdM2*WEp$cnkuA;mTh7u&-%n>Fd7_31wU{Xw*QzR z`}jk@OoKVsu7)z9Mf|ZZQ2y(VS6}nXeQ;*Eiz|tOQH;;zy(D$-T%Yf@r<_jIBD#uG z!7^2$VWx*)*|kwz@aLwkCh8+;@u4vz0V8hS*L(Is8I`i$IU%2u7^++`F$pQB4Hju| z{lOhA+acGfV%a_z9$ud!O6a`~&7!GNYIIF!LEF)@^K}b*x8vYV8Q^Mxo-eH!KXE-imSu5chKLom5IF#Om@dLA}NEhA`UWg^-nv6GYN&GQt+Gi*In!!yrcj@3;wr09F6Gjn- z=B2FDRqJrLe?OWAUwthtE|tDOH}~KJq)z~ZA^X4IP%3`bhixR6v`~~NK?Sdc7V{sa z+2{)SV0OeGea5U;(xOepJSwGp(6YE%CP#^v z@+s*5{Zef#G<((h(=z^}Ir8rxR;z*iM|Af7{h)t7vzh^bs@gIh+4%qdn%}RM{C{&} z#JjapXV3n-67=XFP|Aku~9o4_9hqc%qD?6Nm5+a%1Y>LMGcYo2080j$)1boji^F;vq0}ZS8 z;AEYT^@_g*1vlu(aPQi&zdtIU>YvBjoJQ)azxHna^$zipkcP6gfi_3DE|;@I`5!M+ zpk^QXqw_5wcY~a9z?zPLM&Z|x?bu2$JiqZ4Y8iq7U(6kzTkF!u%(FXEEo40&HOk{C z)nNSrhnzaF=}%+PpRXcs=n zezg#H4&3G!m4p9WBgaK(IWp`z>^X~EV_Vz6wOnNFOvjo7@+eT^qy(c+JMk~Pxuk5(?vS{wSUY(13B)-yNL!(dWCyEJlGTgW!Xzfh5YM>H{wt4?Uh@cRVY1xZJ^` zJiEz68#P-|*bg6@nedrc)V;^ni`h361SX@|<-=8QWYss3)cI&9KFR_Gijk@Qqyd_7 z0e>pATiKr`GHqL=iOjp)Avxya%tvaQnuI@%T`m}O1ivB4pKXW8&`ld)qHA|<;^j`gRWxhQiEK?{FBCmT8p^q9(fb|w;dDfdyw$S;HMUJIKaJZ__W z_3x?oNBWa^OD|q4|8vw!nlU(LkO$)!A#QK|0)QvlfEw5rZ=q?)<%O%|!N0bL?1@;a zpt4HIKkaP(+~{{b=}0S5gYKiFY3g&tJqS1_d-I7MGc_go?*)BG1#1)?IFeAynxDO< z>~Qf~&}-b%$hG{+p#J0(KGk60*6Rkk0}$Q`3dj%QG$R(Bta#bmQBcTCY8zZ_bh+L^ z`(63QA0pV!K77bOO#Yf0W^EV_|ctExBCq1Q52@r_cyb-c(Ug8KIu9=}&)x2a7 zyF_*lymk|JvM2iH(YL4ja}_?+8xl0cY<^%)1GNVl-0DkFzo7rjh1rD~q;;NxnL6}i z$E{domnNG`0@29u&STDNxf12l91PPwf&8pMasei{!HGf(@5S+k#T}L#3pvC-Z0qw9 zpI2?>pye_NDvx;}7ohkd0@TNS*CL^!_C$ybjdD|!`N1DRcMIl2qb??lzfQ0dP%3&19o_d|=;=3X(hY621kjZ3nJ7w@TF)ArDd~ zUSjk2M~};Vd39S!s2(^C?6~Udp8%$L4vMAv`VA{=)q&^~XZV^!&9n(jJdIl$ADc=^ zWZ&-Amg`hb>PR|#MTDe)rnOyUB*Z~U4^f}{`22>Ot>RBm`GIPxa76ssPM&&1jsbXM zhNiC!KsJTq2j_T=kDRW(k=QhVx4QMGJ@eH1?4<^`wOibW;B#+Ula_tpq&6s1>W@K7 zgV7$+GzGLO-F_-J(IU$Wm8M#!aWd$H_wW<|095U>qrn?EF}$&_0W?kf#1s_34FXko zcGVhMkSHj5AcTU>EMIK>4xT>mN7%JZSxCwUI){O?S8a)Q8q(WCdWPyY#xb}Dl8(zg zwGV7kkh=*Cp-&i_A@5Qj2ucxzX1v?+>vS2m1@842?I&<1V?ZJ+t2!FXY+iEk;j>rC zfM4Fm%KN!nE9SFs;AZqH+A#s=Q5(9|OaxwcNqkSZP{g=Leo$kQjNZ+si8=dzXWQ1= z>tg%pftFRnc9CB+5h}SZ;6-oB9Te$c&aYF9O*j(0a~HV)Qy-%QY*2tE`PIF+VUs^W zQ>Czoa83Szej~z~CP>pEbaIiwnvvx5+Q1jRNszwLjj~Vi30S=v(B;{bT&fnZ*7XTPd$U;PWOeU z4n}<<$fpj{Z^EM}umLO1m2;=agbsmo?dClVrT(q{E3e%95b+Pl2#RZtj$lN@=nL$Y>b8p%tw^P zZMdnvVHw`}1kpRBQ9%5Z19Y8@WBn(AmgojH;R$uEYg}#1h-G!?e>`B3B3mo zNF4LLb&Zc@d>#$T@{&pOynFO^x)}eiPk>PrHBMtHBg23hQQpbtex=zPmFmxS!oB@fjzgX& zJ=Go>IB2oVdZ(f3R7r;VWl)o4l~v(~is*g6nT)UTPX4857aG9mb*(&EgvM#NfSN0y zWWuL<`66B9>-n<1v3HJo?nE9w+>I%U+5x(U_wX^mv>Y9vWr=cetA$LmnakQ{U2GUe z9s_Rz%6tQ2t%DO7Oh_Tbf6xuuGE#25Q#&}fs;+s5qmj*l`;~vy45Sh%h@*hmN`$u+ zY`s&GUZ6Yp8<6!i(M^J)$9DQXCDzuO1KVo@PYlWp3@bB|y;C+*dNWFK!uDo9vclrL zOr|CYQl<$Ri5bE^SRV#M5pN5rF#OYBr-DDWeEw|lB|PEY(0Y;1AksE+<6GU3_80Y& zdq46R5HSXXeVC9v{v&0E*~XF=1{!_(Dnt_%zefsK-KSlS_EX-C$5Fx0ZoJyb?~5I0 z8Eu(kz&~&^Zc1W3;%ILln_U^@7rTm0Bw@w0$YWL3Zeat;V&z?1J1@(3rwi$v`24EI zzRD=fCUwUR41Dr)&8tyKB;@UpBubu=-`qg{xtAXb)WecF<&qDfiqS>8_Y&?Ha3J4_ zUlLsc`k_^>Jb8CSyV8YggscuQz*Z7xAF-6bVUHulFtfeW4$KiDpy4N2WC4WgWfp8s;N;K)AiWI&U53 z_kl1%Z(^LWP>dST3B~g7joxFPBfEK8Y%p8d9v{L-&XTkqEZWQW6g5KEmm_!nDFt+7 zuzO}z)Wy@0J4z(VV4+Eu_3Hi`MaqHuWUgEO6yG7^Kp4FijyWt@BREP=hWET;Ck)Cx zIF8h+VMnw>YpN-&j?3+1;ws7S6> z)G>%Sek@MJ(dg!@9H}5+O)M@V_cdvVj^x=BG`AO}dcYXqKtf??#8tp?^0$*7i%3XfH#vUAMJCpl4zNmL)W`2mh=da=x$ z)|8;}uRl+#HI~vE`VJA3gY6j*2VS}}AUu`N8hNHwcWZ7Z65Y|5=W@1Xn6!Wbqk988 zLo}Zzld$n|lv4e3zWupm{Rfwo18sVza?SEwC#Du$P9Y0M!nS6k-mZ(?=t`H}xAGdS zYGgB1EkSbh`uU4YysFyRN6!h`pUZuP(Pa81O=`4^OZ9xwHc( z_lo(Y`-_czK-RtPt`G?d*$3*J8i2-_5qTR3y|?e>bO2WndLU_oKEOnt4W!r|NHgzz z(U4GzolMXGCB9d>>s6E&O;8?$nN6aHt6|$|OJmtSWR>*7U5H6gHL|K4eWTS9)|2f% z(dg4~QMbH@it0UfdQu{QK?IkL+R5%(g2eKTb@oIqisBp)x>UGuML<*+thT%I7}~I2FP{xMXdiL=}KFx@rJVl>?r?rODZez4ncR_8*Q+uz^<7`g318op$ zfj+|b+pgzCa6d4KvJ}bU)N+KVdy0RX;~c9{(VoL>+cy5srcZBNF8RF&QJ8uH;Z;gF zW<$P{j%uB^v**+kSp$MR84gDqK(^<(c1u_Z)A)(jjf}6Kf#RP=_yX}1a4WXg-m!}C z?yGYeQPDg$gIym-KzA}Von4Z~Bp3}JdT-uI&(sfObG1$tD$!VAuDl3z#F5t|>fuHX zk5%FWN$3x#=EE0v?jMAQ6Ek$6iT(E}#+tYf7O49tr@Hx3Xh&)~(?|j(3n^D@r*lb* z*y**ofB+F(G6kd#O^M1WWfSXrZcKNb_cA7;B2MxY9rq4BI?;ILS`-+aBkM>r6^qE@ zvu7E(M(cpC9VtHEB$G&VWgEY9JP?B4<@*o5`TqF)8Va>KkxP@XP)WDaI5446J<8IE zK`~*W%6#r_pI4BBKVskPKLYRD2B(V`Ff8oZOErP-{3Z7cV+e}{X>3TDQ7W%?6)V4* zdmz@2nCC>@G6!L#uzIX|(6c%o2AZ$vNLRnd*)<1l2DI|wcK4|4)5T0Dpy&;fG#`Bg zse*4GpU-=xr^z1Fct7!OjwMj!xBZDm*Y!3`6sU!Iez=PrySOR8ylP+)3%R|j#lhNq zh~(z;4cb0lZi2%K(cWC4rEP?6;ci(*S)o`C8(&-Rzz`{l=>tLV@LPA=4*@x1^u^7< z-p>gSWA|Q-#YPrVHokof?Ui<_CtOfloAKIBnqIv5{LsT(dj|B4x30ar5i2+)LOk?R zrzIA6?N?KErU5V8#hZ1;jZeKoS@OI@v+g48SvTE!8_a>?EXfn*SS> zFk4|7{WhP4I6h=A86k@7+<#q+R+xo`>tHzk1=V9H6kPpCbBXllLK?*CqgH(v73f{~ z%!T>AR&9D%hLhMOmspQGT$MrtR}Rog9VY!h_H!v z7{8KlKeY0j8BxE;Fb!Q>nsW7gQ9>YGpi&t}L^_oXL~@k_bC~K+p*{@HVnT`^NU+~S z@sjJDn^xT9kG^SxmhRqJPjjt5&h`?H8}u`!SzoVi>Z4xUT`{{5^H^st+dR>f=?A-7 zdLjD{b7^`*B1Xg*WgifO8OFzRqqosx-P@8GDYONTGsVrnFHY!xEwLAwltD|igk0BE zmiOw0;GK8#n~Y5f>1uqo{GVIjB}vieI9!^9U#NA%MT7!rQ?-D2(8_xHb`Bg)*OP%_ zxmbfF6ctu)CpLq+FkD^|ox)h)!BQ^PoPZ}g;E6A9`j3b>P>;9af>8sMTF0v^Grd{0 zgW(tcIK(n`!mlt;nmc%fFqwnPM12`CXji>_h8O!NjljGVUi151q!ZYT+c0Nu0*p-1uY zW#wUi?kQBCH199+LkjolTJ zpQALolwRtGhrhPQ&fkFoP7)dAv71J>;FY|mk6S~oJA(osU>uhV6Ae~jk1%~=1$P%V^}mnh#qz3jWV^VXT|Ehe5<+|+_P*v9X;Oj7;W{j!fvm`o9ZW^4uaecPW!`}emR zHbqG1zG;@kX9pZ%4zR7uI%l@PQK}p4v&CeZ6E$LIGVL_~Y^{HPuR#NKYPl{1GOttg zKGO!@*W7i( zO=zWQRt5{Mh{RnMHUSdoB)!R3x5J=6=-y;3W9`0c8qr@D4Dwp$-W$6=h^5KR;@6JJ zRf2w@UAATMZdqp>3bao}Z+v|Z0xwxLmV@Hkt%vCTYy*)=8ZkB9FOa9yh=N z^Q#|=fp~D&CHv{y8KiK`t~O~lQ;esn`&@q>>^`eYyAECuF|^4l-g{CdXVgFF1g;gv zP+WDWVLQOtOwG)@XAQs}7+dMec1OQ($8YzBqvB#C3Lm;3vmP|jc-uxr=t2e}0uOVZ_IH`i@UbGzPD^T)z>`>>VuxB<&`OJU{j;ceEnOmMPoL6$Zrd7<} zO+)zaQePLIq&KG4Hy5UNU9DNpC5+-D7ZXff@rhE1cM5vdx(sw7ACAR-WV(eXyFs|Y z@y~*2AOHkHgm~Tcwn^@jwy3a78J7uKRzZu@y?^q?rv|R)#D{9U{T&8-R2HD3e|w#RvmGSUSwC3)C*)@+%(#{+DVS z3@wde)hGULd#osu6RQS1SpX=J)vNfzM}~XbAO+M4hx8G1Y0TJe9I};9sC&hEn>hhi>t4H^H ztME-|uTXcZ<-EnQDpsF8XjiM4c#uIl`;cAI$`z(&RtsA+fy%||Ld4ev+y8M>Fu!05 z7{XYua~D1{?9i%RHAyeqdY1y8Ba;8AImTd82ZP;9$&t;!XT>l33A8$m8>#)SJYF_- zhakAKR@SjLJMRD-UJ!TD-9CKXxv9HVY}~%p$1}*2^%vt@CMZr%*vC{WdOxZMFJu3( zY8SZ5$`pxMbPk%ksU#gyi&8WANjd=&I*wp0PbVoD5tS*|D*Sfv#E zOW%O6{GIB$Vuo2tD!O0pauqNt6}ru5Ky_@BSy(`K^R@ke3C7M8Wk{*_c-Y^CsqRV6 zSKK%j`xLb1 zH=Rnl$|AG6{fpA}L>i*Uw+~Mv@7BNQxj>wq8~x_d&bM&%K_y?HL2q5NvGNb=q*v44F_{8^M$2prE}5Q%n- zx;)XCQq7iX->hZJCj`U=5N@z%2#}wDai4xHPUqa4N(Fc(3I>}KtbtKiAYCn2#Fi)xH8G93I04@50cN@K4U|p!840C8wu&tY4 z-;c#BeLc!u`|b5VS^&NQ!vYx*R41tUnX%$$59`x&oO>GO)b zXhH9ZeOs8VWoiQA{08nRwwnB+j?WG{?p zKydpG#|*|e#|=c+{3jT?^~HwC+~GQBr4G9K1j$YmvwwOlrNZQ^P{$K)C9-pq?=N7F zyR-y(%v#6m8Hwe3Qp{8CQt{ws{BarSAOeVNx=|C#`7~PJ$D{a)hj3 z(6eJ*%iqWn$179yMaD8_@LaG#SIC8Hu&l9$L&5NZ!W8u1=zb%HS2OI8X+HW^aGQ%S ze|VnF^T->B?)*6)XG^~eGduU!M`jqIz1JU5uATn1ad=Tu;*z;Bj(xF6@T*o-_{ohe z@x~=t`LFLSXIM$k=Rdz-l|RgAB1GBH+2ip7G2W0BTAohxpWZ#m8U;dhxF-4V`R|JW zU6>q66pyu)%EiY?a|Pe~wvUb9z21HIe%-$TPlq1rt*YI3PYte|H)XHr+uIkGkr7be zQDs`q39C|6*BZ5=--&&1cfH#=dJj*Y!41=?sb^=7buTa^6_H}y=LZW9edVwm-1F4v z&#_Lx*yxu_R6EqzG5MJE4R@k~N6Kgf&mMkn@LYlga$vE+d(SM86(_LJXPlwpqAfaI z1P#^F*y$zOKr>`1aDnK~i3iu2b?!PbW^OFc24_S^&_`W4CL}R)~a*4!UQJt=~2<*ug}>K_|0jHTtG5zu|vawM+!IS+mfNxwI#&#`D35G!8@ z-f?DoT1?WWY*{#n{&n%eqC{LT_qwb>B6$-1C*V!)UwH#+U0ixTtR72)S~Hj~3hjpfB@gQOq%?;oF90 zy?7pUzUon=>VrrC=Pr4dYp9nfM3D0kv>dE3!52J(qy(@V=xJ?YehD+y z;M4s2a;|~vh5xXOTU=C-Br49s=qu=hOH~edqIP;y?Yj7y2n(M|9mEU@mn_l$#Pogi zla{@#{xNGy?*b|I)VL}U~ip=6= z?hf|)%sPA9Hgw9X1dZb~dm$U_+9#H}>+cpS5~Z+8os=wM*DRqRE%H$0Z#gy#Op{tn zP>C779^-B7Zd*-2hg!Kqdw<>Lr$DHXX|q>SF`&N}Oay2HN&+6-gyR(eQcZ*5Et{~2 z(D-gTfMBT*Vl*?UADanLu~C=6*eu7ulY+f<8=`6eA1_0j6N3BX9HWq^>~NhkY5q2U zT$wdHA$ACGEcvTnzseSL^u4B->d)ffQgG-r$bdZYkPLIOcXJGg?It=?G{PWg4KmYw zxbAr(1hPQ*X227Ks`kETLIzI2=XA!-UY;m>AZ($qj~XE6`AXbzPe>|ZDqm9|CY$h2 zyl;wzMDZj*ZjqcT6<^8mM#hvu-4dl*cJdiqM%tv1S~#qO1?lZ=$l_mH;oPWQRA@SQ z>>_{OxU|ZVgSHZQ0PPm)eZI=q87iIO^9w&YrlHB;sl6avsip-KGZf5R2``1V0&)>s z?gRRNHwIw3L>ic}xJJXs**Ru`Ud6M(hj{TnDJgd zJ_N%xxd_Gt8EdYQ|2llka!`Mog$s!Zl||xLo5+tnBuQaCa8M7l1of^m zgo=bvuR~dAf??s=wR{eYN0q=C0Vt0zwGnow*i0Y2fBAQ+Ur~q}(vdsRY;sa3_@k~c z{i_J$SJLU>n9psyGtDrb#FD`0*S&u@12r9n_>DGr^{|PXK_EH@Xn~kzxY@%AIHh)B zP!(lMngsB*`_=0S;%o?>po{;G#t=F?<9rS+Pj{(;$hBt|!;0C=Um1^zpb1rI)6lL9{`^h05^DI<+A8vnE%8<~Kz!kAy$CFk5I#yH~fFriQ_v}R9j5X%&Tk;%lmjDaNJr;WZ_2h=Xqi#28>DkA;FWqVlL4QRJxodCyr=Y+>>hePD6vME<-u%8^n_Y z`$S?L3ysc`excZxFVtlPIcHRn&Se4^2_EVP5JJx95a?CY#EV0ewx74hgkzwe$Vh;A zV0ZQ@D7(@4jhG^_6wG!Rfi&Ud*M$dD5J!X@Uk5{KdPP@_=|fZxEjdyg21tp&miPpM z7piQ9IBoU;J+z(WjI{&$O|Ydp9t)5p-~4?@h=sw-hN{1X5vC^-UbM^AIMu z#nrnzNFoLU3PJHFp#q|Dfo2$=VHMkhU&JP6h|%aK+kS|3z*D@iPf_R3)X^`H2ybN9 zMzF7-3R$V+Kz+1rxf`|!7^cCE;tr5@Q9pYdjA_S(VbYMC1g(}GtANHHN4o_PWNU_S zQMJheO38nN0qwL?jC@x+}aB zDk3$w=n|S8+PwBrG|BRc-fKq>d2=|y{x4O8ni{LlraX<^*(@Kgtv1%GV*v%j6M72M_Q<(*6INesz_F)o< zl`7jFq+%p?p5bRsQ6?*;-BN1uC8KFtbL26aFs>C$mGrHEr;jIILNvdi<>37L^G!&R zrxDTBn!+}z7)2jMa>Ug6rcWQ|^oZCho3i;Hz~#hV2t1GZ6ABwlwqwREr#(V}g3u}Y z3i2sodVrq--{WK-<|Nsg<$bMI6U}{WTW+*&pdy|oJGlOxZ3l9xlrRhke+U$8X-P(y z(Wq2nAehc}nQp#xLwEXB;DE#5xN=LEREKGOZTGawTIfyE`6;bmz!f?2`p!jO&h5xf zn^vIqVZBY_^&ZD7)vY`~?@FILcm2RkGX}Ke-ivUBTE?jJwcyb9f@Dfjmev(?-)q8^ zPPS|J!ZAq^J<4wTaZYWK%L&YUu}K{*f=QjRg^u@-^dkNODRh_Gsf|6SGz6H{#K;kV zIV7F3#juJUTac?XX~9j?9^7)RKv^Gt&L*4)ZgED2_@a={01`pABM>s!iQU$ zR)JiwL!S&vFBf0YRlG!|%A0r)<9r7$U-+^FV`fFnMOvs`P5kwR_+TLQ9}q?oE0$PJ z^#g6KAkw$^;^z5`=Au?*HpRzCNl!6^SrTo^PMB~SO?{$=T&JB0;BmVsfs`oxP~yyP zv034ZfSMdHDoc^<072yyKsK@_Gw95!c2Y^JN8&q3+DGEi#eK)6kU&LHgm}3W)S5ch z9HUGY9nK#BM+HSYGvkZ^((C`oPiq1a=Z8uwY~LuUsANJ6PA$b z#=zx^_ih5|X+%4qDX;r1l*_*N?1h3az!qdbf7(;P(vpAm8Y0|O@GJ2T`Dk8&qHKb| z?p+~&(W?c_!-L0G?c}HNR&~&Gn`pcr0JG+D`5I%vsCc>`U~okq#?!N>xHTLp`JFSz zOAwhl0Au9Wip=;@4Wm10gFrTsYjyUW18&?mX5B>B{V)V$B9n*FYZez+a!@;qiVRoI z76Y^-V&G??jk_6+pvk8CK1bOJqFjt$;9K!Qv~sAVR$i5$M~wx}9(3lPgNsP(dfqf| zz=TGv&e@&;SY|rN@x~#6zHMZbm*$D<&`OcJQEo}_Rt=@Pta@~COi50mQ#mQ5zU2z@ zGWt&Vd(TL42~CunbHn$W{J#Lm^!?VHORjk$f7*vqKwXg0!^tiTb(<5NF72mJL40P* z$*uV&6$lZso{30sa~U6*6FNd@4tx%mc$|s+{wP3Gwuz8%ZUKAtq2n6DgKdjy5;#bM z7($aw)LWSao(4j$_9=}G^h<3htVhlS{5ncTmIEnm-g^PmgVy3o2`O~12<1@qA*7@o zbN^K5oPsJ0xdC>c{5Plxl`#jN-1K$M1IfO(EkikSdP6N#$+!vuTl&k7c7!^$CRpd~KK_W~2fa0LCP0|`IX zpWOXPAM*3nZf{pJsg=+u7cVvGQ)Ql5H~l|2r$H=)U`y|=@885p9ccE46&Z+0=GtvV zJ{TRb2rfTZ#<_pDYiow%FF@EnHGhPP_G5`mgahTpqVV7fDVcoYdPgN?aVjh^iK^%h zJn?QlvW0tFzxd#Z6tq#gVCpZO(Dr42`3san=AgXoBmCkBLD%J-tW(tZ#KUQ>DRuK| zc~?Q~4&MW11j@mVjeTmn_y4xue|*57-@**sgB^crkBXefWbFG(rUJw2yz_n4%pYn)k8FJ24%p&U)%`{FXMLy)aUUK7hy{ z4+hTzO?pnURthtDwoLlwuwkA27>@7y%v>)dTlIntze|ySV|UmD1az-(NhGGXv_j>L z1nCnhXLE6s{d&&RE#MhgaIosIcJZp7~Momyz=Cy-n&9HPAu?jTC*f@A=aZ6V9OCHu{)@PFNyK;un-H zZI+G;=O#p&yYAD<_9gk46zRRk?$6F z5=W*&2R^IQ&7A3*3|dEIj@flHi-7t8oay;aSxVu938K$`BILjSFp&n@f}NSgr+dbe z*YAE7p_+nW4IPgQ;?Y1=q0qQPWmU|~V#_-bi-^Xd879BGRa)>o+Xnue(xog7@+Nxg zHmfb}Nwsa9MRKb;ZYx7&L0aB>Ql?Rf&8;3F7zYVAT0e6=nL5|oiuwsm01QZ$XHZi1 zFnI3`X|esETMy0V&dzA=ax%GkTq5PtHCtg1`hu!-y^c30Z&(*NMkj0EGfv|dK0$c7 z=KAPW(z+t%4Z+7&Znt$nmeH)yFEP%?8CQnbU`1 zxG@;eUHN|TUnN}j0ZR-D?N)jBnrD8L=Pa8MBQU=?QEq72MblXSi)cXH3gv#PFh^Oa zkY{#c7wF&7uyH2r*CpQSbQ4qT0F$QZ+n*Tg4uuy0xyhE(+{b7&{;QZKlF^EZ42Er| zQCipkW)$=fOzLl*_;Sd?m7Ocu_cN59$&cmK)GhAWH z{-vMzw@!-WqQph|$@V11fW_IwGK@i&NS$L>@{R@*A19XWFWXz`znYM7u*r7*Qa=zk zbMJ>yCQ^?aXUW*(0j!A%b8aAmK)wx$&;^aYL|pko_Fp9UccPn5KdEExFT>yCUc$jbfLTW$2j8xYMn`2^zTGied971i@i=Jq9 zUY#G3q~&Bzd}+xod-(W5D&OResz?8ckwrOttz)f$KqW`C&eqP>j-C^4?S7 zW=HRN)mr2s&I_lLqax)=-O_N7Ml0W298@?a%o?4Oc`IxyI8v{4|7+NA%IL)`$VZ}} zI4A1~V4il=$_z+2DrzwP1P^czhAJDJiHrKBZ3&f}VtB}0+H0A_7o2i!A zpRL}d@1DdaF7Cu*ymxbmw9m~X9{jhjL;g+R zrV66!pB57O38mEBo6CSifShO(RDjEPC-qdufiCjunv3caD#$klV4gWL?;>Q9X!ROG z?#0URPcNQ+-3Edxr?g$?#*)|R^j|S>d?wUYPtF;|Wk3}a7%DdJGQ}05?xt+EWu5#8 z^%JY({wVHI`Vr=!bCvG5dP3~vl_PmF4l>sIiz8ql!s(*MSg^+IdQxgk|0s=8u4?7^ zKew44dx-C;-+XCmx*kMF(W=!J7m`RL z;pe5=?$)OPSKIYgsfUJQMMj_8>f9avELV~J$BnNwy6waPsClw(tR619`<4hqOKjglf9ZU{vr3g2;t%?71uuH>B(P!MDf(=)ivQ5fd7^k^nV7Y ze6H_txfiXh8%)3c5cd@zR!tL)pMR42KS#yYkkr+-TEjZgFU&w=75a!hZAUJhA^ZD` z9m>07a9g+IPA`GRTuDRzkA7fSMm6rZ+@n&oz`(UDsDNge11C$8rve)0^aC~Z^pK$R zWs}}jLF>42(Iu-zS`p2%GYy>xQ>-E0Sv5I`J4YfSXl!_JquA4}|}bc6l2B?TsWXDe-%<5cfH~EIO`v-(@ZXU|#YSi_QZt%KhEj1+Fws zKrT02@%{bSxyI!%FP@_xqWl!n1$YBbK5p50Uuv07N#8#G5UE&PWlDM*04J#CPd!g; z9AdnPK`$K-`VjhA1jWQ~kY$~7jzc13j-%ag$IgA;@2#d_$e38YxNGVs<_l{wZ(Bn} zm3#5@!3%w#W;f+?b@OQA%4}4thC}70r+Vmbh#zF23lC~{X zL$w{Ij7S(C6G-1x&=Ej+Ole|nL1|6e_Il+0uO7(KP(>V}hzfBXivqfnLggffI!ny`&)m;ngf7V@^L9j3gR^~j|o0{(%r#%agb}n%1)>{X18eE_O(RInE_9izPilpFHW+E)k0Fm*~ZYp2!bPCm19d2QTZVNJoA(w5ObH1CXcN zUc74D&QtazJx#ZGj&K0iomr1>kUvHgIZljWw!-?pv5{Ma-Jj+~7-_rtwP=S^@ttIQ zI33${G%kr{GN{;l(G=xn?hR(IJhrEq|Y%21s!Lyb|_Xwa&&$r)zN)@ zSURC3d9UpDW9BbGvdh@{-Si(=z04-h$b>Uj()HfPeIpMw1ug_1f3_1b_>T)S3#P^5qLzXlh{7QfwU6bxVCR4%N7J zeyIORY?j{3(W_r;Xkm{22mok;?lucUeroQnc==}L*d6jgVV}2$igko$Rojnv?_2|m zQCO8hws&HmJVeg0x*%U5`iAO~nlZfun^}1O=x~YGb>GvBmI1EwH~%~5oX3{1EZO45 zrRyL0Q5Q%`L3)lG?5!clSM?XEs}$Mc)SPShFJaKzUY)ZJ$~6WO8D)>IOMLqX^iUg} zrjtV|nh|u70r(qF<(W8jp0RbeG2f-`A2$(TF8U3I~_t+ARXPyLLZzfHJ)qdl?~iroM?{pTm~D0=?Nt}>Sx{gFj2 zT~gN(NGtcexph=X1^wU{!&?94Q^#Az8uXRUAGzN3tAAG;Jy{?3d~u?w+<58sa8%+D z`Z|LI&y6PS@Vu+oYIAPAGwiLTqT`jS1|peg8F~kL;SE+PHbJ`9aRps3XXB6q{5K^rcY~3*E`dIk1?RW6%K@VLK9!Kx>NNhnEm{E{k zut#l`{S)ATBZB#)&K+a|210*hHVHQCNgDQwq*{BU66l8UuD1I+kZo%p`0vuJorf4@ zgV0)tN}yhQ}lf7}{Hv=!xS0|n6+5Yu*8{rxBw*DWc z&O98-#(U#4V=(p^`xawgvhVvY`z|4A>}1KFHAB`?M1|~9C>4=Z%8)GCB`IZ3Duqat z_?>yT@9(<&;eD@|@jTBw&pG$G@6R!{%EQZ%A?YE^GO4W9sdOTPCd?em&HuKGr1m{y z*E?`i$V!C>2RO+z(5Db| z!di6p+FZzj-jW>svE?W$zBq*z#r457$LKV6H}AyjJ65(f^Er7B&5s`ldFRa76t&gs*1vK-YuuT@y_8gg@a6#AtLHg-o+;uK50~ z6s$3!z*K2{j$L}s?)n)#I)>2&CcL@%H?3L81C|%1Tko{qSlxE0A&&r`{`!1y9R2Fn zkshkzyD{`l`ML!v!EYUJI%aJ_SL1yvBPgx%?y z%?d9uw|=?vnEX><6eY)K0?o8`0OBkILgjXv#8>mLrRdO_1Q^njcb|jSJm;fl$s)+= zc*l*)gu+jeJWPQLl{uLNwkmMco{cyzXsPlwLio}!Xn9w*PLH3dA$RwmrV6l~8x_#d z4g)ysG4xu{MS8uz;L-s`Xa}X(gj=6q-yJRsUcSYd(sj0l!HK?pW~CD1Hqj{XcO0VY zUOEJgkL{E$zYIV_LE+VQNX5RT0AMg!K<$k)k9I*_}-H@DBF&=~ZhHQbh85L0y_4G8CwTs7@ zN+d(0_;l~}!Rs716Xj5l^J0uP|M7R{k#z`h&~e}ebMQo6nQYFx!f&4DL{E`drOO8Y z+T=R>a$Gt9mYoq8Mk=}sMG`c~t^7^enlWpHsr1{f_BZw+Ja!}f84QU$-SjqNT6iIR z(g5#bx)xx8I(KX~<>&4|T z)|Qa>CnkpBJwQRkbBnC6@M}hAJHU11^D@sO1O~Oh13q7U4?Hxm>El36gIf&XC&2{( ztfZ6V;2r(={lFui%_1}8N}f0HUZ=4BR4bg>+ApzjRV_?#VpM87hgaQz9KhO36kR*s z9XwV|W9YRV3aS0A&#%tu(ch5B!#*TR2)_2#yJ4z|c;lzM}49X&Ky4%sXl&<)SBH zPNQydl}oUzP{D7Mk-*U-urmdM!Ly-u^$!WF<8>@?v?pw%-J;+4Flle*YHI?^_B5Sc zsBON4gJ?5ii}eQfknk^WRz=E)D_W|d_{rkndIQ)6mP8cNDA9O;Hkt-Ms}y=OiniA8X$gEX}zxsNi$>(XN4 zX%mqzaOsN5)~D_P0%LGn6M_o7QB+qbw{F5b6`VQ$ok_vjC=^sSyr!cGs$)S;Tmj1ak-tU9uG4Alm(P>`b^#L5s#^q7%P9EwhU-F zmDh%BDsC?pm)0_X=eQ z`DZ7YDA?C*3qc=GrXl;>4EUHiM&4vkQg7u3C$+586qwegK$+>`Nw5)dk&{Fut(V+H z?ptTRHodOcc)rnd?PMMI|B8<_USz!*4eESa-ACb5XGq(mytgEVJk#_-Fl)j4Q}*Ac zX)v@lcny=WsXJje?rm+ghp6PEXoyNg6mEpz~pjoFo@wB)np{D7fWL{NcfU3g>dycsZxK2W)rI6ymK3cuSU*wR@)d~SAk=1Kf1!2@|hxlKmAcz=Gry_H?f=bzs+ ziP6Q#csG$(9{CNI3z&*82Q4^=N}=Z5m-%%m^MtVDxj%HLu{zpQw@fc)UN{p;-({|6 zMfl7i<7WDy+AKdvaUb=Kt=CY7VgPGflz*=A@GU2COu_(rf=rj{+|%FRyuR(GT&<_J zev)P5b>JT}C3*Vw0Mk2)m;QOq#Iwz_PonIKb>!_+sSuJ9KC}_YW=&U^A!91JY#=k-BgG9SHyHdcKP#+{OI1TEVml zHqd==G(~YtU#iX8oI&{7@z_wrW+Kj6vdj#XDr8_!LRk^BAolRxbuL$X6s>4diZi+* zE;OAPlPIm?YD8`1=kbmKLM%oJM#ZrmAK!utx-yI02!*j`#J*D;VK? zMWI2-y`Yh7S}V8q#*mOABpmyYySh{_lPUDA<`E_VOlAVuYcA75iGLk+RR))A%0fb2 zl6vhNb||hw?Hj-D1mvJ!3@KeW8J1c1Fb>CQ5K?*;g)Gqh+3MED`X2AJQb^A#VUONV zgW9KN`R%x}Zxn5!LF4CS3&+mn_nh|8 zA6aMH<8N}smLiV2RoCs2!Tj}`zuY{V}y)uDSvg@z>x5ZWKk5r4|&PrcTncO2U%|LVcNMLiK?RD#gI)+ly4 z1%JMxQDv||b_P5xW0N^DKLsH@8?%1L*vz+J_55Fdk}!%FtG5ob6wFe3Zq;<&4CinP zSU^&QF=!Wj8|v(t#pAcG1PEho_-B=>GQWQi4aEqDozD>(tS_Opzjou?QGZdh_~DrNBK%RdDv`JW8&s8M`F;A=Rsyv{Fsf(A*O1!Wyt#$%aBsq`6ihkftsWW#HRfsbi#~>FA@?Z@i5-KhEGrb z@9w@Pjto~{2%NQ>doQX|d9RnX?DM6gA;lLK&2-=)ls9ZHZ46n1x5+R8DwGbHi8$TU z+8sREFxNrNA6XUv-;g83Im`8KPZQL&K{V)fDH`|=9dKg0=~7yVKf7g2xD4?QjQ2vY z_6q+UfYy9O!0lK9p0!!?`9_svA+yxcmVtA57E8_K8@TjlC9JuoW={e#;NbJ_W0W;48ym3D<96ovv6ljG(xP`x} za}bN~1B3G?Z?K#FlX!m85a78K$73m;1e8*&#tpiUkYyA{Mno-aGfHi#)96S=mn<*a2V87`RI# zVjIHFtka9PwG010DFZ+dpy>1%+L0V?S{@+J@OuWo95MTQMoNJtQ1*-Nf~^*c;huSR z)!z^(=MYQEqr2BEnnHZyJKvQ>o)Y|Ni~8a;@ydvOF%QnB^sz_HFiG3g>j#Hql}=FC zZ_l3__X$$O)Mdf4Q?}CcK{S z2wH#AY#6og|38%nLpG*E2lta{Ls%_9fHcJQ*vTl{+(PWpL_B)ZaohrpVxE1;+b}J1 zd2haPYPgec-TBDz&<;v?o!hu5C^eN1(UAl!=E37z95>PIPn{Y3Odv3P>odfXe5@PI zYcj}<>od4ddgc*6$JQK;JKE&S;FEt7k(Q!P-hF5fC*5pnk{wdaS?Qg88g!y1YY8L zHgw^$0e{tj+sGd)=e<O89*s7-bO`z4zw&aIxy$;6urVZcHW07Y{Ck=y{{rr@=3P@5l~Q(vu*i>mE${ zZ4gNG(Cub4E!I30QlnMi9v9G^%~h&=sWE+4o172i3v2{sWoXEe7|)k(d7Q*LChgCmqZlwS6H;r3y-WoWGQ1ZWI4 zvih4qlKl{DwP{v~Ct2`6MugLP8sw-y5$TlCV0Q>g+y}L7x1>;6`UGm?N$ZCxUskoY z&o^(CDBrLYM(w{)n#e`s9w(v>R<#gpwKhKGmNbipWQEKo%&8{d2b#ww_mKX2wcIF6 z$z^{-waxN^x{s_#OzGD%mq%R&;HClawt>qpUHsmMP`2&w zBIEq0(QTO3lUrNsQg}z?4Qv;Ooa{WM=kT!0Tg^Xsu*WbQqSpD2RVJ7JL$RaGA5w}6 zY!(O^UH-z=f{lE()SPrEF`JQ#p~zCYluC2C7W&Gb^Qd7$gxz08U-FYrZx=^UrL?hM z(B@8Rig=doZ9bd+7;D7^o>|f0^K2rhpq!?=PXQ+}@oe&Aer`^(6+Mc!zIzk{(-VvK z`^{oE@MkGp$Fk*;C%Fl)_QxG;MxUBq6r9{=87ucs09(u{SCPVP%AiV?AH*x!#=~UG zW62oDH>&z(n!Z3@hE}+PPt=oec@K9%8OgRFP$r0|rIiG&w1bWZ4p-5e1~3rJa?^{W zw8k0b)*=geWsv)^myImzDo30MWvFv%;)sEn~FStryK;hrxfP*Pb{N-zjF! z`~Ww5ZAIl`$@lGZejD5brsEZQ$c?+E$C}XSDnBLL?(*hYjl3aS)bBoA;*MwQ z4~n?*yuqa?g0?(C<&{KX8n#7YKwL0NWRq`c!4Vd zRi*m^ffz~Qu=IXWS?++w%~!Q2d6#w}NS6}ejNYOLowtX5*` zsc2I!A2cm%x%nzz=gY^P@vh+)e{Oz_s9gJH^lGf$Y8$ zk=7}6o$y2C83ix^~U+EZ{ z4};(k2@}Oa2dWK5hCx$gSmSErFKa_<5CB zQ`5vMt;i3eofn3fDf(T-?hokqauQ|BF)H0t<2n|P^(FhT>L@P3*Z1StlR24Z zPo?apIU(kE*XMIr)tqXeeo{J(Zb_DI$UoUXgeu;Oo|$&xPg7vrZ9_2{-F!~`p>w&A zZF^CFEg*MM;AS!m0p9p_nN$Qhn#O1op3i2-5sK3|RnxU@$NaC~jwwZq($buV_=VA` zkWD0dtlRlagA49}Y3busaL~3c9x}}q$JHKttTOefn%*CD;!Ekf@tmFv6)92airc@l zkL)v7BwyDth2+bpOkW;B3d-f_#Gl9R=HHDI(nd@%N3jNZRH??M_aB>N?s$h@cXf30C17e zIjf7N|NYT_e*>JWjco|krb}-ioAF?bgvy~4${Ga8Rz`+{;Uz68CnAz zAwQ^&%3Ta<2l6zAUjhZNscAqcRL3!$YOlM(zR?B=l{EH@`+p$gO;d+`%+p3upxf;5!0D*DdMK53-pM1|tt) z`9=05#zrs0I*}?qG9aaQq~D3OEATT z0?`_9X*As6F25cQqYuoXIcN;KzYI4sP3!~L`+2C&GX$ z1!iG!cJ@%<{|4llO31PavT~B|UQKPkuY-P3AbZy`U&nI(yRAk_Yujxa4GE_oQZ3;4 z*CCJ)6K7K&`E>OCDrtC7|EQEM%JtjS113u#e2{S;r@upX&7*^yxoisZ^>Ke39UO_Q zuaamO^OYE4--H#G$=z{4n37U%3@%pmyMTNW@?K+N7_M88=yX3*INc;0xfRNlwc6e5 z8$kf&nZJVP`x1R+bNiXsHR}^x^uJxsH#q?u<&w)eJ_xGt0d_EcL2)1A{G(I>T>Z!2 z!ZpAZkaIwD0=59AcTe%ZOjJx%18B`f0r$iohbJa$6-xpGJXr|SGhkcuuQ>4u1kErc z@CH(?lSf_Diw!dOTm}@;e=-VBHaO@JY)Rj5tvX`SCCa6&(ypwBQ;uD4iu*|Q9>}>R zotF?IjOA!W_BiA>#tT#e!l%riGel7x*L&%`Y1BN&q72-sar8U4N5*dfN1}H3@P7s& zh)(K2{{bQytBjl47)4zYKZ@3DmO?m$$z^w01cjtot7LLs%`;D0cSd{Shy|A}VmHFy zf-QYa>WoXjOP`Bb^p}bAh7-@gv()%w64wx}lCgNI15LUFJx$^x*VfQVb))z}^Wr}z zXvZrQ2v#B1jat5=-&XQ{vpt8X5>)4(>?{>8)!S)TJe13*RpjL6NMwW#`0THV=uXee z_ctLbXL{bj^|SoBZ5TLNSp@MeBcGue8}-G3&ZdG}Vzr1{*(svSsBgT;tDK?!JRbB_ zP>2kXJk)W{k0R^T*{YJROel$tRDA7Xm&?ShNcS95_^Jw=~d10S)RK;G$Dc&@3teW3o&!NV++C_#S8UK-_89Xpz+$I31R z*x5HE9Ym{9H5zw@p%Aj;oM+yki%xc0J4B;*TfeZ0q2+ixWgS|Z`)B=0>%@+*eCx%Zr^+e<3qtcdC!<%CkH`Q|&R z_Q5yU)fQ?kR{v<=yA^i3S$Av_bB$K~3Z~lRV-cHsWy<=?G=~;YOo@Z9-Q~+UX zYm`IIWnye856TXYZNdww?49Wz7oy?BPFJ9J$M8v_`j-TzQC)XF^@ycNSEQOOyd=)A-%=3^_7$NuYWedn`%Qg0Rmp~%3f`(@e4G07w`Etzhf>po-suo? zt!Fl28!*osKp#<*EF%>BZwQKSLx-MUoTF|ZU5maooo7h=BOIM5$8BZpb$?w39z$*M ztHL|Fm=cYsvx&#H;!E$xyoUkAs0$1M1o)F_oWIZQ4m&B9QP%G!z71`K=yY4`hQ^mc zfgd- zbN&FKO;WADTT>RrpmtOv>$dw7?!%P*LA!LlVrXqY3qN(j{t%Ufv{$BnobfjUun2~= z>PdLTVYCC?)G;1JSwtDD5jOOOO4oaY2~8v~JgLbNdGMh#&88RZiJ?!E7WRjaH&Y>Z zY9fD~_)pl&wL}Fmfg_+OILo``NYff~?qmI!*hfvOpCC$mT=?)=8f*zu6ym1O$IG0P z;6baG_wm~`V81@|=LdO)YJ2WpZS)}-xv%mJp&GRg0gUp>KUcL3l%=*~$%`c?(L=8R zcQmIc*iYz`$GRgGsik&;2?i?D^7lyGGjFE9cSIwKQT0DTo1$}CLRZhdetkl`qyYV8 z=8foeA3+?Umvs0YWjzmgpnMkTwFCKZnQw{V$*nnV7GlwE3iv9lfU}T{*t@4tv4lxR z<#`c2A3Xs;!}^cV&YB=M^_ABD6|Y8@5L28w`sc{g&YuDh*y}@-m_B%v%cmQC*}{V5?n9lc7*O{biM4;=NyjZb3T`f-OaOJ}t3o=Q4DB zDkjoVTx1+@xrC~v%Jjr2xpA&JAufG=>4KSPZ?d}CFD}vRn;Jj@wc(IhrQe_nrlFj8 z4GHXnVw|u3I1Mi{=GEQfGp)M94&kgUTQ5`S(GE&4JDP~5)Fpp@GkV8^0?VoKD*sj5 zfAyt=Ad%)HJQ0JduCpGy)$Jibyg&fkbk`!S#mUcIrQjnB{g`;su}(lQcOP6E%wPG7 z6HwFl_Ilz={C8t&7S37MCj(2=(}ifPL1(9kI5z+YWX5x0O~URC1mq|BY-Jq_efCt) z%RP9Cnkk)Uo0QLYBXg+id9Vij4`Q2slybW`L@_7|GV*_;3OdfxpIeBT98)yWh%oPG zEfj69a6T-j+%Q6_PB!x0dO}W?b8{}xB<$3?AdigF^PkR^3T-9y4R#5bAq+G=vc~@q zZA;4Bqq(2SE%@po6IU8XpTx-P+@r}H83Mk8E};8Z)LiRGirYE(82Opv-nOgB8Ejs$ z#Xvk@<0h|>%1{UCnf2z+dVYoqaF9B^kYm<3IVG8@kq_p#b8Sxl#j%omakJD{YdAjM z(Oc}cA4bv;j4e#1__8R<4*Qf~O(*@&UyXrn%b8+tMAI2|pBrB5Q)$g|?Q!A#?y|^3 z<1$oK&a(1Vo|kk8x>;ae$ue-GH}h;;61*8Wh1H^iF61&Vv5Lg4Dxh%_Na6c z*;3@-+PR0bPFHCYulVxeR*6Cf-Et}+wfiaYcGQ)}Qn#(#DI$@)_o!}&Tyl^7z{bli zUc7=>Hpwj1)7`{YB}K11M8BjGy1k(t=}r+YufY>>yp>~3tPu61!#7w)U{U}lQd&@~ zP;>w*3TEsr>$Kc=MXwPBEP}G((wP|DzKpXl-L@eXNVzbZw`~(~Cq|RpPF#A8$FmnW zhWi~8%<9Tyf}z`UGUm3;fOy*Hp|0c4@J&%hOe+i% zsS~udDgT~@l^@NSYBxK$>|m2=RJFO|*!{udZ4_F^*}2={Od2IEN`VpEHd3Qc&jdqk zfnDK@t%)Nhc`xnG6tawnNiIFmLkTk`gY|1oE!oR!-ttYUqR%aI+0g3!$U)2Oax@su zs!KiIZOwlF$X9TV;{pGa#0T2_C};lJ ziR8-U6-+0SDzEnETSC07uX^?|%Pc>1*(I~|glRz4ql&jN5dF}5Q7CVrcGVj<%awfq zXJ90D>ppkvc~a8*h9{yei@;l4<3!Hu&wooE>3*gAb&Q1#d`&m9_Cp}g=mh>@R z<)0<9(axjmv|bsXB|B)A7cR~6Bdl~!pyfiug7aC?b`*SZ_B6E$EufbiAmnR^zC)iiKQ3Y)i2;LK1m$DhG4)O4Eb#{nkvCh|hY zW6t&nd>wU>XXYCBTOZNWNPl0rm3|Y{qxECvyBMQpMDjgl#=AO}YfwCX3_cK9KeI%a zclYvpt^NVAll>D7Xj`nU%4&=KU)kzRwqzO>(cE+w+7##YpZo>QFSm2QO+T61zjR`r z2xGLcP!{FC{vrG%h9?>3U@u?8En?`gM8@sFC}FZgu1-)6JXMgZzwz1r7^?(IR9L?{ z?nt}52mmt*=TH5dm3fdB{V?X+(+7DnTjamvk9ICvAG0>MNWbnMBw%WIm>gBoQ^J6~ ztOfi@P$Oh^6S#%mpwpT7pN1`TgV$5>&5_zlKLEdy6d#zU{69|u{9qL>2253%PKEwW z(fbKLCbN6-mrl@J7czzH#9sgbc#rFc@PKc%3;<8uKr4l)Z%Rte3_xX~|_=lDe84lSGZBfRap5Wfhi6K4#!hVyd$SSnDjh=%g>KnHo zOuIG^Iv}n^wq^kuI6iXG1=#OsGWxxli98^NITo2o9k6Pyoe3*Fpe6hQK%~xJ4i>f? z(|=%QjjX%Cue%5B;bU8^tV`_VPh?uzo%15dyzt8Q@q^6hrQ!17+oh@}-Gw!dl&2H# zy?Flpi=h{_80p57VY7Yiaw-qquAWNxP{Y=$sSN(pG0;syYWM-LFMz3?qxIc%vgd4M zhCem8lKWonw{)a3IsEweX0%vR1WCLD2r~S?Ij%=`+XhN~{+!r(0YAVSz|}C zSk42b0|p>+b(|#l@*d1u$ubEzWnpq`s0$HfBL*CH97ecDU?dEr-4zIGcamN-&vinN z4F&iBuXT|^nu_U%6|&QlyiQ#iya%DTUGIZIEzPl8r%P*cCGNLJ7*!MGapI|i^}NZwg5`b)Oo?jV|Gd*ceMfjuYx9LAb|Y|?7wbGDUcd-BTQ zD4gIXUzlKLr3>@;ZqsPlDX+Qn^_4x^-U{!eiv-}6q~Ss`uR-{pq3pxOJeUuAx{&q` z!<3P1M~2lg;bx!@yKHRCXu~F?M+T!L!4hsaw%QJ1>E|3z<`r*b!){%Zw3f_9DUdb}*X*=w4iE;O}gYTOhX{9BmOQm;Vsmvld_(rx4 zFz{({wo^Qc1J+7Vv8dHm&RA#4V_*!(d<}3Q;JseEWmT^82j-Jj{RD-PJYS;|dJjqQ zkj?@?ZLi272hv{I0aHrc8a5f++8e9HVcZGjPHaW=$EL3WJ1i!StoK*$SHv4jfIMOL zG+;-@yyR2QPs{t5?A71JH{Lw|jFh)B_%h#&BmC)0JoR+m*ZAxOM1+r68DI*$Y)2kN z1^W3ozt92l*k*vn#S8dsxmSn%Q3*_`A=t_$ShC<($NOjlButMM!`g_K+50{!V=lrG zy$XC!5a*&JUMxBuZ}L5_-qnS8MFf`VLsGuD@zQVYVkB1+006pvI(}dWUL?WHPhUxB*fCh@#!);Jp z9GFf56i3!=LuBR-Xr@3)<*H$~Ro#`BTh~$mosG!aj9JsWoI_u)HM(v$+iqxA>lx#a z!Z7%gZ;#U7t8uV;ki6UO)!`#=?#>X7)J?OX0h;5+P$z}TvPNh|^WB%Gd=VD(#(LsZ zLfOx+Q#}b}$|r2C2%2oD1i96jG6{TD6T)gcH~NGAG~WK`bBwa`S2cTo@ zTW0K%HoS(HB;yNje1s^0yx@&-a)JvVj*f|$7vg@DG%=Rov{P}_cURVH%;vf}5O5tq zd$BC4c5|k7_T&G2UiPwjJlu)A=m?r~$75hbzBJMXR?C9j-7 z0*o!Y4qm0HtIZqzFW+P}fgQzy6$?Ud1@l1)&%{b)IR@vi8|8s7R!YaznP#?aQ*)6E z?s>8mEqXyxvbLcFtVf*QpSimm$~BuZs5?NLv%7!d)!N&Ia@;;@@6&*d@UO?h?d)!v zbunz&tDhk#y(ZQ1Kz>Gd(roTzT3*3NTI=Uu>RJ5UA%gH1UlZVwQk1bTltM^te>p;g z6N^-ju1%#MJaCJyOKg;z7R5KQbJ0t^V~>pM8Dybhf3BvuQP?S@xw_5N)NX=Y^ahBcdrAOPcxmr4&ab-hp`Nc7aVv=M@~nj@rV* zzkr$YOrt9@aE6QH6b;{Ha!4e3+WE>Po6gw1TLE#vvC;mO0#Mhv@xgNs8`!)K)VNs9 z-9CSn_}ii%QvUhp%@1|(Ue#Ye=j)Y!o%O=I?g*aSXzd2E01-QVAnoHJ+Q0I3s+A%d z&!_Q5%ujy6MJHdWsYX21wC%9Ng4?a8VVm$MkW;w$Pk$8nM&&y-bv^$ecZlFtIFEG|6N+Uii7pSz3HYr+2K1j56o$kJp9i@8DhBLK?S5oxd}V8 zIY}zzWV-axmQ>DTHAbWlXgT)T2a~n!qpY)QncR#;*M&mo&QAeG%hyTkAoDHC{8}&o z2?}qyb-8!E9!o=6RI}o2-OsSj*?ZwKlTf1O*uHM(`S062HY~o!JO%_C7DU9a##*Wl z3tC2iIOzO+!OpU%Fl{KSN$XG+AzIYMu_?7XvWSu**tu#W`&&@97vJLxwM>(8!fS=l zEeY*jfO~HK8hX6xXKw^k`;Gco=Uo8vAMBhg+Hi zvPtepW0pCw52)kAZh7)vhuUCK?U}?O6eVpor+kW?t$k>Oi)mpto0OfJa==(Mnc()- zYK5B*5YchaUZ+tG7Sho%v0IyKe`w>7G`T>eVeM?s`ZJ^CRp|y>()caHbYA%*Z9l5} z%wj)WZ4y=YG2m1y4qM((ZwL7%fVi&6)G>5Ls#hJ*&HFx2X}9kROQr1CM5=7aE;tQ# zo`#vnGW(97k%j-%8l_J^YjzMlt1H!0_Lz0bv^l0g$<4go2h7!^H)e7Whbh5iD$VjZ+KZ8rgrPt7H+&|-bMQu>#qOE6j8r0zdl{JCJcpaLCU^0hs zgU|1!*Yp1H*Oq6?<)tt&dOgms$16~Tz5}5z`c#iG#yq_3?eXmDyU&>3KSTB8<8|L-o5Ro>>Dm zuUuh5utAifwlIYPg;b`Q@9#zU!{VE8v|NzBmEMZ>p36wxo8+~-bJzTtFXdJv5|4v_ zV9+Jof8?TvzZZ{Klb+bFLt(f0(sJr(>FcVz*$$2a$$vhR#4>N$vX0T0Qr0vu8`y1w zk)d%e%X%<2mZb33&`AgVtIF?-o1R7ZprSm=LyJec?0< zCG~{fGA8sWMy**=TOVP=QgmWUGmWxb;_<|(^d1@(k?|KL-6;=r5SBATszWMQR@w|W zI`@zmZMX5E8|U7da+^XZ_8@w)rL*U9=8QW|ttkB7xL?y~+gj-}1+I0c^MZTbtdB^Qnb6keIg1T8p0 za4%5h&ujkd2j`!mxLFoGM7mMgkB4)#HMAjU#7gP5vpwIhUsla#fWqUS{6EJ&xq>zX8}* zrH-CokLFSDT3a0R6jG}RwuO2X3nS80lz=>a_BB(56P=cels6siY?s@=nQ$vtoKYFM|GQp}D`l0M-y?FF2K7$o zNt-dgt^DB0!DC|v_%?n-!KSF@U!fcz83VgkD|kVZBZ!lqXMr#{=A47nWgo5o_~z%r zPT8zwB5Ky z#px!?eymrAx?jJCG0H?(|BiXukktF_$(xF2i+n;HEW73U>mvdzP!p8IsdXJs-%hIC zv6p1hmdm9v?`&sY`UPaW^-rn$n7q^Z4~c|~S_=XM%_i)>B0Fi57s+;(oqFawK5dfS%-;^3=nN4gohZpgPHr{b>~KKv zTU65p5Azx%G}B8thN_Zgi-JPM^@A1&hsI;rL(e`xb4+)3AR8g&Wy)>BYRHW}o5G=a ztGcG;z)i3g3lkqRtnW7d>@ddM=kAZu`D%@9h@E@Bq<&-Dnv(ysv|rQHq$fcPu2i@D zE0~`kG*B9IKZnq~^3prIQ^)o897alUhglC&)VyR^HxGWmYCHMHqU6V=Fq^eHfkMn4 zYtRqPFD_FHXNW4v2yr#KQ2$}7d?brlN>)Xgo=Qnh{W?c|3XRuq$A1yrDY)$!S=$Eo ze+*`(IAqgqv@3_EY!9hW+eT&IWv=iSwxw%`PJ%^>7_;il9~xqMuUA06a=6S}PUtfL zLz+(OQH|3|ou5okgq50;EzdJ&!7cl|o!A>%$b@x?gI1iZW7|p&yvoa%Vs!idDX0gAn9T~B5=f`HmKN+lzH_XgOaadI4BDfnOE5` zl2pwUg(CSHWkHNV236Iw%vF5zVOBXAuJdVx0U>!o^a)3VQnOn>_sYX9Y4 zi@`um%Q*RXU9}8SJaw-z^4JC|%!U^|ZOqQ`v+ZVQRtuVYd^DtG3`d+?MBFv23VN8$&LgfDNbP=_y_EJ(pdes1aNp;KKRm{Nn zx&e>0i1O7Y-Mgzl%Oc8#H(dUHJK1`GP8q$(m<0WGwm9G7!b*2J+qCNK9Voau z1Lw^O>^V+u4=4^d>#S#ekz9C~KLCCC!;(psn6H&zLAT+KXL&WCkPJeJ;nZ)z*@M&P zGI?0MFl=th5NBqkzO}WG+C#B#Ug(DF6X|cVf7Z&J%ct+Q{x_{TlLJZ5LZ+5wKKYo* z9Mx4M?(o-9!r#X$5ZxLl2+8$6pUbJVDL z?J7FrdwK-j-3s3S1-`J!-gx4cTAbfB2Py?-LGBPJ1Y~pr*BQ2bl%Lt5ak^x6{yrbx+GL2h0#diqHBl@+hKffh0Hn5X#ky3KfxHl5kQ41WKn(QB0$En~d#_M+R<~ohK9Eb^bQEJE zLlXJf|B-8|$&aZNf07%#64b1Li8 z!d0O%J-v~K5q?`$mq4J2f?o~+vv9Ucn6n6OWNl1n=p^JTzTrEs`k%h|E;C1ai?f^3 zEqo%=kk*|{058yTtPb-ooKtK8!l{x`3{_doJk9Id%pcpN^IAs99 zGQPNF0q>ZQDsbo;*qYcu;-mEPTUZDx_{gj$F%Go2ZC?~?*4Nl#jm4bowx|q zhYa2;iz@?fUWaApNNeb>wdH##<{U%xY=c%N#Q!HEdS#kc<~;2$GN%D9qO9B3`i0kl zJ@$PlePhf*RwN84%E~9F-+fS2`UFa$fDb;IG3Bt(VCw|CP#UPRe?YOh1^A8~P&DaR z<0vi2`+!gG{HA-mxeF!@<`ThTL+|bEIrv3wu_aEv(&$4fH3fmi{S$cA>9~>?{}>)Ei0Ix{-E2g~zPafY9~R1;3(wq=bvgPpTMyMBfTb2sOo1gC7xGp{O>*p09yP*xyk`e5|k~9z>j|eu4dHrGVh=wNG z7AaSr7C8oa`%QB-I;DM-!238Ge#5f{GKppKxB8mAugD58L~$KK<>9*6PAWuSh@Jq9 zK#n{61&>1h7LP?R?+tmQYXcI2=C^e>laq5wu9{+0XL-ke_BW#L_CrZM9BH?p)#v3` zVBUlgHel&-WeBkoWlv+ie$bGY)w^%&2Ps4ICBd9^lx8}_Ry{wPirn)vluBLD_9XuK!mpxNXf}1&T(`>1LV216I2XJBWeWg3{yip9nIPz7!!FKB%mV!(bmLvK!xKMPA z>n2H!+kBAO zyK_Is!EoHL6?exV-SD@T7gx*Sv~67OaOZZT)mWJoFEMp^Fh{9;yK!Ifs3GGnP3{?0 zQnm)Hpk)hxKzR)A7x_4YHwOv}@WH z^{yq4@o4}WGv*~&4Dai<(7^stJ19eOwBIidnv?Cen%`O}ERfX?Gigd5aXj*46|fMc z7LnJ{+3swG#9=_nFE0K!=>8~=sLU`@2Ii&SyG*^SvTUi0>A0Hv{=DMjp0wJIW@?8Q z-~9@;==@zJ@jORwb~s`7z^{7ki(WgHhQ?tc6_0s^rWeuA`*!#K{5GWDdv>I2SlyF9 zym~;g)55E-Um|j4`>00#U`ErrEuz0M#xb1cUIi75&5rofNz-f z5YmngMnCxJc+l>9QM(p6WZ}VEnXhsW7!5oafK1x2OE<-*(V6EiR1fxA z>!I!`W3aw2BGd9Xk(6q;OrB7$-dmZrV}H@P5~h?H&Y@$0Rjnu{&Lbv#Mt|v3(=gSN zM{dFNtF7RR_E`46eEB~Y+g}|Ix$nu1LD~0MX>E9O;UW;0h{@^4B(zHz*NbD@m-RP` zJ(y|*5@BM@yy=TEXS-*HKX$nNUY7CT%ff%o2FYg8S?8s#og2-vOi$*K3QzQ9;a8~k zB-53*-@aKn+DtBGQ}eeb(5{yNR1E?vl} z&LeYwOfa%h>gXbJ_L#gWt>9YfKjp4)9yA~b4RtiIM_|Ph;exf(88IA^uNWf1a(pbX6_aQEw%f2;?Nz0KyvNz{xc<#d|IEcN zq+|;Dt-2F^hscJq-k&tV%esxZVlUe7ndcY5YxX6aQc@*%7p5{Om$n4HYVSJ?8^{-7?TUU4`G4vvm*v=dHtu8E;iT;-pot3 z6ll8_ZiGL@4BK&d_E`(3Ccl#-C_tZBFMh0-jDm$yato*wTAN~(^R!^(y#C!n_Q@qp zmw;CiM&yl5uiYZ7Xsw(b22sf{A%^Yp#TB1i{BMcYnd&K|IJO1CB4{|5i-=?hnUYie zWk59{?|M-9KepZjn(Fxf1HSjVxYxXPWL#TDWzV?wUL{Eg$taak$h;S2Z`m?R8i=Tr zbweSWhRV2BC`C&|Jnvime&_dpp648=Q|DxTzn}4bzuvF01ry)Zft7vi_F#5oub};F z@mhMp<76RZtQxX4&x_cy5zM6{VfykL7<#Q&?my|QvICvLM1#*EC>A%qw1B)GdY;we zG1;F$sCwe<`Zi&S0&ym$_tBOTU0C0#))t7#K^L`xPYyk+;1+!Ux7pH1YMMQI*vDIo z0teK6=RzJY>)soIrlJ{izLLe~*}b;Gil+(Y>A|1gKCN2?XYsG#FW?UmQYLqLM1uh+ zqK1kK7Y`}EgbBZp6*b2O*zUq@U|tIXX{xixiDJKYtc4Gp2Xl&{@Tckd#UT++ziKjsEx>d>g#s=XV*+`29NfOvhX-Gsg&yGg zr?zEC34cOk0wdy){ODIXrYKAvRiDdu0AA5Do&P}!XA^)qqhLM6?^51zp@Uf!?g2$` zL8|*4$GrO$v@ch0Qm-;j8nzB_q0U&a2mL!p!`JbDWGZ)(zw)&2`Ng_~euJ*2mRPHM zfJd?kyj{@}Np9f)_|J~OJzVto>be8vJG30`x#qNMIXRdW19fdQ?xWu|Tjj@_KY3J9 zd-}kX!n>Uw^GalSP15>mUv(6iRfC4v$JwxHCA+-8^FVvjherR|5HXwMj?E!;_shS* zILhJe=JsIJc9zJuWBi}olGY92rq;NGnKLbWbn5nBkVxH}Sb`t33tZm-6Px*E3#G@H z>q^MvuzVfKnxbZZre4pxe$(xHT0|_hk!aAY$5rKYJ$t4BZ|qGhbZwh3P7sgF@(3iv z54?ui@MQQmONBRY>prc#d$_M;(n;jPW4ImP^oTyt_gjTA;k})?ttpMU;)VUq9ADoY zE>)hKAw0#k*0z zWt8id$w)MNNf;0nAOq6CSIK|I;3LxcI%Jh(u02|j${SiOixYed?1Woypm(Pxi)#rQ z?9rP}+PpN@9s2EJ+Ip>FczL*nt*rtFz&{F80{xcM6C^VUaXT5hZlS>TrC%_)o>o+k zB*N`wf(}qeDm?^=H&ZMy(o5eYG01y`0+4ZXZ&&&JCP&EcAFshICr?~Q4Wm#F8GSUj29Mv&Nztv|~qk<`O$r#JH5 z6PQHB#JwN$8KE;X$-wWimQssF?GmW%5=|*&)f3bL)rP71aZBk1te--%CNVB5>Y~@I z*3+5mx_4m!j&kf_QF11`~2kL ze)Z)wfAc4=zW{auTCLD=|E~OG@;;HoD|Y(Zm4mswJ-Ix z<%gVOXP>AP(vX22>1vy7reZioYGMewv9R|F4yot5$_E*{8cvf(VEIN+@(w5c++R{% zY8}BPE=hu;`>m$@-cP3=!dioCr{fjED@zRlI2wqRF+R6al>@Co4Gj zowgC)8-J<}9~6K)b#f`Ed!}h5cR2-pc!7;`C<<|%S~RvY%)WCv)gA zX;m?+{Sc2u3h%d(l?=Ku80~+7l%>ZvbM{raU~S9!AGVfVgk-t$BXLi!gmQ%RO1^dP z9E+PgH9X=?u!;4G^xz-hU^%}Qrx z+E_xUge6pR;nmT`nmNq^No0ihWI{a2V8nv1v9~ejdwxQc?Xq=tl#X=&MN~L;=?^UK zV?-Q$rF6{JOg4@f5BZ<^x0J;_3q7-#w(?n<^@S)2{Jma$3Hvkvzx+iDHX_enXnCAZ zT%Q;uxtTsp!oyZWe}fC|0di;t&0J&D5w{9W}e#{Czj&cJfZ7>PkLMxs;+I z@iAxv&;(ISGEocs1iVT4cQu`!*_buxDfQVZx#o6PSWBWQt~TentE^X=3I%_i1Vq>0 z@P=lLZpu6^nsh)elH2*xcazwuyS_Ct&H8gI-QKlUot2GDe@CX;prG0L1?w#cPLbjRvRurpcJN^6m zaTbS>bGB8BI)auCS6%+BUlsJ6uH>1s72NEiBy!0HS@Zi*6Opd3`Ii%C%m~5iM$L7^ znO*r2*FJsOYC8Ekm_kPOJsEpF%^BmYVcZ|vq;DyZL5ezd<(VEnXOqiZr^YHIPK{^R zepdqEf;QZ)--6)&YkDr$c%}O+u667i@!$xsvrMWwnN~DKOa#R|(36wd1G1#m%`0yN zGwWM@nKQq&dfWfKqyGJgBO#1`)?qMJC&3j_?ULuY1R^nF7DVM>KDiI6i>@*)F|TaCF;DA+kEA3cR@HSNv;e!Yh3yxiSPk8q~rTxT!tFJ zhJS>?wIk?b_u}PUB>e(!vc8;qYlgA`l55&l6`(p)vy z8RzHLyHP!xs}-|-Is1+eaAzRmWK$b2Q=8<>s?3RHD7*$GR5~^~)jk26sv6AhJ0?f& zqe@t*CWhU#ky1xzw>G|EJvV-~?fj-vn%grITrBLKGVKm6sbP}2qnF5@o{Tp#bkB$aqi$ z6)o^Z>yeL|0T3O~_*i9-a@T((hxi>cvi9nf+y8#*Ei0;X9Y4?1h1bpo;%f8RhsLxx z%7}(#*3^6uA@zVxxmr-f!krbATL<14+T*tlRk}Y-ted9hlmMu|{5JCYurah1{XUh& zM&q3@s-^PB?4g>5*gXif+7Gl=5H$@8e^U)4LhU|oH$xTC0+nuf5VCQPwu3}z!f!8x zQO8pqYRu_wAFOx|_BTO5n7d||w)epA^ubQ=P4JHLi?woxSfH9Szr?Rze9wzV)Wg+6 z?Zi5X$}Yg;l49{&acFG;gEh43h=GzC_?*JDsaA8n_xN^n9aaOy!boE0Q}H#aLT4+} zu?qUO0OKdNH1BxC53YV_w@gn8s*h~`TnT0tl?cC@9Zr#7HNBDBZLynecWU zdN{D0`VN}G80>CO1#69^>+saG2;sO%)6luvQ`dE5!=W#vhGL5B`pjwGw0y6(mTED{ z+WYXed>4Np)TWu#Ztu+n(T^bF9s@V52si~#%#fOpj?)$GTZe|^@*zN&0^)PiNDwN3 z;b{ce+>aEY)#KS4HFbV}=UDRIw<9SwlX6YQ!$>P2{$a>uh)SMxBu*Q=c)hPl>Cj{F z8Y6!)BuA1FLN|OKh{zeExBdf0Z2ccFf|>s5vlVE0{QK2kdZl&EQ{#l^&$a!2#Z`C0 z!)st3g!xkLcddSb6P)Xbl@GC!o`Y8{=b25|BeDq|u@eL|Z^7(!Ye8qn#}K5< z0|2l>R&3*U-ro@Y*9)T`=mM%p5D$@eQ#at&!y+9 z*o2T-?+Y+ek(c>6RJ3Ot_}aoW6`d!$f14$!cPW@IvH5xZ_(v}JM==o-CpwY)L4q7` zGoz_@S(3^^PF<{HtgRUfyr*bDTWp>(_Yr~-ip;qj@}#Re>}Zuq z6OMG0>pfBgQW=Ik7qs}wkA?O)W>xvW{@#S!?36iXOUD;wb@Wy*mAxnT=n=oUsU8|gSEoCl z`^LIdfcUuep-?QUS#7e5n|$U+H7YkxFhs`KB23bk>z?|;DvM}pVPyWe0QX@C>3*p6 z-E7s&dtLN7&geUlO7Fd19C8f4l%*p^uJ! zOPj}H+x)$whgOpn2mAt!m`GFT;9CX?I5_)xKS#g+gW=sWAfBL)y#9<`7(46GuHYY@ zVb(k9v|~P|S2JNzK<=7HlUR!wrODcctSWvX+@C%{zo081|BYnluj^dyEO%Y7+R8=y`W&7eE=E&- zc4WHpAB=78AS$oy>!=5XOnP+w_ny{e2McbRTUm}pHbRZSw)CgF{ODY<#)aD3^XEat z~6MdnUZigUmW#wIsy%4X^yzTMH-&?*jzl5 zM8)m=c$Db>b4rt5;ECdLh4Rf&U^tOb@?en@mtJ&h_E+zxdk0U57o1GwASWP#mZ+_> z*EX~*W>=K2eFgyRN~GV(xT&z2G5_m*9I3lekWTb27+fmWTc)7t6OoSoB%FyC6Er-! zyExGIZMl=DKSynfSCh9DbA6ENA}&lQd2RINTwHpVt7HzynO2WuG|CaRgAXYg!=+^g zt8kdOA15;#-qYttYTvb}8iF&4SJBh;EKv|?X1z!#%eA}bVfuJx!wrKjRT zrX995$E`HV#mNBK(EK07dDM<5OuZV+BeTK1Jjcc>ByE;yXge6_{qs!WtGkW&H}8_T zBfVsSNIE;dNgcp9vWw{n3og~}A8e1V6Ch|JboF7TJthMu9@W3_Ci@WAZd*OM%npYT z5!Vwtzy%NyD3v;Wt)5Hll|VoW%v6T4T@&HQW2HvCPQmFghEP1VIoV_P&hA!n0lPa& zn`3N)LostOof=827{0Sn|C4T?DrXFvAXCt$mYmY`D9vv4h}s^qt@!U`!2@p}>%As= zCEScP!FB6OqYwXil<|Y#`BcwpQDp^z3+3V@JdGwsQ1boKsDWXE1TH?O^iPrk>ADD6 z(&u2mWWPj5uzKvfhvx<#UuF?7tCvQ^eiPXr2D4F2-+N|lp6#)BJD9d%OVuXuLC=W^4pl8r*lp7) zPT%C@6gi|L(X=LHP!J`wm32%zcPn+_#cO5j1oWs?!o~ONVv3B3huqf!!sk4r-otsT z{E7k0xy(b3EV7kKa9kClEKV0!8d!O*9J!p7_)7R;-freqrm$6e$~UwC$BI|B%+po` zb5wACzd?hfnEOL5bCn8!u}BPy#8TGi1{wExvBBh^w#@LiyjIqs=<%a1RN$R}zGsMN z=@iz#?C0@BQ{amX2lF$i6a8DCi)XZ8M<)QH^i3&yCX&vN2KHW(H}O$zSpn#fbeH0B zm(vh}0Sea+tbw7uQTDSh%T#538%wIg#-{;Wxwc2PQu`-JQ2z*>Ji!A^GykEHFDqsh zk4T7nV&&|vv!B`=^yc%0Hq(O@`D-8W9}O=CS|7>IFhhfJ+jOs>rfGfuvmyBdIX4c+6VyncwSkTt&0oPwJI3mxlxNZMQ^Jzrd{=y%n_&*@J0T%KqEsbu%Dk1=(;znau2tqtv3H zjrN57Gn?IPCK4$Ai(m9)3LD+Q7Kl892s~SqC@h<#X32ukpXoEMdZv@Qp}C*O{e?OV z8|4jC9ggPzX&DmHHf{v>OZ72O(tcqx_DG2@V7b-U(2yTUmuTu!ZzX3yeF_HnNVLM8 z-@DrKZ+cOu;6=V}8cCJGbb6Wy$>M#@t0=y=P6c`~=N4spCmd!ZDsF8d^>rRkq1o`{HK7&&#o z(c-g&&276|;QU2lp;~GiMQ24Ng+dy(f7i2xI#p3r!HHD!i0J%B_F}6iyEUb;ISCAV zuXgxnvkJ!1p#(g#rE~7HT~D z1E}Upv-UV~lq^d+AMOV(F*`vBf{-oN5e+zyC3lMjMM=7|S$u z8a8GN5rT3oZQdFVl`=$ElfVwI*2jWaS_uWYs}^@7j9`wqTzvW?`FI(`nLX;=#qp)k zC7HwKeh4nTa6H83a0z^Zx&=X-9Jo%mS7SIa8#T{E7dox(bfjq~`tG%YiM>Z*g(a!MrMZ7NA3t2pSo%~KQ6-(7=p64xB z{%&t|^?c&wd?yX(%}&L^f%reT!JF7A|1JUOh@KB(Su;Kez%t?Gb3bEo9UQDAQ-9)Fo`wU~d3+D=9oS%m-Xk(t~3>j$-);4|0H)SU|P ztp3~T8sTX)x>{ro1svFam`)DzI^Wwd#j57OrQUc@JRlZo81-o9nExYTrplkI(0&H) zg4RM2--ki)3UGFG?MNiFhOnW3LsSEm1h?DHDW8c~q4Sr3imwB-kYD(Q?maz>hWuzQG5+5+k)@Ykd@b^fXX+7v8Y@-W0?ch z>5fkCV&XtuUHg58z#9*#g9#XxwcnA?;#c#-r#gRz1puJQ2GjL&0~}%ddM*{czs`%) zJ}~NQ=$Hydg@2)-k`2iK*%F_dQZ?xFxhFbhY~-WOGEe(WJ@TjQ;o5Ip=zWYCGFJw zD2VMtewSyYn_>%H01N+ONl53Bc?(V`t>GKKi!d9>(B&OSTy4-}Q()N(b(Q5XTs`Lu zO*p_f%n!P)Hz&Wg#ZBFvD+!cU%oaoj+y9R+p338gd9Vu#5cg0>3Q&uD$g?{@3&}zE z)HB(MY}o>xm)ha^%rP0?ZaLXISEwjOjpLumNyjF`WkkGKdtoEL9Rfr^F_m1sd+cu7 zP|cmsbMYyk@>qgFlVBjt9E;NcB!V?106I05h{Fv%SGE@(Nn5q)Px{{H3|s#L3C6 zh_f+#!B>a};Sum+B~c_)??MHerhkPtmXq7>3{e$*PGB$F!AAab3wt?6=~ljX|12px zebc}RYm$bq33>Sl*5ExH0iB9<>U`tc?nM`w9eRRRbk%6!)0yaLp{b_4jYk&kg=K}o zfTHX@-dZH$OY5AW>OX3!>bCBs9;**6cT2@_U>Z7J6kxn~%+P!@ZQk+X>luZXgyCz( zg`;?n4Y+`x@YjJApBh8^^`T>OS}9jVTjTYA1s4(n}?wG$NvI|Oiy}j9#Ty~ z5PO7suTx$Ju6%Dg(8 zqsBHg);d}EzHaxO?#D8S=cl(L3v~qzJ7G~J8>(sRCedUHJ!?1U6c2L-Xh6O)he_ z>O_08)E-QA;g)s4)!k6wFonluF!IU)R(FpH5vRMH)8Q2asAAUAT8`9k+5Lfy z4H&CB&jW72M%w~#q_{Z{RSWK(Il|dyfIs1XIc-De_+$vH(TF?*`v4$VTYOc?hh65v zTFT_%Qf;ETNZ40E*e_u0J6YKrky#FAd7588pE>Qr&sn$Og=yd#;=^M`8+=Bk+>fLk zq=}9JCey5y4~)?bd|+$Zg|IF<;k#YGHe0Zh1iQfUZ@xO=p+{42PLHq=-1N6rs;~VA zy133wnrsOh%#)`%MicZRcK_m79qjAvZfjJ@VF+ND0~frCX@EfGK9FAPlHpqH;?Zb3 zWwZWsDOO}|JsuH0z4}2kyuyyCxbLQSE=UNEOcro%jDV`ZKmf= zPqZbNaS-6ip$V^W@Z?N&5ubuvAFsD|zDP;A$>Yl9mu4p=$#bmc z+&f2AKnr)cU7Dxbc#m<0e01QOkCJ6`($-PydqpktwuXd#?{WF_KY7v)>0r*5xp=LE zSG6t{)zG8(B9lTavFR^L2Z(uOWntY2JFH#KI27lO!sF|4+t)Z;EuL8dD&O zS1kGlzQ5RbKbTnuN)umqKuk}8M0nU&s>7k(&Uygr7k}KOWLRXX>DCD$k)K`cI5+e* z?WzI!99JY3x4geu^&E?iPHia!-f+YNZut9=W*GR(QbP00NKs9ekpsEZ_q1uC?BMI&HyHTyzP0oc9O#Ewn~sLQ&pe=leQ=MP z8FT-l$JTNSZ?P3d*S9#!2u;wKn-@qD(AGTDn&*>>jx~Q=9=Yz=T%0$Bj^;Whp`)Dc zjbC>)+sdT8o9GTH3>h_Xh!0o}=o4k@=gt1cFvrJ#sBy#_D<)CuI$-Cj<{`he?sw&= zn$mcMIfaO*HL#m$M&+YkG4#E;0)=Ys`bDGn(FFQXfXnMdG=GHvZ^gO)yLs)TD-Rd+ zx1;9&|Gcxh^W8jqH8$#S6k|p+`kTKG@yZ`qDLUw?y!5+0rJ}^*<@9S7h6H()>B54Q zv(kZfWM9)&eJVB~uQK@MRNeSp{t^#v3EZl6qHG_*@=sslhng08(~TL*TIhWD%suNO zK+cn3P2r@$i}ZQ$eT=bjqx5%bHyMTEsQyW{P>Am)6q#ocM2V#IsI0Zm@8GP8|6CHm zdnoo04SLS%710eM&;X>;4TwXyKQu zYGjL5Z(Qb}__bP>Y~6Hx!?`KJN34;Vw)(XN7EEuDOE_Mn`b{6eAUkeX%xiW=I|JYD zm&LySC04qS64~rB?HGX~`A+!wy{+52%>%dSI@cdBI3clNnL{V|&b9pDHh+KY-Hq63 zeW1;M5|g+nv80&gx{^Cc!0yeg{#(J2oOFqwDh`zyKVlI;W`Eu~l8U9K=MCrTcc)CW zK2qsaFL@;9de0Udi@Z_nCVH(plgsru<%2m76kl6P6`(MxgQ=Ql|5T>iysU2ASU z`b4Aap^!ETW3L6|E0wiHHTG)HV2%W zr$}@G=L_RAB%!@Ed|jAb2i#92oIu@}{_2d9XAogRN#Wm<%b$oCF!3qHGdS@2Ri?8w zKA%Jf^XJE_8vFdi^i0m7;%7H%iWE*V9FGof6D{?k7aVr8vGS+HafcRcS+tY0hD^Mb zXUa+epioy9N;!@GiusFlvZKP4LysgrdIHRMR$5pu(CEhCdali+ zCR0GmU}#K!mu@M0ceO4Y#Q+ar`O~P*r!(N0`|iTduk3pzm0VX$uhA0MI|kMsb?yQcS3GT9EBg-X9ltzL6AeNeh;COGcSw>(WD(xOZ0o5?CUr?JcAFu6E1+%tG zcD--RgMN4$jUmD0wwBFLk2&bY@;k)}W8w}fXO`=5-nm>>;RlF3D|A(uHm22uOP*-Maft>fxb&W69^y3r z=#0K+im2S(-M-ut?96sN#=?bd;>_byVm1%U`1p(_Eh?x@KHdtwel+@tXlm>_GRqEy zdg!8UHT>Bifq+7FiFt8kvdzgxrR;^WOAR$y+v2b|eI$+M8&jMyBMUEbEx;uVX+Wd4 z@{~#2$%t$nmHfN8j^+zj15m}7=*>K6`__j)s7{2n=J{7qUUqarlKqddg&1zc2)CYk zqr~-T;^LF}^imcL#`h3RjqFY#Uylxs7OSg^+bz9&(q7nSORx}Z6co9;ShhZ2eD_V^ z$lzF}S_+MpO>AG+kJ4C-SRTT~YF(hO%`Lj?r60AFhYuVh;6CnU*EQfDO9X`G2C$7z zh*Fb+QjrOV#-JlhzjUdK;aL*MRQV5S9zYeOQ=JQuVIm(!W_^X^VxU~yk0!AXm;AhX zv;F|#*3~ik5Dg6PL|8%(;@)KtPlYNfUST<&blU0#V#K%b;=<6?-&$ptq16#6r-2DC zHHwv+N#U7#DI?*#Zn?ZLiclg{oCIN2Y_qtfn1}S|F}kE=3J}xuDiKeSMQnZ#&h3}C zAiH%^V|^a-sH?c0bMUm%1BztF_w>`Q$BD;{(8xT6a!ukyTv#!{9}R$q-4=M+E+k-X zfPGaqX^8j2;@LWGJS$3j|A)KxmrfkZ0aJ;hga9fSmxTygfyB~1Ppc91oBT)LS&cL~ zzF59Re2yrz?v-Rh7JLHE+?wF8{9lL}ng?Bz?{l8f3$rDEVehmR-aJ@P&l~h||GM#% z;Gu9m=BB=QtTf&|-htEHbKTKN#$BnFWPGQj2nOL)m#j`k5A`ud-1CeHZDE=!g+M|z zSvfgpmsZIB874IIIv-(Lh`8t+_mkK!d=(?FiI#Xv@4(zwsLE=aw1lVlv&GQ#g0biqE+l0f;*YV_t*MO5^I@2+Jc&i$)``&$X0DKj>Y3$(Gmoca=%kLaNcHkH0`3%9w3z3(?|YZz24 zx&M?7DB?3 zBEU|)9tb3nrfEdYJAfh(_KJ|PDs9Ug^{%p?Aikm`uTDXe;R`MHKCwgZHO?4O51GE`+l;I_4>Q z4FLN);=@f7^o#-G=&n?Sbn}9*8Lwp$Q`Vu0a{V5Qse+tOw8+!w%LScAx+^`F*bX#j z_-f}9&3u!Y+WY<$W`7rLJfW7#O3SGX<)Bh~VccXN2t)|u{$~I>=>e@rlHhUn2hU=d#?BM#&8}jTCUv+QNNZXk9c6e z{e04fU8Bu8k*Xr!^W@3x`D`%8wC2A)OnMF!f%jU|za1=F{?i7oQGtO?rk62)&>k_Q z%5G>A&3kTnd5$diXTF*dRs3i(AtvJLLRZVUUno|k9wfC{u8J%7qvd^~2VYncFPTP< z^5NAmKZf`CeO10Y^rTXT>z}}YfJa|9#$l*|p8pO?=%jJPtcrlE*6A$nrFb5)s-&A* z&-HG8+_UcJUnU>@&YF=^f$KZUz;n47vWBD8ZpjCgpIS0Ne8Js4?`Ru(SA4bb$L8~>+ESvd-D z!uJnVS^=w>U&VJHs2jg*=Np7^>>c9uF0T&jJV6)H+1OZ@3KG#%PkdQ;)@g7r zVzT=q#YDR$2+BMFH!-7*IA7q>nV~8Ac!|fw0VqQVO}iuTRDj|Smh!Q*$}0#H`n-#v zJO=gWQ;?OP4B|470U8^^5eosh702A~LcVkhASd~sjDP%R;owkFMnUW*#vlFw|Cw=! z%8N_ir!Na47jl;Mw<1=`jLB9sI}6B3%y*36nK2mNAU5ree8J=Xj<)VtQA_`By1WUu^w9lcIJkNn?0mZx8J5A_sd1+sJ*TmN> zY86Z%-!Zk{u6Lf6x(B&bopyoDI=>c0%kO4T>AxJ7Lsr_=*_clazq)j`pKRHtIy87^ zoSwI&5UHYtbzA*7s=C;Up(F7uSjp87XfO_*f+UY|9}|al2r$@BUoKe$LIg1FjLUqe zUtx%UiCyjEIA01Zy*HdVGIAfCfs)A)r0(r0@BZV|QAhw66afJ*KfQ7{#Q$WcdIeHV z#eQM;w}yOimyHC;hYu=28GHb|PG~!ozuW=V4^UcsnU|^|{43Wf0Q;=Rmut)fxcLPs zsuf^n{A4T-w-`gm?EY{EcrT4|-yv7+&7FElA=R_|)v#zgJH4Zy%geunQ1mz>@=e=o zYo8~2S;pyRic>U-jg!LH$TY0Aiju7w`)&(AAdD04P?6y=l&DuW3U@4f-|@-Bu@A=q zdxKB`R|RVA8+MfnGF>_1;xFnL&0+Pejo3?5Iz5i8i{_d*KWbUU*mC;C^^B*~5aT%k zstXs8W#2(w&UMxX_nknH+akrTdpZ9v<}JpCj1>6rAO0)rn{+OJjjru{3zmjAV#oG` zwuABD5{O+m555WGt=8sS_pQIxvqyu}4^zGcM}OiVIvtRrYhiDK_{XA}M?|y2P8f$`XjW%1xo{b82VR>|(CGoQ{AQH`q!kx$~4a z)^C9V;<3mJabFwX`X%1GEB2k4yqwd~cYrT~CJ4BmJOBUbqPgyBz@`V*C>*ZD6xFs1&f4XE_PvaY$kw=}!9wLcElZi9iro8O_ISZV-{J{Q+bJAM+pEkQ{c0GeVk zo3;g~2^SOoAyJBN$xryN*IIn5^X9;^Xfc#$s{F;BMzu2{ zx3^z9cr9M)w8z1$;2WgR#QW`@jc|w@d*J)!?H(Hg9gC&5Ea#X`(W_?Ugx_F% zCx5$3tqm?)NJVHFOy_;|u%+R3V&2&n0eAAs-s0Capwmx3`tYpt3x?&Fpg~2G9k{{t zkp52O>sSLYuxrWHURu4p{BBs5d=1IM*3A4!DGl2k$74HcR*hu<<12zFr7f!kqslO* zNzL)IBbOr7F)m!IA5&@RwK!Ad$SJ!=2RwBE&~wZ$3d3O zCJvSkK$+lVSCDs3B)r8@+9)8M-vpbE{soiNyGc=GU@W{}S5o@N`)Uw|C;N8Xj5@=` zpFnwkO>rS9X?Pq3B4y~-G?qjSGSrQn-4LBAwAG0=;ccZ zNHb;+FzDsJ5~*4K@crb#wF}-GF2`f|<=lmz+RTDOv;l!U1H8ayp{$XKx z9@+ey_B-J|$Q!qU?^O<|zvo)C!E+Lnh&O4)QEH{xZmq)3=gi^>nK=AqDeHiSJbax6 z`_2s*9qHL(ns$c=rIR^~Wtlk+#Z}Z1ezr@>Oj>{Gn7RMkw9i5*@lT&iFrzjF{|LL@8%zh7PUMFiq<)@#Ts^%<3TCvChemGcYD*9?-Ku%Dk(jzlFLcMaNR(>$MVz~KAx zHdo~Ktz@z2;C3O(MEcavSSoY_2egsS%dI`mSI56lk*vvys!Zvb2lz zy7~~X;|-JkEeTu#LV#X`lOj(N2^67LQaK4FhWtvdnFMbdcmC32q-#i3DJ-`d1c~%9 zLvVfmx~%K0?hDoLoKjC$W-~;MSKWz$ogHzkiU*pdLknf(P`9iQ8$2S0=L6UG>F>QP z!x4T4M~L0Yz3%uH28q8utnAy0VegNPmj=t?tE!Qw4))t;{heW&7@?5AbpBlZBI#o2 z^#*xYH!zW(|q-wkB9s^cxyvJ6}Zw!qWDairHa(t%o4xF6a2aa271 zG!gwI8gD}{kLE+xWUs~!_@E-nA5ao)yjNb#d63LZPXoS7({49ZRc`)3wtD?i8RqsZeRX@!2hr*vQOWtJ@HCo;G3esV} zFR(q!@SE1zFb1Xn9sgC`1{O|_<{gDC`uh;EiR7PF2jWP|cp{fN?-LjpC-oro>SXRV z2S8`8g1mZ+KTxEZF`qTp^V?w4L(H*4Xe*AgL?XdWIaP!;fbc#>dI-So57MGCb6P_A z_*0rDziimb->1k8c?G;TG!D`H~quZyUk-Q zf6#9c^cuYK;j)~5t5ZEHg(*;{eX_N?1=Bw`A^g-5t8qVuS1a``;>LFBy*%kt^O

6cw!H^Nujbm=2{zNA#2t4HW8S#) z-AUC7w>5UXcJ;4^dMW+j-aSwCkztYVRZhK-5~Dq<(9OKWJxtS4#@$)O<}!qSLcbBd zT`RYsA9F_iAR<;kHcJ96bux38k!=P_bBo235h-7>R?+IO@52}a(C zq*H{hLFudD6|+oVyFLFsEm%>h`xV~{2eg7de$`8Rl_kQSLz_G7MC#9KY*U2F%VkGr z)@4u2i-J2m-kT>LoG_GX_p=$yQ2SHayaVv_C|fAK=jE7W1g*#%?Q*PhT&F%*NI!Ni ze?5EH$nOG#dJRUYPJGo9B<*8;ty9sB=@xr&Q`t=+Ua)q2I#WRH9!>9pfU#J&G8kqO z`*l)ntFo<%4zTRiq53=ztnDeMAF+=MPG+>RZ`YApVmgU#v~q1YJ*(t+kgsyVcBm)VXwQCvC+p_1hc(WRgT*3zuTPw8KS z_&nD1LC@Xlr~z7=y<{GQ2;!WU-WF$L9Yn@O%hNtF0LzwmkJxRW_yHXwi!rs;0A({i*`7g(lFb9`Q4_M(ECoT#JNL>x_g6;)$qOT5qH) zH>n1Fkj8+nh70S#JlX5*qeSZ({gsxn%he5}G?NA14k7zGM!)18T$g(O@BS1-wHZGC zD?EVH3O)`dA)eE&(~FZU3sv6vR3T&=_pEro53_y?eUX>4?QO3U$;k-y=$8Hbl<$*K z51nq??O9H+pnsC$&CXC+Mk!{B+p)M@hY=_08jr}s^*TSwGF7(=n>KD;#P^lrH)xAVE~bf_SEQCd7AXZ?P=W|L0G zc%SJ>>&gzld?@2@5%hn#=2Sc+-3PUfBHtsrM&@BDu4!X&7~0Pef2YxIQ{sv=FdA$0 z7uCF&I#YZIF`o3`)Jujg((?2hns1Y!3NQa%$7S3k_(&gwgdLT7VyJd**D`XQ)&m*` zjb=z3YqaWYFL_`ubdI)D+u&go*ojyWdvFnmYgm@8TgfNUE_8J~6fq%9+VYgk$?XXT zGWf;@IeHNnsoCIPPI`;5x(nfENDtNzZ2>816;Hem?g`(?;s=`7Hj`}gT)=Dzr;3xy z6iLF7M$vPkeX;SG2CA5ck5<=syR(}Gr2lHE(oyKl(JSSifBL^S!sNJfK?Qdp@tyyl z&QF(5o7;_ITW)T2$gcG`-~Iw5PY!x_-EHM5^)^p!grX z`~3E#V|2fw@o#Mslir7S;&oWRZQ}yJWd8aljm|^-00^}*$%zTC?bO7hx#{!hYR{=Y zL6rEr#Rr=UjZ z(4m0p=7i;`ljqL;D+4;yh~eDN(yBxMD^fH;BdZYQ%(D~E;$G;o;pTfq&>Ybz?`ywuFD&f=DcFpoCV-p0F>LP2t;=7o9=suOjcR@gAO7O_D7636 zI|Ki4t3Uhp@9qUUCk7D6lV`6;NdFflY(Rta%a5p?pHT@Vg4!$zSJT)1mCOI`qsQ#& zJ!c2DE>c#*CWjzoTvu`N(}Vq2)ElZ(20{JSfsYw0LMYnbxC#k>XSXKM+b4L{um$G) zGizDJ3Y~_%2PGd!4qL77$nvnMrEgPR9_#3x*nz#$0MG=R`ENz=l>nOP2#|g@k1!r} zh8%gU67djWW*=_b6=dLOr_~id&fA|OWh=gG(P{=XbH`oV)y)pso2frfvDRZwdk1aub+Cp!U`dY&U!s1vnq zzW0#_m4hT=w9lnE1gLxi+MCN%9|iDUDa($`8?#iMn5*_#f|7gOuY)lFg$@X zu?B>!3)zRPbi-G{>&%w(-s4rPY)Ezav$)jFfj?6;e|Px5i)))eb-st;(zCGRfS;M* z#;8f9K;nILxV>&F7Rx~$bIrs27>8>HpN8O_?RGW-m|(zbGZ+TBWB9NkPeusPKS7H4 z3l6}8qoJ#EobfAz)F%Zz;qN)c{6O1DaOmtb}5mj(L&k)9bgCAOCxQ@nj*) z`Oo9LD-p{(KuEPlcCFlG*8=U`4b9bOyU%TbA`!T8)%SZq_Xf?%_OI7%0SSujU|r)n zGMFC)!ngsToCCz*cpn&dJaROB+tvqu*!4~e9KHm3E{y=zKtH>{?oWoFC&7s?Yh!wt$$i2x=`-T^9cKj@QKjOc|Nvv$XiPn++(cDBoH2msI7CF4C6Sd*bo^ z+2~3{u7qu)pBu{tR=g21d(WfytAf*4BG<6CA06YXIk-r~r$0@D-Q2p=&fleN(6F!eIop)~CCXQ>jlnXTzXp5w9;y&-PSs@e&oHSQz5jyeYqo9Bu#N$K z+p`Zkv$-7k1!@y16#~(aqmH;2DQCYt;SQu{lLp46mdAiskw2YR8_Y0LY&2zM*3@0; z2j=SRSU>9X-ZCx)y^dEgq^gx^?LW&5AnbkfEXvpGSysP#SN6$bTGwaq0Dv<2oJ#Ml zPU!6{C2KJ{vIHUDMMix&cZn6{Bqnm*@5+B4RD&2iD91^AIkP_}C7w)&^*RRuwgo?) zUS_!T2^vp%V$II?O0A1u`2+jpZ$<^GHM>4RrCv;6EpZ#H8*~H9CW}aro$;>SV(76( zJYqk)vtN}wR0>GzI^Hi(B_^uLgC^}EDxCjdi4854Y$!r@;T(=p{eZN**~78;iBu_Y zf#xIg0`0n{m@|M}ki5Yxpuwq(&gQJLTlvrSs6o%w?{{9QdIhyax(vq8iSOS1?RQh# zYvX}~PfvUa31+Q5Cffm7!5wv<01w94P<;YYPbu!q3}n-z%^~ue zfl8P>?0dT9kpw>6%pnRKDu$ZG16l`(AkfS`qvi9Vz5^(?2N4!@wVI3ICliT-IGQDA zPJ22Ne)x?8K}-vbuupKCw*7c@EMuR80DE^HD5ft!ETsZ(C%w1SP%PAQ`1ney>;1*B zqyqx2kH_BK{txk^vk;En8B2%X)YF;3kF8B{d ztI$*K-p4074D(b~j_^<4yRUCad=3Y6aLi+LWh1(xi&j7#PPcno)Ikl9&&SMab9(o+ zjX%0k`9hN(^`Q#B~2kH3)iG8tEVimJD96vSHWx=A$&Jvta_&4p@2kdeYS#I+u) zcwMCzH=G}G-s{M+OKec(dvGI?(pV@8tl~<2^XX%3L~3!y zy%J+T!V=IjEN_q_uXg)+8Uw*+Ww~)beW`FF_PYGHXA(u`DsxJ62DwU_feo$T*YEwY zMAUvDV9&hks{q^q8e$m85J3Id&hdrwci5=?*+kA4R^uTV1s#0Bw zT!4>(^IFt&Ks$#?KluHhv_5CDXU7R-0&KtI=IBFAn~i?Pg4?i=yaBOT()x(}>R2=1~s$+>GBp8OyOFm{6e zo^tlt`?Eivd${iFzHYS45QO96vOLGLgbNypWVym+>`YA0;)~IOoU$5I&h0l^UiDWo zRYm=}5Da{vo{-+X|76?!XE|thH&#&NEgahZ73~uHFd%F5$(4;kkJ8pqcWr#wo@^ht z#0=B3uC3em3(dt5k9i6e)z~?Q>=<1AoHx;fGOer3Q-0B;%zz zGScYWK)#n!VcMY|H+epNSULKuC0c0xexRijlU1)Hh@X+X)lQ^&ce!hjw~m6XV)d|-`9gwem^F%+#@=crT(68 zuI-xh^XHRU`q9!Zf;84E?BCua`a2Y>B{u`1%JFrJH)WVt9rC%i`;Os^_7t^ELdjKC zYu|yTaQ`Gr=lp;+4$_ani2892y(n5JK?I>@l#Y)JNS6U(Hm}KdF!Gm9>ODIm3Sq4; zs$S0`O0J6I)eh#}MPG5XLHNG@#?pXI8OyBN?okUB=H6^P=&!n7 zYoE6I$ur}7g;n5@9Dt^)T$a^Dj4PSV<;q8^9mG=-3c<##5Pz%AIEd7uU4WXxA{YUt z0#)K7gJY(JfHRw)MW}?PnIJ`smd+-x6}7eGo*eBPjTr{86{F_o9U?idpMQTwqR^9I zFi!_%TsK?L0%!2X2utpKROn{RZto9`ilSl1o4YS9kcU3{-yj=lQ6<+%{CkVmo8!<7 z{j3~QuUO}e{x``wZj}2siw(E7*E&U5vHWfqRa-GIcXpj5LnHm%^~e~F=j1J94XBB0 zx(O?kLncMuJRu^JZJQqT`TP%UAZw`G(bh3L1eEWcew-6B*vfg~-JyjGV%rsM<;O8p z8%jpTRI+$8J~VjM=k{9#B}Zw^g0%y3+k6&WG?#)XDFFONc^qcXXA|%I>^tu!)Nz^m82T0I}buw1lXg8 zNThtS+7{*mscqZNy_>v1KyOp$)CG1B7g%!LnuoZ61THZ8Q2Mj=I(11ErmpmLna}R0 zoL2D>f;xBK?3OQHJMPl3-FKnoF@2D?z7MM_Er5p*6luG(cmgFc+cxBwI2mQ93$WhT zG)_y*zC=XbG+(ls5KTbx#^}`sp?TCFI&pcbK$<7-h1e(@grYEui5>A*l1)_CrHIs6 zW1Y~+lFA>@^auKNWvEBrvz%`ox>T3-3$S}}NQH6DR=5<|@=2HJuqLhc)87|<)eHuA zs-JI=pO|t`v{WXH!GCHoRBEC1(w806CcTNgXb35ven1C8W#oAJiwsl^+zEZJc0z(D zNvB*B?dSGTK76ruc^=3VV}lQMjc152tG9d9{#?JkrSy30?8WZKUULDyyAGvuXa<96 z(3IA2Qhf}ywba*`YOv~Za<%5FuZw%kJ}qaxO-jDFOn4$Zz*=$Y*d?~JMuYImFtb)`Opm79$}$`G zJvcUJ-@60bFmu}SoKNHBSCHp%KdL96EZz=-g8_+NulsgR3V3$knh@|jHhRrz0P;q! zq{h(dW28v%MX#o!GF%uk3fGejS-3@e>pFxP8!$%mBWM*)uVI6(WM9)57^xy(`g3OO zvLh?Ik&SK`JE!d6a&-&W>$j7cyXWg`-C+8!R%BDZN2K{)?n~4$x(fp+Kh-2B>%B6S zSh2tMHJfxO(#+ErP;DyQ>r;o}k5Ai(uBb&PwcF$37jJ_W%JJBbv4^Aez$IegG0_kx zB<%-J!C5(Ij)q4qiwgE+GKQK0bP4^k#`y!N+2Ov(IW2iHs*?WnmxX^c8)_M`FO zvZF?274;aJB6gLdROJN<|2IMLXrKa%eiJ=vkQ+ry{!!)KTxHrntAv&9XSh89mC5zj z+jB}7hBUtD_*2mwr{e7Gg9&9gkeA7jQhIpp!UE|vKg(d1ok8aU0lRMA5f&0@pm(tI zR!8blt;L^{Mu?eZ$HNp+Iq%k<&8`6D@kE2_athS9g91N2I}Z64!6>+emdwJh^vj2jr^Yf@T3I)OfUV} z*D13#Rb0wmF0b+MRCC$=SoX~tr9rCuVb{I~-YlBCS(_I86sQ-vv2aj}`B~pEWrh); z0X?(OktnCC?#1fSU55VG}={-)(h z+ciDywpC8_prSR*hZh@K(@TlI{~VK;&4JF!AO2u|ub!*hc}#}nI>o->lxYz9X=Vge za>+9!S`Vf?Ws2sB9^L&X=5FI3w9QGjy02{>y7TCL0z-h=yr4R|sy39)jJkH8U?@fH zyIxFmu5zQ(wNT`W-vdu+{h6u|`U?h&`_JIVllB@I;^Bo3$MOxO+LjUv*dloBBYy$) zaJ8Y4JW2cR{4`%5D&|lvh-0yQNId-@<@Jm?k4lWLr$TdTLv#6N&-^I*2_Pt+7Jb^^ z^e8pv+&aw}8nsLHoGVM7WRq8fi!bZp3s1T@PI4VLmr+8jzhzn+Vu*C=W*Cc8z(CBk zdYI%CshU-PFn>ZSGHtW}*Ssh*zMk&BM6tfduM})Z#+pO9Sg_Ux`r z-RAV{)biJBPn>`C9;n^(hU?^Qg9@!SUdWhyuRWY8Dspf>-xldvzUE-YJX9=qzN)?N zck2<1tv$vT;)4o#@2l5N&(OQ4LQfB`f(-pt1)r^B8b8^~1Ph(qe~Nv5F*kq;@6;x_ zvP>#+EY_r0-`?w#aQqAoVXaxk(&HLLQae!PILb~o7Yo|2|Ga*szEb4I)Q5Bq&fhu_ zcbyJ>V#t5kU0X)WWP7EDQ@KIzBKHbP?(-d}Ui+%JnGZ)QetsLDY`0S}uyp8v3+d$&DeV`MM3iXd*7bR)9ws!S5JSu<;fC?xjE1hP2P~2tli2 ziqnjBeo*RKD351U7}k*L_$ME@$M%L;ANyK>exI!>f7Z>^^V-~~kT18i`O*~nL!=>R zpqse;O{&_Y(kuIoO<3y0pJg~eBAq! zkefAG9(MS1a>zwFg@rBQwaZllH!pYG7r%GT?UKDBSCeTLtV`*GlodgPJur2K#g#9I z4d3m;;R)>8+!PubDz<&N{CC~+EFV{cR=#)H92$#v^ek~RcqVIH{>cO~r+G8gtKUW} z<`6#X9_l@3+Dud|0e|t&xhIq#;NtwZg!!|DiKZ@4cuE3t$V9i@i z=~ew=@bZ}Xg#yDjX8V1}2LmKdBSDq6)&{;?zS(VkA$p6*COc0LhThv1e*MHhjy|T8 zNXy;^X2t_K+H!(rW`7nN!K8EcA89DEn(ykef0m)LbcJ1b3&7RvI0{X??Em9Q>y1&3DE<`<9RwO^Z7)e{4vu06~rag}{Y(LU)DV%w-) z@<4swwLfw4zR$nq%{=|!LufGN(CpI^Pi0jF)l^-ZXFSZ0yQFT?Hx}=vQoUX#pQ_I4 zs(qQKo2#_6E!N3ZUxB_ZfWc>b%7(`Oo3h+1>+ppd8%`S@X@L(G0(vtz2g^S4g-Vr0 zy!Gt^>8qgC?Z;#JWJBw_<+GtiP&X&5#bCA*SNakKy`I^U$0QHyM>&z-C$^UFPMG%q)}cA0foO~{oE_C zTyG!CYx1g;6TP1#T!c*bYYbFZyX^OD(OxH}rkt@wxl){kgRd@@*@4i(DCa|%aEyT& zdBwmf+9ScELVDi?wcgOrHP_A37uH9M_wB#Xg+1>5WR*z@K zHc!=sN^R3Evao6)mT~YN{LFQ z(i3Mm`JX?bPB&w<8Jn#fX)AM^?csNDL=E&Xo%B@pTkG z&VwryCAE6OqL)y)*~?PTDUEZ$Cvq?AvylH#4ERVaEHV;V(rl7zYt6Idbw_S+878ts z^6lbni&^8Z;V_y15x*pg%{sq2uhtNk_+Wx|#^zJ*+-fdtl^=DsCx%TI);F(1iW)+BwrP7^X#n@lIw zsJ!x1;cJX0DcU4jJw7$x%XTnziUEf{$oVOF=tG%J0@v! z!xnnrs8)|)GM?gM2XkBURdbsDmF+UNoPU?C0`u|LE`W;Ed1XTXZ@_ zpZH%Cr%nrWXixw0-%kNXAXYkn@4p`z|6nlK*#BRC-Dq*p@1MG3KhB|gi%KltoJF5d zv16}AAxja3tSr~bMxXI5k**Gm>iJ;1m*@D-^CW?{inML>{;{!vZ_C$4Z~FdgP!_2C zSxN11Ze!|^(*jgOvcTCirU0vU7)dFLK~a*9-eO6+x@G>Fvyz-fr$1fryZt`0)Z+bK z_kHniF%Y6RUbns8*1^G{B5LrR~Xy|wL_Ugoah-8-U`}P&Joxp*wr7e((?SG4`2co(8PS>CP zGH>C=@3N4_|6mo8Tt!t2P(6nK*IG||!42un^qCo^=hwam6%m3!YhWzS=>(75cs?Yy z%s&eyESPiE4L5oKg5Izg5N zjvItmjW00^ox(nQHC2iy9xFzR<0gpQ0lfw9g3({~?Opk66r#6XJRrA-XMxus8mN>L z7!BOE?uh({E-u=-=EJ=sBC~4eA40QDQX7UK(BrbV*-6{Lt?h5RpHL9-mJHSD zd{+SlN3Z;M0~>GnN>%RlzDIvtb4n&sP?W;>zA|t?WBW_XqXM6vCEY>~Q-V#)?);zT zp|s!JoC*pIwTM;JI5&#pPf744R5N>o{_bd|F6ka+ThbJi7FAH1Qg#lnrJ3e}MC$$y?$wtb6sXW3y$VT+U03fI+1-&69^99uBDp z2?K?eK=aWs26+JZQSJOGW;C%YaKk-GEL{SF&tek@Fz6ij~7u40SR5~_D)3DXh6-C?dZcH_6un>I%zmoF|xLEjTd zO88T$^U2LkG;NHi`~Y+Yr+@CNN>0^nKVCR9yFz+h3Tn%%KTVpCP&W3#8V_`-Bt)=~ zu81%Kg`D(4k_z4m0rXD4!xOLtV<{Hp&L8UfAhk!lV5~Z@yfWrPyZeOcqByfQ%23{J zdzvH#V{~FxZR+7yMcXhl(igH=TvV;mU3Ijnt$&c%`lE~6`2;8itSl z%bJvD^m5+omKir^R29 z^?@il4Gf`Bom&N=zl16j&Z}XRzHxRZJDEbcYU4Urf zZf^Rv>O-&$UvUdI4~?Jex z4I%+5Txt~x@|^g}09ykNKysO{EsxQ37dn^Lqo44&;_Rh6FT_f{gzy3EfqyZYCGPCY zeYW$8eN@kfhp&FE>6%s*tx{5twQ zVO^H_wuI`%X?R+r_E0>7{`(dBHn0ehM^3)qX`?r1p^IrN$HK-{c9X{e2jv)g!sU&Y z6NbxGr1ob7!HaR+n7={We5G-M#uq{o)0G+OS}xIie99(8BWJNgBIZ z^6Lg+>`jU&9SoeTJai>ra}4_YOuWcAWyzxAMKY!v;G=KN?eKV!VQwA|oKem!&hp&$ zmxWSSiylp%14Y6~oO=_7ve;*V93E?SrwJQ-@t&qE~t7#$jML;*9 z$qt*CX%>GKDQxMS;|qxaMb*yn3QSz4WP8d?75|$JsBS1^`n_fcfF2LHo;fDVg>-&6WT_xM-jWYEL zY~~WR_EDh{MVL|mE)^t#Jy8=;y+8QZfk;w)>00mJDk--7GGHTi z6qm>AC8N7~hc<>~2zo3u$g|1%HP)QSEDv6=Wj_oqnp*B3pR32>*d;s1xMtCnNgSI#2 z|BrcGNYx$6^yR>XsAQ7kt_2iF6QSw-M)f_-qE3=u$=e*pLV73U`rY9BdTSdhg(}Pw z7HK&7ixdvlc2V+Oo7Cg~b2qrTL6^|k^zX-SlK%HTVcI0|2(vE=?lAn1S!Vu%wF>g7 z68S%mz=21&({t8C=D*jvJxg*`OW@}G|G2JDni_>FLphvjIQO6D882t$+e%*3uAwIuAM`P?<|mfxDhXr45DW(5 z6WFK-+I%4)C>419&FraLq&^_L;9#p?7gi>ell1tEd_{i`@`EJ&)HUOebviTLMiEOp zxp34i`2Dc2^g6;tN?MXcS3`?H+^-+nwu$%t4F5u@#~!;EdUIzLtUjOui5B>44V5o4 zEil4G2ZNRp(k)T9A{d^L#?zZd%Tn?-+iSx!<7;|BhM~ZOb#1uV)l_hD&f--^0JWpq z`6A;Gw67`wxSxG?TK4Jpin09Dx$}aH8<{Aw32SP>6@`(xtTWe^54CQ)7bSo>Q985 zsz7FZ8-oh7h@|m3rE?ilXg0Q5t(ie%4TCCR^0GZ^*tie#OXOD(u1@^;Q=9tDbWWE) z`-9eVPo5^4F&AEO_QS{3#KXGGijvR`&W%h)F1M{Z!R}8jv;q>>vmuyz-iApd@hCJX zTOrH^MD48pKAla1+nn(DV)J9* z?lG4qjzf`VnQnXv`UoE}wPHWm%CKUYK?|0;YBr^xD`4m27hkqTR}_q9Sf`|wfsGfo zCh3#@h7(L}coE-cMGRSdQ|JqJjXDMf$#}J4;$mNt`sP&!y$uX=-^%3$nbXwna-#0J zMO^a>)L++W&!0XOVzP)7tWJhMAfEihOib}5rm!>^E_PvH7q9VNdk4YCx!MyPsH9m{ zeoFifEpR>Uq3vLMACrU*a`EHVxzl&#NvhKHZ5B#dN!(Epxi~B|cD`kzMjb6sEbaj! zaC5G1@2une2oQ4HW|SDy?Umn@2VS2kj$AhK1fQse@ljE)Kl;29wJPCXUSuj0+-JPu z#pnL@4H`JWYXhZ=Iq|i=7nms7ntAwM+Ms;0b%f_*x^XpsGtWpgful0oU>-fIuu9N(0(dc6$QE9*Q#}eB^2=7c4 zu-tYo5(X zcF8)En=0(rc_zhq;(^HV2(_mS%m0|%p!wKGs|#(9{^vevF~rfj%DKw#|5(pcFoF_y ztQY+Enf2uG%#Y^HuAWZxCQK)DHDT1rrX`cBJGDp@+?INnWT30YtiM}v^5!J)?L7i2 z4a9TuiL9##J}CBD9%CvnXBfQwFF0?TUwwdXP+~M1x<#MnTz*k8I^^9D4{D|Ibmj_R z2Pn@}qA5w;k8OaYkuVM@NnY9f@Il6X?xkA-2TkJC+_rbNPGu(8J~ZO-&#q`9o`s~+ z1JVcL!e0ck1eU!`y74ZYg?TFjF044P=t1?Gq-Fhek;$-X_iqTUo%U%I2A+@QgP$j1 z&SOZ(<{QR0^WLt;9YAWlV2~V;F7mv{0PfWH{e1}`L$OxGC_5n2)SAva!%ytLnkfF%$*HN1SJ~&zz*2tfb(0!NF z3v2~skM#u9#32P_3q+3`ufh|j|M6=lX+dKf4(E<t@P!++G@4i0Bc9Wmw~)+U@Me4K0OfzoRwNEt21* z18F$`F&o77&NqBh5R1crLR#<7=bV>!!0bl|oSl$g1i?gc)4T<6%Z`&Xi&qx~R1)l|+sf6AX@9#HW z-&hi?csSBFEMUCSA#^bZ3YC`Ivc#>XsB}ZzxeY4ZREVGYQv&L{e8hA zM~{nc$e>8%P|>i zsB!riqW71HP8iBl&13ltmCMT`*f(BF=mU$QcD)}3E1nRyDL$MUCWzhp9^OM_Ut#{g zUXdS0VkER`9A%N`_ULA!SH)&{mWRF`O0s=xHu7-q{qo#s#{&4SMw!|=L}(qKfnfo>zM$;q6X?pM--DmrY{`{5 zVd)85bPr*w-dEmi0v)ETVa*eE^|{V5rc7pBBA)}aQbQ>8qKGFd&Y*@a5v`qhxVZ>h z1vZEu2X8SSW)quq>#0FFhMiKZmU2xW09k6ee$5L}Jw~Nt6Kz=3qm&+NTyMd?a76+A z^t2Il5pdneR-k6h1xrf3Ov2%4_&~x=*#JH5C7K%7Q~KE*@*FQ6;(#oYCP+0!7OiMh z^2c=9d(CyU>?vojMMhZ=scoM4q($-OG%ejs| ztTJyVmyIs{-p{?`xa?Gx^-Lt{cdg~;cR~vc{>K?%W79i$0P)p;W_i9<+-7~pe)?f=I>4w}spKqc${mm%^tfY~wyzdYTX;j!4 zKsu7`GqR>L?;CYtQ$nXof&1QIjJ$(G8=!+aHs5~x`^<^Er*IN@-d&sac>=VSF$m@R;kb^C{;D7( zk}(GuXIx_W-l`h}$-i?2QfWfg3UG{dJ3?rXcWxq)fV9?b^yUn#>wI}-Ge{dTs6_B` z58`lEr6eui7pv^>4W*oA`v6A3J2l3U%HR1sPk#)(qbfTj<;%f&JahX%Z-%4w4c}In zOK6;LhGLi8#`hL0e%u4v5PI(!EP4Iw?` zKu%XS-A8RbXUW_1YrNv_60-tFM-JHig!gg$d>XKQe`on1m+YKx?2~VA1(E696p#aTe2fQ0{G|^Xy?>`}`sf3^vRJvm&)=Z@SmN>b5j+0k3Jy}$TalJq^+J~fH>A_5^}u=&-Y z@7k+4>?fw(r7S|Dlls&rKixHdtKz^h&_AS zvVL$0&VN}EQlY&JIis_w75^V-Z-pXMA;EueNiVez)LyrT}Q-;Puv8wLN>?fDaxjkn9*eNH?%AQLrrY5vR+o0^Ru z`CKscn#VMpy|429P_`jrAmIQHs44P!cGBOy^Lbf__osLh{i+Y=1)6QJ8aG{lczuO}?ux3fHcsdm8ef;*%&-fu>cD)&8c_W54DnvE3!UK(G;`t?kgBkx{% zJJQS7{;BszL2Z!-CoB6bq%?SW8iA~rFofMFm*0mkKKaQ~-4ebqlMW{VvQboC^2PaI zGZTN-<=r`xYsvA~fJx3Lfa_12+cIFsNOaV3Ui$Uhm~teDDW{Zh^w*WLi8lC*c|_B> zIO9af)$7-Me^vAuNXCGdCPypk9##xD#AC~h!eW8dosa!eULHHYel07$Nk=KWgiTX`&A0#0sSR`3q#tS@XJ8SnEswpY1U^+>`_#xAns zPTEoSHxQ#h|0m+d9E5%dic8c|4s)SW_>C9?0!vuQZ7&OFY%1twN?O3`3newUcElkL za}^xf&(;n^I}TgE=Jm*A+p#JukGJ?CeXn$_*P($;s-D%eELS#j*U7%`O$O-PaP_F@ zN8i2+dyXL99DAJTAFy^`9nkXa~&KuB+l}fWgz~XXKeMmvmBxx5AUI9 zeCup|A*npN>2ySMC9^H&clH&wghWPe*0Dy-V}-{SL8mMK9HrjpMqL@bbTJ z3dHsW91?IjWH$LqeG_OwnVL}p#69SU)ON7&chw|ldPm4=k|jG!*orym2ZzHc%2pX1 zV?vTqGa`fjZUuTMMXhNH?Ebqwh>h|P_JXs|+Z^GR3 zWT(R0xxav|xhXSkI|~o?xdnclh*#W(J#I`U@L{H*H#zR~mwEi3WcIcoe2wz@P9qk& zrAX~g_U1Wty|3G4gLEWYd{twkKdTMkIFJdI>mf z%j45VJJfyNl_5elbUvImWliuag9LnV@QG;~{*F1O2#5n-GxVQlW&S3hqS3{vhj0qn z=|EzTxUNTtZ(o6}2>IQsN^nfuvFlIETeu+V!2E(wh%4G_L9BN`=fHfE)0obkfy@&Q zw`lDZL8Qw8U3u4key%Ctq7b!8;BVC_un*_q_^~gC03NQrRGorerA-K;m!bUsuPdL! zt1V0tN0RYHzc>K6%r)n{Mh1T%czg$K1O!9o zHW;%vgy|t?mFx(HJ;J5$4Y^HdK+iG&J>b+>v^Y`VXNI!EHg`j>Bd`m&oG$8UhqyTS z#In4S0p&eCzc3g2GsB>E>_+M168gq-ic`7Pfc9R3ToVr5Iw@Mw40$zPg}Y0NQzUp*Dt4#@06csk5)FYCmd8yCbBdv=r$e1DO7a`(6Q6$Mc= zOsoe)moWl^loeUvs)rIB!i>C1(M_(+rT7W?PHAxo4FZwoubBxNp=F;DF=@0IyUb4RiXyYI_A5O<#(894^U7zvZ&DxQOq2gGE znJzbWIdi%!Sbx>fo7aOcOp`YSK9O5hVHINz@6fu2W+AxptHC)V#Jt?(`Tu-9T*_uU0WJxyx{(q(tMFysa6j~mBG(z&}bOaCP3U|#fw zbB{5Pm&kK3zE^S=1T(;r|NSxZLcZ+X%HnR)Q1!CEJz!WIt6R9pWZFhaQWaR6nl{nB zILE=4ASFw*3}L#O(NPN4*2XGyJV&Mp@u}rp>4)+34c@tP*~W83cH0CH#^4*GaLEZiBVdvfRlF1f5KZM|CASy@AD?a^#T&F$*`- zbRrx_#d;VO!ye#Vxtj~{(aXFiPfyC>`hUMRUZ!fah++bIKcEORW11lVp2fYr+=}d{N^N z`vA$_WJemoPnrbDdx~fgA-z{T!<&?knQzvb0vAVsPcnV#H&9r~Bl5Ig4?-F@Jf1_E z@Cna+Q$;2Xa}NdOtQDlFyZ;#K@!c751Sdi5K|a~$2JtkoHO*PYsfr`;4a>JqNFjKeGL5{`<%`kc|5-l_DVfe>OD>-IvI$Pgo=cPBOjz{yWnd6~9|t+T#}*lZinR zrghB%Q7_UBfJR#Udtm7_QImkx4_tR~(-q;umC6g}Z+1dG)a@w^D$Tav}h^7 zy*-2c5;_}7yzCY%i-8jY%8UGU52Wi51K6^fTi@8-7EL(#WdVI1a7G%#X;^GFju|Kp zSpbuYRdpVMG{%<==UT)6zWWv`_5|Mnss@3cQ21DnD_RcLa64>@5gbJH3+mkq`kaiT0MZvkzALCjNOj@ z2g3CY!*fdB zX6F5ao~;Jr4UI=sZ^;8VmX0pcnl7WaSYR)Rc%eQCZt9=xQYG%yhx}0Es@!>t@0H_~WWE3rYWo-U5AA(B5^;qwHeokC(f;m^CJ60bZ zPGmECIo!qc0IN0h;Vg6wLt(6En5GqEczEA*Shd zc6j|i=a-D#qDOmqQ-Rpo8YIW9Ocf2+l0j^$r61YBnTvPKD};a#(xk1Few)p_W+sb$ zU7^P5c&!jN8dO7~%MeHgq}8wBA&Br`u9(MCa(A9bgV}V&QnHY=tItcJa6wBao+Px1 z{lmRhiKPn}?oVT@@r9j0p=hV;dv$OZ!Nw#mI=CKZE&0w(RPP22Lcbl?J$v);nMt%VDc#6v(&kB8 z(<$$M8#>0aO?>R%P~~58)!#EkT-$Ip-i&{-!vCqOb%4V5m%aDz1*zin!8%2=^)R&r z9p{5%ZSlFo|J3Xm;VdHf}q;U&~~WX~p!t|J{Jo`*+1^>|2?>Q5!M_h!NM*g6i!}4(A-UcJV5q}H>iUS1>9SW+VQ>OG37qZnwbAIA} z=shA3Nn1S{nT%k64m&%=>;#EC%?jVsfrnhvg&^u8kx!T>Uc;{eqto{9utXgL4#^Ig z9P%=!v*30$9wAB)0G3eUL%6k5T$=uOqoMQ*=FNyeEeo+}9+B5x&;mIKz7djG`@SpX z8N$`oT+mLd>%~r3hSkUG*y$enel+LqWb(em2(yZ7Z&NXxrl=}lv0bV1jAJ;tY-s@P*x7m}z)4EWA(iy9TPe)OC{0**i< z7~Y6XsqC_#-nC4 z@6RSTd*Q4R{{ad8EZ!Vxx?l)Cx8WzU=XDzg4M$rJAM z!#&-*ym5;cNu41-PbQv`_=8O?vS{%FJhF;Grn<^%p@r-|T_nMkDi8FG96`PldkQOk&B{q?GD_dj1a{?%m! zq>-3VKV?Ct9+1{UY>{`{RzLl(aqIzqnB?PSJ?p=Zd`B5Iqy5R^3#K`k6Ay>mx<@I? zHkl-AOz&iJ0rG>EgL{h!F0!x4Y;bZf80%nbtPB|~)4s8%%f4yx!Jc8Tr4~yB<7jk; zcV-LzT{Gq|=|}lSS$vY70F_rIw}sefOixmP)4YpgH@-BKFZ&cAhS*2N{&Jik4tNLXI)W&$KRqLVQa8CC zTNrJadCGz^J#4p{#@fqWQY5H1rdkP5XuZjCWobGGaZN=M?CG3-8S4m7;GYE*3Zm?r0?h*dj@mnCvq3^dY<%|i-fl;rMKHnVLFao^7q*<3hocb6_jaHAC z>9Fs?6C8=`W4o<0%nqR6X3sNCs)3(SeWb%z+7|$(`RlIepAiPimml!q;Yr;c&BZXh z&*B%aR%dTp3QS8rrz}5LfQ)oV0{sYpnQily{YC)~^Kz|G0R$4CzTXhW6hKK0%1;KT z4#AmyxrH1eXXA10T^}Omy+=^HXA^cvgRR|8dyS44nKi)Vh^Ydy^|qP@tYI+n*(Kc4 z2=NWxx39h1^Lo6kL2ZTv_w@;qDou@B?S$m)WYHaj-7J$sWiTdvHhqZ(~_cfMze9o!k0WRvC%m@YLH#o?(MrV?^+h;51$3h=YUp-2M%$|KTu&E; z&rRoSZA8K?Hv?%qk6b_&Prp@M_8)&rgh=XsA64)kB@M~T_H?Gt>YsE)JG0RxqG+97 zI~)X_n?-p;({&D>j$7dTG*jI$QJs6H#~!vl?G5y$;R5Z?l5*AA!Hn;Qa%dSGAo8&6 zvZwOMdlOWAKR)0+&SyC%*R((^H=RRONL*;09>pPEbwRrqHRl9>#$)z1-u7XQJhgG8S(4RS91-}9#(K&moxLwHBD)jYyWEN3k`l=6VbW2@f5 z;t2XomHNFLqC7p2LDRuJC`TJwFHiCOJ3I@^B$Lp-*L=MAc4#mt4fbGqTDa9R2(6e! z-n`96jg-rQcXvNx*a-WpeVkS0a&Aa z3udo55rI#e8T$Gf0BCh7?Ip*exb?m6t~g&+8&)UxWbmF;tNY| zu0h0qKre{0DNm8@&%H!r#0_@3pj75nIW_gLlFnt>teBnqbRY zp8I@z;#(=|2yY`^vbF6c{$GOI!uX;&CLKNzEow2)EsH4|ybuJ(u6V&s9<& zGG=shYx~64M%ltkoqFyI8y4me>h}E3@K=umRg-z)=Diy)OvM7hGhM6ob;JF%r&ehy zUJ)b8Kq>r(-HAAZRk${5&jN$lMI0WdD&ux+n_cwuZw87@H4Ek3{Q_rl@ZIz=c}D+0 z$k?BKI|ef~M>ZecjX6j_&_HZcxoXnh8HsfHBixUAu5M?MZC=Ko)y{HP(;&%ggnEbc z^3E4QDInx^L(A!uv(YUjSkK>p8-SF#Z`qpqXlXm4nT2^@(#ivmmKg4p{ynwLLxpgR zwOSQx;g@|=nfSb>>@XlWuL~|YNFRE^6F9j&>s>0gej>Wvz_Qv8xU0C>1*W%nBSYKj z&$fmyqjkZ-_SNt|HDClHEPZ-T`8)sy!04E%S+guBqVxvbk-fxi1nxw$5Ghz9sE5At zY!AdxWsPH8(M&f`RJ@vzKaApul1=xO>w4!1K0cUtE+J!Mk7Cy?2tr+^KAljpwLgKl z^@624KnEDPZ8Z|=n<9R8YlAU|FDs}mmQURhQ4!GZ6-Woq5Qy|*SY9}MG|3eTLO}DB z9YJxt7QXHa+)msgoEaf2q&-{n#l`x9-XHwsywW?DLpF71xMwfMNp@!{`~f`%b(A+- zniErQ)s}BSC(WsLk$+@r(+Q$$O% zK~aQwRobt|0j(oIJ1~um0Vr`mNZ#UNSx0g*XrJ$dJKk4>5Jc#v0%8tQ|EGpa4xEH$ zGD`g#>=@oXtABSxJbGhdxjhu_ps18zWcy&AoaL_)IOqYfKsGZ+b~9x{ZSTGe$BfyQ zAiRSpd+Yo$^Aqw5td=+2QMF`*ZrQNxU1=ZuAdsjzMsva0`SaoE!p2+&pSy>7y~dD8 z3onFZ+XI3-`pi1&Gw!uP=oHvC4k9|#CiEyH$e5km4EHvYT=OX;m}?_D@^yzpff{>V zVPO_+ytOy4^c#e>1sQvWWL~?j2?oUmj{o|z-4^Sr0hES~eiF`70{v(-;*II)sU(^* zXh<8|`hdWjarpoEeR;cl(f8dT!Sz4#>OXuFo^)Un(Q(zB)-?EpK*P>mJFo6pei@1J zZ5PYe{GQkJK|ttVvpzl`Dc_A9X>IKWj=*?=NZQc(8bd&eOA~TB(Y;y{?b#e=*ACyb zMmg9`+)n?D49v<;_z$JL#s^1d)dbh~7U$T--8~z!#neoFPji2#%gUgGkXwCU8k>-^ z*UK{0B}ZaqK>d>@*|R>uN^4S!h~}cMEWt4Am$JZiqj@?qNyj{1;pUbq%uhF#8VK+n-v|I9QgEQ? z&5M~)!zpoX`^ziiG{Xr$OV}`G4ct;~2s?mPdhIMr3$umvykXnVH^KHA%6^x;<_%{E(&D+a zdl0yjXkdiA>5do3K)XpS0%T}`HJ{9}J0oYFgSjLR_^K|h9IhDqQxHZ`a$lHsr!$}~ zZw{w@#^D!epzj7zdhdFwRJ6MX^$x+d6rfs}UAx}-YelPx>4TFJ1+RoUqr218(z4M{$`;q#ytO7R~XrprY{u_ADcGT^Zg1$VL%{>gdW9N@O2;ZL^QMrP-z5t%R zYUxLB&a+2-%slC0@tt~Ir>01mtKVw`~e`l8vYv(?J@S*S>nZ5Rwep1FZ=5g zEB?m$8Vu*iVwQ_f-*&zYL^xf4DWvV3g9hzwRjR|-HQ+;zn#@au!xrTy8p*-si}z`# zFRe*j@N5bh$Pj}>|Gbc)A z6DSNtk@k`xcJTEz?(;6qyYK-G zTe}8(THWbfA~W~Ek~u%yJ?1&Owr`y?pXcnbN~-8aMvz6b>)L{KS5ilpAjb)Adm?h4 zTV^=BQ|q)&g#4=$+dv>2IzV@78P*OCWOW$ZrLX&}QE=!L zbSik&aO;KCePNAZ$;#`su$W2h54`2dhAxSA<)2RQ^a*3PPjb)GyU%y|Dw8KK3R9oF`HHwHtytQujwpDKl zhJi`=LXCH*a}NV8{>G*uoZ#7zbFOl!>aakZQ)Bt+mrLBDZ6k||?;#5m(X5RjFR+s| zFwd@SCJz+#y$BGKOZl8d%F**%xq(CArA1#MZ=>Rn>5G+K{zuLU?a}vJr9r>n#V^}= zbT`paNgS?9d&%IJ@3fZ^432fA04|^B0g(k?I0ghTt2a>JCr{Jp&Be`^?7~+)?K;>G z?F3C$m)vU4u<}(39*;^(pJc>A^DUgco?j#KD;w+OWqA?NHzVd=HjvxMHs`XKtp7$) zf5xT$Vj0Ft-K)j%TN$;hwy?aD6@*3?l68LJdB-eh3!}@u zQ)V?eUawJk`wM0C+=sd4Y{k1mZ$7CQ?O)jY;9Xm4^G~(P^*&G$?T8`flN>4s4PtcRzWVw*9Bf_9OOTwIBF$>=!GV zwUsXK?=GoG%h>;Np}aaCmc}xCT!F^e#X3+YG+=YKpk!{hcJ!i-UW_*II7H$VvOJXV zmZ8X{l(s`|xbMs6h1;L!Z0sc!f1?2u__|tTxn-1I?u^D9J7vcmma+-`j308*Z$0#* z%b!tRB=n{}o^|Sey7Y>_#3Y*J1AswM>f-X!sLk+r_}*IRSRlpLX)brXo3K3Vzi%Hq>0+(BTs9ZsM`COmp5HCB}{L(yNXG z@86;9V5&ds30jmRe(;<>tv)!%&9qNPS1*k0xZZNP&N_*=hxU&8RYdHKqMjnD4xF1+ zBJ>GC+E<&$!TWGfN2;tfKh2XbVGK!0Qvh@C|&gLcGqe>GozE5llGY zRwMtn=r(A~bQqKBNNXQ&P@*no?KUk~|E*IY=fx%MZugskHM!PNdWTA^QrEY+yl%=a z$p0~>s+n~+Ti~nc)%-nQG`wlb1Ky;v2M@e2gA5Ew5l3V;9j|oV8v^=`8k`y(IYa{( z3WpzD&%M*BmOD4QfpbskDp`e;gPir7>*Lt_cbABSWvF$mJOF=Lwev*d|I^-?|HZh! ze_YcxN~G=3ky0n^3mGX(BhisCo%TJlv`M6sQ_>!#h$N-c%2YFLG*OY_5K2s%LPANQ zO|&ZN`?|I8Ie)?ThvSDH9+vx_xtHsGU9ao;y58owVG*CYHhv?tiA8bj!$<4+Y8}}= z#k+zUtvT2I5_XQ}#E7$egW?P37I`{q$rwLERyNp=?E;Gg49&~ z+GhhoDg!W-aO4~3TpV^TIWaoXnc=Q{EJW~)NbxpKxvE_W?d9R3-q0b9`^h>*vZ{Ux z1v?}Wb(WX4hwutoFw7ph>!SnrWM7}Cg@<_uLQZ`5!R{oDUVGAhk|RaYDx<_X z2DlC`Zb-FUs1XOQxSGjelc2#u5?4@+)^FRjlaChA3QzjJE{{fdPwSzf@iul zM(>LQA<1=b5m8=~e)E!2TI<04@zt-c?RRAIsd>3b_c+px7y1O`Wg4LpF&t~P79A;m zmJ+e}ecpBR4oryiK(0Ml|69}5eatP)_R$<_Wck7}rE;}|$VQT8IHg0MW+9|3v$iX1 z;Thby2T(m@+`0gkeNmDq5)M6zg$B|9BQ7^x)S`B2ddu-Y6KLI)bG~WryIP(7%il_- zOYQ5bP4quWstGuG8vP@*E8pU|prTdg+NGRU?cIlbqvRAtc2t)Hgo^yL^*&gJrXK5`PY(0`PO#+!I*@6bc z*`QTCxfQ1)GuQm%nas`#?h!)qN&hV^KW$9Uv)Uw;Ry3OTjrWLejSYo*p}{pO_=Eku z7JZ@`a_?`i$?A0p%jyzIa!h-5Whc*kiQ`-8`T1!SPJSEe2)S_a z`jIv6z0=F=R3bYoIGg8C&ZLQioA7xi~vCs8X!!r zo+ae-RI;XVlR~apr_-cILZ82_ZpBnBHZpB3Da za@6BsWyt>@{zq666+$;=e8kCa6xi zbXfj1LUc6YBZ1He^It25DR$9)dz*9Q(4sQ+L(*ylmcsf2OfT0kp$pDL^lm-s=`n7a z`UKIE2zd$2>m9T0DD768&|t^=P+<6{(qp!`&D2<#=YEpBNoLgsZ8%%;&l zWO%L#!8^kf@};uh^}Bi=paG@Wd8wFz?tp-3Gn%AG;m~JOf+7J(6~Psiu)P$TX(4l8 zRs;r4wno|8eY&APqf7V}Sj{+8J^=w3sZr`2CH8I(L`nX4#-j;1pV79^!uLQjsN4MP)tno%9E7V0HRH8qo4OolDX3#O=^90JMC=yW)`9ah@(S+^ka zgP6@lI|eDd*%wF$9AbZGezekkETlQ6KT z6CCdcZYbd@=8UsVj3u*r)d=|3VO*sG%Q~oC^#DlVgQjuh+}n&DP=5&UHJLY@`)t8Z zw7JcLs5?TSL!DDJk%bZ?8Jd||a0y;RGBRrza9Z833SvIMic zJd@7L8P1i5s?hdor;xm+>Aa?E`}gZH@P5YnyP(ADvXR zUGShKRm7qN(k1qjbj}4Ihp_OU>QQ*^SDDG6-AvyNPhDhgBsSsFJjTl2fdZ@NE;?jR zN9jiK)hPvOL5-_9hq(QtsWlQV4IBRcj9H%?rsZ~~_5Yaye@6r zJptJc&kS)r(S%S6;KGns=VK-6L;mcz(DWI)v)N|VYJ33 z5E4olHB$>aS@Er4`aBn&;B&t&&^fpZhF)3sun>|>1$$AkgdFYU8xjNZqo6q;oQU$r$5i3Df7Ly!bb;Qn?XWiXd5UI`U1wsAN;bxgDb|rTeDe-0zAEe%Rp> zl>|)=WRs~RAq1n?b_n|QR{QiZf(r?3^4zJvFA!&2EboPKebQI<*@U+|@CMpZknjeN zmUDhMHTx8CzEdJ^R>5_|B&6A5i?84vdC^6RK^_ckW~Ey+h-i5byYQ8HDPgzLOo%ov zaTTEZ^?k|$`Q|P5&}+v&^QMV}bUO49i$O{9d4LHBQxKbnhj#XUz_6zCqKLw84&v5c zhsevRLz-lLj0jkS-~0E|8I!v0gCt59zw{BO9{IoC+?gG(fp}2E97otubc`oVZykW; z%QWQDjB;xd4PcT&Qr30m!@(LiW3*hrfo6<^Z+Spi7R!=X(pXJj?(c2=BlOEcpEhiJ zmdU&UGn8_7DWk^TN)>ZLA`69f&0tJiV#lQ`-M@S3VOX7%SY6TT<@~;p9}QCMhN?Vh zlIhZqEweGlC<9jg0L-AUxQrJN_{=+O&;Xbs?g3hdSCmm|n50p;y0ZLAv38i9jLOkl zq&KcNo{lI}Bs-Vh2QorT24f8g8m3_lGvdAoqYwdjC=-hj^q`b}Y>c&LxY-+X$t^f( zn2!$AG_lJ7I*(|M6T6M(1iA~LksN?l4VJ*7RQD%P!;iFv|6)AaVCiUn&mo2YaQY)*Z3(QQp*d-%VVQ_JCj^ z&-k_3V`RS1)xk!49l|?p;&UT2doR}0&M4-*+>ao6Y~>p^G?u0Y3?)Jpgm#YW)QsHS zpT5Or_n|j&+t(%Pc<&KfR4Cn0^JjZk_NL8>jxnle`q}sc@b$k;NUlhSjVTVUF^+6% zD7tVqcR8Bvwt8DLaCf3sJ=2mCy+4UH8E>X+o-c4o?nXPbAA4>^v4BEG!&cYAP`>4` zl#0T`*lzeuHFxJ}`Q~5uS7|7|O*dU*8PlgOd)+0v|K%rVJ8{(TL=F=~T_vdblsaL3 z7YnPtBemRcOw70ZxaYINbWdBUgM+`Hl3Gnq_0!1nts`{N)@P;Od$CU-N8Vx^#-GJM z2atfoKI)?(UYO- zm-U_5%vyD4q2?1xSy*8K__F4CmG})Z_*Pkc{A#`jT(Bf-0>l6$8ID8bPGQLf1%zno zN#o{`U1aDK|H(w(emKFw?*q1@t9ugBA>y~lMtgk>lDF^>sWOc?>9M_^?%Vbm44#gs zO-CGIuaiMp6SbYm67?d8M%dZ~(Y3f>Y1)@Q0j}&6Ec!%cv_6zi(w;!~-_;~+*Wxv^ zI>u?>@Er9q5QIFAZC~26>pex@Jx+bPBu}H|l}P#52_!g_Ajh_TpZH*I{+rl~C+XQo zhvl~ysU7Kq?z9dFfi7cl!5c}p9bxiYJ3x~m?GYWELpEKGtuO_=T%2y0Uyvn5@xo(T zQ~d>L(hB8tL_>>{@#sN*Lamh>KFiElMG)k&9G^FTuC8hB0a2Sml#>9g|})itgu3XBxahK18Xw{i(-;(SME?8>@1#0r>7lFZ3B=|!5z-bhKJ zx{rL=eLY}pDKXN={u@S94vsbrREe9iySdSb+p!wrDZR<;C_Me08{EckY*`jQ$1XzL zWw*g#nYpp|IOWl6RrR!w2!!q|eOSA5C0C49-dVbWXPd@jiZHcn{(EQfTk$k#uFniy7(sbbV%@oMT1oR#!7-BOB^nFKvbzKF z9MgQ8Y-sFDy8zN0TH;TqSW~~*hW{p(#WrEyb3ZlTMNhwIRfs?`29ZFih;qcd-TMx< zyIh5YX&(CS!VGr?H8kQqGkoZ_TSSc~Z=*?^G$3{r!${;@ISKhXWJ&cMj{|o#*;AQX z>prV8gA#Yg>iW2~-?xUT(r2@`8FXB)Hu&J16mPvXz}rmR^oxuf>pJ-uDBT+m4W(Y3 zUMIN}FuBX?wrQFWzL!%lnl%2`cqRWSNbHo1DcC=pG=ZIH617{^kFiukMtiC`d{G^- zejSo#1+`?#>sxOJdR`3IYQJV}<1}&YFCRRon`*;zt2$aWofcBH6r&h+5|oVP`3(wg z-aO13LsIMkE_;r=L>dV=2{~kw=(FMVYa9PVAycnf?zl7N9qH`zql$m z|575UG2FP_M$trNuFLzIgEp_qvMn|fW1J|jZdC{N6hCE6WNgxPnINguKb>8y31Bgb z=CU~ibIA_%G?{w~Wt?dwpe%~dbt$pSpjwDMS0FWe0`j9m}tqJ>4m-NEq9 zdrl0SU5qh}TM~iJxLlfL--V52FE>ij7H;HXy$|y1P$z1HUs!jYurlYpxmpUGYxZ?IuN&fP?16$XigTp-G%F6vMr?U>~X& zBo9RimQ0M+V)+ky4o80Tdr=@ha!$i>@$(LH*?ctzmjbmY)@TER))#V4T z*k>+a$!0bh947kE{Z9w@)lLw*I=DXxdyWZXV&6C`V^Nu29g04rAzZU)Yq8ah!g6mMdS&BjW|BHbl(l>LLKcR6M;?(DW9+h`d#bMG-M08zFesbUa^d7W$ za}7qp)6DnQogRS2s^NqA-e>(xhIGEpSHD~NuKmDM%hAGXoO|T$lZev|9~WD*mARwl zpN|=^djxXs^q)esQ`%P+1WUX#Wv}<<$u*y2=jQb#hZ*?!fkN*-&ATjCO2h)9v0X3p zaN98(T@&qB&}tL>y7efU>zP`cN)}!$w|e+w`BTGp!2Wwa-IJa`7lJaO|MFnC2g@kr zPkyf)ivx>xcN)TJ>lDJ zCn$)4b&TbR^?s$~ ziH^_#2bYyo$v!~S15;Lu7E3OljEq->?1H>cLw8wA^ur29Ddjss%r7fR+9VI#sWP7e z-(96E%WZ!zVOdy+ct3di9q=Ji2;p@Iskh%*duQSI)uhhnp3kqypZ&;Y4oei3L|R^; z?w%^ah%q>zCO+k`mlA7lGVd$DE;Y5B1&>J0tEWB_oQ%O_dN&_K$&LsJm>%d|R4%`H z?Je2L*GCH1C^b!@K0*IrOm<5jzd=8avy{}S@j*0BONT$WMXY7sx>c#a=1tZnl?R{@ zcs0wVub=(|-%}p^zzBLzYJT;`nfN)gO1k39Bo7(55JK$_sRzFcCQ9H1zWGE6{PW)g z)e|U4Pb7Qs^f9KB4H80xjApN$_;e*Sop5Hs`93{6?thNPvkC04Q*){E|DIwNX&Ye@ zU@7SPPiY_1rw}Gb6JFC2Z}WejJejcWxo!{pZX^Hs0}C%ec@%u$9pmReB);jYt*GDX zQ^-9tF>dN)GMcz1q_#)q|GrvYNeIu2HD=8fA7^7WlU{twvW44wc0 literal 86598 zcmeFZbyQVt*FFjxkVaZcT1vXROS(h4OL~(M8>CyMI|W2akVcRaknWUDrBOf-e)sl$ zp7(pccZ~DjIpdsh#_JfuX6?1^b+0?-yyi8pxuP}H6tFSKFc1(Bu%9T(Y9Sz?SRf!E zwV*-35qDb1XYdQrQ%gY_aeja`82o|mrfBGifPhH=|Br}}nN0#tn0C-HfEuW(2wS;2 zvs+lZTH3JtI=g|h5fDUug~3N>8>j`fuk$k(Phnp%+COIqgU|589JJJbPJud!(Hf{~ zP|LV_*iiGa^Rjc&iepeyQ;T|7+X`#R%Kvpa_)Cn|9tw36=HT%0@nQGjVR!YgHgKo-|fiScv^WlxIrCUU8v#hT3EVzLB(ik;WzsC zzkl@;>R|gncXILk>#@KCa=^dg;9}?G_;=gjQc?I(VHsCvHxC<6Ptd+NpXi@6|JSkq zxzE4O*RXemx`I*gaIkvf0=4k~7eg)Jy%XpD>+1jOGybzJbq@y{@YH{<=KAaEe;)hm zd{GYgga2b7{x!;fj)Gwp#}MWC_mqia)NF7sBOpj3Jdu^s@kRW;?2|vGb6evssUe9{ zTVSdG{T-1Mik#*m-pvU#jqVZCt8%G}3H`v8meW>H3vv*ggTaV=93L z?4KlVO=U$0@(M^B?9H{OJgqzZCjUyDS+(hG>Nxg!4~Z{pZ?z`A(9M zJf#2In$I7v{Y?3#X>9Gs0ArpkEsx!ZvF}=V_s!vQtHN0hyj!mrI6XG|JJepsb2M+h zx;&%`IQ}$V=-Tm{|6|#+(xPR*{qoX6>7c8vGz*2Q_MZ#(w%v?}tw-YTO{`zak>#cc z{c#qAQef!tl9t0DDfFE7WBFxG933~O7c0S3?OxJwpM(%{>E|)?&QK-->!sWuCl&G* z44DRB9@)JXgjQcdGd|wKwhle8$Yq*P9#?UAJh6-d`^9fxQ0>u7oSk6 zYIse!40QH5yqhO{KAk?+Kn|e|)|QVW`9$pJ-nZd8TV>KM?!h>;h;(1H#@4V%ekI@x z`-~?$zia;sYuyl7pCe{^dsF2E71n~`wu0)n;KpJ1`hj6!kfpHyIjSn++APO7db07k z(P1*8@#YczC*)f6E@aji@12@%oYwW~2V7FD1+$!!zTN9ZAN%X#Xq4B4hTrvTHe|?9>nt@nk}62h)o0EE^-Z*3(i&)VpquII zDXn}`0SQfB%APH=~UoROL@@cTI4)$>8En-Qus#|Py7w`G<4G*H153o zu6VA|A|?HPI(TZ2U@7DZ*&hmzNufqs+Z`y=YyA=lyhvn{<3ky%_tfE$+9{2Er_nkN*eD`D`VHFCIrvS6hMU=dvB`gi zlIbCRUIOe0G9pU?KAG=cNm`RXCXs!Hj3D*tTH!%yUeNjp&i(4#pkEH#+GfcJh|7A~ zu_=wlJL7J!F@%tNRff%Rwzj(?nONltl0-=C6lrEbe%oglX}!iBr#dkK(p%vgTq*l+ z=&1ubw>St&bN%<78<+L!)QaRuV`^`IeT(G1)diaigGQ3icO--=@Enh?_>Fqr2t&iX zZPxLw$TRjc^q8|XU~N7JisX-kJ44RGf6(Fd1PO4#7*D#&!-?@&&CWldi zuiM8?X|r3dP%KOwNhy4m-097O=jZ7X1Zk~Be2>i;E;C4YY_SNCF{4<6Z%#gE-^oCN z_@bcin$~01Due%J9yrY7@-MetG(%2_Ru|HCgp8|}eK+EvDw4}O>|z^S5XUMlfA)lC zhJ>TOWM=IqPPq9R5xX#!Mrt!q;#)9sX$~hwUi#{}JkTvrMb}+I{{g>At)!eK1)?T? zwYih~oZZ-Q@*TDH^&&8I$KCf;6jZV7kI0<4WVi}>f!BsXQ1GdJ2c7xtct&)dh3?t~ z$YiK%oD+_fs#B>&>yP-kXNs+b5wJI{hGLbDL3f_;E1BH8mEU>tMo2e;1QBVTJBZD* z&>L6f45K2<;np`tjh6nr5e;~R^je)b3h zJZ*xEBaRJ%V0-p7f9|k<}@0?=?&y{{j6oDh-NM$T!#m zUkebB{kmnD?oIL{=da78liqsxCw5eN3ZLzVz1VZHEiXx3*UQ^SKl*sh9^Wr5jvSn^ zIz(kd4*9s|+@IQFTR>!#9C7bK%`rdJ6nEI?k8{tzgz!zn0^xS;B^VHuyI&>yTcmiZ)_(3K%A=5a)%>DQ@d97dMS@)}{Y3jzv^*Rxfe)LcLX0glSZwui zeR>eV-um%!9g~tGTqIata?X3vxp`@GfXFF));7zz4;c}Ya_}X$juCb&H{r&e-p@Sv zlyD+C9!(PFm7eor&v|@dlSvF5pW)8VZ}Qx>Rup(VHCGr)#CIp7%{mgfr&f?n&y&qa ziUvCXUa}4e6{nbngsr)-53)HFoms-`j8^Kj;OuVuCV|9+T)sqA;HBRWPX6J7NJ^y~ zG=Jw}t8gZ&{UWYm%JFU5A$oQSHf%vcZo=p&ba#u;k%Fv6rVN5Vf$}558cHejljVBI z7L#P!UMXyQm2X+LL4lqkn~G^rP&y=u!B_Et_-;&%Mwxk%GmbPBhjEcL1vGnM!d1LtB_CMkJsiI9G5tU(d#6kdd* zgBYr-Z<5kjaOTF~Yr`b3xM~JBhWU^%jk~MS;P=rBwNqG(O#VNL{0V*fpMS4K^_e|t zJHv5x2vMJcEnIu$N~6nL3#y-<=k8p&e!OBJ)k+^rTm`n0>}Sq6_Rp2zySfqKSaakZ z3o?ZpVBL9f`?VOl_rstAVNLgbV@@nlL&Zqcme1oPm%{1?VXH;M+eLuh6ef7IrdB9kc#3R48~%i(d)Z(jh6dodR`QF_~4_<4yNyK1DO; z_NzI!|Cm{G4lqXAc8m7wr-=-6sqkZwt3<4Fl2Kw8BMlW1);(#8BdcLu?|OM#S@>Dj zxL2zVmcd%sEH#ysa;_t3CPM95#@x9G#*GCf?58&Uwp>;@XCI9Po03(e`3C$GWVpdNFl)<~KS8K>}Od$+9=hMxc_u&%RN5R#RbK$A5@47qla?-<{vles8^;Dj})=Xn-AmFq=2F2q!q zP&!3)&pB%WRrHP35-YomL(QNr-!z8(tVTXt>XU|`!of{RgE;O7$5~y7>xVfr2D0XR zr!~rH>|ysvreNFXymn|G+6Ql_Q*}>nWK*WsYB+)AkOs#5QIF50F72wW+!_u&pAM<| z%f6}}w6a;HGn0k)GgqB&-E5eGMuG}sTn~vKYj(OS2|AwhNg~yb{S0|~z1-r}y8pIf z-^3;#UjqG#r7o{5;7Mj81Vg6<#uch05t`a@4)6Ofk}+lhXM)0?DB)j?2hJ`CbWdf> z5m}@`b2prtDX?3!hk~3gufEmil^OS{RKpC8XWjHQI6liFtnDkkWqwN^7@iY<(^**J_d=T)L z*{60oq1gO$)TO#A@BZ1W903UFcpU|bC3`W?4;C*{jaW=30(2DLyK{Nj5`8F>?^_swnt??)QL2Gu^fEw66U9%)Q$HtCtyJi9Y;!a&d0ZcL z-07PI60!YSi=@I4-p#j3Gid}9bH;;Sc~ zD}fi`jik457?$$SI%nD0zUeUv>4BGCQewz@>wyX?4e~@J0R7-RoXd zssP+W2QX`0uhqpfL@Zo-y9+aT>|9#*-YlgO;5C8O`bH%KUk-5wgtx7SEl#JK0|mHP zp>R**wL3=Rnc>oQB5#Si-0IV@7C}x7LgO5tuU7zRTBoRsUxgV3HjGSQvNxIUPnFAZ zp*hw>nIrB61BL)RNN5H!S_Bqpmv?hKqg$u*)#(?G(H;;yQ}DNsieEpgAD1VzElJT6 zUXP;dizDq0kS70r3ZOOVMDew98uq)9n0Fti^^EYE1visqq%BVib|BLDzkmIp)01Wj z@raOnU>0!lQh{)$A^Lj$*r(zmlY;R*^XEr(M}m?Z0?93PxR`;AWNpW-!jX+cN{H^80a@%29;Ki5HdzQ)d^VUO6s98RqE^fsvQ zm3PlFzy*qo=3{6!Y-?7xuM;93jBr;5xY{K1`jtx&RRG^5>$f z1EiDZ>Ue!STQy7=Q%5DAEZE@Xl}b{2hq~wq@*~DjTw={1usPZEsxuDst^lR}1`r03 ze{HZI55;{_f9+MJ7<*4ix`*beQd_>m*EEOIc?a-d+TTAh3a4YC|_vbmuL~XkM8@Wl4BAcQXhOMRIz`( zQlxO7jCxEwuL{`(pbdM?90DBz^V^0+m#p{66zO%lg|Tmj7!&#r1RfbXmX{RD^)Q!> zQngo#X^P}6IP7Q3EMsg>n#t5Bw;%!f2AV!PNvLyPL1>c8d}VuT*+?<}G8{n!+i0 z?Z+~mss?CT7<#m4a$Oc1*m(ACsY_a*XVIuh2CxbM8V;?$0prYMX<}HKfCyV|khWOk5hD4yv&E(VgS5-(pMQr&Mo9bSw z>I8XvaV8p&<24h_QFtLmJYFFYeWhFkE3=Z`9t*ZeM*pVuId}UWFK_-VtV~EUnc60F zSQ8UQf>;oFTv+f2x(y{Kk1sN)W@@xu%9CPPfDNx1DypU-%KxA=|61OW3poZ#j^Wg+ z93|hO&?=)zuBTZc}p~MDc}H;clF5OxC|OUm^VhD?|PLfcqY@6%#JCxnQ^N`(%d&hVOuS zCBzTo+(*1ldDYkW_uRw%rG_iwBH0w2fbOT4B!pf2`9G$|Yy1``d)(OZJKo_igmfUW z1gvi%SA9d96c8f20D{ae^XbJBi|5_wq(|mTU6dvg3F=F^wO-DUrA?kU)O5_ZeAU|$ z7YB>a>Qn=8gg9f?082C~2#Le#AA><@n*fhF!C@|Hc;MhKZ|xg?HX@R!IvEL}qvkF{ zJfM?S8kApsVM;XrVd0y+C)yKEQACRbd$k;bxPC>IuujwsY7Mr~DgroTIYfhX+WhRkUksm#W!}18-)yL)*tu{5o_Yw&~jDT#5{B%CzUq&t=2! z#`L*ni%>l@|1E?o6p9XZ^sCNQ$WqHBTIjm_g&G%NNcnTWetx~>?hklotYk%jFladE z?6W5-yCIasiWEwht1s?zzoC=DjS-uje*8!(LS3iiyL3j{B#Bf8RX%I%j=&>jrq?#i ztBfq*H*>z^G$z;-`}b=_43rT4m#cXsB{K|k>|ijtC8SbhG&J7h!XM(g_u$YY+dCQu z?L8K5(!pM7-bl9}EebN)+ioeB zBHwF#{Y&W(v!it)UK$r={A~jzrBEhjUb64%4I76z;xkDf$Sz>I$#mgHFVz!=K*fplhe2%_Gpzq&owgFcP)s{Z0tf*>XY2)>e$$J^&c34@b zGxAOkQ9^Ut-+CFql72%NBRxph=hHv}zLC9e`Non&+sPPHIvhufcRyI8;|1e1R^-c9 zbB#Z6=K%IC59ylRt8pO#6cvK2VGnk2GGozmT$Fq(Jcg*Lw!7PF3p~1Lk8~2Fxil+}ZJhe-hM@%ot1V=t9=Edchh3*ARrO2Q3{A{Kh72E*WcfqMTd z020w7xGSW9RQ6Q--Yk%le)f%VmGNzmOBG;Irx`WG+5_BeFW`)-P9y>!nwEFiEu9$v zE6`=5V)~!X#xS$P%SF!NIjSic^huEK5GT;V$S*;l^PzO>M6~LQLyxt0mnn{Le?syW zwtzUad6*ozSnr2EdGpA$v;Pe=2~V$}O6uZpc~ARwqifsAz4->$74g9&=Ntgi&y@$Y z5^MujU)A_uDqm9H^4Cs=^<8d$_Q?}QX+tA`QRE%LX*qa+v#AX7P#@`s6#y9Q&qA0U zC#`surRCTasnTq?VEtN^B;yr7wcnyjD9_vFFZT+<00g)Ky|rCcH+3hMzqiI#c@GVH z>;-}t<~sn`Xb|~Oe8>6ZMa63Gy(QsfW*nO19&hh3!4UOD z6!3e{MCmhZN9VC!SReHh8MlAYr-V$IAl@m|Y^4Fw4z&@Ya3i^Zi?LELGKeSw8op&! znL_H1R=q}qb~I9`a}g2AAjk?2y$A!bMQ<-R&atG6VJ?`8GH1RE)MMCEqJd=$o!Pi! zG^UaUG`X;Z`+iw2)~|gB!_;|RVgAMguu`l*0`lu|{Dnf28->rmAaMu;UB}4G48@Y$ z2u=s}LquaKMZ8(t=ecmVLBeqxCzrqD9DnG%gBNpSfb!M8U#6+ ze%~QaclSwhADZTgy^W;m<#8%8c7Z@o$Ei_v1p2vHM%6FL)|I6P%_5cSHk$T@36y13()gztG02zk(fhx0Pnb4KWU!@yS_2Q4`uC%zZjJgW&zv zgY)Mm2<%2=p5ayS%_cN0d07 zK_gmB5M8jQU|s@N>oa#v*)q1ZpS64<<#M%|M2l}fcl*F!P$z#4bcW2u9_P<`(WM2* zR%8hfhH0k$dmyDH)r;t|efRq)xX!gS$NTASM80l?9RJplQ-NPo54dUS<%{FYWCVf!=)Ht;j99SAQmO(^$tKHdfZ`#jz`7xN9e?Tro6c4!5sYu`Z!}|Z@yD~Mv z6#Vy!|Blt4HSyn7^8f7qu+RL4@kU=VgfOJf)0CMfutKsO?h9-d5hR2&w-zmd#Eey5Pcs92}@_p%>UF$ZWh8~Dp|)y|5c z7+d#0%VAKb(;wN!-<&n|uLhaMa6>>|oq&vl)4cP1Z!*J0QLm~3+?pwg0`VIj{f0hBx5_Jf>NlP0@mcl8Vv+dzoK(yh}$;D+j+DQzXRg@u1Twz2amtS4|LTe&@cpC@v0@AVK6_JcwW{^-7qF}#f2!wUWsmeG<8@m0 zu|S)F6QP1Dwxrwon})DWIl`x}kvK)0duSEBZ5*%qW?u9vd!3@VoxG zRXzd)a9ZTt)@Y=_bNGXtw>9m#ss>CoqvX#8y6E7s-{O>t6#e}qJt$=w=08FNqbC21 zLk7zCI?lR8<^P%Tf3>U_1F%Lhg~NsRzwh(T^Pk6yfe=)2@TIrJe>T<6 zY92@Z-&_1OW0E2NzaOYE-=E*-GqnFKk$%&B00MJ$-|uk$S&{H%s8K4p(HCzwoTU9&n7EX#&f%JC8Lq0N+8F zJPS_$+n)SuV9Q7VwjlB>V=en@ga6s(@E<4ujc90?^8Aw)`m=2<;ex%Pg7^7{JV23K!zvPOTmKq_V333 z>KBMycrS zi!O-ez=af=f-fH+0%F->$dw^z`0SgZd-GK#>oiSWrBwOma@Bt>>a9m0rPr1aS@LhU z`j>sl*8%of>Cut*e{K*a4zNVGyhK^!|Ff+Ue6Xrn{2KrG)_-pT;JW?LN$8lWH!K|f zwq1YQe5Mqb6;-Oz*8>&}{}6cO3(`YZKrEPQXqy3&J8kDE?~(txsWh~J!%j|KiGQp? z8$!xsO93Qcv1|=<3r^MTpYD;@J1;gozc>W0b;dZh4tQ}My#vsU=xj0re3!pfG2iTe znT=4S|C&Kn&(v?r#Ao@3#NCy8eQCDx6E`3$mG_)b75y=#{o)=rtp+*1b%$jO)|uel zoK<4HH;)P+E^Gl|6A1E+b@PKD;WqbStvhm*kE?@WbClzYv390ph&1NZ?VNz~0u(49 zn}Dv;ZZy|qx6-cv?&J00hSU3g+Snw9E!LhJVz#I6TK3;Q07^Rk@S)q*fpX@6iy}{* z2>h0^UvPV}!sV4nnBLln2(pL39c#RRIfU^oW{ih^n=e5Y#Rxi^^E<|KSSS&FJLim+>=Z<- zZRZopA_w)8a24L{4UpKlcpqeYj$}9~5in^)K-R3(VXGhv3;;^L%eNi_IxbtqzR9mU z06hA3d9pDoxDf~RBG0PA&s&A`X$g?Q(g?yQRLt%K>7jl(w$?zP?8MGGBNOxI9bX3m zT#x=Fn&=I;NU*3vC{RFVL(e6}3{ZX`PiF}Sb3&fmfLWLD+Shr%7QF&;YvYZsD-?C_ z?qZrX$&Y}U0S!ryLy z!hs&`VUTV572pIw2GkDZ1|6s0PWO?!iuxX}CdX?^LFYkY7bIN(&r1D^FDez^W_PBN zyrcXtac@_|8`Ugg7FHlzMiq^C3e?TzhJ)eTfEt(qnPSSO%J+jTK=-DQJOJcn!#qq+ zC2{FW6@%}tR|aUF?!$9N8cz|`0_(YVmzsG}UV}WaCdS#Mx`ZZ8Ml>L{6DF5IDvLzP zV5b?7ptZ#rybkyd>|ki$2=$h{yPVcHd+56wk{wpQ*#UHY;gw;c_k}ib^$H+e%%#~~ zq}*Gg*@rhz__$`)OeB^V2#>!J@lg|pV`kxLkZe(rOe$g{>mMeli0>ms>SPA+qPcGust}O9A zH?!m(mH>2|XNz-A)T>K;AhTigBx9Sk7D(B+czET_Eq|TwWmBZDwE20eQBUD-Y2u!J zyK#BXR)yf;Eqr%#=E>EUVl=ot=fgfQw*tnZ9qXHRvHc~#Dre>~&t5t#10=RlCXMrX zat@7u0izaPN5$NoaSVh7Aj^|2=YS?pj{NEb*o}%Vd7kV@o@@yRV!YFNpzH4L0iv|e z$QMF1iy+ya)Y#+ZC+IiC^_g?|y^XTVtltT53Jk?FDo%qS{+yJ*9wgBt&iF4FzcX|^ z+@t8^}p)!pmI0?F@}q&JFVsUQg6DsIzmL?snYO4vrZ1n$@!E z%H=gzrr-76l$jf!%B`qsdGMndF^`xv7-VBNrw}48YfA=HNG?NkyW>(P(lRSY=+kM`DX@FNt0AcUsRUk42S>}T09xup7% z2jh)y4N)m&S;)|%ned+?;o;l4p>aVPJb;Q=>t#sLY!`9!75X>gGmb3w?d>H!en_t< z)0)+&n9vHuRJ7A|<)awN>WRE>0~L#?hnH+S3A-lHbBm^)$}tr8ei(*JA{L7hgku`% z&3l->PCQ0!%~&HhxsJU8y7h=Jj2{W#T?2(e*!4{^31Y@)vw$z5J8)rb=Y|vI$I5IO z>GuyzeM69$S5)_pYLamg4hRjDuEUekxn)r{EbwO?jGOW7&G$eeijrR^yW?4{Ai2k5 zW=KmZ>{2hJ1?BA#eXllk5Q{WS^oHK;IaSx1=#>S|%RO@`x`r4u* z=;}%aVmcEl(RgFrOaWp=7E)18VXBcGpuh#Q0DA#PgVPkQW;oOSm<3Wx+_Cmlc$JE~ zNtWvAoHQ&XTj<;?XfgG+K8dOIY2M7@6Eo_^6d_65}7IC3*8c z5wca z{wb!%ErTEwP31T_asOiJXC!ubk5_JFolr1$zh!?2E`q?Fj`eCo-i_bsz?qgmE}HVy zmm)>A-Zi!zJIB}KpG6uU->6vE71UVrHTnIS%P*JiP#U9)?{5={W~l(59W-GbMH zSF&4_ew%oGn5*6D_SEMblE(@9ZKB3IZYaGJ5a*?VNI9MFJH60tSKjGx>0i6C12WII zk5e{Vaf!omWzrLFJsPP@T!0F}P{+|6O0;iCo)3-K`!_f486d>oqlc*NyMs&?G=JhaI;+IC2e7 z$cK&|6MZc-dpGpB1^XM`GL|xNFw{6%)bwtvfiWqS11+vR46`0I)4(6PskU_mNl$DelNe;< zom{7cFbK_HN48}gop1akyM7A^8~~|4UKj_p+3T@;JS^mR$n1H(UrGoYx271(GovU+ z)yfciQdIa?c6Vb*S)Sq(#GNMktHf;=tGM~e$H7!5XVHzY+8MLG7aYK%65~S4{j8zM zWRr578p?$yF+KJRCZAuQ9W$X|HVS2oWGW$;rq?*5`3drV(f9^HmRuGhFZX`YrIqcw z;C*M1HGKGH&k)v8K>OHzwe zgOq|2OC47YB6S_YY?SCoyV9ijhaow2C-Ca(ds^GW>ugM!D zO-m@SNX9ZEzYsMfX*yr(1vl(i&Wd*?305k;v`N64kBQXsaNfQs-u_xcL^A1gx%eTI zJEF%beu9fElXu%a`v}$6vKW+i3L3qrvnJa4E+G%S=R&CXa|2J1!@kL#--HVgR=O^dm(5fAEN~~9G&ch@Jap|lZtMZ0ipVTGM zunCaYs_0lT#H#GRYWOZGC4O5f1WHbUI+=i2fHz$rYeggdbcAtE#X##CjlMgA} zXx9ty(xT28w+J$uyzxtgj;^&NLz*$%ywjaBF_a{C=8QKYe&((9D16T_MK3ZT;kH(s zYDh&8QeAG6w01x%^wcERia16@?=oVyBW)7VQQjzhIp``&Ye4wJkVIa~5j`8ykS-*lare5hDSd0Du!M=C^ay}T%_%zzLFSXVm)>nl zR*9rM4AM~-hW$g(9W7#aAxaja*W^E9wDmR%0;i}DcN`b8nJE5#7l2x#junp4z`sC= zKeZJt3mZW}HQ!*TvkGsA_l~oyWQ+H~FA!pe^OEw*Lb_V9j0lpOSuTX z_6P_e%YFy%b_p#ELsxc%JJOB$G=pjC4fEB@INv`}$_rn+?ll{^FRESODjn+oEFtSg z?B^UEYsur+bcv$e*-wC)L@9OV)&(Ck-|R&V+swMfeO(eo!&dWxTs$w)q8yekWI2{D zy(L-akiH=dD!MzCe93!=p&zC%Y}P*EbztALjV~oo6)d!lM%`n5WT0W!J-e{iO_6&= z@PKjD7mL17H;j!th~~(& zs1o`Gp#&DjuxfV=r|d|rMWd~#79xDG@y+kjk3LL|rq;MjMh1pJNC-E5^_!VUUh*@i z4iNZ&lCYk_bU9H~fR)A6$8}joUCVIs)D2JfPHzyBvrEuW%DNVFY0H|UkuiKFiLfDXd)$YkM5HeQ1jB}W0Lf>I-5tNd~C@p;}#92Oy(A4>ScXJ1+#Sg z+-MVw{3ogV7i}+##tO=s;>^Az)q^Vwq-cN6=jgz{>z4+37end{*%0}7EbCv){na4D zd*h#J1VOpMe{`Jx0Gjq&mMFO~DC4=t{*E4>(-k7qbOH2xx7I+b$^WA7Kx>0=)r8-} z`#{|X$w1i^NKbgrUFu%`PsV9V9h~xe$ry<8mFULWpI^Y^eBMRxihmGBI;|l4;3&~` zXy%7W;n7b67sT372zk&uSCJf7w|&WKfP#WrwyivF8>IungM|;RfGFs`fuq^l@FL!m z#%7E#w)f-5mWs~uWbm3S(w9t|!62dU0_rHo8-QXluo9Zh zIsg}(AQ@5ZhlF_ymx7aW(+`H7WVr$kX4g5DV4g6s02LLR>>T(I>x|L?;ICmDWP~eA zY%Cg)2x7`dpga|;%;@ubjJyv()m9Do0t?1pKM%eHUW1}a?VUyc@4G-wi~aC1yowZF zh-cz?Hc8Gx23L;Xbk5SL>uXAeU=V?hu|OZNHh_%3(khVtwSwB;k2j}ZA_%Fo>Tt(H zmO;n}6yI4cEecuj*qk~5Nsh0#Ik^U@pQKDe-(6NETd z2aPN6lC^U~Noc-$o_NX&>nDJ8zYjqXTih%Numw7UY4%mavV}x9t)HU}_nrt%9NN(! zQvy~(Ek_t$1r0AA1WX?(zx#}#T~6T7dC|ecrDo3xy-I#SFZdq((kzm924QHWQKKoS z^M)60m+8QZe}T?@bKrG1#oI=p3NMh0C3aq@rKhQk2?`mz_YxH(0k!DeXNlWS zv!YuukGqBFg)vK7$*jKtQ5@10mG_*5anhz{_dCM=BCu;jEi-lBBg19)Rb+e8M-k*M ztNhPFKoA5f#c3`#U#mA^gIdYghW>!IQ^^tb0y6VFK%~XOR|Y6{{|&;~8=QxrhT#TH z3<%1l4uA~UmRaDetwj10D5q9AUjsqJ(QNe!pog9V01|EuG+6~1Vm=E)JNGU@fr1SI zn7`_A5T(eb!?K)R5*q;r1hSd6#1E>IuhwHZJ_0h#mrR{Aw9l7x`)u(Ejb~+q0s#VU zlh;m5h7mNBPKCAoeBz|>1|hN?EqE`D2KvYih*q^(T6hFOb>=#{qdL4S5U`s!Kz5HL zB1V{GzTOiJ=Rg$)#6g7clH9o-ubsrLCQ7`9?wyth)eacqEsU@_?5$C^_cqKIMD(jV z*%s;&K*x#u1Qa)53gr=GZ{^+rGSxf9SY^MZ6?+%3r}`{8;nu7jp6gBzB<}d#w0ujk7f7C#~N6Oxg%G9@W-)pT$5ucFDs6Us*_7wrz15q*!Oo$6I|_nV1$Smn&nv2x(m>85aJakh3tvES=%_)1j=xx~n!alS*~A{q!6B923$b}bXuXe)7 zBzYpRPfS$C5VBxaxclt&nsqEfe27pANP6>tfS$MzNr%N0uN~U~uE$j4UgO2LY)v!q zHZgcX-0)-cuk&TiJlRvhhAM8Q z9SjibOv{XoQRFjWpjY(4vCwSu-K)AMbXQ|0OIhR$WQ+MfQK?Ooc6RhTZ_zHlUujX> zg$R80eSvYM!-YMXV34<{dDz&k&ET@r4Uu~QPd%kheO4l!qc7WRdVeYd zy<~WsLi#;a|HY3*vsnNPL3Syo{Gj2j0?ZNe9ZS7Z+nX8d_VbHh$yYr;n^d>tT4ye4 zA*fUK@uL=*S$ zQ!>vv8TPYCXi6mN4R{Ah2n#{~t#y{mXTLeclnSOPaFFpK@71cBMe?$J^_#s%|Grai_5AzM&&eQK22S;4c-fK<4Wy@1 zmsfx0;$UuEzn-WP{}^SbI;BuT;}l?9=7XEp&HF(&8fR>n$a&ge3%IXf@OlI73G2Jm zLM>7bn=#}ns)MPOBTEL7pK<4wFO&)YXFj+gB1$3~ZCXV^fuqB){g@oSL z_QsqSlKp^^Y;Axn1g~*^P#T8meO}|Q6!^5qD;PuQ^A-Q^h}&By;I^P)K@}{uRU2cMWA>!WF!g!npW7@06k9lhofvPl?Bl zbIm3@lv)J6Y#b)keM51)9C)QZ7`RmK6~M+xK~^pWdRDh$iKGqr0N&HVJt3uIUR4+U zAnZ7Jd+4NiT&m3$aZktq8E4mA^l931;@|UiYMo_KErSCUASqdcs>&d#sg7qa5MH_8_liD>=^~VM4I+q= zdW_WU=D)xx)B)vn0(jWm*0P-d>ZIu5>k#{b%@*y%`SC$o2lVrAK)$-6WV?uG+4f~CBH%DQch{a>k$$RtaZs*^!kQ;{ct8J zUOeZM^3vQwtC|j=4V5LHn`c5Kc>J!W3)F*2<;E?LM1-yS_7VtI!XUZ~Nmtfi6uGho zxf26@`$+A^-ox;8%e-ats-c!$epw&yam%>pt`*DP0(lmHt>TCYDNCBo$Y2Rr+lJKF z%Hd3TL$gT6f#dfHXC2~fa4WTTWchcJg!Qp_nlCLvyR1?sTIhMp_l=}sN&5a*~}lJLg5Asp;z>#YUluHi@FY0xDo z_JDfjXgoLP!K5H&=DGmoANoPZQ8+bnWP=lVE2;v^vLo*(9SPiyRKrn(Syu})s+xI* z7|En(?=k;y-RL#CN!^1`q>vzqB5nBXCGxf#pXwJBM=@l zVq1RlRDA*&DaqCqOpDOQCMqwRhXOCHnZlekt*ZSrB;LZIp`oPj~2&hEXNUq|tz`4EODJ1~sBEbw8gZ45z{x z4L1qaYAMefbr|0~_{v9IfQV41f46@z@$~NaQ}GUZ#KpT^^i zP)(VHVsJZDjnJlZHIp~9=mYii`Xw9P*I`;=qW!sAy7mJs3c;l4v7X-`Xej=PNOa;$ z{b>kbX7Rz|B?czS@%nq-quE<(eqw*>Y|C1I)w6>QN^5RYN|>b89=Q2Z!G2e{QKp#~ zmWl!oAs!g;vc1vddVtZqowRZuRCxN$!3CX=X^{gj=b=hEXNbrUh$ZX1i7JsP%a!K0 zj-%J-#y?@x;Z6!CE(E||X=9>H===bv%LnXb&u5jF1qzj+_(RL$NRAN=_pstsI!NLI zdK6+ud|aE>B7j~qG{}3`c6U3=b=zI|w>_>IHG2)f>zd@XAVNeYBbiGVgk)Hg_?O#@ z7(AB&NngrvI|N&Qg~CqWq3B>Ij1xYN#&01D3?qQ7I6;i!_Mv?NtO=GNyVhz4B)9n1 zX-!8Q`9^k2W?U4XZ8eQ)JhB5pWRx*xjm2ltS)dS!*Z96sz9m{W(j?KNN;Qm{;VIT; z-8TX&Ad=Ml7H@OzSHJfrYvQ^4ZfX=i3@e^6zFLZ6AJj9x3CCpK_>CtDfD~4??1Kmk z(R}c36i~DsUmI_)pWf2`-cQ{v%okPHwd{1P&Q3iPCCj;iR2A|&Gi{iLMO6wgEq(2p zMX!F$>b`yG-TDviVTw5ap46aUC^_^ ziJv-gkSXLZcw_><&~IrfcjkP99Jha^dYfwio!2DYqS26t{#b4vX{8hD56yGn)iLdG z|D2pL(>@x`=Cm`I-s>q*h_Hx2`TZ7Dm4);=yr#C18tpru&OEzOP5Y4XCCe@UIkue= zZWh`a!2#yk)&!ChEwFnEYhPLXM2L|7gCz;HS68|>i;@Le_-WUf7n-lytv{BKd&DUx ztsiu-_uwq!f_-*6)%o|b>2kMu5VilV?Pwf6*W=Nf`+*9P@^NgyXjlpepHNzd`jsE( z5`N7Akb3sE6UAxH(<^XCIx}5P>Xty`$41U<~kd#aiF zi2JE<`l4!)q0apGsBp(z4XSLZW65t6A%(&K2Z;?ewC@)I5BPZd!ztdqLkeGdX|r(( z)UF6#AME|rXD7#?qp{DQ5m?b8P_;n1~BVYG-*j(TJo2Fx7hF^+q4@`r!u`wqV;o)Y)7OrGu`9c= zNkpYJdKZEKhW`Fpu;}V_eQ4)Vw%ZHTB@~a(e0T|daMprX%s9Tg3j(jZfr|HgWkW(j z!FC=L^YMOG_O>b_y|Y~tbTeM1MPQaff@KLe|%Cr)6PHQ(kUfC6BvmUmd! zs6%K~czi?4_=#m79iHL)OoLJ`KS(K>-`x$b3V;yKyHgL9NpREgf_kT1f>T918E2d9Kr!JspGLnlKr_N1(xp-pe z1-^XrxA5xE?P)_)B(I(4um0eh|M@W(`fZif#|?Pf!HWHx1WFo*1#S4T%Kz#AQr{xs z!bf@_N!4Hb954YNpkRu89lSU7&yV4+n&B7CJG1_evBX_CXp#cwhW`)Ea}0hl#%=Py z?=;aKh*fPGUq0Ub_mzq0Q}Bzw;Mw0F#jqTr=Qw}h|8d_S)xj_7!he#s|9ZP32q(%X z;eKHG|9&`o_=Wu=fHeQUMqq@p9oX1+pS;odZxR0d1H3f+;{Ur1KQ6}qOK-!KZTjAITJ64Ab_3pMJ>f@xFRM%-1EE3W5%@o-?9TpEsi+it;^ ziST2kpVuNh)%_V{Aidr1QhKEhIOAH`twX;4zt@(MH@=fLR0-jy5J;aZ5k{b|>2pvhV}Higna1l;b6=$d^R3{RRUFSeW9{-$|; z^)*28XNQ=M2g_7<$^$!Y)(J7T28u~Ar9UcT9lCSQIim6Q#~aH+ZyvA z=lmBg-=)L~+dol_U3>xPSH4UFXbfAy)KeZ>V1Kd%pkMgw)^8x;w}rUAl3B6V zB{qtF@#`P~zf<7Ou7k{E?CcHz8o@L4Xi)s0C_(VSAf79T|4ZY8Ywm;~ll8)z@`Q;l zw?oC=UZCnZ8>_9Ss&>g3(bFu~-5a1Cw46-bEA zCj3h~>3%<5t~?c%7D%CJzo*UXYe-JxJwgupK)sU<|ROV)N;@7n<*(&rt)_YSQ>);6wIYKeGSE>qgartUB*B@5@Wzhw&a(z}e0xV=H zvnuE-jUrYLQN$?pi6+lv3CDvWF8uyq?MT7c9$&$4EZf8lKOXzT@ZTR#hGgLE^tA|% znudLuN(Ea`w^ius*%eF?0dGFvsjmlquUjPA&~9IV4FkMhLlIv$(b z)4HeQEvI&0eBJl|vLxVVhq32IGy!|s>C$cviMeCoY1>!O0mh(9Pn z7a#xWyRqqx02gQ)tazAf`+?~S9hXy`>2kXVQ!@=C;eCbjy7~Vj_K6mhJ<=2kAH)^* zsZ2|Yr&EH{!&-{xOSv5ngv5-%gWdhiG0q3GrOU?^SbWT<5zp`?0*iYaq;Rxz3*QK zrMQ_M>H24D&bML)gjf%iVFFwu28~eS zL9=>3Xjtv^Wzb{1*8k~_{0y(`3tYWvGE~$zuYh0a0Y`uWq&73)vkN%9P zVFY##{QZ_tR6-;*umVJyc5hN$nu;-^M=BcQxP^-vooD$GU&r6~gs_My()NZV7jnj{ zbq6K$VGZicUb*w<58zMt-yLPgm7QEDlc)Nt$nhmS!)6N(xU-4mPs&*YmP#PUI zXQcb1k!IWbdzDw!rk+s)%Ig>K>P6IdH4^r@1P)jkw+Ql4f4_*hSsKOHam1P|kN?H( z%A(~fitc>{d(%K803X|{bQYoY4-1}-`1_vb+MvIJNFH`n9T`o><`I%n`wVFNApVxH zq)n5MRfgr|%j)xQ-j}GKYs}3Z+BYPseI;*vju*rIDnNyLNAjPz^W)lFq=9g$hV)vS zqgZcFZ4NY8Y`BJoftQ!@RrR}|A}^4NcvbE~y0s#HKtA{A-+3GMrWHAjZ>Gx@6tz=vPD=Enc~4j}g;QI>?5RzVa5p}0iJdhK0pTfTTBsaP{_qU9|G**tg3Q-WTbn-oS(-;B zkOs@w>E1l`ig<)M13uc9MKi;SDM*FvX8JKjM05cz0QMXWgWvlXJ3QB*D=M{H;&tKg zHDwl$R#cpRb?2mNYN@$!y$CqqZf~#lOf6nR0@=osRu)M!Gzka-$QgA_EI6J|4fuM` zq2KH{28Axu?z2EA-gSBw8t!y*cv*IaS*f>0=6SVXE&zbxv$UshT=JVfO0fR^R?#ygm9AAx=VVj2$A*~cjsN$IrM zHTgpT`a?KV)Fv75pB2eJ5}6D3obIrt8=+%)?(4_%fS?rW;Vf<~mHoLmUT8Q`rXa(6 zR^y&w-99PU1@|fIuGmc<;oIsH!e)g6@0U=wL_l9W0*(rqs=nKH41vBw3Ui$s1Hp&Y zbYt^##YH~o1$6>1S%1sNfr`WZB7)w?iQCWYNXUidk)Vo5W1!2*!4bzP=LGn}QT8`q zW}n9xVGyS-n{mX=^k=|dqnaMb2;Y*Rx$366b!RMb5JI|-bpFxprifx`>QvQ!+LW7Q zfL0cPJ<(@DZ1oE$0})NXFr*U#W1`0xM?=7(L3Am`J#HiR2I2eR4;2!>|LyTwYhZhaKBV!G-AKj zCoqZ_?ROm!a_AiapY1j8?)T8JFPuWxE9F1bQLp+hu{=u)gcKRXr|qdaEGqMx(~8I2 zMALZH&Uh+5p`rR?ENGUiPdt7aaTk{he!O|Ew{)bL)zsw}rXO}UFmb1V-V!s36a#>Z zF`P67`;IL}j#tY4kiZJo=Wbvq2>4no%TFtRmft&lbf+ITW+kZOS!*U@lF&9OKZy|L^)3i@HTiS}s>6_U; z>F7fwaKJLWYJxcf;-HTo+^1^6AdCOp^figcPK0JKt)2jvxf28BeCHW@&W?hfjHeF} zGqn&{P>n}BQumVrL4dS{iQjYYT{gnoUuOYMx4#xWOv^`VPAqL)+%aEu|(;sUWe{YZvn^RyliF7CF@AI zVUoR}R?={TQDubUoM_RZiRy;ECuWSKAQxs4piutjVG}Zl&r)BR1r-$*fBUZXw~OWN zm7y)z_Jv#7g{*?{6QBw9Ebf)(^}Ab`A2Ji&15&VNE-S!9%n%^6}!0K z_WN6yT>#nq3E0$SrA`N5vLM3#&=s#gfe+c(l7tbkAvgp_sH8N3SZYNhIr7fx*ltCA zvt!iFV}FV+_?(0cn*C0?Aq2PB+)hm+b~D~r-22Jew<7<#1mb3rBtYhi!B^X|(Aa(G z*3p~oX7e8!KfgE#4%|~9@;N-uYcJ=H6lE`&{ePMD{ERPnY7($F5dKYhGX+X*2nrg+ z0Dt}Sfa+Z3p5z^{Vuzl*y?+4z79C5IGB4F-`uo5YFCa`PDH6&+Jqln(FI_Z0sLJS`Koida4re|H23+{?4Kk$#Mb(E2%_yjkb7}+ z17dpl^p&I*>}=oyYyH*p9NRvT@4cdBGbA4F0mp%@;=a|0=lqW-87(2}JJ|`iA`(j{ zE+3GdJhOan#}}t-VH}VtS-@*9XI9^}LaOTGrOJBCzg|qd$BYEVq|wua-PBmT+shhv z@B)Gwbt2rS?t(nzRU+?H?xp?xfW$n)?yr!gAIRhZ0IvK0k97b(3Jq+lb(nTh4ODvT zhuu8FMZqC9HHRP>MGRyAUcT(Y-XQoo0sOt^z2P~vTVO%FfITj$|28-kA9{pL{P@BjrGdy@8n8G9IF2TqT$GffKy%x%zo-XH;o$CbrE9z6 z7x8%L9hg~V3ngC1$i8wOZymPuTjQ?9MiPH~5%Ac~9IXeA4}-N?aI!njE{#hZD1W^) z(LpJvg>2CLoS&cKSKQp{3jtN9mpfk_BwSa5S~#fx7b9A24Bg@$VH3^W>P*~CY54)K>< z2S|Qzs?@_*!B18TJ_f`PgE-E4X!i3kFG!%)*ixkrA+()u%Tr>9Zn?Cvd%H2LD| zcc6vvKfV%gVP;>H1ww9l+pSk95#xPJ_QKi<5Z_c1^Pc6}3*t4<2V{#`m<>;(!3V*k zFF%Tz7T1s1l8)!^VvGB8SI8$Z5bVA{vtil#-JSR+FArz%G#g2Vb?nhkY)Y=U7DEig z$V!u&$uGowDPweKm~p6|UjodI$$1bApuKlJVKwVoxsiUNp#4o-8~Nb=EdMw=%xe+H z26hmxKeINnyGpl%LG;h%Skocmx?#f2r@bc?dQID~@~9opr7y+epC7Y-$Bz0+V{INW38c-%jv5Kt*@y=doz>Y^AcwBq{6TF)OMRLJBywWF<5Wd?y zDKIf?^7DKmRY1KW8g+P6O^-9-7F?dSJ%~y1^I|Ky1|qRdI{}HJ8(E^CcQIpJJQA_( z$8~>irk|^s+<~!Z?jcI3K$(bC^rzoj%a}&2rs|=Sm_?WyuWNh~FXZKI4;lM9S;2ibnKzjewa}Yg8RxRt&_G#vftMH$d0A93FMilxv&cWC z;z@y8`@pjs-7y4NETV(wQS#(e^>L96&ohrCSYr(n@Y_H6mHpnQzwX*R4U+BvJ-1OH zdJV^JjD+%2NNcjk}%d2~^Id2$SafK~j*aDM$a)CtM$|wyWEr=r>4RBQjP;?Z* z0%dzQi!}b7xQ<2>scF2Qz1*aHrV)B{96w=qA(>Us70py-nOnbOe)trr&hpi3z|7 zH{gGB*H~^E!?oCd^?ckRjFZqp=DSDRBnNmk&iVB&J_SI&eH1*1`@udTjBe{>`(=-b z_RH(vzk(|De86+?3@z#fl_CNDhk7;}{^Tl-D)xmhkK#$nJu}{XEz=V%em@LJ9W7 zs>jOg&!tOXp&99ox}RmNu9iq;4{cl5tW){`x=(L*W+-%|NQ6jbrqzfr8te;agCPZK`DQ%REn=d>t%ir9g8Z}rW{WASP`sS7AP0-R&K?C^eiaV8s_N+6E2cn% z)hX_~hFys#fSE+VlT}!UuAS-HQ{QA9yPkTyfUy`yTj0SD@Y@K|m5?Vih0!I0Rlru2 zuzHk52tZ%SXg6Gcsjcs;+7qy1vgWZxkAB{QMqzR-!#QuzrjCLLcX|z&^if=V*FS(u z!8b3dS#NbL9FR=K#g-V0NHU0uW0kaTS`_p!rfUQ$35C$fG$byXUR593gPI{Sn`ptX zD=1a<3wywq&=pm%o1B12jJ)kRlynvHjPF6f-pHOuW@Q<~U%dDIVLXKfT7GYec<)_b zSC+NRg{Sh+-{`X{-Vbi>-Ecy8c?_ncw}0$?Puspa zLB#J{Zcy+%%sxxs%mjn*@JFyS+c>OsegzaKK_br#2*bA&D0RN(2eI#aR$A6tI6#NE zWn=DveTQl@C1W%tXOi{$Q`gOnb@gx+`E3P|{aFKP|5S5GIdA=z!Yz8QidxMnF5aI! zKH-RWYKD2`+-m<>W$OVqCUjKV&X==Z>eCb@j?$0POs@f&)cqCsedD}X85cNk z@DcacArOg@y2K7juRTa?NxSx~Li>L5<`pP%_Yg+%Z`2@%gPtIfz;W+fme92-sQ+ET z_N2GYVv?oyp6vvSWw@j9TBp<2nNsR08Nl!qaRz5P3Qx(sb) zmzmFab|e&CRC$e0)^kvn;{!yFIA1eCclx&rg6Bb&aLvzAvmyVQJn&kfZWC2x^0>f% z0+MxtKA}Oz=h5?Sd$j=h95^MdYE#Z(}@f+(u!Ms3= znk$Oqt%q=?ei|pq^U$}z=(0||p{6RRTjamI5-;Ww{*WlYiw&j4BHKcRxpaNmM3<7W z_Z!mZ4E!`riry9|GoCp43d*=`}cejEw2aQ#rDP0Y*Rwzx6hMOqwX+m zk%a}CVy-i4F0D8W1Me-Kr|%i;982Ti1nNAqD{!(pS|0BL~Qm-_tk3F#~d6yYAf%^waI~3aa-8PyvR%f$L}$bmeY*hf)>>{K8&lz zuiP?}b)@W()W$b+u)~P4J#DMrbOomn>2>M(8?a@Fr0j`?0d~{S+Wij1DRZ6& z^_(%EKMud@j|JvdTp5GbgQ-?Dn2$W_zB{C#%!2W3_9h$08>BX`G~aeX#)bt`(H(oa zlY;&xZ3orCYkLOo?A$+U%*etTWUkA86T?&o3`GhmOrF(NpEAaEL&7=Cg{+;QZ80p` zL@Nfgb4l%#C(8qstg#Tbj3E1WLVP1fqP%!a4`Ci%on((xS^fPFr)rmRs4j{t9Qe!JNCJ@ zn!YKKZF?9dm{wtTJ6a`DaA5XlVujyD0;H*%gJ9pmuM#1CV)R=p7OSGi81rRp_4E6x z7?F7T#??xr$^#jZ3X`}hw2bWLkoG2e5%pj+z~jrxRaK^~YP>9S-8vhUn)nuyzX<1*Qd?M_%L zc46L8(-38L53exJ!Y$e(lVQk!Jx&&s38+$Jwn$*S4=EuUCf%L;nJJDip+ku}^K1<0 z>Pt>)*KS>f7MBa0hAepJ+S&mlR6@tp8KgWiBxm0e=e-69VGAp!p2tuWP|0X(zRA<>&r5?YGQ;&hyj<`m`bRDv z5JS z%Td(1$MWfP-}Y=iWn8sQA9_c7>99}2d>iEfUO#k;2AnBbPx|8Z^qfX(TR_Gpf_K%Eq>-i zP8kYR2|1Nm+pi~yTlma3@xJ>;)9}PSx!L%o;gp+?ghr^)>B_fqMHORb=og)k%<;QT z>6V^_rd$fwJ%hgYp#eFK#uYqWxpX9l*R?7%v&`~FH&h9i(0~etEKmc@kopz{FK^{| z3#HH*YJ|0p5GbDJJY@`mitQ$9RHh{fuse_@9<+%LblnrHgrSt#UxX{hb>Dorf?lSz z{?2u}A~Fc0Mm4SKL1&t$Mx+n3mVHQ`RBRyn1R24u*UVNqeemCU2A@Ll{>T_NZV4w7 z4sd|$P16Qlb8VS{)iG@hz_(w<>RJy-6`eavL1dtYdEsJctlJ{!H}w7F#8_V0m{ z;s$sQsM>pc!^yj}^3Z7MW|Ve48XY|vLSai!Vi&;^^%FJwgt2{|$E8k*q1&okk_t-%CaVTRA)lnkTh~X;KxI#*CjQnrhnl3*R z;;=6Kw>BhH(u`xjEGX_`#*wBmh>wlVfSycV^{*7@Cw2nWlnz{N3He^`!tWTQ1c2IO z3>{&%E3|E9uqrRAV(;GV2g*GMD+NYO-n`D{ z4P6XlNS-Z23m&4FqPnxy-la!0!|@XH=neEvlahy6>v!9jIAz`6A7arc>|GGyEF|d!phW^ z-ZW&0@9aNZ=Y1r9OEdL#ou(?L|6&? zh*av?`1p^hI}T2EWg4tPEp%kGOOkKr1;!XQiKdrVt08Hez(rA%_`R6QH&@Q({J!k@ z-M97StI$Dh0o2)t&<^gh%U*+qC~E=79?a*FArEMWVldmZ#v$)!`9z0PPE3fW*b4{Y zfHHhG;|%$y>dW-U8G-c@b{8W-271^fP4CFgCR?WS-=Qbn6N_iPOai>Ek5}*nP6PL9 zi)vWZ310yCg*3e#ihks-{{dA53Zz&g?&-%R>5qVJxcUTZNZN@Y#-U{c10v?gX`2EV zHQk=SuMLVjE)*&vk_m$RaSVnxDPaahqqN+s=6E8ZIQTB@=M5Nqd=g5ObpCwy^3(+k zKp;*as-9t&q>Pm8ygm9>F*HdhKvVUyq?J4x-KW^Dtf8$>aKV&HdM+GnG1O)DF#Bhp z3DAs;aX+#+Q@b{J_Vdtg^pe4dkg?TeA52y+CcElh&$s~x@i4`wHK-B;#kRVZ9+E~V zC(EUKGIJ)8ifWPc?bF9$p5YBBy~DTGssenW`;OuuMFWrHXX0CwaB(G-=k=g(OPrd6 zx~@&CLdmB0V9n~?Mn#41Xq(_P7}Wg_OhjNalU$uJ+BrimWvdGUF{?T3un1hju9mZt zuoxW9QJDCh(76FsOx)Frp-?i>a!9qFB2nem#$7z!1JK1lJ3I{X;Z<=@yRq|{GY0-)itK>uOrqKgO|yLEZMl}36gE$;zdkIK!`_?ClW|+mjjd~rG)hL_ zFaG+yg1=q|a*H2$?}`lS%#7`uibMA}*4Mx>5tm8y?d`&B1THSJnb1OWBqW8zNe$WV zHewUdr;c}W*_6K{%8s+7U>9~q@*aU(fsp{xJ6|2Bip9g*DBrHVW*x?l7Y{Oh|1`4< z%5CSR1we~g(-4ie*cs=!w37V)q7TFy$yCqSJg1D57v~4uVBR^HS&rZHDtYt~P6qO! ztHt{;8v@J>wpE8=oD4u4kEan`b+bVO_}0bqYjCS&BZIOiqV+@xV}zn$*t>6l{y6EQ zF6{NPMxsG7^&^%ITS)AnY~aKjfnf^vu}vpy$OyczlVFUsVkI|;Ydi%#6ChIRK0)ol zgwxlFd9PYWR`Ij5{y{@YdMqRR=>5Bob%!y2$HSt5h7j}jh1{uqXQkpVRoF^3Bb{yc#qGgK%rO?FRZWj zDrE-Ud`XjXZRxFBWa7~5QAScSP zXQ9HPpdIROz8dCZ-s(tW{x~yM$~%TQ)0{5njEEdMX*A^reHLO6^}>ojKK(J^KKB(~ z7F(4F(X@v?**8zGt-<{3KD@D6ux_1hd~nc~>UO>-L-4L@?gQIsa#aWmO-RWC zok@rhEvI5`8WvZ2li8af7~6IsRM7GgNf)1{pumi~c_UH!NXJ&z@Zp0;VC)%wk6I?y z9S#|UTNX;gy5LPpFpf3KQYgPjEz@b9xC+XS)D!`oz_ zDY0I(34Ob5_4e1m68@GD*vyN$#kmG_ko$zK;YzcSiU1QtfG@*SZ{%`&;RTpY#G_RADGIL+j!NSd+{gI3=#9Y^(kY zO>>x=O!axeikaLo*nkL%1liDeudv`R-E}NItfdN(?$3cngsCamrfvv(o5>`){*x94 ztlLxlo?Zb^)3(x4yYJqjmc8GIFIp;ED4rsLNhLfsK=M+C;r0b%t~X;)7VNKZ&?uX3 zY7-aRj`1@%iiAW+rDDjy9d5OQMO5z2Zg#;A19R6Cc8s%{s$a53O;BeN*O)PZ7eZur zSx?fejKn1bT$t!clQ~e%OC!{%!m|#D?;-7PVA!`Lxdv;&{2+85BF#AO+g&~WVo)$Y zzq&ZlQjUM}Y0kx!g;6<<_yU_3_t^Gs*OpWOI8aAUkg7Vw>1aIj2xLj)4eLos+gPYS zTtA@#wyx?tIpAlZGM}YSH+v+{C&9jfwd71A35i9pjEfz!}R=4 z9m~r6J$L&|^9iAmQ)tLAcg1{xuGEo`b)eOzNLtOpvHKz+?}=;I{IEw)E@~rHE=gpG z%2{p$3pXa@Nkl$V+-4YnTM(yG0?ci=a2>49yh65)Zi$~7Z&LosE)t~?(qx|GQ#yxw zQ7e2Fg=R72F=xRsc)_xL4;7p@`At$Dv0Qkw>SOy#9}dkvsZfij>5(FoQN%3L@4t3k z2&X_|5j6?|kBK^a&G)6%-o<`^+%^Zf2hok4qYw+54(}-eI^Oe{O4&w+k~6Bm^Y;p&f62Fg8L^W^To`X6g76}#@5th3LvqJVVFghT01o( zuiJ6JsUzerK#pq{$GsDhTskEG9nz=FYRyoX0t7tvtK+eDwMT2fl!i%{X7zsV^L~4N zM-UPxjpxgnsD=AJcee_cM?Pjl$qA*hwWY-3cW<+M)G31Mr!}iOR={^bV;fDnSnl?2 zi)a-rVEOVLhwOPuZK{I3)xNN_2LTiBi>wG~J_LHj=O9Z`3FiV!Vn#Z<8n+`dVi-O_ zPUr}J1F=A58jz|6SkmP3x76^*dQh&>jfh3v@n#>$p(1Nwz-fb9*VU7;3+$(;LVP>}%4nqiL+C-EuL5 zz`_G*8rD~v8`yg%Zy0Cj`1aH9@myj)XcMu?{Jp~X*V2Go1S>{lLFwlKmx4*FAd3^} zeG3;ka=eQ>EWo@VkJ?yl=()e(RjJ-$)A1$lkUz&P@7TMzCdi~}Zb({}w)Ukx65X5X z9sO9={S%u_Kr+HNl+)XU4P$&34F=CvyCS6W_dLy38_3e}CI-GSx-5BWy0jR;qq6veM?S{7`~jM|IABQ*F_WElb95eDBt^e|@U;>VqzqI|BZc zDmhcm3VRU#YWJhanT7VnSmLpVAf^0ZH*DWJ3i-nxXCm%G`}ix)3%&P3uc>XLFWQUe zmc78?`8G>^-?ks;s~$fX{rA|vUEW1$7*qBQEpX48U$ z5S|<%@coPD<2*b<=T+&bwD;$~{`IloF|<BaF}AM_F{k-jIXaIL%x%- zTV>0)Zuu28KL<@XoY1_ETOB(p$&HZu%36>s-v)UfBr-k5O2=C1t#ghb;AmxOjlYOX+hx%J8v^AKjxWROGM zPe-5TlAu5|%PpXMEgh1x`odR0InGkuY!zdP*_%6WSxE%Y9pc9BZhDN#D6Es zcr=#~xt97(_7$ZQSFV-pSD{7iCd8t}$Tv(9zW_q4%&F`A_+1Xg@;?k zH7^<>okkLe$ae)j3@MF;`MKFQHwKR9JeIIv3>nXRRKE!uyN0ca@4 z70MTXqQbw@GeQqm>QKER^z&G7b!ruE=bet!1GCk?#OegU7@mvtM9gsbzuN0Z@^2V{ z1QAN{>Klvvx4lQ3IvvX12O;?QVUx^Ve>hG*(K<&g?gm%p+2B1ik?%18b%Z@QK2Ty6 z=CBQ(I?dJK{Qr4siX)dx_TRLqaqIPKpYdy<5wFp1$@sQG+lCF&3giAoO!*aR^W|aD zHlO^`ae2Ez!wwOLJ*A&;*uUXu~!r3C||{j`hE|E~OZN z^PT|}#b$4beG44riKmWjGy==);?taSz)t&(T*5aJdmVws0F3&kuNT@`49R3!olNB|Fz8jdK5g6Ya8)S6U=9cCdikIN=~fjIjEAtUa5%p>_L%Fz{sSus@W_o!h0iws!08rQ zWS}6x5G*J$-mpz2R*rVO#vdt4ev{~e*^`4l{gYPp=l}7}7#FuhIR{(m9bDR}%&UdL zte^%Hccyr*vzNT`$9Ld5wZE^%FP)wWIji+8Xn%cPkS1xi5i3F>18@nH zu4M?FAb!xG>AK;P44}BQSuPnFpa_X|3MNq}B$Z#Ac3*9{a(YrPSE+oQF3??Ov*6*ek z0E#0CXDwv@=X&C0fwTq@=s=N~wjv91uUP2SZM%R1D7AF_z&}?|5B~@=UMBmf)xJI# zA~=u~zXLJdzS#76Knh%P{%p?Ed;jMFp=9Xz?}CE~xRasd?`2V#NJZOP&ms2h)*`t( zc3%2_T0;~hP$wYJI`loBl6OkZ@}n|%>Ynu7+qp-7WcaT6(|`6G0h0wcYb5{D3LN@T zu2)w$G?>q7Eu_4Y69mS{FEf1Di-~56}4Cfp|gPK z>|j;FQd+IrL`{RlXw$ylZw8wCM>`6;CMh~y#|7Pv3;UXv|GRsMH?X;SL@_iCHc3T- zik3|U5Oq-3WjAXU%>*E<#5bW<5B}#NbiX#d!B$BNc3Zp|7+k`pI(iL#X!bM`f6j;Z zYyER?ai|vsH?n-%5xFQ^nfVB zhMgs2!k4|nEdQ<+9O)ez4V>WqrR4;<8!OpwN@xE~Otlp(_GJ?Gj_3c$hJ=ihJT zNIwidb>-#=A|1Re!_2GS=o}=hx?P%{0B~dgK@k6p{D2cOGYxwjYei z-g{q@%L)w?*IJfIBC+Lu`iUhN^+lfk<4I_?j2pjW@UDUW~Bu3WEZ0 zCyYFTaCmw9f%t-k_MgX$o*-T~MBJO@A}!ZLATA-Y$Y$j-4pzcQD4;=NsIDR!=2k`2 zmo&e$V^_U^w>{``ZI zw6@z3nB#zTYy!9eU3-^{pVf=*4+CoqRLL;lm%krXxKcM>E&KQ+1;b&?YeBLd6l?l} zv=)4C2kdsE70_yg55Av%avPU0bsQm+WZeL_c)Klo$q@51 z`k6WceFpfmook=#BLqk8*d=FbxR|5^uRgu znH1lFzNt%A8GU5ZzRkjQj{7FEChPV(K8|KbgjfK!zAm%Aee@C;(x*zu?Lh;vBj)q8 z4--owKZ#mw_aPS4CNnaaJd2x{-?sQxw%?iAJArtK(WRqssx;8pDj6F|Fk(r(qjS>a z-Y0B>s!aJOqj~9a$Ni#0iL;nu>#|j-Lf|6v;DWS3i~NI%tYt+rViR*s1`gGEXiNm3 zOinIVSM);GeycjFq2Bn4;!W6r85KGQS|SJlkXI=TS02#~Dn3xwCV~f%bl-tjRk*}5 zZK)J>i%F1#I6#9EQG5G;v`_HcGmIlOvtp}cu&TyE7yGXxpLivWa8>;*5V}K4gl`@_ zdxE9vhOO;wG6{Hay?`?%_fxo^8bC^__!*2ks zg_q0`orWeVxCTpbY0rIPvF#h=-Vdl6aK_>J*MQj#jD#z#j74QDz(&Ad@U*J_>zUwL z38fe%uh)Zcq60MhyvRHxy&XF7YTyV_6`=g$VAYH3J~V=kE|{(>I}ZZ$0X*@HBHN+a zIuEmx1mj9Bt+1HQJz80^`d}+YboSQZycI8ZWT#srOSm1w*&N~qUxSvx>@|2vf9!}D zRI^Z7)PRC)nsFRxst}$Sc$D;vD=%(dXuSW->T_t?GQ^}!plx=tnTUR!8?wwLkf7%( z1uznvDC!%btpk{RgOT?$mQ(*6zj!PL2!FN5pG#se^x@v;TCq4>qUKC~vUlv(Q06Ue z^lE43T646toY3g6<6W8l;2TR1l=lm8$~!>Ul(^-}b=|g0*WbOU{O4&27gyE(6^3E!d(X!ZQdk;dAbg;PQ3@ZVRP3hzh*x z(@F>y*l|OnlH>dTy}2zzjv?eadk9S_x$mGUZ>8qFgg{vB#LN`cce}5o= z!9dIA9!RT!o_u0t4N?Z6rau5LXN2JaXz?oeGsI#UvVA|bY~Nqx4OG+NbMMO_zh{)u z_eGlD6%O~COfUJ)4azAMKuwwiEH6M&fi5r)5w1oQF!FtA(2llg3_4pX=-If(^KD~x zHIfESL=oRXmP0R*tLsmb%V<_^uR5zZDMB%Ib=0|yiBp!StEcl?9pNO2-qkI?JIO6y z}8 zr6g+{4k0lD1Lc|$$MqZ8XSm<%?jlqHp@|QLTBWkl_vTpt5<<=NvV++akFNt!hHZ=w zlC?sni0D5(QKGMfi24+g{E!m^prD`-Svgr;QXcK_4N^7YDb&#gvq)W9O{C`&@p^y$ zDVOw*8nsT3X%+SAsKK$|_VkIF(ik}0y_pM(+v!Pg>c!D3QnWmfE6+ZOMh^+A>figFeZ2*Gpp-$SlS(sa&C!HM9rf z0A#CqT{qnO;NLp!1T-JUGFI`!6%RQJGQ%6_IG0$J=x;z;u=3)`A#_9PTfa5R6UX^A zgH#?$+#1xY{|Z>)5R}%v0Jn68kz1XNu0gm{PUs#u}0J5(%>? zr(;>4QpL^D3QB7?VYp+0es845n+DZc3pRah+$@`pA3terR+VR`)y!X%KeR59kK3VS zjyl0lP**oDhqu~ByT9c_aKwUMtCvr`>T*-=py~!N$tQUE%-?54 zw4gr$xK*3s=Ks>qFt|(bjE8l(eUz_WV0-^a15ltEXY`v_yCh~u_sKLnWFM8>pz~xk zs@qFJojQ26l0D+%mijw|KJPq;HK(Nc_r{oitb!118S?C_o)mj=>~>^dP~}Os(W5R{ z#*&El1Ay_VB2yJHw~(pG;kLG_lEY9NyC)Y*UVV$jwrVrSM`T#)6Z?tiETL}N$>aUU zYFJveKO!pNp(hqnakH14D-G7ZV0jgvIi(#j;Ej~GOr-EX8F{HJ0Y829Yp=`E`jW=R z-3l7V52fp&!RSU#O%bF83@_9CDB~!h#W2xqL7sb)*?0|Sgx=~GpjdkFj`R#zPdv$@ z-JE&(;&+b>9y?BsjK@4Ypd(ph=Ub&X z8&n+DpQoFEbBtHq$DctbaQcxjOL-^rGmRWouZ{ZlfjdPH%u4Qr6XxICKgIUsTtcO< z)P~&3lf&z$2M#@vf3xetbrDzdzBMpkg|t~aAQ+~6+{#aA!hJPm6Q&^T2q~bjD=5l# zc8PAIU}OY3z?1D6<{z4$3BI`5pzMtLbXpl&}q}QYU}lA4kre7A2+Hjb@jSv>&jr?3#F+kWtM-p$AEHwnE4{^E$FyP zk1XfxyRfF40F$(`+F-s+lz>@iqXfOjLf)&a7U)7>k*OH07;mZ^4_s? zKa|fGy z{K>&x0>Nx4&~`d_h_yUvwRf023MS()e1dXZWraDWs&{a{igWvZIe%<=Hud=!!Wa$b zcHucX&u4473ws|jB}QqIlUuIp=*8`(hyYpg=!+-28D@Zxj@avoBx>Y^5Dd`=->9o& z$<7TtOVjc-7V6nfPr)$^zs=Q|Q2`n{s%U*S zE_p2d`*$iQfnMdqHGe|NM?Oj<;5=ITOx%>y)gA!ey z`JGF%faT+yn(^-SF@no-XtF7Gf*+{;v5Kn(Ucz*g8^%AOwgTMEvc!^~gLO@cuM}xMsXgQYuAN9h2bRNQbQ<#qwmKPdXOG!wl>KV`1KytABkcsUY z$79sT=vMCWtCA0Y0t7*$RD8ygXx|WTLh-w1&v)j|XSaVIis1izD6XTtbTm#1XYU<_ zXXeW#lCFf-$~{`I@AfsU_awAhc>=hP_dLyXcET6WHh7(VE=e}4Tg~v`h%$pUI%~dW z=7-^YmP?|sZv41U=o|Rs_J~I|3s}tMHvmJ`{?fAk-?enT2(z_{WiSZAF9^(Xu+_eE zEaW$BD_!fM_Gu-HdrJLGh?PrbH-1gmCF2(Lf z671}Q``1k0NuEKP?unuJGmk*^`niRft~_9UW4a~wOROq`Q?AH@92(St!rw*#zmd4w zq5bvI(v8bQ@8RUw__p@Q``;buA-^Pe2#1WRgLLw++tLUiqDss|vwIvC4uU#AXgXd2 zd#CTHmNI?>#?P0LYj7i-`*81`raeHrM&CJJ`5EYWYrE0uK=aw^b4nEX;3PJ5>v|2S ze4!WwmW(nS;lRf_grd4P8+wqNG7w@&xQr+mze8)8GI0Eb2)dkRX|z&MVFh%t#+y*) z*n!bYUA3PRlBeBXb00Vmpi{fw2o1wA#EwU!`qfF~T4gYZOB56_5;E2+tcN&+CNTlC zbsKcdo}2NChb+9CR9Mf;1ADhjegU>(bw=2Q&+YSg}tMEFTz^ zC(WLRn634q(y(4JTM8oEc>VG-L$YZlw@V!kZjBI1fHQS0Vc# z>V$ST{b%s;-c3qYhdn!CA4>h2@WTUW0t=r49M%5-VBBQ0RfOe%h+LI%cGvw1z@@n1 z0L+{EfMHJJukn7?tXc%S1YD(Kn8gqh>f`A}vpAj;3ia}1;A9580Q3!% zy7_PC0v*#;F0VLWEF%rOrquRaRLNu9lV)8ATyGA*pdzAS;dnhV>4`9=wWn!2J`J_d zr2wLGt$GOU@;Mmsdv?=Q^Fgyb_~C*ZJ`;r0DXBc6fyIuDpN84^dUt68 zE-YD}tiAJS{ddc+D=>}0WJb298)1?XHeYwgLfhJN0VhaQdV~@WI4s)xMbIi6c=7<{ zT@b#p&y=SB1t6X7=1xye!wjOekc1~1MmSvR&!8Z_LB_D5)eBHL?ZXXlPD(1Z2sh;K z6sTE&W>2~;XACjbe{g5m#z=%C+{9KHjV3_doF3TLpypZ^-_)B zW5Z+~ibe2UqI^zync4Km@<5DLop_4ANg;Ib0G2pIv_^I#NnP2e9DBY<+4oHu$wBcv zA?bUGK@fHB@dSuepa-tF4Q@@Z;oNc)Htm6=oLKtCJncPB4|I$>UD<<4P+niVd0E9| ziWQ#&#y$l{?adyXBqj0@Dg)Bh_a0A&Y@fiRZaq1w0bRmpWBQZ!I_6IIyDN`8L8oBv zJr|=Auhagh4Dd|*iL9)0HrL-$U=G zOw?!=)2DPEgmo9SQ_-G!3X>s~CNY9Mr?jO`rKzgmjzyiH>6+HMuAKN5wCZG}#VfBX zi~3o3dB`$ct$m};}WB4p2xmYFszVE;M;cZ|gXyMef|QKq!Zfj?(x zlX1R*Jw08O(dBy?94jgW7(a|X1#9FVSXuKG2DhIrz4rRy85<(q#$On)1=Eq0 zi$XKAml^Y@g&%}ybHsz-;#aOUwH{X?#(cY-o#0nhNX{@Tov)vdrTx}kuA5AXj#JI0 z0RJvqicenU8@jDV&9@$;&DM>jN{t&nsr%-{XJ3l;nucMNJ-(6qNvJW=CMv=YbvgpZ zS@P16LR*)r$jkdHUSzfcPVMwkA;7e%epFlJWBzjhLH~^%H?pH$37c|Xow$Xq(ZXM& zJZSML2n1*zrm}-H|zm^aBKj9lW2%qPRO^n`y6k;>~_u|a|dLs zmIOP`xM@*htT`q^xiiS9X+{Abjo9a`U^c8V==~E8%fK@#2~!C2*tfyqiU;;cX;?NM zLaMaonzscTGnB5IR7>hjRD1GQXpyYM3I_3ggDw!`O>3+$6oU1AA>af^{8JRH!OfekWFFY}72 zgG^-nl*UeyFU{TAgcSfj$lF5+K#tqvc=09^A6EySXtpa>Fubtdn5toUv^0XQn zbNV%;#`V5wXK8|%@W?(?7Ez5%QSr!E+K1%)Zp~~qCZ6lJHPvV%;%TO~1&BGSiG$L3 zqfnbF=3;`bCaI@;-_!yfp$7Nzg}@5x{qZv#^B^6xg$xg$HCs2d_nz*Vc;;Wdwqj3Z zS#dedG8)neRquo}$WV^75sKhx(_$Hrr^ZE+Q+RLU7~1a<60HO zyQBugkW8r0F6o({;C;_F1sW+_F>4CSHM23(;wtU$6Khvae7_nXzlVvOzG_kUhZi7- z3Okb}oQ(2vjUKgbFj&799zyh`4+0;E#qCy}B@$7z+K%ch$v zU3q5=--J_J1*ZR2amic?-WB`Yi2A4>t5WnT)nV}Vq+;~Lc1$Jy6ZdW5Iq=o2iD0Z+h#K9}fol*!1~X{QE=(U6}M zI+(KfFo!dXkP{DM^op?OYhDz+HB0s!CraQfYeVOAlloT3M>1p4LW9J18S+eb)Km+D z4q^JwR_Qs<)ZoS8JF#*4KY%(N)$SaFKOV$#Uxby6Q`_+D^{PwNNh%_Bc~kS6VnNp( zbq}}Qn>W5AL$L^8Dw54A4y?VpKov-4l(0FeNubs7+ z4+ZtT?tYwDcEcvG@<6ag?MXm&R@~*7EM=bSOT#OkVi?EF=QbHMi>Q}PVl_e0*Mug>|?~^BGUY!%6g|+ZL;wrUbh=YZn2uXAudaP9~jf0Mz|I0 zv;Q{+Q-nF%bG(CxOY-i?Qk&#+?()EuDZ2#xcY9zi^KoHaR$`nSEu*m5CAlkT8p^fb z_$*R7ldxA?2`dp8OjySD+;!BDdO3DhX-ZYc3!INT~MCi3x528T!Q{1Wl2FK@3h=(Akp`FyjV zGWVt^s@w1E=GCHaL0!@-OS|i>%f_%X$1*en>GnFMCiD`|H^_it6R#4NWPYRk{y<%a zw(u<%;$^Fd(v4 z+M8DOHP~A6Y8`ENMTcN#+Efu1Bm*fw{ z^=Hf%seH;Qy6+|`c5=jJ4gzC>>j5c@Yil7Rh8TrL4{ZsAwlP@SV`5b&`uCv9*d*+F z+0Fe;MrFd6rNFYH_FZ+J{Z$AI*;d{Zh~nifg-hq>1L$i4k8${G>-ZEqO0>a z9z{kp4bWKy{8|r@H4x7EVQ?Y!U48fkkV8Nx(*yNe^KBd&`yfk@2${a*(Fo+L6%|5e%I<`E}O-u`qr4GJE)OH7x6 z_O^%Q;@ZiQBCH~*VPfMk9U$RGaM5YuS3o!kV&Rlz7^$=oY7JB!-c&{ULl2{nf^9cd^K`=&m_v~!l2+|} zslh!(Oa-?kw=vcBk9_kV_E-UX|Fo%vqM5|^hM-Hd3gzH&xvijKwCerVd`8a(epXAf0LC8(XLu8QJ*A>x8XI$EP(8~dc2O*G94rtSE46;zH(pI4){l$_pBkSKoy?zcI~t!6>ldS&qi98XA;YurK2uz|pa3moO5Ebuq1hA= zwUUl>PPTpxFL|6mV_~ZGZIZ-rFgSnTTn=48;f@Wk8^{<$asv8~slGI{KZW0{ zOQW%O#G9ZJ7ePjwxg*H&WbL*;LmHcMK&d z($=jqW9kN{6*s^_OB^`8x8|S|Y@W0F01{`DPw-f8H4A;=z0%07afix|*XE^_qO(}f z`(rY5U?-C*er9NHGHCX8-GvQO;q*oP+@zWu4t9%1@_=1u~9kVu*{I2y4fwhKPdUmW5| zi`tpTxZmoDbQ9h>@7h|Uf3qGGHmBnDKWYHK0Xj=g9*Oa`X`}8;|K%V z3Af$RVg}jHq`U!94ibR~^E__E8nEukcAv41s#cIY?Tci$m{9YFKMX0Kzgt&|&)UHd z_grfRbT5PRKsYN3j&K<4e*F%Zs(@ge0V0O6>c})f8`#Cf zbmB3CcjDZ5s?8KB!o4(88`n0Z>4@yVtMJXVS{#w)y`_qzQ# zMi!D3_$G3U|2$MCaC_Po3!NSHrqXG5e4ms7Va;mPJFJERpD;almLIaP%R)2MGIMXI zv+CU}%`I%yzq|lOX;M&u6hSVnQYcMUc~5upt(zOYI@G?Mlh}vRd@{MnACXc95x`oLU z6lgwAmCRMRal`&o&P?B1oyS0aVbk|uXOIlrdg(dp_$*7&m*={p$2;y1a{c=sv#s_& zr8seR3k-yOzGM(?uFP4|^PVo93c6u=c1!L7U)YT&VZ>Mc2_iR8N>q%IbMSOaTl2K# z@F!m?In#D`H*yyQe$<%&leqpaDvxWV#GzRFu3ws+7YTcKGuXGtoZqMTOq^Su;HH`M zskwPY(}>kJWNB7x1bNV9{8Oh9IRk+(D-O?Dy96LQ0nef zl7ad+bP+L(>8He#Zeu4*&B>S*F0oOi3m9kMbI?Du$u_Q%IXGjut$kxJh)%*}pF-mS z8#1spWJhQ2<|eZDz!pDRVDPf%Y;#sh8n?(2n;%)|<<*3{27>Kq=Z9yM+OH1PPz(~j zGYFBN?u`-54V5)0k$+0!l6pMSmn=7gULhwssFOUYvhXfbD1K>=SId6-wk%4R;A*O; zy$`9kdXTKHQd>w8D!*y5O1Er~XUHwEW@sTwu)ki1$8-$Tv%&m2NzNig6btFoX`+I0 zNF&D+A_w4ak~!M7$s=2LdeYN_B2;4}v>AnRv2T93-@lPP# zry_3kCk|NtQZpr&Pma_mQX(EgdXwNEO<}a$=;0ag0p^PzBxWpQtGMO(tpAUGt4JWi zc)zEEitGykM{LhGEYWUR{RgwEb1T_swr`3)j0qWmfg|*>^P}|TA}K6(4ynyr%57_s zY+FAa+HcsA+L4-sr6%(vyS+xgK<7|xNP;WKwczG&L2WgO^`R%Ix=HjX@)|M+wi4&5 zNbf05B-fMy3{|}q%e|3pPo1`b1|Ymnl6>Vl(BpWp_5(|v(K=@d75$qBu(f@22808O zyplwc+fJdkJ`%7*`07v9;m#3MFX?=e9?I@ z>=$C13`wy7^_gM)&VsuiZE3I^sMKm{DWS2|3QvE#&YLt!il2X}_LN@8 z7H)Jn{l2#A#aB=3Wgj%`;p*hvi61w}%gr_4Cw2&}i|n#%+TfQXk`FdM-{LRaaw6o8 z8K&bGEW8q1a`Oq!U-5_0Mu|hHgT9ApF)iZ2soQ_*3>=(q`RwCAn@|UF<#WYIJHKLY z9wT6B+TYvC-A2a}Z&1a`Yt_W#e~?OFEB?fBW`_Iq*JXLub+o6jJxT2Ed{N5s9s>z{ zCgD%CqHXI%!si0qa-&w~z`PY@L(w7JV;O8uyEeU%zB4*9(hDF5x~e<2HIypNaj7>Voe=NWeG6**=fR zg;#@DTP1`CXs7gu&JZk$O?5|v2FZ|D+Zdv0RGW985We%2N>#UDY52Ytp`B}nl|<_s zYTLTK4dtj7rz1~=i}IFY`qq=Gp1CO239vZnSVfN$73yUS@xL=ug;#_2!&z|<@9@PC zJX=pntyE}EP*9Hue%>r-WoT_PUoe`GfWJ(6d(!@_Y$Zi|Ya^pTH}CEGyRp($xfJDVuy#UJ^Jb!mIwOR*mN7&N+L(RA>Alq0KM}7q?NrY6tzA>u-#ew6mNv+% zwFq~e{3DmI4*q9>Oiq3!U-GjuT|C7RCPNZQpEioL9ohhrnz1m6x()4TJUN(4REG9+ z5wB3|Qaz7TsVGpT7xqY|o%U@dB@T2(abYz(tN!+392dF@`bJ#p1zx8#%cbc zC@gH>H_MRIv#40ediewoqV6&LbOCe{T$z}l_YR^*+AcFzd9GCsdY>xlQRj9qCJ-T1 zRX0(8cr}QVkG`P%wnCs?|6ZFw2T zNOWDfo+`s}*Ik>nlZs5bcU2ti_09;SV~$E}Xhk=#9{*0FIy?U$DiYG(H1MBedf_C9FMGyfIN-piHF#rC<2Yy)}~;f^aXoPt1%@8*krgBc#lxs zXUE@5FQz&H zcz_>H*2mDs`)0nw#r=pa`L09tJ?vUdiQ(?%?5gcH??%ygrNYWC0m*B>_D>fy=eAK* z7C}+=Y;9S`@Z|n4u63#_GqaAX+~PmyB@4IGzSi_V#_b8bx?K7|YQgL!?HK}b0$G;e zMbhv!zt3D1z878chS(ji%^3IcpRit{Ri+xV4yw>}@Bt2Mt6?DHmY(v{y>EJpg#pq} z8U!pJ4+$d`!v+^Boh*(9CmA?VRVU1K-}}*ae)dgTk5damU_AC|MvZIVgx`^K<86GJ zXEO#v1Kre7x~cB?YZ**dl?w(m!|q2vtMevl3uoP+JJ9V-)h}&)EqA*6)K~lCBHZW$ z{>F1ACn5ugFzz3Ie@^~=cpw!`=(_7^#d~o|Lo|mtvBw96YIF2dtJ9u$P9?DF(seV` z)TA@Cu%M3H9l3mERsAUx;F2)Tx<$)#2aUmae*O-6^kLFsJ-5~;shNi8L%IslWXdPv zf9Prve6;q|jD)C2D#!hMt(S@hWsgz{aTcs+fk(zI0_ApQfkxCI0X?oPx2bv$hCnZz z%dWIlGP#VbL&n7!(;wqE0X;*-WnyaQ5$}YPm@kJk*K2_Fy3g zAv*P&pGL^Imazmwc-JFC*PR}E_b}Wix%Nl`cp80y)l#41WFloElNif$tCb!)gO8^d z6)9aC`*1!;$|yfvK|~}7-~zreiehsqQ~HBv<1urIT>9oB0{m$YtWYBZVVgkPWF@oMs!*y=P3j*=~Bp24qbHD8|{ z7b!AR{3!WGoOI^4j5fNCj#H|Lf}0uuyr`)uNcq!YnFf`KzA@Ja8}3v@9}G5|%*K=~ z(tI3h;4wWw!Wm+JQQ#c`3JdLO-0tP0?NpC!z<4SgfWSCvaEUfsXSqIf00kL1ppl0ItWcF_4PW!7D zu}mkpa0*GKLJq%lJaHZ)obmlUg(3g`dd{?p^Ql(vU!tHX6JBq+F z{QDmPAYV)Je{Vgqis?{jMlM$KQZbG0%%31hoFYa=)`{IKuZQWP*plURIzJebBkF`&KNrwrkSdRWZ4Ds)U&deG)meQo4-Df~ z=)K8oxzo&L+WIf=8nuoB<$$V9Lba_LoQ*)U4(iOsI$J+h1%Ut%`uBQ>l;UQltW zv;7k8!o==>1JFy7E2L(7sgH1&fRLEN2Bxr}1N)#I3ghjZnO~4+)EaRdu441>W^9Ce z#cnXL&?IO<>arWyrc#y5pqY%c#6Ex(ZjwDB?5TKt;%0IJaCer?&7;krlfdEhYr5g3~(+majZCA4os$jGX z*jKevYj`*0A|{-B3(hZy+m~5+W}<9_PW6)S`~E+ZWqxH7U>7F*=V&?XV9%|t1$VZc zv6}RAJrEmcxZ7az+*JY}I(@NS*5o@2>jNeK6+k|6nwKeJ0A-Oc2I;MrDYuEHCcqI) z89M9XF(2;k{9-#!?Q-#N6a4SD?axmp`0~u4()*pc7XL;06#)dl zk;j0!ysOs4o6u|55c{6$Wn_*jlnxF`6ewgG?C6h`7yv89{mu-wnLc@Ny)-_9P+A0T zzady4?qpX40jEG9UF$=sVjRvVtzI|InFjP>6`fDc0aCh4%XQ&&$Px1|JX2EWSBq-$ z+$#ylqO8%!{|+8Ql`(wiuoW&p*Of2xSTvDv`{vsYB7#ZN$!GFp@s+gM2{v(RNBAR} z(XQk{IzH*sIPM$WzNaq9NQL+WChjk(RWXUsSR?F7hUX$o&_uJMwgQY#Z(SIP!R)_j zA5vRy$(C?BxBW{ztGw?qV)%ER;MZ0NNAMg?u_eJG0FYmV7_yuX?}-JFd6pygc#M9t zyTG|`3u$}=K&WDCTd7)n#{HPxQ$Be&t*CQh3M&EYx9FB35%R+S=YL!t7#xjVK%d_E z&>svOm@X3U@J{mhhnlCyVa3vJV`W^s%7```?FS?TmN*0mIYTxgo^Aom)Yywx|C|#` z+QlQ83rzRBO1COO=dds7c+5rM4v0tMik?>gw|@vZuUaqwU~jb_@wUI%c=IX{*2;gdJFypbyHe@4Y%U zjNexV$@K`dC8u@?RIrXL`p@?uC!ysDPXu!<%(x%yiZxz#{C+$@{{Ns?gKa<%FVg@l zcUmffTfae%T6Wniz9rzvHIT-T#`-{N3M3;8w*WE9?--0Qb}xa#GLHz{C7EVsx#~=Z z_M#K-w>5Fu*+kH7rK-H6jui#7WllKg%n%tuh zV~89yP7L22oO)=cyv*lFSq*YM$P5yD@5JV$e~Ts|IT|jB>HQ66rl|y)mk@;&n~itD zNYO8EgY?3&R(07029-rur0LkfqebBUGVQ;PhCfFw?DvLfl>CYl4K5=O297kxQSzoX zFL{qLEBbK##BS2zdW9}fNs(cN_Cb?0Z8r@PhPRnAg6K%FOhsq`J-GN|Q%?jACl3cs zWq(D(`XZpkKDJT7Tg^Y9f+U+CJr>S1heNT(hy}@9r%$mh@4OaCsW|?0!XzFc1(@j z6DF@M8-OHg^t{3@&KYCt_J|c;OjQG}M;fDfgzTkM|6IVt|L+A5HdWxi{Dg4^g91ts zYyVoky^p_?d58`f*>FXE&4r-}=|zQA$~LDr6%$WNho@@K@U7}tQvRue-eER=x6r-c zYm%c0OSCvDtIoRpb^Kxvtryn(*UozQ;x5TfXGvFg{P`yv@vkNF$EQ-QOsD1BR|XgU89KrC4C2ydnwm(9v_GG{hy|xwcENT< zGAIT8VQ96?X%V~I-XN8nv^Uk!>7RQlMwd$-tvj*6VOQaN%(D$468_zWxK1G(p^?!% z@@oh#=exdExipCkD2YUr0MFgO6i z$?G}fE+4@cEltRlA`m7vIXsbf*>@p+1pEBleN9wzD4p^}pe+5diW1%fWipQ(uc~5s z4S162-S@th|6kGqQ*?#gzlz^T9Rglt&_y=C2Rf{frbr?i#(<^Xi0NsCj6r(RF(!5x zw7?$jj5m4e)yl1sU1oN)`gU%Jq*^&<0hA8%%n*F3) zT|5Eu(xqTw(9bdxbR&Q?@d)IfJO566O*wZqXugQ=r!+{aD)^<08@&WMl=lRX(_y0PLglsCnS%;Ty|;rIcFFT>0E>h%0>O>5-*12p$||)ff`{NwE-c8ZF5n>be2!4 z{weFZKxCMEjc`K`+c!S|Tg^sAY6^C5ATk}@NI(#>jOplT?&UJGX0XLm#ja|-0&|(DeJ&ci zMBx(ee|A2Q_kmyEF6I4c_-BhS5uioO@~tR-1(UK=<}%{tV03#LK_IG~B!3x+!!e3% zQ~gT%1sgBh#1Spnqq=1*Ep@EyG@;@rE|Od{e8oDeC4>n2&r7d~5ku{XRsC}#?P(i{ zuK`@+>N(VLX4l+8N-~1F*-Th)0Ause_2(!1^X}T-Gs1*7m_l$oOe4@dKKt1=&O3AZribVo|!~G>B)`Qbp4_UXO78F3@@s<)A`(rgwXn#A6u545Tk1`in>Z zDztqQ-1dVGbp!lA_>wUJgnW0ZKS8vDC*{8ykqrgZgiI3EhC@vC*;+rSxHD9P{dSY! zIm3uhLAIQf%E0~l2KpUMgd)vUbZer32!T5tGYX{u&M8)_A4^jD7~@X$HiU<-1cQpD zq#DA1K;7sGdy)RCMex7Qvj{rc>Nf77KfAId&mjwkfa5*DjmHy&PSBLA)Oo+GM6;xB!@NKf;5 zpGUIA8`y0VNtaAMb)?*_d-nG;>E-|1@$xTaEd|n6nf|-2`p=*2YQP(~r2e13_3xkm zzxYYkj}Huqjp(kA;CJ;$MwK=2;Z9 zLOyBE_vAq!<^pQ~$pgD;_VM2(fij{K4JAWr}~3Fo_0h_gFo6Ixkq#xL~=Z zT^M;RjFKNwuvi*qp0pMsD~azejNbWmb>j6z=m#aQ;sN!=waT3ryK|unv@T69+m*eTP{j^~`FWV{g?~Mn3ehQohh){BjXpcJo9{o43($Ew#M&It8 zElvBIx>zDm28<90V7dx?1|%2>y02oB?Di+Wz>0bSvSE)uK2r~Dn)AO_{X5vN04x{) zL?px$9XyqiPoz~z|ILn^y8{W0XCq(O{F5yTPRrVaszH*J$o>ek$ZxtQ)@D9wfn8Z)(it#qA&j0v(wVmW}wZZvoa(`kl={ubw zz=#y#WFVv&r-Qv->Ef1s@TIW?sUr}Ml4D($5hrm-Bhh?$2yr`oA_qV$FL*3*{>^@l ze@_XJOMq|=yHpp&dVdF4N&#U53dDh&Ft8qNv5jBn!uWuoRR4x<92_in>>zclsu)~f zfMoX-ln)~W&tV%A86^0aNE;u{LawX!)Y+1hZ{ICt13l~G@`kZ{9n#cNs%4*E`m`5cUi{_L ztIHQZ^~?IcHKPtE+XoECz4t4$DP~}d0E74%pkWV#iNrJk8)6INkpd2{dypW&2HwY% zyXUTqr_qt}_ymC8=h*_7Ko{!YMwA$Ex#}+O=(oK`D{iP(B^`6yMS_MQQh;1M#I7eC z_y{Zs6>lxH31!zxJ_BFNz@7JT7kIh~r@N!?zJ?U7k}W8#yM^<|kq0D02x`;3UHBb# zSlAf38WU5cuQjue__T0-etJCYZ0t&Ye5goimv_)1iGG}hQ9Ca)ChOiypnt)TV+MjRBe`wPouL2C6??Ia#StTWaQGX5hC+J3!IRcEWX}mQ_igkaKU#i_w5TQQ4JjFT z5CuU~R&*$@-}sUA#4gw*=HElB$7f*Yd&?v3NRgg4JM}#H#32v{8xL)hFt;1kV|uc; z8zvDf+cj_xVC8Ez6R}-vJe>ZCq#@Ira}g0SYvH~k-2dDBi>EW#L#Ps53>(^%!+1j%$M z#bni1W*VO9TL2xGnqw4X~nQKZ;E0f_y|3P)?5_poDQs^HM|kgl z2oDf<)pV95+B!N0Y$9EKp?h@8TOVFU$NNsv&GXT83*cVM__LA)(pGW_X)Z-9>3ila zt$Y`b9XHFmq7t&GpC)~_)8y|aP0Meu5vhCZbOL;_jq_;ma7{dAn`jdC@)m$uE^gp{ za=4oFWo#<<3wd~`wo_cYT@B7#@nvPv;kya&MHwO6BFpla01nj5#?9S^3fE8eSyy`= zkltN3Y-eW?xwl#NIhd#M$hE6IGA5d9gCj*A-5fvBl2Tf0`RoImAIMM0iBbqZDV?>_zllZB&P-5IgqbT0M25tJtx=kFul2Rge{!P;3| zM5D3TI@l(RJ2cDi`6L0UQ!TAz5_^zuC%M9q4wfpYdWG@+zy7OMc80z;R9xTQMk`4uob=6ZX+r%DkIyA-Bnj?CKo? z$-le6zE_%=Ogx$7m$7PT-Cr^%)IWLi93{doNwyqWaEEBg0 zJaLVgSYGS}pG(o&1^uyVH0t(37%M`_YiJpwvV~?q(HnF9R6y$W(=1ffO7R?pH^}#q9568nzzgfXR;J3$17?M+N^vqw@LeQ-2p~Iec;2RSys3i|eBV zH108N#YIcFb-ir)T=&G8;fi|xhi$6#jw7yFw?ZJM4T;jV_yomG<}dNEa!7q7W|;cz zweQJc!hs}UB_tDxDRfGr&Qk^8v-pM4mI&$S0QHr?fuh_ECH-dN75f% zOxdL{Zbqdvhs&Dz{!H*|qhmzodU6^zmQzYQ;*gA4Mbi_Bes)~Z)!+5M)cL`asP1#I z_aPu4B?M^>KTyObk~R3QfOLm+t#)!%IQ!zQ?we^vKiN2s5|!)CD}Ih@usg-+EYX7d zX8(mSJ!tCeRTR_~i(vFoKuC4B#~huVy7BWo=QHge5SYu?Od0TC+T4aa@)%ugqIn7! zO2iyC$|Tmtipy}HyeZyW^>^9ViY9H?;{GiA&MT;c@va?Z-7n5Vw?CX^v5T=h#wtFI zyUqCNL_A@!Qq$^BNy=-Atx;RwQx|aEdOrjDPXtL?t-E7Jr0zG{HOl?F893{ohVYnjahEO2IF?BsFy$YDI`Dz zcgoB@Jm&?q64W_o8POC(mh``=+#~3$$*i3Z4%I_{Mh(&_8u3Lk?N0Xs@O7+_C#N4U zL)dJSwH5Db5~$c_OFGYH*JysfM?5d%z@2=PL70fuA0!Q+EtALPX;K0A0 z3is6E=v?$;7ZGIr^0l1rw}{8LQ#>tMqH}k2!c^60FNRz-i1D5Jc+YlH$MiJ2XMbRU zQmd?Yk;jBN>p=AY-XVqBs-#byROV6T+upB_1m&A*Fc{HR*a zUTi;fsuVJ7r7n8Y1w>-Y?=*C? zxQsj#(Q|7P2;3L$NOwK`=VXcy`;U{^&jMeqc67SXELwQ9Mv}at&m$NL?g!mEqLCBt z*N~{JW0m1h(^j=CXLn~T;;!5empdvLEXXGQS^tda3MOybbV)$d8vWCo&UVV2d9C}J zQ`UDNQ_uYn5SI${Tc!}~-@dl;oZ3r%LTQU5%^aOidD`Xs?}c+x{!*h zRGOC`+eGGhOiR-nisAcaLzvx6DIyCg zg1IyXr{u)Mb+nJ@QetJ#`ghB*xwQT4eXFXYwT*y{<8-g zdZF4`C5x0^b!@0>d~(JrtkzeuJY_Q6%I2R5)-!u~VLOpsWgC!vr(=Pqr;I-{=4N#3 ze%z%#daagCuwN}=U9yooRfjF&cx(DGIqes=zmgINx~@0IP~@M|m0~`R`xcWWBc@u) z&3VSq(CzPa5cqp-47|HWqTHt2&|lZ^%KF6L!=MBcPs2AjTL!&cQY8BOg&rJJZ2hqX zuKT|T)V4EyB|mZYe>YwrIsggCJ9DXt!ODUovouRZCNB?#7AFNVY9!{SbHh`WzR$Rp z-*7jRy`otaLxt|DX7utzt8fg`4F2&?mr#3ttI+uPqmP|HZldrM$riJoRx~}sX zckgbU{`0DckRK9~>Z3Gh$!nQ%o`3 zvR0;zqszPMVw#PXL}K?lYoc4IRN{LHZm~*^;cZDPINf8z<06A(r!}&pn3xUZjy`87 zX61Ny5_Qk-*|*a*$Wg9|lhmI&s)Zu5G;eGuI^52e9>V{au(^_x`? zFSR9E-2Z%v`v3J)D)-Enjmlt(4`70uO!x)2yH<_Luz6*WGquTJ}|D4*y|J!emUu8x9K5XkF7AV))qT|g8=`(wOmy8fJ z_PhFmUdws|5JhJsqtrOWCOd&>G!Hc=JsC?Zx+VF7g89%TgPkljQTg2t7w_tXx zuDJUi`TEAmlHjJN`&5&mujr?EP45bR2^B#;Q3N`Jkolb%*br!u#$eI=3L`q+UjYy? zWR1-@k0k4WE!nR{jYC@?2p|9fe0#o2ufJ43YGkhq2>dGXiZKqNI+nuEJP8lAvGIjG z7gcE11V6D5>iT9{zujtU!n&J_nm} zqT05Nkj#z*dvOowt3rT01dNdfV6T>Ah?1)VVeQP^P7?Bd{8^Wp(pbeIE|$E=%kZ2c z#d|@ukL1D}sOgGt?~eNUr1lb=W6ry}QUZZ7epf-inJss58|J&J0D2&Q@tWUyzwd?5 z4^57wpP*oYVyYk941U0+!hj2!tK0tR|FH*svu+H-?5(a3GMIBEVwP?m0=j1B+r|K~ zC;6gbU%;KR+@{EFmVY3C}MgH47B^)sh1JQCpuyx=!qlb zw_r3{@a;^MyA-4jWeotAVw z7)3^@^*va2^jXS}@tYl6hVaN{Aw0Z^ips0-V3c7KE7^`9;ZXGGFRJ{Yo}GO=K|fAW z(IT%a4s>#WR(?Uw;FxNDP0HGC5ZUYV2YmVT1OA6tc|ZmO*2+69_ki*J8eAuA1M6Mo z4DyyauB|l@M9F+ml~KJ5HO0T4A0F)stE{zY6CP<%LL-27`jt5XLb)4zA^V#h0O$e< zN9l@T;avjMmpz!pf4P61Y9?28at-d}ui*7=K6DN<*H=Iq0c(5?%_sQPDnUQ??9I;= z7|i>{avabb4y4j1C@o|_LN_I+I)ZC07GQXrYr(COi0)VN!x!NnEetj?DHKMwtU+3*0em zXt|gfthvr5e^{s%q~>s@%g2rPs-YAX6zzL%h+P+81FLs72lR}JDmqAqnO~$e>fi5w zqdd{>neIgRb&?}`KjrRpg1er~ruzC^OPJj{1iS2~pSk3aa(Bfz+JI1n4T1WWvAA5* ziDz`t9lUL_pN^FYRy}Cpexm|7ZX?fspG#o*cwXEE!?uD-xECRSl$Eo1*1lN6F@aAB&X;bxb%IwxPmI?YL1LGvwXei2 zl`4{6QtIhrV>1ehPw3#vU2Zq1ZD`3p6J09f=BYGirdvmUs6 zF3?u7)$|m})R(Kfz2|eIm%wnzJ!W8i3V`XjvL&&`Ghf|PF!kR_)i#Xo>B!z;Nzrg> zk6$X_7L+0CeS7cjXNQyxOv+wg@9lPVc$V)}r|^Wx8ZJ9Fd#{hAFMkB%i*e=O1>|5I z7*qYc`te-yz5q0KEX4J9xij~d-Z3;;`qj7q+ejpJ3VF27oG{EU1B_Bf7)?{TO(Vk? zzn5{fOp>(9Yz~V=$3Mju>J{Xu#tM09|7n(xAGi|mzp*n@L5A>cjPT`O%XeA?tcau4 zEFx=(_axT|>VGoh<+$mLp@_v8ULZU9QvXC-Cp40q3c!4XvUczgA zV@3TFi@c1&cp=3mT`MmYxDbt~fi(RYD-{ZF2>T85 z5r&^kQMqR$t#^t@l_g1evh{A#J6ymo(bJ}CfO+i5%SmV6tfYK@tu^7z0r1IW;$=k& z$`McQuee`BYTNG7vM(FSs7hUMU%~;OAAWj<$tzly4US?NZ|Un?DeO_BoU_;{UzYEm zu>6vgx?=PyPMkB~I`I-@+puaCw*3I7>LQwsKU8ge*LBcYL|YzgjLa&mUIo!^o+xM>Ha6MTdb1YoVP6}-VTJgFz>_J?H*>pVko)F;m=tSNN{c@ zShT10TB&d224-Te!}WcVYb$ELjNUQ&x2Gk^DSu}@O^J@n%j24qM{B0gD_xf8%q_Rt znAfhP`d<@viNr5_Rc`sU=QC98BYpRn2V&%kyxFaEX1HTbD^MIbt7-eS)7Y0{EEtrv zHZqn9oBF9)9_rb%wgJhvS2Uvd7|BhD-u9g>w{~mB=3&Pb6mxNeRu+~K6v2*L{~&ir z8Mu2i;!D$m5 z;`l6a3H|8eY|V_3M4De+_c~NV*lyvav6&(LScoGI~{hZG?h zGDV?`4VnvytW?6P%sxXXp;8etB$^XViu%x?GHb1*l1heB2^lh-`^~u4KIb~uxvq1r z^UvOY>}zlG8{XghKEwUo&;88g<8u5|;<;PvhhehTRx-Z3nwc)isukUSg)fg^Io*18 zZ9u-CR3NiO?q}wo86(8ptdh*jbP~FfzS~N#>=<3z)~8+P5_^1ilG~XJ5VMRxbvp9T znWCjq!p>1e`gz-lMrOB>@tK4MH;-wQO zxPCtUYvRngL&dEGsgEC`Xz>i>sNXt+sd`_~InD?OD%{w@r#7@L&NV6f?Ftuq^bZY* za~qi}_WZ5y$PvaJS)vd)Lo}JOrcChkz4fl!vJ7M?C0ar2UF#cDFISeu1g0#NP3wHP zFSRn+cS}>9s^aJ3K>J0Ra^y`!^^O*t=hlHZpwlb1-HW{|b^LG1hp2y&71WQFRf+WE zSTVcy&rqt(4NdP*F=4U!M}1o!mFu20ND9`j$w-kG*y31u@%{RV=*75~O=CnXby%~4 zR9WP%7`NP7FC=Txpx3(R(R@rNR_HzZoJmue@2%sn5Rgwbo|wE)lnMIjB`h1Y)wNNo zVkrMru^v#sIhW&dCQnoF+@^-e;1w?H#Mb>}`%O^kuM-KmDUhTP-^d!HT;>`=`}t>~ zDSkf&3+zG&l~r-JciAk9JGkKq4%GJO`N4vwT ze&2TLztTfEMP-%Jc`N^ze2MWDig!d^n1lVM&crn){)qED%FvWwO_y(tn~+KO%pt74 zEdKxFO@x%wf8J`q^BV`yQNB_RVvi%4*5oH=#~XXFJN=$r2U;IMud;WFTi91c)v-upBs_;ec zIQBBRP@Hnt@QY>$Ct6LMtg-!IOfM9|8lGy{vf|&rXr6^%6jnVxtvv0O$c^Gp!8diS ze7%$WLRaKGs{b|%(jc9=8vH>-#u!w}83E;jlOIvlsTf#TJ5ny%#41;JW););r&T&Y zS<1QP#FgvhYL6+9Qj^hRVvJSWHJl$t?YTeTidm%{7r@%x8;)T7+=8mu#$Vvh?7tC} z*v+B&&?9LK=EBNwhA?cRhyBkpy}7;yE3L(szU!pl=$)Do=xhDZZB6NhvF_vs%+ zCsi#oaVby}{26D!Kk_mGas{@cE@-f-829b9ZE+qPvnG>eJ1Dw7^KjH zl&LJpnJ?v0E0IcgQ-CbwyTmg&(H}2;M+XTTSmO;OVQZ!bW6H8PZVA2#jZw&WA{bz2 zEWpsd-6!0sk_VcY1yB(q@<<%L4^YU*h%VYN%IJZxr-18!fPJwR?FvYXAbAtx$v~W5 zQDc*;2c7DjDAuol8XXd!OVE5E=@AI~-KbJ2)46ky0V7$ocyV#f7bwrC!s&v9vMfl0 zFN}z#AjZ)7Y(BYHXpdMs%;TMTU%a-*Qi+Hf1q}rS2R7W^Lt+h-MYkc#etowUX&}(z zBaO_v8g58ep{6#l1yxaalXeqhdko-~)nnNB2?tL@Fn$p5^;a%Ko~_)iRK))qUIZTyeH$?wH$(r=dKls}5c){iKZ2n6 zX}~Bq=Z;r-@`EG;k@Vw?zq;Y3%GUq^g3N<*M#g2@xM<4tOF{t;M#JTKAQ+`sh&gK|{3s02(HDGCD{M5b`J1q=R@3_{6_OaQ(6c4VG&uwN|?r#$naT zw2o+v#PPF=Rksj>Ntt+EedZ1QKAcUrFueBYr;12lP7Mj?GnDMvL@hpPRY){@;+c33 zE{$eoWNtSiE2*j%7y*$$@v1<1Uk9aF^-m@V#Z&pA@iVHZtVNP}`w=|89~m_34#uui zKhsfJpf~^=JW9_f@aga z3cH}hT*rG?Z(gCiMX)%2IO9`r$z$CG>*m6i&~`GT|FgF>zs^P<(@-#zrPA@{5(lMn zz5NdUggpL%Qf^(1vNli*^-oT12Xm3iUd|i+Zm6K<7{wfoaCJW_`i&0SeN)~@$cLWV z@139iF6`%8Y%mRKbx^O=ZEgiIZLutdQWz*@r&QLclxA=zbN@k8tdsKM{Gz%AjCe7P zta8_qU~LXH_3h`$cKrR-EH*HY#@IBNZ+nEtZKd3*@Y;Q@C zxBb~M8T~Ogs!aL1Yr;1Oa0yv{xVwHKxfp#HlNNm3H$gK!7lGntH-Ze$t1j~$JJ zoJ3aMVikP7OtygFjI2h2i-H--cL)3D1Ofse@Y6>>0}Jl zJm)`ri6Hi~u6?(}#A}S*5$D<#=-_9ARo{`iO`UeK3@O;9WznZwr<~i^|BCm+&RF?9 z$~Wuj+;V)Rr9&lN50+ugn%*T*|nMngTNq%wrW$ zQRhoNNNcYjuY!GjGJ?@r4?17E!GcM@hIjf7E)ue!S@SmZp!dkR6^J}0O5ca2e|ce5 zD|Y9)8$0Yh=5|Y?GtkKIleS zk*_qdgfw2VNoJI7At|zj9v!9Kl{tV5tuL@~7!KaPGlmmRfP z9Pf~ax})hN8b%~Me`t)dK)!Xugw5l@*Mw;>+P3XQYTNayK++itNTEGkvlG^NQay%85KE0 z!joyyFceu24p8AG9@`PT6g_e&!{uB1K~~}yPgHlJBqJ7aHVg$As+pQnxOQUQjOoGk zaSSOsu|fNRka=tU*eaiZv4^^HL0VQX*lLK$ZxLCjU!$q>O ze?Ilw3M)FFg75UnfpCz3+ne{5|G}08MjnYscC_9S$aug2UjH}R1t3i*8z=Vzop!sw z2u(9@Sy?#c2?3WI6A)7zHOAI0cn#kS9D6_V>a11-4A65HkZ^XTuR9EIR3P-^IFnRW)Xlcy2l6iG(+|Ce_m%DW@GkYxG{N zd=L~UK17a?*1B}iBHy2k)BAiLp@qwxVH}vR0FGlE;GDNZp9w8R8V5QZyNJ$@nAhED z>(D^fVQ|22Kkgz-0e>^wU~85Izn@_EcAmb5o!2n01hPqtoZYyx9{7y^`eQf z*8Zq!bMd^*gU7!_Ji&=YYq)HG7nQOv7WSPR_i;~2(AVrM{P%yD;UQvX(|&E)_-~Wg z%#U9zt|~eGA2u-!OV+jO-OkCwF>V{<7gLrV%Tf7#4~ulBx<~T4T8F1uLE0RWCzh`2 zm7HZ^PS@Hrs6I!^d&&1#cc5`QZz6}_%tC3Gvw2&7?f6wEEv_7??UYzf66z)_IGv*G z7z0*^_S`Fz9)0YOKsrQDlk#Gm(mWOMD8=Iy{vX_gUMToK%Ul|YmoNU+c2;N;cKICD z9LH42ZAl?6#dr7byfpE|G&?gkVi@S5^X(A(WU{<@^dG4U+wlFa`y)t(teci{Xv$4r z+%988tiB-~^5hb4W)Vv8!3T!0nV-R>5*frG2#7f*-CKLpXUh5{5J%+|_hR-+Bsjw3 z840rz32TqZhdqQgqkr>JbL`OT*o$OjCWuv~BOMY)D%__qMcOAUan_J*FFR^dKiSAI zy!F9>r1@cnb z!8f=VM&;cF%^m$4Hc1+vpqfqsR>&Sk6J55G+&)Tjj2XuJZ(2-7V+F1Mj2tcavx9B85gX6x(W1Tsz8ryJsdvDcKXNz zvK1W_HTlDtnA>wtN_Wo|Fib?FZpzsVtp{Tacd~hFmou)jPk4YpGKT||wiAOX3$`sI z3e>sv%%w9+Qm?JM9?JUtHiyJRBk``msWTDpfK6Zfd9QyyY{dr(#1fqyO5pF32D6?s z{Rl;_t$sPAfV(u+JHZUgi{|+`qdN|*@b#ZMi0e53E6knMM11E(U+% zA6KI%5QlRt?yux&p5gZzI^M(WJr*S?k_#2gBPmv@vmwZ1q=1F_phfWUU<{z;PWcYh zJN4JubhH8;HsZk5Ni$)5kyFU>ui4%A$L%mYG3jdrs)PO%9C8sd{15tyq6Z3O8pxCY*hKd5sA163quPr)Yl{nkFitDDb8h5hv{V zW(=-dCc}++bUT*#Tq5NF6w%Q}y%qzuNFNKpg7+jsLLufSOnKe_V%g)aRT$01W7yxc zdEcu~fTK`)|AdII9-S(w`x^@oe%0IrV&sh&WtC&JfIJG7#wQGiI!oKUrb8X3c-{*? zB#O8YR0h%y&%fISfVRbuaEf4Gj^D>@gnEp8P?6>DCj*U9Ep+5%4pp!|J%kZh2s*dT z2dASy9eRpjOQ1LE3+~^D!K<0&qZl6SqmyD(oojGuHyF3%1#*71Yo%(HwMmittW>a&IxW!%&R?m_0MVyw3)&viv_ z`nRdvKVvne+8b3 z=Xr#kU}!J%xK;0iFIlQ!z`sQ1vY#Cw&Rj7WF5pN-UvIqzn85Yf4)J!qz_f~6Rt$;w z$5F&POm)Iukyb8c+z}Y5qFAAF5N;u^XxB;YtdGVB)*Xg^azu&>F)WyC&FB2N+M}f- zN9?&Ls%-+QcOX#fFm3&Qa1#QQc+HAL7mx#e+Ethd8KckbjqBnzjdHTkZqPY3pFQ z1Ca|tB&A&lUv%f!O|wzQ>>P7ca|&!q4sDPpc5;xy-C%n>_|1;nq|pjzA~stzW{q=2 z37olrs^hWH`eEG_q;(9SuKvsAf9Z0Kbu#Jtd!CteE-qVXn@3_Q??>q(5M*QZNIu6f zEg#%y@jDf1DT*GakQ&o{1U)aC+%?$yeeJ(J!)4fim#-hTOs>uNKd80FNBKYei*$>- zZ*%*`!fVDY*2L|e8zV`1c+8Gug`tFB`(D{U4jbzp;jQKDDsU_g)j4$mg$J~z^DCOh z@@{YaEMy1}5yOsJ`c>S$WLCH;bRJjO@jE3Y<|vEwGOqSJ-FTS7*&H{WZC3dS!y`a8 z8`cr?3Q*Eto72f5r~qdn`(*78v&r|P`|#CAizu{vrZt|$)lP(bsdWo_L%>uVPYB4g zmPfu394~|!Do^|w?KT_mme(aG@j8~$omIgjr1z}|^HoWsFW@8ul%r&*tt)AMJkmc!CY4 zLI&2NCSdA^K=DNu7=7qn-SPU?SAhH_LHEZfVJ}c!2Yz)MNF<61TR|ciEZ{q(gV#Ho zVlUJCnGAu&q=zl=rWm(-3h{OA9W}4oiub4x41-3aC)EzBk%As zkXG;=U&4DLTbJ1Tkbfcl($`HsK{GmN}7}zrsrDd9I=KjWBN6y?J9Upv}I)0j7;F~*p;gPbh zlh6tEDMS_edOi*AIb_PW^)DXJr46lnihc@5#~J=oYW77+*6zbMf5Vjx3;pKijt;b@ zDONQ4$l`L}M*!-=Dqg9-Wl4r0I8xJ<^R5_$I637A8~AZq-;L|#7VoWDDpF%Y6CX6*L{sSwVd3dV|81$l z0v_>!nv7lYcyy!3%Hd~UM4GKGxn;BW!DQGGLNQ^8(jTxQND`qV413i02abTRrc`gI zcv>YTtNr#UD9>HbQg8RYH#HBzi9607oVcELi?6X>^D^z0m9|)=9R}4o@X_u-OyDDW zgrB(5iU7TypDD!?Vci>P-C{!JxbLmrzPY14L&Ev7J^ddWU)C{(#uE!JDD5086yb#Z zjXX5AJJt)zC+>{r?dGYzWYk;A7skg!)ubNGw?OgmCA?|uypx7h1X(=vgAi$^uWBn~ zNQ7S(i=wITGU+&zn^dq%D*4IuedbHUZFu90P`{CjFxFC|-OO;ZO$)yJ$?d$@_peFE z;!jZ8{*p3mcKK$^PU%3A!=>eNTLPjNSe-*hsWp=G3FAt8k|UI)>+PCAXZ?hc*7_U1 zhZszEFd5Y`Yn2M9q6$hA~IVnHn9?XBR z{^Dxo?9EwmrNBP(f=|M=jb(ekQ*lv~4F0zNn9QIyq;OASkncHemHeklI{^CeNQD3Oui!^Y+;4pyr6 znPvx5eC9T-S&Rh>H#unPbo@R`jCGSrv(uJR)$Uwx?Dujr<9mS{#onYnp@5H~zm6U* zawqKPk4a3aX%E|>p(*EuH!iHUq4Lg&y-Dzua5d>r3FOXH&|}nm&XUFz-y7NC?QC;`f&A zR(YFDDjn`~n{+yF1or5AeH=qe=M?ByjrDNiYIy_(?NCaZTA;D-Nq>R#)CZdk!iy_{^Eh^U{2yi;OqdL79E!t{RwbBsEQ#H`($74YOS|@Z% zpoVbl)Ii~1h&VrD420AyNETEBk2|5fDjVkx8{#E=Y-t8kojjC#ku6__t_PAT!uV`U zP`VD^3?io&SnZBXC7@gtKs3kmJo0BGh_Dh{U!_2$zTNXYOgAX!pLW2hVYtQ{Op-uI zeKazb9r5yaMA9aeh;+?AB~P9Jn4&2z34Ms4c*=j$oJs19mOzP%UO=(W!0)2lRZPRB z4E2q@$Vz!7bmC152Xb1Cu7}|%XNsyfVPaMrg0YqcDI6Unh_Y6m$T?SafO>}Pc6nY! zn$alC!U)#Q68Z3Hts2%^XewC3zOUJ6;mn^9==(va-3^7k-yy%_kIZ|CSvn@m<0LN{tkU_&y`i6# z>nf}b3IfNI=Cfy9B@_*$A*V7LF$qvWA2{Ada2a?~F+QTlOjOsad8x?e9;EBFbJCG$ ziP<42;B2@L3GLWMluI(NDc7T_2=y#$F3!f z?^l^0)67cwcyyYWLb)DNMQa%7q5!ANG;+uBAy7M(-f*9yxWWn(qRxYnEt{Fyx|W34 zvm-o_HD+|uN%Diht4D|o`zyW7N({e>Jt~kI`Ih-^6V38IuDhM8)*yq&$jbsrcQTpWoq=8Ymd4vNp2yDLD?EQ!4-gHI0+OJ z=cUAO&q;m!hAu6^=Q@*ntztzkmdwTWp+_l19Ac+@?h=!JbwTM>qRYgrrr#)++~{)a_FYJZ z6e(zSnf*nmI>e2Yd6<%@;fTc~8XlM^zjOw(u&Sdo<*sOMc=0M*gFn&2CMe?Q$*jLV z#uKUw~{;t3G5E$aqUt z&vXLbzBkWzs<4WB0D&g`Z=FOZ*;5%tN(2-eIuR{_H?jR$LOay@=JH7X`NICicR<L6X_`JTguJsZlEC8}0qk za&g)}dP>5q@t&cFT)9CeAGmHlOb}MTnot|}GI^Y`NB1CL`$j9*f8(UnOD>Zp?YD6& zJBcUqSde5nVd%z99zB7$x30MQ*;Clc*so}W5}x{k@KFZ!pzvO z@K~Tq2p>0@#%fgp{(qqV{g$x~e$G00_t8koN-rBV>7K82=?NKqnh)o=3CqNUv z-n!R*GZTK}dhyM8j$=d2>kZ)X)#$L7Yhd32Y5IDj723v>zV; zRLkjq5z|El@IZb_Nv!_+5Py-XtZ zSbgI&zN~a2NQUXXQKHQ5h8Zb#=bDGIk40ZuSHflG@l(}$VrQ1ABGMThe*PD6dhs_X z1aGd+Fm2~DZVFjWpbx;oU$Kn9j32k4WzDqo%wj8Vi!f4WgFqt3Foh~p-7<=SZe`s- z)4KwZyvB)Xwt$hi>&=}X_;OFR0W$;7jxvpu@G65Fe|dA~+Qy+(8b2t0g7LoK5K3$} zW_|bmc5ij6zQ#{r&XY#F1!Nefw~xA9cy4A;oJMcEU7#84tLiTcU@GEqL$@WbQ5xK1el;k zCgYscbgVYfG3Xt{|5rd4vLk%AYMwd(=h!^OXLvTVqz!moaW|$&t3mISZ%J z1H;L|>C9M1ki4c*wzlh1+L=GtoT$cHzd|IPdlYC@X#b0YUJl|WSKd^%{-Bm5H&Oju z%#9_%i`dL{dh|2YjB@UG;iAxG-0_-M%5ZUSd2ZUP83fhuWNtB(Sonv~oDs~mwxEU_SmNPY zON4e%A6x^Nfn7s=V=?UcQ#p!QmrJnsK-VYx7WE&}Z~0)+rwq(7I}|A_`|U0t!)D11 zq&USgFh)H35m&>=PidJ>!kr2#2Ah*mR0zH3_OJ5w3~H!)k0YF6&BDybLy*|2CmOnc zf(rsnhD&6AOjk3l*e(KL;$E4zV#Kk9Uf}bxv)OW{M1qrJsCj!u(a8yW5zj|*knox% z#L^52HKmRsrZjRY2?uVqzoY)>{NeKD44J5~8N9AOOlP?X1$O@CyVYHzeGuwWs1xH` znEp89EOR$aW0w;STAK;O#&xcX2c=Gej4JeKIZ(00)yjpUtV5+{O%{k zRoYi87)LyspY0Ydnfa>X55ox~P17ZeD~YP&-`-1KfziNRzP+qG0+F%KTjp53&q^h% zPQREO<~q7Cv&@7B?Z+dFG&V|W3WZy;+JZ2HewJGGGpHI?`smznlqVC%rLw~rUUCe_Aa(cTf&fObp)NZLgvhni7w&3GHDJoNa z>PY-{W$DHnP+zt9^5lEPzX>vY96D}JW=a9B~bgVf6sAw#z4hq z7#8RK{dEnU5h7b0zm}@*@y4m(%NVFgHV)xfWzuEy71+u(t_)07$Cez;=_rc zNS-Y4s4k#0NlDa`YFj10O7BIcZm-bV6$pip>#ZoJs}TM|L{6Il zD*pj=45cSJ$4K?DOp7@*hzcoy3Hf5VU^7;#UxuY64NMza`y zC+dn`%aO+;EFfJ9n=!BvxjBd10yT!oj5%hZ!>p=_TJI0dz3OpbDJ{8$f#RM~7OCUYUj&$thu&%ahnSVquJ%I0e?%XKQ=V zc*RkqJa_9&J7l0Z`FVh&zpqAs5Y&F|l%!XgG`NC%i!6k~@}qJ*%cinx>@>E@Hy{{{QB|=6FL{K%~<}P zFTy*@lPUQ{qyVZU?M=qikXw50r+Y{~wl+F;8OL&+?sCn;LIdyKZ}SOG|HxD%mzk3C zC@54&+UV_!#H@I#MLoH!?sA_H>#JKPB=^LjW*AA`h${V|-E0|M@Ajpz&K4hHbf`!IAE<=P3T<~5-mt!TVRfKbA3S_d%bAujhSAuB05}=BoC7*`lCMNzvO12)M zgLgafo;&xMwWEe`4T-4JaeukuBM$&TNptgW|EOZ$Fol(&uK>36+2ZAUz-aVRV(1UG zS#8O!YmC)C%X;l^%er$$4T1tzoMk<%S?%JCWZw<_(r*}t_h%bLiNBWCOxC(}Rv>XN zZFwZCwO{t^^kC++dQ+#oka`1%lE_rwqx1f7T((U8?~wZ(2;j?7k{0l+e=J^>RN#>C z_yvcAOo+g&M{->MdY-91hxwdfm_*bjIn&HoGv_W4plitvL1-?h&U(>;Uo?r#QlpH> z>V>pt<*tbrX${AvO>$TaQKyhCVshqPATF~{)70<7!{``($Im9N%eW(>d^Ul70wXWA ziruA5LOW@{`t#w}4VFE0Qy-PMgbk@JEbvklOky4ASiE_}pBv9=a*Q>5fW0`qc%U$5>)08Gvjp~Ncs--HL z%+fzXJH?H>M{>~y)ayiJfFRD7Og}a1;{tK(eI1n&IR$|S7h9dyQdXfy3uNAyCJvK8 zD5M&3wX@O3bAVFFwDr0(+sr{AjgOw+KC;bfCHNO>9i-Azt0Y(h`Z``6w;XZ*tL|Ld z`r-L7e=VxiZfbnGC=}XUm@f%*K1=21ZgMMQ`XaLXBj@PtK}7vT@JDQCC(yvEB+bYY z#M`I{u<-P?*G1kjZMn(I!OCY&TT)!hVr6SlPDB$n#r`FDUQ@Al5q*#FCb?go!KYqT zj>3Z3fb4$$HJJ4mW_=WKR&=PZEGhpj!7`Sd7Lpjd@6|d@L*sl%oaw z#zS*8uDwUblVl#kJZhz9!TkXsA2o+>hoeSG`A#XXgioUWKR-W`In2F?CSLL@_C_-j z2YJ6mo(atSs-{H96K-$*$X_2N=0I6hRva9M;F>zRyGWwU(Q?4RG2Vx!DTi>TFl>6v z=XsvERe~JYIDq-|-vvE!vr|Xgc^JPaKj!lJCwyW@@$r7qDkjfa4ZYX9l^>r>+J0Jc z20~?*DCSGj9&2ajtIpfkisDnE6}>EGxj@_-O61sVr2?tUS>pXOwf2Tcs)B{bb z#W!!Zj3?W6cC21ss@g)mM*l|mv2rEj)_P~DxeF+N@1!YtBBXF$QP_kv%&MYX(v|C7 zGO%&OSiehfU?I_3B3}UI`URmr2cDry?yl(gTJ?Qg%76$-i~MG?a*aT}l=Ry`1HtkI z=51WtRmo{J#yaS@zj(3(vFtvs(XlZJ&%hZIvMc}1fz>gpi5Sv-Tur@*A~-j<)@1i4 zd8y~)e<~Y$YP_3(u0?jS0&m0X)wCkQYk1jhQ~h=^MQ=}!14&M1b9U&IP@52^r?L9G_2dWffHf$E~%nLIB^3CouDdI zm$Z_L*+rvo@Q~{$f=1sm>Btx8)I=HM0E+C}zVxARD}`eR1k)Yy9*8aoDBqnm=iM-9 ze+D`0Dq5trM?K%y!lqEJ`d0uFa@`QiVWhY{;6l`SAB6GfMLC)QmLzq3rC=Tbgg!g& z?B=QbDKz=w^l79{;|7ix;0q{ej36W?odFQ4IlX`IArF-qJRFsE2r9!IhEYc_8GL}` zn(zdndt(w#sB#FDuW#|;BP`Z!eGjuS-@F{$gX*yKzqY0!XC+z*oAx5|yZ&lA#p()n zEJ|68P@|GN%mCN+{4*3nAm+UFKIm$!)F^6hM<9e$ zTI;7+43Uin2<39;2+6KIO%JAetSmtR+F~)7fU8D-1z->YM!J$(zfDS_?gRhvErpL> z3&k@NaYLUy>zUb(tq9r6y867!(_jn09ydu1f152(uKPt0jYkST&=ZFA;jsgVj%Jk&|FPem-%sdmW za%OZ>=18H|L?pjgIr^NO=7i+$RsC*&V zR%x9Tiz84;^SW%0o)180L9Po>9&-S^U7DeZcm;m}>al7VHCE?CRQ=FQaVWBg~X*)}RVw(wt=#a)L*!W1$A%Oq=BmzfR_VyCS4ypi(GC>e3s&`2+ z>yWn)G*6%V5S)%$WY5`a=@FQ>K62Z#X1yl+Gy ziDvtJPaMU%5OexOZxs#<$Eg=4KzkIpH zFYqg#S5wtiW|eimFLfWv4=F+4u^-s38Y^VN#EJHJWznaT8>xKg()0wA`XN^$AAJ~; z%A1pmspb1Eg0_qZilisA=$o!GN|}AXy@o+!_v4F^GCH$&rWv$}t3pXq@oZTeoeTB9 z=-Pi&eU6bXx?_u}mykra&t=F(sLt<%Gk^^_=*vn+wqcaGBV_kzRw)znK~Wc`L%reL zHTE`WN%-&ps?}Is^#(Bk#S~A<_g-l7&t4+Xk;O|%W!rM#JzK5xI`9V6{Fv>IK;!Nw zN}ug0>T|GI}G*1p|1 z=OA~^y=m=2-zj3FZRLHOI1@;sf3k|I{T_qFcz_p~LyN14h zBwQ4P<#Oj4+)6Z4Aw$Q{%**Dd1pR6+-^)0MSuUkrhs>lfctQ+HeM-6>8s+c7W1;TU zsx^}FmdsvVR#;~+Wz#KX;Bzb`!~cTGUM-wx7^t_aquQs{@T@&kIQxO)TYp{S{obnk z&?P8lRMCo0N5`*;KI2%+@ZyT^xtalyQ7p1l2t?jX>j7cvNjask+buS%jZd28s^HJg zYnbNd-$pZLzq;fJvv_Zo?%b3TQReIw&8EAEl9IDThZ_gGpOVMi`X9(AG~P0h)!Te$ z*`3tRu|7aLK{)Vr;B@e0I=;HnnpT*y74VZmXZzi%^uTZNOw89liH+LIrtp!TbBwA# z=-H~3Z<>kx9>WGC_O!fmkq#p~+gGe*R*Qm8s=KrT*t!Rs7g1ibsFaWN#Az2e+&Ip` zuX{9N7SZ9)J^rP(NwH#0KhN4|8UAnI7yfES zTXz*-V9UTZ8dEA|$A(};tuBs7LOL7LesYO2s%DRDi`6}NdpKM`RNR>Tpj@>`h*?V& z4C*s=`gtRYSkywhBhUcT5R?8vRWOlrES8Iq$c$ror`5r$-=W?@TUX?_a2wwilyLVQ zoFz;cr>9?WioIm090X{inVi8RmuXw}m=*;?5;sCA&*kIUE-KIej~;U$O5ye{63ApF zGL61acs(Z&ZlUlg@2He9?+^~*8SUWJmg0XQM~oVg3O&I?%s$AW%f#Bk7%L-u45+@( zPR?7g%1GOEN7_5=l^W$6qotk94s*Hnlg=~&?1tCI>JmdkuDzcEx? z;ns4rL|^|aUu(iy)AAWM9h||p2S~$9Rt!^Jhx~g`U2eO7R@-jk;#oxmjeA?8|`(|4UH^$b0M1eqMdoZY;7raid6N{IM}*qgB^ZWURpp~hb*>%Xo_ zV@`FU`NX@)fi^JedQJ z0Z<|k-pKU49o=aJqVfX8F?`iXPSHJ%R||GFF*WvY;9hmQ>sb8L64&L*9;!+SD-+C0 zRm=LzUR!C09Z~hj+8SXP4TXPKso}2XpUnGQ_!J?N3%^cv<9Cs4p`!TFif*I#06j5 zRFDVCDWo#U>w$-ogf>~y7p_<}M5O2mMhBX!5CSMA5syPFPtP6!dPo``Idva^_=9fq zzB_rQ6jmx9%{YXKnIa`^mpA+PE&TufQK7P6R{cF4?@J?5aH;h6_xB=7D)~SHQ`V!K z?-sYm;jZv^z=CnlW~(z#HQUqYK`VDlZMX!66m}u(*;(Z#SqxfqN-1i~F0Lj^BD0kW zj9V03+#X68B4EaYBps|4`g0`t(kEA7>VKsK0F3QvuT!*2j4uVH?_3)|E^B+&xmL;A z^Sn8`K0)yWKPH{v4J+i$%9i~Cu346Zuy6OPymRxq&dsJvKG1%vWgm%mAunZ@6ynP| z5WGZ+!m1*o16dOiG@PL9GU_%T?DapxP_#XhqRdjExRVta&3@FcvzK_$Z?&KXAh@X6 z6XjP93L1kFZWVN)l9C@U-RiA_($Nd|`G2CZNz2kRJ`S&n>JkfGsyCTI@047 zt02EKL*3_tQ3PY1vA!l=mtt)%6GFHawZ&+#th?|ePU_nF{Olr|4Pax@4vRNv-XH$} zWaX9U%u;JMZpRA*#8MC-KilQ@F{HO8KV%g7RKOSm z7MwIlV!Yza$Y0}~*sWYK5YDyB{hJFWfK?3iW~ohX3)Z_L@%%n}&M7cc(k_t<$?FS9 zS!)N~hRXlivrnpAJUjqUc*0k9y?cGAw(pnh9T}QRDSOT8PnXuoMLyMU;MDElUf9q*ZCTpT zauT*X;f1tuQ@3zg!-!SWp4-1xxY?xWkScV*(-$Fy{7oMo9roR2(*1NF(^D>uqz7?^ zBzhWT%U;sBH5Zvdp&hN8=NP}s_G0XOLA{|q=g~6P$X8A2aVr)seIZ&+9u!L0RyaTRp?1!S%DrKB=rs#0|G!MtE)_1#_? z85G_<^>u#<=l-3(N9r6ab+t?`F1&cCwS5?g^?-h`uOd44tv@u<1@#TE7<(s>^sBASE79v zDeDf1%aq7g+Q`)WDWd3bqrbu)EsNue&&7!@$TW}eh)7gsuP~$<9QvhWbqQ5n?eL;K zYRbl45tj|y-Y$M+a#FW>dt}n4>U%GezOfnh?5rD}^+N8qjl{A1M)^MauC1zQ?YA^0 zy4og2X`{#pSEu{izMb9fyEpKU+0um^g{^At3;(p;I`z!IPdR-)n+xmaB^QdvZ&WTU|r&PZ4n~ay)4KRI?{O6C`RvIs>mQVO&>Z;`4>0ZoY zCZAX?`~U05EAZn}b{EH=cl?+1XL!03yt&{0_2d655AA6DX}HdS!g}9=eyuU^e%aM8 zJvm&>w_ECF@3b6$?ej~TF5TYq&Y^}sb!;7{9{LdL<%LQg>JtC4Y}D1VG}-g6N`IoK zd)jJJg*eKnhx)Y?#=qIPef^7CM!hJue2(-#f4r2U5h`PNgk#DI8Ss@c-+_ax~fhKJxgo{&$t9cC_7pH^$US`R~T~@5cDQ eV3!#Wm;;tBmOrnYo5h6xST40OySl_N^nU;(ZqPgc diff --git a/format/Metadata.md b/format/Metadata.md index fa5f623ac9797..3388a7e6cf00d 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -46,7 +46,17 @@ table Field { name: string; nullable: bool; type: Type; + // present only if the field is dictionary encoded + // will point to a dictionary provided by a DictionaryBatch message + dictionary: long; + // children apply only to Nested data types like Struct, List and Union children: [Field]; + /// layout of buffers produced for this type (as derived from the Type) + /// does not include children + /// each recordbatch will return instances of those Buffers. + layout: [ VectorLayout ]; + // User-defined metadata + custom_metadata: [ KeyValue ]; } ``` From 2d8e82056afdcf125e6e512f96007389ce79c1c7 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 7 Oct 2016 12:13:58 -0700 Subject: [PATCH 0160/1644] ARROW-319: Add canonical Arrow Schema json representation Author: Julien Le Dem Closes #158 from julienledem/json and squashes the following commits: 796cc6d [Julien Le Dem] add json documentation f0b2a39 [Julien Le Dem] add sanity checks 7dd6d45 [Julien Le Dem] fix typo 248d3ec [Julien Le Dem] more tests f2bc3fb [Julien Le Dem] ARROW-319: Add canonical Arrow Schema json representation --- format/Metadata.md | 81 +++++++++ .../src/main/codegen/templates/ArrowType.java | 165 ++++++++++++++++-- .../arrow/vector/schema/ArrowVectorType.java | 43 ++++- .../arrow/vector/schema/TypeLayout.java | 11 +- .../arrow/vector/schema/VectorLayout.java | 5 +- .../apache/arrow/vector/types/pojo/Field.java | 23 ++- .../arrow/vector/types/pojo/Schema.java | 90 ++++++++-- .../arrow/vector/types/pojo/TestSchema.java | 119 +++++++++++++ 8 files changed, 501 insertions(+), 36 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java diff --git a/format/Metadata.md b/format/Metadata.md index 3388a7e6cf00d..653a4c73e830e 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -63,6 +63,87 @@ table Field { The `type` is the logical type of the field. Nested types, such as List, Struct, and Union, have a sequence of child fields. +a JSON representation of the schema is also provided: +Field: +``` +{ + "name" : "name_of_the_field", + "nullable" : false, + "type" : /* Type */, + "children" : [ /* Field */ ], + "typeLayout" : { + "vectors" : [ /* VectorLayout */ ] + } +} +``` +VectorLayout: +``` +{ + "type" : "DATA|OFFSET|VALIDITY|TYPE", + "typeBitWidth" : /* int */ +} +``` +Type: +``` +{ + "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|bool|decimal|date|time|timestamp|interval" + // fields as defined in the flatbuff depending on the type name +} +``` +Union: +``` +{ + "name" : "union", + "mode" : "Sparse|Dense", + "typeIds" : [ /* integer */ ] +} +``` +Int: +``` +{ + "name" : "int", + "bitWidth" : /* integer */, + "isSigned" : /* boolean */ +} +``` +FloatingPoint: +``` +{ + "name" : "floatingpoint", + "precision" : "HALF|SINGLE|DOUBLE" +} +``` +Decimal: +``` +{ + "name" : "decimal", + "precision" : /* integer */, + "scale" : /* integer */ +} +``` +Timestamp: +``` +{ + "name" : "timestamp", + "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND" +} +``` +Interval: +``` +{ + "name" : "interval", + "unit" : "YEAR_MONTH|DAY_TIME" +} +``` +Schema: +``` +{ + "fields" : [ + /* Field */ + ] +} +``` + ## Record data headers A record batch is a collection of top-level named, equal length Arrow arrays diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 30f2c68efe0b3..4069e6061b66e 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -16,12 +16,6 @@ * limitations under the License. */ -import org.apache.arrow.flatbuf.Field; -import org.apache.arrow.flatbuf.Type; -import org.apache.arrow.vector.types.pojo.ArrowType.Int; - -import java.util.Objects; - <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/types/pojo/ArrowType.java" /> <#include "/@includes/license.ftl" /> @@ -31,13 +25,150 @@ import com.google.flatbuffers.FlatBufferBuilder; import org.apache.arrow.flatbuf.Type; +import java.io.IOException; import java.util.Objects; +import org.apache.arrow.flatbuf.Precision; +import org.apache.arrow.flatbuf.UnionMode; +import org.apache.arrow.flatbuf.TimeUnit; +import org.apache.arrow.flatbuf.IntervalUnit; + +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonIgnore; +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.annotation.JsonSubTypes; +import com.fasterxml.jackson.annotation.JsonTypeInfo; +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.core.JsonParser; +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.DeserializationContext; +import com.fasterxml.jackson.databind.JsonDeserializer; +import com.fasterxml.jackson.databind.JsonSerializer; +import com.fasterxml.jackson.databind.SerializerProvider; +import com.fasterxml.jackson.databind.annotation.JsonDeserialize; +import com.fasterxml.jackson.databind.annotation.JsonSerialize; + /** * Arrow types **/ +@JsonTypeInfo( + use = JsonTypeInfo.Id.NAME, + include = JsonTypeInfo.As.PROPERTY, + property = "name") +@JsonSubTypes({ +<#list arrowTypes.types as type> + @JsonSubTypes.Type(value = ArrowType.${type.name}.class, name = "${type.name?remove_ending("_")?lower_case}"), + +}) public abstract class ArrowType { + private static class FloatingPointPrecisionSerializer extends JsonSerializer { + @Override + public void serialize(Short precision, + JsonGenerator jsonGenerator, + SerializerProvider serializerProvider) + throws IOException, JsonProcessingException { + jsonGenerator.writeObject(Precision.name(precision)); + } + } + + private static class FloatingPointPrecisionDeserializer extends JsonDeserializer { + @Override + public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { + String name = p.getText(); + switch(name) { + case "HALF": + return Precision.HALF; + case "SINGLE": + return Precision.SINGLE; + case "DOUBLE": + return Precision.DOUBLE; + default: + throw new IllegalArgumentException("unknown precision: " + name); + } + } + } + + private static class UnionModeSerializer extends JsonSerializer { + @Override + public void serialize(Short mode, + JsonGenerator jsonGenerator, + SerializerProvider serializerProvider) + throws IOException, JsonProcessingException { + jsonGenerator.writeObject(UnionMode.name(mode)); + } + } + + private static class UnionModeDeserializer extends JsonDeserializer { + @Override + public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { + String name = p.getText(); + switch(name) { + case "Sparse": + return UnionMode.Sparse; + case "Dense": + return UnionMode.Dense; + default: + throw new IllegalArgumentException("unknown union mode: " + name); + } + } + } + + private static class TimestampUnitSerializer extends JsonSerializer { + @Override + public void serialize(Short unit, + JsonGenerator jsonGenerator, + SerializerProvider serializerProvider) + throws IOException, JsonProcessingException { + jsonGenerator.writeObject(TimeUnit.name(unit)); + } + } + + private static class TimestampUnitDeserializer extends JsonDeserializer { + @Override + public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { + String name = p.getText(); + switch(name) { + case "SECOND": + return TimeUnit.SECOND; + case "MILLISECOND": + return TimeUnit.MILLISECOND; + case "MICROSECOND": + return TimeUnit.MICROSECOND; + case "NANOSECOND": + return TimeUnit.NANOSECOND; + default: + throw new IllegalArgumentException("unknown time unit: " + name); + } + } + } + + private static class IntervalUnitSerializer extends JsonSerializer { + @Override + public void serialize(Short unit, + JsonGenerator jsonGenerator, + SerializerProvider serializerProvider) + throws IOException, JsonProcessingException { + jsonGenerator.writeObject(IntervalUnit.name(unit)); + } + } + + private static class IntervalUnitDeserializer extends JsonDeserializer { + @Override + public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { + String name = p.getText(); + switch(name) { + case "YEAR_MONTH": + return IntervalUnit.YEAR_MONTH; + case "DAY_TIME": + return IntervalUnit.DAY_TIME; + default: + throw new IllegalArgumentException("unknown interval unit: " + name); + } + } + } + + @JsonIgnore public abstract byte getTypeType(); public abstract int getType(FlatBufferBuilder builder); public abstract T accept(ArrowTypeVisitor visitor); @@ -70,7 +201,12 @@ public static class ${name} extends ArrowType { <#if type.fields?size != 0> - public ${type.name}(<#list type.fields as field>${field.type} ${field.name}<#if field_has_next>, ) { + @JsonCreator + public ${type.name}( + <#list type.fields as field> + <#if field.type == "short"> @JsonDeserialize(using = ${type.name}${field.name?cap_first}Deserializer.class) @JsonProperty("${field.name}") ${field.type} ${field.name}<#if field_has_next>, + + ) { <#list type.fields as field> this.${field.name} = ${field.name}; @@ -86,20 +222,29 @@ public byte getTypeType() { public int getType(FlatBufferBuilder builder) { <#list type.fields as field> <#if field.type == "String"> - int ${field.name} = builder.createString(this.${field.name}); + int ${field.name} = this.${field.name} == null ? -1 : builder.createString(this.${field.name}); <#if field.type == "int[]"> - int ${field.name} = org.apache.arrow.flatbuf.${type.name}.create${field.name?cap_first}Vector(builder, this.${field.name}); + int ${field.name} = this.${field.name} == null ? -1 : org.apache.arrow.flatbuf.${type.name}.create${field.name?cap_first}Vector(builder, this.${field.name}); org.apache.arrow.flatbuf.${type.name}.start${type.name}(builder); <#list type.fields as field> - org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, ${field.name}); + <#if field.type == "String" || field.type == "int[]"> + if (this.${field.name} != null) { + org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, ${field.name}); + } + <#else> + org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, this.${field.name}); + return org.apache.arrow.flatbuf.${type.name}.end${type.name}(builder); } <#list fields as field> + <#if field.type == "short"> + @JsonSerialize(using = ${type.name}${field.name?cap_first}Serializer.class) + public ${field.type} get${field.name?cap_first}() { return ${field.name}; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java index 9b7fa45bb9ae3..8fe8e484496cd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java @@ -17,8 +17,15 @@ */ package org.apache.arrow.vector.schema; +import java.util.Map; + import org.apache.arrow.flatbuf.VectorType; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonValue; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.ImmutableMap.Builder; + public class ArrowVectorType { public static final ArrowVectorType DATA = new ArrowVectorType(VectorType.DATA); @@ -26,22 +33,52 @@ public class ArrowVectorType { public static final ArrowVectorType VALIDITY = new ArrowVectorType(VectorType.VALIDITY); public static final ArrowVectorType TYPE = new ArrowVectorType(VectorType.TYPE); + private static final Map typeByName; + static { + ArrowVectorType[] types = { DATA, OFFSET, VALIDITY, TYPE }; + Builder builder = ImmutableMap.builder(); + for (ArrowVectorType type: types) { + builder.put(type.getName(), type); + } + typeByName = builder.build(); + } + + public static ArrowVectorType fromName(String name) { + ArrowVectorType type = typeByName.get(name); + if (type == null) { + throw new IllegalArgumentException("Unknown type " + name); + } + return type; + } + private final short type; public ArrowVectorType(short type) { this.type = type; + // validate that the type is valid + getName(); + } + + @JsonCreator + private ArrowVectorType(String name) { + this.type = fromName(name).type; } public short getType() { return type; } - @Override - public String toString() { + @JsonValue + public String getName() { try { return VectorType.name(type); } catch (ArrayIndexOutOfBoundsException e) { - return "Unlnown type " + type; + throw new IllegalArgumentException("Unknown type " + type); } } + + @Override + public String toString() { + return getName(); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 072385a215582..06ae203bf4422 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -19,6 +19,7 @@ import static java.util.Arrays.asList; import static org.apache.arrow.flatbuf.Precision.DOUBLE; +import static org.apache.arrow.flatbuf.Precision.HALF; import static org.apache.arrow.flatbuf.Precision.SINGLE; import static org.apache.arrow.vector.schema.VectorLayout.booleanVector; import static org.apache.arrow.vector.schema.VectorLayout.byteVector; @@ -49,6 +50,9 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonIgnore; +import com.fasterxml.jackson.annotation.JsonProperty; import com.google.common.base.Preconditions; /** @@ -110,6 +114,9 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { @Override public TypeLayout visit(FloatingPoint type) { int bitWidth; switch (type.getPrecision()) { + case HALF: + bitWidth = 16; + break; case SINGLE: bitWidth = 32; break; @@ -184,7 +191,8 @@ public TypeLayout visit(Interval type) { // TODO: check size private final List vectors; - public TypeLayout(List vectors) { + @JsonCreator + public TypeLayout(@JsonProperty("vectors") List vectors) { super(); this.vectors = Preconditions.checkNotNull(vectors); } @@ -198,6 +206,7 @@ public List getVectors() { return vectors; } + @JsonIgnore public List getVectorTypes() { List types = new ArrayList<>(vectors.size()); for (VectorLayout vector : vectors) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java index 532e9d2328b0f..931c00a02817b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java @@ -22,6 +22,8 @@ import static org.apache.arrow.vector.schema.ArrowVectorType.TYPE; import static org.apache.arrow.vector.schema.ArrowVectorType.VALIDITY; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonProperty; import com.google.common.base.Preconditions; import com.google.flatbuffers.FlatBufferBuilder; @@ -75,7 +77,8 @@ public static VectorLayout byteVector() { private final ArrowVectorType type; - private VectorLayout(ArrowVectorType type, int typeBitWidth) { + @JsonCreator + private VectorLayout(@JsonProperty("type") ArrowVectorType type, @JsonProperty("typeBitWidth") int typeBitWidth) { super(); this.type = Preconditions.checkNotNull(type); this.typeBitWidth = (short)typeBitWidth; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index cfa1ed40aeb8c..49ba524ab0a4f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -18,6 +18,7 @@ package org.apache.arrow.vector.types.pojo; +import static com.google.common.base.Preconditions.checkNotNull; import static org.apache.arrow.vector.types.pojo.ArrowType.getTypeForField; import java.util.List; @@ -26,6 +27,8 @@ import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.schema.VectorLayout; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonProperty; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; @@ -36,20 +39,26 @@ public class Field { private final List children; private final TypeLayout typeLayout; - private Field(String name, boolean nullable, ArrowType type, List children, TypeLayout typeLayout) { + @JsonCreator + private Field( + @JsonProperty("name") String name, + @JsonProperty("nullable") boolean nullable, + @JsonProperty("type") ArrowType type, + @JsonProperty("children") List children, + @JsonProperty("typeLayout") TypeLayout typeLayout) { this.name = name; this.nullable = nullable; - this.type = type; + this.type = checkNotNull(type); if (children == null) { this.children = ImmutableList.of(); } else { this.children = children; } - this.typeLayout = typeLayout; + this.typeLayout = checkNotNull(typeLayout); } public Field(String name, boolean nullable, ArrowType type, List children) { - this(name, nullable, type, children, TypeLayout.getTypeLayout(type)); + this(name, nullable, type, children, TypeLayout.getTypeLayout(checkNotNull(type))); } public static Field convertField(org.apache.arrow.flatbuf.Field field) { @@ -77,7 +86,7 @@ public void validate() { } public int getField(FlatBufferBuilder builder) { - int nameOffset = builder.createString(name); + int nameOffset = name == null ? -1 : builder.createString(name); int typeOffset = type.getType(builder); int[] childrenData = new int[children.size()]; for (int i = 0; i < children.size(); i++) { @@ -91,7 +100,9 @@ public int getField(FlatBufferBuilder builder) { } int layoutOffset = org.apache.arrow.flatbuf.Field.createLayoutVector(builder, buffersData); org.apache.arrow.flatbuf.Field.startField(builder); - org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); + if (name != null) { + org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); + } org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index 231be9bd55ca7..44b877eb730d5 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -18,19 +18,91 @@ package org.apache.arrow.vector.types.pojo; +import static com.google.common.base.Preconditions.checkNotNull; import static org.apache.arrow.vector.types.pojo.Field.convertField; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; import java.util.List; import java.util.Objects; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonProperty; +import com.fasterxml.jackson.core.JsonProcessingException; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.ObjectReader; +import com.fasterxml.jackson.databind.ObjectWriter; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; +/** + * An Arrow Schema + */ public class Schema { - private List fields; - public Schema(List fields) { - this.fields = ImmutableList.copyOf(fields); + /** + * @param the list of the fields + * @param name the name of the field to return + * @return the corresponding field + * @throws IllegalArgumentException if the field was not found + */ + public static Field findField(List fields, String name) { + for (Field field : fields) { + if (field.getName().equals(name)) { + return field; + } + } + throw new IllegalArgumentException(String.format("field %s not found in %s", name, fields)); + } + + private static final ObjectMapper mapper = new ObjectMapper(); + private static final ObjectWriter writer = mapper.writerWithDefaultPrettyPrinter(); + private static final ObjectReader reader = mapper.readerFor(Schema.class); + + public static Schema fromJSON(String json) throws IOException { + return reader.readValue(checkNotNull(json)); + } + + public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { + ImmutableList.Builder childrenBuilder = ImmutableList.builder(); + for (int i = 0; i < schema.fieldsLength(); i++) { + childrenBuilder.add(convertField(schema.fields(i))); + } + List fields = childrenBuilder.build(); + return new Schema(fields); + } + + private final List fields; + + @JsonCreator + public Schema(@JsonProperty("fields") Iterable fields) { + List fieldList = new ArrayList<>(); + for (Field field : fields) { + fieldList.add(field); + } + this.fields = Collections.unmodifiableList(fieldList); + } + + public List getFields() { + return fields; + } + + /** + * @param name the name of the field to return + * @return the corresponding field + */ + public Field findField(String name) { + return findField(getFields(), name); + } + + public String toJson() { + try { + return writer.writeValueAsString(this); + } catch (JsonProcessingException e) { + // this should not happen + throw new RuntimeException(e); + } } public int getSchema(FlatBufferBuilder builder) { @@ -44,9 +116,6 @@ public int getSchema(FlatBufferBuilder builder) { return org.apache.arrow.flatbuf.Schema.endSchema(builder); } - public List getFields() { - return fields; - } @Override public int hashCode() { @@ -61,15 +130,6 @@ public boolean equals(Object obj) { return Objects.equals(this.fields, ((Schema) obj).fields); } - public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { - ImmutableList.Builder childrenBuilder = ImmutableList.builder(); - for (int i = 0; i < schema.fieldsLength(); i++) { - childrenBuilder.add(convertField(schema.fields(i))); - } - List fields = childrenBuilder.build(); - return new Schema(fields); - } - @Override public String toString() { return "Schema" + fields; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java new file mode 100644 index 0000000000000..0ef8be7ef1b8a --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -0,0 +1,119 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types.pojo; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.IOException; + +import org.apache.arrow.flatbuf.IntervalUnit; +import org.apache.arrow.flatbuf.Precision; +import org.apache.arrow.flatbuf.TimeUnit; +import org.apache.arrow.flatbuf.UnionMode; +import org.junit.Test; + +public class TestSchema { + + private static Field field(String name, boolean nullable, ArrowType type, Field... children) { + return new Field(name, nullable, type, asList(children)); + } + + private static Field field(String name, ArrowType type, Field... children) { + return field(name, true, type, children); + } + + @Test + public void testAll() throws IOException { + Schema schema = new Schema(asList( + field("a", false, new ArrowType.Null()), + field("b", new ArrowType.Struct_(), field("ba", new ArrowType.Null())), + field("c", new ArrowType.List(), field("ca", new ArrowType.Null())), + field("d", new ArrowType.Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new ArrowType.Null())), + field("e", new ArrowType.Int(8, true)), + field("f", new ArrowType.FloatingPoint(Precision.SINGLE)), + field("g", new ArrowType.Utf8()), + field("h", new ArrowType.Binary()), + field("i", new ArrowType.Bool()), + field("j", new ArrowType.Decimal(5, 5)), + field("k", new ArrowType.Date()), + field("l", new ArrowType.Time()), + field("m", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), + field("n", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + )); + roundTrip(schema); + } + + @Test + public void testUnion() throws IOException { + Schema schema = new Schema(asList( + field("d", new ArrowType.Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new ArrowType.Null())) + )); + roundTrip(schema); + contains(schema, "Sparse"); + } + + @Test + public void testTS() throws IOException { + Schema schema = new Schema(asList( + field("a", new ArrowType.Timestamp(TimeUnit.SECOND)), + field("b", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), + field("c", new ArrowType.Timestamp(TimeUnit.MICROSECOND)), + field("d", new ArrowType.Timestamp(TimeUnit.NANOSECOND)) + )); + roundTrip(schema); + contains(schema, "SECOND", "MILLISECOND", "MICROSECOND", "NANOSECOND"); + } + + @Test + public void testInterval() throws IOException { + Schema schema = new Schema(asList( + field("a", new ArrowType.Interval(IntervalUnit.YEAR_MONTH)), + field("b", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + )); + roundTrip(schema); + contains(schema, "YEAR_MONTH", "DAY_TIME"); + } + + @Test + public void testFP() throws IOException { + Schema schema = new Schema(asList( + field("a", new ArrowType.FloatingPoint(Precision.HALF)), + field("b", new ArrowType.FloatingPoint(Precision.SINGLE)), + field("c", new ArrowType.FloatingPoint(Precision.DOUBLE)) + )); + roundTrip(schema); + contains(schema, "HALF", "SINGLE", "DOUBLE"); + } + + private void roundTrip(Schema schema) throws IOException { + String json = schema.toJson(); + Schema actual = Schema.fromJSON(json); + assertEquals(schema.toJson(), actual.toJson()); + assertEquals(schema, actual); + } + + private void contains(Schema schema, String... s) throws IOException { + String json = schema.toJson(); + for (String string : s) { + assertTrue(json + " contains " + string, json.contains(string)); + } + } + +} From 1196691e221c5b00bbf9bf47eead6f684b61fe62 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Fri, 7 Oct 2016 13:12:35 -0700 Subject: [PATCH 0161/1644] ARROW-326: Initialize nested writers in MapWriter based on the underlying MapVector's field Closes #163 --- .../main/codegen/templates/MapWriters.java | 22 +++++++++++++++++++ .../complex/impl/TestPromotableWriter.java | 21 +++++++++++++++++- 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 7f319a9ca34d8..9fe20df7a1df0 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -56,6 +56,28 @@ public class ${mode}MapWriter extends AbstractFieldWriter { } this.container = container; + for (Field child : container.getField().getChildren()) { + switch (Types.getMinorTypeForArrowType(child.getType())) { + case MAP: + map(child.getName()); + break; + case LIST: + list(child.getName()); + break; + case UNION: + UnionWriter writer = new UnionWriter(container.addOrGet(child.getName(), MinorType.UNION, UnionVector.class)); + fields.put(child.getName().toLowerCase(), writer); + break; +<#list vv.types as type><#list type.minor as minor> +<#assign lowerName = minor.class?uncap_first /> +<#if lowerName == "int" ><#assign lowerName = "integer" /> +<#assign upperName = minor.class?upper_case /> + case ${upperName}: + ${lowerName}(child.getName()); + break; + + } + } } @Override diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 689c96fda9202..d439cebeda6ac 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -21,13 +21,16 @@ import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; +import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.DirtyRootAllocator; import org.apache.arrow.vector.complex.AbstractMapVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; import org.junit.Before; import org.junit.Test; @@ -50,7 +53,7 @@ public void terminate() throws Exception { @Test public void testPromoteToUnion() throws Exception { - try (final AbstractMapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); + try (final MapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); final NullableMapVector v = container.addOrGet("test", MinorType.MAP, NullableMapVector.class); final PromotableWriter writer = new PromotableWriter(v, container)) { @@ -92,6 +95,22 @@ public void testPromoteToUnion() throws Exception { assertFalse("4 shouldn't be null", accessor.isNull(4)); assertEquals(100, accessor.getObject(4)); + + container.clear(); + container.allocateNew(); + + ComplexWriterImpl newWriter = new ComplexWriterImpl(EMPTY_SCHEMA_PATH, container); + + MapWriter newMapWriter = newWriter.rootAsMap(); + + newMapWriter.start(); + + newMapWriter.setPosition(2); + newMapWriter.integer("A").writeInt(10); + + Field childField = container.getField().getChildren().get(0).getChildren().get(0); + assertEquals("Child field should be union type: " + childField.getName(), Type.Union, childField.getType().getTypeType()); + } } } From eb1491a96d1fb92bf9c8bfc1acb7a8768af53a7e Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 7 Oct 2016 17:09:00 -0700 Subject: [PATCH 0162/1644] ARROW-325: make TestArrowFile not dependent on timezone Author: Julien Le Dem Closes #162 from julienledem/tz and squashes the following commits: 74b5ee8 [Julien Le Dem] ARROW-325: make TestArrowFile not dependent on timezone --- .../org/apache/arrow/vector/file/TestArrowFile.java | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index ad301689cd1e2..7a5e7b58db98c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -40,10 +40,12 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.holders.NullableTimeStampHolder; import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Schema; +import org.joda.time.DateTimeZone; import org.junit.After; import org.junit.Assert; import org.junit.Before; @@ -58,14 +60,18 @@ public class TestArrowFile { private static final int COUNT = 10; private BufferAllocator allocator; + private DateTimeZone defaultTimezone = DateTimeZone.getDefault(); + @Before public void init() { + DateTimeZone.setDefault(DateTimeZone.forOffsetHours(2)); allocator = new RootAllocator(Integer.MAX_VALUE); } @After public void tearDown() { allocator.close(); + DateTimeZone.setDefault(defaultTimezone); } @Test @@ -258,7 +264,9 @@ private void validateComplexContent(int count, NullableMapVector parent) { Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); Assert.assertEquals(i % 3, rootReader.reader("list").size()); - Assert.assertEquals(i, rootReader.reader("map").reader("timestamp").readDateTime().getMillis() % COUNT); + NullableTimeStampHolder h = new NullableTimeStampHolder(); + rootReader.reader("map").reader("timestamp").read(h); + Assert.assertEquals(i, h.value % COUNT); } } From e7080ef9f1bd91505996edd4e4b7643cc54f6b5f Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 7 Oct 2016 17:14:58 -0700 Subject: [PATCH 0163/1644] [maven-release-plugin] prepare release apache-arrow-0.1.0 --- java/format/pom.xml | 5 ++--- java/memory/pom.xml | 5 ++--- java/pom.xml | 7 +++---- java/vector/pom.xml | 5 ++--- 4 files changed, 9 insertions(+), 13 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index 78300047862f4..c81cfed04d9fa 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -9,14 +9,13 @@ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> - + 4.0.0 arrow-java-root org.apache.arrow - 0.1-SNAPSHOT + 0.1.0 arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index b91b5981559c3..8af2313079159 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -9,13 +9,12 @@ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> - + 4.0.0 org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1.0 arrow-memory Arrow Memory diff --git a/java/pom.xml b/java/pom.xml index fc2c18d0e517d..8ca8eac76a752 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -9,8 +9,7 @@ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> - + 4.0.0 @@ -21,7 +20,7 @@ org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1.0 pom Apache Arrow Java Root POM @@ -42,7 +41,7 @@ scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git https://github.com/apache/arrow - HEAD + apache-arrow-0.1.0 diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 08f9bc8da4e2c..ae48d22a6f4e3 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -9,13 +9,12 @@ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> - + 4.0.0 org.apache.arrow arrow-java-root - 0.1-SNAPSHOT + 0.1.0 arrow-vector Arrow Vectors From 17cd7a6466741d22053d132ea306ad6f05351419 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 7 Oct 2016 17:15:08 -0700 Subject: [PATCH 0164/1644] [maven-release-plugin] prepare for next development iteration --- java/format/pom.xml | 2 +- java/memory/pom.xml | 2 +- java/pom.xml | 4 ++-- java/vector/pom.xml | 2 +- 4 files changed, 5 insertions(+), 5 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index c81cfed04d9fa..eb045d655e982 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -15,7 +15,7 @@ arrow-java-root org.apache.arrow - 0.1.0 + 0.1.1-SNAPSHOT arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index 8af2313079159..6ed14480860f2 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.1.0 + 0.1.1-SNAPSHOT arrow-memory Arrow Memory diff --git a/java/pom.xml b/java/pom.xml index 8ca8eac76a752..0147de7035794 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -20,7 +20,7 @@ org.apache.arrow arrow-java-root - 0.1.0 + 0.1.1-SNAPSHOT pom Apache Arrow Java Root POM @@ -41,7 +41,7 @@ scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git https://github.com/apache/arrow - apache-arrow-0.1.0 + HEAD diff --git a/java/vector/pom.xml b/java/vector/pom.xml index ae48d22a6f4e3..1d06bdece01f8 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.1.0 + 0.1.1-SNAPSHOT arrow-vector Arrow Vectors From a9747ceac2b6399c6acf027de8074d8661d5eb1d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Oct 2016 11:21:49 -0400 Subject: [PATCH 0165/1644] ARROW-312: Read and write Arrow IPC file format from Python This also adds some IO scaffolding for interacting with `arrow::Buffer` objects from Python and assorted additions to help with testing. Author: Wes McKinney Closes #164 from wesm/ARROW-312 and squashes the following commits: 7df3e5f [Wes McKinney] Set BUILD_WITH_INSTALL_RPATH on arrow_ipc be8cee0 [Wes McKinney] Link Cython modules to libarrow* libraries 5716601 [Wes McKinney] Fix accidental deletion 77fb03b [Wes McKinney] Add / test Buffer wrapper. Test that we can write an arrow file to a wrapped buffer. Resize buffer in BufferOutputStream on close 316537d [Wes McKinney] Get ready to wrap Arrow buffers in a Python object 4822d32 [Wes McKinney] Implement RecordBatch::Equals, compare in Python ipc file writes a931e49 [Wes McKinney] Permit buffers (write padding) in a non-multiple of 64 in an IPC context, to allow zero-copy writing of NumPy arrays 2c49cd4 [Wes McKinney] Some debugging ca1562b [Wes McKinney] Draft implementations of Arrow file read/write from Python --- cpp/src/arrow/io/io-memory-test.cc | 25 ++ cpp/src/arrow/io/memory.cc | 13 +- cpp/src/arrow/ipc/CMakeLists.txt | 7 + cpp/src/arrow/ipc/adapter.cc | 16 +- cpp/src/arrow/ipc/util.h | 6 +- cpp/src/arrow/table-test.cc | 27 ++ cpp/src/arrow/table.cc | 16 ++ cpp/src/arrow/table.h | 2 + cpp/src/arrow/types/primitive-test.cc | 3 +- cpp/src/arrow/util/bit-util.h | 13 + cpp/src/arrow/util/buffer.cc | 16 +- cpp/src/arrow/util/buffer.h | 1 - cpp/src/arrow/util/logging.h | 4 +- python/CMakeLists.txt | 8 +- python/cmake_modules/FindArrow.cmake | 11 + python/pyarrow/__init__.py | 3 +- python/pyarrow/array.pyx | 44 +-- python/pyarrow/includes/common.pxd | 4 - python/pyarrow/includes/libarrow.pxd | 29 +- python/pyarrow/includes/libarrow_io.pxd | 14 +- python/pyarrow/includes/libarrow_ipc.pxd | 52 ++++ python/pyarrow/includes/pyarrow.pxd | 13 +- python/pyarrow/io.pxd | 6 + python/pyarrow/io.pyx | 340 ++++++++++++++--------- python/pyarrow/ipc.pyx | 155 +++++++++++ python/pyarrow/table.pxd | 17 +- python/pyarrow/table.pyx | 194 ++++++++++--- python/pyarrow/tests/test_array.py | 4 + python/pyarrow/tests/test_io.py | 41 +++ python/pyarrow/tests/test_ipc.py | 116 ++++++++ python/pyarrow/tests/test_table.py | 82 +++--- python/setup.py | 1 + python/src/pyarrow/adapters/builtin.cc | 2 +- python/src/pyarrow/adapters/pandas.cc | 8 + python/src/pyarrow/common.cc | 2 +- python/src/pyarrow/common.h | 20 +- python/src/pyarrow/io.cc | 6 +- 37 files changed, 1012 insertions(+), 309 deletions(-) create mode 100644 python/pyarrow/includes/libarrow_ipc.pxd create mode 100644 python/pyarrow/ipc.pyx create mode 100644 python/pyarrow/tests/test_ipc.py diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index 6de35dab59b4f..a49faf3bd8578 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -121,5 +121,30 @@ TEST_F(TestMemoryMappedFile, InvalidFile) { IOError, MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); } +class TestBufferOutputStream : public ::testing::Test { + public: + void SetUp() { + buffer_.reset(new PoolBuffer(default_memory_pool())); + stream_.reset(new BufferOutputStream(buffer_)); + } + + protected: + std::shared_ptr buffer_; + std::unique_ptr stream_; +}; + +TEST_F(TestBufferOutputStream, CloseResizes) { + std::string data = "data123456"; + + const int64_t nbytes = static_cast(data.size()); + const int K = 100; + for (int i = 0; i < K; ++i) { + EXPECT_OK(stream_->Write(reinterpret_cast(data.c_str()), nbytes)); + } + + ASSERT_OK(stream_->Close()); + ASSERT_EQ(K * nbytes, buffer_->size()); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 7d6e02e25b43c..c7d0ae5d56474 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -212,7 +212,11 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr& b mutable_data_(buffer->mutable_data()) {} Status BufferOutputStream::Close() { - return Status::OK(); + if (position_ < capacity_) { + return buffer_->Resize(position_); + } else { + return Status::OK(); + } } Status BufferOutputStream::Tell(int64_t* position) { @@ -228,8 +232,11 @@ Status BufferOutputStream::Write(const uint8_t* data, int64_t nbytes) { } Status BufferOutputStream::Reserve(int64_t nbytes) { - while (position_ + nbytes > capacity_) { - int64_t new_capacity = std::max(kBufferMinimumSize, capacity_ * 2); + int64_t new_capacity = capacity_; + while (position_ + nbytes > new_capacity) { + new_capacity = std::max(kBufferMinimumSize, new_capacity * 2); + } + if (new_capacity > capacity_) { RETURN_NOT_OK(buffer_->Resize(new_capacity)); capacity_ = new_capacity; } diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index bde8c5bf73888..8dcd9ac107189 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -57,6 +57,13 @@ SET_TARGET_PROPERTIES(arrow_ipc PROPERTIES LINKER_LANGUAGE CXX LINK_FLAGS "${ARROW_IPC_LINK_FLAGS}") +if (APPLE) + set_target_properties(arrow_ipc + PROPERTIES + BUILD_WITH_INSTALL_RPATH ON + INSTALL_NAME_DIR "@rpath") +endif() + ADD_ARROW_TEST(ipc-adapter-test) ARROW_TEST_LINK_LIBRARIES(ipc-adapter-test ${ARROW_IPC_TEST_LINK_LIBS}) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 99974a4a4c7b7..cd8ab53a31d1f 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -162,15 +162,14 @@ class RecordBatchWriter { for (size_t i = 0; i < buffers_.size(); ++i) { const Buffer* buffer = buffers_[i].get(); int64_t size = 0; + int64_t padding = 0; // The buffer might be null if we are handling zero row lengths. if (buffer) { - // We use capacity here, because size might not reflect the padding - // requirements of buffers but capacity always should. - size = buffer->capacity(); - // check that padding is appropriate - RETURN_NOT_OK(CheckMultipleOf64(size)); + size = buffer->size(); + padding = util::RoundUpToMultipleOf64(size) - size; } + // TODO(wesm): We currently have no notion of shared memory page id's, // but we've included it in the metadata IDL for when we have it in the // future. Use page=0 for now @@ -179,12 +178,17 @@ class RecordBatchWriter { // are using from any OS-level shared memory. The thought is that systems // may (in the future) associate integer page id's with physical memory // pages (according to whatever is the desired shared memory mechanism) - buffer_meta_.push_back(flatbuf::Buffer(0, position, size)); + buffer_meta_.push_back(flatbuf::Buffer(0, position, size + padding)); if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); position += size; } + + if (padding > 0) { + RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); + position += padding; + } } *body_end_offset = position; diff --git a/cpp/src/arrow/ipc/util.h b/cpp/src/arrow/ipc/util.h index 94079a3827777..9000d1bb0c6c3 100644 --- a/cpp/src/arrow/ipc/util.h +++ b/cpp/src/arrow/ipc/util.h @@ -29,7 +29,11 @@ namespace ipc { // Align on 8-byte boundaries static constexpr int kArrowAlignment = 8; -static constexpr uint8_t kPaddingBytes[kArrowAlignment] = {0}; + +// Buffers are padded to 64-byte boundaries (for SIMD) +static constexpr int kArrowBufferAlignment = 64; + +static constexpr uint8_t kPaddingBytes[kArrowBufferAlignment] = {0}; static inline int64_t PaddedLength(int64_t nbytes, int64_t alignment = kArrowAlignment) { return ((nbytes + alignment - 1) / alignment) * alignment; diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 385e7d831500a..743fb669700ea 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -123,4 +123,31 @@ TEST_F(TestTable, InvalidColumns) { ASSERT_RAISES(Invalid, table_->ValidateColumns()); } +class TestRecordBatch : public TestBase {}; + +TEST_F(TestRecordBatch, Equals) { + const int length = 10; + + auto f0 = std::make_shared("f0", INT32); + auto f1 = std::make_shared("f1", UINT8); + auto f2 = std::make_shared("f2", INT16); + + vector> fields = {f0, f1, f2}; + auto schema = std::make_shared(fields); + + auto a0 = MakePrimitive(length); + auto a1 = MakePrimitive(length); + auto a2 = MakePrimitive(length); + + RecordBatch b1(schema, length, {a0, a1, a2}); + RecordBatch b2(schema, 5, {a0, a1, a2}); + RecordBatch b3(schema, length, {a0, a1}); + RecordBatch b4(schema, length, {a0, a1, a1}); + + ASSERT_TRUE(b1.Equals(b1)); + ASSERT_FALSE(b1.Equals(b2)); + ASSERT_FALSE(b1.Equals(b3)); + ASSERT_FALSE(b1.Equals(b4)); +} + } // namespace arrow diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 3a250df81d0fb..af84f27eab557 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -21,6 +21,7 @@ #include #include +#include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" #include "arrow/util/status.h" @@ -35,6 +36,21 @@ const std::string& RecordBatch::column_name(int i) const { return schema_->field(i)->name; } +bool RecordBatch::Equals(const RecordBatch& other) const { + if (num_columns() != other.num_columns() || num_rows_ != other.num_rows()) { + return false; + } + + for (int i = 0; i < num_columns(); ++i) { + if (!column(i)->Equals(other.column(i))) { return false; } + } + + return true; +} + +// ---------------------------------------------------------------------- +// Table methods + Table::Table(const std::string& name, const std::shared_ptr& schema, const std::vector>& columns) : name_(name), schema_(schema), columns_(columns) { diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 36b3c8ecaf43f..1a856c8a436d5 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -43,6 +43,8 @@ class ARROW_EXPORT RecordBatch { RecordBatch(const std::shared_ptr& schema, int32_t num_rows, const std::vector>& columns); + bool Equals(const RecordBatch& other) const; + // @returns: the table's schema const std::shared_ptr& schema() const { return schema_; } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index ffebb9269bdc3..87eb0fe3a8bf7 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -238,8 +238,7 @@ void TestPrimitiveBuilder::Check( } typedef ::testing::Types - Primitives; + PInt32, PInt64, PFloat, PDouble> Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 873a1959865f5..3087ce7784d11 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -19,6 +19,7 @@ #define ARROW_UTIL_BIT_UTIL_H #include +#include #include #include @@ -77,6 +78,18 @@ static inline bool is_multiple_of_64(int64_t n) { return (n & 63) == 0; } +inline int64_t RoundUpToMultipleOf64(int64_t num) { + // TODO(wesm): is this definitely needed? + // DCHECK_GE(num, 0); + constexpr int64_t round_to = 64; + constexpr int64_t force_carry_addend = round_to - 1; + constexpr int64_t truncate_bitmask = ~(round_to - 1); + constexpr int64_t max_roundable_num = std::numeric_limits::max() - round_to; + if (num <= max_roundable_num) { return (num + force_carry_addend) & truncate_bitmask; } + // handle overflow case. This should result in a malloc error upstream + return num; +} + void bytes_to_bits(const std::vector& bytes, uint8_t* bits); ARROW_EXPORT Status bytes_to_bits(const std::vector&, std::shared_ptr*); diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 703ef8384ac07..6faa048e4e52b 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -20,25 +20,13 @@ #include #include +#include "arrow/util/bit-util.h" #include "arrow/util/logging.h" #include "arrow/util/memory-pool.h" #include "arrow/util/status.h" namespace arrow { -namespace { -int64_t RoundUpToMultipleOf64(int64_t num) { - DCHECK_GE(num, 0); - constexpr int64_t round_to = 64; - constexpr int64_t force_carry_addend = round_to - 1; - constexpr int64_t truncate_bitmask = ~(round_to - 1); - constexpr int64_t max_roundable_num = std::numeric_limits::max() - round_to; - if (num <= max_roundable_num) { return (num + force_carry_addend) & truncate_bitmask; } - // handle overflow case. This should result in a malloc error upstream - return num; -} -} // namespace - Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) { data_ = parent->data() + offset; size_ = size; @@ -64,7 +52,7 @@ PoolBuffer::~PoolBuffer() { Status PoolBuffer::Reserve(int64_t new_capacity) { if (!mutable_data_ || new_capacity > capacity_) { uint8_t* new_data; - new_capacity = RoundUpToMultipleOf64(new_capacity); + new_capacity = util::RoundUpToMultipleOf64(new_capacity); if (mutable_data_) { RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); memcpy(new_data, mutable_data_, size_); diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 1aeebc69b4e14..01e4259c31fd2 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -23,7 +23,6 @@ #include #include -#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/status.h" #include "arrow/util/visibility.h" diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index b22f07dd6345f..06ee8411e283c 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -118,9 +118,9 @@ class CerrLog { class FatalLog : public CerrLog { public: explicit FatalLog(int /* severity */) // NOLINT - : CerrLog(ARROW_FATAL){} // NOLINT + : CerrLog(ARROW_FATAL) {} // NOLINT - [[noreturn]] ~FatalLog() { + [[noreturn]] ~FatalLog() { if (has_logged_) { std::cerr << std::endl; } std::exit(1); } diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 77a771ab21c06..55f6d0543a108 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -352,6 +352,8 @@ ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_io SHARED_LIB ${ARROW_IO_SHARED_LIB}) +ADD_THIRDPARTY_LIB(arrow_ipc + SHARED_LIB ${ARROW_IPC_SHARED_LIB}) ############################################################ # Linker setup @@ -415,6 +417,8 @@ if (UNIX) set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) endif() +SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) + add_subdirectory(src/pyarrow) add_subdirectory(src/pyarrow/util) @@ -423,6 +427,7 @@ set(CYTHON_EXTENSIONS config error io + ipc scalar schema table @@ -442,6 +447,7 @@ set(PYARROW_SRCS set(LINK_LIBS arrow arrow_io + arrow_ipc ) if(PARQUET_FOUND AND PARQUET_ARROW_FOUND) @@ -455,8 +461,6 @@ if(PARQUET_FOUND AND PARQUET_ARROW_FOUND) parquet) endif() -SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) - add_library(pyarrow SHARED ${PYARROW_SRCS}) target_link_libraries(pyarrow ${LINK_LIBS}) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 9919746520b4c..3c359aac55309 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -47,10 +47,16 @@ find_library(ARROW_IO_LIB_PATH NAMES arrow_io ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) +find_library(ARROW_IPC_LIB_PATH NAMES arrow_ipc + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) + set(ARROW_IPC_LIB_NAME libarrow_ipc) set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) @@ -58,9 +64,14 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_IO_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IO_LIB_NAME}.a) set(ARROW_IO_SHARED_LIB ${ARROW_LIBS}/${ARROW_IO_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + + set(ARROW_IPC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IPC_LIB_NAME}.a) + set(ARROW_IPC_SHARED_LIB ${ARROW_LIBS}/${ARROW_IPC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") + message(STATUS "Found the Arrow IPC library: ${ARROW_IPC_LIB_PATH}") endif () else () if (NOT Arrow_FIND_QUIETLY) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 7561f6d46df21..8b131aaa8f4af 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -41,5 +41,4 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.array import RowBatch -from pyarrow.table import Column, Table, from_pandas_dataframe +from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index cdbe73ad21f7d..84ab4a48c9b65 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -37,7 +37,7 @@ import pyarrow.schema as schema def total_allocated_bytes(): - cdef MemoryPool* pool = pyarrow.GetMemoryPool() + cdef MemoryPool* pool = pyarrow.get_memory_pool() return pool.bytes_allocated() @@ -243,12 +243,14 @@ def from_pandas_series(object series, object mask=None, timestamps_to_ms=False): series_values = series_values.astype('datetime64[ms]') if mask is None: - check_status(pyarrow.PandasToArrow(pyarrow.GetMemoryPool(), - series_values, &out)) + with nogil: + check_status(pyarrow.PandasToArrow(pyarrow.get_memory_pool(), + series_values, &out)) else: mask = series_as_ndarray(mask) - check_status(pyarrow.PandasMaskedToArrow( - pyarrow.GetMemoryPool(), series_values, mask, &out)) + with nogil: + check_status(pyarrow.PandasMaskedToArrow( + pyarrow.get_memory_pool(), series_values, mask, &out)) return box_arrow_array(out) @@ -262,35 +264,3 @@ cdef object series_as_ndarray(object obj): result = obj return result - -#---------------------------------------------------------------------- -# Table-like data structures - -cdef class RowBatch: - """ - - """ - cdef readonly: - Schema schema - int num_rows - list arrays - - def __cinit__(self, Schema schema, int num_rows, list arrays): - self.schema = schema - self.num_rows = num_rows - self.arrays = arrays - - if len(self.schema) != len(arrays): - raise ValueError('Mismatch number of data arrays and ' - 'schema fields') - - def __len__(self): - return self.num_rows - - property num_columns: - - def __get__(self): - return len(self.arrays) - - def __getitem__(self, i): - return self.arrays[i] diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 133797bc37b5c..05c0123ee7b7e 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -47,7 +47,3 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsKeyError() c_bool IsNotImplemented() c_bool IsInvalid() - - cdef cppclass Buffer: - uint8_t* data() - int64_t size() diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 854d07d691dbf..3ae1789170303 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -54,6 +54,18 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass MemoryPool" arrow::MemoryPool": int64_t bytes_allocated() + cdef cppclass CBuffer" arrow::Buffer": + uint8_t* data() + int64_t size() + + cdef cppclass ResizableBuffer(CBuffer): + CStatus Resize(int64_t nbytes) + CStatus Reserve(int64_t nbytes) + + cdef cppclass PoolBuffer(ResizableBuffer): + PoolBuffer() + PoolBuffer(MemoryPool*) + cdef MemoryPool* default_memory_pool() cdef cppclass CListType" arrow::ListType"(CDataType): @@ -149,6 +161,21 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const shared_ptr[CDataType]& type() const shared_ptr[CChunkedArray]& data() + cdef cppclass CRecordBatch" arrow::RecordBatch": + CRecordBatch(const shared_ptr[CSchema]& schema, int32_t num_rows, + const vector[shared_ptr[CArray]]& columns) + + c_bool Equals(const CRecordBatch& other) + + const shared_ptr[CSchema]& schema() + const shared_ptr[CArray]& column(int i) + const c_string& column_name(int i) + + const vector[shared_ptr[CArray]]& columns() + + int num_columns() + int32_t num_rows() + cdef cppclass CTable" arrow::Table": CTable(const c_string& name, const shared_ptr[CSchema]& schema, const vector[shared_ptr[CColumn]]& columns) @@ -186,7 +213,7 @@ cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: MessageType_DICTIONARY_BATCH" arrow::ipc::Message::DICTIONARY_BATCH" cdef cppclass Message: - CStatus Open(const shared_ptr[Buffer]& buf, + CStatus Open(const shared_ptr[CBuffer]& buf, shared_ptr[Message]* out) int64_t body_length() MessageType type() diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 56d8d4cf61494..8074915508fbe 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -18,7 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport MemoryPool +from pyarrow.includes.libarrow cimport * cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: enum FileMode" arrow::io::FileMode::type": @@ -36,7 +36,7 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: FileMode mode() cdef cppclass Readable: - CStatus ReadB" Read"(int64_t nbytes, shared_ptr[Buffer]* out) + CStatus ReadB" Read"(int64_t nbytes, shared_ptr[CBuffer]* out) CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) cdef cppclass Seekable: @@ -57,7 +57,7 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: CStatus ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) CStatus ReadAt(int64_t position, int64_t nbytes, - int64_t* bytes_read, shared_ptr[Buffer]* out) + int64_t* bytes_read, shared_ptr[CBuffer]* out) cdef cppclass WriteableFileInterface(OutputStream, Seekable): CStatus WriteAt(int64_t position, const uint8_t* data, @@ -143,9 +143,9 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: - cdef cppclass BufferReader(ReadableFileInterface): - BufferReader(const uint8_t* data, int64_t nbytes) + cdef cppclass CBufferReader" arrow::io::BufferReader"\ + (ReadableFileInterface): + CBufferReader(const uint8_t* data, int64_t nbytes) cdef cppclass BufferOutputStream(OutputStream): - # TODO(wesm) - pass + BufferOutputStream(const shared_ptr[ResizableBuffer]& buffer) diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd new file mode 100644 index 0000000000000..eda5b9bae9e31 --- /dev/null +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -0,0 +1,52 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport (MemoryPool, CArray, CSchema, + CRecordBatch) +from pyarrow.includes.libarrow_io cimport (OutputStream, ReadableFileInterface) + +cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: + + cdef cppclass CFileWriter " arrow::ipc::FileWriter": + @staticmethod + CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, + shared_ptr[CFileWriter]* out) + + CStatus WriteRecordBatch(const vector[shared_ptr[CArray]]& columns, + int32_t num_rows) + + CStatus Close() + + cdef cppclass CFileReader " arrow::ipc::FileReader": + + @staticmethod + CStatus Open(const shared_ptr[ReadableFileInterface]& file, + shared_ptr[CFileReader]* out) + + @staticmethod + CStatus Open2" Open"(const shared_ptr[ReadableFileInterface]& file, + int64_t footer_offset, shared_ptr[CFileReader]* out) + + const shared_ptr[CSchema]& schema() + + int num_dictionaries() + int num_record_batches() + + CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 4c971665ff6aa..2fa5a7d63256a 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,8 +18,8 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CColumn, CDataType, CStatus, - Type, MemoryPool) +from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, + CDataType, CStatus, Type, MemoryPool) cimport pyarrow.includes.libarrow_io as arrow_io @@ -53,7 +53,12 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: PyStatus ArrowToPandas(const shared_ptr[CColumn]& arr, object py_ref, PyObject** out) - MemoryPool* GetMemoryPool() + MemoryPool* get_memory_pool() + + +cdef extern from "pyarrow/common.h" namespace "pyarrow" nogil: + cdef cppclass PyBytesBuffer(CBuffer): + PyBytesBuffer(object o) cdef extern from "pyarrow/io.h" namespace "pyarrow" nogil: @@ -63,5 +68,5 @@ cdef extern from "pyarrow/io.h" namespace "pyarrow" nogil: cdef cppclass PyOutputStream(arrow_io.OutputStream): PyOutputStream(object fo) - cdef cppclass PyBytesReader(arrow_io.BufferReader): + cdef cppclass PyBytesReader(arrow_io.CBufferReader): PyBytesReader(object fo) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index 1dbb3fd76bbfd..d6966cdaadd3a 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -22,6 +22,11 @@ from pyarrow.includes.libarrow cimport * from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, OutputStream) +cdef class Buffer: + cdef: + shared_ptr[CBuffer] buffer + + cdef init(self, const shared_ptr[CBuffer]& buffer) cdef class NativeFile: cdef: @@ -29,6 +34,7 @@ cdef class NativeFile: shared_ptr[OutputStream] wr_file bint is_readonly bint is_open + bint own_file # By implementing these "virtual" functions (all functions in Cython # extension classes are technically virtual in the C++ sense) we can expose diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index e6e2b625e87ca..00a492fc0baf2 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -36,6 +36,217 @@ import re import sys import threading + +cdef class NativeFile: + + def __cinit__(self): + self.is_open = False + self.own_file = False + + def __dealloc__(self): + if self.is_open and self.own_file: + self.close() + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_value, tb): + self.close() + + def close(self): + if self.is_open: + with nogil: + if self.is_readonly: + check_cstatus(self.rd_file.get().Close()) + else: + check_cstatus(self.wr_file.get().Close()) + self.is_open = False + + cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): + self._assert_readable() + file[0] = self.rd_file + + cdef write_handle(self, shared_ptr[OutputStream]* file): + self._assert_writeable() + file[0] = self.wr_file + + def _assert_readable(self): + if not self.is_readonly: + raise IOError("only valid on readonly files") + + if not self.is_open: + raise IOError("file not open") + + def _assert_writeable(self): + if self.is_readonly: + raise IOError("only valid on writeonly files") + + if not self.is_open: + raise IOError("file not open") + + def size(self): + cdef int64_t size + self._assert_readable() + with nogil: + check_cstatus(self.rd_file.get().GetSize(&size)) + return size + + def tell(self): + cdef int64_t position + with nogil: + if self.is_readonly: + check_cstatus(self.rd_file.get().Tell(&position)) + else: + check_cstatus(self.wr_file.get().Tell(&position)) + return position + + def seek(self, int64_t position): + self._assert_readable() + with nogil: + check_cstatus(self.rd_file.get().Seek(position)) + + def write(self, data): + """ + Write bytes-like (unicode, encoded to UTF-8) to file + """ + self._assert_writeable() + + data = tobytes(data) + + cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) + cdef int64_t bufsize = len(data) + with nogil: + check_cstatus(self.wr_file.get().Write(buf, bufsize)) + + def read(self, int nbytes): + cdef: + int64_t bytes_read = 0 + uint8_t* buf + shared_ptr[CBuffer] out + + self._assert_readable() + + with nogil: + check_cstatus(self.rd_file.get() + .ReadB(nbytes, &out)) + + result = cp.PyBytes_FromStringAndSize( + out.get().data(), out.get().size()) + + return result + + +# ---------------------------------------------------------------------- +# Python file-like objects + +cdef class PythonFileInterface(NativeFile): + cdef: + object handle + + def __cinit__(self, handle, mode='w'): + self.handle = handle + + if mode.startswith('w'): + self.wr_file.reset(new pyarrow.PyOutputStream(handle)) + self.is_readonly = 0 + elif mode.startswith('r'): + self.rd_file.reset(new pyarrow.PyReadableFile(handle)) + self.is_readonly = 1 + else: + raise ValueError('Invalid file mode: {0}'.format(mode)) + + self.is_open = True + + +cdef class BytesReader(NativeFile): + cdef: + object obj + + def __cinit__(self, obj): + if not isinstance(obj, bytes): + raise ValueError('Must pass bytes object') + + self.obj = obj + self.is_readonly = 1 + self.is_open = True + + self.rd_file.reset(new pyarrow.PyBytesReader(obj)) + +# ---------------------------------------------------------------------- +# Arrow buffers + + +cdef class Buffer: + + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CBuffer]& buffer): + self.buffer = buffer + + def __len__(self): + return self.size + + property size: + + def __get__(self): + return self.buffer.get().size() + + def __getitem__(self, key): + # TODO(wesm): buffer slicing + raise NotImplementedError + + def to_pybytes(self): + return cp.PyBytes_FromStringAndSize( + self.buffer.get().data(), + self.buffer.get().size()) + + +cdef shared_ptr[PoolBuffer] allocate_buffer(): + cdef shared_ptr[PoolBuffer] result + result.reset(new PoolBuffer(pyarrow.get_memory_pool())) + return result + + +cdef class InMemoryOutputStream(NativeFile): + + cdef: + shared_ptr[PoolBuffer] buffer + + def __cinit__(self): + self.buffer = allocate_buffer() + self.wr_file.reset(new BufferOutputStream( + self.buffer)) + self.is_readonly = 0 + self.is_open = True + + def get_result(self): + cdef Buffer result = Buffer() + + check_cstatus(self.wr_file.get().Close()) + result.init( self.buffer) + + self.is_open = False + return result + + +def buffer_from_bytes(object obj): + """ + Construct an Arrow buffer from a Python bytes object + """ + if not isinstance(obj, bytes): + raise ValueError('Must pass bytes object') + + cdef shared_ptr[CBuffer] buf + buf.reset(new pyarrow.PyBytesBuffer(obj)) + + cdef Buffer result = Buffer() + result.init(buf) + return result + +# ---------------------------------------------------------------------- +# HDFS IO implementation + _HDFS_PATH_RE = re.compile('hdfs://(.*):(\d+)(.*)') try: @@ -274,6 +485,7 @@ cdef class HdfsClient: out.buffer_size = c_buffer_size out.parent = self out.is_open = True + out.own_file = True return out @@ -322,134 +534,6 @@ cdef class HdfsClient: f.download(stream) -cdef class NativeFile: - - def __cinit__(self): - self.is_open = False - - def __dealloc__(self): - if self.is_open: - self.close() - - def __enter__(self): - return self - - def __exit__(self, exc_type, exc_value, tb): - self.close() - - def close(self): - if self.is_open: - with nogil: - if self.is_readonly: - check_cstatus(self.rd_file.get().Close()) - else: - check_cstatus(self.wr_file.get().Close()) - self.is_open = False - - cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): - self._assert_readable() - file[0] = self.rd_file - - cdef write_handle(self, shared_ptr[OutputStream]* file): - self._assert_writeable() - file[0] = self.wr_file - - def _assert_readable(self): - if not self.is_readonly: - raise IOError("only valid on readonly files") - - def _assert_writeable(self): - if self.is_readonly: - raise IOError("only valid on writeonly files") - - def size(self): - cdef int64_t size - self._assert_readable() - with nogil: - check_cstatus(self.rd_file.get().GetSize(&size)) - return size - - def tell(self): - cdef int64_t position - with nogil: - if self.is_readonly: - check_cstatus(self.rd_file.get().Tell(&position)) - else: - check_cstatus(self.wr_file.get().Tell(&position)) - return position - - def seek(self, int64_t position): - self._assert_readable() - with nogil: - check_cstatus(self.rd_file.get().Seek(position)) - - def write(self, data): - """ - Write bytes-like (unicode, encoded to UTF-8) to file - """ - self._assert_writeable() - - data = tobytes(data) - - cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) - cdef int64_t bufsize = len(data) - with nogil: - check_cstatus(self.wr_file.get().Write(buf, bufsize)) - - def read(self, int nbytes): - cdef: - int64_t bytes_read = 0 - uint8_t* buf - shared_ptr[Buffer] out - - self._assert_readable() - - with nogil: - check_cstatus(self.rd_file.get() - .ReadB(nbytes, &out)) - - result = cp.PyBytes_FromStringAndSize( - out.get().data(), out.get().size()) - - return result - - -# ---------------------------------------------------------------------- -# Python file-like objects - -cdef class PythonFileInterface(NativeFile): - cdef: - object handle - - def __cinit__(self, handle, mode='w'): - self.handle = handle - - if mode.startswith('w'): - self.wr_file.reset(new pyarrow.PyOutputStream(handle)) - self.is_readonly = 0 - elif mode.startswith('r'): - self.rd_file.reset(new pyarrow.PyReadableFile(handle)) - self.is_readonly = 1 - else: - raise ValueError('Invalid file mode: {0}'.format(mode)) - - self.is_open = True - - -cdef class BytesReader(NativeFile): - cdef: - object obj - - def __cinit__(self, obj): - if not isinstance(obj, bytes): - raise ValueError('Must pass bytes object') - - self.obj = obj - self.is_readonly = 1 - self.is_open = True - - self.rd_file.reset(new pyarrow.PyBytesReader(obj)) - # ---------------------------------------------------------------------- # Specialization for HDFS diff --git a/python/pyarrow/ipc.pyx b/python/pyarrow/ipc.pyx new file mode 100644 index 0000000000000..f8da3a70da819 --- /dev/null +++ b/python/pyarrow/ipc.pyx @@ -0,0 +1,155 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Cython wrappers for arrow::ipc + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from pyarrow.includes.libarrow cimport * +from pyarrow.includes.libarrow_io cimport * +from pyarrow.includes.libarrow_ipc cimport * +cimport pyarrow.includes.pyarrow as pyarrow + +from pyarrow.error cimport check_cstatus +from pyarrow.io cimport NativeFile +from pyarrow.schema cimport Schema +from pyarrow.table cimport RecordBatch + +from pyarrow.compat import frombytes, tobytes +import pyarrow.io as io + +cimport cpython as cp + + +cdef get_reader(source, shared_ptr[ReadableFileInterface]* reader): + cdef NativeFile nf + + if isinstance(source, bytes): + source = io.BytesReader(source) + elif not isinstance(source, io.NativeFile) and hasattr(source, 'read'): + # Optimistically hope this is file-like + source = io.PythonFileInterface(source, mode='r') + + if isinstance(source, NativeFile): + nf = source + + # TODO: what about read-write sources (e.g. memory maps) + if not nf.is_readonly: + raise IOError('Native file is not readable') + + nf.read_handle(reader) + else: + raise TypeError('Unable to read from object of type: {0}' + .format(type(source))) + + +cdef get_writer(source, shared_ptr[OutputStream]* writer): + cdef NativeFile nf + + if not isinstance(source, io.NativeFile) and hasattr(source, 'write'): + # Optimistically hope this is file-like + source = io.PythonFileInterface(source, mode='w') + + if isinstance(source, io.NativeFile): + nf = source + + if nf.is_readonly: + raise IOError('Native file is not writeable') + + nf.write_handle(writer) + else: + raise TypeError('Unable to read from object of type: {0}' + .format(type(source))) + + +cdef class ArrowFileWriter: + cdef: + shared_ptr[CFileWriter] writer + shared_ptr[OutputStream] sink + bint closed + + def __cinit__(self, sink, Schema schema): + self.closed = True + get_writer(sink, &self.sink) + + with nogil: + check_cstatus(CFileWriter.Open(self.sink.get(), schema.sp_schema, + &self.writer)) + + self.closed = False + + def __dealloc__(self): + if not self.closed: + self.close() + + def write_record_batch(self, RecordBatch batch): + cdef CRecordBatch* bptr = batch.batch + with nogil: + check_cstatus(self.writer.get() + .WriteRecordBatch(bptr.columns(), bptr.num_rows())) + + def close(self): + with nogil: + check_cstatus(self.writer.get().Close()) + self.closed = True + + +cdef class ArrowFileReader: + cdef: + shared_ptr[CFileReader] reader + + def __cinit__(self, source, footer_offset=None): + cdef shared_ptr[ReadableFileInterface] reader + get_reader(source, &reader) + + cdef int64_t offset = 0 + if footer_offset is not None: + offset = footer_offset + + with nogil: + if offset != 0: + check_cstatus(CFileReader.Open2(reader, offset, &self.reader)) + else: + check_cstatus(CFileReader.Open(reader, &self.reader)) + + property num_dictionaries: + + def __get__(self): + return self.reader.get().num_dictionaries() + + property num_record_batches: + + def __get__(self): + return self.reader.get().num_record_batches() + + def get_record_batch(self, int i): + cdef: + shared_ptr[CRecordBatch] batch + RecordBatch result + + if i < 0 or i >= self.num_record_batches: + raise ValueError('Batch number {0} out of range'.format(i)) + + with nogil: + check_cstatus(self.reader.get().GetRecordBatch(i, &batch)) + + result = RecordBatch() + result.init(batch) + + return result diff --git a/python/pyarrow/table.pxd b/python/pyarrow/table.pxd index 0a5c122c95cff..79c9ae3b0a194 100644 --- a/python/pyarrow/table.pxd +++ b/python/pyarrow/table.pxd @@ -16,7 +16,10 @@ # under the License. from pyarrow.includes.common cimport shared_ptr -from pyarrow.includes.libarrow cimport CChunkedArray, CColumn, CTable +from pyarrow.includes.libarrow cimport (CChunkedArray, CColumn, CTable, + CRecordBatch) + +from pyarrow.schema cimport Schema cdef class ChunkedArray: @@ -41,6 +44,16 @@ cdef class Table: cdef: shared_ptr[CTable] sp_table CTable* table - + cdef init(self, const shared_ptr[CTable]& table) cdef _check_nullptr(self) + + +cdef class RecordBatch: + cdef: + shared_ptr[CRecordBatch] sp_batch + CRecordBatch* batch + Schema _schema + + cdef init(self, const shared_ptr[CRecordBatch]& table) + cdef _check_nullptr(self) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index ade82aa676164..a1cadcd1e0f69 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -19,6 +19,8 @@ # distutils: language = c++ # cython: embedsignature = True +from cython.operator cimport dereference as deref + from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow @@ -45,8 +47,8 @@ cdef class ChunkedArray: cdef _check_nullptr(self): if self.chunked_array == NULL: - raise ReferenceError("ChunkedArray object references a NULL pointer." - "Not initialized.") + raise ReferenceError("ChunkedArray object references a NULL " + "pointer. Not initialized.") def length(self): self._check_nullptr() @@ -144,6 +146,130 @@ cdef class Column: return chunked_array +cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): + cdef: + Array arr + c_string c_name + vector[shared_ptr[CField]] fields + + cdef int K = len(arrays) + + fields.resize(K) + for i in range(K): + arr = arrays[i] + c_name = tobytes(names[i]) + fields[i].reset(new CField(c_name, arr.type.sp_type, True)) + + schema.reset(new CSchema(fields)) + + + +cdef _dataframe_to_arrays(df, name, timestamps_to_ms): + from pyarrow.array import from_pandas_series + + cdef: + list names = [] + list arrays = [] + + for name in df.columns: + col = df[name] + arr = from_pandas_series(col, timestamps_to_ms=timestamps_to_ms) + + names.append(name) + arrays.append(arr) + + return names, arrays + + +cdef class RecordBatch: + + def __cinit__(self): + self.batch = NULL + self._schema = None + + cdef init(self, const shared_ptr[CRecordBatch]& batch): + self.sp_batch = batch + self.batch = batch.get() + + cdef _check_nullptr(self): + if self.batch == NULL: + raise ReferenceError("Object not initialized") + + def __len__(self): + self._check_nullptr() + return self.batch.num_rows() + + property num_columns: + + def __get__(self): + self._check_nullptr() + return self.batch.num_columns() + + property num_rows: + + def __get__(self): + return len(self) + + property schema: + + def __get__(self): + cdef Schema schema + self._check_nullptr() + if self._schema is None: + schema = Schema() + schema.init_schema(self.batch.schema()) + self._schema = schema + + return self._schema + + def __getitem__(self, i): + cdef Array arr = Array() + arr.init(self.batch.column(i)) + return arr + + def equals(self, RecordBatch other): + self._check_nullptr() + other._check_nullptr() + + return self.batch.Equals(deref(other.batch)) + + @classmethod + def from_pandas(cls, df): + """ + Convert pandas.DataFrame to an Arrow RecordBatch + """ + names, arrays = _dataframe_to_arrays(df, None, False) + return cls.from_arrays(names, arrays) + + @staticmethod + def from_arrays(names, arrays): + cdef: + Array arr + RecordBatch result + c_string c_name + shared_ptr[CSchema] schema + shared_ptr[CRecordBatch] batch + vector[shared_ptr[CArray]] c_arrays + int32_t num_rows + + if len(arrays) == 0: + raise ValueError('Record batch cannot contain no arrays (for now)') + + num_rows = len(arrays[0]) + _schema_from_arrays(arrays, names, &schema) + + for i in range(len(arrays)): + arr = arrays[i] + c_arrays.push_back(arr.sp_array) + + batch.reset(new CRecordBatch(schema, num_rows, c_arrays)) + + result = RecordBatch() + result.init(batch) + + return result + + cdef class Table: ''' Do not call this class's constructor directly. @@ -161,38 +287,50 @@ cdef class Table: raise ReferenceError("Table object references a NULL pointer." "Not initialized.") - @staticmethod - def from_pandas(df, name=None): - return from_pandas_dataframe(df, name=name) + @classmethod + def from_pandas(cls, df, name=None, timestamps_to_ms=False): + """ + Convert pandas.DataFrame to an Arrow Table + + Parameters + ---------- + df: pandas.DataFrame + + name: str + + timestamps_to_ms: bool + Convert datetime columns to ms resolution. This is needed for + compability with other functionality like Parquet I/O which + only supports milliseconds. + """ + names, arrays = _dataframe_to_arrays(df, name=name, + timestamps_to_ms=timestamps_to_ms) + return cls.from_arrays(names, arrays, name=name) @staticmethod def from_arrays(names, arrays, name=None): cdef: Array arr - Table result c_string c_name vector[shared_ptr[CField]] fields vector[shared_ptr[CColumn]] columns + Table result shared_ptr[CSchema] schema shared_ptr[CTable] table - cdef int K = len(arrays) + _schema_from_arrays(arrays, names, &schema) - fields.resize(K) + cdef int K = len(arrays) columns.resize(K) for i in range(K): arr = arrays[i] - c_name = tobytes(names[i]) - - fields[i].reset(new CField(c_name, arr.type.sp_type, True)) - columns[i].reset(new CColumn(fields[i], arr.sp_array)) + columns[i].reset(new CColumn(schema.get().field(i), arr.sp_array)) if name is None: c_name = '' else: c_name = tobytes(name) - schema.reset(new CSchema(fields)) table.reset(new CTable(c_name, schema, columns)) result = Table() @@ -268,32 +406,4 @@ cdef class Table: -def from_pandas_dataframe(object df, name=None, timestamps_to_ms=False): - """ - Convert pandas.DataFrame to an Arrow Table - - Parameters - ---------- - df: pandas.DataFrame - - name: str - - timestamps_to_ms: bool - Convert datetime columns to ms resolution. This is needed for - compability with other functionality like Parquet I/O which - only supports milliseconds. - """ - from pyarrow.array import from_pandas_series - - cdef: - list names = [] - list arrays = [] - - for name in df.columns: - col = df[name] - arr = from_pandas_series(col, timestamps_to_ms=timestamps_to_ms) - - names.append(name) - arrays.append(arr) - - return Table.from_arrays(names, arrays, name=name) +from_pandas_dataframe = Table.from_pandas diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 86147f8df5a11..0a17f691ccd1f 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -19,6 +19,10 @@ import pyarrow.formatting as fmt +def test_total_bytes_allocated(): + assert pyarrow.total_allocated_bytes() == 0 + + def test_repr_on_pre_init_array(): arr = pyarrow.array.Array() assert len(repr(arr)) > 0 diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 9a41ebe3e8c74..211a12bcd92fe 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -98,3 +98,44 @@ def test_bytes_reader(): def test_bytes_reader_non_bytes(): with pytest.raises(ValueError): io.BytesReader(u('some sample data')) + + + +# ---------------------------------------------------------------------- +# Buffers + + +def test_buffer_bytes(): + val = b'some data' + + buf = io.buffer_from_bytes(val) + assert isinstance(buf, io.Buffer) + + result = buf.to_pybytes() + + assert result == val + + +def test_memory_output_stream(): + # 10 bytes + val = b'dataabcdef' + + f = io.InMemoryOutputStream() + + K = 1000 + for i in range(K): + f.write(val) + + buf = f.get_result() + + assert len(buf) == len(val) * K + assert buf.to_pybytes() == val * K + + +def test_inmemory_write_after_closed(): + f = io.InMemoryOutputStream() + f.write(b'ok') + f.get_result() + + with pytest.raises(IOError): + f.write(b'not ok') diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py new file mode 100644 index 0000000000000..b9e9e6ed0c423 --- /dev/null +++ b/python/pyarrow/tests/test_ipc.py @@ -0,0 +1,116 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import io + +import numpy as np +import pandas as pd + +import pyarrow as A +import pyarrow.io as arrow_io +import pyarrow.ipc as ipc + + +class RoundtripTest(object): + # Also tests writing zero-copy NumPy array with additional padding + + def __init__(self): + self.sink = self._get_sink() + + def _get_sink(self): + return io.BytesIO() + + def _get_source(self): + return self.sink.getvalue() + + def run(self): + nrows = 5 + df = pd.DataFrame({ + 'one': np.random.randn(nrows), + 'two': ['foo', np.nan, 'bar', 'bazbaz', 'qux']}) + + batch = A.RecordBatch.from_pandas(df) + writer = ipc.ArrowFileWriter(self.sink, batch.schema) + + num_batches = 5 + frames = [] + batches = [] + for i in range(num_batches): + unique_df = df.copy() + unique_df['one'] = np.random.randn(nrows) + + batch = A.RecordBatch.from_pandas(unique_df) + writer.write_record_batch(batch) + frames.append(unique_df) + batches.append(batch) + + writer.close() + + file_contents = self._get_source() + reader = ipc.ArrowFileReader(file_contents) + + assert reader.num_record_batches == num_batches + + for i in range(num_batches): + # it works. Must convert back to DataFrame + batch = reader.get_record_batch(i) + assert batches[i].equals(batch) + + +class InMemoryStreamTest(RoundtripTest): + + def _get_sink(self): + return arrow_io.InMemoryOutputStream() + + def _get_source(self): + return self.sink.get_result() + + +def test_ipc_file_simple_roundtrip(): + helper = RoundtripTest() + helper.run() + + +# XXX: For benchmarking + +def big_batch(): + df = pd.DataFrame( + np.random.randn(2**4, 2**20).T, + columns=[str(i) for i in range(2**4)] + ) + + df = pd.concat([df] * 2 ** 3, ignore_index=True) + + return A.RecordBatch.from_pandas(df) + + +def write_to_memory(batch): + sink = io.BytesIO() + write_file(batch, sink) + return sink.getvalue() + + +def write_file(batch, sink): + writer = ipc.ArrowFileWriter(sink, batch.schema) + writer.write_record_batch(batch) + writer.close() + + +def read_file(source): + reader = ipc.ArrowFileReader(source) + return [reader.get_record_batch(i) + for i in range(reader.num_record_batches)] diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index abf143199fe15..c5130329e02bc 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -15,60 +15,52 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest import pyarrow as A -class TestRowBatch(unittest.TestCase): +def test_recordbatch_basics(): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] - def test_basics(self): - data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) - ] - num_rows = 5 + batch = A.RecordBatch.from_arrays(['c0', 'c1'], data) - descr = A.schema([A.field('c0', data[0].type), - A.field('c1', data[1].type)]) + assert len(batch) == 5 + assert batch.num_rows == 5 + assert batch.num_columns == len(data) - batch = A.RowBatch(descr, num_rows, data) - assert len(batch) == num_rows - assert batch.num_rows == num_rows - assert batch.num_columns == len(data) +def test_table_basics(): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + assert table.name == 'table_name' + assert len(table) == 5 + assert table.num_rows == 5 + assert table.num_columns == 2 + assert table.shape == (5, 2) + for col in table.itercolumns(): + for chunk in col.data.iterchunks(): + assert chunk is not None -class TestTable(unittest.TestCase): - def test_basics(self): - data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) - ] - table = A.Table.from_arrays(('a', 'b'), data, 'table_name') - assert table.name == 'table_name' - assert len(table) == 5 - assert table.num_rows == 5 - assert table.num_columns == 2 - assert table.shape == (5, 2) +def test_table_pandas(): + data = [ + A.from_pylist(range(5)), + A.from_pylist([-10, -5, 0, 5, 10]) + ] + table = A.Table.from_arrays(('a', 'b'), data, 'table_name') - for col in table.itercolumns(): - for chunk in col.data.iterchunks(): - assert chunk is not None + # TODO: Use this part once from_pandas is implemented + # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} + # df = pd.DataFrame(data) + # A.Table.from_pandas(df) - def test_pandas(self): - data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) - ] - table = A.Table.from_arrays(('a', 'b'), data, 'table_name') - - # TODO: Use this part once from_pandas is implemented - # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} - # df = pd.DataFrame(data) - # A.Table.from_pandas(df) - - df = table.to_pandas() - assert set(df.columns) == set(('a', 'b')) - assert df.shape == (5, 2) - assert df.ix[0, 'b'] == -10 + df = table.to_pandas() + assert set(df.columns) == set(('a', 'b')) + assert df.shape == (5, 2) + assert df.loc[0, 'b'] == -10 diff --git a/python/setup.py b/python/setup.py index d1be122888e7b..d040ea7e892c5 100644 --- a/python/setup.py +++ b/python/setup.py @@ -102,6 +102,7 @@ def initialize_options(self): 'config', 'error', 'io', + 'ipc', 'parquet', 'scalar', 'schema', diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 78ef1b31f34f1..680f3a539b5fa 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -426,7 +426,7 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { // Give the sequence converter an array builder std::shared_ptr builder; - RETURN_ARROW_NOT_OK(arrow::MakeBuilder(GetMemoryPool(), type, &builder)); + RETURN_ARROW_NOT_OK(arrow::MakeBuilder(get_memory_pool(), type, &builder)); converter->Init(builder); PY_RETURN_NOT_OK(converter->AppendData(obj)); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index d224074d652cb..ae24b7ee5847b 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -602,6 +602,8 @@ class ArrowDeserializer { } Status AllocateOutput(int type) { + PyAcquireGIL lock; + npy_intp dims[1] = {col_->length()}; out_ = reinterpret_cast(PyArray_SimpleNew(1, dims, type)); @@ -616,6 +618,8 @@ class ArrowDeserializer { } Status OutputFromData(int type, void* data) { + PyAcquireGIL lock; + // Zero-Copy. We can pass the data pointer directly to NumPy. Py_INCREF(py_ref_); OwnedRef py_ref(py_ref_); @@ -706,6 +710,8 @@ class ArrowDeserializer { inline typename std::enable_if< arrow_traits::is_boolean, Status>::type ConvertValues(const std::shared_ptr& arr) { + PyAcquireGIL lock; + arrow::BooleanArray* bool_arr = static_cast(arr.get()); if (arr->null_count() > 0) { @@ -743,6 +749,8 @@ class ArrowDeserializer { inline typename std::enable_if< T2 == arrow::Type::STRING, Status>::type ConvertValues(const std::shared_ptr& arr) { + PyAcquireGIL lock; + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); PyObject** out_values = reinterpret_cast(PyArray_DATA(out_)); diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index 82b14fdf40173..09f3efb5a03bc 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -63,7 +63,7 @@ class PyArrowMemoryPool : public arrow::MemoryPool { int64_t bytes_allocated_; }; -arrow::MemoryPool* GetMemoryPool() { +arrow::MemoryPool* get_memory_pool() { static PyArrowMemoryPool memory_pool; return &memory_pool; } diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index bc599f84fab50..96eed1654a777 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -109,7 +109,8 @@ class PyGILGuard { return Status::UnknownError(message); \ } -PYARROW_EXPORT arrow::MemoryPool* GetMemoryPool(); +// Return the common PyArrow memory pool +PYARROW_EXPORT arrow::MemoryPool* get_memory_pool(); class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { public: @@ -120,6 +121,7 @@ class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { data_ = reinterpret_cast(PyArray_DATA(arr_)); size_ = PyArray_SIZE(arr_); + capacity_ = size_ * PyArray_DESCR(arr_)->elsize; } virtual ~NumPyBuffer() { @@ -139,6 +141,22 @@ class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { PyObject* obj_; }; + +class PyAcquireGIL { + public: + PyAcquireGIL() { + state_ = PyGILState_Ensure(); + } + + ~PyAcquireGIL() { + PyGILState_Release(state_); + } + + private: + PyGILState_STATE state_; + DISALLOW_COPY_AND_ASSIGN(PyAcquireGIL); +}; + } // namespace pyarrow #endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 35054e9025ad4..9879b3474bcd0 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -47,9 +47,9 @@ static arrow::Status CheckPyError() { PyErr_Fetch(&exc_type, &exc_value, &traceback); PyObjectStringify stringified(exc_value); std::string message(stringified.bytes); - Py_DECREF(exc_type); - Py_DECREF(exc_value); - Py_DECREF(traceback); + Py_XDECREF(exc_type); + Py_XDECREF(exc_value); + Py_XDECREF(traceback); PyErr_Clear(); return arrow::Status::IOError(message); } From fb799bc8f818574aacf380b2694aec011d2c18dd Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Mon, 10 Oct 2016 22:49:47 -0400 Subject: [PATCH 0166/1644] ARROW-112: Changed constexprs to kValue naming. Consistent with Google style. Author: Leif Walsh Closes #168 from leifwalsh/constant-name-fix-no-enum and squashes the following commits: 37a0b34 [Leif Walsh] ARROW-112: Changed constexprs to kValue naming. --- cpp/src/arrow/builder.h | 2 +- cpp/src/arrow/types/json.cc | 6 +++--- cpp/src/arrow/types/primitive-test.cc | 8 ++++---- cpp/src/arrow/types/primitive.cc | 2 +- cpp/src/arrow/util/bit-util.h | 10 +++++----- cpp/src/arrow/util/buffer.h | 2 -- 6 files changed, 14 insertions(+), 16 deletions(-) diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 5d9fb992ff0b5..646a6f24e9df8 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -33,7 +33,7 @@ class Array; class MemoryPool; class PoolBuffer; -static constexpr int32_t MIN_BUILDER_CAPACITY = 1 << 5; +static constexpr int32_t kMinBuilderCapacity = 1 << 5; // Base class for all data array builders. // This class provides a facilities for incrementally building the null bitmap diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc index a4e0d085620a0..89240fc22bb2c 100644 --- a/cpp/src/arrow/types/json.cc +++ b/cpp/src/arrow/types/json.cc @@ -30,8 +30,8 @@ static const TypePtr String(new StringType()); static const TypePtr Double(new DoubleType()); static const TypePtr Bool(new BooleanType()); -static const std::vector json_types = {Null, Int32, String, Double, Bool}; -TypePtr JSONScalar::dense_type = TypePtr(new DenseUnionType(json_types)); -TypePtr JSONScalar::sparse_type = TypePtr(new SparseUnionType(json_types)); +static const std::vector kJsonTypes = {Null, Int32, String, Double, Bool}; +TypePtr JSONScalar::dense_type = TypePtr(new DenseUnionType(kJsonTypes)); +TypePtr JSONScalar::sparse_type = TypePtr(new SparseUnionType(kJsonTypes)); } // namespace arrow diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 87eb0fe3a8bf7..5ac2867932df7 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -460,7 +460,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { TYPED_TEST(TestPrimitiveBuilder, TestResize) { DECL_TYPE(); - int cap = MIN_BUILDER_CAPACITY * 2; + int cap = kMinBuilderCapacity * 2; ASSERT_OK(this->builder_->Reserve(cap)); ASSERT_EQ(cap, this->builder_->capacity()); @@ -472,13 +472,13 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { TYPED_TEST(TestPrimitiveBuilder, TestReserve) { ASSERT_OK(this->builder_->Reserve(10)); ASSERT_EQ(0, this->builder_->length()); - ASSERT_EQ(MIN_BUILDER_CAPACITY, this->builder_->capacity()); + ASSERT_EQ(kMinBuilderCapacity, this->builder_->capacity()); ASSERT_OK(this->builder_->Reserve(90)); ASSERT_OK(this->builder_->Advance(100)); - ASSERT_OK(this->builder_->Reserve(MIN_BUILDER_CAPACITY)); + ASSERT_OK(this->builder_->Reserve(kMinBuilderCapacity)); - ASSERT_EQ(util::next_power2(MIN_BUILDER_CAPACITY + 100), this->builder_->capacity()); + ASSERT_EQ(util::next_power2(kMinBuilderCapacity + 100), this->builder_->capacity()); } } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 375e94f2bc1c4..9ba2ebdcc2d5b 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -86,7 +86,7 @@ Status PrimitiveBuilder::Init(int32_t capacity) { template Status PrimitiveBuilder::Resize(int32_t capacity) { // XXX: Set floor size for now - if (capacity < MIN_BUILDER_CAPACITY) { capacity = MIN_BUILDER_CAPACITY; } + if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } if (capacity_ == 0) { RETURN_NOT_OK(Init(capacity)); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 3087ce7784d11..c33ef272f05e2 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -44,22 +44,22 @@ static inline int64_t ceil_2bytes(int64_t size) { return (size + 15) & ~15; } -static constexpr uint8_t BITMASK[] = {1, 2, 4, 8, 16, 32, 64, 128}; +static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; static inline bool get_bit(const uint8_t* bits, int i) { - return static_cast(bits[i / 8] & BITMASK[i % 8]); + return static_cast(bits[i / 8] & kBitmask[i % 8]); } static inline bool bit_not_set(const uint8_t* bits, int i) { - return (bits[i / 8] & BITMASK[i % 8]) == 0; + return (bits[i / 8] & kBitmask[i % 8]) == 0; } static inline void clear_bit(uint8_t* bits, int i) { - bits[i / 8] &= ~BITMASK[i % 8]; + bits[i / 8] &= ~kBitmask[i % 8]; } static inline void set_bit(uint8_t* bits, int i) { - bits[i / 8] |= BITMASK[i % 8]; + bits[i / 8] |= kBitmask[i % 8]; } static inline int64_t next_power2(int64_t n) { diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 01e4259c31fd2..bc0df86221c45 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -141,8 +141,6 @@ class ARROW_EXPORT PoolBuffer : public ResizableBuffer { MemoryPool* pool_; }; -static constexpr int64_t MIN_BUFFER_CAPACITY = 1024; - class BufferBuilder { public: explicit BufferBuilder(MemoryPool* pool) From 8c8d341e12efcedecd3c2545aaf349bf5f899bc1 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Mon, 10 Oct 2016 13:42:41 -0700 Subject: [PATCH 0167/1644] ARROW-326: Include scale and precision when materializing decimal writer closes #166 --- java/vector/src/main/codegen/templates/MapWriters.java | 5 +++++ .../arrow/vector/complex/impl/TestPromotableWriter.java | 9 ++++++--- 2 files changed, 11 insertions(+), 3 deletions(-) diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 9fe20df7a1df0..696bbf655cac9 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -73,7 +73,12 @@ public class ${mode}MapWriter extends AbstractFieldWriter { <#if lowerName == "int" ><#assign lowerName = "integer" /> <#assign upperName = minor.class?upper_case /> case ${upperName}: + <#if lowerName == "decimal" > + Decimal decimal = (Decimal)child.getType(); + decimal(child.getName(), decimal.getScale(), decimal.getPrecision()); + <#else> ${lowerName}(child.getName()); + break; } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index d439cebeda6ac..176ad5195b3a1 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -67,6 +67,8 @@ public void testPromoteToUnion() throws Exception { writer.setPosition(1); writer.bit("A").writeBit(1); + writer.decimal("dec", 10,10); + writer.setPosition(2); writer.integer("A").writeInt(10); @@ -108,9 +110,10 @@ public void testPromoteToUnion() throws Exception { newMapWriter.setPosition(2); newMapWriter.integer("A").writeInt(10); - Field childField = container.getField().getChildren().get(0).getChildren().get(0); - assertEquals("Child field should be union type: " + childField.getName(), Type.Union, childField.getType().getTypeType()); - + Field childField1 = container.getField().getChildren().get(0).getChildren().get(0); + Field childField2 = container.getField().getChildren().get(0).getChildren().get(1); + assertEquals("Child field should be union type: " + childField1.getName(), Type.Union, childField1.getType().getTypeType()); + assertEquals("Child field should be decimal type: " + childField2.getName(), Type.Decimal, childField2.getType().getTypeType()); } } } From 994aa5a903917aca0c9dd372341d4dcbc8be3aa5 Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Tue, 11 Oct 2016 14:00:36 -0400 Subject: [PATCH 0168/1644] ARROW-189: Build 3rd party with ExternalProject. When third party env vars *_HOME are not present, use cmake's ExternalProject to fetch and build them. When those vars are present, we just use them. Author: Leif Walsh Closes #167 from leifwalsh/cmake-externalproject and squashes the following commits: e4fb63a [Leif Walsh] ARROW-189: Remove 3rd party from conda build. 7892bae [Leif Walsh] ARROW-189: Fix darwin build. 8630428 [Leif Walsh] ARROW-189: Addressed CR comments. 8215abc [Leif Walsh] ARROW-189: Build 3rd party with ExternalProject. --- ci/travis_before_script_cpp.sh | 8 -- ci/travis_script_python.sh | 5 -- cpp/CMakeLists.txt | 107 ++++++++++++++++++++++---- cpp/README.md | 18 +---- cpp/conda.recipe/build.sh | 10 --- cpp/doc/Parquet.md | 1 - cpp/setup_build_env.sh | 21 ----- cpp/src/arrow/ipc/CMakeLists.txt | 10 ++- cpp/thirdparty/build_thirdparty.sh | 104 ------------------------- cpp/thirdparty/download_thirdparty.sh | 44 ----------- cpp/thirdparty/set_thirdparty_env.sh | 24 ------ cpp/thirdparty/versions.sh | 23 ------ 12 files changed, 101 insertions(+), 274 deletions(-) delete mode 100755 cpp/setup_build_env.sh delete mode 100755 cpp/thirdparty/build_thirdparty.sh delete mode 100755 cpp/thirdparty/download_thirdparty.sh delete mode 100755 cpp/thirdparty/set_thirdparty_env.sh delete mode 100755 cpp/thirdparty/versions.sh diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index acd820bbed2d4..2d4224b33336f 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -26,14 +26,6 @@ pushd $CPP_BUILD_DIR CPP_DIR=$TRAVIS_BUILD_DIR/cpp -# Build an isolated thirdparty -cp -r $CPP_DIR/thirdparty . -cp $CPP_DIR/setup_build_env.sh . - -source setup_build_env.sh - -echo $GTEST_HOME - : ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} CMAKE_COMMON_FLAGS="\ diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index a75ff0778bc82..97f0563240c75 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -21,11 +21,6 @@ export MINICONDA=$HOME/miniconda export PATH="$MINICONDA/bin:$PATH" export PARQUET_HOME=$MINICONDA -# Share environment with C++ -pushd $CPP_BUILD_DIR -source setup_build_env.sh -popd - pushd $PYTHON_DIR python_version_tests() { diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f70c8ab4bccef..d682dc76f8ced 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -21,10 +21,15 @@ project(arrow) set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") include(CMakeParseArguments) +include(ExternalProject) set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") +set(GTEST_VERSION "1.7.0") +set(GBENCHMARK_VERSION "1.0.0") +set(FLATBUFFERS_VERSION "1.3.0") + find_package(ClangTools) if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND) # Generate a Clang compile_commands.json "compilation database" file for use @@ -422,16 +427,6 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) endif() endfunction() -## GTest -if ("$ENV{GTEST_HOME}" STREQUAL "") - set(GTest_HOME ${THIRDPARTY_DIR}/googletest-release-1.7.0) -endif() - -## Google Benchmark -if ("$ENV{GBENCHMARK_HOME}" STREQUAL "") - set(GBENCHMARK_HOME ${THIRDPARTY_DIR}/installed) -endif() - # ---------------------------------------------------------------------- # Add Boost dependencies (code adapted from Apache Kudu (incubating)) @@ -476,18 +471,78 @@ include_directories(SYSTEM ${Boost_INCLUDE_DIR}) if(ARROW_BUILD_TESTS) add_custom_target(unittest ctest -L unittest) - find_package(GTest REQUIRED) + + if("$ENV{GTEST_HOME}" STREQUAL "") + if(APPLE) + set(GTEST_CMAKE_CXX_FLAGS "-fPIC -std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes") + else() + set(GTEST_CMAKE_CXX_FLAGS "-fPIC") + endif() + + ExternalProject_Add(googletest_ep + URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" + CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} + # googletest doesn't define install rules, so just build in the + # source dir and don't try to install. See its README for + # details. + BUILD_IN_SOURCE 1 + INSTALL_COMMAND "") + + set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") + set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") + set(GTEST_STATIC_LIB "${GTEST_PREFIX}/libgtest.a") + set(GTEST_VENDORED 1) + else() + find_package(GTest REQUIRED) + set(GTEST_VENDORED 0) + endif() + + message(STATUS "GTest include dir: ${GTEST_INCLUDE_DIR}") + message(STATUS "GTest static library: ${GTEST_STATIC_LIB}") include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(gtest STATIC_LIB ${GTEST_STATIC_LIB}) + + if(GTEST_VENDORED) + add_dependencies(gtest googletest_ep) + endif() endif() if(ARROW_BUILD_BENCHMARKS) add_custom_target(runbenchmark ctest -L benchmark) - find_package(GBenchmark REQUIRED) + + if("$ENV{GBENCHMARK_HOME}" STREQUAL "") + if(APPLE) + set(GBENCHMARK_CMAKE_CXX_FLAGS "-std=c++11 -stdlib=libc++") + else() + set(GBENCHMARK_CMAKE_CXX_FLAGS "--std=c++11") + endif() + + set(GBENCHMARK_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gbenchmark_ep/src/gbenchmark_ep-install") + ExternalProject_Add(gbenchmark_ep + URL "https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" + CMAKE_ARGS + "-DCMAKE_BUILD_TYPE=Release" + "-DCMAKE_INSTALL_PREFIX:PATH=${GBENCHMARK_PREFIX}" + "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") + + set(GBENCHMARK_INCLUDE_DIR "${GBENCHMARK_PREFIX}/include") + set(GBENCHMARK_STATIC_LIB "${GBENCHMARK_PREFIX}/lib/libbenchmark.a") + set(GBENCHMARK_VENDORED 1) + else() + find_package(GBenchmark REQUIRED) + set(GBENCHMARK_VENDORED 0) + endif() + + message(STATUS "GBenchmark include dir: ${GBENCHMARK_INCLUDE_DIR}") + message(STATUS "GBenchmark static library: ${GBENCHMARK_STATIC_LIB}") include_directories(SYSTEM ${GBENCHMARK_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(benchmark STATIC_LIB ${GBENCHMARK_STATIC_LIB}) + + if(GBENCHMARK_VENDORED) + add_dependencies(benchmark gbenchmark_ep) + endif() endif() ## Google PerfTools @@ -705,14 +760,34 @@ add_subdirectory(src/arrow/types) ## Flatbuffers if(ARROW_IPC) - find_package(Flatbuffers REQUIRED) + if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") + set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") + ExternalProject_Add(flatbuffers_ep + URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" + CMAKE_ARGS + "-DCMAKE_CXX_FLAGS=-fPIC" + "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" + "-DFLATBUFFERS_BUILD_TESTS=OFF") + + set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") + set(FLATBUFFERS_STATIC_LIB "${FLATBUFFERS_PREFIX}/libflatbuffers.a") + set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") + set(FLATBUFFERS_VENDORED 1) + else() + find_package(Flatbuffers REQUIRED) + set(FLATBUFFERS_VENDORED 0) + endif() + message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") message(STATUS "Flatbuffers static library: ${FLATBUFFERS_STATIC_LIB}") message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) - add_library(flatbuffers STATIC IMPORTED) - set_target_properties(flatbuffers PROPERTIES - IMPORTED_LOCATION ${FLATBUFFERS_STATIC_LIB}) + ADD_THIRDPARTY_LIB(flatbuffers + STATIC_LIB ${FLATBUFFERS_STATIC_LIB}) + + if(FLATBUFFERS_VENDORED) + add_dependencies(flatbuffers flatbuffers_ep) + endif() add_subdirectory(src/arrow/ipc) endif() diff --git a/cpp/README.md b/cpp/README.md index a1c3ef28447f5..190e6f85b429d 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -22,23 +22,6 @@ out-of-source builds with the latter one being preferred. Arrow requires a C++11-enabled compiler. On Linux, gcc 4.8 and higher should be sufficient. -To build the thirdparty build dependencies, run: - -``` -./thirdparty/download_thirdparty.sh -./thirdparty/build_thirdparty.sh -source ./thirdparty/set_thirdparty_env.sh -``` - -You can also run from the root of the C++ tree - -``` -source setup_build_env.sh -``` - -Arrow is configured to use the `thirdparty` directory by default for its build -dependencies. To set up a custom toolchain see below. - Simple debug build: mkdir debug @@ -76,6 +59,7 @@ variables * Googletest: `GTEST_HOME` (only required to build the unit tests) * Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) * Flatbuffers: `FLATBUFFERS_HOME` (only required for the IPC extensions) +* Hadoop: `HADOOP_HOME` (only required for the HDFS I/O extensions) ## Continuous Integration diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh index 6d7454e927265..0536fd99b5ca5 100644 --- a/cpp/conda.recipe/build.sh +++ b/cpp/conda.recipe/build.sh @@ -38,19 +38,9 @@ cd .. rm -rf conda-build mkdir conda-build - -cp -r thirdparty conda-build/ - cd conda-build pwd -# Build googletest for running unit tests -./thirdparty/download_thirdparty.sh -./thirdparty/build_thirdparty.sh gtest - -source thirdparty/versions.sh -export GTEST_HOME=`pwd`/thirdparty/$GTEST_BASEDIR - # if [ `uname` == Linux ]; then # SHARED_LINKER_FLAGS='-static-libstdc++' # elif [ `uname` == Darwin ]; then diff --git a/cpp/doc/Parquet.md b/cpp/doc/Parquet.md index 34b83e78d0a5c..4985dd3b0bc2d 100644 --- a/cpp/doc/Parquet.md +++ b/cpp/doc/Parquet.md @@ -24,7 +24,6 @@ export ARROW_HOME=$HOME/local git clone https://github.com/apache/parquet-cpp.git cd parquet-cpp -source setup_build_env.sh cmake -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME -DPARQUET_ARROW=on make -j4 make install diff --git a/cpp/setup_build_env.sh b/cpp/setup_build_env.sh deleted file mode 100755 index 546216753b382..0000000000000 --- a/cpp/setup_build_env.sh +++ /dev/null @@ -1,21 +0,0 @@ -#!/bin/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) - -./thirdparty/download_thirdparty.sh || { echo "download_thirdparty.sh failed" ; return; } -./thirdparty/build_thirdparty.sh || { echo "build_thirdparty.sh failed" ; return; } -source ./thirdparty/set_thirdparty_env.sh || { echo "source set_thirdparty_env.sh failed" ; return; } - -echo "Build env initialized" diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 8dcd9ac107189..d2db339de7ea2 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -42,6 +42,9 @@ set(ARROW_IPC_SRCS add_library(arrow_ipc SHARED ${ARROW_IPC_SRCS} ) +if(FLATBUFFERS_VENDORED) + add_dependencies(arrow_ipc flatbuffers_ep) +endif() target_link_libraries(arrow_ipc LINK_PUBLIC ${ARROW_IPC_LINK_LIBS} LINK_PRIVATE ${ARROW_IPC_PRIVATE_LINK_LIBS}) @@ -91,10 +94,15 @@ foreach(FIL ${FBS_SRC}) list(APPEND ABS_FBS_SRC ${ABS_FIL}) endforeach() +if(FLATBUFFERS_VENDORED) + set(FBS_DEPENDS ${ABS_FBS_SRC} flatbuffers_ep) +else() + set(FBS_DEPENDS ${ABS_FBS_SRC}) +endif() add_custom_command( OUTPUT ${FBS_OUTPUT_FILES} COMMAND ${FLATBUFFERS_COMPILER} -c -o ${OUTPUT_DIR} ${ABS_FBS_SRC} - DEPENDS ${ABS_FBS_SRC} + DEPENDS ${FBS_DEPENDS} COMMENT "Running flatc compiler on ${ABS_FBS_SRC}" VERBATIM ) diff --git a/cpp/thirdparty/build_thirdparty.sh b/cpp/thirdparty/build_thirdparty.sh deleted file mode 100755 index 5011e29c01a71..0000000000000 --- a/cpp/thirdparty/build_thirdparty.sh +++ /dev/null @@ -1,104 +0,0 @@ -#!/bin/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -set -x -set -e -TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) - -source $TP_DIR/versions.sh -PREFIX=$TP_DIR/installed - -################################################################################ - -if [ "$#" = "0" ]; then - F_ALL=1 -else - # Allow passing specific libs to build on the command line - for arg in "$*"; do - case $arg in - "gtest") F_GTEST=1 ;; - "gbenchmark") F_GBENCHMARK=1 ;; - "flatbuffers") F_FLATBUFFERS=1 ;; - *) echo "Unknown module: $arg"; exit 1 ;; - esac - done -fi - -################################################################################ - -# Determine how many parallel jobs to use for make based on the number of cores -if [[ "$OSTYPE" =~ ^linux ]]; then - PARALLEL=$(grep -c processor /proc/cpuinfo) -elif [[ "$OSTYPE" == "darwin"* ]]; then - PARALLEL=$(sysctl -n hw.ncpu) -else - echo Unsupported platform $OSTYPE - exit 1 -fi - -mkdir -p "$PREFIX/include" -mkdir -p "$PREFIX/lib" - -# On some systems, autotools installs libraries to lib64 rather than lib. Fix -# this by setting up lib64 as a symlink to lib. We have to do this step first -# to handle cases where one third-party library depends on another. -ln -sf lib "$PREFIX/lib64" - -# use the compiled tools -export PATH=$PREFIX/bin:$PATH - -type cmake >/dev/null 2>&1 || { echo >&2 "cmake not installed. Aborting."; exit 1; } -type make >/dev/null 2>&1 || { echo >&2 "make not installed. Aborting."; exit 1; } - -STANDARD_DARWIN_FLAGS="-std=c++11 -stdlib=libc++" - -# build googletest -GOOGLETEST_ERROR="failed for googletest!" -if [ -n "$F_ALL" -o -n "$F_GTEST" ]; then - cd $TP_DIR/$GTEST_BASEDIR - - if [[ "$OSTYPE" == "darwin"* ]]; then - CXXFLAGS=-fPIC cmake -DCMAKE_CXX_FLAGS="$STANDARD_DARWIN_FLAGS -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes" || { echo "cmake $GOOGLETEST_ERROR" ; exit 1; } - else - CXXFLAGS=-fPIC cmake . || { echo "cmake $GOOGLETEST_ERROR"; exit 1; } - fi - - make -j$PARALLEL VERBOSE=1 || { echo "Make $GOOGLETEST_ERROR" ; exit 1; } -fi - -# build google benchmark -GBENCHMARK_ERROR="failed for google benchmark" -if [ -n "$F_ALL" -o -n "$F_GBENCHMARK" ]; then - cd $TP_DIR/$GBENCHMARK_BASEDIR - - CMAKE_CXX_FLAGS="--std=c++11" - if [[ "$OSTYPE" == "darwin"* ]]; then - CMAKE_CXX_FLAGS=$STANDARD_DARWIN_FLAGS - fi - cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$PREFIX -DCMAKE_CXX_FLAGS="-fPIC $CMAKE_CXX_FLAGS" . || { echo "cmake $GBENCHMARK_ERROR" ; exit 1; } - - make -j$PARALLEL VERBOSE=1 install || { echo "make $GBENCHMARK_ERROR" ; exit 1; } -fi - -FLATBUFFERS_ERROR="failed for flatbuffers" -if [ -n "$F_ALL" -o -n "$F_FLATBUFFERS" ]; then - cd $TP_DIR/$FLATBUFFERS_BASEDIR - - CXXFLAGS=-fPIC cmake -DCMAKE_INSTALL_PREFIX:PATH=$PREFIX -DFLATBUFFERS_BUILD_TESTS=OFF . || { echo "cmake $FLATBUFFERS_ERROR" ; exit 1; } - make VERBOSE=1 -j$PARALLEL || { echo "make $FLATBUFFERS_ERROR" ; exit 1; } - make install || { echo "install $FLATBUFFERS_ERROR" ; exit 1; } -fi - -echo "---------------------" -echo "Thirdparty dependencies built and installed into $PREFIX successfully" diff --git a/cpp/thirdparty/download_thirdparty.sh b/cpp/thirdparty/download_thirdparty.sh deleted file mode 100755 index b50e7bc06a14c..0000000000000 --- a/cpp/thirdparty/download_thirdparty.sh +++ /dev/null @@ -1,44 +0,0 @@ -#!/bin/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -set -x -set -e - -TP_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) - -source $TP_DIR/versions.sh - -download_extract_and_cleanup() { - type curl >/dev/null 2>&1 || { echo >&2 "curl not installed. Aborting."; exit 1; } - filename=$TP_DIR/$(basename "$1") - curl -#LC - "$1" -o $filename - tar xzf $filename -C $TP_DIR - rm $filename -} - -if [ ! -d ${GTEST_BASEDIR} ]; then - echo "Fetching gtest" - download_extract_and_cleanup $GTEST_URL -fi - -echo ${GBENCHMARK_BASEDIR} -if [ ! -d ${GBENCHMARK_BASEDIR} ]; then - echo "Fetching google benchmark" - download_extract_and_cleanup $GBENCHMARK_URL -fi - -if [ ! -d ${FLATBUFFERS_BASEDIR} ]; then - echo "Fetching flatbuffers" - download_extract_and_cleanup $FLATBUFFERS_URL -fi diff --git a/cpp/thirdparty/set_thirdparty_env.sh b/cpp/thirdparty/set_thirdparty_env.sh deleted file mode 100755 index 135972ee9bdce..0000000000000 --- a/cpp/thirdparty/set_thirdparty_env.sh +++ /dev/null @@ -1,24 +0,0 @@ -#!/usr/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -SOURCE_DIR=$(cd "$(dirname "${BASH_SOURCE:-$0}")"; pwd) -source $SOURCE_DIR/versions.sh - -if [ -z "$THIRDPARTY_DIR" ]; then - THIRDPARTY_DIR=$SOURCE_DIR -fi - -export GTEST_HOME=$THIRDPARTY_DIR/$GTEST_BASEDIR -export GBENCHMARK_HOME=$THIRDPARTY_DIR/installed -export FLATBUFFERS_HOME=$THIRDPARTY_DIR/installed diff --git a/cpp/thirdparty/versions.sh b/cpp/thirdparty/versions.sh deleted file mode 100755 index a7b21e19fccd6..0000000000000 --- a/cpp/thirdparty/versions.sh +++ /dev/null @@ -1,23 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -GTEST_VERSION=1.7.0 -GTEST_URL="https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" -GTEST_BASEDIR=googletest-release-$GTEST_VERSION - -GBENCHMARK_VERSION=1.0.0 -GBENCHMARK_URL="https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" -GBENCHMARK_BASEDIR=benchmark-$GBENCHMARK_VERSION - -FLATBUFFERS_VERSION=1.3.0 -FLATBUFFERS_URL="https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" -FLATBUFFERS_BASEDIR=flatbuffers-$FLATBUFFERS_VERSION From caa843bdaf395b915a739bf5e1d6c5eabe1f4693 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 11 Oct 2016 17:29:25 -0700 Subject: [PATCH 0169/1644] ARROW-333: Make writers update their internal schema even when no data is written Make PromotableWriter predefine writers when asked Author: Julien Le Dem Closes #170 from julienledem/promotable_writer_preset and squashes the following commits: 972eb9c [Julien Le Dem] ARROW-333: Make writers update their internal schema even when no data is written Make PromotableWriter predefine writers when asked --- .../main/codegen/templates/MapWriters.java | 15 +++++++++ .../main/codegen/templates/UnionWriter.java | 24 ++++++++++++++ .../vector/complex/impl/PromotableWriter.java | 14 ++++---- .../complex/writer/TestComplexWriter.java | 32 +++++++++++++++++-- 4 files changed, 76 insertions(+), 9 deletions(-) diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 696bbf655cac9..51327b43af0fa 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -112,6 +112,11 @@ public MapWriter map(String name) { } writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); + } else { + if (writer instanceof PromotableWriter) { + // ensure writers are initialized + ((PromotableWriter)writer).getWriter(MinorType.MAP); + } } return writer; } @@ -149,6 +154,11 @@ public ListWriter list(String name) { } writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); + } else { + if (writer instanceof PromotableWriter) { + // ensure writers are initialized + ((PromotableWriter)writer).getWriter(MinorType.LIST); + } } return writer; } @@ -210,6 +220,11 @@ public void end() { } writer.setPosition(idx()); fields.put(name.toLowerCase(), writer); + } else { + if (writer instanceof PromotableWriter) { + // ensure writers are initialized + ((PromotableWriter)writer).getWriter(MinorType.${upperName}); + } } return writer; } diff --git a/java/vector/src/main/codegen/templates/UnionWriter.java b/java/vector/src/main/codegen/templates/UnionWriter.java index 460ec1c0d9586..efb66f168f5f8 100644 --- a/java/vector/src/main/codegen/templates/UnionWriter.java +++ b/java/vector/src/main/codegen/templates/UnionWriter.java @@ -25,6 +25,8 @@ package org.apache.arrow.vector.complex.impl; <#include "/@includes/vv_imports.ftl" /> +import org.apache.arrow.vector.complex.writer.BaseWriter; +import org.apache.arrow.vector.types.Types.MinorType; /* * This class is generated using freemarker and the ${.template_name} template. @@ -100,6 +102,28 @@ public ListWriter asList() { return getListWriter(); } + BaseWriter getWriter(MinorType minorType) { + switch (minorType) { + case MAP: + return getMapWriter(); + case LIST: + return getListWriter(); + <#list vv.types as type> + <#list type.minor as minor> + <#assign name = minor.class?cap_first /> + <#assign fields = minor.fields!type.fields /> + <#assign uncappedName = name?uncap_first/> + <#if !minor.class?starts_with("Decimal")> + case ${name?upper_case}: + return get${name}Writer(); + + + + default: + throw new UnsupportedOperationException("Unknown type: " + minorType); + } + } + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> <#assign uncappedName = name?uncap_first/> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index c282688530b87..94ff82c04bd18 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -94,19 +94,19 @@ public void setPosition(int index) { protected FieldWriter getWriter(MinorType type) { if (state == State.UNION) { - return writer; - } - if (state == State.UNTYPED) { + ((UnionWriter)writer).getWriter(type); + } else if (state == State.UNTYPED) { if (type == null) { + // ??? return null; } ValueVector v = listVector.addOrGetVector(type).getVector(); v.allocateNew(); setWriter(v); writer.setPosition(position); - } - if (type != this.type) { - return promoteToUnion(); + } else if (type != this.type) { + promoteToUnion(); + ((UnionWriter)writer).getWriter(type); } return writer; } @@ -133,7 +133,7 @@ private FieldWriter promoteToUnion() { unionVector.addVector((FieldVector)tp.getTo()); writer = new UnionWriter(unionVector); writer.setPosition(idx()); - for (int i = 0; i < idx(); i++) { + for (int i = 0; i <= idx(); i++) { unionVector.getMutator().setType(i, vector.getMinorType()); } vector = null; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 398aea915b343..9419f88de5b74 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -45,6 +45,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.Text; import org.junit.Assert; import org.junit.Test; @@ -362,11 +363,38 @@ public void promotableWriter() { MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); for (int i = 0; i < 100; i++) { rootReader.setPosition(i); - Assert.assertEquals(i, rootReader.reader("a").readLong().intValue()); + FieldReader reader = rootReader.reader("a"); + Long value = reader.readLong(); + Assert.assertNotNull("index: " + i, value); + Assert.assertEquals(i, value.intValue()); } for (int i = 100; i < 200; i++) { rootReader.setPosition(i); - Assert.assertEquals(Integer.toString(i), rootReader.reader("a").readText().toString()); + FieldReader reader = rootReader.reader("a"); + Text value = reader.readText(); + Assert.assertEquals(Integer.toString(i), value.toString()); } } + + /** + * Even without writing to the writer, the union schema is created correctly + */ + @Test + public void promotableWriterSchema() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + BigIntWriter bigIntWriter = rootWriter.bigInt("a"); + VarCharWriter varCharWriter = rootWriter.varChar("a"); + + Field field = parent.getField().getChildren().get(0).getChildren().get(0); + Assert.assertEquals("a", field.getName()); + Assert.assertEquals(Union.TYPE_TYPE, field.getType().getTypeType()); + + Assert.assertEquals(Int.TYPE_TYPE, field.getChildren().get(0).getType().getTypeType()); + Int intType = (Int) field.getChildren().get(0).getType(); + Assert.assertEquals(64, intType.getBitWidth()); + Assert.assertTrue(intType.getIsSigned()); + Assert.assertEquals(Utf8.TYPE_TYPE, field.getChildren().get(1).getType().getTypeType()); + } } \ No newline at end of file From 3919a277884cf504fdca5d730cf128e36db6f700 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 11 Oct 2016 23:08:48 -0400 Subject: [PATCH 0170/1644] ARROW-332: Add RecordBatch.to_pandas method This makes testing and IPC data wrangling a little easier. Author: Wes McKinney Closes #165 from wesm/ARROW-332 and squashes the following commits: 5f19b97 [Wes McKinney] Add simple arrow::Array->NumPy-for-pandas conversion helper and RecordBatch.to_pandas --- python/pyarrow/includes/pyarrow.pxd | 7 +++-- python/pyarrow/io.pyx | 12 ++++++++ python/pyarrow/table.pyx | 25 ++++++++++++++-- python/pyarrow/tests/test_ipc.py | 40 ++++++++++++++++++++++++-- python/pyarrow/tests/test_table.py | 41 ++++++++++++++++++++------- python/src/pyarrow/adapters/pandas.cc | 19 +++++++++++-- python/src/pyarrow/adapters/pandas.h | 7 ++++- python/src/pyarrow/common.h | 4 +-- python/src/pyarrow/io.cc | 2 +- 9 files changed, 133 insertions(+), 24 deletions(-) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 2fa5a7d63256a..7c47f21854e33 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -50,8 +50,11 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: PyStatus PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, shared_ptr[CArray]* out) - PyStatus ArrowToPandas(const shared_ptr[CColumn]& arr, object py_ref, - PyObject** out) + PyStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, + object py_ref, PyObject** out) + + PyStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, + object py_ref, PyObject** out) MemoryPool* get_memory_pool() diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 00a492fc0baf2..8970e06effdd0 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -230,6 +230,18 @@ cdef class InMemoryOutputStream(NativeFile): return result +cdef class BufferReader(NativeFile): + cdef: + Buffer buffer + + def __cinit__(self, Buffer buffer): + self.buffer = buffer + self.rd_file.reset(new CBufferReader(buffer.buffer.get().data(), + buffer.buffer.get().size())) + self.is_readonly = 1 + self.is_open = True + + def buffer_from_bytes(object obj): """ Construct an Arrow buffer from a Python bytes object diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index a1cadcd1e0f69..969571262ca44 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -100,7 +100,7 @@ cdef class Column: import pandas as pd - check_status(pyarrow.ArrowToPandas(self.sp_column, self, &arr)) + check_status(pyarrow.ConvertColumnToPandas(self.sp_column, self, &arr)) return pd.Series(arr, name=self.name) cdef _check_nullptr(self): @@ -233,6 +233,27 @@ cdef class RecordBatch: return self.batch.Equals(deref(other.batch)) + def to_pandas(self): + """ + Convert the arrow::RecordBatch to a pandas DataFrame + """ + cdef: + PyObject* np_arr + shared_ptr[CArray] arr + Column column + + import pandas as pd + + names = [] + data = [] + for i in range(self.batch.num_columns()): + arr = self.batch.column(i) + check_status(pyarrow.ConvertArrayToPandas(arr, self, &np_arr)) + names.append(frombytes(self.batch.column_name(i))) + data.append( np_arr) + + return pd.DataFrame(dict(zip(names, data)), columns=names) + @classmethod def from_pandas(cls, df): """ @@ -354,7 +375,7 @@ cdef class Table: for i in range(self.table.num_columns()): col = self.table.column(i) column = self.column(i) - check_status(pyarrow.ArrowToPandas(col, column, &arr)) + check_status(pyarrow.ConvertColumnToPandas(col, column, &arr)) names.append(frombytes(col.get().name())) data.append( arr) diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index b9e9e6ed0c423..14cbb30d5d48b 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -18,6 +18,8 @@ import io import numpy as np + +from pandas.util.testing import assert_frame_equal import pandas as pd import pyarrow as A @@ -85,17 +87,40 @@ def test_ipc_file_simple_roundtrip(): helper.run() +def test_ipc_zero_copy_numpy(): + df = pd.DataFrame({'foo': [1.5]}) + + batch = A.RecordBatch.from_pandas(df) + sink = arrow_io.InMemoryOutputStream() + write_file(batch, sink) + buffer = sink.get_result() + reader = arrow_io.BufferReader(buffer) + + batches = read_file(reader) + + data = batches[0].to_pandas() + rdf = pd.DataFrame(data) + assert_frame_equal(df, rdf) + + # XXX: For benchmarking def big_batch(): + K = 2**4 + N = 2**20 df = pd.DataFrame( - np.random.randn(2**4, 2**20).T, - columns=[str(i) for i in range(2**4)] + np.random.randn(K, N).T, + columns=[str(i) for i in range(K)] ) df = pd.concat([df] * 2 ** 3, ignore_index=True) + return df + - return A.RecordBatch.from_pandas(df) +def write_to_memory2(batch): + sink = arrow_io.InMemoryOutputStream() + write_file(batch, sink) + return sink.get_result() def write_to_memory(batch): @@ -114,3 +139,12 @@ def read_file(source): reader = ipc.ArrowFileReader(source) return [reader.get_record_batch(i) for i in range(reader.num_record_batches)] + +# df = big_batch() +# batch = A.RecordBatch.from_pandas(df) +# mem = write_to_memory(batch) +# batches = read_file(mem) +# data = batches[0].to_pandas() +# rdf = pd.DataFrame(data) + +# [x.to_pandas() for x in batches] diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index c5130329e02bc..4c9d302106af8 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -15,28 +15,47 @@ # specific language governing permissions and limitations # under the License. -import pyarrow as A +import numpy as np + +from pandas.util.testing import assert_frame_equal +import pandas as pd + +import pyarrow as pa def test_recordbatch_basics(): data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]) ] - batch = A.RecordBatch.from_arrays(['c0', 'c1'], data) + batch = pa.RecordBatch.from_arrays(['c0', 'c1'], data) assert len(batch) == 5 assert batch.num_rows == 5 assert batch.num_columns == len(data) +def test_recordbatch_from_to_pandas(): + data = pd.DataFrame({ + 'c1': np.array([1, 2, 3, 4, 5], dtype='int64'), + 'c2': np.array([1, 2, 3, 4, 5], dtype='uint32'), + 'c2': np.random.randn(5), + 'c3': ['foo', 'bar', None, 'baz', 'qux'], + 'c4': [False, True, False, True, False] + }) + + batch = pa.RecordBatch.from_pandas(data) + result = batch.to_pandas() + assert_frame_equal(data, result) + + def test_table_basics(): data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + table = pa.Table.from_arrays(('a', 'b'), data, 'table_name') assert table.name == 'table_name' assert len(table) == 5 assert table.num_rows == 5 @@ -50,15 +69,15 @@ def test_table_basics(): def test_table_pandas(): data = [ - A.from_pylist(range(5)), - A.from_pylist([-10, -5, 0, 5, 10]) + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = A.Table.from_arrays(('a', 'b'), data, 'table_name') + table = pa.Table.from_arrays(('a', 'b'), data, 'table_name') # TODO: Use this part once from_pandas is implemented # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} # df = pd.DataFrame(data) - # A.Table.from_pandas(df) + # pa.Table.from_pandas(df) df = table.to_pandas() assert set(df.columns) == set(('a', 'b')) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index ae24b7ee5847b..b2fcd37aec944 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -21,6 +21,8 @@ #include "pyarrow/numpy_interop.h" +#include "pyarrow/adapters/pandas.h" + #include #include #include @@ -38,6 +40,7 @@ namespace pyarrow { using arrow::Array; using arrow::Column; +using arrow::Field; using arrow::DataType; namespace util = arrow::util; @@ -106,7 +109,7 @@ struct npy_traits { template <> struct npy_traits { - typedef double value_type; + typedef int64_t value_type; using TypeClass = arrow::TimestampType; static constexpr bool supports_nulls = true; @@ -163,6 +166,8 @@ class ArrowSerializer { Status ConvertData(); Status ConvertObjectStrings(std::shared_ptr* out) { + PyAcquireGIL lock; + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); arrow::TypePtr string_type(new arrow::StringType()); arrow::StringBuilder string_builder(pool_, string_type); @@ -197,6 +202,8 @@ class ArrowSerializer { } Status ConvertBooleans(std::shared_ptr* out) { + PyAcquireGIL lock; + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); int nbytes = util::bytes_for_bits(length_); @@ -798,7 +805,15 @@ class ArrowDeserializer { } \ break; -Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, +Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, + PyObject** out) { + static std::string dummy_name = "dummy"; + auto field = std::make_shared(dummy_name, arr->type()); + auto col = std::make_shared(field, arr); + return ConvertColumnToPandas(col, py_ref, out); +} + +Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, PyObject** out) { switch(col->type()->type) { FROM_ARROW_CASE(BOOL); diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index c3377685bcce9..141d1219e64db 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -31,6 +31,7 @@ namespace arrow { class Array; class Column; +class MemoryPool; } // namespace arrow @@ -39,7 +40,11 @@ namespace pyarrow { class Status; PYARROW_EXPORT -Status ArrowToPandas(const std::shared_ptr& col, PyObject* py_ref, +Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, + PyObject** out); + +PYARROW_EXPORT +Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, PyObject** out); PYARROW_EXPORT diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 96eed1654a777..50c2577b93c9b 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -120,8 +120,8 @@ class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { Py_INCREF(arr); data_ = reinterpret_cast(PyArray_DATA(arr_)); - size_ = PyArray_SIZE(arr_); - capacity_ = size_ * PyArray_DESCR(arr_)->elsize; + size_ = PyArray_SIZE(arr_) * PyArray_DESCR(arr_)->elsize; + capacity_ = size_; } virtual ~NumPyBuffer() { diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 9879b3474bcd0..7bf32ffa8d22b 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -85,7 +85,7 @@ arrow::Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { ARROW_RETURN_NOT_OK(CheckPyError()); PyObject* result = PyObject_CallMethod(file_, "write", "(O)", py_data); - Py_DECREF(py_data); + Py_XDECREF(py_data); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); return arrow::Status::OK(); From bf749f55a1e24d79b08813a39ce51e9aaf6fb425 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 11 Oct 2016 20:11:48 -0700 Subject: [PATCH 0171/1644] ARROW-275: Add tests for UnionVector in Arrow File Author: Julien Le Dem Closes #169 from julienledem/union_test and squashes the following commits: 120f504 [Julien Le Dem] ARROW-275: Add tests for UnionVector in Arrow File --- .../main/codegen/templates/UnionReader.java | 4 + .../main/codegen/templates/UnionVector.java | 30 ++--- .../org/apache/arrow/vector/VectorLoader.java | 2 + .../arrow/vector/schema/TypeLayout.java | 3 +- .../arrow/vector/file/TestArrowFile.java | 110 +++++++++++++++++- 5 files changed, 127 insertions(+), 22 deletions(-) diff --git a/java/vector/src/main/codegen/templates/UnionReader.java b/java/vector/src/main/codegen/templates/UnionReader.java index 7351ae3776f57..c56e95c89dc81 100644 --- a/java/vector/src/main/codegen/templates/UnionReader.java +++ b/java/vector/src/main/codegen/templates/UnionReader.java @@ -134,6 +134,10 @@ public void copyAsValue(UnionWriter writer) { + public int size() { + return getReaderForIndex(idx()).size(); + } + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign uncappedName = name?uncap_first/> <#assign boxedType = (minor.boxedType!type.boxedType) /> diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index b14314d2b0dbb..5ca3f90148449 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -15,17 +15,6 @@ * See the License for the specific language governing permissions and * limitations under the License. */ - -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; -import io.netty.buffer.ArrowBuf; -import org.apache.arrow.flatbuf.Field; -import org.apache.arrow.flatbuf.Type; -import org.apache.arrow.flatbuf.Union; -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.types.pojo.ArrowType; - -import java.util.ArrayList; import java.util.List; <@pp.dropOutputFile /> @@ -39,7 +28,9 @@ <#include "/@includes/vv_imports.ftl" /> import com.google.common.collect.ImmutableList; import java.util.ArrayList; +import java.util.Collections; import java.util.Iterator; +import org.apache.arrow.vector.BaseDataValueVector; import org.apache.arrow.vector.complex.impl.ComplexCopier; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.schema.ArrowFieldNode; @@ -47,6 +38,7 @@ import static org.apache.arrow.flatbuf.UnionMode.Sparse; + /* * This class is generated using freemarker and the ${.template_name} template. */ @@ -81,6 +73,7 @@ public class UnionVector implements FieldVector { private ValueVector singleVector; private final CallBack callBack; + private final List innerVectors; public UnionVector(String name, BufferAllocator allocator, CallBack callBack) { this.name = name; @@ -88,6 +81,7 @@ public UnionVector(String name, BufferAllocator allocator, CallBack callBack) { this.internalMap = new MapVector("internal", allocator, callBack); this.typeVector = new UInt1Vector("types", allocator); this.callBack = callBack; + this.innerVectors = Collections.unmodifiableList(Arrays.asList(typeVector)); } public BufferAllocator getAllocator() { @@ -101,30 +95,28 @@ public MinorType getMinorType() { @Override public void initializeChildrenFromFields(List children) { - getMap().initializeChildrenFromFields(children); + internalMap.initializeChildrenFromFields(children); } @Override public List getChildrenFromFields() { - return getMap().getChildrenFromFields(); + return internalMap.getChildrenFromFields(); } @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - // TODO - throw new UnsupportedOperationException(); + BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + this.valueCount = fieldNode.getLength(); } @Override public List getFieldBuffers() { - // TODO - throw new UnsupportedOperationException(); + return BaseDataValueVector.unload(getFieldInnerVectors()); } @Override public List getFieldInnerVectors() { - // TODO - throw new UnsupportedOperationException(); + return this.innerVectors; } public NullableMapVector getMap() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index 58ac68b82825d..b7040da9d8203 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -74,6 +74,8 @@ public void load(ArrowRecordBatch recordBatch) { } private void loadBuffers(FieldVector vector, Field field, Iterator buffers, Iterator nodes) { + checkArgument(nodes.hasNext(), + "no more field nodes for for field " + field + " and vector " + vector); ArrowFieldNode fieldNode = nodes.next(); List typeLayout = field.getTypeLayout().getVectors(); List ownBuffers = new ArrayList<>(typeLayout.size()); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 06ae203bf4422..c5f53fe508d9f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -82,8 +82,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { break; case UnionMode.Sparse: vectors = asList( - validityVector(), - typeVector() + typeVector() // type of the value at the index or 0 if null ); break; default: diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 7a5e7b58db98c..0f28d53295c37 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -266,7 +266,7 @@ private void validateComplexContent(int count, NullableMapVector parent) { Assert.assertEquals(i % 3, rootReader.reader("list").size()); NullableTimeStampHolder h = new NullableTimeStampHolder(); rootReader.reader("map").reader("timestamp").read(h); - Assert.assertEquals(i, h.value % COUNT); + Assert.assertEquals(i, h.value); } } @@ -339,4 +339,112 @@ public void testWriteReadMultipleRBs() throws IOException { } } + @Test + public void testWriteReadUnion() throws IOException { + File file = new File("target/mytest_write_union.arrow"); + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + + writeUnionData(count, parent); + + printVectors(parent.getChildrenFromFields()); + + validateUnionData(count, parent); + + write(parent.getChild("root"), file); + } + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null) + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + LOGGER.debug("reading schema: " + schema); + + // initialize vectors + + NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); + VectorLoader vectorLoader = new VectorLoader(schema, root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateUnionData(count, parent); + } + } + } + + public void validateUnionData(int count, MapVector parent) { + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < count; i++) { + rootReader.setPosition(i); + switch (i % 4) { + case 0: + Assert.assertEquals(i, rootReader.reader("union").readInteger().intValue()); + break; + case 1: + Assert.assertEquals(i, rootReader.reader("union").readLong().longValue()); + break; + case 2: + Assert.assertEquals(i % 3, rootReader.reader("union").size()); + break; + case 3: + NullableTimeStampHolder h = new NullableTimeStampHolder(); + rootReader.reader("union").reader("timestamp").read(h); + Assert.assertEquals(i, h.value); + break; + } + } + } + + public void writeUnionData(int count, NullableMapVector parent) { + ArrowBuf varchar = allocator.buffer(3); + varchar.readerIndex(0); + varchar.setByte(0, 'a'); + varchar.setByte(1, 'b'); + varchar.setByte(2, 'c'); + varchar.writerIndex(3); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("union"); + BigIntWriter bigIntWriter = rootWriter.bigInt("union"); + ListWriter listWriter = rootWriter.list("union"); + MapWriter mapWriter = rootWriter.map("union"); + for (int i = 0; i < count; i++) { + switch (i % 4) { + case 0: + intWriter.setPosition(i); + intWriter.writeInt(i); + break; + case 1: + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + break; + case 2: + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 3; j++) { + listWriter.varChar().writeVarChar(0, 3, varchar); + } + listWriter.endList(); + break; + case 3: + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.end(); + break; + } + } + writer.setValueCount(count); + varchar.release(); + } } From 4ecf327636c1373f601679fac18b7fcf7f382e1b Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 16 Oct 2016 16:21:59 -0400 Subject: [PATCH 0172/1644] ARROW-191: Python: Provide infrastructure for manylinux1 wheels Author: Uwe L. Korn Closes #173 from xhochy/ARROW-191 and squashes the following commits: 278f8b0 [Uwe L. Korn] ARROW-191: Python: Provide infrastructure for manylinux1 wheels --- NOTICE.txt | 3 + .../Dockerfile-parquet_arrow-base-x86_64 | 40 ++++++++++ python/manylinux1/Dockerfile-x86_64 | 47 ++++++++++++ python/manylinux1/README.md | 40 ++++++++++ python/manylinux1/build_arrow.sh | 76 +++++++++++++++++++ 5 files changed, 206 insertions(+) create mode 100644 python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 create mode 100644 python/manylinux1/Dockerfile-x86_64 create mode 100644 python/manylinux1/README.md create mode 100755 python/manylinux1/build_arrow.sh diff --git a/NOTICE.txt b/NOTICE.txt index 679bb59e6a97d..5c699ca022c1b 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -38,3 +38,6 @@ This product includes software from the CMake project * Copyright 2001-2009 Kitware, Inc. * Copyright 2012-2014 Continuum Analytics, Inc. * All rights reserved. + +This product includes software from https://github.com/matthew-brett/multibuild (BSD 2-clause) + * Copyright (c) 2013-2016, Matt Terry and Matthew Brett; all rights reserved. diff --git a/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 b/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 new file mode 100644 index 0000000000000..714fa1a91b39e --- /dev/null +++ b/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 @@ -0,0 +1,40 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +FROM arrow-base-x86_64 + +WORKDIR / +ADD http://zlib.net/zlib-1.2.8.tar.gz /zlib-1.2.8.tar.gz +RUN tar xf zlib-1.2.8.tar.gz +WORKDIR zlib-1.2.8 +RUN CFLAGS=-fPIC cmake -DCMAKE_INSTALL_PREFIX:PATH=/usr -DCMAKE_BUILD_TYPE=Release . +RUN make -j5 install + +WORKDIR / +ADD https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz /snappy-1.1.3.tar.gz +RUN tar xf snappy-1.1.3.tar.gz +WORKDIR /snappy-1.1.3 +RUN ./configure --with-pic --prefix=/usr +RUN make -j5 install + +WORKDIR / +ADD http://archive.apache.org/dist/thrift/0.9.1/thrift-0.9.1.tar.gz /thrift-0.9.1.tar.gz +RUN tar xf thrift-0.9.1.tar.gz +WORKDIR /thrift-0.9.1 +RUN ./configure LDFLAGS='-L/usr/lib64' CXXFLAGS='-fPIC' --without-qt4 --without-c_glib --without-csharp --without-java --without-erlang --without-nodejs --without-lua --without-python --without-perl --without-php --without-php_extension --without-ruby --without-haskell --without-go --without-d --without-tests --with-cpp --prefix=/usr --disable-shared --enable-static +RUN make -j5 install + +WORKDIR / +RUN git clone https://github.com/apache/parquet-cpp.git +WORKDIR /parquet-cpp +RUN ARROW_HOME=/usr THRIFT_HOME=/usr cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW=ON . +RUN make -j5 install diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 new file mode 100644 index 0000000000000..e62a60111af4a --- /dev/null +++ b/python/manylinux1/Dockerfile-x86_64 @@ -0,0 +1,47 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +FROM quay.io/pypa/manylinux1_x86_64:latest + +# Install dependencies +RUN yum install -y flex openssl-devel + +WORKDIR / +ADD http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz /boost_1_60_0.tar.gz +RUN tar xf boost_1_60_0.tar.gz +WORKDIR /boost_1_60_0 +RUN ./bootstrap.sh +RUN ./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system install + +WORKDIR / +ADD https://cmake.org/files/v3.5/cmake-3.5.2.tar.gz /cmake-3.5.2.tar.gz +RUN tar xf cmake-3.5.2.tar.gz +WORKDIR /cmake-3.5.2 +RUN ./configure --prefix=/usr +RUN make -j5 install + +WORKDIR / +ADD https://github.com/google/flatbuffers/archive/v1.3.0.tar.gz /flatbuffers-1.3.0.tar.gz +RUN tar xf flatbuffers-1.3.0.tar.gz +WORKDIR /flatbuffers-1.3.0 +RUN CXXFLAGS='-fPIC' cmake -DFLATBUFFERS_BUILD_TESTS=OFF -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr . +RUN make -j5 install + +WORKDIR / +RUN git clone https://github.com/matthew-brett/multibuild.git +WORKDIR /multibuild +RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 + +ADD arrow /arrow +WORKDIR /arrow/cpp +RUN FLATBUFFERS_HOME=/usr cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF . +RUN make -j5 install diff --git a/python/manylinux1/README.md b/python/manylinux1/README.md new file mode 100644 index 0000000000000..8cd9f6db004e5 --- /dev/null +++ b/python/manylinux1/README.md @@ -0,0 +1,40 @@ + + +## Manylinux1 wheels for Apache Arrow + +This folder provides base Docker images and an infrastructure to build +`manylinux1` compatible Python wheels that should be installable on all +Linux distributions published in last four years. + +The process is split up in two parts: There are base Docker images that build +the native, Python-indenpendent dependencies. For these you can select if you +want to also build the dependencies used for the Parquet support. Depending on +these images, there is also a bash script that will build the pyarrow wheels +for all supported Python versions and place them in the `dist` folder. + +### Build instructions + +```bash +# Create a clean copy of the arrow source tree +git clone ../../ arrow +# Build the native baseimage +docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . +# (optionally) build parquet-cpp +docker build -t parquet_arrow-base-x86_64 -f Dockerfile-parquet_arrow-base-x86_64 . +# Build the python packages +docker run --rm -v $PWD:/io parquet_arrow-base-x86_64 /io/build_arrow.sh +# Now the new packages are located in the dist/ folder +ls -l dist/ +``` diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh new file mode 100755 index 0000000000000..0786b6f490a16 --- /dev/null +++ b/python/manylinux1/build_arrow.sh @@ -0,0 +1,76 @@ +#!/bin/bash +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. +# +# Usage: +# docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh +# or with Parquet support +# docker run --rm -v $PWD:/io parquet_arrow-base-x86_64 /io/build_arrow.sh + +# Build upon the scripts in https://github.com/matthew-brett/manylinux-builds +# * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause) + +PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7 3.4 3.5}" + +# Package index with only manylinux1 builds +MANYLINUX_URL=https://nipy.bic.berkeley.edu/manylinux + +source /multibuild/manylinux_utils.sh + +cd /arrow/python + +# PyArrow build configuration +export PYARROW_CMAKE_OPTIONS='-DCMAKE_BUILD_TYPE=Release' +# Need as otherwise arrow_io is sometimes not linked +export LDFLAGS="-Wl,--no-as-needed" +export ARROW_HOME="/usr" + +# Ensure the target directory exists +mkdir -p /io/dist +# Temporary directory to store the wheels that should be sent through auditwheel +rm_mkdir unfixed_wheels + +PY35_BIN=/opt/python/cp35-cp35m/bin +$PY35_BIN/pip install 'pyelftools<0.24' +$PY35_BIN/pip install 'git+https://github.com/xhochy/auditwheel.git@pyarrow-fixes' + +# Override repair_wheelhouse function +function repair_wheelhouse { + local in_dir=$1 + local out_dir=$2 + for whl in $in_dir/*.whl; do + if [[ $whl == *none-any.whl ]]; then + cp $whl $out_dir + else + # Store libraries directly in . not .libs to fix problems with libpyarrow.so linkage. + auditwheel -v repair -L . $whl -w $out_dir/ + fi + done + chmod -R a+rwX $out_dir +} + +for PYTHON in ${PYTHON_VERSIONS}; do + PYTHON_INTERPRETER="$(cpython_path $PYTHON)/bin/python" + PIP="$(cpython_path $PYTHON)/bin/pip" + PIPI_IO="$PIP install -f $MANYLINUX_URL" + PATH="$PATH:$(cpython_path $PYTHON)" + + $PIPI_IO "numpy==1.9.0" + $PIPI_IO "cython==0.24" + + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py bdist_wheel + + rm_mkdir fixed_wheels + repair_wheelhouse dist /io/dist +done + From 8520061d38c4aa407ac6453aff786833efa5cbaa Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 16 Oct 2016 16:23:04 -0400 Subject: [PATCH 0173/1644] ARROW-336: Run Apache Rat in Travis builds @julienledem Integrated the rat call in the cpp build. It should fail if licenses are not matching. We could also make a separate `lint` Travis build but for the moment this seemed overkill to me. Author: Uwe L. Korn Closes #174 from xhochy/ARROW-336 and squashes the following commits: 25f797c [Uwe L. Korn] Make run-rat executable 6b6221f [Uwe L. Korn] ARROW-336: Run Apache Rat in Travis builds --- ci/travis_script_cpp.sh | 4 +++ dev/release/02-source.sh | 37 ++++------------------------ dev/release/run-rat.sh | 53 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+), 32 deletions(-) create mode 100755 dev/release/run-rat.sh diff --git a/ci/travis_script_cpp.sh b/ci/travis_script_cpp.sh index c3bd3b5f207a8..d555cab3e640c 100755 --- a/ci/travis_script_cpp.sh +++ b/ci/travis_script_cpp.sh @@ -16,6 +16,10 @@ set -e : ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} +# Check licenses according to Apache policy +git archive HEAD -o arrow-src.tar.gz +./dev/release/run-rat.sh arrow-src.tar.gz + pushd $CPP_BUILD_DIR make lint diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh index 1bbe2e92753ce..bdaa5cc9340fe 100644 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -7,9 +7,9 @@ # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at -# +# # http://www.apache.org/licenses/LICENSE-2.0 -# +# # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -18,6 +18,8 @@ # under the License. # +SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" + if [ -z "$1" ]; then echo "Usage: $0 " exit @@ -56,36 +58,7 @@ tarball=$tag.tar.gz # archive (identical hashes) using the scm tag git archive $release_hash --prefix $tag/ -o $tarball -# download apache rat -curl -s https://repo1.maven.org/maven2/org/apache/rat/apache-rat/0.12/apache-rat-0.12.jar > apache-rat-0.12.jar - -RAT="java -jar apache-rat-0.12.jar -d " - -# generate the rat report -$RAT $tarball \ - -e ".*" \ - -e mman.h \ - -e "*_generated.h" \ - -e random.h \ - -e status.cc \ - -e status.h \ - -e asan_symbolize.py \ - -e cpplint.py \ - -e FindPythonLibsNew.cmake \ - -e pax_global_header \ - -e MANIFEST.in \ - -e __init__.pxd \ - -e __init__.py \ - -e requirements.txt \ - > rat.txt -UNAPPROVED=`cat rat.txt | grep "Unknown Licenses" | head -n 1 | cut -d " " -f 1` - -if [ "0" -eq "${UNAPPROVED}" ]; then - echo "No unnaproved licenses" -else - echo "${UNAPPROVED} unapproved licences. Check rat report: rat.txt" - exit -fi +${SOURCE_DIR}/run-rat.sh $tarball # sign the archive gpg --armor --output ${tarball}.asc --detach-sig $tarball diff --git a/dev/release/run-rat.sh b/dev/release/run-rat.sh new file mode 100755 index 0000000000000..d8ec6507fc4e5 --- /dev/null +++ b/dev/release/run-rat.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +# download apache rat +curl -s https://repo1.maven.org/maven2/org/apache/rat/apache-rat/0.12/apache-rat-0.12.jar > apache-rat-0.12.jar + +RAT="java -jar apache-rat-0.12.jar -d " + +# generate the rat report +$RAT $1 \ + -e ".*" \ + -e mman.h \ + -e "*_generated.h" \ + -e random.h \ + -e status.cc \ + -e status.h \ + -e asan_symbolize.py \ + -e cpplint.py \ + -e FindPythonLibsNew.cmake \ + -e pax_global_header \ + -e MANIFEST.in \ + -e __init__.pxd \ + -e __init__.py \ + -e requirements.txt \ + > rat.txt +cat rat.txt +UNAPPROVED=`cat rat.txt | grep "Unknown Licenses" | head -n 1 | cut -d " " -f 1` + +if [ "0" -eq "${UNAPPROVED}" ]; then + echo "No unnaproved licenses" +else + echo "${UNAPPROVED} unapproved licences. Check rat report: rat.txt" + exit 1 +fi + + From 8e8b17f992aa3bb3a642a93b44beb9b87d589fea Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 16 Oct 2016 16:23:54 -0400 Subject: [PATCH 0174/1644] ARROW-97: API documentation via sphinx-apidoc Author: Uwe L. Korn Closes #175 from xhochy/ARROW-97 and squashes the following commits: 2ec3e11 [Uwe L. Korn] Add license headers d838e81 [Uwe L. Korn] ARROW-97: API documentation via sphinx-apidoc --- ci/travis_script_python.sh | 7 + python/README.md | 7 + python/doc/.gitignore | 3 + python/doc/Makefile | 237 +++++++++++++++++++++++ python/doc/conf.py | 369 ++++++++++++++++++++++++++++++++++++ python/doc/index.rst | 28 +++ python/doc/requirements.txt | 3 + python/pyarrow/parquet.pyx | 8 +- 8 files changed, 661 insertions(+), 1 deletion(-) create mode 100644 python/doc/.gitignore create mode 100644 python/doc/Makefile create mode 100644 python/doc/conf.py create mode 100644 python/doc/index.rst create mode 100644 python/doc/requirements.txt diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 97f0563240c75..55cb2a76f6db1 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -51,6 +51,13 @@ python_version_tests() { --inplace python -m pytest -vv -r sxX pyarrow + + # Build documentation once + if [[ "$PYTHON_VERSION" == "3.5" ]] + then + pip install -r doc/requirements.txt + python setup.py build_sphinx + fi } # run tests for python 2.7 and 3.5 diff --git a/python/README.md b/python/README.md index 6febcbcbcbfe7..e11f64564558c 100644 --- a/python/README.md +++ b/python/README.md @@ -47,3 +47,10 @@ The Arrow C++ library must be built with all options enabled and installed with python setup.py build_ext --inplace py.test pyarrow ``` + +#### Build the documentation + +```bash +pip install -r doc/requirements.txt +python setup.py build_sphinx +``` diff --git a/python/doc/.gitignore b/python/doc/.gitignore new file mode 100644 index 0000000000000..87d04134d6fc3 --- /dev/null +++ b/python/doc/.gitignore @@ -0,0 +1,3 @@ +# auto-generated module documentation +pyarrow*.rst +modules.rst diff --git a/python/doc/Makefile b/python/doc/Makefile new file mode 100644 index 0000000000000..7257583952481 --- /dev/null +++ b/python/doc/Makefile @@ -0,0 +1,237 @@ + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. +# Makefile for Sphinx documentation +# + +# You can set these variables from the command line. +SPHINXOPTS = +SPHINXBUILD = sphinx-build +PAPER = +BUILDDIR = _build + +# Internal variables. +PAPEROPT_a4 = -D latex_paper_size=a4 +PAPEROPT_letter = -D latex_paper_size=letter +ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . +# the i18n builder cannot share the environment and doctrees with the others +I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . + +.PHONY: help +help: + @echo "Please use \`make ' where is one of" + @echo " html to make standalone HTML files" + @echo " dirhtml to make HTML files named index.html in directories" + @echo " singlehtml to make a single large HTML file" + @echo " pickle to make pickle files" + @echo " json to make JSON files" + @echo " htmlhelp to make HTML files and a HTML help project" + @echo " qthelp to make HTML files and a qthelp project" + @echo " applehelp to make an Apple Help Book" + @echo " devhelp to make HTML files and a Devhelp project" + @echo " epub to make an epub" + @echo " epub3 to make an epub3" + @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" + @echo " latexpdf to make LaTeX files and run them through pdflatex" + @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" + @echo " text to make text files" + @echo " man to make manual pages" + @echo " texinfo to make Texinfo files" + @echo " info to make Texinfo files and run them through makeinfo" + @echo " gettext to make PO message catalogs" + @echo " changes to make an overview of all changed/added/deprecated items" + @echo " xml to make Docutils-native XML files" + @echo " pseudoxml to make pseudoxml-XML files for display purposes" + @echo " linkcheck to check all external links for integrity" + @echo " doctest to run all doctests embedded in the documentation (if enabled)" + @echo " coverage to run coverage check of the documentation (if enabled)" + @echo " dummy to check syntax errors of document sources" + +.PHONY: clean +clean: + rm -rf $(BUILDDIR)/* + +.PHONY: html +html: + $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html + @echo + @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." + +.PHONY: dirhtml +dirhtml: + $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml + @echo + @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." + +.PHONY: singlehtml +singlehtml: + $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml + @echo + @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." + +.PHONY: pickle +pickle: + $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle + @echo + @echo "Build finished; now you can process the pickle files." + +.PHONY: json +json: + $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json + @echo + @echo "Build finished; now you can process the JSON files." + +.PHONY: htmlhelp +htmlhelp: + $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp + @echo + @echo "Build finished; now you can run HTML Help Workshop with the" \ + ".hhp project file in $(BUILDDIR)/htmlhelp." + +.PHONY: qthelp +qthelp: + $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp + @echo + @echo "Build finished; now you can run "qcollectiongenerator" with the" \ + ".qhcp project file in $(BUILDDIR)/qthelp, like this:" + @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/pyarrow.qhcp" + @echo "To view the help file:" + @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/pyarrow.qhc" + +.PHONY: applehelp +applehelp: + $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp + @echo + @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." + @echo "N.B. You won't be able to view it unless you put it in" \ + "~/Library/Documentation/Help or install it in your application" \ + "bundle." + +.PHONY: devhelp +devhelp: + $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp + @echo + @echo "Build finished." + @echo "To view the help file:" + @echo "# mkdir -p $$HOME/.local/share/devhelp/pyarrow" + @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/pyarrow" + @echo "# devhelp" + +.PHONY: epub +epub: + $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub + @echo + @echo "Build finished. The epub file is in $(BUILDDIR)/epub." + +.PHONY: epub3 +epub3: + $(SPHINXBUILD) -b epub3 $(ALLSPHINXOPTS) $(BUILDDIR)/epub3 + @echo + @echo "Build finished. The epub3 file is in $(BUILDDIR)/epub3." + +.PHONY: latex +latex: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo + @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." + @echo "Run \`make' in that directory to run these through (pdf)latex" \ + "(use \`make latexpdf' here to do that automatically)." + +.PHONY: latexpdf +latexpdf: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo "Running LaTeX files through pdflatex..." + $(MAKE) -C $(BUILDDIR)/latex all-pdf + @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." + +.PHONY: latexpdfja +latexpdfja: + $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex + @echo "Running LaTeX files through platex and dvipdfmx..." + $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja + @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." + +.PHONY: text +text: + $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text + @echo + @echo "Build finished. The text files are in $(BUILDDIR)/text." + +.PHONY: man +man: + $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man + @echo + @echo "Build finished. The manual pages are in $(BUILDDIR)/man." + +.PHONY: texinfo +texinfo: + $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo + @echo + @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." + @echo "Run \`make' in that directory to run these through makeinfo" \ + "(use \`make info' here to do that automatically)." + +.PHONY: info +info: + $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo + @echo "Running Texinfo files through makeinfo..." + make -C $(BUILDDIR)/texinfo info + @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." + +.PHONY: gettext +gettext: + $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale + @echo + @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." + +.PHONY: changes +changes: + $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes + @echo + @echo "The overview file is in $(BUILDDIR)/changes." + +.PHONY: linkcheck +linkcheck: + $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck + @echo + @echo "Link check complete; look for any errors in the above output " \ + "or in $(BUILDDIR)/linkcheck/output.txt." + +.PHONY: doctest +doctest: + $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest + @echo "Testing of doctests in the sources finished, look at the " \ + "results in $(BUILDDIR)/doctest/output.txt." + +.PHONY: coverage +coverage: + $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage + @echo "Testing of coverage in the sources finished, look at the " \ + "results in $(BUILDDIR)/coverage/python.txt." + +.PHONY: xml +xml: + $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml + @echo + @echo "Build finished. The XML files are in $(BUILDDIR)/xml." + +.PHONY: pseudoxml +pseudoxml: + $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml + @echo + @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." + +.PHONY: dummy +dummy: + $(SPHINXBUILD) -b dummy $(ALLSPHINXOPTS) $(BUILDDIR)/dummy + @echo + @echo "Build finished. Dummy builder generates no files." diff --git a/python/doc/conf.py b/python/doc/conf.py new file mode 100644 index 0000000000000..99ac3512ec9d4 --- /dev/null +++ b/python/doc/conf.py @@ -0,0 +1,369 @@ +# -*- coding: utf-8 -*- +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. +# +# This file is execfile()d with the current directory set to its +# containing dir. +# +# Note that not all possible configuration values are present in this +# autogenerated file. +# +# All configuration values have a default; values that are commented out +# serve to show the default. + +# If extensions (or modules to document with autodoc) are in another directory, +# add these directories to sys.path here. If the directory is relative to the +# documentation root, use os.path.abspath to make it absolute, like shown here. +# +import inspect +import os +import sys + +from sphinx import apidoc + +import sphinx_rtd_theme + + +__location__ = os.path.join(os.getcwd(), os.path.dirname( + inspect.getfile(inspect.currentframe()))) +output_dir = os.path.join(__location__) +module_dir = os.path.join(__location__, "..", "pyarrow") +cmd_line_template = "sphinx-apidoc -f -e -o {outputdir} {moduledir}" +cmd_line = cmd_line_template.format(outputdir=output_dir, moduledir=module_dir) +apidoc.main(cmd_line.split(" ")) + +sys.path.insert(0, os.path.abspath('..')) + +# -- General configuration ------------------------------------------------ + +# If your documentation needs a minimal Sphinx version, state it here. +# +# needs_sphinx = '1.0' + +# Add any Sphinx extension module names here, as strings. They can be +# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom +# ones. +extensions = [ + 'sphinx.ext.autodoc', + 'sphinx.ext.autosummary', + 'sphinx.ext.doctest', + 'sphinx.ext.mathjax', + 'sphinx.ext.viewcode', + 'numpydoc' +] + +# Add any paths that contain templates here, relative to this directory. +templates_path = ['_templates'] + +# The suffix(es) of source filenames. +# You can specify multiple suffix as a list of string: +# +# source_suffix = ['.rst', '.md'] +source_suffix = '.rst' + +# The encoding of source files. +# +# source_encoding = 'utf-8-sig' + +# The master toctree document. +master_doc = 'index' + +# General information about the project. +project = u'pyarrow' +copyright = u'2016 Apache Software Foundation' +author = u'Apache Software Foundation' + +# The version info for the project you're documenting, acts as replacement for +# |version| and |release|, also used in various other places throughout the +# built documents. +# +# The short X.Y version. +version = u'' +# The full version, including alpha/beta/rc tags. +release = u'' + +# The language for content autogenerated by Sphinx. Refer to documentation +# for a list of supported languages. +# +# This is also used if you do content translation via gettext catalogs. +# Usually you set "language" from the command line for these cases. +language = None + +# There are two options for replacing |today|: either, you set today to some +# non-false value, then it is used: +# +# today = '' +# +# Else, today_fmt is used as the format for a strftime call. +# +# today_fmt = '%B %d, %Y' + +# List of patterns, relative to source directory, that match files and +# directories to ignore when looking for source files. +# This patterns also effect to html_static_path and html_extra_path +exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store'] + +# The reST default role (used for this markup: `text`) to use for all +# documents. +# +# default_role = None + +# If true, '()' will be appended to :func: etc. cross-reference text. +# +# add_function_parentheses = True + +# If true, the current module name will be prepended to all description +# unit titles (such as .. function::). +# +# add_module_names = True + +# If true, sectionauthor and moduleauthor directives will be shown in the +# output. They are ignored by default. +# +# show_authors = False + +# The name of the Pygments (syntax highlighting) style to use. +pygments_style = 'sphinx' + +# A list of ignored prefixes for module index sorting. +# modindex_common_prefix = [] + +# If true, keep warnings as "system message" paragraphs in the built documents. +# keep_warnings = False + +# If true, `todo` and `todoList` produce output, else they produce nothing. +todo_include_todos = False + + +# -- Options for HTML output ---------------------------------------------- + +# The theme to use for HTML and HTML Help pages. See the documentation for +# a list of builtin themes. +# +html_theme = 'sphinx_rtd_theme' + +# Theme options are theme-specific and customize the look and feel of a theme +# further. For a list of options available for each theme, see the +# documentation. +# +# html_theme_options = {} + +# Add any paths that contain custom themes here, relative to this directory. +html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] + +# The name for this set of Sphinx documents. +# " v documentation" by default. +# +# html_title = u'pyarrow v0.1.0' + +# A shorter title for the navigation bar. Default is the same as html_title. +# +# html_short_title = None + +# The name of an image file (relative to this directory) to place at the top +# of the sidebar. +# +# html_logo = None + +# The name of an image file (relative to this directory) to use as a favicon of +# the docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 +# pixels large. +# +# html_favicon = None + +# Add any paths that contain custom static files (such as style sheets) here, +# relative to this directory. They are copied after the builtin static files, +# so a file named "default.css" will overwrite the builtin "default.css". +html_static_path = ['_static'] + +# Add any extra paths that contain custom files (such as robots.txt or +# .htaccess) here, relative to this directory. These files are copied +# directly to the root of the documentation. +# +# html_extra_path = [] + +# If not None, a 'Last updated on:' timestamp is inserted at every page +# bottom, using the given strftime format. +# The empty string is equivalent to '%b %d, %Y'. +# +# html_last_updated_fmt = None + +# If true, SmartyPants will be used to convert quotes and dashes to +# typographically correct entities. +# +# html_use_smartypants = True + +# Custom sidebar templates, maps document names to template names. +# +# html_sidebars = {} + +# Additional templates that should be rendered to pages, maps page names to +# template names. +# +# html_additional_pages = {} + +# If false, no module index is generated. +# +# html_domain_indices = True + +# If false, no index is generated. +# +# html_use_index = True + +# If true, the index is split into individual pages for each letter. +# +# html_split_index = False + +# If true, links to the reST sources are added to the pages. +# +# html_show_sourcelink = True + +# If true, "Created using Sphinx" is shown in the HTML footer. Default is True. +# +# html_show_sphinx = True + +# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. +# +# html_show_copyright = True + +# If true, an OpenSearch description file will be output, and all pages will +# contain a tag referring to it. The value of this option must be the +# base URL from which the finished HTML is served. +# +# html_use_opensearch = '' + +# This is the file name suffix for HTML files (e.g. ".xhtml"). +# html_file_suffix = None + +# Language to be used for generating the HTML full-text search index. +# Sphinx supports the following languages: +# 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' +# 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr', 'zh' +# +# html_search_language = 'en' + +# A dictionary with options for the search language support, empty by default. +# 'ja' uses this config value. +# 'zh' user can custom change `jieba` dictionary path. +# +# html_search_options = {'type': 'default'} + +# The name of a javascript file (relative to the configuration directory) that +# implements a search results scorer. If empty, the default will be used. +# +# html_search_scorer = 'scorer.js' + +# Output file base name for HTML help builder. +htmlhelp_basename = 'pyarrowdoc' + +# -- Options for LaTeX output --------------------------------------------- + +latex_elements = { + # The paper size ('letterpaper' or 'a4paper'). + # + # 'papersize': 'letterpaper', + + # The font size ('10pt', '11pt' or '12pt'). + # + # 'pointsize': '10pt', + + # Additional stuff for the LaTeX preamble. + # + # 'preamble': '', + + # Latex figure (float) alignment + # + # 'figure_align': 'htbp', +} + +# Grouping the document tree into LaTeX files. List of tuples +# (source start file, target name, title, +# author, documentclass [howto, manual, or own class]). +latex_documents = [ + (master_doc, 'pyarrow.tex', u'pyarrow Documentation', + u'Apache Arrow Team', 'manual'), +] + +# The name of an image file (relative to this directory) to place at the top of +# the title page. +# +# latex_logo = None + +# For "manual" documents, if this is true, then toplevel headings are parts, +# not chapters. +# +# latex_use_parts = False + +# If true, show page references after internal links. +# +# latex_show_pagerefs = False + +# If true, show URL addresses after external links. +# +# latex_show_urls = False + +# Documents to append as an appendix to all manuals. +# +# latex_appendices = [] + +# It false, will not define \strong, \code, itleref, \crossref ... but only +# \sphinxstrong, ..., \sphinxtitleref, ... To help avoid clash with user added +# packages. +# +# latex_keep_old_macro_names = True + +# If false, no module index is generated. +# +# latex_domain_indices = True + + +# -- Options for manual page output --------------------------------------- + +# One entry per manual page. List of tuples +# (source start file, name, description, authors, manual section). +man_pages = [ + (master_doc, 'pyarrow', u'pyarrow Documentation', + [author], 1) +] + +# If true, show URL addresses after external links. +# +# man_show_urls = False + + +# -- Options for Texinfo output ------------------------------------------- + +# Grouping the document tree into Texinfo files. List of tuples +# (source start file, target name, title, author, +# dir menu entry, description, category) +texinfo_documents = [ + (master_doc, 'pyarrow', u'pyarrow Documentation', + author, 'pyarrow', 'One line description of project.', + 'Miscellaneous'), +] + +# Documents to append as an appendix to all manuals. +# +# texinfo_appendices = [] + +# If false, no module index is generated. +# +# texinfo_domain_indices = True + +# How to display URL addresses: 'footnote', 'no', or 'inline'. +# +# texinfo_show_urls = 'footnote' + +# If true, do not generate a @detailmenu in the "Top" node's menu. +# +# texinfo_no_detailmenu = False diff --git a/python/doc/index.rst b/python/doc/index.rst new file mode 100644 index 0000000000000..550e544eef9e8 --- /dev/null +++ b/python/doc/index.rst @@ -0,0 +1,28 @@ +Apache Arrow (Python) +===================== + +Arrow is a columnar in-memory analytics layer designed to accelerate big data. +It houses a set of canonical in-memory representations of flat and hierarchical +data along with multiple language-bindings for structure manipulation. It also +provides IPC and common algorithm implementations. + +This is the documentation of the Python API of Apache Arrow. For more details +on the format and other language bindings see +`the main page for Arrow `_. Here will we only +detail the usage of the Python API for Arrow and the leaf libraries that add +additional functionality such as reading Apache Parquet files into Arrow +structures. + +.. toctree:: + :maxdepth: 4 + :hidden: + + Module Reference + +Indices and tables +================== + +* :ref:`genindex` +* :ref:`modindex` +* :ref:`search` + diff --git a/python/doc/requirements.txt b/python/doc/requirements.txt new file mode 100644 index 0000000000000..ce0793c31de26 --- /dev/null +++ b/python/doc/requirements.txt @@ -0,0 +1,3 @@ +numpydoc +sphinx +sphinx_rtd_theme diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index ca0176a7c0403..2abe57b33ed48 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -34,6 +34,10 @@ from pyarrow.io cimport NativeFile import six +__all__ = [ + 'read_table', + 'write_table' +] cdef class ParquetReader: cdef: @@ -76,9 +80,11 @@ cdef class ParquetReader: def read_table(source, columns=None): """ Read a Table from Parquet format + Returns ------- - table: pyarrow.Table + pyarrow.table.Table + Content of the file as a table (of columns) """ cdef ParquetReader reader = ParquetReader() From 732a2059d0c4493e451c566160b9d5d01dfe87be Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Oct 2016 13:44:34 -0400 Subject: [PATCH 0175/1644] ARROW-261: Refactor String/Binary code paths to reflect unnested (non-list-based) structure Per discussions on the mailing list. This should in theory match the Java implementation. Author: Wes McKinney Closes #176 from wesm/ARROW-261 and squashes the following commits: dca39ce [Wes McKinney] Make binary/string constants static to avoid memory-access-related segfaults in third party libraries 1e65b01 [Wes McKinney] Deprecate pyarrow::Status in favor of just arrow::Status. Conform pyarrow use of ArrayBuilder::Finish 9a1f77e [Wes McKinney] Add license header to index.rst bd70cab [Wes McKinney] Complete refactoring, fix up IPC tests for flattened string/binary buffer/metadata layout ae64f2e [Wes McKinney] Refactoring to reflect collaprsed list-like structure of Binary and String types. Not yet complete --- cpp/CMakeLists.txt | 1 - cpp/src/arrow/builder.h | 2 +- cpp/src/arrow/ipc/adapter.cc | 47 ++++---- cpp/src/arrow/ipc/test-common.h | 19 ++-- cpp/src/arrow/type.h | 20 +--- cpp/src/arrow/types/CMakeLists.txt | 1 - cpp/src/arrow/types/binary.h | 28 ----- cpp/src/arrow/types/construct.cc | 31 +----- cpp/src/arrow/types/construct.h | 8 -- cpp/src/arrow/types/json.cc | 37 ------- cpp/src/arrow/types/json.h | 36 ------ cpp/src/arrow/types/list-test.cc | 15 ++- cpp/src/arrow/types/list.cc | 42 ++++++- cpp/src/arrow/types/list.h | 49 ++------- cpp/src/arrow/types/primitive-test.cc | 26 +++-- cpp/src/arrow/types/primitive.cc | 31 ++++-- cpp/src/arrow/types/primitive.h | 31 +++--- cpp/src/arrow/types/string-test.cc | 33 +++--- cpp/src/arrow/types/string.cc | 101 ++++++++++++++--- cpp/src/arrow/types/string.h | 49 +++++---- cpp/src/arrow/types/struct-test.cc | 21 +++- cpp/src/arrow/types/struct.cc | 14 +++ cpp/src/arrow/types/struct.h | 17 +-- cpp/src/arrow/util/status.cc | 6 + cpp/src/arrow/util/status.h | 17 ++- python/CMakeLists.txt | 2 - python/doc/index.rst | 18 ++- python/pyarrow/error.pxd | 4 +- python/pyarrow/error.pyx | 10 +- python/pyarrow/includes/pyarrow.pxd | 35 ++---- python/pyarrow/io.pyx | 56 +++++----- python/pyarrow/ipc.pyx | 18 +-- python/pyarrow/parquet.pyx | 14 +-- python/src/pyarrow/adapters/builtin.cc | 39 ++++--- python/src/pyarrow/adapters/builtin.h | 9 +- python/src/pyarrow/adapters/pandas.cc | 32 +++--- python/src/pyarrow/adapters/pandas.h | 15 ++- python/src/pyarrow/api.h | 2 - python/src/pyarrow/common.cc | 12 +- python/src/pyarrow/common.h | 7 -- python/src/pyarrow/io.cc | 59 +++++----- python/src/pyarrow/status.cc | 92 ---------------- python/src/pyarrow/status.h | 146 ------------------------- 43 files changed, 484 insertions(+), 768 deletions(-) delete mode 100644 cpp/src/arrow/types/binary.h delete mode 100644 cpp/src/arrow/types/json.cc delete mode 100644 cpp/src/arrow/types/json.h delete mode 100644 python/src/pyarrow/status.cc delete mode 100644 python/src/pyarrow/status.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index d682dc76f8ced..6f954830b6334 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -681,7 +681,6 @@ set(ARROW_SRCS src/arrow/types/construct.cc src/arrow/types/decimal.cc - src/arrow/types/json.cc src/arrow/types/list.cc src/arrow/types/primitive.cc src/arrow/types/string.cc diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 646a6f24e9df8..cef17e5aabab9 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -93,7 +93,7 @@ class ARROW_EXPORT ArrayBuilder { // Creates new array object to hold the contents of the builder and transfers // ownership of the data. This resets all variables on the builder. - virtual std::shared_ptr Finish() = 0; + virtual Status Finish(std::shared_ptr* out) = 0; const std::shared_ptr& type() const { return type_; } diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index cd8ab53a31d1f..f84cb264f70e1 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -78,22 +78,6 @@ static bool IsPrimitive(const DataType* type) { } } -static bool IsListType(const DataType* type) { - DCHECK(type != nullptr); - switch (type->type) { - // TODO(emkornfield) grouping like this are used in a few places in the - // code consider using pattern like: - // http://stackoverflow.com/questions/26784685/c-macro-for-calling-function-based-on-enum-type - // - case Type::BINARY: - case Type::LIST: - case Type::STRING: - return true; - default: - return false; - } -} - // ---------------------------------------------------------------------- // Record batch write path @@ -115,7 +99,11 @@ Status VisitArray(const Array* arr, std::vector* field_nodes if (IsPrimitive(arr_type)) { const auto prim_arr = static_cast(arr); buffers->push_back(prim_arr->data()); - } else if (IsListType(arr_type)) { + } else if (arr->type_enum() == Type::STRING || arr->type_enum() == Type::BINARY) { + const auto binary_arr = static_cast(arr); + buffers->push_back(binary_arr->offsets()); + buffers->push_back(binary_arr->data()); + } else if (arr->type_enum() == Type::LIST) { const auto list_arr = static_cast(arr); buffers->push_back(list_arr->offset_buffer()); RETURN_NOT_OK(VisitArray( @@ -331,9 +319,21 @@ class RecordBatchReader::RecordBatchReaderImpl { } return MakePrimitiveArray( type, field_meta.length, data, field_meta.null_count, null_bitmap, out); - } + } else if (type->type == Type::STRING || type->type == Type::BINARY) { + std::shared_ptr offsets; + std::shared_ptr values; + RETURN_NOT_OK(GetBuffer(buffer_index_++, &offsets)); + RETURN_NOT_OK(GetBuffer(buffer_index_++, &values)); - if (IsListType(type.get())) { + if (type->type == Type::STRING) { + *out = std::make_shared( + field_meta.length, offsets, values, field_meta.null_count, null_bitmap); + } else { + *out = std::make_shared( + field_meta.length, offsets, values, field_meta.null_count, null_bitmap); + } + return Status::OK(); + } else if (type->type == Type::LIST) { std::shared_ptr offsets; RETURN_NOT_OK(GetBuffer(buffer_index_++, &offsets)); const int num_children = type->num_children(); @@ -346,11 +346,10 @@ class RecordBatchReader::RecordBatchReaderImpl { std::shared_ptr values_array; RETURN_NOT_OK( NextArray(type->child(0).get(), max_recursion_depth - 1, &values_array)); - return MakeListArray(type, field_meta.length, offsets, values_array, - field_meta.null_count, null_bitmap, out); - } - - if (type->type == Type::STRUCT) { + *out = std::make_shared(type, field_meta.length, offsets, values_array, + field_meta.null_count, null_bitmap); + return Status::OK(); + } else if (type->type == Type::STRUCT) { const int num_children = type->num_children(); std::vector fields; fields.reserve(num_children); diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 7d02bc302f40e..13bbbebde8aa1 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -42,7 +42,7 @@ const auto kListInt32 = std::make_shared(kInt32); const auto kListListInt32 = std::make_shared(kListInt32); Status MakeRandomInt32Array( - int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { + int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr data; test::MakeRandomInt32PoolBuffer(length, pool, &data); const auto kInt32 = std::make_shared(); @@ -52,16 +52,14 @@ Status MakeRandomInt32Array( test::MakeRandomBytePoolBuffer(length, pool, &valid_bytes); RETURN_NOT_OK(builder.Append( reinterpret_cast(data->data()), length, valid_bytes->data())); - *array = builder.Finish(); - return Status::OK(); + return builder.Finish(out); } RETURN_NOT_OK(builder.Append(reinterpret_cast(data->data()), length)); - *array = builder.Finish(); - return Status::OK(); + return builder.Finish(out); } Status MakeRandomListArray(const std::shared_ptr& child_array, int num_lists, - bool include_nulls, MemoryPool* pool, std::shared_ptr* array) { + bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { // Create the null list values std::vector valid_lists(num_lists); const double null_percent = include_nulls ? 0.1 : 0; @@ -90,8 +88,8 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li } ListBuilder builder(pool, child_array); RETURN_NOT_OK(builder.Append(offsets.data(), num_lists, valid_lists.data())); - *array = builder.Finish(); - return (*array)->Validate(); + RETURN_NOT_OK(builder.Finish(out)); + return (*out)->Validate(); } typedef Status MakeRecordBatch(std::shared_ptr* out); @@ -115,7 +113,7 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { template Status MakeRandomBinaryArray( - const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* array) { + const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* out) { const std::vector values = { "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; Builder builder(pool, type); @@ -130,8 +128,7 @@ Status MakeRandomBinaryArray( builder.Append(reinterpret_cast(value.data()), value.size())); } } - *array = builder.Finish(); - return Status::OK(); + return builder.Finish(out); } Status MakeStringTypesRecordBatch(std::shared_ptr* out) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index b4c3721a72895..ea8516fc34798 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -242,7 +242,7 @@ struct ARROW_EXPORT DoubleType : public PrimitiveType { struct ARROW_EXPORT ListType : public DataType { // List can contain any other logical value type explicit ListType(const std::shared_ptr& value_type) - : ListType(value_type, Type::LIST) {} + : ListType(std::make_shared("item", value_type)) {} explicit ListType(const std::shared_ptr& value_field) : DataType(Type::LIST) { children_ = {value_field}; @@ -255,26 +255,17 @@ struct ARROW_EXPORT ListType : public DataType { static char const* name() { return "list"; } std::string ToString() const override; - - protected: - // Constructor for classes that are implemented as List Arrays. - ListType(const std::shared_ptr& value_type, Type::type logical_type) - : DataType(logical_type) { - // TODO ARROW-187 this can technically fail, make a constructor method ? - children_ = {std::make_shared("item", value_type)}; - } }; // BinaryType type is reprsents lists of 1-byte values. -struct ARROW_EXPORT BinaryType : public ListType { +struct ARROW_EXPORT BinaryType : public DataType { BinaryType() : BinaryType(Type::BINARY) {} static char const* name() { return "binary"; } std::string ToString() const override; protected: // Allow subclasses to change the logical type. - explicit BinaryType(Type::type logical_type) - : ListType(std::shared_ptr(new UInt8Type()), logical_type) {} + explicit BinaryType(Type::type logical_type) : DataType(logical_type) {} }; // UTF encoded strings @@ -284,9 +275,6 @@ struct ARROW_EXPORT StringType : public BinaryType { static char const* name() { return "string"; } std::string ToString() const override; - - protected: - explicit StringType(Type::type logical_type) : BinaryType(logical_type) {} }; struct ARROW_EXPORT StructType : public DataType { @@ -300,7 +288,7 @@ struct ARROW_EXPORT StructType : public DataType { // These will be defined elsewhere template -struct type_traits {}; +struct TypeTraits {}; } // namespace arrow diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index 72a8e77664610..9f7816989827d 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -25,7 +25,6 @@ install(FILES construct.h datetime.h decimal.h - json.h list.h primitive.h string.h diff --git a/cpp/src/arrow/types/binary.h b/cpp/src/arrow/types/binary.h deleted file mode 100644 index 201fbb6e79536..0000000000000 --- a/cpp/src/arrow/types/binary.h +++ /dev/null @@ -1,28 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_BINARY_H -#define ARROW_TYPES_BINARY_H - -#include -#include - -#include "arrow/type.h" - -namespace arrow {} // namespace arrow - -#endif // ARROW_TYPES_BINARY_H diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 0b71ea965516c..67245f8ea1fda 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -59,6 +59,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(DOUBLE, DoubleBuilder); BUILDER_CASE(STRING, StringBuilder); + BUILDER_CASE(BINARY, BinaryBuilder); case Type::LIST: { std::shared_ptr value_builder; @@ -105,10 +106,10 @@ Status MakePrimitiveArray(const TypePtr& type, int32_t length, MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); + MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); default: return Status::NotImplemented(type->ToString()); @@ -120,30 +121,4 @@ Status MakePrimitiveArray(const TypePtr& type, int32_t length, #endif } -Status MakeListArray(const TypePtr& type, int32_t length, - const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, - const std::shared_ptr& null_bitmap, ArrayPtr* out) { - switch (type->type) { - case Type::BINARY: - out->reset(new BinaryArray(type, length, offsets, values, null_count, null_bitmap)); - break; - - case Type::LIST: - out->reset(new ListArray(type, length, offsets, values, null_count, null_bitmap)); - break; - - case Type::DECIMAL_TEXT: - case Type::STRING: - out->reset(new StringArray(type, length, offsets, values, null_count, null_bitmap)); - break; - default: - return Status::NotImplemented(type->ToString()); - } -#ifdef NDEBUG - return Status::OK(); -#else - return (*out)->Validate(); -#endif -} - } // namespace arrow diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h index afdadbe079013..e18e946d1a64c 100644 --- a/cpp/src/arrow/types/construct.h +++ b/cpp/src/arrow/types/construct.h @@ -42,14 +42,6 @@ Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out); -// Create new list arrays for logical types that are backed by ListArrays (e.g. list of -// primitives and strings) -// TODO(emkornfield) split up string vs list? -Status ARROW_EXPORT MakeListArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& offests, const std::shared_ptr& values, - int32_t null_count, const std::shared_ptr& null_bitmap, - std::shared_ptr* out); - } // namespace arrow #endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/json.cc b/cpp/src/arrow/types/json.cc deleted file mode 100644 index 89240fc22bb2c..0000000000000 --- a/cpp/src/arrow/types/json.cc +++ /dev/null @@ -1,37 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/json.h" - -#include - -#include "arrow/type.h" -#include "arrow/types/union.h" - -namespace arrow { - -static const TypePtr Null(new NullType()); -static const TypePtr Int32(new Int32Type()); -static const TypePtr String(new StringType()); -static const TypePtr Double(new DoubleType()); -static const TypePtr Bool(new BooleanType()); - -static const std::vector kJsonTypes = {Null, Int32, String, Double, Bool}; -TypePtr JSONScalar::dense_type = TypePtr(new DenseUnionType(kJsonTypes)); -TypePtr JSONScalar::sparse_type = TypePtr(new SparseUnionType(kJsonTypes)); - -} // namespace arrow diff --git a/cpp/src/arrow/types/json.h b/cpp/src/arrow/types/json.h deleted file mode 100644 index 9de961f79a60a..0000000000000 --- a/cpp/src/arrow/types/json.h +++ /dev/null @@ -1,36 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_JSON_H -#define ARROW_TYPES_JSON_H - -#include "arrow/type.h" - -namespace arrow { - -struct JSONScalar : public DataType { - bool dense; - - static TypePtr dense_type; - static TypePtr sparse_type; - - explicit JSONScalar(bool dense = true) : DataType(Type::JSON_SCALAR), dense(dense) {} -}; - -} // namespace arrow - -#endif // ARROW_TYPES_JSON_H diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 2e41b4a61caf2..12c539495a28b 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -76,7 +76,11 @@ class TestListBuilder : public TestBuilder { builder_ = std::dynamic_pointer_cast(tmp); } - void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } + void Done() { + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + result_ = std::dynamic_pointer_cast(out); + } protected: TypePtr value_type_; @@ -98,14 +102,17 @@ TEST_F(TestListBuilder, Equality) { // setup two equal arrays ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); - array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&array)); ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); - equal_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&equal_array)); // now an unequal one ASSERT_OK(builder_->Append(unequal_offsets.data(), unequal_offsets.size())); ASSERT_OK(vb->Append(unequal_values.data(), unequal_values.size())); - unequal_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&unequal_array)); // Test array equality EXPECT_TRUE(array->Equals(array)); diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 6334054caf84a..ef2ec22cb5336 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -25,7 +25,7 @@ bool ListArray::EqualsExact(const ListArray& other) const { if (null_count_ != other.null_count_) { return false; } bool equal_offsets = - offset_buf_->Equals(*other.offset_buf_, (length_ + 1) * sizeof(int32_t)); + offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); if (!equal_offsets) { return false; } bool equal_null_bitmap = true; if (null_count_ > 0) { @@ -72,10 +72,10 @@ bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_st Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } - if (!offset_buf_) { return Status::Invalid("offset_buf_ was null"); } - if (offset_buf_->size() / static_cast(sizeof(int32_t)) < length_) { + if (!offset_buffer_) { return Status::Invalid("offset_buffer_ was null"); } + if (offset_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { std::stringstream ss; - ss << "offset buffer size (bytes): " << offset_buf_->size() + ss << "offset buffer size (bytes): " << offset_buffer_->size() << " isn't large enough for length: " << length_; return Status::Invalid(ss.str()); } @@ -121,4 +121,38 @@ Status ListArray::Validate() const { return Status::OK(); } +Status ListBuilder::Init(int32_t elements) { + DCHECK_LT(elements, std::numeric_limits::max()); + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + // one more then requested for offsets + return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); +} + +Status ListBuilder::Resize(int32_t capacity) { + DCHECK_LT(capacity, std::numeric_limits::max()); + // one more then requested for offsets + RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); + return ArrayBuilder::Resize(capacity); +} + +Status ListBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr items = values_; + if (!items) { RETURN_NOT_OK(value_builder_->Finish(&items)); } + + RETURN_NOT_OK(offset_builder_.Append(items->length())); + std::shared_ptr offsets = offset_builder_.Finish(); + + *out = std::make_shared( + type_, length_, offsets, items, null_count_, null_bitmap_); + + Reset(); + + return Status::OK(); +} + +void ListBuilder::Reset() { + capacity_ = length_ = null_count_ = 0; + null_bitmap_ = nullptr; +} + } // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index f3894510d091a..9440ffed4bf8a 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -43,9 +43,9 @@ class ARROW_EXPORT ListArray : public Array { const ArrayPtr& values, int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { - offset_buf_ = offsets; - offsets_ = offsets == nullptr ? nullptr - : reinterpret_cast(offset_buf_->data()); + offset_buffer_ = offsets; + offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( + offset_buffer_->data()); values_ = values; } @@ -57,7 +57,7 @@ class ARROW_EXPORT ListArray : public Array { // with this array. const std::shared_ptr& values() const { return values_; } const std::shared_ptr offset_buffer() const { - return std::static_pointer_cast(offset_buf_); + return std::static_pointer_cast(offset_buffer_); } const std::shared_ptr& value_type() const { return values_->type(); } @@ -77,7 +77,7 @@ class ARROW_EXPORT ListArray : public Array { const ArrayPtr& arr) const override; protected: - std::shared_ptr offset_buf_; + std::shared_ptr offset_buffer_; const int32_t* offsets_; ArrayPtr values_; }; @@ -119,19 +119,9 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { virtual ~ListBuilder() {} - Status Init(int32_t elements) override { - DCHECK_LT(elements, std::numeric_limits::max()); - RETURN_NOT_OK(ArrayBuilder::Init(elements)); - // one more then requested for offsets - return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); - } - - Status Resize(int32_t capacity) override { - DCHECK_LT(capacity, std::numeric_limits::max()); - // one more then requested for offsets - RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); - return ArrayBuilder::Resize(capacity); - } + Status Init(int32_t elements) override; + Status Resize(int32_t capacity) override; + Status Finish(std::shared_ptr* out) override; // Vector append // @@ -145,27 +135,6 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { return Status::OK(); } - // The same as Finalize but allows for overridding the c++ type - template - std::shared_ptr Transfer() { - std::shared_ptr items = values_; - if (!items) { items = value_builder_->Finish(); } - - offset_builder_.Append(items->length()); - - const auto offsets_buffer = offset_builder_.Finish(); - auto result = std::make_shared( - type_, length_, offsets_buffer, items, null_count_, null_bitmap_); - - // TODO(emkornfield) make a reset method - capacity_ = length_ = null_count_ = 0; - null_bitmap_ = nullptr; - - return result; - } - - std::shared_ptr Finish() override { return Transfer(); } - // Start a new variable-length list slot // // This function should be called before beginning to append elements to the @@ -188,6 +157,8 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { BufferBuilder offset_builder_; std::shared_ptr value_builder_; std::shared_ptr values_; + + void Reset(); }; } // namespace arrow diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 5ac2867932df7..121bd4794f291 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -123,8 +123,11 @@ class TestPrimitiveBuilder : public TestBuilder { auto expected = std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); - std::shared_ptr result = - std::dynamic_pointer_cast(builder->Finish()); + + std::shared_ptr out; + ASSERT_OK(builder->Finish(&out)); + + std::shared_ptr result = std::dynamic_pointer_cast(out); // Builder is now reset ASSERT_EQ(0, builder->length()); @@ -216,8 +219,10 @@ void TestPrimitiveBuilder::Check( auto expected = std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); - std::shared_ptr result = - std::dynamic_pointer_cast(builder->Finish()); + + std::shared_ptr out; + ASSERT_OK(builder->Finish(&out)); + std::shared_ptr result = std::dynamic_pointer_cast(out); // Builder is now reset ASSERT_EQ(0, builder->length()); @@ -254,7 +259,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestInit) { int n = 1000; ASSERT_OK(this->builder_->Reserve(n)); ASSERT_EQ(util::next_power2(n), this->builder_->capacity()); - ASSERT_EQ(util::next_power2(type_traits::bytes_required(n)), + ASSERT_EQ(util::next_power2(TypeTraits::bytes_required(n)), this->builder_->data()->size()); // unsure if this should go in all builder classes @@ -267,7 +272,8 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { ASSERT_OK(this->builder_->AppendNull()); } - auto result = this->builder_->Finish(); + std::shared_ptr result; + ASSERT_OK(this->builder_->Finish(&result)); for (int i = 0; i < size; ++i) { ASSERT_TRUE(result->IsNull(i)) << i; @@ -298,7 +304,8 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { } do { - std::shared_ptr result = this->builder_->Finish(); + std::shared_ptr result; + ASSERT_OK(this->builder_->Finish(&result)); } while (false); ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); @@ -315,8 +322,7 @@ Status MakeArray(const vector& valid_bytes, const vector& draws, int RETURN_NOT_OK(builder->AppendNull()); } } - *out = builder->Finish(); - return Status::OK(); + return builder->Finish(out); } TYPED_TEST(TestPrimitiveBuilder, Equality) { @@ -465,7 +471,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { ASSERT_OK(this->builder_->Reserve(cap)); ASSERT_EQ(cap, this->builder_->capacity()); - ASSERT_EQ(type_traits::bytes_required(cap), this->builder_->data()->size()); + ASSERT_EQ(TypeTraits::bytes_required(cap), this->builder_->data()->size()); ASSERT_EQ(util::bytes_for_bits(cap), this->builder_->null_bitmap()->size()); } diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 9ba2ebdcc2d5b..3a05ccfdf1861 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -69,12 +69,25 @@ bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { return EqualsExact(*static_cast(arr.get())); } +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; + template Status PrimitiveBuilder::Init(int32_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); data_ = std::make_shared(pool_); - int64_t nbytes = type_traits::bytes_required(capacity); + int64_t nbytes = TypeTraits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(nbytes)); // TODO(emkornfield) valgrind complains without this memset(data_->mutable_data(), 0, nbytes); @@ -93,10 +106,9 @@ Status PrimitiveBuilder::Resize(int32_t capacity) { } else { RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); const int64_t old_bytes = data_->size(); - const int64_t new_bytes = type_traits::bytes_required(capacity); + const int64_t new_bytes = TypeTraits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); - memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); } return Status::OK(); @@ -108,7 +120,7 @@ Status PrimitiveBuilder::Append( RETURN_NOT_OK(Reserve(length)); if (length > 0) { - memcpy(raw_data_ + length_, values, type_traits::bytes_required(length)); + memcpy(raw_data_ + length_, values, TypeTraits::bytes_required(length)); } // length_ is update by these @@ -118,13 +130,18 @@ Status PrimitiveBuilder::Append( } template -std::shared_ptr PrimitiveBuilder::Finish() { - std::shared_ptr result = std::make_shared::ArrayType>( +Status PrimitiveBuilder::Finish(std::shared_ptr* out) { + const int64_t bytes_required = TypeTraits::bytes_required(length_); + if (bytes_required > 0 && bytes_required < data_->size()) { + // Trim buffers + RETURN_NOT_OK(data_->Resize(bytes_required)); + } + *out = std::make_shared::ArrayType>( type_, length_, data_, null_count_, null_bitmap_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; - return result; + return Status::OK(); } template <> diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index c643783f681bd..f21470d96e45b 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -91,7 +91,9 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { value_type Value(int i) const { return raw_data()[i]; } }; -#define NUMERIC_ARRAY_DECL(NAME, TypeClass) using NAME = NumericArray; +#define NUMERIC_ARRAY_DECL(NAME, TypeClass) \ + using NAME = NumericArray; \ + extern template class ARROW_EXPORT NumericArray; NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type); NUMERIC_ARRAY_DECL(Int8Array, Int8Type); @@ -139,8 +141,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { Status Append( const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); - std::shared_ptr Finish() override; - + Status Finish(std::shared_ptr* out) override; Status Init(int32_t capacity) override; // Increase the capacity of the builder to accommodate at least the indicated @@ -183,77 +184,77 @@ class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { }; template <> -struct type_traits { +struct TypeTraits { typedef UInt8Array ArrayType; static inline int bytes_required(int elements) { return elements; } }; template <> -struct type_traits { +struct TypeTraits { typedef Int8Array ArrayType; static inline int bytes_required(int elements) { return elements; } }; template <> -struct type_traits { +struct TypeTraits { typedef UInt16Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef Int16Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef UInt32Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef Int32Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef UInt64Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef Int64Array ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef TimestampArray ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } }; template <> -struct type_traits { +struct TypeTraits { typedef FloatArray ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(float); } }; template <> -struct type_traits { +struct TypeTraits { typedef DoubleArray ArrayType; static inline int bytes_required(int elements) { return elements * sizeof(double); } @@ -293,7 +294,7 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { }; template <> -struct type_traits { +struct TypeTraits { typedef BooleanArray ArrayType; static inline int bytes_required(int elements) { diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index 6807b00e8ca99..d897e30a3c6a2 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -66,18 +66,13 @@ class TestStringContainer : public ::testing::Test { void MakeArray() { length_ = offsets_.size() - 1; - int nchars = chars_.size(); - value_buf_ = test::to_buffer(chars_); - values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); - offsets_buf_ = test::to_buffer(offsets_); - null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared( - length_, offsets_buf_, values_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } protected: @@ -94,7 +89,6 @@ class TestStringContainer : public ::testing::Test { int null_count_; int length_; - ArrayPtr values_; std::shared_ptr strings_; }; @@ -122,7 +116,7 @@ TEST_F(TestStringContainer, TestListFunctions) { TEST_F(TestStringContainer, TestDestructor) { auto arr = std::make_shared( - length_, offsets_buf_, values_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } TEST_F(TestStringContainer, TestGetString) { @@ -147,7 +141,10 @@ class TestStringBuilder : public TestBuilder { } void Done() { - result_ = std::dynamic_pointer_cast(builder_->Finish()); + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + + result_ = std::dynamic_pointer_cast(out); result_->Validate(); } @@ -178,7 +175,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { ASSERT_EQ(reps * N, result_->length()); ASSERT_EQ(reps, result_->null_count()); - ASSERT_EQ(reps * 6, result_->values()->length()); + ASSERT_EQ(reps * 6, result_->data()->size()); int32_t length; int32_t pos = 0; @@ -218,18 +215,14 @@ class TestBinaryContainer : public ::testing::Test { void MakeArray() { length_ = offsets_.size() - 1; - int nchars = chars_.size(); - value_buf_ = test::to_buffer(chars_); - values_ = ArrayPtr(new UInt8Array(nchars, value_buf_)); - offsets_buf_ = test::to_buffer(offsets_); null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared( - length_, offsets_buf_, values_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } protected: @@ -246,7 +239,6 @@ class TestBinaryContainer : public ::testing::Test { int null_count_; int length_; - ArrayPtr values_; std::shared_ptr strings_; }; @@ -274,7 +266,7 @@ TEST_F(TestBinaryContainer, TestListFunctions) { TEST_F(TestBinaryContainer, TestDestructor) { auto arr = std::make_shared( - length_, offsets_buf_, values_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } TEST_F(TestBinaryContainer, TestGetValue) { @@ -298,7 +290,10 @@ class TestBinaryBuilder : public TestBuilder { } void Done() { - result_ = std::dynamic_pointer_cast(builder_->Finish()); + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + + result_ = std::dynamic_pointer_cast(out); result_->Validate(); } @@ -330,7 +325,7 @@ TEST_F(TestBinaryBuilder, TestScalarAppend) { ASSERT_OK(result_->Validate()); ASSERT_EQ(reps * N, result_->length()); ASSERT_EQ(reps, result_->null_count()); - ASSERT_EQ(reps * 6, result_->values()->length()); + ASSERT_EQ(reps * 6, result_->data()->size()); int32_t length; for (int i = 0; i < N * reps; ++i) { diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index 745ed8f7edb99..d692e13773f56 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -17,6 +17,7 @@ #include "arrow/types/string.h" +#include #include #include @@ -24,37 +25,77 @@ namespace arrow { -const std::shared_ptr BINARY(new BinaryType()); -const std::shared_ptr STRING(new StringType()); +static std::shared_ptr kBinary = std::make_shared(); +static std::shared_ptr kString = std::make_shared(); BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) - : BinaryArray(BINARY, length, offsets, values, null_count, null_bitmap) {} + : BinaryArray(kBinary, length, offsets, data, null_count, null_bitmap) {} BinaryArray::BinaryArray(const TypePtr& type, int32_t length, - const std::shared_ptr& offsets, const ArrayPtr& values, int32_t null_count, - const std::shared_ptr& null_bitmap) - : ListArray(type, length, offsets, values, null_count, null_bitmap), - bytes_(std::dynamic_pointer_cast(values).get()), - raw_bytes_(bytes_->raw_data()) { - // Check in case the dynamic cast fails. - DCHECK(bytes_); + const std::shared_ptr& offsets, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& null_bitmap) + : Array(type, length, null_count, null_bitmap), + offset_buffer_(offsets), + offsets_(reinterpret_cast(offset_buffer_->data())), + data_buffer_(data), + data_(nullptr) { + if (data_buffer_ != nullptr) { data_ = data_buffer_->data(); } } Status BinaryArray::Validate() const { - if (values()->null_count() > 0) { - std::stringstream ss; - ss << type()->ToString() << " can have null values in the value array"; - Status::Invalid(ss.str()); + // TODO(wesm): what to do here? + return Status::OK(); +} + +bool BinaryArray::EqualsExact(const BinaryArray& other) const { + if (!Array::EqualsExact(other)) { return false; } + + bool equal_offsets = + offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); + if (!equal_offsets) { return false; } + + return data_buffer_->Equals(*other.data_buffer_, data_buffer_->size()); +} + +bool BinaryArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (this->type_enum() != arr->type_enum()) { return false; } + return EqualsExact(*static_cast(arr.get())); +} + +bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = offset(i); + const int32_t end_offset = offset(i + 1); + const int32_t other_begin_offset = other->offset(o_i); + const int32_t other_end_offset = other->offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != other_end_offset - other_begin_offset) { + return false; + } + + if (std::memcmp(data_ + begin_offset, other->data_ + other_begin_offset, + end_offset - begin_offset)) { + return false; + } } - return ListArray::Validate(); + return true; } StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count, + const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) - : StringArray(STRING, length, offsets, values, null_count, null_bitmap) {} + : BinaryArray(kString, length, offsets, data, null_count, null_bitmap) {} Status StringArray::Validate() const { // TODO(emkornfield) Validate proper UTF8 code points? @@ -72,4 +113,28 @@ BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) byte_builder_ = static_cast(value_builder_.get()); } +Status BinaryBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr result; + RETURN_NOT_OK(ListBuilder::Finish(&result)); + + const auto list = std::dynamic_pointer_cast(result); + auto values = std::dynamic_pointer_cast(list->values()); + + *out = std::make_shared(list->length(), list->offset_buffer(), + values->data(), list->null_count(), list->null_bitmap()); + return Status::OK(); +} + +Status StringBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr result; + RETURN_NOT_OK(ListBuilder::Finish(&result)); + + const auto list = std::dynamic_pointer_cast(result); + auto values = std::dynamic_pointer_cast(list->values()); + + *out = std::make_shared(list->length(), list->offset_buffer(), + values->data(), list->null_count(), list->null_bitmap()); + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index bab0c58f617b2..aaba49c60237d 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -35,15 +35,16 @@ namespace arrow { class Buffer; class MemoryPool; -class ARROW_EXPORT BinaryArray : public ListArray { +class ARROW_EXPORT BinaryArray : public Array { public: BinaryArray(int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + // Constructor that allows sub-classes/builders to propagate there logical type up the // class hierarchy. BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); // Return the pointer to the given elements bytes @@ -53,28 +54,38 @@ class ARROW_EXPORT BinaryArray : public ListArray { DCHECK(out_length); const int32_t pos = offsets_[i]; *out_length = offsets_[i + 1] - pos; - return raw_bytes_ + pos; + return data_ + pos; } + std::shared_ptr data() const { return data_buffer_; } + std::shared_ptr offsets() const { return offset_buffer_; } + + int32_t offset(int i) const { return offsets_[i]; } + + // Neither of these functions will perform boundschecking + int32_t value_offset(int i) const { return offsets_[i]; } + int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } + + bool EqualsExact(const BinaryArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; + Status Validate() const override; private: - UInt8Array* bytes_; - const uint8_t* raw_bytes_; + std::shared_ptr offset_buffer_; + const int32_t* offsets_; + + std::shared_ptr data_buffer_; + const uint8_t* data_; }; class ARROW_EXPORT StringArray : public BinaryArray { public: StringArray(int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count = 0, + const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - // Constructor that allows overriding the logical type, so subclasses can propagate - // there - // up the class hierarchy. - StringArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, - const ArrayPtr& values, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) - : BinaryArray(type, length, offsets, values, null_count, null_bitmap) {} // Construct a std::string // TODO: std::bad_alloc possibility @@ -98,9 +109,7 @@ class ARROW_EXPORT BinaryBuilder : public ListBuilder { return byte_builder_->Append(value, length); } - std::shared_ptr Finish() override { - return ListBuilder::Transfer(); - } + Status Finish(std::shared_ptr* out) override; protected: UInt8Builder* byte_builder_; @@ -112,6 +121,8 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : BinaryBuilder(pool, type) {} + Status Finish(std::shared_ptr* out) override; + Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } Status Append(const char* value, int32_t length) { @@ -119,10 +130,6 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { } Status Append(const std::vector& values, uint8_t* null_bytes); - - std::shared_ptr Finish() override { - return ListBuilder::Transfer(); - } }; } // namespace arrow diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index ccf5a52dc831c..8e82c389a9423 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -119,7 +119,11 @@ class TestStructBuilder : public TestBuilder { ASSERT_EQ(2, static_cast(builder_->field_builders().size())); } - void Done() { result_ = std::dynamic_pointer_cast(builder_->Finish()); } + void Done() { + std::shared_ptr out; + ASSERT_OK(builder_->Finish(&out)); + result_ = std::dynamic_pointer_cast(out); + } protected: std::vector value_fields_; @@ -294,7 +298,8 @@ TEST_F(TestStructBuilder, TestEquality) { for (int32_t value : int_values) { int_vb->UnsafeAppend(value); } - array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&array)); ASSERT_OK(builder_->Resize(list_lengths.size())); ASSERT_OK(char_vb->Resize(list_values.size())); @@ -308,7 +313,8 @@ TEST_F(TestStructBuilder, TestEquality) { for (int32_t value : int_values) { int_vb->UnsafeAppend(value); } - equal_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&equal_array)); ASSERT_OK(builder_->Resize(list_lengths.size())); ASSERT_OK(char_vb->Resize(list_values.size())); @@ -323,7 +329,8 @@ TEST_F(TestStructBuilder, TestEquality) { for (int32_t value : int_values) { int_vb->UnsafeAppend(value); } - unequal_bitmap_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&unequal_bitmap_array)); ASSERT_OK(builder_->Resize(list_lengths.size())); ASSERT_OK(char_vb->Resize(list_values.size())); @@ -339,7 +346,8 @@ TEST_F(TestStructBuilder, TestEquality) { for (int32_t value : int_values) { int_vb->UnsafeAppend(value); } - unequal_offsets_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&unequal_offsets_array)); ASSERT_OK(builder_->Resize(list_lengths.size())); ASSERT_OK(char_vb->Resize(list_values.size())); @@ -354,7 +362,8 @@ TEST_F(TestStructBuilder, TestEquality) { for (int32_t value : unequal_int_values) { int_vb->UnsafeAppend(value); } - unequal_values_array = builder_->Finish(); + + ASSERT_OK(builder_->Finish(&unequal_values_array)); // Test array equality EXPECT_TRUE(array->Equals(array)); diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index e8176f08268b4..369c29d15ef96 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -87,4 +87,18 @@ Status StructArray::Validate() const { return Status::OK(); } +Status StructBuilder::Finish(std::shared_ptr* out) { + std::vector> fields(field_builders_.size()); + for (size_t i = 0; i < field_builders_.size(); ++i) { + RETURN_NOT_OK(field_builders_[i]->Finish(&fields[i])); + } + + *out = std::make_shared(type_, length_, fields, null_count_, null_bitmap_); + + null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 63955eb31bb7d..65b8daf214a69 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -73,6 +73,8 @@ class ARROW_EXPORT StructBuilder : public ArrayBuilder { field_builders_ = field_builders; } + Status Finish(std::shared_ptr* out) override; + // Null bitmap is of equal length to every child field, and any zero byte // will be considered as a null for that field, but users must using app- // end methods or advance methods of the child builders' independently to @@ -83,21 +85,6 @@ class ARROW_EXPORT StructBuilder : public ArrayBuilder { return Status::OK(); } - std::shared_ptr Finish() override { - std::vector fields; - for (auto it : field_builders_) { - fields.push_back(it->Finish()); - } - - auto result = - std::make_shared(type_, length_, fields, null_count_, null_bitmap_); - - null_bitmap_ = nullptr; - capacity_ = length_ = null_count_ = 0; - - return result; - } - // Append an element to the Struct. All child-builders' Append method must // be called independently to maintain data-structure consistency. Status Append(bool is_valid = true) { diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/util/status.cc index 8dd07d0d064e7..08e9ae3946e51 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/util/status.cc @@ -49,12 +49,18 @@ std::string Status::CodeAsString() const { case StatusCode::KeyError: type = "Key error"; break; + case StatusCode::TypeError: + type = "Type error"; + break; case StatusCode::Invalid: type = "Invalid"; break; case StatusCode::IOError: type = "IOError"; break; + case StatusCode::UnknownError: + type = "Unknown error"; + break; case StatusCode::NotImplemented: type = "NotImplemented"; break; diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/util/status.h index d5585313c728b..05f5b749b60cb 100644 --- a/cpp/src/arrow/util/status.h +++ b/cpp/src/arrow/util/status.h @@ -78,9 +78,10 @@ enum class StatusCode : char { OK = 0, OutOfMemory = 1, KeyError = 2, - Invalid = 3, - IOError = 4, - + TypeError = 3, + Invalid = 4, + IOError = 5, + UnknownError = 9, NotImplemented = 10, }; @@ -106,6 +107,14 @@ class ARROW_EXPORT Status { return Status(StatusCode::KeyError, msg, -1); } + static Status TypeError(const std::string& msg) { + return Status(StatusCode::TypeError, msg, -1); + } + + static Status UnknownError(const std::string& msg) { + return Status(StatusCode::UnknownError, msg, -1); + } + static Status NotImplemented(const std::string& msg) { return Status(StatusCode::NotImplemented, msg, -1); } @@ -125,6 +134,8 @@ class ARROW_EXPORT Status { bool IsKeyError() const { return code() == StatusCode::KeyError; } bool IsInvalid() const { return code() == StatusCode::Invalid; } bool IsIOError() const { return code() == StatusCode::IOError; } + + bool IsUnknownError() const { return code() == StatusCode::UnknownError; } bool IsNotImplemented() const { return code() == StatusCode::NotImplemented; } // Return a string representation of this status suitable for printing. diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 55f6d0543a108..4357fa05ff864 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -438,8 +438,6 @@ set(PYARROW_SRCS src/pyarrow/config.cc src/pyarrow/helpers.cc src/pyarrow/io.cc - src/pyarrow/status.cc - src/pyarrow/adapters/builtin.cc src/pyarrow/adapters/pandas.cc ) diff --git a/python/doc/index.rst b/python/doc/index.rst index 550e544eef9e8..88725badc1e24 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -1,3 +1,20 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + Apache Arrow (Python) ===================== @@ -25,4 +42,3 @@ Indices and tables * :ref:`genindex` * :ref:`modindex` * :ref:`search` - diff --git a/python/pyarrow/error.pxd b/python/pyarrow/error.pxd index 891d1ac1c7ea0..4fb46c25fafe4 100644 --- a/python/pyarrow/error.pxd +++ b/python/pyarrow/error.pxd @@ -16,7 +16,5 @@ # under the License. from pyarrow.includes.libarrow cimport CStatus -from pyarrow.includes.pyarrow cimport PyStatus -cdef int check_cstatus(const CStatus& status) nogil except -1 -cdef int check_status(const PyStatus& status) nogil except -1 +cdef int check_status(const CStatus& status) nogil except -1 diff --git a/python/pyarrow/error.pyx b/python/pyarrow/error.pyx index a2c53fed8c6a0..b8a82b3754c1b 100644 --- a/python/pyarrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -22,15 +22,7 @@ from pyarrow.compat import frombytes class ArrowException(Exception): pass -cdef int check_cstatus(const CStatus& status) nogil except -1: - if status.ok(): - return 0 - - cdef c_string c_message = status.ToString() - with gil: - raise ArrowException(frombytes(c_message)) - -cdef int check_status(const PyStatus& status) nogil except -1: +cdef int check_status(const CStatus& status) nogil except -1: if status.ok(): return 0 diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 7c47f21854e33..e1da1914c5743 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -25,36 +25,19 @@ cimport pyarrow.includes.libarrow_io as arrow_io cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: - # We can later add more of the common status factory methods as needed - cdef PyStatus PyStatus_OK "Status::OK"() - - cdef cppclass PyStatus "pyarrow::Status": - PyStatus() - - c_string ToString() - - c_bool ok() - c_bool IsOutOfMemory() - c_bool IsKeyError() - c_bool IsTypeError() - c_bool IsIOError() - c_bool IsValueError() - c_bool IsNotImplemented() - c_bool IsArrowError() - shared_ptr[CDataType] GetPrimitiveType(Type type) - PyStatus ConvertPySequence(object obj, shared_ptr[CArray]* out) + CStatus ConvertPySequence(object obj, shared_ptr[CArray]* out) - PyStatus PandasToArrow(MemoryPool* pool, object ao, - shared_ptr[CArray]* out) - PyStatus PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, - shared_ptr[CArray]* out) + CStatus PandasToArrow(MemoryPool* pool, object ao, + shared_ptr[CArray]* out) + CStatus PandasMaskedToArrow(MemoryPool* pool, object ao, object mo, + shared_ptr[CArray]* out) - PyStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, - object py_ref, PyObject** out) + CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, + object py_ref, PyObject** out) - PyStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, - object py_ref, PyObject** out) + CStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, + object py_ref, PyObject** out) MemoryPool* get_memory_pool() diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 8970e06effdd0..16ebfa1138e46 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -28,7 +28,7 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.libarrow_io cimport * from pyarrow.compat import frombytes, tobytes -from pyarrow.error cimport check_cstatus +from pyarrow.error cimport check_status cimport cpython as cp @@ -57,9 +57,9 @@ cdef class NativeFile: if self.is_open: with nogil: if self.is_readonly: - check_cstatus(self.rd_file.get().Close()) + check_status(self.rd_file.get().Close()) else: - check_cstatus(self.wr_file.get().Close()) + check_status(self.wr_file.get().Close()) self.is_open = False cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): @@ -88,22 +88,22 @@ cdef class NativeFile: cdef int64_t size self._assert_readable() with nogil: - check_cstatus(self.rd_file.get().GetSize(&size)) + check_status(self.rd_file.get().GetSize(&size)) return size def tell(self): cdef int64_t position with nogil: if self.is_readonly: - check_cstatus(self.rd_file.get().Tell(&position)) + check_status(self.rd_file.get().Tell(&position)) else: - check_cstatus(self.wr_file.get().Tell(&position)) + check_status(self.wr_file.get().Tell(&position)) return position def seek(self, int64_t position): self._assert_readable() with nogil: - check_cstatus(self.rd_file.get().Seek(position)) + check_status(self.rd_file.get().Seek(position)) def write(self, data): """ @@ -116,7 +116,7 @@ cdef class NativeFile: cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) cdef int64_t bufsize = len(data) with nogil: - check_cstatus(self.wr_file.get().Write(buf, bufsize)) + check_status(self.wr_file.get().Write(buf, bufsize)) def read(self, int nbytes): cdef: @@ -127,8 +127,7 @@ cdef class NativeFile: self._assert_readable() with nogil: - check_cstatus(self.rd_file.get() - .ReadB(nbytes, &out)) + check_status(self.rd_file.get().ReadB(nbytes, &out)) result = cp.PyBytes_FromStringAndSize( out.get().data(), out.get().size()) @@ -223,7 +222,7 @@ cdef class InMemoryOutputStream(NativeFile): def get_result(self): cdef Buffer result = Buffer() - check_cstatus(self.wr_file.get().Close()) + check_status(self.wr_file.get().Close()) result.init( self.buffer) self.is_open = False @@ -270,7 +269,7 @@ except ImportError: def have_libhdfs(): try: - check_cstatus(ConnectLibHdfs()) + check_status(ConnectLibHdfs()) return True except: return False @@ -304,7 +303,7 @@ cdef class HdfsClient: def close(self): self._ensure_client() with nogil: - check_cstatus(self.client.get().Disconnect()) + check_status(self.client.get().Disconnect()) self.is_open = False cdef _ensure_client(self): @@ -341,8 +340,7 @@ cdef class HdfsClient: conf.user = tobytes(user) with nogil: - check_cstatus( - CHdfsClient.Connect(&conf, &out.client)) + check_status(CHdfsClient.Connect(&conf, &out.client)) out.is_open = True return out @@ -383,8 +381,8 @@ cdef class HdfsClient: self._ensure_client() with nogil: - check_cstatus(self.client.get() - .ListDirectory(c_path, &listing)) + check_status(self.client.get() + .ListDirectory(c_path, &listing)) cdef const HdfsPathInfo* info for i in range( listing.size()): @@ -422,8 +420,8 @@ cdef class HdfsClient: cdef c_string c_path = tobytes(path) with nogil: - check_cstatus(self.client.get() - .CreateDirectory(c_path)) + check_status(self.client.get() + .CreateDirectory(c_path)) def delete(self, path, bint recursive=False): """ @@ -439,8 +437,8 @@ cdef class HdfsClient: cdef c_string c_path = tobytes(path) with nogil: - check_cstatus(self.client.get() - .Delete(c_path, recursive)) + check_status(self.client.get() + .Delete(c_path, recursive)) def open(self, path, mode='rb', buffer_size=None, replication=None, default_block_size=None): @@ -473,7 +471,7 @@ cdef class HdfsClient: append = True with nogil: - check_cstatus( + check_status( self.client.get() .OpenWriteable(c_path, append, c_buffer_size, c_replication, c_default_block_size, @@ -484,8 +482,8 @@ cdef class HdfsClient: out.is_readonly = False else: with nogil: - check_cstatus(self.client.get() - .OpenReadable(c_path, &rd_handle)) + check_status(self.client.get() + .OpenReadable(c_path, &rd_handle)) out.rd_file = rd_handle out.is_readonly = True @@ -579,9 +577,9 @@ cdef class HdfsFile(NativeFile): try: with nogil: while total_bytes < nbytes: - check_cstatus(self.rd_file.get() - .Read(rpc_chunksize, &bytes_read, - buf + total_bytes)) + check_status(self.rd_file.get() + .Read(rpc_chunksize, &bytes_read, + buf + total_bytes)) total_bytes += bytes_read @@ -647,8 +645,8 @@ cdef class HdfsFile(NativeFile): try: while True: with nogil: - check_cstatus(self.rd_file.get() - .Read(self.buffer_size, &bytes_read, buf)) + check_status(self.rd_file.get() + .Read(self.buffer_size, &bytes_read, buf)) total_bytes += bytes_read diff --git a/python/pyarrow/ipc.pyx b/python/pyarrow/ipc.pyx index f8da3a70da819..46deb5ad0c35d 100644 --- a/python/pyarrow/ipc.pyx +++ b/python/pyarrow/ipc.pyx @@ -26,7 +26,7 @@ from pyarrow.includes.libarrow_io cimport * from pyarrow.includes.libarrow_ipc cimport * cimport pyarrow.includes.pyarrow as pyarrow -from pyarrow.error cimport check_cstatus +from pyarrow.error cimport check_status from pyarrow.io cimport NativeFile from pyarrow.schema cimport Schema from pyarrow.table cimport RecordBatch @@ -89,8 +89,8 @@ cdef class ArrowFileWriter: get_writer(sink, &self.sink) with nogil: - check_cstatus(CFileWriter.Open(self.sink.get(), schema.sp_schema, - &self.writer)) + check_status(CFileWriter.Open(self.sink.get(), schema.sp_schema, + &self.writer)) self.closed = False @@ -101,12 +101,12 @@ cdef class ArrowFileWriter: def write_record_batch(self, RecordBatch batch): cdef CRecordBatch* bptr = batch.batch with nogil: - check_cstatus(self.writer.get() - .WriteRecordBatch(bptr.columns(), bptr.num_rows())) + check_status(self.writer.get() + .WriteRecordBatch(bptr.columns(), bptr.num_rows())) def close(self): with nogil: - check_cstatus(self.writer.get().Close()) + check_status(self.writer.get().Close()) self.closed = True @@ -124,9 +124,9 @@ cdef class ArrowFileReader: with nogil: if offset != 0: - check_cstatus(CFileReader.Open2(reader, offset, &self.reader)) + check_status(CFileReader.Open2(reader, offset, &self.reader)) else: - check_cstatus(CFileReader.Open(reader, &self.reader)) + check_status(CFileReader.Open(reader, &self.reader)) property num_dictionaries: @@ -147,7 +147,7 @@ cdef class ArrowFileReader: raise ValueError('Batch number {0} out of range'.format(i)) with nogil: - check_cstatus(self.reader.get().GetRecordBatch(i, &batch)) + check_status(self.reader.get().GetRecordBatch(i, &batch)) result = RecordBatch() result.init(batch) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 2abe57b33ed48..019dd2c1de489 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -26,7 +26,7 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import tobytes from pyarrow.error import ArrowException -from pyarrow.error cimport check_cstatus +from pyarrow.error cimport check_status from pyarrow.io import NativeFile from pyarrow.table cimport Table @@ -62,7 +62,7 @@ cdef class ParquetReader: cdef shared_ptr[ReadableFileInterface] cpp_handle file.read_handle(&cpp_handle) - check_cstatus(OpenFile(cpp_handle, &self.allocator, &self.reader)) + check_status(OpenFile(cpp_handle, &self.allocator, &self.reader)) def read_all(self): cdef: @@ -70,8 +70,8 @@ cdef class ParquetReader: shared_ptr[CTable] ctable with nogil: - check_cstatus(self.reader.get() - .ReadFlatTable(&ctable)) + check_status(self.reader.get() + .ReadFlatTable(&ctable)) table.init(ctable) return table @@ -80,7 +80,7 @@ cdef class ParquetReader: def read_table(source, columns=None): """ Read a Table from Parquet format - + Returns ------- pyarrow.table.Table @@ -176,5 +176,5 @@ def write_table(table, filename, chunk_size=None, version=None, sink.reset(new LocalFileOutputStream(tobytes(filename))) with nogil: - check_cstatus(WriteFlatTable(ctable_, default_memory_pool(), sink, - chunk_size_, properties_builder.build())) + check_status(WriteFlatTable(ctable_, default_memory_pool(), sink, + chunk_size_, properties_builder.build())) diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 680f3a539b5fa..c034fbd977747 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -20,13 +20,14 @@ #include "pyarrow/adapters/builtin.h" -#include +#include "arrow/api.h" +#include "arrow/util/status.h" #include "pyarrow/helpers.h" -#include "pyarrow/status.h" using arrow::ArrayBuilder; using arrow::DataType; +using arrow::Status; using arrow::Type; namespace pyarrow { @@ -129,7 +130,7 @@ class SeqVisitor { PyObject* item = item_ref.obj(); if (PyList_Check(item)) { - PY_RETURN_NOT_OK(Visit(item, level + 1)); + RETURN_NOT_OK(Visit(item, level + 1)); } else if (PyDict_Check(item)) { return Status::NotImplemented("No type inference for dicts"); } else { @@ -164,9 +165,9 @@ class SeqVisitor { Status Validate() const { if (scalars_.total_count() > 0) { if (num_nesting_levels() > 1) { - return Status::ValueError("Mixed nesting levels not supported"); + return Status::Invalid("Mixed nesting levels not supported"); } else if (max_observed_level() < max_nesting_level_) { - return Status::ValueError("Mixed nesting levels not supported"); + return Status::Invalid("Mixed nesting levels not supported"); } } return Status::OK(); @@ -216,8 +217,8 @@ static Status InferArrowType(PyObject* obj, int64_t* size, } SeqVisitor seq_visitor; - PY_RETURN_NOT_OK(seq_visitor.Visit(obj)); - PY_RETURN_NOT_OK(seq_visitor.Validate()); + RETURN_NOT_OK(seq_visitor.Visit(obj)); + RETURN_NOT_OK(seq_visitor.Validate()); *out_type = seq_visitor.GetType(); @@ -259,7 +260,7 @@ class BoolConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { Py_ssize_t size = PySequence_Size(seq); - RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); + RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { @@ -281,7 +282,7 @@ class Int64Converter : public TypedConverter { Status AppendData(PyObject* seq) override { int64_t val; Py_ssize_t size = PySequence_Size(seq); - RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); + RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { @@ -301,7 +302,7 @@ class DoubleConverter : public TypedConverter { Status AppendData(PyObject* seq) override { double val; Py_ssize_t size = PySequence_Size(seq); - RETURN_ARROW_NOT_OK(typed_builder_->Reserve(size)); + RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { @@ -330,7 +331,7 @@ class StringConverter : public TypedConverter { OwnedRef holder(item); if (item == Py_None) { - RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + RETURN_NOT_OK(typed_builder_->AppendNull()); continue; } else if (PyUnicode_Check(item)) { tmp.reset(PyUnicode_AsUTF8String(item)); @@ -344,7 +345,7 @@ class StringConverter : public TypedConverter { // No error checking length = PyBytes_GET_SIZE(bytes_obj); bytes = PyBytes_AS_STRING(bytes_obj); - RETURN_ARROW_NOT_OK(typed_builder_->Append(bytes, length)); + RETURN_NOT_OK(typed_builder_->Append(bytes, length)); } return Status::OK(); } @@ -359,10 +360,10 @@ class ListConverter : public TypedConverter { for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { - RETURN_ARROW_NOT_OK(typed_builder_->AppendNull()); + RETURN_NOT_OK(typed_builder_->AppendNull()); } else { typed_builder_->Append(); - PY_RETURN_NOT_OK(value_converter_->AppendData(item.obj())); + RETURN_NOT_OK(value_converter_->AppendData(item.obj())); } } return Status::OK(); @@ -408,7 +409,7 @@ Status ListConverter::Init(const std::shared_ptr& builder) { Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { std::shared_ptr type; int64_t size; - PY_RETURN_NOT_OK(InferArrowType(obj, &size, &type)); + RETURN_NOT_OK(InferArrowType(obj, &size, &type)); // Handle NA / NullType case if (type->type == Type::NA) { @@ -426,14 +427,12 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { // Give the sequence converter an array builder std::shared_ptr builder; - RETURN_ARROW_NOT_OK(arrow::MakeBuilder(get_memory_pool(), type, &builder)); + RETURN_NOT_OK(arrow::MakeBuilder(get_memory_pool(), type, &builder)); converter->Init(builder); - PY_RETURN_NOT_OK(converter->AppendData(obj)); + RETURN_NOT_OK(converter->AppendData(obj)); - *out = builder->Finish(); - - return Status::OK(); + return builder->Finish(out); } } // namespace pyarrow diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 4e997e31dd690..2ddfdaaf44134 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -30,14 +30,15 @@ #include "pyarrow/common.h" #include "pyarrow/visibility.h" -namespace arrow { class Array; } +namespace arrow { +class Array; +class Status; +} namespace pyarrow { -class Status; - PYARROW_EXPORT -Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); +arrow::Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); } // namespace pyarrow diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index b2fcd37aec944..5902b8341696d 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -31,10 +31,10 @@ #include "arrow/api.h" #include "arrow/util/bit-util.h" +#include "arrow/util/status.h" #include "pyarrow/common.h" #include "pyarrow/config.h" -#include "pyarrow/status.h" namespace pyarrow { @@ -42,6 +42,8 @@ using arrow::Array; using arrow::Column; using arrow::Field; using arrow::DataType; +using arrow::Status; + namespace util = arrow::util; // ---------------------------------------------------------------------- @@ -149,7 +151,7 @@ class ArrowSerializer { int null_bytes = util::bytes_for_bits(length_); null_bitmap_ = std::make_shared(pool_); - RETURN_ARROW_NOT_OK(null_bitmap_->Resize(null_bytes)); + RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); null_bitmap_data_ = null_bitmap_->mutable_data(); memset(null_bitmap_data_, 0, null_bytes); @@ -171,9 +173,9 @@ class ArrowSerializer { PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); arrow::TypePtr string_type(new arrow::StringType()); arrow::StringBuilder string_builder(pool_, string_type); - RETURN_ARROW_NOT_OK(string_builder.Resize(length_)); + RETURN_NOT_OK(string_builder.Resize(length_)); - arrow::Status s; + Status s; PyObject* obj; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; @@ -187,18 +189,16 @@ class ArrowSerializer { s = string_builder.Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); if (!s.ok()) { - return Status::ArrowError(s.ToString()); + return s; } } else if (PyBytes_Check(obj)) { const int32_t length = PyBytes_GET_SIZE(obj); - RETURN_ARROW_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); + RETURN_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); } else { string_builder.AppendNull(); } } - *out = std::shared_ptr(string_builder.Finish()); - - return Status::OK(); + return string_builder.Finish(out); } Status ConvertBooleans(std::shared_ptr* out) { @@ -208,7 +208,7 @@ class ArrowSerializer { int nbytes = util::bytes_for_bits(length_); auto data = std::make_shared(pool_); - RETURN_ARROW_NOT_OK(data->Resize(nbytes)); + RETURN_NOT_OK(data->Resize(nbytes)); uint8_t* bitmap = data->mutable_data(); memset(bitmap, 0, nbytes); @@ -305,7 +305,7 @@ inline Status ArrowSerializer::MakeDataType(std::shared_ptrreset(new arrow::TimestampType(unit)); @@ -330,7 +330,7 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) { RETURN_NOT_OK(ConvertData()); std::shared_ptr type; RETURN_NOT_OK(MakeDataType(&type)); - RETURN_ARROW_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); + RETURN_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); return Status::OK(); } @@ -389,7 +389,7 @@ template inline Status ArrowSerializer::ConvertData() { // TODO(wesm): strided arrays if (is_strided()) { - return Status::ValueError("no support for strided data yet"); + return Status::Invalid("no support for strided data yet"); } data_ = std::make_shared(arr_); @@ -399,12 +399,12 @@ inline Status ArrowSerializer::ConvertData() { template <> inline Status ArrowSerializer::ConvertData() { if (is_strided()) { - return Status::ValueError("no support for strided data yet"); + return Status::Invalid("no support for strided data yet"); } int nbytes = util::bytes_for_bits(length_); auto buffer = std::make_shared(pool_); - RETURN_ARROW_NOT_OK(buffer->Resize(nbytes)); + RETURN_NOT_OK(buffer->Resize(nbytes)); const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); @@ -446,7 +446,7 @@ Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, } if (PyArray_NDIM(arr) != 1) { - return Status::ValueError("only handle 1-dimensional arrays"); + return Status::Invalid("only handle 1-dimensional arrays"); } switch(PyArray_DESCR(arr)->type_num) { diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index 141d1219e64db..532495dd792db 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -32,27 +32,26 @@ namespace arrow { class Array; class Column; class MemoryPool; +class Status; } // namespace arrow namespace pyarrow { -class Status; - PYARROW_EXPORT -Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, - PyObject** out); +arrow::Status ConvertArrayToPandas(const std::shared_ptr& arr, + PyObject* py_ref, PyObject** out); PYARROW_EXPORT -Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, - PyObject** out); +arrow::Status ConvertColumnToPandas(const std::shared_ptr& col, + PyObject* py_ref, PyObject** out); PYARROW_EXPORT -Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, +arrow::Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, std::shared_ptr* out); PYARROW_EXPORT -Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, +arrow::Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, std::shared_ptr* out); } // namespace pyarrow diff --git a/python/src/pyarrow/api.h b/python/src/pyarrow/api.h index 72be6afe02c76..6dbbc45d40ccc 100644 --- a/python/src/pyarrow/api.h +++ b/python/src/pyarrow/api.h @@ -18,8 +18,6 @@ #ifndef PYARROW_API_H #define PYARROW_API_H -#include "pyarrow/status.h" - #include "pyarrow/helpers.h" #include "pyarrow/adapters/builtin.h" diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index 09f3efb5a03bc..fa875f2b9aba1 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -21,10 +21,10 @@ #include #include -#include -#include +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" -#include "pyarrow/status.h" +using arrow::Status; namespace pyarrow { @@ -33,18 +33,18 @@ class PyArrowMemoryPool : public arrow::MemoryPool { PyArrowMemoryPool() : bytes_allocated_(0) {} virtual ~PyArrowMemoryPool() {} - arrow::Status Allocate(int64_t size, uint8_t** out) override { + Status Allocate(int64_t size, uint8_t** out) override { std::lock_guard guard(pool_lock_); *out = static_cast(std::malloc(size)); if (*out == nullptr) { std::stringstream ss; ss << "malloc of size " << size << " failed"; - return arrow::Status::OutOfMemory(ss.str()); + return Status::OutOfMemory(ss.str()); } bytes_allocated_ += size; - return arrow::Status::OK(); + return Status::OK(); } int64_t bytes_allocated() const override { diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 50c2577b93c9b..7f3131ef03dd8 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -29,13 +29,6 @@ namespace pyarrow { #define PYARROW_IS_PY2 PY_MAJOR_VERSION <= 2 -#define RETURN_ARROW_NOT_OK(s) do { \ - arrow::Status _s = (s); \ - if (!_s.ok()) { \ - return Status::ArrowError(s.ToString()); \ - } \ - } while (0); - class OwnedRef { public: OwnedRef() : obj_(nullptr) {} diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 7bf32ffa8d22b..e6dbc12d429b0 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -20,12 +20,13 @@ #include #include -#include -#include -#include +#include "arrow/io/memory.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" #include "pyarrow/common.h" -#include "pyarrow/status.h" + +using arrow::Status; namespace pyarrow { @@ -41,7 +42,7 @@ PythonFile::~PythonFile() { Py_DECREF(file_); } -static arrow::Status CheckPyError() { +static Status CheckPyError() { if (PyErr_Occurred()) { PyObject *exc_type, *exc_value, *traceback; PyErr_Fetch(&exc_type, &exc_value, &traceback); @@ -51,35 +52,35 @@ static arrow::Status CheckPyError() { Py_XDECREF(exc_value); Py_XDECREF(traceback); PyErr_Clear(); - return arrow::Status::IOError(message); + return Status::IOError(message); } - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PythonFile::Close() { +Status PythonFile::Close() { // whence: 0 for relative to start of file, 2 for end of file PyObject* result = PyObject_CallMethod(file_, "close", "()"); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PythonFile::Seek(int64_t position, int whence) { +Status PythonFile::Seek(int64_t position, int whence) { // whence: 0 for relative to start of file, 2 for end of file PyObject* result = PyObject_CallMethod(file_, "seek", "(ii)", position, whence); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PythonFile::Read(int64_t nbytes, PyObject** out) { +Status PythonFile::Read(int64_t nbytes, PyObject** out) { PyObject* result = PyObject_CallMethod(file_, "read", "(i)", nbytes); ARROW_RETURN_NOT_OK(CheckPyError()); *out = result; - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { +Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { PyObject* py_data = PyBytes_FromStringAndSize( reinterpret_cast(data), nbytes); ARROW_RETURN_NOT_OK(CheckPyError()); @@ -88,10 +89,10 @@ arrow::Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { Py_XDECREF(py_data); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PythonFile::Tell(int64_t* position) { +Status PythonFile::Tell(int64_t* position) { PyObject* result = PyObject_CallMethod(file_, "tell", "()"); ARROW_RETURN_NOT_OK(CheckPyError()); @@ -101,7 +102,7 @@ arrow::Status PythonFile::Tell(int64_t* position) { // PyLong_AsLongLong can raise OverflowError ARROW_RETURN_NOT_OK(CheckPyError()); - return arrow::Status::OK(); + return Status::OK(); } // ---------------------------------------------------------------------- @@ -113,22 +114,22 @@ PyReadableFile::PyReadableFile(PyObject* file) { PyReadableFile::~PyReadableFile() {} -arrow::Status PyReadableFile::Close() { +Status PyReadableFile::Close() { PyGILGuard lock; return file_->Close(); } -arrow::Status PyReadableFile::Seek(int64_t position) { +Status PyReadableFile::Seek(int64_t position) { PyGILGuard lock; return file_->Seek(position, 0); } -arrow::Status PyReadableFile::Tell(int64_t* position) { +Status PyReadableFile::Tell(int64_t* position) { PyGILGuard lock; return file_->Tell(position); } -arrow::Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { +Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { PyGILGuard lock; PyObject* bytes_obj; ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); @@ -137,10 +138,10 @@ arrow::Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* std::memcpy(out, PyBytes_AS_STRING(bytes_obj), *bytes_read); Py_DECREF(bytes_obj); - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { +Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { PyGILGuard lock; PyObject* bytes_obj; @@ -149,10 +150,10 @@ arrow::Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr(bytes_obj); Py_DECREF(bytes_obj); - return arrow::Status::OK(); + return Status::OK(); } -arrow::Status PyReadableFile::GetSize(int64_t* size) { +Status PyReadableFile::GetSize(int64_t* size) { PyGILGuard lock; int64_t current_position;; @@ -167,7 +168,7 @@ arrow::Status PyReadableFile::GetSize(int64_t* size) { ARROW_RETURN_NOT_OK(file_->Seek(current_position, 0)); *size = file_size; - return arrow::Status::OK(); + return Status::OK(); } bool PyReadableFile::supports_zero_copy() const { @@ -183,17 +184,17 @@ PyOutputStream::PyOutputStream(PyObject* file) { PyOutputStream::~PyOutputStream() {} -arrow::Status PyOutputStream::Close() { +Status PyOutputStream::Close() { PyGILGuard lock; return file_->Close(); } -arrow::Status PyOutputStream::Tell(int64_t* position) { +Status PyOutputStream::Tell(int64_t* position) { PyGILGuard lock; return file_->Tell(position); } -arrow::Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { +Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { PyGILGuard lock; return file_->Write(data, nbytes); } diff --git a/python/src/pyarrow/status.cc b/python/src/pyarrow/status.cc deleted file mode 100644 index 1cd54f6a78560..0000000000000 --- a/python/src/pyarrow/status.cc +++ /dev/null @@ -1,92 +0,0 @@ -// Copyright (c) 2011 The LevelDB Authors. All rights reserved. -// Use of this source code is governed by a BSD-style license that can be -// found in the LICENSE file. See the AUTHORS file for names of contributors. -// -// A Status encapsulates the result of an operation. It may indicate success, -// or it may indicate an error with an associated error message. -// -// Multiple threads can invoke const methods on a Status without -// external synchronization, but if any of the threads may call a -// non-const method, all threads accessing the same Status must use -// external synchronization. - -#include "pyarrow/status.h" - -#include -#include -#include - -namespace pyarrow { - -Status::Status(StatusCode code, const std::string& msg, int16_t posix_code) { - assert(code != StatusCode::OK); - const uint32_t size = msg.size(); - char* result = new char[size + 7]; - memcpy(result, &size, sizeof(size)); - result[4] = static_cast(code); - memcpy(result + 5, &posix_code, sizeof(posix_code)); - memcpy(result + 7, msg.c_str(), msg.size()); - state_ = result; -} - -const char* Status::CopyState(const char* state) { - uint32_t size; - memcpy(&size, state, sizeof(size)); - char* result = new char[size + 7]; - memcpy(result, state, size + 7); - return result; -} - -std::string Status::CodeAsString() const { - if (state_ == NULL) { - return "OK"; - } - - const char* type; - switch (code()) { - case StatusCode::OK: - type = "OK"; - break; - case StatusCode::OutOfMemory: - type = "Out of memory"; - break; - case StatusCode::KeyError: - type = "Key error"; - break; - case StatusCode::TypeError: - type = "Value error"; - break; - case StatusCode::ValueError: - type = "Value error"; - break; - case StatusCode::IOError: - type = "IO error"; - break; - case StatusCode::NotImplemented: - type = "Not implemented"; - break; - case StatusCode::ArrowError: - type = "Arrow C++ error"; - break; - case StatusCode::UnknownError: - type = "Unknown error"; - break; - } - return std::string(type); -} - -std::string Status::ToString() const { - std::string result(CodeAsString()); - if (state_ == NULL) { - return result; - } - - result.append(": "); - - uint32_t length; - memcpy(&length, state_, sizeof(length)); - result.append(reinterpret_cast(state_ + 7), length); - return result; -} - -} // namespace pyarrow diff --git a/python/src/pyarrow/status.h b/python/src/pyarrow/status.h deleted file mode 100644 index 67cd66c58eeb3..0000000000000 --- a/python/src/pyarrow/status.h +++ /dev/null @@ -1,146 +0,0 @@ -// Copyright (c) 2011 The LevelDB Authors. All rights reserved. -// Use of this source code is governed by a BSD-style license that can be -// found in the LICENSE file. See the AUTHORS file for names of contributors. -// -// A Status encapsulates the result of an operation. It may indicate success, -// or it may indicate an error with an associated error message. -// -// Multiple threads can invoke const methods on a Status without -// external synchronization, but if any of the threads may call a -// non-const method, all threads accessing the same Status must use -// external synchronization. - -#ifndef PYARROW_STATUS_H_ -#define PYARROW_STATUS_H_ - -#include -#include -#include - -#include "pyarrow/visibility.h" - -namespace pyarrow { - -#define PY_RETURN_NOT_OK(s) do { \ - Status _s = (s); \ - if (!_s.ok()) return _s; \ - } while (0); - -enum class StatusCode: char { - OK = 0, - OutOfMemory = 1, - KeyError = 2, - TypeError = 3, - ValueError = 4, - IOError = 5, - NotImplemented = 6, - - ArrowError = 7, - - UnknownError = 10 -}; - -class PYARROW_EXPORT Status { - public: - // Create a success status. - Status() : state_(NULL) { } - ~Status() { delete[] state_; } - - // Copy the specified status. - Status(const Status& s); - void operator=(const Status& s); - - // Return a success status. - static Status OK() { return Status(); } - - // Return error status of an appropriate type. - static Status OutOfMemory(const std::string& msg, int16_t posix_code = -1) { - return Status(StatusCode::OutOfMemory, msg, posix_code); - } - - static Status KeyError(const std::string& msg) { - return Status(StatusCode::KeyError, msg, -1); - } - - static Status TypeError(const std::string& msg) { - return Status(StatusCode::TypeError, msg, -1); - } - - static Status IOError(const std::string& msg) { - return Status(StatusCode::IOError, msg, -1); - } - - static Status ValueError(const std::string& msg) { - return Status(StatusCode::ValueError, msg, -1); - } - - static Status NotImplemented(const std::string& msg) { - return Status(StatusCode::NotImplemented, msg, -1); - } - - static Status UnknownError(const std::string& msg) { - return Status(StatusCode::UnknownError, msg, -1); - } - - static Status ArrowError(const std::string& msg) { - return Status(StatusCode::ArrowError, msg, -1); - } - - // Returns true iff the status indicates success. - bool ok() const { return (state_ == NULL); } - - bool IsOutOfMemory() const { return code() == StatusCode::OutOfMemory; } - bool IsKeyError() const { return code() == StatusCode::KeyError; } - bool IsIOError() const { return code() == StatusCode::IOError; } - bool IsTypeError() const { return code() == StatusCode::TypeError; } - bool IsValueError() const { return code() == StatusCode::ValueError; } - - bool IsUnknownError() const { return code() == StatusCode::UnknownError; } - - bool IsArrowError() const { return code() == StatusCode::ArrowError; } - - // Return a string representation of this status suitable for printing. - // Returns the string "OK" for success. - std::string ToString() const; - - // Return a string representation of the status code, without the message - // text or posix code information. - std::string CodeAsString() const; - - // Get the POSIX code associated with this Status, or -1 if there is none. - int16_t posix_code() const; - - private: - // OK status has a NULL state_. Otherwise, state_ is a new[] array - // of the following form: - // state_[0..3] == length of message - // state_[4] == code - // state_[5..6] == posix_code - // state_[7..] == message - const char* state_; - - StatusCode code() const { - return ((state_ == NULL) ? - StatusCode::OK : static_cast(state_[4])); - } - - Status(StatusCode code, const std::string& msg, int16_t posix_code); - static const char* CopyState(const char* s); -}; - -inline Status::Status(const Status& s) { - state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); -} - -inline void Status::operator=(const Status& s) { - // The following condition catches both aliasing (when this == &s), - // and the common case where both s and *this are ok. - if (state_ != s.state_) { - delete[] state_; - state_ = (s.state_ == NULL) ? NULL : CopyState(s.state_); - } -} - -} // namespace pyarrow - -#endif // PYARROW_STATUS_H_ From 676c32ccea6274c75b2750453c1ddbc5f645c037 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Oct 2016 21:18:30 -0400 Subject: [PATCH 0176/1644] ARROW-317: Add Slice, Copy methods to Buffer There's also a little bit of naming cleanup in `bit-util.h`, pardon the diff noise. Author: Wes McKinney Closes #177 from wesm/ARROW-317 and squashes the following commits: 0666b22 [Wes McKinney] Fix up pyarrow usage of BitUtil 3ab4e7a [Wes McKinney] Add Slice, Copy methods to Buffer cb9519d [Wes McKinney] Use more conforming names in bit-util.h --- cpp/src/arrow/array.cc | 3 +- cpp/src/arrow/array.h | 2 +- cpp/src/arrow/builder.cc | 12 ++++---- cpp/src/arrow/column-benchmark.cc | 2 +- cpp/src/arrow/ipc/adapter.cc | 5 ++-- cpp/src/arrow/ipc/test-common.h | 3 +- cpp/src/arrow/test-util.h | 6 ++-- cpp/src/arrow/types/list.cc | 2 +- cpp/src/arrow/types/primitive-test.cc | 16 +++++------ cpp/src/arrow/types/primitive.cc | 13 +++++---- cpp/src/arrow/types/primitive.h | 12 ++++---- cpp/src/arrow/util/bit-util-test.cc | 36 +++++++++++------------ cpp/src/arrow/util/bit-util.cc | 10 +++---- cpp/src/arrow/util/bit-util.h | 29 +++++++++---------- cpp/src/arrow/util/buffer-test.cc | 41 +++++++++++++++++++++++++++ cpp/src/arrow/util/buffer.cc | 28 +++++++++++++++++- cpp/src/arrow/util/buffer.h | 23 +++++++++++---- python/src/pyarrow/adapters/pandas.cc | 20 ++++++------- 18 files changed, 173 insertions(+), 90 deletions(-) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index d6b081f315532..e432a53781f17 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -19,6 +19,7 @@ #include +#include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/status.h" @@ -43,7 +44,7 @@ bool Array::EqualsExact(const Array& other) const { return false; } if (null_count_ > 0) { - return null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); + return null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); } return true; } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c7ffb23ca18a1..ff37323f60519 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -45,7 +45,7 @@ class ARROW_EXPORT Array { // Determine if a slot is null. For inner loops. Does *not* boundscheck bool IsNull(int i) const { - return null_count_ > 0 && util::bit_not_set(null_bitmap_data_, i); + return null_count_ > 0 && BitUtil::BitNotSet(null_bitmap_data_, i); } int32_t length() const { return length_; } diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 1fba96169228f..151b257a3d894 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -31,7 +31,7 @@ Status ArrayBuilder::AppendToBitmap(bool is_valid) { // TODO(emkornfield) doubling isn't great default allocation practice // see https://github.com/facebook/folly/blob/master/folly/docs/FBVector.md // fo discussion - RETURN_NOT_OK(Resize(util::next_power2(capacity_ + 1))); + RETURN_NOT_OK(Resize(BitUtil::NextPower2(capacity_ + 1))); } UnsafeAppendToBitmap(is_valid); return Status::OK(); @@ -45,7 +45,7 @@ Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) } Status ArrayBuilder::Init(int32_t capacity) { - int32_t to_alloc = util::ceil_byte(capacity) / 8; + int32_t to_alloc = BitUtil::CeilByte(capacity) / 8; null_bitmap_ = std::make_shared(pool_); RETURN_NOT_OK(null_bitmap_->Resize(to_alloc)); // Buffers might allocate more then necessary to satisfy padding requirements @@ -58,7 +58,7 @@ Status ArrayBuilder::Init(int32_t capacity) { Status ArrayBuilder::Resize(int32_t new_bits) { if (!null_bitmap_) { return Init(new_bits); } - int32_t new_bytes = util::ceil_byte(new_bits) / 8; + int32_t new_bytes = BitUtil::CeilByte(new_bits) / 8; int32_t old_bytes = null_bitmap_->size(); RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); null_bitmap_data_ = null_bitmap_->mutable_data(); @@ -82,7 +82,7 @@ Status ArrayBuilder::Advance(int32_t elements) { Status ArrayBuilder::Reserve(int32_t elements) { if (length_ + elements > capacity_) { // TODO(emkornfield) power of 2 growth is potentially suboptimal - int32_t new_capacity = util::next_power2(length_ + elements); + int32_t new_capacity = BitUtil::NextPower2(length_ + elements); return Resize(new_capacity); } return Status::OK(); @@ -96,7 +96,7 @@ Status ArrayBuilder::SetNotNull(int32_t length) { void ArrayBuilder::UnsafeAppendToBitmap(bool is_valid) { if (is_valid) { - util::set_bit(null_bitmap_data_, length_); + BitUtil::SetBit(null_bitmap_data_, length_); } else { ++null_count_; } @@ -118,7 +118,7 @@ void ArrayBuilder::UnsafeSetNotNull(int32_t length) { const int32_t new_length = length + length_; // TODO(emkornfield) Optimize for large values of length? for (int32_t i = length_; i < new_length; ++i) { - util::set_bit(null_bitmap_data_, i); + BitUtil::SetBit(null_bitmap_data_, i); } length_ = new_length; } diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index edea0948860de..f429a813c6f20 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -29,7 +29,7 @@ std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { auto data = std::make_shared(pool); auto null_bitmap = std::make_shared(pool); data->Resize(length * sizeof(typename ArrayType::value_type)); - null_bitmap->Resize(util::bytes_for_bits(length)); + null_bitmap->Resize(BitUtil::BytesForBits(length)); return std::make_shared(length, data, 10, null_bitmap); } } // anonymous namespace diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index f84cb264f70e1..74786bf85ffb4 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -37,6 +37,7 @@ #include "arrow/types/primitive.h" #include "arrow/types/string.h" #include "arrow/types/struct.h" +#include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" #include "arrow/util/status.h" @@ -49,7 +50,7 @@ namespace ipc { namespace { Status CheckMultipleOf64(int64_t size) { - if (util::is_multiple_of_64(size)) { return Status::OK(); } + if (BitUtil::IsMultipleOf64(size)) { return Status::OK(); } return Status::Invalid( "Attempted to write a buffer that " "wasn't a multiple of 64 bytes"); @@ -155,7 +156,7 @@ class RecordBatchWriter { // The buffer might be null if we are handling zero row lengths. if (buffer) { size = buffer->size(); - padding = util::RoundUpToMultipleOf64(size) - size; + padding = BitUtil::RoundUpToMultipleOf64(size) - size; } // TODO(wesm): We currently have no notion of shared memory page id's, diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 13bbbebde8aa1..784e238e977c7 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -31,6 +31,7 @@ #include "arrow/types/primitive.h" #include "arrow/types/string.h" #include "arrow/types/struct.h" +#include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/memory-pool.h" @@ -263,7 +264,7 @@ Status MakeStruct(std::shared_ptr* out) { std::vector null_bytes(list_batch->num_rows(), 1); null_bytes[0] = 0; std::shared_ptr null_bitmask; - RETURN_NOT_OK(util::bytes_to_bits(null_bytes, &null_bitmask)); + RETURN_NOT_OK(BitUtil::BytesToBits(null_bytes, &null_bitmask)); ArrayPtr with_nulls( new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index e632ffb1d892d..ac56f5ed0871c 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -69,7 +69,7 @@ class TestBase : public ::testing::Test { auto data = std::make_shared(pool_); auto null_bitmap = std::make_shared(pool_); EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); - EXPECT_OK(null_bitmap->Resize(util::bytes_for_bits(length))); + EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); return std::make_shared(length, data, 10, null_bitmap); } @@ -152,7 +152,7 @@ static inline int bitmap_popcount(const uint8_t* data, int length) { // versions of popcount but the code complexity is likely not worth it) const int loop_tail_index = fast_counts * pop_len; for (int i = loop_tail_index; i < length; ++i) { - if (util::get_bit(data, i)) { ++count; } + if (BitUtil::GetBit(data, i)) { ++count; } } return count; @@ -170,7 +170,7 @@ std::shared_ptr bytes_to_null_buffer(const std::vector& bytes) std::shared_ptr out; // TODO(wesm): error checking - util::bytes_to_bits(bytes, &out); + BitUtil::BytesToBits(bytes, &out); return out; } diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index ef2ec22cb5336..4b1e821472795 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -30,7 +30,7 @@ bool ListArray::EqualsExact(const ListArray& other) const { bool equal_null_bitmap = true; if (null_count_ > 0) { equal_null_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); } if (!equal_null_bitmap) { return false; } diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index 121bd4794f291..e47f6dc74fb7e 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -236,7 +236,7 @@ void TestPrimitiveBuilder::Check( for (int i = 0; i < result->length(); ++i) { if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } - bool actual = util::get_bit(result->raw_data(), i); + bool actual = BitUtil::GetBit(result->raw_data(), i); ASSERT_EQ(static_cast(draws_[i]), actual) << i; } ASSERT_TRUE(result->EqualsExact(*expected.get())); @@ -258,8 +258,8 @@ TYPED_TEST(TestPrimitiveBuilder, TestInit) { int n = 1000; ASSERT_OK(this->builder_->Reserve(n)); - ASSERT_EQ(util::next_power2(n), this->builder_->capacity()); - ASSERT_EQ(util::next_power2(TypeTraits::bytes_required(n)), + ASSERT_EQ(BitUtil::NextPower2(n), this->builder_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(TypeTraits::bytes_required(n)), this->builder_->data()->size()); // unsure if this should go in all builder classes @@ -409,10 +409,10 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { } ASSERT_EQ(size, this->builder_->length()); - ASSERT_EQ(util::next_power2(size), this->builder_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); ASSERT_EQ(size, this->builder_nn_->length()); - ASSERT_EQ(util::next_power2(size), this->builder_nn_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_nn_->capacity()); this->Check(this->builder_, true); this->Check(this->builder_nn_, false); @@ -444,7 +444,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { ASSERT_OK(this->builder_nn_->Append(draws.data() + K, size - K)); ASSERT_EQ(size, this->builder_->length()); - ASSERT_EQ(util::next_power2(size), this->builder_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); this->Check(this->builder_, true); this->Check(this->builder_nn_, false); @@ -472,7 +472,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestResize) { ASSERT_EQ(cap, this->builder_->capacity()); ASSERT_EQ(TypeTraits::bytes_required(cap), this->builder_->data()->size()); - ASSERT_EQ(util::bytes_for_bits(cap), this->builder_->null_bitmap()->size()); + ASSERT_EQ(BitUtil::BytesForBits(cap), this->builder_->null_bitmap()->size()); } TYPED_TEST(TestPrimitiveBuilder, TestReserve) { @@ -484,7 +484,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestReserve) { ASSERT_OK(this->builder_->Advance(100)); ASSERT_OK(this->builder_->Reserve(kMinBuilderCapacity)); - ASSERT_EQ(util::next_power2(kMinBuilderCapacity + 100), this->builder_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(kMinBuilderCapacity + 100), this->builder_->capacity()); } } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 3a05ccfdf1861..d2288bafa71da 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -19,6 +19,7 @@ #include +#include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" @@ -41,7 +42,7 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { if (null_count_ > 0) { bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, util::ceil_byte(length_) / 8); + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); if (!equal_bitmap) { return false; } const uint8_t* this_data = raw_data_; @@ -156,9 +157,9 @@ Status PrimitiveBuilder::Append( if ((valid_bytes != nullptr) && !valid_bytes[i]) continue; if (values[i] > 0) { - util::set_bit(raw_data_, length_ + i); + BitUtil::SetBit(raw_data_, length_ + i); } else { - util::clear_bit(raw_data_, length_ + i); + BitUtil::ClearBit(raw_data_, length_ + i); } } @@ -196,20 +197,20 @@ bool BooleanArray::EqualsExact(const BooleanArray& other) const { if (null_count_ > 0) { bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, util::bytes_for_bits(length_)); + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); if (!equal_bitmap) { return false; } const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && util::get_bit(this_data, i) != util::get_bit(other_data, i)) { + if (!IsNull(i) && BitUtil::GetBit(this_data, i) != BitUtil::GetBit(other_data, i)) { return false; } } return true; } else { - return data_->Equals(*other.data_, util::bytes_for_bits(length_)); + return data_->Equals(*other.data_, BitUtil::BytesForBits(length_)); } } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index f21470d96e45b..c71df584ffe3f 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -173,7 +173,7 @@ class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { // Does not capacity-check; make sure to call Reserve beforehand void UnsafeAppend(value_type val) { - util::set_bit(null_bitmap_data_, length_); + BitUtil::SetBit(null_bitmap_data_, length_); raw_data_[length_++] = val; } @@ -290,7 +290,7 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } - bool Value(int i) const { return util::get_bit(raw_data(), i); } + bool Value(int i) const { return BitUtil::GetBit(raw_data(), i); } }; template <> @@ -298,7 +298,7 @@ struct TypeTraits { typedef BooleanArray ArrayType; static inline int bytes_required(int elements) { - return util::bytes_for_bits(elements); + return BitUtil::BytesForBits(elements); } }; @@ -314,11 +314,11 @@ class ARROW_EXPORT BooleanBuilder : public PrimitiveBuilder { // Scalar append Status Append(bool val) { Reserve(1); - util::set_bit(null_bitmap_data_, length_); + BitUtil::SetBit(null_bitmap_data_, length_); if (val) { - util::set_bit(raw_data_, length_); + BitUtil::SetBit(raw_data_, length_); } else { - util::clear_bit(raw_data_, length_); + BitUtil::ClearBit(raw_data_, length_); } ++length_; return Status::OK(); diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index e1d8a0808b41a..cfdee04f6e2ea 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -22,33 +22,33 @@ namespace arrow { TEST(UtilTests, TestIsMultipleOf64) { - using util::is_multiple_of_64; - EXPECT_TRUE(is_multiple_of_64(64)); - EXPECT_TRUE(is_multiple_of_64(0)); - EXPECT_TRUE(is_multiple_of_64(128)); - EXPECT_TRUE(is_multiple_of_64(192)); - EXPECT_FALSE(is_multiple_of_64(23)); - EXPECT_FALSE(is_multiple_of_64(32)); + using BitUtil::IsMultipleOf64; + EXPECT_TRUE(IsMultipleOf64(64)); + EXPECT_TRUE(IsMultipleOf64(0)); + EXPECT_TRUE(IsMultipleOf64(128)); + EXPECT_TRUE(IsMultipleOf64(192)); + EXPECT_FALSE(IsMultipleOf64(23)); + EXPECT_FALSE(IsMultipleOf64(32)); } TEST(UtilTests, TestNextPower2) { - using util::next_power2; + using BitUtil::NextPower2; - ASSERT_EQ(8, next_power2(6)); - ASSERT_EQ(8, next_power2(8)); + ASSERT_EQ(8, NextPower2(6)); + ASSERT_EQ(8, NextPower2(8)); - ASSERT_EQ(1, next_power2(1)); - ASSERT_EQ(256, next_power2(131)); + ASSERT_EQ(1, NextPower2(1)); + ASSERT_EQ(256, NextPower2(131)); - ASSERT_EQ(1024, next_power2(1000)); + ASSERT_EQ(1024, NextPower2(1000)); - ASSERT_EQ(4096, next_power2(4000)); + ASSERT_EQ(4096, NextPower2(4000)); - ASSERT_EQ(65536, next_power2(64000)); + ASSERT_EQ(65536, NextPower2(64000)); - ASSERT_EQ(1LL << 32, next_power2((1LL << 32) - 1)); - ASSERT_EQ(1LL << 31, next_power2((1LL << 31) - 1)); - ASSERT_EQ(1LL << 62, next_power2((1LL << 62) - 1)); + ASSERT_EQ(1LL << 32, NextPower2((1LL << 32) - 1)); + ASSERT_EQ(1LL << 31, NextPower2((1LL << 31) - 1)); + ASSERT_EQ(1LL << 62, NextPower2((1LL << 62) - 1)); } } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 475576e87cadd..7e1cb1867171e 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -24,20 +24,20 @@ namespace arrow { -void util::bytes_to_bits(const std::vector& bytes, uint8_t* bits) { +void BitUtil::BytesToBits(const std::vector& bytes, uint8_t* bits) { for (size_t i = 0; i < bytes.size(); ++i) { - if (bytes[i] > 0) { set_bit(bits, i); } + if (bytes[i] > 0) { SetBit(bits, i); } } } -Status util::bytes_to_bits( +Status BitUtil::BytesToBits( const std::vector& bytes, std::shared_ptr* out) { - int bit_length = util::bytes_for_bits(bytes.size()); + int bit_length = BitUtil::BytesForBits(bytes.size()); auto buffer = std::make_shared(); RETURN_NOT_OK(buffer->Resize(bit_length)); memset(buffer->mutable_data(), 0, bit_length); - bytes_to_bits(bytes, buffer->mutable_data()); + BytesToBits(bytes, buffer->mutable_data()); *out = buffer; return Status::OK(); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index c33ef272f05e2..13b7e19593d93 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -30,39 +30,39 @@ namespace arrow { class Buffer; class Status; -namespace util { +namespace BitUtil { -static inline int64_t ceil_byte(int64_t size) { +static inline int64_t CeilByte(int64_t size) { return (size + 7) & ~7; } -static inline int64_t bytes_for_bits(int64_t size) { - return ceil_byte(size) / 8; +static inline int64_t BytesForBits(int64_t size) { + return CeilByte(size) / 8; } -static inline int64_t ceil_2bytes(int64_t size) { +static inline int64_t Ceil2Bytes(int64_t size) { return (size + 15) & ~15; } static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; -static inline bool get_bit(const uint8_t* bits, int i) { +static inline bool GetBit(const uint8_t* bits, int i) { return static_cast(bits[i / 8] & kBitmask[i % 8]); } -static inline bool bit_not_set(const uint8_t* bits, int i) { +static inline bool BitNotSet(const uint8_t* bits, int i) { return (bits[i / 8] & kBitmask[i % 8]) == 0; } -static inline void clear_bit(uint8_t* bits, int i) { +static inline void ClearBit(uint8_t* bits, int i) { bits[i / 8] &= ~kBitmask[i % 8]; } -static inline void set_bit(uint8_t* bits, int i) { +static inline void SetBit(uint8_t* bits, int i) { bits[i / 8] |= kBitmask[i % 8]; } -static inline int64_t next_power2(int64_t n) { +static inline int64_t NextPower2(int64_t n) { n--; n |= n >> 1; n |= n >> 2; @@ -74,7 +74,7 @@ static inline int64_t next_power2(int64_t n) { return n; } -static inline bool is_multiple_of_64(int64_t n) { +static inline bool IsMultipleOf64(int64_t n) { return (n & 63) == 0; } @@ -90,11 +90,10 @@ inline int64_t RoundUpToMultipleOf64(int64_t num) { return num; } -void bytes_to_bits(const std::vector& bytes, uint8_t* bits); -ARROW_EXPORT Status bytes_to_bits(const std::vector&, std::shared_ptr*); - -} // namespace util +void BytesToBits(const std::vector& bytes, uint8_t* bits); +ARROW_EXPORT Status BytesToBits(const std::vector&, std::shared_ptr*); +} // namespace BitUtil } // namespace arrow #endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/util/buffer-test.cc index cc4ec98e4fb29..095b07b7ab309 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/util/buffer-test.cc @@ -31,6 +31,18 @@ namespace arrow { class TestBuffer : public ::testing::Test {}; +TEST_F(TestBuffer, IsMutableFlag) { + Buffer buf(nullptr, 0); + + ASSERT_FALSE(buf.is_mutable()); + + MutableBuffer mbuf(nullptr, 0); + ASSERT_TRUE(mbuf.is_mutable()); + + PoolBuffer pbuf; + ASSERT_TRUE(pbuf.is_mutable()); +} + TEST_F(TestBuffer, Resize) { PoolBuffer buf; @@ -96,4 +108,33 @@ TEST_F(TestBuffer, EqualsWithSameBuffer) { pool->Free(rawBuffer, bufferSize); } +TEST_F(TestBuffer, Copy) { + std::string data_str = "some data to copy"; + + auto data = reinterpret_cast(data_str.c_str()); + + Buffer buf(data, data_str.size()); + + std::shared_ptr out; + + ASSERT_OK(buf.Copy(5, 4, &out)); + + Buffer expected(data + 5, 4); + ASSERT_TRUE(out->Equals(expected)); +} + +TEST_F(TestBuffer, SliceBuffer) { + std::string data_str = "some data to slice"; + + auto data = reinterpret_cast(data_str.c_str()); + + auto buf = std::make_shared(data, data_str.size()); + + std::shared_ptr out = SliceBuffer(buf, 5, 4); + Buffer expected(data + 5, 4); + ASSERT_TRUE(out->Equals(expected)); + + ASSERT_EQ(2, buf.use_count()); +} + } // namespace arrow diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/util/buffer.cc index 6faa048e4e52b..a230259e5930d 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/util/buffer.cc @@ -36,6 +36,32 @@ Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t si Buffer::~Buffer() {} +Status Buffer::Copy( + int64_t start, int64_t nbytes, MemoryPool* pool, std::shared_ptr* out) const { + // Sanity checks + DCHECK_LT(start, size_); + DCHECK_LE(nbytes, size_ - start); + + auto new_buffer = std::make_shared(pool); + RETURN_NOT_OK(new_buffer->Resize(nbytes)); + + std::memcpy(new_buffer->mutable_data(), data() + start, nbytes); + + *out = new_buffer; + return Status::OK(); +} + +Status Buffer::Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) const { + return Copy(start, nbytes, default_memory_pool(), out); +} + +std::shared_ptr SliceBuffer( + const std::shared_ptr& buffer, int64_t offset, int64_t length) { + DCHECK_LT(offset, buffer->size()); + DCHECK_LE(length, buffer->size() - offset); + return std::make_shared(buffer, offset, length); +} + std::shared_ptr MutableBuffer::GetImmutableView() { return std::make_shared(this->get_shared_ptr(), 0, size()); } @@ -52,7 +78,7 @@ PoolBuffer::~PoolBuffer() { Status PoolBuffer::Reserve(int64_t new_capacity) { if (!mutable_data_ || new_capacity > capacity_) { uint8_t* new_data; - new_capacity = util::RoundUpToMultipleOf64(new_capacity); + new_capacity = BitUtil::RoundUpToMultipleOf64(new_capacity); if (mutable_data_) { RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); memcpy(new_data, mutable_data_, size_); diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index bc0df86221c45..04ad6c2dffde4 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -43,7 +43,8 @@ class Status; // The following invariant is always true: Size < Capacity class ARROW_EXPORT Buffer : public std::enable_shared_from_this { public: - Buffer(const uint8_t* data, int64_t size) : data_(data), size_(size), capacity_(size) {} + Buffer(const uint8_t* data, int64_t size) + : is_mutable_(false), data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); // An offset into data that is owned by another buffer, but we want to be @@ -57,6 +58,8 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { std::shared_ptr get_shared_ptr() { return shared_from_this(); } + bool is_mutable() const { return is_mutable_; } + // Return true if both buffers are the same size and contain the same bytes // up to the number of compared bytes bool Equals(const Buffer& other, int64_t nbytes) const { @@ -71,18 +74,22 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { (data_ == other.data_ || !memcmp(data_, other.data_, size_))); } + // Copy section of buffer into a new Buffer + Status Copy(int64_t start, int64_t nbytes, MemoryPool* pool, + std::shared_ptr* out) const; + + // Default memory pool + Status Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) const; + int64_t capacity() const { return capacity_; } const uint8_t* data() const { return data_; } int64_t size() const { return size_; } - // Returns true if this Buffer is referencing memory (possibly) owned by some - // other buffer - bool is_shared() const { return static_cast(parent_); } - const std::shared_ptr parent() const { return parent_; } protected: + bool is_mutable_; const uint8_t* data_; int64_t size_; int64_t capacity_; @@ -94,10 +101,16 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { DISALLOW_COPY_AND_ASSIGN(Buffer); }; +// Construct a view on passed buffer at the indicated offset and length. This +// function cannot fail and does not error checking (except in debug builds) +std::shared_ptr SliceBuffer( + const std::shared_ptr& buffer, int64_t offset, int64_t length); + // A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { public: MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { + is_mutable_ = true; mutable_data_ = data; } diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 5902b8341696d..7e70be75da5fc 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -44,7 +44,7 @@ using arrow::Field; using arrow::DataType; using arrow::Status; -namespace util = arrow::util; +namespace BitUtil = arrow::BitUtil; // ---------------------------------------------------------------------- // Serialization @@ -148,7 +148,7 @@ class ArrowSerializer { } Status InitNullBitmap() { - int null_bytes = util::bytes_for_bits(length_); + int null_bytes = BitUtil::BytesForBits(length_); null_bitmap_ = std::make_shared(pool_); RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); @@ -206,7 +206,7 @@ class ArrowSerializer { PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - int nbytes = util::bytes_for_bits(length_); + int nbytes = BitUtil::BytesForBits(length_); auto data = std::make_shared(pool_); RETURN_NOT_OK(data->Resize(nbytes)); uint8_t* bitmap = data->mutable_data(); @@ -215,12 +215,12 @@ class ArrowSerializer { int64_t null_count = 0; for (int64_t i = 0; i < length_; ++i) { if (objects[i] == Py_True) { - util::set_bit(bitmap, i); - util::set_bit(null_bitmap_data_, i); + BitUtil::SetBit(bitmap, i); + BitUtil::SetBit(null_bitmap_data_, i); } else if (objects[i] != Py_False) { ++null_count; } else { - util::set_bit(null_bitmap_data_, i); + BitUtil::SetBit(null_bitmap_data_, i); } } @@ -253,7 +253,7 @@ static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap if (mask_values[i]) { ++null_count; } else { - util::set_bit(bitmap, i); + BitUtil::SetBit(bitmap, i); } } return null_count; @@ -272,7 +272,7 @@ static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) if (traits::isnull(values[i])) { ++null_count; } else { - util::set_bit(bitmap, i); + BitUtil::SetBit(bitmap, i); } } @@ -402,7 +402,7 @@ inline Status ArrowSerializer::ConvertData() { return Status::Invalid("no support for strided data yet"); } - int nbytes = util::bytes_for_bits(length_); + int nbytes = BitUtil::BytesForBits(length_); auto buffer = std::make_shared(pool_); RETURN_NOT_OK(buffer->Resize(nbytes)); @@ -413,7 +413,7 @@ inline Status ArrowSerializer::ConvertData() { memset(bitmap, 0, nbytes); for (int i = 0; i < length_; ++i) { if (values[i] > 0) { - util::set_bit(bitmap, i); + BitUtil::SetBit(bitmap, i); } } From e2c0a18316504a0177129cb66b25a9dc54291587 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Oct 2016 22:46:44 -0400 Subject: [PATCH 0177/1644] ARROW-327: [Python] Remove conda builds from Travis CI setup We'll do these builds in conda-forge Author: Wes McKinney Closes #178 from wesm/ARROW-327 and squashes the following commits: 1303d6e [Wes McKinney] Remove conda builds --- .travis.yml | 18 ---------------- ci/travis_conda_build.sh | 45 ---------------------------------------- 2 files changed, 63 deletions(-) delete mode 100755 ci/travis_conda_build.sh diff --git a/.travis.yml b/.travis.yml index 97229b1ceb3bc..a53756c962e88 100644 --- a/.travis.yml +++ b/.travis.yml @@ -41,24 +41,6 @@ matrix: script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - - compiler: gcc - env: ARROW_TEST_GROUP=packaging - os: linux - before_script: - - export CC="gcc-4.9" - - export CXX="g++-4.9" - script: - - $TRAVIS_BUILD_DIR/ci/travis_conda_build.sh - - os: osx - env: ARROW_TEST_GROUP=packaging - language: objective-c - osx_image: xcode6.4 - compiler: clang - addons: - before_script: - before_install: - script: - - $TRAVIS_BUILD_DIR/ci/travis_conda_build.sh - language: java os: linux jdk: oraclejdk7 diff --git a/ci/travis_conda_build.sh b/ci/travis_conda_build.sh deleted file mode 100755 index 17a33ae9717bc..0000000000000 --- a/ci/travis_conda_build.sh +++ /dev/null @@ -1,45 +0,0 @@ -#!/usr/bin/env bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -set -ex - -source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh - -# Build libarrow - -cd $TRAVIS_BUILD_DIR/cpp - -conda build conda.recipe --channel apache/channel/dev -CONDA_PACKAGE=`conda build --output conda.recipe | grep bz2` - -if [ $TRAVIS_BRANCH == "master" ] && [ $TRAVIS_PULL_REQUEST == "false" ]; then - anaconda --token $ANACONDA_TOKEN upload $CONDA_PACKAGE --user apache --channel dev; -fi - -# Build pyarrow - -cd $TRAVIS_BUILD_DIR/python - -build_for_python_version() { - PY_VERSION=$1 - conda build conda.recipe --python $PY_VERSION --channel apache/channel/dev - CONDA_PACKAGE=`conda build --python $PY_VERSION --output conda.recipe | grep bz2` - - if [ $TRAVIS_BRANCH == "master" ] && [ $TRAVIS_PULL_REQUEST == "false" ]; then - anaconda --token $ANACONDA_TOKEN upload $CONDA_PACKAGE --user apache --channel dev; - fi -} - -build_for_python_version 2.7 -build_for_python_version 3.5 From 446ec9bd628244bf675887f5a030d3a94c07645e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Oct 2016 22:49:56 -0400 Subject: [PATCH 0178/1644] ARROW-334: [Python] Remove INSTALL_RPATH_USE_LINK_PATH Will try to verify whether this resolves the issue. See https://travis-ci.org/conda-forge/staged-recipes/builds/166897102 Author: Wes McKinney Closes #171 from wesm/ARROW-334 and squashes the following commits: ed8fa39 [Wes McKinney] Switch by to xcode 6.4 b8224ce [Wes McKinney] Escape dollar sign in ORIGIN b76b7ac [Wes McKinney] Fix LD_LIBRARY_PATH 3c8d2dd [Wes McKinney] Clean up Travis CI scripts a bit. Put in LD_LIBRARY_PATH 30488d7 [Wes McKinney] Don't conda install arrow-cpp during Travis build afb1dc0 [Wes McKinney] Remove INSTALL_RPATH_USE_LINK_PATH --- .travis.yml | 1 - ci/travis_before_script_cpp.sh | 4 ---- ci/travis_script_python.sh | 15 ++++++--------- python/CMakeLists.txt | 4 +--- 4 files changed, 7 insertions(+), 17 deletions(-) diff --git a/.travis.yml b/.travis.yml index a53756c962e88..052c22ccc3790 100644 --- a/.travis.yml +++ b/.travis.yml @@ -32,7 +32,6 @@ matrix: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - compiler: clang - language: objective-c osx_image: xcode6.4 os: osx addons: diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 2d4224b33336f..20307736e672a 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -15,10 +15,6 @@ set -ex -source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh -conda install -y --channel apache/channel/dev parquet-cpp -export PARQUET_HOME=$MINICONDA - : ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} mkdir $CPP_BUILD_DIR diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 55cb2a76f6db1..179567b595416 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -14,12 +14,16 @@ set -e +source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh + PYTHON_DIR=$TRAVIS_BUILD_DIR/python # Re-use conda installation from C++ export MINICONDA=$HOME/miniconda export PATH="$MINICONDA/bin:$PATH" -export PARQUET_HOME=$MINICONDA + +export ARROW_HOME=$ARROW_CPP_INSTALL +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_CPP_INSTALL/lib pushd $PYTHON_DIR @@ -38,17 +42,10 @@ python_version_tests() { # Expensive dependencies install from Continuum package repo conda install -y pip numpy pandas cython - # conda install -y parquet-cpp - - conda install -y arrow-cpp -c apache/channel/dev - # Other stuff pip install pip install -r requirements.txt - export ARROW_HOME=$ARROW_CPP_INSTALL - - python setup.py build_ext \ - --inplace + python setup.py build_ext --inplace python -m pytest -vv -r sxX pyarrow diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 4357fa05ff864..b8be8665af079 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -417,8 +417,6 @@ if (UNIX) set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) endif() -SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE) - add_subdirectory(src/pyarrow) add_subdirectory(src/pyarrow/util) @@ -494,7 +492,7 @@ foreach(module ${CYTHON_EXTENSIONS}) if(APPLE) set(module_install_rpath "@loader_path") else() - set(module_install_rpath "$ORIGIN") + set(module_install_rpath "\$ORIGIN") endif() list(LENGTH directories i) while(${i} GREATER 0) From 2f84493371bd8fae30b8e042984c9d6ba5419c5f Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 21 Oct 2016 16:27:00 -0400 Subject: [PATCH 0179/1644] ARROW-342: Set Python version on release Author: Uwe L. Korn Closes #179 from xhochy/ARROW-342 and squashes the following commits: 15d0ce3 [Uwe L. Korn] ARROW-342: Set Python version on release --- dev/release/00-prepare.sh | 9 +++++++-- python/.gitignore | 1 + python/pyarrow/__init__.py | 1 + python/setup.py | 24 ++++++++++++++++++++---- 4 files changed, 29 insertions(+), 6 deletions(-) diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 3c1fb9a093892..3423a3e6c5bf9 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -7,9 +7,9 @@ # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at -# +# # http://www.apache.org/licenses/LICENSE-2.0 -# +# # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -43,4 +43,9 @@ mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmod cd - +cd "${SOURCE_DIR}/../../python" +sed -i "s/VERSION = '[^']*'/VERSION = '${version}'/g" setup.py +sed -i "s/ISRELEASED = False/ISRELEASED = True/g" setup.py +cd - + echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/python/.gitignore b/python/.gitignore index 7e2e360557ad8..07f28355a252f 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -25,6 +25,7 @@ MANIFEST # Generated sources *.c *.cpp +pyarrow/version.py # Python files # setup.py working directory diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 8b131aaa8f4af..775ce7ec47578 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -42,3 +42,4 @@ DataType, Field, Schema, schema) from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe +from pyarrow.version import version as __version__ diff --git a/python/setup.py b/python/setup.py index d040ea7e892c5..990497775148d 100644 --- a/python/setup.py +++ b/python/setup.py @@ -50,10 +50,25 @@ if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') -MAJOR = 0 -MINOR = 1 -MICRO = 0 -VERSION = '%d.%d.%ddev' % (MAJOR, MINOR, MICRO) +VERSION = '0.1.0' +ISRELEASED = False + +if not ISRELEASED: + VERSION += '.dev' + +setup_dir = os.path.abspath(os.path.dirname(__file__)) + + +def write_version_py(filename=os.path.join(setup_dir, 'pyarrow/version.py')): + a = open(filename, 'w') + file_content = "\n".join(["", + "# THIS FILE IS GENERATED FROM SETUP.PY", + "version = '%(version)s'", + "isrelease = '%(isrelease)s'"]) + + a.write(file_content % {'version': VERSION, + 'isrelease': str(ISRELEASED)}) + a.close() class clean(_clean): @@ -238,6 +253,7 @@ def get_outputs(self): return [self._get_cmake_ext_path(name) for name in self.get_names()] +write_version_py() DESC = """\ Python library for Apache Arrow""" From 3d2e4df219d6b06a3d78821bbca6ba17188908c2 Mon Sep 17 00:00:00 2001 From: adeneche Date: Wed, 26 Oct 2016 12:09:26 -0700 Subject: [PATCH 0180/1644] =?UTF-8?q?ARROW-337:=20UnionListWriter.list()?= =?UTF-8?q?=20is=20doing=20more=20than=20it=20should,=20this=20=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …can cause data corruption The general idea is to use the "inner" writer's position to update the offset. This involves making sure various writers do indeed update their positions. UnionListWriter.startList() should explicitly set the inner writer position in case setPosition() was called to move the union list writer's position Author: adeneche Closes #183 from adeneche/ARROW-337 and squashes the following commits: 1ae7e00 [adeneche] updated TestComplexWriter to ensure position is set properly by the various writers 7d5aefc [adeneche] ARROW-337: UnionListWriter.list() is doing more than it should, this can cause data corruption --- .../AbstractPromotableFieldWriter.java | 2 + .../main/codegen/templates/MapWriters.java | 1 + .../codegen/templates/UnionListWriter.java | 32 +-- .../apache/arrow/vector/TestListVector.java | 4 - .../complex/writer/TestComplexWriter.java | 201 +++++++++++++----- 5 files changed, 154 insertions(+), 86 deletions(-) diff --git a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java index d21dcd0f6461c..60dd0c7b7adf8 100644 --- a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java @@ -58,6 +58,7 @@ public void start() { @Override public void end() { getWriter(MinorType.MAP).end(); + setPosition(idx() + 1); } @Override @@ -68,6 +69,7 @@ public void startList() { @Override public void endList() { getWriter(MinorType.LIST).endList(); + setPosition(idx() + 1); } <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 51327b43af0fa..f41b60072c873 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -185,6 +185,7 @@ public void start() { @Override public void end() { + setPosition(idx()+1); } <#list vv.types as type><#list type.minor as minor> diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java index 04531a72128a0..bb39fe8d29426 100644 --- a/java/vector/src/main/codegen/templates/UnionListWriter.java +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -101,11 +101,7 @@ public void setPosition(int index) { public ${name}Writer <#if uncappedName == "int">integer<#else>${uncappedName}(String name) { // assert inMap; mapName = name; - final int nextOffset = offsets.getAccessor().get(idx() + 1); - vector.getMutator().setNotNull(idx()); - writer.setPosition(nextOffset); - ${name}Writer ${uncappedName}Writer = writer.<#if uncappedName == "int">integer<#else>${uncappedName}(name); - return ${uncappedName}Writer; + return writer.<#if uncappedName == "int">integer<#else>${uncappedName}(name); } @@ -120,18 +116,11 @@ public MapWriter map() { @Override public ListWriter list() { - final int nextOffset = offsets.getAccessor().get(idx() + 1); - vector.getMutator().setNotNull(idx()); - offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); - writer.setPosition(nextOffset); return writer; } @Override public ListWriter list(String name) { - final int nextOffset = offsets.getAccessor().get(idx() + 1); - vector.getMutator().setNotNull(idx()); - writer.setPosition(nextOffset); ListWriter listWriter = writer.list(name); return listWriter; } @@ -145,30 +134,26 @@ public MapWriter map(String name) { @Override public void startList() { vector.getMutator().startNewValue(idx()); + writer.setPosition(offsets.getAccessor().get(idx() + 1)); } @Override public void endList() { - + offsets.getMutator().set(idx() + 1, writer.idx()); + setPosition(idx() + 1); } @Override public void start() { // assert inMap; - final int nextOffset = offsets.getAccessor().get(idx() + 1); - vector.getMutator().setNotNull(idx()); - offsets.getMutator().setSafe(idx() + 1, nextOffset); - writer.setPosition(nextOffset); writer.start(); } @Override public void end() { // if (inMap) { - writer.end(); - inMap = false; - final int nextOffset = offsets.getAccessor().get(idx() + 1); - offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); + writer.end(); + inMap = false; // } } @@ -181,11 +166,8 @@ public void end() { @Override public void write${name}(<#list fields as field>${field.type} ${field.name}<#if field_has_next>, ) { // assert !inMap; - final int nextOffset = offsets.getAccessor().get(idx() + 1); - vector.getMutator().setNotNull(idx()); - writer.setPosition(nextOffset); writer.write${name}(<#list fields as field>${field.name}<#if field_has_next>, ); - offsets.getMutator().setSafe(idx() + 1, nextOffset + 1); + writer.setPosition(writer.idx()+1); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java index bb7103365557f..1f0baaed776a1 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java @@ -19,18 +19,14 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.complex.ListVector; -import org.apache.arrow.vector.complex.impl.ComplexCopier; -import org.apache.arrow.vector.complex.impl.UnionListReader; import org.apache.arrow.vector.complex.impl.UnionListWriter; import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; import org.junit.Assert; import org.junit.Before; import org.junit.Test; public class TestListVector { - private final static String EMPTY_SCHEMA_PATH = ""; private BufferAllocator allocator; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 9419f88de5b74..6e0e617f299f8 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -65,10 +65,10 @@ public void simpleNestedTypes() { IntWriter intWriter = rootWriter.integer("int"); BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); for (int i = 0; i < COUNT; i++) { - intWriter.setPosition(i); + rootWriter.start(); intWriter.writeInt(i); - bigIntWriter.setPosition(i); bigIntWriter.writeBigInt(i); + rootWriter.end(); } writer.setValueCount(COUNT); MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); @@ -83,23 +83,52 @@ public void simpleNestedTypes() { @Test public void nullableMap() { - MapVector parent = new MapVector("parent", allocator, null); - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - for (int i = 0; i < COUNT; i++) { - rootWriter.setPosition(i); - rootWriter.start(); - if (i % 2 == 0) { - MapWriter mapWriter = rootWriter.map("map"); - mapWriter.setPosition(i); - mapWriter.start(); - mapWriter.bigInt("nested").writeBigInt(i); - mapWriter.end(); + try (MapVector mapVector = new MapVector("parent", allocator, null)) { + ComplexWriter writer = new ComplexWriterImpl("root", mapVector); + MapWriter rootWriter = writer.rootAsMap(); + for (int i = 0; i < COUNT; i++) { + rootWriter.start(); + if (i % 2 == 0) { + MapWriter mapWriter = rootWriter.map("map"); + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.bigInt("nested").writeBigInt(i); + mapWriter.end(); + } + rootWriter.end(); } - rootWriter.end(); + writer.setValueCount(COUNT); + checkNullableMap(mapVector); } - writer.setValueCount(COUNT); - MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + } + + /** + * This test is similar to {@link #nullableMap()} ()} but we get the inner map writer once at the beginning + */ + @Test + public void nullableMap2() { + try (MapVector mapVector = new MapVector("parent", allocator, null)) { + ComplexWriter writer = new ComplexWriterImpl("root", mapVector); + MapWriter rootWriter = writer.rootAsMap(); + MapWriter mapWriter = rootWriter.map("map"); + + for (int i = 0; i < COUNT; i++) { + rootWriter.start(); + if (i % 2 == 0) { + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.bigInt("nested").writeBigInt(i); + mapWriter.end(); + } + rootWriter.end(); + } + writer.setValueCount(COUNT); + checkNullableMap(mapVector); + } + } + + private void checkNullableMap(MapVector mapVector) { + MapReader rootReader = new SingleMapReaderImpl(mapVector).reader("root"); for (int i = 0; i < COUNT; i++) { rootReader.setPosition(i); assertTrue("index is set: " + i, rootReader.isSet()); @@ -113,11 +142,10 @@ public void nullableMap() { assertNull("index is not set: " + i, map.readObject()); } } - parent.close(); } @Test - public void listOfLists() { + public void testList() { MapVector parent = new MapVector("parent", allocator, null); ComplexWriter writer = new ComplexWriterImpl("root", parent); MapWriter rootWriter = writer.rootAsMap(); @@ -129,7 +157,6 @@ public void listOfLists() { rootWriter.list("list").endList(); rootWriter.end(); - rootWriter.setPosition(1); rootWriter.start(); rootWriter.bigInt("int").writeBigInt(1); rootWriter.end(); @@ -152,7 +179,6 @@ public void listScalarType() { listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); for (int i = 0; i < COUNT; i++) { - listWriter.setPosition(i); listWriter.startList(); for (int j = 0; j < i % 7; j++) { listWriter.writeInt(j); @@ -206,7 +232,6 @@ public void listMapType() { UnionListWriter listWriter = new UnionListWriter(listVector); MapWriter mapWriter = listWriter.map(); for (int i = 0; i < COUNT; i++) { - listWriter.setPosition(i); listWriter.startList(); for (int j = 0; j < i % 7; j++) { mapWriter.start(); @@ -230,23 +255,53 @@ public void listMapType() { @Test public void listListType() { - ListVector listVector = new ListVector("list", allocator, null); - listVector.allocateNew(); - UnionListWriter listWriter = new UnionListWriter(listVector); - for (int i = 0; i < COUNT; i++) { - listWriter.setPosition(i); - listWriter.startList(); - for (int j = 0; j < i % 7; j++) { - ListWriter innerListWriter = listWriter.list(); - innerListWriter.startList(); - for (int k = 0; k < i % 13; k++) { - innerListWriter.integer().writeInt(k); + try (ListVector listVector = new ListVector("list", allocator, null)) { + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + ListWriter innerListWriter = listWriter.list(); + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + innerListWriter.integer().writeInt(k); + } + innerListWriter.endList(); } - innerListWriter.endList(); + listWriter.endList(); } - listWriter.endList(); + listWriter.setValueCount(COUNT); + checkListOfLists(listVector); } - listWriter.setValueCount(COUNT); + } + + /** + * This test is similar to {@link #listListType()} but we get the inner list writer once at the beginning + */ + @Test + public void listListType2() { + try (ListVector listVector = new ListVector("list", allocator, null)) { + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + ListWriter innerListWriter = listWriter.list(); + + for (int i = 0; i < COUNT; i++) { + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + innerListWriter.integer().writeInt(k); + } + innerListWriter.endList(); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + checkListOfLists(listVector); + } + } + + private void checkListOfLists(final ListVector listVector) { UnionListReader listReader = new UnionListReader(listVector); for (int i = 0; i < COUNT; i++) { listReader.setPosition(i); @@ -259,32 +314,65 @@ public void listListType() { } } } - listVector.clear(); } @Test public void unionListListType() { - ListVector listVector = new ListVector("list", allocator, null); - listVector.allocateNew(); - UnionListWriter listWriter = new UnionListWriter(listVector); - for (int i = 0; i < COUNT; i++) { - listWriter.setPosition(i); - listWriter.startList(); - for (int j = 0; j < i % 7; j++) { - ListWriter innerListWriter = listWriter.list(); - innerListWriter.startList(); - for (int k = 0; k < i % 13; k++) { - if (k % 2 == 0) { - innerListWriter.integer().writeInt(k); - } else { - innerListWriter.bigInt().writeBigInt(k); + try (ListVector listVector = new ListVector("list", allocator, null)) { + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + for (int i = 0; i < COUNT; i++) { + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + ListWriter innerListWriter = listWriter.list(); + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + if (k % 2 == 0) { + innerListWriter.integer().writeInt(k); + } else { + innerListWriter.bigInt().writeBigInt(k); + } } + innerListWriter.endList(); } - innerListWriter.endList(); + listWriter.endList(); } - listWriter.endList(); + listWriter.setValueCount(COUNT); + checkUnionList(listVector); } - listWriter.setValueCount(COUNT); + } + + /** + * This test is similar to {@link #unionListListType()} but we get the inner list writer once at the beginning + */ + @Test + public void unionListListType2() { + try (ListVector listVector = new ListVector("list", allocator, null)) { + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + ListWriter innerListWriter = listWriter.list(); + + for (int i = 0; i < COUNT; i++) { + listWriter.startList(); + for (int j = 0; j < i % 7; j++) { + innerListWriter.startList(); + for (int k = 0; k < i % 13; k++) { + if (k % 2 == 0) { + innerListWriter.integer().writeInt(k); + } else { + innerListWriter.bigInt().writeBigInt(k); + } + } + innerListWriter.endList(); + } + listWriter.endList(); + } + listWriter.setValueCount(COUNT); + checkUnionList(listVector); + } + } + + private void checkUnionList(ListVector listVector) { UnionListReader listReader = new UnionListReader(listVector); for (int i = 0; i < COUNT; i++) { listReader.setPosition(i); @@ -301,7 +389,6 @@ public void unionListListType() { } } } - listVector.clear(); } @Test @@ -384,8 +471,8 @@ public void promotableWriterSchema() { MapVector parent = new MapVector("parent", allocator, null); ComplexWriter writer = new ComplexWriterImpl("root", parent); MapWriter rootWriter = writer.rootAsMap(); - BigIntWriter bigIntWriter = rootWriter.bigInt("a"); - VarCharWriter varCharWriter = rootWriter.varChar("a"); + rootWriter.bigInt("a"); + rootWriter.varChar("a"); Field field = parent.getField().getChildren().get(0).getChildren().get(0); Assert.assertEquals("a", field.getName()); From 6178bf7b0f0cf66f52536f5d5fb5ee104e696f3c Mon Sep 17 00:00:00 2001 From: "Christopher C. Aycock" Date: Fri, 28 Oct 2016 21:13:02 -0400 Subject: [PATCH 0181/1644] ARROW-350: Added Kerberos to HDFS client Author: Christopher C. Aycock Closes #185 from chrisaycock/ARROW-350 and squashes the following commits: c2a4e64 [Christopher C. Aycock] Renamed 'kerb' parameter to 'kerb_ticket' f1d63de [Christopher C. Aycock] ARROW-350: Added Kerberos to HDFS client 8f1052f [Christopher C. Aycock] ARROW-345: Proper locations of libhdfs and libjvm on Mac --- cpp/doc/HDFS.md | 22 ++++++- cpp/src/arrow/io/hdfs.cc | 16 ++++- cpp/src/arrow/io/hdfs.h | 9 +-- cpp/src/arrow/io/libhdfs_shim.cc | 87 ++++++++++++++++++------- python/pyarrow/includes/libarrow_io.pxd | 1 + python/pyarrow/io.pyx | 29 ++++++--- 6 files changed, 124 insertions(+), 40 deletions(-) diff --git a/cpp/doc/HDFS.md b/cpp/doc/HDFS.md index 83311db2d2dc2..6b1bb8c452461 100644 --- a/cpp/doc/HDFS.md +++ b/cpp/doc/HDFS.md @@ -43,7 +43,7 @@ LD_LIBRARY_PATH), and relies on some environment variables. export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` ``` -#### Setting $JAVA_HOME automatically on OS X +### Mac Specifics The installed location of Java on OS X can vary, however the following snippet will set it automatically for you: @@ -51,3 +51,23 @@ will set it automatically for you: ```shell export JAVA_HOME=$(/usr/libexec/java_home) ``` + +Homebrew's Hadoop does not have native libs. Apache doesn't build these, so +users must build Hadoop to get the native libs. See this Stack Overflow +answer for details: + +http://stackoverflow.com/a/40051353/478288 + +Be sure to include the path to the native libs in `JAVA_LIBRARY_PATH`: + +```shell +export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH +``` + +If you get an error about needing to install Java 6, then add *BundledApp* and +*JNI* to the `JVMCapabilities` in `$JAVA_HOME/../Info.plist`. See + +https://oliverdowling.com.au/2015/10/09/oracles-jre-8-on-mac-os-x-el-capitan/ + +https://derflounder.wordpress.com/2015/08/08/modifying-oracles-java-sdk-to-run-java-applications-on-os-x/ + diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index b74f84604f18c..6490a7574eea2 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -287,12 +287,25 @@ class HdfsClient::HdfsClientImpl { Status Connect(const HdfsConnectionConfig* config) { RETURN_NOT_OK(ConnectLibHdfs()); - fs_ = hdfsConnectAsUser(config->host.c_str(), config->port, config->user.c_str()); + // connect to HDFS with the builder object + hdfsBuilder* builder = hdfsNewBuilder(); + if (!config->host.empty()) { + hdfsBuilderSetNameNode(builder, config->host.c_str()); + } + hdfsBuilderSetNameNodePort(builder, config->port); + if (!config->user.empty()) { + hdfsBuilderSetUserName(builder, config->user.c_str()); + } + if (!config->kerb_ticket.empty()) { + hdfsBuilderSetKerbTicketCachePath(builder, config->kerb_ticket.c_str()); + } + fs_ = hdfsBuilderConnect(builder); if (fs_ == nullptr) { return Status::IOError("HDFS connection failed"); } namenode_host_ = config->host; port_ = config->port; user_ = config->user; + kerb_ticket_ = config->kerb_ticket; return Status::OK(); } @@ -425,6 +438,7 @@ class HdfsClient::HdfsClientImpl { std::string namenode_host_; std::string user_; int port_; + std::string kerb_ticket_; hdfsFS fs_; }; diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 4a4e3ec5f5134..48699c914503e 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -60,19 +60,16 @@ struct HdfsConnectionConfig { std::string host; int port; std::string user; - - // TODO: Kerberos, etc. + std::string kerb_ticket; }; class ARROW_EXPORT HdfsClient : public FileSystemClient { public: ~HdfsClient(); - // Connect to an HDFS cluster at indicated host, port, and as user + // Connect to an HDFS cluster given a configuration // - // @param host (in) - // @param port (in) - // @param user (in): user to identify as + // @param config (in): configuration for connecting // @param fs (out): the created client // @returns Status static Status Connect( diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index f256c31b4f4b2..07eb6250bbe55 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -73,9 +73,17 @@ static HINSTANCE libjvm_handle = NULL; // NOTE(wesm): cpplint does not like use of short and other imprecise C types -static hdfsFS (*ptr_hdfsConnectAsUser)( - const char* host, tPort port, const char* user) = NULL; -static hdfsFS (*ptr_hdfsConnect)(const char* host, tPort port) = NULL; +static hdfsBuilder* (*ptr_hdfsNewBuilder)(void) = NULL; +static void (*ptr_hdfsBuilderSetNameNode)( + hdfsBuilder* bld, const char* nn) = NULL; +static void (*ptr_hdfsBuilderSetNameNodePort)( + hdfsBuilder* bld, tPort port) = NULL; +static void (*ptr_hdfsBuilderSetUserName)( + hdfsBuilder* bld, const char* userName) = NULL; +static void (*ptr_hdfsBuilderSetKerbTicketCachePath)( + hdfsBuilder* bld, const char* kerbTicketCachePath) = NULL; +static hdfsFS (*ptr_hdfsBuilderConnect)(hdfsBuilder* bld) = NULL; + static int (*ptr_hdfsDisconnect)(hdfsFS fs) = NULL; static hdfsFile (*ptr_hdfsOpenFile)(hdfsFS fs, const char* path, int flags, @@ -149,18 +157,29 @@ static void* get_symbol(const char* symbol) { #endif } -hdfsFS hdfsConnectAsUser(const char* host, tPort port, const char* user) { - return ptr_hdfsConnectAsUser(host, port, user); +hdfsBuilder* hdfsNewBuilder(void) { + return ptr_hdfsNewBuilder(); } -// Returns NULL on failure -hdfsFS hdfsConnect(const char* host, tPort port) { - if (ptr_hdfsConnect) { - return ptr_hdfsConnect(host, port); - } else { - // TODO: error reporting when shim setup fails - return NULL; - } +void hdfsBuilderSetNameNode(hdfsBuilder* bld, const char* nn) { + ptr_hdfsBuilderSetNameNode(bld, nn); +} + +void hdfsBuilderSetNameNodePort(hdfsBuilder* bld, tPort port) { + ptr_hdfsBuilderSetNameNodePort(bld, port); +} + +void hdfsBuilderSetUserName(hdfsBuilder* bld, const char* userName) { + ptr_hdfsBuilderSetUserName(bld, userName); +} + +void hdfsBuilderSetKerbTicketCachePath(hdfsBuilder* bld, + const char* kerbTicketCachePath) { + ptr_hdfsBuilderSetKerbTicketCachePath(bld , kerbTicketCachePath); +} + +hdfsFS hdfsBuilderConnect(hdfsBuilder* bld) { + return ptr_hdfsBuilderConnect(bld); } int hdfsDisconnect(hdfsFS fs) { @@ -342,18 +361,36 @@ int hdfsUtime(hdfsFS fs, const char* path, tTime mtime, tTime atime) { } static std::vector get_potential_libhdfs_paths() { - std::vector libhdfs_potential_paths = { - // find one in the local directory - fs::path("./libhdfs.so"), fs::path("./hdfs.dll"), - // find a global libhdfs.so - fs::path("libhdfs.so"), fs::path("hdfs.dll"), + std::vector libhdfs_potential_paths; + std::string file_name; + + // OS-specific file name +#ifdef __WIN32 + file_name = "hdfs.dll"; +#elif __APPLE__ + file_name = "libhdfs.dylib"; +#else + file_name = "libhdfs.so"; +#endif + + // Common paths + std::vector search_paths = { + fs::path(""), + fs::path(".") }; + // Path from environment variable const char* hadoop_home = std::getenv("HADOOP_HOME"); if (hadoop_home != nullptr) { - auto path = fs::path(hadoop_home) / "lib/native/libhdfs.so"; - libhdfs_potential_paths.push_back(path); + auto path = fs::path(hadoop_home) / "lib/native"; + search_paths.push_back(path); } + + // All paths with file name + for (auto& path : search_paths) { + libhdfs_potential_paths.push_back(path / file_name); + } + return libhdfs_potential_paths; } @@ -371,7 +408,7 @@ static std::vector get_potential_libjvm_paths() { file_name = "jvm.dll"; #elif __APPLE__ search_prefixes = {""}; - search_suffixes = {""}; + search_suffixes = {"", "/jre/lib/server"}; file_name = "libjvm.dylib"; // SFrame uses /usr/libexec/java_home to find JAVA_HOME; for now we are @@ -513,8 +550,12 @@ Status ARROW_EXPORT ConnectLibHdfs() { return Status::IOError("Prior attempt to load libhdfs failed"); } - GET_SYMBOL_REQUIRED(hdfsConnect); - GET_SYMBOL_REQUIRED(hdfsConnectAsUser); + GET_SYMBOL_REQUIRED(hdfsNewBuilder); + GET_SYMBOL_REQUIRED(hdfsBuilderSetNameNode); + GET_SYMBOL_REQUIRED(hdfsBuilderSetNameNodePort); + GET_SYMBOL_REQUIRED(hdfsBuilderSetUserName); + GET_SYMBOL_REQUIRED(hdfsBuilderSetKerbTicketCachePath); + GET_SYMBOL_REQUIRED(hdfsBuilderConnect); GET_SYMBOL_REQUIRED(hdfsCreateDirectory); GET_SYMBOL_REQUIRED(hdfsDelete); GET_SYMBOL_REQUIRED(hdfsDisconnect); diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 8074915508fbe..77034159d2f3a 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -93,6 +93,7 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: c_string host int port c_string user + c_string kerb_ticket cdef cppclass HdfsPathInfo: ObjectType kind; diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 16ebfa1138e46..0e6b81e984431 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -288,9 +288,6 @@ cdef class HdfsClient: shared_ptr[CHdfsClient] client cdef readonly: - object host - int port - object user bint is_open def __cinit__(self): @@ -301,6 +298,9 @@ cdef class HdfsClient: self.close() def close(self): + """ + Disconnect from the HDFS cluster + """ self._ensure_client() with nogil: check_status(self.client.get().Disconnect()) @@ -313,14 +313,21 @@ cdef class HdfsClient: raise IOError('HDFS client is closed') @classmethod - def connect(cls, host, port, user): + def connect(cls, host="default", port=0, user=None, kerb_ticket=None): """ + Connect to an HDFS cluster. All parameters are optional and should + only be set if the defaults need to be overridden. + + Authentication should be automatic if the HDFS cluster uses Kerberos. + However, if a username is specified, then the ticket cache will likely + be required. Parameters ---------- - host : - port : - user : + host : NameNode. Set to "default" for fs.defaultFS from core-site.xml. + port : NameNode's port. Set to 0 for default or logical (HA) nodes. + user : Username when connecting to HDFS; None implies login user. + kerb_ticket : Path to Kerberos ticket cache. Notes ----- @@ -335,9 +342,13 @@ cdef class HdfsClient: HdfsClient out = HdfsClient() HdfsConnectionConfig conf - conf.host = tobytes(host) + if host is not None: + conf.host = tobytes(host) conf.port = port - conf.user = tobytes(user) + if user is not None: + conf.user = tobytes(user) + if kerb_ticket is not None: + conf.kerb_ticket = tobytes(kerb_ticket) with nogil: check_status(CHdfsClient.Connect(&conf, &out.client)) From da24c1a0a2aba7ccd42cc3cbcf240eeb22d7ffb6 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 29 Oct 2016 10:02:15 +0200 Subject: [PATCH 0182/1644] ARROW-339: Python 3 compatibility in merge_arrow_pr.py Author: Wes McKinney Closes #188 from wesm/ARROW-339 and squashes the following commits: 1f3617f [Wes McKinney] Remove cherry-picking cruft 6b99632 [Wes McKinney] Python 3 compatibility in merge_arrow_pr.py --- dev/merge_arrow_pr.py | 193 +++++++++++++++++++----------------------- 1 file changed, 88 insertions(+), 105 deletions(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index 8f47f93b26dd1..aa899edd62ca4 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -17,22 +17,24 @@ # limitations under the License. # -# Utility for creating well-formed pull request merges and pushing them to Apache. +# Utility for creating well-formed pull request merges and pushing them to +# Apache. # usage: ./apache-pr-merge.py (see config env vars below) # # This utility assumes you already have a local Arrow git clone and that you # have added remotes corresponding to both (i) the Github Apache Arrow mirror # and (ii) the apache git repo. -import json import os import re import subprocess import sys -import tempfile -import urllib2 +import requests import getpass +from six.moves import input +import six + try: import jira.client JIRA_IMPORTED = True @@ -42,8 +44,8 @@ # Location of your Arrow git clone ARROW_HOME = os.path.abspath(__file__).rsplit("/", 2)[0] PROJECT_NAME = ARROW_HOME.rsplit("/", 1)[1] -print "ARROW_HOME = " + ARROW_HOME -print "PROJECT_NAME = " + PROJECT_NAME +print("ARROW_HOME = " + ARROW_HOME) +print("PROJECT_NAME = " + PROJECT_NAME) # Remote name which points to the Gihub site PR_REMOTE_NAME = os.environ.get("PR_REMOTE_NAME", "apache-github") @@ -65,46 +67,38 @@ def get_json(url): - try: - from urllib2 import urlopen, Request - env_var = 'ARROW_GITHUB_API_TOKEN' - - if env_var in os.environ: - token = os.environ[env_var] - request = Request(url) - request.add_header('Authorization', 'token %s' % token) - response = urlopen(request) - else: - response = urlopen(url) - return json.load(response) - except urllib2.HTTPError as e: - print "Unable to fetch URL, exiting: %s" % url - sys.exit(-1) + req = requests.get(url) + return req.json() def fail(msg): - print msg + print(msg) clean_up() sys.exit(-1) def run_cmd(cmd): + if isinstance(cmd, six.string_types): + cmd = cmd.split(' ') + try: - if isinstance(cmd, list): - return subprocess.check_output(cmd) - else: - return subprocess.check_output(cmd.split(" ")) + output = subprocess.check_output(cmd) except subprocess.CalledProcessError as e: # this avoids hiding the stdout / stderr of failed processes - print 'Command failed: %s' % cmd - print 'With output:' - print '--------------' - print e.output - print '--------------' + print('Command failed: %s' % cmd) + print('With output:') + print('--------------') + print(e.output) + print('--------------') raise e + if isinstance(output, six.binary_type): + output = output.decode('utf-8') + return output + + def continue_maybe(prompt): - result = raw_input("\n%s (y/n): " % prompt) + result = input("\n%s (y/n): " % prompt) if result.lower() != "y": fail("Okay, exiting") @@ -113,38 +107,44 @@ def continue_maybe(prompt): def clean_up(): - print "Restoring head pointer to %s" % original_head + print("Restoring head pointer to %s" % original_head) run_cmd("git checkout %s" % original_head) branches = run_cmd("git branch").replace(" ", "").split("\n") - for branch in filter(lambda x: x.startswith(BRANCH_PREFIX), branches): - print "Deleting local branch %s" % branch + for branch in [x for x in branches if x.startswith(BRANCH_PREFIX)]: + print("Deleting local branch %s" % branch) run_cmd("git branch -D %s" % branch) # merge the requested PR and return the merge hash def merge_pr(pr_num, target_ref): pr_branch_name = "%s_MERGE_PR_%s" % (BRANCH_PREFIX, pr_num) - target_branch_name = "%s_MERGE_PR_%s_%s" % (BRANCH_PREFIX, pr_num, target_ref.upper()) - run_cmd("git fetch %s pull/%s/head:%s" % (PR_REMOTE_NAME, pr_num, pr_branch_name)) - run_cmd("git fetch %s %s:%s" % (PUSH_REMOTE_NAME, target_ref, target_branch_name)) + target_branch_name = "%s_MERGE_PR_%s_%s" % (BRANCH_PREFIX, pr_num, + target_ref.upper()) + run_cmd("git fetch %s pull/%s/head:%s" % (PR_REMOTE_NAME, pr_num, + pr_branch_name)) + run_cmd("git fetch %s %s:%s" % (PUSH_REMOTE_NAME, target_ref, + target_branch_name)) run_cmd("git checkout %s" % target_branch_name) had_conflicts = False try: run_cmd(['git', 'merge', pr_branch_name, '--squash']) except Exception as e: - msg = "Error merging: %s\nWould you like to manually fix-up this merge?" % e + msg = ("Error merging: %s\nWould you like to " + "manually fix-up this merge?" % e) continue_maybe(msg) - msg = "Okay, please fix any conflicts and 'git add' conflicting files... Finished?" + msg = ("Okay, please fix any conflicts and 'git add' " + "conflicting files... Finished?") continue_maybe(msg) had_conflicts = True commit_authors = run_cmd(['git', 'log', 'HEAD..%s' % pr_branch_name, '--pretty=format:%an <%ae>']).split("\n") distinct_authors = sorted(set(commit_authors), - key=lambda x: commit_authors.count(x), reverse=True) + key=lambda x: commit_authors.count(x), + reverse=True) primary_author = distinct_authors[0] commits = run_cmd(['git', 'log', 'HEAD..%s' % pr_branch_name, '--pretty=format:%h [%an] %s']).split("\n\n") @@ -152,7 +152,7 @@ def merge_pr(pr_num, target_ref): merge_message_flags = [] merge_message_flags += ["-m", title] - if body != None: + if body is not None: merge_message_flags += ["-m", body] authors = "\n".join(["Author: %s" % a for a in distinct_authors]) @@ -162,14 +162,17 @@ def merge_pr(pr_num, target_ref): if had_conflicts: committer_name = run_cmd("git config --get user.name").strip() committer_email = run_cmd("git config --get user.email").strip() - message = "This patch had conflicts when merged, resolved by\nCommitter: %s <%s>" % ( - committer_name, committer_email) + message = ("This patch had conflicts when merged, " + "resolved by\nCommitter: %s <%s>" % + (committer_name, committer_email)) merge_message_flags += ["-m", message] - # The string "Closes #%s" string is required for GitHub to correctly close the PR + # The string "Closes #%s" string is required for GitHub to correctly close + # the PR merge_message_flags += [ "-m", - "Closes #%s from %s and squashes the following commits:" % (pr_num, pr_repo_desc)] + "Closes #%s from %s and squashes the following commits:" + % (pr_num, pr_repo_desc)] for c in commits: merge_message_flags += ["-m", c] @@ -182,7 +185,8 @@ def merge_pr(pr_num, target_ref): target_branch_name, PUSH_REMOTE_NAME)) try: - run_cmd('git push %s %s:%s' % (PUSH_REMOTE_NAME, target_branch_name, target_ref)) + run_cmd('git push %s %s:%s' % (PUSH_REMOTE_NAME, target_branch_name, + target_ref)) except Exception as e: clean_up() fail("Exception while pushing: %s" % e) @@ -194,65 +198,42 @@ def merge_pr(pr_num, target_ref): return merge_hash -def cherry_pick(pr_num, merge_hash, default_branch): - pick_ref = raw_input("Enter a branch name [%s]: " % default_branch) - if pick_ref == "": - pick_ref = default_branch - - pick_branch_name = "%s_PICK_PR_%s_%s" % (BRANCH_PREFIX, pr_num, pick_ref.upper()) - - run_cmd("git fetch %s %s:%s" % (PUSH_REMOTE_NAME, pick_ref, pick_branch_name)) - run_cmd("git checkout %s" % pick_branch_name) - run_cmd("git cherry-pick -sx %s" % merge_hash) - - continue_maybe("Pick complete (local ref %s). Push to %s?" % ( - pick_branch_name, PUSH_REMOTE_NAME)) - - try: - run_cmd('git push %s %s:%s' % (PUSH_REMOTE_NAME, pick_branch_name, pick_ref)) - except Exception as e: - clean_up() - fail("Exception while pushing: %s" % e) - - pick_hash = run_cmd("git rev-parse %s" % pick_branch_name)[:8] - clean_up() - - print("Pull request #%s picked into %s!" % (pr_num, pick_ref)) - print("Pick hash: %s" % pick_hash) - return pick_ref - - def fix_version_from_branch(branch, versions): - # Note: Assumes this is a sorted (newest->oldest) list of un-released versions + # Note: Assumes this is a sorted (newest->oldest) list of un-released + # versions if branch == "master": return versions[0] else: branch_ver = branch.replace("branch-", "") - return filter(lambda x: x.name.startswith(branch_ver), versions)[-1] + return [x for x in versions if x.name.startswith(branch_ver)][-1] + def exctract_jira_id(title): m = re.search(r'^(ARROW-[0-9]+)\b.*$', title) if m and m.groups > 0: return m.group(1) else: - fail("PR title should be prefixed by a jira id \"ARROW-XXX: ...\", found: \"%s\"" % title) + fail("PR title should be prefixed by a jira id " + "\"ARROW-XXX: ...\", found: \"%s\"" % title) + def check_jira(title): jira_id = exctract_jira_id(title) asf_jira = jira.client.JIRA({'server': JIRA_API_BASE}, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) try: - issue = asf_jira.issue(jira_id) + asf_jira.issue(jira_id) except Exception as e: fail("ASF JIRA could not find %s\n%s" % (jira_id, e)) + def resolve_jira(title, merge_branches, comment): asf_jira = jira.client.JIRA({'server': JIRA_API_BASE}, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) default_jira_id = exctract_jira_id(title) - jira_id = raw_input("Enter a JIRA id [%s]: " % default_jira_id) + jira_id = input("Enter a JIRA id [%s]: " % default_jira_id) if jira_id == "": jira_id = default_jira_id @@ -271,30 +252,33 @@ def resolve_jira(title, merge_branches, comment): if cur_status == "Resolved" or cur_status == "Closed": fail("JIRA issue %s already has status '%s'" % (jira_id, cur_status)) - print ("=== JIRA %s ===" % jira_id) - print ("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%s\n" % ( - cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) + print("=== JIRA %s ===" % jira_id) + print("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%sf\n" + % (cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) resolve = filter(lambda a: a['name'] == "Resolve Issue", asf_jira.transitions(jira_id))[0] asf_jira.transition_issue(jira_id, resolve["id"], comment=comment) - print "Succesfully resolved %s!" % (jira_id) + print("Succesfully resolved %s!" % (jira_id)) if not JIRA_USERNAME: - JIRA_USERNAME = raw_input("Env JIRA_USERNAME not set, please enter your JIRA username:") + JIRA_USERNAME = input("Env JIRA_USERNAME not set, " + "please enter your JIRA username:") if not JIRA_PASSWORD: - JIRA_PASSWORD = getpass.getpass("Env JIRA_PASSWORD not set, please enter your JIRA password:") + JIRA_PASSWORD = getpass.getpass("Env JIRA_PASSWORD not set, please enter " + "your JIRA password:") branches = get_json("%s/branches" % GITHUB_API_BASE) -branch_names = filter(lambda x: x.startswith("branch-"), [x['name'] for x in branches]) +branch_names = [x['name'] for x in branches if x['name'].startswith('branch-')] + # Assumes branch names can be sorted lexicographically # Julien: I commented this out as we don't have any "branch-*" branch yet -#latest_branch = sorted(branch_names, reverse=True)[0] +# latest_branch = sorted(branch_names, reverse=True)[0] -pr_num = raw_input("Which pull request would you like to merge? (e.g. 34): ") +pr_num = input("Which pull request would you like to merge? (e.g. 34): ") pr = get_json("%s/pulls/%s" % (GITHUB_API_BASE, pr_num)) url = pr["url"] @@ -307,42 +291,41 @@ def resolve_jira(title, merge_branches, comment): pr_repo_desc = "%s/%s" % (user_login, base_ref) if pr["merged"] is True: - print "Pull request %s has already been merged, assuming you want to backport" % pr_num + print("Pull request %s has already been merged, " + "assuming you want to backport" % pr_num) merge_commit_desc = run_cmd([ 'git', 'log', '--merges', '--first-parent', '--grep=pull request #%s' % pr_num, '--oneline']).split("\n")[0] if merge_commit_desc == "": - fail("Couldn't find any merge commit for #%s, you may need to update HEAD." % pr_num) + fail("Couldn't find any merge commit for #%s, " + "you may need to update HEAD." % pr_num) merge_hash = merge_commit_desc[:7] message = merge_commit_desc[8:] - print "Found: %s" % message - maybe_cherry_pick(pr_num, merge_hash, latest_branch) + print("Found: %s" % message) sys.exit(0) if not bool(pr["mergeable"]): - msg = "Pull request %s is not mergeable in its current form.\n" % pr_num + \ - "Continue? (experts only!)" + msg = ("Pull request %s is not mergeable in its current form.\n" + % pr_num + "Continue? (experts only!)") continue_maybe(msg) -print ("\n=== Pull Request #%s ===" % pr_num) -print ("title\t%s\nsource\t%s\ntarget\t%s\nurl\t%s" % ( - title, pr_repo_desc, target_ref, url)) +print("\n=== Pull Request #%s ===" % pr_num) +print("title\t%s\nsource\t%s\ntarget\t%s\nurl\t%s" + % (title, pr_repo_desc, target_ref, url)) continue_maybe("Proceed with merging pull request #%s?" % pr_num) merged_refs = [target_ref] merge_hash = merge_pr(pr_num, target_ref) -pick_prompt = "Would you like to pick %s into another branch?" % merge_hash -while raw_input("\n%s (y/n): " % pick_prompt).lower() == "y": - merged_refs = merged_refs + [cherry_pick(pr_num, merge_hash, latest_branch)] - if JIRA_IMPORTED: continue_maybe("Would you like to update the associated JIRA?") - jira_comment = "Issue resolved by pull request %s\n[%s/%s]" % (pr_num, GITHUB_BASE, pr_num) + jira_comment = ("Issue resolved by pull request %s\n[%s/%s]" + % (pr_num, GITHUB_BASE, pr_num)) resolve_jira(title, merged_refs, jira_comment) else: - print "Could not find jira-python library. Run 'sudo pip install jira-python' to install." - print "Exiting without trying to close the associated JIRA." + print("Could not find jira-python library. " + "Run 'sudo pip install jira-python' to install.") + print("Exiting without trying to close the associated JIRA.") From d946e7917d55cb220becd6469ae93430f2e60764 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 29 Oct 2016 04:36:03 -0400 Subject: [PATCH 0183/1644] ARROW-354: Fix comparison of arrays of empty strings Author: Uwe L. Korn Closes #189 from xhochy/ARROW-354 and squashes the following commits: 8f75d78 [Uwe L. Korn] ARROW-354: Fix comparison of arrays of empty strings --- cpp/src/arrow/types/string-test.cc | 12 ++++++++++++ cpp/src/arrow/types/string.cc | 2 ++ 2 files changed, 14 insertions(+) diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index d897e30a3c6a2..af87a14a8b32e 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -129,6 +129,18 @@ TEST_F(TestStringContainer, TestGetString) { } } +TEST_F(TestStringContainer, TestEmptyStringComparison) { + offsets_ = {0, 0, 0, 0, 0, 0}; + offsets_buf_ = test::to_buffer(offsets_); + length_ = offsets_.size() - 1; + + auto strings_a = std::make_shared( + length_, offsets_buf_, nullptr, null_count_, null_bitmap_); + auto strings_b = std::make_shared( + length_, offsets_buf_, nullptr, null_count_, null_bitmap_); + ASSERT_TRUE(strings_a->Equals(strings_b)); +} + // ---------------------------------------------------------------------- // String builder tests diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index d692e13773f56..f6d26df3167c9 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -56,6 +56,8 @@ bool BinaryArray::EqualsExact(const BinaryArray& other) const { offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); if (!equal_offsets) { return false; } + if (!data_buffer_ && !(other.data_buffer_)) { return true; } + return data_buffer_->Equals(*other.data_buffer_, data_buffer_->size()); } From 772bc6ea6e5d452ccff1df8d5e83299e434c0d04 Mon Sep 17 00:00:00 2001 From: Peter Hoffmann Date: Sun, 30 Oct 2016 11:11:28 +0100 Subject: [PATCH 0184/1644] ARROW-349: Add six as a requirement fixes https://issues.apache.org/jira/browse/ARROW-349 Author: Peter Hoffmann Closes #184 from hoffmann/patch-1 and squashes the following commits: 1bffc69 [Peter Hoffmann] Add six as a requirement --- python/setup.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/setup.py b/python/setup.py index 990497775148d..cdfdc243e2597 100644 --- a/python/setup.py +++ b/python/setup.py @@ -271,7 +271,7 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, - install_requires=['cython >= 0.23', 'numpy >= 1.9'], + install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], description=DESC, license='Apache License, Version 2.0', maintainer="Apache Arrow Developers", From ca088dd19eb4283c71252de39782d811f985649a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 31 Oct 2016 21:16:29 -0400 Subject: [PATCH 0185/1644] ARROW-339: [Dev] Lingering Python 3 fixes I missed a couple Python 3 things. I'll leave this open until one of us successfully merged another patch with this before we merge it. Author: Wes McKinney Closes #191 from wesm/ARROW-339-2 and squashes the following commits: 78bf094 [Wes McKinney] Lingering Python 3 fixes --- dev/merge_arrow_pr.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index aa899edd62ca4..f7e7a37c36e5c 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -210,7 +210,7 @@ def fix_version_from_branch(branch, versions): def exctract_jira_id(title): m = re.search(r'^(ARROW-[0-9]+)\b.*$', title) - if m and m.groups > 0: + if m: return m.group(1) else: fail("PR title should be prefixed by a jira id " @@ -256,8 +256,8 @@ def resolve_jira(title, merge_branches, comment): print("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%sf\n" % (cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) - resolve = filter(lambda a: a['name'] == "Resolve Issue", - asf_jira.transitions(jira_id))[0] + resolve = [x for x in asf_jira.transitions(jira_id) + if x['name'] == "Resolve Issue"][0] asf_jira.transition_issue(jira_id, resolve["id"], comment=comment) print("Succesfully resolved %s!" % (jira_id)) From d4148759a266d90dacd1ca2b7b7ff0df7e02578a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 1 Nov 2016 14:21:07 -0400 Subject: [PATCH 0186/1644] ARROW-348: [Python] Add build-type command line option to setup.py, build CMake extensions in a build type subdirectory This also resolves ARROW-230. Author: Wes McKinney Closes #187 from wesm/ARROW-348 and squashes the following commits: 3cdaeaf [Wes McKinney] Cast build_type to lowercase in case env variable is uppercase 74bfa71 [Wes McKinney] Pull default build type from environment variable d0b3154 [Wes McKinney] Tweak readme 6017948 [Wes McKinney] Add built-type command line option to setup.py, build extensions in release type subdirectory to avoid conflicts with setuptools --- python/CMakeLists.txt | 3 +-- python/README.md | 9 +++++++++ python/setup.py | 34 ++++++++++++++++------------------ 3 files changed, 26 insertions(+), 20 deletions(-) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index b8be8665af079..179f02fbc9daa 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -203,8 +203,7 @@ if (${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_CURRENT_BINARY_DIR}) EXECUTE_PROCESS(COMMAND ln ${MORE_ARGS} -sf ${BUILD_OUTPUT_ROOT_DIRECTORY} ${CMAKE_CURRENT_BINARY_DIR}/build/latest) else() - set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}") - # set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") endif() # where to put generated archives (.a files) diff --git a/python/README.md b/python/README.md index e11f64564558c..2a3e1ba9542f5 100644 --- a/python/README.md +++ b/python/README.md @@ -48,6 +48,15 @@ python setup.py build_ext --inplace py.test pyarrow ``` +To change the build type, use the `--build-type` option: + +```bash +python setup.py build_ext --build-type=release --inplace +``` + +To pass through other build options to CMake, set the environment variable +`$PYARROW_CMAKE_OPTIONS`. + #### Build the documentation ```bash diff --git a/python/setup.py b/python/setup.py index cdfdc243e2597..b3012e694243a 100644 --- a/python/setup.py +++ b/python/setup.py @@ -39,14 +39,6 @@ # Check if we're running 64-bit Python is_64_bit = sys.maxsize > 2**32 -# Check if this is a debug build of Python. -# if hasattr(sys, 'gettotalrefcount'): -# build_type = 'Debug' -# else: -# build_type = 'Release' - -build_type = 'Debug' - if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') @@ -104,13 +96,14 @@ def run(self): # github.com/libdynd/dynd-python description = "Build the C-extensions for arrow" - user_options = ([('extra-cmake-args=', None, - 'extra arguments for CMake')] + - _build_ext.user_options) + user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), + ('build-type=', None, 'build type (debug or release)')] + + _build_ext.user_options) def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') + self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() CYTHON_MODULE_NAMES = [ 'array', @@ -152,9 +145,12 @@ def _run_cmake(self): static_lib_option = '' build_tests_option = '' + build_type_option = '-DCMAKE_BUILD_TYPE={0}'.format(self.build_type) + if sys.platform != 'win32': cmake_command = ['cmake', self.extra_cmake_args, pyexe_option, build_tests_option, + build_type_option, static_lib_option, source] self.spawn(cmake_command) @@ -170,7 +166,8 @@ def _run_cmake(self): # Generate the build files extra_cmake_args = shlex.split(self.extra_cmake_args) cmake_command = (['cmake'] + extra_cmake_args + - [source, pyexe_option, + [source, + pyexe_option, static_lib_option, build_tests_option, '-G', cmake_generator]) @@ -179,7 +176,7 @@ def _run_cmake(self): self.spawn(cmake_command) # Do the build - self.spawn(['cmake', '--build', '.', '--config', build_type]) + self.spawn(['cmake', '--build', '.', '--config', self.build_type]) if self.inplace: # a bit hacky @@ -188,14 +185,15 @@ def _run_cmake(self): # Move the built libpyarrow library to the place expected by the Python # build if sys.platform != 'win32': - name, = glob.glob('libpyarrow.*') + name, = glob.glob(pjoin(self.build_type, 'libpyarrow.*')) try: os.makedirs(pjoin(build_lib, 'pyarrow')) except OSError: pass - shutil.move(name, pjoin(build_lib, 'pyarrow', name)) + shutil.move(name, + pjoin(build_lib, 'pyarrow', os.path.split(name)[1])) else: - shutil.move(pjoin(build_type, 'pyarrow.dll'), + shutil.move(pjoin(self.build_type, 'pyarrow.dll'), pjoin(build_lib, 'pyarrow', 'pyarrow.dll')) # Move the built C-extension to the place expected by the Python build @@ -239,10 +237,10 @@ def get_ext_built(self, name): if sys.platform == 'win32': head, tail = os.path.split(name) suffix = sysconfig.get_config_var('SO') - return pjoin(head, build_type, tail + suffix) + return pjoin(head, self.build_type, tail + suffix) else: suffix = sysconfig.get_config_var('SO') - return name + suffix + return pjoin(self.build_type, name + suffix) def get_names(self): return self._found_names From c7db80e729c4b3e984c3ef5630ccbff43f3042b8 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 1 Nov 2016 14:25:01 -0400 Subject: [PATCH 0187/1644] ARROW-355: Add tests for serialising arrays of empty strings to Parquet Depends on https://issues.apache.org/jira/browse/PARQUET-759 Author: Uwe L. Korn Closes #190 from xhochy/ARROW-355 and squashes the following commits: e5099ce [Uwe L. Korn] ARROW-355: Add tests for serialising arrays of empty strings to Parquet --- python/pyarrow/tests/test_parquet.py | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 0f9f2e40813ce..922ad3aa9ff73 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -73,7 +73,8 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): 'datetime': np.arange("2016-01-01T00:00:00.001", size, dtype='datetime64[ms]'), 'str': [str(x) for x in range(size)], - 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None] + 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None], + 'empty_str': [''] * size }) filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.from_pandas_dataframe(df, timestamps_to_ms=True) @@ -98,7 +99,10 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): 'int64': np.arange(size, dtype=np.int64), 'float32': np.arange(size, dtype=np.float32), 'float64': np.arange(size, dtype=np.float64), - 'bool': np.random.randn(size) > 0 + 'bool': np.random.randn(size) > 0, + 'str': [str(x) for x in range(size)], + 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None], + 'empty_str': [''] * size }) filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.from_pandas_dataframe(df) From e70d97dbc8dc86161083e94c45d5828f79211f6b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 2 Nov 2016 08:06:29 +0100 Subject: [PATCH 0188/1644] ARROW-358: Add explicit environment variable to locate libhdfs in one's environment Author: Wes McKinney Closes #195 from wesm/ARROW-358 and squashes the following commits: c00d251 [Wes McKinney] Add explicit environment variable to locate libhdfs in one's environment --- cpp/src/arrow/io/libhdfs_shim.cc | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index 07eb6250bbe55..1fee595d0718b 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -386,6 +386,11 @@ static std::vector get_potential_libhdfs_paths() { search_paths.push_back(path); } + const char* libhdfs_dir = std::getenv("ARROW_LIBHDFS_DIR"); + if (libhdfs_dir != nullptr) { + search_paths.push_back(fs::path(libhdfs_dir)); + } + // All paths with file name for (auto& path : search_paths) { libhdfs_potential_paths.push_back(path / file_name); From 2a059bd277c58bca80412cbda258a253b801d1a4 Mon Sep 17 00:00:00 2001 From: "Christopher C. Aycock" Date: Wed, 2 Nov 2016 12:15:53 -0400 Subject: [PATCH 0189/1644] ARROW-359: Document ARROW_LIBHDFS_DIR Author: Christopher C. Aycock Closes #196 from chrisaycock/ARROW-359 and squashes the following commits: 52ec78e [Christopher C. Aycock] ARROW-359: Document ARROW_LIBHDFS_DIR --- cpp/doc/HDFS.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/cpp/doc/HDFS.md b/cpp/doc/HDFS.md index 6b1bb8c452461..180d31e54d573 100644 --- a/cpp/doc/HDFS.md +++ b/cpp/doc/HDFS.md @@ -33,16 +33,18 @@ interface to the Java Hadoop client. This library is loaded **at runtime** (rather than at link / library load time, since the library may not be in your LD_LIBRARY_PATH), and relies on some environment variables. -* `HADOOP_HOME`: the root of your installed Hadoop distribution. Check in the - `lib/native` directory to look for `libhdfs.so` if you have any questions - about which directory you're after. -* `JAVA_HOME`: the location of your Java SDK installation +* `HADOOP_HOME`: the root of your installed Hadoop distribution. Often has +`lib/native/libhdfs.so`. +* `JAVA_HOME`: the location of your Java SDK installation. * `CLASSPATH`: must contain the Hadoop jars. You can set these using: ```shell export CLASSPATH=`$HADOOP_HOME/bin/hadoop classpath --glob` ``` +* `ARROW_LIBHDFS_DIR` (optional): explicit location of `libhdfs.so` if it is +installed somewhere other than `$HADOOP_HOME/lib/native`. + ### Mac Specifics The installed location of Java on OS X can vary, however the following snippet From 17c9ae7c4ceb328c897fb6c9025c763a879ebefa Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 2 Nov 2016 12:20:15 -0400 Subject: [PATCH 0190/1644] ARROW-357: Use a single RowGroup for Parquet files as default. This is not the optimal choice, we should rather have an option to optimise for the underlying block size of the filesystem but without the infrastructure for that in ``parquet-cpp``, writing a single RowGroup is the much better choice. Author: Uwe L. Korn Closes #192 from xhochy/ARROW-357 and squashes the following commits: 9eccefd [Uwe L. Korn] ARROW-357: Use a single RowGroup for Parquet files as default. --- python/pyarrow/parquet.pyx | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 019dd2c1de489..a56c1e1456d17 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -106,7 +106,8 @@ def write_table(table, filename, chunk_size=None, version=None, table : pyarrow.Table filename : string chunk_size : int - The maximum number of rows in each Parquet RowGroup + The maximum number of rows in each Parquet RowGroup. As a default, + we will write a single RowGroup per file. version : {"1.0", "2.0"}, default "1.0" The Parquet format version, defaults to 1.0 use_dictionary : bool or list @@ -121,7 +122,7 @@ def write_table(table, filename, chunk_size=None, version=None, cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 if chunk_size is None: - chunk_size_ = min(ctable_.num_rows(), int(2**16)) + chunk_size_ = ctable_.num_rows() else: chunk_size_ = chunk_size From 25e010607542aa7330bd881e145180fe606776c5 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 3 Nov 2016 13:22:19 -0400 Subject: [PATCH 0191/1644] ARROW-323: [Python] Opt-in to pyarrow.parquet extension rather than attempting and failing silently Added a couple ways to do this, either via the `--with-parquet` command line option (preferred) or by passing through an option to CMake Author: Wes McKinney Closes #194 from wesm/ARROW-323 and squashes the following commits: 07c05cc [Wes McKinney] Update readme to illustrate proper use of with build_ext 3bd9a8d [Wes McKinney] Add --with-parquet option to setup.py 374e254 [Wes McKinney] Add to README about building the parquet extension cab55cb [Wes McKinney] Opt in to building the pyarrow.parquet extension, do not silently fail --- python/CMakeLists.txt | 8 +++++++- python/README.md | 20 +++++++++++++++++++- python/setup.py | 38 ++++++++++++++++++++++++-------------- 3 files changed, 50 insertions(+), 16 deletions(-) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 179f02fbc9daa..6ad55f8c9a7b8 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -50,6 +50,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(PYARROW_BUILD_TESTS "Build the PyArrow C++ googletest unit tests" OFF) + option(PYARROW_BUILD_PARQUET + "Build the PyArrow Parquet integration" + OFF) endif() find_program(CCACHE_FOUND ccache) @@ -445,7 +448,10 @@ set(LINK_LIBS arrow_ipc ) -if(PARQUET_FOUND AND PARQUET_ARROW_FOUND) +if (PYARROW_BUILD_PARQUET) + if(NOT (PARQUET_FOUND AND PARQUET_ARROW_FOUND)) + message(FATAL_ERROR "Unable to locate Parquet libraries") + endif() ADD_THIRDPARTY_LIB(parquet_arrow SHARED_LIB ${PARQUET_ARROW_SHARED_LIB}) set(LINK_LIBS diff --git a/python/README.md b/python/README.md index 2a3e1ba9542f5..4fce0d26b2850 100644 --- a/python/README.md +++ b/python/README.md @@ -48,7 +48,8 @@ python setup.py build_ext --inplace py.test pyarrow ``` -To change the build type, use the `--build-type` option: +To change the build type, use the `--build-type` option or set +`$PYARROW_BUILD_TYPE`: ```bash python setup.py build_ext --build-type=release --inplace @@ -57,9 +58,26 @@ python setup.py build_ext --build-type=release --inplace To pass through other build options to CMake, set the environment variable `$PYARROW_CMAKE_OPTIONS`. +#### Build the pyarrow Parquet file extension + +To build the integration with [parquet-cpp][1], pass `--with-parquet` to +the `build_ext` option in setup.py: + +``` +python setup.py build_ext --with-parquet install +``` + +Alternately, add `-DPYARROW_BUILD_PARQUET=on` to the general CMake options. + +``` +export PYARROW_CMAKE_OPTIONS=-DPYARROW_BUILD_PARQUET=on +``` + #### Build the documentation ```bash pip install -r doc/requirements.txt python setup.py build_sphinx ``` + +[1]: https://github.com/apache/parquet-cpp \ No newline at end of file diff --git a/python/setup.py b/python/setup.py index b3012e694243a..341cc64aa2cc8 100644 --- a/python/setup.py +++ b/python/setup.py @@ -97,13 +97,15 @@ def run(self): description = "Build the C-extensions for arrow" user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), - ('build-type=', None, 'build type (debug or release)')] - + _build_ext.user_options) + ('build-type=', None, 'build type (debug or release)'), + ('with-parquet', None, 'build the Parquet extension')] + + _build_ext.user_options) def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() + self.with_parquet = False CYTHON_MODULE_NAMES = [ 'array', @@ -116,8 +118,6 @@ def initialize_options(self): 'schema', 'table'] - CYTHON_ALLOWED_FAILURES = ['parquet'] - def _run_cmake(self): # The directory containing this setup.py source = osp.dirname(osp.abspath(__file__)) @@ -141,17 +141,24 @@ def _run_cmake(self): if (cachedir != build_temp): return - pyexe_option = '-DPYTHON_EXECUTABLE=%s' % sys.executable static_lib_option = '' build_tests_option = '' - build_type_option = '-DCMAKE_BUILD_TYPE={0}'.format(self.build_type) + cmake_options = [ + '-DPYTHON_EXECUTABLE=%s' % sys.executable, + static_lib_option, + build_tests_option, + ] + + if self.with_parquet: + cmake_options.append('-DPYARROW_BUILD_PARQUET=on') if sys.platform != 'win32': - cmake_command = ['cmake', self.extra_cmake_args, pyexe_option, - build_tests_option, - build_type_option, - static_lib_option, source] + cmake_options.append('-DCMAKE_BUILD_TYPE={0}' + .format(self.build_type)) + + cmake_command = (['cmake', self.extra_cmake_args] + + cmake_options + [source]) self.spawn(cmake_command) args = ['make', 'VERBOSE=1'] @@ -166,10 +173,8 @@ def _run_cmake(self): # Generate the build files extra_cmake_args = shlex.split(self.extra_cmake_args) cmake_command = (['cmake'] + extra_cmake_args + + cmake_options + [source, - pyexe_option, - static_lib_option, - build_tests_option, '-G', cmake_generator]) if "-G" in self.extra_cmake_args: cmake_command = cmake_command[:-2] @@ -202,7 +207,7 @@ def _run_cmake(self): built_path = self.get_ext_built(name) if not os.path.exists(built_path): print(built_path) - if name in self.CYTHON_ALLOWED_FAILURES: + if self._failure_permitted(name): print('Cython module {0} failure permitted'.format(name)) continue raise RuntimeError('libpyarrow C-extension failed to build:', @@ -219,6 +224,11 @@ def _run_cmake(self): os.chdir(saved_cwd) + def _failure_permitted(self, name): + if name == 'parquet' and not self.with_parquet: + return True + return False + def _get_inplace_dir(self): pass From e8bc1fe3ba7f94b39f38571a435f93f387e67d37 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Sun, 6 Nov 2016 12:10:06 +0100 Subject: [PATCH 0192/1644] ARROW-368: Added note for LD_LIBRARY_PATH in Python README Added note to use LD_LIBRARY_PATH env var to add $ARROW_HOME/lib path so PyArrow can locate Arrow-Cpp shared libs. Author: Bryan Cutler Closes #199 from BryanCutler/pyarrow-README-note-LD_LIBRARY_PATH-ARROW-368 and squashes the following commits: 15861c4 [Bryan Cutler] Added note for LD_LIBRARY_PATH in Python README --- python/README.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/python/README.md b/python/README.md index 4fce0d26b2850..88ab17e71730f 100644 --- a/python/README.md +++ b/python/README.md @@ -33,12 +33,19 @@ These are the various projects that PyArrow depends on. 1. **g++ and gcc Version >= 4.8** 2. **cmake > 2.8.6** 3. **boost** -4. **Arrow-cpp and its dependencies*** +4. **Arrow-cpp and its dependencies** The Arrow C++ library must be built with all options enabled and installed with ``ARROW_HOME`` environment variable set to the installation location. Look at (https://github.com/apache/arrow/blob/master/cpp/README.md) for instructions. +Ensure PyArrow can locate the Arrow-cpp shared libraries by setting the +LD_LIBRARY_PATH environment variable. + +```bash +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_HOME/lib +``` + 5. **Python dependencies: numpy, pandas, cython, pytest** #### Build pyarrow and run the unit tests From 121e82682344b04bdb26edf16344a9fb2cee240c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 6 Nov 2016 16:08:44 -0500 Subject: [PATCH 0193/1644] ARROW-361: Python: Support reading a column-selection from Parquet files Author: Uwe L. Korn Closes #197 from xhochy/ARROW-361 and squashes the following commits: c1fb939 [Uwe L. Korn] Cache column indices 0c32213 [Uwe L. Korn] ARROW-361: Python: Support reading a column-selection from Parquet files --- python/pyarrow/includes/parquet.pxd | 25 ++++++++++--- python/pyarrow/parquet.pyx | 53 +++++++++++++++++++++++++++- python/pyarrow/tests/test_parquet.py | 16 +++++++++ 3 files changed, 89 insertions(+), 5 deletions(-) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index 754eeccecc8e9..57c35ba89445b 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -18,7 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport CSchema, CStatus, CTable, MemoryPool +from pyarrow.includes.libarrow cimport CArray, CSchema, CStatus, CTable, MemoryPool from pyarrow.includes.libarrow_io cimport ReadableFileInterface @@ -32,6 +32,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: cdef cppclass PrimitiveNode(Node): pass + cdef cppclass ColumnPath: + c_string ToDotString() + cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: enum ParquetVersion" parquet::ParquetVersion::type": PARQUET_1_0" parquet::ParquetVersion::PARQUET_1_0" @@ -44,13 +47,14 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: LZO" parquet::Compression::LZO" BROTLI" parquet::Compression::BROTLI" + cdef cppclass ColumnDescriptor: + shared_ptr[ColumnPath] path() + cdef cppclass SchemaDescriptor: + const ColumnDescriptor* Column(int i) shared_ptr[Node] schema() GroupNode* group() - cdef cppclass ColumnDescriptor: - pass - cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass ColumnReader: @@ -80,10 +84,21 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass RowGroupReader: pass + cdef cppclass FileMetaData: + uint32_t size() + int num_columns() + int64_t num_rows() + int num_row_groups() + int32_t version() + const c_string created_by() + int num_schema_elements() + const SchemaDescriptor* schema() + cdef cppclass ParquetFileReader: # TODO: Some default arguments are missing @staticmethod unique_ptr[ParquetFileReader] OpenFile(const c_string& path) + const FileMetaData* metadata(); cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: @@ -124,7 +139,9 @@ cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: cdef cppclass FileReader: FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) + CStatus ReadFlatColumn(int i, shared_ptr[CArray]* out); CStatus ReadFlatTable(shared_ptr[CTable]* out); + const ParquetFileReader* parquet_reader(); cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index a56c1e1456d17..2152f89474195 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -24,6 +24,7 @@ from pyarrow.includes.parquet cimport * from pyarrow.includes.libarrow_io cimport ReadableFileInterface cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow.array cimport Array from pyarrow.compat import tobytes from pyarrow.error import ArrowException from pyarrow.error cimport check_status @@ -43,6 +44,7 @@ cdef class ParquetReader: cdef: ParquetAllocator allocator unique_ptr[FileReader] reader + column_idx_map def __cinit__(self): self.allocator.set_pool(default_memory_pool()) @@ -76,11 +78,55 @@ cdef class ParquetReader: table.init(ctable) return table + def column_name_idx(self, column_name): + """ + Find the matching index of a column in the schema. + + Parameter + --------- + column_name: str + Name of the column, separation of nesting levels is done via ".". + + Returns + ------- + column_idx: int + Integer index of the position of the column + """ + cdef: + const FileMetaData* metadata = self.reader.get().parquet_reader().metadata() + int i = 0 + + if self.column_idx_map is None: + self.column_idx_map = {} + for i in range(0, metadata.num_columns()): + self.column_idx_map[str(metadata.schema().Column(i).path().get().ToDotString())] = i + + return self.column_idx_map[column_name] + + def read_column(self, int column_index): + cdef: + Array array = Array() + shared_ptr[CArray] carray + + with nogil: + check_status(self.reader.get().ReadFlatColumn(column_index, &carray)) + + array.init(carray) + return array + def read_table(source, columns=None): """ Read a Table from Parquet format + Parameters + ---------- + source: str or pyarrow.io.NativeFile + Readable source. For passing Python file objects or byte buffers, see + pyarrow.io.PythonFileInterface or pyarrow.io.BytesReader. + columns: list + If not None, only these columns will be read from the file. + Returns ------- pyarrow.table.Table @@ -93,7 +139,12 @@ def read_table(source, columns=None): elif isinstance(source, NativeFile): reader.open_native_file(source) - return reader.read_all() + if columns is None: + return reader.read_all() + else: + column_idxs = [reader.column_name_idx(column) for column in columns] + arrays = [reader.read_column(column_idx) for column_idx in column_idxs] + return Table.from_arrays(columns, arrays) def write_table(table, filename, chunk_size=None, version=None, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 922ad3aa9ff73..c1d44ce0d4230 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -115,6 +115,22 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): pdt.assert_frame_equal(df, df_read) +@parquet +def test_pandas_column_selection(tmpdir): + size = 10000 + np.random.seed(0) + df = pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16) + }) + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = A.from_pandas_dataframe(df) + A.parquet.write_table(arrow_table, filename.strpath) + table_read = pq.read_table(filename.strpath, columns=['uint8']) + df_read = table_read.to_pandas() + + pdt.assert_frame_equal(df[['uint8']], df_read) + @parquet def test_pandas_parquet_configuration_options(tmpdir): size = 10000 From 79344b335849c2eb43954b0751018051814019d6 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 7 Nov 2016 13:52:32 -0500 Subject: [PATCH 0194/1644] ARROW-362: Fix memory leak in zero-copy arrow to NumPy/pandas conversion close #198 Author: Wes McKinney Author: Uwe L. Korn Closes #200 from wesm/ARROW-362 and squashes the following commits: 99df96b [Wes McKinney] Force gc to avoid non-deterministic failure d85228f [Wes McKinney] Be more careful about reference counts in zero-copy handoff, add pyarrow.Array.to_pandas method cc7a6b3 [Uwe L. Korn] ARROW-362: Remove redunant reference count --- python/pyarrow/array.pyx | 21 ++++++++++++++ python/pyarrow/includes/common.pxd | 7 +++++ python/pyarrow/includes/pyarrow.pxd | 4 +-- python/pyarrow/table.pyx | 18 ++++++++---- python/pyarrow/tests/test_array.py | 29 ++++++++++++++++++++ python/pyarrow/tests/test_convert_builtin.py | 4 +++ python/src/pyarrow/adapters/pandas.cc | 4 +-- 7 files changed, 76 insertions(+), 11 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 84ab4a48c9b65..fbe4e3879062c 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -22,6 +22,7 @@ import numpy as np from pyarrow.includes.libarrow cimport * +from pyarrow.includes.common cimport PyObject_to_object cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config @@ -35,6 +36,8 @@ from pyarrow.scalar import NA from pyarrow.schema cimport Schema import pyarrow.schema as schema +cimport cpython + def total_allocated_bytes(): cdef MemoryPool* pool = pyarrow.get_memory_pool() @@ -111,6 +114,24 @@ cdef class Array: def slice(self, start, end): pass + def to_pandas(self): + """ + Convert to an array object suitable for use in pandas + + See also + -------- + Column.to_pandas + Table.to_pandas + RecordBatch.to_pandas + """ + cdef: + PyObject* np_arr + + check_status(pyarrow.ConvertArrayToPandas( + self.sp_array, self, &np_arr)) + + return PyObject_to_object(np_arr) + cdef class NullArray(Array): pass diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 05c0123ee7b7e..f689bdc3fd819 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -47,3 +47,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsKeyError() c_bool IsNotImplemented() c_bool IsInvalid() + + +cdef inline object PyObject_to_object(PyObject* o): + # Cast to "object" increments reference count + cdef object result = o + cpython.Py_DECREF(result) + return result diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index e1da1914c5743..a5444c236bcc8 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -34,10 +34,10 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: shared_ptr[CArray]* out) CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, - object py_ref, PyObject** out) + PyObject* py_ref, PyObject** out) CStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, - object py_ref, PyObject** out) + PyObject* py_ref, PyObject** out) MemoryPool* get_memory_pool() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 969571262ca44..c71bc712bffb1 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -22,6 +22,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.libarrow cimport * +from pyarrow.includes.common cimport PyObject_to_object cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config @@ -32,6 +33,7 @@ from pyarrow.schema cimport box_data_type, box_schema from pyarrow.compat import frombytes, tobytes +cimport cpython cdef class ChunkedArray: ''' @@ -100,8 +102,10 @@ cdef class Column: import pandas as pd - check_status(pyarrow.ConvertColumnToPandas(self.sp_column, self, &arr)) - return pd.Series(arr, name=self.name) + check_status(pyarrow.ConvertColumnToPandas(self.sp_column, + self, &arr)) + + return pd.Series(PyObject_to_object(arr), name=self.name) cdef _check_nullptr(self): if self.column == NULL: @@ -248,9 +252,10 @@ cdef class RecordBatch: data = [] for i in range(self.batch.num_columns()): arr = self.batch.column(i) - check_status(pyarrow.ConvertArrayToPandas(arr, self, &np_arr)) + check_status(pyarrow.ConvertArrayToPandas(arr, self, + &np_arr)) names.append(frombytes(self.batch.column_name(i))) - data.append( np_arr) + data.append(PyObject_to_object(np_arr)) return pd.DataFrame(dict(zip(names, data)), columns=names) @@ -375,9 +380,10 @@ cdef class Table: for i in range(self.table.num_columns()): col = self.table.column(i) column = self.column(i) - check_status(pyarrow.ConvertColumnToPandas(col, column, &arr)) + check_status(pyarrow.ConvertColumnToPandas( + col, column, &arr)) names.append(frombytes(col.get().name())) - data.append( arr) + data.append(PyObject_to_object(arr)) return pd.DataFrame(dict(zip(names, data)), columns=names) diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 0a17f691ccd1f..ead17dbec4e35 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -15,6 +15,8 @@ # specific language governing permissions and limitations # under the License. +import sys + import pyarrow import pyarrow.formatting as fmt @@ -71,3 +73,30 @@ def test_long_array_format(): 99 ]""" assert result == expected + + +def test_to_pandas_zero_copy(): + import gc + + arr = pyarrow.from_pylist(range(10)) + + for i in range(10): + np_arr = arr.to_pandas() + assert sys.getrefcount(np_arr) == 2 + np_arr = None # noqa + + assert sys.getrefcount(arr) == 2 + + for i in range(10): + arr = pyarrow.from_pylist(range(10)) + np_arr = arr.to_pandas() + arr = None + gc.collect() + + # Ensure base is still valid + + # Because of py.test's assert inspection magic, if you put getrefcount + # on the line being examined, it will be 1 higher than you expect + base_refcount = sys.getrefcount(np_arr.base) + assert base_refcount == 2 + np_arr.sum() diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 2beb6b39d73ed..8937f8db6941f 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -47,6 +47,10 @@ def test_integer(self): def test_garbage_collection(self): import gc + + # Force the cyclic garbage collector to run + gc.collect() + bytes_before = pyarrow.total_allocated_bytes() pyarrow.from_pylist([1, None, 3, None]) gc.collect() diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 7e70be75da5fc..6a3966b748806 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -628,8 +628,6 @@ class ArrowDeserializer { PyAcquireGIL lock; // Zero-Copy. We can pass the data pointer directly to NumPy. - Py_INCREF(py_ref_); - OwnedRef py_ref(py_ref_); npy_intp dims[1] = {col_->length()}; out_ = reinterpret_cast(PyArray_SimpleNewFromData(1, dims, type, data)); @@ -646,7 +644,7 @@ class ArrowDeserializer { return Status::OK(); } else { // PyArray_SetBaseObject steals our reference to py_ref_ - py_ref.release(); + Py_INCREF(py_ref_); } // Arrow data is immutable. From 6996c17f70dc13659c37dfaa39bc28e7777ca6a6 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 8 Nov 2016 13:29:34 -0500 Subject: [PATCH 0195/1644] ARROW-312: [Java] IPC file round trip tool for integration testing Author: Julien Le Dem Author: Wes McKinney Closes #186 from wesm/roundtrip-tool and squashes the following commits: aee552a [Julien Le Dem] missing file 9d5c078 [Julien Le Dem] fix read-write bug 7f20b36 [Julien Le Dem] simple roundtrip a04091f [Wes McKinney] Drafting file round trip helper executable --- .../main/java/io/netty/buffer/ArrowBuf.java | 7 +- .../arrow/memory/TestBaseAllocator.java | 24 ++- java/pom.xml | 1 + java/tools/pom.xml | 73 ++++++++ .../org/apache/arrow/tools/FileRoundtrip.java | 135 +++++++++++++++ .../apache/arrow/tools/TestFileRoundtrip.java | 159 ++++++++++++++++++ java/vector/pom.xml | 32 ++-- .../templates/NullableValueVectors.java | 2 +- .../org/apache/arrow/vector/VectorLoader.java | 21 +-- .../apache/arrow/vector/VectorSchemaRoot.java | 140 +++++++++++++++ .../apache/arrow/vector/VectorUnloader.java | 13 +- .../arrow/vector/schema/ArrowBuffer.java | 6 + .../arrow/vector/schema/ArrowRecordBatch.java | 8 + .../arrow/vector/TestVectorUnloadLoad.java | 42 +++-- .../arrow/vector/file/TestArrowFile.java | 149 ++++++++-------- 15 files changed, 681 insertions(+), 131 deletions(-) create mode 100644 java/tools/pom.xml create mode 100644 java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java create mode 100644 java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index a5989c1518def..95d2be5a43a36 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -179,7 +179,10 @@ public ArrowBuf retain(BufferAllocator target) { historicalLog.recordEvent("retain(%s)", target.getName()); } final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); - return otherLedger.newArrowBuf(offset, length, null); + ArrowBuf newArrowBuf = otherLedger.newArrowBuf(offset, length, null); + newArrowBuf.readerIndex(this.readerIndex); + newArrowBuf.writerIndex(this.writerIndex); + return newArrowBuf; } /** @@ -214,6 +217,8 @@ public TransferResult transferOwnership(BufferAllocator target) { final BufferLedger otherLedger = this.ledger.getLedgerForAllocator(target); final ArrowBuf newBuf = otherLedger.newArrowBuf(offset, length, null); + newBuf.readerIndex(this.readerIndex); + newBuf.writerIndex(this.writerIndex); final boolean allocationFit = this.ledger.transferBalance(otherLedger); return new TransferResult(allocationFit, newBuf); } diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java index aa6b70c5c74e2..3c96d57f4e64d 100644 --- a/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java @@ -22,16 +22,13 @@ import static org.junit.Assert.assertNotNull; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.ArrowBuf.TransferResult; -import org.apache.arrow.memory.AllocationReservation; -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.memory.RootAllocator; import org.junit.Ignore; import org.junit.Test; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ArrowBuf.TransferResult; + public class TestBaseAllocator { // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(TestBaseAllocator.class); @@ -134,6 +131,7 @@ public void testAllocator_transferOwnership() throws Exception { final ArrowBuf arrowBuf1 = childAllocator1.buffer(MAX_ALLOCATION / 4); rootAllocator.verify(); TransferResult transferOwnership = arrowBuf1.transferOwnership(childAllocator2); + assertEquiv(arrowBuf1, transferOwnership.buffer); final boolean allocationFit = transferOwnership.allocationFit; rootAllocator.verify(); assertTrue(allocationFit); @@ -160,6 +158,7 @@ public void testAllocator_shareOwnership() throws Exception { rootAllocator.verify(); assertNotNull(arrowBuf2); assertNotEquals(arrowBuf2, arrowBuf1); + assertEquiv(arrowBuf1, arrowBuf2); // release original buffer (thus transferring ownership to allocator 2. (should leave allocator 1 in empty state) arrowBuf1.release(); @@ -172,6 +171,7 @@ public void testAllocator_shareOwnership() throws Exception { assertNotNull(arrowBuf3); assertNotEquals(arrowBuf3, arrowBuf1); assertNotEquals(arrowBuf3, arrowBuf2); + assertEquiv(arrowBuf1, arrowBuf3); rootAllocator.verify(); arrowBuf2.release(); @@ -452,8 +452,10 @@ public void testAllocator_transferSliced() throws Exception { rootAllocator.verify(); TransferResult result1 = arrowBuf2s.transferOwnership(childAllocator1); + assertEquiv(arrowBuf2s, result1.buffer); rootAllocator.verify(); TransferResult result2 = arrowBuf1s.transferOwnership(childAllocator2); + assertEquiv(arrowBuf1s, result2.buffer); rootAllocator.verify(); result1.buffer.release(); @@ -482,7 +484,9 @@ public void testAllocator_shareSliced() throws Exception { rootAllocator.verify(); final ArrowBuf arrowBuf2s1 = arrowBuf2s.retain(childAllocator1); + assertEquiv(arrowBuf2s, arrowBuf2s1); final ArrowBuf arrowBuf1s2 = arrowBuf1s.retain(childAllocator2); + assertEquiv(arrowBuf1s, arrowBuf1s2); rootAllocator.verify(); arrowBuf1s.release(); // releases arrowBuf1 @@ -512,11 +516,13 @@ public void testAllocator_transferShared() throws Exception { rootAllocator.verify(); assertNotNull(arrowBuf2); assertNotEquals(arrowBuf2, arrowBuf1); + assertEquiv(arrowBuf1, arrowBuf2); TransferResult result = arrowBuf1.transferOwnership(childAllocator3); allocationFit = result.allocationFit; final ArrowBuf arrowBuf3 = result.buffer; assertTrue(allocationFit); + assertEquiv(arrowBuf1, arrowBuf3); rootAllocator.verify(); // Since childAllocator3 now has childAllocator1's buffer, 1, can close @@ -533,6 +539,7 @@ public void testAllocator_transferShared() throws Exception { allocationFit = result.allocationFit; final ArrowBuf arrowBuf4 = result2.buffer; assertTrue(allocationFit); + assertEquiv(arrowBuf3, arrowBuf4); rootAllocator.verify(); arrowBuf3.release(); @@ -645,4 +652,9 @@ public void multiple() throws Exception { } } + + public void assertEquiv(ArrowBuf origBuf, ArrowBuf newBuf) { + assertEquals(origBuf.readerIndex(), newBuf.readerIndex()); + assertEquals(origBuf.writerIndex(), newBuf.writerIndex()); + } } diff --git a/java/pom.xml b/java/pom.xml index 0147de7035794..7221a140d96ec 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -467,5 +467,6 @@ format memory vector + tools diff --git a/java/tools/pom.xml b/java/tools/pom.xml new file mode 100644 index 0000000000000..84b0b5eb4253c --- /dev/null +++ b/java/tools/pom.xml @@ -0,0 +1,73 @@ + + + + 4.0.0 + + org.apache.arrow + arrow-java-root + 0.1.1-SNAPSHOT + + arrow-tools + Arrow Tools + + + + org.apache.arrow + arrow-format + ${project.version} + + + org.apache.arrow + arrow-memory + ${project.version} + + + org.apache.arrow + arrow-vector + ${project.version} + + + org.apache.commons + commons-lang3 + 3.4 + + + commons-cli + commons-cli + 1.2 + + + + + + + maven-assembly-plugin + 2.6 + + + jar-with-dependencies + + + + + make-assembly + package + + single + + + + + + + + diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java new file mode 100644 index 0000000000000..db7a1c23f9ca6 --- /dev/null +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java @@ -0,0 +1,135 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.tools; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.PrintStream; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.CommandLineParser; +import org.apache.commons.cli.Options; +import org.apache.commons.cli.ParseException; +import org.apache.commons.cli.PosixParser; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class FileRoundtrip { + private static final Logger LOGGER = LoggerFactory.getLogger(FileRoundtrip.class); + + public static void main(String[] args) { + System.exit(new FileRoundtrip(System.out, System.err).run(args)); + } + + private final Options options; + private final PrintStream out; + private final PrintStream err; + + FileRoundtrip(PrintStream out, PrintStream err) { + this.out = out; + this.err = err; + this.options = new Options(); + this.options.addOption("i", "in", true, "input file"); + this.options.addOption("o", "out", true, "output file"); + + } + + private File validateFile(String type, String fileName) { + if (fileName == null) { + throw new IllegalArgumentException("missing " + type + " file parameter"); + } + File f = new File(fileName); + if (!f.exists() || f.isDirectory()) { + throw new IllegalArgumentException(type + " file not found: " + f.getAbsolutePath()); + } + return f; + } + + int run(String[] args) { + try { + CommandLineParser parser = new PosixParser(); + CommandLine cmd = parser.parse(options, args, false); + + String inFileName = cmd.getOptionValue("in"); + String outFileName = cmd.getOptionValue("out"); + + File inFile = validateFile("input", inFileName); + File outFile = validateFile("output", outFileName); + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); // TODO: close + try( + FileInputStream fileInputStream = new FileInputStream(inFile); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator);) { + + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + LOGGER.debug("Input file size: " + inFile.length()); + LOGGER.debug("Found schema: " + schema); + + try ( + FileOutputStream fileOutputStream = new FileOutputStream(outFile); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ) { + + // initialize vectors + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); + VectorSchemaRoot root = new VectorSchemaRoot(schema, allocator);) { + + VectorLoader vectorLoader = new VectorLoader(root); + vectorLoader.load(inRecordBatch); + + VectorUnloader vectorUnloader = new VectorUnloader(root); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + arrowWriter.writeRecordBatch(recordBatch); + } + } + } + LOGGER.debug("Output file size: " + outFile.length()); + } + } catch (ParseException e) { + return fatalError("Invalid parameters", e); + } catch (IOException e) { + return fatalError("Error accessing files", e); + } + return 0; + } + + private int fatalError(String message, Throwable e) { + err.println(message); + LOGGER.error(message, e); + return 1; + } + +} diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java new file mode 100644 index 0000000000000..339725e5af1e0 --- /dev/null +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java @@ -0,0 +1,159 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.tools; + +import static org.junit.Assert.assertEquals; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; +import java.io.IOException; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +public class TestFileRoundtrip { + private static final int COUNT = 10; + + @Rule + public TemporaryFolder testFolder = new TemporaryFolder(); + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Integer.MAX_VALUE); + } + + @After + public void tearDown() { + allocator.close(); + } + + private void writeData(int count, MapVector parent) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(count); + } + + @Test + public void test() throws Exception { + File testInFile = testFolder.newFile("testIn.arrow"); + File testOutFile = testFolder.newFile("testOut.arrow"); + + writeInput(testInFile); + + String[] args = { "-i", testInFile.getAbsolutePath(), "-o", testOutFile.getAbsolutePath()}; + int result = new FileRoundtrip(System.out, System.err).run(args); + assertEquals(0, result); + + validateOutput(testOutFile); + } + + private void validateOutput(File testOutFile) throws Exception { + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(testOutFile); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + + // initialize vectors + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, readerAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateContent(COUNT, root); + } + } + } + } + + private void validateContent(int count, VectorSchemaRoot root) { + Assert.assertEquals(count, root.getRowCount()); + for (int i = 0; i < count; i++) { + Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); + } + } + + public void writeInput(File testInFile) throws FileNotFoundException, IOException { + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeData(count, parent); + write(parent.getChild("root"), testInFile); + } + } + + private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { + Schema schema = new Schema(parent.getField().getChildren()); + int valueCount = parent.getAccessor().getValueCount(); + List fields = parent.getChildrenFromFields(); + VectorUnloader vectorUnloader = new VectorUnloader(schema, valueCount, fields); + try ( + FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + ) { + arrowWriter.writeRecordBatch(recordBatch); + } + } + +} diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 1d06bdece01f8..64b68bf8a1588 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -1,13 +1,13 @@ - 4.0.0 @@ -56,8 +56,6 @@ commons-lang3 3.4 - - @@ -72,13 +70,13 @@ false - - + + - ${basedir}/src/main/codegen codegen @@ -129,7 +127,7 @@ - org.eclipse.m2e @@ -160,8 +158,8 @@ - - + + diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index bafa31760205a..48af7a2bafe4d 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -145,7 +145,7 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { org.apache.arrow.vector.BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); - // TODO: do something with the sizes in fieldNode? + bits.valueCount = fieldNode.getLength(); } public List getFieldBuffers() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index b7040da9d8203..4afd82315d9c3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -27,7 +27,6 @@ import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.schema.VectorLayout; import org.apache.arrow.vector.types.pojo.Field; -import org.apache.arrow.vector.types.pojo.Schema; import com.google.common.collect.Iterators; @@ -37,22 +36,16 @@ * Loads buffers into vectors */ public class VectorLoader { - private final List fieldVectors; - private final List fields; + private final VectorSchemaRoot root; /** * will create children in root based on schema * @param schema the expected schema * @param root the root to add vectors to based on schema */ - public VectorLoader(Schema schema, FieldVector root) { + public VectorLoader(VectorSchemaRoot root) { super(); - this.fields = schema.getFields(); - root.initializeChildrenFromFields(fields); - this.fieldVectors = root.getChildrenFromFields(); - if (this.fieldVectors.size() != fields.size()) { - throw new IllegalArgumentException("The root vector did not create the right number of children. found " + fieldVectors.size() + " expected " + fields.size()); - } + this.root = root; } /** @@ -63,16 +56,19 @@ public VectorLoader(Schema schema, FieldVector root) { public void load(ArrowRecordBatch recordBatch) { Iterator buffers = recordBatch.getBuffers().iterator(); Iterator nodes = recordBatch.getNodes().iterator(); + List fields = root.getSchema().getFields(); for (int i = 0; i < fields.size(); ++i) { Field field = fields.get(i); - FieldVector fieldVector = fieldVectors.get(i); + FieldVector fieldVector = root.getVector(field.getName()); loadBuffers(fieldVector, field, buffers, nodes); } + root.setRowCount(recordBatch.getLength()); if (nodes.hasNext() || buffers.hasNext()) { throw new IllegalArgumentException("not all nodes and buffers where consumed. nodes: " + Iterators.toString(nodes) + " buffers: " + Iterators.toString(buffers)); } } + private void loadBuffers(FieldVector vector, Field field, Iterator buffers, Iterator nodes) { checkArgument(nodes.hasNext(), "no more field nodes for for field " + field + " and vector " + vector); @@ -85,7 +81,7 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf try { vector.loadFieldBuffers(fieldNode, ownBuffers); } catch (RuntimeException e) { - throw new IllegalArgumentException("Could not load buffers for field " + field); + throw new IllegalArgumentException("Could not load buffers for field " + field, e); } List children = field.getChildren(); if (children.size() > 0) { @@ -98,4 +94,5 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf } } } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java new file mode 100644 index 0000000000000..1cbe18787ef45 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java @@ -0,0 +1,140 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + +public class VectorSchemaRoot implements AutoCloseable { + + private final Schema schema; + private int rowCount; + private final List fieldVectors; + private final Map fieldVectorsMap = new HashMap<>(); + + public VectorSchemaRoot(FieldVector parent) { + this.schema = new Schema(parent.getField().getChildren()); + this.rowCount = parent.getAccessor().getValueCount(); + this.fieldVectors = parent.getChildrenFromFields(); + for (int i = 0; i < schema.getFields().size(); ++i) { + Field field = schema.getFields().get(i); + FieldVector vector = fieldVectors.get(i); + fieldVectorsMap.put(field.getName(), vector); + } + } + + public VectorSchemaRoot(Schema schema, BufferAllocator allocator) { + super(); + this.schema = schema; + List fieldVectors = new ArrayList<>(); + for (Field field : schema.getFields()) { + MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); + FieldVector vector = minorType.getNewVector(field.getName(), allocator, null); + vector.initializeChildrenFromFields(field.getChildren()); + fieldVectors.add(vector); + fieldVectorsMap.put(field.getName(), vector); + } + this.fieldVectors = Collections.unmodifiableList(fieldVectors); + if (this.fieldVectors.size() != schema.getFields().size()) { + throw new IllegalArgumentException("The root vector did not create the right number of children. found " + fieldVectors.size() + " expected " + schema.getFields().size()); + } + } + + public List getFieldVectors() { + return fieldVectors; + } + + public FieldVector getVector(String name) { + return fieldVectorsMap.get(name); + } + + public Schema getSchema() { + return schema; + } + + public int getRowCount() { + return rowCount; + } + + public void setRowCount(int rowCount) { + this.rowCount = rowCount; + } + + @Override + public void close() { + RuntimeException ex = null; + for (FieldVector fieldVector : fieldVectors) { + try { + fieldVector.close(); + } catch (RuntimeException e) { + ex = chain(ex, e); + } + } + if (ex!= null) { + throw ex; + } + } + + private RuntimeException chain(RuntimeException root, RuntimeException e) { + if (root == null) { + root = e; + } else { + root.addSuppressed(e); + } + return root; + } + + private void printRow(StringBuilder sb, List row) { + boolean first = true; + for (Object v : row) { + if (first) { + first = false; + } else { + sb.append("\t"); + } + sb.append(v); + } + sb.append("\n"); + } + + public String contentToTSVString() { + StringBuilder sb = new StringBuilder(); + List row = new ArrayList<>(schema.getFields().size()); + for (Field field : schema.getFields()) { + row.add(field.getName()); + } + printRow(sb, row); + for (int i = 0; i < rowCount; i++) { + row.clear(); + for (FieldVector v : fieldVectors) { + row.add(v.getAccessor().getObject(i)); + } + printRow(sb, row); + } + return sb.toString(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java index 3375a7d5c311b..e2462180ffadc 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java @@ -34,11 +34,15 @@ public class VectorUnloader { private final int valueCount; private final List vectors; - public VectorUnloader(FieldVector parent) { + public VectorUnloader(Schema schema, int valueCount, List vectors) { super(); - this.schema = new Schema(parent.getField().getChildren()); - this.valueCount = parent.getAccessor().getValueCount(); - this.vectors = parent.getChildrenFromFields(); + this.schema = schema; + this.valueCount = valueCount; + this.vectors = vectors; + } + + public VectorUnloader(VectorSchemaRoot root) { + this(root.getSchema(), root.getRowCount(), root.getFieldVectors()); } public Schema getSchema() { @@ -77,4 +81,5 @@ private void appendNodes(FieldVector vector, List nodes, List fields = root.getChildrenFromFields(); + return new VectorUnloader(schema, valueCount, fields); + } + @AfterClass public static void afterClass() { allocator.close(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 0f28d53295c37..e97bc14d169b7 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -17,6 +17,8 @@ */ package org.apache.arrow.vector.file; +import static org.apache.arrow.vector.TestVectorUnloadLoad.newVectorUnloader; + import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; @@ -29,12 +31,12 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.VectorUnloader; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; -import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; -import org.apache.arrow.vector.complex.reader.BaseReader.MapReader; +import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; @@ -43,7 +45,6 @@ import org.apache.arrow.vector.holders.NullableTimeStampHolder; import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Schema; import org.joda.time.DateTimeZone; import org.junit.After; @@ -94,8 +95,9 @@ public void testWriteComplex() throws IOException { BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { writeComplexData(count, parent); - validateComplexContent(count, parent); - write(parent.getChild("root"), file); + FieldVector root = parent.getChild("root"); + validateComplexContent(count, new VectorSchemaRoot(root)); + write(root, file); } } @@ -174,33 +176,31 @@ public void testWriteRead() throws IOException { // initialize vectors - NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); - - VectorLoader vectorLoader = new VectorLoader(schema, root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - Assert.assertEquals(0, rbBlock.getOffset() % 8); - Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + Assert.assertEquals(0, rbBlock.getOffset() % 8); + Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); } - vectorLoader.load(recordBatch); - } - validateContent(count, parent); + validateContent(count, root); + } } } } - private void validateContent(int count, MapVector parent) { - MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + private void validateContent(int count, VectorSchemaRoot root) { for (int i = 0; i < count; i++) { - rootReader.setPosition(i); - Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); - Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); } } @@ -231,15 +231,15 @@ public void testWriteReadComplex() throws IOException { // initialize vectors - NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); - VectorLoader vectorLoader = new VectorLoader(schema, root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateComplexContent(count, root); } - validateComplexContent(count, parent); } } } @@ -255,23 +255,23 @@ public void printVectors(List vectors) { } } - private void validateComplexContent(int count, NullableMapVector parent) { - printVectors(parent.getChildrenFromFields()); - - MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + private void validateComplexContent(int count, VectorSchemaRoot root) { + Assert.assertEquals(count, root.getRowCount()); + printVectors(root.getFieldVectors()); for (int i = 0; i < count; i++) { - rootReader.setPosition(i); - Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); - Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); - Assert.assertEquals(i % 3, rootReader.reader("list").size()); + Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); + Assert.assertEquals(i % 3, ((List)root.getVector("list").getAccessor().getObject(i)).size()); NullableTimeStampHolder h = new NullableTimeStampHolder(); - rootReader.reader("map").reader("timestamp").read(h); + FieldReader mapReader = root.getVector("map").getReader(); + mapReader.setPosition(i); + mapReader.reader("timestamp").read(h); Assert.assertEquals(i, h.value); } } private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { - VectorUnloader vectorUnloader = new VectorUnloader(parent); + VectorUnloader vectorUnloader = newVectorUnloader(parent); Schema schema = vectorUnloader.getSchema(); LOGGER.debug("writing schema: " + schema); try ( @@ -294,7 +294,7 @@ public void testWriteReadMultipleRBs() throws IOException { MapVector parent = new MapVector("parent", originalVectorAllocator, null); FileOutputStream fileOutputStream = new FileOutputStream(file);) { writeData(count, parent); - VectorUnloader vectorUnloader = new VectorUnloader(parent.getChild("root")); + VectorUnloader vectorUnloader = newVectorUnloader(parent.getChild("root")); Schema schema = vectorUnloader.getSchema(); Assert.assertEquals(2, schema.getFields().size()); try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema);) { @@ -320,20 +320,21 @@ public void testWriteReadMultipleRBs() throws IOException { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); LOGGER.debug("reading schema: " + schema); - NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); - VectorLoader vectorLoader = new VectorLoader(schema, root); - List recordBatches = footer.getRecordBatches(); - Assert.assertEquals(2, recordBatches.size()); - for (ArrowBlock rbBlock : recordBatches) { - Assert.assertEquals(0, rbBlock.getOffset() % 8); - Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { + VectorLoader vectorLoader = new VectorLoader(root); + List recordBatches = footer.getRecordBatches(); + Assert.assertEquals(2, recordBatches.size()); + for (ArrowBlock rbBlock : recordBatches) { + Assert.assertEquals(0, rbBlock.getOffset() % 8); + Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); + validateContent(count, root); } - vectorLoader.load(recordBatch); - validateContent(count, parent); } } } @@ -351,7 +352,7 @@ public void testWriteReadUnion() throws IOException { printVectors(parent.getChildrenFromFields()); - validateUnionData(count, parent); + validateUnionData(count, new VectorSchemaRoot(parent.getChild("root"))); write(parent.getChild("root"), file); } @@ -361,44 +362,42 @@ public void testWriteReadUnion() throws IOException { FileInputStream fileInputStream = new FileInputStream(file); ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null) ) { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); LOGGER.debug("reading schema: " + schema); // initialize vectors - - NullableMapVector root = parent.addOrGet("root", MinorType.MAP, NullableMapVector.class); - VectorLoader vectorLoader = new VectorLoader(schema, root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { + VectorLoader vectorLoader = new VectorLoader(root); + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateUnionData(count, root); } - validateUnionData(count, parent); } } } - public void validateUnionData(int count, MapVector parent) { - MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + public void validateUnionData(int count, VectorSchemaRoot root) { + FieldReader unionReader = root.getVector("union").getReader(); for (int i = 0; i < count; i++) { - rootReader.setPosition(i); + unionReader.setPosition(i); switch (i % 4) { case 0: - Assert.assertEquals(i, rootReader.reader("union").readInteger().intValue()); + Assert.assertEquals(i, unionReader.readInteger().intValue()); break; case 1: - Assert.assertEquals(i, rootReader.reader("union").readLong().longValue()); + Assert.assertEquals(i, unionReader.readLong().longValue()); break; case 2: - Assert.assertEquals(i % 3, rootReader.reader("union").size()); + Assert.assertEquals(i % 3, unionReader.size()); break; case 3: NullableTimeStampHolder h = new NullableTimeStampHolder(); - rootReader.reader("union").reader("timestamp").read(h); + unionReader.reader("timestamp").read(h); Assert.assertEquals(i, h.value); break; } From 4fa7ac4f6ca30c34a73fb84d9d56d54aed96491b Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 9 Nov 2016 08:55:51 -0800 Subject: [PATCH 0196/1644] ARROW-372: json vector serialization format This format serializes the vectors in JSON. It is not a generic JSON to arrow converter but rather a human readable version of the vectors to help with tests. Author: Julien Le Dem Closes #201 from julienledem/json_file and squashes the following commits: 2e63bec [Julien Le Dem] add missing license 5588729 [Julien Le Dem] refactor tests, improve format 5ef5356 [Julien Le Dem] improve format to allow empty column name 746430c [Julien Le Dem] ARROW-372: Create JSON arrow file format for integration tests --- .../vector/file/json/JsonFileReader.java | 223 ++++++++++++++++++ .../vector/file/json/JsonFileWriter.java | 167 +++++++++++++ .../arrow/vector/file/BaseFileTest.java | 220 +++++++++++++++++ .../arrow/vector/file/TestArrowFile.java | 200 +--------------- .../arrow/vector/file/json/TestJSONFile.java | 120 ++++++++++ 5 files changed, 741 insertions(+), 189 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java new file mode 100644 index 0000000000000..859a3a0e80a50 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -0,0 +1,223 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.file.json; + +import static com.fasterxml.jackson.core.JsonToken.END_ARRAY; +import static com.fasterxml.jackson.core.JsonToken.END_OBJECT; +import static com.fasterxml.jackson.core.JsonToken.START_ARRAY; +import static com.fasterxml.jackson.core.JsonToken.START_OBJECT; +import static java.nio.charset.StandardCharsets.UTF_8; + +import java.io.File; +import java.io.IOException; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.BigIntVector; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.Float4Vector; +import org.apache.arrow.vector.Float8Vector; +import org.apache.arrow.vector.IntVector; +import org.apache.arrow.vector.SmallIntVector; +import org.apache.arrow.vector.TimeStampVector; +import org.apache.arrow.vector.TinyIntVector; +import org.apache.arrow.vector.UInt1Vector; +import org.apache.arrow.vector.UInt2Vector; +import org.apache.arrow.vector.UInt4Vector; +import org.apache.arrow.vector.UInt8Vector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ValueVector.Mutator; +import org.apache.arrow.vector.VarCharVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.schema.ArrowVectorType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.fasterxml.jackson.core.JsonParseException; +import com.fasterxml.jackson.core.JsonParser; +import com.fasterxml.jackson.core.JsonToken; +import com.fasterxml.jackson.databind.MappingJsonFactory; +import com.google.common.base.Objects; + +public class JsonFileReader { + private final File inputFile; + private final JsonParser parser; + private final BufferAllocator allocator; + private Schema schema; + + public JsonFileReader(File inputFile, BufferAllocator allocator) throws JsonParseException, IOException { + super(); + this.inputFile = inputFile; + this.allocator = allocator; + MappingJsonFactory jsonFactory = new MappingJsonFactory(); + this.parser = jsonFactory.createParser(inputFile); + } + + public Schema start() throws JsonParseException, IOException { + readToken(START_OBJECT); + { + this.schema = readNextField("schema", Schema.class); + nextFieldIs("batches"); + readToken(START_ARRAY); + return schema; + } + } + + public VectorSchemaRoot read() throws IOException { + VectorSchemaRoot recordBatch = new VectorSchemaRoot(schema, allocator); + readToken(START_OBJECT); + { + int count = readNextField("count", Integer.class); + recordBatch.setRowCount(count); + nextFieldIs("columns"); + readToken(START_ARRAY); + { + for (Field field : schema.getFields()) { + FieldVector vector = recordBatch.getVector(field.getName()); + readVector(field, vector); + } + } + readToken(END_ARRAY); + } + readToken(END_OBJECT); + return recordBatch; + } + + private void readVector(Field field, FieldVector vector) throws JsonParseException, IOException { + List vectorTypes = field.getTypeLayout().getVectorTypes(); + List fieldInnerVectors = vector.getFieldInnerVectors(); + if (vectorTypes.size() != fieldInnerVectors.size()) { + throw new IllegalArgumentException("vector types and inner vectors are not the same size: " + vectorTypes.size() + " != " + fieldInnerVectors.size()); + } + readToken(START_OBJECT); + { + String name = readNextField("name", String.class); + if (!Objects.equal(field.getName(), name)) { + throw new IllegalArgumentException("Expected field " + field.getName() + " but got " + name); + } + int count = readNextField("count", Integer.class); + for (int v = 0; v < vectorTypes.size(); v++) { + ArrowVectorType vectorType = vectorTypes.get(v); + BufferBacked innerVector = fieldInnerVectors.get(v); + nextFieldIs(vectorType.getName()); + readToken(START_ARRAY); + ValueVector valueVector = (ValueVector)innerVector; + valueVector.allocateNew(); + Mutator mutator = valueVector.getMutator(); + mutator.setValueCount(count); + for (int i = 0; i < count; i++) { + parser.nextToken(); + setValueFromParser(valueVector, i); + } + readToken(END_ARRAY); + } + // if children + List fields = field.getChildren(); + if (!fields.isEmpty()) { + List vectorChildren = vector.getChildrenFromFields(); + if (fields.size() != vectorChildren.size()) { + throw new IllegalArgumentException("fields and children are not the same size: " + fields.size() + " != " + vectorChildren.size()); + } + nextFieldIs("children"); + readToken(START_ARRAY); + for (int i = 0; i < fields.size(); i++) { + Field childField = fields.get(i); + FieldVector childVector = vectorChildren.get(i); + readVector(childField, childVector); + } + readToken(END_ARRAY); + } + } + readToken(END_OBJECT); + } + + private void setValueFromParser(ValueVector valueVector, int i) throws IOException { + switch (valueVector.getMinorType()) { + case BIT: + ((BitVector)valueVector).getMutator().set(i, parser.readValueAs(Boolean.class) ? 1 : 0); + break; + case TINYINT: + ((TinyIntVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case SMALLINT: + ((SmallIntVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case INT: + ((IntVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case BIGINT: + ((BigIntVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case UINT1: + ((UInt1Vector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case UINT2: + ((UInt2Vector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case UINT4: + ((UInt4Vector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case UINT8: + ((UInt8Vector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case FLOAT4: + ((Float4Vector)valueVector).getMutator().set(i, parser.readValueAs(Float.class)); + break; + case FLOAT8: + ((Float8Vector)valueVector).getMutator().set(i, parser.readValueAs(Double.class)); + break; + case VARCHAR: + ((VarCharVector)valueVector).getMutator().setSafe(i, parser.readValueAs(String.class).getBytes(UTF_8)); + break; + case TIMESTAMP: + ((TimeStampVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + default: + throw new UnsupportedOperationException("minor type: " + valueVector.getMinorType()); + } + } + + public void close() throws IOException { + readToken(END_ARRAY); + readToken(END_OBJECT); + parser.close(); + } + + private T readNextField(String expectedFieldName, Class c) throws IOException, JsonParseException { + nextFieldIs(expectedFieldName); + parser.nextToken(); + return parser.readValueAs(c); + } + + private void nextFieldIs(String expectedFieldName) throws IOException, JsonParseException { + String name = parser.nextFieldName(); + if (name == null || !name.equals(expectedFieldName)) { + throw new IllegalStateException("Expected " + expectedFieldName + " but got " + name); + } + } + + private void readToken(JsonToken expected) throws JsonParseException, IOException { + JsonToken t = parser.nextToken(); + if (t != expected) { + throw new IllegalStateException("Expected " + expected + " but got " + t); + } + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java new file mode 100644 index 0000000000000..47c1a7dabef11 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -0,0 +1,167 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.file.json; + +import java.io.File; +import java.io.IOException; +import java.util.List; + +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.TimeStampVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ValueVector.Accessor; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.schema.ArrowVectorType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.fasterxml.jackson.core.JsonEncoding; +import com.fasterxml.jackson.core.JsonGenerator; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; +import com.fasterxml.jackson.databind.MappingJsonFactory; + +public class JsonFileWriter { + + public static final class JSONWriteConfig { + private final boolean pretty; + private JSONWriteConfig(boolean pretty) { + this.pretty = pretty; + } + private JSONWriteConfig() { + this.pretty = false; + } + public JSONWriteConfig pretty(boolean pretty) { + return new JSONWriteConfig(pretty); + } + } + + public static JSONWriteConfig config() { + return new JSONWriteConfig(); + } + + private final JsonGenerator generator; + private Schema schema; + + public JsonFileWriter(File outputFile) throws IOException { + this(outputFile, config()); + } + + public JsonFileWriter(File outputFile, JSONWriteConfig config) throws IOException { + MappingJsonFactory jsonFactory = new MappingJsonFactory(); + this.generator = jsonFactory.createGenerator(outputFile, JsonEncoding.UTF8); + if (config.pretty) { + DefaultPrettyPrinter prettyPrinter = new DefaultPrettyPrinter(); + prettyPrinter.indentArraysWith(NopIndenter.instance); + this.generator.setPrettyPrinter(prettyPrinter); + } + } + + public void start(Schema schema) throws IOException { + this.schema = schema; + generator.writeStartObject(); + generator.writeObjectField("schema", schema); + generator.writeArrayFieldStart("batches"); + } + + public void write(VectorSchemaRoot recordBatch) throws IOException { + if (!recordBatch.getSchema().equals(schema)) { + throw new IllegalArgumentException("record batches must have the same schema: " + schema); + } + generator.writeStartObject(); + { + generator.writeObjectField("count", recordBatch.getRowCount()); + generator.writeArrayFieldStart("columns"); + for (Field field : schema.getFields()) { + FieldVector vector = recordBatch.getVector(field.getName()); + writeVector(field, vector); + } + generator.writeEndArray(); + } + generator.writeEndObject(); + } + + private void writeVector(Field field, FieldVector vector) throws IOException { + List vectorTypes = field.getTypeLayout().getVectorTypes(); + List fieldInnerVectors = vector.getFieldInnerVectors(); + if (vectorTypes.size() != fieldInnerVectors.size()) { + throw new IllegalArgumentException("vector types and inner vectors are not the same size: " + vectorTypes.size() + " != " + fieldInnerVectors.size()); + } + generator.writeStartObject(); + { + generator.writeObjectField("name", field.getName()); + int valueCount = vector.getAccessor().getValueCount(); + generator.writeObjectField("count", valueCount); + for (int v = 0; v < vectorTypes.size(); v++) { + ArrowVectorType vectorType = vectorTypes.get(v); + BufferBacked innerVector = fieldInnerVectors.get(v); + generator.writeArrayFieldStart(vectorType.getName()); + ValueVector valueVector = (ValueVector)innerVector; + for (int i = 0; i < valueCount; i++) { + writeValueToGenerator(valueVector, i); + } + generator.writeEndArray(); + } + List fields = field.getChildren(); + List children = vector.getChildrenFromFields(); + if (fields.size() != children.size()) { + throw new IllegalArgumentException("fields and children are not the same size: " + fields.size() + " != " + children.size()); + } + if (fields.size() > 0) { + generator.writeArrayFieldStart("children"); + for (int i = 0; i < fields.size(); i++) { + Field childField = fields.get(i); + FieldVector childVector = children.get(i); + writeVector(childField, childVector); + } + generator.writeEndArray(); + } + } + generator.writeEndObject(); + } + + private void writeValueToGenerator(ValueVector valueVector, int i) throws IOException { + switch (valueVector.getMinorType()) { + case TIMESTAMP: + generator.writeNumber(((TimeStampVector)valueVector).getAccessor().get(i)); + break; + case BIT: + generator.writeNumber(((BitVector)valueVector).getAccessor().get(i)); + break; + default: + // TODO: each type + Accessor accessor = valueVector.getAccessor(); + Object value = accessor.getObject(i); + if (value instanceof Number || value instanceof Boolean) { + generator.writeObject(value); + } else { + generator.writeObject(value.toString()); + } + break; + } + } + + public void close() throws IOException { + generator.writeEndArray(); + generator.writeEndObject(); + generator.close(); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java new file mode 100644 index 0000000000000..6e577b500a6bd --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java @@ -0,0 +1,220 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.ValueVector.Accessor; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.holders.NullableTimeStampHolder; +import org.joda.time.DateTimeZone; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.buffer.ArrowBuf; + +/** + * Helps testing the file formats + */ +public class BaseFileTest { + private static final Logger LOGGER = LoggerFactory.getLogger(BaseFileTest.class); + protected static final int COUNT = 10; + protected BufferAllocator allocator; + + private DateTimeZone defaultTimezone = DateTimeZone.getDefault(); + + @Before + public void init() { + DateTimeZone.setDefault(DateTimeZone.forOffsetHours(2)); + allocator = new RootAllocator(Integer.MAX_VALUE); + } + + @After + public void tearDown() { + allocator.close(); + DateTimeZone.setDefault(defaultTimezone); + } + + protected void validateContent(int count, VectorSchemaRoot root) { + for (int i = 0; i < count; i++) { + Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); + } + } + + protected void writeComplexData(int count, MapVector parent) { + ArrowBuf varchar = allocator.buffer(3); + varchar.readerIndex(0); + varchar.setByte(0, 'a'); + varchar.setByte(1, 'b'); + varchar.setByte(2, 'c'); + varchar.writerIndex(3); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + ListWriter listWriter = rootWriter.list("list"); + MapWriter mapWriter = rootWriter.map("map"); + for (int i = 0; i < count; i++) { + if (i % 5 != 3) { + intWriter.setPosition(i); + intWriter.writeInt(i); + } + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 3; j++) { + listWriter.varChar().writeVarChar(0, 3, varchar); + } + listWriter.endList(); + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.end(); + } + writer.setValueCount(count); + varchar.release(); + } + + public void printVectors(List vectors) { + for (FieldVector vector : vectors) { + LOGGER.debug(vector.getField().getName()); + Accessor accessor = vector.getAccessor(); + int valueCount = accessor.getValueCount(); + for (int i = 0; i < valueCount; i++) { + LOGGER.debug(String.valueOf(accessor.getObject(i))); + } + } + } + + protected void validateComplexContent(int count, VectorSchemaRoot root) { + Assert.assertEquals(count, root.getRowCount()); + printVectors(root.getFieldVectors()); + for (int i = 0; i < count; i++) { + Object intVal = root.getVector("int").getAccessor().getObject(i); + if (i % 5 != 3) { + Assert.assertEquals(i, intVal); + } else { + Assert.assertNull(intVal); + } + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); + Assert.assertEquals(i % 3, ((List)root.getVector("list").getAccessor().getObject(i)).size()); + NullableTimeStampHolder h = new NullableTimeStampHolder(); + FieldReader mapReader = root.getVector("map").getReader(); + mapReader.setPosition(i); + mapReader.reader("timestamp").read(h); + Assert.assertEquals(i, h.value); + } + } + + protected void writeData(int count, MapVector parent) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(count); + } + + public void validateUnionData(int count, VectorSchemaRoot root) { + FieldReader unionReader = root.getVector("union").getReader(); + for (int i = 0; i < count; i++) { + unionReader.setPosition(i); + switch (i % 4) { + case 0: + Assert.assertEquals(i, unionReader.readInteger().intValue()); + break; + case 1: + Assert.assertEquals(i, unionReader.readLong().longValue()); + break; + case 2: + Assert.assertEquals(i % 3, unionReader.size()); + break; + case 3: + NullableTimeStampHolder h = new NullableTimeStampHolder(); + unionReader.reader("timestamp").read(h); + Assert.assertEquals(i, h.value); + break; + } + } + } + + public void writeUnionData(int count, NullableMapVector parent) { + ArrowBuf varchar = allocator.buffer(3); + varchar.readerIndex(0); + varchar.setByte(0, 'a'); + varchar.setByte(1, 'b'); + varchar.setByte(2, 'c'); + varchar.writerIndex(3); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("union"); + BigIntWriter bigIntWriter = rootWriter.bigInt("union"); + ListWriter listWriter = rootWriter.list("union"); + MapWriter mapWriter = rootWriter.map("union"); + for (int i = 0; i < count; i++) { + switch (i % 4) { + case 0: + intWriter.setPosition(i); + intWriter.writeInt(i); + break; + case 1: + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + break; + case 2: + listWriter.setPosition(i); + listWriter.startList(); + for (int j = 0; j < i % 3; j++) { + listWriter.varChar().writeVarChar(0, 3, varchar); + } + listWriter.endList(); + break; + case 3: + mapWriter.setPosition(i); + mapWriter.start(); + mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.end(); + break; + } + } + writer.setValueCount(count); + varchar.release(); + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index e97bc14d169b7..c9e60ee047bfe 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -27,53 +27,22 @@ import java.util.List; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.VectorUnloader; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; -import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; -import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.complex.writer.BigIntWriter; -import org.apache.arrow.vector.complex.writer.IntWriter; -import org.apache.arrow.vector.holders.NullableTimeStampHolder; import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.Schema; -import org.joda.time.DateTimeZone; -import org.junit.After; import org.junit.Assert; -import org.junit.Before; import org.junit.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import io.netty.buffer.ArrowBuf; - -public class TestArrowFile { +public class TestArrowFile extends BaseFileTest { private static final Logger LOGGER = LoggerFactory.getLogger(TestArrowFile.class); - private static final int COUNT = 10; - private BufferAllocator allocator; - - private DateTimeZone defaultTimezone = DateTimeZone.getDefault(); - - @Before - public void init() { - DateTimeZone.setDefault(DateTimeZone.forOffsetHours(2)); - allocator = new RootAllocator(Integer.MAX_VALUE); - } - - @After - public void tearDown() { - allocator.close(); - DateTimeZone.setDefault(defaultTimezone); - } @Test public void testWrite() throws IOException { @@ -101,54 +70,6 @@ public void testWriteComplex() throws IOException { } } - private void writeComplexData(int count, MapVector parent) { - ArrowBuf varchar = allocator.buffer(3); - varchar.readerIndex(0); - varchar.setByte(0, 'a'); - varchar.setByte(1, 'b'); - varchar.setByte(2, 'c'); - varchar.writerIndex(3); - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - IntWriter intWriter = rootWriter.integer("int"); - BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); - ListWriter listWriter = rootWriter.list("list"); - MapWriter mapWriter = rootWriter.map("map"); - for (int i = 0; i < count; i++) { - intWriter.setPosition(i); - intWriter.writeInt(i); - bigIntWriter.setPosition(i); - bigIntWriter.writeBigInt(i); - listWriter.setPosition(i); - listWriter.startList(); - for (int j = 0; j < i % 3; j++) { - listWriter.varChar().writeVarChar(0, 3, varchar); - } - listWriter.endList(); - mapWriter.setPosition(i); - mapWriter.start(); - mapWriter.timeStamp("timestamp").writeTimeStamp(i); - mapWriter.end(); - } - writer.setValueCount(count); - varchar.release(); - } - - - private void writeData(int count, MapVector parent) { - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - IntWriter intWriter = rootWriter.integer("int"); - BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); - for (int i = 0; i < count; i++) { - intWriter.setPosition(i); - intWriter.writeInt(i); - bigIntWriter.setPosition(i); - bigIntWriter.writeBigInt(i); - } - writer.setValueCount(count); - } - @Test public void testWriteRead() throws IOException { File file = new File("target/mytest.arrow"); @@ -197,13 +118,6 @@ public void testWriteRead() throws IOException { } } - private void validateContent(int count, VectorSchemaRoot root) { - for (int i = 0; i < count; i++) { - Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); - Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); - } - } - @Test public void testWriteReadComplex() throws IOException { File file = new File("target/mytest_complex.arrow"); @@ -244,45 +158,6 @@ public void testWriteReadComplex() throws IOException { } } - public void printVectors(List vectors) { - for (FieldVector vector : vectors) { - LOGGER.debug(vector.getField().getName()); - Accessor accessor = vector.getAccessor(); - int valueCount = accessor.getValueCount(); - for (int i = 0; i < valueCount; i++) { - LOGGER.debug(String.valueOf(accessor.getObject(i))); - } - } - } - - private void validateComplexContent(int count, VectorSchemaRoot root) { - Assert.assertEquals(count, root.getRowCount()); - printVectors(root.getFieldVectors()); - for (int i = 0; i < count; i++) { - Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); - Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); - Assert.assertEquals(i % 3, ((List)root.getVector("list").getAccessor().getObject(i)).size()); - NullableTimeStampHolder h = new NullableTimeStampHolder(); - FieldReader mapReader = root.getVector("map").getReader(); - mapReader.setPosition(i); - mapReader.reader("timestamp").read(h); - Assert.assertEquals(i, h.value); - } - } - - private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { - VectorUnloader vectorUnloader = newVectorUnloader(parent); - Schema schema = vectorUnloader.getSchema(); - LOGGER.debug("writing schema: " + schema); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(file); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - ) { - arrowWriter.writeRecordBatch(recordBatch); - } - } - @Test public void testWriteReadMultipleRBs() throws IOException { File file = new File("target/mytest_multiple.arrow"); @@ -381,69 +256,16 @@ public void testWriteReadUnion() throws IOException { } } - public void validateUnionData(int count, VectorSchemaRoot root) { - FieldReader unionReader = root.getVector("union").getReader(); - for (int i = 0; i < count; i++) { - unionReader.setPosition(i); - switch (i % 4) { - case 0: - Assert.assertEquals(i, unionReader.readInteger().intValue()); - break; - case 1: - Assert.assertEquals(i, unionReader.readLong().longValue()); - break; - case 2: - Assert.assertEquals(i % 3, unionReader.size()); - break; - case 3: - NullableTimeStampHolder h = new NullableTimeStampHolder(); - unionReader.reader("timestamp").read(h); - Assert.assertEquals(i, h.value); - break; - } - } - } - - public void writeUnionData(int count, NullableMapVector parent) { - ArrowBuf varchar = allocator.buffer(3); - varchar.readerIndex(0); - varchar.setByte(0, 'a'); - varchar.setByte(1, 'b'); - varchar.setByte(2, 'c'); - varchar.writerIndex(3); - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - IntWriter intWriter = rootWriter.integer("union"); - BigIntWriter bigIntWriter = rootWriter.bigInt("union"); - ListWriter listWriter = rootWriter.list("union"); - MapWriter mapWriter = rootWriter.map("union"); - for (int i = 0; i < count; i++) { - switch (i % 4) { - case 0: - intWriter.setPosition(i); - intWriter.writeInt(i); - break; - case 1: - bigIntWriter.setPosition(i); - bigIntWriter.writeBigInt(i); - break; - case 2: - listWriter.setPosition(i); - listWriter.startList(); - for (int j = 0; j < i % 3; j++) { - listWriter.varChar().writeVarChar(0, 3, varchar); - } - listWriter.endList(); - break; - case 3: - mapWriter.setPosition(i); - mapWriter.start(); - mapWriter.timeStamp("timestamp").writeTimeStamp(i); - mapWriter.end(); - break; - } + private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { + VectorUnloader vectorUnloader = newVectorUnloader(parent); + Schema schema = vectorUnloader.getSchema(); + LOGGER.debug("writing schema: " + schema); + try ( + FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + ) { + arrowWriter.writeRecordBatch(recordBatch); } - writer.setValueCount(count); - varchar.release(); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java new file mode 100644 index 0000000000000..7d25003f8b335 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java @@ -0,0 +1,120 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file.json; + +import java.io.File; +import java.io.IOException; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; +import org.apache.arrow.vector.file.BaseFileTest; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class TestJSONFile extends BaseFileTest { + private static final Logger LOGGER = LoggerFactory.getLogger(TestJSONFile.class); + + @Test + public void testWriteReadComplexJSON() throws IOException { + File file = new File("target/mytest_complex.json"); + int count = COUNT; + + // write + try ( + BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { + writeComplexData(count, parent); + writeJSON(file, new VectorSchemaRoot(parent.getChild("root"))); + } + + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ) { + JsonFileReader reader = new JsonFileReader(file, readerAllocator); + Schema schema = reader.start(); + LOGGER.debug("reading schema: " + schema); + + // initialize vectors + try (VectorSchemaRoot root = reader.read();) { + validateComplexContent(count, root); + } + reader.close(); + } + } + + @Test + public void testWriteComplexJSON() throws IOException { + File file = new File("target/mytest_write_complex.json"); + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + writeComplexData(count, parent); + VectorSchemaRoot root = new VectorSchemaRoot(parent.getChild("root")); + validateComplexContent(root.getRowCount(), root); + writeJSON(file, root); + } + } + + public void writeJSON(File file, VectorSchemaRoot root) throws IOException { + JsonFileWriter writer = new JsonFileWriter(file, JsonFileWriter.config().pretty(true)); + writer.start(root.getSchema()); + writer.write(root); + writer.close(); + } + + + @Test + public void testWriteReadUnionJSON() throws IOException { + File file = new File("target/mytest_write_union.json"); + int count = COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + + writeUnionData(count, parent); + + printVectors(parent.getChildrenFromFields()); + + VectorSchemaRoot root = new VectorSchemaRoot(parent.getChild("root")); + validateUnionData(count, root); + + writeJSON(file, root); + } + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + ) { + JsonFileReader reader = new JsonFileReader(file, readerAllocator); + Schema schema = reader.start(); + LOGGER.debug("reading schema: " + schema); + + // initialize vectors + try (VectorSchemaRoot root = reader.read();) { + validateUnionData(count, root); + } + } + } + +} From 7f048a4b8bdc6a20cd8f6eeca928ecbb6db7dd96 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 11 Nov 2016 14:18:09 -0500 Subject: [PATCH 0197/1644] ARROW-356: Add documentation about reading Parquet Assumes #192. Author: Uwe L. Korn Closes #193 from xhochy/ARROW-356 and squashes the following commits: 530484f [Uwe L. Korn] Mention new setup instructions 06b2f9c [Uwe L. Korn] Add tables describing dtype support 0467e0e [Uwe L. Korn] Move installation instructions into Sphinx docs 744202a [Uwe L. Korn] Document Pandas<->Arrow conversion b5b4df5 [Uwe L. Korn] ARROW-356: Add documentation about reading Parquet --- python/doc/INSTALL.md | 101 -------------------------- python/doc/index.rst | 16 +++-- python/doc/install.rst | 151 +++++++++++++++++++++++++++++++++++++++ python/doc/pandas.rst | 114 +++++++++++++++++++++++++++++ python/doc/parquet.rst | 66 +++++++++++++++++ python/pyarrow/table.pyx | 15 ++++ 6 files changed, 355 insertions(+), 108 deletions(-) delete mode 100644 python/doc/INSTALL.md create mode 100644 python/doc/install.rst create mode 100644 python/doc/pandas.rst create mode 100644 python/doc/parquet.rst diff --git a/python/doc/INSTALL.md b/python/doc/INSTALL.md deleted file mode 100644 index 81eed565d9123..0000000000000 --- a/python/doc/INSTALL.md +++ /dev/null @@ -1,101 +0,0 @@ - - -## Building pyarrow (Apache Arrow Python library) - -First, clone the master git repository: - -```bash -git clone https://github.com/apache/arrow.git arrow -``` - -#### System requirements - -Building pyarrow requires: - -* A C++11 compiler - - * Linux: gcc >= 4.8 or clang >= 3.5 - * OS X: XCode 6.4 or higher preferred - -* [cmake][1] - -#### Python requirements - -You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and -are not being targeted. - -> This library targets CPython only due to an emphasis on interoperability with -> pandas and NumPy, which are only available for CPython. - -The build requires NumPy, Cython, and a few other Python dependencies: - -```bash -pip install cython -cd arrow/python -pip install -r requirements.txt -``` - -#### Installing Arrow C++ library - -First, you should choose an installation location for Arrow C++. In the future -using the default system install location will work, but for now we are being -explicit: - -```bash -export ARROW_HOME=$HOME/local -``` - -Now, we build Arrow: - -```bash -cd arrow/cpp - -mkdir dev-build -cd dev-build - -cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. - -make - -# Use sudo here if $ARROW_HOME requires it -make install -``` - -#### Install `pyarrow` - -```bash -cd arrow/python - -python setup.py install -``` - -> On XCode 6 and prior there are some known OS X `@rpath` issues. If you are -> unable to import pyarrow, upgrading XCode may be the solution. - - -```python -In [1]: import pyarrow - -In [2]: pyarrow.from_pylist([1,2,3]) -Out[2]: - -[ - 1, - 2, - 3 -] -``` - -[1]: https://cmake.org/ diff --git a/python/doc/index.rst b/python/doc/index.rst index 88725badc1e24..6725ae707d90b 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -31,14 +31,16 @@ additional functionality such as reading Apache Parquet files into Arrow structures. .. toctree:: - :maxdepth: 4 - :hidden: + :maxdepth: 2 + :caption: Getting Started + Installing pyarrow + Pandas Module Reference -Indices and tables -================== +.. toctree:: + :maxdepth: 2 + :caption: Additional Features + + Parquet format -* :ref:`genindex` -* :ref:`modindex` -* :ref:`search` diff --git a/python/doc/install.rst b/python/doc/install.rst new file mode 100644 index 0000000000000..1bab017301633 --- /dev/null +++ b/python/doc/install.rst @@ -0,0 +1,151 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Install PyArrow +=============== + +Conda +----- + +To install the latest version of PyArrow from conda-forge using conda: + +.. code-block:: bash + + conda install -c conda-forge pyarrow + +Pip +--- + +Install the latest version from PyPI: + +.. code-block:: bash + + pip install pyarrow + +.. note:: + Currently there are only binary artifcats available for Linux and MacOS. + Otherwise this will only pull the python sources and assumes an existing + installation of the C++ part of Arrow. + To retrieve the binary artifacts, you'll need a recent ``pip`` version that + supports features like the ``manylinux1`` tag. + +Building from source +-------------------- + +First, clone the master git repository: + +.. code-block:: bash + + git clone https://github.com/apache/arrow.git arrow + +System requirements +~~~~~~~~~~~~~~~~~~~ + +Building pyarrow requires: + +* A C++11 compiler + + * Linux: gcc >= 4.8 or clang >= 3.5 + * OS X: XCode 6.4 or higher preferred + +* `CMake `_ + +Python requirements +~~~~~~~~~~~~~~~~~~~ + +You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and +are not being targeted. + +.. note:: + This library targets CPython only due to an emphasis on interoperability with + pandas and NumPy, which are only available for CPython. + +The build requires NumPy, Cython, and a few other Python dependencies: + +.. code-block:: bash + + pip install cython + cd arrow/python + pip install -r requirements.txt + +Installing Arrow C++ library +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First, you should choose an installation location for Arrow C++. In the future +using the default system install location will work, but for now we are being +explicit: + +.. code-block:: bash + + export ARROW_HOME=$HOME/local + +Now, we build Arrow: + +.. code-block:: bash + + cd arrow/cpp + + mkdir dev-build + cd dev-build + + cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. + + make + + # Use sudo here if $ARROW_HOME requires it + make install + +To get the optional Parquet support, you should also build and install +`parquet-cpp `_. + +Install `pyarrow` +~~~~~~~~~~~~~~~~~ + + +.. code-block:: bash + + cd arrow/python + + # --with-parquet enable the Apache Parquet support in PyArrow + # --build-type=release disables debugging information and turns on + # compiler optimizations for native code + python setup.py build_ext --with-parquet --build-type=release install + python setup.py install + +.. warning:: + On XCode 6 and prior there are some known OS X `@rpath` issues. If you are + unable to import pyarrow, upgrading XCode may be the solution. + +.. note:: + In development installations, you will also need to set a correct + ``LD_LIBARY_PATH``. This is most probably done with + ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``. + + +.. code-block:: python + + In [1]: import pyarrow + + In [2]: pyarrow.from_pylist([1,2,3]) + Out[2]: + + [ + 1, + 2, + 3 + ] + diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst new file mode 100644 index 0000000000000..7c70074817835 --- /dev/null +++ b/python/doc/pandas.rst @@ -0,0 +1,114 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Pandas Interface +================ + +To interface with Pandas, PyArrow provides various conversion routines to +consume Pandas structures and convert back to them. + +DataFrames +---------- + +The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. +Both consist of a set of named columns of equal length. While Pandas only +supports flat columns, the Table also provides nested columns, thus it can +represent more data than a DataFrame, so a full conversion is not always possible. + +Conversion from a Table to a DataFrame is done by calling +:meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using +:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the +convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of +different resolutions, Pandas only supports nanosecond timestamps and most +other systems (e.g. Parquet) only work on millisecond timestamps. This parameter +can be used to already do the time conversion during the Pandas to Arrow +conversion. + +.. code-block:: python + + import pyarrow as pa + import pandas as pd + + df = pd.DataFrame({"a": [1, 2, 3]}) + # Convert from Pandas to Arrow + table = pa.from_pandas_dataframe(df) + # Convert back to Pandas + df_new = table.to_pandas() + + +Series +------ + +In Arrow, the most similar structure to a Pandas Series is an Array. +It is a vector that contains data of the same type as linear memory. You can +convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. +As Arrow Arrays are always nullable, you can supply an optional mask using +the ``mask`` parameter to mark all null-entries. + +Type differences +---------------- + +With the current design of Pandas and Arrow, it is not possible to convert all +column types unmodified. One of the main issues here is that Pandas has no +support for nullable columns of arbitrary type. Also ``datetime64`` is currently +fixed to nanosecond resolution. On the other side, Arrow might be still missing +support for some types. + +Pandas -> Arrow Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++------------------------+--------------------------+ +| Source Type (Pandas) | Destination Type (Arrow) | ++========================+==========================+ +| ``bool`` | ``BOOL`` | ++------------------------+--------------------------+ +| ``(u)int{8,16,32,64}`` | ``(U)INT{8,16,32,64}`` | ++------------------------+--------------------------+ +| ``float32`` | ``FLOAT`` | ++------------------------+--------------------------+ +| ``float64`` | ``DOUBLE`` | ++------------------------+--------------------------+ +| ``str`` / ``unicode`` | ``STRING`` | ++------------------------+--------------------------+ +| ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | ++------------------------+--------------------------+ +| ``pd.Categorical`` | *not supported* | ++------------------------+--------------------------+ + +Arrow -> Pandas Conversion +~~~~~~~~~~~~~~~~~~~~~~~~~~ + ++-------------------------------------+--------------------------------------------------------+ +| Source Type (Arrow) | Destination Type (Pandas) | ++=====================================+========================================================+ +| ``BOOL`` | ``bool`` | ++-------------------------------------+--------------------------------------------------------+ +| ``BOOL`` *with nulls* | ``object`` (with values ``True``, ``False``, ``None``) | ++-------------------------------------+--------------------------------------------------------+ +| ``(U)INT{8,16,32,64}`` | ``(u)int{8,16,32,64}`` | ++-------------------------------------+--------------------------------------------------------+ +| ``(U)INT{8,16,32,64}`` *with nulls* | ``float64`` | ++-------------------------------------+--------------------------------------------------------+ +| ``FLOAT`` | ``float32`` | ++-------------------------------------+--------------------------------------------------------+ +| ``DOUBLE`` | ``float64`` | ++-------------------------------------+--------------------------------------------------------+ +| ``STRING`` | ``str`` | ++-------------------------------------+--------------------------------------------------------+ +| ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | ++-------------------------------------+--------------------------------------------------------+ + diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst new file mode 100644 index 0000000000000..674ed80f27ce3 --- /dev/null +++ b/python/doc/parquet.rst @@ -0,0 +1,66 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Reading/Writing Parquet files +============================= + +If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was +found during the build, you can read files in the Parquet format to/from Arrow +memory structures. The Parquet support code is located in the +:mod:`pyarrow.parquet` module and your package needs to be built with the +``--with-parquet`` flag for ``build_ext``. + +Reading Parquet +--------------- + +To read a Parquet file into Arrow memory, you can use the following code +snippet. It will read the whole Parquet file into memory as an +:class:`pyarrow.table.Table`. + +.. code-block:: python + + import pyarrow + import pyarrow.parquet + + A = pyarrow + + table = A.parquet.read_table('') + +Writing Parquet +--------------- + +Given an instance of :class:`pyarrow.table.Table`, the most simple way to +persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table` +method. + +.. code-block:: python + + import pyarrow + import pyarrow.parquet + + A = pyarrow + + table = A.Table(..) + A.parquet.write_table(table, '') + +By default this will write the Table as a single RowGroup using ``DICTIONARY`` +encoding. To increase the potential of parallelism a query engine can process +a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows. + +If you also want to compress the columns, you can select a compression +method using the ``compression`` argument. Typically, ``GZIP`` is the choice if +you want to minimize size and ``SNAPPY`` for performance. diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index c71bc712bffb1..5459f26b80aa4 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -298,6 +298,8 @@ cdef class RecordBatch: cdef class Table: ''' + A collection of top-level named, equal length Arrow arrays. + Do not call this class's constructor directly. ''' @@ -335,6 +337,19 @@ cdef class Table: @staticmethod def from_arrays(names, arrays, name=None): + """ + Construct a Table from Arrow Arrays + + Parameters + ---------- + + names: list of str + Names for the table columns + arrays: list of pyarrow.array.Array + Equal-length arrays that should form the table. + name: str + (optional) name for the Table + """ cdef: Array arr c_string c_name From 48f9780a8677546cb143a09b25b0b57c1946ba07 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 11 Nov 2016 14:20:36 -0500 Subject: [PATCH 0198/1644] ARROW-375: Fix unicode Python 3 issue in columns argument of parquet.read_table Author: Wes McKinney Closes #204 from wesm/ARROW-375 and squashes the following commits: 9e6f2a6 [Wes McKinney] BUG: convert unicode to utf8 bytes for column filtering --- python/pyarrow/parquet.pyx | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 2152f89474195..a6e3ac30684b4 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -93,15 +93,18 @@ cdef class ParquetReader: Integer index of the position of the column """ cdef: - const FileMetaData* metadata = self.reader.get().parquet_reader().metadata() + const FileMetaData* metadata = (self.reader.get() + .parquet_reader().metadata()) int i = 0 if self.column_idx_map is None: self.column_idx_map = {} for i in range(0, metadata.num_columns()): - self.column_idx_map[str(metadata.schema().Column(i).path().get().ToDotString())] = i + col_bytes = tobytes(metadata.schema().Column(i) + .path().get().ToDotString()) + self.column_idx_map[col_bytes] = i - return self.column_idx_map[column_name] + return self.column_idx_map[tobytes(column_name)] def read_column(self, int column_index): cdef: @@ -109,7 +112,8 @@ cdef class ParquetReader: shared_ptr[CArray] carray with nogil: - check_status(self.reader.get().ReadFlatColumn(column_index, &carray)) + check_status(self.reader.get() + .ReadFlatColumn(column_index, &carray)) array.init(carray) return array From 78288b5fca8ff527257e487d45c7e68f7dbd8cd2 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 16 Nov 2016 16:18:50 -0500 Subject: [PATCH 0199/1644] ARROW-371: Handle pandas-nullable types correctly Author: Uwe L. Korn Closes #205 from xhochy/ARROW-371 and squashes the following commits: 1f73e8b [Uwe L. Korn] ARROW-371: Handle pandas-nullable types correctly --- python/pyarrow/tests/test_convert_pandas.py | 22 +++++++++- python/src/pyarrow/adapters/pandas.cc | 46 ++++++++++----------- 2 files changed, 44 insertions(+), 24 deletions(-) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 55302996f4557..b527ca7e80816 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -165,7 +165,7 @@ def test_strings(self): expected = pd.DataFrame({'strings': values * repeats}) self._check_pandas_roundtrip(df, expected) - def test_timestamps_notimezone(self): + def test_timestamps_notimezone_no_nulls(self): df = pd.DataFrame({ 'datetime64': np.array([ '2007-07-13T01:23:34.123', @@ -184,6 +184,26 @@ def test_timestamps_notimezone(self): }) self._check_pandas_roundtrip(df, timestamps_to_ms=False) + def test_timestamps_notimezone_nulls(self): + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123', + None, + '2010-08-13T05:46:57.437'], + dtype='datetime64[ms]') + }) + df.info() + self._check_pandas_roundtrip(df, timestamps_to_ms=True) + + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123456789', + None, + '2010-08-13T05:46:57.437699912'], + dtype='datetime64[ns]') + }) + self._check_pandas_roundtrip(df, timestamps_to_ms=False) + # def test_category(self): # repeats = 1000 # values = [b'foo', None, u'bar', 'qux', np.nan] diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 6a3966b748806..1f5b7009e6a08 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -489,20 +489,20 @@ struct arrow_traits { static constexpr int npy_type = NPY_BOOL; static constexpr bool supports_nulls = false; static constexpr bool is_boolean = true; - static constexpr bool is_integer = false; - static constexpr bool is_floating = false; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = false; }; -#define INT_DECL(TYPE) \ - template <> \ - struct arrow_traits { \ - static constexpr int npy_type = NPY_##TYPE; \ - static constexpr bool supports_nulls = false; \ - static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_integer = true; \ - static constexpr bool is_floating = false; \ - typedef typename npy_traits::value_type T; \ +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_pandas_numeric_not_nullable = true; \ + static constexpr bool is_pandas_numeric_nullable = false; \ + typedef typename npy_traits::value_type T; \ }; INT_DECL(INT8); @@ -520,8 +520,8 @@ struct arrow_traits { static constexpr bool supports_nulls = true; static constexpr float na_value = NAN; static constexpr bool is_boolean = false; - static constexpr bool is_integer = false; - static constexpr bool is_floating = true; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -531,8 +531,8 @@ struct arrow_traits { static constexpr bool supports_nulls = true; static constexpr double na_value = NAN; static constexpr bool is_boolean = false; - static constexpr bool is_integer = false; - static constexpr bool is_floating = true; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -542,8 +542,8 @@ struct arrow_traits { static constexpr bool supports_nulls = true; static constexpr int64_t na_value = std::numeric_limits::min(); static constexpr bool is_boolean = false; - static constexpr bool is_integer = true; - static constexpr bool is_floating = false; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -552,8 +552,8 @@ struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; static constexpr bool supports_nulls = true; static constexpr bool is_boolean = false; - static constexpr bool is_integer = false; - static constexpr bool is_floating = false; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = false; }; @@ -655,7 +655,7 @@ class ArrowDeserializer { template inline typename std::enable_if< - arrow_traits::is_floating, Status>::type + arrow_traits::is_pandas_numeric_nullable, Status>::type ConvertValues(const std::shared_ptr& arr) { typedef typename arrow_traits::T T; @@ -668,7 +668,7 @@ class ArrowDeserializer { T* out_values = reinterpret_cast(PyArray_DATA(out_)); for (int64_t i = 0; i < arr->length(); ++i) { - out_values[i] = arr->IsNull(i) ? NAN : in_values[i]; + out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i]; } } else { // Zero-Copy. We can pass the data pointer directly to NumPy. @@ -683,7 +683,7 @@ class ArrowDeserializer { // Integer specialization template inline typename std::enable_if< - arrow_traits::is_integer, Status>::type + arrow_traits::is_pandas_numeric_not_nullable, Status>::type ConvertValues(const std::shared_ptr& arr) { typedef typename arrow_traits::T T; From 84170962712b976fd6f68f10ba55e219155a57db Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 18 Nov 2016 11:09:28 -0500 Subject: [PATCH 0200/1644] ARROW-367: converter json <=> Arrow file format for Integration tests Author: Julien Le Dem Closes #203 from julienledem/integration and squashes the following commits: b3cd326 [Julien Le Dem] add license fdbe03f [Julien Le Dem] ARROW-367: converter json <=> Arrow file format for Integration tests --- .../org/apache/arrow/tools/Integration.java | 262 ++++++++++++++++++ .../arrow/tools/ArrowFileTestFixtures.java | 122 ++++++++ .../apache/arrow/tools/TestFileRoundtrip.java | 101 +------ .../apache/arrow/tools/TestIntegration.java | 143 ++++++++++ .../vector/file/json/JsonFileReader.java | 37 +-- .../vector/file/json/JsonFileWriter.java | 3 +- 6 files changed, 554 insertions(+), 114 deletions(-) create mode 100644 java/tools/src/main/java/org/apache/arrow/tools/Integration.java create mode 100644 java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java create mode 100644 java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java new file mode 100644 index 0000000000000..29f0ee29e3ca8 --- /dev/null +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -0,0 +1,262 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.tools; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.file.json.JsonFileReader; +import org.apache.arrow.vector.file.json.JsonFileWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.commons.cli.CommandLine; +import org.apache.commons.cli.CommandLineParser; +import org.apache.commons.cli.Options; +import org.apache.commons.cli.ParseException; +import org.apache.commons.cli.PosixParser; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.common.base.Objects; + +public class Integration { + private static final Logger LOGGER = LoggerFactory.getLogger(Integration.class); + + public static void main(String[] args) { + try { + new Integration().run(args); + } catch (ParseException e) { + fatalError("Invalid parameters", e); + } catch (IOException e) { + fatalError("Error accessing files", e); + } catch (RuntimeException e) { + fatalError("Incompatible files", e); + } + } + + private final Options options; + + enum Command { + ARROW_TO_JSON(true, false) { + @Override + public void execute(File arrowFile, File jsonFile) throws IOException { + try( + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(arrowFile); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator);) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + LOGGER.debug("Input file size: " + arrowFile.length()); + LOGGER.debug("Found schema: " + schema); + try (JsonFileWriter writer = new JsonFileWriter(jsonFile);) { + writer.start(schema); + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); + VectorSchemaRoot root = new VectorSchemaRoot(schema, allocator);) { + VectorLoader vectorLoader = new VectorLoader(root); + vectorLoader.load(inRecordBatch); + writer.write(root); + } + } + } + LOGGER.debug("Output file size: " + jsonFile.length()); + } + } + }, + JSON_TO_ARROW(false, true) { + @Override + public void execute(File arrowFile, File jsonFile) throws IOException { + try ( + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + JsonFileReader reader = new JsonFileReader(jsonFile, allocator); + ) { + Schema schema = reader.start(); + LOGGER.debug("Input file size: " + jsonFile.length()); + LOGGER.debug("Found schema: " + schema); + try ( + FileOutputStream fileOutputStream = new FileOutputStream(arrowFile); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ) { + + // initialize vectors + VectorSchemaRoot root; + while ((root = reader.read()) != null) { + VectorUnloader vectorUnloader = new VectorUnloader(root); + try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();) { + arrowWriter.writeRecordBatch(recordBatch); + } + root.close(); + } + } + LOGGER.debug("Output file size: " + arrowFile.length()); + } + } + }, + VALIDATE(true, true) { + @Override + public void execute(File arrowFile, File jsonFile) throws IOException { + try ( + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + JsonFileReader jsonReader = new JsonFileReader(jsonFile, allocator); + FileInputStream fileInputStream = new FileInputStream(arrowFile); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator); + ) { + Schema jsonSchema = jsonReader.start(); + ArrowFooter footer = arrowReader.readFooter(); + Schema arrowSchema = footer.getSchema(); + LOGGER.debug("Arrow Input file size: " + arrowFile.length()); + LOGGER.debug("ARROW schema: " + arrowSchema); + LOGGER.debug("JSON Input file size: " + jsonFile.length()); + LOGGER.debug("JSON schema: " + jsonSchema); + compareSchemas(jsonSchema, arrowSchema); + + List recordBatches = footer.getRecordBatches(); + Iterator iterator = recordBatches.iterator(); + VectorSchemaRoot jsonRoot; + while ((jsonRoot = jsonReader.read()) != null && iterator.hasNext()) { + ArrowBlock rbBlock = iterator.next(); + try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); + VectorSchemaRoot arrowRoot = new VectorSchemaRoot(arrowSchema, allocator);) { + VectorLoader vectorLoader = new VectorLoader(arrowRoot); + vectorLoader.load(inRecordBatch); + // TODO: compare + compare(arrowRoot, jsonRoot); + } + jsonRoot.close(); + } + boolean hasMoreJSON = jsonRoot != null; + boolean hasMoreArrow = iterator.hasNext(); + if (hasMoreJSON || hasMoreArrow) { + throw new IllegalArgumentException("Unexpected RecordBatches. J:" + hasMoreJSON + " A:" + hasMoreArrow); + } + } + } + }; + + public final boolean arrowExists; + public final boolean jsonExists; + + Command(boolean arrowExists, boolean jsonExists) { + this.arrowExists = arrowExists; + this.jsonExists = jsonExists; + } + + abstract public void execute(File arrowFile, File jsonFile) throws IOException; + + } + + Integration() { + this.options = new Options(); + this.options.addOption("a", "arrow", true, "arrow file"); + this.options.addOption("j", "json", true, "json file"); + this.options.addOption("c", "command", true, "command to execute: " + Arrays.toString(Command.values())); + } + + private File validateFile(String type, String fileName, boolean shouldExist) { + if (fileName == null) { + throw new IllegalArgumentException("missing " + type + " file parameter"); + } + File f = new File(fileName); + if (shouldExist && (!f.exists() || f.isDirectory())) { + throw new IllegalArgumentException(type + " file not found: " + f.getAbsolutePath()); + } + if (!shouldExist && f.exists()) { + throw new IllegalArgumentException(type + " file already exists: " + f.getAbsolutePath()); + } + return f; + } + + void run(String[] args) throws ParseException, IOException { + CommandLineParser parser = new PosixParser(); + CommandLine cmd = parser.parse(options, args, false); + + + Command command = toCommand(cmd.getOptionValue("command")); + File arrowFile = validateFile("arrow", cmd.getOptionValue("arrow"), command.arrowExists); + File jsonFile = validateFile("json", cmd.getOptionValue("json"), command.jsonExists); + command.execute(arrowFile, jsonFile); + } + + private Command toCommand(String commandName) { + try { + return Command.valueOf(commandName); + } catch (IllegalArgumentException e) { + throw new IllegalArgumentException("Unknown command: " + commandName + " expected one of " + Arrays.toString(Command.values())); + } + } + + private static void fatalError(String message, Throwable e) { + System.err.println(message); + LOGGER.error(message, e); + System.exit(1); + } + + + private static void compare(VectorSchemaRoot arrowRoot, VectorSchemaRoot jsonRoot) { + compareSchemas(jsonRoot.getSchema(), arrowRoot.getSchema()); + if (arrowRoot.getRowCount() != jsonRoot.getRowCount()) { + throw new IllegalArgumentException("Different row count:\n" + arrowRoot.getRowCount() + "\n" + jsonRoot.getRowCount()); + } + List arrowVectors = arrowRoot.getFieldVectors(); + List jsonVectors = jsonRoot.getFieldVectors(); + if (arrowVectors.size() != jsonVectors.size()) { + throw new IllegalArgumentException("Different column count:\n" + arrowVectors.size() + "\n" + jsonVectors.size()); + } + for (int i = 0; i < arrowVectors.size(); i++) { + Field field = arrowRoot.getSchema().getFields().get(i); + FieldVector arrowVector = arrowVectors.get(i); + FieldVector jsonVector = jsonVectors.get(i); + int valueCount = arrowVector.getAccessor().getValueCount(); + if (valueCount != jsonVector.getAccessor().getValueCount()) { + throw new IllegalArgumentException("Different value count for field " + field + " : " + valueCount + " != " + jsonVector.getAccessor().getValueCount()); + } + for (int j = 0; j < valueCount; j++) { + Object arrow = arrowVector.getAccessor().getObject(j); + Object json = jsonVector.getAccessor().getObject(j); + if (!Objects.equal(arrow, json)) { + throw new IllegalArgumentException( + "Different values in column:\n" + field + " at index " + j + ": " + arrow + " != " + json); + } + } + } + } + + private static void compareSchemas(Schema jsonSchema, Schema arrowSchema) { + if (!arrowSchema.equals(jsonSchema)) { + throw new IllegalArgumentException("Different schemas:\n" + arrowSchema + "\n" + jsonSchema); + } + } +} diff --git a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java new file mode 100644 index 0000000000000..4cfc52fe08631 --- /dev/null +++ b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.tools; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; +import java.io.IOException; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Assert; + +public class ArrowFileTestFixtures { + static final int COUNT = 10; + + static void writeData(int count, MapVector parent) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + for (int i = 0; i < count; i++) { + intWriter.setPosition(i); + intWriter.writeInt(i); + bigIntWriter.setPosition(i); + bigIntWriter.writeBigInt(i); + } + writer.setValueCount(count); + } + + static void validateOutput(File testOutFile, BufferAllocator allocator) throws Exception { + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(testOutFile); + ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + ) { + ArrowFooter footer = arrowReader.readFooter(); + Schema schema = footer.getSchema(); + + // initialize vectors + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, readerAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + + List recordBatches = footer.getRecordBatches(); + for (ArrowBlock rbBlock : recordBatches) { + try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + vectorLoader.load(recordBatch); + } + validateContent(COUNT, root); + } + } + } + } + + static void validateContent(int count, VectorSchemaRoot root) { + Assert.assertEquals(count, root.getRowCount()); + for (int i = 0; i < count; i++) { + Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); + Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); + } + } + + static void write(FieldVector parent, File file) throws FileNotFoundException, IOException { + Schema schema = new Schema(parent.getField().getChildren()); + int valueCount = parent.getAccessor().getValueCount(); + List fields = parent.getChildrenFromFields(); + VectorUnloader vectorUnloader = new VectorUnloader(schema, valueCount, fields); + try ( + FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + ) { + arrowWriter.writeRecordBatch(recordBatch); + } + } + + + static void writeInput(File testInFile, BufferAllocator allocator) throws FileNotFoundException, IOException { + int count = ArrowFileTestFixtures.COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeData(count, parent); + write(parent.getChild("root"), testInFile); + } + } +} diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java index 339725e5af1e0..ee39f5e92c7b0 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java @@ -18,42 +18,21 @@ */ package org.apache.arrow.tools; +import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; import static org.junit.Assert.assertEquals; import java.io.File; -import java.io.FileInputStream; -import java.io.FileNotFoundException; -import java.io.FileOutputStream; -import java.io.IOException; -import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.VectorLoader; -import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.VectorUnloader; -import org.apache.arrow.vector.complex.MapVector; -import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; -import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.complex.writer.BigIntWriter; -import org.apache.arrow.vector.complex.writer.IntWriter; -import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.file.ArrowWriter; -import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.pojo.Schema; import org.junit.After; -import org.junit.Assert; import org.junit.Before; import org.junit.Rule; import org.junit.Test; import org.junit.rules.TemporaryFolder; public class TestFileRoundtrip { - private static final int COUNT = 10; @Rule public TemporaryFolder testFolder = new TemporaryFolder(); @@ -70,90 +49,18 @@ public void tearDown() { allocator.close(); } - private void writeData(int count, MapVector parent) { - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - IntWriter intWriter = rootWriter.integer("int"); - BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); - for (int i = 0; i < count; i++) { - intWriter.setPosition(i); - intWriter.writeInt(i); - bigIntWriter.setPosition(i); - bigIntWriter.writeBigInt(i); - } - writer.setValueCount(count); - } - @Test public void test() throws Exception { File testInFile = testFolder.newFile("testIn.arrow"); File testOutFile = testFolder.newFile("testOut.arrow"); - writeInput(testInFile); + writeInput(testInFile, allocator); String[] args = { "-i", testInFile.getAbsolutePath(), "-o", testOutFile.getAbsolutePath()}; int result = new FileRoundtrip(System.out, System.err).run(args); assertEquals(0, result); - validateOutput(testOutFile); - } - - private void validateOutput(File testOutFile) throws Exception { - // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(testOutFile); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); - - // initialize vectors - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, readerAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); - } - validateContent(COUNT, root); - } - } - } - } - - private void validateContent(int count, VectorSchemaRoot root) { - Assert.assertEquals(count, root.getRowCount()); - for (int i = 0; i < count; i++) { - Assert.assertEquals(i, root.getVector("int").getAccessor().getObject(i)); - Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); - } - } - - public void writeInput(File testInFile) throws FileNotFoundException, IOException { - int count = COUNT; - try ( - BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null)) { - writeData(count, parent); - write(parent.getChild("root"), testInFile); - } - } - - private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { - Schema schema = new Schema(parent.getField().getChildren()); - int valueCount = parent.getAccessor().getValueCount(); - List fields = parent.getChildrenFromFields(); - VectorUnloader vectorUnloader = new VectorUnloader(schema, valueCount, fields); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(file); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - ) { - arrowWriter.writeRecordBatch(recordBatch); - } + validateOutput(testOutFile, allocator); } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java new file mode 100644 index 0000000000000..bb69ed1498e26 --- /dev/null +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.tools; + +import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; +import static org.apache.arrow.tools.ArrowFileTestFixtures.write; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeData; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; +import static org.junit.Assert.fail; + +import java.io.File; +import java.io.FileNotFoundException; +import java.io.IOException; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.tools.Integration.Command; +import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; +import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.IntWriter; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +public class TestIntegration { + + @Rule + public TemporaryFolder testFolder = new TemporaryFolder(); + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Integer.MAX_VALUE); + } + + @After + public void tearDown() { + allocator.close(); + } + + @Test + public void testValid() throws Exception { + File testInFile = testFolder.newFile("testIn.arrow"); + File testJSONFile = testFolder.newFile("testOut.json"); + testJSONFile.delete(); + File testOutFile = testFolder.newFile("testOut.arrow"); + testOutFile.delete(); + + // generate an arow file + writeInput(testInFile, allocator); + + Integration integration = new Integration(); + + // convert it to json + String[] args1 = { "-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + integration.run(args1); + + // convert back to arrow + String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + integration.run(args2); + + // check it is the same + validateOutput(testOutFile, allocator); + + // validate arrow against json + String[] args3 = { "-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + integration.run(args3); + } + + @Test + public void testInvalid() throws Exception { + File testValidInFile = testFolder.newFile("testValidIn.arrow"); + File testInvalidInFile = testFolder.newFile("testInvalidIn.arrow"); + File testJSONFile = testFolder.newFile("testInvalidOut.json"); + testJSONFile.delete(); + + // generate an arrow file + writeInput(testValidInFile, allocator); + // generate a different arrow file + writeInput2(testInvalidInFile, allocator); + + Integration integration = new Integration(); + + // convert the "valid" file to json + String[] args1 = { "-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + integration.run(args1); + + // compare the "invalid" file to the "valid" json + String[] args3 = { "-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + // this should fail + try { + integration.run(args3); + fail("should have failed"); + } catch (IllegalArgumentException e) { + Assert.assertTrue(e.getMessage(), e.getMessage().contains("Different values in column")); + Assert.assertTrue(e.getMessage(), e.getMessage().contains("999")); + } + + } + + static void writeInput2(File testInFile, BufferAllocator allocator) throws FileNotFoundException, IOException { + int count = ArrowFileTestFixtures.COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeData(count, parent); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + intWriter.setPosition(5); + intWriter.writeInt(999); + bigIntWriter.setPosition(4); + bigIntWriter.writeBigInt(777L); + writer.setValueCount(count); + write(parent.getChild("root"), testInFile); + } + } + +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 859a3a0e80a50..f07b517250732 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -56,7 +56,7 @@ import com.fasterxml.jackson.databind.MappingJsonFactory; import com.google.common.base.Objects; -public class JsonFileReader { +public class JsonFileReader implements AutoCloseable { private final File inputFile; private final JsonParser parser; private final BufferAllocator allocator; @@ -81,23 +81,29 @@ public Schema start() throws JsonParseException, IOException { } public VectorSchemaRoot read() throws IOException { - VectorSchemaRoot recordBatch = new VectorSchemaRoot(schema, allocator); - readToken(START_OBJECT); - { - int count = readNextField("count", Integer.class); - recordBatch.setRowCount(count); - nextFieldIs("columns"); - readToken(START_ARRAY); + JsonToken t = parser.nextToken(); + if (t == START_OBJECT) { + VectorSchemaRoot recordBatch = new VectorSchemaRoot(schema, allocator); { - for (Field field : schema.getFields()) { - FieldVector vector = recordBatch.getVector(field.getName()); - readVector(field, vector); + int count = readNextField("count", Integer.class); + recordBatch.setRowCount(count); + nextFieldIs("columns"); + readToken(START_ARRAY); + { + for (Field field : schema.getFields()) { + FieldVector vector = recordBatch.getVector(field.getName()); + readVector(field, vector); + } } + readToken(END_ARRAY); } - readToken(END_ARRAY); + readToken(END_OBJECT); + return recordBatch; + } else if (t == END_ARRAY) { + return null; + } else { + throw new IllegalArgumentException("Invalid token: " + t); } - readToken(END_OBJECT); - return recordBatch; } private void readVector(Field field, FieldVector vector) throws JsonParseException, IOException { @@ -194,9 +200,8 @@ private void setValueFromParser(ValueVector valueVector, int i) throws IOExcepti } } + @Override public void close() throws IOException { - readToken(END_ARRAY); - readToken(END_OBJECT); parser.close(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java index 47c1a7dabef11..812b3da32f83c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -38,7 +38,7 @@ import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; import com.fasterxml.jackson.databind.MappingJsonFactory; -public class JsonFileWriter { +public class JsonFileWriter implements AutoCloseable { public static final class JSONWriteConfig { private final boolean pretty; @@ -158,6 +158,7 @@ private void writeValueToGenerator(ValueVector valueVector, int i) throws IOExce } } + @Override public void close() throws IOException { generator.writeEndArray(); generator.writeEndObject(); From ed6ec3b76e1ac27fab85cd4bc74fbd61e8dfb27f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 18 Nov 2016 14:58:46 -0500 Subject: [PATCH 0201/1644] ARROW-373: [C++] JSON serialization format for testing C++ version of ARROW-372 Author: Wes McKinney Closes #202 from wesm/ARROW-373 and squashes the following commits: d13a05f [Wes McKinney] Compiler warning 72c24fe [Wes McKinney] Add a minimal literal JSON example a2cf47b [Wes McKinney] cpplint 3d9fcc2 [Wes McKinney] Complete round trip json file test with multiple record batches 2753449 [Wes McKinney] Complete draft json roundtrip implementation. tests not complete yet 3d6bbbd [Wes McKinney] Start high level writer scaffold 6bbd669 [Wes McKinney] Tweaks e2e86b5 [Wes McKinney] Test JSON array roundtrip for numeric types, strings, lists, structs 82f108b [Wes McKinney] Refactoring. Array test scaffold 0891378 [Wes McKinney] Declare loop variables 6566343 [Wes McKinney] Recursively construct children for list/struct 35c2f85 [Wes McKinney] Refactoring. Start drafting string/list reader f26402a [Wes McKinney] Install type_traits.h. cpplint 4fc7294 [Wes McKinney] Refactoring, type attribute consistency. Array reader compiles 2c93cce [Wes McKinney] WIP JSON array reader code path 932ba7a [Wes McKinney] Add ArrayVisitor methods, add enough metaprogramming to detect presence of c_type type member 15c1094 [Wes McKinney] Add type traits, refactoring, drafting json array writing. not working yet 209ba48 [Wes McKinney] More types refactoring. Strange linker error in pyarrow 379da3c [Wes McKinney] Implement union metadata JSON serialization 5fbea41 [Wes McKinney] Implement some more json types and add convenience factory functions 1c08233 [Wes McKinney] JSON schema roundtrip passing for many types 86c9559 [Wes McKinney] Add convenience factory functions for common types 3b9d14e [Wes McKinney] Add type-specific JSON metadata to schema writer 820b0f2 [Wes McKinney] Drafting JSON schema read/write 68ee7ab [Wes McKinney] Move forward declarations into type_fwd.h 1edf2a9 [Wes McKinney] Prototyping out visitor pattern for json serialization 24c1d5d [Wes McKinney] Some Types refactoring, add TypeVisitor abstract class. Add RapidJSON as external project --- cpp/CMakeLists.txt | 19 + cpp/src/arrow/CMakeLists.txt | 2 + cpp/src/arrow/array.cc | 15 + cpp/src/arrow/array.h | 12 + cpp/src/arrow/column-test.cc | 1 + cpp/src/arrow/io/hdfs.cc | 8 +- cpp/src/arrow/io/libhdfs_shim.cc | 26 +- cpp/src/arrow/ipc/CMakeLists.txt | 7 + cpp/src/arrow/ipc/adapter.cc | 2 +- cpp/src/arrow/ipc/ipc-json-test.cc | 353 ++++++++ cpp/src/arrow/ipc/json-internal.cc | 1113 +++++++++++++++++++++++++ cpp/src/arrow/ipc/json-internal.h | 111 +++ cpp/src/arrow/ipc/json.cc | 219 +++++ cpp/src/arrow/ipc/json.h | 92 ++ cpp/src/arrow/ipc/test-common.h | 14 +- cpp/src/arrow/schema-test.cc | 52 +- cpp/src/arrow/schema.cc | 15 + cpp/src/arrow/schema.h | 12 +- cpp/src/arrow/test-util.h | 51 +- cpp/src/arrow/type.cc | 122 ++- cpp/src/arrow/type.h | 338 ++++++-- cpp/src/arrow/type_fwd.h | 157 ++++ cpp/src/arrow/type_traits.h | 197 +++++ cpp/src/arrow/types/CMakeLists.txt | 1 - cpp/src/arrow/types/collection.h | 41 - cpp/src/arrow/types/datetime.h | 37 +- cpp/src/arrow/types/decimal.h | 14 +- cpp/src/arrow/types/list-test.cc | 2 +- cpp/src/arrow/types/list.cc | 4 + cpp/src/arrow/types/list.h | 8 +- cpp/src/arrow/types/primitive-test.cc | 36 +- cpp/src/arrow/types/primitive.cc | 97 ++- cpp/src/arrow/types/primitive.h | 190 ++--- cpp/src/arrow/types/string-test.cc | 12 +- cpp/src/arrow/types/string.cc | 16 +- cpp/src/arrow/types/string.h | 24 +- cpp/src/arrow/types/struct-test.cc | 2 +- cpp/src/arrow/types/struct.cc | 4 + cpp/src/arrow/types/struct.h | 4 + cpp/src/arrow/types/test-common.h | 16 + cpp/src/arrow/types/union.cc | 23 +- cpp/src/arrow/types/union.h | 21 - cpp/src/arrow/util/logging.h | 4 +- format/Metadata.md | 5 + 44 files changed, 3049 insertions(+), 450 deletions(-) create mode 100644 cpp/src/arrow/ipc/ipc-json-test.cc create mode 100644 cpp/src/arrow/ipc/json-internal.cc create mode 100644 cpp/src/arrow/ipc/json-internal.h create mode 100644 cpp/src/arrow/ipc/json.cc create mode 100644 cpp/src/arrow/ipc/json.h create mode 100644 cpp/src/arrow/type_fwd.h create mode 100644 cpp/src/arrow/type_traits.h delete mode 100644 cpp/src/arrow/types/collection.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 6f954830b6334..0bff7528578d1 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -545,6 +545,25 @@ if(ARROW_BUILD_BENCHMARKS) endif() endif() +# RapidJSON, header only dependency +if("$ENV{RAPIDJSON_HOME}" STREQUAL "") + ExternalProject_Add(rapidjson_ep + PREFIX "${CMAKE_BINARY_DIR}" + URL "https://github.com/miloyip/rapidjson/archive/v1.1.0.tar.gz" + URL_MD5 "badd12c511e081fec6c89c43a7027bce" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + BUILD_IN_SOURCE 1 + INSTALL_COMMAND "") + + ExternalProject_Get_Property(rapidjson_ep SOURCE_DIR) + set(RAPIDJSON_INCLUDE_DIR "${SOURCE_DIR}/include") +else() + set(RAPIDJSON_INCLUDE_DIR "$ENV{RAPIDJSON_HOME}/include") +endif() +message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") +include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) + ## Google PerfTools ## ## Disabled with TSAN/ASAN as well as with gold+dynamic linking (see comment diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index a9b2feca28cb7..81851bc5b3eb1 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -24,6 +24,8 @@ install(FILES schema.h table.h type.h + type_fwd.h + type_traits.h test-util.h DESTINATION include/arrow) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index e432a53781f17..3262425e99b66 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -18,6 +18,7 @@ #include "arrow/array.h" #include +#include #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" @@ -25,6 +26,16 @@ namespace arrow { +Status GetEmptyBitmap( + MemoryPool* pool, int32_t length, std::shared_ptr* result) { + auto buffer = std::make_shared(pool); + RETURN_NOT_OK(buffer->Resize(BitUtil::BytesForBits(length))); + memset(buffer->mutable_data(), 0, buffer->size()); + + *result = buffer; + return Status::OK(); +} + // ---------------------------------------------------------------------- // Base array class @@ -66,4 +77,8 @@ bool NullArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_st return true; } +Status NullArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index ff37323f60519..ff2b70e213b1b 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -29,6 +29,8 @@ namespace arrow { class Buffer; +class MemoryPool; +class MutableBuffer; class Status; // Immutable data array with some logical type and some length. Any memory is @@ -70,6 +72,8 @@ class ARROW_EXPORT Array { // returning Status::OK. This can be an expensive check. virtual Status Validate() const; + virtual Status Accept(ArrayVisitor* visitor) const = 0; + protected: std::shared_ptr type_; int32_t null_count_; @@ -86,6 +90,8 @@ class ARROW_EXPORT Array { // Degenerate null type Array class ARROW_EXPORT NullArray : public Array { public: + using TypeClass = NullType; + NullArray(const std::shared_ptr& type, int32_t length) : Array(type, length, length, nullptr) {} @@ -94,9 +100,15 @@ class ARROW_EXPORT NullArray : public Array { bool Equals(const std::shared_ptr& arr) const override; bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_index, const std::shared_ptr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; }; typedef std::shared_ptr ArrayPtr; + +Status ARROW_EXPORT GetEmptyBitmap( + MemoryPool* pool, int32_t length, std::shared_ptr* result); + } // namespace arrow #endif diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 1edf313d49bf6..ac3636d1b6dab 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -22,6 +22,7 @@ #include "gtest/gtest.h" +#include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" #include "arrow/test-util.h" diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 6490a7574eea2..13491e780e21b 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -289,13 +289,9 @@ class HdfsClient::HdfsClientImpl { // connect to HDFS with the builder object hdfsBuilder* builder = hdfsNewBuilder(); - if (!config->host.empty()) { - hdfsBuilderSetNameNode(builder, config->host.c_str()); - } + if (!config->host.empty()) { hdfsBuilderSetNameNode(builder, config->host.c_str()); } hdfsBuilderSetNameNodePort(builder, config->port); - if (!config->user.empty()) { - hdfsBuilderSetUserName(builder, config->user.c_str()); - } + if (!config->user.empty()) { hdfsBuilderSetUserName(builder, config->user.c_str()); } if (!config->kerb_ticket.empty()) { hdfsBuilderSetKerbTicketCachePath(builder, config->kerb_ticket.c_str()); } diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index 1fee595d0718b..36b8a4ec980a9 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -74,12 +74,9 @@ static HINSTANCE libjvm_handle = NULL; // NOTE(wesm): cpplint does not like use of short and other imprecise C types static hdfsBuilder* (*ptr_hdfsNewBuilder)(void) = NULL; -static void (*ptr_hdfsBuilderSetNameNode)( - hdfsBuilder* bld, const char* nn) = NULL; -static void (*ptr_hdfsBuilderSetNameNodePort)( - hdfsBuilder* bld, tPort port) = NULL; -static void (*ptr_hdfsBuilderSetUserName)( - hdfsBuilder* bld, const char* userName) = NULL; +static void (*ptr_hdfsBuilderSetNameNode)(hdfsBuilder* bld, const char* nn) = NULL; +static void (*ptr_hdfsBuilderSetNameNodePort)(hdfsBuilder* bld, tPort port) = NULL; +static void (*ptr_hdfsBuilderSetUserName)(hdfsBuilder* bld, const char* userName) = NULL; static void (*ptr_hdfsBuilderSetKerbTicketCachePath)( hdfsBuilder* bld, const char* kerbTicketCachePath) = NULL; static hdfsFS (*ptr_hdfsBuilderConnect)(hdfsBuilder* bld) = NULL; @@ -173,9 +170,9 @@ void hdfsBuilderSetUserName(hdfsBuilder* bld, const char* userName) { ptr_hdfsBuilderSetUserName(bld, userName); } -void hdfsBuilderSetKerbTicketCachePath(hdfsBuilder* bld, - const char* kerbTicketCachePath) { - ptr_hdfsBuilderSetKerbTicketCachePath(bld , kerbTicketCachePath); +void hdfsBuilderSetKerbTicketCachePath( + hdfsBuilder* bld, const char* kerbTicketCachePath) { + ptr_hdfsBuilderSetKerbTicketCachePath(bld, kerbTicketCachePath); } hdfsFS hdfsBuilderConnect(hdfsBuilder* bld) { @@ -364,7 +361,7 @@ static std::vector get_potential_libhdfs_paths() { std::vector libhdfs_potential_paths; std::string file_name; - // OS-specific file name +// OS-specific file name #ifdef __WIN32 file_name = "hdfs.dll"; #elif __APPLE__ @@ -374,10 +371,7 @@ static std::vector get_potential_libhdfs_paths() { #endif // Common paths - std::vector search_paths = { - fs::path(""), - fs::path(".") - }; + std::vector search_paths = {fs::path(""), fs::path(".")}; // Path from environment variable const char* hadoop_home = std::getenv("HADOOP_HOME"); @@ -387,9 +381,7 @@ static std::vector get_potential_libhdfs_paths() { } const char* libhdfs_dir = std::getenv("ARROW_LIBHDFS_DIR"); - if (libhdfs_dir != nullptr) { - search_paths.push_back(fs::path(libhdfs_dir)); - } + if (libhdfs_dir != nullptr) { search_paths.push_back(fs::path(libhdfs_dir)); } // All paths with file name for (auto& path : search_paths) { diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index d2db339de7ea2..6955bcb6c233e 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -34,6 +34,8 @@ set(ARROW_IPC_TEST_LINK_LIBS set(ARROW_IPC_SRCS adapter.cc file.cc + json.cc + json-internal.cc metadata.cc metadata-internal.cc ) @@ -79,6 +81,10 @@ ADD_ARROW_TEST(ipc-metadata-test) ARROW_TEST_LINK_LIBRARIES(ipc-metadata-test ${ARROW_IPC_TEST_LINK_LIBS}) +ADD_ARROW_TEST(ipc-json-test) +ARROW_TEST_LINK_LIBRARIES(ipc-json-test + ${ARROW_IPC_TEST_LINK_LIBS}) + # make clean will delete the generated file set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) @@ -114,6 +120,7 @@ add_dependencies(arrow_objlib metadata_fbs) install(FILES adapter.h file.h + json.h metadata.h DESTINATION include/arrow/ipc) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 74786bf85ffb4..da718c08d5480 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -106,7 +106,7 @@ Status VisitArray(const Array* arr, std::vector* field_nodes buffers->push_back(binary_arr->data()); } else if (arr->type_enum() == Type::LIST) { const auto list_arr = static_cast(arr); - buffers->push_back(list_arr->offset_buffer()); + buffers->push_back(list_arr->offsets()); RETURN_NOT_OK(VisitArray( list_arr->values().get(), field_nodes, buffers, max_recursion_depth - 1)); } else if (arr->type_enum() == Type::STRUCT) { diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc new file mode 100644 index 0000000000000..a51371c62005b --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -0,0 +1,353 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/ipc/json-internal.h" +#include "arrow/ipc/json.h" +#include "arrow/table.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +void TestSchemaRoundTrip(const Schema& schema) { + rj::StringBuffer sb; + rj::Writer writer(sb); + + ASSERT_OK(WriteJsonSchema(schema, &writer)); + + rj::Document d; + d.Parse(sb.GetString()); + + std::shared_ptr out; + ASSERT_OK(ReadJsonSchema(d, &out)); + + ASSERT_TRUE(schema.Equals(out)); +} + +void TestArrayRoundTrip(const Array& array) { + static std::string name = "dummy"; + + rj::StringBuffer sb; + rj::Writer writer(sb); + + ASSERT_OK(WriteJsonArray(name, array, &writer)); + + std::string array_as_json = sb.GetString(); + + rj::Document d; + d.Parse(array_as_json); + + if (d.HasParseError()) { FAIL() << "JSON parsing failed"; } + + std::shared_ptr out; + ASSERT_OK(ReadJsonArray(default_memory_pool(), d, array.type(), &out)); + + ASSERT_TRUE(array.Equals(out)) << array_as_json; +} + +template +void CheckPrimitive(const std::shared_ptr& type, + const std::vector& is_valid, const std::vector& values) { + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, type); + + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + ASSERT_OK(builder.Append(values[i])); + } else { + ASSERT_OK(builder.AppendNull()); + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + TestArrayRoundTrip(*array.get()); +} + +template +void MakeArray(const std::shared_ptr& type, const std::vector& is_valid, + const std::vector& values, std::shared_ptr* out) { + std::shared_ptr values_buffer; + std::shared_ptr values_bitmap; + + ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); + ASSERT_OK(test::GetBitmapFromBoolVector(is_valid, &values_bitmap)); + + using ArrayType = typename TypeTraits::ArrayType; + + int32_t null_count = 0; + for (bool val : is_valid) { + if (!val) { ++null_count; } + } + + *out = std::make_shared(type, static_cast(values.size()), + values_buffer, null_count, values_bitmap); +} + +TEST(TestJsonSchemaWriter, FlatTypes) { + std::vector> fields = {field("f0", int8()), + field("f1", int16(), false), field("f2", int32()), field("f3", int64(), false), + field("f4", uint8()), field("f5", uint16()), field("f6", uint32()), + field("f7", uint64()), field("f8", float32()), field("f9", float64()), + field("f10", utf8()), field("f11", binary()), field("f12", list(int32())), + field("f13", struct_({field("s1", int32()), field("s2", utf8())})), + field("f14", date()), field("f15", timestamp(TimeUnit::NANO)), + field("f16", time(TimeUnit::MICRO)), + field("f17", union_({field("u1", int8()), field("u2", time(TimeUnit::MILLI))}, + {0, 1}, UnionMode::DENSE))}; + + Schema schema(fields); + TestSchemaRoundTrip(schema); +} + +template +void PrimitiveTypesCheckOne() { + using c_type = typename T::c_type; + + std::vector is_valid = {true, false, true, true, true, false, true, true}; + std::vector values = {0, 1, 2, 3, 4, 5, 6, 7}; + CheckPrimitive(std::make_shared(), is_valid, values); +} + +TEST(TestJsonArrayWriter, PrimitiveTypes) { + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + PrimitiveTypesCheckOne(); + + std::vector is_valid = {true, false, true, true, true, false, true, true}; + std::vector values = {"foo", "bar", "", "baz", "qux", "foo", "a", "1"}; + + CheckPrimitive(utf8(), is_valid, values); + CheckPrimitive(binary(), is_valid, values); +} + +TEST(TestJsonArrayWriter, NestedTypes) { + auto value_type = int32(); + + std::vector values_is_valid = {true, false, true, true, false, true, true}; + std::vector values = {0, 1, 2, 3, 4, 5, 6}; + + std::shared_ptr values_array; + MakeArray(int32(), values_is_valid, values, &values_array); + + // List + std::vector list_is_valid = {true, false, true, true, true}; + std::vector offsets = {0, 0, 0, 1, 4, 7}; + + std::shared_ptr list_bitmap; + ASSERT_OK(test::GetBitmapFromBoolVector(list_is_valid, &list_bitmap)); + std::shared_ptr offsets_buffer = test::GetBufferFromVector(offsets); + + ListArray list_array(list(value_type), 5, offsets_buffer, values_array, 1, list_bitmap); + + TestArrayRoundTrip(list_array); + + // Struct + std::vector struct_is_valid = {true, false, true, true, true, false, true}; + std::shared_ptr struct_bitmap; + ASSERT_OK(test::GetBitmapFromBoolVector(struct_is_valid, &struct_bitmap)); + + auto struct_type = + struct_({field("f1", int32()), field("f2", int32()), field("f3", int32())}); + + std::vector> fields = {values_array, values_array, values_array}; + StructArray struct_array( + struct_type, static_cast(struct_is_valid.size()), fields, 2, struct_bitmap); + TestArrayRoundTrip(struct_array); +} + +// Data generation for test case below +void MakeBatchArrays(const std::shared_ptr& schema, const int num_rows, + std::vector>* arrays) { + std::vector is_valid; + test::random_is_valid(num_rows, 0.25, &is_valid); + + std::vector v1_values; + std::vector v2_values; + + test::randint(num_rows, 0, 100, &v1_values); + test::randint(num_rows, 0, 100, &v2_values); + + std::shared_ptr v1; + MakeArray(schema->field(0)->type, is_valid, v1_values, &v1); + + std::shared_ptr v2; + MakeArray(schema->field(1)->type, is_valid, v2_values, &v2); + + static const int kBufferSize = 10; + static uint8_t buffer[kBufferSize]; + static uint32_t seed = 0; + StringBuilder string_builder(default_memory_pool(), utf8()); + for (int i = 0; i < num_rows; ++i) { + if (!is_valid[i]) { + string_builder.AppendNull(); + } else { + test::random_ascii(kBufferSize, seed++, buffer); + string_builder.Append(buffer, kBufferSize); + } + } + std::shared_ptr v3; + ASSERT_OK(string_builder.Finish(&v3)); + + arrays->emplace_back(v1); + arrays->emplace_back(v2); + arrays->emplace_back(v3); +} + +TEST(TestJsonFileReadWrite, BasicRoundTrip) { + auto v1_type = int8(); + auto v2_type = int32(); + auto v3_type = utf8(); + + std::shared_ptr schema( + new Schema({field("f1", v1_type), field("f2", v2_type), field("f3", v3_type)})); + + std::unique_ptr writer; + ASSERT_OK(JsonWriter::Open(schema, &writer)); + + const int nbatches = 3; + std::vector> batches; + for (int i = 0; i < nbatches; ++i) { + int32_t num_rows = 5 + i * 5; + std::vector> arrays; + + MakeBatchArrays(schema, num_rows, &arrays); + batches.emplace_back(std::make_shared(schema, num_rows, arrays)); + ASSERT_OK(writer->WriteRecordBatch(arrays, num_rows)); + } + + std::string result; + ASSERT_OK(writer->Finish(&result)); + + std::unique_ptr reader; + + auto buffer = std::make_shared( + reinterpret_cast(result.c_str()), static_cast(result.size())); + + ASSERT_OK(JsonReader::Open(buffer, &reader)); + ASSERT_TRUE(reader->schema()->Equals(*schema.get())); + + ASSERT_EQ(nbatches, reader->num_record_batches()); + + for (int i = 0; i < nbatches; ++i) { + std::shared_ptr batch; + ASSERT_OK(reader->GetRecordBatch(i, &batch)); + ASSERT_TRUE(batch->Equals(*batches[i].get())); + } +} + +TEST(TestJsonFileReadWrite, MinimalFormatExample) { + static const char* example = R"example( +{ + "schema": { + "fields": [ + { + "name": "foo", + "type": {"name": "int", "isSigned": true, "bitWidth": 32}, + "nullable": true, "children": [], + "typeLayout": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + }, + { + "name": "bar", + "type": {"name": "floatingpoint", "precision": "DOUBLE"}, + "nullable": true, "children": [], + "typeLayout": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 64} + ] + } + ] + }, + "batches": [ + { + "count": 5, + "columns": [ + { + "name": "foo", + "count": 5, + "DATA": [1, 2, 3, 4, 5], + "VALIDITY": [1, 0, 1, 1, 1] + }, + { + "name": "bar", + "count": 5, + "DATA": [1.0, 2.0, 3.0, 4.0, 5.0], + "VALIDITY": [1, 0, 0, 1, 1] + } + ] + } + ] +} +)example"; + + auto buffer = std::make_shared( + reinterpret_cast(example), strlen(example)); + + std::unique_ptr reader; + ASSERT_OK(JsonReader::Open(buffer, &reader)); + + Schema ex_schema({field("foo", int32()), field("bar", float64())}); + + ASSERT_TRUE(reader->schema()->Equals(ex_schema)); + ASSERT_EQ(1, reader->num_record_batches()); + + std::shared_ptr batch; + ASSERT_OK(reader->GetRecordBatch(0, &batch)); + + std::vector foo_valid = {true, false, true, true, true}; + std::vector foo_values = {1, 2, 3, 4, 5}; + std::shared_ptr foo; + MakeArray(int32(), foo_valid, foo_values, &foo); + ASSERT_TRUE(batch->column(0)->Equals(foo)); + + std::vector bar_valid = {true, false, false, true, true}; + std::vector bar_values = {1, 2, 3, 4, 5}; + std::shared_ptr bar; + MakeArray(float64(), bar_valid, bar_values, &bar); + ASSERT_TRUE(batch->column(1)->Equals(bar)); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc new file mode 100644 index 0000000000000..31fe35b44cef7 --- /dev/null +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -0,0 +1,1113 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/json-internal.h" + +#include +#include +#include +#include +#include +#include + +#include "rapidjson/stringbuffer.h" +#include "rapidjson/writer.h" + +#include "arrow/array.h" +#include "arrow/schema.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +using RjArray = rj::Value::ConstArray; +using RjObject = rj::Value::ConstObject; + +enum class BufferType : char { DATA, OFFSET, TYPE, VALIDITY }; + +static std::string GetBufferTypeName(BufferType type) { + switch (type) { + case BufferType::DATA: + return "DATA"; + case BufferType::OFFSET: + return "OFFSET"; + case BufferType::TYPE: + return "TYPE"; + case BufferType::VALIDITY: + return "VALIDITY"; + default: + break; + } + return "UNKNOWN"; +} + +static std::string GetFloatingPrecisionName(FloatingPointMeta::Precision precision) { + switch (precision) { + case FloatingPointMeta::HALF: + return "HALF"; + case FloatingPointMeta::SINGLE: + return "SINGLE"; + case FloatingPointMeta::DOUBLE: + return "DOUBLE"; + default: + break; + } + return "UNKNOWN"; +} + +static std::string GetTimeUnitName(TimeUnit unit) { + switch (unit) { + case TimeUnit::SECOND: + return "SECOND"; + case TimeUnit::MILLI: + return "MILLISECOND"; + case TimeUnit::MICRO: + return "MICROSECOND"; + case TimeUnit::NANO: + return "NANOSECOND"; + default: + break; + } + return "UNKNOWN"; +} + +class BufferLayout { + public: + BufferLayout(BufferType type, int bit_width) : type_(type), bit_width_(bit_width) {} + + BufferType type() const { return type_; } + int bit_width() const { return bit_width_; } + + private: + BufferType type_; + int bit_width_; +}; + +static const BufferLayout kValidityBuffer(BufferType::VALIDITY, 1); +static const BufferLayout kOffsetBuffer(BufferType::OFFSET, 32); +static const BufferLayout kTypeBuffer(BufferType::TYPE, 32); +static const BufferLayout kBooleanBuffer(BufferType::DATA, 1); +static const BufferLayout kValues64(BufferType::DATA, 64); +static const BufferLayout kValues32(BufferType::DATA, 32); +static const BufferLayout kValues16(BufferType::DATA, 16); +static const BufferLayout kValues8(BufferType::DATA, 8); + +class JsonSchemaWriter : public TypeVisitor { + public: + explicit JsonSchemaWriter(const Schema& schema, RjWriter* writer) + : schema_(schema), writer_(writer) {} + + Status Write() { + writer_->StartObject(); + writer_->Key("fields"); + writer_->StartArray(); + for (const std::shared_ptr& field : schema_.fields()) { + RETURN_NOT_OK(VisitField(*field.get())); + } + writer_->EndArray(); + writer_->EndObject(); + return Status::OK(); + } + + Status VisitField(const Field& field) { + writer_->StartObject(); + + writer_->Key("name"); + writer_->String(field.name.c_str()); + + writer_->Key("nullable"); + writer_->Bool(field.nullable); + + // Visit the type + RETURN_NOT_OK(field.type->Accept(this)); + writer_->EndObject(); + + return Status::OK(); + } + + void SetNoChildren() { + writer_->Key("children"); + writer_->StartArray(); + writer_->EndArray(); + } + + template + typename std::enable_if::value || + std::is_base_of::value || + std::is_base_of::value, + void>::type + WriteTypeMetadata(const T& type) {} + + template + typename std::enable_if::value, void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("bitWidth"); + writer_->Int(type.bit_width()); + writer_->Key("isSigned"); + writer_->Bool(type.is_signed()); + } + + template + typename std::enable_if::value, void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("precision"); + writer_->String(GetFloatingPrecisionName(type.precision())); + } + + template + typename std::enable_if::value, void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("unit"); + switch (type.unit) { + case IntervalType::Unit::YEAR_MONTH: + writer_->String("YEAR_MONTH"); + break; + case IntervalType::Unit::DAY_TIME: + writer_->String("DAY_TIME"); + break; + } + } + + template + typename std::enable_if::value || + std::is_base_of::value, + void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("unit"); + writer_->String(GetTimeUnitName(type.unit)); + } + + template + typename std::enable_if::value, void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("precision"); + writer_->Int(type.precision); + writer_->Key("scale"); + writer_->Int(type.scale); + } + + template + typename std::enable_if::value, void>::type + WriteTypeMetadata(const T& type) { + writer_->Key("mode"); + switch (type.mode) { + case UnionMode::SPARSE: + writer_->String("SPARSE"); + break; + case UnionMode::DENSE: + writer_->String("DENSE"); + break; + } + + // Write type ids + writer_->Key("typeIds"); + writer_->StartArray(); + for (size_t i = 0; i < type.type_ids.size(); ++i) { + writer_->Uint(type.type_ids[i]); + } + writer_->EndArray(); + } + + // TODO(wesm): Other Type metadata + + template + void WriteName(const std::string& typeclass, const T& type) { + writer_->Key("type"); + writer_->StartObject(); + writer_->Key("name"); + writer_->String(typeclass); + WriteTypeMetadata(type); + writer_->EndObject(); + } + + template + Status WritePrimitive(const std::string& typeclass, const T& type, + const std::vector& buffer_layout) { + WriteName(typeclass, type); + SetNoChildren(); + WriteBufferLayout(buffer_layout); + return Status::OK(); + } + + template + Status WriteVarBytes(const std::string& typeclass, const T& type) { + WriteName(typeclass, type); + SetNoChildren(); + WriteBufferLayout({kValidityBuffer, kOffsetBuffer, kValues8}); + return Status::OK(); + } + + void WriteBufferLayout(const std::vector& buffer_layout) { + writer_->Key("typeLayout"); + writer_->StartArray(); + + for (const BufferLayout& buffer : buffer_layout) { + writer_->StartObject(); + writer_->Key("type"); + writer_->String(GetBufferTypeName(buffer.type())); + + writer_->Key("typeBitWidth"); + writer_->Int(buffer.bit_width()); + + writer_->EndObject(); + } + writer_->EndArray(); + } + + Status WriteChildren(const std::vector>& children) { + writer_->Key("children"); + writer_->StartArray(); + for (const std::shared_ptr& field : children) { + RETURN_NOT_OK(VisitField(*field.get())); + } + writer_->EndArray(); + return Status::OK(); + } + + Status Visit(const NullType& type) override { return WritePrimitive("null", type, {}); } + + Status Visit(const BooleanType& type) override { + return WritePrimitive("bool", type, {kValidityBuffer, kBooleanBuffer}); + } + + Status Visit(const Int8Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues8}); + } + + Status Visit(const Int16Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues16}); + } + + Status Visit(const Int32Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues32}); + } + + Status Visit(const Int64Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const UInt8Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues8}); + } + + Status Visit(const UInt16Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues16}); + } + + Status Visit(const UInt32Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues32}); + } + + Status Visit(const UInt64Type& type) override { + return WritePrimitive("int", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const HalfFloatType& type) override { + return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues16}); + } + + Status Visit(const FloatType& type) override { + return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues32}); + } + + Status Visit(const DoubleType& type) override { + return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const StringType& type) override { return WriteVarBytes("utf8", type); } + + Status Visit(const BinaryType& type) override { return WriteVarBytes("binary", type); } + + Status Visit(const DateType& type) override { + return WritePrimitive("date", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const TimeType& type) override { + return WritePrimitive("time", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const TimestampType& type) override { + return WritePrimitive("timestamp", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const IntervalType& type) override { + return WritePrimitive("interval", type, {kValidityBuffer, kValues64}); + } + + Status Visit(const DecimalType& type) override { return Status::NotImplemented("NYI"); } + + Status Visit(const ListType& type) override { + WriteName("list", type); + RETURN_NOT_OK(WriteChildren(type.children())); + WriteBufferLayout({kValidityBuffer, kOffsetBuffer}); + return Status::OK(); + } + + Status Visit(const StructType& type) override { + WriteName("struct", type); + WriteChildren(type.children()); + WriteBufferLayout({kValidityBuffer, kTypeBuffer}); + return Status::OK(); + } + + Status Visit(const UnionType& type) override { + WriteName("union", type); + WriteChildren(type.children()); + + if (type.mode == UnionMode::SPARSE) { + WriteBufferLayout({kValidityBuffer, kTypeBuffer}); + } else { + WriteBufferLayout({kValidityBuffer, kTypeBuffer, kOffsetBuffer}); + } + return Status::OK(); + } + + private: + const Schema& schema_; + RjWriter* writer_; +}; + +class JsonArrayWriter : public ArrayVisitor { + public: + explicit JsonArrayWriter(const std::string& name, const Array& array, RjWriter* writer) + : name_(name), array_(array), writer_(writer) {} + + Status Write() { return VisitArray(name_, array_); } + + Status VisitArray(const std::string& name, const Array& arr) { + writer_->StartObject(); + writer_->Key("name"); + writer_->String(name); + + writer_->Key("count"); + writer_->Int(arr.length()); + + RETURN_NOT_OK(arr.Accept(this)); + + writer_->EndObject(); + return Status::OK(); + } + + template + typename std::enable_if::value, void>::type WriteDataValues( + const T& arr) { + const auto data = arr.raw_data(); + for (int i = 0; i < arr.length(); ++i) { + writer_->Int64(data[i]); + } + } + + template + typename std::enable_if::value, void>::type WriteDataValues( + const T& arr) { + const auto data = arr.raw_data(); + for (int i = 0; i < arr.length(); ++i) { + writer_->Uint64(data[i]); + } + } + + template + typename std::enable_if::value, void>::type WriteDataValues( + const T& arr) { + const auto data = arr.raw_data(); + for (int i = 0; i < arr.length(); ++i) { + writer_->Double(data[i]); + } + } + + // String (Utf8), Binary + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& arr) { + for (int i = 0; i < arr.length(); ++i) { + int32_t length; + const char* buf = reinterpret_cast(arr.GetValue(i, &length)); + writer_->String(buf, length); + } + } + + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& arr) { + for (int i = 0; i < arr.length(); ++i) { + writer_->Bool(arr.Value(i)); + } + } + + template + void WriteDataField(const T& arr) { + writer_->Key("DATA"); + writer_->StartArray(); + WriteDataValues(arr); + writer_->EndArray(); + } + + template + void WriteOffsetsField(const T* offsets, int32_t length) { + writer_->Key("OFFSETS"); + writer_->StartArray(); + for (int i = 0; i < length; ++i) { + writer_->Int64(offsets[i]); + } + writer_->EndArray(); + } + + void WriteValidityField(const Array& arr) { + writer_->Key("VALIDITY"); + writer_->StartArray(); + if (arr.null_count() > 0) { + for (int i = 0; i < arr.length(); ++i) { + writer_->Int(arr.IsNull(i) ? 0 : 1); + } + } else { + for (int i = 0; i < arr.length(); ++i) { + writer_->Int(1); + } + } + writer_->EndArray(); + } + + void SetNoChildren() { + writer_->Key("children"); + writer_->StartArray(); + writer_->EndArray(); + } + + template + Status WritePrimitive(const T& array) { + WriteValidityField(array); + WriteDataField(array); + SetNoChildren(); + return Status::OK(); + } + + template + Status WriteVarBytes(const T& array) { + WriteValidityField(array); + WriteOffsetsField(array.raw_offsets(), array.length() + 1); + WriteDataField(array); + SetNoChildren(); + return Status::OK(); + } + + Status WriteChildren(const std::vector>& fields, + const std::vector>& arrays) { + writer_->Key("children"); + writer_->StartArray(); + for (size_t i = 0; i < fields.size(); ++i) { + RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); + } + writer_->EndArray(); + return Status::OK(); + } + + Status Visit(const NullArray& array) override { + SetNoChildren(); + return Status::OK(); + } + + Status Visit(const BooleanArray& array) override { return WritePrimitive(array); } + + Status Visit(const Int8Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int16Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int32Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int64Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt8Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt16Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt32Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt64Array& array) override { return WritePrimitive(array); } + + Status Visit(const HalfFloatArray& array) override { return WritePrimitive(array); } + + Status Visit(const FloatArray& array) override { return WritePrimitive(array); } + + Status Visit(const DoubleArray& array) override { return WritePrimitive(array); } + + Status Visit(const StringArray& array) override { return WriteVarBytes(array); } + + Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } + + Status Visit(const DateArray& array) override { return Status::NotImplemented("date"); } + + Status Visit(const TimeArray& array) override { return Status::NotImplemented("time"); } + + Status Visit(const TimestampArray& array) override { + return Status::NotImplemented("timestamp"); + } + + Status Visit(const IntervalArray& array) override { + return Status::NotImplemented("interval"); + } + + Status Visit(const DecimalArray& array) override { + return Status::NotImplemented("decimal"); + } + + Status Visit(const ListArray& array) override { + WriteValidityField(array); + WriteOffsetsField(array.raw_offsets(), array.length() + 1); + auto type = static_cast(array.type().get()); + return WriteChildren(type->children(), {array.values()}); + } + + Status Visit(const StructArray& array) override { + WriteValidityField(array); + auto type = static_cast(array.type().get()); + return WriteChildren(type->children(), array.fields()); + } + + Status Visit(const UnionArray& array) override { + return Status::NotImplemented("union"); + } + + private: + const std::string& name_; + const Array& array_; + RjWriter* writer_; +}; + +class JsonSchemaReader { + public: + explicit JsonSchemaReader(const rj::Value& json_schema) : json_schema_(json_schema) {} + + Status GetSchema(std::shared_ptr* schema) { + const auto& obj_schema = json_schema_.GetObject(); + + const auto& json_fields = obj_schema.FindMember("fields"); + RETURN_NOT_ARRAY("fields", json_fields, obj_schema); + + std::vector> fields; + RETURN_NOT_OK(GetFieldsFromArray(json_fields->value, &fields)); + + *schema = std::make_shared(fields); + return Status::OK(); + } + + Status GetFieldsFromArray( + const rj::Value& obj, std::vector>* fields) { + const auto& values = obj.GetArray(); + + fields->resize(values.Size()); + for (size_t i = 0; i < fields->size(); ++i) { + RETURN_NOT_OK(GetField(values[i], &(*fields)[i])); + } + return Status::OK(); + } + + Status GetField(const rj::Value& obj, std::shared_ptr* field) { + if (!obj.IsObject()) { return Status::Invalid("Field was not a JSON object"); } + const auto& json_field = obj.GetObject(); + + const auto& json_name = json_field.FindMember("name"); + RETURN_NOT_STRING("name", json_name, json_field); + + const auto& json_nullable = json_field.FindMember("nullable"); + RETURN_NOT_BOOL("nullable", json_nullable, json_field); + + const auto& json_type = json_field.FindMember("type"); + RETURN_NOT_OBJECT("type", json_type, json_field); + + const auto& json_children = json_field.FindMember("children"); + RETURN_NOT_ARRAY("children", json_children, json_field); + + std::vector> children; + RETURN_NOT_OK(GetFieldsFromArray(json_children->value, &children)); + + std::shared_ptr type; + RETURN_NOT_OK(GetType(json_type->value.GetObject(), children, &type)); + + *field = std::make_shared( + json_name->value.GetString(), type, json_nullable->value.GetBool()); + return Status::OK(); + } + + Status GetInteger( + const rj::Value::ConstObject& json_type, std::shared_ptr* type) { + const auto& json_bit_width = json_type.FindMember("bitWidth"); + RETURN_NOT_INT("bitWidth", json_bit_width, json_type); + + const auto& json_is_signed = json_type.FindMember("isSigned"); + RETURN_NOT_BOOL("isSigned", json_is_signed, json_type); + + bool is_signed = json_is_signed->value.GetBool(); + int bit_width = json_bit_width->value.GetInt(); + + switch (bit_width) { + case 8: + *type = is_signed ? int8() : uint8(); + break; + case 16: + *type = is_signed ? int16() : uint16(); + break; + case 32: + *type = is_signed ? int32() : uint32(); + break; + case 64: + *type = is_signed ? int64() : uint64(); + break; + default: + std::stringstream ss; + ss << "Invalid bit width: " << bit_width; + return Status::Invalid(ss.str()); + } + return Status::OK(); + } + + Status GetFloatingPoint(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_precision = json_type.FindMember("precision"); + RETURN_NOT_STRING("precision", json_precision, json_type); + + std::string precision = json_precision->value.GetString(); + + if (precision == "DOUBLE") { + *type = float64(); + } else if (precision == "SINGLE") { + *type = float32(); + } else if (precision == "HALF") { + *type = float16(); + } else { + std::stringstream ss; + ss << "Invalid precision: " << precision; + return Status::Invalid(ss.str()); + } + return Status::OK(); + } + + template + Status GetTimeLike(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_unit = json_type.FindMember("unit"); + RETURN_NOT_STRING("unit", json_unit, json_type); + + std::string unit_str = json_unit->value.GetString(); + + TimeUnit unit; + + if (unit_str == "SECOND") { + unit = TimeUnit::SECOND; + } else if (unit_str == "MILLISECOND") { + unit = TimeUnit::MILLI; + } else if (unit_str == "MICROSECOND") { + unit = TimeUnit::MICRO; + } else if (unit_str == "NANOSECOND") { + unit = TimeUnit::NANO; + } else { + std::stringstream ss; + ss << "Invalid time unit: " << unit_str; + return Status::Invalid(ss.str()); + } + + *type = std::make_shared(unit); + + return Status::OK(); + } + + Status GetUnion(const RjObject& json_type, + const std::vector>& children, + std::shared_ptr* type) { + const auto& json_mode = json_type.FindMember("mode"); + RETURN_NOT_STRING("mode", json_mode, json_type); + + std::string mode_str = json_mode->value.GetString(); + UnionMode mode; + + if (mode_str == "SPARSE") { + mode = UnionMode::SPARSE; + } else if (mode_str == "DENSE") { + mode = UnionMode::DENSE; + } else { + std::stringstream ss; + ss << "Invalid union mode: " << mode_str; + return Status::Invalid(ss.str()); + } + + const auto& json_type_ids = json_type.FindMember("typeIds"); + RETURN_NOT_ARRAY("typeIds", json_type_ids, json_type); + + std::vector type_ids; + const auto& id_array = json_type_ids->value.GetArray(); + for (const rj::Value& val : id_array) { + DCHECK(val.IsUint()); + type_ids.push_back(val.GetUint()); + } + + *type = union_(children, type_ids, mode); + + return Status::OK(); + } + + Status GetType(const RjObject& json_type, + const std::vector>& children, + std::shared_ptr* type) { + const auto& json_type_name = json_type.FindMember("name"); + RETURN_NOT_STRING("name", json_type_name, json_type); + + std::string type_name = json_type_name->value.GetString(); + + if (type_name == "int") { + return GetInteger(json_type, type); + } else if (type_name == "floatingpoint") { + return GetFloatingPoint(json_type, type); + } else if (type_name == "bool") { + *type = boolean(); + } else if (type_name == "utf8") { + *type = utf8(); + } else if (type_name == "binary") { + *type = binary(); + } else if (type_name == "null") { + *type = null(); + } else if (type_name == "date") { + *type = date(); + } else if (type_name == "time") { + return GetTimeLike(json_type, type); + } else if (type_name == "timestamp") { + return GetTimeLike(json_type, type); + } else if (type_name == "list") { + *type = list(children[0]); + } else if (type_name == "struct") { + *type = struct_(children); + } else { + return GetUnion(json_type, children, type); + } + return Status::OK(); + } + + private: + const rj::Value& json_schema_; +}; + +class JsonArrayReader { + public: + explicit JsonArrayReader(MemoryPool* pool) : pool_(pool) {} + + Status GetValidityBuffer(const std::vector& is_valid, int32_t* null_count, + std::shared_ptr* validity_buffer) { + int length = static_cast(is_valid.size()); + + std::shared_ptr out_buffer; + RETURN_NOT_OK(GetEmptyBitmap(pool_, length, &out_buffer)); + uint8_t* bitmap = out_buffer->mutable_data(); + + *null_count = 0; + for (int i = 0; i < length; ++i) { + if (!is_valid[i]) { + ++(*null_count); + continue; + } + BitUtil::SetBit(bitmap, i); + } + + *validity_buffer = out_buffer; + return Status::OK(); + } + + template + typename std::enable_if::value || + std::is_base_of::value, + Status>::type + ReadArray(const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + typename TypeTraits::BuilderType builder(pool_, type); + + const auto& json_data = json_array.FindMember("DATA"); + RETURN_NOT_ARRAY("DATA", json_data, json_array); + + const auto& json_data_arr = json_data->value.GetArray(); + + DCHECK_EQ(static_cast(json_data_arr.Size()), length); + for (int i = 0; i < length; ++i) { + if (!is_valid[i]) { + builder.AppendNull(); + continue; + } + + const rj::Value& val = json_data_arr[i]; + if (IsSignedInt::value) { + DCHECK(val.IsInt()); + builder.Append(val.GetInt64()); + } else if (IsUnsignedInt::value) { + DCHECK(val.IsUint()); + builder.Append(val.GetUint64()); + } else if (IsFloatingPoint::value) { + DCHECK(val.IsFloat()); + builder.Append(val.GetFloat()); + } else if (std::is_base_of::value) { + DCHECK(val.IsBool()); + builder.Append(val.GetBool()); + } else { + // We are in the wrong function + return Status::Invalid(type->ToString()); + } + } + + return builder.Finish(array); + } + + template + typename std::enable_if::value, Status>::type ReadArray( + const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + typename TypeTraits::BuilderType builder(pool_, type); + + const auto& json_data = json_array.FindMember("DATA"); + RETURN_NOT_ARRAY("DATA", json_data, json_array); + + const auto& json_data_arr = json_data->value.GetArray(); + + DCHECK_EQ(static_cast(json_data_arr.Size()), length); + for (int i = 0; i < length; ++i) { + if (!is_valid[i]) { + builder.AppendNull(); + continue; + } + + const rj::Value& val = json_data_arr[i]; + DCHECK(val.IsString()); + builder.Append(val.GetString()); + } + + return builder.Finish(array); + } + + template + typename std::enable_if::value, Status>::type ReadArray( + const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + const auto& json_offsets = json_array.FindMember("OFFSETS"); + RETURN_NOT_ARRAY("OFFSETS", json_offsets, json_array); + const auto& json_offsets_arr = json_offsets->value.GetArray(); + + int32_t null_count = 0; + std::shared_ptr validity_buffer; + RETURN_NOT_OK(GetValidityBuffer(is_valid, &null_count, &validity_buffer)); + + auto offsets_buffer = std::make_shared(pool_); + RETURN_NOT_OK(offsets_buffer->Resize((length + 1) * sizeof(int32_t))); + int32_t* offsets = reinterpret_cast(offsets_buffer->mutable_data()); + + for (int i = 0; i < length + 1; ++i) { + const rj::Value& val = json_offsets_arr[i]; + DCHECK(val.IsInt()); + offsets[i] = val.GetInt(); + } + + std::vector> children; + RETURN_NOT_OK(GetChildren(json_array, type, &children)); + DCHECK_EQ(children.size(), 1); + + *array = std::make_shared( + type, length, offsets_buffer, children[0], null_count, validity_buffer); + + return Status::OK(); + } + + template + typename std::enable_if::value, Status>::type ReadArray( + const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + int32_t null_count = 0; + std::shared_ptr validity_buffer; + RETURN_NOT_OK(GetValidityBuffer(is_valid, &null_count, &validity_buffer)); + + std::vector> fields; + RETURN_NOT_OK(GetChildren(json_array, type, &fields)); + + *array = + std::make_shared(type, length, fields, null_count, validity_buffer); + + return Status::OK(); + } + + template + typename std::enable_if::value, Status>::type ReadArray( + const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + *array = std::make_shared(type, length); + return Status::OK(); + } + + Status GetChildren(const RjObject& json_array, const std::shared_ptr& type, + std::vector>* array) { + const auto& json_children = json_array.FindMember("children"); + RETURN_NOT_ARRAY("children", json_children, json_array); + const auto& json_children_arr = json_children->value.GetArray(); + + if (type->num_children() != static_cast(json_children_arr.Size())) { + std::stringstream ss; + ss << "Expected " << type->num_children() << " children, but got " + << json_children_arr.Size(); + return Status::Invalid(ss.str()); + } + + for (int i = 0; i < static_cast(json_children_arr.Size()); ++i) { + const rj::Value& json_child = json_children_arr[i]; + DCHECK(json_child.IsObject()); + + std::shared_ptr child_field = type->child(i); + + auto it = json_child.FindMember("name"); + RETURN_NOT_STRING("name", it, json_child); + + DCHECK_EQ(it->value.GetString(), child_field->name); + std::shared_ptr child; + RETURN_NOT_OK(GetArray(json_children_arr[i], child_field->type, &child)); + array->emplace_back(child); + } + + return Status::OK(); + } + + Status GetArray(const rj::Value& obj, const std::shared_ptr& type, + std::shared_ptr* array) { + if (!obj.IsObject()) { + return Status::Invalid("Array element was not a JSON object"); + } + const auto& json_array = obj.GetObject(); + + const auto& json_length = json_array.FindMember("count"); + RETURN_NOT_INT("count", json_length, json_array); + int32_t length = json_length->value.GetInt(); + + const auto& json_valid_iter = json_array.FindMember("VALIDITY"); + RETURN_NOT_ARRAY("VALIDITY", json_valid_iter, json_array); + + const auto& json_validity = json_valid_iter->value.GetArray(); + + DCHECK_EQ(static_cast(json_validity.Size()), length); + + std::vector is_valid; + for (const rj::Value& val : json_validity) { + DCHECK(val.IsInt()); + is_valid.push_back(static_cast(val.GetInt())); + } + +#define TYPE_CASE(TYPE) \ + case TYPE::type_id: \ + return ReadArray(json_array, length, is_valid, type, array); + +#define NOT_IMPLEMENTED_CASE(TYPE_ENUM) \ + case Type::TYPE_ENUM: { \ + std::stringstream ss; \ + ss << type->ToString(); \ + return Status::NotImplemented(ss.str()); \ + } + + switch (type->type) { + TYPE_CASE(NullType); + TYPE_CASE(BooleanType); + TYPE_CASE(UInt8Type); + TYPE_CASE(Int8Type); + TYPE_CASE(UInt16Type); + TYPE_CASE(Int16Type); + TYPE_CASE(UInt32Type); + TYPE_CASE(Int32Type); + TYPE_CASE(UInt64Type); + TYPE_CASE(Int64Type); + TYPE_CASE(HalfFloatType); + TYPE_CASE(FloatType); + TYPE_CASE(DoubleType); + TYPE_CASE(StringType); + TYPE_CASE(BinaryType); + NOT_IMPLEMENTED_CASE(DATE); + NOT_IMPLEMENTED_CASE(TIMESTAMP); + NOT_IMPLEMENTED_CASE(TIME); + NOT_IMPLEMENTED_CASE(INTERVAL); + TYPE_CASE(ListType); + TYPE_CASE(StructType); + NOT_IMPLEMENTED_CASE(UNION); + default: + std::stringstream ss; + ss << type->ToString(); + return Status::NotImplemented(ss.str()); + } + +#undef TYPE_CASE +#undef NOT_IMPLEMENTED_CASE + + return Status::OK(); + } + + private: + MemoryPool* pool_; +}; + +Status WriteJsonSchema(const Schema& schema, RjWriter* json_writer) { + JsonSchemaWriter converter(schema, json_writer); + return converter.Write(); +} + +Status ReadJsonSchema(const rj::Value& json_schema, std::shared_ptr* schema) { + JsonSchemaReader converter(json_schema); + return converter.GetSchema(schema); +} + +Status WriteJsonArray( + const std::string& name, const Array& array, RjWriter* json_writer) { + JsonArrayWriter converter(name, array, json_writer); + return converter.Write(); +} + +Status ReadJsonArray(MemoryPool* pool, const rj::Value& json_array, + const std::shared_ptr& type, std::shared_ptr* array) { + JsonArrayReader converter(pool); + return converter.GetArray(json_array, type, array); +} + +Status ReadJsonArray(MemoryPool* pool, const rj::Value& json_array, const Schema& schema, + std::shared_ptr* array) { + if (!json_array.IsObject()) { return Status::Invalid("Element was not a JSON object"); } + + const auto& json_obj = json_array.GetObject(); + + const auto& json_name = json_obj.FindMember("name"); + RETURN_NOT_STRING("name", json_name, json_obj); + + std::string name = json_name->value.GetString(); + + std::shared_ptr result = nullptr; + for (const std::shared_ptr& field : schema.fields()) { + if (field->name == name) { + result = field; + break; + } + } + + if (result == nullptr) { + std::stringstream ss; + ss << "Field named " << name << " not found in schema"; + return Status::KeyError(ss.str()); + } + + return ReadJsonArray(pool, json_array, result->type, array); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/json-internal.h b/cpp/src/arrow/ipc/json-internal.h new file mode 100644 index 0000000000000..0c167a4ec53a2 --- /dev/null +++ b/cpp/src/arrow/ipc/json-internal.h @@ -0,0 +1,111 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IPC_JSON_INTERNAL_H +#define ARROW_IPC_JSON_INTERNAL_H + +#define RAPIDJSON_HAS_STDSTRING 1 +#define RAPIDJSON_HAS_CXX11_RVALUE_REFS 1 +#define RAPIDJSON_HAS_CXX11_RANGE_FOR 1 + +#include +#include +#include + +#include "rapidjson/document.h" +#include "rapidjson/stringbuffer.h" +#include "rapidjson/writer.h" + +#include "arrow/type_fwd.h" +#include "arrow/util/visibility.h" + +namespace rj = rapidjson; +using RjWriter = rj::Writer; + +#define RETURN_NOT_FOUND(TOK, NAME, PARENT) \ + if (NAME == PARENT.MemberEnd()) { \ + std::stringstream ss; \ + ss << "field " << TOK << " not found"; \ + return Status::Invalid(ss.str()); \ + } + +#define RETURN_NOT_STRING(TOK, NAME, PARENT) \ + RETURN_NOT_FOUND(TOK, NAME, PARENT); \ + if (!NAME->value.IsString()) { \ + std::stringstream ss; \ + ss << "field was not a string" \ + << " line " << __LINE__; \ + return Status::Invalid(ss.str()); \ + } + +#define RETURN_NOT_BOOL(TOK, NAME, PARENT) \ + RETURN_NOT_FOUND(TOK, NAME, PARENT); \ + if (!NAME->value.IsBool()) { \ + std::stringstream ss; \ + ss << "field was not a boolean" \ + << " line " << __LINE__; \ + return Status::Invalid(ss.str()); \ + } + +#define RETURN_NOT_INT(TOK, NAME, PARENT) \ + RETURN_NOT_FOUND(TOK, NAME, PARENT); \ + if (!NAME->value.IsInt()) { \ + std::stringstream ss; \ + ss << "field was not an int" \ + << " line " << __LINE__; \ + return Status::Invalid(ss.str()); \ + } + +#define RETURN_NOT_ARRAY(TOK, NAME, PARENT) \ + RETURN_NOT_FOUND(TOK, NAME, PARENT); \ + if (!NAME->value.IsArray()) { \ + std::stringstream ss; \ + ss << "field was not an array" \ + << " line " << __LINE__; \ + return Status::Invalid(ss.str()); \ + } + +#define RETURN_NOT_OBJECT(TOK, NAME, PARENT) \ + RETURN_NOT_FOUND(TOK, NAME, PARENT); \ + if (!NAME->value.IsObject()) { \ + std::stringstream ss; \ + ss << "field was not an object" \ + << " line " << __LINE__; \ + return Status::Invalid(ss.str()); \ + } + +namespace arrow { +namespace ipc { + +// TODO(wesm): Only exporting these because arrow_ipc does not have a static +// library at the moment. Better to not export +Status ARROW_EXPORT WriteJsonSchema(const Schema& schema, RjWriter* json_writer); +Status ARROW_EXPORT WriteJsonArray( + const std::string& name, const Array& array, RjWriter* json_writer); + +Status ARROW_EXPORT ReadJsonSchema( + const rj::Value& json_obj, std::shared_ptr* schema); +Status ARROW_EXPORT ReadJsonArray(MemoryPool* pool, const rj::Value& json_obj, + const std::shared_ptr& type, std::shared_ptr* array); + +Status ARROW_EXPORT ReadJsonArray(MemoryPool* pool, const rj::Value& json_obj, + const Schema& schema, std::shared_ptr* array); + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_JSON_INTERNAL_H diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc new file mode 100644 index 0000000000000..2281611f8b879 --- /dev/null +++ b/cpp/src/arrow/ipc/json.cc @@ -0,0 +1,219 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/json.h" + +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/ipc/json-internal.h" +#include "arrow/schema.h" +#include "arrow/table.h" +#include "arrow/type.h" +#include "arrow/util/buffer.h" +#include "arrow/util/logging.h" +#include "arrow/util/memory-pool.h" +#include "arrow/util/status.h" + +namespace arrow { +namespace ipc { + +// ---------------------------------------------------------------------- +// Writer implementation + +class JsonWriter::JsonWriterImpl { + public: + explicit JsonWriterImpl(const std::shared_ptr& schema) : schema_(schema) { + writer_.reset(new RjWriter(string_buffer_)); + } + + Status Start() { + writer_->StartObject(); + + writer_->Key("schema"); + RETURN_NOT_OK(WriteJsonSchema(*schema_.get(), writer_.get())); + + // Record batches + writer_->Key("batches"); + writer_->StartArray(); + return Status::OK(); + } + + Status Finish(std::string* result) { + writer_->EndArray(); // Record batches + writer_->EndObject(); + + *result = string_buffer_.GetString(); + return Status::OK(); + } + + Status WriteRecordBatch( + const std::vector>& columns, int32_t num_rows) { + DCHECK_EQ(static_cast(columns.size()), schema_->num_fields()); + + writer_->StartObject(); + writer_->Key("count"); + writer_->Int(num_rows); + + writer_->Key("columns"); + writer_->StartArray(); + + for (int i = 0; i < schema_->num_fields(); ++i) { + const std::shared_ptr& column = columns[i]; + + DCHECK_EQ(num_rows, column->length()) + << "Array length did not match record batch length"; + + RETURN_NOT_OK( + WriteJsonArray(schema_->field(i)->name, *column.get(), writer_.get())); + } + + writer_->EndArray(); + writer_->EndObject(); + return Status::OK(); + } + + private: + std::shared_ptr schema_; + + rj::StringBuffer string_buffer_; + std::unique_ptr writer_; +}; + +JsonWriter::JsonWriter(const std::shared_ptr& schema) { + impl_.reset(new JsonWriterImpl(schema)); +} + +JsonWriter::~JsonWriter() {} + +Status JsonWriter::Open( + const std::shared_ptr& schema, std::unique_ptr* writer) { + *writer = std::unique_ptr(new JsonWriter(schema)); + return (*writer)->impl_->Start(); +} + +Status JsonWriter::Finish(std::string* result) { + return impl_->Finish(result); +} + +Status JsonWriter::WriteRecordBatch( + const std::vector>& columns, int32_t num_rows) { + return impl_->WriteRecordBatch(columns, num_rows); +} + +// ---------------------------------------------------------------------- +// Reader implementation + +class JsonReader::JsonReaderImpl { + public: + JsonReaderImpl(MemoryPool* pool, const std::shared_ptr& data) + : pool_(pool), data_(data) {} + + Status ParseAndReadSchema() { + doc_.Parse(reinterpret_cast(data_->data()), + static_cast(data_->size())); + if (doc_.HasParseError()) { return Status::IOError("JSON parsing failed"); } + + auto it = doc_.FindMember("schema"); + RETURN_NOT_OBJECT("schema", it, doc_); + RETURN_NOT_OK(ReadJsonSchema(it->value, &schema_)); + + it = doc_.FindMember("batches"); + RETURN_NOT_ARRAY("batches", it, doc_); + record_batches_ = &it->value; + + return Status::OK(); + } + + Status GetRecordBatch(int i, std::shared_ptr* batch) const { + DCHECK_GE(i, 0) << "i out of bounds"; + DCHECK_LT(i, static_cast(record_batches_->GetArray().Size())) + << "i out of bounds"; + + const auto& batch_val = record_batches_->GetArray()[i]; + DCHECK(batch_val.IsObject()); + + const auto& batch_obj = batch_val.GetObject(); + + auto it = batch_obj.FindMember("count"); + RETURN_NOT_INT("count", it, batch_obj); + int32_t num_rows = static_cast(it->value.GetInt()); + + it = batch_obj.FindMember("columns"); + RETURN_NOT_ARRAY("columns", it, batch_obj); + const auto& json_columns = it->value.GetArray(); + + std::vector> columns(json_columns.Size()); + for (size_t i = 0; i < columns.size(); ++i) { + const std::shared_ptr& type = schema_->field(i)->type; + RETURN_NOT_OK(ReadJsonArray(pool_, json_columns[i], type, &columns[i])); + } + + *batch = std::make_shared(schema_, num_rows, columns); + return Status::OK(); + } + + std::shared_ptr schema() const { return schema_; } + + int num_record_batches() const { + return static_cast(record_batches_->GetArray().Size()); + } + + private: + MemoryPool* pool_; + std::shared_ptr data_; + rj::Document doc_; + + const rj::Value* record_batches_; + + std::shared_ptr schema_; +}; + +JsonReader::JsonReader(MemoryPool* pool, const std::shared_ptr& data) { + impl_.reset(new JsonReaderImpl(pool, data)); +} + +JsonReader::~JsonReader() {} + +Status JsonReader::Open( + const std::shared_ptr& data, std::unique_ptr* reader) { + return Open(default_memory_pool(), data, reader); +} + +Status JsonReader::Open(MemoryPool* pool, const std::shared_ptr& data, + std::unique_ptr* reader) { + *reader = std::unique_ptr(new JsonReader(pool, data)); + return (*reader)->impl_->ParseAndReadSchema(); +} + +std::shared_ptr JsonReader::schema() const { + return impl_->schema(); +} + +int JsonReader::num_record_batches() const { + return impl_->num_record_batches(); +} + +Status JsonReader::GetRecordBatch(int i, std::shared_ptr* batch) const { + return impl_->GetRecordBatch(i, batch); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/json.h b/cpp/src/arrow/ipc/json.h new file mode 100644 index 0000000000000..7395be43b967d --- /dev/null +++ b/cpp/src/arrow/ipc/json.h @@ -0,0 +1,92 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Implement Arrow JSON serialization format + +#ifndef ARROW_IPC_JSON_H +#define ARROW_IPC_JSON_H + +#include +#include +#include + +#include "arrow/type_fwd.h" +#include "arrow/util/visibility.h" + +namespace arrow { +namespace io { + +class OutputStream; +class ReadableFileInterface; + +} // namespace io + +namespace ipc { + +class ARROW_EXPORT JsonWriter { + public: + ~JsonWriter(); + + static Status Open( + const std::shared_ptr& schema, std::unique_ptr* out); + + // TODO(wesm): Write dictionaries + + Status WriteRecordBatch( + const std::vector>& columns, int32_t num_rows); + + Status Finish(std::string* result); + + private: + explicit JsonWriter(const std::shared_ptr& schema); + + // Hide RapidJSON details from public API + class JsonWriterImpl; + std::unique_ptr impl_; +}; + +// TODO(wesm): Read from a file stream rather than an in-memory buffer +class ARROW_EXPORT JsonReader { + public: + ~JsonReader(); + + static Status Open(MemoryPool* pool, const std::shared_ptr& data, + std::unique_ptr* reader); + + // Use the default memory pool + static Status Open( + const std::shared_ptr& data, std::unique_ptr* reader); + + std::shared_ptr schema() const; + + int num_record_batches() const; + + // Read a record batch from the file + Status GetRecordBatch(int i, std::shared_ptr* batch) const; + + private: + JsonReader(MemoryPool* pool, const std::shared_ptr& data); + + // Hide RapidJSON details from public API + class JsonReaderImpl; + std::unique_ptr impl_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_JSON_H diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 784e238e977c7..9abc20d876de4 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -27,6 +27,7 @@ #include "arrow/array.h" #include "arrow/table.h" #include "arrow/test-util.h" +#include "arrow/type.h" #include "arrow/types/list.h" #include "arrow/types/primitive.h" #include "arrow/types/string.h" @@ -39,15 +40,14 @@ namespace arrow { namespace ipc { const auto kInt32 = std::make_shared(); -const auto kListInt32 = std::make_shared(kInt32); -const auto kListListInt32 = std::make_shared(kListInt32); +const auto kListInt32 = list(kInt32); +const auto kListListInt32 = list(kListInt32); Status MakeRandomInt32Array( int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr data; test::MakeRandomInt32PoolBuffer(length, pool, &data); - const auto kInt32 = std::make_shared(); - Int32Builder builder(pool, kInt32); + Int32Builder builder(pool, int32()); if (include_nulls) { std::shared_ptr valid_bytes; test::MakeRandomBytePoolBuffer(length, pool, &valid_bytes); @@ -134,8 +134,8 @@ Status MakeRandomBinaryArray( Status MakeStringTypesRecordBatch(std::shared_ptr* out) { const int32_t length = 500; - auto string_type = std::make_shared(); - auto binary_type = std::make_shared(); + auto string_type = utf8(); + auto binary_type = binary(); auto f0 = std::make_shared("f0", string_type); auto f1 = std::make_shared("f1", binary_type); std::shared_ptr schema(new Schema({f0, f1})); @@ -233,7 +233,7 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &array)); for (int i = 0; i < 63; ++i) { - type = std::static_pointer_cast(std::make_shared(type)); + type = std::static_pointer_cast(list(type)); RETURN_NOT_OK(MakeRandomListArray(array, batch_length, include_nulls, pool, &array)); } diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/schema-test.cc index 8cc80be120a44..4826199f73de7 100644 --- a/cpp/src/arrow/schema-test.cc +++ b/cpp/src/arrow/schema-test.cc @@ -29,23 +29,21 @@ using std::vector; namespace arrow { -const auto INT32 = std::make_shared(); - TEST(TestField, Basics) { - Field f0("f0", INT32); - Field f0_nn("f0", INT32, false); + Field f0("f0", int32()); + Field f0_nn("f0", int32(), false); ASSERT_EQ(f0.name, "f0"); - ASSERT_EQ(f0.type->ToString(), INT32->ToString()); + ASSERT_EQ(f0.type->ToString(), int32()->ToString()); ASSERT_TRUE(f0.nullable); ASSERT_FALSE(f0_nn.nullable); } TEST(TestField, Equals) { - Field f0("f0", INT32); - Field f0_nn("f0", INT32, false); - Field f0_other("f0", INT32); + Field f0("f0", int32()); + Field f0_nn("f0", int32(), false); + Field f0_other("f0", int32()); ASSERT_EQ(f0, f0_other); ASSERT_NE(f0, f0_nn); @@ -57,11 +55,11 @@ class TestSchema : public ::testing::Test { }; TEST_F(TestSchema, Basics) { - auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", std::make_shared(), false); - auto f1_optional = std::make_shared("f1", std::make_shared()); + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f1_optional = field("f1", uint8()); - auto f2 = std::make_shared("f2", std::make_shared()); + auto f2 = field("f2", utf8()); vector> fields = {f0, f1, f2}; auto schema = std::make_shared(fields); @@ -83,11 +81,10 @@ TEST_F(TestSchema, Basics) { } TEST_F(TestSchema, ToString) { - auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", std::make_shared(), false); - auto f2 = std::make_shared("f2", std::make_shared()); - auto f3 = std::make_shared( - "f3", std::make_shared(std::make_shared())); + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + auto f3 = field("f3", list(int16())); vector> fields = {f0, f1, f2, f3}; auto schema = std::make_shared(fields); @@ -101,4 +98,25 @@ f3: list)"; ASSERT_EQ(expected, result); } +TEST_F(TestSchema, GetFieldByName) { + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + auto f3 = field("f3", list(int16())); + + vector> fields = {f0, f1, f2, f3}; + auto schema = std::make_shared(fields); + + std::shared_ptr result; + + result = schema->GetFieldByName("f1"); + ASSERT_TRUE(f1->Equals(result)); + + result = schema->GetFieldByName("f3"); + ASSERT_TRUE(f3->Equals(result)); + + result = schema->GetFieldByName("not-found"); + ASSERT_TRUE(result == nullptr); +} + } // namespace arrow diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc index ff3ea1990e551..cd8256e658ec3 100644 --- a/cpp/src/arrow/schema.cc +++ b/cpp/src/arrow/schema.cc @@ -42,6 +42,21 @@ bool Schema::Equals(const std::shared_ptr& other) const { return Equals(*other.get()); } +std::shared_ptr Schema::GetFieldByName(const std::string& name) { + if (fields_.size() > 0 && name_to_index_.size() == 0) { + for (size_t i = 0; i < fields_.size(); ++i) { + name_to_index_[fields_[i]->name] = i; + } + } + + auto it = name_to_index_.find(name); + if (it == name_to_index_.end()) { + return nullptr; + } else { + return fields_[it->second]; + } +} + std::string Schema::ToString() const { std::stringstream buffer; diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h index 4301968e01578..0e1ab5c368e98 100644 --- a/cpp/src/arrow/schema.h +++ b/cpp/src/arrow/schema.h @@ -20,14 +20,14 @@ #include #include +#include #include +#include "arrow/type.h" #include "arrow/util/visibility.h" namespace arrow { -struct Field; - class ARROW_EXPORT Schema { public: explicit Schema(const std::vector>& fields); @@ -37,7 +37,12 @@ class ARROW_EXPORT Schema { bool Equals(const std::shared_ptr& other) const; // Return the ith schema element. Does not boundscheck - const std::shared_ptr& field(int i) const { return fields_[i]; } + std::shared_ptr field(int i) const { return fields_[i]; } + + // Returns nullptr if name not found + std::shared_ptr GetFieldByName(const std::string& name); + + const std::vector>& fields() const { return fields_; } // Render a string representation of the schema suitable for debugging std::string ToString() const; @@ -46,6 +51,7 @@ class ARROW_EXPORT Schema { private: std::vector> fields_; + std::unordered_map name_to_index_; }; } // namespace arrow diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index ac56f5ed0871c..ab4b980b3be63 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -27,6 +27,7 @@ #include "gtest/gtest.h" +#include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" #include "arrow/table.h" @@ -102,20 +103,57 @@ void random_real(int n, uint32_t seed, T min_value, T max_value, std::vector* } template -std::shared_ptr to_buffer(const std::vector& values) { +std::shared_ptr GetBufferFromVector(const std::vector& values) { return std::make_shared( reinterpret_cast(values.data()), values.size() * sizeof(T)); } +template +inline Status CopyBufferFromVector( + const std::vector& values, std::shared_ptr* result) { + int64_t nbytes = static_cast(values.size()) * sizeof(T); + + auto buffer = std::make_shared(default_memory_pool()); + RETURN_NOT_OK(buffer->Resize(nbytes)); + memcpy(buffer->mutable_data(), values.data(), nbytes); + + *result = buffer; + return Status::OK(); +} + +static inline Status GetBitmapFromBoolVector( + const std::vector& is_valid, std::shared_ptr* result) { + int length = static_cast(is_valid.size()); + + std::shared_ptr buffer; + RETURN_NOT_OK(GetEmptyBitmap(default_memory_pool(), length, &buffer)); + + uint8_t* bitmap = buffer->mutable_data(); + for (int i = 0; i < length; ++i) { + if (is_valid[i]) { BitUtil::SetBit(bitmap, i); } + } + + *result = buffer; + return Status::OK(); +} + // Sets approximately pct_null of the first n bytes in null_bytes to zero // and the rest to non-zero (true) values. -void random_null_bytes(int64_t n, double pct_null, uint8_t* null_bytes) { +static inline void random_null_bytes(int64_t n, double pct_null, uint8_t* null_bytes) { Random rng(random_seed()); for (int i = 0; i < n; ++i) { null_bytes[i] = rng.NextDoubleFraction() > pct_null; } } +static inline void random_is_valid( + int64_t n, double pct_null, std::vector* is_valid) { + Random rng(random_seed()); + for (int i = 0; i < n; ++i) { + is_valid->push_back(rng.NextDoubleFraction() > pct_null); + } +} + static inline void random_bytes(int n, uint32_t seed, uint8_t* out) { std::mt19937 gen(seed); std::uniform_int_distribution d(0, 255); @@ -125,6 +163,15 @@ static inline void random_bytes(int n, uint32_t seed, uint8_t* out) { } } +static inline void random_ascii(int n, uint32_t seed, uint8_t* out) { + std::mt19937 gen(seed); + std::uniform_int_distribution d(65, 122); + + for (int i = 0; i < n; ++i) { + out[i] = d(gen) & 0xFF; + } +} + template void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { DCHECK(out); diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 4fd50b7c19365..589bdadb77c64 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -20,6 +20,8 @@ #include #include +#include "arrow/util/status.h" + namespace arrow { std::string Field::ToString() const { @@ -44,9 +46,24 @@ bool DataType::Equals(const DataType* other) const { return equals; } +std::string BooleanType::ToString() const { + return name(); +} + +FloatingPointMeta::Precision HalfFloatType::precision() const { + return FloatingPointMeta::HALF; +} + +FloatingPointMeta::Precision FloatType::precision() const { + return FloatingPointMeta::SINGLE; +} + +FloatingPointMeta::Precision DoubleType::precision() const { + return FloatingPointMeta::DOUBLE; +} + std::string StringType::ToString() const { - std::string result(name()); - return result; + return std::string("string"); } std::string ListType::ToString() const { @@ -56,7 +73,7 @@ std::string ListType::ToString() const { } std::string BinaryType::ToString() const { - return std::string(name()); + return std::string("binary"); } std::string StructType::ToString() const { @@ -71,4 +88,103 @@ std::string StructType::ToString() const { return s.str(); } +std::string UnionType::ToString() const { + std::stringstream s; + + if (mode == UnionMode::SPARSE) { + s << "union[sparse]<"; + } else { + s << "union[dense]<"; + } + + for (size_t i = 0; i < children_.size(); ++i) { + if (i) { s << ", "; } + s << children_[i]->ToString(); + } + s << ">"; + return s.str(); +} + +int NullType::bit_width() const { + return 0; +} + +std::string NullType::ToString() const { + return name(); +} + +// Visitors and template instantiation + +#define ACCEPT_VISITOR(TYPE) \ + Status TYPE::Accept(TypeVisitor* visitor) const { return visitor->Visit(*this); } + +ACCEPT_VISITOR(NullType); +ACCEPT_VISITOR(BooleanType); +ACCEPT_VISITOR(BinaryType); +ACCEPT_VISITOR(StringType); +ACCEPT_VISITOR(ListType); +ACCEPT_VISITOR(StructType); +ACCEPT_VISITOR(DecimalType); +ACCEPT_VISITOR(UnionType); +ACCEPT_VISITOR(DateType); +ACCEPT_VISITOR(TimeType); +ACCEPT_VISITOR(TimestampType); +ACCEPT_VISITOR(IntervalType); + +#define TYPE_FACTORY(NAME, KLASS) \ + std::shared_ptr NAME() { \ + static std::shared_ptr result = std::make_shared(); \ + return result; \ + } + +TYPE_FACTORY(null, NullType); +TYPE_FACTORY(boolean, BooleanType); +TYPE_FACTORY(int8, Int8Type); +TYPE_FACTORY(uint8, UInt8Type); +TYPE_FACTORY(int16, Int16Type); +TYPE_FACTORY(uint16, UInt16Type); +TYPE_FACTORY(int32, Int32Type); +TYPE_FACTORY(uint32, UInt32Type); +TYPE_FACTORY(int64, Int64Type); +TYPE_FACTORY(uint64, UInt64Type); +TYPE_FACTORY(float16, HalfFloatType); +TYPE_FACTORY(float32, FloatType); +TYPE_FACTORY(float64, DoubleType); +TYPE_FACTORY(utf8, StringType); +TYPE_FACTORY(binary, BinaryType); +TYPE_FACTORY(date, DateType); + +std::shared_ptr timestamp(TimeUnit unit) { + static std::shared_ptr result = std::make_shared(); + return result; +} + +std::shared_ptr time(TimeUnit unit) { + static std::shared_ptr result = std::make_shared(); + return result; +} + +std::shared_ptr list(const std::shared_ptr& value_type) { + return std::make_shared(value_type); +} + +std::shared_ptr list(const std::shared_ptr& value_field) { + return std::make_shared(value_field); +} + +std::shared_ptr struct_(const std::vector>& fields) { + return std::make_shared(fields); +} + +std::shared_ptr ARROW_EXPORT union_( + const std::vector>& child_fields, + const std::vector& type_ids, UnionMode mode) { + return std::make_shared(child_fields, type_ids, mode); +} + +std::shared_ptr field( + const std::string& name, const TypePtr& type, bool nullable, int64_t dictionary) { + return std::make_shared(name, type, nullable, dictionary); +} + } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index ea8516fc34798..5b4d7bc42bd3d 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -23,7 +23,9 @@ #include #include +#include "arrow/type_fwd.h" #include "arrow/util/macros.h" +#include "arrow/util/status.h" #include "arrow/util/visibility.h" namespace arrow { @@ -50,17 +52,20 @@ struct Type { UINT64 = 8, INT64 = 9, + // 2-byte floating point value + HALF_FLOAT = 10, + // 4-byte floating point value - FLOAT = 10, + FLOAT = 11, // 8-byte floating point value - DOUBLE = 11, + DOUBLE = 12, // UTF8 variable-length string as List STRING = 13, // Variable-length bytes (no guarantee of UTF8-ness) - BINARY = 15, + BINARY = 14, // By default, int32 days since the UNIX epoch DATE = 16, @@ -69,19 +74,16 @@ struct Type { // Default unit millisecond TIMESTAMP = 17, - // Timestamp as double seconds since the UNIX epoch - TIMESTAMP_DOUBLE = 18, - // Exact time encoded with int64, default unit millisecond - TIME = 19, + TIME = 18, + + // YEAR_MONTH or DAY_TIME interval in SQL style + INTERVAL = 19, // Precision- and scale-based decimal type. Storage type depends on the // parameters. DECIMAL = 20, - // Decimal value encoded as a text string - DECIMAL_TEXT = 21, - // A list of some logical data type LIST = 30, @@ -89,19 +91,16 @@ struct Type { STRUCT = 31, // Unions of logical types - DENSE_UNION = 32, - SPARSE_UNION = 33, + UNION = 32, - // Union - JSON_SCALAR = 50, + // Timestamp as double seconds since the UNIX epoch + TIMESTAMP_DOUBLE = 33, - // User-defined type - USER = 60 + // Decimal value encoded as a text string + DECIMAL_TEXT = 34, }; }; -struct Field; - struct ARROW_EXPORT DataType { Type::type type; @@ -123,15 +122,32 @@ struct ARROW_EXPORT DataType { const std::shared_ptr& child(int i) const { return children_[i]; } + const std::vector>& children() const { return children_; } + int num_children() const { return children_.size(); } - virtual int value_size() const { return -1; } + virtual Status Accept(TypeVisitor* visitor) const = 0; virtual std::string ToString() const = 0; }; typedef std::shared_ptr TypePtr; +struct ARROW_EXPORT FixedWidthMeta { + virtual int bit_width() const = 0; +}; + +struct ARROW_EXPORT IntegerMeta { + virtual bool is_signed() const = 0; +}; + +struct ARROW_EXPORT FloatingPointMeta { + enum Precision { HALF, SINGLE, DOUBLE }; + virtual Precision precision() const = 0; +}; + +struct NoExtraMeta {}; + // A field is a piece of metadata that includes (for now) a name and a data // type struct ARROW_EXPORT Field { @@ -139,7 +155,7 @@ struct ARROW_EXPORT Field { std::string name; // The field's data type - TypePtr type; + std::shared_ptr type; // Fields can be nullable bool nullable; @@ -148,8 +164,8 @@ struct ARROW_EXPORT Field { // 0 means it's not dictionary encoded int64_t dictionary; - Field(const std::string& name, const TypePtr& type, bool nullable = true, - int64_t dictionary = 0) + Field(const std::string& name, const std::shared_ptr& type, + bool nullable = true, int64_t dictionary = 0) : name(name), type(type), nullable(nullable), dictionary(dictionary) {} bool operator==(const Field& other) const { return this->Equals(other); } @@ -168,78 +184,112 @@ struct ARROW_EXPORT Field { }; typedef std::shared_ptr FieldPtr; -template -struct ARROW_EXPORT PrimitiveType : public DataType { - PrimitiveType() : DataType(Derived::type_enum) {} +struct PrimitiveCType : public DataType { + using DataType::DataType; +}; + +template +struct ARROW_EXPORT CTypeImpl : public PrimitiveCType, public FixedWidthMeta { + using c_type = C_TYPE; + static constexpr Type::type type_id = TYPE_ID; + + CTypeImpl() : PrimitiveCType(TYPE_ID) {} + int bit_width() const override { return sizeof(C_TYPE) * 8; } + + Status Accept(TypeVisitor* visitor) const override { + return visitor->Visit(*static_cast(this)); + } + + std::string ToString() const override { return std::string(DERIVED::name()); } +}; + +struct ARROW_EXPORT NullType : public DataType, public FixedWidthMeta { + static constexpr Type::type type_id = Type::NA; + + NullType() : DataType(Type::NA) {} + + int bit_width() const override; + Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + + static std::string name() { return "null"; } +}; + +template +struct IntegerTypeImpl : public CTypeImpl, public IntegerMeta { + bool is_signed() const override { return std::is_signed::value; } }; -template -inline std::string PrimitiveType::ToString() const { - std::string result(static_cast(this)->name()); - return result; -} +struct ARROW_EXPORT BooleanType : public DataType, FixedWidthMeta { + static constexpr Type::type type_id = Type::BOOL; -#define PRIMITIVE_DECL(TYPENAME, C_TYPE, ENUM, SIZE, NAME) \ - typedef C_TYPE c_type; \ - static constexpr Type::type type_enum = Type::ENUM; \ - \ - TYPENAME() : PrimitiveType() {} \ - \ - virtual int value_size() const { return SIZE; } \ - \ - static const char* name() { return NAME; } + BooleanType() : DataType(Type::BOOL) {} -struct ARROW_EXPORT NullType : public PrimitiveType { - PRIMITIVE_DECL(NullType, void, NA, 0, "null"); + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; + + int bit_width() const override { return 1; } + static std::string name() { return "bool"; } }; -struct ARROW_EXPORT BooleanType : public PrimitiveType { - PRIMITIVE_DECL(BooleanType, uint8_t, BOOL, 1, "bool"); +struct ARROW_EXPORT UInt8Type : public IntegerTypeImpl { + static std::string name() { return "uint8"; } }; -struct ARROW_EXPORT UInt8Type : public PrimitiveType { - PRIMITIVE_DECL(UInt8Type, uint8_t, UINT8, 1, "uint8"); +struct ARROW_EXPORT Int8Type : public IntegerTypeImpl { + static std::string name() { return "int8"; } }; -struct ARROW_EXPORT Int8Type : public PrimitiveType { - PRIMITIVE_DECL(Int8Type, int8_t, INT8, 1, "int8"); +struct ARROW_EXPORT UInt16Type + : public IntegerTypeImpl { + static std::string name() { return "uint16"; } }; -struct ARROW_EXPORT UInt16Type : public PrimitiveType { - PRIMITIVE_DECL(UInt16Type, uint16_t, UINT16, 2, "uint16"); +struct ARROW_EXPORT Int16Type : public IntegerTypeImpl { + static std::string name() { return "int16"; } }; -struct ARROW_EXPORT Int16Type : public PrimitiveType { - PRIMITIVE_DECL(Int16Type, int16_t, INT16, 2, "int16"); +struct ARROW_EXPORT UInt32Type + : public IntegerTypeImpl { + static std::string name() { return "uint32"; } }; -struct ARROW_EXPORT UInt32Type : public PrimitiveType { - PRIMITIVE_DECL(UInt32Type, uint32_t, UINT32, 4, "uint32"); +struct ARROW_EXPORT Int32Type : public IntegerTypeImpl { + static std::string name() { return "int32"; } }; -struct ARROW_EXPORT Int32Type : public PrimitiveType { - PRIMITIVE_DECL(Int32Type, int32_t, INT32, 4, "int32"); +struct ARROW_EXPORT UInt64Type + : public IntegerTypeImpl { + static std::string name() { return "uint64"; } }; -struct ARROW_EXPORT UInt64Type : public PrimitiveType { - PRIMITIVE_DECL(UInt64Type, uint64_t, UINT64, 8, "uint64"); +struct ARROW_EXPORT Int64Type : public IntegerTypeImpl { + static std::string name() { return "int64"; } }; -struct ARROW_EXPORT Int64Type : public PrimitiveType { - PRIMITIVE_DECL(Int64Type, int64_t, INT64, 8, "int64"); +struct ARROW_EXPORT HalfFloatType + : public CTypeImpl, + public FloatingPointMeta { + Precision precision() const override; + static std::string name() { return "halffloat"; } }; -struct ARROW_EXPORT FloatType : public PrimitiveType { - PRIMITIVE_DECL(FloatType, float, FLOAT, 4, "float"); +struct ARROW_EXPORT FloatType : public CTypeImpl, + public FloatingPointMeta { + Precision precision() const override; + static std::string name() { return "float"; } }; -struct ARROW_EXPORT DoubleType : public PrimitiveType { - PRIMITIVE_DECL(DoubleType, double, DOUBLE, 8, "double"); +struct ARROW_EXPORT DoubleType : public CTypeImpl, + public FloatingPointMeta { + Precision precision() const override; + static std::string name() { return "double"; } }; -struct ARROW_EXPORT ListType : public DataType { +struct ARROW_EXPORT ListType : public DataType, public NoExtraMeta { + static constexpr Type::type type_id = Type::LIST; + // List can contain any other logical value type explicit ListType(const std::shared_ptr& value_type) : ListType(std::make_shared("item", value_type)) {} @@ -252,16 +302,21 @@ struct ARROW_EXPORT ListType : public DataType { const std::shared_ptr& value_type() const { return children_[0]->type; } - static char const* name() { return "list"; } - + Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + + static std::string name() { return "list"; } }; // BinaryType type is reprsents lists of 1-byte values. -struct ARROW_EXPORT BinaryType : public DataType { +struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { + static constexpr Type::type type_id = Type::BINARY; + BinaryType() : BinaryType(Type::BINARY) {} - static char const* name() { return "binary"; } + + Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + static std::string name() { return "binary"; } protected: // Allow subclasses to change the logical type. @@ -270,25 +325,160 @@ struct ARROW_EXPORT BinaryType : public DataType { // UTF encoded strings struct ARROW_EXPORT StringType : public BinaryType { - StringType() : BinaryType(Type::STRING) {} + static constexpr Type::type type_id = Type::STRING; - static char const* name() { return "string"; } + StringType() : BinaryType(Type::STRING) {} + Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + static std::string name() { return "utf8"; } }; -struct ARROW_EXPORT StructType : public DataType { +struct ARROW_EXPORT StructType : public DataType, public NoExtraMeta { + static constexpr Type::type type_id = Type::STRUCT; + explicit StructType(const std::vector>& fields) : DataType(Type::STRUCT) { children_ = fields; } + Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + static std::string name() { return "struct"; } +}; + +struct ARROW_EXPORT DecimalType : public DataType { + static constexpr Type::type type_id = Type::DECIMAL; + + explicit DecimalType(int precision_, int scale_) + : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} + int precision; + int scale; + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; + static std::string name() { return "decimal"; } +}; + +enum class UnionMode : char { SPARSE, DENSE }; + +struct ARROW_EXPORT UnionType : public DataType { + static constexpr Type::type type_id = Type::UNION; + + UnionType(const std::vector>& child_fields, + const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE) + : DataType(Type::UNION), mode(mode), type_ids(type_ids) { + children_ = child_fields; + } + + std::string ToString() const override; + static std::string name() { return "union"; } + Status Accept(TypeVisitor* visitor) const override; + + UnionMode mode; + std::vector type_ids; +}; + +struct ARROW_EXPORT DateType : public DataType, public NoExtraMeta { + static constexpr Type::type type_id = Type::DATE; + + DateType() : DataType(Type::DATE) {} + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override { return name(); } + static std::string name() { return "date"; } +}; + +enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; + +struct ARROW_EXPORT TimeType : public DataType { + static constexpr Type::type type_id = Type::TIME; + using Unit = TimeUnit; + + TimeUnit unit; + + explicit TimeType(TimeUnit unit = TimeUnit::MILLI) : DataType(Type::TIME), unit(unit) {} + TimeType(const TimeType& other) : TimeType(other.unit) {} + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override { return name(); } + static std::string name() { return "time"; } +}; + +struct ARROW_EXPORT TimestampType : public DataType, public FixedWidthMeta { + using Unit = TimeUnit; + + typedef int64_t c_type; + static constexpr Type::type type_id = Type::TIMESTAMP; + + int bit_width() const override { return sizeof(int64_t) * 8; } + + TimeUnit unit; + + explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) + : DataType(Type::TIMESTAMP), unit(unit) {} + + TimestampType(const TimestampType& other) : TimestampType(other.unit) {} + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override { return name(); } + static std::string name() { return "timestamp"; } +}; + +struct ARROW_EXPORT IntervalType : public DataType, public FixedWidthMeta { + enum class Unit : char { YEAR_MONTH = 0, DAY_TIME = 1 }; + + typedef int64_t c_type; + static constexpr Type::type type_id = Type::INTERVAL; + + int bit_width() const override { return sizeof(int64_t) * 8; } + + Unit unit; + + explicit IntervalType(Unit unit = Unit::YEAR_MONTH) + : DataType(Type::INTERVAL), unit(unit) {} + + IntervalType(const IntervalType& other) : IntervalType(other.unit) {} + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override { return name(); } + static std::string name() { return "date"; } }; -// These will be defined elsewhere -template -struct TypeTraits {}; +// Factory functions + +std::shared_ptr ARROW_EXPORT null(); +std::shared_ptr ARROW_EXPORT boolean(); +std::shared_ptr ARROW_EXPORT int8(); +std::shared_ptr ARROW_EXPORT int16(); +std::shared_ptr ARROW_EXPORT int32(); +std::shared_ptr ARROW_EXPORT int64(); +std::shared_ptr ARROW_EXPORT uint8(); +std::shared_ptr ARROW_EXPORT uint16(); +std::shared_ptr ARROW_EXPORT uint32(); +std::shared_ptr ARROW_EXPORT uint64(); +std::shared_ptr ARROW_EXPORT float16(); +std::shared_ptr ARROW_EXPORT float32(); +std::shared_ptr ARROW_EXPORT float64(); +std::shared_ptr ARROW_EXPORT utf8(); +std::shared_ptr ARROW_EXPORT binary(); + +std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); +std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); + +std::shared_ptr ARROW_EXPORT date(); +std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); +std::shared_ptr ARROW_EXPORT time(TimeUnit unit); + +std::shared_ptr ARROW_EXPORT struct_( + const std::vector>& fields); + +std::shared_ptr ARROW_EXPORT union_( + const std::vector>& child_fields, + const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE); + +std::shared_ptr ARROW_EXPORT field(const std::string& name, + const std::shared_ptr& type, bool nullable = true, int64_t dictionary = 0); } // namespace arrow diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h new file mode 100644 index 0000000000000..6d660f4fdee43 --- /dev/null +++ b/cpp/src/arrow/type_fwd.h @@ -0,0 +1,157 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPE_FWD_H +#define ARROW_TYPE_FWD_H + +namespace arrow { + +class Status; + +struct DataType; +class Array; +class ArrayBuilder; +struct Field; + +class Buffer; +class MemoryPool; +class RecordBatch; +class Schema; + +struct NullType; +class NullArray; + +struct BooleanType; +class BooleanArray; +class BooleanBuilder; + +struct BinaryType; +class BinaryArray; +class BinaryBuilder; + +struct StringType; +class StringArray; +class StringBuilder; + +struct ListType; +class ListArray; +class ListBuilder; + +struct StructType; +class StructArray; +class StructBuilder; + +struct DecimalType; +class DecimalArray; + +struct UnionType; +class UnionArray; + +template +class NumericArray; + +template +class NumericBuilder; + +#define _NUMERIC_TYPE_DECL(KLASS) \ + struct KLASS##Type; \ + using KLASS##Array = NumericArray; \ + using KLASS##Builder = NumericBuilder; + +_NUMERIC_TYPE_DECL(Int8); +_NUMERIC_TYPE_DECL(Int16); +_NUMERIC_TYPE_DECL(Int32); +_NUMERIC_TYPE_DECL(Int64); +_NUMERIC_TYPE_DECL(UInt8); +_NUMERIC_TYPE_DECL(UInt16); +_NUMERIC_TYPE_DECL(UInt32); +_NUMERIC_TYPE_DECL(UInt64); +_NUMERIC_TYPE_DECL(HalfFloat); +_NUMERIC_TYPE_DECL(Float); +_NUMERIC_TYPE_DECL(Double); + +#undef _NUMERIC_TYPE_DECL + +struct DateType; +class DateArray; + +struct TimeType; +class TimeArray; + +struct TimestampType; +using TimestampArray = NumericArray; + +struct IntervalType; +using IntervalArray = NumericArray; + +class TypeVisitor { + public: + virtual Status Visit(const NullType& type) = 0; + virtual Status Visit(const BooleanType& type) = 0; + virtual Status Visit(const Int8Type& type) = 0; + virtual Status Visit(const Int16Type& type) = 0; + virtual Status Visit(const Int32Type& type) = 0; + virtual Status Visit(const Int64Type& type) = 0; + virtual Status Visit(const UInt8Type& type) = 0; + virtual Status Visit(const UInt16Type& type) = 0; + virtual Status Visit(const UInt32Type& type) = 0; + virtual Status Visit(const UInt64Type& type) = 0; + virtual Status Visit(const HalfFloatType& type) = 0; + virtual Status Visit(const FloatType& type) = 0; + virtual Status Visit(const DoubleType& type) = 0; + virtual Status Visit(const StringType& type) = 0; + virtual Status Visit(const BinaryType& type) = 0; + virtual Status Visit(const DateType& type) = 0; + virtual Status Visit(const TimeType& type) = 0; + virtual Status Visit(const TimestampType& type) = 0; + virtual Status Visit(const IntervalType& type) = 0; + virtual Status Visit(const DecimalType& type) = 0; + virtual Status Visit(const ListType& type) = 0; + virtual Status Visit(const StructType& type) = 0; + virtual Status Visit(const UnionType& type) = 0; +}; + +class ArrayVisitor { + public: + virtual Status Visit(const NullArray& array) = 0; + virtual Status Visit(const BooleanArray& array) = 0; + virtual Status Visit(const Int8Array& array) = 0; + virtual Status Visit(const Int16Array& array) = 0; + virtual Status Visit(const Int32Array& array) = 0; + virtual Status Visit(const Int64Array& array) = 0; + virtual Status Visit(const UInt8Array& array) = 0; + virtual Status Visit(const UInt16Array& array) = 0; + virtual Status Visit(const UInt32Array& array) = 0; + virtual Status Visit(const UInt64Array& array) = 0; + virtual Status Visit(const HalfFloatArray& array) = 0; + virtual Status Visit(const FloatArray& array) = 0; + virtual Status Visit(const DoubleArray& array) = 0; + virtual Status Visit(const StringArray& array) = 0; + virtual Status Visit(const BinaryArray& array) = 0; + virtual Status Visit(const DateArray& array) = 0; + virtual Status Visit(const TimeArray& array) = 0; + virtual Status Visit(const TimestampArray& array) = 0; + virtual Status Visit(const IntervalArray& array) = 0; + virtual Status Visit(const DecimalArray& array) = 0; + virtual Status Visit(const ListArray& array) = 0; + virtual Status Visit(const StructArray& array) = 0; + virtual Status Visit(const UnionArray& array) = 0; +}; + +} // namespace arrow + +#endif // ARROW_TYPE_FWD_H diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h new file mode 100644 index 0000000000000..bbb807488e3d0 --- /dev/null +++ b/cpp/src/arrow/type_traits.h @@ -0,0 +1,197 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TYPE_TRAITS_H +#define ARROW_TYPE_TRAITS_H + +#include + +#include "arrow/type_fwd.h" +#include "arrow/util/bit-util.h" + +namespace arrow { + +template +struct TypeTraits {}; + +template <> +struct TypeTraits { + using ArrayType = UInt8Array; + using BuilderType = UInt8Builder; + static inline int bytes_required(int elements) { return elements; } +}; + +template <> +struct TypeTraits { + using ArrayType = Int8Array; + using BuilderType = Int8Builder; + static inline int bytes_required(int elements) { return elements; } +}; + +template <> +struct TypeTraits { + using ArrayType = UInt16Array; + using BuilderType = UInt16Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = Int16Array; + using BuilderType = Int16Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = UInt32Array; + using BuilderType = UInt32Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = Int32Array; + using BuilderType = Int32Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = UInt64Array; + using BuilderType = UInt64Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = Int64Array; + using BuilderType = Int64Builder; + + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = TimestampArray; + // using BuilderType = TimestampBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = HalfFloatArray; + using BuilderType = HalfFloatBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } +}; + +template <> +struct TypeTraits { + using ArrayType = FloatArray; + using BuilderType = FloatBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(float); } +}; + +template <> +struct TypeTraits { + using ArrayType = DoubleArray; + using BuilderType = DoubleBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(double); } +}; + +template <> +struct TypeTraits { + using ArrayType = BooleanArray; + using BuilderType = BooleanBuilder; + + static inline int bytes_required(int elements) { + return BitUtil::BytesForBits(elements); + } +}; + +template <> +struct TypeTraits { + using ArrayType = StringArray; + using BuilderType = StringBuilder; +}; + +template <> +struct TypeTraits { + using ArrayType = BinaryArray; + using BuilderType = BinaryBuilder; +}; + +// Not all type classes have a c_type +template +struct as_void { + using type = void; +}; + +// The partial specialization will match if T has the ATTR_NAME member +#define GET_ATTR(ATTR_NAME, DEFAULT) \ + template \ + struct GetAttr_##ATTR_NAME { \ + using type = DEFAULT; \ + }; \ + \ + template \ + struct GetAttr_##ATTR_NAME::type> { \ + using type = typename T::ATTR_NAME; \ + }; + +GET_ATTR(c_type, void); +GET_ATTR(TypeClass, void); + +#undef GET_ATTR + +#define PRIMITIVE_TRAITS(T) \ + using TypeClass = typename std::conditional::value, T, \ + typename GetAttr_TypeClass::type>::type; \ + using c_type = typename GetAttr_c_type::type; + +template +struct IsUnsignedInt { + PRIMITIVE_TRAITS(T); + static constexpr bool value = + std::is_integral::value && std::is_unsigned::value; +}; + +template +struct IsSignedInt { + PRIMITIVE_TRAITS(T); + static constexpr bool value = + std::is_integral::value && std::is_signed::value; +}; + +template +struct IsFloatingPoint { + PRIMITIVE_TRAITS(T); + static constexpr bool value = std::is_floating_point::value; +}; + +} // namespace arrow + +#endif // ARROW_TYPE_TRAITS_H diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt index 9f7816989827d..6d59acfdf2eec 100644 --- a/cpp/src/arrow/types/CMakeLists.txt +++ b/cpp/src/arrow/types/CMakeLists.txt @@ -21,7 +21,6 @@ # Headers: top level install(FILES - collection.h construct.h datetime.h decimal.h diff --git a/cpp/src/arrow/types/collection.h b/cpp/src/arrow/types/collection.h deleted file mode 100644 index 1712030203fa2..0000000000000 --- a/cpp/src/arrow/types/collection.h +++ /dev/null @@ -1,41 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_COLLECTION_H -#define ARROW_TYPES_COLLECTION_H - -#include -#include - -#include "arrow/type.h" - -namespace arrow { - -template -struct CollectionType : public DataType { - std::vector child_types_; - - CollectionType() : DataType(T) {} - - const TypePtr& child(int i) const { return child_types_[i]; } - - int num_children() const { return child_types_.size(); } -}; - -} // namespace arrow - -#endif // ARROW_TYPES_COLLECTION_H diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h index 241a126d1007f..a8f863923129a 100644 --- a/cpp/src/arrow/types/datetime.h +++ b/cpp/src/arrow/types/datetime.h @@ -22,41 +22,6 @@ #include "arrow/type.h" -namespace arrow { - -struct DateType : public DataType { - enum class Unit : char { DAY = 0, MONTH = 1, YEAR = 2 }; - - Unit unit; - - explicit DateType(Unit unit = Unit::DAY) : DataType(Type::DATE), unit(unit) {} - - DateType(const DateType& other) : DateType(other.unit) {} - - static char const* name() { return "date"; } -}; - -struct ARROW_EXPORT TimestampType : public DataType { - enum class Unit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; - - typedef int64_t c_type; - static constexpr Type::type type_enum = Type::TIMESTAMP; - - int value_size() const override { return sizeof(int64_t); } - - Unit unit; - - explicit TimestampType(Unit unit = Unit::MILLI) - : DataType(Type::TIMESTAMP), unit(unit) {} - - TimestampType(const TimestampType& other) : TimestampType(other.unit) {} - virtual ~TimestampType() {} - - std::string ToString() const override { return "timestamp"; } - - static char const* name() { return "timestamp"; } -}; - -} // namespace arrow +namespace arrow {} // namespace arrow #endif // ARROW_TYPES_DATETIME_H diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h index 6c497c597d987..b3ea3a56d8008 100644 --- a/cpp/src/arrow/types/decimal.h +++ b/cpp/src/arrow/types/decimal.h @@ -23,18 +23,6 @@ #include "arrow/type.h" #include "arrow/util/visibility.h" -namespace arrow { - -struct ARROW_EXPORT DecimalType : public DataType { - explicit DecimalType(int precision_, int scale_) - : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} - int precision; - int scale; - static char const* name() { return "decimal"; } - - std::string ToString() const override; -}; - -} // namespace arrow +namespace arrow {} // namespace arrow #endif // ARROW_TYPES_DECIMAL_H diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/types/list-test.cc index 12c539495a28b..cb9a8c12d8ab9 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/types/list-test.cc @@ -141,7 +141,7 @@ TEST_F(TestListBuilder, TestAppendNull) { ASSERT_TRUE(result_->IsNull(0)); ASSERT_TRUE(result_->IsNull(1)); - ASSERT_EQ(0, result_->offsets()[0]); + ASSERT_EQ(0, result_->raw_offsets()[0]); ASSERT_EQ(0, result_->offset(1)); ASSERT_EQ(0, result_->offset(2)); diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc index 4b1e821472795..d86563253bd5a 100644 --- a/cpp/src/arrow/types/list.cc +++ b/cpp/src/arrow/types/list.cc @@ -155,4 +155,8 @@ void ListBuilder::Reset() { null_bitmap_ = nullptr; } +Status ListArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index 9440ffed4bf8a..bd93e8fdcfa1c 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -39,6 +39,8 @@ class MemoryPool; class ARROW_EXPORT ListArray : public Array { public: + using TypeClass = ListType; + ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, const ArrayPtr& values, int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) @@ -56,13 +58,13 @@ class ARROW_EXPORT ListArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. const std::shared_ptr& values() const { return values_; } - const std::shared_ptr offset_buffer() const { + std::shared_ptr offsets() const { return std::static_pointer_cast(offset_buffer_); } const std::shared_ptr& value_type() const { return values_->type(); } - const int32_t* offsets() const { return offsets_; } + const int32_t* raw_offsets() const { return offsets_; } int32_t offset(int i) const { return offsets_[i]; } @@ -76,6 +78,8 @@ class ARROW_EXPORT ListArray : public Array { bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const ArrayPtr& arr) const override; + Status Accept(ArrayVisitor* visitor) const override; + protected: std::shared_ptr offset_buffer_; const int32_t* offsets_; diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/types/primitive-test.cc index e47f6dc74fb7e..bdc8ec00be02c 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/types/primitive-test.cc @@ -25,6 +25,7 @@ #include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/type_traits.h" #include "arrow/types/construct.h" #include "arrow/types/primitive.h" #include "arrow/types/test-common.h" @@ -41,15 +42,15 @@ namespace arrow { class Array; -#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ - TEST(TypesTest, TestPrimitive_##ENUM) { \ - KLASS tp; \ - \ - ASSERT_EQ(tp.type, Type::ENUM); \ - ASSERT_EQ(tp.name(), string(NAME)); \ - \ - KLASS tp_copy = tp; \ - ASSERT_EQ(tp_copy.type, Type::ENUM); \ +#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ + TEST(TypesTest, TestPrimitive_##ENUM) { \ + KLASS tp; \ + \ + ASSERT_EQ(tp.type, Type::ENUM); \ + ASSERT_EQ(tp.ToString(), string(NAME)); \ + \ + KLASS tp_copy = tp; \ + ASSERT_EQ(tp_copy.type, Type::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); @@ -243,7 +244,8 @@ void TestPrimitiveBuilder::Check( } typedef ::testing::Types Primitives; + PInt32, PInt64, PFloat, PDouble> + Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); @@ -311,20 +313,6 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); } -template -Status MakeArray(const vector& valid_bytes, const vector& draws, int size, - Builder* builder, ArrayPtr* out) { - // Append the first 1000 - for (int i = 0; i < size; ++i) { - if (valid_bytes[i] > 0) { - RETURN_NOT_OK(builder->Append(draws[i])); - } else { - RETURN_NOT_OK(builder->AppendNull()); - } - } - return builder->Finish(out); -} - TYPED_TEST(TestPrimitiveBuilder, Equality) { DECL_T(); diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index d2288bafa71da..14667ee5b6eac 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -19,6 +19,7 @@ #include +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" @@ -48,13 +49,14 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; - int value_size = type_->value_size(); - DCHECK_GT(value_size, 0); + auto size_meta = dynamic_cast(type_.get()); + int value_byte_size = size_meta->bit_width() / 8; + DCHECK_GT(value_byte_size, 0); for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && memcmp(this_data, other_data, value_size)) { return false; } - this_data += value_size; - other_data += value_size; + if (!IsNull(i) && memcmp(this_data, other_data, value_byte_size)) { return false; } + this_data += value_byte_size; + other_data += value_byte_size; } return true; } else { @@ -70,6 +72,11 @@ bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { return EqualsExact(*static_cast(arr.get())); } +template +Status NumericArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + template class NumericArray; template class NumericArray; template class NumericArray; @@ -79,9 +86,9 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; -template class NumericArray; template Status PrimitiveBuilder::Init(int32_t capacity) { @@ -145,8 +152,65 @@ Status PrimitiveBuilder::Finish(std::shared_ptr* out) { return Status::OK(); } -template <> -Status PrimitiveBuilder::Append( +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; + +Status BooleanBuilder::Init(int32_t capacity) { + RETURN_NOT_OK(ArrayBuilder::Init(capacity)); + data_ = std::make_shared(pool_); + + int64_t nbytes = BitUtil::BytesForBits(capacity); + RETURN_NOT_OK(data_->Resize(nbytes)); + // TODO(emkornfield) valgrind complains without this + memset(data_->mutable_data(), 0, nbytes); + + raw_data_ = reinterpret_cast(data_->mutable_data()); + return Status::OK(); +} + +Status BooleanBuilder::Resize(int32_t capacity) { + // XXX: Set floor size for now + if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + + if (capacity_ == 0) { + RETURN_NOT_OK(Init(capacity)); + } else { + RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); + const int64_t old_bytes = data_->size(); + const int64_t new_bytes = BitUtil::BytesForBits(capacity); + + RETURN_NOT_OK(data_->Resize(new_bytes)); + raw_data_ = reinterpret_cast(data_->mutable_data()); + memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + } + return Status::OK(); +} + +Status BooleanBuilder::Finish(std::shared_ptr* out) { + const int64_t bytes_required = BitUtil::BytesForBits(length_); + + if (bytes_required > 0 && bytes_required < data_->size()) { + // Trim buffers + RETURN_NOT_OK(data_->Resize(bytes_required)); + } + *out = std::make_shared(type_, length_, data_, null_count_, null_bitmap_); + + data_ = null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + return Status::OK(); +} + +Status BooleanBuilder::Append( const uint8_t* values, int32_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); @@ -168,19 +232,6 @@ Status PrimitiveBuilder::Append( return Status::OK(); } -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; - BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : PrimitiveArray( @@ -235,4 +286,8 @@ bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, return true; } +Status BooleanArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + } // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index c71df584ffe3f..a5a3704e2d2d3 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -26,6 +26,7 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/type.h" +#include "arrow/type_fwd.h" #include "arrow/types/datetime.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" @@ -54,9 +55,10 @@ class ARROW_EXPORT PrimitiveArray : public Array { const uint8_t* raw_data_; }; -template +template class ARROW_EXPORT NumericArray : public PrimitiveArray { public: + using TypeClass = TYPE; using value_type = typename TypeClass::c_type; NumericArray(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) @@ -88,29 +90,15 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { return reinterpret_cast(raw_data_); } + Status Accept(ArrayVisitor* visitor) const override; + value_type Value(int i) const { return raw_data()[i]; } }; -#define NUMERIC_ARRAY_DECL(NAME, TypeClass) \ - using NAME = NumericArray; \ - extern template class ARROW_EXPORT NumericArray; - -NUMERIC_ARRAY_DECL(UInt8Array, UInt8Type); -NUMERIC_ARRAY_DECL(Int8Array, Int8Type); -NUMERIC_ARRAY_DECL(UInt16Array, UInt16Type); -NUMERIC_ARRAY_DECL(Int16Array, Int16Type); -NUMERIC_ARRAY_DECL(UInt32Array, UInt32Type); -NUMERIC_ARRAY_DECL(Int32Array, Int32Type); -NUMERIC_ARRAY_DECL(UInt64Array, UInt64Type); -NUMERIC_ARRAY_DECL(Int64Array, Int64Type); -NUMERIC_ARRAY_DECL(TimestampArray, TimestampType); -NUMERIC_ARRAY_DECL(FloatArray, FloatType); -NUMERIC_ARRAY_DECL(DoubleArray, DoubleType); - template class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { public: - typedef typename Type::c_type value_type; + using value_type = typename Type::c_type; explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) : ArrayBuilder(pool, type), data_(nullptr) {} @@ -183,101 +171,27 @@ class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::raw_data_; }; -template <> -struct TypeTraits { - typedef UInt8Array ArrayType; - - static inline int bytes_required(int elements) { return elements; } -}; - -template <> -struct TypeTraits { - typedef Int8Array ArrayType; - - static inline int bytes_required(int elements) { return elements; } -}; - -template <> -struct TypeTraits { - typedef UInt16Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } -}; - -template <> -struct TypeTraits { - typedef Int16Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } -}; - -template <> -struct TypeTraits { - typedef UInt32Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } -}; - -template <> -struct TypeTraits { - typedef Int32Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } -}; - -template <> -struct TypeTraits { - typedef UInt64Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } -}; - -template <> -struct TypeTraits { - typedef Int64Array ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } -}; - -template <> -struct TypeTraits { - typedef TimestampArray ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } -}; -template <> - -struct TypeTraits { - typedef FloatArray ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(float); } -}; - -template <> -struct TypeTraits { - typedef DoubleArray ArrayType; - - static inline int bytes_required(int elements) { return elements * sizeof(double); } -}; - // Builders -typedef NumericBuilder UInt8Builder; -typedef NumericBuilder UInt16Builder; -typedef NumericBuilder UInt32Builder; -typedef NumericBuilder UInt64Builder; +using UInt8Builder = NumericBuilder; +using UInt16Builder = NumericBuilder; +using UInt32Builder = NumericBuilder; +using UInt64Builder = NumericBuilder; -typedef NumericBuilder Int8Builder; -typedef NumericBuilder Int16Builder; -typedef NumericBuilder Int32Builder; -typedef NumericBuilder Int64Builder; -typedef NumericBuilder TimestampBuilder; +using Int8Builder = NumericBuilder; +using Int16Builder = NumericBuilder; +using Int32Builder = NumericBuilder; +using Int64Builder = NumericBuilder; +using TimestampBuilder = NumericBuilder; -typedef NumericBuilder FloatBuilder; -typedef NumericBuilder DoubleBuilder; +using HalfFloatBuilder = NumericBuilder; +using FloatBuilder = NumericBuilder; +using DoubleBuilder = NumericBuilder; class ARROW_EXPORT BooleanArray : public PrimitiveArray { public: + using TypeClass = BooleanType; + BooleanArray(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); BooleanArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, @@ -288,28 +202,36 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const ArrayPtr& arr) const override; + Status Accept(ArrayVisitor* visitor) const override; + const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } bool Value(int i) const { return BitUtil::GetBit(raw_data(), i); } }; -template <> -struct TypeTraits { - typedef BooleanArray ArrayType; - - static inline int bytes_required(int elements) { - return BitUtil::BytesForBits(elements); - } -}; - -class ARROW_EXPORT BooleanBuilder : public PrimitiveBuilder { +class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { public: explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) - : PrimitiveBuilder(pool, type) {} + : ArrayBuilder(pool, type), data_(nullptr) {} virtual ~BooleanBuilder() {} - using PrimitiveBuilder::Append; + using ArrayBuilder::Advance; + + // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); + } + + Status AppendNull() { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(false); + return Status::OK(); + } + + std::shared_ptr data() const { return data_; } // Scalar append Status Append(bool val) { @@ -324,9 +246,39 @@ class ARROW_EXPORT BooleanBuilder : public PrimitiveBuilder { return Status::OK(); } - Status Append(uint8_t val) { return Append(static_cast(val)); } + // Vector append + // + // If passed, valid_bytes is of equal length to values, and any zero byte + // will be considered as a null for that slot + Status Append( + const uint8_t* values, int32_t length, const uint8_t* valid_bytes = nullptr); + + Status Finish(std::shared_ptr* out) override; + Status Init(int32_t capacity) override; + + // Increase the capacity of the builder to accommodate at least the indicated + // number of elements + Status Resize(int32_t capacity) override; + + protected: + std::shared_ptr data_; + uint8_t* raw_data_; }; +// Only instantiate these templates once +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; + } // namespace arrow #endif // ARROW_TYPES_PRIMITIVE_H diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/types/string-test.cc index af87a14a8b32e..3c4b12b7bc772 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/types/string-test.cc @@ -47,7 +47,7 @@ TEST(TypesTest, BinaryType) { TEST(TypesTest, TestStringType) { StringType str; ASSERT_EQ(str.type, Type::STRING); - ASSERT_EQ(str.name(), std::string("string")); + ASSERT_EQ(str.ToString(), std::string("string")); } // ---------------------------------------------------------------------- @@ -66,8 +66,8 @@ class TestStringContainer : public ::testing::Test { void MakeArray() { length_ = offsets_.size() - 1; - value_buf_ = test::to_buffer(chars_); - offsets_buf_ = test::to_buffer(offsets_); + value_buf_ = test::GetBufferFromVector(chars_); + offsets_buf_ = test::GetBufferFromVector(offsets_); null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); @@ -131,7 +131,7 @@ TEST_F(TestStringContainer, TestGetString) { TEST_F(TestStringContainer, TestEmptyStringComparison) { offsets_ = {0, 0, 0, 0, 0, 0}; - offsets_buf_ = test::to_buffer(offsets_); + offsets_buf_ = test::GetBufferFromVector(offsets_); length_ = offsets_.size() - 1; auto strings_a = std::make_shared( @@ -227,8 +227,8 @@ class TestBinaryContainer : public ::testing::Test { void MakeArray() { length_ = offsets_.size() - 1; - value_buf_ = test::to_buffer(chars_); - offsets_buf_ = test::to_buffer(offsets_); + value_buf_ = test::GetBufferFromVector(chars_); + offsets_buf_ = test::GetBufferFromVector(offsets_); null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); null_count_ = test::null_count(valid_bytes_); diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc index f6d26df3167c9..db963dfa0de5f 100644 --- a/cpp/src/arrow/types/string.cc +++ b/cpp/src/arrow/types/string.cc @@ -94,6 +94,10 @@ bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_ return true; } +Status BinaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) @@ -104,6 +108,10 @@ Status StringArray::Validate() const { return BinaryArray::Validate(); } +Status StringArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + // This used to be a static member variable of BinaryBuilder, but it can cause // valgrind to report a (spurious?) memory leak when needed in other shared // libraries. The problem came up while adding explicit visibility to libarrow @@ -122,8 +130,8 @@ Status BinaryBuilder::Finish(std::shared_ptr* out) { const auto list = std::dynamic_pointer_cast(result); auto values = std::dynamic_pointer_cast(list->values()); - *out = std::make_shared(list->length(), list->offset_buffer(), - values->data(), list->null_count(), list->null_bitmap()); + *out = std::make_shared(list->length(), list->offsets(), values->data(), + list->null_count(), list->null_bitmap()); return Status::OK(); } @@ -134,8 +142,8 @@ Status StringBuilder::Finish(std::shared_ptr* out) { const auto list = std::dynamic_pointer_cast(result); auto values = std::dynamic_pointer_cast(list->values()); - *out = std::make_shared(list->length(), list->offset_buffer(), - values->data(), list->null_count(), list->null_bitmap()); + *out = std::make_shared(list->length(), list->offsets(), values->data(), + list->null_count(), list->null_bitmap()); return Status::OK(); } diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h index aaba49c60237d..c8752439f168c 100644 --- a/cpp/src/arrow/types/string.h +++ b/cpp/src/arrow/types/string.h @@ -37,6 +37,8 @@ class MemoryPool; class ARROW_EXPORT BinaryArray : public Array { public: + using TypeClass = BinaryType; + BinaryArray(int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); @@ -60,6 +62,8 @@ class ARROW_EXPORT BinaryArray : public Array { std::shared_ptr data() const { return data_buffer_; } std::shared_ptr offsets() const { return offset_buffer_; } + const int32_t* raw_offsets() const { return offsets_; } + int32_t offset(int i) const { return offsets_[i]; } // Neither of these functions will perform boundschecking @@ -73,6 +77,8 @@ class ARROW_EXPORT BinaryArray : public Array { Status Validate() const override; + Status Accept(ArrayVisitor* visitor) const override; + private: std::shared_ptr offset_buffer_; const int32_t* offsets_; @@ -83,6 +89,8 @@ class ARROW_EXPORT BinaryArray : public Array { class ARROW_EXPORT StringArray : public BinaryArray { public: + using TypeClass = StringType; + StringArray(int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); @@ -96,6 +104,8 @@ class ARROW_EXPORT StringArray : public BinaryArray { } Status Validate() const override; + + Status Accept(ArrayVisitor* visitor) const override; }; // BinaryBuilder : public ListBuilder @@ -109,6 +119,12 @@ class ARROW_EXPORT BinaryBuilder : public ListBuilder { return byte_builder_->Append(value, length); } + Status Append(const char* value, int32_t length) { + return Append(reinterpret_cast(value), length); + } + + Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } + Status Finish(std::shared_ptr* out) override; protected: @@ -121,13 +137,9 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { explicit StringBuilder(MemoryPool* pool, const TypePtr& type) : BinaryBuilder(pool, type) {} - Status Finish(std::shared_ptr* out) override; - - Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } + using BinaryBuilder::Append; - Status Append(const char* value, int32_t length) { - return BinaryBuilder::Append(reinterpret_cast(value), length); - } + Status Finish(std::shared_ptr* out) override; Status Append(const std::vector& values, uint8_t* null_bytes); }; diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/types/struct-test.cc index 8e82c389a9423..197d7d4ad1f5e 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/types/struct-test.cc @@ -80,7 +80,7 @@ void ValidateBasicStructArray(const StructArray* result, ASSERT_EQ(4, list_char_arr->length()); ASSERT_EQ(10, list_char_arr->values()->length()); for (size_t i = 0; i < list_offsets.size(); ++i) { - ASSERT_EQ(list_offsets[i], list_char_arr->offsets()[i]); + ASSERT_EQ(list_offsets[i], list_char_arr->raw_offsets()[i]); } for (size_t i = 0; i < list_values.size(); ++i) { ASSERT_EQ(list_values[i], char_arr->Value(i)); diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc index 369c29d15ef96..0e0db23544bf7 100644 --- a/cpp/src/arrow/types/struct.cc +++ b/cpp/src/arrow/types/struct.cc @@ -87,6 +87,10 @@ Status StructArray::Validate() const { return Status::OK(); } +Status StructArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + Status StructBuilder::Finish(std::shared_ptr* out) { std::vector> fields(field_builders_.size()); for (size_t i = 0; i < field_builders_.size(); ++i) { diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 65b8daf214a69..035af05132572 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -31,6 +31,8 @@ namespace arrow { class ARROW_EXPORT StructArray : public Array { public: + using TypeClass = StructType; + StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { @@ -55,6 +57,8 @@ class ARROW_EXPORT StructArray : public Array { bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const std::shared_ptr& arr) const override; + Status Accept(ArrayVisitor* visitor) const override; + protected: // The child arrays corresponding to each field of the struct data type. std::vector field_arrays_; diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h index 1957636b141fd..6e6ab85ad4eb7 100644 --- a/cpp/src/arrow/types/test-common.h +++ b/cpp/src/arrow/types/test-common.h @@ -24,6 +24,8 @@ #include "gtest/gtest.h" +#include "arrow/array.h" +#include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/util/memory-pool.h" @@ -49,6 +51,20 @@ class TestBuilder : public ::testing::Test { unique_ptr builder_nn_; }; +template +Status MakeArray(const std::vector& valid_bytes, const std::vector& values, + int size, Builder* builder, ArrayPtr* out) { + // Append the first 1000 + for (int i = 0; i < size; ++i) { + if (valid_bytes[i] > 0) { + RETURN_NOT_OK(builder->Append(values[i])); + } else { + RETURN_NOT_OK(builder->AppendNull()); + } + } + return builder->Finish(out); +} + } // namespace arrow #endif // ARROW_TYPES_TEST_COMMON_H diff --git a/cpp/src/arrow/types/union.cc b/cpp/src/arrow/types/union.cc index c891b4a5357ef..cc2934b2e4adb 100644 --- a/cpp/src/arrow/types/union.cc +++ b/cpp/src/arrow/types/union.cc @@ -24,25 +24,4 @@ #include "arrow/type.h" -namespace arrow { - -static inline std::string format_union(const std::vector& child_types) { - std::stringstream s; - s << "union<"; - for (size_t i = 0; i < child_types.size(); ++i) { - if (i) { s << ", "; } - s << child_types[i]->ToString(); - } - s << ">"; - return s.str(); -} - -std::string DenseUnionType::ToString() const { - return format_union(child_types_); -} - -std::string SparseUnionType::ToString() const { - return format_union(child_types_); -} - -} // namespace arrow +namespace arrow {} // namespace arrow diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h index d2ee9bde04d0d..44f39cc69942b 100644 --- a/cpp/src/arrow/types/union.h +++ b/cpp/src/arrow/types/union.h @@ -24,32 +24,11 @@ #include "arrow/array.h" #include "arrow/type.h" -#include "arrow/types/collection.h" namespace arrow { class Buffer; -struct DenseUnionType : public CollectionType { - typedef CollectionType Base; - - explicit DenseUnionType(const std::vector& child_types) : Base() { - child_types_ = child_types; - } - - virtual std::string ToString() const; -}; - -struct SparseUnionType : public CollectionType { - typedef CollectionType Base; - - explicit SparseUnionType(const std::vector& child_types) : Base() { - child_types_ = child_types; - } - - virtual std::string ToString() const; -}; - class UnionArray : public Array { protected: // The data are types encoded as int16 diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index 06ee8411e283c..b22f07dd6345f 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -118,9 +118,9 @@ class CerrLog { class FatalLog : public CerrLog { public: explicit FatalLog(int /* severity */) // NOLINT - : CerrLog(ARROW_FATAL) {} // NOLINT + : CerrLog(ARROW_FATAL){} // NOLINT - [[noreturn]] ~FatalLog() { + [[noreturn]] ~FatalLog() { if (has_logged_) { std::cerr << std::endl; } std::exit(1); } diff --git a/format/Metadata.md b/format/Metadata.md index 653a4c73e830e..a4878f347073f 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -98,6 +98,11 @@ Union: "typeIds" : [ /* integer */ ] } ``` + +The `typeIds` field in the Union are the codes used to denote each type, which +may be different from the index of the child array. This is so that the union +type ids do not have to be enumerated from 0. + Int: ``` { From 997f502ce10d6ae3e0b7b1df55e167fda040fc18 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 21 Nov 2016 13:25:55 -0500 Subject: [PATCH 0202/1644] ARROW-382: Extend Python API documentation * Fix numpydoc compilation * Add simple examples to the API * Move away from deprecated Cython-property declaration * Add basic descriptions with return types to functions Author: Uwe L. Korn Closes #208 from xhochy/ARROW-382 and squashes the following commits: 31e0cb3 [Uwe L. Korn] ARROW-382: Extend Python API documentation --- python/doc/conf.py | 5 +- python/pyarrow/array.pyx | 53 ++++++- python/pyarrow/compat.py | 2 + python/pyarrow/table.pyx | 333 ++++++++++++++++++++++++++++++--------- 4 files changed, 318 insertions(+), 75 deletions(-) diff --git a/python/doc/conf.py b/python/doc/conf.py index 99ac3512ec9d4..4c324a8086c39 100644 --- a/python/doc/conf.py +++ b/python/doc/conf.py @@ -59,9 +59,12 @@ 'sphinx.ext.doctest', 'sphinx.ext.mathjax', 'sphinx.ext.viewcode', - 'numpydoc' + 'sphinx.ext.napoleon' ] +# numpydoc configuration +napoleon_use_rtype = False + # Add any paths that contain templates here, relative to this directory. templates_path = ['_templates'] diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index fbe4e3879062c..6c862751fc218 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -54,6 +54,41 @@ cdef class Array: @staticmethod def from_pandas(obj, mask=None): + """ + Create an array from a pandas.Series + + Parameters + ---------- + obj : pandas.Series or numpy.ndarray + vector holding the data + mask : numpy.ndarray, optional + boolean mask if the object is valid or null + + Returns + ------- + pyarrow.Array + + Examples + -------- + + >>> import pandas as pd + >>> import pyarrow as pa + >>> pa.Array.from_pandas(pd.Series([1, 2])) + + [ + 1, + 2 + ] + + + >>> import numpy as np + >>> pa.Array.from_pandas(pd.Series([1, 2]), np.array([0, 1], dtype=bool)) + + [ + 1, + NA + ] + """ return from_pandas_series(obj, mask) property null_count: @@ -228,6 +263,14 @@ cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): def from_pylist(object list_obj, DataType type=None): """ Convert Python list to Arrow array + + Parameters + ---------- + list_obj : array_like + + Returns + ------- + pyarrow.array.Array """ cdef: shared_ptr[CArray] sp_array @@ -246,15 +289,19 @@ def from_pandas_series(object series, object mask=None, timestamps_to_ms=False): Parameters ---------- - series: pandas.Series or numpy.ndarray + series : pandas.Series or numpy.ndarray - mask: pandas.Series or numpy.ndarray + mask : pandas.Series or numpy.ndarray, optional array to mask null entries in the series - timestamps_to_ms: bool + timestamps_to_ms : bool, optional Convert datetime columns to ms resolution. This is needed for compability with other functionality like Parquet I/O which only supports milliseconds. + + Returns + ------- + pyarrow.array.Array """ cdef: shared_ptr[CArray] out diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 08f0f23796797..2dfdb5041d13e 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -90,3 +90,5 @@ def frombytes(o): integer_types = six.integer_types + (np.integer,) + +__all__ = [] diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 5459f26b80aa4..a6715b141ce41 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -36,9 +36,13 @@ from pyarrow.compat import frombytes, tobytes cimport cpython cdef class ChunkedArray: - ''' + """ + Array backed via one or more memory chunks. + + Warning + ------- Do not call this class's constructor directly. - ''' + """ def __cinit__(self): self.chunked_array = NULL @@ -59,19 +63,42 @@ cdef class ChunkedArray: def __len__(self): return self.length() - property null_count: + @property + def null_count(self): + """ + Number of null entires - def __get__(self): - self._check_nullptr() - return self.chunked_array.null_count() + Returns + ------- + int + """ + self._check_nullptr() + return self.chunked_array.null_count() - property num_chunks: + @property + def num_chunks(self): + """ + Number of underlying chunks - def __get__(self): - self._check_nullptr() - return self.chunked_array.num_chunks() + Returns + ------- + int + """ + self._check_nullptr() + return self.chunked_array.num_chunks() def chunk(self, i): + """ + Select a chunk by its index + + Parameters + ---------- + i : int + + Returns + ------- + pyarrow.array.Array + """ self._check_nullptr() return box_arrow_array(self.chunked_array.chunk(i)) @@ -82,9 +109,13 @@ cdef class ChunkedArray: cdef class Column: - ''' + """ + Named vector of elements of equal type. + + Warning + ------- Do not call this class's constructor directly. - ''' + """ def __cinit__(self): self.column = NULL @@ -95,7 +126,11 @@ cdef class Column: def to_pandas(self): """ - Convert the arrow::Column to a pandas Series + Convert the arrow::Column to a pandas.Series + + Returns + ------- + pandas.Series """ cdef: PyObject* arr @@ -120,34 +155,64 @@ cdef class Column: self._check_nullptr() return self.column.length() - property shape: + @property + def shape(self): + """ + Dimensions of this columns - def __get__(self): - self._check_nullptr() - return (self.length(),) + Returns + ------- + (int,) + """ + self._check_nullptr() + return (self.length(),) - property null_count: + @property + def null_count(self): + """ + Number of null entires - def __get__(self): - self._check_nullptr() - return self.column.null_count() + Returns + ------- + int + """ + self._check_nullptr() + return self.column.null_count() - property name: + @property + def name(self): + """ + Label of the column - def __get__(self): - return frombytes(self.column.name()) + Returns + ------- + str + """ + return frombytes(self.column.name()) - property type: + @property + def type(self): + """ + Type information for this column - def __get__(self): - return box_data_type(self.column.type()) + Returns + ------- + pyarrow.schema.DataType + """ + return box_data_type(self.column.type()) - property data: + @property + def data(self): + """ + The underlying data - def __get__(self): - cdef ChunkedArray chunked_array = ChunkedArray() - chunked_array.init(self.column.data()) - return chunked_array + Returns + ------- + pyarrow.table.ChunkedArray + """ + cdef ChunkedArray chunked_array = ChunkedArray() + chunked_array.init(self.column.data()) + return chunked_array cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): @@ -186,6 +251,13 @@ cdef _dataframe_to_arrays(df, name, timestamps_to_ms): cdef class RecordBatch: + """ + Batch of rows of columns of equal length + + Warning + ------- + Do not call this class's constructor directly, use one of the ``from_*`` methods instead. + """ def __cinit__(self): self.batch = NULL @@ -203,28 +275,48 @@ cdef class RecordBatch: self._check_nullptr() return self.batch.num_rows() - property num_columns: + @property + def num_columns(self): + """ + Number of columns - def __get__(self): - self._check_nullptr() - return self.batch.num_columns() + Returns + ------- + int + """ + self._check_nullptr() + return self.batch.num_columns() - property num_rows: + @property + def num_rows(self): + """ + Number of rows - def __get__(self): - return len(self) + Due to the definition of a RecordBatch, all columns have the same number of rows. - property schema: + Returns + ------- + int + """ + return len(self) - def __get__(self): - cdef Schema schema - self._check_nullptr() - if self._schema is None: - schema = Schema() - schema.init_schema(self.batch.schema()) - self._schema = schema + @property + def schema(self): + """ + Schema of the RecordBatch and its columns - return self._schema + Returns + ------- + pyarrow.schema.Schema + """ + cdef Schema schema + self._check_nullptr() + if self._schema is None: + schema = Schema() + schema.init_schema(self.batch.schema()) + self._schema = schema + + return self._schema def __getitem__(self, i): cdef Array arr = Array() @@ -240,6 +332,10 @@ cdef class RecordBatch: def to_pandas(self): """ Convert the arrow::RecordBatch to a pandas DataFrame + + Returns + ------- + pandas.DataFrame """ cdef: PyObject* np_arr @@ -263,12 +359,34 @@ cdef class RecordBatch: def from_pandas(cls, df): """ Convert pandas.DataFrame to an Arrow RecordBatch + + Parameters + ---------- + df: pandas.DataFrame + + Returns + ------- + pyarrow.table.RecordBatch """ names, arrays = _dataframe_to_arrays(df, None, False) return cls.from_arrays(names, arrays) @staticmethod def from_arrays(names, arrays): + """ + Construct a RecordBatch from multiple pyarrow.Arrays + + Parameters + ---------- + names: list of str + Labels for the columns + arrays: list of pyarrow.Array + column-wise data vectors + + Returns + ------- + pyarrow.table.RecordBatch + """ cdef: Array arr RecordBatch result @@ -297,11 +415,13 @@ cdef class RecordBatch: cdef class Table: - ''' + """ A collection of top-level named, equal length Arrow arrays. - Do not call this class's constructor directly. - ''' + Warning + ------- + Do not call this class's constructor directly, use one of the ``from_*`` methods instead. + """ def __cinit__(self): self.table = NULL @@ -330,6 +450,22 @@ cdef class Table: Convert datetime columns to ms resolution. This is needed for compability with other functionality like Parquet I/O which only supports milliseconds. + + Returns + ------- + pyarrow.table.Table + + Examples + -------- + + >>> import pandas as pd + >>> import pyarrow as pa + >>> df = pd.DataFrame({ + ... 'int': [1, 2], + ... 'str': ['a', 'b'] + ... }) + >>> pa.table.from_pandas_dataframe(df) + """ names, arrays = _dataframe_to_arrays(df, name=name, timestamps_to_ms=timestamps_to_ms) @@ -347,8 +483,13 @@ cdef class Table: Names for the table columns arrays: list of pyarrow.array.Array Equal-length arrays that should form the table. - name: str - (optional) name for the Table + name: str, optional + name for the Table + + Returns + ------- + pyarrow.table.Table + """ cdef: Array arr @@ -382,6 +523,10 @@ cdef class Table: def to_pandas(self): """ Convert the arrow::Table to a pandas DataFrame + + Returns + ------- + pandas.DataFrame """ cdef: PyObject* arr @@ -402,18 +547,41 @@ cdef class Table: return pd.DataFrame(dict(zip(names, data)), columns=names) - property name: + @property + def name(self): + """ + Label of the table - def __get__(self): - self._check_nullptr() - return frombytes(self.table.name()) + Returns + ------- + str + """ + self._check_nullptr() + return frombytes(self.table.name()) - property schema: + @property + def schema(self): + """ + Schema of the table and its columns - def __get__(self): - raise box_schema(self.table.schema()) + Returns + ------- + pyarrow.schema.Schema + """ + return box_schema(self.table.schema()) def column(self, index): + """ + Select a column by its numeric index. + + Parameters + ---------- + index: int + + Returns + ------- + pyarrow.table.Column + """ self._check_nullptr() cdef Column column = Column() column.init(self.table.column(index)) @@ -423,28 +591,51 @@ cdef class Table: return self.column(i) def itercolumns(self): + """ + Iterator over all columns in their numerical order + """ for i in range(self.num_columns): yield self.column(i) - property num_columns: + @property + def num_columns(self): + """ + Number of columns in this table - def __get__(self): - self._check_nullptr() - return self.table.num_columns() + Returns + ------- + int + """ + self._check_nullptr() + return self.table.num_columns() - property num_rows: + @property + def num_rows(self): + """ + Number of rows in this table. - def __get__(self): - self._check_nullptr() - return self.table.num_rows() + Due to the definition of a table, all columns have the same number of rows. + + Returns + ------- + int + """ + self._check_nullptr() + return self.table.num_rows() def __len__(self): return self.num_rows - property shape: + @property + def shape(self): + """ + Dimensions of the table: (#rows, #columns) - def __get__(self): - return (self.num_rows, self.num_columns) + Returns + ------- + (int, int) + """ + return (self.num_rows, self.num_columns) From f082b17323354dc2af31f39c15c58b995ba08360 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 21 Nov 2016 17:48:09 -0500 Subject: [PATCH 0203/1644] ARROW-383: [C++] Integration testing CLI tool Modeled after Java version in ARROW-367 Author: Wes McKinney Closes #209 from wesm/ARROW-383 and squashes the following commits: 7b29b24 [Wes McKinney] Use git master gflags to avoid valgrind error f563d3a [Wes McKinney] Build integration test as a normal unit test executable, opt-in to integration testing bdf1f7a [Wes McKinney] Call the RunCommand method instead dbc6aab [Wes McKinney] Add unit tests for the integration validator ca1eade [Wes McKinney] Clean up includes 1752249 [Wes McKinney] Draft integration testing CLI tool, untested b773d0d [Wes McKinney] Add gflags external project and json-integration-test executable stub --- cpp/CMakeLists.txt | 44 +++ cpp/cmake_modules/FindGFlags.cmake | 60 ++++ cpp/src/arrow/io/file.cc | 7 +- cpp/src/arrow/ipc/CMakeLists.txt | 27 ++ cpp/src/arrow/ipc/adapter.h | 2 +- cpp/src/arrow/ipc/json-integration-test.cc | 381 +++++++++++++++++++++ 6 files changed, 517 insertions(+), 4 deletions(-) create mode 100644 cpp/cmake_modules/FindGFlags.cmake create mode 100644 cpp/src/arrow/ipc/json-integration-test.cc diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 0bff7528578d1..839ea17dee02e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -26,6 +26,7 @@ include(ExternalProject) set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") +set(GFLAGS_VERSION "2.1.2") set(GTEST_VERSION "1.7.0") set(GBENCHMARK_VERSION "1.0.0") set(FLATBUFFERS_VERSION "1.3.0") @@ -506,6 +507,49 @@ if(ARROW_BUILD_TESTS) if(GTEST_VENDORED) add_dependencies(gtest googletest_ep) endif() + + # gflags (formerly Googleflags) command line parsing + if("$ENV{GFLAGS_HOME}" STREQUAL "") + if(APPLE) + set(GFLAGS_CMAKE_CXX_FLAGS "-fPIC -std=c++11 -stdlib=libc++") + else() + set(GFLAGS_CMAKE_CXX_FLAGS "-fPIC") + endif() + + set(GFLAGS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gflags_ep-prefix/src/gflags_ep") + ExternalProject_Add(gflags_ep + GIT_REPOSITORY https://github.com/gflags/gflags.git + GIT_TAG cce68f0c9c5d054017425e6e6fd54f696d36e8ee + # URL "https://github.com/gflags/gflags/archive/v${GFLAGS_VERSION}.tar.gz" + BUILD_IN_SOURCE 1 + CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + -DCMAKE_INSTALL_PREFIX=${GFLAGS_PREFIX} + -DBUILD_SHARED_LIBS=OFF + -DBUILD_STATIC_LIBS=ON + -DBUILD_PACKAGING=OFF + -DBUILD_TESTING=OFF + -BUILD_CONFIG_TESTS=OFF + -DINSTALL_HEADERS=ON + -DCMAKE_CXX_FLAGS=${GFLAGS_CMAKE_CXX_FLAGS}) + + set(GFLAGS_HOME "${GFLAGS_PREFIX}") + set(GFLAGS_INCLUDE_DIR "${GFLAGS_PREFIX}/include") + set(GFLAGS_STATIC_LIB "${GFLAGS_PREFIX}/lib/libgflags.a") + set(GFLAGS_VENDORED 1) + else() + set(GFLAGS_VENDORED 0) + find_package(GFlags REQUIRED) + endif() + + message(STATUS "GFlags include dir: ${GFLAGS_INCLUDE_DIR}") + message(STATUS "GFlags static library: ${GFLAGS_STATIC_LIB}") + include_directories(SYSTEM ${GFLAGS_INCLUDE_DIR}) + ADD_THIRDPARTY_LIB(gflags + STATIC_LIB ${GFLAGS_STATIC_LIB}) + + if(GFLAGS_VENDORED) + add_dependencies(gflags gflags_ep) + endif() endif() if(ARROW_BUILD_BENCHMARKS) diff --git a/cpp/cmake_modules/FindGFlags.cmake b/cpp/cmake_modules/FindGFlags.cmake new file mode 100644 index 0000000000000..eaea83530547b --- /dev/null +++ b/cpp/cmake_modules/FindGFlags.cmake @@ -0,0 +1,60 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# - Find GFLAGS (gflags.h, libgflags.a, libgflags.so, and libgflags.so.0) +# This module defines +# GFLAGS_INCLUDE_DIR, directory containing headers +# GFLAGS_SHARED_LIB, path to libgflags shared library +# GFLAGS_STATIC_LIB, path to libgflags static library +# GFLAGS_FOUND, whether gflags has been found + +if( NOT "$ENV{GFLAGS_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{GFLAGS_HOME}" _native_path ) + list( APPEND _gflags_roots ${_native_path} ) +elseif ( GFlags_HOME ) + list( APPEND _gflags_roots ${GFlags_HOME} ) +endif() + +if ( _gflags_roots ) + find_path(GFLAGS_INCLUDE_DIR NAMES gflags/gflags.h + PATHS ${_gflags_roots} + NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library(GFLAGS_SHARED_LIB NAMES gflags + PATHS ${_gflags_roots} + NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) + find_library(GFLAGS_SHARED_LIB NAMES libgflags.a + PATHS ${_gflags_roots} + NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else() + find_path(GFLAGS_INCLUDE_DIR gflags/gflags.h + # make sure we don't accidentally pick up a different version + NO_CMAKE_SYSTEM_PATH + NO_SYSTEM_ENVIRONMENT_PATH) + find_library(GFLAGS_SHARED_LIB gflags + NO_CMAKE_SYSTEM_PATH + NO_SYSTEM_ENVIRONMENT_PATH) + find_library(GFLAGS_STATIC_LIB libgflags.a + NO_CMAKE_SYSTEM_PATH + NO_SYSTEM_ENVIRONMENT_PATH) +endif() + +include(FindPackageHandleStandardArgs) +find_package_handle_standard_args(GFLAGS REQUIRED_VARS + GFLAGS_SHARED_LIB GFLAGS_STATIC_LIB GFLAGS_INCLUDE_DIR) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 93f0ad91ee86c..05fa6663e335d 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -186,12 +186,13 @@ static inline Status FileOpenWriteable(const std::string& filename, int* fd) { memcpy(wpath.data(), filename.data(), filename.size()); memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); - errno_actual = _wsopen_s( - fd, wpath.data(), _O_WRONLY | _O_CREAT | _O_BINARY, _SH_DENYNO, _S_IWRITE); + errno_actual = _wsopen_s(fd, wpath.data(), _O_WRONLY | _O_CREAT | _O_BINARY | _O_TRUNC, + _SH_DENYNO, _S_IWRITE); ret = *fd; #else - ret = *fd = open(filename.c_str(), O_WRONLY | O_CREAT | O_BINARY, ARROW_WRITE_SHMODE); + ret = *fd = + open(filename.c_str(), O_WRONLY | O_CREAT | O_BINARY | O_TRUNC, ARROW_WRITE_SHMODE); #endif return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); } diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 6955bcb6c233e..f9e7cf792ae51 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -85,6 +85,33 @@ ADD_ARROW_TEST(ipc-json-test) ARROW_TEST_LINK_LIBRARIES(ipc-json-test ${ARROW_IPC_TEST_LINK_LIBS}) +ADD_ARROW_TEST(json-integration-test) + +if (APPLE) + target_link_libraries(json-integration-test + arrow_static + arrow_io + arrow_ipc + gflags + gtest + boost_filesystem_static + boost_system_static + dl) + set_target_properties(json-integration-test + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") +else() + target_link_libraries(json-integration-test + arrow_static + arrow_io + arrow_ipc + gflags + gtest + pthread + boost_filesystem_static + boost_system_static + dl) +endif() + # make clean will delete the generated file set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 3fde18dde835b..b02de284dfc7d 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -16,7 +16,7 @@ // under the License. // Public API for writing and accessing (with zero copy, if possible) Arrow -// data in shared memory +// IPC binary formatted data (e.g. in shared memory, or from some other IO source) #ifndef ARROW_IPC_ADAPTER_H #define ARROW_IPC_ADAPTER_H diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc new file mode 100644 index 0000000000000..5eff8998afbc8 --- /dev/null +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -0,0 +1,381 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "gflags/gflags.h" +#include "gtest/gtest.h" + +#include // NOLINT + +#include "arrow/io/file.h" +#include "arrow/ipc/file.h" +#include "arrow/ipc/json.h" +#include "arrow/schema.h" +#include "arrow/table.h" +#include "arrow/test-util.h" +#include "arrow/util/status.h" + +DEFINE_string(arrow, "", "Arrow file name"); +DEFINE_string(json, "", "JSON file name"); +DEFINE_string(mode, "VALIDATE", + "Mode of integration testing tool (ARROW_TO_JSON, JSON_TO_ARROW, VALIDATE)"); +DEFINE_bool(integration, false, "Run in integration test mode"); +DEFINE_bool(verbose, true, "Verbose output"); + +namespace fs = boost::filesystem; + +namespace arrow { + +bool file_exists(const char* path) { + std::ifstream handle(path); + return handle.good(); +} + +// Convert JSON file to IPC binary format +static Status ConvertJsonToArrow( + const std::string& json_path, const std::string& arrow_path) { + std::shared_ptr in_file; + std::shared_ptr out_file; + + RETURN_NOT_OK(io::ReadableFile::Open(json_path, &in_file)); + RETURN_NOT_OK(io::FileOutputStream::Open(arrow_path, &out_file)); + + int64_t file_size = 0; + RETURN_NOT_OK(in_file->GetSize(&file_size)); + + std::shared_ptr json_buffer; + RETURN_NOT_OK(in_file->Read(file_size, &json_buffer)); + + std::unique_ptr reader; + RETURN_NOT_OK(ipc::JsonReader::Open(json_buffer, &reader)); + + if (FLAGS_verbose) { + std::cout << "Found schema: " << reader->schema()->ToString() << std::endl; + } + + std::shared_ptr writer; + RETURN_NOT_OK(ipc::FileWriter::Open(out_file.get(), reader->schema(), &writer)); + + for (int i = 0; i < reader->num_record_batches(); ++i) { + std::shared_ptr batch; + RETURN_NOT_OK(reader->GetRecordBatch(i, &batch)); + RETURN_NOT_OK(writer->WriteRecordBatch(batch->columns(), batch->num_rows())); + } + return writer->Close(); +} + +// Convert IPC binary format to JSON +static Status ConvertArrowToJson( + const std::string& arrow_path, const std::string& json_path) { + std::shared_ptr in_file; + std::shared_ptr out_file; + + RETURN_NOT_OK(io::ReadableFile::Open(arrow_path, &in_file)); + RETURN_NOT_OK(io::FileOutputStream::Open(json_path, &out_file)); + + std::shared_ptr reader; + RETURN_NOT_OK(ipc::FileReader::Open(in_file, &reader)); + + if (FLAGS_verbose) { + std::cout << "Found schema: " << reader->schema()->ToString() << std::endl; + } + + std::unique_ptr writer; + RETURN_NOT_OK(ipc::JsonWriter::Open(reader->schema(), &writer)); + + for (int i = 0; i < reader->num_record_batches(); ++i) { + std::shared_ptr batch; + RETURN_NOT_OK(reader->GetRecordBatch(i, &batch)); + RETURN_NOT_OK(writer->WriteRecordBatch(batch->columns(), batch->num_rows())); + } + + std::string result; + RETURN_NOT_OK(writer->Finish(&result)); + return out_file->Write(reinterpret_cast(result.c_str()), + static_cast(result.size())); +} + +static Status ValidateArrowVsJson( + const std::string& arrow_path, const std::string& json_path) { + // Construct JSON reader + std::shared_ptr json_file; + RETURN_NOT_OK(io::ReadableFile::Open(json_path, &json_file)); + + int64_t file_size = 0; + RETURN_NOT_OK(json_file->GetSize(&file_size)); + + std::shared_ptr json_buffer; + RETURN_NOT_OK(json_file->Read(file_size, &json_buffer)); + + std::unique_ptr json_reader; + RETURN_NOT_OK(ipc::JsonReader::Open(json_buffer, &json_reader)); + + // Construct Arrow reader + std::shared_ptr arrow_file; + RETURN_NOT_OK(io::ReadableFile::Open(arrow_path, &arrow_file)); + + std::shared_ptr arrow_reader; + RETURN_NOT_OK(ipc::FileReader::Open(arrow_file, &arrow_reader)); + + auto json_schema = json_reader->schema(); + auto arrow_schema = arrow_reader->schema(); + + if (!json_schema->Equals(arrow_schema)) { + std::stringstream ss; + ss << "JSON schema: \n" + << json_schema->ToString() << "\n" + << "Arrow schema: \n" + << arrow_schema->ToString(); + + if (FLAGS_verbose) { std::cout << ss.str() << std::endl; } + return Status::Invalid("Schemas did not match"); + } + + const int json_nbatches = json_reader->num_record_batches(); + const int arrow_nbatches = arrow_reader->num_record_batches(); + + if (json_nbatches != arrow_nbatches) { + std::stringstream ss; + ss << "Different number of record batches: " << json_nbatches << " (JSON) vs " + << arrow_nbatches << " (Arrow)"; + return Status::Invalid(ss.str()); + } + + std::shared_ptr arrow_batch; + std::shared_ptr json_batch; + for (int i = 0; i < json_nbatches; ++i) { + RETURN_NOT_OK(json_reader->GetRecordBatch(i, &json_batch)); + RETURN_NOT_OK(arrow_reader->GetRecordBatch(i, &arrow_batch)); + + if (!json_batch->Equals(*arrow_batch.get())) { + std::stringstream ss; + ss << "Record batch " << i << " did not match"; + return Status::Invalid(ss.str()); + } + } + + return Status::OK(); +} + +Status RunCommand(const std::string& json_path, const std::string& arrow_path, + const std::string& command) { + if (json_path == "") { return Status::Invalid("Must specify json file name"); } + + if (arrow_path == "") { return Status::Invalid("Must specify arrow file name"); } + + if (command == "ARROW_TO_JSON") { + if (!file_exists(arrow_path.c_str())) { + return Status::Invalid("Input file does not exist"); + } + + return ConvertArrowToJson(arrow_path, json_path); + } else if (command == "JSON_TO_ARROW") { + if (!file_exists(json_path.c_str())) { + return Status::Invalid("Input file does not exist"); + } + + return ConvertJsonToArrow(json_path, arrow_path); + } else if (command == "VALIDATE") { + if (!file_exists(json_path.c_str())) { + return Status::Invalid("JSON file does not exist"); + } + + if (!file_exists(arrow_path.c_str())) { + return Status::Invalid("Arrow file does not exist"); + } + + return ValidateArrowVsJson(arrow_path, json_path); + } else { + std::stringstream ss; + ss << "Unknown command: " << command; + return Status::Invalid(ss.str()); + } +} + +static std::string temp_path() { + return (fs::temp_directory_path() / fs::unique_path()).native(); +} + +class TestJSONIntegration : public ::testing::Test { + public: + void SetUp() {} + + std::string mkstemp() { + auto path = temp_path(); + tmp_paths_.push_back(path); + return path; + } + + Status WriteJson(const char* data, const std::string& path) { + do { + std::shared_ptr out; + RETURN_NOT_OK(io::FileOutputStream::Open(path, &out)); + RETURN_NOT_OK(out->Write( + reinterpret_cast(data), static_cast(strlen(data)))); + } while (0); + return Status::OK(); + } + + void TearDown() { + for (const std::string path : tmp_paths_) { + std::remove(path.c_str()); + } + } + + protected: + std::vector tmp_paths_; +}; + +static const char* JSON_EXAMPLE = R"example( +{ + "schema": { + "fields": [ + { + "name": "foo", + "type": {"name": "int", "isSigned": true, "bitWidth": 32}, + "nullable": true, "children": [], + "typeLayout": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + }, + { + "name": "bar", + "type": {"name": "floatingpoint", "precision": "DOUBLE"}, + "nullable": true, "children": [], + "typeLayout": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 64} + ] + } + ] + }, + "batches": [ + { + "count": 5, + "columns": [ + { + "name": "foo", + "count": 5, + "DATA": [1, 2, 3, 4, 5], + "VALIDITY": [1, 0, 1, 1, 1] + }, + { + "name": "bar", + "count": 5, + "DATA": [1.0, 2.0, 3.0, 4.0, 5.0], + "VALIDITY": [1, 0, 0, 1, 1] + } + ] + } + ] +} +)example"; + +static const char* JSON_EXAMPLE2 = R"example( +{ + "schema": { + "fields": [ + { + "name": "foo", + "type": {"name": "int", "isSigned": true, "bitWidth": 32}, + "nullable": true, "children": [], + "typeLayout": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + } + ] + }, + "batches": [ + { + "count": 5, + "columns": [ + { + "name": "foo", + "count": 5, + "DATA": [1, 2, 3, 4, 5], + "VALIDITY": [1, 0, 1, 1, 1] + } + ] + } + ] +} +)example"; + +TEST_F(TestJSONIntegration, ConvertAndValidate) { + std::string json_path = this->mkstemp(); + std::string arrow_path = this->mkstemp(); + + ASSERT_OK(WriteJson(JSON_EXAMPLE, json_path)); + + ASSERT_OK(RunCommand(json_path, arrow_path, "JSON_TO_ARROW")); + ASSERT_OK(RunCommand(json_path, arrow_path, "VALIDATE")); + + // Convert and overwrite + ASSERT_OK(RunCommand(json_path, arrow_path, "ARROW_TO_JSON")); + + // Convert back to arrow, and validate + ASSERT_OK(RunCommand(json_path, arrow_path, "JSON_TO_ARROW")); + ASSERT_OK(RunCommand(json_path, arrow_path, "VALIDATE")); +} + +TEST_F(TestJSONIntegration, ErrorStates) { + std::string json_path = this->mkstemp(); + std::string json_path2 = this->mkstemp(); + std::string arrow_path = this->mkstemp(); + + ASSERT_OK(WriteJson(JSON_EXAMPLE, json_path)); + ASSERT_OK(WriteJson(JSON_EXAMPLE2, json_path2)); + + ASSERT_OK(ConvertJsonToArrow(json_path, arrow_path)); + ASSERT_RAISES(Invalid, ValidateArrowVsJson(arrow_path, json_path2)); + + ASSERT_RAISES(IOError, ValidateArrowVsJson("does_not_exist-1234", json_path2)); + ASSERT_RAISES(IOError, ValidateArrowVsJson(arrow_path, "does_not_exist-1234")); + + ASSERT_RAISES(Invalid, RunCommand("", arrow_path, "VALIDATE")); + ASSERT_RAISES(Invalid, RunCommand(json_path, "", "VALIDATE")); +} + +} // namespace arrow + +int main(int argc, char** argv) { + gflags::ParseCommandLineFlags(&argc, &argv, true); + + int ret = 0; + + if (FLAGS_integration) { + arrow::Status result = arrow::RunCommand(FLAGS_json, FLAGS_arrow, FLAGS_mode); + if (!result.ok()) { + std::cout << "Error message: " << result.ToString() << std::endl; + ret = 1; + } + } else { + ::testing::InitGoogleTest(&argc, argv); + ret = RUN_ALL_TESTS(); + } + gflags::ShutDownCommandLineFlags(); + return ret; +} From 197120cbc7ae419657bb3d22d1c343b49ec3e984 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 26 Nov 2016 14:14:40 -0500 Subject: [PATCH 0204/1644] ARROW-390: Only specify dependencies for json-integration-test on ARROW_BUILD_TESTS=ON Author: Uwe L. Korn Closes #215 from xhochy/ARROW-390 and squashes the following commits: 36d34c7 [Uwe L. Korn] ARROW-390: Only specify dependencies for json-integration-test on ARROW_BUILD_TESTS=ON --- cpp/src/arrow/ipc/CMakeLists.txt | 48 +++++++++++++++++--------------- 1 file changed, 25 insertions(+), 23 deletions(-) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index f9e7cf792ae51..6f401dba2495f 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -87,29 +87,31 @@ ARROW_TEST_LINK_LIBRARIES(ipc-json-test ADD_ARROW_TEST(json-integration-test) -if (APPLE) - target_link_libraries(json-integration-test - arrow_static - arrow_io - arrow_ipc - gflags - gtest - boost_filesystem_static - boost_system_static - dl) - set_target_properties(json-integration-test - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") -else() - target_link_libraries(json-integration-test - arrow_static - arrow_io - arrow_ipc - gflags - gtest - pthread - boost_filesystem_static - boost_system_static - dl) +if (ARROW_BUILD_TESTS) + if (APPLE) + target_link_libraries(json-integration-test + arrow_static + arrow_io + arrow_ipc + gflags + gtest + boost_filesystem_static + boost_system_static + dl) + set_target_properties(json-integration-test + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + else() + target_link_libraries(json-integration-test + arrow_static + arrow_io + arrow_ipc + gflags + gtest + pthread + boost_filesystem_static + boost_system_static + dl) + endif() endif() # make clean will delete the generated file From 86f56a6073c3254487ede3aff1dc9d117d24adaf Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 26 Nov 2016 14:22:47 -0500 Subject: [PATCH 0205/1644] ARROW-202: Integrate with appveyor ci for windows This only adds yet a successful compilation for windows. Tests don't run. Author: Uwe L. Korn Closes #213 from xhochy/ARROW-202 and squashes the following commits: d5088a6 [Uwe L. Korn] Correctly reference Kudu in LICENSE and NOTICE 72a583b [Uwe L. Korn] Differentiate Boost libraries based on build type 6c75699 [Uwe L. Korn] Add license header e33b08c [Uwe L. Korn] Pick up shared Boost libraries correctly 5da5f5d [Uwe L. Korn] ARROW-202: Integrate with appveyor ci for windows --- LICENSE.txt | 12 ++++++ NOTICE.txt | 14 ++++++ appveyor.yml | 39 +++++++++++++++++ cpp/CMakeLists.txt | 64 ++++++++++++++++------------ cpp/cmake_modules/CompilerInfo.cmake | 42 +++++++++++------- cpp/src/arrow/array-test.cc | 1 + cpp/src/arrow/io/CMakeLists.txt | 14 ++++-- cpp/src/arrow/io/io-file-test.cc | 9 +++- cpp/src/arrow/io/memory.cc | 13 +++++- cpp/src/arrow/io/mman.h | 12 +++--- cpp/src/arrow/io/test-common.h | 12 ++++++ cpp/src/arrow/test-util.h | 8 ++++ cpp/src/arrow/type.h | 2 +- cpp/src/arrow/util/CMakeLists.txt | 25 ++++++----- cpp/src/arrow/util/buffer.h | 4 +- cpp/src/arrow/util/memory-pool.cc | 14 ++++++ cpp/src/arrow/util/visibility.h | 1 + 17 files changed, 217 insertions(+), 69 deletions(-) create mode 100644 appveyor.yml diff --git a/LICENSE.txt b/LICENSE.txt index d645695673349..c3bec4385507e 100644 --- a/LICENSE.txt +++ b/LICENSE.txt @@ -200,3 +200,15 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +-------------------------------------------------------------------------------- + +This product includes code from Apache Kudu. + + * cpp/cmake_modules/CompilerInfo.cmake is based on Kudu's cmake_modules/CompilerInfo.cmake + +Copyright: 2016 The Apache Software Foundation. +Home page: https://kudu.apache.org/ +License: http://www.apache.org/licenses/LICENSE-2.0 + +-------------------------------------------------------------------------------- diff --git a/NOTICE.txt b/NOTICE.txt index 5c699ca022c1b..02cb4dd6ee001 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -41,3 +41,17 @@ This product includes software from the CMake project This product includes software from https://github.com/matthew-brett/multibuild (BSD 2-clause) * Copyright (c) 2013-2016, Matt Terry and Matthew Brett; all rights reserved. + +-------------------------------------------------------------------------------- + +This product includes code from Apache Kudu, which includes the following in +its NOTICE file: + + Apache Kudu + Copyright 2016 The Apache Software Foundation + + This product includes software developed at + The Apache Software Foundation (http://www.apache.org/). + + Portions of this software were developed at + Cloudera, Inc (http://www.cloudera.com/). diff --git a/appveyor.yml b/appveyor.yml new file mode 100644 index 0000000000000..67478487081b7 --- /dev/null +++ b/appveyor.yml @@ -0,0 +1,39 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Operating system (build VM template) +os: Visual Studio 2015 + +environment: + matrix: + - GENERATOR: Visual Studio 14 2015 Win64 + # - GENERATOR: Visual Studio 14 2015 + MSVC_DEFAULT_OPTIONS: ON + BOOST_ROOT: C:\Libraries\boost_1_59_0 + BOOST_LIBRARYDIR: C:\Libraries\boost_1_59_0\lib64-msvc-14.0 + +build_script: + - cd cpp + - mkdir build + - cd build + # A lot of features are still deactivated as they do not build on Windows + # * gbenchmark doesn't build with MSVC + - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_IPC=OFF -DARROW_HDFS=OFF -DARROW_BUILD_BENCHMARKS=OFF .. + - cmake --build . --config Debug + +# test_script: +# - ctest -VV diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 839ea17dee02e..0edb8ce410b87 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -141,9 +141,11 @@ endif() # For CMAKE_BUILD_TYPE=Release # -O3: Enable all compiler optimizations # -g: Enable symbols for profiler tools (TODO: remove for shipping) -set(CXX_FLAGS_DEBUG "-ggdb -O0") -set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") -set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") +if (NOT MSVC) + set(CXX_FLAGS_DEBUG "-ggdb -O0") + set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") + set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") +endif() set(CXX_FLAGS_PROFILE_GEN "${CXX_FLAGS_RELEASE} -fprofile-generate") set(CXX_FLAGS_PROFILE_BUILD "${CXX_FLAGS_RELEASE} -fprofile-use") @@ -347,6 +349,8 @@ function(ADD_ARROW_TEST REL_TEST_NAME) COMPILE_FLAGS " -DARROW_VALGRIND") add_test(${TEST_NAME} valgrind --tool=memcheck --leak-check=full --error-exitcode=1 ${TEST_PATH}) + elseif(MSVC) + add_test(${TEST_NAME} ${TEST_PATH}) else() add_test(${TEST_NAME} ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) @@ -431,40 +435,44 @@ endfunction() # ---------------------------------------------------------------------- # Add Boost dependencies (code adapted from Apache Kudu (incubating)) -# find boost headers and libs +# Find static boost headers and libs +# TODO Differentiate here between release and debug builds set(Boost_DEBUG TRUE) set(Boost_USE_MULTITHREADED ON) set(Boost_USE_STATIC_LIBS ON) find_package(Boost COMPONENTS system filesystem REQUIRED) -include_directories(SYSTEM ${Boost_INCLUDE_DIRS}) -set(BOOST_STATIC_LIBS ${Boost_LIBRARIES}) -list(LENGTH BOOST_STATIC_LIBS BOOST_STATIC_LIBS_LEN) +if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) + set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) +else() + set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) + set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) +endif() -# Find Boost shared libraries. +# Find shared Boost libraries. set(Boost_USE_STATIC_LIBS OFF) find_package(Boost COMPONENTS system filesystem REQUIRED) -set(BOOST_SHARED_LIBS ${Boost_LIBRARIES}) -list(LENGTH BOOST_SHARED_LIBS BOOST_SHARED_LIBS_LEN) -list(SORT BOOST_SHARED_LIBS) +if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) + set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) +else() + set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) + set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) +endif() message(STATUS "Boost include dir: " ${Boost_INCLUDE_DIRS}) message(STATUS "Boost libraries: " ${Boost_LIBRARIES}) -math(EXPR LAST_IDX "${BOOST_STATIC_LIBS_LEN} - 1") -foreach(IDX RANGE ${LAST_IDX}) - list(GET BOOST_STATIC_LIBS ${IDX} BOOST_STATIC_LIB) - list(GET BOOST_SHARED_LIBS ${IDX} BOOST_SHARED_LIB) +ADD_THIRDPARTY_LIB(boost_system + STATIC_LIB "${BOOST_STATIC_SYSTEM_LIBRARY}" + SHARED_LIB "${BOOST_SHARED_SYSTEM_LIBRARY}") + +ADD_THIRDPARTY_LIB(boost_filesystem + STATIC_LIB "${BOOST_STATIC_FILESYSTEM_LIBRARY}" + SHARED_LIB "${BOOST_SHARED_FILESYSTEM_LIBRARY}") + +SET(ARROW_BOOST_LIBS boost_system boost_filesystem) - # Remove the prefix/suffix from the library name. - # - # e.g. libboost_system-mt --> boost_system - get_filename_component(LIB_NAME ${BOOST_STATIC_LIB} NAME_WE) - string(REGEX REPLACE "lib([^-]*)(-mt)?" "\\1" LIB_NAME_NO_PREFIX_SUFFIX ${LIB_NAME}) - ADD_THIRDPARTY_LIB(${LIB_NAME_NO_PREFIX_SUFFIX} - STATIC_LIB "${BOOST_STATIC_LIB}" - SHARED_LIB "${BOOST_SHARED_LIB}") - list(APPEND ARROW_BOOST_LIBS ${LIB_NAME_NO_PREFIX_SUFFIX}) -endforeach() include_directories(SYSTEM ${Boost_INCLUDE_DIR}) # ---------------------------------------------------------------------- @@ -482,7 +490,7 @@ if(ARROW_BUILD_TESTS) ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} + CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON # googletest doesn't define install rules, so just build in the # source dir and don't try to install. See its README for # details. @@ -491,7 +499,7 @@ if(ARROW_BUILD_TESTS) set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") - set(GTEST_STATIC_LIB "${GTEST_PREFIX}/libgtest.a") + set(GTEST_STATIC_LIB "${GTEST_PREFIX}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}gtest${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GTEST_VENDORED 1) else() find_package(GTest REQUIRED) @@ -571,7 +579,7 @@ if(ARROW_BUILD_BENCHMARKS) "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") set(GBENCHMARK_INCLUDE_DIR "${GBENCHMARK_PREFIX}/include") - set(GBENCHMARK_STATIC_LIB "${GBENCHMARK_PREFIX}/lib/libbenchmark.a") + set(GBENCHMARK_STATIC_LIB "${GBENCHMARK_PREFIX}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}benchmark${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GBENCHMARK_VENDORED 1) else() find_package(GBenchmark REQUIRED) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 02f6fd46997a3..187698f54507b 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -1,25 +1,32 @@ -# Copyright 2013 Cloudera, Inc. +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at # -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 # -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. # # Sets COMPILER_FAMILY to 'clang' or 'gcc' # Sets COMPILER_VERSION to the version execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v ERROR_VARIABLE COMPILER_VERSION_FULL) message(INFO " ${COMPILER_VERSION_FULL}") +message(INFO " ${CMAKE_CXX_COMPILER_ID}") + +if(MSVC) + set(COMPILER_FAMILY "msvc") # clang on Linux and Mac OS X before 10.9 -if("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") set(COMPILER_FAMILY "clang") string(REGEX REPLACE ".*clang version ([0-9]+\\.[0-9]+).*" "\\1" COMPILER_VERSION "${COMPILER_VERSION_FULL}") @@ -29,10 +36,15 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" COMPILER_VERSION "${COMPILER_VERSION_FULL}") -# clang on Mac OS X, XCode 7+. No version replacement is done -# because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-.*") +# clang on Mac OS X, XCode 7. +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang-7") + set(COMPILER_FAMILY "clang") + set(COMPILER_VERSION "3.7.0svn") + +# clang on Mac OS X, XCode 8. +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang-8") set(COMPILER_FAMILY "clang") + set(COMPILER_VERSION "3.8.0svn") # gcc elseif("${COMPILER_VERSION_FULL}" MATCHES ".*gcc version.*") diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 3b4736327b47c..158124468992a 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -18,6 +18,7 @@ #include #include #include +#include #include #include "gtest/gtest.h" diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 47bb089386371..a1892a9294a78 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -18,10 +18,16 @@ # ---------------------------------------------------------------------- # arrow_io : Arrow IO interfaces -set(ARROW_IO_LINK_LIBS - arrow_shared - dl -) +if (MSVC) + set(ARROW_IO_LINK_LIBS + arrow_shared + ) +else() + set(ARROW_IO_LINK_LIBS + arrow_shared + dl + ) +endif() if (ARROW_BOOST_USE_SHARED) set(ARROW_IO_PRIVATE_LINK_LIBS diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index cde769ffb6155..54c21d2e62ce9 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -18,7 +18,9 @@ #include #include #include -#include +#ifndef _MSC_VER +# include +#endif #include #include #include @@ -38,7 +40,12 @@ static bool FileExists(const std::string& path) { } static bool FileIsClosed(int fd) { +#ifdef _MSC_VER + // Close file a second time, this should set errno to EBADF + close(fd); +#else if (-1 != fcntl(fd, F_GETFD)) { return false; } +#endif return errno == EBADF; } diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index c7d0ae5d56474..71b0f1e29b220 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -17,7 +17,18 @@ #include "arrow/io/memory.h" -#include // For memory-mapping +// sys/mman.h not present in Visual Studio or Cygwin +#ifdef _WIN32 +#ifndef NOMINMAX +#define NOMINMAX +#endif +#include "arrow/io/mman.h" +#undef Realloc +#undef Free +#include +#else +#include +#endif #include #include diff --git a/cpp/src/arrow/io/mman.h b/cpp/src/arrow/io/mman.h index 00d1f93601df3..27d9736b683fd 100644 --- a/cpp/src/arrow/io/mman.h +++ b/cpp/src/arrow/io/mman.h @@ -76,7 +76,7 @@ static DWORD __map_mmap_prot_file(const int prot) { return desiredAccess; } -void* mmap(void* addr, size_t len, int prot, int flags, int fildes, off_t off) { +static void* mmap(void* addr, size_t len, int prot, int flags, int fildes, off_t off) { HANDLE fm, h; void* map = MAP_FAILED; @@ -143,7 +143,7 @@ void* mmap(void* addr, size_t len, int prot, int flags, int fildes, off_t off) { return map; } -int munmap(void* addr, size_t len) { +static int munmap(void* addr, size_t len) { if (UnmapViewOfFile(addr)) return 0; errno = __map_mman_error(GetLastError(), EPERM); @@ -151,7 +151,7 @@ int munmap(void* addr, size_t len) { return -1; } -int mprotect(void* addr, size_t len, int prot) { +static int mprotect(void* addr, size_t len, int prot) { DWORD newProtect = __map_mmap_prot_page(prot); DWORD oldProtect = 0; @@ -162,7 +162,7 @@ int mprotect(void* addr, size_t len, int prot) { return -1; } -int msync(void* addr, size_t len, int flags) { +static int msync(void* addr, size_t len, int flags) { if (FlushViewOfFile(addr, len)) return 0; errno = __map_mman_error(GetLastError(), EPERM); @@ -170,7 +170,7 @@ int msync(void* addr, size_t len, int flags) { return -1; } -int mlock(const void* addr, size_t len) { +static int mlock(const void* addr, size_t len) { if (VirtualLock((LPVOID)addr, len)) return 0; errno = __map_mman_error(GetLastError(), EPERM); @@ -178,7 +178,7 @@ int mlock(const void* addr, size_t len) { return -1; } -int munlock(const void* addr, size_t len) { +static int munlock(const void* addr, size_t len) { if (VirtualUnlock((LPVOID)addr, len)) return 0; errno = __map_mman_error(GetLastError(), EPERM); diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 1954d479e3930..f8fed883cf583 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -24,6 +24,14 @@ #include #include +#if defined(__MINGW32__) // MinGW +// nothing +#elif defined(_MSC_VER) // Visual Studio +#include +#else // POSIX / Linux +// nothing +#endif + #include "arrow/io/memory.h" #include "arrow/test-util.h" #include "arrow/util/buffer.h" @@ -43,7 +51,11 @@ class MemoryMapFixture { void CreateFile(const std::string path, int64_t size) { FILE* file = fopen(path.c_str(), "w"); if (file != nullptr) { tmp_files_.push_back(path); } +#ifdef _MSC_VER + _chsize(fileno(file), size); +#else ftruncate(fileno(file), size); +#endif fclose(file); } diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index ab4b980b3be63..93dd5b69b1bb7 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -59,6 +59,14 @@ EXPECT_TRUE(s.ok()); \ } while (0) +// Alias MSVC popcount to GCC name +#ifdef _MSC_VER +# include +# define __builtin_popcount __popcnt +# include +# define __builtin_popcountll _mm_popcnt_u64 +#endif + namespace arrow { class TestBase : public ::testing::Test { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 5b4d7bc42bd3d..876d7ea464b1c 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -184,7 +184,7 @@ struct ARROW_EXPORT Field { }; typedef std::shared_ptr FieldPtr; -struct PrimitiveCType : public DataType { +struct ARROW_EXPORT PrimitiveCType : public DataType { using DataType::DataType; }; diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index fd23c1aa3b8b2..6e19730219553 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -40,17 +40,20 @@ if (ARROW_BUILD_TESTS) test_main.cc) if (APPLE) - target_link_libraries(arrow_test_main - gtest - dl) - set_target_properties(arrow_test_main - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + target_link_libraries(arrow_test_main + gtest + dl) + set_target_properties(arrow_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + elseif(MSVC) + target_link_libraries(arrow_test_main + gtest) else() - target_link_libraries(arrow_test_main - gtest - pthread - dl - ) + target_link_libraries(arrow_test_main + gtest + pthread + dl + ) endif() endif() @@ -71,4 +74,4 @@ endif() ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(memory-pool-test) -ADD_ARROW_TEST(status-test) \ No newline at end of file +ADD_ARROW_TEST(status-test) diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 04ad6c2dffde4..330e15feae152 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -103,7 +103,7 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { // Construct a view on passed buffer at the indicated offset and length. This // function cannot fail and does not error checking (except in debug builds) -std::shared_ptr SliceBuffer( +ARROW_EXPORT std::shared_ptr SliceBuffer( const std::shared_ptr& buffer, int64_t offset, int64_t length); // A Buffer whose contents can be mutated. May or may not own its data. @@ -154,7 +154,7 @@ class ARROW_EXPORT PoolBuffer : public ResizableBuffer { MemoryPool* pool_; }; -class BufferBuilder { +class ARROW_EXPORT BufferBuilder { public: explicit BufferBuilder(MemoryPool* pool) : pool_(pool), data_(nullptr), capacity_(0), size_(0) {} diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/util/memory-pool.cc index 9f83afe4cb20f..9aa706693e9f7 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/util/memory-pool.cc @@ -33,6 +33,15 @@ namespace { Status AllocateAligned(int64_t size, uint8_t** out) { // TODO(emkornfield) find something compatible with windows constexpr size_t kAlignment = 64; +#ifdef _MSC_VER + // Special code path for MSVC + *out = reinterpret_cast(_aligned_malloc(size, kAlignment)); + if (!*out) { + std::stringstream ss; + ss << "malloc of size " << size << " failed"; + return Status::OutOfMemory(ss.str()); + } +#else const int result = posix_memalign(reinterpret_cast(out), kAlignment, size); if (result == ENOMEM) { std::stringstream ss; @@ -45,6 +54,7 @@ Status AllocateAligned(int64_t size, uint8_t** out) { ss << "invalid alignment parameter: " << kAlignment; return Status::Invalid(ss.str()); } +#endif return Status::OK(); } } // namespace @@ -83,7 +93,11 @@ int64_t InternalMemoryPool::bytes_allocated() const { void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { std::lock_guard guard(pool_lock_); DCHECK_GE(bytes_allocated_, size); +#ifdef _MSC_VER + _aligned_free(buffer); +#else std::free(buffer); +#endif bytes_allocated_ -= size; } diff --git a/cpp/src/arrow/util/visibility.h b/cpp/src/arrow/util/visibility.h index b197c198297c8..9321cc550ec1f 100644 --- a/cpp/src/arrow/util/visibility.h +++ b/cpp/src/arrow/util/visibility.h @@ -20,6 +20,7 @@ #if defined(_WIN32) || defined(__CYGWIN__) #define ARROW_EXPORT __declspec(dllexport) +#define ARROW_NO_EXPORT #else // Not Windows #ifndef ARROW_EXPORT #define ARROW_EXPORT __attribute__((visibility("default"))) From e3c167bd101734f92c3a2be2eb7f56f1fba91e67 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 28 Nov 2016 21:29:19 -0500 Subject: [PATCH 0206/1644] ARROW-363: [Java/C++] integration testing harness, initial integration tests This also includes format reconciliation as discussed in ARROW-384. Author: Wes McKinney Closes #211 from wesm/ARROW-363 and squashes the following commits: 6982c3c [Wes McKinney] Permit end of buffer IPC reads if length is 0 4d46c8b [Wes McKinney] Fix logical error with offsets array in JsonFileWriter. Add broken string test case to simple.json 36ab5d6 [Wes McKinney] Increment MetadataVersion in flatbuffer 844257e [Wes McKinney] cpplint a2711f2 [Wes McKinney] Address other format incompatibilities, write vectorLayout to Arrow metadata 13608ef [Wes McKinney] Relax 64 byte padding. Do not write RecordBatch embedded in Message for now 6a66fc8 [Wes McKinney] Write record batch size prefix in Java 72ea42c [Wes McKinney] Note that padding is 64-bytes at start of file (for now) c2ffde4 [Wes McKinney] More notes about the file format aef4382 [Wes McKinney] cpplint 85128f7 [Wes McKinney] Refactor IPC/File record batch read/write structure to reflect discussion in ARROW-384 dbd6ed6 [Wes McKinney] Do not embed metadata length in WriteDataHeader c529d63 [Wes McKinney] Fix JSON integration test example to make it further d806aa6 [Wes McKinney] Exclude JSON files from Apache RAT checks a7e2d4b [Wes McKinney] Draft testing harness --- .gitignore | 26 ++ cpp/CMakeLists.txt | 1 - cpp/src/arrow/io/io-file-test.cc | 2 +- cpp/src/arrow/io/memory.cc | 25 +- cpp/src/arrow/io/memory.h | 8 +- cpp/src/arrow/ipc/adapter.cc | 251 ++++++++++-------- cpp/src/arrow/ipc/adapter.h | 65 ++--- cpp/src/arrow/ipc/file.cc | 31 ++- cpp/src/arrow/ipc/ipc-adapter-test.cc | 85 +++--- cpp/src/arrow/ipc/ipc-file-test.cc | 2 +- cpp/src/arrow/ipc/ipc-json-test.cc | 20 +- cpp/src/arrow/ipc/ipc-metadata-test.cc | 12 +- cpp/src/arrow/ipc/json-integration-test.cc | 30 ++- cpp/src/arrow/ipc/json-internal.cc | 110 +++----- cpp/src/arrow/ipc/metadata-internal.cc | 100 ++++--- cpp/src/arrow/ipc/metadata-internal.h | 6 +- cpp/src/arrow/ipc/metadata.cc | 115 ++++---- cpp/src/arrow/ipc/metadata.h | 50 ++-- cpp/src/arrow/ipc/test-common.h | 15 +- cpp/src/arrow/ipc/util.h | 6 +- cpp/src/arrow/test-util.h | 8 +- cpp/src/arrow/type.cc | 46 +++- cpp/src/arrow/type.h | 73 +++-- cpp/src/arrow/types/primitive.cc | 2 +- cpp/src/arrow/util/bit-util.h | 4 + dev/release/run-rat.sh | 3 +- format/IPC.md | 106 ++++++++ format/Message.fbs | 3 +- integration/data/simple.json | 66 +++++ integration/integration_test.py | 177 ++++++++++++ java/pom.xml | 6 +- java/tools/pom.xml | 6 + .../org/apache/arrow/tools/Integration.java | 1 + .../org/apache/arrow/vector/VectorLoader.java | 4 +- .../apache/arrow/vector/file/ArrowReader.java | 6 +- .../apache/arrow/vector/file/ArrowWriter.java | 23 +- .../vector/file/json/JsonFileReader.java | 9 +- .../vector/file/json/JsonFileWriter.java | 2 +- python/.gitignore | 10 - 39 files changed, 1024 insertions(+), 491 deletions(-) create mode 100644 .gitignore create mode 100644 integration/data/simple.json create mode 100644 integration/integration_test.py diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000000000..a00cbba065a03 --- /dev/null +++ b/.gitignore @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Compiled source +*.a +*.dll +*.o +*.py[ocd] +*.so +*.dylib +.build_cache_dir +MANIFEST diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 0edb8ce410b87..1a970081234fa 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -528,7 +528,6 @@ if(ARROW_BUILD_TESTS) ExternalProject_Add(gflags_ep GIT_REPOSITORY https://github.com/gflags/gflags.git GIT_TAG cce68f0c9c5d054017425e6e6fd54f696d36e8ee - # URL "https://github.com/gflags/gflags/archive/v${GFLAGS_VERSION}.tar.gz" BUILD_IN_SOURCE 1 CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DCMAKE_INSTALL_PREFIX=${GFLAGS_PREFIX} diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 54c21d2e62ce9..fad49cef89908 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -19,7 +19,7 @@ #include #include #ifndef _MSC_VER -# include +#include #endif #include #include diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 71b0f1e29b220..af495e27e5642 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -258,8 +258,11 @@ Status BufferOutputStream::Reserve(int64_t nbytes) { // ---------------------------------------------------------------------- // In-memory buffer reader -BufferReader::BufferReader(const uint8_t* buffer, int buffer_size) - : buffer_(buffer), buffer_size_(buffer_size), position_(0) {} +BufferReader::BufferReader(const std::shared_ptr& buffer) + : buffer_(buffer), data_(buffer->data()), size_(buffer->size()), position_(0) {} + +BufferReader::BufferReader(const uint8_t* data, int64_t size) + : buffer_(nullptr), data_(data), size_(size), position_(0) {} BufferReader::~BufferReader() {} @@ -278,26 +281,32 @@ bool BufferReader::supports_zero_copy() const { } Status BufferReader::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { - memcpy(buffer, buffer_ + position_, nbytes); - *bytes_read = std::min(nbytes, buffer_size_ - position_); + memcpy(buffer, data_ + position_, nbytes); + *bytes_read = std::min(nbytes, size_ - position_); position_ += *bytes_read; return Status::OK(); } Status BufferReader::Read(int64_t nbytes, std::shared_ptr* out) { - int64_t size = std::min(nbytes, buffer_size_ - position_); - *out = std::make_shared(buffer_ + position_, size); + int64_t size = std::min(nbytes, size_ - position_); + + if (buffer_ != nullptr) { + *out = SliceBuffer(buffer_, position_, size); + } else { + *out = std::make_shared(data_ + position_, size); + } + position_ += nbytes; return Status::OK(); } Status BufferReader::GetSize(int64_t* size) { - *size = buffer_size_; + *size = size_; return Status::OK(); } Status BufferReader::Seek(int64_t position) { - if (position < 0 || position >= buffer_size_) { + if (position < 0 || position >= size_) { return Status::IOError("position out of bounds"); } diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index df2fe8d6efbfc..b72f93b939148 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -99,7 +99,8 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { class ARROW_EXPORT BufferReader : public ReadableFileInterface { public: - BufferReader(const uint8_t* buffer, int buffer_size); + explicit BufferReader(const std::shared_ptr& buffer); + BufferReader(const uint8_t* data, int64_t size); ~BufferReader(); Status Close() override; @@ -116,8 +117,9 @@ class ARROW_EXPORT BufferReader : public ReadableFileInterface { bool supports_zero_copy() const override; private: - const uint8_t* buffer_; - int buffer_size_; + std::shared_ptr buffer_; + const uint8_t* data_; + int64_t size_; int64_t position_; }; diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index da718c08d5480..edf716f662753 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -48,15 +48,6 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { -namespace { -Status CheckMultipleOf64(int64_t size) { - if (BitUtil::IsMultipleOf64(size)) { return Status::OK(); } - return Status::Invalid( - "Attempted to write a buffer that " - "wasn't a multiple of 64 bytes"); -} -} - static bool IsPrimitive(const DataType* type) { DCHECK(type != nullptr); switch (type->type) { @@ -124,30 +115,30 @@ Status VisitArray(const Array* arr, std::vector* field_nodes class RecordBatchWriter { public: RecordBatchWriter(const std::vector>& columns, int32_t num_rows, - int max_recursion_depth) + int64_t buffer_start_offset, int max_recursion_depth) : columns_(&columns), num_rows_(num_rows), + buffer_start_offset_(buffer_start_offset), max_recursion_depth_(max_recursion_depth) {} - Status AssemblePayload() { + Status AssemblePayload(int64_t* body_length) { + if (field_nodes_.size() > 0) { + field_nodes_.clear(); + buffer_meta_.clear(); + buffers_.clear(); + } + // Perform depth-first traversal of the row-batch for (size_t i = 0; i < columns_->size(); ++i) { const Array* arr = (*columns_)[i].get(); RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_, max_recursion_depth_)); } - return Status::OK(); - } - Status Write( - io::OutputStream* dst, int64_t* body_end_offset, int64_t* header_end_offset) { - // Get the starting position - int64_t start_position; - RETURN_NOT_OK(dst->Tell(&start_position)); - - // Keep track of the current position so we can determine the size of the - // message body - int64_t position = start_position; + // The position for the start of a buffer relative to the passed frame of + // reference. May be 0 or some other position in an address space + int64_t offset = buffer_start_offset_; + // Construct the buffer metadata for the record batch header for (size_t i = 0; i < buffers_.size(); ++i) { const Buffer* buffer = buffers_[i].get(); int64_t size = 0; @@ -161,65 +152,103 @@ class RecordBatchWriter { // TODO(wesm): We currently have no notion of shared memory page id's, // but we've included it in the metadata IDL for when we have it in the - // future. Use page=0 for now + // future. Use page = -1 for now // // Note that page ids are a bespoke notion for Arrow and not a feature we // are using from any OS-level shared memory. The thought is that systems // may (in the future) associate integer page id's with physical memory // pages (according to whatever is the desired shared memory mechanism) - buffer_meta_.push_back(flatbuf::Buffer(0, position, size + padding)); - - if (size > 0) { - RETURN_NOT_OK(dst->Write(buffer->data(), size)); - position += size; - } - - if (padding > 0) { - RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); - position += padding; - } + buffer_meta_.push_back(flatbuf::Buffer(-1, offset, size + padding)); + offset += size + padding; } - *body_end_offset = position; + *body_length = offset - buffer_start_offset_; + DCHECK(BitUtil::IsMultipleOf64(*body_length)); + + return Status::OK(); + } + Status WriteMetadata( + int64_t body_length, io::OutputStream* dst, int32_t* metadata_length) { // Now that we have computed the locations of all of the buffers in shared // memory, the data header can be converted to a flatbuffer and written out // // Note: The memory written here is prefixed by the size of the flatbuffer - // itself as an int32_t. On reading from a input, you will have to - // determine the data header size then request a buffer such that you can - // construct the flatbuffer data accessor object (see arrow::ipc::Message) - std::shared_ptr data_header; - RETURN_NOT_OK(WriteDataHeader( - num_rows_, position - start_position, field_nodes_, buffer_meta_, &data_header)); + // itself as an int32_t. + std::shared_ptr metadata_fb; + RETURN_NOT_OK(WriteRecordBatchMetadata( + num_rows_, body_length, field_nodes_, buffer_meta_, &metadata_fb)); + + // Need to write 4 bytes (metadata size), the metadata, plus padding to + // fall on a 64-byte offset + int64_t padded_metadata_length = + BitUtil::RoundUpToMultipleOf64(metadata_fb->size() + 4); + + // The returned metadata size includes the length prefix, the flatbuffer, + // plus padding + *metadata_length = padded_metadata_length; - // Write the data header at the end - RETURN_NOT_OK(dst->Write(data_header->data(), data_header->size())); + // Write the flatbuffer size prefix + int32_t flatbuffer_size = metadata_fb->size(); + RETURN_NOT_OK( + dst->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); - position += data_header->size(); - *header_end_offset = position; + // Write the flatbuffer + RETURN_NOT_OK(dst->Write(metadata_fb->data(), metadata_fb->size())); - return Align(dst, &position); + // Write any padding + int64_t padding = padded_metadata_length - metadata_fb->size() - 4; + if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } + + return Status::OK(); } - Status Align(io::OutputStream* dst, int64_t* position) { - // Write all buffers here on word boundaries - // TODO(wesm): Is there benefit to 64-byte padding in IPC? - int64_t remainder = PaddedLength(*position) - *position; - if (remainder > 0) { - RETURN_NOT_OK(dst->Write(kPaddingBytes, remainder)); - *position += remainder; + Status Write(io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + RETURN_NOT_OK(AssemblePayload(body_length)); + +#ifndef NDEBUG + int64_t start_position, current_position; + RETURN_NOT_OK(dst->Tell(&start_position)); +#endif + + RETURN_NOT_OK(WriteMetadata(*body_length, dst, metadata_length)); + +#ifndef NDEBUG + RETURN_NOT_OK(dst->Tell(¤t_position)); + DCHECK(BitUtil::IsMultipleOf8(current_position)); +#endif + + // Now write the buffers + for (size_t i = 0; i < buffers_.size(); ++i) { + const Buffer* buffer = buffers_[i].get(); + int64_t size = 0; + int64_t padding = 0; + + // The buffer might be null if we are handling zero row lengths. + if (buffer) { + size = buffer->size(); + padding = BitUtil::RoundUpToMultipleOf64(size) - size; + } + + if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); } + + if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } } + +#ifndef NDEBUG + RETURN_NOT_OK(dst->Tell(¤t_position)); + DCHECK(BitUtil::IsMultipleOf8(current_position)); +#endif + return Status::OK(); } - // This must be called after invoking AssemblePayload Status GetTotalSize(int64_t* size) { // emulates the behavior of Write without actually writing - int64_t body_offset; - int64_t data_header_offset; + int32_t metadata_length; + int64_t body_length; MockOutputStream dst; - RETURN_NOT_OK(Write(&dst, &body_offset, &data_header_offset)); + RETURN_NOT_OK(Write(&dst, &metadata_length, &body_length)); *size = dst.GetExtentBytesWritten(); return Status::OK(); } @@ -228,6 +257,7 @@ class RecordBatchWriter { // Do not copy this vector. Ownership must be retained elsewhere const std::vector>* columns_; int32_t num_rows_; + int64_t buffer_start_offset_; std::vector field_nodes_; std::vector buffer_meta_; @@ -236,18 +266,17 @@ class RecordBatchWriter { }; Status WriteRecordBatch(const std::vector>& columns, - int32_t num_rows, io::OutputStream* dst, int64_t* body_end_offset, - int64_t* header_end_offset, int max_recursion_depth) { + int32_t num_rows, int64_t buffer_start_offset, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length, int max_recursion_depth) { DCHECK_GT(max_recursion_depth, 0); - RecordBatchWriter serializer(columns, num_rows, max_recursion_depth); - RETURN_NOT_OK(serializer.AssemblePayload()); - return serializer.Write(dst, body_end_offset, header_end_offset); + RecordBatchWriter serializer( + columns, num_rows, buffer_start_offset, max_recursion_depth); + return serializer.Write(dst, metadata_length, body_length); } Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size) { RecordBatchWriter serializer( - batch->columns(), batch->num_rows(), kMaxIpcRecursionDepth); - RETURN_NOT_OK(serializer.AssemblePayload()); + batch->columns(), batch->num_rows(), 0, kMaxIpcRecursionDepth); RETURN_NOT_OK(serializer.GetTotalSize(size)); return Status::OK(); } @@ -255,30 +284,33 @@ Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size) { // ---------------------------------------------------------------------- // Record batch read path -class RecordBatchReader::RecordBatchReaderImpl { +class RecordBatchReader { public: - RecordBatchReaderImpl(io::ReadableFileInterface* file, - const std::shared_ptr& metadata, int max_recursion_depth) - : file_(file), metadata_(metadata), max_recursion_depth_(max_recursion_depth) { + RecordBatchReader(const std::shared_ptr& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::ReadableFileInterface* file) + : metadata_(metadata), + schema_(schema), + max_recursion_depth_(max_recursion_depth), + file_(file) { num_buffers_ = metadata->num_buffers(); num_flattened_fields_ = metadata->num_fields(); } - Status AssembleBatch( - const std::shared_ptr& schema, std::shared_ptr* out) { - std::vector> arrays(schema->num_fields()); + Status Read(std::shared_ptr* out) { + std::vector> arrays(schema_->num_fields()); // The field_index and buffer_index are incremented in NextArray based on // how much of the batch is "consumed" (through nested data reconstruction, // for example) field_index_ = 0; buffer_index_ = 0; - for (int i = 0; i < schema->num_fields(); ++i) { - const Field* field = schema->field(i).get(); + for (int i = 0; i < schema_->num_fields(); ++i) { + const Field* field = schema_->field(i).get(); RETURN_NOT_OK(NextArray(field, max_recursion_depth_, &arrays[i])); } - *out = std::make_shared(schema, metadata_->length(), arrays); + *out = std::make_shared(schema_, metadata_->length(), arrays); return Status::OK(); } @@ -370,67 +402,56 @@ class RecordBatchReader::RecordBatchReaderImpl { Status GetBuffer(int buffer_index, std::shared_ptr* out) { BufferMetadata metadata = metadata_->buffer(buffer_index); - RETURN_NOT_OK(CheckMultipleOf64(metadata.length)); - return file_->ReadAt(metadata.offset, metadata.length, out); + + if (metadata.length == 0) { + *out = std::make_shared(nullptr, 0); + return Status::OK(); + } else { + return file_->ReadAt(metadata.offset, metadata.length, out); + } } private: + std::shared_ptr metadata_; + std::shared_ptr schema_; + int max_recursion_depth_; io::ReadableFileInterface* file_; - std::shared_ptr metadata_; int field_index_; int buffer_index_; - int max_recursion_depth_; int num_buffers_; int num_flattened_fields_; }; -Status RecordBatchReader::Open(io::ReadableFileInterface* file, int64_t offset, - std::shared_ptr* out) { - return Open(file, offset, kMaxIpcRecursionDepth, out); -} - -Status RecordBatchReader::Open(io::ReadableFileInterface* file, int64_t offset, - int max_recursion_depth, std::shared_ptr* out) { +Status ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, + io::ReadableFileInterface* file, std::shared_ptr* metadata) { std::shared_ptr buffer; - RETURN_NOT_OK(file->ReadAt(offset - sizeof(int32_t), sizeof(int32_t), &buffer)); - - int32_t metadata_size = *reinterpret_cast(buffer->data()); + RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); - if (metadata_size + static_cast(sizeof(int32_t)) > offset) { - return Status::Invalid("metadata size invalid"); - } - - // Read the metadata - RETURN_NOT_OK( - file->ReadAt(offset - metadata_size - sizeof(int32_t), metadata_size, &buffer)); - - // TODO(wesm): buffer slicing here would be better in case ReadAt returns - // allocated memory - - std::shared_ptr message; - RETURN_NOT_OK(Message::Open(buffer, &message)); + int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); - if (message->type() != Message::RECORD_BATCH) { - return Status::Invalid("Metadata message is not a record batch"); + if (flatbuffer_size + static_cast(sizeof(int32_t)) > metadata_length) { + std::stringstream ss; + ss << "flatbuffer size " << metadata_length << " invalid. File offset: " << offset + << ", metadata length: " << metadata_length; + return Status::Invalid(ss.str()); } - std::shared_ptr batch_meta = message->GetRecordBatch(); - - std::shared_ptr result(new RecordBatchReader()); - result->impl_.reset(new RecordBatchReaderImpl(file, batch_meta, max_recursion_depth)); - *out = result; - + *metadata = std::make_shared(buffer, sizeof(int32_t)); return Status::OK(); } -// Here the explicit destructor is required for compilers to be aware of -// the complete information of RecordBatchReader::RecordBatchReaderImpl class -RecordBatchReader::~RecordBatchReader() {} +Status ReadRecordBatch(const std::shared_ptr& metadata, + const std::shared_ptr& schema, io::ReadableFileInterface* file, + std::shared_ptr* out) { + return ReadRecordBatch(metadata, schema, kMaxIpcRecursionDepth, file, out); +} -Status RecordBatchReader::GetRecordBatch( - const std::shared_ptr& schema, std::shared_ptr* out) { - return impl_->AssembleBatch(schema, out); +Status ReadRecordBatch(const std::shared_ptr& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::ReadableFileInterface* file, std::shared_ptr* out) { + RecordBatchReader reader(metadata, schema, max_recursion_depth, file); + return reader.Read(out); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index b02de284dfc7d..963b9ee368537 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -43,7 +43,7 @@ class OutputStream; namespace ipc { -class RecordBatchMessage; +class RecordBatchMetadata; // ---------------------------------------------------------------------- // Write path @@ -51,22 +51,30 @@ class RecordBatchMessage; // TODO(emkornfield) investigate this more constexpr int kMaxIpcRecursionDepth = 64; -// Write the RecordBatch (collection of equal-length Arrow arrays) to the output -// stream +// Write the RecordBatch (collection of equal-length Arrow arrays) to the +// output stream in a contiguous block. The record batch metadata is written as +// a flatbuffer (see format/Message.fbs -- the RecordBatch message type) +// prefixed by its size, followed by each of the memory buffers in the batch +// written end to end (with appropriate alignment and padding): // -// First, each of the memory buffers are written out end-to-end -// -// Then, this function writes the batch metadata as a flatbuffer (see -// format/Message.fbs -- the RecordBatch message type) like so: -// -// +// // // Finally, the absolute offsets (relative to the start of the output stream) // to the end of the body and end of the metadata / data header (suffixed by // the header size) is returned in out-variables +// +// @param(in) buffer_start_offset: the start offset to use in the buffer metadata, +// default should be 0 +// +// @param(out) metadata_length: the size of the length-prefixed flatbuffer +// including padding to a 64-byte boundary +// +// @param(out) body_length: the size of the contiguous buffer block plus +// padding bytes ARROW_EXPORT Status WriteRecordBatch(const std::vector>& columns, - int32_t num_rows, io::OutputStream* dst, int64_t* body_end_offset, - int64_t* header_end_offset, int max_recursion_depth = kMaxIpcRecursionDepth); + int32_t num_rows, int64_t buffer_start_offset, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length, + int max_recursion_depth = kMaxIpcRecursionDepth); // int64_t GetRecordBatchMetadata(const RecordBatch* batch); @@ -78,27 +86,20 @@ ARROW_EXPORT Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the input supports zero copy reads -class ARROW_EXPORT RecordBatchReader { - public: - // The offset is the absolute position to the *end* of the record batch data - // header - static Status Open(io::ReadableFileInterface* file, int64_t offset, - std::shared_ptr* out); - - static Status Open(io::ReadableFileInterface* file, int64_t offset, - int max_recursion_depth, std::shared_ptr* out); - - virtual ~RecordBatchReader(); - - // Reassemble the record batch. A Schema is required to be able to construct - // the right array containers - Status GetRecordBatch( - const std::shared_ptr& schema, std::shared_ptr* out); - - private: - class RecordBatchReaderImpl; - std::unique_ptr impl_; -}; +// Read the record batch flatbuffer metadata starting at the indicated file offset +// +// The flatbuffer is expected to be length-prefixed, so the metadata_length +// includes at least the length prefix and the flatbuffer +Status ARROW_EXPORT ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, + io::ReadableFileInterface* file, std::shared_ptr* metadata); + +Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& metadata, + const std::shared_ptr& schema, io::ReadableFileInterface* file, + std::shared_ptr* out); + +Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::ReadableFileInterface* file, std::shared_ptr* out); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index c68244d50258c..06001cc1c77bc 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -23,6 +23,7 @@ #include #include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" #include "arrow/ipc/adapter.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" @@ -87,19 +88,19 @@ Status FileWriter::WriteRecordBatch( int64_t offset = position_; - int64_t body_end_offset; - int64_t header_end_offset; + // There may be padding ever the end of the metadata, so we cannot rely on + // position_ + int32_t metadata_length; + int64_t body_length; + + // Frame of reference in file format is 0, see ARROW-384 + const int64_t buffer_start_offset = 0; RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( - columns, num_rows, sink_, &body_end_offset, &header_end_offset)); + columns, num_rows, buffer_start_offset, sink_, &metadata_length, &body_length)); RETURN_NOT_OK(UpdatePosition()); DCHECK(position_ % 8 == 0) << "ipc::WriteRecordBatch did not perform aligned writes"; - // There may be padding ever the end of the metadata, so we cannot rely on - // position_ - int32_t metadata_length = header_end_offset - body_end_offset; - int32_t body_length = body_end_offset - offset; - // Append metadata, to be written in the footer later record_batches_.emplace_back(offset, metadata_length, body_length); @@ -198,12 +199,18 @@ Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { DCHECK_GE(i, 0); DCHECK_LT(i, num_record_batches()); FileBlock block = footer_->record_batch(i); - int64_t metadata_end_offset = block.offset + block.body_length + block.metadata_length; - std::shared_ptr reader; - RETURN_NOT_OK(RecordBatchReader::Open(file_.get(), metadata_end_offset, &reader)); + std::shared_ptr metadata; + RETURN_NOT_OK(ReadRecordBatchMetadata( + block.offset, block.metadata_length, file_.get(), &metadata)); + + // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see + // ARROW-384). + std::shared_ptr buffer_block; + RETURN_NOT_OK(file_->Read(block.body_length, &buffer_block)); + io::BufferReader reader(buffer_block); - return reader->GetRecordBatch(schema_, batch); + return ReadRecordBatch(metadata, schema_, &reader, batch); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index f5611d4840c97..1accfde7c4842 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -54,17 +54,24 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, std::string path = "test-write-row-batch"; io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - int64_t body_end_offset; - int64_t header_end_offset; + int32_t metadata_length; + int64_t body_length; - RETURN_NOT_OK(WriteRecordBatch(batch.columns(), batch.num_rows(), mmap_.get(), - &body_end_offset, &header_end_offset)); + const int64_t buffer_offset = 0; - std::shared_ptr reader; - RETURN_NOT_OK(RecordBatchReader::Open(mmap_.get(), header_end_offset, &reader)); + RETURN_NOT_OK(WriteRecordBatch(batch.columns(), batch.num_rows(), buffer_offset, + mmap_.get(), &metadata_length, &body_length)); - RETURN_NOT_OK(reader->GetRecordBatch(batch.schema(), batch_result)); - return Status::OK(); + std::shared_ptr metadata; + RETURN_NOT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); + + // The buffer offsets start at 0, so we must construct a + // ReadableFileInterface according to that frame of reference + std::shared_ptr buffer_payload; + RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); + io::BufferReader buffer_reader(buffer_payload); + + return ReadRecordBatch(metadata, batch.schema(), &buffer_reader, batch_result); } protected: @@ -96,11 +103,11 @@ INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRecordBatch, void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; - int64_t mock_header_offset = -1; - int64_t mock_body_offset = -1; + int32_t mock_metadata_length = -1; + int64_t mock_body_length = -1; int64_t size = -1; - ASSERT_OK(WriteRecordBatch(batch->columns(), batch->num_rows(), &mock, - &mock_body_offset, &mock_header_offset)); + ASSERT_OK(WriteRecordBatch(batch->columns(), batch->num_rows(), 0, &mock, + &mock_metadata_length, &mock_body_length)); ASSERT_OK(GetRecordBatchSize(batch.get(), &size)); ASSERT_EQ(mock.GetExtentBytesWritten(), size); } @@ -129,39 +136,36 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { void SetUp() { pool_ = default_memory_pool(); } void TearDown() { io::MemoryMapFixture::TearDown(); } - Status WriteToMmap(int recursion_level, bool override_level, - int64_t* header_out = nullptr, std::shared_ptr* schema_out = nullptr) { + Status WriteToMmap(int recursion_level, bool override_level, int32_t* metadata_length, + int64_t* body_length, std::shared_ptr* schema) { const int batch_length = 5; - TypePtr type = kInt32; + TypePtr type = int32(); ArrayPtr array; const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); for (int i = 0; i < recursion_level; ++i) { - type = std::static_pointer_cast(std::make_shared(type)); + type = list(type); RETURN_NOT_OK( MakeRandomListArray(array, batch_length, include_nulls, pool_, &array)); } - auto f0 = std::make_shared("f0", type); - std::shared_ptr schema(new Schema({f0})); - if (schema_out != nullptr) { *schema_out = schema; } + auto f0 = field("f0", type); + + *schema = std::shared_ptr(new Schema({f0})); + std::vector arrays = {array}; - auto batch = std::make_shared(schema, batch_length, arrays); + auto batch = std::make_shared(*schema, batch_length, arrays); std::string path = "test-write-past-max-recursion"; const int memory_map_size = 1 << 16; io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - int64_t body_offset; - int64_t header_offset; - - int64_t* header_out_param = header_out == nullptr ? &header_offset : header_out; if (override_level) { - return WriteRecordBatch(batch->columns(), batch->num_rows(), mmap_.get(), - &body_offset, header_out_param, recursion_level + 1); + return WriteRecordBatch(batch->columns(), batch->num_rows(), 0, mmap_.get(), + metadata_length, body_length, recursion_level + 1); } else { - return WriteRecordBatch(batch->columns(), batch->num_rows(), mmap_.get(), - &body_offset, header_out_param); + return WriteRecordBatch(batch->columns(), batch->num_rows(), 0, mmap_.get(), + metadata_length, body_length); } } @@ -171,18 +175,29 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { }; TEST_F(RecursionLimits, WriteLimit) { - ASSERT_RAISES(Invalid, WriteToMmap((1 << 8) + 1, false)); + int32_t metadata_length = -1; + int64_t body_length = -1; + std::shared_ptr schema; + ASSERT_RAISES( + Invalid, WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &schema)); } TEST_F(RecursionLimits, ReadLimit) { - int64_t header_offset = -1; + int32_t metadata_length = -1; + int64_t body_length = -1; std::shared_ptr schema; - ASSERT_OK(WriteToMmap(64, true, &header_offset, &schema)); + ASSERT_OK(WriteToMmap(64, true, &metadata_length, &body_length, &schema)); - std::shared_ptr reader; - ASSERT_OK(RecordBatchReader::Open(mmap_.get(), header_offset, &reader)); - std::shared_ptr batch_result; - ASSERT_RAISES(Invalid, reader->GetRecordBatch(schema, &batch_result)); + std::shared_ptr metadata; + ASSERT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); + + std::shared_ptr payload; + ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); + + io::BufferReader reader(payload); + + std::shared_ptr batch; + ASSERT_RAISES(Invalid, ReadRecordBatch(metadata, schema, &reader, &batch)); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index cd424bf385cae..a1feac401f24e 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -68,7 +68,7 @@ class TestFileFormat : public ::testing::TestWithParam { RETURN_NOT_OK(sink_->Tell(&footer_offset)); // Open the file - auto reader = std::make_shared(buffer_->data(), buffer_->size()); + auto reader = std::make_shared(buffer_); RETURN_NOT_OK(FileReader::Open(reader, footer_offset, &file_reader_)); EXPECT_EQ(num_batches, file_reader_->num_record_batches()); diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index a51371c62005b..e5c3a081fca53 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -284,19 +284,23 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { "name": "foo", "type": {"name": "int", "isSigned": true, "bitWidth": 32}, "nullable": true, "children": [], - "typeLayout": [ - {"type": "VALIDITY", "typeBitWidth": 1}, - {"type": "DATA", "typeBitWidth": 32} - ] + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + } }, { "name": "bar", "type": {"name": "floatingpoint", "precision": "DOUBLE"}, "nullable": true, "children": [], - "typeLayout": [ - {"type": "VALIDITY", "typeBitWidth": 1}, - {"type": "DATA", "typeBitWidth": 64} - ] + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 64} + ] + } } ] }, diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index 1dc3969233237..d29583f8488e0 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -43,7 +43,7 @@ static inline void assert_schema_equal(const Schema* lhs, const Schema* rhs) { } } -class TestSchemaMessage : public ::testing::Test { +class TestSchemaMetadata : public ::testing::Test { public: void SetUp() {} @@ -52,11 +52,11 @@ class TestSchemaMessage : public ::testing::Test { ASSERT_OK(WriteSchema(schema, &buffer)); std::shared_ptr message; - ASSERT_OK(Message::Open(buffer, &message)); + ASSERT_OK(Message::Open(buffer, 0, &message)); ASSERT_EQ(Message::SCHEMA, message->type()); - std::shared_ptr schema_msg = message->GetSchema(); + auto schema_msg = std::make_shared(message); ASSERT_EQ(schema->num_fields(), schema_msg->num_fields()); std::shared_ptr schema2; @@ -68,7 +68,7 @@ class TestSchemaMessage : public ::testing::Test { const std::shared_ptr INT32 = std::make_shared(); -TEST_F(TestSchemaMessage, PrimitiveFields) { +TEST_F(TestSchemaMetadata, PrimitiveFields) { auto f0 = std::make_shared("f0", std::make_shared()); auto f1 = std::make_shared("f1", std::make_shared()); auto f2 = std::make_shared("f2", std::make_shared()); @@ -85,7 +85,7 @@ TEST_F(TestSchemaMessage, PrimitiveFields) { CheckRoundtrip(&schema); } -TEST_F(TestSchemaMessage, NestedFields) { +TEST_F(TestSchemaMetadata, NestedFields) { auto type = std::make_shared(std::make_shared()); auto f0 = std::make_shared("f0", type); @@ -111,7 +111,7 @@ class TestFileFooter : public ::testing::Test { std::unique_ptr footer; ASSERT_OK(FileFooter::Open(buffer, &footer)); - ASSERT_EQ(MetadataVersion::V1_SNAPSHOT, footer->version()); + ASSERT_EQ(MetadataVersion::V2, footer->version()); // Check schema std::shared_ptr schema2; diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 5eff8998afbc8..7a313f791e6c8 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -255,19 +255,23 @@ static const char* JSON_EXAMPLE = R"example( "name": "foo", "type": {"name": "int", "isSigned": true, "bitWidth": 32}, "nullable": true, "children": [], - "typeLayout": [ - {"type": "VALIDITY", "typeBitWidth": 1}, - {"type": "DATA", "typeBitWidth": 32} - ] + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + } }, { "name": "bar", "type": {"name": "floatingpoint", "precision": "DOUBLE"}, "nullable": true, "children": [], - "typeLayout": [ - {"type": "VALIDITY", "typeBitWidth": 1}, - {"type": "DATA", "typeBitWidth": 64} - ] + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 64} + ] + } } ] }, @@ -301,10 +305,12 @@ static const char* JSON_EXAMPLE2 = R"example( "name": "foo", "type": {"name": "int", "isSigned": true, "bitWidth": 32}, "nullable": true, "children": [], - "typeLayout": [ - {"type": "VALIDITY", "typeBitWidth": 1}, - {"type": "DATA", "typeBitWidth": 32} - ] + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + } } ] }, diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 31fe35b44cef7..e56bcb32b9488 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -45,8 +45,6 @@ namespace ipc { using RjArray = rj::Value::ConstArray; using RjObject = rj::Value::ConstObject; -enum class BufferType : char { DATA, OFFSET, TYPE, VALIDITY }; - static std::string GetBufferTypeName(BufferType type) { switch (type) { case BufferType::DATA: @@ -93,27 +91,6 @@ static std::string GetTimeUnitName(TimeUnit unit) { return "UNKNOWN"; } -class BufferLayout { - public: - BufferLayout(BufferType type, int bit_width) : type_(type), bit_width_(bit_width) {} - - BufferType type() const { return type_; } - int bit_width() const { return bit_width_; } - - private: - BufferType type_; - int bit_width_; -}; - -static const BufferLayout kValidityBuffer(BufferType::VALIDITY, 1); -static const BufferLayout kOffsetBuffer(BufferType::OFFSET, 32); -static const BufferLayout kTypeBuffer(BufferType::TYPE, 32); -static const BufferLayout kBooleanBuffer(BufferType::DATA, 1); -static const BufferLayout kValues64(BufferType::DATA, 64); -static const BufferLayout kValues32(BufferType::DATA, 32); -static const BufferLayout kValues16(BufferType::DATA, 16); -static const BufferLayout kValues8(BufferType::DATA, 8); - class JsonSchemaWriter : public TypeVisitor { public: explicit JsonSchemaWriter(const Schema& schema, RjWriter* writer) @@ -154,9 +131,9 @@ class JsonSchemaWriter : public TypeVisitor { } template - typename std::enable_if::value || - std::is_base_of::value || - std::is_base_of::value, + typename std::enable_if< + std::is_base_of::value || std::is_base_of::value || + std::is_base_of::value || std::is_base_of::value, void>::type WriteTypeMetadata(const T& type) {} @@ -243,11 +220,10 @@ class JsonSchemaWriter : public TypeVisitor { } template - Status WritePrimitive(const std::string& typeclass, const T& type, - const std::vector& buffer_layout) { + Status WritePrimitive(const std::string& typeclass, const T& type) { WriteName(typeclass, type); SetNoChildren(); - WriteBufferLayout(buffer_layout); + WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } @@ -255,15 +231,17 @@ class JsonSchemaWriter : public TypeVisitor { Status WriteVarBytes(const std::string& typeclass, const T& type) { WriteName(typeclass, type); SetNoChildren(); - WriteBufferLayout({kValidityBuffer, kOffsetBuffer, kValues8}); + WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } - void WriteBufferLayout(const std::vector& buffer_layout) { + void WriteBufferLayout(const std::vector& buffer_layout) { writer_->Key("typeLayout"); + writer_->StartObject(); + writer_->Key("vectors"); writer_->StartArray(); - for (const BufferLayout& buffer : buffer_layout) { + for (const BufferDescr& buffer : buffer_layout) { writer_->StartObject(); writer_->Key("type"); writer_->String(GetBufferTypeName(buffer.type())); @@ -274,6 +252,7 @@ class JsonSchemaWriter : public TypeVisitor { writer_->EndObject(); } writer_->EndArray(); + writer_->EndObject(); } Status WriteChildren(const std::vector>& children) { @@ -286,74 +265,52 @@ class JsonSchemaWriter : public TypeVisitor { return Status::OK(); } - Status Visit(const NullType& type) override { return WritePrimitive("null", type, {}); } + Status Visit(const NullType& type) override { return WritePrimitive("null", type); } - Status Visit(const BooleanType& type) override { - return WritePrimitive("bool", type, {kValidityBuffer, kBooleanBuffer}); - } + Status Visit(const BooleanType& type) override { return WritePrimitive("bool", type); } - Status Visit(const Int8Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues8}); - } + Status Visit(const Int8Type& type) override { return WritePrimitive("int", type); } - Status Visit(const Int16Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues16}); - } + Status Visit(const Int16Type& type) override { return WritePrimitive("int", type); } - Status Visit(const Int32Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues32}); - } + Status Visit(const Int32Type& type) override { return WritePrimitive("int", type); } - Status Visit(const Int64Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues64}); - } + Status Visit(const Int64Type& type) override { return WritePrimitive("int", type); } - Status Visit(const UInt8Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues8}); - } + Status Visit(const UInt8Type& type) override { return WritePrimitive("int", type); } - Status Visit(const UInt16Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues16}); - } + Status Visit(const UInt16Type& type) override { return WritePrimitive("int", type); } - Status Visit(const UInt32Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues32}); - } + Status Visit(const UInt32Type& type) override { return WritePrimitive("int", type); } - Status Visit(const UInt64Type& type) override { - return WritePrimitive("int", type, {kValidityBuffer, kValues64}); - } + Status Visit(const UInt64Type& type) override { return WritePrimitive("int", type); } Status Visit(const HalfFloatType& type) override { - return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues16}); + return WritePrimitive("floatingpoint", type); } Status Visit(const FloatType& type) override { - return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues32}); + return WritePrimitive("floatingpoint", type); } Status Visit(const DoubleType& type) override { - return WritePrimitive("floatingpoint", type, {kValidityBuffer, kValues64}); + return WritePrimitive("floatingpoint", type); } Status Visit(const StringType& type) override { return WriteVarBytes("utf8", type); } Status Visit(const BinaryType& type) override { return WriteVarBytes("binary", type); } - Status Visit(const DateType& type) override { - return WritePrimitive("date", type, {kValidityBuffer, kValues64}); - } + Status Visit(const DateType& type) override { return WritePrimitive("date", type); } - Status Visit(const TimeType& type) override { - return WritePrimitive("time", type, {kValidityBuffer, kValues64}); - } + Status Visit(const TimeType& type) override { return WritePrimitive("time", type); } Status Visit(const TimestampType& type) override { - return WritePrimitive("timestamp", type, {kValidityBuffer, kValues64}); + return WritePrimitive("timestamp", type); } Status Visit(const IntervalType& type) override { - return WritePrimitive("interval", type, {kValidityBuffer, kValues64}); + return WritePrimitive("interval", type); } Status Visit(const DecimalType& type) override { return Status::NotImplemented("NYI"); } @@ -361,26 +318,21 @@ class JsonSchemaWriter : public TypeVisitor { Status Visit(const ListType& type) override { WriteName("list", type); RETURN_NOT_OK(WriteChildren(type.children())); - WriteBufferLayout({kValidityBuffer, kOffsetBuffer}); + WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } Status Visit(const StructType& type) override { WriteName("struct", type); WriteChildren(type.children()); - WriteBufferLayout({kValidityBuffer, kTypeBuffer}); + WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } Status Visit(const UnionType& type) override { WriteName("union", type); WriteChildren(type.children()); - - if (type.mode == UnionMode::SPARSE) { - WriteBufferLayout({kValidityBuffer, kTypeBuffer}); - } else { - WriteBufferLayout({kValidityBuffer, kTypeBuffer, kOffsetBuffer}); - } + WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 7102012c29a84..b99522825d902 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -37,20 +37,6 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { -const std::shared_ptr BOOL = std::make_shared(); -const std::shared_ptr INT8 = std::make_shared(); -const std::shared_ptr INT16 = std::make_shared(); -const std::shared_ptr INT32 = std::make_shared(); -const std::shared_ptr INT64 = std::make_shared(); -const std::shared_ptr UINT8 = std::make_shared(); -const std::shared_ptr UINT16 = std::make_shared(); -const std::shared_ptr UINT32 = std::make_shared(); -const std::shared_ptr UINT64 = std::make_shared(); -const std::shared_ptr FLOAT = std::make_shared(); -const std::shared_ptr DOUBLE = std::make_shared(); -const std::shared_ptr STRING = std::make_shared(); -const std::shared_ptr BINARY = std::make_shared(); - static Status IntFromFlatbuffer( const flatbuf::Int* int_data, std::shared_ptr* out) { if (int_data->bitWidth() > 64) { @@ -62,16 +48,16 @@ static Status IntFromFlatbuffer( switch (int_data->bitWidth()) { case 8: - *out = int_data->is_signed() ? INT8 : UINT8; + *out = int_data->is_signed() ? int8() : uint8(); break; case 16: - *out = int_data->is_signed() ? INT16 : UINT16; + *out = int_data->is_signed() ? int16() : uint16(); break; case 32: - *out = int_data->is_signed() ? INT32 : UINT32; + *out = int_data->is_signed() ? int32() : uint32(); break; case 64: - *out = int_data->is_signed() ? INT64 : UINT64; + *out = int_data->is_signed() ? int64() : uint64(); break; default: return Status::NotImplemented("Integers not in cstdint are not implemented"); @@ -81,10 +67,12 @@ static Status IntFromFlatbuffer( static Status FloatFromFlatuffer( const flatbuf::FloatingPoint* float_data, std::shared_ptr* out) { - if (float_data->precision() == flatbuf::Precision_SINGLE) { - *out = FLOAT; + if (float_data->precision() == flatbuf::Precision_HALF) { + *out = float16(); + } else if (float_data->precision() == flatbuf::Precision_SINGLE) { + *out = float32(); } else { - *out = DOUBLE; + *out = float64(); } return Status::OK(); } @@ -100,13 +88,13 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return FloatFromFlatuffer( static_cast(type_data), out); case flatbuf::Type_Binary: - *out = BINARY; + *out = binary(); return Status::OK(); case flatbuf::Type_Utf8: - *out = STRING; + *out = utf8(); return Status::OK(); case flatbuf::Type_Bool: - *out = BOOL; + *out = boolean(); return Status::OK(); case flatbuf::Type_Decimal: case flatbuf::Type_Timestamp: @@ -164,7 +152,32 @@ static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type break; static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* children, flatbuf::Type* out_type, Offset* offset) { + std::vector* children, std::vector* layout, + flatbuf::Type* out_type, Offset* offset) { + std::vector buffer_layout = type->GetBufferLayout(); + for (const BufferDescr& descr : buffer_layout) { + flatbuf::VectorType vector_type; + switch (descr.type()) { + case BufferType::OFFSET: + vector_type = flatbuf::VectorType_OFFSET; + break; + case BufferType::DATA: + vector_type = flatbuf::VectorType_DATA; + break; + case BufferType::VALIDITY: + vector_type = flatbuf::VectorType_VALIDITY; + break; + case BufferType::TYPE: + vector_type = flatbuf::VectorType_TYPE; + break; + default: + vector_type = flatbuf::VectorType_DATA; + break; + } + auto offset = flatbuf::CreateVectorLayout(fbb, descr.bit_width(), vector_type); + layout->push_back(offset); + } + switch (type->type) { case Type::BOOL: *out_type = flatbuf::Type_Bool; @@ -223,14 +236,18 @@ static Status FieldToFlatbuffer( flatbuf::Type type_enum; Offset type_data; + Offset type_layout; std::vector children; + std::vector layout; - RETURN_NOT_OK(TypeToFlatbuffer(fbb, field->type, &children, &type_enum, &type_data)); + RETURN_NOT_OK( + TypeToFlatbuffer(fbb, field->type, &children, &layout, &type_enum, &type_data)); auto fb_children = fbb.CreateVector(children); + auto fb_layout = fbb.CreateVector(layout); // TODO: produce the list of VectorTypes *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_data, - field->dictionary, fb_children); + field->dictionary, fb_children, fb_layout); return Status::OK(); } @@ -300,13 +317,26 @@ Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, return Status::OK(); } -Status WriteDataHeader(int32_t length, int64_t body_length, +Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { - MessageBuilder message; - RETURN_NOT_OK(message.SetRecordBatch(length, body_length, nodes, buffers)); - RETURN_NOT_OK(message.Finish()); - return message.GetBuffer(out); + flatbuffers::FlatBufferBuilder fbb; + + auto batch = flatbuf::CreateRecordBatch( + fbb, length, fbb.CreateVectorOfStructs(nodes), fbb.CreateVectorOfStructs(buffers)); + + fbb.Finish(batch); + + int32_t size = fbb.GetSize(); + + auto result = std::make_shared(); + RETURN_NOT_OK(result->Resize(size)); + + uint8_t* dst = result->mutable_data(); + memcpy(dst, fbb.GetBufferPointer(), size); + + *out = result; + return Status::OK(); } Status MessageBuilder::Finish() { @@ -317,17 +347,13 @@ Status MessageBuilder::Finish() { } Status MessageBuilder::GetBuffer(std::shared_ptr* out) { - // The message buffer is suffixed by the size of the complete flatbuffer as - // int32_t - // int32_t size = fbb_.GetSize(); auto result = std::make_shared(); - RETURN_NOT_OK(result->Resize(size + sizeof(int32_t))); + RETURN_NOT_OK(result->Resize(size)); uint8_t* dst = result->mutable_data(); memcpy(dst, fbb_.GetBufferPointer(), size); - memcpy(dst + size, reinterpret_cast(&size), sizeof(int32_t)); *out = result; return Status::OK(); diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index c404cfde22ca3..4826ebe22899d 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -41,10 +41,10 @@ namespace ipc { using FBB = flatbuffers::FlatBufferBuilder; using FieldOffset = flatbuffers::Offset; +using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; -static constexpr flatbuf::MetadataVersion kMetadataVersion = - flatbuf::MetadataVersion_V1_SNAPSHOT; +static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); @@ -70,7 +70,7 @@ class MessageBuilder { flatbuffers::FlatBufferBuilder fbb_; }; -Status WriteDataHeader(int32_t length, int64_t body_length, +Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 66df8a6711fa9..44d3939c04f1d 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -50,9 +50,15 @@ Status WriteSchema(const Schema* schema, std::shared_ptr* out) { class Message::MessageImpl { public: - explicit MessageImpl( - const std::shared_ptr& buffer, const flatbuf::Message* message) - : buffer_(buffer), message_(message) {} + explicit MessageImpl(const std::shared_ptr& buffer, int64_t offset) + : buffer_(buffer), offset_(offset), message_(nullptr) {} + + Status Open() { + message_ = flatbuf::GetMessage(buffer_->data() + offset_); + + // TODO(wesm): verify the message + return Status::OK(); + } Message::Type type() const { switch (message_->header_type()) { @@ -72,25 +78,23 @@ class Message::MessageImpl { int64_t body_length() const { return message_->bodyLength(); } private: - // Owns the memory this message accesses + // Retain reference to memory std::shared_ptr buffer_; + int64_t offset_; const flatbuf::Message* message_; }; -Message::Message() {} - -Status Message::Open( - const std::shared_ptr& buffer, std::shared_ptr* out) { - std::shared_ptr result(new Message()); - - const flatbuf::Message* message = flatbuf::GetMessage(buffer->data()); +Message::Message(const std::shared_ptr& buffer, int64_t offset) { + impl_.reset(new MessageImpl(buffer, offset)); +} - // TODO(wesm): verify message - result->impl_.reset(new MessageImpl(buffer, message)); - *out = result; +Status Message::Open(const std::shared_ptr& buffer, int64_t offset, + std::shared_ptr* out) { + // ctor is private - return Status::OK(); + *out = std::shared_ptr(new Message(buffer, offset)); + return (*out)->impl_->Open(); } Message::Type Message::type() const { @@ -101,20 +105,12 @@ int64_t Message::body_length() const { return impl_->body_length(); } -std::shared_ptr Message::get_shared_ptr() { - return this->shared_from_this(); -} - -std::shared_ptr Message::GetSchema() { - return std::make_shared(this->shared_from_this(), impl_->header()); -} - // ---------------------------------------------------------------------- -// SchemaMessage +// SchemaMetadata -class SchemaMessage::SchemaMessageImpl { +class SchemaMetadata::SchemaMetadataImpl { public: - explicit SchemaMessageImpl(const void* schema) + explicit SchemaMetadataImpl(const void* schema) : schema_(static_cast(schema)) {} const flatbuf::Field* field(int i) const { return schema_->fields()->Get(i); } @@ -125,22 +121,29 @@ class SchemaMessage::SchemaMessageImpl { const flatbuf::Schema* schema_; }; -SchemaMessage::SchemaMessage( - const std::shared_ptr& message, const void* schema) { +SchemaMetadata::SchemaMetadata( + const std::shared_ptr& message, const void* flatbuf) { + message_ = message; + impl_.reset(new SchemaMetadataImpl(flatbuf)); +} + +SchemaMetadata::SchemaMetadata(const std::shared_ptr& message) { message_ = message; - impl_.reset(new SchemaMessageImpl(schema)); + impl_.reset(new SchemaMetadataImpl(message->impl_->header())); } -int SchemaMessage::num_fields() const { +SchemaMetadata::~SchemaMetadata() {} + +int SchemaMetadata::num_fields() const { return impl_->num_fields(); } -Status SchemaMessage::GetField(int i, std::shared_ptr* out) const { +Status SchemaMetadata::GetField(int i, std::shared_ptr* out) const { const flatbuf::Field* field = impl_->field(i); return FieldFromFlatbuffer(field, out); } -Status SchemaMessage::GetSchema(std::shared_ptr* out) const { +Status SchemaMetadata::GetSchema(std::shared_ptr* out) const { std::vector> fields(num_fields()); for (int i = 0; i < this->num_fields(); ++i) { RETURN_NOT_OK(GetField(i, &fields[i])); @@ -150,11 +153,11 @@ Status SchemaMessage::GetSchema(std::shared_ptr* out) const { } // ---------------------------------------------------------------------- -// RecordBatchMessage +// RecordBatchMetadata -class RecordBatchMessage::RecordBatchMessageImpl { +class RecordBatchMetadata::RecordBatchMetadataImpl { public: - explicit RecordBatchMessageImpl(const void* batch) + explicit RecordBatchMetadataImpl(const void* batch) : batch_(static_cast(batch)) { nodes_ = batch_->nodes(); buffers_ = batch_->buffers(); @@ -176,19 +179,29 @@ class RecordBatchMessage::RecordBatchMessageImpl { const flatbuffers::Vector* buffers_; }; -std::shared_ptr Message::GetRecordBatch() { - return std::make_shared(this->shared_from_this(), impl_->header()); +RecordBatchMetadata::RecordBatchMetadata(const std::shared_ptr& message) { + message_ = message; + impl_.reset(new RecordBatchMetadataImpl(message->impl_->header())); } -RecordBatchMessage::RecordBatchMessage( - const std::shared_ptr& message, const void* batch) { - message_ = message; - impl_.reset(new RecordBatchMessageImpl(batch)); +RecordBatchMetadata::RecordBatchMetadata( + const std::shared_ptr& buffer, int64_t offset) { + message_ = nullptr; + buffer_ = buffer; + + const flatbuf::RecordBatch* metadata = + flatbuffers::GetRoot(buffer->data() + offset); + + // TODO(wesm): validate table + + impl_.reset(new RecordBatchMetadataImpl(metadata)); } +RecordBatchMetadata::~RecordBatchMetadata() {} + // TODO(wesm): Copying the flatbuffer data isn't great, but this will do for // now -FieldMetadata RecordBatchMessage::field(int i) const { +FieldMetadata RecordBatchMetadata::field(int i) const { const flatbuf::FieldNode* node = impl_->field(i); FieldMetadata result; @@ -197,7 +210,7 @@ FieldMetadata RecordBatchMessage::field(int i) const { return result; } -BufferMetadata RecordBatchMessage::buffer(int i) const { +BufferMetadata RecordBatchMetadata::buffer(int i) const { const flatbuf::Buffer* buffer = impl_->buffer(i); BufferMetadata result; @@ -207,15 +220,15 @@ BufferMetadata RecordBatchMessage::buffer(int i) const { return result; } -int32_t RecordBatchMessage::length() const { +int32_t RecordBatchMetadata::length() const { return impl_->length(); } -int RecordBatchMessage::num_buffers() const { +int RecordBatchMetadata::num_buffers() const { return impl_->num_buffers(); } -int RecordBatchMessage::num_fields() const { +int RecordBatchMetadata::num_fields() const { return impl_->num_fields(); } @@ -268,11 +281,13 @@ class FileFooter::FileFooterImpl { MetadataVersion::type version() const { switch (footer_->version()) { - case flatbuf::MetadataVersion_V1_SNAPSHOT: - return MetadataVersion::V1_SNAPSHOT; + case flatbuf::MetadataVersion_V1: + return MetadataVersion::V1; + case flatbuf::MetadataVersion_V2: + return MetadataVersion::V2; // Add cases as other versions become available default: - return MetadataVersion::V1_SNAPSHOT; + return MetadataVersion::V2; } } @@ -285,7 +300,7 @@ class FileFooter::FileFooterImpl { } Status GetSchema(std::shared_ptr* out) const { - auto schema_msg = std::make_shared(nullptr, footer_->schema()); + auto schema_msg = std::make_shared(nullptr, footer_->schema()); return schema_msg->GetSchema(out); } diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 2f0e853bf97f0..1c4ef64d62fad 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -42,7 +42,7 @@ class OutputStream; namespace ipc { struct MetadataVersion { - enum type { V1_SNAPSHOT }; + enum type { V1, V2 }; }; //---------------------------------------------------------------------- @@ -58,10 +58,14 @@ Status WriteSchema(const Schema* schema, std::shared_ptr* out); class Message; // Container for serialized Schema metadata contained in an IPC message -class ARROW_EXPORT SchemaMessage { +class ARROW_EXPORT SchemaMetadata { public: + explicit SchemaMetadata(const std::shared_ptr& message); + // Accepts an opaque flatbuffer pointer - SchemaMessage(const std::shared_ptr& message, const void* schema); + SchemaMetadata(const std::shared_ptr& message, const void* schema); + + ~SchemaMetadata(); int num_fields() const; @@ -76,8 +80,8 @@ class ARROW_EXPORT SchemaMessage { // Parent, owns the flatbuffer data std::shared_ptr message_; - class SchemaMessageImpl; - std::unique_ptr impl_; + class SchemaMetadataImpl; + std::unique_ptr impl_; }; // Field metadata @@ -93,10 +97,13 @@ struct BufferMetadata { }; // Container for serialized record batch metadata contained in an IPC message -class ARROW_EXPORT RecordBatchMessage { +class ARROW_EXPORT RecordBatchMetadata { public: - // Accepts an opaque flatbuffer pointer - RecordBatchMessage(const std::shared_ptr& message, const void* batch_meta); + explicit RecordBatchMetadata(const std::shared_ptr& message); + + RecordBatchMetadata(const std::shared_ptr& message, int64_t offset); + + ~RecordBatchMetadata(); FieldMetadata field(int i) const; BufferMetadata buffer(int i) const; @@ -108,37 +115,34 @@ class ARROW_EXPORT RecordBatchMessage { private: // Parent, owns the flatbuffer data std::shared_ptr message_; + std::shared_ptr buffer_; - class RecordBatchMessageImpl; - std::unique_ptr impl_; + class RecordBatchMetadataImpl; + std::unique_ptr impl_; }; -class ARROW_EXPORT DictionaryBatchMessage { +class ARROW_EXPORT DictionaryBatchMetadata { public: int64_t id() const; - std::unique_ptr data() const; + std::unique_ptr data() const; }; -class ARROW_EXPORT Message : public std::enable_shared_from_this { +class ARROW_EXPORT Message { public: enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; - static Status Open( - const std::shared_ptr& buffer, std::shared_ptr* out); - - std::shared_ptr get_shared_ptr(); + static Status Open(const std::shared_ptr& buffer, int64_t offset, + std::shared_ptr* out); int64_t body_length() const; Type type() const; - // These methods only to be invoked if you have checked the message type - std::shared_ptr GetSchema(); - std::shared_ptr GetRecordBatch(); - std::shared_ptr GetDictionaryBatch(); - private: - Message(); + Message(const std::shared_ptr& buffer, int64_t offset); + + friend class RecordBatchMetadata; + friend class SchemaMetadata; // Hide serialization details from user API class MessageImpl; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 9abc20d876de4..65b378215222d 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -39,8 +39,7 @@ namespace arrow { namespace ipc { -const auto kInt32 = std::make_shared(); -const auto kListInt32 = list(kInt32); +const auto kListInt32 = list(int32()); const auto kListListInt32 = list(kListInt32); Status MakeRandomInt32Array( @@ -99,8 +98,8 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { const int length = 1000; // Make the schema - auto f0 = std::make_shared("f0", kInt32); - auto f1 = std::make_shared("f1", kInt32); + auto f0 = std::make_shared("f0", int32()); + auto f1 = std::make_shared("f1", int32()); std::shared_ptr schema(new Schema({f0, f1})); // Example data @@ -161,7 +160,7 @@ Status MakeListRecordBatch(std::shared_ptr* out) { // Make the schema auto f0 = std::make_shared("f0", kListInt32); auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", kInt32); + auto f2 = std::make_shared("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data @@ -184,7 +183,7 @@ Status MakeZeroLengthRecordBatch(std::shared_ptr* out) { // Make the schema auto f0 = std::make_shared("f0", kListInt32); auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", kInt32); + auto f2 = std::make_shared("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data @@ -205,7 +204,7 @@ Status MakeNonNullRecordBatch(std::shared_ptr* out) { // Make the schema auto f0 = std::make_shared("f0", kListInt32); auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", kInt32); + auto f2 = std::make_shared("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data @@ -226,7 +225,7 @@ Status MakeNonNullRecordBatch(std::shared_ptr* out) { Status MakeDeeplyNestedList(std::shared_ptr* out) { const int batch_length = 5; - TypePtr type = kInt32; + TypePtr type = int32(); MemoryPool* pool = default_memory_pool(); ArrayPtr array; diff --git a/cpp/src/arrow/ipc/util.h b/cpp/src/arrow/ipc/util.h index 9000d1bb0c6c3..242d6624f1e7f 100644 --- a/cpp/src/arrow/ipc/util.h +++ b/cpp/src/arrow/ipc/util.h @@ -28,12 +28,10 @@ namespace arrow { namespace ipc { // Align on 8-byte boundaries -static constexpr int kArrowAlignment = 8; - // Buffers are padded to 64-byte boundaries (for SIMD) -static constexpr int kArrowBufferAlignment = 64; +static constexpr int kArrowAlignment = 64; -static constexpr uint8_t kPaddingBytes[kArrowBufferAlignment] = {0}; +static constexpr uint8_t kPaddingBytes[kArrowAlignment] = {0}; static inline int64_t PaddedLength(int64_t nbytes, int64_t alignment = kArrowAlignment) { return ((nbytes + alignment - 1) / alignment) * alignment; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 93dd5b69b1bb7..63c2166a5736b 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -61,10 +61,10 @@ // Alias MSVC popcount to GCC name #ifdef _MSC_VER -# include -# define __builtin_popcount __popcnt -# include -# define __builtin_popcountll _mm_popcnt_u64 +#include +#define __builtin_popcount __popcnt +#include +#define __builtin_popcountll _mm_popcnt_u64 #endif namespace arrow { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 589bdadb77c64..80f295c487f13 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -105,10 +105,6 @@ std::string UnionType::ToString() const { return s.str(); } -int NullType::bit_width() const { - return 0; -} - std::string NullType::ToString() const { return name(); } @@ -187,4 +183,46 @@ std::shared_ptr field( return std::make_shared(name, type, nullable, dictionary); } +static const BufferDescr kValidityBuffer(BufferType::VALIDITY, 1); +static const BufferDescr kOffsetBuffer(BufferType::OFFSET, 32); +static const BufferDescr kTypeBuffer(BufferType::TYPE, 32); +static const BufferDescr kBooleanBuffer(BufferType::DATA, 1); +static const BufferDescr kValues64(BufferType::DATA, 64); +static const BufferDescr kValues32(BufferType::DATA, 32); +static const BufferDescr kValues16(BufferType::DATA, 16); +static const BufferDescr kValues8(BufferType::DATA, 8); + +std::vector FixedWidthType::GetBufferLayout() const { + return {kValidityBuffer, BufferDescr(BufferType::DATA, bit_width())}; +} + +std::vector NullType::GetBufferLayout() const { + return {}; +} + +std::vector BinaryType::GetBufferLayout() const { + return {kValidityBuffer, kOffsetBuffer, kValues8}; +} + +std::vector ListType::GetBufferLayout() const { + return {kValidityBuffer, kOffsetBuffer}; +} + +std::vector StructType::GetBufferLayout() const { + return {kValidityBuffer, kTypeBuffer}; +} + +std::vector UnionType::GetBufferLayout() const { + if (mode == UnionMode::SPARSE) { + return {kValidityBuffer, kTypeBuffer}; + } else { + return {kValidityBuffer, kTypeBuffer, kOffsetBuffer}; + } +} + +std::vector DecimalType::GetBufferLayout() const { + // TODO(wesm) + return {}; +} + } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 876d7ea464b1c..30777384dfb9f 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -101,6 +101,20 @@ struct Type { }; }; +enum class BufferType : char { DATA, OFFSET, TYPE, VALIDITY }; + +class BufferDescr { + public: + BufferDescr(BufferType type, int bit_width) : type_(type), bit_width_(bit_width) {} + + BufferType type() const { return type_; } + int bit_width() const { return bit_width_; } + + private: + BufferType type_; + int bit_width_; +}; + struct ARROW_EXPORT DataType { Type::type type; @@ -129,12 +143,18 @@ struct ARROW_EXPORT DataType { virtual Status Accept(TypeVisitor* visitor) const = 0; virtual std::string ToString() const = 0; + + virtual std::vector GetBufferLayout() const = 0; }; typedef std::shared_ptr TypePtr; -struct ARROW_EXPORT FixedWidthMeta { +struct ARROW_EXPORT FixedWidthType : public DataType { + using DataType::DataType; + virtual int bit_width() const = 0; + + std::vector GetBufferLayout() const override; }; struct ARROW_EXPORT IntegerMeta { @@ -184,12 +204,12 @@ struct ARROW_EXPORT Field { }; typedef std::shared_ptr FieldPtr; -struct ARROW_EXPORT PrimitiveCType : public DataType { - using DataType::DataType; +struct ARROW_EXPORT PrimitiveCType : public FixedWidthType { + using FixedWidthType::FixedWidthType; }; template -struct ARROW_EXPORT CTypeImpl : public PrimitiveCType, public FixedWidthMeta { +struct ARROW_EXPORT CTypeImpl : public PrimitiveCType { using c_type = C_TYPE; static constexpr Type::type type_id = TYPE_ID; @@ -204,16 +224,17 @@ struct ARROW_EXPORT CTypeImpl : public PrimitiveCType, public FixedWidthMeta { std::string ToString() const override { return std::string(DERIVED::name()); } }; -struct ARROW_EXPORT NullType : public DataType, public FixedWidthMeta { +struct ARROW_EXPORT NullType : public DataType { static constexpr Type::type type_id = Type::NA; NullType() : DataType(Type::NA) {} - int bit_width() const override; Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "null"; } + + std::vector GetBufferLayout() const override; }; template @@ -221,10 +242,10 @@ struct IntegerTypeImpl : public CTypeImpl, public Inte bool is_signed() const override { return std::is_signed::value; } }; -struct ARROW_EXPORT BooleanType : public DataType, FixedWidthMeta { +struct ARROW_EXPORT BooleanType : public FixedWidthType { static constexpr Type::type type_id = Type::BOOL; - BooleanType() : DataType(Type::BOOL) {} + BooleanType() : FixedWidthType(Type::BOOL) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -306,6 +327,8 @@ struct ARROW_EXPORT ListType : public DataType, public NoExtraMeta { std::string ToString() const override; static std::string name() { return "list"; } + + std::vector GetBufferLayout() const override; }; // BinaryType type is reprsents lists of 1-byte values. @@ -318,6 +341,8 @@ struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { std::string ToString() const override; static std::string name() { return "binary"; } + std::vector GetBufferLayout() const override; + protected: // Allow subclasses to change the logical type. explicit BinaryType(Type::type logical_type) : DataType(logical_type) {} @@ -345,6 +370,8 @@ struct ARROW_EXPORT StructType : public DataType, public NoExtraMeta { Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "struct"; } + + std::vector GetBufferLayout() const override; }; struct ARROW_EXPORT DecimalType : public DataType { @@ -358,6 +385,8 @@ struct ARROW_EXPORT DecimalType : public DataType { Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "decimal"; } + + std::vector GetBufferLayout() const override; }; enum class UnionMode : char { SPARSE, DENSE }; @@ -375,14 +404,20 @@ struct ARROW_EXPORT UnionType : public DataType { static std::string name() { return "union"; } Status Accept(TypeVisitor* visitor) const override; + std::vector GetBufferLayout() const override; + UnionMode mode; std::vector type_ids; }; -struct ARROW_EXPORT DateType : public DataType, public NoExtraMeta { +struct ARROW_EXPORT DateType : public FixedWidthType { static constexpr Type::type type_id = Type::DATE; - DateType() : DataType(Type::DATE) {} + using c_type = int32_t; + + DateType() : FixedWidthType(Type::DATE) {} + + int bit_width() const override { return sizeof(c_type) * 8; } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override { return name(); } @@ -391,13 +426,17 @@ struct ARROW_EXPORT DateType : public DataType, public NoExtraMeta { enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; -struct ARROW_EXPORT TimeType : public DataType { +struct ARROW_EXPORT TimeType : public FixedWidthType { static constexpr Type::type type_id = Type::TIME; using Unit = TimeUnit; + using c_type = int64_t; TimeUnit unit; - explicit TimeType(TimeUnit unit = TimeUnit::MILLI) : DataType(Type::TIME), unit(unit) {} + int bit_width() const override { return sizeof(c_type) * 8; } + + explicit TimeType(TimeUnit unit = TimeUnit::MILLI) + : FixedWidthType(Type::TIME), unit(unit) {} TimeType(const TimeType& other) : TimeType(other.unit) {} Status Accept(TypeVisitor* visitor) const override; @@ -405,7 +444,7 @@ struct ARROW_EXPORT TimeType : public DataType { static std::string name() { return "time"; } }; -struct ARROW_EXPORT TimestampType : public DataType, public FixedWidthMeta { +struct ARROW_EXPORT TimestampType : public FixedWidthType { using Unit = TimeUnit; typedef int64_t c_type; @@ -416,7 +455,7 @@ struct ARROW_EXPORT TimestampType : public DataType, public FixedWidthMeta { TimeUnit unit; explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) - : DataType(Type::TIMESTAMP), unit(unit) {} + : FixedWidthType(Type::TIMESTAMP), unit(unit) {} TimestampType(const TimestampType& other) : TimestampType(other.unit) {} @@ -425,10 +464,10 @@ struct ARROW_EXPORT TimestampType : public DataType, public FixedWidthMeta { static std::string name() { return "timestamp"; } }; -struct ARROW_EXPORT IntervalType : public DataType, public FixedWidthMeta { +struct ARROW_EXPORT IntervalType : public FixedWidthType { enum class Unit : char { YEAR_MONTH = 0, DAY_TIME = 1 }; - typedef int64_t c_type; + using c_type = int64_t; static constexpr Type::type type_id = Type::INTERVAL; int bit_width() const override { return sizeof(int64_t) * 8; } @@ -436,7 +475,7 @@ struct ARROW_EXPORT IntervalType : public DataType, public FixedWidthMeta { Unit unit; explicit IntervalType(Unit unit = Unit::YEAR_MONTH) - : DataType(Type::INTERVAL), unit(unit) {} + : FixedWidthType(Type::INTERVAL), unit(unit) {} IntervalType(const IntervalType& other) : IntervalType(other.unit) {} diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index 14667ee5b6eac..f42a3cac021cd 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -49,7 +49,7 @@ bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; - auto size_meta = dynamic_cast(type_.get()); + auto size_meta = dynamic_cast(type_.get()); int value_byte_size = size_meta->bit_width() / 8; DCHECK_GT(value_byte_size, 0); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 13b7e19593d93..5c8055f9c6171 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -78,6 +78,10 @@ static inline bool IsMultipleOf64(int64_t n) { return (n & 63) == 0; } +static inline bool IsMultipleOf8(int64_t n) { + return (n & 7) == 0; +} + inline int64_t RoundUpToMultipleOf64(int64_t num) { // TODO(wesm): is this definitely needed? // DCHECK_GE(num, 0); diff --git a/dev/release/run-rat.sh b/dev/release/run-rat.sh index d8ec6507fc4e5..e26dd589695b1 100755 --- a/dev/release/run-rat.sh +++ b/dev/release/run-rat.sh @@ -28,6 +28,7 @@ $RAT $1 \ -e ".*" \ -e mman.h \ -e "*_generated.h" \ + -e "*.json" \ -e random.h \ -e status.cc \ -e status.h \ @@ -49,5 +50,3 @@ else echo "${UNAPPROVED} unapproved licences. Check rat report: rat.txt" exit 1 fi - - diff --git a/format/IPC.md b/format/IPC.md index 3f78126ef55d3..a55dcdff48117 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -15,3 +15,109 @@ # Interprocess messaging / communication (IPC) ## File format + +We define a self-contained "file format" containing an Arrow schema along with +one or more record batches defining a dataset. See [format/File.fbs][1] for the +precise details of the file metadata. + +In general, the file looks like: + +``` + + + +... + + +... + + + + +``` + +See the File.fbs document for details about the Flatbuffers metadata. The +record batches have a particular structure, defined next. + +### Record batches + +The record batch metadata is written as a flatbuffer (see +[format/Message.fbs][2] -- the RecordBatch message type) prefixed by its size, +followed by each of the memory buffers in the batch written end to end (with +appropriate alignment and padding): + +``` + + + + +``` + +The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of +field metadata and physical memory buffers (some comments from [Message.fbs][2] +have been shortened / removed): + +``` +table RecordBatch { + length: int; + nodes: [FieldNode]; + buffers: [Buffer]; +} + +struct FieldNode { + /// The number of value slots in the Arrow array at this level of a nested + /// tree + length: int; + + /// The number of observed nulls. Fields with null_count == 0 may choose not + /// to write their physical validity bitmap out as a materialized buffer, + /// instead setting the length of the bitmap buffer to 0. + null_count: int; +} + +struct Buffer { + /// The shared memory page id where this buffer is located. Currently this is + /// not used + page: int; + + /// The relative offset into the shared memory page where the bytes for this + /// buffer starts + offset: long; + + /// The absolute length (in bytes) of the memory buffer. The memory is found + /// from offset (inclusive) to offset + length (non-inclusive). + length: long; +} +``` + +In the context of a file, the `page` is not used, and the `Buffer` offsets use +as a frame of reference the start of the segment where they are written in the +file. So, while in a general IPC setting these offsets may be anyplace in one +or more shared memory regions, in the file format the offsets start from 0. + +The location of a record batch and the size of the metadata block as well as +the body of buffers is stored in the file footer: + +``` +struct Block { + offset: long; + metaDataLength: int; + bodyLength: long; +} +``` + +Some notes about this + +* The `Block` offset indicates the starting byte of the record batch. +* The metadata length includes the flatbuffer size, the record batch metadata + flatbuffer, and any padding bytes + + +### Dictionary batches + +Dictionary batches have not yet been implemented, while they are provided for +in the metadata. For the time being, the `DICTIONARY` segments shown above in +the file do not appear in any of the file implementations. + +[1]: https://github.com/apache/arrow/blob/master/format/File.fbs +[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs \ No newline at end of file diff --git a/format/Message.fbs b/format/Message.fbs index 2ec9fd1817bd5..d07d0666dce87 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -18,7 +18,8 @@ namespace org.apache.arrow.flatbuf; enum MetadataVersion:short { - V1_SNAPSHOT + V1, + V2 } /// ---------------------------------------------------------------------- diff --git a/integration/data/simple.json b/integration/data/simple.json new file mode 100644 index 0000000000000..a91b405d4f0f0 --- /dev/null +++ b/integration/data/simple.json @@ -0,0 +1,66 @@ +{ + "schema": { + "fields": [ + { + "name": "foo", + "type": {"name": "int", "isSigned": true, "bitWidth": 32}, + "nullable": true, "children": [], + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 32} + ] + } + }, + { + "name": "bar", + "type": {"name": "floatingpoint", "precision": "DOUBLE"}, + "nullable": true, "children": [], + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "DATA", "typeBitWidth": 64} + ] + } + }, + { + "name": "baz", + "type": {"name": "utf8"}, + "nullable": true, "children": [], + "typeLayout": { + "vectors": [ + {"type": "VALIDITY", "typeBitWidth": 1}, + {"type": "OFFSET", "typeBitWidth": 32}, + {"type": "DATA", "typeBitWidth": 64} + ] + } + } + ] + }, + "batches": [ + { + "count": 5, + "columns": [ + { + "name": "foo", + "count": 5, + "VALIDITY": [1, 0, 1, 1, 1], + "DATA": [1, 2, 3, 4, 5] + }, + { + "name": "bar", + "count": 5, + "VALIDITY": [1, 0, 0, 1, 1], + "DATA": [1.0, 2.0, 3.0, 4.0, 5.0] + }, + { + "name": "baz", + "count": 5, + "VALIDITY": [1, 0, 0, 1, 1], + "OFFSET": [0, 2, 2, 2, 5, 9], + "DATA": ["aa", "", "", "bbb", "cccc"] + } + ] + } + ] +} diff --git a/integration/integration_test.py b/integration/integration_test.py new file mode 100644 index 0000000000000..6ea634d779566 --- /dev/null +++ b/integration/integration_test.py @@ -0,0 +1,177 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import argparse +import glob +import itertools +import os +import six +import subprocess +import tempfile +import uuid + + +ARROW_HOME = os.path.abspath(__file__).rsplit("/", 2)[0] + + +def guid(): + return uuid.uuid4().hex + + +def run_cmd(cmd): + if isinstance(cmd, six.string_types): + cmd = cmd.split(' ') + + try: + output = subprocess.check_output(cmd, stderr=subprocess.STDOUT) + except subprocess.CalledProcessError as e: + # this avoids hiding the stdout / stderr of failed processes + print('Command failed: %s' % ' '.join(cmd)) + print('With output:') + print('--------------') + print(e.output) + print('--------------') + raise e + + if isinstance(output, six.binary_type): + output = output.decode('utf-8') + return output + + +class IntegrationRunner(object): + + def __init__(self, json_files, testers, debug=False): + self.json_files = json_files + self.testers = testers + self.temp_dir = tempfile.mkdtemp() + self.debug = debug + + def run(self): + for producer, consumer in itertools.product(self.testers, + self.testers): + if producer is consumer: + continue + + print('-- {0} producing, {1} consuming'.format(producer.name, + consumer.name)) + + for json_path in self.json_files: + print('Testing with {0}'.format(json_path)) + + arrow_path = os.path.join(self.temp_dir, guid()) + + producer.json_to_arrow(json_path, arrow_path) + consumer.validate(json_path, arrow_path) + + +class Tester(object): + + def __init__(self, debug=False): + self.debug = debug + + def json_to_arrow(self, json_path, arrow_path): + raise NotImplementedError + + def validate(self, json_path, arrow_path): + raise NotImplementedError + + +class JavaTester(Tester): + + ARROW_TOOLS_JAR = os.path.join(ARROW_HOME, + 'java/tools/target/arrow-tools-0.1.1-' + 'SNAPSHOT-jar-with-dependencies.jar') + + name = 'Java' + + def _run(self, arrow_path=None, json_path=None, command='VALIDATE'): + cmd = ['java', '-cp', self.ARROW_TOOLS_JAR, + 'org.apache.arrow.tools.Integration'] + + if arrow_path is not None: + cmd.extend(['-a', arrow_path]) + + if json_path is not None: + cmd.extend(['-j', json_path]) + + cmd.extend(['-c', command]) + + if self.debug: + print(' '.join(cmd)) + + return run_cmd(cmd) + + def validate(self, json_path, arrow_path): + return self._run(arrow_path, json_path, 'VALIDATE') + + def json_to_arrow(self, json_path, arrow_path): + return self._run(arrow_path, json_path, 'JSON_TO_ARROW') + + +class CPPTester(Tester): + + CPP_INTEGRATION_EXE = os.environ.get( + 'ARROW_CPP_TESTER', + os.path.join(ARROW_HOME, + 'cpp/test-build/debug/json-integration-test')) + + name = 'C++' + + def _run(self, arrow_path=None, json_path=None, command='VALIDATE'): + cmd = [self.CPP_INTEGRATION_EXE, '--integration'] + + if arrow_path is not None: + cmd.append('--arrow=' + arrow_path) + + if json_path is not None: + cmd.append('--json=' + json_path) + + cmd.append('--mode=' + command) + + if self.debug: + print(' '.join(cmd)) + + return run_cmd(cmd) + + def validate(self, json_path, arrow_path): + return self._run(arrow_path, json_path, 'VALIDATE') + + def json_to_arrow(self, json_path, arrow_path): + return self._run(arrow_path, json_path, 'JSON_TO_ARROW') + + +def get_json_files(): + glob_pattern = os.path.join(ARROW_HOME, 'integration', 'data', '*.json') + return glob.glob(glob_pattern) + + +def run_all_tests(debug=False): + testers = [JavaTester(debug=debug), CPPTester(debug=debug)] + json_files = get_json_files() + + runner = IntegrationRunner(json_files, testers, debug=debug) + runner.run() + + +if __name__ == '__main__': + parser = argparse.ArgumentParser(description='Arrow integration test CLI') + parser.add_argument('--debug', dest='debug', action='store_true', + default=False, + help='Run executables in debug mode as relevant') + + args = parser.parse_args() + run_all_tests(debug=args.debug) diff --git a/java/pom.xml b/java/pom.xml index 7221a140d96ec..a147d66c98318 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -24,7 +24,7 @@ pom Apache Arrow Java Root POM - Apache arrow is an open source, low latency SQL query engine for Hadoop and NoSQL. + Apache Arrow is open source, in-memory columnar data structures and low-overhead messaging http://arrow.apache.org/ @@ -442,8 +442,8 @@ test - + org.mockito mockito-core 1.9.5 diff --git a/java/tools/pom.xml b/java/tools/pom.xml index 84b0b5eb4253c..ef96328f7668a 100644 --- a/java/tools/pom.xml +++ b/java/tools/pom.xml @@ -45,6 +45,12 @@ commons-cli 1.2 + + ch.qos.logback + logback-classic + 1.0.13 + run + diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index 29f0ee29e3ca8..fa4bedca7a9bd 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -220,6 +220,7 @@ private Command toCommand(String commandName) { private static void fatalError(String message, Throwable e) { System.err.println(message); + System.err.println(e.getMessage()); LOGGER.error(message, e); System.exit(1); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index 4afd82315d9c3..c5d642ee0cc72 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -81,7 +81,9 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf try { vector.loadFieldBuffers(fieldNode, ownBuffers); } catch (RuntimeException e) { - throw new IllegalArgumentException("Could not load buffers for field " + field, e); + e.printStackTrace(); + throw new IllegalArgumentException("Could not load buffers for field " + + field + " error message" + e.getMessage(), e); } List children = field.getChildren(); if (children.size() > 0) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java index bbcd3e9f470e3..cd520da54f2f5 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -123,7 +123,11 @@ public ArrowRecordBatch readRecordBatch(ArrowBlock recordBatchBlock) throws IOEx if (n != l) { throw new IllegalStateException(n + " != " + l); } - RecordBatch recordBatchFB = RecordBatch.getRootAsRecordBatch(buffer.nioBuffer().asReadOnlyBuffer()); + + // Record batch flatbuffer is prefixed by its size as int32le + final ArrowBuf metadata = buffer.slice(4, recordBatchBlock.getMetadataLength() - 4); + RecordBatch recordBatchFB = RecordBatch.getRootAsRecordBatch(metadata.nioBuffer().asReadOnlyBuffer()); + int nodesLength = recordBatchFB.nodesLength(); final ArrowBuf body = buffer.slice(recordBatchBlock.getMetadataLength(), (int)recordBatchBlock.getBodyLength()); List nodes = new ArrayList<>(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java index 9881a229c23ea..1cd87ebc33594 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -99,9 +99,10 @@ private long writeIntLittleEndian(int v) throws IOException { public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { checkStarted(); align(); - // write metadata header + + // write metadata header with int32 size prefix long offset = currentPosition; - write(recordBatch); + write(recordBatch, true); align(); // write body long bodyOffset = currentPosition; @@ -117,6 +118,7 @@ public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { if (startPosition != currentPosition) { writeZeros((int)(startPosition - currentPosition)); } + write(buffer); if (currentPosition != startPosition + layout.getSize()) { throw new IllegalStateException("wrong buffer size: " + currentPosition + " != " + startPosition + layout.getSize()); @@ -133,7 +135,9 @@ public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { } private void write(ArrowBuf buffer) throws IOException { - write(buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes())); + ByteBuffer nioBuffer = buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes()); + LOGGER.debug("Writing buffer with size: " + nioBuffer.remaining()); + write(nioBuffer); } private void checkStarted() throws IOException { @@ -166,14 +170,21 @@ private void writeMagic() throws IOException { private void writeFooter() throws IOException { // TODO: dictionaries - write(new ArrowFooter(schema, Collections.emptyList(), recordBatches)); + write(new ArrowFooter(schema, Collections.emptyList(), recordBatches), false); } - private long write(FBSerializable writer) throws IOException { + private long write(FBSerializable writer, boolean withSizePrefix) throws IOException { FlatBufferBuilder builder = new FlatBufferBuilder(); int root = writer.writeTo(builder); builder.finish(root); - return write(builder.dataBuffer()); + + ByteBuffer buffer = builder.dataBuffer(); + + if (withSizePrefix) { + writeIntLittleEndian(buffer.remaining()); + } + + return write(buffer); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index f07b517250732..f2059820d23d6 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -127,8 +127,13 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti ValueVector valueVector = (ValueVector)innerVector; valueVector.allocateNew(); Mutator mutator = valueVector.getMutator(); - mutator.setValueCount(count); - for (int i = 0; i < count; i++) { + + int innerVectorCount = count; + if (vectorType.getName() == "OFFSET") { + innerVectorCount++; + } + mutator.setValueCount(innerVectorCount); + for (int i = 0; i < innerVectorCount; i++) { parser.nextToken(); setValueFromParser(valueVector, i); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java index 812b3da32f83c..6ff357774486d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -114,7 +114,7 @@ private void writeVector(Field field, FieldVector vector) throws IOException { BufferBacked innerVector = fieldInnerVectors.get(v); generator.writeArrayFieldStart(vectorType.getName()); ValueVector valueVector = (ValueVector)innerVector; - for (int i = 0; i < valueCount; i++) { + for (int i = 0; i < valueVector.getAccessor().getValueCount(); i++) { writeValueToGenerator(valueVector, i); } generator.writeEndArray(); diff --git a/python/.gitignore b/python/.gitignore index 07f28355a252f..c37efc4b56650 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -12,16 +12,6 @@ Testing/ # Editor temporary/working/backup files *flymake* -# Compiled source -*.a -*.dll -*.o -*.py[ocd] -*.so -*.dylib -.build_cache_dir -MANIFEST - # Generated sources *.c *.cpp From 65b74b350209ee3f930a00a0270e1d7c3d485c93 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 29 Nov 2016 22:23:19 -0500 Subject: [PATCH 0207/1644] ARROW-393: [JAVA] JSON file reader fails to set the buffer size on String data vector Fixed by calling setValueCount after setting the values instead of before. Since we set the inner vectors of NullableVarCharVector directly we don't have to worry about it's lastSet field and the way null values are handled. Author: Julien Le Dem Closes #218 from julienledem/json_read_varchar and squashes the following commits: e147906 [Julien Le Dem] ARROW-393: [JAVA] JSON file reader fails to set the buffer size on String data vector --- .../org/apache/arrow/tools/Integration.java | 2 +- .../apache/arrow/tools/TestIntegration.java | 54 ++++++++++++++++++- .../vector/file/json/JsonFileReader.java | 8 ++- .../arrow/vector/schema/ArrowVectorType.java | 15 ++++++ 4 files changed, 72 insertions(+), 7 deletions(-) diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index fa4bedca7a9bd..85af30da1e8ae 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -80,7 +80,7 @@ public void execute(File arrowFile, File jsonFile) throws IOException { Schema schema = footer.getSchema(); LOGGER.debug("Input file size: " + arrowFile.length()); LOGGER.debug("Found schema: " + schema); - try (JsonFileWriter writer = new JsonFileWriter(jsonFile);) { + try (JsonFileWriter writer = new JsonFileWriter(jsonFile, JsonFileWriter.config().pretty(true));) { writer.start(schema); List recordBatches = footer.getRecordBatches(); for (ArrowBlock rbBlock : recordBatches) { diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java index bb69ed1498e26..464144b95a1aa 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -24,9 +24,12 @@ import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; import static org.junit.Assert.fail; +import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.IOException; +import java.io.StringReader; +import java.util.Map; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -44,6 +47,11 @@ import org.junit.Test; import org.junit.rules.TemporaryFolder; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.SerializationFeature; + public class TestIntegration { @Rule @@ -69,7 +77,7 @@ public void testValid() throws Exception { File testOutFile = testFolder.newFile("testOut.arrow"); testOutFile.delete(); - // generate an arow file + // generate an arrow file writeInput(testInFile, allocator); Integration integration = new Integration(); @@ -90,6 +98,50 @@ public void testValid() throws Exception { integration.run(args3); } + @Test + public void testJSONRoundTripWithVariableWidth() throws Exception { + File testJSONFile = new File("../../integration/data/simple.json"); + File testOutFile = testFolder.newFile("testOut.arrow"); + File testRoundTripJSONFile = testFolder.newFile("testOut.json"); + testOutFile.delete(); + testRoundTripJSONFile.delete(); + + Integration integration = new Integration(); + + // convert to arrow + String[] args1 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + integration.run(args1); + + // convert back to json + String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + integration.run(args2); + + BufferedReader orig = readNormalized(testJSONFile); + BufferedReader rt = readNormalized(testRoundTripJSONFile); + String i, o; + int j = 0; + while ((i = orig.readLine()) != null && (o = rt.readLine()) != null) { + Assert.assertEquals("line: " + j, i, o); + ++j; + } + } + + private ObjectMapper om = new ObjectMapper(); + { + DefaultPrettyPrinter prettyPrinter = new DefaultPrettyPrinter(); + prettyPrinter.indentArraysWith(NopIndenter.instance); + om.setDefaultPrettyPrinter(prettyPrinter); + om.enable(SerializationFeature.INDENT_OUTPUT); + om.enable(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS); + } + + private BufferedReader readNormalized(File f) throws IOException { + Map tree = om.readValue(f, Map.class); + String normalized = om.writeValueAsString(tree); + return new BufferedReader(new StringReader(normalized)); + } + + @Test public void testInvalid() throws Exception { File testValidInFile = testFolder.newFile("testValidIn.arrow"); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index f2059820d23d6..26dd3f6dfe5ae 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -22,6 +22,7 @@ import static com.fasterxml.jackson.core.JsonToken.START_ARRAY; import static com.fasterxml.jackson.core.JsonToken.START_OBJECT; import static java.nio.charset.StandardCharsets.UTF_8; +import static org.apache.arrow.vector.schema.ArrowVectorType.OFFSET; import java.io.File; import java.io.IOException; @@ -128,15 +129,12 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti valueVector.allocateNew(); Mutator mutator = valueVector.getMutator(); - int innerVectorCount = count; - if (vectorType.getName() == "OFFSET") { - innerVectorCount++; - } - mutator.setValueCount(innerVectorCount); + int innerVectorCount = vectorType.equals(OFFSET) ? count + 1 : count; for (int i = 0; i < innerVectorCount; i++) { parser.nextToken(); setValueFromParser(valueVector, i); } + mutator.setValueCount(innerVectorCount); readToken(END_ARRAY); } // if children diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java index 8fe8e484496cd..68da7052f2b8b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowVectorType.java @@ -81,4 +81,19 @@ public String getName() { public String toString() { return getName(); } + + @Override + public int hashCode() { + return type; + } + + @Override + public boolean equals(Object obj) { + if (obj instanceof ArrowVectorType) { + ArrowVectorType other = (ArrowVectorType) obj; + return type == other.type; + } + return false; + } + } From 859018b3c79bfc0cb2259bdfc3d5930a9a936432 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 30 Nov 2016 13:25:42 -0500 Subject: [PATCH 0208/1644] ARROW-392: [C++/Java] String IPC integration testing / fixes. Add array / record batch pretty-printing Was blocked by ARROW-393 Author: Wes McKinney Closes #217 from wesm/ARROW-392 and squashes the following commits: 1efeaed [Wes McKinney] Remove debug printing from Java 57e4926 [Wes McKinney] cpplint 56a1c41 [Wes McKinney] We are only padding to 8 byte boundaries e33bed3 [Wes McKinney] clang-format, add all-OK message to integration_test.py 8e8d6d3 [Wes McKinney] Implement simple C++ pretty printer for record batches. Debugging efforts --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/CMakeLists.txt | 2 + cpp/src/arrow/ipc/ipc-json-test.cc | 20 --- cpp/src/arrow/ipc/json-integration-test.cc | 7 + cpp/src/arrow/ipc/json-internal.cc | 2 +- cpp/src/arrow/pretty_print-test.cc | 87 ++++++++++ cpp/src/arrow/pretty_print.cc | 192 +++++++++++++++++++++ cpp/src/arrow/pretty_print.h | 35 ++++ cpp/src/arrow/test-util.h | 22 +++ cpp/src/arrow/type_traits.h | 6 + format/IPC.md | 4 +- integration/data/simple.json | 2 +- integration/integration_test.py | 2 +- 13 files changed, 357 insertions(+), 25 deletions(-) create mode 100644 cpp/src/arrow/pretty_print-test.cc create mode 100644 cpp/src/arrow/pretty_print.cc create mode 100644 cpp/src/arrow/pretty_print.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 1a970081234fa..798d75fe55643 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -745,6 +745,7 @@ set(ARROW_SRCS src/arrow/array.cc src/arrow/builder.cc src/arrow/column.cc + src/arrow/pretty_print.cc src/arrow/schema.cc src/arrow/table.cc src/arrow/type.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 81851bc5b3eb1..6c0dea20ba7b5 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -21,6 +21,7 @@ install(FILES array.h column.h builder.h + pretty_print.h schema.h table.h type.h @@ -37,6 +38,7 @@ set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) ADD_ARROW_TEST(column-test) +ADD_ARROW_TEST(pretty_print-test) ADD_ARROW_TEST(schema-test) ADD_ARROW_TEST(table-test) diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index e5c3a081fca53..ba4d9ca982850 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -96,26 +96,6 @@ void CheckPrimitive(const std::shared_ptr& type, TestArrayRoundTrip(*array.get()); } -template -void MakeArray(const std::shared_ptr& type, const std::vector& is_valid, - const std::vector& values, std::shared_ptr* out) { - std::shared_ptr values_buffer; - std::shared_ptr values_bitmap; - - ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); - ASSERT_OK(test::GetBitmapFromBoolVector(is_valid, &values_bitmap)); - - using ArrayType = typename TypeTraits::ArrayType; - - int32_t null_count = 0; - for (bool val : is_valid) { - if (!val) { ++null_count; } - } - - *out = std::make_shared(type, static_cast(values.size()), - values_buffer, null_count, values_bitmap); -} - TEST(TestJsonSchemaWriter, FlatTypes) { std::vector> fields = {field("f0", int8()), field("f1", int16(), false), field("f2", int32()), field("f3", int64(), false), diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 7a313f791e6c8..c4e68472a19d4 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -31,6 +31,7 @@ #include "arrow/io/file.h" #include "arrow/ipc/file.h" #include "arrow/ipc/json.h" +#include "arrow/pretty_print.h" #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/test-util.h" @@ -171,6 +172,12 @@ static Status ValidateArrowVsJson( if (!json_batch->Equals(*arrow_batch.get())) { std::stringstream ss; ss << "Record batch " << i << " did not match"; + + ss << "\nJSON: \n "; + RETURN_NOT_OK(PrettyPrint(*json_batch.get(), &ss)); + + ss << "\nArrow: \n "; + RETURN_NOT_OK(PrettyPrint(*arrow_batch.get(), &ss)); return Status::Invalid(ss.str()); } } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index e56bcb32b9488..50f5b0cb1bd1e 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -343,7 +343,7 @@ class JsonSchemaWriter : public TypeVisitor { class JsonArrayWriter : public ArrayVisitor { public: - explicit JsonArrayWriter(const std::string& name, const Array& array, RjWriter* writer) + JsonArrayWriter(const std::string& name, const Array& array, RjWriter* writer) : name_(name), array_(array), writer_(writer) {} Status Write() { return VisitArray(name_, array_); } diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc new file mode 100644 index 0000000000000..10af41d16af13 --- /dev/null +++ b/cpp/src/arrow/pretty_print-test.cc @@ -0,0 +1,87 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/pretty_print.h" +#include "arrow/test-util.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/types/list.h" +#include "arrow/types/primitive.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" + +namespace arrow { + +class TestArrayPrinter : public ::testing::Test { + public: + void SetUp() {} + + void Print(const Array& array) {} + + private: + std::ostringstream sink_; +}; + +template +void CheckPrimitive(const std::vector& is_valid, const std::vector& values, + const char* expected) { + std::ostringstream sink; + + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, std::make_shared()); + + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + ASSERT_OK(builder.Append(values[i])); + } else { + ASSERT_OK(builder.AppendNull()); + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + ASSERT_OK(PrettyPrint(*array.get(), &sink)); + + std::string result = sink.str(); + ASSERT_EQ(std::string(expected, strlen(expected)), result); +} + +TEST_F(TestArrayPrinter, PrimitiveType) { + std::vector is_valid = {true, true, false, true, false}; + + std::vector values = {0, 1, 2, 3, 4}; + static const char* expected = R"expected([0, 1, null, 3, null])expected"; + CheckPrimitive(is_valid, values, expected); + + std::vector values2 = {"foo", "bar", "", "baz", ""}; + static const char* ex2 = R"expected(["foo", "bar", null, "baz", null])expected"; + CheckPrimitive(is_valid, values2, ex2); +} + +} // namespace arrow diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc new file mode 100644 index 0000000000000..c0b4b08274ac1 --- /dev/null +++ b/cpp/src/arrow/pretty_print.cc @@ -0,0 +1,192 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include + +#include "arrow/array.h" +#include "arrow/pretty_print.h" +#include "arrow/table.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/types/list.h" +#include "arrow/types/string.h" +#include "arrow/types/struct.h" +#include "arrow/util/status.h" + +namespace arrow { + +class ArrayPrinter : public ArrayVisitor { + public: + ArrayPrinter(const Array& array, std::ostream* sink) : array_(array), sink_(sink) {} + + Status Print() { return VisitArray(array_); } + + Status VisitArray(const Array& array) { return array.Accept(this); } + + template + typename std::enable_if::value, void>::type WriteDataValues( + const T& array) { + const auto data = array.raw_data(); + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + (*sink_) << "null"; + } else { + (*sink_) << data[i]; + } + } + } + + // String (Utf8), Binary + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { + int32_t length; + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + (*sink_) << "null"; + } else { + const char* buf = reinterpret_cast(array.GetValue(i, &length)); + (*sink_) << "\"" << std::string(buf, length) << "\""; + } + } + } + + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + (*sink_) << "null"; + } else { + (*sink_) << (array.Value(i) ? "true" : "false"); + } + } + } + + void OpenArray() { (*sink_) << "["; } + + void CloseArray() { (*sink_) << "]"; } + + template + Status WritePrimitive(const T& array) { + OpenArray(); + WriteDataValues(array); + CloseArray(); + return Status::OK(); + } + + template + Status WriteVarBytes(const T& array) { + OpenArray(); + WriteDataValues(array); + CloseArray(); + return Status::OK(); + } + + Status Visit(const NullArray& array) override { return Status::OK(); } + + Status Visit(const BooleanArray& array) override { return WritePrimitive(array); } + + Status Visit(const Int8Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int16Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int32Array& array) override { return WritePrimitive(array); } + + Status Visit(const Int64Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt8Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt16Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt32Array& array) override { return WritePrimitive(array); } + + Status Visit(const UInt64Array& array) override { return WritePrimitive(array); } + + Status Visit(const HalfFloatArray& array) override { return WritePrimitive(array); } + + Status Visit(const FloatArray& array) override { return WritePrimitive(array); } + + Status Visit(const DoubleArray& array) override { return WritePrimitive(array); } + + Status Visit(const StringArray& array) override { return WriteVarBytes(array); } + + Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } + + Status Visit(const DateArray& array) override { return Status::NotImplemented("date"); } + + Status Visit(const TimeArray& array) override { return Status::NotImplemented("time"); } + + Status Visit(const TimestampArray& array) override { + return Status::NotImplemented("timestamp"); + } + + Status Visit(const IntervalArray& array) override { + return Status::NotImplemented("interval"); + } + + Status Visit(const DecimalArray& array) override { + return Status::NotImplemented("decimal"); + } + + Status Visit(const ListArray& array) override { + // auto type = static_cast(array.type().get()); + // for (size_t i = 0; i < fields.size(); ++i) { + // RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); + // } + // return WriteChildren(type->children(), {array.values()}); + return Status::OK(); + } + + Status Visit(const StructArray& array) override { + // auto type = static_cast(array.type().get()); + // for (size_t i = 0; i < fields.size(); ++i) { + // RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); + // } + // return WriteChildren(type->children(), array.fields()); + return Status::OK(); + } + + Status Visit(const UnionArray& array) override { + return Status::NotImplemented("union"); + } + + private: + const Array& array_; + std::ostream* sink_; +}; + +Status PrettyPrint(const Array& arr, std::ostream* sink) { + ArrayPrinter printer(arr, sink); + return printer.Print(); +} + +Status PrettyPrint(const RecordBatch& batch, std::ostream* sink) { + for (int i = 0; i < batch.num_columns(); ++i) { + const std::string& name = batch.column_name(i); + (*sink) << name << ": "; + RETURN_NOT_OK(PrettyPrint(*batch.column(i).get(), sink)); + (*sink) << "\n"; + } + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/pretty_print.h b/cpp/src/arrow/pretty_print.h new file mode 100644 index 0000000000000..dcb236d726949 --- /dev/null +++ b/cpp/src/arrow/pretty_print.h @@ -0,0 +1,35 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PRETTY_PRINT_H +#define ARROW_PRETTY_PRINT_H + +#include + +#include "arrow/type_fwd.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Status; + +Status ARROW_EXPORT PrettyPrint(const RecordBatch& batch, std::ostream* sink); +Status ARROW_EXPORT PrettyPrint(const Array& arr, std::ostream* sink); + +} // namespace arrow + +#endif // ARROW_PRETTY_PRINT_H diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 63c2166a5736b..b86a1809cd0e9 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -32,6 +32,7 @@ #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/type.h" +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/buffer.h" #include "arrow/util/logging.h" @@ -250,6 +251,27 @@ Status MakeRandomBytePoolBuffer(int32_t length, MemoryPool* pool, } } // namespace test + +template +void MakeArray(const std::shared_ptr& type, const std::vector& is_valid, + const std::vector& values, std::shared_ptr* out) { + std::shared_ptr values_buffer; + std::shared_ptr values_bitmap; + + ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); + ASSERT_OK(test::GetBitmapFromBoolVector(is_valid, &values_bitmap)); + + using ArrayType = typename TypeTraits::ArrayType; + + int32_t null_count = 0; + for (bool val : is_valid) { + if (!val) { ++null_count; } + } + + *out = std::make_shared(type, static_cast(values.size()), + values_buffer, null_count, values_bitmap); +} + } // namespace arrow #endif // ARROW_TEST_UTIL_H_ diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index bbb807488e3d0..c21c5002035f8 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -192,6 +192,12 @@ struct IsFloatingPoint { static constexpr bool value = std::is_floating_point::value; }; +template +struct IsNumeric { + PRIMITIVE_TRAITS(T); + static constexpr bool value = std::is_arithmetic::value; +}; + } // namespace arrow #endif // ARROW_TYPE_TRAITS_H diff --git a/format/IPC.md b/format/IPC.md index a55dcdff48117..d386e6048cf12 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -24,7 +24,7 @@ In general, the file looks like: ``` - + ... @@ -49,7 +49,7 @@ appropriate alignment and padding): ``` - + ``` diff --git a/integration/data/simple.json b/integration/data/simple.json index a91b405d4f0f0..fb903e7ac4b63 100644 --- a/integration/data/simple.json +++ b/integration/data/simple.json @@ -31,7 +31,7 @@ "vectors": [ {"type": "VALIDITY", "typeBitWidth": 1}, {"type": "OFFSET", "typeBitWidth": 32}, - {"type": "DATA", "typeBitWidth": 64} + {"type": "DATA", "typeBitWidth": 8} ] } } diff --git a/integration/integration_test.py b/integration/integration_test.py index 6ea634d779566..88dc3ad7971ff 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -165,7 +165,7 @@ def run_all_tests(debug=False): runner = IntegrationRunner(json_files, testers, debug=debug) runner.run() - + print('-- All tests passed!') if __name__ == '__main__': parser = argparse.ArgumentParser(description='Arrow integration test CLI') From 072b7d671356721bc57da8703ed0939749cf4880 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 30 Nov 2016 18:14:31 -0500 Subject: [PATCH 0209/1644] ARROW-395: Arrow file format writes record batches in reverse order. Author: Julien Le Dem Closes #220 from julienledem/rb_order and squashes the following commits: ae5b7f8 [Julien Le Dem] ARROW-395: Arrow file format writes record batches in reverse order. --- .../apache/arrow/vector/file/ArrowFooter.java | 17 +++++-------- .../arrow/vector/file/TestArrowFile.java | 25 +++++++++++++------ 2 files changed, 23 insertions(+), 19 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java index 01e175b31b8db..3be19296cb56d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java @@ -17,6 +17,8 @@ */ package org.apache.arrow.vector.file; +import static org.apache.arrow.vector.schema.FBSerializables.writeAllStructsToVector; + import java.util.ArrayList; import java.util.List; @@ -52,10 +54,10 @@ public ArrowFooter(Footer footer) { private static List recordBatches(Footer footer) { List recordBatches = new ArrayList<>(); - Block tempBLock = new Block(); + Block tempBlock = new Block(); int recordBatchesLength = footer.recordBatchesLength(); for (int i = 0; i < recordBatchesLength; i++) { - Block block = footer.recordBatches(tempBLock, i); + Block block = footer.recordBatches(tempBlock, i); recordBatches.add(new ArrowBlock(block.offset(), block.metaDataLength(), block.bodyLength())); } return recordBatches; @@ -88,9 +90,9 @@ public List getRecordBatches() { public int writeTo(FlatBufferBuilder builder) { int schemaIndex = schema.getSchema(builder); Footer.startDictionariesVector(builder, dictionaries.size()); - int dicsOffset = endVector(builder, dictionaries); + int dicsOffset = writeAllStructsToVector(builder, dictionaries); Footer.startRecordBatchesVector(builder, recordBatches.size()); - int rbsOffset = endVector(builder, recordBatches); + int rbsOffset = writeAllStructsToVector(builder, recordBatches); Footer.startFooter(builder); Footer.addSchema(builder, schemaIndex); Footer.addDictionaries(builder, dicsOffset); @@ -98,13 +100,6 @@ public int writeTo(FlatBufferBuilder builder) { return Footer.endFooter(builder); } - private int endVector(FlatBufferBuilder builder, List blocks) { - for (ArrowBlock block : blocks) { - block.writeTo(builder); - } - return builder.endVector(); - } - @Override public int hashCode() { final int prime = 31; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index c9e60ee047bfe..5fa18b3ca5339 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -161,24 +161,27 @@ public void testWriteReadComplex() throws IOException { @Test public void testWriteReadMultipleRBs() throws IOException { File file = new File("target/mytest_multiple.arrow"); - int count = COUNT; + int[] counts = { 10, 5 }; // write try ( BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", originalVectorAllocator, null); FileOutputStream fileOutputStream = new FileOutputStream(file);) { - writeData(count, parent); - VectorUnloader vectorUnloader = newVectorUnloader(parent.getChild("root")); - Schema schema = vectorUnloader.getSchema(); + writeData(counts[0], parent); + VectorUnloader vectorUnloader0 = newVectorUnloader(parent.getChild("root")); + Schema schema = vectorUnloader0.getSchema(); Assert.assertEquals(2, schema.getFields().size()); try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema);) { - try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch()) { + try (ArrowRecordBatch recordBatch = vectorUnloader0.getRecordBatch()) { + Assert.assertEquals("RB #0", counts[0], recordBatch.getLength()); arrowWriter.writeRecordBatch(recordBatch); } parent.allocateNew(); - writeData(count, parent); - try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch()) { + writeData(counts[1], parent); // if we write the same data we don't catch that the metadata is stored in the wrong order. + VectorUnloader vectorUnloader1 = newVectorUnloader(parent.getChild("root")); + try (ArrowRecordBatch recordBatch = vectorUnloader1.getRecordBatch()) { + Assert.assertEquals("RB #1", counts[1], recordBatch.getLength()); arrowWriter.writeRecordBatch(recordBatch); } } @@ -195,21 +198,27 @@ public void testWriteReadMultipleRBs() throws IOException { ArrowFooter footer = arrowReader.readFooter(); Schema schema = footer.getSchema(); LOGGER.debug("reading schema: " + schema); + int i = 0; try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { VectorLoader vectorLoader = new VectorLoader(root); List recordBatches = footer.getRecordBatches(); Assert.assertEquals(2, recordBatches.size()); + long previousOffset = 0; for (ArrowBlock rbBlock : recordBatches) { + Assert.assertTrue(rbBlock.getOffset() + " > " + previousOffset, rbBlock.getOffset() > previousOffset); + previousOffset = rbBlock.getOffset(); Assert.assertEquals(0, rbBlock.getOffset() % 8); Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { + Assert.assertEquals("RB #" + i, counts[i], recordBatch.getLength()); List buffersLayout = recordBatch.getBuffersLayout(); for (ArrowBuffer arrowBuffer : buffersLayout) { Assert.assertEquals(0, arrowBuffer.getOffset() % 8); } vectorLoader.load(recordBatch); - validateContent(count, root); + validateContent(counts[i], root); } + ++i; } } } From 3b946b822445f21872c7cb42563c8d0c7bc84b80 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Thu, 1 Dec 2016 13:26:43 +0100 Subject: [PATCH 0210/1644] ARROW-396: [Python] Add pyarrow.schema.Schema.equals Added pyarrow api for `Schema.equals` to check if 2 schema's are equal and corresponding test case. Author: Bryan Cutler Closes #221 from BryanCutler/add-pyarrow-schema_equals-ARROW-396 and squashes the following commits: 910e943 [Bryan Cutler] added test case for pyarrow Schema equals 24cf982 [Bryan Cutler] added pyarrow Schema equals, and related def for CSchema --- python/pyarrow/includes/libarrow.pxd | 3 +++ python/pyarrow/schema.pyx | 9 +++++++++ python/pyarrow/tests/test_schema.py | 17 +++++++++++++++++ 3 files changed, 29 insertions(+) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 3ae1789170303..19da4085e1bb9 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -88,6 +88,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CSchema" arrow::Schema": CSchema(const vector[shared_ptr[CField]]& fields) + + c_bool Equals(const shared_ptr[CSchema]& other) + const shared_ptr[CField]& field(int i) int num_fields() c_string ToString() diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 084c304aed2a2..e0badb9764143 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -110,6 +110,15 @@ cdef class Schema: self.schema = schema.get() self.sp_schema = schema + def equals(self, other): + """ + Test if this schema is equal to the other + """ + cdef Schema _other + _other = other + + return self.sp_schema.get().Equals(_other.sp_schema) + @classmethod def from_fields(cls, fields): cdef: diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 2894ea8f84451..4aa8112a91769 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -69,3 +69,20 @@ def test_schema(self): foo: int32 bar: string baz: list""" + + def test_schema_equals(self): + fields = [ + A.field('foo', A.int32()), + A.field('bar', A.string()), + A.field('baz', A.list_(A.int8())) + ] + + sch1 = A.schema(fields) + print(dir(sch1)) + sch2 = A.schema(fields) + assert sch1.equals(sch2) + + del fields[-1] + sch3 = A.schema(fields) + assert not sch1.equals(sch3) + From 33c731dbd69331b0d7ce0dc924791db4ca461009 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 2 Dec 2016 10:32:36 -0500 Subject: [PATCH 0211/1644] =?UTF-8?q?ARROW-398:=20Java=20file=20format=20r?= =?UTF-8?q?equires=20bitmaps=20of=20all=201's=20to=20be=20written=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit … when there are no nulls Author: Julien Le Dem Closes #222 from julienledem/empty_buf and squashes the following commits: c29da53 [Julien Le Dem] fix extraneous bits 4e87d88 [Julien Le Dem] ARROW-398: Java file format requires bitmaps of all 1's to be written when there are no nulls --- .../templates/NullableValueVectors.java | 2 +- .../main/codegen/templates/UnionVector.java | 2 +- .../arrow/vector/BaseDataValueVector.java | 7 +- .../org/apache/arrow/vector/BitVector.java | 36 ++++++++++ .../org/apache/arrow/vector/BufferBacked.java | 4 +- .../org/apache/arrow/vector/ValueVector.java | 17 ----- .../org/apache/arrow/vector/VectorLoader.java | 1 - .../arrow/vector/complex/ListVector.java | 2 +- .../vector/complex/NullableMapVector.java | 2 +- .../arrow/vector/TestVectorUnloadLoad.java | 69 +++++++++++++++++++ .../vector/file/TestArrowReaderWriter.java | 4 -- 11 files changed, 116 insertions(+), 30 deletions(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 48af7a2bafe4d..716fedcf866ef 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -144,7 +144,7 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - org.apache.arrow.vector.BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + org.apache.arrow.vector.BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); bits.valueCount = fieldNode.getLength(); } diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 5ca3f90148449..9608b3c48ebd0 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -105,7 +105,7 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); this.valueCount = fieldNode.getLength(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index c22258d42651b..4c6d363f21cda 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -21,6 +21,7 @@ import java.util.List; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.schema.ArrowFieldNode; import io.netty.buffer.ArrowBuf; @@ -29,13 +30,13 @@ public abstract class BaseDataValueVector extends BaseValueVector implements Buf protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this - public static void load(List vectors, List buffers) { + public static void load(ArrowFieldNode fieldNode, List vectors, List buffers) { int expectedSize = vectors.size(); if (buffers.size() != expectedSize) { throw new IllegalArgumentException("Illegal buffer count, expected " + expectedSize + ", got: " + buffers.size()); } for (int i = 0; i < expectedSize; i++) { - vectors.get(i).load(buffers.get(i)); + vectors.get(i).load(fieldNode, buffers.get(i)); } } @@ -106,7 +107,7 @@ public ArrowBuf getBuffer() { } @Override - public void load(ArrowBuf data) { + public void load(ArrowFieldNode fieldNode, ArrowBuf data) { this.data.release(); this.data = data.retain(allocator); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index c12db5045c2db..7ce1236b2ec30 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -22,6 +22,7 @@ import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.BitHolder; import org.apache.arrow.vector.holders.NullableBitHolder; +import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.OversizedAllocationException; @@ -48,6 +49,41 @@ public BitVector(String name, BufferAllocator allocator) { super(name, allocator); } + @Override + public void load(ArrowFieldNode fieldNode, ArrowBuf data) { + // When the vector is all nulls or all defined, the content of the buffer can be omitted + if (data.readableBytes() == 0 && fieldNode.getLength() != 0) { + data.release(); + int count = fieldNode.getLength(); + allocateNew(count); + int n = getSizeFromCount(count); + if (fieldNode.getNullCount() == 0) { + // all defined + // create an all 1s buffer + // set full bytes + int fullBytesCount = count / 8; + for (int i = 0; i < fullBytesCount; ++i) { + this.data.setByte(i, 0xFF); + } + int remainder = count % 8; + // set remaining bits + if (remainder > 0) { + byte bitMask = (byte) (0xFFL >>> ((8 - remainder) & 7));; + this.data.setByte(fullBytesCount, bitMask); + } + } else if (fieldNode.getNullCount() == fieldNode.getLength()) { + // all null + // create an all 0s buffer + zeroVector(); + } else { + throw new IllegalArgumentException("The buffer can be empty only if there's no data or it's all null or all defined"); + } + this.data.writerIndex(n); + } else { + super.load(fieldNode, data); + } + } + @Override public Field getField() { throw new UnsupportedOperationException("internal vector"); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java b/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java index d1c262d226556..3c8b3210d77ff 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BufferBacked.java @@ -17,6 +17,8 @@ */ package org.apache.arrow.vector; +import org.apache.arrow.vector.schema.ArrowFieldNode; + import io.netty.buffer.ArrowBuf; /** @@ -24,7 +26,7 @@ */ public interface BufferBacked { - void load(ArrowBuf data); + void load(ArrowFieldNode fieldNode, ArrowBuf data); ArrowBuf unLoad(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index ba7790e47ef95..5b24a41850d75 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -130,13 +130,6 @@ public interface ValueVector extends Closeable, Iterable { */ FieldReader getReader(); - /** - * Get the metadata for this field. Used in serialization - * - * @return FieldMetadata for this field. - */ -// SerializedField getMetadata(); - /** * Returns the number of bytes that is used by this vector instance. */ @@ -166,16 +159,6 @@ public interface ValueVector extends Closeable, Iterable { */ ArrowBuf[] getBuffers(boolean clear); - /** - * Load the data provided in the buffer. Typically used when deserializing from the wire. - * - * @param metadata - * Metadata used to decode the incoming buffer. - * @param buffer - * The buffer that contains the ValueVector. - */ -// void load(SerializedField metadata, DrillBuf buffer); - /** * An abstraction that is used to read from this vector instance. */ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index c5d642ee0cc72..757f061dd5a2f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -81,7 +81,6 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf try { vector.loadFieldBuffers(fieldNode, ownBuffers); } catch (RuntimeException e) { - e.printStackTrace(); throw new IllegalArgumentException("Could not load buffers for field " + field + " error message" + e.getMessage(), e); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index dd99c734f7ff8..e18f99f95d780 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -93,7 +93,7 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 8e1bbfabdc907..f0ddf2727e9ea 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -62,7 +62,7 @@ public NullableMapVector(String name, BufferAllocator allocator, CallBack callBa @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { - BaseDataValueVector.load(getFieldInnerVectors(), ownBuffers); + BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); this.valueCount = fieldNode.getLength(); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java index 78f69eedc1c27..9dfe8d840e49d 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -17,7 +17,13 @@ */ package org.apache.arrow.vector; +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + import java.io.IOException; +import java.util.Collections; import java.util.List; import org.apache.arrow.memory.BufferAllocator; @@ -29,12 +35,17 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.AfterClass; import org.junit.Assert; import org.junit.Test; +import io.netty.buffer.ArrowBuf; + public class TestVectorUnloadLoad { static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); @@ -88,6 +99,64 @@ public void testUnloadLoad() throws IOException { } } + /** + * The validity buffer can be empty if: + * - all values are defined + * - all values are null + * @throws IOException + */ + @Test + public void testLoadEmptyValidityBuffer() throws IOException { + Schema schema = new Schema(asList( + new Field("intDefined", true, new ArrowType.Int(32, true), Collections.emptyList()), + new Field("intNull", true, new ArrowType.Int(32, true), Collections.emptyList()) + )); + int count = 10; + ArrowBuf validity = allocator.getEmpty(); + ArrowBuf values = allocator.buffer(count * 4); // integers + for (int i = 0; i < count; i++) { + values.setInt(i * 4, i); + } + try ( + ArrowRecordBatch recordBatch = new ArrowRecordBatch(count, asList(new ArrowFieldNode(count, 0), new ArrowFieldNode(count, count)), asList(validity, values, validity, values)); + BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); + ) { + + // load it + VectorLoader vectorLoader = new VectorLoader(newRoot); + + vectorLoader.load(recordBatch); + + NullableIntVector intDefinedVector = (NullableIntVector)newRoot.getVector("intDefined"); + NullableIntVector intNullVector = (NullableIntVector)newRoot.getVector("intNull"); + for (int i = 0; i < count; i++) { + assertFalse("#" + i, intDefinedVector.getAccessor().isNull(i)); + assertEquals("#" + i, i, intDefinedVector.getAccessor().get(i)); + assertTrue("#" + i, intNullVector.getAccessor().isNull(i)); + } + intDefinedVector.getMutator().setSafe(count + 10, 1234); + assertTrue(intDefinedVector.getAccessor().isNull(count + 1)); + // empty slots should still default to unset + intDefinedVector.getMutator().setSafe(count + 1, 789); + assertFalse(intDefinedVector.getAccessor().isNull(count + 1)); + assertEquals(789, intDefinedVector.getAccessor().get(count + 1)); + assertTrue(intDefinedVector.getAccessor().isNull(count)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 2)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 3)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 4)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 5)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 6)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 7)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 8)); + assertTrue(intDefinedVector.getAccessor().isNull(count + 9)); + assertFalse(intDefinedVector.getAccessor().isNull(count + 10)); + assertEquals(1234, intDefinedVector.getAccessor().get(count + 10)); + } finally { + values.release(); + } + } + public static VectorUnloader newVectorUnloader(FieldVector root) { Schema schema = new Schema(root.getField().getChildren()); int valueCount = root.getAccessor().getValueCount(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java index f90329aca11dd..8ed89fa347b3b 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -30,10 +30,6 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.file.ArrowWriter; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.ArrowType; From 06be7aed062aca32b683f2ab3a94a201ae54b4f3 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 2 Dec 2016 11:48:24 -0500 Subject: [PATCH 0212/1644] ARROW-389: Python: Write Parquet files to pyarrow.io.NativeFile objects Author: Uwe L. Korn Closes #214 from xhochy/ARROW-389 and squashes the following commits: e66c895 [Uwe L. Korn] Switch image to deprecated group 876cd65 [Uwe L. Korn] ARROW-389: Python: Write Parquet files to pyarrow.io.NativeFile objects --- .travis.yml | 1 + python/pyarrow/includes/parquet.pxd | 7 +++++-- python/pyarrow/parquet.pyx | 18 ++++++++++++------ python/pyarrow/tests/test_parquet.py | 27 +++++++++++++++++++++++++++ 4 files changed, 45 insertions(+), 8 deletions(-) diff --git a/.travis.yml b/.travis.yml index 052c22ccc3790..bfc2f26b4f590 100644 --- a/.travis.yml +++ b/.travis.yml @@ -24,6 +24,7 @@ matrix: - compiler: gcc language: cpp os: linux + group: deprecated before_script: - export CC="gcc-4.9" - export CXX="g++-4.9" diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index 57c35ba89445b..cb791e16f926d 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport CArray, CSchema, CStatus, CTable, MemoryPool -from pyarrow.includes.libarrow_io cimport ReadableFileInterface +from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: @@ -131,6 +131,9 @@ cdef extern from "parquet/arrow/io.h" namespace "parquet::arrow" nogil: ParquetReadSource(ParquetAllocator* allocator) Open(const shared_ptr[ReadableFileInterface]& file) + cdef cppclass ParquetWriteSink: + ParquetWriteSink(const shared_ptr[OutputStream]& file) + cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, @@ -154,6 +157,6 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: cdef CStatus WriteFlatTable( const CTable* table, MemoryPool* pool, - const shared_ptr[ParquetOutputStream]& sink, + const shared_ptr[ParquetWriteSink]& sink, int64_t chunk_size, const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index a6e3ac30684b4..83fddb287a3f1 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -21,7 +21,7 @@ from pyarrow.includes.libarrow cimport * from pyarrow.includes.parquet cimport * -from pyarrow.includes.libarrow_io cimport ReadableFileInterface +from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream, FileOutputStream cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.array cimport Array @@ -151,7 +151,7 @@ def read_table(source, columns=None): return Table.from_arrays(columns, arrays) -def write_table(table, filename, chunk_size=None, version=None, +def write_table(table, sink, chunk_size=None, version=None, use_dictionary=True, compression=None): """ Write a Table to Parquet format @@ -159,7 +159,7 @@ def write_table(table, filename, chunk_size=None, version=None, Parameters ---------- table : pyarrow.Table - filename : string + sink: string or pyarrow.io.NativeFile chunk_size : int The maximum number of rows in each Parquet RowGroup. As a default, we will write a single RowGroup per file. @@ -173,7 +173,8 @@ def write_table(table, filename, chunk_size=None, version=None, """ cdef Table table_ = table cdef CTable* ctable_ = table_.table - cdef shared_ptr[ParquetOutputStream] sink + cdef shared_ptr[ParquetWriteSink] sink_ + cdef shared_ptr[FileOutputStream] filesink_ cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 if chunk_size is None: @@ -230,7 +231,12 @@ def write_table(table, filename, chunk_size=None, version=None, else: raise ArrowException("Unsupport compression codec") - sink.reset(new LocalFileOutputStream(tobytes(filename))) + if isinstance(sink, six.string_types): + check_status(FileOutputStream.Open(tobytes(sink), &filesink_)) + sink_.reset(new ParquetWriteSink(filesink_)) + elif isinstance(sink, NativeFile): + sink_.reset(new ParquetWriteSink((sink).wr_file)) + with nogil: - check_status(WriteFlatTable(ctable_, default_memory_pool(), sink, + check_status(WriteFlatTable(ctable_, default_memory_pool(), sink_, chunk_size_, properties_builder.build())) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index c1d44ce0d4230..841830f6fba3b 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -18,6 +18,7 @@ import pytest import pyarrow as A +import pyarrow.io as paio import numpy as np import pandas as pd @@ -131,6 +132,32 @@ def test_pandas_column_selection(tmpdir): pdt.assert_frame_equal(df[['uint8']], df_read) +@parquet +def test_pandas_parquet_native_file_roundtrip(tmpdir): + size = 10000 + np.random.seed(0) + df = pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16), + 'uint32': np.arange(size, dtype=np.uint32), + 'uint64': np.arange(size, dtype=np.uint64), + 'int8': np.arange(size, dtype=np.int16), + 'int16': np.arange(size, dtype=np.int16), + 'int32': np.arange(size, dtype=np.int32), + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0 + }) + arrow_table = A.from_pandas_dataframe(df) + imos = paio.InMemoryOutputStream() + pq.write_table(arrow_table, imos, version="2.0") + buf = imos.get_result() + reader = paio.BufferReader(buf) + df_read = pq.read_table(reader).to_pandas() + pdt.assert_frame_equal(df, df_read) + + @parquet def test_pandas_parquet_configuration_options(tmpdir): size = 10000 From ebe7dc8f5ff32f5fa86625d4c622b4e075e95ae0 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 2 Dec 2016 11:51:22 -0500 Subject: [PATCH 0213/1644] ARROW-335: Improve Type apis and toString() by encapsulating flatbuffers better Author: Julien Le Dem Closes #172 from julienledem/tostring and squashes the following commits: 546aa02 [Julien Le Dem] fix rebase issues 262ae9f [Julien Le Dem] review feedback 41d5627 [Julien Le Dem] ARROW-335: Improve Type apis and toString() by encapsulating flatbuffers better --- .../src/main/codegen/data/ArrowTypes.tdd | 8 +- .../src/main/codegen/templates/ArrowType.java | 172 ++++-------------- .../templates/NullableValueVectors.java | 10 +- .../main/codegen/templates/UnionVector.java | 4 +- .../complex/BaseRepeatedValueVector.java | 5 +- .../arrow/vector/complex/MapVector.java | 4 +- .../arrow/vector/schema/TypeLayout.java | 19 +- .../arrow/vector/schema/VectorLayout.java | 2 +- .../vector/types/FloatingPointPrecision.java | 47 +++++ .../arrow/vector/types/IntervalUnit.java | 44 +++++ .../apache/arrow/vector/types/TimeUnit.java | 46 +++++ .../org/apache/arrow/vector/types/Types.java | 38 ++-- .../apache/arrow/vector/types/UnionMode.java | 44 +++++ .../apache/arrow/vector/types/pojo/Field.java | 14 +- .../arrow/vector/types/pojo/Schema.java | 3 +- .../complex/impl/TestPromotableWriter.java | 7 +- .../complex/writer/TestComplexWriter.java | 15 +- .../apache/arrow/vector/pojo/TestConvert.java | 14 +- .../arrow/vector/types/pojo/TestSchema.java | 36 +++- 19 files changed, 318 insertions(+), 214 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/FloatingPointPrecision.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/IntervalUnit.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/TimeUnit.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/UnionMode.java diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index c0b942bc3595d..01465e585dad2 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -26,7 +26,7 @@ }, { name: "Union", - fields: [{name: "mode", type: short}, {name: "typeIds", type: "int[]"}] + fields: [{name: "mode", type: short, valueType: UnionMode}, {name: "typeIds", type: "int[]"}] }, { name: "Int", @@ -34,7 +34,7 @@ }, { name: "FloatingPoint", - fields: [{name: precision, type: short}] + fields: [{name: precision, type: short, valueType: FloatingPointPrecision}] }, { name: "Utf8", @@ -62,11 +62,11 @@ }, { name: "Timestamp", - fields: [{name: "unit", type: short}] + fields: [{name: "unit", type: short, valueType: TimeUnit}] }, { name: "Interval", - fields: [{name: "unit", type: short}] + fields: [{name: "unit", type: short, valueType: IntervalUnit}] } ] } diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 4069e6061b66e..85ea3898e09c6 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -23,30 +23,18 @@ package org.apache.arrow.vector.types.pojo; import com.google.flatbuffers.FlatBufferBuilder; -import org.apache.arrow.flatbuf.Type; -import java.io.IOException; import java.util.Objects; -import org.apache.arrow.flatbuf.Precision; -import org.apache.arrow.flatbuf.UnionMode; -import org.apache.arrow.flatbuf.TimeUnit; -import org.apache.arrow.flatbuf.IntervalUnit; +import org.apache.arrow.flatbuf.Type; + +import org.apache.arrow.vector.types.*; import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonIgnore; import com.fasterxml.jackson.annotation.JsonProperty; import com.fasterxml.jackson.annotation.JsonSubTypes; import com.fasterxml.jackson.annotation.JsonTypeInfo; -import com.fasterxml.jackson.core.JsonGenerator; -import com.fasterxml.jackson.core.JsonParser; -import com.fasterxml.jackson.core.JsonProcessingException; -import com.fasterxml.jackson.databind.DeserializationContext; -import com.fasterxml.jackson.databind.JsonDeserializer; -import com.fasterxml.jackson.databind.JsonSerializer; -import com.fasterxml.jackson.databind.SerializerProvider; -import com.fasterxml.jackson.databind.annotation.JsonDeserialize; -import com.fasterxml.jackson.databind.annotation.JsonSerialize; /** * Arrow types @@ -57,119 +45,31 @@ property = "name") @JsonSubTypes({ <#list arrowTypes.types as type> - @JsonSubTypes.Type(value = ArrowType.${type.name}.class, name = "${type.name?remove_ending("_")?lower_case}"), + @JsonSubTypes.Type(value = ArrowType.${type.name?remove_ending("_")}.class, name = "${type.name?remove_ending("_")?lower_case}"), }) public abstract class ArrowType { - private static class FloatingPointPrecisionSerializer extends JsonSerializer { - @Override - public void serialize(Short precision, - JsonGenerator jsonGenerator, - SerializerProvider serializerProvider) - throws IOException, JsonProcessingException { - jsonGenerator.writeObject(Precision.name(precision)); - } - } - - private static class FloatingPointPrecisionDeserializer extends JsonDeserializer { - @Override - public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { - String name = p.getText(); - switch(name) { - case "HALF": - return Precision.HALF; - case "SINGLE": - return Precision.SINGLE; - case "DOUBLE": - return Precision.DOUBLE; - default: - throw new IllegalArgumentException("unknown precision: " + name); - } - } - } - - private static class UnionModeSerializer extends JsonSerializer { - @Override - public void serialize(Short mode, - JsonGenerator jsonGenerator, - SerializerProvider serializerProvider) - throws IOException, JsonProcessingException { - jsonGenerator.writeObject(UnionMode.name(mode)); - } - } - - private static class UnionModeDeserializer extends JsonDeserializer { - @Override - public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { - String name = p.getText(); - switch(name) { - case "Sparse": - return UnionMode.Sparse; - case "Dense": - return UnionMode.Dense; - default: - throw new IllegalArgumentException("unknown union mode: " + name); - } - } - } - - private static class TimestampUnitSerializer extends JsonSerializer { - @Override - public void serialize(Short unit, - JsonGenerator jsonGenerator, - SerializerProvider serializerProvider) - throws IOException, JsonProcessingException { - jsonGenerator.writeObject(TimeUnit.name(unit)); - } - } - - private static class TimestampUnitDeserializer extends JsonDeserializer { - @Override - public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { - String name = p.getText(); - switch(name) { - case "SECOND": - return TimeUnit.SECOND; - case "MILLISECOND": - return TimeUnit.MILLISECOND; - case "MICROSECOND": - return TimeUnit.MICROSECOND; - case "NANOSECOND": - return TimeUnit.NANOSECOND; - default: - throw new IllegalArgumentException("unknown time unit: " + name); - } - } - } + public static enum ArrowTypeID { + <#list arrowTypes.types as type> + <#assign name = type.name> + ${name?remove_ending("_")}(Type.${name}), + + NONE(Type.NONE); + + private final byte flatbufType; - private static class IntervalUnitSerializer extends JsonSerializer { - @Override - public void serialize(Short unit, - JsonGenerator jsonGenerator, - SerializerProvider serializerProvider) - throws IOException, JsonProcessingException { - jsonGenerator.writeObject(IntervalUnit.name(unit)); + public byte getFlatbufID() { + return this.flatbufType; } - } - private static class IntervalUnitDeserializer extends JsonDeserializer { - @Override - public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOException, JsonProcessingException { - String name = p.getText(); - switch(name) { - case "YEAR_MONTH": - return IntervalUnit.YEAR_MONTH; - case "DAY_TIME": - return IntervalUnit.DAY_TIME; - default: - throw new IllegalArgumentException("unknown interval unit: " + name); - } + private ArrowTypeID(byte flatbufType) { + this.flatbufType = flatbufType; } } @JsonIgnore - public abstract byte getTypeType(); + public abstract ArrowTypeID getTypeID(); public abstract int getType(FlatBufferBuilder builder); public abstract T accept(ArrowTypeVisitor visitor); @@ -183,28 +83,30 @@ public Short deserialize(JsonParser p, DeserializationContext ctxt) throws IOExc */ public static interface ArrowTypeVisitor { <#list arrowTypes.types as type> - T visit(${type.name} type); + T visit(${type.name?remove_ending("_")} type); } <#list arrowTypes.types as type> - <#assign name = type.name> + <#assign name = type.name?remove_ending("_")> <#assign fields = type.fields> public static class ${name} extends ArrowType { - public static final byte TYPE_TYPE = Type.${name}; + public static final ArrowTypeID TYPE_TYPE = ArrowTypeID.${name}; <#if type.fields?size == 0> public static final ${name} INSTANCE = new ${name}(); <#list fields as field> - ${field.type} ${field.name}; + <#assign fieldType = field.valueType!field.type> + ${fieldType} ${field.name}; <#if type.fields?size != 0> @JsonCreator public ${type.name}( <#list type.fields as field> - <#if field.type == "short"> @JsonDeserialize(using = ${type.name}${field.name?cap_first}Deserializer.class) @JsonProperty("${field.name}") ${field.type} ${field.name}<#if field_has_next>, + <#assign fieldType = field.valueType!field.type> + @JsonProperty("${field.name}") ${fieldType} ${field.name}<#if field_has_next>, ) { <#list type.fields as field> @@ -214,7 +116,7 @@ public static class ${name} extends ArrowType { @Override - public byte getTypeType() { + public ArrowTypeID getTypeID() { return TYPE_TYPE; } @@ -235,27 +137,29 @@ public int getType(FlatBufferBuilder builder) { org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, ${field.name}); } <#else> - org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, this.${field.name}); + org.apache.arrow.flatbuf.${type.name}.add${field.name?cap_first}(builder, this.${field.name}<#if field.valueType??>.getFlatbufID()); return org.apache.arrow.flatbuf.${type.name}.end${type.name}(builder); } <#list fields as field> - <#if field.type == "short"> - @JsonSerialize(using = ${type.name}${field.name?cap_first}Serializer.class) - - public ${field.type} get${field.name?cap_first}() { + <#assign fieldType = field.valueType!field.type> + public ${fieldType} get${field.name?cap_first}() { return ${field.name}; } public String toString() { - return "${name}{" + return "${name}" + <#if fields?size != 0> + + "(" <#list fields as field> - + <#if field.type == "int[]">java.util.Arrays.toString(${field.name})<#else>${field.name}<#if field_has_next> + ", " + + <#if field.type == "int[]">java.util.Arrays.toString(${field.name})<#else>${field.name}<#if field_has_next> + ", " - + "}"; + + ")" + + ; } @Override @@ -265,7 +169,7 @@ public int hashCode() { @Override public boolean equals(Object obj) { - if (!(obj instanceof ${type.name})) { + if (!(obj instanceof ${name})) { return false; } <#if type.fields?size == 0> @@ -287,7 +191,7 @@ public T accept(ArrowTypeVisitor visitor) { public static org.apache.arrow.vector.types.pojo.ArrowType getTypeForField(org.apache.arrow.flatbuf.Field field) { switch(field.typeType()) { <#list arrowTypes.types as type> - <#assign name = type.name> + <#assign name = type.name?remove_ending("_")> <#assign nameLower = type.name?lower_case> <#assign fields = type.fields> case Type.${type.name}: { @@ -302,7 +206,7 @@ public static org.apache.arrow.vector.types.pojo.ArrowType getTypeForField(org.a ${field.type} ${field.name} = ${nameLower}Type.${field.name}(); - return new ${type.name}(<#list type.fields as field>${field.name}<#if field_has_next>, ); + return new ${name}(<#list type.fields as field><#if field.valueType??>${field.valueType}.fromFlatbufID(${field.name})<#else>${field.name}<#if field_has_next>, ); } default: diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 716fedcf866ef..2c4274c13ee58 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -99,15 +99,15 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#elseif minor.class == "Time"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null); <#elseif minor.class == "Float4"> - field = new Field(name, true, new FloatingPoint(Precision.SINGLE), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE), null); <#elseif minor.class == "Float8"> - field = new Field(name, true, new FloatingPoint(Precision.DOUBLE), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE), null); <#elseif minor.class == "TimeStamp"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND), null); <#elseif minor.class == "IntervalDay"> - field = new Field(name, true, new Interval(org.apache.arrow.flatbuf.IntervalUnit.DAY_TIME), null); + field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.DAY_TIME), null); <#elseif minor.class == "IntervalYear"> - field = new Field(name, true, new Interval(org.apache.arrow.flatbuf.IntervalUnit.YEAR_MONTH), null); + field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.YEAR_MONTH), null); <#elseif minor.class == "VarChar"> field = new Field(name, true, new Utf8(), null); <#elseif minor.class == "VarBinary"> diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 9608b3c48ebd0..ea1fdf6bd60fb 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -15,8 +15,6 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -import java.util.List; - <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/UnionVector.java" /> @@ -35,7 +33,7 @@ import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.schema.ArrowFieldNode; -import static org.apache.arrow.flatbuf.UnionMode.Sparse; +import static org.apache.arrow.vector.types.UnionMode.Sparse; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 517d20c77a93c..7424df474ae89 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -20,7 +20,6 @@ import java.util.Collections; import java.util.Iterator; -import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseValueVector; @@ -159,9 +158,9 @@ public AddOrGetResult addOrGetVector(MinorType minorT created = true; } - if (vector.getField().getType().getTypeType() != minorType.getType().getTypeType()) { + if (vector.getField().getType().getTypeID() != minorType.getType().getTypeID()) { final String msg = String.format("Inner vector type mismatch. Requested type: [%s], actual type: [%s]", - Type.name(minorType.getType().getTypeType()), Type.name(vector.getField().getType().getTypeType())); + minorType.getType().getTypeID(), vector.getField().getType().getTypeID()); throw new SchemaChangeRuntimeException(msg); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index aaecb956434e9..c2f216b197e1d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -34,7 +34,7 @@ import org.apache.arrow.vector.holders.ComplexHolder; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringHashMap; @@ -290,7 +290,7 @@ public Field getField() { for (ValueVector child : getChildren()) { children.add(child.getField()); } - return new Field(name, false, Struct_.INSTANCE, children); + return new Field(name, false, Struct.INSTANCE, children); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index c5f53fe508d9f..0b586914bdf85 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -18,9 +18,6 @@ package org.apache.arrow.vector.schema; import static java.util.Arrays.asList; -import static org.apache.arrow.flatbuf.Precision.DOUBLE; -import static org.apache.arrow.flatbuf.Precision.HALF; -import static org.apache.arrow.flatbuf.Precision.SINGLE; import static org.apache.arrow.vector.schema.VectorLayout.booleanVector; import static org.apache.arrow.vector.schema.VectorLayout.byteVector; import static org.apache.arrow.vector.schema.VectorLayout.dataVector; @@ -32,8 +29,6 @@ import java.util.Collections; import java.util.List; -import org.apache.arrow.flatbuf.IntervalUnit; -import org.apache.arrow.flatbuf.UnionMode; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeVisitor; import org.apache.arrow.vector.types.pojo.ArrowType.Binary; @@ -44,7 +39,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Interval; import org.apache.arrow.vector.types.pojo.ArrowType.Null; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; @@ -72,7 +67,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { @Override public TypeLayout visit(Union type) { List vectors; switch (type.getMode()) { - case UnionMode.Dense: + case Dense: vectors = asList( // TODO: validate this validityVector(), @@ -80,7 +75,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { offsetVector() // offset to find the vector ); break; - case UnionMode.Sparse: + case Sparse: vectors = asList( typeVector() // type of the value at the index or 0 if null ); @@ -91,7 +86,7 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { return new TypeLayout(vectors); } - @Override public TypeLayout visit(Struct_ type) { + @Override public TypeLayout visit(Struct type) { List vectors = asList( validityVector() ); @@ -175,9 +170,9 @@ public TypeLayout visit(Time type) { @Override public TypeLayout visit(Interval type) { // TODO: check size switch (type.getUnit()) { - case IntervalUnit.DAY_TIME: + case DAY_TIME: return newFixedWidthTypeLayout(dataVector(64)); - case IntervalUnit.YEAR_MONTH: + case YEAR_MONTH: return newFixedWidthTypeLayout(dataVector(64)); default: throw new UnsupportedOperationException("Unknown unit " + type.getUnit()); @@ -215,7 +210,7 @@ public List getVectorTypes() { } public String toString() { - return "TypeLayout{" + vectors + "}"; + return vectors.toString(); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java index 931c00a02817b..2073795b2a199 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/VectorLayout.java @@ -101,7 +101,7 @@ public ArrowVectorType getType() { @Override public String toString() { - return String.format("{width=%s,type=%s}", typeBitWidth, type); + return String.format("%s(%s)", type, typeBitWidth); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/FloatingPointPrecision.java b/java/vector/src/main/java/org/apache/arrow/vector/types/FloatingPointPrecision.java new file mode 100644 index 0000000000000..3206969fb7ead --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/FloatingPointPrecision.java @@ -0,0 +1,47 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +import org.apache.arrow.flatbuf.Precision; + +public enum FloatingPointPrecision { + HALF(Precision.HALF), + SINGLE(Precision.SINGLE), + DOUBLE(Precision.DOUBLE); + + private static final FloatingPointPrecision[] valuesByFlatbufId = new FloatingPointPrecision[FloatingPointPrecision.values().length]; + static { + for (FloatingPointPrecision v : FloatingPointPrecision.values()) { + valuesByFlatbufId[v.flatbufID] = v; + } + } + + private short flatbufID; + + private FloatingPointPrecision(short flatbufID) { + this.flatbufID = flatbufID; + } + + public short getFlatbufID() { + return flatbufID; + } + + public static FloatingPointPrecision fromFlatbufID(short id) { + return valuesByFlatbufId[id]; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/IntervalUnit.java b/java/vector/src/main/java/org/apache/arrow/vector/types/IntervalUnit.java new file mode 100644 index 0000000000000..b3ddf1fe497de --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/IntervalUnit.java @@ -0,0 +1,44 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +public enum IntervalUnit { + YEAR_MONTH(org.apache.arrow.flatbuf.IntervalUnit.YEAR_MONTH), + DAY_TIME(org.apache.arrow.flatbuf.IntervalUnit.DAY_TIME); + + private static final IntervalUnit[] valuesByFlatbufId = new IntervalUnit[IntervalUnit.values().length]; + static { + for (IntervalUnit v : IntervalUnit.values()) { + valuesByFlatbufId[v.flatbufID] = v; + } + } + + private short flatbufID; + + private IntervalUnit(short flatbufID) { + this.flatbufID = flatbufID; + } + + public short getFlatbufID() { + return flatbufID; + } + + public static IntervalUnit fromFlatbufID(short id) { + return valuesByFlatbufId[id]; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/TimeUnit.java b/java/vector/src/main/java/org/apache/arrow/vector/types/TimeUnit.java new file mode 100644 index 0000000000000..cea9866965854 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/TimeUnit.java @@ -0,0 +1,46 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +public enum TimeUnit { + SECOND(org.apache.arrow.flatbuf.TimeUnit.SECOND), + MILLISECOND(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND), + MICROSECOND(org.apache.arrow.flatbuf.TimeUnit.MICROSECOND), + NANOSECOND(org.apache.arrow.flatbuf.TimeUnit.NANOSECOND); + + private static final TimeUnit[] valuesByFlatbufId = new TimeUnit[TimeUnit.values().length]; + static { + for (TimeUnit v : TimeUnit.values()) { + valuesByFlatbufId[v.flatbufID] = v; + } + } + + private final short flatbufID; + + TimeUnit(short flatbufID) { + this.flatbufID = flatbufID; + } + + public short getFlatbufID() { + return flatbufID; + } + + public static TimeUnit fromFlatbufID(short id) { + return valuesByFlatbufId[id]; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index d9593673156bf..2a2fb74bee85c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -17,10 +17,10 @@ */ package org.apache.arrow.vector.types; -import org.apache.arrow.flatbuf.IntervalUnit; -import org.apache.arrow.flatbuf.Precision; -import org.apache.arrow.flatbuf.TimeUnit; -import org.apache.arrow.flatbuf.UnionMode; +import static org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE; +import static org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE; +import static org.apache.arrow.vector.types.UnionMode.Sparse; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.NullableBigIntVector; @@ -81,7 +81,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Interval; import org.apache.arrow.vector.types.pojo.ArrowType.List; import org.apache.arrow.vector.types.pojo.ArrowType.Null; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; import org.apache.arrow.vector.types.pojo.ArrowType.Time; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; @@ -102,11 +102,11 @@ public class Types { private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); private static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); - private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND), null); + private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND), null); private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); - private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(Precision.SINGLE), null); - private static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(Precision.DOUBLE), null); + private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.SINGLE), null); + private static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.DOUBLE), null); private static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); private static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); private static final Field BIT_FIELD = new Field("", true, Bool.INSTANCE, null); @@ -129,7 +129,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return null; } }, - MAP(Struct_.INSTANCE) { + MAP(Struct.INSTANCE) { @Override public Field getField() { throw new UnsupportedOperationException("Cannot get simple field for Map type"); @@ -242,7 +242,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. - TIMESTAMP(new Timestamp(org.apache.arrow.flatbuf.TimeUnit.MILLISECOND)) { + TIMESTAMP(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND)) { @Override public Field getField() { return TIMESTAMP_FIELD; @@ -291,7 +291,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // 4 byte ieee 754 - FLOAT4(new FloatingPoint(Precision.SINGLE)) { + FLOAT4(new FloatingPoint(SINGLE)) { @Override public Field getField() { return FLOAT4_FIELD; @@ -308,7 +308,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // 8 byte ieee 754 - FLOAT8(new FloatingPoint(Precision.DOUBLE)) { + FLOAT8(new FloatingPoint(DOUBLE)) { @Override public Field getField() { return FLOAT8_FIELD; @@ -472,7 +472,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new UnionListWriter((ListVector) vector); } }, - UNION(new Union(UnionMode.Sparse, null)) { + UNION(new Union(Sparse, null)) { @Override public Field getField() { throw new UnsupportedOperationException("Cannot get simple field for Union type"); @@ -512,7 +512,7 @@ public static MinorType getMinorTypeForArrowType(ArrowType arrowType) { return MinorType.NULL; } - @Override public MinorType visit(Struct_ type) { + @Override public MinorType visit(Struct type) { return MinorType.MAP; } @@ -543,11 +543,11 @@ public MinorType visit(Int type) { @Override public MinorType visit(FloatingPoint type) { switch (type.getPrecision()) { - case Precision.HALF: + case HALF: throw new UnsupportedOperationException("NYI: " + type); - case Precision.SINGLE: + case SINGLE: return MinorType.FLOAT4; - case Precision.DOUBLE: + case DOUBLE: return MinorType.FLOAT8; default: throw new IllegalArgumentException("unknown precision: " + type); @@ -588,9 +588,9 @@ public MinorType visit(FloatingPoint type) { @Override public MinorType visit(Interval type) { switch (type.getUnit()) { - case IntervalUnit.DAY_TIME: + case DAY_TIME: return MinorType.INTERVALDAY; - case IntervalUnit.YEAR_MONTH: + case YEAR_MONTH: return MinorType.INTERVALYEAR; default: throw new IllegalArgumentException("unknown unit: " + type); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/UnionMode.java b/java/vector/src/main/java/org/apache/arrow/vector/types/UnionMode.java new file mode 100644 index 0000000000000..8e957bc0b6e34 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/UnionMode.java @@ -0,0 +1,44 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +public enum UnionMode { + Sparse(org.apache.arrow.flatbuf.UnionMode.Sparse), + Dense(org.apache.arrow.flatbuf.UnionMode.Dense); + + private static final UnionMode[] valuesByFlatbufId = new UnionMode[UnionMode.values().length]; + static { + for (UnionMode v : UnionMode.values()) { + valuesByFlatbufId[v.flatbufID] = v; + } + } + + private final short flatbufID; + + private UnionMode(short flatbufID) { + this.flatbufID = flatbufID; + } + + public short getFlatbufID() { + return flatbufID; + } + + public static UnionMode fromFlatbufID(short id) { + return valuesByFlatbufId[id]; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 49ba524ab0a4f..412fc54b538da 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -29,6 +29,7 @@ import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonProperty; +import com.google.common.base.Joiner; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; @@ -104,7 +105,7 @@ public int getField(FlatBufferBuilder builder) { org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); } org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); - org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeType()); + org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeID().getFlatbufID()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); @@ -143,11 +144,18 @@ public boolean equals(Object obj) { (Objects.equals(this.children, that.children) || (this.children == null && that.children.size() == 0) || (this.children.size() == 0 && that.children == null)); - } @Override public String toString() { - return String.format("Field{name=%s, type=%s, children=%s, layout=%s}", name, type, children, typeLayout); + StringBuilder sb = new StringBuilder(); + if (name != null) { + sb.append(name).append(": "); + } + sb.append(type); + if (!children.isEmpty()) { + sb.append("<").append(Joiner.on(", ").join(children)).append(">"); + } + return sb.toString(); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index 44b877eb730d5..5ca8ade7891ee 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -33,6 +33,7 @@ import com.fasterxml.jackson.databind.ObjectMapper; import com.fasterxml.jackson.databind.ObjectReader; import com.fasterxml.jackson.databind.ObjectWriter; +import com.google.common.base.Joiner; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; @@ -132,6 +133,6 @@ public boolean equals(Object obj) { @Override public String toString() { - return "Schema" + fields; + return "Schema<" + Joiner.on(", ").join(fields) + ">"; } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 176ad5195b3a1..58312b3f9ff9c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -21,15 +21,14 @@ import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; -import org.apache.arrow.flatbuf.Type; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.DirtyRootAllocator; -import org.apache.arrow.vector.complex.AbstractMapVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID; import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; import org.junit.Before; @@ -112,8 +111,8 @@ public void testPromoteToUnion() throws Exception { Field childField1 = container.getField().getChildren().get(0).getChildren().get(0); Field childField2 = container.getField().getChildren().get(0).getChildren().get(1); - assertEquals("Child field should be union type: " + childField1.getName(), Type.Union, childField1.getType().getTypeType()); - assertEquals("Child field should be decimal type: " + childField2.getName(), Type.Decimal, childField2.getType().getTypeType()); + assertEquals("Child field should be union type: " + childField1.getName(), ArrowTypeID.Union, childField1.getType().getTypeID()); + assertEquals("Child field should be decimal type: " + childField2.getName(), ArrowTypeID.Decimal, childField2.getType().getTypeID()); } } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 6e0e617f299f8..caa438aff4761 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -41,6 +41,7 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; @@ -429,7 +430,7 @@ public void promotableWriter() { } Field field = parent.getField().getChildren().get(0).getChildren().get(0); Assert.assertEquals("a", field.getName()); - Assert.assertEquals(Int.TYPE_TYPE, field.getType().getTypeType()); + Assert.assertEquals(Int.TYPE_TYPE, field.getType().getTypeID()); Int intType = (Int) field.getType(); Assert.assertEquals(64, intType.getBitWidth()); @@ -444,9 +445,9 @@ public void promotableWriter() { } field = parent.getField().getChildren().get(0).getChildren().get(0); Assert.assertEquals("a", field.getName()); - Assert.assertEquals(Union.TYPE_TYPE, field.getType().getTypeType()); - Assert.assertEquals(Int.TYPE_TYPE, field.getChildren().get(0).getType().getTypeType()); - Assert.assertEquals(Utf8.TYPE_TYPE, field.getChildren().get(1).getType().getTypeType()); + Assert.assertEquals(Union.TYPE_TYPE, field.getType().getTypeID()); + Assert.assertEquals(Int.TYPE_TYPE, field.getChildren().get(0).getType().getTypeID()); + Assert.assertEquals(Utf8.TYPE_TYPE, field.getChildren().get(1).getType().getTypeID()); MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); for (int i = 0; i < 100; i++) { rootReader.setPosition(i); @@ -476,12 +477,12 @@ public void promotableWriterSchema() { Field field = parent.getField().getChildren().get(0).getChildren().get(0); Assert.assertEquals("a", field.getName()); - Assert.assertEquals(Union.TYPE_TYPE, field.getType().getTypeType()); + Assert.assertEquals(ArrowTypeID.Union, field.getType().getTypeID()); - Assert.assertEquals(Int.TYPE_TYPE, field.getChildren().get(0).getType().getTypeType()); + Assert.assertEquals(ArrowTypeID.Int, field.getChildren().get(0).getType().getTypeID()); Int intType = (Int) field.getChildren().get(0).getType(); Assert.assertEquals(64, intType.getBitWidth()); Assert.assertTrue(intType.getIsSigned()); - Assert.assertEquals(Utf8.TYPE_TYPE, field.getChildren().get(1).getType().getTypeType()); + Assert.assertEquals(ArrowTypeID.Utf8, field.getChildren().get(1).getType().getTypeID()); } } \ No newline at end of file diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 3da8db298b4a3..5a238bcc0d0c3 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -17,17 +17,17 @@ */ package org.apache.arrow.vector.pojo; -import static org.apache.arrow.flatbuf.Precision.DOUBLE; -import static org.apache.arrow.flatbuf.Precision.SINGLE; +import static org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE; +import static org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE; import static org.junit.Assert.assertEquals; -import org.apache.arrow.flatbuf.TimeUnit; -import org.apache.arrow.flatbuf.UnionMode; +import org.apache.arrow.vector.types.TimeUnit; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.UnionMode; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.List; -import org.apache.arrow.vector.types.pojo.ArrowType.Struct_; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; @@ -55,7 +55,7 @@ public void complex() { childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); - Field initialField = new Field("a", true, Struct_.INSTANCE, childrenBuilder.build()); + Field initialField = new Field("a", true, Struct.INSTANCE, childrenBuilder.build()); run(initialField); } @@ -73,7 +73,7 @@ public void nestedSchema() { ImmutableList.Builder childrenBuilder = ImmutableList.builder(); childrenBuilder.add(new Field("child1", true, Utf8.INSTANCE, null)); childrenBuilder.add(new Field("child2", true, new FloatingPoint(SINGLE), ImmutableList.of())); - childrenBuilder.add(new Field("child3", true, new Struct_(), ImmutableList.of( + childrenBuilder.add(new Field("child3", true, new Struct(), ImmutableList.of( new Field("child3.1", true, Utf8.INSTANCE, null), new Field("child3.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 0ef8be7ef1b8a..d60d17ea76db8 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -23,10 +23,10 @@ import java.io.IOException; -import org.apache.arrow.flatbuf.IntervalUnit; -import org.apache.arrow.flatbuf.Precision; -import org.apache.arrow.flatbuf.TimeUnit; -import org.apache.arrow.flatbuf.UnionMode; +import org.apache.arrow.vector.types.FloatingPointPrecision; +import org.apache.arrow.vector.types.IntervalUnit; +import org.apache.arrow.vector.types.TimeUnit; +import org.apache.arrow.vector.types.UnionMode; import org.junit.Test; public class TestSchema { @@ -39,15 +39,33 @@ private static Field field(String name, ArrowType type, Field... children) { return field(name, true, type, children); } + @Test + public void testComplex() throws IOException { + Schema schema = new Schema(asList( + field("a", false, new ArrowType.Int(8, true)), + field("b", new ArrowType.Struct(), + field("c", new ArrowType.Int(16, true)), + field("d", new ArrowType.Utf8())), + field("e", new ArrowType.List(), field(null, new ArrowType.Date())), + field("f", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), + field("g", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), + field("h", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + )); + roundTrip(schema); + assertEquals( + "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND), h: Interval(DAY_TIME)>", + schema.toString()); + } + @Test public void testAll() throws IOException { Schema schema = new Schema(asList( field("a", false, new ArrowType.Null()), - field("b", new ArrowType.Struct_(), field("ba", new ArrowType.Null())), + field("b", new ArrowType.Struct(), field("ba", new ArrowType.Null())), field("c", new ArrowType.List(), field("ca", new ArrowType.Null())), field("d", new ArrowType.Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new ArrowType.Null())), field("e", new ArrowType.Int(8, true)), - field("f", new ArrowType.FloatingPoint(Precision.SINGLE)), + field("f", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), field("g", new ArrowType.Utf8()), field("h", new ArrowType.Binary()), field("i", new ArrowType.Bool()), @@ -94,9 +112,9 @@ public void testInterval() throws IOException { @Test public void testFP() throws IOException { Schema schema = new Schema(asList( - field("a", new ArrowType.FloatingPoint(Precision.HALF)), - field("b", new ArrowType.FloatingPoint(Precision.SINGLE)), - field("c", new ArrowType.FloatingPoint(Precision.DOUBLE)) + field("a", new ArrowType.FloatingPoint(FloatingPointPrecision.HALF)), + field("b", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), + field("c", new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)) )); roundTrip(schema); contains(schema, "HALF", "SINGLE", "DOUBLE"); From b5de9e56db08480050445dd883643448af12b81b Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Fri, 2 Dec 2016 14:34:47 -0500 Subject: [PATCH 0214/1644] ARROW-369: [Python] Convert multiple record batches at once to Pandas Modified Pandas adapter to handle columns with multiple chunks with `ConvertColumnToPandas`. This modifies the pyarrow public API by adding a class `RecordBatchList` and static method `toPandas` which takes a list of Arrow RecordBatches and outputs a Pandas DataFrame. Adds unit test in test_table.py to do the conversion for each column with typed specialization. Author: Bryan Cutler Closes #216 from BryanCutler/multi-batch-toPandas-ARROW-369 and squashes the following commits: b6c9986 [Bryan Cutler] fixed formatting edf056e [Bryan Cutler] simplified with pyarrow.schema.Schema.equals 068bc1b [Bryan Cutler] Merge remote-tracking branch 'upstream/master' into multi-batch-toPandas-ARROW-369 da65345 [Bryan Cutler] fixed test case for schema checking 9edb0ba [Bryan Cutler] used auto keyword where some typecasting was done in ConvertValues bd2a720 [Bryan Cutler] added testcase for schema not equal, disabled now c3d7e8f [Bryan Cutler] Changed conversion to make Table from columns first, now conversion is now just a free function 3ee51e6 [Bryan Cutler] cleanup 398b18d [Bryan Cutler] Fixed case for Integer specialization without nulls 7b29a55 [Bryan Cutler] Initial working version of RecordBatch list to_pandas, need more tests and cleanup --- python/pyarrow/__init__.py | 4 +- python/pyarrow/includes/libarrow.pxd | 3 + python/pyarrow/table.pyx | 47 ++++++ python/pyarrow/tests/test_table.py | 35 +++++ python/src/pyarrow/adapters/pandas.cc | 209 ++++++++++++++++---------- 5 files changed, 219 insertions(+), 79 deletions(-) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 775ce7ec47578..d4d0f00c52346 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -41,5 +41,7 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe +from pyarrow.table import (Column, RecordBatch, dataframe_from_batches, Table, + from_pandas_dataframe) + from pyarrow.version import version as __version__ diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 19da4085e1bb9..350ebe30c9b89 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -158,6 +158,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: CColumn(const shared_ptr[CField]& field, const shared_ptr[CArray]& data) + CColumn(const shared_ptr[CField]& field, + const vector[shared_ptr[CArray]]& chunks) + int64_t length() int64_t null_count() const c_string& name() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index a6715b141ce41..45cf7becceefa 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -28,6 +28,7 @@ cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config from pyarrow.array cimport Array, box_arrow_array +from pyarrow.error import ArrowException from pyarrow.error cimport check_status from pyarrow.schema cimport box_data_type, box_schema @@ -414,6 +415,52 @@ cdef class RecordBatch: return result +def dataframe_from_batches(batches): + """ + Convert a list of Arrow RecordBatches to a pandas.DataFrame + + Parameters + ---------- + + batches: list of RecordBatch + RecordBatch list to be converted, schemas must be equal + """ + + cdef: + vector[shared_ptr[CArray]] c_array_chunks + vector[shared_ptr[CColumn]] c_columns + shared_ptr[CTable] c_table + Array arr + Schema schema + + import pandas as pd + + schema = batches[0].schema + + # check schemas are equal + if any((not schema.equals(other.schema) for other in batches[1:])): + raise ArrowException("Error converting list of RecordBatches to " + "DataFrame, not all schemas are equal") + + cdef int K = batches[0].num_columns + + # create chunked columns from the batches + c_columns.resize(K) + for i in range(K): + for batch in batches: + arr = batch[i] + c_array_chunks.push_back(arr.sp_array) + c_columns[i].reset(new CColumn(schema.sp_schema.get().field(i), + c_array_chunks)) + c_array_chunks.clear() + + # create a Table from columns and convert to DataFrame + c_table.reset(new CTable('', schema.sp_schema, c_columns)) + table = Table() + table.init(c_table) + return table.to_pandas() + + cdef class Table: """ A collection of top-level named, equal length Arrow arrays. diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 4c9d302106af8..dc4f37a830e5f 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -19,6 +19,7 @@ from pandas.util.testing import assert_frame_equal import pandas as pd +import pytest import pyarrow as pa @@ -50,6 +51,40 @@ def test_recordbatch_from_to_pandas(): assert_frame_equal(data, result) +def test_recordbatchlist_to_pandas(): + data1 = pd.DataFrame({ + 'c1': np.array([1, 1, 2], dtype='uint32'), + 'c2': np.array([1.0, 2.0, 3.0], dtype='float64'), + 'c3': [True, None, False], + 'c4': ['foo', 'bar', None] + }) + + data2 = pd.DataFrame({ + 'c1': np.array([3, 5], dtype='uint32'), + 'c2': np.array([4.0, 5.0], dtype='float64'), + 'c3': [True, True], + 'c4': ['baz', 'qux'] + }) + + batch1 = pa.RecordBatch.from_pandas(data1) + batch2 = pa.RecordBatch.from_pandas(data2) + + result = pa.dataframe_from_batches([batch1, batch2]) + data = pd.concat([data1, data2], ignore_index=True) + assert_frame_equal(data, result) + + +def test_recordbatchlist_schema_equals(): + data1 = pd.DataFrame({'c1': np.array([1], dtype='uint32')}) + data2 = pd.DataFrame({'c1': np.array([4.0, 5.0], dtype='float64')}) + + batch1 = pa.RecordBatch.from_pandas(data1) + batch2 = pa.RecordBatch.from_pandas(data2) + + with pytest.raises(pa.ArrowException): + pa.dataframe_from_batches([batch1, batch2]) + + def test_table_basics(): data = [ pa.from_pylist(range(5)), diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 1f5b7009e6a08..adb27e83ef120 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -597,14 +597,10 @@ class ArrowDeserializer { Status Convert(PyObject** out) { const std::shared_ptr data = col_->data(); - if (data->num_chunks() > 1) { - return Status::NotImplemented("Chunked column conversion NYI"); - } - - auto chunk = data->chunk(0); - RETURN_NOT_OK(ConvertValues(chunk)); + RETURN_NOT_OK(ConvertValues(data)); *out = reinterpret_cast(out_); + return Status::OK(); } @@ -653,28 +649,49 @@ class ArrowDeserializer { return Status::OK(); } + template + Status ConvertValuesZeroCopy(std::shared_ptr arr) { + typedef typename arrow_traits::T T; + + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + // Zero-Copy. We can pass the data pointer directly to NumPy. + void* data = const_cast(in_values); + int type = arrow_traits::npy_type; + RETURN_NOT_OK(OutputFromData(type, data)); + + return Status::OK(); + } + template inline typename std::enable_if< arrow_traits::is_pandas_numeric_nullable, Status>::type - ConvertValues(const std::shared_ptr& arr) { + ConvertValues(const std::shared_ptr& data) { typedef typename arrow_traits::T T; + size_t chunk_offset = 0; - arrow::PrimitiveArray* prim_arr = static_cast( - arr.get()); - const T* in_values = reinterpret_cast(prim_arr->data()->data()); + if (data->num_chunks() == 1 && data->null_count() == 0) { + return ConvertValuesZeroCopy(data->chunk(0)); + } + + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - if (arr->null_count() > 0) { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; - T* out_values = reinterpret_cast(PyArray_DATA(out_)); - for (int64_t i = 0; i < arr->length(); ++i) { - out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i]; + if (arr->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i]; + } + } else { + memcpy(out_values, in_values, sizeof(T) * arr->length()); } - } else { - // Zero-Copy. We can pass the data pointer directly to NumPy. - void* data = const_cast(in_values); - int type = arrow_traits::npy_type; - RETURN_NOT_OK(OutputFromData(type, data)); + + chunk_offset += arr->length(); } return Status::OK(); @@ -684,27 +701,43 @@ class ArrowDeserializer { template inline typename std::enable_if< arrow_traits::is_pandas_numeric_not_nullable, Status>::type - ConvertValues(const std::shared_ptr& arr) { + ConvertValues(const std::shared_ptr& data) { typedef typename arrow_traits::T T; + size_t chunk_offset = 0; - arrow::PrimitiveArray* prim_arr = static_cast( - arr.get()); - - const T* in_values = reinterpret_cast(prim_arr->data()->data()); + if (data->num_chunks() == 1 && data->null_count() == 0) { + return ConvertValuesZeroCopy(data->chunk(0)); + } - if (arr->null_count() > 0) { + if (data->null_count() > 0) { RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); - // Upcast to double, set NaN as appropriate - double* out_values = reinterpret_cast(PyArray_DATA(out_)); - for (int i = 0; i < arr->length(); ++i) { - out_values[i] = prim_arr->IsNull(i) ? NAN : in_values[i]; + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + // Upcast to double, set NaN as appropriate + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + for (int i = 0; i < arr->length(); ++i) { + out_values[i] = prim_arr->IsNull(i) ? NAN : in_values[i]; + } + + chunk_offset += arr->length(); } } else { - // Zero-Copy. We can pass the data pointer directly to NumPy. - void* data = const_cast(in_values); - int type = arrow_traits::npy_type; - RETURN_NOT_OK(OutputFromData(type, data)); + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + memcpy(out_values, in_values, sizeof(T) * arr->length()); + + chunk_offset += arr->length(); + } } return Status::OK(); @@ -714,35 +747,48 @@ class ArrowDeserializer { template inline typename std::enable_if< arrow_traits::is_boolean, Status>::type - ConvertValues(const std::shared_ptr& arr) { + ConvertValues(const std::shared_ptr& data) { + size_t chunk_offset = 0; PyAcquireGIL lock; - arrow::BooleanArray* bool_arr = static_cast(arr.get()); - - if (arr->null_count() > 0) { + if (data->null_count() > 0) { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - PyObject** out_values = reinterpret_cast(PyArray_DATA(out_)); - for (int64_t i = 0; i < arr->length(); ++i) { - if (bool_arr->IsNull(i)) { - Py_INCREF(Py_None); - out_values[i] = Py_None; - } else if (bool_arr->Value(i)) { - // True - Py_INCREF(Py_True); - out_values[i] = Py_True; - } else { - // False - Py_INCREF(Py_False); - out_values[i] = Py_False; + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto bool_arr = static_cast(arr.get()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + for (int64_t i = 0; i < arr->length(); ++i) { + if (bool_arr->IsNull(i)) { + Py_INCREF(Py_None); + out_values[i] = Py_None; + } else if (bool_arr->Value(i)) { + // True + Py_INCREF(Py_True); + out_values[i] = Py_True; + } else { + // False + Py_INCREF(Py_False); + out_values[i] = Py_False; + } } + + chunk_offset += bool_arr->length(); } } else { RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - uint8_t* out_values = reinterpret_cast(PyArray_DATA(out_)); - for (int64_t i = 0; i < arr->length(); ++i) { - out_values[i] = static_cast(bool_arr->Value(i)); + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto bool_arr = static_cast(arr.get()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + for (int64_t i = 0; i < arr->length(); ++i) { + out_values[i] = static_cast(bool_arr->Value(i)); + } + + chunk_offset += bool_arr->length(); } } @@ -753,42 +799,49 @@ class ArrowDeserializer { template inline typename std::enable_if< T2 == arrow::Type::STRING, Status>::type - ConvertValues(const std::shared_ptr& arr) { + ConvertValues(const std::shared_ptr& data) { + size_t chunk_offset = 0; PyAcquireGIL lock; RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - PyObject** out_values = reinterpret_cast(PyArray_DATA(out_)); - - arrow::StringArray* string_arr = static_cast(arr.get()); - - const uint8_t* data; - int32_t length; - if (arr->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - if (string_arr->IsNull(i)) { - Py_INCREF(Py_None); - out_values[i] = Py_None; - } else { - data = string_arr->GetValue(i, &length); - - out_values[i] = make_pystring(data, length); + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto string_arr = static_cast(arr.get()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + const uint8_t* data_ptr; + int32_t length; + if (data->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + if (string_arr->IsNull(i)) { + Py_INCREF(Py_None); + out_values[i] = Py_None; + } else { + data_ptr = string_arr->GetValue(i, &length); + + out_values[i] = make_pystring(data_ptr, length); + if (out_values[i] == nullptr) { + return Status::UnknownError("String initialization failed"); + } + } + } + } else { + for (int64_t i = 0; i < arr->length(); ++i) { + data_ptr = string_arr->GetValue(i, &length); + out_values[i] = make_pystring(data_ptr, length); if (out_values[i] == nullptr) { return Status::UnknownError("String initialization failed"); } } } - } else { - for (int64_t i = 0; i < arr->length(); ++i) { - data = string_arr->GetValue(i, &length); - out_values[i] = make_pystring(data, length); - if (out_values[i] == nullptr) { - return Status::UnknownError("String initialization failed"); - } - } + + chunk_offset += string_arr->length(); } + return Status::OK(); } + private: std::shared_ptr col_; PyObject* py_ref_; From 0ac01a5bf6747a5855d20632c9c7874483b9830a Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 4 Dec 2016 10:50:43 -0500 Subject: [PATCH 0215/1644] ARROW-379: Use setuptools_scm for Python versioning Author: Uwe L. Korn Closes #224 from xhochy/ARROW-379 and squashes the following commits: 3a68d9f [Uwe L. Korn] Remove deprecated version import 15fe9b2 [Uwe L. Korn] Add license header aa9bd49 [Uwe L. Korn] ARROW-379: Use setuptools_scm for Python versioning --- dev/release/00-prepare.sh | 5 ----- python/.git_archival.txt | 1 + python/.gitattributes | 1 + python/pyarrow/__init__.py | 10 ++++++++-- python/setup.cfg | 20 ++++++++++++++++++++ python/setup.py | 23 ++--------------------- 6 files changed, 32 insertions(+), 28 deletions(-) create mode 100644 python/.git_archival.txt create mode 100644 python/.gitattributes create mode 100644 python/setup.cfg diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 3423a3e6c5bf9..00af5e7768161 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -43,9 +43,4 @@ mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmod cd - -cd "${SOURCE_DIR}/../../python" -sed -i "s/VERSION = '[^']*'/VERSION = '${version}'/g" setup.py -sed -i "s/ISRELEASED = False/ISRELEASED = True/g" setup.py -cd - - echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/python/.git_archival.txt b/python/.git_archival.txt new file mode 100644 index 0000000000000..95cb3eea4e336 --- /dev/null +++ b/python/.git_archival.txt @@ -0,0 +1 @@ +ref-names: $Format:%D$ diff --git a/python/.gitattributes b/python/.gitattributes new file mode 100644 index 0000000000000..00a7b00c94e08 --- /dev/null +++ b/python/.gitattributes @@ -0,0 +1 @@ +.git_archival.txt export-subst diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index d4d0f00c52346..f366317d04c96 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -17,6 +17,14 @@ # flake8: noqa +from pkg_resources import get_distribution, DistributionNotFound +try: + __version__ = get_distribution(__name__).version +except DistributionNotFound: + # package is not installed + pass + + import pyarrow.config from pyarrow.array import (Array, @@ -43,5 +51,3 @@ from pyarrow.table import (Column, RecordBatch, dataframe_from_batches, Table, from_pandas_dataframe) - -from pyarrow.version import version as __version__ diff --git a/python/setup.cfg b/python/setup.cfg new file mode 100644 index 0000000000000..caae3e081b6ca --- /dev/null +++ b/python/setup.cfg @@ -0,0 +1,20 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[build_sphinx] +source-dir = doc/ +build-dir = doc/_build diff --git a/python/setup.py b/python/setup.py index 341cc64aa2cc8..0f6bbda6ec3aa 100644 --- a/python/setup.py +++ b/python/setup.py @@ -42,27 +42,9 @@ if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') -VERSION = '0.1.0' -ISRELEASED = False - -if not ISRELEASED: - VERSION += '.dev' - setup_dir = os.path.abspath(os.path.dirname(__file__)) -def write_version_py(filename=os.path.join(setup_dir, 'pyarrow/version.py')): - a = open(filename, 'w') - file_content = "\n".join(["", - "# THIS FILE IS GENERATED FROM SETUP.PY", - "version = '%(version)s'", - "isrelease = '%(isrelease)s'"]) - - a.write(file_content % {'version': VERSION, - 'isrelease': str(ISRELEASED)}) - a.close() - - class clean(_clean): def run(self): @@ -261,15 +243,12 @@ def get_outputs(self): return [self._get_cmake_ext_path(name) for name in self.get_names()] -write_version_py() - DESC = """\ Python library for Apache Arrow""" setup( name="pyarrow", packages=['pyarrow', 'pyarrow.tests'], - version=VERSION, zip_safe=False, package_data={'pyarrow': ['*.pxd', '*.pyx']}, # Dummy extension to trigger build_ext @@ -279,6 +258,8 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, + use_scm_version = {"root": "..", "relative_to": __file__}, + setup_requires=['setuptools_scm', 'setuptools_scm_git_archive'], install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], description=DESC, license='Apache License, Version 2.0', From 599d516a7306de4d1f9d7e0ddc888f13026efd49 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 5 Dec 2016 14:56:56 -0800 Subject: [PATCH 0216/1644] =?UTF-8?q?ARROW-401:=20Floating=20point=20vecto?= =?UTF-8?q?rs=20should=20do=20an=20approximate=20comparison=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit … in integration tests Author: Julien Le Dem Closes #223 from julienledem/arrow_401 and squashes the following commits: a9ee84d [Julien Le Dem] review feedback da64ca0 [Julien Le Dem] ARROW-401: Floating point vectors should do an approximate comparison in integration tests --- .../org/apache/arrow/tools/Integration.java | 51 ++++++++++- .../apache/arrow/tools/TestIntegration.java | 84 ++++++++++++++++++- 2 files changed, 130 insertions(+), 5 deletions(-) diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index 85af30da1e8ae..fd835a63a11ac 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -39,6 +39,8 @@ import org.apache.arrow.vector.file.json.JsonFileReader; import org.apache.arrow.vector.file.json.JsonFileWriter; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.commons.cli.CommandLine; @@ -247,7 +249,7 @@ private static void compare(VectorSchemaRoot arrowRoot, VectorSchemaRoot jsonRoo for (int j = 0; j < valueCount; j++) { Object arrow = arrowVector.getAccessor().getObject(j); Object json = jsonVector.getAccessor().getObject(j); - if (!Objects.equal(arrow, json)) { + if (!equals(field.getType(), arrow, json)) { throw new IllegalArgumentException( "Different values in column:\n" + field + " at index " + j + ": " + arrow + " != " + json); } @@ -255,6 +257,53 @@ private static void compare(VectorSchemaRoot arrowRoot, VectorSchemaRoot jsonRoo } } + private static boolean equals(ArrowType type, final Object arrow, final Object json) { + if (type instanceof ArrowType.FloatingPoint) { + FloatingPoint fpType = (FloatingPoint) type; + switch (fpType.getPrecision()) { + case DOUBLE: + return equalEnough((Double)arrow, (Double)json); + case SINGLE: + return equalEnough((Float)arrow, (Float)json); + case HALF: + default: + throw new UnsupportedOperationException("unsupported precision: " + fpType); + } + } + return Objects.equal(arrow, json); + } + + static boolean equalEnough(Float f1, Float f2) { + if (f1 == null || f2 == null) { + return f1 == null && f2 == null; + } + if (f1.isNaN()) { + return f2.isNaN(); + } + if (f1.isInfinite()) { + return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); + } + float average = Math.abs((f1 + f2) / 2); + float differenceScaled = Math.abs(f1 - f2) / (average == 0.0f ? 1f : average); + return differenceScaled < 1.0E-6f; + } + + static boolean equalEnough(Double f1, Double f2) { + if (f1 == null || f2 == null) { + return f1 == null && f2 == null; + } + if (f1.isNaN()) { + return f2.isNaN(); + } + if (f1.isInfinite()) { + return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); + } + double average = Math.abs((f1 + f2) / 2); + double differenceScaled = Math.abs(f1 - f2) / (average == 0.0d ? 1d : average); + return differenceScaled < 1.0E-12d; + } + + private static void compareSchemas(Schema jsonSchema, Schema arrowSchema) { if (!arrowSchema.equals(jsonSchema)) { throw new IllegalArgumentException("Different schemas:\n" + arrowSchema + "\n" + jsonSchema); diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java index 464144b95a1aa..ee6196b74e0dc 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -22,6 +22,10 @@ import static org.apache.arrow.tools.ArrowFileTestFixtures.write; import static org.apache.arrow.tools.ArrowFileTestFixtures.writeData; import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; +import static org.apache.arrow.tools.Integration.equalEnough; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; import java.io.BufferedReader; @@ -39,9 +43,9 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.Float8Writer; import org.apache.arrow.vector.complex.writer.IntWriter; import org.junit.After; -import org.junit.Assert; import org.junit.Before; import org.junit.Rule; import org.junit.Test; @@ -121,7 +125,7 @@ public void testJSONRoundTripWithVariableWidth() throws Exception { String i, o; int j = 0; while ((i = orig.readLine()) != null && (o = rt.readLine()) != null) { - Assert.assertEquals("line: " + j, i, o); + assertEquals("line: " + j, i, o); ++j; } } @@ -142,6 +146,33 @@ private BufferedReader readNormalized(File f) throws IOException { } + /** + * the test should not be sensitive to small variations in float representation + */ + @Test + public void testFloat() throws Exception { + File testValidInFile = testFolder.newFile("testValidFloatIn.arrow"); + File testInvalidInFile = testFolder.newFile("testAlsoValidFloatIn.arrow"); + File testJSONFile = testFolder.newFile("testValidOut.json"); + testJSONFile.delete(); + + // generate an arrow file + writeInputFloat(testValidInFile, allocator, 912.4140000000002, 912.414); + // generate a different arrow file + writeInputFloat(testInvalidInFile, allocator, 912.414, 912.4140000000002); + + Integration integration = new Integration(); + + // convert the "valid" file to json + String[] args1 = { "-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + integration.run(args1); + + // compare the "invalid" file to the "valid" json + String[] args3 = { "-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + // this should fail + integration.run(args3); + } + @Test public void testInvalid() throws Exception { File testValidInFile = testFolder.newFile("testValidIn.arrow"); @@ -167,12 +198,28 @@ public void testInvalid() throws Exception { integration.run(args3); fail("should have failed"); } catch (IllegalArgumentException e) { - Assert.assertTrue(e.getMessage(), e.getMessage().contains("Different values in column")); - Assert.assertTrue(e.getMessage(), e.getMessage().contains("999")); + assertTrue(e.getMessage(), e.getMessage().contains("Different values in column")); + assertTrue(e.getMessage(), e.getMessage().contains("999")); } } + static void writeInputFloat(File testInFile, BufferAllocator allocator, double... f) throws FileNotFoundException, IOException { + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + Float8Writer floatWriter = rootWriter.float8("float"); + for (int i = 0; i < f.length; i++) { + floatWriter.setPosition(i); + floatWriter.writeFloat8(f[i]); + } + writer.setValueCount(f.length); + write(parent.getChild("root"), testInFile); + } + } + static void writeInput2(File testInFile, BufferAllocator allocator) throws FileNotFoundException, IOException { int count = ArrowFileTestFixtures.COUNT; try ( @@ -192,4 +239,33 @@ static void writeInput2(File testInFile, BufferAllocator allocator) throws FileN } } + @Test + public void testFloatComp() { + assertTrue(equalEnough(912.4140000000002F, 912.414F)); + assertTrue(equalEnough(912.4140000000002D, 912.414D)); + assertTrue(equalEnough(912.414F, 912.4140000000002F)); + assertTrue(equalEnough(912.414D, 912.4140000000002D)); + assertFalse(equalEnough(912.414D, 912.4140001D)); + assertFalse(equalEnough(null, 912.414D)); + assertTrue(equalEnough((Float)null, null)); + assertTrue(equalEnough((Double)null, null)); + assertFalse(equalEnough(912.414D, null)); + assertFalse(equalEnough(Double.MAX_VALUE, Double.MIN_VALUE)); + assertFalse(equalEnough(Double.MIN_VALUE, Double.MAX_VALUE)); + assertTrue(equalEnough(Double.MAX_VALUE, Double.MAX_VALUE)); + assertTrue(equalEnough(Double.MIN_VALUE, Double.MIN_VALUE)); + assertTrue(equalEnough(Double.NEGATIVE_INFINITY, Double.NEGATIVE_INFINITY)); + assertFalse(equalEnough(Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)); + assertTrue(equalEnough(Double.NaN, Double.NaN)); + assertFalse(equalEnough(1.0, Double.NaN)); + assertFalse(equalEnough(Float.MAX_VALUE, Float.MIN_VALUE)); + assertFalse(equalEnough(Float.MIN_VALUE, Float.MAX_VALUE)); + assertTrue(equalEnough(Float.MAX_VALUE, Float.MAX_VALUE)); + assertTrue(equalEnough(Float.MIN_VALUE, Float.MIN_VALUE)); + assertTrue(equalEnough(Float.NEGATIVE_INFINITY, Float.NEGATIVE_INFINITY)); + assertFalse(equalEnough(Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY)); + assertTrue(equalEnough(Float.NaN, Float.NaN)); + assertFalse(equalEnough(1.0F, Float.NaN)); + } + } From 82575ca3c22db18b2ea69f248b471a0317042b38 Mon Sep 17 00:00:00 2001 From: vkorukanti Date: Mon, 5 Dec 2016 21:28:14 -0800 Subject: [PATCH 0217/1644] ARROW-403: [Java] Create transfer pairs for internal vectors in UnionVector transfer impl @StevenMPhillips, @julienledem Could you please review the patch? Author: vkorukanti Closes #225 from vkorukanti/union_vector_schema and squashes the following commits: 431874f [vkorukanti] ARROW-403: [Java] Create transfer pairs for internal vectors in UnionVector transfer impl --- .../main/codegen/templates/UnionVector.java | 19 +++---- .../apache/arrow/vector/TestUnionVector.java | 54 +++++++++++++++++++ 2 files changed, 64 insertions(+), 9 deletions(-) diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index ea1fdf6bd60fb..4e68b681d1404 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -246,12 +246,6 @@ public TransferPair makeTransferPair(ValueVector target) { return new TransferImpl((UnionVector) target); } - public void transferTo(org.apache.arrow.vector.complex.UnionVector target) { - typeVector.makeTransferPair(target.typeVector).transfer(); - internalMap.makeTransferPair(target.internalMap).transfer(); - target.valueCount = valueCount; - } - public void copyFrom(int inIndex, int outIndex, UnionVector from) { from.getReader().setPosition(inIndex); getWriter().setPosition(outIndex); @@ -275,20 +269,27 @@ public FieldVector addVector(FieldVector v) { } private class TransferImpl implements TransferPair { - - UnionVector to; + private final TransferPair internalMapVectorTransferPair; + private final TransferPair typeVectorTransferPair; + private final UnionVector to; public TransferImpl(String name, BufferAllocator allocator) { to = new UnionVector(name, allocator, null); + internalMapVectorTransferPair = internalMap.makeTransferPair(to.internalMap); + typeVectorTransferPair = typeVector.makeTransferPair(to.typeVector); } public TransferImpl(UnionVector to) { this.to = to; + internalMapVectorTransferPair = internalMap.makeTransferPair(to.internalMap); + typeVectorTransferPair = typeVector.makeTransferPair(to.typeVector); } @Override public void transfer() { - transferTo(to); + internalMapVectorTransferPair.transfer(); + typeVectorTransferPair.transfer(); + to.valueCount = valueCount; } @Override diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java index 1bb50b73a9057..a5b90ee90b8f9 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUnionVector.java @@ -21,8 +21,12 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.complex.UnionVector; +import org.apache.arrow.vector.holders.NullableBitHolder; +import org.apache.arrow.vector.holders.NullableIntHolder; import org.apache.arrow.vector.holders.NullableUInt4Holder; import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.TransferPair; import org.junit.After; import org.junit.Before; import org.junit.Test; @@ -76,4 +80,54 @@ public void testUnionVector() throws Exception { } } + @Test + public void testTransfer() throws Exception { + try (UnionVector srcVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + srcVector.allocateNew(); + + // write some data + final UnionVector.Mutator mutator = srcVector.getMutator(); + mutator.setType(0, MinorType.INT); + mutator.setSafe(0, newIntHolder(5)); + mutator.setType(1, MinorType.BIT); + mutator.setSafe(1, newBitHolder(false)); + mutator.setType(3, MinorType.INT); + mutator.setSafe(3, newIntHolder(10)); + mutator.setType(5, MinorType.BIT); + mutator.setSafe(5, newBitHolder(false)); + mutator.setValueCount(6); + + try(UnionVector destVector = new UnionVector(EMPTY_SCHEMA_PATH, allocator, null)) { + TransferPair pair = srcVector.makeTransferPair(destVector); + + // Creating the transfer should transfer the type of the field at least. + assertEquals(srcVector.getField(), destVector.getField()); + + // transfer + pair.transfer(); + + assertEquals(srcVector.getField(), destVector.getField()); + + // now check the values are transferred + assertEquals(srcVector.getAccessor().getValueCount(), destVector.getAccessor().getValueCount()); + for(int i=0; i Date: Tue, 6 Dec 2016 11:41:08 -0500 Subject: [PATCH 0218/1644] ARROW-406: [C++] Set explicit 64K HDFS buffer size, test large reads We could not support reads in excess of the default buffer size (typically 64K) Author: Wes McKinney Closes #226 from wesm/ARROW-406 and squashes the following commits: d09b645 [Wes McKinney] cpplint 0028e90 [Wes McKinney] Set explicit 64K HDFS buffer size, test large reads using buffered chunks --- cpp/src/arrow/io/hdfs.cc | 33 +++++++++++++++++++++++++------- cpp/src/arrow/io/hdfs.h | 3 +++ cpp/src/arrow/io/io-hdfs-test.cc | 33 ++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+), 7 deletions(-) diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 13491e780e21b..8c6d49f92e606 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -17,6 +17,7 @@ #include +#include #include #include #include @@ -51,6 +52,8 @@ static Status CheckReadResult(int ret) { return Status::OK(); } +static constexpr int kDefaultHdfsBufferSize = 1 << 16; + // ---------------------------------------------------------------------- // File reading @@ -124,9 +127,16 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { } Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { - tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer), nbytes); - RETURN_NOT_OK(CheckReadResult(ret)); - *bytes_read = ret; + int64_t total_bytes = 0; + while (total_bytes < nbytes) { + tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer + total_bytes), + std::min(buffer_size_, nbytes - total_bytes)); + RETURN_NOT_OK(CheckReadResult(ret)); + total_bytes += ret; + if (ret == 0) { break; } + } + + *bytes_read = total_bytes; return Status::OK(); } @@ -136,7 +146,6 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { int64_t bytes_read = 0; RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); - if (bytes_read < nbytes) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } *out = buffer; @@ -154,8 +163,11 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { void set_memory_pool(MemoryPool* pool) { pool_ = pool; } + void set_buffer_size(int32_t buffer_size) { buffer_size_ = buffer_size; } + private: MemoryPool* pool_; + int32_t buffer_size_; }; HdfsReadableFile::HdfsReadableFile(MemoryPool* pool) { @@ -384,8 +396,9 @@ class HdfsClient::HdfsClientImpl { return Status::OK(); } - Status OpenReadable(const std::string& path, std::shared_ptr* file) { - hdfsFile handle = hdfsOpenFile(fs_, path.c_str(), O_RDONLY, 0, 0, 0); + Status OpenReadable(const std::string& path, int32_t buffer_size, + std::shared_ptr* file) { + hdfsFile handle = hdfsOpenFile(fs_, path.c_str(), O_RDONLY, buffer_size, 0, 0); if (handle == nullptr) { // TODO(wesm): determine cause of failure @@ -397,6 +410,7 @@ class HdfsClient::HdfsClientImpl { // std::make_shared does not work with private ctors *file = std::shared_ptr(new HdfsReadableFile()); (*file)->impl_->set_members(path, fs_, handle); + (*file)->impl_->set_buffer_size(buffer_size); return Status::OK(); } @@ -490,9 +504,14 @@ Status HdfsClient::ListDirectory( return impl_->ListDirectory(path, listing); } +Status HdfsClient::OpenReadable(const std::string& path, int32_t buffer_size, + std::shared_ptr* file) { + return impl_->OpenReadable(path, buffer_size, file); +} + Status HdfsClient::OpenReadable( const std::string& path, std::shared_ptr* file) { - return impl_->OpenReadable(path, file); + return OpenReadable(path, kDefaultHdfsBufferSize, file); } Status HdfsClient::OpenWriteable(const std::string& path, bool append, diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 48699c914503e..1c76f15c397ce 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -128,6 +128,9 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // status if the file is not found. // // @param path complete file path + Status OpenReadable(const std::string& path, int32_t buffer_size, + std::shared_ptr* file); + Status OpenReadable(const std::string& path, std::shared_ptr* file); // FileMode::WRITE options diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 7901932dee676..8338de6d96a55 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -293,6 +293,39 @@ TEST_F(TestHdfsClient, ReadableMethods) { ASSERT_EQ(60, position); } +TEST_F(TestHdfsClient, LargeFile) { + SKIP_IF_NO_LIBHDFS(); + + ASSERT_OK(MakeScratchDir()); + + auto path = ScratchPath("test-large-file"); + const int size = 1000000; + + std::vector data = RandomData(size); + ASSERT_OK(WriteDummyFile(path, data.data(), size)); + + std::shared_ptr file; + ASSERT_OK(client_->OpenReadable(path, &file)); + + auto buffer = std::make_shared(); + ASSERT_OK(buffer->Resize(size)); + int64_t bytes_read = 0; + + ASSERT_OK(file->Read(size, &bytes_read, buffer->mutable_data())); + ASSERT_EQ(0, std::memcmp(buffer->data(), data.data(), size)); + ASSERT_EQ(size, bytes_read); + + // explicit buffer size + std::shared_ptr file2; + ASSERT_OK(client_->OpenReadable(path, 1 << 18, &file2)); + + auto buffer2 = std::make_shared(); + ASSERT_OK(buffer2->Resize(size)); + ASSERT_OK(file2->Read(size, &bytes_read, buffer2->mutable_data())); + ASSERT_EQ(0, std::memcmp(buffer2->data(), data.data(), size)); + ASSERT_EQ(size, bytes_read); +} + TEST_F(TestHdfsClient, RenameFile) { SKIP_IF_NO_LIBHDFS(); From 72f80d450e0e8e20812fd80571b0c1d18e88114a Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Wed, 7 Dec 2016 15:00:18 +0100 Subject: [PATCH 0219/1644] ARROW-409: [Python] Change record batches conversion to Table From discussion in ARROW-369, it is more consistent and flexible for the pyarrow.Table API to convert a RecordBatch list first a Table, then Table to pandas.DataFrame. For example: ``` table = pa.Table.from_batches(batches) df = table.to_pandas() ``` Also updated conversion to print schemas in exception message if not equal. Author: Bryan Cutler Closes #229 from BryanCutler/pyarrow-table-from_batches-ARROW-409 and squashes the following commits: f5751e0 [Bryan Cutler] fixed schema check to print out if not equal 72ea875 [Bryan Cutler] changed batches conversion to Table instead --- python/pyarrow/__init__.py | 3 +- python/pyarrow/table.pyx | 94 +++++++++++++++--------------- python/pyarrow/tests/test_table.py | 5 +- 3 files changed, 52 insertions(+), 50 deletions(-) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index f366317d04c96..5af93fb5865de 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -49,5 +49,4 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.table import (Column, RecordBatch, dataframe_from_batches, Table, - from_pandas_dataframe) +from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 45cf7becceefa..0a9805cfdf427 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -415,52 +415,6 @@ cdef class RecordBatch: return result -def dataframe_from_batches(batches): - """ - Convert a list of Arrow RecordBatches to a pandas.DataFrame - - Parameters - ---------- - - batches: list of RecordBatch - RecordBatch list to be converted, schemas must be equal - """ - - cdef: - vector[shared_ptr[CArray]] c_array_chunks - vector[shared_ptr[CColumn]] c_columns - shared_ptr[CTable] c_table - Array arr - Schema schema - - import pandas as pd - - schema = batches[0].schema - - # check schemas are equal - if any((not schema.equals(other.schema) for other in batches[1:])): - raise ArrowException("Error converting list of RecordBatches to " - "DataFrame, not all schemas are equal") - - cdef int K = batches[0].num_columns - - # create chunked columns from the batches - c_columns.resize(K) - for i in range(K): - for batch in batches: - arr = batch[i] - c_array_chunks.push_back(arr.sp_array) - c_columns[i].reset(new CColumn(schema.sp_schema.get().field(i), - c_array_chunks)) - c_array_chunks.clear() - - # create a Table from columns and convert to DataFrame - c_table.reset(new CTable('', schema.sp_schema, c_columns)) - table = Table() - table.init(c_table) - return table.to_pandas() - - cdef class Table: """ A collection of top-level named, equal length Arrow arrays. @@ -567,6 +521,54 @@ cdef class Table: return result + @staticmethod + def from_batches(batches): + """ + Construct a Table from a list of Arrow RecordBatches + + Parameters + ---------- + + batches: list of RecordBatch + RecordBatch list to be converted, schemas must be equal + """ + + cdef: + vector[shared_ptr[CArray]] c_array_chunks + vector[shared_ptr[CColumn]] c_columns + shared_ptr[CTable] c_table + Array arr + Schema schema + + import pandas as pd + + schema = batches[0].schema + + # check schemas are equal + for other in batches[1:]: + if not schema.equals(other.schema): + raise ArrowException("Error converting list of RecordBatches " + "to DataFrame, not all schemas are equal: {%s} != {%s}" + % (str(schema), str(other.schema))) + + cdef int K = batches[0].num_columns + + # create chunked columns from the batches + c_columns.resize(K) + for i in range(K): + for batch in batches: + arr = batch[i] + c_array_chunks.push_back(arr.sp_array) + c_columns[i].reset(new CColumn(schema.sp_schema.get().field(i), + c_array_chunks)) + c_array_chunks.clear() + + # create a Table from columns and convert to DataFrame + c_table.reset(new CTable('', schema.sp_schema, c_columns)) + table = Table() + table.init(c_table) + return table + def to_pandas(self): """ Convert the arrow::Table to a pandas DataFrame diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index dc4f37a830e5f..25463145c00ce 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -69,7 +69,8 @@ def test_recordbatchlist_to_pandas(): batch1 = pa.RecordBatch.from_pandas(data1) batch2 = pa.RecordBatch.from_pandas(data2) - result = pa.dataframe_from_batches([batch1, batch2]) + table = pa.Table.from_batches([batch1, batch2]) + result = table.to_pandas() data = pd.concat([data1, data2], ignore_index=True) assert_frame_equal(data, result) @@ -82,7 +83,7 @@ def test_recordbatchlist_schema_equals(): batch2 = pa.RecordBatch.from_pandas(data2) with pytest.raises(pa.ArrowException): - pa.dataframe_from_batches([batch1, batch2]) + pa.Table.from_batches([batch1, batch2]) def test_table_basics(): From c8eb49e4136365f8056e09c36746b6dbb02d2814 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 8 Dec 2016 20:58:48 +0100 Subject: [PATCH 0220/1644] ARROW-408: Remove defunct conda recipes These are better maintained on conda-forge since there's also an associated cross-platform build environment Author: Wes McKinney Closes #231 from wesm/ARROW-408 and squashes the following commits: 8c58b75 [Wes McKinney] Remove defunct conda recipes --- cpp/conda.recipe/build.sh | 62 ----------------------------------- cpp/conda.recipe/meta.yaml | 43 ------------------------ python/conda.recipe/build.sh | 45 ------------------------- python/conda.recipe/meta.yaml | 54 ------------------------------ 4 files changed, 204 deletions(-) delete mode 100644 cpp/conda.recipe/build.sh delete mode 100644 cpp/conda.recipe/meta.yaml delete mode 100644 python/conda.recipe/build.sh delete mode 100644 python/conda.recipe/meta.yaml diff --git a/cpp/conda.recipe/build.sh b/cpp/conda.recipe/build.sh deleted file mode 100644 index 0536fd99b5ca5..0000000000000 --- a/cpp/conda.recipe/build.sh +++ /dev/null @@ -1,62 +0,0 @@ -#!/bin/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -set -e -set -x - -cd $RECIPE_DIR - -# Build dependencies -export FLATBUFFERS_HOME=$PREFIX -export PARQUET_HOME=$PREFIX - -if [ "$(uname)" == "Darwin" ]; then - # C++11 finagling for Mac OSX - export CC=clang - export CXX=clang++ - export MACOSX_VERSION_MIN="10.7" - CXXFLAGS="${CXXFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" - CXXFLAGS="${CXXFLAGS} -stdlib=libc++ -std=c++11" - export LDFLAGS="${LDFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" - export LDFLAGS="${LDFLAGS} -stdlib=libc++ -std=c++11" - export LINKFLAGS="${LDFLAGS}" - export MACOSX_DEPLOYMENT_TARGET=10.7 -fi - -cd .. - -rm -rf conda-build -mkdir conda-build -cd conda-build -pwd - -# if [ `uname` == Linux ]; then -# SHARED_LINKER_FLAGS='-static-libstdc++' -# elif [ `uname` == Darwin ]; then -# SHARED_LINKER_FLAGS='' -# fi - -# -DCMAKE_SHARED_LINKER_FLAGS=$SHARED_LINKER_FLAGS \ - -cmake \ - -DCMAKE_BUILD_TYPE=release \ - -DCMAKE_INSTALL_PREFIX=$PREFIX \ - -DARROW_HDFS=on \ - -DARROW_IPC=on \ - -DARROW_PARQUET=on \ - .. - -make -ctest -L unittest -make install diff --git a/cpp/conda.recipe/meta.yaml b/cpp/conda.recipe/meta.yaml deleted file mode 100644 index 31f150c1f0b00..0000000000000 --- a/cpp/conda.recipe/meta.yaml +++ /dev/null @@ -1,43 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -package: - name: arrow-cpp - version: "0.1" - -build: - number: {{environ.get('TRAVIS_BUILD_NUMBER', 0)}} # [unix] - skip: true # [win] - script_env: - - CC [linux] - - CXX [linux] - - LD_LIBRARY_PATH [linux] - -requirements: - build: - - cmake - - flatbuffers - - parquet-cpp - - run: - - parquet-cpp - -test: - commands: - - test -f $PREFIX/lib/libarrow.so # [linux] - - test -f $PREFIX/lib/libarrow_parquet.so # [linux] - - test -f $PREFIX/include/arrow/api.h - -about: - home: http://github.com/apache/arrow - license: Apache 2.0 - summary: 'C++ libraries for the reference Apache Arrow implementation' diff --git a/python/conda.recipe/build.sh b/python/conda.recipe/build.sh deleted file mode 100644 index fafe71e7adb75..0000000000000 --- a/python/conda.recipe/build.sh +++ /dev/null @@ -1,45 +0,0 @@ -#!/bin/bash - -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -set -ex - -# Build dependency -export ARROW_HOME=$PREFIX - -cd $RECIPE_DIR - -if [ "$(uname)" == "Darwin" ]; then - # C++11 finagling for Mac OSX - export CC=clang - export CXX=clang++ - export MACOSX_VERSION_MIN="10.7" - CXXFLAGS="${CXXFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" - CXXFLAGS="${CXXFLAGS} -stdlib=libc++ -std=c++11" - export LDFLAGS="${LDFLAGS} -mmacosx-version-min=${MACOSX_VERSION_MIN}" - export LDFLAGS="${LDFLAGS} -stdlib=libc++ -std=c++11" - export LINKFLAGS="${LDFLAGS}" - export MACOSX_DEPLOYMENT_TARGET=10.7 -fi - -# echo Setting the compiler... -# if [ `uname` == Linux ]; then -# EXTRA_CMAKE_ARGS=-DCMAKE_SHARED_LINKER_FLAGS=-static-libstdc++ -# elif [ `uname` == Darwin ]; then -# EXTRA_CMAKE_ARGS= -# fi - -cd .. -# $PYTHON setup.py build_ext --extra-cmake-args=$EXTRA_CMAKE_ARGS || exit 1 -$PYTHON setup.py build_ext || exit 1 -$PYTHON setup.py install || exit 1 diff --git a/python/conda.recipe/meta.yaml b/python/conda.recipe/meta.yaml deleted file mode 100644 index b37dfde0a0d6f..0000000000000 --- a/python/conda.recipe/meta.yaml +++ /dev/null @@ -1,54 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -package: - name: pyarrow - version: "0.1" - -build: - number: {{environ.get('TRAVIS_BUILD_NUMBER', 0)}} # [unix] - rpaths: - - lib # [unix] - - lib/python{{environ.get('PY_VER')}}/site-packages/pyarrow # [unix] - script_env: - - CC [linux] - - CXX [linux] - - LD_LIBRARY_PATH [linux] - skip: true # [win] - -requirements: - build: - - cmake - - python - - setuptools - - cython - - numpy - - pandas - - arrow-cpp - - pytest - - run: - - arrow-cpp - - parquet-cpp - - python - - numpy - - pandas - - six - -test: - imports: - - pyarrow - -about: - home: http://github.com/apache/arrow - license: Apache 2.0 - summary: 'Python bindings for Arrow C++ and interoperability tool for pandas and NumPy' From e139b8b7c11b7f36fa57a625a39d9c8779d033f4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 9 Dec 2016 06:49:49 +0100 Subject: [PATCH 0221/1644] ARROW-404: [Python] Fix segfault caused by HdfsClient getting closed before an HdfsFile The one downside of this patch is that HdfsFile handles don't get garbage-collected until the cyclic GC runs -- I tried to fix this but couldn't get it working. So bytes don't always get flushed to HDFS until `close()` is called. The flush issue should be addressed on the C++ side Author: Wes McKinney Closes #230 from wesm/ARROW-404 and squashes the following commits: 3a8e641 [Wes McKinney] Use weakref in _HdfsFileNanny to avoid cyclic gc 274d0c5 [Wes McKinney] amend comment 1539a2c [Wes McKinney] Ensure that HdfsClient does not get closed before an open file does when the last user-accessible client reference goes out of scope --- python/pyarrow/io.pyx | 86 ++++++++++++++++++++----------- python/pyarrow/tests/test_hdfs.py | 23 +++++++++ 2 files changed, 79 insertions(+), 30 deletions(-) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 0e6b81e984431..2fa5fb6b87885 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -504,7 +504,7 @@ cdef class HdfsClient: out.mode = mode out.buffer_size = c_buffer_size - out.parent = self + out.parent = _HdfsFileNanny(self, out) out.is_open = True out.own_file = True @@ -516,48 +516,69 @@ cdef class HdfsClient: """ write_queue = Queue(50) - f = self.open(path, 'wb') + with self.open(path, 'wb') as f: + done = False + exc_info = None + def bg_write(): + try: + while not done or write_queue.qsize() > 0: + try: + buf = write_queue.get(timeout=0.01) + except QueueEmpty: + continue - done = False - exc_info = None - def bg_write(): - try: - while not done or write_queue.qsize() > 0: - try: - buf = write_queue.get(timeout=0.01) - except QueueEmpty: - continue + f.write(buf) - f.write(buf) + except Exception as e: + exc_info = sys.exc_info() - except Exception as e: - exc_info = sys.exc_info() - - writer_thread = threading.Thread(target=bg_write) - writer_thread.start() + writer_thread = threading.Thread(target=bg_write) + writer_thread.start() - try: - while True: - buf = stream.read(buffer_size) - if not buf: - break + try: + while True: + buf = stream.read(buffer_size) + if not buf: + break - write_queue.put_nowait(buf) - finally: - done = True + write_queue.put_nowait(buf) + finally: + done = True - writer_thread.join() - if exc_info is not None: - raise exc_info[0], exc_info[1], exc_info[2] + writer_thread.join() + if exc_info is not None: + raise exc_info[0], exc_info[1], exc_info[2] def download(self, path, stream, buffer_size=None): - f = self.open(path, 'rb', buffer_size=buffer_size) - f.download(stream) + with self.open(path, 'rb', buffer_size=buffer_size) as f: + f.download(stream) # ---------------------------------------------------------------------- # Specialization for HDFS +# ARROW-404: Helper class to ensure that files are closed before the +# client. During deallocation of the extension class, the attributes are +# decref'd which can cause the client to get closed first if the file has the +# last remaining reference +cdef class _HdfsFileNanny: + cdef: + object client + object file_handle_ref + + def __cinit__(self, client, file_handle): + import weakref + self.client = client + self.file_handle_ref = weakref.ref(file_handle) + + def __dealloc__(self): + fh = self.file_handle_ref() + if fh: + fh.close() + # avoid cyclic GC + self.file_handle_ref = None + self.client = None + cdef class HdfsFile(NativeFile): cdef readonly: @@ -565,6 +586,11 @@ cdef class HdfsFile(NativeFile): object mode object parent + cdef object __weakref__ + + def __dealloc__(self): + self.parent = None + def read(self, int nbytes): """ Read indicated number of bytes from the file, up to EOF diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index ed8d41994cdd0..c23543b7f0d07 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -98,6 +98,29 @@ def test_hdfs_ls(hdfs): assert contents == [dir_path, f1_path] +def _make_test_file(hdfs, test_name, test_path, test_data): + base_path = pjoin(HDFS_TMP_PATH, test_name) + hdfs.mkdir(base_path) + + full_path = pjoin(base_path, test_path) + + f = hdfs.open(full_path, 'wb') + f.write(test_data) + + return full_path + + +@libhdfs +def test_hdfs_orphaned_file(): + hdfs = hdfs_test_client() + file_path = _make_test_file(hdfs, 'orphaned_file_test', 'fname', + 'foobarbaz') + + f = hdfs.open(file_path) + hdfs = None + f = None # noqa + + @libhdfs def test_hdfs_download_upload(hdfs): base_path = pjoin(HDFS_TMP_PATH, 'upload-test') From a5362c2cbed5f32a468d93d23c8365c9c5528b03 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 9 Dec 2016 11:39:49 -0500 Subject: [PATCH 0222/1644] ARROW-346: Use conda environment to build API docs As we cannot currently build pyarrow on readthedocs, we have to resort to building the API docs for the latest version of pyarrow on conda-forge. All other documentation will though be pulled directly from the source code. Author: Uwe L. Korn Closes #228 from xhochy/ARROW-346 and squashes the following commits: 6a4bdc1 [Uwe L. Korn] Add license header b741141 [Uwe L. Korn] ARROW-346: Use conda environment to build API docs --- .readthedocs.yml | 2 ++ python/doc/conf.py | 7 ++++++- python/doc/environment.yml | 25 +++++++++++++++++++++++++ 3 files changed, 33 insertions(+), 1 deletion(-) create mode 100644 .readthedocs.yml create mode 100644 python/doc/environment.yml diff --git a/.readthedocs.yml b/.readthedocs.yml new file mode 100644 index 0000000000000..2e1fe3fbc251a --- /dev/null +++ b/.readthedocs.yml @@ -0,0 +1,2 @@ +conda: + file: python/doc/environment.yml diff --git a/python/doc/conf.py b/python/doc/conf.py index 4c324a8086c39..e817bbdd42bd3 100644 --- a/python/doc/conf.py +++ b/python/doc/conf.py @@ -42,7 +42,12 @@ cmd_line = cmd_line_template.format(outputdir=output_dir, moduledir=module_dir) apidoc.main(cmd_line.split(" ")) -sys.path.insert(0, os.path.abspath('..')) +on_rtd = os.environ.get('READTHEDOCS') == 'True' + +if not on_rtd: + # Hack: On RTD we use the pyarrow package from conda-forge as we cannot + # build pyarrow there. + sys.path.insert(0, os.path.abspath('..')) # -- General configuration ------------------------------------------------ diff --git a/python/doc/environment.yml b/python/doc/environment.yml new file mode 100644 index 0000000000000..8d1fe9bfb5d11 --- /dev/null +++ b/python/doc/environment.yml @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +channels: +- defaults +- conda-forge +dependencies: +- arrow-cpp +- parquet-cpp +- pyarrow +- numpydoc From d06c49144a60faa9af115e803694329e82623a5d Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 9 Dec 2016 11:41:21 -0500 Subject: [PATCH 0223/1644] =?UTF-8?q?ARROW-399:=20ListVector.loadFieldBuff?= =?UTF-8?q?ers=20ignores=20the=20ArrowFieldNode=20len=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …gth metadata Author: Julien Le Dem Closes #227 from julienledem/arrow_399 and squashes the following commits: 93a77cb [Julien Le Dem] set padding; add test 462a36c [Julien Le Dem] ARROW-399: ListVector.loadFieldBuffers ignores the ArrowFieldNode length metadata --- .../codegen/templates/FixedValueVectors.java | 2 + .../templates/NullableValueVectors.java | 62 +++++-------- .../main/codegen/templates/UnionVector.java | 2 + .../arrow/vector/BaseDataValueVector.java | 17 ++++ .../org/apache/arrow/vector/BitVector.java | 2 +- .../org/apache/arrow/vector/VectorLoader.java | 2 +- .../arrow/vector/complex/ListVector.java | 2 + .../arrow/vector/TestVectorUnloadLoad.java | 92 ++++++++++++++++++- 8 files changed, 136 insertions(+), 45 deletions(-) diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 7958222f5c1bb..be385d146dbac 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -45,6 +45,8 @@ public final class ${minor.class}Vector extends BaseDataValueVector implements FixedWidthVector{ private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${minor.class}Vector.class); + public static final int TYPE_WIDTH = ${type.width}; + private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 2c4274c13ee58..6a9ce65392f59 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -37,7 +37,7 @@ import org.apache.arrow.flatbuf.Precision; /** - * Nullable${minor.class} implements a vector of values which could be null. Elements in the vector + * ${className} implements a vector of values which could be null. Elements in the vector * are first checked against a fixed length vector of boolean values. Then the element is retrieved * from the base class (if not null). * @@ -47,7 +47,7 @@ public final class ${className} extends BaseDataValueVector implements <#if type.major == "VarLen">VariableWidth<#else>FixedWidthVector, NullableVector, FieldVector { private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(${className}.class); - private final FieldReader reader = new ${minor.class}ReaderImpl(Nullable${minor.class}Vector.this); + private final FieldReader reader = new ${minor.class}ReaderImpl(${className}.this); private final String bitsField = "$bits$"; private final String valuesField = "$values$"; @@ -67,7 +67,7 @@ public final class ${className} extends BaseDataValueVector implements <#if type public ${className}(String name, BufferAllocator allocator, int precision, int scale) { super(name, allocator); - values = new ${minor.class}Vector(valuesField, allocator, precision, scale); + values = new ${valuesName}(valuesField, allocator, precision, scale); this.precision = precision; this.scale = scale; mutator = new Mutator(); @@ -81,7 +81,7 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#else> public ${className}(String name, BufferAllocator allocator) { super(name, allocator); - values = new ${minor.class}Vector(valuesField, allocator); + values = new ${valuesName}(valuesField, allocator); mutator = new Mutator(); accessor = new Accessor(); <#if minor.class == "TinyInt" || @@ -144,6 +144,13 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + <#if type.major = "VarLen"> + // variable width values: truncate offset vector buffer to size (#1) + org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, values.offsetVector.getBufferSizeFor(fieldNode.getLength() + 1)); + <#else> + // fixed width values truncate value vector to size (#1) + org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, values.getBufferSizeFor(fieldNode.getLength())); + org.apache.arrow.vector.BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); bits.valueCount = fieldNode.getLength(); } @@ -229,13 +236,6 @@ public void setInitialCapacity(int numRecords) { values.setInitialCapacity(numRecords); } -// @Override -// public SerializedField.Builder getMetadataBuilder() { -// return super.getMetadataBuilder() -// .addChild(bits.getMetadata()) -// .addChild(values.getMetadata()); -// } - @Override public void allocateNew() { if(!allocateNewSafe()){ @@ -329,20 +329,6 @@ public void zeroVector() { } - -// @Override -// public void load(SerializedField metadata, ArrowBuf buffer) { -// clear(); - // the bits vector is the first child (the order in which the children are added in getMetadataBuilder is significant) -// final SerializedField bitsField = metadata.getChild(0); -// bits.load(bitsField, buffer); -// -// final int capacity = buffer.capacity(); -// final int bitsLength = bitsField.getBufferLength(); -// final SerializedField valuesField = metadata.getChild(1); -// values.load(valuesField, buffer.slice(bitsLength, capacity - bitsLength)); -// } - @Override public TransferPair getTransferPair(BufferAllocator allocator){ return new TransferImpl(name, allocator); @@ -356,10 +342,10 @@ public TransferPair getTransferPair(String ref, BufferAllocator allocator){ @Override public TransferPair makeTransferPair(ValueVector to) { - return new TransferImpl((Nullable${minor.class}Vector) to); + return new TransferImpl((${className}) to); } - public void transferTo(Nullable${minor.class}Vector target){ + public void transferTo(${className} target){ bits.transferTo(target.bits); values.transferTo(target.values); <#if type.major == "VarLen"> @@ -368,7 +354,7 @@ public void transferTo(Nullable${minor.class}Vector target){ clear(); } - public void splitAndTransferTo(int startIndex, int length, Nullable${minor.class}Vector target) { + public void splitAndTransferTo(int startIndex, int length, ${className} target) { bits.splitAndTransferTo(startIndex, length, target.bits); values.splitAndTransferTo(startIndex, length, target.values); <#if type.major == "VarLen"> @@ -377,22 +363,22 @@ public void splitAndTransferTo(int startIndex, int length, Nullable${minor.class } private class TransferImpl implements TransferPair { - Nullable${minor.class}Vector to; + ${className} to; public TransferImpl(String name, BufferAllocator allocator){ <#if minor.class == "Decimal"> - to = new Nullable${minor.class}Vector(name, allocator, precision, scale); + to = new ${className}(name, allocator, precision, scale); <#else> - to = new Nullable${minor.class}Vector(name, allocator); + to = new ${className}(name, allocator); } - public TransferImpl(Nullable${minor.class}Vector to){ + public TransferImpl(${className} to){ this.to = to; } @Override - public Nullable${minor.class}Vector getTo(){ + public ${className} getTo(){ return to; } @@ -408,7 +394,7 @@ public void splitAndTransfer(int startIndex, int length) { @Override public void copyValueSafe(int fromIndex, int toIndex) { - to.copyFromSafe(fromIndex, toIndex, Nullable${minor.class}Vector.this); + to.copyFromSafe(fromIndex, toIndex, ${className}.this); } } @@ -422,14 +408,14 @@ public Mutator getMutator(){ return mutator; } - public void copyFrom(int fromIndex, int thisIndex, Nullable${minor.class}Vector from){ + public void copyFrom(int fromIndex, int thisIndex, ${className} from){ final Accessor fromAccessor = from.getAccessor(); if (!fromAccessor.isNull(fromIndex)) { mutator.set(thisIndex, fromAccessor.get(fromIndex)); } } - public void copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector from){ + public void copyFromSafe(int fromIndex, int thisIndex, ${valuesName} from){ <#if type.major == "VarLen"> mutator.fillEmpties(thisIndex); @@ -437,7 +423,7 @@ public void copyFromSafe(int fromIndex, int thisIndex, ${minor.class}Vector from bits.getMutator().setSafe(thisIndex, 1); } - public void copyFromSafe(int fromIndex, int thisIndex, Nullable${minor.class}Vector from){ + public void copyFromSafe(int fromIndex, int thisIndex, ${className} from){ <#if type.major == "VarLen"> mutator.fillEmpties(thisIndex); @@ -640,7 +626,7 @@ public void set(int index, ${minor.class}Holder holder){ } public boolean isSafe(int outIndex) { - return outIndex < Nullable${minor.class}Vector.this.getValueCapacity(); + return outIndex < ${className}.this.getValueCapacity(); } <#assign fields = minor.fields!type.fields /> diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 4e68b681d1404..18acdf4a551b4 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -103,6 +103,8 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + // truncate types vector buffer to size (#0) + org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 0, typeVector.getBufferSizeFor(fieldNode.getLength())); BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); this.valueCount = fieldNode.getLength(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index 4c6d363f21cda..b7df8d13ee607 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -30,6 +30,9 @@ public abstract class BaseDataValueVector extends BaseValueVector implements Buf protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this + /** maximum extra size at the end of the buffer */ + private static final int MAX_BUFFER_PADDING = 64; + public static void load(ArrowFieldNode fieldNode, List vectors, List buffers) { int expectedSize = vectors.size(); if (buffers.size() != expectedSize) { @@ -40,6 +43,20 @@ public static void load(ArrowFieldNode fieldNode, List vectors, Li } } + public static void truncateBufferBasedOnSize(List buffers, int bufferIndex, int byteSize) { + if (bufferIndex >= buffers.size()) { + throw new IllegalArgumentException("no buffer at index " + bufferIndex + ": " + buffers); + } + ArrowBuf buffer = buffers.get(bufferIndex); + if (buffer.writerIndex() < byteSize) { + throw new IllegalArgumentException("can not truncate buffer to a larger size " + byteSize + ": " + buffer.writerIndex()); + } + if (buffer.writerIndex() - byteSize > MAX_BUFFER_PADDING) { + throw new IllegalArgumentException("Buffer too large to resize to " + byteSize + ": " + buffer.writerIndex()); + } + buffer.writerIndex(byteSize); + } + public static List unload(List vectors) { List result = new ArrayList<>(vectors.size()); for (BufferBacked vector : vectors) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 7ce1236b2ec30..48da8e77d6814 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -68,7 +68,7 @@ public void load(ArrowFieldNode fieldNode, ArrowBuf data) { int remainder = count % 8; // set remaining bits if (remainder > 0) { - byte bitMask = (byte) (0xFFL >>> ((8 - remainder) & 7));; + byte bitMask = (byte) (0xFFL >>> ((8 - remainder) & 7)); this.data.setByte(fullBytesCount, bitMask); } } else if (fieldNode.getNullCount() == fieldNode.getLength()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index 757f061dd5a2f..5c1176cf95d26 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -82,7 +82,7 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf vector.loadFieldBuffers(fieldNode, ownBuffers); } catch (RuntimeException e) { throw new IllegalArgumentException("Could not load buffers for field " + - field + " error message" + e.getMessage(), e); + field + ". error message: " + e.getMessage(), e); } List children = field.getChildren(); if (children.size() > 0) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index e18f99f95d780..461bdbcda1b52 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -93,6 +93,8 @@ public List getChildrenFromFields() { @Override public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + // variable width values: truncate offset vector buffer to size (#1) + org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, offsets.getBufferSizeFor(fieldNode.getLength() + 1)); BaseDataValueVector.load(fieldNode, getFieldInnerVectors(), ownBuffers); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java index 9dfe8d840e49d..7a70ffd904758 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -23,6 +23,7 @@ import static org.junit.Assert.assertTrue; import java.io.IOException; +import java.util.ArrayList; import java.util.Collections; import java.util.List; @@ -32,6 +33,7 @@ import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; import org.apache.arrow.vector.complex.writer.IntWriter; @@ -99,6 +101,79 @@ public void testUnloadLoad() throws IOException { } } + @Test + public void testUnloadLoadAddPadding() throws IOException { + int count = 10000; + Schema schema; + try ( + BufferAllocator originalVectorsAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorsAllocator, null)) { + + // write some data + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + ListWriter list = rootWriter.list("list"); + IntWriter intWriter = list.integer(); + for (int i = 0; i < count; i++) { + list.setPosition(i); + list.startList(); + for (int j = 0; j < i % 4 + 1; j++) { + intWriter.writeInt(i); + } + list.endList(); + } + writer.setValueCount(count); + + // unload it + FieldVector root = parent.getChild("root"); + schema = new Schema(root.getField().getChildren()); + VectorUnloader vectorUnloader = newVectorUnloader(root); + try ( + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); + ) { + List oldBuffers = recordBatch.getBuffers(); + List newBuffers = new ArrayList<>(); + for (ArrowBuf oldBuffer : oldBuffers) { + int l = oldBuffer.readableBytes(); + if (l % 64 != 0) { + // pad + l = l + 64 - l % 64; + } + ArrowBuf newBuffer = allocator.buffer(l); + for (int i = oldBuffer.readerIndex(); i < oldBuffer.writerIndex(); i++) { + newBuffer.setByte(i - oldBuffer.readerIndex(), oldBuffer.getByte(i)); + } + newBuffer.readerIndex(0); + newBuffer.writerIndex(l); + newBuffers.add(newBuffer); + } + + try (ArrowRecordBatch newBatch = new ArrowRecordBatch(recordBatch.getLength(), recordBatch.getNodes(), newBuffers);) { + // load it + VectorLoader vectorLoader = new VectorLoader(newRoot); + + vectorLoader.load(newBatch); + + FieldReader reader = newRoot.getVector("list").getReader(); + for (int i = 0; i < count; i++) { + reader.setPosition(i); + List expected = new ArrayList<>(); + for (int j = 0; j < i % 4 + 1; j++) { + expected.add(i); + } + Assert.assertEquals(expected, reader.readObject()); + } + } + + for (ArrowBuf newBuf : newBuffers) { + newBuf.release(); + } + } + } + } + /** * The validity buffer can be empty if: * - all values are defined @@ -113,12 +188,17 @@ public void testLoadEmptyValidityBuffer() throws IOException { )); int count = 10; ArrowBuf validity = allocator.getEmpty(); - ArrowBuf values = allocator.buffer(count * 4); // integers - for (int i = 0; i < count; i++) { - values.setInt(i * 4, i); + ArrowBuf[] values = new ArrowBuf[2]; + for (int i = 0; i < values.length; i++) { + ArrowBuf arrowBuf = allocator.buffer(count * 4); // integers + values[i] = arrowBuf; + for (int j = 0; j < count; j++) { + arrowBuf.setInt(j * 4, j); + } + arrowBuf.writerIndex(count * 4); } try ( - ArrowRecordBatch recordBatch = new ArrowRecordBatch(count, asList(new ArrowFieldNode(count, 0), new ArrowFieldNode(count, count)), asList(validity, values, validity, values)); + ArrowRecordBatch recordBatch = new ArrowRecordBatch(count, asList(new ArrowFieldNode(count, 0), new ArrowFieldNode(count, count)), asList(validity, values[0], validity, values[1])); BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); ) { @@ -153,7 +233,9 @@ public void testLoadEmptyValidityBuffer() throws IOException { assertFalse(intDefinedVector.getAccessor().isNull(count + 10)); assertEquals(1234, intDefinedVector.getAccessor().get(count + 10)); } finally { - values.release(); + for (ArrowBuf arrowBuf : values) { + arrowBuf.release(); + } } } From 14ed1be2d89fedc31f4015456cda28216f926dcc Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 9 Dec 2016 12:04:25 -0500 Subject: [PATCH 0224/1644] ARROW-400: set struct length on load Adds unit test, closes #233 Author: Julien Le Dem Author: Wes McKinney Closes #234 from wesm/ARROW-400 and squashes the following commits: f516ba1 [Wes McKinney] Add unit test for ARROW-400 741ff71 [Julien Le Dem] ARROW-400: set struct length on json load --- integration/data/struct_example.json | 237 ++++++++++++++++++ .../arrow/vector/complex/MapVector.java | 2 +- .../vector/file/json/JsonFileReader.java | 4 + .../arrow/vector/file/json/TestJSONFile.java | 20 ++ 4 files changed, 262 insertions(+), 1 deletion(-) create mode 100644 integration/data/struct_example.json diff --git a/integration/data/struct_example.json b/integration/data/struct_example.json new file mode 100644 index 0000000000000..3ea062db7ba32 --- /dev/null +++ b/integration/data/struct_example.json @@ -0,0 +1,237 @@ +{ + "schema": { + "fields": [ + { + "name": "struct_nullable", + "type": { + "name": "struct" + }, + "nullable": true, + "children": [ + { + "name": "f1", + "type": { + "name": "int", + "isSigned": true, + "bitWidth": 32 + }, + "nullable": true, + "children": [], + "typeLayout": { + "vectors": [ + { + "type": "VALIDITY", + "typeBitWidth": 1 + }, + { + "type": "DATA", + "typeBitWidth": 32 + } + ] + } + }, + { + "name": "f2", + "type": { + "name": "utf8" + }, + "nullable": true, + "children": [], + "typeLayout": { + "vectors": [ + { + "type": "VALIDITY", + "typeBitWidth": 1 + }, + { + "type": "OFFSET", + "typeBitWidth": 32 + }, + { + "type": "DATA", + "typeBitWidth": 8 + } + ] + } + } + ], + "typeLayout": { + "vectors": [ + { + "type": "VALIDITY", + "typeBitWidth": 1 + } + ] + } + } + ] + }, + "batches": [ + { + "count": 7, + "columns": [ + { + "name": "struct_nullable", + "count": 7, + "VALIDITY": [ + 0, + 1, + 1, + 1, + 0, + 1, + 0 + ], + "children": [ + { + "name": "f1", + "count": 7, + "VALIDITY": [ + 1, + 0, + 1, + 1, + 1, + 0, + 0 + ], + "DATA": [ + 1402032511, + 290876774, + 137773603, + 410361374, + 1959836418, + 1995074679, + -163525262 + ] + }, + { + "name": "f2", + "count": 7, + "VALIDITY": [ + 0, + 1, + 1, + 1, + 0, + 1, + 0 + ], + "OFFSET": [ + 0, + 0, + 7, + 14, + 21, + 21, + 28, + 28 + ], + "DATA": [ + "", + "MhRNxD4", + "3F9HBxK", + "aVd88fp", + "", + "3loZrRf", + "" + ] + } + ] + } + ] + }, + { + "count": 10, + "columns": [ + { + "name": "struct_nullable", + "count": 10, + "VALIDITY": [ + 0, + 1, + 1, + 0, + 1, + 0, + 0, + 1, + 1, + 1 + ], + "children": [ + { + "name": "f1", + "count": 10, + "VALIDITY": [ + 0, + 0, + 0, + 0, + 0, + 0, + 1, + 0, + 0, + 0 + ], + "DATA": [ + -2041500147, + 1715692943, + -35444996, + 1425496657, + 112765084, + 1760754983, + 413888857, + 2039738337, + -1924327700, + 670528518 + ] + }, + { + "name": "f2", + "count": 10, + "VALIDITY": [ + 1, + 0, + 0, + 1, + 1, + 1, + 1, + 1, + 1, + 0 + ], + "OFFSET": [ + 0, + 7, + 7, + 7, + 14, + 21, + 28, + 35, + 42, + 49, + 49 + ], + "DATA": [ + "AS5oARE", + "", + "", + "JGdagcX", + "78SLiRw", + "vbGf7OY", + "5uh5fTs", + "0ilsf82", + "LjS9MbU", + "" + ] + } + ] + } + ] + } + ] +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index c2f216b197e1d..31a1bb74b8e98 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -50,7 +50,7 @@ public class MapVector extends AbstractMapVector { private final SingleMapReaderImpl reader = new SingleMapReaderImpl(this); private final Accessor accessor = new Accessor(); private final Mutator mutator = new Mutator(); - int valueCount; + public int valueCount; public MapVector(String name, BufferAllocator allocator, CallBack callBack) { super(name, allocator, callBack); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 26dd3f6dfe5ae..152867c1a11d7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -47,6 +47,7 @@ import org.apache.arrow.vector.ValueVector.Mutator; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.schema.ArrowVectorType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; @@ -153,6 +154,9 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti } readToken(END_ARRAY); } + if (vector instanceof NullableMapVector) { + ((NullableMapVector)vector).valueCount = count; + } } readToken(END_OBJECT); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java index 7d25003f8b335..3720a13b0fce5 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java @@ -21,11 +21,13 @@ import java.io.IOException; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.file.BaseFileTest; import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Assert; import org.junit.Test; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -117,4 +119,22 @@ public void testWriteReadUnionJSON() throws IOException { } } + @Test + public void testSetStructLength() throws IOException { + File file = new File("../../integration/data/struct_example.json"); + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ) { + JsonFileReader reader = new JsonFileReader(file, readerAllocator); + Schema schema = reader.start(); + LOGGER.debug("reading schema: " + schema); + + // initialize vectors + try (VectorSchemaRoot root = reader.read();) { + FieldVector vector = root.getVector("struct_nullable"); + Assert.assertEquals(7, vector.getAccessor().getValueCount()); + } + } + } + } From 8995c923043788f98afef4dd80f72de4688a8e0c Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Thu, 8 Dec 2016 21:24:29 -0800 Subject: [PATCH 0225/1644] ARROW-402: Fix reference counting issue with empty buffers. Close #232 Change-Id: I87910c03d7ebca5a8edbf53d01f70c38ef339f04 --- .../src/main/java/org/apache/arrow/vector/BitVector.java | 1 - .../java/org/apache/arrow/vector/TestVectorUnloadLoad.java | 3 ++- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 48da8e77d6814..26eeafd51d900 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -53,7 +53,6 @@ public BitVector(String name, BufferAllocator allocator) { public void load(ArrowFieldNode fieldNode, ArrowBuf data) { // When the vector is all nulls or all defined, the content of the buffer can be omitted if (data.readableBytes() == 0 && fieldNode.getLength() != 0) { - data.release(); int count = fieldNode.getLength(); allocateNew(count); int n = getSizeFromCount(count); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java index 7a70ffd904758..79c9d5046acd6 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -187,7 +187,7 @@ public void testLoadEmptyValidityBuffer() throws IOException { new Field("intNull", true, new ArrowType.Int(32, true), Collections.emptyList()) )); int count = 10; - ArrowBuf validity = allocator.getEmpty(); + ArrowBuf validity = allocator.buffer(10).slice(0, 0); ArrowBuf[] values = new ArrowBuf[2]; for (int i = 0; i < values.length; i++) { ArrowBuf arrowBuf = allocator.buffer(count * 4); // integers @@ -236,6 +236,7 @@ public void testLoadEmptyValidityBuffer() throws IOException { for (ArrowBuf arrowBuf : values) { arrowBuf.release(); } + validity.release(); } } From 45ed7e7a36fb2a69de468c41132b6b3bbd270c92 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 9 Dec 2016 19:50:19 -0500 Subject: [PATCH 0226/1644] ARROW-394: [Integration] Generate tests cases for numeric types, strings, lists, structs Automatically generating testing files from Python. Author: Wes McKinney Closes #219 from wesm/ARROW-394 and squashes the following commits: 7807f48 [Wes McKinney] OS X doesn't have std::fabs c0c804c [Wes McKinney] abs -> fabs 8cd1902 [Wes McKinney] Fix compiler warning in OS X from incorrect type declaration d51581a [Wes McKinney] Add missing apache license 527622d [Wes McKinney] ARROW-414: remove check for maximum buffer padding 2a7b0fc [Wes McKinney] Add JSON generation code to fuzz test numeric types, print integers more nicely. Add integration tests to Travis CI build matrix. Add ApproxEquals method for floating point comparisons. Add boolean, string, struct, list to generated json test case --- .travis.yml | 10 + ci/travis_script_integration.sh | 49 ++ cpp/src/arrow/array.cc | 4 + cpp/src/arrow/array.h | 1 + cpp/src/arrow/ipc/ipc-metadata-test.cc | 4 +- cpp/src/arrow/ipc/json-integration-test.cc | 27 +- cpp/src/arrow/ipc/json-internal.cc | 8 +- cpp/src/arrow/ipc/metadata-internal.cc | 2 +- cpp/src/arrow/pretty_print-test.cc | 10 +- cpp/src/arrow/pretty_print.cc | 90 +++- cpp/src/arrow/pretty_print.h | 8 +- cpp/src/arrow/table.cc | 12 + cpp/src/arrow/table.h | 2 + cpp/src/arrow/type.cc | 2 +- cpp/src/arrow/type_traits.h | 6 + cpp/src/arrow/types/primitive.cc | 1 + cpp/src/arrow/types/primitive.h | 91 +++- integration/README.md | 59 ++ integration/integration_test.py | 508 +++++++++++++++++- .../arrow/vector/BaseDataValueVector.java | 6 - 20 files changed, 844 insertions(+), 56 deletions(-) create mode 100755 ci/travis_script_integration.sh create mode 100644 integration/README.md diff --git a/.travis.yml b/.travis.yml index bfc2f26b4f590..1634eba443615 100644 --- a/.travis.yml +++ b/.travis.yml @@ -46,6 +46,16 @@ matrix: jdk: oraclejdk7 script: - $TRAVIS_BUILD_DIR/ci/travis_script_java.sh + - language: java + os: linux + env: ARROW_TEST_GROUP=integration + jdk: oraclejdk7 + before_script: + - export CC="gcc-4.9" + - export CXX="g++-4.9" + - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh + script: + - $TRAVIS_BUILD_DIR/ci/travis_script_integration.sh before_install: - ulimit -c unlimited -S diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh new file mode 100755 index 0000000000000..d93411b907d47 --- /dev/null +++ b/ci/travis_script_integration.sh @@ -0,0 +1,49 @@ +#!/usr/bin/env bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +set -e + +: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} + +JAVA_DIR=${TRAVIS_BUILD_DIR}/java + +pushd $JAVA_DIR + +mvn package + +popd + +pushd $TRAVIS_BUILD_DIR/integration + +VERSION=0.1.1-SNAPSHOT +export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar +export ARROW_CPP_TESTER=$CPP_BUILD_DIR/debug/json-integration-test + +source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh +export MINICONDA=$HOME/miniconda +export PATH="$MINICONDA/bin:$PATH" + +CONDA_ENV_NAME=arrow-integration-test +conda create -y -q -n $CONDA_ENV_NAME python=3.5 +source activate $CONDA_ENV_NAME + +# faster builds, please +conda install -y nomkl + +# Expensive dependencies install from Continuum package repo +conda install -y pip numpy six + +python integration_test.py --debug + +popd diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 3262425e99b66..1f0bb66e91a3e 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -60,6 +60,10 @@ bool Array::EqualsExact(const Array& other) const { return true; } +bool Array::ApproxEquals(const std::shared_ptr& arr) const { + return Equals(arr); +} + Status Array::Validate() const { return Status::OK(); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index ff2b70e213b1b..78aa2b867e1ea 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -62,6 +62,7 @@ class ARROW_EXPORT Array { bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; + virtual bool ApproxEquals(const std::shared_ptr& arr) const; // Compare if the range of slots specified are equal for the given array and // this array. end_idx exclusive. This methods does not bounds check. diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index d29583f8488e0..de08e6dab73c6 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -70,7 +70,7 @@ const std::shared_ptr INT32 = std::make_shared(); TEST_F(TestSchemaMetadata, PrimitiveFields) { auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared(), false); auto f2 = std::make_shared("f2", std::make_shared()); auto f3 = std::make_shared("f3", std::make_shared()); auto f4 = std::make_shared("f4", std::make_shared()); @@ -78,7 +78,7 @@ TEST_F(TestSchemaMetadata, PrimitiveFields) { auto f6 = std::make_shared("f6", std::make_shared()); auto f7 = std::make_shared("f7", std::make_shared()); auto f8 = std::make_shared("f8", std::make_shared()); - auto f9 = std::make_shared("f9", std::make_shared()); + auto f9 = std::make_shared("f9", std::make_shared(), false); auto f10 = std::make_shared("f10", std::make_shared()); Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index c4e68472a19d4..291a719d4e58c 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -169,15 +169,15 @@ static Status ValidateArrowVsJson( RETURN_NOT_OK(json_reader->GetRecordBatch(i, &json_batch)); RETURN_NOT_OK(arrow_reader->GetRecordBatch(i, &arrow_batch)); - if (!json_batch->Equals(*arrow_batch.get())) { + if (!json_batch->ApproxEquals(*arrow_batch.get())) { std::stringstream ss; ss << "Record batch " << i << " did not match"; - ss << "\nJSON: \n "; - RETURN_NOT_OK(PrettyPrint(*json_batch.get(), &ss)); + ss << "\nJSON:\n"; + RETURN_NOT_OK(PrettyPrint(*json_batch.get(), 0, &ss)); - ss << "\nArrow: \n "; - RETURN_NOT_OK(PrettyPrint(*arrow_batch.get(), &ss)); + ss << "\nArrow:\n"; + RETURN_NOT_OK(PrettyPrint(*arrow_batch.get(), 0, &ss)); return Status::Invalid(ss.str()); } } @@ -299,6 +299,23 @@ static const char* JSON_EXAMPLE = R"example( "VALIDITY": [1, 0, 0, 1, 1] } ] + }, + { + "count": 4, + "columns": [ + { + "name": "foo", + "count": 4, + "DATA": [1, 2, 3, 4], + "VALIDITY": [1, 0, 1, 1] + }, + { + "name": "bar", + "count": 4, + "DATA": [1.0, 2.0, 3.0, 4.0], + "VALIDITY": [1, 0, 0, 1] + } + ] } ] } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 50f5b0cb1bd1e..ff9f59800be38 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -418,7 +418,7 @@ class JsonArrayWriter : public ArrayVisitor { template void WriteOffsetsField(const T* offsets, int32_t length) { - writer_->Key("OFFSETS"); + writer_->Key("OFFSET"); writer_->StartArray(); for (int i = 0; i < length; ++i) { writer_->Int64(offsets[i]); @@ -810,7 +810,7 @@ class JsonArrayReader { builder.Append(val.GetUint64()); } else if (IsFloatingPoint::value) { DCHECK(val.IsFloat()); - builder.Append(val.GetFloat()); + builder.Append(val.GetDouble()); } else if (std::is_base_of::value) { DCHECK(val.IsBool()); builder.Append(val.GetBool()); @@ -853,8 +853,8 @@ class JsonArrayReader { typename std::enable_if::value, Status>::type ReadArray( const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { - const auto& json_offsets = json_array.FindMember("OFFSETS"); - RETURN_NOT_ARRAY("OFFSETS", json_offsets, json_array); + const auto& json_offsets = json_array.FindMember("OFFSET"); + RETURN_NOT_ARRAY("OFFSET", json_offsets, json_array); const auto& json_offsets_arr = json_offsets->value.GetArray(); int32_t null_count = 0; diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index b99522825d902..7a2416165b203 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -264,7 +264,7 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* RETURN_NOT_OK( TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); - *out = std::make_shared(field->name()->str(), type); + *out = std::make_shared(field->name()->str(), type, field->nullable()); return Status::OK(); } diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index 10af41d16af13..b1e6a11cedd9b 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -48,8 +48,8 @@ class TestArrayPrinter : public ::testing::Test { }; template -void CheckPrimitive(const std::vector& is_valid, const std::vector& values, - const char* expected) { +void CheckPrimitive(int indent, const std::vector& is_valid, + const std::vector& values, const char* expected) { std::ostringstream sink; MemoryPool* pool = default_memory_pool(); @@ -66,7 +66,7 @@ void CheckPrimitive(const std::vector& is_valid, const std::vector std::shared_ptr array; ASSERT_OK(builder.Finish(&array)); - ASSERT_OK(PrettyPrint(*array.get(), &sink)); + ASSERT_OK(PrettyPrint(*array.get(), indent, &sink)); std::string result = sink.str(); ASSERT_EQ(std::string(expected, strlen(expected)), result); @@ -77,11 +77,11 @@ TEST_F(TestArrayPrinter, PrimitiveType) { std::vector values = {0, 1, 2, 3, 4}; static const char* expected = R"expected([0, 1, null, 3, null])expected"; - CheckPrimitive(is_valid, values, expected); + CheckPrimitive(0, is_valid, values, expected); std::vector values2 = {"foo", "bar", "", "baz", ""}; static const char* ex2 = R"expected(["foo", "bar", null, "baz", null])expected"; - CheckPrimitive(is_valid, values2, ex2); + CheckPrimitive(0, is_valid, values2, ex2); } } // namespace arrow diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index c0b4b08274ac1..c63a9e93e6a63 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -16,7 +16,9 @@ // under the License. #include +#include #include +#include #include "arrow/array.h" #include "arrow/pretty_print.h" @@ -32,20 +34,35 @@ namespace arrow { class ArrayPrinter : public ArrayVisitor { public: - ArrayPrinter(const Array& array, std::ostream* sink) : array_(array), sink_(sink) {} + ArrayPrinter(const Array& array, int indent, std::ostream* sink) + : array_(array), indent_(indent), sink_(sink) {} Status Print() { return VisitArray(array_); } Status VisitArray(const Array& array) { return array.Accept(this); } template - typename std::enable_if::value, void>::type WriteDataValues( + typename std::enable_if::value, void>::type WriteDataValues( const T& array) { const auto data = array.raw_data(); for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } if (array.IsNull(i)) { (*sink_) << "null"; + } else { + (*sink_) << static_cast(data[i]); + } + } + } + + template + typename std::enable_if::value, void>::type WriteDataValues( + const T& array) { + const auto data = array.raw_data(); + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + Write("null"); } else { (*sink_) << data[i]; } @@ -60,7 +77,7 @@ class ArrayPrinter : public ArrayVisitor { for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } if (array.IsNull(i)) { - (*sink_) << "null"; + Write("null"); } else { const char* buf = reinterpret_cast(array.GetValue(i, &length)); (*sink_) << "\"" << std::string(buf, length) << "\""; @@ -74,9 +91,9 @@ class ArrayPrinter : public ArrayVisitor { for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } if (array.IsNull(i)) { - (*sink_) << "null"; + Write("null"); } else { - (*sink_) << (array.Value(i) ? "true" : "false"); + Write(array.Value(i) ? "true" : "false"); } } } @@ -148,20 +165,38 @@ class ArrayPrinter : public ArrayVisitor { } Status Visit(const ListArray& array) override { - // auto type = static_cast(array.type().get()); - // for (size_t i = 0; i < fields.size(); ++i) { - // RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); - // } - // return WriteChildren(type->children(), {array.values()}); + Newline(); + Write("-- is_valid: "); + BooleanArray is_valid(array.length(), array.null_bitmap()); + PrettyPrint(is_valid, indent_ + 2, sink_); + + Newline(); + Write("-- offsets: "); + Int32Array offsets(array.length() + 1, array.offsets()); + PrettyPrint(offsets, indent_ + 2, sink_); + + Newline(); + Write("-- values: "); + PrettyPrint(*array.values().get(), indent_ + 2, sink_); + return Status::OK(); } Status Visit(const StructArray& array) override { - // auto type = static_cast(array.type().get()); - // for (size_t i = 0; i < fields.size(); ++i) { - // RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); - // } - // return WriteChildren(type->children(), array.fields()); + Newline(); + Write("-- is_valid: "); + BooleanArray is_valid(array.length(), array.null_bitmap()); + PrettyPrint(is_valid, indent_ + 2, sink_); + + const std::vector>& fields = array.fields(); + for (size_t i = 0; i < fields.size(); ++i) { + Newline(); + std::stringstream ss; + ss << "-- child " << i << " type: " << fields[i]->type()->ToString() << " values: "; + Write(ss.str()); + PrettyPrint(*fields[i].get(), indent_ + 2, sink_); + } + return Status::OK(); } @@ -169,21 +204,38 @@ class ArrayPrinter : public ArrayVisitor { return Status::NotImplemented("union"); } + void Write(const char* data) { (*sink_) << data; } + + void Write(const std::string& data) { (*sink_) << data; } + + void Newline() { + (*sink_) << "\n"; + Indent(); + } + + void Indent() { + for (int i = 0; i < indent_; ++i) { + (*sink_) << " "; + } + } + private: const Array& array_; + int indent_; + std::ostream* sink_; }; -Status PrettyPrint(const Array& arr, std::ostream* sink) { - ArrayPrinter printer(arr, sink); +Status PrettyPrint(const Array& arr, int indent, std::ostream* sink) { + ArrayPrinter printer(arr, indent, sink); return printer.Print(); } -Status PrettyPrint(const RecordBatch& batch, std::ostream* sink) { +Status PrettyPrint(const RecordBatch& batch, int indent, std::ostream* sink) { for (int i = 0; i < batch.num_columns(); ++i) { const std::string& name = batch.column_name(i); (*sink) << name << ": "; - RETURN_NOT_OK(PrettyPrint(*batch.column(i).get(), sink)); + RETURN_NOT_OK(PrettyPrint(*batch.column(i).get(), indent + 2, sink)); (*sink) << "\n"; } return Status::OK(); diff --git a/cpp/src/arrow/pretty_print.h b/cpp/src/arrow/pretty_print.h index dcb236d726949..f508aa042945a 100644 --- a/cpp/src/arrow/pretty_print.h +++ b/cpp/src/arrow/pretty_print.h @@ -27,8 +27,12 @@ namespace arrow { class Status; -Status ARROW_EXPORT PrettyPrint(const RecordBatch& batch, std::ostream* sink); -Status ARROW_EXPORT PrettyPrint(const Array& arr, std::ostream* sink); +struct PrettyPrintOptions { + int indent; +}; + +Status ARROW_EXPORT PrettyPrint(const RecordBatch& batch, int indent, std::ostream* sink); +Status ARROW_EXPORT PrettyPrint(const Array& arr, int indent, std::ostream* sink); } // namespace arrow diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index af84f27eab557..eb1258a73038a 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -48,6 +48,18 @@ bool RecordBatch::Equals(const RecordBatch& other) const { return true; } +bool RecordBatch::ApproxEquals(const RecordBatch& other) const { + if (num_columns() != other.num_columns() || num_rows_ != other.num_rows()) { + return false; + } + + for (int i = 0; i < num_columns(); ++i) { + if (!column(i)->ApproxEquals(other.column(i))) { return false; } + } + + return true; +} + // ---------------------------------------------------------------------- // Table methods diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 1a856c8a436d5..f2c334ff626a4 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -45,6 +45,8 @@ class ARROW_EXPORT RecordBatch { bool Equals(const RecordBatch& other) const; + bool ApproxEquals(const RecordBatch& other) const; + // @returns: the table's schema const std::shared_ptr& schema() const { return schema_; } diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 80f295c487f13..dc955ac62d36c 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -209,7 +209,7 @@ std::vector ListType::GetBufferLayout() const { } std::vector StructType::GetBufferLayout() const { - return {kValidityBuffer, kTypeBuffer}; + return {kValidityBuffer}; } std::vector UnionType::GetBufferLayout() const { diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index c21c5002035f8..3aaec0bd5935a 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -186,6 +186,12 @@ struct IsSignedInt { std::is_integral::value && std::is_signed::value; }; +template +struct IsInteger { + PRIMITIVE_TRAITS(T); + static constexpr bool value = std::is_integral::value; +}; + template struct IsFloatingPoint { PRIMITIVE_TRAITS(T); diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc index f42a3cac021cd..75e5a9ff40e16 100644 --- a/cpp/src/arrow/types/primitive.cc +++ b/cpp/src/arrow/types/primitive.cc @@ -17,6 +17,7 @@ #include "arrow/types/primitive.h" +#include #include #include "arrow/type_traits.h" diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index a5a3704e2d2d3..c665218b4448c 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -18,8 +18,10 @@ #ifndef ARROW_TYPES_PRIMITIVE_H #define ARROW_TYPES_PRIMITIVE_H +#include #include #include +#include #include #include @@ -55,7 +57,7 @@ class ARROW_EXPORT PrimitiveArray : public Array { const uint8_t* raw_data_; }; -template +template class ARROW_EXPORT NumericArray : public PrimitiveArray { public: using TypeClass = TYPE; @@ -69,9 +71,11 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { : PrimitiveArray(type, length, data, null_count, null_bitmap) {} bool EqualsExact(const NumericArray& other) const { - return PrimitiveArray::EqualsExact(*static_cast(&other)); + return PrimitiveArray::EqualsExact(static_cast(other)); } + bool ApproxEquals(const std::shared_ptr& arr) const { return Equals(arr); } + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const ArrayPtr& arr) const override { if (this == arr.get()) { return true; } @@ -95,6 +99,78 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { value_type Value(int i) const { return raw_data()[i]; } }; +template <> +inline bool NumericArray::ApproxEquals( + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + + const auto& other = *static_cast*>(arr.get()); + + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + auto this_data = reinterpret_cast(raw_data_); + auto other_data = reinterpret_cast(other.raw_data_); + + static constexpr float EPSILON = 1E-5; + + if (length_ == 0 && other.length_ == 0) { return true; } + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); + if (!equal_bitmap) { return false; } + + for (int i = 0; i < length_; ++i) { + if (IsNull(i)) continue; + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } else { + for (int i = 0; i < length_; ++i) { + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } + return true; +} + +template <> +inline bool NumericArray::ApproxEquals( + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + + const auto& other = *static_cast*>(arr.get()); + + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + auto this_data = reinterpret_cast(raw_data_); + auto other_data = reinterpret_cast(other.raw_data_); + + if (length_ == 0 && other.length_ == 0) { return true; } + + static constexpr double EPSILON = 1E-5; + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); + if (!equal_bitmap) { return false; } + + for (int i = 0; i < length_; ++i) { + if (IsNull(i)) continue; + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } else { + for (int i = 0; i < length_; ++i) { + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } + return true; +} + template class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { public: @@ -265,6 +341,13 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { uint8_t* raw_data_; }; +// gcc and clang disagree about how to handle template visibility when you have +// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wattributes" +#endif + // Only instantiate these templates once extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; @@ -279,6 +362,10 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic pop +#endif + } // namespace arrow #endif // ARROW_TYPES_PRIMITIVE_H diff --git a/integration/README.md b/integration/README.md new file mode 100644 index 0000000000000..b1e4e3a82a734 --- /dev/null +++ b/integration/README.md @@ -0,0 +1,59 @@ + + +# Arrow integration testing + +Our strategy for integration testing between Arrow implementations is as follows: + +* Test datasets are specified in a custom human-readable, JSON-based format + designed for Arrow + +* Each implementation provides a testing executable capable of converting + between the JSON and the binary Arrow file representation + +* The test executable is also capable of validating the contents of a binary + file against a corresponding JSON file + +## Running the existing integration tests + +First, build the Java and C++ projects. For Java, you must run + +``` +mvn package +``` + +Now, the integration tests rely on two environment variables which point to the +Java `arrow-tool` JAR and the C++ `json-integration-test` executable: + +```bash +JAVA_DIR=$ARROW_HOME/java +CPP_BUILD_DIR=$ARROW_HOME/cpp/test-build + +VERSION=0.1.1-SNAPSHOT +export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar +export ARROW_CPP_TESTER=$CPP_BUILD_DIR/debug/json-integration-test +``` + +Here `$ARROW_HOME` is the location of your Arrow git clone. The +`$CPP_BUILD_DIR` may be different depending on how you built with CMake +(in-source of out-of-source). + +Once this is done, run the integration tests with (optionally adding `--debug` +for additional output) + +``` +python integration_test.py + +python integration_test.py --debug # additional output +``` \ No newline at end of file diff --git a/integration/integration_test.py b/integration/integration_test.py index 88dc3ad7971ff..417354bc83d9e 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -15,23 +15,53 @@ # specific language governing permissions and limitations # under the License. +from collections import OrderedDict import argparse import glob import itertools +import json import os import six +import string import subprocess import tempfile import uuid +import numpy as np ARROW_HOME = os.path.abspath(__file__).rsplit("/", 2)[0] +# Control for flakiness +np.random.seed(12345) + def guid(): return uuid.uuid4().hex +# from pandas +RANDS_CHARS = np.array(list(string.ascii_letters + string.digits), + dtype=(np.str_, 1)) + + +def rands(nchars): + """ + Generate one random byte string. + + See `rands_array` if you want to create an array of random strings. + + """ + return ''.join(np.random.choice(RANDS_CHARS, nchars)) + + +def str_from_bytes(x): + if six.PY2: + return x + else: + return x.decode('utf-8') + + +# from the merge_arrow_pr.py script def run_cmd(cmd): if isinstance(cmd, six.string_types): cmd = cmd.split(' ') @@ -43,13 +73,469 @@ def run_cmd(cmd): print('Command failed: %s' % ' '.join(cmd)) print('With output:') print('--------------') - print(e.output) + print(str_from_bytes(e.output)) print('--------------') raise e - if isinstance(output, six.binary_type): - output = output.decode('utf-8') - return output + return str_from_bytes(output) + +# ---------------------------------------------------------------------- +# Data generation + + +class DataType(object): + + def __init__(self, name, nullable=True): + self.name = name + self.nullable = nullable + + def get_json(self): + return OrderedDict([ + ('name', self.name), + ('type', self._get_type()), + ('nullable', self.nullable), + ('children', self._get_children()), + ('typeLayout', self._get_type_layout()) + ]) + + def _make_is_valid(self, size): + if self.nullable: + return np.random.randint(0, 2, size=size) + else: + return np.ones(size) + + +class Column(object): + + def __init__(self, name, count): + self.name = name + self.count = count + + def _get_children(self): + return [] + + def _get_buffers(self): + return [] + + def get_json(self): + entries = [ + ('name', self.name), + ('count', self.count) + ] + + buffers = self._get_buffers() + entries.extend(buffers) + + children = self._get_children() + if len(children) > 0: + entries.append(('children', children)) + + return OrderedDict(entries) + + +class PrimitiveType(DataType): + + def _get_children(self): + return [] + + def _get_type_layout(self): + return OrderedDict([ + ('vectors', + [OrderedDict([('type', 'VALIDITY'), + ('typeBitWidth', 1)]), + OrderedDict([('type', 'DATA'), + ('typeBitWidth', self.bit_width)])])]) + + +class PrimitiveColumn(Column): + + def __init__(self, name, count, is_valid, values): + Column.__init__(self, name, count) + self.is_valid = is_valid + self.values = values + + def _get_buffers(self): + return [ + ('VALIDITY', [int(v) for v in self.is_valid]), + ('DATA', list(self.values)) + ] + + +TEST_INT_MIN = - 2**31 + 1 +TEST_INT_MAX = 2**31 - 1 + + +class IntegerType(PrimitiveType): + + def __init__(self, name, is_signed, bit_width, nullable=True): + PrimitiveType.__init__(self, name, nullable=nullable) + self.is_signed = is_signed + self.bit_width = bit_width + + @property + def numpy_type(self): + return ('int' if self.is_signed else 'uint') + str(self.bit_width) + + def _get_type(self): + return OrderedDict([ + ('name', 'int'), + ('isSigned', self.is_signed), + ('bitWidth', self.bit_width) + ]) + + def generate_column(self, size): + iinfo = np.iinfo(self.numpy_type) + values = [int(x) for x in + np.random.randint(max(iinfo.min, TEST_INT_MIN), + min(iinfo.max, TEST_INT_MAX), + size=size)] + + is_valid = self._make_is_valid(size) + return PrimitiveColumn(self.name, size, is_valid, values) + + +class FloatingPointType(PrimitiveType): + + def __init__(self, name, bit_width, nullable=True): + PrimitiveType.__init__(self, name, nullable=nullable) + + self.bit_width = bit_width + self.precision = { + 16: 'HALF', + 32: 'SINGLE', + 64: 'DOUBLE' + }[self.bit_width] + + @property + def numpy_type(self): + return 'float' + str(self.bit_width) + + def _get_type(self): + return OrderedDict([ + ('name', 'floatingpoint'), + ('precision', self.precision) + ]) + + def generate_column(self, size): + values = np.random.randn(size) * 1000 + values = np.round(values, 3) + + is_valid = self._make_is_valid(size) + return PrimitiveColumn(self.name, size, is_valid, values) + + +class BooleanType(PrimitiveType): + + bit_width = 1 + + def _get_type(self): + return OrderedDict([('name', 'bool')]) + + @property + def numpy_type(self): + return 'bool' + + def generate_column(self, size): + values = list(map(bool, np.random.randint(0, 2, size=size))) + is_valid = self._make_is_valid(size) + return PrimitiveColumn(self.name, size, is_valid, values) + + +class StringType(PrimitiveType): + + @property + def numpy_type(self): + return object + + def _get_type(self): + return OrderedDict([('name', 'utf8')]) + + def _get_type_layout(self): + return OrderedDict([ + ('vectors', + [OrderedDict([('type', 'VALIDITY'), + ('typeBitWidth', 1)]), + OrderedDict([('type', 'OFFSET'), + ('typeBitWidth', 32)]), + OrderedDict([('type', 'DATA'), + ('typeBitWidth', 8)])])]) + + def generate_column(self, size): + K = 7 + is_valid = self._make_is_valid(size) + values = [] + + for i in range(size): + if is_valid[i]: + values.append(rands(K)) + else: + values.append("") + + return StringColumn(self.name, size, is_valid, values) + + +class JSONSchema(object): + + def __init__(self, fields): + self.fields = fields + + def get_json(self): + return OrderedDict([ + ('fields', [field.get_json() for field in self.fields]) + ]) + + +class StringColumn(PrimitiveColumn): + + def _get_buffers(self): + offset = 0 + offsets = [0] + + data = [] + for i, v in enumerate(self.values): + if self.is_valid[i]: + offset += len(v) + else: + v = "" + + offsets.append(offset) + data.append(v) + + return [ + ('VALIDITY', [int(x) for x in self.is_valid]), + ('OFFSET', offsets), + ('DATA', data) + ] + + +class ListType(DataType): + + def __init__(self, name, value_type, nullable=True): + DataType.__init__(self, name, nullable=nullable) + self.value_type = value_type + + def _get_type(self): + return OrderedDict([ + ('name', 'list') + ]) + + def _get_children(self): + return [self.value_type.get_json()] + + def _get_type_layout(self): + return OrderedDict([ + ('vectors', + [OrderedDict([('type', 'VALIDITY'), + ('typeBitWidth', 1)]), + OrderedDict([('type', 'OFFSET'), + ('typeBitWidth', 32)])])]) + + def generate_column(self, size): + MAX_LIST_SIZE = 4 + + is_valid = self._make_is_valid(size) + list_sizes = np.random.randint(0, MAX_LIST_SIZE + 1, size=size) + offsets = [0] + + offset = 0 + for i in range(size): + if is_valid[i]: + offset += int(list_sizes[i]) + offsets.append(offset) + + # The offset now is the total number of elements in the child array + values = self.value_type.generate_column(offset) + + return ListColumn(self.name, size, is_valid, offsets, values) + + +class ListColumn(Column): + + def __init__(self, name, count, is_valid, offsets, values): + Column.__init__(self, name, count) + self.is_valid = is_valid + self.offsets = offsets + self.values = values + + def _get_buffers(self): + return [ + ('VALIDITY', [int(v) for v in self.is_valid]), + ('OFFSET', list(self.offsets)) + ] + + def _get_children(self): + return [self.values.get_json()] + + +class StructType(DataType): + + def __init__(self, name, field_types, nullable=True): + DataType.__init__(self, name, nullable=nullable) + self.field_types = field_types + + def _get_type(self): + return OrderedDict([ + ('name', 'struct') + ]) + + def _get_children(self): + return [type_.get_json() for type_ in self.field_types] + + def _get_type_layout(self): + return OrderedDict([ + ('vectors', + [OrderedDict([('type', 'VALIDITY'), + ('typeBitWidth', 1)])])]) + + def generate_column(self, size): + is_valid = self._make_is_valid(size) + + field_values = [type_.generate_column(size) + for type_ in self.field_types] + + return StructColumn(self.name, size, is_valid, field_values) + + +class StructColumn(Column): + + def __init__(self, name, count, is_valid, field_values): + Column.__init__(self, name, count) + self.is_valid = is_valid + self.field_values = field_values + + def _get_buffers(self): + return [ + ('VALIDITY', [int(v) for v in self.is_valid]) + ] + + def _get_children(self): + return [field.get_json() for field in self.field_values] + + +class JSONRecordBatch(object): + + def __init__(self, count, columns): + self.count = count + self.columns = columns + + def get_json(self): + return OrderedDict([ + ('count', self.count), + ('columns', [col.get_json() for col in self.columns]) + ]) + + +class JSONFile(object): + + def __init__(self, schema, batches): + self.schema = schema + self.batches = batches + + def get_json(self): + return OrderedDict([ + ('schema', self.schema.get_json()), + ('batches', [batch.get_json() for batch in self.batches]) + ]) + + def write(self, path): + with open(path, 'wb') as f: + f.write(json.dumps(self.get_json(), indent=2).encode('utf-8')) + + +def get_field(name, type_, nullable=True): + if type_ == 'utf8': + return StringType(name, nullable=nullable) + + dtype = np.dtype(type_) + + if dtype.kind in ('i', 'u'): + return IntegerType(name, dtype.kind == 'i', dtype.itemsize * 8, + nullable=nullable) + elif dtype.kind == 'f': + return FloatingPointType(name, dtype.itemsize * 8, + nullable=nullable) + elif dtype.kind == 'b': + return BooleanType(name, nullable=nullable) + else: + raise TypeError(dtype) + + +def generate_primitive_case(): + types = ['bool', 'int8', 'int16', 'int32', 'int64', + 'uint8', 'uint16', 'uint32', 'uint64', + 'float32', 'float64', 'utf8'] + + fields = [] + + for type_ in types: + fields.append(get_field(type_ + "_nullable", type_, True)) + fields.append(get_field(type_ + "_nonnullable", type_, False)) + + schema = JSONSchema(fields) + + batch_sizes = [7, 10] + batches = [] + for size in batch_sizes: + columns = [] + for field in fields: + col = field.generate_column(size) + columns.append(col) + + batches.append(JSONRecordBatch(size, columns)) + + return JSONFile(schema, batches) + + +def generate_nested_case(): + fields = [ + ListType('list_nullable', get_field('item', 'int32')), + StructType('struct_nullable', [get_field('f1', 'int32'), + get_field('f2', 'utf8')]), + + # TODO(wesm): this causes segfault + # ListType('list_nonnullable', get_field('item', 'int32'), False), + ] + + schema = JSONSchema(fields) + + batch_sizes = [7, 10] + batches = [] + for size in batch_sizes: + columns = [] + for field in fields: + col = field.generate_column(size) + columns.append(col) + + batches.append(JSONRecordBatch(size, columns)) + + return JSONFile(schema, batches) + + +def get_generated_json_files(): + temp_dir = tempfile.mkdtemp() + + def _temp_path(): + return + + file_objs = [] + + K = 10 + for i in range(K): + file_objs.append(generate_primitive_case()) + + file_objs.append(generate_nested_case()) + + generated_paths = [] + for file_obj in file_objs: + out_path = os.path.join(temp_dir, guid() + '.json') + file_obj.write(out_path) + generated_paths.append(out_path) + + return generated_paths + + +# ---------------------------------------------------------------------- +# Testing harness class IntegrationRunner(object): @@ -92,9 +578,11 @@ def validate(self, json_path, arrow_path): class JavaTester(Tester): - ARROW_TOOLS_JAR = os.path.join(ARROW_HOME, - 'java/tools/target/arrow-tools-0.1.1-' - 'SNAPSHOT-jar-with-dependencies.jar') + ARROW_TOOLS_JAR = os.environ.get( + 'ARROW_JAVA_INTEGRATION_JAR', + os.path.join(ARROW_HOME, + 'java/tools/target/arrow-tools-0.1.1-' + 'SNAPSHOT-jar-with-dependencies.jar')) name = 'Java' @@ -154,14 +642,16 @@ def json_to_arrow(self, json_path, arrow_path): return self._run(arrow_path, json_path, 'JSON_TO_ARROW') -def get_json_files(): +def get_static_json_files(): glob_pattern = os.path.join(ARROW_HOME, 'integration', 'data', '*.json') return glob.glob(glob_pattern) def run_all_tests(debug=False): testers = [JavaTester(debug=debug), CPPTester(debug=debug)] - json_files = get_json_files() + static_json_files = get_static_json_files() + generated_json_files = get_generated_json_files() + json_files = static_json_files + generated_json_files runner = IntegrationRunner(json_files, testers, debug=debug) runner.run() diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index b7df8d13ee607..7fe1615da5a27 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -30,9 +30,6 @@ public abstract class BaseDataValueVector extends BaseValueVector implements Buf protected final static byte[] emptyByteArray = new byte[]{}; // Nullable vectors use this - /** maximum extra size at the end of the buffer */ - private static final int MAX_BUFFER_PADDING = 64; - public static void load(ArrowFieldNode fieldNode, List vectors, List buffers) { int expectedSize = vectors.size(); if (buffers.size() != expectedSize) { @@ -51,9 +48,6 @@ public static void truncateBufferBasedOnSize(List buffers, int bufferI if (buffer.writerIndex() < byteSize) { throw new IllegalArgumentException("can not truncate buffer to a larger size " + byteSize + ": " + buffer.writerIndex()); } - if (buffer.writerIndex() - byteSize > MAX_BUFFER_PADDING) { - throw new IllegalArgumentException("Buffer too large to resize to " + byteSize + ": " + buffer.writerIndex()); - } buffer.writerIndex(byteSize); } From 73fe55683c36465972e21bef01b377c3b66579f9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 10 Dec 2016 09:05:48 +0100 Subject: [PATCH 0227/1644] ARROW-328: Return shared_ptr by value instead of const-ref Author: Wes McKinney Closes #235 from wesm/ARROW-328 and squashes the following commits: f71decc [Wes McKinney] Return shared_ptr by value instead of const-ref --- cpp/src/arrow/array.h | 4 ++-- cpp/src/arrow/builder.h | 4 ++-- cpp/src/arrow/column.cc | 2 +- cpp/src/arrow/column.h | 8 ++++---- cpp/src/arrow/ipc/file.cc | 2 +- cpp/src/arrow/ipc/file.h | 2 +- cpp/src/arrow/ipc/metadata-internal.cc | 2 +- cpp/src/arrow/table.h | 8 ++++---- cpp/src/arrow/type.cc | 2 +- cpp/src/arrow/type.h | 6 +++--- cpp/src/arrow/types/construct.cc | 2 +- cpp/src/arrow/types/list.h | 6 +++--- cpp/src/arrow/types/primitive.h | 2 +- cpp/src/arrow/types/struct.h | 4 ++-- cpp/src/arrow/util/buffer.h | 2 +- python/pyarrow/includes/libarrow.pxd | 22 +++++++++++----------- python/pyarrow/includes/libarrow_ipc.pxd | 2 +- 17 files changed, 40 insertions(+), 40 deletions(-) diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 78aa2b867e1ea..91fb93e625494 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -53,10 +53,10 @@ class ARROW_EXPORT Array { int32_t length() const { return length_; } int32_t null_count() const { return null_count_; } - const std::shared_ptr& type() const { return type_; } + std::shared_ptr type() const { return type_; } Type::type type_enum() const { return type_->type; } - const std::shared_ptr& null_bitmap() const { return null_bitmap_; } + std::shared_ptr null_bitmap() const { return null_bitmap_; } const uint8_t* null_bitmap_data() const { return null_bitmap_data_; } diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index cef17e5aabab9..73e49c0a69674 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -89,13 +89,13 @@ class ARROW_EXPORT ArrayBuilder { // this function responsibly. Status Advance(int32_t elements); - const std::shared_ptr& null_bitmap() const { return null_bitmap_; } + std::shared_ptr null_bitmap() const { return null_bitmap_; } // Creates new array object to hold the contents of the builder and transfers // ownership of the data. This resets all variables on the builder. virtual Status Finish(std::shared_ptr* out) = 0; - const std::shared_ptr& type() const { return type_; } + std::shared_ptr type() const { return type_; } protected: MemoryPool* pool_; diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 52e4c58e1dc3d..eca5f4d30a698 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -51,7 +51,7 @@ Column::Column( Status Column::ValidateData() { for (int i = 0; i < data_->num_chunks(); ++i) { - const std::shared_ptr& type = data_->chunk(i)->type(); + std::shared_ptr type = data_->chunk(i)->type(); if (!this->type()->Equals(type)) { std::stringstream ss; ss << "In chunk " << i << " expected type " << this->type()->ToString() diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index d5168cb032ba5..1caafec9db95c 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -46,7 +46,7 @@ class ARROW_EXPORT ChunkedArray { int num_chunks() const { return chunks_.size(); } - const std::shared_ptr& chunk(int i) const { return chunks_[i]; } + std::shared_ptr chunk(int i) const { return chunks_[i]; } protected: ArrayVector chunks_; @@ -68,16 +68,16 @@ class ARROW_EXPORT Column { int64_t null_count() const { return data_->null_count(); } - const std::shared_ptr& field() const { return field_; } + std::shared_ptr field() const { return field_; } // @returns: the column's name in the passed metadata const std::string& name() const { return field_->name; } // @returns: the column's type according to the metadata - const std::shared_ptr& type() const { return field_->type; } + std::shared_ptr type() const { return field_->type; } // @returns: the column's data as a chunked logical array - const std::shared_ptr& data() const { return data_; } + std::shared_ptr data() const { return data_; } // Verify that the column's array data is consistent with the passed field's // metadata Status ValidateData(); diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index 06001cc1c77bc..fa50058ea4200 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -179,7 +179,7 @@ Status FileReader::ReadFooter() { return footer_->GetSchema(&schema_); } -const std::shared_ptr& FileReader::schema() const { +std::shared_ptr FileReader::schema() const { return schema_; } diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/file.h index 4b79c98281bbc..4f35c37b03235 100644 --- a/cpp/src/arrow/ipc/file.h +++ b/cpp/src/arrow/ipc/file.h @@ -106,7 +106,7 @@ class ARROW_EXPORT FileReader { static Status Open(const std::shared_ptr& file, int64_t footer_offset, std::shared_ptr* reader); - const std::shared_ptr& schema() const; + std::shared_ptr schema() const; // Shared dictionaries for dictionary-encoding cross record batches // TODO(wesm): Implement dictionary reading when we also have dictionary diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 7a2416165b203..5a2758912b759 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -285,7 +285,7 @@ Status SchemaToFlatbuffer( FBB& fbb, const Schema* schema, flatbuffers::Offset* out) { std::vector field_offsets; for (int i = 0; i < schema->num_fields(); ++i) { - const std::shared_ptr& field = schema->field(i); + std::shared_ptr field = schema->field(i); FieldOffset offset; RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, &offset)); field_offsets.push_back(offset); diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index f2c334ff626a4..bf5c39f11e411 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -48,11 +48,11 @@ class ARROW_EXPORT RecordBatch { bool ApproxEquals(const RecordBatch& other) const; // @returns: the table's schema - const std::shared_ptr& schema() const { return schema_; } + std::shared_ptr schema() const { return schema_; } // @returns: the i-th column // Note: Does not boundscheck - const std::shared_ptr& column(int i) const { return columns_[i]; } + std::shared_ptr column(int i) const { return columns_[i]; } const std::vector>& columns() const { return columns_; } @@ -88,11 +88,11 @@ class ARROW_EXPORT Table { const std::string& name() const { return name_; } // @returns: the table's schema - const std::shared_ptr& schema() const { return schema_; } + std::shared_ptr schema() const { return schema_; } // Note: Does not boundscheck // @returns: the i-th column - const std::shared_ptr& column(int i) const { return columns_[i]; } + std::shared_ptr column(int i) const { return columns_[i]; } // @returns: the number of columns in the table int num_columns() const { return columns_.size(); } diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index dc955ac62d36c..75f5086f37de0 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -81,7 +81,7 @@ std::string StructType::ToString() const { s << "struct<"; for (int i = 0; i < this->num_children(); ++i) { if (i > 0) { s << ", "; } - const std::shared_ptr& field = this->child(i); + std::shared_ptr field = this->child(i); s << field->name << ": " << field->type->ToString(); } s << ">"; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 30777384dfb9f..966706cb520b2 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -134,7 +134,7 @@ struct ARROW_EXPORT DataType { return Equals(other.get()); } - const std::shared_ptr& child(int i) const { return children_[i]; } + std::shared_ptr child(int i) const { return children_[i]; } const std::vector>& children() const { return children_; } @@ -319,9 +319,9 @@ struct ARROW_EXPORT ListType : public DataType, public NoExtraMeta { children_ = {value_field}; } - const std::shared_ptr& value_field() const { return children_[0]; } + std::shared_ptr value_field() const { return children_[0]; } - const std::shared_ptr& value_type() const { return children_[0]->type; } + std::shared_ptr value_type() const { return children_[0]->type; } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc index 67245f8ea1fda..ab9c59fd4639d 100644 --- a/cpp/src/arrow/types/construct.cc +++ b/cpp/src/arrow/types/construct.cc @@ -63,7 +63,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, case Type::LIST: { std::shared_ptr value_builder; - const std::shared_ptr& value_type = + std::shared_ptr value_type = static_cast(type.get())->value_type(); RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); out->reset(new ListBuilder(pool, value_builder)); diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h index bd93e8fdcfa1c..ec09a78afa66c 100644 --- a/cpp/src/arrow/types/list.h +++ b/cpp/src/arrow/types/list.h @@ -57,12 +57,12 @@ class ARROW_EXPORT ListArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. - const std::shared_ptr& values() const { return values_; } + std::shared_ptr values() const { return values_; } std::shared_ptr offsets() const { return std::static_pointer_cast(offset_buffer_); } - const std::shared_ptr& value_type() const { return values_->type(); } + std::shared_ptr value_type() const { return values_->type(); } const int32_t* raw_offsets() const { return offsets_; } @@ -152,7 +152,7 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { Status AppendNull() { return Append(false); } - const std::shared_ptr& value_builder() const { + std::shared_ptr value_builder() const { DCHECK(!values_) << "Using value builder is pointless when values_ is set"; return value_builder_; } diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h index c665218b4448c..ec578e1e0aee7 100644 --- a/cpp/src/arrow/types/primitive.h +++ b/cpp/src/arrow/types/primitive.h @@ -45,7 +45,7 @@ class ARROW_EXPORT PrimitiveArray : public Array { public: virtual ~PrimitiveArray() {} - const std::shared_ptr& data() const { return data_; } + std::shared_ptr data() const { return data_; } bool EqualsExact(const PrimitiveArray& other) const; bool Equals(const std::shared_ptr& arr) const override; diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h index 035af05132572..1e2bf2d9a1223 100644 --- a/cpp/src/arrow/types/struct.h +++ b/cpp/src/arrow/types/struct.h @@ -46,7 +46,7 @@ class ARROW_EXPORT StructArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. - const std::shared_ptr& field(int32_t pos) const { + std::shared_ptr field(int32_t pos) const { DCHECK_GT(field_arrays_.size(), 0); return field_arrays_[pos]; } @@ -99,7 +99,7 @@ class ARROW_EXPORT StructBuilder : public ArrayBuilder { Status AppendNull() { return Append(false); } - const std::shared_ptr field_builder(int pos) const { + std::shared_ptr field_builder(int pos) const { DCHECK_GT(field_builders_.size(), 0); return field_builders_[pos]; } diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/util/buffer.h index 330e15feae152..5c87395deebb0 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/util/buffer.h @@ -86,7 +86,7 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { int64_t size() const { return size_; } - const std::shared_ptr parent() const { return parent_; } + std::shared_ptr parent() const { return parent_; } protected: bool is_mutable_; diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 350ebe30c9b89..15781ced4433a 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -91,12 +91,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool Equals(const shared_ptr[CSchema]& other) - const shared_ptr[CField]& field(int i) + shared_ptr[CField] field(int i) int num_fields() c_string ToString() cdef cppclass CArray" arrow::Array": - const shared_ptr[CDataType]& type() + shared_ptr[CDataType] type() int32_t length() int32_t null_count() @@ -142,8 +142,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const int32_t* offsets() int32_t offset(int i) int32_t value_length(int i) - const shared_ptr[CArray]& values() - const shared_ptr[CDataType]& value_type() + shared_ptr[CArray] values() + shared_ptr[CDataType] value_type() cdef cppclass CStringArray" arrow::StringArray"(CListArray): c_string GetString(int i) @@ -152,7 +152,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int64_t length() int64_t null_count() int num_chunks() - const shared_ptr[CArray]& chunk(int i) + shared_ptr[CArray] chunk(int i) cdef cppclass CColumn" arrow::Column": CColumn(const shared_ptr[CField]& field, @@ -164,8 +164,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int64_t length() int64_t null_count() const c_string& name() - const shared_ptr[CDataType]& type() - const shared_ptr[CChunkedArray]& data() + shared_ptr[CDataType] type() + shared_ptr[CChunkedArray] data() cdef cppclass CRecordBatch" arrow::RecordBatch": CRecordBatch(const shared_ptr[CSchema]& schema, int32_t num_rows, @@ -173,8 +173,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool Equals(const CRecordBatch& other) - const shared_ptr[CSchema]& schema() - const shared_ptr[CArray]& column(int i) + shared_ptr[CSchema] schema() + shared_ptr[CArray] column(int i) const c_string& column_name(int i) const vector[shared_ptr[CArray]]& columns() @@ -191,8 +191,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const c_string& name() - const shared_ptr[CSchema]& schema() - const shared_ptr[CColumn]& column(int i) + shared_ptr[CSchema] schema() + shared_ptr[CColumn] column(int i) cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index eda5b9bae9e31..b3185b1c1671c 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -44,7 +44,7 @@ cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: CStatus Open2" Open"(const shared_ptr[ReadableFileInterface]& file, int64_t footer_offset, shared_ptr[CFileReader]* out) - const shared_ptr[CSchema]& schema() + shared_ptr[CSchema] schema() int num_dictionaries() int num_record_batches() From 2c10d7ccec3c07fb061e1988be16aecaf9916af4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 12 Dec 2016 17:17:31 -0500 Subject: [PATCH 0228/1644] ARROW-418: [C++] Array / Builder class code reorganization, flattening I've been wanting to do this for a while -- it feels cleaner to me. I also am going to promote modules from arrow/util to the top level as well. I'm open to other ideas, too. Author: Wes McKinney Closes #236 from wesm/ARROW-418 and squashes the following commits: 6f556ea [Wes McKinney] Add missing math.h include for clang 9dc2e22 [Wes McKinney] Fix remaining old includes 6f7ae77 [Wes McKinney] Fixes, cpplint 66ac3f7 [Wes McKinney] Promote buffer.h/status.h/memory-pool.h to top level directory 8cdf059 [Wes McKinney] Consolidate Array and Builder classes in array.h, builder.h. Remove arrow/types subdirectory --- cpp/CMakeLists.txt | 15 +- cpp/src/arrow/CMakeLists.txt | 11 + cpp/src/arrow/api.h | 13 +- .../decimal-test.cc => array-decimal-test.cc} | 2 +- .../list-test.cc => array-list-test.cc} | 6 +- ...mitive-test.cc => array-primitive-test.cc} | 8 +- .../string-test.cc => array-string-test.cc} | 4 +- .../struct-test.cc => array-struct-test.cc} | 7 +- cpp/src/arrow/array-test.cc | 5 +- cpp/src/arrow/array.cc | 443 +++++++++++++++++- cpp/src/arrow/array.h | 373 ++++++++++++++- cpp/src/arrow/{util => }/buffer-test.cc | 4 +- cpp/src/arrow/{util => }/buffer.cc | 6 +- cpp/src/arrow/{util => }/buffer.h | 2 +- cpp/src/arrow/builder.cc | 329 ++++++++++++- cpp/src/arrow/builder.h | 315 ++++++++++++- cpp/src/arrow/column-benchmark.cc | 4 +- cpp/src/arrow/column-test.cc | 1 - cpp/src/arrow/column.cc | 2 +- cpp/src/arrow/io/file.cc | 6 +- cpp/src/arrow/io/hdfs.cc | 6 +- cpp/src/arrow/io/interfaces.cc | 4 +- cpp/src/arrow/io/io-file-test.cc | 2 +- cpp/src/arrow/io/io-hdfs-test.cc | 2 +- cpp/src/arrow/io/libhdfs_shim.cc | 2 +- cpp/src/arrow/io/memory.cc | 5 +- cpp/src/arrow/io/test-common.h | 4 +- cpp/src/arrow/ipc/adapter.cc | 9 +- cpp/src/arrow/ipc/file.cc | 4 +- cpp/src/arrow/ipc/ipc-adapter-test.cc | 10 +- cpp/src/arrow/ipc/ipc-file-test.cc | 11 +- cpp/src/arrow/ipc/ipc-json-test.cc | 18 +- cpp/src/arrow/ipc/ipc-metadata-test.cc | 2 +- cpp/src/arrow/ipc/json-integration-test.cc | 2 +- cpp/src/arrow/ipc/json-internal.cc | 10 +- cpp/src/arrow/ipc/json.cc | 6 +- cpp/src/arrow/ipc/metadata-internal.cc | 4 +- cpp/src/arrow/ipc/metadata.cc | 4 +- cpp/src/arrow/ipc/test-common.h | 9 +- cpp/src/arrow/ipc/util.h | 2 +- ...emory-pool-test.cc => memory_pool-test.cc} | 4 +- .../{util/memory-pool.cc => memory_pool.cc} | 4 +- .../{util/memory-pool.h => memory_pool.h} | 0 cpp/src/arrow/pretty_print-test.cc | 5 +- cpp/src/arrow/pretty_print.cc | 5 +- cpp/src/arrow/{util => }/status-test.cc | 2 +- cpp/src/arrow/{util => }/status.cc | 2 +- cpp/src/arrow/{util => }/status.h | 0 cpp/src/arrow/table-test.cc | 4 +- cpp/src/arrow/table.cc | 2 +- cpp/src/arrow/test-util.h | 43 +- cpp/src/arrow/type.cc | 8 +- cpp/src/arrow/type.h | 2 +- cpp/src/arrow/types/CMakeLists.txt | 39 -- cpp/src/arrow/types/construct.cc | 124 ----- cpp/src/arrow/types/construct.h | 47 -- cpp/src/arrow/types/datetime.h | 27 -- cpp/src/arrow/types/decimal.cc | 31 -- cpp/src/arrow/types/decimal.h | 28 -- cpp/src/arrow/types/list.cc | 162 ------- cpp/src/arrow/types/list.h | 170 ------- cpp/src/arrow/types/primitive.cc | 294 ------------ cpp/src/arrow/types/primitive.h | 371 --------------- cpp/src/arrow/types/string.cc | 150 ------ cpp/src/arrow/types/string.h | 149 ------ cpp/src/arrow/types/struct.cc | 108 ----- cpp/src/arrow/types/struct.h | 116 ----- cpp/src/arrow/types/test-common.h | 70 --- cpp/src/arrow/types/union.cc | 27 -- cpp/src/arrow/types/union.h | 48 -- cpp/src/arrow/util/CMakeLists.txt | 6 - cpp/src/arrow/util/bit-util.cc | 4 +- python/src/pyarrow/adapters/builtin.cc | 2 +- python/src/pyarrow/adapters/pandas.cc | 2 +- python/src/pyarrow/common.cc | 4 +- python/src/pyarrow/common.h | 5 +- python/src/pyarrow/io.cc | 4 +- 77 files changed, 1607 insertions(+), 2134 deletions(-) rename cpp/src/arrow/{types/decimal-test.cc => array-decimal-test.cc} (97%) rename cpp/src/arrow/{types/list-test.cc => array-list-test.cc} (97%) rename cpp/src/arrow/{types/primitive-test.cc => array-primitive-test.cc} (98%) rename cpp/src/arrow/{types/string-test.cc => array-string-test.cc} (98%) rename cpp/src/arrow/{types/struct-test.cc => array-struct-test.cc} (98%) rename cpp/src/arrow/{util => }/buffer-test.cc (98%) rename cpp/src/arrow/{util => }/buffer.cc (96%) rename cpp/src/arrow/{util => }/buffer.h (99%) rename cpp/src/arrow/{util/memory-pool-test.cc => memory_pool-test.cc} (96%) rename cpp/src/arrow/{util/memory-pool.cc => memory_pool.cc} (97%) rename cpp/src/arrow/{util/memory-pool.h => memory_pool.h} (100%) rename cpp/src/arrow/{util => }/status-test.cc (97%) rename cpp/src/arrow/{util => }/status.cc (98%) rename cpp/src/arrow/{util => }/status.h (100%) delete mode 100644 cpp/src/arrow/types/CMakeLists.txt delete mode 100644 cpp/src/arrow/types/construct.cc delete mode 100644 cpp/src/arrow/types/construct.h delete mode 100644 cpp/src/arrow/types/datetime.h delete mode 100644 cpp/src/arrow/types/decimal.cc delete mode 100644 cpp/src/arrow/types/decimal.h delete mode 100644 cpp/src/arrow/types/list.cc delete mode 100644 cpp/src/arrow/types/list.h delete mode 100644 cpp/src/arrow/types/primitive.cc delete mode 100644 cpp/src/arrow/types/primitive.h delete mode 100644 cpp/src/arrow/types/string.cc delete mode 100644 cpp/src/arrow/types/string.h delete mode 100644 cpp/src/arrow/types/struct.cc delete mode 100644 cpp/src/arrow/types/struct.h delete mode 100644 cpp/src/arrow/types/test-common.h delete mode 100644 cpp/src/arrow/types/union.cc delete mode 100644 cpp/src/arrow/types/union.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 798d75fe55643..adcca0e0b49e8 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -743,25 +743,17 @@ set(ARROW_PRIVATE_LINK_LIBS set(ARROW_SRCS src/arrow/array.cc + src/arrow/buffer.cc src/arrow/builder.cc src/arrow/column.cc + src/arrow/memory_pool.cc src/arrow/pretty_print.cc src/arrow/schema.cc + src/arrow/status.cc src/arrow/table.cc src/arrow/type.cc - src/arrow/types/construct.cc - src/arrow/types/decimal.cc - src/arrow/types/list.cc - src/arrow/types/primitive.cc - src/arrow/types/string.cc - src/arrow/types/struct.cc - src/arrow/types/union.cc - src/arrow/util/bit-util.cc - src/arrow/util/buffer.cc - src/arrow/util/memory-pool.cc - src/arrow/util/status.cc ) add_library(arrow_objlib OBJECT @@ -823,7 +815,6 @@ endif() add_subdirectory(src/arrow) add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) -add_subdirectory(src/arrow/types) #---------------------------------------------------------------------- # IPC library diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 6c0dea20ba7b5..7d7bc29f4abd8 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -20,9 +20,12 @@ install(FILES api.h array.h column.h + buffer.h builder.h + memory_pool.h pretty_print.h schema.h + status.h table.h type.h type_fwd.h @@ -37,9 +40,17 @@ install(FILES set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) ADD_ARROW_TEST(array-test) +ADD_ARROW_TEST(array-decimal-test) +ADD_ARROW_TEST(array-list-test) +ADD_ARROW_TEST(array-primitive-test) +ADD_ARROW_TEST(array-string-test) +ADD_ARROW_TEST(array-struct-test) +ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(column-test) +ADD_ARROW_TEST(memory_pool-test) ADD_ARROW_TEST(pretty_print-test) ADD_ARROW_TEST(schema-test) +ADD_ARROW_TEST(status-test) ADD_ARROW_TEST(table-test) ADD_ARROW_BENCHMARK(column-benchmark) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 2d317b49cb7b6..51437d863b8b9 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -21,20 +21,13 @@ #define ARROW_API_H #include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/builder.h" #include "arrow/column.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" -#include "arrow/types/construct.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" - -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" - #endif // ARROW_API_H diff --git a/cpp/src/arrow/types/decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc similarity index 97% rename from cpp/src/arrow/types/decimal-test.cc rename to cpp/src/arrow/array-decimal-test.cc index 7296ff8176113..9e00fd9a7dd49 100644 --- a/cpp/src/arrow/types/decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -17,7 +17,7 @@ #include "gtest/gtest.h" -#include "arrow/types/decimal.h" +#include "arrow/type.h" namespace arrow { diff --git a/cpp/src/arrow/types/list-test.cc b/cpp/src/arrow/array-list-test.cc similarity index 97% rename from cpp/src/arrow/types/list-test.cc rename to cpp/src/arrow/array-list-test.cc index cb9a8c12d8ab9..8baaf06a7dbcc 100644 --- a/cpp/src/arrow/types/list-test.cc +++ b/cpp/src/arrow/array-list-test.cc @@ -25,13 +25,9 @@ #include "arrow/array.h" #include "arrow/builder.h" +#include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/construct.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/test-common.h" -#include "arrow/util/status.h" using std::shared_ptr; using std::string; diff --git a/cpp/src/arrow/types/primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc similarity index 98% rename from cpp/src/arrow/types/primitive-test.cc rename to cpp/src/arrow/array-primitive-test.cc index bdc8ec00be02c..a10e2404f29c6 100644 --- a/cpp/src/arrow/types/primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -22,16 +22,14 @@ #include "gtest/gtest.h" +#include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/builder.h" +#include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/type_traits.h" -#include "arrow/types/construct.h" -#include "arrow/types/primitive.h" -#include "arrow/types/test-common.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" using std::string; using std::shared_ptr; diff --git a/cpp/src/arrow/types/string-test.cc b/cpp/src/arrow/array-string-test.cc similarity index 98% rename from cpp/src/arrow/types/string-test.cc rename to cpp/src/arrow/array-string-test.cc index 3c4b12b7bc772..b144c632133d6 100644 --- a/cpp/src/arrow/types/string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -24,11 +24,9 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/test-common.h" namespace arrow { diff --git a/cpp/src/arrow/types/struct-test.cc b/cpp/src/arrow/array-struct-test.cc similarity index 98% rename from cpp/src/arrow/types/struct-test.cc rename to cpp/src/arrow/array-struct-test.cc index 197d7d4ad1f5e..58386fe028fd2 100644 --- a/cpp/src/arrow/types/struct-test.cc +++ b/cpp/src/arrow/array-struct-test.cc @@ -23,14 +23,9 @@ #include "arrow/array.h" #include "arrow/builder.h" +#include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/construct.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/struct.h" -#include "arrow/types/test-common.h" -#include "arrow/util/status.h" using std::shared_ptr; using std::string; diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 158124468992a..783104e874bb7 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -24,11 +24,10 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/primitive.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" namespace arrow { diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 1f0bb66e91a3e..7ab61f59f551b 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -19,10 +19,13 @@ #include #include +#include +#include "arrow/buffer.h" +#include "arrow/status.h" +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" +#include "arrow/util/logging.h" namespace arrow { @@ -85,4 +88,440 @@ Status NullArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +// ---------------------------------------------------------------------- +// Primitive array base + +PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : Array(type, length, null_count, null_bitmap) { + data_ = data; + raw_data_ = data == nullptr ? nullptr : data_->data(); +} + +bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); + if (!equal_bitmap) { return false; } + + const uint8_t* this_data = raw_data_; + const uint8_t* other_data = other.raw_data_; + + auto size_meta = dynamic_cast(type_.get()); + int value_byte_size = size_meta->bit_width() / 8; + DCHECK_GT(value_byte_size, 0); + + for (int i = 0; i < length_; ++i) { + if (!IsNull(i) && memcmp(this_data, other_data, value_byte_size)) { return false; } + this_data += value_byte_size; + other_data += value_byte_size; + } + return true; + } else { + if (length_ == 0 && other.length_ == 0) { return true; } + return data_->Equals(*other.data_, length_); + } +} + +bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + return EqualsExact(*static_cast(arr.get())); +} + +template +Status NumericArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; + +// ---------------------------------------------------------------------- +// BooleanArray + +BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& null_bitmap) + : PrimitiveArray( + std::make_shared(), length, data, null_count, null_bitmap) {} + +BooleanArray::BooleanArray(const TypePtr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + +bool BooleanArray::EqualsExact(const BooleanArray& other) const { + if (this == &other) return true; + if (null_count_ != other.null_count_) { return false; } + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); + if (!equal_bitmap) { return false; } + + const uint8_t* this_data = raw_data_; + const uint8_t* other_data = other.raw_data_; + + for (int i = 0; i < length_; ++i) { + if (!IsNull(i) && BitUtil::GetBit(this_data, i) != BitUtil::GetBit(other_data, i)) { + return false; + } + } + return true; + } else { + return data_->Equals(*other.data_, BitUtil::BytesForBits(length_)); + } +} + +bool BooleanArray::Equals(const ArrayPtr& arr) const { + if (this == arr.get()) return true; + if (Type::BOOL != arr->type_enum()) { return false; } + return EqualsExact(*static_cast(arr.get())); +} + +bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, + int32_t other_start_idx, const ArrayPtr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { + return false; + } + } + return true; +} + +Status BooleanArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +// ---------------------------------------------------------------------- +// ListArray + +bool ListArray::EqualsExact(const ListArray& other) const { + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + bool equal_offsets = + offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); + if (!equal_offsets) { return false; } + bool equal_null_bitmap = true; + if (null_count_ > 0) { + equal_null_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); + } + + if (!equal_null_bitmap) { return false; } + + return values()->Equals(other.values()); +} + +bool ListArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (this->type_enum() != arr->type_enum()) { return false; } + return EqualsExact(*static_cast(arr.get())); +} + +bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = offset(i); + const int32_t end_offset = offset(i + 1); + const int32_t other_begin_offset = other->offset(o_i); + const int32_t other_end_offset = other->offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != other_end_offset - other_begin_offset) { + return false; + } + if (!values_->RangeEquals( + begin_offset, end_offset, other_begin_offset, other->values())) { + return false; + } + } + return true; +} + +Status ListArray::Validate() const { + if (length_ < 0) { return Status::Invalid("Length was negative"); } + if (!offset_buffer_) { return Status::Invalid("offset_buffer_ was null"); } + if (offset_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { + std::stringstream ss; + ss << "offset buffer size (bytes): " << offset_buffer_->size() + << " isn't large enough for length: " << length_; + return Status::Invalid(ss.str()); + } + const int32_t last_offset = offset(length_); + if (last_offset > 0) { + if (!values_) { + return Status::Invalid("last offset was non-zero and values was null"); + } + if (values_->length() != last_offset) { + std::stringstream ss; + ss << "Final offset invariant not equal to values length: " << last_offset + << "!=" << values_->length(); + return Status::Invalid(ss.str()); + } + + const Status child_valid = values_->Validate(); + if (!child_valid.ok()) { + std::stringstream ss; + ss << "Child array invalid: " << child_valid.ToString(); + return Status::Invalid(ss.str()); + } + } + + int32_t prev_offset = offset(0); + if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } + for (int32_t i = 1; i <= length_; ++i) { + int32_t current_offset = offset(i); + if (IsNull(i - 1) && current_offset != prev_offset) { + std::stringstream ss; + ss << "Offset invariant failure at: " << i << " inconsistent offsets for null slot" + << current_offset << "!=" << prev_offset; + return Status::Invalid(ss.str()); + } + if (current_offset < prev_offset) { + std::stringstream ss; + ss << "Offset invariant failure: " << i + << " inconsistent offset for non-null slot: " << current_offset << "<" + << prev_offset; + return Status::Invalid(ss.str()); + } + prev_offset = current_offset; + } + return Status::OK(); +} + +Status ListArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +// ---------------------------------------------------------------------- +// String and binary + +static std::shared_ptr kBinary = std::make_shared(); +static std::shared_ptr kString = std::make_shared(); + +BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : BinaryArray(kBinary, length, offsets, data, null_count, null_bitmap) {} + +BinaryArray::BinaryArray(const TypePtr& type, int32_t length, + const std::shared_ptr& offsets, const std::shared_ptr& data, + int32_t null_count, const std::shared_ptr& null_bitmap) + : Array(type, length, null_count, null_bitmap), + offset_buffer_(offsets), + offsets_(reinterpret_cast(offset_buffer_->data())), + data_buffer_(data), + data_(nullptr) { + if (data_buffer_ != nullptr) { data_ = data_buffer_->data(); } +} + +Status BinaryArray::Validate() const { + // TODO(wesm): what to do here? + return Status::OK(); +} + +bool BinaryArray::EqualsExact(const BinaryArray& other) const { + if (!Array::EqualsExact(other)) { return false; } + + bool equal_offsets = + offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); + if (!equal_offsets) { return false; } + + if (!data_buffer_ && !(other.data_buffer_)) { return true; } + + return data_buffer_->Equals(*other.data_buffer_, data_buffer_->size()); +} + +bool BinaryArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (this->type_enum() != arr->type_enum()) { return false; } + return EqualsExact(*static_cast(arr.get())); +} + +bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = offset(i); + const int32_t end_offset = offset(i + 1); + const int32_t other_begin_offset = other->offset(o_i); + const int32_t other_end_offset = other->offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != other_end_offset - other_begin_offset) { + return false; + } + + if (std::memcmp(data_ + begin_offset, other->data_ + other_begin_offset, + end_offset - begin_offset)) { + return false; + } + } + return true; +} + +Status BinaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap) + : BinaryArray(kString, length, offsets, data, null_count, null_bitmap) {} + +Status StringArray::Validate() const { + // TODO(emkornfield) Validate proper UTF8 code points? + return BinaryArray::Validate(); +} + +Status StringArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +// ---------------------------------------------------------------------- +// Struct + +std::shared_ptr StructArray::field(int32_t pos) const { + DCHECK_GT(field_arrays_.size(), 0); + return field_arrays_[pos]; +} + +bool StructArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + if (null_count_ != arr->null_count()) { return false; } + return RangeEquals(0, length_, 0, arr); +} + +bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (Type::STRUCT != arr->type_enum()) { return false; } + const auto other = static_cast(arr.get()); + + bool equal_fields = true; + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + if (IsNull(i) != arr->IsNull(o_i)) { return false; } + if (IsNull(i)) continue; + for (size_t j = 0; j < field_arrays_.size(); ++j) { + // TODO: really we should be comparing stretches of non-null data rather + // than looking at one value at a time. + equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other->field(j)); + if (!equal_fields) { return false; } + } + } + + return true; +} + +Status StructArray::Validate() const { + if (length_ < 0) { return Status::Invalid("Length was negative"); } + + if (null_count() > length_) { + return Status::Invalid("Null count exceeds the length of this struct"); + } + + if (field_arrays_.size() > 0) { + // Validate fields + int32_t array_length = field_arrays_[0]->length(); + size_t idx = 0; + for (auto it : field_arrays_) { + if (it->length() != array_length) { + std::stringstream ss; + ss << "Length is not equal from field " << it->type()->ToString() + << " at position {" << idx << "}"; + return Status::Invalid(ss.str()); + } + + const Status child_valid = it->Validate(); + if (!child_valid.ok()) { + std::stringstream ss; + ss << "Child array invalid: " << child_valid.ToString() << " at position {" << idx + << "}"; + return Status::Invalid(ss.str()); + } + ++idx; + } + + if (array_length > 0 && array_length != length_) { + return Status::Invalid("Struct's length is not equal to its child arrays"); + } + } + return Status::OK(); +} + +Status StructArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +// ---------------------------------------------------------------------- + +#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ + case Type::ENUM: \ + out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ + break; + +Status MakePrimitiveArray(const TypePtr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap, ArrayPtr* out) { + switch (type->type) { + MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); + MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT8, Int8Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT16, UInt16Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT16, Int16Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT32, UInt32Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); + MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); + MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); + MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); + MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); + MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); + default: + return Status::NotImplemented(type->ToString()); + } +#ifdef NDEBUG + return Status::OK(); +#else + return (*out)->Validate(); +#endif +} + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 91fb93e625494..1a4a9237a1f79 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -18,9 +18,13 @@ #ifndef ARROW_ARRAY_H #define ARROW_ARRAY_H +#include #include #include +#include +#include +#include "arrow/buffer.h" #include "arrow/type.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" @@ -28,7 +32,6 @@ namespace arrow { -class Buffer; class MemoryPool; class MutableBuffer; class Status; @@ -110,6 +113,374 @@ typedef std::shared_ptr ArrayPtr; Status ARROW_EXPORT GetEmptyBitmap( MemoryPool* pool, int32_t length, std::shared_ptr* result); +// Base class for fixed-size logical types +class ARROW_EXPORT PrimitiveArray : public Array { + public: + virtual ~PrimitiveArray() {} + + std::shared_ptr data() const { return data_; } + + bool EqualsExact(const PrimitiveArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + + protected: + PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + std::shared_ptr data_; + const uint8_t* raw_data_; +}; + +template +class ARROW_EXPORT NumericArray : public PrimitiveArray { + public: + using TypeClass = TYPE; + using value_type = typename TypeClass::c_type; + NumericArray(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) + : PrimitiveArray( + std::make_shared(), length, data, null_count, null_bitmap) {} + NumericArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) + : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + + bool EqualsExact(const NumericArray& other) const { + return PrimitiveArray::EqualsExact(static_cast(other)); + } + + bool ApproxEquals(const std::shared_ptr& arr) const { return Equals(arr); } + + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + const auto other = static_cast*>(arr.get()); + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + const bool is_null = IsNull(i); + if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { + return false; + } + } + return true; + } + const value_type* raw_data() const { + return reinterpret_cast(raw_data_); + } + + Status Accept(ArrayVisitor* visitor) const override; + + value_type Value(int i) const { return raw_data()[i]; } +}; + +template <> +inline bool NumericArray::ApproxEquals( + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + + const auto& other = *static_cast*>(arr.get()); + + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + auto this_data = reinterpret_cast(raw_data_); + auto other_data = reinterpret_cast(other.raw_data_); + + static constexpr float EPSILON = 1E-5; + + if (length_ == 0 && other.length_ == 0) { return true; } + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); + if (!equal_bitmap) { return false; } + + for (int i = 0; i < length_; ++i) { + if (IsNull(i)) continue; + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } else { + for (int i = 0; i < length_; ++i) { + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } + return true; +} + +template <> +inline bool NumericArray::ApproxEquals( + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (this->type_enum() != arr->type_enum()) { return false; } + + const auto& other = *static_cast*>(arr.get()); + + if (this == &other) { return true; } + if (null_count_ != other.null_count_) { return false; } + + auto this_data = reinterpret_cast(raw_data_); + auto other_data = reinterpret_cast(other.raw_data_); + + if (length_ == 0 && other.length_ == 0) { return true; } + + static constexpr double EPSILON = 1E-5; + + if (null_count_ > 0) { + bool equal_bitmap = + null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); + if (!equal_bitmap) { return false; } + + for (int i = 0; i < length_; ++i) { + if (IsNull(i)) continue; + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } else { + for (int i = 0; i < length_; ++i) { + if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } + } + } + return true; +} + +class ARROW_EXPORT BooleanArray : public PrimitiveArray { + public: + using TypeClass = BooleanType; + + BooleanArray(int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + BooleanArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + + bool EqualsExact(const BooleanArray& other) const; + bool Equals(const ArrayPtr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; + + const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } + + bool Value(int i) const { return BitUtil::GetBit(raw_data(), i); } +}; + +// ---------------------------------------------------------------------- +// ListArray + +class ARROW_EXPORT ListArray : public Array { + public: + using TypeClass = ListType; + + ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, + const ArrayPtr& values, int32_t null_count = 0, + std::shared_ptr null_bitmap = nullptr) + : Array(type, length, null_count, null_bitmap) { + offset_buffer_ = offsets; + offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( + offset_buffer_->data()); + values_ = values; + } + + Status Validate() const override; + + virtual ~ListArray() = default; + + // Return a shared pointer in case the requestor desires to share ownership + // with this array. + std::shared_ptr values() const { return values_; } + std::shared_ptr offsets() const { + return std::static_pointer_cast(offset_buffer_); + } + + std::shared_ptr value_type() const { return values_->type(); } + + const int32_t* raw_offsets() const { return offsets_; } + + int32_t offset(int i) const { return offsets_[i]; } + + // Neither of these functions will perform boundschecking + int32_t value_offset(int i) const { return offsets_[i]; } + int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } + + bool EqualsExact(const ListArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; + + protected: + std::shared_ptr offset_buffer_; + const int32_t* offsets_; + ArrayPtr values_; +}; + +// ---------------------------------------------------------------------- +// Binary and String + +class ARROW_EXPORT BinaryArray : public Array { + public: + using TypeClass = BinaryType; + + BinaryArray(int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + + // Constructor that allows sub-classes/builders to propagate there logical type up the + // class hierarchy. + BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + + // Return the pointer to the given elements bytes + // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy + // pointer + offset + const uint8_t* GetValue(int i, int32_t* out_length) const { + const int32_t pos = offsets_[i]; + *out_length = offsets_[i + 1] - pos; + return data_ + pos; + } + + std::shared_ptr data() const { return data_buffer_; } + std::shared_ptr offsets() const { return offset_buffer_; } + + const int32_t* raw_offsets() const { return offsets_; } + + int32_t offset(int i) const { return offsets_[i]; } + + // Neither of these functions will perform boundschecking + int32_t value_offset(int i) const { return offsets_[i]; } + int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } + + bool EqualsExact(const BinaryArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const ArrayPtr& arr) const override; + + Status Validate() const override; + + Status Accept(ArrayVisitor* visitor) const override; + + private: + std::shared_ptr offset_buffer_; + const int32_t* offsets_; + + std::shared_ptr data_buffer_; + const uint8_t* data_; +}; + +class ARROW_EXPORT StringArray : public BinaryArray { + public: + using TypeClass = StringType; + + StringArray(int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + + // Construct a std::string + // TODO: std::bad_alloc possibility + std::string GetString(int i) const { + int32_t nchars; + const uint8_t* str = GetValue(i, &nchars); + return std::string(reinterpret_cast(str), nchars); + } + + Status Validate() const override; + + Status Accept(ArrayVisitor* visitor) const override; +}; + +// ---------------------------------------------------------------------- +// Struct + +class ARROW_EXPORT StructArray : public Array { + public: + using TypeClass = StructType; + + StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, + int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) + : Array(type, length, null_count, null_bitmap) { + type_ = type; + field_arrays_ = field_arrays; + } + + Status Validate() const override; + + virtual ~StructArray() {} + + // Return a shared pointer in case the requestor desires to share ownership + // with this array. + std::shared_ptr field(int32_t pos) const; + + const std::vector& fields() const { return field_arrays_; } + + bool EqualsExact(const StructArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; + + protected: + // The child arrays corresponding to each field of the struct data type. + std::vector field_arrays_; +}; + +// ---------------------------------------------------------------------- +// Union + +class UnionArray : public Array { + protected: + // The data are types encoded as int16 + Buffer* types_; + std::vector> children_; +}; + +class DenseUnionArray : public UnionArray { + protected: + Buffer* offset_buf_; +}; + +class SparseUnionArray : public UnionArray {}; + +// ---------------------------------------------------------------------- +// extern templates and other details + +// gcc and clang disagree about how to handle template visibility when you have +// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wattributes" +#endif + +// Only instantiate these templates once +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; + +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic pop +#endif + +// ---------------------------------------------------------------------- +// Helper functions + +// Create new arrays for logical types that are backed by primitive arrays. +Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, + int32_t length, const std::shared_ptr& data, int32_t null_count, + const std::shared_ptr& null_bitmap, std::shared_ptr* out); + } // namespace arrow #endif diff --git a/cpp/src/arrow/util/buffer-test.cc b/cpp/src/arrow/buffer-test.cc similarity index 98% rename from cpp/src/arrow/util/buffer-test.cc rename to cpp/src/arrow/buffer-test.cc index 095b07b7ab309..c1d027bb653fe 100644 --- a/cpp/src/arrow/util/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -21,9 +21,9 @@ #include "gtest/gtest.h" +#include "arrow/buffer.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" using std::string; diff --git a/cpp/src/arrow/util/buffer.cc b/cpp/src/arrow/buffer.cc similarity index 96% rename from cpp/src/arrow/util/buffer.cc rename to cpp/src/arrow/buffer.cc index a230259e5930d..6ffa03a0b5663 100644 --- a/cpp/src/arrow/util/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -15,15 +15,15 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/util/buffer.h" +#include "arrow/buffer.h" #include #include +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/buffer.h b/cpp/src/arrow/buffer.h similarity index 99% rename from cpp/src/arrow/util/buffer.h rename to cpp/src/arrow/buffer.h index 5c87395deebb0..27437ca0486c3 100644 --- a/cpp/src/arrow/util/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -23,8 +23,8 @@ #include #include +#include "arrow/status.h" #include "arrow/util/macros.h" -#include "arrow/util/status.h" #include "arrow/util/visibility.h" namespace arrow { diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 151b257a3d894..493b5e7ccab9e 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -17,11 +17,17 @@ #include "arrow/builder.h" +#include #include +#include +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" +#include "arrow/util/logging.h" namespace arrow { @@ -123,4 +129,323 @@ void ArrayBuilder::UnsafeSetNotNull(int32_t length) { length_ = new_length; } +template +Status PrimitiveBuilder::Init(int32_t capacity) { + RETURN_NOT_OK(ArrayBuilder::Init(capacity)); + data_ = std::make_shared(pool_); + + int64_t nbytes = TypeTraits::bytes_required(capacity); + RETURN_NOT_OK(data_->Resize(nbytes)); + // TODO(emkornfield) valgrind complains without this + memset(data_->mutable_data(), 0, nbytes); + + raw_data_ = reinterpret_cast(data_->mutable_data()); + return Status::OK(); +} + +template +Status PrimitiveBuilder::Resize(int32_t capacity) { + // XXX: Set floor size for now + if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + + if (capacity_ == 0) { + RETURN_NOT_OK(Init(capacity)); + } else { + RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); + const int64_t old_bytes = data_->size(); + const int64_t new_bytes = TypeTraits::bytes_required(capacity); + RETURN_NOT_OK(data_->Resize(new_bytes)); + raw_data_ = reinterpret_cast(data_->mutable_data()); + memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + } + return Status::OK(); +} + +template +Status PrimitiveBuilder::Append( + const value_type* values, int32_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + + if (length > 0) { + memcpy(raw_data_ + length_, values, TypeTraits::bytes_required(length)); + } + + // length_ is update by these + ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); + + return Status::OK(); +} + +template +Status PrimitiveBuilder::Finish(std::shared_ptr* out) { + const int64_t bytes_required = TypeTraits::bytes_required(length_); + if (bytes_required > 0 && bytes_required < data_->size()) { + // Trim buffers + RETURN_NOT_OK(data_->Resize(bytes_required)); + } + *out = std::make_shared::ArrayType>( + type_, length_, data_, null_count_, null_bitmap_); + + data_ = null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + return Status::OK(); +} + +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; + +Status BooleanBuilder::Init(int32_t capacity) { + RETURN_NOT_OK(ArrayBuilder::Init(capacity)); + data_ = std::make_shared(pool_); + + int64_t nbytes = BitUtil::BytesForBits(capacity); + RETURN_NOT_OK(data_->Resize(nbytes)); + // TODO(emkornfield) valgrind complains without this + memset(data_->mutable_data(), 0, nbytes); + + raw_data_ = reinterpret_cast(data_->mutable_data()); + return Status::OK(); +} + +Status BooleanBuilder::Resize(int32_t capacity) { + // XXX: Set floor size for now + if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } + + if (capacity_ == 0) { + RETURN_NOT_OK(Init(capacity)); + } else { + RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); + const int64_t old_bytes = data_->size(); + const int64_t new_bytes = BitUtil::BytesForBits(capacity); + + RETURN_NOT_OK(data_->Resize(new_bytes)); + raw_data_ = reinterpret_cast(data_->mutable_data()); + memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + } + return Status::OK(); +} + +Status BooleanBuilder::Finish(std::shared_ptr* out) { + const int64_t bytes_required = BitUtil::BytesForBits(length_); + + if (bytes_required > 0 && bytes_required < data_->size()) { + // Trim buffers + RETURN_NOT_OK(data_->Resize(bytes_required)); + } + *out = std::make_shared(type_, length_, data_, null_count_, null_bitmap_); + + data_ = null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + return Status::OK(); +} + +Status BooleanBuilder::Append( + const uint8_t* values, int32_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + + for (int i = 0; i < length; ++i) { + // Skip reading from unitialised memory + // TODO: This actually is only to keep valgrind happy but may or may not + // have a performance impact. + if ((valid_bytes != nullptr) && !valid_bytes[i]) continue; + + if (values[i] > 0) { + BitUtil::SetBit(raw_data_, length_ + i); + } else { + BitUtil::ClearBit(raw_data_, length_ + i); + } + } + + // this updates length_ + ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// ListBuilder + +ListBuilder::ListBuilder( + MemoryPool* pool, std::shared_ptr value_builder, const TypePtr& type) + : ArrayBuilder( + pool, type ? type : std::static_pointer_cast( + std::make_shared(value_builder->type()))), + offset_builder_(pool), + value_builder_(value_builder) {} + +ListBuilder::ListBuilder( + MemoryPool* pool, std::shared_ptr values, const TypePtr& type) + : ArrayBuilder(pool, type ? type : std::static_pointer_cast( + std::make_shared(values->type()))), + offset_builder_(pool), + values_(values) {} + +Status ListBuilder::Init(int32_t elements) { + DCHECK_LT(elements, std::numeric_limits::max()); + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + // one more then requested for offsets + return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); +} + +Status ListBuilder::Resize(int32_t capacity) { + DCHECK_LT(capacity, std::numeric_limits::max()); + // one more then requested for offsets + RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); + return ArrayBuilder::Resize(capacity); +} + +Status ListBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr items = values_; + if (!items) { RETURN_NOT_OK(value_builder_->Finish(&items)); } + + RETURN_NOT_OK(offset_builder_.Append(items->length())); + std::shared_ptr offsets = offset_builder_.Finish(); + + *out = std::make_shared( + type_, length_, offsets, items, null_count_, null_bitmap_); + + Reset(); + + return Status::OK(); +} + +void ListBuilder::Reset() { + capacity_ = length_ = null_count_ = 0; + null_bitmap_ = nullptr; +} + +std::shared_ptr ListBuilder::value_builder() const { + DCHECK(!values_) << "Using value builder is pointless when values_ is set"; + return value_builder_; +} + +// ---------------------------------------------------------------------- +// String and binary + +// This used to be a static member variable of BinaryBuilder, but it can cause +// valgrind to report a (spurious?) memory leak when needed in other shared +// libraries. The problem came up while adding explicit visibility to libarrow +// and libparquet_arrow +static TypePtr kBinaryValueType = TypePtr(new UInt8Type()); + +BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) + : ListBuilder(pool, std::make_shared(pool, kBinaryValueType), type) { + byte_builder_ = static_cast(value_builder_.get()); +} + +Status BinaryBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr result; + RETURN_NOT_OK(ListBuilder::Finish(&result)); + + const auto list = std::dynamic_pointer_cast(result); + auto values = std::dynamic_pointer_cast(list->values()); + + *out = std::make_shared(list->length(), list->offsets(), values->data(), + list->null_count(), list->null_bitmap()); + return Status::OK(); +} + +Status StringBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr result; + RETURN_NOT_OK(ListBuilder::Finish(&result)); + + const auto list = std::dynamic_pointer_cast(result); + auto values = std::dynamic_pointer_cast(list->values()); + + *out = std::make_shared(list->length(), list->offsets(), values->data(), + list->null_count(), list->null_bitmap()); + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// Struct + +Status StructBuilder::Finish(std::shared_ptr* out) { + std::vector> fields(field_builders_.size()); + for (size_t i = 0; i < field_builders_.size(); ++i) { + RETURN_NOT_OK(field_builders_[i]->Finish(&fields[i])); + } + + *out = std::make_shared(type_, length_, fields, null_count_, null_bitmap_); + + null_bitmap_ = nullptr; + capacity_ = length_ = null_count_ = 0; + + return Status::OK(); +} + +std::shared_ptr StructBuilder::field_builder(int pos) const { + DCHECK_GT(field_builders_.size(), 0); + return field_builders_[pos]; +} + +// ---------------------------------------------------------------------- +// Helper functions + +#define BUILDER_CASE(ENUM, BuilderType) \ + case Type::ENUM: \ + out->reset(new BuilderType(pool, type)); \ + return Status::OK(); + +// Initially looked at doing this with vtables, but shared pointers makes it +// difficult +// +// TODO(wesm): come up with a less monolithic strategy +Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, + std::shared_ptr* out) { + switch (type->type) { + BUILDER_CASE(UINT8, UInt8Builder); + BUILDER_CASE(INT8, Int8Builder); + BUILDER_CASE(UINT16, UInt16Builder); + BUILDER_CASE(INT16, Int16Builder); + BUILDER_CASE(UINT32, UInt32Builder); + BUILDER_CASE(INT32, Int32Builder); + BUILDER_CASE(UINT64, UInt64Builder); + BUILDER_CASE(INT64, Int64Builder); + BUILDER_CASE(TIMESTAMP, TimestampBuilder); + + BUILDER_CASE(BOOL, BooleanBuilder); + + BUILDER_CASE(FLOAT, FloatBuilder); + BUILDER_CASE(DOUBLE, DoubleBuilder); + + BUILDER_CASE(STRING, StringBuilder); + BUILDER_CASE(BINARY, BinaryBuilder); + + case Type::LIST: { + std::shared_ptr value_builder; + std::shared_ptr value_type = + static_cast(type.get())->value_type(); + RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); + out->reset(new ListBuilder(pool, value_builder)); + return Status::OK(); + } + + case Type::STRUCT: { + std::vector& fields = type->children_; + std::vector> values_builder; + + for (auto it : fields) { + std::shared_ptr builder; + RETURN_NOT_OK(MakeBuilder(pool, it->type, &builder)); + values_builder.push_back(builder); + } + out->reset(new StructBuilder(pool, type, values_builder)); + return Status::OK(); + } + + default: + return Status::NotImplemented(type->ToString()); + } +} + } // namespace arrow diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 73e49c0a69674..7162d31d2464a 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -20,18 +20,20 @@ #include #include +#include #include +#include "arrow/buffer.h" +#include "arrow/status.h" #include "arrow/type.h" +#include "arrow/util/bit-util.h" #include "arrow/util/macros.h" -#include "arrow/util/status.h" #include "arrow/util/visibility.h" namespace arrow { class Array; class MemoryPool; -class PoolBuffer; static constexpr int32_t kMinBuilderCapacity = 1 << 5; @@ -130,6 +132,315 @@ class ARROW_EXPORT ArrayBuilder { DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); }; +template +class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { + public: + using value_type = typename Type::c_type; + + explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) + : ArrayBuilder(pool, type), data_(nullptr) {} + + virtual ~PrimitiveBuilder() {} + + using ArrayBuilder::Advance; + + // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); + } + + Status AppendNull() { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(false); + return Status::OK(); + } + + std::shared_ptr data() const { return data_; } + + // Vector append + // + // If passed, valid_bytes is of equal length to values, and any zero byte + // will be considered as a null for that slot + Status Append( + const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); + + Status Finish(std::shared_ptr* out) override; + Status Init(int32_t capacity) override; + + // Increase the capacity of the builder to accommodate at least the indicated + // number of elements + Status Resize(int32_t capacity) override; + + protected: + std::shared_ptr data_; + value_type* raw_data_; +}; + +template +class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { + public: + using typename PrimitiveBuilder::value_type; + using PrimitiveBuilder::PrimitiveBuilder; + + using PrimitiveBuilder::Append; + using PrimitiveBuilder::Init; + using PrimitiveBuilder::Resize; + using PrimitiveBuilder::Reserve; + + // Scalar append. + Status Append(value_type val) { + RETURN_NOT_OK(ArrayBuilder::Reserve(1)); + UnsafeAppend(val); + return Status::OK(); + } + + // Does not capacity-check; make sure to call Reserve beforehand + void UnsafeAppend(value_type val) { + BitUtil::SetBit(null_bitmap_data_, length_); + raw_data_[length_++] = val; + } + + protected: + using PrimitiveBuilder::length_; + using PrimitiveBuilder::null_bitmap_data_; + using PrimitiveBuilder::raw_data_; +}; + +// Builders + +using UInt8Builder = NumericBuilder; +using UInt16Builder = NumericBuilder; +using UInt32Builder = NumericBuilder; +using UInt64Builder = NumericBuilder; + +using Int8Builder = NumericBuilder; +using Int16Builder = NumericBuilder; +using Int32Builder = NumericBuilder; +using Int64Builder = NumericBuilder; +using TimestampBuilder = NumericBuilder; + +using HalfFloatBuilder = NumericBuilder; +using FloatBuilder = NumericBuilder; +using DoubleBuilder = NumericBuilder; + +class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { + public: + explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) + : ArrayBuilder(pool, type), data_(nullptr) {} + + virtual ~BooleanBuilder() {} + + using ArrayBuilder::Advance; + + // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); + } + + Status AppendNull() { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(false); + return Status::OK(); + } + + std::shared_ptr data() const { return data_; } + + // Scalar append + Status Append(bool val) { + Reserve(1); + BitUtil::SetBit(null_bitmap_data_, length_); + if (val) { + BitUtil::SetBit(raw_data_, length_); + } else { + BitUtil::ClearBit(raw_data_, length_); + } + ++length_; + return Status::OK(); + } + + // Vector append + // + // If passed, valid_bytes is of equal length to values, and any zero byte + // will be considered as a null for that slot + Status Append( + const uint8_t* values, int32_t length, const uint8_t* valid_bytes = nullptr); + + Status Finish(std::shared_ptr* out) override; + Status Init(int32_t capacity) override; + + // Increase the capacity of the builder to accommodate at least the indicated + // number of elements + Status Resize(int32_t capacity) override; + + protected: + std::shared_ptr data_; + uint8_t* raw_data_; +}; + +// ---------------------------------------------------------------------- +// List builder + +// Builder class for variable-length list array value types +// +// To use this class, you must append values to the child array builder and use +// the Append function to delimit each distinct list value (once the values +// have been appended to the child array) or use the bulk API to append +// a sequence of offests and null values. +// +// A note on types. Per arrow/type.h all types in the c++ implementation are +// logical so even though this class always builds list array, this can +// represent multiple different logical types. If no logical type is provided +// at construction time, the class defaults to List where t is taken from the +// value_builder/values that the object is constructed with. +class ARROW_EXPORT ListBuilder : public ArrayBuilder { + public: + // Use this constructor to incrementally build the value array along with offsets and + // null bitmap. + ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, + const TypePtr& type = nullptr); + + // Use this constructor to build the list with a pre-existing values array + ListBuilder( + MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr); + + virtual ~ListBuilder() {} + + Status Init(int32_t elements) override; + Status Resize(int32_t capacity) override; + Status Finish(std::shared_ptr* out) override; + + // Vector append + // + // If passed, valid_bytes is of equal length to values, and any zero byte + // will be considered as a null for that slot + Status Append( + const int32_t* offsets, int32_t length, const uint8_t* valid_bytes = nullptr) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + offset_builder_.UnsafeAppend(offsets, length); + return Status::OK(); + } + + // Start a new variable-length list slot + // + // This function should be called before beginning to append elements to the + // value builder + Status Append(bool is_valid = true) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(is_valid); + RETURN_NOT_OK(offset_builder_.Append(value_builder_->length())); + return Status::OK(); + } + + Status AppendNull() { return Append(false); } + + std::shared_ptr value_builder() const; + + protected: + BufferBuilder offset_builder_; + std::shared_ptr value_builder_; + std::shared_ptr values_; + + void Reset(); +}; + +// ---------------------------------------------------------------------- +// Binary and String + +// BinaryBuilder : public ListBuilder +class ARROW_EXPORT BinaryBuilder : public ListBuilder { + public: + explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type); + virtual ~BinaryBuilder() {} + + Status Append(const uint8_t* value, int32_t length) { + RETURN_NOT_OK(ListBuilder::Append()); + return byte_builder_->Append(value, length); + } + + Status Append(const char* value, int32_t length) { + return Append(reinterpret_cast(value), length); + } + + Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } + + Status Finish(std::shared_ptr* out) override; + + protected: + UInt8Builder* byte_builder_; +}; + +// String builder +class ARROW_EXPORT StringBuilder : public BinaryBuilder { + public: + explicit StringBuilder(MemoryPool* pool, const TypePtr& type) + : BinaryBuilder(pool, type) {} + + using BinaryBuilder::Append; + + Status Finish(std::shared_ptr* out) override; + + Status Append(const std::vector& values, uint8_t* null_bytes); +}; + +// ---------------------------------------------------------------------- +// Struct + +// --------------------------------------------------------------------------------- +// StructArray builder +// Append, Resize and Reserve methods are acting on StructBuilder. +// Please make sure all these methods of all child-builders' are consistently +// called to maintain data-structure consistency. +class ARROW_EXPORT StructBuilder : public ArrayBuilder { + public: + StructBuilder(MemoryPool* pool, const std::shared_ptr& type, + const std::vector>& field_builders) + : ArrayBuilder(pool, type) { + field_builders_ = field_builders; + } + + Status Finish(std::shared_ptr* out) override; + + // Null bitmap is of equal length to every child field, and any zero byte + // will be considered as a null for that field, but users must using app- + // end methods or advance methods of the child builders' independently to + // insert data. + Status Append(int32_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return Status::OK(); + } + + // Append an element to the Struct. All child-builders' Append method must + // be called independently to maintain data-structure consistency. + Status Append(bool is_valid = true) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(is_valid); + return Status::OK(); + } + + Status AppendNull() { return Append(false); } + + std::shared_ptr field_builder(int pos) const; + + const std::vector>& field_builders() const { + return field_builders_; + } + + protected: + std::vector> field_builders_; +}; + +// ---------------------------------------------------------------------- +// Helper functions + +Status ARROW_EXPORT MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, + std::shared_ptr* out); + } // namespace arrow #endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index f429a813c6f20..650ec90fc0728 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -17,9 +17,9 @@ #include "benchmark/benchmark.h" +#include "arrow/array.h" +#include "arrow/memory_pool.h" #include "arrow/test-util.h" -#include "arrow/types/primitive.h" -#include "arrow/util/memory-pool.h" namespace arrow { namespace { diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index ac3636d1b6dab..9005245b20419 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -27,7 +27,6 @@ #include "arrow/schema.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/primitive.h" using std::shared_ptr; using std::vector; diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index eca5f4d30a698..1d136e7d95a55 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -21,8 +21,8 @@ #include #include "arrow/array.h" +#include "arrow/status.h" #include "arrow/type.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 05fa6663e335d..c50a9bba28e8e 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -107,9 +107,9 @@ #include "arrow/io/interfaces.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 8c6d49f92e606..b8e212026b11c 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -22,10 +22,10 @@ #include #include +#include "arrow/buffer.h" #include "arrow/io/hdfs.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 44986cee1afc9..68c1ac30f8250 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -20,8 +20,8 @@ #include #include -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" +#include "arrow/buffer.h" +#include "arrow/status.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index fad49cef89908..f0ea7ec5e4dea 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -30,7 +30,7 @@ #include "arrow/io/file.h" #include "arrow/io/test-common.h" -#include "arrow/util/memory-pool.h" +#include "arrow/memory_pool.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 8338de6d96a55..e07eaa3d1b487 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -25,8 +25,8 @@ #include // NOLINT #include "arrow/io/hdfs.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/util/status.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc index 36b8a4ec980a9..3715376ebb95b 100644 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ b/cpp/src/arrow/io/libhdfs_shim.cc @@ -53,7 +53,7 @@ extern "C" { #include // NOLINT -#include "arrow/util/status.h" +#include "arrow/status.h" #include "arrow/util/visibility.h" namespace fs = boost::filesystem; diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index af495e27e5642..b5cf4b77a980d 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -38,10 +38,9 @@ #include #include +#include "arrow/buffer.h" #include "arrow/io/interfaces.h" - -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" +#include "arrow/status.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index f8fed883cf583..146808371d307 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -32,10 +32,10 @@ // nothing #endif +#include "arrow/buffer.h" #include "arrow/io/memory.h" +#include "arrow/memory_pool.h" #include "arrow/test-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" namespace arrow { namespace io { diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index edf716f662753..f813c1dbbc3b0 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -23,6 +23,7 @@ #include #include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" #include "arrow/ipc/Message_generated.h" @@ -30,17 +31,11 @@ #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" -#include "arrow/types/construct.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" #include "arrow/util/logging.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index fa50058ea4200..d7d2e613f87db 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -22,14 +22,14 @@ #include #include +#include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" #include "arrow/ipc/adapter.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" -#include "arrow/util/buffer.h" +#include "arrow/status.h" #include "arrow/util/logging.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 1accfde7c4842..f309b8562f76a 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -30,15 +30,11 @@ #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index a1feac401f24e..0a9f677966389 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -24,6 +24,7 @@ #include "gtest/gtest.h" +#include "arrow/array.h" #include "arrow/io/memory.h" #include "arrow/io/test-common.h" #include "arrow/ipc/adapter.h" @@ -31,15 +32,11 @@ #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index ba4d9ca982850..f793a2659579c 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -26,17 +26,15 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/builder.h" #include "arrow/ipc/json-internal.h" #include "arrow/ipc/json.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/type_traits.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { @@ -147,7 +145,7 @@ TEST(TestJsonArrayWriter, NestedTypes) { std::vector values = {0, 1, 2, 3, 4, 5, 6}; std::shared_ptr values_array; - MakeArray(int32(), values_is_valid, values, &values_array); + ArrayFromVector(int32(), values_is_valid, values, &values_array); // List std::vector list_is_valid = {true, false, true, true, true}; @@ -188,10 +186,10 @@ void MakeBatchArrays(const std::shared_ptr& schema, const int num_rows, test::randint(num_rows, 0, 100, &v2_values); std::shared_ptr v1; - MakeArray(schema->field(0)->type, is_valid, v1_values, &v1); + ArrayFromVector(schema->field(0)->type, is_valid, v1_values, &v1); std::shared_ptr v2; - MakeArray(schema->field(1)->type, is_valid, v2_values, &v2); + ArrayFromVector(schema->field(1)->type, is_valid, v2_values, &v2); static const int kBufferSize = 10; static uint8_t buffer[kBufferSize]; @@ -323,13 +321,13 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { std::vector foo_valid = {true, false, true, true, true}; std::vector foo_values = {1, 2, 3, 4, 5}; std::shared_ptr foo; - MakeArray(int32(), foo_valid, foo_values, &foo); + ArrayFromVector(int32(), foo_valid, foo_values, &foo); ASSERT_TRUE(batch->column(0)->Equals(foo)); std::vector bar_valid = {true, false, false, true, true}; std::vector bar_values = {1, 2, 3, 4, 5}; std::shared_ptr bar; - MakeArray(float64(), bar_valid, bar_values, &bar); + ArrayFromVector(float64(), bar_valid, bar_values, &bar); ASSERT_TRUE(batch->column(1)->Equals(bar)); } diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index de08e6dab73c6..7c5744a241068 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -24,9 +24,9 @@ #include "arrow/io/memory.h" #include "arrow/ipc/metadata.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 291a719d4e58c..5e593560f8cfa 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -33,9 +33,9 @@ #include "arrow/ipc/json.h" #include "arrow/pretty_print.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" -#include "arrow/util/status.h" DEFINE_string(arrow, "", "Arrow file name"); DEFINE_string(json, "", "JSON file name"); diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index ff9f59800be38..db11b7d0372f7 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -28,16 +28,14 @@ #include "rapidjson/writer.h" #include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/type.h" #include "arrow/type_traits.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" +#include "arrow/util/logging.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index 2281611f8b879..6e3a9939730f4 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -23,14 +23,14 @@ #include #include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/ipc/json-internal.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" -#include "arrow/util/buffer.h" #include "arrow/util/logging.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 5a2758912b759..16069a8f9dcf0 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -25,11 +25,11 @@ #include "flatbuffers/flatbuffers.h" +#include "arrow/buffer.h" #include "arrow/ipc/Message_generated.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/type.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 44d3939c04f1d..f0674ff8d5aeb 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -28,9 +28,9 @@ #include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata-internal.h" +#include "arrow/buffer.h" #include "arrow/schema.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" +#include "arrow/status.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 65b378215222d..8416f0df57364 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -25,16 +25,13 @@ #include #include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/builder.h" +#include "arrow/memory_pool.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/memory-pool.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/ipc/util.h b/cpp/src/arrow/ipc/util.h index 242d6624f1e7f..2000c61e7ed57 100644 --- a/cpp/src/arrow/ipc/util.h +++ b/cpp/src/arrow/ipc/util.h @@ -22,7 +22,7 @@ #include "arrow/array.h" #include "arrow/io/interfaces.h" -#include "arrow/util/status.h" +#include "arrow/status.h" namespace arrow { namespace ipc { diff --git a/cpp/src/arrow/util/memory-pool-test.cc b/cpp/src/arrow/memory_pool-test.cc similarity index 96% rename from cpp/src/arrow/util/memory-pool-test.cc rename to cpp/src/arrow/memory_pool-test.cc index 5d60376f794ff..d6f323d276305 100644 --- a/cpp/src/arrow/util/memory-pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -20,9 +20,9 @@ #include "gtest/gtest.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/memory-pool.cc b/cpp/src/arrow/memory_pool.cc similarity index 97% rename from cpp/src/arrow/util/memory-pool.cc rename to cpp/src/arrow/memory_pool.cc index 9aa706693e9f7..f55b1ac668c7c 100644 --- a/cpp/src/arrow/util/memory-pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -15,15 +15,15 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/util/memory-pool.h" +#include "arrow/memory_pool.h" #include #include #include #include +#include "arrow/status.h" #include "arrow/util/logging.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/memory-pool.h b/cpp/src/arrow/memory_pool.h similarity index 100% rename from cpp/src/arrow/util/memory-pool.h rename to cpp/src/arrow/memory_pool.h diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index b1e6a11cedd9b..c22d3aa632b9d 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -26,14 +26,11 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/builder.h" #include "arrow/pretty_print.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/type_traits.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" namespace arrow { diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index c63a9e93e6a63..9c439c47eb82c 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -22,13 +22,10 @@ #include "arrow/array.h" #include "arrow/pretty_print.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" #include "arrow/type_traits.h" -#include "arrow/types/list.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/status-test.cc b/cpp/src/arrow/status-test.cc similarity index 97% rename from cpp/src/arrow/util/status-test.cc rename to cpp/src/arrow/status-test.cc index e0ff20fea1233..969ba970c154f 100644 --- a/cpp/src/arrow/util/status-test.cc +++ b/cpp/src/arrow/status-test.cc @@ -17,8 +17,8 @@ #include "gtest/gtest.h" +#include "arrow/status.h" #include "arrow/test-util.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/cpp/src/arrow/util/status.cc b/cpp/src/arrow/status.cc similarity index 98% rename from cpp/src/arrow/util/status.cc rename to cpp/src/arrow/status.cc index 08e9ae3946e51..e1a242721eccc 100644 --- a/cpp/src/arrow/util/status.cc +++ b/cpp/src/arrow/status.cc @@ -10,7 +10,7 @@ // non-const method, all threads accessing the same Status must use // external synchronization. -#include "arrow/util/status.h" +#include "arrow/status.h" #include diff --git a/cpp/src/arrow/util/status.h b/cpp/src/arrow/status.h similarity index 100% rename from cpp/src/arrow/util/status.h rename to cpp/src/arrow/status.h diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 743fb669700ea..f62336d07f09a 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -21,13 +21,13 @@ #include "gtest/gtest.h" +#include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "arrow/types/primitive.h" -#include "arrow/util/status.h" using std::shared_ptr; using std::vector; diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index eb1258a73038a..855d4ec04085d 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -24,7 +24,7 @@ #include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" -#include "arrow/util/status.h" +#include "arrow/status.h" namespace arrow { diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index b86a1809cd0e9..aa310b1a49ebe 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -28,17 +28,18 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/builder.h" #include "arrow/column.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" #include "arrow/util/logging.h" -#include "arrow/util/memory-pool.h" #include "arrow/util/random.h" -#include "arrow/util/status.h" #define ASSERT_RAISES(ENUM, expr) \ do { \ @@ -253,8 +254,9 @@ Status MakeRandomBytePoolBuffer(int32_t length, MemoryPool* pool, } // namespace test template -void MakeArray(const std::shared_ptr& type, const std::vector& is_valid, - const std::vector& values, std::shared_ptr* out) { +void ArrayFromVector(const std::shared_ptr& type, + const std::vector& is_valid, const std::vector& values, + std::shared_ptr* out) { std::shared_ptr values_buffer; std::shared_ptr values_bitmap; @@ -272,6 +274,37 @@ void MakeArray(const std::shared_ptr& type, const std::vector& i values_buffer, null_count, values_bitmap); } +class TestBuilder : public ::testing::Test { + public: + void SetUp() { + pool_ = default_memory_pool(); + type_ = TypePtr(new UInt8Type()); + builder_.reset(new UInt8Builder(pool_, type_)); + builder_nn_.reset(new UInt8Builder(pool_, type_)); + } + + protected: + MemoryPool* pool_; + + TypePtr type_; + std::unique_ptr builder_; + std::unique_ptr builder_nn_; +}; + +template +Status MakeArray(const std::vector& valid_bytes, const std::vector& values, + int size, Builder* builder, ArrayPtr* out) { + // Append the first 1000 + for (int i = 0; i < size; ++i) { + if (valid_bytes[i] > 0) { + RETURN_NOT_OK(builder->Append(values[i])); + } else { + RETURN_NOT_OK(builder->AppendNull()); + } + } + return builder->Finish(out); +} + } // namespace arrow #endif // ARROW_TEST_UTIL_H_ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 75f5086f37de0..5b172e41f6809 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -20,7 +20,7 @@ #include #include -#include "arrow/util/status.h" +#include "arrow/status.h" namespace arrow { @@ -220,6 +220,12 @@ std::vector UnionType::GetBufferLayout() const { } } +std::string DecimalType::ToString() const { + std::stringstream s; + s << "decimal(" << precision << ", " << scale << ")"; + return s.str(); +} + std::vector DecimalType::GetBufferLayout() const { // TODO(wesm) return {}; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 966706cb520b2..8637081acd9b7 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -23,9 +23,9 @@ #include #include +#include "arrow/status.h" #include "arrow/type_fwd.h" #include "arrow/util/macros.h" -#include "arrow/util/status.h" #include "arrow/util/visibility.h" namespace arrow { diff --git a/cpp/src/arrow/types/CMakeLists.txt b/cpp/src/arrow/types/CMakeLists.txt deleted file mode 100644 index 6d59acfdf2eec..0000000000000 --- a/cpp/src/arrow/types/CMakeLists.txt +++ /dev/null @@ -1,39 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -####################################### -# arrow_types -####################################### - -# Headers: top level -install(FILES - construct.h - datetime.h - decimal.h - list.h - primitive.h - string.h - struct.h - union.h - DESTINATION include/arrow/types) - - -ADD_ARROW_TEST(decimal-test) -ADD_ARROW_TEST(list-test) -ADD_ARROW_TEST(primitive-test) -ADD_ARROW_TEST(string-test) -ADD_ARROW_TEST(struct-test) diff --git a/cpp/src/arrow/types/construct.cc b/cpp/src/arrow/types/construct.cc deleted file mode 100644 index ab9c59fd4639d..0000000000000 --- a/cpp/src/arrow/types/construct.cc +++ /dev/null @@ -1,124 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/construct.h" - -#include - -#include "arrow/type.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/types/string.h" -#include "arrow/types/struct.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" - -namespace arrow { - -class ArrayBuilder; - -#define BUILDER_CASE(ENUM, BuilderType) \ - case Type::ENUM: \ - out->reset(new BuilderType(pool, type)); \ - return Status::OK(); - -// Initially looked at doing this with vtables, but shared pointers makes it -// difficult -// -// TODO(wesm): come up with a less monolithic strategy -Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::shared_ptr* out) { - switch (type->type) { - BUILDER_CASE(UINT8, UInt8Builder); - BUILDER_CASE(INT8, Int8Builder); - BUILDER_CASE(UINT16, UInt16Builder); - BUILDER_CASE(INT16, Int16Builder); - BUILDER_CASE(UINT32, UInt32Builder); - BUILDER_CASE(INT32, Int32Builder); - BUILDER_CASE(UINT64, UInt64Builder); - BUILDER_CASE(INT64, Int64Builder); - BUILDER_CASE(TIMESTAMP, TimestampBuilder); - - BUILDER_CASE(BOOL, BooleanBuilder); - - BUILDER_CASE(FLOAT, FloatBuilder); - BUILDER_CASE(DOUBLE, DoubleBuilder); - - BUILDER_CASE(STRING, StringBuilder); - BUILDER_CASE(BINARY, BinaryBuilder); - - case Type::LIST: { - std::shared_ptr value_builder; - std::shared_ptr value_type = - static_cast(type.get())->value_type(); - RETURN_NOT_OK(MakeBuilder(pool, value_type, &value_builder)); - out->reset(new ListBuilder(pool, value_builder)); - return Status::OK(); - } - - case Type::STRUCT: { - std::vector& fields = type->children_; - std::vector> values_builder; - - for (auto it : fields) { - std::shared_ptr builder; - RETURN_NOT_OK(MakeBuilder(pool, it->type, &builder)); - values_builder.push_back(builder); - } - out->reset(new StructBuilder(pool, type, values_builder)); - return Status::OK(); - } - - default: - return Status::NotImplemented(type->ToString()); - } -} - -#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ - case Type::ENUM: \ - out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ - break; - -Status MakePrimitiveArray(const TypePtr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, ArrayPtr* out) { - switch (type->type) { - MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); - MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT8, Int8Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT16, UInt16Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT16, Int16Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT32, UInt32Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); - MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); - MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); - default: - return Status::NotImplemented(type->ToString()); - } -#ifdef NDEBUG - return Status::OK(); -#else - return (*out)->Validate(); -#endif -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/construct.h b/cpp/src/arrow/types/construct.h deleted file mode 100644 index e18e946d1a64c..0000000000000 --- a/cpp/src/arrow/types/construct.h +++ /dev/null @@ -1,47 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_CONSTRUCT_H -#define ARROW_TYPES_CONSTRUCT_H - -#include -#include -#include - -#include "arrow/util/visibility.h" - -namespace arrow { - -class Array; -class ArrayBuilder; -class Buffer; -struct DataType; -struct Field; -class MemoryPool; -class Status; - -Status ARROW_EXPORT MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, - std::shared_ptr* out); - -// Create new arrays for logical types that are backed by primitive arrays. -Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - int32_t length, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out); - -} // namespace arrow - -#endif // ARROW_BUILDER_H_ diff --git a/cpp/src/arrow/types/datetime.h b/cpp/src/arrow/types/datetime.h deleted file mode 100644 index a8f863923129a..0000000000000 --- a/cpp/src/arrow/types/datetime.h +++ /dev/null @@ -1,27 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_DATETIME_H -#define ARROW_TYPES_DATETIME_H - -#include - -#include "arrow/type.h" - -namespace arrow {} // namespace arrow - -#endif // ARROW_TYPES_DATETIME_H diff --git a/cpp/src/arrow/types/decimal.cc b/cpp/src/arrow/types/decimal.cc deleted file mode 100644 index 1d9a5e50e460b..0000000000000 --- a/cpp/src/arrow/types/decimal.cc +++ /dev/null @@ -1,31 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/decimal.h" - -#include -#include - -namespace arrow { - -std::string DecimalType::ToString() const { - std::stringstream s; - s << "decimal(" << precision << ", " << scale << ")"; - return s.str(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/decimal.h b/cpp/src/arrow/types/decimal.h deleted file mode 100644 index b3ea3a56d8008..0000000000000 --- a/cpp/src/arrow/types/decimal.h +++ /dev/null @@ -1,28 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_DECIMAL_H -#define ARROW_TYPES_DECIMAL_H - -#include - -#include "arrow/type.h" -#include "arrow/util/visibility.h" - -namespace arrow {} // namespace arrow - -#endif // ARROW_TYPES_DECIMAL_H diff --git a/cpp/src/arrow/types/list.cc b/cpp/src/arrow/types/list.cc deleted file mode 100644 index d86563253bd5a..0000000000000 --- a/cpp/src/arrow/types/list.cc +++ /dev/null @@ -1,162 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. -#include "arrow/types/list.h" - -#include - -namespace arrow { - -bool ListArray::EqualsExact(const ListArray& other) const { - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - bool equal_offsets = - offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); - if (!equal_offsets) { return false; } - bool equal_null_bitmap = true; - if (null_count_ > 0) { - equal_null_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); - } - - if (!equal_null_bitmap) { return false; } - - return values()->Equals(other.values()); -} - -bool ListArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); -} - -bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i)) { return false; } - if (is_null) continue; - const int32_t begin_offset = offset(i); - const int32_t end_offset = offset(i + 1); - const int32_t other_begin_offset = other->offset(o_i); - const int32_t other_end_offset = other->offset(o_i + 1); - // Underlying can't be equal if the size isn't equal - if (end_offset - begin_offset != other_end_offset - other_begin_offset) { - return false; - } - if (!values_->RangeEquals( - begin_offset, end_offset, other_begin_offset, other->values())) { - return false; - } - } - return true; -} - -Status ListArray::Validate() const { - if (length_ < 0) { return Status::Invalid("Length was negative"); } - if (!offset_buffer_) { return Status::Invalid("offset_buffer_ was null"); } - if (offset_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { - std::stringstream ss; - ss << "offset buffer size (bytes): " << offset_buffer_->size() - << " isn't large enough for length: " << length_; - return Status::Invalid(ss.str()); - } - const int32_t last_offset = offset(length_); - if (last_offset > 0) { - if (!values_) { - return Status::Invalid("last offset was non-zero and values was null"); - } - if (values_->length() != last_offset) { - std::stringstream ss; - ss << "Final offset invariant not equal to values length: " << last_offset - << "!=" << values_->length(); - return Status::Invalid(ss.str()); - } - - const Status child_valid = values_->Validate(); - if (!child_valid.ok()) { - std::stringstream ss; - ss << "Child array invalid: " << child_valid.ToString(); - return Status::Invalid(ss.str()); - } - } - - int32_t prev_offset = offset(0); - if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } - for (int32_t i = 1; i <= length_; ++i) { - int32_t current_offset = offset(i); - if (IsNull(i - 1) && current_offset != prev_offset) { - std::stringstream ss; - ss << "Offset invariant failure at: " << i << " inconsistent offsets for null slot" - << current_offset << "!=" << prev_offset; - return Status::Invalid(ss.str()); - } - if (current_offset < prev_offset) { - std::stringstream ss; - ss << "Offset invariant failure: " << i - << " inconsistent offset for non-null slot: " << current_offset << "<" - << prev_offset; - return Status::Invalid(ss.str()); - } - prev_offset = current_offset; - } - return Status::OK(); -} - -Status ListBuilder::Init(int32_t elements) { - DCHECK_LT(elements, std::numeric_limits::max()); - RETURN_NOT_OK(ArrayBuilder::Init(elements)); - // one more then requested for offsets - return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); -} - -Status ListBuilder::Resize(int32_t capacity) { - DCHECK_LT(capacity, std::numeric_limits::max()); - // one more then requested for offsets - RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); - return ArrayBuilder::Resize(capacity); -} - -Status ListBuilder::Finish(std::shared_ptr* out) { - std::shared_ptr items = values_; - if (!items) { RETURN_NOT_OK(value_builder_->Finish(&items)); } - - RETURN_NOT_OK(offset_builder_.Append(items->length())); - std::shared_ptr offsets = offset_builder_.Finish(); - - *out = std::make_shared( - type_, length_, offsets, items, null_count_, null_bitmap_); - - Reset(); - - return Status::OK(); -} - -void ListBuilder::Reset() { - capacity_ = length_ = null_count_ = 0; - null_bitmap_ = nullptr; -} - -Status ListArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/list.h b/cpp/src/arrow/types/list.h deleted file mode 100644 index ec09a78afa66c..0000000000000 --- a/cpp/src/arrow/types/list.h +++ /dev/null @@ -1,170 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_LIST_H -#define ARROW_TYPES_LIST_H - -#include -#include -#include -#include - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/type.h" -#include "arrow/types/primitive.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/logging.h" -#include "arrow/util/status.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class MemoryPool; - -class ARROW_EXPORT ListArray : public Array { - public: - using TypeClass = ListType; - - ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, - const ArrayPtr& values, int32_t null_count = 0, - std::shared_ptr null_bitmap = nullptr) - : Array(type, length, null_count, null_bitmap) { - offset_buffer_ = offsets; - offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( - offset_buffer_->data()); - values_ = values; - } - - Status Validate() const override; - - virtual ~ListArray() = default; - - // Return a shared pointer in case the requestor desires to share ownership - // with this array. - std::shared_ptr values() const { return values_; } - std::shared_ptr offsets() const { - return std::static_pointer_cast(offset_buffer_); - } - - std::shared_ptr value_type() const { return values_->type(); } - - const int32_t* raw_offsets() const { return offsets_; } - - int32_t offset(int i) const { return offsets_[i]; } - - // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return offsets_[i]; } - int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } - - bool EqualsExact(const ListArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; - - Status Accept(ArrayVisitor* visitor) const override; - - protected: - std::shared_ptr offset_buffer_; - const int32_t* offsets_; - ArrayPtr values_; -}; - -// ---------------------------------------------------------------------- -// Array builder - -// Builder class for variable-length list array value types -// -// To use this class, you must append values to the child array builder and use -// the Append function to delimit each distinct list value (once the values -// have been appended to the child array) or use the bulk API to append -// a sequence of offests and null values. -// -// A note on types. Per arrow/type.h all types in the c++ implementation are -// logical so even though this class always builds list array, this can -// represent multiple different logical types. If no logical type is provided -// at construction time, the class defaults to List where t is taken from the -// value_builder/values that the object is constructed with. -class ARROW_EXPORT ListBuilder : public ArrayBuilder { - public: - // Use this constructor to incrementally build the value array along with offsets and - // null bitmap. - ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, - const TypePtr& type = nullptr) - : ArrayBuilder( - pool, type ? type : std::static_pointer_cast( - std::make_shared(value_builder->type()))), - offset_builder_(pool), - value_builder_(value_builder) {} - - // Use this constructor to build the list with a pre-existing values array - ListBuilder( - MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr) - : ArrayBuilder(pool, type ? type : std::static_pointer_cast( - std::make_shared(values->type()))), - offset_builder_(pool), - values_(values) {} - - virtual ~ListBuilder() {} - - Status Init(int32_t elements) override; - Status Resize(int32_t capacity) override; - Status Finish(std::shared_ptr* out) override; - - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot - Status Append( - const int32_t* offsets, int32_t length, const uint8_t* valid_bytes = nullptr) { - RETURN_NOT_OK(Reserve(length)); - UnsafeAppendToBitmap(valid_bytes, length); - offset_builder_.UnsafeAppend(offsets, length); - return Status::OK(); - } - - // Start a new variable-length list slot - // - // This function should be called before beginning to append elements to the - // value builder - Status Append(bool is_valid = true) { - RETURN_NOT_OK(Reserve(1)); - UnsafeAppendToBitmap(is_valid); - RETURN_NOT_OK(offset_builder_.Append(value_builder_->length())); - return Status::OK(); - } - - Status AppendNull() { return Append(false); } - - std::shared_ptr value_builder() const { - DCHECK(!values_) << "Using value builder is pointless when values_ is set"; - return value_builder_; - } - - protected: - BufferBuilder offset_builder_; - std::shared_ptr value_builder_; - std::shared_ptr values_; - - void Reset(); -}; - -} // namespace arrow - -#endif // ARROW_TYPES_LIST_H diff --git a/cpp/src/arrow/types/primitive.cc b/cpp/src/arrow/types/primitive.cc deleted file mode 100644 index 75e5a9ff40e16..0000000000000 --- a/cpp/src/arrow/types/primitive.cc +++ /dev/null @@ -1,294 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/primitive.h" - -#include -#include - -#include "arrow/type_traits.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/logging.h" - -namespace arrow { - -// ---------------------------------------------------------------------- -// Primitive array base - -PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : Array(type, length, null_count, null_bitmap) { - data_ = data; - raw_data_ = data == nullptr ? nullptr : data_->data(); -} - -bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - - const uint8_t* this_data = raw_data_; - const uint8_t* other_data = other.raw_data_; - - auto size_meta = dynamic_cast(type_.get()); - int value_byte_size = size_meta->bit_width() / 8; - DCHECK_GT(value_byte_size, 0); - - for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && memcmp(this_data, other_data, value_byte_size)) { return false; } - this_data += value_byte_size; - other_data += value_byte_size; - } - return true; - } else { - if (length_ == 0 && other.length_ == 0) { return true; } - return data_->Equals(*other.data_, length_); - } -} - -bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); -} - -template -Status NumericArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; - -template -Status PrimitiveBuilder::Init(int32_t capacity) { - RETURN_NOT_OK(ArrayBuilder::Init(capacity)); - data_ = std::make_shared(pool_); - - int64_t nbytes = TypeTraits::bytes_required(capacity); - RETURN_NOT_OK(data_->Resize(nbytes)); - // TODO(emkornfield) valgrind complains without this - memset(data_->mutable_data(), 0, nbytes); - - raw_data_ = reinterpret_cast(data_->mutable_data()); - return Status::OK(); -} - -template -Status PrimitiveBuilder::Resize(int32_t capacity) { - // XXX: Set floor size for now - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } - - if (capacity_ == 0) { - RETURN_NOT_OK(Init(capacity)); - } else { - RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); - const int64_t old_bytes = data_->size(); - const int64_t new_bytes = TypeTraits::bytes_required(capacity); - RETURN_NOT_OK(data_->Resize(new_bytes)); - raw_data_ = reinterpret_cast(data_->mutable_data()); - memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); - } - return Status::OK(); -} - -template -Status PrimitiveBuilder::Append( - const value_type* values, int32_t length, const uint8_t* valid_bytes) { - RETURN_NOT_OK(Reserve(length)); - - if (length > 0) { - memcpy(raw_data_ + length_, values, TypeTraits::bytes_required(length)); - } - - // length_ is update by these - ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); - - return Status::OK(); -} - -template -Status PrimitiveBuilder::Finish(std::shared_ptr* out) { - const int64_t bytes_required = TypeTraits::bytes_required(length_); - if (bytes_required > 0 && bytes_required < data_->size()) { - // Trim buffers - RETURN_NOT_OK(data_->Resize(bytes_required)); - } - *out = std::make_shared::ArrayType>( - type_, length_, data_, null_count_, null_bitmap_); - - data_ = null_bitmap_ = nullptr; - capacity_ = length_ = null_count_ = 0; - return Status::OK(); -} - -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; -template class PrimitiveBuilder; - -Status BooleanBuilder::Init(int32_t capacity) { - RETURN_NOT_OK(ArrayBuilder::Init(capacity)); - data_ = std::make_shared(pool_); - - int64_t nbytes = BitUtil::BytesForBits(capacity); - RETURN_NOT_OK(data_->Resize(nbytes)); - // TODO(emkornfield) valgrind complains without this - memset(data_->mutable_data(), 0, nbytes); - - raw_data_ = reinterpret_cast(data_->mutable_data()); - return Status::OK(); -} - -Status BooleanBuilder::Resize(int32_t capacity) { - // XXX: Set floor size for now - if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } - - if (capacity_ == 0) { - RETURN_NOT_OK(Init(capacity)); - } else { - RETURN_NOT_OK(ArrayBuilder::Resize(capacity)); - const int64_t old_bytes = data_->size(); - const int64_t new_bytes = BitUtil::BytesForBits(capacity); - - RETURN_NOT_OK(data_->Resize(new_bytes)); - raw_data_ = reinterpret_cast(data_->mutable_data()); - memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); - } - return Status::OK(); -} - -Status BooleanBuilder::Finish(std::shared_ptr* out) { - const int64_t bytes_required = BitUtil::BytesForBits(length_); - - if (bytes_required > 0 && bytes_required < data_->size()) { - // Trim buffers - RETURN_NOT_OK(data_->Resize(bytes_required)); - } - *out = std::make_shared(type_, length_, data_, null_count_, null_bitmap_); - - data_ = null_bitmap_ = nullptr; - capacity_ = length_ = null_count_ = 0; - return Status::OK(); -} - -Status BooleanBuilder::Append( - const uint8_t* values, int32_t length, const uint8_t* valid_bytes) { - RETURN_NOT_OK(Reserve(length)); - - for (int i = 0; i < length; ++i) { - // Skip reading from unitialised memory - // TODO: This actually is only to keep valgrind happy but may or may not - // have a performance impact. - if ((valid_bytes != nullptr) && !valid_bytes[i]) continue; - - if (values[i] > 0) { - BitUtil::SetBit(raw_data_, length_ + i); - } else { - BitUtil::ClearBit(raw_data_, length_ + i); - } - } - - // this updates length_ - ArrayBuilder::UnsafeAppendToBitmap(valid_bytes, length); - return Status::OK(); -} - -BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap) - : PrimitiveArray( - std::make_shared(), length, data, null_count, null_bitmap) {} - -BooleanArray::BooleanArray(const TypePtr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : PrimitiveArray(type, length, data, null_count, null_bitmap) {} - -bool BooleanArray::EqualsExact(const BooleanArray& other) const { - if (this == &other) return true; - if (null_count_ != other.null_count_) { return false; } - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); - if (!equal_bitmap) { return false; } - - const uint8_t* this_data = raw_data_; - const uint8_t* other_data = other.raw_data_; - - for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && BitUtil::GetBit(this_data, i) != BitUtil::GetBit(other_data, i)) { - return false; - } - } - return true; - } else { - return data_->Equals(*other.data_, BitUtil::BytesForBits(length_)); - } -} - -bool BooleanArray::Equals(const ArrayPtr& arr) const { - if (this == arr.get()) return true; - if (Type::BOOL != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); -} - -bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, - int32_t other_start_idx, const ArrayPtr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { - return false; - } - } - return true; -} - -Status BooleanArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/primitive.h b/cpp/src/arrow/types/primitive.h deleted file mode 100644 index ec578e1e0aee7..0000000000000 --- a/cpp/src/arrow/types/primitive.h +++ /dev/null @@ -1,371 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_PRIMITIVE_H -#define ARROW_TYPES_PRIMITIVE_H - -#include -#include -#include -#include -#include -#include - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/type.h" -#include "arrow/type_fwd.h" -#include "arrow/types/datetime.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class MemoryPool; - -// Base class for fixed-size logical types. See MakePrimitiveArray -// (types/construct.h) for constructing a specific subclass. -class ARROW_EXPORT PrimitiveArray : public Array { - public: - virtual ~PrimitiveArray() {} - - std::shared_ptr data() const { return data_; } - - bool EqualsExact(const PrimitiveArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - - protected: - PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - std::shared_ptr data_; - const uint8_t* raw_data_; -}; - -template -class ARROW_EXPORT NumericArray : public PrimitiveArray { - public: - using TypeClass = TYPE; - using value_type = typename TypeClass::c_type; - NumericArray(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) - : PrimitiveArray( - std::make_shared(), length, data, null_count, null_bitmap) {} - NumericArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) - : PrimitiveArray(type, length, data, null_count, null_bitmap) {} - - bool EqualsExact(const NumericArray& other) const { - return PrimitiveArray::EqualsExact(static_cast(other)); - } - - bool ApproxEquals(const std::shared_ptr& arr) const { return Equals(arr); } - - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast*>(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { - return false; - } - } - return true; - } - const value_type* raw_data() const { - return reinterpret_cast(raw_data_); - } - - Status Accept(ArrayVisitor* visitor) const override; - - value_type Value(int i) const { return raw_data()[i]; } -}; - -template <> -inline bool NumericArray::ApproxEquals( - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - - const auto& other = *static_cast*>(arr.get()); - - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - auto this_data = reinterpret_cast(raw_data_); - auto other_data = reinterpret_cast(other.raw_data_); - - static constexpr float EPSILON = 1E-5; - - if (length_ == 0 && other.length_ == 0) { return true; } - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - - for (int i = 0; i < length_; ++i) { - if (IsNull(i)) continue; - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } else { - for (int i = 0; i < length_; ++i) { - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } - return true; -} - -template <> -inline bool NumericArray::ApproxEquals( - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - - const auto& other = *static_cast*>(arr.get()); - - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - auto this_data = reinterpret_cast(raw_data_); - auto other_data = reinterpret_cast(other.raw_data_); - - if (length_ == 0 && other.length_ == 0) { return true; } - - static constexpr double EPSILON = 1E-5; - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - - for (int i = 0; i < length_; ++i) { - if (IsNull(i)) continue; - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } else { - for (int i = 0; i < length_; ++i) { - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } - return true; -} - -template -class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { - public: - using value_type = typename Type::c_type; - - explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) - : ArrayBuilder(pool, type), data_(nullptr) {} - - virtual ~PrimitiveBuilder() {} - - using ArrayBuilder::Advance; - - // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { - RETURN_NOT_OK(Reserve(length)); - UnsafeAppendToBitmap(valid_bytes, length); - return Status::OK(); - } - - Status AppendNull() { - RETURN_NOT_OK(Reserve(1)); - UnsafeAppendToBitmap(false); - return Status::OK(); - } - - std::shared_ptr data() const { return data_; } - - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot - Status Append( - const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); - - Status Finish(std::shared_ptr* out) override; - Status Init(int32_t capacity) override; - - // Increase the capacity of the builder to accommodate at least the indicated - // number of elements - Status Resize(int32_t capacity) override; - - protected: - std::shared_ptr data_; - value_type* raw_data_; -}; - -template -class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { - public: - using typename PrimitiveBuilder::value_type; - using PrimitiveBuilder::PrimitiveBuilder; - - using PrimitiveBuilder::Append; - using PrimitiveBuilder::Init; - using PrimitiveBuilder::Resize; - using PrimitiveBuilder::Reserve; - - // Scalar append. - Status Append(value_type val) { - RETURN_NOT_OK(ArrayBuilder::Reserve(1)); - UnsafeAppend(val); - return Status::OK(); - } - - // Does not capacity-check; make sure to call Reserve beforehand - void UnsafeAppend(value_type val) { - BitUtil::SetBit(null_bitmap_data_, length_); - raw_data_[length_++] = val; - } - - protected: - using PrimitiveBuilder::length_; - using PrimitiveBuilder::null_bitmap_data_; - using PrimitiveBuilder::raw_data_; -}; - -// Builders - -using UInt8Builder = NumericBuilder; -using UInt16Builder = NumericBuilder; -using UInt32Builder = NumericBuilder; -using UInt64Builder = NumericBuilder; - -using Int8Builder = NumericBuilder; -using Int16Builder = NumericBuilder; -using Int32Builder = NumericBuilder; -using Int64Builder = NumericBuilder; -using TimestampBuilder = NumericBuilder; - -using HalfFloatBuilder = NumericBuilder; -using FloatBuilder = NumericBuilder; -using DoubleBuilder = NumericBuilder; - -class ARROW_EXPORT BooleanArray : public PrimitiveArray { - public: - using TypeClass = BooleanType; - - BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - BooleanArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - - bool EqualsExact(const BooleanArray& other) const; - bool Equals(const ArrayPtr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; - - Status Accept(ArrayVisitor* visitor) const override; - - const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } - - bool Value(int i) const { return BitUtil::GetBit(raw_data(), i); } -}; - -class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { - public: - explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) - : ArrayBuilder(pool, type), data_(nullptr) {} - - virtual ~BooleanBuilder() {} - - using ArrayBuilder::Advance; - - // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { - RETURN_NOT_OK(Reserve(length)); - UnsafeAppendToBitmap(valid_bytes, length); - return Status::OK(); - } - - Status AppendNull() { - RETURN_NOT_OK(Reserve(1)); - UnsafeAppendToBitmap(false); - return Status::OK(); - } - - std::shared_ptr data() const { return data_; } - - // Scalar append - Status Append(bool val) { - Reserve(1); - BitUtil::SetBit(null_bitmap_data_, length_); - if (val) { - BitUtil::SetBit(raw_data_, length_); - } else { - BitUtil::ClearBit(raw_data_, length_); - } - ++length_; - return Status::OK(); - } - - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot - Status Append( - const uint8_t* values, int32_t length, const uint8_t* valid_bytes = nullptr); - - Status Finish(std::shared_ptr* out) override; - Status Init(int32_t capacity) override; - - // Increase the capacity of the builder to accommodate at least the indicated - // number of elements - Status Resize(int32_t capacity) override; - - protected: - std::shared_ptr data_; - uint8_t* raw_data_; -}; - -// gcc and clang disagree about how to handle template visibility when you have -// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wattributes" -#endif - -// Only instantiate these templates once -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; - -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic pop -#endif - -} // namespace arrow - -#endif // ARROW_TYPES_PRIMITIVE_H diff --git a/cpp/src/arrow/types/string.cc b/cpp/src/arrow/types/string.cc deleted file mode 100644 index db963dfa0de5f..0000000000000 --- a/cpp/src/arrow/types/string.cc +++ /dev/null @@ -1,150 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/string.h" - -#include -#include -#include - -#include "arrow/type.h" - -namespace arrow { - -static std::shared_ptr kBinary = std::make_shared(); -static std::shared_ptr kString = std::make_shared(); - -BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : BinaryArray(kBinary, length, offsets, data, null_count, null_bitmap) {} - -BinaryArray::BinaryArray(const TypePtr& type, int32_t length, - const std::shared_ptr& offsets, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap) - : Array(type, length, null_count, null_bitmap), - offset_buffer_(offsets), - offsets_(reinterpret_cast(offset_buffer_->data())), - data_buffer_(data), - data_(nullptr) { - if (data_buffer_ != nullptr) { data_ = data_buffer_->data(); } -} - -Status BinaryArray::Validate() const { - // TODO(wesm): what to do here? - return Status::OK(); -} - -bool BinaryArray::EqualsExact(const BinaryArray& other) const { - if (!Array::EqualsExact(other)) { return false; } - - bool equal_offsets = - offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); - if (!equal_offsets) { return false; } - - if (!data_buffer_ && !(other.data_buffer_)) { return true; } - - return data_buffer_->Equals(*other.data_buffer_, data_buffer_->size()); -} - -bool BinaryArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); -} - -bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i)) { return false; } - if (is_null) continue; - const int32_t begin_offset = offset(i); - const int32_t end_offset = offset(i + 1); - const int32_t other_begin_offset = other->offset(o_i); - const int32_t other_end_offset = other->offset(o_i + 1); - // Underlying can't be equal if the size isn't equal - if (end_offset - begin_offset != other_end_offset - other_begin_offset) { - return false; - } - - if (std::memcmp(data_ + begin_offset, other->data_ + other_begin_offset, - end_offset - begin_offset)) { - return false; - } - } - return true; -} - -Status BinaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : BinaryArray(kString, length, offsets, data, null_count, null_bitmap) {} - -Status StringArray::Validate() const { - // TODO(emkornfield) Validate proper UTF8 code points? - return BinaryArray::Validate(); -} - -Status StringArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -// This used to be a static member variable of BinaryBuilder, but it can cause -// valgrind to report a (spurious?) memory leak when needed in other shared -// libraries. The problem came up while adding explicit visibility to libarrow -// and libparquet_arrow -static TypePtr kBinaryValueType = TypePtr(new UInt8Type()); - -BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) - : ListBuilder(pool, std::make_shared(pool, kBinaryValueType), type) { - byte_builder_ = static_cast(value_builder_.get()); -} - -Status BinaryBuilder::Finish(std::shared_ptr* out) { - std::shared_ptr result; - RETURN_NOT_OK(ListBuilder::Finish(&result)); - - const auto list = std::dynamic_pointer_cast(result); - auto values = std::dynamic_pointer_cast(list->values()); - - *out = std::make_shared(list->length(), list->offsets(), values->data(), - list->null_count(), list->null_bitmap()); - return Status::OK(); -} - -Status StringBuilder::Finish(std::shared_ptr* out) { - std::shared_ptr result; - RETURN_NOT_OK(ListBuilder::Finish(&result)); - - const auto list = std::dynamic_pointer_cast(result); - auto values = std::dynamic_pointer_cast(list->values()); - - *out = std::make_shared(list->length(), list->offsets(), values->data(), - list->null_count(), list->null_bitmap()); - return Status::OK(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/string.h b/cpp/src/arrow/types/string.h deleted file mode 100644 index c8752439f168c..0000000000000 --- a/cpp/src/arrow/types/string.h +++ /dev/null @@ -1,149 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_STRING_H -#define ARROW_TYPES_STRING_H - -#include -#include -#include -#include - -#include "arrow/array.h" -#include "arrow/type.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/util/status.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Buffer; -class MemoryPool; - -class ARROW_EXPORT BinaryArray : public Array { - public: - using TypeClass = BinaryType; - - BinaryArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); - - // Constructor that allows sub-classes/builders to propagate there logical type up the - // class hierarchy. - BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); - - // Return the pointer to the given elements bytes - // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy - // pointer + offset - const uint8_t* GetValue(int i, int32_t* out_length) const { - DCHECK(out_length); - const int32_t pos = offsets_[i]; - *out_length = offsets_[i + 1] - pos; - return data_ + pos; - } - - std::shared_ptr data() const { return data_buffer_; } - std::shared_ptr offsets() const { return offset_buffer_; } - - const int32_t* raw_offsets() const { return offsets_; } - - int32_t offset(int i) const { return offsets_[i]; } - - // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return offsets_[i]; } - int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } - - bool EqualsExact(const BinaryArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; - - Status Validate() const override; - - Status Accept(ArrayVisitor* visitor) const override; - - private: - std::shared_ptr offset_buffer_; - const int32_t* offsets_; - - std::shared_ptr data_buffer_; - const uint8_t* data_; -}; - -class ARROW_EXPORT StringArray : public BinaryArray { - public: - using TypeClass = StringType; - - StringArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); - - // Construct a std::string - // TODO: std::bad_alloc possibility - std::string GetString(int i) const { - int32_t nchars; - const uint8_t* str = GetValue(i, &nchars); - return std::string(reinterpret_cast(str), nchars); - } - - Status Validate() const override; - - Status Accept(ArrayVisitor* visitor) const override; -}; - -// BinaryBuilder : public ListBuilder -class ARROW_EXPORT BinaryBuilder : public ListBuilder { - public: - explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type); - virtual ~BinaryBuilder() {} - - Status Append(const uint8_t* value, int32_t length) { - RETURN_NOT_OK(ListBuilder::Append()); - return byte_builder_->Append(value, length); - } - - Status Append(const char* value, int32_t length) { - return Append(reinterpret_cast(value), length); - } - - Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } - - Status Finish(std::shared_ptr* out) override; - - protected: - UInt8Builder* byte_builder_; -}; - -// String builder -class ARROW_EXPORT StringBuilder : public BinaryBuilder { - public: - explicit StringBuilder(MemoryPool* pool, const TypePtr& type) - : BinaryBuilder(pool, type) {} - - using BinaryBuilder::Append; - - Status Finish(std::shared_ptr* out) override; - - Status Append(const std::vector& values, uint8_t* null_bytes); -}; - -} // namespace arrow - -#endif // ARROW_TYPES_STRING_H diff --git a/cpp/src/arrow/types/struct.cc b/cpp/src/arrow/types/struct.cc deleted file mode 100644 index 0e0db23544bf7..0000000000000 --- a/cpp/src/arrow/types/struct.cc +++ /dev/null @@ -1,108 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/struct.h" - -#include - -namespace arrow { - -bool StructArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - if (null_count_ != arr->null_count()) { return false; } - return RangeEquals(0, length_, 0, arr); -} - -bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (Type::STRUCT != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - - bool equal_fields = true; - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - if (IsNull(i) != arr->IsNull(o_i)) { return false; } - if (IsNull(i)) continue; - for (size_t j = 0; j < field_arrays_.size(); ++j) { - // TODO: really we should be comparing stretches of non-null data rather - // than looking at one value at a time. - equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other->field(j)); - if (!equal_fields) { return false; } - } - } - - return true; -} - -Status StructArray::Validate() const { - if (length_ < 0) { return Status::Invalid("Length was negative"); } - - if (null_count() > length_) { - return Status::Invalid("Null count exceeds the length of this struct"); - } - - if (field_arrays_.size() > 0) { - // Validate fields - int32_t array_length = field_arrays_[0]->length(); - size_t idx = 0; - for (auto it : field_arrays_) { - if (it->length() != array_length) { - std::stringstream ss; - ss << "Length is not equal from field " << it->type()->ToString() - << " at position {" << idx << "}"; - return Status::Invalid(ss.str()); - } - - const Status child_valid = it->Validate(); - if (!child_valid.ok()) { - std::stringstream ss; - ss << "Child array invalid: " << child_valid.ToString() << " at position {" << idx - << "}"; - return Status::Invalid(ss.str()); - } - ++idx; - } - - if (array_length > 0 && array_length != length_) { - return Status::Invalid("Struct's length is not equal to its child arrays"); - } - } - return Status::OK(); -} - -Status StructArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status StructBuilder::Finish(std::shared_ptr* out) { - std::vector> fields(field_builders_.size()); - for (size_t i = 0; i < field_builders_.size(); ++i) { - RETURN_NOT_OK(field_builders_[i]->Finish(&fields[i])); - } - - *out = std::make_shared(type_, length_, fields, null_count_, null_bitmap_); - - null_bitmap_ = nullptr; - capacity_ = length_ = null_count_ = 0; - - return Status::OK(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/types/struct.h b/cpp/src/arrow/types/struct.h deleted file mode 100644 index 1e2bf2d9a1223..0000000000000 --- a/cpp/src/arrow/types/struct.h +++ /dev/null @@ -1,116 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_STRUCT_H -#define ARROW_TYPES_STRUCT_H - -#include -#include -#include - -#include "arrow/type.h" -#include "arrow/types/list.h" -#include "arrow/types/primitive.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class ARROW_EXPORT StructArray : public Array { - public: - using TypeClass = StructType; - - StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, - int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) - : Array(type, length, null_count, null_bitmap) { - type_ = type; - field_arrays_ = field_arrays; - } - - Status Validate() const override; - - virtual ~StructArray() {} - - // Return a shared pointer in case the requestor desires to share ownership - // with this array. - std::shared_ptr field(int32_t pos) const { - DCHECK_GT(field_arrays_.size(), 0); - return field_arrays_[pos]; - } - const std::vector& fields() const { return field_arrays_; } - - bool EqualsExact(const StructArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - - Status Accept(ArrayVisitor* visitor) const override; - - protected: - // The child arrays corresponding to each field of the struct data type. - std::vector field_arrays_; -}; - -// --------------------------------------------------------------------------------- -// StructArray builder -// Append, Resize and Reserve methods are acting on StructBuilder. -// Please make sure all these methods of all child-builders' are consistently -// called to maintain data-structure consistency. -class ARROW_EXPORT StructBuilder : public ArrayBuilder { - public: - StructBuilder(MemoryPool* pool, const std::shared_ptr& type, - const std::vector>& field_builders) - : ArrayBuilder(pool, type) { - field_builders_ = field_builders; - } - - Status Finish(std::shared_ptr* out) override; - - // Null bitmap is of equal length to every child field, and any zero byte - // will be considered as a null for that field, but users must using app- - // end methods or advance methods of the child builders' independently to - // insert data. - Status Append(int32_t length, const uint8_t* valid_bytes) { - RETURN_NOT_OK(Reserve(length)); - UnsafeAppendToBitmap(valid_bytes, length); - return Status::OK(); - } - - // Append an element to the Struct. All child-builders' Append method must - // be called independently to maintain data-structure consistency. - Status Append(bool is_valid = true) { - RETURN_NOT_OK(Reserve(1)); - UnsafeAppendToBitmap(is_valid); - return Status::OK(); - } - - Status AppendNull() { return Append(false); } - - std::shared_ptr field_builder(int pos) const { - DCHECK_GT(field_builders_.size(), 0); - return field_builders_[pos]; - } - const std::vector>& field_builders() const { - return field_builders_; - } - - protected: - std::vector> field_builders_; -}; - -} // namespace arrow - -#endif // ARROW_TYPES_STRUCT_H diff --git a/cpp/src/arrow/types/test-common.h b/cpp/src/arrow/types/test-common.h deleted file mode 100644 index 6e6ab85ad4eb7..0000000000000 --- a/cpp/src/arrow/types/test-common.h +++ /dev/null @@ -1,70 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_TEST_COMMON_H -#define ARROW_TYPES_TEST_COMMON_H - -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/test-util.h" -#include "arrow/type.h" -#include "arrow/util/memory-pool.h" - -namespace arrow { - -using std::unique_ptr; - -class TestBuilder : public ::testing::Test { - public: - void SetUp() { - pool_ = default_memory_pool(); - type_ = TypePtr(new UInt8Type()); - builder_.reset(new UInt8Builder(pool_, type_)); - builder_nn_.reset(new UInt8Builder(pool_, type_)); - } - - protected: - MemoryPool* pool_; - - TypePtr type_; - unique_ptr builder_; - unique_ptr builder_nn_; -}; - -template -Status MakeArray(const std::vector& valid_bytes, const std::vector& values, - int size, Builder* builder, ArrayPtr* out) { - // Append the first 1000 - for (int i = 0; i < size; ++i) { - if (valid_bytes[i] > 0) { - RETURN_NOT_OK(builder->Append(values[i])); - } else { - RETURN_NOT_OK(builder->AppendNull()); - } - } - return builder->Finish(out); -} - -} // namespace arrow - -#endif // ARROW_TYPES_TEST_COMMON_H diff --git a/cpp/src/arrow/types/union.cc b/cpp/src/arrow/types/union.cc deleted file mode 100644 index cc2934b2e4adb..0000000000000 --- a/cpp/src/arrow/types/union.cc +++ /dev/null @@ -1,27 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/types/union.h" - -#include -#include -#include -#include - -#include "arrow/type.h" - -namespace arrow {} // namespace arrow diff --git a/cpp/src/arrow/types/union.h b/cpp/src/arrow/types/union.h deleted file mode 100644 index 44f39cc69942b..0000000000000 --- a/cpp/src/arrow/types/union.h +++ /dev/null @@ -1,48 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_TYPES_UNION_H -#define ARROW_TYPES_UNION_H - -#include -#include -#include - -#include "arrow/array.h" -#include "arrow/type.h" - -namespace arrow { - -class Buffer; - -class UnionArray : public Array { - protected: - // The data are types encoded as int16 - Buffer* types_; - std::vector> children_; -}; - -class DenseUnionArray : public UnionArray { - protected: - Buffer* offset_buf_; -}; - -class SparseUnionArray : public UnionArray {}; - -} // namespace arrow - -#endif // ARROW_TYPES_UNION_H diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 6e19730219553..8d9afccf867df 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -22,12 +22,9 @@ # Headers: top level install(FILES bit-util.h - buffer.h logging.h macros.h - memory-pool.h random.h - status.h visibility.h DESTINATION include/arrow/util) @@ -72,6 +69,3 @@ if (ARROW_BUILD_BENCHMARKS) endif() ADD_ARROW_TEST(bit-util-test) -ADD_ARROW_TEST(buffer-test) -ADD_ARROW_TEST(memory-pool-test) -ADD_ARROW_TEST(status-test) diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 7e1cb1867171e..9c82407ecc092 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -18,9 +18,9 @@ #include #include +#include "arrow/buffer.h" +#include "arrow/status.h" #include "arrow/util/bit-util.h" -#include "arrow/util/buffer.h" -#include "arrow/util/status.h" namespace arrow { diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index c034fbd977747..ac2f533c408c7 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -21,7 +21,7 @@ #include "pyarrow/adapters/builtin.h" #include "arrow/api.h" -#include "arrow/util/status.h" +#include "arrow/status.h" #include "pyarrow/helpers.h" diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index adb27e83ef120..64b708695194a 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -31,7 +31,7 @@ #include "arrow/api.h" #include "arrow/util/bit-util.h" -#include "arrow/util/status.h" +#include "arrow/status.h" #include "pyarrow/common.h" #include "pyarrow/config.h" diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index fa875f2b9aba1..fb4d3496ac79f 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -21,8 +21,8 @@ #include #include -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" using arrow::Status; diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 7f3131ef03dd8..7e3382634a781 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -19,10 +19,11 @@ #define PYARROW_COMMON_H #include "pyarrow/config.h" -#include "arrow/util/buffer.h" -#include "arrow/util/macros.h" #include "pyarrow/visibility.h" +#include "arrow/buffer.h" +#include "arrow/util/macros.h" + namespace arrow { class MemoryPool; } namespace pyarrow { diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index e6dbc12d429b0..12f5ba0bf2b49 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -21,8 +21,8 @@ #include #include "arrow/io/memory.h" -#include "arrow/util/memory-pool.h" -#include "arrow/util/status.h" +#include "arrow/memory_pool.h" +#include "arrow/status.h" #include "pyarrow/common.h" From 7e93075cd48c5f6b1b75f9adc43ba53c831046e7 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 13 Dec 2016 06:50:25 +0100 Subject: [PATCH 0229/1644] ARROW-405: Use vendored hdfs.h if not found in include/ in $HADOOP_HOME Not all Hadoop distributions have their files arranged in the same way. Author: Wes McKinney Closes #237 from wesm/ARROW-405 and squashes the following commits: 3a266d3 [Wes McKinney] Use vendored hdfs.h if not found in include/ in $HADOOP_HOME --- cpp/src/arrow/io/CMakeLists.txt | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index a1892a9294a78..f285180c5142a 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -56,6 +56,10 @@ if(ARROW_HDFS) if (DEFINED ENV{HADOOP_HOME}) set(HADOOP_HOME $ENV{HADOOP_HOME}) + if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") + message(STATUS "Did not find hdfs.h in expected location, using vendored one") + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") + endif() else() set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") endif() From 935279091f371716adcf18f6437244f040f98da8 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 14 Dec 2016 15:41:44 -0500 Subject: [PATCH 0230/1644] ARROW-422: IPC should depend on rapidjson_ep if RapidJSON is vendored Author: Uwe L. Korn Closes #239 from xhochy/ARROW-422 and squashes the following commits: 1545012 [Uwe L. Korn] ARROW-422: IPC should depend on rapidjson_ep if RapidJSON is vendored --- cpp/CMakeLists.txt | 2 ++ cpp/src/arrow/ipc/CMakeLists.txt | 3 +++ 2 files changed, 5 insertions(+) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index adcca0e0b49e8..d288ffb5f7a81 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -609,8 +609,10 @@ if("$ENV{RAPIDJSON_HOME}" STREQUAL "") ExternalProject_Get_Property(rapidjson_ep SOURCE_DIR) set(RAPIDJSON_INCLUDE_DIR "${SOURCE_DIR}/include") + set(RAPIDJSON_VENDORED 1) else() set(RAPIDJSON_INCLUDE_DIR "$ENV{RAPIDJSON_HOME}/include") + set(RAPIDJSON_VENDORED 0) endif() message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 6f401dba2495f..b1669c5f7c239 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -44,6 +44,9 @@ set(ARROW_IPC_SRCS add_library(arrow_ipc SHARED ${ARROW_IPC_SRCS} ) +if(RAPIDJSON_VERDORED) + add_dependencies(arrow_ipc rapidjson_ep) +endif() if(FLATBUFFERS_VENDORED) add_dependencies(arrow_ipc flatbuffers_ep) endif() From 063c190a5252d8f77a37ebf80efcc68b70ffacab Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 15 Dec 2016 13:15:08 -0500 Subject: [PATCH 0231/1644] ARROW-423: Define BUILD_BYPRODUCTS for CMake 3.2+ Author: Uwe L. Korn Closes #240 from xhochy/ARROW-423 and squashes the following commits: 4c99ba2 [Uwe L. Korn] ARROW-423: Define BUILD_BYPRODUCTS for CMake 3.2+ --- cpp/CMakeLists.txt | 91 ++++++++++++++++++++++++------------ cpp/src/arrow/CMakeLists.txt | 2 - 2 files changed, 61 insertions(+), 32 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index d288ffb5f7a81..315995ce7cb97 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -488,19 +488,32 @@ if(ARROW_BUILD_TESTS) set(GTEST_CMAKE_CXX_FLAGS "-fPIC") endif() - ExternalProject_Add(googletest_ep - URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON - # googletest doesn't define install rules, so just build in the - # source dir and don't try to install. See its README for - # details. - BUILD_IN_SOURCE 1 - INSTALL_COMMAND "") - set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") set(GTEST_STATIC_LIB "${GTEST_PREFIX}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}gtest${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GTEST_VENDORED 1) + + if (CMAKE_VERSION VERSION_GREATER "3.2") + # BUILD_BYPRODUCTS is a 3.2+ feature + ExternalProject_Add(googletest_ep + URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" + CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON + # googletest doesn't define install rules, so just build in the + # source dir and don't try to install. See its README for + # details. + BUILD_IN_SOURCE 1 + BUILD_BYPRODUCTS "${GTEST_STATIC_LIB}" + INSTALL_COMMAND "") + else() + ExternalProject_Add(googletest_ep + URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" + CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON + # googletest doesn't define install rules, so just build in the + # source dir and don't try to install. See its README for + # details. + BUILD_IN_SOURCE 1 + INSTALL_COMMAND "") + endif() else() find_package(GTest REQUIRED) set(GTEST_VENDORED 0) @@ -525,24 +538,34 @@ if(ARROW_BUILD_TESTS) endif() set(GFLAGS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gflags_ep-prefix/src/gflags_ep") - ExternalProject_Add(gflags_ep - GIT_REPOSITORY https://github.com/gflags/gflags.git - GIT_TAG cce68f0c9c5d054017425e6e6fd54f696d36e8ee - BUILD_IN_SOURCE 1 - CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} - -DCMAKE_INSTALL_PREFIX=${GFLAGS_PREFIX} - -DBUILD_SHARED_LIBS=OFF - -DBUILD_STATIC_LIBS=ON - -DBUILD_PACKAGING=OFF - -DBUILD_TESTING=OFF - -BUILD_CONFIG_TESTS=OFF - -DINSTALL_HEADERS=ON - -DCMAKE_CXX_FLAGS=${GFLAGS_CMAKE_CXX_FLAGS}) - set(GFLAGS_HOME "${GFLAGS_PREFIX}") set(GFLAGS_INCLUDE_DIR "${GFLAGS_PREFIX}/include") set(GFLAGS_STATIC_LIB "${GFLAGS_PREFIX}/lib/libgflags.a") set(GFLAGS_VENDORED 1) + set(GFLAGS_CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + -DCMAKE_INSTALL_PREFIX=${GFLAGS_PREFIX} + -DBUILD_SHARED_LIBS=OFF + -DBUILD_STATIC_LIBS=ON + -DBUILD_PACKAGING=OFF + -DBUILD_TESTING=OFF + -BUILD_CONFIG_TESTS=OFF + -DINSTALL_HEADERS=ON + -DCMAKE_CXX_FLAGS=${GFLAGS_CMAKE_CXX_FLAGS}) + if (CMAKE_VERSION VERSION_GREATER "3.2") + # BUILD_BYPRODUCTS is a 3.2+ feature + ExternalProject_Add(gflags_ep + GIT_REPOSITORY https://github.com/gflags/gflags.git + GIT_TAG cce68f0c9c5d054017425e6e6fd54f696d36e8ee + BUILD_IN_SOURCE 1 + BUILD_BYPRODUCTS "${GFLAGS_STATIC_LIB}" + CMAKE_ARGS ${GFLAGS_CMAKE_ARGS}) + else() + ExternalProject_Add(gflags_ep + GIT_REPOSITORY https://github.com/gflags/gflags.git + GIT_TAG cce68f0c9c5d054017425e6e6fd54f696d36e8ee + BUILD_IN_SOURCE 1 + CMAKE_ARGS ${GFLAGS_CMAKE_ARGS}) + endif() else() set(GFLAGS_VENDORED 0) find_package(GFlags REQUIRED) @@ -570,16 +593,24 @@ if(ARROW_BUILD_BENCHMARKS) endif() set(GBENCHMARK_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gbenchmark_ep/src/gbenchmark_ep-install") - ExternalProject_Add(gbenchmark_ep - URL "https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" - CMAKE_ARGS - "-DCMAKE_BUILD_TYPE=Release" - "-DCMAKE_INSTALL_PREFIX:PATH=${GBENCHMARK_PREFIX}" - "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") - set(GBENCHMARK_INCLUDE_DIR "${GBENCHMARK_PREFIX}/include") set(GBENCHMARK_STATIC_LIB "${GBENCHMARK_PREFIX}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}benchmark${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GBENCHMARK_VENDORED 1) + set(GBENCHMARK_CMAKE_ARGS + "-DCMAKE_BUILD_TYPE=Release" + "-DCMAKE_INSTALL_PREFIX:PATH=${GBENCHMARK_PREFIX}" + "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") + if (CMAKE_VERSION VERSION_GREATER "3.2") + # BUILD_BYPRODUCTS is a 3.2+ feature + ExternalProject_Add(gbenchmark_ep + URL "https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" + BUILD_BYPRODUCTS "${GBENCHMARK_STATIC_LIB}" + CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS}) + else() + ExternalProject_Add(gbenchmark_ep + URL "https://github.com/google/benchmark/archive/v${GBENCHMARK_VERSION}.tar.gz" + CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS}) + endif() else() find_package(GBenchmark REQUIRED) set(GBENCHMARK_VENDORED 0) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 7d7bc29f4abd8..b8500ab264f80 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -37,8 +37,6 @@ install(FILES # Unit tests ####################################### -set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) - ADD_ARROW_TEST(array-test) ADD_ARROW_TEST(array-decimal-test) ADD_ARROW_TEST(array-list-test) From cfb544de2efb260bc0737460e056a0d2a5295e6a Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 15 Dec 2016 13:16:01 -0500 Subject: [PATCH 0232/1644] ARROW-425: Add private API to get python Table from a C++ object Author: Uwe L. Korn Closes #241 from xhochy/pyarrow-private-api and squashes the following commits: dc9b814 [Uwe L. Korn] ARROW-425: Add private API to get python Table from a C++ object --- python/pyarrow/table.pxd | 2 ++ python/pyarrow/table.pyx | 4 ++++ python/setup.py | 11 +++++++++++ 3 files changed, 17 insertions(+) diff --git a/python/pyarrow/table.pxd b/python/pyarrow/table.pxd index 79c9ae3b0a194..df3687ddf9761 100644 --- a/python/pyarrow/table.pxd +++ b/python/pyarrow/table.pxd @@ -57,3 +57,5 @@ cdef class RecordBatch: cdef init(self, const shared_ptr[CRecordBatch]& table) cdef _check_nullptr(self) + +cdef api object table_from_ctable(const shared_ptr[CTable]& ctable) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 0a9805cfdf427..333686f810ea8 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -687,5 +687,9 @@ cdef class Table: return (self.num_rows, self.num_columns) +cdef api object table_from_ctable(const shared_ptr[CTable]& ctable): + cdef Table table = Table() + table.init(ctable) + return table from_pandas_dataframe = Table.from_pandas diff --git a/python/setup.py b/python/setup.py index 0f6bbda6ec3aa..5acdca34a0882 100644 --- a/python/setup.py +++ b/python/setup.py @@ -204,6 +204,10 @@ def _run_cmake(self): shutil.move(self.get_ext_built(name), ext_path) self._found_names.append(name) + if os.path.exists(self.get_ext_built_api_header(name)): + shutil.move(self.get_ext_built_api_header(name), + pjoin(os.path.dirname(ext_path), name + '_api.h')) + os.chdir(saved_cwd) def _failure_permitted(self, name): @@ -225,6 +229,13 @@ def _get_cmake_ext_path(self, name): filename = name + suffix return pjoin(package_dir, filename) + def get_ext_built_api_header(self, name): + if sys.platform == 'win32': + head, tail = os.path.split(name) + return pjoin(head, tail + "_api.h") + else: + return pjoin(name + "_api.h") + def get_ext_built(self, name): if sys.platform == 'win32': head, tail = os.path.split(name) From a2ead2f646baad78de01fcb1b90f710fa1eae70b Mon Sep 17 00:00:00 2001 From: Mohamed Zenadi Date: Sat, 17 Dec 2016 18:05:04 -0500 Subject: [PATCH 0233/1644] ARROW-380: [Java] optimize null count when serializing vectors I added `getNullCount()` to the `Accessor` interface. I don't know if this is the best way to achieve this. Hence, we'll have both ValueCount and NullCount immediately accessible from the accessor. Author: Mohamed Zenadi Closes #207 from zeapo/ARROW-380 and squashes the following commits: 27c0342 [Mohamed Zenadi] implement missing getNullCount implementation for NullableMapVector 9ff3355 [Mohamed Zenadi] implement the base case of getNullCount() ad3f24a [Mohamed Zenadi] the used size is not the same as the allocated size e858432 [Mohamed Zenadi] use the valueCount as basis for counting nulls rather than allocated bytes 0530c85 [Mohamed Zenadi] test the null count byte by byte and the odd length case 95667d3 [Mohamed Zenadi] fix the comment b12a2a5 [Mohamed Zenadi] fix wrong value returned by the method f264250 [Mohamed Zenadi] use getNullCount() rather than isNull baca69c [Mohamed Zenadi] Add methods to count the number null values in the vector --- .../apache/arrow/vector/BaseValueVector.java | 12 +++++ .../org/apache/arrow/vector/BitVector.java | 22 ++++++++++ .../org/apache/arrow/vector/ValueVector.java | 5 +++ .../apache/arrow/vector/VectorUnloader.java | 10 +---- .../org/apache/arrow/vector/ZeroVector.java | 5 +++ .../arrow/vector/complex/ListVector.java | 5 +++ .../vector/complex/NullableMapVector.java | 5 +++ .../apache/arrow/vector/TestValueVector.java | 44 +++++++++++++++++++ 8 files changed, 99 insertions(+), 9 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java index 884cdf0910b8e..2a61403c0dcbe 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseValueVector.java @@ -72,6 +72,18 @@ protected BaseAccessor() { } public boolean isNull(int index) { return false; } + + @Override + // override this in case your implementation is faster, see BitVector + public int getNullCount() { + int nullCount = 0; + for (int i = 0; i < getValueCount(); i++) { + if (isNull(i)) { + nullCount ++; + } + } + return nullCount; + } } public abstract static class BaseMutator implements ValueVector.Mutator { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 26eeafd51d900..9beabcbe46bcc 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -379,6 +379,28 @@ public final void get(int index, NullableBitHolder holder) { holder.isSet = 1; holder.value = get(index); } + + /** + * Get the number nulls, this correspond to the number of bits set to 0 in the vector + * @return the number of bits set to 0 + */ + @Override + public final int getNullCount() { + int count = 0; + int sizeInBytes = getSizeFromCount(valueCount); + + for (int i = 0; i < sizeInBytes; ++i) { + byte byteValue = data.getByte(i); + // Java uses two's complement binary representation, hence 11111111_b which is -1 when converted to Int + // will have 32bits set to 1. Masking the MSB and then adding it back solves the issue. + count += Integer.bitCount(byteValue & 0x7F) - (byteValue >> 7); + } + int nullCount = (sizeInBytes * 8) - count; + // if the valueCount is not a multiple of 8, the bits on the right were counted as null bits + int remainder = valueCount % 8; + nullCount -= remainder == 0 ? 0 : 8 - remainder; + return nullCount; + } } /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index 5b24a41850d75..ff7b94c34d80d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -180,6 +180,11 @@ interface Accessor { * Returns true if the value at the given index is null, false otherwise. */ boolean isNull(int index); + + /** + * Returns the number of null values + */ + int getNullCount(); } /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java index e2462180ffadc..92d8cb045ae31 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java @@ -60,15 +60,7 @@ public ArrowRecordBatch getRecordBatch() { private void appendNodes(FieldVector vector, List nodes, List buffers) { Accessor accessor = vector.getAccessor(); - int nullCount = 0; - // TODO: should not have to do that - // we can do that a lot more efficiently (for example with Long.bitCount(i)) - for (int i = 0; i < accessor.getValueCount(); i++) { - if (accessor.isNull(i)) { - nullCount ++; - } - } - nodes.add(new ArrowFieldNode(accessor.getValueCount(), nullCount)); + nodes.add(new ArrowFieldNode(accessor.getValueCount(), accessor.getNullCount())); List fieldBuffers = vector.getFieldBuffers(); List expectedBuffers = vector.getField().getTypeLayout().getVectorTypes(); if (fieldBuffers.size() != expectedBuffers.size()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index c2482adefecfb..e163b4fa9398f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -69,6 +69,11 @@ public int getValueCount() { public boolean isNull(int index) { return true; } + + @Override + public int getNullCount() { + return 0; + } }; private final Mutator defaultMutator = new Mutator() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 461bdbcda1b52..074b0aa7e58fa 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -310,6 +310,11 @@ public Object getObject(int index) { public boolean isNull(int index) { return bits.getAccessor().get(index) == 0; } + + @Override + public int getNullCount() { + return bits.getAccessor().getNullCount(); + } } public class Mutator extends BaseRepeatedMutator { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index f0ddf2727e9ea..5fa35307ab683 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -203,6 +203,11 @@ public void get(int index, ComplexHolder holder) { super.get(index, holder); } + @Override + public int getNullCount() { + return bits.getAccessor().getNullCount(); + } + @Override public boolean isNull(int index) { return isSet(index) == 0; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 124452e96ee42..b33919b2790fc 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -288,6 +288,7 @@ public void testBitVector() { try (final BitVector vector = new BitVector(EMPTY_SCHEMA_PATH, allocator)) { final BitVector.Mutator m = vector.getMutator(); vector.allocateNew(1024); + m.setValueCount(1024); // Put and set a few values m.set(0, 1); @@ -295,12 +296,16 @@ public void testBitVector() { m.set(100, 0); m.set(1022, 1); + m.setValueCount(1024); + final BitVector.Accessor accessor = vector.getAccessor(); assertEquals(1, accessor.get(0)); assertEquals(0, accessor.get(1)); assertEquals(0, accessor.get(100)); assertEquals(1, accessor.get(1022)); + assertEquals(1022, accessor.getNullCount()); + // test setting the same value twice m.set(0, 1); m.set(0, 1); @@ -315,8 +320,47 @@ public void testBitVector() { assertEquals(0, accessor.get(0)); assertEquals(1, accessor.get(1)); + // should not change + assertEquals(1022, accessor.getNullCount()); + // Ensure unallocated space returns 0 assertEquals(0, accessor.get(3)); + + // unset the previously set bits + m.set(1, 0); + m.set(1022, 0); + // this should set all the array to 0 + assertEquals(1024, accessor.getNullCount()); + + // set all the array to 1 + for (int i = 0; i < 1024; ++i) { + assertEquals(1024 - i, accessor.getNullCount()); + m.set(i, 1); + } + + assertEquals(0, accessor.getNullCount()); + + vector.allocateNew(1015); + m.setValueCount(1015); + + // ensure it has been zeroed + assertEquals(1015, accessor.getNullCount()); + + m.set(0, 1); + m.set(1014, 1); // ensure that the last item of the last byte is allocated + + assertEquals(1013, accessor.getNullCount()); + + vector.zeroVector(); + assertEquals(1015, accessor.getNullCount()); + + // set all the array to 1 + for (int i = 0; i < 1015; ++i) { + assertEquals(1015 - i, accessor.getNullCount()); + m.set(i, 1); + } + + assertEquals(0, accessor.getNullCount()); } } From c369709c4f8157cb5e6c8121e1e613b104305aed Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 19 Dec 2016 11:47:32 -0500 Subject: [PATCH 0234/1644] ARROW-426: Python: Conversion from pyarrow.Array to a Python list Author: Uwe L. Korn Closes #242 from xhochy/ARROW-426 and squashes the following commits: 10739ac [Uwe L. Korn] ARROW-426: Python: Conversion from pyarrow.Array to a Python list --- python/pyarrow/array.pyx | 6 ++++++ python/pyarrow/scalar.pyx | 4 +++- python/pyarrow/table.pyx | 15 +++++++++++++++ python/pyarrow/tests/test_column.py | 1 + python/pyarrow/tests/test_convert_builtin.py | 13 +++++++++++-- 5 files changed, 36 insertions(+), 3 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 6c862751fc218..d44212f4aed63 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -167,6 +167,12 @@ cdef class Array: return PyObject_to_object(np_arr) + def to_pylist(self): + """ + Convert to an list of native Python objects. + """ + return [x.as_py() for x in self] + cdef class NullArray(Array): pass diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 0d391e5f26b3e..c2d20e460c37c 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -194,7 +194,9 @@ cdef object box_arrow_scalar(DataType type, const shared_ptr[CArray]& sp_array, int index): cdef ArrayValue val - if sp_array.get().IsNull(index): + if type.type.type == Type_NA: + return NA + elif sp_array.get().IsNull(index): return NA else: val = _scalar_classes[type.type.type]() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 333686f810ea8..2f7d4309e4518 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -108,6 +108,15 @@ cdef class ChunkedArray: for i in range(self.num_chunks): yield self.chunk(i) + def to_pylist(self): + """ + Convert to a list of native Python objects. + """ + result = [] + for i in range(self.num_chunks): + result += self.chunk(i).to_pylist() + return result + cdef class Column: """ @@ -143,6 +152,12 @@ cdef class Column: return pd.Series(PyObject_to_object(arr), name=self.name) + def to_pylist(self): + """ + Convert to a list of native Python objects. + """ + return self.data.to_pylist() + cdef _check_nullptr(self): if self.column == NULL: raise ReferenceError("Column object references a NULL pointer." diff --git a/python/pyarrow/tests/test_column.py b/python/pyarrow/tests/test_column.py index b62f58236e073..32202cb5a9ad8 100644 --- a/python/pyarrow/tests/test_column.py +++ b/python/pyarrow/tests/test_column.py @@ -35,6 +35,7 @@ def test_basics(self): assert column.length() == 5 assert len(column) == 5 assert column.shape == (5,) + assert column.to_pylist() == [-10, -5, 0, 5, 10] def test_pandas(self): data = [ diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 8937f8db6941f..34371b0bdd7c9 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -22,28 +22,34 @@ class TestConvertList(unittest.TestCase): def test_boolean(self): - arr = pyarrow.from_pylist([True, None, False, None]) + expected = [True, None, False, None] + arr = pyarrow.from_pylist(expected) assert len(arr) == 4 assert arr.null_count == 2 assert arr.type == pyarrow.bool_() + assert arr.to_pylist() == expected def test_empty_list(self): arr = pyarrow.from_pylist([]) assert len(arr) == 0 assert arr.null_count == 0 assert arr.type == pyarrow.null() + assert arr.to_pylist() == [] def test_all_none(self): arr = pyarrow.from_pylist([None, None]) assert len(arr) == 2 assert arr.null_count == 2 assert arr.type == pyarrow.null() + assert arr.to_pylist() == [None, None] def test_integer(self): - arr = pyarrow.from_pylist([1, None, 3, None]) + expected = [1, None, 3, None] + arr = pyarrow.from_pylist(expected) assert len(arr) == 4 assert arr.null_count == 2 assert arr.type == pyarrow.int64() + assert arr.to_pylist() == expected def test_garbage_collection(self): import gc @@ -62,6 +68,7 @@ def test_double(self): assert len(arr) == 6 assert arr.null_count == 3 assert arr.type == pyarrow.double() + assert arr.to_pylist() == data def test_string(self): data = ['foo', b'bar', None, 'arrow'] @@ -69,6 +76,7 @@ def test_string(self): assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pyarrow.string() + assert arr.to_pylist() == ['foo', 'bar', None, 'arrow'] def test_mixed_nesting_levels(self): pyarrow.from_pylist([1, 2, None]) @@ -90,3 +98,4 @@ def test_list_of_int(self): assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pyarrow.list_(pyarrow.int64()) + assert arr.to_pylist() == data From 68e39c6868d449f10243707ca1a7513aaa29761f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 19 Dec 2016 21:11:25 +0100 Subject: [PATCH 0235/1644] ARROW-429: Revert ARROW-379 until git-archive issues are resolved These changes are resulting in GitHub producing archive tarballs with non-deterministic contents. Author: Wes McKinney Closes #243 from wesm/ARROW-429 and squashes the following commits: 49f6edb [Wes McKinney] Revert "ARROW-379: Use setuptools_scm for Python versioning" --- dev/release/00-prepare.sh | 5 +++++ python/.git_archival.txt | 1 - python/.gitattributes | 1 - python/pyarrow/__init__.py | 10 ++-------- python/setup.cfg | 20 -------------------- python/setup.py | 23 +++++++++++++++++++++-- 6 files changed, 28 insertions(+), 32 deletions(-) delete mode 100644 python/.git_archival.txt delete mode 100644 python/.gitattributes delete mode 100644 python/setup.cfg diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 00af5e7768161..3423a3e6c5bf9 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -43,4 +43,9 @@ mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmod cd - +cd "${SOURCE_DIR}/../../python" +sed -i "s/VERSION = '[^']*'/VERSION = '${version}'/g" setup.py +sed -i "s/ISRELEASED = False/ISRELEASED = True/g" setup.py +cd - + echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/python/.git_archival.txt b/python/.git_archival.txt deleted file mode 100644 index 95cb3eea4e336..0000000000000 --- a/python/.git_archival.txt +++ /dev/null @@ -1 +0,0 @@ -ref-names: $Format:%D$ diff --git a/python/.gitattributes b/python/.gitattributes deleted file mode 100644 index 00a7b00c94e08..0000000000000 --- a/python/.gitattributes +++ /dev/null @@ -1 +0,0 @@ -.git_archival.txt export-subst diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 5af93fb5865de..b9d386195b436 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -17,14 +17,6 @@ # flake8: noqa -from pkg_resources import get_distribution, DistributionNotFound -try: - __version__ = get_distribution(__name__).version -except DistributionNotFound: - # package is not installed - pass - - import pyarrow.config from pyarrow.array import (Array, @@ -50,3 +42,5 @@ DataType, Field, Schema, schema) from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe + +from pyarrow.version import version as __version__ diff --git a/python/setup.cfg b/python/setup.cfg deleted file mode 100644 index caae3e081b6ca..0000000000000 --- a/python/setup.cfg +++ /dev/null @@ -1,20 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -[build_sphinx] -source-dir = doc/ -build-dir = doc/_build diff --git a/python/setup.py b/python/setup.py index 5acdca34a0882..5f448f7d50784 100644 --- a/python/setup.py +++ b/python/setup.py @@ -42,9 +42,27 @@ if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') +VERSION = '0.1.0' +ISRELEASED = False + +if not ISRELEASED: + VERSION += '.dev' + setup_dir = os.path.abspath(os.path.dirname(__file__)) +def write_version_py(filename=os.path.join(setup_dir, 'pyarrow/version.py')): + a = open(filename, 'w') + file_content = "\n".join(["", + "# THIS FILE IS GENERATED FROM SETUP.PY", + "version = '%(version)s'", + "isrelease = '%(isrelease)s'"]) + + a.write(file_content % {'version': VERSION, + 'isrelease': str(ISRELEASED)}) + a.close() + + class clean(_clean): def run(self): @@ -254,12 +272,15 @@ def get_outputs(self): return [self._get_cmake_ext_path(name) for name in self.get_names()] +write_version_py() + DESC = """\ Python library for Apache Arrow""" setup( name="pyarrow", packages=['pyarrow', 'pyarrow.tests'], + version=VERSION, zip_safe=False, package_data={'pyarrow': ['*.pxd', '*.pyx']}, # Dummy extension to trigger build_ext @@ -269,8 +290,6 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, - use_scm_version = {"root": "..", "relative_to": __file__}, - setup_requires=['setuptools_scm', 'setuptools_scm_git_archive'], install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], description=DESC, license='Apache License, Version 2.0', From cfde4607df453e4b97560e64caff744fb3ba3d1f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 19 Dec 2016 18:26:17 -0500 Subject: [PATCH 0236/1644] ARROW-243: [C++] Add option to switch between libhdfs and libhdfs3 when creating HdfsClient Closes #108 Some users will not have a full Java Hadoop distribution and may wish to use the libhdfs3 package from Pivotal (https://github.com/Pivotal-Data-Attic/pivotalrd-libhdfs3), part of Apache HAWQ (incubating). In C++, you can switch by setting: ```c++ HdfsConnectionConfig conf; conf.driver = HdfsDriver::LIBHDFS3; ``` In Python, you can run: ```python con = arrow.io.HdfsClient.connect(..., driver='libhdfs3') ``` Author: Wes McKinney Closes #244 from wesm/ARROW-243 and squashes the following commits: 7ae197a [Wes McKinney] Refactor HdfsClient code to support both libhdfs and libhdfs3 at runtime. Add driver option to Python interface --- cpp/src/arrow/io/CMakeLists.txt | 2 +- cpp/src/arrow/io/hdfs-internal.cc | 590 ++++++++++++++++++++++++ cpp/src/arrow/io/hdfs-internal.h | 203 ++++++++ cpp/src/arrow/io/hdfs.cc | 102 ++-- cpp/src/arrow/io/hdfs.h | 6 +- cpp/src/arrow/io/io-hdfs-test.cc | 211 +++++---- cpp/src/arrow/io/libhdfs_shim.cc | 582 ----------------------- python/.gitignore | 1 + python/pyarrow/includes/libarrow_io.pxd | 8 +- python/pyarrow/io.pyx | 45 +- python/pyarrow/tests/test_hdfs.py | 161 +++---- 11 files changed, 1109 insertions(+), 802 deletions(-) create mode 100644 cpp/src/arrow/io/hdfs-internal.cc create mode 100644 cpp/src/arrow/io/hdfs-internal.h delete mode 100644 cpp/src/arrow/io/libhdfs_shim.cc diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index f285180c5142a..e2b6496cc3f87 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -75,7 +75,7 @@ if(ARROW_HDFS) set(ARROW_HDFS_SRCS hdfs.cc - libhdfs_shim.cc) + hdfs-internal.cc) set_property(SOURCE ${ARROW_HDFS_SRCS} APPEND_STRING PROPERTY diff --git a/cpp/src/arrow/io/hdfs-internal.cc b/cpp/src/arrow/io/hdfs-internal.cc new file mode 100644 index 0000000000000..7094785de02a0 --- /dev/null +++ b/cpp/src/arrow/io/hdfs-internal.cc @@ -0,0 +1,590 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// This shim interface to libhdfs (for runtime shared library loading) has been +// adapted from the SFrame project, released under the ASF-compatible 3-clause +// BSD license +// +// Using this required having the $JAVA_HOME and $HADOOP_HOME environment +// variables set, so that libjvm and libhdfs can be located easily + +// Copyright (C) 2015 Dato, Inc. +// All rights reserved. +// +// This software may be modified and distributed under the terms +// of the BSD license. See the LICENSE file for details. + +#ifdef HAS_HADOOP + +#ifndef _WIN32 +#include +#else +#include +#include + +// TODO(wesm): address when/if we add windows support +// #include +#endif + +extern "C" { +#include +} + +#include +#include +#include +#include +#include +#include + +#include // NOLINT + +#include "arrow/io/hdfs-internal.h" +#include "arrow/status.h" +#include "arrow/util/visibility.h" + +namespace fs = boost::filesystem; + +#ifndef _WIN32 +static void* libjvm_handle = NULL; +#else +static HINSTANCE libjvm_handle = NULL; +#endif +/* + * All the shim pointers + */ + +// Helper functions for dlopens +static std::vector get_potential_libjvm_paths(); +static std::vector get_potential_libhdfs_paths(); +static std::vector get_potential_libhdfs3_paths(); +static arrow::Status try_dlopen(std::vector potential_paths, const char* name, +#ifndef _WIN32 + void*& out_handle); +#else + HINSTANCE& out_handle); +#endif + +static std::vector get_potential_libhdfs_paths() { + std::vector libhdfs_potential_paths; + std::string file_name; + +// OS-specific file name +#ifdef __WIN32 + file_name = "hdfs.dll"; +#elif __APPLE__ + file_name = "libhdfs.dylib"; +#else + file_name = "libhdfs.so"; +#endif + + // Common paths + std::vector search_paths = {fs::path(""), fs::path(".")}; + + // Path from environment variable + const char* hadoop_home = std::getenv("HADOOP_HOME"); + if (hadoop_home != nullptr) { + auto path = fs::path(hadoop_home) / "lib/native"; + search_paths.push_back(path); + } + + const char* libhdfs_dir = std::getenv("ARROW_LIBHDFS_DIR"); + if (libhdfs_dir != nullptr) { search_paths.push_back(fs::path(libhdfs_dir)); } + + // All paths with file name + for (auto& path : search_paths) { + libhdfs_potential_paths.push_back(path / file_name); + } + + return libhdfs_potential_paths; +} + +static std::vector get_potential_libhdfs3_paths() { + std::vector potential_paths; + std::string file_name; + +// OS-specific file name +#ifdef __WIN32 + file_name = "hdfs3.dll"; +#elif __APPLE__ + file_name = "libhdfs3.dylib"; +#else + file_name = "libhdfs3.so"; +#endif + + // Common paths + std::vector search_paths = {fs::path(""), fs::path(".")}; + + const char* libhdfs3_dir = std::getenv("ARROW_LIBHDFS3_DIR"); + if (libhdfs3_dir != nullptr) { search_paths.push_back(fs::path(libhdfs3_dir)); } + + // All paths with file name + for (auto& path : search_paths) { + potential_paths.push_back(path / file_name); + } + + return potential_paths; +} + +static std::vector get_potential_libjvm_paths() { + std::vector libjvm_potential_paths; + + std::vector search_prefixes; + std::vector search_suffixes; + std::string file_name; + +// From heuristics +#ifdef __WIN32 + search_prefixes = {""}; + search_suffixes = {"/jre/bin/server", "/bin/server"}; + file_name = "jvm.dll"; +#elif __APPLE__ + search_prefixes = {""}; + search_suffixes = {"", "/jre/lib/server"}; + file_name = "libjvm.dylib"; + +// SFrame uses /usr/libexec/java_home to find JAVA_HOME; for now we are +// expecting users to set an environment variable +#else + search_prefixes = { + "/usr/lib/jvm/default-java", // ubuntu / debian distros + "/usr/lib/jvm/java", // rhel6 + "/usr/lib/jvm", // centos6 + "/usr/lib64/jvm", // opensuse 13 + "/usr/local/lib/jvm/default-java", // alt ubuntu / debian distros + "/usr/local/lib/jvm/java", // alt rhel6 + "/usr/local/lib/jvm", // alt centos6 + "/usr/local/lib64/jvm", // alt opensuse 13 + "/usr/local/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros + "/usr/local/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros + "/usr/lib/jvm/java-7-oracle", // alt ubuntu + "/usr/lib/jvm/java-8-oracle", // alt ubuntu + "/usr/lib/jvm/java-6-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-7-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-8-oracle", // alt ubuntu + "/usr/local/lib/jvm/java-6-oracle", // alt ubuntu + "/usr/lib/jvm/default", // alt centos + "/usr/java/latest", // alt centos + }; + search_suffixes = {"/jre/lib/amd64/server"}; + file_name = "libjvm.so"; +#endif + // From direct environment variable + char* env_value = NULL; + if ((env_value = getenv("JAVA_HOME")) != NULL) { + search_prefixes.insert(search_prefixes.begin(), env_value); + } + + // Generate cross product between search_prefixes, search_suffixes, and file_name + for (auto& prefix : search_prefixes) { + for (auto& suffix : search_suffixes) { + auto path = (fs::path(prefix) / fs::path(suffix) / fs::path(file_name)); + libjvm_potential_paths.push_back(path); + } + } + + return libjvm_potential_paths; +} + +#ifndef _WIN32 +static arrow::Status try_dlopen( + std::vector potential_paths, const char* name, void*& out_handle) { + std::vector error_messages; + + for (auto& i : potential_paths) { + i.make_preferred(); + out_handle = dlopen(i.native().c_str(), RTLD_NOW | RTLD_LOCAL); + + if (out_handle != NULL) { + // std::cout << "Loaded " << i << std::endl; + break; + } else { + const char* err_msg = dlerror(); + if (err_msg != NULL) { + error_messages.push_back(std::string(err_msg)); + } else { + error_messages.push_back(std::string(" returned NULL")); + } + } + } + + if (out_handle == NULL) { + std::stringstream ss; + ss << "Unable to load " << name; + return arrow::Status::IOError(ss.str()); + } + + return arrow::Status::OK(); +} + +#else +static arrow::Status try_dlopen( + std::vector potential_paths, const char* name, HINSTANCE& out_handle) { + std::vector error_messages; + + for (auto& i : potential_paths) { + i.make_preferred(); + out_handle = LoadLibrary(i.string().c_str()); + + if (out_handle != NULL) { + break; + } else { + // error_messages.push_back(get_last_err_str(GetLastError())); + } + } + + if (out_handle == NULL) { + std::stringstream ss; + ss << "Unable to load " << name; + return arrow::Status::IOError(ss.str()); + } + + return arrow::Status::OK(); +} +#endif // _WIN32 + +static inline void* GetLibrarySymbol(void* handle, const char* symbol) { + if (handle == NULL) return NULL; +#ifndef _WIN32 + return dlsym(handle, symbol); +#else + + void* ret = reinterpret_cast(GetProcAddress(handle, symbol)); + if (ret == NULL) { + // logstream(LOG_INFO) << "GetProcAddress error: " + // << get_last_err_str(GetLastError()) << std::endl; + } + return ret; +#endif +} + +#define GET_SYMBOL_REQUIRED(SHIM, SYMBOL_NAME) \ + do { \ + if (!SHIM->SYMBOL_NAME) { \ + *reinterpret_cast(&SHIM->SYMBOL_NAME) = \ + GetLibrarySymbol(SHIM->handle, "" #SYMBOL_NAME); \ + } \ + if (!SHIM->SYMBOL_NAME) \ + return Status::IOError("Getting symbol " #SYMBOL_NAME "failed"); \ + } while (0) + +#define GET_SYMBOL(SHIM, SYMBOL_NAME) \ + if (!SHIM->SYMBOL_NAME) { \ + *reinterpret_cast(&SHIM->SYMBOL_NAME) = \ + GetLibrarySymbol(SHIM->handle, "" #SYMBOL_NAME); \ + } + +namespace arrow { +namespace io { + +static LibHdfsShim libhdfs_shim; +static LibHdfsShim libhdfs3_shim; + +hdfsBuilder* LibHdfsShim::NewBuilder(void) { + return this->hdfsNewBuilder(); +} + +void LibHdfsShim::BuilderSetNameNode(hdfsBuilder* bld, const char* nn) { + this->hdfsBuilderSetNameNode(bld, nn); +} + +void LibHdfsShim::BuilderSetNameNodePort(hdfsBuilder* bld, tPort port) { + this->hdfsBuilderSetNameNodePort(bld, port); +} + +void LibHdfsShim::BuilderSetUserName(hdfsBuilder* bld, const char* userName) { + this->hdfsBuilderSetUserName(bld, userName); +} + +void LibHdfsShim::BuilderSetKerbTicketCachePath( + hdfsBuilder* bld, const char* kerbTicketCachePath) { + this->hdfsBuilderSetKerbTicketCachePath(bld, kerbTicketCachePath); +} + +hdfsFS LibHdfsShim::BuilderConnect(hdfsBuilder* bld) { + return this->hdfsBuilderConnect(bld); +} + +int LibHdfsShim::Disconnect(hdfsFS fs) { + return this->hdfsDisconnect(fs); +} + +hdfsFile LibHdfsShim::OpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, + short replication, tSize blocksize) { // NOLINT + return this->hdfsOpenFile(fs, path, flags, bufferSize, replication, blocksize); +} + +int LibHdfsShim::CloseFile(hdfsFS fs, hdfsFile file) { + return this->hdfsCloseFile(fs, file); +} + +int LibHdfsShim::Exists(hdfsFS fs, const char* path) { + return this->hdfsExists(fs, path); +} + +int LibHdfsShim::Seek(hdfsFS fs, hdfsFile file, tOffset desiredPos) { + return this->hdfsSeek(fs, file, desiredPos); +} + +tOffset LibHdfsShim::Tell(hdfsFS fs, hdfsFile file) { + return this->hdfsTell(fs, file); +} + +tSize LibHdfsShim::Read(hdfsFS fs, hdfsFile file, void* buffer, tSize length) { + return this->hdfsRead(fs, file, buffer, length); +} + +bool LibHdfsShim::HasPread() { + GET_SYMBOL(this, hdfsPread); + return this->hdfsPread != nullptr; +} + +tSize LibHdfsShim::Pread( + hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) { + GET_SYMBOL(this, hdfsPread); + return this->hdfsPread(fs, file, position, buffer, length); +} + +tSize LibHdfsShim::Write(hdfsFS fs, hdfsFile file, const void* buffer, tSize length) { + return this->hdfsWrite(fs, file, buffer, length); +} + +int LibHdfsShim::Flush(hdfsFS fs, hdfsFile file) { + return this->hdfsFlush(fs, file); +} + +int LibHdfsShim::Available(hdfsFS fs, hdfsFile file) { + GET_SYMBOL(this, hdfsAvailable); + if (this->hdfsAvailable) + return this->hdfsAvailable(fs, file); + else + return 0; +} + +int LibHdfsShim::Copy(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { + GET_SYMBOL(this, hdfsCopy); + if (this->hdfsCopy) + return this->hdfsCopy(srcFS, src, dstFS, dst); + else + return 0; +} + +int LibHdfsShim::Move(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { + GET_SYMBOL(this, hdfsMove); + if (this->hdfsMove) + return this->hdfsMove(srcFS, src, dstFS, dst); + else + return 0; +} + +int LibHdfsShim::Delete(hdfsFS fs, const char* path, int recursive) { + return this->hdfsDelete(fs, path, recursive); +} + +int LibHdfsShim::Rename(hdfsFS fs, const char* oldPath, const char* newPath) { + GET_SYMBOL(this, hdfsRename); + if (this->hdfsRename) + return this->hdfsRename(fs, oldPath, newPath); + else + return 0; +} + +char* LibHdfsShim::GetWorkingDirectory(hdfsFS fs, char* buffer, size_t bufferSize) { + GET_SYMBOL(this, hdfsGetWorkingDirectory); + if (this->hdfsGetWorkingDirectory) { + return this->hdfsGetWorkingDirectory(fs, buffer, bufferSize); + } else { + return NULL; + } +} + +int LibHdfsShim::SetWorkingDirectory(hdfsFS fs, const char* path) { + GET_SYMBOL(this, hdfsSetWorkingDirectory); + if (this->hdfsSetWorkingDirectory) { + return this->hdfsSetWorkingDirectory(fs, path); + } else { + return 0; + } +} + +int LibHdfsShim::CreateDirectory(hdfsFS fs, const char* path) { + return this->hdfsCreateDirectory(fs, path); +} + +int LibHdfsShim::SetReplication(hdfsFS fs, const char* path, int16_t replication) { + GET_SYMBOL(this, hdfsSetReplication); + if (this->hdfsSetReplication) { + return this->hdfsSetReplication(fs, path, replication); + } else { + return 0; + } +} + +hdfsFileInfo* LibHdfsShim::ListDirectory(hdfsFS fs, const char* path, int* numEntries) { + return this->hdfsListDirectory(fs, path, numEntries); +} + +hdfsFileInfo* LibHdfsShim::GetPathInfo(hdfsFS fs, const char* path) { + return this->hdfsGetPathInfo(fs, path); +} + +void LibHdfsShim::FreeFileInfo(hdfsFileInfo* hdfsFileInfo, int numEntries) { + this->hdfsFreeFileInfo(hdfsFileInfo, numEntries); +} + +char*** LibHdfsShim::GetHosts( + hdfsFS fs, const char* path, tOffset start, tOffset length) { + GET_SYMBOL(this, hdfsGetHosts); + if (this->hdfsGetHosts) { + return this->hdfsGetHosts(fs, path, start, length); + } else { + return NULL; + } +} + +void LibHdfsShim::FreeHosts(char*** blockHosts) { + GET_SYMBOL(this, hdfsFreeHosts); + if (this->hdfsFreeHosts) { this->hdfsFreeHosts(blockHosts); } +} + +tOffset LibHdfsShim::GetDefaultBlockSize(hdfsFS fs) { + GET_SYMBOL(this, hdfsGetDefaultBlockSize); + if (this->hdfsGetDefaultBlockSize) { + return this->hdfsGetDefaultBlockSize(fs); + } else { + return 0; + } +} + +tOffset LibHdfsShim::GetCapacity(hdfsFS fs) { + return this->hdfsGetCapacity(fs); +} + +tOffset LibHdfsShim::GetUsed(hdfsFS fs) { + return this->hdfsGetUsed(fs); +} + +int LibHdfsShim::Chown( + hdfsFS fs, const char* path, const char* owner, const char* group) { + GET_SYMBOL(this, hdfsChown); + if (this->hdfsChown) { + return this->hdfsChown(fs, path, owner, group); + } else { + return 0; + } +} + +int LibHdfsShim::Chmod(hdfsFS fs, const char* path, short mode) { // NOLINT + GET_SYMBOL(this, hdfsChmod); + if (this->hdfsChmod) { + return this->hdfsChmod(fs, path, mode); + } else { + return 0; + } +} + +int LibHdfsShim::Utime(hdfsFS fs, const char* path, tTime mtime, tTime atime) { + GET_SYMBOL(this, hdfsUtime); + if (this->hdfsUtime) { + return this->hdfsUtime(fs, path, mtime, atime); + } else { + return 0; + } +} + +Status LibHdfsShim::GetRequiredSymbols() { + GET_SYMBOL_REQUIRED(this, hdfsNewBuilder); + GET_SYMBOL_REQUIRED(this, hdfsBuilderSetNameNode); + GET_SYMBOL_REQUIRED(this, hdfsBuilderSetNameNodePort); + GET_SYMBOL_REQUIRED(this, hdfsBuilderSetUserName); + GET_SYMBOL_REQUIRED(this, hdfsBuilderSetKerbTicketCachePath); + GET_SYMBOL_REQUIRED(this, hdfsBuilderConnect); + GET_SYMBOL_REQUIRED(this, hdfsCreateDirectory); + GET_SYMBOL_REQUIRED(this, hdfsDelete); + GET_SYMBOL_REQUIRED(this, hdfsDisconnect); + GET_SYMBOL_REQUIRED(this, hdfsExists); + GET_SYMBOL_REQUIRED(this, hdfsFreeFileInfo); + GET_SYMBOL_REQUIRED(this, hdfsGetCapacity); + GET_SYMBOL_REQUIRED(this, hdfsGetUsed); + GET_SYMBOL_REQUIRED(this, hdfsGetPathInfo); + GET_SYMBOL_REQUIRED(this, hdfsListDirectory); + + // File methods + GET_SYMBOL_REQUIRED(this, hdfsCloseFile); + GET_SYMBOL_REQUIRED(this, hdfsFlush); + GET_SYMBOL_REQUIRED(this, hdfsOpenFile); + GET_SYMBOL_REQUIRED(this, hdfsRead); + GET_SYMBOL_REQUIRED(this, hdfsSeek); + GET_SYMBOL_REQUIRED(this, hdfsTell); + GET_SYMBOL_REQUIRED(this, hdfsWrite); + + return Status::OK(); +} + +Status ARROW_EXPORT ConnectLibHdfs(LibHdfsShim** driver) { + static std::mutex lock; + std::lock_guard guard(lock); + + LibHdfsShim* shim = &libhdfs_shim; + + static bool shim_attempted = false; + if (!shim_attempted) { + shim_attempted = true; + + shim->Initialize(); + + std::vector libjvm_potential_paths = get_potential_libjvm_paths(); + RETURN_NOT_OK(try_dlopen(libjvm_potential_paths, "libjvm", libjvm_handle)); + + std::vector libhdfs_potential_paths = get_potential_libhdfs_paths(); + RETURN_NOT_OK(try_dlopen(libhdfs_potential_paths, "libhdfs", shim->handle)); + } else if (shim->handle == nullptr) { + return Status::IOError("Prior attempt to load libhdfs failed"); + } + + *driver = shim; + return shim->GetRequiredSymbols(); +} + +Status ARROW_EXPORT ConnectLibHdfs3(LibHdfsShim** driver) { + static std::mutex lock; + std::lock_guard guard(lock); + + LibHdfsShim* shim = &libhdfs3_shim; + + static bool shim_attempted = false; + if (!shim_attempted) { + shim_attempted = true; + + shim->Initialize(); + + std::vector libhdfs3_potential_paths = get_potential_libhdfs3_paths(); + RETURN_NOT_OK(try_dlopen(libhdfs3_potential_paths, "libhdfs3", shim->handle)); + } else if (shim->handle == nullptr) { + return Status::IOError("Prior attempt to load libhdfs3 failed"); + } + + *driver = shim; + return shim->GetRequiredSymbols(); +} + +} // namespace io +} // namespace arrow + +#endif // HAS_HADOOP diff --git a/cpp/src/arrow/io/hdfs-internal.h b/cpp/src/arrow/io/hdfs-internal.h new file mode 100644 index 0000000000000..0ff118a8f57e7 --- /dev/null +++ b/cpp/src/arrow/io/hdfs-internal.h @@ -0,0 +1,203 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IO_HDFS_INTERNAL +#define ARROW_IO_HDFS_INTERNAL + +#include + +namespace arrow { + +class Status; + +namespace io { + +// NOTE(wesm): cpplint does not like use of short and other imprecise C types +struct LibHdfsShim { +#ifndef _WIN32 + void* handle; +#else + HINSTANCE handle; +#endif + + hdfsBuilder* (*hdfsNewBuilder)(void); + void (*hdfsBuilderSetNameNode)(hdfsBuilder* bld, const char* nn); + void (*hdfsBuilderSetNameNodePort)(hdfsBuilder* bld, tPort port); + void (*hdfsBuilderSetUserName)(hdfsBuilder* bld, const char* userName); + void (*hdfsBuilderSetKerbTicketCachePath)( + hdfsBuilder* bld, const char* kerbTicketCachePath); + hdfsFS (*hdfsBuilderConnect)(hdfsBuilder* bld); + + int (*hdfsDisconnect)(hdfsFS fs); + + hdfsFile (*hdfsOpenFile)(hdfsFS fs, const char* path, int flags, int bufferSize, + short replication, tSize blocksize); // NOLINT + + int (*hdfsCloseFile)(hdfsFS fs, hdfsFile file); + int (*hdfsExists)(hdfsFS fs, const char* path); + int (*hdfsSeek)(hdfsFS fs, hdfsFile file, tOffset desiredPos); + tOffset (*hdfsTell)(hdfsFS fs, hdfsFile file); + tSize (*hdfsRead)(hdfsFS fs, hdfsFile file, void* buffer, tSize length); + tSize (*hdfsPread)( + hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length); + tSize (*hdfsWrite)(hdfsFS fs, hdfsFile file, const void* buffer, tSize length); + int (*hdfsFlush)(hdfsFS fs, hdfsFile file); + int (*hdfsAvailable)(hdfsFS fs, hdfsFile file); + int (*hdfsCopy)(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + int (*hdfsMove)(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + int (*hdfsDelete)(hdfsFS fs, const char* path, int recursive); + int (*hdfsRename)(hdfsFS fs, const char* oldPath, const char* newPath); + char* (*hdfsGetWorkingDirectory)(hdfsFS fs, char* buffer, size_t bufferSize); + int (*hdfsSetWorkingDirectory)(hdfsFS fs, const char* path); + int (*hdfsCreateDirectory)(hdfsFS fs, const char* path); + int (*hdfsSetReplication)(hdfsFS fs, const char* path, int16_t replication); + hdfsFileInfo* (*hdfsListDirectory)(hdfsFS fs, const char* path, int* numEntries); + hdfsFileInfo* (*hdfsGetPathInfo)(hdfsFS fs, const char* path); + void (*hdfsFreeFileInfo)(hdfsFileInfo* hdfsFileInfo, int numEntries); + char*** (*hdfsGetHosts)(hdfsFS fs, const char* path, tOffset start, tOffset length); + void (*hdfsFreeHosts)(char*** blockHosts); + tOffset (*hdfsGetDefaultBlockSize)(hdfsFS fs); + tOffset (*hdfsGetCapacity)(hdfsFS fs); + tOffset (*hdfsGetUsed)(hdfsFS fs); + int (*hdfsChown)(hdfsFS fs, const char* path, const char* owner, const char* group); + int (*hdfsChmod)(hdfsFS fs, const char* path, short mode); // NOLINT + int (*hdfsUtime)(hdfsFS fs, const char* path, tTime mtime, tTime atime); + + void Initialize() { + this->handle = nullptr; + this->hdfsNewBuilder = nullptr; + this->hdfsBuilderSetNameNode = nullptr; + this->hdfsBuilderSetNameNodePort = nullptr; + this->hdfsBuilderSetUserName = nullptr; + this->hdfsBuilderSetKerbTicketCachePath = nullptr; + this->hdfsBuilderConnect = nullptr; + this->hdfsDisconnect = nullptr; + this->hdfsOpenFile = nullptr; + this->hdfsCloseFile = nullptr; + this->hdfsExists = nullptr; + this->hdfsSeek = nullptr; + this->hdfsTell = nullptr; + this->hdfsRead = nullptr; + this->hdfsPread = nullptr; + this->hdfsWrite = nullptr; + this->hdfsFlush = nullptr; + this->hdfsAvailable = nullptr; + this->hdfsCopy = nullptr; + this->hdfsMove = nullptr; + this->hdfsDelete = nullptr; + this->hdfsRename = nullptr; + this->hdfsGetWorkingDirectory = nullptr; + this->hdfsSetWorkingDirectory = nullptr; + this->hdfsCreateDirectory = nullptr; + this->hdfsSetReplication = nullptr; + this->hdfsListDirectory = nullptr; + this->hdfsGetPathInfo = nullptr; + this->hdfsFreeFileInfo = nullptr; + this->hdfsGetHosts = nullptr; + this->hdfsFreeHosts = nullptr; + this->hdfsGetDefaultBlockSize = nullptr; + this->hdfsGetCapacity = nullptr; + this->hdfsGetUsed = nullptr; + this->hdfsChown = nullptr; + this->hdfsChmod = nullptr; + this->hdfsUtime = nullptr; + } + + hdfsBuilder* NewBuilder(void); + + void BuilderSetNameNode(hdfsBuilder* bld, const char* nn); + + void BuilderSetNameNodePort(hdfsBuilder* bld, tPort port); + + void BuilderSetUserName(hdfsBuilder* bld, const char* userName); + + void BuilderSetKerbTicketCachePath(hdfsBuilder* bld, const char* kerbTicketCachePath); + + hdfsFS BuilderConnect(hdfsBuilder* bld); + + int Disconnect(hdfsFS fs); + + hdfsFile OpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, + short replication, tSize blocksize); // NOLINT + + int CloseFile(hdfsFS fs, hdfsFile file); + + int Exists(hdfsFS fs, const char* path); + + int Seek(hdfsFS fs, hdfsFile file, tOffset desiredPos); + + tOffset Tell(hdfsFS fs, hdfsFile file); + + tSize Read(hdfsFS fs, hdfsFile file, void* buffer, tSize length); + + bool HasPread(); + + tSize Pread(hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length); + + tSize Write(hdfsFS fs, hdfsFile file, const void* buffer, tSize length); + + int Flush(hdfsFS fs, hdfsFile file); + + int Available(hdfsFS fs, hdfsFile file); + + int Copy(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + + int Move(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst); + + int Delete(hdfsFS fs, const char* path, int recursive); + + int Rename(hdfsFS fs, const char* oldPath, const char* newPath); + + char* GetWorkingDirectory(hdfsFS fs, char* buffer, size_t bufferSize); + + int SetWorkingDirectory(hdfsFS fs, const char* path); + + int CreateDirectory(hdfsFS fs, const char* path); + + int SetReplication(hdfsFS fs, const char* path, int16_t replication); + + hdfsFileInfo* ListDirectory(hdfsFS fs, const char* path, int* numEntries); + + hdfsFileInfo* GetPathInfo(hdfsFS fs, const char* path); + + void FreeFileInfo(hdfsFileInfo* hdfsFileInfo, int numEntries); + + char*** GetHosts(hdfsFS fs, const char* path, tOffset start, tOffset length); + + void FreeHosts(char*** blockHosts); + + tOffset GetDefaultBlockSize(hdfsFS fs); + tOffset GetCapacity(hdfsFS fs); + + tOffset GetUsed(hdfsFS fs); + + int Chown(hdfsFS fs, const char* path, const char* owner, const char* group); + + int Chmod(hdfsFS fs, const char* path, short mode); // NOLINT + + int Utime(hdfsFS fs, const char* path, tTime mtime, tTime atime); + + Status GetRequiredSymbols(); +}; + +Status ConnectLibHdfs(LibHdfsShim** driver); +Status ConnectLibHdfs3(LibHdfsShim** driver); + +} // namespace io +} // namespace arrow + +#endif // ARROW_IO_HDFS_INTERNAL diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index b8e212026b11c..44e503ff11302 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -23,6 +23,7 @@ #include #include "arrow/buffer.h" +#include "arrow/io/hdfs-internal.h" #include "arrow/io/hdfs.h" #include "arrow/memory_pool.h" #include "arrow/status.h" @@ -59,21 +60,23 @@ static constexpr int kDefaultHdfsBufferSize = 1 << 16; class HdfsAnyFileImpl { public: - void set_members(const std::string& path, hdfsFS fs, hdfsFile handle) { + void set_members( + const std::string& path, LibHdfsShim* driver, hdfsFS fs, hdfsFile handle) { path_ = path; + driver_ = driver; fs_ = fs; file_ = handle; is_open_ = true; } Status Seek(int64_t position) { - int ret = hdfsSeek(fs_, file_, position); + int ret = driver_->Seek(fs_, file_, position); CHECK_FAILURE(ret, "seek"); return Status::OK(); } Status Tell(int64_t* offset) { - int64_t ret = hdfsTell(fs_, file_); + int64_t ret = driver_->Tell(fs_, file_); CHECK_FAILURE(ret, "tell"); *offset = ret; return Status::OK(); @@ -84,6 +87,8 @@ class HdfsAnyFileImpl { protected: std::string path_; + LibHdfsShim* driver_; + // These are pointers in libhdfs, so OK to copy hdfsFS fs_; hdfsFile file_; @@ -98,7 +103,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { Status Close() { if (is_open_) { - int ret = hdfsCloseFile(fs_, file_); + int ret = driver_->CloseFile(fs_, file_); CHECK_FAILURE(ret, "CloseFile"); is_open_ = false; } @@ -106,8 +111,14 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { } Status ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { - tSize ret = hdfsPread(fs_, file_, static_cast(position), - reinterpret_cast(buffer), nbytes); + tSize ret; + if (driver_->HasPread()) { + ret = driver_->Pread(fs_, file_, static_cast(position), + reinterpret_cast(buffer), nbytes); + } else { + RETURN_NOT_OK(Seek(position)); + return Read(nbytes, bytes_read, buffer); + } RETURN_NOT_OK(CheckReadResult(ret)); *bytes_read = ret; return Status::OK(); @@ -129,7 +140,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) { int64_t total_bytes = 0; while (total_bytes < nbytes) { - tSize ret = hdfsRead(fs_, file_, reinterpret_cast(buffer + total_bytes), + tSize ret = driver_->Read(fs_, file_, reinterpret_cast(buffer + total_bytes), std::min(buffer_size_, nbytes - total_bytes)); RETURN_NOT_OK(CheckReadResult(ret)); total_bytes += ret; @@ -153,11 +164,11 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { } Status GetSize(int64_t* size) { - hdfsFileInfo* entry = hdfsGetPathInfo(fs_, path_.c_str()); + hdfsFileInfo* entry = driver_->GetPathInfo(fs_, path_.c_str()); if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } *size = entry->mSize; - hdfsFreeFileInfo(entry, 1); + driver_->FreeFileInfo(entry, 1); return Status::OK(); } @@ -227,9 +238,9 @@ class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { Status Close() { if (is_open_) { - int ret = hdfsFlush(fs_, file_); + int ret = driver_->Flush(fs_, file_); CHECK_FAILURE(ret, "Flush"); - ret = hdfsCloseFile(fs_, file_); + ret = driver_->CloseFile(fs_, file_); CHECK_FAILURE(ret, "CloseFile"); is_open_ = false; } @@ -237,7 +248,7 @@ class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { } Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written) { - tSize ret = hdfsWrite(fs_, file_, reinterpret_cast(buffer), nbytes); + tSize ret = driver_->Write(fs_, file_, reinterpret_cast(buffer), nbytes); CHECK_FAILURE(ret, "Write"); *bytes_written = ret; return Status::OK(); @@ -297,17 +308,25 @@ class HdfsClient::HdfsClientImpl { HdfsClientImpl() {} Status Connect(const HdfsConnectionConfig* config) { - RETURN_NOT_OK(ConnectLibHdfs()); + if (config->driver == HdfsDriver::LIBHDFS3) { + RETURN_NOT_OK(ConnectLibHdfs3(&driver_)); + } else { + RETURN_NOT_OK(ConnectLibHdfs(&driver_)); + } // connect to HDFS with the builder object - hdfsBuilder* builder = hdfsNewBuilder(); - if (!config->host.empty()) { hdfsBuilderSetNameNode(builder, config->host.c_str()); } - hdfsBuilderSetNameNodePort(builder, config->port); - if (!config->user.empty()) { hdfsBuilderSetUserName(builder, config->user.c_str()); } + hdfsBuilder* builder = driver_->NewBuilder(); + if (!config->host.empty()) { + driver_->BuilderSetNameNode(builder, config->host.c_str()); + } + driver_->BuilderSetNameNodePort(builder, config->port); + if (!config->user.empty()) { + driver_->BuilderSetUserName(builder, config->user.c_str()); + } if (!config->kerb_ticket.empty()) { - hdfsBuilderSetKerbTicketCachePath(builder, config->kerb_ticket.c_str()); + driver_->BuilderSetKerbTicketCachePath(builder, config->kerb_ticket.c_str()); } - fs_ = hdfsBuilderConnect(builder); + fs_ = driver_->BuilderConnect(builder); if (fs_ == nullptr) { return Status::IOError("HDFS connection failed"); } namenode_host_ = config->host; @@ -319,19 +338,19 @@ class HdfsClient::HdfsClientImpl { } Status CreateDirectory(const std::string& path) { - int ret = hdfsCreateDirectory(fs_, path.c_str()); + int ret = driver_->CreateDirectory(fs_, path.c_str()); CHECK_FAILURE(ret, "create directory"); return Status::OK(); } Status Delete(const std::string& path, bool recursive) { - int ret = hdfsDelete(fs_, path.c_str(), static_cast(recursive)); + int ret = driver_->Delete(fs_, path.c_str(), static_cast(recursive)); CHECK_FAILURE(ret, "delete"); return Status::OK(); } Status Disconnect() { - int ret = hdfsDisconnect(fs_); + int ret = driver_->Disconnect(fs_); CHECK_FAILURE(ret, "hdfsFS::Disconnect"); return Status::OK(); } @@ -339,38 +358,38 @@ class HdfsClient::HdfsClientImpl { bool Exists(const std::string& path) { // hdfsExists does not distinguish between RPC failure and the file not // existing - int ret = hdfsExists(fs_, path.c_str()); + int ret = driver_->Exists(fs_, path.c_str()); return ret == 0; } Status GetCapacity(int64_t* nbytes) { - tOffset ret = hdfsGetCapacity(fs_); + tOffset ret = driver_->GetCapacity(fs_); CHECK_FAILURE(ret, "GetCapacity"); *nbytes = ret; return Status::OK(); } Status GetUsed(int64_t* nbytes) { - tOffset ret = hdfsGetUsed(fs_); + tOffset ret = driver_->GetUsed(fs_); CHECK_FAILURE(ret, "GetUsed"); *nbytes = ret; return Status::OK(); } Status GetPathInfo(const std::string& path, HdfsPathInfo* info) { - hdfsFileInfo* entry = hdfsGetPathInfo(fs_, path.c_str()); + hdfsFileInfo* entry = driver_->GetPathInfo(fs_, path.c_str()); if (entry == nullptr) { return Status::IOError("HDFS: GetPathInfo failed"); } SetPathInfo(entry, info); - hdfsFreeFileInfo(entry, 1); + driver_->FreeFileInfo(entry, 1); return Status::OK(); } Status ListDirectory(const std::string& path, std::vector* listing) { int num_entries = 0; - hdfsFileInfo* entries = hdfsListDirectory(fs_, path.c_str(), &num_entries); + hdfsFileInfo* entries = driver_->ListDirectory(fs_, path.c_str(), &num_entries); if (entries == nullptr) { // If the directory is empty, entries is NULL but errno is 0. Non-zero @@ -391,14 +410,14 @@ class HdfsClient::HdfsClientImpl { } // Free libhdfs file info - hdfsFreeFileInfo(entries, num_entries); + driver_->FreeFileInfo(entries, num_entries); return Status::OK(); } Status OpenReadable(const std::string& path, int32_t buffer_size, std::shared_ptr* file) { - hdfsFile handle = hdfsOpenFile(fs_, path.c_str(), O_RDONLY, buffer_size, 0, 0); + hdfsFile handle = driver_->OpenFile(fs_, path.c_str(), O_RDONLY, buffer_size, 0, 0); if (handle == nullptr) { // TODO(wesm): determine cause of failure @@ -409,7 +428,7 @@ class HdfsClient::HdfsClientImpl { // std::make_shared does not work with private ctors *file = std::shared_ptr(new HdfsReadableFile()); - (*file)->impl_->set_members(path, fs_, handle); + (*file)->impl_->set_members(path, driver_, fs_, handle); (*file)->impl_->set_buffer_size(buffer_size); return Status::OK(); @@ -421,7 +440,7 @@ class HdfsClient::HdfsClientImpl { int flags = O_WRONLY; if (append) flags |= O_APPEND; - hdfsFile handle = hdfsOpenFile( + hdfsFile handle = driver_->OpenFile( fs_, path.c_str(), flags, buffer_size, replication, default_block_size); if (handle == nullptr) { @@ -433,18 +452,20 @@ class HdfsClient::HdfsClientImpl { // std::make_shared does not work with private ctors *file = std::shared_ptr(new HdfsOutputStream()); - (*file)->impl_->set_members(path, fs_, handle); + (*file)->impl_->set_members(path, driver_, fs_, handle); return Status::OK(); } Status Rename(const std::string& src, const std::string& dst) { - int ret = hdfsRename(fs_, src.c_str(), dst.c_str()); + int ret = driver_->Rename(fs_, src.c_str(), dst.c_str()); CHECK_FAILURE(ret, "Rename"); return Status::OK(); } private: + LibHdfsShim* driver_; + std::string namenode_host_; std::string user_; int port_; @@ -530,5 +551,18 @@ Status HdfsClient::Rename(const std::string& src, const std::string& dst) { return impl_->Rename(src, dst); } +// ---------------------------------------------------------------------- +// Allow public API users to check whether we are set up correctly + +Status HaveLibHdfs() { + LibHdfsShim* driver; + return ConnectLibHdfs(&driver); +} + +Status HaveLibHdfs3() { + LibHdfsShim* driver; + return ConnectLibHdfs3(&driver); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 1c76f15c397ce..5cc783e475967 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -56,11 +56,14 @@ struct HdfsPathInfo { int16_t permissions; }; +enum class HdfsDriver : char { LIBHDFS, LIBHDFS3 }; + struct HdfsConnectionConfig { std::string host; int port; std::string user; std::string kerb_ticket; + HdfsDriver driver; }; class ARROW_EXPORT HdfsClient : public FileSystemClient { @@ -218,7 +221,8 @@ class ARROW_EXPORT HdfsOutputStream : public OutputStream { DISALLOW_COPY_AND_ASSIGN(HdfsOutputStream); }; -Status ARROW_EXPORT ConnectLibHdfs(); +Status ARROW_EXPORT HaveLibHdfs(); +Status ARROW_EXPORT HaveLibHdfs3(); } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index e07eaa3d1b487..4ef47b8babe6e 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -24,6 +24,7 @@ #include // NOLINT +#include "arrow/io/hdfs-internal.h" #include "arrow/io/hdfs.h" #include "arrow/status.h" #include "arrow/test-util.h" @@ -37,6 +38,7 @@ std::vector RandomData(int64_t size) { return buffer; } +template class TestHdfsClient : public ::testing::Test { public: Status MakeScratchDir() { @@ -71,15 +73,34 @@ class TestHdfsClient : public ::testing::Test { return ss.str(); } - protected: // Set up shared state between unit tests - static void SetUpTestCase() { - if (!ConnectLibHdfs().ok()) { - std::cout << "Loading libhdfs failed, skipping tests gracefully" << std::endl; - return; + void SetUp() { + LibHdfsShim* driver_shim; + + client_ = nullptr; + scratch_dir_ = + boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").native(); + + loaded_driver_ = false; + + Status msg; + + if (DRIVER::type == HdfsDriver::LIBHDFS) { + msg = ConnectLibHdfs(&driver_shim); + if (!msg.ok()) { + std::cout << "Loading libhdfs failed, skipping tests gracefully" << std::endl; + return; + } + } else { + msg = ConnectLibHdfs3(&driver_shim); + if (!msg.ok()) { + std::cout << "Loading libhdfs3 failed, skipping tests gracefully. " + << msg.ToString() << std::endl; + return; + } } - loaded_libhdfs_ = true; + loaded_driver_ = true; const char* host = std::getenv("ARROW_HDFS_TEST_HOST"); const char* port = std::getenv("ARROW_HDFS_TEST_PORT"); @@ -94,151 +115,159 @@ class TestHdfsClient : public ::testing::Test { ASSERT_OK(HdfsClient::Connect(&conf_, &client_)); } - static void TearDownTestCase() { + void TearDown() { if (client_) { - EXPECT_OK(client_->Delete(scratch_dir_, true)); + if (client_->Exists(scratch_dir_)) { + EXPECT_OK(client_->Delete(scratch_dir_, true)); + } EXPECT_OK(client_->Disconnect()); } } - static bool loaded_libhdfs_; + HdfsConnectionConfig conf_; + bool loaded_driver_; // Resources shared amongst unit tests - static HdfsConnectionConfig conf_; - static std::string scratch_dir_; - static std::shared_ptr client_; + std::string scratch_dir_; + std::shared_ptr client_; }; -bool TestHdfsClient::loaded_libhdfs_ = false; -HdfsConnectionConfig TestHdfsClient::conf_ = HdfsConnectionConfig(); +#define SKIP_IF_NO_DRIVER() \ + if (!this->loaded_driver_) { \ + std::cout << "Driver not loaded, skipping" << std::endl; \ + return; \ + } -std::string TestHdfsClient::scratch_dir_ = - boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").native(); +struct JNIDriver { + static HdfsDriver type; +}; -std::shared_ptr TestHdfsClient::client_ = nullptr; +struct PivotalDriver { + static HdfsDriver type; +}; -#define SKIP_IF_NO_LIBHDFS() \ - if (!loaded_libhdfs_) { \ - std::cout << "No libhdfs, skipping" << std::endl; \ - return; \ - } +HdfsDriver JNIDriver::type = HdfsDriver::LIBHDFS; +HdfsDriver PivotalDriver::type = HdfsDriver::LIBHDFS3; + +typedef ::testing::Types DriverTypes; +TYPED_TEST_CASE(TestHdfsClient, DriverTypes); -TEST_F(TestHdfsClient, ConnectsAgain) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, ConnectsAgain) { + SKIP_IF_NO_DRIVER(); std::shared_ptr client; - ASSERT_OK(HdfsClient::Connect(&conf_, &client)); + ASSERT_OK(HdfsClient::Connect(&this->conf_, &client)); ASSERT_OK(client->Disconnect()); } -TEST_F(TestHdfsClient, CreateDirectory) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, CreateDirectory) { + SKIP_IF_NO_DRIVER(); - std::string path = ScratchPath("create-directory"); + std::string path = this->ScratchPath("create-directory"); - if (client_->Exists(path)) { ASSERT_OK(client_->Delete(path, true)); } + if (this->client_->Exists(path)) { ASSERT_OK(this->client_->Delete(path, true)); } - ASSERT_OK(client_->CreateDirectory(path)); - ASSERT_TRUE(client_->Exists(path)); - EXPECT_OK(client_->Delete(path, true)); - ASSERT_FALSE(client_->Exists(path)); + ASSERT_OK(this->client_->CreateDirectory(path)); + ASSERT_TRUE(this->client_->Exists(path)); + EXPECT_OK(this->client_->Delete(path, true)); + ASSERT_FALSE(this->client_->Exists(path)); } -TEST_F(TestHdfsClient, GetCapacityUsed) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, GetCapacityUsed) { + SKIP_IF_NO_DRIVER(); // Who knows what is actually in your DFS cluster, but expect it to have // positive used bytes and capacity int64_t nbytes = 0; - ASSERT_OK(client_->GetCapacity(&nbytes)); + ASSERT_OK(this->client_->GetCapacity(&nbytes)); ASSERT_LT(0, nbytes); - ASSERT_OK(client_->GetUsed(&nbytes)); + ASSERT_OK(this->client_->GetUsed(&nbytes)); ASSERT_LT(0, nbytes); } -TEST_F(TestHdfsClient, GetPathInfo) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, GetPathInfo) { + SKIP_IF_NO_DRIVER(); HdfsPathInfo info; - ASSERT_OK(MakeScratchDir()); + ASSERT_OK(this->MakeScratchDir()); // Directory info - ASSERT_OK(client_->GetPathInfo(scratch_dir_, &info)); + ASSERT_OK(this->client_->GetPathInfo(this->scratch_dir_, &info)); ASSERT_EQ(ObjectType::DIRECTORY, info.kind); - ASSERT_EQ(HdfsAbsPath(scratch_dir_), info.name); - ASSERT_EQ(conf_.user, info.owner); + ASSERT_EQ(this->HdfsAbsPath(this->scratch_dir_), info.name); + ASSERT_EQ(this->conf_.user, info.owner); // TODO(wesm): test group, other attrs - auto path = ScratchPath("test-file"); + auto path = this->ScratchPath("test-file"); const int size = 100; std::vector buffer = RandomData(size); - ASSERT_OK(WriteDummyFile(path, buffer.data(), size)); - ASSERT_OK(client_->GetPathInfo(path, &info)); + ASSERT_OK(this->WriteDummyFile(path, buffer.data(), size)); + ASSERT_OK(this->client_->GetPathInfo(path, &info)); ASSERT_EQ(ObjectType::FILE, info.kind); - ASSERT_EQ(HdfsAbsPath(path), info.name); - ASSERT_EQ(conf_.user, info.owner); + ASSERT_EQ(this->HdfsAbsPath(path), info.name); + ASSERT_EQ(this->conf_.user, info.owner); ASSERT_EQ(size, info.size); } -TEST_F(TestHdfsClient, AppendToFile) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, AppendToFile) { + SKIP_IF_NO_DRIVER(); - ASSERT_OK(MakeScratchDir()); + ASSERT_OK(this->MakeScratchDir()); - auto path = ScratchPath("test-file"); + auto path = this->ScratchPath("test-file"); const int size = 100; std::vector buffer = RandomData(size); - ASSERT_OK(WriteDummyFile(path, buffer.data(), size)); + ASSERT_OK(this->WriteDummyFile(path, buffer.data(), size)); // now append - ASSERT_OK(WriteDummyFile(path, buffer.data(), size, true)); + ASSERT_OK(this->WriteDummyFile(path, buffer.data(), size, true)); HdfsPathInfo info; - ASSERT_OK(client_->GetPathInfo(path, &info)); + ASSERT_OK(this->client_->GetPathInfo(path, &info)); ASSERT_EQ(size * 2, info.size); } -TEST_F(TestHdfsClient, ListDirectory) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, ListDirectory) { + SKIP_IF_NO_DRIVER(); const int size = 100; std::vector data = RandomData(size); - auto p1 = ScratchPath("test-file-1"); - auto p2 = ScratchPath("test-file-2"); - auto d1 = ScratchPath("test-dir-1"); + auto p1 = this->ScratchPath("test-file-1"); + auto p2 = this->ScratchPath("test-file-2"); + auto d1 = this->ScratchPath("test-dir-1"); - ASSERT_OK(MakeScratchDir()); - ASSERT_OK(WriteDummyFile(p1, data.data(), size)); - ASSERT_OK(WriteDummyFile(p2, data.data(), size / 2)); - ASSERT_OK(client_->CreateDirectory(d1)); + ASSERT_OK(this->MakeScratchDir()); + ASSERT_OK(this->WriteDummyFile(p1, data.data(), size)); + ASSERT_OK(this->WriteDummyFile(p2, data.data(), size / 2)); + ASSERT_OK(this->client_->CreateDirectory(d1)); std::vector listing; - ASSERT_OK(client_->ListDirectory(scratch_dir_, &listing)); + ASSERT_OK(this->client_->ListDirectory(this->scratch_dir_, &listing)); // Do it again, appends! - ASSERT_OK(client_->ListDirectory(scratch_dir_, &listing)); + ASSERT_OK(this->client_->ListDirectory(this->scratch_dir_, &listing)); ASSERT_EQ(6, static_cast(listing.size())); // Argh, well, shouldn't expect the listing to be in any particular order for (size_t i = 0; i < listing.size(); ++i) { const HdfsPathInfo& info = listing[i]; - if (info.name == HdfsAbsPath(p1)) { + if (info.name == this->HdfsAbsPath(p1)) { ASSERT_EQ(ObjectType::FILE, info.kind); ASSERT_EQ(size, info.size); - } else if (info.name == HdfsAbsPath(p2)) { + } else if (info.name == this->HdfsAbsPath(p2)) { ASSERT_EQ(ObjectType::FILE, info.kind); ASSERT_EQ(size / 2, info.size); - } else if (info.name == HdfsAbsPath(d1)) { + } else if (info.name == this->HdfsAbsPath(d1)) { ASSERT_EQ(ObjectType::DIRECTORY, info.kind); } else { FAIL() << "Unexpected path: " << info.name; @@ -246,19 +275,19 @@ TEST_F(TestHdfsClient, ListDirectory) { } } -TEST_F(TestHdfsClient, ReadableMethods) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, ReadableMethods) { + SKIP_IF_NO_DRIVER(); - ASSERT_OK(MakeScratchDir()); + ASSERT_OK(this->MakeScratchDir()); - auto path = ScratchPath("test-file"); + auto path = this->ScratchPath("test-file"); const int size = 100; std::vector data = RandomData(size); - ASSERT_OK(WriteDummyFile(path, data.data(), size)); + ASSERT_OK(this->WriteDummyFile(path, data.data(), size)); std::shared_ptr file; - ASSERT_OK(client_->OpenReadable(path, &file)); + ASSERT_OK(this->client_->OpenReadable(path, &file)); // Test GetSize -- move this into its own unit test if ever needed int64_t file_size; @@ -293,19 +322,19 @@ TEST_F(TestHdfsClient, ReadableMethods) { ASSERT_EQ(60, position); } -TEST_F(TestHdfsClient, LargeFile) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, LargeFile) { + SKIP_IF_NO_DRIVER(); - ASSERT_OK(MakeScratchDir()); + ASSERT_OK(this->MakeScratchDir()); - auto path = ScratchPath("test-large-file"); + auto path = this->ScratchPath("test-large-file"); const int size = 1000000; std::vector data = RandomData(size); - ASSERT_OK(WriteDummyFile(path, data.data(), size)); + ASSERT_OK(this->WriteDummyFile(path, data.data(), size)); std::shared_ptr file; - ASSERT_OK(client_->OpenReadable(path, &file)); + ASSERT_OK(this->client_->OpenReadable(path, &file)); auto buffer = std::make_shared(); ASSERT_OK(buffer->Resize(size)); @@ -317,7 +346,7 @@ TEST_F(TestHdfsClient, LargeFile) { // explicit buffer size std::shared_ptr file2; - ASSERT_OK(client_->OpenReadable(path, 1 << 18, &file2)); + ASSERT_OK(this->client_->OpenReadable(path, 1 << 18, &file2)); auto buffer2 = std::make_shared(); ASSERT_OK(buffer2->Resize(size)); @@ -326,22 +355,22 @@ TEST_F(TestHdfsClient, LargeFile) { ASSERT_EQ(size, bytes_read); } -TEST_F(TestHdfsClient, RenameFile) { - SKIP_IF_NO_LIBHDFS(); +TYPED_TEST(TestHdfsClient, RenameFile) { + SKIP_IF_NO_DRIVER(); - ASSERT_OK(MakeScratchDir()); + ASSERT_OK(this->MakeScratchDir()); - auto src_path = ScratchPath("src-file"); - auto dst_path = ScratchPath("dst-file"); + auto src_path = this->ScratchPath("src-file"); + auto dst_path = this->ScratchPath("dst-file"); const int size = 100; std::vector data = RandomData(size); - ASSERT_OK(WriteDummyFile(src_path, data.data(), size)); + ASSERT_OK(this->WriteDummyFile(src_path, data.data(), size)); - ASSERT_OK(client_->Rename(src_path, dst_path)); + ASSERT_OK(this->client_->Rename(src_path, dst_path)); - ASSERT_FALSE(client_->Exists(src_path)); - ASSERT_TRUE(client_->Exists(dst_path)); + ASSERT_FALSE(this->client_->Exists(src_path)); + ASSERT_TRUE(this->client_->Exists(dst_path)); } } // namespace io diff --git a/cpp/src/arrow/io/libhdfs_shim.cc b/cpp/src/arrow/io/libhdfs_shim.cc deleted file mode 100644 index 3715376ebb95b..0000000000000 --- a/cpp/src/arrow/io/libhdfs_shim.cc +++ /dev/null @@ -1,582 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// This shim interface to libhdfs (for runtime shared library loading) has been -// adapted from the SFrame project, released under the ASF-compatible 3-clause -// BSD license -// -// Using this required having the $JAVA_HOME and $HADOOP_HOME environment -// variables set, so that libjvm and libhdfs can be located easily - -// Copyright (C) 2015 Dato, Inc. -// All rights reserved. -// -// This software may be modified and distributed under the terms -// of the BSD license. See the LICENSE file for details. - -#ifdef HAS_HADOOP - -#ifndef _WIN32 -#include -#else -#include -#include - -// TODO(wesm): address when/if we add windows support -// #include -#endif - -extern "C" { -#include -} - -#include -#include -#include -#include -#include -#include - -#include // NOLINT - -#include "arrow/status.h" -#include "arrow/util/visibility.h" - -namespace fs = boost::filesystem; - -extern "C" { - -#ifndef _WIN32 -static void* libhdfs_handle = NULL; -static void* libjvm_handle = NULL; -#else -static HINSTANCE libhdfs_handle = NULL; -static HINSTANCE libjvm_handle = NULL; -#endif -/* - * All the shim pointers - */ - -// NOTE(wesm): cpplint does not like use of short and other imprecise C types - -static hdfsBuilder* (*ptr_hdfsNewBuilder)(void) = NULL; -static void (*ptr_hdfsBuilderSetNameNode)(hdfsBuilder* bld, const char* nn) = NULL; -static void (*ptr_hdfsBuilderSetNameNodePort)(hdfsBuilder* bld, tPort port) = NULL; -static void (*ptr_hdfsBuilderSetUserName)(hdfsBuilder* bld, const char* userName) = NULL; -static void (*ptr_hdfsBuilderSetKerbTicketCachePath)( - hdfsBuilder* bld, const char* kerbTicketCachePath) = NULL; -static hdfsFS (*ptr_hdfsBuilderConnect)(hdfsBuilder* bld) = NULL; - -static int (*ptr_hdfsDisconnect)(hdfsFS fs) = NULL; - -static hdfsFile (*ptr_hdfsOpenFile)(hdfsFS fs, const char* path, int flags, - int bufferSize, short replication, tSize blocksize) = NULL; // NOLINT - -static int (*ptr_hdfsCloseFile)(hdfsFS fs, hdfsFile file) = NULL; -static int (*ptr_hdfsExists)(hdfsFS fs, const char* path) = NULL; -static int (*ptr_hdfsSeek)(hdfsFS fs, hdfsFile file, tOffset desiredPos) = NULL; -static tOffset (*ptr_hdfsTell)(hdfsFS fs, hdfsFile file) = NULL; -static tSize (*ptr_hdfsRead)(hdfsFS fs, hdfsFile file, void* buffer, tSize length) = NULL; -static tSize (*ptr_hdfsPread)( - hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) = NULL; -static tSize (*ptr_hdfsWrite)( - hdfsFS fs, hdfsFile file, const void* buffer, tSize length) = NULL; -static int (*ptr_hdfsFlush)(hdfsFS fs, hdfsFile file) = NULL; -static int (*ptr_hdfsAvailable)(hdfsFS fs, hdfsFile file) = NULL; -static int (*ptr_hdfsCopy)( - hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) = NULL; -static int (*ptr_hdfsMove)( - hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) = NULL; -static int (*ptr_hdfsDelete)(hdfsFS fs, const char* path, int recursive) = NULL; -static int (*ptr_hdfsRename)(hdfsFS fs, const char* oldPath, const char* newPath) = NULL; -static char* (*ptr_hdfsGetWorkingDirectory)( - hdfsFS fs, char* buffer, size_t bufferSize) = NULL; -static int (*ptr_hdfsSetWorkingDirectory)(hdfsFS fs, const char* path) = NULL; -static int (*ptr_hdfsCreateDirectory)(hdfsFS fs, const char* path) = NULL; -static int (*ptr_hdfsSetReplication)( - hdfsFS fs, const char* path, int16_t replication) = NULL; -static hdfsFileInfo* (*ptr_hdfsListDirectory)( - hdfsFS fs, const char* path, int* numEntries) = NULL; -static hdfsFileInfo* (*ptr_hdfsGetPathInfo)(hdfsFS fs, const char* path) = NULL; -static void (*ptr_hdfsFreeFileInfo)(hdfsFileInfo* hdfsFileInfo, int numEntries) = NULL; -static char*** (*ptr_hdfsGetHosts)( - hdfsFS fs, const char* path, tOffset start, tOffset length) = NULL; -static void (*ptr_hdfsFreeHosts)(char*** blockHosts) = NULL; -static tOffset (*ptr_hdfsGetDefaultBlockSize)(hdfsFS fs) = NULL; -static tOffset (*ptr_hdfsGetCapacity)(hdfsFS fs) = NULL; -static tOffset (*ptr_hdfsGetUsed)(hdfsFS fs) = NULL; -static int (*ptr_hdfsChown)( - hdfsFS fs, const char* path, const char* owner, const char* group) = NULL; -static int (*ptr_hdfsChmod)(hdfsFS fs, const char* path, short mode) = NULL; // NOLINT -static int (*ptr_hdfsUtime)(hdfsFS fs, const char* path, tTime mtime, tTime atime) = NULL; - -// Helper functions for dlopens -static std::vector get_potential_libjvm_paths(); -static std::vector get_potential_libhdfs_paths(); -static arrow::Status try_dlopen(std::vector potential_paths, const char* name, -#ifndef _WIN32 - void*& out_handle); -#else - HINSTANCE& out_handle); -#endif - -#define GET_SYMBOL(SYMBOL_NAME) \ - if (!ptr_##SYMBOL_NAME) { \ - *reinterpret_cast(&ptr_##SYMBOL_NAME) = get_symbol("" #SYMBOL_NAME); \ - } - -static void* get_symbol(const char* symbol) { - if (libhdfs_handle == NULL) return NULL; -#ifndef _WIN32 - return dlsym(libhdfs_handle, symbol); -#else - - void* ret = reinterpret_cast(GetProcAddress(libhdfs_handle, symbol)); - if (ret == NULL) { - // logstream(LOG_INFO) << "GetProcAddress error: " - // << get_last_err_str(GetLastError()) << std::endl; - } - return ret; -#endif -} - -hdfsBuilder* hdfsNewBuilder(void) { - return ptr_hdfsNewBuilder(); -} - -void hdfsBuilderSetNameNode(hdfsBuilder* bld, const char* nn) { - ptr_hdfsBuilderSetNameNode(bld, nn); -} - -void hdfsBuilderSetNameNodePort(hdfsBuilder* bld, tPort port) { - ptr_hdfsBuilderSetNameNodePort(bld, port); -} - -void hdfsBuilderSetUserName(hdfsBuilder* bld, const char* userName) { - ptr_hdfsBuilderSetUserName(bld, userName); -} - -void hdfsBuilderSetKerbTicketCachePath( - hdfsBuilder* bld, const char* kerbTicketCachePath) { - ptr_hdfsBuilderSetKerbTicketCachePath(bld, kerbTicketCachePath); -} - -hdfsFS hdfsBuilderConnect(hdfsBuilder* bld) { - return ptr_hdfsBuilderConnect(bld); -} - -int hdfsDisconnect(hdfsFS fs) { - return ptr_hdfsDisconnect(fs); -} - -hdfsFile hdfsOpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, - short replication, tSize blocksize) { // NOLINT - return ptr_hdfsOpenFile(fs, path, flags, bufferSize, replication, blocksize); -} - -int hdfsCloseFile(hdfsFS fs, hdfsFile file) { - return ptr_hdfsCloseFile(fs, file); -} - -int hdfsExists(hdfsFS fs, const char* path) { - return ptr_hdfsExists(fs, path); -} - -int hdfsSeek(hdfsFS fs, hdfsFile file, tOffset desiredPos) { - return ptr_hdfsSeek(fs, file, desiredPos); -} - -tOffset hdfsTell(hdfsFS fs, hdfsFile file) { - return ptr_hdfsTell(fs, file); -} - -tSize hdfsRead(hdfsFS fs, hdfsFile file, void* buffer, tSize length) { - return ptr_hdfsRead(fs, file, buffer, length); -} - -tSize hdfsPread(hdfsFS fs, hdfsFile file, tOffset position, void* buffer, tSize length) { - return ptr_hdfsPread(fs, file, position, buffer, length); -} - -tSize hdfsWrite(hdfsFS fs, hdfsFile file, const void* buffer, tSize length) { - return ptr_hdfsWrite(fs, file, buffer, length); -} - -int hdfsFlush(hdfsFS fs, hdfsFile file) { - return ptr_hdfsFlush(fs, file); -} - -int hdfsAvailable(hdfsFS fs, hdfsFile file) { - GET_SYMBOL(hdfsAvailable); - if (ptr_hdfsAvailable) - return ptr_hdfsAvailable(fs, file); - else - return 0; -} - -int hdfsCopy(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { - GET_SYMBOL(hdfsCopy); - if (ptr_hdfsCopy) - return ptr_hdfsCopy(srcFS, src, dstFS, dst); - else - return 0; -} - -int hdfsMove(hdfsFS srcFS, const char* src, hdfsFS dstFS, const char* dst) { - GET_SYMBOL(hdfsMove); - if (ptr_hdfsMove) - return ptr_hdfsMove(srcFS, src, dstFS, dst); - else - return 0; -} - -int hdfsDelete(hdfsFS fs, const char* path, int recursive) { - return ptr_hdfsDelete(fs, path, recursive); -} - -int hdfsRename(hdfsFS fs, const char* oldPath, const char* newPath) { - GET_SYMBOL(hdfsRename); - if (ptr_hdfsRename) - return ptr_hdfsRename(fs, oldPath, newPath); - else - return 0; -} - -char* hdfsGetWorkingDirectory(hdfsFS fs, char* buffer, size_t bufferSize) { - GET_SYMBOL(hdfsGetWorkingDirectory); - if (ptr_hdfsGetWorkingDirectory) { - return ptr_hdfsGetWorkingDirectory(fs, buffer, bufferSize); - } else { - return NULL; - } -} - -int hdfsSetWorkingDirectory(hdfsFS fs, const char* path) { - GET_SYMBOL(hdfsSetWorkingDirectory); - if (ptr_hdfsSetWorkingDirectory) { - return ptr_hdfsSetWorkingDirectory(fs, path); - } else { - return 0; - } -} - -int hdfsCreateDirectory(hdfsFS fs, const char* path) { - return ptr_hdfsCreateDirectory(fs, path); -} - -int hdfsSetReplication(hdfsFS fs, const char* path, int16_t replication) { - GET_SYMBOL(hdfsSetReplication); - if (ptr_hdfsSetReplication) { - return ptr_hdfsSetReplication(fs, path, replication); - } else { - return 0; - } -} - -hdfsFileInfo* hdfsListDirectory(hdfsFS fs, const char* path, int* numEntries) { - return ptr_hdfsListDirectory(fs, path, numEntries); -} - -hdfsFileInfo* hdfsGetPathInfo(hdfsFS fs, const char* path) { - return ptr_hdfsGetPathInfo(fs, path); -} - -void hdfsFreeFileInfo(hdfsFileInfo* hdfsFileInfo, int numEntries) { - ptr_hdfsFreeFileInfo(hdfsFileInfo, numEntries); -} - -char*** hdfsGetHosts(hdfsFS fs, const char* path, tOffset start, tOffset length) { - GET_SYMBOL(hdfsGetHosts); - if (ptr_hdfsGetHosts) { - return ptr_hdfsGetHosts(fs, path, start, length); - } else { - return NULL; - } -} - -void hdfsFreeHosts(char*** blockHosts) { - GET_SYMBOL(hdfsFreeHosts); - if (ptr_hdfsFreeHosts) { ptr_hdfsFreeHosts(blockHosts); } -} - -tOffset hdfsGetDefaultBlockSize(hdfsFS fs) { - GET_SYMBOL(hdfsGetDefaultBlockSize); - if (ptr_hdfsGetDefaultBlockSize) { - return ptr_hdfsGetDefaultBlockSize(fs); - } else { - return 0; - } -} - -tOffset hdfsGetCapacity(hdfsFS fs) { - return ptr_hdfsGetCapacity(fs); -} - -tOffset hdfsGetUsed(hdfsFS fs) { - return ptr_hdfsGetUsed(fs); -} - -int hdfsChown(hdfsFS fs, const char* path, const char* owner, const char* group) { - GET_SYMBOL(hdfsChown); - if (ptr_hdfsChown) { - return ptr_hdfsChown(fs, path, owner, group); - } else { - return 0; - } -} - -int hdfsChmod(hdfsFS fs, const char* path, short mode) { // NOLINT - GET_SYMBOL(hdfsChmod); - if (ptr_hdfsChmod) { - return ptr_hdfsChmod(fs, path, mode); - } else { - return 0; - } -} - -int hdfsUtime(hdfsFS fs, const char* path, tTime mtime, tTime atime) { - GET_SYMBOL(hdfsUtime); - if (ptr_hdfsUtime) { - return ptr_hdfsUtime(fs, path, mtime, atime); - } else { - return 0; - } -} - -static std::vector get_potential_libhdfs_paths() { - std::vector libhdfs_potential_paths; - std::string file_name; - -// OS-specific file name -#ifdef __WIN32 - file_name = "hdfs.dll"; -#elif __APPLE__ - file_name = "libhdfs.dylib"; -#else - file_name = "libhdfs.so"; -#endif - - // Common paths - std::vector search_paths = {fs::path(""), fs::path(".")}; - - // Path from environment variable - const char* hadoop_home = std::getenv("HADOOP_HOME"); - if (hadoop_home != nullptr) { - auto path = fs::path(hadoop_home) / "lib/native"; - search_paths.push_back(path); - } - - const char* libhdfs_dir = std::getenv("ARROW_LIBHDFS_DIR"); - if (libhdfs_dir != nullptr) { search_paths.push_back(fs::path(libhdfs_dir)); } - - // All paths with file name - for (auto& path : search_paths) { - libhdfs_potential_paths.push_back(path / file_name); - } - - return libhdfs_potential_paths; -} - -static std::vector get_potential_libjvm_paths() { - std::vector libjvm_potential_paths; - - std::vector search_prefixes; - std::vector search_suffixes; - std::string file_name; - -// From heuristics -#ifdef __WIN32 - search_prefixes = {""}; - search_suffixes = {"/jre/bin/server", "/bin/server"}; - file_name = "jvm.dll"; -#elif __APPLE__ - search_prefixes = {""}; - search_suffixes = {"", "/jre/lib/server"}; - file_name = "libjvm.dylib"; - -// SFrame uses /usr/libexec/java_home to find JAVA_HOME; for now we are -// expecting users to set an environment variable -#else - search_prefixes = { - "/usr/lib/jvm/default-java", // ubuntu / debian distros - "/usr/lib/jvm/java", // rhel6 - "/usr/lib/jvm", // centos6 - "/usr/lib64/jvm", // opensuse 13 - "/usr/local/lib/jvm/default-java", // alt ubuntu / debian distros - "/usr/local/lib/jvm/java", // alt rhel6 - "/usr/local/lib/jvm", // alt centos6 - "/usr/local/lib64/jvm", // alt opensuse 13 - "/usr/local/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros - "/usr/lib/jvm/java-7-openjdk-amd64", // alt ubuntu / debian distros - "/usr/local/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros - "/usr/lib/jvm/java-6-openjdk-amd64", // alt ubuntu / debian distros - "/usr/lib/jvm/java-7-oracle", // alt ubuntu - "/usr/lib/jvm/java-8-oracle", // alt ubuntu - "/usr/lib/jvm/java-6-oracle", // alt ubuntu - "/usr/local/lib/jvm/java-7-oracle", // alt ubuntu - "/usr/local/lib/jvm/java-8-oracle", // alt ubuntu - "/usr/local/lib/jvm/java-6-oracle", // alt ubuntu - "/usr/lib/jvm/default", // alt centos - "/usr/java/latest", // alt centos - }; - search_suffixes = {"/jre/lib/amd64/server"}; - file_name = "libjvm.so"; -#endif - // From direct environment variable - char* env_value = NULL; - if ((env_value = getenv("JAVA_HOME")) != NULL) { - // logstream(LOG_INFO) << "Found environment variable " << env_name << ": " << - // env_value << std::endl; - search_prefixes.insert(search_prefixes.begin(), env_value); - } - - // Generate cross product between search_prefixes, search_suffixes, and file_name - for (auto& prefix : search_prefixes) { - for (auto& suffix : search_suffixes) { - auto path = (fs::path(prefix) / fs::path(suffix) / fs::path(file_name)); - libjvm_potential_paths.push_back(path); - } - } - - return libjvm_potential_paths; -} - -#ifndef _WIN32 -static arrow::Status try_dlopen( - std::vector potential_paths, const char* name, void*& out_handle) { - std::vector error_messages; - - for (auto& i : potential_paths) { - i.make_preferred(); - // logstream(LOG_INFO) << "Trying " << i.string().c_str() << std::endl; - out_handle = dlopen(i.native().c_str(), RTLD_NOW | RTLD_LOCAL); - - if (out_handle != NULL) { - // logstream(LOG_INFO) << "Success!" << std::endl; - break; - } else { - const char* err_msg = dlerror(); - if (err_msg != NULL) { - error_messages.push_back(std::string(err_msg)); - } else { - error_messages.push_back(std::string(" returned NULL")); - } - } - } - - if (out_handle == NULL) { - std::stringstream ss; - ss << "Unable to load " << name; - return arrow::Status::IOError(ss.str()); - } - - return arrow::Status::OK(); -} - -#else -static arrow::Status try_dlopen( - std::vector potential_paths, const char* name, HINSTANCE& out_handle) { - std::vector error_messages; - - for (auto& i : potential_paths) { - i.make_preferred(); - // logstream(LOG_INFO) << "Trying " << i.string().c_str() << std::endl; - - out_handle = LoadLibrary(i.string().c_str()); - - if (out_handle != NULL) { - // logstream(LOG_INFO) << "Success!" << std::endl; - break; - } else { - // error_messages.push_back(get_last_err_str(GetLastError())); - } - } - - if (out_handle == NULL) { - std::stringstream ss; - ss << "Unable to load " << name; - return arrow::Status::IOError(ss.str()); - } - - return arrow::Status::OK(); -} -#endif // _WIN32 - -} // extern "C" - -#define GET_SYMBOL_REQUIRED(SYMBOL_NAME) \ - do { \ - if (!ptr_##SYMBOL_NAME) { \ - *reinterpret_cast(&ptr_##SYMBOL_NAME) = get_symbol("" #SYMBOL_NAME); \ - } \ - if (!ptr_##SYMBOL_NAME) \ - return Status::IOError("Getting symbol " #SYMBOL_NAME "failed"); \ - } while (0) - -namespace arrow { -namespace io { - -Status ARROW_EXPORT ConnectLibHdfs() { - static std::mutex lock; - std::lock_guard guard(lock); - - static bool shim_attempted = false; - if (!shim_attempted) { - shim_attempted = true; - - std::vector libjvm_potential_paths = get_potential_libjvm_paths(); - RETURN_NOT_OK(try_dlopen(libjvm_potential_paths, "libjvm", libjvm_handle)); - - std::vector libhdfs_potential_paths = get_potential_libhdfs_paths(); - RETURN_NOT_OK(try_dlopen(libhdfs_potential_paths, "libhdfs", libhdfs_handle)); - } else if (libhdfs_handle == nullptr) { - return Status::IOError("Prior attempt to load libhdfs failed"); - } - - GET_SYMBOL_REQUIRED(hdfsNewBuilder); - GET_SYMBOL_REQUIRED(hdfsBuilderSetNameNode); - GET_SYMBOL_REQUIRED(hdfsBuilderSetNameNodePort); - GET_SYMBOL_REQUIRED(hdfsBuilderSetUserName); - GET_SYMBOL_REQUIRED(hdfsBuilderSetKerbTicketCachePath); - GET_SYMBOL_REQUIRED(hdfsBuilderConnect); - GET_SYMBOL_REQUIRED(hdfsCreateDirectory); - GET_SYMBOL_REQUIRED(hdfsDelete); - GET_SYMBOL_REQUIRED(hdfsDisconnect); - GET_SYMBOL_REQUIRED(hdfsExists); - GET_SYMBOL_REQUIRED(hdfsFreeFileInfo); - GET_SYMBOL_REQUIRED(hdfsGetCapacity); - GET_SYMBOL_REQUIRED(hdfsGetUsed); - GET_SYMBOL_REQUIRED(hdfsGetPathInfo); - GET_SYMBOL_REQUIRED(hdfsListDirectory); - - // File methods - GET_SYMBOL_REQUIRED(hdfsCloseFile); - GET_SYMBOL_REQUIRED(hdfsFlush); - GET_SYMBOL_REQUIRED(hdfsOpenFile); - GET_SYMBOL_REQUIRED(hdfsRead); - GET_SYMBOL_REQUIRED(hdfsPread); - GET_SYMBOL_REQUIRED(hdfsSeek); - GET_SYMBOL_REQUIRED(hdfsTell); - GET_SYMBOL_REQUIRED(hdfsWrite); - - return Status::OK(); -} - -} // namespace io -} // namespace arrow - -#endif // HAS_HADOOP diff --git a/python/.gitignore b/python/.gitignore index c37efc4b56650..4ab802006914e 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -16,6 +16,7 @@ Testing/ *.c *.cpp pyarrow/version.py +pyarrow/table_api.h # Python files # setup.py working directory diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 77034159d2f3a..99f88adf81d2b 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -87,13 +87,19 @@ cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: - CStatus ConnectLibHdfs() + CStatus HaveLibHdfs() + CStatus HaveLibHdfs3() + + enum HdfsDriver" arrow::io::HdfsDriver": + HdfsDriver_LIBHDFS" arrow::io::HdfsDriver::LIBHDFS" + HdfsDriver_LIBHDFS3" arrow::io::HdfsDriver::LIBHDFS3" cdef cppclass HdfsConnectionConfig: c_string host int port c_string user c_string kerb_ticket + HdfsDriver driver cdef cppclass HdfsPathInfo: ObjectType kind; diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 2fa5fb6b87885..6b0e3924d207c 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -35,6 +35,7 @@ cimport cpython as cp import re import sys import threading +import time cdef class NativeFile: @@ -269,7 +270,15 @@ except ImportError: def have_libhdfs(): try: - check_status(ConnectLibHdfs()) + check_status(HaveLibHdfs()) + return True + except: + return False + + +def have_libhdfs3(): + try: + check_status(HaveLibHdfs3()) return True except: return False @@ -313,7 +322,8 @@ cdef class HdfsClient: raise IOError('HDFS client is closed') @classmethod - def connect(cls, host="default", port=0, user=None, kerb_ticket=None): + def connect(cls, host="default", port=0, user=None, kerb_ticket=None, + driver='libhdfs'): """ Connect to an HDFS cluster. All parameters are optional and should only be set if the defaults need to be overridden. @@ -328,6 +338,9 @@ cdef class HdfsClient: port : NameNode's port. Set to 0 for default or logical (HA) nodes. user : Username when connecting to HDFS; None implies login user. kerb_ticket : Path to Kerberos ticket cache. + driver : {'libhdfs', 'libhdfs3'}, default 'libhdfs' + Connect using libhdfs (JNI-based) or libhdfs3 (3rd-party C++ + library from Pivotal Labs) Notes ----- @@ -350,6 +363,13 @@ cdef class HdfsClient: if kerb_ticket is not None: conf.kerb_ticket = tobytes(kerb_ticket) + if driver == 'libhdfs': + check_status(HaveLibHdfs()) + conf.driver = HdfsDriver_LIBHDFS + else: + check_status(HaveLibHdfs3()) + conf.driver = HdfsDriver_LIBHDFS3 + with nogil: check_status(CHdfsClient.Connect(&conf, &out.client)) out.is_open = True @@ -541,6 +561,12 @@ cdef class HdfsClient: if not buf: break + if writer_thread.is_alive(): + while write_queue.full(): + time.sleep(0.01) + else: + break + write_queue.put_nowait(buf) finally: done = True @@ -609,22 +635,13 @@ cdef class HdfsFile(NativeFile): cdef int64_t total_bytes = 0 - cdef int rpc_chunksize = min(self.buffer_size, nbytes) - try: with nogil: - while total_bytes < nbytes: - check_status(self.rd_file.get() - .Read(rpc_chunksize, &bytes_read, - buf + total_bytes)) - - total_bytes += bytes_read + check_status(self.rd_file.get() + .Read(nbytes, &bytes_read, buf)) - # EOF - if bytes_read == 0: - break result = cp.PyBytes_FromStringAndSize(buf, - total_bytes) + bytes_read) finally: free(buf) diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index c23543b7f0d07..73d5a66cf4765 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -19,6 +19,7 @@ from os.path import join as pjoin import os import random +import unittest import pytest @@ -28,7 +29,7 @@ # HDFS tests -def hdfs_test_client(): +def hdfs_test_client(driver='libhdfs'): host = os.environ.get('ARROW_HDFS_TEST_HOST', 'localhost') user = os.environ['ARROW_HDFS_TEST_USER'] try: @@ -37,115 +38,119 @@ def hdfs_test_client(): raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' 'an integer') - return io.HdfsClient.connect(host, port, user) + return io.HdfsClient.connect(host, port, user, driver=driver) -libhdfs = pytest.mark.skipif(not io.have_libhdfs(), - reason='No libhdfs available on system') +class HdfsTestCases(object): + def _make_test_file(self, hdfs, test_name, test_path, test_data): + base_path = pjoin(self.tmp_path, test_name) + hdfs.mkdir(base_path) -HDFS_TMP_PATH = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + full_path = pjoin(base_path, test_path) + with hdfs.open(full_path, 'wb') as f: + f.write(test_data) -@pytest.fixture(scope='session') -def hdfs(request): - fixture = hdfs_test_client() + return full_path - def teardown(): - fixture.delete(HDFS_TMP_PATH, recursive=True) - fixture.close() - request.addfinalizer(teardown) - return fixture + @classmethod + def setUpClass(cls): + cls.check_driver() + cls.hdfs = hdfs_test_client(cls.DRIVER) + cls.tmp_path = '/tmp/pyarrow-test-{0}'.format(random.randint(0, 1000)) + cls.hdfs.mkdir(cls.tmp_path) + @classmethod + def tearDownClass(cls): + cls.hdfs.delete(cls.tmp_path, recursive=True) + cls.hdfs.close() -@libhdfs -def test_hdfs_close(): - client = hdfs_test_client() - assert client.is_open - client.close() - assert not client.is_open + def test_hdfs_close(self): + client = hdfs_test_client() + assert client.is_open + client.close() + assert not client.is_open - with pytest.raises(Exception): - client.ls('/') + with pytest.raises(Exception): + client.ls('/') + def test_hdfs_mkdir(self): + path = pjoin(self.tmp_path, 'test-dir/test-dir') + parent_path = pjoin(self.tmp_path, 'test-dir') -@libhdfs -def test_hdfs_mkdir(hdfs): - path = pjoin(HDFS_TMP_PATH, 'test-dir/test-dir') - parent_path = pjoin(HDFS_TMP_PATH, 'test-dir') + self.hdfs.mkdir(path) + assert self.hdfs.exists(path) - hdfs.mkdir(path) - assert hdfs.exists(path) + self.hdfs.delete(parent_path, recursive=True) + assert not self.hdfs.exists(path) - hdfs.delete(parent_path, recursive=True) - assert not hdfs.exists(path) + def test_hdfs_ls(self): + base_path = pjoin(self.tmp_path, 'ls-test') + self.hdfs.mkdir(base_path) + dir_path = pjoin(base_path, 'a-dir') + f1_path = pjoin(base_path, 'a-file-1') -@libhdfs -def test_hdfs_ls(hdfs): - base_path = pjoin(HDFS_TMP_PATH, 'ls-test') - hdfs.mkdir(base_path) + self.hdfs.mkdir(dir_path) - dir_path = pjoin(base_path, 'a-dir') - f1_path = pjoin(base_path, 'a-file-1') + f = self.hdfs.open(f1_path, 'wb') + f.write('a' * 10) - hdfs.mkdir(dir_path) + contents = sorted(self.hdfs.ls(base_path, False)) + assert contents == [dir_path, f1_path] - f = hdfs.open(f1_path, 'wb') - f.write('a' * 10) + def test_hdfs_download_upload(self): + base_path = pjoin(self.tmp_path, 'upload-test') - contents = sorted(hdfs.ls(base_path, False)) - assert contents == [dir_path, f1_path] + data = b'foobarbaz' + buf = BytesIO(data) + buf.seek(0) + self.hdfs.upload(base_path, buf) -def _make_test_file(hdfs, test_name, test_path, test_data): - base_path = pjoin(HDFS_TMP_PATH, test_name) - hdfs.mkdir(base_path) + out_buf = BytesIO() + self.hdfs.download(base_path, out_buf) + out_buf.seek(0) + assert out_buf.getvalue() == data - full_path = pjoin(base_path, test_path) + def test_hdfs_file_context_manager(self): + path = pjoin(self.tmp_path, 'ctx-manager') - f = hdfs.open(full_path, 'wb') - f.write(test_data) + data = b'foo' + with self.hdfs.open(path, 'wb') as f: + f.write(data) - return full_path + with self.hdfs.open(path, 'rb') as f: + assert f.size() == 3 + result = f.read(10) + assert result == data -@libhdfs -def test_hdfs_orphaned_file(): - hdfs = hdfs_test_client() - file_path = _make_test_file(hdfs, 'orphaned_file_test', 'fname', - 'foobarbaz') +class TestLibHdfs(HdfsTestCases, unittest.TestCase): - f = hdfs.open(file_path) - hdfs = None - f = None # noqa + DRIVER = 'libhdfs' + @classmethod + def check_driver(cls): + if not io.have_libhdfs(): + pytest.skip('No libhdfs available on system') -@libhdfs -def test_hdfs_download_upload(hdfs): - base_path = pjoin(HDFS_TMP_PATH, 'upload-test') + def test_hdfs_orphaned_file(self): + hdfs = hdfs_test_client() + file_path = self._make_test_file(hdfs, 'orphaned_file_test', 'fname', + 'foobarbaz') - data = b'foobarbaz' - buf = BytesIO(data) - buf.seek(0) + f = hdfs.open(file_path) + hdfs = None + f = None # noqa - hdfs.upload(base_path, buf) - out_buf = BytesIO() - hdfs.download(base_path, out_buf) - out_buf.seek(0) - assert out_buf.getvalue() == data +class TestLibHdfs3(HdfsTestCases, unittest.TestCase): + DRIVER = 'libhdfs3' -@libhdfs -def test_hdfs_file_context_manager(hdfs): - path = pjoin(HDFS_TMP_PATH, 'ctx-manager') - - data = b'foo' - with hdfs.open(path, 'wb') as f: - f.write(data) - - with hdfs.open(path, 'rb') as f: - assert f.size() == 3 - result = f.read(10) - assert result == data + @classmethod + def check_driver(cls): + if not io.have_libhdfs3(): + pytest.skip('No libhdfs3 available on system') From d7845fcd8b8a06248e42ca083c6460c43723c154 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 19 Dec 2016 18:44:09 -0500 Subject: [PATCH 0237/1644] ARROW-420: Align DATE type with Java implementation Author: Uwe L. Korn Closes #238 from xhochy/ARROW-420 and squashes the following commits: e497d9f [Uwe L. Korn] Add datetime.date parsing for numpy conversion 5c21453 [Uwe L. Korn] Add support for datetime.datetime 6bf346f [Uwe L. Korn] Add datetime.date conversions 6fca4da [Uwe L. Korn] ARROW-420: Align DATE type with Java implementation --- cpp/src/arrow/array.cc | 1 + cpp/src/arrow/array.h | 1 + cpp/src/arrow/builder.cc | 2 + cpp/src/arrow/builder.h | 1 + cpp/src/arrow/type.cc | 4 + cpp/src/arrow/type.h | 4 +- cpp/src/arrow/type_fwd.h | 4 +- cpp/src/arrow/type_traits.h | 8 ++ python/pyarrow/__init__.py | 1 + python/pyarrow/array.pyx | 7 +- python/pyarrow/includes/libarrow.pxd | 16 +++ python/pyarrow/scalar.pyx | 31 ++++++ python/pyarrow/schema.pyx | 7 ++ python/pyarrow/tests/test_convert_builtin.py | 28 +++++ python/pyarrow/tests/test_convert_pandas.py | 15 +++ python/src/pyarrow/adapters/builtin.cc | 69 +++++++++++++ python/src/pyarrow/adapters/pandas.cc | 103 ++++++++++++++++--- python/src/pyarrow/helpers.cc | 6 ++ python/src/pyarrow/helpers.h | 2 + python/src/pyarrow/util/datetime.h | 40 +++++++ 20 files changed, 330 insertions(+), 20 deletions(-) create mode 100644 python/src/pyarrow/util/datetime.h diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 7ab61f59f551b..d13fa1e108196 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -148,6 +148,7 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 1a4a9237a1f79..26d53f7d75896 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -468,6 +468,7 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; #if defined(__GNUC__) && !defined(__clang__) #pragma GCC diagnostic pop diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 493b5e7ccab9e..1d94dbaa0e91d 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -199,6 +199,7 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; @@ -411,6 +412,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(INT32, Int32Builder); BUILDER_CASE(UINT64, UInt64Builder); BUILDER_CASE(INT64, Int64Builder); + BUILDER_CASE(DATE, DateBuilder); BUILDER_CASE(TIMESTAMP, TimestampBuilder); BUILDER_CASE(BOOL, BooleanBuilder); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 7162d31d2464a..205139849b44e 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -220,6 +220,7 @@ using Int16Builder = NumericBuilder; using Int32Builder = NumericBuilder; using Int64Builder = NumericBuilder; using TimestampBuilder = NumericBuilder; +using DateBuilder = NumericBuilder; using HalfFloatBuilder = NumericBuilder; using FloatBuilder = NumericBuilder; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 5b172e41f6809..4748cc3c04a20 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -88,6 +88,10 @@ std::string StructType::ToString() const { return s.str(); } +std::string DateType::ToString() const { + return std::string("date"); +} + std::string UnionType::ToString() const { std::stringstream s; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 8637081acd9b7..73005707c9edc 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -413,14 +413,14 @@ struct ARROW_EXPORT UnionType : public DataType { struct ARROW_EXPORT DateType : public FixedWidthType { static constexpr Type::type type_id = Type::DATE; - using c_type = int32_t; + using c_type = int64_t; DateType() : FixedWidthType(Type::DATE) {} int bit_width() const override { return sizeof(c_type) * 8; } Status Accept(TypeVisitor* visitor) const override; - std::string ToString() const override { return name(); } + std::string ToString() const override; static std::string name() { return "date"; } }; diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 6d660f4fdee43..a9db32df54dc3 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -87,13 +87,15 @@ _NUMERIC_TYPE_DECL(Double); #undef _NUMERIC_TYPE_DECL struct DateType; -class DateArray; +using DateArray = NumericArray; +using DateBuilder = NumericBuilder; struct TimeType; class TimeArray; struct TimestampType; using TimestampArray = NumericArray; +using TimestampBuilder = NumericBuilder; struct IntervalType; using IntervalArray = NumericArray; diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 3aaec0bd5935a..5616018d93400 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -90,6 +90,14 @@ struct TypeTraits { static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } }; +template <> +struct TypeTraits { + using ArrayType = DateArray; + // using BuilderType = DateBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } +}; + template <> struct TypeTraits { using ArrayType = TimestampArray; diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index b9d386195b436..a42e39cf9865c 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -37,6 +37,7 @@ from pyarrow.schema import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, + timestamp, date, float_, double, string, list_, struct, field, DataType, Field, Schema, schema) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index d44212f4aed63..84f17056a19f7 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -218,6 +218,10 @@ cdef class UInt64Array(NumericArray): pass +cdef class DateArray(NumericArray): + pass + + cdef class FloatArray(NumericArray): pass @@ -245,6 +249,7 @@ cdef dict _array_classes = { Type_INT16: Int16Array, Type_INT32: Int32Array, Type_INT64: Int64Array, + Type_DATE: DateArray, Type_FLOAT: FloatArray, Type_DOUBLE: DoubleArray, Type_LIST: ListArray, @@ -284,7 +289,7 @@ def from_pylist(object list_obj, DataType type=None): if type is None: check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) else: - raise NotImplementedError + raise NotImplementedError() return box_arrow_array(sp_array) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 15781ced4433a..419dd74846c92 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -39,11 +39,18 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_DOUBLE" arrow::Type::DOUBLE" Type_TIMESTAMP" arrow::Type::TIMESTAMP" + Type_DATE" arrow::Type::DATE" Type_STRING" arrow::Type::STRING" Type_LIST" arrow::Type::LIST" Type_STRUCT" arrow::Type::STRUCT" + enum TimeUnit" arrow::TimeUnit": + TimeUnit_SECOND" arrow::TimeUnit::SECOND" + TimeUnit_MILLI" arrow::TimeUnit::MILLI" + TimeUnit_MICRO" arrow::TimeUnit::MICRO" + TimeUnit_NANO" arrow::TimeUnit::NANO" + cdef cppclass CDataType" arrow::DataType": Type type @@ -74,6 +81,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringType" arrow::StringType"(CDataType): pass + cdef cppclass CTimestampType" arrow::TimestampType"(CDataType): + TimeUnit unit + cdef cppclass CField" arrow::Field": c_string name shared_ptr[CDataType] type @@ -132,6 +142,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CInt64Array" arrow::Int64Array"(CArray): int64_t Value(int i) + cdef cppclass CDateArray" arrow::DateArray"(CArray): + int64_t Value(int i) + + cdef cppclass CTimestampArray" arrow::TimestampArray"(CArray): + int64_t Value(int i) + cdef cppclass CFloatArray" arrow::FloatArray"(CArray): float Value(int i) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index c2d20e460c37c..09f60e2675499 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -20,6 +20,9 @@ from pyarrow.schema cimport DataType, box_data_type from pyarrow.compat import frombytes import pyarrow.schema as schema +import datetime + + NA = None cdef class NAType(Scalar): @@ -120,6 +123,32 @@ cdef class UInt64Value(ArrayValue): return ap.Value(self.index) +cdef class DateValue(ArrayValue): + + def as_py(self): + cdef CDateArray* ap = self.sp_array.get() + return datetime.date.fromtimestamp(ap.Value(self.index) / 1000) + + +cdef class TimestampValue(ArrayValue): + + def as_py(self): + cdef: + CTimestampArray* ap = self.sp_array.get() + CTimestampType* dtype = ap.type().get() + int64_t val = ap.Value(self.index) + + if dtype.unit == TimeUnit_SECOND: + return datetime.datetime.utcfromtimestamp(val) + elif dtype.unit == TimeUnit_MILLI: + return datetime.datetime.utcfromtimestamp(float(val) / 1000) + elif dtype.unit == TimeUnit_MICRO: + return datetime.datetime.utcfromtimestamp(float(val) / 1000000) + else: + # TimeUnit_NANO + raise NotImplementedError("Cannot convert nanosecond timestamps to datetime.datetime") + + cdef class FloatValue(ArrayValue): def as_py(self): @@ -184,6 +213,8 @@ cdef dict _scalar_classes = { Type_INT16: Int16Value, Type_INT32: Int32Value, Type_INT64: Int64Value, + Type_DATE: DateValue, + Type_TIMESTAMP: TimestampValue, Type_FLOAT: FloatValue, Type_DOUBLE: DoubleValue, Type_LIST: ListValue, diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index e0badb9764143..d05ac9ebc015a 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -164,6 +164,7 @@ cdef set PRIMITIVE_TYPES = set([ Type_UINT16, Type_INT16, Type_UINT32, Type_INT32, Type_UINT64, Type_INT64, + Type_TIMESTAMP, Type_DATE, Type_FLOAT, Type_DOUBLE]) def null(): @@ -196,6 +197,12 @@ def uint64(): def int64(): return primitive_type(Type_INT64) +def timestamp(): + return primitive_type(Type_TIMESTAMP) + +def date(): + return primitive_type(Type_DATE) + def float_(): return primitive_type(Type_FLOAT) diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 34371b0bdd7c9..7dc1c1b2a4828 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -18,6 +18,7 @@ from pyarrow.compat import unittest import pyarrow +import datetime class TestConvertList(unittest.TestCase): @@ -78,6 +79,33 @@ def test_string(self): assert arr.type == pyarrow.string() assert arr.to_pylist() == ['foo', 'bar', None, 'arrow'] + def test_date(self): + data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)] + arr = pyarrow.from_pylist(data) + assert len(arr) == 4 + assert arr.type == pyarrow.date() + assert arr.null_count == 1 + assert arr[0].as_py() == datetime.date(2000, 1, 1) + assert arr[1].as_py() is None + assert arr[2].as_py() == datetime.date(1970, 1, 1) + assert arr[3].as_py() == datetime.date(2040, 2, 26) + + def test_timestamp(self): + data = [ + datetime.datetime(2007, 7, 13, 1, 23, 34, 123456), + None, + datetime.datetime(2006, 1, 13, 12, 34, 56, 432539), + datetime.datetime(2010, 8, 13, 5, 46, 57, 437699) + ] + arr = pyarrow.from_pylist(data) + assert len(arr) == 4 + assert arr.type == pyarrow.timestamp() + assert arr.null_count == 1 + assert arr[0].as_py() == datetime.datetime(2007, 7, 13, 1, 23, 34, 123456) + assert arr[1].as_py() is None + assert arr[2].as_py() == datetime.datetime(2006, 1, 13, 12, 34, 56, 432539) + assert arr[3].as_py() == datetime.datetime(2010, 8, 13, 5, 46, 57, 437699) + def test_mixed_nesting_levels(self): pyarrow.from_pylist([1, 2, None]) pyarrow.from_pylist([[1], [2], None]) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index b527ca7e80816..cf50f3d1c2c7a 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -15,6 +15,7 @@ # specific language governing permissions and limitations # under the License. +import datetime import unittest import numpy as np @@ -204,6 +205,20 @@ def test_timestamps_notimezone_nulls(self): }) self._check_pandas_roundtrip(df, timestamps_to_ms=False) + def test_date(self): + df = pd.DataFrame({ + 'date': [ + datetime.date(2000, 1, 1), + None, + datetime.date(1970, 1, 1), + datetime.date(2040, 2, 26) + ]}) + table = A.from_pandas_dataframe(df) + result = table.to_pandas() + expected = df.copy() + expected['date'] = pd.to_datetime(df['date']) + tm.assert_frame_equal(result, expected) + # def test_category(self): # repeats = 1000 # values = [b'foo', None, u'bar', 'qux', np.nan] diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index ac2f533c408c7..e0cb7c20be3d5 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -16,6 +16,7 @@ // under the License. #include +#include #include #include "pyarrow/adapters/builtin.h" @@ -24,6 +25,7 @@ #include "arrow/status.h" #include "pyarrow/helpers.h" +#include "pyarrow/util/datetime.h" using arrow::ArrayBuilder; using arrow::DataType; @@ -55,6 +57,8 @@ class ScalarVisitor { none_count_(0), bool_count_(0), int_count_(0), + date_count_(0), + timestamp_count_(0), float_count_(0), string_count_(0) {} @@ -68,6 +72,10 @@ class ScalarVisitor { ++float_count_; } else if (IsPyInteger(obj)) { ++int_count_; + } else if (PyDate_CheckExact(obj)) { + ++date_count_; + } else if (PyDateTime_CheckExact(obj)) { + ++timestamp_count_; } else if (IsPyBaseString(obj)) { ++string_count_; } else { @@ -82,6 +90,10 @@ class ScalarVisitor { } else if (int_count_) { // TODO(wesm): tighter type later return INT64; + } else if (date_count_) { + return DATE; + } else if (timestamp_count_) { + return TIMESTAMP_US; } else if (bool_count_) { return BOOL; } else if (string_count_) { @@ -100,6 +112,8 @@ class ScalarVisitor { int64_t none_count_; int64_t bool_count_; int64_t int_count_; + int64_t date_count_; + int64_t timestamp_count_; int64_t float_count_; int64_t string_count_; @@ -297,6 +311,56 @@ class Int64Converter : public TypedConverter { } }; +class DateConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + Py_ssize_t size = PySequence_Size(seq); + RETURN_NOT_OK(typed_builder_->Reserve(size)); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + typed_builder_->AppendNull(); + } else { + PyDateTime_Date* pydate = reinterpret_cast(item.obj()); + typed_builder_->Append(PyDate_to_ms(pydate)); + } + } + return Status::OK(); + } +}; + +class TimestampConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + Py_ssize_t size = PySequence_Size(seq); + RETURN_NOT_OK(typed_builder_->Reserve(size)); + for (int64_t i = 0; i < size; ++i) { + OwnedRef item(PySequence_GetItem(seq, i)); + if (item.obj() == Py_None) { + typed_builder_->AppendNull(); + } else { + PyDateTime_DateTime* pydatetime = reinterpret_cast(item.obj()); + struct tm datetime = {0}; + datetime.tm_year = PyDateTime_GET_YEAR(pydatetime) - 1900; + datetime.tm_mon = PyDateTime_GET_MONTH(pydatetime) - 1; + datetime.tm_mday = PyDateTime_GET_DAY(pydatetime); + datetime.tm_hour = PyDateTime_DATE_GET_HOUR(pydatetime); + datetime.tm_min = PyDateTime_DATE_GET_MINUTE(pydatetime); + datetime.tm_sec = PyDateTime_DATE_GET_SECOND(pydatetime); + int us = PyDateTime_DATE_GET_MICROSECOND(pydatetime); + RETURN_IF_PYERROR(); + struct tm epoch = {0}; + epoch.tm_year = 70; + epoch.tm_mday = 1; + // Microseconds since the epoch + int64_t val = lrint(difftime(mktime(&datetime), mktime(&epoch))) * 1000000 + us; + typed_builder_->Append(val); + } + } + return Status::OK(); + } +}; + class DoubleConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { @@ -379,6 +443,10 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::INT64: return std::make_shared(); + case Type::DATE: + return std::make_shared(); + case Type::TIMESTAMP: + return std::make_shared(); case Type::DOUBLE: return std::make_shared(); case Type::STRING: @@ -409,6 +477,7 @@ Status ListConverter::Init(const std::shared_ptr& builder) { Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { std::shared_ptr type; int64_t size; + PyDateTime_IMPORT; RETURN_NOT_OK(InferArrowType(obj, &size, &type)); // Handle NA / NullType case diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 64b708695194a..f8dff6d824153 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -35,6 +35,7 @@ #include "pyarrow/common.h" #include "pyarrow/config.h" +#include "pyarrow/util/datetime.h" namespace pyarrow { @@ -167,6 +168,28 @@ class ArrowSerializer { private: Status ConvertData(); + Status ConvertDates(std::shared_ptr* out) { + PyAcquireGIL lock; + + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + arrow::TypePtr string_type(new arrow::DateType()); + arrow::DateBuilder date_builder(pool_, string_type); + RETURN_NOT_OK(date_builder.Resize(length_)); + + Status s; + PyObject* obj; + for (int64_t i = 0; i < length_; ++i) { + obj = objects[i]; + if (PyDate_CheckExact(obj)) { + PyDateTime_Date* pydate = reinterpret_cast(obj); + date_builder.Append(PyDate_to_ms(pydate)); + } else { + date_builder.AppendNull(); + } + } + return date_builder.Finish(out); + } + Status ConvertObjectStrings(std::shared_ptr* out) { PyAcquireGIL lock; @@ -369,6 +392,10 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) // TODO: mask not supported here const PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + { + PyAcquireGIL lock; + PyDateTime_IMPORT; + } for (int64_t i = 0; i < length_; ++i) { if (PyObject_is_null(objects[i])) { @@ -377,6 +404,8 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) return ConvertObjectStrings(out); } else if (PyBool_Check(objects[i])) { return ConvertBooleans(out); + } else if (PyDate_CheckExact(objects[i])) { + return ConvertDates(out); } else { return Status::TypeError("unhandled python type"); } @@ -547,6 +576,17 @@ struct arrow_traits { typedef typename npy_traits::value_type T; }; +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = std::numeric_limits::min(); + static constexpr bool is_boolean = false; + static constexpr bool is_pandas_numeric_not_nullable = false; + static constexpr bool is_pandas_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + template <> struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; @@ -567,24 +607,28 @@ static inline PyObject* make_pystring(const uint8_t* data, int32_t length) { inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { if (type == NPY_DATETIME) { - auto timestamp_type = static_cast(datatype); - // We only support ms resolution at the moment PyArray_Descr* descr = PyArray_DESCR(out); auto date_dtype = reinterpret_cast(descr->c_metadata); + if (datatype->type == arrow::Type::TIMESTAMP) { + auto timestamp_type = static_cast(datatype); - switch (timestamp_type->unit) { - case arrow::TimestampType::Unit::SECOND: - date_dtype->meta.base = NPY_FR_s; - break; - case arrow::TimestampType::Unit::MILLI: - date_dtype->meta.base = NPY_FR_ms; - break; - case arrow::TimestampType::Unit::MICRO: - date_dtype->meta.base = NPY_FR_us; - break; - case arrow::TimestampType::Unit::NANO: - date_dtype->meta.base = NPY_FR_ns; - break; + switch (timestamp_type->unit) { + case arrow::TimestampType::Unit::SECOND: + date_dtype->meta.base = NPY_FR_s; + break; + case arrow::TimestampType::Unit::MILLI: + date_dtype->meta.base = NPY_FR_ms; + break; + case arrow::TimestampType::Unit::MICRO: + date_dtype->meta.base = NPY_FR_us; + break; + case arrow::TimestampType::Unit::NANO: + date_dtype->meta.base = NPY_FR_ns; + break; + } + } else { + // datatype->type == arrow::Type::DATE + date_dtype->meta.base = NPY_FR_D; } } } @@ -666,7 +710,7 @@ class ArrowDeserializer { template inline typename std::enable_if< - arrow_traits::is_pandas_numeric_nullable, Status>::type + (T2 != arrow::Type::DATE) & arrow_traits::is_pandas_numeric_nullable, Status>::type ConvertValues(const std::shared_ptr& data) { typedef typename arrow_traits::T T; size_t chunk_offset = 0; @@ -697,6 +741,32 @@ class ArrowDeserializer { return Status::OK(); } + template + inline typename std::enable_if< + T2 == arrow::Type::DATE, Status>::type + ConvertValues(const std::shared_ptr& data) { + typedef typename arrow_traits::T T; + size_t chunk_offset = 0; + + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + for (int64_t i = 0; i < arr->length(); ++i) { + // There are 1000 * 60 * 60 * 24 = 86400000ms in a day + out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i] / 86400000; + } + + chunk_offset += arr->length(); + } + + return Status::OK(); + } + // Integer specialization template inline typename std::enable_if< @@ -879,6 +949,7 @@ Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_re FROM_ARROW_CASE(FLOAT); FROM_ARROW_CASE(DOUBLE); FROM_ARROW_CASE(STRING); + FROM_ARROW_CASE(DATE); FROM_ARROW_CASE(TIMESTAMP); default: return Status::NotImplemented("Arrow type reading not implemented"); diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index 08003aabf9f22..af9274484935f 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -33,6 +33,8 @@ const std::shared_ptr INT8 = std::make_shared(); const std::shared_ptr INT16 = std::make_shared(); const std::shared_ptr INT32 = std::make_shared(); const std::shared_ptr INT64 = std::make_shared(); +const std::shared_ptr DATE = std::make_shared(); +const std::shared_ptr TIMESTAMP_US = std::make_shared(TimeUnit::MICRO); const std::shared_ptr FLOAT = std::make_shared(); const std::shared_ptr DOUBLE = std::make_shared(); const std::shared_ptr STRING = std::make_shared(); @@ -54,6 +56,10 @@ std::shared_ptr GetPrimitiveType(Type::type type) { GET_PRIMITIVE_TYPE(INT32, Int32Type); GET_PRIMITIVE_TYPE(UINT64, UInt64Type); GET_PRIMITIVE_TYPE(INT64, Int64Type); + GET_PRIMITIVE_TYPE(DATE, DateType); + case Type::TIMESTAMP: + return TIMESTAMP_US; + break; GET_PRIMITIVE_TYPE(BOOL, BooleanType); GET_PRIMITIVE_TYPE(FLOAT, FloatType); GET_PRIMITIVE_TYPE(DOUBLE, DoubleType); diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index fa9c713b0c22c..e714bba5db4cc 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -38,6 +38,8 @@ extern const std::shared_ptr INT8; extern const std::shared_ptr INT16; extern const std::shared_ptr INT32; extern const std::shared_ptr INT64; +extern const std::shared_ptr DATE; +extern const std::shared_ptr TIMESTAMP_US; extern const std::shared_ptr FLOAT; extern const std::shared_ptr DOUBLE; extern const std::shared_ptr STRING; diff --git a/python/src/pyarrow/util/datetime.h b/python/src/pyarrow/util/datetime.h new file mode 100644 index 0000000000000..b67accc388f59 --- /dev/null +++ b/python/src/pyarrow/util/datetime.h @@ -0,0 +1,40 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef PYARROW_UTIL_DATETIME_H +#define PYARROW_UTIL_DATETIME_H + +#include +#include + +namespace pyarrow { + +inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { + struct tm date = {0}; + date.tm_year = PyDateTime_GET_YEAR(pydate) - 1900; + date.tm_mon = PyDateTime_GET_MONTH(pydate) - 1; + date.tm_mday = PyDateTime_GET_DAY(pydate); + struct tm epoch = {0}; + epoch.tm_year = 70; + epoch.tm_mday = 1; + // Milliseconds since the epoch + return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); +} + +} // namespace pyarrow + +#endif // PYARROW_UTIL_DATETIME_H From fe53fa409b644ee7b971dc4ed8f877199e91686e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 20 Dec 2016 11:40:04 -0500 Subject: [PATCH 0238/1644] ARROW-435: Fix spelling of RAPIDJSON_VENDORED Author: Uwe L. Korn Closes #246 from xhochy/ARROW-435 and squashes the following commits: 9fdbfde [Uwe L. Korn] ARROW-435: Fix spelling of RAPIDJSON_VENDORED --- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index b1669c5f7c239..619ca7c92cb7a 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -44,7 +44,7 @@ set(ARROW_IPC_SRCS add_library(arrow_ipc SHARED ${ARROW_IPC_SRCS} ) -if(RAPIDJSON_VERDORED) +if(RAPIDJSON_VENDORED) add_dependencies(arrow_ipc rapidjson_ep) endif() if(FLATBUFFERS_VENDORED) From 6ff5fcf1bfb67d817d6261596d47cf6a6d9c3c6c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 20 Dec 2016 14:08:07 -0500 Subject: [PATCH 0239/1644] ARROW-433: Correctly handle Arrow to Python date conversion for timezones west of London Verified with `TZ='America/New_York' py.test pyarrow` Author: Uwe L. Korn Closes #245 from xhochy/ARROW-433 and squashes the following commits: 06745d8 [Uwe L. Korn] Use more pythonic approach a55be24 [Uwe L. Korn] ARROW-433: Correctly handle Arrow to Python date conversion for timezones west of London --- python/pyarrow/scalar.pyx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 09f60e2675499..623e3e483d4ae 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -127,7 +127,7 @@ cdef class DateValue(ArrayValue): def as_py(self): cdef CDateArray* ap = self.sp_array.get() - return datetime.date.fromtimestamp(ap.Value(self.index) / 1000) + return datetime.datetime.utcfromtimestamp(ap.Value(self.index) / 1000).date() cdef class TimestampValue(ArrayValue): From f6bf112cd22eeb03725dff79a28c205324fa4f45 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 20 Dec 2016 14:09:32 -0500 Subject: [PATCH 0240/1644] ARROW-434: [Python] Correctly handle Python file objects in Parquet read/write paths While we'd enabled Python file objects for IPC file reader/writer, they hadn't been enabled in the Parquet read/write paths. For example: ```python with open(filename, 'wb') as f: A.parquet.write_table(arrow_table, f, version="1.0") data = io.BytesIO(open(filename, 'rb').read()) table_read = pq.read_table(data) ``` There was a separate bug reported in ARROW-434, but that's a Parquet type mapping issue, will be fixed in PARQUET-812. Author: Wes McKinney Closes #247 from wesm/ARROW-434 and squashes the following commits: c704088 [Wes McKinney] Correctly handle Python file objects in Parquet read/write paths --- python/pyarrow/io.pxd | 3 ++ python/pyarrow/io.pyx | 40 +++++++++++++++++++++++ python/pyarrow/ipc.pyx | 43 +----------------------- python/pyarrow/parquet.pyx | 49 +++++++++++++++------------- python/pyarrow/tests/test_parquet.py | 36 ++++++++++++++++++-- 5 files changed, 104 insertions(+), 67 deletions(-) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index d6966cdaadd3a..02265d0a68eb1 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -42,3 +42,6 @@ cdef class NativeFile: # suite of Arrow C++ libraries cdef read_handle(self, shared_ptr[ReadableFileInterface]* file) cdef write_handle(self, shared_ptr[OutputStream]* file) + +cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader) +cdef get_writer(object source, shared_ptr[OutputStream]* writer) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 6b0e3924d207c..8491aa8964fb9 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -256,6 +256,46 @@ def buffer_from_bytes(object obj): result.init(buf) return result +cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader): + cdef NativeFile nf + + if isinstance(source, bytes): + source = BytesReader(source) + elif not isinstance(source, NativeFile) and hasattr(source, 'read'): + # Optimistically hope this is file-like + source = PythonFileInterface(source, mode='r') + + if isinstance(source, NativeFile): + nf = source + + # TODO: what about read-write sources (e.g. memory maps) + if not nf.is_readonly: + raise IOError('Native file is not readable') + + nf.read_handle(reader) + else: + raise TypeError('Unable to read from object of type: {0}' + .format(type(source))) + + +cdef get_writer(object source, shared_ptr[OutputStream]* writer): + cdef NativeFile nf + + if not isinstance(source, NativeFile) and hasattr(source, 'write'): + # Optimistically hope this is file-like + source = PythonFileInterface(source, mode='w') + + if isinstance(source, NativeFile): + nf = source + + if nf.is_readonly: + raise IOError('Native file is not writeable') + + nf.write_handle(writer) + else: + raise TypeError('Unable to read from object of type: {0}' + .format(type(source))) + # ---------------------------------------------------------------------- # HDFS IO implementation diff --git a/python/pyarrow/ipc.pyx b/python/pyarrow/ipc.pyx index 46deb5ad0c35d..abc5e1b11ec4c 100644 --- a/python/pyarrow/ipc.pyx +++ b/python/pyarrow/ipc.pyx @@ -27,7 +27,7 @@ from pyarrow.includes.libarrow_ipc cimport * cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.error cimport check_status -from pyarrow.io cimport NativeFile +from pyarrow.io cimport NativeFile, get_reader, get_writer from pyarrow.schema cimport Schema from pyarrow.table cimport RecordBatch @@ -37,47 +37,6 @@ import pyarrow.io as io cimport cpython as cp -cdef get_reader(source, shared_ptr[ReadableFileInterface]* reader): - cdef NativeFile nf - - if isinstance(source, bytes): - source = io.BytesReader(source) - elif not isinstance(source, io.NativeFile) and hasattr(source, 'read'): - # Optimistically hope this is file-like - source = io.PythonFileInterface(source, mode='r') - - if isinstance(source, NativeFile): - nf = source - - # TODO: what about read-write sources (e.g. memory maps) - if not nf.is_readonly: - raise IOError('Native file is not readable') - - nf.read_handle(reader) - else: - raise TypeError('Unable to read from object of type: {0}' - .format(type(source))) - - -cdef get_writer(source, shared_ptr[OutputStream]* writer): - cdef NativeFile nf - - if not isinstance(source, io.NativeFile) and hasattr(source, 'write'): - # Optimistically hope this is file-like - source = io.PythonFileInterface(source, mode='w') - - if isinstance(source, io.NativeFile): - nf = source - - if nf.is_readonly: - raise IOError('Native file is not writeable') - - nf.write_handle(writer) - else: - raise TypeError('Unable to read from object of type: {0}' - .format(type(source))) - - cdef class ArrowFileWriter: cdef: shared_ptr[CFileWriter] writer diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 83fddb287a3f1..043ccf12d9181 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -31,7 +31,7 @@ from pyarrow.error cimport check_status from pyarrow.io import NativeFile from pyarrow.table cimport Table -from pyarrow.io cimport NativeFile +from pyarrow.io cimport NativeFile, get_reader, get_writer import six @@ -49,22 +49,27 @@ cdef class ParquetReader: def __cinit__(self): self.allocator.set_pool(default_memory_pool()) - cdef open_local_file(self, file_path): - cdef c_string path = tobytes(file_path) + def open(self, source): + self._open(source) - # Must be in one expression to avoid calling std::move which is not - # possible in Cython (due to missing rvalue support) + cdef _open(self, object source): + cdef: + shared_ptr[ReadableFileInterface] rd_handle + c_string path - # TODO(wesm): ParquetFileReader::OpenFIle can throw? - self.reader = unique_ptr[FileReader]( - new FileReader(default_memory_pool(), - ParquetFileReader.OpenFile(path))) + if isinstance(source, six.string_types): + path = tobytes(source) - cdef open_native_file(self, NativeFile file): - cdef shared_ptr[ReadableFileInterface] cpp_handle - file.read_handle(&cpp_handle) + # Must be in one expression to avoid calling std::move which is not + # possible in Cython (due to missing rvalue support) - check_status(OpenFile(cpp_handle, &self.allocator, &self.reader)) + # TODO(wesm): ParquetFileReader::OpenFile can throw? + self.reader = unique_ptr[FileReader]( + new FileReader(default_memory_pool(), + ParquetFileReader.OpenFile(path))) + else: + get_reader(source, &rd_handle) + check_status(OpenFile(rd_handle, &self.allocator, &self.reader)) def read_all(self): cdef: @@ -137,11 +142,7 @@ def read_table(source, columns=None): Content of the file as a table (of columns) """ cdef ParquetReader reader = ParquetReader() - - if isinstance(source, six.string_types): - reader.open_local_file(source) - elif isinstance(source, NativeFile): - reader.open_native_file(source) + reader._open(source) if columns is None: return reader.read_all() @@ -174,7 +175,10 @@ def write_table(table, sink, chunk_size=None, version=None, cdef Table table_ = table cdef CTable* ctable_ = table_.table cdef shared_ptr[ParquetWriteSink] sink_ + cdef shared_ptr[FileOutputStream] filesink_ + cdef shared_ptr[OutputStream] general_sink + cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 if chunk_size is None: @@ -232,10 +236,11 @@ def write_table(table, sink, chunk_size=None, version=None, raise ArrowException("Unsupport compression codec") if isinstance(sink, six.string_types): - check_status(FileOutputStream.Open(tobytes(sink), &filesink_)) - sink_.reset(new ParquetWriteSink(filesink_)) - elif isinstance(sink, NativeFile): - sink_.reset(new ParquetWriteSink((sink).wr_file)) + check_status(FileOutputStream.Open(tobytes(sink), &filesink_)) + sink_.reset(new ParquetWriteSink(filesink_)) + else: + get_writer(sink, &general_sink) + sink_.reset(new ParquetWriteSink(general_sink)) with nogil: check_status(WriteFlatTable(ctable_, default_memory_pool(), sink_, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 841830f6fba3b..7c45732d34573 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -15,6 +15,7 @@ # specific language governing permissions and limitations # under the License. +import io import pytest import pyarrow as A @@ -132,9 +133,8 @@ def test_pandas_column_selection(tmpdir): pdt.assert_frame_equal(df[['uint8']], df_read) -@parquet -def test_pandas_parquet_native_file_roundtrip(tmpdir): - size = 10000 + +def _test_dataframe(size=10000): np.random.seed(0) df = pd.DataFrame({ 'uint8': np.arange(size, dtype=np.uint8), @@ -149,6 +149,12 @@ def test_pandas_parquet_native_file_roundtrip(tmpdir): 'float64': np.arange(size, dtype=np.float64), 'bool': np.random.randn(size) > 0 }) + return df + + +@parquet +def test_pandas_parquet_native_file_roundtrip(tmpdir): + df = _test_dataframe(10000) arrow_table = A.from_pandas_dataframe(df) imos = paio.InMemoryOutputStream() pq.write_table(arrow_table, imos, version="2.0") @@ -158,6 +164,30 @@ def test_pandas_parquet_native_file_roundtrip(tmpdir): pdt.assert_frame_equal(df, df_read) +@parquet +def test_pandas_parquet_pyfile_roundtrip(tmpdir): + filename = tmpdir.join('pandas_pyfile_roundtrip.parquet').strpath + size = 5 + df = pd.DataFrame({ + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0, + 'strings': ['foo', 'bar', None, 'baz', 'qux'] + }) + + arrow_table = A.from_pandas_dataframe(df) + + with open(filename, 'wb') as f: + A.parquet.write_table(arrow_table, f, version="1.0") + + data = io.BytesIO(open(filename, 'rb').read()) + + table_read = pq.read_table(data) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) + + @parquet def test_pandas_parquet_configuration_options(tmpdir): size = 10000 From 73455b56f705c3c11d3c29447082641dcab4c63a Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 20 Dec 2016 14:10:21 -0500 Subject: [PATCH 0241/1644] ARROW-430: Improved version handling This reintroduces `setuptools_scm` versioning for git clones and sdists/wheels. git-archives are handled by a separate chunk of code that will infer the version from the `pom.xml` As in the Maven world, always the next to-be-released version is specified and Python PEP 440 development versions are based on the previous release, it used the most minimal pre-release version specified there. I would suggest to keep the conda package versioning as it is currently, i.e. manually set it to 0.1.0postX. Also: I would rather not parse the Maven XML but that is the most simple way currently to ensure that the versioning system is in a state where we can still make releases with the Maven release plugin. Author: Uwe L. Korn Closes #248 from xhochy/ARROW-430 and squashes the following commits: 39753f8 [Uwe L. Korn] Infer version from java/pom.xml 05c44ea [Uwe L. Korn] Get rid of setuptools_scm_git_archive 14b8136 [Uwe L. Korn] Revert "ARROW-429: Revert ARROW-379 until git-archive issues are resolved" Change-Id: I4f6d291e63b2518af47c2a81049aa24a38c92821 --- dev/release/00-prepare.sh | 5 --- .../cmake_modules/FindParquet.cmake | 0 python/pyarrow/__init__.py | 10 ++++-- python/setup.cfg | 20 +++++++++++ python/setup.py | 34 +++++++------------ 5 files changed, 41 insertions(+), 28 deletions(-) rename {cpp => python}/cmake_modules/FindParquet.cmake (100%) create mode 100644 python/setup.cfg diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 3423a3e6c5bf9..00af5e7768161 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -43,9 +43,4 @@ mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmod cd - -cd "${SOURCE_DIR}/../../python" -sed -i "s/VERSION = '[^']*'/VERSION = '${version}'/g" setup.py -sed -i "s/ISRELEASED = False/ISRELEASED = True/g" setup.py -cd - - echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/cpp/cmake_modules/FindParquet.cmake b/python/cmake_modules/FindParquet.cmake similarity index 100% rename from cpp/cmake_modules/FindParquet.cmake rename to python/cmake_modules/FindParquet.cmake diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index a42e39cf9865c..39ba4c72e7d3c 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -17,6 +17,14 @@ # flake8: noqa +from pkg_resources import get_distribution, DistributionNotFound +try: + __version__ = get_distribution(__name__).version +except DistributionNotFound: + # package is not installed + pass + + import pyarrow.config from pyarrow.array import (Array, @@ -43,5 +51,3 @@ DataType, Field, Schema, schema) from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe - -from pyarrow.version import version as __version__ diff --git a/python/setup.cfg b/python/setup.cfg new file mode 100644 index 0000000000000..caae3e081b6ca --- /dev/null +++ b/python/setup.cfg @@ -0,0 +1,20 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +[build_sphinx] +source-dir = doc/ +build-dir = doc/_build diff --git a/python/setup.py b/python/setup.py index 5f448f7d50784..2e595e2dc870e 100644 --- a/python/setup.py +++ b/python/setup.py @@ -42,27 +42,9 @@ if Cython.__version__ < '0.19.1': raise Exception('Please upgrade to Cython 0.19.1 or newer') -VERSION = '0.1.0' -ISRELEASED = False - -if not ISRELEASED: - VERSION += '.dev' - setup_dir = os.path.abspath(os.path.dirname(__file__)) -def write_version_py(filename=os.path.join(setup_dir, 'pyarrow/version.py')): - a = open(filename, 'w') - file_content = "\n".join(["", - "# THIS FILE IS GENERATED FROM SETUP.PY", - "version = '%(version)s'", - "isrelease = '%(isrelease)s'"]) - - a.write(file_content % {'version': VERSION, - 'isrelease': str(ISRELEASED)}) - a.close() - - class clean(_clean): def run(self): @@ -272,15 +254,23 @@ def get_outputs(self): return [self._get_cmake_ext_path(name) for name in self.get_names()] -write_version_py() - DESC = """\ Python library for Apache Arrow""" +# In the case of a git-archive, we don't have any version information +# from the SCM to infer a version. The only source is the java/pom.xml. +# +# Note that this is only the case for git-archives. sdist tarballs have +# all relevant information (but not the Java sources). +if not os.path.exists('../.git') and os.path.exists('../java/pom.xml'): + import xml.etree.ElementTree as ET + tree = ET.parse('../java/pom.xml') + version_tag = list(tree.getroot().findall('{http://maven.apache.org/POM/4.0.0}version'))[0] + os.environ["SETUPTOOLS_SCM_PRETEND_VERSION"] = version_tag.text.replace("-SNAPSHOT", "a0") + setup( name="pyarrow", packages=['pyarrow', 'pyarrow.tests'], - version=VERSION, zip_safe=False, package_data={'pyarrow': ['*.pxd', '*.pyx']}, # Dummy extension to trigger build_ext @@ -290,6 +280,8 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, + use_scm_version = {"root": "..", "relative_to": __file__}, + setup_requires=['setuptools_scm'], install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], description=DESC, license='Apache License, Version 2.0', From 268ffbeffb1cd0617e52d381d500a2d10f61124c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 21 Dec 2016 09:31:56 +0100 Subject: [PATCH 0242/1644] ARROW-374: More precise handling of bytes vs unicode in Python API Python built-in types that are not all unicode become `arrow::BinaryArray` instead of `arrow::StringArray`, since we cannot be sure that the PyBytes objects are UTF-8-encoded strings. Author: Wes McKinney Closes #249 from wesm/ARROW-374 and squashes the following commits: 1371a30 [Wes McKinney] py3 fixes 8ac3a49 [Wes McKinney] Consistently convert PyBytes to BinaryArray with pandas, too 83d1c05 [Wes McKinney] Remove print statement c8df606 [Wes McKinney] Timestamp and time cannot be static 4a9aaf4 [Wes McKinney] Add Python interface to BinaryArray, convert PyBytes to binary instead of assuming utf8 unicode --- cpp/src/arrow/type.cc | 6 +- python/pyarrow/__init__.py | 5 +- python/pyarrow/array.pyx | 5 ++ python/pyarrow/includes/libarrow.pxd | 6 +- python/pyarrow/scalar.pyx | 16 +++- python/pyarrow/schema.pyx | 6 ++ python/pyarrow/tests/test_convert_builtin.py | 31 ++++++-- python/pyarrow/tests/test_convert_pandas.py | 18 ++++- python/pyarrow/tests/test_scalars.py | 22 ++++-- python/src/pyarrow/adapters/builtin.cc | 80 ++++++++++++++------ python/src/pyarrow/adapters/pandas.cc | 65 +++++++++++++++- python/src/pyarrow/helpers.cc | 50 +++++------- python/src/pyarrow/helpers.h | 16 ---- 13 files changed, 227 insertions(+), 99 deletions(-) diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 4748cc3c04a20..8ff9eea87051d 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -155,13 +155,11 @@ TYPE_FACTORY(binary, BinaryType); TYPE_FACTORY(date, DateType); std::shared_ptr timestamp(TimeUnit unit) { - static std::shared_ptr result = std::make_shared(); - return result; + return std::make_shared(unit); } std::shared_ptr time(TimeUnit unit) { - static std::shared_ptr result = std::make_shared(); - return result; + return std::make_shared(unit); } std::shared_ptr list(const std::shared_ptr& value_type) { diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 39ba4c72e7d3c..9ede9348c93de 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -40,13 +40,14 @@ BooleanValue, Int8Value, Int16Value, Int32Value, Int64Value, UInt8Value, UInt16Value, UInt32Value, UInt64Value, - FloatValue, DoubleValue, ListValue, StringValue) + FloatValue, DoubleValue, ListValue, + BinaryValue, StringValue) from pyarrow.schema import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, timestamp, date, - float_, double, string, + float_, double, binary, string, list_, struct, field, DataType, Field, Schema, schema) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 84f17056a19f7..c178d5ccd355b 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -238,6 +238,10 @@ cdef class StringArray(Array): pass +cdef class BinaryArray(Array): + pass + + cdef dict _array_classes = { Type_NA: NullArray, Type_BOOL: BooleanArray, @@ -253,6 +257,7 @@ cdef dict _array_classes = { Type_FLOAT: FloatArray, Type_DOUBLE: DoubleArray, Type_LIST: ListArray, + Type_BINARY: BinaryArray, Type_STRING: StringArray, Type_TIMESTAMP: Int64Array, } diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 419dd74846c92..40fb60de07ed3 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -40,6 +40,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_TIMESTAMP" arrow::Type::TIMESTAMP" Type_DATE" arrow::Type::DATE" + Type_BINARY" arrow::Type::BINARY" Type_STRING" arrow::Type::STRING" Type_LIST" arrow::Type::LIST" @@ -161,7 +162,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CArray] values() shared_ptr[CDataType] value_type() - cdef cppclass CStringArray" arrow::StringArray"(CListArray): + cdef cppclass CBinaryArray" arrow::BinaryArray"(CListArray): + const uint8_t* GetValue(int i, int32_t* length) + + cdef cppclass CStringArray" arrow::StringArray"(CBinaryArray): c_string GetString(int i) cdef cppclass CChunkedArray" arrow::ChunkedArray": diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 623e3e483d4ae..a0610a14e6bd0 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -22,6 +22,7 @@ import pyarrow.schema as schema import datetime +cimport cpython as cp NA = None @@ -170,6 +171,18 @@ cdef class StringValue(ArrayValue): return frombytes(ap.GetString(self.index)) +cdef class BinaryValue(ArrayValue): + + def as_py(self): + cdef: + const uint8_t* ptr + int32_t length + CBinaryArray* ap = self.sp_array.get() + + ptr = ap.GetValue(self.index, &length) + return cp.PyBytes_FromStringAndSize((ptr), length) + + cdef class ListValue(ArrayValue): def __len__(self): @@ -218,7 +231,8 @@ cdef dict _scalar_classes = { Type_FLOAT: FloatValue, Type_DOUBLE: DoubleValue, Type_LIST: ListValue, - Type_STRING: StringValue + Type_BINARY: BinaryValue, + Type_STRING: StringValue, } cdef object box_arrow_scalar(DataType type, diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index d05ac9ebc015a..7a69b0f12391a 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -215,6 +215,12 @@ def string(): """ return primitive_type(Type_STRING) +def binary(): + """ + Binary (PyBytes-like) type + """ + return primitive_type(Type_BINARY) + def list_(DataType value_type): cdef DataType out = DataType() cdef shared_ptr[CDataType] list_type diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 7dc1c1b2a4828..a5f7aa51c29da 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest +from pyarrow.compat import unittest, u import pyarrow import datetime @@ -71,16 +71,28 @@ def test_double(self): assert arr.type == pyarrow.double() assert arr.to_pylist() == data - def test_string(self): - data = ['foo', b'bar', None, 'arrow'] + def test_unicode(self): + data = [u('foo'), u('bar'), None, u('arrow')] arr = pyarrow.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pyarrow.string() - assert arr.to_pylist() == ['foo', 'bar', None, 'arrow'] + assert arr.to_pylist() == [u('foo'), u('bar'), None, u('arrow')] + + def test_bytes(self): + u1 = b'ma\xc3\xb1ana' + data = [b'foo', + u1.decode('utf-8'), # unicode gets encoded, + None] + arr = pyarrow.from_pylist(data) + assert len(arr) == 3 + assert arr.null_count == 1 + assert arr.type == pyarrow.binary() + assert arr.to_pylist() == [b'foo', u1, None] def test_date(self): - data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)] + data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), + datetime.date(2040, 2, 26)] arr = pyarrow.from_pylist(data) assert len(arr) == 4 assert arr.type == pyarrow.date() @@ -101,10 +113,13 @@ def test_timestamp(self): assert len(arr) == 4 assert arr.type == pyarrow.timestamp() assert arr.null_count == 1 - assert arr[0].as_py() == datetime.datetime(2007, 7, 13, 1, 23, 34, 123456) + assert arr[0].as_py() == datetime.datetime(2007, 7, 13, 1, + 23, 34, 123456) assert arr[1].as_py() is None - assert arr[2].as_py() == datetime.datetime(2006, 1, 13, 12, 34, 56, 432539) - assert arr[3].as_py() == datetime.datetime(2010, 8, 13, 5, 46, 57, 437699) + assert arr[2].as_py() == datetime.datetime(2006, 1, 13, 12, + 34, 56, 432539) + assert arr[3].as_py() == datetime.datetime(2010, 8, 13, 5, + 46, 57, 437699) def test_mixed_nesting_levels(self): pyarrow.from_pylist([1, 2, None]) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index cf50f3d1c2c7a..da34f85588130 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -23,6 +23,7 @@ import pandas as pd import pandas.util.testing as tm +from pyarrow.compat import u import pyarrow as A @@ -157,13 +158,22 @@ def test_boolean_object_nulls(self): df = pd.DataFrame({'bools': arr}) self._check_pandas_roundtrip(df) - def test_strings(self): + def test_unicode(self): repeats = 1000 - values = [b'foo', None, u'bar', 'qux', np.nan] + values = [u('foo'), None, u('bar'), u('qux'), np.nan] df = pd.DataFrame({'strings': values * repeats}) - values = ['foo', None, u'bar', 'qux', None] - expected = pd.DataFrame({'strings': values * repeats}) + self._check_pandas_roundtrip(df) + + def test_bytes_to_binary(self): + values = [u('qux'), b'foo', None, 'bar', 'qux', np.nan] + df = pd.DataFrame({'strings': values}) + + table = A.from_pandas_dataframe(df) + assert table[0].type == A.binary() + + values2 = [b'qux', b'foo', None, b'bar', b'qux', np.nan] + expected = pd.DataFrame({'strings': values2}) self._check_pandas_roundtrip(df, expected) def test_timestamps_notimezone_no_nulls(self): diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 4fb850a4d47bf..19cfacbcb6b11 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest, u +from pyarrow.compat import unittest, u, unicode_type import pyarrow as A @@ -58,20 +58,32 @@ def test_double(self): v = arr[2] assert v.as_py() == 3.0 - def test_string(self): - arr = A.from_pylist(['foo', None, u('bar')]) + def test_string_unicode(self): + arr = A.from_pylist([u('foo'), None, u('bar')]) v = arr[0] assert isinstance(v, A.StringValue) - assert repr(v) == "'foo'" assert v.as_py() == 'foo' assert arr[1] is A.NA v = arr[2].as_py() - assert v == 'bar' + assert v == u('bar') assert isinstance(v, str) + def test_bytes(self): + arr = A.from_pylist([b'foo', None, u('bar')]) + + v = arr[0] + assert isinstance(v, A.BinaryValue) + assert v.as_py() == b'foo' + + assert arr[1] is A.NA + + v = arr[2].as_py() + assert v == b'bar' + assert isinstance(v, bytes) + def test_list(self): arr = A.from_pylist([['foo', None], None, ['bar'], []]) diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index e0cb7c20be3d5..2a13944b35c1c 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -42,14 +42,6 @@ static inline bool IsPyInteger(PyObject* obj) { #endif } -static inline bool IsPyBaseString(PyObject* obj) { -#if PYARROW_IS_PY2 - return PyString_Check(obj) || PyUnicode_Check(obj); -#else - return PyUnicode_Check(obj); -#endif -} - class ScalarVisitor { public: ScalarVisitor() : @@ -60,7 +52,8 @@ class ScalarVisitor { date_count_(0), timestamp_count_(0), float_count_(0), - string_count_(0) {} + binary_count_(0), + unicode_count_(0) {} void Visit(PyObject* obj) { ++total_count_; @@ -76,8 +69,10 @@ class ScalarVisitor { ++date_count_; } else if (PyDateTime_CheckExact(obj)) { ++timestamp_count_; - } else if (IsPyBaseString(obj)) { - ++string_count_; + } else if (PyBytes_Check(obj)) { + ++binary_count_; + } else if (PyUnicode_Check(obj)) { + ++unicode_count_; } else { // TODO(wesm): accumulate error information somewhere } @@ -86,20 +81,22 @@ class ScalarVisitor { std::shared_ptr GetType() { // TODO(wesm): handling mixed-type cases if (float_count_) { - return DOUBLE; + return arrow::float64(); } else if (int_count_) { // TODO(wesm): tighter type later - return INT64; + return arrow::int64(); } else if (date_count_) { - return DATE; + return arrow::date(); } else if (timestamp_count_) { - return TIMESTAMP_US; + return arrow::timestamp(arrow::TimeUnit::MICRO); } else if (bool_count_) { - return BOOL; - } else if (string_count_) { - return STRING; + return arrow::boolean(); + } else if (binary_count_) { + return arrow::binary(); + } else if (unicode_count_) { + return arrow::utf8(); } else { - return NA; + return arrow::null(); } } @@ -115,7 +112,8 @@ class ScalarVisitor { int64_t date_count_; int64_t timestamp_count_; int64_t float_count_; - int64_t string_count_; + int64_t binary_count_; + int64_t unicode_count_; // Place to accumulate errors // std::vector errors_; @@ -163,7 +161,7 @@ class SeqVisitor { std::shared_ptr GetType() { if (scalars_.total_count() == 0) { if (max_nesting_level_ == 0) { - return NA; + return arrow::null(); } else { return nullptr; } @@ -227,7 +225,7 @@ static Status InferArrowType(PyObject* obj, int64_t* size, // For 0-length sequences, refuse to guess if (*size == 0) { - *out_type = NA; + *out_type = arrow::null(); } SeqVisitor seq_visitor; @@ -381,7 +379,7 @@ class DoubleConverter : public TypedConverter { } }; -class StringConverter : public TypedConverter { +class BytesConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { PyObject* item; @@ -415,6 +413,38 @@ class StringConverter : public TypedConverter { } }; +class UTF8Converter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + PyObject* item; + PyObject* bytes_obj; + OwnedRef tmp; + const char* bytes; + int32_t length; + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + item = PySequence_GetItem(seq, i); + OwnedRef holder(item); + + if (item == Py_None) { + RETURN_NOT_OK(typed_builder_->AppendNull()); + continue; + } else if (!PyUnicode_Check(item)) { + return Status::TypeError("Non-unicode value encountered"); + } + tmp.reset(PyUnicode_AsUTF8String(item)); + RETURN_IF_PYERROR(); + bytes_obj = tmp.obj(); + + // No error checking + length = PyBytes_GET_SIZE(bytes_obj); + bytes = PyBytes_AS_STRING(bytes_obj); + RETURN_NOT_OK(typed_builder_->Append(bytes, length)); + } + return Status::OK(); + } +}; + class ListConverter : public TypedConverter { public: Status Init(const std::shared_ptr& builder) override; @@ -449,8 +479,10 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::DOUBLE: return std::make_shared(); + case Type::BINARY: + return std::make_shared(); case Type::STRING: - return std::make_shared(); + return std::make_shared(); case Type::LIST: return std::make_shared(); case Type::STRUCT: diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index f8dff6d824153..38f3b6f5248ee 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -193,6 +193,9 @@ class ArrowSerializer { Status ConvertObjectStrings(std::shared_ptr* out) { PyAcquireGIL lock; + // The output type at this point is inconclusive because there may be bytes + // and unicode mixed in the object array + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); arrow::TypePtr string_type(new arrow::StringType()); arrow::StringBuilder string_builder(pool_, string_type); @@ -200,6 +203,7 @@ class ArrowSerializer { Status s; PyObject* obj; + bool have_bytes = false; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; if (PyUnicode_Check(obj)) { @@ -215,13 +219,21 @@ class ArrowSerializer { return s; } } else if (PyBytes_Check(obj)) { + have_bytes = true; const int32_t length = PyBytes_GET_SIZE(obj); RETURN_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); } else { string_builder.AppendNull(); } } - return string_builder.Finish(out); + RETURN_NOT_OK(string_builder.Finish(out)); + + if (have_bytes) { + const auto& arr = static_cast(*out->get()); + *out = std::make_shared(arr.length(), arr.offsets(), + arr.data(), arr.null_count(), arr.null_bitmap()); + } + return Status::OK(); } Status ConvertBooleans(std::shared_ptr* out) { @@ -865,7 +877,7 @@ class ArrowDeserializer { return Status::OK(); } - // UTF8 + // UTF8 strings template inline typename std::enable_if< T2 == arrow::Type::STRING, Status>::type @@ -912,6 +924,54 @@ class ArrowDeserializer { return Status::OK(); } + template + inline typename std::enable_if< + T2 == arrow::Type::BINARY, Status>::type + ConvertValues(const std::shared_ptr& data) { + size_t chunk_offset = 0; + PyAcquireGIL lock; + + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + + for (int c = 0; c < data->num_chunks(); c++) { + const std::shared_ptr arr = data->chunk(c); + auto binary_arr = static_cast(arr.get()); + auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + + const uint8_t* data_ptr; + int32_t length; + if (data->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + if (binary_arr->IsNull(i)) { + Py_INCREF(Py_None); + out_values[i] = Py_None; + } else { + data_ptr = binary_arr->GetValue(i, &length); + + out_values[i] = PyBytes_FromStringAndSize( + reinterpret_cast(data_ptr), length); + if (out_values[i] == nullptr) { + return Status::UnknownError("String initialization failed"); + } + } + } + } else { + for (int64_t i = 0; i < arr->length(); ++i) { + data_ptr = binary_arr->GetValue(i, &length); + out_values[i] = PyBytes_FromStringAndSize( + reinterpret_cast(data_ptr), length); + if (out_values[i] == nullptr) { + return Status::UnknownError("String initialization failed"); + } + } + } + + chunk_offset += binary_arr->length(); + } + + return Status::OK(); + } + private: std::shared_ptr col_; PyObject* py_ref_; @@ -948,6 +1008,7 @@ Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_re FROM_ARROW_CASE(UINT64); FROM_ARROW_CASE(FLOAT); FROM_ARROW_CASE(DOUBLE); + FROM_ARROW_CASE(BINARY); FROM_ARROW_CASE(STRING); FROM_ARROW_CASE(DATE); FROM_ARROW_CASE(TIMESTAMP); diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index af9274484935f..b42199c8e041c 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -23,47 +23,33 @@ using namespace arrow; namespace pyarrow { -const std::shared_ptr NA = std::make_shared(); -const std::shared_ptr BOOL = std::make_shared(); -const std::shared_ptr UINT8 = std::make_shared(); -const std::shared_ptr UINT16 = std::make_shared(); -const std::shared_ptr UINT32 = std::make_shared(); -const std::shared_ptr UINT64 = std::make_shared(); -const std::shared_ptr INT8 = std::make_shared(); -const std::shared_ptr INT16 = std::make_shared(); -const std::shared_ptr INT32 = std::make_shared(); -const std::shared_ptr INT64 = std::make_shared(); -const std::shared_ptr DATE = std::make_shared(); -const std::shared_ptr TIMESTAMP_US = std::make_shared(TimeUnit::MICRO); -const std::shared_ptr FLOAT = std::make_shared(); -const std::shared_ptr DOUBLE = std::make_shared(); -const std::shared_ptr STRING = std::make_shared(); -#define GET_PRIMITIVE_TYPE(NAME, Class) \ +#define GET_PRIMITIVE_TYPE(NAME, FACTORY) \ case Type::NAME: \ - return NAME; \ + return FACTORY(); \ break; std::shared_ptr GetPrimitiveType(Type::type type) { switch (type) { case Type::NA: - return NA; - GET_PRIMITIVE_TYPE(UINT8, UInt8Type); - GET_PRIMITIVE_TYPE(INT8, Int8Type); - GET_PRIMITIVE_TYPE(UINT16, UInt16Type); - GET_PRIMITIVE_TYPE(INT16, Int16Type); - GET_PRIMITIVE_TYPE(UINT32, UInt32Type); - GET_PRIMITIVE_TYPE(INT32, Int32Type); - GET_PRIMITIVE_TYPE(UINT64, UInt64Type); - GET_PRIMITIVE_TYPE(INT64, Int64Type); - GET_PRIMITIVE_TYPE(DATE, DateType); + return null(); + GET_PRIMITIVE_TYPE(UINT8, uint8); + GET_PRIMITIVE_TYPE(INT8, int8); + GET_PRIMITIVE_TYPE(UINT16, uint16); + GET_PRIMITIVE_TYPE(INT16, int16); + GET_PRIMITIVE_TYPE(UINT32, uint32); + GET_PRIMITIVE_TYPE(INT32, int32); + GET_PRIMITIVE_TYPE(UINT64, uint64); + GET_PRIMITIVE_TYPE(INT64, int64); + GET_PRIMITIVE_TYPE(DATE, date); case Type::TIMESTAMP: - return TIMESTAMP_US; + return arrow::timestamp(arrow::TimeUnit::MICRO); break; - GET_PRIMITIVE_TYPE(BOOL, BooleanType); - GET_PRIMITIVE_TYPE(FLOAT, FloatType); - GET_PRIMITIVE_TYPE(DOUBLE, DoubleType); - GET_PRIMITIVE_TYPE(STRING, StringType); + GET_PRIMITIVE_TYPE(BOOL, boolean); + GET_PRIMITIVE_TYPE(FLOAT, float32); + GET_PRIMITIVE_TYPE(DOUBLE, float64); + GET_PRIMITIVE_TYPE(BINARY, binary); + GET_PRIMITIVE_TYPE(STRING, utf8); default: return nullptr; } diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index e714bba5db4cc..8334d974c0237 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -28,22 +28,6 @@ namespace pyarrow { using arrow::DataType; using arrow::Type; -extern const std::shared_ptr NA; -extern const std::shared_ptr BOOL; -extern const std::shared_ptr UINT8; -extern const std::shared_ptr UINT16; -extern const std::shared_ptr UINT32; -extern const std::shared_ptr UINT64; -extern const std::shared_ptr INT8; -extern const std::shared_ptr INT16; -extern const std::shared_ptr INT32; -extern const std::shared_ptr INT64; -extern const std::shared_ptr DATE; -extern const std::shared_ptr TIMESTAMP_US; -extern const std::shared_ptr FLOAT; -extern const std::shared_ptr DOUBLE; -extern const std::shared_ptr STRING; - PYARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); From fd4eb98af9bbf19b7a640b55e2d8ed5ad87b6af1 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 21 Dec 2016 16:50:55 +0100 Subject: [PATCH 0243/1644] ARROW-440: [C++] Support pkg-config pkg-config is a tool to get build flags. If Arrow supports pkg-config, users can set build flags easily. For example, CMake supports pkg-config. To support pkg-config, we just install .pc file that includes build flags information. Author: Kouhei Sutou Closes #250 from kou/ARROW-440-support-pkg-config and squashes the following commits: f35fc44 [Kouhei Sutou] ARROW-440: [C++] Support pkg-config --- cpp/src/arrow/CMakeLists.txt | 8 ++++++++ cpp/src/arrow/arrow.pc.in | 26 ++++++++++++++++++++++++++ cpp/src/arrow/io/CMakeLists.txt | 8 ++++++++ cpp/src/arrow/io/arrow-io.pc.in | 27 +++++++++++++++++++++++++++ cpp/src/arrow/ipc/CMakeLists.txt | 8 ++++++++ cpp/src/arrow/ipc/arrow-ipc.pc.in | 27 +++++++++++++++++++++++++++ 6 files changed, 104 insertions(+) create mode 100644 cpp/src/arrow/arrow.pc.in create mode 100644 cpp/src/arrow/io/arrow-io.pc.in create mode 100644 cpp/src/arrow/ipc/arrow-ipc.pc.in diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index b8500ab264f80..f8c50513d31a5 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -33,6 +33,14 @@ install(FILES test-util.h DESTINATION include/arrow) +# pkg-config support +configure_file(arrow.pc.in + "${CMAKE_CURRENT_BINARY_DIR}/arrow.pc" + @ONLY) +install( + FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow.pc" + DESTINATION "lib/pkgconfig/") + ####################################### # Unit tests ####################################### diff --git a/cpp/src/arrow/arrow.pc.in b/cpp/src/arrow/arrow.pc.in new file mode 100644 index 0000000000000..5ad429b714893 --- /dev/null +++ b/cpp/src/arrow/arrow.pc.in @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=${prefix}/lib +includedir=${prefix}/include + +Name: Apache Arrow +Description: Arrow is a set of technologies that enable big-data systems to process and move data fast. +Version: @ARROW_VERSION@ +Libs: -L${libdir} -larrow +Cflags: -I${includedir} diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index e2b6496cc3f87..2062cd43b7b48 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -134,3 +134,11 @@ install(FILES install(TARGETS arrow_io LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) + +# pkg-config support +configure_file(arrow-io.pc.in + "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" + @ONLY) +install( + FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" + DESTINATION "lib/pkgconfig/") diff --git a/cpp/src/arrow/io/arrow-io.pc.in b/cpp/src/arrow/io/arrow-io.pc.in new file mode 100644 index 0000000000000..4b4abdd62df42 --- /dev/null +++ b/cpp/src/arrow/io/arrow-io.pc.in @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=${prefix}/lib +includedir=${prefix}/include + +Name: Apache Arrow I/O +Description: I/O interface for Arrow. +Version: @ARROW_VERSION@ +Libs: -L${libdir} -larrow_io +Cflags: -I${includedir} +Requires: arrow diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 619ca7c92cb7a..d3e625a08fbfe 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -159,3 +159,11 @@ install(FILES install(TARGETS arrow_ipc LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) + +# pkg-config support +configure_file(arrow-ipc.pc.in + "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" + @ONLY) +install( + FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" + DESTINATION "lib/pkgconfig/") diff --git a/cpp/src/arrow/ipc/arrow-ipc.pc.in b/cpp/src/arrow/ipc/arrow-ipc.pc.in new file mode 100644 index 0000000000000..73b44c99f0430 --- /dev/null +++ b/cpp/src/arrow/ipc/arrow-ipc.pc.in @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=${prefix}/lib +includedir=${prefix}/include + +Name: Apache Arrow IPC +Description: IPC extension for Arrow. +Version: @ARROW_VERSION@ +Libs: -L${libdir} -larrow_ipc +Cflags: -I${includedir} +Requires: arrow-io From 65af9ea16a3c9241a66203b57cbfe2041a5ee52b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 23 Dec 2016 17:29:17 -0500 Subject: [PATCH 0244/1644] ARROW-432: [Python] Construct precise pandas BlockManager structure for zero-copy DataFrame initialization This avoids "memory tripling" (because pd.DataFrame will often immediately consolidate the arrays) and also will allow the Arrow->pandas deserialization to be parallelized for further performance gains. @xhochy this also has the effect of coercing all timestamps to `datetime[ns]` -- for pandas I believe this is the proper behavior, but wanted to run it by you. In a local benchmark on roughly 1GB of data I have: setup code: ```python import numpy as np import pandas as pd import pyarrow as pa DATA_SIZE = (1 << 30) NCOLS = 100 data = np.random.randn(NCOLS, DATA_SIZE / NCOLS / 8).T data[::2] = np.nan df = pd.DataFrame(data, columns=['c' + str(i) for i in range(NCOLS)]) table = pa.Table.from_pandas(df) ``` before these changes (I added `block_based` argument to toggle this code path off): ```python In [4]: %timeit df2 = table.to_pandas(block_based=False) 1 loop, best of 3: 252 ms per loop ``` ```python In [5]: %timeit df2 = table.to_pandas() 10 loops, best of 3: 139 ms per loop ``` This takes the effective in-memory bandwidth on numerical data from 3.97 GB/s to 7.19 GB/s. I also moved the clang-format files to the top level so we can more easily run code formatting on the C++ code under python/. Author: Wes McKinney Closes #251 from wesm/ARROW-432 and squashes the following commits: f22e1b5 [Wes McKinney] Remove unneeded import ea83ded [Wes McKinney] Unit tests pass again ec239b8 [Wes McKinney] Fix DataFrame constructions, code formatting af960ee [Wes McKinney] Draft blocks -> DataFrame scaffold 110692f [Wes McKinney] First draft of scaffolding for creating precise pandas.DataFrame block structure 4928a0b [Wes McKinney] Refactor post rebase c89cfaf [Wes McKinney] Rearrange to-pandas deserialization to better permit reads into pre-allocated memory --- cpp/src/.clang-format => .clang-format | 0 cpp/src/.clang-tidy => .clang-tidy | 0 .../.clang-tidy-ignore => .clang-tidy-ignore | 0 cpp/CMakeLists.txt | 3 +- python/pyarrow/includes/pyarrow.pxd | 5 +- python/pyarrow/table.pyx | 54 +- python/src/pyarrow/adapters/builtin.cc | 66 +- python/src/pyarrow/adapters/builtin.h | 4 +- python/src/pyarrow/adapters/pandas.cc | 1184 ++++++++++++----- python/src/pyarrow/adapters/pandas.h | 33 +- python/src/pyarrow/api.h | 2 +- python/src/pyarrow/common.cc | 4 +- python/src/pyarrow/common.h | 52 +- python/src/pyarrow/config.cc | 5 +- python/src/pyarrow/config.h | 6 +- python/src/pyarrow/helpers.cc | 37 +- python/src/pyarrow/helpers.h | 4 +- python/src/pyarrow/io.cc | 14 +- python/src/pyarrow/io.h | 6 +- python/src/pyarrow/numpy_interop.h | 4 +- python/src/pyarrow/util/datetime.h | 8 +- python/src/pyarrow/util/test_main.cc | 2 +- 22 files changed, 986 insertions(+), 507 deletions(-) rename cpp/src/.clang-format => .clang-format (100%) rename cpp/src/.clang-tidy => .clang-tidy (100%) rename cpp/src/.clang-tidy-ignore => .clang-tidy-ignore (100%) diff --git a/cpp/src/.clang-format b/.clang-format similarity index 100% rename from cpp/src/.clang-format rename to .clang-format diff --git a/cpp/src/.clang-tidy b/.clang-tidy similarity index 100% rename from cpp/src/.clang-tidy rename to .clang-tidy diff --git a/cpp/src/.clang-tidy-ignore b/.clang-tidy-ignore similarity index 100% rename from cpp/src/.clang-tidy-ignore rename to .clang-tidy-ignore diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 315995ce7cb97..93e9853df8972 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -741,7 +741,8 @@ endif (UNIX) if (${CLANG_FORMAT_FOUND}) # runs clang format and updates files in place. add_custom_target(format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 1 - `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'`) + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'` + `find ${CMAKE_CURRENT_SOURCE_DIR}/../python -name \\*.cc -or -name \\*.h`) # runs clang format and exits with a non-zero exit code if any files need to be reformatted add_custom_target(check-format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 0 diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index a5444c236bcc8..dc6ccd2025932 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,7 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, +from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CTable, CDataType, CStatus, Type, MemoryPool) cimport pyarrow.includes.libarrow_io as arrow_io @@ -39,6 +39,9 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: CStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, PyObject* py_ref, PyObject** out) + CStatus ConvertTableToPandas(const shared_ptr[CTable]& table, + int nthreads, PyObject** out) + MemoryPool* get_memory_pool() diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 2f7d4309e4518..9375557888490 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -430,6 +430,32 @@ cdef class RecordBatch: return result +cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): + cdef: + PyObject* result_obj + CColumn* col + int i + + from pandas.core.internals import BlockManager, make_block + from pandas import RangeIndex + + check_status(pyarrow.ConvertTableToPandas(table, nthreads, &result_obj)) + + result = PyObject_to_object(result_obj) + + blocks = [] + for block_arr, placement_arr in result: + blocks.append(make_block(block_arr, placement=placement_arr)) + + names = [] + for i in range(table.get().num_columns()): + col = table.get().column(i).get() + names.append(frombytes(col.name())) + + axes = [names, RangeIndex(table.get().num_rows())] + return BlockManager(blocks, axes) + + cdef class Table: """ A collection of top-level named, equal length Arrow arrays. @@ -584,7 +610,7 @@ cdef class Table: table.init(c_table) return table - def to_pandas(self): + def to_pandas(self, nthreads=1, block_based=True): """ Convert the arrow::Table to a pandas DataFrame @@ -599,17 +625,21 @@ cdef class Table: import pandas as pd - names = [] - data = [] - for i in range(self.table.num_columns()): - col = self.table.column(i) - column = self.column(i) - check_status(pyarrow.ConvertColumnToPandas( - col, column, &arr)) - names.append(frombytes(col.get().name())) - data.append(PyObject_to_object(arr)) - - return pd.DataFrame(dict(zip(names, data)), columns=names) + if block_based: + mgr = table_to_blockmanager(self.sp_table, nthreads) + return pd.DataFrame(mgr) + else: + names = [] + data = [] + for i in range(self.table.num_columns()): + col = self.table.column(i) + column = self.column(i) + check_status(pyarrow.ConvertColumnToPandas( + col, column, &arr)) + names.append(frombytes(col.get().name())) + data.append(PyObject_to_object(arr)) + + return pd.DataFrame(dict(zip(names, data)), columns=names) @property def name(self): diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 2a13944b35c1c..fb7475f0c9407 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -44,16 +44,16 @@ static inline bool IsPyInteger(PyObject* obj) { class ScalarVisitor { public: - ScalarVisitor() : - total_count_(0), - none_count_(0), - bool_count_(0), - int_count_(0), - date_count_(0), - timestamp_count_(0), - float_count_(0), - binary_count_(0), - unicode_count_(0) {} + ScalarVisitor() + : total_count_(0), + none_count_(0), + bool_count_(0), + int_count_(0), + date_count_(0), + timestamp_count_(0), + float_count_(0), + binary_count_(0), + unicode_count_(0) {} void Visit(PyObject* obj) { ++total_count_; @@ -100,9 +100,7 @@ class ScalarVisitor { } } - int64_t total_count() const { - return total_count_; - } + int64_t total_count() const { return total_count_; } private: int64_t total_count_; @@ -123,17 +121,14 @@ static constexpr int MAX_NESTING_LEVELS = 32; class SeqVisitor { public: - SeqVisitor() : - max_nesting_level_(0) { + SeqVisitor() : max_nesting_level_(0) { memset(nesting_histogram_, 0, MAX_NESTING_LEVELS * sizeof(int)); } - Status Visit(PyObject* obj, int level=0) { + Status Visit(PyObject* obj, int level = 0) { Py_ssize_t size = PySequence_Size(obj); - if (level > max_nesting_level_) { - max_nesting_level_ = level; - } + if (level > max_nesting_level_) { max_nesting_level_ = level; } for (int64_t i = 0; i < size; ++i) { // TODO(wesm): Error checking? @@ -188,9 +183,7 @@ class SeqVisitor { int max_observed_level() const { int result = 0; for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { - if (nesting_histogram_[i] > 0) { - result = i; - } + if (nesting_histogram_[i] > 0) { result = i; } } return result; } @@ -198,9 +191,7 @@ class SeqVisitor { int num_nesting_levels() const { int result = 0; for (int i = 0; i < MAX_NESTING_LEVELS; ++i) { - if (nesting_histogram_[i] > 0) { - ++result; - } + if (nesting_histogram_[i] > 0) { ++result; } } return result; } @@ -214,8 +205,8 @@ class SeqVisitor { }; // Non-exhaustive type inference -static Status InferArrowType(PyObject* obj, int64_t* size, - std::shared_ptr* out_type) { +static Status InferArrowType( + PyObject* obj, int64_t* size, std::shared_ptr* out_type) { *size = PySequence_Size(obj); if (PyErr_Occurred()) { // Not a sequence @@ -224,9 +215,7 @@ static Status InferArrowType(PyObject* obj, int64_t* size, } // For 0-length sequences, refuse to guess - if (*size == 0) { - *out_type = arrow::null(); - } + if (*size == 0) { *out_type = arrow::null(); } SeqVisitor seq_visitor; RETURN_NOT_OK(seq_visitor.Visit(obj)); @@ -234,9 +223,7 @@ static Status InferArrowType(PyObject* obj, int64_t* size, *out_type = seq_visitor.GetType(); - if (*out_type == nullptr) { - return Status::TypeError("Unable to determine data type"); - } + if (*out_type == nullptr) { return Status::TypeError("Unable to determine data type"); } return Status::OK(); } @@ -337,7 +324,8 @@ class TimestampConverter : public TypedConverter { if (item.obj() == Py_None) { typed_builder_->AppendNull(); } else { - PyDateTime_DateTime* pydatetime = reinterpret_cast(item.obj()); + PyDateTime_DateTime* pydatetime = + reinterpret_cast(item.obj()); struct tm datetime = {0}; datetime.tm_year = PyDateTime_GET_YEAR(pydatetime) - 1900; datetime.tm_mon = PyDateTime_GET_MONTH(pydatetime) - 1; @@ -462,6 +450,7 @@ class ListConverter : public TypedConverter { } return Status::OK(); } + protected: std::shared_ptr value_converter_; }; @@ -496,8 +485,8 @@ Status ListConverter::Init(const std::shared_ptr& builder) { builder_ = builder; typed_builder_ = static_cast(builder.get()); - value_converter_ = GetConverter(static_cast( - builder->type().get())->value_type()); + value_converter_ = + GetConverter(static_cast(builder->type().get())->value_type()); if (value_converter_ == nullptr) { return Status::NotImplemented("value type not implemented"); } @@ -521,8 +510,7 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { std::shared_ptr converter = GetConverter(type); if (converter == nullptr) { std::stringstream ss; - ss << "No type converter implemented for " - << type->ToString(); + ss << "No type converter implemented for " << type->ToString(); return Status::NotImplemented(ss.str()); } @@ -536,4 +524,4 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { return builder->Finish(out); } -} // namespace pyarrow +} // namespace pyarrow diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 2ddfdaaf44134..1ff36945c88c7 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -40,6 +40,6 @@ namespace pyarrow { PYARROW_EXPORT arrow::Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_ADAPTERS_BUILTIN_H +#endif // PYARROW_ADAPTERS_BUILTIN_H diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 38f3b6f5248ee..899eb5519d562 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -28,10 +28,13 @@ #include #include #include +#include #include "arrow/api.h" -#include "arrow/util/bit-util.h" #include "arrow/status.h" +#include "arrow/type_fwd.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/macros.h" #include "pyarrow/common.h" #include "pyarrow/config.h" @@ -40,10 +43,13 @@ namespace pyarrow { using arrow::Array; +using arrow::ChunkedArray; using arrow::Column; using arrow::Field; using arrow::DataType; using arrow::Status; +using arrow::Table; +using arrow::Type; namespace BitUtil = arrow::BitUtil; @@ -51,8 +57,7 @@ namespace BitUtil = arrow::BitUtil; // Serialization template -struct npy_traits { -}; +struct npy_traits {}; template <> struct npy_traits { @@ -60,21 +65,17 @@ struct npy_traits { using TypeClass = arrow::BooleanType; static constexpr bool supports_nulls = false; - static inline bool isnull(uint8_t v) { - return false; - } + static inline bool isnull(uint8_t v) { return false; } }; -#define NPY_INT_DECL(TYPE, CapType, T) \ - template <> \ - struct npy_traits { \ - typedef T value_type; \ - using TypeClass = arrow::CapType##Type; \ - \ - static constexpr bool supports_nulls = false; \ - static inline bool isnull(T v) { \ - return false; \ - } \ +#define NPY_INT_DECL(TYPE, CapType, T) \ + template <> \ + struct npy_traits { \ + typedef T value_type; \ + using TypeClass = arrow::CapType##Type; \ + \ + static constexpr bool supports_nulls = false; \ + static inline bool isnull(T v) { return false; } \ }; NPY_INT_DECL(INT8, Int8, int8_t); @@ -93,9 +94,7 @@ struct npy_traits { static constexpr bool supports_nulls = true; - static inline bool isnull(float v) { - return v != v; - } + static inline bool isnull(float v) { return v != v; } }; template <> @@ -105,9 +104,7 @@ struct npy_traits { static constexpr bool supports_nulls = true; - static inline bool isnull(double v) { - return v != v; - } + static inline bool isnull(double v) { return v != v; } }; template <> @@ -135,18 +132,14 @@ struct npy_traits { template class ArrowSerializer { public: - ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) : - pool_(pool), - arr_(arr), - mask_(mask) { + ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) + : pool_(pool), arr_(arr), mask_(mask) { length_ = PyArray_SIZE(arr_); } Status Convert(std::shared_ptr* out); - int stride() const { - return PyArray_STRIDES(arr_)[0]; - } + int stride() const { return PyArray_STRIDES(arr_)[0]; } Status InitNullBitmap() { int null_bytes = BitUtil::BytesForBits(length_); @@ -215,9 +208,7 @@ class ArrowSerializer { const int32_t length = PyBytes_GET_SIZE(obj); s = string_builder.Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); - if (!s.ok()) { - return s; - } + if (!s.ok()) { return s; } } else if (PyBytes_Check(obj)) { have_bytes = true; const int32_t length = PyBytes_GET_SIZE(obj); @@ -230,8 +221,8 @@ class ArrowSerializer { if (have_bytes) { const auto& arr = static_cast(*out->get()); - *out = std::make_shared(arr.length(), arr.offsets(), - arr.data(), arr.null_count(), arr.null_bitmap()); + *out = std::make_shared( + arr.length(), arr.offsets(), arr.data(), arr.null_count(), arr.null_bitmap()); } return Status::OK(); } @@ -259,8 +250,7 @@ class ArrowSerializer { } } - *out = std::make_shared(length_, data, null_count, - null_bitmap_); + *out = std::make_shared(length_, data, null_count, null_bitmap_); return Status::OK(); } @@ -321,26 +311,27 @@ inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out } template <> -inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { +inline Status ArrowSerializer::MakeDataType( + std::shared_ptr* out) { PyArray_Descr* descr = PyArray_DESCR(arr_); auto date_dtype = reinterpret_cast(descr->c_metadata); arrow::TimestampType::Unit unit; switch (date_dtype->meta.base) { - case NPY_FR_s: - unit = arrow::TimestampType::Unit::SECOND; - break; - case NPY_FR_ms: - unit = arrow::TimestampType::Unit::MILLI; - break; - case NPY_FR_us: - unit = arrow::TimestampType::Unit::MICRO; - break; - case NPY_FR_ns: - unit = arrow::TimestampType::Unit::NANO; - break; - default: - return Status::Invalid("Unknown NumPy datetime unit"); + case NPY_FR_s: + unit = arrow::TimestampType::Unit::SECOND; + break; + case NPY_FR_ms: + unit = arrow::TimestampType::Unit::MILLI; + break; + case NPY_FR_us: + unit = arrow::TimestampType::Unit::MICRO; + break; + case NPY_FR_ns: + unit = arrow::TimestampType::Unit::NANO; + break; + default: + return Status::Invalid("Unknown NumPy datetime unit"); } out->reset(new arrow::TimestampType(unit)); @@ -351,9 +342,7 @@ template inline Status ArrowSerializer::Convert(std::shared_ptr* out) { typedef npy_traits traits; - if (mask_ != nullptr || traits::supports_nulls) { - RETURN_NOT_OK(InitNullBitmap()); - } + if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } int64_t null_count = 0; if (mask_ != nullptr) { @@ -429,9 +418,7 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) template inline Status ArrowSerializer::ConvertData() { // TODO(wesm): strided arrays - if (is_strided()) { - return Status::Invalid("no support for strided data yet"); - } + if (is_strided()) { return Status::Invalid("no support for strided data yet"); } data_ = std::make_shared(arr_); return Status::OK(); @@ -439,9 +426,7 @@ inline Status ArrowSerializer::ConvertData() { template <> inline Status ArrowSerializer::ConvertData() { - if (is_strided()) { - return Status::Invalid("no support for strided data yet"); - } + if (is_strided()) { return Status::Invalid("no support for strided data yet"); } int nbytes = BitUtil::BytesForBits(length_); auto buffer = std::make_shared(pool_); @@ -453,9 +438,7 @@ inline Status ArrowSerializer::ConvertData() { memset(bitmap, 0, nbytes); for (int i = 0; i < length_; ++i) { - if (values[i] > 0) { - BitUtil::SetBit(bitmap, i); - } + if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } } data_ = buffer; @@ -468,29 +451,24 @@ inline Status ArrowSerializer::ConvertData() { return Status::TypeError("NYI"); } +#define TO_ARROW_CASE(TYPE) \ + case NPY_##TYPE: { \ + ArrowSerializer converter(pool, arr, mask); \ + RETURN_NOT_OK(converter.Convert(out)); \ + } break; -#define TO_ARROW_CASE(TYPE) \ - case NPY_##TYPE: \ - { \ - ArrowSerializer converter(pool, arr, mask); \ - RETURN_NOT_OK(converter.Convert(out)); \ - } \ - break; - -Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, - std::shared_ptr* out) { +Status PandasMaskedToArrow( + arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, std::shared_ptr* out) { PyArrayObject* arr = reinterpret_cast(ao); PyArrayObject* mask = nullptr; - if (mo != nullptr) { - mask = reinterpret_cast(mo); - } + if (mo != nullptr) { mask = reinterpret_cast(mo); } if (PyArray_NDIM(arr) != 1) { return Status::Invalid("only handle 1-dimensional arrays"); } - switch(PyArray_DESCR(arr)->type_num) { + switch (PyArray_DESCR(arr)->type_num) { TO_ARROW_CASE(BOOL); TO_ARROW_CASE(INT8); TO_ARROW_CASE(INT16); @@ -506,15 +484,13 @@ Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, TO_ARROW_CASE(OBJECT); default: std::stringstream ss; - ss << "unsupported type " << PyArray_DESCR(arr)->type_num - << std::endl; + ss << "unsupported type " << PyArray_DESCR(arr)->type_num << std::endl; return Status::NotImplemented(ss.str()); } return Status::OK(); } -Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, - std::shared_ptr* out) { +Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, std::shared_ptr* out) { return PandasMaskedToArrow(pool, ao, nullptr, out); } @@ -522,28 +498,27 @@ Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, // Deserialization template -struct arrow_traits { -}; +struct arrow_traits {}; template <> struct arrow_traits { static constexpr int npy_type = NPY_BOOL; static constexpr bool supports_nulls = false; static constexpr bool is_boolean = true; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; }; -#define INT_DECL(TYPE) \ - template <> \ - struct arrow_traits { \ - static constexpr int npy_type = NPY_##TYPE; \ - static constexpr bool supports_nulls = false; \ - static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_pandas_numeric_not_nullable = true; \ - static constexpr bool is_pandas_numeric_nullable = false; \ - typedef typename npy_traits::value_type T; \ +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_numeric_not_nullable = true; \ + static constexpr bool is_numeric_nullable = false; \ + typedef typename npy_traits::value_type T; \ }; INT_DECL(INT8); @@ -561,8 +536,8 @@ struct arrow_traits { static constexpr bool supports_nulls = true; static constexpr float na_value = NAN; static constexpr bool is_boolean = false; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -572,19 +547,21 @@ struct arrow_traits { static constexpr bool supports_nulls = true; static constexpr double na_value = NAN; static constexpr bool is_boolean = false; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; +static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); + template <> struct arrow_traits { static constexpr int npy_type = NPY_DATETIME; static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = std::numeric_limits::min(); + static constexpr int64_t na_value = kPandasTimestampNull; static constexpr bool is_boolean = false; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -592,10 +569,10 @@ template <> struct arrow_traits { static constexpr int npy_type = NPY_DATETIME; static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = std::numeric_limits::min(); + static constexpr int64_t na_value = kPandasTimestampNull; static constexpr bool is_boolean = false; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -604,18 +581,39 @@ struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; static constexpr bool supports_nulls = true; static constexpr bool is_boolean = false; - static constexpr bool is_pandas_numeric_not_nullable = false; - static constexpr bool is_pandas_numeric_nullable = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; }; +template +struct WrapBytes {}; -static inline PyObject* make_pystring(const uint8_t* data, int32_t length) { +template <> +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { #if PY_MAJOR_VERSION >= 3 - return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); + return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); #else - return PyString_FromStringAndSize(reinterpret_cast(data), length); + return PyString_FromStringAndSize(reinterpret_cast(data), length); #endif -} + } +}; + +template <> +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyBytes_FromStringAndSize(reinterpret_cast(data), length); + } +}; inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { if (type == NPY_DATETIME) { @@ -645,20 +643,169 @@ inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) } } -template -class ArrowDeserializer { - public: - ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) : - col_(col), py_ref_(py_ref) {} +template +inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + // Upcast to double, set NaN as appropriate - Status Convert(PyObject** out) { - const std::shared_ptr data = col_->data(); + for (int i = 0; i < arr->length(); ++i) { + *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; + } + } +} - RETURN_NOT_OK(ConvertValues(data)); - *out = reinterpret_cast(out_); +template +inline void ConvertIntegerNoNullsSameType(const ChunkedArray& data, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); + } +} - return Status::OK(); +template +inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + for (int32_t i = 0; i < arr->length(); ++i) { + *out_values = in_values[i]; + } + } +} + +static Status ConvertBooleanWithNulls(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); + + for (int64_t i = 0; i < arr->length(); ++i) { + if (bool_arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values++ = Py_None; + } else if (bool_arr->Value(i)) { + // True + Py_INCREF(Py_True); + *out_values++ = Py_True; + } else { + // False + Py_INCREF(Py_False); + *out_values++ = Py_False; + } + } + } + return Status::OK(); +} + +static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = static_cast(bool_arr->Value(i)); + } + } +} + +template +inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = static_cast(data.chunk(c).get()); + + const uint8_t* data_ptr; + int32_t length; + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + data_ptr = arr->GetValue(i, &length); + *out_values = WrapBytes::Wrap(data_ptr, length); + if (*out_values == nullptr) { + return Status::UnknownError("String initialization failed"); + } + } + ++out_values; + } + } + return Status::OK(); +} + +template +inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + const uint8_t* valid_bits = arr->null_bitmap_data(); + + if (arr->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = BitUtil::BitNotSet(valid_bits, i) ? na_value : in_values[i]; + } + } else { + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); + } } +} + +template +inline void ConvertNumericNullableCast( + const ChunkedArray& data, OutType na_value, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? na_value : static_cast(in_values[i]); + } + } +} + +template +inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + // There are 1000 * 60 * 60 * 24 = 86400000ms in a day + *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; + } + } +} + +template +inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? kPandasTimestampNull + : (static_cast(in_values[i]) * SHIFT); + } + } +} + +class ArrowDeserializer { + public: + ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) + : col_(col), data_(*col->data().get()), py_ref_(py_ref) {} Status AllocateOutput(int type) { PyAcquireGIL lock; @@ -676,20 +823,29 @@ class ArrowDeserializer { return Status::OK(); } - Status OutputFromData(int type, void* data) { + template + Status ConvertValuesZeroCopy(int npy_type, std::shared_ptr arr) { + typedef typename arrow_traits::T T; + + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + // Zero-Copy. We can pass the data pointer directly to NumPy. + void* data = const_cast(in_values); + PyAcquireGIL lock; // Zero-Copy. We can pass the data pointer directly to NumPy. npy_intp dims[1] = {col_->length()}; - out_ = reinterpret_cast(PyArray_SimpleNewFromData(1, dims, - type, data)); + out_ = reinterpret_cast( + PyArray_SimpleNewFromData(1, dims, npy_type, data)); if (out_ == NULL) { // Error occurred, trust that SimpleNew set the error state return Status::OK(); } - set_numpy_metadata(type, col_->type().get(), out_); + set_numpy_metadata(npy_type, col_->type().get(), out_); if (PyArray_SetBaseObject(out_, py_ref_) == -1) { // Error occurred, trust that SetBaseObject set the error state @@ -705,317 +861,621 @@ class ArrowDeserializer { return Status::OK(); } - template - Status ConvertValuesZeroCopy(std::shared_ptr arr) { - typedef typename arrow_traits::T T; + // ---------------------------------------------------------------------- + // Allocate new array and deserialize. Can do a zero copy conversion for some + // types - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); + Status Convert(PyObject** out) { +#define CONVERT_CASE(TYPE) \ + case arrow::Type::TYPE: { \ + RETURN_NOT_OK(ConvertValues()); \ + } break; + + switch (col_->type()->type) { + CONVERT_CASE(BOOL); + CONVERT_CASE(INT8); + CONVERT_CASE(INT16); + CONVERT_CASE(INT32); + CONVERT_CASE(INT64); + CONVERT_CASE(UINT8); + CONVERT_CASE(UINT16); + CONVERT_CASE(UINT32); + CONVERT_CASE(UINT64); + CONVERT_CASE(FLOAT); + CONVERT_CASE(DOUBLE); + CONVERT_CASE(BINARY); + CONVERT_CASE(STRING); + CONVERT_CASE(DATE); + CONVERT_CASE(TIMESTAMP); + default: + return Status::NotImplemented("Arrow type reading not implemented"); + } - // Zero-Copy. We can pass the data pointer directly to NumPy. - void* data = const_cast(in_values); - int type = arrow_traits::npy_type; - RETURN_NOT_OK(OutputFromData(type, data)); +#undef CONVERT_CASE + *out = reinterpret_cast(out_); return Status::OK(); } - template + template inline typename std::enable_if< - (T2 != arrow::Type::DATE) & arrow_traits::is_pandas_numeric_nullable, Status>::type - ConvertValues(const std::shared_ptr& data) { - typedef typename arrow_traits::T T; - size_t chunk_offset = 0; + (TYPE != arrow::Type::DATE) & arrow_traits::is_numeric_nullable, Status>::type + ConvertValues() { + typedef typename arrow_traits::T T; + int npy_type = arrow_traits::npy_type; - if (data->num_chunks() == 1 && data->null_count() == 0) { - return ConvertValuesZeroCopy(data->chunk(0)); + if (data_.num_chunks() == 1 && data_.null_count() == 0) { + return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); } - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + RETURN_NOT_OK(AllocateOutput(npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + ConvertNumericNullable(data_, arrow_traits::na_value, out_values); - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + return Status::OK(); + } - if (arr->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i]; - } - } else { - memcpy(out_values, in_values, sizeof(T) * arr->length()); - } + template + inline typename std::enable_if::type + ConvertValues() { + typedef typename arrow_traits::T T; + + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + ConvertDates(data_, arrow_traits::na_value, out_values); + return Status::OK(); + } - chunk_offset += arr->length(); + // Integer specialization + template + inline + typename std::enable_if::is_numeric_not_nullable, Status>::type + ConvertValues() { + typedef typename arrow_traits::T T; + int npy_type = arrow_traits::npy_type; + + if (data_.num_chunks() == 1 && data_.null_count() == 0) { + return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); + } + + if (data_.null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + ConvertIntegerWithNulls(data_, out_values); + } else { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + ConvertIntegerNoNullsSameType(data_, out_values); } return Status::OK(); } + // Boolean specialization + template + inline typename std::enable_if::is_boolean, Status>::type + ConvertValues() { + if (data_.null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + RETURN_NOT_OK(ConvertBooleanWithNulls(data_, out_values)); + } else { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + ConvertBooleanNoNulls(data_, out_values); + } + return Status::OK(); + } + + // UTF8 strings + template + inline typename std::enable_if::type + ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + return ConvertBinaryLike(data_, out_values); + } + template - inline typename std::enable_if< - T2 == arrow::Type::DATE, Status>::type - ConvertValues(const std::shared_ptr& data) { - typedef typename arrow_traits::T T; - size_t chunk_offset = 0; + inline typename std::enable_if::type + ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(out_)); + return ConvertBinaryLike(data_, out_values); + } + + private: + std::shared_ptr col_; + const arrow::ChunkedArray& data_; + PyObject* py_ref_; + PyArrayObject* out_; +}; + +Status ConvertArrayToPandas( + const std::shared_ptr& arr, PyObject* py_ref, PyObject** out) { + static std::string dummy_name = "dummy"; + auto field = std::make_shared(dummy_name, arr->type()); + auto col = std::make_shared(field, arr); + return ConvertColumnToPandas(col, py_ref, out); +} - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); +Status ConvertColumnToPandas( + const std::shared_ptr& col, PyObject* py_ref, PyObject** out) { + ArrowDeserializer converter(col, py_ref); + return converter.Convert(out); +} - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; +// ---------------------------------------------------------------------- +// pandas 0.x DataFrame conversion internals - for (int64_t i = 0; i < arr->length(); ++i) { - // There are 1000 * 60 * 60 * 24 = 86400000ms in a day - out_values[i] = arr->IsNull(i) ? arrow_traits::na_value : in_values[i] / 86400000; - } +class PandasBlock { + public: + enum type { + OBJECT, + UINT8, + INT8, + UINT16, + INT16, + UINT32, + INT32, + UINT64, + INT64, + FLOAT, + DOUBLE, + BOOL, + DATETIME, + CATEGORICAL + }; + + PandasBlock(int64_t num_rows, int num_columns) + : num_rows_(num_rows), num_columns_(num_columns) {} + + virtual Status Allocate() = 0; + virtual Status WriteNext(const std::shared_ptr& col, int64_t placement) = 0; - chunk_offset += arr->length(); + PyObject* block_arr() { return block_arr_.obj(); } + + PyObject* placement_arr() { return placement_arr_.obj(); } + + protected: + Status AllocateNDArray(int npy_type) { + PyAcquireGIL lock; + + npy_intp block_dims[2] = {num_columns_, num_rows_}; + PyObject* block_arr = PyArray_SimpleNew(2, block_dims, npy_type); + if (block_arr == NULL) { + // TODO(wesm): propagating Python exception + return Status::OK(); } + npy_intp placement_dims[1] = {num_columns_}; + PyObject* placement_arr = PyArray_SimpleNew(1, placement_dims, NPY_INT64); + if (placement_arr == NULL) { + // TODO(wesm): propagating Python exception + return Status::OK(); + } + + block_arr_.reset(block_arr); + placement_arr_.reset(placement_arr); + current_placement_index_ = 0; + + block_data_ = reinterpret_cast( + PyArray_DATA(reinterpret_cast(block_arr))); + + placement_data_ = reinterpret_cast( + PyArray_DATA(reinterpret_cast(placement_arr))); + return Status::OK(); } - // Integer specialization - template - inline typename std::enable_if< - arrow_traits::is_pandas_numeric_not_nullable, Status>::type - ConvertValues(const std::shared_ptr& data) { - typedef typename arrow_traits::T T; - size_t chunk_offset = 0; + int64_t num_rows_; + int num_columns_; + int current_placement_index_; - if (data->num_chunks() == 1 && data->null_count() == 0) { - return ConvertValuesZeroCopy(data->chunk(0)); - } + OwnedRef block_arr_; + uint8_t* block_data_; - if (data->null_count() > 0) { - RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); + // ndarray + OwnedRef placement_arr_; + int64_t* placement_data_; - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - // Upcast to double, set NaN as appropriate - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + DISALLOW_COPY_AND_ASSIGN(PandasBlock); +}; - for (int i = 0; i < arr->length(); ++i) { - out_values[i] = prim_arr->IsNull(i) ? NAN : in_values[i]; - } +class ObjectBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; - chunk_offset += arr->length(); - } - } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + Status Allocate() override { return AllocateNDArray(NPY_OBJECT); } - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; - memcpy(out_values, in_values, sizeof(T) * arr->length()); + PyObject** out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; - chunk_offset += arr->length(); - } + const ChunkedArray& data = *col->data().get(); + + if (type == Type::BOOL) { + RETURN_NOT_OK(ConvertBooleanWithNulls(data, out_buffer)); + } else if (type == Type::BINARY) { + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + } else if (type == Type::STRING) { + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + } else { + std::stringstream ss; + ss << "Unsupported type for object array output: " << col->type()->ToString(); + return Status::NotImplemented(ss.str()); } + placement_data_[current_placement_index_++] = placement; return Status::OK(); } +}; - // Boolean specialization - template - inline typename std::enable_if< - arrow_traits::is_boolean, Status>::type - ConvertValues(const std::shared_ptr& data) { - size_t chunk_offset = 0; - PyAcquireGIL lock; +template +class IntBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; - if (data->null_count() > 0) { - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + Status Allocate() override { + return AllocateNDArray(arrow_traits::npy_type); + } - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto bool_arr = static_cast(arr.get()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; - - for (int64_t i = 0; i < arr->length(); ++i) { - if (bool_arr->IsNull(i)) { - Py_INCREF(Py_None); - out_values[i] = Py_None; - } else if (bool_arr->Value(i)) { - // True - Py_INCREF(Py_True); - out_values[i] = Py_True; - } else { - // False - Py_INCREF(Py_False); - out_values[i] = Py_False; - } - } + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; - chunk_offset += bool_arr->length(); - } - } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + C_TYPE* out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto bool_arr = static_cast(arr.get()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; + const ChunkedArray& data = *col->data().get(); - for (int64_t i = 0; i < arr->length(); ++i) { - out_values[i] = static_cast(bool_arr->Value(i)); - } + if (type != ARROW_TYPE) { return Status::NotImplemented(col->type()->ToString()); } - chunk_offset += bool_arr->length(); - } + ConvertIntegerNoNullsSameType(data, out_buffer); + placement_data_[current_placement_index_++] = placement; + return Status::OK(); + } +}; + +using UInt8Block = IntBlock; +using Int8Block = IntBlock; +using UInt16Block = IntBlock; +using Int16Block = IntBlock; +using UInt32Block = IntBlock; +using Int32Block = IntBlock; +using UInt64Block = IntBlock; +using Int64Block = IntBlock; + +class Float32Block : public PandasBlock { + public: + using PandasBlock::PandasBlock; + + Status Allocate() override { return AllocateNDArray(NPY_FLOAT32); } + + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; + + if (type != Type::FLOAT) { return Status::NotImplemented(col->type()->ToString()); } + + float* out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + + ConvertNumericNullable(*col->data().get(), NAN, out_buffer); + placement_data_[current_placement_index_++] = placement; + return Status::OK(); + } +}; + +class Float64Block : public PandasBlock { + public: + using PandasBlock::PandasBlock; + + Status Allocate() override { return AllocateNDArray(NPY_FLOAT64); } + + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; + + double* out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + + const ChunkedArray& data = *col->data().get(); + +#define INTEGER_CASE(IN_TYPE) \ + ConvertIntegerWithNulls(data, out_buffer); \ + break; + + switch (type) { + case Type::UINT8: + INTEGER_CASE(uint8_t); + case Type::INT8: + INTEGER_CASE(int8_t); + case Type::UINT16: + INTEGER_CASE(uint16_t); + case Type::INT16: + INTEGER_CASE(int16_t); + case Type::UINT32: + INTEGER_CASE(uint32_t); + case Type::INT32: + INTEGER_CASE(int32_t); + case Type::UINT64: + INTEGER_CASE(uint64_t); + case Type::INT64: + INTEGER_CASE(int64_t); + case Type::FLOAT: + ConvertNumericNullableCast(data, NAN, out_buffer); + break; + case Type::DOUBLE: + ConvertNumericNullable(data, NAN, out_buffer); + break; + default: + return Status::NotImplemented(col->type()->ToString()); } +#undef INTEGER_CASE + + placement_data_[current_placement_index_++] = placement; return Status::OK(); } +}; - // UTF8 strings - template - inline typename std::enable_if< - T2 == arrow::Type::STRING, Status>::type - ConvertValues(const std::shared_ptr& data) { - size_t chunk_offset = 0; - PyAcquireGIL lock; +class BoolBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + Status Allocate() override { return AllocateNDArray(NPY_BOOL); } - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto string_arr = static_cast(arr.get()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; - - const uint8_t* data_ptr; - int32_t length; - if (data->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - if (string_arr->IsNull(i)) { - Py_INCREF(Py_None); - out_values[i] = Py_None; - } else { - data_ptr = string_arr->GetValue(i, &length); - - out_values[i] = make_pystring(data_ptr, length); - if (out_values[i] == nullptr) { - return Status::UnknownError("String initialization failed"); - } - } - } + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; + + if (type != Type::BOOL) { return Status::NotImplemented(col->type()->ToString()); } + + uint8_t* out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + + ConvertBooleanNoNulls(*col->data().get(), out_buffer); + placement_data_[current_placement_index_++] = placement; + return Status::OK(); + } +}; + +class DatetimeBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; + + Status Allocate() override { + RETURN_NOT_OK(AllocateNDArray(NPY_DATETIME)); + + PyAcquireGIL lock; + auto date_dtype = reinterpret_cast( + PyArray_DESCR(reinterpret_cast(block_arr_.obj()))->c_metadata); + date_dtype->meta.base = NPY_FR_ns; + return Status::OK(); + } + + Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Type::type type = col->type()->type; + + int64_t* out_buffer = + reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + + const ChunkedArray& data = *col.get()->data(); + + if (type == Type::DATE) { + // DateType is millisecond timestamp stored as int64_t + // TODO(wesm): Do we want to make sure to zero out the milliseconds? + ConvertDatetimeNanos(data, out_buffer); + } else if (type == Type::TIMESTAMP) { + auto ts_type = static_cast(col->type().get()); + + if (ts_type->unit == arrow::TimeUnit::NANO) { + ConvertNumericNullable(data, kPandasTimestampNull, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::MICRO) { + ConvertDatetimeNanos(data, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::MILLI) { + ConvertDatetimeNanos(data, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::SECOND) { + ConvertDatetimeNanos(data, out_buffer); } else { - for (int64_t i = 0; i < arr->length(); ++i) { - data_ptr = string_arr->GetValue(i, &length); - out_values[i] = make_pystring(data_ptr, length); - if (out_values[i] == nullptr) { - return Status::UnknownError("String initialization failed"); - } - } + return Status::NotImplemented("Unsupported time unit"); } - - chunk_offset += string_arr->length(); + } else { + return Status::NotImplemented(col->type()->ToString()); } + placement_data_[current_placement_index_++] = placement; return Status::OK(); } +}; - template - inline typename std::enable_if< - T2 == arrow::Type::BINARY, Status>::type - ConvertValues(const std::shared_ptr& data) { - size_t chunk_offset = 0; - PyAcquireGIL lock; +// class CategoricalBlock : public PandasBlock {}; - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); +Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, + std::shared_ptr* block) { +#define BLOCK_CASE(NAME, TYPE) \ + case PandasBlock::NAME: \ + *block = std::make_shared(num_rows, num_columns); \ + break; - for (int c = 0; c < data->num_chunks(); c++) { - const std::shared_ptr arr = data->chunk(c); - auto binary_arr = static_cast(arr.get()); - auto out_values = reinterpret_cast(PyArray_DATA(out_)) + chunk_offset; - - const uint8_t* data_ptr; - int32_t length; - if (data->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - if (binary_arr->IsNull(i)) { - Py_INCREF(Py_None); - out_values[i] = Py_None; - } else { - data_ptr = binary_arr->GetValue(i, &length); - - out_values[i] = PyBytes_FromStringAndSize( - reinterpret_cast(data_ptr), length); - if (out_values[i] == nullptr) { - return Status::UnknownError("String initialization failed"); - } - } - } + switch (type) { + BLOCK_CASE(OBJECT, ObjectBlock); + BLOCK_CASE(UINT8, UInt8Block); + BLOCK_CASE(INT8, Int8Block); + BLOCK_CASE(UINT16, UInt16Block); + BLOCK_CASE(INT16, Int16Block); + BLOCK_CASE(UINT32, UInt32Block); + BLOCK_CASE(INT32, Int32Block); + BLOCK_CASE(UINT64, UInt64Block); + BLOCK_CASE(INT64, Int64Block); + BLOCK_CASE(FLOAT, Float32Block); + BLOCK_CASE(DOUBLE, Float64Block); + BLOCK_CASE(BOOL, BoolBlock); + BLOCK_CASE(DATETIME, DatetimeBlock); + case PandasBlock::CATEGORICAL: + return Status::NotImplemented("categorical"); + } + +#undef BLOCK_CASE + + return (*block)->Allocate(); +} + +// Construct the exact pandas 0.x "BlockManager" memory layout +// +// * For each column determine the correct output pandas type +// * Allocate 2D blocks (ncols x nrows) for each distinct data type in output +// * Allocate block placement arrays +// * Write Arrow columns out into each slice of memory; populate block +// * placement arrays as we go +class DataFrameBlockCreator { + public: + DataFrameBlockCreator(const std::shared_ptr

& table) : table_(table) {} + + Status Convert(int nthreads, PyObject** output) { + column_types_.resize(table_->num_columns()); + type_counts_.clear(); + blocks_.clear(); + + RETURN_NOT_OK(CountColumnTypes()); + RETURN_NOT_OK(CreateBlocks()); + RETURN_NOT_OK(WriteTableToBlocks(nthreads)); + + return GetResultList(output); + } + + Status CountColumnTypes() { + for (int i = 0; i < table_->num_columns(); ++i) { + std::shared_ptr col = table_->column(i); + PandasBlock::type output_type; + + switch (col->type()->type) { + case Type::BOOL: + output_type = col->null_count() > 0 ? PandasBlock::OBJECT : PandasBlock::BOOL; + break; + case Type::UINT8: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT8; + break; + case Type::INT8: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT8; + break; + case Type::UINT16: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT16; + break; + case Type::INT16: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT16; + break; + case Type::UINT32: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT32; + break; + case Type::INT32: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT32; + break; + case Type::INT64: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT64; + break; + case Type::UINT64: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT64; + break; + case Type::FLOAT: + output_type = PandasBlock::FLOAT; + break; + case Type::DOUBLE: + output_type = PandasBlock::DOUBLE; + break; + case Type::STRING: + case Type::BINARY: + output_type = PandasBlock::OBJECT; + break; + case Type::DATE: + output_type = PandasBlock::DATETIME; + break; + case Type::TIMESTAMP: + output_type = PandasBlock::DATETIME; + break; + default: + return Status::NotImplemented(col->type()->ToString()); + } + + auto it = type_counts_.find(output_type); + if (it != type_counts_.end()) { + // Increment count + it->second += 1; } else { - for (int64_t i = 0; i < arr->length(); ++i) { - data_ptr = binary_arr->GetValue(i, &length); - out_values[i] = PyBytes_FromStringAndSize( - reinterpret_cast(data_ptr), length); - if (out_values[i] == nullptr) { - return Status::UnknownError("String initialization failed"); - } - } + // Add key to map + type_counts_[output_type] = 1; } - chunk_offset += binary_arr->length(); + column_types_[i] = output_type; } + return Status::OK(); + } + Status CreateBlocks() { + for (const auto& it : type_counts_) { + PandasBlock::type type = static_cast(it.first); + std::shared_ptr block; + RETURN_NOT_OK(MakeBlock(type, table_->num_rows(), it.second, &block)); + blocks_[type] = block; + } return Status::OK(); } - private: - std::shared_ptr col_; - PyObject* py_ref_; - PyArrayObject* out_; -}; + Status WriteTableToBlocks(int nthreads) { + if (nthreads > 1) { + return Status::NotImplemented("multithreading not yet implemented"); + } -#define FROM_ARROW_CASE(TYPE) \ - case arrow::Type::TYPE: \ - { \ - ArrowDeserializer converter(col, py_ref); \ - return converter.Convert(out); \ - } \ - break; + for (int i = 0; i < table_->num_columns(); ++i) { + std::shared_ptr col = table_->column(i); + PandasBlock::type output_type = column_types_[i]; -Status ConvertArrayToPandas(const std::shared_ptr& arr, PyObject* py_ref, - PyObject** out) { - static std::string dummy_name = "dummy"; - auto field = std::make_shared(dummy_name, arr->type()); - auto col = std::make_shared(field, arr); - return ConvertColumnToPandas(col, py_ref, out); -} + auto it = blocks_.find(output_type); + if (it == blocks_.end()) { return Status::KeyError("No block allocated"); } + RETURN_NOT_OK(it->second->WriteNext(col, i)); + } + return Status::OK(); + } -Status ConvertColumnToPandas(const std::shared_ptr& col, PyObject* py_ref, - PyObject** out) { - switch(col->type()->type) { - FROM_ARROW_CASE(BOOL); - FROM_ARROW_CASE(INT8); - FROM_ARROW_CASE(INT16); - FROM_ARROW_CASE(INT32); - FROM_ARROW_CASE(INT64); - FROM_ARROW_CASE(UINT8); - FROM_ARROW_CASE(UINT16); - FROM_ARROW_CASE(UINT32); - FROM_ARROW_CASE(UINT64); - FROM_ARROW_CASE(FLOAT); - FROM_ARROW_CASE(DOUBLE); - FROM_ARROW_CASE(BINARY); - FROM_ARROW_CASE(STRING); - FROM_ARROW_CASE(DATE); - FROM_ARROW_CASE(TIMESTAMP); - default: - return Status::NotImplemented("Arrow type reading not implemented"); + Status GetResultList(PyObject** out) { + auto num_blocks = static_cast(blocks_.size()); + PyObject* result = PyList_New(num_blocks); + RETURN_IF_PYERROR(); + + int i = 0; + for (const auto& it : blocks_) { + const std::shared_ptr block = it.second; + + PyObject* item = PyTuple_New(2); + RETURN_IF_PYERROR(); + + PyObject* block_arr = block->block_arr(); + PyObject* placement_arr = block->placement_arr(); + Py_INCREF(block_arr); + Py_INCREF(placement_arr); + PyTuple_SET_ITEM(item, 0, block_arr); + PyTuple_SET_ITEM(item, 1, placement_arr); + + if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } + } + *out = result; + return Status::OK(); } - return Status::OK(); + + private: + std::shared_ptr
table_; + std::vector column_types_; + + // block type -> type count + std::unordered_map type_counts_; + + // block type -> block + std::unordered_map> blocks_; +}; + +Status ConvertTableToPandas( + const std::shared_ptr
& table, int nthreads, PyObject** out) { + DataFrameBlockCreator helper(table); + return helper.Convert(nthreads, out); } -} // namespace pyarrow +} // namespace pyarrow diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index 532495dd792db..60dadd473ad3f 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -33,27 +33,42 @@ class Array; class Column; class MemoryPool; class Status; +class Table; -} // namespace arrow +} // namespace arrow namespace pyarrow { PYARROW_EXPORT -arrow::Status ConvertArrayToPandas(const std::shared_ptr& arr, - PyObject* py_ref, PyObject** out); +arrow::Status ConvertArrayToPandas( + const std::shared_ptr& arr, PyObject* py_ref, PyObject** out); PYARROW_EXPORT -arrow::Status ConvertColumnToPandas(const std::shared_ptr& col, - PyObject* py_ref, PyObject** out); +arrow::Status ConvertColumnToPandas( + const std::shared_ptr& col, PyObject* py_ref, PyObject** out); + +struct PandasOptions { + bool strings_to_categorical; +}; + +// Convert a whole table as efficiently as possible to a pandas.DataFrame. +// +// The returned Python object is a list of tuples consisting of the exact 2D +// BlockManager structure of the pandas.DataFrame used as of pandas 0.19.x. +// +// tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2]) +PYARROW_EXPORT +arrow::Status ConvertTableToPandas( + const std::shared_ptr& table, int nthreads, PyObject** out); PYARROW_EXPORT arrow::Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, std::shared_ptr* out); PYARROW_EXPORT -arrow::Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, - std::shared_ptr* out); +arrow::Status PandasToArrow( + arrow::MemoryPool* pool, PyObject* ao, std::shared_ptr* out); -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_ADAPTERS_PANDAS_H +#endif // PYARROW_ADAPTERS_PANDAS_H diff --git a/python/src/pyarrow/api.h b/python/src/pyarrow/api.h index 6dbbc45d40ccc..f65cc097f548f 100644 --- a/python/src/pyarrow/api.h +++ b/python/src/pyarrow/api.h @@ -23,4 +23,4 @@ #include "pyarrow/adapters/builtin.h" #include "pyarrow/adapters/pandas.h" -#endif // PYARROW_API_H +#endif // PYARROW_API_H diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index fb4d3496ac79f..8660ac8f0cedf 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -73,7 +73,7 @@ arrow::MemoryPool* get_memory_pool() { PyBytesBuffer::PyBytesBuffer(PyObject* obj) : Buffer(reinterpret_cast(PyBytes_AS_STRING(obj)), - PyBytes_GET_SIZE(obj)), + PyBytes_GET_SIZE(obj)), obj_(obj) { Py_INCREF(obj_); } @@ -83,4 +83,4 @@ PyBytesBuffer::~PyBytesBuffer() { Py_DECREF(obj_); } -} // namespace pyarrow +} // namespace pyarrow diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 7e3382634a781..639918d309fe7 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -24,7 +24,9 @@ #include "arrow/buffer.h" #include "arrow/util/macros.h" -namespace arrow { class MemoryPool; } +namespace arrow { +class MemoryPool; +} namespace pyarrow { @@ -34,27 +36,18 @@ class OwnedRef { public: OwnedRef() : obj_(nullptr) {} - OwnedRef(PyObject* obj) : - obj_(obj) {} + OwnedRef(PyObject* obj) : obj_(obj) {} - ~OwnedRef() { - Py_XDECREF(obj_); - } + ~OwnedRef() { Py_XDECREF(obj_); } void reset(PyObject* obj) { - if (obj_ != nullptr) { - Py_XDECREF(obj_); - } + if (obj_ != nullptr) { Py_XDECREF(obj_); } obj_ = obj; } - void release() { - obj_ = nullptr; - } + void release() { obj_ = nullptr; } - PyObject* obj() const{ - return obj_; - } + PyObject* obj() const { return obj_; } private: PyObject* obj_; @@ -78,13 +71,10 @@ struct PyObjectStringify { class PyGILGuard { public: - PyGILGuard() { - state_ = PyGILState_Ensure(); - } + PyGILGuard() { state_ = PyGILState_Ensure(); } + + ~PyGILGuard() { PyGILState_Release(state_); } - ~PyGILGuard() { - PyGILState_Release(state_); - } private: PyGILState_STATE state_; DISALLOW_COPY_AND_ASSIGN(PyGILGuard); @@ -108,8 +98,7 @@ PYARROW_EXPORT arrow::MemoryPool* get_memory_pool(); class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { public: - NumPyBuffer(PyArrayObject* arr) - : Buffer(nullptr, 0) { + NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { arr_ = arr; Py_INCREF(arr); @@ -118,9 +107,7 @@ class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { capacity_ = size_; } - virtual ~NumPyBuffer() { - Py_XDECREF(arr_); - } + virtual ~NumPyBuffer() { Py_XDECREF(arr_); } private: PyArrayObject* arr_; @@ -135,22 +122,17 @@ class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { PyObject* obj_; }; - class PyAcquireGIL { public: - PyAcquireGIL() { - state_ = PyGILState_Ensure(); - } + PyAcquireGIL() { state_ = PyGILState_Ensure(); } - ~PyAcquireGIL() { - PyGILState_Release(state_); - } + ~PyAcquireGIL() { PyGILState_Release(state_); } private: PyGILState_STATE state_; DISALLOW_COPY_AND_ASSIGN(PyAcquireGIL); }; -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_COMMON_H +#endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/config.cc b/python/src/pyarrow/config.cc index 730d2db99a530..e1002bf4fd146 100644 --- a/python/src/pyarrow/config.cc +++ b/python/src/pyarrow/config.cc @@ -21,8 +21,7 @@ namespace pyarrow { -void pyarrow_init() { -} +void pyarrow_init() {} PyObject* numpy_nan = nullptr; @@ -31,4 +30,4 @@ void pyarrow_set_numpy_nan(PyObject* obj) { numpy_nan = obj; } -} // namespace pyarrow +} // namespace pyarrow diff --git a/python/src/pyarrow/config.h b/python/src/pyarrow/config.h index 82936b1a5f317..386ee4b1e2590 100644 --- a/python/src/pyarrow/config.h +++ b/python/src/pyarrow/config.h @@ -24,7 +24,7 @@ #include "pyarrow/visibility.h" #if PY_MAJOR_VERSION >= 3 - #define PyString_Check PyUnicode_Check +#define PyString_Check PyUnicode_Check #endif namespace pyarrow { @@ -38,6 +38,6 @@ void pyarrow_init(); PYARROW_EXPORT void pyarrow_set_numpy_nan(PyObject* obj); -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_CONFIG_H +#endif // PYARROW_CONFIG_H diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index b42199c8e041c..3f650326e09aa 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -23,36 +23,35 @@ using namespace arrow; namespace pyarrow { - -#define GET_PRIMITIVE_TYPE(NAME, FACTORY) \ - case Type::NAME: \ - return FACTORY(); \ +#define GET_PRIMITIVE_TYPE(NAME, FACTORY) \ + case Type::NAME: \ + return FACTORY(); \ break; std::shared_ptr GetPrimitiveType(Type::type type) { switch (type) { case Type::NA: return null(); - GET_PRIMITIVE_TYPE(UINT8, uint8); - GET_PRIMITIVE_TYPE(INT8, int8); - GET_PRIMITIVE_TYPE(UINT16, uint16); - GET_PRIMITIVE_TYPE(INT16, int16); - GET_PRIMITIVE_TYPE(UINT32, uint32); - GET_PRIMITIVE_TYPE(INT32, int32); - GET_PRIMITIVE_TYPE(UINT64, uint64); - GET_PRIMITIVE_TYPE(INT64, int64); - GET_PRIMITIVE_TYPE(DATE, date); + GET_PRIMITIVE_TYPE(UINT8, uint8); + GET_PRIMITIVE_TYPE(INT8, int8); + GET_PRIMITIVE_TYPE(UINT16, uint16); + GET_PRIMITIVE_TYPE(INT16, int16); + GET_PRIMITIVE_TYPE(UINT32, uint32); + GET_PRIMITIVE_TYPE(INT32, int32); + GET_PRIMITIVE_TYPE(UINT64, uint64); + GET_PRIMITIVE_TYPE(INT64, int64); + GET_PRIMITIVE_TYPE(DATE, date); case Type::TIMESTAMP: return arrow::timestamp(arrow::TimeUnit::MICRO); break; - GET_PRIMITIVE_TYPE(BOOL, boolean); - GET_PRIMITIVE_TYPE(FLOAT, float32); - GET_PRIMITIVE_TYPE(DOUBLE, float64); - GET_PRIMITIVE_TYPE(BINARY, binary); - GET_PRIMITIVE_TYPE(STRING, utf8); + GET_PRIMITIVE_TYPE(BOOL, boolean); + GET_PRIMITIVE_TYPE(FLOAT, float32); + GET_PRIMITIVE_TYPE(DOUBLE, float64); + GET_PRIMITIVE_TYPE(BINARY, binary); + GET_PRIMITIVE_TYPE(STRING, utf8); default: return nullptr; } } -} // namespace pyarrow +} // namespace pyarrow diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index 8334d974c0237..788c3eedddfd6 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -31,6 +31,6 @@ using arrow::Type; PYARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_HELPERS_H +#endif // PYARROW_HELPERS_H diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 12f5ba0bf2b49..ac1aa635b40ea 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -33,8 +33,7 @@ namespace pyarrow { // ---------------------------------------------------------------------- // Python file -PythonFile::PythonFile(PyObject* file) - : file_(file) { +PythonFile::PythonFile(PyObject* file) : file_(file) { Py_INCREF(file_); } @@ -81,8 +80,8 @@ Status PythonFile::Read(int64_t nbytes, PyObject** out) { } Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { - PyObject* py_data = PyBytes_FromStringAndSize( - reinterpret_cast(data), nbytes); + PyObject* py_data = + PyBytes_FromStringAndSize(reinterpret_cast(data), nbytes); ARROW_RETURN_NOT_OK(CheckPyError()); PyObject* result = PyObject_CallMethod(file_, "write", "(O)", py_data); @@ -102,7 +101,7 @@ Status PythonFile::Tell(int64_t* position) { // PyLong_AsLongLong can raise OverflowError ARROW_RETURN_NOT_OK(CheckPyError()); - return Status::OK(); + return Status::OK(); } // ---------------------------------------------------------------------- @@ -156,7 +155,8 @@ Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) Status PyReadableFile::GetSize(int64_t* size) { PyGILGuard lock; - int64_t current_position;; + int64_t current_position; + ; ARROW_RETURN_NOT_OK(file_->Tell(¤t_position)); ARROW_RETURN_NOT_OK(file_->Seek(0, 2)); @@ -204,7 +204,7 @@ Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { PyBytesReader::PyBytesReader(PyObject* obj) : arrow::io::BufferReader(reinterpret_cast(PyBytes_AS_STRING(obj)), - PyBytes_GET_SIZE(obj)), + PyBytes_GET_SIZE(obj)), obj_(obj) { Py_INCREF(obj_); } diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h index e14aa8cfb27e3..fd3e7c0887207 100644 --- a/python/src/pyarrow/io.h +++ b/python/src/pyarrow/io.h @@ -24,7 +24,9 @@ #include "pyarrow/config.h" #include "pyarrow/visibility.h" -namespace arrow { class MemoryPool; } +namespace arrow { +class MemoryPool; +} namespace pyarrow { @@ -92,6 +94,6 @@ class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { // TODO(wesm): seekable output files -} // namespace pyarrow +} // namespace pyarrow #endif // PYARROW_IO_H diff --git a/python/src/pyarrow/numpy_interop.h b/python/src/pyarrow/numpy_interop.h index 882d287c7c559..6326527a67420 100644 --- a/python/src/pyarrow/numpy_interop.h +++ b/python/src/pyarrow/numpy_interop.h @@ -53,6 +53,6 @@ inline int import_numpy() { return 0; } -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_NUMPY_INTEROP_H +#endif // PYARROW_NUMPY_INTEROP_H diff --git a/python/src/pyarrow/util/datetime.h b/python/src/pyarrow/util/datetime.h index b67accc388f59..9ffa691052460 100644 --- a/python/src/pyarrow/util/datetime.h +++ b/python/src/pyarrow/util/datetime.h @@ -22,8 +22,8 @@ #include namespace pyarrow { - -inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { + +inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { struct tm date = {0}; date.tm_year = PyDateTime_GET_YEAR(pydate) - 1900; date.tm_mon = PyDateTime_GET_MONTH(pydate) - 1; @@ -35,6 +35,6 @@ inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); } -} // namespace pyarrow +} // namespace pyarrow -#endif // PYARROW_UTIL_DATETIME_H +#endif // PYARROW_UTIL_DATETIME_H diff --git a/python/src/pyarrow/util/test_main.cc b/python/src/pyarrow/util/test_main.cc index 00139f36742ed..6fb7c0536eed3 100644 --- a/python/src/pyarrow/util/test_main.cc +++ b/python/src/pyarrow/util/test_main.cc @@ -17,7 +17,7 @@ #include -int main(int argc, char **argv) { +int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); int ret = RUN_ALL_TESTS(); From 1079a3206c58dc053ad6d9ca4ead6446a1c9be80 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 27 Dec 2016 18:04:28 -0500 Subject: [PATCH 0245/1644] ARROW-437: [C++} Fix clang compiler warning Author: Wes McKinney Closes #254 from wesm/ARROW-437 and squashes the following commits: a18a651 [Wes McKinney] Fix compiler warning in clang --- cpp/src/arrow/array.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 26d53f7d75896..5cd56d6df5427 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -147,7 +147,9 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { return PrimitiveArray::EqualsExact(static_cast(other)); } - bool ApproxEquals(const std::shared_ptr& arr) const { return Equals(arr); } + bool ApproxEquals(const std::shared_ptr& arr) const override { + return Equals(arr); + } bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const ArrayPtr& arr) const override { From ab5f66a2e9a2b6af312ffdfa2f95c65b1d6f5739 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 28 Dec 2016 07:49:06 -0500 Subject: [PATCH 0246/1644] ARROW-428: [Python] Multithreaded conversion from Arrow table to pandas.DataFrame This yields a substantial speedup on my laptop. On a 1GB numeric dataset, with 1 thread (the default prior to this patch): ``` >>> %timeit df2 = table.to_pandas(nthreads=1) 1 loop, best of 3: 498 ms per loop ``` With 4 threads (this is a true quad-core machine) ``` >>> %timeit df2 = table.to_pandas(nthreads=4) 1 loop, best of 3: 151 ms per loop ``` The default number of cores used is the `os.cpu_count` divided by 2 (since hyperthreading doesn't help with this largely memory-bound operation). Author: Wes McKinney Closes #252 from wesm/ARROW-428 and squashes the following commits: da929bf [Wes McKinney] Factor out common compiler flag code between Arrow C++ and Python CMake files. Add pyarrow.cpu_count/set_cpu_count functions per feedback cad89e9 [Wes McKinney] Tweak pyarrow cmake flags e70f16d [Wes McKinney] Add missing GIL acquisition. Do not spawn too many threads if few columns bc4dff7 [Wes McKinney] Return errors from threaded conversion. Add doc about number of cpus used 79f5fd9 [Wes McKinney] Implement multithreaded conversion from Arrow table to pandas.DataFrame. Default to multiprocessing.cpu_count for now --- cpp/CMakeLists.txt | 71 +----------- cpp/cmake_modules/SetupCxxFlags.cmake | 86 ++++++++++++++ python/CMakeLists.txt | 36 +----- python/pyarrow/__init__.py | 1 + python/pyarrow/config.pyx | 23 ++++ python/pyarrow/table.pyx | 38 +++--- python/pyarrow/tests/test_convert_pandas.py | 42 +++++-- python/src/pyarrow/adapters/pandas.cc | 121 ++++++++++++++------ 8 files changed, 250 insertions(+), 168 deletions(-) create mode 100644 cpp/cmake_modules/SetupCxxFlags.cmake diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 93e9853df8972..4507e6783e4de 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -105,76 +105,7 @@ endif() # Compiler flags ############################################################ -# Check if the target architecture and compiler supports some special -# instruction sets that would boost performance. -include(CheckCXXCompilerFlag) -# x86/amd64 compiler flags -CHECK_CXX_COMPILER_FLAG("-msse3" CXX_SUPPORTS_SSE3) -# power compiler flags -CHECK_CXX_COMPILER_FLAG("-maltivec" CXX_SUPPORTS_ALTIVEC) - -# compiler flags that are common across debug/release builds -# - Wall: Enable all warnings. -set(CXX_COMMON_FLAGS "-std=c++11 -Wall") - -# Only enable additional instruction sets if they are supported -if (CXX_SUPPORTS_SSE3 AND ARROW_SSE3) - set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -msse3") -endif() -if (CXX_SUPPORTS_ALTIVEC AND ARROW_ALTIVEC) - set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -maltivec") -endif() - -if (APPLE) - # Depending on the default OSX_DEPLOYMENT_TARGET (< 10.9), libstdc++ may be - # the default standard library which does not support C++11. libc++ is the - # default from 10.9 onward. - set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -stdlib=libc++") -endif() - -# compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') -# For all builds: -# For CMAKE_BUILD_TYPE=Debug -# -ggdb: Enable gdb debugging -# For CMAKE_BUILD_TYPE=FastDebug -# Same as DEBUG, except with some optimizations on. -# For CMAKE_BUILD_TYPE=Release -# -O3: Enable all compiler optimizations -# -g: Enable symbols for profiler tools (TODO: remove for shipping) -if (NOT MSVC) - set(CXX_FLAGS_DEBUG "-ggdb -O0") - set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") - set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") -endif() - -set(CXX_FLAGS_PROFILE_GEN "${CXX_FLAGS_RELEASE} -fprofile-generate") -set(CXX_FLAGS_PROFILE_BUILD "${CXX_FLAGS_RELEASE} -fprofile-use") - -# if no build build type is specified, default to debug builds -if (NOT CMAKE_BUILD_TYPE) - set(CMAKE_BUILD_TYPE Debug) -endif(NOT CMAKE_BUILD_TYPE) - -string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) - - -# Set compile flags based on the build type. -message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") -if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_DEBUG}") -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_FASTDEBUG}") -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_RELEASE}") -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_GEN") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_GEN}") -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_BUILD") - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_BUILD}") -else() - message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") -endif () - -message(STATUS "Build Type: ${CMAKE_BUILD_TYPE}") +include(SetupCxxFlags) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake new file mode 100644 index 0000000000000..ee672bd5f6a96 --- /dev/null +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -0,0 +1,86 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Check if the target architecture and compiler supports some special +# instruction sets that would boost performance. +include(CheckCXXCompilerFlag) +# x86/amd64 compiler flags +CHECK_CXX_COMPILER_FLAG("-msse3" CXX_SUPPORTS_SSE3) +# power compiler flags +CHECK_CXX_COMPILER_FLAG("-maltivec" CXX_SUPPORTS_ALTIVEC) + +# compiler flags that are common across debug/release builds +# - Wall: Enable all warnings. +set(CXX_COMMON_FLAGS "-std=c++11 -Wall") + +# Only enable additional instruction sets if they are supported +if (CXX_SUPPORTS_SSE3 AND ARROW_SSE3) + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -msse3") +endif() +if (CXX_SUPPORTS_ALTIVEC AND ARROW_ALTIVEC) + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -maltivec") +endif() + +if (APPLE) + # Depending on the default OSX_DEPLOYMENT_TARGET (< 10.9), libstdc++ may be + # the default standard library which does not support C++11. libc++ is the + # default from 10.9 onward. + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -stdlib=libc++") +endif() + +# compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') +# For all builds: +# For CMAKE_BUILD_TYPE=Debug +# -ggdb: Enable gdb debugging +# For CMAKE_BUILD_TYPE=FastDebug +# Same as DEBUG, except with some optimizations on. +# For CMAKE_BUILD_TYPE=Release +# -O3: Enable all compiler optimizations +# -g: Enable symbols for profiler tools (TODO: remove for shipping) +if (NOT MSVC) + set(CXX_FLAGS_DEBUG "-ggdb -O0") + set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") + set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") +endif() + +set(CXX_FLAGS_PROFILE_GEN "${CXX_FLAGS_RELEASE} -fprofile-generate") +set(CXX_FLAGS_PROFILE_BUILD "${CXX_FLAGS_RELEASE} -fprofile-use") + +# if no build build type is specified, default to debug builds +if (NOT CMAKE_BUILD_TYPE) + set(CMAKE_BUILD_TYPE Debug) +endif(NOT CMAKE_BUILD_TYPE) + +string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) + +# Set compile flags based on the build type. +message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") +if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_DEBUG}") +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_FASTDEBUG}") +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_RELEASE}") +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_GEN") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_GEN}") +elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "PROFILE_BUILD") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_FLAGS_PROFILE_BUILD}") +else() + message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") +endif () + +message(STATUS "Build Type: ${CMAKE_BUILD_TYPE}") diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 6ad55f8c9a7b8..6c2477235faaa 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -65,41 +65,7 @@ endif(CCACHE_FOUND) # Compiler flags ############################################################ -# compiler flags that are common across debug/release builds -set(CXX_COMMON_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -Wall") - -# compiler flags for different build types (run 'cmake -DCMAKE_BUILD_TYPE= .') -# For all builds: -# For CMAKE_BUILD_TYPE=Debug -# -ggdb: Enable gdb debugging -# For CMAKE_BUILD_TYPE=FastDebug -# Same as DEBUG, except with some optimizations on. -# For CMAKE_BUILD_TYPE=Release -# -O3: Enable all compiler optimizations -# -g: Enable symbols for profiler tools (TODO: remove for shipping) -# -DNDEBUG: Turn off dchecks/asserts/debug only code. -set(CXX_FLAGS_DEBUG "-ggdb -O0") -set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") -set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") - -# if no build build type is specified, default to debug builds -if (NOT CMAKE_BUILD_TYPE) - set(CMAKE_BUILD_TYPE Debug) -endif(NOT CMAKE_BUILD_TYPE) - -string (TOUPPER ${CMAKE_BUILD_TYPE} CMAKE_BUILD_TYPE) - -# Set compile flags based on the build type. -message("Configured for ${CMAKE_BUILD_TYPE} build (set with cmake -DCMAKE_BUILD_TYPE={release,debug,...})") -if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_DEBUG}) -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "FASTDEBUG") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_FASTDEBUG}) -elseif ("${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") - set(CMAKE_CXX_FLAGS ${CXX_FLAGS_RELEASE}) -else() - message(FATAL_ERROR "Unknown build type: ${CMAKE_BUILD_TYPE}") -endif () +include(SetupCxxFlags) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 9ede9348c93de..6f81ef470a86c 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -26,6 +26,7 @@ import pyarrow.config +from pyarrow.config import cpu_count, set_cpu_count from pyarrow.array import (Array, from_pandas_series, from_pylist, diff --git a/python/pyarrow/config.pyx b/python/pyarrow/config.pyx index 778c15a5e655b..aa30f097248cd 100644 --- a/python/pyarrow/config.pyx +++ b/python/pyarrow/config.pyx @@ -29,3 +29,26 @@ pyarrow_init() import numpy as np pyarrow_set_numpy_nan(np.nan) + +import multiprocessing +import os +cdef int CPU_COUNT = int( + os.environ.get('OMP_NUM_THREADS', + max(multiprocessing.cpu_count() // 2, 1))) + +def cpu_count(): + """ + Returns + ------- + count : Number of CPUs to use by default in parallel operations. Default is + max(1, multiprocessing.cpu_count() / 2), but can be overridden by the + OMP_NUM_THREADS environment variable. For the default, we divide the CPU + count by 2 because most modern computers have hyperthreading turned on, + so doubling the CPU count beyond the number of physical cores does not + help. + """ + return CPU_COUNT + +def set_cpu_count(count): + global CPU_COUNT + CPU_COUNT = max(int(count), 1) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 9375557888490..20137e3d4f8d9 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -439,7 +439,9 @@ cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): from pandas.core.internals import BlockManager, make_block from pandas import RangeIndex - check_status(pyarrow.ConvertTableToPandas(table, nthreads, &result_obj)) + with nogil: + check_status(pyarrow.ConvertTableToPandas(table, nthreads, + &result_obj)) result = PyObject_to_object(result_obj) @@ -610,36 +612,28 @@ cdef class Table: table.init(c_table) return table - def to_pandas(self, nthreads=1, block_based=True): + def to_pandas(self, nthreads=None): """ Convert the arrow::Table to a pandas DataFrame + Parameters + ---------- + nthreads : int, default max(1, multiprocessing.cpu_count() / 2) + For the default, we divide the CPU count by 2 because most modern + computers have hyperthreading turned on, so doubling the CPU count + beyond the number of physical cores does not help + Returns ------- pandas.DataFrame """ - cdef: - PyObject* arr - shared_ptr[CColumn] col - Column column - import pandas as pd - if block_based: - mgr = table_to_blockmanager(self.sp_table, nthreads) - return pd.DataFrame(mgr) - else: - names = [] - data = [] - for i in range(self.table.num_columns()): - col = self.table.column(i) - column = self.column(i) - check_status(pyarrow.ConvertColumnToPandas( - col, column, &arr)) - names.append(frombytes(col.get().name())) - data.append(PyObject_to_object(arr)) - - return pd.DataFrame(dict(zip(names, data)), columns=names) + if nthreads is None: + nthreads = pyarrow.config.cpu_count() + + mgr = table_to_blockmanager(self.sp_table, nthreads) + return pd.DataFrame(mgr) @property def name(self): diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index da34f85588130..863aa3073fe12 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -27,6 +27,29 @@ import pyarrow as A +def _alltypes_example(size=100): + return pd.DataFrame({ + 'uint8': np.arange(size, dtype=np.uint8), + 'uint16': np.arange(size, dtype=np.uint16), + 'uint32': np.arange(size, dtype=np.uint32), + 'uint64': np.arange(size, dtype=np.uint64), + 'int8': np.arange(size, dtype=np.int16), + 'int16': np.arange(size, dtype=np.int16), + 'int32': np.arange(size, dtype=np.int32), + 'int64': np.arange(size, dtype=np.int64), + 'float32': np.arange(size, dtype=np.float32), + 'float64': np.arange(size, dtype=np.float64), + 'bool': np.random.randn(size) > 0, + # TODO(wesm): Pandas only support ns resolution, Arrow supports s, ms, + # us, ns + 'datetime': np.arange("2016-01-01T00:00:00.001", size, + dtype='datetime64[ms]'), + 'str': [str(x) for x in range(size)], + 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None], + 'empty_str': [''] * size + }) + + class TestPandasConversion(unittest.TestCase): def setUp(self): @@ -35,10 +58,10 @@ def setUp(self): def tearDown(self): pass - def _check_pandas_roundtrip(self, df, expected=None, + def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, timestamps_to_ms=False): table = A.from_pandas_dataframe(df, timestamps_to_ms=timestamps_to_ms) - result = table.to_pandas() + result = table.to_pandas(nthreads=nthreads) if expected is None: expected = df tm.assert_frame_equal(result, expected) @@ -217,18 +240,21 @@ def test_timestamps_notimezone_nulls(self): def test_date(self): df = pd.DataFrame({ - 'date': [ - datetime.date(2000, 1, 1), - None, - datetime.date(1970, 1, 1), - datetime.date(2040, 2, 26) - ]}) + 'date': [datetime.date(2000, 1, 1), + None, + datetime.date(1970, 1, 1), + datetime.date(2040, 2, 26)]}) table = A.from_pandas_dataframe(df) result = table.to_pandas() expected = df.copy() expected['date'] = pd.to_datetime(df['date']) tm.assert_frame_equal(result, expected) + def test_threaded_conversion(self): + df = _alltypes_example() + self._check_pandas_roundtrip(df, nthreads=2, + timestamps_to_ms=False) + # def test_category(self): # repeats = 1000 # values = [b'foo', None, u'bar', 'qux', np.nan] diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 899eb5519d562..5e5826b8236a6 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -19,15 +19,18 @@ #include -#include "pyarrow/numpy_interop.h" - #include "pyarrow/adapters/pandas.h" +#include "pyarrow/numpy_interop.h" +#include +#include #include #include #include +#include #include #include +#include #include #include "arrow/api.h" @@ -1031,7 +1034,8 @@ class PandasBlock { : num_rows_(num_rows), num_columns_(num_columns) {} virtual Status Allocate() = 0; - virtual Status WriteNext(const std::shared_ptr& col, int64_t placement) = 0; + virtual Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) = 0; PyObject* block_arr() { return block_arr_.obj(); } @@ -1057,7 +1061,6 @@ class PandasBlock { block_arr_.reset(block_arr); placement_arr_.reset(placement_arr); - current_placement_index_ = 0; block_data_ = reinterpret_cast( PyArray_DATA(reinterpret_cast(block_arr))); @@ -1070,7 +1073,6 @@ class PandasBlock { int64_t num_rows_; int num_columns_; - int current_placement_index_; OwnedRef block_arr_; uint8_t* block_data_; @@ -1088,11 +1090,12 @@ class ObjectBlock : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_OBJECT); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; PyObject** out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + reinterpret_cast(block_data_) + rel_placement * num_rows_; const ChunkedArray& data = *col->data().get(); @@ -1108,7 +1111,7 @@ class ObjectBlock : public PandasBlock { return Status::NotImplemented(ss.str()); } - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1122,18 +1125,19 @@ class IntBlock : public PandasBlock { return AllocateNDArray(arrow_traits::npy_type); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; C_TYPE* out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + reinterpret_cast(block_data_) + rel_placement * num_rows_; const ChunkedArray& data = *col->data().get(); if (type != ARROW_TYPE) { return Status::NotImplemented(col->type()->ToString()); } ConvertIntegerNoNullsSameType(data, out_buffer); - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1153,16 +1157,16 @@ class Float32Block : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_FLOAT32); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; if (type != Type::FLOAT) { return Status::NotImplemented(col->type()->ToString()); } - float* out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + float* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; ConvertNumericNullable(*col->data().get(), NAN, out_buffer); - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1173,11 +1177,12 @@ class Float64Block : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_FLOAT64); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; double* out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + reinterpret_cast(block_data_) + rel_placement * num_rows_; const ChunkedArray& data = *col->data().get(); @@ -1214,7 +1219,7 @@ class Float64Block : public PandasBlock { #undef INTEGER_CASE - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1225,16 +1230,17 @@ class BoolBlock : public PandasBlock { Status Allocate() override { return AllocateNDArray(NPY_BOOL); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; if (type != Type::BOOL) { return Status::NotImplemented(col->type()->ToString()); } uint8_t* out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + reinterpret_cast(block_data_) + rel_placement * num_rows_; ConvertBooleanNoNulls(*col->data().get(), out_buffer); - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1253,11 +1259,12 @@ class DatetimeBlock : public PandasBlock { return Status::OK(); } - Status WriteNext(const std::shared_ptr& col, int64_t placement) override { + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { Type::type type = col->type()->type; int64_t* out_buffer = - reinterpret_cast(block_data_) + current_placement_index_ * num_rows_; + reinterpret_cast(block_data_) + rel_placement * num_rows_; const ChunkedArray& data = *col.get()->data(); @@ -1283,7 +1290,7 @@ class DatetimeBlock : public PandasBlock { return Status::NotImplemented(col->type()->ToString()); } - placement_data_[current_placement_index_++] = placement; + placement_data_[rel_placement] = abs_placement; return Status::OK(); } }; @@ -1333,6 +1340,7 @@ class DataFrameBlockCreator { Status Convert(int nthreads, PyObject** output) { column_types_.resize(table_->num_columns()); + column_block_placement_.resize(table_->num_columns()); type_counts_.clear(); blocks_.clear(); @@ -1397,7 +1405,9 @@ class DataFrameBlockCreator { } auto it = type_counts_.find(output_type); + int block_placement = 0; if (it != type_counts_.end()) { + block_placement = it->second; // Increment count it->second += 1; } else { @@ -1406,6 +1416,7 @@ class DataFrameBlockCreator { } column_types_[i] = output_type; + column_block_placement_[i] = block_placement; } return Status::OK(); } @@ -1421,22 +1432,61 @@ class DataFrameBlockCreator { } Status WriteTableToBlocks(int nthreads) { - if (nthreads > 1) { - return Status::NotImplemented("multithreading not yet implemented"); - } + auto WriteColumn = [this](int i) { + std::shared_ptr col = this->table_->column(i); + PandasBlock::type output_type = this->column_types_[i]; - for (int i = 0; i < table_->num_columns(); ++i) { - std::shared_ptr col = table_->column(i); - PandasBlock::type output_type = column_types_[i]; + int rel_placement = this->column_block_placement_[i]; + + auto it = this->blocks_.find(output_type); + if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } + return it->second->Write(col, i, rel_placement); + }; - auto it = blocks_.find(output_type); - if (it == blocks_.end()) { return Status::KeyError("No block allocated"); } - RETURN_NOT_OK(it->second->WriteNext(col, i)); + nthreads = std::min(nthreads, table_->num_columns()); + + if (nthreads == 1) { + for (int i = 0; i < table_->num_columns(); ++i) { + RETURN_NOT_OK(WriteColumn(i)); + } + } else { + std::vector thread_pool; + thread_pool.reserve(nthreads); + std::atomic task_counter(0); + + std::mutex error_mtx; + bool error_occurred = false; + Status error; + + for (int thread_id = 0; thread_id < nthreads; ++thread_id) { + thread_pool.emplace_back( + [this, &error, &error_occurred, &error_mtx, &task_counter, &WriteColumn]() { + int column_num; + while (!error_occurred) { + column_num = task_counter.fetch_add(1); + if (column_num >= this->table_->num_columns()) { break; } + Status s = WriteColumn(column_num); + if (!s.ok()) { + std::lock_guard lock(error_mtx); + error_occurred = true; + error = s; + break; + } + } + }); + } + for (auto&& thread : thread_pool) { + thread.join(); + } + + if (error_occurred) { return error; } } return Status::OK(); } Status GetResultList(PyObject** out) { + PyAcquireGIL lock; + auto num_blocks = static_cast(blocks_.size()); PyObject* result = PyList_New(num_blocks); RETURN_IF_PYERROR(); @@ -1463,8 +1513,13 @@ class DataFrameBlockCreator { private: std::shared_ptr
table_; + + // column num -> block type id std::vector column_types_; + // column num -> relative placement within internal block + std::vector column_block_placement_; + // block type -> type count std::unordered_map type_counts_; From cfbdb680063b15b5068d99175fe2f042d16abf52 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 28 Dec 2016 14:52:43 +0100 Subject: [PATCH 0247/1644] ARROW-322: [C++] Remove ARROW_HDFS option, always build the module Author: Wes McKinney Closes #253 from wesm/ARROW-322 and squashes the following commits: e793fd1 [Wes McKinney] Use string() instead of native() for file paths because windows uses utf16 native encoding d0cc376 [Wes McKinney] Add NOMINMAX windows workaround 5e53ddb [Wes McKinney] Visibility fix ea8fb9d [Wes McKinney] Various Win32 compilation fixes 82c4d2d [Wes McKinney] Remove ARROW_HDFS option, always build the module --- ci/travis_before_script_cpp.sh | 2 - cpp/CMakeLists.txt | 4 -- cpp/src/arrow/io/CMakeLists.txt | 56 ++++++++-------------- cpp/src/arrow/io/hdfs-internal.cc | 26 ++-------- cpp/src/arrow/io/hdfs-internal.h | 22 ++++++++- cpp/src/arrow/io/io-hdfs-test.cc | 2 +- cpp/src/arrow/ipc/json-integration-test.cc | 2 +- 7 files changed, 47 insertions(+), 67 deletions(-) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 20307736e672a..73bdaeb81fe78 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -26,8 +26,6 @@ CPP_DIR=$TRAVIS_BUILD_DIR/cpp CMAKE_COMMON_FLAGS="\ -DARROW_BUILD_BENCHMARKS=ON \ --DARROW_PARQUET=OFF \ --DARROW_HDFS=ON \ -DCMAKE_INSTALL_PREFIX=$ARROW_CPP_INSTALL" if [ $TRAVIS_OS_NAME == "linux" ]; then diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 4507e6783e4de..47b767119c95b 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -74,10 +74,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow IPC extensions" ON) - option(ARROW_HDFS - "Build the Arrow IO extensions for the Hadoop file system" - OFF) - option(ARROW_BOOST_USE_SHARED "Rely on boost shared libraries where relevant" ON) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 2062cd43b7b48..1e65a1a46abb4 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -45,50 +45,30 @@ set(ARROW_IO_TEST_LINK_LIBS set(ARROW_IO_SRCS file.cc + hdfs.cc + hdfs-internal.cc interfaces.cc memory.cc ) -if(ARROW_HDFS) - if(NOT THIRDPARTY_DIR) - message(FATAL_ERROR "THIRDPARTY_DIR not set") - endif() - - if (DEFINED ENV{HADOOP_HOME}) - set(HADOOP_HOME $ENV{HADOOP_HOME}) - if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") - message(STATUS "Did not find hdfs.h in expected location, using vendored one") - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") - endif() - else() +# HDFS thirdparty setup +if (DEFINED ENV{HADOOP_HOME}) + set(HADOOP_HOME $ENV{HADOOP_HOME}) + if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") + message(STATUS "Did not find hdfs.h in expected location, using vendored one") set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") endif() +else() + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") +endif() - set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") - if (NOT EXISTS ${HDFS_H_PATH}) - message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") - endif() - message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) - message(STATUS "Building libhdfs shim component") - - include_directories(SYSTEM "${HADOOP_HOME}/include") - - set(ARROW_HDFS_SRCS - hdfs.cc - hdfs-internal.cc) - - set_property(SOURCE ${ARROW_HDFS_SRCS} - APPEND_STRING PROPERTY - COMPILE_FLAGS "-DHAS_HADOOP") - - set(ARROW_IO_SRCS - ${ARROW_HDFS_SRCS} - ${ARROW_IO_SRCS}) - - ADD_ARROW_TEST(io-hdfs-test) - ARROW_TEST_LINK_LIBRARIES(io-hdfs-test - ${ARROW_IO_TEST_LINK_LIBS}) +set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") +if (NOT EXISTS ${HDFS_H_PATH}) + message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") endif() +message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) + +include_directories(SYSTEM "${HADOOP_HOME}/include") add_library(arrow_io SHARED ${ARROW_IO_SRCS} @@ -119,6 +99,10 @@ ADD_ARROW_TEST(io-file-test) ARROW_TEST_LINK_LIBRARIES(io-file-test ${ARROW_IO_TEST_LINK_LIBS}) +ADD_ARROW_TEST(io-hdfs-test) +ARROW_TEST_LINK_LIBRARIES(io-hdfs-test + ${ARROW_IO_TEST_LINK_LIBS}) + ADD_ARROW_TEST(io-memory-test) ARROW_TEST_LINK_LIBRARIES(io-memory-test ${ARROW_IO_TEST_LINK_LIBS}) diff --git a/cpp/src/arrow/io/hdfs-internal.cc b/cpp/src/arrow/io/hdfs-internal.cc index 7094785de02a0..e4b2cd55978cb 100644 --- a/cpp/src/arrow/io/hdfs-internal.cc +++ b/cpp/src/arrow/io/hdfs-internal.cc @@ -28,21 +28,7 @@ // This software may be modified and distributed under the terms // of the BSD license. See the LICENSE file for details. -#ifdef HAS_HADOOP - -#ifndef _WIN32 -#include -#else -#include -#include - -// TODO(wesm): address when/if we add windows support -// #include -#endif - -extern "C" { -#include -} +#include "arrow/io/hdfs-internal.h" #include #include @@ -53,7 +39,6 @@ extern "C" { #include // NOLINT -#include "arrow/io/hdfs-internal.h" #include "arrow/status.h" #include "arrow/util/visibility.h" @@ -265,7 +250,8 @@ static inline void* GetLibrarySymbol(void* handle, const char* symbol) { return dlsym(handle, symbol); #else - void* ret = reinterpret_cast(GetProcAddress(handle, symbol)); + void* ret = reinterpret_cast( + GetProcAddress(reinterpret_cast(handle), symbol)); if (ret == NULL) { // logstream(LOG_INFO) << "GetProcAddress error: " // << get_last_err_str(GetLastError()) << std::endl; @@ -537,7 +523,7 @@ Status LibHdfsShim::GetRequiredSymbols() { return Status::OK(); } -Status ARROW_EXPORT ConnectLibHdfs(LibHdfsShim** driver) { +Status ConnectLibHdfs(LibHdfsShim** driver) { static std::mutex lock; std::lock_guard guard(lock); @@ -562,7 +548,7 @@ Status ARROW_EXPORT ConnectLibHdfs(LibHdfsShim** driver) { return shim->GetRequiredSymbols(); } -Status ARROW_EXPORT ConnectLibHdfs3(LibHdfsShim** driver) { +Status ConnectLibHdfs3(LibHdfsShim** driver) { static std::mutex lock; std::lock_guard guard(lock); @@ -586,5 +572,3 @@ Status ARROW_EXPORT ConnectLibHdfs3(LibHdfsShim** driver) { } // namespace io } // namespace arrow - -#endif // HAS_HADOOP diff --git a/cpp/src/arrow/io/hdfs-internal.h b/cpp/src/arrow/io/hdfs-internal.h index 0ff118a8f57e7..8f9a06758cbaa 100644 --- a/cpp/src/arrow/io/hdfs-internal.h +++ b/cpp/src/arrow/io/hdfs-internal.h @@ -18,8 +18,25 @@ #ifndef ARROW_IO_HDFS_INTERNAL #define ARROW_IO_HDFS_INTERNAL +#ifndef _WIN32 +#include +#else + +// Windows defines min and max macros that mess up std::min/maxa +#ifndef NOMINMAX +#define NOMINMAX +#endif +#include +#include + +// TODO(wesm): address when/if we add windows support +// #include +#endif + #include +#include "arrow/util/visibility.h" + namespace arrow { class Status; @@ -194,8 +211,9 @@ struct LibHdfsShim { Status GetRequiredSymbols(); }; -Status ConnectLibHdfs(LibHdfsShim** driver); -Status ConnectLibHdfs3(LibHdfsShim** driver); +// TODO(wesm): Remove these exports when we are linking statically +Status ARROW_EXPORT ConnectLibHdfs(LibHdfsShim** driver); +Status ARROW_EXPORT ConnectLibHdfs3(LibHdfsShim** driver); } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 4ef47b8babe6e..72e0ba8f2987b 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -79,7 +79,7 @@ class TestHdfsClient : public ::testing::Test { client_ = nullptr; scratch_dir_ = - boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").native(); + boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").string(); loaded_driver_ = false; diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 5e593560f8cfa..757e6c00ab243 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -221,7 +221,7 @@ Status RunCommand(const std::string& json_path, const std::string& arrow_path, } static std::string temp_path() { - return (fs::temp_directory_path() / fs::unique_path()).native(); + return (fs::temp_directory_path() / fs::unique_path()).string(); } class TestJSONIntegration : public ::testing::Test { From 8aab00ee16d9dfe7ed578c8dbe59761eaa68670f Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 28 Dec 2016 10:54:27 -0500 Subject: [PATCH 0248/1644] ARROW-330: CMake functions to simplify shared / static library configuration This also fixes ARROW-303 Author: Uwe L. Korn Closes #255 from xhochy/ARROW-330 and squashes the following commits: a495d16 [Uwe L. Korn] Fix linking order 17131c9 [Uwe L. Korn] ARROW-330: CMake functions to simplify shared / static library configuration --- cpp/CMakeLists.txt | 54 ++--------------- cpp/cmake_modules/BuildUtils.cmake | 77 ++++++++++++++++++++++++ cpp/src/arrow/io/CMakeLists.txt | 97 +++++++++++++++--------------- cpp/src/arrow/io/memory.h | 2 +- cpp/src/arrow/ipc/CMakeLists.txt | 67 +++++++++------------ 5 files changed, 164 insertions(+), 133 deletions(-) create mode 100644 cpp/cmake_modules/BuildUtils.cmake diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 47b767119c95b..bf30543dc4d65 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -717,61 +717,19 @@ set(ARROW_SRCS src/arrow/util/bit-util.cc ) -add_library(arrow_objlib OBJECT - ${ARROW_SRCS} -) - -# Necessary to make static linking into other shared libraries work properly -set_property(TARGET arrow_objlib PROPERTY POSITION_INDEPENDENT_CODE 1) - if(NOT APPLE) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the # version-script option. - set(SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") -endif() - -if (ARROW_BUILD_SHARED) - add_library(arrow_shared SHARED $) - if(APPLE) - set_target_properties(arrow_shared PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") - endif() - set_target_properties(arrow_shared - PROPERTIES - LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" - LINK_FLAGS "${SHARED_LINK_FLAGS}" - OUTPUT_NAME "arrow") - target_link_libraries(arrow_shared - LINK_PUBLIC ${ARROW_LINK_LIBS} - LINK_PRIVATE ${ARROW_PRIVATE_LINK_LIBS}) - - install(TARGETS arrow_shared - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) + set(ARROW_SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") endif() -if (ARROW_BUILD_STATIC) - add_library(arrow_static STATIC $) - set_target_properties(arrow_static - PROPERTIES - LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" - OUTPUT_NAME "arrow") +include(BuildUtils) - target_link_libraries(arrow_static - LINK_PUBLIC ${ARROW_LINK_LIBS} - LINK_PRIVATE ${ARROW_PRIVATE_LINK_LIBS}) - - install(TARGETS arrow_static - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) -endif() - -if (APPLE) - set_target_properties(arrow_shared - PROPERTIES - BUILD_WITH_INSTALL_RPATH ON - INSTALL_NAME_DIR "@rpath") -endif() +ADD_ARROW_LIB(arrow + SOURCES ${ARROW_SRCS} + SHARED_LINK_FLAGS ${ARROW_SHARED_LINK_FLAGS} +) add_subdirectory(src/arrow) add_subdirectory(src/arrow/io) diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake new file mode 100644 index 0000000000000..b620de515c126 --- /dev/null +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -0,0 +1,77 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +function(ADD_ARROW_LIB LIB_NAME) + set(options) + set(one_value_args SHARED_LINK_FLAGS) + set(multi_value_args SOURCES STATIC_LINK_LIBS STATIC_PRIVATE_LINK_LIBS SHARED_LINK_LIBS SHARED_PRIVATE_LINK_LIBS) + cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) + if(ARG_UNPARSED_ARGUMENTS) + message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") + endif() + + add_library(${LIB_NAME}_objlib OBJECT + ${ARG_SOURCES} + ) + + # Necessary to make static linking into other shared libraries work properly + set_property(TARGET ${LIB_NAME}_objlib PROPERTY POSITION_INDEPENDENT_CODE 1) + + if (ARROW_BUILD_SHARED) + add_library(${LIB_NAME}_shared SHARED $) + if(APPLE) + set_target_properties(${LIB_NAME}_shared PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + endif() + set_target_properties(${LIB_NAME}_shared + PROPERTIES + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + LINK_FLAGS "${ARG_SHARED_LINK_FLAGS}" + OUTPUT_NAME ${LIB_NAME}) + target_link_libraries(${LIB_NAME}_shared + LINK_PUBLIC ${ARG_SHARED_LINK_LIBS} + LINK_PRIVATE ${ARG_SHARED_PRIVATE_LINK_LIBS}) + + install(TARGETS ${LIB_NAME}_shared + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) + endif() + + if (ARROW_BUILD_STATIC) + add_library(${LIB_NAME}_static STATIC $) + set_target_properties(${LIB_NAME}_static + PROPERTIES + LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + OUTPUT_NAME ${LIB_NAME}) + + target_link_libraries(${LIB_NAME}_static + LINK_PUBLIC ${ARG_STATIC_LINK_LIBS} + LINK_PRIVATE ${ARG_STATIC_PRIVATE_LINK_LIBS}) + + install(TARGETS ${LIB_NAME}_static + LIBRARY DESTINATION lib + ARCHIVE DESTINATION lib) + endif() + + if (APPLE) + set_target_properties(${LIB_NAME}_shared + PROPERTIES + BUILD_WITH_INSTALL_RPATH ON + INSTALL_NAME_DIR "@rpath") + endif() + +endfunction() + diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 1e65a1a46abb4..b8882e46b4893 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -18,30 +18,65 @@ # ---------------------------------------------------------------------- # arrow_io : Arrow IO interfaces +# HDFS thirdparty setup +if (DEFINED ENV{HADOOP_HOME}) + set(HADOOP_HOME $ENV{HADOOP_HOME}) + if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") + message(STATUS "Did not find hdfs.h in expected location, using vendored one") + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") + endif() +else() + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") +endif() + +set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") +if (NOT EXISTS ${HDFS_H_PATH}) + message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") +endif() +message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) + +include_directories(SYSTEM "${HADOOP_HOME}/include") + +# arrow_io library if (MSVC) - set(ARROW_IO_LINK_LIBS + set(ARROW_IO_STATIC_LINK_LIBS + arrow_static + ) + set(ARROW_IO_SHARED_LINK_LIBS arrow_shared ) else() - set(ARROW_IO_LINK_LIBS + set(ARROW_IO_STATIC_LINK_LIBS + arrow_static + dl + ) + set(ARROW_IO_SHARED_LINK_LIBS arrow_shared dl ) endif() if (ARROW_BOOST_USE_SHARED) - set(ARROW_IO_PRIVATE_LINK_LIBS + set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS boost_system_shared boost_filesystem_shared) else() - set(ARROW_IO_PRIVATE_LINK_LIBS + set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS boost_system_static boost_filesystem_static) endif() -set(ARROW_IO_TEST_LINK_LIBS - arrow_io - ${ARROW_IO_PRIVATE_LINK_LIBS}) +set(ARROW_IO_STATIC_PRIVATE_LINK_LIBS + boost_system_static + boost_filesystem_static) + +if (ARROW_BUILD_STATIC) + set(ARROW_IO_TEST_LINK_LIBS + arrow_io_static) +else() + set(ARROW_IO_TEST_LINK_LIBS + arrow_io_shared) +endif() set(ARROW_IO_SRCS file.cc @@ -51,32 +86,6 @@ set(ARROW_IO_SRCS memory.cc ) -# HDFS thirdparty setup -if (DEFINED ENV{HADOOP_HOME}) - set(HADOOP_HOME $ENV{HADOOP_HOME}) - if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") - message(STATUS "Did not find hdfs.h in expected location, using vendored one") - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") - endif() -else() - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") -endif() - -set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") -if (NOT EXISTS ${HDFS_H_PATH}) - message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") -endif() -message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) - -include_directories(SYSTEM "${HADOOP_HOME}/include") - -add_library(arrow_io SHARED - ${ARROW_IO_SRCS} -) -target_link_libraries(arrow_io - LINK_PUBLIC ${ARROW_IO_LINK_LIBS} - LINK_PRIVATE ${ARROW_IO_PRIVATE_LINK_LIBS}) - if(NOT APPLE) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the @@ -84,16 +93,14 @@ if(NOT APPLE) set(ARROW_IO_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") endif() -SET_TARGET_PROPERTIES(arrow_io PROPERTIES - LINKER_LANGUAGE CXX - LINK_FLAGS "${ARROW_IO_LINK_FLAGS}") - -if (APPLE) - set_target_properties(arrow_io - PROPERTIES - BUILD_WITH_INSTALL_RPATH ON - INSTALL_NAME_DIR "@rpath") -endif() +ADD_ARROW_LIB(arrow_io + SOURCES ${ARROW_IO_SRCS} + SHARED_LINK_FLAGS ${ARROW_IO_LINK_FLAGS} + SHARED_LINK_LIBS ${ARROW_IO_SHARED_LINK_LIBS} + SHARED_PRIVATE_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_IO_STATIC_LINK_LIBS} + STATIC_PRIVATE_LINK_LIBS ${ARROW_IO_STATIC_PRIVATE_LINK_LIBS} +) ADD_ARROW_TEST(io-file-test) ARROW_TEST_LINK_LIBRARIES(io-file-test @@ -115,10 +122,6 @@ install(FILES memory.h DESTINATION include/arrow/io) -install(TARGETS arrow_io - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) - # pkg-config support configure_file(arrow-io.pc.in "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index b72f93b939148..2faf2804bcbd0 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -101,7 +101,7 @@ class ARROW_EXPORT BufferReader : public ReadableFileInterface { public: explicit BufferReader(const std::shared_ptr& buffer); BufferReader(const uint8_t* data, int64_t size); - ~BufferReader(); + virtual ~BufferReader(); Status Close() override; Status Tell(int64_t* position) override; diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index d3e625a08fbfe..11ca19179f3dc 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -19,17 +19,25 @@ # arrow_ipc ####################################### -set(ARROW_IPC_LINK_LIBS - arrow_io +set(ARROW_IPC_SHARED_LINK_LIBS + arrow_io_shared arrow_shared ) -set(ARROW_IPC_PRIVATE_LINK_LIBS - ) +set(ARROW_IPC_STATIC_LINK_LIBS + arrow_static + arrow_io_static +) -set(ARROW_IPC_TEST_LINK_LIBS - arrow_ipc - ${ARROW_IPC_PRIVATE_LINK_LIBS}) +if (ARROW_BUILD_STATIC) + set(ARROW_IPC_TEST_LINK_LIBS + arrow_io_static + arrow_ipc_static) +else() + set(ARROW_IPC_TEST_LINK_LIBS + arrow_io_shared + arrow_ipc_shared) +endif() set(ARROW_IPC_SRCS adapter.cc @@ -40,20 +48,6 @@ set(ARROW_IPC_SRCS metadata-internal.cc ) -# TODO(wesm): SHARED and STATIC targets -add_library(arrow_ipc SHARED - ${ARROW_IPC_SRCS} -) -if(RAPIDJSON_VENDORED) - add_dependencies(arrow_ipc rapidjson_ep) -endif() -if(FLATBUFFERS_VENDORED) - add_dependencies(arrow_ipc flatbuffers_ep) -endif() -target_link_libraries(arrow_ipc - LINK_PUBLIC ${ARROW_IPC_LINK_LIBS} - LINK_PRIVATE ${ARROW_IPC_PRIVATE_LINK_LIBS}) - if(NOT APPLE) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the @@ -61,15 +55,18 @@ if(NOT APPLE) set(ARROW_IPC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") endif() -SET_TARGET_PROPERTIES(arrow_ipc PROPERTIES - LINKER_LANGUAGE CXX - LINK_FLAGS "${ARROW_IPC_LINK_FLAGS}") +ADD_ARROW_LIB(arrow_ipc + SOURCES ${ARROW_IPC_SRCS} + SHARED_LINK_FLAGS ${ARROW_IPC_LINK_FLAGS} + SHARED_LINK_LIBS ${ARROW_IPC_SHARED_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} +) -if (APPLE) - set_target_properties(arrow_ipc - PROPERTIES - BUILD_WITH_INSTALL_RPATH ON - INSTALL_NAME_DIR "@rpath") +if(RAPIDJSON_VENDORED) + add_dependencies(arrow_ipc_objlib rapidjson_ep) +endif() +if(FLATBUFFERS_VENDORED) + add_dependencies(arrow_ipc_objlib flatbuffers_ep) endif() ADD_ARROW_TEST(ipc-adapter-test) @@ -93,9 +90,9 @@ ADD_ARROW_TEST(json-integration-test) if (ARROW_BUILD_TESTS) if (APPLE) target_link_libraries(json-integration-test + arrow_ipc_static + arrow_io_static arrow_static - arrow_io - arrow_ipc gflags gtest boost_filesystem_static @@ -105,9 +102,9 @@ if (ARROW_BUILD_TESTS) PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") else() target_link_libraries(json-integration-test + arrow_ipc_static + arrow_io_static arrow_static - arrow_io - arrow_ipc gflags gtest pthread @@ -156,10 +153,6 @@ install(FILES metadata.h DESTINATION include/arrow/ipc) -install(TARGETS arrow_ipc - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) - # pkg-config support configure_file(arrow-ipc.pc.in "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" From 3095f2cb7bc19954d0dfba02486b7ec48d8fef0f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 28 Dec 2016 23:05:50 +0100 Subject: [PATCH 0249/1644] ARROW-444: [Python] Native file reads into pre-allocated memory. Some IO API cleanup / niceness This yields slightly better performance and less memory use. Also deleted some duplicated code Author: Wes McKinney Closes #257 from wesm/ARROW-444 and squashes the following commits: 30e480d [Wes McKinney] Rename PyBytes_Empty to something more mundane 9db0d81 [Wes McKinney] Native file reads into pre-allocated memory. Deprecated HdfsClient.connect API. Promote pyarrow.io classes into pyarrow namespace --- python/pyarrow/__init__.py | 4 ++ python/pyarrow/io.pyx | 109 +++++++++++------------------- python/pyarrow/tests/test_hdfs.py | 2 +- python/pyarrow/tests/test_io.py | 4 +- 4 files changed, 49 insertions(+), 70 deletions(-) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 6f81ef470a86c..02b2b06237de3 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -37,6 +37,10 @@ from pyarrow.error import ArrowException +from pyarrow.io import (HdfsClient, HdfsFile, NativeFile, PythonFileInterface, + BytesReader, Buffer, InMemoryOutputStream, + BufferReader) + from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, BooleanValue, Int8Value, Int16Value, Int32Value, Int64Value, diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 8491aa8964fb9..cab6ccb90ee6b 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -37,6 +37,10 @@ import sys import threading import time +# To let us get a PyObject* and avoid Cython auto-ref-counting +cdef extern from "Python.h": + PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( + char *v, Py_ssize_t len) except NULL cdef class NativeFile: @@ -119,21 +123,24 @@ cdef class NativeFile: with nogil: check_status(self.wr_file.get().Write(buf, bufsize)) - def read(self, int nbytes): + def read(self, int64_t nbytes): cdef: int64_t bytes_read = 0 - uint8_t* buf - shared_ptr[CBuffer] out + PyObject* obj self._assert_readable() + # Allocate empty write space + obj = PyBytes_FromStringAndSizeNative(NULL, nbytes) + + cdef uint8_t* buf = cp.PyBytes_AS_STRING( obj) with nogil: - check_status(self.rd_file.get().ReadB(nbytes, &out)) + check_status(self.rd_file.get().Read(nbytes, &bytes_read, buf)) - result = cp.PyBytes_FromStringAndSize( - out.get().data(), out.get().size()) + if bytes_read < nbytes: + cp._PyBytes_Resize(&obj, bytes_read) - return result + return PyObject_to_object(obj) # ---------------------------------------------------------------------- @@ -339,31 +346,8 @@ cdef class HdfsClient: cdef readonly: bint is_open - def __cinit__(self): - self.is_open = False - - def __dealloc__(self): - if self.is_open: - self.close() - - def close(self): - """ - Disconnect from the HDFS cluster - """ - self._ensure_client() - with nogil: - check_status(self.client.get().Disconnect()) - self.is_open = False - - cdef _ensure_client(self): - if self.client.get() == NULL: - raise IOError('HDFS client improperly initialized') - elif not self.is_open: - raise IOError('HDFS client is closed') - - @classmethod - def connect(cls, host="default", port=0, user=None, kerb_ticket=None, - driver='libhdfs'): + def __cinit__(self, host="default", port=0, user=None, kerb_ticket=None, + driver='libhdfs'): """ Connect to an HDFS cluster. All parameters are optional and should only be set if the defaults need to be overridden. @@ -391,9 +375,7 @@ cdef class HdfsClient: ------- client : HDFSClient """ - cdef: - HdfsClient out = HdfsClient() - HdfsConnectionConfig conf + cdef HdfsConnectionConfig conf if host is not None: conf.host = tobytes(host) @@ -411,10 +393,31 @@ cdef class HdfsClient: conf.driver = HdfsDriver_LIBHDFS3 with nogil: - check_status(CHdfsClient.Connect(&conf, &out.client)) - out.is_open = True + check_status(CHdfsClient.Connect(&conf, &self.client)) + self.is_open = True - return out + @classmethod + def connect(cls, *args, **kwargs): + return cls(*args, **kwargs) + + def __dealloc__(self): + if self.is_open: + self.close() + + def close(self): + """ + Disconnect from the HDFS cluster + """ + self._ensure_client() + with nogil: + check_status(self.client.get().Disconnect()) + self.is_open = False + + cdef _ensure_client(self): + if self.client.get() == NULL: + raise IOError('HDFS client improperly initialized') + elif not self.is_open: + raise IOError('HDFS client is closed') def exists(self, path): """ @@ -657,36 +660,6 @@ cdef class HdfsFile(NativeFile): def __dealloc__(self): self.parent = None - def read(self, int nbytes): - """ - Read indicated number of bytes from the file, up to EOF - """ - cdef: - int64_t bytes_read = 0 - uint8_t* buf - - self._assert_readable() - - # This isn't ideal -- PyBytes_FromStringAndSize copies the data from - # the passed buffer, so it's hard for us to avoid doubling the memory - buf = malloc(nbytes) - if buf == NULL: - raise MemoryError("Failed to allocate {0} bytes".format(nbytes)) - - cdef int64_t total_bytes = 0 - - try: - with nogil: - check_status(self.rd_file.get() - .Read(nbytes, &bytes_read, buf)) - - result = cp.PyBytes_FromStringAndSize(buf, - bytes_read) - finally: - free(buf) - - return result - def download(self, stream_or_path): """ Read file completely to local path (rather than reading completely into diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index 73d5a66cf4765..4ff5a9d42b55e 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -38,7 +38,7 @@ def hdfs_test_client(driver='libhdfs'): raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' 'an integer') - return io.HdfsClient.connect(host, port, user, driver=driver) + return io.HdfsClient(host, port, user, driver=driver) class HdfsTestCases(object): diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 211a12bcd92fe..c10ed0394b1a8 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -67,7 +67,9 @@ def test_python_file_read(): f.seek(5) assert f.tell() == 5 - assert f.read(50) == b'sample data' + v = f.read(50) + assert v == b'sample data' + assert len(v) == 11 f.close() From 4733ee876e1fddb8032fce1dc9e486d68904fbea Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 28 Dec 2016 20:21:48 -0500 Subject: [PATCH 0250/1644] ARROW-445: arrow_ipc_objlib depends on Flatbuffer generated files This is needed as before the depedency was done through the arrow library on which arrow_ipc depends. But as the arrow_objlib target is not linked to anything else, it can actually be built independently. This would lead to races where the flatbuffer generated files were not existing during arrow_ipc compilation. Author: Uwe L. Korn Closes #258 from xhochy/ARROW-445 and squashes the following commits: 8bdad8e [Uwe L. Korn] ARROW-445: arrow_ipc_objlib depends on Flatbuffer generated files --- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 11ca19179f3dc..b7ac5f059749f 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -143,7 +143,7 @@ add_custom_command( ) add_custom_target(metadata_fbs DEPENDS ${FBS_OUTPUT_FILES}) -add_dependencies(arrow_objlib metadata_fbs) +add_dependencies(arrow_ipc_objlib metadata_fbs) # Headers: top level install(FILES From 23fe6ae02a6fa6ff912986c45079e25b3e5e4deb Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 29 Dec 2016 10:22:40 +0100 Subject: [PATCH 0251/1644] ARROW-338: Implement visitor pattern for IPC loading/unloading This is a first cut at getting rid of the if-then-else statements and using the visitor pattern. This also has the benefit of forcing us to provide implementations should we add new types to Arrow. Author: Wes McKinney Closes #256 from wesm/ARROW-338 and squashes the following commits: 59bac66 [Wes McKinney] Fix accidental copy 17214c4 [Wes McKinney] Fix comment 6b00da4 [Wes McKinney] Implement visitor pattern for IPC loading/unloading --- cpp/src/arrow/array.h | 1 + cpp/src/arrow/ipc/adapter.cc | 477 ++++++++++++++++++++++------------- cpp/src/arrow/type_fwd.h | 3 +- 3 files changed, 306 insertions(+), 175 deletions(-) diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 5cd56d6df5427..6239ccc576b8d 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -471,6 +471,7 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; #if defined(__GNUC__) && !defined(__clang__) #pragma GCC diagnostic pop diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index f813c1dbbc3b0..ac4054b376adc 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -34,6 +34,7 @@ #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" +#include "arrow/type_fwd.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" @@ -43,80 +44,34 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { -static bool IsPrimitive(const DataType* type) { - DCHECK(type != nullptr); - switch (type->type) { - // NA is null type or "no type", considered primitive for now - case Type::NA: - case Type::BOOL: - case Type::UINT8: - case Type::INT8: - case Type::UINT16: - case Type::INT16: - case Type::UINT32: - case Type::INT32: - case Type::UINT64: - case Type::INT64: - case Type::FLOAT: - case Type::DOUBLE: - return true; - default: - return false; - } -} - // ---------------------------------------------------------------------- // Record batch write path -Status VisitArray(const Array* arr, std::vector* field_nodes, - std::vector>* buffers, int max_recursion_depth) { - if (max_recursion_depth <= 0) { return Status::Invalid("Max recursion depth reached"); } - DCHECK(arr); - DCHECK(field_nodes); - // push back all common elements - field_nodes->push_back(flatbuf::FieldNode(arr->length(), arr->null_count())); - if (arr->null_count() > 0) { - buffers->push_back(arr->null_bitmap()); - } else { - // Push a dummy zero-length buffer, not to be copied - buffers->push_back(std::make_shared(nullptr, 0)); - } - - const DataType* arr_type = arr->type().get(); - if (IsPrimitive(arr_type)) { - const auto prim_arr = static_cast(arr); - buffers->push_back(prim_arr->data()); - } else if (arr->type_enum() == Type::STRING || arr->type_enum() == Type::BINARY) { - const auto binary_arr = static_cast(arr); - buffers->push_back(binary_arr->offsets()); - buffers->push_back(binary_arr->data()); - } else if (arr->type_enum() == Type::LIST) { - const auto list_arr = static_cast(arr); - buffers->push_back(list_arr->offsets()); - RETURN_NOT_OK(VisitArray( - list_arr->values().get(), field_nodes, buffers, max_recursion_depth - 1)); - } else if (arr->type_enum() == Type::STRUCT) { - const auto struct_arr = static_cast(arr); - for (auto& field : struct_arr->fields()) { - RETURN_NOT_OK( - VisitArray(field.get(), field_nodes, buffers, max_recursion_depth - 1)); - } - } else { - return Status::NotImplemented("Unrecognized type"); - } - return Status::OK(); -} - -class RecordBatchWriter { +class RecordBatchWriter : public ArrayVisitor { public: RecordBatchWriter(const std::vector>& columns, int32_t num_rows, int64_t buffer_start_offset, int max_recursion_depth) - : columns_(&columns), + : columns_(columns), num_rows_(num_rows), - buffer_start_offset_(buffer_start_offset), - max_recursion_depth_(max_recursion_depth) {} + max_recursion_depth_(max_recursion_depth), + buffer_start_offset_(buffer_start_offset) {} - Status AssemblePayload(int64_t* body_length) { + Status VisitArray(const Array& arr) { + if (max_recursion_depth_ <= 0) { + return Status::Invalid("Max recursion depth reached"); + } + // push back all common elements + field_nodes_.push_back(flatbuf::FieldNode(arr.length(), arr.null_count())); + if (arr.null_count() > 0) { + buffers_.push_back(arr.null_bitmap()); + } else { + // Push a dummy zero-length buffer, not to be copied + buffers_.push_back(std::make_shared(nullptr, 0)); + } + return arr.Accept(this); + } + + Status Assemble(int64_t* body_length) { if (field_nodes_.size() > 0) { field_nodes_.clear(); buffer_meta_.clear(); @@ -124,9 +79,8 @@ class RecordBatchWriter { } // Perform depth-first traversal of the row-batch - for (size_t i = 0; i < columns_->size(); ++i) { - const Array* arr = (*columns_)[i].get(); - RETURN_NOT_OK(VisitArray(arr, &field_nodes_, &buffers_, max_recursion_depth_)); + for (size_t i = 0; i < columns_.size(); ++i) { + RETURN_NOT_OK(VisitArray(*columns_[i].get())); } // The position for the start of a buffer relative to the passed frame of @@ -199,7 +153,7 @@ class RecordBatchWriter { } Status Write(io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { - RETURN_NOT_OK(AssemblePayload(body_length)); + RETURN_NOT_OK(Assemble(body_length)); #ifndef NDEBUG int64_t start_position, current_position; @@ -249,15 +203,92 @@ class RecordBatchWriter { } private: + Status Visit(const NullArray& array) override { return Status::NotImplemented("null"); } + + Status VisitPrimitive(const PrimitiveArray& array) { + buffers_.push_back(array.data()); + return Status::OK(); + } + + Status VisitBinary(const BinaryArray& array) { + buffers_.push_back(array.offsets()); + buffers_.push_back(array.data()); + return Status::OK(); + } + + Status Visit(const BooleanArray& array) override { return VisitPrimitive(array); } + + Status Visit(const Int8Array& array) override { return VisitPrimitive(array); } + + Status Visit(const Int16Array& array) override { return VisitPrimitive(array); } + + Status Visit(const Int32Array& array) override { return VisitPrimitive(array); } + + Status Visit(const Int64Array& array) override { return VisitPrimitive(array); } + + Status Visit(const UInt8Array& array) override { return VisitPrimitive(array); } + + Status Visit(const UInt16Array& array) override { return VisitPrimitive(array); } + + Status Visit(const UInt32Array& array) override { return VisitPrimitive(array); } + + Status Visit(const UInt64Array& array) override { return VisitPrimitive(array); } + + Status Visit(const HalfFloatArray& array) override { return VisitPrimitive(array); } + + Status Visit(const FloatArray& array) override { return VisitPrimitive(array); } + + Status Visit(const DoubleArray& array) override { return VisitPrimitive(array); } + + Status Visit(const StringArray& array) override { return VisitBinary(array); } + + Status Visit(const BinaryArray& array) override { return VisitBinary(array); } + + Status Visit(const DateArray& array) override { return VisitPrimitive(array); } + + Status Visit(const TimeArray& array) override { return VisitPrimitive(array); } + + Status Visit(const TimestampArray& array) override { return VisitPrimitive(array); } + + Status Visit(const IntervalArray& array) override { + return Status::NotImplemented("interval"); + } + + Status Visit(const DecimalArray& array) override { + return Status::NotImplemented("decimal"); + } + + Status Visit(const ListArray& array) override { + buffers_.push_back(array.offsets()); + --max_recursion_depth_; + RETURN_NOT_OK(VisitArray(*array.values().get())); + ++max_recursion_depth_; + return Status::OK(); + } + + Status Visit(const StructArray& array) override { + --max_recursion_depth_; + for (const auto& field : array.fields()) { + RETURN_NOT_OK(VisitArray(*field.get())); + } + ++max_recursion_depth_; + return Status::OK(); + } + + Status Visit(const UnionArray& array) override { + return Status::NotImplemented("union"); + } + // Do not copy this vector. Ownership must be retained elsewhere - const std::vector>* columns_; + const std::vector>& columns_; int32_t num_rows_; - int64_t buffer_start_offset_; std::vector field_nodes_; std::vector buffer_meta_; std::vector> buffers_; - int max_recursion_depth_; + + int64_t max_recursion_depth_; + int64_t buffer_start_offset_; }; Status WriteRecordBatch(const std::vector>& columns, @@ -279,143 +310,241 @@ Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size) { // ---------------------------------------------------------------------- // Record batch read path -class RecordBatchReader { - public: - RecordBatchReader(const std::shared_ptr& metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::ReadableFileInterface* file) - : metadata_(metadata), - schema_(schema), - max_recursion_depth_(max_recursion_depth), - file_(file) { - num_buffers_ = metadata->num_buffers(); - num_flattened_fields_ = metadata->num_fields(); - } +struct RecordBatchContext { + const RecordBatchMetadata* metadata; + int buffer_index; + int field_index; + int max_recursion_depth; +}; - Status Read(std::shared_ptr* out) { - std::vector> arrays(schema_->num_fields()); +// Traverse the flattened record batch metadata and reassemble the +// corresponding array containers +class ArrayLoader : public TypeVisitor { + public: + ArrayLoader( + const Field& field, RecordBatchContext* context, io::ReadableFileInterface* file) + : field_(field), context_(context), file_(file) {} - // The field_index and buffer_index are incremented in NextArray based on - // how much of the batch is "consumed" (through nested data reconstruction, - // for example) - field_index_ = 0; - buffer_index_ = 0; - for (int i = 0; i < schema_->num_fields(); ++i) { - const Field* field = schema_->field(i).get(); - RETURN_NOT_OK(NextArray(field, max_recursion_depth_, &arrays[i])); + Status Load(std::shared_ptr* out) { + if (context_->max_recursion_depth <= 0) { + return Status::Invalid("Max recursion depth reached"); } - *out = std::make_shared(schema_, metadata_->length(), arrays); + // Load the array + RETURN_NOT_OK(field_.type->Accept(this)); + + *out = std::move(result_); return Status::OK(); } private: - // Traverse the flattened record batch metadata and reassemble the - // corresponding array containers - Status NextArray( - const Field* field, int max_recursion_depth, std::shared_ptr* out) { - const TypePtr& type = field->type; - if (max_recursion_depth <= 0) { - return Status::Invalid("Max recursion depth reached"); + const Field& field_; + RecordBatchContext* context_; + io::ReadableFileInterface* file_; + + // Used in visitor pattern + std::shared_ptr result_; + + Status LoadChild(const Field& field, std::shared_ptr* out) { + ArrayLoader loader(field, context_, file_); + --context_->max_recursion_depth; + RETURN_NOT_OK(loader.Load(out)); + ++context_->max_recursion_depth; + return Status::OK(); + } + + Status GetBuffer(int buffer_index, std::shared_ptr* out) { + BufferMetadata metadata = context_->metadata->buffer(buffer_index); + + if (metadata.length == 0) { + *out = std::make_shared(nullptr, 0); + return Status::OK(); + } else { + return file_->ReadAt(metadata.offset, metadata.length, out); } + } + Status LoadCommon(FieldMetadata* field_meta, std::shared_ptr* null_bitmap) { // pop off a field - if (field_index_ >= num_flattened_fields_) { + if (context_->field_index >= context_->metadata->num_fields()) { return Status::Invalid("Ran out of field metadata, likely malformed"); } // This only contains the length and null count, which we need to figure // out what to do with the buffers. For example, if null_count == 0, then // we can skip that buffer without reading from shared memory - FieldMetadata field_meta = metadata_->field(field_index_++); + *field_meta = context_->metadata->field(context_->field_index++); // extract null_bitmap which is common to all arrays + if (field_meta->null_count == 0) { + *null_bitmap = nullptr; + } else { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, null_bitmap)); + } + context_->buffer_index++; + return Status::OK(); + } + + Status LoadPrimitive(const DataType& type) { + FieldMetadata field_meta; std::shared_ptr null_bitmap; - if (field_meta.null_count == 0) { - ++buffer_index_; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + + std::shared_ptr data; + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); } else { - RETURN_NOT_OK(GetBuffer(buffer_index_++, &null_bitmap)); + context_->buffer_index++; + data.reset(new Buffer(nullptr, 0)); } + return MakePrimitiveArray(field_.type, field_meta.length, data, field_meta.null_count, + null_bitmap, &result_); + } - if (IsPrimitive(type.get())) { - std::shared_ptr data; - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(buffer_index_++, &data)); - } else { - buffer_index_++; - data.reset(new Buffer(nullptr, 0)); - } - return MakePrimitiveArray( - type, field_meta.length, data, field_meta.null_count, null_bitmap, out); - } else if (type->type == Type::STRING || type->type == Type::BINARY) { - std::shared_ptr offsets; - std::shared_ptr values; - RETURN_NOT_OK(GetBuffer(buffer_index_++, &offsets)); - RETURN_NOT_OK(GetBuffer(buffer_index_++, &values)); - - if (type->type == Type::STRING) { - *out = std::make_shared( - field_meta.length, offsets, values, field_meta.null_count, null_bitmap); - } else { - *out = std::make_shared( - field_meta.length, offsets, values, field_meta.null_count, null_bitmap); - } - return Status::OK(); - } else if (type->type == Type::LIST) { - std::shared_ptr offsets; - RETURN_NOT_OK(GetBuffer(buffer_index_++, &offsets)); - const int num_children = type->num_children(); - if (num_children != 1) { - std::stringstream ss; - ss << "Field: " << field->ToString() - << " has wrong number of children:" << num_children; - return Status::Invalid(ss.str()); - } - std::shared_ptr values_array; - RETURN_NOT_OK( - NextArray(type->child(0).get(), max_recursion_depth - 1, &values_array)); - *out = std::make_shared(type, field_meta.length, offsets, values_array, - field_meta.null_count, null_bitmap); - return Status::OK(); - } else if (type->type == Type::STRUCT) { - const int num_children = type->num_children(); - std::vector fields; - fields.reserve(num_children); - for (int child_idx = 0; child_idx < num_children; ++child_idx) { - std::shared_ptr field_array; - RETURN_NOT_OK(NextArray( - type->child(child_idx).get(), max_recursion_depth - 1, &field_array)); - fields.push_back(field_array); - } - out->reset(new StructArray( - type, field_meta.length, fields, field_meta.null_count, null_bitmap)); - return Status::OK(); + template + Status LoadBinary() { + FieldMetadata field_meta; + std::shared_ptr null_bitmap; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + + std::shared_ptr offsets; + std::shared_ptr values; + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); + + result_ = std::make_shared( + field_meta.length, offsets, values, field_meta.null_count, null_bitmap); + return Status::OK(); + } + + Status Visit(const NullType& type) override { return Status::NotImplemented("null"); } + + Status Visit(const BooleanType& type) override { return LoadPrimitive(type); } + + Status Visit(const Int8Type& type) override { return LoadPrimitive(type); } + + Status Visit(const Int16Type& type) override { return LoadPrimitive(type); } + + Status Visit(const Int32Type& type) override { return LoadPrimitive(type); } + + Status Visit(const Int64Type& type) override { return LoadPrimitive(type); } + + Status Visit(const UInt8Type& type) override { return LoadPrimitive(type); } + + Status Visit(const UInt16Type& type) override { return LoadPrimitive(type); } + + Status Visit(const UInt32Type& type) override { return LoadPrimitive(type); } + + Status Visit(const UInt64Type& type) override { return LoadPrimitive(type); } + + Status Visit(const HalfFloatType& type) override { return LoadPrimitive(type); } + + Status Visit(const FloatType& type) override { return LoadPrimitive(type); } + + Status Visit(const DoubleType& type) override { return LoadPrimitive(type); } + + Status Visit(const StringType& type) override { return LoadBinary(); } + + Status Visit(const BinaryType& type) override { return LoadBinary(); } + + Status Visit(const DateType& type) override { return LoadPrimitive(type); } + + Status Visit(const TimeType& type) override { return LoadPrimitive(type); } + + Status Visit(const TimestampType& type) override { return LoadPrimitive(type); } + + Status Visit(const IntervalType& type) override { + return Status::NotImplemented(type.ToString()); + } + + Status Visit(const DecimalType& type) override { + return Status::NotImplemented(type.ToString()); + } + + Status Visit(const ListType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap; + std::shared_ptr offsets; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + + const int num_children = type.num_children(); + if (num_children != 1) { + std::stringstream ss; + ss << "Wrong number of children: " << num_children; + return Status::Invalid(ss.str()); } + std::shared_ptr values_array; - return Status::NotImplemented("Non-primitive types not complete yet"); + RETURN_NOT_OK(LoadChild(*type.child(0).get(), &values_array)); + + result_ = std::make_shared(field_.type, field_meta.length, offsets, + values_array, field_meta.null_count, null_bitmap); + return Status::OK(); } - Status GetBuffer(int buffer_index, std::shared_ptr* out) { - BufferMetadata metadata = metadata_->buffer(buffer_index); + Status Visit(const StructType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (metadata.length == 0) { - *out = std::make_shared(nullptr, 0); - return Status::OK(); - } else { - return file_->ReadAt(metadata.offset, metadata.length, out); + const int num_children = type.num_children(); + std::vector fields; + fields.reserve(num_children); + + for (int child_idx = 0; child_idx < num_children; ++child_idx) { + std::shared_ptr field_array; + RETURN_NOT_OK(LoadChild(*type.child(child_idx).get(), &field_array)); + fields.emplace_back(field_array); + } + + result_ = std::make_shared( + field_.type, field_meta.length, fields, field_meta.null_count, null_bitmap); + return Status::OK(); + } + + Status Visit(const UnionType& type) override { + return Status::NotImplemented(type.ToString()); + } +}; + +class RecordBatchReader { + public: + RecordBatchReader(const std::shared_ptr& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::ReadableFileInterface* file) + : metadata_(metadata), + schema_(schema), + max_recursion_depth_(max_recursion_depth), + file_(file) {} + + Status Read(std::shared_ptr* out) { + std::vector> arrays(schema_->num_fields()); + + // The field_index and buffer_index are incremented in the ArrayLoader + // based on how much of the batch is "consumed" (through nested data + // reconstruction, for example) + context_.metadata = metadata_.get(); + context_.field_index = 0; + context_.buffer_index = 0; + context_.max_recursion_depth = max_recursion_depth_; + + for (int i = 0; i < schema_->num_fields(); ++i) { + ArrayLoader loader(*schema_->field(i).get(), &context_, file_); + RETURN_NOT_OK(loader.Load(&arrays[i])); } + + *out = std::make_shared(schema_, metadata_->length(), arrays); + return Status::OK(); } private: + RecordBatchContext context_; std::shared_ptr metadata_; std::shared_ptr schema_; int max_recursion_depth_; io::ReadableFileInterface* file_; - - int field_index_; - int buffer_index_; - int num_buffers_; - int num_flattened_fields_; }; Status ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index a9db32df54dc3..a14c535b9b3f1 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -91,7 +91,8 @@ using DateArray = NumericArray; using DateBuilder = NumericBuilder; struct TimeType; -class TimeArray; +using TimeArray = NumericArray; +using TimeBuilder = NumericBuilder; struct TimestampType; using TimestampArray = NumericArray; From e15c6a0b3c05b5b42c204f34369d127182450ca0 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 29 Dec 2016 07:05:55 -0500 Subject: [PATCH 0252/1644] ARROW-447: Always return unicode objects for UTF-8 strings As the u() function was not working with Unicode characters, this uses the u'' literal again which was re-introduced with Python 3.3. Thus the tests will fail with Python3 < 3.3 Author: Uwe L. Korn Closes #260 from xhochy/ARROW-447 and squashes the following commits: 84d3569 [Uwe L. Korn] ARROW-447: Always return unicode objects for UTF-8 strings --- python/pyarrow/scalar.pyx | 2 +- python/pyarrow/tests/test_convert_builtin.py | 5 +++-- python/pyarrow/tests/test_convert_pandas.py | 3 ++- python/pyarrow/tests/test_scalars.py | 7 ++++--- python/src/pyarrow/adapters/pandas.cc | 4 ---- 5 files changed, 10 insertions(+), 11 deletions(-) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index a0610a14e6bd0..30b90408dc082 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -168,7 +168,7 @@ cdef class StringValue(ArrayValue): def as_py(self): cdef CStringArray* ap = self.sp_array.get() - return frombytes(ap.GetString(self.index)) + return ap.GetString(self.index).decode('utf-8') cdef class BinaryValue(ArrayValue): diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index a5f7aa51c29da..61167422de93c 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -1,3 +1,4 @@ +# -*- coding: utf-8 -*- # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information @@ -72,12 +73,12 @@ def test_double(self): assert arr.to_pylist() == data def test_unicode(self): - data = [u('foo'), u('bar'), None, u('arrow')] + data = [u'foo', u'bar', None, u'mañana'] arr = pyarrow.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pyarrow.string() - assert arr.to_pylist() == [u('foo'), u('bar'), None, u('arrow')] + assert arr.to_pylist() == data def test_bytes(self): u1 = b'ma\xc3\xb1ana' diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 863aa3073fe12..bb9f0b3eccab1 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -1,3 +1,4 @@ +# -*- coding: utf-8 -*- # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information @@ -183,7 +184,7 @@ def test_boolean_object_nulls(self): def test_unicode(self): repeats = 1000 - values = [u('foo'), None, u('bar'), u('qux'), np.nan] + values = [u'foo', None, u'bar', u'mañana', np.nan] df = pd.DataFrame({'strings': values * repeats}) self._check_pandas_roundtrip(df) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 19cfacbcb6b11..62e51f8dee846 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -1,3 +1,4 @@ +# -*- coding: utf-8 -*- # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information @@ -59,7 +60,7 @@ def test_double(self): assert v.as_py() == 3.0 def test_string_unicode(self): - arr = A.from_pylist([u('foo'), None, u('bar')]) + arr = A.from_pylist([u'foo', None, u'mañana']) v = arr[0] assert isinstance(v, A.StringValue) @@ -68,8 +69,8 @@ def test_string_unicode(self): assert arr[1] is A.NA v = arr[2].as_py() - assert v == u('bar') - assert isinstance(v, str) + assert v == u'mañana' + assert isinstance(v, unicode_type) def test_bytes(self): arr = A.from_pylist([b'foo', None, u('bar')]) diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 5e5826b8236a6..ad18eca66e42b 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -603,11 +603,7 @@ struct WrapBytes {}; template <> struct WrapBytes { static inline PyObject* Wrap(const uint8_t* data, int64_t length) { -#if PY_MAJOR_VERSION >= 3 return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); -#else - return PyString_FromStringAndSize(reinterpret_cast(data), length); -#endif } }; From e8b6231b29f59b2978b78a33eff73697d537c5dd Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 2 Jan 2017 02:37:37 -0500 Subject: [PATCH 0253/1644] ARROW-450: Fixes for PARQUET-818 Author: Uwe L. Korn Closes #263 from xhochy/ARROW-450 and squashes the following commits: 287015a [Uwe L. Korn] ARROW-450: Fixes for PARQUET-818 --- python/pyarrow/includes/parquet.pxd | 19 ++----------------- python/pyarrow/parquet.pyx | 15 ++++++--------- 2 files changed, 8 insertions(+), 26 deletions(-) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index cb791e16f926d..b4d127c871e09 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -120,24 +120,9 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: shared_ptr[WriterProperties] build() -cdef extern from "parquet/arrow/io.h" namespace "parquet::arrow" nogil: - cdef cppclass ParquetAllocator: - ParquetAllocator() - ParquetAllocator(MemoryPool* pool) - MemoryPool* pool() - void set_pool(MemoryPool* pool) - - cdef cppclass ParquetReadSource: - ParquetReadSource(ParquetAllocator* allocator) - Open(const shared_ptr[ReadableFileInterface]& file) - - cdef cppclass ParquetWriteSink: - ParquetWriteSink(const shared_ptr[OutputStream]& file) - - cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, - ParquetAllocator* allocator, + MemoryPool* allocator, unique_ptr[FileReader]* reader) cdef cppclass FileReader: @@ -157,6 +142,6 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: cdef CStatus WriteFlatTable( const CTable* table, MemoryPool* pool, - const shared_ptr[ParquetWriteSink]& sink, + const shared_ptr[OutputStream]& sink, int64_t chunk_size, const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 043ccf12d9181..7379456feef2b 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -42,12 +42,12 @@ __all__ = [ cdef class ParquetReader: cdef: - ParquetAllocator allocator + MemoryPool* allocator unique_ptr[FileReader] reader column_idx_map def __cinit__(self): - self.allocator.set_pool(default_memory_pool()) + self.allocator = default_memory_pool() def open(self, source): self._open(source) @@ -69,7 +69,7 @@ cdef class ParquetReader: ParquetFileReader.OpenFile(path))) else: get_reader(source, &rd_handle) - check_status(OpenFile(rd_handle, &self.allocator, &self.reader)) + check_status(OpenFile(rd_handle, self.allocator, &self.reader)) def read_all(self): cdef: @@ -174,10 +174,8 @@ def write_table(table, sink, chunk_size=None, version=None, """ cdef Table table_ = table cdef CTable* ctable_ = table_.table - cdef shared_ptr[ParquetWriteSink] sink_ - cdef shared_ptr[FileOutputStream] filesink_ - cdef shared_ptr[OutputStream] general_sink + cdef shared_ptr[OutputStream] sink_ cdef WriterProperties.Builder properties_builder cdef int64_t chunk_size_ = 0 @@ -237,10 +235,9 @@ def write_table(table, sink, chunk_size=None, version=None, if isinstance(sink, six.string_types): check_status(FileOutputStream.Open(tobytes(sink), &filesink_)) - sink_.reset(new ParquetWriteSink(filesink_)) + sink_ = filesink_ else: - get_writer(sink, &general_sink) - sink_.reset(new ParquetWriteSink(general_sink)) + get_writer(sink, &sink_) with nogil: check_status(WriteFlatTable(ctable_, default_memory_pool(), sink_, From 806239fdd102649b7afa1dbe9aa1c09911f2885e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 2 Jan 2017 08:48:20 -0500 Subject: [PATCH 0254/1644] ARROW-449: Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict Author: Uwe L. Korn Closes #262 from xhochy/ARROW-449 and squashes the following commits: 5f15533 [Uwe L. Korn] Fix string conversion routines 9d72c85 [Uwe L. Korn] ARROW-449: Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict --- python/pyarrow/table.pyx | 36 +++++++++++++++++++++++++++++- python/pyarrow/tests/test_table.py | 10 ++++++++- 2 files changed, 44 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 20137e3d4f8d9..925543176c531 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -36,6 +36,9 @@ from pyarrow.compat import frombytes, tobytes cimport cpython +from collections import OrderedDict + + cdef class ChunkedArray: """ Array backed via one or more memory chunks. @@ -204,7 +207,7 @@ cdef class Column: ------- str """ - return frombytes(self.column.name()) + return bytes(self.column.name()).decode('utf8') @property def type(self): @@ -345,6 +348,22 @@ cdef class RecordBatch: return self.batch.Equals(deref(other.batch)) + def to_pydict(self): + """ + Converted the arrow::RecordBatch to an OrderedDict + + Returns + ------- + OrderedDict + """ + entries = [] + for i in range(self.batch.num_columns()): + name = bytes(self.batch.column_name(i)).decode('utf8') + column = self[i].to_pylist() + entries.append((name, column)) + return OrderedDict(entries) + + def to_pandas(self): """ Convert the arrow::RecordBatch to a pandas DataFrame @@ -635,6 +654,21 @@ cdef class Table: mgr = table_to_blockmanager(self.sp_table, nthreads) return pd.DataFrame(mgr) + def to_pydict(self): + """ + Converted the arrow::Table to an OrderedDict + + Returns + ------- + OrderedDict + """ + entries = [] + for i in range(self.table.num_columns()): + name = self.column(i).name + column = self.column(i).to_pylist() + entries.append((name, column)) + return OrderedDict(entries) + @property def name(self): """ diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 25463145c00ce..9985b3e29ada1 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -15,8 +15,8 @@ # specific language governing permissions and limitations # under the License. +from collections import OrderedDict import numpy as np - from pandas.util.testing import assert_frame_equal import pandas as pd import pytest @@ -35,6 +35,10 @@ def test_recordbatch_basics(): assert len(batch) == 5 assert batch.num_rows == 5 assert batch.num_columns == len(data) + assert batch.to_pydict() == OrderedDict([ + ('c0', [0, 1, 2, 3, 4]), + ('c1', [-10, -5, 0, 5, 10]) + ]) def test_recordbatch_from_to_pandas(): @@ -97,6 +101,10 @@ def test_table_basics(): assert table.num_rows == 5 assert table.num_columns == 2 assert table.shape == (5, 2) + assert table.to_pydict() == OrderedDict([ + ('a', [0, 1, 2, 3, 4]), + ('b', [-10, -5, 0, 5, 10]) + ]) for col in table.itercolumns(): for chunk in col.data.iterchunks(): From 9f7d4ae6da04d9339dfa2811d750ccf616568bc8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 3 Jan 2017 08:27:36 +0100 Subject: [PATCH 0255/1644] ARROW-108: [C++] Add Union implementation and IPC/JSON serialization tests Closes #206. Still need to add test cases for JSON read/write and dense union IPC. Integration tests can happen in a subsequent PR (but the Java library does not support dense unions yet, so sparse only -- i.e. no offsets vector) Author: Wes McKinney Closes #264 from wesm/ARROW-108 and squashes the following commits: 86c4191 [Wes McKinney] Fix valgrind error cdfc61d [Wes McKinney] Export UnionArray 3edca1e [Wes McKinney] Implement basic JSON roundtrip for unions 30b7188 [Wes McKinney] Add test case for dense union, implement RangeEquals for it 4887fd2 [Wes McKinney] Move Windows stuff into a compatibility header, exclude from clang-format because of include order sensitivity 5ca9c57 [Wes McKinney] Implement IPC/JSON serializationf or unions. Test UnionMode::SPARSE example in IPC --- cpp/CMakeLists.txt | 4 +- cpp/src/arrow/array-list-test.cc | 2 +- cpp/src/arrow/array-primitive-test.cc | 2 +- cpp/src/arrow/array-struct-test.cc | 5 +- cpp/src/arrow/array-test.cc | 6 +- cpp/src/arrow/array.cc | 120 ++++++++++++++++++++--- cpp/src/arrow/array.h | 90 +++++++++++------ cpp/src/arrow/builder.h | 2 +- cpp/src/arrow/io/hdfs-internal.h | 12 +-- cpp/src/arrow/io/windows_compatibility.h | 36 +++++++ cpp/src/arrow/ipc/adapter.cc | 56 ++++++++--- cpp/src/arrow/ipc/ipc-adapter-test.cc | 6 +- cpp/src/arrow/ipc/ipc-json-test.cc | 18 +++- cpp/src/arrow/ipc/json-internal.cc | 90 +++++++++++++---- cpp/src/arrow/ipc/test-common.h | 83 ++++++++++++++-- cpp/src/arrow/pretty_print.cc | 44 ++++++--- cpp/src/arrow/test-util.h | 14 ++- cpp/src/arrow/type.cc | 2 +- cpp/src/arrow/type.h | 8 +- 19 files changed, 476 insertions(+), 124 deletions(-) create mode 100644 cpp/src/arrow/io/windows_compatibility.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index bf30543dc4d65..13f0354a73b8b 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -668,7 +668,9 @@ endif (UNIX) if (${CLANG_FORMAT_FOUND}) # runs clang format and updates files in place. add_custom_target(format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 1 - `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g'` + `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | + sed -e '/_generated/g' | + sed -e '/windows_compatibility.h/g'` `find ${CMAKE_CURRENT_SOURCE_DIR}/../python -name \\*.cc -or -name \\*.h`) # runs clang format and exits with a non-zero exit code if any files need to be reformatted diff --git a/cpp/src/arrow/array-list-test.cc b/cpp/src/arrow/array-list-test.cc index 8baaf06a7dbcc..8e4d319f5dca8 100644 --- a/cpp/src/arrow/array-list-test.cc +++ b/cpp/src/arrow/array-list-test.cc @@ -89,7 +89,7 @@ class TestListBuilder : public TestBuilder { TEST_F(TestListBuilder, Equality) { Int32Builder* vb = static_cast(builder_->value_builder().get()); - ArrayPtr array, equal_array, unequal_array; + std::shared_ptr array, equal_array, unequal_array; vector equal_offsets = {0, 1, 2, 5}; vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2}; vector unequal_offsets = {0, 1, 4}; diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index a10e2404f29c6..443abac459dbf 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -318,7 +318,7 @@ TYPED_TEST(TestPrimitiveBuilder, Equality) { this->RandomData(size); vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; - ArrayPtr array, equal_array, unequal_array; + std::shared_ptr array, equal_array, unequal_array; auto builder = this->builder_.get(); ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &equal_array)); diff --git a/cpp/src/arrow/array-struct-test.cc b/cpp/src/arrow/array-struct-test.cc index 58386fe028fd2..5827c399dda17 100644 --- a/cpp/src/arrow/array-struct-test.cc +++ b/cpp/src/arrow/array-struct-test.cc @@ -261,8 +261,9 @@ TEST_F(TestStructBuilder, BulkAppendInvalid) { } TEST_F(TestStructBuilder, TestEquality) { - ArrayPtr array, equal_array; - ArrayPtr unequal_bitmap_array, unequal_offsets_array, unequal_values_array; + std::shared_ptr array, equal_array; + std::shared_ptr unequal_bitmap_array, unequal_offsets_array, + unequal_values_array; vector int_values = {1, 2, 3, 4}; vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 783104e874bb7..a1d8fdfa91e85 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -56,7 +56,8 @@ TEST_F(TestArray, TestLength) { ASSERT_EQ(arr->length(), 100); } -ArrayPtr MakeArrayFromValidBytes(const std::vector& v, MemoryPool* pool) { +std::shared_ptr MakeArrayFromValidBytes( + const std::vector& v, MemoryPool* pool) { int32_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); std::shared_ptr null_buf = test::bytes_to_null_buffer(v); @@ -65,7 +66,8 @@ ArrayPtr MakeArrayFromValidBytes(const std::vector& v, MemoryPool* pool value_builder.Append(0); } - ArrayPtr arr(new Int32Array(v.size(), value_builder.Finish(), null_count, null_buf)); + std::shared_ptr arr( + new Int32Array(v.size(), value_builder.Finish(), null_count, null_buf)); return arr; } diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index d13fa1e108196..3d309b8b92f48 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -189,14 +189,14 @@ bool BooleanArray::EqualsExact(const BooleanArray& other) const { } } -bool BooleanArray::Equals(const ArrayPtr& arr) const { +bool BooleanArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) return true; if (Type::BOOL != arr->type_enum()) { return false; } return EqualsExact(*static_cast(arr.get())); } bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, - int32_t other_start_idx, const ArrayPtr& arr) const { + int32_t other_start_idx, const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } if (!arr) { return false; } if (this->type_enum() != arr->type_enum()) { return false; } @@ -222,7 +222,7 @@ bool ListArray::EqualsExact(const ListArray& other) const { if (null_count_ != other.null_count_) { return false; } bool equal_offsets = - offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); + offsets_buffer_->Equals(*other.offsets_buffer_, (length_ + 1) * sizeof(int32_t)); if (!equal_offsets) { return false; } bool equal_null_bitmap = true; if (null_count_ > 0) { @@ -269,10 +269,10 @@ bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_st Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } - if (!offset_buffer_) { return Status::Invalid("offset_buffer_ was null"); } - if (offset_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { + if (!offsets_buffer_) { return Status::Invalid("offsets_buffer_ was null"); } + if (offsets_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { std::stringstream ss; - ss << "offset buffer size (bytes): " << offset_buffer_->size() + ss << "offset buffer size (bytes): " << offsets_buffer_->size() << " isn't large enough for length: " << length_; return Status::Invalid(ss.str()); } @@ -337,8 +337,8 @@ BinaryArray::BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : Array(type, length, null_count, null_bitmap), - offset_buffer_(offsets), - offsets_(reinterpret_cast(offset_buffer_->data())), + offsets_buffer_(offsets), + offsets_(reinterpret_cast(offsets_buffer_->data())), data_buffer_(data), data_(nullptr) { if (data_buffer_ != nullptr) { data_ = data_buffer_->data(); } @@ -353,7 +353,7 @@ bool BinaryArray::EqualsExact(const BinaryArray& other) const { if (!Array::EqualsExact(other)) { return false; } bool equal_offsets = - offset_buffer_->Equals(*other.offset_buffer_, (length_ + 1) * sizeof(int32_t)); + offsets_buffer_->Equals(*other.offsets_buffer_, (length_ + 1) * sizeof(int32_t)); if (!equal_offsets) { return false; } if (!data_buffer_ && !(other.data_buffer_)) { return true; } @@ -433,7 +433,7 @@ bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_ if (this == arr.get()) { return true; } if (!arr) { return false; } if (Type::STRUCT != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); + const auto& other = static_cast(*arr.get()); bool equal_fields = true; for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { @@ -442,7 +442,7 @@ bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_ for (size_t j = 0; j < field_arrays_.size(); ++j) { // TODO: really we should be comparing stretches of non-null data rather // than looking at one value at a time. - equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other->field(j)); + equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other.field(j)); if (!equal_fields) { return false; } } } @@ -490,6 +490,102 @@ Status StructArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +// ---------------------------------------------------------------------- +// UnionArray + +UnionArray::UnionArray(const TypePtr& type, int32_t length, + const std::vector>& children, + const std::shared_ptr& type_ids, const std::shared_ptr& offsets, + int32_t null_count, const std::shared_ptr& null_bitmap) + : Array(type, length, null_count, null_bitmap), + children_(children), + type_ids_buffer_(type_ids), + offsets_buffer_(offsets) { + type_ids_ = reinterpret_cast(type_ids->data()); + if (offsets) { offsets_ = reinterpret_cast(offsets->data()); } +} + +std::shared_ptr UnionArray::child(int32_t pos) const { + DCHECK_GT(children_.size(), 0); + return children_[pos]; +} + +bool UnionArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (!this->type_->Equals(arr->type())) { return false; } + if (null_count_ != arr->null_count()) { return false; } + return RangeEquals(0, length_, 0, arr); +} + +bool UnionArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (!arr) { return false; } + if (Type::UNION != arr->type_enum()) { return false; } + const auto& other = static_cast(*arr.get()); + + const UnionMode union_mode = mode(); + if (union_mode != other.mode()) { return false; } + + // Define a mapping from the type id to child number + const auto& type_codes = static_cast(*arr->type().get()).type_ids; + uint8_t max_code = 0; + for (uint8_t code : type_codes) { + if (code > max_code) { max_code = code; } + } + + // Store mapping in a vector for constant time lookups + std::vector type_id_to_child_num(max_code + 1); + for (uint8_t i = 0; i < static_cast(type_codes.size()); ++i) { + type_id_to_child_num[type_codes[i]] = i; + } + + const uint8_t* this_ids = raw_type_ids(); + const uint8_t* other_ids = other.raw_type_ids(); + + uint8_t id, child_num; + for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { + if (IsNull(i) != other.IsNull(o_i)) { return false; } + if (IsNull(i)) continue; + if (this_ids[i] != other_ids[o_i]) { return false; } + + id = this_ids[i]; + child_num = type_id_to_child_num[id]; + + // TODO(wesm): really we should be comparing stretches of non-null data + // rather than looking at one value at a time. + if (union_mode == UnionMode::SPARSE) { + if (!child(child_num)->RangeEquals(i, i + 1, o_i, other.child(child_num))) { + return false; + } + } else { + const int32_t offset = offsets_[i]; + const int32_t o_offset = other.offsets_[i]; + if (!child(child_num)->RangeEquals( + offset, offset + 1, o_offset, other.child(child_num))) { + return false; + } + } + } + return true; +} + +Status UnionArray::Validate() const { + if (length_ < 0) { return Status::Invalid("Length was negative"); } + + if (null_count() > length_) { + return Status::Invalid("Null count exceeds the length of this struct"); + } + + DCHECK(false) << "Validate not yet implemented"; + return Status::OK(); +} + +Status UnionArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + // ---------------------------------------------------------------------- #define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ @@ -499,7 +595,7 @@ Status StructArray::Accept(ArrayVisitor* visitor) const { Status MakePrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, ArrayPtr* out) { + const std::shared_ptr& null_bitmap, std::shared_ptr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 6239ccc576b8d..cd42a28e251ca 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -108,8 +108,6 @@ class ARROW_EXPORT NullArray : public Array { Status Accept(ArrayVisitor* visitor) const override; }; -typedef std::shared_ptr ArrayPtr; - Status ARROW_EXPORT GetEmptyBitmap( MemoryPool* pool, int32_t length, std::shared_ptr* result); @@ -152,7 +150,7 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { } bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override { + const std::shared_ptr& arr) const override { if (this == arr.get()) { return true; } if (!arr) { return false; } if (this->type_enum() != arr->type_enum()) { return false; } @@ -256,9 +254,9 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); bool EqualsExact(const BooleanArray& other) const; - bool Equals(const ArrayPtr& arr) const override; + bool Equals(const std::shared_ptr& arr) const override; bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; + const std::shared_ptr& arr) const override; Status Accept(ArrayVisitor* visitor) const override; @@ -274,13 +272,13 @@ class ARROW_EXPORT ListArray : public Array { public: using TypeClass = ListType; - ListArray(const TypePtr& type, int32_t length, std::shared_ptr offsets, - const ArrayPtr& values, int32_t null_count = 0, - std::shared_ptr null_bitmap = nullptr) + ListArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, + const std::shared_ptr& values, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { - offset_buffer_ = offsets; + offsets_buffer_ = offsets; offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( - offset_buffer_->data()); + offsets_buffer_->data()); values_ = values; } @@ -291,9 +289,7 @@ class ARROW_EXPORT ListArray : public Array { // Return a shared pointer in case the requestor desires to share ownership // with this array. std::shared_ptr values() const { return values_; } - std::shared_ptr offsets() const { - return std::static_pointer_cast(offset_buffer_); - } + std::shared_ptr offsets() const { return offsets_buffer_; } std::shared_ptr value_type() const { return values_->type(); } @@ -309,14 +305,14 @@ class ARROW_EXPORT ListArray : public Array { bool Equals(const std::shared_ptr& arr) const override; bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; + const std::shared_ptr& arr) const override; Status Accept(ArrayVisitor* visitor) const override; protected: - std::shared_ptr offset_buffer_; + std::shared_ptr offsets_buffer_; const int32_t* offsets_; - ArrayPtr values_; + std::shared_ptr values_; }; // ---------------------------------------------------------------------- @@ -346,7 +342,7 @@ class ARROW_EXPORT BinaryArray : public Array { } std::shared_ptr data() const { return data_buffer_; } - std::shared_ptr offsets() const { return offset_buffer_; } + std::shared_ptr offsets() const { return offsets_buffer_; } const int32_t* raw_offsets() const { return offsets_; } @@ -359,14 +355,14 @@ class ARROW_EXPORT BinaryArray : public Array { bool EqualsExact(const BinaryArray& other) const; bool Equals(const std::shared_ptr& arr) const override; bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const ArrayPtr& arr) const override; + const std::shared_ptr& arr) const override; Status Validate() const override; Status Accept(ArrayVisitor* visitor) const override; private: - std::shared_ptr offset_buffer_; + std::shared_ptr offsets_buffer_; const int32_t* offsets_; std::shared_ptr data_buffer_; @@ -401,8 +397,9 @@ class ARROW_EXPORT StructArray : public Array { public: using TypeClass = StructType; - StructArray(const TypePtr& type, int32_t length, std::vector& field_arrays, - int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) + StructArray(const TypePtr& type, int32_t length, + const std::vector>& field_arrays, int32_t null_count = 0, + std::shared_ptr null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { type_ = type; field_arrays_ = field_arrays; @@ -416,7 +413,7 @@ class ARROW_EXPORT StructArray : public Array { // with this array. std::shared_ptr field(int32_t pos) const; - const std::vector& fields() const { return field_arrays_; } + const std::vector>& fields() const { return field_arrays_; } bool EqualsExact(const StructArray& other) const; bool Equals(const std::shared_ptr& arr) const override; @@ -427,25 +424,54 @@ class ARROW_EXPORT StructArray : public Array { protected: // The child arrays corresponding to each field of the struct data type. - std::vector field_arrays_; + std::vector> field_arrays_; }; // ---------------------------------------------------------------------- // Union -class UnionArray : public Array { +class ARROW_EXPORT UnionArray : public Array { + public: + using TypeClass = UnionType; + + UnionArray(const TypePtr& type, int32_t length, + const std::vector>& children, + const std::shared_ptr& type_ids, + const std::shared_ptr& offsets = nullptr, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); + + Status Validate() const override; + + virtual ~UnionArray() {} + + std::shared_ptr type_ids() const { return type_ids_buffer_; } + const uint8_t* raw_type_ids() const { return type_ids_; } + + std::shared_ptr offsets() const { return offsets_buffer_; } + const int32_t* raw_offsets() const { return offsets_; } + + UnionMode mode() const { return static_cast(*type_.get()).mode; } + + std::shared_ptr child(int32_t pos) const; + + const std::vector>& children() const { return children_; } + + bool EqualsExact(const UnionArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; + protected: - // The data are types encoded as int16 - Buffer* types_; std::vector> children_; -}; -class DenseUnionArray : public UnionArray { - protected: - Buffer* offset_buf_; -}; + std::shared_ptr type_ids_buffer_; + const uint8_t* type_ids_; -class SparseUnionArray : public UnionArray {}; + std::shared_ptr offsets_buffer_; + const int32_t* offsets_; +}; // ---------------------------------------------------------------------- // extern templates and other details diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 205139849b44e..1837340cedc81 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -228,7 +228,7 @@ using DoubleBuilder = NumericBuilder; class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { public: - explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type) + explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type = boolean()) : ArrayBuilder(pool, type), data_(nullptr) {} virtual ~BooleanBuilder() {} diff --git a/cpp/src/arrow/io/hdfs-internal.h b/cpp/src/arrow/io/hdfs-internal.h index 8f9a06758cbaa..01cf1499857d9 100644 --- a/cpp/src/arrow/io/hdfs-internal.h +++ b/cpp/src/arrow/io/hdfs-internal.h @@ -20,21 +20,11 @@ #ifndef _WIN32 #include -#else - -// Windows defines min and max macros that mess up std::min/maxa -#ifndef NOMINMAX -#define NOMINMAX -#endif -#include -#include - -// TODO(wesm): address when/if we add windows support -// #include #endif #include +#include "arrow/io/windows_compatibility.h" #include "arrow/util/visibility.h" namespace arrow { diff --git a/cpp/src/arrow/io/windows_compatibility.h b/cpp/src/arrow/io/windows_compatibility.h new file mode 100644 index 0000000000000..ac8f6aeeb5cac --- /dev/null +++ b/cpp/src/arrow/io/windows_compatibility.h @@ -0,0 +1,36 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IO_WINDOWS_COMPATIBILITY +#define ARROW_IO_WINDOWS_COMPATIBILITY + +#ifdef _WIN32 + +// Windows defines min and max macros that mess up std::min/max +#ifndef NOMINMAX +#define NOMINMAX +#endif + +#include +#include + +// TODO(wesm): address when/if we add windows support +// #include + +#endif // _WIN32 + +#endif // ARROW_IO_WINDOWS_COMPATIBILITY diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index ac4054b376adc..9bfd11fd01b5a 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -276,7 +276,16 @@ class RecordBatchWriter : public ArrayVisitor { } Status Visit(const UnionArray& array) override { - return Status::NotImplemented("union"); + buffers_.push_back(array.type_ids()); + + if (array.mode() == UnionMode::DENSE) { buffers_.push_back(array.offsets()); } + + --max_recursion_depth_; + for (const auto& field : array.children()) { + RETURN_NOT_OK(VisitArray(*field.get())); + } + ++max_recursion_depth_; + return Status::OK(); } // Do not copy this vector. Ownership must be retained elsewhere @@ -464,9 +473,10 @@ class ArrayLoader : public TypeVisitor { Status Visit(const ListType& type) override { FieldMetadata field_meta; std::shared_ptr null_bitmap; - std::shared_ptr offsets; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + + std::shared_ptr offsets; RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); const int num_children = type.num_children(); @@ -484,20 +494,25 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } + Status LoadChildren(std::vector> child_fields, + std::vector>* arrays) { + arrays->reserve(static_cast(child_fields.size())); + + for (const auto& child_field : child_fields) { + std::shared_ptr field_array; + RETURN_NOT_OK(LoadChild(*child_field.get(), &field_array)); + arrays->emplace_back(field_array); + } + return Status::OK(); + } + Status Visit(const StructType& type) override { FieldMetadata field_meta; std::shared_ptr null_bitmap; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - const int num_children = type.num_children(); - std::vector fields; - fields.reserve(num_children); - - for (int child_idx = 0; child_idx < num_children; ++child_idx) { - std::shared_ptr field_array; - RETURN_NOT_OK(LoadChild(*type.child(child_idx).get(), &field_array)); - fields.emplace_back(field_array); - } + std::vector> fields; + RETURN_NOT_OK(LoadChildren(type.children(), &fields)); result_ = std::make_shared( field_.type, field_meta.length, fields, field_meta.null_count, null_bitmap); @@ -505,7 +520,24 @@ class ArrayLoader : public TypeVisitor { } Status Visit(const UnionType& type) override { - return Status::NotImplemented(type.ToString()); + FieldMetadata field_meta; + std::shared_ptr null_bitmap; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + + std::shared_ptr type_ids; + std::shared_ptr offsets = nullptr; + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &type_ids)); + + if (type.mode == UnionMode::DENSE) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + } + + std::vector> fields; + RETURN_NOT_OK(LoadChildren(type.children(), &fields)); + + result_ = std::make_shared(field_.type, field_meta.length, fields, + type_ids, offsets, field_meta.null_count, null_bitmap); + return Status::OK(); } }; diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index f309b8562f76a..6ba0a6e16be08 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -95,7 +95,7 @@ TEST_P(TestWriteRecordBatch, RoundTrip) { INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRecordBatch, ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, - &MakeStringTypesRecordBatch, &MakeStruct)); + &MakeStringTypesRecordBatch, &MakeStruct, &MakeUnion)); void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; @@ -136,7 +136,7 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { int64_t* body_length, std::shared_ptr* schema) { const int batch_length = 5; TypePtr type = int32(); - ArrayPtr array; + std::shared_ptr array; const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); for (int i = 0; i < recursion_level; ++i) { @@ -149,7 +149,7 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { *schema = std::shared_ptr(new Schema({f0})); - std::vector arrays = {array}; + std::vector> arrays = {array}; auto batch = std::make_shared(*schema, batch_length, arrays); std::string path = "test-write-past-max-recursion"; diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index f793a2659579c..07509890da35c 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -29,6 +29,7 @@ #include "arrow/builder.h" #include "arrow/ipc/json-internal.h" #include "arrow/ipc/json.h" +#include "arrow/ipc/test-common.h" #include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/table.h" @@ -142,11 +143,16 @@ TEST(TestJsonArrayWriter, NestedTypes) { auto value_type = int32(); std::vector values_is_valid = {true, false, true, true, false, true, true}; - std::vector values = {0, 1, 2, 3, 4, 5, 6}; + std::vector values = {0, 1, 2, 3, 4, 5, 6}; std::shared_ptr values_array; ArrayFromVector(int32(), values_is_valid, values, &values_array); + std::vector i16_values = {0, 1, 2, 3, 4, 5, 6}; + std::shared_ptr i16_values_array; + ArrayFromVector( + int16(), values_is_valid, i16_values, &i16_values_array); + // List std::vector list_is_valid = {true, false, true, true, true}; std::vector offsets = {0, 0, 0, 1, 4, 7}; @@ -173,6 +179,16 @@ TEST(TestJsonArrayWriter, NestedTypes) { TestArrayRoundTrip(struct_array); } +TEST(TestJsonArrayWriter, Unions) { + std::shared_ptr batch; + ASSERT_OK(MakeUnion(&batch)); + + for (int i = 0; i < batch->num_columns(); ++i) { + std::shared_ptr col = batch->column(i); + TestArrayRoundTrip(*col.get()); + } +} + // Data generation for test case below void MakeBatchArrays(const std::shared_ptr& schema, const int num_rows, std::vector>* arrays) { diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index db11b7d0372f7..4f980d3e5d157 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -415,11 +415,11 @@ class JsonArrayWriter : public ArrayVisitor { } template - void WriteOffsetsField(const T* offsets, int32_t length) { - writer_->Key("OFFSET"); + void WriteIntegerField(const char* name, const T* values, int32_t length) { + writer_->Key(name); writer_->StartArray(); for (int i = 0; i < length; ++i) { - writer_->Int64(offsets[i]); + writer_->Int64(values[i]); } writer_->EndArray(); } @@ -456,7 +456,7 @@ class JsonArrayWriter : public ArrayVisitor { template Status WriteVarBytes(const T& array) { WriteValidityField(array); - WriteOffsetsField(array.raw_offsets(), array.length() + 1); + WriteIntegerField("OFFSET", array.raw_offsets(), array.length() + 1); WriteDataField(array); SetNoChildren(); return Status::OK(); @@ -524,7 +524,7 @@ class JsonArrayWriter : public ArrayVisitor { Status Visit(const ListArray& array) override { WriteValidityField(array); - WriteOffsetsField(array.raw_offsets(), array.length() + 1); + WriteIntegerField("OFFSET", array.raw_offsets(), array.length() + 1); auto type = static_cast(array.type().get()); return WriteChildren(type->children(), {array.values()}); } @@ -536,7 +536,14 @@ class JsonArrayWriter : public ArrayVisitor { } Status Visit(const UnionArray& array) override { - return Status::NotImplemented("union"); + WriteValidityField(array); + auto type = static_cast(array.type().get()); + + WriteIntegerField("TYPE_ID", array.raw_type_ids(), array.length()); + if (type->mode == UnionMode::DENSE) { + WriteIntegerField("OFFSET", array.raw_offsets(), array.length()); + } + return WriteChildren(type->children(), array.children()); } private: @@ -847,27 +854,35 @@ class JsonArrayReader { return builder.Finish(array); } + template + Status GetIntArray( + const RjArray& json_array, const int32_t length, std::shared_ptr* out) { + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(length * sizeof(T))); + T* values = reinterpret_cast(buffer->mutable_data()); + for (int i = 0; i < length; ++i) { + const rj::Value& val = json_array[i]; + DCHECK(val.IsInt()); + values[i] = static_cast(val.GetInt()); + } + + *out = buffer; + return Status::OK(); + } + template typename std::enable_if::value, Status>::type ReadArray( const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { - const auto& json_offsets = json_array.FindMember("OFFSET"); - RETURN_NOT_ARRAY("OFFSET", json_offsets, json_array); - const auto& json_offsets_arr = json_offsets->value.GetArray(); - int32_t null_count = 0; std::shared_ptr validity_buffer; RETURN_NOT_OK(GetValidityBuffer(is_valid, &null_count, &validity_buffer)); - auto offsets_buffer = std::make_shared(pool_); - RETURN_NOT_OK(offsets_buffer->Resize((length + 1) * sizeof(int32_t))); - int32_t* offsets = reinterpret_cast(offsets_buffer->mutable_data()); - - for (int i = 0; i < length + 1; ++i) { - const rj::Value& val = json_offsets_arr[i]; - DCHECK(val.IsInt()); - offsets[i] = val.GetInt(); - } + const auto& json_offsets = json_array.FindMember("OFFSET"); + RETURN_NOT_ARRAY("OFFSET", json_offsets, json_array); + std::shared_ptr offsets_buffer; + RETURN_NOT_OK(GetIntArray( + json_offsets->value.GetArray(), length + 1, &offsets_buffer)); std::vector> children; RETURN_NOT_OK(GetChildren(json_array, type, &children)); @@ -896,6 +911,41 @@ class JsonArrayReader { return Status::OK(); } + template + typename std::enable_if::value, Status>::type ReadArray( + const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + int32_t null_count = 0; + + const auto& union_type = static_cast(*type.get()); + + std::shared_ptr validity_buffer; + std::shared_ptr type_id_buffer; + std::shared_ptr offsets_buffer; + + RETURN_NOT_OK(GetValidityBuffer(is_valid, &null_count, &validity_buffer)); + + const auto& json_type_ids = json_array.FindMember("TYPE_ID"); + RETURN_NOT_ARRAY("TYPE_ID", json_type_ids, json_array); + RETURN_NOT_OK( + GetIntArray(json_type_ids->value.GetArray(), length, &type_id_buffer)); + + if (union_type.mode == UnionMode::DENSE) { + const auto& json_offsets = json_array.FindMember("OFFSET"); + RETURN_NOT_ARRAY("OFFSET", json_offsets, json_array); + RETURN_NOT_OK( + GetIntArray(json_offsets->value.GetArray(), length, &offsets_buffer)); + } + + std::vector> children; + RETURN_NOT_OK(GetChildren(json_array, type, &children)); + + *array = std::make_shared(type, length, children, type_id_buffer, + offsets_buffer, null_count, validity_buffer); + + return Status::OK(); + } + template typename std::enable_if::value, Status>::type ReadArray( const RjObject& json_array, int32_t length, const std::vector& is_valid, @@ -992,7 +1042,7 @@ class JsonArrayReader { NOT_IMPLEMENTED_CASE(INTERVAL); TYPE_CASE(ListType); TYPE_CASE(StructType); - NOT_IMPLEMENTED_CASE(UNION); + TYPE_CASE(UnionType); default: std::stringstream ss; ss << type->ToString(); diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 8416f0df57364..3faeebf956966 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -110,7 +110,7 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { template Status MakeRandomBinaryArray( - const TypePtr& type, int32_t length, MemoryPool* pool, ArrayPtr* out) { + const TypePtr& type, int32_t length, MemoryPool* pool, std::shared_ptr* out) { const std::vector values = { "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; Builder builder(pool, type); @@ -225,7 +225,7 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { TypePtr type = int32(); MemoryPool* pool = default_memory_pool(); - ArrayPtr array; + std::shared_ptr array; const bool include_nulls = true; RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool, &array)); for (int i = 0; i < 63; ++i) { @@ -235,7 +235,7 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { auto f0 = std::make_shared("f0", type); std::shared_ptr schema(new Schema({f0})); - std::vector arrays = {array}; + std::vector> arrays = {array}; out->reset(new RecordBatch(schema, batch_length, arrays)); return Status::OK(); } @@ -244,7 +244,7 @@ Status MakeStruct(std::shared_ptr* out) { // reuse constructed list columns std::shared_ptr list_batch; RETURN_NOT_OK(MakeListRecordBatch(&list_batch)); - std::vector columns = { + std::vector> columns = { list_batch->column(0), list_batch->column(1), list_batch->column(2)}; auto list_schema = list_batch->schema(); @@ -256,20 +256,89 @@ Status MakeStruct(std::shared_ptr* out) { std::shared_ptr schema(new Schema({f0, f1})); // construct individual nullable/non-nullable struct arrays - ArrayPtr no_nulls(new StructArray(type, list_batch->num_rows(), columns)); + std::shared_ptr no_nulls(new StructArray(type, list_batch->num_rows(), columns)); std::vector null_bytes(list_batch->num_rows(), 1); null_bytes[0] = 0; std::shared_ptr null_bitmask; RETURN_NOT_OK(BitUtil::BytesToBits(null_bytes, &null_bitmask)); - ArrayPtr with_nulls( + std::shared_ptr with_nulls( new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); // construct batch - std::vector arrays = {no_nulls, with_nulls}; + std::vector> arrays = {no_nulls, with_nulls}; out->reset(new RecordBatch(schema, list_batch->num_rows(), arrays)); return Status::OK(); } +Status MakeUnion(std::shared_ptr* out) { + // Define schema + std::vector> union_types( + {std::make_shared("u0", int32()), std::make_shared("u1", uint8())}); + + std::vector type_codes = {5, 10}; + auto sparse_type = + std::make_shared(union_types, type_codes, UnionMode::SPARSE); + + auto dense_type = + std::make_shared(union_types, type_codes, UnionMode::DENSE); + + auto f0 = std::make_shared("sparse_nonnull", sparse_type, false); + auto f1 = std::make_shared("sparse", sparse_type); + auto f2 = std::make_shared("dense", dense_type); + + std::shared_ptr schema(new Schema({f0, f1, f2})); + + // Create data + std::vector> sparse_children(2); + std::vector> dense_children(2); + + const int32_t length = 7; + + std::shared_ptr type_ids_buffer; + std::vector type_ids = {5, 10, 5, 5, 10, 10, 5}; + RETURN_NOT_OK(test::CopyBufferFromVector(type_ids, &type_ids_buffer)); + + std::vector u0_values = {0, 1, 2, 3, 4, 5, 6}; + ArrayFromVector( + sparse_type->child(0)->type, u0_values, &sparse_children[0]); + + std::vector u1_values = {10, 11, 12, 13, 14, 15, 16}; + ArrayFromVector( + sparse_type->child(1)->type, u1_values, &sparse_children[1]); + + // dense children + u0_values = {0, 2, 3, 7}; + ArrayFromVector( + dense_type->child(0)->type, u0_values, &dense_children[0]); + + u1_values = {11, 14, 15}; + ArrayFromVector( + dense_type->child(1)->type, u1_values, &dense_children[1]); + + std::shared_ptr offsets_buffer; + std::vector offsets = {0, 0, 1, 2, 1, 2, 3}; + RETURN_NOT_OK(test::CopyBufferFromVector(offsets, &offsets_buffer)); + + std::vector null_bytes(length, 1); + null_bytes[2] = 0; + std::shared_ptr null_bitmask; + RETURN_NOT_OK(BitUtil::BytesToBits(null_bytes, &null_bitmask)); + + // construct individual nullable/non-nullable struct arrays + auto sparse_no_nulls = + std::make_shared(sparse_type, length, sparse_children, type_ids_buffer); + auto sparse = std::make_shared( + sparse_type, length, sparse_children, type_ids_buffer, nullptr, 1, null_bitmask); + + auto dense = std::make_shared(dense_type, length, dense_children, + type_ids_buffer, offsets_buffer, 1, null_bitmask); + + // construct batch + std::vector> arrays = {sparse_no_nulls, sparse, dense}; + out->reset(new RecordBatch(schema, length, arrays)); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 9c439c47eb82c..324f81bfbfd6b 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -161,44 +161,60 @@ class ArrayPrinter : public ArrayVisitor { return Status::NotImplemented("decimal"); } - Status Visit(const ListArray& array) override { + Status WriteValidityBitmap(const Array& array) { Newline(); Write("-- is_valid: "); BooleanArray is_valid(array.length(), array.null_bitmap()); - PrettyPrint(is_valid, indent_ + 2, sink_); + return PrettyPrint(is_valid, indent_ + 2, sink_); + } + + Status Visit(const ListArray& array) override { + RETURN_NOT_OK(WriteValidityBitmap(array)); Newline(); Write("-- offsets: "); Int32Array offsets(array.length() + 1, array.offsets()); - PrettyPrint(offsets, indent_ + 2, sink_); + RETURN_NOT_OK(PrettyPrint(offsets, indent_ + 2, sink_)); Newline(); Write("-- values: "); - PrettyPrint(*array.values().get(), indent_ + 2, sink_); + RETURN_NOT_OK(PrettyPrint(*array.values().get(), indent_ + 2, sink_)); return Status::OK(); } - Status Visit(const StructArray& array) override { - Newline(); - Write("-- is_valid: "); - BooleanArray is_valid(array.length(), array.null_bitmap()); - PrettyPrint(is_valid, indent_ + 2, sink_); - - const std::vector>& fields = array.fields(); + Status PrintChildren(const std::vector>& fields) { for (size_t i = 0; i < fields.size(); ++i) { Newline(); std::stringstream ss; ss << "-- child " << i << " type: " << fields[i]->type()->ToString() << " values: "; Write(ss.str()); - PrettyPrint(*fields[i].get(), indent_ + 2, sink_); + RETURN_NOT_OK(PrettyPrint(*fields[i].get(), indent_ + 2, sink_)); } - return Status::OK(); } + Status Visit(const StructArray& array) override { + RETURN_NOT_OK(WriteValidityBitmap(array)); + return PrintChildren(array.fields()); + } + Status Visit(const UnionArray& array) override { - return Status::NotImplemented("union"); + RETURN_NOT_OK(WriteValidityBitmap(array)); + + Newline(); + Write("-- type_ids: "); + UInt8Array type_ids(array.length(), array.type_ids()); + RETURN_NOT_OK(PrettyPrint(type_ids, indent_ + 2, sink_)); + + if (array.mode() == UnionMode::DENSE) { + Newline(); + Write("-- offsets: "); + Int32Array offsets(array.length(), array.offsets()); + RETURN_NOT_OK(PrettyPrint(offsets, indent_ + 2, sink_)); + } + + return PrintChildren(array.children()); } void Write(const char* data) { (*sink_) << data; } diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index aa310b1a49ebe..ce9327d9009e2 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -274,6 +274,18 @@ void ArrayFromVector(const std::shared_ptr& type, values_buffer, null_count, values_bitmap); } +template +void ArrayFromVector(const std::shared_ptr& type, + const std::vector& values, std::shared_ptr* out) { + std::shared_ptr values_buffer; + + ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); + + using ArrayType = typename TypeTraits::ArrayType; + *out = std::make_shared( + type, static_cast(values.size()), values_buffer); +} + class TestBuilder : public ::testing::Test { public: void SetUp() { @@ -293,7 +305,7 @@ class TestBuilder : public ::testing::Test { template Status MakeArray(const std::vector& valid_bytes, const std::vector& values, - int size, Builder* builder, ArrayPtr* out) { + int size, Builder* builder, std::shared_ptr* out) { // Append the first 1000 for (int i = 0; i < size; ++i) { if (valid_bytes[i] > 0) { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 8ff9eea87051d..89faab6ec6ae2 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -103,7 +103,7 @@ std::string UnionType::ToString() const { for (size_t i = 0; i < children_.size(); ++i) { if (i) { s << ", "; } - s << children_[i]->ToString(); + s << children_[i]->ToString() << "=" << static_cast(type_ids[i]); } s << ">"; return s.str(); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 73005707c9edc..530c3235dc9ab 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -394,10 +394,10 @@ enum class UnionMode : char { SPARSE, DENSE }; struct ARROW_EXPORT UnionType : public DataType { static constexpr Type::type type_id = Type::UNION; - UnionType(const std::vector>& child_fields, + UnionType(const std::vector>& fields, const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE) : DataType(Type::UNION), mode(mode), type_ids(type_ids) { - children_ = child_fields; + children_ = fields; } std::string ToString() const override; @@ -407,6 +407,10 @@ struct ARROW_EXPORT UnionType : public DataType { std::vector GetBufferLayout() const override; UnionMode mode; + + // The type id used in the data to indicate each data type in the union. For + // example, the first type in the union might be denoted by the id 5 (instead + // of 0). std::vector type_ids; }; From d9df556791fc6051b2c8582668df9c256f675116 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 3 Jan 2017 08:28:46 +0100 Subject: [PATCH 0256/1644] ARROW-294: [C++] Do not use platform-dependent fopen/fclose functions for MemoryMappedFile Also adds a test case for ARROW-340. Author: Wes McKinney Closes #265 from wesm/ARROW-294 and squashes the following commits: 42a83a4 [Wes McKinney] Remove duplicated includes 3928ab0 [Wes McKinney] Base MemoryMappedFile implementation on common OSFile interface. Add test case for ARROW-340. --- cpp/src/arrow/io/file.cc | 208 +++++++++++++++++++++++++++-- cpp/src/arrow/io/file.h | 49 +++++++ cpp/src/arrow/io/io-file-test.cc | 116 +++++++++++++++- cpp/src/arrow/io/io-memory-test.cc | 91 ------------- cpp/src/arrow/io/memory.cc | 178 ------------------------ cpp/src/arrow/io/memory.h | 39 ------ cpp/src/arrow/io/test-common.h | 1 + 7 files changed, 359 insertions(+), 323 deletions(-) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index c50a9bba28e8e..3182f2dd8a3b5 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -60,7 +60,7 @@ #endif // _MSC_VER -// defines that +// defines that don't exist in MinGW #if defined(__MINGW32__) #define ARROW_WRITE_SHMODE S_IRUSR | S_IWUSR #elif defined(_MSC_VER) // Visual Studio @@ -174,7 +174,8 @@ static inline Status FileOpenReadable(const std::string& filename, int* fd) { return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); } -static inline Status FileOpenWriteable(const std::string& filename, int* fd) { +static inline Status FileOpenWriteable( + const std::string& filename, bool write_only, bool truncate, int* fd) { int ret; errno_t errno_actual = 0; @@ -186,13 +187,31 @@ static inline Status FileOpenWriteable(const std::string& filename, int* fd) { memcpy(wpath.data(), filename.data(), filename.size()); memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); - errno_actual = _wsopen_s(fd, wpath.data(), _O_WRONLY | _O_CREAT | _O_BINARY | _O_TRUNC, - _SH_DENYNO, _S_IWRITE); + int oflag = _O_CREAT | _O_BINARY; + + if (truncate) { oflag |= _O_TRUNC; } + + if (write_only) { + oflag |= _O_WRONLY; + } else { + oflag |= _O_RDWR; + } + + errno_actual = _wsopen_s(fd, wpath.data(), oflag, _SH_DENYNO, _S_IWRITE); ret = *fd; #else - ret = *fd = - open(filename.c_str(), O_WRONLY | O_CREAT | O_BINARY | O_TRUNC, ARROW_WRITE_SHMODE); + int oflag = O_CREAT | O_BINARY; + + if (truncate) { oflag |= O_TRUNC; } + + if (write_only) { + oflag |= O_WRONLY; + } else { + oflag |= O_RDWR; + } + + ret = *fd = open(filename.c_str(), oflag, ARROW_WRITE_SHMODE); #endif return CheckOpenResult(ret, errno_actual, filename.c_str(), filename.size()); } @@ -296,10 +315,17 @@ class OSFile { ~OSFile() {} - Status OpenWritable(const std::string& path) { - RETURN_NOT_OK(FileOpenWriteable(path, &fd_)); + Status OpenWriteable(const std::string& path, bool append, bool write_only) { + RETURN_NOT_OK(FileOpenWriteable(path, write_only, !append, &fd_)); path_ = path; is_open_ = true; + mode_ = write_only ? FileMode::READ : FileMode::READWRITE; + + if (append) { + RETURN_NOT_OK(FileGetSize(fd_, &size_)); + } else { + size_ = 0; + } return Status::OK(); } @@ -307,11 +333,9 @@ class OSFile { RETURN_NOT_OK(FileOpenReadable(path, &fd_)); RETURN_NOT_OK(FileGetSize(fd_, &size_)); - // The position should be 0 after GetSize - // RETURN_NOT_OK(Seek(0)); - path_ = path; is_open_ = true; + mode_ = FileMode::READ; return Status::OK(); } @@ -346,12 +370,14 @@ class OSFile { int64_t size() const { return size_; } - private: + protected: std::string path_; // File descriptor int fd_; + FileMode::type mode_; + bool is_open_; int64_t size_; }; @@ -440,7 +466,9 @@ int ReadableFile::file_descriptor() const { class FileOutputStream::FileOutputStreamImpl : public OSFile { public: - Status Open(const std::string& path) { return OpenWritable(path); } + Status Open(const std::string& path, bool append) { + return OpenWriteable(path, append, true); + } }; FileOutputStream::FileOutputStream() { @@ -453,9 +481,14 @@ FileOutputStream::~FileOutputStream() { Status FileOutputStream::Open( const std::string& path, std::shared_ptr* file) { + return Open(path, false, file); +} + +Status FileOutputStream::Open( + const std::string& path, bool append, std::shared_ptr* file) { // private ctor *file = std::shared_ptr(new FileOutputStream()); - return (*file)->impl_->Open(path); + return (*file)->impl_->Open(path, append); } Status FileOutputStream::Close() { @@ -474,5 +507,152 @@ int FileOutputStream::file_descriptor() const { return impl_->fd(); } +// ---------------------------------------------------------------------- +// Implement MemoryMappedFile + +class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { + public: + MemoryMappedFileImpl() : OSFile(), data_(nullptr) {} + + ~MemoryMappedFileImpl() { + if (is_open_) { + munmap(data_, size_); + OSFile::Close(); + } + } + + Status Open(const std::string& path, FileMode::type mode) { + int prot_flags = PROT_READ; + + if (mode != FileMode::READ) { + prot_flags |= PROT_WRITE; + const bool append = true; + RETURN_NOT_OK(OSFile::OpenWriteable(path, append, mode == FileMode::WRITE)); + } else { + RETURN_NOT_OK(OSFile::OpenReadable(path)); + } + + void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fd(), 0); + if (result == MAP_FAILED) { + std::stringstream ss; + ss << "Memory mapping file failed, errno: " << errno; + return Status::IOError(ss.str()); + } + data_ = reinterpret_cast(result); + position_ = 0; + + return Status::OK(); + } + + int64_t size() const { return size_; } + + Status Seek(int64_t position) { + if (position < 0 || position >= size_) { + return Status::Invalid("position is out of bounds"); + } + position_ = position; + return Status::OK(); + } + + int64_t position() { return position_; } + + void advance(int64_t nbytes) { position_ = std::min(size_, position_ + nbytes); } + + uint8_t* data() { return data_; } + + uint8_t* head() { return data_ + position_; } + + bool writable() { return mode_ != FileMode::READ; } + + bool opened() { return is_open_; } + + private: + int64_t position_; + + // The memory map + uint8_t* data_; +}; + +MemoryMappedFile::MemoryMappedFile(FileMode::type mode) { + ReadableFileInterface::set_mode(mode); +} + +MemoryMappedFile::~MemoryMappedFile() {} + +Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, + std::shared_ptr* out) { + std::shared_ptr result(new MemoryMappedFile(mode)); + + result->impl_.reset(new MemoryMappedFileImpl()); + RETURN_NOT_OK(result->impl_->Open(path, mode)); + + *out = result; + return Status::OK(); +} + +Status MemoryMappedFile::GetSize(int64_t* size) { + *size = impl_->size(); + return Status::OK(); +} + +Status MemoryMappedFile::Tell(int64_t* position) { + *position = impl_->position(); + return Status::OK(); +} + +Status MemoryMappedFile::Seek(int64_t position) { + return impl_->Seek(position); +} + +Status MemoryMappedFile::Close() { + // munmap handled in pimpl dtor + return Status::OK(); +} + +Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + nbytes = std::min(nbytes, impl_->size() - impl_->position()); + std::memcpy(out, impl_->head(), nbytes); + *bytes_read = nbytes; + impl_->advance(nbytes); + return Status::OK(); +} + +Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { + nbytes = std::min(nbytes, impl_->size() - impl_->position()); + *out = std::make_shared(impl_->head(), nbytes); + impl_->advance(nbytes); + return Status::OK(); +} + +bool MemoryMappedFile::supports_zero_copy() const { + return true; +} + +Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { + if (!impl_->opened() || !impl_->writable()) { + return Status::IOError("Unable to write"); + } + + RETURN_NOT_OK(impl_->Seek(position)); + return WriteInternal(data, nbytes); +} + +Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { + if (!impl_->opened() || !impl_->writable()) { + return Status::IOError("Unable to write"); + } + if (nbytes + impl_->position() > impl_->size()) { + return Status::Invalid("Cannot write past end of memory map"); + } + + return WriteInternal(data, nbytes); +} + +Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { + memcpy(impl_->head(), data, nbytes); + impl_->advance(nbytes); + return Status::OK(); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index 10fe16e511210..9ca9c540e7c22 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -40,8 +40,13 @@ class ARROW_EXPORT FileOutputStream : public OutputStream { public: ~FileOutputStream(); + // When opening a new file, any existing file with the indicated path is + // truncated to 0 bytes, deleting any existing memory static Status Open(const std::string& path, std::shared_ptr* file); + static Status Open( + const std::string& path, bool append, std::shared_ptr* file); + // OutputStream interface Status Close() override; Status Tell(int64_t* position) override; @@ -88,6 +93,50 @@ class ARROW_EXPORT ReadableFile : public ReadableFileInterface { std::unique_ptr impl_; }; +// A file interface that uses memory-mapped files for memory interactions, +// supporting zero copy reads. The same class is used for both reading and +// writing. +// +// If opening a file in a writeable mode, it is not truncated first as with +// FileOutputStream +class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { + public: + ~MemoryMappedFile(); + + static Status Open(const std::string& path, FileMode::type mode, + std::shared_ptr* out); + + Status Close() override; + + Status Tell(int64_t* position) override; + + Status Seek(int64_t position) override; + + // Required by ReadableFileInterface, copies memory into out + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; + + // Zero copy read + Status Read(int64_t nbytes, std::shared_ptr* out) override; + + bool supports_zero_copy() const override; + + Status Write(const uint8_t* data, int64_t nbytes) override; + + Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; + + // @return: the size in bytes of the memory source + Status GetSize(int64_t* size) override; + + private: + explicit MemoryMappedFile(FileMode::type mode); + + Status WriteInternal(const uint8_t* data, int64_t nbytes); + + // Hide the internal details of this class for now + class ARROW_NO_EXPORT MemoryMappedFileImpl; + std::unique_ptr impl_; +}; + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index f0ea7ec5e4dea..5f5d639fab0d8 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -71,7 +71,9 @@ class FileTestFixture : public ::testing::Test { class TestFileOutputStream : public FileTestFixture { public: - void OpenFile() { ASSERT_OK(FileOutputStream::Open(path_, &file_)); } + void OpenFile(bool append = false) { + ASSERT_OK(FileOutputStream::Open(path_, append, &file_)); + } protected: std::shared_ptr file_; @@ -131,6 +133,24 @@ TEST_F(TestFileOutputStream, Tell) { ASSERT_EQ(8, position); } +TEST_F(TestFileOutputStream, TruncatesNewFile) { + ASSERT_OK(FileOutputStream::Open(path_, &file_)); + + const char* data = "testdata"; + ASSERT_OK(file_->Write(reinterpret_cast(data), strlen(data))); + ASSERT_OK(file_->Close()); + + ASSERT_OK(FileOutputStream::Open(path_, &file_)); + ASSERT_OK(file_->Close()); + + std::shared_ptr rd_file; + ASSERT_OK(ReadableFile::Open(path_, &rd_file)); + + int64_t size; + ASSERT_OK(rd_file->GetSize(&size)); + ASSERT_EQ(0, size); +} + // ---------------------------------------------------------------------- // File input tests @@ -293,5 +313,99 @@ TEST_F(TestReadableFile, CustomMemoryPool) { ASSERT_EQ(2, pool.num_allocations()); } +// ---------------------------------------------------------------------- +// Memory map tests + +class TestMemoryMappedFile : public ::testing::Test, public MemoryMapFixture { + public: + void TearDown() { MemoryMapFixture::TearDown(); } +}; + +TEST_F(TestMemoryMappedFile, InvalidUsages) {} + +TEST_F(TestMemoryMappedFile, WriteRead) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + const int reps = 5; + + std::string path = "ipc-write-read-test"; + CreateFile(path, reps * buffer_size); + + std::shared_ptr result; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &result)); + + int64_t position = 0; + std::shared_ptr out_buffer; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(result->Write(buffer.data(), buffer_size)); + ASSERT_OK(result->ReadAt(position, buffer_size, &out_buffer)); + + ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); + + position += buffer_size; + } +} + +TEST_F(TestMemoryMappedFile, ReadOnly) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + const int reps = 5; + + std::string path = "ipc-read-only-test"; + CreateFile(path, reps * buffer_size); + + std::shared_ptr rwmmap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &rwmmap)); + + int64_t position = 0; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(rwmmap->Write(buffer.data(), buffer_size)); + position += buffer_size; + } + rwmmap->Close(); + + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); + + position = 0; + std::shared_ptr out_buffer; + for (int i = 0; i < reps; ++i) { + ASSERT_OK(rommap->ReadAt(position, buffer_size, &out_buffer)); + + ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); + position += buffer_size; + } + rommap->Close(); +} + +TEST_F(TestMemoryMappedFile, InvalidMode) { + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + std::string path = "ipc-invalid-mode-test"; + CreateFile(path, buffer_size); + + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); + + ASSERT_RAISES(IOError, rommap->Write(buffer.data(), buffer_size)); +} + +TEST_F(TestMemoryMappedFile, InvalidFile) { + std::string non_existent_path = "invalid-file-name-asfd"; + + std::shared_ptr result; + ASSERT_RAISES( + IOError, MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index a49faf3bd8578..246310221e9aa 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -30,97 +30,6 @@ namespace arrow { namespace io { -class TestMemoryMappedFile : public ::testing::Test, public MemoryMapFixture { - public: - void TearDown() { MemoryMapFixture::TearDown(); } -}; - -TEST_F(TestMemoryMappedFile, InvalidUsages) {} - -TEST_F(TestMemoryMappedFile, WriteRead) { - const int64_t buffer_size = 1024; - std::vector buffer(buffer_size); - - test::random_bytes(1024, 0, buffer.data()); - - const int reps = 5; - - std::string path = "ipc-write-read-test"; - CreateFile(path, reps * buffer_size); - - std::shared_ptr result; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &result)); - - int64_t position = 0; - std::shared_ptr out_buffer; - for (int i = 0; i < reps; ++i) { - ASSERT_OK(result->Write(buffer.data(), buffer_size)); - ASSERT_OK(result->ReadAt(position, buffer_size, &out_buffer)); - - ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); - - position += buffer_size; - } -} - -TEST_F(TestMemoryMappedFile, ReadOnly) { - const int64_t buffer_size = 1024; - std::vector buffer(buffer_size); - - test::random_bytes(1024, 0, buffer.data()); - - const int reps = 5; - - std::string path = "ipc-read-only-test"; - CreateFile(path, reps * buffer_size); - - std::shared_ptr rwmmap; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &rwmmap)); - - int64_t position = 0; - for (int i = 0; i < reps; ++i) { - ASSERT_OK(rwmmap->Write(buffer.data(), buffer_size)); - position += buffer_size; - } - rwmmap->Close(); - - std::shared_ptr rommap; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); - - position = 0; - std::shared_ptr out_buffer; - for (int i = 0; i < reps; ++i) { - ASSERT_OK(rommap->ReadAt(position, buffer_size, &out_buffer)); - - ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); - position += buffer_size; - } - rommap->Close(); -} - -TEST_F(TestMemoryMappedFile, InvalidMode) { - const int64_t buffer_size = 1024; - std::vector buffer(buffer_size); - - test::random_bytes(1024, 0, buffer.data()); - - std::string path = "ipc-invalid-mode-test"; - CreateFile(path, buffer_size); - - std::shared_ptr rommap; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); - - ASSERT_RAISES(IOError, rommap->Write(buffer.data(), buffer_size)); -} - -TEST_F(TestMemoryMappedFile, InvalidFile) { - std::string non_existent_path = "invalid-file-name-asfd"; - - std::shared_ptr result; - ASSERT_RAISES( - IOError, MemoryMappedFile::Open(non_existent_path, FileMode::READ, &result)); -} - class TestBufferOutputStream : public ::testing::Test { public: void SetUp() { diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index b5cf4b77a980d..4595268372aa2 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -17,19 +17,6 @@ #include "arrow/io/memory.h" -// sys/mman.h not present in Visual Studio or Cygwin -#ifdef _WIN32 -#ifndef NOMINMAX -#define NOMINMAX -#endif -#include "arrow/io/mman.h" -#undef Realloc -#undef Free -#include -#else -#include -#endif - #include #include #include @@ -45,171 +32,6 @@ namespace arrow { namespace io { -// Implement MemoryMappedFile - -class MemoryMappedFile::MemoryMappedFileImpl { - public: - MemoryMappedFileImpl() - : file_(nullptr), is_open_(false), is_writable_(false), data_(nullptr) {} - - ~MemoryMappedFileImpl() { - if (is_open_) { - munmap(data_, size_); - fclose(file_); - } - } - - Status Open(const std::string& path, FileMode::type mode) { - if (is_open_) { return Status::IOError("A file is already open"); } - - int prot_flags = PROT_READ; - - if (mode == FileMode::READWRITE) { - file_ = fopen(path.c_str(), "r+b"); - prot_flags |= PROT_WRITE; - is_writable_ = true; - } else { - file_ = fopen(path.c_str(), "rb"); - } - if (file_ == nullptr) { - std::stringstream ss; - ss << "Unable to open file, errno: " << errno; - return Status::IOError(ss.str()); - } - - fseek(file_, 0L, SEEK_END); - if (ferror(file_)) { return Status::IOError("Unable to seek to end of file"); } - size_ = ftell(file_); - - fseek(file_, 0L, SEEK_SET); - is_open_ = true; - position_ = 0; - - void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fileno(file_), 0); - if (result == MAP_FAILED) { - std::stringstream ss; - ss << "Memory mapping file failed, errno: " << errno; - return Status::IOError(ss.str()); - } - data_ = reinterpret_cast(result); - - return Status::OK(); - } - - int64_t size() const { return size_; } - - Status Seek(int64_t position) { - if (position < 0 || position >= size_) { - return Status::Invalid("position is out of bounds"); - } - position_ = position; - return Status::OK(); - } - - int64_t position() { return position_; } - - void advance(int64_t nbytes) { position_ = std::min(size_, position_ + nbytes); } - - uint8_t* data() { return data_; } - - uint8_t* head() { return data_ + position_; } - - bool writable() { return is_writable_; } - - bool opened() { return is_open_; } - - private: - FILE* file_; - int64_t position_; - int64_t size_; - bool is_open_; - bool is_writable_; - - // The memory map - uint8_t* data_; -}; - -MemoryMappedFile::MemoryMappedFile(FileMode::type mode) { - ReadableFileInterface::set_mode(mode); -} - -MemoryMappedFile::~MemoryMappedFile() {} - -Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, - std::shared_ptr* out) { - std::shared_ptr result(new MemoryMappedFile(mode)); - - result->impl_.reset(new MemoryMappedFileImpl()); - RETURN_NOT_OK(result->impl_->Open(path, mode)); - - *out = result; - return Status::OK(); -} - -Status MemoryMappedFile::GetSize(int64_t* size) { - *size = impl_->size(); - return Status::OK(); -} - -Status MemoryMappedFile::Tell(int64_t* position) { - *position = impl_->position(); - return Status::OK(); -} - -Status MemoryMappedFile::Seek(int64_t position) { - return impl_->Seek(position); -} - -Status MemoryMappedFile::Close() { - // munmap handled in pimpl dtor - return Status::OK(); -} - -Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - nbytes = std::min(nbytes, impl_->size() - impl_->position()); - std::memcpy(out, impl_->head(), nbytes); - *bytes_read = nbytes; - impl_->advance(nbytes); - return Status::OK(); -} - -Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { - nbytes = std::min(nbytes, impl_->size() - impl_->position()); - *out = std::make_shared(impl_->head(), nbytes); - impl_->advance(nbytes); - return Status::OK(); -} - -bool MemoryMappedFile::supports_zero_copy() const { - return true; -} - -Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { - if (!impl_->opened() || !impl_->writable()) { - return Status::IOError("Unable to write"); - } - - RETURN_NOT_OK(impl_->Seek(position)); - return WriteInternal(data, nbytes); -} - -Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { - if (!impl_->opened() || !impl_->writable()) { - return Status::IOError("Unable to write"); - } - if (nbytes + impl_->position() > impl_->size()) { - return Status::Invalid("Cannot write past end of memory map"); - } - - return WriteInternal(data, nbytes); -} - -Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { - memcpy(impl_->head(), data, nbytes); - impl_->advance(nbytes); - return Status::OK(); -} - // ---------------------------------------------------------------------- // OutputStream that writes to resizable buffer diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 2faf2804bcbd0..2f1d8ec317578 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -58,45 +58,6 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { uint8_t* mutable_data_; }; -// A memory source that uses memory-mapped files for memory interactions -class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { - public: - ~MemoryMappedFile(); - - static Status Open(const std::string& path, FileMode::type mode, - std::shared_ptr* out); - - Status Close() override; - - Status Tell(int64_t* position) override; - - Status Seek(int64_t position) override; - - // Required by ReadableFileInterface, copies memory into out - Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; - - // Zero copy read - Status Read(int64_t nbytes, std::shared_ptr* out) override; - - bool supports_zero_copy() const override; - - Status Write(const uint8_t* data, int64_t nbytes) override; - - Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; - - // @return: the size in bytes of the memory source - Status GetSize(int64_t* size) override; - - private: - explicit MemoryMappedFile(FileMode::type mode); - - Status WriteInternal(const uint8_t* data, int64_t nbytes); - - // Hide the internal details of this class for now - class ARROW_NO_EXPORT MemoryMappedFileImpl; - std::unique_ptr impl_; -}; - class ARROW_EXPORT BufferReader : public ReadableFileInterface { public: explicit BufferReader(const std::shared_ptr& buffer); diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 146808371d307..6e917135db274 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -33,6 +33,7 @@ #endif #include "arrow/buffer.h" +#include "arrow/io/file.h" #include "arrow/io/memory.h" #include "arrow/memory_pool.h" #include "arrow/test-util.h" From 26140dca893296d84cea3b76c97c62fbc4052e3f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 3 Jan 2017 08:31:37 +0100 Subject: [PATCH 0257/1644] ARROW-387: [C++] Verify zero-copy Buffer slices from BufferReader retain reference to parent Buffer This is stacked on top of the patch for ARROW-294, will rebase. Author: Wes McKinney Closes #266 from wesm/ARROW-387 and squashes the following commits: 061ef8b [Wes McKinney] Verify BufferReader passes on ownership of parent buffer to zero-copy slices 42a83a4 [Wes McKinney] Remove duplicated includes 3928ab0 [Wes McKinney] Base MemoryMappedFile implementation on common OSFile interface. Add test case for ARROW-340. --- cpp/src/arrow/io/interfaces.cc | 5 +++++ cpp/src/arrow/io/interfaces.h | 5 ++++- cpp/src/arrow/io/io-memory-test.cc | 23 ++++++++++++++++++++++- 3 files changed, 31 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 68c1ac30f8250..23bef2853b206 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -44,5 +44,10 @@ Status ReadableFileInterface::ReadAt( return Read(nbytes, out); } +Status Writeable::Write(const std::string& data) { + return Write(reinterpret_cast(data.c_str()), + static_cast(data.size())); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index db0c059c6e286..8fe2849287064 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -20,6 +20,7 @@ #include #include +#include #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -67,9 +68,11 @@ class Seekable { virtual Status Seek(int64_t position) = 0; }; -class Writeable { +class ARROW_EXPORT Writeable { public: virtual Status Write(const uint8_t* data, int64_t nbytes) = 0; + + Status Write(const std::string& data); }; class Readable { diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index 246310221e9aa..95d788c03c97e 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -48,12 +48,33 @@ TEST_F(TestBufferOutputStream, CloseResizes) { const int64_t nbytes = static_cast(data.size()); const int K = 100; for (int i = 0; i < K; ++i) { - EXPECT_OK(stream_->Write(reinterpret_cast(data.c_str()), nbytes)); + EXPECT_OK(stream_->Write(data)); } ASSERT_OK(stream_->Close()); ASSERT_EQ(K * nbytes, buffer_->size()); } +TEST(TestBufferReader, RetainParentReference) { + // ARROW-387 + std::string data = "data123456"; + + std::shared_ptr slice1; + std::shared_ptr slice2; + { + auto buffer = std::make_shared(); + ASSERT_OK(buffer->Resize(static_cast(data.size()))); + std::memcpy(buffer->mutable_data(), data.c_str(), data.size()); + BufferReader reader(buffer); + ASSERT_OK(reader.Read(4, &slice1)); + ASSERT_OK(reader.Read(6, &slice2)); + } + + ASSERT_TRUE(slice1->parent() != nullptr); + + ASSERT_EQ(0, std::memcmp(slice1->data(), data.c_str(), 4)); + ASSERT_EQ(0, std::memcmp(slice2->data(), data.c_str() + 4, 6)); +} + } // namespace io } // namespace arrow From fdbc57941fd3615c71b3a61b409b63eb6a48a817 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 3 Jan 2017 07:23:17 -0500 Subject: [PATCH 0258/1644] ARROW-417: Add Equals implementation to compare ChunkedArrays Author: Uwe L. Korn Closes #259 from xhochy/ARROW-417 and squashes the following commits: ffc076a [Uwe L. Korn] Add interface for non-shared_ptr-based Equals 3686d6c [Uwe L. Korn] ARROW-415: C++: Add Equals implementation to compare Tables 54cbf54 [Uwe L. Korn] ARROW-416: C++: Add Equals implementation to compare Columns 21e73a0 [Uwe L. Korn] Make signed comparison explicit 8563cb2 [Uwe L. Korn] ARROW-417: Add Equals implementation to compare ChunkedArrays --- cpp/src/arrow/column-test.cc | 121 +++++++++++++++++++++++++++++++++-- cpp/src/arrow/column.cc | 51 +++++++++++++++ cpp/src/arrow/column.h | 7 ++ cpp/src/arrow/table-test.cc | 44 +++++++++---- cpp/src/arrow/table.cc | 17 +++++ cpp/src/arrow/table.h | 3 + cpp/src/arrow/test-util.h | 2 +- 7 files changed, 228 insertions(+), 17 deletions(-) diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 9005245b20419..1e722ed7de0d6 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -33,12 +33,92 @@ using std::vector; namespace arrow { -const auto INT32 = std::make_shared(); +class TestChunkedArray : public TestBase { + protected: + virtual void Construct() { + one_ = std::make_shared(arrays_one_); + another_ = std::make_shared(arrays_another_); + } + + ArrayVector arrays_one_; + ArrayVector arrays_another_; + + std::shared_ptr one_; + std::shared_ptr another_; +}; + +TEST_F(TestChunkedArray, BasicEquals) { + std::vector null_bitmap(100, true); + std::vector data(100, 1); + std::shared_ptr array; + ArrayFromVector(int32(), null_bitmap, data, &array); + arrays_one_.push_back(array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_TRUE(one_->Equals(one_)); + ASSERT_FALSE(one_->Equals(nullptr)); + ASSERT_TRUE(one_->Equals(another_)); + ASSERT_TRUE(one_->Equals(*another_.get())); +} + +TEST_F(TestChunkedArray, EqualsDifferingTypes) { + std::vector null_bitmap(100, true); + std::vector data32(100, 1); + std::vector data64(100, 1); + std::shared_ptr array; + ArrayFromVector(int32(), null_bitmap, data32, &array); + arrays_one_.push_back(array); + ArrayFromVector(int64(), null_bitmap, data64, &array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_FALSE(one_->Equals(another_)); + ASSERT_FALSE(one_->Equals(*another_.get())); +} + +TEST_F(TestChunkedArray, EqualsDifferingLengths) { + std::vector null_bitmap100(100, true); + std::vector null_bitmap101(101, true); + std::vector data100(100, 1); + std::vector data101(101, 1); + std::shared_ptr array; + ArrayFromVector(int32(), null_bitmap100, data100, &array); + arrays_one_.push_back(array); + ArrayFromVector(int32(), null_bitmap101, data101, &array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_FALSE(one_->Equals(another_)); + ASSERT_FALSE(one_->Equals(*another_.get())); + + std::vector null_bitmap1(1, true); + std::vector data1(1, 1); + ArrayFromVector(int32(), null_bitmap1, data1, &array); + arrays_one_.push_back(array); -class TestColumn : public TestBase { + Construct(); + ASSERT_TRUE(one_->Equals(another_)); + ASSERT_TRUE(one_->Equals(*another_.get())); +} + +class TestColumn : public TestChunkedArray { protected: + void Construct() override { + TestChunkedArray::Construct(); + + one_col_ = std::make_shared(one_field_, one_); + another_col_ = std::make_shared(another_field_, another_); + } + std::shared_ptr data_; std::unique_ptr column_; + + std::shared_ptr one_field_; + std::shared_ptr another_field_; + + std::shared_ptr one_col_; + std::shared_ptr another_col_; }; TEST_F(TestColumn, BasicAPI) { @@ -47,11 +127,11 @@ TEST_F(TestColumn, BasicAPI) { arrays.push_back(MakePrimitive(100, 10)); arrays.push_back(MakePrimitive(100, 20)); - auto field = std::make_shared("c0", INT32); + auto field = std::make_shared("c0", int32()); column_.reset(new Column(field, arrays)); ASSERT_EQ("c0", column_->name()); - ASSERT_TRUE(column_->type()->Equals(INT32)); + ASSERT_TRUE(column_->type()->Equals(int32())); ASSERT_EQ(300, column_->length()); ASSERT_EQ(30, column_->null_count()); ASSERT_EQ(3, column_->data()->num_chunks()); @@ -62,7 +142,7 @@ TEST_F(TestColumn, ChunksInhomogeneous) { arrays.push_back(MakePrimitive(100)); arrays.push_back(MakePrimitive(100, 10)); - auto field = std::make_shared("c0", INT32); + auto field = std::make_shared("c0", int32()); column_.reset(new Column(field, arrays)); ASSERT_OK(column_->ValidateData()); @@ -72,4 +152,35 @@ TEST_F(TestColumn, ChunksInhomogeneous) { ASSERT_RAISES(Invalid, column_->ValidateData()); } +TEST_F(TestColumn, Equals) { + std::vector null_bitmap(100, true); + std::vector data(100, 1); + std::shared_ptr array; + ArrayFromVector(int32(), null_bitmap, data, &array); + arrays_one_.push_back(array); + arrays_another_.push_back(array); + + one_field_ = std::make_shared("column", int32()); + another_field_ = std::make_shared("column", int32()); + + Construct(); + ASSERT_TRUE(one_col_->Equals(one_col_)); + ASSERT_FALSE(one_col_->Equals(nullptr)); + ASSERT_TRUE(one_col_->Equals(another_col_)); + ASSERT_TRUE(one_col_->Equals(*another_col_.get())); + + // Field is different + another_field_ = std::make_shared("two", int32()); + Construct(); + ASSERT_FALSE(one_col_->Equals(another_col_)); + ASSERT_FALSE(one_col_->Equals(*another_col_.get())); + + // ChunkedArray is different + another_field_ = std::make_shared("column", int32()); + arrays_another_.push_back(array); + Construct(); + ASSERT_FALSE(one_col_->Equals(another_col_)); + ASSERT_FALSE(one_col_->Equals(*another_col_.get())); +} + } // namespace arrow diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 1d136e7d95a55..3e899563e2cbe 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -35,6 +35,45 @@ ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { } } +bool ChunkedArray::Equals(const ChunkedArray& other) const { + if (length_ != other.length()) { return false; } + if (null_count_ != other.null_count()) { return false; } + + // Check contents of the underlying arrays. This checks for equality of + // the underlying data independently of the chunk size. + int this_chunk_idx = 0; + int32_t this_start_idx = 0; + int other_chunk_idx = 0; + int32_t other_start_idx = 0; + while (this_chunk_idx < static_cast(chunks_.size())) { + const std::shared_ptr this_array = chunks_[this_chunk_idx]; + const std::shared_ptr other_array = other.chunk(other_chunk_idx); + int32_t common_length = std::min( + this_array->length() - this_start_idx, other_array->length() - other_start_idx); + if (!this_array->RangeEquals(this_start_idx, this_start_idx + common_length, + other_start_idx, other_array)) { + return false; + } + + // If we have exhausted the current chunk, proceed to the next one individually. + if (this_start_idx + common_length == this_array->length()) { + this_chunk_idx++; + this_start_idx = 0; + } + if (other_start_idx + common_length == other_array->length()) { + other_chunk_idx++; + other_start_idx = 0; + } + } + return true; +} + +bool ChunkedArray::Equals(const std::shared_ptr& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + return Equals(*other.get()); +} + Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) : field_(field) { data_ = std::make_shared(chunks); @@ -49,6 +88,18 @@ Column::Column( const std::shared_ptr& field, const std::shared_ptr& data) : field_(field), data_(data) {} +bool Column::Equals(const Column& other) const { + if (!field_->Equals(other.field())) { return false; } + return data_->Equals(other.data()); +} + +bool Column::Equals(const std::shared_ptr& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + + return Equals(*other.get()); +} + Status Column::ValidateData() { for (int i = 0; i < data_->num_chunks(); ++i) { std::shared_ptr type = data_->chunk(i)->type(); diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index 1caafec9db95c..f71647381743c 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -48,6 +48,9 @@ class ARROW_EXPORT ChunkedArray { std::shared_ptr chunk(int i) const { return chunks_[i]; } + bool Equals(const ChunkedArray& other) const; + bool Equals(const std::shared_ptr& other) const; + protected: ArrayVector chunks_; int64_t length_; @@ -78,6 +81,10 @@ class ARROW_EXPORT Column { // @returns: the column's data as a chunked logical array std::shared_ptr data() const { return data_; } + + bool Equals(const Column& other) const; + bool Equals(const std::shared_ptr& other) const; + // Verify that the column's array data is consistent with the passed field's // metadata Status ValidateData(); diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index f62336d07f09a..734b94125defc 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -34,16 +34,12 @@ using std::vector; namespace arrow { -const auto INT16 = std::make_shared(); -const auto UINT8 = std::make_shared(); -const auto INT32 = std::make_shared(); - class TestTable : public TestBase { public: void MakeExample1(int length) { - auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", UINT8); - auto f2 = std::make_shared("f2", INT16); + auto f0 = std::make_shared("f0", int32()); + auto f1 = std::make_shared("f1", uint8()); + auto f2 = std::make_shared("f2", int16()); vector> fields = {f0, f1, f2}; schema_ = std::make_shared(fields); @@ -55,7 +51,7 @@ class TestTable : public TestBase { } protected: - std::unique_ptr
table_; + std::shared_ptr
table_; shared_ptr schema_; vector> columns_; }; @@ -123,14 +119,40 @@ TEST_F(TestTable, InvalidColumns) { ASSERT_RAISES(Invalid, table_->ValidateColumns()); } +TEST_F(TestTable, Equals) { + int length = 100; + MakeExample1(length); + + std::string name = "data"; + table_.reset(new Table(name, schema_, columns_)); + + ASSERT_TRUE(table_->Equals(table_)); + ASSERT_FALSE(table_->Equals(nullptr)); + // Differing name + ASSERT_FALSE(table_->Equals(std::make_shared
("other_name", schema_, columns_))); + // Differing schema + auto f0 = std::make_shared("f3", int32()); + auto f1 = std::make_shared("f4", uint8()); + auto f2 = std::make_shared("f5", int16()); + vector> fields = {f0, f1, f2}; + auto other_schema = std::make_shared(fields); + ASSERT_FALSE(table_->Equals(std::make_shared
(name, other_schema, columns_))); + // Differing columns + std::vector> other_columns = { + std::make_shared(schema_->field(0), MakePrimitive(length, 10)), + std::make_shared(schema_->field(1), MakePrimitive(length, 10)), + std::make_shared(schema_->field(2), MakePrimitive(length, 10))}; + ASSERT_FALSE(table_->Equals(std::make_shared
(name, schema_, other_columns))); +} + class TestRecordBatch : public TestBase {}; TEST_F(TestRecordBatch, Equals) { const int length = 10; - auto f0 = std::make_shared("f0", INT32); - auto f1 = std::make_shared("f1", UINT8); - auto f2 = std::make_shared("f2", INT16); + auto f0 = std::make_shared("f0", int32()); + auto f1 = std::make_shared("f1", uint8()); + auto f2 = std::make_shared("f2", int16()); vector> fields = {f0, f1, f2}; auto schema = std::make_shared(fields); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 855d4ec04085d..45f672ec8928b 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -77,6 +77,23 @@ Table::Table(const std::string& name, const std::shared_ptr& schema, const std::vector>& columns, int64_t num_rows) : name_(name), schema_(schema), columns_(columns), num_rows_(num_rows) {} +bool Table::Equals(const Table& other) const { + if (name_ != other.name()) { return false; } + if (!schema_->Equals(other.schema())) { return false; } + if (static_cast(columns_.size()) != other.num_columns()) { return false; } + + for (size_t i = 0; i < columns_.size(); i++) { + if (!columns_[i]->Equals(other.column(i))) { return false; } + } + return true; +} + +bool Table::Equals(const std::shared_ptr
& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + return Equals(*other.get()); +} + Status Table::ValidateColumns() const { if (num_columns() != schema_->num_fields()) { return Status::Invalid("Number of columns did not match schema"); diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index bf5c39f11e411..0f2418d0e7900 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -100,6 +100,9 @@ class ARROW_EXPORT Table { // @returns: the number of rows (the corresponding length of each column) int64_t num_rows() const { return num_rows_; } + bool Equals(const Table& other) const; + bool Equals(const std::shared_ptr
& other) const; + // After construction, perform any checks to validate the input arguments Status ValidateColumns() const; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index ce9327d9009e2..70e933365cfdf 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -81,7 +81,7 @@ class TestBase : public ::testing::Test { auto null_bitmap = std::make_shared(pool_); EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); - return std::make_shared(length, data, 10, null_bitmap); + return std::make_shared(length, data, null_count, null_bitmap); } protected: From 9513ca7741bc036ff369cbbd3b3ee3f4bcc06722 Mon Sep 17 00:00:00 2001 From: Li Jin Date: Thu, 5 Jan 2017 11:00:32 -0500 Subject: [PATCH 0259/1644] ARROW-411: [Java] Move compactor functions in Integration to a separate Validator module Author: Li Jin Closes #267 from icexelloss/validator and squashes the following commits: b4e86c5 [Li Jin] ARROW-411: Move compator functions in Integration to a separate Validator moduleO --- .../org/apache/arrow/tools/Integration.java | 94 +------------ .../apache/arrow/tools/TestIntegration.java | 32 ----- .../apache/arrow/vector/util/Validator.java | 125 ++++++++++++++++++ .../arrow/vector/util/TestValidator.java | 57 ++++++++ 4 files changed, 185 insertions(+), 123 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/util/TestValidator.java diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index fd835a63a11ac..36d4ee5485470 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -28,7 +28,6 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.VectorUnloader; @@ -39,10 +38,8 @@ import org.apache.arrow.vector.file.json.JsonFileReader; import org.apache.arrow.vector.file.json.JsonFileWriter; import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.pojo.ArrowType; -import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; -import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.arrow.vector.util.Validator; import org.apache.commons.cli.CommandLine; import org.apache.commons.cli.CommandLineParser; import org.apache.commons.cli.Options; @@ -51,8 +48,6 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import com.google.common.base.Objects; - public class Integration { private static final Logger LOGGER = LoggerFactory.getLogger(Integration.class); @@ -143,7 +138,7 @@ public void execute(File arrowFile, File jsonFile) throws IOException { LOGGER.debug("ARROW schema: " + arrowSchema); LOGGER.debug("JSON Input file size: " + jsonFile.length()); LOGGER.debug("JSON schema: " + jsonSchema); - compareSchemas(jsonSchema, arrowSchema); + Validator.compareSchemas(jsonSchema, arrowSchema); List recordBatches = footer.getRecordBatches(); Iterator iterator = recordBatches.iterator(); @@ -154,8 +149,7 @@ public void execute(File arrowFile, File jsonFile) throws IOException { VectorSchemaRoot arrowRoot = new VectorSchemaRoot(arrowSchema, allocator);) { VectorLoader vectorLoader = new VectorLoader(arrowRoot); vectorLoader.load(inRecordBatch); - // TODO: compare - compare(arrowRoot, jsonRoot); + Validator.compareVectorSchemaRoot(arrowRoot, jsonRoot); } jsonRoot.close(); } @@ -227,86 +221,4 @@ private static void fatalError(String message, Throwable e) { System.exit(1); } - - private static void compare(VectorSchemaRoot arrowRoot, VectorSchemaRoot jsonRoot) { - compareSchemas(jsonRoot.getSchema(), arrowRoot.getSchema()); - if (arrowRoot.getRowCount() != jsonRoot.getRowCount()) { - throw new IllegalArgumentException("Different row count:\n" + arrowRoot.getRowCount() + "\n" + jsonRoot.getRowCount()); - } - List arrowVectors = arrowRoot.getFieldVectors(); - List jsonVectors = jsonRoot.getFieldVectors(); - if (arrowVectors.size() != jsonVectors.size()) { - throw new IllegalArgumentException("Different column count:\n" + arrowVectors.size() + "\n" + jsonVectors.size()); - } - for (int i = 0; i < arrowVectors.size(); i++) { - Field field = arrowRoot.getSchema().getFields().get(i); - FieldVector arrowVector = arrowVectors.get(i); - FieldVector jsonVector = jsonVectors.get(i); - int valueCount = arrowVector.getAccessor().getValueCount(); - if (valueCount != jsonVector.getAccessor().getValueCount()) { - throw new IllegalArgumentException("Different value count for field " + field + " : " + valueCount + " != " + jsonVector.getAccessor().getValueCount()); - } - for (int j = 0; j < valueCount; j++) { - Object arrow = arrowVector.getAccessor().getObject(j); - Object json = jsonVector.getAccessor().getObject(j); - if (!equals(field.getType(), arrow, json)) { - throw new IllegalArgumentException( - "Different values in column:\n" + field + " at index " + j + ": " + arrow + " != " + json); - } - } - } - } - - private static boolean equals(ArrowType type, final Object arrow, final Object json) { - if (type instanceof ArrowType.FloatingPoint) { - FloatingPoint fpType = (FloatingPoint) type; - switch (fpType.getPrecision()) { - case DOUBLE: - return equalEnough((Double)arrow, (Double)json); - case SINGLE: - return equalEnough((Float)arrow, (Float)json); - case HALF: - default: - throw new UnsupportedOperationException("unsupported precision: " + fpType); - } - } - return Objects.equal(arrow, json); - } - - static boolean equalEnough(Float f1, Float f2) { - if (f1 == null || f2 == null) { - return f1 == null && f2 == null; - } - if (f1.isNaN()) { - return f2.isNaN(); - } - if (f1.isInfinite()) { - return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); - } - float average = Math.abs((f1 + f2) / 2); - float differenceScaled = Math.abs(f1 - f2) / (average == 0.0f ? 1f : average); - return differenceScaled < 1.0E-6f; - } - - static boolean equalEnough(Double f1, Double f2) { - if (f1 == null || f2 == null) { - return f1 == null && f2 == null; - } - if (f1.isNaN()) { - return f2.isNaN(); - } - if (f1.isInfinite()) { - return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); - } - double average = Math.abs((f1 + f2) / 2); - double differenceScaled = Math.abs(f1 - f2) / (average == 0.0d ? 1d : average); - return differenceScaled < 1.0E-12d; - } - - - private static void compareSchemas(Schema jsonSchema, Schema arrowSchema) { - if (!arrowSchema.equals(jsonSchema)) { - throw new IllegalArgumentException("Different schemas:\n" + arrowSchema + "\n" + jsonSchema); - } - } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java index ee6196b74e0dc..0ae32bebe0b30 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -22,9 +22,7 @@ import static org.apache.arrow.tools.ArrowFileTestFixtures.write; import static org.apache.arrow.tools.ArrowFileTestFixtures.writeData; import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; -import static org.apache.arrow.tools.Integration.equalEnough; import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; import static org.junit.Assert.fail; @@ -238,34 +236,4 @@ static void writeInput2(File testInFile, BufferAllocator allocator) throws FileN write(parent.getChild("root"), testInFile); } } - - @Test - public void testFloatComp() { - assertTrue(equalEnough(912.4140000000002F, 912.414F)); - assertTrue(equalEnough(912.4140000000002D, 912.414D)); - assertTrue(equalEnough(912.414F, 912.4140000000002F)); - assertTrue(equalEnough(912.414D, 912.4140000000002D)); - assertFalse(equalEnough(912.414D, 912.4140001D)); - assertFalse(equalEnough(null, 912.414D)); - assertTrue(equalEnough((Float)null, null)); - assertTrue(equalEnough((Double)null, null)); - assertFalse(equalEnough(912.414D, null)); - assertFalse(equalEnough(Double.MAX_VALUE, Double.MIN_VALUE)); - assertFalse(equalEnough(Double.MIN_VALUE, Double.MAX_VALUE)); - assertTrue(equalEnough(Double.MAX_VALUE, Double.MAX_VALUE)); - assertTrue(equalEnough(Double.MIN_VALUE, Double.MIN_VALUE)); - assertTrue(equalEnough(Double.NEGATIVE_INFINITY, Double.NEGATIVE_INFINITY)); - assertFalse(equalEnough(Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)); - assertTrue(equalEnough(Double.NaN, Double.NaN)); - assertFalse(equalEnough(1.0, Double.NaN)); - assertFalse(equalEnough(Float.MAX_VALUE, Float.MIN_VALUE)); - assertFalse(equalEnough(Float.MIN_VALUE, Float.MAX_VALUE)); - assertTrue(equalEnough(Float.MAX_VALUE, Float.MAX_VALUE)); - assertTrue(equalEnough(Float.MIN_VALUE, Float.MIN_VALUE)); - assertTrue(equalEnough(Float.NEGATIVE_INFINITY, Float.NEGATIVE_INFINITY)); - assertFalse(equalEnough(Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY)); - assertTrue(equalEnough(Float.NaN, Float.NaN)); - assertFalse(equalEnough(1.0F, Float.NaN)); - } - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java new file mode 100644 index 0000000000000..a97458254151d --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java @@ -0,0 +1,125 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.util; + +import java.util.List; + +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.common.base.Objects; + +/** + * Utility class for validating arrow data structures + */ +public class Validator { + + /** + * Validate two arrow schemas are equal. + * + * @throws IllegalArgumentException if they are different. + */ + public static void compareSchemas(Schema schema1, Schema schema2) { + if (!schema2.equals(schema1)) { + throw new IllegalArgumentException("Different schemas:\n" + schema2 + "\n" + schema1); + } + } + + /** + * Validate two arrow vectorSchemaRoot are equal. + * + * @throws IllegalArgumentException if they are different. + */ + public static void compareVectorSchemaRoot(VectorSchemaRoot root1, VectorSchemaRoot root2) { + compareSchemas(root2.getSchema(), root1.getSchema()); + if (root1.getRowCount() != root2.getRowCount()) { + throw new IllegalArgumentException("Different row count:\n" + root1.getRowCount() + "\n" + root2.getRowCount()); + } + List arrowVectors = root1.getFieldVectors(); + List jsonVectors = root2.getFieldVectors(); + if (arrowVectors.size() != jsonVectors.size()) { + throw new IllegalArgumentException("Different column count:\n" + arrowVectors.size() + "\n" + jsonVectors.size()); + } + for (int i = 0; i < arrowVectors.size(); i++) { + Field field = root1.getSchema().getFields().get(i); + FieldVector arrowVector = arrowVectors.get(i); + FieldVector jsonVector = jsonVectors.get(i); + int valueCount = arrowVector.getAccessor().getValueCount(); + if (valueCount != jsonVector.getAccessor().getValueCount()) { + throw new IllegalArgumentException("Different value count for field " + field + " : " + valueCount + " != " + jsonVector.getAccessor().getValueCount()); + } + for (int j = 0; j < valueCount; j++) { + Object arrow = arrowVector.getAccessor().getObject(j); + Object json = jsonVector.getAccessor().getObject(j); + if (!equals(field.getType(), arrow, json)) { + throw new IllegalArgumentException( + "Different values in column:\n" + field + " at index " + j + ": " + arrow + " != " + json); + } + } + } + } + + static boolean equals(ArrowType type, final Object o1, final Object o2) { + if (type instanceof ArrowType.FloatingPoint) { + ArrowType.FloatingPoint fpType = (ArrowType.FloatingPoint) type; + switch (fpType.getPrecision()) { + case DOUBLE: + return equalEnough((Double)o1, (Double)o2); + case SINGLE: + return equalEnough((Float)o1, (Float)o2); + case HALF: + default: + throw new UnsupportedOperationException("unsupported precision: " + fpType); + } + } + return Objects.equal(o1, o2); + } + + static boolean equalEnough(Float f1, Float f2) { + if (f1 == null || f2 == null) { + return f1 == null && f2 == null; + } + if (f1.isNaN()) { + return f2.isNaN(); + } + if (f1.isInfinite()) { + return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); + } + float average = Math.abs((f1 + f2) / 2); + float differenceScaled = Math.abs(f1 - f2) / (average == 0.0f ? 1f : average); + return differenceScaled < 1.0E-6f; + } + + static boolean equalEnough(Double f1, Double f2) { + if (f1 == null || f2 == null) { + return f1 == null && f2 == null; + } + if (f1.isNaN()) { + return f2.isNaN(); + } + if (f1.isInfinite()) { + return f2.isInfinite() && Math.signum(f1) == Math.signum(f2); + } + double average = Math.abs((f1 + f2) / 2); + double differenceScaled = Math.abs(f1 - f2) / (average == 0.0d ? 1d : average); + return differenceScaled < 1.0E-12d; + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/util/TestValidator.java b/java/vector/src/test/java/org/apache/arrow/vector/util/TestValidator.java new file mode 100644 index 0000000000000..7cf638e57d849 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/util/TestValidator.java @@ -0,0 +1,57 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ +package org.apache.arrow.vector.util; + +import static org.apache.arrow.vector.util.Validator.equalEnough; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + +import org.junit.Test; + +public class TestValidator { + + @Test + public void testFloatComp() { + assertTrue(equalEnough(912.4140000000002F, 912.414F)); + assertTrue(equalEnough(912.4140000000002D, 912.414D)); + assertTrue(equalEnough(912.414F, 912.4140000000002F)); + assertTrue(equalEnough(912.414D, 912.4140000000002D)); + assertFalse(equalEnough(912.414D, 912.4140001D)); + assertFalse(equalEnough(null, 912.414D)); + assertTrue(equalEnough((Float)null, null)); + assertTrue(equalEnough((Double)null, null)); + assertFalse(equalEnough(912.414D, null)); + assertFalse(equalEnough(Double.MAX_VALUE, Double.MIN_VALUE)); + assertFalse(equalEnough(Double.MIN_VALUE, Double.MAX_VALUE)); + assertTrue(equalEnough(Double.MAX_VALUE, Double.MAX_VALUE)); + assertTrue(equalEnough(Double.MIN_VALUE, Double.MIN_VALUE)); + assertTrue(equalEnough(Double.NEGATIVE_INFINITY, Double.NEGATIVE_INFINITY)); + assertFalse(equalEnough(Double.NEGATIVE_INFINITY, Double.POSITIVE_INFINITY)); + assertTrue(equalEnough(Double.NaN, Double.NaN)); + assertFalse(equalEnough(1.0, Double.NaN)); + assertFalse(equalEnough(Float.MAX_VALUE, Float.MIN_VALUE)); + assertFalse(equalEnough(Float.MIN_VALUE, Float.MAX_VALUE)); + assertTrue(equalEnough(Float.MAX_VALUE, Float.MAX_VALUE)); + assertTrue(equalEnough(Float.MIN_VALUE, Float.MIN_VALUE)); + assertTrue(equalEnough(Float.NEGATIVE_INFINITY, Float.NEGATIVE_INFINITY)); + assertFalse(equalEnough(Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY)); + assertTrue(equalEnough(Float.NaN, Float.NaN)); + assertFalse(equalEnough(1.0F, Float.NaN)); + } +} From 320f5875eef4010762a2146a0691148af1a3f182 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 5 Jan 2017 15:18:24 -0500 Subject: [PATCH 0260/1644] ARROW-455: [C++] Add dtor to BufferOutputStream that calls Close() Since `Close()` can technically fail, it's better to call it yourself (and it's idempotent), but this will help avoid a common class of bugs in small-scale use cases. An alternative here is that we could remove all `Close()` calls from all destructors and possibly add a `DCHECK(!is_open_)` to the base dtor to force the user to close handles. The downside of this is that it makes RAII more difficult, so I'd prefer to leave the close-in-dtor even though it can fail in unusual scenarios. Author: Wes McKinney Closes #269 from wesm/ARROW-455 and squashes the following commits: 821ee22 [Wes McKinney] Add dtor to BufferOutputStream that calls Close() --- cpp/src/arrow/io/file.cc | 1 + cpp/src/arrow/io/io-memory-test.cc | 15 +++++++++++++-- cpp/src/arrow/io/memory.cc | 5 +++++ cpp/src/arrow/io/memory.h | 2 ++ 4 files changed, 21 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 3182f2dd8a3b5..0fb13ea22e39f 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -476,6 +476,7 @@ FileOutputStream::FileOutputStream() { } FileOutputStream::~FileOutputStream() { + // This can fail; better to explicitly call close impl_->Close(); } diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index 95d788c03c97e..c0b01653cb128 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -42,17 +42,28 @@ class TestBufferOutputStream : public ::testing::Test { std::unique_ptr stream_; }; +TEST_F(TestBufferOutputStream, DtorCloses) { + std::string data = "data123456"; + + const int K = 100; + for (int i = 0; i < K; ++i) { + EXPECT_OK(stream_->Write(data)); + } + + stream_ = nullptr; + ASSERT_EQ(static_cast(K * data.size()), buffer_->size()); +} + TEST_F(TestBufferOutputStream, CloseResizes) { std::string data = "data123456"; - const int64_t nbytes = static_cast(data.size()); const int K = 100; for (int i = 0; i < K; ++i) { EXPECT_OK(stream_->Write(data)); } ASSERT_OK(stream_->Close()); - ASSERT_EQ(K * nbytes, buffer_->size()); + ASSERT_EQ(static_cast(K * data.size()), buffer_->size()); } TEST(TestBufferReader, RetainParentReference) { diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 4595268372aa2..0f5a0dc06979c 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -43,6 +43,11 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr& b position_(0), mutable_data_(buffer->mutable_data()) {} +BufferOutputStream::~BufferOutputStream() { + // This can fail, better to explicitly call close + Close(); +} + Status BufferOutputStream::Close() { if (position_ < capacity_) { return buffer_->Resize(position_); diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 2f1d8ec317578..8428a12220a69 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -43,6 +43,8 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { public: explicit BufferOutputStream(const std::shared_ptr& buffer); + ~BufferOutputStream(); + // Implement the OutputStream interface Status Close() override; Status Tell(int64_t* position) override; From 5bf6ae49ec561eaaef823f0eb16ccca2d2ba7cf3 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 6 Jan 2017 15:57:20 +0100 Subject: [PATCH 0261/1644] ARROW-456: Add jemalloc based MemoryPool Runtimes of the `builder-benchmark`: ``` BM_BuildPrimitiveArrayNoNulls/repeats:3 901 ms 889 ms 1 576.196MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3 833 ms 829 ms 1 617.6MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3 825 ms 821 ms 1 623.855MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 853 ms 846 ms 1 605.884MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 34 ms 30 ms 0 21.147MB/s BM_BuildVectorNoNulls/repeats:3 712 ms 701 ms 1 729.866MB/s BM_BuildVectorNoNulls/repeats:3 671 ms 670 ms 1 764.464MB/s BM_BuildVectorNoNulls/repeats:3 688 ms 681 ms 1 751.285MB/s BM_BuildVectorNoNulls/repeats:3_mean 690 ms 684 ms 1 748.538MB/s BM_BuildVectorNoNulls/repeats:3_stddev 17 ms 13 ms 0 14.2578MB/s ``` With an aligned `Reallocate`, the jemalloc version is 50% faster and even outperforms `std::vector`: ``` BM_BuildPrimitiveArrayNoNulls/repeats:3 565 ms 559 ms 1 916.516MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3 540 ms 537 ms 1 952.727MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3 544 ms 543 ms 1 942.948MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3_mean 550 ms 546 ms 1 937.397MB/s BM_BuildPrimitiveArrayNoNulls/repeats:3_stddev 11 ms 9 ms 0 15.2949MB/s ``` Author: Uwe L. Korn Closes #270 from xhochy/ARROW-456 and squashes the following commits: d3ce3bf [Uwe L. Korn] Zero arrays for now 831399d [Uwe L. Korn] cpplint #2 e6e251b [Uwe L. Korn] cpplint 52b3c76 [Uwe L. Korn] Add Reallocate implementation to PyArrowMemoryPool 113e650 [Uwe L. Korn] Add missing file d331cd9 [Uwe L. Korn] Add tests for Reallocate c2be086 [Uwe L. Korn] Add JEMALLOC_HOME to the Readme bd47f51 [Uwe L. Korn] Add missing return value 5142ac3 [Uwe L. Korn] Don't use deprecated GBenchmark interfaces b6bff98 [Uwe L. Korn] Add missing (win) include 6f08e19 [Uwe L. Korn] Don't build jemalloc on AppVeyor 834c3b2 [Uwe L. Korn] Add jemalloc to Travis builds 10c6839 [Uwe L. Korn] Implement Reallocate function a17b313 [Uwe L. Korn] ARROW-456: C++: Add jemalloc based MemoryPool --- .travis.yml | 1 + appveyor.yml | 2 +- ci/travis_before_script_cpp.sh | 5 ++ cpp/CMakeLists.txt | 30 ++++++- cpp/README.md | 1 + cpp/cmake_modules/Findjemalloc.cmake | 86 +++++++++++++++++++ cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/buffer.cc | 6 +- cpp/src/arrow/builder-benchmark.cc | 64 ++++++++++++++ cpp/src/arrow/builder.cc | 1 + cpp/src/arrow/column-benchmark.cc | 2 +- cpp/src/arrow/io/interfaces.cc | 4 +- cpp/src/arrow/io/io-file-test.cc | 13 +++ cpp/src/arrow/jemalloc/CMakeLists.txt | 80 +++++++++++++++++ cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in | 27 ++++++ .../jemalloc/jemalloc-builder-benchmark.cc | 47 ++++++++++ .../jemalloc/jemalloc-memory_pool-test.cc | 51 +++++++++++ cpp/src/arrow/jemalloc/memory_pool.cc | 74 ++++++++++++++++ cpp/src/arrow/jemalloc/memory_pool.h | 57 ++++++++++++ cpp/src/arrow/jemalloc/symbols.map | 30 +++++++ cpp/src/arrow/memory_pool-test.cc | 33 +++---- cpp/src/arrow/memory_pool-test.h | 79 +++++++++++++++++ cpp/src/arrow/memory_pool.cc | 24 ++++++ cpp/src/arrow/memory_pool.h | 1 + python/src/pyarrow/common.cc | 14 +++ 25 files changed, 704 insertions(+), 29 deletions(-) create mode 100644 cpp/cmake_modules/Findjemalloc.cmake create mode 100644 cpp/src/arrow/builder-benchmark.cc create mode 100644 cpp/src/arrow/jemalloc/CMakeLists.txt create mode 100644 cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in create mode 100644 cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc create mode 100644 cpp/src/arrow/jemalloc/jemalloc-memory_pool-test.cc create mode 100644 cpp/src/arrow/jemalloc/memory_pool.cc create mode 100644 cpp/src/arrow/jemalloc/memory_pool.h create mode 100644 cpp/src/arrow/jemalloc/symbols.map create mode 100644 cpp/src/arrow/memory_pool-test.h diff --git a/.travis.yml b/.travis.yml index 1634eba443615..e8d91045c2254 100644 --- a/.travis.yml +++ b/.travis.yml @@ -15,6 +15,7 @@ addons: - libboost-dev - libboost-filesystem-dev - libboost-system-dev + - libjemalloc-dev matrix: fast_finish: true diff --git a/appveyor.yml b/appveyor.yml index 67478487081b7..17362c993d053 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -32,7 +32,7 @@ build_script: - cd build # A lot of features are still deactivated as they do not build on Windows # * gbenchmark doesn't build with MSVC - - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_IPC=OFF -DARROW_HDFS=OFF -DARROW_BUILD_BENCHMARKS=OFF .. + - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_IPC=OFF -DARROW_HDFS=OFF -DARROW_BUILD_BENCHMARKS=OFF -DARROW_JEMALLOC=OFF .. - cmake --build . --config Debug # test_script: diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 73bdaeb81fe78..94a889cff1a78 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -17,6 +17,11 @@ set -ex : ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} +if [ $TRAVIS_OS_NAME == "osx" ]; then + brew update > /dev/null + brew install jemalloc +fi + mkdir $CPP_BUILD_DIR pushd $CPP_BUILD_DIR diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 13f0354a73b8b..419691b4b68b2 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -28,7 +28,7 @@ set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") set(GFLAGS_VERSION "2.1.2") set(GTEST_VERSION "1.7.0") -set(GBENCHMARK_VERSION "1.0.0") +set(GBENCHMARK_VERSION "1.1.0") set(FLATBUFFERS_VERSION "1.3.0") find_package(ClangTools) @@ -74,6 +74,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow IPC extensions" ON) + option(ARROW_JEMALLOC + "Build the Arrow jemalloc-based allocator" + ON) + option(ARROW_BOOST_USE_SHARED "Rely on boost shared libraries where relevant" ON) @@ -238,6 +242,16 @@ function(ADD_ARROW_BENCHMARK_DEPENDENCIES REL_BENCHMARK_NAME) add_dependencies(${BENCHMARK_NAME} ${ARGN}) endfunction() +# A wrapper for target_link_libraries() that is compatible with NO_BENCHMARKS. +function(ARROW_BENCHMARK_LINK_LIBRARIES REL_BENCHMARK_NAME) + if(NO_TESTS) + return() + endif() + get_filename_component(BENCHMARK_NAME ${REL_BENCHMARK_NAME} NAME_WE) + + target_link_libraries(${BENCHMARK_NAME} ${ARGN}) +endfunction() + ############################################################ # Testing @@ -526,7 +540,11 @@ if(ARROW_BUILD_BENCHMARKS) set(GBENCHMARK_CMAKE_ARGS "-DCMAKE_BUILD_TYPE=Release" "-DCMAKE_INSTALL_PREFIX:PATH=${GBENCHMARK_PREFIX}" + "-DBENCHMARK_ENABLE_TESTING=OFF" "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") + if (APPLE) + set(GBENCHMARK_CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS} "-DBENCHMARK_USE_LIBCXX=ON") + endif() if (CMAKE_VERSION VERSION_GREATER "3.2") # BUILD_BYPRODUCTS is a 3.2+ feature ExternalProject_Add(gbenchmark_ep @@ -575,6 +593,12 @@ endif() message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) +if (ARROW_JEMALLOC) + find_package(jemalloc REQUIRED) + ADD_THIRDPARTY_LIB(jemalloc + SHARED_LIB ${JEMALLOC_SHARED_LIB}) +endif() + ## Google PerfTools ## ## Disabled with TSAN/ASAN as well as with gold+dynamic linking (see comment @@ -737,6 +761,10 @@ add_subdirectory(src/arrow) add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) +if(ARROW_JEMALLOC) + add_subdirectory(src/arrow/jemalloc) +endif() + #---------------------------------------------------------------------- # IPC library diff --git a/cpp/README.md b/cpp/README.md index 190e6f85b429d..b77ea990d0659 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -60,6 +60,7 @@ variables * Google Benchmark: `GBENCHMARK_HOME` (only required if building benchmarks) * Flatbuffers: `FLATBUFFERS_HOME` (only required for the IPC extensions) * Hadoop: `HADOOP_HOME` (only required for the HDFS I/O extensions) +* jemalloc: `JEMALLOC_HOME` (only required for the jemalloc-based memory pool) ## Continuous Integration diff --git a/cpp/cmake_modules/Findjemalloc.cmake b/cpp/cmake_modules/Findjemalloc.cmake new file mode 100644 index 0000000000000..e7fbb94a69235 --- /dev/null +++ b/cpp/cmake_modules/Findjemalloc.cmake @@ -0,0 +1,86 @@ +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Tries to find jemalloc headers and libraries. +# +# Usage of this module as follows: +# +# find_package(jemalloc) +# +# Variables used by this module, they can change the default behaviour and need +# to be set before calling find_package: +# +# JEMALLOC_HOME - +# When set, this path is inspected instead of standard library locations as +# the root of the jemalloc installation. The environment variable +# JEMALLOC_HOME overrides this veriable. +# +# This module defines +# JEMALLOC_INCLUDE_DIR, directory containing headers +# JEMALLOC_SHARED_LIB, path to libjemalloc.so/dylib +# JEMALLOC_FOUND, whether flatbuffers has been found + +if( NOT "$ENV{JEMALLOC_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "$ENV{JEMALLOC_HOME}" _native_path ) + list( APPEND _jemalloc_roots ${_native_path} ) +elseif ( JEMALLOC_HOME ) + list( APPEND _jemalloc_roots ${JEMALLOC_HOME} ) +endif() + +set(LIBJEMALLOC_NAMES jemalloc libjemalloc.so.1 libjemalloc.so.2 libjemalloc.dylib) + +# Try the parameterized roots, if they exist +if ( _jemalloc_roots ) + find_path( JEMALLOC_INCLUDE_DIR NAMES jemalloc/jemalloc.h + PATHS ${_jemalloc_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "include" ) + find_library( JEMALLOC_SHARED_LIB NAMES ${LIBJEMALLOC_NAMES} + PATHS ${_jemalloc_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) +else () + find_path( JEMALLOC_INCLUDE_DIR NAMES jemalloc/jemalloc.h ) + message(STATUS ${JEMALLOC_INCLUDE_DIR}) + find_library( JEMALLOC_SHARED_LIB NAMES ${LIBJEMALLOC_NAMES}) + message(STATUS ${JEMALLOC_SHARED_LIB}) +endif () + +if (JEMALLOC_INCLUDE_DIR AND JEMALLOC_SHARED_LIB) + set(JEMALLOC_FOUND TRUE) +else () + set(JEMALLOC_FOUND FALSE) +endif () + +if (JEMALLOC_FOUND) + if (NOT jemalloc_FIND_QUIETLY) + message(STATUS "Found the jemalloc library: ${JEMALLOC_LIBRARIES}") + endif () +else () + if (NOT jemalloc_FIND_QUIETLY) + set(JEMALLOC_ERR_MSG "Could not find the jemalloc library. Looked in ") + if ( _flatbuffers_roots ) + set(JEMALLOC_ERR_MSG "${JEMALLOC_ERR_MSG} in ${_jemalloc_roots}.") + else () + set(JEMALLOC_ERR_MSG "${JEMALLOC_ERR_MSG} system search paths.") + endif () + if (jemalloc_FIND_REQUIRED) + message(FATAL_ERROR "${JEMALLOC_ERR_MSG}") + else (jemalloc_FIND_REQUIRED) + message(STATUS "${JEMALLOC_ERR_MSG}") + endif (jemalloc_FIND_REQUIRED) + endif () +endif () + +mark_as_advanced( + JEMALLOC_INCLUDE_DIR + JEMALLOC_SHARED_LIB +) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index f8c50513d31a5..16668db798b78 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -59,4 +59,5 @@ ADD_ARROW_TEST(schema-test) ADD_ARROW_TEST(status-test) ADD_ARROW_TEST(table-test) +ADD_ARROW_BENCHMARK(builder-benchmark) ADD_ARROW_BENCHMARK(column-benchmark) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 6ffa03a0b5663..6d55f88af1e32 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -80,13 +80,11 @@ Status PoolBuffer::Reserve(int64_t new_capacity) { uint8_t* new_data; new_capacity = BitUtil::RoundUpToMultipleOf64(new_capacity); if (mutable_data_) { - RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); - memcpy(new_data, mutable_data_, size_); - pool_->Free(mutable_data_, capacity_); + RETURN_NOT_OK(pool_->Reallocate(capacity_, new_capacity, &mutable_data_)); } else { RETURN_NOT_OK(pool_->Allocate(new_capacity, &new_data)); + mutable_data_ = new_data; } - mutable_data_ = new_data; data_ = mutable_data_; capacity_ = new_capacity; } diff --git a/cpp/src/arrow/builder-benchmark.cc b/cpp/src/arrow/builder-benchmark.cc new file mode 100644 index 0000000000000..67799a3485f23 --- /dev/null +++ b/cpp/src/arrow/builder-benchmark.cc @@ -0,0 +1,64 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/builder.h" +#include "arrow/memory_pool.h" +#include "arrow/test-util.h" + +namespace arrow { + +constexpr int64_t kFinalSize = 256; + +static void BM_BuildPrimitiveArrayNoNulls( + benchmark::State& state) { // NOLINT non-const reference + // 2 MiB block + std::vector data(256 * 1024, 100); + while (state.KeepRunning()) { + Int64Builder builder(default_memory_pool(), arrow::int64()); + for (int i = 0; i < kFinalSize; i++) { + // Build up an array of 512 MiB in size + builder.Append(data.data(), data.size(), nullptr); + } + std::shared_ptr out; + builder.Finish(&out); + } + state.SetBytesProcessed( + state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); +} + +BENCHMARK(BM_BuildPrimitiveArrayNoNulls)->Repetitions(3)->Unit(benchmark::kMillisecond); + +static void BM_BuildVectorNoNulls( + benchmark::State& state) { // NOLINT non-const reference + // 2 MiB block + std::vector data(256 * 1024, 100); + while (state.KeepRunning()) { + std::vector builder; + for (int i = 0; i < kFinalSize; i++) { + // Build up an array of 512 MiB in size + builder.insert(builder.end(), data.cbegin(), data.cend()); + } + } + state.SetBytesProcessed( + state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); +} + +BENCHMARK(BM_BuildVectorNoNulls)->Repetitions(3)->Unit(benchmark::kMillisecond); + +} // namespace arrow diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 1d94dbaa0e91d..a308ea53c570c 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -156,6 +156,7 @@ Status PrimitiveBuilder::Resize(int32_t capacity) { const int64_t new_bytes = TypeTraits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); + // TODO(emkornfield) valgrind complains without this memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); } return Status::OK(); diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index 650ec90fc0728..8a1c775d7376d 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -37,7 +37,7 @@ std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { static void BM_BuildInt32ColumnByChunk( benchmark::State& state) { // NOLINT non-const reference ArrayVector arrays; - for (int chunk_n = 0; chunk_n < state.range_x(); ++chunk_n) { + for (int chunk_n = 0; chunk_n < state.range(0); ++chunk_n) { arrays.push_back(MakePrimitive(100, 10)); } const auto INT32 = std::make_shared(); diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 23bef2853b206..8040f93836cdc 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -45,8 +45,8 @@ Status ReadableFileInterface::ReadAt( } Status Writeable::Write(const std::string& data) { - return Write(reinterpret_cast(data.c_str()), - static_cast(data.size())); + return Write( + reinterpret_cast(data.c_str()), static_cast(data.size())); } } // namespace io diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 5f5d639fab0d8..378b60e782124 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -292,6 +292,19 @@ class MyMemoryPool : public MemoryPool { void Free(uint8_t* buffer, int64_t size) override { std::free(buffer); } + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override { + *ptr = reinterpret_cast(std::realloc(*ptr, new_size)); + + if (*ptr == NULL) { + std::stringstream ss; + ss << "realloc of size " << new_size << " failed"; + return Status::OutOfMemory(ss.str()); + } + + + return Status::OK(); + } + int64_t bytes_allocated() const override { return -1; } int64_t num_allocations() const { return num_allocations_; } diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt new file mode 100644 index 0000000000000..c6663eb8227f0 --- /dev/null +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# ---------------------------------------------------------------------- +# arrow_jemalloc : Arrow jemalloc-based allocator + +include_directories(SYSTEM "{JEMALLOC_INCLUDE_DIR}") + +# arrow_jemalloc library +set(ARROW_JEMALLOC_STATIC_LINK_LIBS + arrow_static + jemalloc +) +set(ARROW_JEMALLOC_SHARED_LINK_LIBS + arrow_shared + jemalloc +) + +if (ARROW_BUILD_STATIC) + set(ARROW_JEMALLOC_TEST_LINK_LIBS + arrow_jemalloc_static) +else() + set(ARROW_jemalloc_TEST_LINK_LIBS + arrow_jemalloc_shared) +endif() + +set(ARROW_JEMALLOC_SRCS + memory_pool.cc +) + +if(NOT APPLE) + # Localize thirdparty symbols using a linker version script. This hides them + # from the client application. The OS X linker does not support the + # version-script option. + set(ARROW_JEMALLOC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") +endif() + +ADD_ARROW_LIB(arrow_jemalloc + SOURCES ${ARROW_JEMALLOC_SRCS} + SHARED_LINK_FLAGS ${ARROW_JEMALLOC_LINK_FLAGS} + SHARED_LINK_LIBS ${ARROW_JEMALLOC_SHARED_LINK_LIBS} + SHARED_PRIVATE_LINK_LIBS ${ARROW_JEMALLOC_SHARED_PRIVATE_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_JEMALLOC_STATIC_LINK_LIBS} + STATIC_PRIVATE_LINK_LIBS ${ARROW_JEMALLOC_STATIC_PRIVATE_LINK_LIBS} +) + +ADD_ARROW_TEST(jemalloc-memory_pool-test) +ARROW_TEST_LINK_LIBRARIES(jemalloc-memory_pool-test + ${ARROW_JEMALLOC_TEST_LINK_LIBS}) + +ADD_ARROW_BENCHMARK(jemalloc-builder-benchmark) +ARROW_BENCHMARK_LINK_LIBRARIES(jemalloc-builder-benchmark + ${ARROW_JEMALLOC_TEST_LINK_LIBS}) + +# Headers: top level +install(FILES + memory_pool.h + DESTINATION include/arrow/jemalloc) + +# pkg-config support +configure_file(arrow-jemalloc.pc.in + "${CMAKE_CURRENT_BINARY_DIR}/arrow-jemalloc.pc" + @ONLY) +install( + FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-jemalloc.pc" + DESTINATION "lib/pkgconfig/") diff --git a/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in new file mode 100644 index 0000000000000..0b300fec0b2bf --- /dev/null +++ b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@CMAKE_INSTALL_PREFIX@ +libdir=${prefix}/lib +includedir=${prefix}/include + +Name: Apache Arrow jemalloc-based allocator +Description: jemalloc allocator for Arrow. +Version: @ARROW_VERSION@ +Libs: -L${libdir} -larrow_jemalloc +Cflags: -I${includedir} +Requires: arrow diff --git a/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc b/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc new file mode 100644 index 0000000000000..58dbaa33a1a0f --- /dev/null +++ b/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc @@ -0,0 +1,47 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include "arrow/builder.h" +#include "arrow/jemalloc/memory_pool.h" +#include "arrow/test-util.h" + +namespace arrow { + +constexpr int64_t kFinalSize = 256; + +static void BM_BuildPrimitiveArrayNoNulls( + benchmark::State& state) { // NOLINT non-const reference + // 2 MiB block + std::vector data(256 * 1024, 100); + while (state.KeepRunning()) { + Int64Builder builder(jemalloc::MemoryPool::default_pool(), arrow::int64()); + for (int i = 0; i < kFinalSize; i++) { + // Build up an array of 512 MiB in size + builder.Append(data.data(), data.size(), nullptr); + } + std::shared_ptr out; + builder.Finish(&out); + } + state.SetBytesProcessed( + state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); +} + +BENCHMARK(BM_BuildPrimitiveArrayNoNulls)->Repetitions(3)->Unit(benchmark::kMillisecond); + +} // namespace arrow diff --git a/cpp/src/arrow/jemalloc/jemalloc-memory_pool-test.cc b/cpp/src/arrow/jemalloc/jemalloc-memory_pool-test.cc new file mode 100644 index 0000000000000..a8448abc7d296 --- /dev/null +++ b/cpp/src/arrow/jemalloc/jemalloc-memory_pool-test.cc @@ -0,0 +1,51 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/jemalloc/memory_pool.h" +#include "arrow/memory_pool-test.h" + +namespace arrow { +namespace jemalloc { +namespace test { + +class TestJemallocMemoryPool : public ::arrow::test::TestMemoryPoolBase { + public: + ::arrow::MemoryPool* memory_pool() override { + return ::arrow::jemalloc::MemoryPool::default_pool(); + } +}; + +TEST_F(TestJemallocMemoryPool, MemoryTracking) { + this->TestMemoryTracking(); +} + +TEST_F(TestJemallocMemoryPool, OOM) { + this->TestOOM(); +} + +TEST_F(TestJemallocMemoryPool, Reallocate) { + this->TestReallocate(); +} + +} // namespace test +} // namespace jemalloc +} // namespace arrow diff --git a/cpp/src/arrow/jemalloc/memory_pool.cc b/cpp/src/arrow/jemalloc/memory_pool.cc new file mode 100644 index 0000000000000..acc09c7cd7587 --- /dev/null +++ b/cpp/src/arrow/jemalloc/memory_pool.cc @@ -0,0 +1,74 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/jemalloc/memory_pool.h" + +#include + +#include + +#include "arrow/status.h" + +constexpr size_t kAlignment = 64; + +namespace arrow { +namespace jemalloc { + +MemoryPool* MemoryPool::default_pool() { + static MemoryPool pool; + return &pool; +} + +MemoryPool::MemoryPool() : allocated_size_(0) {} + +MemoryPool::~MemoryPool() {} + +Status MemoryPool::Allocate(int64_t size, uint8_t** out) { + *out = reinterpret_cast(mallocx(size, MALLOCX_ALIGN(kAlignment))); + if (*out == NULL) { + std::stringstream ss; + ss << "malloc of size " << size << " failed"; + return Status::OutOfMemory(ss.str()); + } + allocated_size_ += size; + return Status::OK(); +} + +Status MemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { + *ptr = reinterpret_cast(rallocx(*ptr, new_size, MALLOCX_ALIGN(kAlignment))); + if (*ptr == NULL) { + std::stringstream ss; + ss << "realloc of size " << new_size << " failed"; + return Status::OutOfMemory(ss.str()); + } + + allocated_size_ += new_size - old_size; + + return Status::OK(); +} + +void MemoryPool::Free(uint8_t* buffer, int64_t size) { + allocated_size_ -= size; + free(buffer); +} + +int64_t MemoryPool::bytes_allocated() const { + return allocated_size_.load(); +} + +} // namespace jemalloc +} // namespace arrow diff --git a/cpp/src/arrow/jemalloc/memory_pool.h b/cpp/src/arrow/jemalloc/memory_pool.h new file mode 100644 index 0000000000000..0d32b4658e3e8 --- /dev/null +++ b/cpp/src/arrow/jemalloc/memory_pool.h @@ -0,0 +1,57 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Public API for the jemalloc-based allocator + +#ifndef ARROW_JEMALLOC_MEMORY_POOL_H +#define ARROW_JEMALLOC_MEMORY_POOL_H + +#include "arrow/memory_pool.h" + +#include + +namespace arrow { + +class Status; + +namespace jemalloc { + +class ARROW_EXPORT MemoryPool : public ::arrow::MemoryPool { + public: + static MemoryPool* default_pool(); + + MemoryPool(MemoryPool const&) = delete; + MemoryPool& operator=(MemoryPool const&) = delete; + + virtual ~MemoryPool(); + + Status Allocate(int64_t size, uint8_t** out) override; + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override; + void Free(uint8_t* buffer, int64_t size) override; + + int64_t bytes_allocated() const override; + + private: + MemoryPool(); + + std::atomic allocated_size_; +}; + +} // namespace jemalloc +} // namespace arrow + +#endif // ARROW_JEMALLOC_MEMORY_POOL_H diff --git a/cpp/src/arrow/jemalloc/symbols.map b/cpp/src/arrow/jemalloc/symbols.map new file mode 100644 index 0000000000000..1e87caef9c8c1 --- /dev/null +++ b/cpp/src/arrow/jemalloc/symbols.map @@ -0,0 +1,30 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +{ + # Symbols marked as 'local' are not exported by the DSO and thus may not + # be used by client applications. + local: + # devtoolset / static-libstdc++ symbols + __cxa_*; + + extern "C++" { + # boost + boost::*; + + # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically + # links c++11 symbols into binaries so that the result may be executed on + # a system with an older libstdc++ which doesn't include the necessary + # c++11 symbols. + std::*; + }; +}; diff --git a/cpp/src/arrow/memory_pool-test.cc b/cpp/src/arrow/memory_pool-test.cc index d6f323d276305..3daf72755cff2 100644 --- a/cpp/src/arrow/memory_pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -15,35 +15,28 @@ // specific language governing permissions and limitations // under the License. +#include "arrow/memory_pool-test.h" + #include #include -#include "gtest/gtest.h" - -#include "arrow/memory_pool.h" -#include "arrow/status.h" -#include "arrow/test-util.h" - namespace arrow { -TEST(DefaultMemoryPool, MemoryTracking) { - MemoryPool* pool = default_memory_pool(); +class TestDefaultMemoryPool : public ::arrow::test::TestMemoryPoolBase { + public: + ::arrow::MemoryPool* memory_pool() override { return ::arrow::default_memory_pool(); } +}; - uint8_t* data; - ASSERT_OK(pool->Allocate(100, &data)); - EXPECT_EQ(static_cast(0), reinterpret_cast(data) % 64); - ASSERT_EQ(100, pool->bytes_allocated()); - - pool->Free(data, 100); - ASSERT_EQ(0, pool->bytes_allocated()); +TEST_F(TestDefaultMemoryPool, MemoryTracking) { + this->TestMemoryTracking(); } -TEST(DefaultMemoryPool, OOM) { - MemoryPool* pool = default_memory_pool(); +TEST_F(TestDefaultMemoryPool, OOM) { + this->TestOOM(); +} - uint8_t* data; - int64_t to_alloc = std::numeric_limits::max(); - ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); +TEST_F(TestDefaultMemoryPool, Reallocate) { + this->TestReallocate(); } // Death tests and valgrind are known to not play well 100% of the time. See diff --git a/cpp/src/arrow/memory_pool-test.h b/cpp/src/arrow/memory_pool-test.h new file mode 100644 index 0000000000000..b9f0337dfac8e --- /dev/null +++ b/cpp/src/arrow/memory_pool-test.h @@ -0,0 +1,79 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include + +#include "arrow/memory_pool.h" +#include "arrow/test-util.h" + +namespace arrow { + +namespace test { + +class TestMemoryPoolBase : public ::testing::Test { + public: + virtual ::arrow::MemoryPool* memory_pool() = 0; + + void TestMemoryTracking() { + auto pool = memory_pool(); + + uint8_t* data; + ASSERT_OK(pool->Allocate(100, &data)); + EXPECT_EQ(static_cast(0), reinterpret_cast(data) % 64); + ASSERT_EQ(100, pool->bytes_allocated()); + + pool->Free(data, 100); + ASSERT_EQ(0, pool->bytes_allocated()); + } + + void TestOOM() { + auto pool = memory_pool(); + + uint8_t* data; + int64_t to_alloc = std::numeric_limits::max(); + ASSERT_RAISES(OutOfMemory, pool->Allocate(to_alloc, &data)); + } + + void TestReallocate() { + auto pool = memory_pool(); + + uint8_t* data; + ASSERT_OK(pool->Allocate(10, &data)); + ASSERT_EQ(10, pool->bytes_allocated()); + data[0] = 35; + data[9] = 12; + + // Expand + ASSERT_OK(pool->Reallocate(10, 20, &data)); + ASSERT_EQ(data[9], 12); + ASSERT_EQ(20, pool->bytes_allocated()); + + // Shrink + ASSERT_OK(pool->Reallocate(20, 5, &data)); + ASSERT_EQ(data[0], 35); + ASSERT_EQ(5, pool->bytes_allocated()); + + // Free + pool->Free(data, 5); + ASSERT_EQ(0, pool->bytes_allocated()); + } +}; + +} // namespace test +} // namespace arrow diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index f55b1ac668c7c..aea5e210f4980 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -17,6 +17,7 @@ #include "arrow/memory_pool.h" +#include #include #include #include @@ -67,6 +68,7 @@ class InternalMemoryPool : public MemoryPool { virtual ~InternalMemoryPool(); Status Allocate(int64_t size, uint8_t** out) override; + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override; void Free(uint8_t* buffer, int64_t size) override; @@ -85,6 +87,28 @@ Status InternalMemoryPool::Allocate(int64_t size, uint8_t** out) { return Status::OK(); } +Status InternalMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { + std::lock_guard guard(pool_lock_); + + // Note: We cannot use realloc() here as it doesn't guarantee alignment. + + // Allocate new chunk + uint8_t* out; + RETURN_NOT_OK(AllocateAligned(new_size, &out)); + // Copy contents and release old memory chunk + memcpy(out, *ptr, std::min(new_size, old_size)); +#ifdef _MSC_VER + _aligned_free(*ptr); +#else + std::free(*ptr); +#endif + *ptr = out; + + bytes_allocated_ += new_size - old_size; + + return Status::OK(); +} + int64_t InternalMemoryPool::bytes_allocated() const { std::lock_guard guard(pool_lock_); return bytes_allocated_; diff --git a/cpp/src/arrow/memory_pool.h b/cpp/src/arrow/memory_pool.h index 4c1d699addd50..13a3f129c1a9e 100644 --- a/cpp/src/arrow/memory_pool.h +++ b/cpp/src/arrow/memory_pool.h @@ -31,6 +31,7 @@ class ARROW_EXPORT MemoryPool { virtual ~MemoryPool(); virtual Status Allocate(int64_t size, uint8_t** out) = 0; + virtual Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) = 0; virtual void Free(uint8_t* buffer, int64_t size) = 0; virtual int64_t bytes_allocated() const = 0; diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index 8660ac8f0cedf..0bdd289953dc4 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -47,6 +47,20 @@ class PyArrowMemoryPool : public arrow::MemoryPool { return Status::OK(); } + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override { + *ptr = reinterpret_cast(std::realloc(*ptr, new_size)); + + if (*ptr == NULL) { + std::stringstream ss; + ss << "realloc of size " << new_size << " failed"; + return Status::OutOfMemory(ss.str()); + } + + bytes_allocated_ += new_size - old_size; + + return Status::OK(); + } + int64_t bytes_allocated() const override { std::lock_guard guard(pool_lock_); return bytes_allocated_; From 74685f386307171a90a9f97316e25b7f39cdd0a1 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 6 Jan 2017 11:11:43 -0500 Subject: [PATCH 0262/1644] ARROW-427: [C++] Implement dictionary array type I thought some about this and thought that it made sense to store the reference to the dictionary values themselves in the data type object, similar to `CategoricalDtype` in pandas. This will be at least adequate for the Feather file format merge. In the IPC metadata, there is no explicit dictionary type -- an array can be dictionary encoded or not. On JIRA we've discussed adding a dictionary type flag indicating whether or not the dictionary values/categories are ordered (also called "ordinal") or unordered (also called "nominal"). That hasn't been done yet. Author: Wes McKinney Closes #268 from wesm/ARROW-427 and squashes the following commits: 5ce3701 [Wes McKinney] cpplint a6c2896 [Wes McKinney] Revert T::Equals(const T& other) to EqualsExact to appease clang 9a4edb5 [Wes McKinney] Implement rudimentary DictionaryArray::Validate 9efe46b [Wes McKinney] Add tests, implementation for DictionaryArray::Equals and RangeEquals b06eb86 [Wes McKinney] Implement PrettyPrint for DictionaryArray 17c70de [Wes McKinney] Refactor, compose shared_ptr in DictionaryType b52b3a7 [Wes McKinney] Add rudimentary DictionaryType and DictionaryArray implementation for discussion --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/array-dictionary-test.cc | 128 +++++++++++++++++++ cpp/src/arrow/array-string-test.cc | 4 +- cpp/src/arrow/array.cc | 94 +++++++++++--- cpp/src/arrow/array.h | 111 ++++++++++++++--- cpp/src/arrow/ipc/adapter.cc | 11 ++ cpp/src/arrow/ipc/json-internal.cc | 13 ++ cpp/src/arrow/pretty_print-test.cc | 53 ++++---- cpp/src/arrow/pretty_print.cc | 12 ++ cpp/src/arrow/test-util.h | 36 +++--- cpp/src/arrow/type.cc | 69 +++++++++-- cpp/src/arrow/type.h | 163 +++++++++++++++++++------ cpp/src/arrow/type_fwd.h | 57 +-------- format/Message.fbs | 2 +- python/pyarrow/includes/libarrow.pxd | 3 +- python/pyarrow/includes/parquet.pxd | 2 +- python/pyarrow/parquet.pyx | 4 +- python/pyarrow/schema.pyx | 4 +- 18 files changed, 583 insertions(+), 184 deletions(-) create mode 100644 cpp/src/arrow/array-dictionary-test.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 16668db798b78..e5e36ed253cfa 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -47,6 +47,7 @@ install( ADD_ARROW_TEST(array-test) ADD_ARROW_TEST(array-decimal-test) +ADD_ARROW_TEST(array-dictionary-test) ADD_ARROW_TEST(array-list-test) ADD_ARROW_TEST(array-primitive-test) ADD_ARROW_TEST(array-string-test) diff --git a/cpp/src/arrow/array-dictionary-test.cc b/cpp/src/arrow/array-dictionary-test.cc new file mode 100644 index 0000000000000..c290153b95053 --- /dev/null +++ b/cpp/src/arrow/array-dictionary-test.cc @@ -0,0 +1,128 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" +#include "arrow/test-util.h" +#include "arrow/type.h" + +namespace arrow { + +TEST(TestDictionary, Basics) { + std::vector values = {100, 1000, 10000, 100000}; + std::shared_ptr dict; + ArrayFromVector(int32(), values, &dict); + + std::shared_ptr type1 = + std::dynamic_pointer_cast(dictionary(int16(), dict)); + DictionaryType type2(int16(), dict); + + ASSERT_TRUE(int16()->Equals(type1->index_type())); + ASSERT_TRUE(type1->dictionary()->Equals(dict)); + + ASSERT_TRUE(int16()->Equals(type2.index_type())); + ASSERT_TRUE(type2.dictionary()->Equals(dict)); + + ASSERT_EQ("dictionary", type1->ToString()); +} + +TEST(TestDictionary, Equals) { + std::vector is_valid = {true, true, false, true, true, true}; + + std::shared_ptr dict; + std::vector dict_values = {"foo", "bar", "baz"}; + ArrayFromVector(utf8(), dict_values, &dict); + std::shared_ptr dict_type = dictionary(int16(), dict); + + std::shared_ptr dict2; + std::vector dict2_values = {"foo", "bar", "baz", "qux"}; + ArrayFromVector(utf8(), dict2_values, &dict2); + std::shared_ptr dict2_type = dictionary(int16(), dict2); + + std::shared_ptr indices; + std::vector indices_values = {1, 2, -1, 0, 2, 0}; + ArrayFromVector(int16(), is_valid, indices_values, &indices); + + std::shared_ptr indices2; + std::vector indices2_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(int16(), is_valid, indices2_values, &indices2); + + std::shared_ptr indices3; + std::vector indices3_values = {1, 1, 0, 0, 2, 0}; + ArrayFromVector(int16(), is_valid, indices3_values, &indices3); + + auto arr = std::make_shared(dict_type, indices); + auto arr2 = std::make_shared(dict_type, indices2); + auto arr3 = std::make_shared(dict2_type, indices); + auto arr4 = std::make_shared(dict_type, indices3); + + ASSERT_TRUE(arr->Equals(arr)); + + // Equal, because the unequal index is masked by null + ASSERT_TRUE(arr->Equals(arr2)); + + // Unequal dictionaries + ASSERT_FALSE(arr->Equals(arr3)); + + // Unequal indices + ASSERT_FALSE(arr->Equals(arr4)); + + // RangeEquals + ASSERT_TRUE(arr->RangeEquals(3, 6, 3, arr4)); + ASSERT_FALSE(arr->RangeEquals(1, 3, 1, arr4)); +} + +TEST(TestDictionary, Validate) { + std::vector is_valid = {true, true, false, true, true, true}; + + std::shared_ptr dict; + std::vector dict_values = {"foo", "bar", "baz"}; + ArrayFromVector(utf8(), dict_values, &dict); + std::shared_ptr dict_type = dictionary(int16(), dict); + + std::shared_ptr indices; + std::vector indices_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(uint8(), is_valid, indices_values, &indices); + + std::shared_ptr indices2; + std::vector indices2_values = {1., 2., 0., 0., 2., 0.}; + ArrayFromVector(float32(), is_valid, indices2_values, &indices2); + + std::shared_ptr indices3; + std::vector indices3_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(int64(), is_valid, indices3_values, &indices3); + + std::shared_ptr arr = std::make_shared(dict_type, indices); + std::shared_ptr arr2 = std::make_shared(dict_type, indices2); + std::shared_ptr arr3 = std::make_shared(dict_type, indices3); + + // Only checking index type for now + ASSERT_OK(arr->Validate()); + ASSERT_RAISES(Invalid, arr2->Validate()); + ASSERT_OK(arr3->Validate()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index b144c632133d6..024bfd508957d 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -36,8 +36,8 @@ TEST(TypesTest, BinaryType) { BinaryType t1; BinaryType e1; StringType t2; - EXPECT_TRUE(t1.Equals(&e1)); - EXPECT_FALSE(t1.Equals(&t2)); + EXPECT_TRUE(t1.Equals(e1)); + EXPECT_FALSE(t1.Equals(t2)); ASSERT_EQ(t1.type, Type::BINARY); ASSERT_EQ(t1.ToString(), std::string("binary")); } diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 3d309b8b92f48..7509520d12685 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -42,7 +42,7 @@ Status GetEmptyBitmap( // ---------------------------------------------------------------------- // Base array class -Array::Array(const TypePtr& type, int32_t length, int32_t null_count, +Array::Array(const std::shared_ptr& type, int32_t length, int32_t null_count, const std::shared_ptr& null_bitmap) { type_ = type; length_ = length; @@ -51,6 +51,12 @@ Array::Array(const TypePtr& type, int32_t length, int32_t null_count, if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } +bool Array::BaseEquals(const std::shared_ptr& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + return EqualsExact(*other.get()); +} + bool Array::EqualsExact(const Array& other) const { if (this == &other) { return true; } if (length_ != other.length_ || null_count_ != other.null_count_ || @@ -91,7 +97,7 @@ Status NullArray::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // Primitive array base -PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, +PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : Array(type, length, null_count, null_bitmap) { @@ -100,14 +106,9 @@ PrimitiveArray::PrimitiveArray(const TypePtr& type, int32_t length, } bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } + if (!Array::EqualsExact(other)) { return false; } if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - const uint8_t* this_data = raw_data_; const uint8_t* other_data = other.raw_data_; @@ -131,7 +132,7 @@ bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } if (!arr) { return false; } if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); + return EqualsExact(static_cast(*arr.get())); } template @@ -161,7 +162,7 @@ BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, : PrimitiveArray( std::make_shared(), length, data, null_count, null_bitmap) {} -BooleanArray::BooleanArray(const TypePtr& type, int32_t length, +BooleanArray::BooleanArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : PrimitiveArray(type, length, data, null_count, null_bitmap) {} @@ -192,7 +193,7 @@ bool BooleanArray::EqualsExact(const BooleanArray& other) const { bool BooleanArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) return true; if (Type::BOOL != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); + return EqualsExact(static_cast(*arr.get())); } bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, @@ -238,7 +239,7 @@ bool ListArray::EqualsExact(const ListArray& other) const { bool ListArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); + return EqualsExact(static_cast(*arr.get())); } bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, @@ -333,7 +334,7 @@ BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& null_bitmap) : BinaryArray(kBinary, length, offsets, data, null_count, null_bitmap) {} -BinaryArray::BinaryArray(const TypePtr& type, int32_t length, +BinaryArray::BinaryArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& offsets, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap) : Array(type, length, null_count, null_bitmap), @@ -364,7 +365,7 @@ bool BinaryArray::EqualsExact(const BinaryArray& other) const { bool BinaryArray::Equals(const std::shared_ptr& arr) const { if (this == arr.get()) { return true; } if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(*static_cast(arr.get())); + return EqualsExact(static_cast(*arr.get())); } bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, @@ -493,7 +494,7 @@ Status StructArray::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // UnionArray -UnionArray::UnionArray(const TypePtr& type, int32_t length, +UnionArray::UnionArray(const std::shared_ptr& type, int32_t length, const std::vector>& children, const std::shared_ptr& type_ids, const std::shared_ptr& offsets, int32_t null_count, const std::shared_ptr& null_bitmap) @@ -586,6 +587,66 @@ Status UnionArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +// ---------------------------------------------------------------------- +// DictionaryArray + +Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& indices, int32_t null_count, + const std::shared_ptr& null_bitmap, std::shared_ptr* out) { + DCHECK_EQ(type->type, Type::DICTIONARY); + const auto& dict_type = static_cast(type.get()); + + std::shared_ptr boxed_indices; + RETURN_NOT_OK(MakePrimitiveArray( + dict_type->index_type(), length, indices, null_count, null_bitmap, &boxed_indices)); + + *out = std::make_shared(type, boxed_indices); + return Status::OK(); +} + +DictionaryArray::DictionaryArray( + const std::shared_ptr& type, const std::shared_ptr& indices) + : Array(type, indices->length(), indices->null_count(), indices->null_bitmap()), + dict_type_(static_cast(type.get())), + indices_(indices) { + DCHECK_EQ(type->type, Type::DICTIONARY); +} + +Status DictionaryArray::Validate() const { + Type::type index_type_id = indices_->type()->type; + if (!is_integer(index_type_id)) { + return Status::Invalid("Dictionary indices must be integer type"); + } + return Status::OK(); +} + +std::shared_ptr DictionaryArray::dictionary() const { + return dict_type_->dictionary(); +} + +bool DictionaryArray::EqualsExact(const DictionaryArray& other) const { + if (!dictionary()->Equals(other.dictionary())) { return false; } + return indices_->Equals(other.indices()); +} + +bool DictionaryArray::Equals(const std::shared_ptr& arr) const { + if (this == arr.get()) { return true; } + if (Type::DICTIONARY != arr->type_enum()) { return false; } + return EqualsExact(static_cast(*arr.get())); +} + +bool DictionaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, + int32_t other_start_idx, const std::shared_ptr& arr) const { + if (Type::DICTIONARY != arr->type_enum()) { return false; } + const auto& dict_other = static_cast(*arr.get()); + if (!dictionary()->Equals(dict_other.dictionary())) { return false; } + return indices_->RangeEquals(start_idx, end_idx, other_start_idx, dict_other.indices()); +} + +Status DictionaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + // ---------------------------------------------------------------------- #define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ @@ -593,7 +654,7 @@ Status UnionArray::Accept(ArrayVisitor* visitor) const { out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ break; -Status MakePrimitiveArray(const TypePtr& type, int32_t length, +Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count, const std::shared_ptr& null_bitmap, std::shared_ptr* out) { switch (type->type) { @@ -610,7 +671,6 @@ Status MakePrimitiveArray(const TypePtr& type, int32_t length, MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP_DOUBLE, DoubleArray); default: return Status::NotImplemented(type->ToString()); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index cd42a28e251ca..57214c46d1cc6 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -26,6 +26,7 @@ #include "arrow/buffer.h" #include "arrow/type.h" +#include "arrow/type_fwd.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -36,6 +37,34 @@ class MemoryPool; class MutableBuffer; class Status; +class ArrayVisitor { + public: + virtual Status Visit(const NullArray& array) = 0; + virtual Status Visit(const BooleanArray& array) = 0; + virtual Status Visit(const Int8Array& array) = 0; + virtual Status Visit(const Int16Array& array) = 0; + virtual Status Visit(const Int32Array& array) = 0; + virtual Status Visit(const Int64Array& array) = 0; + virtual Status Visit(const UInt8Array& array) = 0; + virtual Status Visit(const UInt16Array& array) = 0; + virtual Status Visit(const UInt32Array& array) = 0; + virtual Status Visit(const UInt64Array& array) = 0; + virtual Status Visit(const HalfFloatArray& array) = 0; + virtual Status Visit(const FloatArray& array) = 0; + virtual Status Visit(const DoubleArray& array) = 0; + virtual Status Visit(const StringArray& array) = 0; + virtual Status Visit(const BinaryArray& array) = 0; + virtual Status Visit(const DateArray& array) = 0; + virtual Status Visit(const TimeArray& array) = 0; + virtual Status Visit(const TimestampArray& array) = 0; + virtual Status Visit(const IntervalArray& array) = 0; + virtual Status Visit(const DecimalArray& array) = 0; + virtual Status Visit(const ListArray& array) = 0; + virtual Status Visit(const StructArray& array) = 0; + virtual Status Visit(const UnionArray& array) = 0; + virtual Status Visit(const DictionaryArray& type) = 0; +}; + // Immutable data array with some logical type and some length. Any memory is // owned by the respective Buffer instance (or its parents). // @@ -63,6 +92,7 @@ class ARROW_EXPORT Array { const uint8_t* null_bitmap_data() const { return null_bitmap_data_; } + bool BaseEquals(const std::shared_ptr& arr) const; bool EqualsExact(const Array& arr) const; virtual bool Equals(const std::shared_ptr& arr) const = 0; virtual bool ApproxEquals(const std::shared_ptr& arr) const; @@ -122,8 +152,9 @@ class ARROW_EXPORT PrimitiveArray : public Array { bool Equals(const std::shared_ptr& arr) const override; protected: - PrimitiveArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + PrimitiveArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); std::shared_ptr data_; const uint8_t* raw_data_; }; @@ -137,8 +168,9 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) : PrimitiveArray( std::make_shared(), length, data, null_count, null_bitmap) {} - NumericArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) + NumericArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr) : PrimitiveArray(type, length, data, null_count, null_bitmap) {} bool EqualsExact(const NumericArray& other) const { @@ -146,7 +178,7 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { } bool ApproxEquals(const std::shared_ptr& arr) const override { - return Equals(arr); + return PrimitiveArray::Equals(arr); } bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, @@ -250,8 +282,9 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { BooleanArray(int32_t length, const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - BooleanArray(const TypePtr& type, int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + BooleanArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, int32_t null_count = 0, + const std::shared_ptr& null_bitmap = nullptr); bool EqualsExact(const BooleanArray& other) const; bool Equals(const std::shared_ptr& arr) const override; @@ -272,9 +305,9 @@ class ARROW_EXPORT ListArray : public Array { public: using TypeClass = ListType; - ListArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& values, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) + ListArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& offsets, const std::shared_ptr& values, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { offsets_buffer_ = offsets; offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( @@ -328,9 +361,9 @@ class ARROW_EXPORT BinaryArray : public Array { // Constructor that allows sub-classes/builders to propagate there logical type up the // class hierarchy. - BinaryArray(const TypePtr& type, int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + BinaryArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& offsets, const std::shared_ptr& data, + int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); // Return the pointer to the given elements bytes // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy @@ -397,7 +430,7 @@ class ARROW_EXPORT StructArray : public Array { public: using TypeClass = StructType; - StructArray(const TypePtr& type, int32_t length, + StructArray(const std::shared_ptr& type, int32_t length, const std::vector>& field_arrays, int32_t null_count = 0, std::shared_ptr null_bitmap = nullptr) : Array(type, length, null_count, null_bitmap) { @@ -434,7 +467,7 @@ class ARROW_EXPORT UnionArray : public Array { public: using TypeClass = UnionType; - UnionArray(const TypePtr& type, int32_t length, + UnionArray(const std::shared_ptr& type, int32_t length, const std::vector>& children, const std::shared_ptr& type_ids, const std::shared_ptr& offsets = nullptr, int32_t null_count = 0, @@ -473,6 +506,54 @@ class ARROW_EXPORT UnionArray : public Array { const int32_t* offsets_; }; +// ---------------------------------------------------------------------- +// DictionaryArray (categorical and dictionary-encoded in memory) + +// A dictionary array contains an array of non-negative integers (the +// "dictionary indices") along with a data type containing a "dictionary" +// corresponding to the distinct values represented in the data. +// +// For example, the array +// +// ["foo", "bar", "foo", "bar", "foo", "bar"] +// +// with dictionary ["bar", "foo"], would have dictionary array representation +// +// indices: [1, 0, 1, 0, 1, 0] +// dictionary: ["bar", "foo"] +// +// The indices in principle may have any integer type (signed or unsigned), +// though presently data in IPC exchanges must be signed int32. +class ARROW_EXPORT DictionaryArray : public Array { + public: + using TypeClass = DictionaryType; + + DictionaryArray( + const std::shared_ptr& type, const std::shared_ptr& indices); + + // Alternate ctor; other attributes (like null count) are inherited from the + // passed indices array + static Status FromBuffer(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& indices, int32_t null_count, + const std::shared_ptr& null_bitmap, std::shared_ptr* out); + + Status Validate() const override; + + std::shared_ptr indices() const { return indices_; } + std::shared_ptr dictionary() const; + + bool EqualsExact(const DictionaryArray& other) const; + bool Equals(const std::shared_ptr& arr) const override; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const override; + + Status Accept(ArrayVisitor* visitor) const override; + + protected: + const DictionaryType* dict_type_; + std::shared_ptr indices_; +}; + // ---------------------------------------------------------------------- // extern templates and other details diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 9bfd11fd01b5a..2b5ef11f861af 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -288,6 +288,13 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } + Status Visit(const DictionaryArray& array) override { + // Dictionary written out separately + const auto& indices = static_cast(*array.indices().get()); + buffers_.push_back(indices.data()); + return Status::OK(); + } + // Do not copy this vector. Ownership must be retained elsewhere const std::vector>& columns_; int32_t num_rows_; @@ -539,6 +546,10 @@ class ArrayLoader : public TypeVisitor { type_ids, offsets, field_meta.null_count, null_bitmap); return Status::OK(); } + + Status Visit(const DictionaryType& type) override { + return Status::NotImplemented("dictionary"); + }; }; class RecordBatchReader { diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 4f980d3e5d157..43bd8a4a4e814 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -334,6 +334,14 @@ class JsonSchemaWriter : public TypeVisitor { return Status::OK(); } + Status Visit(const DictionaryType& type) override { + // WriteName("dictionary", type); + // WriteChildren(type.children()); + // WriteBufferLayout(type.GetBufferLayout()); + // return Status::OK(); + return Status::NotImplemented("dictionary type"); + } + private: const Schema& schema_; RjWriter* writer_; @@ -546,6 +554,10 @@ class JsonArrayWriter : public ArrayVisitor { return WriteChildren(type->children(), array.children()); } + Status Visit(const DictionaryArray& array) override { + return Status::NotImplemented("dictionary"); + } + private: const std::string& name_; const Array& array_; @@ -1043,6 +1055,7 @@ class JsonArrayReader { TYPE_CASE(ListType); TYPE_CASE(StructType); TYPE_CASE(UnionType); + NOT_IMPLEMENTED_CASE(DICTIONARY); default: std::stringstream ss; ss << type->ToString(); diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index c22d3aa632b9d..4725d5dd808ee 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -34,7 +34,7 @@ namespace arrow { -class TestArrayPrinter : public ::testing::Test { +class TestPrettyPrint : public ::testing::Test { public: void SetUp() {} @@ -44,32 +44,22 @@ class TestArrayPrinter : public ::testing::Test { std::ostringstream sink_; }; +void CheckArray(const Array& arr, int indent, const char* expected) { + std::ostringstream sink; + ASSERT_OK(PrettyPrint(arr, indent, &sink)); + std::string result = sink.str(); + ASSERT_EQ(std::string(expected, strlen(expected)), result); +} + template void CheckPrimitive(int indent, const std::vector& is_valid, const std::vector& values, const char* expected) { - std::ostringstream sink; - - MemoryPool* pool = default_memory_pool(); - typename TypeTraits::BuilderType builder(pool, std::make_shared()); - - for (size_t i = 0; i < values.size(); ++i) { - if (is_valid[i]) { - ASSERT_OK(builder.Append(values[i])); - } else { - ASSERT_OK(builder.AppendNull()); - } - } - std::shared_ptr array; - ASSERT_OK(builder.Finish(&array)); - - ASSERT_OK(PrettyPrint(*array.get(), indent, &sink)); - - std::string result = sink.str(); - ASSERT_EQ(std::string(expected, strlen(expected)), result); + ArrayFromVector(std::make_shared(), is_valid, values, &array); + CheckArray(*array.get(), indent, expected); } -TEST_F(TestArrayPrinter, PrimitiveType) { +TEST_F(TestPrettyPrint, PrimitiveType) { std::vector is_valid = {true, true, false, true, false}; std::vector values = {0, 1, 2, 3, 4}; @@ -81,4 +71,25 @@ TEST_F(TestArrayPrinter, PrimitiveType) { CheckPrimitive(0, is_valid, values2, ex2); } +TEST_F(TestPrettyPrint, DictionaryType) { + std::vector is_valid = {true, true, false, true, true, true}; + + std::shared_ptr dict; + std::vector dict_values = {"foo", "bar", "baz"}; + ArrayFromVector(utf8(), dict_values, &dict); + std::shared_ptr dict_type = dictionary(int16(), dict); + + std::shared_ptr indices; + std::vector indices_values = {1, 2, -1, 0, 2, 0}; + ArrayFromVector(int16(), is_valid, indices_values, &indices); + auto arr = std::make_shared(dict_type, indices); + + static const char* expected = R"expected( +-- is_valid: [true, true, false, true, true, true] +-- dictionary: ["foo", "bar", "baz"] +-- indices: [1, 2, null, 0, 2, 0])expected"; + + CheckArray(*arr.get(), 0, expected); +} + } // namespace arrow diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 324f81bfbfd6b..e30f4cc58d7ab 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -217,6 +217,18 @@ class ArrayPrinter : public ArrayVisitor { return PrintChildren(array.children()); } + Status Visit(const DictionaryArray& array) override { + RETURN_NOT_OK(WriteValidityBitmap(array)); + + Newline(); + Write("-- dictionary: "); + RETURN_NOT_OK(PrettyPrint(*array.dictionary().get(), indent_ + 2, sink_)); + + Newline(); + Write("-- indices: "); + return PrettyPrint(*array.indices().get(), indent_ + 2, sink_); + } + void Write(const char* data) { (*sink_) << data; } void Write(const std::string& data) { (*sink_) << data; } diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 70e933365cfdf..e5957490e0733 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -257,33 +257,27 @@ template void ArrayFromVector(const std::shared_ptr& type, const std::vector& is_valid, const std::vector& values, std::shared_ptr* out) { - std::shared_ptr values_buffer; - std::shared_ptr values_bitmap; - - ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); - ASSERT_OK(test::GetBitmapFromBoolVector(is_valid, &values_bitmap)); - - using ArrayType = typename TypeTraits::ArrayType; - - int32_t null_count = 0; - for (bool val : is_valid) { - if (!val) { ++null_count; } + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, std::make_shared()); + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + ASSERT_OK(builder.Append(values[i])); + } else { + ASSERT_OK(builder.AppendNull()); + } } - - *out = std::make_shared(type, static_cast(values.size()), - values_buffer, null_count, values_bitmap); + ASSERT_OK(builder.Finish(out)); } template void ArrayFromVector(const std::shared_ptr& type, const std::vector& values, std::shared_ptr* out) { - std::shared_ptr values_buffer; - - ASSERT_OK(test::CopyBufferFromVector(values, &values_buffer)); - - using ArrayType = typename TypeTraits::ArrayType; - *out = std::make_shared( - type, static_cast(values.size()), values_buffer); + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, std::make_shared()); + for (size_t i = 0; i < values.size(); ++i) { + ASSERT_OK(builder.Append(values[i])); + } + ASSERT_OK(builder.Finish(out)); } class TestBuilder : public ::testing::Test { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 89faab6ec6ae2..954fba7af9df9 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -20,10 +20,22 @@ #include #include +#include "arrow/array.h" #include "arrow/status.h" +#include "arrow/util/logging.h" namespace arrow { +bool Field::Equals(const Field& other) const { + return (this == &other) || + (this->name == other.name && this->nullable == other.nullable && + this->dictionary == dictionary && this->type->Equals(*other.type.get())); +} + +bool Field::Equals(const std::shared_ptr& other) const { + return Equals(*other.get()); +} + std::string Field::ToString() const { std::stringstream ss; ss << this->name << ": " << this->type->ToString(); @@ -33,14 +45,14 @@ std::string Field::ToString() const { DataType::~DataType() {} -bool DataType::Equals(const DataType* other) const { - bool equals = other && ((this == other) || - ((this->type == other->type) && - ((this->num_children() == other->num_children())))); +bool DataType::Equals(const DataType& other) const { + bool equals = + ((this == &other) || ((this->type == other.type) && + ((this->num_children() == other.num_children())))); if (equals) { for (int i = 0; i < num_children(); ++i) { // TODO(emkornfield) limit recursion - if (!children_[i]->Equals(other->children_[i])) { return false; } + if (!children_[i]->Equals(other.children_[i])) { return false; } } } return equals; @@ -109,11 +121,47 @@ std::string UnionType::ToString() const { return s.str(); } +// ---------------------------------------------------------------------- +// DictionaryType + +DictionaryType::DictionaryType( + const std::shared_ptr& index_type, const std::shared_ptr& dictionary) + : FixedWidthType(Type::DICTIONARY), + index_type_(index_type), + dictionary_(dictionary) {} + +int DictionaryType::bit_width() const { + return static_cast(index_type_.get())->bit_width(); +} + +std::shared_ptr DictionaryType::dictionary() const { + return dictionary_; +} + +bool DictionaryType::Equals(const DataType& other) const { + if (other.type != Type::DICTIONARY) { return false; } + const auto& other_dict = static_cast(other); + + return index_type_->Equals(other_dict.index_type_) && + dictionary_->Equals(other_dict.dictionary_); +} + +std::string DictionaryType::ToString() const { + std::stringstream ss; + ss << "dictionary<" << dictionary_->type()->ToString() << ", " + << index_type_->ToString() << ">"; + return ss.str(); +} + +// ---------------------------------------------------------------------- +// Null type + std::string NullType::ToString() const { return name(); } -// Visitors and template instantiation +// ---------------------------------------------------------------------- +// Visitors and factory functions #define ACCEPT_VISITOR(TYPE) \ Status TYPE::Accept(TypeVisitor* visitor) const { return visitor->Visit(*this); } @@ -130,6 +178,7 @@ ACCEPT_VISITOR(DateType); ACCEPT_VISITOR(TimeType); ACCEPT_VISITOR(TimestampType); ACCEPT_VISITOR(IntervalType); +ACCEPT_VISITOR(DictionaryType); #define TYPE_FACTORY(NAME, KLASS) \ std::shared_ptr NAME() { \ @@ -174,12 +223,16 @@ std::shared_ptr struct_(const std::vector>& fie return std::make_shared(fields); } -std::shared_ptr ARROW_EXPORT union_( - const std::vector>& child_fields, +std::shared_ptr union_(const std::vector>& child_fields, const std::vector& type_ids, UnionMode mode) { return std::make_shared(child_fields, type_ids, mode); } +std::shared_ptr dictionary(const std::shared_ptr& index_type, + const std::shared_ptr& dict_values) { + return std::make_shared(index_type, dict_values); +} + std::shared_ptr field( const std::string& name, const TypePtr& type, bool nullable, int64_t dictionary) { return std::make_shared(name, type, nullable, dictionary); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 530c3235dc9ab..c2a762d279364 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -37,67 +37,64 @@ namespace arrow { struct Type { enum type { // A degenerate NULL type represented as 0 bytes/bits - NA = 0, + NA, // A boolean value represented as 1 bit - BOOL = 1, + BOOL, // Little-endian integer types - UINT8 = 2, - INT8 = 3, - UINT16 = 4, - INT16 = 5, - UINT32 = 6, - INT32 = 7, - UINT64 = 8, - INT64 = 9, + UINT8, + INT8, + UINT16, + INT16, + UINT32, + INT32, + UINT64, + INT64, // 2-byte floating point value - HALF_FLOAT = 10, + HALF_FLOAT, // 4-byte floating point value - FLOAT = 11, + FLOAT, // 8-byte floating point value - DOUBLE = 12, + DOUBLE, // UTF8 variable-length string as List - STRING = 13, + STRING, // Variable-length bytes (no guarantee of UTF8-ness) - BINARY = 14, + BINARY, // By default, int32 days since the UNIX epoch - DATE = 16, + DATE, // Exact timestamp encoded with int64 since UNIX epoch // Default unit millisecond - TIMESTAMP = 17, + TIMESTAMP, // Exact time encoded with int64, default unit millisecond - TIME = 18, + TIME, // YEAR_MONTH or DAY_TIME interval in SQL style - INTERVAL = 19, + INTERVAL, // Precision- and scale-based decimal type. Storage type depends on the // parameters. - DECIMAL = 20, + DECIMAL, // A list of some logical data type - LIST = 30, + LIST, // Struct of logical types - STRUCT = 31, + STRUCT, // Unions of logical types - UNION = 32, + UNION, - // Timestamp as double seconds since the UNIX epoch - TIMESTAMP_DOUBLE = 33, - - // Decimal value encoded as a text string - DECIMAL_TEXT = 34, + // Dictionary aka Category type + DICTIONARY }; }; @@ -115,6 +112,34 @@ class BufferDescr { int bit_width_; }; +class TypeVisitor { + public: + virtual Status Visit(const NullType& type) = 0; + virtual Status Visit(const BooleanType& type) = 0; + virtual Status Visit(const Int8Type& type) = 0; + virtual Status Visit(const Int16Type& type) = 0; + virtual Status Visit(const Int32Type& type) = 0; + virtual Status Visit(const Int64Type& type) = 0; + virtual Status Visit(const UInt8Type& type) = 0; + virtual Status Visit(const UInt16Type& type) = 0; + virtual Status Visit(const UInt32Type& type) = 0; + virtual Status Visit(const UInt64Type& type) = 0; + virtual Status Visit(const HalfFloatType& type) = 0; + virtual Status Visit(const FloatType& type) = 0; + virtual Status Visit(const DoubleType& type) = 0; + virtual Status Visit(const StringType& type) = 0; + virtual Status Visit(const BinaryType& type) = 0; + virtual Status Visit(const DateType& type) = 0; + virtual Status Visit(const TimeType& type) = 0; + virtual Status Visit(const TimestampType& type) = 0; + virtual Status Visit(const IntervalType& type) = 0; + virtual Status Visit(const DecimalType& type) = 0; + virtual Status Visit(const ListType& type) = 0; + virtual Status Visit(const StructType& type) = 0; + virtual Status Visit(const UnionType& type) = 0; + virtual Status Visit(const DictionaryType& type) = 0; +}; + struct ARROW_EXPORT DataType { Type::type type; @@ -128,10 +153,10 @@ struct ARROW_EXPORT DataType { // // Types that are logically convertable from one to another e.g. List // and Binary are NOT equal). - virtual bool Equals(const DataType* other) const; + virtual bool Equals(const DataType& other) const; bool Equals(const std::shared_ptr& other) const { - return Equals(other.get()); + return Equals(*other.get()); } std::shared_ptr child(int i) const { return children_[i]; } @@ -189,16 +214,9 @@ struct ARROW_EXPORT Field { : name(name), type(type), nullable(nullable), dictionary(dictionary) {} bool operator==(const Field& other) const { return this->Equals(other); } - bool operator!=(const Field& other) const { return !this->Equals(other); } - - bool Equals(const Field& other) const { - return (this == &other) || - (this->name == other.name && this->nullable == other.nullable && - this->dictionary == dictionary && this->type->Equals(other.type.get())); - } - - bool Equals(const std::shared_ptr& other) const { return Equals(*other.get()); } + bool Equals(const Field& other) const; + bool Equals(const std::shared_ptr& other) const; std::string ToString() const; }; @@ -414,6 +432,9 @@ struct ARROW_EXPORT UnionType : public DataType { std::vector type_ids; }; +// ---------------------------------------------------------------------- +// Date and time types + struct ARROW_EXPORT DateType : public FixedWidthType { static constexpr Type::type type_id = Type::DATE; @@ -488,6 +509,35 @@ struct ARROW_EXPORT IntervalType : public FixedWidthType { static std::string name() { return "date"; } }; +// ---------------------------------------------------------------------- +// DictionaryType (for categorical or dictionary-encoded data) + +class ARROW_EXPORT DictionaryType : public FixedWidthType { + public: + static constexpr Type::type type_id = Type::DICTIONARY; + + DictionaryType(const std::shared_ptr& index_type, + const std::shared_ptr& dictionary); + + int bit_width() const override; + + std::shared_ptr index_type() const { return index_type_; } + + std::shared_ptr dictionary() const; + + bool Equals(const DataType& other) const override; + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; + + private: + // Must be an integer type (not currently checked) + std::shared_ptr index_type_; + + std::shared_ptr dictionary_; +}; + +// ---------------------------------------------------------------------- // Factory functions std::shared_ptr ARROW_EXPORT null(); @@ -520,9 +570,44 @@ std::shared_ptr ARROW_EXPORT union_( const std::vector>& child_fields, const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE); +std::shared_ptr ARROW_EXPORT dictionary( + const std::shared_ptr& index_type, const std::shared_ptr& values); + std::shared_ptr ARROW_EXPORT field(const std::string& name, const std::shared_ptr& type, bool nullable = true, int64_t dictionary = 0); +// ---------------------------------------------------------------------- +// + +static inline bool is_integer(Type::type type_id) { + switch (type_id) { + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::UINT64: + case Type::INT64: + return true; + default: + break; + } + return false; +} + +static inline bool is_floating(Type::type type_id) { + switch (type_id) { + case Type::HALF_FLOAT: + case Type::FLOAT: + case Type::DOUBLE: + return true; + default: + break; + } + return false; +} + } // namespace arrow #endif // ARROW_TYPE_H diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index a14c535b9b3f1..334abef664426 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -32,6 +32,9 @@ class MemoryPool; class RecordBatch; class Schema; +class DictionaryType; +class DictionaryArray; + struct NullType; class NullArray; @@ -101,60 +104,6 @@ using TimestampBuilder = NumericBuilder; struct IntervalType; using IntervalArray = NumericArray; -class TypeVisitor { - public: - virtual Status Visit(const NullType& type) = 0; - virtual Status Visit(const BooleanType& type) = 0; - virtual Status Visit(const Int8Type& type) = 0; - virtual Status Visit(const Int16Type& type) = 0; - virtual Status Visit(const Int32Type& type) = 0; - virtual Status Visit(const Int64Type& type) = 0; - virtual Status Visit(const UInt8Type& type) = 0; - virtual Status Visit(const UInt16Type& type) = 0; - virtual Status Visit(const UInt32Type& type) = 0; - virtual Status Visit(const UInt64Type& type) = 0; - virtual Status Visit(const HalfFloatType& type) = 0; - virtual Status Visit(const FloatType& type) = 0; - virtual Status Visit(const DoubleType& type) = 0; - virtual Status Visit(const StringType& type) = 0; - virtual Status Visit(const BinaryType& type) = 0; - virtual Status Visit(const DateType& type) = 0; - virtual Status Visit(const TimeType& type) = 0; - virtual Status Visit(const TimestampType& type) = 0; - virtual Status Visit(const IntervalType& type) = 0; - virtual Status Visit(const DecimalType& type) = 0; - virtual Status Visit(const ListType& type) = 0; - virtual Status Visit(const StructType& type) = 0; - virtual Status Visit(const UnionType& type) = 0; -}; - -class ArrayVisitor { - public: - virtual Status Visit(const NullArray& array) = 0; - virtual Status Visit(const BooleanArray& array) = 0; - virtual Status Visit(const Int8Array& array) = 0; - virtual Status Visit(const Int16Array& array) = 0; - virtual Status Visit(const Int32Array& array) = 0; - virtual Status Visit(const Int64Array& array) = 0; - virtual Status Visit(const UInt8Array& array) = 0; - virtual Status Visit(const UInt16Array& array) = 0; - virtual Status Visit(const UInt32Array& array) = 0; - virtual Status Visit(const UInt64Array& array) = 0; - virtual Status Visit(const HalfFloatArray& array) = 0; - virtual Status Visit(const FloatArray& array) = 0; - virtual Status Visit(const DoubleArray& array) = 0; - virtual Status Visit(const StringArray& array) = 0; - virtual Status Visit(const BinaryArray& array) = 0; - virtual Status Visit(const DateArray& array) = 0; - virtual Status Visit(const TimeArray& array) = 0; - virtual Status Visit(const TimestampArray& array) = 0; - virtual Status Visit(const IntervalArray& array) = 0; - virtual Status Visit(const DecimalArray& array) = 0; - virtual Status Visit(const ListArray& array) = 0; - virtual Status Visit(const StructArray& array) = 0; - virtual Status Visit(const UnionArray& array) = 0; -}; - } // namespace arrow #endif // ARROW_TYPE_FWD_H diff --git a/format/Message.fbs b/format/Message.fbs index d07d0666dce87..b2c64649f2687 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -256,7 +256,7 @@ table RecordBatch { /// For sending dictionary encoding information. Any Field can be /// dictionary-encoded, but in this case none of its children may be /// dictionary-encoded. -/// There is one dictionary batch per dictionary +/// There is one vector / column per dictionary /// table DictionaryBatch { diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 40fb60de07ed3..3cdfe49a4e7a7 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -55,7 +55,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CDataType" arrow::DataType": Type type - c_bool Equals(const CDataType* other) + c_bool Equals(const shared_ptr[CDataType]& other) + c_bool Equals(const CDataType& other) c_string ToString() diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd index b4d127c871e09..d9e121dd8853e 100644 --- a/python/pyarrow/includes/parquet.pxd +++ b/python/pyarrow/includes/parquet.pxd @@ -98,7 +98,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: # TODO: Some default arguments are missing @staticmethod unique_ptr[ParquetFileReader] OpenFile(const c_string& path) - const FileMetaData* metadata(); + shared_ptr[FileMetaData] metadata(); cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx index 7379456feef2b..c0921859e440d 100644 --- a/python/pyarrow/parquet.pyx +++ b/python/pyarrow/parquet.pyx @@ -98,8 +98,8 @@ cdef class ParquetReader: Integer index of the position of the column """ cdef: - const FileMetaData* metadata = (self.reader.get() - .parquet_reader().metadata()) + const FileMetaData* metadata = (self.reader.get().parquet_reader() + .metadata().get()) int i = 0 if self.column_idx_map is None: diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 7a69b0f12391a..d91ae7cb3b193 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -45,9 +45,9 @@ cdef class DataType: def __richcmp__(DataType self, DataType other, int op): if op == cpython.Py_EQ: - return self.type.Equals(other.type) + return self.type.Equals(other.sp_type) elif op == cpython.Py_NE: - return not self.type.Equals(other.type) + return not self.type.Equals(other.sp_type) else: raise TypeError('Invalid comparison') From 7d1f1cf91b798259de18ebd772b213e12a6dd194 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 7 Jan 2017 16:16:31 -0500 Subject: [PATCH 0263/1644] ARROW-360: C++: Add method to shrink PoolBuffer using realloc Added no explicit unit test for this as I want to keep the freedom to the allocator in the future to advise the PoolBuffer on an acceptable minimal size. In some cases it might be worth it to occupy a whole page. Resizing to a smaller size is tested though, so we already have a unit test ensuring that this code runs smoothly. Author: Uwe L. Korn Closes #272 from xhochy/ARROW-360 and squashes the following commits: f4992ee [Uwe L. Korn] Adjust DCHECK for zero size arrays 040d3b4 [Uwe L. Korn] ARROW-360: C++: Add method to shrink PoolBuffer using realloc --- cpp/src/arrow/buffer.cc | 20 +++++++++++++++++++- cpp/src/arrow/test-util.h | 2 +- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 6d55f88af1e32..2e64ffd75c263 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -92,7 +92,25 @@ Status PoolBuffer::Reserve(int64_t new_capacity) { } Status PoolBuffer::Resize(int64_t new_size) { - RETURN_NOT_OK(Reserve(new_size)); + if (new_size > size_) { + RETURN_NOT_OK(Reserve(new_size)); + } else { + // Buffer is not growing, so shrink to the requested size without + // excess space. + if (capacity_ != new_size) { + // Buffer hasn't got yet the requested size. + if (new_size == 0) { + pool_->Free(mutable_data_, capacity_); + capacity_ = 0; + mutable_data_ = nullptr; + data_ = nullptr; + } else { + RETURN_NOT_OK(pool_->Reallocate(capacity_, new_size, &mutable_data_)); + data_ = mutable_data_; + capacity_ = new_size; + } + } + } size_ = new_size; return Status::OK(); } diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index e5957490e0733..f2da824084775 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -184,7 +184,7 @@ static inline void random_ascii(int n, uint32_t seed, uint8_t* out) { template void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { - DCHECK(out); + DCHECK(out || (n == 0)); std::mt19937 gen(seed); std::uniform_int_distribution d(min_value, max_value); for (int i = 0; i < n; ++i) { From 1094d89d4094ab3209c2df15826d8e7d3758df97 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 7 Jan 2017 16:29:46 -0500 Subject: [PATCH 0264/1644] ARROW-463: C++: Support jemalloc 4.x This also fixes some minor bugs in the CMakeLists for jemalloc. Author: Uwe L. Korn Closes #273 from xhochy/ARROW-463 and squashes the following commits: d12a97c [Uwe L. Korn] ARROW-463: C++: Support jemalloc 4.x --- cpp/CMakeLists.txt | 4 +++- cpp/src/arrow/jemalloc/memory_pool.cc | 2 ++ 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 419691b4b68b2..3522e5c5a7640 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -244,7 +244,7 @@ endfunction() # A wrapper for target_link_libraries() that is compatible with NO_BENCHMARKS. function(ARROW_BENCHMARK_LINK_LIBRARIES REL_BENCHMARK_NAME) - if(NO_TESTS) + if(NO_BENCHMARKS) return() endif() get_filename_component(BENCHMARK_NAME ${REL_BENCHMARK_NAME} NAME_WE) @@ -595,6 +595,8 @@ include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) if (ARROW_JEMALLOC) find_package(jemalloc REQUIRED) + + include_directories(SYSTEM ${JEMALLOC_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(jemalloc SHARED_LIB ${JEMALLOC_SHARED_LIB}) endif() diff --git a/cpp/src/arrow/jemalloc/memory_pool.cc b/cpp/src/arrow/jemalloc/memory_pool.cc index acc09c7cd7587..c568316711717 100644 --- a/cpp/src/arrow/jemalloc/memory_pool.cc +++ b/cpp/src/arrow/jemalloc/memory_pool.cc @@ -19,6 +19,8 @@ #include +// Needed to support jemalloc 3 and 4 +#define JEMALLOC_MANGLE #include #include "arrow/status.h" From 3195948f64770520c7ed4c8a7888b33402ad6519 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 8 Jan 2017 10:50:30 -0500 Subject: [PATCH 0265/1644] ARROW-438: [C++/Python] Implement zero-data-copy record batch and table concatenation. This also fixes a bug in ChunkedArray::Equals. This is caught by the Python test suite but would benefit from more C++ unit tests. Author: Wes McKinney Closes #274 from wesm/ARROW-438 and squashes the following commits: 1f39568 [Wes McKinney] py3 compatibility 2e76c5e [Wes McKinney] Implement arrow::ConcatenateTables and Python wrapper. Fix bug in ChunkedArray::Equals f3cb170 [Wes McKinney] Fix Cython compilation, verify pyarrow.Table.from_batches still works af28755 [Wes McKinney] Implement Table::FromRecordBatches Change-Id: I948b61d848c178edefad63465a74d9c303ad1f18 --- cpp/CMakeLists.txt | 2 +- cpp/src/arrow/column.cc | 11 +- cpp/src/arrow/column.h | 2 + cpp/src/arrow/io/io-file-test.cc | 1 - cpp/src/arrow/table-test.cc | 88 ++++++++++-- cpp/src/arrow/table.cc | 71 ++++++++++ cpp/src/arrow/table.h | 13 +- cpp/src/arrow/test-util.h | 43 +++--- python/CMakeLists.txt | 3 + python/benchmarks/array.py | 7 +- python/doc/pandas.rst | 5 +- python/pyarrow/__init__.py | 2 +- python/pyarrow/array.pyx | 25 ++++ python/pyarrow/includes/libarrow.pxd | 16 +++ python/pyarrow/table.pyx | 147 ++++++++++++++------ python/pyarrow/tests/test_convert_pandas.py | 6 +- python/pyarrow/tests/test_parquet.py | 12 +- python/pyarrow/tests/test_table.py | 27 ++++ python/setup.py | 5 +- 19 files changed, 395 insertions(+), 91 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 3522e5c5a7640..87b7841ece52e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -76,7 +76,7 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_JEMALLOC "Build the Arrow jemalloc-based allocator" - ON) + OFF) option(ARROW_BOOST_USE_SHARED "Rely on boost shared libraries where relevant" diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 3e899563e2cbe..9cc0f579dc5bd 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -45,7 +45,9 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const { int32_t this_start_idx = 0; int other_chunk_idx = 0; int32_t other_start_idx = 0; - while (this_chunk_idx < static_cast(chunks_.size())) { + + int64_t elements_compared = 0; + while (elements_compared < length_) { const std::shared_ptr this_array = chunks_[this_chunk_idx]; const std::shared_ptr other_array = other.chunk(other_chunk_idx); int32_t common_length = std::min( @@ -55,14 +57,21 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const { return false; } + elements_compared += common_length; + // If we have exhausted the current chunk, proceed to the next one individually. if (this_start_idx + common_length == this_array->length()) { this_chunk_idx++; this_start_idx = 0; + } else { + this_start_idx += common_length; } + if (other_start_idx + common_length == other_array->length()) { other_chunk_idx++; other_start_idx = 0; + } else { + other_start_idx += common_length; } } return true; diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index f71647381743c..a28b2665e9c1c 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -48,6 +48,8 @@ class ARROW_EXPORT ChunkedArray { std::shared_ptr chunk(int i) const { return chunks_[i]; } + const ArrayVector& chunks() const { return chunks_; } + bool Equals(const ChunkedArray& other) const; bool Equals(const std::shared_ptr& other) const; diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 378b60e782124..821e71d0212f6 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -301,7 +301,6 @@ class MyMemoryPool : public MemoryPool { return Status::OutOfMemory(ss.str()); } - return Status::OK(); } diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 734b94125defc..67c9f6703f496 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -44,16 +44,20 @@ class TestTable : public TestBase { vector> fields = {f0, f1, f2}; schema_ = std::make_shared(fields); - columns_ = { - std::make_shared(schema_->field(0), MakePrimitive(length)), - std::make_shared(schema_->field(1), MakePrimitive(length)), - std::make_shared(schema_->field(2), MakePrimitive(length))}; + arrays_ = {MakePrimitive(length), MakePrimitive(length), + MakePrimitive(length)}; + + columns_ = {std::make_shared(schema_->field(0), arrays_[0]), + std::make_shared(schema_->field(1), arrays_[1]), + std::make_shared(schema_->field(2), arrays_[2])}; } protected: std::shared_ptr
table_; shared_ptr schema_; - vector> columns_; + + std::vector> arrays_; + std::vector> columns_; }; TEST_F(TestTable, EmptySchema) { @@ -65,7 +69,7 @@ TEST_F(TestTable, EmptySchema) { } TEST_F(TestTable, Ctors) { - int length = 100; + const int length = 100; MakeExample1(length); std::string name = "data"; @@ -83,7 +87,7 @@ TEST_F(TestTable, Ctors) { } TEST_F(TestTable, Metadata) { - int length = 100; + const int length = 100; MakeExample1(length); std::string name = "data"; @@ -98,7 +102,7 @@ TEST_F(TestTable, Metadata) { TEST_F(TestTable, InvalidColumns) { // Check that columns are all the same length - int length = 100; + const int length = 100; MakeExample1(length); table_.reset(new Table("data", schema_, columns_, length - 1)); @@ -120,7 +124,7 @@ TEST_F(TestTable, InvalidColumns) { } TEST_F(TestTable, Equals) { - int length = 100; + const int length = 100; MakeExample1(length); std::string name = "data"; @@ -145,6 +149,72 @@ TEST_F(TestTable, Equals) { ASSERT_FALSE(table_->Equals(std::make_shared
(name, schema_, other_columns))); } +TEST_F(TestTable, FromRecordBatches) { + const int32_t length = 10; + MakeExample1(length); + + auto batch1 = std::make_shared(schema_, length, arrays_); + + std::shared_ptr
result, expected; + ASSERT_OK(Table::FromRecordBatches("foo", {batch1}, &result)); + + expected = std::make_shared
("foo", schema_, columns_); + ASSERT_TRUE(result->Equals(expected)); + + std::vector> other_columns; + for (int i = 0; i < schema_->num_fields(); ++i) { + std::vector> col_arrays = {arrays_[i], arrays_[i]}; + other_columns.push_back(std::make_shared(schema_->field(i), col_arrays)); + } + + ASSERT_OK(Table::FromRecordBatches("foo", {batch1, batch1}, &result)); + expected = std::make_shared
("foo", schema_, other_columns); + ASSERT_TRUE(result->Equals(expected)); + + // Error states + std::vector> empty_batches; + ASSERT_RAISES(Invalid, Table::FromRecordBatches("", empty_batches, &result)); + + std::vector> fields = {schema_->field(0), schema_->field(1)}; + auto other_schema = std::make_shared(fields); + + std::vector> other_arrays = {arrays_[0], arrays_[1]}; + auto batch2 = std::make_shared(other_schema, length, other_arrays); + ASSERT_RAISES(Invalid, Table::FromRecordBatches("", {batch1, batch2}, &result)); +} + +TEST_F(TestTable, ConcatenateTables) { + const int32_t length = 10; + + MakeExample1(length); + auto batch1 = std::make_shared(schema_, length, arrays_); + + // generate different data + MakeExample1(length); + auto batch2 = std::make_shared(schema_, length, arrays_); + + std::shared_ptr
t1, t2, t3, result, expected; + ASSERT_OK(Table::FromRecordBatches("foo", {batch1}, &t1)); + ASSERT_OK(Table::FromRecordBatches("foo", {batch2}, &t2)); + + ASSERT_OK(ConcatenateTables("bar", {t1, t2}, &result)); + ASSERT_OK(Table::FromRecordBatches("bar", {batch1, batch2}, &expected)); + ASSERT_TRUE(result->Equals(expected)); + + // Error states + std::vector> empty_tables; + ASSERT_RAISES(Invalid, ConcatenateTables("", empty_tables, &result)); + + std::vector> fields = {schema_->field(0), schema_->field(1)}; + auto other_schema = std::make_shared(fields); + + std::vector> other_arrays = {arrays_[0], arrays_[1]}; + auto batch3 = std::make_shared(other_schema, length, other_arrays); + ASSERT_OK(Table::FromRecordBatches("", {batch3}, &t3)); + + ASSERT_RAISES(Invalid, ConcatenateTables("foo", {t1, t3}, &result)); +} + class TestRecordBatch : public TestBase {}; TEST_F(TestRecordBatch, Equals) { diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 45f672ec8928b..b3563eaae7b57 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -77,6 +77,77 @@ Table::Table(const std::string& name, const std::shared_ptr& schema, const std::vector>& columns, int64_t num_rows) : name_(name), schema_(schema), columns_(columns), num_rows_(num_rows) {} +Status Table::FromRecordBatches(const std::string& name, + const std::vector>& batches, + std::shared_ptr
* table) { + if (batches.size() == 0) { + return Status::Invalid("Must pass at least one record batch"); + } + + std::shared_ptr schema = batches[0]->schema(); + + const int nbatches = static_cast(batches.size()); + const int ncolumns = static_cast(schema->num_fields()); + + for (int i = 1; i < nbatches; ++i) { + if (!batches[i]->schema()->Equals(schema)) { + std::stringstream ss; + ss << "Schema at index " << static_cast(i) << " was different: \n" + << schema->ToString() << "\nvs\n" + << batches[i]->schema()->ToString(); + return Status::Invalid(ss.str()); + } + } + + std::vector> columns(ncolumns); + std::vector> column_arrays(nbatches); + + for (int i = 0; i < ncolumns; ++i) { + for (int j = 0; j < nbatches; ++j) { + column_arrays[j] = batches[j]->column(i); + } + columns[i] = std::make_shared(schema->field(i), column_arrays); + } + + *table = std::make_shared
(name, schema, columns); + return Status::OK(); +} + +Status ConcatenateTables(const std::string& output_name, + const std::vector>& tables, std::shared_ptr
* table) { + if (tables.size() == 0) { return Status::Invalid("Must pass at least one table"); } + + std::shared_ptr schema = tables[0]->schema(); + + const int ntables = static_cast(tables.size()); + const int ncolumns = static_cast(schema->num_fields()); + + for (int i = 1; i < ntables; ++i) { + if (!tables[i]->schema()->Equals(schema)) { + std::stringstream ss; + ss << "Schema at index " << static_cast(i) << " was different: \n" + << schema->ToString() << "\nvs\n" + << tables[i]->schema()->ToString(); + return Status::Invalid(ss.str()); + } + } + + std::vector> columns(ncolumns); + for (int i = 0; i < ncolumns; ++i) { + std::vector> column_arrays; + for (int j = 0; j < ntables; ++j) { + const std::vector>& chunks = + tables[j]->column(i)->data()->chunks(); + for (const auto& chunk : chunks) { + column_arrays.push_back(chunk); + } + } + columns[i] = std::make_shared(schema->field(i), column_arrays); + } + *table = std::make_shared
(output_name, schema, columns); + return Status::OK(); +} + bool Table::Equals(const Table& other) const { if (name_ != other.name()) { return false; } if (!schema_->Equals(other.schema())) { return false; } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 0f2418d0e7900..583847cfbe3e5 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -82,7 +82,13 @@ class ARROW_EXPORT Table { // same length as num_rows -- you can validate this using // Table::ValidateColumns Table(const std::string& name, const std::shared_ptr& schema, - const std::vector>& columns, int64_t num_rows); + const std::vector>& columns, int64_t nubm_rows); + + // Construct table from RecordBatch, but only if all of the batch schemas are + // equal. Returns Status::Invalid if there is some problem + static Status FromRecordBatches(const std::string& name, + const std::vector>& batches, + std::shared_ptr
* table); // @returns: the table's name, if any (may be length 0) const std::string& name() const { return name_; } @@ -116,6 +122,11 @@ class ARROW_EXPORT Table { int64_t num_rows_; }; +// Construct table from multiple input tables. Return Status::Invalid if +// schemas are not equal +Status ARROW_EXPORT ConcatenateTables(const std::string& output_name, + const std::vector>& tables, std::shared_ptr
* table); + } // namespace arrow #endif // ARROW_TABLE_H diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index f2da824084775..b59809d9e48e6 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -71,23 +71,6 @@ namespace arrow { -class TestBase : public ::testing::Test { - public: - void SetUp() { pool_ = default_memory_pool(); } - - template - std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { - auto data = std::make_shared(pool_); - auto null_bitmap = std::make_shared(pool_); - EXPECT_OK(data->Resize(length * sizeof(typename ArrayType::value_type))); - EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); - return std::make_shared(length, data, null_count, null_bitmap); - } - - protected: - MemoryPool* pool_; -}; - namespace test { template @@ -253,6 +236,32 @@ Status MakeRandomBytePoolBuffer(int32_t length, MemoryPool* pool, } // namespace test +class TestBase : public ::testing::Test { + public: + void SetUp() { + pool_ = default_memory_pool(); + random_seed_ = 0; + } + + template + std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + auto data = std::make_shared(pool_); + const int64_t data_nbytes = length * sizeof(typename ArrayType::value_type); + EXPECT_OK(data->Resize(data_nbytes)); + + // Fill with random data + test::random_bytes(data_nbytes, random_seed_++, data->mutable_data()); + + auto null_bitmap = std::make_shared(pool_); + EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); + return std::make_shared(length, data, null_count, null_bitmap); + } + + protected: + uint32_t random_seed_; + MemoryPool* pool_; +}; + template void ArrayFromVector(const std::shared_ptr& type, const std::vector& is_valid, const std::vector& values, diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 6c2477235faaa..e42c45d3f5cc9 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -70,6 +70,9 @@ include(SetupCxxFlags) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +# Suppress Cython warnings +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") + # Determine compiler version include(CompilerInfo) diff --git a/python/benchmarks/array.py b/python/benchmarks/array.py index 4268f0073f292..e22c0f7fc9e70 100644 --- a/python/benchmarks/array.py +++ b/python/benchmarks/array.py @@ -49,10 +49,10 @@ class PandasConversionsToArrow(PandasConversionsBase): params = ((1, 10 ** 5, 10 ** 6, 10 ** 7), ('int64', 'float64', 'float64_nans', 'str')) def time_from_series(self, n, dtype): - A.from_pandas_dataframe(self.data) + A.Table.from_pandas(self.data) def peakmem_from_series(self, n, dtype): - A.from_pandas_dataframe(self.data) + A.Table.from_pandas(self.data) class PandasConversionsFromArrow(PandasConversionsBase): @@ -61,7 +61,7 @@ class PandasConversionsFromArrow(PandasConversionsBase): def setup(self, n, dtype): super(PandasConversionsFromArrow, self).setup(n, dtype) - self.arrow_data = A.from_pandas_dataframe(self.data) + self.arrow_data = A.Table.from_pandas(self.data) def time_to_series(self, n, dtype): self.arrow_data.to_pandas() @@ -80,4 +80,3 @@ def setUp(self, n): def time_as_py(self, n): for i in range(n): self._array[i].as_py() - diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst index 7c70074817835..c225d1362c7b6 100644 --- a/python/doc/pandas.rst +++ b/python/doc/pandas.rst @@ -31,7 +31,7 @@ represent more data than a DataFrame, so a full conversion is not always possibl Conversion from a Table to a DataFrame is done by calling :meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using -:meth:`pyarrow.from_pandas_dataframe`. This conversion routine provides the +:meth:`pyarrow.Table.from_pandas`. This conversion routine provides the convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of different resolutions, Pandas only supports nanosecond timestamps and most other systems (e.g. Parquet) only work on millisecond timestamps. This parameter @@ -45,7 +45,7 @@ conversion. df = pd.DataFrame({"a": [1, 2, 3]}) # Convert from Pandas to Arrow - table = pa.from_pandas_dataframe(df) + table = pa.Table.from_pandas(df) # Convert back to Pandas df_new = table.to_pandas() @@ -111,4 +111,3 @@ Arrow -> Pandas Conversion +-------------------------------------+--------------------------------------------------------+ | ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | +-------------------------------------+--------------------------------------------------------+ - diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 02b2b06237de3..d25cdd47ea974 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -56,4 +56,4 @@ list_, struct, field, DataType, Field, Schema, schema) -from pyarrow.table import Column, RecordBatch, Table, from_pandas_dataframe +from pyarrow.table import Column, RecordBatch, Table, concat_tables diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index c178d5ccd355b..266768f7e0ba5 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -91,6 +91,29 @@ cdef class Array: """ return from_pandas_series(obj, mask) + @staticmethod + def from_list(object list_obj, DataType type=None): + """ + Convert Python list to Arrow array + + Parameters + ---------- + list_obj : array_like + + Returns + ------- + pyarrow.array.Array + """ + cdef: + shared_ptr[CArray] sp_array + + if type is None: + check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) + else: + raise NotImplementedError() + + return box_arrow_array(sp_array) + property null_count: def __get__(self): @@ -348,3 +371,5 @@ cdef object series_as_ndarray(object obj): result = obj return result + +from_pylist = Array.from_list diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 3cdfe49a4e7a7..b0f971d516ce5 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -182,6 +182,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: CColumn(const shared_ptr[CField]& field, const vector[shared_ptr[CArray]]& chunks) + c_bool Equals(const CColumn& other) + c_bool Equals(const shared_ptr[CColumn]& other) + int64_t length() int64_t null_count() const c_string& name() @@ -207,14 +210,27 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: CTable(const c_string& name, const shared_ptr[CSchema]& schema, const vector[shared_ptr[CColumn]]& columns) + @staticmethod + CStatus FromRecordBatches( + const c_string& name, + const vector[shared_ptr[CRecordBatch]]& batches, + shared_ptr[CTable]* table) + int num_columns() int num_rows() + c_bool Equals(const CTable& other) + c_bool Equals(const shared_ptr[CTable]& other) + const c_string& name() shared_ptr[CSchema] schema() shared_ptr[CColumn] column(int i) + CStatus ConcatenateTables(const c_string& output_name, + const vector[shared_ptr[CTable]]& tables, + shared_ptr[CTable]* result) + cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: cdef cppclass SchemaMessage: diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 925543176c531..3a046516d961b 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -155,6 +155,31 @@ cdef class Column: return pd.Series(PyObject_to_object(arr), name=self.name) + def equals(self, Column other): + """ + Check if contents of two columns are equal + + Parameters + ---------- + other : pyarrow.Column + + Returns + ------- + are_equal : boolean + """ + cdef: + CColumn* my_col = self.column + CColumn* other_col = other.column + c_bool result + + self._check_nullptr() + other._check_nullptr() + + with nogil: + result = my_col.Equals(deref(other_col)) + + return result + def to_pylist(self): """ Convert to a list of native Python objects. @@ -343,10 +368,18 @@ cdef class RecordBatch: return arr def equals(self, RecordBatch other): + cdef: + CRecordBatch* my_batch = self.batch + CRecordBatch* other_batch = other.batch + c_bool result + self._check_nullptr() other._check_nullptr() - return self.batch.Equals(deref(other.batch)) + with nogil: + result = my_batch.Equals(deref(other_batch)) + + return result def to_pydict(self): """ @@ -424,7 +457,6 @@ cdef class RecordBatch: """ cdef: Array arr - RecordBatch result c_string c_name shared_ptr[CSchema] schema shared_ptr[CRecordBatch] batch @@ -442,11 +474,7 @@ cdef class RecordBatch: c_arrays.push_back(arr.sp_array) batch.reset(new CRecordBatch(schema, num_rows, c_arrays)) - - result = RecordBatch() - result.init(batch) - - return result + return batch_from_cbatch(batch) cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): @@ -498,6 +526,31 @@ cdef class Table: raise ReferenceError("Table object references a NULL pointer." "Not initialized.") + def equals(self, Table other): + """ + Check if contents of two tables are equal + + Parameters + ---------- + other : pyarrow.Table + + Returns + ------- + are_equal : boolean + """ + cdef: + CTable* my_table = self.table + CTable* other_table = other.table + c_bool result + + self._check_nullptr() + other._check_nullptr() + + with nogil: + result = my_table.Equals(deref(other_table)) + + return result + @classmethod def from_pandas(cls, df, name=None, timestamps_to_ms=False): """ @@ -527,7 +580,7 @@ cdef class Table: ... 'int': [1, 2], ... 'str': ['a', 'b'] ... }) - >>> pa.table.from_pandas_dataframe(df) + >>> pa.Table.from_pandas(df) """ names, arrays = _dataframe_to_arrays(df, name=name, @@ -559,7 +612,6 @@ cdef class Table: c_string c_name vector[shared_ptr[CField]] fields vector[shared_ptr[CColumn]] columns - Table result shared_ptr[CSchema] schema shared_ptr[CTable] table @@ -577,14 +629,10 @@ cdef class Table: c_name = tobytes(name) table.reset(new CTable(c_name, schema, columns)) - - result = Table() - result.init(table) - - return result + return table_from_ctable(table) @staticmethod - def from_batches(batches): + def from_batches(batches, name=None): """ Construct a Table from a list of Arrow RecordBatches @@ -594,39 +642,21 @@ cdef class Table: batches: list of RecordBatch RecordBatch list to be converted, schemas must be equal """ - cdef: - vector[shared_ptr[CArray]] c_array_chunks - vector[shared_ptr[CColumn]] c_columns + vector[shared_ptr[CRecordBatch]] c_batches shared_ptr[CTable] c_table - Array arr - Schema schema - - import pandas as pd + RecordBatch batch + Table table + c_string c_name - schema = batches[0].schema + c_name = b'' if name is None else tobytes(name) - # check schemas are equal - for other in batches[1:]: - if not schema.equals(other.schema): - raise ArrowException("Error converting list of RecordBatches " - "to DataFrame, not all schemas are equal: {%s} != {%s}" - % (str(schema), str(other.schema))) + for batch in batches: + c_batches.push_back(batch.sp_batch) - cdef int K = batches[0].num_columns + with nogil: + check_status(CTable.FromRecordBatches(c_name, c_batches, &c_table)) - # create chunked columns from the batches - c_columns.resize(K) - for i in range(K): - for batch in batches: - arr = batch[i] - c_array_chunks.push_back(arr.sp_array) - c_columns[i].reset(new CColumn(schema.sp_schema.get().field(i), - c_array_chunks)) - c_array_chunks.clear() - - # create a Table from columns and convert to DataFrame - c_table.reset(new CTable('', schema.sp_schema, c_columns)) table = Table() table.init(c_table) return table @@ -760,9 +790,40 @@ cdef class Table: return (self.num_rows, self.num_columns) +def concat_tables(tables, output_name=None): + """ + Perform zero-copy concatenation of pyarrow.Table objects. Raises exception + if all of the Table schemas are not the same + + Parameters + ---------- + tables : iterable of pyarrow.Table objects + output_name : string, default None + A name for the output table, if any + """ + cdef: + vector[shared_ptr[CTable]] c_tables + shared_ptr[CTable] c_result + Table table + c_string c_name + + c_name = b'' if output_name is None else tobytes(output_name) + + for table in tables: + c_tables.push_back(table.sp_table) + + with nogil: + check_status(ConcatenateTables(c_name, c_tables, &c_result)) + + return table_from_ctable(c_result) + + cdef api object table_from_ctable(const shared_ptr[CTable]& ctable): cdef Table table = Table() table.init(ctable) return table -from_pandas_dataframe = Table.from_pandas +cdef api object batch_from_cbatch(const shared_ptr[CRecordBatch]& cbatch): + cdef RecordBatch batch = RecordBatch() + batch.init(cbatch) + return batch diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index bb9f0b3eccab1..12e7a08d795a2 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -61,7 +61,7 @@ def tearDown(self): def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, timestamps_to_ms=False): - table = A.from_pandas_dataframe(df, timestamps_to_ms=timestamps_to_ms) + table = A.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms) result = table.to_pandas(nthreads=nthreads) if expected is None: expected = df @@ -193,7 +193,7 @@ def test_bytes_to_binary(self): values = [u('qux'), b'foo', None, 'bar', 'qux', np.nan] df = pd.DataFrame({'strings': values}) - table = A.from_pandas_dataframe(df) + table = A.Table.from_pandas(df) assert table[0].type == A.binary() values2 = [b'qux', b'foo', None, b'bar', b'qux', np.nan] @@ -245,7 +245,7 @@ def test_date(self): None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)]}) - table = A.from_pandas_dataframe(df) + table = A.Table.from_pandas(df) result = table.to_pandas() expected = df.copy() expected['date'] = pd.to_datetime(df['date']) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 7c45732d34573..0fb913cc792d8 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -79,7 +79,7 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): 'empty_str': [''] * size }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.from_pandas_dataframe(df, timestamps_to_ms=True) + arrow_table = A.Table.from_pandas(df, timestamps_to_ms=True) A.parquet.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() @@ -107,7 +107,7 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): 'empty_str': [''] * size }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.Table.from_pandas(df) A.parquet.write_table(arrow_table, filename.strpath, version="1.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() @@ -126,7 +126,7 @@ def test_pandas_column_selection(tmpdir): 'uint16': np.arange(size, dtype=np.uint16) }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.Table.from_pandas(df) A.parquet.write_table(arrow_table, filename.strpath) table_read = pq.read_table(filename.strpath, columns=['uint8']) df_read = table_read.to_pandas() @@ -155,7 +155,7 @@ def _test_dataframe(size=10000): @parquet def test_pandas_parquet_native_file_roundtrip(tmpdir): df = _test_dataframe(10000) - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.Table.from_pandas(df) imos = paio.InMemoryOutputStream() pq.write_table(arrow_table, imos, version="2.0") buf = imos.get_result() @@ -176,7 +176,7 @@ def test_pandas_parquet_pyfile_roundtrip(tmpdir): 'strings': ['foo', 'bar', None, 'baz', 'qux'] }) - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.Table.from_pandas(df) with open(filename, 'wb') as f: A.parquet.write_table(arrow_table, f, version="1.0") @@ -206,7 +206,7 @@ def test_pandas_parquet_configuration_options(tmpdir): 'bool': np.random.randn(size) > 0 }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.from_pandas_dataframe(df) + arrow_table = A.Table.from_pandas(df) for use_dictionary in [True, False]: A.parquet.write_table( diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 9985b3e29ada1..6f00c7391f66d 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -111,6 +111,33 @@ def test_table_basics(): assert chunk is not None +def test_concat_tables(): + data = [ + list(range(5)), + [-10., -5., 0., 5., 10.] + ] + data2 = [ + list(range(5, 10)), + [1., 2., 3., 4., 5.] + ] + + t1 = pa.Table.from_arrays(('a', 'b'), [pa.from_pylist(x) + for x in data], 'table_name') + t2 = pa.Table.from_arrays(('a', 'b'), [pa.from_pylist(x) + for x in data2], 'table_name') + + result = pa.concat_tables([t1, t2], output_name='foo') + assert result.name == 'foo' + assert len(result) == 10 + + expected = pa.Table.from_arrays( + ('a', 'b'), [pa.from_pylist(x + y) + for x, y in zip(data, data2)], + 'foo') + + assert result.equals(expected) + + def test_table_pandas(): data = [ pa.from_pylist(range(5)), diff --git a/python/setup.py b/python/setup.py index 2e595e2dc870e..3829a7982d670 100644 --- a/python/setup.py +++ b/python/setup.py @@ -143,7 +143,10 @@ def _run_cmake(self): cmake_options + [source]) self.spawn(cmake_command) - args = ['make', 'VERBOSE=1'] + args = ['make'] + if os.environ.get('PYARROW_BUILD_VERBOSE', '0') == '1': + args.append('VERBOSE=1') + if 'PYARROW_PARALLEL' in os.environ: args.append('-j{0}'.format(os.environ['PYARROW_PARALLEL'])) self.spawn(args) From f44b6a3b91a15461804dd7877840a557caa52e4e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 10 Jan 2017 08:44:01 +0100 Subject: [PATCH 0266/1644] ARROW-442: [Python] Inspect Parquet file metadata from Python I also made the Cython parquet extension "private" so that higher level logic (e.g. upcoming handling of multiple files) can be handled in pure Python (which doesn't need to be compiled) Requires PARQUET-828 for the test suite to pass. Author: Wes McKinney Closes #275 from wesm/ARROW-442 and squashes the following commits: a4255a2 [Wes McKinney] Add row group metadata accessor, add smoke tests 75a11cf [Wes McKinney] Add more metadata accessor scaffolding, to be tested e59ca40 [Wes McKinney] Move parquet Cython wrapper to a private import, add parquet.py for high level logic --- python/CMakeLists.txt | 2 +- python/pyarrow/_parquet.pxd | 217 +++++++++++ python/pyarrow/_parquet.pyx | 520 +++++++++++++++++++++++++++ python/pyarrow/includes/parquet.pxd | 147 -------- python/pyarrow/parquet.py | 116 ++++++ python/pyarrow/parquet.pyx | 244 ------------- python/pyarrow/tests/test_parquet.py | 71 +++- python/setup.py | 4 +- 8 files changed, 922 insertions(+), 399 deletions(-) create mode 100644 python/pyarrow/_parquet.pxd create mode 100644 python/pyarrow/_parquet.pyx delete mode 100644 python/pyarrow/includes/parquet.pxd create mode 100644 python/pyarrow/parquet.py delete mode 100644 python/pyarrow/parquet.pyx diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index e42c45d3f5cc9..45115d49d455e 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -428,7 +428,7 @@ if (PYARROW_BUILD_PARQUET) parquet_arrow) set(CYTHON_EXTENSIONS ${CYTHON_EXTENSIONS} - parquet) + _parquet) endif() add_library(pyarrow SHARED diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd new file mode 100644 index 0000000000000..faca845167d31 --- /dev/null +++ b/python/pyarrow/_parquet.pxd @@ -0,0 +1,217 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport (CArray, CSchema, CStatus, + CTable, MemoryPool) +from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream + + +cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: + cdef cppclass Node: + pass + + cdef cppclass GroupNode(Node): + pass + + cdef cppclass PrimitiveNode(Node): + pass + + cdef cppclass ColumnPath: + c_string ToDotString() + +cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: + enum ParquetType" parquet::Type::type": + ParquetType_BOOLEAN" parquet::Type::BOOLEAN" + ParquetType_INT32" parquet::Type::INT32" + ParquetType_INT64" parquet::Type::INT64" + ParquetType_INT96" parquet::Type::INT96" + ParquetType_FLOAT" parquet::Type::FLOAT" + ParquetType_DOUBLE" parquet::Type::DOUBLE" + ParquetType_BYTE_ARRAY" parquet::Type::BYTE_ARRAY" + ParquetType_FIXED_LEN_BYTE_ARRAY" parquet::Type::FIXED_LEN_BYTE_ARRAY" + + enum ParquetLogicalType" parquet::LogicalType::type": + ParquetLogicalType_NONE" parquet::LogicalType::NONE" + ParquetLogicalType_UTF8" parquet::LogicalType::UTF8" + ParquetLogicalType_MAP" parquet::LogicalType::MAP" + ParquetLogicalType_MAP_KEY_VALUE" parquet::LogicalType::MAP_KEY_VALUE" + ParquetLogicalType_LIST" parquet::LogicalType::LIST" + ParquetLogicalType_ENUM" parquet::LogicalType::ENUM" + ParquetLogicalType_DECIMAL" parquet::LogicalType::DECIMAL" + ParquetLogicalType_DATE" parquet::LogicalType::DATE" + ParquetLogicalType_TIME_MILLIS" parquet::LogicalType::TIME_MILLIS" + ParquetLogicalType_TIME_MICROS" parquet::LogicalType::TIME_MICROS" + ParquetLogicalType_TIMESTAMP_MILLIS" parquet::LogicalType::TIMESTAMP_MILLIS" + ParquetLogicalType_TIMESTAMP_MICROS" parquet::LogicalType::TIMESTAMP_MICROS" + ParquetLogicalType_UINT_8" parquet::LogicalType::UINT_8" + ParquetLogicalType_UINT_16" parquet::LogicalType::UINT_16" + ParquetLogicalType_UINT_32" parquet::LogicalType::UINT_32" + ParquetLogicalType_UINT_64" parquet::LogicalType::UINT_64" + ParquetLogicalType_INT_8" parquet::LogicalType::INT_8" + ParquetLogicalType_INT_16" parquet::LogicalType::INT_16" + ParquetLogicalType_INT_32" parquet::LogicalType::INT_32" + ParquetLogicalType_INT_64" parquet::LogicalType::INT_64" + ParquetLogicalType_JSON" parquet::LogicalType::JSON" + ParquetLogicalType_BSON" parquet::LogicalType::BSON" + ParquetLogicalType_INTERVAL" parquet::LogicalType::INTERVAL" + + enum ParquetRepetition" parquet::Repetition::type": + ParquetRepetition_REQUIRED" parquet::REPETITION::REQUIRED" + ParquetRepetition_OPTIONAL" parquet::REPETITION::OPTIONAL" + ParquetRepetition_REPEATED" parquet::REPETITION::REPEATED" + + enum ParquetEncoding" parquet::Encoding::type": + ParquetEncoding_PLAIN" parquet::Encoding::PLAIN" + ParquetEncoding_PLAIN_DICTIONARY" parquet::Encoding::PLAIN_DICTIONARY" + ParquetEncoding_RLE" parquet::Encoding::RLE" + ParquetEncoding_BIT_PACKED" parquet::Encoding::BIT_PACKED" + ParquetEncoding_DELTA_BINARY_PACKED" parquet::Encoding::DELTA_BINARY_PACKED" + ParquetEncoding_DELTA_LENGTH_BYTE_ARRAY" parquet::Encoding::DELTA_LENGTH_BYTE_ARRAY" + ParquetEncoding_DELTA_BYTE_ARRAY" parquet::Encoding::DELTA_BYTE_ARRAY" + ParquetEncoding_RLE_DICTIONARY" parquet::Encoding::RLE_DICTIONARY" + + enum ParquetCompression" parquet::Compression::type": + ParquetCompression_UNCOMPRESSED" parquet::Compression::UNCOMPRESSED" + ParquetCompression_SNAPPY" parquet::Compression::SNAPPY" + ParquetCompression_GZIP" parquet::Compression::GZIP" + ParquetCompression_LZO" parquet::Compression::LZO" + ParquetCompression_BROTLI" parquet::Compression::BROTLI" + + enum ParquetVersion" parquet::ParquetVersion::type": + ParquetVersion_V1" parquet::ParquetVersion::PARQUET_1_0" + ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0" + + cdef cppclass ColumnDescriptor: + shared_ptr[ColumnPath] path() + + int16_t max_definition_level() + int16_t max_repetition_level() + + ParquetType physical_type() + ParquetLogicalType logical_type() + const c_string& name() + int type_length() + int type_precision() + int type_scale() + + cdef cppclass SchemaDescriptor: + const ColumnDescriptor* Column(int i) + shared_ptr[Node] schema() + GroupNode* group() + int num_columns() + + +cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: + cdef cppclass ColumnReader: + pass + + cdef cppclass BoolReader(ColumnReader): + pass + + cdef cppclass Int32Reader(ColumnReader): + pass + + cdef cppclass Int64Reader(ColumnReader): + pass + + cdef cppclass Int96Reader(ColumnReader): + pass + + cdef cppclass FloatReader(ColumnReader): + pass + + cdef cppclass DoubleReader(ColumnReader): + pass + + cdef cppclass ByteArrayReader(ColumnReader): + pass + + cdef cppclass RowGroupReader: + pass + + cdef cppclass CRowGroupMetaData" parquet::RowGroupMetaData": + int num_columns() + int64_t num_rows() + int64_t total_byte_size() + + cdef cppclass CFileMetaData" parquet::FileMetaData": + uint32_t size() + int num_columns() + int64_t num_rows() + int num_row_groups() + int32_t version() + const c_string created_by() + int num_schema_elements() + + unique_ptr[CRowGroupMetaData] RowGroup(int i) + const SchemaDescriptor* schema() + + cdef cppclass ParquetFileReader: + # TODO: Some default arguments are missing + @staticmethod + unique_ptr[ParquetFileReader] OpenFile(const c_string& path) + shared_ptr[CFileMetaData] metadata(); + + +cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: + cdef cppclass ParquetOutputStream" parquet::OutputStream": + pass + + cdef cppclass LocalFileOutputStream(ParquetOutputStream): + LocalFileOutputStream(const c_string& path) + void Close() + + cdef cppclass WriterProperties: + cppclass Builder: + Builder* version(ParquetVersion version) + Builder* compression(ParquetCompression codec) + Builder* compression(const c_string& path, + ParquetCompression codec) + Builder* disable_dictionary() + Builder* enable_dictionary() + Builder* enable_dictionary(const c_string& path) + shared_ptr[WriterProperties] build() + + +cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: + CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, + MemoryPool* allocator, + unique_ptr[FileReader]* reader) + + cdef cppclass FileReader: + FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) + CStatus ReadFlatColumn(int i, shared_ptr[CArray]* out); + CStatus ReadFlatTable(shared_ptr[CTable]* out); + const ParquetFileReader* parquet_reader(); + + +cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: + CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, + shared_ptr[CSchema]* out) + CStatus ToParquetSchema(const CSchema* arrow_schema, + shared_ptr[SchemaDescriptor]* out) + + +cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: + cdef CStatus WriteFlatTable( + const CTable* table, MemoryPool* pool, + const shared_ptr[OutputStream]& sink, + int64_t chunk_size, + const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx new file mode 100644 index 0000000000000..c0dc3eb460929 --- /dev/null +++ b/python/pyarrow/_parquet.pyx @@ -0,0 +1,520 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from pyarrow._parquet cimport * + +from pyarrow.includes.libarrow cimport * +from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, OutputStream, + FileOutputStream) +cimport pyarrow.includes.pyarrow as pyarrow + +from pyarrow.array cimport Array +from pyarrow.compat import tobytes, frombytes +from pyarrow.error import ArrowException +from pyarrow.error cimport check_status +from pyarrow.io import NativeFile +from pyarrow.table cimport Table + +from pyarrow.io cimport NativeFile, get_reader, get_writer + +import six + + +cdef class RowGroupMetaData: + cdef: + unique_ptr[CRowGroupMetaData] up_metadata + CRowGroupMetaData* metadata + object parent + + def __cinit__(self): + pass + + cdef init_from_file(self, FileMetaData parent, int i): + if i < 0 or i >= parent.num_row_groups: + raise IndexError('{0} out of bounds'.format(i)) + self.up_metadata = parent.metadata.RowGroup(i) + self.metadata = self.up_metadata.get() + self.parent = parent + + def __repr__(self): + return """{0} + num_columns: {1} + num_rows: {2} + total_byte_size: {3}""".format(object.__repr__(self), + self.num_columns, + self.num_rows, + self.total_byte_size) + + property num_columns: + + def __get__(self): + return self.metadata.num_columns() + + property num_rows: + + def __get__(self): + return self.metadata.num_rows() + + property total_byte_size: + + def __get__(self): + return self.metadata.total_byte_size() + + +cdef class FileMetaData: + cdef: + shared_ptr[CFileMetaData] sp_metadata + CFileMetaData* metadata + object _schema + + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CFileMetaData]& metadata): + self.sp_metadata = metadata + self.metadata = metadata.get() + + def __repr__(self): + return """{0} + created_by: {1} + num_columns: {2} + num_rows: {3} + num_row_groups: {4} + format_version: {5} + serialized_size: {6}""".format(object.__repr__(self), + self.created_by, self.num_columns, + self.num_rows, self.num_row_groups, + self.format_version, + self.serialized_size) + + @property + def schema(self): + if self._schema is not None: + return self._schema + + cdef Schema schema = Schema() + schema.init_from_filemeta(self) + self._schema = schema + return schema + + property serialized_size: + + def __get__(self): + return self.metadata.size() + + property num_columns: + + def __get__(self): + return self.metadata.num_columns() + + property num_rows: + + def __get__(self): + return self.metadata.num_rows() + + property num_row_groups: + + def __get__(self): + return self.metadata.num_row_groups() + + property format_version: + + def __get__(self): + cdef int version = self.metadata.version() + if version == 2: + return '2.0' + elif version == 1: + return '1.0' + else: + print('Unrecognized file version, assuming 1.0: {0}' + .format(version)) + return '1.0' + + property created_by: + + def __get__(self): + return frombytes(self.metadata.created_by()) + + def row_group(self, int i): + """ + + """ + cdef RowGroupMetaData result = RowGroupMetaData() + result.init_from_file(self, i) + return result + + +cdef class Schema: + cdef: + object parent # the FileMetaData owning the SchemaDescriptor + const SchemaDescriptor* schema + + def __cinit__(self): + self.parent = None + self.schema = NULL + + def __repr__(self): + cdef const ColumnDescriptor* descr + elements = [] + for i in range(self.schema.num_columns()): + col = self.column(i) + logical_type = col.logical_type + formatted = '{0}: {1}'.format(col.path, col.physical_type) + if logical_type != 'NONE': + formatted += ' {0}'.format(logical_type) + elements.append(formatted) + + return """{0} +{1} + """.format(object.__repr__(self), '\n'.join(elements)) + + cdef init_from_filemeta(self, FileMetaData container): + self.parent = container + self.schema = container.metadata.schema() + + def __len__(self): + return self.schema.num_columns() + + def __getitem__(self, i): + return self.column(i) + + def column(self, i): + if i < 0 or i >= len(self): + raise IndexError('{0} out of bounds'.format(i)) + + cdef ColumnSchema col = ColumnSchema() + col.init_from_schema(self, i) + return col + + +cdef class ColumnSchema: + cdef: + object parent + const ColumnDescriptor* descr + + def __cinit__(self): + self.descr = NULL + + cdef init_from_schema(self, Schema schema, int i): + self.parent = schema + self.descr = schema.schema.Column(i) + + def __repr__(self): + physical_type = self.physical_type + logical_type = self.logical_type + if logical_type == 'DECIMAL': + logical_type = 'DECIMAL({0}, {1})'.format(self.precision, + self.scale) + elif physical_type == 'FIXED_LEN_BYTE_ARRAY': + logical_type = ('FIXED_LEN_BYTE_ARRAY(length={0})' + .format(self.length)) + + return """ + name: {0} + path: {1} + max_definition_level: {2} + max_repetition_level: {3} + physical_type: {4} + logical_type: {5}""".format(self.name, self.path, self.max_definition_level, + self.max_repetition_level, physical_type, + logical_type) + + property name: + + def __get__(self): + return frombytes(self.descr.name()) + + property path: + + def __get__(self): + return frombytes(self.descr.path().get().ToDotString()) + + property max_definition_level: + + def __get__(self): + return self.descr.max_definition_level() + + property max_repetition_level: + + def __get__(self): + return self.descr.max_repetition_level() + + property physical_type: + + def __get__(self): + return physical_type_name_from_enum(self.descr.physical_type()) + + property logical_type: + + def __get__(self): + return logical_type_name_from_enum(self.descr.logical_type()) + + # FIXED_LEN_BYTE_ARRAY attribute + property length: + + def __get__(self): + return self.descr.type_length() + + # Decimal attributes + property precision: + + def __get__(self): + return self.descr.type_precision() + + property scale: + + def __get__(self): + return self.descr.type_scale() + + +cdef physical_type_name_from_enum(ParquetType type_): + return { + ParquetType_BOOLEAN: 'BOOLEAN', + ParquetType_INT32: 'INT32', + ParquetType_INT64: 'INT64', + ParquetType_INT96: 'INT96', + ParquetType_FLOAT: 'FLOAT', + ParquetType_DOUBLE: 'DOUBLE', + ParquetType_BYTE_ARRAY: 'BYTE_ARRAY', + ParquetType_FIXED_LEN_BYTE_ARRAY: 'FIXED_LEN_BYTE_ARRAY', + }.get(type_, 'UNKNOWN') + + +cdef logical_type_name_from_enum(ParquetLogicalType type_): + return { + ParquetLogicalType_NONE: 'NONE', + ParquetLogicalType_UTF8: 'UTF8', + ParquetLogicalType_MAP: 'MAP', + ParquetLogicalType_MAP_KEY_VALUE: 'MAP_KEY_VALUE', + ParquetLogicalType_LIST: 'LIST', + ParquetLogicalType_ENUM: 'ENUM', + ParquetLogicalType_DECIMAL: 'DECIMAL', + ParquetLogicalType_DATE: 'DATE', + ParquetLogicalType_TIME_MILLIS: 'TIME_MILLIS', + ParquetLogicalType_TIME_MICROS: 'TIME_MICROS', + ParquetLogicalType_TIMESTAMP_MILLIS: 'TIMESTAMP_MILLIS', + ParquetLogicalType_TIMESTAMP_MICROS: 'TIMESTAMP_MICROS', + ParquetLogicalType_UINT_8: 'UINT_8', + ParquetLogicalType_UINT_16: 'UINT_16', + ParquetLogicalType_UINT_32: 'UINT_32', + ParquetLogicalType_UINT_64: 'UINT_64', + ParquetLogicalType_INT_8: 'INT_8', + ParquetLogicalType_INT_16: 'INT_16', + ParquetLogicalType_INT_32: 'INT_32', + ParquetLogicalType_INT_64: 'UINT_64', + ParquetLogicalType_JSON: 'JSON', + ParquetLogicalType_BSON: 'BSON', + ParquetLogicalType_INTERVAL: 'INTERVAL', + }.get(type_, 'UNKNOWN') + + +cdef class ParquetReader: + cdef: + MemoryPool* allocator + unique_ptr[FileReader] reader + column_idx_map + FileMetaData _metadata + + def __cinit__(self): + self.allocator = default_memory_pool() + self._metadata = None + + def open(self, object source): + cdef: + shared_ptr[ReadableFileInterface] rd_handle + c_string path + + if isinstance(source, six.string_types): + path = tobytes(source) + + # Must be in one expression to avoid calling std::move which is not + # possible in Cython (due to missing rvalue support) + + # TODO(wesm): ParquetFileReader::OpenFile can throw? + self.reader = unique_ptr[FileReader]( + new FileReader(default_memory_pool(), + ParquetFileReader.OpenFile(path))) + else: + get_reader(source, &rd_handle) + check_status(OpenFile(rd_handle, self.allocator, &self.reader)) + + @property + def metadata(self): + cdef: + shared_ptr[CFileMetaData] metadata + FileMetaData result + if self._metadata is not None: + return self._metadata + + metadata = self.reader.get().parquet_reader().metadata() + + self._metadata = result = FileMetaData() + result.init(metadata) + return result + + def read_all(self): + cdef: + Table table = Table() + shared_ptr[CTable] ctable + + with nogil: + check_status(self.reader.get() + .ReadFlatTable(&ctable)) + + table.init(ctable) + return table + + def column_name_idx(self, column_name): + """ + Find the matching index of a column in the schema. + + Parameter + --------- + column_name: str + Name of the column, separation of nesting levels is done via ".". + + Returns + ------- + column_idx: int + Integer index of the position of the column + """ + cdef: + FileMetaData container = self.metadata + const CFileMetaData* metadata = container.metadata + int i = 0 + + if self.column_idx_map is None: + self.column_idx_map = {} + for i in range(0, metadata.num_columns()): + col_bytes = tobytes(metadata.schema().Column(i) + .path().get().ToDotString()) + self.column_idx_map[col_bytes] = i + + return self.column_idx_map[tobytes(column_name)] + + def read_column(self, int column_index): + cdef: + Array array = Array() + shared_ptr[CArray] carray + + with nogil: + check_status(self.reader.get() + .ReadFlatColumn(column_index, &carray)) + + array.init(carray) + return array + + +cdef check_compression_name(name): + if name.upper() not in ['NONE', 'SNAPPY', 'GZIP', 'LZO', 'BROTLI']: + raise ArrowException("Unsupported compression: " + name) + + +cdef ParquetCompression compression_from_name(object name): + name = name.upper() + if name == "SNAPPY": + return ParquetCompression_SNAPPY + elif name == "GZIP": + return ParquetCompression_GZIP + elif name == "LZO": + return ParquetCompression_LZO + elif name == "BROTLI": + return ParquetCompression_BROTLI + else: + return ParquetCompression_UNCOMPRESSED + + +cdef class ParquetWriter: + cdef: + shared_ptr[WriterProperties] properties + shared_ptr[OutputStream] sink + + cdef readonly: + object use_dictionary + object compression + object version + int row_group_size + + def __cinit__(self, where, use_dictionary=None, compression=None, + version=None): + cdef shared_ptr[FileOutputStream] filestream + + if isinstance(where, six.string_types): + check_status(FileOutputStream.Open(tobytes(where), &filestream)) + self.sink = filestream + else: + get_writer(where, &self.sink) + + self.use_dictionary = use_dictionary + self.compression = compression + self.version = version + self._setup_properties() + + cdef _setup_properties(self): + cdef WriterProperties.Builder properties_builder + self._set_version(&properties_builder) + self._set_compression_props(&properties_builder) + self._set_dictionary_props(&properties_builder) + self.properties = properties_builder.build() + + cdef _set_version(self, WriterProperties.Builder* props): + if self.version is not None: + if self.version == "1.0": + props.version(ParquetVersion_V1) + elif self.version == "2.0": + props.version(ParquetVersion_V2) + else: + raise ArrowException("Unsupported Parquet format version") + + cdef _set_compression_props(self, WriterProperties.Builder* props): + if isinstance(self.compression, basestring): + check_compression_name(self.compression) + props.compression(compression_from_name(self.compression)) + elif self.compression is not None: + # Deactivate dictionary encoding by default + props.disable_dictionary() + for column, codec in self.compression.iteritems(): + check_compression_name(codec) + props.compression(column, compression_from_name(codec)) + + cdef _set_dictionary_props(self, WriterProperties.Builder* props): + if isinstance(self.use_dictionary, bool): + if self.use_dictionary: + props.enable_dictionary() + else: + props.disable_dictionary() + else: + # Deactivate dictionary encoding by default + props.disable_dictionary() + for column in self.use_dictionary: + props.enable_dictionary(column) + + def write_table(self, Table table, row_group_size=None): + cdef CTable* ctable = table.table + + if row_group_size is None: + row_group_size = ctable.num_rows() + + cdef int c_row_group_size = row_group_size + with nogil: + check_status(WriteFlatTable(ctable, default_memory_pool(), + self.sink, c_row_group_size, + self.properties)) diff --git a/python/pyarrow/includes/parquet.pxd b/python/pyarrow/includes/parquet.pxd deleted file mode 100644 index d9e121dd8853e..0000000000000 --- a/python/pyarrow/includes/parquet.pxd +++ /dev/null @@ -1,147 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# distutils: language = c++ - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport CArray, CSchema, CStatus, CTable, MemoryPool -from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream - - -cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: - cdef cppclass Node: - pass - - cdef cppclass GroupNode(Node): - pass - - cdef cppclass PrimitiveNode(Node): - pass - - cdef cppclass ColumnPath: - c_string ToDotString() - -cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: - enum ParquetVersion" parquet::ParquetVersion::type": - PARQUET_1_0" parquet::ParquetVersion::PARQUET_1_0" - PARQUET_2_0" parquet::ParquetVersion::PARQUET_2_0" - - enum Compression" parquet::Compression::type": - UNCOMPRESSED" parquet::Compression::UNCOMPRESSED" - SNAPPY" parquet::Compression::SNAPPY" - GZIP" parquet::Compression::GZIP" - LZO" parquet::Compression::LZO" - BROTLI" parquet::Compression::BROTLI" - - cdef cppclass ColumnDescriptor: - shared_ptr[ColumnPath] path() - - cdef cppclass SchemaDescriptor: - const ColumnDescriptor* Column(int i) - shared_ptr[Node] schema() - GroupNode* group() - - -cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: - cdef cppclass ColumnReader: - pass - - cdef cppclass BoolReader(ColumnReader): - pass - - cdef cppclass Int32Reader(ColumnReader): - pass - - cdef cppclass Int64Reader(ColumnReader): - pass - - cdef cppclass Int96Reader(ColumnReader): - pass - - cdef cppclass FloatReader(ColumnReader): - pass - - cdef cppclass DoubleReader(ColumnReader): - pass - - cdef cppclass ByteArrayReader(ColumnReader): - pass - - cdef cppclass RowGroupReader: - pass - - cdef cppclass FileMetaData: - uint32_t size() - int num_columns() - int64_t num_rows() - int num_row_groups() - int32_t version() - const c_string created_by() - int num_schema_elements() - const SchemaDescriptor* schema() - - cdef cppclass ParquetFileReader: - # TODO: Some default arguments are missing - @staticmethod - unique_ptr[ParquetFileReader] OpenFile(const c_string& path) - shared_ptr[FileMetaData] metadata(); - - -cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: - cdef cppclass ParquetOutputStream" parquet::OutputStream": - pass - - cdef cppclass LocalFileOutputStream(ParquetOutputStream): - LocalFileOutputStream(const c_string& path) - void Close() - - cdef cppclass WriterProperties: - cppclass Builder: - Builder* version(ParquetVersion version) - Builder* compression(Compression codec) - Builder* compression(const c_string& path, Compression codec) - Builder* disable_dictionary() - Builder* enable_dictionary() - Builder* enable_dictionary(const c_string& path) - shared_ptr[WriterProperties] build() - - -cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: - CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, - MemoryPool* allocator, - unique_ptr[FileReader]* reader) - - cdef cppclass FileReader: - FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) - CStatus ReadFlatColumn(int i, shared_ptr[CArray]* out); - CStatus ReadFlatTable(shared_ptr[CTable]* out); - const ParquetFileReader* parquet_reader(); - - -cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: - CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, - shared_ptr[CSchema]* out) - CStatus ToParquetSchema(const CSchema* arrow_schema, - shared_ptr[SchemaDescriptor]* out) - - -cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: - cdef CStatus WriteFlatTable( - const CTable* table, MemoryPool* pool, - const shared_ptr[OutputStream]& sink, - int64_t chunk_size, - const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py new file mode 100644 index 0000000000000..2dedb72ebfcc1 --- /dev/null +++ b/python/pyarrow/parquet.py @@ -0,0 +1,116 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import pyarrow._parquet as _parquet +from pyarrow.table import Table + + +class ParquetFile(object): + """ + Open a Parquet binary file for reading + + Parameters + ---------- + source : str or pyarrow.io.NativeFile + Readable source. For passing Python file objects or byte buffers, + see pyarrow.io.PythonFileInterface or pyarrow.io.BytesReader. + metadata : ParquetFileMetadata, default None + Use existing metadata object, rather than reading from file. + """ + def __init__(self, source, metadata=None): + self.reader = _parquet.ParquetReader() + self.reader.open(source) + + @property + def metadata(self): + return self.reader.metadata + + @property + def schema(self): + return self.metadata.schema + + def read(self, nrows=None, columns=None): + """ + Read a Table from Parquet format + + Parameters + ---------- + columns: list + If not None, only these columns will be read from the file. + + Returns + ------- + pyarrow.table.Table + Content of the file as a table (of columns) + """ + if nrows is not None: + raise NotImplementedError("nrows argument") + + if columns is None: + return self.reader.read_all() + else: + column_idxs = [self.reader.column_name_idx(column) + for column in columns] + arrays = [self.reader.read_column(column_idx) + for column_idx in column_idxs] + return Table.from_arrays(columns, arrays) + + +def read_table(source, columns=None): + """ + Read a Table from Parquet format + + Parameters + ---------- + source: str or pyarrow.io.NativeFile + Readable source. For passing Python file objects or byte buffers, see + pyarrow.io.PythonFileInterface or pyarrow.io.BytesReader. + columns: list + If not None, only these columns will be read from the file. + + Returns + ------- + pyarrow.table.Table + Content of the file as a table (of columns) + """ + return ParquetFile(source).read(columns=columns) + + +def write_table(table, sink, chunk_size=None, version=None, + use_dictionary=True, compression=None): + """ + Write a Table to Parquet format + + Parameters + ---------- + table : pyarrow.Table + sink: string or pyarrow.io.NativeFile + chunk_size : int + The maximum number of rows in each Parquet RowGroup. As a default, + we will write a single RowGroup per file. + version : {"1.0", "2.0"}, default "1.0" + The Parquet format version, defaults to 1.0 + use_dictionary : bool or list + Specify if we should use dictionary encoding in general or only for + some columns. + compression : str or dict + Specify the compression codec, either on a general basis or per-column. + """ + writer = _parquet.ParquetWriter(sink, use_dictionary=use_dictionary, + compression=compression, + version=version) + writer.write_table(table, row_group_size=chunk_size) diff --git a/python/pyarrow/parquet.pyx b/python/pyarrow/parquet.pyx deleted file mode 100644 index c0921859e440d..0000000000000 --- a/python/pyarrow/parquet.pyx +++ /dev/null @@ -1,244 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# cython: profile=False -# distutils: language = c++ -# cython: embedsignature = True - -from pyarrow.includes.libarrow cimport * -from pyarrow.includes.parquet cimport * -from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream, FileOutputStream -cimport pyarrow.includes.pyarrow as pyarrow - -from pyarrow.array cimport Array -from pyarrow.compat import tobytes -from pyarrow.error import ArrowException -from pyarrow.error cimport check_status -from pyarrow.io import NativeFile -from pyarrow.table cimport Table - -from pyarrow.io cimport NativeFile, get_reader, get_writer - -import six - -__all__ = [ - 'read_table', - 'write_table' -] - -cdef class ParquetReader: - cdef: - MemoryPool* allocator - unique_ptr[FileReader] reader - column_idx_map - - def __cinit__(self): - self.allocator = default_memory_pool() - - def open(self, source): - self._open(source) - - cdef _open(self, object source): - cdef: - shared_ptr[ReadableFileInterface] rd_handle - c_string path - - if isinstance(source, six.string_types): - path = tobytes(source) - - # Must be in one expression to avoid calling std::move which is not - # possible in Cython (due to missing rvalue support) - - # TODO(wesm): ParquetFileReader::OpenFile can throw? - self.reader = unique_ptr[FileReader]( - new FileReader(default_memory_pool(), - ParquetFileReader.OpenFile(path))) - else: - get_reader(source, &rd_handle) - check_status(OpenFile(rd_handle, self.allocator, &self.reader)) - - def read_all(self): - cdef: - Table table = Table() - shared_ptr[CTable] ctable - - with nogil: - check_status(self.reader.get() - .ReadFlatTable(&ctable)) - - table.init(ctable) - return table - - def column_name_idx(self, column_name): - """ - Find the matching index of a column in the schema. - - Parameter - --------- - column_name: str - Name of the column, separation of nesting levels is done via ".". - - Returns - ------- - column_idx: int - Integer index of the position of the column - """ - cdef: - const FileMetaData* metadata = (self.reader.get().parquet_reader() - .metadata().get()) - int i = 0 - - if self.column_idx_map is None: - self.column_idx_map = {} - for i in range(0, metadata.num_columns()): - col_bytes = tobytes(metadata.schema().Column(i) - .path().get().ToDotString()) - self.column_idx_map[col_bytes] = i - - return self.column_idx_map[tobytes(column_name)] - - def read_column(self, int column_index): - cdef: - Array array = Array() - shared_ptr[CArray] carray - - with nogil: - check_status(self.reader.get() - .ReadFlatColumn(column_index, &carray)) - - array.init(carray) - return array - - -def read_table(source, columns=None): - """ - Read a Table from Parquet format - - Parameters - ---------- - source: str or pyarrow.io.NativeFile - Readable source. For passing Python file objects or byte buffers, see - pyarrow.io.PythonFileInterface or pyarrow.io.BytesReader. - columns: list - If not None, only these columns will be read from the file. - - Returns - ------- - pyarrow.table.Table - Content of the file as a table (of columns) - """ - cdef ParquetReader reader = ParquetReader() - reader._open(source) - - if columns is None: - return reader.read_all() - else: - column_idxs = [reader.column_name_idx(column) for column in columns] - arrays = [reader.read_column(column_idx) for column_idx in column_idxs] - return Table.from_arrays(columns, arrays) - - -def write_table(table, sink, chunk_size=None, version=None, - use_dictionary=True, compression=None): - """ - Write a Table to Parquet format - - Parameters - ---------- - table : pyarrow.Table - sink: string or pyarrow.io.NativeFile - chunk_size : int - The maximum number of rows in each Parquet RowGroup. As a default, - we will write a single RowGroup per file. - version : {"1.0", "2.0"}, default "1.0" - The Parquet format version, defaults to 1.0 - use_dictionary : bool or list - Specify if we should use dictionary encoding in general or only for - some columns. - compression : str or dict - Specify the compression codec, either on a general basis or per-column. - """ - cdef Table table_ = table - cdef CTable* ctable_ = table_.table - cdef shared_ptr[FileOutputStream] filesink_ - cdef shared_ptr[OutputStream] sink_ - - cdef WriterProperties.Builder properties_builder - cdef int64_t chunk_size_ = 0 - if chunk_size is None: - chunk_size_ = ctable_.num_rows() - else: - chunk_size_ = chunk_size - - if version is not None: - if version == "1.0": - properties_builder.version(PARQUET_1_0) - elif version == "2.0": - properties_builder.version(PARQUET_2_0) - else: - raise ArrowException("Unsupported Parquet format version") - - if isinstance(use_dictionary, bool): - if use_dictionary: - properties_builder.enable_dictionary() - else: - properties_builder.disable_dictionary() - else: - # Deactivate dictionary encoding by default - properties_builder.disable_dictionary() - for column in use_dictionary: - properties_builder.enable_dictionary(column) - - if isinstance(compression, basestring): - if compression == "NONE": - properties_builder.compression(UNCOMPRESSED) - elif compression == "SNAPPY": - properties_builder.compression(SNAPPY) - elif compression == "GZIP": - properties_builder.compression(GZIP) - elif compression == "LZO": - properties_builder.compression(LZO) - elif compression == "BROTLI": - properties_builder.compression(BROTLI) - else: - raise ArrowException("Unsupport compression codec") - elif compression is not None: - # Deactivate dictionary encoding by default - properties_builder.disable_dictionary() - for column, codec in compression.iteritems(): - if codec == "NONE": - properties_builder.compression(column, UNCOMPRESSED) - elif codec == "SNAPPY": - properties_builder.compression(column, SNAPPY) - elif codec == "GZIP": - properties_builder.compression(column, GZIP) - elif codec == "LZO": - properties_builder.compression(column, LZO) - elif codec == "BROTLI": - properties_builder.compression(column, BROTLI) - else: - raise ArrowException("Unsupport compression codec") - - if isinstance(sink, six.string_types): - check_status(FileOutputStream.Open(tobytes(sink), &filesink_)) - sink_ = filesink_ - else: - get_writer(sink, &sink_) - - with nogil: - check_status(WriteFlatTable(ctable_, default_memory_pool(), sink_, - chunk_size_, properties_builder.build())) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 0fb913cc792d8..ad4bc580e8b1c 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -55,10 +55,8 @@ def test_single_pylist_column_roundtrip(tmpdir): assert data_written.equals(data_read) -@parquet -def test_pandas_parquet_2_0_rountrip(tmpdir): - size = 10000 - np.random.seed(0) +def alltypes_sample(size=10000, seed=0): + np.random.seed(seed) df = pd.DataFrame({ 'uint8': np.arange(size, dtype=np.uint8), 'uint16': np.arange(size, dtype=np.uint16), @@ -71,13 +69,21 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): 'float32': np.arange(size, dtype=np.float32), 'float64': np.arange(size, dtype=np.float64), 'bool': np.random.randn(size) > 0, - # Pandas only support ns resolution, Arrow at the moment only ms + # TODO(wesm): Test other timestamp resolutions now that arrow supports + # them 'datetime': np.arange("2016-01-01T00:00:00.001", size, dtype='datetime64[ms]'), 'str': [str(x) for x in range(size)], 'str_with_nulls': [None] + [str(x) for x in range(size - 2)] + [None], 'empty_str': [''] * size }) + return df + + +@parquet +def test_pandas_parquet_2_0_rountrip(tmpdir): + df = alltypes_sample(size=10000) + filename = tmpdir.join('pandas_rountrip.parquet') arrow_table = A.Table.from_pandas(df, timestamps_to_ms=True) A.parquet.write_table(arrow_table, filename.strpath, version="2.0") @@ -117,6 +123,7 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): pdt.assert_frame_equal(df, df_read) + @parquet def test_pandas_column_selection(tmpdir): size = 10000 @@ -227,3 +234,57 @@ def test_pandas_parquet_configuration_options(tmpdir): table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) + + +@parquet +def test_parquet_metadata_api(): + df = alltypes_sample(size=10000) + df = df.reindex(columns=sorted(df.columns)) + + a_table = A.Table.from_pandas(df, timestamps_to_ms=True) + + buf = io.BytesIO() + pq.write_table(a_table, buf, compression='snappy', version='2.0') + + buf.seek(0) + fileh = pq.ParquetFile(buf) + + ncols = len(df.columns) + + # Series of sniff tests + meta = fileh.metadata + repr(meta) + assert meta.num_rows == len(df) + assert meta.num_columns == ncols + assert meta.num_row_groups == 1 + assert meta.format_version == '2.0' + assert 'parquet-cpp' in meta.created_by + + # Schema + schema = fileh.schema + assert meta.schema is schema + assert len(schema) == ncols + repr(schema) + + col = schema[0] + repr(col) + assert col.name == df.columns[0] + assert col.max_definition_level == 1 + assert col.max_repetition_level == 0 + assert col.max_repetition_level == 0 + + assert col.physical_type == 'BOOLEAN' + assert col.logical_type == 'NONE' + + with pytest.raises(IndexError): + schema[ncols] + + with pytest.raises(IndexError): + schema[-1] + + # Row group + rg_meta = meta.row_group(0) + repr(rg_meta) + + assert rg_meta.num_rows == len(df) + assert rg_meta.num_columns == ncols diff --git a/python/setup.py b/python/setup.py index 3829a7982d670..72ff5842a22a5 100644 --- a/python/setup.py +++ b/python/setup.py @@ -95,7 +95,7 @@ def initialize_options(self): 'error', 'io', 'ipc', - 'parquet', + '_parquet', 'scalar', 'schema', 'table'] @@ -214,7 +214,7 @@ def _run_cmake(self): os.chdir(saved_cwd) def _failure_permitted(self, name): - if name == 'parquet' and not self.with_parquet: + if name == '_parquet' and not self.with_parquet: return True return False From 8d917c1f925c76e9009f6d1b9792551293a572fe Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 10 Jan 2017 13:23:26 -0500 Subject: [PATCH 0267/1644] ARROW-466: Add ExternalProject for jemalloc Author: Uwe L. Korn Closes #276 from xhochy/ARROW-466 and squashes the following commits: 9379c45 [Uwe L. Korn] Revert "Enable jemalloc on Windows" 6fd8da8 [Uwe L. Korn] Enable jemalloc on Windows 0e1082f [Uwe L. Korn] ARROW-466: Add ExternalProject for jemalloc --- cpp/CMakeLists.txt | 37 +++++++++++++++++++++++++-- cpp/cmake_modules/Findjemalloc.cmake | 5 ++++ cpp/src/arrow/jemalloc/CMakeLists.txt | 12 ++++++--- 3 files changed, 49 insertions(+), 5 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 87b7841ece52e..8a2cfc5d6f180 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -30,6 +30,7 @@ set(GFLAGS_VERSION "2.1.2") set(GTEST_VERSION "1.7.0") set(GBENCHMARK_VERSION "1.1.0") set(FLATBUFFERS_VERSION "1.3.0") +set(JEMALLOC_VERSION "4.4.0") find_package(ClangTools) if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND) @@ -76,7 +77,7 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_JEMALLOC "Build the Arrow jemalloc-based allocator" - OFF) + ON) option(ARROW_BOOST_USE_SHARED "Rely on boost shared libraries where relevant" @@ -594,11 +595,43 @@ message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) if (ARROW_JEMALLOC) - find_package(jemalloc REQUIRED) + find_package(jemalloc) + + if(NOT JEMALLOC_FOUND) + set(JEMALLOC_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/jemalloc_ep-prefix/src/jemalloc_ep") + set(JEMALLOC_HOME "${JEMALLOC_PREFIX}") + set(JEMALLOC_INCLUDE_DIR "${JEMALLOC_PREFIX}/include") + set(JEMALLOC_SHARED_LIB "${JEMALLOC_PREFIX}/lib/libjemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}") + set(JEMALLOC_STATIC_LIB "${JEMALLOC_PREFIX}/lib/libjemalloc_pic${CMAKE_STATIC_LIBRARY_SUFFIX}") + set(JEMALLOC_VENDORED 1) + if (CMAKE_VERSION VERSION_GREATER "3.2") + # BUILD_BYPRODUCTS is a 3.2+ feature + ExternalProject_Add(jemalloc_ep + URL https://github.com/jemalloc/jemalloc/releases/download/${JEMALLOC_VERSION}/jemalloc-${JEMALLOC_VERSION}.tar.bz2 + CONFIGURE_COMMAND ./configure "--prefix=${JEMALLOC_PREFIX}" "--with-jemalloc-prefix=" + BUILD_IN_SOURCE 1 + BUILD_COMMAND ${MAKE} + BUILD_BYPRODUCTS "${JEMALLOC_STATIC_LIB}") + else() + ExternalProject_Add(jemalloc_ep + URL https://github.com/jemalloc/jemalloc/releases/download/${JEMALLOC_VERSION}/jemalloc-${JEMALLOC_VERSION}.tar.bz2 + CONFIGURE_COMMAND ./configure "--prefix=${JEMALLOC_PREFIX}" "--with-jemalloc-prefix=" + BUILD_IN_SOURCE 1 + BUILD_COMMAND ${MAKE}) + endif() + else() + set(JEMALLOC_VENDORED 0) + endif() include_directories(SYSTEM ${JEMALLOC_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(jemalloc + STATIC_LIB ${JEMALLOC_STATIC_LIB} SHARED_LIB ${JEMALLOC_SHARED_LIB}) + + if (JEMALLOC_VENDORED) + add_dependencies(jemalloc_shared jemalloc_ep) + add_dependencies(jemalloc_static jemalloc_ep) + endif() endif() ## Google PerfTools diff --git a/cpp/cmake_modules/Findjemalloc.cmake b/cpp/cmake_modules/Findjemalloc.cmake index e7fbb94a69235..e511d4dde0f71 100644 --- a/cpp/cmake_modules/Findjemalloc.cmake +++ b/cpp/cmake_modules/Findjemalloc.cmake @@ -47,11 +47,16 @@ if ( _jemalloc_roots ) find_library( JEMALLOC_SHARED_LIB NAMES ${LIBJEMALLOC_NAMES} PATHS ${_jemalloc_roots} NO_DEFAULT_PATH PATH_SUFFIXES "lib" ) + find_library( JEMALLOC_STATIC_LIB NAMES jemalloc_pic + PATHS ${_jemalloc_roots} NO_DEFAULT_PATH + PATH_SUFFIXES "lib" ) else () find_path( JEMALLOC_INCLUDE_DIR NAMES jemalloc/jemalloc.h ) message(STATUS ${JEMALLOC_INCLUDE_DIR}) find_library( JEMALLOC_SHARED_LIB NAMES ${LIBJEMALLOC_NAMES}) message(STATUS ${JEMALLOC_SHARED_LIB}) + find_library( JEMALLOC_STATIC_LIB NAMES jemalloc_pic) + message(STATUS ${JEMALLOC_STATIC_LIB}) endif () if (JEMALLOC_INCLUDE_DIR AND JEMALLOC_SHARED_LIB) diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index c6663eb8227f0..c0f90eba260f6 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -23,18 +23,24 @@ include_directories(SYSTEM "{JEMALLOC_INCLUDE_DIR}") # arrow_jemalloc library set(ARROW_JEMALLOC_STATIC_LINK_LIBS arrow_static - jemalloc + jemalloc_static ) + +if (NOT APPLE) + set(ARROW_JEMALLOC_STATIC_LINK_LIBS ${ARROW_JEMALLOC_STATIC_LINK_LIBS} pthread) +endif() + set(ARROW_JEMALLOC_SHARED_LINK_LIBS arrow_shared - jemalloc + jemalloc_shared ) if (ARROW_BUILD_STATIC) set(ARROW_JEMALLOC_TEST_LINK_LIBS + ${ARROW_JEMALLOC_STATIC_LINK_LIBS} arrow_jemalloc_static) else() - set(ARROW_jemalloc_TEST_LINK_LIBS + set(ARROW_JEMALLOC_TEST_LINK_LIBS arrow_jemalloc_shared) endif() From 543e50814c15d58387683a43b5abc661c4acc484 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 10 Jan 2017 13:27:34 -0500 Subject: [PATCH 0268/1644] ARROW-469: C++: Add option so that resize doesn't decrease the capacity Author: Uwe L. Korn Closes #277 from xhochy/ARROW-469 and squashes the following commits: f59059f [Uwe L. Korn] ARROW-469: C++: Add option so that resize doesn't decrease the capacity --- cpp/src/arrow/buffer-test.cc | 11 ++++++++++- cpp/src/arrow/buffer.cc | 11 ++++++----- cpp/src/arrow/buffer.h | 13 ++++++++----- 3 files changed, 24 insertions(+), 11 deletions(-) diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc index c1d027bb653fe..2ded1e11f96f8 100644 --- a/cpp/src/arrow/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -53,8 +53,17 @@ TEST_F(TestBuffer, Resize) { ASSERT_EQ(200, buf.size()); // Make it smaller, too - ASSERT_OK(buf.Resize(50)); + ASSERT_OK(buf.Resize(50, true)); ASSERT_EQ(50, buf.size()); + // We have actually shrunken in size + // The spec requires that capacity is a multiple of 64 + ASSERT_EQ(64, buf.capacity()); + + // Resize to a larger capacity again to test shrink_to_fit = false + ASSERT_OK(buf.Resize(100)); + ASSERT_EQ(128, buf.capacity()); + ASSERT_OK(buf.Resize(50, false)); + ASSERT_EQ(128, buf.capacity()); } TEST_F(TestBuffer, ResizeOOM) { diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 2e64ffd75c263..6cce0efa37784 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -91,13 +91,14 @@ Status PoolBuffer::Reserve(int64_t new_capacity) { return Status::OK(); } -Status PoolBuffer::Resize(int64_t new_size) { - if (new_size > size_) { +Status PoolBuffer::Resize(int64_t new_size, bool shrink_to_fit) { + if (!shrink_to_fit || (new_size > size_)) { RETURN_NOT_OK(Reserve(new_size)); } else { // Buffer is not growing, so shrink to the requested size without // excess space. - if (capacity_ != new_size) { + int64_t new_capacity = BitUtil::RoundUpToMultipleOf64(new_size); + if (capacity_ != new_capacity) { // Buffer hasn't got yet the requested size. if (new_size == 0) { pool_->Free(mutable_data_, capacity_); @@ -105,9 +106,9 @@ Status PoolBuffer::Resize(int64_t new_size) { mutable_data_ = nullptr; data_ = nullptr; } else { - RETURN_NOT_OK(pool_->Reallocate(capacity_, new_size, &mutable_data_)); + RETURN_NOT_OK(pool_->Reallocate(capacity_, new_capacity, &mutable_data_)); data_ = mutable_data_; - capacity_ = new_size; + capacity_ = new_capacity; } } } diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 27437ca0486c3..ac78808eaf205 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -127,10 +127,13 @@ class ARROW_EXPORT MutableBuffer : public Buffer { class ARROW_EXPORT ResizableBuffer : public MutableBuffer { public: - // Change buffer reported size to indicated size, allocating memory if - // necessary. This will ensure that the capacity of the buffer is a multiple - // of 64 bytes as defined in Layout.md. - virtual Status Resize(int64_t new_size) = 0; + /// Change buffer reported size to indicated size, allocating memory if + /// necessary. This will ensure that the capacity of the buffer is a multiple + /// of 64 bytes as defined in Layout.md. + /// + /// @param shrink_to_fit On deactivating this option, the capacity of the Buffer won't + /// decrease. + virtual Status Resize(int64_t new_size, bool shrink_to_fit = true) = 0; // Ensure that buffer has enough memory allocated to fit the indicated // capacity (and meets the 64 byte padding requirement in Layout.md). @@ -147,7 +150,7 @@ class ARROW_EXPORT PoolBuffer : public ResizableBuffer { explicit PoolBuffer(MemoryPool* pool = nullptr); virtual ~PoolBuffer(); - Status Resize(int64_t new_size) override; + Status Resize(int64_t new_size, bool shrink_to_fit = true) override; Status Reserve(int64_t new_capacity) override; private: From 7d3e2a3ab90324625b738e464a020758379f457a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 11 Jan 2017 09:33:29 -0500 Subject: [PATCH 0269/1644] ARROW-421: [Python] Retain parent reference in PyBytesReader Pass Buffer to BufferReader so that zero-copy slices retain reference to PyBytesBuffer, which prevents the bytes object from being garbage collected prematurely. Also added some helper tools for inspecting Arrow Buffer objects in Python. Close #278 Author: Wes McKinney Closes #279 from wesm/ARROW-421 and squashes the following commits: acf730e [Wes McKinney] Rename method 50c195a [Wes McKinney] Fix accidental typo ef20185 [Wes McKinney] Pass Buffer to BufferReader so that zero-copy slices retain reference to PyBytesBuffer, which prevents the bytes object from being garbage collected prematurely --- cpp/src/arrow/io/memory.h | 2 ++ python/pyarrow/_parquet.pxd | 2 +- python/pyarrow/_parquet.pyx | 8 ++--- python/pyarrow/includes/libarrow.pxd | 1 + python/pyarrow/io.pyx | 46 ++++++++++++++++++++++++++-- python/pyarrow/tests/test_io.py | 14 +++++++++ python/src/pyarrow/io.cc | 10 ++---- python/src/pyarrow/io.h | 5 ++- 8 files changed, 69 insertions(+), 19 deletions(-) diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 8428a12220a69..2d3df4224e9fb 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -79,6 +79,8 @@ class ARROW_EXPORT BufferReader : public ReadableFileInterface { bool supports_zero_copy() const override; + std::shared_ptr buffer() const { return buffer_; } + private: std::shared_ptr buffer_; const uint8_t* data_; diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index faca845167d31..7e49e9e834b77 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -156,7 +156,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: int num_columns() int64_t num_rows() int num_row_groups() - int32_t version() + ParquetVersion version() const c_string created_by() int num_schema_elements() diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index c0dc3eb460929..30e3de417a827 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -138,11 +138,11 @@ cdef class FileMetaData: property format_version: def __get__(self): - cdef int version = self.metadata.version() - if version == 2: - return '2.0' - elif version == 1: + cdef ParquetVersion version = self.metadata.version() + if version == ParquetVersion_V1: return '1.0' + if version == ParquetVersion_V2: + return '2.0' else: print('Unrecognized file version, assuming 1.0: {0}' .format(version)) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index b0f971d516ce5..d1970e5a2c8f1 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -66,6 +66,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CBuffer" arrow::Buffer": uint8_t* data() int64_t size() + shared_ptr[CBuffer] parent() cdef cppclass ResizableBuffer(CBuffer): CStatus Resize(int64_t nbytes) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index cab6ccb90ee6b..b62de6cdd462c 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -123,11 +123,17 @@ cdef class NativeFile: with nogil: check_status(self.wr_file.get().Write(buf, bufsize)) - def read(self, int64_t nbytes): + def read(self, nbytes=None): cdef: + int64_t c_nbytes int64_t bytes_read = 0 PyObject* obj + if nbytes is None: + c_nbytes = self.size() - self.tell() + else: + c_nbytes = nbytes + self._assert_readable() # Allocate empty write space @@ -135,17 +141,35 @@ cdef class NativeFile: cdef uint8_t* buf = cp.PyBytes_AS_STRING( obj) with nogil: - check_status(self.rd_file.get().Read(nbytes, &bytes_read, buf)) + check_status(self.rd_file.get().Read(c_nbytes, &bytes_read, buf)) - if bytes_read < nbytes: + if bytes_read < c_nbytes: cp._PyBytes_Resize(&obj, bytes_read) return PyObject_to_object(obj) + def read_buffer(self, nbytes=None): + cdef: + int64_t c_nbytes + int64_t bytes_read = 0 + shared_ptr[CBuffer] output + self._assert_readable() + + if nbytes is None: + c_nbytes = self.size() - self.tell() + else: + c_nbytes = nbytes + + with nogil: + check_status(self.rd_file.get().ReadB(c_nbytes, &output)) + + return wrap_buffer(output) + # ---------------------------------------------------------------------- # Python file-like objects + cdef class PythonFileInterface(NativeFile): cdef: object handle @@ -199,6 +223,16 @@ cdef class Buffer: def __get__(self): return self.buffer.get().size() + property parent: + + def __get__(self): + cdef shared_ptr[CBuffer] parent_buf = self.buffer.get().parent() + + if parent_buf.get() == NULL: + return None + else: + return wrap_buffer(parent_buf) + def __getitem__(self, key): # TODO(wesm): buffer slicing raise NotImplementedError @@ -209,6 +243,12 @@ cdef class Buffer: self.buffer.get().size()) +cdef wrap_buffer(const shared_ptr[CBuffer]& buffer): + cdef Buffer result = Buffer() + result.buffer = buffer + return result + + cdef shared_ptr[PoolBuffer] allocate_buffer(): cdef shared_ptr[PoolBuffer] result result.reset(new PoolBuffer(pyarrow.get_memory_pool())) diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index c10ed0394b1a8..3e7a43702aa05 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -102,6 +102,20 @@ def test_bytes_reader_non_bytes(): io.BytesReader(u('some sample data')) +def test_bytes_reader_retains_parent_reference(): + import gc + + # ARROW-421 + def get_buffer(): + data = b'some sample data' * 1000 + reader = io.BytesReader(data) + reader.seek(5) + return reader.read_buffer(6) + + buf = get_buffer() + gc.collect() + assert buf.to_pybytes() == b'sample' + assert buf.parent is not None # ---------------------------------------------------------------------- # Buffers diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index ac1aa635b40ea..01f851d874075 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -203,14 +203,8 @@ Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { // A readable file that is backed by a PyBytes PyBytesReader::PyBytesReader(PyObject* obj) - : arrow::io::BufferReader(reinterpret_cast(PyBytes_AS_STRING(obj)), - PyBytes_GET_SIZE(obj)), - obj_(obj) { - Py_INCREF(obj_); -} + : arrow::io::BufferReader(std::make_shared(obj)) {} -PyBytesReader::~PyBytesReader() { - Py_DECREF(obj_); -} +PyBytesReader::~PyBytesReader() {} } // namespace pyarrow diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h index fd3e7c0887207..4cb010f2d4e9f 100644 --- a/python/src/pyarrow/io.h +++ b/python/src/pyarrow/io.h @@ -22,6 +22,8 @@ #include "arrow/io/memory.h" #include "pyarrow/config.h" + +#include "pyarrow/common.h" #include "pyarrow/visibility.h" namespace arrow { @@ -87,9 +89,6 @@ class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { public: explicit PyBytesReader(PyObject* obj); virtual ~PyBytesReader(); - - private: - PyObject* obj_; }; // TODO(wesm): seekable output files From c5663c6d00dbd297dac573670156e26dc0593357 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Thu, 12 Jan 2017 12:21:37 -0500 Subject: [PATCH 0270/1644] ARROW-385: Refactors metric system Arrow has some support for metrics, but the metrics registry is by default not configured to export values. It also forces user to user yammer/codahale metrics library instead of the library of their choice. To allow for integration with other metrics system, replace it with a notification mechanism to alert user on allocation/deallocation. Author: Laurent Goujon Closes #212 from laurentgo/laurent/metrics-refactoring and squashes the following commits: e6c435b [Laurent Goujon] ARROW-385: Refactors metric system --- java/memory/pom.xml | 7 - .../java/io/netty/buffer/LargeBuffer.java | 31 +--- .../netty/buffer/PooledByteBufAllocatorL.java | 157 ++++++++---------- .../buffer/UnsafeDirectLittleEndian.java | 34 +--- .../Metrics.java => AllocationListener.java} | 40 ++--- .../arrow/memory/AllocationManager.java | 13 +- .../apache/arrow/memory/BaseAllocator.java | 30 +++- .../apache/arrow/memory/RootAllocator.java | 7 +- .../org/apache/arrow/memory/util/Pointer.java | 28 ---- 9 files changed, 138 insertions(+), 209 deletions(-) rename java/memory/src/main/java/org/apache/arrow/memory/{util/Metrics.java => AllocationListener.java} (58%) delete mode 100644 java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java diff --git a/java/memory/pom.xml b/java/memory/pom.xml index 6ed14480860f2..a4eb65228febf 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -20,13 +20,6 @@ Arrow Memory - - - com.codahale.metrics - metrics-core - 3.0.1 - - com.google.code.findbugs jsr305 diff --git a/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java index 5f5e904fb0429..c026e430d77f3 100644 --- a/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java +++ b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java @@ -17,43 +17,16 @@ */ package io.netty.buffer; -import java.util.concurrent.atomic.AtomicLong; - /** * A MutableWrappedByteBuf that also maintains a metric of the number of huge buffer bytes and counts. */ public class LargeBuffer extends MutableWrappedByteBuf { - - private final AtomicLong hugeBufferSize; - private final AtomicLong hugeBufferCount; - - private final int initCap; - - public LargeBuffer(ByteBuf buffer, AtomicLong hugeBufferSize, AtomicLong hugeBufferCount) { + public LargeBuffer(ByteBuf buffer) { super(buffer); - initCap = buffer.capacity(); - this.hugeBufferCount = hugeBufferCount; - this.hugeBufferSize = hugeBufferSize; } @Override public ByteBuf copy(int index, int length) { - return new LargeBuffer(buffer.copy(index, length), hugeBufferSize, hugeBufferCount); + return new LargeBuffer(buffer.copy(index, length)); } - - @Override - public boolean release() { - return release(1); - } - - @Override - public boolean release(int decrement) { - boolean released = unwrap().release(decrement); - if (released) { - hugeBufferSize.addAndGet(-initCap); - hugeBufferCount.decrementAndGet(); - } - return released; - } - } diff --git a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java index f6feb65cccd09..a843ac5586e79 100644 --- a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java +++ b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java @@ -17,7 +17,7 @@ */ package io.netty.buffer; -import io.netty.util.internal.StringUtil; +import static org.apache.arrow.memory.util.AssertionUtil.ASSERT_ENABLED; import java.lang.reflect.Field; import java.nio.ByteBuffer; @@ -25,24 +25,16 @@ import org.apache.arrow.memory.OutOfMemoryException; -import com.codahale.metrics.Gauge; -import com.codahale.metrics.Histogram; -import com.codahale.metrics.Metric; -import com.codahale.metrics.MetricFilter; -import com.codahale.metrics.MetricRegistry; +import io.netty.util.internal.StringUtil; /** * The base allocator that we use for all of Arrow's memory management. Returns UnsafeDirectLittleEndian buffers. */ public class PooledByteBufAllocatorL { - private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("drill.allocator"); + private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("arrow.allocator"); private static final int MEMORY_LOGGER_FREQUENCY_SECONDS = 60; - - public static final String METRIC_PREFIX = "drill.allocator."; - - private final MetricRegistry registry; private final AtomicLong hugeBufferSize = new AtomicLong(0); private final AtomicLong hugeBufferCount = new AtomicLong(0); private final AtomicLong normalBufferSize = new AtomicLong(0); @@ -51,8 +43,7 @@ public class PooledByteBufAllocatorL { private final InnerAllocator allocator; public final UnsafeDirectLittleEndian empty; - public PooledByteBufAllocatorL(MetricRegistry registry) { - this.registry = registry; + public PooledByteBufAllocatorL() { allocator = new InnerAllocator(); empty = new UnsafeDirectLittleEndian(new DuplicatedByteBuf(Unpooled.EMPTY_BUFFER)); } @@ -70,13 +61,66 @@ public int getChunkSize() { return allocator.chunkSize; } - private class InnerAllocator extends PooledByteBufAllocator { + public long getHugeBufferSize() { + return hugeBufferSize.get(); + } + public long getHugeBufferCount() { + return hugeBufferCount.get(); + } + public long getNormalBufferSize() { + return normalBufferSize.get(); + } + + public long getNormalBufferCount() { + return normalBufferSize.get(); + } + + private static class AccountedUnsafeDirectLittleEndian extends UnsafeDirectLittleEndian { + private final long initialCapacity; + private final AtomicLong count; + private final AtomicLong size; + + private AccountedUnsafeDirectLittleEndian(LargeBuffer buf, AtomicLong count, AtomicLong size) { + super(buf); + this.initialCapacity = buf.capacity(); + this.count = count; + this.size = size; + } + + private AccountedUnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf, AtomicLong count, AtomicLong size) { + super(buf); + this.initialCapacity = buf.capacity(); + this.count = count; + this.size = size; + } + + @Override + public ByteBuf copy() { + throw new UnsupportedOperationException("copy method is not supported"); + } + + @Override + public ByteBuf copy(int index, int length) { + throw new UnsupportedOperationException("copy method is not supported"); + } + + @Override + public boolean release(int decrement) { + boolean released = super.release(decrement); + if (released) { + count.decrementAndGet(); + size.addAndGet(-initialCapacity); + } + return released; + } + + } + + private class InnerAllocator extends PooledByteBufAllocator { private final PoolArena[] directArenas; private final MemoryStatusThread statusThread; - private final Histogram largeBuffersHist; - private final Histogram normalBuffersHist; private final int chunkSize; public InnerAllocator() { @@ -98,50 +142,6 @@ public InnerAllocator() { } else { statusThread = null; } - removeOldMetrics(); - - registry.register(METRIC_PREFIX + "normal.size", new Gauge() { - @Override - public Long getValue() { - return normalBufferSize.get(); - } - }); - - registry.register(METRIC_PREFIX + "normal.count", new Gauge() { - @Override - public Long getValue() { - return normalBufferCount.get(); - } - }); - - registry.register(METRIC_PREFIX + "huge.size", new Gauge() { - @Override - public Long getValue() { - return hugeBufferSize.get(); - } - }); - - registry.register(METRIC_PREFIX + "huge.count", new Gauge() { - @Override - public Long getValue() { - return hugeBufferCount.get(); - } - }); - - largeBuffersHist = registry.histogram(METRIC_PREFIX + "huge.hist"); - normalBuffersHist = registry.histogram(METRIC_PREFIX + "normal.hist"); - - } - - - private synchronized void removeOldMetrics() { - registry.removeMatching(new MetricFilter() { - @Override - public boolean matches(String name, Metric metric) { - return name.startsWith("drill.allocator."); - } - - }); } private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCapacity) { @@ -154,12 +154,11 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa // This is beyond chunk size so we'll allocate separately. ByteBuf buf = UnpooledByteBufAllocator.DEFAULT.directBuffer(initialCapacity, maxCapacity); - hugeBufferCount.incrementAndGet(); hugeBufferSize.addAndGet(buf.capacity()); - largeBuffersHist.update(buf.capacity()); - // logger.debug("Allocating huge buffer of size {}", initialCapacity, new Exception()); - return new UnsafeDirectLittleEndian(new LargeBuffer(buf, hugeBufferSize, hugeBufferCount)); + hugeBufferCount.incrementAndGet(); + // logger.debug("Allocating huge buffer of size {}", initialCapacity, new Exception()); + return new AccountedUnsafeDirectLittleEndian(new LargeBuffer(buf), hugeBufferCount, hugeBufferSize); } else { // within chunk, use arena. ByteBuf buf = directArena.allocate(cache, initialCapacity, maxCapacity); @@ -167,14 +166,14 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa fail(); } - normalBuffersHist.update(buf.capacity()); - if (ASSERT_ENABLED) { - normalBufferSize.addAndGet(buf.capacity()); - normalBufferCount.incrementAndGet(); + if (!ASSERT_ENABLED) { + return new UnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf); } - return new UnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf, normalBufferCount, - normalBufferSize); + normalBufferSize.addAndGet(buf.capacity()); + normalBufferCount.incrementAndGet(); + + return new AccountedUnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf, normalBufferCount, normalBufferSize); } } else { @@ -184,9 +183,10 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa private UnsupportedOperationException fail() { return new UnsupportedOperationException( - "Arrow requries that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); + "Arrow requires that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); } + @Override public UnsafeDirectLittleEndian directBuffer(int initialCapacity, int maxCapacity) { if (initialCapacity == 0 && maxCapacity == 0) { newDirectBuffer(initialCapacity, maxCapacity); @@ -215,9 +215,8 @@ private void validate(int initialCapacity, int maxCapacity) { private class MemoryStatusThread extends Thread { public MemoryStatusThread() { - super("memory-status-logger"); + super("allocation.logger"); this.setDaemon(true); - this.setName("allocation.logger"); } @Override @@ -229,12 +228,11 @@ public void run() { } catch (InterruptedException e) { return; } - } } - } + @Override public String toString() { StringBuilder buf = new StringBuilder(); buf.append(directArenas.length); @@ -260,13 +258,4 @@ public String toString() { } - - public static final boolean ASSERT_ENABLED; - - static { - boolean isAssertEnabled = false; - assert isAssertEnabled = true; - ASSERT_ENABLED = isAssertEnabled; - } - } diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java index 023a6a2892b80..5ea176745f25e 100644 --- a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -18,8 +18,6 @@ package io.netty.buffer; -import io.netty.util.internal.PlatformDependent; - import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; @@ -32,7 +30,7 @@ * The underlying class we use for little-endian access to memory. Is used underneath ArrowBufs to abstract away the * Netty classes and underlying Netty memory management. */ -public final class UnsafeDirectLittleEndian extends WrappedByteBuf { +public class UnsafeDirectLittleEndian extends WrappedByteBuf { private static final boolean NATIVE_ORDER = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN; private static final AtomicLong ID_GENERATOR = new AtomicLong(0); @@ -40,35 +38,25 @@ public final class UnsafeDirectLittleEndian extends WrappedByteBuf { private final AbstractByteBuf wrapped; private final long memoryAddress; - private final AtomicLong bufferCount; - private final AtomicLong bufferSize; - private final long initCap; - UnsafeDirectLittleEndian(DuplicatedByteBuf buf) { - this(buf, true, null, null); + this(buf, true); } UnsafeDirectLittleEndian(LargeBuffer buf) { - this(buf, true, null, null); + this(buf, true); } - UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf, AtomicLong bufferCount, AtomicLong bufferSize) { - this(buf, true, bufferCount, bufferSize); + UnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf) { + this(buf, true); } - private UnsafeDirectLittleEndian(AbstractByteBuf buf, boolean fake, AtomicLong bufferCount, AtomicLong bufferSize) { + private UnsafeDirectLittleEndian(AbstractByteBuf buf, boolean fake) { super(buf); if (!NATIVE_ORDER || buf.order() != ByteOrder.BIG_ENDIAN) { throw new IllegalStateException("Arrow only runs on LittleEndian systems."); } - this.bufferCount = bufferCount; - this.bufferSize = bufferSize; - - // initCap is used if we're tracking memory release. If we're in non-debug mode, we'll skip this. - this.initCap = ASSERT_ENABLED ? buf.capacity() : -1; - this.wrapped = buf; this.memoryAddress = buf.memoryAddress(); } @@ -244,16 +232,6 @@ public boolean release() { return release(1); } - @Override - public boolean release(int decrement) { - final boolean released = super.release(decrement); - if (ASSERT_ENABLED && released && bufferCount != null && bufferSize != null) { - bufferCount.decrementAndGet(); - bufferSize.addAndGet(-initCap); - } - return released; - } - @Override public int setBytes(int index, InputStream in, int length) throws IOException { wrapped.checkIndex(index, length); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java similarity index 58% rename from java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java rename to java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java index 5177a2478b53a..1b127f8181222 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/Metrics.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java @@ -1,4 +1,4 @@ -/** +/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information @@ -15,26 +15,26 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.arrow.memory.util; - -import com.codahale.metrics.MetricRegistry; - -public class Metrics { - - private Metrics() { - - } +package org.apache.arrow.memory; - private static class RegistryHolder { - public static final MetricRegistry REGISTRY; - - static { - REGISTRY = new MetricRegistry(); +/** + * An allocation listener being notified for allocation/deallocation + * + * It is expected to be called from multiple threads and as such, + * provider should take care of making the implementation thread-safe + */ +public interface AllocationListener { + public static final AllocationListener NOOP = new AllocationListener() { + @Override + public void onAllocation(long size) { } + }; - } + /** + * Called each time a new buffer is allocated + * + * @param size the buffer size being allocated + */ + void onAllocation(long size); - public static MetricRegistry getInstance() { - return RegistryHolder.REGISTRY; - } -} \ No newline at end of file +} diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java index 43ee9c108d902..f15bb8a40fa01 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -18,9 +18,6 @@ package org.apache.arrow.memory; import static org.apache.arrow.memory.BaseAllocator.indent; -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.PooledByteBufAllocatorL; -import io.netty.buffer.UnsafeDirectLittleEndian; import java.util.IdentityHashMap; import java.util.concurrent.atomic.AtomicInteger; @@ -31,10 +28,13 @@ import org.apache.arrow.memory.BaseAllocator.Verbosity; import org.apache.arrow.memory.util.AutoCloseableLock; import org.apache.arrow.memory.util.HistoricalLog; -import org.apache.arrow.memory.util.Metrics; import com.google.common.base.Preconditions; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.PooledByteBufAllocatorL; +import io.netty.buffer.UnsafeDirectLittleEndian; + /** * Manages the relationship between one or more allocators and a particular UDLE. Ensures that one allocator owns the * memory that multiple allocators may be referencing. Manages a BufferLedger between each of its associated allocators. @@ -56,7 +56,10 @@ public class AllocationManager { private static final AtomicLong MANAGER_ID_GENERATOR = new AtomicLong(0); private static final AtomicLong LEDGER_ID_GENERATOR = new AtomicLong(0); - static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(Metrics.getInstance()); + private static final PooledByteBufAllocatorL INNER_ALLOCATOR = new PooledByteBufAllocatorL(); + + static final UnsafeDirectLittleEndian EMPTY = INNER_ALLOCATOR.empty; + static final long CHUNK_SIZE = INNER_ALLOCATOR.getChunkSize(); private final RootAllocator root; private final long allocatorManagerId = MANAGER_ID_GENERATOR.incrementAndGet(); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index dbb0705045c35..9edafbce082cb 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -21,7 +21,6 @@ import java.util.IdentityHashMap; import java.util.Set; import java.util.concurrent.atomic.AtomicInteger; -import java.util.concurrent.atomic.AtomicLong; import org.apache.arrow.memory.AllocationManager.BufferLedger; import org.apache.arrow.memory.util.AssertionUtil; @@ -37,14 +36,12 @@ public abstract class BaseAllocator extends Accountant implements BufferAllocato public static final String DEBUG_ALLOCATOR = "arrow.memory.debug.allocator"; - private static final AtomicLong ID_GENERATOR = new AtomicLong(0); - private static final int CHUNK_SIZE = AllocationManager.INNER_ALLOCATOR.getChunkSize(); - public static final int DEBUG_LOG_LENGTH = 6; public static final boolean DEBUG = AssertionUtil.isAssertionsEnabled() || Boolean.parseBoolean(System.getProperty(DEBUG_ALLOCATOR, "false")); private final Object DEBUG_LOCK = DEBUG ? new Object() : null; + private final AllocationListener listener; private final BaseAllocator parentAllocator; private final ArrowByteBufAllocator thisAsByteBufAllocator; private final IdentityHashMap childAllocators; @@ -61,13 +58,32 @@ public abstract class BaseAllocator extends Accountant implements BufferAllocato private final IdentityHashMap reservations; private final HistoricalLog historicalLog; + protected BaseAllocator( + final AllocationListener listener, + final String name, + final long initReservation, + final long maxAllocation) throws OutOfMemoryException { + this(listener, null, name, initReservation, maxAllocation); + } + protected BaseAllocator( final BaseAllocator parentAllocator, final String name, final long initReservation, final long maxAllocation) throws OutOfMemoryException { + this(parentAllocator.listener, parentAllocator, name, initReservation, maxAllocation); + } + + private BaseAllocator( + final AllocationListener listener, + final BaseAllocator parentAllocator, + final String name, + final long initReservation, + final long maxAllocation) throws OutOfMemoryException { super(parentAllocator, initReservation, maxAllocation); + this.listener = listener; + if (parentAllocator != null) { this.root = parentAllocator.root; empty = parentAllocator.empty; @@ -192,7 +208,7 @@ public ArrowBuf buffer(final int initialRequestSize) { private ArrowBuf createEmpty(){ assertOpen(); - return new ArrowBuf(new AtomicInteger(), null, AllocationManager.INNER_ALLOCATOR.empty, null, null, 0, 0, true); + return new ArrowBuf(new AtomicInteger(), null, AllocationManager.EMPTY, null, null, 0, 0, true); } @Override @@ -206,7 +222,7 @@ public ArrowBuf buffer(final int initialRequestSize, BufferManager manager) { } // round to next largest power of two if we're within a chunk since that is how our allocator operates - final int actualRequestSize = initialRequestSize < CHUNK_SIZE ? + final int actualRequestSize = initialRequestSize < AllocationManager.CHUNK_SIZE ? nextPowerOfTwo(initialRequestSize) : initialRequestSize; AllocationOutcome outcome = this.allocateBytes(actualRequestSize); @@ -218,6 +234,7 @@ public ArrowBuf buffer(final int initialRequestSize, BufferManager manager) { try { ArrowBuf buffer = bufferWithoutReservation(actualRequestSize, manager); success = true; + listener.onAllocation(actualRequestSize); return buffer; } finally { if (!success) { @@ -405,6 +422,7 @@ private ArrowBuf allocate(int nBytes) { try { final ArrowBuf arrowBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); + listener.onAllocation(nBytes); if (DEBUG) { historicalLog.recordEvent("allocate() => %s", String.format("ArrowBuf[%d]", arrowBuf.getId())); } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java index 571fc37577209..57a2c0cdae8d8 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java @@ -24,9 +24,12 @@ * tree of descendant child allocators. */ public class RootAllocator extends BaseAllocator { - public RootAllocator(final long limit) { - super(null, "ROOT", 0, limit); + this(AllocationListener.NOOP, limit); + } + + public RootAllocator(final AllocationListener listener, final long limit) { + super(listener, "ROOT", 0, limit); } /** diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java b/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java deleted file mode 100644 index 58ab13b0a16ab..0000000000000 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/Pointer.java +++ /dev/null @@ -1,28 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.memory.util; - -public class Pointer { - public T value; - - public Pointer(){} - - public Pointer(T value){ - this.value = value; - } -} From 5ffbda1b408951cb5cf49008920f1054544148d3 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 13 Jan 2017 08:46:48 -0500 Subject: [PATCH 0271/1644] ARROW-479: Python: Test for expected schema in Pandas conversion Author: Uwe L. Korn Closes #281 from xhochy/ARROW-479 and squashes the following commits: acd9abd [Uwe L. Korn] Use arrow::timestamp() 43dba37 [Uwe L. Korn] Fix tests 7a3f5b8 [Uwe L. Korn] ARROW-479: Python: Test for expected schema in Pandas conversion --- python/pyarrow/includes/libarrow.pxd | 2 + python/pyarrow/includes/pyarrow.pxd | 4 +- python/pyarrow/schema.pyx | 38 +++++++++- python/pyarrow/tests/test_convert_builtin.py | 2 +- python/pyarrow/tests/test_convert_pandas.py | 77 ++++++++++++++------ python/pyarrow/tests/test_parquet.py | 2 +- python/src/pyarrow/helpers.cc | 3 - 7 files changed, 97 insertions(+), 31 deletions(-) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index d1970e5a2c8f1..8cfaaf72bf16f 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -60,6 +60,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_string ToString() + shared_ptr[CDataType] timestamp(TimeUnit unit) + cdef cppclass MemoryPool" arrow::MemoryPool": int64_t bytes_allocated() diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index dc6ccd2025932..901e6c9457dfa 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -19,13 +19,15 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CTable, - CDataType, CStatus, Type, MemoryPool) + CDataType, CStatus, Type, MemoryPool, + TimeUnit) cimport pyarrow.includes.libarrow_io as arrow_io cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) + shared_ptr[CDataType] GetTimestampType(TimeUnit unit) CStatus ConvertPySequence(object obj, shared_ptr[CArray]* out) CStatus PandasToArrow(MemoryPool* pool, object ao, diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index d91ae7cb3b193..f6a1a10c8dd5c 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -23,8 +23,20 @@ # cython: embedsignature = True from pyarrow.compat import frombytes, tobytes -from pyarrow.includes.libarrow cimport * +from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, + Type_NA, Type_BOOL, + Type_UINT8, Type_INT8, + Type_UINT16, Type_INT16, + Type_UINT32, Type_INT32, + Type_UINT64, Type_INT64, + Type_TIMESTAMP, Type_DATE, + Type_FLOAT, Type_DOUBLE, + Type_STRING, Type_BINARY, + TimeUnit_SECOND, TimeUnit_MILLI, + TimeUnit_MICRO, TimeUnit_NANO, + Type, TimeUnit) cimport pyarrow.includes.pyarrow as pyarrow +cimport pyarrow.includes.libarrow as libarrow cimport cpython @@ -197,8 +209,28 @@ def uint64(): def int64(): return primitive_type(Type_INT64) -def timestamp(): - return primitive_type(Type_TIMESTAMP) +cdef dict _timestamp_type_cache = {} + +def timestamp(unit_str): + cdef TimeUnit unit + if unit_str == "s": + unit = TimeUnit_SECOND + elif unit_str == 'ms': + unit = TimeUnit_MILLI + elif unit_str == 'us': + unit = TimeUnit_MICRO + elif unit_str == 'ns': + unit = TimeUnit_NANO + else: + raise TypeError('Invalid TimeUnit string') + + if unit in _timestamp_type_cache: + return _timestamp_type_cache[unit] + + cdef DataType out = DataType() + out.init(libarrow.timestamp(unit)) + _timestamp_type_cache[unit] = out + return out def date(): return primitive_type(Type_DATE) diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 61167422de93c..72e438910159f 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -112,7 +112,7 @@ def test_timestamp(self): ] arr = pyarrow.from_pylist(data) assert len(arr) == 4 - assert arr.type == pyarrow.timestamp() + assert arr.type == pyarrow.timestamp('us') assert arr.null_count == 1 assert arr[0].as_py() == datetime.datetime(2007, 7, 13, 1, 23, 34, 123456) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 12e7a08d795a2..261eaa85657ed 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -60,65 +60,79 @@ def tearDown(self): pass def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, - timestamps_to_ms=False): + timestamps_to_ms=False, expected_schema=None): table = A.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms) result = table.to_pandas(nthreads=nthreads) + if expected_schema: + assert table.schema.equals(expected_schema) if expected is None: expected = df tm.assert_frame_equal(result, expected) def test_float_no_nulls(self): data = {} - numpy_dtypes = ['f4', 'f8'] + fields = [] + dtypes = [('f4', A.float_()), ('f8', A.double())] num_values = 100 - for dtype in numpy_dtypes: + for numpy_dtype, arrow_dtype in dtypes: values = np.random.randn(num_values) - data[dtype] = values.astype(dtype) + data[numpy_dtype] = values.astype(numpy_dtype) + fields.append(A.Field.from_py(numpy_dtype, arrow_dtype)) df = pd.DataFrame(data) - self._check_pandas_roundtrip(df) + schema = A.Schema.from_fields(fields) + self._check_pandas_roundtrip(df, expected_schema=schema) def test_float_nulls(self): num_values = 100 null_mask = np.random.randint(0, 10, size=num_values) < 3 - dtypes = ['f4', 'f8'] + dtypes = [('f4', A.float_()), ('f8', A.double())] + names = ['f4', 'f8'] expected_cols = [] arrays = [] - for name in dtypes: + fields = [] + for name, arrow_dtype in dtypes: values = np.random.randn(num_values).astype(name) arr = A.from_pandas_series(values, null_mask) arrays.append(arr) - + fields.append(A.Field.from_py(name, arrow_dtype)) values[null_mask] = np.nan expected_cols.append(values) - ex_frame = pd.DataFrame(dict(zip(dtypes, expected_cols)), - columns=dtypes) + ex_frame = pd.DataFrame(dict(zip(names, expected_cols)), + columns=names) - table = A.Table.from_arrays(dtypes, arrays) + table = A.Table.from_arrays(names, arrays) + assert table.schema.equals(A.Schema.from_fields(fields)) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) def test_integer_no_nulls(self): data = {} + fields = [] - numpy_dtypes = ['i1', 'i2', 'i4', 'i8', 'u1', 'u2', 'u4', 'u8'] + numpy_dtypes = [('i1', A.int8()), ('i2', A.int16()), + ('i4', A.int32()), ('i8', A.int64()), + ('u1', A.uint8()), ('u2', A.uint16()), + ('u4', A.uint32()), ('u8', A.uint64())] num_values = 100 - for dtype in numpy_dtypes: + for dtype, arrow_dtype in numpy_dtypes: info = np.iinfo(dtype) values = np.random.randint(info.min, min(info.max, np.iinfo('i8').max), size=num_values) data[dtype] = values.astype(dtype) + fields.append(A.Field.from_py(dtype, arrow_dtype)) df = pd.DataFrame(data) - self._check_pandas_roundtrip(df) + schema = A.Schema.from_fields(fields) + self._check_pandas_roundtrip(df, expected_schema=schema) def test_integer_with_nulls(self): # pandas requires upcast to float dtype @@ -155,7 +169,9 @@ def test_boolean_no_nulls(self): np.random.seed(0) df = pd.DataFrame({'bools': np.random.randn(num_values) > 0}) - self._check_pandas_roundtrip(df) + field = A.Field.from_py('bools', A.bool_()) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, expected_schema=schema) def test_boolean_nulls(self): # pandas requires upcast to object dtype @@ -170,9 +186,12 @@ def test_boolean_nulls(self): expected = values.astype(object) expected[mask] = None + field = A.Field.from_py('bools', A.bool_()) + schema = A.Schema.from_fields([field]) ex_frame = pd.DataFrame({'bools': expected}) table = A.Table.from_arrays(['bools'], [arr]) + assert table.schema.equals(schema) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -180,14 +199,18 @@ def test_boolean_nulls(self): def test_boolean_object_nulls(self): arr = np.array([False, None, True] * 100, dtype=object) df = pd.DataFrame({'bools': arr}) - self._check_pandas_roundtrip(df) + field = A.Field.from_py('bools', A.bool_()) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, expected_schema=schema) def test_unicode(self): repeats = 1000 values = [u'foo', None, u'bar', u'mañana', np.nan] df = pd.DataFrame({'strings': values * repeats}) + field = A.Field.from_py('strings', A.string()) + schema = A.Schema.from_fields([field]) - self._check_pandas_roundtrip(df) + self._check_pandas_roundtrip(df, expected_schema=schema) def test_bytes_to_binary(self): values = [u('qux'), b'foo', None, 'bar', 'qux', np.nan] @@ -208,7 +231,9 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - self._check_pandas_roundtrip(df, timestamps_to_ms=True) + field = A.Field.from_py('datetime64', A.timestamp('ms')) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) df = pd.DataFrame({ 'datetime64': np.array([ @@ -217,7 +242,9 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - self._check_pandas_roundtrip(df, timestamps_to_ms=False) + field = A.Field.from_py('datetime64', A.timestamp('ns')) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) def test_timestamps_notimezone_nulls(self): df = pd.DataFrame({ @@ -227,8 +254,9 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - df.info() - self._check_pandas_roundtrip(df, timestamps_to_ms=True) + field = A.Field.from_py('datetime64', A.timestamp('ms')) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) df = pd.DataFrame({ 'datetime64': np.array([ @@ -237,7 +265,9 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - self._check_pandas_roundtrip(df, timestamps_to_ms=False) + field = A.Field.from_py('datetime64', A.timestamp('ns')) + schema = A.Schema.from_fields([field]) + self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) def test_date(self): df = pd.DataFrame({ @@ -246,6 +276,9 @@ def test_date(self): datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)]}) table = A.Table.from_pandas(df) + field = A.Field.from_py('date', A.date()) + schema = A.Schema.from_fields([field]) + assert table.schema.equals(schema) result = table.to_pandas() expected = df.copy() expected['date'] = pd.to_datetime(df['date']) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index ad4bc580e8b1c..e1571557d9aa7 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -244,7 +244,7 @@ def test_parquet_metadata_api(): a_table = A.Table.from_pandas(df, timestamps_to_ms=True) buf = io.BytesIO() - pq.write_table(a_table, buf, compression='snappy', version='2.0') + pq.write_table(a_table, buf, compression='SNAPPY', version='2.0') buf.seek(0) fileh = pq.ParquetFile(buf) diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index 3f650326e09aa..78fad165ac8e6 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -41,9 +41,6 @@ std::shared_ptr GetPrimitiveType(Type::type type) { GET_PRIMITIVE_TYPE(UINT64, uint64); GET_PRIMITIVE_TYPE(INT64, int64); GET_PRIMITIVE_TYPE(DATE, date); - case Type::TIMESTAMP: - return arrow::timestamp(arrow::TimeUnit::MICRO); - break; GET_PRIMITIVE_TYPE(BOOL, boolean); GET_PRIMITIVE_TYPE(FLOAT, float32); GET_PRIMITIVE_TYPE(DOUBLE, float64); From ad0e57d23257462b9933745949d54ca729da537e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 13 Jan 2017 08:50:14 -0500 Subject: [PATCH 0272/1644] ARROW-472: [Python] Expose more C++ IO interfaces. Add equals methods to Parquet schemas. Pass Parquet metadata separately in reader Also includes ARROW-471, ARROW-441. Needed to compare file schemas easily. Requires PARQUET-830 Author: Wes McKinney Closes #280 from wesm/ARROW-472 and squashes the following commits: 1c5e27c [Wes McKinney] Name static const variables constexpr instead 25c5c90 [Wes McKinney] Add some tests for io.OSFile 1d22428 [Wes McKinney] Add some memory map Python unit tests 5268b6c [Wes McKinney] Add untested wrapper for operating system files fd52153 [Wes McKinney] Add unit test for passing metadata down 2316e64 [Wes McKinney] Expose MemoryMappedFile in pyarrow.io, expand parquet::arrow::OpenFile to take metadata, props parameters a2ce247 [Wes McKinney] Add equals methods to Parquet Schema and ColumnSchema objects --- cpp/src/arrow/io/file.cc | 45 +++++--- cpp/src/arrow/io/file.h | 2 + cpp/src/arrow/io/io-file-test.cc | 4 +- python/pyarrow/_parquet.pxd | 18 +++- python/pyarrow/_parquet.pyx | 37 ++++--- python/pyarrow/compat.py | 26 +++++ python/pyarrow/includes/libarrow_io.pxd | 10 ++ python/pyarrow/io.pxd | 3 +- python/pyarrow/io.pyx | 125 ++++++++++++++++++++--- python/pyarrow/parquet.py | 2 +- python/pyarrow/table.pyx | 2 +- python/pyarrow/tests/test_io.py | 130 +++++++++++++++++++++++- python/pyarrow/tests/test_parquet.py | 52 ++++++++-- 13 files changed, 396 insertions(+), 60 deletions(-) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 0fb13ea22e39f..1de6efa4d811f 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -188,6 +188,8 @@ static inline Status FileOpenWriteable( memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); int oflag = _O_CREAT | _O_BINARY; + int sh_flag = _S_IWRITE; + if (!write_only) { sh_flag |= _S_IREAD; } if (truncate) { oflag |= _O_TRUNC; } @@ -197,7 +199,7 @@ static inline Status FileOpenWriteable( oflag |= _O_RDWR; } - errno_actual = _wsopen_s(fd, wpath.data(), oflag, _SH_DENYNO, _S_IWRITE); + errno_actual = _wsopen_s(fd, wpath.data(), oflag, _SH_DENYNO, sh_flag); ret = *fd; #else @@ -319,7 +321,7 @@ class OSFile { RETURN_NOT_OK(FileOpenWriteable(path, write_only, !append, &fd_)); path_ = path; is_open_ = true; - mode_ = write_only ? FileMode::READ : FileMode::READWRITE; + mode_ = write_only ? FileMode::WRITE : FileMode::READWRITE; if (append) { RETURN_NOT_OK(FileGetSize(fd_, &size_)); @@ -352,7 +354,7 @@ class OSFile { } Status Seek(int64_t pos) { - if (pos > size_) { pos = size_; } + if (pos < 0) { return Status::Invalid("Invalid position"); } return FileSeek(fd_, pos); } @@ -523,17 +525,24 @@ class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { } Status Open(const std::string& path, FileMode::type mode) { - int prot_flags = PROT_READ; + int prot_flags; + int map_mode; if (mode != FileMode::READ) { - prot_flags |= PROT_WRITE; - const bool append = true; - RETURN_NOT_OK(OSFile::OpenWriteable(path, append, mode == FileMode::WRITE)); + // Memory mapping has permission failures if PROT_READ not set + prot_flags = PROT_READ | PROT_WRITE; + map_mode = MAP_SHARED; + constexpr bool append = true; + constexpr bool write_only = false; + RETURN_NOT_OK(OSFile::OpenWriteable(path, append, write_only)); + mode_ = mode; } else { + prot_flags = PROT_READ; + map_mode = MAP_PRIVATE; // Changes are not to be committed back to the file RETURN_NOT_OK(OSFile::OpenReadable(path)); } - void* result = mmap(nullptr, size_, prot_flags, MAP_SHARED, fd(), 0); + void* result = mmap(nullptr, size_, prot_flags, map_mode, fd(), 0); if (result == MAP_FAILED) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; @@ -548,16 +557,14 @@ class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { int64_t size() const { return size_; } Status Seek(int64_t position) { - if (position < 0 || position >= size_) { - return Status::Invalid("position is out of bounds"); - } + if (position < 0) { return Status::Invalid("position is out of bounds"); } position_ = position; return Status::OK(); } int64_t position() { return position_; } - void advance(int64_t nbytes) { position_ = std::min(size_, position_ + nbytes); } + void advance(int64_t nbytes) { position_ = position_ + nbytes; } uint8_t* data() { return data_; } @@ -611,16 +618,18 @@ Status MemoryMappedFile::Close() { } Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - nbytes = std::min(nbytes, impl_->size() - impl_->position()); - std::memcpy(out, impl_->head(), nbytes); + nbytes = std::max(0, std::min(nbytes, impl_->size() - impl_->position())); + if (nbytes > 0) { std::memcpy(out, impl_->head(), nbytes); } *bytes_read = nbytes; impl_->advance(nbytes); return Status::OK(); } Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { - nbytes = std::min(nbytes, impl_->size() - impl_->position()); - *out = std::make_shared(impl_->head(), nbytes); + nbytes = std::max(0, std::min(nbytes, impl_->size() - impl_->position())); + + const uint8_t* data = nbytes > 0 ? impl_->head() : nullptr; + *out = std::make_shared(data, nbytes); impl_->advance(nbytes); return Status::OK(); } @@ -655,5 +664,9 @@ Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { return Status::OK(); } +int MemoryMappedFile::file_descriptor() const { + return impl_->fd(); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index 9ca9c540e7c22..2387232b2157a 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -127,6 +127,8 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { // @return: the size in bytes of the memory source Status GetSize(int64_t* size) override; + int file_descriptor() const; + private: explicit MemoryMappedFile(FileMode::type mode); diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 821e71d0212f6..20cd04748f019 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -209,8 +209,8 @@ TEST_F(TestReadableFile, SeekTellSize) { ASSERT_OK(file_->Seek(100)); ASSERT_OK(file_->Tell(&position)); - // now at EOF - ASSERT_EQ(8, position); + // Can seek past end of file + ASSERT_EQ(100, position); int64_t size; ASSERT_OK(file_->GetSize(&size)); diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index 7e49e9e834b77..cf1da1c3a9e52 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -99,8 +99,9 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: ParquetVersion_V2" parquet::ParquetVersion::PARQUET_2_0" cdef cppclass ColumnDescriptor: - shared_ptr[ColumnPath] path() + c_bool Equals(const ColumnDescriptor& other) + shared_ptr[ColumnPath] path() int16_t max_definition_level() int16_t max_repetition_level() @@ -115,6 +116,7 @@ cdef extern from "parquet/api/schema.h" namespace "parquet" nogil: const ColumnDescriptor* Column(int i) shared_ptr[Node] schema() GroupNode* group() + c_bool Equals(const SchemaDescriptor& other) int num_columns() @@ -163,8 +165,18 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: unique_ptr[CRowGroupMetaData] RowGroup(int i) const SchemaDescriptor* schema() + cdef cppclass ReaderProperties: + pass + + ReaderProperties default_reader_properties() + cdef cppclass ParquetFileReader: - # TODO: Some default arguments are missing + @staticmethod + unique_ptr[ParquetFileReader] Open( + const shared_ptr[ReadableFileInterface]& file, + const ReaderProperties& props, + const shared_ptr[CFileMetaData]& metadata) + @staticmethod unique_ptr[ParquetFileReader] OpenFile(const c_string& path) shared_ptr[CFileMetaData] metadata(); @@ -193,6 +205,8 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, MemoryPool* allocator, + const ReaderProperties& properties, + const shared_ptr[CFileMetaData]& metadata, unique_ptr[FileReader]* reader) cdef cppclass FileReader: diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 30e3de417a827..867fc4cfecbd6 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -19,8 +19,9 @@ # distutils: language = c++ # cython: embedsignature = True -from pyarrow._parquet cimport * +from cython.operator cimport dereference as deref +from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, OutputStream, FileOutputStream) @@ -196,6 +197,12 @@ cdef class Schema: def __getitem__(self, i): return self.column(i) + def equals(self, Schema other): + """ + Returns True if the Parquet schemas are equal + """ + return self.schema.Equals(deref(other.schema)) + def column(self, i): if i < 0 or i >= len(self): raise IndexError('{0} out of bounds'.format(i)) @@ -217,6 +224,12 @@ cdef class ColumnSchema: self.parent = schema self.descr = schema.schema.Column(i) + def equals(self, ColumnSchema other): + """ + Returns True if the column schemas are equal + """ + return self.descr.Equals(deref(other.descr)) + def __repr__(self): physical_type = self.physical_type logical_type = self.logical_type @@ -337,24 +350,20 @@ cdef class ParquetReader: self.allocator = default_memory_pool() self._metadata = None - def open(self, object source): + def open(self, object source, FileMetaData metadata=None): cdef: shared_ptr[ReadableFileInterface] rd_handle + shared_ptr[CFileMetaData] c_metadata + ReaderProperties properties = default_reader_properties() c_string path - if isinstance(source, six.string_types): - path = tobytes(source) - - # Must be in one expression to avoid calling std::move which is not - # possible in Cython (due to missing rvalue support) + if metadata is not None: + c_metadata = metadata.sp_metadata - # TODO(wesm): ParquetFileReader::OpenFile can throw? - self.reader = unique_ptr[FileReader]( - new FileReader(default_memory_pool(), - ParquetFileReader.OpenFile(path))) - else: - get_reader(source, &rd_handle) - check_status(OpenFile(rd_handle, self.allocator, &self.reader)) + get_reader(source, &rd_handle) + with nogil: + check_status(OpenFile(rd_handle, self.allocator, properties, + c_metadata, &self.reader)) @property def metadata(self): diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 2dfdb5041d13e..9148be7d9f8ad 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -54,6 +54,10 @@ def dict_values(x): range = xrange long = long + def guid(): + from uuid import uuid4 + return uuid4().get_hex() + def u(s): return unicode(s, "unicode_escape") @@ -76,6 +80,10 @@ def dict_values(x): from decimal import Decimal range = range + def guid(): + from uuid import uuid4 + return uuid4().hex + def u(s): return s @@ -89,6 +97,24 @@ def frombytes(o): return o.decode('utf8') +def encode_file_path(path): + import os + # Windows requires utf-16le encoding for unicode file names + if isinstance(path, unicode_type): + if os.name == 'nt': + # try: + # encoded_path = path.encode('ascii') + # except: + encoded_path = path.encode('utf-16le') + else: + # POSIX systems can handle utf-8 + encoded_path = path.encode('utf-8') + else: + encoded_path = path + + return encoded_path + + integer_types = six.integer_types + (np.integer,) __all__ = [] diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 99f88adf81d2b..6b141a3e76f09 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -69,6 +69,8 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: + + cdef cppclass FileOutputStream(OutputStream): @staticmethod CStatus Open(const c_string& path, shared_ptr[FileOutputStream]* file) @@ -85,6 +87,14 @@ cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: int file_descriptor() + cdef cppclass CMemoryMappedFile" arrow::io::MemoryMappedFile"\ + (ReadWriteFileInterface): + @staticmethod + CStatus Open(const c_string& path, FileMode mode, + shared_ptr[CMemoryMappedFile]* file) + + int file_descriptor() + cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus HaveLibHdfs() diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index 02265d0a68eb1..fffc7c596db76 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -32,7 +32,8 @@ cdef class NativeFile: cdef: shared_ptr[ReadableFileInterface] rd_file shared_ptr[OutputStream] wr_file - bint is_readonly + bint is_readable + bint is_writeable bint is_open bint own_file diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index b62de6cdd462c..2d8e4e8f34242 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -27,12 +27,13 @@ from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.libarrow_io cimport * -from pyarrow.compat import frombytes, tobytes +from pyarrow.compat import frombytes, tobytes, encode_file_path from pyarrow.error cimport check_status cimport cpython as cp import re +import six import sys import threading import time @@ -42,6 +43,7 @@ cdef extern from "Python.h": PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( char *v, Py_ssize_t len) except NULL + cdef class NativeFile: def __cinit__(self): @@ -61,7 +63,7 @@ cdef class NativeFile: def close(self): if self.is_open: with nogil: - if self.is_readonly: + if self.is_readable: check_status(self.rd_file.get().Close()) else: check_status(self.wr_file.get().Close()) @@ -76,15 +78,15 @@ cdef class NativeFile: file[0] = self.wr_file def _assert_readable(self): - if not self.is_readonly: + if not self.is_readable: raise IOError("only valid on readonly files") if not self.is_open: raise IOError("file not open") def _assert_writeable(self): - if self.is_readonly: - raise IOError("only valid on writeonly files") + if not self.is_writeable: + raise IOError("only valid on writeable files") if not self.is_open: raise IOError("file not open") @@ -99,7 +101,7 @@ cdef class NativeFile: def tell(self): cdef int64_t position with nogil: - if self.is_readonly: + if self.is_readable: check_status(self.rd_file.get().Tell(&position)) else: check_status(self.wr_file.get().Tell(&position)) @@ -137,7 +139,7 @@ cdef class NativeFile: self._assert_readable() # Allocate empty write space - obj = PyBytes_FromStringAndSizeNative(NULL, nbytes) + obj = PyBytes_FromStringAndSizeNative(NULL, c_nbytes) cdef uint8_t* buf = cp.PyBytes_AS_STRING( obj) with nogil: @@ -179,16 +181,100 @@ cdef class PythonFileInterface(NativeFile): if mode.startswith('w'): self.wr_file.reset(new pyarrow.PyOutputStream(handle)) - self.is_readonly = 0 + self.is_readable = 0 + self.is_writeable = 1 elif mode.startswith('r'): self.rd_file.reset(new pyarrow.PyReadableFile(handle)) - self.is_readonly = 1 + self.is_readable = 1 + self.is_writeable = 0 + else: + raise ValueError('Invalid file mode: {0}'.format(mode)) + + self.is_open = True + + +cdef class MemoryMappedFile(NativeFile): + """ + Supports 'r', 'r+w', 'w' modes + """ + cdef: + object path + + def __cinit__(self, path, mode='r'): + self.path = path + + cdef: + FileMode c_mode + shared_ptr[CMemoryMappedFile] handle + c_string c_path = encode_file_path(path) + + self.is_readable = self.is_writeable = 0 + + if mode in ('r', 'rb'): + c_mode = FileMode_READ + self.is_readable = 1 + elif mode in ('w', 'wb'): + c_mode = FileMode_WRITE + self.is_writeable = 1 + elif mode == 'r+w': + c_mode = FileMode_READWRITE + self.is_readable = 1 + self.is_writeable = 1 else: raise ValueError('Invalid file mode: {0}'.format(mode)) + check_status(CMemoryMappedFile.Open(c_path, c_mode, &handle)) + + self.wr_file = handle + self.rd_file = handle self.is_open = True +cdef class OSFile(NativeFile): + """ + Supports 'r', 'w' modes + """ + cdef: + object path + + def __cinit__(self, path, mode='r'): + self.path = path + + cdef: + FileMode c_mode + shared_ptr[Readable] handle + c_string c_path = encode_file_path(path) + + self.is_readable = self.is_writeable = 0 + + if mode in ('r', 'rb'): + self._open_readable(c_path) + elif mode in ('w', 'wb'): + self._open_writeable(c_path) + else: + raise ValueError('Invalid file mode: {0}'.format(mode)) + + self.is_open = True + + cdef _open_readable(self, c_string path): + cdef shared_ptr[ReadableFile] handle + + with nogil: + check_status(ReadableFile.Open(path, pyarrow.get_memory_pool(), + &handle)) + + self.is_readable = 1 + self.rd_file = handle + + cdef _open_writeable(self, c_string path): + cdef shared_ptr[FileOutputStream] handle + + with nogil: + check_status(FileOutputStream.Open(path, &handle)) + self.is_writeable = 1 + self.wr_file = handle + + cdef class BytesReader(NativeFile): cdef: object obj @@ -198,7 +284,8 @@ cdef class BytesReader(NativeFile): raise ValueError('Must pass bytes object') self.obj = obj - self.is_readonly = 1 + self.is_readable = 1 + self.is_writeable = 0 self.is_open = True self.rd_file.reset(new pyarrow.PyBytesReader(obj)) @@ -264,7 +351,8 @@ cdef class InMemoryOutputStream(NativeFile): self.buffer = allocate_buffer() self.wr_file.reset(new BufferOutputStream( self.buffer)) - self.is_readonly = 0 + self.is_readable = 0 + self.is_writeable = 1 self.is_open = True def get_result(self): @@ -285,7 +373,8 @@ cdef class BufferReader(NativeFile): self.buffer = buffer self.rd_file.reset(new CBufferReader(buffer.buffer.get().data(), buffer.buffer.get().size())) - self.is_readonly = 1 + self.is_readable = 1 + self.is_writeable = 0 self.is_open = True @@ -311,12 +400,14 @@ cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader): elif not isinstance(source, NativeFile) and hasattr(source, 'read'): # Optimistically hope this is file-like source = PythonFileInterface(source, mode='r') + elif isinstance(source, six.string_types): + source = MemoryMappedFile(source, mode='r') if isinstance(source, NativeFile): nf = source # TODO: what about read-write sources (e.g. memory maps) - if not nf.is_readonly: + if not nf.is_readable: raise IOError('Native file is not readable') nf.read_handle(reader) @@ -335,7 +426,7 @@ cdef get_writer(object source, shared_ptr[OutputStream]* writer): if isinstance(source, NativeFile): nf = source - if nf.is_readonly: + if nf.is_readable: raise IOError('Native file is not writeable') nf.write_handle(writer) @@ -593,14 +684,16 @@ cdef class HdfsClient: out.wr_file = wr_handle - out.is_readonly = False + out.is_readable = False + out.is_writeable = 1 else: with nogil: check_status(self.client.get() .OpenReadable(c_path, &rd_handle)) out.rd_file = rd_handle - out.is_readonly = True + out.is_readable = True + out.is_writeable = 0 if c_buffer_size == 0: c_buffer_size = 2 ** 16 diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 2dedb72ebfcc1..708ae65585ae2 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -33,7 +33,7 @@ class ParquetFile(object): """ def __init__(self, source, metadata=None): self.reader = _parquet.ParquetReader() - self.reader.open(source) + self.reader.open(source, metadata=metadata) @property def metadata(self): diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 3a046516d961b..dce125a7b3236 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -22,7 +22,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.libarrow cimport * -from pyarrow.includes.common cimport PyObject_to_object +from pyarrow.includes.common cimport * cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 3e7a43702aa05..224f20dbfbb03 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -16,10 +16,14 @@ # under the License. from io import BytesIO +import os import pytest -from pyarrow.compat import u +import numpy as np + +from pyarrow.compat import u, guid import pyarrow.io as io +import pyarrow as pa # ---------------------------------------------------------------------- # Python file-like objects @@ -155,3 +159,127 @@ def test_inmemory_write_after_closed(): with pytest.raises(IOError): f.write(b'not ok') + + +# ---------------------------------------------------------------------- +# OS files and memory maps + +@pytest.fixture(scope='session') +def sample_disk_data(request): + + SIZE = 4096 + arr = np.random.randint(0, 256, size=SIZE).astype('u1') + data = arr.tobytes()[:SIZE] + + path = guid() + with open(path, 'wb') as f: + f.write(data) + + def teardown(): + _try_delete(path) + request.addfinalizer(teardown) + return path, data + + +def _check_native_file_reader(KLASS, sample_data): + path, data = sample_data + + f = KLASS(path, mode='r') + + assert f.read(10) == data[:10] + assert f.read(0) == b'' + assert f.tell() == 10 + + assert f.read() == data[10:] + + assert f.size() == len(data) + + f.seek(0) + assert f.tell() == 0 + + # Seeking past end of file not supported in memory maps + f.seek(len(data) + 1) + assert f.tell() == len(data) + 1 + assert f.read(5) == b'' + + +def test_memory_map_reader(sample_disk_data): + _check_native_file_reader(io.MemoryMappedFile, sample_disk_data) + + +def test_os_file_reader(sample_disk_data): + _check_native_file_reader(io.OSFile, sample_disk_data) + + +def _try_delete(path): + try: + os.remove(path) + except os.error: + pass + + +def test_memory_map_writer(): + SIZE = 4096 + arr = np.random.randint(0, 256, size=SIZE).astype('u1') + data = arr.tobytes()[:SIZE] + + path = guid() + try: + with open(path, 'wb') as f: + f.write(data) + + f = io.MemoryMappedFile(path, mode='r+w') + + f.seek(10) + f.write('peekaboo') + assert f.tell() == 18 + + f.seek(10) + assert f.read(8) == b'peekaboo' + + f2 = io.MemoryMappedFile(path, mode='r+w') + + f2.seek(10) + f2.write(b'booapeak') + f2.seek(10) + + f.seek(10) + assert f.read(8) == b'booapeak' + + # Does not truncate file + f3 = io.MemoryMappedFile(path, mode='w') + f3.write('foo') + + with io.MemoryMappedFile(path) as f4: + assert f4.size() == SIZE + + with pytest.raises(IOError): + f3.read(5) + + f.seek(0) + assert f.read(3) == b'foo' + finally: + _try_delete(path) + + +def test_os_file_writer(): + SIZE = 4096 + arr = np.random.randint(0, 256, size=SIZE).astype('u1') + data = arr.tobytes()[:SIZE] + + path = guid() + try: + with open(path, 'wb') as f: + f.write(data) + + # Truncates file + f2 = io.OSFile(path, mode='w') + f2.write('foo') + + with io.OSFile(path) as f3: + assert f3.size() == 3 + + with pytest.raises(IOError): + f2.read(5) + finally: + _try_delete(path) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index e1571557d9aa7..9cf860ac28a10 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -236,19 +236,22 @@ def test_pandas_parquet_configuration_options(tmpdir): pdt.assert_frame_equal(df, df_read) -@parquet -def test_parquet_metadata_api(): - df = alltypes_sample(size=10000) - df = df.reindex(columns=sorted(df.columns)) - +def make_sample_file(df): a_table = A.Table.from_pandas(df, timestamps_to_ms=True) buf = io.BytesIO() pq.write_table(a_table, buf, compression='SNAPPY', version='2.0') buf.seek(0) - fileh = pq.ParquetFile(buf) + return pq.ParquetFile(buf) + +@parquet +def test_parquet_metadata_api(): + df = alltypes_sample(size=10000) + df = df.reindex(columns=sorted(df.columns)) + + fileh = make_sample_file(df) ncols = len(df.columns) # Series of sniff tests @@ -288,3 +291,40 @@ def test_parquet_metadata_api(): assert rg_meta.num_rows == len(df) assert rg_meta.num_columns == ncols + + +@parquet +def test_compare_schemas(): + df = alltypes_sample(size=10000) + + fileh = make_sample_file(df) + fileh2 = make_sample_file(df) + fileh3 = make_sample_file(df[df.columns[::2]]) + + assert fileh.schema.equals(fileh.schema) + assert fileh.schema.equals(fileh2.schema) + + assert not fileh.schema.equals(fileh3.schema) + + assert fileh.schema[0].equals(fileh.schema[0]) + assert not fileh.schema[0].equals(fileh.schema[1]) + + +@parquet +def test_pass_separate_metadata(): + # ARROW-471 + df = alltypes_sample(size=10000) + + a_table = A.Table.from_pandas(df, timestamps_to_ms=True) + + buf = io.BytesIO() + pq.write_table(a_table, buf, compression='snappy', version='2.0') + + buf.seek(0) + metadata = pq.ParquetFile(buf).metadata + + buf.seek(0) + + fileh = pq.ParquetFile(buf, metadata=metadata) + + pdt.assert_frame_equal(df, fileh.read().to_pandas()) From cb83b8d30d6bc7d654736c590763145d7c7252ce Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 13 Jan 2017 13:58:45 -0500 Subject: [PATCH 0273/1644] ARROW-96: Add C++ API documentation This adds a basic `Doxyfile` that can be used to generate the HTML API documentation as well as a small, initial "Getting Started". The documentation is not yet deployed anywhere. We can either also use `readthedocs.org` for this (via the `breathe` package) or wait for the restructuring of the website as discussed on the ML and then add this to the "update website" scripts. I'd personally prefer the latter. Author: Uwe L. Korn Closes #271 from xhochy/ARROW-96 and squashes the following commits: 45c98cb [Uwe L. Korn] Add license headers e7c9849 [Uwe L. Korn] ARROW-96: Add C++ API documentation --- cpp/README.md | 9 + cpp/apidoc/.gitignore | 1 + cpp/apidoc/Doxyfile | 2492 +++++++++++++++++++++++++++++++++++++++++ cpp/apidoc/index.md | 85 ++ cpp/src/arrow/array.h | 35 +- 5 files changed, 2610 insertions(+), 12 deletions(-) create mode 100644 cpp/apidoc/.gitignore create mode 100644 cpp/apidoc/Doxyfile create mode 100644 cpp/apidoc/index.md diff --git a/cpp/README.md b/cpp/README.md index b77ea990d0659..542a854990250 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -62,6 +62,15 @@ variables * Hadoop: `HADOOP_HOME` (only required for the HDFS I/O extensions) * jemalloc: `JEMALLOC_HOME` (only required for the jemalloc-based memory pool) +### API documentation + +To generate the (html) API documentation, run the following command in the apidoc +directoy: + + doxygen Doxyfile + +This requires [Doxygen](http://www.doxygen.org) to be installed. + ## Continuous Integration Pull requests are run through travis-ci for continuous integration. You can avoid diff --git a/cpp/apidoc/.gitignore b/cpp/apidoc/.gitignore new file mode 100644 index 0000000000000..5ccff1a6bea26 --- /dev/null +++ b/cpp/apidoc/.gitignore @@ -0,0 +1 @@ +html/ diff --git a/cpp/apidoc/Doxyfile b/cpp/apidoc/Doxyfile new file mode 100644 index 0000000000000..7dc55fef834fc --- /dev/null +++ b/cpp/apidoc/Doxyfile @@ -0,0 +1,2492 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Doxyfile 1.8.13 + +# This file describes the settings to be used by the documentation system +# doxygen (www.doxygen.org) for a project. +# +# All text after a double hash (##) is considered a comment and is placed in +# front of the TAG it is preceding. +# +# All text after a single hash (#) is considered a comment and will be ignored. +# The format is: +# TAG = value [value, ...] +# For lists, items can also be appended using: +# TAG += value [value, ...] +# Values that contain spaces should be placed between quotes (\" \"). + +#--------------------------------------------------------------------------- +# Project related configuration options +#--------------------------------------------------------------------------- + +# This tag specifies the encoding used for all characters in the config file +# that follow. The default is UTF-8 which is also the encoding used for all text +# before the first occurrence of this tag. Doxygen uses libiconv (or the iconv +# built into libc) for the transcoding. See http://www.gnu.org/software/libiconv +# for the list of possible encodings. +# The default value is: UTF-8. + +DOXYFILE_ENCODING = UTF-8 + +# The PROJECT_NAME tag is a single word (or a sequence of words surrounded by +# double-quotes, unless you are using Doxywizard) that should identify the +# project for which the documentation is generated. This name is used in the +# title of most generated pages and in a few other places. +# The default value is: My Project. + +PROJECT_NAME = "Apache Arrow (C++)" + +# The PROJECT_NUMBER tag can be used to enter a project or revision number. This +# could be handy for archiving the generated documentation or if some version +# control system is used. + +PROJECT_NUMBER = + +# Using the PROJECT_BRIEF tag one can provide an optional one line description +# for a project that appears at the top of each page and should give viewer a +# quick idea about the purpose of the project. Keep the description short. + +PROJECT_BRIEF = "A columnar in-memory analytics layer designed to accelerate big data." + +# With the PROJECT_LOGO tag one can specify a logo or an icon that is included +# in the documentation. The maximum height of the logo should not exceed 55 +# pixels and the maximum width should not exceed 200 pixels. Doxygen will copy +# the logo to the output directory. + +PROJECT_LOGO = + +# The OUTPUT_DIRECTORY tag is used to specify the (relative or absolute) path +# into which the generated documentation will be written. If a relative path is +# entered, it will be relative to the location where doxygen was started. If +# left blank the current directory will be used. + +OUTPUT_DIRECTORY = + +# If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub- +# directories (in 2 levels) under the output directory of each output format and +# will distribute the generated files over these directories. Enabling this +# option can be useful when feeding doxygen a huge amount of source files, where +# putting all generated files in the same directory would otherwise causes +# performance problems for the file system. +# The default value is: NO. + +CREATE_SUBDIRS = NO + +# If the ALLOW_UNICODE_NAMES tag is set to YES, doxygen will allow non-ASCII +# characters to appear in the names of generated files. If set to NO, non-ASCII +# characters will be escaped, for example _xE3_x81_x84 will be used for Unicode +# U+3044. +# The default value is: NO. + +ALLOW_UNICODE_NAMES = NO + +# The OUTPUT_LANGUAGE tag is used to specify the language in which all +# documentation generated by doxygen is written. Doxygen will use this +# information to generate all constant output in the proper language. +# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Catalan, Chinese, +# Chinese-Traditional, Croatian, Czech, Danish, Dutch, English (United States), +# Esperanto, Farsi (Persian), Finnish, French, German, Greek, Hungarian, +# Indonesian, Italian, Japanese, Japanese-en (Japanese with English messages), +# Korean, Korean-en (Korean with English messages), Latvian, Lithuanian, +# Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese, Romanian, Russian, +# Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish, Swedish, Turkish, +# Ukrainian and Vietnamese. +# The default value is: English. + +OUTPUT_LANGUAGE = English + +# If the BRIEF_MEMBER_DESC tag is set to YES, doxygen will include brief member +# descriptions after the members that are listed in the file and class +# documentation (similar to Javadoc). Set to NO to disable this. +# The default value is: YES. + +BRIEF_MEMBER_DESC = YES + +# If the REPEAT_BRIEF tag is set to YES, doxygen will prepend the brief +# description of a member or function before the detailed description +# +# Note: If both HIDE_UNDOC_MEMBERS and BRIEF_MEMBER_DESC are set to NO, the +# brief descriptions will be completely suppressed. +# The default value is: YES. + +REPEAT_BRIEF = YES + +# This tag implements a quasi-intelligent brief description abbreviator that is +# used to form the text in various listings. Each string in this list, if found +# as the leading text of the brief description, will be stripped from the text +# and the result, after processing the whole list, is used as the annotated +# text. Otherwise, the brief description is used as-is. If left blank, the +# following values are used ($name is automatically replaced with the name of +# the entity):The $name class, The $name widget, The $name file, is, provides, +# specifies, contains, represents, a, an and the. + +ABBREVIATE_BRIEF = "The $name class" \ + "The $name widget" \ + "The $name file" \ + is \ + provides \ + specifies \ + contains \ + represents \ + a \ + an \ + the + +# If the ALWAYS_DETAILED_SEC and REPEAT_BRIEF tags are both set to YES then +# doxygen will generate a detailed section even if there is only a brief +# description. +# The default value is: NO. + +ALWAYS_DETAILED_SEC = NO + +# If the INLINE_INHERITED_MEMB tag is set to YES, doxygen will show all +# inherited members of a class in the documentation of that class as if those +# members were ordinary class members. Constructors, destructors and assignment +# operators of the base classes will not be shown. +# The default value is: NO. + +INLINE_INHERITED_MEMB = NO + +# If the FULL_PATH_NAMES tag is set to YES, doxygen will prepend the full path +# before files name in the file list and in the header files. If set to NO the +# shortest path that makes the file name unique will be used +# The default value is: YES. + +FULL_PATH_NAMES = YES + +# The STRIP_FROM_PATH tag can be used to strip a user-defined part of the path. +# Stripping is only done if one of the specified strings matches the left-hand +# part of the path. The tag can be used to show relative paths in the file list. +# If left blank the directory from which doxygen is run is used as the path to +# strip. +# +# Note that you can specify absolute paths here, but also relative paths, which +# will be relative from the directory where doxygen is started. +# This tag requires that the tag FULL_PATH_NAMES is set to YES. + +STRIP_FROM_PATH = + +# The STRIP_FROM_INC_PATH tag can be used to strip a user-defined part of the +# path mentioned in the documentation of a class, which tells the reader which +# header file to include in order to use a class. If left blank only the name of +# the header file containing the class definition is used. Otherwise one should +# specify the list of include paths that are normally passed to the compiler +# using the -I flag. + +STRIP_FROM_INC_PATH = + +# If the SHORT_NAMES tag is set to YES, doxygen will generate much shorter (but +# less readable) file names. This can be useful is your file systems doesn't +# support long names like on DOS, Mac, or CD-ROM. +# The default value is: NO. + +SHORT_NAMES = NO + +# If the JAVADOC_AUTOBRIEF tag is set to YES then doxygen will interpret the +# first line (until the first dot) of a Javadoc-style comment as the brief +# description. If set to NO, the Javadoc-style will behave just like regular Qt- +# style comments (thus requiring an explicit @brief command for a brief +# description.) +# The default value is: NO. + +JAVADOC_AUTOBRIEF = NO + +# If the QT_AUTOBRIEF tag is set to YES then doxygen will interpret the first +# line (until the first dot) of a Qt-style comment as the brief description. If +# set to NO, the Qt-style will behave just like regular Qt-style comments (thus +# requiring an explicit \brief command for a brief description.) +# The default value is: NO. + +QT_AUTOBRIEF = NO + +# The MULTILINE_CPP_IS_BRIEF tag can be set to YES to make doxygen treat a +# multi-line C++ special comment block (i.e. a block of //! or /// comments) as +# a brief description. This used to be the default behavior. The new default is +# to treat a multi-line C++ comment block as a detailed description. Set this +# tag to YES if you prefer the old behavior instead. +# +# Note that setting this tag to YES also means that rational rose comments are +# not recognized any more. +# The default value is: NO. + +MULTILINE_CPP_IS_BRIEF = NO + +# If the INHERIT_DOCS tag is set to YES then an undocumented member inherits the +# documentation from any documented member that it re-implements. +# The default value is: YES. + +INHERIT_DOCS = YES + +# If the SEPARATE_MEMBER_PAGES tag is set to YES then doxygen will produce a new +# page for each member. If set to NO, the documentation of a member will be part +# of the file/class/namespace that contains it. +# The default value is: NO. + +SEPARATE_MEMBER_PAGES = NO + +# The TAB_SIZE tag can be used to set the number of spaces in a tab. Doxygen +# uses this value to replace tabs by spaces in code fragments. +# Minimum value: 1, maximum value: 16, default value: 4. + +TAB_SIZE = 4 + +# This tag can be used to specify a number of aliases that act as commands in +# the documentation. An alias has the form: +# name=value +# For example adding +# "sideeffect=@par Side Effects:\n" +# will allow you to put the command \sideeffect (or @sideeffect) in the +# documentation, which will result in a user-defined paragraph with heading +# "Side Effects:". You can put \n's in the value part of an alias to insert +# newlines. + +ALIASES = + +# This tag can be used to specify a number of word-keyword mappings (TCL only). +# A mapping has the form "name=value". For example adding "class=itcl::class" +# will allow you to use the command class in the itcl::class meaning. + +TCL_SUBST = + +# Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources +# only. Doxygen will then generate output that is more tailored for C. For +# instance, some of the names that are used will be different. The list of all +# members will be omitted, etc. +# The default value is: NO. + +OPTIMIZE_OUTPUT_FOR_C = NO + +# Set the OPTIMIZE_OUTPUT_JAVA tag to YES if your project consists of Java or +# Python sources only. Doxygen will then generate output that is more tailored +# for that language. For instance, namespaces will be presented as packages, +# qualified scopes will look different, etc. +# The default value is: NO. + +OPTIMIZE_OUTPUT_JAVA = NO + +# Set the OPTIMIZE_FOR_FORTRAN tag to YES if your project consists of Fortran +# sources. Doxygen will then generate output that is tailored for Fortran. +# The default value is: NO. + +OPTIMIZE_FOR_FORTRAN = NO + +# Set the OPTIMIZE_OUTPUT_VHDL tag to YES if your project consists of VHDL +# sources. Doxygen will then generate output that is tailored for VHDL. +# The default value is: NO. + +OPTIMIZE_OUTPUT_VHDL = NO + +# Doxygen selects the parser to use depending on the extension of the files it +# parses. With this tag you can assign which parser to use for a given +# extension. Doxygen has a built-in mapping, but you can override or extend it +# using this tag. The format is ext=language, where ext is a file extension, and +# language is one of the parsers supported by doxygen: IDL, Java, Javascript, +# C#, C, C++, D, PHP, Objective-C, Python, Fortran (fixed format Fortran: +# FortranFixed, free formatted Fortran: FortranFree, unknown formatted Fortran: +# Fortran. In the later case the parser tries to guess whether the code is fixed +# or free formatted code, this is the default for Fortran type files), VHDL. For +# instance to make doxygen treat .inc files as Fortran files (default is PHP), +# and .f files as C (default is Fortran), use: inc=Fortran f=C. +# +# Note: For files without extension you can use no_extension as a placeholder. +# +# Note that for custom extensions you also need to set FILE_PATTERNS otherwise +# the files are not read by doxygen. + +EXTENSION_MAPPING = + +# If the MARKDOWN_SUPPORT tag is enabled then doxygen pre-processes all comments +# according to the Markdown format, which allows for more readable +# documentation. See http://daringfireball.net/projects/markdown/ for details. +# The output of markdown processing is further processed by doxygen, so you can +# mix doxygen, HTML, and XML commands with Markdown formatting. Disable only in +# case of backward compatibilities issues. +# The default value is: YES. + +MARKDOWN_SUPPORT = YES + +# When the TOC_INCLUDE_HEADINGS tag is set to a non-zero value, all headings up +# to that level are automatically included in the table of contents, even if +# they do not have an id attribute. +# Note: This feature currently applies only to Markdown headings. +# Minimum value: 0, maximum value: 99, default value: 0. +# This tag requires that the tag MARKDOWN_SUPPORT is set to YES. + +TOC_INCLUDE_HEADINGS = 0 + +# When enabled doxygen tries to link words that correspond to documented +# classes, or namespaces to their corresponding documentation. Such a link can +# be prevented in individual cases by putting a % sign in front of the word or +# globally by setting AUTOLINK_SUPPORT to NO. +# The default value is: YES. + +AUTOLINK_SUPPORT = YES + +# If you use STL classes (i.e. std::string, std::vector, etc.) but do not want +# to include (a tag file for) the STL sources as input, then you should set this +# tag to YES in order to let doxygen match functions declarations and +# definitions whose arguments contain STL classes (e.g. func(std::string); +# versus func(std::string) {}). This also make the inheritance and collaboration +# diagrams that involve STL classes more complete and accurate. +# The default value is: NO. + +BUILTIN_STL_SUPPORT = NO + +# If you use Microsoft's C++/CLI language, you should set this option to YES to +# enable parsing support. +# The default value is: NO. + +CPP_CLI_SUPPORT = NO + +# Set the SIP_SUPPORT tag to YES if your project consists of sip (see: +# http://www.riverbankcomputing.co.uk/software/sip/intro) sources only. Doxygen +# will parse them like normal C++ but will assume all classes use public instead +# of private inheritance when no explicit protection keyword is present. +# The default value is: NO. + +SIP_SUPPORT = NO + +# For Microsoft's IDL there are propget and propput attributes to indicate +# getter and setter methods for a property. Setting this option to YES will make +# doxygen to replace the get and set methods by a property in the documentation. +# This will only work if the methods are indeed getting or setting a simple +# type. If this is not the case, or you want to show the methods anyway, you +# should set this option to NO. +# The default value is: YES. + +IDL_PROPERTY_SUPPORT = YES + +# If member grouping is used in the documentation and the DISTRIBUTE_GROUP_DOC +# tag is set to YES then doxygen will reuse the documentation of the first +# member in the group (if any) for the other members of the group. By default +# all members of a group must be documented explicitly. +# The default value is: NO. + +DISTRIBUTE_GROUP_DOC = NO + +# If one adds a struct or class to a group and this option is enabled, then also +# any nested class or struct is added to the same group. By default this option +# is disabled and one has to add nested compounds explicitly via \ingroup. +# The default value is: NO. + +GROUP_NESTED_COMPOUNDS = NO + +# Set the SUBGROUPING tag to YES to allow class member groups of the same type +# (for instance a group of public functions) to be put as a subgroup of that +# type (e.g. under the Public Functions section). Set it to NO to prevent +# subgrouping. Alternatively, this can be done per class using the +# \nosubgrouping command. +# The default value is: YES. + +SUBGROUPING = YES + +# When the INLINE_GROUPED_CLASSES tag is set to YES, classes, structs and unions +# are shown inside the group in which they are included (e.g. using \ingroup) +# instead of on a separate page (for HTML and Man pages) or section (for LaTeX +# and RTF). +# +# Note that this feature does not work in combination with +# SEPARATE_MEMBER_PAGES. +# The default value is: NO. + +INLINE_GROUPED_CLASSES = NO + +# When the INLINE_SIMPLE_STRUCTS tag is set to YES, structs, classes, and unions +# with only public data fields or simple typedef fields will be shown inline in +# the documentation of the scope in which they are defined (i.e. file, +# namespace, or group documentation), provided this scope is documented. If set +# to NO, structs, classes, and unions are shown on a separate page (for HTML and +# Man pages) or section (for LaTeX and RTF). +# The default value is: NO. + +INLINE_SIMPLE_STRUCTS = NO + +# When TYPEDEF_HIDES_STRUCT tag is enabled, a typedef of a struct, union, or +# enum is documented as struct, union, or enum with the name of the typedef. So +# typedef struct TypeS {} TypeT, will appear in the documentation as a struct +# with name TypeT. When disabled the typedef will appear as a member of a file, +# namespace, or class. And the struct will be named TypeS. This can typically be +# useful for C code in case the coding convention dictates that all compound +# types are typedef'ed and only the typedef is referenced, never the tag name. +# The default value is: NO. + +TYPEDEF_HIDES_STRUCT = NO + +# The size of the symbol lookup cache can be set using LOOKUP_CACHE_SIZE. This +# cache is used to resolve symbols given their name and scope. Since this can be +# an expensive process and often the same symbol appears multiple times in the +# code, doxygen keeps a cache of pre-resolved symbols. If the cache is too small +# doxygen will become slower. If the cache is too large, memory is wasted. The +# cache size is given by this formula: 2^(16+LOOKUP_CACHE_SIZE). The valid range +# is 0..9, the default is 0, corresponding to a cache size of 2^16=65536 +# symbols. At the end of a run doxygen will report the cache usage and suggest +# the optimal cache size from a speed point of view. +# Minimum value: 0, maximum value: 9, default value: 0. + +LOOKUP_CACHE_SIZE = 0 + +#--------------------------------------------------------------------------- +# Build related configuration options +#--------------------------------------------------------------------------- + +# If the EXTRACT_ALL tag is set to YES, doxygen will assume all entities in +# documentation are documented, even if no documentation was available. Private +# class members and static file members will be hidden unless the +# EXTRACT_PRIVATE respectively EXTRACT_STATIC tags are set to YES. +# Note: This will also disable the warnings about undocumented members that are +# normally produced when WARNINGS is set to YES. +# The default value is: NO. + +EXTRACT_ALL = YES + +# If the EXTRACT_PRIVATE tag is set to YES, all private members of a class will +# be included in the documentation. +# The default value is: NO. + +EXTRACT_PRIVATE = NO + +# If the EXTRACT_PACKAGE tag is set to YES, all members with package or internal +# scope will be included in the documentation. +# The default value is: NO. + +EXTRACT_PACKAGE = NO + +# If the EXTRACT_STATIC tag is set to YES, all static members of a file will be +# included in the documentation. +# The default value is: NO. + +EXTRACT_STATIC = NO + +# If the EXTRACT_LOCAL_CLASSES tag is set to YES, classes (and structs) defined +# locally in source files will be included in the documentation. If set to NO, +# only classes defined in header files are included. Does not have any effect +# for Java sources. +# The default value is: YES. + +EXTRACT_LOCAL_CLASSES = YES + +# This flag is only useful for Objective-C code. If set to YES, local methods, +# which are defined in the implementation section but not in the interface are +# included in the documentation. If set to NO, only methods in the interface are +# included. +# The default value is: NO. + +EXTRACT_LOCAL_METHODS = NO + +# If this flag is set to YES, the members of anonymous namespaces will be +# extracted and appear in the documentation as a namespace called +# 'anonymous_namespace{file}', where file will be replaced with the base name of +# the file that contains the anonymous namespace. By default anonymous namespace +# are hidden. +# The default value is: NO. + +EXTRACT_ANON_NSPACES = NO + +# If the HIDE_UNDOC_MEMBERS tag is set to YES, doxygen will hide all +# undocumented members inside documented classes or files. If set to NO these +# members will be included in the various overviews, but no documentation +# section is generated. This option has no effect if EXTRACT_ALL is enabled. +# The default value is: NO. + +HIDE_UNDOC_MEMBERS = NO + +# If the HIDE_UNDOC_CLASSES tag is set to YES, doxygen will hide all +# undocumented classes that are normally visible in the class hierarchy. If set +# to NO, these classes will be included in the various overviews. This option +# has no effect if EXTRACT_ALL is enabled. +# The default value is: NO. + +HIDE_UNDOC_CLASSES = NO + +# If the HIDE_FRIEND_COMPOUNDS tag is set to YES, doxygen will hide all friend +# (class|struct|union) declarations. If set to NO, these declarations will be +# included in the documentation. +# The default value is: NO. + +HIDE_FRIEND_COMPOUNDS = NO + +# If the HIDE_IN_BODY_DOCS tag is set to YES, doxygen will hide any +# documentation blocks found inside the body of a function. If set to NO, these +# blocks will be appended to the function's detailed documentation block. +# The default value is: NO. + +HIDE_IN_BODY_DOCS = NO + +# The INTERNAL_DOCS tag determines if documentation that is typed after a +# \internal command is included. If the tag is set to NO then the documentation +# will be excluded. Set it to YES to include the internal documentation. +# The default value is: NO. + +INTERNAL_DOCS = NO + +# If the CASE_SENSE_NAMES tag is set to NO then doxygen will only generate file +# names in lower-case letters. If set to YES, upper-case letters are also +# allowed. This is useful if you have classes or files whose names only differ +# in case and if your file system supports case sensitive file names. Windows +# and Mac users are advised to set this option to NO. +# The default value is: system dependent. + +CASE_SENSE_NAMES = NO + +# If the HIDE_SCOPE_NAMES tag is set to NO then doxygen will show members with +# their full class and namespace scopes in the documentation. If set to YES, the +# scope will be hidden. +# The default value is: NO. + +HIDE_SCOPE_NAMES = NO + +# If the HIDE_COMPOUND_REFERENCE tag is set to NO (default) then doxygen will +# append additional text to a page's title, such as Class Reference. If set to +# YES the compound reference will be hidden. +# The default value is: NO. + +HIDE_COMPOUND_REFERENCE= NO + +# If the SHOW_INCLUDE_FILES tag is set to YES then doxygen will put a list of +# the files that are included by a file in the documentation of that file. +# The default value is: YES. + +SHOW_INCLUDE_FILES = YES + +# If the SHOW_GROUPED_MEMB_INC tag is set to YES then Doxygen will add for each +# grouped member an include statement to the documentation, telling the reader +# which file to include in order to use the member. +# The default value is: NO. + +SHOW_GROUPED_MEMB_INC = NO + +# If the FORCE_LOCAL_INCLUDES tag is set to YES then doxygen will list include +# files with double quotes in the documentation rather than with sharp brackets. +# The default value is: NO. + +FORCE_LOCAL_INCLUDES = NO + +# If the INLINE_INFO tag is set to YES then a tag [inline] is inserted in the +# documentation for inline members. +# The default value is: YES. + +INLINE_INFO = YES + +# If the SORT_MEMBER_DOCS tag is set to YES then doxygen will sort the +# (detailed) documentation of file and class members alphabetically by member +# name. If set to NO, the members will appear in declaration order. +# The default value is: YES. + +SORT_MEMBER_DOCS = YES + +# If the SORT_BRIEF_DOCS tag is set to YES then doxygen will sort the brief +# descriptions of file, namespace and class members alphabetically by member +# name. If set to NO, the members will appear in declaration order. Note that +# this will also influence the order of the classes in the class list. +# The default value is: NO. + +SORT_BRIEF_DOCS = NO + +# If the SORT_MEMBERS_CTORS_1ST tag is set to YES then doxygen will sort the +# (brief and detailed) documentation of class members so that constructors and +# destructors are listed first. If set to NO the constructors will appear in the +# respective orders defined by SORT_BRIEF_DOCS and SORT_MEMBER_DOCS. +# Note: If SORT_BRIEF_DOCS is set to NO this option is ignored for sorting brief +# member documentation. +# Note: If SORT_MEMBER_DOCS is set to NO this option is ignored for sorting +# detailed member documentation. +# The default value is: NO. + +SORT_MEMBERS_CTORS_1ST = NO + +# If the SORT_GROUP_NAMES tag is set to YES then doxygen will sort the hierarchy +# of group names into alphabetical order. If set to NO the group names will +# appear in their defined order. +# The default value is: NO. + +SORT_GROUP_NAMES = NO + +# If the SORT_BY_SCOPE_NAME tag is set to YES, the class list will be sorted by +# fully-qualified names, including namespaces. If set to NO, the class list will +# be sorted only by class name, not including the namespace part. +# Note: This option is not very useful if HIDE_SCOPE_NAMES is set to YES. +# Note: This option applies only to the class list, not to the alphabetical +# list. +# The default value is: NO. + +SORT_BY_SCOPE_NAME = NO + +# If the STRICT_PROTO_MATCHING option is enabled and doxygen fails to do proper +# type resolution of all parameters of a function it will reject a match between +# the prototype and the implementation of a member function even if there is +# only one candidate or it is obvious which candidate to choose by doing a +# simple string match. By disabling STRICT_PROTO_MATCHING doxygen will still +# accept a match between prototype and implementation in such cases. +# The default value is: NO. + +STRICT_PROTO_MATCHING = NO + +# The GENERATE_TODOLIST tag can be used to enable (YES) or disable (NO) the todo +# list. This list is created by putting \todo commands in the documentation. +# The default value is: YES. + +GENERATE_TODOLIST = YES + +# The GENERATE_TESTLIST tag can be used to enable (YES) or disable (NO) the test +# list. This list is created by putting \test commands in the documentation. +# The default value is: YES. + +GENERATE_TESTLIST = YES + +# The GENERATE_BUGLIST tag can be used to enable (YES) or disable (NO) the bug +# list. This list is created by putting \bug commands in the documentation. +# The default value is: YES. + +GENERATE_BUGLIST = YES + +# The GENERATE_DEPRECATEDLIST tag can be used to enable (YES) or disable (NO) +# the deprecated list. This list is created by putting \deprecated commands in +# the documentation. +# The default value is: YES. + +GENERATE_DEPRECATEDLIST= YES + +# The ENABLED_SECTIONS tag can be used to enable conditional documentation +# sections, marked by \if ... \endif and \cond +# ... \endcond blocks. + +ENABLED_SECTIONS = + +# The MAX_INITIALIZER_LINES tag determines the maximum number of lines that the +# initial value of a variable or macro / define can have for it to appear in the +# documentation. If the initializer consists of more lines than specified here +# it will be hidden. Use a value of 0 to hide initializers completely. The +# appearance of the value of individual variables and macros / defines can be +# controlled using \showinitializer or \hideinitializer command in the +# documentation regardless of this setting. +# Minimum value: 0, maximum value: 10000, default value: 30. + +MAX_INITIALIZER_LINES = 30 + +# Set the SHOW_USED_FILES tag to NO to disable the list of files generated at +# the bottom of the documentation of classes and structs. If set to YES, the +# list will mention the files that were used to generate the documentation. +# The default value is: YES. + +SHOW_USED_FILES = YES + +# Set the SHOW_FILES tag to NO to disable the generation of the Files page. This +# will remove the Files entry from the Quick Index and from the Folder Tree View +# (if specified). +# The default value is: YES. + +SHOW_FILES = YES + +# Set the SHOW_NAMESPACES tag to NO to disable the generation of the Namespaces +# page. This will remove the Namespaces entry from the Quick Index and from the +# Folder Tree View (if specified). +# The default value is: YES. + +SHOW_NAMESPACES = YES + +# The FILE_VERSION_FILTER tag can be used to specify a program or script that +# doxygen should invoke to get the current version for each file (typically from +# the version control system). Doxygen will invoke the program by executing (via +# popen()) the command command input-file, where command is the value of the +# FILE_VERSION_FILTER tag, and input-file is the name of an input file provided +# by doxygen. Whatever the program writes to standard output is used as the file +# version. For an example see the documentation. + +FILE_VERSION_FILTER = + +# The LAYOUT_FILE tag can be used to specify a layout file which will be parsed +# by doxygen. The layout file controls the global structure of the generated +# output files in an output format independent way. To create the layout file +# that represents doxygen's defaults, run doxygen with the -l option. You can +# optionally specify a file name after the option, if omitted DoxygenLayout.xml +# will be used as the name of the layout file. +# +# Note that if you run doxygen from a directory containing a file called +# DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE +# tag is left empty. + +LAYOUT_FILE = + +# The CITE_BIB_FILES tag can be used to specify one or more bib files containing +# the reference definitions. This must be a list of .bib files. The .bib +# extension is automatically appended if omitted. This requires the bibtex tool +# to be installed. See also http://en.wikipedia.org/wiki/BibTeX for more info. +# For LaTeX the style of the bibliography can be controlled using +# LATEX_BIB_STYLE. To use this feature you need bibtex and perl available in the +# search path. See also \cite for info how to create references. + +CITE_BIB_FILES = + +#--------------------------------------------------------------------------- +# Configuration options related to warning and progress messages +#--------------------------------------------------------------------------- + +# The QUIET tag can be used to turn on/off the messages that are generated to +# standard output by doxygen. If QUIET is set to YES this implies that the +# messages are off. +# The default value is: NO. + +QUIET = NO + +# The WARNINGS tag can be used to turn on/off the warning messages that are +# generated to standard error (stderr) by doxygen. If WARNINGS is set to YES +# this implies that the warnings are on. +# +# Tip: Turn warnings on while writing the documentation. +# The default value is: YES. + +WARNINGS = YES + +# If the WARN_IF_UNDOCUMENTED tag is set to YES then doxygen will generate +# warnings for undocumented members. If EXTRACT_ALL is set to YES then this flag +# will automatically be disabled. +# The default value is: YES. + +WARN_IF_UNDOCUMENTED = YES + +# If the WARN_IF_DOC_ERROR tag is set to YES, doxygen will generate warnings for +# potential errors in the documentation, such as not documenting some parameters +# in a documented function, or documenting parameters that don't exist or using +# markup commands wrongly. +# The default value is: YES. + +WARN_IF_DOC_ERROR = YES + +# This WARN_NO_PARAMDOC option can be enabled to get warnings for functions that +# are documented, but have no documentation for their parameters or return +# value. If set to NO, doxygen will only warn about wrong or incomplete +# parameter documentation, but not about the absence of documentation. +# The default value is: NO. + +WARN_NO_PARAMDOC = NO + +# If the WARN_AS_ERROR tag is set to YES then doxygen will immediately stop when +# a warning is encountered. +# The default value is: NO. + +WARN_AS_ERROR = NO + +# The WARN_FORMAT tag determines the format of the warning messages that doxygen +# can produce. The string should contain the $file, $line, and $text tags, which +# will be replaced by the file and line number from which the warning originated +# and the warning text. Optionally the format may contain $version, which will +# be replaced by the version of the file (if it could be obtained via +# FILE_VERSION_FILTER) +# The default value is: $file:$line: $text. + +WARN_FORMAT = "$file:$line: $text" + +# The WARN_LOGFILE tag can be used to specify a file to which warning and error +# messages should be written. If left blank the output is written to standard +# error (stderr). + +WARN_LOGFILE = + +#--------------------------------------------------------------------------- +# Configuration options related to the input files +#--------------------------------------------------------------------------- + +# The INPUT tag is used to specify the files and/or directories that contain +# documented source files. You may enter file names like myfile.cpp or +# directories like /usr/src/myproject. Separate the files or directories with +# spaces. See also FILE_PATTERNS and EXTENSION_MAPPING +# Note: If this tag is empty the current directory is searched. + +# Note we include "." here to add the markdown text. +INPUT = ../src . + +# This tag can be used to specify the character encoding of the source files +# that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses +# libiconv (or the iconv built into libc) for the transcoding. See the libiconv +# documentation (see: http://www.gnu.org/software/libiconv) for the list of +# possible encodings. +# The default value is: UTF-8. + +INPUT_ENCODING = UTF-8 + +# If the value of the INPUT tag contains directories, you can use the +# FILE_PATTERNS tag to specify one or more wildcard patterns (like *.cpp and +# *.h) to filter out the source-files in the directories. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# read by doxygen. +# +# If left blank the following patterns are tested:*.c, *.cc, *.cxx, *.cpp, +# *.c++, *.java, *.ii, *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h, +# *.hh, *.hxx, *.hpp, *.h++, *.cs, *.d, *.php, *.php4, *.php5, *.phtml, *.inc, +# *.m, *.markdown, *.md, *.mm, *.dox, *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, +# *.f, *.for, *.tcl, *.vhd, *.vhdl, *.ucf and *.qsf. + +FILE_PATTERNS = *.c \ + *.cc \ + *.cxx \ + *.cpp \ + *.c++ \ + *.java \ + *.ii \ + *.ixx \ + *.ipp \ + *.i++ \ + *.inl \ + *.idl \ + *.ddl \ + *.odl \ + *.h \ + *.hh \ + *.hxx \ + *.hpp \ + *.h++ \ + *.cs \ + *.d \ + *.php \ + *.php4 \ + *.php5 \ + *.phtml \ + *.inc \ + *.m \ + *.markdown \ + *.md \ + *.mm \ + *.dox \ + *.py \ + *.pyw \ + *.f90 \ + *.f95 \ + *.f03 \ + *.f08 \ + *.f \ + *.for \ + *.tcl \ + *.vhd \ + *.vhdl \ + *.ucf \ + *.qsf + +# The RECURSIVE tag can be used to specify whether or not subdirectories should +# be searched for input files as well. +# The default value is: NO. + +RECURSIVE = YES + +# The EXCLUDE tag can be used to specify files and/or directories that should be +# excluded from the INPUT source files. This way you can easily exclude a +# subdirectory from a directory tree whose root is specified with the INPUT tag. +# +# Note that relative paths are relative to the directory from which doxygen is +# run. + +EXCLUDE = + +# The EXCLUDE_SYMLINKS tag can be used to select whether or not files or +# directories that are symbolic links (a Unix file system feature) are excluded +# from the input. +# The default value is: NO. + +EXCLUDE_SYMLINKS = NO + +# If the value of the INPUT tag contains directories, you can use the +# EXCLUDE_PATTERNS tag to specify one or more wildcard patterns to exclude +# certain files from those directories. +# +# Note that the wildcards are matched against the file with absolute path, so to +# exclude all test directories for example use the pattern */test/* + +EXCLUDE_PATTERNS = *-test.cc \ + *-benchmark.cc + +# The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names +# (namespaces, classes, functions, etc.) that should be excluded from the +# output. The symbol name can be a fully qualified name, a word, or if the +# wildcard * is used, a substring. Examples: ANamespace, AClass, +# AClass::ANamespace, ANamespace::*Test +# +# Note that the wildcards are matched against the file with absolute path, so to +# exclude all test directories use the pattern */test/* + +EXCLUDE_SYMBOLS = + +# The EXAMPLE_PATH tag can be used to specify one or more files or directories +# that contain example code fragments that are included (see the \include +# command). + +EXAMPLE_PATH = + +# If the value of the EXAMPLE_PATH tag contains directories, you can use the +# EXAMPLE_PATTERNS tag to specify one or more wildcard pattern (like *.cpp and +# *.h) to filter out the source-files in the directories. If left blank all +# files are included. + +EXAMPLE_PATTERNS = * + +# If the EXAMPLE_RECURSIVE tag is set to YES then subdirectories will be +# searched for input files to be used with the \include or \dontinclude commands +# irrespective of the value of the RECURSIVE tag. +# The default value is: NO. + +EXAMPLE_RECURSIVE = NO + +# The IMAGE_PATH tag can be used to specify one or more files or directories +# that contain images that are to be included in the documentation (see the +# \image command). + +IMAGE_PATH = + +# The INPUT_FILTER tag can be used to specify a program that doxygen should +# invoke to filter for each input file. Doxygen will invoke the filter program +# by executing (via popen()) the command: +# +# +# +# where is the value of the INPUT_FILTER tag, and is the +# name of an input file. Doxygen will then use the output that the filter +# program writes to standard output. If FILTER_PATTERNS is specified, this tag +# will be ignored. +# +# Note that the filter must not add or remove lines; it is applied before the +# code is scanned, but not when the output code is generated. If lines are added +# or removed, the anchors will not be placed correctly. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# properly processed by doxygen. + +INPUT_FILTER = + +# The FILTER_PATTERNS tag can be used to specify filters on a per file pattern +# basis. Doxygen will compare the file name with each pattern and apply the +# filter if there is a match. The filters are a list of the form: pattern=filter +# (like *.cpp=my_cpp_filter). See INPUT_FILTER for further information on how +# filters are used. If the FILTER_PATTERNS tag is empty or if none of the +# patterns match the file name, INPUT_FILTER is applied. +# +# Note that for custom extensions or not directly supported extensions you also +# need to set EXTENSION_MAPPING for the extension otherwise the files are not +# properly processed by doxygen. + +FILTER_PATTERNS = + +# If the FILTER_SOURCE_FILES tag is set to YES, the input filter (if set using +# INPUT_FILTER) will also be used to filter the input files that are used for +# producing the source files to browse (i.e. when SOURCE_BROWSER is set to YES). +# The default value is: NO. + +FILTER_SOURCE_FILES = NO + +# The FILTER_SOURCE_PATTERNS tag can be used to specify source filters per file +# pattern. A pattern will override the setting for FILTER_PATTERN (if any) and +# it is also possible to disable source filtering for a specific pattern using +# *.ext= (so without naming a filter). +# This tag requires that the tag FILTER_SOURCE_FILES is set to YES. + +FILTER_SOURCE_PATTERNS = + +# If the USE_MDFILE_AS_MAINPAGE tag refers to the name of a markdown file that +# is part of the input, its contents will be placed on the main page +# (index.html). This can be useful if you have a project on for instance GitHub +# and want to reuse the introduction page also for the doxygen output. + +USE_MDFILE_AS_MAINPAGE = + +#--------------------------------------------------------------------------- +# Configuration options related to source browsing +#--------------------------------------------------------------------------- + +# If the SOURCE_BROWSER tag is set to YES then a list of source files will be +# generated. Documented entities will be cross-referenced with these sources. +# +# Note: To get rid of all source code in the generated output, make sure that +# also VERBATIM_HEADERS is set to NO. +# The default value is: NO. + +SOURCE_BROWSER = NO + +# Setting the INLINE_SOURCES tag to YES will include the body of functions, +# classes and enums directly into the documentation. +# The default value is: NO. + +INLINE_SOURCES = NO + +# Setting the STRIP_CODE_COMMENTS tag to YES will instruct doxygen to hide any +# special comment blocks from generated source code fragments. Normal C, C++ and +# Fortran comments will always remain visible. +# The default value is: YES. + +STRIP_CODE_COMMENTS = YES + +# If the REFERENCED_BY_RELATION tag is set to YES then for each documented +# function all documented functions referencing it will be listed. +# The default value is: NO. + +REFERENCED_BY_RELATION = NO + +# If the REFERENCES_RELATION tag is set to YES then for each documented function +# all documented entities called/used by that function will be listed. +# The default value is: NO. + +REFERENCES_RELATION = NO + +# If the REFERENCES_LINK_SOURCE tag is set to YES and SOURCE_BROWSER tag is set +# to YES then the hyperlinks from functions in REFERENCES_RELATION and +# REFERENCED_BY_RELATION lists will link to the source code. Otherwise they will +# link to the documentation. +# The default value is: YES. + +REFERENCES_LINK_SOURCE = YES + +# If SOURCE_TOOLTIPS is enabled (the default) then hovering a hyperlink in the +# source code will show a tooltip with additional information such as prototype, +# brief description and links to the definition and documentation. Since this +# will make the HTML file larger and loading of large files a bit slower, you +# can opt to disable this feature. +# The default value is: YES. +# This tag requires that the tag SOURCE_BROWSER is set to YES. + +SOURCE_TOOLTIPS = YES + +# If the USE_HTAGS tag is set to YES then the references to source code will +# point to the HTML generated by the htags(1) tool instead of doxygen built-in +# source browser. The htags tool is part of GNU's global source tagging system +# (see http://www.gnu.org/software/global/global.html). You will need version +# 4.8.6 or higher. +# +# To use it do the following: +# - Install the latest version of global +# - Enable SOURCE_BROWSER and USE_HTAGS in the config file +# - Make sure the INPUT points to the root of the source tree +# - Run doxygen as normal +# +# Doxygen will invoke htags (and that will in turn invoke gtags), so these +# tools must be available from the command line (i.e. in the search path). +# +# The result: instead of the source browser generated by doxygen, the links to +# source code will now point to the output of htags. +# The default value is: NO. +# This tag requires that the tag SOURCE_BROWSER is set to YES. + +USE_HTAGS = NO + +# If the VERBATIM_HEADERS tag is set the YES then doxygen will generate a +# verbatim copy of the header file for each class for which an include is +# specified. Set to NO to disable this. +# See also: Section \class. +# The default value is: YES. + +VERBATIM_HEADERS = YES + +#--------------------------------------------------------------------------- +# Configuration options related to the alphabetical class index +#--------------------------------------------------------------------------- + +# If the ALPHABETICAL_INDEX tag is set to YES, an alphabetical index of all +# compounds will be generated. Enable this if the project contains a lot of +# classes, structs, unions or interfaces. +# The default value is: YES. + +ALPHABETICAL_INDEX = YES + +# The COLS_IN_ALPHA_INDEX tag can be used to specify the number of columns in +# which the alphabetical index list will be split. +# Minimum value: 1, maximum value: 20, default value: 5. +# This tag requires that the tag ALPHABETICAL_INDEX is set to YES. + +COLS_IN_ALPHA_INDEX = 5 + +# In case all classes in a project start with a common prefix, all classes will +# be put under the same header in the alphabetical index. The IGNORE_PREFIX tag +# can be used to specify a prefix (or a list of prefixes) that should be ignored +# while generating the index headers. +# This tag requires that the tag ALPHABETICAL_INDEX is set to YES. + +IGNORE_PREFIX = + +#--------------------------------------------------------------------------- +# Configuration options related to the HTML output +#--------------------------------------------------------------------------- + +# If the GENERATE_HTML tag is set to YES, doxygen will generate HTML output +# The default value is: YES. + +GENERATE_HTML = YES + +# The HTML_OUTPUT tag is used to specify where the HTML docs will be put. If a +# relative path is entered the value of OUTPUT_DIRECTORY will be put in front of +# it. +# The default directory is: html. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_OUTPUT = html + +# The HTML_FILE_EXTENSION tag can be used to specify the file extension for each +# generated HTML page (for example: .htm, .php, .asp). +# The default value is: .html. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_FILE_EXTENSION = .html + +# The HTML_HEADER tag can be used to specify a user-defined HTML header file for +# each generated HTML page. If the tag is left blank doxygen will generate a +# standard header. +# +# To get valid HTML the header file that includes any scripts and style sheets +# that doxygen needs, which is dependent on the configuration options used (e.g. +# the setting GENERATE_TREEVIEW). It is highly recommended to start with a +# default header using +# doxygen -w html new_header.html new_footer.html new_stylesheet.css +# YourConfigFile +# and then modify the file new_header.html. See also section "Doxygen usage" +# for information on how to generate the default header that doxygen normally +# uses. +# Note: The header is subject to change so you typically have to regenerate the +# default header when upgrading to a newer version of doxygen. For a description +# of the possible markers and block names see the documentation. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_HEADER = + +# The HTML_FOOTER tag can be used to specify a user-defined HTML footer for each +# generated HTML page. If the tag is left blank doxygen will generate a standard +# footer. See HTML_HEADER for more information on how to generate a default +# footer and what special commands can be used inside the footer. See also +# section "Doxygen usage" for information on how to generate the default footer +# that doxygen normally uses. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_FOOTER = + +# The HTML_STYLESHEET tag can be used to specify a user-defined cascading style +# sheet that is used by each HTML page. It can be used to fine-tune the look of +# the HTML output. If left blank doxygen will generate a default style sheet. +# See also section "Doxygen usage" for information on how to generate the style +# sheet that doxygen normally uses. +# Note: It is recommended to use HTML_EXTRA_STYLESHEET instead of this tag, as +# it is more robust and this tag (HTML_STYLESHEET) will in the future become +# obsolete. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_STYLESHEET = + +# The HTML_EXTRA_STYLESHEET tag can be used to specify additional user-defined +# cascading style sheets that are included after the standard style sheets +# created by doxygen. Using this option one can overrule certain style aspects. +# This is preferred over using HTML_STYLESHEET since it does not replace the +# standard style sheet and is therefore more robust against future updates. +# Doxygen will copy the style sheet files to the output directory. +# Note: The order of the extra style sheet files is of importance (e.g. the last +# style sheet in the list overrules the setting of the previous ones in the +# list). For an example see the documentation. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_EXTRA_STYLESHEET = + +# The HTML_EXTRA_FILES tag can be used to specify one or more extra images or +# other source files which should be copied to the HTML output directory. Note +# that these files will be copied to the base HTML output directory. Use the +# $relpath^ marker in the HTML_HEADER and/or HTML_FOOTER files to load these +# files. In the HTML_STYLESHEET file, use the file name only. Also note that the +# files will be copied as-is; there are no commands or markers available. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_EXTRA_FILES = + +# The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen +# will adjust the colors in the style sheet and background images according to +# this color. Hue is specified as an angle on a colorwheel, see +# http://en.wikipedia.org/wiki/Hue for more information. For instance the value +# 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300 +# purple, and 360 is red again. +# Minimum value: 0, maximum value: 359, default value: 220. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_HUE = 220 + +# The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors +# in the HTML output. For a value of 0 the output will use grayscales only. A +# value of 255 will produce the most vivid colors. +# Minimum value: 0, maximum value: 255, default value: 100. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_SAT = 100 + +# The HTML_COLORSTYLE_GAMMA tag controls the gamma correction applied to the +# luminance component of the colors in the HTML output. Values below 100 +# gradually make the output lighter, whereas values above 100 make the output +# darker. The value divided by 100 is the actual gamma applied, so 80 represents +# a gamma of 0.8, The value 220 represents a gamma of 2.2, and 100 does not +# change the gamma. +# Minimum value: 40, maximum value: 240, default value: 80. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_COLORSTYLE_GAMMA = 80 + +# If the HTML_TIMESTAMP tag is set to YES then the footer of each generated HTML +# page will contain the date and time when the page was generated. Setting this +# to YES can help to show when doxygen was last run and thus if the +# documentation is up to date. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_TIMESTAMP = NO + +# If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML +# documentation will contain sections that can be hidden and shown after the +# page has loaded. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_DYNAMIC_SECTIONS = NO + +# With HTML_INDEX_NUM_ENTRIES one can control the preferred number of entries +# shown in the various tree structured indices initially; the user can expand +# and collapse entries dynamically later on. Doxygen will expand the tree to +# such a level that at most the specified number of entries are visible (unless +# a fully collapsed tree already exceeds this amount). So setting the number of +# entries 1 will produce a full collapsed tree by default. 0 is a special value +# representing an infinite number of entries and will result in a full expanded +# tree by default. +# Minimum value: 0, maximum value: 9999, default value: 100. +# This tag requires that the tag GENERATE_HTML is set to YES. + +HTML_INDEX_NUM_ENTRIES = 100 + +# If the GENERATE_DOCSET tag is set to YES, additional index files will be +# generated that can be used as input for Apple's Xcode 3 integrated development +# environment (see: http://developer.apple.com/tools/xcode/), introduced with +# OSX 10.5 (Leopard). To create a documentation set, doxygen will generate a +# Makefile in the HTML output directory. Running make will produce the docset in +# that directory and running make install will install the docset in +# ~/Library/Developer/Shared/Documentation/DocSets so that Xcode will find it at +# startup. See http://developer.apple.com/tools/creatingdocsetswithdoxygen.html +# for more information. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_DOCSET = NO + +# This tag determines the name of the docset feed. A documentation feed provides +# an umbrella under which multiple documentation sets from a single provider +# (such as a company or product suite) can be grouped. +# The default value is: Doxygen generated docs. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_FEEDNAME = "Doxygen generated docs" + +# This tag specifies a string that should uniquely identify the documentation +# set bundle. This should be a reverse domain-name style string, e.g. +# com.mycompany.MyDocSet. Doxygen will append .docset to the name. +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_BUNDLE_ID = org.doxygen.Project + +# The DOCSET_PUBLISHER_ID tag specifies a string that should uniquely identify +# the documentation publisher. This should be a reverse domain-name style +# string, e.g. com.mycompany.MyDocSet.documentation. +# The default value is: org.doxygen.Publisher. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_PUBLISHER_ID = org.doxygen.Publisher + +# The DOCSET_PUBLISHER_NAME tag identifies the documentation publisher. +# The default value is: Publisher. +# This tag requires that the tag GENERATE_DOCSET is set to YES. + +DOCSET_PUBLISHER_NAME = Publisher + +# If the GENERATE_HTMLHELP tag is set to YES then doxygen generates three +# additional HTML index files: index.hhp, index.hhc, and index.hhk. The +# index.hhp is a project file that can be read by Microsoft's HTML Help Workshop +# (see: http://www.microsoft.com/en-us/download/details.aspx?id=21138) on +# Windows. +# +# The HTML Help Workshop contains a compiler that can convert all HTML output +# generated by doxygen into a single compiled HTML file (.chm). Compiled HTML +# files are now used as the Windows 98 help format, and will replace the old +# Windows help format (.hlp) on all Windows platforms in the future. Compressed +# HTML files also contain an index, a table of contents, and you can search for +# words in the documentation. The HTML workshop also contains a viewer for +# compressed HTML files. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_HTMLHELP = NO + +# The CHM_FILE tag can be used to specify the file name of the resulting .chm +# file. You can add a path in front of the file if the result should not be +# written to the html output directory. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +CHM_FILE = + +# The HHC_LOCATION tag can be used to specify the location (absolute path +# including file name) of the HTML help compiler (hhc.exe). If non-empty, +# doxygen will try to run the HTML help compiler on the generated index.hhp. +# The file has to be specified with full path. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +HHC_LOCATION = + +# The GENERATE_CHI flag controls if a separate .chi index file is generated +# (YES) or that it should be included in the master .chm file (NO). +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +GENERATE_CHI = NO + +# The CHM_INDEX_ENCODING is used to encode HtmlHelp index (hhk), content (hhc) +# and project file content. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +CHM_INDEX_ENCODING = + +# The BINARY_TOC flag controls whether a binary table of contents is generated +# (YES) or a normal table of contents (NO) in the .chm file. Furthermore it +# enables the Previous and Next buttons. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +BINARY_TOC = NO + +# The TOC_EXPAND flag can be set to YES to add extra items for group members to +# the table of contents of the HTML help documentation and to the tree view. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTMLHELP is set to YES. + +TOC_EXPAND = NO + +# If the GENERATE_QHP tag is set to YES and both QHP_NAMESPACE and +# QHP_VIRTUAL_FOLDER are set, an additional index file will be generated that +# can be used as input for Qt's qhelpgenerator to generate a Qt Compressed Help +# (.qch) of the generated HTML documentation. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_QHP = NO + +# If the QHG_LOCATION tag is specified, the QCH_FILE tag can be used to specify +# the file name of the resulting .qch file. The path specified is relative to +# the HTML output folder. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QCH_FILE = + +# The QHP_NAMESPACE tag specifies the namespace to use when generating Qt Help +# Project output. For more information please see Qt Help Project / Namespace +# (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#namespace). +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_NAMESPACE = org.doxygen.Project + +# The QHP_VIRTUAL_FOLDER tag specifies the namespace to use when generating Qt +# Help Project output. For more information please see Qt Help Project / Virtual +# Folders (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#virtual- +# folders). +# The default value is: doc. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_VIRTUAL_FOLDER = doc + +# If the QHP_CUST_FILTER_NAME tag is set, it specifies the name of a custom +# filter to add. For more information please see Qt Help Project / Custom +# Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom- +# filters). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_CUST_FILTER_NAME = + +# The QHP_CUST_FILTER_ATTRS tag specifies the list of the attributes of the +# custom filter to add. For more information please see Qt Help Project / Custom +# Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom- +# filters). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_CUST_FILTER_ATTRS = + +# The QHP_SECT_FILTER_ATTRS tag specifies the list of the attributes this +# project's filter section matches. Qt Help Project / Filter Attributes (see: +# http://qt-project.org/doc/qt-4.8/qthelpproject.html#filter-attributes). +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHP_SECT_FILTER_ATTRS = + +# The QHG_LOCATION tag can be used to specify the location of Qt's +# qhelpgenerator. If non-empty doxygen will try to run qhelpgenerator on the +# generated .qhp file. +# This tag requires that the tag GENERATE_QHP is set to YES. + +QHG_LOCATION = + +# If the GENERATE_ECLIPSEHELP tag is set to YES, additional index files will be +# generated, together with the HTML files, they form an Eclipse help plugin. To +# install this plugin and make it available under the help contents menu in +# Eclipse, the contents of the directory containing the HTML and XML files needs +# to be copied into the plugins directory of eclipse. The name of the directory +# within the plugins directory should be the same as the ECLIPSE_DOC_ID value. +# After copying Eclipse needs to be restarted before the help appears. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_ECLIPSEHELP = NO + +# A unique identifier for the Eclipse help plugin. When installing the plugin +# the directory name containing the HTML and XML files should also have this +# name. Each documentation set should have its own identifier. +# The default value is: org.doxygen.Project. +# This tag requires that the tag GENERATE_ECLIPSEHELP is set to YES. + +ECLIPSE_DOC_ID = org.doxygen.Project + +# If you want full control over the layout of the generated HTML pages it might +# be necessary to disable the index and replace it with your own. The +# DISABLE_INDEX tag can be used to turn on/off the condensed index (tabs) at top +# of each HTML page. A value of NO enables the index and the value YES disables +# it. Since the tabs in the index contain the same information as the navigation +# tree, you can set this option to YES if you also set GENERATE_TREEVIEW to YES. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +DISABLE_INDEX = NO + +# The GENERATE_TREEVIEW tag is used to specify whether a tree-like index +# structure should be generated to display hierarchical information. If the tag +# value is set to YES, a side panel will be generated containing a tree-like +# index structure (just like the one that is generated for HTML Help). For this +# to work a browser that supports JavaScript, DHTML, CSS and frames is required +# (i.e. any modern browser). Windows users are probably better off using the +# HTML help feature. Via custom style sheets (see HTML_EXTRA_STYLESHEET) one can +# further fine-tune the look of the index. As an example, the default style +# sheet generated by doxygen has an example that shows how to put an image at +# the root of the tree instead of the PROJECT_NAME. Since the tree basically has +# the same information as the tab index, you could consider setting +# DISABLE_INDEX to YES when enabling this option. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +GENERATE_TREEVIEW = NO + +# The ENUM_VALUES_PER_LINE tag can be used to set the number of enum values that +# doxygen will group on one line in the generated HTML documentation. +# +# Note that a value of 0 will completely suppress the enum values from appearing +# in the overview section. +# Minimum value: 0, maximum value: 20, default value: 4. +# This tag requires that the tag GENERATE_HTML is set to YES. + +ENUM_VALUES_PER_LINE = 4 + +# If the treeview is enabled (see GENERATE_TREEVIEW) then this tag can be used +# to set the initial width (in pixels) of the frame in which the tree is shown. +# Minimum value: 0, maximum value: 1500, default value: 250. +# This tag requires that the tag GENERATE_HTML is set to YES. + +TREEVIEW_WIDTH = 250 + +# If the EXT_LINKS_IN_WINDOW option is set to YES, doxygen will open links to +# external symbols imported via tag files in a separate window. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +EXT_LINKS_IN_WINDOW = NO + +# Use this tag to change the font size of LaTeX formulas included as images in +# the HTML documentation. When you change the font size after a successful +# doxygen run you need to manually remove any form_*.png images from the HTML +# output directory to force them to be regenerated. +# Minimum value: 8, maximum value: 50, default value: 10. +# This tag requires that the tag GENERATE_HTML is set to YES. + +FORMULA_FONTSIZE = 10 + +# Use the FORMULA_TRANPARENT tag to determine whether or not the images +# generated for formulas are transparent PNGs. Transparent PNGs are not +# supported properly for IE 6.0, but are supported on all modern browsers. +# +# Note that when changing this option you need to delete any form_*.png files in +# the HTML output directory before the changes have effect. +# The default value is: YES. +# This tag requires that the tag GENERATE_HTML is set to YES. + +FORMULA_TRANSPARENT = YES + +# Enable the USE_MATHJAX option to render LaTeX formulas using MathJax (see +# http://www.mathjax.org) which uses client side Javascript for the rendering +# instead of using pre-rendered bitmaps. Use this if you do not have LaTeX +# installed or if you want to formulas look prettier in the HTML output. When +# enabled you may also need to install MathJax separately and configure the path +# to it using the MATHJAX_RELPATH option. +# The default value is: NO. +# This tag requires that the tag GENERATE_HTML is set to YES. + +USE_MATHJAX = NO + +# When MathJax is enabled you can set the default output format to be used for +# the MathJax output. See the MathJax site (see: +# http://docs.mathjax.org/en/latest/output.html) for more details. +# Possible values are: HTML-CSS (which is slower, but has the best +# compatibility), NativeMML (i.e. MathML) and SVG. +# The default value is: HTML-CSS. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_FORMAT = HTML-CSS + +# When MathJax is enabled you need to specify the location relative to the HTML +# output directory using the MATHJAX_RELPATH option. The destination directory +# should contain the MathJax.js script. For instance, if the mathjax directory +# is located at the same level as the HTML output directory, then +# MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax +# Content Delivery Network so you can quickly see the result without installing +# MathJax. However, it is strongly recommended to install a local copy of +# MathJax from http://www.mathjax.org before deployment. +# The default value is: http://cdn.mathjax.org/mathjax/latest. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_RELPATH = http://cdn.mathjax.org/mathjax/latest + +# The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax +# extension names that should be enabled during MathJax rendering. For example +# MATHJAX_EXTENSIONS = TeX/AMSmath TeX/AMSsymbols +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_EXTENSIONS = + +# The MATHJAX_CODEFILE tag can be used to specify a file with javascript pieces +# of code that will be used on startup of the MathJax code. See the MathJax site +# (see: http://docs.mathjax.org/en/latest/output.html) for more details. For an +# example see the documentation. +# This tag requires that the tag USE_MATHJAX is set to YES. + +MATHJAX_CODEFILE = + +# When the SEARCHENGINE tag is enabled doxygen will generate a search box for +# the HTML output. The underlying search engine uses javascript and DHTML and +# should work on any modern browser. Note that when using HTML help +# (GENERATE_HTMLHELP), Qt help (GENERATE_QHP), or docsets (GENERATE_DOCSET) +# there is already a search function so this one should typically be disabled. +# For large projects the javascript based search engine can be slow, then +# enabling SERVER_BASED_SEARCH may provide a better solution. It is possible to +# search using the keyboard; to jump to the search box use + S +# (what the is depends on the OS and browser, but it is typically +# , /
("", schema, cols); + + PyObject* out; + Py_BEGIN_ALLOW_THREADS; + ASSERT_RAISES(UnknownError, ConvertTableToPandas(table, 2, &out)); + Py_END_ALLOW_THREADS; +} + +} // namespace arrow diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 8c2d3506c8a9d..6623e239880bc 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -49,6 +49,7 @@ namespace pyarrow { using arrow::Array; using arrow::ChunkedArray; using arrow::Column; +using arrow::DictionaryType; using arrow::Field; using arrow::DataType; using arrow::ListType; @@ -60,7 +61,7 @@ using arrow::Type; namespace BitUtil = arrow::BitUtil; // ---------------------------------------------------------------------- -// Serialization +// Utility code template struct npy_traits {}; @@ -242,1577 +243,1730 @@ Status AppendObjectStrings(arrow::StringBuilder& string_builder, PyObject** obje } template -class ArrowSerializer { - public: - ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) - : pool_(pool), arr_(arr), mask_(mask) { - length_ = PyArray_SIZE(arr_); - } +struct arrow_traits {}; - void IndicateType(const std::shared_ptr field) { field_indicator_ = field; } +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_BOOL; + static constexpr bool supports_nulls = false; + static constexpr bool is_boolean = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; - Status Convert(std::shared_ptr* out); +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_numeric_not_nullable = true; \ + static constexpr bool is_numeric_nullable = false; \ + typedef typename npy_traits::value_type T; \ + }; - int stride() const { return PyArray_STRIDES(arr_)[0]; } +INT_DECL(INT8); +INT_DECL(INT16); +INT_DECL(INT32); +INT_DECL(INT64); +INT_DECL(UINT8); +INT_DECL(UINT16); +INT_DECL(UINT32); +INT_DECL(UINT64); - Status InitNullBitmap() { - int null_bytes = BitUtil::BytesForBits(length_); +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT32; + static constexpr bool supports_nulls = true; + static constexpr float na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; - null_bitmap_ = std::make_shared(pool_); - RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT64; + static constexpr bool supports_nulls = true; + static constexpr double na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; - null_bitmap_data_ = null_bitmap_->mutable_data(); - memset(null_bitmap_data_, 0, null_bytes); +static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); - return Status::OK(); - } +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = kPandasTimestampNull; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; - bool is_strided() const { - npy_intp* astrides = PyArray_STRIDES(arr_); - return astrides[0] != PyArray_DESCR(arr_)->elsize; - } +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = kPandasTimestampNull; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; - private: - Status ConvertData(); +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; - Status ConvertDates(std::shared_ptr* out) { - PyAcquireGIL lock; +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::TypePtr string_type(new arrow::DateType()); - arrow::DateBuilder date_builder(pool_, string_type); - RETURN_NOT_OK(date_builder.Resize(length_)); +template +struct WrapBytes {}; - Status s; - PyObject* obj; - for (int64_t i = 0; i < length_; ++i) { - obj = objects[i]; - if (PyDate_CheckExact(obj)) { - PyDateTime_Date* pydate = reinterpret_cast(obj); - date_builder.Append(PyDate_to_ms(pydate)); - } else { - date_builder.AppendNull(); - } - } - return date_builder.Finish(out); +template <> +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); } +}; - Status ConvertObjectStrings(std::shared_ptr* out) { - PyAcquireGIL lock; +template <> +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyBytes_FromStringAndSize(reinterpret_cast(data), length); + } +}; - // The output type at this point is inconclusive because there may be bytes - // and unicode mixed in the object array +inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { + if (type == NPY_DATETIME) { + PyArray_Descr* descr = PyArray_DESCR(out); + auto date_dtype = reinterpret_cast(descr->c_metadata); + if (datatype->type == Type::TIMESTAMP) { + auto timestamp_type = static_cast(datatype); - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::TypePtr string_type(new arrow::StringType()); - arrow::StringBuilder string_builder(pool_, string_type); - RETURN_NOT_OK(string_builder.Resize(length_)); + switch (timestamp_type->unit) { + case arrow::TimestampType::Unit::SECOND: + date_dtype->meta.base = NPY_FR_s; + break; + case arrow::TimestampType::Unit::MILLI: + date_dtype->meta.base = NPY_FR_ms; + break; + case arrow::TimestampType::Unit::MICRO: + date_dtype->meta.base = NPY_FR_us; + break; + case arrow::TimestampType::Unit::NANO: + date_dtype->meta.base = NPY_FR_ns; + break; + } + } else { + // datatype->type == Type::DATE + date_dtype->meta.base = NPY_FR_D; + } + } +} - Status s; - bool have_bytes = false; - RETURN_NOT_OK(AppendObjectStrings(string_builder, objects, length_, &have_bytes)); - RETURN_NOT_OK(string_builder.Finish(out)); +template +inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + // Upcast to double, set NaN as appropriate - if (have_bytes) { - const auto& arr = static_cast(*out->get()); - *out = std::make_shared( - arr.length(), arr.offsets(), arr.data(), arr.null_count(), arr.null_bitmap()); + for (int i = 0; i < arr->length(); ++i) { + *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; } - return Status::OK(); } +} - Status ConvertBooleans(std::shared_ptr* out) { - PyAcquireGIL lock; +template +inline void ConvertIntegerNoNullsSameType(const ChunkedArray& data, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); + } +} - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); +template +inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + for (int32_t i = 0; i < arr->length(); ++i) { + *out_values = in_values[i]; + } + } +} - int nbytes = BitUtil::BytesForBits(length_); - auto data = std::make_shared(pool_); - RETURN_NOT_OK(data->Resize(nbytes)); - uint8_t* bitmap = data->mutable_data(); - memset(bitmap, 0, nbytes); +static Status ConvertBooleanWithNulls(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); - int64_t null_count = 0; - for (int64_t i = 0; i < length_; ++i) { - if (objects[i] == Py_True) { - BitUtil::SetBit(bitmap, i); - BitUtil::SetBit(null_bitmap_data_, i); - } else if (objects[i] != Py_False) { - ++null_count; + for (int64_t i = 0; i < arr->length(); ++i) { + if (bool_arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values++ = Py_None; + } else if (bool_arr->Value(i)) { + // True + Py_INCREF(Py_True); + *out_values++ = Py_True; } else { - BitUtil::SetBit(null_bitmap_data_, i); + // False + Py_INCREF(Py_False); + *out_values++ = Py_False; } } - - *out = std::make_shared(length_, data, null_count, null_bitmap_); - - return Status::OK(); } + return Status::OK(); +} - template - Status ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out); - -#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ - case Type::TYPE: { \ - return ConvertTypedLists(field, out); \ +static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = static_cast(bool_arr->Value(i)); + } } +} - Status ConvertLists(const std::shared_ptr& field, std::shared_ptr* out) { - switch (field->type->type) { - LIST_CASE(UINT8, NPY_UINT8, UInt8Type) - LIST_CASE(INT8, NPY_INT8, Int8Type) - LIST_CASE(UINT16, NPY_UINT16, UInt16Type) - LIST_CASE(INT16, NPY_INT16, Int16Type) - LIST_CASE(UINT32, NPY_UINT32, UInt32Type) - LIST_CASE(INT32, NPY_INT32, Int32Type) - LIST_CASE(UINT64, NPY_UINT64, UInt64Type) - LIST_CASE(INT64, NPY_INT64, Int64Type) - LIST_CASE(TIMESTAMP, NPY_DATETIME, TimestampType) - LIST_CASE(FLOAT, NPY_FLOAT, FloatType) - LIST_CASE(DOUBLE, NPY_DOUBLE, DoubleType) - LIST_CASE(STRING, NPY_OBJECT, StringType) - default: - return Status::TypeError("Unknown list item type"); - } - - return Status::TypeError("Unknown list type"); - } - - Status MakeDataType(std::shared_ptr* out); - - arrow::MemoryPool* pool_; - - PyArrayObject* arr_; - PyArrayObject* mask_; - - int64_t length_; - - std::shared_ptr field_indicator_; - std::shared_ptr data_; - std::shared_ptr null_bitmap_; - uint8_t* null_bitmap_data_; -}; +template +inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = static_cast(data.chunk(c).get()); -// Returns null count -static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { - int64_t null_count = 0; - const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); - // TODO(wesm): strided null mask - for (int i = 0; i < length; ++i) { - if (mask_values[i]) { - ++null_count; - } else { - BitUtil::SetBit(bitmap, i); + const uint8_t* data_ptr; + int32_t length; + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + data_ptr = arr->GetValue(i, &length); + *out_values = WrapBytes::Wrap(data_ptr, length); + if (*out_values == nullptr) { + PyErr_Clear(); + std::stringstream ss; + ss << "Wrapping " + << std::string(reinterpret_cast(data_ptr), length) << " failed"; + return Status::UnknownError(ss.str()); + } + } + ++out_values; } } - return null_count; -} - -template -inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { - out->reset(new typename npy_traits::TypeClass()); return Status::OK(); } -template <> -inline Status ArrowSerializer::MakeDataType( - std::shared_ptr* out) { - PyArray_Descr* descr = PyArray_DESCR(arr_); - auto date_dtype = reinterpret_cast(descr->c_metadata); - arrow::TimestampType::Unit unit; +template +inline Status ConvertListsLike( + const std::shared_ptr& col, PyObject** out_values) { + const ChunkedArray& data = *col->data().get(); + auto list_type = std::static_pointer_cast(col->type()); - switch (date_dtype->meta.base) { - case NPY_FR_s: - unit = arrow::TimestampType::Unit::SECOND; - break; - case NPY_FR_ms: - unit = arrow::TimestampType::Unit::MILLI; - break; - case NPY_FR_us: - unit = arrow::TimestampType::Unit::MICRO; - break; - case NPY_FR_ns: - unit = arrow::TimestampType::Unit::NANO; - break; - default: - return Status::Invalid("Unknown NumPy datetime unit"); + // Get column of underlying value arrays + std::vector> value_arrays; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = std::static_pointer_cast(data.chunk(c)); + value_arrays.emplace_back(arr->values()); } + auto flat_column = std::make_shared(list_type->value_field(), value_arrays); + // TODO(ARROW-489): Currently we don't have a Python reference for single columns. + // Storing a reference to the whole Array would be to expensive. + PyObject* numpy_array; + RETURN_NOT_OK(ConvertColumnToPandas(flat_column, nullptr, &numpy_array)); - out->reset(new arrow::TimestampType(unit)); - return Status::OK(); -} - -template -inline Status ArrowSerializer::Convert(std::shared_ptr* out) { - typedef npy_traits traits; + PyAcquireGIL lock; - if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = std::static_pointer_cast(data.chunk(c)); - int64_t null_count = 0; - if (mask_ != nullptr) { - null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); - } else if (traits::supports_nulls) { - null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); + const uint8_t* data_ptr; + int32_t length; + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + PyObject* start = PyLong_FromLong(arr->value_offset(i)); + PyObject* end = PyLong_FromLong(arr->value_offset(i + 1)); + PyObject* slice = PySlice_New(start, end, NULL); + *out_values = PyObject_GetItem(numpy_array, slice); + Py_DECREF(start); + Py_DECREF(end); + Py_DECREF(slice); + } + ++out_values; + } } - RETURN_NOT_OK(ConvertData()); - std::shared_ptr type; - RETURN_NOT_OK(MakeDataType(&type)); - RETURN_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); + Py_XDECREF(numpy_array); return Status::OK(); } -template <> -inline Status ArrowSerializer::Convert(std::shared_ptr* out) { - // Python object arrays are annoying, since we could have one of: - // - // * Strings - // * Booleans with nulls - // * Mixed type (not supported at the moment by arrow format) - // - // Additionally, nulls may be encoded either as np.nan or None. So we have to - // do some type inference and conversion - - RETURN_NOT_OK(InitNullBitmap()); +template +inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); - // TODO: mask not supported here - const PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - { - PyAcquireGIL lock; - PyDateTime_IMPORT; - } + const uint8_t* valid_bits = arr->null_bitmap_data(); - if (field_indicator_) { - switch (field_indicator_->type->type) { - case Type::STRING: - return ConvertObjectStrings(out); - case Type::BOOL: - return ConvertBooleans(out); - case Type::DATE: - return ConvertDates(out); - case Type::LIST: { - auto list_field = static_cast(field_indicator_->type.get()); - return ConvertLists(list_field->value_field(), out); - } - default: - return Status::TypeError("No known conversion to Arrow type"); - } - } else { - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - continue; - } else if (PyObject_is_string(objects[i])) { - return ConvertObjectStrings(out); - } else if (PyBool_Check(objects[i])) { - return ConvertBooleans(out); - } else if (PyDate_CheckExact(objects[i])) { - return ConvertDates(out); - } else { - return Status::TypeError("unhandled python type"); + if (arr->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = BitUtil::BitNotSet(valid_bits, i) ? na_value : in_values[i]; } + } else { + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); } } - - return Status::TypeError("Unable to infer type of object array, were all null"); } -template -inline Status ArrowSerializer::ConvertData() { - // TODO(wesm): strided arrays - if (is_strided()) { return Status::Invalid("no support for strided data yet"); } +template +inline void ConvertNumericNullableCast( + const ChunkedArray& data, OutType na_value, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); - data_ = std::make_shared(arr_); - return Status::OK(); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? na_value : static_cast(in_values[i]); + } + } } -template <> -inline Status ArrowSerializer::ConvertData() { - if (is_strided()) { return Status::Invalid("no support for strided data yet"); } - - int nbytes = BitUtil::BytesForBits(length_); - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(nbytes)); - - const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); - - uint8_t* bitmap = buffer->mutable_data(); +template +inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); - memset(bitmap, 0, nbytes); - for (int i = 0; i < length_; ++i) { - if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } + for (int64_t i = 0; i < arr->length(); ++i) { + // There are 1000 * 60 * 60 * 24 = 86400000ms in a day + *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; + } } +} - data_ = buffer; +template +inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); - return Status::OK(); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? kPandasTimestampNull + : (static_cast(in_values[i]) * SHIFT); + } + } } -template -template -inline Status ArrowSerializer::ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out) { - typedef npy_traits traits; - typedef typename traits::value_type T; - typedef typename traits::BuilderClass BuilderT; +// ---------------------------------------------------------------------- +// pandas 0.x DataFrame conversion internals - auto value_builder = std::make_shared(pool_, field->type); - ListBuilder list_builder(pool_, value_builder); - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - RETURN_NOT_OK(list_builder.AppendNull()); - } else if (PyArray_Check(objects[i])) { - auto numpy_array = reinterpret_cast(objects[i]); - RETURN_NOT_OK(list_builder.Append(true)); +class PandasBlock { + public: + enum type { + OBJECT, + UINT8, + INT8, + UINT16, + INT16, + UINT32, + INT32, + UINT64, + INT64, + FLOAT, + DOUBLE, + BOOL, + DATETIME, + CATEGORICAL + }; - // TODO(uwe): Support more complex numpy array structures - RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE)); + PandasBlock(int64_t num_rows, int num_columns) + : num_rows_(num_rows), num_columns_(num_columns) {} + virtual ~PandasBlock() {} - int32_t size = PyArray_DIM(numpy_array, 0); - auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - if (traits::supports_nulls) { - null_bitmap_->Resize(size, false); - // TODO(uwe): A bitmap would be more space-efficient but the Builder API doesn't - // currently support this. - // ValuesToBitmap(data, size, null_bitmap_->mutable_data()); - ValuesToBytemap(data, size, null_bitmap_->mutable_data()); - RETURN_NOT_OK(value_builder->Append(data, size, null_bitmap_->data())); - } else { - RETURN_NOT_OK(value_builder->Append(data, size)); - } - } else if (PyList_Check(objects[i])) { - return Status::TypeError("Python lists are not yet supported"); - } else { - return Status::TypeError("Unsupported Python type for list items"); - } - } - return list_builder.Finish(out); -} + virtual Status Allocate() = 0; + virtual Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) = 0; -template <> -template <> -inline Status -ArrowSerializer::ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out) { - // TODO: If there are bytes involed, convert to Binary representation - bool have_bytes = false; + PyObject* block_arr() const { return block_arr_.obj(); } - auto value_builder = std::make_shared(pool_, field->type); - ListBuilder list_builder(pool_, value_builder); - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - RETURN_NOT_OK(list_builder.AppendNull()); - } else if (PyArray_Check(objects[i])) { - auto numpy_array = reinterpret_cast(objects[i]); - RETURN_NOT_OK(list_builder.Append(true)); + virtual Status GetPyResult(PyObject** output) { + PyObject* result = PyDict_New(); + RETURN_IF_PYERROR(); - // TODO(uwe): Support more complex numpy array structures - RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); + PyDict_SetItemString(result, "block", block_arr_.obj()); + PyDict_SetItemString(result, "placement", placement_arr_.obj()); - int32_t size = PyArray_DIM(numpy_array, 0); - auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); - } else if (PyList_Check(objects[i])) { - return Status::TypeError("Python lists are not yet supported"); + *output = result; + + return Status::OK(); + } + + protected: + Status AllocateNDArray(int npy_type, int ndim = 2) { + PyAcquireGIL lock; + + PyObject* block_arr; + if (ndim == 2) { + npy_intp block_dims[2] = {num_columns_, num_rows_}; + block_arr = PyArray_SimpleNew(2, block_dims, npy_type); } else { - return Status::TypeError("Unsupported Python type for list items"); + npy_intp block_dims[1] = {num_rows_}; + block_arr = PyArray_SimpleNew(1, block_dims, npy_type); } - } - return list_builder.Finish(out); -} -template <> -inline Status ArrowSerializer::ConvertData() { - return Status::TypeError("NYI"); -} + if (block_arr == NULL) { + // TODO(wesm): propagating Python exception + return Status::OK(); + } -#define TO_ARROW_CASE(TYPE) \ - case NPY_##TYPE: { \ - ArrowSerializer converter(pool, arr, mask); \ - RETURN_NOT_OK(converter.Convert(out)); \ - } break; + npy_intp placement_dims[1] = {num_columns_}; + PyObject* placement_arr = PyArray_SimpleNew(1, placement_dims, NPY_INT64); + if (placement_arr == NULL) { + // TODO(wesm): propagating Python exception + return Status::OK(); + } -Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& field, std::shared_ptr* out) { - PyArrayObject* arr = reinterpret_cast(ao); - PyArrayObject* mask = nullptr; + block_arr_.reset(block_arr); + placement_arr_.reset(placement_arr); - if (mo != nullptr) { mask = reinterpret_cast(mo); } + block_data_ = reinterpret_cast( + PyArray_DATA(reinterpret_cast(block_arr))); - if (PyArray_NDIM(arr) != 1) { - return Status::Invalid("only handle 1-dimensional arrays"); + placement_data_ = reinterpret_cast( + PyArray_DATA(reinterpret_cast(placement_arr))); + + return Status::OK(); } - switch (PyArray_DESCR(arr)->type_num) { - TO_ARROW_CASE(BOOL); - TO_ARROW_CASE(INT8); - TO_ARROW_CASE(INT16); - TO_ARROW_CASE(INT32); - TO_ARROW_CASE(INT64); - TO_ARROW_CASE(UINT8); - TO_ARROW_CASE(UINT16); - TO_ARROW_CASE(UINT32); - TO_ARROW_CASE(UINT64); - TO_ARROW_CASE(FLOAT32); - TO_ARROW_CASE(FLOAT64); - TO_ARROW_CASE(DATETIME); - case NPY_OBJECT: { - ArrowSerializer converter(pool, arr, mask); - converter.IndicateType(field); - RETURN_NOT_OK(converter.Convert(out)); - } break; - default: + int64_t num_rows_; + int num_columns_; + + OwnedRef block_arr_; + uint8_t* block_data_; + + // ndarray + OwnedRef placement_arr_; + int64_t* placement_data_; + + DISALLOW_COPY_AND_ASSIGN(PandasBlock); +}; + +#define CONVERTLISTSLIKE_CASE(ArrowType, ArrowEnum) \ + case Type::ArrowEnum: \ + RETURN_NOT_OK((ConvertListsLike<::arrow::ArrowType>(col, out_buffer))); \ + break; + +class ObjectBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; + virtual ~ObjectBlock() {} + + Status Allocate() override { return AllocateNDArray(NPY_OBJECT); } + + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; + + PyObject** out_buffer = + reinterpret_cast(block_data_) + rel_placement * num_rows_; + + const ChunkedArray& data = *col->data().get(); + + if (type == Type::BOOL) { + RETURN_NOT_OK(ConvertBooleanWithNulls(data, out_buffer)); + } else if (type == Type::BINARY) { + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + } else if (type == Type::STRING) { + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + } else if (type == Type::LIST) { + auto list_type = std::static_pointer_cast(col->type()); + switch (list_type->value_type()->type) { + CONVERTLISTSLIKE_CASE(UInt8Type, UINT8) + CONVERTLISTSLIKE_CASE(Int8Type, INT8) + CONVERTLISTSLIKE_CASE(UInt16Type, UINT16) + CONVERTLISTSLIKE_CASE(Int16Type, INT16) + CONVERTLISTSLIKE_CASE(UInt32Type, UINT32) + CONVERTLISTSLIKE_CASE(Int32Type, INT32) + CONVERTLISTSLIKE_CASE(UInt64Type, UINT64) + CONVERTLISTSLIKE_CASE(Int64Type, INT64) + CONVERTLISTSLIKE_CASE(TimestampType, TIMESTAMP) + CONVERTLISTSLIKE_CASE(FloatType, FLOAT) + CONVERTLISTSLIKE_CASE(DoubleType, DOUBLE) + CONVERTLISTSLIKE_CASE(StringType, STRING) + default: { + std::stringstream ss; + ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); + return Status::NotImplemented(ss.str()); + } + } + } else { std::stringstream ss; - ss << "unsupported type " << PyArray_DESCR(arr)->type_num << std::endl; + ss << "Unsupported type for object array output: " << col->type()->ToString(); return Status::NotImplemented(ss.str()); + } + + placement_data_[rel_placement] = abs_placement; + return Status::OK(); } - return Status::OK(); -} +}; -Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, - const std::shared_ptr& field, std::shared_ptr* out) { - return PandasMaskedToArrow(pool, ao, nullptr, field, out); -} +template +class IntBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; -// ---------------------------------------------------------------------- -// Deserialization + Status Allocate() override { + return AllocateNDArray(arrow_traits::npy_type); + } -template -struct arrow_traits {}; + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_BOOL; - static constexpr bool supports_nulls = false; - static constexpr bool is_boolean = true; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; -}; + C_TYPE* out_buffer = + reinterpret_cast(block_data_) + rel_placement * num_rows_; -#define INT_DECL(TYPE) \ - template <> \ - struct arrow_traits { \ - static constexpr int npy_type = NPY_##TYPE; \ - static constexpr bool supports_nulls = false; \ - static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_numeric_not_nullable = true; \ - static constexpr bool is_numeric_nullable = false; \ - typedef typename npy_traits::value_type T; \ - }; + const ChunkedArray& data = *col->data().get(); -INT_DECL(INT8); -INT_DECL(INT16); -INT_DECL(INT32); -INT_DECL(INT64); -INT_DECL(UINT8); -INT_DECL(UINT16); -INT_DECL(UINT32); -INT_DECL(UINT64); + if (type != ARROW_TYPE) { return Status::NotImplemented(col->type()->ToString()); } -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_FLOAT32; - static constexpr bool supports_nulls = true; - static constexpr float na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; + ConvertIntegerNoNullsSameType(data, out_buffer); + placement_data_[rel_placement] = abs_placement; + return Status::OK(); + } }; -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_FLOAT64; - static constexpr bool supports_nulls = true; - static constexpr double na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; -}; +using UInt8Block = IntBlock; +using Int8Block = IntBlock; +using UInt16Block = IntBlock; +using Int16Block = IntBlock; +using UInt32Block = IntBlock; +using Int32Block = IntBlock; +using UInt64Block = IntBlock; +using Int64Block = IntBlock; -static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); +class Float32Block : public PandasBlock { + public: + using PandasBlock::PandasBlock; -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_DATETIME; - static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; -}; + Status Allocate() override { return AllocateNDArray(NPY_FLOAT32); } -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_DATETIME; - static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; + + if (type != Type::FLOAT) { return Status::NotImplemented(col->type()->ToString()); } + + float* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; + + ConvertNumericNullable(*col->data().get(), NAN, out_buffer); + placement_data_[rel_placement] = abs_placement; + return Status::OK(); + } }; -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_OBJECT; - static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; -}; +class Float64Block : public PandasBlock { + public: + using PandasBlock::PandasBlock; + + Status Allocate() override { return AllocateNDArray(NPY_FLOAT64); } + + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; + + double* out_buffer = + reinterpret_cast(block_data_) + rel_placement * num_rows_; + + const ChunkedArray& data = *col->data().get(); + +#define INTEGER_CASE(IN_TYPE) \ + ConvertIntegerWithNulls(data, out_buffer); \ + break; + + switch (type) { + case Type::UINT8: + INTEGER_CASE(uint8_t); + case Type::INT8: + INTEGER_CASE(int8_t); + case Type::UINT16: + INTEGER_CASE(uint16_t); + case Type::INT16: + INTEGER_CASE(int16_t); + case Type::UINT32: + INTEGER_CASE(uint32_t); + case Type::INT32: + INTEGER_CASE(int32_t); + case Type::UINT64: + INTEGER_CASE(uint64_t); + case Type::INT64: + INTEGER_CASE(int64_t); + case Type::FLOAT: + ConvertNumericNullableCast(data, NAN, out_buffer); + break; + case Type::DOUBLE: + ConvertNumericNullable(data, NAN, out_buffer); + break; + default: + return Status::NotImplemented(col->type()->ToString()); + } + +#undef INTEGER_CASE -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_OBJECT; - static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; + placement_data_[rel_placement] = abs_placement; + return Status::OK(); + } }; -template -struct WrapBytes {}; +class BoolBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; -template <> -struct WrapBytes { - static inline PyObject* Wrap(const uint8_t* data, int64_t length) { - return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); - } -}; + Status Allocate() override { return AllocateNDArray(NPY_BOOL); } -template <> -struct WrapBytes { - static inline PyObject* Wrap(const uint8_t* data, int64_t length) { - return PyBytes_FromStringAndSize(reinterpret_cast(data), length); - } -}; + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; -inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { - if (type == NPY_DATETIME) { - PyArray_Descr* descr = PyArray_DESCR(out); - auto date_dtype = reinterpret_cast(descr->c_metadata); - if (datatype->type == arrow::Type::TIMESTAMP) { - auto timestamp_type = static_cast(datatype); + if (type != Type::BOOL) { return Status::NotImplemented(col->type()->ToString()); } - switch (timestamp_type->unit) { - case arrow::TimestampType::Unit::SECOND: - date_dtype->meta.base = NPY_FR_s; - break; - case arrow::TimestampType::Unit::MILLI: - date_dtype->meta.base = NPY_FR_ms; - break; - case arrow::TimestampType::Unit::MICRO: - date_dtype->meta.base = NPY_FR_us; - break; - case arrow::TimestampType::Unit::NANO: - date_dtype->meta.base = NPY_FR_ns; - break; - } - } else { - // datatype->type == arrow::Type::DATE - date_dtype->meta.base = NPY_FR_D; - } + uint8_t* out_buffer = + reinterpret_cast(block_data_) + rel_placement * num_rows_; + + ConvertBooleanNoNulls(*col->data().get(), out_buffer); + placement_data_[rel_placement] = abs_placement; + return Status::OK(); } -} +}; -template -inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - // Upcast to double, set NaN as appropriate +class DatetimeBlock : public PandasBlock { + public: + using PandasBlock::PandasBlock; - for (int i = 0; i < arr->length(); ++i) { - *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; - } - } -} + Status Allocate() override { + RETURN_NOT_OK(AllocateNDArray(NPY_DATETIME)); -template -inline void ConvertIntegerNoNullsSameType(const ChunkedArray& data, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - memcpy(out_values, in_values, sizeof(T) * arr->length()); - out_values += arr->length(); + PyAcquireGIL lock; + auto date_dtype = reinterpret_cast( + PyArray_DESCR(reinterpret_cast(block_arr_.obj()))->c_metadata); + date_dtype->meta.base = NPY_FR_ns; + return Status::OK(); } -} -template -inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - for (int32_t i = 0; i < arr->length(); ++i) { - *out_values = in_values[i]; - } - } -} + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + Type::type type = col->type()->type; -static Status ConvertBooleanWithNulls(const ChunkedArray& data, PyObject** out_values) { - PyAcquireGIL lock; - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto bool_arr = static_cast(arr.get()); + int64_t* out_buffer = + reinterpret_cast(block_data_) + rel_placement * num_rows_; - for (int64_t i = 0; i < arr->length(); ++i) { - if (bool_arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values++ = Py_None; - } else if (bool_arr->Value(i)) { - // True - Py_INCREF(Py_True); - *out_values++ = Py_True; + const ChunkedArray& data = *col.get()->data(); + + if (type == Type::DATE) { + // DateType is millisecond timestamp stored as int64_t + // TODO(wesm): Do we want to make sure to zero out the milliseconds? + ConvertDatetimeNanos(data, out_buffer); + } else if (type == Type::TIMESTAMP) { + auto ts_type = static_cast(col->type().get()); + + if (ts_type->unit == arrow::TimeUnit::NANO) { + ConvertNumericNullable(data, kPandasTimestampNull, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::MICRO) { + ConvertDatetimeNanos(data, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::MILLI) { + ConvertDatetimeNanos(data, out_buffer); + } else if (ts_type->unit == arrow::TimeUnit::SECOND) { + ConvertDatetimeNanos(data, out_buffer); } else { - // False - Py_INCREF(Py_False); - *out_values++ = Py_False; + return Status::NotImplemented("Unsupported time unit"); } + } else { + return Status::NotImplemented(col->type()->ToString()); } + + placement_data_[rel_placement] = abs_placement; + return Status::OK(); } - return Status::OK(); -} +}; -static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto bool_arr = static_cast(arr.get()); - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = static_cast(bool_arr->Value(i)); +template +class CategoricalBlock : public PandasBlock { + public: + CategoricalBlock(int64_t num_rows) : PandasBlock(num_rows, 1) {} + + Status Allocate() override { + constexpr int npy_type = arrow_traits::npy_type; + + if (!(npy_type == NPY_INT8 || npy_type == NPY_INT16 || npy_type == NPY_INT32 || + npy_type == NPY_INT64)) { + return Status::Invalid("Category indices must be signed integers"); } + return AllocateNDArray(npy_type, 1); } -} -template -inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { - PyAcquireGIL lock; - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = static_cast(data.chunk(c).get()); + Status Write(const std::shared_ptr& col, int64_t abs_placement, + int64_t rel_placement) override { + using T = typename arrow_traits::T; - const uint8_t* data_ptr; - int32_t length; - const bool has_nulls = data.null_count() > 0; - for (int64_t i = 0; i < arr->length(); ++i) { - if (has_nulls && arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values = Py_None; - } else { - data_ptr = arr->GetValue(i, &length); - *out_values = WrapBytes::Wrap(data_ptr, length); - if (*out_values == nullptr) { - return Status::UnknownError("String initialization failed"); - } + T* out_values = reinterpret_cast(block_data_) + rel_placement * num_rows_; + + const ChunkedArray& data = *col->data().get(); + + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + const auto& dict_arr = static_cast(*arr); + const auto& indices = + static_cast(*dict_arr.indices()); + auto in_values = reinterpret_cast(indices.data()->data()); + + // Null is -1 in CategoricalBlock + for (int i = 0; i < arr->length(); ++i) { + *out_values++ = indices.IsNull(i) ? -1 : in_values[i]; } - ++out_values; } - } - return Status::OK(); -} -template -inline Status ConvertListsLike( - const std::shared_ptr& col, PyObject** out_values) { - typedef arrow_traits traits; - typedef typename ::arrow::TypeTraits::ArrayType ArrayType; + placement_data_[rel_placement] = abs_placement; - const ChunkedArray& data = *col->data().get(); - auto list_type = std::static_pointer_cast(col->type()); + auto dict_type = static_cast(col->type().get()); - // Get column of underlying value arrays - std::vector> value_arrays; - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = std::static_pointer_cast(data.chunk(c)); - value_arrays.emplace_back(arr->values()); + PyObject* dict; + RETURN_NOT_OK(ConvertArrayToPandas(dict_type->dictionary(), nullptr, &dict)); + dictionary_.reset(dict); + + return Status::OK(); } - auto flat_column = std::make_shared(list_type->value_field(), value_arrays); - // TODO(ARROW-489): Currently we don't have a Python reference for single columns. - // Storing a reference to the whole Array would be to expensive. - PyObject* numpy_array; - RETURN_NOT_OK(ConvertColumnToPandas(flat_column, nullptr, &numpy_array)); - PyAcquireGIL lock; + Status GetPyResult(PyObject** output) override { + PyObject* result = PyDict_New(); + RETURN_IF_PYERROR(); - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = std::static_pointer_cast(data.chunk(c)); + PyDict_SetItemString(result, "block", block_arr_.obj()); + PyDict_SetItemString(result, "dictionary", dictionary_.obj()); + PyDict_SetItemString(result, "placement", placement_arr_.obj()); - const uint8_t* data_ptr; - int32_t length; - const bool has_nulls = data.null_count() > 0; - for (int64_t i = 0; i < arr->length(); ++i) { - if (has_nulls && arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values = Py_None; - } else { - PyObject* start = PyLong_FromLong(arr->value_offset(i)); - PyObject* end = PyLong_FromLong(arr->value_offset(i + 1)); - PyObject* slice = PySlice_New(start, end, NULL); - *out_values = PyObject_GetItem(numpy_array, slice); - Py_DECREF(start); - Py_DECREF(end); - Py_DECREF(slice); - } - ++out_values; - } - } + *output = result; - Py_XDECREF(numpy_array); - return Status::OK(); -} + return Status::OK(); + } -template -inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); + protected: + OwnedRef dictionary_; +}; - const uint8_t* valid_bits = arr->null_bitmap_data(); +Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, + std::shared_ptr* block) { +#define BLOCK_CASE(NAME, TYPE) \ + case PandasBlock::NAME: \ + *block = std::make_shared(num_rows, num_columns); \ + break; - if (arr->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = BitUtil::BitNotSet(valid_bits, i) ? na_value : in_values[i]; - } - } else { - memcpy(out_values, in_values, sizeof(T) * arr->length()); - out_values += arr->length(); - } + switch (type) { + BLOCK_CASE(OBJECT, ObjectBlock); + BLOCK_CASE(UINT8, UInt8Block); + BLOCK_CASE(INT8, Int8Block); + BLOCK_CASE(UINT16, UInt16Block); + BLOCK_CASE(INT16, Int16Block); + BLOCK_CASE(UINT32, UInt32Block); + BLOCK_CASE(INT32, Int32Block); + BLOCK_CASE(UINT64, UInt64Block); + BLOCK_CASE(INT64, Int64Block); + BLOCK_CASE(FLOAT, Float32Block); + BLOCK_CASE(DOUBLE, Float64Block); + BLOCK_CASE(BOOL, BoolBlock); + BLOCK_CASE(DATETIME, DatetimeBlock); + default: + return Status::NotImplemented("Unsupported block type"); } -} -template -inline void ConvertNumericNullableCast( - const ChunkedArray& data, OutType na_value, OutType* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); +#undef BLOCK_CASE - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = arr->IsNull(i) ? na_value : static_cast(in_values[i]); - } - } + return (*block)->Allocate(); } -template -inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - - for (int64_t i = 0; i < arr->length(); ++i) { - // There are 1000 * 60 * 60 * 24 = 86400000ms in a day - *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; - } +static inline bool ListTypeSupported(const Type::type type_id) { + switch (type_id) { + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::INT64: + case Type::UINT64: + case Type::FLOAT: + case Type::DOUBLE: + case Type::STRING: + case Type::TIMESTAMP: + // The above types are all supported. + return true; + default: + break; } + return false; } -template -inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = arr->IsNull(i) ? kPandasTimestampNull - : (static_cast(in_values[i]) * SHIFT); +static inline Status MakeCategoricalBlock(const std::shared_ptr& type, + int64_t num_rows, std::shared_ptr* block) { + // All categoricals become a block with a single column + auto dict_type = static_cast(type.get()); + switch (dict_type->index_type()->type) { + case Type::INT8: + *block = std::make_shared>(num_rows); + break; + case Type::INT16: + *block = std::make_shared>(num_rows); + break; + case Type::INT32: + *block = std::make_shared>(num_rows); + break; + case Type::INT64: + *block = std::make_shared>(num_rows); + break; + default: { + std::stringstream ss; + ss << "Categorical index type not implemented: " + << dict_type->index_type()->ToString(); + return Status::NotImplemented(ss.str()); } } + return (*block)->Allocate(); } -class ArrowDeserializer { +// Construct the exact pandas 0.x "BlockManager" memory layout +// +// * For each column determine the correct output pandas type +// * Allocate 2D blocks (ncols x nrows) for each distinct data type in output +// * Allocate block placement arrays +// * Write Arrow columns out into each slice of memory; populate block +// * placement arrays as we go +class DataFrameBlockCreator { public: - ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) - : col_(col), data_(*col->data().get()), py_ref_(py_ref) {} - - Status AllocateOutput(int type) { - PyAcquireGIL lock; - - npy_intp dims[1] = {col_->length()}; - out_ = reinterpret_cast(PyArray_SimpleNew(1, dims, type)); + DataFrameBlockCreator(const std::shared_ptr
& table) : table_(table) {} - if (out_ == NULL) { - // Error occurred, trust that SimpleNew set the error state - return Status::OK(); - } + Status Convert(int nthreads, PyObject** output) { + column_types_.resize(table_->num_columns()); + column_block_placement_.resize(table_->num_columns()); + type_counts_.clear(); + blocks_.clear(); - set_numpy_metadata(type, col_->type().get(), out_); + RETURN_NOT_OK(CreateBlocks()); + RETURN_NOT_OK(WriteTableToBlocks(nthreads)); - return Status::OK(); + return GetResultList(output); } - template - Status ConvertValuesZeroCopy(int npy_type, std::shared_ptr arr) { - typedef typename arrow_traits::T T; - - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); + Status CreateBlocks() { + for (int i = 0; i < table_->num_columns(); ++i) { + std::shared_ptr col = table_->column(i); + PandasBlock::type output_type; - // Zero-Copy. We can pass the data pointer directly to NumPy. - void* data = const_cast(in_values); + Type::type column_type = col->type()->type; + switch (column_type) { + case Type::BOOL: + output_type = col->null_count() > 0 ? PandasBlock::OBJECT : PandasBlock::BOOL; + break; + case Type::UINT8: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT8; + break; + case Type::INT8: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT8; + break; + case Type::UINT16: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT16; + break; + case Type::INT16: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT16; + break; + case Type::UINT32: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT32; + break; + case Type::INT32: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT32; + break; + case Type::INT64: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT64; + break; + case Type::UINT64: + output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT64; + break; + case Type::FLOAT: + output_type = PandasBlock::FLOAT; + break; + case Type::DOUBLE: + output_type = PandasBlock::DOUBLE; + break; + case Type::STRING: + case Type::BINARY: + output_type = PandasBlock::OBJECT; + break; + case Type::DATE: + output_type = PandasBlock::DATETIME; + break; + case Type::TIMESTAMP: + output_type = PandasBlock::DATETIME; + break; + case Type::LIST: { + auto list_type = std::static_pointer_cast(col->type()); + if (!ListTypeSupported(list_type->value_type()->type)) { + std::stringstream ss; + ss << "Not implemented type for lists: " + << list_type->value_type()->ToString(); + return Status::NotImplemented(ss.str()); + } + output_type = PandasBlock::OBJECT; + } break; + case Type::DICTIONARY: + output_type = PandasBlock::CATEGORICAL; + break; + default: + return Status::NotImplemented(col->type()->ToString()); + } - PyAcquireGIL lock; + int block_placement = 0; + if (column_type == Type::DICTIONARY) { + std::shared_ptr block; + RETURN_NOT_OK(MakeCategoricalBlock(col->type(), table_->num_rows(), &block)); + categorical_blocks_[i] = block; + } else { + auto it = type_counts_.find(output_type); + if (it != type_counts_.end()) { + block_placement = it->second; + // Increment count + it->second += 1; + } else { + // Add key to map + type_counts_[output_type] = 1; + } + } - // Zero-Copy. We can pass the data pointer directly to NumPy. - npy_intp dims[1] = {col_->length()}; - out_ = reinterpret_cast( - PyArray_SimpleNewFromData(1, dims, npy_type, data)); + column_types_[i] = output_type; + column_block_placement_[i] = block_placement; + } - if (out_ == NULL) { - // Error occurred, trust that SimpleNew set the error state - return Status::OK(); + // Create normal non-categorical blocks + for (const auto& it : type_counts_) { + PandasBlock::type type = static_cast(it.first); + std::shared_ptr block; + RETURN_NOT_OK(MakeBlock(type, table_->num_rows(), it.second, &block)); + blocks_[type] = block; } + return Status::OK(); + } - set_numpy_metadata(npy_type, col_->type().get(), out_); + Status WriteTableToBlocks(int nthreads) { + auto WriteColumn = [this](int i) { + std::shared_ptr col = this->table_->column(i); + PandasBlock::type output_type = this->column_types_[i]; - if (PyArray_SetBaseObject(out_, py_ref_) == -1) { - // Error occurred, trust that SetBaseObject set the error state - return Status::OK(); - } else { - // PyArray_SetBaseObject steals our reference to py_ref_ - Py_INCREF(py_ref_); - } + int rel_placement = this->column_block_placement_[i]; - // Arrow data is immutable. - PyArray_CLEARFLAGS(out_, NPY_ARRAY_WRITEABLE); + std::shared_ptr block; + if (output_type == PandasBlock::CATEGORICAL) { + auto it = this->categorical_blocks_.find(i); + if (it == this->blocks_.end()) { + return Status::KeyError("No categorical block allocated"); + } + block = it->second; + } else { + auto it = this->blocks_.find(output_type); + if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } + block = it->second; + } + return block->Write(col, i, rel_placement); + }; - return Status::OK(); - } + nthreads = std::min(nthreads, table_->num_columns()); - // ---------------------------------------------------------------------- - // Allocate new array and deserialize. Can do a zero copy conversion for some - // types + if (nthreads == 1) { + for (int i = 0; i < table_->num_columns(); ++i) { + RETURN_NOT_OK(WriteColumn(i)); + } + } else { + std::vector thread_pool; + thread_pool.reserve(nthreads); + std::atomic task_counter(0); - Status Convert(PyObject** out) { -#define CONVERT_CASE(TYPE) \ - case arrow::Type::TYPE: { \ - RETURN_NOT_OK(ConvertValues()); \ - } break; + std::mutex error_mtx; + bool error_occurred = false; + Status error; - switch (col_->type()->type) { - CONVERT_CASE(BOOL); - CONVERT_CASE(INT8); - CONVERT_CASE(INT16); - CONVERT_CASE(INT32); - CONVERT_CASE(INT64); - CONVERT_CASE(UINT8); - CONVERT_CASE(UINT16); - CONVERT_CASE(UINT32); - CONVERT_CASE(UINT64); - CONVERT_CASE(FLOAT); - CONVERT_CASE(DOUBLE); - CONVERT_CASE(BINARY); - CONVERT_CASE(STRING); - CONVERT_CASE(DATE); - CONVERT_CASE(TIMESTAMP); - default: { - std::stringstream ss; - ss << "Arrow type reading not implemented for " << col_->type()->ToString(); - return Status::NotImplemented(ss.str()); + for (int thread_id = 0; thread_id < nthreads; ++thread_id) { + thread_pool.emplace_back( + [this, &error, &error_occurred, &error_mtx, &task_counter, &WriteColumn]() { + int column_num; + while (!error_occurred) { + column_num = task_counter.fetch_add(1); + if (column_num >= this->table_->num_columns()) { break; } + Status s = WriteColumn(column_num); + if (!s.ok()) { + std::lock_guard lock(error_mtx); + error_occurred = true; + error = s; + break; + } + } + }); + } + for (auto&& thread : thread_pool) { + thread.join(); } - } - -#undef CONVERT_CASE - *out = reinterpret_cast(out_); + if (error_occurred) { return error; } + } return Status::OK(); } - template - inline typename std::enable_if< - (TYPE != arrow::Type::DATE) & arrow_traits::is_numeric_nullable, Status>::type - ConvertValues() { - typedef typename arrow_traits::T T; - int npy_type = arrow_traits::npy_type; + Status GetResultList(PyObject** out) { + PyAcquireGIL lock; - if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { - return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); + auto num_blocks = + static_cast(blocks_.size() + categorical_blocks_.size()); + PyObject* result = PyList_New(num_blocks); + RETURN_IF_PYERROR(); + + int i = 0; + for (const auto& it : blocks_) { + const std::shared_ptr block = it.second; + PyObject* item; + RETURN_NOT_OK(block->GetPyResult(&item)); + if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } } - RETURN_NOT_OK(AllocateOutput(npy_type)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - ConvertNumericNullable(data_, arrow_traits::na_value, out_values); + for (const auto& it : categorical_blocks_) { + const std::shared_ptr block = it.second; + PyObject* item; + RETURN_NOT_OK(block->GetPyResult(&item)); + if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } + } + *out = result; return Status::OK(); } - template - inline typename std::enable_if::type - ConvertValues() { - typedef typename arrow_traits::T T; + private: + std::shared_ptr
table_; - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - ConvertDates(data_, arrow_traits::na_value, out_values); - return Status::OK(); - } + // column num -> block type id + std::vector column_types_; - // Integer specialization - template - inline - typename std::enable_if::is_numeric_not_nullable, Status>::type - ConvertValues() { - typedef typename arrow_traits::T T; - int npy_type = arrow_traits::npy_type; + // column num -> relative placement within internal block + std::vector column_block_placement_; - if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { - return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); - } + // block type -> type count + std::unordered_map type_counts_; - if (data_.null_count() > 0) { - RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - ConvertIntegerWithNulls(data_, out_values); - } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - ConvertIntegerNoNullsSameType(data_, out_values); - } + // block type -> block + std::unordered_map> blocks_; - return Status::OK(); - } + // column number -> categorical block + std::unordered_map> categorical_blocks_; +}; - // Boolean specialization - template - inline typename std::enable_if::is_boolean, Status>::type - ConvertValues() { - if (data_.null_count() > 0) { - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - RETURN_NOT_OK(ConvertBooleanWithNulls(data_, out_values)); - } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - ConvertBooleanNoNulls(data_, out_values); - } - return Status::OK(); +Status ConvertTableToPandas( + const std::shared_ptr
& table, int nthreads, PyObject** out) { + DataFrameBlockCreator helper(table); + return helper.Convert(nthreads, out); +} + +// ---------------------------------------------------------------------- +// Serialization + +template +class ArrowSerializer { + public: + ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) + : pool_(pool), arr_(arr), mask_(mask) { + length_ = PyArray_SIZE(arr_); } - // UTF8 strings - template - inline typename std::enable_if::type - ConvertValues() { - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - return ConvertBinaryLike(data_, out_values); + void IndicateType(const std::shared_ptr field) { field_indicator_ = field; } + + Status Convert(std::shared_ptr* out); + + int stride() const { return PyArray_STRIDES(arr_)[0]; } + + Status InitNullBitmap() { + int null_bytes = BitUtil::BytesForBits(length_); + + null_bitmap_ = std::make_shared(pool_); + RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); + + null_bitmap_data_ = null_bitmap_->mutable_data(); + memset(null_bitmap_data_, 0, null_bytes); + + return Status::OK(); } - template - inline typename std::enable_if::type - ConvertValues() { - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - auto out_values = reinterpret_cast(PyArray_DATA(out_)); - return ConvertBinaryLike(data_, out_values); + bool is_strided() const { + npy_intp* astrides = PyArray_STRIDES(arr_); + return astrides[0] != PyArray_DESCR(arr_)->elsize; } private: - std::shared_ptr col_; - const arrow::ChunkedArray& data_; - PyObject* py_ref_; - PyArrayObject* out_; -}; + Status ConvertData(); -Status ConvertArrayToPandas( - const std::shared_ptr& arr, PyObject* py_ref, PyObject** out) { - static std::string dummy_name = "dummy"; - auto field = std::make_shared(dummy_name, arr->type()); - auto col = std::make_shared(field, arr); - return ConvertColumnToPandas(col, py_ref, out); -} + Status ConvertDates(std::shared_ptr* out) { + PyAcquireGIL lock; -Status ConvertColumnToPandas( - const std::shared_ptr& col, PyObject* py_ref, PyObject** out) { - ArrowDeserializer converter(col, py_ref); - return converter.Convert(out); -} + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + arrow::TypePtr string_type(new arrow::DateType()); + arrow::DateBuilder date_builder(pool_, string_type); + RETURN_NOT_OK(date_builder.Resize(length_)); -// ---------------------------------------------------------------------- -// pandas 0.x DataFrame conversion internals + Status s; + PyObject* obj; + for (int64_t i = 0; i < length_; ++i) { + obj = objects[i]; + if (PyDate_CheckExact(obj)) { + PyDateTime_Date* pydate = reinterpret_cast(obj); + date_builder.Append(PyDate_to_ms(pydate)); + } else { + date_builder.AppendNull(); + } + } + return date_builder.Finish(out); + } -class PandasBlock { - public: - enum type { - OBJECT, - UINT8, - INT8, - UINT16, - INT16, - UINT32, - INT32, - UINT64, - INT64, - FLOAT, - DOUBLE, - BOOL, - DATETIME, - CATEGORICAL - }; + Status ConvertObjectStrings(std::shared_ptr* out) { + PyAcquireGIL lock; - PandasBlock(int64_t num_rows, int num_columns) - : num_rows_(num_rows), num_columns_(num_columns) {} - virtual ~PandasBlock() {} + // The output type at this point is inconclusive because there may be bytes + // and unicode mixed in the object array - virtual Status Allocate() = 0; - virtual Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) = 0; + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + arrow::TypePtr string_type(new arrow::StringType()); + arrow::StringBuilder string_builder(pool_, string_type); + RETURN_NOT_OK(string_builder.Resize(length_)); - PyObject* block_arr() { return block_arr_.obj(); } + Status s; + bool have_bytes = false; + RETURN_NOT_OK(AppendObjectStrings(string_builder, objects, length_, &have_bytes)); + RETURN_NOT_OK(string_builder.Finish(out)); - PyObject* placement_arr() { return placement_arr_.obj(); } + if (have_bytes) { + const auto& arr = static_cast(*out->get()); + *out = std::make_shared( + arr.length(), arr.offsets(), arr.data(), arr.null_count(), arr.null_bitmap()); + } + return Status::OK(); + } - protected: - Status AllocateNDArray(int npy_type) { + Status ConvertBooleans(std::shared_ptr* out) { PyAcquireGIL lock; - npy_intp block_dims[2] = {num_columns_, num_rows_}; - PyObject* block_arr = PyArray_SimpleNew(2, block_dims, npy_type); - if (block_arr == NULL) { - // TODO(wesm): propagating Python exception - return Status::OK(); + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + + int nbytes = BitUtil::BytesForBits(length_); + auto data = std::make_shared(pool_); + RETURN_NOT_OK(data->Resize(nbytes)); + uint8_t* bitmap = data->mutable_data(); + memset(bitmap, 0, nbytes); + + int64_t null_count = 0; + for (int64_t i = 0; i < length_; ++i) { + if (objects[i] == Py_True) { + BitUtil::SetBit(bitmap, i); + BitUtil::SetBit(null_bitmap_data_, i); + } else if (objects[i] != Py_False) { + ++null_count; + } else { + BitUtil::SetBit(null_bitmap_data_, i); + } } - npy_intp placement_dims[1] = {num_columns_}; - PyObject* placement_arr = PyArray_SimpleNew(1, placement_dims, NPY_INT64); - if (placement_arr == NULL) { - // TODO(wesm): propagating Python exception - return Status::OK(); + *out = std::make_shared(length_, data, null_count, null_bitmap_); + + return Status::OK(); + } + + template + Status ConvertTypedLists( + const std::shared_ptr& field, std::shared_ptr* out); + +#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ + case Type::TYPE: { \ + return ConvertTypedLists(field, out); \ + } + + Status ConvertLists(const std::shared_ptr& field, std::shared_ptr* out) { + switch (field->type->type) { + LIST_CASE(UINT8, NPY_UINT8, UInt8Type) + LIST_CASE(INT8, NPY_INT8, Int8Type) + LIST_CASE(UINT16, NPY_UINT16, UInt16Type) + LIST_CASE(INT16, NPY_INT16, Int16Type) + LIST_CASE(UINT32, NPY_UINT32, UInt32Type) + LIST_CASE(INT32, NPY_INT32, Int32Type) + LIST_CASE(UINT64, NPY_UINT64, UInt64Type) + LIST_CASE(INT64, NPY_INT64, Int64Type) + LIST_CASE(TIMESTAMP, NPY_DATETIME, TimestampType) + LIST_CASE(FLOAT, NPY_FLOAT, FloatType) + LIST_CASE(DOUBLE, NPY_DOUBLE, DoubleType) + LIST_CASE(STRING, NPY_OBJECT, StringType) + default: + return Status::TypeError("Unknown list item type"); } - block_arr_.reset(block_arr); - placement_arr_.reset(placement_arr); - - block_data_ = reinterpret_cast( - PyArray_DATA(reinterpret_cast(block_arr))); - - placement_data_ = reinterpret_cast( - PyArray_DATA(reinterpret_cast(placement_arr))); - - return Status::OK(); + return Status::TypeError("Unknown list type"); } - int64_t num_rows_; - int num_columns_; - - OwnedRef block_arr_; - uint8_t* block_data_; + Status MakeDataType(std::shared_ptr* out); - // ndarray - OwnedRef placement_arr_; - int64_t* placement_data_; + arrow::MemoryPool* pool_; - DISALLOW_COPY_AND_ASSIGN(PandasBlock); -}; + PyArrayObject* arr_; + PyArrayObject* mask_; -#define CONVERTLISTSLIKE_CASE(ArrowType, ArrowEnum) \ - case Type::ArrowEnum: \ - RETURN_NOT_OK((ConvertListsLike<::arrow::ArrowType>(col, out_buffer))); \ - break; + int64_t length_; -class ObjectBlock : public PandasBlock { - public: - using PandasBlock::PandasBlock; - virtual ~ObjectBlock() {} + std::shared_ptr field_indicator_; + std::shared_ptr data_; + std::shared_ptr null_bitmap_; + uint8_t* null_bitmap_data_; +}; - Status Allocate() override { return AllocateNDArray(NPY_OBJECT); } +// Returns null count +static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { + int64_t null_count = 0; + const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); + // TODO(wesm): strided null mask + for (int i = 0; i < length; ++i) { + if (mask_values[i]) { + ++null_count; + } else { + BitUtil::SetBit(bitmap, i); + } + } + return null_count; +} - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; +template +inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { + out->reset(new typename npy_traits::TypeClass()); + return Status::OK(); +} - PyObject** out_buffer = - reinterpret_cast(block_data_) + rel_placement * num_rows_; +template <> +inline Status ArrowSerializer::MakeDataType( + std::shared_ptr* out) { + PyArray_Descr* descr = PyArray_DESCR(arr_); + auto date_dtype = reinterpret_cast(descr->c_metadata); + arrow::TimestampType::Unit unit; - const ChunkedArray& data = *col->data().get(); + switch (date_dtype->meta.base) { + case NPY_FR_s: + unit = arrow::TimestampType::Unit::SECOND; + break; + case NPY_FR_ms: + unit = arrow::TimestampType::Unit::MILLI; + break; + case NPY_FR_us: + unit = arrow::TimestampType::Unit::MICRO; + break; + case NPY_FR_ns: + unit = arrow::TimestampType::Unit::NANO; + break; + default: + return Status::Invalid("Unknown NumPy datetime unit"); + } - if (type == Type::BOOL) { - RETURN_NOT_OK(ConvertBooleanWithNulls(data, out_buffer)); - } else if (type == Type::BINARY) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); - } else if (type == Type::STRING) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); - } else if (type == Type::LIST) { - auto list_type = std::static_pointer_cast(col->type()); - switch (list_type->value_type()->type) { - CONVERTLISTSLIKE_CASE(UInt8Type, UINT8) - CONVERTLISTSLIKE_CASE(Int8Type, INT8) - CONVERTLISTSLIKE_CASE(UInt16Type, UINT16) - CONVERTLISTSLIKE_CASE(Int16Type, INT16) - CONVERTLISTSLIKE_CASE(UInt32Type, UINT32) - CONVERTLISTSLIKE_CASE(Int32Type, INT32) - CONVERTLISTSLIKE_CASE(UInt64Type, UINT64) - CONVERTLISTSLIKE_CASE(Int64Type, INT64) - CONVERTLISTSLIKE_CASE(TimestampType, TIMESTAMP) - CONVERTLISTSLIKE_CASE(FloatType, FLOAT) - CONVERTLISTSLIKE_CASE(DoubleType, DOUBLE) - CONVERTLISTSLIKE_CASE(StringType, STRING) - default: { - std::stringstream ss; - ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); - return Status::NotImplemented(ss.str()); - } - } - } else { - std::stringstream ss; - ss << "Unsupported type for object array output: " << col->type()->ToString(); - return Status::NotImplemented(ss.str()); - } + out->reset(new arrow::TimestampType(unit)); + return Status::OK(); +} - placement_data_[rel_placement] = abs_placement; - return Status::OK(); - } -}; +template +inline Status ArrowSerializer::Convert(std::shared_ptr* out) { + typedef npy_traits traits; -template -class IntBlock : public PandasBlock { - public: - using PandasBlock::PandasBlock; + if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } - Status Allocate() override { - return AllocateNDArray(arrow_traits::npy_type); + int64_t null_count = 0; + if (mask_ != nullptr) { + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); + } else if (traits::supports_nulls) { + null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); } - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; - - C_TYPE* out_buffer = - reinterpret_cast(block_data_) + rel_placement * num_rows_; + RETURN_NOT_OK(ConvertData()); + std::shared_ptr type; + RETURN_NOT_OK(MakeDataType(&type)); + RETURN_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); + return Status::OK(); +} - const ChunkedArray& data = *col->data().get(); +template <> +inline Status ArrowSerializer::Convert(std::shared_ptr* out) { + // Python object arrays are annoying, since we could have one of: + // + // * Strings + // * Booleans with nulls + // * Mixed type (not supported at the moment by arrow format) + // + // Additionally, nulls may be encoded either as np.nan or None. So we have to + // do some type inference and conversion - if (type != ARROW_TYPE) { return Status::NotImplemented(col->type()->ToString()); } + RETURN_NOT_OK(InitNullBitmap()); - ConvertIntegerNoNullsSameType(data, out_buffer); - placement_data_[rel_placement] = abs_placement; - return Status::OK(); + // TODO: mask not supported here + const PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + { + PyAcquireGIL lock; + PyDateTime_IMPORT; } -}; - -using UInt8Block = IntBlock; -using Int8Block = IntBlock; -using UInt16Block = IntBlock; -using Int16Block = IntBlock; -using UInt32Block = IntBlock; -using Int32Block = IntBlock; -using UInt64Block = IntBlock; -using Int64Block = IntBlock; -class Float32Block : public PandasBlock { - public: - using PandasBlock::PandasBlock; + if (field_indicator_) { + switch (field_indicator_->type->type) { + case Type::STRING: + return ConvertObjectStrings(out); + case Type::BOOL: + return ConvertBooleans(out); + case Type::DATE: + return ConvertDates(out); + case Type::LIST: { + auto list_field = static_cast(field_indicator_->type.get()); + return ConvertLists(list_field->value_field(), out); + } + default: + return Status::TypeError("No known conversion to Arrow type"); + } + } else { + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + continue; + } else if (PyObject_is_string(objects[i])) { + return ConvertObjectStrings(out); + } else if (PyBool_Check(objects[i])) { + return ConvertBooleans(out); + } else if (PyDate_CheckExact(objects[i])) { + return ConvertDates(out); + } else { + return Status::TypeError("unhandled python type"); + } + } + } - Status Allocate() override { return AllocateNDArray(NPY_FLOAT32); } + return Status::TypeError("Unable to infer type of object array, were all null"); +} - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; +template +inline Status ArrowSerializer::ConvertData() { + // TODO(wesm): strided arrays + if (is_strided()) { return Status::Invalid("no support for strided data yet"); } - if (type != Type::FLOAT) { return Status::NotImplemented(col->type()->ToString()); } + data_ = std::make_shared(arr_); + return Status::OK(); +} - float* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; +template <> +inline Status ArrowSerializer::ConvertData() { + if (is_strided()) { return Status::Invalid("no support for strided data yet"); } - ConvertNumericNullable(*col->data().get(), NAN, out_buffer); - placement_data_[rel_placement] = abs_placement; - return Status::OK(); - } -}; + int nbytes = BitUtil::BytesForBits(length_); + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(nbytes)); -class Float64Block : public PandasBlock { - public: - using PandasBlock::PandasBlock; + const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); - Status Allocate() override { return AllocateNDArray(NPY_FLOAT64); } + uint8_t* bitmap = buffer->mutable_data(); - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; + memset(bitmap, 0, nbytes); + for (int i = 0; i < length_; ++i) { + if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } + } - double* out_buffer = - reinterpret_cast(block_data_) + rel_placement * num_rows_; + data_ = buffer; - const ChunkedArray& data = *col->data().get(); + return Status::OK(); +} -#define INTEGER_CASE(IN_TYPE) \ - ConvertIntegerWithNulls(data, out_buffer); \ - break; +template +template +inline Status ArrowSerializer::ConvertTypedLists( + const std::shared_ptr& field, std::shared_ptr* out) { + typedef npy_traits traits; + typedef typename traits::value_type T; + typedef typename traits::BuilderClass BuilderT; - switch (type) { - case Type::UINT8: - INTEGER_CASE(uint8_t); - case Type::INT8: - INTEGER_CASE(int8_t); - case Type::UINT16: - INTEGER_CASE(uint16_t); - case Type::INT16: - INTEGER_CASE(int16_t); - case Type::UINT32: - INTEGER_CASE(uint32_t); - case Type::INT32: - INTEGER_CASE(int32_t); - case Type::UINT64: - INTEGER_CASE(uint64_t); - case Type::INT64: - INTEGER_CASE(int64_t); - case Type::FLOAT: - ConvertNumericNullableCast(data, NAN, out_buffer); - break; - case Type::DOUBLE: - ConvertNumericNullable(data, NAN, out_buffer); - break; - default: - return Status::NotImplemented(col->type()->ToString()); - } + auto value_builder = std::make_shared(pool_, field->type); + ListBuilder list_builder(pool_, value_builder); + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + RETURN_NOT_OK(list_builder.AppendNull()); + } else if (PyArray_Check(objects[i])) { + auto numpy_array = reinterpret_cast(objects[i]); + RETURN_NOT_OK(list_builder.Append(true)); -#undef INTEGER_CASE + // TODO(uwe): Support more complex numpy array structures + RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE)); - placement_data_[rel_placement] = abs_placement; - return Status::OK(); + int32_t size = PyArray_DIM(numpy_array, 0); + auto data = reinterpret_cast(PyArray_DATA(numpy_array)); + if (traits::supports_nulls) { + null_bitmap_->Resize(size, false); + // TODO(uwe): A bitmap would be more space-efficient but the Builder API doesn't + // currently support this. + // ValuesToBitmap(data, size, null_bitmap_->mutable_data()); + ValuesToBytemap(data, size, null_bitmap_->mutable_data()); + RETURN_NOT_OK(value_builder->Append(data, size, null_bitmap_->data())); + } else { + RETURN_NOT_OK(value_builder->Append(data, size)); + } + } else if (PyList_Check(objects[i])) { + return Status::TypeError("Python lists are not yet supported"); + } else { + return Status::TypeError("Unsupported Python type for list items"); + } } -}; + return list_builder.Finish(out); +} -class BoolBlock : public PandasBlock { - public: - using PandasBlock::PandasBlock; +template <> +template <> +inline Status +ArrowSerializer::ConvertTypedLists( + const std::shared_ptr& field, std::shared_ptr* out) { + // TODO: If there are bytes involed, convert to Binary representation + bool have_bytes = false; - Status Allocate() override { return AllocateNDArray(NPY_BOOL); } + auto value_builder = std::make_shared(pool_, field->type); + ListBuilder list_builder(pool_, value_builder); + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + RETURN_NOT_OK(list_builder.AppendNull()); + } else if (PyArray_Check(objects[i])) { + auto numpy_array = reinterpret_cast(objects[i]); + RETURN_NOT_OK(list_builder.Append(true)); - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; + // TODO(uwe): Support more complex numpy array structures + RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); - if (type != Type::BOOL) { return Status::NotImplemented(col->type()->ToString()); } + int32_t size = PyArray_DIM(numpy_array, 0); + auto data = reinterpret_cast(PyArray_DATA(numpy_array)); + RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); + } else if (PyList_Check(objects[i])) { + return Status::TypeError("Python lists are not yet supported"); + } else { + return Status::TypeError("Unsupported Python type for list items"); + } + } + return list_builder.Finish(out); +} - uint8_t* out_buffer = - reinterpret_cast(block_data_) + rel_placement * num_rows_; +template <> +inline Status ArrowSerializer::ConvertData() { + return Status::TypeError("NYI"); +} - ConvertBooleanNoNulls(*col->data().get(), out_buffer); - placement_data_[rel_placement] = abs_placement; - return Status::OK(); - } -}; +#define TO_ARROW_CASE(TYPE) \ + case NPY_##TYPE: { \ + ArrowSerializer converter(pool, arr, mask); \ + RETURN_NOT_OK(converter.Convert(out)); \ + } break; -class DatetimeBlock : public PandasBlock { - public: - using PandasBlock::PandasBlock; +Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& field, std::shared_ptr* out) { + PyArrayObject* arr = reinterpret_cast(ao); + PyArrayObject* mask = nullptr; - Status Allocate() override { - RETURN_NOT_OK(AllocateNDArray(NPY_DATETIME)); + if (mo != nullptr and mo != Py_None) { mask = reinterpret_cast(mo); } - PyAcquireGIL lock; - auto date_dtype = reinterpret_cast( - PyArray_DESCR(reinterpret_cast(block_arr_.obj()))->c_metadata); - date_dtype->meta.base = NPY_FR_ns; - return Status::OK(); + if (PyArray_NDIM(arr) != 1) { + return Status::Invalid("only handle 1-dimensional arrays"); } - Status Write(const std::shared_ptr& col, int64_t abs_placement, - int64_t rel_placement) override { - Type::type type = col->type()->type; + switch (PyArray_DESCR(arr)->type_num) { + TO_ARROW_CASE(BOOL); + TO_ARROW_CASE(INT8); + TO_ARROW_CASE(INT16); + TO_ARROW_CASE(INT32); + TO_ARROW_CASE(INT64); + TO_ARROW_CASE(UINT8); + TO_ARROW_CASE(UINT16); + TO_ARROW_CASE(UINT32); + TO_ARROW_CASE(UINT64); + TO_ARROW_CASE(FLOAT32); + TO_ARROW_CASE(FLOAT64); + TO_ARROW_CASE(DATETIME); + case NPY_OBJECT: { + ArrowSerializer converter(pool, arr, mask); + converter.IndicateType(field); + RETURN_NOT_OK(converter.Convert(out)); + } break; + default: + std::stringstream ss; + ss << "unsupported type " << PyArray_DESCR(arr)->type_num << std::endl; + return Status::NotImplemented(ss.str()); + } + return Status::OK(); +} - int64_t* out_buffer = - reinterpret_cast(block_data_) + rel_placement * num_rows_; +class ArrowDeserializer { + public: + ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) + : col_(col), data_(*col->data().get()), py_ref_(py_ref) {} - const ChunkedArray& data = *col.get()->data(); + Status AllocateOutput(int type) { + PyAcquireGIL lock; - if (type == Type::DATE) { - // DateType is millisecond timestamp stored as int64_t - // TODO(wesm): Do we want to make sure to zero out the milliseconds? - ConvertDatetimeNanos(data, out_buffer); - } else if (type == Type::TIMESTAMP) { - auto ts_type = static_cast(col->type().get()); + npy_intp dims[1] = {col_->length()}; + result_ = PyArray_SimpleNew(1, dims, type); + arr_ = reinterpret_cast(result_); - if (ts_type->unit == arrow::TimeUnit::NANO) { - ConvertNumericNullable(data, kPandasTimestampNull, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::MICRO) { - ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::MILLI) { - ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::SECOND) { - ConvertDatetimeNanos(data, out_buffer); - } else { - return Status::NotImplemented("Unsupported time unit"); - } - } else { - return Status::NotImplemented(col->type()->ToString()); + if (arr_ == NULL) { + // Error occurred, trust that SimpleNew set the error state + return Status::OK(); } - placement_data_[rel_placement] = abs_placement; + set_numpy_metadata(type, col_->type().get(), arr_); + return Status::OK(); } -}; -// class CategoricalBlock : public PandasBlock {}; + template + Status ConvertValuesZeroCopy(int npy_type, std::shared_ptr arr) { + typedef typename arrow_traits::T T; -Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, - std::shared_ptr* block) { -#define BLOCK_CASE(NAME, TYPE) \ - case PandasBlock::NAME: \ - *block = std::make_shared(num_rows, num_columns); \ - break; + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); - switch (type) { - BLOCK_CASE(OBJECT, ObjectBlock); - BLOCK_CASE(UINT8, UInt8Block); - BLOCK_CASE(INT8, Int8Block); - BLOCK_CASE(UINT16, UInt16Block); - BLOCK_CASE(INT16, Int16Block); - BLOCK_CASE(UINT32, UInt32Block); - BLOCK_CASE(INT32, Int32Block); - BLOCK_CASE(UINT64, UInt64Block); - BLOCK_CASE(INT64, Int64Block); - BLOCK_CASE(FLOAT, Float32Block); - BLOCK_CASE(DOUBLE, Float64Block); - BLOCK_CASE(BOOL, BoolBlock); - BLOCK_CASE(DATETIME, DatetimeBlock); - case PandasBlock::CATEGORICAL: - return Status::NotImplemented("categorical"); - } + // Zero-Copy. We can pass the data pointer directly to NumPy. + void* data = const_cast(in_values); -#undef BLOCK_CASE + PyAcquireGIL lock; - return (*block)->Allocate(); -} + // Zero-Copy. We can pass the data pointer directly to NumPy. + npy_intp dims[1] = {col_->length()}; + result_ = PyArray_SimpleNewFromData(1, dims, npy_type, data); + arr_ = reinterpret_cast(result_); -// Construct the exact pandas 0.x "BlockManager" memory layout -// -// * For each column determine the correct output pandas type -// * Allocate 2D blocks (ncols x nrows) for each distinct data type in output -// * Allocate block placement arrays -// * Write Arrow columns out into each slice of memory; populate block -// * placement arrays as we go -class DataFrameBlockCreator { - public: - DataFrameBlockCreator(const std::shared_ptr
& table) : table_(table) {} + if (arr_ == NULL) { + // Error occurred, trust that SimpleNew set the error state + return Status::OK(); + } + + set_numpy_metadata(npy_type, col_->type().get(), arr_); - Status Convert(int nthreads, PyObject** output) { - column_types_.resize(table_->num_columns()); - column_block_placement_.resize(table_->num_columns()); - type_counts_.clear(); - blocks_.clear(); + if (PyArray_SetBaseObject(arr_, py_ref_) == -1) { + // Error occurred, trust that SetBaseObject set the error state + return Status::OK(); + } else { + // PyArray_SetBaseObject steals our reference to py_ref_ + Py_INCREF(py_ref_); + } - RETURN_NOT_OK(CountColumnTypes()); - RETURN_NOT_OK(CreateBlocks()); - RETURN_NOT_OK(WriteTableToBlocks(nthreads)); + // Arrow data is immutable. + PyArray_CLEARFLAGS(arr_, NPY_ARRAY_WRITEABLE); - return GetResultList(output); + return Status::OK(); } - Status CountColumnTypes() { - for (int i = 0; i < table_->num_columns(); ++i) { - std::shared_ptr col = table_->column(i); - PandasBlock::type output_type; + // ---------------------------------------------------------------------- + // Allocate new array and deserialize. Can do a zero copy conversion for some + // types - switch (col->type()->type) { - case Type::BOOL: - output_type = col->null_count() > 0 ? PandasBlock::OBJECT : PandasBlock::BOOL; - break; - case Type::UINT8: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT8; - break; - case Type::INT8: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT8; - break; - case Type::UINT16: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT16; - break; - case Type::INT16: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT16; - break; - case Type::UINT32: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT32; - break; - case Type::INT32: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT32; - break; - case Type::INT64: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT64; - break; - case Type::UINT64: - output_type = col->null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT64; - break; - case Type::FLOAT: - output_type = PandasBlock::FLOAT; - break; - case Type::DOUBLE: - output_type = PandasBlock::DOUBLE; - break; - case Type::STRING: - case Type::BINARY: - output_type = PandasBlock::OBJECT; - break; - case Type::DATE: - output_type = PandasBlock::DATETIME; - break; - case Type::TIMESTAMP: - output_type = PandasBlock::DATETIME; - break; - case Type::LIST: { - auto list_type = std::static_pointer_cast(col->type()); - switch (list_type->value_type()->type) { - case Type::UINT8: - case Type::INT8: - case Type::UINT16: - case Type::INT16: - case Type::UINT32: - case Type::INT32: - case Type::INT64: - case Type::UINT64: - case Type::FLOAT: - case Type::DOUBLE: - case Type::STRING: - case Type::TIMESTAMP: - // The above types are all supported. - break; - default: { - std::stringstream ss; - ss << "Not implemented type for lists: " - << list_type->value_type()->ToString(); - return Status::NotImplemented(ss.str()); - } - } - output_type = PandasBlock::OBJECT; - } break; - default: - return Status::NotImplemented(col->type()->ToString()); - } + Status Convert(PyObject** out) { +#define CONVERT_CASE(TYPE) \ + case Type::TYPE: { \ + RETURN_NOT_OK(ConvertValues()); \ + } break; - auto it = type_counts_.find(output_type); - int block_placement = 0; - if (it != type_counts_.end()) { - block_placement = it->second; - // Increment count - it->second += 1; - } else { - // Add key to map - type_counts_[output_type] = 1; + switch (col_->type()->type) { + CONVERT_CASE(BOOL); + CONVERT_CASE(INT8); + CONVERT_CASE(INT16); + CONVERT_CASE(INT32); + CONVERT_CASE(INT64); + CONVERT_CASE(UINT8); + CONVERT_CASE(UINT16); + CONVERT_CASE(UINT32); + CONVERT_CASE(UINT64); + CONVERT_CASE(FLOAT); + CONVERT_CASE(DOUBLE); + CONVERT_CASE(BINARY); + CONVERT_CASE(STRING); + CONVERT_CASE(DATE); + CONVERT_CASE(TIMESTAMP); + CONVERT_CASE(DICTIONARY); + default: { + std::stringstream ss; + ss << "Arrow type reading not implemented for " << col_->type()->ToString(); + return Status::NotImplemented(ss.str()); } - - column_types_[i] = output_type; - column_block_placement_[i] = block_placement; } + +#undef CONVERT_CASE + + *out = result_; return Status::OK(); } - Status CreateBlocks() { - for (const auto& it : type_counts_) { - PandasBlock::type type = static_cast(it.first); - std::shared_ptr block; - RETURN_NOT_OK(MakeBlock(type, table_->num_rows(), it.second, &block)); - blocks_[type] = block; + template + inline typename std::enable_if< + (TYPE != Type::DATE) & arrow_traits::is_numeric_nullable, Status>::type + ConvertValues() { + typedef typename arrow_traits::T T; + int npy_type = arrow_traits::npy_type; + + if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { + return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); } + + RETURN_NOT_OK(AllocateOutput(npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + ConvertNumericNullable(data_, arrow_traits::na_value, out_values); + return Status::OK(); } - Status WriteTableToBlocks(int nthreads) { - auto WriteColumn = [this](int i) { - std::shared_ptr col = this->table_->column(i); - PandasBlock::type output_type = this->column_types_[i]; + template + inline typename std::enable_if::type + ConvertValues() { + typedef typename arrow_traits::T T; - int rel_placement = this->column_block_placement_[i]; + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + ConvertDates(data_, arrow_traits::na_value, out_values); + return Status::OK(); + } - auto it = this->blocks_.find(output_type); - if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } - return it->second->Write(col, i, rel_placement); - }; + // Integer specialization + template + inline + typename std::enable_if::is_numeric_not_nullable, Status>::type + ConvertValues() { + typedef typename arrow_traits::T T; + int npy_type = arrow_traits::npy_type; - nthreads = std::min(nthreads, table_->num_columns()); + if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { + return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); + } - if (nthreads == 1) { - for (int i = 0; i < table_->num_columns(); ++i) { - RETURN_NOT_OK(WriteColumn(i)); - } + if (data_.null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_FLOAT64)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + ConvertIntegerWithNulls(data_, out_values); } else { - std::vector thread_pool; - thread_pool.reserve(nthreads); - std::atomic task_counter(0); - - std::mutex error_mtx; - bool error_occurred = false; - Status error; + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + ConvertIntegerNoNullsSameType(data_, out_values); + } - for (int thread_id = 0; thread_id < nthreads; ++thread_id) { - thread_pool.emplace_back( - [this, &error, &error_occurred, &error_mtx, &task_counter, &WriteColumn]() { - int column_num; - while (!error_occurred) { - column_num = task_counter.fetch_add(1); - if (column_num >= this->table_->num_columns()) { break; } - Status s = WriteColumn(column_num); - if (!s.ok()) { - std::lock_guard lock(error_mtx); - error_occurred = true; - error = s; - break; - } - } - }); - } - for (auto&& thread : thread_pool) { - thread.join(); - } + return Status::OK(); + } - if (error_occurred) { return error; } + // Boolean specialization + template + inline typename std::enable_if::is_boolean, Status>::type + ConvertValues() { + if (data_.null_count() > 0) { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + RETURN_NOT_OK(ConvertBooleanWithNulls(data_, out_values)); + } else { + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + ConvertBooleanNoNulls(data_, out_values); } return Status::OK(); } - Status GetResultList(PyObject** out) { - PyAcquireGIL lock; + // UTF8 strings + template + inline typename std::enable_if::type + ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + return ConvertBinaryLike(data_, out_values); + } - auto num_blocks = static_cast(blocks_.size()); - PyObject* result = PyList_New(num_blocks); - RETURN_IF_PYERROR(); + template + inline typename std::enable_if::type + ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + return ConvertBinaryLike(data_, out_values); + } - int i = 0; - for (const auto& it : blocks_) { - const std::shared_ptr block = it.second; + template + inline typename std::enable_if::type + ConvertValues() { + std::shared_ptr block; + RETURN_NOT_OK(MakeCategoricalBlock(col_->type(), col_->length(), &block)); + RETURN_NOT_OK(block->Write(col_, 0, 0)); - PyObject* item = PyTuple_New(2); - RETURN_IF_PYERROR(); + auto dict_type = static_cast(col_->type().get()); - PyObject* block_arr = block->block_arr(); - PyObject* placement_arr = block->placement_arr(); - Py_INCREF(block_arr); - Py_INCREF(placement_arr); - PyTuple_SET_ITEM(item, 0, block_arr); - PyTuple_SET_ITEM(item, 1, placement_arr); + PyAcquireGIL lock; + result_ = PyDict_New(); + RETURN_IF_PYERROR(); + + PyObject* dictionary; + RETURN_NOT_OK(ConvertArrayToPandas(dict_type->dictionary(), nullptr, &dictionary)); + + PyDict_SetItemString(result_, "indices", block->block_arr()); + PyDict_SetItemString(result_, "dictionary", dictionary); - if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } - } - *out = result; return Status::OK(); } private: - std::shared_ptr
table_; - - // column num -> block type id - std::vector column_types_; - - // column num -> relative placement within internal block - std::vector column_block_placement_; - - // block type -> type count - std::unordered_map type_counts_; - - // block type -> block - std::unordered_map> blocks_; + std::shared_ptr col_; + const arrow::ChunkedArray& data_; + PyObject* py_ref_; + PyArrayObject* arr_; + PyObject* result_; }; -Status ConvertTableToPandas( - const std::shared_ptr
& table, int nthreads, PyObject** out) { - DataFrameBlockCreator helper(table); - return helper.Convert(nthreads, out); +Status ConvertArrayToPandas( + const std::shared_ptr& arr, PyObject* py_ref, PyObject** out) { + static std::string dummy_name = "dummy"; + auto field = std::make_shared(dummy_name, arr->type()); + auto col = std::make_shared(field, arr); + return ConvertColumnToPandas(col, py_ref, out); +} + +Status ConvertColumnToPandas( + const std::shared_ptr& col, PyObject* py_ref, PyObject** out) { + ArrowDeserializer converter(col, py_ref); + return converter.Convert(out); } } // namespace pyarrow diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index 664365e398384..b548f9321d75a 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -63,11 +63,7 @@ arrow::Status ConvertTableToPandas( const std::shared_ptr& table, int nthreads, PyObject** out); PYARROW_EXPORT -arrow::Status PandasMaskedToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& field, std::shared_ptr* out); - -PYARROW_EXPORT -arrow::Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, +arrow::Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& field, std::shared_ptr* out); } // namespace pyarrow diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index 0bdd289953dc4..b8712d7d0a4fc 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -93,7 +93,7 @@ PyBytesBuffer::PyBytesBuffer(PyObject* obj) } PyBytesBuffer::~PyBytesBuffer() { - PyGILGuard lock; + PyAcquireGIL lock; Py_DECREF(obj_); } diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 639918d309fe7..0733a3b7cf061 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -30,6 +30,17 @@ class MemoryPool; namespace pyarrow { +class PyAcquireGIL { + public: + PyAcquireGIL() { state_ = PyGILState_Ensure(); } + + ~PyAcquireGIL() { PyGILState_Release(state_); } + + private: + PyGILState_STATE state_; + DISALLOW_COPY_AND_ASSIGN(PyAcquireGIL); +}; + #define PYARROW_IS_PY2 PY_MAJOR_VERSION <= 2 class OwnedRef { @@ -38,7 +49,10 @@ class OwnedRef { OwnedRef(PyObject* obj) : obj_(obj) {} - ~OwnedRef() { Py_XDECREF(obj_); } + ~OwnedRef() { + PyAcquireGIL lock; + Py_XDECREF(obj_); + } void reset(PyObject* obj) { if (obj_ != nullptr) { Py_XDECREF(obj_); } @@ -69,17 +83,6 @@ struct PyObjectStringify { } }; -class PyGILGuard { - public: - PyGILGuard() { state_ = PyGILState_Ensure(); } - - ~PyGILGuard() { PyGILState_Release(state_); } - - private: - PyGILState_STATE state_; - DISALLOW_COPY_AND_ASSIGN(PyGILGuard); -}; - // TODO(wesm): We can just let errors pass through. To be explored later #define RETURN_IF_PYERROR() \ if (PyErr_Occurred()) { \ @@ -88,8 +91,9 @@ class PyGILGuard { PyObjectStringify stringified(exc_value); \ std::string message(stringified.bytes); \ Py_DECREF(exc_type); \ - Py_DECREF(exc_value); \ - Py_DECREF(traceback); \ + Py_XDECREF(exc_value); \ + Py_XDECREF(traceback); \ + PyErr_Clear(); \ return Status::UnknownError(message); \ } @@ -122,17 +126,6 @@ class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { PyObject* obj_; }; -class PyAcquireGIL { - public: - PyAcquireGIL() { state_ = PyGILState_Ensure(); } - - ~PyAcquireGIL() { PyGILState_Release(state_); } - - private: - PyGILState_STATE state_; - DISALLOW_COPY_AND_ASSIGN(PyAcquireGIL); -}; - } // namespace pyarrow #endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 01f851d874075..92352607e62ec 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -114,22 +114,22 @@ PyReadableFile::PyReadableFile(PyObject* file) { PyReadableFile::~PyReadableFile() {} Status PyReadableFile::Close() { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Close(); } Status PyReadableFile::Seek(int64_t position) { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Seek(position, 0); } Status PyReadableFile::Tell(int64_t* position) { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Tell(position); } Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - PyGILGuard lock; + PyAcquireGIL lock; PyObject* bytes_obj; ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); @@ -141,7 +141,7 @@ Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { } Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { - PyGILGuard lock; + PyAcquireGIL lock; PyObject* bytes_obj; ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); @@ -153,7 +153,7 @@ Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) } Status PyReadableFile::GetSize(int64_t* size) { - PyGILGuard lock; + PyAcquireGIL lock; int64_t current_position; ; @@ -185,17 +185,17 @@ PyOutputStream::PyOutputStream(PyObject* file) { PyOutputStream::~PyOutputStream() {} Status PyOutputStream::Close() { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Close(); } Status PyOutputStream::Tell(int64_t* position) { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Tell(position); } Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { - PyGILGuard lock; + PyAcquireGIL lock; return file_->Write(data, nbytes); } diff --git a/python/src/pyarrow/util/CMakeLists.txt b/python/src/pyarrow/util/CMakeLists.txt index 4afb4d0f912b1..6cd49cb75a4fb 100644 --- a/python/src/pyarrow/util/CMakeLists.txt +++ b/python/src/pyarrow/util/CMakeLists.txt @@ -20,7 +20,7 @@ ####################################### if (PYARROW_BUILD_TESTS) - add_library(pyarrow_test_main + add_library(pyarrow_test_main STATIC test_main.cc) if (APPLE) diff --git a/python/src/pyarrow/util/test_main.cc b/python/src/pyarrow/util/test_main.cc index 6fb7c0536eed3..02e9a54f65914 100644 --- a/python/src/pyarrow/util/test_main.cc +++ b/python/src/pyarrow/util/test_main.cc @@ -15,12 +15,22 @@ // specific language governing permissions and limitations // under the License. +#include + #include +#include "pyarrow/do_import_numpy.h" +#include "pyarrow/numpy_interop.h" + int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); + Py_Initialize(); + pyarrow::import_numpy(); + int ret = RUN_ALL_TESTS(); + Py_Finalize(); + return ret; } From 6811d3fcfc9da65e24b6d0f2ad5d5d348d879f11 Mon Sep 17 00:00:00 2001 From: Nong Li Date: Thu, 19 Jan 2017 14:45:24 -0500 Subject: [PATCH 0282/1644] ARROW-474: [Java] Add initial version of streaming serialized format. This patch proposes a serialized container format for streaming producer and consumers. The goal is to allow readers and writers to produce/consume arrow data without requiring intermediate buffering. This is similar to the File format but reorganizes the pieces. In particular: - No magic header. It's likely a reader connects to a 'random' stream to read it. - Move footer to header. This includes similar information, including the schema. - ArrowRecordBatches follow one by one. Each is prefixed with an i32 length. The serialization is identical as the File version. - See Stream.fbs for more details. This patch also implements the Java reader/writer. Author: Nong Li Closes #288 from nongli/streaming and squashes the following commits: 554cc18 [Nong Li] Redo serialization format. 03bee58 [Nong Li] Updates from wes' comments. 7257031 [Nong Li] ARROW-474: [Java] Add initial version of streaming serialized format. --- .../apache/arrow/vector/file/ArrowReader.java | 12 +- .../apache/arrow/vector/file/ArrowWriter.java | 102 ++------- .../apache/arrow/vector/file/ReadChannel.java | 75 ++++++ .../arrow/vector/file/WriteChannel.java | 111 +++++++++ .../arrow/vector/schema/ArrowRecordBatch.java | 23 ++ .../vector/stream/ArrowStreamReader.java | 95 ++++++++ .../vector/stream/ArrowStreamWriter.java | 71 ++++++ .../vector/stream/MessageSerializer.java | 216 ++++++++++++++++++ .../arrow/vector/types/pojo/Schema.java | 5 + .../arrow/vector/file/TestArrowFile.java | 149 +++++++++++- .../vector/stream/MessageSerializerTest.java | 115 ++++++++++ .../arrow/vector/stream/TestArrowStream.java | 96 ++++++++ .../vector/stream/TestArrowStreamPipe.java | 129 +++++++++++ 13 files changed, 1100 insertions(+), 99 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java index cd520da54f2f5..58c51605c5600 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -31,6 +31,7 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.MessageSerializer; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -39,7 +40,7 @@ public class ArrowReader implements AutoCloseable { private static final Logger LOGGER = LoggerFactory.getLogger(ArrowReader.class); - private static final byte[] MAGIC = "ARROW1".getBytes(); + public static final byte[] MAGIC = "ARROW1".getBytes(); private final SeekableByteChannel in; @@ -73,13 +74,6 @@ private int readFully(ByteBuffer buffer) throws IOException { return total; } - private static int bytesToInt(byte[] bytes) { - return ((int)(bytes[3] & 255) << 24) + - ((int)(bytes[2] & 255) << 16) + - ((int)(bytes[1] & 255) << 8) + - ((int)(bytes[0] & 255) << 0); - } - public ArrowFooter readFooter() throws IOException { if (footer == null) { if (in.size() <= (MAGIC.length * 2 + 4)) { @@ -93,7 +87,7 @@ public ArrowFooter readFooter() throws IOException { if (!Arrays.equals(MAGIC, Arrays.copyOfRange(array, 4, array.length))) { throw new InvalidArrowFileException("missing Magic number " + Arrays.toString(buffer.array())); } - int footerLength = bytesToInt(array); + int footerLength = MessageSerializer.bytesToInt(array); if (footerLength <= 0 || footerLength + MAGIC.length * 2 + 4 > in.size()) { throw new InvalidArrowFileException("invalid footer length: " + footerLength); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java index 1cd87ebc33594..3febd11f4c76a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -18,7 +18,6 @@ package org.apache.arrow.vector.file; import java.io.IOException; -import java.nio.ByteBuffer; import java.nio.channels.WritableByteChannel; import java.util.ArrayList; import java.util.Collections; @@ -26,32 +25,25 @@ import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.schema.FBSerializable; import org.apache.arrow.vector.types.pojo.Schema; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import com.google.flatbuffers.FlatBufferBuilder; - import io.netty.buffer.ArrowBuf; public class ArrowWriter implements AutoCloseable { private static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class); - private static final byte[] MAGIC = "ARROW1".getBytes(); - - private final WritableByteChannel out; + private final WriteChannel out; private final Schema schema; private final List recordBatches = new ArrayList<>(); - private long currentPosition = 0; - private boolean started = false; public ArrowWriter(WritableByteChannel out, Schema schema) { - this.out = out; + this.out = new WriteChannel(out); this.schema = schema; } @@ -59,53 +51,19 @@ private void start() throws IOException { writeMagic(); } - private long write(byte[] buffer) throws IOException { - return write(ByteBuffer.wrap(buffer)); - } - - private long writeZeros(int zeroCount) throws IOException { - return write(new byte[zeroCount]); - } - - private long align() throws IOException { - if (currentPosition % 8 != 0) { // align on 8 byte boundaries - return writeZeros(8 - (int)(currentPosition % 8)); - } - return 0; - } - - private long write(ByteBuffer buffer) throws IOException { - long length = buffer.remaining(); - out.write(buffer); - currentPosition += length; - return length; - } - - private static byte[] intToBytes(int value) { - byte[] outBuffer = new byte[4]; - outBuffer[3] = (byte)(value >>> 24); - outBuffer[2] = (byte)(value >>> 16); - outBuffer[1] = (byte)(value >>> 8); - outBuffer[0] = (byte)(value >>> 0); - return outBuffer; - } - - private long writeIntLittleEndian(int v) throws IOException { - return write(intToBytes(v)); - } // TODO: write dictionaries public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { checkStarted(); - align(); + out.align(); // write metadata header with int32 size prefix - long offset = currentPosition; - write(recordBatch, true); - align(); + long offset = out.getCurrentPosition(); + out.write(recordBatch, true); + out.align(); // write body - long bodyOffset = currentPosition; + long bodyOffset = out.getCurrentPosition(); List buffers = recordBatch.getBuffers(); List buffersLayout = recordBatch.getBuffersLayout(); if (buffers.size() != buffersLayout.size()) { @@ -115,31 +73,25 @@ public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { ArrowBuf buffer = buffers.get(i); ArrowBuffer layout = buffersLayout.get(i); long startPosition = bodyOffset + layout.getOffset(); - if (startPosition != currentPosition) { - writeZeros((int)(startPosition - currentPosition)); + if (startPosition != out.getCurrentPosition()) { + out.writeZeros((int)(startPosition - out.getCurrentPosition())); } - write(buffer); - if (currentPosition != startPosition + layout.getSize()) { - throw new IllegalStateException("wrong buffer size: " + currentPosition + " != " + startPosition + layout.getSize()); + out.write(buffer); + if (out.getCurrentPosition() != startPosition + layout.getSize()) { + throw new IllegalStateException("wrong buffer size: " + out.getCurrentPosition() + " != " + startPosition + layout.getSize()); } } int metadataLength = (int)(bodyOffset - offset); if (metadataLength <= 0) { throw new InvalidArrowFileException("invalid recordBatch"); } - long bodyLength = currentPosition - bodyOffset; + long bodyLength = out.getCurrentPosition() - bodyOffset; LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", offset, metadataLength, bodyLength)); // add metadata to footer recordBatches.add(new ArrowBlock(offset, metadataLength, bodyLength)); } - private void write(ArrowBuf buffer) throws IOException { - ByteBuffer nioBuffer = buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes()); - LOGGER.debug("Writing buffer with size: " + nioBuffer.remaining()); - write(nioBuffer); - } - private void checkStarted() throws IOException { if (!started) { started = true; @@ -147,15 +99,16 @@ private void checkStarted() throws IOException { } } + @Override public void close() throws IOException { try { - long footerStart = currentPosition; + long footerStart = out.getCurrentPosition(); writeFooter(); - int footerLength = (int)(currentPosition - footerStart); + int footerLength = (int)(out.getCurrentPosition() - footerStart); if (footerLength <= 0 ) { throw new InvalidArrowFileException("invalid footer"); } - writeIntLittleEndian(footerLength); + out.writeIntLittleEndian(footerLength); LOGGER.debug(String.format("Footer starts at %d, length: %d", footerStart, footerLength)); writeMagic(); } finally { @@ -164,27 +117,12 @@ public void close() throws IOException { } private void writeMagic() throws IOException { - write(MAGIC); - LOGGER.debug(String.format("magic written, now at %d", currentPosition)); + out.write(ArrowReader.MAGIC); + LOGGER.debug(String.format("magic written, now at %d", out.getCurrentPosition())); } private void writeFooter() throws IOException { // TODO: dictionaries - write(new ArrowFooter(schema, Collections.emptyList(), recordBatches), false); - } - - private long write(FBSerializable writer, boolean withSizePrefix) throws IOException { - FlatBufferBuilder builder = new FlatBufferBuilder(); - int root = writer.writeTo(builder); - builder.finish(root); - - ByteBuffer buffer = builder.dataBuffer(); - - if (withSizePrefix) { - writeIntLittleEndian(buffer.remaining()); - } - - return write(buffer); + out.write(new ArrowFooter(schema, Collections.emptyList(), recordBatches), false); } - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java new file mode 100644 index 0000000000000..b062f3826eab3 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java @@ -0,0 +1,75 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.ReadableByteChannel; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import io.netty.buffer.ArrowBuf; + +public class ReadChannel implements AutoCloseable { + + private static final Logger LOGGER = LoggerFactory.getLogger(ReadChannel.class); + + private ReadableByteChannel in; + private long bytesRead = 0; + + public ReadChannel(ReadableByteChannel in) { + this.in = in; + } + + public long bytesRead() { return bytesRead; } + + /** + * Reads bytes into buffer until it is full (buffer.remaining() == 0). Returns the + * number of bytes read which can be less than full if there are no more. + */ + public int readFully(ByteBuffer buffer) throws IOException { + LOGGER.debug("Reading buffer with size: " + buffer.remaining()); + int totalRead = 0; + while (buffer.remaining() != 0) { + int read = in.read(buffer); + if (read < 0) return totalRead; + totalRead += read; + if (read == 0) break; + } + this.bytesRead += totalRead; + return totalRead; + } + + /** + * Reads up to len into buffer. Returns bytes read. + */ + public int readFully(ArrowBuf buffer, int l) throws IOException { + int n = readFully(buffer.nioBuffer(buffer.writerIndex(), l)); + buffer.writerIndex(n); + return n; + } + + @Override + public void close() throws IOException { + if (this.in != null) { + in.close(); + in = null; + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java new file mode 100644 index 0000000000000..d99c9a6c99958 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java @@ -0,0 +1,111 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.WritableByteChannel; + +import org.apache.arrow.vector.schema.FBSerializable; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; + +/** + * Wrapper around a WritableByteChannel that maintains the position as well adding + * some common serialization utilities. + */ +public class WriteChannel implements AutoCloseable { + private static final Logger LOGGER = LoggerFactory.getLogger(WriteChannel.class); + + private long currentPosition = 0; + + private final WritableByteChannel out; + + public WriteChannel(WritableByteChannel out) { + this.out = out; + } + + @Override + public void close() throws IOException { + out.close(); + } + + public long getCurrentPosition() { + return currentPosition; + } + + public long write(byte[] buffer) throws IOException { + return write(ByteBuffer.wrap(buffer)); + } + + public long writeZeros(int zeroCount) throws IOException { + return write(new byte[zeroCount]); + } + + public long align() throws IOException { + if (currentPosition % 8 != 0) { // align on 8 byte boundaries + return writeZeros(8 - (int)(currentPosition % 8)); + } + return 0; + } + + public long write(ByteBuffer buffer) throws IOException { + long length = buffer.remaining(); + LOGGER.debug("Writing buffer with size: " + length); + out.write(buffer); + currentPosition += length; + return length; + } + + public static byte[] intToBytes(int value) { + byte[] outBuffer = new byte[4]; + outBuffer[3] = (byte)(value >>> 24); + outBuffer[2] = (byte)(value >>> 16); + outBuffer[1] = (byte)(value >>> 8); + outBuffer[0] = (byte)(value >>> 0); + return outBuffer; + } + + public long writeIntLittleEndian(int v) throws IOException { + return write(intToBytes(v)); + } + + public void write(ArrowBuf buffer) throws IOException { + ByteBuffer nioBuffer = buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes()); + write(nioBuffer); + } + + public long write(FBSerializable writer, boolean withSizePrefix) throws IOException { + ByteBuffer buffer = serialize(writer); + if (withSizePrefix) { + writeIntLittleEndian(buffer.remaining()); + } + return write(buffer); + } + + public static ByteBuffer serialize(FBSerializable writer) { + FlatBufferBuilder builder = new FlatBufferBuilder(); + int root = writer.writeTo(builder); + builder.finish(root); + return builder.dataBuffer(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java index adb99e2f3ffb7..40c2fbfd984f8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java @@ -19,6 +19,7 @@ import static org.apache.arrow.vector.schema.FBSerializables.writeAllStructsToVector; +import java.nio.ByteBuffer; import java.util.ArrayList; import java.util.Collections; import java.util.List; @@ -130,6 +131,28 @@ public String toString() { + buffersLayout + ", closed=" + closed + "]"; } + /** + * Computes the size of the serialized body for this recordBatch. + */ + public int computeBodyLength() { + int size = 0; + + List buffers = getBuffers(); + List buffersLayout = getBuffersLayout(); + if (buffers.size() != buffersLayout.size()) { + throw new IllegalStateException("the layout does not match: " + + buffers.size() + " != " + buffersLayout.size()); + } + for (int i = 0; i < buffers.size(); i++) { + ArrowBuf buffer = buffers.get(i); + ArrowBuffer layout = buffersLayout.get(i); + size += (layout.getOffset() - size); + ByteBuffer nioBuffer = + buffer.nioBuffer(buffer.readerIndex(), buffer.readableBytes()); + size += nioBuffer.remaining(); + } + return size; + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java new file mode 100644 index 0000000000000..f32966c5d5217 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java @@ -0,0 +1,95 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import java.io.IOException; +import java.io.InputStream; +import java.nio.channels.Channels; +import java.nio.channels.ReadableByteChannel; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.file.ReadChannel; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.common.base.Preconditions; + +/** + * This classes reads from an input stream and produces ArrowRecordBatches. + */ +public class ArrowStreamReader implements AutoCloseable { + private ReadChannel in; + private final BufferAllocator allocator; + private Schema schema; + + /** + * Constructs a streaming read, reading bytes from 'in'. Non-blocking. + */ + public ArrowStreamReader(ReadableByteChannel in, BufferAllocator allocator) { + super(); + this.in = new ReadChannel(in); + this.allocator = allocator; + } + + public ArrowStreamReader(InputStream in, BufferAllocator allocator) { + this(Channels.newChannel(in), allocator); + } + + /** + * Initializes the reader. Must be called before the other APIs. This is blocking. + */ + public void init() throws IOException { + Preconditions.checkState(this.schema == null, "Cannot call init() more than once."); + this.schema = readSchema(); + } + + /** + * Returns the schema for all records in this stream. + */ + public Schema getSchema () { + Preconditions.checkState(this.schema != null, "Must call init() first."); + return schema; + } + + public long bytesRead() { return in.bytesRead(); } + + /** + * Reads and returns the next ArrowRecordBatch. Returns null if this is the end + * of stream. + */ + public ArrowRecordBatch nextRecordBatch() throws IOException { + Preconditions.checkState(this.in != null, "Cannot call after close()"); + Preconditions.checkState(this.schema != null, "Must call init() first."); + return MessageSerializer.deserializeRecordBatch(in, allocator); + } + + @Override + public void close() throws IOException { + if (this.in != null) { + in.close(); + in = null; + } + } + + /** + * Reads the schema message from the beginning of the stream. + */ + private Schema readSchema() throws IOException { + return MessageSerializer.deserializeSchema(in); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java new file mode 100644 index 0000000000000..06acf9f2c140e --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java @@ -0,0 +1,71 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import java.io.IOException; +import java.io.OutputStream; +import java.nio.channels.Channels; +import java.nio.channels.WritableByteChannel; + +import org.apache.arrow.vector.file.WriteChannel; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; + +public class ArrowStreamWriter implements AutoCloseable { + private final WriteChannel out; + private final Schema schema; + private boolean headerSent = false; + + /** + * Creates the stream writer. non-blocking. + * totalBatches can be set if the writer knows beforehand. Can be -1 if unknown. + */ + public ArrowStreamWriter(WritableByteChannel out, Schema schema, int totalBatches) { + this.out = new WriteChannel(out); + this.schema = schema; + } + + public ArrowStreamWriter(OutputStream out, Schema schema, int totalBatches) + throws IOException { + this(Channels.newChannel(out), schema, totalBatches); + } + + public long bytesWritten() { return out.getCurrentPosition(); } + + public void writeRecordBatch(ArrowRecordBatch batch) throws IOException { + // Send the header if we have not yet. + checkAndSendHeader(); + MessageSerializer.serialize(out, batch); + } + + @Override + public void close() throws IOException { + // The header might not have been sent if this is an empty stream. Send it even in + // this case so readers see a valid empty stream. + checkAndSendHeader(); + out.close(); + } + + private void checkAndSendHeader() throws IOException { + if (!headerSent) { + MessageSerializer.serialize(out, schema); + headerSent = true; + } + } +} + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java new file mode 100644 index 0000000000000..22c46e2817b1e --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -0,0 +1,216 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.util.ArrayList; +import java.util.List; + +import org.apache.arrow.flatbuf.Buffer; +import org.apache.arrow.flatbuf.FieldNode; +import org.apache.arrow.flatbuf.Message; +import org.apache.arrow.flatbuf.MessageHeader; +import org.apache.arrow.flatbuf.MetadataVersion; +import org.apache.arrow.flatbuf.RecordBatch; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.file.ReadChannel; +import org.apache.arrow.vector.file.WriteChannel; +import org.apache.arrow.vector.schema.ArrowBuffer; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; + +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; + +/** + * Utility class for serializing Messages. Messages are all serialized a similar way. + * 1. 4 byte little endian message header prefix + * 2. FB serialized Message: This includes it the body length, which is the serialized + * body and the type of the message. + * 3. Serialized message. + * + * For schema messages, the serialization is simply the FB serialized Schema. + * + * For RecordBatch messages the serialization is: + * 1. 4 byte little endian batch metadata header + * 2. FB serialized RowBatch + * 3. serialized RowBatch buffers. + */ +public class MessageSerializer { + + public static int bytesToInt(byte[] bytes) { + return ((bytes[3] & 255) << 24) + + ((bytes[2] & 255) << 16) + + ((bytes[1] & 255) << 8) + + ((bytes[0] & 255) << 0); + } + + /** + * Serialize a schema object. + */ + public static long serialize(WriteChannel out, Schema schema) throws IOException { + FlatBufferBuilder builder = new FlatBufferBuilder(); + builder.finish(schema.getSchema(builder)); + ByteBuffer serializedBody = builder.dataBuffer(); + ByteBuffer serializedHeader = + serializeHeader(MessageHeader.Schema, serializedBody.remaining()); + + long size = out.writeIntLittleEndian(serializedHeader.remaining()); + size += out.write(serializedHeader); + size += out.write(serializedBody); + return size; + } + + /** + * Deserializes a schema object. Format is from serialize(). + */ + public static Schema deserializeSchema(ReadChannel in) throws IOException { + Message header = deserializeHeader(in, MessageHeader.Schema); + if (header == null) { + throw new IOException("Unexpected end of input. Missing schema."); + } + + // Now read the schema. + ByteBuffer buffer = ByteBuffer.allocate((int)header.bodyLength()); + if (in.readFully(buffer) != header.bodyLength()) { + throw new IOException("Unexpected end of input trying to read schema."); + } + buffer.rewind(); + return Schema.deserialize(buffer); + } + + /** + * Serializes an ArrowRecordBatch. + */ + public static long serialize(WriteChannel out, ArrowRecordBatch batch) + throws IOException { + long start = out.getCurrentPosition(); + int bodyLength = batch.computeBodyLength(); + + ByteBuffer metadata = WriteChannel.serialize(batch); + ByteBuffer serializedHeader = + serializeHeader(MessageHeader.RecordBatch, bodyLength + metadata.remaining() + 4); + + // Write message header. + out.writeIntLittleEndian(serializedHeader.remaining()); + out.write(serializedHeader); + + // Write the metadata, with the 4 byte little endian prefix + out.writeIntLittleEndian(metadata.remaining()); + out.write(metadata); + + // Write batch header. + long offset = out.getCurrentPosition(); + List buffers = batch.getBuffers(); + List buffersLayout = batch.getBuffersLayout(); + + for (int i = 0; i < buffers.size(); i++) { + ArrowBuf buffer = buffers.get(i); + ArrowBuffer layout = buffersLayout.get(i); + long startPosition = offset + layout.getOffset(); + if (startPosition != out.getCurrentPosition()) { + out.writeZeros((int)(startPosition - out.getCurrentPosition())); + } + out.write(buffer); + if (out.getCurrentPosition() != startPosition + layout.getSize()) { + throw new IllegalStateException("wrong buffer size: " + out.getCurrentPosition() + + " != " + startPosition + layout.getSize()); + } + } + return out.getCurrentPosition() - start; + } + + /** + * Deserializes a RecordBatch + */ + public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, + BufferAllocator alloc) throws IOException { + Message header = deserializeHeader(in, MessageHeader.RecordBatch); + if (header == null) return null; + + int messageLen = (int)header.bodyLength(); + // Now read the buffer. This has the metadata followed by the data. + ArrowBuf buffer = alloc.buffer(messageLen); + if (in.readFully(buffer, messageLen) != messageLen) { + throw new IOException("Unexpected end of input trying to read batch."); + } + + // Read the metadata. It starts with the 4 byte size of the metadata. + int metadataSize = buffer.readInt(); + RecordBatch recordBatchFB = + RecordBatch.getRootAsRecordBatch( buffer.nioBuffer().asReadOnlyBuffer()); + + // No read the body + final ArrowBuf body = buffer.slice(4 + metadataSize, messageLen - metadataSize - 4); + int nodesLength = recordBatchFB.nodesLength(); + List nodes = new ArrayList<>(); + for (int i = 0; i < nodesLength; ++i) { + FieldNode node = recordBatchFB.nodes(i); + nodes.add(new ArrowFieldNode(node.length(), node.nullCount())); + } + List buffers = new ArrayList<>(); + for (int i = 0; i < recordBatchFB.buffersLength(); ++i) { + Buffer bufferFB = recordBatchFB.buffers(i); + ArrowBuf vectorBuffer = body.slice((int)bufferFB.offset(), (int)bufferFB.length()); + buffers.add(vectorBuffer); + } + ArrowRecordBatch arrowRecordBatch = + new ArrowRecordBatch(recordBatchFB.length(), nodes, buffers); + buffer.release(); + return arrowRecordBatch; + } + + /** + * Serializes a message header. + */ + private static ByteBuffer serializeHeader(byte headerType, int bodyLength) { + FlatBufferBuilder headerBuilder = new FlatBufferBuilder(); + Message.startMessage(headerBuilder); + Message.addHeaderType(headerBuilder, headerType); + Message.addVersion(headerBuilder, MetadataVersion.V1); + Message.addBodyLength(headerBuilder, bodyLength); + headerBuilder.finish(Message.endMessage(headerBuilder)); + return headerBuilder.dataBuffer(); + } + + private static Message deserializeHeader(ReadChannel in, byte headerType) throws IOException { + // Read the header size. There is an i32 little endian prefix. + ByteBuffer buffer = ByteBuffer.allocate(4); + if (in.readFully(buffer) != 4) { + return null; + } + + int headerLength = bytesToInt(buffer.array()); + buffer = ByteBuffer.allocate(headerLength); + if (in.readFully(buffer) != headerLength) { + throw new IOException( + "Unexpected end of stream trying to read header."); + } + buffer.rewind(); + + Message header = Message.getRootAsMessage(buffer); + if (header.headerType() != headerType) { + throw new IOException("Invalid message: expecting " + headerType + + ". Message contained: " + header.headerType()); + } + return header; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index 5ca8ade7891ee..c33bd6e6e61b0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -22,6 +22,7 @@ import static org.apache.arrow.vector.types.pojo.Field.convertField; import java.io.IOException; +import java.nio.ByteBuffer; import java.util.ArrayList; import java.util.Collections; import java.util.List; @@ -65,6 +66,10 @@ public static Schema fromJSON(String json) throws IOException { return reader.readValue(checkNotNull(json)); } + public static Schema deserialize(ByteBuffer buffer) { + return convertSchema(org.apache.arrow.flatbuf.Schema.getRootAsSchema(buffer)); + } + public static Schema convertSchema(org.apache.arrow.flatbuf.Schema schema) { ImmutableList.Builder childrenBuilder = ImmutableList.builder(); for (int i = 0; i < schema.fieldsLength(); i++) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 5fa18b3ca5339..bf635fb39f5b8 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -18,12 +18,16 @@ package org.apache.arrow.vector.file; import static org.apache.arrow.vector.TestVectorUnloadLoad.newVectorUnloader; +import static org.junit.Assert.assertTrue; +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; +import java.io.OutputStream; import java.util.List; import org.apache.arrow.memory.BufferAllocator; @@ -35,6 +39,8 @@ import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamReader; +import org.apache.arrow.vector.stream.ArrowStreamWriter; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Assert; import org.junit.Test; @@ -52,7 +58,7 @@ public void testWrite() throws IOException { BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", vectorAllocator, null)) { writeData(count, parent); - write(parent.getChild("root"), file); + write(parent.getChild("root"), file, new ByteArrayOutputStream()); } } @@ -66,13 +72,14 @@ public void testWriteComplex() throws IOException { writeComplexData(count, parent); FieldVector root = parent.getChild("root"); validateComplexContent(count, new VectorSchemaRoot(root)); - write(root, file); + write(root, file, new ByteArrayOutputStream()); } } @Test public void testWriteRead() throws IOException { File file = new File("target/mytest.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); int count = COUNT; // write @@ -80,7 +87,7 @@ public void testWriteRead() throws IOException { BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeData(count, parent); - write(parent.getChild("root"), file); + write(parent.getChild("root"), file, stream); } // read @@ -116,11 +123,40 @@ public void testWriteRead() throws IOException { } } } + + // Read from stream. + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + arrowReader.init(); + Schema schema = arrowReader.getSchema(); + LOGGER.debug("reading schema: " + schema); + + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + while (true) { + try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { + if (recordBatch == null) break; + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); + } + } + validateContent(count, root); + } + } } @Test public void testWriteReadComplex() throws IOException { File file = new File("target/mytest_complex.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); int count = COUNT; // write @@ -128,7 +164,7 @@ public void testWriteReadComplex() throws IOException { BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeComplexData(count, parent); - write(parent.getChild("root"), file); + write(parent.getChild("root"), file, stream); } // read @@ -156,11 +192,36 @@ public void testWriteReadComplex() throws IOException { } } } + + // Read from stream. + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + arrowReader.init(); + Schema schema = arrowReader.getSchema(); + LOGGER.debug("reading schema: " + schema); + + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + while (true) { + try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { + if (recordBatch == null) break; + vectorLoader.load(recordBatch); + } + } + validateComplexContent(count, root); + } + } } @Test public void testWriteReadMultipleRBs() throws IOException { File file = new File("target/mytest_multiple.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); int[] counts = { 10, 5 }; // write @@ -172,10 +233,12 @@ public void testWriteReadMultipleRBs() throws IOException { VectorUnloader vectorUnloader0 = newVectorUnloader(parent.getChild("root")); Schema schema = vectorUnloader0.getSchema(); Assert.assertEquals(2, schema.getFields().size()); - try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema);) { + try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); + ArrowStreamWriter streamWriter = new ArrowStreamWriter(stream, schema, 2)) { try (ArrowRecordBatch recordBatch = vectorUnloader0.getRecordBatch()) { Assert.assertEquals("RB #0", counts[0], recordBatch.getLength()); arrowWriter.writeRecordBatch(recordBatch); + streamWriter.writeRecordBatch(recordBatch); } parent.allocateNew(); writeData(counts[1], parent); // if we write the same data we don't catch that the metadata is stored in the wrong order. @@ -183,6 +246,7 @@ public void testWriteReadMultipleRBs() throws IOException { try (ArrowRecordBatch recordBatch = vectorUnloader1.getRecordBatch()) { Assert.assertEquals("RB #1", counts[1], recordBatch.getLength()); arrowWriter.writeRecordBatch(recordBatch); + streamWriter.writeRecordBatch(recordBatch); } } } @@ -222,11 +286,42 @@ public void testWriteReadMultipleRBs() throws IOException { } } } + + // read stream + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + arrowReader.init(); + Schema schema = arrowReader.getSchema(); + LOGGER.debug("reading schema: " + schema); + int i = 0; + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { + VectorLoader vectorLoader = new VectorLoader(root); + for (int n = 0; n < 2; n++) { + try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { + assertTrue(recordBatch != null); + Assert.assertEquals("RB #" + i, counts[i], recordBatch.getLength()); + List buffersLayout = recordBatch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + vectorLoader.load(recordBatch); + validateContent(counts[i], root); + } + ++i; + } + } + } } @Test public void testWriteReadUnion() throws IOException { File file = new File("target/mytest_write_union.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); int count = COUNT; try ( BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); @@ -238,9 +333,9 @@ public void testWriteReadUnion() throws IOException { validateUnionData(count, new VectorSchemaRoot(parent.getChild("root"))); - write(parent.getChild("root"), file); + write(parent.getChild("root"), file, stream); } - // read + // read try ( BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); FileInputStream fileInputStream = new FileInputStream(file); @@ -263,9 +358,37 @@ public void testWriteReadUnion() throws IOException { } } } + + // Read from stream. + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); + BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null) + ) { + arrowReader.init(); + Schema schema = arrowReader.getSchema(); + LOGGER.debug("reading schema: " + schema); + + try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { + VectorLoader vectorLoader = new VectorLoader(root); + while (true) { + try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { + if (recordBatch == null) break; + vectorLoader.load(recordBatch); + } + } + validateUnionData(count, root); + } + } } - private void write(FieldVector parent, File file) throws FileNotFoundException, IOException { + /** + * Writes the contents of parents to file. If outStream is non-null, also writes it + * to outStream in the streaming serialized format. + */ + private void write(FieldVector parent, File file, OutputStream outStream) throws FileNotFoundException, IOException { VectorUnloader vectorUnloader = newVectorUnloader(parent); Schema schema = vectorUnloader.getSchema(); LOGGER.debug("writing schema: " + schema); @@ -276,5 +399,15 @@ private void write(FieldVector parent, File file) throws FileNotFoundException, ) { arrowWriter.writeRecordBatch(recordBatch); } + + // Also try serializing to the stream writer. + if (outStream != null) { + try ( + ArrowStreamWriter arrowWriter = new ArrowStreamWriter(outStream, schema, -1); + ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); + ) { + arrowWriter.writeRecordBatch(recordBatch); + } + } } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java new file mode 100644 index 0000000000000..7b4de80ee03ea --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java @@ -0,0 +1,115 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; +import java.nio.channels.Channels; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.ReadChannel; +import org.apache.arrow.vector.file.WriteChannel; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class MessageSerializerTest { + + public static ArrowBuf buf(BufferAllocator alloc, byte[] bytes) { + ArrowBuf buffer = alloc.buffer(bytes.length); + buffer.writeBytes(bytes); + return buffer; + } + + public static byte[] array(ArrowBuf buf) { + byte[] bytes = new byte[buf.readableBytes()]; + buf.readBytes(bytes); + return bytes; + } + + @Test + public void testSchemaMessageSerialization() throws IOException { + Schema schema = testSchema(); + ByteArrayOutputStream out = new ByteArrayOutputStream(); + long size = MessageSerializer.serialize( + new WriteChannel(Channels.newChannel(out)), schema); + assertEquals(size, out.toByteArray().length); + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + Schema deserialized = MessageSerializer.deserializeSchema( + new ReadChannel(Channels.newChannel(in))); + assertEquals(schema, deserialized); + assertEquals(1, deserialized.getFields().size()); + } + + @Test + public void testSerializeRecordBatch() throws IOException { + byte[] validity = new byte[] { (byte)255, 0}; + // second half is "undefined" + byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; + + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + ArrowBuf validityb = buf(alloc, validity); + ArrowBuf valuesb = buf(alloc, values); + + ArrowRecordBatch batch = new ArrowRecordBatch( + 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb)); + + ByteArrayOutputStream out = new ByteArrayOutputStream(); + MessageSerializer.serialize(new WriteChannel(Channels.newChannel(out)), batch); + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + ArrowRecordBatch deserialized = MessageSerializer.deserializeRecordBatch( + new ReadChannel(Channels.newChannel(in)), alloc); + verifyBatch(deserialized, validity, values); + } + + public static Schema testSchema() { + return new Schema(asList(new Field( + "testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); + } + + // Verifies batch contents matching test schema. + public static void verifyBatch(ArrowRecordBatch batch, byte[] validity, byte[] values) { + assertTrue(batch != null); + List nodes = batch.getNodes(); + assertEquals(1, nodes.size()); + ArrowFieldNode node = nodes.get(0); + assertEquals(16, node.getLength()); + assertEquals(8, node.getNullCount()); + List buffers = batch.getBuffers(); + assertEquals(2, buffers.size()); + assertArrayEquals(validity, MessageSerializerTest.array(buffers.get(0))); + assertArrayEquals(values, MessageSerializerTest.array(buffers.get(1))); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java new file mode 100644 index 0000000000000..ba1cdaeeb2262 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java @@ -0,0 +1,96 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.BaseFileTest; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class TestArrowStream extends BaseFileTest { + @Test + public void testEmptyStream() throws IOException { + Schema schema = MessageSerializerTest.testSchema(); + + // Write the stream. + ByteArrayOutputStream out = new ByteArrayOutputStream(); + try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema, -1)) { + } + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { + reader.init(); + assertEquals(schema, reader.getSchema()); + // Empty should return null. Can be called repeatedly. + assertTrue(reader.nextRecordBatch() == null); + assertTrue(reader.nextRecordBatch() == null); + } + } + + @Test + public void testReadWrite() throws IOException { + Schema schema = MessageSerializerTest.testSchema(); + byte[] validity = new byte[] { (byte)255, 0}; + // second half is "undefined" + byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; + + int numBatches = 5; + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + ByteArrayOutputStream out = new ByteArrayOutputStream(); + long bytesWritten = 0; + try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema, numBatches)) { + ArrowBuf validityb = MessageSerializerTest.buf(alloc, validity); + ArrowBuf valuesb = MessageSerializerTest.buf(alloc, values); + for (int i = 0; i < numBatches; i++) { + writer.writeRecordBatch(new ArrowRecordBatch( + 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); + } + bytesWritten = writer.bytesWritten(); + } + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + try (ArrowStreamReader reader = new ArrowStreamReader(in, alloc)) { + reader.init(); + Schema readSchema = reader.getSchema(); + for (int i = 0; i < numBatches; i++) { + assertEquals(schema, readSchema); + assertTrue( + readSchema.getFields().get(0).getTypeLayout().getVectorTypes().toString(), + readSchema.getFields().get(0).getTypeLayout().getVectors().size() > 0); + ArrowRecordBatch recordBatch = reader.nextRecordBatch(); + MessageSerializerTest.verifyBatch(recordBatch, validity, values); + assertTrue(recordBatch != null); + } + assertTrue(reader.nextRecordBatch() == null); + assertEquals(bytesWritten, reader.bytesRead()); + } + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java new file mode 100644 index 0000000000000..e187fa535cada --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java @@ -0,0 +1,129 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.stream; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.IOException; +import java.nio.channels.Pipe; +import java.nio.channels.ReadableByteChannel; +import java.nio.channels.WritableByteChannel; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class TestArrowStreamPipe { + Schema schema = MessageSerializerTest.testSchema(); + // second half is "undefined" + byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; + + private final class WriterThread extends Thread { + private final int numBatches; + private final ArrowStreamWriter writer; + + public WriterThread(int numBatches, WritableByteChannel sinkChannel) + throws IOException { + this.numBatches = numBatches; + writer = new ArrowStreamWriter(sinkChannel, schema, -1); + } + + @Override + public void run() { + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + try { + ArrowBuf valuesb = MessageSerializerTest.buf(alloc, values); + for (int i = 0; i < numBatches; i++) { + // Send a changing byte id first. + byte[] validity = new byte[] { (byte)i, 0}; + ArrowBuf validityb = MessageSerializerTest.buf(alloc, validity); + writer.writeRecordBatch(new ArrowRecordBatch( + 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); + } + writer.close(); + } catch (IOException e) { + e.printStackTrace(); + assertTrue(false); + } + } + + public long bytesWritten() { return writer.bytesWritten(); } + } + + private final class ReaderThread extends Thread { + private int batchesRead = 0; + private final ArrowStreamReader reader; + private final BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + + public ReaderThread(ReadableByteChannel sourceChannel) + throws IOException { + reader = new ArrowStreamReader(sourceChannel, alloc); + } + + @Override + public void run() { + try { + reader.init(); + assertEquals(schema, reader.getSchema()); + assertTrue( + reader.getSchema().getFields().get(0).getTypeLayout().getVectorTypes().toString(), + reader.getSchema().getFields().get(0).getTypeLayout().getVectors().size() > 0); + + // Read all the batches. Each batch contains an incrementing id and then some + // constant data. Verify both. + while (true) { + ArrowRecordBatch batch = reader.nextRecordBatch(); + if (batch == null) break; + byte[] validity = new byte[] { (byte)batchesRead, 0}; + MessageSerializerTest.verifyBatch(batch, validity, values); + batchesRead++; + } + } catch (IOException e) { + e.printStackTrace(); + assertTrue(false); + } + } + + public int getBatchesRead() { return batchesRead; } + public long bytesRead() { return reader.bytesRead(); } + } + + // Starts up a producer and consumer thread to read/write batches. + @Test + public void pipeTest() throws IOException, InterruptedException { + int NUM_BATCHES = 1000; + Pipe pipe = Pipe.open(); + WriterThread writer = new WriterThread(NUM_BATCHES, pipe.sink()); + ReaderThread reader = new ReaderThread(pipe.source()); + + writer.start(); + reader.start(); + reader.join(); + writer.join(); + + assertEquals(NUM_BATCHES, reader.getBatchesRead()); + assertEquals(writer.bytesWritten(), reader.bytesRead()); + } +} From 512bc160ebaf8d6775ea67994262709e10a72795 Mon Sep 17 00:00:00 2001 From: Jingyuan Wang Date: Fri, 20 Jan 2017 12:43:20 -0500 Subject: [PATCH 0283/1644] ARROW-386: [Java] Respect case of struct / map field names Changes include: - Remove all toLowerCase() calls on field names in MapWriters.java template file, so that the writers can respect case of the field names. - Use lower-case keys for internalMap in UnionVector instead of camel-case (e.g. bigInt -> bigint). p.s. I don't know what is the original purpose of using camel case here. It did not conflict because all field names are converted to lower cases in the past. - Add a simple test case of MapWriter with mixed-case field names. Author: Jingyuan Wang Closes #261 from alphalfalfa/arrow-386 and squashes the following commits: cd08145 [Jingyuan Wang] Remove unnecessary handleCase() call 7b28bfc [Jingyuan Wang] Pass caseSensitive Attribute down to nested MapWriters 2fe7bcf [Jingyuan Wang] Separate MapWriters with CaseSensitiveMapWriters d269e21 [Jingyuan Wang] Configure case sensitivity when constructing ComplexWriterImpl cba60d1 [Jingyuan Wang] Add option to MapWriters to configure the case sensitivity (defaulted as case-insensitive) 51da2a1 [Jingyuan Wang] Arrow-386: [Java] Respect case of struct / map field names --- .../templates/CaseSensitiveMapWriters.java | 54 +++++++++++++ .../main/codegen/templates/MapWriters.java | 35 +++++---- .../codegen/templates/UnionListWriter.java | 6 +- .../main/codegen/templates/UnionVector.java | 3 +- .../main/codegen/templates/UnionWriter.java | 12 ++- .../vector/complex/AbstractMapVector.java | 6 +- .../complex/impl/ComplexWriterImpl.java | 17 +++-- .../impl/NullableMapWriterFactory.java | 42 ++++++++++ .../vector/complex/impl/PromotableWriter.java | 31 +++++++- .../complex/writer/TestComplexWriter.java | 76 +++++++++++++++++++ 10 files changed, 253 insertions(+), 29 deletions(-) create mode 100644 java/vector/src/main/codegen/templates/CaseSensitiveMapWriters.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapWriterFactory.java diff --git a/java/vector/src/main/codegen/templates/CaseSensitiveMapWriters.java b/java/vector/src/main/codegen/templates/CaseSensitiveMapWriters.java new file mode 100644 index 0000000000000..5357f9b8a9d3a --- /dev/null +++ b/java/vector/src/main/codegen/templates/CaseSensitiveMapWriters.java @@ -0,0 +1,54 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +<@pp.dropOutputFile /> +<#list ["Nullable", "Single"] as mode> +<@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/${mode}CaseSensitiveMapWriter.java" /> +<#assign index = "idx()"> +<#if mode == "Single"> +<#assign containerClass = "MapVector" /> +<#else> +<#assign containerClass = "NullableMapVector" /> + + +<#include "/@includes/license.ftl" /> + +package org.apache.arrow.vector.complex.impl; + +<#include "/@includes/vv_imports.ftl" /> +/* + * This class is generated using FreeMarker and the ${.template_name} template. + */ +@SuppressWarnings("unused") +public class ${mode}CaseSensitiveMapWriter extends ${mode}MapWriter { + public ${mode}CaseSensitiveMapWriter(${containerClass} container) { + super(container); + } + + @Override + protected String handleCase(final String input){ + return input; + } + + @Override + protected NullableMapWriterFactory getNullableMapWriterFactory() { + return NullableMapWriterFactory.getNullableCaseSensitiveMapWriterFactoryInstance(); + } + +} + diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index f41b60072c873..4af6eee91b6de 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -48,7 +48,6 @@ public class ${mode}MapWriter extends AbstractFieldWriter { protected final ${containerClass} container; private final Map fields = Maps.newHashMap(); - public ${mode}MapWriter(${containerClass} container) { <#if mode == "Single"> if (container instanceof NullableMapVector) { @@ -65,8 +64,8 @@ public class ${mode}MapWriter extends AbstractFieldWriter { list(child.getName()); break; case UNION: - UnionWriter writer = new UnionWriter(container.addOrGet(child.getName(), MinorType.UNION, UnionVector.class)); - fields.put(child.getName().toLowerCase(), writer); + UnionWriter writer = new UnionWriter(container.addOrGet(child.getName(), MinorType.UNION, UnionVector.class), getNullableMapWriterFactory()); + fields.put(handleCase(child.getName()), writer); break; <#list vv.types as type><#list type.minor as minor> <#assign lowerName = minor.class?uncap_first /> @@ -85,6 +84,14 @@ public class ${mode}MapWriter extends AbstractFieldWriter { } } + protected String handleCase(final String input) { + return input.toLowerCase(); + } + + protected NullableMapWriterFactory getNullableMapWriterFactory() { + return NullableMapWriterFactory.getNullableMapWriterFactoryInstance(); + } + @Override public int getValueCapacity() { return container.getValueCapacity(); @@ -102,16 +109,17 @@ public Field getField() { @Override public MapWriter map(String name) { - FieldWriter writer = fields.get(name.toLowerCase()); + String finalName = handleCase(name); + FieldWriter writer = fields.get(finalName); if(writer == null){ int vectorCount=container.size(); NullableMapVector vector = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); - writer = new PromotableWriter(vector, container); + writer = new PromotableWriter(vector, container, getNullableMapWriterFactory()); if(vectorCount != container.size()) { writer.allocate(); } writer.setPosition(idx()); - fields.put(name.toLowerCase(), writer); + fields.put(finalName, writer); } else { if (writer instanceof PromotableWriter) { // ensure writers are initialized @@ -145,15 +153,16 @@ public void clear() { @Override public ListWriter list(String name) { - FieldWriter writer = fields.get(name.toLowerCase()); + String finalName = handleCase(name); + FieldWriter writer = fields.get(finalName); int vectorCount = container.size(); if(writer == null) { - writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class), container); + writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class), container, getNullableMapWriterFactory()); if (container.size() > vectorCount) { writer.allocate(); } writer.setPosition(idx()); - fields.put(name.toLowerCase(), writer); + fields.put(finalName, writer); } else { if (writer instanceof PromotableWriter) { // ensure writers are initialized @@ -199,7 +208,7 @@ public void end() { <#if minor.class?starts_with("Decimal") > public ${minor.class}Writer ${lowerName}(String name) { // returns existing writer - final FieldWriter writer = fields.get(name.toLowerCase()); + final FieldWriter writer = fields.get(handleCase(name)); assert writer != null; return writer; } @@ -209,18 +218,18 @@ public void end() { @Override public ${minor.class}Writer ${lowerName}(String name) { - FieldWriter writer = fields.get(name.toLowerCase()); + FieldWriter writer = fields.get(handleCase(name)); if(writer == null) { ValueVector vector; ValueVector currentVector = container.getChild(name); ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class<#if minor.class == "Decimal"> , new int[] {precision, scale}); - writer = new PromotableWriter(v, container); + writer = new PromotableWriter(v, container, getNullableMapWriterFactory()); vector = v; if (currentVector == null || currentVector != vector) { vector.allocateNewSafe(); } writer.setPosition(idx()); - fields.put(name.toLowerCase(), writer); + fields.put(handleCase(name), writer); } else { if (writer instanceof PromotableWriter) { // ensure writers are initialized diff --git a/java/vector/src/main/codegen/templates/UnionListWriter.java b/java/vector/src/main/codegen/templates/UnionListWriter.java index bb39fe8d29426..d980830923b31 100644 --- a/java/vector/src/main/codegen/templates/UnionListWriter.java +++ b/java/vector/src/main/codegen/templates/UnionListWriter.java @@ -43,8 +43,12 @@ public class UnionListWriter extends AbstractFieldWriter { private int lastIndex = 0; public UnionListWriter(ListVector vector) { + this(vector, NullableMapWriterFactory.getNullableMapWriterFactoryInstance()); + } + + public UnionListWriter(ListVector vector, NullableMapWriterFactory nullableMapWriterFactory) { this.vector = vector; - this.writer = new PromotableWriter(vector.getDataVector(), vector); + this.writer = new PromotableWriter(vector.getDataVector(), vector, nullableMapWriterFactory); this.offsets = vector.getOffsetVector(); } diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 18acdf4a551b4..1a6908df2c40d 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -136,6 +136,7 @@ public NullableMapVector getMap() { <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign fields = minor.fields!type.fields /> <#assign uncappedName = name?uncap_first/> + <#assign lowerCaseName = name?lower_case/> <#if !minor.class?starts_with("Decimal")> private Nullable${name}Vector ${uncappedName}Vector; @@ -143,7 +144,7 @@ public NullableMapVector getMap() { public Nullable${name}Vector get${name}Vector() { if (${uncappedName}Vector == null) { int vectorCount = internalMap.size(); - ${uncappedName}Vector = internalMap.addOrGet("${uncappedName}", MinorType.${name?upper_case}, Nullable${name}Vector.class); + ${uncappedName}Vector = internalMap.addOrGet("${lowerCaseName}", MinorType.${name?upper_case}, Nullable${name}Vector.class); if (internalMap.size() > vectorCount) { ${uncappedName}Vector.allocateNew(); if (callBack != null) { diff --git a/java/vector/src/main/codegen/templates/UnionWriter.java b/java/vector/src/main/codegen/templates/UnionWriter.java index efb66f168f5f8..880f537c0296f 100644 --- a/java/vector/src/main/codegen/templates/UnionWriter.java +++ b/java/vector/src/main/codegen/templates/UnionWriter.java @@ -16,6 +16,8 @@ * limitations under the License. */ +import org.apache.arrow.vector.complex.impl.NullableMapWriterFactory; + <@pp.dropOutputFile /> <@pp.changeOutputFile name="/org/apache/arrow/vector/complex/impl/UnionWriter.java" /> @@ -38,9 +40,15 @@ public class UnionWriter extends AbstractFieldWriter implements FieldWriter { private MapWriter mapWriter; private UnionListWriter listWriter; private List writers = Lists.newArrayList(); + private final NullableMapWriterFactory nullableMapWriterFactory; public UnionWriter(UnionVector vector) { + this(vector, NullableMapWriterFactory.getNullableMapWriterFactoryInstance()); + } + + public UnionWriter(UnionVector vector, NullableMapWriterFactory nullableMapWriterFactory) { data = vector; + this.nullableMapWriterFactory = nullableMapWriterFactory; } @Override @@ -76,7 +84,7 @@ public void endList() { private MapWriter getMapWriter() { if (mapWriter == null) { - mapWriter = new NullableMapWriter(data.getMap()); + mapWriter = nullableMapWriterFactory.build(data.getMap()); mapWriter.setPosition(idx()); writers.add(mapWriter); } @@ -90,7 +98,7 @@ public MapWriter asMap() { private ListWriter getListWriter() { if (listWriter == null) { - listWriter = new UnionListWriter(data.getList()); + listWriter = new UnionListWriter(data.getList(), nullableMapWriterFactory); listWriter.setPosition(idx()); writers.add(listWriter); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index 23b4997f4f586..f030d166ade8d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -155,7 +155,7 @@ public ValueVector getChildByOrdinal(int id) { */ @Override public T getChild(String name, Class clazz) { - final ValueVector v = vectors.get(name.toLowerCase()); + final ValueVector v = vectors.get(name); if (v == null) { return null; } @@ -191,7 +191,7 @@ protected void putChild(String name, FieldVector vector) { */ protected void putVector(String name, FieldVector vector) { final ValueVector old = vectors.put( - Preconditions.checkNotNull(name, "field name cannot be null").toLowerCase(), + Preconditions.checkNotNull(name, "field name cannot be null"), Preconditions.checkNotNull(vector, "vector cannot be null") ); if (old != null && old != vector) { @@ -254,7 +254,7 @@ public List getPrimitiveVectors() { */ @Override public VectorWithOrdinal getChildVectorWithOrdinal(String name) { - final int ordinal = vectors.getOrdinal(name.toLowerCase()); + final int ordinal = vectors.getOrdinal(name); if (ordinal < 0) { return null; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index 761b1b43c08aa..dbdd2050d13ed 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -37,13 +37,20 @@ public class ComplexWriterImpl extends AbstractFieldWriter implements ComplexWri Mode mode = Mode.INIT; private final String name; private final boolean unionEnabled; + private final NullableMapWriterFactory nullableMapWriterFactory; private enum Mode { INIT, MAP, LIST }; - public ComplexWriterImpl(String name, MapVector container, boolean unionEnabled){ + public ComplexWriterImpl(String name, MapVector container, boolean unionEnabled, boolean caseSensitive){ this.name = name; this.container = container; this.unionEnabled = unionEnabled; + nullableMapWriterFactory = caseSensitive? NullableMapWriterFactory.getNullableCaseSensitiveMapWriterFactoryInstance() : + NullableMapWriterFactory.getNullableMapWriterFactoryInstance(); + } + + public ComplexWriterImpl(String name, MapVector container, boolean unionEnabled) { + this(name, container, unionEnabled, false); } public ComplexWriterImpl(String name, MapVector container){ @@ -122,8 +129,7 @@ public MapWriter directMap(){ switch(mode){ case INIT: - NullableMapVector map = (NullableMapVector) container; - mapRoot = new NullableMapWriter(map); + mapRoot = nullableMapWriterFactory.build((NullableMapVector) container); mapRoot.setPosition(idx()); mode = Mode.MAP; break; @@ -144,7 +150,7 @@ public MapWriter rootAsMap() { case INIT: NullableMapVector map = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); - mapRoot = new NullableMapWriter(map); + mapRoot = nullableMapWriterFactory.build(map); mapRoot.setPosition(idx()); mode = Mode.MAP; break; @@ -159,7 +165,6 @@ public MapWriter rootAsMap() { return mapRoot; } - @Override public void allocate() { if(mapRoot != null) { @@ -179,7 +184,7 @@ public ListWriter rootAsList() { if (container.size() > vectorCount) { listVector.allocateNew(); } - listRoot = new UnionListWriter(listVector); + listRoot = new UnionListWriter(listVector, nullableMapWriterFactory); listRoot.setPosition(idx()); mode = Mode.LIST; break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapWriterFactory.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapWriterFactory.java new file mode 100644 index 0000000000000..d932cfb3e1287 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/NullableMapWriterFactory.java @@ -0,0 +1,42 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.complex.NullableMapVector; + +public class NullableMapWriterFactory { + private final boolean caseSensitive; + private static final NullableMapWriterFactory nullableMapWriterFactory = new NullableMapWriterFactory(false); + private static final NullableMapWriterFactory nullableCaseSensitiveWriterFactory = new NullableMapWriterFactory(true); + + public NullableMapWriterFactory(boolean caseSensitive) { + this.caseSensitive = caseSensitive; + } + + public NullableMapWriter build(NullableMapVector container) { + return this.caseSensitive? new NullableCaseSensitiveMapWriter(container) : new NullableMapWriter(container); + } + + public static NullableMapWriterFactory getNullableMapWriterFactoryInstance() { + return nullableMapWriterFactory; + } + + public static NullableMapWriterFactory getNullableCaseSensitiveMapWriterFactoryInstance() { + return nullableCaseSensitiveWriterFactory; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 94ff82c04bd18..1f7253bca93c8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -22,6 +22,7 @@ import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.complex.AbstractMapVector; import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.types.Types.MinorType; @@ -38,6 +39,7 @@ public class PromotableWriter extends AbstractPromotableFieldWriter { private final AbstractMapVector parentContainer; private final ListVector listVector; + private final NullableMapWriterFactory nullableMapWriterFactory; private int position; private enum State { @@ -51,14 +53,24 @@ private enum State { private FieldWriter writer; public PromotableWriter(ValueVector v, AbstractMapVector parentContainer) { + this(v, parentContainer, NullableMapWriterFactory.getNullableMapWriterFactoryInstance()); + } + + public PromotableWriter(ValueVector v, AbstractMapVector parentContainer, NullableMapWriterFactory nullableMapWriterFactory) { this.parentContainer = parentContainer; this.listVector = null; + this.nullableMapWriterFactory = nullableMapWriterFactory; init(v); } public PromotableWriter(ValueVector v, ListVector listVector) { + this(v, listVector, NullableMapWriterFactory.getNullableMapWriterFactoryInstance()); + } + + public PromotableWriter(ValueVector v, ListVector listVector, NullableMapWriterFactory nullableMapWriterFactory) { this.listVector = listVector; this.parentContainer = null; + this.nullableMapWriterFactory = nullableMapWriterFactory; init(v); } @@ -66,7 +78,7 @@ private void init(ValueVector v) { if (v instanceof UnionVector) { state = State.UNION; unionVector = (UnionVector) v; - writer = new UnionWriter(unionVector); + writer = new UnionWriter(unionVector, nullableMapWriterFactory); } else if (v instanceof ZeroVector) { state = State.UNTYPED; } else { @@ -78,7 +90,20 @@ private void setWriter(ValueVector v) { state = State.SINGLE; vector = v; type = v.getMinorType(); - writer = type.getNewFieldWriter(vector); + switch (type) { + case MAP: + writer = nullableMapWriterFactory.build((NullableMapVector) vector); + break; + case LIST: + writer = new UnionListWriter((ListVector) vector, nullableMapWriterFactory); + break; + case UNION: + writer = new UnionWriter((UnionVector) vector, nullableMapWriterFactory); + break; + default: + writer = type.getNewFieldWriter(vector); + break; + } } @Override @@ -131,7 +156,7 @@ private FieldWriter promoteToUnion() { unionVector = listVector.promoteToUnion(); } unionVector.addVector((FieldVector)tp.getTo()); - writer = new UnionWriter(unionVector); + writer = new UnionWriter(unionVector, nullableMapWriterFactory); writer.setPosition(idx()); for (int i = 0; i <= idx(); i++) { unionVector.getMutator().setType(i, vector.getMinorType()); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index caa438aff4761..2c0c85328bdfb 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -23,7 +23,9 @@ import static org.junit.Assert.assertNull; import static org.junit.Assert.assertTrue; +import java.util.HashSet; import java.util.List; +import java.util.Set; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -485,4 +487,78 @@ public void promotableWriterSchema() { Assert.assertTrue(intType.getIsSigned()); Assert.assertEquals(ArrowTypeID.Utf8, field.getChildren().get(1).getType().getTypeID()); } + + private Set getFieldNames(List fields) { + Set fieldNames = new HashSet<>(); + for (Field field: fields) { + fieldNames.add(field.getName()); + if (!field.getChildren().isEmpty()) { + for (String name: getFieldNames(field.getChildren())) { + fieldNames.add(field.getName() + "::" + name); + } + } + } + return fieldNames; + } + + @Test + public void mapWriterMixedCaseFieldNames() { + // test case-sensitive MapWriter + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("rootCaseSensitive", parent, false, true); + MapWriter rootWriterCaseSensitive = writer.rootAsMap(); + rootWriterCaseSensitive.bigInt("int_field"); + rootWriterCaseSensitive.bigInt("Int_Field"); + rootWriterCaseSensitive.float4("float_field"); + rootWriterCaseSensitive.float4("Float_Field"); + MapWriter mapFieldWriterCaseSensitive = rootWriterCaseSensitive.map("map_field"); + mapFieldWriterCaseSensitive.varChar("char_field"); + mapFieldWriterCaseSensitive.varChar("Char_Field"); + ListWriter listFieldWriterCaseSensitive = rootWriterCaseSensitive.list("list_field"); + MapWriter listMapFieldWriterCaseSensitive = listFieldWriterCaseSensitive.map(); + listMapFieldWriterCaseSensitive.bit("bit_field"); + listMapFieldWriterCaseSensitive.bit("Bit_Field"); + + List fieldsCaseSensitive = parent.getField().getChildren().get(0).getChildren(); + Set fieldNamesCaseSensitive = getFieldNames(fieldsCaseSensitive); + Assert.assertEquals(11, fieldNamesCaseSensitive.size()); + Assert.assertTrue(fieldNamesCaseSensitive.contains("int_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("Int_Field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("float_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("Float_Field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("map_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("map_field::char_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("map_field::Char_Field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$::bit_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$::Bit_Field")); + + // test case-insensitive MapWriter + ComplexWriter writerCaseInsensitive = new ComplexWriterImpl("rootCaseInsensitive", parent, false, false); + MapWriter rootWriterCaseInsensitive = writerCaseInsensitive.rootAsMap(); + + rootWriterCaseInsensitive.bigInt("int_field"); + rootWriterCaseInsensitive.bigInt("Int_Field"); + rootWriterCaseInsensitive.float4("float_field"); + rootWriterCaseInsensitive.float4("Float_Field"); + MapWriter mapFieldWriterCaseInsensitive = rootWriterCaseInsensitive.map("map_field"); + mapFieldWriterCaseInsensitive.varChar("char_field"); + mapFieldWriterCaseInsensitive.varChar("Char_Field"); + ListWriter listFieldWriterCaseInsensitive = rootWriterCaseInsensitive.list("list_field"); + MapWriter listMapFieldWriterCaseInsensitive = listFieldWriterCaseInsensitive.map(); + listMapFieldWriterCaseInsensitive.bit("bit_field"); + listMapFieldWriterCaseInsensitive.bit("Bit_Field"); + + List fieldsCaseInsensitive = parent.getField().getChildren().get(1).getChildren(); + Set fieldNamesCaseInsensitive = getFieldNames(fieldsCaseInsensitive); + Assert.assertEquals(7, fieldNamesCaseInsensitive.size()); + Assert.assertTrue(fieldNamesCaseInsensitive.contains("int_field")); + Assert.assertTrue(fieldNamesCaseInsensitive.contains("float_field")); + Assert.assertTrue(fieldNamesCaseInsensitive.contains("map_field")); + Assert.assertTrue(fieldNamesCaseInsensitive.contains("map_field::char_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$")); + Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$::bit_field")); + } } \ No newline at end of file From 8ca7033fcd3fcf377cb7924eae9be45b8f6ebd5d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 20 Jan 2017 17:56:23 -0500 Subject: [PATCH 0284/1644] ARROW-499: Update file serialization to use the streaming serialization format. Author: Wes McKinney Author: Nong Li Closes #292 from nongli/file and squashes the following commits: 18890a9 [Wes McKinney] Message fixes. Fix Java test suite. Integration tests pass f187539 [Nong Li] Merge pull request #1 from wesm/file-change-cpp-impl e3af434 [Wes McKinney] Remove unused variable 664d5be [Wes McKinney] Fixes, stream tests pass again ba8db91 [Wes McKinney] Redo MessageSerializer with unions. Still has bugs 21854cc [Wes McKinney] Restore Block.bodyLength to long 7c6f7ef [Nong Li] Update to restore Block behavior 27b3909 [Nong Li] [ARROW-499]: [Java] Update file serialization to use the streaming serialization format. --- cpp/src/arrow/ipc/adapter.cc | 11 +- cpp/src/arrow/ipc/metadata-internal.cc | 21 +-- format/File.fbs | 5 +- integration/integration_test.py | 2 +- .../apache/arrow/vector/file/ArrowFooter.java | 5 +- .../apache/arrow/vector/file/ArrowReader.java | 64 ++----- .../apache/arrow/vector/file/ArrowWriter.java | 43 +---- .../apache/arrow/vector/file/ReadChannel.java | 11 +- .../vector/stream/MessageSerializer.java | 169 +++++++++++------- .../arrow/vector/file/TestArrowFile.java | 4 - .../arrow/vector/file/TestArrowFooter.java | 8 + .../vector/file/TestArrowReaderWriter.java | 16 ++ 12 files changed, 174 insertions(+), 185 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 2b5ef11f861af..7b4d18c267d43 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -129,13 +129,12 @@ class RecordBatchWriter : public ArrayVisitor { num_rows_, body_length, field_nodes_, buffer_meta_, &metadata_fb)); // Need to write 4 bytes (metadata size), the metadata, plus padding to - // fall on a 64-byte offset - int64_t padded_metadata_length = - BitUtil::RoundUpToMultipleOf64(metadata_fb->size() + 4); + // fall on an 8-byte offset + int64_t padded_metadata_length = BitUtil::CeilByte(metadata_fb->size() + 4); // The returned metadata size includes the length prefix, the flatbuffer, // plus padding - *metadata_length = padded_metadata_length; + *metadata_length = static_cast(padded_metadata_length); // Write the flatbuffer size prefix int32_t flatbuffer_size = metadata_fb->size(); @@ -604,7 +603,9 @@ Status ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, return Status::Invalid(ss.str()); } - *metadata = std::make_shared(buffer, sizeof(int32_t)); + std::shared_ptr message; + RETURN_NOT_OK(Message::Open(buffer, 4, &message)); + *metadata = std::make_shared(message); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 16069a8f9dcf0..cc160c42ec9ef 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -320,23 +320,10 @@ Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { - flatbuffers::FlatBufferBuilder fbb; - - auto batch = flatbuf::CreateRecordBatch( - fbb, length, fbb.CreateVectorOfStructs(nodes), fbb.CreateVectorOfStructs(buffers)); - - fbb.Finish(batch); - - int32_t size = fbb.GetSize(); - - auto result = std::make_shared(); - RETURN_NOT_OK(result->Resize(size)); - - uint8_t* dst = result->mutable_data(); - memcpy(dst, fbb.GetBufferPointer(), size); - - *out = result; - return Status::OK(); + MessageBuilder builder; + RETURN_NOT_OK(builder.SetRecordBatch(length, body_length, nodes, buffers)); + RETURN_NOT_OK(builder.Finish()); + return builder.GetBuffer(out); } Status MessageBuilder::Finish() { diff --git a/format/File.fbs b/format/File.fbs index f28dc204d58d9..e8d6da4f848ff 100644 --- a/format/File.fbs +++ b/format/File.fbs @@ -35,12 +35,15 @@ table Footer { struct Block { + /// Index to the start of the RecordBlock (note this is past the Message header) offset: long; + /// Length of the metadata metaDataLength: int; + /// Length of the data (this is aligned so there can be a gap between this and + /// the metatdata). bodyLength: long; - } root_type Footer; diff --git a/integration/integration_test.py b/integration/integration_test.py index 417354bc83d9e..77510daecc0b4 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -648,7 +648,7 @@ def get_static_json_files(): def run_all_tests(debug=False): - testers = [JavaTester(debug=debug), CPPTester(debug=debug)] + testers = [CPPTester(debug=debug), JavaTester(debug=debug)] static_json_files = get_static_json_files() generated_json_files = get_generated_json_files() json_files = static_json_files + generated_json_files diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java index 3be19296cb56d..38903068570c7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java @@ -65,10 +65,11 @@ private static List recordBatches(Footer footer) { private static List dictionaries(Footer footer) { List dictionaries = new ArrayList<>(); - Block tempBLock = new Block(); + Block tempBlock = new Block(); + int dictionariesLength = footer.dictionariesLength(); for (int i = 0; i < dictionariesLength; i++) { - Block block = footer.dictionaries(tempBLock, i); + Block block = footer.dictionaries(tempBlock, i); dictionaries.add(new ArrowBlock(block.offset(), block.metaDataLength(), block.bodyLength())); } return dictionaries; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java index 58c51605c5600..8f4f4978d66cf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -20,23 +20,15 @@ import java.io.IOException; import java.nio.ByteBuffer; import java.nio.channels.SeekableByteChannel; -import java.util.ArrayList; import java.util.Arrays; -import java.util.List; -import org.apache.arrow.flatbuf.Buffer; -import org.apache.arrow.flatbuf.FieldNode; import org.apache.arrow.flatbuf.Footer; -import org.apache.arrow.flatbuf.RecordBatch; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.stream.MessageSerializer; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import io.netty.buffer.ArrowBuf; - public class ArrowReader implements AutoCloseable { private static final Logger LOGGER = LoggerFactory.getLogger(ArrowReader.class); @@ -54,15 +46,6 @@ public ArrowReader(SeekableByteChannel in, BufferAllocator allocator) { this.allocator = allocator; } - private int readFully(ArrowBuf buffer, int l) throws IOException { - int n = readFully(buffer.nioBuffer(buffer.writerIndex(), l)); - buffer.writerIndex(n); - if (n != l) { - throw new IllegalStateException(n + " != " + l); - } - return n; - } - private int readFully(ByteBuffer buffer) throws IOException { int total = 0; int n; @@ -104,46 +87,21 @@ public ArrowFooter readFooter() throws IOException { // TODO: read dictionaries - public ArrowRecordBatch readRecordBatch(ArrowBlock recordBatchBlock) throws IOException { - LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", recordBatchBlock.getOffset(), recordBatchBlock.getMetadataLength(), recordBatchBlock.getBodyLength())); - int l = (int)(recordBatchBlock.getMetadataLength() + recordBatchBlock.getBodyLength()); - if (l < 0) { - throw new InvalidArrowFileException("block invalid: " + recordBatchBlock); - } - final ArrowBuf buffer = allocator.buffer(l); - LOGGER.debug("allocated buffer " + buffer); - in.position(recordBatchBlock.getOffset()); - int n = readFully(buffer, l); - if (n != l) { - throw new IllegalStateException(n + " != " + l); - } - - // Record batch flatbuffer is prefixed by its size as int32le - final ArrowBuf metadata = buffer.slice(4, recordBatchBlock.getMetadataLength() - 4); - RecordBatch recordBatchFB = RecordBatch.getRootAsRecordBatch(metadata.nioBuffer().asReadOnlyBuffer()); - - int nodesLength = recordBatchFB.nodesLength(); - final ArrowBuf body = buffer.slice(recordBatchBlock.getMetadataLength(), (int)recordBatchBlock.getBodyLength()); - List nodes = new ArrayList<>(); - for (int i = 0; i < nodesLength; ++i) { - FieldNode node = recordBatchFB.nodes(i); - nodes.add(new ArrowFieldNode(node.length(), node.nullCount())); + public ArrowRecordBatch readRecordBatch(ArrowBlock block) throws IOException { + LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", + block.getOffset(), block.getMetadataLength(), + block.getBodyLength())); + in.position(block.getOffset()); + ArrowRecordBatch batch = MessageSerializer.deserializeRecordBatch( + new ReadChannel(in, block.getOffset()), block, allocator); + if (batch == null) { + throw new IOException("Invalid file. No batch at offset: " + block.getOffset()); } - List buffers = new ArrayList<>(); - for (int i = 0; i < recordBatchFB.buffersLength(); ++i) { - Buffer bufferFB = recordBatchFB.buffers(i); - LOGGER.debug(String.format("Buffer in RecordBatch at %d, length: %d", bufferFB.offset(), bufferFB.length())); - ArrowBuf vectorBuffer = body.slice((int)bufferFB.offset(), (int)bufferFB.length()); - buffers.add(vectorBuffer); - } - ArrowRecordBatch arrowRecordBatch = new ArrowRecordBatch(recordBatchFB.length(), nodes, buffers); - LOGGER.debug("released buffer " + buffer); - buffer.release(); - return arrowRecordBatch; + return batch; } + @Override public void close() throws IOException { in.close(); } - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java index 3febd11f4c76a..24c667e67d98d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -23,14 +23,12 @@ import java.util.Collections; import java.util.List; -import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.MessageSerializer; import org.apache.arrow.vector.types.pojo.Schema; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import io.netty.buffer.ArrowBuf; - public class ArrowWriter implements AutoCloseable { private static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class); @@ -39,7 +37,6 @@ public class ArrowWriter implements AutoCloseable { private final Schema schema; private final List recordBatches = new ArrayList<>(); - private boolean started = false; public ArrowWriter(WritableByteChannel out, Schema schema) { @@ -49,47 +46,19 @@ public ArrowWriter(WritableByteChannel out, Schema schema) { private void start() throws IOException { writeMagic(); + MessageSerializer.serialize(out, schema); } - // TODO: write dictionaries public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { checkStarted(); - out.align(); + ArrowBlock batchDesc = MessageSerializer.serialize(out, recordBatch); + LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", + batchDesc.getOffset(), batchDesc.getMetadataLength(), batchDesc.getBodyLength())); - // write metadata header with int32 size prefix - long offset = out.getCurrentPosition(); - out.write(recordBatch, true); - out.align(); - // write body - long bodyOffset = out.getCurrentPosition(); - List buffers = recordBatch.getBuffers(); - List buffersLayout = recordBatch.getBuffersLayout(); - if (buffers.size() != buffersLayout.size()) { - throw new IllegalStateException("the layout does not match: " + buffers.size() + " != " + buffersLayout.size()); - } - for (int i = 0; i < buffers.size(); i++) { - ArrowBuf buffer = buffers.get(i); - ArrowBuffer layout = buffersLayout.get(i); - long startPosition = bodyOffset + layout.getOffset(); - if (startPosition != out.getCurrentPosition()) { - out.writeZeros((int)(startPosition - out.getCurrentPosition())); - } - - out.write(buffer); - if (out.getCurrentPosition() != startPosition + layout.getSize()) { - throw new IllegalStateException("wrong buffer size: " + out.getCurrentPosition() + " != " + startPosition + layout.getSize()); - } - } - int metadataLength = (int)(bodyOffset - offset); - if (metadataLength <= 0) { - throw new InvalidArrowFileException("invalid recordBatch"); - } - long bodyLength = out.getCurrentPosition() - bodyOffset; - LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", offset, metadataLength, bodyLength)); // add metadata to footer - recordBatches.add(new ArrowBlock(offset, metadataLength, bodyLength)); + recordBatches.add(batchDesc); } private void checkStarted() throws IOException { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java index b062f3826eab3..a9dc1293b8193 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java @@ -32,9 +32,16 @@ public class ReadChannel implements AutoCloseable { private ReadableByteChannel in; private long bytesRead = 0; + // The starting byte offset into 'in'. + private final long startByteOffset; - public ReadChannel(ReadableByteChannel in) { + public ReadChannel(ReadableByteChannel in, long startByteOffset) { this.in = in; + this.startByteOffset = startByteOffset; + } + + public ReadChannel(ReadableByteChannel in) { + this(in, 0); } public long bytesRead() { return bytesRead; } @@ -65,6 +72,8 @@ public int readFully(ArrowBuf buffer, int l) throws IOException { return n; } + public long getCurrentPositiion() { return startByteOffset + bytesRead; } + @Override public void close() throws IOException { if (this.in != null) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 22c46e2817b1e..6e22dbd164d6e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -29,6 +29,7 @@ import org.apache.arrow.flatbuf.MetadataVersion; import org.apache.arrow.flatbuf.RecordBatch; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.file.ArrowBlock; import org.apache.arrow.vector.file.ReadChannel; import org.apache.arrow.vector.file.WriteChannel; import org.apache.arrow.vector.schema.ArrowBuffer; @@ -52,7 +53,8 @@ * For RecordBatch messages the serialization is: * 1. 4 byte little endian batch metadata header * 2. FB serialized RowBatch - * 3. serialized RowBatch buffers. + * 3. Padding to align to 8 byte boundary. + * 4. serialized RowBatch buffers. */ public class MessageSerializer { @@ -68,14 +70,10 @@ public static int bytesToInt(byte[] bytes) { */ public static long serialize(WriteChannel out, Schema schema) throws IOException { FlatBufferBuilder builder = new FlatBufferBuilder(); - builder.finish(schema.getSchema(builder)); - ByteBuffer serializedBody = builder.dataBuffer(); - ByteBuffer serializedHeader = - serializeHeader(MessageHeader.Schema, serializedBody.remaining()); - - long size = out.writeIntLittleEndian(serializedHeader.remaining()); - size += out.write(serializedHeader); - size += out.write(serializedBody); + int schemaOffset = schema.getSchema(builder); + ByteBuffer serializedMessage = serializeMessage(builder, MessageHeader.Schema, schemaOffset, 0); + long size = out.writeIntLittleEndian(serializedMessage.remaining()); + size += out.write(serializedMessage); return size; } @@ -83,49 +81,51 @@ public static long serialize(WriteChannel out, Schema schema) throws IOException * Deserializes a schema object. Format is from serialize(). */ public static Schema deserializeSchema(ReadChannel in) throws IOException { - Message header = deserializeHeader(in, MessageHeader.Schema); - if (header == null) { + Message message = deserializeMessage(in, MessageHeader.Schema); + if (message == null) { throw new IOException("Unexpected end of input. Missing schema."); } - // Now read the schema. - ByteBuffer buffer = ByteBuffer.allocate((int)header.bodyLength()); - if (in.readFully(buffer) != header.bodyLength()) { - throw new IOException("Unexpected end of input trying to read schema."); - } - buffer.rewind(); - return Schema.deserialize(buffer); + return Schema.convertSchema((org.apache.arrow.flatbuf.Schema) + message.header(new org.apache.arrow.flatbuf.Schema())); } /** - * Serializes an ArrowRecordBatch. + * Serializes an ArrowRecordBatch. Returns the offset and length of the written batch. */ - public static long serialize(WriteChannel out, ArrowRecordBatch batch) + public static ArrowBlock serialize(WriteChannel out, ArrowRecordBatch batch) throws IOException { long start = out.getCurrentPosition(); int bodyLength = batch.computeBodyLength(); - ByteBuffer metadata = WriteChannel.serialize(batch); - ByteBuffer serializedHeader = - serializeHeader(MessageHeader.RecordBatch, bodyLength + metadata.remaining() + 4); + FlatBufferBuilder builder = new FlatBufferBuilder(); + int batchOffset = batch.writeTo(builder); + + ByteBuffer serializedMessage = serializeMessage(builder, MessageHeader.RecordBatch, + batchOffset, bodyLength); + + int metadataLength = serializedMessage.remaining(); + + // Add extra padding bytes so that length prefix + metadata is a multiple + // of 8 after alignment + if ((start + metadataLength + 4) % 8 != 0) { + metadataLength += 8 - (start + metadataLength + 4) % 8; + } - // Write message header. - out.writeIntLittleEndian(serializedHeader.remaining()); - out.write(serializedHeader); + out.writeIntLittleEndian(metadataLength); + out.write(serializedMessage); - // Write the metadata, with the 4 byte little endian prefix - out.writeIntLittleEndian(metadata.remaining()); - out.write(metadata); + // Align the output to 8 byte boundary. + out.align(); - // Write batch header. - long offset = out.getCurrentPosition(); + long bufferStart = out.getCurrentPosition(); List buffers = batch.getBuffers(); List buffersLayout = batch.getBuffersLayout(); for (int i = 0; i < buffers.size(); i++) { ArrowBuf buffer = buffers.get(i); ArrowBuffer layout = buffersLayout.get(i); - long startPosition = offset + layout.getOffset(); + long startPosition = bufferStart + layout.getOffset(); if (startPosition != out.getCurrentPosition()) { out.writeZeros((int)(startPosition - out.getCurrentPosition())); } @@ -135,7 +135,8 @@ public static long serialize(WriteChannel out, ArrowRecordBatch batch) " != " + startPosition + layout.getSize()); } } - return out.getCurrentPosition() - start; + // Metadata size in the Block account for the size prefix + return new ArrowBlock(start, metadataLength + 4, out.getCurrentPosition() - bufferStart); } /** @@ -143,23 +144,62 @@ public static long serialize(WriteChannel out, ArrowRecordBatch batch) */ public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, BufferAllocator alloc) throws IOException { - Message header = deserializeHeader(in, MessageHeader.RecordBatch); - if (header == null) return null; + Message message = deserializeMessage(in, MessageHeader.RecordBatch); + if (message == null) return null; + + if (message.bodyLength() > Integer.MAX_VALUE) { + throw new IOException("Cannot currently deserialize record batches over 2GB"); + } + + RecordBatch recordBatchFB = (RecordBatch) message.header(new RecordBatch()); + + int bodyLength = (int) message.bodyLength(); + + // Now read the record batch body + ArrowBuf buffer = alloc.buffer(bodyLength); + if (in.readFully(buffer, bodyLength) != bodyLength) { + throw new IOException("Unexpected end of input trying to read batch."); + } + return deserializeRecordBatch(recordBatchFB, buffer); + } + + /** + * Deserializes a RecordBatch knowing the size of the entire message up front. This + * minimizes the number of reads to the underlying stream. + */ + public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, ArrowBlock block, + BufferAllocator alloc) throws IOException { + // Metadata length contains integer prefix plus byte padding + long totalLen = block.getMetadataLength() + block.getBodyLength(); - int messageLen = (int)header.bodyLength(); - // Now read the buffer. This has the metadata followed by the data. - ArrowBuf buffer = alloc.buffer(messageLen); - if (in.readFully(buffer, messageLen) != messageLen) { + if (totalLen > Integer.MAX_VALUE) { + throw new IOException("Cannot currently deserialize record batches over 2GB"); + } + + ArrowBuf buffer = alloc.buffer((int) totalLen); + if (in.readFully(buffer, (int) totalLen) != totalLen) { throw new IOException("Unexpected end of input trying to read batch."); } - // Read the metadata. It starts with the 4 byte size of the metadata. - int metadataSize = buffer.readInt(); - RecordBatch recordBatchFB = - RecordBatch.getRootAsRecordBatch( buffer.nioBuffer().asReadOnlyBuffer()); + ArrowBuf metadataBuffer = buffer.slice(4, block.getMetadataLength() - 4); + + Message messageFB = + Message.getRootAsMessage(metadataBuffer.nioBuffer().asReadOnlyBuffer()); + + RecordBatch recordBatchFB = (RecordBatch) messageFB.header(new RecordBatch()); + + // Now read the body + final ArrowBuf body = buffer.slice(block.getMetadataLength(), + (int) totalLen - block.getMetadataLength()); + ArrowRecordBatch result = deserializeRecordBatch(recordBatchFB, body); + + return result; + } - // No read the body - final ArrowBuf body = buffer.slice(4 + metadataSize, messageLen - metadataSize - 4); + // Deserializes a record batch given the Flatbuffer metadata and in-memory body + private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB, + ArrowBuf body) { + // Now read the body int nodesLength = recordBatchFB.nodesLength(); List nodes = new ArrayList<>(); for (int i = 0; i < nodesLength; ++i) { @@ -174,43 +214,44 @@ public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, } ArrowRecordBatch arrowRecordBatch = new ArrowRecordBatch(recordBatchFB.length(), nodes, buffers); - buffer.release(); + body.release(); return arrowRecordBatch; } /** * Serializes a message header. */ - private static ByteBuffer serializeHeader(byte headerType, int bodyLength) { - FlatBufferBuilder headerBuilder = new FlatBufferBuilder(); - Message.startMessage(headerBuilder); - Message.addHeaderType(headerBuilder, headerType); - Message.addVersion(headerBuilder, MetadataVersion.V1); - Message.addBodyLength(headerBuilder, bodyLength); - headerBuilder.finish(Message.endMessage(headerBuilder)); - return headerBuilder.dataBuffer(); + private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte headerType, + int headerOffset, int bodyLength) { + Message.startMessage(builder); + Message.addHeaderType(builder, headerType); + Message.addHeader(builder, headerOffset); + Message.addVersion(builder, MetadataVersion.V1); + Message.addBodyLength(builder, bodyLength); + builder.finish(Message.endMessage(builder)); + return builder.dataBuffer(); } - private static Message deserializeHeader(ReadChannel in, byte headerType) throws IOException { - // Read the header size. There is an i32 little endian prefix. + private static Message deserializeMessage(ReadChannel in, byte headerType) throws IOException { + // Read the message size. There is an i32 little endian prefix. ByteBuffer buffer = ByteBuffer.allocate(4); if (in.readFully(buffer) != 4) { return null; } - int headerLength = bytesToInt(buffer.array()); - buffer = ByteBuffer.allocate(headerLength); - if (in.readFully(buffer) != headerLength) { + int messageLength = bytesToInt(buffer.array()); + buffer = ByteBuffer.allocate(messageLength); + if (in.readFully(buffer) != messageLength) { throw new IOException( - "Unexpected end of stream trying to read header."); + "Unexpected end of stream trying to read message."); } buffer.rewind(); - Message header = Message.getRootAsMessage(buffer); - if (header.headerType() != headerType) { + Message message = Message.getRootAsMessage(buffer); + if (message.headerType() != headerType) { throw new IOException("Invalid message: expecting " + headerType + - ". Message contained: " + header.headerType()); + ". Message contained: " + message.headerType()); } - return header; + return message; } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index bf635fb39f5b8..9b9914480bad0 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -109,8 +109,6 @@ public void testWriteRead() throws IOException { List recordBatches = footer.getRecordBatches(); for (ArrowBlock rbBlock : recordBatches) { - Assert.assertEquals(0, rbBlock.getOffset() % 8); - Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { List buffersLayout = recordBatch.getBuffersLayout(); for (ArrowBuffer arrowBuffer : buffersLayout) { @@ -271,8 +269,6 @@ public void testWriteReadMultipleRBs() throws IOException { for (ArrowBlock rbBlock : recordBatches) { Assert.assertTrue(rbBlock.getOffset() + " > " + previousOffset, rbBlock.getOffset() > previousOffset); previousOffset = rbBlock.getOffset(); - Assert.assertEquals(0, rbBlock.getOffset() % 8); - Assert.assertEquals(0, rbBlock.getMetadataLength() % 8); try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { Assert.assertEquals("RB #" + i, counts[i], recordBatch.getLength()); List buffersLayout = recordBatch.getBuffersLayout(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java index 707dba2af9898..1e514585e502f 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFooter.java @@ -21,7 +21,9 @@ import static org.junit.Assert.assertEquals; import java.nio.ByteBuffer; +import java.util.ArrayList; import java.util.Collections; +import java.util.List; import org.apache.arrow.flatbuf.Footer; import org.apache.arrow.vector.types.pojo.ArrowType; @@ -41,6 +43,12 @@ public void test() { ArrowFooter footer = new ArrowFooter(schema, Collections.emptyList(), Collections.emptyList()); ArrowFooter newFooter = roundTrip(footer); assertEquals(footer, newFooter); + + List ids = new ArrayList<>(); + ids.add(new ArrowBlock(0, 1, 2)); + ids.add(new ArrowBlock(4, 5, 6)); + footer = new ArrowFooter(schema, ids, ids); + assertEquals(footer, roundTrip(footer)); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java index 8ed89fa347b3b..96bcbb1dae71c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -24,10 +24,14 @@ import java.io.ByteArrayOutputStream; import java.io.IOException; +import java.nio.ByteBuffer; import java.nio.channels.Channels; import java.util.Collections; import java.util.List; +import org.apache.arrow.flatbuf.FieldNode; +import org.apache.arrow.flatbuf.Message; +import org.apache.arrow.flatbuf.RecordBatch; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.schema.ArrowFieldNode; @@ -96,6 +100,18 @@ public void test() throws IOException { assertArrayEquals(validity, array(buffers.get(0))); assertArrayEquals(values, array(buffers.get(1))); + // Read just the header. This demonstrates being able to read without need to + // deserialize the buffer. + ByteBuffer headerBuffer = ByteBuffer.allocate(recordBatches.get(0).getMetadataLength()); + headerBuffer.put(byteArray, (int)recordBatches.get(0).getOffset(), headerBuffer.capacity()); + headerBuffer.position(4); + Message messageFB = Message.getRootAsMessage(headerBuffer); + RecordBatch recordBatchFB = (RecordBatch) messageFB.header(new RecordBatch()); + assertEquals(2, recordBatchFB.buffersLength()); + assertEquals(1, recordBatchFB.nodesLength()); + FieldNode nodeFB = recordBatchFB.nodes(0); + assertEquals(16, nodeFB.length()); + assertEquals(8, nodeFB.nullCount()); } } From 5888e10cffac222e359d1b440b4684d16c061085 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 21 Jan 2017 11:11:06 -0500 Subject: [PATCH 0285/1644] ARROW-495: [C++] Implement streaming binary format, refactoring cc @nongli Author: Wes McKinney Closes #293 from wesm/ARROW-495 and squashes the following commits: 279583b [Wes McKinney] FileBlock is a struct c88e61a [Wes McKinney] Fix Python bindings after API changes 645a329 [Wes McKinney] Install stream.h 21378b4 [Wes McKinney] Collapse BaseStreamWriter and StreamWriter b6c4578 [Wes McKinney] clang-format 12eb2cb [Wes McKinney] Add unit tests for streaming format, fix EOS, metadata length padding issues 3200b17 [Wes McKinney] Implement StreamReader 69fe82e [Wes McKinney] Implement rough draft of StreamWriter, share code with FileWriter --- cpp/CMakeLists.txt | 1 - cpp/src/arrow/io/memory.cc | 4 +- cpp/src/arrow/ipc/CMakeLists.txt | 2 + cpp/src/arrow/ipc/adapter.cc | 44 ++--- cpp/src/arrow/ipc/adapter.h | 11 +- cpp/src/arrow/ipc/file.cc | 167 ++++++++++++----- cpp/src/arrow/ipc/file.h | 54 +++--- cpp/src/arrow/ipc/ipc-adapter-test.cc | 16 +- cpp/src/arrow/ipc/ipc-file-test.cc | 188 ++++++++++++++++--- cpp/src/arrow/ipc/ipc-json-test.cc | 5 +- cpp/src/arrow/ipc/ipc-metadata-test.cc | 83 +-------- cpp/src/arrow/ipc/json-integration-test.cc | 4 +- cpp/src/arrow/ipc/json.cc | 19 +- cpp/src/arrow/ipc/json.h | 3 +- cpp/src/arrow/ipc/metadata-internal.cc | 8 +- cpp/src/arrow/ipc/metadata-internal.h | 4 +- cpp/src/arrow/ipc/metadata.cc | 121 +----------- cpp/src/arrow/ipc/metadata.h | 32 +--- cpp/src/arrow/ipc/stream.cc | 206 +++++++++++++++++++++ cpp/src/arrow/ipc/stream.h | 112 +++++++++++ cpp/src/arrow/ipc/test-common.h | 9 + python/pyarrow/includes/libarrow_ipc.pxd | 3 +- python/pyarrow/ipc.pyx | 5 +- python/src/pyarrow/adapters/pandas.cc | 34 ++-- 24 files changed, 718 insertions(+), 417 deletions(-) create mode 100644 cpp/src/arrow/ipc/stream.cc create mode 100644 cpp/src/arrow/ipc/stream.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 885ab19256065..9039ffb571b9e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -90,7 +90,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_ALTIVEC "Build Arrow with Altivec" ON) - endif() if(NOT ARROW_BUILD_TESTS) diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 0f5a0dc06979c..1339a99aa787e 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -116,13 +116,13 @@ Status BufferReader::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) Status BufferReader::Read(int64_t nbytes, std::shared_ptr* out) { int64_t size = std::min(nbytes, size_ - position_); - if (buffer_ != nullptr) { + if (size > 0 && buffer_ != nullptr) { *out = SliceBuffer(buffer_, position_, size); } else { *out = std::make_shared(data_ + position_, size); } - position_ += nbytes; + position_ += size; return Status::OK(); } diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index b7ac5f059749f..c047f53d6bf06 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -46,6 +46,7 @@ set(ARROW_IPC_SRCS json-internal.cc metadata.cc metadata-internal.cc + stream.cc ) if(NOT APPLE) @@ -151,6 +152,7 @@ install(FILES file.h json.h metadata.h + stream.h DESTINATION include/arrow/ipc) # pkg-config support diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 7b4d18c267d43..9da7b3912d4bc 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -49,10 +49,9 @@ namespace ipc { class RecordBatchWriter : public ArrayVisitor { public: - RecordBatchWriter(const std::vector>& columns, int32_t num_rows, - int64_t buffer_start_offset, int max_recursion_depth) - : columns_(columns), - num_rows_(num_rows), + RecordBatchWriter( + const RecordBatch& batch, int64_t buffer_start_offset, int max_recursion_depth) + : batch_(batch), max_recursion_depth_(max_recursion_depth), buffer_start_offset_(buffer_start_offset) {} @@ -79,8 +78,8 @@ class RecordBatchWriter : public ArrayVisitor { } // Perform depth-first traversal of the row-batch - for (size_t i = 0; i < columns_.size(); ++i) { - RETURN_NOT_OK(VisitArray(*columns_[i].get())); + for (int i = 0; i < batch_.num_columns(); ++i) { + RETURN_NOT_OK(VisitArray(*batch_.column(i))); } // The position for the start of a buffer relative to the passed frame of @@ -126,18 +125,23 @@ class RecordBatchWriter : public ArrayVisitor { // itself as an int32_t. std::shared_ptr metadata_fb; RETURN_NOT_OK(WriteRecordBatchMetadata( - num_rows_, body_length, field_nodes_, buffer_meta_, &metadata_fb)); + batch_.num_rows(), body_length, field_nodes_, buffer_meta_, &metadata_fb)); // Need to write 4 bytes (metadata size), the metadata, plus padding to - // fall on an 8-byte offset - int64_t padded_metadata_length = BitUtil::CeilByte(metadata_fb->size() + 4); + // end on an 8-byte offset + int64_t start_offset; + RETURN_NOT_OK(dst->Tell(&start_offset)); + + int64_t padded_metadata_length = metadata_fb->size() + 4; + const int remainder = (padded_metadata_length + start_offset) % 8; + if (remainder != 0) { padded_metadata_length += 8 - remainder; } // The returned metadata size includes the length prefix, the flatbuffer, // plus padding *metadata_length = static_cast(padded_metadata_length); - // Write the flatbuffer size prefix - int32_t flatbuffer_size = metadata_fb->size(); + // Write the flatbuffer size prefix including padding + int32_t flatbuffer_size = padded_metadata_length - 4; RETURN_NOT_OK( dst->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); @@ -294,9 +298,7 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - // Do not copy this vector. Ownership must be retained elsewhere - const std::vector>& columns_; - int32_t num_rows_; + const RecordBatch& batch_; std::vector field_nodes_; std::vector buffer_meta_; @@ -306,18 +308,16 @@ class RecordBatchWriter : public ArrayVisitor { int64_t buffer_start_offset_; }; -Status WriteRecordBatch(const std::vector>& columns, - int32_t num_rows, int64_t buffer_start_offset, io::OutputStream* dst, - int32_t* metadata_length, int64_t* body_length, int max_recursion_depth) { +Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + int max_recursion_depth) { DCHECK_GT(max_recursion_depth, 0); - RecordBatchWriter serializer( - columns, num_rows, buffer_start_offset, max_recursion_depth); + RecordBatchWriter serializer(batch, buffer_start_offset, max_recursion_depth); return serializer.Write(dst, metadata_length, body_length); } -Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size) { - RecordBatchWriter serializer( - batch->columns(), batch->num_rows(), 0, kMaxIpcRecursionDepth); +Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { + RecordBatchWriter serializer(batch, 0, kMaxIpcRecursionDepth); RETURN_NOT_OK(serializer.GetTotalSize(size)); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 963b9ee368537..f9ef7d9fe1202 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -71,17 +71,14 @@ constexpr int kMaxIpcRecursionDepth = 64; // // @param(out) body_length: the size of the contiguous buffer block plus // padding bytes -ARROW_EXPORT Status WriteRecordBatch(const std::vector>& columns, - int32_t num_rows, int64_t buffer_start_offset, io::OutputStream* dst, - int32_t* metadata_length, int64_t* body_length, - int max_recursion_depth = kMaxIpcRecursionDepth); - -// int64_t GetRecordBatchMetadata(const RecordBatch* batch); +ARROW_EXPORT Status WriteRecordBatch(const RecordBatch& batch, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, int max_recursion_depth = kMaxIpcRecursionDepth); // Compute the precise number of bytes needed in a contiguous memory segment to // write the record batch. This involves generating the complete serialized // Flatbuffers metadata. -ARROW_EXPORT Status GetRecordBatchSize(const RecordBatch* batch, int64_t* size); +ARROW_EXPORT Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the input supports zero copy reads diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index d7d2e613f87db..bc086e31519a5 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -26,6 +26,7 @@ #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" #include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" #include "arrow/status.h" @@ -35,82 +36,154 @@ namespace arrow { namespace ipc { static constexpr const char* kArrowMagicBytes = "ARROW1"; - // ---------------------------------------------------------------------- -// Writer implementation +// File footer + +static flatbuffers::Offset> +FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { + std::vector fb_blocks; -FileWriter::FileWriter(io::OutputStream* sink, const std::shared_ptr& schema) - : sink_(sink), schema_(schema), position_(-1), started_(false) {} + for (const FileBlock& block : blocks) { + fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); + } -Status FileWriter::UpdatePosition() { - return sink_->Tell(&position_); + return fbb.CreateVectorOfStructs(fb_blocks); } -Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out) { - *out = std::shared_ptr(new FileWriter(sink, schema)); // ctor is private - RETURN_NOT_OK((*out)->UpdatePosition()); - return Status::OK(); +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, io::OutputStream* out) { + FBB fbb; + + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, &fb_schema)); + + auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); + auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); + + auto footer = flatbuf::CreateFooter( + fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + + fbb.Finish(footer); + + int32_t size = fbb.GetSize(); + + return out->Write(fbb.GetBufferPointer(), size); } -Status FileWriter::Write(const uint8_t* data, int64_t nbytes) { - RETURN_NOT_OK(sink_->Write(data, nbytes)); - position_ += nbytes; - return Status::OK(); +static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { + return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); } -Status FileWriter::Align() { - int64_t remainder = PaddedLength(position_) - position_; - if (remainder > 0) { return Write(kPaddingBytes, remainder); } +class FileFooter::FileFooterImpl { + public: + FileFooterImpl(const std::shared_ptr& buffer, const flatbuf::Footer* footer) + : buffer_(buffer), footer_(footer) {} + + int num_dictionaries() const { return footer_->dictionaries()->size(); } + + int num_record_batches() const { return footer_->recordBatches()->size(); } + + MetadataVersion::type version() const { + switch (footer_->version()) { + case flatbuf::MetadataVersion_V1: + return MetadataVersion::V1; + case flatbuf::MetadataVersion_V2: + return MetadataVersion::V2; + // Add cases as other versions become available + default: + return MetadataVersion::V2; + } + } + + FileBlock record_batch(int i) const { + return FileBlockFromFlatbuffer(footer_->recordBatches()->Get(i)); + } + + FileBlock dictionary(int i) const { + return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); + } + + Status GetSchema(std::shared_ptr* out) const { + auto schema_msg = std::make_shared(nullptr, footer_->schema()); + return schema_msg->GetSchema(out); + } + + private: + // Retain reference to memory + std::shared_ptr buffer_; + + const flatbuf::Footer* footer_; +}; + +FileFooter::FileFooter() {} + +FileFooter::~FileFooter() {} + +Status FileFooter::Open( + const std::shared_ptr& buffer, std::unique_ptr* out) { + const flatbuf::Footer* footer = flatbuf::GetFooter(buffer->data()); + + *out = std::unique_ptr(new FileFooter()); + + // TODO(wesm): Verify the footer + (*out)->impl_.reset(new FileFooterImpl(buffer, footer)); + return Status::OK(); } -Status FileWriter::WriteAligned(const uint8_t* data, int64_t nbytes) { - RETURN_NOT_OK(Write(data, nbytes)); - return Align(); +int FileFooter::num_dictionaries() const { + return impl_->num_dictionaries(); } -Status FileWriter::Start() { - RETURN_NOT_OK(WriteAligned( - reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); - started_ = true; - return Status::OK(); +int FileFooter::num_record_batches() const { + return impl_->num_record_batches(); } -Status FileWriter::CheckStarted() { - if (!started_) { return Start(); } - return Status::OK(); +MetadataVersion::type FileFooter::version() const { + return impl_->version(); } -Status FileWriter::WriteRecordBatch( - const std::vector>& columns, int32_t num_rows) { - RETURN_NOT_OK(CheckStarted()); - - int64_t offset = position_; +FileBlock FileFooter::record_batch(int i) const { + return impl_->record_batch(i); +} - // There may be padding ever the end of the metadata, so we cannot rely on - // position_ - int32_t metadata_length; - int64_t body_length; +FileBlock FileFooter::dictionary(int i) const { + return impl_->dictionary(i); +} - // Frame of reference in file format is 0, see ARROW-384 - const int64_t buffer_start_offset = 0; - RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( - columns, num_rows, buffer_start_offset, sink_, &metadata_length, &body_length)); - RETURN_NOT_OK(UpdatePosition()); +Status FileFooter::GetSchema(std::shared_ptr* out) const { + return impl_->GetSchema(out); +} - DCHECK(position_ % 8 == 0) << "ipc::WriteRecordBatch did not perform aligned writes"; +// ---------------------------------------------------------------------- +// File writer implementation - // Append metadata, to be written in the footer later - record_batches_.emplace_back(offset, metadata_length, body_length); +Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out) { + *out = std::shared_ptr(new FileWriter(sink, schema)); // ctor is private + RETURN_NOT_OK((*out)->UpdatePosition()); + return Status::OK(); +} +Status FileWriter::Start() { + RETURN_NOT_OK(WriteAligned( + reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); + started_ = true; return Status::OK(); } +Status FileWriter::WriteRecordBatch(const RecordBatch& batch) { + // Push an empty FileBlock + // Append metadata, to be written in the footer later + record_batches_.emplace_back(0, 0, 0); + return StreamWriter::WriteRecordBatch( + batch, &record_batches_[record_batches_.size() - 1]); +} + Status FileWriter::Close() { // Write metadata int64_t initial_position = position_; - RETURN_NOT_OK(WriteFileFooter(schema_.get(), dictionaries_, record_batches_, sink_)); + RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, record_batches_, sink_)); RETURN_NOT_OK(UpdatePosition()); // Write footer length diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/file.h index 4f35c37b03235..7696954c188e3 100644 --- a/cpp/src/arrow/ipc/file.h +++ b/cpp/src/arrow/ipc/file.h @@ -25,13 +25,12 @@ #include #include "arrow/ipc/metadata.h" +#include "arrow/ipc/stream.h" #include "arrow/util/visibility.h" namespace arrow { -class Array; class Buffer; -struct Field; class RecordBatch; class Schema; class Status; @@ -45,40 +44,43 @@ class ReadableFileInterface; namespace ipc { -class ARROW_EXPORT FileWriter { - public: - static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out); +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, io::OutputStream* out); - // TODO(wesm): Write dictionaries +class ARROW_EXPORT FileFooter { + public: + ~FileFooter(); - Status WriteRecordBatch( - const std::vector>& columns, int32_t num_rows); + static Status Open( + const std::shared_ptr& buffer, std::unique_ptr* out); - Status Close(); + int num_dictionaries() const; + int num_record_batches() const; + MetadataVersion::type version() const; - private: - FileWriter(io::OutputStream* sink, const std::shared_ptr& schema); + FileBlock record_batch(int i) const; + FileBlock dictionary(int i) const; - Status CheckStarted(); - Status Start(); + Status GetSchema(std::shared_ptr* out) const; - Status UpdatePosition(); + private: + FileFooter(); + class FileFooterImpl; + std::unique_ptr impl_; +}; - // Adds padding bytes if necessary to ensure all memory blocks are written on - // 8-byte boundaries. - Status Align(); +class ARROW_EXPORT FileWriter : public StreamWriter { + public: + static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out); - // Write data and update position - Status Write(const uint8_t* data, int64_t nbytes); + Status WriteRecordBatch(const RecordBatch& batch) override; + Status Close() override; - // Write and align - Status WriteAligned(const uint8_t* data, int64_t nbytes); + private: + using StreamWriter::StreamWriter; - io::OutputStream* sink_; - std::shared_ptr schema_; - int64_t position_; - bool started_; + Status Start() override; std::vector dictionaries_; std::vector record_batches_; diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 6ba0a6e16be08..17868f8f1029e 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -55,8 +55,8 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, const int64_t buffer_offset = 0; - RETURN_NOT_OK(WriteRecordBatch(batch.columns(), batch.num_rows(), buffer_offset, - mmap_.get(), &metadata_length, &body_length)); + RETURN_NOT_OK(WriteRecordBatch( + batch, buffer_offset, mmap_.get(), &metadata_length, &body_length)); std::shared_ptr metadata; RETURN_NOT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); @@ -102,9 +102,8 @@ void TestGetRecordBatchSize(std::shared_ptr batch) { int32_t mock_metadata_length = -1; int64_t mock_body_length = -1; int64_t size = -1; - ASSERT_OK(WriteRecordBatch(batch->columns(), batch->num_rows(), 0, &mock, - &mock_metadata_length, &mock_body_length)); - ASSERT_OK(GetRecordBatchSize(batch.get(), &size)); + ASSERT_OK(WriteRecordBatch(*batch, 0, &mock, &mock_metadata_length, &mock_body_length)); + ASSERT_OK(GetRecordBatchSize(*batch, &size)); ASSERT_EQ(mock.GetExtentBytesWritten(), size); } @@ -157,11 +156,10 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); if (override_level) { - return WriteRecordBatch(batch->columns(), batch->num_rows(), 0, mmap_.get(), - metadata_length, body_length, recursion_level + 1); + return WriteRecordBatch( + *batch, 0, mmap_.get(), metadata_length, body_length, recursion_level + 1); } else { - return WriteRecordBatch(batch->columns(), batch->num_rows(), 0, mmap_.get(), - metadata_length, body_length); + return WriteRecordBatch(*batch, 0, mmap_.get(), metadata_length, body_length); } } diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index 0a9f677966389..15ceb80493632 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -29,6 +29,7 @@ #include "arrow/io/test-common.h" #include "arrow/ipc/adapter.h" #include "arrow/ipc/file.h" +#include "arrow/ipc/stream.h" #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" @@ -41,6 +42,19 @@ namespace arrow { namespace ipc { +void CompareBatch(const RecordBatch& left, const RecordBatch& right) { + ASSERT_TRUE(left.schema()->Equals(right.schema())); + ASSERT_EQ(left.num_columns(), right.num_columns()) + << left.schema()->ToString() << " result: " << right.schema()->ToString(); + EXPECT_EQ(left.num_rows(), right.num_rows()); + for (int i = 0; i < left.num_columns(); ++i) { + EXPECT_TRUE(left.column(i)->Equals(right.column(i))) + << "Idx: " << i << " Name: " << left.column_name(i); + } +} + +using BatchVector = std::vector>; + class TestFileFormat : public ::testing::TestWithParam { public: void SetUp() { @@ -50,43 +64,94 @@ class TestFileFormat : public ::testing::TestWithParam { } void TearDown() {} - Status RoundTripHelper( - const RecordBatch& batch, std::vector>* out_batches) { + Status RoundTripHelper(const BatchVector& in_batches, BatchVector* out_batches) { // Write the file - RETURN_NOT_OK(FileWriter::Open(sink_.get(), batch.schema(), &file_writer_)); - int num_batches = 3; - for (int i = 0; i < num_batches; ++i) { - RETURN_NOT_OK(file_writer_->WriteRecordBatch(batch.columns(), batch.num_rows())); + std::shared_ptr writer; + RETURN_NOT_OK(FileWriter::Open(sink_.get(), in_batches[0]->schema(), &writer)); + + const int num_batches = static_cast(in_batches.size()); + + for (const auto& batch : in_batches) { + RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); } - RETURN_NOT_OK(file_writer_->Close()); + RETURN_NOT_OK(writer->Close()); // Current offset into stream is the end of the file int64_t footer_offset; RETURN_NOT_OK(sink_->Tell(&footer_offset)); // Open the file - auto reader = std::make_shared(buffer_); - RETURN_NOT_OK(FileReader::Open(reader, footer_offset, &file_reader_)); + auto buf_reader = std::make_shared(buffer_); + std::shared_ptr reader; + RETURN_NOT_OK(FileReader::Open(buf_reader, footer_offset, &reader)); - EXPECT_EQ(num_batches, file_reader_->num_record_batches()); - - out_batches->resize(num_batches); + EXPECT_EQ(num_batches, reader->num_record_batches()); for (int i = 0; i < num_batches; ++i) { - RETURN_NOT_OK(file_reader_->GetRecordBatch(i, &(*out_batches)[i])); + std::shared_ptr chunk; + RETURN_NOT_OK(reader->GetRecordBatch(i, &chunk)); + out_batches->emplace_back(chunk); } return Status::OK(); } - void CompareBatch(const RecordBatch* left, const RecordBatch* right) { - ASSERT_TRUE(left->schema()->Equals(right->schema())); - ASSERT_EQ(left->num_columns(), right->num_columns()) - << left->schema()->ToString() << " result: " << right->schema()->ToString(); - EXPECT_EQ(left->num_rows(), right->num_rows()); - for (int i = 0; i < left->num_columns(); ++i) { - EXPECT_TRUE(left->column(i)->Equals(right->column(i))) - << "Idx: " << i << " Name: " << left->column_name(i); + protected: + MemoryPool* pool_; + + std::unique_ptr sink_; + std::shared_ptr buffer_; +}; + +TEST_P(TestFileFormat, RoundTrip) { + std::shared_ptr batch1; + std::shared_ptr batch2; + ASSERT_OK((*GetParam())(&batch1)); // NOLINT clang-tidy gtest issue + ASSERT_OK((*GetParam())(&batch2)); // NOLINT clang-tidy gtest issue + + std::vector> in_batches = {batch1, batch2}; + std::vector> out_batches; + + ASSERT_OK(RoundTripHelper(in_batches, &out_batches)); + + // Compare batches + for (size_t i = 0; i < in_batches.size(); ++i) { + CompareBatch(*in_batches[i], *out_batches[i]); + } +} + +class TestStreamFormat : public ::testing::TestWithParam { + public: + void SetUp() { + pool_ = default_memory_pool(); + buffer_ = std::make_shared(pool_); + sink_.reset(new io::BufferOutputStream(buffer_)); + } + void TearDown() {} + + Status RoundTripHelper( + const RecordBatch& batch, std::vector>* out_batches) { + // Write the file + std::shared_ptr writer; + RETURN_NOT_OK(StreamWriter::Open(sink_.get(), batch.schema(), &writer)); + int num_batches = 5; + for (int i = 0; i < num_batches; ++i) { + RETURN_NOT_OK(writer->WriteRecordBatch(batch)); + } + RETURN_NOT_OK(writer->Close()); + + // Open the file + auto buf_reader = std::make_shared(buffer_); + + std::shared_ptr reader; + RETURN_NOT_OK(StreamReader::Open(buf_reader, &reader)); + + std::shared_ptr chunk; + while (true) { + RETURN_NOT_OK(reader->GetNextRecordBatch(&chunk)); + if (chunk == nullptr) { break; } + out_batches->emplace_back(chunk); } + return Status::OK(); } protected: @@ -94,12 +159,9 @@ class TestFileFormat : public ::testing::TestWithParam { std::unique_ptr sink_; std::shared_ptr buffer_; - - std::shared_ptr file_writer_; - std::shared_ptr file_reader_; }; -TEST_P(TestFileFormat, RoundTrip) { +TEST_P(TestStreamFormat, RoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue @@ -109,14 +171,80 @@ TEST_P(TestFileFormat, RoundTrip) { // Compare batches. Same for (size_t i = 0; i < out_batches.size(); ++i) { - CompareBatch(batch.get(), out_batches[i].get()); + CompareBatch(*batch, *out_batches[i]); } } -INSTANTIATE_TEST_CASE_P(RoundTripTests, TestFileFormat, - ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, - &MakeStringTypesRecordBatch, &MakeStruct)); +#define BATCH_CASES() \ + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ + &MakeStruct); + +INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); +INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); + +class TestFileFooter : public ::testing::Test { + public: + void SetUp() {} + + void CheckRoundtrip(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches) { + auto buffer = std::make_shared(); + io::BufferOutputStream stream(buffer); + + ASSERT_OK(WriteFileFooter(schema, dictionaries, record_batches, &stream)); + + std::unique_ptr footer; + ASSERT_OK(FileFooter::Open(buffer, &footer)); + + ASSERT_EQ(MetadataVersion::V2, footer->version()); + + // Check schema + std::shared_ptr schema2; + ASSERT_OK(footer->GetSchema(&schema2)); + AssertSchemaEqual(schema, *schema2); + + // Check blocks + ASSERT_EQ(dictionaries.size(), footer->num_dictionaries()); + ASSERT_EQ(record_batches.size(), footer->num_record_batches()); + + for (int i = 0; i < footer->num_dictionaries(); ++i) { + CheckBlocks(dictionaries[i], footer->dictionary(i)); + } + + for (int i = 0; i < footer->num_record_batches(); ++i) { + CheckBlocks(record_batches[i], footer->record_batch(i)); + } + } + + void CheckBlocks(const FileBlock& left, const FileBlock& right) { + ASSERT_EQ(left.offset, right.offset); + ASSERT_EQ(left.metadata_length, right.metadata_length); + ASSERT_EQ(left.body_length, right.body_length); + } + + private: + std::shared_ptr example_schema_; +}; + +TEST_F(TestFileFooter, Basics) { + auto f0 = std::make_shared("f0", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared()); + Schema schema({f0, f1}); + + std::vector dictionaries; + dictionaries.emplace_back(8, 92, 900); + dictionaries.emplace_back(1000, 100, 1900); + dictionaries.emplace_back(3000, 100, 2900); + + std::vector record_batches; + record_batches.emplace_back(6000, 100, 900); + record_batches.emplace_back(7000, 100, 1900); + record_batches.emplace_back(9000, 100, 2900); + record_batches.emplace_back(12000, 100, 3900); + + CheckRoundtrip(schema, dictionaries, record_batches); +} } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 07509890da35c..30f968c2bfd8b 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -245,8 +245,9 @@ TEST(TestJsonFileReadWrite, BasicRoundTrip) { std::vector> arrays; MakeBatchArrays(schema, num_rows, &arrays); - batches.emplace_back(std::make_shared(schema, num_rows, arrays)); - ASSERT_OK(writer->WriteRecordBatch(arrays, num_rows)); + auto batch = std::make_shared(schema, num_rows, arrays); + batches.push_back(batch); + ASSERT_OK(writer->WriteRecordBatch(*batch)); } std::string result; diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index 7c5744a241068..098f996d292a2 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -23,6 +23,7 @@ #include "arrow/io/memory.h" #include "arrow/ipc/metadata.h" +#include "arrow/ipc/test-common.h" #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/test-util.h" @@ -34,20 +35,11 @@ class Buffer; namespace ipc { -static inline void assert_schema_equal(const Schema* lhs, const Schema* rhs) { - if (!lhs->Equals(*rhs)) { - std::stringstream ss; - ss << "left schema: " << lhs->ToString() << std::endl - << "right schema: " << rhs->ToString() << std::endl; - FAIL() << ss.str(); - } -} - class TestSchemaMetadata : public ::testing::Test { public: void SetUp() {} - void CheckRoundtrip(const Schema* schema) { + void CheckRoundtrip(const Schema& schema) { std::shared_ptr buffer; ASSERT_OK(WriteSchema(schema, &buffer)); @@ -57,12 +49,12 @@ class TestSchemaMetadata : public ::testing::Test { ASSERT_EQ(Message::SCHEMA, message->type()); auto schema_msg = std::make_shared(message); - ASSERT_EQ(schema->num_fields(), schema_msg->num_fields()); + ASSERT_EQ(schema.num_fields(), schema_msg->num_fields()); std::shared_ptr schema2; ASSERT_OK(schema_msg->GetSchema(&schema2)); - assert_schema_equal(schema, schema2.get()); + AssertSchemaEqual(schema, *schema2); } }; @@ -82,7 +74,7 @@ TEST_F(TestSchemaMetadata, PrimitiveFields) { auto f10 = std::make_shared("f10", std::make_shared()); Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); - CheckRoundtrip(&schema); + CheckRoundtrip(schema); } TEST_F(TestSchemaMetadata, NestedFields) { @@ -94,70 +86,7 @@ TEST_F(TestSchemaMetadata, NestedFields) { auto f1 = std::make_shared("f1", type2); Schema schema({f0, f1}); - CheckRoundtrip(&schema); -} - -class TestFileFooter : public ::testing::Test { - public: - void SetUp() {} - - void CheckRoundtrip(const Schema* schema, const std::vector& dictionaries, - const std::vector& record_batches) { - auto buffer = std::make_shared(); - io::BufferOutputStream stream(buffer); - - ASSERT_OK(WriteFileFooter(schema, dictionaries, record_batches, &stream)); - - std::unique_ptr footer; - ASSERT_OK(FileFooter::Open(buffer, &footer)); - - ASSERT_EQ(MetadataVersion::V2, footer->version()); - - // Check schema - std::shared_ptr schema2; - ASSERT_OK(footer->GetSchema(&schema2)); - assert_schema_equal(schema, schema2.get()); - - // Check blocks - ASSERT_EQ(dictionaries.size(), footer->num_dictionaries()); - ASSERT_EQ(record_batches.size(), footer->num_record_batches()); - - for (int i = 0; i < footer->num_dictionaries(); ++i) { - CheckBlocks(dictionaries[i], footer->dictionary(i)); - } - - for (int i = 0; i < footer->num_record_batches(); ++i) { - CheckBlocks(record_batches[i], footer->record_batch(i)); - } - } - - void CheckBlocks(const FileBlock& left, const FileBlock& right) { - ASSERT_EQ(left.offset, right.offset); - ASSERT_EQ(left.metadata_length, right.metadata_length); - ASSERT_EQ(left.body_length, right.body_length); - } - - private: - std::shared_ptr example_schema_; -}; - -TEST_F(TestFileFooter, Basics) { - auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared()); - Schema schema({f0, f1}); - - std::vector dictionaries; - dictionaries.emplace_back(8, 92, 900); - dictionaries.emplace_back(1000, 100, 1900); - dictionaries.emplace_back(3000, 100, 2900); - - std::vector record_batches; - record_batches.emplace_back(6000, 100, 900); - record_batches.emplace_back(7000, 100, 1900); - record_batches.emplace_back(9000, 100, 2900); - record_batches.emplace_back(12000, 100, 3900); - - CheckRoundtrip(&schema, dictionaries, record_batches); + CheckRoundtrip(schema); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 757e6c00ab243..95bc742054fab 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -81,7 +81,7 @@ static Status ConvertJsonToArrow( for (int i = 0; i < reader->num_record_batches(); ++i) { std::shared_ptr batch; RETURN_NOT_OK(reader->GetRecordBatch(i, &batch)); - RETURN_NOT_OK(writer->WriteRecordBatch(batch->columns(), batch->num_rows())); + RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); } return writer->Close(); } @@ -108,7 +108,7 @@ static Status ConvertArrowToJson( for (int i = 0; i < reader->num_record_batches(); ++i) { std::shared_ptr batch; RETURN_NOT_OK(reader->GetRecordBatch(i, &batch)); - RETURN_NOT_OK(writer->WriteRecordBatch(batch->columns(), batch->num_rows())); + RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); } std::string result; diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index 6e3a9939730f4..773fb74a1767a 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -64,25 +64,23 @@ class JsonWriter::JsonWriterImpl { return Status::OK(); } - Status WriteRecordBatch( - const std::vector>& columns, int32_t num_rows) { - DCHECK_EQ(static_cast(columns.size()), schema_->num_fields()); + Status WriteRecordBatch(const RecordBatch& batch) { + DCHECK_EQ(batch.num_columns(), schema_->num_fields()); writer_->StartObject(); writer_->Key("count"); - writer_->Int(num_rows); + writer_->Int(batch.num_rows()); writer_->Key("columns"); writer_->StartArray(); for (int i = 0; i < schema_->num_fields(); ++i) { - const std::shared_ptr& column = columns[i]; + const std::shared_ptr& column = batch.column(i); - DCHECK_EQ(num_rows, column->length()) + DCHECK_EQ(batch.num_rows(), column->length()) << "Array length did not match record batch length"; - RETURN_NOT_OK( - WriteJsonArray(schema_->field(i)->name, *column.get(), writer_.get())); + RETURN_NOT_OK(WriteJsonArray(schema_->field(i)->name, *column, writer_.get())); } writer_->EndArray(); @@ -113,9 +111,8 @@ Status JsonWriter::Finish(std::string* result) { return impl_->Finish(result); } -Status JsonWriter::WriteRecordBatch( - const std::vector>& columns, int32_t num_rows) { - return impl_->WriteRecordBatch(columns, num_rows); +Status JsonWriter::WriteRecordBatch(const RecordBatch& batch) { + return impl_->WriteRecordBatch(batch); } // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/ipc/json.h b/cpp/src/arrow/ipc/json.h index 7395be43b967d..88afdfaa5ff3b 100644 --- a/cpp/src/arrow/ipc/json.h +++ b/cpp/src/arrow/ipc/json.h @@ -46,8 +46,7 @@ class ARROW_EXPORT JsonWriter { // TODO(wesm): Write dictionaries - Status WriteRecordBatch( - const std::vector>& columns, int32_t num_rows); + Status WriteRecordBatch(const RecordBatch& batch); Status Finish(std::string* result); diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index cc160c42ec9ef..cd7722056a3c7 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -282,10 +282,10 @@ flatbuf::Endianness endianness() { } Status SchemaToFlatbuffer( - FBB& fbb, const Schema* schema, flatbuffers::Offset* out) { + FBB& fbb, const Schema& schema, flatbuffers::Offset* out) { std::vector field_offsets; - for (int i = 0; i < schema->num_fields(); ++i) { - std::shared_ptr field = schema->field(i); + for (int i = 0; i < schema.num_fields(); ++i) { + std::shared_ptr field = schema.field(i); FieldOffset offset; RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, &offset)); field_offsets.push_back(offset); @@ -295,7 +295,7 @@ Status SchemaToFlatbuffer( return Status::OK(); } -Status MessageBuilder::SetSchema(const Schema* schema) { +Status MessageBuilder::SetSchema(const Schema& schema) { flatbuffers::Offset fb_schema; RETURN_NOT_OK(SchemaToFlatbuffer(fbb_, schema, &fb_schema)); diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index 4826ebe22899d..d94a8abc99ab0 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -49,11 +49,11 @@ static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVe Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); Status SchemaToFlatbuffer( - FBB& fbb, const Schema* schema, flatbuffers::Offset* out); + FBB& fbb, const Schema& schema, flatbuffers::Offset* out); class MessageBuilder { public: - Status SetSchema(const Schema* schema); + Status SetSchema(const Schema& schema); Status SetRecordBatch(int32_t length, int64_t body_length, const std::vector& nodes, diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index f0674ff8d5aeb..a97965c40d608 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -38,7 +38,7 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { -Status WriteSchema(const Schema* schema, std::shared_ptr* out) { +Status WriteSchema(const Schema& schema, std::shared_ptr* out) { MessageBuilder message; RETURN_NOT_OK(message.SetSchema(schema)); RETURN_NOT_OK(message.Finish()); @@ -232,124 +232,5 @@ int RecordBatchMetadata::num_fields() const { return impl_->num_fields(); } -// ---------------------------------------------------------------------- -// File footer - -static flatbuffers::Offset> -FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { - std::vector fb_blocks; - - for (const FileBlock& block : blocks) { - fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); - } - - return fbb.CreateVectorOfStructs(fb_blocks); -} - -Status WriteFileFooter(const Schema* schema, const std::vector& dictionaries, - const std::vector& record_batches, io::OutputStream* out) { - FBB fbb; - - flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, &fb_schema)); - - auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); - auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); - - auto footer = flatbuf::CreateFooter( - fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); - - fbb.Finish(footer); - - int32_t size = fbb.GetSize(); - - return out->Write(fbb.GetBufferPointer(), size); -} - -static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { - return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); -} - -class FileFooter::FileFooterImpl { - public: - FileFooterImpl(const std::shared_ptr& buffer, const flatbuf::Footer* footer) - : buffer_(buffer), footer_(footer) {} - - int num_dictionaries() const { return footer_->dictionaries()->size(); } - - int num_record_batches() const { return footer_->recordBatches()->size(); } - - MetadataVersion::type version() const { - switch (footer_->version()) { - case flatbuf::MetadataVersion_V1: - return MetadataVersion::V1; - case flatbuf::MetadataVersion_V2: - return MetadataVersion::V2; - // Add cases as other versions become available - default: - return MetadataVersion::V2; - } - } - - FileBlock record_batch(int i) const { - return FileBlockFromFlatbuffer(footer_->recordBatches()->Get(i)); - } - - FileBlock dictionary(int i) const { - return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); - } - - Status GetSchema(std::shared_ptr* out) const { - auto schema_msg = std::make_shared(nullptr, footer_->schema()); - return schema_msg->GetSchema(out); - } - - private: - // Retain reference to memory - std::shared_ptr buffer_; - - const flatbuf::Footer* footer_; -}; - -FileFooter::FileFooter() {} - -FileFooter::~FileFooter() {} - -Status FileFooter::Open( - const std::shared_ptr& buffer, std::unique_ptr* out) { - const flatbuf::Footer* footer = flatbuf::GetFooter(buffer->data()); - - *out = std::unique_ptr(new FileFooter()); - - // TODO(wesm): Verify the footer - (*out)->impl_.reset(new FileFooterImpl(buffer, footer)); - - return Status::OK(); -} - -int FileFooter::num_dictionaries() const { - return impl_->num_dictionaries(); -} - -int FileFooter::num_record_batches() const { - return impl_->num_record_batches(); -} - -MetadataVersion::type FileFooter::version() const { - return impl_->version(); -} - -FileBlock FileFooter::record_batch(int i) const { - return impl_->record_batch(i); -} - -FileBlock FileFooter::dictionary(int i) const { - return impl_->dictionary(i); -} - -Status FileFooter::GetSchema(std::shared_ptr* out) const { - return impl_->GetSchema(out); -} - } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 1c4ef64d62fad..6e15ef353d853 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -49,7 +49,7 @@ struct MetadataVersion { // Serialize arrow::Schema as a Flatbuffer ARROW_EXPORT -Status WriteSchema(const Schema* schema, std::shared_ptr* out); +Status WriteSchema(const Schema& schema, std::shared_ptr* out); // Read interface classes. We do not fully deserialize the flatbuffers so that // individual fields metadata can be retrieved from very large schema without @@ -149,10 +149,8 @@ class ARROW_EXPORT Message { std::unique_ptr impl_; }; -// ---------------------------------------------------------------------- -// File footer for file-like representation - struct FileBlock { + FileBlock() {} FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) : offset(offset), metadata_length(metadata_length), body_length(body_length) {} @@ -161,32 +159,6 @@ struct FileBlock { int64_t body_length; }; -ARROW_EXPORT -Status WriteFileFooter(const Schema* schema, const std::vector& dictionaries, - const std::vector& record_batches, io::OutputStream* out); - -class ARROW_EXPORT FileFooter { - public: - ~FileFooter(); - - static Status Open( - const std::shared_ptr& buffer, std::unique_ptr* out); - - int num_dictionaries() const; - int num_record_batches() const; - MetadataVersion::type version() const; - - FileBlock record_batch(int i) const; - FileBlock dictionary(int i) const; - - Status GetSchema(std::shared_ptr* out) const; - - private: - FileFooter(); - class FileFooterImpl; - std::unique_ptr impl_; -}; - } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/stream.cc b/cpp/src/arrow/ipc/stream.cc new file mode 100644 index 0000000000000..a2ca672fbe0aa --- /dev/null +++ b/cpp/src/arrow/ipc/stream.cc @@ -0,0 +1,206 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/stream.h" + +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata.h" +#include "arrow/ipc/util.h" +#include "arrow/schema.h" +#include "arrow/status.h" +#include "arrow/util/logging.h" + +namespace arrow { +namespace ipc { + +// ---------------------------------------------------------------------- +// Stream writer implementation + +StreamWriter::~StreamWriter() {} + +StreamWriter::StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema) + : sink_(sink), schema_(schema), position_(-1), started_(false) {} + +Status StreamWriter::UpdatePosition() { + return sink_->Tell(&position_); +} + +Status StreamWriter::Write(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(sink_->Write(data, nbytes)); + position_ += nbytes; + return Status::OK(); +} + +Status StreamWriter::Align() { + int64_t remainder = PaddedLength(position_) - position_; + if (remainder > 0) { return Write(kPaddingBytes, remainder); } + return Status::OK(); +} + +Status StreamWriter::WriteAligned(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(Write(data, nbytes)); + return Align(); +} + +Status StreamWriter::CheckStarted() { + if (!started_) { return Start(); } + return Status::OK(); +} + +Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, FileBlock* block) { + RETURN_NOT_OK(CheckStarted()); + + block->offset = position_; + + // Frame of reference in file format is 0, see ARROW-384 + const int64_t buffer_start_offset = 0; + RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( + batch, buffer_start_offset, sink_, &block->metadata_length, &block->body_length)); + RETURN_NOT_OK(UpdatePosition()); + + DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; + + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// StreamWriter implementation + +Status StreamWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out) { + // ctor is private + *out = std::shared_ptr(new StreamWriter(sink, schema)); + RETURN_NOT_OK((*out)->UpdatePosition()); + return Status::OK(); +} + +Status StreamWriter::Start() { + std::shared_ptr schema_fb; + RETURN_NOT_OK(WriteSchema(*schema_, &schema_fb)); + + int32_t flatbuffer_size = schema_fb->size(); + RETURN_NOT_OK( + Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); + + // Write the flatbuffer + RETURN_NOT_OK(Write(schema_fb->data(), flatbuffer_size)); + started_ = true; + return Status::OK(); +} + +Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { + // Pass FileBlock, but results not used + FileBlock dummy_block; + return WriteRecordBatch(batch, &dummy_block); +} + +Status StreamWriter::Close() { + // Close the stream + RETURN_NOT_OK(CheckStarted()); + return sink_->Close(); +} + +// ---------------------------------------------------------------------- +// StreamReader implementation + +StreamReader::StreamReader(const std::shared_ptr& stream) + : stream_(stream), schema_(nullptr) {} + +StreamReader::~StreamReader() {} + +Status StreamReader::Open(const std::shared_ptr& stream, + std::shared_ptr* reader) { + // Private ctor + *reader = std::shared_ptr(new StreamReader(stream)); + return (*reader)->ReadSchema(); +} + +Status StreamReader::ReadSchema() { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(&message)); + + if (message->type() != Message::SCHEMA) { + return Status::IOError("First message was not schema type"); + } + + SchemaMetadata schema_meta(message); + + // TODO(wesm): If the schema contains dictionaries, we must read all the + // dictionaries from the stream before constructing the final Schema + return schema_meta.GetSchema(&schema_); +} + +Status StreamReader::ReadNextMessage(std::shared_ptr* message) { + std::shared_ptr buffer; + RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); + + if (buffer->size() != sizeof(int32_t)) { + *message = nullptr; + return Status::OK(); + } + + int32_t message_length = *reinterpret_cast(buffer->data()); + + RETURN_NOT_OK(stream_->Read(message_length, &buffer)); + if (buffer->size() != message_length) { + return Status::IOError("Unexpected end of stream trying to read message"); + } + return Message::Open(buffer, 0, message); +} + +std::shared_ptr StreamReader::schema() const { + return schema_; +} + +Status StreamReader::GetNextRecordBatch(std::shared_ptr* batch) { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(&message)); + + if (message == nullptr) { + // End of stream + *batch = nullptr; + return Status::OK(); + } + + if (message->type() != Message::RECORD_BATCH) { + return Status::IOError("Metadata not record batch"); + } + + auto batch_metadata = std::make_shared(message); + + std::shared_ptr batch_body; + RETURN_NOT_OK(stream_->Read(message->body_length(), &batch_body)); + + if (batch_body->size() < message->body_length()) { + return Status::IOError("Unexpected EOS when reading message body"); + } + + io::BufferReader reader(batch_body); + + return ReadRecordBatch(batch_metadata, schema_, &reader, batch); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/stream.h b/cpp/src/arrow/ipc/stream.h new file mode 100644 index 0000000000000..0b0e62f13fc5f --- /dev/null +++ b/cpp/src/arrow/ipc/stream.h @@ -0,0 +1,112 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Implement Arrow streaming binary format + +#ifndef ARROW_IPC_STREAM_H +#define ARROW_IPC_STREAM_H + +#include +#include + +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +class Buffer; +struct Field; +class RecordBatch; +class Schema; +class Status; + +namespace io { + +class InputStream; +class OutputStream; + +} // namespace io + +namespace ipc { + +struct FileBlock; +class Message; + +class ARROW_EXPORT StreamWriter { + public: + virtual ~StreamWriter(); + + static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out); + + virtual Status WriteRecordBatch(const RecordBatch& batch); + virtual Status Close(); + + protected: + StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema); + + virtual Status Start(); + + Status CheckStarted(); + Status UpdatePosition(); + + Status WriteRecordBatch(const RecordBatch& batch, FileBlock* block); + + // Adds padding bytes if necessary to ensure all memory blocks are written on + // 8-byte boundaries. + Status Align(); + + // Write data and update position + Status Write(const uint8_t* data, int64_t nbytes); + + // Write and align + Status WriteAligned(const uint8_t* data, int64_t nbytes); + + io::OutputStream* sink_; + std::shared_ptr schema_; + int64_t position_; + bool started_; +}; + +class ARROW_EXPORT StreamReader { + public: + ~StreamReader(); + + // Open an stream. + static Status Open(const std::shared_ptr& stream, + std::shared_ptr* reader); + + std::shared_ptr schema() const; + + // Returned batch is nullptr when end of stream reached + Status GetNextRecordBatch(std::shared_ptr* batch); + + private: + explicit StreamReader(const std::shared_ptr& stream); + + Status ReadSchema(); + + Status ReadNextMessage(std::shared_ptr* message); + + std::shared_ptr stream_; + std::shared_ptr schema_; +}; + +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_STREAM_H diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 3faeebf956966..ca790ded92191 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -36,6 +36,15 @@ namespace arrow { namespace ipc { +static inline void AssertSchemaEqual(const Schema& lhs, const Schema& rhs) { + if (!lhs.Equals(rhs)) { + std::stringstream ss; + ss << "left schema: " << lhs.ToString() << std::endl + << "right schema: " << rhs.ToString() << std::endl; + FAIL() << ss.str(); + } +} + const auto kListInt32 = list(int32()); const auto kListListInt32 = list(kListInt32); diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index b3185b1c1671c..82957600d1eb6 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -29,8 +29,7 @@ cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, shared_ptr[CFileWriter]* out) - CStatus WriteRecordBatch(const vector[shared_ptr[CArray]]& columns, - int32_t num_rows) + CStatus WriteRecordBatch(const CRecordBatch& batch) CStatus Close() diff --git a/python/pyarrow/ipc.pyx b/python/pyarrow/ipc.pyx index abc5e1b11ec4c..22069a7290ead 100644 --- a/python/pyarrow/ipc.pyx +++ b/python/pyarrow/ipc.pyx @@ -21,6 +21,8 @@ # distutils: language = c++ # cython: embedsignature = True +from cython.operator cimport dereference as deref + from pyarrow.includes.libarrow cimport * from pyarrow.includes.libarrow_io cimport * from pyarrow.includes.libarrow_ipc cimport * @@ -58,10 +60,9 @@ cdef class ArrowFileWriter: self.close() def write_record_batch(self, RecordBatch batch): - cdef CRecordBatch* bptr = batch.batch with nogil: check_status(self.writer.get() - .WriteRecordBatch(bptr.columns(), bptr.num_rows())) + .WriteRecordBatch(deref(batch.batch))) def close(self): with nogil: diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 6623e239880bc..feafa3dfc3875 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -254,16 +254,16 @@ struct arrow_traits { static constexpr bool is_numeric_nullable = false; }; -#define INT_DECL(TYPE) \ - template <> \ - struct arrow_traits { \ - static constexpr int npy_type = NPY_##TYPE; \ - static constexpr bool supports_nulls = false; \ - static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_numeric_not_nullable = true; \ - static constexpr bool is_numeric_nullable = false; \ - typedef typename npy_traits::value_type T; \ +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_numeric_not_nullable = true; \ + static constexpr bool is_numeric_nullable = false; \ + typedef typename npy_traits::value_type T; \ }; INT_DECL(INT8); @@ -1803,7 +1803,7 @@ class ArrowDeserializer { // types Status Convert(PyObject** out) { -#define CONVERT_CASE(TYPE) \ +#define CONVERT_CASE(TYPE) \ case Type::TYPE: { \ RETURN_NOT_OK(ConvertValues()); \ } break; @@ -1857,8 +1857,7 @@ class ArrowDeserializer { } template - inline typename std::enable_if::type - ConvertValues() { + inline typename std::enable_if::type ConvertValues() { typedef typename arrow_traits::T T; RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); @@ -1910,24 +1909,21 @@ class ArrowDeserializer { // UTF8 strings template - inline typename std::enable_if::type - ConvertValues() { + inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); return ConvertBinaryLike(data_, out_values); } template - inline typename std::enable_if::type - ConvertValues() { + inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); return ConvertBinaryLike(data_, out_values); } template - inline typename std::enable_if::type - ConvertValues() { + inline typename std::enable_if::type ConvertValues() { std::shared_ptr block; RETURN_NOT_OK(MakeCategoricalBlock(col_->type(), col_->length(), &block)); RETURN_NOT_OK(block->Write(col_, 0, 0)); From 5a161ebc1961b4f784d51322b12fe09e8c8aa08d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 22 Jan 2017 11:13:59 -0500 Subject: [PATCH 0286/1644] ARROW-505: [C++] Fix compiler warning in gcc in release mode Author: Wes McKinney Closes #294 from wesm/fix-release-compile-warning and squashes the following commits: 4189892 [Wes McKinney] Suppress another Cython compiler warning when compiling with clang 9a8ad54 [Wes McKinney] Fix compiler warning in gcc in release mode --- cpp/src/arrow/ipc/adapter.cc | 4 ++-- python/CMakeLists.txt | 3 +++ 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 9da7b3912d4bc..c8e631c564b22 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -197,8 +197,8 @@ class RecordBatchWriter : public ArrayVisitor { Status GetTotalSize(int64_t* size) { // emulates the behavior of Write without actually writing - int32_t metadata_length; - int64_t body_length; + int32_t metadata_length = 0; + int64_t body_length = 0; MockOutputStream dst; RETURN_NOT_OK(Write(&dst, &metadata_length, &body_length)); *size = dst.GetExtentBytesWritten(); diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 0a2d4e9facba2..b3735b1d58653 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -87,6 +87,9 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") # http://petereisentraut.blogspot.com/2011/05/ccache-and-clang.html # http://petereisentraut.blogspot.com/2011/09/ccache-and-clang-part-2.html set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") + + # Cython warnings in clang + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality") endif() set(PYARROW_LINK "a") From 53a478dfb278dcae5ca7f300b70857662553d118 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 23 Jan 2017 06:41:35 -0500 Subject: [PATCH 0287/1644] ARROW-475: [Python] Add support for reading multiple Parquet files as a single pyarrow.Table Also fixes a serious bug in which the data source passed to the ParquetReader gets garbage collected prematurely Also implements ARROW-470 Author: Wes McKinney Closes #296 from wesm/ARROW-475 and squashes the following commits: 894d2a2 [Wes McKinney] Implement Filesystem abstraction, add Filesystem.read_parquet. Implement rudimentary shim on local filesystem 3927c2c [Wes McKinney] Test read multiple Parquet from HDFS, fix premature garbage collection error 4904b3b [Wes McKinney] Implement read_multiple_files function for multiple Parquet files as a single Arrow table --- python/pyarrow/__init__.py | 6 +- python/pyarrow/_parquet.pyx | 3 + python/pyarrow/filesystem.py | 186 +++++++++++++++++++ python/pyarrow/includes/libarrow_io.pxd | 2 + python/pyarrow/io.pyx | 62 +++---- python/pyarrow/parquet.py | 88 +++++++-- python/pyarrow/table.pyx | 60 ++++-- python/pyarrow/tests/test_column.py | 49 ----- python/pyarrow/tests/test_convert_builtin.py | 3 +- python/pyarrow/tests/test_convert_pandas.py | 8 +- python/pyarrow/tests/test_hdfs.py | 46 ++++- python/pyarrow/tests/test_parquet.py | 155 ++++++++++++---- python/pyarrow/tests/test_scalars.py | 2 +- python/pyarrow/tests/test_schema.py | 1 - python/pyarrow/tests/test_table.py | 50 +++-- python/pyarrow/util.py | 25 +++ 16 files changed, 568 insertions(+), 178 deletions(-) create mode 100644 python/pyarrow/filesystem.py delete mode 100644 python/pyarrow/tests/test_column.py create mode 100644 python/pyarrow/util.py diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index efffbf2a4588d..d563c7aa4055d 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -42,7 +42,8 @@ from pyarrow.error import ArrowException -from pyarrow.io import (HdfsClient, HdfsFile, NativeFile, PythonFileInterface, +from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem +from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, Buffer, InMemoryOutputStream, BufferReader) from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, @@ -61,3 +62,6 @@ DataType, Field, Schema, schema) from pyarrow.table import Column, RecordBatch, Table, concat_tables + + +localfs = LocalFilesystem() diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 867fc4cfecbd6..b11cee3a201fb 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -341,6 +341,7 @@ cdef logical_type_name_from_enum(ParquetLogicalType type_): cdef class ParquetReader: cdef: + object source MemoryPool* allocator unique_ptr[FileReader] reader column_idx_map @@ -360,6 +361,8 @@ cdef class ParquetReader: if metadata is not None: c_metadata = metadata.sp_metadata + self.source = source + get_reader(source, &rd_handle) with nogil: check_status(OpenFile(rd_handle, self.allocator, properties, diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py new file mode 100644 index 0000000000000..82409b7666ab1 --- /dev/null +++ b/python/pyarrow/filesystem.py @@ -0,0 +1,186 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from os.path import join as pjoin +import os + +from pyarrow.util import implements +import pyarrow.io as io + + +class Filesystem(object): + """ + Abstract filesystem interface + """ + def ls(self, path): + """ + Return list of file paths + """ + raise NotImplementedError + + def delete(self, path, recursive=False): + """ + Delete the indicated file or directory + + Parameters + ---------- + path : string + recursive : boolean, default False + If True, also delete child paths for directories + """ + raise NotImplementedError + + def mkdir(self, path, create_parents=True): + raise NotImplementedError + + def exists(self, path): + raise NotImplementedError + + def isdir(self, path): + """ + Return True if path is a directory + """ + raise NotImplementedError + + def isfile(self, path): + """ + Return True if path is a file + """ + raise NotImplementedError + + def read_parquet(self, path, columns=None, schema=None): + """ + Read Parquet data from path in file system. Can read from a single file + or a directory of files + + Parameters + ---------- + path : str + Single file path or directory + columns : List[str], optional + Subset of columns to read + schema : pyarrow.parquet.Schema + Known schema to validate files against + + Returns + ------- + table : pyarrow.Table + """ + from pyarrow.parquet import read_multiple_files + + if self.isdir(path): + paths_to_read = [] + for path in self.ls(path): + if path == '_metadata' or path == '_common_metadata': + raise ValueError('No support yet for common metadata file') + paths_to_read.append(path) + else: + paths_to_read = [path] + + return read_multiple_files(paths_to_read, columns=columns, + filesystem=self, schema=schema) + + +class LocalFilesystem(Filesystem): + + @implements(Filesystem.ls) + def ls(self, path): + return sorted(pjoin(path, x) for x in os.listdir(path)) + + @implements(Filesystem.isdir) + def isdir(self, path): + return os.path.isdir(path) + + @implements(Filesystem.isfile) + def isfile(self, path): + return os.path.isfile(path) + + @implements(Filesystem.exists) + def exists(self, path): + return os.path.exists(path) + + def open(self, path, mode='rb'): + """ + Open file for reading or writing + """ + return open(path, mode=mode) + + +class HdfsClient(io._HdfsClient, Filesystem): + """ + Connect to an HDFS cluster. All parameters are optional and should + only be set if the defaults need to be overridden. + + Authentication should be automatic if the HDFS cluster uses Kerberos. + However, if a username is specified, then the ticket cache will likely + be required. + + Parameters + ---------- + host : NameNode. Set to "default" for fs.defaultFS from core-site.xml. + port : NameNode's port. Set to 0 for default or logical (HA) nodes. + user : Username when connecting to HDFS; None implies login user. + kerb_ticket : Path to Kerberos ticket cache. + driver : {'libhdfs', 'libhdfs3'}, default 'libhdfs' + Connect using libhdfs (JNI-based) or libhdfs3 (3rd-party C++ + library from Pivotal Labs) + + Notes + ----- + The first time you call this method, it will take longer than usual due + to JNI spin-up time. + + Returns + ------- + client : HDFSClient + """ + + def __init__(self, host="default", port=0, user=None, kerb_ticket=None, + driver='libhdfs'): + self._connect(host, port, user, kerb_ticket, driver) + + @implements(Filesystem.isdir) + def isdir(self, path): + return io._HdfsClient.isdir(self, path) + + @implements(Filesystem.isfile) + def isfile(self, path): + return io._HdfsClient.isfile(self, path) + + @implements(Filesystem.delete) + def delete(self, path, recursive=False): + return io._HdfsClient.delete(self, path, recursive) + + @implements(Filesystem.mkdir) + def mkdir(self, path, create_parents=True): + return io._HdfsClient.mkdir(self, path) + + def ls(self, path, full_info=False): + """ + Retrieve directory contents and metadata, if requested. + + Parameters + ---------- + path : HDFS path + full_info : boolean, default False + If False, only return list of paths + + Returns + ------- + result : list of dicts (full_info=True) or strings (full_info=False) + """ + return io._HdfsClient.ls(self, path, full_info) diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 417af7d67d1ab..31379386187ee 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -148,6 +148,8 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus ListDirectory(const c_string& path, vector[HdfsPathInfo]* listing) + CStatus GetPathInfo(const c_string& path, HdfsPathInfo* info) + CStatus Rename(const c_string& src, const c_string& dst) CStatus OpenReadable(const c_string& path, diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 0f626f178abde..26215122b7a23 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -463,42 +463,17 @@ def strip_hdfs_abspath(path): return path -cdef class HdfsClient: +cdef class _HdfsClient: cdef: shared_ptr[CHdfsClient] client cdef readonly: bint is_open - def __cinit__(self, host="default", port=0, user=None, kerb_ticket=None, - driver='libhdfs'): - """ - Connect to an HDFS cluster. All parameters are optional and should - only be set if the defaults need to be overridden. - - Authentication should be automatic if the HDFS cluster uses Kerberos. - However, if a username is specified, then the ticket cache will likely - be required. + def __cinit__(self): + pass - Parameters - ---------- - host : NameNode. Set to "default" for fs.defaultFS from core-site.xml. - port : NameNode's port. Set to 0 for default or logical (HA) nodes. - user : Username when connecting to HDFS; None implies login user. - kerb_ticket : Path to Kerberos ticket cache. - driver : {'libhdfs', 'libhdfs3'}, default 'libhdfs' - Connect using libhdfs (JNI-based) or libhdfs3 (3rd-party C++ - library from Pivotal Labs) - - Notes - ----- - The first time you call this method, it will take longer than usual due - to JNI spin-up time. - - Returns - ------- - client : HDFSClient - """ + def _connect(self, host, port, user, kerb_ticket, driver): cdef HdfsConnectionConfig conf if host is not None: @@ -556,20 +531,25 @@ cdef class HdfsClient: result = self.client.get().Exists(c_path) return result - def ls(self, path, bint full_info=True): - """ - Retrieve directory contents and metadata, if requested. + def isdir(self, path): + cdef HdfsPathInfo info + self._path_info(path, &info) + return info.kind == ObjectType_DIRECTORY - Parameters - ---------- - path : HDFS path - full_info : boolean, default True - If False, only return list of paths + def isfile(self, path): + cdef HdfsPathInfo info + self._path_info(path, &info) + return info.kind == ObjectType_FILE - Returns - ------- - result : list of dicts (full_info=True) or strings (full_info=False) - """ + cdef _path_info(self, path, HdfsPathInfo* info): + cdef c_string c_path = tobytes(path) + + with nogil: + check_status(self.client.get() + .GetPathInfo(c_path, info)) + + + def ls(self, path, bint full_info): cdef: c_string c_path = tobytes(path) vector[HdfsPathInfo] listing diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 2a1ac9d2db7ed..cbe1c6e5d79d9 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -15,8 +15,10 @@ # specific language governing permissions and limitations # under the License. -import pyarrow._parquet as _parquet -from pyarrow.table import Table +from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa + RowGroupMetaData, Schema, ParquetWriter) +import pyarrow._parquet as _parquet # noqa +from pyarrow.table import Table, concat_tables class ParquetFile(object): @@ -32,7 +34,7 @@ class ParquetFile(object): Use existing metadata object, rather than reading from file. """ def __init__(self, source, metadata=None): - self.reader = _parquet.ParquetReader() + self.reader = ParquetReader() self.reader.open(source, metadata=metadata) @property @@ -67,10 +69,10 @@ def read(self, nrows=None, columns=None): for column in columns] arrays = [self.reader.read_column(column_idx) for column_idx in column_idxs] - return Table.from_arrays(columns, arrays) + return Table.from_arrays(arrays, names=columns) -def read_table(source, columns=None): +def read_table(source, columns=None, metadata=None): """ Read a Table from Parquet format @@ -81,17 +83,79 @@ def read_table(source, columns=None): pyarrow.io.PythonFileInterface or pyarrow.io.BufferReader. columns: list If not None, only these columns will be read from the file. + metadata : FileMetaData + If separately computed Returns ------- - pyarrow.table.Table + pyarrow.Table Content of the file as a table (of columns) """ - return ParquetFile(source).read(columns=columns) + return ParquetFile(source, metadata=metadata).read(columns=columns) -def write_table(table, sink, chunk_size=None, version=None, - use_dictionary=True, compression=None): +def read_multiple_files(paths, columns=None, filesystem=None, metadata=None, + schema=None): + """ + Read multiple Parquet files as a single pyarrow.Table + + Parameters + ---------- + paths : List[str] + List of file paths + columns : List[str] + Names of columns to read from the file + filesystem : Filesystem, default None + If nothing passed, paths assumed to be found in the local on-disk + filesystem + metadata : pyarrow.parquet.FileMetaData + Use metadata obtained elsewhere to validate file schemas + schema : pyarrow.parquet.Schema + Use schema obtained elsewhere to validate file schemas. Alternative to + metadata parameter + + Returns + ------- + pyarrow.Table + Content of the file as a table (of columns) + """ + if filesystem is None: + def open_file(path, meta=None): + return ParquetFile(path, metadata=meta) + else: + def open_file(path, meta=None): + return ParquetFile(filesystem.open(path, mode='rb'), metadata=meta) + + if len(paths) == 0: + raise ValueError('Must pass at least one file path') + + if metadata is None and schema is None: + schema = open_file(paths[0]).schema + elif schema is None: + schema = metadata.schema + + # Verify schemas are all equal + all_file_metadata = [] + for path in paths: + file_metadata = open_file(path).metadata + if not schema.equals(file_metadata.schema): + raise ValueError('Schema in {0} was different. {1!s} vs {2!s}' + .format(path, file_metadata.schema, schema)) + all_file_metadata.append(file_metadata) + + # Read the tables + tables = [] + for path, path_metadata in zip(paths, all_file_metadata): + reader = open_file(path, meta=path_metadata) + table = reader.read(columns=columns) + tables.append(table) + + all_data = concat_tables(tables) + return all_data + + +def write_table(table, sink, chunk_size=None, version='1.0', + use_dictionary=True, compression='snappy'): """ Write a Table to Parquet format @@ -110,7 +174,7 @@ def write_table(table, sink, chunk_size=None, version=None, compression : str or dict Specify the compression codec, either on a general basis or per-column. """ - writer = _parquet.ParquetWriter(sink, use_dictionary=use_dictionary, - compression=compression, - version=version) + writer = ParquetWriter(sink, use_dictionary=use_dictionary, + compression=compression, + version=version) writer.write_table(table, row_group_size=chunk_size) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 0e3b2bd90dc64..924233066055e 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -265,16 +265,35 @@ cdef class Column: cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): cdef: Array arr + Column col c_string c_name vector[shared_ptr[CField]] fields + cdef shared_ptr[CDataType] type_ cdef int K = len(arrays) fields.resize(K) - for i in range(K): - arr = arrays[i] - c_name = tobytes(names[i]) - fields[i].reset(new CField(c_name, arr.type.sp_type, True)) + + if len(arrays) == 0: + raise ValueError('Must pass at least one array') + + if isinstance(arrays[0], Array): + if names is None: + raise ValueError('Must pass names when constructing ' + 'from Array objects') + for i in range(K): + arr = arrays[i] + type_ = arr.type.sp_type + c_name = tobytes(names[i]) + fields[i].reset(new CField(c_name, type_, True)) + elif isinstance(arrays[0], Column): + for i in range(K): + col = arrays[i] + type_ = col.sp_column.get().type() + c_name = tobytes(col.name) + fields[i].reset(new CField(c_name, type_, True)) + else: + raise TypeError(type(arrays[0])) schema.reset(new CSchema(fields)) @@ -429,19 +448,19 @@ cdef class RecordBatch: pyarrow.table.RecordBatch """ names, arrays = _dataframe_to_arrays(df, None, False, schema) - return cls.from_arrays(names, arrays) + return cls.from_arrays(arrays, names) @staticmethod - def from_arrays(names, arrays): + def from_arrays(arrays, names): """ Construct a RecordBatch from multiple pyarrow.Arrays Parameters ---------- - names: list of str - Labels for the columns arrays: list of pyarrow.Array column-wise data vectors + names: list of str + Labels for the columns Returns ------- @@ -594,20 +613,20 @@ cdef class Table: names, arrays = _dataframe_to_arrays(df, name=name, timestamps_to_ms=timestamps_to_ms, schema=schema) - return cls.from_arrays(names, arrays, name=name) + return cls.from_arrays(arrays, names=names, name=name) @staticmethod - def from_arrays(names, arrays, name=None): + def from_arrays(arrays, names=None, name=None): """ - Construct a Table from Arrow Arrays + Construct a Table from Arrow arrays or columns Parameters ---------- - - names: list of str - Names for the table columns - arrays: list of pyarrow.array.Array + arrays: list of pyarrow.Array or pyarrow.Column Equal-length arrays that should form the table. + names: list of str, optional + Names for the table columns. If Columns passed, will be + inferred. If Arrays passed, this argument is required name: str, optional name for the Table @@ -617,7 +636,6 @@ cdef class Table: """ cdef: - Array arr c_string c_name vector[shared_ptr[CField]] fields vector[shared_ptr[CColumn]] columns @@ -628,9 +646,15 @@ cdef class Table: cdef int K = len(arrays) columns.resize(K) + for i in range(K): - arr = arrays[i] - columns[i].reset(new CColumn(schema.get().field(i), arr.sp_array)) + if isinstance(arrays[i], Array): + columns[i].reset(new CColumn(schema.get().field(i), + ( arrays[i]).sp_array)) + elif isinstance(arrays[i], Column): + columns[i] = ( arrays[i]).sp_column + else: + raise ValueError(type(arrays[i])) if name is None: c_name = '' diff --git a/python/pyarrow/tests/test_column.py b/python/pyarrow/tests/test_column.py deleted file mode 100644 index 1a507c81030f8..0000000000000 --- a/python/pyarrow/tests/test_column.py +++ /dev/null @@ -1,49 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from pyarrow.compat import unittest -import pyarrow as arrow - -A = arrow - -import pandas as pd - - -class TestColumn(unittest.TestCase): - - def test_basics(self): - data = [ - A.from_pylist([-10, -5, 0, 5, 10]) - ] - table = A.Table.from_arrays(('a'), data, 'table_name') - column = table.column(0) - assert column.name == 'a' - assert column.length() == 5 - assert len(column) == 5 - assert column.shape == (5,) - assert column.to_pylist() == [-10, -5, 0, 5, 10] - - def test_pandas(self): - data = [ - A.from_pylist([-10, -5, 0, 5, 10]) - ] - table = A.Table.from_arrays(('a'), data, 'table_name') - column = table.column(0) - series = column.to_pandas() - assert series.name == 'a' - assert series.shape == (5,) - assert series.iloc[0] == -10 diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 72e438910159f..c06d18d19c049 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -16,11 +16,12 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest, u +from pyarrow.compat import unittest, u # noqa import pyarrow import datetime + class TestConvertList(unittest.TestCase): def test_boolean(self): diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index a2f50620d8925..30705c4ca2a20 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -74,7 +74,7 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, tm.assert_frame_equal(result, expected) def _check_array_roundtrip(self, values, expected=None, - timestamps_to_ms=False, field=None): + timestamps_to_ms=False, field=None): arr = A.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, field=field) result = arr.to_pandas() @@ -118,7 +118,7 @@ def test_float_nulls(self): ex_frame = pd.DataFrame(dict(zip(names, expected_cols)), columns=names) - table = A.Table.from_arrays(names, arrays) + table = A.Table.from_arrays(arrays, names) assert table.schema.equals(A.Schema.from_fields(fields)) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -169,7 +169,7 @@ def test_integer_with_nulls(self): ex_frame = pd.DataFrame(dict(zip(int_dtypes, expected_cols)), columns=int_dtypes) - table = A.Table.from_arrays(int_dtypes, arrays) + table = A.Table.from_arrays(arrays, int_dtypes) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -201,7 +201,7 @@ def test_boolean_nulls(self): schema = A.Schema.from_fields([field]) ex_frame = pd.DataFrame({'bools': expected}) - table = A.Table.from_arrays(['bools'], [arr]) + table = A.Table.from_arrays([arr], ['bools']) assert table.schema.equals(schema) result = table.to_pandas() diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index 2056f7ab589da..cb24adb73adc9 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -21,9 +21,16 @@ import random import unittest +import numpy as np +import pandas.util.testing as pdt import pytest +from pyarrow.compat import guid +from pyarrow.filesystem import HdfsClient import pyarrow.io as io +import pyarrow as pa + +import pyarrow.tests.test_parquet as test_parquet # ---------------------------------------------------------------------- # HDFS tests @@ -38,7 +45,7 @@ def hdfs_test_client(driver='libhdfs'): raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' 'an integer') - return io.HdfsClient(host, port, user, driver=driver) + return HdfsClient(host, port, user, driver=driver) class HdfsTestCases(object): @@ -138,6 +145,43 @@ def test_hdfs_read_whole_file(self): assert result == data + @test_parquet.parquet + def test_hdfs_read_multiple_parquet_files(self): + import pyarrow.parquet as pq + + nfiles = 10 + size = 5 + + tmpdir = pjoin(self.tmp_path, 'multi-parquet-' + guid()) + + self.hdfs.mkdir(tmpdir) + + test_data = [] + paths = [] + for i in range(nfiles): + df = test_parquet._test_dataframe(size, seed=i) + + df['index'] = np.arange(i * size, (i + 1) * size) + + # Hack so that we don't have a dtype cast in v1 files + df['uint32'] = df['uint32'].astype(np.int64) + + path = pjoin(tmpdir, '{0}.parquet'.format(i)) + + table = pa.Table.from_pandas(df) + with self.hdfs.open(path, 'wb') as f: + pq.write_table(table, f) + + test_data.append(table) + paths.append(path) + + result = self.hdfs.read_parquet(tmpdir) + expected = pa.concat_tables(test_data) + + pdt.assert_frame_equal(result.to_pandas() + .sort_values(by='index').reset_index(drop=True), + expected.to_pandas()) + class TestLibHdfs(HdfsTestCases, unittest.TestCase): diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 9cf860ac28a10..a94fe456d3b2b 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -15,10 +15,13 @@ # specific language governing permissions and limitations # under the License. +from os.path import join as pjoin import io +import os import pytest -import pyarrow as A +from pyarrow.compat import guid +import pyarrow as pa import pyarrow.io as paio import numpy as np @@ -42,9 +45,9 @@ def test_single_pylist_column_roundtrip(tmpdir): for dtype in [int, float]: filename = tmpdir.join('single_{}_column.parquet' .format(dtype.__name__)) - data = [A.from_pylist(list(map(dtype, range(5))))] - table = A.Table.from_arrays(('a', 'b'), data, 'table_name') - A.parquet.write_table(table, filename.strpath) + data = [pa.from_pylist(list(map(dtype, range(5))))] + table = pa.Table.from_arrays(data, names=('a', 'b'), name='table_name') + pq.write_table(table, filename.strpath) table_read = pq.read_table(filename.strpath) for col_written, col_read in zip(table.itercolumns(), table_read.itercolumns()): @@ -85,8 +88,8 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): df = alltypes_sample(size=10000) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.Table.from_pandas(df, timestamps_to_ms=True) - A.parquet.write_table(arrow_table, filename.strpath, version="2.0") + arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True) + pq.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) @@ -113,8 +116,8 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): 'empty_str': [''] * size }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.Table.from_pandas(df) - A.parquet.write_table(arrow_table, filename.strpath, version="1.0") + arrow_table = pa.Table.from_pandas(df) + pq.write_table(arrow_table, filename.strpath, version="1.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() @@ -133,28 +136,39 @@ def test_pandas_column_selection(tmpdir): 'uint16': np.arange(size, dtype=np.uint16) }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.Table.from_pandas(df) - A.parquet.write_table(arrow_table, filename.strpath) + arrow_table = pa.Table.from_pandas(df) + pq.write_table(arrow_table, filename.strpath) table_read = pq.read_table(filename.strpath, columns=['uint8']) df_read = table_read.to_pandas() pdt.assert_frame_equal(df[['uint8']], df_read) -def _test_dataframe(size=10000): - np.random.seed(0) +def _random_integers(size, dtype): + # We do not generate integers outside the int64 range + i64_info = np.iinfo('int64') + iinfo = np.iinfo(dtype) + return np.random.randint(max(iinfo.min, i64_info.min), + min(iinfo.max, i64_info.max), + size=size).astype(dtype) + + +def _test_dataframe(size=10000, seed=0): + np.random.seed(seed) df = pd.DataFrame({ - 'uint8': np.arange(size, dtype=np.uint8), - 'uint16': np.arange(size, dtype=np.uint16), - 'uint32': np.arange(size, dtype=np.uint32), - 'uint64': np.arange(size, dtype=np.uint64), - 'int8': np.arange(size, dtype=np.int16), - 'int16': np.arange(size, dtype=np.int16), - 'int32': np.arange(size, dtype=np.int32), - 'int64': np.arange(size, dtype=np.int64), - 'float32': np.arange(size, dtype=np.float32), + 'uint8': _random_integers(size, np.uint8), + 'uint16': _random_integers(size, np.uint16), + 'uint32': _random_integers(size, np.uint32), + 'uint64': _random_integers(size, np.uint64), + 'int8': _random_integers(size, np.int8), + 'int16': _random_integers(size, np.int16), + 'int32': _random_integers(size, np.int32), + 'int64': _random_integers(size, np.int64), + 'float32': np.random.randn(size).astype(np.float32), + 'float64': np.random.randn(size), 'float64': np.arange(size, dtype=np.float64), - 'bool': np.random.randn(size) > 0 + 'bool': np.random.randn(size) > 0, + 'strings': [pdt.rands(10) for i in range(size)] }) return df @@ -162,7 +176,7 @@ def _test_dataframe(size=10000): @parquet def test_pandas_parquet_native_file_roundtrip(tmpdir): df = _test_dataframe(10000) - arrow_table = A.Table.from_pandas(df) + arrow_table = pa.Table.from_pandas(df) imos = paio.InMemoryOutputStream() pq.write_table(arrow_table, imos, version="2.0") buf = imos.get_result() @@ -183,10 +197,10 @@ def test_pandas_parquet_pyfile_roundtrip(tmpdir): 'strings': ['foo', 'bar', None, 'baz', 'qux'] }) - arrow_table = A.Table.from_pandas(df) + arrow_table = pa.Table.from_pandas(df) with open(filename, 'wb') as f: - A.parquet.write_table(arrow_table, f, version="1.0") + pq.write_table(arrow_table, f, version="1.0") data = io.BytesIO(open(filename, 'rb').read()) @@ -213,31 +227,27 @@ def test_pandas_parquet_configuration_options(tmpdir): 'bool': np.random.randn(size) > 0 }) filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = A.Table.from_pandas(df) + arrow_table = pa.Table.from_pandas(df) for use_dictionary in [True, False]: - A.parquet.write_table( - arrow_table, - filename.strpath, - version="2.0", - use_dictionary=use_dictionary) + pq.write_table(arrow_table, filename.strpath, + version="2.0", + use_dictionary=use_dictionary) table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) for compression in ['NONE', 'SNAPPY', 'GZIP']: - A.parquet.write_table( - arrow_table, - filename.strpath, - version="2.0", - compression=compression) + pq.write_table(arrow_table, filename.strpath, + version="2.0", + compression=compression) table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) def make_sample_file(df): - a_table = A.Table.from_pandas(df, timestamps_to_ms=True) + a_table = pa.Table.from_pandas(df, timestamps_to_ms=True) buf = io.BytesIO() pq.write_table(a_table, buf, compression='SNAPPY', version='2.0') @@ -315,7 +325,7 @@ def test_pass_separate_metadata(): # ARROW-471 df = alltypes_sample(size=10000) - a_table = A.Table.from_pandas(df, timestamps_to_ms=True) + a_table = pa.Table.from_pandas(df, timestamps_to_ms=True) buf = io.BytesIO() pq.write_table(a_table, buf, compression='snappy', version='2.0') @@ -328,3 +338,72 @@ def test_pass_separate_metadata(): fileh = pq.ParquetFile(buf, metadata=metadata) pdt.assert_frame_equal(df, fileh.read().to_pandas()) + + +@parquet +def test_read_multiple_files(tmpdir): + nfiles = 10 + size = 5 + + dirpath = tmpdir.join(guid()).strpath + os.mkdir(dirpath) + + test_data = [] + paths = [] + for i in range(nfiles): + df = _test_dataframe(size, seed=i) + + # Hack so that we don't have a dtype cast in v1 files + df['uint32'] = df['uint32'].astype(np.int64) + + path = pjoin(dirpath, '{0}.parquet'.format(i)) + + table = pa.Table.from_pandas(df) + pq.write_table(table, path) + + test_data.append(table) + paths.append(path) + + result = pq.read_multiple_files(paths) + expected = pa.concat_tables(test_data) + + assert result.equals(expected) + + # Read with provided metadata + metadata = pq.ParquetFile(paths[0]).metadata + + result2 = pq.read_multiple_files(paths, metadata=metadata) + assert result2.equals(expected) + + result3 = pa.localfs.read_parquet(dirpath, schema=metadata.schema) + assert result3.equals(expected) + + # Read column subset + to_read = [result[0], result[3], result[6]] + result = pa.localfs.read_parquet( + dirpath, columns=[c.name for c in to_read]) + expected = pa.Table.from_arrays(to_read) + assert result.equals(expected) + + # Test failure modes with non-uniform metadata + bad_apple = _test_dataframe(size, seed=i).iloc[:, :4] + bad_apple_path = tmpdir.join('{0}.parquet'.format(guid())).strpath + + t = pa.Table.from_pandas(bad_apple) + pq.write_table(t, bad_apple_path) + + bad_meta = pq.ParquetFile(bad_apple_path).metadata + + with pytest.raises(ValueError): + pq.read_multiple_files(paths + [bad_apple_path]) + + with pytest.raises(ValueError): + pq.read_multiple_files(paths, metadata=bad_meta) + + mixed_paths = [bad_apple_path, paths[0]] + + with pytest.raises(ValueError): + pq.read_multiple_files(mixed_paths, schema=bad_meta.schema) + + with pytest.raises(ValueError): + pq.read_multiple_files(mixed_paths) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 62e51f8dee846..ef600a06296cb 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -32,7 +32,7 @@ def test_bool(self): v = arr[0] assert isinstance(v, A.BooleanValue) assert repr(v) == "True" - assert v.as_py() == True + assert v.as_py() is True assert arr[1] is A.NA diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 4aa8112a91769..507ebb878d87b 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -85,4 +85,3 @@ def test_schema_equals(self): del fields[-1] sch3 = A.schema(fields) assert not sch1.equals(sch3) - diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 6f00c7391f66d..d49b33c9f42d6 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -21,16 +21,43 @@ import pandas as pd import pytest +from pyarrow.compat import unittest import pyarrow as pa +class TestColumn(unittest.TestCase): + + def test_basics(self): + data = [ + pa.from_pylist([-10, -5, 0, 5, 10]) + ] + table = pa.Table.from_arrays(data, names=['a'], name='table_name') + column = table.column(0) + assert column.name == 'a' + assert column.length() == 5 + assert len(column) == 5 + assert column.shape == (5,) + assert column.to_pylist() == [-10, -5, 0, 5, 10] + + def test_pandas(self): + data = [ + pa.from_pylist([-10, -5, 0, 5, 10]) + ] + table = pa.Table.from_arrays(data, names=['a'], name='table_name') + column = table.column(0) + series = column.to_pandas() + assert series.name == 'a' + assert series.shape == (5,) + assert series.iloc[0] == -10 + + def test_recordbatch_basics(): data = [ pa.from_pylist(range(5)), pa.from_pylist([-10, -5, 0, 5, 10]) ] - batch = pa.RecordBatch.from_arrays(['c0', 'c1'], data) + batch = pa.RecordBatch.from_arrays(data, ['c0', 'c1']) assert len(batch) == 5 assert batch.num_rows == 5 @@ -95,7 +122,7 @@ def test_table_basics(): pa.from_pylist(range(5)), pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(('a', 'b'), data, 'table_name') + table = pa.Table.from_arrays(data, names=('a', 'b'), name='table_name') assert table.name == 'table_name' assert len(table) == 5 assert table.num_rows == 5 @@ -121,19 +148,19 @@ def test_concat_tables(): [1., 2., 3., 4., 5.] ] - t1 = pa.Table.from_arrays(('a', 'b'), [pa.from_pylist(x) - for x in data], 'table_name') - t2 = pa.Table.from_arrays(('a', 'b'), [pa.from_pylist(x) - for x in data2], 'table_name') + t1 = pa.Table.from_arrays([pa.from_pylist(x) for x in data], + names=('a', 'b'), name='table_name') + t2 = pa.Table.from_arrays([pa.from_pylist(x) for x in data2], + names=('a', 'b'), name='table_name') result = pa.concat_tables([t1, t2], output_name='foo') assert result.name == 'foo' assert len(result) == 10 - expected = pa.Table.from_arrays( - ('a', 'b'), [pa.from_pylist(x + y) - for x, y in zip(data, data2)], - 'foo') + expected = pa.Table.from_arrays([pa.from_pylist(x + y) + for x, y in zip(data, data2)], + names=('a', 'b'), + name='foo') assert result.equals(expected) @@ -143,7 +170,8 @@ def test_table_pandas(): pa.from_pylist(range(5)), pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(('a', 'b'), data, 'table_name') + table = pa.Table.from_arrays(data, names=('a', 'b'), + name='table_name') # TODO: Use this part once from_pandas is implemented # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} diff --git a/python/pyarrow/util.py b/python/pyarrow/util.py new file mode 100644 index 0000000000000..4b6a8356330d5 --- /dev/null +++ b/python/pyarrow/util.py @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Miscellaneous utility code + + +def implements(f): + def decorator(g): + g.__doc__ = f.__doc__ + return g + return decorator From 69cdbd8ce665138ce35bb34d0bbe8087c0e9513e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 23 Jan 2017 06:43:05 -0500 Subject: [PATCH 0288/1644] ARROW-494: [C++] Extend lifetime of memory mapped data if any buffers reference it If you read memory in an IPC scenario and then the `arrow::io::MemoryMappedFile` goes out of scope, before this patch the memory was being unmapped even if there are `arrow::Buffer` object referencing it. Author: Wes McKinney Closes #298 from wesm/ARROW-494 and squashes the following commits: 60222e3 [Wes McKinney] clang-format 2960d17 [Wes McKinney] Add C++ unit test d7d776a [Wes McKinney] Add Python unit test where memory mapped file is garbage collected edf1295 [Wes McKinney] Reimplement memory map owner as Buffer subclass so that MemoryMappedFile can be safely destructed without invalidating Buffers referencing the mapped data --- cpp/src/arrow/io/file.cc | 94 ++++++++++++++++++-------------- cpp/src/arrow/io/file.h | 7 +-- cpp/src/arrow/io/io-file-test.cc | 31 +++++++++++ python/pyarrow/tests/test_io.py | 20 ++++++- 4 files changed, 104 insertions(+), 48 deletions(-) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 1de6efa4d811f..3bf8dfa08f2ff 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -372,6 +372,8 @@ class OSFile { int64_t size() const { return size_; } + FileMode::type mode() const { return mode_; } + protected: std::string path_; @@ -513,14 +515,14 @@ int FileOutputStream::file_descriptor() const { // ---------------------------------------------------------------------- // Implement MemoryMappedFile -class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { +class MemoryMappedFile::MemoryMap : public MutableBuffer { public: - MemoryMappedFileImpl() : OSFile(), data_(nullptr) {} + MemoryMap() : MutableBuffer(nullptr, 0) {} - ~MemoryMappedFileImpl() { - if (is_open_) { - munmap(data_, size_); - OSFile::Close(); + ~MemoryMap() { + if (file_->is_open()) { + munmap(mutable_data_, size_); + file_->Close(); } } @@ -528,27 +530,35 @@ class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { int prot_flags; int map_mode; + file_.reset(new OSFile()); + if (mode != FileMode::READ) { // Memory mapping has permission failures if PROT_READ not set prot_flags = PROT_READ | PROT_WRITE; map_mode = MAP_SHARED; constexpr bool append = true; constexpr bool write_only = false; - RETURN_NOT_OK(OSFile::OpenWriteable(path, append, write_only)); - mode_ = mode; + RETURN_NOT_OK(file_->OpenWriteable(path, append, write_only)); + + is_mutable_ = true; } else { prot_flags = PROT_READ; map_mode = MAP_PRIVATE; // Changes are not to be committed back to the file - RETURN_NOT_OK(OSFile::OpenReadable(path)); + RETURN_NOT_OK(file_->OpenReadable(path)); + + is_mutable_ = false; } - void* result = mmap(nullptr, size_, prot_flags, map_mode, fd(), 0); + void* result = mmap(nullptr, file_->size(), prot_flags, map_mode, file_->fd(), 0); if (result == MAP_FAILED) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; return Status::IOError(ss.str()); } - data_ = reinterpret_cast(result); + + data_ = mutable_data_ = reinterpret_cast(result); + size_ = file_->size(); + position_ = 0; return Status::OK(); @@ -566,50 +576,45 @@ class MemoryMappedFile::MemoryMappedFileImpl : public OSFile { void advance(int64_t nbytes) { position_ = position_ + nbytes; } - uint8_t* data() { return data_; } + uint8_t* head() { return mutable_data_ + position_; } - uint8_t* head() { return data_ + position_; } + bool writable() { return file_->mode() != FileMode::READ; } - bool writable() { return mode_ != FileMode::READ; } + bool opened() { return file_->is_open(); } - bool opened() { return is_open_; } + int fd() const { return file_->fd(); } private: + std::unique_ptr file_; int64_t position_; - - // The memory map - uint8_t* data_; }; -MemoryMappedFile::MemoryMappedFile(FileMode::type mode) { - ReadableFileInterface::set_mode(mode); -} - +MemoryMappedFile::MemoryMappedFile() {} MemoryMappedFile::~MemoryMappedFile() {} Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, std::shared_ptr* out) { - std::shared_ptr result(new MemoryMappedFile(mode)); + std::shared_ptr result(new MemoryMappedFile()); - result->impl_.reset(new MemoryMappedFileImpl()); - RETURN_NOT_OK(result->impl_->Open(path, mode)); + result->memory_map_.reset(new MemoryMap()); + RETURN_NOT_OK(result->memory_map_->Open(path, mode)); *out = result; return Status::OK(); } Status MemoryMappedFile::GetSize(int64_t* size) { - *size = impl_->size(); + *size = memory_map_->size(); return Status::OK(); } Status MemoryMappedFile::Tell(int64_t* position) { - *position = impl_->position(); + *position = memory_map_->position(); return Status::OK(); } Status MemoryMappedFile::Seek(int64_t position) { - return impl_->Seek(position); + return memory_map_->Seek(position); } Status MemoryMappedFile::Close() { @@ -618,19 +623,24 @@ Status MemoryMappedFile::Close() { } Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { - nbytes = std::max(0, std::min(nbytes, impl_->size() - impl_->position())); - if (nbytes > 0) { std::memcpy(out, impl_->head(), nbytes); } + nbytes = std::max( + 0, std::min(nbytes, memory_map_->size() - memory_map_->position())); + if (nbytes > 0) { std::memcpy(out, memory_map_->head(), nbytes); } *bytes_read = nbytes; - impl_->advance(nbytes); + memory_map_->advance(nbytes); return Status::OK(); } Status MemoryMappedFile::Read(int64_t nbytes, std::shared_ptr* out) { - nbytes = std::max(0, std::min(nbytes, impl_->size() - impl_->position())); + nbytes = std::max( + 0, std::min(nbytes, memory_map_->size() - memory_map_->position())); - const uint8_t* data = nbytes > 0 ? impl_->head() : nullptr; - *out = std::make_shared(data, nbytes); - impl_->advance(nbytes); + if (nbytes > 0) { + *out = SliceBuffer(memory_map_, memory_map_->position(), nbytes); + } else { + *out = std::make_shared(nullptr, 0); + } + memory_map_->advance(nbytes); return Status::OK(); } @@ -639,19 +649,19 @@ bool MemoryMappedFile::supports_zero_copy() const { } Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { - if (!impl_->opened() || !impl_->writable()) { + if (!memory_map_->opened() || !memory_map_->writable()) { return Status::IOError("Unable to write"); } - RETURN_NOT_OK(impl_->Seek(position)); + RETURN_NOT_OK(memory_map_->Seek(position)); return WriteInternal(data, nbytes); } Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { - if (!impl_->opened() || !impl_->writable()) { + if (!memory_map_->opened() || !memory_map_->writable()) { return Status::IOError("Unable to write"); } - if (nbytes + impl_->position() > impl_->size()) { + if (nbytes + memory_map_->position() > memory_map_->size()) { return Status::Invalid("Cannot write past end of memory map"); } @@ -659,13 +669,13 @@ Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { } Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { - memcpy(impl_->head(), data, nbytes); - impl_->advance(nbytes); + memcpy(memory_map_->head(), data, nbytes); + memory_map_->advance(nbytes); return Status::OK(); } int MemoryMappedFile::file_descriptor() const { - return impl_->fd(); + return memory_map_->fd(); } } // namespace io diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index 2387232b2157a..930346b0518b3 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -130,13 +130,12 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { int file_descriptor() const; private: - explicit MemoryMappedFile(FileMode::type mode); + MemoryMappedFile(); Status WriteInternal(const uint8_t* data, int64_t nbytes); - // Hide the internal details of this class for now - class ARROW_NO_EXPORT MemoryMappedFileImpl; - std::unique_ptr impl_; + class ARROW_NO_EXPORT MemoryMap; + std::shared_ptr memory_map_; }; } // namespace io diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index f18f7b649eb9b..999b296465544 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -396,6 +396,37 @@ TEST_F(TestMemoryMappedFile, ReadOnly) { rommap->Close(); } +TEST_F(TestMemoryMappedFile, RetainMemoryMapReference) { + // ARROW-494 + + const int64_t buffer_size = 1024; + std::vector buffer(buffer_size); + + test::random_bytes(1024, 0, buffer.data()); + + std::string path = "ipc-read-only-test"; + CreateFile(path, buffer_size); + + { + std::shared_ptr rwmmap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &rwmmap)); + ASSERT_OK(rwmmap->Write(buffer.data(), buffer_size)); + ASSERT_OK(rwmmap->Close()); + } + + std::shared_ptr out_buffer; + + { + std::shared_ptr rommap; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READ, &rommap)); + ASSERT_OK(rommap->Read(buffer_size, &out_buffer)); + ASSERT_OK(rommap->Close()); + } + + // valgrind will catch if memory is unmapped + ASSERT_EQ(0, memcmp(out_buffer->data(), buffer.data(), buffer_size)); +} + TEST_F(TestMemoryMappedFile, InvalidMode) { const int64_t buffer_size = 1024; std::vector buffer(buffer_size); diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index f28d44a746c45..dfa84a27e6be9 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -16,6 +16,7 @@ # under the License. from io import BytesIO +import gc import os import pytest @@ -163,9 +164,8 @@ def test_inmemory_write_after_closed(): # ---------------------------------------------------------------------- # OS files and memory maps -@pytest.fixture(scope='session') +@pytest.fixture def sample_disk_data(request): - SIZE = 4096 arr = np.random.randint(0, 256, size=SIZE).astype('u1') data = arr.tobytes()[:SIZE] @@ -206,6 +206,22 @@ def test_memory_map_reader(sample_disk_data): _check_native_file_reader(io.MemoryMappedFile, sample_disk_data) +def test_memory_map_retain_buffer_reference(sample_disk_data): + path, data = sample_disk_data + + cases = [] + with io.MemoryMappedFile(path, 'rb') as f: + cases.append((f.read_buffer(100), data[:100])) + cases.append((f.read_buffer(100), data[100:200])) + cases.append((f.read_buffer(100), data[200:300])) + + # Call gc.collect() for good measure + gc.collect() + + for buf, expected in cases: + assert buf.to_pybytes() == expected + + def test_os_file_reader(sample_disk_data): _check_native_file_reader(io.OSFile, sample_disk_data) From c327b5fd2c35788c90b3fef2bc7b5faf89c07e49 Mon Sep 17 00:00:00 2001 From: Nong Li Date: Mon, 23 Jan 2017 06:43:59 -0500 Subject: [PATCH 0289/1644] ARROW-506: Java: Implement echo server for integration testing. While implementing this, it became clear it made sense for the stream writer to have an API to indicate EOS without closing the stream. The current message the reader is expecting is a 4 byte size for the next batch. This patch proposes we allow 0 as the size to indicate EOS. Author: Nong Li Closes #295 from nongli/echo_server and squashes the following commits: c115b02 [Nong Li] Add license header. a3a50ca [Nong Li] ARROW-506: Java: Implement echo server for integration testing. --- .../org/apache/arrow/tools/EchoServer.java | 130 ++++++++++++++++++ .../apache/arrow/tools/EchoServerTest.java | 129 +++++++++++++++++ .../vector/stream/ArrowStreamWriter.java | 14 +- .../vector/stream/MessageSerializer.java | 7 +- .../arrow/vector/file/TestArrowFile.java | 4 +- .../arrow/vector/stream/TestArrowStream.java | 4 +- .../vector/stream/TestArrowStreamPipe.java | 2 +- 7 files changed, 278 insertions(+), 12 deletions(-) create mode 100644 java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java create mode 100644 java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java diff --git a/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java new file mode 100644 index 0000000000000..c00620e44b064 --- /dev/null +++ b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java @@ -0,0 +1,130 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.tools; + +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.net.ServerSocket; +import java.net.Socket; +import java.util.ArrayList; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamReader; +import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import com.google.common.base.Preconditions; + +public class EchoServer { + private static final Logger LOGGER = LoggerFactory.getLogger(EchoServer.class); + + private boolean closed = false; + private final ServerSocket serverSocket; + + public EchoServer(int port) throws IOException { + LOGGER.info("Starting echo server."); + serverSocket = new ServerSocket(port); + LOGGER.info("Running echo server on port: " + port()); + } + + public int port() { return serverSocket.getLocalPort(); } + + public static class ClientConnection implements AutoCloseable { + public final Socket socket; + public ClientConnection(Socket socket) { + this.socket = socket; + } + + public void run() throws IOException { + BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + List batches = new ArrayList(); + try ( + InputStream in = socket.getInputStream(); + OutputStream out = socket.getOutputStream(); + ArrowStreamReader reader = new ArrowStreamReader(in, allocator); + ) { + // Read the entire input stream. + reader.init(); + while (true) { + ArrowRecordBatch batch = reader.nextRecordBatch(); + if (batch == null) break; + batches.add(batch); + } + LOGGER.info(String.format("Received %d batches", batches.size())); + + // Write it back + try (ArrowStreamWriter writer = new ArrowStreamWriter(out, reader.getSchema())) { + for (ArrowRecordBatch batch: batches) { + writer.writeRecordBatch(batch); + } + writer.end(); + Preconditions.checkState(reader.bytesRead() == writer.bytesWritten()); + } + LOGGER.info("Done writing stream back."); + } + } + + @Override + public void close() throws IOException { + socket.close(); + } + } + + public void run() throws IOException { + try { + while (!closed) { + LOGGER.info("Waiting to accept new client connection."); + Socket clientSocket = serverSocket.accept(); + LOGGER.info("Accepted new client connection."); + try (ClientConnection client = new ClientConnection(clientSocket)) { + try { + client.run(); + } catch (IOException e) { + LOGGER.warn("Error handling client connection.", e); + } + } + LOGGER.info("Closed connection with client"); + } + } catch (java.net.SocketException ex) { + if (!closed) throw ex; + } finally { + serverSocket.close(); + LOGGER.info("Server closed."); + } + } + + public void close() throws IOException { + closed = true; + serverSocket.close(); + } + + public static void main(String[] args) throws Exception { + int port; + if (args.length > 0) { + port = Integer.parseInt(args[0]); + } else { + port = 8080; + } + new EchoServer(port).run(); + } +} diff --git a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java new file mode 100644 index 0000000000000..48d6162f423a3 --- /dev/null +++ b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java @@ -0,0 +1,129 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.tools; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.IOException; +import java.net.Socket; +import java.net.UnknownHostException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.List; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamReader; +import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +import io.netty.buffer.ArrowBuf; + +public class EchoServerTest { + public static ArrowBuf buf(BufferAllocator alloc, byte[] bytes) { + ArrowBuf buffer = alloc.buffer(bytes.length); + buffer.writeBytes(bytes); + return buffer; + } + + public static byte[] array(ArrowBuf buf) { + byte[] bytes = new byte[buf.readableBytes()]; + buf.readBytes(bytes); + return bytes; + } + + private void testEchoServer(int serverPort, Schema schema, List batches) + throws UnknownHostException, IOException { + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + try (Socket socket = new Socket("localhost", serverPort); + ArrowStreamWriter writer = new ArrowStreamWriter(socket.getOutputStream(), schema); + ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), alloc)) { + for (ArrowRecordBatch batch: batches) { + writer.writeRecordBatch(batch); + } + writer.end(); + + reader.init(); + assertEquals(schema, reader.getSchema()); + for (int i = 0; i < batches.size(); i++) { + ArrowRecordBatch result = reader.nextRecordBatch(); + ArrowRecordBatch expected = batches.get(i); + assertTrue(result != null); + assertEquals(expected.getBuffers().size(), result.getBuffers().size()); + for (int j = 0; j < expected.getBuffers().size(); j++) { + assertTrue(expected.getBuffers().get(j).compareTo(result.getBuffers().get(j)) == 0); + } + } + ArrowRecordBatch result = reader.nextRecordBatch(); + assertTrue(result == null); + assertEquals(reader.bytesRead(), writer.bytesWritten()); + } + } + + @Test + public void basicTest() throws InterruptedException, IOException { + final EchoServer server = new EchoServer(0); + int serverPort = server.port(); + Thread serverThread = new Thread() { + @Override + public void run() { + try { + server.run(); + } catch (IOException e) { + e.printStackTrace(); + } + } + }; + serverThread.start(); + + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + byte[] validity = new byte[] { (byte)255, 0}; + byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; + ArrowBuf validityb = buf(alloc, validity); + ArrowBuf valuesb = buf(alloc, values); + ArrowRecordBatch batch = new ArrowRecordBatch( + 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb)); + + Schema schema = new Schema(asList(new Field( + "testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); + + // Try an empty stream, just the header. + testEchoServer(serverPort, schema, new ArrayList()); + + // Try with one batch. + List batches = new ArrayList<>(); + batches.add(batch); + testEchoServer(serverPort, schema, batches); + + // Try with a few + for (int i = 0; i < 10; i++) { + batches.add(batch); + } + testEchoServer(serverPort, schema, batches); + + server.close(); + serverThread.join(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java index 06acf9f2c140e..60dc5861c9242 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java @@ -35,14 +35,14 @@ public class ArrowStreamWriter implements AutoCloseable { * Creates the stream writer. non-blocking. * totalBatches can be set if the writer knows beforehand. Can be -1 if unknown. */ - public ArrowStreamWriter(WritableByteChannel out, Schema schema, int totalBatches) { + public ArrowStreamWriter(WritableByteChannel out, Schema schema) { this.out = new WriteChannel(out); this.schema = schema; } - public ArrowStreamWriter(OutputStream out, Schema schema, int totalBatches) + public ArrowStreamWriter(OutputStream out, Schema schema) throws IOException { - this(Channels.newChannel(out), schema, totalBatches); + this(Channels.newChannel(out), schema); } public long bytesWritten() { return out.getCurrentPosition(); } @@ -53,6 +53,14 @@ public void writeRecordBatch(ArrowRecordBatch batch) throws IOException { MessageSerializer.serialize(out, batch); } + /** + * End the stream. This is not required and this object can simply be closed. + */ + public void end() throws IOException { + checkAndSendHeader(); + out.writeIntLittleEndian(0); + } + @Override public void close() throws IOException { // The header might not have been sent if this is an empty stream. Send it even in diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 6e22dbd164d6e..7ab740c70782e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -235,11 +235,10 @@ private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte heade private static Message deserializeMessage(ReadChannel in, byte headerType) throws IOException { // Read the message size. There is an i32 little endian prefix. ByteBuffer buffer = ByteBuffer.allocate(4); - if (in.readFully(buffer) != 4) { - return null; - } - + if (in.readFully(buffer) != 4) return null; int messageLength = bytesToInt(buffer.array()); + if (messageLength == 0) return null; + buffer = ByteBuffer.allocate(messageLength); if (in.readFully(buffer) != messageLength) { throw new IOException( diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 9b9914480bad0..a83a2833c88bf 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -232,7 +232,7 @@ public void testWriteReadMultipleRBs() throws IOException { Schema schema = vectorUnloader0.getSchema(); Assert.assertEquals(2, schema.getFields().size()); try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowStreamWriter streamWriter = new ArrowStreamWriter(stream, schema, 2)) { + ArrowStreamWriter streamWriter = new ArrowStreamWriter(stream, schema)) { try (ArrowRecordBatch recordBatch = vectorUnloader0.getRecordBatch()) { Assert.assertEquals("RB #0", counts[0], recordBatch.getLength()); arrowWriter.writeRecordBatch(recordBatch); @@ -399,7 +399,7 @@ private void write(FieldVector parent, File file, OutputStream outStream) throws // Also try serializing to the stream writer. if (outStream != null) { try ( - ArrowStreamWriter arrowWriter = new ArrowStreamWriter(outStream, schema, -1); + ArrowStreamWriter arrowWriter = new ArrowStreamWriter(outStream, schema); ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); ) { arrowWriter.writeRecordBatch(recordBatch); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java index ba1cdaeeb2262..725272a0f072e 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java @@ -42,7 +42,7 @@ public void testEmptyStream() throws IOException { // Write the stream. ByteArrayOutputStream out = new ByteArrayOutputStream(); - try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema, -1)) { + try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema)) { } ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); @@ -66,7 +66,7 @@ public void testReadWrite() throws IOException { BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); ByteArrayOutputStream out = new ByteArrayOutputStream(); long bytesWritten = 0; - try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema, numBatches)) { + try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema)) { ArrowBuf validityb = MessageSerializerTest.buf(alloc, validity); ArrowBuf valuesb = MessageSerializerTest.buf(alloc, values); for (int i = 0; i < numBatches; i++) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java index e187fa535cada..a0a7ffa279308 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java @@ -47,7 +47,7 @@ private final class WriterThread extends Thread { public WriterThread(int numBatches, WritableByteChannel sinkChannel) throws IOException { this.numBatches = numBatches; - writer = new ArrowStreamWriter(sinkChannel, schema, -1); + writer = new ArrowStreamWriter(sinkChannel, schema); } @Override From 1f81adcc88b138c6ae5f5ffb3250f87239c89dc1 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 23 Jan 2017 09:10:18 -0500 Subject: [PATCH 0290/1644] ARROW-503: [Python] Implement Python interface to streaming file format See the new `StreamWriter` and `StreamReader` classes. This patch is stacked on top of the patch for ARROW-475. Will rebase when that is merged. Author: Wes McKinney Closes #299 from wesm/ARROW-503 and squashes the following commits: e9d918e [Wes McKinney] Close BufferOutputStream after completing file or stream writes 31e519f [Wes McKinney] Add function alias to preserve backwards compatibility faac28c [Wes McKinney] Fix small bug in BinaryArray::Equals, add rudimentary StreamReader/Writer interface and tests d9fb3dc [Wes McKinney] Refactoring, consolidate IPC code into io.pyx --- cpp/src/arrow/array.cc | 2 +- cpp/src/arrow/ipc/ipc-file-test.cc | 2 + cpp/src/arrow/ipc/stream.cc | 6 +- cpp/src/arrow/ipc/stream.h | 3 + python/CMakeLists.txt | 1 - python/pyarrow/__init__.py | 2 + python/pyarrow/includes/libarrow_ipc.pxd | 29 +- python/pyarrow/io.pyx | 367 ++++++++++++++++------- python/pyarrow/ipc.py | 83 +++++ python/pyarrow/ipc.pyx | 115 ------- python/pyarrow/schema.pyx | 1 + python/pyarrow/table.pxd | 1 + python/pyarrow/tests/test_ipc.py | 120 ++++---- python/setup.py | 1 - 14 files changed, 438 insertions(+), 295 deletions(-) create mode 100644 python/pyarrow/ipc.py delete mode 100644 python/pyarrow/ipc.pyx diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 7509520d12685..aa4a692e85cb9 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -359,7 +359,7 @@ bool BinaryArray::EqualsExact(const BinaryArray& other) const { if (!data_buffer_ && !(other.data_buffer_)) { return true; } - return data_buffer_->Equals(*other.data_buffer_, data_buffer_->size()); + return data_buffer_->Equals(*other.data_buffer_, raw_offsets()[length_]); } bool BinaryArray::Equals(const std::shared_ptr& arr) const { diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index 15ceb80493632..7cd8054679e44 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -75,6 +75,7 @@ class TestFileFormat : public ::testing::TestWithParam { RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); } RETURN_NOT_OK(writer->Close()); + RETURN_NOT_OK(sink_->Close()); // Current offset into stream is the end of the file int64_t footer_offset; @@ -138,6 +139,7 @@ class TestStreamFormat : public ::testing::TestWithParam { RETURN_NOT_OK(writer->WriteRecordBatch(batch)); } RETURN_NOT_OK(writer->Close()); + RETURN_NOT_OK(sink_->Close()); // Open the file auto buf_reader = std::make_shared(buffer_); diff --git a/cpp/src/arrow/ipc/stream.cc b/cpp/src/arrow/ipc/stream.cc index a2ca672fbe0aa..c9057e860b1e8 100644 --- a/cpp/src/arrow/ipc/stream.cc +++ b/cpp/src/arrow/ipc/stream.cc @@ -117,9 +117,9 @@ Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { } Status StreamWriter::Close() { - // Close the stream - RETURN_NOT_OK(CheckStarted()); - return sink_->Close(); + // Write the schema if not already written + // User is responsible for closing the OutputStream + return CheckStarted(); } // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/ipc/stream.h b/cpp/src/arrow/ipc/stream.h index 0b0e62f13fc5f..53f51dc73675f 100644 --- a/cpp/src/arrow/ipc/stream.h +++ b/cpp/src/arrow/ipc/stream.h @@ -54,6 +54,9 @@ class ARROW_EXPORT StreamWriter { std::shared_ptr* out); virtual Status WriteRecordBatch(const RecordBatch& batch); + + /// Perform any logic necessary to finish the stream. User is responsible for + /// closing the actual OutputStream virtual Status Close(); protected: diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index b3735b1d58653..d63fff48a011f 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -409,7 +409,6 @@ set(CYTHON_EXTENSIONS config error io - ipc scalar schema table diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index d563c7aa4055d..7c521db6280be 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -46,6 +46,8 @@ from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, Buffer, InMemoryOutputStream, BufferReader) +from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter + from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, BooleanValue, Int8Value, Int16Value, Int32Value, Int64Value, diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index 82957600d1eb6..bfece14fe6e03 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -20,18 +20,37 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (MemoryPool, CArray, CSchema, CRecordBatch) -from pyarrow.includes.libarrow_io cimport (OutputStream, ReadableFileInterface) +from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, + ReadableFileInterface) -cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: - cdef cppclass CFileWriter " arrow::ipc::FileWriter": +cdef extern from "arrow/ipc/stream.h" namespace "arrow::ipc" nogil: + + cdef cppclass CStreamWriter " arrow::ipc::StreamWriter": @staticmethod CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, - shared_ptr[CFileWriter]* out) + shared_ptr[CStreamWriter]* out) + CStatus Close() CStatus WriteRecordBatch(const CRecordBatch& batch) - CStatus Close() + cdef cppclass CStreamReader " arrow::ipc::StreamReader": + + @staticmethod + CStatus Open(const shared_ptr[InputStream]& stream, + shared_ptr[CStreamReader]* out) + + shared_ptr[CSchema] schema() + + CStatus GetNextRecordBatch(shared_ptr[CRecordBatch]* batch) + + +cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: + + cdef cppclass CFileWriter " arrow::ipc::FileWriter"(CStreamWriter): + @staticmethod + CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, + shared_ptr[CFileWriter]* out) cdef cppclass CFileReader " arrow::ipc::FileReader": diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 26215122b7a23..0755ed8bb4d4f 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -15,20 +15,26 @@ # specific language governing permissions and limitations # under the License. -# Cython wrappers for IO interfaces defined in arrow/io +# Cython wrappers for IO interfaces defined in arrow::io and messaging in +# arrow::ipc # cython: profile=False # distutils: language = c++ # cython: embedsignature = True +from cython.operator cimport dereference as deref + from libc.stdlib cimport malloc, free from pyarrow.includes.libarrow cimport * -cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.includes.libarrow_io cimport * +from pyarrow.includes.libarrow_ipc cimport * +cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import frombytes, tobytes, encode_file_path from pyarrow.error cimport check_status +from pyarrow.schema cimport Schema +from pyarrow.table cimport RecordBatch, batch_from_cbatch cimport cpython as cp @@ -38,6 +44,11 @@ import sys import threading import time + +# 64K +DEFAULT_BUFFER_SIZE = 2 ** 16 + + # To let us get a PyObject* and avoid Cython auto-ref-counting cdef extern from "Python.h": PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( @@ -167,6 +178,129 @@ cdef class NativeFile: return wrap_buffer(output) + def download(self, stream_or_path, buffer_size=None): + """ + Read file completely to local path (rather than reading completely into + memory). First seeks to the beginning of the file. + """ + cdef: + int64_t bytes_read = 0 + uint8_t* buf + self._assert_readable() + + buffer_size = buffer_size or DEFAULT_BUFFER_SIZE + + write_queue = Queue(50) + + if not hasattr(stream_or_path, 'read'): + stream = open(stream_or_path, 'wb') + cleanup = lambda: stream.close() + else: + stream = stream_or_path + cleanup = lambda: None + + done = False + exc_info = None + def bg_write(): + try: + while not done or write_queue.qsize() > 0: + try: + buf = write_queue.get(timeout=0.01) + except QueueEmpty: + continue + stream.write(buf) + except Exception as e: + exc_info = sys.exc_info() + finally: + cleanup() + + self.seek(0) + + writer_thread = threading.Thread(target=bg_write) + + # This isn't ideal -- PyBytes_FromStringAndSize copies the data from + # the passed buffer, so it's hard for us to avoid doubling the memory + buf = malloc(buffer_size) + if buf == NULL: + raise MemoryError("Failed to allocate {0} bytes" + .format(buffer_size)) + + writer_thread.start() + + cdef int64_t total_bytes = 0 + cdef int32_t c_buffer_size = buffer_size + + try: + while True: + with nogil: + check_status(self.rd_file.get() + .Read(c_buffer_size, &bytes_read, buf)) + + total_bytes += bytes_read + + # EOF + if bytes_read == 0: + break + + pybuf = cp.PyBytes_FromStringAndSize(buf, + bytes_read) + + write_queue.put_nowait(pybuf) + finally: + free(buf) + done = True + + writer_thread.join() + if exc_info is not None: + raise exc_info[0], exc_info[1], exc_info[2] + + def upload(self, stream, buffer_size=None): + """ + Pipe file-like object to file + """ + write_queue = Queue(50) + self._assert_writeable() + + buffer_size = buffer_size or DEFAULT_BUFFER_SIZE + + done = False + exc_info = None + def bg_write(): + try: + while not done or write_queue.qsize() > 0: + try: + buf = write_queue.get(timeout=0.01) + except QueueEmpty: + continue + + self.write(buf) + + except Exception as e: + exc_info = sys.exc_info() + + writer_thread = threading.Thread(target=bg_write) + writer_thread.start() + + try: + while True: + buf = stream.read(buffer_size) + if not buf: + break + + if writer_thread.is_alive(): + while write_queue.full(): + time.sleep(0.01) + else: + break + + write_queue.put_nowait(buf) + finally: + done = True + + writer_thread.join() + if exc_info is not None: + raise exc_info[0], exc_info[1], exc_info[2] + # ---------------------------------------------------------------------- # Python file-like objects @@ -679,58 +813,17 @@ cdef class _HdfsClient: return out - def upload(self, path, stream, buffer_size=2**16): + def download(self, path, stream, buffer_size=None): + with self.open(path, 'rb') as f: + f.download(stream, buffer_size=buffer_size) + + def upload(self, path, stream, buffer_size=None): """ Upload file-like object to HDFS path """ - write_queue = Queue(50) - with self.open(path, 'wb') as f: - done = False - exc_info = None - def bg_write(): - try: - while not done or write_queue.qsize() > 0: - try: - buf = write_queue.get(timeout=0.01) - except QueueEmpty: - continue - - f.write(buf) - - except Exception as e: - exc_info = sys.exc_info() - - writer_thread = threading.Thread(target=bg_write) - writer_thread.start() + f.upload(stream, buffer_size=buffer_size) - try: - while True: - buf = stream.read(buffer_size) - if not buf: - break - - if writer_thread.is_alive(): - while write_queue.full(): - time.sleep(0.01) - else: - break - - write_queue.put_nowait(buf) - finally: - done = True - - writer_thread.join() - if exc_info is not None: - raise exc_info[0], exc_info[1], exc_info[2] - - def download(self, path, stream, buffer_size=None): - with self.open(path, 'rb', buffer_size=buffer_size) as f: - f.download(stream) - - -# ---------------------------------------------------------------------- -# Specialization for HDFS # ARROW-404: Helper class to ensure that files are closed before the # client. During deallocation of the extension class, the attributes are @@ -766,75 +859,139 @@ cdef class HdfsFile(NativeFile): def __dealloc__(self): self.parent = None - def download(self, stream_or_path): +# ---------------------------------------------------------------------- +# File and stream readers and writers + +cdef class _StreamWriter: + cdef: + shared_ptr[CStreamWriter] writer + shared_ptr[OutputStream] sink + bint closed + + def __cinit__(self): + self.closed = True + + def __dealloc__(self): + if not self.closed: + self.close() + + def _open(self, sink, Schema schema): + get_writer(sink, &self.sink) + + with nogil: + check_status(CStreamWriter.Open(self.sink.get(), schema.sp_schema, + &self.writer)) + + self.closed = False + + def write_batch(self, RecordBatch batch): + with nogil: + check_status(self.writer.get() + .WriteRecordBatch(deref(batch.batch))) + + def close(self): + with nogil: + check_status(self.writer.get().Close()) + self.closed = True + + +cdef class _StreamReader: + cdef: + shared_ptr[CStreamReader] reader + + cdef readonly: + Schema schema + + def __cinit__(self): + pass + + def _open(self, source): + cdef: + shared_ptr[ReadableFileInterface] reader + shared_ptr[InputStream] in_stream + + get_reader(source, &reader) + in_stream = reader + + with nogil: + check_status(CStreamReader.Open(in_stream, &self.reader)) + + schema = Schema() + schema.init_schema(self.reader.get().schema()) + + def get_next_batch(self): """ - Read file completely to local path (rather than reading completely into - memory). First seeks to the beginning of the file. + Read next RecordBatch from the stream. Raises StopIteration at end of + stream """ - cdef: - int64_t bytes_read = 0 - uint8_t* buf - self._assert_readable() + cdef shared_ptr[CRecordBatch] batch - write_queue = Queue(50) + with nogil: + check_status(self.reader.get().GetNextRecordBatch(&batch)) - if not hasattr(stream_or_path, 'read'): - stream = open(stream_or_path, 'wb') - cleanup = lambda: stream.close() - else: - stream = stream_or_path - cleanup = lambda: None + if batch.get() == NULL: + raise StopIteration - done = False - exc_info = None - def bg_write(): - try: - while not done or write_queue.qsize() > 0: - try: - buf = write_queue.get(timeout=0.01) - except QueueEmpty: - continue - stream.write(buf) - except Exception as e: - exc_info = sys.exc_info() - finally: - cleanup() + return batch_from_cbatch(batch) - self.seek(0) - writer_thread = threading.Thread(target=bg_write) +cdef class _FileWriter(_StreamWriter): - # This isn't ideal -- PyBytes_FromStringAndSize copies the data from - # the passed buffer, so it's hard for us to avoid doubling the memory - buf = malloc(self.buffer_size) - if buf == NULL: - raise MemoryError("Failed to allocate {0} bytes" - .format(self.buffer_size)) + def _open(self, sink, Schema schema): + cdef shared_ptr[CFileWriter] writer + get_writer(sink, &self.sink) - writer_thread.start() + with nogil: + check_status(CFileWriter.Open(self.sink.get(), schema.sp_schema, + &writer)) - cdef int64_t total_bytes = 0 + # Cast to base class, because has same interface + self.writer = writer + self.closed = False - try: - while True: - with nogil: - check_status(self.rd_file.get() - .Read(self.buffer_size, &bytes_read, buf)) - total_bytes += bytes_read +cdef class _FileReader: + cdef: + shared_ptr[CFileReader] reader - # EOF - if bytes_read == 0: - break + def __cinit__(self): + pass - pybuf = cp.PyBytes_FromStringAndSize(buf, - bytes_read) + def _open(self, source, footer_offset=None): + cdef shared_ptr[ReadableFileInterface] reader + get_reader(source, &reader) - write_queue.put_nowait(pybuf) - finally: - free(buf) - done = True + cdef int64_t offset = 0 + if footer_offset is not None: + offset = footer_offset - writer_thread.join() - if exc_info is not None: - raise exc_info[0], exc_info[1], exc_info[2] + with nogil: + if offset != 0: + check_status(CFileReader.Open2(reader, offset, &self.reader)) + else: + check_status(CFileReader.Open(reader, &self.reader)) + + property num_dictionaries: + + def __get__(self): + return self.reader.get().num_dictionaries() + + property num_record_batches: + + def __get__(self): + return self.reader.get().num_record_batches() + + def get_batch(self, int i): + cdef shared_ptr[CRecordBatch] batch + + if i < 0 or i >= self.num_record_batches: + raise ValueError('Batch number {0} out of range'.format(i)) + + with nogil: + check_status(self.reader.get().GetRecordBatch(i, &batch)) + + return batch_from_cbatch(batch) + + # TODO(wesm): ARROW-503: Function was renamed. Remove after a period of + # time has passed + get_record_batch = get_batch diff --git a/python/pyarrow/ipc.py b/python/pyarrow/ipc.py new file mode 100644 index 0000000000000..5a5616564324c --- /dev/null +++ b/python/pyarrow/ipc.py @@ -0,0 +1,83 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Arrow file and stream reader/writer classes, and other messaging tools + +import pyarrow.io as io + + +class StreamReader(io._StreamReader): + """ + Reader for the Arrow streaming binary format + + Parameters + ---------- + source : str, pyarrow.NativeFile, or file-like Python object + Either a file path, or a readable file object + """ + def __init__(self, source): + self._open(source) + + def __iter__(self): + while True: + yield self.get_next_batch() + + +class StreamWriter(io._StreamWriter): + """ + Writer for the Arrow streaming binary format + + Parameters + ---------- + sink : str, pyarrow.NativeFile, or file-like Python object + Either a file path, or a writeable file object + schema : pyarrow.Schema + The Arrow schema for data to be written to the file + """ + def __init__(self, sink, schema): + self._open(sink, schema) + + +class FileReader(io._FileReader): + """ + Class for reading Arrow record batch data from the Arrow binary file format + + Parameters + ---------- + source : str, pyarrow.NativeFile, or file-like Python object + Either a file path, or a readable file object + footer_offset : int, default None + If the file is embedded in some larger file, this is the byte offset to + the very end of the file data + """ + def __init__(self, source, footer_offset=None): + self._open(source, footer_offset=footer_offset) + + +class FileWriter(io._FileWriter): + """ + Writer to create the Arrow binary file format + + Parameters + ---------- + sink : str, pyarrow.NativeFile, or file-like Python object + Either a file path, or a writeable file object + schema : pyarrow.Schema + The Arrow schema for data to be written to the file + """ + def __init__(self, sink, schema): + self._open(sink, schema) diff --git a/python/pyarrow/ipc.pyx b/python/pyarrow/ipc.pyx deleted file mode 100644 index 22069a7290ead..0000000000000 --- a/python/pyarrow/ipc.pyx +++ /dev/null @@ -1,115 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# Cython wrappers for arrow::ipc - -# cython: profile=False -# distutils: language = c++ -# cython: embedsignature = True - -from cython.operator cimport dereference as deref - -from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport * -from pyarrow.includes.libarrow_ipc cimport * -cimport pyarrow.includes.pyarrow as pyarrow - -from pyarrow.error cimport check_status -from pyarrow.io cimport NativeFile, get_reader, get_writer -from pyarrow.schema cimport Schema -from pyarrow.table cimport RecordBatch - -from pyarrow.compat import frombytes, tobytes -import pyarrow.io as io - -cimport cpython as cp - - -cdef class ArrowFileWriter: - cdef: - shared_ptr[CFileWriter] writer - shared_ptr[OutputStream] sink - bint closed - - def __cinit__(self, sink, Schema schema): - self.closed = True - get_writer(sink, &self.sink) - - with nogil: - check_status(CFileWriter.Open(self.sink.get(), schema.sp_schema, - &self.writer)) - - self.closed = False - - def __dealloc__(self): - if not self.closed: - self.close() - - def write_record_batch(self, RecordBatch batch): - with nogil: - check_status(self.writer.get() - .WriteRecordBatch(deref(batch.batch))) - - def close(self): - with nogil: - check_status(self.writer.get().Close()) - self.closed = True - - -cdef class ArrowFileReader: - cdef: - shared_ptr[CFileReader] reader - - def __cinit__(self, source, footer_offset=None): - cdef shared_ptr[ReadableFileInterface] reader - get_reader(source, &reader) - - cdef int64_t offset = 0 - if footer_offset is not None: - offset = footer_offset - - with nogil: - if offset != 0: - check_status(CFileReader.Open2(reader, offset, &self.reader)) - else: - check_status(CFileReader.Open(reader, &self.reader)) - - property num_dictionaries: - - def __get__(self): - return self.reader.get().num_dictionaries() - - property num_record_batches: - - def __get__(self): - return self.reader.get().num_record_batches() - - def get_record_batch(self, int i): - cdef: - shared_ptr[CRecordBatch] batch - RecordBatch result - - if i < 0 or i >= self.num_record_batches: - raise ValueError('Batch number {0} out of range'.format(i)) - - with nogil: - check_status(self.reader.get().GetRecordBatch(i, &batch)) - - result = RecordBatch() - result.init(batch) - - return result diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 2bcfec1bcf3e2..52eeeaf717622 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -112,6 +112,7 @@ cdef class Field: def __get__(self): return frombytes(self.field.name) + cdef class Schema: def __cinit__(self): diff --git a/python/pyarrow/table.pxd b/python/pyarrow/table.pxd index df3687ddf9761..389727b4dc1d7 100644 --- a/python/pyarrow/table.pxd +++ b/python/pyarrow/table.pxd @@ -59,3 +59,4 @@ cdef class RecordBatch: cdef _check_nullptr(self) cdef api object table_from_ctable(const shared_ptr[CTable]& ctable) +cdef api object batch_from_cbatch(const shared_ptr[CRecordBatch]& cbatch) diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index bbd6c6a56705c..819d1b71b8546 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -16,21 +16,20 @@ # under the License. import io +import pytest import numpy as np from pandas.util.testing import assert_frame_equal import pandas as pd -import pyarrow as A -import pyarrow.io as aio -import pyarrow.ipc as ipc +from pyarrow.compat import unittest +import pyarrow as pa -class RoundtripTest(object): - # Also tests writing zero-copy NumPy array with additional padding +class MessagingTest(object): - def __init__(self): + def setUp(self): self.sink = self._get_sink() def _get_sink(self): @@ -39,14 +38,15 @@ def _get_sink(self): def _get_source(self): return self.sink.getvalue() - def run(self): + def write_batches(self): nrows = 5 df = pd.DataFrame({ 'one': np.random.randn(nrows), 'two': ['foo', np.nan, 'bar', 'bazbaz', 'qux']}) - batch = A.RecordBatch.from_pandas(df) - writer = ipc.ArrowFileWriter(self.sink, batch.schema) + batch = pa.RecordBatch.from_pandas(df) + + writer = self._get_writer(self.sink, batch.schema) num_batches = 5 frames = [] @@ -55,46 +55,73 @@ def run(self): unique_df = df.copy() unique_df['one'] = np.random.randn(nrows) - batch = A.RecordBatch.from_pandas(unique_df) - writer.write_record_batch(batch) + batch = pa.RecordBatch.from_pandas(unique_df) + writer.write_batch(batch) frames.append(unique_df) batches.append(batch) writer.close() + return batches + + +class TestFile(MessagingTest, unittest.TestCase): + # Also tests writing zero-copy NumPy array with additional padding + + def _get_writer(self, sink, schema): + return pa.FileWriter(sink, schema) + def test_simple_roundtrip(self): + batches = self.write_batches() file_contents = self._get_source() - reader = ipc.ArrowFileReader(aio.BufferReader(file_contents)) - assert reader.num_record_batches == num_batches + reader = pa.FileReader(pa.BufferReader(file_contents)) - for i in range(num_batches): + assert reader.num_record_batches == len(batches) + + for i, batch in enumerate(batches): # it works. Must convert back to DataFrame - batch = reader.get_record_batch(i) + batch = reader.get_batch(i) assert batches[i].equals(batch) -class InMemoryStreamTest(RoundtripTest): +class TestStream(MessagingTest, unittest.TestCase): + + def _get_writer(self, sink, schema): + return pa.StreamWriter(sink, schema) + + def test_simple_roundtrip(self): + batches = self.write_batches() + file_contents = self._get_source() + reader = pa.StreamReader(pa.BufferReader(file_contents)) + + total = 0 + for i, next_batch in enumerate(reader): + assert next_batch.equals(batches[i]) + total += 1 + + assert total == len(batches) + + with pytest.raises(StopIteration): + reader.get_next_batch() + + +class TestInMemoryFile(TestFile): def _get_sink(self): - return aio.InMemoryOutputStream() + return pa.InMemoryOutputStream() def _get_source(self): return self.sink.get_result() -def test_ipc_file_simple_roundtrip(): - helper = RoundtripTest() - helper.run() - - def test_ipc_zero_copy_numpy(): df = pd.DataFrame({'foo': [1.5]}) - batch = A.RecordBatch.from_pandas(df) - sink = aio.InMemoryOutputStream() + batch = pa.RecordBatch.from_pandas(df) + sink = pa.InMemoryOutputStream() write_file(batch, sink) buffer = sink.get_result() - reader = aio.BufferReader(buffer) + reader = pa.BufferReader(buffer) batches = read_file(reader) @@ -103,48 +130,13 @@ def test_ipc_zero_copy_numpy(): assert_frame_equal(df, rdf) -# XXX: For benchmarking - -def big_batch(): - K = 2**4 - N = 2**20 - df = pd.DataFrame( - np.random.randn(K, N).T, - columns=[str(i) for i in range(K)] - ) - - df = pd.concat([df] * 2 ** 3, ignore_index=True) - return df - - -def write_to_memory2(batch): - sink = aio.InMemoryOutputStream() - write_file(batch, sink) - return sink.get_result() - - -def write_to_memory(batch): - sink = io.BytesIO() - write_file(batch, sink) - return sink.getvalue() - - def write_file(batch, sink): - writer = ipc.ArrowFileWriter(sink, batch.schema) - writer.write_record_batch(batch) + writer = pa.FileWriter(sink, batch.schema) + writer.write_batch(batch) writer.close() def read_file(source): - reader = ipc.ArrowFileReader(source) - return [reader.get_record_batch(i) + reader = pa.FileReader(source) + return [reader.get_batch(i) for i in range(reader.num_record_batches)] - -# df = big_batch() -# batch = A.RecordBatch.from_pandas(df) -# mem = write_to_memory(batch) -# batches = read_file(mem) -# data = batches[0].to_pandas() -# rdf = pd.DataFrame(data) - -# [x.to_pandas() for x in batches] diff --git a/python/setup.py b/python/setup.py index de59a92905895..9c63e93df3352 100644 --- a/python/setup.py +++ b/python/setup.py @@ -94,7 +94,6 @@ def initialize_options(self): 'config', 'error', 'io', - 'ipc', '_parquet', 'scalar', 'schema', From 2821030124eb3e884b0e48f09c38b54f00430b13 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 23 Jan 2017 09:11:26 -0500 Subject: [PATCH 0291/1644] ARROW-508: [C++] Add basic threadsafety to normal files and memory maps This patch is stacked on ARROW-494, so will need to be rebased. * Since the naive `ReadAt` implementation involves a Seek and a Read, this locks until the read is completed. * Normal file reads block until completion * File writes block until completion This covers the threadsafety requirements for parquet-cpp at least. For on-disk files, the following methods are now threadsafe: * `ArrowInputFile::Read` and `ArrowInputFile::ReadAt` * `ArrowOutputStream::Write` parquet-cpp calls `Seek` in a couple places: https://github.com/apache/parquet-cpp/blob/master/src/parquet/file/reader-internal.cc#L257 Strictly speaking, if two threads are trying to read the same file from the same input source, this could have a race condition in esoteric circumstances. I'm going to report a bug to change these to `ReadAt` which can be more easily made threadsafe Author: Wes McKinney Closes #300 from wesm/ARROW-508 and squashes the following commits: e57156c [Wes McKinney] Make base ReadableFileInterface::ReadAt and some file functions threadsafe --- cpp/src/arrow/io/file.cc | 10 ++++- cpp/src/arrow/io/file.h | 9 ++++- cpp/src/arrow/io/interfaces.cc | 3 ++ cpp/src/arrow/io/interfaces.h | 12 +++++- cpp/src/arrow/io/io-file-test.cc | 69 ++++++++++++++++++++++++++++++++ 5 files changed, 98 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 3bf8dfa08f2ff..ff58e539b9353 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -76,6 +76,7 @@ #include #include #include +#include #include #include @@ -350,6 +351,7 @@ class OSFile { } Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + std::lock_guard guard(lock_); return FileRead(fd_, out, nbytes, bytes_read); } @@ -361,6 +363,7 @@ class OSFile { Status Tell(int64_t* pos) const { return FileTell(fd_, pos); } Status Write(const uint8_t* data, int64_t length) { + std::lock_guard guard(lock_); if (length < 0) { return Status::IOError("Length must be non-negative"); } return FileWrite(fd_, data, length); } @@ -377,6 +380,8 @@ class OSFile { protected: std::string path_; + std::mutex lock_; + // File descriptor int fd_; @@ -649,6 +654,8 @@ bool MemoryMappedFile::supports_zero_copy() const { } Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) { + std::lock_guard guard(lock_); + if (!memory_map_->opened() || !memory_map_->writable()) { return Status::IOError("Unable to write"); } @@ -658,13 +665,14 @@ Status MemoryMappedFile::WriteAt(int64_t position, const uint8_t* data, int64_t } Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { + std::lock_guard guard(lock_); + if (!memory_map_->opened() || !memory_map_->writable()) { return Status::IOError("Unable to write"); } if (nbytes + memory_map_->position() > memory_map_->size()) { return Status::Invalid("Cannot write past end of memory map"); } - return WriteInternal(data, nbytes); } diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index 930346b0518b3..fe55e968e05d7 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -50,6 +50,8 @@ class ARROW_EXPORT FileOutputStream : public OutputStream { // OutputStream interface Status Close() override; Status Tell(int64_t* position) override; + + // Write bytes to the stream. Thread-safe Status Write(const uint8_t* data, int64_t nbytes) override; int file_descriptor() const; @@ -76,6 +78,7 @@ class ARROW_EXPORT ReadableFile : public ReadableFileInterface { Status Close() override; Status Tell(int64_t* position) override; + // Read bytes from the file. Thread-safe Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* buffer) override; Status Read(int64_t nbytes, std::shared_ptr* out) override; @@ -112,16 +115,18 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { Status Seek(int64_t position) override; - // Required by ReadableFileInterface, copies memory into out + // Required by ReadableFileInterface, copies memory into out. Not thread-safe Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; - // Zero copy read + // Zero copy read. Not thread-safe Status Read(int64_t nbytes, std::shared_ptr* out) override; bool supports_zero_copy() const override; + /// Write data at the current position in the file. Thread-safe Status Write(const uint8_t* data, int64_t nbytes) override; + /// Write data at a particular position in the file. Thread-safe Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; // @return: the size in bytes of the memory source diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 8040f93836cdc..7e78caa04e711 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -19,6 +19,7 @@ #include #include +#include #include "arrow/buffer.h" #include "arrow/status.h" @@ -34,12 +35,14 @@ ReadableFileInterface::ReadableFileInterface() { Status ReadableFileInterface::ReadAt( int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Read(nbytes, bytes_read, out); } Status ReadableFileInterface::ReadAt( int64_t position, int64_t nbytes, std::shared_ptr* out) { + std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Read(nbytes, out); } diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index fdb3788188915..78680903d230a 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -20,6 +20,7 @@ #include #include +#include #include #include "arrow/util/macros.h" @@ -99,14 +100,21 @@ class ARROW_EXPORT ReadableFileInterface : public InputStream, public Seekable { virtual bool supports_zero_copy() const = 0; - // Read at position, provide default implementations using Read(...), but can - // be overridden + /// Read at position, provide default implementations using Read(...), but can + /// be overridden + /// + /// Default implementation is thread-safe virtual Status ReadAt( int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out); + /// Default implementation is thread-safe virtual Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out); + std::mutex& lock() { return lock_; } + protected: + std::mutex lock_; + ReadableFileInterface(); }; diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 999b296465544..86a3287b84fbc 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -15,6 +15,7 @@ // specific language governing permissions and limitations // under the License. +#include #include #include #include @@ -25,6 +26,7 @@ #include #include #include +#include #include "gtest/gtest.h" @@ -325,6 +327,40 @@ TEST_F(TestReadableFile, CustomMemoryPool) { ASSERT_EQ(2, pool.num_allocations()); } +TEST_F(TestReadableFile, ThreadSafety) { + std::string data = "foobar"; + { + std::ofstream stream; + stream.open(path_.c_str()); + stream << data; + } + + MyMemoryPool pool; + ASSERT_OK(ReadableFile::Open(path_, &pool, &file_)); + + std::atomic correct_count(0); + const int niter = 10000; + + auto ReadData = [&correct_count, &data, niter, this] () { + std::shared_ptr buffer; + + for (int i = 0; i < niter; ++i) { + ASSERT_OK(file_->ReadAt(0, 3, &buffer)); + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { + correct_count += 1; + } + } + }; + + std::thread thread1(ReadData); + std::thread thread2(ReadData); + + thread1.join(); + thread2.join(); + + ASSERT_EQ(niter * 2, correct_count); +} + // ---------------------------------------------------------------------- // Memory map tests @@ -455,5 +491,38 @@ TEST_F(TestMemoryMappedFile, CastableToFileInterface) { std::shared_ptr file = memory_mapped_file; } +TEST_F(TestMemoryMappedFile, ThreadSafety) { + std::string data = "foobar"; + std::string path = "ipc-multithreading-test"; + CreateFile(path, static_cast(data.size())); + + std::shared_ptr file; + ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &file)); + ASSERT_OK(file->Write(reinterpret_cast(data.c_str()), + static_cast(data.size()))); + + std::atomic correct_count(0); + const int niter = 10000; + + auto ReadData = [&correct_count, &data, niter, &file] () { + std::shared_ptr buffer; + + for (int i = 0; i < niter; ++i) { + ASSERT_OK(file->ReadAt(0, 3, &buffer)); + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { + correct_count += 1; + } + } + }; + + std::thread thread1(ReadData); + std::thread thread2(ReadData); + + thread1.join(); + thread2.join(); + + ASSERT_EQ(niter * 2, correct_count); +} + } // namespace io } // namespace arrow From 085c8754b0ab2da7fcd245fc88bc4de9a6806a4c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 23 Jan 2017 09:13:39 -0500 Subject: [PATCH 0292/1644] ARROW-81: [Format] Augment dictionary encoding metadata to accommodate additional use cases cc @julienledem @nongli @jacques-n. I am hoping to close the loop on our discussion in https://issues.apache.org/jira/browse/ARROW-81. In my applications, I need the flexibility to transmit: * Dictionaries encoded in signed integers smaller than int32. For example, with 10 dictionary values, we may send int8 indices * Indicator that the dictionary is ordered These features are needed for Python and R support, and in general for statistical computing applications. Author: Wes McKinney Closes #297 from wesm/ARROW-81 and squashes the following commits: c960bac [Wes McKinney] Augment dictionary encoding metadata to accommodate additional use cases --- format/Message.fbs | 27 ++++++++++++++++++++++++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index b2c64649f2687..028c56ad51618 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -150,6 +150,26 @@ table KeyValue { value: [ubyte]; } +/// ---------------------------------------------------------------------- +/// Dictionary encoding metadata + +table DictionaryEncoding { + /// The known dictionary id in the application where this data is used. In + /// the file or streaming formats, the dictionary ids are found in the + /// DictionaryBatch messages + id: long; + + /// The dictionary indices are constrained to be positive integers. If this + /// field is null, the indices must be signed int32 + indexType: Int; + + /// By default, dictionaries are not ordered, or the order does not have + /// semantic meaning. In some statistical, applications, dictionary-encoding + /// is used to represent ordered categorical data, and we provide a way to + /// preserve that metadata here + isOrdered: bool; +} + /// ---------------------------------------------------------------------- /// A field represents a named column in a record / row batch or child of a /// nested type. @@ -163,9 +183,10 @@ table Field { name: string; nullable: bool; type: Type; - // present only if the field is dictionary encoded - // will point to a dictionary provided by a DictionaryBatch message - dictionary: long; + + // Present only if the field is dictionary encoded + dictionary: DictionaryEncoding; + // children apply only to Nested data types like Struct, List and Union children: [Field]; /// layout of buffers produced for this type (as derived from the Type) From c90ca60c1859b2b70c4f2dd3fb8c41b0f75f02d0 Mon Sep 17 00:00:00 2001 From: ahnj Date: Mon, 23 Jan 2017 23:44:22 -0500 Subject: [PATCH 0293/1644] ARROW-378: Python: Respect timezone on conversion of Pandas datetime columns arrow is now pandas datetime timezone aware Author: ahnj Closes #287 from ahnj/timestamp-aware and squashes the following commits: 0221ed0 [ahnj] ARROW-378: Python: Respect timezone on conversion of Pandas datetime columns --- python/pyarrow/array.pyx | 6 ++++- python/pyarrow/tests/test_convert_pandas.py | 29 +++++++++++++++++++-- 2 files changed, 32 insertions(+), 3 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 92206f2451ffb..c3a5a045b7dd5 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -71,9 +71,13 @@ cdef class Array: timestamps_to_ms : bool, optional Convert datetime columns to ms resolution. This is needed for - compability with other functionality like Parquet I/O which + compatibility with other functionality like Parquet I/O which only supports milliseconds. + Notes + ----- + Localized timestamps will currently be returned as UTC (pandas's native representation). + Timezone-naive data will be implicitly interpreted as UTC. Examples -------- diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 30705c4ca2a20..674a4361d3395 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -63,7 +63,7 @@ def tearDown(self): def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, timestamps_to_ms=False, expected_schema=None, - schema=None): + check_dtype=True, schema=None): table = A.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms, schema=schema) result = table.to_pandas(nthreads=nthreads) @@ -71,7 +71,7 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, assert table.schema.equals(expected_schema) if expected is None: expected = df - tm.assert_frame_equal(result, expected) + tm.assert_frame_equal(result, expected, check_dtype=check_dtype) def _check_array_roundtrip(self, values, expected=None, timestamps_to_ms=False, field=None): @@ -284,6 +284,31 @@ def test_timestamps_notimezone_nulls(self): self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) + def test_timestamps_with_timezone(self): + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123', + '2006-01-13T12:34:56.432', + '2010-08-13T05:46:57.437'], + dtype='datetime64[ms]') + }) + df_est = df['datetime64'].dt.tz_localize('US/Eastern').to_frame() + df_utc = df_est['datetime64'].dt.tz_convert('UTC').to_frame() + self._check_pandas_roundtrip(df_est, expected=df_utc, timestamps_to_ms=True, check_dtype=False) + + # drop-in a null and ns instead of ms + df = pd.DataFrame({ + 'datetime64': np.array([ + '2007-07-13T01:23:34.123456789', + None, + '2006-01-13T12:34:56.432539784', + '2010-08-13T05:46:57.437699912'], + dtype='datetime64[ns]') + }) + df_est = df['datetime64'].dt.tz_localize('US/Eastern').to_frame() + df_utc = df_est['datetime64'].dt.tz_convert('UTC').to_frame() + self._check_pandas_roundtrip(df_est, expected=df_utc, timestamps_to_ms=False, check_dtype=False) + def test_date(self): df = pd.DataFrame({ 'date': [datetime.date(2000, 1, 1), From 61a54f8a619efc4fd256c446be29905d6484c5e9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 24 Jan 2017 08:30:37 -0500 Subject: [PATCH 0294/1644] ARROW-509: [Python] Add support for multithreaded Parquet reads I'm getting very nice speedups on a Parquet file storing a ~4.5 GB dataset: ``` In [1]: import pyarrow.parquet as pq In [2]: %time table = pq.read_table('/home/wesm/data/airlines_parquet/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq') CPU times: user 8.21 s, sys: 468 ms, total: 8.68 s Wall time: 8.68 s In [3]: %time table = pq.read_table('/home/wesm/data/airlines_parquet/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq', nthreads=4) CPU times: user 8.84 s, sys: 4.28 s, total: 13.1 s Wall time: 3.91 s In [4]: %time table = pq.read_table('/home/wesm/data/airlines_parquet/4345e5eef217aa1b-c8f16177f35fd983_1150363067_data.0.parq', nthreads=8) CPU times: user 13.3 s, sys: 1.15 s, total: 14.4 s Wall time: 2.86 s ``` This requires a bugfix in parquet-cpp that will come soon in a patch for PARQUET-836 Author: Wes McKinney Closes #301 from wesm/ARROW-509 and squashes the following commits: 9816689 [Wes McKinney] Update docs slightly, flake8 warning 239b086 [Wes McKinney] Add support for nthreads option in parquet::arrow, unit tests --- python/pyarrow/_parquet.pxd | 4 ++++ python/pyarrow/_parquet.pyx | 21 ++++++++++++---- python/pyarrow/parquet.py | 36 ++++++++++++++++++---------- python/pyarrow/tests/test_parquet.py | 18 ++++++++++++++ 4 files changed, 62 insertions(+), 17 deletions(-) diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index cf1da1c3a9e52..fabee5d5761d7 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -213,8 +213,12 @@ cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) CStatus ReadFlatColumn(int i, shared_ptr[CArray]* out); CStatus ReadFlatTable(shared_ptr[CTable]* out); + CStatus ReadFlatTable(const vector[int]& column_indices, + shared_ptr[CTable]* out); const ParquetFileReader* parquet_reader(); + void set_num_threads(int num_threads) + cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: CStatus FromParquetSchema(const SchemaDescriptor* parquet_schema, diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index b11cee3a201fb..3f847e9808230 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -382,14 +382,27 @@ cdef class ParquetReader: result.init(metadata) return result - def read_all(self): + def read(self, column_indices=None, nthreads=1): cdef: Table table = Table() shared_ptr[CTable] ctable + vector[int] c_column_indices - with nogil: - check_status(self.reader.get() - .ReadFlatTable(&ctable)) + self.reader.get().set_num_threads(nthreads) + + if column_indices is not None: + # Read only desired column indices + for index in column_indices: + c_column_indices.push_back(index) + + with nogil: + check_status(self.reader.get() + .ReadFlatTable(c_column_indices, &ctable)) + else: + # Read all columns + with nogil: + check_status(self.reader.get() + .ReadFlatTable(&ctable)) table.init(ctable) return table diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index cbe1c6e5d79d9..6654b770ba33e 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -18,7 +18,7 @@ from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa RowGroupMetaData, Schema, ParquetWriter) import pyarrow._parquet as _parquet # noqa -from pyarrow.table import Table, concat_tables +from pyarrow.table import concat_tables class ParquetFile(object): @@ -45,7 +45,7 @@ def metadata(self): def schema(self): return self.metadata.schema - def read(self, nrows=None, columns=None): + def read(self, nrows=None, columns=None, nthreads=1): """ Read a Table from Parquet format @@ -53,6 +53,9 @@ def read(self, nrows=None, columns=None): ---------- columns: list If not None, only these columns will be read from the file. + nthreads : int, default 1 + Number of columns to read in parallel. Requires that the underlying + file source is threadsafe Returns ------- @@ -63,16 +66,16 @@ def read(self, nrows=None, columns=None): raise NotImplementedError("nrows argument") if columns is None: - return self.reader.read_all() + column_indices = None else: - column_idxs = [self.reader.column_name_idx(column) - for column in columns] - arrays = [self.reader.read_column(column_idx) - for column_idx in column_idxs] - return Table.from_arrays(arrays, names=columns) + column_indices = [self.reader.column_name_idx(column) + for column in columns] + return self.reader.read(column_indices=column_indices, + nthreads=nthreads) -def read_table(source, columns=None, metadata=None): + +def read_table(source, columns=None, nthreads=1, metadata=None): """ Read a Table from Parquet format @@ -83,6 +86,9 @@ def read_table(source, columns=None, metadata=None): pyarrow.io.PythonFileInterface or pyarrow.io.BufferReader. columns: list If not None, only these columns will be read from the file. + nthreads : int, default 1 + Number of columns to read in parallel. Requires that the underlying + file source is threadsafe metadata : FileMetaData If separately computed @@ -91,11 +97,12 @@ def read_table(source, columns=None, metadata=None): pyarrow.Table Content of the file as a table (of columns) """ - return ParquetFile(source, metadata=metadata).read(columns=columns) + pf = ParquetFile(source, metadata=metadata) + return pf.read(columns=columns, nthreads=nthreads) -def read_multiple_files(paths, columns=None, filesystem=None, metadata=None, - schema=None): +def read_multiple_files(paths, columns=None, filesystem=None, nthreads=1, + metadata=None, schema=None): """ Read multiple Parquet files as a single pyarrow.Table @@ -108,6 +115,9 @@ def read_multiple_files(paths, columns=None, filesystem=None, metadata=None, filesystem : Filesystem, default None If nothing passed, paths assumed to be found in the local on-disk filesystem + nthreads : int, default 1 + Number of columns to read in parallel. Requires that the underlying + file source is threadsafe metadata : pyarrow.parquet.FileMetaData Use metadata obtained elsewhere to validate file schemas schema : pyarrow.parquet.Schema @@ -147,7 +157,7 @@ def open_file(path, meta=None): tables = [] for path, path_metadata in zip(paths, all_file_metadata): reader = open_file(path, meta=path_metadata) - table = reader.read(columns=columns) + table = reader.read(columns=columns, nthreads=nthreads) tables.append(table) all_data = concat_tables(tables) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index a94fe456d3b2b..d85f0e513702f 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -320,6 +320,24 @@ def test_compare_schemas(): assert not fileh.schema[0].equals(fileh.schema[1]) +@parquet +def test_multithreaded_read(): + df = alltypes_sample(size=10000) + + table = pa.Table.from_pandas(df, timestamps_to_ms=True) + + buf = io.BytesIO() + pq.write_table(table, buf, compression='SNAPPY', version='2.0') + + buf.seek(0) + table1 = pq.read_table(buf, nthreads=4) + + buf.seek(0) + table2 = pq.read_table(buf, nthreads=1) + + assert table1.equals(table2) + + @parquet def test_pass_separate_metadata(): # ARROW-471 From a68af9d168e381d1730ae0cb4dc653bef42562d3 Mon Sep 17 00:00:00 2001 From: Nong Li Date: Wed, 25 Jan 2017 14:49:00 -0500 Subject: [PATCH 0295/1644] ARROW-498 [C++] Add command line utilities that convert between stream and file. These are in the style of unix utilities using stdin/stdout for argument passing. This makes it easy to chain them together and I think are using for getting started or testing. As an example, this command line tests a round trip: $ build/debug/file-to-stream /tmp/arrow-file | build/debug/stream-to-file > /tmp/copy $ diff /tmp/arrow-file /tmp/copy If we had the same in java, this would make it pretty convenient for integration testing. Author: Nong Li Closes #302 from nongli/utils and squashes the following commits: b970c75 [Nong Li] fix long -> int64_t a01ef4d [Nong Li] Fix style issues. da3d98d [Nong Li] ARROW-498 [C++] Add commandline utilities that convert between stream and file. --- cpp/CMakeLists.txt | 4 ++ cpp/src/arrow/util/CMakeLists.txt | 26 ++++++++ cpp/src/arrow/util/file-to-stream.cc | 60 ++++++++++++++++++ cpp/src/arrow/util/io-util.h | 93 ++++++++++++++++++++++++++++ cpp/src/arrow/util/stream-to-file.cc | 58 +++++++++++++++++ 5 files changed, 241 insertions(+) create mode 100644 cpp/src/arrow/util/file-to-stream.cc create mode 100644 cpp/src/arrow/util/io-util.h create mode 100644 cpp/src/arrow/util/stream-to-file.cc diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 9039ffb571b9e..a0f89f314f683 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -90,6 +90,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_ALTIVEC "Build Arrow with Altivec" ON) + + option(ARROW_BUILD_UTILITIES + "Build Arrow commandline utilities" + ON) endif() if(NOT ARROW_BUILD_TESTS) diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 8d9afccf867df..0830ee2ed2928 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -68,4 +68,30 @@ if (ARROW_BUILD_BENCHMARKS) endif() endif() +if (ARROW_BUILD_UTILITIES) + if (APPLE) + set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + boost_filesystem_static + boost_system_static + dl) + else() + set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + pthread + boost_filesystem_static + boost_system_static + dl) + endif() + + add_executable(file-to-stream file-to-stream.cc) + target_link_libraries(file-to-stream ${UTIL_LINK_LIBS}) + add_executable(stream-to-file stream-to-file.cc) + target_link_libraries(stream-to-file ${UTIL_LINK_LIBS}) +endif() + ADD_ARROW_TEST(bit-util-test) diff --git a/cpp/src/arrow/util/file-to-stream.cc b/cpp/src/arrow/util/file-to-stream.cc new file mode 100644 index 0000000000000..42c1d55afd322 --- /dev/null +++ b/cpp/src/arrow/util/file-to-stream.cc @@ -0,0 +1,60 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "arrow/io/file.h" +#include "arrow/ipc/file.h" +#include "arrow/ipc/stream.h" +#include "arrow/status.h" + +#include "arrow/util/io-util.h" + +namespace arrow { + +// Reads a file on the file system and prints to stdout the stream version of it. +Status ConvertToStream(const char* path) { + std::shared_ptr in_file; + std::shared_ptr reader; + + RETURN_NOT_OK(io::ReadableFile::Open(path, &in_file)); + RETURN_NOT_OK(ipc::FileReader::Open(in_file, &reader)); + + io::StdoutStream sink; + std::shared_ptr writer; + RETURN_NOT_OK(ipc::StreamWriter::Open(&sink, reader->schema(), &writer)); + for (int i = 0; i < reader->num_record_batches(); ++i) { + std::shared_ptr chunk; + RETURN_NOT_OK(reader->GetRecordBatch(i, &chunk)); + RETURN_NOT_OK(writer->WriteRecordBatch(*chunk)); + } + return writer->Close(); +} + +} // namespace arrow + +int main(int argc, char** argv) { + if (argc != 2) { + std::cerr << "Usage: file-to-stream " << std::endl; + return 1; + } + arrow::Status status = arrow::ConvertToStream(argv[1]); + if (!status.ok()) { + std::cerr << "Could not convert to stream: " << status.ToString() << std::endl; + return 1; + } + return 0; +} diff --git a/cpp/src/arrow/util/io-util.h b/cpp/src/arrow/util/io-util.h new file mode 100644 index 0000000000000..3e5054d8fa83d --- /dev/null +++ b/cpp/src/arrow/util/io-util.h @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_IO_UTIL_H +#define ARROW_UTIL_IO_UTIL_H + +#include +#include "arrow/buffer.h" + +namespace arrow { +namespace io { + +// Output stream that just writes to stdout. +class StdoutStream : public OutputStream { + public: + StdoutStream() : pos_(0) { + set_mode(FileMode::WRITE); + } + virtual ~StdoutStream() {} + + Status Close() { return Status::OK(); } + Status Tell(int64_t* position) { + *position = pos_; + return Status::OK(); + } + + Status Write(const uint8_t* data, int64_t nbytes) { + pos_ += nbytes; + std::cout.write(reinterpret_cast(data), nbytes); + return Status::OK(); + } + private: + int64_t pos_; +}; + +// Input stream that just reads from stdin. +class StdinStream : public InputStream { + public: + StdinStream() : pos_(0) { + set_mode(FileMode::READ); + } + virtual ~StdinStream() {} + + Status Close() { return Status::OK(); } + Status Tell(int64_t* position) { + *position = pos_; + return Status::OK(); + } + + virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { + std::cin.read(reinterpret_cast(out), nbytes); + if (std::cin) { + *bytes_read = nbytes; + pos_ += nbytes; + } else { + *bytes_read = 0; + } + return Status::OK(); + } + + virtual Status Read(int64_t nbytes, std::shared_ptr* out) { + auto buffer = std::make_shared(nullptr); + RETURN_NOT_OK(buffer->Resize(nbytes)); + int64_t bytes_read; + RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); + RETURN_NOT_OK(buffer->Resize(bytes_read, false)); + *out = buffer; + return Status::OK(); + } + + private: + int64_t pos_; +}; + +} // namespace io +} // namespace arrow + +#endif // ARROW_UTIL_IO_UTIL_H + diff --git a/cpp/src/arrow/util/stream-to-file.cc b/cpp/src/arrow/util/stream-to-file.cc new file mode 100644 index 0000000000000..7a8ec0bfd952b --- /dev/null +++ b/cpp/src/arrow/util/stream-to-file.cc @@ -0,0 +1,58 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include "arrow/io/file.h" +#include "arrow/ipc/file.h" +#include "arrow/ipc/stream.h" +#include "arrow/status.h" + +#include "arrow/util/io-util.h" + +namespace arrow { + +// Converts a stream from stdin to a file written to standard out. +// A typical usage would be: +// $ | stream-to-file > file.arrow +Status ConvertToFile() { + std::shared_ptr input(new io::StdinStream); + std::shared_ptr reader; + RETURN_NOT_OK(ipc::StreamReader::Open(input, &reader)); + + io::StdoutStream sink; + std::shared_ptr writer; + RETURN_NOT_OK(ipc::FileWriter::Open(&sink, reader->schema(), &writer)); + + std::shared_ptr batch; + while (true) { + RETURN_NOT_OK(reader->GetNextRecordBatch(&batch)); + if (batch == nullptr) break; + RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); + } + return writer->Close(); +} + +} // namespace arrow + +int main(int argc, char** argv) { + arrow::Status status = arrow::ConvertToFile(); + if (!status.ok()) { + std::cerr << "Could not convert to file: " << status.ToString() << std::endl; + return 1; + } + return 0; +} From a90b5f3634bdbd6af01967f288457d07d5f2e2eb Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 26 Jan 2017 13:20:17 -0500 Subject: [PATCH 0296/1644] ARROW-512: C++: Add method to check for primitive types Also includes some documentation updates. Author: Uwe L. Korn Closes #304 from xhochy/ARROW-512 and squashes the following commits: cfe9205 [Uwe L. Korn] ARROW-512: C++: Add method to check for primitive types --- cpp/apidoc/Doxyfile | 2 +- cpp/apidoc/index.md | 4 +- cpp/src/arrow/buffer.h | 54 ++++++------- cpp/src/arrow/builder.h | 150 +++++++++++++++++++----------------- cpp/src/arrow/memory_pool.h | 21 +++++ cpp/src/arrow/type.h | 37 +++++++++ 6 files changed, 167 insertions(+), 101 deletions(-) diff --git a/cpp/apidoc/Doxyfile b/cpp/apidoc/Doxyfile index 7dc55fef834fc..51f5543b2de1b 100644 --- a/cpp/apidoc/Doxyfile +++ b/cpp/apidoc/Doxyfile @@ -204,7 +204,7 @@ SHORT_NAMES = NO # description.) # The default value is: NO. -JAVADOC_AUTOBRIEF = NO +JAVADOC_AUTOBRIEF = YES # If the QT_AUTOBRIEF tag is set to YES then doxygen will interpret the first # line (until the first dot) of a Qt-style comment as the brief description. If diff --git a/cpp/apidoc/index.md b/cpp/apidoc/index.md index 080f848bb184f..fdac4969beb9b 100644 --- a/cpp/apidoc/index.md +++ b/cpp/apidoc/index.md @@ -38,8 +38,8 @@ this bitmap. As Arrow objects are immutable, there are classes provided that should help you build these objects. To build an array of `int64_t` elements, we can use the -`Int64Builder`. In the following example, we build an array of the range 1 to 8 -where the element that should hold the number 4 is nulled. +`arrow::Int64Builder`. In the following example, we build an array of the range +1 to 8 where the element that should hold the number 4 is nulled. Int64Builder builder(arrow::default_memory_pool(), arrow::int64()); builder.Append(1); diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index ac78808eaf205..d43ab0375b725 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -35,33 +35,35 @@ class Status; // ---------------------------------------------------------------------- // Buffer classes -// Immutable API for a chunk of bytes which may or may not be owned by the -// class instance. Buffers have two related notions of length: size and -// capacity. Size is the number of bytes that might have valid data. -// Capacity is the number of bytes that where allocated for the buffer in -// total. -// The following invariant is always true: Size < Capacity +/// Immutable API for a chunk of bytes which may or may not be owned by the +/// class instance. +/// +/// Buffers have two related notions of length: size and capacity. Size is +/// the number of bytes that might have valid data. Capacity is the number +/// of bytes that where allocated for the buffer in total. +/// +/// The following invariant is always true: Size < Capacity class ARROW_EXPORT Buffer : public std::enable_shared_from_this { public: Buffer(const uint8_t* data, int64_t size) : is_mutable_(false), data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); - // An offset into data that is owned by another buffer, but we want to be - // able to retain a valid pointer to it even after other shared_ptr's to the - // parent buffer have been destroyed - // - // This method makes no assertions about alignment or padding of the buffer but - // in general we expected buffers to be aligned and padded to 64 bytes. In the future - // we might add utility methods to help determine if a buffer satisfies this contract. + /// An offset into data that is owned by another buffer, but we want to be + /// able to retain a valid pointer to it even after other shared_ptr's to the + /// parent buffer have been destroyed + /// + /// This method makes no assertions about alignment or padding of the buffer but + /// in general we expected buffers to be aligned and padded to 64 bytes. In the future + /// we might add utility methods to help determine if a buffer satisfies this contract. Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); std::shared_ptr get_shared_ptr() { return shared_from_this(); } bool is_mutable() const { return is_mutable_; } - // Return true if both buffers are the same size and contain the same bytes - // up to the number of compared bytes + /// Return true if both buffers are the same size and contain the same bytes + /// up to the number of compared bytes bool Equals(const Buffer& other, int64_t nbytes) const { return this == &other || (size_ >= nbytes && other.size_ >= nbytes && @@ -74,11 +76,11 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { (data_ == other.data_ || !memcmp(data_, other.data_, size_))); } - // Copy section of buffer into a new Buffer + /// Copy a section of the buffer into a new Buffer. Status Copy(int64_t start, int64_t nbytes, MemoryPool* pool, std::shared_ptr* out) const; - // Default memory pool + /// Copy a section of the buffer using the default memory pool into a new Buffer. Status Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) const; int64_t capacity() const { return capacity_; } @@ -101,12 +103,12 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { DISALLOW_COPY_AND_ASSIGN(Buffer); }; -// Construct a view on passed buffer at the indicated offset and length. This -// function cannot fail and does not error checking (except in debug builds) +/// Construct a view on passed buffer at the indicated offset and length. This +/// function cannot fail and does not error checking (except in debug builds) ARROW_EXPORT std::shared_ptr SliceBuffer( const std::shared_ptr& buffer, int64_t offset, int64_t length); -// A Buffer whose contents can be mutated. May or may not own its data. +/// A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { public: MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { @@ -116,7 +118,7 @@ class ARROW_EXPORT MutableBuffer : public Buffer { uint8_t* mutable_data() { return mutable_data_; } - // Get a read-only view of this buffer + /// Get a read-only view of this buffer std::shared_ptr GetImmutableView(); protected: @@ -135,16 +137,16 @@ class ARROW_EXPORT ResizableBuffer : public MutableBuffer { /// decrease. virtual Status Resize(int64_t new_size, bool shrink_to_fit = true) = 0; - // Ensure that buffer has enough memory allocated to fit the indicated - // capacity (and meets the 64 byte padding requirement in Layout.md). - // It does not change buffer's reported size. + /// Ensure that buffer has enough memory allocated to fit the indicated + /// capacity (and meets the 64 byte padding requirement in Layout.md). + /// It does not change buffer's reported size. virtual Status Reserve(int64_t new_capacity) = 0; protected: ResizableBuffer(uint8_t* data, int64_t size) : MutableBuffer(data, size) {} }; -// A Buffer whose lifetime is tied to a particular MemoryPool +/// A Buffer whose lifetime is tied to a particular MemoryPool class ARROW_EXPORT PoolBuffer : public ResizableBuffer { public: explicit PoolBuffer(MemoryPool* pool = nullptr); @@ -162,7 +164,7 @@ class ARROW_EXPORT BufferBuilder { explicit BufferBuilder(MemoryPool* pool) : pool_(pool), data_(nullptr), capacity_(0), size_(0) {} - // Resizes the buffer to the nearest multiple of 64 bytes per Layout.md + /// Resizes the buffer to the nearest multiple of 64 bytes per Layout.md Status Resize(int32_t elements) { if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } RETURN_NOT_OK(buffer_->Resize(elements)); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 735bca1b1bcb3..747da7ca2d9dd 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -37,10 +37,11 @@ class Array; static constexpr int32_t kMinBuilderCapacity = 1 << 5; -// Base class for all data array builders. -// This class provides a facilities for incrementally building the null bitmap -// (see Append methods) and as a side effect the current number of slots and -// the null count. +/// Base class for all data array builders. +// +/// This class provides a facilities for incrementally building the null bitmap +/// (see Append methods) and as a side effect the current number of slots and +/// the null count. class ARROW_EXPORT ArrayBuilder { public: explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) @@ -54,8 +55,8 @@ class ARROW_EXPORT ArrayBuilder { virtual ~ArrayBuilder() = default; - // For nested types. Since the objects are owned by this class instance, we - // skip shared pointers and just return a raw pointer + /// For nested types. Since the objects are owned by this class instance, we + /// skip shared pointers and just return a raw pointer ArrayBuilder* child(int i) { return children_[i].get(); } int num_children() const { return children_.size(); } @@ -64,37 +65,37 @@ class ARROW_EXPORT ArrayBuilder { int32_t null_count() const { return null_count_; } int32_t capacity() const { return capacity_; } - // Append to null bitmap + /// Append to null bitmap Status AppendToBitmap(bool is_valid); - // Vector append. Treat each zero byte as a null. If valid_bytes is null - // assume all of length bits are valid. + /// Vector append. Treat each zero byte as a null. If valid_bytes is null + /// assume all of length bits are valid. Status AppendToBitmap(const uint8_t* valid_bytes, int32_t length); - // Set the next length bits to not null (i.e. valid). + /// Set the next length bits to not null (i.e. valid). Status SetNotNull(int32_t length); - // Allocates initial capacity requirements for the builder. In most - // cases subclasses should override and call there parent classes - // method as well. + /// Allocates initial capacity requirements for the builder. In most + /// cases subclasses should override and call there parent classes + /// method as well. virtual Status Init(int32_t capacity); - // Resizes the null_bitmap array. In most - // cases subclasses should override and call there parent classes - // method as well. + /// Resizes the null_bitmap array. In most + /// cases subclasses should override and call there parent classes + /// method as well. virtual Status Resize(int32_t new_bits); - // Ensures there is enough space for adding the number of elements by checking - // capacity and calling Resize if necessary. + /// Ensures there is enough space for adding the number of elements by checking + /// capacity and calling Resize if necessary. Status Reserve(int32_t elements); - // For cases where raw data was memcpy'd into the internal buffers, allows us - // to advance the length of the builder. It is your responsibility to use - // this function responsibly. + /// For cases where raw data was memcpy'd into the internal buffers, allows us + /// to advance the length of the builder. It is your responsibility to use + /// this function responsibly. Status Advance(int32_t elements); std::shared_ptr null_bitmap() const { return null_bitmap_; } - // Creates new array object to hold the contents of the builder and transfers - // ownership of the data. This resets all variables on the builder. + /// Creates new Array object to hold the contents of the builder and transfers + /// ownership of the data. This resets all variables on the builder. virtual Status Finish(std::shared_ptr* out) = 0; std::shared_ptr type() const { return type_; } @@ -144,7 +145,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { using ArrayBuilder::Advance; - // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + /// Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); @@ -159,18 +160,18 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { std::shared_ptr data() const { return data_; } - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot + /// Vector append + /// + /// If passed, valid_bytes is of equal length to values, and any zero byte + /// will be considered as a null for that slot Status Append( const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; Status Init(int32_t capacity) override; - // Increase the capacity of the builder to accommodate at least the indicated - // number of elements + /// Increase the capacity of the builder to accommodate at least the indicated + /// number of elements Status Resize(int32_t capacity) override; protected: @@ -178,6 +179,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { value_type* raw_data_; }; +/// Base class for all Builders that emit an Array of a scalar numerical type. template class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { public: @@ -189,14 +191,18 @@ class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { using PrimitiveBuilder::Resize; using PrimitiveBuilder::Reserve; - // Scalar append. + /// Append a single scalar and increase the size if necessary. Status Append(value_type val) { RETURN_NOT_OK(ArrayBuilder::Reserve(1)); UnsafeAppend(val); return Status::OK(); } - // Does not capacity-check; make sure to call Reserve beforehand + /// Append a single scalar under the assumption that the underlying Buffer is + /// large enough. + /// + /// This method does not capacity-check; make sure to call Reserve + /// beforehand. void UnsafeAppend(value_type val) { BitUtil::SetBit(null_bitmap_data_, length_); raw_data_[length_++] = val; @@ -235,7 +241,7 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { using ArrayBuilder::Advance; - // Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory + /// Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); @@ -250,7 +256,7 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { std::shared_ptr data() const { return data_; } - // Scalar append + /// Scalar append Status Append(bool val) { Reserve(1); BitUtil::SetBit(null_bitmap_data_, length_); @@ -263,18 +269,18 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { return Status::OK(); } - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot + /// Vector append + /// + /// If passed, valid_bytes is of equal length to values, and any zero byte + /// will be considered as a null for that slot Status Append( const uint8_t* values, int32_t length, const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; Status Init(int32_t capacity) override; - // Increase the capacity of the builder to accommodate at least the indicated - // number of elements + /// Increase the capacity of the builder to accommodate at least the indicated + /// number of elements Status Resize(int32_t capacity) override; protected: @@ -285,26 +291,26 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { // ---------------------------------------------------------------------- // List builder -// Builder class for variable-length list array value types -// -// To use this class, you must append values to the child array builder and use -// the Append function to delimit each distinct list value (once the values -// have been appended to the child array) or use the bulk API to append -// a sequence of offests and null values. -// -// A note on types. Per arrow/type.h all types in the c++ implementation are -// logical so even though this class always builds list array, this can -// represent multiple different logical types. If no logical type is provided -// at construction time, the class defaults to List where t is taken from the -// value_builder/values that the object is constructed with. +/// Builder class for variable-length list array value types +/// +/// To use this class, you must append values to the child array builder and use +/// the Append function to delimit each distinct list value (once the values +/// have been appended to the child array) or use the bulk API to append +/// a sequence of offests and null values. +/// +/// A note on types. Per arrow/type.h all types in the c++ implementation are +/// logical so even though this class always builds list array, this can +/// represent multiple different logical types. If no logical type is provided +/// at construction time, the class defaults to List where t is taken from the +/// value_builder/values that the object is constructed with. class ARROW_EXPORT ListBuilder : public ArrayBuilder { public: - // Use this constructor to incrementally build the value array along with offsets and - // null bitmap. + /// Use this constructor to incrementally build the value array along with offsets and + /// null bitmap. ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, const TypePtr& type = nullptr); - // Use this constructor to build the list with a pre-existing values array + /// Use this constructor to build the list with a pre-existing values array ListBuilder( MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr); @@ -314,10 +320,10 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { Status Resize(int32_t capacity) override; Status Finish(std::shared_ptr* out) override; - // Vector append - // - // If passed, valid_bytes is of equal length to values, and any zero byte - // will be considered as a null for that slot + /// Vector append + /// + /// If passed, valid_bytes is of equal length to values, and any zero byte + /// will be considered as a null for that slot Status Append( const int32_t* offsets, int32_t length, const uint8_t* valid_bytes = nullptr) { RETURN_NOT_OK(Reserve(length)); @@ -326,10 +332,10 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { return Status::OK(); } - // Start a new variable-length list slot - // - // This function should be called before beginning to append elements to the - // value builder + /// Start a new variable-length list slot + /// + /// This function should be called before beginning to append elements to the + /// value builder Status Append(bool is_valid = true) { RETURN_NOT_OK(Reserve(1)); UnsafeAppendToBitmap(is_valid); @@ -396,9 +402,9 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { // --------------------------------------------------------------------------------- // StructArray builder -// Append, Resize and Reserve methods are acting on StructBuilder. -// Please make sure all these methods of all child-builders' are consistently -// called to maintain data-structure consistency. +/// Append, Resize and Reserve methods are acting on StructBuilder. +/// Please make sure all these methods of all child-builders' are consistently +/// called to maintain data-structure consistency. class ARROW_EXPORT StructBuilder : public ArrayBuilder { public: StructBuilder(MemoryPool* pool, const std::shared_ptr& type, @@ -409,18 +415,18 @@ class ARROW_EXPORT StructBuilder : public ArrayBuilder { Status Finish(std::shared_ptr* out) override; - // Null bitmap is of equal length to every child field, and any zero byte - // will be considered as a null for that field, but users must using app- - // end methods or advance methods of the child builders' independently to - // insert data. + /// Null bitmap is of equal length to every child field, and any zero byte + /// will be considered as a null for that field, but users must using app- + /// end methods or advance methods of the child builders' independently to + /// insert data. Status Append(int32_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); } - // Append an element to the Struct. All child-builders' Append method must - // be called independently to maintain data-structure consistency. + /// Append an element to the Struct. All child-builders' Append method must + /// be called independently to maintain data-structure consistency. Status Append(bool is_valid = true) { RETURN_NOT_OK(Reserve(1)); UnsafeAppendToBitmap(is_valid); diff --git a/cpp/src/arrow/memory_pool.h b/cpp/src/arrow/memory_pool.h index 13a3f129c1a9e..89477b6ddeab0 100644 --- a/cpp/src/arrow/memory_pool.h +++ b/cpp/src/arrow/memory_pool.h @@ -26,14 +26,35 @@ namespace arrow { class Status; +/// Base class for memory allocation. +/// +/// Besides tracking the number of allocated bytes, the allocator also should +/// take care of the required 64-byte alignment. class ARROW_EXPORT MemoryPool { public: virtual ~MemoryPool(); + /// Allocate a new memory region of at least size bytes. + /// + /// The allocated region shall be 64-byte aligned. virtual Status Allocate(int64_t size, uint8_t** out) = 0; + + /// Resize an already allocated memory section. + /// + /// As by default most default allocators on a platform don't support aligned + /// reallocation, this function can involve a copy of the underlying data. virtual Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) = 0; + + /// Free an allocated region. + /// + /// @param buffer Pointer to the start of the allocated memory region + /// @param size Allocated size located at buffer. An allocator implementation + /// may use this for tracking the amount of allocated bytes as well as for + /// faster deallocation if supported by its backend. virtual void Free(uint8_t* buffer, int64_t size) = 0; + /// The number of bytes that were allocated and not yet free'd through + /// this allocator. virtual int64_t bytes_allocated() const = 0; }; diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index c2a762d279364..77a70d1d2ddd3 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -608,6 +608,43 @@ static inline bool is_floating(Type::type type_id) { return false; } +static inline bool is_primitive(Type::type type_id) { + switch (type_id) { + case Type::NA: + case Type::BOOL: + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::UINT64: + case Type::INT64: + case Type::HALF_FLOAT: + case Type::FLOAT: + case Type::DOUBLE: + case Type::DATE: + case Type::TIMESTAMP: + case Type::TIME: + case Type::INTERVAL: + return true; + default: + break; + } + return false; +} + +static inline bool is_binary_like(Type::type type_id) { + switch (type_id) { + case Type::BINARY: + case Type::STRING: + return true; + default: + break; + } + return false; +} + } // namespace arrow #endif // ARROW_TYPE_H From aac2e70c1639cb45c5300b18dd94b000ba4b79db Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 26 Jan 2017 18:38:19 -0500 Subject: [PATCH 0297/1644] ARROW-513: [C++] Fixing Appveyor / MSVC build Visual Studio 2015 (MSVC 19.0) seems to have a compiler bug with inheriting private ctors. It didn't like the private `using StreamWriter::StreamWriter` in the `FileWriter` implementation. This is not consistent with Microsoft's Modern C++ support matrix https://msdn.microsoft.com/en-us/library/hh567368.aspx, so perhaps they now support inheriting *public* constructors. Author: Wes McKinney Closes #305 from wesm/ARROW-513 and squashes the following commits: 9362674 [Wes McKinney] Visual Studio 2015 has limited support for inheriting constructors 93119e5 [Wes McKinney] Export some more classes to appease MSVC 9d4887c [Wes McKinney] Disable MSVC 4251 warning, add some ARROW_EXPORT visibility macros --- cpp/src/arrow/io/interfaces.h | 16 ++++++++-------- cpp/src/arrow/io/io-file-test.cc | 16 ++++++---------- cpp/src/arrow/ipc/file.cc | 3 +++ cpp/src/arrow/ipc/file.h | 2 +- cpp/src/arrow/ipc/metadata.h | 6 +++--- cpp/src/arrow/util/CMakeLists.txt | 28 ++++++++++++---------------- cpp/src/arrow/util/file-to-stream.cc | 4 ++-- cpp/src/arrow/util/io-util.h | 22 +++++++++------------- cpp/src/arrow/util/stream-to-file.cc | 6 +++--- cpp/src/arrow/util/visibility.h | 5 +++++ 10 files changed, 52 insertions(+), 56 deletions(-) diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 78680903d230a..e9f07f03a1419 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -42,7 +42,7 @@ struct ObjectType { enum type { FILE, DIRECTORY }; }; -class FileSystemClient { +class ARROW_EXPORT FileSystemClient { public: virtual ~FileSystemClient() {} }; @@ -64,7 +64,7 @@ class ARROW_EXPORT FileInterface { DISALLOW_COPY_AND_ASSIGN(FileInterface); }; -class Seekable { +class ARROW_EXPORT Seekable { public: virtual Status Seek(int64_t position) = 0; }; @@ -76,7 +76,7 @@ class ARROW_EXPORT Writeable { Status Write(const std::string& data); }; -class Readable { +class ARROW_EXPORT Readable { public: virtual Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) = 0; @@ -84,12 +84,12 @@ class Readable { virtual Status Read(int64_t nbytes, std::shared_ptr* out) = 0; }; -class OutputStream : virtual public FileInterface, public Writeable { +class ARROW_EXPORT OutputStream : virtual public FileInterface, public Writeable { protected: OutputStream() {} }; -class InputStream : virtual public FileInterface, public Readable { +class ARROW_EXPORT InputStream : virtual public FileInterface, public Readable { protected: InputStream() {} }; @@ -118,7 +118,7 @@ class ARROW_EXPORT ReadableFileInterface : public InputStream, public Seekable { ReadableFileInterface(); }; -class WriteableFileInterface : public OutputStream, public Seekable { +class ARROW_EXPORT WriteableFileInterface : public OutputStream, public Seekable { public: virtual Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) = 0; @@ -126,8 +126,8 @@ class WriteableFileInterface : public OutputStream, public Seekable { WriteableFileInterface() { set_mode(FileMode::READ); } }; -class ReadWriteFileInterface : public ReadableFileInterface, - public WriteableFileInterface { +class ARROW_EXPORT ReadWriteFileInterface : public ReadableFileInterface, + public WriteableFileInterface { protected: ReadWriteFileInterface() { ReadableFileInterface::set_mode(FileMode::READWRITE); } }; diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 86a3287b84fbc..5810c820f6dd7 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -341,14 +341,12 @@ TEST_F(TestReadableFile, ThreadSafety) { std::atomic correct_count(0); const int niter = 10000; - auto ReadData = [&correct_count, &data, niter, this] () { + auto ReadData = [&correct_count, &data, niter, this]() { std::shared_ptr buffer; for (int i = 0; i < niter; ++i) { ASSERT_OK(file_->ReadAt(0, 3, &buffer)); - if (0 == memcmp(data.c_str(), buffer->data(), 3)) { - correct_count += 1; - } + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { correct_count += 1; } } }; @@ -498,20 +496,18 @@ TEST_F(TestMemoryMappedFile, ThreadSafety) { std::shared_ptr file; ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &file)); - ASSERT_OK(file->Write(reinterpret_cast(data.c_str()), - static_cast(data.size()))); + ASSERT_OK(file->Write( + reinterpret_cast(data.c_str()), static_cast(data.size()))); std::atomic correct_count(0); const int niter = 10000; - auto ReadData = [&correct_count, &data, niter, &file] () { + auto ReadData = [&correct_count, &data, niter, &file]() { std::shared_ptr buffer; for (int i = 0; i < niter; ++i) { ASSERT_OK(file->ReadAt(0, 3, &buffer)); - if (0 == memcmp(data.c_str(), buffer->data(), 3)) { - correct_count += 1; - } + if (0 == memcmp(data.c_str(), buffer->data(), 3)) { correct_count += 1; } } }; diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index bc086e31519a5..3b1832611024f 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -158,6 +158,9 @@ Status FileFooter::GetSchema(std::shared_ptr* out) const { // ---------------------------------------------------------------------- // File writer implementation +FileWriter::FileWriter(io::OutputStream* sink, const std::shared_ptr& schema) + : StreamWriter(sink, schema) {} + Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out) { *out = std::shared_ptr(new FileWriter(sink, schema)); // ctor is private diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/file.h index 7696954c188e3..cf0baab820eef 100644 --- a/cpp/src/arrow/ipc/file.h +++ b/cpp/src/arrow/ipc/file.h @@ -78,7 +78,7 @@ class ARROW_EXPORT FileWriter : public StreamWriter { Status Close() override; private: - using StreamWriter::StreamWriter; + FileWriter(io::OutputStream* sink, const std::shared_ptr& schema); Status Start() override; diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 6e15ef353d853..81e3dbdf6c4c0 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -85,12 +85,12 @@ class ARROW_EXPORT SchemaMetadata { }; // Field metadata -struct FieldMetadata { +struct ARROW_EXPORT FieldMetadata { int32_t length; int32_t null_count; }; -struct BufferMetadata { +struct ARROW_EXPORT BufferMetadata { int32_t page; int64_t offset; int64_t length; @@ -149,7 +149,7 @@ class ARROW_EXPORT Message { std::unique_ptr impl_; }; -struct FileBlock { +struct ARROW_EXPORT FileBlock { FileBlock() {} FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) : offset(offset), metadata_length(metadata_length), body_length(body_length) {} diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 0830ee2ed2928..19b1e193d4228 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -68,24 +68,20 @@ if (ARROW_BUILD_BENCHMARKS) endif() endif() -if (ARROW_BUILD_UTILITIES) - if (APPLE) - set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static - arrow_static - boost_filesystem_static - boost_system_static - dl) - else() +if (ARROW_IPC AND ARROW_BUILD_UTILITIES) + set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + boost_filesystem_static + boost_system_static + dl) + + if (NOT APPLE) set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static - arrow_static - pthread + ${UTIL_LINK_LIBS} boost_filesystem_static - boost_system_static - dl) + boost_system_static) endif() add_executable(file-to-stream file-to-stream.cc) diff --git a/cpp/src/arrow/util/file-to-stream.cc b/cpp/src/arrow/util/file-to-stream.cc index 42c1d55afd322..7daf26366721d 100644 --- a/cpp/src/arrow/util/file-to-stream.cc +++ b/cpp/src/arrow/util/file-to-stream.cc @@ -15,11 +15,11 @@ // specific language governing permissions and limitations // under the License. -#include #include "arrow/io/file.h" #include "arrow/ipc/file.h" #include "arrow/ipc/stream.h" #include "arrow/status.h" +#include #include "arrow/util/io-util.h" @@ -44,7 +44,7 @@ Status ConvertToStream(const char* path) { return writer->Close(); } -} // namespace arrow +} // namespace arrow int main(int argc, char** argv) { if (argc != 2) { diff --git a/cpp/src/arrow/util/io-util.h b/cpp/src/arrow/util/io-util.h index 3e5054d8fa83d..9f2645699004c 100644 --- a/cpp/src/arrow/util/io-util.h +++ b/cpp/src/arrow/util/io-util.h @@ -18,8 +18,8 @@ #ifndef ARROW_UTIL_IO_UTIL_H #define ARROW_UTIL_IO_UTIL_H -#include #include "arrow/buffer.h" +#include namespace arrow { namespace io { @@ -27,13 +27,11 @@ namespace io { // Output stream that just writes to stdout. class StdoutStream : public OutputStream { public: - StdoutStream() : pos_(0) { - set_mode(FileMode::WRITE); - } + StdoutStream() : pos_(0) { set_mode(FileMode::WRITE); } virtual ~StdoutStream() {} Status Close() { return Status::OK(); } - Status Tell(int64_t* position) { + Status Tell(int64_t* position) { *position = pos_; return Status::OK(); } @@ -43,6 +41,7 @@ class StdoutStream : public OutputStream { std::cout.write(reinterpret_cast(data), nbytes); return Status::OK(); } + private: int64_t pos_; }; @@ -50,13 +49,11 @@ class StdoutStream : public OutputStream { // Input stream that just reads from stdin. class StdinStream : public InputStream { public: - StdinStream() : pos_(0) { - set_mode(FileMode::READ); - } + StdinStream() : pos_(0) { set_mode(FileMode::READ); } virtual ~StdinStream() {} Status Close() { return Status::OK(); } - Status Tell(int64_t* position) { + Status Tell(int64_t* position) { *position = pos_; return Status::OK(); } @@ -86,8 +83,7 @@ class StdinStream : public InputStream { int64_t pos_; }; -} // namespace io -} // namespace arrow - -#endif // ARROW_UTIL_IO_UTIL_H +} // namespace io +} // namespace arrow +#endif // ARROW_UTIL_IO_UTIL_H diff --git a/cpp/src/arrow/util/stream-to-file.cc b/cpp/src/arrow/util/stream-to-file.cc index 7a8ec0bfd952b..393b07d8d355f 100644 --- a/cpp/src/arrow/util/stream-to-file.cc +++ b/cpp/src/arrow/util/stream-to-file.cc @@ -15,11 +15,11 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/ipc/stream.h" #include "arrow/io/file.h" #include "arrow/ipc/file.h" -#include "arrow/ipc/stream.h" #include "arrow/status.h" +#include #include "arrow/util/io-util.h" @@ -46,7 +46,7 @@ Status ConvertToFile() { return writer->Close(); } -} // namespace arrow +} // namespace arrow int main(int argc, char** argv) { arrow::Status status = arrow::ConvertToFile(); diff --git a/cpp/src/arrow/util/visibility.h b/cpp/src/arrow/util/visibility.h index 9321cc550ec1f..4819a0061e75f 100644 --- a/cpp/src/arrow/util/visibility.h +++ b/cpp/src/arrow/util/visibility.h @@ -19,6 +19,11 @@ #define ARROW_UTIL_VISIBILITY_H #if defined(_WIN32) || defined(__CYGWIN__) +#if defined(_MSC_VER) +#pragma warning(disable : 4251) +#else +#pragma GCC diagnostic ignored "-Wattributes" +#endif #define ARROW_EXPORT __declspec(dllexport) #define ARROW_NO_EXPORT #else // Not Windows From 30bb0d97d584b65ad6ed8ab225c5c4008eafb88c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 26 Jan 2017 22:12:19 -0500 Subject: [PATCH 0298/1644] ARROW-514: [Python] Automatically wrap pyarrow.io.Buffer in BufferReader Author: Wes McKinney Closes #306 from wesm/ARROW-514 and squashes the following commits: d5e3235 [Wes McKinney] Automatically wrap pyarrow.io.Buffer when passing in to FileReader or StreamReader --- python/pyarrow/io.pyx | 2 ++ python/pyarrow/tests/test_ipc.py | 6 +++--- 2 files changed, 5 insertions(+), 3 deletions(-) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 0755ed8bb4d4f..e5f8b7a0abff4 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -526,6 +526,8 @@ cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader): if isinstance(source, six.string_types): source = MemoryMappedFile(source, mode='r') + elif isinstance(source, Buffer): + source = BufferReader(source) elif not isinstance(source, NativeFile) and hasattr(source, 'read'): # Optimistically hope this is file-like source = PythonFileInterface(source, mode='r') diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 819d1b71b8546..8ca464f034d14 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -36,7 +36,7 @@ def _get_sink(self): return io.BytesIO() def _get_source(self): - return self.sink.getvalue() + return pa.BufferReader(self.sink.getvalue()) def write_batches(self): nrows = 5 @@ -74,7 +74,7 @@ def test_simple_roundtrip(self): batches = self.write_batches() file_contents = self._get_source() - reader = pa.FileReader(pa.BufferReader(file_contents)) + reader = pa.FileReader(file_contents) assert reader.num_record_batches == len(batches) @@ -92,7 +92,7 @@ def _get_writer(self, sink, schema): def test_simple_roundtrip(self): batches = self.write_batches() file_contents = self._get_source() - reader = pa.StreamReader(pa.BufferReader(file_contents)) + reader = pa.StreamReader(file_contents) total = 0 for i, next_batch in enumerate(reader): From 4226adfbc6b3dff10b3fe7a6691b30bcc94140bd Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 27 Jan 2017 10:46:34 +0100 Subject: [PATCH 0299/1644] ARROW-515: [Python] Add read_all methods to FileReader, StreamReader Stacked on top of ARROW-514 Author: Wes McKinney Closes #307 from wesm/ARROW-515 and squashes the following commits: 6f2185c [Wes McKinney] Add read_all method to StreamReader, FileReader --- python/pyarrow/io.pyx | 44 +++++++++++++++++++++++++++++++- python/pyarrow/table.pyx | 4 +-- python/pyarrow/tests/test_ipc.py | 19 ++++++++++++++ 3 files changed, 63 insertions(+), 4 deletions(-) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index e5f8b7a0abff4..8b5650879f8f1 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -34,7 +34,8 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import frombytes, tobytes, encode_file_path from pyarrow.error cimport check_status from pyarrow.schema cimport Schema -from pyarrow.table cimport RecordBatch, batch_from_cbatch +from pyarrow.table cimport (RecordBatch, batch_from_cbatch, + table_from_ctable) cimport cpython as cp @@ -936,6 +937,27 @@ cdef class _StreamReader: return batch_from_cbatch(batch) + def read_all(self): + """ + Read all record batches as a pyarrow.Table + """ + cdef: + vector[shared_ptr[CRecordBatch]] batches + shared_ptr[CRecordBatch] batch + shared_ptr[CTable] table + c_string name = b'' + + with nogil: + while True: + check_status(self.reader.get().GetNextRecordBatch(&batch)) + if batch.get() == NULL: + break + batches.push_back(batch) + + check_status(CTable.FromRecordBatches(name, batches, &table)) + + return table_from_ctable(table) + cdef class _FileWriter(_StreamWriter): @@ -997,3 +1019,23 @@ cdef class _FileReader: # TODO(wesm): ARROW-503: Function was renamed. Remove after a period of # time has passed get_record_batch = get_batch + + def read_all(self): + """ + Read all record batches as a pyarrow.Table + """ + cdef: + vector[shared_ptr[CRecordBatch]] batches + shared_ptr[CTable] table + c_string name = b'' + int i, nbatches + + nbatches = self.num_record_batches + + batches.resize(nbatches) + with nogil: + for i in range(nbatches): + check_status(self.reader.get().GetRecordBatch(i, &batches[i])) + check_status(CTable.FromRecordBatches(name, batches, &table)) + + return table_from_ctable(table) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 924233066055e..17072108f301f 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -690,9 +690,7 @@ cdef class Table: with nogil: check_status(CTable.FromRecordBatches(c_name, c_batches, &c_table)) - table = Table() - table.init(c_table) - return table + return table_from_ctable(c_table) def to_pandas(self, nthreads=None): """ diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 8ca464f034d14..665a63b6d5a38 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -83,6 +83,16 @@ def test_simple_roundtrip(self): batch = reader.get_batch(i) assert batches[i].equals(batch) + def test_read_all(self): + batches = self.write_batches() + file_contents = self._get_source() + + reader = pa.FileReader(file_contents) + + result = reader.read_all() + expected = pa.Table.from_batches(batches) + assert result.equals(expected) + class TestStream(MessagingTest, unittest.TestCase): @@ -104,6 +114,15 @@ def test_simple_roundtrip(self): with pytest.raises(StopIteration): reader.get_next_batch() + def test_read_all(self): + batches = self.write_batches() + file_contents = self._get_source() + reader = pa.StreamReader(file_contents) + + result = reader.read_all() + expected = pa.Table.from_batches(batches) + assert result.equals(expected) + class TestInMemoryFile(TestFile): From 7ac320bde52ae47007dadac7398e22a203c6a48d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 29 Jan 2017 21:27:17 -0500 Subject: [PATCH 0300/1644] ARROW-519: [C++] Refactor array comparison code into a compare.h / compare.cc in part to resolve Xcode 6.1 linker issue This should also pave the way for more user-friendly reporting of "why are the arrays not equal" per ARROW-517 Author: Wes McKinney Closes #308 from wesm/ARROW-519 and squashes the following commits: 85b0bf8 [Wes McKinney] Fix invalid memory access when doing RangeEquals on BinaryArray with all empty strings f5f4593 [Wes McKinney] Remove unused function in pandas.cc. Fix Binary RangeEquals for arrays of length-0 strings 2118ef4 [Wes McKinney] cpplint, compiler warnings ad54cc6 [Wes McKinney] Remove unneeded ARROW_EXPORT 342a8e6 [Wes McKinney] Refactor array comparison code into a compare.h header and compilation unit. Use visitor pattern. Also may resolve Xcode bug reported in ARROW-519 --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/array-primitive-test.cc | 4 +- cpp/src/arrow/array-string-test.cc | 48 ++- cpp/src/arrow/array.cc | 334 ++--------------- cpp/src/arrow/array.h | 145 +------- cpp/src/arrow/compare.cc | 516 ++++++++++++++++++++++++++ cpp/src/arrow/compare.h | 46 +++ cpp/src/arrow/util/macros.h | 2 + python/CMakeLists.txt | 3 + python/src/pyarrow/adapters/pandas.cc | 8 - 11 files changed, 641 insertions(+), 467 deletions(-) create mode 100644 cpp/src/arrow/compare.cc create mode 100644 cpp/src/arrow/compare.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index a0f89f314f683..ff2c1a61b95a6 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -771,6 +771,7 @@ set(ARROW_SRCS src/arrow/buffer.cc src/arrow/builder.cc src/arrow/column.cc + src/arrow/compare.cc src/arrow/memory_pool.cc src/arrow/pretty_print.cc src/arrow/schema.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index e5e36ed253cfa..b002bb75ca934 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -20,6 +20,7 @@ install(FILES api.h array.h column.h + compare.h buffer.h builder.h memory_pool.h diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index 443abac459dbf..c839fb9b19234 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -135,7 +135,7 @@ class TestPrimitiveBuilder : public TestBuilder { ASSERT_EQ(nullptr, builder->data()); ASSERT_EQ(ex_null_count, result->null_count()); - ASSERT_TRUE(result->EqualsExact(*expected.get())); + ASSERT_TRUE(result->Equals(*expected)); } protected: @@ -238,7 +238,7 @@ void TestPrimitiveBuilder::Check( bool actual = BitUtil::GetBit(result->raw_data(), i); ASSERT_EQ(static_cast(draws_[i]), actual) << i; } - ASSERT_TRUE(result->EqualsExact(*expected.get())); + ASSERT_TRUE(result->Equals(*expected)); } typedef ::testing::Types strings_; }; -TEST_F(TestStringContainer, TestArrayBasics) { +TEST_F(TestStringArray, TestArrayBasics) { ASSERT_EQ(length_, strings_->length()); ASSERT_EQ(1, strings_->null_count()); ASSERT_OK(strings_->Validate()); } -TEST_F(TestStringContainer, TestType) { +TEST_F(TestStringArray, TestType) { TypePtr type = strings_->type(); ASSERT_EQ(Type::STRING, type->type); ASSERT_EQ(Type::STRING, strings_->type_enum()); } -TEST_F(TestStringContainer, TestListFunctions) { +TEST_F(TestStringArray, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); @@ -112,12 +112,12 @@ TEST_F(TestStringContainer, TestListFunctions) { } } -TEST_F(TestStringContainer, TestDestructor) { +TEST_F(TestStringArray, TestDestructor) { auto arr = std::make_shared( length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } -TEST_F(TestStringContainer, TestGetString) { +TEST_F(TestStringArray, TestGetString) { for (size_t i = 0; i < expected_.size(); ++i) { if (valid_bytes_[i] == 0) { ASSERT_TRUE(strings_->IsNull(i)); @@ -127,7 +127,7 @@ TEST_F(TestStringContainer, TestGetString) { } } -TEST_F(TestStringContainer, TestEmptyStringComparison) { +TEST_F(TestStringArray, TestEmptyStringComparison) { offsets_ = {0, 0, 0, 0, 0, 0}; offsets_buf_ = test::GetBufferFromVector(offsets_); length_ = offsets_.size() - 1; @@ -212,7 +212,7 @@ TEST_F(TestStringBuilder, TestZeroLength) { // Binary container type // TODO(emkornfield) there should be some way to refactor these to avoid code duplicating // with String -class TestBinaryContainer : public ::testing::Test { +class TestBinaryArray : public ::testing::Test { public: void SetUp() { chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; @@ -252,20 +252,20 @@ class TestBinaryContainer : public ::testing::Test { std::shared_ptr strings_; }; -TEST_F(TestBinaryContainer, TestArrayBasics) { +TEST_F(TestBinaryArray, TestArrayBasics) { ASSERT_EQ(length_, strings_->length()); ASSERT_EQ(1, strings_->null_count()); ASSERT_OK(strings_->Validate()); } -TEST_F(TestBinaryContainer, TestType) { +TEST_F(TestBinaryArray, TestType) { TypePtr type = strings_->type(); ASSERT_EQ(Type::BINARY, type->type); ASSERT_EQ(Type::BINARY, strings_->type_enum()); } -TEST_F(TestBinaryContainer, TestListFunctions) { +TEST_F(TestBinaryArray, TestListFunctions) { int pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); @@ -274,12 +274,12 @@ TEST_F(TestBinaryContainer, TestListFunctions) { } } -TEST_F(TestBinaryContainer, TestDestructor) { +TEST_F(TestBinaryArray, TestDestructor) { auto arr = std::make_shared( length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); } -TEST_F(TestBinaryContainer, TestGetValue) { +TEST_F(TestBinaryArray, TestGetValue) { for (size_t i = 0; i < expected_.size(); ++i) { if (valid_bytes_[i] == 0) { ASSERT_TRUE(strings_->IsNull(i)); @@ -291,6 +291,28 @@ TEST_F(TestBinaryContainer, TestGetValue) { } } +TEST_F(TestBinaryArray, TestEqualsEmptyStrings) { + BinaryBuilder builder(default_memory_pool(), arrow::binary()); + + std::string empty_string(""); + + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + + std::shared_ptr left_arr; + ASSERT_OK(builder.Finish(&left_arr)); + + const BinaryArray& left = static_cast(*left_arr); + std::shared_ptr right = std::make_shared( + left.length(), left.offsets(), nullptr, left.null_count(), left.null_bitmap()); + + ASSERT_TRUE(left.Equals(right)); + ASSERT_TRUE(left.RangeEquals(0, left.length(), 0, right)); +} + class TestBinaryBuilder : public TestBuilder { public: void SetUp() { diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index aa4a692e85cb9..6fc7fb60bf364 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -22,6 +22,7 @@ #include #include "arrow/buffer.h" +#include "arrow/compare.h" #include "arrow/status.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" @@ -51,43 +52,42 @@ Array::Array(const std::shared_ptr& type, int32_t length, int32_t null if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } -bool Array::BaseEquals(const std::shared_ptr& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } - return EqualsExact(*other.get()); +bool Array::Equals(const Array& arr) const { + bool are_equal = false; + Status error = ArrayEquals(*this, arr, &are_equal); + if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + return are_equal; } -bool Array::EqualsExact(const Array& other) const { - if (this == &other) { return true; } - if (length_ != other.length_ || null_count_ != other.null_count_ || - type_enum() != other.type_enum()) { - return false; - } - if (null_count_ > 0) { - return null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); - } - return true; +bool Array::Equals(const std::shared_ptr& arr) const { + if (!arr) { return false; } + return Equals(*arr); } -bool Array::ApproxEquals(const std::shared_ptr& arr) const { - return Equals(arr); +bool Array::ApproxEquals(const Array& arr) const { + bool are_equal = false; + Status error = ArrayApproxEquals(*this, arr, &are_equal); + if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + return are_equal; } -Status Array::Validate() const { - return Status::OK(); -} - -bool NullArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (Type::NA != arr->type_enum()) { return false; } - return arr->length() == length_; +bool Array::ApproxEquals(const std::shared_ptr& arr) const { + if (!arr) { return false; } + return ApproxEquals(*arr); } -bool NullArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_index, +bool Array::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, const std::shared_ptr& arr) const { if (!arr) { return false; } - if (Type::NA != arr->type_enum()) { return false; } - return true; + bool are_equal = false; + Status error = + ArrayRangeEquals(*this, *arr, start_idx, end_idx, other_start_idx, &are_equal); + if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } + return are_equal; +} + +Status Array::Validate() const { + return Status::OK(); } Status NullArray::Accept(ArrayVisitor* visitor) const { @@ -105,36 +105,6 @@ PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int32_t le raw_data_ = data == nullptr ? nullptr : data_->data(); } -bool PrimitiveArray::EqualsExact(const PrimitiveArray& other) const { - if (!Array::EqualsExact(other)) { return false; } - - if (null_count_ > 0) { - const uint8_t* this_data = raw_data_; - const uint8_t* other_data = other.raw_data_; - - auto size_meta = dynamic_cast(type_.get()); - int value_byte_size = size_meta->bit_width() / 8; - DCHECK_GT(value_byte_size, 0); - - for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && memcmp(this_data, other_data, value_byte_size)) { return false; } - this_data += value_byte_size; - other_data += value_byte_size; - } - return true; - } else { - if (length_ == 0 && other.length_ == 0) { return true; } - return data_->Equals(*other.data_, length_); - } -} - -bool PrimitiveArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(static_cast(*arr.get())); -} - template Status NumericArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); @@ -150,6 +120,7 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; @@ -167,50 +138,6 @@ BooleanArray::BooleanArray(const std::shared_ptr& type, int32_t length const std::shared_ptr& null_bitmap) : PrimitiveArray(type, length, data, null_count, null_bitmap) {} -bool BooleanArray::EqualsExact(const BooleanArray& other) const { - if (this == &other) return true; - if (null_count_ != other.null_count_) { return false; } - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); - if (!equal_bitmap) { return false; } - - const uint8_t* this_data = raw_data_; - const uint8_t* other_data = other.raw_data_; - - for (int i = 0; i < length_; ++i) { - if (!IsNull(i) && BitUtil::GetBit(this_data, i) != BitUtil::GetBit(other_data, i)) { - return false; - } - } - return true; - } else { - return data_->Equals(*other.data_, BitUtil::BytesForBits(length_)); - } -} - -bool BooleanArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) return true; - if (Type::BOOL != arr->type_enum()) { return false; } - return EqualsExact(static_cast(*arr.get())); -} - -bool BooleanArray::RangeEquals(int32_t start_idx, int32_t end_idx, - int32_t other_start_idx, const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { - return false; - } - } - return true; -} - Status BooleanArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } @@ -218,56 +145,6 @@ Status BooleanArray::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // ListArray -bool ListArray::EqualsExact(const ListArray& other) const { - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - bool equal_offsets = - offsets_buffer_->Equals(*other.offsets_buffer_, (length_ + 1) * sizeof(int32_t)); - if (!equal_offsets) { return false; } - bool equal_null_bitmap = true; - if (null_count_ > 0) { - equal_null_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::BytesForBits(length_)); - } - - if (!equal_null_bitmap) { return false; } - - return values()->Equals(other.values()); -} - -bool ListArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(static_cast(*arr.get())); -} - -bool ListArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i)) { return false; } - if (is_null) continue; - const int32_t begin_offset = offset(i); - const int32_t end_offset = offset(i + 1); - const int32_t other_begin_offset = other->offset(o_i); - const int32_t other_end_offset = other->offset(o_i + 1); - // Underlying can't be equal if the size isn't equal - if (end_offset - begin_offset != other_end_offset - other_begin_offset) { - return false; - } - if (!values_->RangeEquals( - begin_offset, end_offset, other_begin_offset, other->values())) { - return false; - } - } - return true; -} - Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } if (!offsets_buffer_) { return Status::Invalid("offsets_buffer_ was null"); } @@ -350,51 +227,6 @@ Status BinaryArray::Validate() const { return Status::OK(); } -bool BinaryArray::EqualsExact(const BinaryArray& other) const { - if (!Array::EqualsExact(other)) { return false; } - - bool equal_offsets = - offsets_buffer_->Equals(*other.offsets_buffer_, (length_ + 1) * sizeof(int32_t)); - if (!equal_offsets) { return false; } - - if (!data_buffer_ && !(other.data_buffer_)) { return true; } - - return data_buffer_->Equals(*other.data_buffer_, raw_offsets()[length_]); -} - -bool BinaryArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (this->type_enum() != arr->type_enum()) { return false; } - return EqualsExact(static_cast(*arr.get())); -} - -bool BinaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i)) { return false; } - if (is_null) continue; - const int32_t begin_offset = offset(i); - const int32_t end_offset = offset(i + 1); - const int32_t other_begin_offset = other->offset(o_i); - const int32_t other_end_offset = other->offset(o_i + 1); - // Underlying can't be equal if the size isn't equal - if (end_offset - begin_offset != other_end_offset - other_begin_offset) { - return false; - } - - if (std::memcmp(data_ + begin_offset, other->data_ + other_begin_offset, - end_offset - begin_offset)) { - return false; - } - } - return true; -} - Status BinaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } @@ -421,36 +253,6 @@ std::shared_ptr StructArray::field(int32_t pos) const { return field_arrays_[pos]; } -bool StructArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - if (null_count_ != arr->null_count()) { return false; } - return RangeEquals(0, length_, 0, arr); -} - -bool StructArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (Type::STRUCT != arr->type_enum()) { return false; } - const auto& other = static_cast(*arr.get()); - - bool equal_fields = true; - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - if (IsNull(i) != arr->IsNull(o_i)) { return false; } - if (IsNull(i)) continue; - for (size_t j = 0; j < field_arrays_.size(); ++j) { - // TODO: really we should be comparing stretches of non-null data rather - // than looking at one value at a time. - equal_fields = field(j)->RangeEquals(i, i + 1, o_i, other.field(j)); - if (!equal_fields) { return false; } - } - } - - return true; -} - Status StructArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } @@ -511,67 +313,6 @@ std::shared_ptr UnionArray::child(int32_t pos) const { return children_[pos]; } -bool UnionArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (!this->type_->Equals(arr->type())) { return false; } - if (null_count_ != arr->null_count()) { return false; } - return RangeEquals(0, length_, 0, arr); -} - -bool UnionArray::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (Type::UNION != arr->type_enum()) { return false; } - const auto& other = static_cast(*arr.get()); - - const UnionMode union_mode = mode(); - if (union_mode != other.mode()) { return false; } - - // Define a mapping from the type id to child number - const auto& type_codes = static_cast(*arr->type().get()).type_ids; - uint8_t max_code = 0; - for (uint8_t code : type_codes) { - if (code > max_code) { max_code = code; } - } - - // Store mapping in a vector for constant time lookups - std::vector type_id_to_child_num(max_code + 1); - for (uint8_t i = 0; i < static_cast(type_codes.size()); ++i) { - type_id_to_child_num[type_codes[i]] = i; - } - - const uint8_t* this_ids = raw_type_ids(); - const uint8_t* other_ids = other.raw_type_ids(); - - uint8_t id, child_num; - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - if (IsNull(i) != other.IsNull(o_i)) { return false; } - if (IsNull(i)) continue; - if (this_ids[i] != other_ids[o_i]) { return false; } - - id = this_ids[i]; - child_num = type_id_to_child_num[id]; - - // TODO(wesm): really we should be comparing stretches of non-null data - // rather than looking at one value at a time. - if (union_mode == UnionMode::SPARSE) { - if (!child(child_num)->RangeEquals(i, i + 1, o_i, other.child(child_num))) { - return false; - } - } else { - const int32_t offset = offsets_[i]; - const int32_t o_offset = other.offsets_[i]; - if (!child(child_num)->RangeEquals( - offset, offset + 1, o_offset, other.child(child_num))) { - return false; - } - } - } - return true; -} - Status UnionArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } @@ -624,25 +365,6 @@ std::shared_ptr DictionaryArray::dictionary() const { return dict_type_->dictionary(); } -bool DictionaryArray::EqualsExact(const DictionaryArray& other) const { - if (!dictionary()->Equals(other.dictionary())) { return false; } - return indices_->Equals(other.indices()); -} - -bool DictionaryArray::Equals(const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (Type::DICTIONARY != arr->type_enum()) { return false; } - return EqualsExact(static_cast(*arr.get())); -} - -bool DictionaryArray::RangeEquals(int32_t start_idx, int32_t end_idx, - int32_t other_start_idx, const std::shared_ptr& arr) const { - if (Type::DICTIONARY != arr->type_enum()) { return false; } - const auto& dict_other = static_cast(*arr.get()); - if (!dictionary()->Equals(dict_other.dictionary())) { return false; } - return indices_->RangeEquals(start_idx, end_idx, other_start_idx, dict_other.indices()); -} - Status DictionaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 4f4b727f39ae5..3b6e93f9cb34d 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -102,15 +102,16 @@ class ARROW_EXPORT Array { /// Note that for `null_count == 0`, this can be a `nullptr`. const uint8_t* null_bitmap_data() const { return null_bitmap_data_; } - bool BaseEquals(const std::shared_ptr& arr) const; - bool EqualsExact(const Array& arr) const; - virtual bool Equals(const std::shared_ptr& arr) const = 0; - virtual bool ApproxEquals(const std::shared_ptr& arr) const; + bool Equals(const Array& arr) const; + bool Equals(const std::shared_ptr& arr) const; + + bool ApproxEquals(const std::shared_ptr& arr) const; + bool ApproxEquals(const Array& arr) const; /// Compare if the range of slots specified are equal for the given array and /// this array. end_idx exclusive. This methods does not bounds check. - virtual bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const = 0; + bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + const std::shared_ptr& arr) const; /// Determines if the array is internally consistent. /// @@ -142,10 +143,6 @@ class ARROW_EXPORT NullArray : public Array { explicit NullArray(int32_t length) : NullArray(std::make_shared(), length) {} - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_index, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; }; @@ -159,9 +156,6 @@ class ARROW_EXPORT PrimitiveArray : public Array { std::shared_ptr data() const { return data_; } - bool EqualsExact(const PrimitiveArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - protected: PrimitiveArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& data, int32_t null_count = 0, @@ -184,28 +178,6 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { const std::shared_ptr& null_bitmap = nullptr) : PrimitiveArray(type, length, data, null_count, null_bitmap) {} - bool EqualsExact(const NumericArray& other) const { - return PrimitiveArray::EqualsExact(static_cast(other)); - } - - bool ApproxEquals(const std::shared_ptr& arr) const override { - return PrimitiveArray::Equals(arr); - } - - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - const auto other = static_cast*>(arr.get()); - for (int32_t i = start_idx, o_i = other_start_idx; i < end_idx; ++i, ++o_i) { - const bool is_null = IsNull(i); - if (is_null != arr->IsNull(o_i) || (!is_null && Value(i) != other->Value(o_i))) { - return false; - } - } - return true; - } const value_type* raw_data() const { return reinterpret_cast(raw_data_); } @@ -215,78 +187,6 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { value_type Value(int i) const { return raw_data()[i]; } }; -template <> -inline bool NumericArray::ApproxEquals( - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - - const auto& other = *static_cast*>(arr.get()); - - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - auto this_data = reinterpret_cast(raw_data_); - auto other_data = reinterpret_cast(other.raw_data_); - - static constexpr float EPSILON = 1E-5; - - if (length_ == 0 && other.length_ == 0) { return true; } - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - - for (int i = 0; i < length_; ++i) { - if (IsNull(i)) continue; - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } else { - for (int i = 0; i < length_; ++i) { - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } - return true; -} - -template <> -inline bool NumericArray::ApproxEquals( - const std::shared_ptr& arr) const { - if (this == arr.get()) { return true; } - if (!arr) { return false; } - if (this->type_enum() != arr->type_enum()) { return false; } - - const auto& other = *static_cast*>(arr.get()); - - if (this == &other) { return true; } - if (null_count_ != other.null_count_) { return false; } - - auto this_data = reinterpret_cast(raw_data_); - auto other_data = reinterpret_cast(other.raw_data_); - - if (length_ == 0 && other.length_ == 0) { return true; } - - static constexpr double EPSILON = 1E-5; - - if (null_count_ > 0) { - bool equal_bitmap = - null_bitmap_->Equals(*other.null_bitmap_, BitUtil::CeilByte(length_) / 8); - if (!equal_bitmap) { return false; } - - for (int i = 0; i < length_; ++i) { - if (IsNull(i)) continue; - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } else { - for (int i = 0; i < length_; ++i) { - if (fabs(this_data[i] - other_data[i]) > EPSILON) { return false; } - } - } - return true; -} - class ARROW_EXPORT BooleanArray : public PrimitiveArray { public: using TypeClass = BooleanType; @@ -297,11 +197,6 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { const std::shared_ptr& data, int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - bool EqualsExact(const BooleanArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } @@ -345,12 +240,6 @@ class ARROW_EXPORT ListArray : public Array { int32_t value_offset(int i) const { return offsets_[i]; } int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } - bool EqualsExact(const ListArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; protected: @@ -396,11 +285,6 @@ class ARROW_EXPORT BinaryArray : public Array { int32_t value_offset(int i) const { return offsets_[i]; } int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } - bool EqualsExact(const BinaryArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Validate() const override; Status Accept(ArrayVisitor* visitor) const override; @@ -459,11 +343,6 @@ class ARROW_EXPORT StructArray : public Array { const std::vector>& fields() const { return field_arrays_; } - bool EqualsExact(const StructArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; protected: @@ -500,11 +379,6 @@ class ARROW_EXPORT UnionArray : public Array { const std::vector>& children() const { return children_; } - bool EqualsExact(const UnionArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; protected: @@ -555,11 +429,6 @@ class ARROW_EXPORT DictionaryArray : public Array { const DictionaryType* dict_type() { return dict_type_; } - bool EqualsExact(const DictionaryArray& other) const; - bool Equals(const std::shared_ptr& arr) const override; - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const override; - Status Accept(ArrayVisitor* visitor) const override; protected: diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc new file mode 100644 index 0000000000000..d039bba08827c --- /dev/null +++ b/cpp/src/arrow/compare.cc @@ -0,0 +1,516 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for comparing Arrow data structures + +#include "arrow/compare.h" + +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/logging.h" + +namespace arrow { + +// ---------------------------------------------------------------------- +// Public method implementations + +class RangeEqualsVisitor : public ArrayVisitor { + public: + RangeEqualsVisitor(const Array& right, int32_t left_start_idx, int32_t left_end_idx, + int32_t right_start_idx) + : right_(right), + left_start_idx_(left_start_idx), + left_end_idx_(left_end_idx), + right_start_idx_(right_start_idx), + result_(false) {} + + Status Visit(const NullArray& left) override { + UNUSED(left); + result_ = true; + return Status::OK(); + } + + template + inline Status CompareValues(const ArrayType& left) { + const auto& right = static_cast(right_); + + for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i) || + (!is_null && left.Value(i) != right.Value(o_i))) { + result_ = false; + return Status::OK(); + } + } + result_ = true; + return Status::OK(); + } + + bool CompareBinaryRange(const BinaryArray& left) const { + const auto& right = static_cast(right_); + + for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = left.offset(i); + const int32_t end_offset = left.offset(i + 1); + const int32_t right_begin_offset = right.offset(o_i); + const int32_t right_end_offset = right.offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != right_end_offset - right_begin_offset) { + return false; + } + + if (end_offset - begin_offset > 0 && + std::memcmp(left.data()->data() + begin_offset, + right.data()->data() + right_begin_offset, end_offset - begin_offset)) { + return false; + } + } + return true; + } + + Status Visit(const BooleanArray& left) override { + return CompareValues(left); + } + + Status Visit(const Int8Array& left) override { return CompareValues(left); } + + Status Visit(const Int16Array& left) override { + return CompareValues(left); + } + Status Visit(const Int32Array& left) override { + return CompareValues(left); + } + Status Visit(const Int64Array& left) override { + return CompareValues(left); + } + Status Visit(const UInt8Array& left) override { + return CompareValues(left); + } + Status Visit(const UInt16Array& left) override { + return CompareValues(left); + } + Status Visit(const UInt32Array& left) override { + return CompareValues(left); + } + Status Visit(const UInt64Array& left) override { + return CompareValues(left); + } + Status Visit(const FloatArray& left) override { + return CompareValues(left); + } + Status Visit(const DoubleArray& left) override { + return CompareValues(left); + } + + Status Visit(const HalfFloatArray& left) override { + return Status::NotImplemented("Half float type"); + } + + Status Visit(const StringArray& left) override { + result_ = CompareBinaryRange(left); + return Status::OK(); + } + + Status Visit(const BinaryArray& left) override { + result_ = CompareBinaryRange(left); + return Status::OK(); + } + + Status Visit(const DateArray& left) override { return CompareValues(left); } + + Status Visit(const TimeArray& left) override { return CompareValues(left); } + + Status Visit(const TimestampArray& left) override { + return CompareValues(left); + } + + Status Visit(const IntervalArray& left) override { + return CompareValues(left); + } + + Status Visit(const DecimalArray& left) override { + return Status::NotImplemented("Decimal type"); + } + + bool CompareLists(const ListArray& left) { + const auto& right = static_cast(right_); + + const std::shared_ptr& left_values = left.values(); + const std::shared_ptr& right_values = right.values(); + + for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i)) { return false; } + if (is_null) continue; + const int32_t begin_offset = left.offset(i); + const int32_t end_offset = left.offset(i + 1); + const int32_t right_begin_offset = right.offset(o_i); + const int32_t right_end_offset = right.offset(o_i + 1); + // Underlying can't be equal if the size isn't equal + if (end_offset - begin_offset != right_end_offset - right_begin_offset) { + return false; + } + if (!left_values->RangeEquals( + begin_offset, end_offset, right_begin_offset, right_values)) { + return false; + } + } + return true; + } + + Status Visit(const ListArray& left) override { + result_ = CompareLists(left); + return Status::OK(); + } + + bool CompareStructs(const StructArray& left) { + const auto& right = static_cast(right_); + bool equal_fields = true; + for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + if (left.IsNull(i) != right.IsNull(o_i)) { return false; } + if (left.IsNull(i)) continue; + for (size_t j = 0; j < left.fields().size(); ++j) { + // TODO: really we should be comparing stretches of non-null data rather + // than looking at one value at a time. + equal_fields = left.field(j)->RangeEquals(i, i + 1, o_i, right.field(j)); + if (!equal_fields) { return false; } + } + } + return true; + } + + Status Visit(const StructArray& left) override { + result_ = CompareStructs(left); + return Status::OK(); + } + + bool CompareUnions(const UnionArray& left) const { + const auto& right = static_cast(right_); + + const UnionMode union_mode = left.mode(); + if (union_mode != right.mode()) { return false; } + + const auto& left_type = static_cast(*left.type()); + + // Define a mapping from the type id to child number + uint8_t max_code = 0; + + const std::vector type_codes = left_type.type_ids; + for (size_t i = 0; i < type_codes.size(); ++i) { + const uint8_t code = type_codes[i]; + if (code > max_code) { max_code = code; } + } + + // Store mapping in a vector for constant time lookups + std::vector type_id_to_child_num(max_code + 1); + for (uint8_t i = 0; i < static_cast(type_codes.size()); ++i) { + type_id_to_child_num[type_codes[i]] = i; + } + + const uint8_t* left_ids = left.raw_type_ids(); + const uint8_t* right_ids = right.raw_type_ids(); + + uint8_t id, child_num; + for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + if (left.IsNull(i) != right.IsNull(o_i)) { return false; } + if (left.IsNull(i)) continue; + if (left_ids[i] != right_ids[o_i]) { return false; } + + id = left_ids[i]; + child_num = type_id_to_child_num[id]; + + // TODO(wesm): really we should be comparing stretches of non-null data + // rather than looking at one value at a time. + if (union_mode == UnionMode::SPARSE) { + if (!left.child(child_num)->RangeEquals(i, i + 1, o_i, right.child(child_num))) { + return false; + } + } else { + const int32_t offset = left.raw_offsets()[i]; + const int32_t o_offset = right.raw_offsets()[i]; + if (!left.child(child_num)->RangeEquals( + offset, offset + 1, o_offset, right.child(child_num))) { + return false; + } + } + } + return true; + } + + Status Visit(const UnionArray& left) override { + result_ = CompareUnions(left); + return Status::OK(); + } + + Status Visit(const DictionaryArray& left) override { + const auto& right = static_cast(right_); + if (!left.dictionary()->Equals(right.dictionary())) { + result_ = false; + return Status::OK(); + } + result_ = left.indices()->RangeEquals( + left_start_idx_, left_end_idx_, right_start_idx_, right.indices()); + return Status::OK(); + } + + bool result() const { return result_; } + + protected: + const Array& right_; + int32_t left_start_idx_; + int32_t left_end_idx_; + int32_t right_start_idx_; + + bool result_; +}; + +class EqualsVisitor : public RangeEqualsVisitor { + public: + explicit EqualsVisitor(const Array& right) + : RangeEqualsVisitor(right, 0, right.length(), 0) {} + + Status Visit(const NullArray& left) override { return Status::OK(); } + + Status Visit(const BooleanArray& left) override { + const auto& right = static_cast(right_); + if (left.null_count() > 0) { + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right.data()->data(); + + for (int i = 0; i < left.length(); ++i) { + if (!left.IsNull(i) && + BitUtil::GetBit(left_data, i) != BitUtil::GetBit(right_data, i)) { + result_ = false; + return Status::OK(); + } + } + result_ = true; + } else { + result_ = left.data()->Equals(*right.data(), BitUtil::BytesForBits(left.length())); + } + return Status::OK(); + } + + bool IsEqualPrimitive(const PrimitiveArray& left) { + const auto& right = static_cast(right_); + if (left.null_count() > 0) { + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right.data()->data(); + const auto& size_meta = dynamic_cast(*left.type()); + const int value_byte_size = size_meta.bit_width() / 8; + DCHECK_GT(value_byte_size, 0); + + for (int i = 0; i < left.length(); ++i) { + if (!left.IsNull(i) && memcmp(left_data, right_data, value_byte_size)) { + return false; + } + left_data += value_byte_size; + right_data += value_byte_size; + } + return true; + } else { + if (left.length() == 0) { return true; } + return left.data()->Equals(*right.data(), left.length()); + } + } + + Status ComparePrimitive(const PrimitiveArray& left) { + result_ = IsEqualPrimitive(left); + return Status::OK(); + } + + Status Visit(const Int8Array& left) override { return ComparePrimitive(left); } + + Status Visit(const Int16Array& left) override { return ComparePrimitive(left); } + + Status Visit(const Int32Array& left) override { return ComparePrimitive(left); } + + Status Visit(const Int64Array& left) override { return ComparePrimitive(left); } + + Status Visit(const UInt8Array& left) override { return ComparePrimitive(left); } + + Status Visit(const UInt16Array& left) override { return ComparePrimitive(left); } + + Status Visit(const UInt32Array& left) override { return ComparePrimitive(left); } + + Status Visit(const UInt64Array& left) override { return ComparePrimitive(left); } + + Status Visit(const FloatArray& left) override { return ComparePrimitive(left); } + + Status Visit(const DoubleArray& left) override { return ComparePrimitive(left); } + + Status Visit(const DateArray& left) override { return ComparePrimitive(left); } + + Status Visit(const TimeArray& left) override { return ComparePrimitive(left); } + + Status Visit(const TimestampArray& left) override { return ComparePrimitive(left); } + + Status Visit(const IntervalArray& left) override { return ComparePrimitive(left); } + + bool CompareBinary(const BinaryArray& left) { + const auto& right = static_cast(right_); + bool equal_offsets = + left.offsets()->Equals(*right.offsets(), (left.length() + 1) * sizeof(int32_t)); + if (!equal_offsets) { return false; } + if (!left.data() && !(right.data())) { return true; } + return left.data()->Equals(*right.data(), left.raw_offsets()[left.length()]); + } + + Status Visit(const StringArray& left) override { + result_ = CompareBinary(left); + return Status::OK(); + } + + Status Visit(const BinaryArray& left) override { + result_ = CompareBinary(left); + return Status::OK(); + } + + Status Visit(const ListArray& left) override { + const auto& right = static_cast(right_); + if (!left.offsets()->Equals( + *right.offsets(), (left.length() + 1) * sizeof(int32_t))) { + result_ = false; + } else { + result_ = left.values()->Equals(right.values()); + } + return Status::OK(); + } + + Status Visit(const DictionaryArray& left) override { + const auto& right = static_cast(right_); + if (!left.dictionary()->Equals(right.dictionary())) { + result_ = false; + } else { + result_ = left.indices()->Equals(right.indices()); + } + return Status::OK(); + } +}; + +template +inline bool FloatingApproxEquals( + const NumericArray& left, const NumericArray& right) { + using T = typename TYPE::c_type; + + auto left_data = reinterpret_cast(left.data()->data()); + auto right_data = reinterpret_cast(right.data()->data()); + + static constexpr T EPSILON = 1E-5; + + if (left.length() == 0 && right.length() == 0) { return true; } + + if (left.null_count() > 0) { + for (int32_t i = 0; i < left.length(); ++i) { + if (left.IsNull(i)) continue; + if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } + } + } else { + for (int32_t i = 0; i < left.length(); ++i) { + if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } + } + } + return true; +} + +class ApproxEqualsVisitor : public EqualsVisitor { + public: + using EqualsVisitor::EqualsVisitor; + + Status Visit(const FloatArray& left) override { + result_ = + FloatingApproxEquals(left, static_cast(right_)); + return Status::OK(); + } + + Status Visit(const DoubleArray& left) override { + result_ = + FloatingApproxEquals(left, static_cast(right_)); + return Status::OK(); + } +}; + +static bool BaseDataEquals(const Array& left, const Array& right) { + if (left.length() != right.length() || left.null_count() != right.null_count() || + left.type_enum() != right.type_enum()) { + return false; + } + if (left.null_count() > 0) { + return left.null_bitmap()->Equals( + *right.null_bitmap(), BitUtil::BytesForBits(left.length())); + } + return true; +} + +Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { + // The arrays are the same object + if (&left == &right) { + *are_equal = true; + } else if (!BaseDataEquals(left, right)) { + *are_equal = false; + } else { + EqualsVisitor visitor(right); + RETURN_NOT_OK(left.Accept(&visitor)); + *are_equal = visitor.result(); + } + return Status::OK(); +} + +Status ArrayRangeEquals(const Array& left, const Array& right, int32_t left_start_idx, + int32_t left_end_idx, int32_t right_start_idx, bool* are_equal) { + if (&left == &right) { + *are_equal = true; + } else if (left.type_enum() != right.type_enum()) { + *are_equal = false; + } else { + RangeEqualsVisitor visitor(right, left_start_idx, left_end_idx, right_start_idx); + RETURN_NOT_OK(left.Accept(&visitor)); + *are_equal = visitor.result(); + } + return Status::OK(); +} + +Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) { + // The arrays are the same object + if (&left == &right) { + *are_equal = true; + } else if (!BaseDataEquals(left, right)) { + *are_equal = false; + } else { + ApproxEqualsVisitor visitor(right); + RETURN_NOT_OK(left.Accept(&visitor)); + *are_equal = visitor.result(); + } + return Status::OK(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h new file mode 100644 index 0000000000000..2093b65a51a13 --- /dev/null +++ b/cpp/src/arrow/compare.h @@ -0,0 +1,46 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for comparing Arrow data structures + +#ifndef ARROW_COMPARE_H +#define ARROW_COMPARE_H + +#include + +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +class Status; + +/// Returns true if the arrays are exactly equal +Status ARROW_EXPORT ArrayEquals(const Array& left, const Array& right, bool* are_equal); + +/// Returns true if the arrays are approximately equal. For non-floating point +/// types, this is equivalent to ArrayEquals(left, right) +Status ARROW_EXPORT ArrayApproxEquals( + const Array& left, const Array& right, bool* are_equal); + +/// Returns true if indicated equal-length segment of arrays is exactly equal +Status ARROW_EXPORT ArrayRangeEquals(const Array& left, const Array& right, + int32_t start_idx, int32_t end_idx, int32_t other_start_idx, bool* are_equal); + +} // namespace arrow + +#endif // ARROW_COMPARE_H diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index e2bb355115b42..c4a62a475b92f 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -25,4 +25,6 @@ TypeName& operator=(const TypeName&) = delete #endif +#define UNUSED(x) (void)x + #endif // ARROW_UTIL_MACROS_H diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index d63fff48a011f..942e74b73aaee 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -74,6 +74,9 @@ include(SetupCxxFlags) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +# Enable perf and other tools to work properly +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer") + # Suppress Cython warnings set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index feafa3dfc3875..920779fe86174 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -153,14 +153,6 @@ static inline bool PyObject_is_string(const PyObject* obj) { #endif } -static inline bool PyObject_is_bool(const PyObject* obj) { -#if PY_MAJOR_VERSION >= 3 - return PyString_Check(obj) || PyBytes_Check(obj); -#else - return PyString_Check(obj) || PyUnicode_Check(obj); -#endif -} - template static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) { typedef npy_traits traits; From be5d73f2cbcbd4c3f4e0a8ba41f69222ecedfc05 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 31 Jan 2017 08:00:49 -0500 Subject: [PATCH 0301/1644] ARROW-410: [C++] Add virtual Writeable::Flush Author: Wes McKinney Closes #310 from wesm/ARROW-410 and squashes the following commits: 7352f0a [Wes McKinney] Add virtual Writeable::Flush, and move HDFS flush there --- cpp/src/arrow/io/hdfs.cc | 15 ++++++++++++--- cpp/src/arrow/io/hdfs.h | 2 ++ cpp/src/arrow/io/interfaces.cc | 4 ++++ cpp/src/arrow/io/interfaces.h | 3 +++ 4 files changed, 21 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 44e503ff11302..2845b0d8c448c 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -238,15 +238,20 @@ class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { Status Close() { if (is_open_) { - int ret = driver_->Flush(fs_, file_); - CHECK_FAILURE(ret, "Flush"); - ret = driver_->CloseFile(fs_, file_); + RETURN_NOT_OK(Flush()); + int ret = driver_->CloseFile(fs_, file_); CHECK_FAILURE(ret, "CloseFile"); is_open_ = false; } return Status::OK(); } + Status Flush() { + int ret = driver_->Flush(fs_, file_); + CHECK_FAILURE(ret, "Flush"); + return Status::OK(); + } + Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written) { tSize ret = driver_->Write(fs_, file_, reinterpret_cast(buffer), nbytes); CHECK_FAILURE(ret, "Write"); @@ -277,6 +282,10 @@ Status HdfsOutputStream::Write(const uint8_t* buffer, int64_t nbytes) { return Write(buffer, nbytes, &bytes_written_dummy); } +Status HdfsOutputStream::Flush() { + return impl_->Flush(); +} + Status HdfsOutputStream::Tell(int64_t* position) { return impl_->Tell(position); } diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index 5cc783e475967..fbf1d758afb99 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -208,6 +208,8 @@ class ARROW_EXPORT HdfsOutputStream : public OutputStream { Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written); + Status Flush() override; + Status Tell(int64_t* position) override; private: diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 7e78caa04e711..51ed0693e2c57 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -52,5 +52,9 @@ Status Writeable::Write(const std::string& data) { reinterpret_cast(data.c_str()), static_cast(data.size())); } +Status Writeable::Flush() { + return Status::OK(); +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index e9f07f03a1419..9862a67aed0cd 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -73,6 +73,9 @@ class ARROW_EXPORT Writeable { public: virtual Status Write(const uint8_t* data, int64_t nbytes) = 0; + // Default implementation is a no-op + virtual Status Flush(); + Status Write(const std::string& data); }; From 0ae4d86e5ef8ee53a8810f4324dce80ec6a9d422 Mon Sep 17 00:00:00 2001 From: Nong Li Date: Thu, 2 Feb 2017 14:36:23 +0100 Subject: [PATCH 0302/1644] ARROW-497: Integration harness for streaming file format These tests pass locally for me. Thanks @nongli for this! Author: Nong Li Author: Wes McKinney Closes #312 from wesm/streaming-integration and squashes the following commits: 8b9ad76 [Wes McKinney] Hook stream<->file tools together and get integration tests working. Quiet test output in TestArrowStreamPipe c7f0483 [Nong Li] ARROW-XXX: [Java] Add file <=> stream utility tools. --- ci/travis_script_integration.sh | 3 + integration/integration_test.py | 76 ++++++++++++++++--- .../org/apache/arrow/tools/FileToStream.java | 68 +++++++++++++++++ .../org/apache/arrow/tools/StreamToFile.java | 61 +++++++++++++++ .../vector/stream/MessageSerializer.java | 2 +- .../vector/stream/TestArrowStreamPipe.java | 2 +- 6 files changed, 198 insertions(+), 14 deletions(-) create mode 100644 java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java create mode 100644 java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index d93411b907d47..c019a4b7ab7ff 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -28,7 +28,10 @@ pushd $TRAVIS_BUILD_DIR/integration VERSION=0.1.1-SNAPSHOT export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar + export ARROW_CPP_TESTER=$CPP_BUILD_DIR/debug/json-integration-test +export ARROW_CPP_STREAM_TO_FILE=$CPP_BUILD_DIR/debug/stream-to-file +export ARROW_CPP_FILE_TO_STREAM=$CPP_BUILD_DIR/debug/file-to-stream source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh export MINICONDA=$HOME/miniconda diff --git a/integration/integration_test.py b/integration/integration_test.py index 77510daecc0b4..a622bf228a651 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -556,12 +556,25 @@ def run(self): consumer.name)) for json_path in self.json_files: - print('Testing with {0}'.format(json_path)) + print('Testing file {0}'.format(json_path)) - arrow_path = os.path.join(self.temp_dir, guid()) + # Make the random access file + print('-- Creating binary inputs') + producer_file_path = os.path.join(self.temp_dir, guid()) + producer.json_to_file(json_path, producer_file_path) - producer.json_to_arrow(json_path, arrow_path) - consumer.validate(json_path, arrow_path) + # Validate the file + print('-- Validating file') + consumer.validate(json_path, producer_file_path) + + print('-- Validating stream') + producer_stream_path = os.path.join(self.temp_dir, guid()) + consumer_file_path = os.path.join(self.temp_dir, guid()) + producer.file_to_stream(producer_file_path, + producer_stream_path) + consumer.stream_to_file(producer_stream_path, + consumer_file_path) + consumer.validate(json_path, consumer_file_path) class Tester(object): @@ -569,7 +582,13 @@ class Tester(object): def __init__(self, debug=False): self.debug = debug - def json_to_arrow(self, json_path, arrow_path): + def json_to_file(self, json_path, arrow_path): + raise NotImplementedError + + def stream_to_file(self, stream_path, file_path): + raise NotImplementedError + + def file_to_stream(self, file_path, stream_path): raise NotImplementedError def validate(self, json_path, arrow_path): @@ -601,21 +620,40 @@ def _run(self, arrow_path=None, json_path=None, command='VALIDATE'): if self.debug: print(' '.join(cmd)) - return run_cmd(cmd) + run_cmd(cmd) def validate(self, json_path, arrow_path): return self._run(arrow_path, json_path, 'VALIDATE') - def json_to_arrow(self, json_path, arrow_path): + def json_to_file(self, json_path, arrow_path): return self._run(arrow_path, json_path, 'JSON_TO_ARROW') + def stream_to_file(self, stream_path, file_path): + cmd = ['java', '-cp', self.ARROW_TOOLS_JAR, + 'org.apache.arrow.tools.StreamToFile', + stream_path, file_path] + run_cmd(cmd) + + def file_to_stream(self, file_path, stream_path): + cmd = ['java', '-cp', self.ARROW_TOOLS_JAR, + 'org.apache.arrow.tools.FileToStream', + file_path, stream_path] + run_cmd(cmd) + class CPPTester(Tester): + BUILD_PATH = os.path.join(ARROW_HOME, 'cpp/test-build/debug') CPP_INTEGRATION_EXE = os.environ.get( - 'ARROW_CPP_TESTER', - os.path.join(ARROW_HOME, - 'cpp/test-build/debug/json-integration-test')) + 'ARROW_CPP_TESTER', os.path.join(BUILD_PATH, 'json-integration-test')) + + STREAM_TO_FILE = os.environ.get( + 'ARROW_CPP_STREAM_TO_FILE', + os.path.join(BUILD_PATH, 'stream-to-file')) + + FILE_TO_STREAM = os.environ.get( + 'ARROW_CPP_FILE_TO_STREAM', + os.path.join(BUILD_PATH, 'file-to-stream')) name = 'C++' @@ -633,14 +671,28 @@ def _run(self, arrow_path=None, json_path=None, command='VALIDATE'): if self.debug: print(' '.join(cmd)) - return run_cmd(cmd) + run_cmd(cmd) def validate(self, json_path, arrow_path): return self._run(arrow_path, json_path, 'VALIDATE') - def json_to_arrow(self, json_path, arrow_path): + def json_to_file(self, json_path, arrow_path): return self._run(arrow_path, json_path, 'JSON_TO_ARROW') + def stream_to_file(self, stream_path, file_path): + cmd = ['cat', stream_path, '|', self.STREAM_TO_FILE, '>', file_path] + cmd = ' '.join(cmd) + if self.debug: + print(cmd) + os.system(cmd) + + def file_to_stream(self, file_path, stream_path): + cmd = [self.FILE_TO_STREAM, file_path, '>', stream_path] + cmd = ' '.join(cmd) + if self.debug: + print(cmd) + os.system(cmd) + def get_static_json_files(): glob_pattern = os.path.join(ARROW_HOME, 'integration', 'data', '*.json') diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java new file mode 100644 index 0000000000000..ba6505cb48d08 --- /dev/null +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java @@ -0,0 +1,68 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.tools; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.OutputStream; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowFooter; +import org.apache.arrow.vector.file.ArrowReader; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamWriter; + +/** + * Converts an Arrow file to an Arrow stream. The file should be specified as the + * first argument and the output is written to standard out. + */ +public class FileToStream { + public static void convert(FileInputStream in, OutputStream out) throws IOException { + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + try( + ArrowReader reader = new ArrowReader(in.getChannel(), allocator);) { + ArrowFooter footer = reader.readFooter(); + try ( + ArrowStreamWriter writer = new ArrowStreamWriter(out, footer.getSchema()); + ) { + for (ArrowBlock block: footer.getRecordBatches()) { + try (ArrowRecordBatch batch = reader.readRecordBatch(block)) { + writer.writeRecordBatch(batch); + } + } + } + } + } + + public static void main(String[] args) throws IOException { + if (args.length != 1 && args.length != 2) { + System.err.println("Usage: FileToStream [output file]"); + System.exit(1); + } + + FileInputStream in = new FileInputStream(new File(args[0])); + OutputStream out = args.length == 1 ? + System.out : new FileOutputStream(new File(args[1])); + + convert(in, out); + } +} diff --git a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java new file mode 100644 index 0000000000000..c8a5c8914afcc --- /dev/null +++ b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java @@ -0,0 +1,61 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.tools; + +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.nio.channels.Channels; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamReader; + +/** + * Converts an Arrow stream to an Arrow file. + */ +public class StreamToFile { + public static void convert(InputStream in, OutputStream out) throws IOException { + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { + reader.init(); + try (ArrowWriter writer = new ArrowWriter(Channels.newChannel(out), reader.getSchema());) { + while (true) { + ArrowRecordBatch batch = reader.nextRecordBatch(); + if (batch == null) break; + writer.writeRecordBatch(batch); + } + } + } + } + + public static void main(String[] args) throws IOException { + InputStream in = System.in; + OutputStream out = System.out; + if (args.length == 2) { + in = new FileInputStream(new File(args[0])); + out = new FileOutputStream(new File(args[1])); + } + convert(in, out); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 7ab740c70782e..92df2504bcb23 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -226,7 +226,7 @@ private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte heade Message.startMessage(builder); Message.addHeaderType(builder, headerType); Message.addHeader(builder, headerOffset); - Message.addVersion(builder, MetadataVersion.V1); + Message.addVersion(builder, MetadataVersion.V2); Message.addBodyLength(builder, bodyLength); builder.finish(Message.endMessage(builder)); return builder.dataBuffer(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java index a0a7ffa279308..aa0b77e46a392 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java @@ -113,7 +113,7 @@ public void run() { // Starts up a producer and consumer thread to read/write batches. @Test public void pipeTest() throws IOException, InterruptedException { - int NUM_BATCHES = 1000; + int NUM_BATCHES = 10; Pipe pipe = Pipe.open(); WriterThread writer = new WriterThread(NUM_BATCHES, pipe.sink()); ReaderThread reader = new ReaderThread(pipe.source()); From c05292faf74111e826dbbafe9a1e108346eb10dc Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 2 Feb 2017 16:13:40 -0500 Subject: [PATCH 0303/1644] ARROW-523: Python: Account for changes in PARQUET-834 Author: Uwe L. Korn Closes #313 from xhochy/ARROW-523 and squashes the following commits: ff699ea [Uwe L. Korn] Use relative import e36dcc8 [Uwe L. Korn] ARROW-523: Python: Account for changes in PARQUET-834 --- python/pyarrow/_parquet.pxd | 8 +-- python/pyarrow/_parquet.pyx | 8 +-- python/pyarrow/tests/pandas_examples.py | 78 +++++++++++++++++++++ python/pyarrow/tests/test_convert_pandas.py | 47 +------------ python/pyarrow/tests/test_parquet.py | 11 +++ 5 files changed, 100 insertions(+), 52 deletions(-) create mode 100644 python/pyarrow/tests/pandas_examples.py diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index fabee5d5761d7..6b9350ad6782a 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -211,9 +211,9 @@ cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: cdef cppclass FileReader: FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) - CStatus ReadFlatColumn(int i, shared_ptr[CArray]* out); - CStatus ReadFlatTable(shared_ptr[CTable]* out); - CStatus ReadFlatTable(const vector[int]& column_indices, + CStatus ReadColumn(int i, shared_ptr[CArray]* out); + CStatus ReadTable(shared_ptr[CTable]* out); + CStatus ReadTable(const vector[int]& column_indices, shared_ptr[CTable]* out); const ParquetFileReader* parquet_reader(); @@ -228,7 +228,7 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: - cdef CStatus WriteFlatTable( + cdef CStatus WriteTable( const CTable* table, MemoryPool* pool, const shared_ptr[OutputStream]& sink, int64_t chunk_size, diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 3f847e9808230..fd4670a00c837 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -397,12 +397,12 @@ cdef class ParquetReader: with nogil: check_status(self.reader.get() - .ReadFlatTable(c_column_indices, &ctable)) + .ReadTable(c_column_indices, &ctable)) else: # Read all columns with nogil: check_status(self.reader.get() - .ReadFlatTable(&ctable)) + .ReadTable(&ctable)) table.init(ctable) return table @@ -442,7 +442,7 @@ cdef class ParquetReader: with nogil: check_status(self.reader.get() - .ReadFlatColumn(column_index, &carray)) + .ReadColumn(column_index, &carray)) array.init(carray) return array @@ -540,6 +540,6 @@ cdef class ParquetWriter: cdef int c_row_group_size = row_group_size with nogil: - check_status(WriteFlatTable(ctable, default_memory_pool(), + check_status(WriteTable(ctable, default_memory_pool(), self.sink, c_row_group_size, self.properties)) diff --git a/python/pyarrow/tests/pandas_examples.py b/python/pyarrow/tests/pandas_examples.py new file mode 100644 index 0000000000000..63af42348026c --- /dev/null +++ b/python/pyarrow/tests/pandas_examples.py @@ -0,0 +1,78 @@ +# -*- coding: utf-8 -*- +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from collections import OrderedDict + +import numpy as np +import pandas as pd +import pyarrow as pa + + +def dataframe_with_arrays(): + """ + Dataframe with numpy arrays columns of every possible primtive type. + + Returns + ------- + df: pandas.DataFrame + schema: pyarrow.Schema + Arrow schema definition that is in line with the constructed df. + """ + dtypes = [('i1', pa.int8()), ('i2', pa.int16()), + ('i4', pa.int32()), ('i8', pa.int64()), + ('u1', pa.uint8()), ('u2', pa.uint16()), + ('u4', pa.uint32()), ('u8', pa.uint64()), + ('f4', pa.float_()), ('f8', pa.double())] + + arrays = OrderedDict() + fields = [] + for dtype, arrow_dtype in dtypes: + fields.append(pa.field(dtype, pa.list_(arrow_dtype))) + arrays[dtype] = [ + np.arange(10, dtype=dtype), + np.arange(5, dtype=dtype), + None, + np.arange(1, dtype=dtype) + ] + + fields.append(pa.field('str', pa.list_(pa.string()))) + arrays['str'] = [ + np.array([u"1", u"ä"], dtype="object"), + None, + np.array([u"1"], dtype="object"), + np.array([u"1", u"2", u"3"], dtype="object") + ] + + fields.append(pa.field('datetime64', pa.list_(pa.timestamp('ms')))) + arrays['datetime64'] = [ + np.array(['2007-07-13T01:23:34.123456789', + None, + '2010-08-13T05:46:57.437699912'], + dtype='datetime64[ms]'), + None, + None, + np.array(['2007-07-13T02', + None, + '2010-08-13T05:46:57.437699912'], + dtype='datetime64[ms]'), + ] + + df = pd.DataFrame(arrays) + schema = pa.Schema.from_fields(fields) + + return df, schema diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 674a4361d3395..ddbb02a770c35 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -29,6 +29,8 @@ from pyarrow.compat import u import pyarrow as A +from .pandas_examples import dataframe_with_arrays + def _alltypes_example(size=100): return pd.DataFrame({ @@ -325,54 +327,11 @@ def test_date(self): tm.assert_frame_equal(result, expected) def test_column_of_lists(self): - dtypes = [('i1', A.int8()), ('i2', A.int16()), - ('i4', A.int32()), ('i8', A.int64()), - ('u1', A.uint8()), ('u2', A.uint16()), - ('u4', A.uint32()), ('u8', A.uint64()), - ('f4', A.float_()), ('f8', A.double())] - - arrays = OrderedDict() - fields = [] - for dtype, arrow_dtype in dtypes: - fields.append(A.field(dtype, A.list_(arrow_dtype))) - arrays[dtype] = [ - np.arange(10, dtype=dtype), - np.arange(5, dtype=dtype), - None, - np.arange(1, dtype=dtype) - ] - - fields.append(A.field('str', A.list_(A.string()))) - arrays['str'] = [ - np.array([u"1", u"ä"], dtype="object"), - None, - np.array([u"1"], dtype="object"), - np.array([u"1", u"2", u"3"], dtype="object") - ] - - fields.append(A.field('datetime64', A.list_(A.timestamp('ns')))) - arrays['datetime64'] = [ - np.array(['2007-07-13T01:23:34.123456789', - None, - '2010-08-13T05:46:57.437699912'], - dtype='datetime64[ns]'), - None, - None, - np.array(['2007-07-13T02', - None, - '2010-08-13T05:46:57.437699912'], - dtype='datetime64[ns]'), - ] - - df = pd.DataFrame(arrays) - schema = A.Schema.from_fields(fields) + df, schema = dataframe_with_arrays() self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) table = A.Table.from_pandas(df, schema=schema) assert table.schema.equals(schema) - # it works! - table.to_pandas(nthreads=1) - def test_threaded_conversion(self): df = _alltypes_example() self._check_pandas_roundtrip(df, nthreads=2, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index d85f0e513702f..80a995fbb6662 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -23,6 +23,7 @@ from pyarrow.compat import guid import pyarrow as pa import pyarrow.io as paio +from .pandas_examples import dataframe_with_arrays import numpy as np import pandas as pd @@ -319,6 +320,16 @@ def test_compare_schemas(): assert fileh.schema[0].equals(fileh.schema[0]) assert not fileh.schema[0].equals(fileh.schema[1]) +@parquet +def test_column_of_lists(tmpdir): + df, schema = dataframe_with_arrays() + + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True, schema=schema) + pq.write_table(arrow_table, filename.strpath, version="2.0") + table_read = pq.read_table(filename.strpath) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) @parquet def test_multithreaded_read(): From 720d422fa761e2beab1b412b1b42c041ac2db1a4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 3 Feb 2017 09:08:14 +0100 Subject: [PATCH 0304/1644] ARROW-467: [Python] Run Python parquet-cpp unit tests in Travis CI This means we'll have to tolerate broken builds whenever APIs change (a good incentive to avoid changing them as much as possible) Author: Wes McKinney Closes #311 from wesm/ARROW-467 and squashes the following commits: a9c285d [Wes McKinney] parquet-cpp build tweaks 661671c [Wes McKinney] Build parquet-cpp from source and run PyArrow Parquet unit tests in Travis CI --- ci/travis_script_python.sh | 50 +++++++++++++++++++++++++++++++++++--- 1 file changed, 46 insertions(+), 4 deletions(-) diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 179567b595416..c186fd4639fca 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -26,12 +26,52 @@ export ARROW_HOME=$ARROW_CPP_INSTALL export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_CPP_INSTALL/lib pushd $PYTHON_DIR +export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env + +build_parquet_cpp() { + conda create -y -q -p $PARQUET_HOME thrift-cpp snappy zlib brotli boost + source activate $PARQUET_HOME + + export BOOST_ROOT=$PARQUET_HOME + export SNAPPY_HOME=$PARQUET_HOME + export THRIFT_HOME=$PARQUET_HOME + export ZLIB_HOME=$PARQUET_HOME + export BROTLI_HOME=$PARQUET_HOME + + PARQUET_DIR=$TRAVIS_BUILD_DIR/parquet + mkdir -p $PARQUET_DIR + + git clone https://github.com/apache/parquet-cpp.git $PARQUET_DIR + + pushd $PARQUET_DIR + mkdir build-dir + cd build-dir + + cmake \ + -DCMAKE_BUILD_TYPE=debug \ + -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ + -DPARQUET_ARROW=on \ + -DPARQUET_BUILD_BENCHMARKS=off \ + -DPARQUET_BUILD_EXECUTABLES=off \ + -DPARQUET_ZLIB_VENDORED=off \ + -DPARQUET_BUILD_TESTS=off \ + .. + + make -j${CPU_COUNT} + make install + + popd +} + +build_parquet_cpp + +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PARQUET_HOME/lib python_version_tests() { PYTHON_VERSION=$1 - CONDA_ENV_NAME="pyarrow-test-${PYTHON_VERSION}" - conda create -y -q -n $CONDA_ENV_NAME python=$PYTHON_VERSION - source activate $CONDA_ENV_NAME + CONDA_ENV_DIR=$TRAVIS_BUILD_DIR/pyarrow-test-$PYTHON_VERSION + conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION + source activate $CONDA_ENV_DIR python --version which python @@ -45,7 +85,9 @@ python_version_tests() { # Other stuff pip install pip install -r requirements.txt - python setup.py build_ext --inplace + python setup.py build_ext --inplace --with-parquet + + python -c "import pyarrow.parquet" python -m pytest -vv -r sxX pyarrow From 08f38d97904e8d265dea09cdc67946119998e039 Mon Sep 17 00:00:00 2001 From: Jingyuan Wang Date: Fri, 3 Feb 2017 23:00:25 -0500 Subject: [PATCH 0305/1644] ARROW-477: [Java] Add support for second/microsecond/nanosecond timestamps in-memory and in IPC/JSON layer Changes include: - add support for TimeStamp data type with second/microsecond/nanosecond time units - add an additional readLong() method to timestamp readers to support reading raw long values - add a simple test case for timestamp readers and writers Author: Jingyuan Wang Closes #303 from alphalfalfa/arrow-477 and squashes the following commits: 0199574 [Jingyuan Wang] rename TimeStamp to TimeStampMilli 068e47f [Jingyuan Wang] use a test value that exhibits micro/nanosecond truncation when converting timestamps to JODA DateTime bef2330 [Jingyuan Wang] fix a typo 9b4d7b4 [Jingyuan Wang] add support for timestamp data type with second/microsecond/nanosecond time units --- .../main/codegen/data/ValueVectorTypes.tdd | 5 +- .../codegen/templates/ComplexReaders.java | 10 +++ .../codegen/templates/FixedValueVectors.java | 31 ++++++- .../templates/NullableValueVectors.java | 8 +- .../vector/file/json/JsonFileReader.java | 18 +++- .../vector/file/json/JsonFileWriter.java | 18 +++- .../org/apache/arrow/vector/types/Types.java | 88 ++++++++++++++++--- .../complex/writer/TestComplexWriter.java | 85 ++++++++++++++++++ .../arrow/vector/file/BaseFileTest.java | 10 +-- .../apache/arrow/vector/pojo/TestConvert.java | 2 +- 10 files changed, 250 insertions(+), 25 deletions(-) diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index f7790bb3d6ddf..2181cfdc335b4 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -71,7 +71,10 @@ { class: "UInt8" }, { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, { class: "Date", javaType: "long", friendlyType: "DateTime" }, - { class: "TimeStamp", javaType: "long", friendlyType: "DateTime" } + { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } + { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } + { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } + { class: "TimeStampNano", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } ] }, { diff --git a/java/vector/src/main/codegen/templates/ComplexReaders.java b/java/vector/src/main/codegen/templates/ComplexReaders.java index 74a19a605e21e..d53744539aae8 100644 --- a/java/vector/src/main/codegen/templates/ComplexReaders.java +++ b/java/vector/src/main/codegen/templates/ComplexReaders.java @@ -96,6 +96,16 @@ public void read(Nullable${minor.class?cap_first}Holder h){ public ${friendlyType} read${safeType}(){ return vector.getAccessor().getObject(idx()); } + + <#if minor.class == "TimeStampSec" || + minor.class == "TimeStampMilli" || + minor.class == "TimeStampMicro" || + minor.class == "TimeStampNano"> + @Override + public ${minor.boxedType} read${minor.boxedType}(){ + return vector.getAccessor().get(idx()); + } + public void copyValue(FieldWriter w){ diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index be385d146dbac..d5265f1140ee0 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -490,7 +490,16 @@ public long getTwoAsLong(int index) { return date; } - <#elseif minor.class == "TimeStamp"> + <#elseif minor.class == "TimeStampSec"> + @Override + public ${friendlyType} getObject(int index) { + long secs = java.util.concurrent.TimeUnit.SECONDS.toMillis(get(index)); + org.joda.time.DateTime date = new org.joda.time.DateTime(secs, org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + + <#elseif minor.class == "TimeStampMilli"> @Override public ${friendlyType} getObject(int index) { org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); @@ -498,6 +507,26 @@ public long getTwoAsLong(int index) { return date; } + <#elseif minor.class == "TimeStampMicro"> + @Override + public ${friendlyType} getObject(int index) { + // value is truncated when converting microseconds to milliseconds in order to use DateTime type + long micros = java.util.concurrent.TimeUnit.MICROSECONDS.toMillis(get(index)); + org.joda.time.DateTime date = new org.joda.time.DateTime(micros, org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + + <#elseif minor.class == "TimeStampNano"> + @Override + public ${friendlyType} getObject(int index) { + // value is truncated when converting nanoseconds to milliseconds in order to use DateTime type + long millis = java.util.concurrent.TimeUnit.NANOSECONDS.toMillis(get(index)); + org.joda.time.DateTime date = new org.joda.time.DateTime(millis, org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + <#elseif minor.class == "IntervalYear"> @Override public ${friendlyType} getObject(int index) { diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 6a9ce65392f59..ce637100cd8bf 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -102,8 +102,14 @@ public final class ${className} extends BaseDataValueVector implements <#if type field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE), null); <#elseif minor.class == "Float8"> field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE), null); - <#elseif minor.class == "TimeStamp"> + <#elseif minor.class == "TimeStampSec"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND), null); + <#elseif minor.class == "TimeStampMilli"> field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND), null); + <#elseif minor.class == "TimeStampMicro"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND), null); + <#elseif minor.class == "TimeStampNano"> + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND), null); <#elseif minor.class == "IntervalDay"> field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.DAY_TIME), null); <#elseif minor.class == "IntervalYear"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 152867c1a11d7..71fe88e156a5d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -37,7 +37,10 @@ import org.apache.arrow.vector.Float8Vector; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.SmallIntVector; -import org.apache.arrow.vector.TimeStampVector; +import org.apache.arrow.vector.TimeStampSecVector; +import org.apache.arrow.vector.TimeStampMilliVector; +import org.apache.arrow.vector.TimeStampMicroVector; +import org.apache.arrow.vector.TimeStampNanoVector; import org.apache.arrow.vector.TinyIntVector; import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt2Vector; @@ -199,9 +202,18 @@ private void setValueFromParser(ValueVector valueVector, int i) throws IOExcepti case VARCHAR: ((VarCharVector)valueVector).getMutator().setSafe(i, parser.readValueAs(String.class).getBytes(UTF_8)); break; - case TIMESTAMP: - ((TimeStampVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + case TIMESTAMPSEC: + ((TimeStampSecVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); break; + case TIMESTAMPMILLI: + ((TimeStampMilliVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case TIMESTAMPMICRO: + ((TimeStampMicroVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case TIMESTAMPNANO: + ((TimeStampNanoVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; default: throw new UnsupportedOperationException("minor type: " + valueVector.getMinorType()); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java index 6ff357774486d..ddc80433cb6db 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -24,7 +24,10 @@ import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BufferBacked; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.TimeStampVector; +import org.apache.arrow.vector.TimeStampSecVector; +import org.apache.arrow.vector.TimeStampMilliVector; +import org.apache.arrow.vector.TimeStampMicroVector; +import org.apache.arrow.vector.TimeStampNanoVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.VectorSchemaRoot; @@ -139,8 +142,17 @@ private void writeVector(Field field, FieldVector vector) throws IOException { private void writeValueToGenerator(ValueVector valueVector, int i) throws IOException { switch (valueVector.getMinorType()) { - case TIMESTAMP: - generator.writeNumber(((TimeStampVector)valueVector).getAccessor().get(i)); + case TIMESTAMPSEC: + generator.writeNumber(((TimeStampSecVector)valueVector).getAccessor().get(i)); + break; + case TIMESTAMPMILLI: + generator.writeNumber(((TimeStampMilliVector)valueVector).getAccessor().get(i)); + break; + case TIMESTAMPMICRO: + generator.writeNumber(((TimeStampMicroVector)valueVector).getAccessor().get(i)); + break; + case TIMESTAMPNANO: + generator.writeNumber(((TimeStampNanoVector)valueVector).getAccessor().get(i)); break; case BIT: generator.writeNumber(((BitVector)valueVector).getAccessor().get(i)); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 2a2fb74bee85c..ab539d5dc3b6e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -33,7 +33,10 @@ import org.apache.arrow.vector.NullableIntervalDayVector; import org.apache.arrow.vector.NullableIntervalYearVector; import org.apache.arrow.vector.NullableSmallIntVector; -import org.apache.arrow.vector.NullableTimeStampVector; +import org.apache.arrow.vector.NullableTimeStampSecVector; +import org.apache.arrow.vector.NullableTimeStampMilliVector; +import org.apache.arrow.vector.NullableTimeStampMicroVector; +import org.apache.arrow.vector.NullableTimeStampNanoVector; import org.apache.arrow.vector.NullableTimeVector; import org.apache.arrow.vector.NullableTinyIntVector; import org.apache.arrow.vector.NullableUInt1Vector; @@ -58,7 +61,10 @@ import org.apache.arrow.vector.complex.impl.IntervalYearWriterImpl; import org.apache.arrow.vector.complex.impl.NullableMapWriter; import org.apache.arrow.vector.complex.impl.SmallIntWriterImpl; -import org.apache.arrow.vector.complex.impl.TimeStampWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampSecWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampMilliWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampMicroWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampNanoWriterImpl; import org.apache.arrow.vector.complex.impl.TimeWriterImpl; import org.apache.arrow.vector.complex.impl.TinyIntWriterImpl; import org.apache.arrow.vector.complex.impl.UInt1WriterImpl; @@ -102,7 +108,10 @@ public class Types { private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); private static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); - private static final Field TIMESTAMP_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND), null); + private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND), null); + private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND), null); + private static final Field TIMESTAMPMICRO_FIELD = new Field("", true, new Timestamp(TimeUnit.MICROSECOND), null); + private static final Field TIMESTAMPNANO_FIELD = new Field("", true, new Timestamp(TimeUnit.NANOSECOND), null); private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.SINGLE), null); @@ -241,21 +250,72 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new TimeWriterImpl((NullableTimeVector) vector); } }, + // time in second from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. + TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND)) { + @Override + public Field getField() { + return TIMESTAMPSEC_FIELD; + } + + @Override + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTimeStampSecVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeStampSecWriterImpl((NullableTimeStampSecVector) vector); + } + }, // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. - TIMESTAMP(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND)) { + TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND)) { @Override public Field getField() { - return TIMESTAMP_FIELD; + return TIMESTAMPMILLI_FIELD; } @Override public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeStampVector(name, allocator); + return new NullableTimeStampMilliVector(name, allocator); } @Override public FieldWriter getNewFieldWriter(ValueVector vector) { - return new TimeStampWriterImpl((NullableTimeStampVector) vector); + return new TimeStampMilliWriterImpl((NullableTimeStampMilliVector) vector); + } + }, + // time in microsecond from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. + TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND)) { + @Override + public Field getField() { + return TIMESTAMPMICRO_FIELD; + } + + @Override + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTimeStampMicroVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeStampMicroWriterImpl((NullableTimeStampMicroVector) vector); + } + }, + // time in nanosecond from the Unix epoch, 00:00:00.000000000 on 1 January 1970, UTC. + TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND)) { + @Override + public Field getField() { + return TIMESTAMPNANO_FIELD; + } + + @Override + public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + return new NullableTimeStampNanoVector(name, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeStampNanoWriterImpl((NullableTimeStampNanoVector) vector); } }, INTERVALDAY(new Interval(IntervalUnit.DAY_TIME)) { @@ -579,10 +639,18 @@ public MinorType visit(FloatingPoint type) { } @Override public MinorType visit(Timestamp type) { - if (type.getUnit() != TimeUnit.MILLISECOND) { - throw new UnsupportedOperationException("Only milliseconds supported: " + type); + switch (type.getUnit()) { + case SECOND: + return MinorType.TIMESTAMPSEC; + case MILLISECOND: + return MinorType.TIMESTAMPMILLI; + case MICROSECOND: + return MinorType.TIMESTAMPMICRO; + case NANOSECOND: + return MinorType.TIMESTAMPNANO; + default: + throw new IllegalArgumentException("unknown unit: " + type); } - return MinorType.TIMESTAMP; } @Override diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 2c0c85328bdfb..7a2d416241b78 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -43,12 +43,15 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; +import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.Text; +import org.joda.time.DateTime; +import org.joda.time.DateTimeZone; import org.junit.Assert; import org.junit.Test; @@ -561,4 +564,86 @@ public void mapWriterMixedCaseFieldNames() { Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$")); Assert.assertTrue(fieldNamesCaseSensitive.contains("list_field::$data$::bit_field")); } + + @Test + public void timeStampWriters() throws Exception { + // test values + final long expectedNanos = 981173106123456789L; + final long expectedMicros = 981173106123456L; + final long expectedMillis = 981173106123L; + final long expectedSecs = 981173106L; + final DateTime expectedSecDateTime = new DateTime(2001, 2, 3, 4, 5, 6, 0).withZoneRetainFields(DateTimeZone.getDefault()); + final DateTime expectedMilliDateTime = new DateTime(2001, 2, 3, 4, 5, 6, 123).withZoneRetainFields(DateTimeZone.getDefault()); + final DateTime expectedMicroDateTime = expectedMilliDateTime; + final DateTime expectedNanoDateTime = expectedMilliDateTime; + + // write + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + + TimeStampSecWriter timeStampSecWriter = rootWriter.timeStampSec("sec"); + timeStampSecWriter.setPosition(0); + timeStampSecWriter.writeTimeStampSec(expectedSecs); + + TimeStampMilliWriter timeStampWriter = rootWriter.timeStampMilli("milli"); + timeStampWriter.setPosition(1); + timeStampWriter.writeTimeStampMilli(expectedMillis); + + TimeStampMicroWriter timeStampMicroWriter = rootWriter.timeStampMicro("micro"); + timeStampMicroWriter.setPosition(2); + timeStampMicroWriter.writeTimeStampMicro(expectedMicros); + + TimeStampNanoWriter timeStampNanoWriter = rootWriter.timeStampNano("nano"); + timeStampNanoWriter.setPosition(3); + timeStampNanoWriter.writeTimeStampNano(expectedNanos); + + // schema + Field secField = parent.getField().getChildren().get(0).getChildren().get(0); + Assert.assertEquals("sec", secField.getName()); + Assert.assertEquals(ArrowType.Timestamp.TYPE_TYPE, secField.getType().getTypeID()); + + Field milliField = parent.getField().getChildren().get(0).getChildren().get(1); + Assert.assertEquals("milli", milliField.getName()); + Assert.assertEquals(ArrowType.Timestamp.TYPE_TYPE, milliField.getType().getTypeID()); + + Field microField = parent.getField().getChildren().get(0).getChildren().get(2); + Assert.assertEquals("micro", microField.getName()); + Assert.assertEquals(ArrowType.Timestamp.TYPE_TYPE, microField.getType().getTypeID()); + + Field nanoField = parent.getField().getChildren().get(0).getChildren().get(3); + Assert.assertEquals("nano", nanoField.getName()); + Assert.assertEquals(ArrowType.Timestamp.TYPE_TYPE, nanoField.getType().getTypeID()); + + // read + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + + FieldReader secReader = rootReader.reader("sec"); + secReader.setPosition(0); + DateTime secDateTime = secReader.readDateTime(); + Assert.assertEquals(expectedSecDateTime, secDateTime); + long secLong = secReader.readLong(); + Assert.assertEquals(expectedSecs, secLong); + + FieldReader milliReader = rootReader.reader("milli"); + milliReader.setPosition(1); + DateTime milliDateTime = milliReader.readDateTime(); + Assert.assertEquals(expectedMilliDateTime, milliDateTime); + long milliLong = milliReader.readLong(); + Assert.assertEquals(expectedMillis, milliLong); + + FieldReader microReader = rootReader.reader("micro"); + microReader.setPosition(2); + DateTime microDateTime = microReader.readDateTime(); + Assert.assertEquals(expectedMicroDateTime, microDateTime); + long microLong = microReader.readLong(); + Assert.assertEquals(expectedMicros, microLong); + + FieldReader nanoReader = rootReader.reader("nano"); + nanoReader.setPosition(3); + DateTime nanoDateTime = nanoReader.readDateTime(); + Assert.assertEquals(expectedNanoDateTime, nanoDateTime); + long nanoLong = nanoReader.readLong(); + Assert.assertEquals(expectedNanos, nanoLong); + } } \ No newline at end of file diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java index 6e577b500a6bd..774bead3207a7 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java @@ -33,7 +33,7 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; import org.apache.arrow.vector.complex.writer.IntWriter; -import org.apache.arrow.vector.holders.NullableTimeStampHolder; +import org.apache.arrow.vector.holders.NullableTimeStampMilliHolder; import org.joda.time.DateTimeZone; import org.junit.After; import org.junit.Assert; @@ -100,7 +100,7 @@ protected void writeComplexData(int count, MapVector parent) { listWriter.endList(); mapWriter.setPosition(i); mapWriter.start(); - mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.timeStampMilli("timestamp").writeTimeStampMilli(i); mapWriter.end(); } writer.setValueCount(count); @@ -130,7 +130,7 @@ protected void validateComplexContent(int count, VectorSchemaRoot root) { } Assert.assertEquals(Long.valueOf(i), root.getVector("bigInt").getAccessor().getObject(i)); Assert.assertEquals(i % 3, ((List)root.getVector("list").getAccessor().getObject(i)).size()); - NullableTimeStampHolder h = new NullableTimeStampHolder(); + NullableTimeStampMilliHolder h = new NullableTimeStampMilliHolder(); FieldReader mapReader = root.getVector("map").getReader(); mapReader.setPosition(i); mapReader.reader("timestamp").read(h); @@ -167,7 +167,7 @@ public void validateUnionData(int count, VectorSchemaRoot root) { Assert.assertEquals(i % 3, unionReader.size()); break; case 3: - NullableTimeStampHolder h = new NullableTimeStampHolder(); + NullableTimeStampMilliHolder h = new NullableTimeStampMilliHolder(); unionReader.reader("timestamp").read(h); Assert.assertEquals(i, h.value); break; @@ -209,7 +209,7 @@ public void writeUnionData(int count, NullableMapVector parent) { case 3: mapWriter.setPosition(i); mapWriter.start(); - mapWriter.timeStamp("timestamp").writeTimeStamp(i); + mapWriter.timeStampMilli("timestamp").writeTimeStampMilli(i); mapWriter.end(); break; } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 5a238bcc0d0c3..65823e2a821a1 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -80,7 +80,7 @@ public void nestedSchema() { childrenBuilder.add(new Field("child4", true, new List(), ImmutableList.of( new Field("child4.1", true, Utf8.INSTANCE, null) ))); - childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMP.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( + childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMPMILLI.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND), null), new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); From e881f1155c7c628f79008988fff8ae81d3750984 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 4 Feb 2017 16:20:54 -0500 Subject: [PATCH 0306/1644] ARROW-525: Python: Add more documentation to the package Author: Uwe L. Korn Closes #317 from xhochy/ARROW-525 and squashes the following commits: d213e63 [Uwe L. Korn] ARROW-525: Python: Add more documentation to the package --- python/setup.py | 22 +++++++++++++++++----- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/python/setup.py b/python/setup.py index 9c63e93df3352..a771d23877013 100644 --- a/python/setup.py +++ b/python/setup.py @@ -257,9 +257,6 @@ def get_outputs(self): return [self._get_cmake_ext_path(name) for name in self.get_names()] -DESC = """\ -Python library for Apache Arrow""" - # In the case of a git-archive, we don't have any version information # from the SCM to infer a version. The only source is the java/pom.xml. # @@ -271,6 +268,12 @@ def get_outputs(self): version_tag = list(tree.getroot().findall('{http://maven.apache.org/POM/4.0.0}version'))[0] os.environ["SETUPTOOLS_SCM_PRETEND_VERSION"] = version_tag.text.replace("-SNAPSHOT", "a0") +long_description = """Apache Arrow is a columnar in-memory analytics layer +designed to accelerate big data. It houses a set of canonical in-memory +representations of flat and hierarchical data along with multiple +language-bindings for structure manipulation. It also provides IPC +and common algorithm implementations.""" + setup( name="pyarrow", packages=['pyarrow', 'pyarrow.tests'], @@ -286,9 +289,18 @@ def get_outputs(self): use_scm_version = {"root": "..", "relative_to": __file__}, setup_requires=['setuptools_scm'], install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], - description=DESC, + description="Python library for Apache Arrow", + long_description=long_description, + classifiers=[ + 'License :: OSI Approved :: Apache Software License', + 'Programming Language :: Python :: 2.7', + 'Programming Language :: Python :: 3.4', + 'Programming Language :: Python :: 3.5', + 'Programming Language :: Python :: 3.6' + ], license='Apache License, Version 2.0', maintainer="Apache Arrow Developers", maintainer_email="dev@arrow.apache.org", - test_suite="pyarrow.tests" + test_suite="pyarrow.tests", + url="https://arrow.apache.org/" ) From 5b35d6bda94e901d25aaf3d622dbe47214f75488 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 4 Feb 2017 16:23:46 -0500 Subject: [PATCH 0307/1644] ARROW-457: Python: Better control over memory pool Author: Uwe L. Korn Closes #315 from xhochy/ARROW-457 and squashes the following commits: dc5abdb [Uwe L. Korn] Use aligned deallocator 20c8505 [Uwe L. Korn] ARROW-458: Python: Expose jemalloc MemoryPool 2962bd8 [Uwe L. Korn] ARROW-457: Python: Better control over memory pool --- ci/travis_script_python.sh | 3 +- cpp/src/arrow/jemalloc/memory_pool.cc | 2 +- python/CMakeLists.txt | 15 +++++ python/cmake_modules/FindArrow.cmake | 14 +++++ python/pyarrow/__init__.py | 3 +- python/pyarrow/_parquet.pxd | 8 +-- python/pyarrow/_parquet.pyx | 13 ++-- python/pyarrow/array.pyx | 32 +++++----- python/pyarrow/includes/libarrow.pxd | 6 +- python/pyarrow/includes/libarrow_io.pxd | 2 +- python/pyarrow/includes/libarrow_ipc.pxd | 3 +- python/pyarrow/includes/libarrow_jemalloc.pxd | 27 ++++++++ python/pyarrow/includes/pyarrow.pxd | 9 +-- python/pyarrow/io.pyx | 18 +++--- python/pyarrow/jemalloc.pyx | 28 +++++++++ python/pyarrow/memory.pxd | 27 ++++++++ python/pyarrow/memory.pyx | 49 +++++++++++++++ python/pyarrow/tests/test_jemalloc.py | 56 +++++++++++++++++ python/setup.py | 11 +++- python/src/pyarrow/adapters/builtin.cc | 6 +- python/src/pyarrow/adapters/builtin.h | 3 +- python/src/pyarrow/common.cc | 61 ++++--------------- python/src/pyarrow/common.h | 1 + 23 files changed, 298 insertions(+), 99 deletions(-) create mode 100644 python/pyarrow/includes/libarrow_jemalloc.pxd create mode 100644 python/pyarrow/jemalloc.pyx create mode 100644 python/pyarrow/memory.pxd create mode 100644 python/pyarrow/memory.pyx create mode 100644 python/pyarrow/tests/test_jemalloc.py diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index c186fd4639fca..11d8d89ca7b6f 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -85,9 +85,10 @@ python_version_tests() { # Other stuff pip install pip install -r requirements.txt - python setup.py build_ext --inplace --with-parquet + python setup.py build_ext --inplace --with-parquet --with-jemalloc python -c "import pyarrow.parquet" + python -c "import pyarrow.jemalloc" python -m pytest -vv -r sxX pyarrow diff --git a/cpp/src/arrow/jemalloc/memory_pool.cc b/cpp/src/arrow/jemalloc/memory_pool.cc index c568316711717..f7a1446a0d27c 100644 --- a/cpp/src/arrow/jemalloc/memory_pool.cc +++ b/cpp/src/arrow/jemalloc/memory_pool.cc @@ -65,7 +65,7 @@ Status MemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) void MemoryPool::Free(uint8_t* buffer, int64_t size) { allocated_size_ -= size; - free(buffer); + dallocx(buffer, MALLOCX_ALIGN(kAlignment)); } int64_t MemoryPool::bytes_allocated() const { diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 942e74b73aaee..898c48ee0e48d 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -53,6 +53,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(PYARROW_BUILD_PARQUET "Build the PyArrow Parquet integration" OFF) + option(PYARROW_BUILD_JEMALLOC + "Build the PyArrow jemalloc integration" + OFF) endif() if(NOT PYARROW_BUILD_TESTS) @@ -412,6 +415,7 @@ set(CYTHON_EXTENSIONS config error io + memory scalar schema table @@ -446,6 +450,17 @@ if (PYARROW_BUILD_PARQUET) _parquet) endif() +if (PYARROW_BUILD_JEMALLOC) + ADD_THIRDPARTY_LIB(arrow_jemalloc + SHARED_LIB ${ARROW_JEMALLOC_SHARED_LIB}) + set(LINK_LIBS + ${LINK_LIBS} + arrow_jemalloc) + set(CYTHON_EXTENSIONS + ${CYTHON_EXTENSIONS} + jemalloc) +endif() + add_library(pyarrow SHARED ${PYARROW_SRCS}) target_link_libraries(pyarrow ${LINK_LIBS}) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 3c359aac55309..5d0207d7c7769 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -52,11 +52,17 @@ find_library(ARROW_IPC_LIB_PATH NAMES arrow_ipc ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) +find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) set(ARROW_IPC_LIB_NAME libarrow_ipc) + set(ARROW_JEMALLOC_LIB_NAME libarrow_jemalloc) set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) @@ -68,10 +74,14 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_IPC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IPC_LIB_NAME}.a) set(ARROW_IPC_SHARED_LIB ${ARROW_LIBS}/${ARROW_IPC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_JEMALLOC_LIB_NAME}.a) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/${ARROW_JEMALLOC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") message(STATUS "Found the Arrow IPC library: ${ARROW_IPC_LIB_PATH}") + message(STATUS "Found the Arrow jemalloc library: ${ARROW_JEMALLOC_LIB_PATH}") endif () else () if (NOT Arrow_FIND_QUIETLY) @@ -94,4 +104,8 @@ mark_as_advanced( ARROW_SHARED_LIB ARROW_IO_STATIC_LIB ARROW_IO_SHARED_LIB + ARROW_IPC_STATIC_LIB + ARROW_IPC_SHARED_LIB + ARROW_JEMALLOC_STATIC_LIB + ARROW_JEMALLOC_SHARED_LIB ) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 7c521db6280be..ea4710d4137de 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -30,7 +30,6 @@ from pyarrow.array import (Array, from_pandas_series, from_pylist, - total_allocated_bytes, NumericArray, IntegerArray, FloatingPointArray, BooleanArray, Int8Array, UInt8Array, @@ -48,6 +47,8 @@ from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter +from pyarrow.memory import MemoryPool, total_allocated_bytes + from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, BooleanValue, Int8Value, Int16Value, Int32Value, Int64Value, diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index 6b9350ad6782a..005be91bdb97f 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CSchema, CStatus, - CTable, MemoryPool) + CTable, CMemoryPool) from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream @@ -204,13 +204,13 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, - MemoryPool* allocator, + CMemoryPool* allocator, const ReaderProperties& properties, const shared_ptr[CFileMetaData]& metadata, unique_ptr[FileReader]* reader) cdef cppclass FileReader: - FileReader(MemoryPool* pool, unique_ptr[ParquetFileReader] reader) + FileReader(CMemoryPool* pool, unique_ptr[ParquetFileReader] reader) CStatus ReadColumn(int i, shared_ptr[CArray]* out); CStatus ReadTable(shared_ptr[CTable]* out); CStatus ReadTable(const vector[int]& column_indices, @@ -229,7 +229,7 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: cdef CStatus WriteTable( - const CTable* table, MemoryPool* pool, + const CTable* table, CMemoryPool* pool, const shared_ptr[OutputStream]& sink, int64_t chunk_size, const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index fd4670a00c837..796c436ec46f4 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -32,6 +32,7 @@ from pyarrow.compat import tobytes, frombytes from pyarrow.error import ArrowException from pyarrow.error cimport check_status from pyarrow.io import NativeFile +from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow.table cimport Table from pyarrow.io cimport NativeFile, get_reader, get_writer @@ -342,13 +343,13 @@ cdef logical_type_name_from_enum(ParquetLogicalType type_): cdef class ParquetReader: cdef: object source - MemoryPool* allocator + CMemoryPool* allocator unique_ptr[FileReader] reader column_idx_map FileMetaData _metadata - def __cinit__(self): - self.allocator = default_memory_pool() + def __cinit__(self, MemoryPool memory_pool=None): + self.allocator = maybe_unbox_memory_pool(memory_pool) self._metadata = None def open(self, object source, FileMetaData metadata=None): @@ -471,6 +472,7 @@ cdef class ParquetWriter: cdef: shared_ptr[WriterProperties] properties shared_ptr[OutputStream] sink + CMemoryPool* allocator cdef readonly: object use_dictionary @@ -479,7 +481,7 @@ cdef class ParquetWriter: int row_group_size def __cinit__(self, where, use_dictionary=None, compression=None, - version=None): + version=None, MemoryPool memory_pool=None): cdef shared_ptr[FileOutputStream] filestream if isinstance(where, six.string_types): @@ -487,6 +489,7 @@ cdef class ParquetWriter: self.sink = filestream else: get_writer(where, &self.sink) + self.allocator = maybe_unbox_memory_pool(memory_pool) self.use_dictionary = use_dictionary self.compression = compression @@ -540,6 +543,6 @@ cdef class ParquetWriter: cdef int c_row_group_size = row_group_size with nogil: - check_status(WriteTable(ctable, default_memory_pool(), + check_status(WriteTable(ctable, self.allocator, self.sink, c_row_group_size, self.properties)) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index c3a5a045b7dd5..9b34f5607b31d 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -29,6 +29,7 @@ import pyarrow.config from pyarrow.compat import frombytes, tobytes from pyarrow.error cimport check_status +from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool cimport pyarrow.scalar as scalar from pyarrow.scalar import NA @@ -44,11 +45,6 @@ cdef _pandas(): return pd -def total_allocated_bytes(): - cdef MemoryPool* pool = pyarrow.get_memory_pool() - return pool.bytes_allocated() - - cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array): @@ -58,7 +54,7 @@ cdef class Array: self.type.init(self.sp_array.get().type()) @staticmethod - def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None): + def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None, MemoryPool memory_pool=None): """ Convert pandas.Series to an Arrow Array. @@ -74,6 +70,9 @@ cdef class Array: compatibility with other functionality like Parquet I/O which only supports milliseconds. + memory_pool: MemoryPool, optional + Specific memory pool to use to allocate the resulting Arrow array. + Notes ----- Localized timestamps will currently be returned as UTC (pandas's native representation). @@ -107,6 +106,7 @@ cdef class Array: cdef: shared_ptr[CArray] out shared_ptr[CField] c_field + CMemoryPool* pool pd = _pandas() @@ -121,20 +121,20 @@ cdef class Array: if isinstance(series_values, pd.Categorical): return DictionaryArray.from_arrays(series_values.codes, series_values.categories.values, - mask=mask) + mask=mask, memory_pool=memory_pool) else: if series_values.dtype.type == np.datetime64 and timestamps_to_ms: series_values = series_values.astype('datetime64[ms]') + pool = maybe_unbox_memory_pool(memory_pool) with nogil: check_status(pyarrow.PandasToArrow( - pyarrow.get_memory_pool(), series_values, mask, - c_field, &out)) + pool, series_values, mask, c_field, &out)) return box_arrow_array(out) @staticmethod - def from_list(object list_obj, DataType type=None): + def from_list(object list_obj, DataType type=None, MemoryPool memory_pool=None): """ Convert Python list to Arrow array @@ -147,10 +147,12 @@ cdef class Array: pyarrow.array.Array """ cdef: - shared_ptr[CArray] sp_array + shared_ptr[CArray] sp_array + CMemoryPool* pool + pool = maybe_unbox_memory_pool(memory_pool) if type is None: - check_status(pyarrow.ConvertPySequence(list_obj, &sp_array)) + check_status(pyarrow.ConvertPySequence(list_obj, pool, &sp_array)) else: raise NotImplementedError() @@ -330,7 +332,7 @@ cdef class BinaryArray(Array): cdef class DictionaryArray(Array): @staticmethod - def from_arrays(indices, dictionary, mask=None): + def from_arrays(indices, dictionary, mask=None, MemoryPool memory_pool=None): """ Construct Arrow DictionaryArray from array of indices (must be non-negative integers) and corresponding array of dictionary values @@ -352,8 +354,8 @@ cdef class DictionaryArray(Array): shared_ptr[CDataType] c_type shared_ptr[CArray] c_result - arrow_indices = Array.from_pandas(indices, mask=mask) - arrow_dictionary = Array.from_pandas(dictionary) + arrow_indices = Array.from_pandas(indices, mask=mask, memory_pool=memory_pool) + arrow_dictionary = Array.from_pandas(dictionary, memory_pool=memory_pool) if not isinstance(arrow_indices, IntegerArray): raise ValueError('Indices must be integer type') diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 6284ad3c88a7a..38883e811e1cc 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -90,7 +90,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CDataType] timestamp(TimeUnit unit) - cdef cppclass MemoryPool" arrow::MemoryPool": + cdef cppclass CMemoryPool" arrow::MemoryPool": int64_t bytes_allocated() cdef cppclass CBuffer" arrow::Buffer": @@ -104,9 +104,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass PoolBuffer(ResizableBuffer): PoolBuffer() - PoolBuffer(MemoryPool*) + PoolBuffer(CMemoryPool*) - cdef MemoryPool* default_memory_pool() + cdef CMemoryPool* default_memory_pool() cdef cppclass CListType" arrow::ListType"(CDataType): CListType(const shared_ptr[CDataType]& value_type) diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 31379386187ee..8d0d5248b4db0 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -82,7 +82,7 @@ cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: CStatus Open(const c_string& path, shared_ptr[ReadableFile]* file) @staticmethod - CStatus Open(const c_string& path, MemoryPool* memory_pool, + CStatus Open(const c_string& path, CMemoryPool* memory_pool, shared_ptr[ReadableFile]* file) int file_descriptor() diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index bfece14fe6e03..5ab98152add49 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -18,8 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (MemoryPool, CArray, CSchema, - CRecordBatch) +from pyarrow.includes.libarrow cimport (CArray, CSchema, CRecordBatch) from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, ReadableFileInterface) diff --git a/python/pyarrow/includes/libarrow_jemalloc.pxd b/python/pyarrow/includes/libarrow_jemalloc.pxd new file mode 100644 index 0000000000000..0609d1907589a --- /dev/null +++ b/python/pyarrow/includes/libarrow_jemalloc.pxd @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# distutils: language = c++ + +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport * + +cdef extern from "arrow/jemalloc/memory_pool.h" namespace "arrow::jemalloc" nogil: + cdef cppclass CJemallocMemoryPool" arrow::jemalloc::MemoryPool": + int64_t bytes_allocated() + @staticmethod + CMemoryPool* default_pool() diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 04ad4f32272e6..f1d45e0d50f36 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -20,7 +20,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CField, CTable, CDataType, CStatus, Type, - MemoryPool, TimeUnit) + CMemoryPool, TimeUnit) cimport pyarrow.includes.libarrow_io as arrow_io @@ -28,9 +28,9 @@ cimport pyarrow.includes.libarrow_io as arrow_io cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) shared_ptr[CDataType] GetTimestampType(TimeUnit unit) - CStatus ConvertPySequence(object obj, shared_ptr[CArray]* out) + CStatus ConvertPySequence(object obj, CMemoryPool* pool, shared_ptr[CArray]* out) - CStatus PandasToArrow(MemoryPool* pool, object ao, object mo, + CStatus PandasToArrow(CMemoryPool* pool, object ao, object mo, shared_ptr[CField] field, shared_ptr[CArray]* out) @@ -43,7 +43,8 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: CStatus ConvertTableToPandas(const shared_ptr[CTable]& table, int nthreads, PyObject** out) - MemoryPool* get_memory_pool() + void set_default_memory_pool(CMemoryPool* pool) + CMemoryPool* get_memory_pool() cdef extern from "pyarrow/common.h" namespace "pyarrow" nogil: diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 8b5650879f8f1..89ce6e785c02b 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -33,6 +33,7 @@ cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import frombytes, tobytes, encode_file_path from pyarrow.error cimport check_status +from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow.schema cimport Schema from pyarrow.table cimport (RecordBatch, batch_from_cbatch, table_from_ctable) @@ -372,7 +373,7 @@ cdef class OSFile(NativeFile): cdef: object path - def __cinit__(self, path, mode='r'): + def __cinit__(self, path, mode='r', MemoryPool memory_pool=None): self.path = path cdef: @@ -383,7 +384,7 @@ cdef class OSFile(NativeFile): self.is_readable = self.is_writeable = 0 if mode in ('r', 'rb'): - self._open_readable(c_path) + self._open_readable(c_path, maybe_unbox_memory_pool(memory_pool)) elif mode in ('w', 'wb'): self._open_writeable(c_path) else: @@ -391,12 +392,11 @@ cdef class OSFile(NativeFile): self.is_open = True - cdef _open_readable(self, c_string path): + cdef _open_readable(self, c_string path, CMemoryPool* pool): cdef shared_ptr[ReadableFile] handle with nogil: - check_status(ReadableFile.Open(path, pyarrow.get_memory_pool(), - &handle)) + check_status(ReadableFile.Open(path, pool, &handle)) self.is_readable = 1 self.rd_file = handle @@ -450,9 +450,9 @@ cdef class Buffer: self.buffer.get().size()) -cdef shared_ptr[PoolBuffer] allocate_buffer(): +cdef shared_ptr[PoolBuffer] allocate_buffer(CMemoryPool* pool): cdef shared_ptr[PoolBuffer] result - result.reset(new PoolBuffer(pyarrow.get_memory_pool())) + result.reset(new PoolBuffer(pool)) return result @@ -461,8 +461,8 @@ cdef class InMemoryOutputStream(NativeFile): cdef: shared_ptr[PoolBuffer] buffer - def __cinit__(self): - self.buffer = allocate_buffer() + def __cinit__(self, MemoryPool memory_pool=None): + self.buffer = allocate_buffer(maybe_unbox_memory_pool(memory_pool)) self.wr_file.reset(new BufferOutputStream( self.buffer)) self.is_readable = 0 diff --git a/python/pyarrow/jemalloc.pyx b/python/pyarrow/jemalloc.pyx new file mode 100644 index 0000000000000..97583f4b0da95 --- /dev/null +++ b/python/pyarrow/jemalloc.pyx @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from pyarrow.includes.libarrow_jemalloc cimport CJemallocMemoryPool +from pyarrow.memory cimport MemoryPool + +def default_pool(): + cdef MemoryPool pool = MemoryPool() + pool.init(CJemallocMemoryPool.default_pool()) + return pool diff --git a/python/pyarrow/memory.pxd b/python/pyarrow/memory.pxd new file mode 100644 index 0000000000000..3079ccb807b0d --- /dev/null +++ b/python/pyarrow/memory.pxd @@ -0,0 +1,27 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from pyarrow.includes.libarrow cimport CMemoryPool + + +cdef class MemoryPool: + cdef: + CMemoryPool* pool + + cdef init(self, CMemoryPool* pool) + +cdef CMemoryPool* maybe_unbox_memory_pool(MemoryPool memory_pool) diff --git a/python/pyarrow/memory.pyx b/python/pyarrow/memory.pyx new file mode 100644 index 0000000000000..18a6de4f15392 --- /dev/null +++ b/python/pyarrow/memory.pyx @@ -0,0 +1,49 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from pyarrow.includes.libarrow cimport CMemoryPool +from pyarrow.includes.pyarrow cimport set_default_memory_pool, get_memory_pool + +cdef class MemoryPool: + cdef init(self, CMemoryPool* pool): + self.pool = pool + + def bytes_allocated(self): + return self.pool.bytes_allocated() + +cdef CMemoryPool* maybe_unbox_memory_pool(MemoryPool memory_pool): + if memory_pool is None: + return get_memory_pool() + else: + return memory_pool.pool + +def default_pool(): + cdef: + MemoryPool pool = MemoryPool() + pool.init(get_memory_pool()) + return pool + +def set_default_pool(MemoryPool pool): + set_default_memory_pool(pool.pool) + +def total_allocated_bytes(): + cdef CMemoryPool* pool = get_memory_pool() + return pool.bytes_allocated() diff --git a/python/pyarrow/tests/test_jemalloc.py b/python/pyarrow/tests/test_jemalloc.py new file mode 100644 index 0000000000000..8efd514dd0cae --- /dev/null +++ b/python/pyarrow/tests/test_jemalloc.py @@ -0,0 +1,56 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import gc +import pytest + +try: + import pyarrow.jemalloc + HAVE_JEMALLOC = True +except ImportError: + HAVE_JEMALLOC = False + +jemalloc = pytest.mark.skipif(not HAVE_JEMALLOC, + reason='jemalloc support not built') + + +@jemalloc +def test_different_memory_pool(): + gc.collect() + bytes_before_default = pyarrow.total_allocated_bytes() + bytes_before_jemalloc = pyarrow.jemalloc.default_pool().bytes_allocated() + array = pyarrow.from_pylist([1, None, 3, None], + memory_pool=pyarrow.jemalloc.default_pool()) + gc.collect() + assert pyarrow.total_allocated_bytes() == bytes_before_default + assert pyarrow.jemalloc.default_pool().bytes_allocated() > bytes_before_jemalloc + +@jemalloc +def test_default_memory_pool(): + gc.collect() + bytes_before_default = pyarrow.total_allocated_bytes() + bytes_before_jemalloc = pyarrow.jemalloc.default_pool().bytes_allocated() + + old_memory_pool = pyarrow.memory.default_pool() + pyarrow.memory.set_default_pool(pyarrow.jemalloc.default_pool()) + array = pyarrow.from_pylist([1, None, 3, None]) + pyarrow.memory.set_default_pool(old_memory_pool) + gc.collect() + + assert pyarrow.total_allocated_bytes() == bytes_before_default + assert pyarrow.jemalloc.default_pool().bytes_allocated() > bytes_before_jemalloc + diff --git a/python/setup.py b/python/setup.py index a771d23877013..5f5e5f3795541 100644 --- a/python/setup.py +++ b/python/setup.py @@ -80,7 +80,8 @@ def run(self): description = "Build the C-extensions for arrow" user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), ('build-type=', None, 'build type (debug or release)'), - ('with-parquet', None, 'build the Parquet extension')] + + ('with-parquet', None, 'build the Parquet extension'), + ('with-jemalloc', None, 'build the jemalloc extension')] + _build_ext.user_options) def initialize_options(self): @@ -88,12 +89,15 @@ def initialize_options(self): self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() self.with_parquet = False + self.with_jemalloc = False CYTHON_MODULE_NAMES = [ 'array', 'config', 'error', 'io', + 'jemalloc', + 'memory', '_parquet', 'scalar', 'schema', @@ -135,6 +139,9 @@ def _run_cmake(self): if self.with_parquet: cmake_options.append('-DPYARROW_BUILD_PARQUET=on') + if self.with_jemalloc: + cmake_options.append('-DPYARROW_BUILD_JEMALLOC=on') + if sys.platform != 'win32': cmake_options.append('-DCMAKE_BUILD_TYPE={0}' .format(self.build_type)) @@ -216,6 +223,8 @@ def _run_cmake(self): def _failure_permitted(self, name): if name == '_parquet' and not self.with_parquet: return True + if name == 'jemalloc' and not self.with_jemalloc: + return True return False def _get_inplace_dir(self): diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index fb7475f0c9407..1abfb4091189e 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -29,6 +29,7 @@ using arrow::ArrayBuilder; using arrow::DataType; +using arrow::MemoryPool; using arrow::Status; using arrow::Type; @@ -495,7 +496,8 @@ Status ListConverter::Init(const std::shared_ptr& builder) { return Status::OK(); } -Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { +Status ConvertPySequence( + PyObject* obj, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr type; int64_t size; PyDateTime_IMPORT; @@ -516,7 +518,7 @@ Status ConvertPySequence(PyObject* obj, std::shared_ptr* out) { // Give the sequence converter an array builder std::shared_ptr builder; - RETURN_NOT_OK(arrow::MakeBuilder(get_memory_pool(), type, &builder)); + RETURN_NOT_OK(arrow::MakeBuilder(pool, type, &builder)); converter->Init(builder); RETURN_NOT_OK(converter->AppendData(obj)); diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 1ff36945c88c7..667298e3c5c5f 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -38,7 +38,8 @@ class Status; namespace pyarrow { PYARROW_EXPORT -arrow::Status ConvertPySequence(PyObject* obj, std::shared_ptr* out); +arrow::Status ConvertPySequence( + PyObject* obj, arrow::MemoryPool* pool, std::shared_ptr* out); } // namespace pyarrow diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index b8712d7d0a4fc..d2f5291ea8301 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -28,58 +28,21 @@ using arrow::Status; namespace pyarrow { -class PyArrowMemoryPool : public arrow::MemoryPool { - public: - PyArrowMemoryPool() : bytes_allocated_(0) {} - virtual ~PyArrowMemoryPool() {} +static std::mutex memory_pool_mutex; +static arrow::MemoryPool* default_pyarrow_pool = nullptr; - Status Allocate(int64_t size, uint8_t** out) override { - std::lock_guard guard(pool_lock_); - *out = static_cast(std::malloc(size)); - if (*out == nullptr) { - std::stringstream ss; - ss << "malloc of size " << size << " failed"; - return Status::OutOfMemory(ss.str()); - } - - bytes_allocated_ += size; - - return Status::OK(); - } - - Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override { - *ptr = reinterpret_cast(std::realloc(*ptr, new_size)); - - if (*ptr == NULL) { - std::stringstream ss; - ss << "realloc of size " << new_size << " failed"; - return Status::OutOfMemory(ss.str()); - } - - bytes_allocated_ += new_size - old_size; - - return Status::OK(); - } - - int64_t bytes_allocated() const override { - std::lock_guard guard(pool_lock_); - return bytes_allocated_; - } - - void Free(uint8_t* buffer, int64_t size) override { - std::lock_guard guard(pool_lock_); - std::free(buffer); - bytes_allocated_ -= size; - } - - private: - mutable std::mutex pool_lock_; - int64_t bytes_allocated_; -}; +void set_default_memory_pool(arrow::MemoryPool* pool) { + std::lock_guard guard(memory_pool_mutex); + default_pyarrow_pool = pool; +} arrow::MemoryPool* get_memory_pool() { - static PyArrowMemoryPool memory_pool; - return &memory_pool; + std::lock_guard guard(memory_pool_mutex); + if (default_pyarrow_pool) { + return default_pyarrow_pool; + } else { + return arrow::default_memory_pool(); + } } // ---------------------------------------------------------------------- diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 0733a3b7cf061..ad65ec75eec9e 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -98,6 +98,7 @@ struct PyObjectStringify { } // Return the common PyArrow memory pool +PYARROW_EXPORT void set_default_memory_pool(arrow::MemoryPool* pool); PYARROW_EXPORT arrow::MemoryPool* get_memory_pool(); class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { From 84f16624bb390aebf16318b62ff2ac8238fc4b7c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 4 Feb 2017 18:33:39 -0500 Subject: [PATCH 0308/1644] ARROW-381: [C++] Simplify primitive array type builders to use a default type singleton Author: Uwe L. Korn Closes #316 from xhochy/ARROW-381 and squashes the following commits: 7061d9a [Uwe L. Korn] Use TypeTraits 79e07f1 [Uwe L. Korn] ARROW-381: [C++] Simplify primitive array type builders to use a default type singleton --- cpp/src/arrow/builder-benchmark.cc | 2 +- cpp/src/arrow/builder.h | 7 +++++++ cpp/src/arrow/test-util.h | 4 ++-- cpp/src/arrow/type.h | 17 ---------------- cpp/src/arrow/type_fwd.h | 22 +++++++++++++++++++++ cpp/src/arrow/type_traits.h | 31 ++++++++++++++++++++++++++++++ 6 files changed, 63 insertions(+), 20 deletions(-) diff --git a/cpp/src/arrow/builder-benchmark.cc b/cpp/src/arrow/builder-benchmark.cc index 67799a3485f23..b0c3cd19064de 100644 --- a/cpp/src/arrow/builder-benchmark.cc +++ b/cpp/src/arrow/builder-benchmark.cc @@ -30,7 +30,7 @@ static void BM_BuildPrimitiveArrayNoNulls( // 2 MiB block std::vector data(256 * 1024, 100); while (state.KeepRunning()) { - Int64Builder builder(default_memory_pool(), arrow::int64()); + Int64Builder builder(default_memory_pool()); for (int i = 0; i < kFinalSize; i++) { // Build up an array of 512 MiB in size builder.Append(data.data(), data.size(), nullptr); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 747da7ca2d9dd..672d2d8f23e8f 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -19,6 +19,7 @@ #define ARROW_BUILDER_H #include +#include #include #include #include @@ -27,6 +28,7 @@ #include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/type.h" +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -186,6 +188,11 @@ class ARROW_EXPORT NumericBuilder : public PrimitiveBuilder { using typename PrimitiveBuilder::value_type; using PrimitiveBuilder::PrimitiveBuilder; + template + explicit NumericBuilder( + typename std::enable_if::is_parameter_free, MemoryPool*>::type pool) + : PrimitiveBuilder(pool, TypeTraits::type_singleton()) {} + using PrimitiveBuilder::Append; using PrimitiveBuilder::Init; using PrimitiveBuilder::Resize; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index b59809d9e48e6..4e525804b47cc 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -294,8 +294,8 @@ class TestBuilder : public ::testing::Test { void SetUp() { pool_ = default_memory_pool(); type_ = TypePtr(new UInt8Type()); - builder_.reset(new UInt8Builder(pool_, type_)); - builder_nn_.reset(new UInt8Builder(pool_, type_)); + builder_.reset(new UInt8Builder(pool_)); + builder_nn_.reset(new UInt8Builder(pool_)); } protected: diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 77a70d1d2ddd3..8638a3f4b6e90 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -540,26 +540,9 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { // ---------------------------------------------------------------------- // Factory functions -std::shared_ptr ARROW_EXPORT null(); -std::shared_ptr ARROW_EXPORT boolean(); -std::shared_ptr ARROW_EXPORT int8(); -std::shared_ptr ARROW_EXPORT int16(); -std::shared_ptr ARROW_EXPORT int32(); -std::shared_ptr ARROW_EXPORT int64(); -std::shared_ptr ARROW_EXPORT uint8(); -std::shared_ptr ARROW_EXPORT uint16(); -std::shared_ptr ARROW_EXPORT uint32(); -std::shared_ptr ARROW_EXPORT uint64(); -std::shared_ptr ARROW_EXPORT float16(); -std::shared_ptr ARROW_EXPORT float32(); -std::shared_ptr ARROW_EXPORT float64(); -std::shared_ptr ARROW_EXPORT utf8(); -std::shared_ptr ARROW_EXPORT binary(); - std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); -std::shared_ptr ARROW_EXPORT date(); std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); std::shared_ptr ARROW_EXPORT time(TimeUnit unit); diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 334abef664426..fc4ad3d87d8ac 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -18,6 +18,8 @@ #ifndef ARROW_TYPE_FWD_H #define ARROW_TYPE_FWD_H +#include "arrow/util/visibility.h" + namespace arrow { class Status; @@ -104,6 +106,26 @@ using TimestampBuilder = NumericBuilder; struct IntervalType; using IntervalArray = NumericArray; +// ---------------------------------------------------------------------- +// (parameter-free) Factory functions + +std::shared_ptr ARROW_EXPORT null(); +std::shared_ptr ARROW_EXPORT boolean(); +std::shared_ptr ARROW_EXPORT int8(); +std::shared_ptr ARROW_EXPORT int16(); +std::shared_ptr ARROW_EXPORT int32(); +std::shared_ptr ARROW_EXPORT int64(); +std::shared_ptr ARROW_EXPORT uint8(); +std::shared_ptr ARROW_EXPORT uint16(); +std::shared_ptr ARROW_EXPORT uint32(); +std::shared_ptr ARROW_EXPORT uint64(); +std::shared_ptr ARROW_EXPORT float16(); +std::shared_ptr ARROW_EXPORT float32(); +std::shared_ptr ARROW_EXPORT float64(); +std::shared_ptr ARROW_EXPORT utf8(); +std::shared_ptr ARROW_EXPORT binary(); +std::shared_ptr ARROW_EXPORT date(); + } // namespace arrow #endif // ARROW_TYPE_FWD_H diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 5616018d93400..5cd5f45466bf7 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -33,6 +33,8 @@ struct TypeTraits { using ArrayType = UInt8Array; using BuilderType = UInt8Builder; static inline int bytes_required(int elements) { return elements; } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return uint8(); } }; template <> @@ -40,6 +42,8 @@ struct TypeTraits { using ArrayType = Int8Array; using BuilderType = Int8Builder; static inline int bytes_required(int elements) { return elements; } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return int8(); } }; template <> @@ -48,6 +52,8 @@ struct TypeTraits { using BuilderType = UInt16Builder; static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return uint16(); } }; template <> @@ -56,6 +62,8 @@ struct TypeTraits { using BuilderType = Int16Builder; static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return int16(); } }; template <> @@ -64,6 +72,8 @@ struct TypeTraits { using BuilderType = UInt32Builder; static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return uint32(); } }; template <> @@ -72,6 +82,8 @@ struct TypeTraits { using BuilderType = Int32Builder; static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return int32(); } }; template <> @@ -80,6 +92,8 @@ struct TypeTraits { using BuilderType = UInt64Builder; static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return uint64(); } }; template <> @@ -88,6 +102,8 @@ struct TypeTraits { using BuilderType = Int64Builder; static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return int64(); } }; template <> @@ -96,6 +112,8 @@ struct TypeTraits { // using BuilderType = DateBuilder; static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return date(); } }; template <> @@ -104,6 +122,7 @@ struct TypeTraits { // using BuilderType = TimestampBuilder; static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + constexpr static bool is_parameter_free = false; }; template <> @@ -112,6 +131,8 @@ struct TypeTraits { using BuilderType = HalfFloatBuilder; static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return float16(); } }; template <> @@ -120,6 +141,8 @@ struct TypeTraits { using BuilderType = FloatBuilder; static inline int bytes_required(int elements) { return elements * sizeof(float); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return float32(); } }; template <> @@ -128,6 +151,8 @@ struct TypeTraits { using BuilderType = DoubleBuilder; static inline int bytes_required(int elements) { return elements * sizeof(double); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return float64(); } }; template <> @@ -138,18 +163,24 @@ struct TypeTraits { static inline int bytes_required(int elements) { return BitUtil::BytesForBits(elements); } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return boolean(); } }; template <> struct TypeTraits { using ArrayType = StringArray; using BuilderType = StringBuilder; + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return utf8(); } }; template <> struct TypeTraits { using ArrayType = BinaryArray; using BuilderType = BinaryBuilder; + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return binary(); } }; // Not all type classes have a c_type From c45c3b3e11e328a6fdd50d7e1577eb3ba6ab9f93 Mon Sep 17 00:00:00 2001 From: Laurent Goujon Date: Sat, 4 Feb 2017 18:34:40 -0500 Subject: [PATCH 0309/1644] ARROW-527: Remove drill-module.conf file Remove drill-module.conf file as it is not used by the project. Author: Laurent Goujon Closes #318 from laurentgo/laurent/ARROW-527 and squashes the following commits: 7cd384d [Laurent Goujon] ARROW-527: Remove drill-module.conf file --- .../src/main/resources/drill-module.conf | 25 ------------------- 1 file changed, 25 deletions(-) delete mode 100644 java/memory/src/main/resources/drill-module.conf diff --git a/java/memory/src/main/resources/drill-module.conf b/java/memory/src/main/resources/drill-module.conf deleted file mode 100644 index 593ef8e41e76b..0000000000000 --- a/java/memory/src/main/resources/drill-module.conf +++ /dev/null @@ -1,25 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one or more -// contributor license agreements. See the NOTICE file distributed with -// this work for additional information regarding copyright ownership. -// The ASF licenses this file to You under the Apache License, Version 2.0 -// (the "License"); you may not use this file except in compliance with -// the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, software -// distributed under the License is distributed on an "AS IS" BASIS, -// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -// See the License for the specific language governing permissions and -// limitations under the License. - -// This file tells Drill to consider this module when class path scanning. -// This file can also include any supplementary configuration information. -// This file is in HOCON format, see https://github.com/typesafehub/config/blob/master/HOCON.md for more information. -drill: { - memory: { - debug.error_on_leak: true, - top.max: 1000000000000 - } - -} From 70c05be2130bdbb650a83bc46f7c4f8fc8a231df Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Sun, 5 Feb 2017 14:06:37 +0100 Subject: [PATCH 0310/1644] ARROW-524: provide apis to access nested vectors and buffers Author: Julien Le Dem Closes #314 from julienledem/setRangeToOne and squashes the following commits: 0d526bd [Julien Le Dem] ARROW-524: provide apis to access nested vectors and buffers --- .../templates/NullableValueVectors.java | 21 +++-- .../org/apache/arrow/vector/BitVector.java | 88 ++++++++++++++++++- .../apache/arrow/vector/NullableVector.java | 2 + .../apache/arrow/vector/TestValueVector.java | 36 ++++++++ 4 files changed, 137 insertions(+), 10 deletions(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index ce637100cd8bf..6b25fb36b40c0 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -131,6 +131,11 @@ public final class ${className} extends BaseDataValueVector implements <#if type } + @Override + public BitVector getValidityVector() { + return bits; + } + @Override public List getFieldInnerVectors() { return innerVectors; @@ -426,7 +431,7 @@ public void copyFromSafe(int fromIndex, int thisIndex, ${valuesName} from){ mutator.fillEmpties(thisIndex); values.copyFromSafe(fromIndex, thisIndex, from); - bits.getMutator().setSafe(thisIndex, 1); + bits.getMutator().setSafeToOne(thisIndex); } public void copyFromSafe(int fromIndex, int thisIndex, ${className} from){ @@ -525,7 +530,7 @@ private Mutator(){ @Override public void setIndexDefined(int index){ - bits.getMutator().set(index, 1); + bits.getMutator().setToOne(index); } /** @@ -543,7 +548,7 @@ public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.widt valuesMutator.set(i, emptyByteArray); } - bitsMutator.set(index, 1); + bitsMutator.setToOne(index); valuesMutator.set(index, value); <#if type.major == "VarLen">lastSet = index; } @@ -574,7 +579,7 @@ public void setSafe(int index, byte[] value, int start, int length) { <#else> fillEmpties(index); - bits.getMutator().setSafe(index, 1); + bits.getMutator().setSafeToOne(index); values.getMutator().setSafe(index, value, start, length); setCount++; <#if type.major == "VarLen">lastSet = index; @@ -587,7 +592,7 @@ public void setSafe(int index, ByteBuffer value, int start, int length) { <#else> fillEmpties(index); - bits.getMutator().setSafe(index, 1); + bits.getMutator().setSafeToOne(index); values.getMutator().setSafe(index, value, start, length); setCount++; <#if type.major == "VarLen">lastSet = index; @@ -626,7 +631,7 @@ public void set(int index, ${minor.class}Holder holder){ valuesMutator.set(i, emptyByteArray); } - bits.getMutator().set(index, 1); + bits.getMutator().setToOne(index); valuesMutator.set(index, holder); <#if type.major == "VarLen">lastSet = index; } @@ -676,7 +681,7 @@ public void setSafe(int index, ${minor.class}Holder value) { <#if type.major == "VarLen"> fillEmpties(index); - bits.getMutator().setSafe(index, 1); + bits.getMutator().setSafeToOne(index); values.getMutator().setSafe(index, value); setCount++; <#if type.major == "VarLen">lastSet = index; @@ -687,7 +692,7 @@ public void setSafe(int index, ${minor.javaType!type.javaType} value) { <#if type.major == "VarLen"> fillEmpties(index); - bits.getMutator().setSafe(index, 1); + bits.getMutator().setSafeToOne(index); values.getMutator().setSafe(index, value); setCount++; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 9beabcbe46bcc..d1e9abe5dd111 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -423,8 +423,8 @@ private Mutator() { * value to set (either 1 or 0) */ public final void set(int index, int value) { - int byteIndex = index >> 3; - int bitIndex = index & 7; + int byteIndex = byteIndex(index); + int bitIndex = bitIndex(index); byte currentByte = data.getByte(byteIndex); byte bitMask = (byte) (1L << bitIndex); if (value != 0) { @@ -432,10 +432,87 @@ public final void set(int index, int value) { } else { currentByte -= (bitMask & currentByte); } + data.setByte(byteIndex, currentByte); + } + /** + * Set the bit at the given index to 1. + * + * @param index position of the bit to set + */ + public final void setToOne(int index) { + int byteIndex = byteIndex(index); + int bitIndex = bitIndex(index); + byte currentByte = data.getByte(byteIndex); + byte bitMask = (byte) (1L << bitIndex); + currentByte |= bitMask; data.setByte(byteIndex, currentByte); } + /** + * set count bits to 1 in data starting at firstBitIndex + * @param data the buffer to set + * @param firstBitIndex the index of the first bit to set + * @param count the number of bits to set + */ + public void setRangeToOne(int firstBitIndex, int count) { + int starByteIndex = byteIndex(firstBitIndex); + final int lastBitIndex = firstBitIndex + count; + final int endByteIndex = byteIndex(lastBitIndex); + final int startByteBitIndex = bitIndex(firstBitIndex); + final int endBytebitIndex = bitIndex(lastBitIndex); + if (count < 8 && starByteIndex == endByteIndex) { + // handles the case where we don't have a first and a last byte + byte bitMask = 0; + for (int i = startByteBitIndex; i < endBytebitIndex; ++i) { + bitMask |= (byte) (1L << i); + } + byte currentByte = data.getByte(starByteIndex); + currentByte |= bitMask; + data.setByte(starByteIndex, currentByte); + } else { + // fill in first byte (if it's not full) + if (startByteBitIndex != 0) { + byte currentByte = data.getByte(starByteIndex); + final byte bitMask = (byte) (0xFFL << startByteBitIndex); + currentByte |= bitMask; + data.setByte(starByteIndex, currentByte); + ++ starByteIndex; + } + + // fill in one full byte at a time + for (int i = starByteIndex; i < endByteIndex; i++) { + data.setByte(i, 0xFF); + } + + // fill in the last byte (if it's not full) + if (endBytebitIndex != 0) { + final int byteIndex = byteIndex(lastBitIndex - endBytebitIndex); + byte currentByte = data.getByte(byteIndex); + final byte bitMask = (byte) (0xFFL >>> ((8 - endBytebitIndex) & 7)); + currentByte |= bitMask; + data.setByte(byteIndex, currentByte); + } + + } + } + + /** + * @param absoluteBitIndex the index of the bit in the buffer + * @return the index of the byte containing that bit + */ + private int byteIndex(int absoluteBitIndex) { + return absoluteBitIndex >> 3; + } + + /** + * @param absoluteBitIndex the index of the bit in the buffer + * @return the index of the bit inside the byte + */ + private int bitIndex(int absoluteBitIndex) { + return absoluteBitIndex & 7; + } + public final void set(int index, BitHolder holder) { set(index, holder.value); } @@ -451,6 +528,13 @@ public void setSafe(int index, int value) { set(index, value); } + public void setSafeToOne(int index) { + while(index >= getValueCapacity()) { + reAlloc(); + } + setToOne(index); + } + public void setSafe(int index, BitHolder holder) { while(index >= getValueCapacity()) { reAlloc(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java index 0212b3c0d7b95..b49e9167c2589 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/NullableVector.java @@ -19,5 +19,7 @@ public interface NullableVector extends ValueVector { + BitVector getValidityVector(); + ValueVector getValuesVector(); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index b33919b2790fc..774b59e3683e3 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -30,6 +30,7 @@ import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; +import org.junit.Assert; import org.junit.Before; import org.junit.Test; @@ -364,6 +365,41 @@ public void testBitVector() { } } + @Test + public void testBitVectorRangeSetAllOnes() { + validateRange(1000, 0, 1000); + validateRange(1000, 0, 1); + validateRange(1000, 1, 2); + validateRange(1000, 5, 6); + validateRange(1000, 5, 10); + validateRange(1000, 5, 150); + validateRange(1000, 5, 27); + for (int i = 0; i < 8; i++) { + for (int j = 0; j < 8; j++) { + validateRange(1000, 10 + i, 27 + j); + validateRange(1000, i, j); + } + } + } + + private void validateRange(int length, int start, int count) { + String desc = "[" + start + ", " + (start + count) + ") "; + try (BitVector bitVector = new BitVector("bits", allocator)) { + bitVector.reset(); + bitVector.allocateNew(length); + bitVector.getMutator().setRangeToOne(start, count); + for (int i = 0; i < start; i++) { + Assert.assertEquals(desc + i, 0, bitVector.getAccessor().get(i)); + } + for (int i = start; i < start + count; i++) { + Assert.assertEquals(desc + i, 1, bitVector.getAccessor().get(i)); + } + for (int i = start + count; i < length; i++) { + Assert.assertEquals(desc + i, 0, bitVector.getAccessor().get(i)); + } + } + } + @Test public void testReAllocNullableFixedWidthVector() { // Create a new value vector for 1024 integers From 5bee596caf6f26b0f10a2c384f025bbaab43e27e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 5 Feb 2017 15:27:39 -0500 Subject: [PATCH 0311/1644] ARROW-529: Python: Add jemalloc and Python 3.6 to manylinux1 build Author: Uwe L. Korn Closes #319 from xhochy/ARROW-529 and squashes the following commits: 48893a2 [Uwe L. Korn] ARROW-529: Python: Add jemalloc and Python 3.6 to manylinux1 build --- python/CMakeLists.txt | 2 +- .../manylinux1/Dockerfile-parquet_arrow-base-x86_64 | 2 +- python/manylinux1/Dockerfile-x86_64 | 11 ++++++++++- python/manylinux1/build_arrow.sh | 12 +++++++++--- 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 898c48ee0e48d..842a2196dab62 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -352,7 +352,7 @@ set(PYARROW_MIN_TEST_LIBS pyarrow ${PYARROW_BASE_LIBS}) -if(NOT APPLE) +if(NOT APPLE AND PYARROW_BUILD_TESTS) ADD_THIRDPARTY_LIB(python SHARED_LIB "${PYTHON_LIBRARIES}") list(APPEND PYARROW_MIN_TEST_LIBS python) diff --git a/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 b/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 index 94f5bc0f3b66e..dcc9321c322b2 100644 --- a/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 +++ b/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 @@ -15,5 +15,5 @@ FROM arrow-base-x86_64 WORKDIR / RUN git clone https://github.com/apache/parquet-cpp.git WORKDIR /parquet-cpp -RUN ARROW_HOME=/usr /opt/python/cp35-cp35m/bin/cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW=ON . +RUN ARROW_HOME=/usr cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW=ON . RUN make -j5 install diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 29e00b0ccbe49..059158856f1f2 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -22,9 +22,18 @@ WORKDIR /boost_1_60_0 RUN ./bootstrap.sh RUN ./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system install +WORKDIR / +RUN wget https://github.com/jemalloc/jemalloc/releases/download/4.4.0/jemalloc-4.4.0.tar.bz2 -O jemalloc-4.4.0.tar.bz2 +RUN tar xf jemalloc-4.4.0.tar.bz2 +WORKDIR /jemalloc-4.4.0 +RUN ./configure +RUN make -j5 +RUN make install + WORKDIR / # Install cmake manylinux1 package RUN /opt/python/cp35-cp35m/bin/pip install cmake +RUN ln -s /opt/python/cp35-cp35m/bin/cmake /usr/bin/cmake WORKDIR / RUN git clone https://github.com/matthew-brett/multibuild.git @@ -34,5 +43,5 @@ RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 ADD arrow /arrow WORKDIR /arrow/cpp -RUN /opt/python/cp35-cp35m/bin/cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON . +RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON . RUN make -j5 install diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 7e2ad58617793..cce5cd2b4d412 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -20,7 +20,7 @@ # Build upon the scripts in https://github.com/matthew-brett/manylinux-builds # * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause) -PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7 3.4 3.5}" +PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7 3.4 3.5 3.6}" # Package index with only manylinux1 builds MANYLINUX_URL=https://nipy.bic.berkeley.edu/manylinux @@ -29,9 +29,10 @@ source /multibuild/manylinux_utils.sh cd /arrow/python +export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib" # PyArrow build configuration export PYARROW_BUILD_TYPE='release' -export PYARROW_CMAKE_OPTIONS='-DPYARROW_BUILD_PARQUET=ON' +export PYARROW_CMAKE_OPTIONS='-DPYARROW_BUILD_TESTS=ON' # Need as otherwise arrow_io is sometimes not linked export LDFLAGS="-Wl,--no-as-needed" export ARROW_HOME="/usr" @@ -69,10 +70,15 @@ for PYTHON in ${PYTHON_VERSIONS}; do $PIPI_IO "numpy==1.9.0" $PIPI_IO "cython==0.24" - $PIPI_IO "cmake" + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py build_ext --inplace --with-parquet --with-jemalloc PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py bdist_wheel + # Test for optional modules + $PIPI_IO -r requirements.txt + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet" + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.jemalloc" + repair_wheelhouse dist /io/dist done From 74bc4dd480d6153cf1fb5d6fb7cdbb22d1e6e5d9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 5 Feb 2017 15:29:11 -0500 Subject: [PATCH 0312/1644] ARROW-511: Python: Implement List conversions for single arrays Author: Uwe L. Korn Closes #320 from xhochy/ARROW-511 and squashes the following commits: 2ff63f9 [Uwe L. Korn] Use _check_pandas_roundtrip 6c8fa6d [Uwe L. Korn] Python: Implement List conversions for single arrays --- python/pyarrow/tests/test_convert_pandas.py | 7 ++++- python/src/pyarrow/adapters/pandas.cc | 31 +++++++++++++++++++++ 2 files changed, 37 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index ddbb02a770c35..f04fbe5b139e4 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -22,6 +22,7 @@ import unittest import numpy as np +import numpy.testing as npt import pandas as pd import pandas.util.testing as tm @@ -80,7 +81,7 @@ def _check_array_roundtrip(self, values, expected=None, arr = A.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, field=field) result = arr.to_pandas() - tm.assert_series_equal(pd.Series(result), pd.Series(values)) + tm.assert_series_equal(pd.Series(result), pd.Series(values), check_names=False) def test_float_no_nulls(self): data = {} @@ -332,6 +333,10 @@ def test_column_of_lists(self): table = A.Table.from_pandas(df, schema=schema) assert table.schema.equals(schema) + for column in df.columns: + field = schema.field_by_name(column) + self._check_array_roundtrip(df[column], field=field) + def test_threaded_conversion(self): df = _alltypes_example() self._check_pandas_roundtrip(df, nthreads=2, diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 920779fe86174..8d05821c2fd08 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -1817,6 +1817,7 @@ class ArrowDeserializer { CONVERT_CASE(DATE); CONVERT_CASE(TIMESTAMP); CONVERT_CASE(DICTIONARY); + CONVERT_CASE(LIST); default: { std::stringstream ss; ss << "Arrow type reading not implemented for " << col_->type()->ToString(); @@ -1914,6 +1915,36 @@ class ArrowDeserializer { return ConvertBinaryLike(data_, out_values); } +#define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ + case Type::ArrowEnum: \ + return ConvertListsLike<::arrow::ArrowType>(col_, out_values); + + template + inline typename std::enable_if::type ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + auto list_type = std::static_pointer_cast(col_->type()); + switch (list_type->value_type()->type) { + CONVERTVALUES_LISTSLIKE_CASE(UInt8Type, UINT8) + CONVERTVALUES_LISTSLIKE_CASE(Int8Type, INT8) + CONVERTVALUES_LISTSLIKE_CASE(UInt16Type, UINT16) + CONVERTVALUES_LISTSLIKE_CASE(Int16Type, INT16) + CONVERTVALUES_LISTSLIKE_CASE(UInt32Type, UINT32) + CONVERTVALUES_LISTSLIKE_CASE(Int32Type, INT32) + CONVERTVALUES_LISTSLIKE_CASE(UInt64Type, UINT64) + CONVERTVALUES_LISTSLIKE_CASE(Int64Type, INT64) + CONVERTVALUES_LISTSLIKE_CASE(TimestampType, TIMESTAMP) + CONVERTVALUES_LISTSLIKE_CASE(FloatType, FLOAT) + CONVERTVALUES_LISTSLIKE_CASE(DoubleType, DOUBLE) + CONVERTVALUES_LISTSLIKE_CASE(StringType, STRING) + default: { + std::stringstream ss; + ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); + return Status::NotImplemented(ss.str()); + } + } + } + template inline typename std::enable_if::type ConvertValues() { std::shared_ptr block; From 5439b71586f4b0f9a36544b9e2417ee6ad7b48e8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 6 Feb 2017 11:25:18 -0500 Subject: [PATCH 0313/1644] ARROW-33: [C++] Implement zero-copy array slicing, integrate with IPC code paths This turned into a bit of a refactoring bloodbath. I have sorted through most of the issues that this turned up, so I should have this all completely working within a day or so. There will be some follow up work to do to polish things up Closes #56. Author: Wes McKinney Closes #322 from wesm/ARROW-33 and squashes the following commits: 61afe42 [Wes McKinney] Some API cleaning in builder.h 86511a3 [Wes McKinney] Python fixes, clang warning fixes 9a00870 [Wes McKinney] Make ApproxEquals for floating point arrays work on slices 2a13929 [Wes McKinney] Implement slicing IPC logic for dense array 4f08628 [Wes McKinney] Add missing include 1a6fcb4 [Wes McKinney] Make some more progress. dense union needs more work c6d814d [Wes McKinney] Work on adding sliced array support to IPC code path, with pretty printer and comparison fixed for sliced bitmaps, etc. Not all working yet b6c511e [Wes McKinney] Add RecordBatch::Slice convenience method 8900d58 [Wes McKinney] Add Slice tests for DictionaryArray. Test recomputing the null count 55454d7 [Wes McKinney] Add slice tests for struct, union, string, list a72653d [Wes McKinney] Rename offsets to value_offsets in list/binary/string/union for better clarity. Test Slice for primitive arrays 0355f71 [Wes McKinney] Implement CopyBitmap function a228b50 [Wes McKinney] Implement Slice methods on Array classes e502901 [Wes McKinney] Move null_count and offset as last two parameters of all array ctors. Implement/test bitmap set bit count with offset bae6922 [Wes McKinney] Temporary work on adding offset parameter to Array classes for slicing --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/array-dictionary-test.cc | 62 +++-- cpp/src/arrow/array-list-test.cc | 36 ++- cpp/src/arrow/array-primitive-test.cc | 78 +++++- cpp/src/arrow/array-string-test.cc | 90 +++++-- cpp/src/arrow/array-struct-test.cc | 19 +- cpp/src/arrow/array-test.cc | 32 ++- cpp/src/arrow/array-union-test.cc | 67 ++++++ cpp/src/arrow/array.cc | 233 ++++++++++++------ cpp/src/arrow/array.h | 265 +++++++++++++-------- cpp/src/arrow/buffer.cc | 16 ++ cpp/src/arrow/buffer.h | 21 +- cpp/src/arrow/builder.cc | 64 ++--- cpp/src/arrow/builder.h | 21 +- cpp/src/arrow/column-test.cc | 14 +- cpp/src/arrow/compare.cc | 122 +++++++--- cpp/src/arrow/io/file.cc | 4 +- cpp/src/arrow/io/hdfs.cc | 8 +- cpp/src/arrow/io/io-hdfs-test.cc | 10 +- cpp/src/arrow/io/io-memory-test.cc | 4 +- cpp/src/arrow/ipc/adapter.cc | 260 ++++++++++++++++---- cpp/src/arrow/ipc/adapter.h | 8 +- cpp/src/arrow/ipc/ipc-adapter-test.cc | 52 +++- cpp/src/arrow/ipc/ipc-json-test.cc | 21 +- cpp/src/arrow/ipc/json-integration-test.cc | 6 +- cpp/src/arrow/ipc/json-internal.cc | 37 +-- cpp/src/arrow/ipc/stream.cc | 15 +- cpp/src/arrow/ipc/stream.h | 8 + cpp/src/arrow/ipc/test-common.h | 79 +++--- cpp/src/arrow/pretty_print-test.cc | 6 +- cpp/src/arrow/pretty_print.cc | 53 +++-- cpp/src/arrow/table-test.cc | 26 ++ cpp/src/arrow/table.cc | 19 +- cpp/src/arrow/table.h | 4 + cpp/src/arrow/test-util.h | 43 +--- cpp/src/arrow/type.cc | 6 +- cpp/src/arrow/type.h | 8 +- cpp/src/arrow/type_traits.h | 9 + cpp/src/arrow/util/bit-util-test.cc | 62 ++++- cpp/src/arrow/util/bit-util.cc | 83 ++++++- cpp/src/arrow/util/bit-util.h | 45 ++++ cpp/src/arrow/util/logging.h | 4 +- cpp/src/arrow/util/macros.h | 2 +- python/CMakeLists.txt | 2 +- python/pyarrow/includes/libarrow.pxd | 4 +- python/pyarrow/scalar.pyx | 2 +- python/src/pyarrow/adapters/builtin.cc | 2 +- python/src/pyarrow/adapters/pandas.cc | 20 +- python/src/pyarrow/io.cc | 21 +- 49 files changed, 1524 insertions(+), 550 deletions(-) create mode 100644 cpp/src/arrow/array-union-test.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index b002bb75ca934..824ced1a51eb9 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -53,6 +53,7 @@ ADD_ARROW_TEST(array-list-test) ADD_ARROW_TEST(array-primitive-test) ADD_ARROW_TEST(array-string-test) ADD_ARROW_TEST(array-struct-test) +ADD_ARROW_TEST(array-union-test) ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(column-test) ADD_ARROW_TEST(memory_pool-test) diff --git a/cpp/src/arrow/array-dictionary-test.cc b/cpp/src/arrow/array-dictionary-test.cc index 1a0d49a118f50..61381b7671180 100644 --- a/cpp/src/arrow/array-dictionary-test.cc +++ b/cpp/src/arrow/array-dictionary-test.cc @@ -34,7 +34,7 @@ namespace arrow { TEST(TestDictionary, Basics) { std::vector values = {100, 1000, 10000, 100000}; std::shared_ptr dict; - ArrayFromVector(int32(), values, &dict); + ArrayFromVector(values, &dict); std::shared_ptr type1 = std::dynamic_pointer_cast(dictionary(int16(), dict)); @@ -54,45 +54,67 @@ TEST(TestDictionary, Equals) { std::shared_ptr dict; std::vector dict_values = {"foo", "bar", "baz"}; - ArrayFromVector(utf8(), dict_values, &dict); + ArrayFromVector(dict_values, &dict); std::shared_ptr dict_type = dictionary(int16(), dict); std::shared_ptr dict2; std::vector dict2_values = {"foo", "bar", "baz", "qux"}; - ArrayFromVector(utf8(), dict2_values, &dict2); + ArrayFromVector(dict2_values, &dict2); std::shared_ptr dict2_type = dictionary(int16(), dict2); std::shared_ptr indices; std::vector indices_values = {1, 2, -1, 0, 2, 0}; - ArrayFromVector(int16(), is_valid, indices_values, &indices); + ArrayFromVector(is_valid, indices_values, &indices); std::shared_ptr indices2; std::vector indices2_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(int16(), is_valid, indices2_values, &indices2); + ArrayFromVector(is_valid, indices2_values, &indices2); std::shared_ptr indices3; std::vector indices3_values = {1, 1, 0, 0, 2, 0}; - ArrayFromVector(int16(), is_valid, indices3_values, &indices3); + ArrayFromVector(is_valid, indices3_values, &indices3); - auto arr = std::make_shared(dict_type, indices); - auto arr2 = std::make_shared(dict_type, indices2); - auto arr3 = std::make_shared(dict2_type, indices); - auto arr4 = std::make_shared(dict_type, indices3); + auto array = std::make_shared(dict_type, indices); + auto array2 = std::make_shared(dict_type, indices2); + auto array3 = std::make_shared(dict2_type, indices); + auto array4 = std::make_shared(dict_type, indices3); - ASSERT_TRUE(arr->Equals(arr)); + ASSERT_TRUE(array->Equals(array)); // Equal, because the unequal index is masked by null - ASSERT_TRUE(arr->Equals(arr2)); + ASSERT_TRUE(array->Equals(array2)); // Unequal dictionaries - ASSERT_FALSE(arr->Equals(arr3)); + ASSERT_FALSE(array->Equals(array3)); // Unequal indices - ASSERT_FALSE(arr->Equals(arr4)); + ASSERT_FALSE(array->Equals(array4)); // RangeEquals - ASSERT_TRUE(arr->RangeEquals(3, 6, 3, arr4)); - ASSERT_FALSE(arr->RangeEquals(1, 3, 1, arr4)); + ASSERT_TRUE(array->RangeEquals(3, 6, 3, array4)); + ASSERT_FALSE(array->RangeEquals(1, 3, 1, array4)); + + // ARROW-33 Test slices + const int size = array->length(); + + std::shared_ptr slice, slice2; + slice = array->Array::Slice(2); + slice2 = array->Array::Slice(2); + ASSERT_EQ(size - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Array::Slice(1)->Array::Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 3); + slice2 = array->Slice(1, 3); + ASSERT_EQ(3, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 4, 0, slice)); } TEST(TestDictionary, Validate) { @@ -100,20 +122,20 @@ TEST(TestDictionary, Validate) { std::shared_ptr dict; std::vector dict_values = {"foo", "bar", "baz"}; - ArrayFromVector(utf8(), dict_values, &dict); + ArrayFromVector(dict_values, &dict); std::shared_ptr dict_type = dictionary(int16(), dict); std::shared_ptr indices; std::vector indices_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(uint8(), is_valid, indices_values, &indices); + ArrayFromVector(is_valid, indices_values, &indices); std::shared_ptr indices2; std::vector indices2_values = {1., 2., 0., 0., 2., 0.}; - ArrayFromVector(float32(), is_valid, indices2_values, &indices2); + ArrayFromVector(is_valid, indices2_values, &indices2); std::shared_ptr indices3; std::vector indices3_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(int64(), is_valid, indices3_values, &indices3); + ArrayFromVector(is_valid, indices3_values, &indices3); std::shared_ptr arr = std::make_shared(dict_type, indices); std::shared_ptr arr2 = std::make_shared(dict_type, indices2); diff --git a/cpp/src/arrow/array-list-test.cc b/cpp/src/arrow/array-list-test.cc index 8e4d319f5dca8..a144fd937d7a0 100644 --- a/cpp/src/arrow/array-list-test.cc +++ b/cpp/src/arrow/array-list-test.cc @@ -90,9 +90,9 @@ TEST_F(TestListBuilder, Equality) { Int32Builder* vb = static_cast(builder_->value_builder().get()); std::shared_ptr array, equal_array, unequal_array; - vector equal_offsets = {0, 1, 2, 5}; - vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2}; - vector unequal_offsets = {0, 1, 4}; + vector equal_offsets = {0, 1, 2, 5, 6, 7, 8, 10}; + vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2, 5, 6}; + vector unequal_offsets = {0, 1, 4, 7}; vector unequal_values = {1, 2, 2, 2, 3, 4, 5}; // setup two equal arrays @@ -122,7 +122,27 @@ TEST_F(TestListBuilder, Equality) { EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_array)); EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); EXPECT_TRUE(array->RangeEquals(2, 3, 2, unequal_array)); - EXPECT_TRUE(array->RangeEquals(3, 4, 1, unequal_array)); + + // Check with slices, ARROW-33 + std::shared_ptr slice, slice2; + + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(array->length() - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 4); + slice2 = array->Slice(1, 4); + ASSERT_EQ(4, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 5, 0, slice)); } TEST_F(TestListBuilder, TestResize) {} @@ -137,9 +157,9 @@ TEST_F(TestListBuilder, TestAppendNull) { ASSERT_TRUE(result_->IsNull(0)); ASSERT_TRUE(result_->IsNull(1)); - ASSERT_EQ(0, result_->raw_offsets()[0]); - ASSERT_EQ(0, result_->offset(1)); - ASSERT_EQ(0, result_->offset(2)); + ASSERT_EQ(0, result_->raw_value_offsets()[0]); + ASSERT_EQ(0, result_->value_offset(1)); + ASSERT_EQ(0, result_->value_offset(2)); Int32Array* values = static_cast(result_->values().get()); ASSERT_EQ(0, values->length()); @@ -154,7 +174,7 @@ void ValidateBasicListArray(const ListArray* result, const vector& valu ASSERT_EQ(3, result->length()); vector ex_offsets = {0, 3, 3, 7}; for (size_t i = 0; i < ex_offsets.size(); ++i) { - ASSERT_EQ(ex_offsets[i], result->offset(i)); + ASSERT_EQ(ex_offsets[i], result->value_offset(i)); } for (int i = 0; i < result->length(); ++i) { diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index c839fb9b19234..a20fdbf8b9166 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -121,7 +121,7 @@ class TestPrimitiveBuilder : public TestBuilder { } auto expected = - std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); + std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); std::shared_ptr out; ASSERT_OK(builder->Finish(&out)); @@ -217,7 +217,7 @@ void TestPrimitiveBuilder::Check( } auto expected = - std::make_shared(size, ex_data, ex_null_count, ex_null_bitmap); + std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); std::shared_ptr out; ASSERT_OK(builder->Finish(&out)); @@ -235,15 +235,14 @@ void TestPrimitiveBuilder::Check( for (int i = 0; i < result->length(); ++i) { if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } - bool actual = BitUtil::GetBit(result->raw_data(), i); + bool actual = BitUtil::GetBit(result->data()->data(), i); ASSERT_EQ(static_cast(draws_[i]), actual) << i; } ASSERT_TRUE(result->Equals(*expected)); } typedef ::testing::Types - Primitives; + PInt32, PInt64, PFloat, PDouble> Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); @@ -347,6 +346,39 @@ TYPED_TEST(TestPrimitiveBuilder, Equality) { array->RangeEquals(first_valid_idx + 1, size, first_valid_idx + 1, unequal_array)); } +TYPED_TEST(TestPrimitiveBuilder, SliceEquality) { + DECL_T(); + + const int size = 1000; + this->RandomData(size); + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + auto builder = this->builder_.get(); + + std::shared_ptr array; + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(5); + slice2 = array->Slice(5); + ASSERT_EQ(size - 5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(2)->Slice(3); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(5, 10); + slice2 = array->Slice(5, 10); + ASSERT_EQ(10, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, 15, 0, slice)); +} + TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { DECL_T(); @@ -473,4 +505,40 @@ TYPED_TEST(TestPrimitiveBuilder, TestReserve) { ASSERT_EQ(BitUtil::NextPower2(kMinBuilderCapacity + 100), this->builder_->capacity()); } +template +void CheckSliceApproxEquals() { + using T = typename TYPE::c_type; + + const int kSize = 50; + std::vector draws1; + std::vector draws2; + + const uint32_t kSeed = 0; + test::random_real(kSize, kSeed, 0, 100, &draws1); + test::random_real(kSize, kSeed + 1, 0, 100, &draws2); + + // Make the draws equal in the sliced segment, but unequal elsewhere (to + // catch not using the slice offset) + for (int i = 10; i < 30; ++i) { + draws2[i] = draws1[i]; + } + + std::vector is_valid; + test::random_is_valid(kSize, 0.1, &is_valid); + + std::shared_ptr array1, array2; + ArrayFromVector(is_valid, draws1, &array1); + ArrayFromVector(is_valid, draws2, &array2); + + std::shared_ptr slice1 = array1->Slice(10, 20); + std::shared_ptr slice2 = array2->Slice(10, 20); + + ASSERT_TRUE(slice1->ApproxEquals(slice2)); +} + +TEST(TestPrimitiveAdHoc, FloatingSliceApproxEquals) { + CheckSliceApproxEquals(); + CheckSliceApproxEquals(); +} + } // namespace arrow diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index 5ea384acb1c57..8b7eb41d4c3b9 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -27,6 +27,7 @@ #include "arrow/builder.h" #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/type_traits.h" namespace arrow { @@ -70,7 +71,7 @@ class TestStringArray : public ::testing::Test { null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); } protected: @@ -114,7 +115,7 @@ TEST_F(TestStringArray, TestListFunctions) { TEST_F(TestStringArray, TestDestructor) { auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); } TEST_F(TestStringArray, TestGetString) { @@ -133,9 +134,9 @@ TEST_F(TestStringArray, TestEmptyStringComparison) { length_ = offsets_.size() - 1; auto strings_a = std::make_shared( - length_, offsets_buf_, nullptr, null_count_, null_bitmap_); + length_, offsets_buf_, nullptr, null_bitmap_, null_count_); auto strings_b = std::make_shared( - length_, offsets_buf_, nullptr, null_count_, null_bitmap_); + length_, offsets_buf_, nullptr, null_bitmap_, null_count_); ASSERT_TRUE(strings_a->Equals(strings_b)); } @@ -146,8 +147,7 @@ class TestStringBuilder : public TestBuilder { public: void SetUp() { TestBuilder::SetUp(); - type_ = TypePtr(new StringType()); - builder_.reset(new StringBuilder(pool_, type_)); + builder_.reset(new StringBuilder(pool_)); } void Done() { @@ -159,8 +159,6 @@ class TestStringBuilder : public TestBuilder { } protected: - TypePtr type_; - std::unique_ptr builder_; std::shared_ptr result_; }; @@ -195,7 +193,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { } else { ASSERT_FALSE(result_->IsNull(i)); result_->GetValue(i, &length); - ASSERT_EQ(pos, result_->offset(i)); + ASSERT_EQ(pos, result_->value_offset(i)); ASSERT_EQ(static_cast(strings[i % N].size()), length); ASSERT_EQ(strings[i % N], result_->GetString(i)); @@ -232,7 +230,7 @@ class TestBinaryArray : public ::testing::Test { null_count_ = test::null_count(valid_bytes_); strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); } protected: @@ -276,7 +274,7 @@ TEST_F(TestBinaryArray, TestListFunctions) { TEST_F(TestBinaryArray, TestDestructor) { auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_count_, null_bitmap_); + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); } TEST_F(TestBinaryArray, TestGetValue) { @@ -306,8 +304,8 @@ TEST_F(TestBinaryArray, TestEqualsEmptyStrings) { ASSERT_OK(builder.Finish(&left_arr)); const BinaryArray& left = static_cast(*left_arr); - std::shared_ptr right = std::make_shared( - left.length(), left.offsets(), nullptr, left.null_count(), left.null_bitmap()); + std::shared_ptr right = std::make_shared(left.length(), + left.value_offsets(), nullptr, left.null_bitmap(), left.null_count()); ASSERT_TRUE(left.Equals(right)); ASSERT_TRUE(left.RangeEquals(0, left.length(), 0, right)); @@ -317,8 +315,7 @@ class TestBinaryBuilder : public TestBuilder { public: void SetUp() { TestBuilder::SetUp(); - type_ = TypePtr(new BinaryType()); - builder_.reset(new BinaryBuilder(pool_, type_)); + builder_.reset(new BinaryBuilder(pool_)); } void Done() { @@ -330,8 +327,6 @@ class TestBinaryBuilder : public TestBuilder { } protected: - TypePtr type_; - std::unique_ptr builder_; std::shared_ptr result_; }; @@ -348,8 +343,7 @@ TEST_F(TestBinaryBuilder, TestScalarAppend) { if (is_null[i]) { builder_->AppendNull(); } else { - builder_->Append( - reinterpret_cast(strings[i].data()), strings[i].size()); + builder_->Append(strings[i]); } } } @@ -377,4 +371,62 @@ TEST_F(TestBinaryBuilder, TestZeroLength) { Done(); } +// ---------------------------------------------------------------------- +// Slice tests + +template +void CheckSliceEquality() { + using Traits = TypeTraits; + using BuilderType = typename Traits::BuilderType; + + BuilderType builder(default_memory_pool()); + + std::vector strings = {"foo", "", "bar", "baz", "qux", ""}; + std::vector is_null = {0, 1, 0, 1, 0, 0}; + + int N = strings.size(); + int reps = 10; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder.AppendNull(); + } else { + builder.Append(strings[i]); + } + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(5); + slice2 = array->Slice(5); + ASSERT_EQ(N * reps - 5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, slice->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(2)->Slice(3); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(5, 20); + slice2 = array->Slice(5, 20); + ASSERT_EQ(20, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, 25, 0, slice)); +} + +TEST_F(TestBinaryArray, TestSliceEquality) { + CheckSliceEquality(); +} + +TEST_F(TestStringArray, TestSliceEquality) { + CheckSliceEquality(); +} + } // namespace arrow diff --git a/cpp/src/arrow/array-struct-test.cc b/cpp/src/arrow/array-struct-test.cc index 5827c399dda17..f4e7409a6232a 100644 --- a/cpp/src/arrow/array-struct-test.cc +++ b/cpp/src/arrow/array-struct-test.cc @@ -75,7 +75,7 @@ void ValidateBasicStructArray(const StructArray* result, ASSERT_EQ(4, list_char_arr->length()); ASSERT_EQ(10, list_char_arr->values()->length()); for (size_t i = 0; i < list_offsets.size(); ++i) { - ASSERT_EQ(list_offsets[i], list_char_arr->raw_offsets()[i]); + ASSERT_EQ(list_offsets[i], list_char_arr->raw_value_offsets()[i]); } for (size_t i = 0; i < list_values.size(); ++i) { ASSERT_EQ(list_values[i], char_arr->Value(i)); @@ -381,6 +381,23 @@ TEST_F(TestStructBuilder, TestEquality) { EXPECT_FALSE(array->RangeEquals(0, 1, 0, unequal_values_array)); EXPECT_TRUE(array->RangeEquals(1, 3, 1, unequal_values_array)); EXPECT_FALSE(array->RangeEquals(3, 4, 3, unequal_values_array)); + + // ARROW-33 Slice / equality + std::shared_ptr slice, slice2; + + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(array->length() - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); + + slice = array->Slice(1, 2); + slice2 = array->Slice(1, 2); + ASSERT_EQ(2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); } TEST_F(TestStructBuilder, TestZeroLength) { diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index a1d8fdfa91e85..45130d8f64004 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -43,7 +43,7 @@ TEST_F(TestArray, TestNullCount) { auto data = std::make_shared(pool_); auto null_bitmap = std::make_shared(pool_); - std::unique_ptr arr(new Int32Array(100, data, 10, null_bitmap)); + std::unique_ptr arr(new Int32Array(100, data, null_bitmap, 10)); ASSERT_EQ(10, arr->null_count()); std::unique_ptr arr_no_nulls(new Int32Array(100, data)); @@ -67,7 +67,7 @@ std::shared_ptr MakeArrayFromValidBytes( } std::shared_ptr arr( - new Int32Array(v.size(), value_builder.Finish(), null_count, null_buf)); + new Int32Array(v.size(), value_builder.Finish(), null_buf, null_count)); return arr; } @@ -87,6 +87,32 @@ TEST_F(TestArray, TestEquality) { EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); } +TEST_F(TestArray, SliceRecomputeNullCount) { + std::vector valid_bytes = {1, 0, 1, 1, 0, 1, 0, 0}; + + auto array = MakeArrayFromValidBytes(valid_bytes, pool_); + + ASSERT_EQ(4, array->null_count()); + + auto slice = array->Slice(1, 4); + ASSERT_EQ(2, slice->null_count()); + + slice = array->Slice(4); + ASSERT_EQ(1, slice->null_count()); + + slice = array->Slice(0); + ASSERT_EQ(4, slice->null_count()); + + // No bitmap, compute 0 + std::shared_ptr data; + const int kBufferSize = 64; + ASSERT_OK(AllocateBuffer(pool_, kBufferSize, &data)); + memset(data->mutable_data(), 0, kBufferSize); + + auto arr = std::make_shared(16, data, nullptr, -1); + ASSERT_EQ(0, arr->null_count()); +} + TEST_F(TestArray, TestIsNull) { // clang-format off std::vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, @@ -102,7 +128,7 @@ TEST_F(TestArray, TestIsNull) { std::shared_ptr null_buf = test::bytes_to_null_buffer(null_bitmap); std::unique_ptr arr; - arr.reset(new Int32Array(null_bitmap.size(), nullptr, null_count, null_buf)); + arr.reset(new Int32Array(null_bitmap.size(), nullptr, null_buf, null_count)); ASSERT_EQ(null_count, arr->null_count()); ASSERT_EQ(5, null_buf->size()); diff --git a/cpp/src/arrow/array-union-test.cc b/cpp/src/arrow/array-union-test.cc new file mode 100644 index 0000000000000..eb9bd7da31b13 --- /dev/null +++ b/cpp/src/arrow/array-union-test.cc @@ -0,0 +1,67 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Tests for UnionArray + +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/builder.h" +#include "arrow/ipc/test-common.h" +#include "arrow/status.h" +#include "arrow/table.h" +#include "arrow/test-util.h" +#include "arrow/type.h" + +namespace arrow { + +TEST(TestUnionArrayAdHoc, TestSliceEquals) { + std::shared_ptr batch; + ASSERT_OK(ipc::MakeUnion(&batch)); + + const int size = batch->num_rows(); + + auto CheckUnion = [&size](std::shared_ptr array) { + std::shared_ptr slice, slice2; + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(size - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 5); + slice2 = array->Slice(1, 5); + ASSERT_EQ(5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 6, 0, slice)); + }; + + CheckUnion(batch->column(1)); + CheckUnion(batch->column(2)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 6fc7fb60bf364..f84023e6c7d31 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -17,6 +17,7 @@ #include "arrow/array.h" +#include #include #include #include @@ -30,28 +31,37 @@ namespace arrow { -Status GetEmptyBitmap( - MemoryPool* pool, int32_t length, std::shared_ptr* result) { - auto buffer = std::make_shared(pool); - RETURN_NOT_OK(buffer->Resize(BitUtil::BytesForBits(length))); - memset(buffer->mutable_data(), 0, buffer->size()); - - *result = buffer; - return Status::OK(); -} +// When slicing, we do not know the null count of the sliced range without +// doing some computation. To avoid doing this eagerly, we set the null count +// to -1 (any negative number will do). When Array::null_count is called the +// first time, the null count will be computed. See ARROW-33 +constexpr int32_t kUnknownNullCount = -1; // ---------------------------------------------------------------------- // Base array class -Array::Array(const std::shared_ptr& type, int32_t length, int32_t null_count, - const std::shared_ptr& null_bitmap) { - type_ = type; - length_ = length; - null_count_ = null_count; - null_bitmap_ = null_bitmap; +Array::Array(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + : type_(type), + length_(length), + offset_(offset), + null_count_(null_count), + null_bitmap_(null_bitmap), + null_bitmap_data_(nullptr) { if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } +int32_t Array::null_count() const { + if (null_count_ < 0) { + if (null_bitmap_) { + null_count_ = CountSetBits(null_bitmap_data_, offset_, length_); + } else { + null_count_ = 0; + } + } + return null_count_; +} + bool Array::Equals(const Array& arr) const { bool are_equal = false; Status error = ArrayEquals(*this, arr, &are_equal); @@ -86,10 +96,32 @@ bool Array::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_ return are_equal; } +// Last two parameters are in-out parameters +static inline void ConformSliceParams( + int32_t array_offset, int32_t array_length, int32_t* offset, int32_t* length) { + DCHECK_LE(*offset, array_length); + DCHECK_GE(offset, 0); + *length = std::min(array_length - *offset, *length); + *offset = array_offset + *offset; +} + +std::shared_ptr Array::Slice(int32_t offset) const { + int32_t slice_length = length_ - offset; + return Slice(offset, slice_length); +} + Status Array::Validate() const { return Status::OK(); } +NullArray::NullArray(int32_t length) : Array(null(), length, nullptr, length) {} + +std::shared_ptr NullArray::Slice(int32_t offset, int32_t length) const { + DCHECK_LE(offset, length_); + length = std::min(length_ - offset, length); + return std::make_shared(length); +} + Status NullArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } @@ -98,9 +130,9 @@ Status NullArray::Accept(ArrayVisitor* visitor) const { // Primitive array base PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : Array(type, length, null_count, null_bitmap) { + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset) + : Array(type, length, null_bitmap, null_count, offset) { data_ = data; raw_data_ = data == nullptr ? nullptr : data_->data(); } @@ -110,6 +142,13 @@ Status NumericArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +template +std::shared_ptr NumericArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared>( + type_, length, data_, null_bitmap_, kUnknownNullCount, offset); +} + template class NumericArray; template class NumericArray; template class NumericArray; @@ -129,32 +168,33 @@ template class NumericArray; // BooleanArray BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap) - : PrimitiveArray( - std::make_shared(), length, data, null_count, null_bitmap) {} - -BooleanArray::BooleanArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + : PrimitiveArray(std::make_shared(), length, data, null_bitmap, + null_count, offset) {} Status BooleanArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr BooleanArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + length, data_, null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // ListArray Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } - if (!offsets_buffer_) { return Status::Invalid("offsets_buffer_ was null"); } - if (offsets_buffer_->size() / static_cast(sizeof(int32_t)) < length_) { + if (!value_offsets_) { return Status::Invalid("value_offsets_ was null"); } + if (value_offsets_->size() / static_cast(sizeof(int32_t)) < length_) { std::stringstream ss; - ss << "offset buffer size (bytes): " << offsets_buffer_->size() + ss << "offset buffer size (bytes): " << value_offsets_->size() << " isn't large enough for length: " << length_; return Status::Invalid(ss.str()); } - const int32_t last_offset = offset(length_); + const int32_t last_offset = this->value_offset(length_); if (last_offset > 0) { if (!values_) { return Status::Invalid("last offset was non-zero and values was null"); @@ -174,14 +214,15 @@ Status ListArray::Validate() const { } } - int32_t prev_offset = offset(0); + int32_t prev_offset = this->value_offset(0); if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } for (int32_t i = 1; i <= length_; ++i) { - int32_t current_offset = offset(i); + int32_t current_offset = this->value_offset(i); if (IsNull(i - 1) && current_offset != prev_offset) { std::stringstream ss; - ss << "Offset invariant failure at: " << i << " inconsistent offsets for null slot" - << current_offset << "!=" << prev_offset; + ss << "Offset invariant failure at: " << i + << " inconsistent value_offsets for null slot" << current_offset + << "!=" << prev_offset; return Status::Invalid(ss.str()); } if (current_offset < prev_offset) { @@ -200,26 +241,33 @@ Status ListArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr ListArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + type_, length, value_offsets_, values_, null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // String and binary static std::shared_ptr kBinary = std::make_shared(); static std::shared_ptr kString = std::make_shared(); -BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : BinaryArray(kBinary, length, offsets, data, null_count, null_bitmap) {} +BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& value_offsets, + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset) + : BinaryArray(kBinary, length, value_offsets, data, null_bitmap, null_count, offset) { +} BinaryArray::BinaryArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& offsets, const std::shared_ptr& data, - int32_t null_count, const std::shared_ptr& null_bitmap) - : Array(type, length, null_count, null_bitmap), - offsets_buffer_(offsets), - offsets_(reinterpret_cast(offsets_buffer_->data())), - data_buffer_(data), - data_(nullptr) { - if (data_buffer_ != nullptr) { data_ = data_buffer_->data(); } + const std::shared_ptr& value_offsets, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + : Array(type, length, null_bitmap, null_count, offset), + value_offsets_(value_offsets), + raw_value_offsets_(reinterpret_cast(value_offsets_->data())), + data_(data), + raw_data_(nullptr) { + if (data_ != nullptr) { raw_data_ = data_->data(); } } Status BinaryArray::Validate() const { @@ -231,10 +279,17 @@ Status BinaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -StringArray::StringArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap) - : BinaryArray(kString, length, offsets, data, null_count, null_bitmap) {} +std::shared_ptr BinaryArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + length, value_offsets_, data_, null_bitmap_, kUnknownNullCount, offset); +} + +StringArray::StringArray(int32_t length, const std::shared_ptr& value_offsets, + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset) + : BinaryArray(kString, length, value_offsets, data, null_bitmap, null_count, offset) { +} Status StringArray::Validate() const { // TODO(emkornfield) Validate proper UTF8 code points? @@ -245,12 +300,26 @@ Status StringArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr StringArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + length, value_offsets_, data_, null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // Struct +StructArray::StructArray(const std::shared_ptr& type, int32_t length, + const std::vector>& children, + std::shared_ptr null_bitmap, int32_t null_count, int32_t offset) + : Array(type, length, null_bitmap, null_count, offset) { + type_ = type; + children_ = children; +} + std::shared_ptr StructArray::field(int32_t pos) const { - DCHECK_GT(field_arrays_.size(), 0); - return field_arrays_[pos]; + DCHECK_GT(children_.size(), 0); + return children_[pos]; } Status StructArray::Validate() const { @@ -260,11 +329,11 @@ Status StructArray::Validate() const { return Status::Invalid("Null count exceeds the length of this struct"); } - if (field_arrays_.size() > 0) { + if (children_.size() > 0) { // Validate fields - int32_t array_length = field_arrays_[0]->length(); + int32_t array_length = children_[0]->length(); size_t idx = 0; - for (auto it : field_arrays_) { + for (auto it : children_) { if (it->length() != array_length) { std::stringstream ss; ss << "Length is not equal from field " << it->type()->ToString() @@ -293,19 +362,27 @@ Status StructArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr StructArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + type_, length, children_, null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // UnionArray UnionArray::UnionArray(const std::shared_ptr& type, int32_t length, const std::vector>& children, - const std::shared_ptr& type_ids, const std::shared_ptr& offsets, - int32_t null_count, const std::shared_ptr& null_bitmap) - : Array(type, length, null_count, null_bitmap), + const std::shared_ptr& type_ids, const std::shared_ptr& value_offsets, + const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + : Array(type, length, null_bitmap, null_count, offset), children_(children), - type_ids_buffer_(type_ids), - offsets_buffer_(offsets) { - type_ids_ = reinterpret_cast(type_ids->data()); - if (offsets) { offsets_ = reinterpret_cast(offsets->data()); } + type_ids_(type_ids), + value_offsets_(value_offsets) { + raw_type_ids_ = reinterpret_cast(type_ids->data()); + if (value_offsets) { + raw_value_offsets_ = reinterpret_cast(value_offsets->data()); + } } std::shared_ptr UnionArray::child(int32_t pos) const { @@ -328,18 +405,24 @@ Status UnionArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr UnionArray::Slice(int32_t offset, int32_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared(type_, length, children_, type_ids_, value_offsets_, + null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // DictionaryArray Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& indices, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out) { + const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset, std::shared_ptr* out) { DCHECK_EQ(type->type, Type::DICTIONARY); const auto& dict_type = static_cast(type.get()); std::shared_ptr boxed_indices; - RETURN_NOT_OK(MakePrimitiveArray( - dict_type->index_type(), length, indices, null_count, null_bitmap, &boxed_indices)); + RETURN_NOT_OK(MakePrimitiveArray(dict_type->index_type(), length, indices, null_bitmap, + null_count, offset, &boxed_indices)); *out = std::make_shared(type, boxed_indices); return Status::OK(); @@ -347,7 +430,8 @@ Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int32_ DictionaryArray::DictionaryArray( const std::shared_ptr& type, const std::shared_ptr& indices) - : Array(type, indices->length(), indices->null_count(), indices->null_bitmap()), + : Array(type, indices->length(), indices->null_bitmap(), indices->null_count(), + indices->offset()), dict_type_(static_cast(type.get())), indices_(indices) { DCHECK_EQ(type->type, Type::DICTIONARY); @@ -369,16 +453,21 @@ Status DictionaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } +std::shared_ptr DictionaryArray::Slice(int32_t offset, int32_t length) const { + std::shared_ptr sliced_indices = indices_->Slice(offset, length); + return std::make_shared(type_, sliced_indices); +} + // ---------------------------------------------------------------------- -#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ - case Type::ENUM: \ - out->reset(new ArrayType(type, length, data, null_count, null_bitmap)); \ +#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ + case Type::ENUM: \ + out->reset(new ArrayType(type, length, data, null_bitmap, null_count, offset)); \ break; Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out) { + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset, std::shared_ptr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 3b6e93f9cb34d..f3e8f9a4982f7 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -27,6 +27,7 @@ #include "arrow/buffer.h" #include "arrow/type.h" #include "arrow/type_fwd.h" +#include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -71,23 +72,36 @@ class ArrayVisitor { /// /// The base class is only required to have a null bitmap buffer if the null /// count is greater than 0 +/// +/// If known, the null count can be provided in the base Array constructor. If +/// the null count is not known, pass -1 to indicate that the null count is to +/// be computed on the first call to null_count() class ARROW_EXPORT Array { public: - Array(const std::shared_ptr& type, int32_t length, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + Array(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); virtual ~Array() = default; /// Determine if a slot is null. For inner loops. Does *not* boundscheck bool IsNull(int i) const { - return null_count_ > 0 && BitUtil::BitNotSet(null_bitmap_data_, i); + return null_bitmap_data_ != nullptr && + BitUtil::BitNotSet(null_bitmap_data_, i + offset_); } /// Size in the number of elements this array contains. int32_t length() const { return length_; } - /// The number of null entries in the array. - int32_t null_count() const { return null_count_; } + /// A relative position into another array's data, to enable zero-copy + /// slicing. This value defaults to zero + int32_t offset() const { return offset_; } + + /// The number of null entries in the array. If the null count was not known + /// at time of construction (and set to a negative value), then the null + /// count will be computed and cached on the first invocation of this + /// function + int32_t null_count() const; std::shared_ptr type() const { return type_; } Type::type type_enum() const { return type_->type; } @@ -95,11 +109,13 @@ class ARROW_EXPORT Array { /// Buffer for the null bitmap. /// /// Note that for `null_count == 0`, this can be a `nullptr`. + /// This buffer does not account for any slice offset std::shared_ptr null_bitmap() const { return null_bitmap_; } /// Raw pointer to the null bitmap. /// /// Note that for `null_count == 0`, this can be a `nullptr`. + /// This buffer does not account for any slice offset const uint8_t* null_bitmap_data() const { return null_bitmap_data_; } bool Equals(const Array& arr) const; @@ -120,10 +136,29 @@ class ARROW_EXPORT Array { virtual Status Accept(ArrayVisitor* visitor) const = 0; + /// Construct a zero-copy slice of the array with the indicated offset and + /// length + /// + /// \param[in] offset the position of the first element in the constructed slice + /// \param[in] length the length of the slice. If there are not enough elements in the + /// array, + /// the length will be adjusted accordingly + /// + /// \return a new object wrapped in std::shared_ptr + virtual std::shared_ptr Slice(int32_t offset, int32_t length) const = 0; + + /// Slice from offset until end of the array + std::shared_ptr Slice(int32_t offset) const; + protected: std::shared_ptr type_; - int32_t null_count_; int32_t length_; + int32_t offset_; + + // This member is marked mutable so that it can be modified when null_count() + // is called from a const context and the null count has to be computed (if + // it is not already known) + mutable int32_t null_count_; std::shared_ptr null_bitmap_; const uint8_t* null_bitmap_data_; @@ -138,28 +173,26 @@ class ARROW_EXPORT NullArray : public Array { public: using TypeClass = NullType; - NullArray(const std::shared_ptr& type, int32_t length) - : Array(type, length, length, nullptr) {} - - explicit NullArray(int32_t length) : NullArray(std::make_shared(), length) {} + explicit NullArray(int32_t length); Status Accept(ArrayVisitor* visitor) const override; -}; -Status ARROW_EXPORT GetEmptyBitmap( - MemoryPool* pool, int32_t length, std::shared_ptr* result); + std::shared_ptr Slice(int32_t offset, int32_t length) const override; +}; /// Base class for fixed-size logical types class ARROW_EXPORT PrimitiveArray : public Array { public: - virtual ~PrimitiveArray() {} + PrimitiveArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); + /// The memory containing this array's data + /// This buffer does not account for any slice offset std::shared_ptr data() const { return data_; } protected: - PrimitiveArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); std::shared_ptr data_; const uint8_t* raw_data_; }; @@ -169,21 +202,28 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { public: using TypeClass = TYPE; using value_type = typename TypeClass::c_type; - NumericArray(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) - : PrimitiveArray( - std::make_shared(), length, data, null_count, null_bitmap) {} - NumericArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr) - : PrimitiveArray(type, length, data, null_count, null_bitmap) {} + + using PrimitiveArray::PrimitiveArray; + + // Only enable this constructor without a type argument for types without additional + // metadata + template + NumericArray( + typename std::enable_if::is_parameter_free, int32_t>::type length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0) + : PrimitiveArray(TypeTraits::type_singleton(), length, data, null_bitmap, + null_count, offset) {} const value_type* raw_data() const { - return reinterpret_cast(raw_data_); + return reinterpret_cast(raw_data_) + offset_; } Status Accept(ArrayVisitor* visitor) const override; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + value_type Value(int i) const { return raw_data()[i]; } }; @@ -191,17 +231,19 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { public: using TypeClass = BooleanType; + using PrimitiveArray::PrimitiveArray; + BooleanArray(int32_t length, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); - BooleanArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); Status Accept(ArrayVisitor* visitor) const override; - const uint8_t* raw_data() const { return reinterpret_cast(raw_data_); } + std::shared_ptr Slice(int32_t offset, int32_t length) const override; - bool Value(int i) const { return BitUtil::GetBit(raw_data(), i); } + bool Value(int i) const { + return BitUtil::GetBit(reinterpret_cast(raw_data_), i + offset_); + } }; // ---------------------------------------------------------------------- @@ -212,39 +254,45 @@ class ARROW_EXPORT ListArray : public Array { using TypeClass = ListType; ListArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& offsets, const std::shared_ptr& values, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr) - : Array(type, length, null_count, null_bitmap) { - offsets_buffer_ = offsets; - offsets_ = offsets == nullptr ? nullptr : reinterpret_cast( - offsets_buffer_->data()); + const std::shared_ptr& value_offsets, const std::shared_ptr& values, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0) + : Array(type, length, null_bitmap, null_count, offset) { + value_offsets_ = value_offsets; + raw_value_offsets_ = value_offsets == nullptr + ? nullptr + : reinterpret_cast(value_offsets_->data()); values_ = values; } Status Validate() const override; - virtual ~ListArray() = default; - // Return a shared pointer in case the requestor desires to share ownership // with this array. std::shared_ptr values() const { return values_; } - std::shared_ptr offsets() const { return offsets_buffer_; } - std::shared_ptr value_type() const { return values_->type(); } + /// Note that this buffer does not account for any slice offset + std::shared_ptr value_offsets() const { return value_offsets_; } - const int32_t* raw_offsets() const { return offsets_; } + std::shared_ptr value_type() const { return values_->type(); } - int32_t offset(int i) const { return offsets_[i]; } + /// Return pointer to raw value offsets accounting for any slice offset + const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return offsets_[i]; } - int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } + int32_t value_offset(int i) const { return raw_value_offsets_[i + offset_]; } + int32_t value_length(int i) const { + i += offset_; + return raw_value_offsets_[i + 1] - raw_value_offsets_[i]; + } Status Accept(ArrayVisitor* visitor) const override; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + protected: - std::shared_ptr offsets_buffer_; - const int32_t* offsets_; + std::shared_ptr value_offsets_; + const int32_t* raw_value_offsets_; std::shared_ptr values_; }; @@ -255,55 +303,67 @@ class ARROW_EXPORT BinaryArray : public Array { public: using TypeClass = BinaryType; - BinaryArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); - - // Constructor that allows sub-classes/builders to propagate there logical type up the - // class hierarchy. - BinaryArray(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& offsets, const std::shared_ptr& data, - int32_t null_count = 0, const std::shared_ptr& null_bitmap = nullptr); + BinaryArray(int32_t length, const std::shared_ptr& value_offsets, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); // Return the pointer to the given elements bytes // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy // pointer + offset const uint8_t* GetValue(int i, int32_t* out_length) const { - const int32_t pos = offsets_[i]; - *out_length = offsets_[i + 1] - pos; - return data_ + pos; + // Account for base offset + i += offset_; + + const int32_t pos = raw_value_offsets_[i]; + *out_length = raw_value_offsets_[i + 1] - pos; + return raw_data_ + pos; } - std::shared_ptr data() const { return data_buffer_; } - std::shared_ptr offsets() const { return offsets_buffer_; } + /// Note that this buffer does not account for any slice offset + std::shared_ptr data() const { return data_; } - const int32_t* raw_offsets() const { return offsets_; } + /// Note that this buffer does not account for any slice offset + std::shared_ptr value_offsets() const { return value_offsets_; } - int32_t offset(int i) const { return offsets_[i]; } + const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return offsets_[i]; } - int32_t value_length(int i) const { return offsets_[i + 1] - offsets_[i]; } + int32_t value_offset(int i) const { return raw_value_offsets_[i + offset_]; } + int32_t value_length(int i) const { + i += offset_; + return raw_value_offsets_[i + 1] - raw_value_offsets_[i]; + } Status Validate() const override; Status Accept(ArrayVisitor* visitor) const override; - private: - std::shared_ptr offsets_buffer_; - const int32_t* offsets_; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + + protected: + // Constructor that allows sub-classes/builders to propagate there logical type up the + // class hierarchy. + BinaryArray(const std::shared_ptr& type, int32_t length, + const std::shared_ptr& value_offsets, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); - std::shared_ptr data_buffer_; - const uint8_t* data_; + std::shared_ptr value_offsets_; + const int32_t* raw_value_offsets_; + + std::shared_ptr data_; + const uint8_t* raw_data_; }; class ARROW_EXPORT StringArray : public BinaryArray { public: using TypeClass = StringType; - StringArray(int32_t length, const std::shared_ptr& offsets, - const std::shared_ptr& data, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + StringArray(int32_t length, const std::shared_ptr& value_offsets, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); // Construct a std::string // TODO: std::bad_alloc possibility @@ -316,6 +376,8 @@ class ARROW_EXPORT StringArray : public BinaryArray { Status Validate() const override; Status Accept(ArrayVisitor* visitor) const override; + + std::shared_ptr Slice(int32_t offset, int32_t length) const override; }; // ---------------------------------------------------------------------- @@ -326,28 +388,25 @@ class ARROW_EXPORT StructArray : public Array { using TypeClass = StructType; StructArray(const std::shared_ptr& type, int32_t length, - const std::vector>& field_arrays, int32_t null_count = 0, - std::shared_ptr null_bitmap = nullptr) - : Array(type, length, null_count, null_bitmap) { - type_ = type; - field_arrays_ = field_arrays; - } + const std::vector>& children, + std::shared_ptr null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); Status Validate() const override; - virtual ~StructArray() {} - // Return a shared pointer in case the requestor desires to share ownership // with this array. std::shared_ptr field(int32_t pos) const; - const std::vector>& fields() const { return field_arrays_; } + const std::vector>& fields() const { return children_; } Status Accept(ArrayVisitor* visitor) const override; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + protected: // The child arrays corresponding to each field of the struct data type. - std::vector> field_arrays_; + std::vector> children_; }; // ---------------------------------------------------------------------- @@ -356,22 +415,25 @@ class ARROW_EXPORT StructArray : public Array { class ARROW_EXPORT UnionArray : public Array { public: using TypeClass = UnionType; + using type_id_t = uint8_t; UnionArray(const std::shared_ptr& type, int32_t length, const std::vector>& children, const std::shared_ptr& type_ids, - const std::shared_ptr& offsets = nullptr, int32_t null_count = 0, - const std::shared_ptr& null_bitmap = nullptr); + const std::shared_ptr& value_offsets = nullptr, + const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, + int32_t offset = 0); Status Validate() const override; - virtual ~UnionArray() {} + /// Note that this buffer does not account for any slice offset + std::shared_ptr type_ids() const { return type_ids_; } - std::shared_ptr type_ids() const { return type_ids_buffer_; } - const uint8_t* raw_type_ids() const { return type_ids_; } + /// Note that this buffer does not account for any slice offset + std::shared_ptr value_offsets() const { return value_offsets_; } - std::shared_ptr offsets() const { return offsets_buffer_; } - const int32_t* raw_offsets() const { return offsets_; } + const type_id_t* raw_type_ids() const { return raw_type_ids_ + offset_; } + const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } UnionMode mode() const { return static_cast(*type_.get()).mode; } @@ -381,14 +443,16 @@ class ARROW_EXPORT UnionArray : public Array { Status Accept(ArrayVisitor* visitor) const override; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + protected: std::vector> children_; - std::shared_ptr type_ids_buffer_; - const uint8_t* type_ids_; + std::shared_ptr type_ids_; + const type_id_t* raw_type_ids_; - std::shared_ptr offsets_buffer_; - const int32_t* offsets_; + std::shared_ptr value_offsets_; + const int32_t* raw_value_offsets_; }; // ---------------------------------------------------------------------- @@ -419,8 +483,8 @@ class ARROW_EXPORT DictionaryArray : public Array { // Alternate ctor; other attributes (like null count) are inherited from the // passed indices array static Status FromBuffer(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& indices, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out); + const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, + int32_t null_count, int32_t offset, std::shared_ptr* out); Status Validate() const override; @@ -431,6 +495,8 @@ class ARROW_EXPORT DictionaryArray : public Array { Status Accept(ArrayVisitor* visitor) const override; + std::shared_ptr Slice(int32_t offset, int32_t length) const override; + protected: const DictionaryType* dict_type_; std::shared_ptr indices_; @@ -471,8 +537,9 @@ extern template class ARROW_EXPORT NumericArray; // Create new arrays for logical types that are backed by primitive arrays. Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - int32_t length, const std::shared_ptr& data, int32_t null_count, - const std::shared_ptr& null_bitmap, std::shared_ptr* out); + int32_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset, + std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 6cce0efa37784..fb5a010efa225 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -116,4 +116,20 @@ Status PoolBuffer::Resize(int64_t new_size, bool shrink_to_fit) { return Status::OK(); } +Status AllocateBuffer( + MemoryPool* pool, int64_t size, std::shared_ptr* out) { + auto buffer = std::make_shared(pool); + RETURN_NOT_OK(buffer->Resize(size)); + *out = buffer; + return Status::OK(); +} + +Status AllocateResizableBuffer( + MemoryPool* pool, int64_t size, std::shared_ptr* out) { + auto buffer = std::make_shared(pool); + RETURN_NOT_OK(buffer->Resize(size)); + *out = buffer; + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index d43ab0375b725..9c400b1290394 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_UTIL_BUFFER_H -#define ARROW_UTIL_BUFFER_H +#ifndef ARROW_BUFFER_H +#define ARROW_BUFFER_H #include #include @@ -105,7 +105,7 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { /// Construct a view on passed buffer at the indicated offset and length. This /// function cannot fail and does not error checking (except in debug builds) -ARROW_EXPORT std::shared_ptr SliceBuffer( +std::shared_ptr ARROW_EXPORT SliceBuffer( const std::shared_ptr& buffer, int64_t offset, int64_t length); /// A Buffer whose contents can be mutated. May or may not own its data. @@ -232,6 +232,19 @@ class ARROW_EXPORT BufferBuilder { int64_t size_; }; +/// Allocate a new mutable buffer from a memory pool +/// +/// \param[in] pool a memory pool +/// \param[in] size size of buffer to allocate +/// \param[out] out the allocated buffer with padding +/// +/// \return Status message +Status ARROW_EXPORT AllocateBuffer( + MemoryPool* pool, int64_t size, std::shared_ptr* out); + +Status ARROW_EXPORT AllocateResizableBuffer( + MemoryPool* pool, int64_t size, std::shared_ptr* out); + } // namespace arrow -#endif // ARROW_UTIL_BUFFER_H +#endif // ARROW_BUFFER_H diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index b0dc41baf4202..dddadeee0dacf 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -185,7 +185,7 @@ Status PrimitiveBuilder::Finish(std::shared_ptr* out) { RETURN_NOT_OK(data_->Resize(bytes_required)); } *out = std::make_shared::ArrayType>( - type_, length_, data_, null_count_, null_bitmap_); + type_, length_, data_, null_bitmap_, null_count_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -202,10 +202,19 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +BooleanBuilder::BooleanBuilder(MemoryPool* pool) + : ArrayBuilder(pool, boolean()), data_(nullptr), raw_data_(nullptr) {} + +BooleanBuilder::BooleanBuilder(MemoryPool* pool, const std::shared_ptr& type) + : BooleanBuilder(pool) { + DCHECK_EQ(Type::BOOL, type->type); +} + Status BooleanBuilder::Init(int32_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); data_ = std::make_shared(pool_); @@ -244,7 +253,7 @@ Status BooleanBuilder::Finish(std::shared_ptr* out) { // Trim buffers RETURN_NOT_OK(data_->Resize(bytes_required)); } - *out = std::make_shared(type_, length_, data_, null_count_, null_bitmap_); + *out = std::make_shared(type_, length_, data_, null_bitmap_, null_count_); data_ = null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -313,7 +322,7 @@ Status ListBuilder::Finish(std::shared_ptr* out) { std::shared_ptr offsets = offset_builder_.Finish(); *out = std::make_shared( - type_, length_, offsets, items, null_count_, null_bitmap_); + type_, length_, offsets, items, null_bitmap_, null_count_); Reset(); @@ -333,14 +342,13 @@ std::shared_ptr ListBuilder::value_builder() const { // ---------------------------------------------------------------------- // String and binary -// This used to be a static member variable of BinaryBuilder, but it can cause -// valgrind to report a (spurious?) memory leak when needed in other shared -// libraries. The problem came up while adding explicit visibility to libarrow -// and libparquet_arrow -static TypePtr kBinaryValueType = TypePtr(new UInt8Type()); +BinaryBuilder::BinaryBuilder(MemoryPool* pool) + : ListBuilder(pool, std::make_shared(pool, uint8()), binary()) { + byte_builder_ = static_cast(value_builder_.get()); +} BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) - : ListBuilder(pool, std::make_shared(pool, kBinaryValueType), type) { + : ListBuilder(pool, std::make_shared(pool, uint8()), type) { byte_builder_ = static_cast(value_builder_.get()); } @@ -351,11 +359,13 @@ Status BinaryBuilder::Finish(std::shared_ptr* out) { const auto list = std::dynamic_pointer_cast(result); auto values = std::dynamic_pointer_cast(list->values()); - *out = std::make_shared(list->length(), list->offsets(), values->data(), - list->null_count(), list->null_bitmap()); + *out = std::make_shared(list->length(), list->value_offsets(), + values->data(), list->null_bitmap(), list->null_count()); return Status::OK(); } +StringBuilder::StringBuilder(MemoryPool* pool) : BinaryBuilder(pool, utf8()) {} + Status StringBuilder::Finish(std::shared_ptr* out) { std::shared_ptr result; RETURN_NOT_OK(ListBuilder::Finish(&result)); @@ -363,8 +373,8 @@ Status StringBuilder::Finish(std::shared_ptr* out) { const auto list = std::dynamic_pointer_cast(result); auto values = std::dynamic_pointer_cast(list->values()); - *out = std::make_shared(list->length(), list->offsets(), values->data(), - list->null_count(), list->null_bitmap()); + *out = std::make_shared(list->length(), list->value_offsets(), + values->data(), list->null_bitmap(), list->null_count()); return Status::OK(); } @@ -377,7 +387,7 @@ Status StructBuilder::Finish(std::shared_ptr* out) { RETURN_NOT_OK(field_builders_[i]->Finish(&fields[i])); } - *out = std::make_shared(type_, length_, fields, null_count_, null_bitmap_); + *out = std::make_shared(type_, length_, fields, null_bitmap_, null_count_); null_bitmap_ = nullptr; capacity_ = length_ = null_count_ = 0; @@ -393,9 +403,9 @@ std::shared_ptr StructBuilder::field_builder(int pos) const { // ---------------------------------------------------------------------- // Helper functions -#define BUILDER_CASE(ENUM, BuilderType) \ - case Type::ENUM: \ - out->reset(new BuilderType(pool, type)); \ +#define BUILDER_CASE(ENUM, BuilderType) \ + case Type::ENUM: \ + out->reset(new BuilderType(pool)); \ return Status::OK(); // Initially looked at doing this with vtables, but shared pointers makes it @@ -414,19 +424,17 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(UINT64, UInt64Builder); BUILDER_CASE(INT64, Int64Builder); BUILDER_CASE(DATE, DateBuilder); - BUILDER_CASE(TIMESTAMP, TimestampBuilder); - - BUILDER_CASE(BOOL, BooleanBuilder); - - BUILDER_CASE(FLOAT, FloatBuilder); - BUILDER_CASE(DOUBLE, DoubleBuilder); - - case Type::STRING: - out->reset(new StringBuilder(pool)); + case Type::TIMESTAMP: + out->reset(new TimestampBuilder(pool, type)); return Status::OK(); - case Type::BINARY: - out->reset(new BinaryBuilder(pool, type)); + case Type::TIME: + out->reset(new TimeBuilder(pool, type)); return Status::OK(); + BUILDER_CASE(BOOL, BooleanBuilder); + BUILDER_CASE(FLOAT, FloatBuilder); + BUILDER_CASE(DOUBLE, DoubleBuilder); + BUILDER_CASE(STRING, StringBuilder); + BUILDER_CASE(BINARY, BinaryBuilder); case Type::LIST: { std::shared_ptr value_builder; std::shared_ptr value_type = diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 672d2d8f23e8f..0b83b9f3c6862 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -141,9 +141,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { using value_type = typename Type::c_type; explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) - : ArrayBuilder(pool, type), data_(nullptr) {} - - virtual ~PrimitiveBuilder() {} + : ArrayBuilder(pool, type), data_(nullptr), raw_data_(nullptr) {} using ArrayBuilder::Advance; @@ -233,6 +231,7 @@ using Int16Builder = NumericBuilder; using Int32Builder = NumericBuilder; using Int64Builder = NumericBuilder; using TimestampBuilder = NumericBuilder; +using TimeBuilder = NumericBuilder; using DateBuilder = NumericBuilder; using HalfFloatBuilder = NumericBuilder; @@ -241,10 +240,8 @@ using DoubleBuilder = NumericBuilder; class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { public: - explicit BooleanBuilder(MemoryPool* pool, const TypePtr& type = boolean()) - : ArrayBuilder(pool, type), data_(nullptr) {} - - virtual ~BooleanBuilder() {} + explicit BooleanBuilder(MemoryPool* pool); + explicit BooleanBuilder(MemoryPool* pool, const std::shared_ptr& type); using ArrayBuilder::Advance; @@ -321,8 +318,6 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { ListBuilder( MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr); - virtual ~ListBuilder() {} - Status Init(int32_t elements) override; Status Resize(int32_t capacity) override; Status Finish(std::shared_ptr* out) override; @@ -368,8 +363,8 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { // BinaryBuilder : public ListBuilder class ARROW_EXPORT BinaryBuilder : public ListBuilder { public: + explicit BinaryBuilder(MemoryPool* pool); explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type); - virtual ~BinaryBuilder() {} Status Append(const uint8_t* value, int32_t length) { RETURN_NOT_OK(ListBuilder::Append()); @@ -391,11 +386,7 @@ class ARROW_EXPORT BinaryBuilder : public ListBuilder { // String builder class ARROW_EXPORT StringBuilder : public BinaryBuilder { public: - explicit StringBuilder(MemoryPool* pool = default_memory_pool()) - : BinaryBuilder(pool, utf8()) {} - - explicit StringBuilder(MemoryPool* pool, const std::shared_ptr& type) - : BinaryBuilder(pool, type) {} + explicit StringBuilder(MemoryPool* pool); using BinaryBuilder::Append; diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 1e722ed7de0d6..0bbfc831f5cb9 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -51,7 +51,7 @@ TEST_F(TestChunkedArray, BasicEquals) { std::vector null_bitmap(100, true); std::vector data(100, 1); std::shared_ptr array; - ArrayFromVector(int32(), null_bitmap, data, &array); + ArrayFromVector(null_bitmap, data, &array); arrays_one_.push_back(array); arrays_another_.push_back(array); @@ -67,9 +67,9 @@ TEST_F(TestChunkedArray, EqualsDifferingTypes) { std::vector data32(100, 1); std::vector data64(100, 1); std::shared_ptr array; - ArrayFromVector(int32(), null_bitmap, data32, &array); + ArrayFromVector(null_bitmap, data32, &array); arrays_one_.push_back(array); - ArrayFromVector(int64(), null_bitmap, data64, &array); + ArrayFromVector(null_bitmap, data64, &array); arrays_another_.push_back(array); Construct(); @@ -83,9 +83,9 @@ TEST_F(TestChunkedArray, EqualsDifferingLengths) { std::vector data100(100, 1); std::vector data101(101, 1); std::shared_ptr array; - ArrayFromVector(int32(), null_bitmap100, data100, &array); + ArrayFromVector(null_bitmap100, data100, &array); arrays_one_.push_back(array); - ArrayFromVector(int32(), null_bitmap101, data101, &array); + ArrayFromVector(null_bitmap101, data101, &array); arrays_another_.push_back(array); Construct(); @@ -94,7 +94,7 @@ TEST_F(TestChunkedArray, EqualsDifferingLengths) { std::vector null_bitmap1(1, true); std::vector data1(1, 1); - ArrayFromVector(int32(), null_bitmap1, data1, &array); + ArrayFromVector(null_bitmap1, data1, &array); arrays_one_.push_back(array); Construct(); @@ -156,7 +156,7 @@ TEST_F(TestColumn, Equals) { std::vector null_bitmap(100, true); std::vector data(100, 1); std::shared_ptr array; - ArrayFromVector(int32(), null_bitmap, data, &array); + ArrayFromVector(null_bitmap, data, &array); arrays_one_.push_back(array); arrays_another_.push_back(array); diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index d039bba08827c..27fad7135721c 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -76,10 +76,10 @@ class RangeEqualsVisitor : public ArrayVisitor { const bool is_null = left.IsNull(i); if (is_null != right.IsNull(o_i)) { return false; } if (is_null) continue; - const int32_t begin_offset = left.offset(i); - const int32_t end_offset = left.offset(i + 1); - const int32_t right_begin_offset = right.offset(o_i); - const int32_t right_end_offset = right.offset(o_i + 1); + const int32_t begin_offset = left.value_offset(i); + const int32_t end_offset = left.value_offset(i + 1); + const int32_t right_begin_offset = right.value_offset(o_i); + const int32_t right_end_offset = right.value_offset(o_i + 1); // Underlying can't be equal if the size isn't equal if (end_offset - begin_offset != right_end_offset - right_begin_offset) { return false; @@ -169,10 +169,10 @@ class RangeEqualsVisitor : public ArrayVisitor { const bool is_null = left.IsNull(i); if (is_null != right.IsNull(o_i)) { return false; } if (is_null) continue; - const int32_t begin_offset = left.offset(i); - const int32_t end_offset = left.offset(i + 1); - const int32_t right_begin_offset = right.offset(o_i); - const int32_t right_end_offset = right.offset(o_i + 1); + const int32_t begin_offset = left.value_offset(i); + const int32_t end_offset = left.value_offset(i + 1); + const int32_t right_begin_offset = right.value_offset(o_i); + const int32_t right_end_offset = right.value_offset(o_i + 1); // Underlying can't be equal if the size isn't equal if (end_offset - begin_offset != right_end_offset - right_begin_offset) { return false; @@ -200,7 +200,11 @@ class RangeEqualsVisitor : public ArrayVisitor { for (size_t j = 0; j < left.fields().size(); ++j) { // TODO: really we should be comparing stretches of non-null data rather // than looking at one value at a time. - equal_fields = left.field(j)->RangeEquals(i, i + 1, o_i, right.field(j)); + const int left_abs_index = i + left.offset(); + const int right_abs_index = o_i + right.offset(); + + equal_fields = left.field(j)->RangeEquals( + left_abs_index, left_abs_index + 1, right_abs_index, right.field(j)); if (!equal_fields) { return false; } } } @@ -223,7 +227,7 @@ class RangeEqualsVisitor : public ArrayVisitor { // Define a mapping from the type id to child number uint8_t max_code = 0; - const std::vector type_codes = left_type.type_ids; + const std::vector type_codes = left_type.type_codes; for (size_t i = 0; i < type_codes.size(); ++i) { const uint8_t code = type_codes[i]; if (code > max_code) { max_code = code; } @@ -248,15 +252,19 @@ class RangeEqualsVisitor : public ArrayVisitor { id = left_ids[i]; child_num = type_id_to_child_num[id]; + const int left_abs_index = i + left.offset(); + const int right_abs_index = o_i + right.offset(); + // TODO(wesm): really we should be comparing stretches of non-null data // rather than looking at one value at a time. if (union_mode == UnionMode::SPARSE) { - if (!left.child(child_num)->RangeEquals(i, i + 1, o_i, right.child(child_num))) { + if (!left.child(child_num)->RangeEquals(left_abs_index, left_abs_index + 1, + right_abs_index, right.child(child_num))) { return false; } } else { - const int32_t offset = left.raw_offsets()[i]; - const int32_t o_offset = right.raw_offsets()[i]; + const int32_t offset = left.raw_value_offsets()[i]; + const int32_t o_offset = right.raw_value_offsets()[i]; if (!left.child(child_num)->RangeEquals( offset, offset + 1, o_offset, right.child(child_num))) { return false; @@ -315,20 +323,29 @@ class EqualsVisitor : public RangeEqualsVisitor { } result_ = true; } else { - result_ = left.data()->Equals(*right.data(), BitUtil::BytesForBits(left.length())); + result_ = BitmapEquals(left.data()->data(), left.offset(), right.data()->data(), + right.offset(), left.length()); } return Status::OK(); } bool IsEqualPrimitive(const PrimitiveArray& left) { const auto& right = static_cast(right_); - if (left.null_count() > 0) { - const uint8_t* left_data = left.data()->data(); - const uint8_t* right_data = right.data()->data(); - const auto& size_meta = dynamic_cast(*left.type()); - const int value_byte_size = size_meta.bit_width() / 8; - DCHECK_GT(value_byte_size, 0); + const auto& size_meta = dynamic_cast(*left.type()); + const int value_byte_size = size_meta.bit_width() / 8; + DCHECK_GT(value_byte_size, 0); + + const uint8_t* left_data = nullptr; + if (left.length() > 0) { + left_data = left.data()->data() + left.offset() * value_byte_size; + } + + const uint8_t* right_data = nullptr; + if (right.length() > 0) { + right_data = right.data()->data() + right.offset() * value_byte_size; + } + if (left.null_count() > 0) { for (int i = 0; i < left.length(); ++i) { if (!left.IsNull(i) && memcmp(left_data, right_data, value_byte_size)) { return false; @@ -339,7 +356,7 @@ class EqualsVisitor : public RangeEqualsVisitor { return true; } else { if (left.length() == 0) { return true; } - return left.data()->Equals(*right.data(), left.length()); + return memcmp(left_data, right_data, value_byte_size * left.length()) == 0; } } @@ -376,13 +393,46 @@ class EqualsVisitor : public RangeEqualsVisitor { Status Visit(const IntervalArray& left) override { return ComparePrimitive(left); } + template + bool ValueOffsetsEqual(const ArrayType& left) { + const auto& right = static_cast(right_); + + if (left.offset() == 0 && right.offset() == 0) { + return left.value_offsets()->Equals( + *right.value_offsets(), (left.length() + 1) * sizeof(int32_t)); + } else { + // One of the arrays is sliced; logic is more complicated because the + // value offsets are not both 0-based + auto left_offsets = + reinterpret_cast(left.value_offsets()->data()) + left.offset(); + auto right_offsets = + reinterpret_cast(right.value_offsets()->data()) + + right.offset(); + + for (int32_t i = 0; i < left.length() + 1; ++i) { + if (left_offsets[i] - left_offsets[0] != right_offsets[i] - right_offsets[0]) { + return false; + } + } + return true; + } + } + bool CompareBinary(const BinaryArray& left) { const auto& right = static_cast(right_); - bool equal_offsets = - left.offsets()->Equals(*right.offsets(), (left.length() + 1) * sizeof(int32_t)); + + bool equal_offsets = ValueOffsetsEqual(left); if (!equal_offsets) { return false; } - if (!left.data() && !(right.data())) { return true; } - return left.data()->Equals(*right.data(), left.raw_offsets()[left.length()]); + + if (left.offset() == 0 && right.offset() == 0) { + if (!left.data() && !(right.data())) { return true; } + return left.data()->Equals(*right.data(), left.raw_value_offsets()[left.length()]); + } else { + // Compare the corresponding data range + const int64_t total_bytes = left.value_offset(left.length()) - left.value_offset(0); + return std::memcmp(left.data()->data() + left.value_offset(0), + right.data()->data() + right.value_offset(0), total_bytes) == 0; + } } Status Visit(const StringArray& left) override { @@ -397,12 +447,20 @@ class EqualsVisitor : public RangeEqualsVisitor { Status Visit(const ListArray& left) override { const auto& right = static_cast(right_); - if (!left.offsets()->Equals( - *right.offsets(), (left.length() + 1) * sizeof(int32_t))) { + bool equal_offsets = ValueOffsetsEqual(left); + if (!equal_offsets) { result_ = false; - } else { + return Status::OK(); + } + + if (left.offset() == 0 && right.offset() == 0) { result_ = left.values()->Equals(right.values()); + } else { + // One of the arrays is sliced + result_ = left.values()->RangeEquals(left.value_offset(0), + left.value_offset(left.length()), right.value_offset(0), right.values()); } + return Status::OK(); } @@ -422,8 +480,8 @@ inline bool FloatingApproxEquals( const NumericArray& left, const NumericArray& right) { using T = typename TYPE::c_type; - auto left_data = reinterpret_cast(left.data()->data()); - auto right_data = reinterpret_cast(right.data()->data()); + const T* left_data = left.raw_data(); + const T* right_data = right.raw_data(); static constexpr T EPSILON = 1E-5; @@ -465,8 +523,8 @@ static bool BaseDataEquals(const Array& left, const Array& right) { return false; } if (left.null_count() > 0) { - return left.null_bitmap()->Equals( - *right.null_bitmap(), BitUtil::BytesForBits(left.length())); + return BitmapEquals(left.null_bitmap()->data(), left.offset(), + right.null_bitmap()->data(), right.offset(), left.length()); } return true; } diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index ff58e539b9353..c1f0ea48a9dc9 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -401,8 +401,8 @@ class ReadableFile::ReadableFileImpl : public OSFile { Status Open(const std::string& path) { return OpenReadable(path); } Status ReadBuffer(int64_t nbytes, std::shared_ptr* out) { - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(nbytes)); + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateResizableBuffer(pool_, nbytes, &buffer)); int64_t bytes_read = 0; RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 2845b0d8c448c..5682f44b94a46 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -125,8 +125,8 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { } Status ReadAt(int64_t position, int64_t nbytes, std::shared_ptr* out) { - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(nbytes)); + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateResizableBuffer(pool_, nbytes, &buffer)); int64_t bytes_read = 0; RETURN_NOT_OK(ReadAt(position, nbytes, &bytes_read, buffer->mutable_data())); @@ -152,8 +152,8 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { } Status Read(int64_t nbytes, std::shared_ptr* out) { - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(nbytes)); + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateResizableBuffer(pool_, nbytes, &buffer)); int64_t bytes_read = 0; RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 72e0ba8f2987b..f0e5a280d3116 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -336,8 +336,9 @@ TYPED_TEST(TestHdfsClient, LargeFile) { std::shared_ptr file; ASSERT_OK(this->client_->OpenReadable(path, &file)); - auto buffer = std::make_shared(); - ASSERT_OK(buffer->Resize(size)); + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(nullptr, size, &buffer)); + int64_t bytes_read = 0; ASSERT_OK(file->Read(size, &bytes_read, buffer->mutable_data())); @@ -348,8 +349,9 @@ TYPED_TEST(TestHdfsClient, LargeFile) { std::shared_ptr file2; ASSERT_OK(this->client_->OpenReadable(path, 1 << 18, &file2)); - auto buffer2 = std::make_shared(); - ASSERT_OK(buffer2->Resize(size)); + std::shared_ptr buffer2; + ASSERT_OK(AllocateBuffer(nullptr, size, &buffer2)); + ASSERT_OK(file2->Read(size, &bytes_read, buffer2->mutable_data())); ASSERT_EQ(0, std::memcmp(buffer2->data(), data.data(), size)); ASSERT_EQ(size, bytes_read); diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index c0b01653cb128..442cd0c4bbccd 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -73,8 +73,8 @@ TEST(TestBufferReader, RetainParentReference) { std::shared_ptr slice1; std::shared_ptr slice2; { - auto buffer = std::make_shared(); - ASSERT_OK(buffer->Resize(static_cast(data.size()))); + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(nullptr, static_cast(data.size()), &buffer)); std::memcpy(buffer->mutable_data(), data.c_str(), data.size()); BufferReader reader(buffer); ASSERT_OK(reader.Read(4, &slice1)); diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index c8e631c564b22..3613ccbadbbab 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -17,6 +17,7 @@ #include "arrow/ipc/adapter.h" +#include #include #include #include @@ -30,6 +31,7 @@ #include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" @@ -49,9 +51,10 @@ namespace ipc { class RecordBatchWriter : public ArrayVisitor { public: - RecordBatchWriter( - const RecordBatch& batch, int64_t buffer_start_offset, int max_recursion_depth) - : batch_(batch), + RecordBatchWriter(MemoryPool* pool, const RecordBatch& batch, + int64_t buffer_start_offset, int max_recursion_depth) + : pool_(pool), + batch_(batch), max_recursion_depth_(max_recursion_depth), buffer_start_offset_(buffer_start_offset) {} @@ -62,7 +65,15 @@ class RecordBatchWriter : public ArrayVisitor { // push back all common elements field_nodes_.push_back(flatbuf::FieldNode(arr.length(), arr.null_count())); if (arr.null_count() > 0) { - buffers_.push_back(arr.null_bitmap()); + std::shared_ptr bitmap = arr.null_bitmap(); + + if (arr.offset() != 0) { + // With a sliced array / non-zero offset, we must copy the bitmap + RETURN_NOT_OK( + CopyBitmap(pool_, bitmap->data(), arr.offset(), arr.length(), &bitmap)); + } + + buffers_.push_back(bitmap); } else { // Push a dummy zero-length buffer, not to be copied buffers_.push_back(std::make_shared(nullptr, 0)); @@ -208,50 +219,136 @@ class RecordBatchWriter : public ArrayVisitor { private: Status Visit(const NullArray& array) override { return Status::NotImplemented("null"); } - Status VisitPrimitive(const PrimitiveArray& array) { - buffers_.push_back(array.data()); + template + Status VisitFixedWidth(const ArrayType& array) { + std::shared_ptr data_buffer = array.data(); + + if (array.offset() != 0) { + // Non-zero offset, slice the buffer + const auto& fw_type = static_cast(*array.type()); + const int type_width = fw_type.bit_width() / 8; + const int64_t byte_offset = array.offset() * type_width; + + // Send padding if it's available + const int64_t buffer_length = + std::min(BitUtil::RoundUpToMultipleOf64(array.length() * type_width), + data_buffer->size() - byte_offset); + data_buffer = SliceBuffer(data_buffer, byte_offset, buffer_length); + } + buffers_.push_back(data_buffer); + return Status::OK(); + } + + template + Status GetZeroBasedValueOffsets( + const ArrayType& array, std::shared_ptr* value_offsets) { + // Share slicing logic between ListArray and BinaryArray + + auto offsets = array.value_offsets(); + + if (array.offset() != 0) { + // If we have a non-zero offset, then the value offsets do not start at + // zero. We must a) create a new offsets array with shifted offsets and + // b) slice the values array accordingly + + std::shared_ptr shifted_offsets; + RETURN_NOT_OK(AllocateBuffer( + pool_, sizeof(int32_t) * (array.length() + 1), &shifted_offsets)); + + int32_t* dest_offsets = reinterpret_cast(shifted_offsets->mutable_data()); + const int32_t start_offset = array.value_offset(0); + + for (int i = 0; i < array.length(); ++i) { + dest_offsets[i] = array.value_offset(i) - start_offset; + } + // Final offset + dest_offsets[array.length()] = array.value_offset(array.length()) - start_offset; + offsets = shifted_offsets; + } + + *value_offsets = offsets; return Status::OK(); } Status VisitBinary(const BinaryArray& array) { - buffers_.push_back(array.offsets()); - buffers_.push_back(array.data()); + std::shared_ptr value_offsets; + RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); + auto data = array.data(); + + if (array.offset() != 0) { + // Slice the data buffer to include only the range we need now + data = SliceBuffer(data, array.value_offset(0), array.value_offset(array.length())); + } + + buffers_.push_back(value_offsets); + buffers_.push_back(data); return Status::OK(); } - Status Visit(const BooleanArray& array) override { return VisitPrimitive(array); } + Status Visit(const BooleanArray& array) override { + buffers_.push_back(array.data()); + return Status::OK(); + } - Status Visit(const Int8Array& array) override { return VisitPrimitive(array); } + Status Visit(const Int8Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const Int16Array& array) override { return VisitPrimitive(array); } + Status Visit(const Int16Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const Int32Array& array) override { return VisitPrimitive(array); } + Status Visit(const Int32Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const Int64Array& array) override { return VisitPrimitive(array); } + Status Visit(const Int64Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const UInt8Array& array) override { return VisitPrimitive(array); } + Status Visit(const UInt8Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const UInt16Array& array) override { return VisitPrimitive(array); } + Status Visit(const UInt16Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const UInt32Array& array) override { return VisitPrimitive(array); } + Status Visit(const UInt32Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const UInt64Array& array) override { return VisitPrimitive(array); } + Status Visit(const UInt64Array& array) override { + return VisitFixedWidth(array); + } - Status Visit(const HalfFloatArray& array) override { return VisitPrimitive(array); } + Status Visit(const HalfFloatArray& array) override { + return VisitFixedWidth(array); + } - Status Visit(const FloatArray& array) override { return VisitPrimitive(array); } + Status Visit(const FloatArray& array) override { + return VisitFixedWidth(array); + } - Status Visit(const DoubleArray& array) override { return VisitPrimitive(array); } + Status Visit(const DoubleArray& array) override { + return VisitFixedWidth(array); + } Status Visit(const StringArray& array) override { return VisitBinary(array); } Status Visit(const BinaryArray& array) override { return VisitBinary(array); } - Status Visit(const DateArray& array) override { return VisitPrimitive(array); } + Status Visit(const DateArray& array) override { + return VisitFixedWidth(array); + } - Status Visit(const TimeArray& array) override { return VisitPrimitive(array); } + Status Visit(const TimeArray& array) override { + return VisitFixedWidth(array); + } - Status Visit(const TimestampArray& array) override { return VisitPrimitive(array); } + Status Visit(const TimestampArray& array) override { + return VisitFixedWidth(array); + } Status Visit(const IntervalArray& array) override { return Status::NotImplemented("interval"); @@ -262,30 +359,109 @@ class RecordBatchWriter : public ArrayVisitor { } Status Visit(const ListArray& array) override { - buffers_.push_back(array.offsets()); + std::shared_ptr value_offsets; + RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); + buffers_.push_back(value_offsets); + --max_recursion_depth_; - RETURN_NOT_OK(VisitArray(*array.values().get())); + std::shared_ptr values = array.values(); + + if (array.offset() != 0) { + // For non-zero offset, we slice the values array accordingly + const int32_t offset = array.value_offset(0); + const int32_t length = array.value_offset(array.length()) - offset; + values = values->Slice(offset, length); + } + RETURN_NOT_OK(VisitArray(*values)); ++max_recursion_depth_; return Status::OK(); } Status Visit(const StructArray& array) override { --max_recursion_depth_; - for (const auto& field : array.fields()) { - RETURN_NOT_OK(VisitArray(*field.get())); + for (std::shared_ptr field : array.fields()) { + if (array.offset() != 0) { + // If offset is non-zero, slice the child array + field = field->Slice(array.offset(), array.length()); + } + RETURN_NOT_OK(VisitArray(*field)); } ++max_recursion_depth_; return Status::OK(); } Status Visit(const UnionArray& array) override { - buffers_.push_back(array.type_ids()); + auto type_ids = array.type_ids(); + if (array.offset() != 0) { + type_ids = SliceBuffer(type_ids, array.offset() * sizeof(UnionArray::type_id_t), + array.length() * sizeof(UnionArray::type_id_t)); + } - if (array.mode() == UnionMode::DENSE) { buffers_.push_back(array.offsets()); } + buffers_.push_back(type_ids); --max_recursion_depth_; - for (const auto& field : array.children()) { - RETURN_NOT_OK(VisitArray(*field.get())); + if (array.mode() == UnionMode::DENSE) { + const auto& type = static_cast(*array.type()); + auto value_offsets = array.value_offsets(); + + // The Union type codes are not necessary 0-indexed + uint8_t max_code = 0; + for (uint8_t code : type.type_codes) { + if (code > max_code) { max_code = code; } + } + + // Allocate an array of child offsets. Set all to -1 to indicate that we + // haven't observed a first occurrence of a particular child yet + std::vector child_offsets(max_code + 1); + std::vector child_lengths(max_code + 1, 0); + + if (array.offset() != 0) { + // This is an unpleasant case. Because the offsets are different for + // each child array, when we have a sliced array, we need to "rebase" + // the value_offsets for each array + + const int32_t* unshifted_offsets = array.raw_value_offsets(); + const uint8_t* type_ids = array.raw_type_ids(); + + // Allocate the shifted offsets + std::shared_ptr shifted_offsets_buffer; + RETURN_NOT_OK(AllocateBuffer( + pool_, array.length() * sizeof(int32_t), &shifted_offsets_buffer)); + int32_t* shifted_offsets = + reinterpret_cast(shifted_offsets_buffer->mutable_data()); + + for (int32_t i = 0; i < array.length(); ++i) { + const uint8_t code = type_ids[i]; + int32_t shift = child_offsets[code]; + if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } + shifted_offsets[i] = unshifted_offsets[i] - shift; + + // Update the child length to account for observed value + ++child_lengths[code]; + } + + value_offsets = shifted_offsets_buffer; + } + buffers_.push_back(value_offsets); + + // Visit children and slice accordingly + for (int i = 0; i < type.num_children(); ++i) { + std::shared_ptr child = array.child(i); + if (array.offset() != 0) { + const uint8_t code = type.type_codes[i]; + child = child->Slice(child_offsets[code], child_lengths[code]); + } + RETURN_NOT_OK(VisitArray(*child)); + } + } else { + for (std::shared_ptr child : array.children()) { + // Sparse union, slicing is simpler + if (array.offset() != 0) { + // If offset is non-zero, slice the child array + child = child->Slice(array.offset(), array.length()); + } + RETURN_NOT_OK(VisitArray(*child)); + } } ++max_recursion_depth_; return Status::OK(); @@ -298,6 +474,8 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } + // In some cases, intermediate buffers may need to be allocated (with sliced arrays) + MemoryPool* pool_; const RecordBatch& batch_; std::vector field_nodes_; @@ -310,14 +488,14 @@ class RecordBatchWriter : public ArrayVisitor { Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - int max_recursion_depth) { + MemoryPool* pool, int max_recursion_depth) { DCHECK_GT(max_recursion_depth, 0); - RecordBatchWriter serializer(batch, buffer_start_offset, max_recursion_depth); + RecordBatchWriter serializer(pool, batch, buffer_start_offset, max_recursion_depth); return serializer.Write(dst, metadata_length, body_length); } Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter serializer(batch, 0, kMaxIpcRecursionDepth); + RecordBatchWriter serializer(default_memory_pool(), batch, 0, kMaxIpcRecursionDepth); RETURN_NOT_OK(serializer.GetTotalSize(size)); return Status::OK(); } @@ -372,7 +550,7 @@ class ArrayLoader : public TypeVisitor { BufferMetadata metadata = context_->metadata->buffer(buffer_index); if (metadata.length == 0) { - *out = std::make_shared(nullptr, 0); + *out = nullptr; return Status::OK(); } else { return file_->ReadAt(metadata.offset, metadata.length, out); @@ -412,8 +590,8 @@ class ArrayLoader : public TypeVisitor { context_->buffer_index++; data.reset(new Buffer(nullptr, 0)); } - return MakePrimitiveArray(field_.type, field_meta.length, data, field_meta.null_count, - null_bitmap, &result_); + return MakePrimitiveArray(field_.type, field_meta.length, data, null_bitmap, + field_meta.null_count, 0, &result_); } template @@ -428,7 +606,7 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); result_ = std::make_shared( - field_meta.length, offsets, values, field_meta.null_count, null_bitmap); + field_meta.length, offsets, values, null_bitmap, field_meta.null_count); return Status::OK(); } @@ -496,7 +674,7 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(LoadChild(*type.child(0).get(), &values_array)); result_ = std::make_shared(field_.type, field_meta.length, offsets, - values_array, field_meta.null_count, null_bitmap); + values_array, null_bitmap, field_meta.null_count); return Status::OK(); } @@ -521,7 +699,7 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(LoadChildren(type.children(), &fields)); result_ = std::make_shared( - field_.type, field_meta.length, fields, field_meta.null_count, null_bitmap); + field_.type, field_meta.length, fields, null_bitmap, field_meta.null_count); return Status::OK(); } @@ -542,7 +720,7 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(LoadChildren(type.children(), &fields)); result_ = std::make_shared(field_.type, field_meta.length, fields, - type_ids, offsets, field_meta.null_count, null_bitmap); + type_ids, offsets, null_bitmap, field_meta.null_count); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index f9ef7d9fe1202..83542d0b066d4 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -30,6 +30,7 @@ namespace arrow { class Array; +class MemoryPool; class RecordBatch; class Schema; class Status; @@ -71,14 +72,15 @@ constexpr int kMaxIpcRecursionDepth = 64; // // @param(out) body_length: the size of the contiguous buffer block plus // padding bytes -ARROW_EXPORT Status WriteRecordBatch(const RecordBatch& batch, +Status ARROW_EXPORT WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, int max_recursion_depth = kMaxIpcRecursionDepth); + int64_t* body_length, MemoryPool* pool, + int max_recursion_depth = kMaxIpcRecursionDepth); // Compute the precise number of bytes needed in a contiguous memory segment to // write the record batch. This involves generating the complete serialized // Flatbuffers metadata. -ARROW_EXPORT Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); +Status ARROW_EXPORT GetRecordBatchSize(const RecordBatch& batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the input supports zero copy reads diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 17868f8f1029e..bae6578f110f2 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -32,6 +32,7 @@ #include "arrow/buffer.h" #include "arrow/memory_pool.h" +#include "arrow/pretty_print.h" #include "arrow/status.h" #include "arrow/test-util.h" #include "arrow/util/bit-util.h" @@ -56,7 +57,7 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, const int64_t buffer_offset = 0; RETURN_NOT_OK(WriteRecordBatch( - batch, buffer_offset, mmap_.get(), &metadata_length, &body_length)); + batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); std::shared_ptr metadata; RETURN_NOT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); @@ -92,17 +93,49 @@ TEST_P(TestWriteRecordBatch, RoundTrip) { } } -INSTANTIATE_TEST_CASE_P(RoundTripTests, TestWriteRecordBatch, - ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, - &MakeStringTypesRecordBatch, &MakeStruct, &MakeUnion)); +TEST_P(TestWriteRecordBatch, SliceRoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + std::shared_ptr batch_result; + + // Skip the zero-length case + if (batch->num_rows() < 2) { return; } + + auto sliced_batch = batch->Slice(2, 10); + + ASSERT_OK(RoundTripHelper(*sliced_batch, 1 << 16, &batch_result)); + + EXPECT_EQ(sliced_batch->num_rows(), batch_result->num_rows()); + + for (int i = 0; i < sliced_batch->num_columns(); ++i) { + const auto& left = *sliced_batch->column(i); + const auto& right = *batch_result->column(i); + if (!left.Equals(right)) { + std::stringstream pp_result; + std::stringstream pp_expected; + + ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); + ASSERT_OK(PrettyPrint(right, 0, &pp_result)); + + FAIL() << "Index: " << i << " Expected: " << pp_expected.str() + << "\nGot: " << pp_result.str(); + } + } +} + +INSTANTIATE_TEST_CASE_P( + RoundTripTests, TestWriteRecordBatch, + ::testing::Values(&MakeIntRecordBatch, &MakeStringTypesRecordBatch, + &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeListRecordBatch, + &MakeDeeplyNestedList, &MakeStruct, &MakeUnion)); void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; int32_t mock_metadata_length = -1; int64_t mock_body_length = -1; int64_t size = -1; - ASSERT_OK(WriteRecordBatch(*batch, 0, &mock, &mock_metadata_length, &mock_body_length)); + ASSERT_OK(WriteRecordBatch( + *batch, 0, &mock, &mock_metadata_length, &mock_body_length, default_memory_pool())); ASSERT_OK(GetRecordBatchSize(*batch, &size)); ASSERT_EQ(mock.GetExtentBytesWritten(), size); } @@ -156,10 +189,11 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); if (override_level) { - return WriteRecordBatch( - *batch, 0, mmap_.get(), metadata_length, body_length, recursion_level + 1); + return WriteRecordBatch(*batch, 0, mmap_.get(), metadata_length, body_length, pool_, + recursion_level + 1); } else { - return WriteRecordBatch(*batch, 0, mmap_.get(), metadata_length, body_length); + return WriteRecordBatch( + *batch, 0, mmap_.get(), metadata_length, body_length, pool_); } } diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 30f968c2bfd8b..3e759cccbbccc 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -80,7 +80,7 @@ template void CheckPrimitive(const std::shared_ptr& type, const std::vector& is_valid, const std::vector& values) { MemoryPool* pool = default_memory_pool(); - typename TypeTraits::BuilderType builder(pool, type); + typename TypeTraits::BuilderType builder(pool); for (size_t i = 0; i < values.size(); ++i) { if (is_valid[i]) { @@ -146,12 +146,11 @@ TEST(TestJsonArrayWriter, NestedTypes) { std::vector values = {0, 1, 2, 3, 4, 5, 6}; std::shared_ptr values_array; - ArrayFromVector(int32(), values_is_valid, values, &values_array); + ArrayFromVector(values_is_valid, values, &values_array); std::vector i16_values = {0, 1, 2, 3, 4, 5, 6}; std::shared_ptr i16_values_array; - ArrayFromVector( - int16(), values_is_valid, i16_values, &i16_values_array); + ArrayFromVector(values_is_valid, i16_values, &i16_values_array); // List std::vector list_is_valid = {true, false, true, true, true}; @@ -161,7 +160,7 @@ TEST(TestJsonArrayWriter, NestedTypes) { ASSERT_OK(test::GetBitmapFromBoolVector(list_is_valid, &list_bitmap)); std::shared_ptr offsets_buffer = test::GetBufferFromVector(offsets); - ListArray list_array(list(value_type), 5, offsets_buffer, values_array, 1, list_bitmap); + ListArray list_array(list(value_type), 5, offsets_buffer, values_array, list_bitmap, 1); TestArrayRoundTrip(list_array); @@ -175,7 +174,7 @@ TEST(TestJsonArrayWriter, NestedTypes) { std::vector> fields = {values_array, values_array, values_array}; StructArray struct_array( - struct_type, static_cast(struct_is_valid.size()), fields, 2, struct_bitmap); + struct_type, static_cast(struct_is_valid.size()), fields, struct_bitmap, 2); TestArrayRoundTrip(struct_array); } @@ -202,15 +201,15 @@ void MakeBatchArrays(const std::shared_ptr& schema, const int num_rows, test::randint(num_rows, 0, 100, &v2_values); std::shared_ptr v1; - ArrayFromVector(schema->field(0)->type, is_valid, v1_values, &v1); + ArrayFromVector(is_valid, v1_values, &v1); std::shared_ptr v2; - ArrayFromVector(schema->field(1)->type, is_valid, v2_values, &v2); + ArrayFromVector(is_valid, v2_values, &v2); static const int kBufferSize = 10; static uint8_t buffer[kBufferSize]; static uint32_t seed = 0; - StringBuilder string_builder(default_memory_pool(), utf8()); + StringBuilder string_builder(default_memory_pool()); for (int i = 0; i < num_rows; ++i) { if (!is_valid[i]) { string_builder.AppendNull(); @@ -338,13 +337,13 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { std::vector foo_valid = {true, false, true, true, true}; std::vector foo_values = {1, 2, 3, 4, 5}; std::shared_ptr foo; - ArrayFromVector(int32(), foo_valid, foo_values, &foo); + ArrayFromVector(foo_valid, foo_values, &foo); ASSERT_TRUE(batch->column(0)->Equals(foo)); std::vector bar_valid = {true, false, false, true, true}; std::vector bar_values = {1, 2, 3, 4, 5}; std::shared_ptr bar; - ArrayFromVector(float64(), bar_valid, bar_values, &bar); + ArrayFromVector(bar_valid, bar_values, &bar); ASSERT_TRUE(batch->column(1)->Equals(bar)); } diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 95bc742054fab..17ccc4ac1d0da 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -144,10 +144,8 @@ static Status ValidateArrowVsJson( if (!json_schema->Equals(arrow_schema)) { std::stringstream ss; - ss << "JSON schema: \n" - << json_schema->ToString() << "\n" - << "Arrow schema: \n" - << arrow_schema->ToString(); + ss << "JSON schema: \n" << json_schema->ToString() << "\n" + << "Arrow schema: \n" << arrow_schema->ToString(); if (FLAGS_verbose) { std::cout << ss.str() << std::endl; } return Status::Invalid("Schemas did not match"); diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 43bd8a4a4e814..1a95b2ce470b2 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -199,8 +199,8 @@ class JsonSchemaWriter : public TypeVisitor { // Write type ids writer_->Key("typeIds"); writer_->StartArray(); - for (size_t i = 0; i < type.type_ids.size(); ++i) { - writer_->Uint(type.type_ids[i]); + for (size_t i = 0; i < type.type_codes.size(); ++i) { + writer_->Uint(type.type_codes[i]); } writer_->EndArray(); } @@ -464,7 +464,7 @@ class JsonArrayWriter : public ArrayVisitor { template Status WriteVarBytes(const T& array) { WriteValidityField(array); - WriteIntegerField("OFFSET", array.raw_offsets(), array.length() + 1); + WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); WriteDataField(array); SetNoChildren(); return Status::OK(); @@ -532,7 +532,7 @@ class JsonArrayWriter : public ArrayVisitor { Status Visit(const ListArray& array) override { WriteValidityField(array); - WriteIntegerField("OFFSET", array.raw_offsets(), array.length() + 1); + WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); auto type = static_cast(array.type().get()); return WriteChildren(type->children(), {array.values()}); } @@ -549,7 +549,7 @@ class JsonArrayWriter : public ArrayVisitor { WriteIntegerField("TYPE_ID", array.raw_type_ids(), array.length()); if (type->mode == UnionMode::DENSE) { - WriteIntegerField("OFFSET", array.raw_offsets(), array.length()); + WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length()); } return WriteChildren(type->children(), array.children()); } @@ -718,17 +718,17 @@ class JsonSchemaReader { return Status::Invalid(ss.str()); } - const auto& json_type_ids = json_type.FindMember("typeIds"); - RETURN_NOT_ARRAY("typeIds", json_type_ids, json_type); + const auto& json_type_codes = json_type.FindMember("typeIds"); + RETURN_NOT_ARRAY("typeIds", json_type_codes, json_type); - std::vector type_ids; - const auto& id_array = json_type_ids->value.GetArray(); + std::vector type_codes; + const auto& id_array = json_type_codes->value.GetArray(); for (const rj::Value& val : id_array) { DCHECK(val.IsUint()); - type_ids.push_back(val.GetUint()); + type_codes.push_back(val.GetUint()); } - *type = union_(children, type_ids, mode); + *type = union_(children, type_codes, mode); return Status::OK(); } @@ -844,7 +844,7 @@ class JsonArrayReader { typename std::enable_if::value, Status>::type ReadArray( const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { - typename TypeTraits::BuilderType builder(pool_, type); + typename TypeTraits::BuilderType builder(pool_); const auto& json_data = json_array.FindMember("DATA"); RETURN_NOT_ARRAY("DATA", json_data, json_array); @@ -869,8 +869,9 @@ class JsonArrayReader { template Status GetIntArray( const RjArray& json_array, const int32_t length, std::shared_ptr* out) { - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(length * sizeof(T))); + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateBuffer(pool_, length * sizeof(T), &buffer)); + T* values = reinterpret_cast(buffer->mutable_data()); for (int i = 0; i < length; ++i) { const rj::Value& val = json_array[i]; @@ -901,7 +902,7 @@ class JsonArrayReader { DCHECK_EQ(children.size(), 1); *array = std::make_shared( - type, length, offsets_buffer, children[0], null_count, validity_buffer); + type, length, offsets_buffer, children[0], validity_buffer, null_count); return Status::OK(); } @@ -918,7 +919,7 @@ class JsonArrayReader { RETURN_NOT_OK(GetChildren(json_array, type, &fields)); *array = - std::make_shared(type, length, fields, null_count, validity_buffer); + std::make_shared(type, length, fields, validity_buffer, null_count); return Status::OK(); } @@ -953,7 +954,7 @@ class JsonArrayReader { RETURN_NOT_OK(GetChildren(json_array, type, &children)); *array = std::make_shared(type, length, children, type_id_buffer, - offsets_buffer, null_count, validity_buffer); + offsets_buffer, validity_buffer, null_count); return Status::OK(); } @@ -962,7 +963,7 @@ class JsonArrayReader { typename std::enable_if::value, Status>::type ReadArray( const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { - *array = std::make_shared(type, length); + *array = std::make_shared(length); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/stream.cc b/cpp/src/arrow/ipc/stream.cc index c9057e860b1e8..72eb13465afcc 100644 --- a/cpp/src/arrow/ipc/stream.cc +++ b/cpp/src/arrow/ipc/stream.cc @@ -28,6 +28,7 @@ #include "arrow/ipc/adapter.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" +#include "arrow/memory_pool.h" #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/util/logging.h" @@ -41,7 +42,11 @@ namespace ipc { StreamWriter::~StreamWriter() {} StreamWriter::StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema) - : sink_(sink), schema_(schema), position_(-1), started_(false) {} + : sink_(sink), + schema_(schema), + pool_(default_memory_pool()), + position_(-1), + started_(false) {} Status StreamWriter::UpdatePosition() { return sink_->Tell(&position_); @@ -76,8 +81,8 @@ Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, FileBlock* block // Frame of reference in file format is 0, see ARROW-384 const int64_t buffer_start_offset = 0; - RETURN_NOT_OK(arrow::ipc::WriteRecordBatch( - batch, buffer_start_offset, sink_, &block->metadata_length, &block->body_length)); + RETURN_NOT_OK(arrow::ipc::WriteRecordBatch(batch, buffer_start_offset, sink_, + &block->metadata_length, &block->body_length, pool_)); RETURN_NOT_OK(UpdatePosition()); DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; @@ -85,6 +90,10 @@ Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, FileBlock* block return Status::OK(); } +void StreamWriter::set_memory_pool(MemoryPool* pool) { + pool_ = pool; +} + // ---------------------------------------------------------------------- // StreamWriter implementation diff --git a/cpp/src/arrow/ipc/stream.h b/cpp/src/arrow/ipc/stream.h index 53f51dc73675f..12414fa2ca0c7 100644 --- a/cpp/src/arrow/ipc/stream.h +++ b/cpp/src/arrow/ipc/stream.h @@ -30,6 +30,7 @@ namespace arrow { class Array; class Buffer; struct Field; +class MemoryPool; class RecordBatch; class Schema; class Status; @@ -59,6 +60,10 @@ class ARROW_EXPORT StreamWriter { /// closing the actual OutputStream virtual Status Close(); + // In some cases, writing may require memory allocation. We use the default + // memory pool, but provide the option to override + void set_memory_pool(MemoryPool* pool); + protected: StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema); @@ -81,6 +86,9 @@ class ARROW_EXPORT StreamWriter { io::OutputStream* sink_; std::shared_ptr schema_; + + MemoryPool* pool_; + int64_t position_; bool started_; }; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index ca790ded92191..b4930c4555d44 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -28,6 +29,7 @@ #include "arrow/buffer.h" #include "arrow/builder.h" #include "arrow/memory_pool.h" +#include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" @@ -104,8 +106,8 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { const int length = 1000; // Make the schema - auto f0 = std::make_shared("f0", int32()); - auto f1 = std::make_shared("f1", int32()); + auto f0 = field("f0", int32()); + auto f1 = field("f1", int32()); std::shared_ptr schema(new Schema({f0, f1})); // Example data @@ -119,10 +121,10 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { template Status MakeRandomBinaryArray( - const TypePtr& type, int32_t length, MemoryPool* pool, std::shared_ptr* out) { + int32_t length, MemoryPool* pool, std::shared_ptr* out) { const std::vector values = { "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; - Builder builder(pool, type); + Builder builder(pool); const auto values_len = values.size(); for (int32_t i = 0; i < length; ++i) { int values_index = i % values_len; @@ -141,22 +143,22 @@ Status MakeStringTypesRecordBatch(std::shared_ptr* out) { const int32_t length = 500; auto string_type = utf8(); auto binary_type = binary(); - auto f0 = std::make_shared("f0", string_type); - auto f1 = std::make_shared("f1", binary_type); + auto f0 = field("f0", string_type); + auto f1 = field("f1", binary_type); std::shared_ptr schema(new Schema({f0, f1})); std::shared_ptr a0, a1; MemoryPool* pool = default_memory_pool(); + // Quirk with RETURN_NOT_OK macro and templated functions { - auto status = - MakeRandomBinaryArray(string_type, length, pool, &a0); - RETURN_NOT_OK(status); + auto s = MakeRandomBinaryArray(length, pool, &a0); + RETURN_NOT_OK(s); } + { - auto status = - MakeRandomBinaryArray(binary_type, length, pool, &a1); - RETURN_NOT_OK(status); + auto s = MakeRandomBinaryArray(length, pool, &a1); + RETURN_NOT_OK(s); } out->reset(new RecordBatch(schema, length, {a0, a1})); return Status::OK(); @@ -164,9 +166,9 @@ Status MakeStringTypesRecordBatch(std::shared_ptr* out) { Status MakeListRecordBatch(std::shared_ptr* out) { // Make the schema - auto f0 = std::make_shared("f0", kListInt32); - auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", int32()); + auto f0 = field("f0", kListInt32); + auto f1 = field("f1", kListListInt32); + auto f2 = field("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data @@ -187,14 +189,13 @@ Status MakeListRecordBatch(std::shared_ptr* out) { Status MakeZeroLengthRecordBatch(std::shared_ptr* out) { // Make the schema - auto f0 = std::make_shared("f0", kListInt32); - auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", int32()); + auto f0 = field("f0", kListInt32); + auto f1 = field("f1", kListListInt32); + auto f2 = field("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data MemoryPool* pool = default_memory_pool(); - const int length = 200; const bool include_nulls = true; std::shared_ptr leaf_values, list_array, list_list_array, flat_array; RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &leaf_values)); @@ -202,15 +203,15 @@ Status MakeZeroLengthRecordBatch(std::shared_ptr* out) { RETURN_NOT_OK( MakeRandomListArray(list_array, 0, include_nulls, pool, &list_list_array)); RETURN_NOT_OK(MakeRandomInt32Array(0, include_nulls, pool, &flat_array)); - out->reset(new RecordBatch(schema, length, {list_array, list_list_array, flat_array})); + out->reset(new RecordBatch(schema, 0, {list_array, list_list_array, flat_array})); return Status::OK(); } Status MakeNonNullRecordBatch(std::shared_ptr* out) { // Make the schema - auto f0 = std::make_shared("f0", kListInt32); - auto f1 = std::make_shared("f1", kListListInt32); - auto f2 = std::make_shared("f2", int32()); + auto f0 = field("f0", kListInt32); + auto f1 = field("f1", kListListInt32); + auto f2 = field("f2", int32()); std::shared_ptr schema(new Schema({f0, f1, f2})); // Example data @@ -242,7 +243,7 @@ Status MakeDeeplyNestedList(std::shared_ptr* out) { RETURN_NOT_OK(MakeRandomListArray(array, batch_length, include_nulls, pool, &array)); } - auto f0 = std::make_shared("f0", type); + auto f0 = field("f0", type); std::shared_ptr schema(new Schema({f0})); std::vector> arrays = {array}; out->reset(new RecordBatch(schema, batch_length, arrays)); @@ -260,8 +261,8 @@ Status MakeStruct(std::shared_ptr* out) { // Define schema std::shared_ptr type(new StructType( {list_schema->field(0), list_schema->field(1), list_schema->field(2)})); - auto f0 = std::make_shared("non_null_struct", type); - auto f1 = std::make_shared("null_struct", type); + auto f0 = field("non_null_struct", type); + auto f1 = field("null_struct", type); std::shared_ptr schema(new Schema({f0, f1})); // construct individual nullable/non-nullable struct arrays @@ -271,7 +272,7 @@ Status MakeStruct(std::shared_ptr* out) { std::shared_ptr null_bitmask; RETURN_NOT_OK(BitUtil::BytesToBits(null_bytes, &null_bitmask)); std::shared_ptr with_nulls( - new StructArray(type, list_batch->num_rows(), columns, 1, null_bitmask)); + new StructArray(type, list_batch->num_rows(), columns, null_bitmask, 1)); // construct batch std::vector> arrays = {no_nulls, with_nulls}; @@ -282,7 +283,7 @@ Status MakeStruct(std::shared_ptr* out) { Status MakeUnion(std::shared_ptr* out) { // Define schema std::vector> union_types( - {std::make_shared("u0", int32()), std::make_shared("u1", uint8())}); + {field("u0", int32()), field("u1", uint8())}); std::vector type_codes = {5, 10}; auto sparse_type = @@ -291,9 +292,9 @@ Status MakeUnion(std::shared_ptr* out) { auto dense_type = std::make_shared(union_types, type_codes, UnionMode::DENSE); - auto f0 = std::make_shared("sparse_nonnull", sparse_type, false); - auto f1 = std::make_shared("sparse", sparse_type); - auto f2 = std::make_shared("dense", dense_type); + auto f0 = field("sparse_nonnull", sparse_type, false); + auto f1 = field("sparse", sparse_type); + auto f2 = field("dense", dense_type); std::shared_ptr schema(new Schema({f0, f1, f2})); @@ -308,21 +309,17 @@ Status MakeUnion(std::shared_ptr* out) { RETURN_NOT_OK(test::CopyBufferFromVector(type_ids, &type_ids_buffer)); std::vector u0_values = {0, 1, 2, 3, 4, 5, 6}; - ArrayFromVector( - sparse_type->child(0)->type, u0_values, &sparse_children[0]); + ArrayFromVector(u0_values, &sparse_children[0]); std::vector u1_values = {10, 11, 12, 13, 14, 15, 16}; - ArrayFromVector( - sparse_type->child(1)->type, u1_values, &sparse_children[1]); + ArrayFromVector(u1_values, &sparse_children[1]); // dense children u0_values = {0, 2, 3, 7}; - ArrayFromVector( - dense_type->child(0)->type, u0_values, &dense_children[0]); + ArrayFromVector(u0_values, &dense_children[0]); u1_values = {11, 14, 15}; - ArrayFromVector( - dense_type->child(1)->type, u1_values, &dense_children[1]); + ArrayFromVector(u1_values, &dense_children[1]); std::shared_ptr offsets_buffer; std::vector offsets = {0, 0, 1, 2, 1, 2, 3}; @@ -337,10 +334,10 @@ Status MakeUnion(std::shared_ptr* out) { auto sparse_no_nulls = std::make_shared(sparse_type, length, sparse_children, type_ids_buffer); auto sparse = std::make_shared( - sparse_type, length, sparse_children, type_ids_buffer, nullptr, 1, null_bitmask); + sparse_type, length, sparse_children, type_ids_buffer, nullptr, null_bitmask, 1); auto dense = std::make_shared(dense_type, length, dense_children, - type_ids_buffer, offsets_buffer, 1, null_bitmask); + type_ids_buffer, offsets_buffer, null_bitmask, 1); // construct batch std::vector> arrays = {sparse_no_nulls, sparse, dense}; diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index 4725d5dd808ee..aca650f0a927b 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -55,7 +55,7 @@ template void CheckPrimitive(int indent, const std::vector& is_valid, const std::vector& values, const char* expected) { std::shared_ptr array; - ArrayFromVector(std::make_shared(), is_valid, values, &array); + ArrayFromVector(is_valid, values, &array); CheckArray(*array.get(), indent, expected); } @@ -76,12 +76,12 @@ TEST_F(TestPrettyPrint, DictionaryType) { std::shared_ptr dict; std::vector dict_values = {"foo", "bar", "baz"}; - ArrayFromVector(utf8(), dict_values, &dict); + ArrayFromVector(dict_values, &dict); std::shared_ptr dict_type = dictionary(int16(), dict); std::shared_ptr indices; std::vector indices_values = {1, 2, -1, 0, 2, 0}; - ArrayFromVector(int16(), is_valid, indices_values, &indices); + ArrayFromVector(is_valid, indices_values, &indices); auto arr = std::make_shared(dict_type, indices); static const char* expected = R"expected( diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index e30f4cc58d7ab..23c05807c16ee 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -164,39 +164,56 @@ class ArrayPrinter : public ArrayVisitor { Status WriteValidityBitmap(const Array& array) { Newline(); Write("-- is_valid: "); - BooleanArray is_valid(array.length(), array.null_bitmap()); - return PrettyPrint(is_valid, indent_ + 2, sink_); + + if (array.null_count() > 0) { + BooleanArray is_valid( + array.length(), array.null_bitmap(), nullptr, 0, array.offset()); + return PrettyPrint(is_valid, indent_ + 2, sink_); + } else { + Write("all not null"); + return Status::OK(); + } } Status Visit(const ListArray& array) override { RETURN_NOT_OK(WriteValidityBitmap(array)); Newline(); - Write("-- offsets: "); - Int32Array offsets(array.length() + 1, array.offsets()); - RETURN_NOT_OK(PrettyPrint(offsets, indent_ + 2, sink_)); + Write("-- value_offsets: "); + Int32Array value_offsets( + array.length() + 1, array.value_offsets(), nullptr, 0, array.offset()); + RETURN_NOT_OK(PrettyPrint(value_offsets, indent_ + 2, sink_)); Newline(); Write("-- values: "); - RETURN_NOT_OK(PrettyPrint(*array.values().get(), indent_ + 2, sink_)); + auto values = array.values(); + if (array.offset() != 0) { + values = values->Slice(array.value_offset(0), array.value_offset(array.length())); + } + RETURN_NOT_OK(PrettyPrint(*values, indent_ + 2, sink_)); return Status::OK(); } - Status PrintChildren(const std::vector>& fields) { + Status PrintChildren( + const std::vector>& fields, int32_t offset, int32_t length) { for (size_t i = 0; i < fields.size(); ++i) { Newline(); std::stringstream ss; ss << "-- child " << i << " type: " << fields[i]->type()->ToString() << " values: "; Write(ss.str()); - RETURN_NOT_OK(PrettyPrint(*fields[i].get(), indent_ + 2, sink_)); + + std::shared_ptr field = fields[i]; + if (offset != 0) { field = field->Slice(offset, length); } + + RETURN_NOT_OK(PrettyPrint(*field, indent_ + 2, sink_)); } return Status::OK(); } Status Visit(const StructArray& array) override { RETURN_NOT_OK(WriteValidityBitmap(array)); - return PrintChildren(array.fields()); + return PrintChildren(array.fields(), array.offset(), array.length()); } Status Visit(const UnionArray& array) override { @@ -204,17 +221,19 @@ class ArrayPrinter : public ArrayVisitor { Newline(); Write("-- type_ids: "); - UInt8Array type_ids(array.length(), array.type_ids()); + UInt8Array type_ids(array.length(), array.type_ids(), nullptr, 0, array.offset()); RETURN_NOT_OK(PrettyPrint(type_ids, indent_ + 2, sink_)); if (array.mode() == UnionMode::DENSE) { Newline(); - Write("-- offsets: "); - Int32Array offsets(array.length(), array.offsets()); - RETURN_NOT_OK(PrettyPrint(offsets, indent_ + 2, sink_)); + Write("-- value_offsets: "); + Int32Array value_offsets( + array.length(), array.value_offsets(), nullptr, 0, array.offset()); + RETURN_NOT_OK(PrettyPrint(value_offsets, indent_ + 2, sink_)); } - return PrintChildren(array.children()); + // Print the children without any offset, because the type ids are absolute + return PrintChildren(array.children(), 0, array.length() + array.offset()); } Status Visit(const DictionaryArray& array) override { @@ -222,11 +241,11 @@ class ArrayPrinter : public ArrayVisitor { Newline(); Write("-- dictionary: "); - RETURN_NOT_OK(PrettyPrint(*array.dictionary().get(), indent_ + 2, sink_)); + RETURN_NOT_OK(PrettyPrint(*array.dictionary(), indent_ + 2, sink_)); Newline(); Write("-- indices: "); - return PrettyPrint(*array.indices().get(), indent_ + 2, sink_); + return PrettyPrint(*array.indices(), indent_ + 2, sink_); } void Write(const char* data) { (*sink_) << data; } @@ -260,7 +279,7 @@ Status PrettyPrint(const RecordBatch& batch, int indent, std::ostream* sink) { for (int i = 0; i < batch.num_columns(); ++i) { const std::string& name = batch.column_name(i); (*sink) << name << ": "; - RETURN_NOT_OK(PrettyPrint(*batch.column(i).get(), indent + 2, sink)); + RETURN_NOT_OK(PrettyPrint(*batch.column(i), indent + 2, sink)); (*sink) << "\n"; } return Status::OK(); diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 67c9f6703f496..e7c5d667903e8 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -242,4 +242,30 @@ TEST_F(TestRecordBatch, Equals) { ASSERT_FALSE(b1.Equals(b4)); } +TEST_F(TestRecordBatch, Slice) { + const int length = 10; + + auto f0 = std::make_shared("f0", int32()); + auto f1 = std::make_shared("f1", uint8()); + + vector> fields = {f0, f1}; + auto schema = std::make_shared(fields); + + auto a0 = MakePrimitive(length); + auto a1 = MakePrimitive(length); + + RecordBatch batch(schema, length, {a0, a1}); + + auto batch_slice = batch.Slice(2); + auto batch_slice2 = batch.Slice(1, 5); + + for (int i = 0; i < batch.num_columns(); ++i) { + ASSERT_EQ(2, batch_slice->column(i)->offset()); + ASSERT_EQ(length - 2, batch_slice->column(i)->length()); + + ASSERT_EQ(1, batch_slice2->column(i)->offset()); + ASSERT_EQ(5, batch_slice2->column(i)->length()); + } +} + } // namespace arrow diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index b3563eaae7b57..9e31ba5af0ce3 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -60,6 +60,19 @@ bool RecordBatch::ApproxEquals(const RecordBatch& other) const { return true; } +std::shared_ptr RecordBatch::Slice(int32_t offset) { + return Slice(offset, this->num_rows() - offset); +} + +std::shared_ptr RecordBatch::Slice(int32_t offset, int32_t length) { + std::vector> arrays; + arrays.reserve(num_columns()); + for (const auto& field : columns_) { + arrays.emplace_back(field->Slice(offset, length)); + } + return std::make_shared(schema_, num_rows_, arrays); +} + // ---------------------------------------------------------------------- // Table methods @@ -93,8 +106,7 @@ Status Table::FromRecordBatches(const std::string& name, if (!batches[i]->schema()->Equals(schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" - << schema->ToString() << "\nvs\n" - << batches[i]->schema()->ToString(); + << schema->ToString() << "\nvs\n" << batches[i]->schema()->ToString(); return Status::Invalid(ss.str()); } } @@ -126,8 +138,7 @@ Status ConcatenateTables(const std::string& output_name, if (!tables[i]->schema()->Equals(schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" - << schema->ToString() << "\nvs\n" - << tables[i]->schema()->ToString(); + << schema->ToString() << "\nvs\n" << tables[i]->schema()->ToString(); return Status::Invalid(ss.str()); } } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 583847cfbe3e5..fa56824a5a1bc 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -64,6 +64,10 @@ class ARROW_EXPORT RecordBatch { // @returns: the number of rows (the corresponding length of each column) int32_t num_rows() const { return num_rows_; } + /// Slice each of the arrays in the record batch and construct a new RecordBatch object + std::shared_ptr Slice(int32_t offset); + std::shared_ptr Slice(int32_t offset, int32_t length); + private: std::shared_ptr schema_; int32_t num_rows_; diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 4e525804b47cc..ffc78067d1b97 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -61,14 +61,6 @@ EXPECT_TRUE(s.ok()); \ } while (0) -// Alias MSVC popcount to GCC name -#ifdef _MSC_VER -#include -#define __builtin_popcount __popcnt -#include -#define __builtin_popcountll _mm_popcnt_u64 -#endif - namespace arrow { namespace test { @@ -175,29 +167,6 @@ void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { } } -static inline int bitmap_popcount(const uint8_t* data, int length) { - // book keeping - constexpr int pop_len = sizeof(uint64_t); - const uint64_t* i64_data = reinterpret_cast(data); - const int fast_counts = length / pop_len; - const uint64_t* end = i64_data + fast_counts; - - int count = 0; - // popcount as much as possible with the widest possible count - for (auto iter = i64_data; iter < end; ++iter) { - count += __builtin_popcountll(*iter); - } - - // Account for left over bytes (in theory we could fall back to smaller - // versions of popcount but the code complexity is likely not worth it) - const int loop_tail_index = fast_counts * pop_len; - for (int i = loop_tail_index; i < length; ++i) { - if (BitUtil::GetBit(data, i)) { ++count; } - } - - return count; -} - static inline int null_count(const std::vector& valid_bytes) { int result = 0; for (size_t i = 0; i < valid_bytes.size(); ++i) { @@ -254,7 +223,7 @@ class TestBase : public ::testing::Test { auto null_bitmap = std::make_shared(pool_); EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); - return std::make_shared(length, data, null_count, null_bitmap); + return std::make_shared(length, data, null_bitmap, null_count); } protected: @@ -263,11 +232,10 @@ class TestBase : public ::testing::Test { }; template -void ArrayFromVector(const std::shared_ptr& type, - const std::vector& is_valid, const std::vector& values, +void ArrayFromVector(const std::vector& is_valid, const std::vector& values, std::shared_ptr* out) { MemoryPool* pool = default_memory_pool(); - typename TypeTraits::BuilderType builder(pool, std::make_shared()); + typename TypeTraits::BuilderType builder(pool); for (size_t i = 0; i < values.size(); ++i) { if (is_valid[i]) { ASSERT_OK(builder.Append(values[i])); @@ -279,10 +247,9 @@ void ArrayFromVector(const std::shared_ptr& type, } template -void ArrayFromVector(const std::shared_ptr& type, - const std::vector& values, std::shared_ptr* out) { +void ArrayFromVector(const std::vector& values, std::shared_ptr* out) { MemoryPool* pool = default_memory_pool(); - typename TypeTraits::BuilderType builder(pool, std::make_shared()); + typename TypeTraits::BuilderType builder(pool); for (size_t i = 0; i < values.size(); ++i) { ASSERT_OK(builder.Append(values[i])); } diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index ba775845fcaf2..a1c2b79950d59 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -115,7 +115,7 @@ std::string UnionType::ToString() const { for (size_t i = 0; i < children_.size(); ++i) { if (i) { s << ", "; } - s << children_[i]->ToString() << "=" << static_cast(type_ids[i]); + s << children_[i]->ToString() << "=" << static_cast(type_codes[i]); } s << ">"; return s.str(); @@ -224,8 +224,8 @@ std::shared_ptr struct_(const std::vector>& fie } std::shared_ptr union_(const std::vector>& child_fields, - const std::vector& type_ids, UnionMode mode) { - return std::make_shared(child_fields, type_ids, mode); + const std::vector& type_codes, UnionMode mode) { + return std::make_shared(child_fields, type_codes, mode); } std::shared_ptr dictionary(const std::shared_ptr& index_type, diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 8638a3f4b6e90..927b8a44fe12f 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -413,8 +413,8 @@ struct ARROW_EXPORT UnionType : public DataType { static constexpr Type::type type_id = Type::UNION; UnionType(const std::vector>& fields, - const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE) - : DataType(Type::UNION), mode(mode), type_ids(type_ids) { + const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE) + : DataType(Type::UNION), mode(mode), type_codes(type_codes) { children_ = fields; } @@ -429,7 +429,7 @@ struct ARROW_EXPORT UnionType : public DataType { // The type id used in the data to indicate each data type in the union. For // example, the first type in the union might be denoted by the id 5 (instead // of 0). - std::vector type_ids; + std::vector type_codes; }; // ---------------------------------------------------------------------- @@ -551,7 +551,7 @@ std::shared_ptr ARROW_EXPORT struct_( std::shared_ptr ARROW_EXPORT union_( const std::vector>& child_fields, - const std::vector& type_ids, UnionMode mode = UnionMode::SPARSE); + const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); std::shared_ptr ARROW_EXPORT dictionary( const std::shared_ptr& index_type, const std::shared_ptr& values); diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 5cd5f45466bf7..c4898b1ac8ce2 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -125,6 +125,15 @@ struct TypeTraits { constexpr static bool is_parameter_free = false; }; +template <> +struct TypeTraits { + using ArrayType = TimeArray; + // using BuilderType = TimestampBuilder; + + static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + constexpr static bool is_parameter_free = false; +}; + template <> struct TypeTraits { using ArrayType = HalfFloatArray; diff --git a/cpp/src/arrow/util/bit-util-test.cc b/cpp/src/arrow/util/bit-util-test.cc index cfdee04f6e2ea..cb2fd1ab276ad 100644 --- a/cpp/src/arrow/util/bit-util-test.cc +++ b/cpp/src/arrow/util/bit-util-test.cc @@ -17,11 +17,17 @@ #include "arrow/util/bit-util.h" +#include +#include + #include "gtest/gtest.h" +#include "arrow/buffer.h" +#include "arrow/test-util.h" + namespace arrow { -TEST(UtilTests, TestIsMultipleOf64) { +TEST(BitUtilTests, TestIsMultipleOf64) { using BitUtil::IsMultipleOf64; EXPECT_TRUE(IsMultipleOf64(64)); EXPECT_TRUE(IsMultipleOf64(0)); @@ -31,7 +37,7 @@ TEST(UtilTests, TestIsMultipleOf64) { EXPECT_FALSE(IsMultipleOf64(32)); } -TEST(UtilTests, TestNextPower2) { +TEST(BitUtilTests, TestNextPower2) { using BitUtil::NextPower2; ASSERT_EQ(8, NextPower2(6)); @@ -51,4 +57,56 @@ TEST(UtilTests, TestNextPower2) { ASSERT_EQ(1LL << 62, NextPower2((1LL << 62) - 1)); } +static inline int64_t SlowCountBits( + const uint8_t* data, int64_t bit_offset, int64_t length) { + int64_t count = 0; + for (int64_t i = bit_offset; i < bit_offset + length; ++i) { + if (BitUtil::GetBit(data, i)) { ++count; } + } + return count; +} + +TEST(BitUtilTests, TestCountSetBits) { + const int kBufferSize = 1000; + uint8_t buffer[kBufferSize] = {0}; + + test::random_bytes(kBufferSize, 0, buffer); + + const int num_bits = kBufferSize * 8; + + std::vector offsets = { + 0, 12, 16, 32, 37, 63, 64, 128, num_bits - 30, num_bits - 64}; + for (int64_t offset : offsets) { + int64_t result = CountSetBits(buffer, offset, num_bits - offset); + int64_t expected = SlowCountBits(buffer, offset, num_bits - offset); + + ASSERT_EQ(expected, result); + } +} + +TEST(BitUtilTests, TestCopyBitmap) { + const int kBufferSize = 1000; + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), kBufferSize, &buffer)); + memset(buffer->mutable_data(), 0, kBufferSize); + test::random_bytes(kBufferSize, 0, buffer->mutable_data()); + + const int num_bits = kBufferSize * 8; + + const uint8_t* src = buffer->data(); + + std::vector offsets = {0, 12, 16, 32, 37, 63, 64, 128}; + for (int64_t offset : offsets) { + const int64_t copy_length = num_bits - offset; + + std::shared_ptr copy; + ASSERT_OK(CopyBitmap(default_memory_pool(), src, offset, copy_length, ©)); + + for (int64_t i = 0; i < copy_length; ++i) { + ASSERT_EQ(BitUtil::GetBit(src, i + offset), BitUtil::GetBit(copy->data(), i)); + } + } +} + } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 9c82407ecc092..f3fbb41fa54a7 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -15,10 +15,20 @@ // specific language governing permissions and limitations // under the License. +// Alias MSVC popcount to GCC name +#ifdef _MSC_VER +#include +#define __builtin_popcount __popcnt +#include +#define __builtin_popcountll _mm_popcnt_u64 +#endif + +#include #include #include #include "arrow/buffer.h" +#include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/util/bit-util.h" @@ -34,8 +44,9 @@ Status BitUtil::BytesToBits( const std::vector& bytes, std::shared_ptr* out) { int bit_length = BitUtil::BytesForBits(bytes.size()); - auto buffer = std::make_shared(); - RETURN_NOT_OK(buffer->Resize(bit_length)); + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateBuffer(default_memory_pool(), bit_length, &buffer)); + memset(buffer->mutable_data(), 0, bit_length); BytesToBits(bytes, buffer->mutable_data()); @@ -43,4 +54,72 @@ Status BitUtil::BytesToBits( return Status::OK(); } +int64_t CountSetBits(const uint8_t* data, int64_t bit_offset, int64_t length) { + constexpr int64_t pop_len = sizeof(uint64_t) * 8; + + int64_t count = 0; + + // The first bit offset where we can use a 64-bit wide hardware popcount + const int64_t fast_count_start = BitUtil::RoundUp(bit_offset, pop_len); + + // The number of bits until fast_count_start + const int64_t initial_bits = std::min(length, fast_count_start - bit_offset); + for (int64_t i = bit_offset; i < bit_offset + initial_bits; ++i) { + if (BitUtil::GetBit(data, i)) { ++count; } + } + + const int64_t fast_counts = (length - initial_bits) / pop_len; + + // Advance until the first aligned 8-byte word after the initial bits + const uint64_t* u64_data = + reinterpret_cast(data) + fast_count_start / pop_len; + + const uint64_t* end = u64_data + fast_counts; + + // popcount as much as possible with the widest possible count + for (auto iter = u64_data; iter < end; ++iter) { + count += __builtin_popcountll(*iter); + } + + // Account for left over bit (in theory we could fall back to smaller + // versions of popcount but the code complexity is likely not worth it) + const int64_t tail_index = bit_offset + initial_bits + fast_counts * pop_len; + for (int64_t i = tail_index; i < bit_offset + length; ++i) { + if (BitUtil::GetBit(data, i)) { ++count; } + } + + return count; +} + +Status GetEmptyBitmap( + MemoryPool* pool, int64_t length, std::shared_ptr* result) { + RETURN_NOT_OK(AllocateBuffer(pool, BitUtil::BytesForBits(length), result)); + memset((*result)->mutable_data(), 0, (*result)->size()); + return Status::OK(); +} + +Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int32_t offset, int32_t length, + std::shared_ptr* out) { + std::shared_ptr buffer; + RETURN_NOT_OK(GetEmptyBitmap(pool, length, &buffer)); + uint8_t* dest = buffer->mutable_data(); + for (int64_t i = 0; i < length; ++i) { + BitUtil::SetBitTo(dest, i, BitUtil::GetBit(data, i + offset)); + } + *out = buffer; + return Status::OK(); +} + +bool BitmapEquals(const uint8_t* left, int64_t left_offset, const uint8_t* right, + int64_t right_offset, int64_t bit_length) { + // TODO(wesm): Make this faster using word-wise comparisons + for (int64_t i = 0; i < bit_length; ++i) { + if (BitUtil::GetBit(left, left_offset + i) != + BitUtil::GetBit(right, right_offset + i)) { + return false; + } + } + return true; +} + } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 5c8055f9c6171..a0fbdd2f92ca1 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -28,6 +28,8 @@ namespace arrow { class Buffer; +class MemoryPool; +class MutableBuffer; class Status; namespace BitUtil { @@ -62,6 +64,12 @@ static inline void SetBit(uint8_t* bits, int i) { bits[i / 8] |= kBitmask[i % 8]; } +static inline void SetBitTo(uint8_t* bits, int i, bool bit_is_set) { + // See https://graphics.stanford.edu/~seander/bithacks.html + // "Conditionally set or clear bits without branching" + bits[i / 8] ^= (-bit_is_set ^ bits[i / 8]) & kBitmask[i % 8]; +} + static inline int64_t NextPower2(int64_t n) { n--; n |= n >> 1; @@ -82,6 +90,11 @@ static inline bool IsMultipleOf8(int64_t n) { return (n & 7) == 0; } +/// Returns 'value' rounded up to the nearest multiple of 'factor' +inline int64_t RoundUp(int64_t value, int64_t factor) { + return (value + (factor - 1)) / factor * factor; +} + inline int64_t RoundUpToMultipleOf64(int64_t num) { // TODO(wesm): is this definitely needed? // DCHECK_GE(num, 0); @@ -98,6 +111,38 @@ void BytesToBits(const std::vector& bytes, uint8_t* bits); ARROW_EXPORT Status BytesToBits(const std::vector&, std::shared_ptr*); } // namespace BitUtil + +// ---------------------------------------------------------------------- +// Bitmap utilities + +Status ARROW_EXPORT GetEmptyBitmap( + MemoryPool* pool, int64_t length, std::shared_ptr* result); + +/// Copy a bit range of an existing bitmap +/// +/// \param[in] pool memory pool to allocate memory from +/// \param[in] bitmap source data +/// \param[in] offset bit offset into the source data +/// \param[in] length number of bits to copy +/// \param[out] out the resulting copy +/// +/// \return Status message +Status ARROW_EXPORT CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, int32_t offset, + int32_t length, std::shared_ptr* out); + +/// Compute the number of 1's in the given data array +/// +/// \param[in] data a packed LSB-ordered bitmap as a byte array +/// \param[in] bit_offset a bitwise offset into the bitmap +/// \param[in] length the number of bits to inspect in the bitmap relative to the offset +/// +/// \return The number of set (1) bits in the range +int64_t ARROW_EXPORT CountSetBits( + const uint8_t* data, int64_t bit_offset, int64_t length); + +bool ARROW_EXPORT BitmapEquals(const uint8_t* left, int64_t left_offset, + const uint8_t* right, int64_t right_offset, int64_t bit_length); + } // namespace arrow #endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index b22f07dd6345f..06ee8411e283c 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -118,9 +118,9 @@ class CerrLog { class FatalLog : public CerrLog { public: explicit FatalLog(int /* severity */) // NOLINT - : CerrLog(ARROW_FATAL){} // NOLINT + : CerrLog(ARROW_FATAL) {} // NOLINT - [[noreturn]] ~FatalLog() { + [[noreturn]] ~FatalLog() { if (has_logged_) { std::cerr << std::endl; } std::exit(1); } diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index c4a62a475b92f..81a9b0cff5687 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -25,6 +25,6 @@ TypeName& operator=(const TypeName&) = delete #endif -#define UNUSED(x) (void)x +#define UNUSED(x) (void) x #endif // ARROW_UTIL_MACROS_H diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 842a2196dab62..ba26692b32b88 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -95,7 +95,7 @@ if ("${COMPILER_FAMILY}" STREQUAL "clang") set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Qunused-arguments") # Cython warnings in clang - set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality") + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-parentheses-equality -Wno-constant-logical-operand") endif() set(PYARROW_LINK "a") diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 38883e811e1cc..ebfdc410fa004 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -179,8 +179,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: double Value(int i) cdef cppclass CListArray" arrow::ListArray"(CArray): - const int32_t* offsets() - int32_t offset(int i) + const int32_t* raw_value_offsets() + int32_t value_offset(int i) int32_t value_length(int i) shared_ptr[CArray] values() shared_ptr[CDataType] value_type() diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 30b90408dc082..9d2b2b11a80d6 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -202,7 +202,7 @@ cdef class ListValue(ArrayValue): self.value_type = box_data_type(self.ap.value_type()) cdef getitem(self, int i): - cdef int j = self.ap.offset(self.index) + i + cdef int j = self.ap.value_offset(self.index) + i return box_arrow_scalar(self.value_type, self.ap.values(), j) def as_py(self): diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 1abfb4091189e..5fd8eef23fec5 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -505,7 +505,7 @@ Status ConvertPySequence( // Handle NA / NullType case if (type->type == Type::NA) { - out->reset(new arrow::NullArray(type, size)); + out->reset(new arrow::NullArray(size)); return Status::OK(); } diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 8d05821c2fd08..345dc9070e5b3 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -1338,8 +1338,7 @@ class ArrowSerializer { PyAcquireGIL lock; PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::TypePtr string_type(new arrow::DateType()); - arrow::DateBuilder date_builder(pool_, string_type); + arrow::DateBuilder date_builder(pool_); RETURN_NOT_OK(date_builder.Resize(length_)); Status s; @@ -1363,8 +1362,7 @@ class ArrowSerializer { // and unicode mixed in the object array PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::TypePtr string_type(new arrow::StringType()); - arrow::StringBuilder string_builder(pool_, string_type); + arrow::StringBuilder string_builder(pool_); RETURN_NOT_OK(string_builder.Resize(length_)); Status s; @@ -1374,8 +1372,8 @@ class ArrowSerializer { if (have_bytes) { const auto& arr = static_cast(*out->get()); - *out = std::make_shared( - arr.length(), arr.offsets(), arr.data(), arr.null_count(), arr.null_bitmap()); + *out = std::make_shared(arr.length(), arr.value_offsets(), + arr.data(), arr.null_bitmap(), arr.null_count()); } return Status::OK(); } @@ -1403,7 +1401,7 @@ class ArrowSerializer { } } - *out = std::make_shared(length_, data, null_count, null_bitmap_); + *out = std::make_shared(length_, data, null_bitmap_, null_count); return Status::OK(); } @@ -1515,10 +1513,14 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) { null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); } + // For readability + constexpr int32_t kOffset = 0; + RETURN_NOT_OK(ConvertData()); std::shared_ptr type; RETURN_NOT_OK(MakeDataType(&type)); - RETURN_NOT_OK(MakePrimitiveArray(type, length_, data_, null_count, null_bitmap_, out)); + RETURN_NOT_OK( + MakePrimitiveArray(type, length_, data_, null_bitmap_, null_count, kOffset, out)); return Status::OK(); } @@ -1657,7 +1659,7 @@ ArrowSerializer::ConvertTypedLists( // TODO: If there are bytes involed, convert to Binary representation bool have_bytes = false; - auto value_builder = std::make_shared(pool_, field->type); + auto value_builder = std::make_shared(pool_); ListBuilder list_builder(pool_, value_builder); PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); for (int64_t i = 0; i < length_; ++i) { diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 92352607e62ec..aa4cb7b052c27 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -56,9 +56,20 @@ static Status CheckPyError() { return Status::OK(); } +// This is annoying: because C++11 does not allow implicit conversion of string +// literals to non-const char*, we need to go through some gymnastics to use +// PyObject_CallMethod without a lot of pain (its arguments are non-const +// char*) +template +static inline PyObject* cpp_PyObject_CallMethod( + PyObject* obj, const char* method_name, const char* argspec, ArgTypes... args) { + return PyObject_CallMethod( + obj, const_cast(method_name), const_cast(argspec), args...); +} + Status PythonFile::Close() { // whence: 0 for relative to start of file, 2 for end of file - PyObject* result = PyObject_CallMethod(file_, "close", "()"); + PyObject* result = cpp_PyObject_CallMethod(file_, "close", "()"); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); return Status::OK(); @@ -66,14 +77,14 @@ Status PythonFile::Close() { Status PythonFile::Seek(int64_t position, int whence) { // whence: 0 for relative to start of file, 2 for end of file - PyObject* result = PyObject_CallMethod(file_, "seek", "(ii)", position, whence); + PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ii)", position, whence); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); return Status::OK(); } Status PythonFile::Read(int64_t nbytes, PyObject** out) { - PyObject* result = PyObject_CallMethod(file_, "read", "(i)", nbytes); + PyObject* result = cpp_PyObject_CallMethod(file_, "read", "(i)", nbytes); ARROW_RETURN_NOT_OK(CheckPyError()); *out = result; return Status::OK(); @@ -84,7 +95,7 @@ Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { PyBytes_FromStringAndSize(reinterpret_cast(data), nbytes); ARROW_RETURN_NOT_OK(CheckPyError()); - PyObject* result = PyObject_CallMethod(file_, "write", "(O)", py_data); + PyObject* result = cpp_PyObject_CallMethod(file_, "write", "(O)", py_data); Py_XDECREF(py_data); Py_XDECREF(result); ARROW_RETURN_NOT_OK(CheckPyError()); @@ -92,7 +103,7 @@ Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { } Status PythonFile::Tell(int64_t* position) { - PyObject* result = PyObject_CallMethod(file_, "tell", "()"); + PyObject* result = cpp_PyObject_CallMethod(file_, "tell", "()"); ARROW_RETURN_NOT_OK(CheckPyError()); *position = PyLong_AsLongLong(result); From f268e927ada5cb637404769a136506c600582061 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 7 Feb 2017 15:37:33 +0100 Subject: [PATCH 0314/1644] ARROW-540: [C++] Build fixes after ARROW-33, PARQUET-866 Author: Wes McKinney Author: Wes McKinney Closes #325 from wesm/ARROW-540 and squashes the following commits: 9070baf [Wes McKinney] Change DCHECK_LT to DCHECK_LE. Not sure why it fixes bug on OS X eb80701 [Wes McKinney] Fix API change --- cpp/src/arrow/buffer.cc | 2 +- cpp/src/arrow/column-benchmark.cc | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index fb5a010efa225..18e9ed2015227 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -57,7 +57,7 @@ Status Buffer::Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) std::shared_ptr SliceBuffer( const std::shared_ptr& buffer, int64_t offset, int64_t length) { - DCHECK_LT(offset, buffer->size()); + DCHECK_LE(offset, buffer->size()); DCHECK_LE(length, buffer->size() - offset); return std::make_shared(buffer, offset, length); } diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index 8a1c775d7376d..1bab5a8de0ca4 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -30,7 +30,7 @@ std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { auto null_bitmap = std::make_shared(pool); data->Resize(length * sizeof(typename ArrayType::value_type)); null_bitmap->Resize(BitUtil::BytesForBits(length)); - return std::make_shared(length, data, 10, null_bitmap); + return std::make_shared(length, data, null_bitmap, 10); } } // anonymous namespace From 4c3481ea5438d52878f390b0f562f6113f111a8f Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 7 Feb 2017 11:13:00 -0500 Subject: [PATCH 0315/1644] ARROW-535: [Python] Add type mapping for NPY_LONGLONG Based on https://github.com/wesm/feather/pull/107 Author: Uwe L. Korn Closes #323 from xhochy/ARROW-535 and squashes the following commits: 72221fa [Uwe L. Korn] Address review comments 5d3c046 [Uwe L. Korn] ARROW-535: [Python] Add type mapping for NPY_LONGLONG --- python/pyarrow/tests/test_convert_pandas.py | 6 +++-- python/src/pyarrow/adapters/pandas.cc | 29 +++++++++++++++++++-- 2 files changed, 31 insertions(+), 4 deletions(-) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index f04fbe5b139e4..960653dca279e 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -127,13 +127,14 @@ def test_float_nulls(self): tm.assert_frame_equal(result, ex_frame) def test_integer_no_nulls(self): - data = {} + data = OrderedDict() fields = [] numpy_dtypes = [('i1', A.int8()), ('i2', A.int16()), ('i4', A.int32()), ('i8', A.int64()), ('u1', A.uint8()), ('u2', A.uint16()), - ('u4', A.uint32()), ('u8', A.uint64())] + ('u4', A.uint32()), ('u8', A.uint64()), + ('longlong', A.int64()), ('ulonglong', A.uint64())] num_values = 100 for dtype, arrow_dtype in numpy_dtypes: @@ -148,6 +149,7 @@ def test_integer_no_nulls(self): schema = A.Schema.from_fields(fields) self._check_pandas_roundtrip(df, expected_schema=schema) + def test_integer_with_nulls(self): # pandas requires upcast to float dtype diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 345dc9070e5b3..b4e0d2f9c138e 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -91,11 +91,17 @@ NPY_INT_DECL(INT8, Int8, int8_t); NPY_INT_DECL(INT16, Int16, int16_t); NPY_INT_DECL(INT32, Int32, int32_t); NPY_INT_DECL(INT64, Int64, int64_t); + NPY_INT_DECL(UINT8, UInt8, uint8_t); NPY_INT_DECL(UINT16, UInt16, uint16_t); NPY_INT_DECL(UINT32, UInt32, uint32_t); NPY_INT_DECL(UINT64, UInt64, uint64_t); +#if NPY_INT64 != NPY_LONGLONG +NPY_INT_DECL(LONGLONG, Int64, int64_t); +NPY_INT_DECL(ULONGLONG, UInt64, uint64_t); +#endif + template <> struct npy_traits { typedef float value_type; @@ -1706,16 +1712,35 @@ Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, return Status::Invalid("only handle 1-dimensional arrays"); } - switch (PyArray_DESCR(arr)->type_num) { + int type_num = PyArray_DESCR(arr)->type_num; + +#if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) + // Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set + // U/LONGLONG to U/INT64 so things work properly. + if (type_num == NPY_LONGLONG) { + type_num = NPY_INT64; + } + if (type_num == NPY_ULONGLONG) { + type_num = NPY_UINT64; + } +#endif + + switch (type_num) { TO_ARROW_CASE(BOOL); TO_ARROW_CASE(INT8); TO_ARROW_CASE(INT16); TO_ARROW_CASE(INT32); TO_ARROW_CASE(INT64); +#if (NPY_INT64 != NPY_LONGLONG) + TO_ARROW_CASE(LONGLONG); +#endif TO_ARROW_CASE(UINT8); TO_ARROW_CASE(UINT16); TO_ARROW_CASE(UINT32); TO_ARROW_CASE(UINT64); +#if (NPY_UINT64 != NPY_ULONGLONG) + TO_ARROW_CASE(ULONGLONG); +#endif TO_ARROW_CASE(FLOAT32); TO_ARROW_CASE(FLOAT64); TO_ARROW_CASE(DATETIME); @@ -1726,7 +1751,7 @@ Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, } break; default: std::stringstream ss; - ss << "unsupported type " << PyArray_DESCR(arr)->type_num << std::endl; + ss << "Unsupported numpy type " << PyArray_DESCR(arr)->type_num << std::endl; return Status::NotImplemented(ss.str()); } return Status::OK(); From e97fbe6407e8b15c6d3ef745f7a728e01d499a23 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 7 Feb 2017 11:17:28 -0500 Subject: [PATCH 0316/1644] ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved Author: Uwe L. Korn Closes #321 from xhochy/ARROW-531 and squashes the following commits: 55da9dc [Uwe L. Korn] ARROW-531: Python: Document jemalloc, extend Pandas section, add Getting Involved --- python/doc/getting_involved.rst | 37 +++++++++++++++++++++++ python/doc/index.rst | 2 ++ python/doc/install.rst | 5 ++-- python/doc/jemalloc.rst | 52 +++++++++++++++++++++++++++++++++ python/doc/pandas.rst | 8 ++++- python/doc/parquet.rst | 47 ++++++++++++++++++++++------- 6 files changed, 137 insertions(+), 14 deletions(-) create mode 100644 python/doc/getting_involved.rst create mode 100644 python/doc/jemalloc.rst diff --git a/python/doc/getting_involved.rst b/python/doc/getting_involved.rst new file mode 100644 index 0000000000000..90fa3e49aa191 --- /dev/null +++ b/python/doc/getting_involved.rst @@ -0,0 +1,37 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +Getting Involved +================ + +Right now the primary audience for Apache Arrow are the developers of data +systems; most people will use Apache Arrow indirectly through systems that use +it for internal data handling and interoperating with other Arrow-enabled +systems. + +Even if you do not plan to contribute to Apache Arrow itself or Arrow +integrations in other projects, we'd be happy to have you involved: + + * Join the mailing list: send an email to + `dev-subscribe@arrow.apache.org `_. + Share your ideas and use cases for the project or read through the + `Archive `_. + * Follow our activity on `JIRA `_ + * Learn the `Format / Specification + `_ + * Chat with us on `Slack `_ + diff --git a/python/doc/index.rst b/python/doc/index.rst index 6725ae707d90b..d64354be05520 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -37,10 +37,12 @@ structures. Installing pyarrow Pandas Module Reference + Getting Involved .. toctree:: :maxdepth: 2 :caption: Additional Features Parquet format + jemalloc MemoryPool diff --git a/python/doc/install.rst b/python/doc/install.rst index 1bab017301633..4d99fa0caf1de 100644 --- a/python/doc/install.rst +++ b/python/doc/install.rst @@ -120,10 +120,11 @@ Install `pyarrow` cd arrow/python - # --with-parquet enable the Apache Parquet support in PyArrow + # --with-parquet enables the Apache Parquet support in PyArrow + # --with-jemalloc enables the jemalloc allocator support in PyArrow # --build-type=release disables debugging information and turns on # compiler optimizations for native code - python setup.py build_ext --with-parquet --build-type=release install + python setup.py build_ext --with-parquet --with--jemalloc --build-type=release install python setup.py install .. warning:: diff --git a/python/doc/jemalloc.rst b/python/doc/jemalloc.rst new file mode 100644 index 0000000000000..33fe61729c1e9 --- /dev/null +++ b/python/doc/jemalloc.rst @@ -0,0 +1,52 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +jemalloc MemoryPool +=================== + +Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator +through the POSIX APIs. Although this already provides aligned allocation, the +POSIX interface doesn't support aligned reallocation. The default reallocation +strategy is to allocate a new region, copy over the old data and free the +previous region. Using `jemalloc `_ we can simply extend +the existing memory allocation to the requested size. While this may still be +linear in the size of allocated memory, it is magnitudes faster as only the page +mapping in the kernel is touched, not the actual data. + +The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the +use of the system allocator and/or other allocators like ``tcmalloc``. You can +either explicitly make it the default allocator or pass it only to single +operations. + +.. code:: python + + import pyarrow as pa + import pyarrow.jemalloc + import pyarrow.memory + + jemalloc_pool = pyarrow.jemalloc.default_pool() + + # Explicitly use jemalloc for allocating memory for an Arrow Table object + array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool) + + # Set the global pool + pyarrow.memory.set_default_pool(jemalloc_pool) + # This operation has no explicit MemoryPool specified and will thus will + # also use jemalloc for its allocations. + array = pa.Array.from_pylist([1, 2, 3]) + + diff --git a/python/doc/pandas.rst b/python/doc/pandas.rst index c225d1362c7b6..34445aed517d3 100644 --- a/python/doc/pandas.rst +++ b/python/doc/pandas.rst @@ -84,9 +84,11 @@ Pandas -> Arrow Conversion +------------------------+--------------------------+ | ``str`` / ``unicode`` | ``STRING`` | +------------------------+--------------------------+ +| ``pd.Categorical`` | ``DICTIONARY`` | ++------------------------+--------------------------+ | ``pd.Timestamp`` | ``TIMESTAMP(unit=ns)`` | +------------------------+--------------------------+ -| ``pd.Categorical`` | *not supported* | +| ``datetime.date`` | ``DATE`` | +------------------------+--------------------------+ Arrow -> Pandas Conversion @@ -109,5 +111,9 @@ Arrow -> Pandas Conversion +-------------------------------------+--------------------------------------------------------+ | ``STRING`` | ``str`` | +-------------------------------------+--------------------------------------------------------+ +| ``DICTIONARY`` | ``pd.Categorical`` | ++-------------------------------------+--------------------------------------------------------+ | ``TIMESTAMP(unit=*)`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | +-------------------------------------+--------------------------------------------------------+ +| ``DATE`` | ``pd.Timestamp`` (``np.datetime64[ns]``) | ++-------------------------------------+--------------------------------------------------------+ diff --git a/python/doc/parquet.rst b/python/doc/parquet.rst index 674ed80f27ce3..8e011e4f19857 100644 --- a/python/doc/parquet.rst +++ b/python/doc/parquet.rst @@ -29,16 +29,30 @@ Reading Parquet To read a Parquet file into Arrow memory, you can use the following code snippet. It will read the whole Parquet file into memory as an -:class:`pyarrow.table.Table`. +:class:`~pyarrow.table.Table`. .. code-block:: python - import pyarrow - import pyarrow.parquet + import pyarrow.parquet as pq - A = pyarrow + table = pq.read_table('') - table = A.parquet.read_table('') +As DataFrames stored as Parquet are often stored in multiple files, a +convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided. + +If you already have the Parquet available in memory or get it via non-file +source, you can utilize :class:`pyarrow.io.BufferReader` to read it from +memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply +a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`. + +.. code:: python + + import pyarrow.io as paio + import pyarrow.parquet as pq + + buf = ... # either bytes or paio.Buffer + reader = paio.BufferReader(buf) + table = pq.read_table(reader) Writing Parquet --------------- @@ -49,13 +63,11 @@ method. .. code-block:: python - import pyarrow - import pyarrow.parquet - - A = pyarrow + import pyarrow as pa + import pyarrow.parquet as pq - table = A.Table(..) - A.parquet.write_table(table, '') + table = pa.Table(..) + pq.write_table(table, '') By default this will write the Table as a single RowGroup using ``DICTIONARY`` encoding. To increase the potential of parallelism a query engine can process @@ -64,3 +76,16 @@ a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows If you also want to compress the columns, you can select a compression method using the ``compression`` argument. Typically, ``GZIP`` is the choice if you want to minimize size and ``SNAPPY`` for performance. + +Instead of writing to a file, you can also write to Python ``bytes`` by +utilizing an :class:`pyarrow.io.InMemoryOutputStream()`: + +.. code:: python + + import pyarrow.io as paio + import pyarrow.parquet as pq + + table = ... + output = paio.InMemoryOutputStream() + pq.write_table(table, output) + pybytes = output.get_result().to_pybytes() From c322cbf225b5da5e17ceec0e9e7373852bcba85c Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Tue, 7 Feb 2017 16:44:35 -0500 Subject: [PATCH 0317/1644] ARROW-366 Java Dictionary Vector I've added a dictionary type, and a partial implementation of a dictionary vector that just wraps an index vector and has a reference to a lookup vector. The spec seems to indicate that any array can be dictionary encoded, but the C++ implementation created a new type, so I went that way. Feedback would be appreciated - I want to make sure I'm on the right path. Author: Emilio Lahr-Vivaz Closes #309 from elahrvivaz/ARROW-366 and squashes the following commits: 60836ea [Emilio Lahr-Vivaz] removing dictionary ID from encoded vector 0871e13 [Emilio Lahr-Vivaz] ARROW-366 Adding Java dictionary vector --- .../vector/complex/DictionaryVector.java | 229 ++++++++++++++++++ .../apache/arrow/vector/types/Dictionary.java | 40 +++ .../apache/arrow/vector/types/pojo/Field.java | 35 ++- .../org/apache/arrow/vector/util/Text.java | 31 ++- .../arrow/vector/TestDictionaryVector.java | 154 ++++++++++++ 5 files changed, 482 insertions(+), 7 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java new file mode 100644 index 0000000000000..84760eadf2253 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java @@ -0,0 +1,229 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex; + +import io.netty.buffer.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.NullableIntVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.Dictionary; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.TransferPair; + +import java.util.HashMap; +import java.util.Iterator; +import java.util.Map; + +public class DictionaryVector implements ValueVector { + + private ValueVector indices; + private Dictionary dictionary; + + public DictionaryVector(ValueVector indices, Dictionary dictionary) { + this.indices = indices; + this.dictionary = dictionary; + } + + /** + * Dictionary encodes a vector. The dictionary will be built using the values from the vector. + * + * @param vector vector to encode + * @return dictionary encoded vector + */ + public static DictionaryVector encode(ValueVector vector) { + validateType(vector.getMinorType()); + Map lookUps = new HashMap<>(); + Map transfers = new HashMap<>(); + + ValueVector.Accessor accessor = vector.getAccessor(); + int count = accessor.getValueCount(); + + NullableIntVector indices = new NullableIntVector(vector.getField().getName(), vector.getAllocator()); + indices.allocateNew(count); + NullableIntVector.Mutator mutator = indices.getMutator(); + + int nextIndex = 0; + for (int i = 0; i < count; i++) { + Object value = accessor.getObject(i); + if (value != null) { // if it's null leave it null + Integer index = lookUps.get(value); + if (index == null) { + index = nextIndex++; + lookUps.put(value, index); + transfers.put(i, index); + } + mutator.set(i, index); + } + } + mutator.setValueCount(count); + + // copy the dictionary values into the dictionary vector + TransferPair dictionaryTransfer = vector.getTransferPair(vector.getAllocator()); + ValueVector dictionaryVector = dictionaryTransfer.getTo(); + dictionaryVector.allocateNewSafe(); + for (Map.Entry entry: transfers.entrySet()) { + dictionaryTransfer.copyValueSafe(entry.getKey(), entry.getValue()); + } + dictionaryVector.getMutator().setValueCount(transfers.size()); + Dictionary dictionary = new Dictionary(dictionaryVector, false); + + return new DictionaryVector(indices, dictionary); + } + + /** + * Dictionary encodes a vector with a provided dictionary. The dictionary must contain all values in the vector. + * + * @param vector vector to encode + * @param dictionary dictionary used for encoding + * @return dictionary encoded vector + */ + public static DictionaryVector encode(ValueVector vector, Dictionary dictionary) { + validateType(vector.getMinorType()); + // load dictionary values into a hashmap for lookup + ValueVector.Accessor dictionaryAccessor = dictionary.getDictionary().getAccessor(); + Map lookUps = new HashMap<>(dictionaryAccessor.getValueCount()); + for (int i = 0; i < dictionaryAccessor.getValueCount(); i++) { + // for primitive array types we need a wrapper that implements equals and hashcode appropriately + lookUps.put(dictionaryAccessor.getObject(i), i); + } + + // vector to hold our indices (dictionary encoded values) + NullableIntVector indices = new NullableIntVector(vector.getField().getName(), vector.getAllocator()); + NullableIntVector.Mutator mutator = indices.getMutator(); + + ValueVector.Accessor accessor = vector.getAccessor(); + int count = accessor.getValueCount(); + + indices.allocateNew(count); + + for (int i = 0; i < count; i++) { + Object value = accessor.getObject(i); + if (value != null) { // if it's null leave it null + // note: this may fail if value was not included in the dictionary + mutator.set(i, lookUps.get(value)); + } + } + mutator.setValueCount(count); + + return new DictionaryVector(indices, dictionary); + } + + /** + * Decodes a dictionary encoded array using the provided dictionary. + * + * @param indices dictionary encoded values, must be int type + * @param dictionary dictionary used to decode the values + * @return vector with values restored from dictionary + */ + public static ValueVector decode(ValueVector indices, Dictionary dictionary) { + ValueVector.Accessor accessor = indices.getAccessor(); + int count = accessor.getValueCount(); + ValueVector dictionaryVector = dictionary.getDictionary(); + // copy the dictionary values into the decoded vector + TransferPair transfer = dictionaryVector.getTransferPair(indices.getAllocator()); + transfer.getTo().allocateNewSafe(); + for (int i = 0; i < count; i++) { + Object index = accessor.getObject(i); + if (index != null) { + transfer.copyValueSafe(((Number) index).intValue(), i); + } + } + + ValueVector decoded = transfer.getTo(); + decoded.getMutator().setValueCount(count); + return decoded; + } + + private static void validateType(MinorType type) { + // byte arrays don't work as keys in our dictionary map - we could wrap them with something to + // implement equals and hashcode if we want that functionality + if (type == MinorType.VARBINARY || type == MinorType.LIST || type == MinorType.MAP || type == MinorType.UNION) { + throw new IllegalArgumentException("Dictionary encoding for complex types not implemented"); + } + } + + public ValueVector getIndexVector() { return indices; } + + public ValueVector getDictionaryVector() { return dictionary.getDictionary(); } + + public Dictionary getDictionary() { return dictionary; } + + @Override + public MinorType getMinorType() { return indices.getMinorType(); } + + @Override + public Field getField() { return indices.getField(); } + + // note: dictionary vector is not closed, as it may be shared + @Override + public void close() { indices.close(); } + + @Override + public void allocateNew() throws OutOfMemoryException { indices.allocateNew(); } + + @Override + public boolean allocateNewSafe() { return indices.allocateNewSafe(); } + + @Override + public BufferAllocator getAllocator() { return indices.getAllocator(); } + + @Override + public void setInitialCapacity(int numRecords) { indices.setInitialCapacity(numRecords); } + + @Override + public int getValueCapacity() { return indices.getValueCapacity(); } + + @Override + public int getBufferSize() { return indices.getBufferSize(); } + + @Override + public int getBufferSizeFor(int valueCount) { return indices.getBufferSizeFor(valueCount); } + + @Override + public Iterator iterator() { + return indices.iterator(); + } + + @Override + public void clear() { indices.clear(); } + + @Override + public TransferPair getTransferPair(BufferAllocator allocator) { return indices.getTransferPair(allocator); } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { return indices.getTransferPair(ref, allocator); } + + @Override + public TransferPair makeTransferPair(ValueVector target) { return indices.makeTransferPair(target); } + + @Override + public Accessor getAccessor() { return indices.getAccessor(); } + + @Override + public Mutator getMutator() { return indices.getMutator(); } + + @Override + public FieldReader getReader() { return indices.getReader(); } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { return indices.getBuffers(clear); } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java new file mode 100644 index 0000000000000..fbe1345f96aa3 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java @@ -0,0 +1,40 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +import org.apache.arrow.vector.ValueVector; + +public class Dictionary { + + private ValueVector dictionary; + private boolean ordered; + + public Dictionary(ValueVector dictionary, boolean ordered) { + this.dictionary = dictionary; + this.ordered = ordered; + } + + public ValueVector getDictionary() { + return dictionary; + } + + public boolean isOrdered() { + return ordered; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 412fc54b538da..2d528e4141907 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -24,6 +24,9 @@ import java.util.List; import java.util.Objects; +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.annotation.JsonInclude.Include; +import org.apache.arrow.flatbuf.DictionaryEncoding; import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.schema.VectorLayout; @@ -37,6 +40,7 @@ public class Field { private final String name; private final boolean nullable; private final ArrowType type; + private final Long dictionary; private final List children; private final TypeLayout typeLayout; @@ -45,11 +49,13 @@ private Field( @JsonProperty("name") String name, @JsonProperty("nullable") boolean nullable, @JsonProperty("type") ArrowType type, + @JsonProperty("dictionary") Long dictionary, @JsonProperty("children") List children, @JsonProperty("typeLayout") TypeLayout typeLayout) { this.name = name; this.nullable = nullable; this.type = checkNotNull(type); + this.dictionary = dictionary; if (children == null) { this.children = ImmutableList.of(); } else { @@ -59,13 +65,22 @@ private Field( } public Field(String name, boolean nullable, ArrowType type, List children) { - this(name, nullable, type, children, TypeLayout.getTypeLayout(checkNotNull(type))); + this(name, nullable, type, null, children, TypeLayout.getTypeLayout(checkNotNull(type))); + } + + public Field(String name, boolean nullable, ArrowType type, Long dictionary, List children) { + this(name, nullable, type, dictionary, children, TypeLayout.getTypeLayout(checkNotNull(type))); } public static Field convertField(org.apache.arrow.flatbuf.Field field) { String name = field.name(); boolean nullable = field.nullable(); ArrowType type = getTypeForField(field); + DictionaryEncoding dictionaryEncoding = field.dictionary(); + Long dictionary = null; + if (dictionaryEncoding != null) { + dictionary = dictionaryEncoding.id(); + } ImmutableList.Builder layout = ImmutableList.builder(); for (int i = 0; i < field.layoutLength(); ++i) { layout.add(new org.apache.arrow.vector.schema.VectorLayout(field.layout(i))); @@ -75,8 +90,7 @@ public static Field convertField(org.apache.arrow.flatbuf.Field field) { childrenBuilder.add(convertField(field.children(i))); } List children = childrenBuilder.build(); - Field result = new Field(name, nullable, type, children, new TypeLayout(layout.build())); - return result; + return new Field(name, nullable, type, dictionary, children, new TypeLayout(layout.build())); } public void validate() { @@ -89,6 +103,11 @@ public void validate() { public int getField(FlatBufferBuilder builder) { int nameOffset = name == null ? -1 : builder.createString(name); int typeOffset = type.getType(builder); + int dictionaryOffset = -1; + if (dictionary != null) { + builder.addLong(dictionary); + dictionaryOffset = builder.offset(); + } int[] childrenData = new int[children.size()]; for (int i = 0; i < children.size(); i++) { childrenData[i] = children.get(i).getField(builder); @@ -107,6 +126,9 @@ public int getField(FlatBufferBuilder builder) { org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeID().getFlatbufID()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); + if (dictionary != null) { + org.apache.arrow.flatbuf.Field.addDictionary(builder, dictionaryOffset); + } org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); return org.apache.arrow.flatbuf.Field.endField(builder); @@ -124,6 +146,9 @@ public ArrowType getType() { return type; } + @JsonInclude(Include.NON_NULL) + public Long getDictionary() { return dictionary; } + public List getChildren() { return children; } @@ -141,6 +166,7 @@ public boolean equals(Object obj) { return Objects.equals(this.name, that.name) && Objects.equals(this.nullable, that.nullable) && Objects.equals(this.type, that.type) && + Objects.equals(this.dictionary, that.dictionary) && (Objects.equals(this.children, that.children) || (this.children == null && that.children.size() == 0) || (this.children.size() == 0 && that.children == null)); @@ -153,6 +179,9 @@ public String toString() { sb.append(name).append(": "); } sb.append(type); + if (dictionary != null) { + sb.append("[dictionary: ").append(dictionary).append("]"); + } if (!children.isEmpty()) { sb.append("<").append(Joiner.on(", ").join(children)).append(">"); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java index 3919f0606cb20..3db4358ea9155 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java @@ -299,6 +299,11 @@ public void readWithKnownLength(DataInput in, int len) throws IOException { /** Returns true iff o is a Text with the same contents. */ @Override public boolean equals(Object o) { + if (o == this) { + return true; + } else if (o == null) { + return false; + } if (!(o instanceof Text)) { return false; } @@ -308,15 +313,33 @@ public boolean equals(Object o) { return false; } - byte[] thisBytes = Arrays.copyOf(this.getBytes(), getLength()); - byte[] thatBytes = Arrays.copyOf(that.getBytes(), getLength()); - return Arrays.equals(thisBytes, thatBytes); + // copied from Arrays.equals so we don'thave to copy the byte arrays + for (int i = 0; i < length; i++) { + if (bytes[i] != that.bytes[i]) { + return false; + } + } + return true; } + /** + * Copied from Arrays.hashCode so we don't have to copy the byte array + * + * @return + */ @Override public int hashCode() { - return super.hashCode(); + if (bytes == null) { + return 0; + } + + int result = 1; + for (int i = 0; i < length; i++) { + result = 31 * result + bytes[i]; + } + + return result; } // / STATIC UTILITIES FROM HERE DOWN diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java new file mode 100644 index 0000000000000..962950abec87a --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java @@ -0,0 +1,154 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.complex.DictionaryVector; +import org.apache.arrow.vector.types.Dictionary; +import org.apache.arrow.vector.types.Types.MinorType; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +import java.nio.charset.StandardCharsets; + +import static org.junit.Assert.assertArrayEquals; +import static org.junit.Assert.assertEquals; + +public class TestDictionaryVector { + + private BufferAllocator allocator; + + byte[] zero = "foo".getBytes(StandardCharsets.UTF_8); + byte[] one = "bar".getBytes(StandardCharsets.UTF_8); + byte[] two = "baz".getBytes(StandardCharsets.UTF_8); + + @Before + public void init() { + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testEncodeStringsWithGeneratedDictionary() { + // Create a new value vector + try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null)) { + final NullableVarCharVector.Mutator m = vector.getMutator(); + vector.allocateNew(512, 5); + + // set some values + m.setSafe(0, zero, 0, zero.length); + m.setSafe(1, one, 0, one.length); + m.setSafe(2, one, 0, one.length); + m.setSafe(3, two, 0, two.length); + m.setSafe(4, zero, 0, zero.length); + m.setValueCount(5); + + DictionaryVector encoded = DictionaryVector.encode(vector); + + try { + // verify values in the dictionary + ValueVector dictionary = encoded.getDictionaryVector(); + assertEquals(vector.getClass(), dictionary.getClass()); + + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary).getAccessor(); + assertEquals(3, dictionaryAccessor.getValueCount()); + assertArrayEquals(zero, dictionaryAccessor.get(0)); + assertArrayEquals(one, dictionaryAccessor.get(1)); + assertArrayEquals(two, dictionaryAccessor.get(2)); + + // verify indices + ValueVector indices = encoded.getIndexVector(); + assertEquals(NullableIntVector.class, indices.getClass()); + + NullableIntVector.Accessor indexAccessor = ((NullableIntVector) indices).getAccessor(); + assertEquals(5, indexAccessor.getValueCount()); + assertEquals(0, indexAccessor.get(0)); + assertEquals(1, indexAccessor.get(1)); + assertEquals(1, indexAccessor.get(2)); + assertEquals(2, indexAccessor.get(3)); + assertEquals(0, indexAccessor.get(4)); + + // now run through the decoder and verify we get the original back + try (ValueVector decoded = DictionaryVector.decode(indices, encoded.getDictionary())) { + assertEquals(vector.getClass(), decoded.getClass()); + assertEquals(vector.getAccessor().getValueCount(), decoded.getAccessor().getValueCount()); + for (int i = 0; i < 5; i++) { + assertEquals(vector.getAccessor().getObject(i), decoded.getAccessor().getObject(i)); + } + } + } finally { + encoded.getDictionaryVector().close(); + encoded.getIndexVector().close(); + } + } + } + + @Test + public void testEncodeStringsWithProvidedDictionary() { + // Create a new value vector + try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null); + final NullableVarCharVector dictionary = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("dict", allocator, null)) { + final NullableVarCharVector.Mutator m = vector.getMutator(); + vector.allocateNew(512, 5); + + // set some values + m.setSafe(0, zero, 0, zero.length); + m.setSafe(1, one, 0, one.length); + m.setSafe(2, one, 0, one.length); + m.setSafe(3, two, 0, two.length); + m.setSafe(4, zero, 0, zero.length); + m.setValueCount(5); + + // set some dictionary values + final NullableVarCharVector.Mutator m2 = dictionary.getMutator(); + dictionary.allocateNew(512, 3); + m2.setSafe(0, zero, 0, zero.length); + m2.setSafe(1, one, 0, one.length); + m2.setSafe(2, two, 0, two.length); + m2.setValueCount(3); + + try(final DictionaryVector encoded = DictionaryVector.encode(vector, new Dictionary(dictionary, false))) { + // verify indices + ValueVector indices = encoded.getIndexVector(); + assertEquals(NullableIntVector.class, indices.getClass()); + + NullableIntVector.Accessor indexAccessor = ((NullableIntVector) indices).getAccessor(); + assertEquals(5, indexAccessor.getValueCount()); + assertEquals(0, indexAccessor.get(0)); + assertEquals(1, indexAccessor.get(1)); + assertEquals(1, indexAccessor.get(2)); + assertEquals(2, indexAccessor.get(3)); + assertEquals(0, indexAccessor.get(4)); + + // now run through the decoder and verify we get the original back + try (ValueVector decoded = DictionaryVector.decode(indices, encoded.getDictionary())) { + assertEquals(vector.getClass(), decoded.getClass()); + assertEquals(vector.getAccessor().getValueCount(), decoded.getAccessor().getValueCount()); + for (int i = 0; i < 5; i++) { + assertEquals(vector.getAccessor().getObject(i), decoded.getAccessor().getObject(i)); + } + } + } + } + } +} From 1407abfc90c03e133f198b59fed48469d171c0a9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 8 Feb 2017 09:16:57 +0100 Subject: [PATCH 0318/1644] ARROW-537: [C++] Do not compare String/Binary data in null slots when comparing arrays Author: Wes McKinney Closes #327 from wesm/ARROW-537 and squashes the following commits: 66b1961 [Wes McKinney] Do not compare String/Binary data in null slots when comparing arrays --- cpp/src/arrow/array-string-test.cc | 41 ++++++++++++++++++++ cpp/src/arrow/array.cc | 11 ++++-- cpp/src/arrow/array.h | 9 +++-- cpp/src/arrow/compare.cc | 55 ++++++++++++++++++--------- python/src/pyarrow/adapters/pandas.cc | 12 ++---- 5 files changed, 95 insertions(+), 33 deletions(-) diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index 8b7eb41d4c3b9..c4d9bf40f57f9 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -140,6 +140,47 @@ TEST_F(TestStringArray, TestEmptyStringComparison) { ASSERT_TRUE(strings_a->Equals(strings_b)); } +TEST_F(TestStringArray, CompareNullByteSlots) { + StringBuilder builder(default_memory_pool()); + StringBuilder builder2(default_memory_pool()); + StringBuilder builder3(default_memory_pool()); + + builder.Append("foo"); + builder2.Append("foo"); + builder3.Append("foo"); + + builder.Append("bar"); + builder2.AppendNull(); + + // same length, but different + builder3.Append("xyz"); + + builder.Append("baz"); + builder2.Append("baz"); + builder3.Append("baz"); + + std::shared_ptr array, array2, array3; + ASSERT_OK(builder.Finish(&array)); + ASSERT_OK(builder2.Finish(&array2)); + ASSERT_OK(builder3.Finish(&array3)); + + const auto& a1 = static_cast(*array); + const auto& a2 = static_cast(*array2); + const auto& a3 = static_cast(*array3); + + // The validity bitmaps are the same, the data is different, but the unequal + // portion is masked out + StringArray equal_array(3, a1.value_offsets(), a1.data(), a2.null_bitmap(), 1); + StringArray equal_array2(3, a3.value_offsets(), a3.data(), a2.null_bitmap(), 1); + + ASSERT_TRUE(equal_array.Equals(equal_array2)); + ASSERT_TRUE(a2.RangeEquals(equal_array2, 0, 3, 0)); + + ASSERT_TRUE(equal_array.Array::Slice(1)->Equals(equal_array2.Array::Slice(1))); + ASSERT_TRUE( + equal_array.Array::Slice(1)->RangeEquals(0, 2, 0, equal_array2.Array::Slice(1))); +} + // ---------------------------------------------------------------------- // String builder tests diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index f84023e6c7d31..39459a031f4b0 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -87,11 +87,16 @@ bool Array::ApproxEquals(const std::shared_ptr& arr) const { } bool Array::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const { - if (!arr) { return false; } + const std::shared_ptr& other) const { + if (!other) { return false; } + return RangeEquals(*other, start_idx, end_idx, other_start_idx); +} + +bool Array::RangeEquals(const Array& other, int32_t start_idx, int32_t end_idx, + int32_t other_start_idx) const { bool are_equal = false; Status error = - ArrayRangeEquals(*this, *arr, start_idx, end_idx, other_start_idx, &are_equal); + ArrayRangeEquals(*this, other, start_idx, end_idx, other_start_idx, &are_equal); if (!error.ok()) { DCHECK(false) << "Arrays not comparable: " << error.ToString(); } return are_equal; } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index f3e8f9a4982f7..32d156b8cd0f6 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -127,7 +127,10 @@ class ARROW_EXPORT Array { /// Compare if the range of slots specified are equal for the given array and /// this array. end_idx exclusive. This methods does not bounds check. bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, - const std::shared_ptr& arr) const; + const std::shared_ptr& other) const; + + bool RangeEquals(const Array& other, int32_t start_idx, int32_t end_idx, + int32_t other_start_idx) const; /// Determines if the array is internally consistent. /// @@ -315,8 +318,8 @@ class ARROW_EXPORT BinaryArray : public Array { // Account for base offset i += offset_; - const int32_t pos = raw_value_offsets_[i]; - *out_length = raw_value_offsets_[i + 1] - pos; + const int32_t pos = raw_value_offsets_[i + offset_]; + *out_length = raw_value_offsets_[i + offset_ + 1] - pos; return raw_data_ + pos; } diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 27fad7135721c..21fdb6633a9ee 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -335,15 +335,8 @@ class EqualsVisitor : public RangeEqualsVisitor { const int value_byte_size = size_meta.bit_width() / 8; DCHECK_GT(value_byte_size, 0); - const uint8_t* left_data = nullptr; - if (left.length() > 0) { - left_data = left.data()->data() + left.offset() * value_byte_size; - } - - const uint8_t* right_data = nullptr; - if (right.length() > 0) { - right_data = right.data()->data() + right.offset() * value_byte_size; - } + const uint8_t* left_data = left.data()->data() + left.offset() * value_byte_size; + const uint8_t* right_data = right.data()->data() + right.offset() * value_byte_size; if (left.null_count() > 0) { for (int i = 0; i < left.length(); ++i) { @@ -355,7 +348,6 @@ class EqualsVisitor : public RangeEqualsVisitor { } return true; } else { - if (left.length() == 0) { return true; } return memcmp(left_data, right_data, value_byte_size * left.length()) == 0; } } @@ -424,14 +416,35 @@ class EqualsVisitor : public RangeEqualsVisitor { bool equal_offsets = ValueOffsetsEqual(left); if (!equal_offsets) { return false; } - if (left.offset() == 0 && right.offset() == 0) { - if (!left.data() && !(right.data())) { return true; } - return left.data()->Equals(*right.data(), left.raw_value_offsets()[left.length()]); + if (!left.data() && !(right.data())) { return true; } + if (left.value_offset(left.length()) == 0) { return true; } + + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right.data()->data(); + + if (left.null_count() == 0) { + // Fast path for null count 0, single memcmp + if (left.offset() == 0 && right.offset() == 0) { + return std::memcmp( + left_data, right_data, left.raw_value_offsets()[left.length()]) == 0; + } else { + const int64_t total_bytes = + left.value_offset(left.length()) - left.value_offset(0); + return std::memcmp(left_data + left.value_offset(0), + right_data + right.value_offset(0), total_bytes) == 0; + } } else { - // Compare the corresponding data range - const int64_t total_bytes = left.value_offset(left.length()) - left.value_offset(0); - return std::memcmp(left.data()->data() + left.value_offset(0), - right.data()->data() + right.value_offset(0), total_bytes) == 0; + // ARROW-537: Only compare data in non-null slots + const int32_t* left_offsets = left.raw_value_offsets(); + const int32_t* right_offsets = right.raw_value_offsets(); + for (int32_t i = 0; i < left.length(); ++i) { + if (left.IsNull(i)) { continue; } + if (std::memcmp(left_data + left_offsets[i], right_data + right_offsets[i], + left.value_length(i))) { + return false; + } + } + return true; } } @@ -485,8 +498,6 @@ inline bool FloatingApproxEquals( static constexpr T EPSILON = 1E-5; - if (left.length() == 0 && right.length() == 0) { return true; } - if (left.null_count() > 0) { for (int32_t i = 0; i < left.length(); ++i) { if (left.IsNull(i)) continue; @@ -535,6 +546,8 @@ Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { *are_equal = true; } else if (!BaseDataEquals(left, right)) { *are_equal = false; + } else if (left.length() == 0) { + *are_equal = true; } else { EqualsVisitor visitor(right); RETURN_NOT_OK(left.Accept(&visitor)); @@ -549,6 +562,8 @@ Status ArrayRangeEquals(const Array& left, const Array& right, int32_t left_star *are_equal = true; } else if (left.type_enum() != right.type_enum()) { *are_equal = false; + } else if (left.length() == 0) { + *are_equal = true; } else { RangeEqualsVisitor visitor(right, left_start_idx, left_end_idx, right_start_idx); RETURN_NOT_OK(left.Accept(&visitor)); @@ -563,6 +578,8 @@ Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) *are_equal = true; } else if (!BaseDataEquals(left, right)) { *are_equal = false; + } else if (left.length() == 0) { + *are_equal = true; } else { ApproxEqualsVisitor visitor(right); RETURN_NOT_OK(left.Accept(&visitor)); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index b4e0d2f9c138e..bdc2cb7d0025f 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -1717,12 +1717,8 @@ Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, #if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) // Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set // U/LONGLONG to U/INT64 so things work properly. - if (type_num == NPY_LONGLONG) { - type_num = NPY_INT64; - } - if (type_num == NPY_ULONGLONG) { - type_num = NPY_UINT64; - } + if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } + if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } #endif switch (type_num) { @@ -1732,14 +1728,14 @@ Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, TO_ARROW_CASE(INT32); TO_ARROW_CASE(INT64); #if (NPY_INT64 != NPY_LONGLONG) - TO_ARROW_CASE(LONGLONG); + TO_ARROW_CASE(LONGLONG); #endif TO_ARROW_CASE(UINT8); TO_ARROW_CASE(UINT16); TO_ARROW_CASE(UINT32); TO_ARROW_CASE(UINT64); #if (NPY_UINT64 != NPY_ULONGLONG) - TO_ARROW_CASE(ULONGLONG); + TO_ARROW_CASE(ULONGLONG); #endif TO_ARROW_CASE(FLOAT32); TO_ARROW_CASE(FLOAT64); From b99d049c3d1894908b7e52774eb657675dc1f439 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 8 Feb 2017 11:20:48 -0500 Subject: [PATCH 0319/1644] ARROW-351: Time type has no unit Author: Julien Le Dem Closes #328 from julienledem/arrow_351 and squashes the following commits: 2497ee3 [Julien Le Dem] ARROW-351: Time type has no unit --- format/Message.fbs | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/format/Message.fbs b/format/Message.fbs index 028c56ad51618..86dfa87b04807 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -79,11 +79,12 @@ table Decimal { table Date { } +enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } + table Time { + unit: TimeUnit; } -enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } - /// time from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. table Timestamp { unit: TimeUnit; From 4440e4011d88967a53054486f9eb0a0363a1c217 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 8 Feb 2017 15:05:29 -0500 Subject: [PATCH 0320/1644] ARROW-543: C++: Lazily computed null_counts counts number of non-null entries Author: Uwe L. Korn Closes #329 from xhochy/ARROW-543 and squashes the following commits: 191792b [Uwe L. Korn] ARROW-543: C++: Lazily computed null_counts counts number of non-null entries --- cpp/src/arrow/array-test.cc | 8 ++++---- cpp/src/arrow/array.cc | 3 ++- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 45130d8f64004..45ab2740b4c16 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -88,20 +88,20 @@ TEST_F(TestArray, TestEquality) { } TEST_F(TestArray, SliceRecomputeNullCount) { - std::vector valid_bytes = {1, 0, 1, 1, 0, 1, 0, 0}; + std::vector valid_bytes = {1, 0, 1, 1, 0, 1, 0, 0, 0}; auto array = MakeArrayFromValidBytes(valid_bytes, pool_); - ASSERT_EQ(4, array->null_count()); + ASSERT_EQ(5, array->null_count()); auto slice = array->Slice(1, 4); ASSERT_EQ(2, slice->null_count()); slice = array->Slice(4); - ASSERT_EQ(1, slice->null_count()); + ASSERT_EQ(4, slice->null_count()); slice = array->Slice(0); - ASSERT_EQ(4, slice->null_count()); + ASSERT_EQ(5, slice->null_count()); // No bitmap, compute 0 std::shared_ptr data; diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 39459a031f4b0..bf368d91226be 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -48,13 +48,14 @@ Array::Array(const std::shared_ptr& type, int32_t length, null_count_(null_count), null_bitmap_(null_bitmap), null_bitmap_data_(nullptr) { + if (null_count_ == 0) { null_bitmap_ = nullptr; } if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } int32_t Array::null_count() const { if (null_count_ < 0) { if (null_bitmap_) { - null_count_ = CountSetBits(null_bitmap_data_, offset_, length_); + null_count_ = length_ - CountSetBits(null_bitmap_data_, offset_, length_); } else { null_count_ = 0; } From 0bdfd5efb2d7360f8ec8f6a65401d4c76a8df597 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 8 Feb 2017 15:07:39 -0500 Subject: [PATCH 0321/1644] ARROW-538: [C++] Set up AddressSanitizer (ASAN) builds Most of the infrastructure was already in place, only needed to fix the gtest build. We will now build with AddressSanitizer activated on OSX. Author: Uwe L. Korn Closes #324 from xhochy/ARROW-538 and squashes the following commits: c2f8dda [Uwe L. Korn] Don't run AddressSanitizer on Travis f6b65e5 [Uwe L. Korn] Explicitly detected 3.6 8a20d91 [Uwe L. Korn] Log detected COMPILER_VERSION in error message acf3f69 [Uwe L. Korn] ARROW-538: [C++] Set up AddressSanitizer (ASAN) builds --- cpp/CMakeLists.txt | 4 +++- cpp/build-support/run-test.sh | 4 ++-- cpp/build-support/sanitize-blacklist.txt | 22 ++++++++++++++++++++++ cpp/cmake_modules/CompilerInfo.cmake | 6 ++++++ cpp/cmake_modules/san-config.cmake | 5 +++-- cpp/src/arrow/buffer-test.cc | 3 +++ cpp/src/arrow/memory_pool-test.cc | 4 +++- python/manylinux1/Dockerfile-x86_64 | 2 +- 8 files changed, 43 insertions(+), 7 deletions(-) create mode 100644 cpp/build-support/sanitize-blacklist.txt diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index ff2c1a61b95a6..035cd8f9b90c7 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -428,10 +428,12 @@ if(ARROW_BUILD_TESTS) if("$ENV{GTEST_HOME}" STREQUAL "") if(APPLE) - set(GTEST_CMAKE_CXX_FLAGS "-fPIC -std=c++11 -stdlib=libc++ -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes") + set(GTEST_CMAKE_CXX_FLAGS "-fPIC -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes") else() set(GTEST_CMAKE_CXX_FLAGS "-fPIC") endif() + string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE) + set(GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CXX_FLAGS_${UPPERCASE_BUILD_TYPE}} ${GTEST_CMAKE_CXX_FLAGS}") set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") diff --git a/cpp/build-support/run-test.sh b/cpp/build-support/run-test.sh index f563da53679be..b4da4f3f02ee4 100755 --- a/cpp/build-support/run-test.sh +++ b/cpp/build-support/run-test.sh @@ -82,8 +82,8 @@ function setup_sanitizers() { # Enable leak detection even under LLVM 3.4, where it was disabled by default. # This flag only takes effect when running an ASAN build. - ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" - export ASAN_OPTIONS + # ASAN_OPTIONS="$ASAN_OPTIONS detect_leaks=1" + # export ASAN_OPTIONS # Set up suppressions for LeakSanitizer LSAN_OPTIONS="$LSAN_OPTIONS suppressions=$ROOT/build-support/lsan-suppressions.txt" diff --git a/cpp/build-support/sanitize-blacklist.txt b/cpp/build-support/sanitize-blacklist.txt new file mode 100644 index 0000000000000..f6900c643db90 --- /dev/null +++ b/cpp/build-support/sanitize-blacklist.txt @@ -0,0 +1,22 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# Workaround for a problem with gmock where a runtime error is caused by a call on a null pointer, +# on a mocked object. +# Seen error: +# thirdparty/gmock-1.7.0/include/gmock/gmock-spec-builders.h:1529:12: runtime error: member call on null pointer of type 'testing::internal::ActionResultHolder' +fun:*testing*internal*InvokeWith* diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 187698f54507b..fe200be65d502 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -30,6 +30,12 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") set(COMPILER_FAMILY "clang") string(REGEX REPLACE ".*clang version ([0-9]+\\.[0-9]+).*" "\\1" COMPILER_VERSION "${COMPILER_VERSION_FULL}") + +# LLVM 3.6 on Mac OS X 10.9 and later +elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM 3\\.6\\..*") + set(COMPILER_FAMILY "clang") + set(COMPILER_VERSION "3.6.0svn") + # clang on Mac OS X 10.9 and later elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") set(COMPILER_FAMILY "clang") diff --git a/cpp/cmake_modules/san-config.cmake b/cpp/cmake_modules/san-config.cmake index fe52fef12ea5d..1917eabe8b4b2 100644 --- a/cpp/cmake_modules/san-config.cmake +++ b/cpp/cmake_modules/san-config.cmake @@ -94,8 +94,9 @@ if ("${ARROW_USE_UBSAN}" OR "${ARROW_USE_ASAN}" OR "${ARROW_USE_TSAN}") # Require clang 3.4 or newer; clang 3.3 has issues with TSAN and pthread # symbol interception. if("${COMPILER_VERSION}" VERSION_LESS "3.4") - message(SEND_ERROR "Must use clang 3.4 or newer to run a sanitizer build." - " Try using clang from $NATIVE_TOOLCHAIN/") + message(SEND_ERROR "Must use clang 3.4 or newer to run a sanitizer build." + " Detected unsupported version ${COMPILER_VERSION}." + " Try using clang from $NATIVE_TOOLCHAIN/.") endif() add_definitions("-fsanitize-blacklist=${BUILD_SUPPORT_DIR}/sanitize-blacklist.txt") else() diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc index 2ded1e11f96f8..d76e991b54378 100644 --- a/cpp/src/arrow/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -67,11 +67,14 @@ TEST_F(TestBuffer, Resize) { } TEST_F(TestBuffer, ResizeOOM) { + // This test doesn't play nice with AddressSanitizer +#ifndef ADDRESS_SANITIZER // realloc fails, even though there may be no explicit limit PoolBuffer buf; ASSERT_OK(buf.Resize(100)); int64_t to_alloc = std::numeric_limits::max(); ASSERT_RAISES(OutOfMemory, buf.Resize(to_alloc)); +#endif } TEST_F(TestBuffer, EqualsWithSameContent) { diff --git a/cpp/src/arrow/memory_pool-test.cc b/cpp/src/arrow/memory_pool-test.cc index 3daf72755cff2..56bb32f0b5b27 100644 --- a/cpp/src/arrow/memory_pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -32,7 +32,9 @@ TEST_F(TestDefaultMemoryPool, MemoryTracking) { } TEST_F(TestDefaultMemoryPool, OOM) { +#ifndef ADDRESS_SANITIZER this->TestOOM(); +#endif } TEST_F(TestDefaultMemoryPool, Reallocate) { @@ -41,7 +43,7 @@ TEST_F(TestDefaultMemoryPool, Reallocate) { // Death tests and valgrind are known to not play well 100% of the time. See // googletest documentation -#ifndef ARROW_VALGRIND +#if !(defined(ARROW_VALGRIND) || defined(ADDRESS_SANITIZER)) TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { MemoryPool* pool = default_memory_pool(); diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 059158856f1f2..ac47108c84ae7 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -16,7 +16,7 @@ FROM quay.io/pypa/manylinux1_x86_64:latest RUN yum install -y flex openssl-devel WORKDIR / -RUN wget http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz +RUN wget --no-check-certificate http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz RUN tar xf boost_1_60_0.tar.gz WORKDIR /boost_1_60_0 RUN ./bootstrap.sh From 31f145dc5296d27cc8010a4cd17ca5b4ae461dff Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 9 Feb 2017 13:47:09 +0100 Subject: [PATCH 0322/1644] ARROW-545: [Python] Ignore non .parq/.parquet files when reading directories as Parquet datasets Author: Wes McKinney Closes #331 from wesm/ARROW-545 and squashes the following commits: 5494167 [Wes McKinney] Docstring typo 92b274c [Wes McKinney] Ignore non .parq/.parquet files when reading directories-as-Parquet-datasets --- python/pyarrow/__init__.py | 2 +- python/pyarrow/filesystem.py | 23 +++++++++++++++++------ python/pyarrow/parquet.py | 18 ++++++++++++++++-- python/pyarrow/tests/test_parquet.py | 4 ++++ 4 files changed, 38 insertions(+), 9 deletions(-) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index ea4710d4137de..6724b52e6004e 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -67,4 +67,4 @@ from pyarrow.table import Column, RecordBatch, Table, concat_tables -localfs = LocalFilesystem() +localfs = LocalFilesystem.get_instance() diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py index 82409b7666ab1..55bcad044305e 100644 --- a/python/pyarrow/filesystem.py +++ b/python/pyarrow/filesystem.py @@ -62,7 +62,7 @@ def isfile(self, path): """ raise NotImplementedError - def read_parquet(self, path, columns=None, schema=None): + def read_parquet(self, path, columns=None, metadata=None, schema=None): """ Read Parquet data from path in file system. Can read from a single file or a directory of files @@ -73,8 +73,11 @@ def read_parquet(self, path, columns=None, schema=None): Single file path or directory columns : List[str], optional Subset of columns to read + metadata : pyarrow.parquet.FileMetaData + Known metadata to validate files against schema : pyarrow.parquet.Schema - Known schema to validate files against + Known schema to validate files against. Alternative to metadata + argument Returns ------- @@ -85,18 +88,26 @@ def read_parquet(self, path, columns=None, schema=None): if self.isdir(path): paths_to_read = [] for path in self.ls(path): - if path == '_metadata' or path == '_common_metadata': - raise ValueError('No support yet for common metadata file') - paths_to_read.append(path) + if path.endswith('parq') or path.endswith('parquet'): + paths_to_read.append(path) else: paths_to_read = [path] return read_multiple_files(paths_to_read, columns=columns, - filesystem=self, schema=schema) + filesystem=self, schema=schema, + metadata=metadata) class LocalFilesystem(Filesystem): + _instance = None + + @classmethod + def get_instance(cls): + if cls._instance is None: + cls._instance = LocalFilesystem() + return cls._instance + @implements(Filesystem.ls) def ls(self, path): return sorted(pjoin(path, x) for x in os.listdir(path)) diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 6654b770ba33e..9766ff6dfa8e8 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -15,12 +15,17 @@ # specific language governing permissions and limitations # under the License. +import six + from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa RowGroupMetaData, Schema, ParquetWriter) import pyarrow._parquet as _parquet # noqa from pyarrow.table import concat_tables +EXCLUDED_PARQUET_PATHS = {'_metadata', '_common_metadata', '_SUCCESS'} + + class ParquetFile(object): """ Open a Parquet binary file for reading @@ -82,8 +87,9 @@ def read_table(source, columns=None, nthreads=1, metadata=None): Parameters ---------- source: str or pyarrow.io.NativeFile - Readable source. For passing Python file objects or byte buffers, see - pyarrow.io.PythonFileInterface or pyarrow.io.BufferReader. + Location of Parquet dataset. If a string passed, can be a single file + name or directory name. For passing Python file objects or byte + buffers, see pyarrow.io.PythonFileInterface or pyarrow.io.BufferReader. columns: list If not None, only these columns will be read from the file. nthreads : int, default 1 @@ -97,6 +103,14 @@ def read_table(source, columns=None, nthreads=1, metadata=None): pyarrow.Table Content of the file as a table (of columns) """ + from pyarrow.filesystem import LocalFilesystem + + if isinstance(source, six.string_types): + fs = LocalFilesystem.get_instance() + if fs.isdir(source): + return fs.read_parquet(source, columns=columns, + metadata=metadata) + pf = ParquetFile(source, metadata=metadata) return pf.read(columns=columns, nthreads=nthreads) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 80a995fbb6662..969f68b47b497 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -393,6 +393,10 @@ def test_read_multiple_files(tmpdir): test_data.append(table) paths.append(path) + # Write a _SUCCESS.crc file + with open(pjoin(dirpath, '_SUCCESS.crc'), 'wb') as f: + f.write(b'0') + result = pq.read_multiple_files(paths) expected = pa.concat_tables(test_data) From dc6cefde46c65ce4753bec3fbafc44a20944f9c9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 9 Feb 2017 10:43:53 -0500 Subject: [PATCH 0323/1644] ARROW-521: [C++] Track peak allocations in default memory pool This should enable us to remove the `parquet::MemoryAllocator` implementation in parquet-cpp Author: Wes McKinney Closes #330 from wesm/ARROW-521 and squashes the following commits: 10531c4 [Wes McKinney] Move max_memory_ member to DefaultMemoryPool, add default virtual max_memory() to MemoryPool a0d134d [Wes McKinney] Add max_memory() method to MemoryPool, leave implementation to subclasses --- cpp/src/arrow/array-primitive-test.cc | 3 +- cpp/src/arrow/buffer-test.cc | 2 +- cpp/src/arrow/ipc/json-integration-test.cc | 6 ++- cpp/src/arrow/memory_pool-test.cc | 17 +++++++ cpp/src/arrow/memory_pool.cc | 54 +++++++++++----------- cpp/src/arrow/memory_pool.h | 31 +++++++++++++ cpp/src/arrow/table.cc | 6 ++- cpp/src/arrow/util/logging.h | 4 +- cpp/src/arrow/util/macros.h | 2 +- 9 files changed, 89 insertions(+), 36 deletions(-) diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index a20fdbf8b9166..f8bbd774d483c 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -242,7 +242,8 @@ void TestPrimitiveBuilder::Check( } typedef ::testing::Types Primitives; + PInt32, PInt64, PFloat, PDouble> + Primitives; TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc index d76e991b54378..934fcfef14856 100644 --- a/cpp/src/arrow/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -67,7 +67,7 @@ TEST_F(TestBuffer, Resize) { } TEST_F(TestBuffer, ResizeOOM) { - // This test doesn't play nice with AddressSanitizer +// This test doesn't play nice with AddressSanitizer #ifndef ADDRESS_SANITIZER // realloc fails, even though there may be no explicit limit PoolBuffer buf; diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 17ccc4ac1d0da..95bc742054fab 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -144,8 +144,10 @@ static Status ValidateArrowVsJson( if (!json_schema->Equals(arrow_schema)) { std::stringstream ss; - ss << "JSON schema: \n" << json_schema->ToString() << "\n" - << "Arrow schema: \n" << arrow_schema->ToString(); + ss << "JSON schema: \n" + << json_schema->ToString() << "\n" + << "Arrow schema: \n" + << arrow_schema->ToString(); if (FLAGS_verbose) { std::cout << ss.str() << std::endl; } return Status::Invalid("Schemas did not match"); diff --git a/cpp/src/arrow/memory_pool-test.cc b/cpp/src/arrow/memory_pool-test.cc index 56bb32f0b5b27..6ab73fb103f50 100644 --- a/cpp/src/arrow/memory_pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -59,6 +59,23 @@ TEST(DefaultMemoryPoolDeathTest, FreeLargeMemory) { pool->Free(data, 100); } +TEST(DefaultMemoryPoolDeathTest, MaxMemory) { + DefaultMemoryPool pool; + + ASSERT_EQ(0, pool.max_memory()); + + uint8_t* data; + ASSERT_OK(pool.Allocate(100, &data)); + + uint8_t* data2; + ASSERT_OK(pool.Allocate(100, &data2)); + + pool.Free(data, 100); + pool.Free(data2, 100); + + ASSERT_EQ(200, pool.max_memory()); +} + #endif // ARROW_VALGRIND } // namespace arrow diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index aea5e210f4980..8d85a089a65c9 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -60,36 +60,30 @@ Status AllocateAligned(int64_t size, uint8_t** out) { } } // namespace -MemoryPool::~MemoryPool() {} - -class InternalMemoryPool : public MemoryPool { - public: - InternalMemoryPool() : bytes_allocated_(0) {} - virtual ~InternalMemoryPool(); - - Status Allocate(int64_t size, uint8_t** out) override; - Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override; +MemoryPool::MemoryPool() {} - void Free(uint8_t* buffer, int64_t size) override; +MemoryPool::~MemoryPool() {} - int64_t bytes_allocated() const override; +int64_t MemoryPool::max_memory() const { + return -1; +} - private: - mutable std::mutex pool_lock_; - int64_t bytes_allocated_; -}; +DefaultMemoryPool::DefaultMemoryPool() : bytes_allocated_(0) { + max_memory_ = 0; +} -Status InternalMemoryPool::Allocate(int64_t size, uint8_t** out) { - std::lock_guard guard(pool_lock_); +Status DefaultMemoryPool::Allocate(int64_t size, uint8_t** out) { RETURN_NOT_OK(AllocateAligned(size, out)); bytes_allocated_ += size; + { + std::lock_guard guard(lock_); + if (bytes_allocated_ > max_memory_) { max_memory_ = bytes_allocated_.load(); } + } return Status::OK(); } -Status InternalMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { - std::lock_guard guard(pool_lock_); - +Status DefaultMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { // Note: We cannot use realloc() here as it doesn't guarantee alignment. // Allocate new chunk @@ -105,17 +99,19 @@ Status InternalMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_ *ptr = out; bytes_allocated_ += new_size - old_size; + { + std::lock_guard guard(lock_); + if (bytes_allocated_ > max_memory_) { max_memory_ = bytes_allocated_.load(); } + } return Status::OK(); } -int64_t InternalMemoryPool::bytes_allocated() const { - std::lock_guard guard(pool_lock_); - return bytes_allocated_; +int64_t DefaultMemoryPool::bytes_allocated() const { + return bytes_allocated_.load(); } -void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { - std::lock_guard guard(pool_lock_); +void DefaultMemoryPool::Free(uint8_t* buffer, int64_t size) { DCHECK_GE(bytes_allocated_, size); #ifdef _MSC_VER _aligned_free(buffer); @@ -125,10 +121,14 @@ void InternalMemoryPool::Free(uint8_t* buffer, int64_t size) { bytes_allocated_ -= size; } -InternalMemoryPool::~InternalMemoryPool() {} +int64_t DefaultMemoryPool::max_memory() const { + return max_memory_.load(); +} + +DefaultMemoryPool::~DefaultMemoryPool() {} MemoryPool* default_memory_pool() { - static InternalMemoryPool default_memory_pool_; + static DefaultMemoryPool default_memory_pool_; return &default_memory_pool_; } diff --git a/cpp/src/arrow/memory_pool.h b/cpp/src/arrow/memory_pool.h index 89477b6ddeab0..33d4c3e9aad52 100644 --- a/cpp/src/arrow/memory_pool.h +++ b/cpp/src/arrow/memory_pool.h @@ -18,7 +18,9 @@ #ifndef ARROW_UTIL_MEMORY_POOL_H #define ARROW_UTIL_MEMORY_POOL_H +#include #include +#include #include "arrow/util/visibility.h" @@ -56,6 +58,35 @@ class ARROW_EXPORT MemoryPool { /// The number of bytes that were allocated and not yet free'd through /// this allocator. virtual int64_t bytes_allocated() const = 0; + + /// Return peak memory allocation in this memory pool + /// + /// \return Maximum bytes allocated. If not known (or not implemented), + /// returns -1 + virtual int64_t max_memory() const; + + protected: + MemoryPool(); +}; + +class ARROW_EXPORT DefaultMemoryPool : public MemoryPool { + public: + DefaultMemoryPool(); + virtual ~DefaultMemoryPool(); + + Status Allocate(int64_t size, uint8_t** out) override; + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override; + + void Free(uint8_t* buffer, int64_t size) override; + + int64_t bytes_allocated() const override; + + int64_t max_memory() const override; + + private: + mutable std::mutex lock_; + std::atomic bytes_allocated_; + std::atomic max_memory_; }; ARROW_EXPORT MemoryPool* default_memory_pool(); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 9e31ba5af0ce3..a9e0909b8b741 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -106,7 +106,8 @@ Status Table::FromRecordBatches(const std::string& name, if (!batches[i]->schema()->Equals(schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" - << schema->ToString() << "\nvs\n" << batches[i]->schema()->ToString(); + << schema->ToString() << "\nvs\n" + << batches[i]->schema()->ToString(); return Status::Invalid(ss.str()); } } @@ -138,7 +139,8 @@ Status ConcatenateTables(const std::string& output_name, if (!tables[i]->schema()->Equals(schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" - << schema->ToString() << "\nvs\n" << tables[i]->schema()->ToString(); + << schema->ToString() << "\nvs\n" + << tables[i]->schema()->ToString(); return Status::Invalid(ss.str()); } } diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index 06ee8411e283c..b22f07dd6345f 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -118,9 +118,9 @@ class CerrLog { class FatalLog : public CerrLog { public: explicit FatalLog(int /* severity */) // NOLINT - : CerrLog(ARROW_FATAL) {} // NOLINT + : CerrLog(ARROW_FATAL){} // NOLINT - [[noreturn]] ~FatalLog() { + [[noreturn]] ~FatalLog() { if (has_logged_) { std::cerr << std::endl; } std::exit(1); } diff --git a/cpp/src/arrow/util/macros.h b/cpp/src/arrow/util/macros.h index 81a9b0cff5687..c4a62a475b92f 100644 --- a/cpp/src/arrow/util/macros.h +++ b/cpp/src/arrow/util/macros.h @@ -25,6 +25,6 @@ TypeName& operator=(const TypeName&) = delete #endif -#define UNUSED(x) (void) x +#define UNUSED(x) (void)x #endif // ARROW_UTIL_MACROS_H From 3add9181f98810bcfeae558bf44093d9ab89bc3f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 9 Feb 2017 10:45:35 -0500 Subject: [PATCH 0324/1644] ARROW-476: Add binary integration test fixture, add Java support @julienledem could you review my Java changes? Thanks Author: Wes McKinney Closes #326 from wesm/ARROW-476 and squashes the following commits: a75228d [Wes McKinney] Use PoolBuffer instead of std::vector e5a96a0 [Wes McKinney] Chain exceptions b23b852 [Wes McKinney] Use hexadecimal for transporting binary data in JSON 1d4e850 [Wes McKinney] Compare byte[] with Arrays.equals e5f13d5 [Wes McKinney] Add binary integration test fixture, add to JsonFileReader.java, but fails --- cpp/src/arrow/ipc/json-internal.cc | 55 ++++++++++++++++++- integration/integration_test.py | 55 ++++++++++++++++--- java/vector/pom.xml | 5 ++ .../vector/file/json/JsonFileReader.java | 14 +++++ .../vector/file/json/JsonFileWriter.java | 6 ++ .../apache/arrow/vector/util/Validator.java | 4 ++ 6 files changed, 129 insertions(+), 10 deletions(-) diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 1a95b2ce470b2..b9f97dd2bbd15 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -17,7 +17,10 @@ #include "arrow/ipc/json-internal.h" +#include #include +#include +#include #include #include #include @@ -40,6 +43,8 @@ namespace arrow { namespace ipc { +static const char* kAsciiTable = "0123456789ABCDEF"; + using RjArray = rj::Value::ConstArray; using RjObject = rj::Value::ConstObject; @@ -395,14 +400,26 @@ class JsonArrayWriter : public ArrayVisitor { } } - // String (Utf8), Binary + // Binary, encode to hexadecimal. UTF8 string write as is template typename std::enable_if::value, void>::type WriteDataValues(const T& arr) { for (int i = 0; i < arr.length(); ++i) { int32_t length; const char* buf = reinterpret_cast(arr.GetValue(i, &length)); - writer_->String(buf, length); + + if (std::is_base_of::value) { + writer_->String(buf, length); + } else { + std::string hex_string; + hex_string.reserve(length * 2); + for (int32_t j = 0; j < length; ++j) { + // Convert to 2 base16 digits + hex_string.push_back(kAsciiTable[buf[j] >> 4]); + hex_string.push_back(kAsciiTable[buf[j] & 15]); + } + writer_->String(hex_string); + } } } @@ -773,6 +790,20 @@ class JsonSchemaReader { const rj::Value& json_schema_; }; +static inline Status ParseHexValue(const char* data, uint8_t* out) { + char c1 = data[0]; + char c2 = data[1]; + + const char* pos1 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c1); + const char* pos2 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c2); + + // Error checking + if (*pos1 != c1 || *pos2 != c2) { return Status::Invalid("Encountered non-hex digit"); } + + *out = (pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable); + return Status::OK(); +} + class JsonArrayReader { public: explicit JsonArrayReader(MemoryPool* pool) : pool_(pool) {} @@ -852,6 +883,8 @@ class JsonArrayReader { const auto& json_data_arr = json_data->value.GetArray(); DCHECK_EQ(static_cast(json_data_arr.Size()), length); + + auto byte_buffer = std::make_shared(pool_); for (int i = 0; i < length; ++i) { if (!is_valid[i]) { builder.AppendNull(); @@ -860,7 +893,23 @@ class JsonArrayReader { const rj::Value& val = json_data_arr[i]; DCHECK(val.IsString()); - builder.Append(val.GetString()); + if (std::is_base_of::value) { + builder.Append(val.GetString()); + } else { + std::string hex_string = val.GetString(); + + DCHECK(hex_string.size() % 2 == 0) << "Expected base16 hex string"; + int64_t length = static_cast(hex_string.size()) / 2; + + if (byte_buffer->size() < length) { RETURN_NOT_OK(byte_buffer->Resize(length)); } + + const char* hex_data = hex_string.c_str(); + uint8_t* byte_buffer_data = byte_buffer->mutable_data(); + for (int64_t j = 0; j < length; ++j) { + RETURN_NOT_OK(ParseHexValue(hex_data + j * 2, &byte_buffer_data[j])); + } + RETURN_NOT_OK(builder.Append(byte_buffer_data, length)); + } } return builder.Finish(array); diff --git a/integration/integration_test.py b/integration/integration_test.py index a622bf228a651..1d8dc29a9f529 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -241,14 +241,18 @@ def generate_column(self, size): return PrimitiveColumn(self.name, size, is_valid, values) -class StringType(PrimitiveType): +class BinaryType(PrimitiveType): @property def numpy_type(self): return object + @property + def column_class(self): + return BinaryColumn + def _get_type(self): - return OrderedDict([('name', 'utf8')]) + return OrderedDict([('name', 'binary')]) def _get_type_layout(self): return OrderedDict([ @@ -260,6 +264,32 @@ def _get_type_layout(self): OrderedDict([('type', 'DATA'), ('typeBitWidth', 8)])])]) + def generate_column(self, size): + K = 7 + is_valid = self._make_is_valid(size) + values = [] + + for i in range(size): + if is_valid[i]: + draw = (np.random.randint(0, 255, size=K) + .astype(np.uint8) + .tostring()) + values.append(draw) + else: + values.append("") + + return self.column_class(self.name, size, is_valid, values) + + +class StringType(BinaryType): + + @property + def column_class(self): + return StringColumn + + def _get_type(self): + return OrderedDict([('name', 'utf8')]) + def generate_column(self, size): K = 7 is_valid = self._make_is_valid(size) @@ -271,7 +301,7 @@ def generate_column(self, size): else: values.append("") - return StringColumn(self.name, size, is_valid, values) + return self.column_class(self.name, size, is_valid, values) class JSONSchema(object): @@ -285,7 +315,10 @@ def get_json(self): ]) -class StringColumn(PrimitiveColumn): +class BinaryColumn(PrimitiveColumn): + + def _encode_value(self, x): + return ''.join('{:02x}'.format(c).upper() for c in x) def _get_buffers(self): offset = 0 @@ -299,7 +332,7 @@ def _get_buffers(self): v = "" offsets.append(offset) - data.append(v) + data.append(self._encode_value(v)) return [ ('VALIDITY', [int(x) for x in self.is_valid]), @@ -308,6 +341,12 @@ def _get_buffers(self): ] +class StringColumn(BinaryColumn): + + def _encode_value(self, x): + return x + + class ListType(DataType): def __init__(self, name, value_type, nullable=True): @@ -443,7 +482,9 @@ def write(self, path): def get_field(name, type_, nullable=True): - if type_ == 'utf8': + if type_ == 'binary': + return BinaryType(name, nullable=nullable) + elif type_ == 'utf8': return StringType(name, nullable=nullable) dtype = np.dtype(type_) @@ -463,7 +504,7 @@ def get_field(name, type_, nullable=True): def generate_primitive_case(): types = ['bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', - 'float32', 'float64', 'utf8'] + 'float32', 'float64', 'binary', 'utf8'] fields = [] diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 64b68bf8a1588..8517d4ced80f1 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -56,6 +56,11 @@ commons-lang3 3.4 + + commons-codec + commons-codec + 1.10 + diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 71fe88e156a5d..24fdc184523b3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -48,6 +48,7 @@ import org.apache.arrow.vector.UInt8Vector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ValueVector.Mutator; +import org.apache.arrow.vector.VarBinaryVector; import org.apache.arrow.vector.VarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.complex.NullableMapVector; @@ -60,6 +61,8 @@ import com.fasterxml.jackson.core.JsonToken; import com.fasterxml.jackson.databind.MappingJsonFactory; import com.google.common.base.Objects; +import org.apache.commons.codec.DecoderException; +import org.apache.commons.codec.binary.Hex; public class JsonFileReader implements AutoCloseable { private final File inputFile; @@ -164,6 +167,14 @@ private void readVector(Field field, FieldVector vector) throws JsonParseExcepti readToken(END_OBJECT); } + private byte[] decodeHexSafe(String hexString) throws IOException { + try { + return Hex.decodeHex(hexString.toCharArray()); + } catch (DecoderException e) { + throw new IOException("Unable to decode hex string: " + hexString, e); + } + } + private void setValueFromParser(ValueVector valueVector, int i) throws IOException { switch (valueVector.getMinorType()) { case BIT: @@ -199,6 +210,9 @@ private void setValueFromParser(ValueVector valueVector, int i) throws IOExcepti case FLOAT8: ((Float8Vector)valueVector).getMutator().set(i, parser.readValueAs(Double.class)); break; + case VARBINARY: + ((VarBinaryVector)valueVector).getMutator().setSafe(i, decodeHexSafe(parser.readValueAs(String.class))); + break; case VARCHAR: ((VarCharVector)valueVector).getMutator().setSafe(i, parser.readValueAs(String.class).getBytes(UTF_8)); break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java index ddc80433cb6db..99040b67e1cd3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -30,6 +30,7 @@ import org.apache.arrow.vector.TimeStampNanoVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ValueVector.Accessor; +import org.apache.arrow.vector.VarBinaryVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.schema.ArrowVectorType; import org.apache.arrow.vector.types.pojo.Field; @@ -40,6 +41,7 @@ import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; import com.fasterxml.jackson.databind.MappingJsonFactory; +import org.apache.commons.codec.binary.Hex; public class JsonFileWriter implements AutoCloseable { @@ -157,6 +159,10 @@ private void writeValueToGenerator(ValueVector valueVector, int i) throws IOExce case BIT: generator.writeNumber(((BitVector)valueVector).getAccessor().get(i)); break; + case VARBINARY: + String hexString = Hex.encodeHexString(((VarBinaryVector) valueVector).getAccessor().get(i)); + generator.writeObject(hexString); + break; default: // TODO: each type Accessor accessor = valueVector.getAccessor(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java index a97458254151d..f294e20b029c5 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java @@ -17,6 +17,7 @@ */ package org.apache.arrow.vector.util; +import java.util.Arrays; import java.util.List; import org.apache.arrow.vector.FieldVector; @@ -89,7 +90,10 @@ static boolean equals(ArrowType type, final Object o1, final Object o2) { default: throw new UnsupportedOperationException("unsupported precision: " + fpType); } + } else if (type instanceof ArrowType.Binary) { + return Arrays.equals((byte[]) o1, (byte[]) o2); } + return Objects.equal(o1, o2); } From 0ab4252453be025f13df2a825e67fbfbbb608ab9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 9 Feb 2017 16:19:38 -0500 Subject: [PATCH 0325/1644] ARROW-546: Python: Account for changes in PARQUET-867 Author: Uwe L. Korn Closes #332 from xhochy/ARROW-546 and squashes the following commits: ca019c5 [Uwe L. Korn] ARROW-546: Python: Account for changes in PARQUET-867 --- python/pyarrow/_parquet.pxd | 2 +- python/pyarrow/_parquet.pyx | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index 005be91bdb97f..e106252189f42 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -229,7 +229,7 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: cdef CStatus WriteTable( - const CTable* table, CMemoryPool* pool, + const CTable& table, CMemoryPool* pool, const shared_ptr[OutputStream]& sink, int64_t chunk_size, const shared_ptr[WriterProperties]& properties) diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 796c436ec46f4..08c7bb5d8b1bc 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -543,6 +543,6 @@ cdef class ParquetWriter: cdef int c_row_group_size = row_group_size with nogil: - check_status(WriteTable(ctable, self.allocator, - self.sink, c_row_group_size, - self.properties)) + check_status(WriteTable(deref(ctable), self.allocator, + self.sink, c_row_group_size, + self.properties)) From 42b55d98c151d87e5a7c6442800f3e5b9316499b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 10 Feb 2017 13:15:39 -0500 Subject: [PATCH 0326/1644] ARROW-544: [C++] Test writing zero-length record batches, zero-length BinaryArray fixes I believe this should fix the failure reported in the Spark integration work. We'll need to upgrade the conda test packages to verify. cc @BryanCutler Author: Wes McKinney Closes #333 from wesm/ARROW-544 and squashes the following commits: f80d58f [Wes McKinney] Protect zero-length record batches from incomplete buffer metadata f876dce [Wes McKinney] Test with null value_offsets too 1dc7733 [Wes McKinney] Test writing zero-length record batches, misc zero-length fixes --- cpp/src/arrow/array-string-test.cc | 4 ++ cpp/src/arrow/array.cc | 11 +++- cpp/src/arrow/ipc/adapter.cc | 27 +++++--- cpp/src/arrow/ipc/ipc-adapter-test.cc | 89 +++++++++++++++++++-------- 4 files changed, 94 insertions(+), 37 deletions(-) diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index c4d9bf40f57f9..d8a35851c1238 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -470,4 +470,8 @@ TEST_F(TestStringArray, TestSliceEquality) { CheckSliceEquality(); } +TEST_F(TestBinaryArray, LengthZeroCtor) { + BinaryArray array(0, nullptr, nullptr); +} + } // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index bf368d91226be..81678e354a608 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -270,9 +270,12 @@ BinaryArray::BinaryArray(const std::shared_ptr& type, int32_t length, const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) : Array(type, length, null_bitmap, null_count, offset), value_offsets_(value_offsets), - raw_value_offsets_(reinterpret_cast(value_offsets_->data())), + raw_value_offsets_(nullptr), data_(data), raw_data_(nullptr) { + if (value_offsets_ != nullptr) { + raw_value_offsets_ = reinterpret_cast(value_offsets_->data()); + } if (data_ != nullptr) { raw_data_ = data_->data(); } } @@ -384,8 +387,10 @@ UnionArray::UnionArray(const std::shared_ptr& type, int32_t length, : Array(type, length, null_bitmap, null_count, offset), children_(children), type_ids_(type_ids), - value_offsets_(value_offsets) { - raw_type_ids_ = reinterpret_cast(type_ids->data()); + raw_type_ids_(nullptr), + value_offsets_(value_offsets), + raw_value_offsets_(nullptr) { + if (type_ids) { raw_type_ids_ = reinterpret_cast(type_ids->data()); } if (value_offsets) { raw_value_offsets_ = reinterpret_cast(value_offsets->data()); } diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 3613ccbadbbab..f36ff37db15a3 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -602,8 +602,13 @@ class ArrayLoader : public TypeVisitor { std::shared_ptr offsets; std::shared_ptr values; - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); + } else { + context_->buffer_index += 2; + offsets = values = nullptr; + } result_ = std::make_shared( field_meta.length, offsets, values, null_bitmap, field_meta.null_count); @@ -661,7 +666,12 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); std::shared_ptr offsets; - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, &offsets)); + } else { + offsets = nullptr; + } + ++context_->buffer_index; const int num_children = type.num_children(); if (num_children != 1) { @@ -708,13 +718,16 @@ class ArrayLoader : public TypeVisitor { std::shared_ptr null_bitmap; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - std::shared_ptr type_ids; + std::shared_ptr type_ids = nullptr; std::shared_ptr offsets = nullptr; - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &type_ids)); - if (type.mode == UnionMode::DENSE) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, &type_ids)); + if (type.mode == UnionMode::DENSE) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index + 1, &offsets)); + } } + context_->buffer_index += type.mode == UnionMode::DENSE? 2 : 1; std::vector> fields; RETURN_NOT_OK(LoadChildren(type.children(), &fields)); diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index bae6578f110f2..d11b95b167d21 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -71,6 +71,42 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, return ReadRecordBatch(metadata, batch.schema(), &buffer_reader, batch_result); } + void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { + std::shared_ptr batch_result; + + ASSERT_OK(RoundTripHelper(batch, 1 << 16, &batch_result)); + EXPECT_EQ(batch.num_rows(), batch_result->num_rows()); + + ASSERT_TRUE(batch.schema()->Equals(batch_result->schema())); + ASSERT_EQ(batch.num_columns(), batch_result->num_columns()) + << batch.schema()->ToString() + << " result: " << batch_result->schema()->ToString(); + + for (int i = 0; i < batch.num_columns(); ++i) { + const auto& left = *batch.column(i); + const auto& right = *batch_result->column(i); + if (!left.Equals(right)) { + std::stringstream pp_result; + std::stringstream pp_expected; + + ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); + ASSERT_OK(PrettyPrint(right, 0, &pp_result)); + + FAIL() << "Index: " << i << " Expected: " << pp_expected.str() + << "\nGot: " << pp_result.str(); + } + } + } + + void CheckRoundtrip(const std::shared_ptr& array, int64_t buffer_size) { + auto f0 = arrow::field("f0", array->type()); + std::vector> fields = {f0}; + auto schema = std::make_shared(fields); + + RecordBatch batch(schema, 0, {array}); + CheckRoundtrip(batch, buffer_size); + } + protected: std::shared_ptr mmap_; MemoryPool* pool_; @@ -79,48 +115,47 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, TEST_P(TestWriteRecordBatch, RoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - std::shared_ptr batch_result; - ASSERT_OK(RoundTripHelper(*batch, 1 << 16, &batch_result)); - - // do checks - ASSERT_TRUE(batch->schema()->Equals(batch_result->schema())); - ASSERT_EQ(batch->num_columns(), batch_result->num_columns()) - << batch->schema()->ToString() << " result: " << batch_result->schema()->ToString(); - EXPECT_EQ(batch->num_rows(), batch_result->num_rows()); - for (int i = 0; i < batch->num_columns(); ++i) { - EXPECT_TRUE(batch->column(i)->Equals(batch_result->column(i))) - << "Idx: " << i << " Name: " << batch->column_name(i); - } + + CheckRoundtrip(*batch, 1 << 20); } TEST_P(TestWriteRecordBatch, SliceRoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - std::shared_ptr batch_result; // Skip the zero-length case if (batch->num_rows() < 2) { return; } auto sliced_batch = batch->Slice(2, 10); + CheckRoundtrip(*sliced_batch, 1 << 20); +} + +TEST_P(TestWriteRecordBatch, ZeroLengthArrays) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - ASSERT_OK(RoundTripHelper(*sliced_batch, 1 << 16, &batch_result)); + std::shared_ptr zero_length_batch; + if (batch->num_rows() > 2) { + zero_length_batch = batch->Slice(2, 0); + } else { + zero_length_batch = batch->Slice(0, 0); + } - EXPECT_EQ(sliced_batch->num_rows(), batch_result->num_rows()); + CheckRoundtrip(*zero_length_batch, 1 << 20); - for (int i = 0; i < sliced_batch->num_columns(); ++i) { - const auto& left = *sliced_batch->column(i); - const auto& right = *batch_result->column(i); - if (!left.Equals(right)) { - std::stringstream pp_result; - std::stringstream pp_expected; + // ARROW-544: check binary array + std::shared_ptr value_offsets; + ASSERT_OK(AllocateBuffer(pool_, sizeof(int32_t), &value_offsets)); + *reinterpret_cast(value_offsets->mutable_data()) = 0; - ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); - ASSERT_OK(PrettyPrint(right, 0, &pp_result)); + std::shared_ptr bin_array = std::make_shared(0, value_offsets, + std::make_shared(nullptr, 0), std::make_shared(nullptr, 0)); - FAIL() << "Index: " << i << " Expected: " << pp_expected.str() - << "\nGot: " << pp_result.str(); - } - } + // null value_offsets + std::shared_ptr bin_array2 = std::make_shared(0, nullptr, nullptr); + + CheckRoundtrip(bin_array, 1 << 20); + CheckRoundtrip(bin_array2, 1 << 20); } INSTANTIATE_TEST_CASE_P( From e4845c447cbef12fa7543f372fbc744fa833fee1 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 12 Feb 2017 07:53:40 -0500 Subject: [PATCH 0327/1644] ARROW-551: C++: Construction of Column with nullptr Array segfaults Author: Uwe L. Korn Closes #335 from xhochy/ARROW-551 and squashes the following commits: 440d4a9 [Uwe L. Korn] ARROW-551: C++: Construction of Column with nullptr Array segfaults --- cpp/src/arrow/column-test.cc | 4 ++++ cpp/src/arrow/column.cc | 6 +++++- 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 0bbfc831f5cb9..24d58c80b9fae 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -135,6 +135,10 @@ TEST_F(TestColumn, BasicAPI) { ASSERT_EQ(300, column_->length()); ASSERT_EQ(30, column_->null_count()); ASSERT_EQ(3, column_->data()->num_chunks()); + + // nullptr array should not break + column_.reset(new Column(field, std::shared_ptr(nullptr))); + ASSERT_NE(column_.get(), nullptr); } TEST_F(TestColumn, ChunksInhomogeneous) { diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 9cc0f579dc5bd..1376f6534ece1 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -90,7 +90,11 @@ Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) : field_(field) { - data_ = std::make_shared(ArrayVector({data})); + if (data) { + data_ = std::make_shared(ArrayVector({data})); + } else { + data_ = std::make_shared(ArrayVector({})); + } } Column::Column( From 1f26040f55eb54e00dc5e67ce0c1df64e51a1567 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 13 Feb 2017 09:52:59 +0100 Subject: [PATCH 0328/1644] ARROW-548: [Python] Add nthreads to Filesystem.read_parquet and pass through Author: Wes McKinney Closes #337 from wesm/ARROW-548 and squashes the following commits: b9aeaeb [Wes McKinney] Add nthreads to Filesystem.read_parquet and pass through --- python/pyarrow/filesystem.py | 9 +++++++-- python/pyarrow/parquet.py | 4 ++-- python/pyarrow/tests/test_parquet.py | 8 +++++++- 3 files changed, 16 insertions(+), 5 deletions(-) diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py index 55bcad044305e..e820806ab4e68 100644 --- a/python/pyarrow/filesystem.py +++ b/python/pyarrow/filesystem.py @@ -62,7 +62,8 @@ def isfile(self, path): """ raise NotImplementedError - def read_parquet(self, path, columns=None, metadata=None, schema=None): + def read_parquet(self, path, columns=None, metadata=None, schema=None, + nthreads=1): """ Read Parquet data from path in file system. Can read from a single file or a directory of files @@ -78,6 +79,9 @@ def read_parquet(self, path, columns=None, metadata=None, schema=None): schema : pyarrow.parquet.Schema Known schema to validate files against. Alternative to metadata argument + nthreads : int, default 1 + Number of columns to read in parallel. If > 1, requires that the + underlying file source is threadsafe Returns ------- @@ -95,7 +99,8 @@ def read_parquet(self, path, columns=None, metadata=None, schema=None): return read_multiple_files(paths_to_read, columns=columns, filesystem=self, schema=schema, - metadata=metadata) + metadata=metadata, + nthreads=nthreads) class LocalFilesystem(Filesystem): diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 9766ff6dfa8e8..fa96f95698013 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -59,8 +59,8 @@ def read(self, nrows=None, columns=None, nthreads=1): columns: list If not None, only these columns will be read from the file. nthreads : int, default 1 - Number of columns to read in parallel. Requires that the underlying - file source is threadsafe + Number of columns to read in parallel. If > 1, requires that the + underlying file source is threadsafe Returns ------- diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 969f68b47b497..96f2d15e312f2 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -320,17 +320,20 @@ def test_compare_schemas(): assert fileh.schema[0].equals(fileh.schema[0]) assert not fileh.schema[0].equals(fileh.schema[1]) + @parquet def test_column_of_lists(tmpdir): df, schema = dataframe_with_arrays() filename = tmpdir.join('pandas_rountrip.parquet') - arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True, schema=schema) + arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True, + schema=schema) pq.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() pdt.assert_frame_equal(df, df_read) + @parquet def test_multithreaded_read(): df = alltypes_sample(size=10000) @@ -418,6 +421,9 @@ def test_read_multiple_files(tmpdir): expected = pa.Table.from_arrays(to_read) assert result.equals(expected) + # Read with multiple threads + pa.localfs.read_parquet(dirpath, nthreads=2) + # Test failure modes with non-uniform metadata bad_apple = _test_dataframe(size, seed=i).iloc[:, :4] bad_apple_path = tmpdir.join('{0}.parquet'.format(guid())).strpath From ad0157547a4f5e6e51fa2f712c2ed9477489a20c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 13 Feb 2017 09:03:46 -0500 Subject: [PATCH 0329/1644] ARROW-553: C++: Faster valid bitmap building Author: Uwe L. Korn Closes #338 from xhochy/ARROW-553 and squashes the following commits: 1c1ee3d [Uwe L. Korn] ARROW-553: C++: Faster valid bitmap building --- cpp/src/arrow/builder.cc | 41 +++++++++++++++++-- .../jemalloc/jemalloc-builder-benchmark.cc | 2 +- 2 files changed, 38 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index dddadeee0dacf..f5c13f9e77ef1 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -114,18 +114,51 @@ void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t leng UnsafeSetNotNull(length); return; } + + int byte_offset = length_ / 8; + int bit_offset = length_ % 8; + uint8_t bitset = null_bitmap_data_[byte_offset]; + for (int32_t i = 0; i < length; ++i) { - // TODO(emkornfield) Optimize for large values of length? - UnsafeAppendToBitmap(valid_bytes[i] > 0); + if (valid_bytes[i]) { + bitset |= (1 << bit_offset); + } else { + bitset &= ~(1 << bit_offset); + ++null_count_; + } + + bit_offset++; + if (bit_offset == 8) { + bit_offset = 0; + null_bitmap_data_[byte_offset] = bitset; + byte_offset++; + // TODO: Except for the last byte, this shouldn't be needed + bitset = null_bitmap_data_[byte_offset]; + } } + if (bit_offset != 0) { null_bitmap_data_[byte_offset] = bitset; } + length_ += length; } void ArrayBuilder::UnsafeSetNotNull(int32_t length) { const int32_t new_length = length + length_; - // TODO(emkornfield) Optimize for large values of length? - for (int32_t i = length_; i < new_length; ++i) { + + // Fill up the bytes until we have a byte alignment + int32_t pad_to_byte = 8 - (length_ % 8); + if (pad_to_byte == 8) { pad_to_byte = 0; } + for (int32_t i = 0; i < pad_to_byte; ++i) { + BitUtil::SetBit(null_bitmap_data_, i); + } + + // Fast bitsetting + int32_t fast_length = (length - pad_to_byte) / 8; + memset(null_bitmap_data_ + ((length_ + pad_to_byte) / 8), 255, fast_length); + + // Trailing bytes + for (int32_t i = length_ + pad_to_byte + (fast_length * 8); i < new_length; ++i) { BitUtil::SetBit(null_bitmap_data_, i); } + length_ = new_length; } diff --git a/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc b/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc index 58dbaa33a1a0f..d69c3047587bf 100644 --- a/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc +++ b/cpp/src/arrow/jemalloc/jemalloc-builder-benchmark.cc @@ -42,6 +42,6 @@ static void BM_BuildPrimitiveArrayNoNulls( state.iterations() * data.size() * sizeof(int64_t) * kFinalSize); } -BENCHMARK(BM_BuildPrimitiveArrayNoNulls)->Repetitions(3)->Unit(benchmark::kMillisecond); +BENCHMARK(BM_BuildPrimitiveArrayNoNulls)->Repetitions(5)->Unit(benchmark::kMillisecond); } // namespace arrow From 66f650cd359e13f3d5c3d4ef78d89f389d6bcecc Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 13 Feb 2017 09:04:37 -0500 Subject: [PATCH 0330/1644] ARROW-547: [Python] Add zero-copy slice methods to Array, RecordBatch Author: Wes McKinney Closes #336 from wesm/ARROW-547 and squashes the following commits: 42037c2 [Wes McKinney] cpplint 2b91b5b [Wes McKinney] Tweak docstring 5f80d80 [Wes McKinney] Add slice methods to pyarrow.Array and RecordBatch. Fix bug in RecordBatch::Slice 20dc23f [Wes McKinney] Draft Array.slice implementation --- cpp/src/arrow/ipc/adapter.cc | 2 +- cpp/src/arrow/table-test.cc | 2 ++ cpp/src/arrow/table.cc | 5 +++- python/pyarrow/array.pxd | 2 +- python/pyarrow/array.pyx | 42 ++++++++++++++++++++++------ python/pyarrow/includes/libarrow.pxd | 6 ++++ python/pyarrow/scalar.pxd | 6 ++-- python/pyarrow/scalar.pyx | 7 ++--- python/pyarrow/table.pyx | 37 ++++++++++++++++++++---- python/pyarrow/tests/test_array.py | 36 ++++++++++++++++++++++++ python/pyarrow/tests/test_table.py | 32 +++++++++++++++++++++ 11 files changed, 153 insertions(+), 24 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index f36ff37db15a3..a24c007a4056e 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -727,7 +727,7 @@ class ArrayLoader : public TypeVisitor { RETURN_NOT_OK(GetBuffer(context_->buffer_index + 1, &offsets)); } } - context_->buffer_index += type.mode == UnionMode::DENSE? 2 : 1; + context_->buffer_index += type.mode == UnionMode::DENSE ? 2 : 1; std::vector> fields; RETURN_NOT_OK(LoadChildren(type.children(), &fields)); diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index e7c5d667903e8..25f12c4b4300d 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -259,6 +259,8 @@ TEST_F(TestRecordBatch, Slice) { auto batch_slice = batch.Slice(2); auto batch_slice2 = batch.Slice(1, 5); + ASSERT_EQ(batch_slice->num_rows(), batch.num_rows() - 2); + for (int i = 0; i < batch.num_columns(); ++i) { ASSERT_EQ(2, batch_slice->column(i)->offset()); ASSERT_EQ(length - 2, batch_slice->column(i)->length()); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index a9e0909b8b741..8ac06b8cb7811 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -17,6 +17,7 @@ #include "arrow/table.h" +#include #include #include #include @@ -70,7 +71,9 @@ std::shared_ptr RecordBatch::Slice(int32_t offset, int32_t length) for (const auto& field : columns_) { arrays.emplace_back(field->Slice(offset, length)); } - return std::make_shared(schema_, num_rows_, arrays); + + int32_t num_rows = std::min(num_rows_ - offset, length); + return std::make_shared(schema_, num_rows, arrays); } // ---------------------------------------------------------------------- diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index af105354ac2f3..9e4d469bcfa5f 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -38,7 +38,7 @@ cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array) cdef getitem(self, int i) -cdef object box_arrow_array(const shared_ptr[CArray]& sp_array) +cdef object box_array(const shared_ptr[CArray]& sp_array) cdef class BooleanArray(Array): diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 9b34f5607b31d..11abf03e35f1d 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -131,7 +131,7 @@ cdef class Array: check_status(pyarrow.PandasToArrow( pool, series_values, mask, c_field, &out)) - return box_arrow_array(out) + return box_array(out) @staticmethod def from_list(object list_obj, DataType type=None, MemoryPool memory_pool=None): @@ -156,7 +156,7 @@ cdef class Array: else: raise NotImplementedError() - return box_arrow_array(sp_array) + return box_array(sp_array) property null_count: @@ -201,9 +201,9 @@ cdef class Array: step = key.step or 1 if step != 1: - raise NotImplementedError + raise IndexError('only slices with step 1 supported') else: - return self.slice(start, stop) + return self.slice(start, stop - start) while key < 0: key += len(self) @@ -211,10 +211,36 @@ cdef class Array: return self.getitem(key) cdef getitem(self, int i): - return scalar.box_arrow_scalar(self.type, self.sp_array, i) + return scalar.box_scalar(self.type, self.sp_array, i) - def slice(self, start, end): - pass + def slice(self, offset=0, length=None): + """ + Compute zero-copy slice of this array + + Parameters + ---------- + offset : int, default 0 + Offset from start of array to slice + length : int, default None + Length of slice (default is until end of Array starting from + offset) + + Returns + ------- + sliced : RecordBatch + """ + cdef: + shared_ptr[CArray] result + + if offset < 0: + raise IndexError('Offset must be non-negative') + + if length is None: + result = self.ap.Slice(offset) + else: + result = self.ap.Slice(offset, length) + + return box_array(result) def to_pandas(self): """ @@ -390,7 +416,7 @@ cdef dict _array_classes = { Type_DICTIONARY: DictionaryArray } -cdef object box_arrow_array(const shared_ptr[CArray]& sp_array): +cdef object box_array(const shared_ptr[CArray]& sp_array): if sp_array.get() == NULL: raise ValueError('Array was NULL') diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index ebfdc410fa004..702acfbc12e17 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -71,6 +71,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool Equals(const shared_ptr[CArray]& arr) c_bool IsNull(int i) + shared_ptr[CArray] Slice(int32_t offset) + shared_ptr[CArray] Slice(int32_t offset, int32_t length) + cdef cppclass CFixedWidthType" arrow::FixedWidthType"(CDataType): int bit_width() @@ -228,6 +231,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int num_columns() int32_t num_rows() + shared_ptr[CRecordBatch] Slice(int32_t offset) + shared_ptr[CRecordBatch] Slice(int32_t offset, int32_t length) + cdef cppclass CTable" arrow::Table": CTable(const c_string& name, const shared_ptr[CSchema]& schema, const vector[shared_ptr[CColumn]]& columns) diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd index b06845718649b..2d55757726183 100644 --- a/python/pyarrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -61,6 +61,6 @@ cdef class ListValue(ArrayValue): cdef class StringValue(ArrayValue): pass -cdef object box_arrow_scalar(DataType type, - const shared_ptr[CArray]& sp_array, - int index) +cdef object box_scalar(DataType type, + const shared_ptr[CArray]& sp_array, + int index) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 9d2b2b11a80d6..57a15ad78344c 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -203,7 +203,7 @@ cdef class ListValue(ArrayValue): cdef getitem(self, int i): cdef int j = self.ap.value_offset(self.index) + i - return box_arrow_scalar(self.value_type, self.ap.values(), j) + return box_scalar(self.value_type, self.ap.values(), j) def as_py(self): cdef: @@ -235,9 +235,8 @@ cdef dict _scalar_classes = { Type_STRING: StringValue, } -cdef object box_arrow_scalar(DataType type, - const shared_ptr[CArray]& sp_array, - int index): +cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, + int index): cdef ArrayValue val if type.type.type == Type_NA: return NA diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 17072108f301f..7d7336246ee79 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -27,7 +27,7 @@ cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config -from pyarrow.array cimport Array, box_arrow_array, wrap_array_output +from pyarrow.array cimport Array, box_array, wrap_array_output from pyarrow.error import ArrowException from pyarrow.error cimport check_status from pyarrow.schema cimport box_data_type, box_schema, Field @@ -109,8 +109,7 @@ cdef class ChunkedArray: pyarrow.array.Array """ self._check_nullptr() - return box_arrow_array(self.chunked_array.chunk(i)) - + return box_array(self.chunked_array.chunk(i)) def iterchunks(self): for i in range(self.num_chunks): @@ -387,9 +386,35 @@ cdef class RecordBatch: return self._schema def __getitem__(self, i): - cdef Array arr = Array() - arr.init(self.batch.column(i)) - return arr + return box_array(self.batch.column(i)) + + def slice(self, offset=0, length=None): + """ + Compute zero-copy slice of this RecordBatch + + Parameters + ---------- + offset : int, default 0 + Offset from start of array to slice + length : int, default None + Length of slice (default is until end of batch starting from + offset) + + Returns + ------- + sliced : RecordBatch + """ + cdef shared_ptr[CRecordBatch] result + + if offset < 0: + raise IndexError('Offset must be non-negative') + + if length is None: + result = self.batch.Slice(offset) + else: + result = self.batch.Slice(offset, length) + + return batch_from_cbatch(result) def equals(self, RecordBatch other): cdef: diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index ead17dbec4e35..d8b2e2f5d80d6 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -17,6 +17,8 @@ import sys +import pytest + import pyarrow import pyarrow.formatting as fmt @@ -100,3 +102,37 @@ def test_to_pandas_zero_copy(): base_refcount = sys.getrefcount(np_arr.base) assert base_refcount == 2 np_arr.sum() + + +def test_array_slice(): + arr = pyarrow.from_pylist(range(10)) + + sliced = arr.slice(2) + expected = pyarrow.from_pylist(range(2, 10)) + assert sliced.equals(expected) + + sliced2 = arr.slice(2, 4) + expected2 = pyarrow.from_pylist(range(2, 6)) + assert sliced2.equals(expected2) + + # 0 offset + assert arr.slice(0).equals(arr) + + # Slice past end of array + assert len(arr.slice(len(arr))) == 0 + + with pytest.raises(IndexError): + arr.slice(-1) + + # Test slice notation + assert arr[2:].equals(arr.slice(2)) + + assert arr[2:5].equals(arr.slice(2, 3)) + + assert arr[-5:].equals(arr.slice(len(arr) - 5)) + + with pytest.raises(IndexError): + arr[::-1] + + with pytest.raises(IndexError): + arr[::2] diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index d49b33c9f42d6..67f1892a9987b 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -68,6 +68,38 @@ def test_recordbatch_basics(): ]) +def test_recordbatch_slice(): + data = [ + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]) + ] + names = ['c0', 'c1'] + + batch = pa.RecordBatch.from_arrays(data, names) + + sliced = batch.slice(2) + + assert sliced.num_rows == 3 + + expected = pa.RecordBatch.from_arrays( + [x.slice(2) for x in data], names) + assert sliced.equals(expected) + + sliced2 = batch.slice(2, 2) + expected2 = pa.RecordBatch.from_arrays( + [x.slice(2, 2) for x in data], names) + assert sliced2.equals(expected2) + + # 0 offset + assert batch.slice(0).equals(batch) + + # Slice past end of array + assert len(batch.slice(len(batch))) == 0 + + with pytest.raises(IndexError): + batch.slice(-1) + + def test_recordbatch_from_to_pandas(): data = pd.DataFrame({ 'c1': np.array([1, 2, 3, 4, 5], dtype='int64'), From 69cf69238492e3872e729775bd833aa23e36bdc8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 13 Feb 2017 21:27:24 -0500 Subject: [PATCH 0331/1644] ARROW-556: [Integration] Configure C++ integration test executable with a single environment variable. Update README Author: Wes McKinney Closes #340 from wesm/ARROW-556 and squashes the following commits: 521af12 [Wes McKinney] Configure C++ integration test executable with a single environment variable. Update README.md --- ci/travis_script_integration.sh | 4 +--- integration/README.md | 4 ++-- integration/integration_test.py | 16 ++++++---------- 3 files changed, 9 insertions(+), 15 deletions(-) diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index c019a4b7ab7ff..7bb1dc0a6015c 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -29,9 +29,7 @@ pushd $TRAVIS_BUILD_DIR/integration VERSION=0.1.1-SNAPSHOT export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar -export ARROW_CPP_TESTER=$CPP_BUILD_DIR/debug/json-integration-test -export ARROW_CPP_STREAM_TO_FILE=$CPP_BUILD_DIR/debug/stream-to-file -export ARROW_CPP_FILE_TO_STREAM=$CPP_BUILD_DIR/debug/file-to-stream +export ARROW_CPP_EXE_PATH=$CPP_BUILD_DIR/debug source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh export MINICONDA=$HOME/miniconda diff --git a/integration/README.md b/integration/README.md index b1e4e3a82a734..6005b62c41cd5 100644 --- a/integration/README.md +++ b/integration/README.md @@ -34,7 +34,7 @@ mvn package ``` Now, the integration tests rely on two environment variables which point to the -Java `arrow-tool` JAR and the C++ `json-integration-test` executable: +Java `arrow-tool` JAR and the build path for the C++ executables: ```bash JAVA_DIR=$ARROW_HOME/java @@ -42,7 +42,7 @@ CPP_BUILD_DIR=$ARROW_HOME/cpp/test-build VERSION=0.1.1-SNAPSHOT export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar -export ARROW_CPP_TESTER=$CPP_BUILD_DIR/debug/json-integration-test +export ARROW_CPP_EXE_PATH=$CPP_BUILD_DIR/debug ``` Here `$ARROW_HOME` is the location of your Arrow git clone. The diff --git a/integration/integration_test.py b/integration/integration_test.py index 1d8dc29a9f529..d5a066be5f246 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -684,17 +684,13 @@ def file_to_stream(self, file_path, stream_path): class CPPTester(Tester): - BUILD_PATH = os.path.join(ARROW_HOME, 'cpp/test-build/debug') - CPP_INTEGRATION_EXE = os.environ.get( - 'ARROW_CPP_TESTER', os.path.join(BUILD_PATH, 'json-integration-test')) + EXE_PATH = os.environ.get( + 'ARROW_CPP_EXE_PATH', + os.path.join(ARROW_HOME, 'cpp/test-build/debug')) - STREAM_TO_FILE = os.environ.get( - 'ARROW_CPP_STREAM_TO_FILE', - os.path.join(BUILD_PATH, 'stream-to-file')) - - FILE_TO_STREAM = os.environ.get( - 'ARROW_CPP_FILE_TO_STREAM', - os.path.join(BUILD_PATH, 'file-to-stream')) + CPP_INTEGRATION_EXE = os.path.join(EXE_PATH, 'json-integration-test') + STREAM_TO_FILE = os.path.join(EXE_PATH, 'stream-to-file') + FILE_TO_STREAM = os.path.join(EXE_PATH, 'file-to-stream') name = 'C++' From d50f1525a999e1de837334adb7a3d7d0da3f0c33 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 14 Feb 2017 08:25:58 -0500 Subject: [PATCH 0332/1644] ARROW-558: Add KEYS files Author: Uwe L. Korn Closes #341 from xhochy/ARROW-558 and squashes the following commits: ea5327b [Uwe L. Korn] ARROW-558: Add KEYS files --- KEYS | 180 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 KEYS diff --git a/KEYS b/KEYS new file mode 100644 index 0000000000000..ad0f5cf6f4340 --- /dev/null +++ b/KEYS @@ -0,0 +1,180 @@ +This file contains the PGP keys of various developers. + +Users: pgp < KEYS + gpg --import KEYS +Developers: + pgp -kxa and append it to this file. + (pgpk -ll && pgpk -xa ) >> this file. + (gpg --list-sigs + && gpg --armor --export ) >> this file. + +pub 2048R/7AE7E47B 2013-04-10 [expires: 2017-04-10] +uid Julien Le Dem +sig 3 7AE7E47B 2013-04-10 Julien Le Dem +sig D3924CCD 2014-09-08 Ryan Blue (CODE SIGNING KEY) +sig 71F0F13B 2014-09-08 Tianshuo Deng +sub 2048R/03C4E111 2013-04-10 [expires: 2017-04-10] +sig 7AE7E47B 2013-04-10 Julien Le Dem + +pub 4096R/1679D194 2016-09-19 [expires: 2020-09-19] +uid Julien Le Dem +sig 3 1679D194 2016-09-19 Julien Le Dem +sub 4096R/61C65CFD 2016-09-19 [expires: 2020-09-19] +sig 1679D194 2016-09-19 Julien Le Dem + +-----BEGIN PGP PUBLIC KEY BLOCK----- + +mQENBFFll5kBCACk/tTfHSxUT2W9phkLQzJs6AV4GElqcFo7ZNE1DwAB/gk8uJwR +Po7WYaO2/91hNu4y1SooDRGnqz0FvZzOA8sW/KujK13MMqmGYb1jJdwPjNq6KOK/ +3EygCxq9DxSS+TILvq3NsFgYGdopdJxRl9zh15Po/3c/jNMPtnGZzP39EsfMhgIS +YwwiEHPVPB00Q0IGRQMhtJqh1AQ5KrxqK4+uEwwu3Sb52DpBjfgffl8GMGKfH/tk +VvJ6L+7rPXtNqho5b7i8379//Bn9xwgO2YCtjPoZMVg37M6f6hVWMr3fFmX/OXgU +UWwLGOTAeuLKWkikFJr5y0rzDaF2qcD9t7wfABEBAAG0IEp1bGllbiBMZSBEZW0g +PGp1bGllbkBsZWRlbS5uZXQ+iQE9BBMBCgAnBQJRZZeZAhsvBQkHhh+ABQsJCAcD +BRUKCQgLBRYCAwEAAh4BAheAAAoJEJfX6GR65+R7au4IAIfZVA9eWBZn9NuaWX7L +Xi+xDtzrfUrsWZxMIP6zkQsIspiX9AThGv3zDn+Tpfw7svV1QfUQX0LHbwMMYqq+ +mRJB/kqYutpLxw7h63zrWR2k2Sdzvole2c3Rfk1vblIdWZk7ArLSivqTk/oGwr7d +MejvOMmKSzqW0vQF6dNbYerLOiqPr4mKqONWm4nOLZEBzjE3IfbK3gNBSFq+92jV +iWY6ozqAxydYafNUSZRrcniYskxd9JCSSLZiIZW3X9lToA/74LjpPbmzvQtkH68D +0EnC1mkPTKCA4r+CLb3a9GJ9Surg2T0OptyPHsXipgViVryXgopD2odA3fh9SY5l +Ee+JAhwEEAECAAYFAlQN+kQACgkQ/LPL2dOSTM3+OA//dYj9kiZhZNVb6hMfrubn +OjTmY8Hcax8G+aJWxRrGE8HrCUjEJ4NThK523+fmol1PxNWsguljlsZvJ189YPOh +weDJzNmKwhLntq/uBgtJyWBN1v9bUzkR9Ud+UdD1tPbNj7sNiIQE1ZqWMxra3sq/ +gcodVgqSADGgjKO9tenQhWvQXxBR55MOqZbxnyazRPEYS0mkN0A0DwtG82tHNRL7 +Z3vs/kG5hoW3kYifCZn5pW3wKtfIY5JH7usYOzA86p7GH4hOfO+dzhDANH+C+u9O +ZRbCdUE8oEp3fAWY9+3VzlO5ixpFOeHGfbSJp44Jv6wUOxNwRmD/gk+DxVrsS/Yn +rLFCZgDHgkFHGJ1D7PnxTy4qtwGasYxWYJOUiaAJbOvRa8nbhan2/wsrgnJTbXAH ++7v5tFfCV77Po//V0fojYZNvbkEO8/yRpQL+uKiVRaRD5dMfHRb31OR0A59ssYX9 +63QpBEof/OeELC0VowG+KCc+4CfSMmAGnQMdEhMAUPz+79nJw7ijeF5C82Z5mQof +v+nf+kdqr80UbG+RoODKtlHFETxJ5STQe6uiPOfvb+EADPA0cZ34u5tD3Z+SMV1k +Gf7Jxi45jmkn9Z9AkVj6KgdDeSjV7EkRiY0pm43Vvd6WvV5t54cgJcwXrjG+h03f +65w7F+KBrh7YAcUvrf4JeXKJARwEEAECAAYFAlQN/XwACgkQfNgniXHw8TtU9Af/ +b9CYFtsG9q1ZbnV9SChxjLLUipGsmKTUjCnz7oiZvJJ04e+0np1NQJKJbthGfEDM +eLt1WiYpTDu66zAuLDA7ACcbv3UUXXsUTEfN76J+9DJHrtK1soHGLkKLW2hZeWKp +PKya/HRF4Rv3/aAwWtRjEuQr9pLt/wAOedV6mrpyTngOKQn97tzo/yUeDNG7be8A +xtUStQY/2zJmHkaLeULKOspgUchBQ1S+M4q46dE+tyel47BLyHIECqk/geLOlZmh +lo6TtVgnBSXC5SqMwh5pz/P5ntQ8FVLedGQI9dwVhxbjoo5DNB/6ntfbwkheiak1 +CFBm0ZVPJjX7F2XFcq7VCrkBDQRRZZeZAQgA4eixR7xHvnTyF12CYLsnFE8x1tI+ +78FCjKm0n1YPCzEYa70bnnZmpW4KCwO0flN4RhhP+g2KRCCov2ZH7bxvhTxe4n/j +T6I/+61Fpba4I7qExYqX+tylyjUKhynLcWCbvRQnyjOMTaLbMVrftV+ATVmj7fi0 +PdzRW/7QvCSrDsMFtTSaNBdeMbzptpoXAxTgVZOIoHbWOIfovN1uPnFItrmNnKXX +KGyDPX2s2KCz10G1lrw0l9tqDg+BtqE9/xCtqWoZJMnT8jAJZeJ0V37R1jDBDEHK +AfPOUKNYf5GWxJeCWYzL77ve8VdItKwPhtjW7zFKuyrqiBHE40fgTLKvNQARAQAB +iQJEBBgBCgAPBQJRZZeZAhsuBQkHhh+AASkJEJfX6GR65+R7wF0gBBkBCgAGBQJR +ZZeZAAoJECrRWHEDxOERzmEIAOCrfYGPdLyzBn/xAdymx2FaNTS48ybMIGjcu6Od +nKzvgBJObLPQf0+WKhkbQf2HEHYinBVpX8K4dNY9RhzIRbQNhCWY5E5/leI/nQ9O +ZBUMpT8Gw5saj0YtF3By4E9ywxNWiAyX2SAHjPv/lub0PEaUiWWe6s9MaX5fp71C +TupkdElpxucEpVefUaUOSMQ2ecOniCh/9ltPLYcjwnC1ti+Et8/cAK2N554GNE+x +fO3qtGXGUleWhpt3fblTcCyO+odAPKxm70jnABLk8m+KpffcdBYSJ5ai5hPkrnyq +3NBRDPGlLdtDkzn0/xKYnVbLW1d+d2NFwJzEKncQphHoo0T19wf8DSfym7dIsstj +jwFI8+N/1yCdMD886x8bgmsSsNiD9tro+1083yr+IL5+gUs8Q4ETpsow+IS6sfp2 +fzA0TaLBLEOFYy/XFxnzO+YtVNIDAnrDEgTOMahFUrJ/HVZF9xT+kKwhyHaRNIQL +CYc4VoSWldqoDVOGI30NjtVo5EGzf3qVWkTm4yplBhJvJanxrMHuJAWRgFX8D48B +cs/senr8s+O0oXQQYIjz/FkZh/mQFtrgsvnzyUR52SnwEzNMmXjZNkydPZwcY6mu +cqCIvQIvmBpPdlyaoglwJ8wWb76uIE6VFcN71FF3EfV51/yUeQGJaoExWLY6IH8x +Xtn3IWkBWJkCDQRX4AxuARAAzzTxE1FGdmJYPZyTys51oDi8+CJ8VXF6wlTkjuOW +abkGUu0FjnN++D7G9LRDvN7QnVUHU+h6QWPZ0LanmjYh0ABO5SeWCjOX6ajcACkz +pEzMv2DbOPfJuPJmtuFfiLOQAUVBB1dSSPFMPPaGTco2iE7uLr8edtQBvgpx/PGd +52lma3qAAZFzonKWyTRonUjV4SU3C96Xhbs+DExTL8H6MX8NzZCz4UZj5u4NsEH+ +oQD0U4tSOe3xgroJpOR6ZPvlhgbWVqlYvkEWt0AaPJsXJwnWe17GgDmxME2cwsuI +fgv/9shB7VYmLglY6dV/6HYoHh+2qKXndTMjlqXXvUHW0J3uRryoCR+C2gin38/f +sPFICpt5yJVnR517O/jsviDz4TwjhqFsUUM7Ud8IydriJX02Oj5UitUF7l5MSqkS +/Z7jwPEErCRWwmfj4ZjjWWV60I9SYgPZhBp0//s2o/gbIBBtIdHI2+xaMt0lWOsA +Xi7dsY1NLGoSGUlhdSiP032tVHpGiOV3AWwf399Qus4iuwf6N8KEVSTRdaA7b1Um +b7PepfEHIrOS5oUYjgZJK+JFGey5SOsPvG3Yv9cbEKWqmoEzEDb6y3HI/iRbk/qC +SWGKvEiqYSo6wlvZFDv1qoApylfBaI8Lf32vawlMCSI37KCWfua1RqbCYMi/4wux +bfsAEQEAAbQgSnVsaWVuIExlIERlbSA8anVsaWVuQGxlZGVtLm5ldD6JAj8EEwEI +ACkFAlfgDG4CGwMFCQeGH4AHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRAC +2r/fFnnRlHqND/oCaPPGn8u5oyVml9J3+lpYWwT69qHwYV5IX+72zqq02uvYEqlY +CseEwOvkfDClh81KWO4A9kzVwWcu611d/UFsA94EnZuJ46m8DflPeidhJoTRnNpr +IRzH2lL72QyQFeT9viWHdxu3cKlJkChQuD3zR9yyVH/QVFOlBvdx/ZT0dOFpbgJd +2n4fy8ExGSXLP6wGf5RQRKEYiZX4VB4Bkz6sK0Con6GPsqqtaUgj9fCxA1ebhGA2 +k1m79mR9gh4oJWeefSuXyf3x8oBoQ46Lury8HiuxLh6cy9SqHZ8uXu3hfQEZ6rhd +C9yBjK2+8z6GLhjgSasMkJK0OAR8yLgARZJwt9+wV/Ww15Sq76B6IrKJnSR6P4FM +jA+ItCDtiooGz6rJGFidsH99fU2IydcsSqbTN3h3/2cBXFgxspesHWsEeTvtXSgR +I82kUyA0g1v9ESY0leiLVzKyL0zmCjaPg0nHoH8tFqFkqaSXSZu75TefnsokkoXN +ewkDf+yD2J/BMtUHAgFOlvYkywGzS9cbAxxzc607Jvww5LjtPI0wYRIwzOlvZfws +xoYPrqJe1R7hRy0QS6pnSL7TgdbvwbGtiUAZ9w/Y5FEugV8bgyZMvF7Z9gt3ThMg +XOSWlZrsDym7jg/yd/h/4aPZuPC73oNvgV4g6OT510fkkMNWbZR8C2uHX7kCDQRX +4AxuARAAphEmWY5Z3Q+gQ1X9+b55VE17ORMKjXtE2gQnYk9Fxpt31F0kZcoK/25Y +BItkjcmIaC4LTLjbdwe6IW4zlOjULxaTstTJsfCcrJONlSmEJ0OWaXx9i/tAXt8d +0IZn2hkQ9aevJuoWqta+wFNhpLdPuPQq6vO6hIl7j0w1tAGFHV+IQ7Q7VFuUVo/h +gZbtJOjufZWqulz6pMVu4p3TW9OM98CWioO1eidcwKYEsgk/fJ1uKc599SSCz2Cg ++lEho6rHtvojk34TLD9QQHaEcCZToq7WSqwqLi4OCuhcfAVuwydj0RMByE1TOpsg +RwOh2egBLNplK+0k2jVaQPX38laolOAMNLg+VVRy73T1MpyelY/m+cRr8292X/f4 +GgHNHbmQU+LDzsezC+ryPXdP3FjVo66xNlYHzw1x0hRdnwExkqOYmdTz1YN83Z+6 +p0d2RkTZTpLnE449KiNsgTPttplBGE8QKNqYxoKIk40DlDuya7q/acgTcqe8vW0v +34E6RRIX8dbCJeTBB2vUDp6bD3ICXGI09EuUAh9yy/FlNv1OdggDfTnF/NztWmmT +CpNwmdx+GTT2Sv0i6H9RelOl0uGj351+7PSFSFrHV9T3TUaMB0QkkZDxklIvPVZv +dhx7UGXFJPDjQyJxcN7UW6Pc7m+m2W3/u2MZaL7xPbWRPVkqs48AEQEAAYkCJQQY +AQgADwUCV+AMbgIbDAUJB4YfgAAKCRAC2r/fFnnRlGwCEACXcfMAz79G2sLs6z1N +6tMbO0qGGQJ9vAXRKjb7JN/yd+z+zaejs/+cmRM+wHKZtANYtnSzGiWJO4TIP5A+ +DeRE93GJaVr0ly0C+du/uSm6wVg+w1wgy6JE6q/IsMW5qHd0qWi/npq4esfH1Uho +T7Kl/AxUyT0N23n5oK0GrVUFhFcU9dUx6auhHxEOM5tIgNqV6lAn72lykPYUvV5f +aAiz2OAlVYxgBb6wxjXTUVlrhaxbgNQ7PPzkjzMVZaE/TZrcyl4Ck3grYDBFZEGi +jhjsl/HX+/lhJvr5gcFkisG5A2pnrkAe1wnXm4HoKGN2xUWCCipN5oPc3Lw6ge76 +YDX1t5CXqd94cDBlwFDtd4kykI3rJDvTI3P/fevMNqVS3tzW9AwkHkPil1DE+4rI +/qCib+G6BAgloUGYLuNxSa1ySOd0yckFTrNBB5yk+yWvrLpKGFVdQS7BwUcgdeCJ +3XU3fyhfXcIn3tMHabZ6laB3Xzi3Gi8iL6SJywSXIqTGw3MmLJlxr1IKWTMNxjjs +d/XBF7ZpCCisH7s9hyMCAet72YFAxVcB3bwbd3mzcGfTg/Y+sSum82vaSvAJ0QBc +pp4X8HzEsSsJ88N6ON7IU92r+1mxWhglKZx2NORHIvNFwIrvAzKWhqGdHd5/xq3f +EwCykGi6RtdCStNFh6h16kCkgA== +=YkSF +-----END PGP PUBLIC KEY BLOCK----- +pub 4096R/8CAAD602 2016-11-06 +uid Uwe L. Korn +sig 3 8CAAD602 2016-11-06 Uwe L. Korn +sub 4096R/7BD1BC86 2016-11-06 +sig 8CAAD602 2016-11-06 Uwe L. Korn + +-----BEGIN PGP PUBLIC KEY BLOCK----- +Version: GnuPG v1 + +mQINBFgfd4wBEACylQqqVH/aK00fgU/v1ZggNwtgJhzH7yswAzQz9eUU5t4Q9kzI +zdkR1yJvaEDHtZy2D0mCM1CuGVPXzf+0kSFDaRPcm6LNAD15KC7eUzyad1Y4MwNn +UYE3pZlnvSwUBAigQSN1quw+u1eHc+IJc32iCRcK8DihQgrDivg8yZckoGGZj/6w +Epfp8SLrI+OmqBgwYYjRqy9uC0aWypKb9waZmc2NIZZu1y3bL6hx54+Dk+4hF01E +OtT79HQV1e4MyqiuGUKa34QAHb1CGrju+1Z9sDNdI7hBDqfQKjisR2WaJM4kXHjj +m7Tv3M1LUB4eh1+Yd514d/wpSChkLvMCJ9tYGSpQ8c+qrLAFvgRD7YCYp4ypslcx +Sg30gU0bcTu8aiIm7qfl9CUjtBYwirUGC/t2SUxnhOpxWuzZdAiUJHi0QFa+LnZa +ecA5fIoMfqTWAqfQr3noxB6qLLNCgZd7IIH5KXIIhJZHpO3eMCCTJuDXiMS1Z/uo +D1FvUL8c19nmMjPJSfQo95Uynw6gZKFy0d3xg7NKUvnJBsVI24/PTVabzRrDh/qb +RCHvQOFjXOSYsPm2sz1BPs+ucV4AoxPZFgsCfUN2t4FRbcb39vr6oYFb+Nd3sIKX +7wknSwAid6pATvfZuLC9NI8ykjcEDGeLL0sET3kdUeuGYjpj2kuhnrV4cwARAQAB +tBxVd2UgTC4gS29ybiA8dXdlQGFwYWNoZS5vcmc+iQI4BBMBAgAiBQJYH3eMAhsD +BgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRAp2U4ijKrWAos1D/98UBoLbt6L +c7mnXTww069nkt0vOOHSz/QWJxo5rQsqFSKcSRuBhwLuaVMGTjBqCOLdEmA+XKJ8 +O+OgCZz0QZXuwL3PklX3DFvsYO0wIEIssovEJMu5e3XxDcCf5ZZtfszW5dnbWTjc +JXP0TlEbjOR5Z0/O/24iysGtoEMiktRTLOz9R5oRXFQLN4jQSykvMfKhanCVFljX +qEdMszjtvZhLwOiCaWkIOEo3jCrCDhdThI5nTiu/pH3vi7mkFYTNKpiva3XYKH7V +ITEdn5WO/QNFu/VBRjtOxT+F068vuuNpvAddn5rOtZOyGMCHnEqnlRnqIIZGtJeo +EJ87N2ytn8CtKpQKhyJIJhQIfW5jS3YW8qj1HeKN2s5wqQKnBYYsJOh+/QC9g3oE +nllgoSHAKSzys2Y1VoOQbRxYipCqRx7uS2aAqFr6r3hQpzySWeKQuxVZSZD7ar/0 +AFB0Hg4EgUGDl6Lw5icJ7scXTgoQKZWH1UmNc/FwFbG/F1GVbU88R6DlF84D1X/P +ArtP20eT+B3u5nfO2pCaBVi6GYyMsL2WKHO7AQAgURMgEPk0AQZZpv/OSJFa/TzI +UQ8xTLgmwZRL/XjjNFYWs+eYecGQsHKLbKNm1BpZMEfbVSFw54PiyJgoOhdMKdyA +Cmb+aUBkbPXf5S0ScZOoq8e8k1dYseDGOLkCDQRYH3eMARAAx/joL6ScsKMmPGRn +n79gQ3zbcKxWSfEDMYeeFfSssRgRd2iIrgvjzr9phka2yknzPnQPi7C8GLkUTj5e +V1dBxIGkGmP28n0DoowMqGb1xqn0WeoxDL0VQycGjkv5SOkxcbCCKS/MHOn6zenh +patSJsEHkCqk3f4GtPngYN5oMRTXUfUj1s7AooNti1ONSQSvZNbOMKAg8MgAjAHm +z3A+INLVTa59vqUNr5ptG/n+cB65ggeNhJf3gMaDyUy7oRZtOhrmA4D9CLpy2OBA +gezgOCZk/mPNP5jW0sbRiL6nYqC9VTp0E+f3hYSdgXNTWGIcxOwK7xe09SRqUQ7u +WnoKBTjkkYdCaCN4rv8IhJdrufgYdfqMGuldQZ9R/gcN3Iel7JMdon2onk94KZPs +W58/1DCD2eRuz8CsIgleUHVXJ+mCpkdtAt45ZGyv5pFC/+6s8mS/pBQEdVl7wjEX +kf2lrtFZCfK1uUiUTDnJJdtXNhwdtvnxJYeRg51jlD9Qg/mPV6m8KFyINtLKedLv +hChFkAIfFsdC/r1Xt4fMiCv2eZ8Dop2dM6xV/6Ueicti0lywoTpVtugSUWPO1j8a +N48jUfkZUV0jdELNHAloZaIDeLc7mU0uZJ3JykC4laD+YDwHT8tYUvamtU2uNgh1 +V7I3jrEu8YO4T2fiXe+0EzBwzjEAEQEAAYkCHwQYAQIACQUCWB93jAIbDAAKCRAp +2U4ijKrWAs3bD/wOE8NLnzKqebz0v+lxQf7fRL+RMaJ8mFda/t7UFtxj6XdePGZy +HWdqlvBFSDo/K6aEiicmpEIPbMi+V7d1Dg3tGhwtkHzgbpxNVoolR+2cF4jtrkoV +NC7uAMaDPt0X+wqinGg4E7IFuJoT4WiS+i4lzCUbD8n7lxe6Kj9bDt8tb6gOCgld +oweGN2k3bc4hIzeRt0jqGu1xm91Zbf8YbI3vyi8WQqmxX3zugY46NWwj8a+4Mhxz +Ysd7SI1pPs5k7vdHif3MD3Wwx68CCuZSm2KzNsm0iGxrCXSA6dXVflK9rlq6O1Us +UTxfX60o6S8PdFr4oOPFHYXmvDU5PY575xscWB2VVAyuSCyZWtq8d1BBU9JxcozS +6PTefVUqgr0XXRwVldAIabSA5q13j+b5+vU6LnAuoeMlFFprRlcJN03XTWKXF/gP +SpCDscCEMbz7aHpox8wmFckeiT+TgwDLMKO5PKRSMEBErUk+SsOyBnFpuGaPsCem +Pi6NwQyPCt3eep4Ti0dPo3u/dCUEtdKWMpOhsPIoCvGpgqS7o5PuBC2MDHQCc7q8 +wfxeCKBeSpMuy3pvOnNy8uNYjNqizVlpNBx01I2R1MD8P14Pxteg6APi0jcusXrD +s8g7c7dzdXM0lxreeXge8JSmxuwcCqVUswac6zbX4li03m/lov2YYxCwuw== +=ESbR +-----END PGP PUBLIC KEY BLOCK----- From fa8d27f314b7c21c611d1c5caaa9b32ae0cb2b06 Mon Sep 17 00:00:00 2001 From: Holden Karau Date: Wed, 15 Feb 2017 08:55:24 -0500 Subject: [PATCH 0333/1644] ARROW-561:[JAVA][PYTHON] Update java & python dependencies to improve downstream packaging experience The current build for arrow uses a interesting work around for hamcrest conflict between JUNIT and mockito which results in mockito being in the compile scope. This is not suitable for some downstream users. Python setup file also leaves out test dependency (not overly important but useful for developers) & we can clarify parquet-cpp as an "extra" dependency for people requiring parquet support (already mentioned in the README file but good to have clarity in setup.py as well). Author: Holden Karau Closes #342 from holdenk/ARROW-561-improve-deps and squashes the following commits: 5919b32 [Holden Karau] Drop extras_requires 938ed97 [Holden Karau] Mention test requires pytest and add an extra requires for parquet 283d6cd [Holden Karau] Remove mockito from compile scope and fix hamcrest issue with exclusion rule instead --- java/pom.xml | 10 ++++++++-- python/setup.py | 1 + 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/java/pom.xml b/java/pom.xml index a147d66c98318..e467cc185be06 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -442,11 +442,17 @@ test - org.mockito mockito-core 1.9.5 + test + + + + org.hamcrest + hamcrest-core + + ch.qos.logback diff --git a/python/setup.py b/python/setup.py index 5f5e5f3795541..54d1cd3af48bc 100644 --- a/python/setup.py +++ b/python/setup.py @@ -298,6 +298,7 @@ def get_outputs(self): use_scm_version = {"root": "..", "relative_to": __file__}, setup_requires=['setuptools_scm'], install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], + test_requires=['pytest'], description="Python library for Apache Arrow", long_description=long_description, classifiers=[ From f6924ad83bc95741f003830892ad4815ca3b70fd Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 15 Feb 2017 15:59:36 +0100 Subject: [PATCH 0334/1644] [maven-release-plugin] prepare release apache-arrow-0.2.0 Change-Id: I71a840dd1891d1b738d6a43748642390d7541f42 --- java/format/pom.xml | 2 +- java/memory/pom.xml | 2 +- java/pom.xml | 4 ++-- java/tools/pom.xml | 2 +- java/vector/pom.xml | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index eb045d655e982..055df5b2b0622 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -15,7 +15,7 @@ arrow-java-root org.apache.arrow - 0.1.1-SNAPSHOT + 0.2.0 arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index a4eb65228febf..a3085aa506f65 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.1.1-SNAPSHOT + 0.2.0 arrow-memory Arrow Memory diff --git a/java/pom.xml b/java/pom.xml index e467cc185be06..ea0d0297ac3b4 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -20,7 +20,7 @@ org.apache.arrow arrow-java-root - 0.1.1-SNAPSHOT + 0.2.0 pom Apache Arrow Java Root POM @@ -41,7 +41,7 @@ scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git https://github.com/apache/arrow - HEAD + apache-arrow-0.2.0 diff --git a/java/tools/pom.xml b/java/tools/pom.xml index ef96328f7668a..7271778aea9ad 100644 --- a/java/tools/pom.xml +++ b/java/tools/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.1.1-SNAPSHOT + 0.2.0 arrow-tools Arrow Tools diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 8517d4ced80f1..8ac42531e7f68 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.1.1-SNAPSHOT + 0.2.0 arrow-vector Arrow Vectors From ab15e01c70d12ea163dd9b0109fa9332884e3e7c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Wed, 15 Feb 2017 15:59:46 +0100 Subject: [PATCH 0335/1644] [maven-release-plugin] prepare for next development iteration Change-Id: I1a9e3a6d0dc29c1a7933d373a7224a7bbd60e7e9 --- java/format/pom.xml | 2 +- java/memory/pom.xml | 2 +- java/pom.xml | 4 ++-- java/tools/pom.xml | 2 +- java/vector/pom.xml | 2 +- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index 055df5b2b0622..c65a7bc3de197 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -15,7 +15,7 @@ arrow-java-root org.apache.arrow - 0.2.0 + 0.2.1-SNAPSHOT arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index a3085aa506f65..f20228b1bee62 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.2.0 + 0.2.1-SNAPSHOT arrow-memory Arrow Memory diff --git a/java/pom.xml b/java/pom.xml index ea0d0297ac3b4..fa03783396ffb 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -20,7 +20,7 @@ org.apache.arrow arrow-java-root - 0.2.0 + 0.2.1-SNAPSHOT pom Apache Arrow Java Root POM @@ -41,7 +41,7 @@ scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git scm:git:https://git-wip-us.apache.org/repos/asf/arrow.git https://github.com/apache/arrow - apache-arrow-0.2.0 + HEAD diff --git a/java/tools/pom.xml b/java/tools/pom.xml index 7271778aea9ad..35e5599b3b64c 100644 --- a/java/tools/pom.xml +++ b/java/tools/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.2.0 + 0.2.1-SNAPSHOT arrow-tools Arrow Tools diff --git a/java/vector/pom.xml b/java/vector/pom.xml index 8ac42531e7f68..fc3ce66ac1f80 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.2.0 + 0.2.1-SNAPSHOT arrow-vector Arrow Vectors From ef6b4655798e9c31631377bd6412c36405001f7f Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 20 Feb 2017 17:23:40 -0500 Subject: [PATCH 0336/1644] ARROW-563: Support non-standard gcc version strings Author: Uwe L. Korn Closes #343 from xhochy/ARROW-563 and squashes the following commits: 64d1c93 [Uwe L. Korn] ARROW-563: Support non-standard gcc version strings --- cpp/cmake_modules/CompilerInfo.cmake | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index fe200be65d502..079d9d1f3270d 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -21,6 +21,7 @@ execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v ERROR_VARIABLE COMPILER_VERSION_FULL) message(INFO " ${COMPILER_VERSION_FULL}") message(INFO " ${CMAKE_CXX_COMPILER_ID}") +string(TOLOWER "${COMPILER_VERSION_FULL}" COMPILER_VERSION_FULL_LOWER) if(MSVC) set(COMPILER_FAMILY "msvc") @@ -53,10 +54,10 @@ elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang-8") set(COMPILER_VERSION "3.8.0svn") # gcc -elseif("${COMPILER_VERSION_FULL}" MATCHES ".*gcc version.*") +elseif("${COMPILER_VERSION_FULL_LOWER}" MATCHES ".*gcc[ -]version.*") set(COMPILER_FAMILY "gcc") - string(REGEX REPLACE ".*gcc version ([0-9\\.]+).*" "\\1" - COMPILER_VERSION "${COMPILER_VERSION_FULL}") + string(REGEX REPLACE ".*gcc[ -]version ([0-9\\.]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL_LOWER}") else() message(FATAL_ERROR "Unknown compiler. Version info:\n${COMPILER_VERSION_FULL}") endif() From 4598c1a36c20de1f4d12dee62c79a67197e8a603 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 21 Feb 2017 14:41:54 +0100 Subject: [PATCH 0337/1644] ARROW-570: Determine Java tools JAR location from project metadata Author: Uwe L. Korn Closes #346 from xhochy/ARROW-570 and squashes the following commits: 32ece28 [Uwe L. Korn] Add missing ) f1071db [Uwe L. Korn] ARROW-570: Determine Java tools JAR location from project metadata --- ci/travis_script_integration.sh | 3 --- integration/integration_test.py | 11 +++++++++-- 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index 7bb1dc0a6015c..8ddd89b1639b0 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -26,9 +26,6 @@ popd pushd $TRAVIS_BUILD_DIR/integration -VERSION=0.1.1-SNAPSHOT -export ARROW_JAVA_INTEGRATION_JAR=$JAVA_DIR/tools/target/arrow-tools-$VERSION-jar-with-dependencies.jar - export ARROW_CPP_EXE_PATH=$CPP_BUILD_DIR/debug source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh diff --git a/integration/integration_test.py b/integration/integration_test.py index d5a066be5f246..049436a751f38 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -34,6 +34,12 @@ # Control for flakiness np.random.seed(12345) +def load_version_from_pom(): + import xml.etree.ElementTree as ET + tree = ET.parse(os.path.join(ARROW_HOME, 'java', 'pom.xml')) + version_tag = list(tree.getroot().findall('{http://maven.apache.org/POM/4.0.0}version'))[0] + return version_tag.text + def guid(): return uuid.uuid4().hex @@ -638,11 +644,12 @@ def validate(self, json_path, arrow_path): class JavaTester(Tester): + _arrow_version = load_version_from_pom() ARROW_TOOLS_JAR = os.environ.get( 'ARROW_JAVA_INTEGRATION_JAR', os.path.join(ARROW_HOME, - 'java/tools/target/arrow-tools-0.1.1-' - 'SNAPSHOT-jar-with-dependencies.jar')) + 'java/tools/target/arrow-tools-{}-' + 'jar-with-dependencies.jar'.format(_arrow_version))) name = 'Java' From 5e279f0a73842518caf34c2cda7c941548d55dbf Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 21 Feb 2017 14:44:01 +0100 Subject: [PATCH 0338/1644] ARROW-569: [C++] Set version for *.pc *.pc.in such as cpp/build/arrow.pc.in refers ARROW_VERSION but it isn't defined. Author: Kouhei Sutou Closes #344 from kou/arrow-569-c++-set-version-for-pc and squashes the following commits: 48b366b [Kouhei Sutou] ARROW-569: [C++] Set version for *.pc --- cpp/CMakeLists.txt | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 035cd8f9b90c7..0888a8b97faa1 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -18,6 +18,12 @@ cmake_minimum_required(VERSION 2.7) project(arrow) +file(READ "${CMAKE_CURRENT_SOURCE_DIR}/../java/pom.xml" POM_XML) +string(REGEX MATCHALL + "\n [^<]+" ARROW_VERSION_TAG "${POM_XML}") +string(REGEX REPLACE + "(\n |)" "" ARROW_VERSION "${ARROW_VERSION_TAG}") + set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") include(CMakeParseArguments) From d28f1c1e0f21bc578b84ab4bed4cf259c333fbc9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Feb 2017 09:16:32 -0500 Subject: [PATCH 0339/1644] ARROW-459: [C++] Dictionary IPC support in file and stream formats Also fixes ARROW-565 Author: Wes McKinney Closes #347 from wesm/ARROW-459 and squashes the following commits: 6a987b7 [Wes McKinney] Fix clang warning with forward declaration 8e0e6fb [Wes McKinney] Fix bug causing valgrind failure dee044e [Wes McKinney] Review comments 7ac756e [Wes McKinney] Fix Python build e5cec27 [Wes McKinney] Add some less trivial dictionary-encoded arrays to test case acfa994 [Wes McKinney] cpplint ef9dea8 [Wes McKinney] More dictionary support in FileReader. Simple test passes cb04a41 [Wes McKinney] Refactoring. Remove FileFooter class in favor of private impl in FileReader 1cee0ff [Wes McKinney] More progress toward file/stream roundtrips with dictionaries ae389fa [Wes McKinney] WIP progress toward stream/file dictionary roundtrip 6858e12 [Wes McKinney] Add union and dictionary to file/stream tests d537004 [Wes McKinney] Add support for deconstructing and reconstructing DictionaryArray with known schema --- cpp/CMakeLists.txt | 4 +- cpp/src/arrow/array.h | 2 + cpp/src/arrow/io/CMakeLists.txt | 9 +- cpp/src/arrow/ipc/CMakeLists.txt | 15 +- cpp/src/arrow/ipc/adapter.cc | 189 +++++++++----- cpp/src/arrow/ipc/adapter.h | 30 +-- cpp/src/arrow/ipc/file.cc | 306 ++++++++++++----------- cpp/src/arrow/ipc/file.h | 51 +--- cpp/src/arrow/ipc/ipc-adapter-test.cc | 46 ++-- cpp/src/arrow/ipc/ipc-file-test.cc | 78 ++---- cpp/src/arrow/ipc/ipc-metadata-test.cc | 17 +- cpp/src/arrow/ipc/metadata-internal.cc | 239 ++++++++++++++---- cpp/src/arrow/ipc/metadata-internal.h | 45 ++-- cpp/src/arrow/ipc/metadata.cc | 185 ++++++++++++-- cpp/src/arrow/ipc/metadata.h | 97 +++++-- cpp/src/arrow/ipc/stream.cc | 207 ++++++++++----- cpp/src/arrow/ipc/stream.h | 34 ++- cpp/src/arrow/ipc/test-common.h | 80 ++++++ cpp/src/arrow/type.cc | 6 +- cpp/src/arrow/type.h | 14 +- python/pyarrow/includes/libarrow_ipc.pxd | 1 - python/pyarrow/io.pyx | 5 - 22 files changed, 1093 insertions(+), 567 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 0888a8b97faa1..b77f8c79fa024 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -102,7 +102,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") ON) endif() -if(NOT ARROW_BUILD_TESTS) +if(ARROW_BUILD_TESTS) + set(ARROW_BUILD_STATIC ON) +else() set(NO_TESTS 1) endif() diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 32d156b8cd0f6..9bb06afc9bf6c 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -40,6 +40,8 @@ class Status; class ArrayVisitor { public: + virtual ~ArrayVisitor() = default; + virtual Status Visit(const NullArray& array) = 0; virtual Status Visit(const BooleanArray& array) = 0; virtual Status Visit(const Int8Array& array) = 0; diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index b8882e46b4893..ceb7b7379322a 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -70,13 +70,8 @@ set(ARROW_IO_STATIC_PRIVATE_LINK_LIBS boost_system_static boost_filesystem_static) -if (ARROW_BUILD_STATIC) - set(ARROW_IO_TEST_LINK_LIBS - arrow_io_static) -else() - set(ARROW_IO_TEST_LINK_LIBS - arrow_io_shared) -endif() +set(ARROW_IO_TEST_LINK_LIBS + arrow_io_static) set(ARROW_IO_SRCS file.cc diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index c047f53d6bf06..e7a3fdb1dd862 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -24,20 +24,9 @@ set(ARROW_IPC_SHARED_LINK_LIBS arrow_shared ) -set(ARROW_IPC_STATIC_LINK_LIBS - arrow_static +set(ARROW_IPC_TEST_LINK_LIBS arrow_io_static -) - -if (ARROW_BUILD_STATIC) - set(ARROW_IPC_TEST_LINK_LIBS - arrow_io_static - arrow_ipc_static) -else() - set(ARROW_IPC_TEST_LINK_LIBS - arrow_io_shared - arrow_ipc_shared) -endif() + arrow_ipc_static) set(ARROW_IPC_SRCS adapter.cc diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index a24c007a4056e..08ac9832982c1 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -51,12 +51,15 @@ namespace ipc { class RecordBatchWriter : public ArrayVisitor { public: - RecordBatchWriter(MemoryPool* pool, const RecordBatch& batch, - int64_t buffer_start_offset, int max_recursion_depth) + RecordBatchWriter( + MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth) : pool_(pool), - batch_(batch), max_recursion_depth_(max_recursion_depth), - buffer_start_offset_(buffer_start_offset) {} + buffer_start_offset_(buffer_start_offset) { + DCHECK_GT(max_recursion_depth, 0); + } + + virtual ~RecordBatchWriter() = default; Status VisitArray(const Array& arr) { if (max_recursion_depth_ <= 0) { @@ -81,7 +84,7 @@ class RecordBatchWriter : public ArrayVisitor { return arr.Accept(this); } - Status Assemble(int64_t* body_length) { + Status Assemble(const RecordBatch& batch, int64_t* body_length) { if (field_nodes_.size() > 0) { field_nodes_.clear(); buffer_meta_.clear(); @@ -89,8 +92,8 @@ class RecordBatchWriter : public ArrayVisitor { } // Perform depth-first traversal of the row-batch - for (int i = 0; i < batch_.num_columns(); ++i) { - RETURN_NOT_OK(VisitArray(*batch_.column(i))); + for (int i = 0; i < batch.num_columns(); ++i) { + RETURN_NOT_OK(VisitArray(*batch.column(i))); } // The position for the start of a buffer relative to the passed frame of @@ -127,16 +130,22 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status WriteMetadata( - int64_t body_length, io::OutputStream* dst, int32_t* metadata_length) { + // Override this for writing dictionary metadata + virtual Status WriteMetadataMessage( + int32_t num_rows, int64_t body_length, std::shared_ptr* out) { + return WriteRecordBatchMessage( + num_rows, body_length, field_nodes_, buffer_meta_, out); + } + + Status WriteMetadata(int32_t num_rows, int64_t body_length, io::OutputStream* dst, + int32_t* metadata_length) { // Now that we have computed the locations of all of the buffers in shared // memory, the data header can be converted to a flatbuffer and written out // // Note: The memory written here is prefixed by the size of the flatbuffer // itself as an int32_t. std::shared_ptr metadata_fb; - RETURN_NOT_OK(WriteRecordBatchMetadata( - batch_.num_rows(), body_length, field_nodes_, buffer_meta_, &metadata_fb)); + RETURN_NOT_OK(WriteMetadataMessage(num_rows, body_length, &metadata_fb)); // Need to write 4 bytes (metadata size), the metadata, plus padding to // end on an 8-byte offset @@ -166,15 +175,16 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status Write(io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { - RETURN_NOT_OK(Assemble(body_length)); + Status Write(const RecordBatch& batch, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length) { + RETURN_NOT_OK(Assemble(batch, body_length)); #ifndef NDEBUG int64_t start_position, current_position; RETURN_NOT_OK(dst->Tell(&start_position)); #endif - RETURN_NOT_OK(WriteMetadata(*body_length, dst, metadata_length)); + RETURN_NOT_OK(WriteMetadata(batch.num_rows(), *body_length, dst, metadata_length)); #ifndef NDEBUG RETURN_NOT_OK(dst->Tell(¤t_position)); @@ -206,17 +216,17 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status GetTotalSize(int64_t* size) { + Status GetTotalSize(const RecordBatch& batch, int64_t* size) { // emulates the behavior of Write without actually writing int32_t metadata_length = 0; int64_t body_length = 0; MockOutputStream dst; - RETURN_NOT_OK(Write(&dst, &metadata_length, &body_length)); + RETURN_NOT_OK(Write(batch, &dst, &metadata_length, &body_length)); *size = dst.GetExtentBytesWritten(); return Status::OK(); } - private: + protected: Status Visit(const NullArray& array) override { return Status::NotImplemented("null"); } template @@ -468,15 +478,12 @@ class RecordBatchWriter : public ArrayVisitor { } Status Visit(const DictionaryArray& array) override { - // Dictionary written out separately - const auto& indices = static_cast(*array.indices().get()); - buffers_.push_back(indices.data()); - return Status::OK(); + // Dictionary written out separately. Slice offset contained in the indices + return array.indices()->Accept(this); } // In some cases, intermediate buffers may need to be allocated (with sliced arrays) MemoryPool* pool_; - const RecordBatch& batch_; std::vector field_nodes_; std::vector buffer_meta_; @@ -486,17 +493,51 @@ class RecordBatchWriter : public ArrayVisitor { int64_t buffer_start_offset_; }; +class DictionaryWriter : public RecordBatchWriter { + public: + using RecordBatchWriter::RecordBatchWriter; + + Status WriteMetadataMessage( + int32_t num_rows, int64_t body_length, std::shared_ptr* out) override { + return WriteDictionaryMessage( + dictionary_id_, num_rows, body_length, field_nodes_, buffer_meta_, out); + } + + Status Write(int64_t dictionary_id, const std::shared_ptr& dictionary, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + dictionary_id_ = dictionary_id; + + // Make a dummy record batch. A bit tedious as we have to make a schema + std::vector> fields = { + arrow::field("dictionary", dictionary->type())}; + auto schema = std::make_shared(fields); + RecordBatch batch(schema, dictionary->length(), {dictionary}); + + return RecordBatchWriter::Write(batch, dst, metadata_length, body_length); + } + + private: + // TODO(wesm): Setting this in Write is a bit unclean, but it works + int64_t dictionary_id_; +}; + Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool, int max_recursion_depth) { - DCHECK_GT(max_recursion_depth, 0); - RecordBatchWriter serializer(pool, batch, buffer_start_offset, max_recursion_depth); - return serializer.Write(dst, metadata_length, body_length); + RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); + return writer.Write(batch, dst, metadata_length, body_length); +} + +Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool) { + DictionaryWriter writer(pool, buffer_start_offset, kMaxIpcRecursionDepth); + return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); } Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter serializer(default_memory_pool(), batch, 0, kMaxIpcRecursionDepth); - RETURN_NOT_OK(serializer.GetTotalSize(size)); + RecordBatchWriter writer(default_memory_pool(), 0, kMaxIpcRecursionDepth); + RETURN_NOT_OK(writer.GetTotalSize(batch, size)); return Status::OK(); } @@ -580,10 +621,9 @@ class ArrayLoader : public TypeVisitor { Status LoadPrimitive(const DataType& type) { FieldMetadata field_meta; - std::shared_ptr null_bitmap; - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + std::shared_ptr null_bitmap, data; - std::shared_ptr data; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); } else { @@ -597,11 +637,9 @@ class ArrayLoader : public TypeVisitor { template Status LoadBinary() { FieldMetadata field_meta; - std::shared_ptr null_bitmap; - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + std::shared_ptr null_bitmap, offsets, values; - std::shared_ptr offsets; - std::shared_ptr values; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); @@ -661,11 +699,9 @@ class ArrayLoader : public TypeVisitor { Status Visit(const ListType& type) override { FieldMetadata field_meta; - std::shared_ptr null_bitmap; + std::shared_ptr null_bitmap, offsets; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - - std::shared_ptr offsets; if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(context_->buffer_index, &offsets)); } else { @@ -715,12 +751,9 @@ class ArrayLoader : public TypeVisitor { Status Visit(const UnionType& type) override { FieldMetadata field_meta; - std::shared_ptr null_bitmap; - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - - std::shared_ptr type_ids = nullptr; - std::shared_ptr offsets = nullptr; + std::shared_ptr null_bitmap, type_ids, offsets; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(context_->buffer_index, &type_ids)); if (type.mode == UnionMode::DENSE) { @@ -738,13 +771,23 @@ class ArrayLoader : public TypeVisitor { } Status Visit(const DictionaryType& type) override { - return Status::NotImplemented("dictionary"); + FieldMetadata field_meta; + std::shared_ptr null_bitmap, indices_data; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &indices_data)); + + std::shared_ptr indices; + RETURN_NOT_OK(MakePrimitiveArray(type.index_type(), field_meta.length, indices_data, + null_bitmap, field_meta.null_count, 0, &indices)); + + result_ = std::make_shared(field_.type, indices); + return Status::OK(); }; }; class RecordBatchReader { public: - RecordBatchReader(const std::shared_ptr& metadata, + RecordBatchReader(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, io::ReadableFileInterface* file) : metadata_(metadata), @@ -758,7 +801,7 @@ class RecordBatchReader { // The field_index and buffer_index are incremented in the ArrayLoader // based on how much of the batch is "consumed" (through nested data // reconstruction, for example) - context_.metadata = metadata_.get(); + context_.metadata = &metadata_; context_.field_index = 0; context_.buffer_index = 0; context_.max_recursion_depth = max_recursion_depth_; @@ -768,50 +811,58 @@ class RecordBatchReader { RETURN_NOT_OK(loader.Load(&arrays[i])); } - *out = std::make_shared(schema_, metadata_->length(), arrays); + *out = std::make_shared(schema_, metadata_.length(), arrays); return Status::OK(); } private: RecordBatchContext context_; - std::shared_ptr metadata_; + const RecordBatchMetadata& metadata_; std::shared_ptr schema_; int max_recursion_depth_; io::ReadableFileInterface* file_; }; -Status ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, - io::ReadableFileInterface* file, std::shared_ptr* metadata) { - std::shared_ptr buffer; - RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); - - int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); - - if (flatbuffer_size + static_cast(sizeof(int32_t)) > metadata_length) { - std::stringstream ss; - ss << "flatbuffer size " << metadata_length << " invalid. File offset: " << offset - << ", metadata length: " << metadata_length; - return Status::Invalid(ss.str()); - } - - std::shared_ptr message; - RETURN_NOT_OK(Message::Open(buffer, 4, &message)); - *metadata = std::make_shared(message); - return Status::OK(); -} - -Status ReadRecordBatch(const std::shared_ptr& metadata, +Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, io::ReadableFileInterface* file, std::shared_ptr* out) { return ReadRecordBatch(metadata, schema, kMaxIpcRecursionDepth, file, out); } -Status ReadRecordBatch(const std::shared_ptr& metadata, +Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, io::ReadableFileInterface* file, std::shared_ptr* out) { RecordBatchReader reader(metadata, schema, max_recursion_depth, file); return reader.Read(out); } +Status ReadDictionary(const DictionaryBatchMetadata& metadata, + const DictionaryTypeMap& dictionary_types, io::ReadableFileInterface* file, + std::shared_ptr* out) { + int64_t id = metadata.id(); + auto it = dictionary_types.find(id); + if (it == dictionary_types.end()) { + std::stringstream ss; + ss << "Do not have type metadata for dictionary with id: " << id; + return Status::KeyError(ss.str()); + } + + std::vector> fields = {it->second}; + + // We need a schema for the record batch + auto dummy_schema = std::make_shared(fields); + + // The dictionary is embedded in a record batch with a single column + std::shared_ptr batch; + RETURN_NOT_OK(ReadRecordBatch(metadata.record_batch(), dummy_schema, file, &batch)); + + if (batch->num_columns() != 1) { + return Status::Invalid("Dictionary record batch must only contain one field"); + } + + *out = batch->column(0); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 83542d0b066d4..b7d8fa93d3651 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -25,6 +25,7 @@ #include #include +#include "arrow/ipc/metadata.h" #include "arrow/util/visibility.h" namespace arrow { @@ -44,8 +45,6 @@ class OutputStream; namespace ipc { -class RecordBatchMetadata; - // ---------------------------------------------------------------------- // Write path // We have trouble decoding flatbuffers if the size i > 70, so 64 is a nice round number @@ -72,34 +71,35 @@ constexpr int kMaxIpcRecursionDepth = 64; // // @param(out) body_length: the size of the contiguous buffer block plus // padding bytes -Status ARROW_EXPORT WriteRecordBatch(const RecordBatch& batch, +Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, int max_recursion_depth = kMaxIpcRecursionDepth); + +// Write Array as a DictionaryBatch message +Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool, - int max_recursion_depth = kMaxIpcRecursionDepth); + int64_t* body_length, MemoryPool* pool); // Compute the precise number of bytes needed in a contiguous memory segment to // write the record batch. This involves generating the complete serialized // Flatbuffers metadata. -Status ARROW_EXPORT GetRecordBatchSize(const RecordBatch& batch, int64_t* size); +Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); // ---------------------------------------------------------------------- // "Read" path; does not copy data if the input supports zero copy reads -// Read the record batch flatbuffer metadata starting at the indicated file offset -// -// The flatbuffer is expected to be length-prefixed, so the metadata_length -// includes at least the length prefix and the flatbuffer -Status ARROW_EXPORT ReadRecordBatchMetadata(int64_t offset, int32_t metadata_length, - io::ReadableFileInterface* file, std::shared_ptr* metadata); - -Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& metadata, +Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, io::ReadableFileInterface* file, std::shared_ptr* out); -Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& metadata, +Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, io::ReadableFileInterface* file, std::shared_ptr* out); +Status ReadDictionary(const DictionaryBatchMetadata& metadata, + const DictionaryTypeMap& dictionary_types, io::ReadableFileInterface* file, + std::shared_ptr* out); + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/file.cc index 3b1832611024f..c1d483f1fbba6 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/file.cc @@ -36,8 +36,6 @@ namespace arrow { namespace ipc { static constexpr const char* kArrowMagicBytes = "ARROW1"; -// ---------------------------------------------------------------------- -// File footer static flatbuffers::Offset> FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { @@ -51,11 +49,12 @@ FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { } Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, io::OutputStream* out) { + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out) { FBB fbb; flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, &fb_schema)); + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); @@ -74,87 +73,6 @@ static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); } -class FileFooter::FileFooterImpl { - public: - FileFooterImpl(const std::shared_ptr& buffer, const flatbuf::Footer* footer) - : buffer_(buffer), footer_(footer) {} - - int num_dictionaries() const { return footer_->dictionaries()->size(); } - - int num_record_batches() const { return footer_->recordBatches()->size(); } - - MetadataVersion::type version() const { - switch (footer_->version()) { - case flatbuf::MetadataVersion_V1: - return MetadataVersion::V1; - case flatbuf::MetadataVersion_V2: - return MetadataVersion::V2; - // Add cases as other versions become available - default: - return MetadataVersion::V2; - } - } - - FileBlock record_batch(int i) const { - return FileBlockFromFlatbuffer(footer_->recordBatches()->Get(i)); - } - - FileBlock dictionary(int i) const { - return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); - } - - Status GetSchema(std::shared_ptr* out) const { - auto schema_msg = std::make_shared(nullptr, footer_->schema()); - return schema_msg->GetSchema(out); - } - - private: - // Retain reference to memory - std::shared_ptr buffer_; - - const flatbuf::Footer* footer_; -}; - -FileFooter::FileFooter() {} - -FileFooter::~FileFooter() {} - -Status FileFooter::Open( - const std::shared_ptr& buffer, std::unique_ptr* out) { - const flatbuf::Footer* footer = flatbuf::GetFooter(buffer->data()); - - *out = std::unique_ptr(new FileFooter()); - - // TODO(wesm): Verify the footer - (*out)->impl_.reset(new FileFooterImpl(buffer, footer)); - - return Status::OK(); -} - -int FileFooter::num_dictionaries() const { - return impl_->num_dictionaries(); -} - -int FileFooter::num_record_batches() const { - return impl_->num_record_batches(); -} - -MetadataVersion::type FileFooter::version() const { - return impl_->version(); -} - -FileBlock FileFooter::record_batch(int i) const { - return impl_->record_batch(i); -} - -FileBlock FileFooter::dictionary(int i) const { - return impl_->dictionary(i); -} - -Status FileFooter::GetSchema(std::shared_ptr* out) const { - return impl_->GetSchema(out); -} - // ---------------------------------------------------------------------- // File writer implementation @@ -171,22 +89,17 @@ Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& s Status FileWriter::Start() { RETURN_NOT_OK(WriteAligned( reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); - started_ = true; - return Status::OK(); -} -Status FileWriter::WriteRecordBatch(const RecordBatch& batch) { - // Push an empty FileBlock - // Append metadata, to be written in the footer later - record_batches_.emplace_back(0, 0, 0); - return StreamWriter::WriteRecordBatch( - batch, &record_batches_[record_batches_.size() - 1]); + // We write the schema at the start of the file (and the end). This also + // writes all the dictionaries at the beginning of the file + return StreamWriter::Start(); } Status FileWriter::Close() { // Write metadata int64_t initial_position = position_; - RETURN_NOT_OK(WriteFileFooter(*schema_, dictionaries_, record_batches_, sink_)); + RETURN_NOT_OK(WriteFileFooter( + *schema_, dictionaries_, record_batches_, dictionary_memo_.get(), sink_)); RETURN_NOT_OK(UpdatePosition()); // Write footer length @@ -204,89 +117,180 @@ Status FileWriter::Close() { // ---------------------------------------------------------------------- // Reader implementation -FileReader::FileReader( - const std::shared_ptr& file, int64_t footer_offset) - : file_(file), footer_offset_(footer_offset) {} +class FileReader::FileReaderImpl { + public: + FileReaderImpl() { dictionary_memo_ = std::make_shared(); } -FileReader::~FileReader() {} + Status ReadFooter() { + int magic_size = static_cast(strlen(kArrowMagicBytes)); -Status FileReader::Open(const std::shared_ptr& file, - std::shared_ptr* reader) { - int64_t footer_offset; - RETURN_NOT_OK(file->GetSize(&footer_offset)); - return Open(file, footer_offset, reader); -} + if (footer_offset_ <= magic_size * 2 + 4) { + std::stringstream ss; + ss << "File is too small: " << footer_offset_; + return Status::Invalid(ss.str()); + } -Status FileReader::Open(const std::shared_ptr& file, - int64_t footer_offset, std::shared_ptr* reader) { - *reader = std::shared_ptr(new FileReader(file, footer_offset)); - return (*reader)->ReadFooter(); -} + std::shared_ptr buffer; + int file_end_size = magic_size + sizeof(int32_t); + RETURN_NOT_OK(file_->ReadAt(footer_offset_ - file_end_size, file_end_size, &buffer)); + + if (memcmp(buffer->data() + sizeof(int32_t), kArrowMagicBytes, magic_size)) { + return Status::Invalid("Not an Arrow file"); + } + + int32_t footer_length = *reinterpret_cast(buffer->data()); + + if (footer_length <= 0 || footer_length + magic_size * 2 + 4 > footer_offset_) { + return Status::Invalid("File is smaller than indicated metadata size"); + } -Status FileReader::ReadFooter() { - int magic_size = static_cast(strlen(kArrowMagicBytes)); + // Now read the footer + RETURN_NOT_OK(file_->ReadAt( + footer_offset_ - footer_length - file_end_size, footer_length, &footer_buffer_)); - if (footer_offset_ <= magic_size * 2 + 4) { - std::stringstream ss; - ss << "File is too small: " << footer_offset_; - return Status::Invalid(ss.str()); + // TODO(wesm): Verify the footer + footer_ = flatbuf::GetFooter(footer_buffer_->data()); + schema_metadata_.reset(new SchemaMetadata(nullptr, footer_->schema())); + + return Status::OK(); + } + + int num_dictionaries() const { return footer_->dictionaries()->size(); } + + int num_record_batches() const { return footer_->recordBatches()->size(); } + + MetadataVersion::type version() const { + switch (footer_->version()) { + case flatbuf::MetadataVersion_V1: + return MetadataVersion::V1; + case flatbuf::MetadataVersion_V2: + return MetadataVersion::V2; + // Add cases as other versions become available + default: + return MetadataVersion::V2; + } + } + + FileBlock record_batch(int i) const { + return FileBlockFromFlatbuffer(footer_->recordBatches()->Get(i)); + } + + FileBlock dictionary(int i) const { + return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); } - std::shared_ptr buffer; - int file_end_size = magic_size + sizeof(int32_t); - RETURN_NOT_OK(file_->ReadAt(footer_offset_ - file_end_size, file_end_size, &buffer)); + const SchemaMetadata& schema_metadata() const { return *schema_metadata_; } + + Status GetRecordBatch(int i, std::shared_ptr* batch) { + DCHECK_GE(i, 0); + DCHECK_LT(i, num_record_batches()); + FileBlock block = record_batch(i); + + std::shared_ptr message; + RETURN_NOT_OK( + ReadMessage(block.offset, block.metadata_length, file_.get(), &message)); + auto metadata = std::make_shared(message); - if (memcmp(buffer->data() + sizeof(int32_t), kArrowMagicBytes, magic_size)) { - return Status::Invalid("Not an Arrow file"); + // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see + // ARROW-384). + std::shared_ptr buffer_block; + RETURN_NOT_OK(file_->Read(block.body_length, &buffer_block)); + io::BufferReader reader(buffer_block); + + return ReadRecordBatch(*metadata, schema_, &reader, batch); } - int32_t footer_length = *reinterpret_cast(buffer->data()); + Status ReadSchema() { + RETURN_NOT_OK(schema_metadata_->GetDictionaryTypes(&dictionary_fields_)); + + // Read all the dictionaries + for (int i = 0; i < num_dictionaries(); ++i) { + FileBlock block = dictionary(i); + std::shared_ptr message; + RETURN_NOT_OK( + ReadMessage(block.offset, block.metadata_length, file_.get(), &message)); + + // TODO(wesm): ARROW-577: This code is duplicated, can be fixed with a more + // invasive refactor + DictionaryBatchMetadata metadata(message); + + // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see + // ARROW-384). + std::shared_ptr buffer_block; + RETURN_NOT_OK(file_->Read(block.body_length, &buffer_block)); + io::BufferReader reader(buffer_block); + + std::shared_ptr dictionary; + RETURN_NOT_OK(ReadDictionary(metadata, dictionary_fields_, &reader, &dictionary)); + RETURN_NOT_OK(dictionary_memo_->AddDictionary(metadata.id(), dictionary)); + } - if (footer_length <= 0 || footer_length + magic_size * 2 + 4 > footer_offset_) { - return Status::Invalid("File is smaller than indicated metadata size"); + // Get the schema + return schema_metadata_->GetSchema(*dictionary_memo_, &schema_); } - // Now read the footer - RETURN_NOT_OK(file_->ReadAt( - footer_offset_ - footer_length - file_end_size, footer_length, &buffer)); - RETURN_NOT_OK(FileFooter::Open(buffer, &footer_)); + Status Open( + const std::shared_ptr& file, int64_t footer_offset) { + file_ = file; + footer_offset_ = footer_offset; + RETURN_NOT_OK(ReadFooter()); + return ReadSchema(); + } + + std::shared_ptr schema() const { return schema_; } + + private: + std::shared_ptr file_; - // Get the schema - return footer_->GetSchema(&schema_); + // The location where the Arrow file layout ends. May be the end of the file + // or some other location if embedded in a larger file. + int64_t footer_offset_; + + // Footer metadata + std::shared_ptr footer_buffer_; + const flatbuf::Footer* footer_; + std::unique_ptr schema_metadata_; + + DictionaryTypeMap dictionary_fields_; + std::shared_ptr dictionary_memo_; + + // Reconstructed schema, including any read dictionaries + std::shared_ptr schema_; +}; + +FileReader::FileReader() { + impl_.reset(new FileReaderImpl()); } -std::shared_ptr FileReader::schema() const { - return schema_; +FileReader::~FileReader() {} + +Status FileReader::Open(const std::shared_ptr& file, + std::shared_ptr* reader) { + int64_t footer_offset; + RETURN_NOT_OK(file->GetSize(&footer_offset)); + return Open(file, footer_offset, reader); +} + +Status FileReader::Open(const std::shared_ptr& file, + int64_t footer_offset, std::shared_ptr* reader) { + *reader = std::shared_ptr(new FileReader()); + return (*reader)->impl_->Open(file, footer_offset); } -int FileReader::num_dictionaries() const { - return footer_->num_dictionaries(); +std::shared_ptr FileReader::schema() const { + return impl_->schema(); } int FileReader::num_record_batches() const { - return footer_->num_record_batches(); + return impl_->num_record_batches(); } MetadataVersion::type FileReader::version() const { - return footer_->version(); + return impl_->version(); } Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { - DCHECK_GE(i, 0); - DCHECK_LT(i, num_record_batches()); - FileBlock block = footer_->record_batch(i); - - std::shared_ptr metadata; - RETURN_NOT_OK(ReadRecordBatchMetadata( - block.offset, block.metadata_length, file_.get(), &metadata)); - - // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see - // ARROW-384). - std::shared_ptr buffer_block; - RETURN_NOT_OK(file_->Read(block.body_length, &buffer_block)); - io::BufferReader reader(buffer_block); - - return ReadRecordBatch(metadata, schema_, &reader, batch); + return impl_->GetRecordBatch(i, batch); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/file.h index cf0baab820eef..524766ccb3336 100644 --- a/cpp/src/arrow/ipc/file.h +++ b/cpp/src/arrow/ipc/file.h @@ -45,45 +45,21 @@ class ReadableFileInterface; namespace ipc { Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, io::OutputStream* out); - -class ARROW_EXPORT FileFooter { - public: - ~FileFooter(); - - static Status Open( - const std::shared_ptr& buffer, std::unique_ptr* out); - - int num_dictionaries() const; - int num_record_batches() const; - MetadataVersion::type version() const; - - FileBlock record_batch(int i) const; - FileBlock dictionary(int i) const; - - Status GetSchema(std::shared_ptr* out) const; - - private: - FileFooter(); - class FileFooterImpl; - std::unique_ptr impl_; -}; + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out); class ARROW_EXPORT FileWriter : public StreamWriter { public: static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); - Status WriteRecordBatch(const RecordBatch& batch) override; + using StreamWriter::WriteRecordBatch; Status Close() override; private: FileWriter(io::OutputStream* sink, const std::shared_ptr& schema); Status Start() override; - - std::vector dictionaries_; - std::vector record_batches_; }; class ARROW_EXPORT FileReader { @@ -108,13 +84,9 @@ class ARROW_EXPORT FileReader { static Status Open(const std::shared_ptr& file, int64_t footer_offset, std::shared_ptr* reader); + /// The schema includes any dictionaries std::shared_ptr schema() const; - // Shared dictionaries for dictionary-encoding cross record batches - // TODO(wesm): Implement dictionary reading when we also have dictionary - // encoding - int num_dictionaries() const; - int num_record_batches() const; MetadataVersion::type version() const; @@ -127,19 +99,10 @@ class ARROW_EXPORT FileReader { Status GetRecordBatch(int i, std::shared_ptr* batch); private: - FileReader( - const std::shared_ptr& file, int64_t footer_offset); - - Status ReadFooter(); - - std::shared_ptr file_; - - // The location where the Arrow file layout ends. May be the end of the file - // or some other location if embedded in a larger file. - int64_t footer_offset_; + FileReader(); - std::unique_ptr footer_; - std::shared_ptr schema_; + class ARROW_NO_EXPORT FileReaderImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index d11b95b167d21..8999363893289 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -27,6 +27,7 @@ #include "arrow/io/memory.h" #include "arrow/io/test-common.h" #include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata.h" #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" @@ -40,12 +41,8 @@ namespace arrow { namespace ipc { -class TestWriteRecordBatch : public ::testing::TestWithParam, - public io::MemoryMapFixture { +class IpcTestFixture : public io::MemoryMapFixture { public: - void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { io::MemoryMapFixture::TearDown(); } - Status RoundTripHelper(const RecordBatch& batch, int memory_map_size, std::shared_ptr* batch_result) { std::string path = "test-write-row-batch"; @@ -59,8 +56,9 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, RETURN_NOT_OK(WriteRecordBatch( batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); - std::shared_ptr metadata; - RETURN_NOT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); + std::shared_ptr message; + RETURN_NOT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); // The buffer offsets start at 0, so we must construct a // ReadableFileInterface according to that frame of reference @@ -68,7 +66,7 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); io::BufferReader buffer_reader(buffer_payload); - return ReadRecordBatch(metadata, batch.schema(), &buffer_reader, batch_result); + return ReadRecordBatch(*metadata, batch.schema(), &buffer_reader, batch_result); } void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { @@ -112,14 +110,29 @@ class TestWriteRecordBatch : public ::testing::TestWithParam, MemoryPool* pool_; }; -TEST_P(TestWriteRecordBatch, RoundTrip) { +class TestWriteRecordBatch : public ::testing::Test, public IpcTestFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } +}; + +class TestRecordBatchParam : public ::testing::TestWithParam, + public IpcTestFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } + using IpcTestFixture::RoundTripHelper; + using IpcTestFixture::CheckRoundtrip; +}; + +TEST_P(TestRecordBatchParam, RoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue CheckRoundtrip(*batch, 1 << 20); } -TEST_P(TestWriteRecordBatch, SliceRoundTrip) { +TEST_P(TestRecordBatchParam, SliceRoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue @@ -130,7 +143,7 @@ TEST_P(TestWriteRecordBatch, SliceRoundTrip) { CheckRoundtrip(*sliced_batch, 1 << 20); } -TEST_P(TestWriteRecordBatch, ZeroLengthArrays) { +TEST_P(TestRecordBatchParam, ZeroLengthArrays) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue @@ -159,10 +172,10 @@ TEST_P(TestWriteRecordBatch, ZeroLengthArrays) { } INSTANTIATE_TEST_CASE_P( - RoundTripTests, TestWriteRecordBatch, + RoundTripTests, TestRecordBatchParam, ::testing::Values(&MakeIntRecordBatch, &MakeStringTypesRecordBatch, &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeListRecordBatch, - &MakeDeeplyNestedList, &MakeStruct, &MakeUnion)); + &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary)); void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; @@ -251,8 +264,9 @@ TEST_F(RecursionLimits, ReadLimit) { std::shared_ptr schema; ASSERT_OK(WriteToMmap(64, true, &metadata_length, &body_length, &schema)); - std::shared_ptr metadata; - ASSERT_OK(ReadRecordBatchMetadata(0, metadata_length, mmap_.get(), &metadata)); + std::shared_ptr message; + ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); std::shared_ptr payload; ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); @@ -260,7 +274,7 @@ TEST_F(RecursionLimits, ReadLimit) { io::BufferReader reader(payload); std::shared_ptr batch; - ASSERT_RAISES(Invalid, ReadRecordBatch(metadata, schema, &reader, &batch)); + ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &batch)); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index 7cd8054679e44..4b82aab0e3978 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -180,72 +180,44 @@ TEST_P(TestStreamFormat, RoundTrip) { #define BATCH_CASES() \ ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct); + &MakeStruct, &MakeDictionary); INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); -class TestFileFooter : public ::testing::Test { - public: - void SetUp() {} - - void CheckRoundtrip(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches) { - auto buffer = std::make_shared(); - io::BufferOutputStream stream(buffer); - - ASSERT_OK(WriteFileFooter(schema, dictionaries, record_batches, &stream)); - - std::unique_ptr footer; - ASSERT_OK(FileFooter::Open(buffer, &footer)); - - ASSERT_EQ(MetadataVersion::V2, footer->version()); +void CheckBatchDictionaries(const RecordBatch& batch) { + // Check that dictionaries that should be the same are the same + auto schema = batch.schema(); - // Check schema - std::shared_ptr schema2; - ASSERT_OK(footer->GetSchema(&schema2)); - AssertSchemaEqual(schema, *schema2); + const auto& t0 = static_cast(*schema->field(0)->type); + const auto& t1 = static_cast(*schema->field(1)->type); - // Check blocks - ASSERT_EQ(dictionaries.size(), footer->num_dictionaries()); - ASSERT_EQ(record_batches.size(), footer->num_record_batches()); + ASSERT_EQ(t0.dictionary().get(), t1.dictionary().get()); - for (int i = 0; i < footer->num_dictionaries(); ++i) { - CheckBlocks(dictionaries[i], footer->dictionary(i)); - } - - for (int i = 0; i < footer->num_record_batches(); ++i) { - CheckBlocks(record_batches[i], footer->record_batch(i)); - } - } + // Same dictionary used for list values + const auto& t3 = static_cast(*schema->field(3)->type); + const auto& t3_value = static_cast(*t3.value_type()); + ASSERT_EQ(t0.dictionary().get(), t3_value.dictionary().get()); +} - void CheckBlocks(const FileBlock& left, const FileBlock& right) { - ASSERT_EQ(left.offset, right.offset); - ASSERT_EQ(left.metadata_length, right.metadata_length); - ASSERT_EQ(left.body_length, right.body_length); - } +TEST_F(TestStreamFormat, DictionaryRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeDictionary(&batch)); - private: - std::shared_ptr example_schema_; -}; + std::vector> out_batches; + ASSERT_OK(RoundTripHelper(*batch, &out_batches)); -TEST_F(TestFileFooter, Basics) { - auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared()); - Schema schema({f0, f1}); + CheckBatchDictionaries(*out_batches[0]); +} - std::vector dictionaries; - dictionaries.emplace_back(8, 92, 900); - dictionaries.emplace_back(1000, 100, 1900); - dictionaries.emplace_back(3000, 100, 2900); +TEST_F(TestFileFormat, DictionaryRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeDictionary(&batch)); - std::vector record_batches; - record_batches.emplace_back(6000, 100, 900); - record_batches.emplace_back(7000, 100, 1900); - record_batches.emplace_back(9000, 100, 2900); - record_batches.emplace_back(12000, 100, 3900); + std::vector> out_batches; + ASSERT_OK(RoundTripHelper({batch}, &out_batches)); - CheckRoundtrip(schema, dictionaries, record_batches); + CheckBatchDictionaries(*out_batches[0]); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc index 098f996d292a2..4fb3204a5b6d2 100644 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ b/cpp/src/arrow/ipc/ipc-metadata-test.cc @@ -22,6 +22,7 @@ #include "gtest/gtest.h" #include "arrow/io/memory.h" +#include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/test-common.h" #include "arrow/schema.h" @@ -39,9 +40,9 @@ class TestSchemaMetadata : public ::testing::Test { public: void SetUp() {} - void CheckRoundtrip(const Schema& schema) { + void CheckRoundtrip(const Schema& schema, DictionaryMemo* memo) { std::shared_ptr buffer; - ASSERT_OK(WriteSchema(schema, &buffer)); + ASSERT_OK(WriteSchemaMessage(schema, memo, &buffer)); std::shared_ptr message; ASSERT_OK(Message::Open(buffer, 0, &message)); @@ -51,8 +52,10 @@ class TestSchemaMetadata : public ::testing::Test { auto schema_msg = std::make_shared(message); ASSERT_EQ(schema.num_fields(), schema_msg->num_fields()); + DictionaryMemo empty_memo; + std::shared_ptr schema2; - ASSERT_OK(schema_msg->GetSchema(&schema2)); + ASSERT_OK(schema_msg->GetSchema(empty_memo, &schema2)); AssertSchemaEqual(schema, *schema2); } @@ -74,7 +77,9 @@ TEST_F(TestSchemaMetadata, PrimitiveFields) { auto f10 = std::make_shared("f10", std::make_shared()); Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); - CheckRoundtrip(schema); + DictionaryMemo memo; + + CheckRoundtrip(schema, &memo); } TEST_F(TestSchemaMetadata, NestedFields) { @@ -86,7 +91,9 @@ TEST_F(TestSchemaMetadata, NestedFields) { auto f1 = std::make_shared("f1", type2); Schema schema({f0, f1}); - CheckRoundtrip(schema); + DictionaryMemo memo; + + CheckRoundtrip(schema, &memo); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index cd7722056a3c7..7c8ddb93c09d1 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -25,6 +25,7 @@ #include "flatbuffers/flatbuffers.h" +#include "arrow/array.h" #include "arrow/buffer.h" #include "arrow/ipc/Message_generated.h" #include "arrow/schema.h" @@ -115,8 +116,8 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } // Forward declaration -static Status FieldToFlatbuffer( - FBB& fbb, const std::shared_ptr& field, FieldOffset* offset); +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + DictionaryMemo* dictionary_memo, FieldOffset* offset); static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, bool is_signed) { return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); @@ -126,34 +127,73 @@ static Offset FloatToFlatbuffer(FBB& fbb, flatbuf::Precision precision) { return flatbuf::CreateFloatingPoint(fbb, precision).Union(); } -static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, Offset* offset) { +static Status AppendChildFields(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo) { FieldOffset field; - RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(0), &field)); - out_children->push_back(field); + for (int i = 0; i < type->num_children(); ++i) { + RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), dictionary_memo, &field)); + out_children->push_back(field); + } + return Status::OK(); +} + +static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); *offset = flatbuf::CreateList(fbb).Union(); return Status::OK(); } static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, Offset* offset) { - FieldOffset field; - for (int i = 0; i < type->num_children(); ++i) { - RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), &field)); - out_children->push_back(field); - } + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); *offset = flatbuf::CreateStruct_(fbb).Union(); return Status::OK(); } +static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); + + const auto& union_type = static_cast(*type); + + flatbuf::UnionMode mode = union_type.mode == UnionMode::SPARSE + ? flatbuf::UnionMode_Sparse + : flatbuf::UnionMode_Dense; + + std::vector type_ids; + type_ids.reserve(union_type.type_codes.size()); + for (uint8_t code : union_type.type_codes) { + type_ids.push_back(code); + } + + auto fb_type_ids = fbb.CreateVector(type_ids); + + *offset = flatbuf::CreateUnion(fbb, mode, fb_type_ids).Union(); + return Status::OK(); +} + #define INT_TO_FB_CASE(BIT_WIDTH, IS_SIGNED) \ *out_type = flatbuf::Type_Int; \ *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ break; +// TODO(wesm): Convert this to visitor pattern static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, std::vector* children, std::vector* layout, - flatbuf::Type* out_type, Offset* offset) { + flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, Offset* offset) { + if (type->type == Type::DICTIONARY) { + // In this library, the dictionary "type" is a logical construct. Here we + // pass through to the value type, as we've already captured the index + // type in the DictionaryEncoding metadata in the parent field + const auto& dict_type = static_cast(*type); + return TypeToFlatbuffer(fbb, dict_type.dictionary()->type(), children, layout, + out_type, dictionary_memo, offset); + } + std::vector buffer_layout = type->GetBufferLayout(); for (const BufferDescr& descr : buffer_layout) { flatbuf::VectorType vector_type; @@ -217,10 +257,13 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, break; case Type::LIST: *out_type = flatbuf::Type_List; - return ListToFlatbuffer(fbb, type, children, offset); + return ListToFlatbuffer(fbb, type, children, dictionary_memo, offset); case Type::STRUCT: *out_type = flatbuf::Type_Struct_; - return StructToFlatbuffer(fbb, type, children, offset); + return StructToFlatbuffer(fbb, type, children, dictionary_memo, offset); + case Type::UNION: + *out_type = flatbuf::Type_Union; + return UnionToFlatBuffer(fbb, type, children, dictionary_memo, offset); default: *out_type = flatbuf::Type_NONE; // Make clang-tidy happy std::stringstream ss; @@ -230,35 +273,63 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, return Status::OK(); } -static Status FieldToFlatbuffer( - FBB& fbb, const std::shared_ptr& field, FieldOffset* offset) { +using DictionaryOffset = flatbuffers::Offset; + +static DictionaryOffset GetDictionaryEncoding( + FBB& fbb, const DictionaryType& type, DictionaryMemo* memo) { + int64_t dictionary_id = memo->GetId(type.dictionary()); + + // We assume that the dictionary index type (as an integer) has already been + // validated elsewhere, and can safely assume we are dealing with signed + // integers + const auto& fw_index_type = static_cast(*type.index_type()); + + auto index_type_offset = flatbuf::CreateInt(fbb, fw_index_type.bit_width(), true); + + // TODO(wesm): ordered dictionaries + return flatbuf::CreateDictionaryEncoding(fbb, dictionary_id, index_type_offset); +} + +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + DictionaryMemo* dictionary_memo, FieldOffset* offset) { auto fb_name = fbb.CreateString(field->name); flatbuf::Type type_enum; - Offset type_data; + Offset type_offset; Offset type_layout; std::vector children; std::vector layout; - RETURN_NOT_OK( - TypeToFlatbuffer(fbb, field->type, &children, &layout, &type_enum, &type_data)); + RETURN_NOT_OK(TypeToFlatbuffer( + fbb, field->type, &children, &layout, &type_enum, dictionary_memo, &type_offset)); auto fb_children = fbb.CreateVector(children); auto fb_layout = fbb.CreateVector(layout); + DictionaryOffset dictionary = 0; + if (field->type->type == Type::DICTIONARY) { + dictionary = GetDictionaryEncoding( + fbb, static_cast(*field->type), dictionary_memo); + } + // TODO: produce the list of VectorTypes - *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_data, - field->dictionary, fb_children, fb_layout); + *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_offset, + dictionary, fb_children, fb_layout); return Status::OK(); } -Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out) { - std::shared_ptr type; +Status FieldFromFlatbufferDictionary( + const flatbuf::Field* field, std::shared_ptr* out) { + // Need an empty memo to pass down for constructing children + DictionaryMemo dummy_memo; + + // Any DictionaryEncoding set is ignored here + std::shared_ptr type; auto children = field->children(); std::vector> child_fields(children->size()); for (size_t i = 0; i < children->size(); ++i) { - RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), &child_fields[i])); + RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), dummy_memo, &child_fields[i])); } RETURN_NOT_OK( @@ -268,6 +339,39 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* return Status::OK(); } +Status FieldFromFlatbuffer(const flatbuf::Field* field, + const DictionaryMemo& dictionary_memo, std::shared_ptr* out) { + std::shared_ptr type; + + const flatbuf::DictionaryEncoding* encoding = field->dictionary(); + + if (encoding == nullptr) { + // The field is not dictionary encoded. We must potentially visit its + // children to fully reconstruct the data type + auto children = field->children(); + std::vector> child_fields(children->size()); + for (size_t i = 0; i < children->size(); ++i) { + RETURN_NOT_OK( + FieldFromFlatbuffer(children->Get(i), dictionary_memo, &child_fields[i])); + } + RETURN_NOT_OK( + TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); + } else { + // The field is dictionary encoded. The type of the dictionary values has + // been determined elsewhere, and is stored in the DictionaryMemo. Here we + // construct the logical DictionaryType object + + std::shared_ptr dictionary; + RETURN_NOT_OK(dictionary_memo.GetDictionary(encoding->id(), &dictionary)); + + std::shared_ptr index_type; + RETURN_NOT_OK(IntFromFlatbuffer(encoding->indexType(), &index_type)); + type = std::make_shared(index_type, dictionary); + } + *out = std::make_shared(field->name()->str(), type, field->nullable()); + return Status::OK(); +} + // Implement MessageBuilder // will return the endianness of the system we are running on @@ -281,13 +385,13 @@ flatbuf::Endianness endianness() { return bint.c[0] == 1 ? flatbuf::Endianness_Big : flatbuf::Endianness_Little; } -Status SchemaToFlatbuffer( - FBB& fbb, const Schema& schema, flatbuffers::Offset* out) { +Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, DictionaryMemo* dictionary_memo, + flatbuffers::Offset* out) { std::vector field_offsets; for (int i = 0; i < schema.num_fields(); ++i) { std::shared_ptr field = schema.field(i); FieldOffset offset; - RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, &offset)); + RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, dictionary_memo, &offset)); field_offsets.push_back(offset); } @@ -295,29 +399,63 @@ Status SchemaToFlatbuffer( return Status::OK(); } -Status MessageBuilder::SetSchema(const Schema& schema) { - flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb_, schema, &fb_schema)); +class MessageBuilder { + public: + Status SetSchema(const Schema& schema, DictionaryMemo* dictionary_memo) { + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb_, schema, dictionary_memo, &fb_schema)); - header_type_ = flatbuf::MessageHeader_Schema; - header_ = fb_schema.Union(); - body_length_ = 0; - return Status::OK(); -} + header_type_ = flatbuf::MessageHeader_Schema; + header_ = fb_schema.Union(); + body_length_ = 0; + return Status::OK(); + } -Status MessageBuilder::SetRecordBatch(int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers) { - header_type_ = flatbuf::MessageHeader_RecordBatch; - header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), - fbb_.CreateVectorOfStructs(buffers)) - .Union(); - body_length_ = body_length; + Status SetRecordBatch(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers) { + header_type_ = flatbuf::MessageHeader_RecordBatch; + header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), + fbb_.CreateVectorOfStructs(buffers)) + .Union(); + body_length_ = body_length; - return Status::OK(); + return Status::OK(); + } + + Status SetDictionary(int64_t id, int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers) { + header_type_ = flatbuf::MessageHeader_DictionaryBatch; + + auto record_batch = flatbuf::CreateRecordBatch(fbb_, length, + fbb_.CreateVectorOfStructs(nodes), fbb_.CreateVectorOfStructs(buffers)); + + header_ = flatbuf::CreateDictionaryBatch(fbb_, id, record_batch).Union(); + body_length_ = body_length; + return Status::OK(); + } + + Status Finish(); + + Status GetBuffer(std::shared_ptr* out); + + private: + flatbuf::MessageHeader header_type_; + flatbuffers::Offset header_; + int64_t body_length_; + flatbuffers::FlatBufferBuilder fbb_; +}; + +Status WriteSchemaMessage( + const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out) { + MessageBuilder message; + RETURN_NOT_OK(message.SetSchema(schema, dictionary_memo)); + RETURN_NOT_OK(message.Finish()); + return message.GetBuffer(out); } -Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, +Status WriteRecordBatchMessage(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { MessageBuilder builder; @@ -326,6 +464,15 @@ Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, return builder.GetBuffer(out); } +Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers, std::shared_ptr* out) { + MessageBuilder builder; + RETURN_NOT_OK(builder.SetDictionary(id, length, body_length, nodes, buffers)); + RETURN_NOT_OK(builder.Finish()); + return builder.GetBuffer(out); +} + Status MessageBuilder::Finish() { auto message = flatbuf::CreateMessage(fbb_, kMetadataVersion, header_type_, header_, body_length_); diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h index d94a8abc99ab0..59afecbcbd27e 100644 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ b/cpp/src/arrow/ipc/metadata-internal.h @@ -46,31 +46,34 @@ using Offset = flatbuffers::Offset; static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; -Status FieldFromFlatbuffer(const flatbuf::Field* field, std::shared_ptr* out); +// Construct a field with type for a dictionary-encoded field. None of its +// children or children's descendents can be dictionary encoded +Status FieldFromFlatbufferDictionary( + const flatbuf::Field* field, std::shared_ptr* out); -Status SchemaToFlatbuffer( - FBB& fbb, const Schema& schema, flatbuffers::Offset* out); +// Construct a field for a non-dictionary-encoded field. Its children may be +// dictionary encoded +Status FieldFromFlatbuffer(const flatbuf::Field* field, + const DictionaryMemo& dictionary_memo, std::shared_ptr* out); -class MessageBuilder { - public: - Status SetSchema(const Schema& schema); +Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, DictionaryMemo* dictionary_memo, + flatbuffers::Offset* out); - Status SetRecordBatch(int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers); - - Status Finish(); - - Status GetBuffer(std::shared_ptr* out); - - private: - flatbuf::MessageHeader header_type_; - flatbuffers::Offset header_; - int64_t body_length_; - flatbuffers::FlatBufferBuilder fbb_; -}; +// Serialize arrow::Schema as a Flatbuffer +// +// \param[in] schema a Schema instance +// \param[inout] dictionary_memo class for tracking dictionaries and assigning +// dictionary ids +// \param[out] out the serialized arrow::Buffer +// \return Status outcome +Status WriteSchemaMessage( + const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); + +Status WriteRecordBatchMessage(int32_t length, int64_t body_length, + const std::vector& nodes, + const std::vector& buffers, std::shared_ptr* out); -Status WriteRecordBatchMetadata(int32_t length, int64_t body_length, +Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index a97965c40d608..2ba44ac618ce3 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -19,6 +19,7 @@ #include #include +#include #include #include "flatbuffers/flatbuffers.h" @@ -38,11 +39,60 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { -Status WriteSchema(const Schema& schema, std::shared_ptr* out) { - MessageBuilder message; - RETURN_NOT_OK(message.SetSchema(schema)); - RETURN_NOT_OK(message.Finish()); - return message.GetBuffer(out); +// ---------------------------------------------------------------------- +// Memoization data structure for handling shared dictionaries + +DictionaryMemo::DictionaryMemo() {} + +// Returns KeyError if dictionary not found +Status DictionaryMemo::GetDictionary( + int64_t id, std::shared_ptr* dictionary) const { + auto it = id_to_dictionary_.find(id); + if (it == id_to_dictionary_.end()) { + std::stringstream ss; + ss << "Dictionary with id " << id << " not found"; + return Status::KeyError(ss.str()); + } + *dictionary = it->second; + return Status::OK(); +} + +int64_t DictionaryMemo::GetId(const std::shared_ptr& dictionary) { + intptr_t address = reinterpret_cast(dictionary.get()); + auto it = dictionary_to_id_.find(address); + if (it != dictionary_to_id_.end()) { + // Dictionary already observed, return the id + return it->second; + } else { + int64_t new_id = static_cast(dictionary_to_id_.size()) + 1; + dictionary_to_id_[address] = new_id; + id_to_dictionary_[new_id] = dictionary; + return new_id; + } +} + +bool DictionaryMemo::HasDictionary(const std::shared_ptr& dictionary) const { + intptr_t address = reinterpret_cast(dictionary.get()); + auto it = dictionary_to_id_.find(address); + return it != dictionary_to_id_.end(); +} + +bool DictionaryMemo::HasDictionaryId(int64_t id) const { + auto it = id_to_dictionary_.find(id); + return it != id_to_dictionary_.end(); +} + +Status DictionaryMemo::AddDictionary( + int64_t id, const std::shared_ptr& dictionary) { + if (HasDictionaryId(id)) { + std::stringstream ss; + ss << "Dictionary with id " << id << " already exists"; + return Status::KeyError(ss.str()); + } + intptr_t address = reinterpret_cast(dictionary.get()); + id_to_dictionary_[id] = dictionary; + dictionary_to_id_[address] = id; + return Status::OK(); } //---------------------------------------------------------------------- @@ -113,10 +163,35 @@ class SchemaMetadata::SchemaMetadataImpl { explicit SchemaMetadataImpl(const void* schema) : schema_(static_cast(schema)) {} - const flatbuf::Field* field(int i) const { return schema_->fields()->Get(i); } + const flatbuf::Field* get_field(int i) const { return schema_->fields()->Get(i); } int num_fields() const { return schema_->fields()->size(); } + Status VisitField(const flatbuf::Field* field, DictionaryTypeMap* id_to_field) const { + const flatbuf::DictionaryEncoding* dict_metadata = field->dictionary(); + if (dict_metadata == nullptr) { + // Field is not dictionary encoded. Visit children + auto children = field->children(); + for (flatbuffers::uoffset_t i = 0; i < children->size(); ++i) { + RETURN_NOT_OK(VisitField(children->Get(i), id_to_field)); + } + } else { + // Field is dictionary encoded. Construct the data type for the + // dictionary (no descendents can be dictionary encoded) + std::shared_ptr dictionary_field; + RETURN_NOT_OK(FieldFromFlatbufferDictionary(field, &dictionary_field)); + (*id_to_field)[dict_metadata->id()] = dictionary_field; + } + return Status::OK(); + } + + Status GetDictionaryTypes(DictionaryTypeMap* id_to_field) const { + for (int i = 0; i < num_fields(); ++i) { + RETURN_NOT_OK(VisitField(get_field(i), id_to_field)); + } + return Status::OK(); + } + private: const flatbuf::Schema* schema_; }; @@ -138,15 +213,16 @@ int SchemaMetadata::num_fields() const { return impl_->num_fields(); } -Status SchemaMetadata::GetField(int i, std::shared_ptr* out) const { - const flatbuf::Field* field = impl_->field(i); - return FieldFromFlatbuffer(field, out); +Status SchemaMetadata::GetDictionaryTypes(DictionaryTypeMap* id_to_field) const { + return impl_->GetDictionaryTypes(id_to_field); } -Status SchemaMetadata::GetSchema(std::shared_ptr* out) const { +Status SchemaMetadata::GetSchema( + const DictionaryMemo& dictionary_memo, std::shared_ptr* out) const { std::vector> fields(num_fields()); for (int i = 0; i < this->num_fields(); ++i) { - RETURN_NOT_OK(GetField(i, &fields[i])); + const flatbuf::Field* field = impl_->get_field(i); + RETURN_NOT_OK(FieldFromFlatbuffer(field, dictionary_memo, &fields[i])); } *out = std::make_shared(fields); return Status::OK(); @@ -173,28 +249,34 @@ class RecordBatchMetadata::RecordBatchMetadataImpl { int num_fields() const { return batch_->nodes()->size(); } + void set_message(const std::shared_ptr& message) { message_ = message; } + + void set_buffer(const std::shared_ptr& buffer) { buffer_ = buffer; } + private: const flatbuf::RecordBatch* batch_; const flatbuffers::Vector* nodes_; const flatbuffers::Vector* buffers_; + + // Possible parents, owns the flatbuffer data + std::shared_ptr message_; + std::shared_ptr buffer_; }; RecordBatchMetadata::RecordBatchMetadata(const std::shared_ptr& message) { - message_ = message; impl_.reset(new RecordBatchMetadataImpl(message->impl_->header())); + impl_->set_message(message); } -RecordBatchMetadata::RecordBatchMetadata( - const std::shared_ptr& buffer, int64_t offset) { - message_ = nullptr; - buffer_ = buffer; - - const flatbuf::RecordBatch* metadata = - flatbuffers::GetRoot(buffer->data() + offset); - - // TODO(wesm): validate table +RecordBatchMetadata::RecordBatchMetadata(const void* header) { + impl_.reset(new RecordBatchMetadataImpl(header)); +} - impl_.reset(new RecordBatchMetadataImpl(metadata)); +RecordBatchMetadata::RecordBatchMetadata( + const std::shared_ptr& buffer, int64_t offset) + : RecordBatchMetadata(buffer->data() + offset) { + // Preserve ownership + impl_->set_buffer(buffer); } RecordBatchMetadata::~RecordBatchMetadata() {} @@ -232,5 +314,64 @@ int RecordBatchMetadata::num_fields() const { return impl_->num_fields(); } +// ---------------------------------------------------------------------- +// DictionaryBatchMetadata + +class DictionaryBatchMetadata::DictionaryBatchMetadataImpl { + public: + explicit DictionaryBatchMetadataImpl(const void* dictionary) + : metadata_(static_cast(dictionary)) { + record_batch_.reset(new RecordBatchMetadata(metadata_->data())); + } + + int64_t id() const { return metadata_->id(); } + const RecordBatchMetadata& record_batch() const { return *record_batch_; } + + void set_message(const std::shared_ptr& message) { message_ = message; } + + private: + const flatbuf::DictionaryBatch* metadata_; + + std::unique_ptr record_batch_; + + // Parent, owns the flatbuffer data + std::shared_ptr message_; +}; + +DictionaryBatchMetadata::DictionaryBatchMetadata( + const std::shared_ptr& message) { + impl_.reset(new DictionaryBatchMetadataImpl(message->impl_->header())); + impl_->set_message(message); +} + +DictionaryBatchMetadata::~DictionaryBatchMetadata() {} + +int64_t DictionaryBatchMetadata::id() const { + return impl_->id(); +} + +const RecordBatchMetadata& DictionaryBatchMetadata::record_batch() const { + return impl_->record_batch(); +} + +// ---------------------------------------------------------------------- +// Conveniences + +Status ReadMessage(int64_t offset, int32_t metadata_length, + io::ReadableFileInterface* file, std::shared_ptr* message) { + std::shared_ptr buffer; + RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); + + int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); + + if (flatbuffer_size + static_cast(sizeof(int32_t)) > metadata_length) { + std::stringstream ss; + ss << "flatbuffer size " << metadata_length << " invalid. File offset: " << offset + << ", metadata length: " << metadata_length; + return Status::Invalid(ss.str()); + } + return Message::Open(buffer, 4, message); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 81e3dbdf6c4c0..0091067c3225a 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -22,13 +22,17 @@ #include #include +#include #include +#include "arrow/util/macros.h" #include "arrow/util/visibility.h" namespace arrow { +class Array; class Buffer; +struct DataType; struct Field; class Schema; class Status; @@ -36,6 +40,7 @@ class Status; namespace io { class OutputStream; +class ReadableFileInterface; } // namespace io @@ -47,9 +52,38 @@ struct MetadataVersion { //---------------------------------------------------------------------- -// Serialize arrow::Schema as a Flatbuffer -ARROW_EXPORT -Status WriteSchema(const Schema& schema, std::shared_ptr* out); +using DictionaryMap = std::unordered_map>; +using DictionaryTypeMap = std::unordered_map>; + +// Memoization data structure for handling shared dictionaries +class DictionaryMemo { + public: + DictionaryMemo(); + + // Returns KeyError if dictionary not found + Status GetDictionary(int64_t id, std::shared_ptr* dictionary) const; + + int64_t GetId(const std::shared_ptr& dictionary); + + bool HasDictionary(const std::shared_ptr& dictionary) const; + bool HasDictionaryId(int64_t id) const; + + // Add a dictionary to the memo with a particular id. Returns KeyError if + // that dictionary already exists + Status AddDictionary(int64_t id, const std::shared_ptr& dictionary); + + const DictionaryMap& id_to_dictionary() const { return id_to_dictionary_; } + + private: + // Dictionary memory addresses, to track whether a dictionary has been seen + // before + std::unordered_map dictionary_to_id_; + + // Map of dictionary id to dictionary array + DictionaryMap id_to_dictionary_; + + DISALLOW_COPY_AND_ASSIGN(DictionaryMemo); +}; // Read interface classes. We do not fully deserialize the flatbuffers so that // individual fields metadata can be retrieved from very large schema without @@ -69,12 +103,15 @@ class ARROW_EXPORT SchemaMetadata { int num_fields() const; - // Construct an arrow::Field for the i-th value in the metadata - Status GetField(int i, std::shared_ptr* out) const; + // Retrieve a list of all the dictionary ids and types required by the schema for + // reconstruction. The presumption is that these will be loaded either from + // the stream or file (or they may already be somewhere else in memory) + Status GetDictionaryTypes(DictionaryTypeMap* id_to_field) const; // Construct a complete Schema from the message. May be expensive for very // large schemas if you are only interested in a few fields - Status GetSchema(std::shared_ptr* out) const; + Status GetSchema( + const DictionaryMemo& dictionary_memo, std::shared_ptr* out) const; private: // Parent, owns the flatbuffer data @@ -82,6 +119,8 @@ class ARROW_EXPORT SchemaMetadata { class SchemaMetadataImpl; std::unique_ptr impl_; + + DISALLOW_COPY_AND_ASSIGN(SchemaMetadata); }; // Field metadata @@ -99,8 +138,10 @@ struct ARROW_EXPORT BufferMetadata { // Container for serialized record batch metadata contained in an IPC message class ARROW_EXPORT RecordBatchMetadata { public: + // Instantiate from opaque pointer. Memory ownership must be preserved + // elsewhere (e.g. in a dictionary batch) + explicit RecordBatchMetadata(const void* header); explicit RecordBatchMetadata(const std::shared_ptr& message); - RecordBatchMetadata(const std::shared_ptr& message, int64_t offset); ~RecordBatchMetadata(); @@ -113,18 +154,25 @@ class ARROW_EXPORT RecordBatchMetadata { int num_fields() const; private: - // Parent, owns the flatbuffer data - std::shared_ptr message_; - std::shared_ptr buffer_; - class RecordBatchMetadataImpl; std::unique_ptr impl_; + + DISALLOW_COPY_AND_ASSIGN(RecordBatchMetadata); }; class ARROW_EXPORT DictionaryBatchMetadata { public: + explicit DictionaryBatchMetadata(const std::shared_ptr& message); + ~DictionaryBatchMetadata(); + int64_t id() const; - std::unique_ptr data() const; + const RecordBatchMetadata& record_batch() const; + + private: + class DictionaryBatchMetadataImpl; + std::unique_ptr impl_; + + DISALLOW_COPY_AND_ASSIGN(DictionaryBatchMetadata); }; class ARROW_EXPORT Message { @@ -141,24 +189,31 @@ class ARROW_EXPORT Message { private: Message(const std::shared_ptr& buffer, int64_t offset); + friend class DictionaryBatchMetadata; friend class RecordBatchMetadata; friend class SchemaMetadata; // Hide serialization details from user API class MessageImpl; std::unique_ptr impl_; -}; -struct ARROW_EXPORT FileBlock { - FileBlock() {} - FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) - : offset(offset), metadata_length(metadata_length), body_length(body_length) {} - - int64_t offset; - int32_t metadata_length; - int64_t body_length; + DISALLOW_COPY_AND_ASSIGN(Message); }; +/// Read a length-prefixed message flatbuffer starting at the indicated file +/// offset +/// +/// The metadata_length includes at least the length prefix and the flatbuffer +/// +/// \param[in] offset the position in the file where the message starts. The +/// first 4 bytes after the offset are the message length +/// \param[in] metadata_length the total number of bytes to read from file +/// \param[in] file the seekable file interface to read from +/// \param[out] message the message read +/// \return Status success or failure +Status ReadMessage(int64_t offset, int32_t metadata_length, + io::ReadableFileInterface* file, std::shared_ptr* message); + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/stream.cc b/cpp/src/arrow/ipc/stream.cc index 72eb13465afcc..7f5c9932330be 100644 --- a/cpp/src/arrow/ipc/stream.cc +++ b/cpp/src/arrow/ipc/stream.cc @@ -20,17 +20,20 @@ #include #include #include +#include #include #include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" #include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" #include "arrow/memory_pool.h" #include "arrow/schema.h" #include "arrow/status.h" +#include "arrow/table.h" #include "arrow/util/logging.h" namespace arrow { @@ -39,11 +42,10 @@ namespace ipc { // ---------------------------------------------------------------------- // Stream writer implementation -StreamWriter::~StreamWriter() {} - StreamWriter::StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema) : sink_(sink), schema_(schema), + dictionary_memo_(std::make_shared()), pool_(default_memory_pool()), position_(-1), started_(false) {} @@ -107,7 +109,7 @@ Status StreamWriter::Open(io::OutputStream* sink, const std::shared_ptr& Status StreamWriter::Start() { std::shared_ptr schema_fb; - RETURN_NOT_OK(WriteSchema(*schema_, &schema_fb)); + RETURN_NOT_OK(WriteSchemaMessage(*schema_, dictionary_memo_.get(), &schema_fb)); int32_t flatbuffer_size = schema_fb->size(); RETURN_NOT_OK( @@ -115,14 +117,41 @@ Status StreamWriter::Start() { // Write the flatbuffer RETURN_NOT_OK(Write(schema_fb->data(), flatbuffer_size)); + + // If there are any dictionaries, write them as the next messages + RETURN_NOT_OK(WriteDictionaries()); + started_ = true; return Status::OK(); } Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { - // Pass FileBlock, but results not used - FileBlock dummy_block; - return WriteRecordBatch(batch, &dummy_block); + // Push an empty FileBlock. Can be written in the footer later + record_batches_.emplace_back(0, 0, 0); + return WriteRecordBatch(batch, &record_batches_[record_batches_.size() - 1]); +} + +Status StreamWriter::WriteDictionaries() { + const DictionaryMap& id_to_dictionary = dictionary_memo_->id_to_dictionary(); + + dictionaries_.resize(id_to_dictionary.size()); + + // TODO(wesm): does sorting by id yield any benefit? + int dict_index = 0; + for (const auto& entry : id_to_dictionary) { + FileBlock* block = &dictionaries_[dict_index++]; + + block->offset = position_; + + // Frame of reference in file format is 0, see ARROW-384 + const int64_t buffer_start_offset = 0; + RETURN_NOT_OK(WriteDictionary(entry.first, entry.second, buffer_start_offset, sink_, + &block->metadata_length, &block->body_length, pool_)); + RETURN_NOT_OK(UpdatePosition()); + DCHECK(position_ % 8 == 0) << "WriteDictionary did not perform aligned writes"; + } + + return Status::OK(); } Status StreamWriter::Close() { @@ -134,81 +163,147 @@ Status StreamWriter::Close() { // ---------------------------------------------------------------------- // StreamReader implementation -StreamReader::StreamReader(const std::shared_ptr& stream) - : stream_(stream), schema_(nullptr) {} - -StreamReader::~StreamReader() {} - -Status StreamReader::Open(const std::shared_ptr& stream, - std::shared_ptr* reader) { - // Private ctor - *reader = std::shared_ptr(new StreamReader(stream)); - return (*reader)->ReadSchema(); +static inline std::string message_type_name(Message::Type type) { + switch (type) { + case Message::SCHEMA: + return "schema"; + case Message::RECORD_BATCH: + return "record batch"; + case Message::DICTIONARY_BATCH: + return "dictionary"; + default: + break; + } + return "unknown"; } -Status StreamReader::ReadSchema() { - std::shared_ptr message; - RETURN_NOT_OK(ReadNextMessage(&message)); +class StreamReader::StreamReaderImpl { + public: + StreamReaderImpl() {} + ~StreamReaderImpl() {} - if (message->type() != Message::SCHEMA) { - return Status::IOError("First message was not schema type"); + Status Open(const std::shared_ptr& stream) { + stream_ = stream; + return ReadSchema(); } - SchemaMetadata schema_meta(message); + Status ReadNextMessage(Message::Type expected_type, std::shared_ptr* message) { + std::shared_ptr buffer; + RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); - // TODO(wesm): If the schema contains dictionaries, we must read all the - // dictionaries from the stream before constructing the final Schema - return schema_meta.GetSchema(&schema_); -} + if (buffer->size() != sizeof(int32_t)) { + *message = nullptr; + return Status::OK(); + } + + int32_t message_length = *reinterpret_cast(buffer->data()); + + RETURN_NOT_OK(stream_->Read(message_length, &buffer)); + if (buffer->size() != message_length) { + return Status::IOError("Unexpected end of stream trying to read message"); + } -Status StreamReader::ReadNextMessage(std::shared_ptr* message) { - std::shared_ptr buffer; - RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); + RETURN_NOT_OK(Message::Open(buffer, 0, message)); - if (buffer->size() != sizeof(int32_t)) { - *message = nullptr; + if ((*message)->type() != expected_type) { + std::stringstream ss; + ss << "Message not expected type: " << message_type_name(expected_type) + << ", was: " << (*message)->type(); + return Status::IOError(ss.str()); + } return Status::OK(); } - int32_t message_length = *reinterpret_cast(buffer->data()); + Status ReadExact(int64_t size, std::shared_ptr* buffer) { + RETURN_NOT_OK(stream_->Read(size, buffer)); - RETURN_NOT_OK(stream_->Read(message_length, &buffer)); - if (buffer->size() != message_length) { - return Status::IOError("Unexpected end of stream trying to read message"); + if ((*buffer)->size() < size) { + return Status::IOError("Unexpected EOS when reading buffer"); + } + return Status::OK(); } - return Message::Open(buffer, 0, message); -} -std::shared_ptr StreamReader::schema() const { - return schema_; -} + Status ReadNextDictionary() { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::DICTIONARY_BATCH, &message)); -Status StreamReader::GetNextRecordBatch(std::shared_ptr* batch) { - std::shared_ptr message; - RETURN_NOT_OK(ReadNextMessage(&message)); + DictionaryBatchMetadata metadata(message); - if (message == nullptr) { - // End of stream - *batch = nullptr; - return Status::OK(); + std::shared_ptr batch_body; + RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)) + io::BufferReader reader(batch_body); + + std::shared_ptr dictionary; + RETURN_NOT_OK(ReadDictionary(metadata, dictionary_types_, &reader, &dictionary)); + return dictionary_memo_.AddDictionary(metadata.id(), dictionary); } - if (message->type() != Message::RECORD_BATCH) { - return Status::IOError("Metadata not record batch"); + Status ReadSchema() { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::SCHEMA, &message)); + + SchemaMetadata schema_meta(message); + RETURN_NOT_OK(schema_meta.GetDictionaryTypes(&dictionary_types_)); + + // TODO(wesm): In future, we may want to reconcile the ids in the stream with + // those found in the schema + int num_dictionaries = static_cast(dictionary_types_.size()); + for (int i = 0; i < num_dictionaries; ++i) { + RETURN_NOT_OK(ReadNextDictionary()); + } + + return schema_meta.GetSchema(dictionary_memo_, &schema_); } - auto batch_metadata = std::make_shared(message); + Status GetNextRecordBatch(std::shared_ptr* batch) { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::RECORD_BATCH, &message)); + + if (message == nullptr) { + // End of stream + *batch = nullptr; + return Status::OK(); + } - std::shared_ptr batch_body; - RETURN_NOT_OK(stream_->Read(message->body_length(), &batch_body)); + RecordBatchMetadata batch_metadata(message); - if (batch_body->size() < message->body_length()) { - return Status::IOError("Unexpected EOS when reading message body"); + std::shared_ptr batch_body; + RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)); + io::BufferReader reader(batch_body); + return ReadRecordBatch(batch_metadata, schema_, &reader, batch); } - io::BufferReader reader(batch_body); + std::shared_ptr schema() const { return schema_; } + + private: + // dictionary_id -> type + DictionaryTypeMap dictionary_types_; + + DictionaryMemo dictionary_memo_; + + std::shared_ptr stream_; + std::shared_ptr schema_; +}; + +StreamReader::StreamReader() { + impl_.reset(new StreamReaderImpl()); +} + +StreamReader::~StreamReader() {} + +Status StreamReader::Open(const std::shared_ptr& stream, + std::shared_ptr* reader) { + // Private ctor + *reader = std::shared_ptr(new StreamReader()); + return (*reader)->impl_->Open(stream); +} + +std::shared_ptr StreamReader::schema() const { + return impl_->schema(); +} - return ReadRecordBatch(batch_metadata, schema_, &reader, batch); +Status StreamReader::GetNextRecordBatch(std::shared_ptr* batch) { + return impl_->GetNextRecordBatch(batch); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/stream.h b/cpp/src/arrow/ipc/stream.h index 12414fa2ca0c7..1c3f65e49af32 100644 --- a/cpp/src/arrow/ipc/stream.h +++ b/cpp/src/arrow/ipc/stream.h @@ -22,7 +22,9 @@ #include #include +#include +#include "arrow/ipc/metadata.h" #include "arrow/util/visibility.h" namespace arrow { @@ -44,12 +46,19 @@ class OutputStream; namespace ipc { -struct FileBlock; -class Message; +struct ARROW_EXPORT FileBlock { + FileBlock() {} + FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) + : offset(offset), metadata_length(metadata_length), body_length(body_length) {} + + int64_t offset; + int32_t metadata_length; + int64_t body_length; +}; class ARROW_EXPORT StreamWriter { public: - virtual ~StreamWriter(); + virtual ~StreamWriter() = default; static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); @@ -72,6 +81,8 @@ class ARROW_EXPORT StreamWriter { Status CheckStarted(); Status UpdatePosition(); + Status WriteDictionaries(); + Status WriteRecordBatch(const RecordBatch& batch, FileBlock* block); // Adds padding bytes if necessary to ensure all memory blocks are written on @@ -87,10 +98,17 @@ class ARROW_EXPORT StreamWriter { io::OutputStream* sink_; std::shared_ptr schema_; + // When writing out the schema, we keep track of all the dictionaries we + // encounter, as they must be written out first in the stream + std::shared_ptr dictionary_memo_; + MemoryPool* pool_; int64_t position_; bool started_; + + std::vector dictionaries_; + std::vector record_batches_; }; class ARROW_EXPORT StreamReader { @@ -107,14 +125,10 @@ class ARROW_EXPORT StreamReader { Status GetNextRecordBatch(std::shared_ptr* batch); private: - explicit StreamReader(const std::shared_ptr& stream); - - Status ReadSchema(); + StreamReader(); - Status ReadNextMessage(std::shared_ptr* message); - - std::shared_ptr stream_; - std::shared_ptr schema_; + class ARROW_NO_EXPORT StreamReaderImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index b4930c4555d44..07f786c4d1d77 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -345,6 +345,86 @@ Status MakeUnion(std::shared_ptr* out) { return Status::OK(); } +Status MakeDictionary(std::shared_ptr* out) { + const int32_t length = 6; + + std::vector is_valid = {true, true, false, true, true, true}; + std::shared_ptr dict1, dict2; + + std::vector dict1_values = {"foo", "bar", "baz"}; + std::vector dict2_values = {"foo", "bar", "baz", "qux"}; + + ArrayFromVector(dict1_values, &dict1); + ArrayFromVector(dict2_values, &dict2); + + auto f0_type = arrow::dictionary(arrow::int32(), dict1); + auto f1_type = arrow::dictionary(arrow::int8(), dict1); + auto f2_type = arrow::dictionary(arrow::int32(), dict2); + + std::shared_ptr indices0, indices1, indices2; + std::vector indices0_values = {1, 2, -1, 0, 2, 0}; + std::vector indices1_values = {0, 0, 2, 2, 1, 1}; + std::vector indices2_values = {3, 0, 2, 1, 0, 2}; + + ArrayFromVector(is_valid, indices0_values, &indices0); + ArrayFromVector(is_valid, indices1_values, &indices1); + ArrayFromVector(is_valid, indices2_values, &indices2); + + auto a0 = std::make_shared(f0_type, indices0); + auto a1 = std::make_shared(f1_type, indices1); + auto a2 = std::make_shared(f2_type, indices2); + + // List of dictionary-encoded string + auto f3_type = list(f1_type); + + std::vector list_offsets = {0, 0, 2, 2, 5, 6, 9}; + std::shared_ptr offsets, indices3; + ArrayFromVector( + std::vector(list_offsets.size(), true), list_offsets, &offsets); + + std::vector indices3_values = {0, 1, 2, 0, 1, 2, 0, 1, 2}; + std::vector is_valid3(9, true); + ArrayFromVector(is_valid3, indices3_values, &indices3); + + std::shared_ptr null_bitmap; + RETURN_NOT_OK(test::GetBitmapFromBoolVector(is_valid, &null_bitmap)); + + std::shared_ptr a3 = std::make_shared(f3_type, length, + std::static_pointer_cast(offsets)->data(), + std::make_shared(f1_type, indices3), null_bitmap, 1); + + // Dictionary-encoded list of integer + auto f4_value_type = list(int8()); + + std::shared_ptr offsets4, values4, indices4; + + std::vector list_offsets4 = {0, 2, 2, 3}; + ArrayFromVector( + std::vector(4, true), list_offsets4, &offsets4); + + std::vector list_values4 = {0, 1, 2}; + ArrayFromVector(std::vector(3, true), list_values4, &values4); + + auto dict3 = std::make_shared(f4_value_type, 3, + std::static_pointer_cast(offsets4)->data(), values4); + + std::vector indices4_values = {0, 1, 2, 0, 1, 2}; + ArrayFromVector(is_valid, indices4_values, &indices4); + + auto f4_type = dictionary(int8(), dict3); + auto a4 = std::make_shared(f4_type, indices4); + + // construct batch + std::shared_ptr schema(new Schema({field("dict1", f0_type), + field("sparse", f1_type), field("dense", f2_type), + field("list of encoded string", f3_type), field("encoded list", f4_type)})); + + std::vector> arrays = {a0, a1, a2, a3, a4}; + + out->reset(new RecordBatch(schema, length, arrays)); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index a1c2b79950d59..b97b4657c361c 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -29,7 +29,7 @@ namespace arrow { bool Field::Equals(const Field& other) const { return (this == &other) || (this->name == other.name && this->nullable == other.nullable && - this->dictionary == dictionary && this->type->Equals(*other.type.get())); + this->type->Equals(*other.type.get())); } bool Field::Equals(const std::shared_ptr& other) const { @@ -234,8 +234,8 @@ std::shared_ptr dictionary(const std::shared_ptr& index_type } std::shared_ptr field( - const std::string& name, const TypePtr& type, bool nullable, int64_t dictionary) { - return std::make_shared(name, type, nullable, dictionary); + const std::string& name, const TypePtr& type, bool nullable) { + return std::make_shared(name, type, nullable); } static const BufferDescr kValidityBuffer(BufferType::VALIDITY, 1); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 927b8a44fe12f..b15aa277af201 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -114,6 +114,8 @@ class BufferDescr { class TypeVisitor { public: + virtual ~TypeVisitor() = default; + virtual Status Visit(const NullType& type) = 0; virtual Status Visit(const BooleanType& type) = 0; virtual Status Visit(const Int8Type& type) = 0; @@ -205,13 +207,9 @@ struct ARROW_EXPORT Field { // Fields can be nullable bool nullable; - // optional dictionary id if the field is dictionary encoded - // 0 means it's not dictionary encoded - int64_t dictionary; - Field(const std::string& name, const std::shared_ptr& type, - bool nullable = true, int64_t dictionary = 0) - : name(name), type(type), nullable(nullable), dictionary(dictionary) {} + bool nullable = true) + : name(name), type(type), nullable(nullable) {} bool operator==(const Field& other) const { return this->Equals(other); } bool operator!=(const Field& other) const { return !this->Equals(other); } @@ -556,8 +554,8 @@ std::shared_ptr ARROW_EXPORT union_( std::shared_ptr ARROW_EXPORT dictionary( const std::shared_ptr& index_type, const std::shared_ptr& values); -std::shared_ptr ARROW_EXPORT field(const std::string& name, - const std::shared_ptr& type, bool nullable = true, int64_t dictionary = 0); +std::shared_ptr ARROW_EXPORT field( + const std::string& name, const std::shared_ptr& type, bool nullable = true); // ---------------------------------------------------------------------- // diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index 5ab98152add49..afc7dbd36e5f0 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -63,7 +63,6 @@ cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: shared_ptr[CSchema] schema() - int num_dictionaries() int num_record_batches() CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 89ce6e785c02b..4acef212b4dce 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -995,11 +995,6 @@ cdef class _FileReader: else: check_status(CFileReader.Open(reader, &self.reader)) - property num_dictionaries: - - def __get__(self): - return self.reader.get().num_dictionaries() - property num_record_batches: def __get__(self): From 89dc55789b895653ba8184f462c88588928aee15 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 25 Feb 2017 15:29:30 -0500 Subject: [PATCH 0340/1644] ARROW-580: C++: Also provide jemalloc_X targets if only a static or shared version is found Author: Uwe L. Korn Closes #349 from xhochy/ARROW-580 and squashes the following commits: 6cdeef2 [Uwe L. Korn] ARROW-580: C++: Also provide jemalloc_X targets if only a static or shared version is found --- cpp/CMakeLists.txt | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index b77f8c79fa024..06a18925c0d91 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -369,11 +369,19 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) add_library(${LIB_NAME} STATIC IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + SET(AUG_LIB_NAME "${LIB_NAME}_static") + add_library(${AUG_LIB_NAME} STATIC IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") elseif(ARG_SHARED_LIB) add_library(${LIB_NAME} SHARED IMPORTED) set_target_properties(${LIB_NAME} PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + SET(AUG_LIB_NAME "${LIB_NAME}_shared") + add_library(${AUG_LIB_NAME} SHARED IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") else() message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") From 8afe92c6cb966d7a3fa5fe30a24bb10be49afc06 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 25 Feb 2017 15:51:03 -0500 Subject: [PATCH 0341/1644] ARROW-578: [C++] Add -DARROW_CXXFLAGS=... option to make CMake more consistent I've had issues on CMake 2.8.x with `-DCMAKE_CXX_FLAGS=$MY_CXXFLAGS` not passing on the flags to the compiler. But it seems to work properly in our Travis CI setup, so go figure. Some Google searches seem to confirm this is a known issue, and having a specific "user flags" option is a way around it. We just did the same thing in parquet-cpp. Author: Wes McKinney Closes #348 from wesm/ARROW-578 and squashes the following commits: 1103bed [Wes McKinney] Use ARROW_CXXFLAGS in Travis CI 086d643 [Wes McKinney] Add -DARROW_CXXFLAGS=... option to make CMake behavior more consistent across versions --- ci/travis_before_script_cpp.sh | 4 ++-- cpp/CMakeLists.txt | 5 ++++- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index 94a889cff1a78..feacf8f8e361a 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -36,11 +36,11 @@ CMAKE_COMMON_FLAGS="\ if [ $TRAVIS_OS_NAME == "linux" ]; then cmake -DARROW_TEST_MEMCHECK=on \ $CMAKE_COMMON_FLAGS \ - -DCMAKE_CXX_FLAGS="-Werror" \ + -DARROW_CXXFLAGS=-Werror \ $CPP_DIR else cmake $CMAKE_COMMON_FLAGS \ - -DCMAKE_CXX_FLAGS="-Werror" \ + -DARROW_CXXFLAGS=-Werror \ $CPP_DIR fi diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 06a18925c0d91..be3d4b98cf77f 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -57,6 +57,9 @@ endif(CCACHE_FOUND) # Top level cmake dir if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") + set(ARROW_CXXFLAGS "" CACHE STRING + "Compiler flags to append when compiling Arrow") + option(ARROW_BUILD_STATIC "Build the libarrow static libraries" ON) @@ -120,7 +123,7 @@ endif() include(SetupCxxFlags) # Add common flags -set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${ARROW_CXXFLAGS} ${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") # Determine compiler version include(CompilerInfo) From ef3b6b34482c36615af5064f474363126e755a18 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 26 Feb 2017 18:25:03 -0500 Subject: [PATCH 0342/1644] ARROW-451: [C++] Implement DataType::Equals as TypeVisitor. Add default implementations for TypeVisitor, ArrayVisitor methods This patch also resolves ARROW-568. Added tests for TimeType, TimestampType, which were not having their `unit` metadata compared due to an oversight. Author: Wes McKinney Closes #350 from wesm/ARROW-451 and squashes the following commits: 97e75d8 [Wes McKinney] Export ArrayVisitor, TypeVisitor symbols a3332be [Wes McKinney] Typo 635e74d [Wes McKinney] Implement DataType::Equals as TypeVisitor, compare child metadata. Add default implementations for TypeVisitor, ArrayVisitor methods --- cpp/src/arrow/CMakeLists.txt | 2 +- cpp/src/arrow/array.cc | 36 ++++++ cpp/src/arrow/array.h | 50 ++++---- cpp/src/arrow/compare.cc | 108 +++++++++++++++++- cpp/src/arrow/compare.h | 5 + cpp/src/arrow/ipc/adapter.cc | 20 ---- cpp/src/arrow/ipc/json-internal.cc | 30 ----- .../arrow/{schema-test.cc => type-test.cc} | 34 +++++- cpp/src/arrow/type.cc | 69 ++++++++--- cpp/src/arrow/type.h | 64 +++++------ 10 files changed, 277 insertions(+), 141 deletions(-) rename cpp/src/arrow/{schema-test.cc => type-test.cc} (81%) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 824ced1a51eb9..d1efa021a496d 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -58,8 +58,8 @@ ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(column-test) ADD_ARROW_TEST(memory_pool-test) ADD_ARROW_TEST(pretty_print-test) -ADD_ARROW_TEST(schema-test) ADD_ARROW_TEST(status-test) +ADD_ARROW_TEST(type-test) ADD_ARROW_TEST(table-test) ADD_ARROW_BENCHMARK(builder-benchmark) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 81678e354a608..eb4c210930fb2 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -503,4 +503,40 @@ Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, #endif } +// ---------------------------------------------------------------------- +// Default implementations of ArrayVisitor methods + +#define ARRAY_VISITOR_DEFAULT(ARRAY_CLASS) \ + Status ArrayVisitor::Visit(const ARRAY_CLASS& array) { \ + return Status::NotImplemented(array.type()->ToString()); \ + } + +ARRAY_VISITOR_DEFAULT(NullArray); +ARRAY_VISITOR_DEFAULT(BooleanArray); +ARRAY_VISITOR_DEFAULT(Int8Array); +ARRAY_VISITOR_DEFAULT(Int16Array); +ARRAY_VISITOR_DEFAULT(Int32Array); +ARRAY_VISITOR_DEFAULT(Int64Array); +ARRAY_VISITOR_DEFAULT(UInt8Array); +ARRAY_VISITOR_DEFAULT(UInt16Array); +ARRAY_VISITOR_DEFAULT(UInt32Array); +ARRAY_VISITOR_DEFAULT(UInt64Array); +ARRAY_VISITOR_DEFAULT(HalfFloatArray); +ARRAY_VISITOR_DEFAULT(FloatArray); +ARRAY_VISITOR_DEFAULT(DoubleArray); +ARRAY_VISITOR_DEFAULT(StringArray); +ARRAY_VISITOR_DEFAULT(BinaryArray); +ARRAY_VISITOR_DEFAULT(DateArray); +ARRAY_VISITOR_DEFAULT(TimeArray); +ARRAY_VISITOR_DEFAULT(TimestampArray); +ARRAY_VISITOR_DEFAULT(IntervalArray); +ARRAY_VISITOR_DEFAULT(ListArray); +ARRAY_VISITOR_DEFAULT(StructArray); +ARRAY_VISITOR_DEFAULT(UnionArray); +ARRAY_VISITOR_DEFAULT(DictionaryArray); + +Status ArrayVisitor::Visit(const DecimalArray& array) { + return Status::NotImplemented("decimal"); +} + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 9bb06afc9bf6c..8bb914e44ad3d 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -38,34 +38,34 @@ class MemoryPool; class MutableBuffer; class Status; -class ArrayVisitor { +class ARROW_EXPORT ArrayVisitor { public: virtual ~ArrayVisitor() = default; - virtual Status Visit(const NullArray& array) = 0; - virtual Status Visit(const BooleanArray& array) = 0; - virtual Status Visit(const Int8Array& array) = 0; - virtual Status Visit(const Int16Array& array) = 0; - virtual Status Visit(const Int32Array& array) = 0; - virtual Status Visit(const Int64Array& array) = 0; - virtual Status Visit(const UInt8Array& array) = 0; - virtual Status Visit(const UInt16Array& array) = 0; - virtual Status Visit(const UInt32Array& array) = 0; - virtual Status Visit(const UInt64Array& array) = 0; - virtual Status Visit(const HalfFloatArray& array) = 0; - virtual Status Visit(const FloatArray& array) = 0; - virtual Status Visit(const DoubleArray& array) = 0; - virtual Status Visit(const StringArray& array) = 0; - virtual Status Visit(const BinaryArray& array) = 0; - virtual Status Visit(const DateArray& array) = 0; - virtual Status Visit(const TimeArray& array) = 0; - virtual Status Visit(const TimestampArray& array) = 0; - virtual Status Visit(const IntervalArray& array) = 0; - virtual Status Visit(const DecimalArray& array) = 0; - virtual Status Visit(const ListArray& array) = 0; - virtual Status Visit(const StructArray& array) = 0; - virtual Status Visit(const UnionArray& array) = 0; - virtual Status Visit(const DictionaryArray& type) = 0; + virtual Status Visit(const NullArray& array); + virtual Status Visit(const BooleanArray& array); + virtual Status Visit(const Int8Array& array); + virtual Status Visit(const Int16Array& array); + virtual Status Visit(const Int32Array& array); + virtual Status Visit(const Int64Array& array); + virtual Status Visit(const UInt8Array& array); + virtual Status Visit(const UInt16Array& array); + virtual Status Visit(const UInt32Array& array); + virtual Status Visit(const UInt64Array& array); + virtual Status Visit(const HalfFloatArray& array); + virtual Status Visit(const FloatArray& array); + virtual Status Visit(const DoubleArray& array); + virtual Status Visit(const StringArray& array); + virtual Status Visit(const BinaryArray& array); + virtual Status Visit(const DateArray& array); + virtual Status Visit(const TimeArray& array); + virtual Status Visit(const TimestampArray& array); + virtual Status Visit(const IntervalArray& array); + virtual Status Visit(const DecimalArray& array); + virtual Status Visit(const ListArray& array); + virtual Status Visit(const StructArray& array); + virtual Status Visit(const UnionArray& array); + virtual Status Visit(const DictionaryArray& type); }; /// Immutable data array with some logical type and some length. diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 21fdb6633a9ee..ff3c59f638bb0 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -301,9 +301,9 @@ class RangeEqualsVisitor : public ArrayVisitor { bool result_; }; -class EqualsVisitor : public RangeEqualsVisitor { +class ArrayEqualsVisitor : public RangeEqualsVisitor { public: - explicit EqualsVisitor(const Array& right) + explicit ArrayEqualsVisitor(const Array& right) : RangeEqualsVisitor(right, 0, right.length(), 0) {} Status Visit(const NullArray& left) override { return Status::OK(); } @@ -511,9 +511,9 @@ inline bool FloatingApproxEquals( return true; } -class ApproxEqualsVisitor : public EqualsVisitor { +class ApproxEqualsVisitor : public ArrayEqualsVisitor { public: - using EqualsVisitor::EqualsVisitor; + using ArrayEqualsVisitor::ArrayEqualsVisitor; Status Visit(const FloatArray& left) override { result_ = @@ -549,7 +549,7 @@ Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { } else if (left.length() == 0) { *are_equal = true; } else { - EqualsVisitor visitor(right); + ArrayEqualsVisitor visitor(right); RETURN_NOT_OK(left.Accept(&visitor)); *are_equal = visitor.result(); } @@ -588,4 +588,102 @@ Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) return Status::OK(); } +// ---------------------------------------------------------------------- +// Implement TypeEquals + +class TypeEqualsVisitor : public TypeVisitor { + public: + explicit TypeEqualsVisitor(const DataType& right) : right_(right), result_(false) {} + + Status VisitChildren(const DataType& left) { + if (left.num_children() != right_.num_children()) { + result_ = false; + return Status::OK(); + } + + for (int i = 0; i < left.num_children(); ++i) { + if (!left.child(i)->Equals(right_.child(i))) { + result_ = false; + break; + } + } + result_ = true; + return Status::OK(); + } + + Status Visit(const TimeType& left) override { + const auto& right = static_cast(right_); + result_ = left.unit == right.unit; + return Status::OK(); + } + + Status Visit(const TimestampType& left) override { + const auto& right = static_cast(right_); + result_ = left.unit == right.unit; + return Status::OK(); + } + + Status Visit(const ListType& left) override { return VisitChildren(left); } + + Status Visit(const StructType& left) override { return VisitChildren(left); } + + Status Visit(const UnionType& left) override { + const auto& right = static_cast(right_); + + if (left.mode != right.mode || left.type_codes.size() != right.type_codes.size()) { + result_ = false; + return Status::OK(); + } + + const std::vector left_codes = left.type_codes; + const std::vector right_codes = right.type_codes; + + for (size_t i = 0; i < left_codes.size(); ++i) { + if (left_codes[i] != right_codes[i]) { + result_ = false; + break; + } + } + result_ = true; + return Status::OK(); + } + + Status Visit(const DictionaryType& left) override { + const auto& right = static_cast(right_); + result_ = left.index_type()->Equals(right.index_type()) && + left.dictionary()->Equals(right.dictionary()); + return Status::OK(); + } + + bool result() const { return result_; } + + protected: + const DataType& right_; + bool result_; +}; + +Status TypeEquals(const DataType& left, const DataType& right, bool* are_equal) { + // The arrays are the same object + if (&left == &right) { + *are_equal = true; + } else if (left.type != right.type) { + *are_equal = false; + } else { + TypeEqualsVisitor visitor(right); + Status s = left.Accept(&visitor); + + // We do not implement any type visitors where there is no additional + // metadata to compare. + if (s.IsNotImplemented()) { + // Not implemented means there is no additional metadata to compare + *are_equal = true; + } else if (!s.ok()) { + return s; + } else { + *are_equal = visitor.result(); + } + } + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h index 2093b65a51a13..6a71f9fd573ba 100644 --- a/cpp/src/arrow/compare.h +++ b/cpp/src/arrow/compare.h @@ -27,6 +27,7 @@ namespace arrow { class Array; +struct DataType; class Status; /// Returns true if the arrays are exactly equal @@ -41,6 +42,10 @@ Status ARROW_EXPORT ArrayApproxEquals( Status ARROW_EXPORT ArrayRangeEquals(const Array& left, const Array& right, int32_t start_idx, int32_t end_idx, int32_t other_start_idx, bool* are_equal); +/// Returns true if the type metadata are exactly equal +Status ARROW_EXPORT TypeEquals( + const DataType& left, const DataType& right, bool* are_equal); + } // namespace arrow #endif // ARROW_COMPARE_H diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 08ac9832982c1..2be87a35e7fb3 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -227,8 +227,6 @@ class RecordBatchWriter : public ArrayVisitor { } protected: - Status Visit(const NullArray& array) override { return Status::NotImplemented("null"); } - template Status VisitFixedWidth(const ArrayType& array) { std::shared_ptr data_buffer = array.data(); @@ -360,14 +358,6 @@ class RecordBatchWriter : public ArrayVisitor { return VisitFixedWidth(array); } - Status Visit(const IntervalArray& array) override { - return Status::NotImplemented("interval"); - } - - Status Visit(const DecimalArray& array) override { - return Status::NotImplemented("decimal"); - } - Status Visit(const ListArray& array) override { std::shared_ptr value_offsets; RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); @@ -653,8 +643,6 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - Status Visit(const NullType& type) override { return Status::NotImplemented("null"); } - Status Visit(const BooleanType& type) override { return LoadPrimitive(type); } Status Visit(const Int8Type& type) override { return LoadPrimitive(type); } @@ -689,14 +677,6 @@ class ArrayLoader : public TypeVisitor { Status Visit(const TimestampType& type) override { return LoadPrimitive(type); } - Status Visit(const IntervalType& type) override { - return Status::NotImplemented(type.ToString()); - } - - Status Visit(const DecimalType& type) override { - return Status::NotImplemented(type.ToString()); - } - Status Visit(const ListType& type) override { FieldMetadata field_meta; std::shared_ptr null_bitmap, offsets; diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index b9f97dd2bbd15..6253cd6b43605 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -316,8 +316,6 @@ class JsonSchemaWriter : public TypeVisitor { return WritePrimitive("interval", type); } - Status Visit(const DecimalType& type) override { return Status::NotImplemented("NYI"); } - Status Visit(const ListType& type) override { WriteName("list", type); RETURN_NOT_OK(WriteChildren(type.children())); @@ -339,14 +337,6 @@ class JsonSchemaWriter : public TypeVisitor { return Status::OK(); } - Status Visit(const DictionaryType& type) override { - // WriteName("dictionary", type); - // WriteChildren(type.children()); - // WriteBufferLayout(type.GetBufferLayout()); - // return Status::OK(); - return Status::NotImplemented("dictionary type"); - } - private: const Schema& schema_; RjWriter* writer_; @@ -531,22 +521,6 @@ class JsonArrayWriter : public ArrayVisitor { Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } - Status Visit(const DateArray& array) override { return Status::NotImplemented("date"); } - - Status Visit(const TimeArray& array) override { return Status::NotImplemented("time"); } - - Status Visit(const TimestampArray& array) override { - return Status::NotImplemented("timestamp"); - } - - Status Visit(const IntervalArray& array) override { - return Status::NotImplemented("interval"); - } - - Status Visit(const DecimalArray& array) override { - return Status::NotImplemented("decimal"); - } - Status Visit(const ListArray& array) override { WriteValidityField(array); WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); @@ -571,10 +545,6 @@ class JsonArrayWriter : public ArrayVisitor { return WriteChildren(type->children(), array.children()); } - Status Visit(const DictionaryArray& array) override { - return Status::NotImplemented("dictionary"); - } - private: const std::string& name_; const Array& array_; diff --git a/cpp/src/arrow/schema-test.cc b/cpp/src/arrow/type-test.cc similarity index 81% rename from cpp/src/arrow/schema-test.cc rename to cpp/src/arrow/type-test.cc index 4826199f73de7..fe6c62adb7fba 100644 --- a/cpp/src/arrow/schema-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -15,6 +15,8 @@ // specific language governing permissions and limitations // under the License. +// Unit tests for DataType (and subclasses), Field, and Schema + #include #include #include @@ -45,8 +47,8 @@ TEST(TestField, Equals) { Field f0_nn("f0", int32(), false); Field f0_other("f0", int32()); - ASSERT_EQ(f0, f0_other); - ASSERT_NE(f0, f0_nn); + ASSERT_TRUE(f0.Equals(f0_other)); + ASSERT_FALSE(f0.Equals(f0_nn)); } class TestSchema : public ::testing::Test { @@ -65,9 +67,9 @@ TEST_F(TestSchema, Basics) { auto schema = std::make_shared(fields); ASSERT_EQ(3, schema->num_fields()); - ASSERT_EQ(f0, schema->field(0)); - ASSERT_EQ(f1, schema->field(1)); - ASSERT_EQ(f2, schema->field(2)); + ASSERT_TRUE(f0->Equals(schema->field(0))); + ASSERT_TRUE(f1->Equals(schema->field(1))); + ASSERT_TRUE(f2->Equals(schema->field(2))); auto schema2 = std::make_shared(fields); @@ -119,4 +121,26 @@ TEST_F(TestSchema, GetFieldByName) { ASSERT_TRUE(result == nullptr); } +TEST(TestTimeType, Equals) { + TimeType t1; + TimeType t2; + TimeType t3(TimeUnit::NANO); + TimeType t4(TimeUnit::NANO); + + ASSERT_TRUE(t1.Equals(t2)); + ASSERT_FALSE(t1.Equals(t3)); + ASSERT_TRUE(t3.Equals(t4)); +} + +TEST(TestTimestampType, Equals) { + TimestampType t1; + TimestampType t2; + TimestampType t3(TimeUnit::NANO); + TimestampType t4(TimeUnit::NANO); + + ASSERT_TRUE(t1.Equals(t2)); + ASSERT_FALSE(t1.Equals(t3)); + ASSERT_TRUE(t3.Equals(t4)); +} + } // namespace arrow diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index b97b4657c361c..23fa6812f53d4 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -21,6 +21,7 @@ #include #include "arrow/array.h" +#include "arrow/compare.h" #include "arrow/status.h" #include "arrow/util/logging.h" @@ -46,16 +47,14 @@ std::string Field::ToString() const { DataType::~DataType() {} bool DataType::Equals(const DataType& other) const { - bool equals = - ((this == &other) || ((this->type == other.type) && - ((this->num_children() == other.num_children())))); - if (equals) { - for (int i = 0; i < num_children(); ++i) { - // TODO(emkornfield) limit recursion - if (!children_[i]->Equals(other.children_[i])) { return false; } - } - } - return equals; + bool are_equal = false; + Status error = TypeEquals(*this, other, &are_equal); + if (!error.ok()) { DCHECK(false) << "Types not comparable: " << error.ToString(); } + return are_equal; +} + +bool DataType::Equals(const std::shared_ptr& other) const { + return Equals(*other.get()); } std::string BooleanType::ToString() const { @@ -104,6 +103,15 @@ std::string DateType::ToString() const { return std::string("date"); } +// ---------------------------------------------------------------------- +// Union type + +UnionType::UnionType(const std::vector>& fields, + const std::vector& type_codes, UnionMode mode) + : DataType(Type::UNION), mode(mode), type_codes(type_codes) { + children_ = fields; +} + std::string UnionType::ToString() const { std::stringstream s; @@ -138,14 +146,6 @@ std::shared_ptr DictionaryType::dictionary() const { return dictionary_; } -bool DictionaryType::Equals(const DataType& other) const { - if (other.type != Type::DICTIONARY) { return false; } - const auto& other_dict = static_cast(other); - - return index_type_->Equals(other_dict.index_type_) && - dictionary_->Equals(other_dict.dictionary_); -} - std::string DictionaryType::ToString() const { std::stringstream ss; ss << "dictionarytype()->ToString() @@ -286,4 +286,37 @@ std::vector DecimalType::GetBufferLayout() const { return {}; } +// ---------------------------------------------------------------------- +// Default implementations of TypeVisitor methods + +#define TYPE_VISITOR_DEFAULT(TYPE_CLASS) \ + Status TypeVisitor::Visit(const TYPE_CLASS& type) { \ + return Status::NotImplemented(type.ToString()); \ + } + +TYPE_VISITOR_DEFAULT(NullType); +TYPE_VISITOR_DEFAULT(BooleanType); +TYPE_VISITOR_DEFAULT(Int8Type); +TYPE_VISITOR_DEFAULT(Int16Type); +TYPE_VISITOR_DEFAULT(Int32Type); +TYPE_VISITOR_DEFAULT(Int64Type); +TYPE_VISITOR_DEFAULT(UInt8Type); +TYPE_VISITOR_DEFAULT(UInt16Type); +TYPE_VISITOR_DEFAULT(UInt32Type); +TYPE_VISITOR_DEFAULT(UInt64Type); +TYPE_VISITOR_DEFAULT(HalfFloatType); +TYPE_VISITOR_DEFAULT(FloatType); +TYPE_VISITOR_DEFAULT(DoubleType); +TYPE_VISITOR_DEFAULT(StringType); +TYPE_VISITOR_DEFAULT(BinaryType); +TYPE_VISITOR_DEFAULT(DateType); +TYPE_VISITOR_DEFAULT(TimeType); +TYPE_VISITOR_DEFAULT(TimestampType); +TYPE_VISITOR_DEFAULT(IntervalType); +TYPE_VISITOR_DEFAULT(DecimalType); +TYPE_VISITOR_DEFAULT(ListType); +TYPE_VISITOR_DEFAULT(StructType); +TYPE_VISITOR_DEFAULT(UnionType); +TYPE_VISITOR_DEFAULT(DictionaryType); + } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index b15aa277af201..9a97fc30094b9 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -112,34 +112,34 @@ class BufferDescr { int bit_width_; }; -class TypeVisitor { +class ARROW_EXPORT TypeVisitor { public: virtual ~TypeVisitor() = default; - virtual Status Visit(const NullType& type) = 0; - virtual Status Visit(const BooleanType& type) = 0; - virtual Status Visit(const Int8Type& type) = 0; - virtual Status Visit(const Int16Type& type) = 0; - virtual Status Visit(const Int32Type& type) = 0; - virtual Status Visit(const Int64Type& type) = 0; - virtual Status Visit(const UInt8Type& type) = 0; - virtual Status Visit(const UInt16Type& type) = 0; - virtual Status Visit(const UInt32Type& type) = 0; - virtual Status Visit(const UInt64Type& type) = 0; - virtual Status Visit(const HalfFloatType& type) = 0; - virtual Status Visit(const FloatType& type) = 0; - virtual Status Visit(const DoubleType& type) = 0; - virtual Status Visit(const StringType& type) = 0; - virtual Status Visit(const BinaryType& type) = 0; - virtual Status Visit(const DateType& type) = 0; - virtual Status Visit(const TimeType& type) = 0; - virtual Status Visit(const TimestampType& type) = 0; - virtual Status Visit(const IntervalType& type) = 0; - virtual Status Visit(const DecimalType& type) = 0; - virtual Status Visit(const ListType& type) = 0; - virtual Status Visit(const StructType& type) = 0; - virtual Status Visit(const UnionType& type) = 0; - virtual Status Visit(const DictionaryType& type) = 0; + virtual Status Visit(const NullType& type); + virtual Status Visit(const BooleanType& type); + virtual Status Visit(const Int8Type& type); + virtual Status Visit(const Int16Type& type); + virtual Status Visit(const Int32Type& type); + virtual Status Visit(const Int64Type& type); + virtual Status Visit(const UInt8Type& type); + virtual Status Visit(const UInt16Type& type); + virtual Status Visit(const UInt32Type& type); + virtual Status Visit(const UInt64Type& type); + virtual Status Visit(const HalfFloatType& type); + virtual Status Visit(const FloatType& type); + virtual Status Visit(const DoubleType& type); + virtual Status Visit(const StringType& type); + virtual Status Visit(const BinaryType& type); + virtual Status Visit(const DateType& type); + virtual Status Visit(const TimeType& type); + virtual Status Visit(const TimestampType& type); + virtual Status Visit(const IntervalType& type); + virtual Status Visit(const DecimalType& type); + virtual Status Visit(const ListType& type); + virtual Status Visit(const StructType& type); + virtual Status Visit(const UnionType& type); + virtual Status Visit(const DictionaryType& type); }; struct ARROW_EXPORT DataType { @@ -156,10 +156,7 @@ struct ARROW_EXPORT DataType { // Types that are logically convertable from one to another e.g. List // and Binary are NOT equal). virtual bool Equals(const DataType& other) const; - - bool Equals(const std::shared_ptr& other) const { - return Equals(*other.get()); - } + bool Equals(const std::shared_ptr& other) const; std::shared_ptr child(int i) const { return children_[i]; } @@ -211,8 +208,6 @@ struct ARROW_EXPORT Field { bool nullable = true) : name(name), type(type), nullable(nullable) {} - bool operator==(const Field& other) const { return this->Equals(other); } - bool operator!=(const Field& other) const { return !this->Equals(other); } bool Equals(const Field& other) const; bool Equals(const std::shared_ptr& other) const; @@ -411,10 +406,7 @@ struct ARROW_EXPORT UnionType : public DataType { static constexpr Type::type type_id = Type::UNION; UnionType(const std::vector>& fields, - const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE) - : DataType(Type::UNION), mode(mode), type_codes(type_codes) { - children_ = fields; - } + const std::vector& type_codes, UnionMode mode = UnionMode::SPARSE); std::string ToString() const override; static std::string name() { return "union"; } @@ -523,8 +515,6 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { std::shared_ptr dictionary() const; - bool Equals(const DataType& other) const override; - Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; From 16c97592bf948c32a8dae9441ace078422d642dd Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 26 Feb 2017 19:22:15 -0500 Subject: [PATCH 0343/1644] ARROW-577: [C++] Use private implementation pattern in ipc::StreamWriter and ipc::FileWriter This patch also includes some code reorganization -- I moved the reader and writer classes to their own headers/compilation units. I also moved the stream-to-file and file-to-stream executables to arrow/ipc Author: Wes McKinney Closes #351 from wesm/ARROW-577 and squashes the following commits: 98c32d2 [Wes McKinney] Only build file/stream utils if ARROW_BUILD_UTILITIES is on c5fa43f [Wes McKinney] Refactor to give make stream and file writer implementation details private in public ABI --- cpp/src/arrow/ipc/CMakeLists.txt | 32 +- cpp/src/arrow/ipc/api.h | 27 ++ cpp/src/arrow/{util => ipc}/file-to-stream.cc | 4 +- cpp/src/arrow/ipc/ipc-file-test.cc | 4 +- cpp/src/arrow/ipc/json-integration-test.cc | 3 +- cpp/src/arrow/ipc/metadata.h | 12 + cpp/src/arrow/ipc/{file.cc => reader.cc} | 184 +++++++---- cpp/src/arrow/ipc/{file.h => reader.h} | 27 +- cpp/src/arrow/{util => ipc}/stream-to-file.cc | 4 +- cpp/src/arrow/ipc/stream.cc | 310 ------------------ cpp/src/arrow/ipc/writer.cc | 287 ++++++++++++++++ cpp/src/arrow/ipc/{stream.h => writer.h} | 75 +---- cpp/src/arrow/util/CMakeLists.txt | 22 -- python/pyarrow/includes/libarrow_ipc.pxd | 5 +- 14 files changed, 520 insertions(+), 476 deletions(-) create mode 100644 cpp/src/arrow/ipc/api.h rename cpp/src/arrow/{util => ipc}/file-to-stream.cc (97%) rename cpp/src/arrow/ipc/{file.cc => reader.cc} (63%) rename cpp/src/arrow/ipc/{file.h => reader.h} (83%) rename cpp/src/arrow/{util => ipc}/stream-to-file.cc (96%) delete mode 100644 cpp/src/arrow/ipc/stream.cc create mode 100644 cpp/src/arrow/ipc/writer.cc rename cpp/src/arrow/ipc/{stream.h => writer.h} (52%) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index e7a3fdb1dd862..08da0a109c963 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -30,12 +30,12 @@ set(ARROW_IPC_TEST_LINK_LIBS set(ARROW_IPC_SRCS adapter.cc - file.cc json.cc json-internal.cc metadata.cc metadata-internal.cc - stream.cc + reader.cc + writer.cc ) if(NOT APPLE) @@ -138,10 +138,11 @@ add_dependencies(arrow_ipc_objlib metadata_fbs) # Headers: top level install(FILES adapter.h - file.h + api.h json.h metadata.h - stream.h + reader.h + writer.h DESTINATION include/arrow/ipc) # pkg-config support @@ -151,3 +152,26 @@ configure_file(arrow-ipc.pc.in install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" DESTINATION "lib/pkgconfig/") + + +set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + boost_filesystem_static + boost_system_static + dl) + +if (NOT APPLE) + set(UTIL_LINK_LIBS + ${UTIL_LINK_LIBS} + boost_filesystem_static + boost_system_static) +endif() + +if (ARROW_BUILD_UTILITIES) + add_executable(file-to-stream file-to-stream.cc) + target_link_libraries(file-to-stream ${UTIL_LINK_LIBS}) + add_executable(stream-to-file stream-to-file.cc) + target_link_libraries(stream-to-file ${UTIL_LINK_LIBS}) +endif() diff --git a/cpp/src/arrow/ipc/api.h b/cpp/src/arrow/ipc/api.h new file mode 100644 index 0000000000000..cb854212bbeee --- /dev/null +++ b/cpp/src/arrow/ipc/api.h @@ -0,0 +1,27 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_IPC_API_H +#define ARROW_IPC_API_H + +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/json.h" +#include "arrow/ipc/metadata.h" +#include "arrow/ipc/reader.h" +#include "arrow/ipc/writer.h" + +#endif // ARROW_IPC_API_H diff --git a/cpp/src/arrow/util/file-to-stream.cc b/cpp/src/arrow/ipc/file-to-stream.cc similarity index 97% rename from cpp/src/arrow/util/file-to-stream.cc rename to cpp/src/arrow/ipc/file-to-stream.cc index 7daf26366721d..8161b191380dc 100644 --- a/cpp/src/arrow/util/file-to-stream.cc +++ b/cpp/src/arrow/ipc/file-to-stream.cc @@ -16,8 +16,8 @@ // under the License. #include "arrow/io/file.h" -#include "arrow/ipc/file.h" -#include "arrow/ipc/stream.h" +#include "arrow/ipc/reader.h" +#include "arrow/ipc/writer.h" #include "arrow/status.h" #include diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index 4b82aab0e3978..e58f2cfbbe8c9 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -28,10 +28,10 @@ #include "arrow/io/memory.h" #include "arrow/io/test-common.h" #include "arrow/ipc/adapter.h" -#include "arrow/ipc/file.h" -#include "arrow/ipc/stream.h" +#include "arrow/ipc/reader.h" #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" +#include "arrow/ipc/writer.h" #include "arrow/buffer.h" #include "arrow/memory_pool.h" diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index 95bc742054fab..c16074ee32dc6 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -29,8 +29,9 @@ #include // NOLINT #include "arrow/io/file.h" -#include "arrow/ipc/file.h" #include "arrow/ipc/json.h" +#include "arrow/ipc/reader.h" +#include "arrow/ipc/writer.h" #include "arrow/pretty_print.h" #include "arrow/schema.h" #include "arrow/status.h" diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 0091067c3225a..f12529b5c585e 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -50,6 +50,18 @@ struct MetadataVersion { enum type { V1, V2 }; }; +static constexpr const char* kArrowMagicBytes = "ARROW1"; + +struct ARROW_EXPORT FileBlock { + FileBlock() {} + FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) + : offset(offset), metadata_length(metadata_length), body_length(body_length) {} + + int64_t offset; + int32_t metadata_length; + int64_t body_length; +}; + //---------------------------------------------------------------------- using DictionaryMap = std::unordered_map>; diff --git a/cpp/src/arrow/ipc/file.cc b/cpp/src/arrow/ipc/reader.cc similarity index 63% rename from cpp/src/arrow/ipc/file.cc rename to cpp/src/arrow/ipc/reader.cc index c1d483f1fbba6..1a9af7db3dcdc 100644 --- a/cpp/src/arrow/ipc/file.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -15,11 +15,12 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/ipc/file.h" +#include "arrow/ipc/reader.h" #include #include #include +#include #include #include "arrow/buffer.h" @@ -35,83 +36,154 @@ namespace arrow { namespace ipc { -static constexpr const char* kArrowMagicBytes = "ARROW1"; +// ---------------------------------------------------------------------- +// StreamReader implementation -static flatbuffers::Offset> -FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { - std::vector fb_blocks; +static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { + return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); +} - for (const FileBlock& block : blocks) { - fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); +static inline std::string message_type_name(Message::Type type) { + switch (type) { + case Message::SCHEMA: + return "schema"; + case Message::RECORD_BATCH: + return "record batch"; + case Message::DICTIONARY_BATCH: + return "dictionary"; + default: + break; } - - return fbb.CreateVectorOfStructs(fb_blocks); + return "unknown"; } -Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out) { - FBB fbb; +class StreamReader::StreamReaderImpl { + public: + StreamReaderImpl() {} + ~StreamReaderImpl() {} - flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); + Status Open(const std::shared_ptr& stream) { + stream_ = stream; + return ReadSchema(); + } - auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); - auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); + Status ReadNextMessage(Message::Type expected_type, std::shared_ptr* message) { + std::shared_ptr buffer; + RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); - auto footer = flatbuf::CreateFooter( - fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + if (buffer->size() != sizeof(int32_t)) { + *message = nullptr; + return Status::OK(); + } - fbb.Finish(footer); + int32_t message_length = *reinterpret_cast(buffer->data()); - int32_t size = fbb.GetSize(); + RETURN_NOT_OK(stream_->Read(message_length, &buffer)); + if (buffer->size() != message_length) { + return Status::IOError("Unexpected end of stream trying to read message"); + } - return out->Write(fbb.GetBufferPointer(), size); -} + RETURN_NOT_OK(Message::Open(buffer, 0, message)); -static inline FileBlock FileBlockFromFlatbuffer(const flatbuf::Block* block) { - return FileBlock(block->offset(), block->metaDataLength(), block->bodyLength()); -} + if ((*message)->type() != expected_type) { + std::stringstream ss; + ss << "Message not expected type: " << message_type_name(expected_type) + << ", was: " << (*message)->type(); + return Status::IOError(ss.str()); + } + return Status::OK(); + } -// ---------------------------------------------------------------------- -// File writer implementation + Status ReadExact(int64_t size, std::shared_ptr* buffer) { + RETURN_NOT_OK(stream_->Read(size, buffer)); -FileWriter::FileWriter(io::OutputStream* sink, const std::shared_ptr& schema) - : StreamWriter(sink, schema) {} + if ((*buffer)->size() < size) { + return Status::IOError("Unexpected EOS when reading buffer"); + } + return Status::OK(); + } -Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out) { - *out = std::shared_ptr(new FileWriter(sink, schema)); // ctor is private - RETURN_NOT_OK((*out)->UpdatePosition()); - return Status::OK(); -} + Status ReadNextDictionary() { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::DICTIONARY_BATCH, &message)); -Status FileWriter::Start() { - RETURN_NOT_OK(WriteAligned( - reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); + DictionaryBatchMetadata metadata(message); - // We write the schema at the start of the file (and the end). This also - // writes all the dictionaries at the beginning of the file - return StreamWriter::Start(); -} + std::shared_ptr batch_body; + RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)) + io::BufferReader reader(batch_body); -Status FileWriter::Close() { - // Write metadata - int64_t initial_position = position_; - RETURN_NOT_OK(WriteFileFooter( - *schema_, dictionaries_, record_batches_, dictionary_memo_.get(), sink_)); - RETURN_NOT_OK(UpdatePosition()); + std::shared_ptr dictionary; + RETURN_NOT_OK(ReadDictionary(metadata, dictionary_types_, &reader, &dictionary)); + return dictionary_memo_.AddDictionary(metadata.id(), dictionary); + } - // Write footer length - int32_t footer_length = position_ - initial_position; + Status ReadSchema() { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::SCHEMA, &message)); + + SchemaMetadata schema_meta(message); + RETURN_NOT_OK(schema_meta.GetDictionaryTypes(&dictionary_types_)); + + // TODO(wesm): In future, we may want to reconcile the ids in the stream with + // those found in the schema + int num_dictionaries = static_cast(dictionary_types_.size()); + for (int i = 0; i < num_dictionaries; ++i) { + RETURN_NOT_OK(ReadNextDictionary()); + } + + return schema_meta.GetSchema(dictionary_memo_, &schema_); + } + + Status GetNextRecordBatch(std::shared_ptr* batch) { + std::shared_ptr message; + RETURN_NOT_OK(ReadNextMessage(Message::RECORD_BATCH, &message)); + + if (message == nullptr) { + // End of stream + *batch = nullptr; + return Status::OK(); + } + + RecordBatchMetadata batch_metadata(message); - if (footer_length <= 0) { return Status::Invalid("Invalid file footer"); } + std::shared_ptr batch_body; + RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)); + io::BufferReader reader(batch_body); + return ReadRecordBatch(batch_metadata, schema_, &reader, batch); + } + + std::shared_ptr schema() const { return schema_; } - RETURN_NOT_OK(Write(reinterpret_cast(&footer_length), sizeof(int32_t))); + private: + // dictionary_id -> type + DictionaryTypeMap dictionary_types_; + + DictionaryMemo dictionary_memo_; + + std::shared_ptr stream_; + std::shared_ptr schema_; +}; + +StreamReader::StreamReader() { + impl_.reset(new StreamReaderImpl()); +} + +StreamReader::~StreamReader() {} + +Status StreamReader::Open(const std::shared_ptr& stream, + std::shared_ptr* reader) { + // Private ctor + *reader = std::shared_ptr(new StreamReader()); + return (*reader)->impl_->Open(stream); +} + +std::shared_ptr StreamReader::schema() const { + return impl_->schema(); +} - // Write magic bytes to end file - return Write( - reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes)); +Status StreamReader::GetNextRecordBatch(std::shared_ptr* batch) { + return impl_->GetNextRecordBatch(batch); } // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/ipc/file.h b/cpp/src/arrow/ipc/reader.h similarity index 83% rename from cpp/src/arrow/ipc/file.h rename to cpp/src/arrow/ipc/reader.h index 524766ccb3336..6f143e1a1265e 100644 --- a/cpp/src/arrow/ipc/file.h +++ b/cpp/src/arrow/ipc/reader.h @@ -25,7 +25,6 @@ #include #include "arrow/ipc/metadata.h" -#include "arrow/ipc/stream.h" #include "arrow/util/visibility.h" namespace arrow { @@ -37,29 +36,31 @@ class Status; namespace io { -class OutputStream; +class InputStream; class ReadableFileInterface; } // namespace io namespace ipc { -Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out); - -class ARROW_EXPORT FileWriter : public StreamWriter { +class ARROW_EXPORT StreamReader { public: - static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out); + ~StreamReader(); + + // Open an stream. + static Status Open(const std::shared_ptr& stream, + std::shared_ptr* reader); + + std::shared_ptr schema() const; - using StreamWriter::WriteRecordBatch; - Status Close() override; + // Returned batch is nullptr when end of stream reached + Status GetNextRecordBatch(std::shared_ptr* batch); private: - FileWriter(io::OutputStream* sink, const std::shared_ptr& schema); + StreamReader(); - Status Start() override; + class ARROW_NO_EXPORT StreamReaderImpl; + std::unique_ptr impl_; }; class ARROW_EXPORT FileReader { diff --git a/cpp/src/arrow/util/stream-to-file.cc b/cpp/src/arrow/ipc/stream-to-file.cc similarity index 96% rename from cpp/src/arrow/util/stream-to-file.cc rename to cpp/src/arrow/ipc/stream-to-file.cc index 393b07d8d355f..ec0ac435a9d0d 100644 --- a/cpp/src/arrow/util/stream-to-file.cc +++ b/cpp/src/arrow/ipc/stream-to-file.cc @@ -15,9 +15,9 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/ipc/stream.h" #include "arrow/io/file.h" -#include "arrow/ipc/file.h" +#include "arrow/ipc/reader.h" +#include "arrow/ipc/writer.h" #include "arrow/status.h" #include diff --git a/cpp/src/arrow/ipc/stream.cc b/cpp/src/arrow/ipc/stream.cc deleted file mode 100644 index 7f5c9932330be..0000000000000 --- a/cpp/src/arrow/ipc/stream.cc +++ /dev/null @@ -1,310 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/ipc/stream.h" - -#include -#include -#include -#include -#include - -#include "arrow/buffer.h" -#include "arrow/io/interfaces.h" -#include "arrow/io/memory.h" -#include "arrow/ipc/adapter.h" -#include "arrow/ipc/metadata-internal.h" -#include "arrow/ipc/metadata.h" -#include "arrow/ipc/util.h" -#include "arrow/memory_pool.h" -#include "arrow/schema.h" -#include "arrow/status.h" -#include "arrow/table.h" -#include "arrow/util/logging.h" - -namespace arrow { -namespace ipc { - -// ---------------------------------------------------------------------- -// Stream writer implementation - -StreamWriter::StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema) - : sink_(sink), - schema_(schema), - dictionary_memo_(std::make_shared()), - pool_(default_memory_pool()), - position_(-1), - started_(false) {} - -Status StreamWriter::UpdatePosition() { - return sink_->Tell(&position_); -} - -Status StreamWriter::Write(const uint8_t* data, int64_t nbytes) { - RETURN_NOT_OK(sink_->Write(data, nbytes)); - position_ += nbytes; - return Status::OK(); -} - -Status StreamWriter::Align() { - int64_t remainder = PaddedLength(position_) - position_; - if (remainder > 0) { return Write(kPaddingBytes, remainder); } - return Status::OK(); -} - -Status StreamWriter::WriteAligned(const uint8_t* data, int64_t nbytes) { - RETURN_NOT_OK(Write(data, nbytes)); - return Align(); -} - -Status StreamWriter::CheckStarted() { - if (!started_) { return Start(); } - return Status::OK(); -} - -Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, FileBlock* block) { - RETURN_NOT_OK(CheckStarted()); - - block->offset = position_; - - // Frame of reference in file format is 0, see ARROW-384 - const int64_t buffer_start_offset = 0; - RETURN_NOT_OK(arrow::ipc::WriteRecordBatch(batch, buffer_start_offset, sink_, - &block->metadata_length, &block->body_length, pool_)); - RETURN_NOT_OK(UpdatePosition()); - - DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; - - return Status::OK(); -} - -void StreamWriter::set_memory_pool(MemoryPool* pool) { - pool_ = pool; -} - -// ---------------------------------------------------------------------- -// StreamWriter implementation - -Status StreamWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, - std::shared_ptr* out) { - // ctor is private - *out = std::shared_ptr(new StreamWriter(sink, schema)); - RETURN_NOT_OK((*out)->UpdatePosition()); - return Status::OK(); -} - -Status StreamWriter::Start() { - std::shared_ptr schema_fb; - RETURN_NOT_OK(WriteSchemaMessage(*schema_, dictionary_memo_.get(), &schema_fb)); - - int32_t flatbuffer_size = schema_fb->size(); - RETURN_NOT_OK( - Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); - - // Write the flatbuffer - RETURN_NOT_OK(Write(schema_fb->data(), flatbuffer_size)); - - // If there are any dictionaries, write them as the next messages - RETURN_NOT_OK(WriteDictionaries()); - - started_ = true; - return Status::OK(); -} - -Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { - // Push an empty FileBlock. Can be written in the footer later - record_batches_.emplace_back(0, 0, 0); - return WriteRecordBatch(batch, &record_batches_[record_batches_.size() - 1]); -} - -Status StreamWriter::WriteDictionaries() { - const DictionaryMap& id_to_dictionary = dictionary_memo_->id_to_dictionary(); - - dictionaries_.resize(id_to_dictionary.size()); - - // TODO(wesm): does sorting by id yield any benefit? - int dict_index = 0; - for (const auto& entry : id_to_dictionary) { - FileBlock* block = &dictionaries_[dict_index++]; - - block->offset = position_; - - // Frame of reference in file format is 0, see ARROW-384 - const int64_t buffer_start_offset = 0; - RETURN_NOT_OK(WriteDictionary(entry.first, entry.second, buffer_start_offset, sink_, - &block->metadata_length, &block->body_length, pool_)); - RETURN_NOT_OK(UpdatePosition()); - DCHECK(position_ % 8 == 0) << "WriteDictionary did not perform aligned writes"; - } - - return Status::OK(); -} - -Status StreamWriter::Close() { - // Write the schema if not already written - // User is responsible for closing the OutputStream - return CheckStarted(); -} - -// ---------------------------------------------------------------------- -// StreamReader implementation - -static inline std::string message_type_name(Message::Type type) { - switch (type) { - case Message::SCHEMA: - return "schema"; - case Message::RECORD_BATCH: - return "record batch"; - case Message::DICTIONARY_BATCH: - return "dictionary"; - default: - break; - } - return "unknown"; -} - -class StreamReader::StreamReaderImpl { - public: - StreamReaderImpl() {} - ~StreamReaderImpl() {} - - Status Open(const std::shared_ptr& stream) { - stream_ = stream; - return ReadSchema(); - } - - Status ReadNextMessage(Message::Type expected_type, std::shared_ptr* message) { - std::shared_ptr buffer; - RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); - - if (buffer->size() != sizeof(int32_t)) { - *message = nullptr; - return Status::OK(); - } - - int32_t message_length = *reinterpret_cast(buffer->data()); - - RETURN_NOT_OK(stream_->Read(message_length, &buffer)); - if (buffer->size() != message_length) { - return Status::IOError("Unexpected end of stream trying to read message"); - } - - RETURN_NOT_OK(Message::Open(buffer, 0, message)); - - if ((*message)->type() != expected_type) { - std::stringstream ss; - ss << "Message not expected type: " << message_type_name(expected_type) - << ", was: " << (*message)->type(); - return Status::IOError(ss.str()); - } - return Status::OK(); - } - - Status ReadExact(int64_t size, std::shared_ptr* buffer) { - RETURN_NOT_OK(stream_->Read(size, buffer)); - - if ((*buffer)->size() < size) { - return Status::IOError("Unexpected EOS when reading buffer"); - } - return Status::OK(); - } - - Status ReadNextDictionary() { - std::shared_ptr message; - RETURN_NOT_OK(ReadNextMessage(Message::DICTIONARY_BATCH, &message)); - - DictionaryBatchMetadata metadata(message); - - std::shared_ptr batch_body; - RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)) - io::BufferReader reader(batch_body); - - std::shared_ptr dictionary; - RETURN_NOT_OK(ReadDictionary(metadata, dictionary_types_, &reader, &dictionary)); - return dictionary_memo_.AddDictionary(metadata.id(), dictionary); - } - - Status ReadSchema() { - std::shared_ptr message; - RETURN_NOT_OK(ReadNextMessage(Message::SCHEMA, &message)); - - SchemaMetadata schema_meta(message); - RETURN_NOT_OK(schema_meta.GetDictionaryTypes(&dictionary_types_)); - - // TODO(wesm): In future, we may want to reconcile the ids in the stream with - // those found in the schema - int num_dictionaries = static_cast(dictionary_types_.size()); - for (int i = 0; i < num_dictionaries; ++i) { - RETURN_NOT_OK(ReadNextDictionary()); - } - - return schema_meta.GetSchema(dictionary_memo_, &schema_); - } - - Status GetNextRecordBatch(std::shared_ptr* batch) { - std::shared_ptr message; - RETURN_NOT_OK(ReadNextMessage(Message::RECORD_BATCH, &message)); - - if (message == nullptr) { - // End of stream - *batch = nullptr; - return Status::OK(); - } - - RecordBatchMetadata batch_metadata(message); - - std::shared_ptr batch_body; - RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)); - io::BufferReader reader(batch_body); - return ReadRecordBatch(batch_metadata, schema_, &reader, batch); - } - - std::shared_ptr schema() const { return schema_; } - - private: - // dictionary_id -> type - DictionaryTypeMap dictionary_types_; - - DictionaryMemo dictionary_memo_; - - std::shared_ptr stream_; - std::shared_ptr schema_; -}; - -StreamReader::StreamReader() { - impl_.reset(new StreamReaderImpl()); -} - -StreamReader::~StreamReader() {} - -Status StreamReader::Open(const std::shared_ptr& stream, - std::shared_ptr* reader) { - // Private ctor - *reader = std::shared_ptr(new StreamReader()); - return (*reader)->impl_->Open(stream); -} - -std::shared_ptr StreamReader::schema() const { - return impl_->schema(); -} - -Status StreamReader::GetNextRecordBatch(std::shared_ptr* batch) { - return impl_->GetNextRecordBatch(batch); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc new file mode 100644 index 0000000000000..975b0d10cae7d --- /dev/null +++ b/cpp/src/arrow/ipc/writer.cc @@ -0,0 +1,287 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/writer.h" + +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/io/interfaces.h" +#include "arrow/io/memory.h" +#include "arrow/ipc/adapter.h" +#include "arrow/ipc/metadata-internal.h" +#include "arrow/ipc/metadata.h" +#include "arrow/ipc/util.h" +#include "arrow/memory_pool.h" +#include "arrow/schema.h" +#include "arrow/status.h" +#include "arrow/table.h" +#include "arrow/util/logging.h" + +namespace arrow { +namespace ipc { + +// ---------------------------------------------------------------------- +// Stream writer implementation + +class StreamWriter::StreamWriterImpl { + public: + StreamWriterImpl() + : dictionary_memo_(std::make_shared()), + pool_(default_memory_pool()), + position_(-1), + started_(false) {} + + virtual ~StreamWriterImpl() = default; + + Status Open(io::OutputStream* sink, const std::shared_ptr& schema) { + sink_ = sink; + schema_ = schema; + return UpdatePosition(); + } + + virtual Status Start() { + std::shared_ptr schema_fb; + RETURN_NOT_OK(WriteSchemaMessage(*schema_, dictionary_memo_.get(), &schema_fb)); + + int32_t flatbuffer_size = schema_fb->size(); + RETURN_NOT_OK( + Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); + + // Write the flatbuffer + RETURN_NOT_OK(Write(schema_fb->data(), flatbuffer_size)); + + // If there are any dictionaries, write them as the next messages + RETURN_NOT_OK(WriteDictionaries()); + + started_ = true; + return Status::OK(); + } + + virtual Status Close() { + // Write the schema if not already written + // User is responsible for closing the OutputStream + return CheckStarted(); + } + + Status CheckStarted() { + if (!started_) { return Start(); } + return Status::OK(); + } + + Status UpdatePosition() { return sink_->Tell(&position_); } + + Status WriteDictionaries() { + const DictionaryMap& id_to_dictionary = dictionary_memo_->id_to_dictionary(); + + dictionaries_.resize(id_to_dictionary.size()); + + // TODO(wesm): does sorting by id yield any benefit? + int dict_index = 0; + for (const auto& entry : id_to_dictionary) { + FileBlock* block = &dictionaries_[dict_index++]; + + block->offset = position_; + + // Frame of reference in file format is 0, see ARROW-384 + const int64_t buffer_start_offset = 0; + RETURN_NOT_OK(WriteDictionary(entry.first, entry.second, buffer_start_offset, sink_, + &block->metadata_length, &block->body_length, pool_)); + RETURN_NOT_OK(UpdatePosition()); + DCHECK(position_ % 8 == 0) << "WriteDictionary did not perform aligned writes"; + } + + return Status::OK(); + } + + Status WriteRecordBatch(const RecordBatch& batch, FileBlock* block) { + RETURN_NOT_OK(CheckStarted()); + + block->offset = position_; + + // Frame of reference in file format is 0, see ARROW-384 + const int64_t buffer_start_offset = 0; + RETURN_NOT_OK(arrow::ipc::WriteRecordBatch(batch, buffer_start_offset, sink_, + &block->metadata_length, &block->body_length, pool_)); + RETURN_NOT_OK(UpdatePosition()); + + DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; + + return Status::OK(); + } + + Status WriteRecordBatch(const RecordBatch& batch) { + // Push an empty FileBlock. Can be written in the footer later + record_batches_.emplace_back(0, 0, 0); + return WriteRecordBatch(batch, &record_batches_[record_batches_.size() - 1]); + } + + // Adds padding bytes if necessary to ensure all memory blocks are written on + // 8-byte boundaries. + Status Align() { + int64_t remainder = PaddedLength(position_) - position_; + if (remainder > 0) { return Write(kPaddingBytes, remainder); } + return Status::OK(); + } + + // Write data and update position + Status Write(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(sink_->Write(data, nbytes)); + position_ += nbytes; + return Status::OK(); + } + + // Write and align + Status WriteAligned(const uint8_t* data, int64_t nbytes) { + RETURN_NOT_OK(Write(data, nbytes)); + return Align(); + } + + void set_memory_pool(MemoryPool* pool) { pool_ = pool; } + + protected: + io::OutputStream* sink_; + std::shared_ptr schema_; + + // When writing out the schema, we keep track of all the dictionaries we + // encounter, as they must be written out first in the stream + std::shared_ptr dictionary_memo_; + + MemoryPool* pool_; + + int64_t position_; + bool started_; + + std::vector dictionaries_; + std::vector record_batches_; +}; + +StreamWriter::StreamWriter() { + impl_.reset(new StreamWriterImpl()); +} + +Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { + return impl_->WriteRecordBatch(batch); +} + +void StreamWriter::set_memory_pool(MemoryPool* pool) { + impl_->set_memory_pool(pool); +} + +Status StreamWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out) { + // ctor is private + *out = std::shared_ptr(new StreamWriter()); + return (*out)->impl_->Open(sink, schema); +} + +Status StreamWriter::Close() { + return impl_->Close(); +} + +// ---------------------------------------------------------------------- +// File writer implementation + +static flatbuffers::Offset> +FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { + std::vector fb_blocks; + + for (const FileBlock& block : blocks) { + fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); + } + + return fbb.CreateVectorOfStructs(fb_blocks); +} + +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out) { + FBB fbb; + + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); + + auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); + auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); + + auto footer = flatbuf::CreateFooter( + fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + + fbb.Finish(footer); + + int32_t size = fbb.GetSize(); + + return out->Write(fbb.GetBufferPointer(), size); +} + +class FileWriter::FileWriterImpl : public StreamWriter::StreamWriterImpl { + public: + using BASE = StreamWriter::StreamWriterImpl; + + Status Start() override { + RETURN_NOT_OK(WriteAligned( + reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes))); + + // We write the schema at the start of the file (and the end). This also + // writes all the dictionaries at the beginning of the file + return BASE::Start(); + } + + Status Close() override { + // Write metadata + int64_t initial_position = position_; + RETURN_NOT_OK(WriteFileFooter( + *schema_, dictionaries_, record_batches_, dictionary_memo_.get(), sink_)); + RETURN_NOT_OK(UpdatePosition()); + + // Write footer length + int32_t footer_length = position_ - initial_position; + + if (footer_length <= 0) { return Status::Invalid("Invalid file footer"); } + + RETURN_NOT_OK( + Write(reinterpret_cast(&footer_length), sizeof(int32_t))); + + // Write magic bytes to end file + return Write( + reinterpret_cast(kArrowMagicBytes), strlen(kArrowMagicBytes)); + } +}; + +FileWriter::FileWriter() { + impl_.reset(new FileWriterImpl()); +} + +Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out) { + *out = std::shared_ptr(new FileWriter()); // ctor is private + return (*out)->impl_->Open(sink, schema); +} + +Status FileWriter::WriteRecordBatch(const RecordBatch& batch) { + return impl_->WriteRecordBatch(batch); +} + +Status FileWriter::Close() { + return impl_->Close(); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/stream.h b/cpp/src/arrow/ipc/writer.h similarity index 52% rename from cpp/src/arrow/ipc/stream.h rename to cpp/src/arrow/ipc/writer.h index 1c3f65e49af32..7aff71e18e486 100644 --- a/cpp/src/arrow/ipc/stream.h +++ b/cpp/src/arrow/ipc/writer.h @@ -39,23 +39,12 @@ class Status; namespace io { -class InputStream; class OutputStream; } // namespace io namespace ipc { -struct ARROW_EXPORT FileBlock { - FileBlock() {} - FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) - : offset(offset), metadata_length(metadata_length), body_length(body_length) {} - - int64_t offset; - int32_t metadata_length; - int64_t body_length; -}; - class ARROW_EXPORT StreamWriter { public: virtual ~StreamWriter() = default; @@ -74,61 +63,27 @@ class ARROW_EXPORT StreamWriter { void set_memory_pool(MemoryPool* pool); protected: - StreamWriter(io::OutputStream* sink, const std::shared_ptr& schema); - - virtual Status Start(); - - Status CheckStarted(); - Status UpdatePosition(); - - Status WriteDictionaries(); - - Status WriteRecordBatch(const RecordBatch& batch, FileBlock* block); - - // Adds padding bytes if necessary to ensure all memory blocks are written on - // 8-byte boundaries. - Status Align(); - - // Write data and update position - Status Write(const uint8_t* data, int64_t nbytes); - - // Write and align - Status WriteAligned(const uint8_t* data, int64_t nbytes); - - io::OutputStream* sink_; - std::shared_ptr schema_; - - // When writing out the schema, we keep track of all the dictionaries we - // encounter, as they must be written out first in the stream - std::shared_ptr dictionary_memo_; - - MemoryPool* pool_; - - int64_t position_; - bool started_; - - std::vector dictionaries_; - std::vector record_batches_; + StreamWriter(); + class ARROW_NO_EXPORT StreamWriterImpl; + std::unique_ptr impl_; }; -class ARROW_EXPORT StreamReader { - public: - ~StreamReader(); - - // Open an stream. - static Status Open(const std::shared_ptr& stream, - std::shared_ptr* reader); +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out); - std::shared_ptr schema() const; +class ARROW_EXPORT FileWriter : public StreamWriter { + public: + static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, + std::shared_ptr* out); - // Returned batch is nullptr when end of stream reached - Status GetNextRecordBatch(std::shared_ptr* batch); + Status WriteRecordBatch(const RecordBatch& batch) override; + Status Close() override; private: - StreamReader(); - - class ARROW_NO_EXPORT StreamReaderImpl; - std::unique_ptr impl_; + FileWriter(); + class ARROW_NO_EXPORT FileWriterImpl; + std::unique_ptr impl_; }; } // namespace ipc diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 19b1e193d4228..8d9afccf867df 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -68,26 +68,4 @@ if (ARROW_BUILD_BENCHMARKS) endif() endif() -if (ARROW_IPC AND ARROW_BUILD_UTILITIES) - set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static - arrow_static - boost_filesystem_static - boost_system_static - dl) - - if (NOT APPLE) - set(UTIL_LINK_LIBS - ${UTIL_LINK_LIBS} - boost_filesystem_static - boost_system_static) - endif() - - add_executable(file-to-stream file-to-stream.cc) - target_link_libraries(file-to-stream ${UTIL_LINK_LIBS}) - add_executable(stream-to-file stream-to-file.cc) - target_link_libraries(stream-to-file ${UTIL_LINK_LIBS}) -endif() - ADD_ARROW_TEST(bit-util-test) diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index afc7dbd36e5f0..10c70a96b0ab2 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -23,7 +23,7 @@ from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, ReadableFileInterface) -cdef extern from "arrow/ipc/stream.h" namespace "arrow::ipc" nogil: +cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: cdef cppclass CStreamWriter " arrow::ipc::StreamWriter": @staticmethod @@ -43,9 +43,6 @@ cdef extern from "arrow/ipc/stream.h" namespace "arrow::ipc" nogil: CStatus GetNextRecordBatch(shared_ptr[CRecordBatch]* batch) - -cdef extern from "arrow/ipc/file.h" namespace "arrow::ipc" nogil: - cdef cppclass CFileWriter " arrow::ipc::FileWriter"(CStreamWriter): @staticmethod CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, From dc103feaf0bb07b95f0c81afe0e342f239319dec Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 27 Feb 2017 08:13:29 +0100 Subject: [PATCH 0344/1644] ARROW-557: [Python] Add option to explicitly opt in to HDFS tests, do not implicitly skip I have ``` $ py.test pyarrow/tests/test_hdfs.py ================================== test session starts ================================== platform linux2 -- Python 2.7.11, pytest-2.9.0, py-1.4.31, pluggy-0.3.1 rootdir: /home/wesm/code/arrow/python, inifile: collected 15 items pyarrow/tests/test_hdfs.py sssssssssssssss ``` But ``` $ py.test pyarrow/tests/test_hdfs.py --hdfs -v ================================== test session starts ================================== platform linux2 -- Python 2.7.11, pytest-2.9.0, py-1.4.31, pluggy-0.3.1 -- /home/wesm/anaconda3/envs/py27/bin/python cachedir: .cache rootdir: /home/wesm/code/arrow/python, inifile: collected 15 items pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_close PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_download_upload PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_file_context_manager PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_ls PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_mkdir PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_orphaned_file PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_read_multiple_parquet_files SKIPPED pyarrow/tests/test_hdfs.py::TestLibHdfs::test_hdfs_read_whole_file PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_close PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_download_upload PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_file_context_manager PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_ls PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_mkdir PASSED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_read_multiple_parquet_files SKIPPED pyarrow/tests/test_hdfs.py::TestLibHdfs3::test_hdfs_read_whole_file PASSED ``` The `py.test pyarrow --only-hdfs` option will run only the HDFS tests. Author: Wes McKinney Closes #353 from wesm/ARROW-557 and squashes the following commits: 52e03db [Wes McKinney] Add conftest.py file, hdfs group to opt in to HDFS tests with --hdfs --- LICENSE.txt | 12 ------ NOTICE.txt | 4 ++ python/pyarrow/tests/conftest.py | 62 +++++++++++++++++++++++++++++++ python/pyarrow/tests/test_hdfs.py | 5 ++- 4 files changed, 69 insertions(+), 14 deletions(-) create mode 100644 python/pyarrow/tests/conftest.py diff --git a/LICENSE.txt b/LICENSE.txt index c3bec4385507e..d645695673349 100644 --- a/LICENSE.txt +++ b/LICENSE.txt @@ -200,15 +200,3 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. - --------------------------------------------------------------------------------- - -This product includes code from Apache Kudu. - - * cpp/cmake_modules/CompilerInfo.cmake is based on Kudu's cmake_modules/CompilerInfo.cmake - -Copyright: 2016 The Apache Software Foundation. -Home page: https://kudu.apache.org/ -License: http://www.apache.org/licenses/LICENSE-2.0 - --------------------------------------------------------------------------------- diff --git a/NOTICE.txt b/NOTICE.txt index 02cb4dd6ee001..e71835c233de6 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -42,6 +42,10 @@ This product includes software from the CMake project This product includes software from https://github.com/matthew-brett/multibuild (BSD 2-clause) * Copyright (c) 2013-2016, Matt Terry and Matthew Brett; all rights reserved. +This product includes software from the Ibis project (Apache 2.0) + * Copyright (c) 2015 Cloudera, Inc. + * https://github.com/cloudera/ibis + -------------------------------------------------------------------------------- This product includes code from Apache Kudu, which includes the following in diff --git a/python/pyarrow/tests/conftest.py b/python/pyarrow/tests/conftest.py new file mode 100644 index 0000000000000..d5b4b69d97392 --- /dev/null +++ b/python/pyarrow/tests/conftest.py @@ -0,0 +1,62 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +from pytest import skip + + +groups = ['hdfs'] + + +def pytest_configure(config): + pass + + +def pytest_addoption(parser): + for group in groups: + parser.addoption('--{0}'.format(group), action='store_true', + default=False, + help=('Enable the {0} test group'.format(group))) + + for group in groups: + parser.addoption('--only-{0}'.format(group), action='store_true', + default=False, + help=('Run only the {0} test group'.format(group))) + + +def pytest_runtest_setup(item): + only_set = False + + for group in groups: + only_flag = '--only-{0}'.format(group) + flag = '--{0}'.format(group) + + if item.config.getoption(only_flag): + only_set = True + elif getattr(item.obj, group, None): + if not item.config.getoption(flag): + skip('{0} NOT enabled'.format(flag)) + + if only_set: + skip_item = True + for group in groups: + only_flag = '--only-{0}'.format(group) + if (getattr(item.obj, group, False) and + item.config.getoption(only_flag)): + skip_item = False + + if skip_item: + skip('Only running some groups with only flags') diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index cb24adb73adc9..b8f7e25233421 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -48,6 +48,7 @@ def hdfs_test_client(driver='libhdfs'): return HdfsClient(host, port, user, driver=driver) +@pytest.mark.hdfs class HdfsTestCases(object): def _make_test_file(self, hdfs, test_name, test_path, test_data): @@ -190,7 +191,7 @@ class TestLibHdfs(HdfsTestCases, unittest.TestCase): @classmethod def check_driver(cls): if not io.have_libhdfs(): - pytest.skip('No libhdfs available on system') + pytest.fail('No libhdfs available on system') def test_hdfs_orphaned_file(self): hdfs = hdfs_test_client() @@ -209,4 +210,4 @@ class TestLibHdfs3(HdfsTestCases, unittest.TestCase): @classmethod def check_driver(cls): if not io.have_libhdfs3(): - pytest.skip('No libhdfs3 available on system') + pytest.fail('No libhdfs3 available on system') From 01a67f3ff3f43f504dc92b66e04473a8b29053f1 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 27 Feb 2017 08:14:10 +0100 Subject: [PATCH 0345/1644] ARROW-493: [C++] Permit large (length > INT32_MAX) arrays in memory This commit relaxes the INT32_MAX length requirement for in-memory data. It does not change the Arrow memory format, nor does it permit arrays over INT32_MAX elements to be included in a RecordBatch message sent in the streaming or file formats. The purpose of this change is to enable Arrow containers to do zero-copy addressing of large datasets (generally of fixed-size elements) produced by other systems. Should those systems wish to send messages to Java, they will need to break those large arrays up into smaller pieces. We can create utilities to assist in copy-free segmentation of large in-memory datasets into compatible chunksizes. If the large data is only being used in C++-land, then there are no problems. This is a helpful change en route to adding an `arrow::Tensor` type per ARROW-550, and probably some other things. This also includes ARROW-584, as I wanted to be sure that I caught all the places in the codebase where there were imprecise integer conversions. cc @pcmoritz @robertnishihara Author: Wes McKinney Closes #352 from wesm/ARROW-493 and squashes the following commits: 013d8cc [Wes McKinney] Fix some more compiler warnings 13c4067 [Wes McKinney] Do not pass CMAKE_CXX_FLAGS to googletest ep dc50d80 [Wes McKinney] Fix last imprecise conversions c8e90bc [Wes McKinney] Fix many imprecise integer conversions 6bacdf3 [Wes McKinney] Permit in-memory arrays with more than INT32_MAX elements in Array and Builder classes. Raise if large arrays used in IPC context --- ci/travis_before_script_cpp.sh | 2 +- cpp/CMakeLists.txt | 6 +- cpp/src/arrow/array-dictionary-test.cc | 2 +- cpp/src/arrow/array-primitive-test.cc | 69 ++++++------- cpp/src/arrow/array-string-test.cc | 24 ++--- cpp/src/arrow/array-test.cc | 17 +++- cpp/src/arrow/array-union-test.cc | 2 +- cpp/src/arrow/array.cc | 84 ++++++++-------- cpp/src/arrow/array.h | 132 ++++++++++++------------- cpp/src/arrow/buffer.h | 14 +-- cpp/src/arrow/builder.cc | 79 +++++++-------- cpp/src/arrow/builder.h | 63 ++++++------ cpp/src/arrow/column-benchmark.cc | 2 +- cpp/src/arrow/column.cc | 6 +- cpp/src/arrow/column.h | 2 +- cpp/src/arrow/compare.cc | 48 ++++----- cpp/src/arrow/compare.h | 2 +- cpp/src/arrow/io/file.cc | 8 +- cpp/src/arrow/io/hdfs.cc | 15 +-- cpp/src/arrow/io/io-hdfs-test.cc | 2 +- cpp/src/arrow/ipc/adapter.cc | 24 +++-- cpp/src/arrow/ipc/ipc-json-test.cc | 2 +- cpp/src/arrow/ipc/json-internal.cc | 61 +++++++----- cpp/src/arrow/ipc/json.cc | 4 +- cpp/src/arrow/ipc/metadata-internal.cc | 7 +- cpp/src/arrow/ipc/reader.cc | 2 +- cpp/src/arrow/ipc/test-common.h | 24 ++--- cpp/src/arrow/ipc/writer.cc | 4 +- cpp/src/arrow/pretty_print.cc | 2 +- cpp/src/arrow/schema.cc | 2 +- cpp/src/arrow/schema.h | 2 +- cpp/src/arrow/status.cc | 2 +- cpp/src/arrow/table-test.cc | 4 +- cpp/src/arrow/table.cc | 10 +- cpp/src/arrow/table.h | 14 +-- cpp/src/arrow/test-util.h | 47 ++++----- cpp/src/arrow/type.h | 12 +-- cpp/src/arrow/type_traits.h | 54 +++++++--- cpp/src/arrow/util/bit-util.cc | 4 +- cpp/src/arrow/util/bit-util.h | 25 ++--- python/pyarrow/array.pxd | 4 +- python/pyarrow/array.pyx | 2 +- python/pyarrow/includes/libarrow.pxd | 16 +-- python/pyarrow/scalar.pxd | 8 +- python/pyarrow/scalar.pyx | 10 +- python/pyarrow/table.pyx | 2 +- python/src/pyarrow/adapters/builtin.cc | 4 +- python/src/pyarrow/adapters/pandas.cc | 13 ++- 48 files changed, 508 insertions(+), 436 deletions(-) diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index feacf8f8e361a..f804a38e76484 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -36,7 +36,7 @@ CMAKE_COMMON_FLAGS="\ if [ $TRAVIS_OS_NAME == "linux" ]; then cmake -DARROW_TEST_MEMCHECK=on \ $CMAKE_COMMON_FLAGS \ - -DARROW_CXXFLAGS=-Werror \ + -DARROW_CXXFLAGS="-Wconversion -Werror" \ $CPP_DIR else cmake $CMAKE_COMMON_FLAGS \ diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index be3d4b98cf77f..f6dab788b26d5 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -123,7 +123,9 @@ endif() include(SetupCxxFlags) # Add common flags -set(CMAKE_CXX_FLAGS "${ARROW_CXXFLAGS} ${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +set(EP_CXX_FLAGS "${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${ARROW_CXXFLAGS} ${CMAKE_CXX_FLAGS}") # Determine compiler version include(CompilerInfo) @@ -452,7 +454,7 @@ if(ARROW_BUILD_TESTS) set(GTEST_CMAKE_CXX_FLAGS "-fPIC") endif() string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE) - set(GTEST_CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CMAKE_CXX_FLAGS_${UPPERCASE_BUILD_TYPE}} ${GTEST_CMAKE_CXX_FLAGS}") + set(GTEST_CMAKE_CXX_FLAGS "${EP_CXX_FLAGS} ${CMAKE_CXX_FLAGS_${UPPERCASE_BUILD_TYPE}} ${GTEST_CMAKE_CXX_FLAGS}") set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") diff --git a/cpp/src/arrow/array-dictionary-test.cc b/cpp/src/arrow/array-dictionary-test.cc index 61381b7671180..0c4e628111a15 100644 --- a/cpp/src/arrow/array-dictionary-test.cc +++ b/cpp/src/arrow/array-dictionary-test.cc @@ -95,7 +95,7 @@ TEST(TestDictionary, Equals) { ASSERT_FALSE(array->RangeEquals(1, 3, 1, array4)); // ARROW-33 Test slices - const int size = array->length(); + const int64_t size = array->length(); std::shared_ptr slice, slice2; slice = array->Array::Slice(2); diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index f8bbd774d483c..7b36275cbabfb 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -97,7 +97,7 @@ class TestPrimitiveBuilder : public TestBuilder { builder_nn_ = std::dynamic_pointer_cast(tmp); } - void RandomData(int N, double pct_null = 0.1) { + void RandomData(int64_t N, double pct_null = 0.1) { Attrs::draw(N, &draws_); valid_bytes_.resize(N); @@ -105,13 +105,13 @@ class TestPrimitiveBuilder : public TestBuilder { } void Check(const std::shared_ptr& builder, bool nullable) { - int size = builder->length(); + int64_t size = builder->length(); auto ex_data = std::make_shared( reinterpret_cast(draws_.data()), size * sizeof(T)); std::shared_ptr ex_null_bitmap; - int32_t ex_null_count = 0; + int64_t ex_null_count = 0; if (nullable) { ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); @@ -157,18 +157,18 @@ class TestPrimitiveBuilder : public TestBuilder { return std::shared_ptr(new Type()); \ } -#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ - struct P##CapType { \ - PTYPE_DECL(CapType, c_type); \ - static void draw(int N, vector* draws) { \ - test::randint(N, LOWER, UPPER, draws); \ - } \ +#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int64_t N, vector* draws) { \ + test::randint(N, LOWER, UPPER, draws); \ + } \ } #define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ struct P##CapType { \ PTYPE_DECL(CapType, c_type); \ - static void draw(int N, vector* draws) { \ + static void draw(int64_t N, vector* draws) { \ test::random_real(N, 0, LOWER, UPPER, draws); \ } \ } @@ -191,7 +191,7 @@ struct PBoolean { }; template <> -void TestPrimitiveBuilder::RandomData(int N, double pct_null) { +void TestPrimitiveBuilder::RandomData(int64_t N, double pct_null) { draws_.resize(N); valid_bytes_.resize(N); @@ -202,12 +202,12 @@ void TestPrimitiveBuilder::RandomData(int N, double pct_null) { template <> void TestPrimitiveBuilder::Check( const std::shared_ptr& builder, bool nullable) { - int size = builder->length(); + int64_t size = builder->length(); auto ex_data = test::bytes_to_null_buffer(draws_); std::shared_ptr ex_null_bitmap; - int32_t ex_null_count = 0; + int64_t ex_null_count = 0; if (nullable) { ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); @@ -233,7 +233,7 @@ void TestPrimitiveBuilder::Check( ASSERT_EQ(expected->length(), result->length()); - for (int i = 0; i < result->length(); ++i) { + for (int64_t i = 0; i < result->length(); ++i) { if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } bool actual = BitUtil::GetBit(result->data()->data(), i); ASSERT_EQ(static_cast(draws_[i]), actual) << i; @@ -256,7 +256,7 @@ TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); TYPED_TEST(TestPrimitiveBuilder, TestInit) { DECL_TYPE(); - int n = 1000; + int64_t n = 1000; ASSERT_OK(this->builder_->Reserve(n)); ASSERT_EQ(BitUtil::NextPower2(n), this->builder_->capacity()); ASSERT_EQ(BitUtil::NextPower2(TypeTraits::bytes_required(n)), @@ -267,15 +267,15 @@ TYPED_TEST(TestPrimitiveBuilder, TestInit) { } TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { - int size = 1000; - for (int i = 0; i < size; ++i) { + int64_t size = 1000; + for (int64_t i = 0; i < size; ++i) { ASSERT_OK(this->builder_->AppendNull()); } std::shared_ptr result; ASSERT_OK(this->builder_->Finish(&result)); - for (int i = 0; i < size; ++i) { + for (int64_t i = 0; i < size; ++i) { ASSERT_TRUE(result->IsNull(i)) << i; } } @@ -283,7 +283,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { DECL_T(); - int size = 1000; + int64_t size = 1000; vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; @@ -294,7 +294,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { this->builder_->Reserve(size); - int i; + int64_t i; for (i = 0; i < size; ++i) { if (valid_bytes[i] > 0) { this->builder_->Append(draws[i]); @@ -314,7 +314,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { TYPED_TEST(TestPrimitiveBuilder, Equality) { DECL_T(); - const int size = 1000; + const int64_t size = 1000; this->RandomData(size); vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; @@ -326,10 +326,11 @@ TYPED_TEST(TestPrimitiveBuilder, Equality) { // Make the not equal array by negating the first valid element with itself. const auto first_valid = std::find_if( valid_bytes.begin(), valid_bytes.end(), [](uint8_t valid) { return valid > 0; }); - const int first_valid_idx = std::distance(valid_bytes.begin(), first_valid); + const int64_t first_valid_idx = std::distance(valid_bytes.begin(), first_valid); // This should be true with a very high probability, but might introduce flakiness ASSERT_LT(first_valid_idx, size - 1); - draws[first_valid_idx] = ~*reinterpret_cast(&draws[first_valid_idx]); + draws[first_valid_idx] = + static_cast(~*reinterpret_cast(&draws[first_valid_idx])); ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &unequal_array)); // test normal equality @@ -350,7 +351,7 @@ TYPED_TEST(TestPrimitiveBuilder, Equality) { TYPED_TEST(TestPrimitiveBuilder, SliceEquality) { DECL_T(); - const int size = 1000; + const int64_t size = 1000; this->RandomData(size); vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; @@ -383,7 +384,7 @@ TYPED_TEST(TestPrimitiveBuilder, SliceEquality) { TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { DECL_T(); - const int size = 10000; + const int64_t size = 10000; vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; @@ -393,8 +394,8 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { this->builder_->Reserve(1000); this->builder_nn_->Reserve(1000); - int i; - int null_count = 0; + int64_t i; + int64_t null_count = 0; // Append the first 1000 for (i = 0; i < 1000; ++i) { if (valid_bytes[i] > 0) { @@ -440,14 +441,14 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { DECL_T(); - int size = 10000; + int64_t size = 10000; this->RandomData(size); vector& draws = this->draws_; vector& valid_bytes = this->valid_bytes_; // first slug - int K = 1000; + int64_t K = 1000; ASSERT_OK(this->builder_->Append(draws.data(), K, valid_bytes.data())); ASSERT_OK(this->builder_nn_->Append(draws.data(), K)); @@ -470,7 +471,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { } TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { - int n = 1000; + int64_t n = 1000; ASSERT_OK(this->builder_->Reserve(n)); ASSERT_OK(this->builder_->Advance(100)); @@ -478,14 +479,14 @@ TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { ASSERT_OK(this->builder_->Advance(900)); - int too_many = this->builder_->capacity() - 1000 + 1; + int64_t too_many = this->builder_->capacity() - 1000 + 1; ASSERT_RAISES(Invalid, this->builder_->Advance(too_many)); } TYPED_TEST(TestPrimitiveBuilder, TestResize) { DECL_TYPE(); - int cap = kMinBuilderCapacity * 2; + int64_t cap = kMinBuilderCapacity * 2; ASSERT_OK(this->builder_->Reserve(cap)); ASSERT_EQ(cap, this->builder_->capacity()); @@ -510,7 +511,7 @@ template void CheckSliceApproxEquals() { using T = typename TYPE::c_type; - const int kSize = 50; + const int64_t kSize = 50; std::vector draws1; std::vector draws2; @@ -520,7 +521,7 @@ void CheckSliceApproxEquals() { // Make the draws equal in the sliced segment, but unequal elsewhere (to // catch not using the slice offset) - for (int i = 10; i < 30; ++i) { + for (int64_t i = 10; i < 30; ++i) { draws2[i] = draws1[i]; } diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index d8a35851c1238..3fdeb3cefe7d2 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -64,7 +64,7 @@ class TestStringArray : public ::testing::Test { } void MakeArray() { - length_ = offsets_.size() - 1; + length_ = static_cast(offsets_.size()) - 1; value_buf_ = test::GetBufferFromVector(chars_); offsets_buf_ = test::GetBufferFromVector(offsets_); null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); @@ -85,8 +85,8 @@ class TestStringArray : public ::testing::Test { std::shared_ptr offsets_buf_; std::shared_ptr null_bitmap_; - int null_count_; - int length_; + int64_t null_count_; + int64_t length_; std::shared_ptr strings_; }; @@ -109,7 +109,7 @@ TEST_F(TestStringArray, TestListFunctions) { for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); - pos += expected_[i].size(); + pos += static_cast(expected_[i].size()); } } @@ -131,7 +131,7 @@ TEST_F(TestStringArray, TestGetString) { TEST_F(TestStringArray, TestEmptyStringComparison) { offsets_ = {0, 0, 0, 0, 0, 0}; offsets_buf_ = test::GetBufferFromVector(offsets_); - length_ = offsets_.size() - 1; + length_ = static_cast(offsets_.size() - 1); auto strings_a = std::make_shared( length_, offsets_buf_, nullptr, null_bitmap_, null_count_); @@ -208,7 +208,7 @@ TEST_F(TestStringBuilder, TestScalarAppend) { std::vector strings = {"", "bb", "a", "", "ccc"}; std::vector is_null = {0, 0, 0, 1, 0}; - int N = strings.size(); + int N = static_cast(strings.size()); int reps = 1000; for (int j = 0; j < reps; ++j) { @@ -263,7 +263,7 @@ class TestBinaryArray : public ::testing::Test { } void MakeArray() { - length_ = offsets_.size() - 1; + length_ = static_cast(offsets_.size() - 1); value_buf_ = test::GetBufferFromVector(chars_); offsets_buf_ = test::GetBufferFromVector(offsets_); @@ -285,8 +285,8 @@ class TestBinaryArray : public ::testing::Test { std::shared_ptr offsets_buf_; std::shared_ptr null_bitmap_; - int null_count_; - int length_; + int64_t null_count_; + int64_t length_; std::shared_ptr strings_; }; @@ -305,7 +305,7 @@ TEST_F(TestBinaryArray, TestType) { } TEST_F(TestBinaryArray, TestListFunctions) { - int pos = 0; + size_t pos = 0; for (size_t i = 0; i < expected_.size(); ++i) { ASSERT_EQ(pos, strings_->value_offset(i)); ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); @@ -376,7 +376,7 @@ TEST_F(TestBinaryBuilder, TestScalarAppend) { std::vector strings = {"", "bb", "a", "", "ccc"}; std::vector is_null = {0, 0, 0, 1, 0}; - int N = strings.size(); + int N = static_cast(strings.size()); int reps = 1000; for (int j = 0; j < reps; ++j) { @@ -425,7 +425,7 @@ void CheckSliceEquality() { std::vector strings = {"foo", "", "bar", "baz", "qux", ""}; std::vector is_null = {0, 1, 0, 1, 0, 0}; - int N = strings.size(); + int N = static_cast(strings.size()); int reps = 10; for (int j = 0; j < reps; ++j) { diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 45ab2740b4c16..854ebb20f53ed 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -58,7 +58,7 @@ TEST_F(TestArray, TestLength) { std::shared_ptr MakeArrayFromValidBytes( const std::vector& v, MemoryPool* pool) { - int32_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); + int64_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); std::shared_ptr null_buf = test::bytes_to_null_buffer(v); BufferBuilder value_builder(pool); @@ -121,7 +121,7 @@ TEST_F(TestArray, TestIsNull) { 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1}; // clang-format on - int32_t null_count = 0; + int64_t null_count = 0; for (uint8_t x : null_bitmap) { if (x == 0) { ++null_count; } } @@ -140,6 +140,19 @@ TEST_F(TestArray, TestIsNull) { } } +TEST_F(TestArray, BuildLargeInMemoryArray) { + const int64_t length = static_cast(std::numeric_limits::max()) + 1; + + BooleanBuilder builder(default_memory_pool()); + ASSERT_OK(builder.Reserve(length)); + ASSERT_OK(builder.Advance(length)); + + std::shared_ptr result; + ASSERT_OK(builder.Finish(&result)); + + ASSERT_EQ(length, result->length()); +} + TEST_F(TestArray, TestCopy) {} } // namespace arrow diff --git a/cpp/src/arrow/array-union-test.cc b/cpp/src/arrow/array-union-test.cc index eb9bd7da31b13..83c3196cab74b 100644 --- a/cpp/src/arrow/array-union-test.cc +++ b/cpp/src/arrow/array-union-test.cc @@ -37,7 +37,7 @@ TEST(TestUnionArrayAdHoc, TestSliceEquals) { std::shared_ptr batch; ASSERT_OK(ipc::MakeUnion(&batch)); - const int size = batch->num_rows(); + const int64_t size = batch->num_rows(); auto CheckUnion = [&size](std::shared_ptr array) { std::shared_ptr slice, slice2; diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index eb4c210930fb2..284bb57a02b88 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -35,13 +35,13 @@ namespace arrow { // doing some computation. To avoid doing this eagerly, we set the null count // to -1 (any negative number will do). When Array::null_count is called the // first time, the null count will be computed. See ARROW-33 -constexpr int32_t kUnknownNullCount = -1; +constexpr int64_t kUnknownNullCount = -1; // ---------------------------------------------------------------------- // Base array class -Array::Array(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) +Array::Array(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) : type_(type), length_(length), offset_(offset), @@ -52,7 +52,7 @@ Array::Array(const std::shared_ptr& type, int32_t length, if (null_bitmap_) { null_bitmap_data_ = null_bitmap_->data(); } } -int32_t Array::null_count() const { +int64_t Array::null_count() const { if (null_count_ < 0) { if (null_bitmap_) { null_count_ = length_ - CountSetBits(null_bitmap_data_, offset_, length_); @@ -87,14 +87,14 @@ bool Array::ApproxEquals(const std::shared_ptr& arr) const { return ApproxEquals(*arr); } -bool Array::RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, +bool Array::RangeEquals(int64_t start_idx, int64_t end_idx, int64_t other_start_idx, const std::shared_ptr& other) const { if (!other) { return false; } return RangeEquals(*other, start_idx, end_idx, other_start_idx); } -bool Array::RangeEquals(const Array& other, int32_t start_idx, int32_t end_idx, - int32_t other_start_idx) const { +bool Array::RangeEquals(const Array& other, int64_t start_idx, int64_t end_idx, + int64_t other_start_idx) const { bool are_equal = false; Status error = ArrayRangeEquals(*this, other, start_idx, end_idx, other_start_idx, &are_equal); @@ -104,15 +104,15 @@ bool Array::RangeEquals(const Array& other, int32_t start_idx, int32_t end_idx, // Last two parameters are in-out parameters static inline void ConformSliceParams( - int32_t array_offset, int32_t array_length, int32_t* offset, int32_t* length) { + int64_t array_offset, int64_t array_length, int64_t* offset, int64_t* length) { DCHECK_LE(*offset, array_length); DCHECK_GE(offset, 0); *length = std::min(array_length - *offset, *length); *offset = array_offset + *offset; } -std::shared_ptr Array::Slice(int32_t offset) const { - int32_t slice_length = length_ - offset; +std::shared_ptr Array::Slice(int64_t offset) const { + int64_t slice_length = length_ - offset; return Slice(offset, slice_length); } @@ -120,9 +120,9 @@ Status Array::Validate() const { return Status::OK(); } -NullArray::NullArray(int32_t length) : Array(null(), length, nullptr, length) {} +NullArray::NullArray(int64_t length) : Array(null(), length, nullptr, length) {} -std::shared_ptr NullArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr NullArray::Slice(int64_t offset, int64_t length) const { DCHECK_LE(offset, length_); length = std::min(length_ - offset, length); return std::make_shared(length); @@ -135,9 +135,9 @@ Status NullArray::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // Primitive array base -PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int32_t length, +PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset) + int64_t null_count, int64_t offset) : Array(type, length, null_bitmap, null_count, offset) { data_ = data; raw_data_ = data == nullptr ? nullptr : data_->data(); @@ -149,7 +149,7 @@ Status NumericArray::Accept(ArrayVisitor* visitor) const { } template -std::shared_ptr NumericArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr NumericArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared>( type_, length, data_, null_bitmap_, kUnknownNullCount, offset); @@ -173,8 +173,8 @@ template class NumericArray; // ---------------------------------------------------------------------- // BooleanArray -BooleanArray::BooleanArray(int32_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) +BooleanArray::BooleanArray(int64_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) : PrimitiveArray(std::make_shared(), length, data, null_bitmap, null_count, offset) {} @@ -182,7 +182,7 @@ Status BooleanArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr BooleanArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr BooleanArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( length, data_, null_bitmap_, kUnknownNullCount, offset); @@ -222,7 +222,7 @@ Status ListArray::Validate() const { int32_t prev_offset = this->value_offset(0); if (prev_offset != 0) { return Status::Invalid("The first offset wasn't zero"); } - for (int32_t i = 1; i <= length_; ++i) { + for (int64_t i = 1; i <= length_; ++i) { int32_t current_offset = this->value_offset(i); if (IsNull(i - 1) && current_offset != prev_offset) { std::stringstream ss; @@ -247,7 +247,7 @@ Status ListArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr ListArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr ListArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( type_, length, value_offsets_, values_, null_bitmap_, kUnknownNullCount, offset); @@ -259,15 +259,15 @@ std::shared_ptr ListArray::Slice(int32_t offset, int32_t length) const { static std::shared_ptr kBinary = std::make_shared(); static std::shared_ptr kString = std::make_shared(); -BinaryArray::BinaryArray(int32_t length, const std::shared_ptr& value_offsets, +BinaryArray::BinaryArray(int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset) + int64_t null_count, int64_t offset) : BinaryArray(kBinary, length, value_offsets, data, null_bitmap, null_count, offset) { } -BinaryArray::BinaryArray(const std::shared_ptr& type, int32_t length, +BinaryArray::BinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) : Array(type, length, null_bitmap, null_count, offset), value_offsets_(value_offsets), raw_value_offsets_(nullptr), @@ -288,15 +288,15 @@ Status BinaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr BinaryArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr BinaryArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( length, value_offsets_, data_, null_bitmap_, kUnknownNullCount, offset); } -StringArray::StringArray(int32_t length, const std::shared_ptr& value_offsets, +StringArray::StringArray(int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset) + int64_t null_count, int64_t offset) : BinaryArray(kString, length, value_offsets, data, null_bitmap, null_count, offset) { } @@ -309,7 +309,7 @@ Status StringArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr StringArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr StringArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( length, value_offsets_, data_, null_bitmap_, kUnknownNullCount, offset); @@ -318,15 +318,15 @@ std::shared_ptr StringArray::Slice(int32_t offset, int32_t length) const // ---------------------------------------------------------------------- // Struct -StructArray::StructArray(const std::shared_ptr& type, int32_t length, +StructArray::StructArray(const std::shared_ptr& type, int64_t length, const std::vector>& children, - std::shared_ptr null_bitmap, int32_t null_count, int32_t offset) + std::shared_ptr null_bitmap, int64_t null_count, int64_t offset) : Array(type, length, null_bitmap, null_count, offset) { type_ = type; children_ = children; } -std::shared_ptr StructArray::field(int32_t pos) const { +std::shared_ptr StructArray::field(int pos) const { DCHECK_GT(children_.size(), 0); return children_[pos]; } @@ -340,7 +340,7 @@ Status StructArray::Validate() const { if (children_.size() > 0) { // Validate fields - int32_t array_length = children_[0]->length(); + int64_t array_length = children_[0]->length(); size_t idx = 0; for (auto it : children_) { if (it->length() != array_length) { @@ -371,7 +371,7 @@ Status StructArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr StructArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr StructArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( type_, length, children_, null_bitmap_, kUnknownNullCount, offset); @@ -380,10 +380,10 @@ std::shared_ptr StructArray::Slice(int32_t offset, int32_t length) const // ---------------------------------------------------------------------- // UnionArray -UnionArray::UnionArray(const std::shared_ptr& type, int32_t length, +UnionArray::UnionArray(const std::shared_ptr& type, int64_t length, const std::vector>& children, const std::shared_ptr& type_ids, const std::shared_ptr& value_offsets, - const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset) + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) : Array(type, length, null_bitmap, null_count, offset), children_(children), type_ids_(type_ids), @@ -396,7 +396,7 @@ UnionArray::UnionArray(const std::shared_ptr& type, int32_t length, } } -std::shared_ptr UnionArray::child(int32_t pos) const { +std::shared_ptr UnionArray::child(int pos) const { DCHECK_GT(children_.size(), 0); return children_[pos]; } @@ -416,7 +416,7 @@ Status UnionArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr UnionArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr UnionArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared(type_, length, children_, type_ids_, value_offsets_, null_bitmap_, kUnknownNullCount, offset); @@ -425,9 +425,9 @@ std::shared_ptr UnionArray::Slice(int32_t offset, int32_t length) const { // ---------------------------------------------------------------------- // DictionaryArray -Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int32_t length, +Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int64_t length, const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset, std::shared_ptr* out) { + int64_t null_count, int64_t offset, std::shared_ptr* out) { DCHECK_EQ(type->type, Type::DICTIONARY); const auto& dict_type = static_cast(type.get()); @@ -464,7 +464,7 @@ Status DictionaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -std::shared_ptr DictionaryArray::Slice(int32_t offset, int32_t length) const { +std::shared_ptr DictionaryArray::Slice(int64_t offset, int64_t length) const { std::shared_ptr sliced_indices = indices_->Slice(offset, length); return std::make_shared(type_, sliced_indices); } @@ -476,9 +476,9 @@ std::shared_ptr DictionaryArray::Slice(int32_t offset, int32_t length) co out->reset(new ArrayType(type, length, data, null_bitmap, null_count, offset)); \ break; -Status MakePrimitiveArray(const std::shared_ptr& type, int32_t length, +Status MakePrimitiveArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset, std::shared_ptr* out) { + int64_t null_count, int64_t offset, std::shared_ptr* out) { switch (type->type) { MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 8bb914e44ad3d..f20f212c3a825 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -80,30 +80,30 @@ class ARROW_EXPORT ArrayVisitor { /// be computed on the first call to null_count() class ARROW_EXPORT Array { public: - Array(const std::shared_ptr& type, int32_t length, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + Array(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); virtual ~Array() = default; /// Determine if a slot is null. For inner loops. Does *not* boundscheck - bool IsNull(int i) const { + bool IsNull(int64_t i) const { return null_bitmap_data_ != nullptr && BitUtil::BitNotSet(null_bitmap_data_, i + offset_); } /// Size in the number of elements this array contains. - int32_t length() const { return length_; } + int64_t length() const { return length_; } /// A relative position into another array's data, to enable zero-copy /// slicing. This value defaults to zero - int32_t offset() const { return offset_; } + int64_t offset() const { return offset_; } /// The number of null entries in the array. If the null count was not known /// at time of construction (and set to a negative value), then the null /// count will be computed and cached on the first invocation of this /// function - int32_t null_count() const; + int64_t null_count() const; std::shared_ptr type() const { return type_; } Type::type type_enum() const { return type_->type; } @@ -128,11 +128,11 @@ class ARROW_EXPORT Array { /// Compare if the range of slots specified are equal for the given array and /// this array. end_idx exclusive. This methods does not bounds check. - bool RangeEquals(int32_t start_idx, int32_t end_idx, int32_t other_start_idx, + bool RangeEquals(int64_t start_idx, int64_t end_idx, int64_t other_start_idx, const std::shared_ptr& other) const; - bool RangeEquals(const Array& other, int32_t start_idx, int32_t end_idx, - int32_t other_start_idx) const; + bool RangeEquals(const Array& other, int64_t start_idx, int64_t end_idx, + int64_t other_start_idx) const; /// Determines if the array is internally consistent. /// @@ -150,20 +150,20 @@ class ARROW_EXPORT Array { /// the length will be adjusted accordingly /// /// \return a new object wrapped in std::shared_ptr - virtual std::shared_ptr Slice(int32_t offset, int32_t length) const = 0; + virtual std::shared_ptr Slice(int64_t offset, int64_t length) const = 0; /// Slice from offset until end of the array - std::shared_ptr Slice(int32_t offset) const; + std::shared_ptr Slice(int64_t offset) const; protected: std::shared_ptr type_; - int32_t length_; - int32_t offset_; + int64_t length_; + int64_t offset_; // This member is marked mutable so that it can be modified when null_count() // is called from a const context and the null count has to be computed (if // it is not already known) - mutable int32_t null_count_; + mutable int64_t null_count_; std::shared_ptr null_bitmap_; const uint8_t* null_bitmap_data_; @@ -178,20 +178,20 @@ class ARROW_EXPORT NullArray : public Array { public: using TypeClass = NullType; - explicit NullArray(int32_t length); + explicit NullArray(int64_t length); Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; }; /// Base class for fixed-size logical types class ARROW_EXPORT PrimitiveArray : public Array { public: - PrimitiveArray(const std::shared_ptr& type, int32_t length, + PrimitiveArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); /// The memory containing this array's data /// This buffer does not account for any slice offset @@ -214,10 +214,10 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { // metadata template NumericArray( - typename std::enable_if::is_parameter_free, int32_t>::type length, + typename std::enable_if::is_parameter_free, int64_t>::type length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0) + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0) : PrimitiveArray(TypeTraits::type_singleton(), length, data, null_bitmap, null_count, offset) {} @@ -227,9 +227,9 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; - value_type Value(int i) const { return raw_data()[i]; } + value_type Value(int64_t i) const { return raw_data()[i]; } }; class ARROW_EXPORT BooleanArray : public PrimitiveArray { @@ -238,15 +238,15 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { using PrimitiveArray::PrimitiveArray; - BooleanArray(int32_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + BooleanArray(int64_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; - bool Value(int i) const { + bool Value(int64_t i) const { return BitUtil::GetBit(reinterpret_cast(raw_data_), i + offset_); } }; @@ -258,10 +258,10 @@ class ARROW_EXPORT ListArray : public Array { public: using TypeClass = ListType; - ListArray(const std::shared_ptr& type, int32_t length, + ListArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& values, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0) + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0) : Array(type, length, null_bitmap, null_count, offset) { value_offsets_ = value_offsets; raw_value_offsets_ = value_offsets == nullptr @@ -285,15 +285,15 @@ class ARROW_EXPORT ListArray : public Array { const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return raw_value_offsets_[i + offset_]; } - int32_t value_length(int i) const { + int32_t value_offset(int64_t i) const { return raw_value_offsets_[i + offset_]; } + int32_t value_length(int64_t i) const { i += offset_; return raw_value_offsets_[i + 1] - raw_value_offsets_[i]; } Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: std::shared_ptr value_offsets_; @@ -308,15 +308,15 @@ class ARROW_EXPORT BinaryArray : public Array { public: using TypeClass = BinaryType; - BinaryArray(int32_t length, const std::shared_ptr& value_offsets, + BinaryArray(int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); // Return the pointer to the given elements bytes // TODO(emkornfield) introduce a StringPiece or something similar to capture zero-copy // pointer + offset - const uint8_t* GetValue(int i, int32_t* out_length) const { + const uint8_t* GetValue(int64_t i, int32_t* out_length) const { // Account for base offset i += offset_; @@ -334,8 +334,8 @@ class ARROW_EXPORT BinaryArray : public Array { const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } // Neither of these functions will perform boundschecking - int32_t value_offset(int i) const { return raw_value_offsets_[i + offset_]; } - int32_t value_length(int i) const { + int32_t value_offset(int64_t i) const { return raw_value_offsets_[i + offset_]; } + int32_t value_length(int64_t i) const { i += offset_; return raw_value_offsets_[i + 1] - raw_value_offsets_[i]; } @@ -344,15 +344,15 @@ class ARROW_EXPORT BinaryArray : public Array { Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: // Constructor that allows sub-classes/builders to propagate there logical type up the // class hierarchy. - BinaryArray(const std::shared_ptr& type, int32_t length, + BinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); std::shared_ptr value_offsets_; const int32_t* raw_value_offsets_; @@ -365,14 +365,14 @@ class ARROW_EXPORT StringArray : public BinaryArray { public: using TypeClass = StringType; - StringArray(int32_t length, const std::shared_ptr& value_offsets, + StringArray(int64_t length, const std::shared_ptr& value_offsets, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); // Construct a std::string // TODO: std::bad_alloc possibility - std::string GetString(int i) const { + std::string GetString(int64_t i) const { int32_t nchars; const uint8_t* str = GetValue(i, &nchars); return std::string(reinterpret_cast(str), nchars); @@ -382,7 +382,7 @@ class ARROW_EXPORT StringArray : public BinaryArray { Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; }; // ---------------------------------------------------------------------- @@ -392,22 +392,22 @@ class ARROW_EXPORT StructArray : public Array { public: using TypeClass = StructType; - StructArray(const std::shared_ptr& type, int32_t length, + StructArray(const std::shared_ptr& type, int64_t length, const std::vector>& children, - std::shared_ptr null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + std::shared_ptr null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); Status Validate() const override; // Return a shared pointer in case the requestor desires to share ownership // with this array. - std::shared_ptr field(int32_t pos) const; + std::shared_ptr field(int pos) const; const std::vector>& fields() const { return children_; } Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: // The child arrays corresponding to each field of the struct data type. @@ -422,12 +422,12 @@ class ARROW_EXPORT UnionArray : public Array { using TypeClass = UnionType; using type_id_t = uint8_t; - UnionArray(const std::shared_ptr& type, int32_t length, + UnionArray(const std::shared_ptr& type, int64_t length, const std::vector>& children, const std::shared_ptr& type_ids, const std::shared_ptr& value_offsets = nullptr, - const std::shared_ptr& null_bitmap = nullptr, int32_t null_count = 0, - int32_t offset = 0); + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); Status Validate() const override; @@ -442,13 +442,13 @@ class ARROW_EXPORT UnionArray : public Array { UnionMode mode() const { return static_cast(*type_.get()).mode; } - std::shared_ptr child(int32_t pos) const; + std::shared_ptr child(int pos) const; const std::vector>& children() const { return children_; } Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: std::vector> children_; @@ -487,9 +487,9 @@ class ARROW_EXPORT DictionaryArray : public Array { // Alternate ctor; other attributes (like null count) are inherited from the // passed indices array - static Status FromBuffer(const std::shared_ptr& type, int32_t length, + static Status FromBuffer(const std::shared_ptr& type, int64_t length, const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, - int32_t null_count, int32_t offset, std::shared_ptr* out); + int64_t null_count, int64_t offset, std::shared_ptr* out); Status Validate() const override; @@ -500,7 +500,7 @@ class ARROW_EXPORT DictionaryArray : public Array { Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int32_t offset, int32_t length) const override; + std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: const DictionaryType* dict_type_; @@ -542,8 +542,8 @@ extern template class ARROW_EXPORT NumericArray; // Create new arrays for logical types that are backed by primitive arrays. Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - int32_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int32_t null_count, int32_t offset, + int64_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset, std::shared_ptr* out); } // namespace arrow diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 9c400b1290394..be91af3556da4 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -165,7 +165,7 @@ class ARROW_EXPORT BufferBuilder { : pool_(pool), data_(nullptr), capacity_(0), size_(0) {} /// Resizes the buffer to the nearest multiple of 64 bytes per Layout.md - Status Resize(int32_t elements) { + Status Resize(int64_t elements) { if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } RETURN_NOT_OK(buffer_->Resize(elements)); capacity_ = buffer_->capacity(); @@ -173,7 +173,7 @@ class ARROW_EXPORT BufferBuilder { return Status::OK(); } - Status Append(const uint8_t* data, int length) { + Status Append(const uint8_t* data, int64_t length) { if (capacity_ < length + size_) { RETURN_NOT_OK(Resize(length + size_)); } UnsafeAppend(data, length); return Status::OK(); @@ -187,7 +187,7 @@ class ARROW_EXPORT BufferBuilder { } template - Status Append(const T* arithmetic_values, int num_elements) { + Status Append(const T* arithmetic_values, int64_t num_elements) { static_assert(std::is_arithmetic::value, "Convenience buffer append only supports arithmetic types"); return Append( @@ -195,7 +195,7 @@ class ARROW_EXPORT BufferBuilder { } // Unsafe methods don't check existing size - void UnsafeAppend(const uint8_t* data, int length) { + void UnsafeAppend(const uint8_t* data, int64_t length) { memcpy(data_ + size_, data, length); size_ += length; } @@ -208,7 +208,7 @@ class ARROW_EXPORT BufferBuilder { } template - void UnsafeAppend(const T* arithmetic_values, int num_elements) { + void UnsafeAppend(const T* arithmetic_values, int64_t num_elements) { static_assert(std::is_arithmetic::value, "Convenience buffer append only supports arithmetic types"); UnsafeAppend( @@ -221,8 +221,8 @@ class ARROW_EXPORT BufferBuilder { capacity_ = size_ = 0; return result; } - int capacity() { return capacity_; } - int length() { return size_; } + int64_t capacity() { return capacity_; } + int64_t length() { return size_; } private: std::shared_ptr buffer_; diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index f5c13f9e77ef1..63e083e76b660 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -43,33 +43,33 @@ Status ArrayBuilder::AppendToBitmap(bool is_valid) { return Status::OK(); } -Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int32_t length) { +Status ArrayBuilder::AppendToBitmap(const uint8_t* valid_bytes, int64_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); } -Status ArrayBuilder::Init(int32_t capacity) { - int32_t to_alloc = BitUtil::CeilByte(capacity) / 8; +Status ArrayBuilder::Init(int64_t capacity) { + int64_t to_alloc = BitUtil::CeilByte(capacity) / 8; null_bitmap_ = std::make_shared(pool_); RETURN_NOT_OK(null_bitmap_->Resize(to_alloc)); // Buffers might allocate more then necessary to satisfy padding requirements - const int byte_capacity = null_bitmap_->capacity(); + const int64_t byte_capacity = null_bitmap_->capacity(); capacity_ = capacity; null_bitmap_data_ = null_bitmap_->mutable_data(); memset(null_bitmap_data_, 0, byte_capacity); return Status::OK(); } -Status ArrayBuilder::Resize(int32_t new_bits) { +Status ArrayBuilder::Resize(int64_t new_bits) { if (!null_bitmap_) { return Init(new_bits); } - int32_t new_bytes = BitUtil::CeilByte(new_bits) / 8; - int32_t old_bytes = null_bitmap_->size(); + int64_t new_bytes = BitUtil::CeilByte(new_bits) / 8; + int64_t old_bytes = null_bitmap_->size(); RETURN_NOT_OK(null_bitmap_->Resize(new_bytes)); null_bitmap_data_ = null_bitmap_->mutable_data(); // The buffer might be overpadded to deal with padding according to the spec - const int32_t byte_capacity = null_bitmap_->capacity(); + const int64_t byte_capacity = null_bitmap_->capacity(); capacity_ = new_bits; if (old_bytes < new_bytes) { memset(null_bitmap_data_ + old_bytes, 0, byte_capacity - old_bytes); @@ -77,7 +77,7 @@ Status ArrayBuilder::Resize(int32_t new_bits) { return Status::OK(); } -Status ArrayBuilder::Advance(int32_t elements) { +Status ArrayBuilder::Advance(int64_t elements) { if (length_ + elements > capacity_) { return Status::Invalid("Builder must be expanded"); } @@ -85,16 +85,16 @@ Status ArrayBuilder::Advance(int32_t elements) { return Status::OK(); } -Status ArrayBuilder::Reserve(int32_t elements) { +Status ArrayBuilder::Reserve(int64_t elements) { if (length_ + elements > capacity_) { // TODO(emkornfield) power of 2 growth is potentially suboptimal - int32_t new_capacity = BitUtil::NextPower2(length_ + elements); + int64_t new_capacity = BitUtil::NextPower2(length_ + elements); return Resize(new_capacity); } return Status::OK(); } -Status ArrayBuilder::SetNotNull(int32_t length) { +Status ArrayBuilder::SetNotNull(int64_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeSetNotNull(length); return Status::OK(); @@ -109,21 +109,21 @@ void ArrayBuilder::UnsafeAppendToBitmap(bool is_valid) { ++length_; } -void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length) { +void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int64_t length) { if (valid_bytes == nullptr) { UnsafeSetNotNull(length); return; } - int byte_offset = length_ / 8; - int bit_offset = length_ % 8; + int64_t byte_offset = length_ / 8; + int64_t bit_offset = length_ % 8; uint8_t bitset = null_bitmap_data_[byte_offset]; - for (int32_t i = 0; i < length; ++i) { + for (int64_t i = 0; i < length; ++i) { if (valid_bytes[i]) { - bitset |= (1 << bit_offset); + bitset |= BitUtil::kBitmask[bit_offset]; } else { - bitset &= ~(1 << bit_offset); + bitset &= BitUtil::kFlippedBitmask[bit_offset]; ++null_count_; } @@ -140,22 +140,22 @@ void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t leng length_ += length; } -void ArrayBuilder::UnsafeSetNotNull(int32_t length) { - const int32_t new_length = length + length_; +void ArrayBuilder::UnsafeSetNotNull(int64_t length) { + const int64_t new_length = length + length_; // Fill up the bytes until we have a byte alignment - int32_t pad_to_byte = 8 - (length_ % 8); + int64_t pad_to_byte = 8 - (length_ % 8); if (pad_to_byte == 8) { pad_to_byte = 0; } - for (int32_t i = 0; i < pad_to_byte; ++i) { + for (int64_t i = 0; i < pad_to_byte; ++i) { BitUtil::SetBit(null_bitmap_data_, i); } // Fast bitsetting - int32_t fast_length = (length - pad_to_byte) / 8; + int64_t fast_length = (length - pad_to_byte) / 8; memset(null_bitmap_data_ + ((length_ + pad_to_byte) / 8), 255, fast_length); // Trailing bytes - for (int32_t i = length_ + pad_to_byte + (fast_length * 8); i < new_length; ++i) { + for (int64_t i = length_ + pad_to_byte + (fast_length * 8); i < new_length; ++i) { BitUtil::SetBit(null_bitmap_data_, i); } @@ -163,7 +163,7 @@ void ArrayBuilder::UnsafeSetNotNull(int32_t length) { } template -Status PrimitiveBuilder::Init(int32_t capacity) { +Status PrimitiveBuilder::Init(int64_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); data_ = std::make_shared(pool_); @@ -177,7 +177,7 @@ Status PrimitiveBuilder::Init(int32_t capacity) { } template -Status PrimitiveBuilder::Resize(int32_t capacity) { +Status PrimitiveBuilder::Resize(int64_t capacity) { // XXX: Set floor size for now if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } @@ -197,11 +197,12 @@ Status PrimitiveBuilder::Resize(int32_t capacity) { template Status PrimitiveBuilder::Append( - const value_type* values, int32_t length, const uint8_t* valid_bytes) { + const value_type* values, int64_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); if (length > 0) { - memcpy(raw_data_ + length_, values, TypeTraits::bytes_required(length)); + std::memcpy(raw_data_ + length_, values, + static_cast(TypeTraits::bytes_required(length))); } // length_ is update by these @@ -248,7 +249,7 @@ BooleanBuilder::BooleanBuilder(MemoryPool* pool, const std::shared_ptr DCHECK_EQ(Type::BOOL, type->type); } -Status BooleanBuilder::Init(int32_t capacity) { +Status BooleanBuilder::Init(int64_t capacity) { RETURN_NOT_OK(ArrayBuilder::Init(capacity)); data_ = std::make_shared(pool_); @@ -261,7 +262,7 @@ Status BooleanBuilder::Init(int32_t capacity) { return Status::OK(); } -Status BooleanBuilder::Resize(int32_t capacity) { +Status BooleanBuilder::Resize(int64_t capacity) { // XXX: Set floor size for now if (capacity < kMinBuilderCapacity) { capacity = kMinBuilderCapacity; } @@ -294,10 +295,10 @@ Status BooleanBuilder::Finish(std::shared_ptr* out) { } Status BooleanBuilder::Append( - const uint8_t* values, int32_t length, const uint8_t* valid_bytes) { + const uint8_t* values, int64_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); - for (int i = 0; i < length; ++i) { + for (int64_t i = 0; i < length; ++i) { // Skip reading from unitialised memory // TODO: This actually is only to keep valgrind happy but may or may not // have a performance impact. @@ -333,17 +334,17 @@ ListBuilder::ListBuilder( offset_builder_(pool), values_(values) {} -Status ListBuilder::Init(int32_t elements) { - DCHECK_LT(elements, std::numeric_limits::max()); +Status ListBuilder::Init(int64_t elements) { + DCHECK_LT(elements, std::numeric_limits::max()); RETURN_NOT_OK(ArrayBuilder::Init(elements)); // one more then requested for offsets - return offset_builder_.Resize((elements + 1) * sizeof(int32_t)); + return offset_builder_.Resize((elements + 1) * sizeof(int64_t)); } -Status ListBuilder::Resize(int32_t capacity) { - DCHECK_LT(capacity, std::numeric_limits::max()); +Status ListBuilder::Resize(int64_t capacity) { + DCHECK_LT(capacity, std::numeric_limits::max()); // one more then requested for offsets - RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int32_t))); + RETURN_NOT_OK(offset_builder_.Resize((capacity + 1) * sizeof(int64_t))); return ArrayBuilder::Resize(capacity); } @@ -351,7 +352,7 @@ Status ListBuilder::Finish(std::shared_ptr* out) { std::shared_ptr items = values_; if (!items) { RETURN_NOT_OK(value_builder_->Finish(&items)); } - RETURN_NOT_OK(offset_builder_.Append(items->length())); + RETURN_NOT_OK(offset_builder_.Append(items->length())); std::shared_ptr offsets = offset_builder_.Finish(); *out = std::make_shared( diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 0b83b9f3c6862..e642d3c21a2fd 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -37,7 +37,7 @@ namespace arrow { class Array; -static constexpr int32_t kMinBuilderCapacity = 1 << 5; +static constexpr int64_t kMinBuilderCapacity = 1 << 5; /// Base class for all data array builders. // @@ -61,38 +61,38 @@ class ARROW_EXPORT ArrayBuilder { /// skip shared pointers and just return a raw pointer ArrayBuilder* child(int i) { return children_[i].get(); } - int num_children() const { return children_.size(); } + int num_children() const { return static_cast(children_.size()); } - int32_t length() const { return length_; } - int32_t null_count() const { return null_count_; } - int32_t capacity() const { return capacity_; } + int64_t length() const { return length_; } + int64_t null_count() const { return null_count_; } + int64_t capacity() const { return capacity_; } /// Append to null bitmap Status AppendToBitmap(bool is_valid); /// Vector append. Treat each zero byte as a null. If valid_bytes is null /// assume all of length bits are valid. - Status AppendToBitmap(const uint8_t* valid_bytes, int32_t length); + Status AppendToBitmap(const uint8_t* valid_bytes, int64_t length); /// Set the next length bits to not null (i.e. valid). - Status SetNotNull(int32_t length); + Status SetNotNull(int64_t length); /// Allocates initial capacity requirements for the builder. In most /// cases subclasses should override and call there parent classes /// method as well. - virtual Status Init(int32_t capacity); + virtual Status Init(int64_t capacity); /// Resizes the null_bitmap array. In most /// cases subclasses should override and call there parent classes /// method as well. - virtual Status Resize(int32_t new_bits); + virtual Status Resize(int64_t new_bits); /// Ensures there is enough space for adding the number of elements by checking /// capacity and calling Resize if necessary. - Status Reserve(int32_t elements); + Status Reserve(int64_t elements); /// For cases where raw data was memcpy'd into the internal buffers, allows us /// to advance the length of the builder. It is your responsibility to use /// this function responsibly. - Status Advance(int32_t elements); + Status Advance(int64_t elements); std::shared_ptr null_bitmap() const { return null_bitmap_; } @@ -109,12 +109,12 @@ class ARROW_EXPORT ArrayBuilder { // When null_bitmap are first appended to the builder, the null bitmap is allocated std::shared_ptr null_bitmap_; - int32_t null_count_; + int64_t null_count_; uint8_t* null_bitmap_data_; // Array length, so far. Also, the index of the next element to be added - int32_t length_; - int32_t capacity_; + int64_t length_; + int64_t capacity_; // Child value array builders. These are owned by this class std::vector> children_; @@ -127,9 +127,9 @@ class ARROW_EXPORT ArrayBuilder { void UnsafeAppendToBitmap(bool is_valid); // Vector append. Treat each zero byte as a nullzero. If valid_bytes is null // assume all of length bits are valid. - void UnsafeAppendToBitmap(const uint8_t* valid_bytes, int32_t length); + void UnsafeAppendToBitmap(const uint8_t* valid_bytes, int64_t length); // Set the next length bits to not null (i.e. valid). - void UnsafeSetNotNull(int32_t length); + void UnsafeSetNotNull(int64_t length); private: DISALLOW_COPY_AND_ASSIGN(ArrayBuilder); @@ -146,7 +146,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { using ArrayBuilder::Advance; /// Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + Status AppendNulls(const uint8_t* valid_bytes, int64_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); @@ -165,14 +165,14 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot Status Append( - const value_type* values, int32_t length, const uint8_t* valid_bytes = nullptr); + const value_type* values, int64_t length, const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; - Status Init(int32_t capacity) override; + Status Init(int64_t capacity) override; /// Increase the capacity of the builder to accommodate at least the indicated /// number of elements - Status Resize(int32_t capacity) override; + Status Resize(int64_t capacity) override; protected: std::shared_ptr data_; @@ -246,7 +246,7 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { using ArrayBuilder::Advance; /// Write nulls as uint8_t* (0 value indicates null) into pre-allocated memory - Status AppendNulls(const uint8_t* valid_bytes, int32_t length) { + Status AppendNulls(const uint8_t* valid_bytes, int64_t length) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); @@ -278,14 +278,14 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot Status Append( - const uint8_t* values, int32_t length, const uint8_t* valid_bytes = nullptr); + const uint8_t* values, int64_t length, const uint8_t* valid_bytes = nullptr); Status Finish(std::shared_ptr* out) override; - Status Init(int32_t capacity) override; + Status Init(int64_t capacity) override; /// Increase the capacity of the builder to accommodate at least the indicated /// number of elements - Status Resize(int32_t capacity) override; + Status Resize(int64_t capacity) override; protected: std::shared_ptr data_; @@ -318,8 +318,8 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { ListBuilder( MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr); - Status Init(int32_t elements) override; - Status Resize(int32_t capacity) override; + Status Init(int64_t elements) override; + Status Resize(int64_t capacity) override; Status Finish(std::shared_ptr* out) override; /// Vector append @@ -327,7 +327,7 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot Status Append( - const int32_t* offsets, int32_t length, const uint8_t* valid_bytes = nullptr) { + const int32_t* offsets, int64_t length, const uint8_t* valid_bytes = nullptr) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); offset_builder_.UnsafeAppend(offsets, length); @@ -341,7 +341,8 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { Status Append(bool is_valid = true) { RETURN_NOT_OK(Reserve(1)); UnsafeAppendToBitmap(is_valid); - RETURN_NOT_OK(offset_builder_.Append(value_builder_->length())); + RETURN_NOT_OK( + offset_builder_.Append(static_cast(value_builder_->length()))); return Status::OK(); } @@ -375,7 +376,9 @@ class ARROW_EXPORT BinaryBuilder : public ListBuilder { return Append(reinterpret_cast(value), length); } - Status Append(const std::string& value) { return Append(value.c_str(), value.size()); } + Status Append(const std::string& value) { + return Append(value.c_str(), static_cast(value.size())); + } Status Finish(std::shared_ptr* out) override; @@ -417,7 +420,7 @@ class ARROW_EXPORT StructBuilder : public ArrayBuilder { /// will be considered as a null for that field, but users must using app- /// end methods or advance methods of the child builders' independently to /// insert data. - Status Append(int32_t length, const uint8_t* valid_bytes) { + Status Append(int64_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return Status::OK(); diff --git a/cpp/src/arrow/column-benchmark.cc b/cpp/src/arrow/column-benchmark.cc index 1bab5a8de0ca4..13076a4788689 100644 --- a/cpp/src/arrow/column-benchmark.cc +++ b/cpp/src/arrow/column-benchmark.cc @@ -24,7 +24,7 @@ namespace arrow { namespace { template -std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { +std::shared_ptr MakePrimitive(int64_t length, int64_t null_count = 0) { auto pool = default_memory_pool(); auto data = std::make_shared(pool); auto null_bitmap = std::make_shared(pool); diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 1376f6534ece1..18228700472c4 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -42,15 +42,15 @@ bool ChunkedArray::Equals(const ChunkedArray& other) const { // Check contents of the underlying arrays. This checks for equality of // the underlying data independently of the chunk size. int this_chunk_idx = 0; - int32_t this_start_idx = 0; + int64_t this_start_idx = 0; int other_chunk_idx = 0; - int32_t other_start_idx = 0; + int64_t other_start_idx = 0; int64_t elements_compared = 0; while (elements_compared < length_) { const std::shared_ptr this_array = chunks_[this_chunk_idx]; const std::shared_ptr other_array = other.chunk(other_chunk_idx); - int32_t common_length = std::min( + int64_t common_length = std::min( this_array->length() - this_start_idx, other_array->length() - other_start_idx); if (!this_array->RangeEquals(this_start_idx, this_start_idx + common_length, other_start_idx, other_array)) { diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index a28b2665e9c1c..93a34c7c95fdf 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -44,7 +44,7 @@ class ARROW_EXPORT ChunkedArray { int64_t null_count() const { return null_count_; } - int num_chunks() const { return chunks_.size(); } + int num_chunks() const { return static_cast(chunks_.size()); } std::shared_ptr chunk(int i) const { return chunks_[i]; } diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index ff3c59f638bb0..e94fa74ea6589 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -37,8 +37,8 @@ namespace arrow { class RangeEqualsVisitor : public ArrayVisitor { public: - RangeEqualsVisitor(const Array& right, int32_t left_start_idx, int32_t left_end_idx, - int32_t right_start_idx) + RangeEqualsVisitor(const Array& right, int64_t left_start_idx, int64_t left_end_idx, + int64_t right_start_idx) : right_(right), left_start_idx_(left_start_idx), left_end_idx_(left_end_idx), @@ -55,7 +55,7 @@ class RangeEqualsVisitor : public ArrayVisitor { inline Status CompareValues(const ArrayType& left) { const auto& right = static_cast(right_); - for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { const bool is_null = left.IsNull(i); if (is_null != right.IsNull(o_i) || @@ -71,7 +71,7 @@ class RangeEqualsVisitor : public ArrayVisitor { bool CompareBinaryRange(const BinaryArray& left) const { const auto& right = static_cast(right_); - for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { const bool is_null = left.IsNull(i); if (is_null != right.IsNull(o_i)) { return false; } @@ -164,7 +164,7 @@ class RangeEqualsVisitor : public ArrayVisitor { const std::shared_ptr& left_values = left.values(); const std::shared_ptr& right_values = right.values(); - for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { const bool is_null = left.IsNull(i); if (is_null != right.IsNull(o_i)) { return false; } @@ -193,15 +193,15 @@ class RangeEqualsVisitor : public ArrayVisitor { bool CompareStructs(const StructArray& left) { const auto& right = static_cast(right_); bool equal_fields = true; - for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { if (left.IsNull(i) != right.IsNull(o_i)) { return false; } if (left.IsNull(i)) continue; - for (size_t j = 0; j < left.fields().size(); ++j) { + for (int j = 0; j < static_cast(left.fields().size()); ++j) { // TODO: really we should be comparing stretches of non-null data rather // than looking at one value at a time. - const int left_abs_index = i + left.offset(); - const int right_abs_index = o_i + right.offset(); + const int64_t left_abs_index = i + left.offset(); + const int64_t right_abs_index = o_i + right.offset(); equal_fields = left.field(j)->RangeEquals( left_abs_index, left_abs_index + 1, right_abs_index, right.field(j)); @@ -243,7 +243,7 @@ class RangeEqualsVisitor : public ArrayVisitor { const uint8_t* right_ids = right.raw_type_ids(); uint8_t id, child_num; - for (int32_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; ++i, ++o_i) { if (left.IsNull(i) != right.IsNull(o_i)) { return false; } if (left.IsNull(i)) continue; @@ -252,8 +252,8 @@ class RangeEqualsVisitor : public ArrayVisitor { id = left_ids[i]; child_num = type_id_to_child_num[id]; - const int left_abs_index = i + left.offset(); - const int right_abs_index = o_i + right.offset(); + const int64_t left_abs_index = i + left.offset(); + const int64_t right_abs_index = o_i + right.offset(); // TODO(wesm): really we should be comparing stretches of non-null data // rather than looking at one value at a time. @@ -294,9 +294,9 @@ class RangeEqualsVisitor : public ArrayVisitor { protected: const Array& right_; - int32_t left_start_idx_; - int32_t left_end_idx_; - int32_t right_start_idx_; + int64_t left_start_idx_; + int64_t left_end_idx_; + int64_t right_start_idx_; bool result_; }; @@ -314,7 +314,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { const uint8_t* left_data = left.data()->data(); const uint8_t* right_data = right.data()->data(); - for (int i = 0; i < left.length(); ++i) { + for (int64_t i = 0; i < left.length(); ++i) { if (!left.IsNull(i) && BitUtil::GetBit(left_data, i) != BitUtil::GetBit(right_data, i)) { result_ = false; @@ -339,7 +339,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { const uint8_t* right_data = right.data()->data() + right.offset() * value_byte_size; if (left.null_count() > 0) { - for (int i = 0; i < left.length(); ++i) { + for (int64_t i = 0; i < left.length(); ++i) { if (!left.IsNull(i) && memcmp(left_data, right_data, value_byte_size)) { return false; } @@ -401,7 +401,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { reinterpret_cast(right.value_offsets()->data()) + right.offset(); - for (int32_t i = 0; i < left.length() + 1; ++i) { + for (int64_t i = 0; i < left.length() + 1; ++i) { if (left_offsets[i] - left_offsets[0] != right_offsets[i] - right_offsets[0]) { return false; } @@ -437,7 +437,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { // ARROW-537: Only compare data in non-null slots const int32_t* left_offsets = left.raw_value_offsets(); const int32_t* right_offsets = right.raw_value_offsets(); - for (int32_t i = 0; i < left.length(); ++i) { + for (int64_t i = 0; i < left.length(); ++i) { if (left.IsNull(i)) { continue; } if (std::memcmp(left_data + left_offsets[i], right_data + right_offsets[i], left.value_length(i))) { @@ -496,15 +496,15 @@ inline bool FloatingApproxEquals( const T* left_data = left.raw_data(); const T* right_data = right.raw_data(); - static constexpr T EPSILON = 1E-5; + static constexpr T EPSILON = static_cast(1E-5); if (left.null_count() > 0) { - for (int32_t i = 0; i < left.length(); ++i) { + for (int64_t i = 0; i < left.length(); ++i) { if (left.IsNull(i)) continue; if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } } } else { - for (int32_t i = 0; i < left.length(); ++i) { + for (int64_t i = 0; i < left.length(); ++i) { if (fabs(left_data[i] - right_data[i]) > EPSILON) { return false; } } } @@ -556,8 +556,8 @@ Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { return Status::OK(); } -Status ArrayRangeEquals(const Array& left, const Array& right, int32_t left_start_idx, - int32_t left_end_idx, int32_t right_start_idx, bool* are_equal) { +Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_start_idx, + int64_t left_end_idx, int64_t right_start_idx, bool* are_equal) { if (&left == &right) { *are_equal = true; } else if (left.type_enum() != right.type_enum()) { diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h index 6a71f9fd573ba..1ddf0497dd3d9 100644 --- a/cpp/src/arrow/compare.h +++ b/cpp/src/arrow/compare.h @@ -40,7 +40,7 @@ Status ARROW_EXPORT ArrayApproxEquals( /// Returns true if indicated equal-length segment of arrays is exactly equal Status ARROW_EXPORT ArrayRangeEquals(const Array& left, const Array& right, - int32_t start_idx, int32_t end_idx, int32_t other_start_idx, bool* are_equal); + int64_t start_idx, int64_t end_idx, int64_t other_start_idx, bool* are_equal); /// Returns true if the type metadata are exactly equal Status ARROW_EXPORT TypeEquals( diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index c1f0ea48a9dc9..230c7fe0fb4a0 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -263,9 +263,9 @@ static inline Status FileWrite(int fd, const uint8_t* buffer, int64_t nbytes) { if (nbytes > INT32_MAX) { return Status::IOError("Unable to write > 2GB blocks to file yet"); } - ret = _write(fd, buffer, static_cast(nbytes)); + ret = static_cast(_write(fd, buffer, static_cast(nbytes))); #else - ret = write(fd, buffer, nbytes); + ret = static_cast(write(fd, buffer, nbytes)); #endif if (ret == -1) { @@ -303,9 +303,9 @@ static inline Status FileClose(int fd) { int ret; #if defined(_MSC_VER) - ret = _close(fd); + ret = static_cast(_close(fd)); #else - ret = close(fd); + ret = static_cast(close(fd)); #endif if (ret == -1) { return Status::IOError("error closing file"); } diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 5682f44b94a46..408b85f8daeb7 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -114,7 +114,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { tSize ret; if (driver_->HasPread()) { ret = driver_->Pread(fs_, file_, static_cast(position), - reinterpret_cast(buffer), nbytes); + reinterpret_cast(buffer), static_cast(nbytes)); } else { RETURN_NOT_OK(Seek(position)); return Read(nbytes, bytes_read, buffer); @@ -141,7 +141,7 @@ class HdfsReadableFile::HdfsReadableFileImpl : public HdfsAnyFileImpl { int64_t total_bytes = 0; while (total_bytes < nbytes) { tSize ret = driver_->Read(fs_, file_, reinterpret_cast(buffer + total_bytes), - std::min(buffer_size_, nbytes - total_bytes)); + static_cast(std::min(buffer_size_, nbytes - total_bytes))); RETURN_NOT_OK(CheckReadResult(ret)); total_bytes += ret; if (ret == 0) { break; } @@ -253,7 +253,8 @@ class HdfsOutputStream::HdfsOutputStreamImpl : public HdfsAnyFileImpl { } Status Write(const uint8_t* buffer, int64_t nbytes, int64_t* bytes_written) { - tSize ret = driver_->Write(fs_, file_, reinterpret_cast(buffer), nbytes); + tSize ret = driver_->Write( + fs_, file_, reinterpret_cast(buffer), static_cast(nbytes)); CHECK_FAILURE(ret, "Write"); *bytes_written = ret; return Status::OK(); @@ -328,7 +329,7 @@ class HdfsClient::HdfsClientImpl { if (!config->host.empty()) { driver_->BuilderSetNameNode(builder, config->host.c_str()); } - driver_->BuilderSetNameNodePort(builder, config->port); + driver_->BuilderSetNameNodePort(builder, static_cast(config->port)); if (!config->user.empty()) { driver_->BuilderSetUserName(builder, config->user.c_str()); } @@ -411,7 +412,7 @@ class HdfsClient::HdfsClientImpl { // Allocate additional space for elements - int vec_offset = listing->size(); + int vec_offset = static_cast(listing->size()); listing->resize(vec_offset + num_entries); for (int i = 0; i < num_entries; ++i) { @@ -449,8 +450,8 @@ class HdfsClient::HdfsClientImpl { int flags = O_WRONLY; if (append) flags |= O_APPEND; - hdfsFile handle = driver_->OpenFile( - fs_, path.c_str(), flags, buffer_size, replication, default_block_size); + hdfsFile handle = driver_->OpenFile(fs_, path.c_str(), flags, buffer_size, + replication, static_cast(default_block_size)); if (handle == nullptr) { // TODO(wesm): determine cause of failure diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index f0e5a280d3116..648d4baac9b6f 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -49,7 +49,7 @@ class TestHdfsClient : public ::testing::Test { } Status WriteDummyFile(const std::string& path, const uint8_t* buffer, int64_t size, - bool append = false, int buffer_size = 0, int replication = 0, + bool append = false, int buffer_size = 0, int16_t replication = 0, int default_block_size = 0) { std::shared_ptr file; RETURN_NOT_OK(client_->OpenWriteable( diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 2be87a35e7fb3..f11c88a6e1e4b 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -65,8 +66,14 @@ class RecordBatchWriter : public ArrayVisitor { if (max_recursion_depth_ <= 0) { return Status::Invalid("Max recursion depth reached"); } + + if (arr.length() > std::numeric_limits::max()) { + return Status::Invalid("Cannot write arrays larger than 2^31 - 1 in length"); + } + // push back all common elements - field_nodes_.push_back(flatbuf::FieldNode(arr.length(), arr.null_count())); + field_nodes_.push_back(flatbuf::FieldNode( + static_cast(arr.length()), static_cast(arr.null_count()))); if (arr.null_count() > 0) { std::shared_ptr bitmap = arr.null_bitmap(); @@ -152,13 +159,14 @@ class RecordBatchWriter : public ArrayVisitor { int64_t start_offset; RETURN_NOT_OK(dst->Tell(&start_offset)); - int64_t padded_metadata_length = metadata_fb->size() + 4; - const int remainder = (padded_metadata_length + start_offset) % 8; + int32_t padded_metadata_length = static_cast(metadata_fb->size()) + 4; + const int32_t remainder = + (padded_metadata_length + static_cast(start_offset)) % 8; if (remainder != 0) { padded_metadata_length += 8 - remainder; } // The returned metadata size includes the length prefix, the flatbuffer, // plus padding - *metadata_length = static_cast(padded_metadata_length); + *metadata_length = padded_metadata_length; // Write the flatbuffer size prefix including padding int32_t flatbuffer_size = padded_metadata_length - 4; @@ -169,7 +177,8 @@ class RecordBatchWriter : public ArrayVisitor { RETURN_NOT_OK(dst->Write(metadata_fb->data(), metadata_fb->size())); // Write any padding - int64_t padding = padded_metadata_length - metadata_fb->size() - 4; + int32_t padding = + padded_metadata_length - static_cast(metadata_fb->size()) - 4; if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } return Status::OK(); @@ -184,7 +193,8 @@ class RecordBatchWriter : public ArrayVisitor { RETURN_NOT_OK(dst->Tell(&start_position)); #endif - RETURN_NOT_OK(WriteMetadata(batch.num_rows(), *body_length, dst, metadata_length)); + RETURN_NOT_OK(WriteMetadata( + static_cast(batch.num_rows()), *body_length, dst, metadata_length)); #ifndef NDEBUG RETURN_NOT_OK(dst->Tell(¤t_position)); @@ -430,7 +440,7 @@ class RecordBatchWriter : public ArrayVisitor { int32_t* shifted_offsets = reinterpret_cast(shifted_offsets_buffer->mutable_data()); - for (int32_t i = 0; i < array.length(); ++i) { + for (int64_t i = 0; i < array.length(); ++i) { const uint8_t code = type_ids[i]; int32_t shift = child_offsets[code]; if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 3e759cccbbccc..4c18a496f4c80 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -240,7 +240,7 @@ TEST(TestJsonFileReadWrite, BasicRoundTrip) { const int nbatches = 3; std::vector> batches; for (int i = 0; i < nbatches; ++i) { - int32_t num_rows = 5 + i * 5; + int num_rows = 5 + i * 5; std::vector> arrays; MakeBatchArrays(schema, num_rows, &arrays); diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 6253cd6b43605..0458b85f0078a 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -355,7 +355,7 @@ class JsonArrayWriter : public ArrayVisitor { writer_->String(name); writer_->Key("count"); - writer_->Int(arr.length()); + writer_->Int(static_cast(arr.length())); RETURN_NOT_OK(arr.Accept(this)); @@ -394,7 +394,7 @@ class JsonArrayWriter : public ArrayVisitor { template typename std::enable_if::value, void>::type WriteDataValues(const T& arr) { - for (int i = 0; i < arr.length(); ++i) { + for (int64_t i = 0; i < arr.length(); ++i) { int32_t length; const char* buf = reinterpret_cast(arr.GetValue(i, &length)); @@ -430,7 +430,7 @@ class JsonArrayWriter : public ArrayVisitor { } template - void WriteIntegerField(const char* name, const T* values, int32_t length) { + void WriteIntegerField(const char* name, const T* values, int64_t length) { writer_->Key(name); writer_->StartArray(); for (int i = 0; i < length; ++i) { @@ -573,7 +573,7 @@ class JsonSchemaReader { const auto& values = obj.GetArray(); fields->resize(values.Size()); - for (size_t i = 0; i < fields->size(); ++i) { + for (rj::SizeType i = 0; i < fields->size(); ++i) { RETURN_NOT_OK(GetField(values[i], &(*fields)[i])); } return Status::OK(); @@ -712,7 +712,7 @@ class JsonSchemaReader { const auto& id_array = json_type_codes->value.GetArray(); for (const rj::Value& val : id_array) { DCHECK(val.IsUint()); - type_codes.push_back(val.GetUint()); + type_codes.push_back(static_cast(val.GetUint())); } *type = union_(children, type_codes, mode); @@ -770,10 +770,38 @@ static inline Status ParseHexValue(const char* data, uint8_t* out) { // Error checking if (*pos1 != c1 || *pos2 != c2) { return Status::Invalid("Encountered non-hex digit"); } - *out = (pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable); + *out = static_cast((pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable)); return Status::OK(); } +template +inline typename std::enable_if::value, typename T::c_type>::type +UnboxValue(const rj::Value& val) { + DCHECK(val.IsInt()); + return static_cast(val.GetInt64()); +} + +template +inline typename std::enable_if::value, typename T::c_type>::type +UnboxValue(const rj::Value& val) { + DCHECK(val.IsUint()); + return static_cast(val.GetUint64()); +} + +template +inline typename std::enable_if::value, typename T::c_type>::type +UnboxValue(const rj::Value& val) { + DCHECK(val.IsFloat()); + return static_cast(val.GetDouble()); +} + +template +inline typename std::enable_if::value, bool>::type +UnboxValue(const rj::Value& val) { + DCHECK(val.IsBool()); + return val.GetBool(); +} + class JsonArrayReader { public: explicit JsonArrayReader(MemoryPool* pool) : pool_(pool) {} @@ -820,22 +848,7 @@ class JsonArrayReader { } const rj::Value& val = json_data_arr[i]; - if (IsSignedInt::value) { - DCHECK(val.IsInt()); - builder.Append(val.GetInt64()); - } else if (IsUnsignedInt::value) { - DCHECK(val.IsUint()); - builder.Append(val.GetUint64()); - } else if (IsFloatingPoint::value) { - DCHECK(val.IsFloat()); - builder.Append(val.GetDouble()); - } else if (std::is_base_of::value) { - DCHECK(val.IsBool()); - builder.Append(val.GetBool()); - } else { - // We are in the wrong function - return Status::Invalid(type->ToString()); - } + builder.Append(UnboxValue(val)); } return builder.Finish(array); @@ -869,13 +882,13 @@ class JsonArrayReader { std::string hex_string = val.GetString(); DCHECK(hex_string.size() % 2 == 0) << "Expected base16 hex string"; - int64_t length = static_cast(hex_string.size()) / 2; + int32_t length = static_cast(hex_string.size()) / 2; if (byte_buffer->size() < length) { RETURN_NOT_OK(byte_buffer->Resize(length)); } const char* hex_data = hex_string.c_str(); uint8_t* byte_buffer_data = byte_buffer->mutable_data(); - for (int64_t j = 0; j < length; ++j) { + for (int32_t j = 0; j < length; ++j) { RETURN_NOT_OK(ParseHexValue(hex_data + j * 2, &byte_buffer_data[j])); } RETURN_NOT_OK(builder.Append(byte_buffer_data, length)); diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index 773fb74a1767a..a01be191aa8ad 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -69,7 +69,7 @@ class JsonWriter::JsonWriterImpl { writer_->StartObject(); writer_->Key("count"); - writer_->Int(batch.num_rows()); + writer_->Int(static_cast(batch.num_rows())); writer_->Key("columns"); writer_->StartArray(); @@ -158,7 +158,7 @@ class JsonReader::JsonReaderImpl { const auto& json_columns = it->value.GetArray(); std::vector> columns(json_columns.Size()); - for (size_t i = 0; i < columns.size(); ++i) { + for (int i = 0; i < static_cast(columns.size()); ++i) { const std::shared_ptr& type = schema_->field(i)->type; RETURN_NOT_OK(ReadJsonArray(pool_, json_columns[i], type, &columns[i])); } diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 7c8ddb93c09d1..1cc4a235b81bd 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -214,7 +214,8 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, vector_type = flatbuf::VectorType_DATA; break; } - auto offset = flatbuf::CreateVectorLayout(fbb, descr.bit_width(), vector_type); + auto offset = flatbuf::CreateVectorLayout( + fbb, static_cast(descr.bit_width()), vector_type); layout->push_back(offset); } @@ -328,7 +329,7 @@ Status FieldFromFlatbufferDictionary( std::shared_ptr type; auto children = field->children(); std::vector> child_fields(children->size()); - for (size_t i = 0; i < children->size(); ++i) { + for (int i = 0; i < static_cast(children->size()); ++i) { RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), dummy_memo, &child_fields[i])); } @@ -350,7 +351,7 @@ Status FieldFromFlatbuffer(const flatbuf::Field* field, // children to fully reconstruct the data type auto children = field->children(); std::vector> child_fields(children->size()); - for (size_t i = 0; i < children->size(); ++i) { + for (int i = 0; i < static_cast(children->size()); ++i) { RETURN_NOT_OK( FieldFromFlatbuffer(children->Get(i), dictionary_memo, &child_fields[i])); } diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 1a9af7db3dcdc..973416670bdfa 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -203,7 +203,7 @@ class FileReader::FileReaderImpl { } std::shared_ptr buffer; - int file_end_size = magic_size + sizeof(int32_t); + int file_end_size = static_cast(magic_size + sizeof(int32_t)); RETURN_NOT_OK(file_->ReadAt(footer_offset_ - file_end_size, file_end_size, &buffer)); if (memcmp(buffer->data() + sizeof(int32_t), kArrowMagicBytes, magic_size)) { diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 07f786c4d1d77..dc823662ee1ef 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -51,7 +51,7 @@ const auto kListInt32 = list(int32()); const auto kListListInt32 = list(kListInt32); Status MakeRandomInt32Array( - int32_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { + int64_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr data; test::MakeRandomInt32PoolBuffer(length, pool, &data); Int32Builder builder(pool, int32()); @@ -79,7 +79,7 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li std::vector list_sizes(num_lists, 0); std::vector offsets( num_lists + 1, 0); // +1 so we can shift for nulls. See partial sum below. - const int seed = child_array->length(); + const uint32_t seed = static_cast(child_array->length()); if (num_lists > 0) { test::rand_uniform_int(num_lists, seed, 0, max_list_size, list_sizes.data()); // make sure sizes are consistent with null @@ -89,7 +89,7 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li std::partial_sum(list_sizes.begin(), list_sizes.end(), ++offsets.begin()); // Force invariants - const int child_length = child_array->length(); + const int64_t child_length = child_array->length(); offsets[0] = 0; std::replace_if(offsets.begin(), offsets.end(), [child_length](int32_t offset) { return offset > child_length; }, child_length); @@ -121,26 +121,26 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { template Status MakeRandomBinaryArray( - int32_t length, MemoryPool* pool, std::shared_ptr* out) { + int64_t length, MemoryPool* pool, std::shared_ptr* out) { const std::vector values = { "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; Builder builder(pool); - const auto values_len = values.size(); - for (int32_t i = 0; i < length; ++i) { - int values_index = i % values_len; + const size_t values_len = values.size(); + for (int64_t i = 0; i < length; ++i) { + int64_t values_index = i % values_len; if (values_index == 0) { RETURN_NOT_OK(builder.AppendNull()); } else { const std::string& value = values[values_index]; - RETURN_NOT_OK( - builder.Append(reinterpret_cast(value.data()), value.size())); + RETURN_NOT_OK(builder.Append(reinterpret_cast(value.data()), + static_cast(value.size()))); } } return builder.Finish(out); } Status MakeStringTypesRecordBatch(std::shared_ptr* out) { - const int32_t length = 500; + const int64_t length = 500; auto string_type = utf8(); auto binary_type = binary(); auto f0 = field("f0", string_type); @@ -302,7 +302,7 @@ Status MakeUnion(std::shared_ptr* out) { std::vector> sparse_children(2); std::vector> dense_children(2); - const int32_t length = 7; + const int64_t length = 7; std::shared_ptr type_ids_buffer; std::vector type_ids = {5, 10, 5, 5, 10, 10, 5}; @@ -346,7 +346,7 @@ Status MakeUnion(std::shared_ptr* out) { } Status MakeDictionary(std::shared_ptr* out) { - const int32_t length = 6; + const int64_t length = 6; std::vector is_valid = {true, true, false, true, true, true}; std::shared_ptr dict1, dict2; diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 975b0d10cae7d..58402b588404c 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -61,7 +61,7 @@ class StreamWriter::StreamWriterImpl { std::shared_ptr schema_fb; RETURN_NOT_OK(WriteSchemaMessage(*schema_, dictionary_memo_.get(), &schema_fb)); - int32_t flatbuffer_size = schema_fb->size(); + int32_t flatbuffer_size = static_cast(schema_fb->size()); RETURN_NOT_OK( Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); @@ -252,7 +252,7 @@ class FileWriter::FileWriterImpl : public StreamWriter::StreamWriterImpl { RETURN_NOT_OK(UpdatePosition()); // Write footer length - int32_t footer_length = position_ - initial_position; + int32_t footer_length = static_cast(position_ - initial_position); if (footer_length <= 0) { return Status::Invalid("Invalid file footer"); } diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 23c05807c16ee..7e69e42800e79 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -196,7 +196,7 @@ class ArrayPrinter : public ArrayVisitor { } Status PrintChildren( - const std::vector>& fields, int32_t offset, int32_t length) { + const std::vector>& fields, int64_t offset, int64_t length) { for (size_t i = 0; i < fields.size(); ++i) { Newline(); std::stringstream ss; diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc index cd8256e658ec3..aa38fd3dd9260 100644 --- a/cpp/src/arrow/schema.cc +++ b/cpp/src/arrow/schema.cc @@ -45,7 +45,7 @@ bool Schema::Equals(const std::shared_ptr& other) const { std::shared_ptr Schema::GetFieldByName(const std::string& name) { if (fields_.size() > 0 && name_to_index_.size() == 0) { for (size_t i = 0; i < fields_.size(); ++i) { - name_to_index_[fields_[i]->name] = i; + name_to_index_[fields_[i]->name] = static_cast(i); } } diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h index 0e1ab5c368e98..37cdbf7d786a4 100644 --- a/cpp/src/arrow/schema.h +++ b/cpp/src/arrow/schema.h @@ -47,7 +47,7 @@ class ARROW_EXPORT Schema { // Render a string representation of the schema suitable for debugging std::string ToString() const; - int num_fields() const { return fields_.size(); } + int num_fields() const { return static_cast(fields_.size()); } private: std::vector> fields_; diff --git a/cpp/src/arrow/status.cc b/cpp/src/arrow/status.cc index e1a242721eccc..3a39c8409a5f7 100644 --- a/cpp/src/arrow/status.cc +++ b/cpp/src/arrow/status.cc @@ -18,7 +18,7 @@ namespace arrow { Status::Status(StatusCode code, const std::string& msg, int16_t posix_code) { assert(code != StatusCode::OK); - const uint32_t size = msg.size(); + const uint32_t size = static_cast(msg.size()); char* result = new char[size + 7]; memcpy(result, &size, sizeof(size)); result[4] = static_cast(code); diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 25f12c4b4300d..36374731cbb49 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -150,7 +150,7 @@ TEST_F(TestTable, Equals) { } TEST_F(TestTable, FromRecordBatches) { - const int32_t length = 10; + const int64_t length = 10; MakeExample1(length); auto batch1 = std::make_shared(schema_, length, arrays_); @@ -184,7 +184,7 @@ TEST_F(TestTable, FromRecordBatches) { } TEST_F(TestTable, ConcatenateTables) { - const int32_t length = 10; + const int64_t length = 10; MakeExample1(length); auto batch1 = std::make_shared(schema_, length, arrays_); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 8ac06b8cb7811..6b957c081e502 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -29,7 +29,7 @@ namespace arrow { -RecordBatch::RecordBatch(const std::shared_ptr& schema, int num_rows, +RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} @@ -61,18 +61,18 @@ bool RecordBatch::ApproxEquals(const RecordBatch& other) const { return true; } -std::shared_ptr RecordBatch::Slice(int32_t offset) { +std::shared_ptr RecordBatch::Slice(int64_t offset) { return Slice(offset, this->num_rows() - offset); } -std::shared_ptr RecordBatch::Slice(int32_t offset, int32_t length) { +std::shared_ptr RecordBatch::Slice(int64_t offset, int64_t length) { std::vector> arrays; arrays.reserve(num_columns()); for (const auto& field : columns_) { arrays.emplace_back(field->Slice(offset, length)); } - int32_t num_rows = std::min(num_rows_ - offset, length); + int64_t num_rows = std::min(num_rows_ - offset, length); return std::make_shared(schema_, num_rows, arrays); } @@ -169,7 +169,7 @@ bool Table::Equals(const Table& other) const { if (!schema_->Equals(other.schema())) { return false; } if (static_cast(columns_.size()) != other.num_columns()) { return false; } - for (size_t i = 0; i < columns_.size(); i++) { + for (int i = 0; i < static_cast(columns_.size()); i++) { if (!columns_[i]->Equals(other.column(i))) { return false; } } return true; diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index fa56824a5a1bc..68f664b38a365 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -40,7 +40,7 @@ class ARROW_EXPORT RecordBatch { // num_rows is a parameter to allow for record batches of a particular size not // having any materialized columns. Each array should have the same length as // num_rows - RecordBatch(const std::shared_ptr& schema, int32_t num_rows, + RecordBatch(const std::shared_ptr& schema, int64_t num_rows, const std::vector>& columns); bool Equals(const RecordBatch& other) const; @@ -59,18 +59,18 @@ class ARROW_EXPORT RecordBatch { const std::string& column_name(int i) const; // @returns: the number of columns in the table - int num_columns() const { return columns_.size(); } + int num_columns() const { return static_cast(columns_.size()); } // @returns: the number of rows (the corresponding length of each column) - int32_t num_rows() const { return num_rows_; } + int64_t num_rows() const { return num_rows_; } /// Slice each of the arrays in the record batch and construct a new RecordBatch object - std::shared_ptr Slice(int32_t offset); - std::shared_ptr Slice(int32_t offset, int32_t length); + std::shared_ptr Slice(int64_t offset); + std::shared_ptr Slice(int64_t offset, int64_t length); private: std::shared_ptr schema_; - int32_t num_rows_; + int64_t num_rows_; std::vector> columns_; }; @@ -105,7 +105,7 @@ class ARROW_EXPORT Table { std::shared_ptr column(int i) const { return columns_[i]; } // @returns: the number of columns in the table - int num_columns() const { return columns_.size(); } + int num_columns() const { return static_cast(columns_.size()); } // @returns: the number of rows (the corresponding length of each column) int64_t num_rows() const { return num_rows_; } diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index ffc78067d1b97..5c7d04de6dfbb 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -73,16 +73,17 @@ void randint(int64_t N, T lower, T upper, std::vector* out) { T val; for (int64_t i = 0; i < N; ++i) { draw = rng.Uniform64(span); - val = lower + static_cast(draw); + val = static_cast(draw + lower); out->push_back(val); } } template -void random_real(int n, uint32_t seed, T min_value, T max_value, std::vector* out) { +void random_real( + int64_t n, uint32_t seed, T min_value, T max_value, std::vector* out) { std::mt19937 gen(seed); std::uniform_real_distribution d(min_value, max_value); - for (int i = 0; i < n; ++i) { + for (int64_t i = 0; i < n; ++i) { out->push_back(d(gen)); } } @@ -108,13 +109,13 @@ inline Status CopyBufferFromVector( static inline Status GetBitmapFromBoolVector( const std::vector& is_valid, std::shared_ptr* result) { - int length = static_cast(is_valid.size()); + int64_t length = static_cast(is_valid.size()); std::shared_ptr buffer; RETURN_NOT_OK(GetEmptyBitmap(default_memory_pool(), length, &buffer)); uint8_t* bitmap = buffer->mutable_data(); - for (int i = 0; i < length; ++i) { + for (int64_t i = 0; i < length; ++i) { if (is_valid[i]) { BitUtil::SetBit(bitmap, i); } } @@ -126,7 +127,7 @@ static inline Status GetBitmapFromBoolVector( // and the rest to non-zero (true) values. static inline void random_null_bytes(int64_t n, double pct_null, uint8_t* null_bytes) { Random rng(random_seed()); - for (int i = 0; i < n; ++i) { + for (int64_t i = 0; i < n; ++i) { null_bytes[i] = rng.NextDoubleFraction() > pct_null; } } @@ -134,41 +135,41 @@ static inline void random_null_bytes(int64_t n, double pct_null, uint8_t* null_b static inline void random_is_valid( int64_t n, double pct_null, std::vector* is_valid) { Random rng(random_seed()); - for (int i = 0; i < n; ++i) { + for (int64_t i = 0; i < n; ++i) { is_valid->push_back(rng.NextDoubleFraction() > pct_null); } } -static inline void random_bytes(int n, uint32_t seed, uint8_t* out) { +static inline void random_bytes(int64_t n, uint32_t seed, uint8_t* out) { std::mt19937 gen(seed); std::uniform_int_distribution d(0, 255); - for (int i = 0; i < n; ++i) { - out[i] = d(gen) & 0xFF; + for (int64_t i = 0; i < n; ++i) { + out[i] = static_cast(d(gen) & 0xFF); } } -static inline void random_ascii(int n, uint32_t seed, uint8_t* out) { +static inline void random_ascii(int64_t n, uint32_t seed, uint8_t* out) { std::mt19937 gen(seed); std::uniform_int_distribution d(65, 122); - for (int i = 0; i < n; ++i) { - out[i] = d(gen) & 0xFF; + for (int64_t i = 0; i < n; ++i) { + out[i] = static_cast(d(gen) & 0xFF); } } template -void rand_uniform_int(int n, uint32_t seed, T min_value, T max_value, T* out) { +void rand_uniform_int(int64_t n, uint32_t seed, T min_value, T max_value, T* out) { DCHECK(out || (n == 0)); std::mt19937 gen(seed); std::uniform_int_distribution d(min_value, max_value); - for (int i = 0; i < n; ++i) { - out[i] = d(gen); + for (int64_t i = 0; i < n; ++i) { + out[i] = static_cast(d(gen)); } } -static inline int null_count(const std::vector& valid_bytes) { - int result = 0; +static inline int64_t null_count(const std::vector& valid_bytes) { + int64_t result = 0; for (size_t i = 0; i < valid_bytes.size(); ++i) { if (valid_bytes[i] == 0) { ++result; } } @@ -183,7 +184,7 @@ std::shared_ptr bytes_to_null_buffer(const std::vector& bytes) return out; } -Status MakeRandomInt32PoolBuffer(int32_t length, MemoryPool* pool, +Status MakeRandomInt32PoolBuffer(int64_t length, MemoryPool* pool, std::shared_ptr* pool_buffer, uint32_t seed = 0) { DCHECK(pool); auto data = std::make_shared(pool); @@ -194,7 +195,7 @@ Status MakeRandomInt32PoolBuffer(int32_t length, MemoryPool* pool, return Status::OK(); } -Status MakeRandomBytePoolBuffer(int32_t length, MemoryPool* pool, +Status MakeRandomBytePoolBuffer(int64_t length, MemoryPool* pool, std::shared_ptr* pool_buffer, uint32_t seed = 0) { auto bytes = std::make_shared(pool); RETURN_NOT_OK(bytes->Resize(length)); @@ -213,7 +214,7 @@ class TestBase : public ::testing::Test { } template - std::shared_ptr MakePrimitive(int32_t length, int32_t null_count = 0) { + std::shared_ptr MakePrimitive(int64_t length, int64_t null_count = 0) { auto data = std::make_shared(pool_); const int64_t data_nbytes = length * sizeof(typename ArrayType::value_type); EXPECT_OK(data->Resize(data_nbytes)); @@ -275,9 +276,9 @@ class TestBuilder : public ::testing::Test { template Status MakeArray(const std::vector& valid_bytes, const std::vector& values, - int size, Builder* builder, std::shared_ptr* out) { + int64_t size, Builder* builder, std::shared_ptr* out) { // Append the first 1000 - for (int i = 0; i < size; ++i) { + for (int64_t i = 0; i < size; ++i) { if (valid_bytes[i] > 0) { RETURN_NOT_OK(builder->Append(values[i])); } else { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 9a97fc30094b9..9b1ab3288eb8c 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -162,7 +162,7 @@ struct ARROW_EXPORT DataType { const std::vector>& children() const { return children_; } - int num_children() const { return children_.size(); } + int num_children() const { return static_cast(children_.size()); } virtual Status Accept(TypeVisitor* visitor) const = 0; @@ -226,7 +226,7 @@ struct ARROW_EXPORT CTypeImpl : public PrimitiveCType { CTypeImpl() : PrimitiveCType(TYPE_ID) {} - int bit_width() const override { return sizeof(C_TYPE) * 8; } + int bit_width() const override { return static_cast(sizeof(C_TYPE) * 8); } Status Accept(TypeVisitor* visitor) const override { return visitor->Visit(*static_cast(this)); @@ -432,7 +432,7 @@ struct ARROW_EXPORT DateType : public FixedWidthType { DateType() : FixedWidthType(Type::DATE) {} - int bit_width() const override { return sizeof(c_type) * 8; } + int bit_width() const override { return static_cast(sizeof(c_type) * 8); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -448,7 +448,7 @@ struct ARROW_EXPORT TimeType : public FixedWidthType { TimeUnit unit; - int bit_width() const override { return sizeof(c_type) * 8; } + int bit_width() const override { return static_cast(sizeof(c_type) * 8); } explicit TimeType(TimeUnit unit = TimeUnit::MILLI) : FixedWidthType(Type::TIME), unit(unit) {} @@ -465,7 +465,7 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { typedef int64_t c_type; static constexpr Type::type type_id = Type::TIMESTAMP; - int bit_width() const override { return sizeof(int64_t) * 8; } + int bit_width() const override { return static_cast(sizeof(int64_t) * 8); } TimeUnit unit; @@ -485,7 +485,7 @@ struct ARROW_EXPORT IntervalType : public FixedWidthType { using c_type = int64_t; static constexpr Type::type type_id = Type::INTERVAL; - int bit_width() const override { return sizeof(int64_t) * 8; } + int bit_width() const override { return static_cast(sizeof(int64_t) * 8); } Unit unit; diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index c4898b1ac8ce2..d6687c11bcf73 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -32,7 +32,7 @@ template <> struct TypeTraits { using ArrayType = UInt8Array; using BuilderType = UInt8Builder; - static inline int bytes_required(int elements) { return elements; } + static inline int64_t bytes_required(int64_t elements) { return elements; } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return uint8(); } }; @@ -41,7 +41,7 @@ template <> struct TypeTraits { using ArrayType = Int8Array; using BuilderType = Int8Builder; - static inline int bytes_required(int elements) { return elements; } + static inline int64_t bytes_required(int64_t elements) { return elements; } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return int8(); } }; @@ -51,7 +51,9 @@ struct TypeTraits { using ArrayType = UInt16Array; using BuilderType = UInt16Builder; - static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(uint16_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return uint16(); } }; @@ -61,7 +63,9 @@ struct TypeTraits { using ArrayType = Int16Array; using BuilderType = Int16Builder; - static inline int bytes_required(int elements) { return elements * sizeof(int16_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int16_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return int16(); } }; @@ -71,7 +75,9 @@ struct TypeTraits { using ArrayType = UInt32Array; using BuilderType = UInt32Builder; - static inline int bytes_required(int elements) { return elements * sizeof(uint32_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(uint32_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return uint32(); } }; @@ -81,7 +87,9 @@ struct TypeTraits { using ArrayType = Int32Array; using BuilderType = Int32Builder; - static inline int bytes_required(int elements) { return elements * sizeof(int32_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int32_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return int32(); } }; @@ -91,7 +99,9 @@ struct TypeTraits { using ArrayType = UInt64Array; using BuilderType = UInt64Builder; - static inline int bytes_required(int elements) { return elements * sizeof(uint64_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(uint64_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return uint64(); } }; @@ -101,7 +111,9 @@ struct TypeTraits { using ArrayType = Int64Array; using BuilderType = Int64Builder; - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int64_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return int64(); } }; @@ -111,7 +123,9 @@ struct TypeTraits { using ArrayType = DateArray; // using BuilderType = DateBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int64_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return date(); } }; @@ -121,7 +135,9 @@ struct TypeTraits { using ArrayType = TimestampArray; // using BuilderType = TimestampBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int64_t); + } constexpr static bool is_parameter_free = false; }; @@ -130,7 +146,9 @@ struct TypeTraits { using ArrayType = TimeArray; // using BuilderType = TimestampBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(int64_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int64_t); + } constexpr static bool is_parameter_free = false; }; @@ -139,7 +157,9 @@ struct TypeTraits { using ArrayType = HalfFloatArray; using BuilderType = HalfFloatBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(uint16_t); } + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(uint16_t); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return float16(); } }; @@ -149,7 +169,9 @@ struct TypeTraits { using ArrayType = FloatArray; using BuilderType = FloatBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(float); } + static inline int64_t bytes_required(int64_t elements) { + return static_cast(elements * sizeof(float)); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return float32(); } }; @@ -159,7 +181,9 @@ struct TypeTraits { using ArrayType = DoubleArray; using BuilderType = DoubleBuilder; - static inline int bytes_required(int elements) { return elements * sizeof(double); } + static inline int64_t bytes_required(int64_t elements) { + return static_cast(elements * sizeof(double)); + } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return float64(); } }; @@ -169,7 +193,7 @@ struct TypeTraits { using ArrayType = BooleanArray; using BuilderType = BooleanBuilder; - static inline int bytes_required(int elements) { + static inline int64_t bytes_required(int64_t elements) { return BitUtil::BytesForBits(elements); } constexpr static bool is_parameter_free = true; diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index f3fbb41fa54a7..1bbd2384267c9 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -42,7 +42,7 @@ void BitUtil::BytesToBits(const std::vector& bytes, uint8_t* bits) { Status BitUtil::BytesToBits( const std::vector& bytes, std::shared_ptr* out) { - int bit_length = BitUtil::BytesForBits(bytes.size()); + int64_t bit_length = BitUtil::BytesForBits(bytes.size()); std::shared_ptr buffer; RETURN_NOT_OK(AllocateBuffer(default_memory_pool(), bit_length, &buffer)); @@ -98,7 +98,7 @@ Status GetEmptyBitmap( return Status::OK(); } -Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int32_t offset, int32_t length, +Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int64_t offset, int64_t length, std::shared_ptr* out) { std::shared_ptr buffer; RETURN_NOT_OK(GetEmptyBitmap(pool, length, &buffer)); diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index a0fbdd2f92ca1..6e3e8ae9f2160 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -34,6 +34,11 @@ class Status; namespace BitUtil { +static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; + +// the ~i byte version of kBitmaks +static constexpr uint8_t kFlippedBitmask[] = {254, 253, 251, 247, 239, 223, 191, 127}; + static inline int64_t CeilByte(int64_t size) { return (size + 7) & ~7; } @@ -46,28 +51,26 @@ static inline int64_t Ceil2Bytes(int64_t size) { return (size + 15) & ~15; } -static constexpr uint8_t kBitmask[] = {1, 2, 4, 8, 16, 32, 64, 128}; - -static inline bool GetBit(const uint8_t* bits, int i) { +static inline bool GetBit(const uint8_t* bits, int64_t i) { return static_cast(bits[i / 8] & kBitmask[i % 8]); } -static inline bool BitNotSet(const uint8_t* bits, int i) { +static inline bool BitNotSet(const uint8_t* bits, int64_t i) { return (bits[i / 8] & kBitmask[i % 8]) == 0; } -static inline void ClearBit(uint8_t* bits, int i) { - bits[i / 8] &= ~kBitmask[i % 8]; +static inline void ClearBit(uint8_t* bits, int64_t i) { + bits[i / 8] &= kFlippedBitmask[i % 8]; } -static inline void SetBit(uint8_t* bits, int i) { +static inline void SetBit(uint8_t* bits, int64_t i) { bits[i / 8] |= kBitmask[i % 8]; } -static inline void SetBitTo(uint8_t* bits, int i, bool bit_is_set) { +static inline void SetBitTo(uint8_t* bits, int64_t i, bool bit_is_set) { // See https://graphics.stanford.edu/~seander/bithacks.html // "Conditionally set or clear bits without branching" - bits[i / 8] ^= (-bit_is_set ^ bits[i / 8]) & kBitmask[i % 8]; + bits[i / 8] ^= static_cast(-bit_is_set ^ bits[i / 8]) & kBitmask[i % 8]; } static inline int64_t NextPower2(int64_t n) { @@ -127,8 +130,8 @@ Status ARROW_EXPORT GetEmptyBitmap( /// \param[out] out the resulting copy /// /// \return Status message -Status ARROW_EXPORT CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, int32_t offset, - int32_t length, std::shared_ptr* out); +Status ARROW_EXPORT CopyBitmap(MemoryPool* pool, const uint8_t* bitmap, int64_t offset, + int64_t length, std::shared_ptr* out); /// Compute the number of 1's in the given data array /// diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index 9e4d469bcfa5f..56bb53d5c97dc 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.includes.common cimport shared_ptr +from pyarrow.includes.common cimport shared_ptr, int64_t from pyarrow.includes.libarrow cimport CArray from pyarrow.scalar import NA @@ -36,7 +36,7 @@ cdef class Array: DataType type cdef init(self, const shared_ptr[CArray]& sp_array) - cdef getitem(self, int i) + cdef getitem(self, int64_t i) cdef object box_array(const shared_ptr[CArray]& sp_array) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 11abf03e35f1d..7787e95df5e72 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -210,7 +210,7 @@ cdef class Array: return self.getitem(key) - cdef getitem(self, int i): + cdef getitem(self, int64_t i): return scalar.box_scalar(self.type, self.sp_array, i) def slice(self, offset=0, length=None): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 702acfbc12e17..253cabbe0a581 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -64,15 +64,15 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CArray" arrow::Array": shared_ptr[CDataType] type() - int32_t length() - int32_t null_count() + int64_t length() + int64_t null_count() Type type_enum() c_bool Equals(const shared_ptr[CArray]& arr) c_bool IsNull(int i) - shared_ptr[CArray] Slice(int32_t offset) - shared_ptr[CArray] Slice(int32_t offset, int32_t length) + shared_ptr[CArray] Slice(int64_t offset) + shared_ptr[CArray] Slice(int64_t offset, int64_t length) cdef cppclass CFixedWidthType" arrow::FixedWidthType"(CDataType): int bit_width() @@ -217,7 +217,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CChunkedArray] data() cdef cppclass CRecordBatch" arrow::RecordBatch": - CRecordBatch(const shared_ptr[CSchema]& schema, int32_t num_rows, + CRecordBatch(const shared_ptr[CSchema]& schema, int64_t num_rows, const vector[shared_ptr[CArray]]& columns) c_bool Equals(const CRecordBatch& other) @@ -229,10 +229,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const vector[shared_ptr[CArray]]& columns() int num_columns() - int32_t num_rows() + int64_t num_rows() - shared_ptr[CRecordBatch] Slice(int32_t offset) - shared_ptr[CRecordBatch] Slice(int32_t offset, int32_t length) + shared_ptr[CRecordBatch] Slice(int64_t offset) + shared_ptr[CRecordBatch] Slice(int64_t offset, int64_t length) cdef cppclass CTable" arrow::Table": CTable(const c_string& name, const shared_ptr[CSchema]& schema, diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd index 2d55757726183..551aeb9697bf7 100644 --- a/python/pyarrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -32,10 +32,10 @@ cdef class NAType(Scalar): cdef class ArrayValue(Scalar): cdef: shared_ptr[CArray] sp_array - int index + int64_t index cdef void init(self, DataType type, - const shared_ptr[CArray]& sp_array, int index) + const shared_ptr[CArray]& sp_array, int64_t index) cdef void _set_array(self, const shared_ptr[CArray]& sp_array) @@ -55,7 +55,7 @@ cdef class ListValue(ArrayValue): cdef: CListArray* ap - cdef getitem(self, int i) + cdef getitem(self, int64_t i) cdef class StringValue(ArrayValue): @@ -63,4 +63,4 @@ cdef class StringValue(ArrayValue): cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, - int index) + int64_t index) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 57a15ad78344c..1337b2b2cb198 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -46,7 +46,7 @@ NA = NAType() cdef class ArrayValue(Scalar): cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int index): + int64_t index): self.type = type self.index = index self._set_array(sp_array) @@ -201,13 +201,13 @@ cdef class ListValue(ArrayValue): self.ap = sp_array.get() self.value_type = box_data_type(self.ap.value_type()) - cdef getitem(self, int i): - cdef int j = self.ap.value_offset(self.index) + i + cdef getitem(self, int64_t i): + cdef int64_t j = self.ap.value_offset(self.index) + i return box_scalar(self.value_type, self.ap.values(), j) def as_py(self): cdef: - int j + int64_t j list result = [] for j in range(len(self)): @@ -236,7 +236,7 @@ cdef dict _scalar_classes = { } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, - int index): + int64_t index): cdef ArrayValue val if type.type.type == Type_NA: return NA diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 7d7336246ee79..93bc6ddcd56f6 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -497,7 +497,7 @@ cdef class RecordBatch: shared_ptr[CSchema] schema shared_ptr[CRecordBatch] batch vector[shared_ptr[CArray]] c_arrays - int32_t num_rows + int64_t num_rows if len(arrays) == 0: raise ValueError('Record batch cannot contain no arrays (for now)') diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 5fd8eef23fec5..c125cc078af88 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -375,7 +375,7 @@ class BytesConverter : public TypedConverter { PyObject* bytes_obj; OwnedRef tmp; const char* bytes; - int32_t length; + int64_t length; Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { item = PySequence_GetItem(seq, i); @@ -409,7 +409,7 @@ class UTF8Converter : public TypedConverter { PyObject* bytes_obj; OwnedRef tmp; const char* bytes; - int32_t length; + int64_t length; Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { item = PySequence_GetItem(seq, i); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index bdc2cb7d0025f..cadb53e0d2ab9 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -224,13 +224,13 @@ Status AppendObjectStrings(arrow::StringBuilder& string_builder, PyObject** obje PyErr_Clear(); return Status::TypeError("failed converting unicode to UTF8"); } - const int32_t length = PyBytes_GET_SIZE(obj); + const int64_t length = PyBytes_GET_SIZE(obj); Status s = string_builder.Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); if (!s.ok()) { return s; } } else if (PyBytes_Check(obj)) { *have_bytes = true; - const int32_t length = PyBytes_GET_SIZE(obj); + const int64_t length = PyBytes_GET_SIZE(obj); RETURN_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); } else { string_builder.AppendNull(); @@ -413,7 +413,7 @@ inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_val const std::shared_ptr arr = data.chunk(c); auto prim_arr = static_cast(arr.get()); auto in_values = reinterpret_cast(prim_arr->data()->data()); - for (int32_t i = 0; i < arr->length(); ++i) { + for (int64_t i = 0; i < arr->length(); ++i) { *out_values = in_values[i]; } } @@ -507,7 +507,6 @@ inline Status ConvertListsLike( auto arr = std::static_pointer_cast(data.chunk(c)); const uint8_t* data_ptr; - int32_t length; const bool has_nulls = data.null_count() > 0; for (int64_t i = 0; i < arr->length(); ++i) { if (has_nulls && arr->IsNull(i)) { @@ -1520,7 +1519,7 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) { } // For readability - constexpr int32_t kOffset = 0; + constexpr int64_t kOffset = 0; RETURN_NOT_OK(ConvertData()); std::shared_ptr type; @@ -1636,7 +1635,7 @@ inline Status ArrowSerializer::ConvertTypedLists( // TODO(uwe): Support more complex numpy array structures RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE)); - int32_t size = PyArray_DIM(numpy_array, 0); + int64_t size = PyArray_DIM(numpy_array, 0); auto data = reinterpret_cast(PyArray_DATA(numpy_array)); if (traits::supports_nulls) { null_bitmap_->Resize(size, false); @@ -1678,7 +1677,7 @@ ArrowSerializer::ConvertTypedLists( // TODO(uwe): Support more complex numpy array structures RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); - int32_t size = PyArray_DIM(numpy_array, 0); + int64_t size = PyArray_DIM(numpy_array, 0); auto data = reinterpret_cast(PyArray_DATA(numpy_array)); RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); } else if (PyList_Check(objects[i])) { From 2c3bd9311b370a45bac3ff90ed2f772991f211e0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 27 Feb 2017 14:09:39 -0500 Subject: [PATCH 0346/1644] ARROW-588: [C++] Fix some 32 bit compiler warnings I also found that $CMAKE_CXX_FLAGS were not being passed to the gflags external project. Author: Wes McKinney Closes #354 from wesm/32-bit-compiler-warnings and squashes the following commits: 8829a58 [Wes McKinney] Fix cast to wrong type 5a17654 [Wes McKinney] clang format 43687c5 [Wes McKinney] Fix some more compiler warnings 843479c [Wes McKinney] Fixes 9dbd619 [Wes McKinney] 32 bit fixes --- cpp/CMakeLists.txt | 6 +----- cpp/src/arrow/array-primitive-test.cc | 11 +++++------ cpp/src/arrow/buffer.cc | 15 ++++++++++++++- cpp/src/arrow/buffer.h | 15 +++------------ cpp/src/arrow/builder.cc | 18 +++++++++++------- cpp/src/arrow/compare.cc | 9 ++++++--- cpp/src/arrow/io/file.cc | 17 +++++++++-------- cpp/src/arrow/io/test-common.h | 4 ++-- cpp/src/arrow/memory_pool.cc | 8 +++++--- cpp/src/arrow/test-util.h | 4 ++-- cpp/src/arrow/util/bit-util.cc | 4 ++-- 11 files changed, 60 insertions(+), 51 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f6dab788b26d5..7d1f9e167d486 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -499,11 +499,7 @@ if(ARROW_BUILD_TESTS) # gflags (formerly Googleflags) command line parsing if("$ENV{GFLAGS_HOME}" STREQUAL "") - if(APPLE) - set(GFLAGS_CMAKE_CXX_FLAGS "-fPIC -std=c++11 -stdlib=libc++") - else() - set(GFLAGS_CMAKE_CXX_FLAGS "-fPIC") - endif() + set(GFLAGS_CMAKE_CXX_FLAGS ${EP_CXX_FLAGS}) set(GFLAGS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gflags_ep-prefix/src/gflags_ep") set(GFLAGS_HOME "${GFLAGS_PREFIX}") diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index 7b36275cbabfb..dfa37a8063767 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -100,7 +100,7 @@ class TestPrimitiveBuilder : public TestBuilder { void RandomData(int64_t N, double pct_null = 0.1) { Attrs::draw(N, &draws_); - valid_bytes_.resize(N); + valid_bytes_.resize(static_cast(N)); test::random_null_bytes(N, pct_null, valid_bytes_.data()); } @@ -192,8 +192,8 @@ struct PBoolean { template <> void TestPrimitiveBuilder::RandomData(int64_t N, double pct_null) { - draws_.resize(N); - valid_bytes_.resize(N); + draws_.resize(static_cast(N)); + valid_bytes_.resize(static_cast(N)); test::random_null_bytes(N, 0.5, draws_.data()); test::random_null_bytes(N, pct_null, valid_bytes_.data()); @@ -394,10 +394,9 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { this->builder_->Reserve(1000); this->builder_nn_->Reserve(1000); - int64_t i; int64_t null_count = 0; // Append the first 1000 - for (i = 0; i < 1000; ++i) { + for (size_t i = 0; i < 1000; ++i) { if (valid_bytes[i] > 0) { this->builder_->Append(draws[i]); } else { @@ -419,7 +418,7 @@ TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { this->builder_nn_->Reserve(size - 1000); // Append the next 9000 - for (i = 1000; i < size; ++i) { + for (size_t i = 1000; i < size; ++i) { if (valid_bytes[i] > 0) { this->builder_->Append(draws[i]); } else { diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 18e9ed2015227..a0b78ac0b9f20 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -45,7 +45,7 @@ Status Buffer::Copy( auto new_buffer = std::make_shared(pool); RETURN_NOT_OK(new_buffer->Resize(nbytes)); - std::memcpy(new_buffer->mutable_data(), data() + start, nbytes); + std::memcpy(new_buffer->mutable_data(), data() + start, static_cast(nbytes)); *out = new_buffer; return Status::OK(); @@ -55,6 +55,19 @@ Status Buffer::Copy(int64_t start, int64_t nbytes, std::shared_ptr* out) return Copy(start, nbytes, default_memory_pool(), out); } +bool Buffer::Equals(const Buffer& other, int64_t nbytes) const { + return this == &other || + (size_ >= nbytes && other.size_ >= nbytes && + (data_ == other.data_ || + !memcmp(data_, other.data_, static_cast(nbytes)))); +} + +bool Buffer::Equals(const Buffer& other) const { + return this == &other || (size_ == other.size_ && (data_ == other.data_ || + !memcmp(data_, other.data_, + static_cast(size_)))); +} + std::shared_ptr SliceBuffer( const std::shared_ptr& buffer, int64_t offset, int64_t length) { DCHECK_LE(offset, buffer->size()); diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index be91af3556da4..0724385a4aff8 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -64,17 +64,8 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { /// Return true if both buffers are the same size and contain the same bytes /// up to the number of compared bytes - bool Equals(const Buffer& other, int64_t nbytes) const { - return this == &other || - (size_ >= nbytes && other.size_ >= nbytes && - (data_ == other.data_ || !memcmp(data_, other.data_, nbytes))); - } - - bool Equals(const Buffer& other) const { - return this == &other || - (size_ == other.size_ && - (data_ == other.data_ || !memcmp(data_, other.data_, size_))); - } + bool Equals(const Buffer& other, int64_t nbytes) const; + bool Equals(const Buffer& other) const; /// Copy a section of the buffer into a new Buffer. Status Copy(int64_t start, int64_t nbytes, MemoryPool* pool, @@ -196,7 +187,7 @@ class ARROW_EXPORT BufferBuilder { // Unsafe methods don't check existing size void UnsafeAppend(const uint8_t* data, int64_t length) { - memcpy(data_ + size_, data, length); + memcpy(data_ + size_, data, static_cast(length)); size_ += length; } diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 63e083e76b660..9086598cc5ba7 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -58,7 +58,7 @@ Status ArrayBuilder::Init(int64_t capacity) { const int64_t byte_capacity = null_bitmap_->capacity(); capacity_ = capacity; null_bitmap_data_ = null_bitmap_->mutable_data(); - memset(null_bitmap_data_, 0, byte_capacity); + memset(null_bitmap_data_, 0, static_cast(byte_capacity)); return Status::OK(); } @@ -72,7 +72,8 @@ Status ArrayBuilder::Resize(int64_t new_bits) { const int64_t byte_capacity = null_bitmap_->capacity(); capacity_ = new_bits; if (old_bytes < new_bytes) { - memset(null_bitmap_data_ + old_bytes, 0, byte_capacity - old_bytes); + memset( + null_bitmap_data_ + old_bytes, 0, static_cast(byte_capacity - old_bytes)); } return Status::OK(); } @@ -152,7 +153,8 @@ void ArrayBuilder::UnsafeSetNotNull(int64_t length) { // Fast bitsetting int64_t fast_length = (length - pad_to_byte) / 8; - memset(null_bitmap_data_ + ((length_ + pad_to_byte) / 8), 255, fast_length); + memset(null_bitmap_data_ + ((length_ + pad_to_byte) / 8), 255, + static_cast(fast_length)); // Trailing bytes for (int64_t i = length_ + pad_to_byte + (fast_length * 8); i < new_length; ++i) { @@ -170,7 +172,7 @@ Status PrimitiveBuilder::Init(int64_t capacity) { int64_t nbytes = TypeTraits::bytes_required(capacity); RETURN_NOT_OK(data_->Resize(nbytes)); // TODO(emkornfield) valgrind complains without this - memset(data_->mutable_data(), 0, nbytes); + memset(data_->mutable_data(), 0, static_cast(nbytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); return Status::OK(); @@ -190,7 +192,8 @@ Status PrimitiveBuilder::Resize(int64_t capacity) { RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); // TODO(emkornfield) valgrind complains without this - memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + memset( + data_->mutable_data() + old_bytes, 0, static_cast(new_bytes - old_bytes)); } return Status::OK(); } @@ -256,7 +259,7 @@ Status BooleanBuilder::Init(int64_t capacity) { int64_t nbytes = BitUtil::BytesForBits(capacity); RETURN_NOT_OK(data_->Resize(nbytes)); // TODO(emkornfield) valgrind complains without this - memset(data_->mutable_data(), 0, nbytes); + memset(data_->mutable_data(), 0, static_cast(nbytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); return Status::OK(); @@ -275,7 +278,8 @@ Status BooleanBuilder::Resize(int64_t capacity) { RETURN_NOT_OK(data_->Resize(new_bytes)); raw_data_ = reinterpret_cast(data_->mutable_data()); - memset(data_->mutable_data() + old_bytes, 0, new_bytes - old_bytes); + memset( + data_->mutable_data() + old_bytes, 0, static_cast(new_bytes - old_bytes)); } return Status::OK(); } diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index e94fa74ea6589..f38f8d67aa796 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -87,7 +87,8 @@ class RangeEqualsVisitor : public ArrayVisitor { if (end_offset - begin_offset > 0 && std::memcmp(left.data()->data() + begin_offset, - right.data()->data() + right_begin_offset, end_offset - begin_offset)) { + right.data()->data() + right_begin_offset, + static_cast(end_offset - begin_offset))) { return false; } } @@ -348,7 +349,8 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { } return true; } else { - return memcmp(left_data, right_data, value_byte_size * left.length()) == 0; + return memcmp(left_data, right_data, + static_cast(value_byte_size * left.length())) == 0; } } @@ -431,7 +433,8 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { const int64_t total_bytes = left.value_offset(left.length()) - left.value_offset(0); return std::memcmp(left_data + left.value_offset(0), - right_data + right.value_offset(0), total_bytes) == 0; + right_data + right.value_offset(0), + static_cast(total_bytes)) == 0; } } else { // ARROW-537: Only compare data in non-null slots diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 230c7fe0fb4a0..7c14238e8fda4 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -244,9 +244,9 @@ static inline Status FileRead( int fd, uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { #if defined(_MSC_VER) if (nbytes > INT32_MAX) { return Status::IOError("Unable to read > 2GB blocks yet"); } - *bytes_read = _read(fd, buffer, static_cast(nbytes)); + *bytes_read = _read(fd, buffer, static_cast(nbytes)); #else - *bytes_read = read(fd, buffer, nbytes); + *bytes_read = read(fd, buffer, static_cast(nbytes)); #endif if (*bytes_read == -1) { @@ -263,9 +263,9 @@ static inline Status FileWrite(int fd, const uint8_t* buffer, int64_t nbytes) { if (nbytes > INT32_MAX) { return Status::IOError("Unable to write > 2GB blocks to file yet"); } - ret = static_cast(_write(fd, buffer, static_cast(nbytes))); + ret = static_cast(_write(fd, buffer, static_cast(nbytes))); #else - ret = static_cast(write(fd, buffer, nbytes)); + ret = static_cast(write(fd, buffer, static_cast(nbytes))); #endif if (ret == -1) { @@ -526,7 +526,7 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { ~MemoryMap() { if (file_->is_open()) { - munmap(mutable_data_, size_); + munmap(mutable_data_, static_cast(size_)); file_->Close(); } } @@ -554,7 +554,8 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { is_mutable_ = false; } - void* result = mmap(nullptr, file_->size(), prot_flags, map_mode, file_->fd(), 0); + void* result = mmap(nullptr, static_cast(file_->size()), prot_flags, map_mode, + file_->fd(), 0); if (result == MAP_FAILED) { std::stringstream ss; ss << "Memory mapping file failed, errno: " << errno; @@ -630,7 +631,7 @@ Status MemoryMappedFile::Close() { Status MemoryMappedFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { nbytes = std::max( 0, std::min(nbytes, memory_map_->size() - memory_map_->position())); - if (nbytes > 0) { std::memcpy(out, memory_map_->head(), nbytes); } + if (nbytes > 0) { std::memcpy(out, memory_map_->head(), static_cast(nbytes)); } *bytes_read = nbytes; memory_map_->advance(nbytes); return Status::OK(); @@ -677,7 +678,7 @@ Status MemoryMappedFile::Write(const uint8_t* data, int64_t nbytes) { } Status MemoryMappedFile::WriteInternal(const uint8_t* data, int64_t nbytes) { - memcpy(memory_map_->head(), data, nbytes); + memcpy(memory_map_->head(), data, static_cast(nbytes)); memory_map_->advance(nbytes); return Status::OK(); } diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 6e917135db274..8355714540e95 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -53,9 +53,9 @@ class MemoryMapFixture { FILE* file = fopen(path.c_str(), "w"); if (file != nullptr) { tmp_files_.push_back(path); } #ifdef _MSC_VER - _chsize(fileno(file), size); + _chsize(fileno(file), static_cast(size)); #else - ftruncate(fileno(file), size); + ftruncate(fileno(file), static_cast(size)); #endif fclose(file); } diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index 8d85a089a65c9..5a630271a7da7 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -36,14 +36,16 @@ Status AllocateAligned(int64_t size, uint8_t** out) { constexpr size_t kAlignment = 64; #ifdef _MSC_VER // Special code path for MSVC - *out = reinterpret_cast(_aligned_malloc(size, kAlignment)); + *out = + reinterpret_cast(_aligned_malloc(static_cast(size), kAlignment)); if (!*out) { std::stringstream ss; ss << "malloc of size " << size << " failed"; return Status::OutOfMemory(ss.str()); } #else - const int result = posix_memalign(reinterpret_cast(out), kAlignment, size); + const int result = posix_memalign( + reinterpret_cast(out), kAlignment, static_cast(size)); if (result == ENOMEM) { std::stringstream ss; ss << "malloc of size " << size << " failed"; @@ -90,7 +92,7 @@ Status DefaultMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t uint8_t* out; RETURN_NOT_OK(AllocateAligned(new_size, &out)); // Copy contents and release old memory chunk - memcpy(out, *ptr, std::min(new_size, old_size)); + memcpy(out, *ptr, static_cast(std::min(new_size, old_size))); #ifdef _MSC_VER _aligned_free(*ptr); #else diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 5c7d04de6dfbb..11ce50a76a547 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -109,13 +109,13 @@ inline Status CopyBufferFromVector( static inline Status GetBitmapFromBoolVector( const std::vector& is_valid, std::shared_ptr* result) { - int64_t length = static_cast(is_valid.size()); + size_t length = is_valid.size(); std::shared_ptr buffer; RETURN_NOT_OK(GetEmptyBitmap(default_memory_pool(), length, &buffer)); uint8_t* bitmap = buffer->mutable_data(); - for (int64_t i = 0; i < length; ++i) { + for (size_t i = 0; i < static_cast(length); ++i) { if (is_valid[i]) { BitUtil::SetBit(bitmap, i); } } diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 1bbd2384267c9..3767ba9e62f4a 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -47,7 +47,7 @@ Status BitUtil::BytesToBits( std::shared_ptr buffer; RETURN_NOT_OK(AllocateBuffer(default_memory_pool(), bit_length, &buffer)); - memset(buffer->mutable_data(), 0, bit_length); + memset(buffer->mutable_data(), 0, static_cast(bit_length)); BytesToBits(bytes, buffer->mutable_data()); *out = buffer; @@ -94,7 +94,7 @@ int64_t CountSetBits(const uint8_t* data, int64_t bit_offset, int64_t length) { Status GetEmptyBitmap( MemoryPool* pool, int64_t length, std::shared_ptr* result) { RETURN_NOT_OK(AllocateBuffer(pool, BitUtil::BytesForBits(length), result)); - memset((*result)->mutable_data(), 0, (*result)->size()); + memset((*result)->mutable_data(), 0, static_cast((*result)->size())); return Status::OK(); } From 0637e05d59f20363a9103ffad5712f981314c4df Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 2 Mar 2017 14:41:29 -0500 Subject: [PATCH 0347/1644] ARROW-576: [C++] Complete file/stream implementation for union types Author: Wes McKinney Closes #356 from wesm/ARROW-576 and squashes the following commits: e239ba1 [Wes McKinney] Fix miniconda links 12fde46 [Wes McKinney] Complete metadata roundtrip for unions --- ci/travis_install_conda.sh | 4 +- cpp/src/arrow/ipc/ipc-file-test.cc | 2 +- cpp/src/arrow/ipc/metadata-internal.cc | 101 ++++++++++++++++--------- 3 files changed, 67 insertions(+), 40 deletions(-) diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index ffa017cbaf5dd..9c13b1bc0f079 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -15,9 +15,9 @@ set -e if [ $TRAVIS_OS_NAME == "linux" ]; then - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh" + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh" else - MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda-latest-MacOSX-x86_64.sh" + MINICONDA_URL="https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh" fi wget -O miniconda.sh $MINICONDA_URL diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index e58f2cfbbe8c9..0c95c8eca65ca 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -180,7 +180,7 @@ TEST_P(TestStreamFormat, RoundTrip) { #define BATCH_CASES() \ ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeDictionary); + &MakeStruct, &MakeUnion, &MakeDictionary); INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 1cc4a235b81bd..17a3a5fafe626 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -78,43 +78,6 @@ static Status FloatFromFlatuffer( return Status::OK(); } -static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, - const std::vector>& children, std::shared_ptr* out) { - switch (type) { - case flatbuf::Type_NONE: - return Status::Invalid("Type metadata cannot be none"); - case flatbuf::Type_Int: - return IntFromFlatbuffer(static_cast(type_data), out); - case flatbuf::Type_FloatingPoint: - return FloatFromFlatuffer( - static_cast(type_data), out); - case flatbuf::Type_Binary: - *out = binary(); - return Status::OK(); - case flatbuf::Type_Utf8: - *out = utf8(); - return Status::OK(); - case flatbuf::Type_Bool: - *out = boolean(); - return Status::OK(); - case flatbuf::Type_Decimal: - case flatbuf::Type_Timestamp: - case flatbuf::Type_List: - if (children.size() != 1) { - return Status::Invalid("List must have exactly 1 child field"); - } - *out = std::make_shared(children[0]); - return Status::OK(); - case flatbuf::Type_Struct_: - *out = std::make_shared(children); - return Status::OK(); - case flatbuf::Type_Union: - return Status::NotImplemented("Type is not implemented"); - default: - return Status::Invalid("Unrecognized type"); - } -} - // Forward declaration static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, DictionaryMemo* dictionary_memo, FieldOffset* offset); @@ -153,6 +116,32 @@ static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type return Status::OK(); } +// ---------------------------------------------------------------------- +// Union implementation + +static Status UnionFromFlatbuffer(const flatbuf::Union* union_data, + const std::vector>& children, std::shared_ptr* out) { + UnionMode mode = union_data->mode() == flatbuf::UnionMode_Sparse ? UnionMode::SPARSE + : UnionMode::DENSE; + + std::vector type_codes; + + const flatbuffers::Vector* fb_type_ids = union_data->typeIds(); + if (fb_type_ids == nullptr) { + for (uint8_t i = 0; i < children.size(); ++i) { + type_codes.push_back(i); + } + } else { + for (int32_t id : (*fb_type_ids)) { + // TODO(wesm): can these values exceed 255? + type_codes.push_back(static_cast(id)); + } + } + + *out = union_(children, type_codes, mode); + return Status::OK(); +} + static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, std::vector* out_children, DictionaryMemo* dictionary_memo, Offset* offset) { @@ -181,6 +170,44 @@ static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ break; +static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, + const std::vector>& children, std::shared_ptr* out) { + switch (type) { + case flatbuf::Type_NONE: + return Status::Invalid("Type metadata cannot be none"); + case flatbuf::Type_Int: + return IntFromFlatbuffer(static_cast(type_data), out); + case flatbuf::Type_FloatingPoint: + return FloatFromFlatuffer( + static_cast(type_data), out); + case flatbuf::Type_Binary: + *out = binary(); + return Status::OK(); + case flatbuf::Type_Utf8: + *out = utf8(); + return Status::OK(); + case flatbuf::Type_Bool: + *out = boolean(); + return Status::OK(); + case flatbuf::Type_Decimal: + case flatbuf::Type_Timestamp: + case flatbuf::Type_List: + if (children.size() != 1) { + return Status::Invalid("List must have exactly 1 child field"); + } + *out = std::make_shared(children[0]); + return Status::OK(); + case flatbuf::Type_Struct_: + *out = std::make_shared(children); + return Status::OK(); + case flatbuf::Type_Union: + return UnionFromFlatbuffer( + static_cast(type_data), children, out); + default: + return Status::Invalid("Unrecognized type"); + } +} + // TODO(wesm): Convert this to visitor pattern static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, std::vector* children, std::vector* layout, From 8378c48df53bfdcf0c834aaf3b8b737f74eb212c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 3 Mar 2017 09:51:57 -0500 Subject: [PATCH 0348/1644] ARROW-589: C++: Use system provided shared jemalloc if static is unavailable Author: Uwe L. Korn Closes #355 from xhochy/ARROW-589 and squashes the following commits: a9d88bc [Uwe L. Korn] ARROW-589: C++: Use system provided shared jemalloc if static is unavailable --- cpp/src/arrow/jemalloc/CMakeLists.txt | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index c0f90eba260f6..7caa74a3ebbda 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -20,11 +20,21 @@ include_directories(SYSTEM "{JEMALLOC_INCLUDE_DIR}") -# arrow_jemalloc library -set(ARROW_JEMALLOC_STATIC_LINK_LIBS - arrow_static - jemalloc_static -) +# In the case that jemalloc is only available as a shared library also use it to +# link it in the static requirements. In contrast to other libraries we try in +# most cases to use the system provided version of jemalloc to better align with +# other potential users of jemalloc. +if (JEMALLOC_STATIC_LIB) + set(ARROW_JEMALLOC_STATIC_LINK_LIBS + arrow_static + jemalloc_static + ) +else() + set(ARROW_JEMALLOC_STATIC_LINK_LIBS + arrow_static + jemalloc_shared + ) +endif() if (NOT APPLE) set(ARROW_JEMALLOC_STATIC_LINK_LIBS ${ARROW_JEMALLOC_STATIC_LINK_LIBS} pthread) From 9deb3251ec89f5afb14b5bc768f2c3a88cad1627 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 5 Mar 2017 08:52:20 -0500 Subject: [PATCH 0349/1644] ARROW-109: [C++] Add nesting stress tests up to 500 recursion depth There doesn't appear to be any limit to the nesting depth permitted in the flatbuffers. I think what @emkornfield was running into was the size of the IPC payload exceeding the size of the memory map that was being allocated to accommodate it. I expanded the memory map size and was able to write schemas with 1000 and 5000 levels of nesting. I left a unit test with 500 depth which doesn't take too long to run. Author: Wes McKinney Closes #357 from wesm/ARROW-109 and squashes the following commits: fa78976 [Wes McKinney] Add nesting stress tests up to 500 recursion depth, expand size of memory map --- cpp/src/arrow/ipc/adapter.h | 6 ++- cpp/src/arrow/ipc/ipc-adapter-test.cc | 60 ++++++++++++++++++++++----- 2 files changed, 53 insertions(+), 13 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index b7d8fa93d3651..933d3a4639fe8 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -47,8 +47,10 @@ namespace ipc { // ---------------------------------------------------------------------- // Write path -// We have trouble decoding flatbuffers if the size i > 70, so 64 is a nice round number -// TODO(emkornfield) investigate this more +// +// ARROW-109: We set this number arbitrarily to help catch user mistakes. For +// deeply nested schemas, it is expected the user will indicate explicitly the +// maximum allowed recursion depth constexpr int kMaxIpcRecursionDepth = 64; // Write the RecordBatch (collection of equal-length Arrow arrays) to the diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 8999363893289..6678fd522a86a 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -213,7 +213,8 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { void TearDown() { io::MemoryMapFixture::TearDown(); } Status WriteToMmap(int recursion_level, bool override_level, int32_t* metadata_length, - int64_t* body_length, std::shared_ptr* schema) { + int64_t* body_length, std::shared_ptr* batch, + std::shared_ptr* schema) { const int batch_length = 5; TypePtr type = int32(); std::shared_ptr array; @@ -230,18 +231,18 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { *schema = std::shared_ptr(new Schema({f0})); std::vector> arrays = {array}; - auto batch = std::make_shared(*schema, batch_length, arrays); + *batch = std::make_shared(*schema, batch_length, arrays); std::string path = "test-write-past-max-recursion"; - const int memory_map_size = 1 << 16; + const int memory_map_size = 1 << 20; io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); if (override_level) { - return WriteRecordBatch(*batch, 0, mmap_.get(), metadata_length, body_length, pool_, - recursion_level + 1); + return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, + pool_, recursion_level + 1); } else { return WriteRecordBatch( - *batch, 0, mmap_.get(), metadata_length, body_length, pool_); + **batch, 0, mmap_.get(), metadata_length, body_length, pool_); } } @@ -254,15 +255,21 @@ TEST_F(RecursionLimits, WriteLimit) { int32_t metadata_length = -1; int64_t body_length = -1; std::shared_ptr schema; - ASSERT_RAISES( - Invalid, WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &schema)); + std::shared_ptr batch; + ASSERT_RAISES(Invalid, + WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &batch, &schema)); } TEST_F(RecursionLimits, ReadLimit) { int32_t metadata_length = -1; int64_t body_length = -1; std::shared_ptr schema; - ASSERT_OK(WriteToMmap(64, true, &metadata_length, &body_length, &schema)); + + const int recursion_depth = 64; + + std::shared_ptr batch; + ASSERT_OK(WriteToMmap( + recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); std::shared_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); @@ -273,8 +280,39 @@ TEST_F(RecursionLimits, ReadLimit) { io::BufferReader reader(payload); - std::shared_ptr batch; - ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &batch)); + std::shared_ptr result; + ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &result)); +} + +TEST_F(RecursionLimits, StressLimit) { + auto CheckDepth = [this](int recursion_depth, bool* it_works) { + int32_t metadata_length = -1; + int64_t body_length = -1; + std::shared_ptr schema; + std::shared_ptr batch; + ASSERT_OK(WriteToMmap( + recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); + + std::shared_ptr message; + ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); + + std::shared_ptr payload; + ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); + + io::BufferReader reader(payload); + + std::shared_ptr result; + ASSERT_OK(ReadRecordBatch(*metadata, schema, recursion_depth + 1, &reader, &result)); + *it_works = result->Equals(*batch); + }; + + bool it_works = false; + CheckDepth(100, &it_works); + ASSERT_TRUE(it_works); + + CheckDepth(500, &it_works); + ASSERT_TRUE(it_works); } } // namespace ipc From fb9fbe4981420aaa0a56bfe87254d8b10bd5ba18 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 7 Mar 2017 17:13:57 +0100 Subject: [PATCH 0350/1644] ARROW-604: Python: boxed Field instances are missing the reference to their DataType Author: Uwe L. Korn Closes #362 from xhochy/ARROW-604 and squashes the following commits: 2e837c8 [Uwe L. Korn] ARROW-604: Python: boxed Field instances are missing the reference to DataType --- cpp/src/arrow/type.cc | 3 +++ python/pyarrow/schema.pyx | 5 +++++ python/pyarrow/tests/test_schema.py | 2 ++ 3 files changed, 10 insertions(+) diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 23fa6812f53d4..7e5f13af9cf9b 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -54,6 +54,9 @@ bool DataType::Equals(const DataType& other) const { } bool DataType::Equals(const std::shared_ptr& other) const { + if (!other) { + return false; + } return Equals(*other.get()); } diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 52eeeaf717622..19910aba00427 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -88,6 +88,7 @@ cdef class Field: cdef init(self, const shared_ptr[CField]& field): self.sp_field = field self.field = field.get() + self.type = box_data_type(field.get().type) @classmethod def from_py(cls, object name, DataType type, bint nullable=True): @@ -326,11 +327,15 @@ def schema(fields): return Schema.from_fields(fields) cdef DataType box_data_type(const shared_ptr[CDataType]& type): + if type.get() == NULL: + return None cdef DataType out = DataType() out.init(type) return out cdef Field box_field(const shared_ptr[CField]& field): + if field.get() == NULL: + return None cdef Field out = Field() out.init(field) return out diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 507ebb878d87b..f6dc33c75dfb8 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -64,6 +64,8 @@ def test_schema(self): assert len(sch) == 3 assert sch[0].name == 'foo' assert sch[0].type == fields[0].type + assert sch.field_by_name('foo').name == 'foo' + assert sch.field_by_name('foo').type == fields[0].type assert repr(sch) == """\ foo: int32 From b109a246f464eaf641dd7741d348e02069f3a0e9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 7 Mar 2017 18:04:24 -0500 Subject: [PATCH 0351/1644] ARROW-566: Bundle Arrow libraries in Python package Depends on https://github.com/apache/parquet-cpp/pull/265 With this change we can also build self-contained OSX wheels, still we have to find a way to build them reproducibly (will take care of that soon). Author: Uwe L. Korn Closes #360 from xhochy/ARROW-566 and squashes the following commits: d6c86de [Uwe L. Korn] Use Apache git again for Parquet 21861de [Uwe L. Korn] Only link to librt if we use GCC 925fce9 [Uwe L. Korn] ARROW-566: Bundle Arrow libraries in Python package --- cpp/CMakeLists.txt | 8 +++ cpp/cmake_modules/BuildUtils.cmake | 14 ++++- cpp/src/arrow/jemalloc/CMakeLists.txt | 27 ++++++++-- python/CMakeLists.txt | 45 ++++++++++++++++ .../Dockerfile-parquet_arrow-base-x86_64 | 19 ------- python/manylinux1/Dockerfile-x86_64 | 21 ++++++-- python/manylinux1/README.md | 4 +- python/manylinux1/build_arrow.sh | 52 +++++++++---------- python/setup.py | 50 +++++++++++++----- 9 files changed, 170 insertions(+), 70 deletions(-) delete mode 100644 python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 7d1f9e167d486..22c6e9a7acbe5 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -88,6 +88,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the Arrow jemalloc-based allocator" ON) + option(ARROW_JEMALLOC_USE_SHARED + "Rely on jemalloc shared libraries where relevant" + ON) + option(ARROW_BOOST_USE_SHARED "Rely on boost shared libraries where relevant" ON) @@ -103,6 +107,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(ARROW_BUILD_UTILITIES "Build Arrow commandline utilities" ON) + + option(ARROW_RPATH_ORIGIN + "Build Arrow libraries with RATH set to \$ORIGIN" + OFF) endif() if(ARROW_BUILD_TESTS) diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 9de9de516f996..2da8a05c9c42a 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -53,11 +53,21 @@ function(ADD_ARROW_LIB LIB_NAME) LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" LINK_FLAGS "${ARG_SHARED_LINK_FLAGS}" OUTPUT_NAME ${LIB_NAME}) - target_link_libraries(${LIB_NAME}_shared + target_link_libraries(${LIB_NAME}_shared LINK_PUBLIC ${ARG_SHARED_LINK_LIBS} LINK_PRIVATE ${ARG_SHARED_PRIVATE_LINK_LIBS}) + + if (ARROW_RPATH_ORIGIN) + if (APPLE) + set(_lib_install_rpath "@loader_path") + else() + set(_lib_install_rpath "\$ORIGIN") + endif() + set_target_properties(${LIB_NAME}_shared PROPERTIES + INSTALL_RPATH ${_lib_install_rpath}) + endif() - install(TARGETS ${LIB_NAME}_shared + install(TARGETS ${LIB_NAME}_shared LIBRARY DESTINATION lib ARCHIVE DESTINATION lib) endif() diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index 7caa74a3ebbda..5d5482ab653bf 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -40,10 +40,29 @@ if (NOT APPLE) set(ARROW_JEMALLOC_STATIC_LINK_LIBS ${ARROW_JEMALLOC_STATIC_LINK_LIBS} pthread) endif() -set(ARROW_JEMALLOC_SHARED_LINK_LIBS - arrow_shared - jemalloc_shared -) +if (ARROW_JEMALLOC_USE_SHARED) + set(ARROW_JEMALLOC_SHARED_LINK_LIBS + arrow_shared + jemalloc_shared + ) +else() + if (CMAKE_COMPILER_IS_GNUCXX) + set(ARROW_JEMALLOC_SHARED_LINK_LIBS + arrow_shared + jemalloc_static + # For glibc <2.17 we need to link to librt. + # As we compile with --as-needed by default, the linker will omit this + # dependency if not required. + rt + ) + else() + set(ARROW_JEMALLOC_SHARED_LINK_LIBS + arrow_shared + jemalloc_static + ) + endif() +endif() + if (ARROW_BUILD_STATIC) set(ARROW_JEMALLOC_TEST_LINK_LIBS diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index ba26692b32b88..6e6d609b00007 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -56,6 +56,9 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") option(PYARROW_BUILD_JEMALLOC "Build the PyArrow jemalloc integration" OFF) + option(PYARROW_BUNDLE_ARROW_CPP + "Bundle the Arrow C++ libraries" + OFF) endif() if(NOT PYARROW_BUILD_TESTS) @@ -332,6 +335,25 @@ endif() ## Arrow find_package(Arrow REQUIRED) include_directories(SYSTEM ${ARROW_INCLUDE_DIR}) + +if (PYARROW_BUNDLE_ARROW_CPP) + configure_file(${ARROW_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(ARROW_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) + configure_file(${ARROW_IO_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_io${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(ARROW_IO_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_io${CMAKE_SHARED_LIBRARY_SUFFIX}) + configure_file(${ARROW_IPC_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_ipc${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(ARROW_IPC_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_ipc${CMAKE_SHARED_LIBRARY_SUFFIX}) +endif() + ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_io @@ -440,6 +462,18 @@ if (PYARROW_BUILD_PARQUET) if(NOT (PARQUET_FOUND AND PARQUET_ARROW_FOUND)) message(FATAL_ERROR "Unable to locate Parquet libraries") endif() + if (PYARROW_BUNDLE_ARROW_CPP) + configure_file(${PARQUET_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libparquet${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(PARQUET_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libparquet${CMAKE_SHARED_LIBRARY_SUFFIX}) + configure_file(${PARQUET_ARROW_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libparquet_arrow${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(PARQUET_ARROW_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libparquet_arrow${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() ADD_THIRDPARTY_LIB(parquet_arrow SHARED_LIB ${PARQUET_ARROW_SHARED_LIB}) set(LINK_LIBS @@ -451,6 +485,13 @@ if (PYARROW_BUILD_PARQUET) endif() if (PYARROW_BUILD_JEMALLOC) + if (PYARROW_BUNDLE_ARROW_CPP) + configure_file(${ARROW_JEMALLOC_SHARED_LIB} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX} + COPYONLY) + SET(ARROW_JEMALLOC_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() ADD_THIRDPARTY_LIB(arrow_jemalloc SHARED_LIB ${ARROW_JEMALLOC_SHARED_LIB}) set(LINK_LIBS @@ -463,6 +504,10 @@ endif() add_library(pyarrow SHARED ${PYARROW_SRCS}) +if (PYARROW_BUNDLE_ARROW_CPP) + set_target_properties(pyarrow PROPERTIES + INSTALL_RPATH "\$ORIGIN") +endif() target_link_libraries(pyarrow ${LINK_LIBS}) if(APPLE) diff --git a/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 b/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 deleted file mode 100644 index dcc9321c322b2..0000000000000 --- a/python/manylinux1/Dockerfile-parquet_arrow-base-x86_64 +++ /dev/null @@ -1,19 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -FROM arrow-base-x86_64 - -WORKDIR / -RUN git clone https://github.com/apache/parquet-cpp.git -WORKDIR /parquet-cpp -RUN ARROW_HOME=/usr cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW=ON . -RUN make -j5 install diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index ac47108c84ae7..820b94e306afe 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -13,14 +13,23 @@ FROM quay.io/pypa/manylinux1_x86_64:latest # Install dependencies -RUN yum install -y flex openssl-devel +RUN yum install -y flex zlib-devel + +# Build a newer OpenSSL version to support Thrift 0.10.0, note that we don't trigger the SSL code in Arrow. +WORKDIR / +RUN wget --no-check-certificate https://www.openssl.org/source/openssl-1.0.2k.tar.gz -O openssl-1.0.2k.tar.gz +RUN tar xf openssl-1.0.2k.tar.gz +WORKDIR openssl-1.0.2k +RUN ./config -fpic shared --prefix=/usr +RUN make -j5 +RUN make install WORKDIR / RUN wget --no-check-certificate http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz RUN tar xf boost_1_60_0.tar.gz WORKDIR /boost_1_60_0 RUN ./bootstrap.sh -RUN ./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system install +RUN ./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system --with-regex install WORKDIR / RUN wget https://github.com/jemalloc/jemalloc/releases/download/4.4.0/jemalloc-4.4.0.tar.bz2 -O jemalloc-4.4.0.tar.bz2 @@ -43,5 +52,11 @@ RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 ADD arrow /arrow WORKDIR /arrow/cpp -RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON . +RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF . +RUN make -j5 install + +WORKDIR / +RUN git clone https://github.com/apache/parquet-cpp.git +WORKDIR /parquet-cpp +RUN ARROW_HOME=/arrow-dist cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/usr -DPARQUET_BUILD_TESTS=OFF -DPARQUET_ARROW=ON -DPARQUET_BOOST_USE_SHARED=OFF . RUN make -j5 install diff --git a/python/manylinux1/README.md b/python/manylinux1/README.md index 8cd9f6db004e5..32af6f31da287 100644 --- a/python/manylinux1/README.md +++ b/python/manylinux1/README.md @@ -31,10 +31,8 @@ for all supported Python versions and place them in the `dist` folder. git clone ../../ arrow # Build the native baseimage docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . -# (optionally) build parquet-cpp -docker build -t parquet_arrow-base-x86_64 -f Dockerfile-parquet_arrow-base-x86_64 . # Build the python packages -docker run --rm -v $PWD:/io parquet_arrow-base-x86_64 /io/build_arrow.sh +docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh # Now the new packages are located in the dist/ folder ls -l dist/ ``` diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index cce5cd2b4d412..576a983b11c37 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -29,38 +29,19 @@ source /multibuild/manylinux_utils.sh cd /arrow/python -export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/lib" # PyArrow build configuration export PYARROW_BUILD_TYPE='release' export PYARROW_CMAKE_OPTIONS='-DPYARROW_BUILD_TESTS=ON' +export PYARROW_WITH_PARQUET=1 +export PYARROW_WITH_JEMALLOC=1 +export PYARROW_BUNDLE_ARROW_CPP=1 # Need as otherwise arrow_io is sometimes not linked export LDFLAGS="-Wl,--no-as-needed" -export ARROW_HOME="/usr" +export ARROW_HOME="/arrow-dist" export PARQUET_HOME="/usr" # Ensure the target directory exists mkdir -p /io/dist -# Temporary directory to store the wheels that should be sent through auditwheel -rm_mkdir unfixed_wheels - -PY35_BIN=/opt/python/cp35-cp35m/bin -$PY35_BIN/pip install 'pyelftools<0.24' -$PY35_BIN/pip install 'git+https://github.com/xhochy/auditwheel.git@pyarrow-fixes' - -# Override repair_wheelhouse function -function repair_wheelhouse { - local in_dir=$1 - local out_dir=$2 - for whl in $in_dir/*.whl; do - if [[ $whl == *none-any.whl ]]; then - cp $whl $out_dir - else - # Store libraries directly in . not .libs to fix problems with libpyarrow.so linkage. - $PY35_BIN/auditwheel -v repair -L . $whl -w $out_dir/ - fi - done - chmod -R a+rwX $out_dir -} for PYTHON in ${PYTHON_VERSIONS}; do PYTHON_INTERPRETER="$(cpython_path $PYTHON)/bin/python" @@ -68,17 +49,36 @@ for PYTHON in ${PYTHON_VERSIONS}; do PIPI_IO="$PIP install -f $MANYLINUX_URL" PATH="$PATH:$(cpython_path $PYTHON)" + echo "=== (${PYTHON}) Installing build dependencies ===" $PIPI_IO "numpy==1.9.0" $PIPI_IO "cython==0.24" - PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py build_ext --inplace --with-parquet --with-jemalloc + # Clear output directory + rm -rf dist/ + echo "=== (${PYTHON}) Building wheel ===" + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py build_ext --inplace --with-parquet --with-jemalloc --bundle-arrow-cpp PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER setup.py bdist_wheel - # Test for optional modules + echo "=== (${PYTHON}) Test the existence of optional modules ===" $PIPI_IO -r requirements.txt PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet" PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.jemalloc" - repair_wheelhouse dist /io/dist + echo "=== (${PYTHON}) Tag the wheel with manylinux1 ===" + mkdir -p repaired_wheels/ + auditwheel -v repair -L . dist/pyarrow-*.whl -w repaired_wheels/ + + echo "=== (${PYTHON}) Testing manylinux1 wheel ===" + # Fix version to keep build reproducible" + $PIPI_IO "virtualenv==15.1.0" + rm -rf venv + "$(cpython_path $PYTHON)/bin/virtualenv" -p ${PYTHON_INTERPRETER} --no-download venv + source ./venv/bin/activate + pip install repaired_wheels/*.whl + pip install pytest pandas + py.test venv/lib/*/site-packages/pyarrow + deactivate + + mv repaired_wheels/*.whl /io/dist done diff --git a/python/setup.py b/python/setup.py index 54d1cd3af48bc..b0f29be4c1b3b 100644 --- a/python/setup.py +++ b/python/setup.py @@ -34,6 +34,7 @@ from os.path import join as pjoin from distutils.command.clean import clean as _clean +from distutils.util import strtobool from distutils import sysconfig # Check if we're running 64-bit Python @@ -81,15 +82,17 @@ def run(self): user_options = ([('extra-cmake-args=', None, 'extra arguments for CMake'), ('build-type=', None, 'build type (debug or release)'), ('with-parquet', None, 'build the Parquet extension'), - ('with-jemalloc', None, 'build the jemalloc extension')] + + ('with-jemalloc', None, 'build the jemalloc extension'), + ('bundle-arrow-cpp', None, 'bundle the Arrow C++ libraries')] + _build_ext.user_options) def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() - self.with_parquet = False - self.with_jemalloc = False + self.with_parquet = strtobool(os.environ.get('PYARROW_WITH_PARQUET', '0')) + self.with_jemalloc = strtobool(os.environ.get('PYARROW_WITH_JEMALLOC', '0')) + self.bundle_arrow_cpp = strtobool(os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) CYTHON_MODULE_NAMES = [ 'array', @@ -142,6 +145,9 @@ def _run_cmake(self): if self.with_jemalloc: cmake_options.append('-DPYARROW_BUILD_JEMALLOC=on') + if self.bundle_arrow_cpp: + cmake_options.append('-DPYARROW_BUNDLE_ARROW_CPP=ON') + if sys.platform != 'win32': cmake_options.append('-DCMAKE_BUILD_TYPE={0}' .format(self.build_type)) @@ -181,17 +187,35 @@ def _run_cmake(self): # Move the built libpyarrow library to the place expected by the Python # build - if sys.platform != 'win32': - name, = glob.glob(pjoin(self.build_type, 'libpyarrow.*')) - try: - os.makedirs(pjoin(build_lib, 'pyarrow')) - except OSError: - pass - shutil.move(name, - pjoin(build_lib, 'pyarrow', os.path.split(name)[1])) + shared_library_prefix = 'lib' + if sys.platform == 'darwin': + shared_library_suffix = '.dylib' + elif sys.platform == 'win32': + shared_library_suffix = '.dll' + shared_library_prefix = '' else: - shutil.move(pjoin(self.build_type, 'pyarrow.dll'), - pjoin(build_lib, 'pyarrow', 'pyarrow.dll')) + shared_library_suffix = '.so' + + try: + os.makedirs(pjoin(build_lib, 'pyarrow')) + except OSError: + pass + + def move_lib(lib_name): + lib_filename = shared_library_prefix + lib_name + shared_library_suffix + shutil.move(pjoin(self.build_type, lib_filename), + pjoin(build_lib, 'pyarrow', lib_filename)) + + move_lib("pyarrow") + if self.bundle_arrow_cpp: + move_lib("arrow") + move_lib("arrow_io") + move_lib("arrow_ipc") + if self.with_jemalloc: + move_lib("arrow_jemalloc") + if self.with_parquet: + move_lib("parquet") + move_lib("parquet_arrow") # Move the built C-extension to the place expected by the Python build self._found_names = [] From 6b3ae2aecc8cd31425035a021fa04b9ed3385a8d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 9 Mar 2017 14:00:48 -0500 Subject: [PATCH 0352/1644] ARROW-605: [C++] Refactor IPC adapter code into generic ArrayLoader class. Add Date32Type These are various changes introduced to support the Feather merge in ARROW-452 #361 Author: Wes McKinney Closes #365 from wesm/array-loader and squashes the following commits: bc22872 [Wes McKinney] Revert Array::type_id to type_enum since Parquet uses this API 344e6b1 [Wes McKinney] fix compiler warning 997b7a2 [Wes McKinney] Refactor IPC adapter code into generic ArrayLoader class. Add Date32Type --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/CMakeLists.txt | 5 +- cpp/src/arrow/array.cc | 52 +--- cpp/src/arrow/array.h | 17 +- cpp/src/arrow/builder.cc | 1 + cpp/src/arrow/builder.h | 1 + cpp/src/arrow/column.cc | 3 + cpp/src/arrow/column.h | 3 + cpp/src/arrow/compare.cc | 8 +- cpp/src/arrow/io/memory.cc | 19 +- cpp/src/arrow/io/memory.h | 6 + cpp/src/arrow/ipc/adapter.cc | 252 ++--------------- cpp/src/arrow/ipc/adapter.h | 8 +- cpp/src/arrow/ipc/metadata.cc | 1 + cpp/src/arrow/ipc/metadata.h | 7 +- cpp/src/arrow/loader.cc | 285 ++++++++++++++++++++ cpp/src/arrow/loader.h | 89 ++++++ cpp/src/arrow/pretty_print.cc | 6 +- cpp/src/arrow/type.cc | 22 +- cpp/src/arrow/type.h | 38 ++- cpp/src/arrow/type_fwd.h | 5 + cpp/src/arrow/type_traits.h | 12 + python/pyarrow/array.pyx | 31 ++- python/pyarrow/table.pyx | 3 +- python/pyarrow/tests/test_convert_pandas.py | 6 +- python/src/pyarrow/adapters/pandas.cc | 17 +- 26 files changed, 558 insertions(+), 340 deletions(-) create mode 100644 cpp/src/arrow/loader.cc create mode 100644 cpp/src/arrow/loader.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 22c6e9a7acbe5..294c439e2b093 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -799,6 +799,7 @@ set(ARROW_SRCS src/arrow/builder.cc src/arrow/column.cc src/arrow/compare.cc + src/arrow/loader.cc src/arrow/memory_pool.cc src/arrow/pretty_print.cc src/arrow/schema.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index d1efa021a496d..ddeb81cae7b5b 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -19,10 +19,11 @@ install(FILES api.h array.h - column.h - compare.h buffer.h builder.h + column.h + compare.h + loader.h memory_pool.h pretty_print.h schema.h diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 284bb57a02b88..49da6bb3197a1 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -165,6 +165,7 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; @@ -193,7 +194,7 @@ std::shared_ptr BooleanArray::Slice(int64_t offset, int64_t length) const Status ListArray::Validate() const { if (length_ < 0) { return Status::Invalid("Length was negative"); } - if (!value_offsets_) { return Status::Invalid("value_offsets_ was null"); } + if (length_ && !value_offsets_) { return Status::Invalid("value_offsets_ was null"); } if (value_offsets_->size() / static_cast(sizeof(int32_t)) < length_) { std::stringstream ss; ss << "offset buffer size (bytes): " << value_offsets_->size() @@ -425,20 +426,6 @@ std::shared_ptr UnionArray::Slice(int64_t offset, int64_t length) const { // ---------------------------------------------------------------------- // DictionaryArray -Status DictionaryArray::FromBuffer(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset, std::shared_ptr* out) { - DCHECK_EQ(type->type, Type::DICTIONARY); - const auto& dict_type = static_cast(type.get()); - - std::shared_ptr boxed_indices; - RETURN_NOT_OK(MakePrimitiveArray(dict_type->index_type(), length, indices, null_bitmap, - null_count, offset, &boxed_indices)); - - *out = std::make_shared(type, boxed_indices); - return Status::OK(); -} - DictionaryArray::DictionaryArray( const std::shared_ptr& type, const std::shared_ptr& indices) : Array(type, indices->length(), indices->null_bitmap(), indices->null_count(), @@ -469,40 +456,6 @@ std::shared_ptr DictionaryArray::Slice(int64_t offset, int64_t length) co return std::make_shared(type_, sliced_indices); } -// ---------------------------------------------------------------------- - -#define MAKE_PRIMITIVE_ARRAY_CASE(ENUM, ArrayType) \ - case Type::ENUM: \ - out->reset(new ArrayType(type, length, data, null_bitmap, null_count, offset)); \ - break; - -Status MakePrimitiveArray(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& data, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset, std::shared_ptr* out) { - switch (type->type) { - MAKE_PRIMITIVE_ARRAY_CASE(BOOL, BooleanArray); - MAKE_PRIMITIVE_ARRAY_CASE(UINT8, UInt8Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT8, Int8Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT16, UInt16Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT16, Int16Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT32, UInt32Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT32, Int32Array); - MAKE_PRIMITIVE_ARRAY_CASE(UINT64, UInt64Array); - MAKE_PRIMITIVE_ARRAY_CASE(INT64, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(FLOAT, FloatArray); - MAKE_PRIMITIVE_ARRAY_CASE(DOUBLE, DoubleArray); - MAKE_PRIMITIVE_ARRAY_CASE(TIME, Int64Array); - MAKE_PRIMITIVE_ARRAY_CASE(TIMESTAMP, TimestampArray); - default: - return Status::NotImplemented(type->ToString()); - } -#ifdef NDEBUG - return Status::OK(); -#else - return (*out)->Validate(); -#endif -} - // ---------------------------------------------------------------------- // Default implementations of ArrayVisitor methods @@ -527,6 +480,7 @@ ARRAY_VISITOR_DEFAULT(DoubleArray); ARRAY_VISITOR_DEFAULT(StringArray); ARRAY_VISITOR_DEFAULT(BinaryArray); ARRAY_VISITOR_DEFAULT(DateArray); +ARRAY_VISITOR_DEFAULT(Date32Array); ARRAY_VISITOR_DEFAULT(TimeArray); ARRAY_VISITOR_DEFAULT(TimestampArray); ARRAY_VISITOR_DEFAULT(IntervalArray); diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index f20f212c3a825..f111609db4317 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -58,6 +58,7 @@ class ARROW_EXPORT ArrayVisitor { virtual Status Visit(const StringArray& array); virtual Status Visit(const BinaryArray& array); virtual Status Visit(const DateArray& array); + virtual Status Visit(const Date32Array& array); virtual Status Visit(const TimeArray& array); virtual Status Visit(const TimestampArray& array); virtual Status Visit(const IntervalArray& array); @@ -485,12 +486,6 @@ class ARROW_EXPORT DictionaryArray : public Array { DictionaryArray( const std::shared_ptr& type, const std::shared_ptr& indices); - // Alternate ctor; other attributes (like null count) are inherited from the - // passed indices array - static Status FromBuffer(const std::shared_ptr& type, int64_t length, - const std::shared_ptr& indices, const std::shared_ptr& null_bitmap, - int64_t null_count, int64_t offset, std::shared_ptr* out); - Status Validate() const override; std::shared_ptr indices() const { return indices_; } @@ -531,21 +526,13 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; #if defined(__GNUC__) && !defined(__clang__) #pragma GCC diagnostic pop #endif -// ---------------------------------------------------------------------- -// Helper functions - -// Create new arrays for logical types that are backed by primitive arrays. -Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, - int64_t length, const std::shared_ptr& data, - const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset, - std::shared_ptr* out); - } // namespace arrow #endif diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 9086598cc5ba7..4372925fe494b 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -238,6 +238,7 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index e642d3c21a2fd..ebc683ab334e6 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -233,6 +233,7 @@ using Int64Builder = NumericBuilder; using TimestampBuilder = NumericBuilder; using TimeBuilder = NumericBuilder; using DateBuilder = NumericBuilder; +using Date32Builder = NumericBuilder; using HalfFloatBuilder = NumericBuilder; using FloatBuilder = NumericBuilder; diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc index 18228700472c4..78501f9393e22 100644 --- a/cpp/src/arrow/column.cc +++ b/cpp/src/arrow/column.cc @@ -97,6 +97,9 @@ Column::Column(const std::shared_ptr& field, const std::shared_ptr } } +Column::Column(const std::string& name, const std::shared_ptr& data) + : Column(::arrow::field(name, data->type()), data) {} + Column::Column( const std::shared_ptr& field, const std::shared_ptr& data) : field_(field), data_(data) {} diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h index 93a34c7c95fdf..bfcfd8ee459c0 100644 --- a/cpp/src/arrow/column.h +++ b/cpp/src/arrow/column.h @@ -69,6 +69,9 @@ class ARROW_EXPORT Column { Column(const std::shared_ptr& field, const std::shared_ptr& data); + /// Construct from name and array + Column(const std::string& name, const std::shared_ptr& data); + int64_t length() const { return data_->length(); } int64_t null_count() const { return data_->null_count(); } diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index f38f8d67aa796..17b883302c658 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -145,6 +145,10 @@ class RangeEqualsVisitor : public ArrayVisitor { Status Visit(const DateArray& left) override { return CompareValues(left); } + Status Visit(const Date32Array& left) override { + return CompareValues(left); + } + Status Visit(const TimeArray& left) override { return CompareValues(left); } Status Visit(const TimestampArray& left) override { @@ -381,6 +385,8 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { Status Visit(const DateArray& left) override { return ComparePrimitive(left); } + Status Visit(const Date32Array& left) override { return ComparePrimitive(left); } + Status Visit(const TimeArray& left) override { return ComparePrimitive(left); } Status Visit(const TimestampArray& left) override { return ComparePrimitive(left); } @@ -622,7 +628,7 @@ class TypeEqualsVisitor : public TypeVisitor { Status Visit(const TimestampType& left) override { const auto& right = static_cast(right_); - result_ = left.unit == right.unit; + result_ = left.unit == right.unit && left.timezone == right.timezone; return Status::OK(); } diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 1339a99aa787e..5b5c8649deec4 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -28,6 +28,7 @@ #include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/status.h" +#include "arrow/util/logging.h" namespace arrow { namespace io { @@ -43,9 +44,17 @@ BufferOutputStream::BufferOutputStream(const std::shared_ptr& b position_(0), mutable_data_(buffer->mutable_data()) {} +Status BufferOutputStream::Create(int64_t initial_capacity, MemoryPool* pool, + std::shared_ptr* out) { + std::shared_ptr buffer; + RETURN_NOT_OK(AllocateResizableBuffer(pool, initial_capacity, &buffer)); + *out = std::make_shared(buffer); + return Status::OK(); +} + BufferOutputStream::~BufferOutputStream() { // This can fail, better to explicitly call close - Close(); + if (buffer_) { Close(); } } Status BufferOutputStream::Close() { @@ -56,12 +65,20 @@ Status BufferOutputStream::Close() { } } +Status BufferOutputStream::Finish(std::shared_ptr* result) { + RETURN_NOT_OK(Close()); + *result = buffer_; + buffer_ = nullptr; + return Status::OK(); +} + Status BufferOutputStream::Tell(int64_t* position) { *position = position_; return Status::OK(); } Status BufferOutputStream::Write(const uint8_t* data, int64_t nbytes) { + DCHECK(buffer_); RETURN_NOT_OK(Reserve(nbytes)); std::memcpy(mutable_data_ + position_, data, nbytes); position_ += nbytes; diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 2d3df4224e9fb..82807508417d7 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -43,6 +43,9 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { public: explicit BufferOutputStream(const std::shared_ptr& buffer); + static Status Create(int64_t initial_capacity, MemoryPool* pool, + std::shared_ptr* out); + ~BufferOutputStream(); // Implement the OutputStream interface @@ -50,6 +53,9 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { Status Tell(int64_t* position) override; Status Write(const uint8_t* data, int64_t nbytes) override; + /// Close the stream and return the buffer + Status Finish(std::shared_ptr* result); + private: // Ensures there is sufficient space available to write nbytes Status Reserve(int64_t nbytes); diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index f11c88a6e1e4b..78d58101963dc 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -32,6 +32,7 @@ #include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" +#include "arrow/loader.h" #include "arrow/memory_pool.h" #include "arrow/schema.h" #include "arrow/status.h" @@ -531,12 +532,12 @@ Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool) { - DictionaryWriter writer(pool, buffer_start_offset, kMaxIpcRecursionDepth); + DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth); return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); } Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter writer(default_memory_pool(), 0, kMaxIpcRecursionDepth); + RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth); RETURN_NOT_OK(writer.GetTotalSize(batch, size)); return Status::OK(); } @@ -544,235 +545,33 @@ Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { // ---------------------------------------------------------------------- // Record batch read path -struct RecordBatchContext { - const RecordBatchMetadata* metadata; - int buffer_index; - int field_index; - int max_recursion_depth; -}; - -// Traverse the flattened record batch metadata and reassemble the -// corresponding array containers -class ArrayLoader : public TypeVisitor { +class IpcComponentSource : public ArrayComponentSource { public: - ArrayLoader( - const Field& field, RecordBatchContext* context, io::ReadableFileInterface* file) - : field_(field), context_(context), file_(file) {} - - Status Load(std::shared_ptr* out) { - if (context_->max_recursion_depth <= 0) { - return Status::Invalid("Max recursion depth reached"); - } - - // Load the array - RETURN_NOT_OK(field_.type->Accept(this)); + IpcComponentSource(const RecordBatchMetadata& metadata, io::ReadableFileInterface* file) + : metadata_(metadata), file_(file) {} - *out = std::move(result_); - return Status::OK(); - } - - private: - const Field& field_; - RecordBatchContext* context_; - io::ReadableFileInterface* file_; - - // Used in visitor pattern - std::shared_ptr result_; - - Status LoadChild(const Field& field, std::shared_ptr* out) { - ArrayLoader loader(field, context_, file_); - --context_->max_recursion_depth; - RETURN_NOT_OK(loader.Load(out)); - ++context_->max_recursion_depth; - return Status::OK(); - } - - Status GetBuffer(int buffer_index, std::shared_ptr* out) { - BufferMetadata metadata = context_->metadata->buffer(buffer_index); - - if (metadata.length == 0) { + Status GetBuffer(int buffer_index, std::shared_ptr* out) override { + BufferMetadata buffer_meta = metadata_.buffer(buffer_index); + if (buffer_meta.length == 0) { *out = nullptr; return Status::OK(); } else { - return file_->ReadAt(metadata.offset, metadata.length, out); + return file_->ReadAt(buffer_meta.offset, buffer_meta.length, out); } } - Status LoadCommon(FieldMetadata* field_meta, std::shared_ptr* null_bitmap) { + Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { // pop off a field - if (context_->field_index >= context_->metadata->num_fields()) { + if (field_index >= metadata_.num_fields()) { return Status::Invalid("Ran out of field metadata, likely malformed"); } - - // This only contains the length and null count, which we need to figure - // out what to do with the buffers. For example, if null_count == 0, then - // we can skip that buffer without reading from shared memory - *field_meta = context_->metadata->field(context_->field_index++); - - // extract null_bitmap which is common to all arrays - if (field_meta->null_count == 0) { - *null_bitmap = nullptr; - } else { - RETURN_NOT_OK(GetBuffer(context_->buffer_index, null_bitmap)); - } - context_->buffer_index++; - return Status::OK(); - } - - Status LoadPrimitive(const DataType& type) { - FieldMetadata field_meta; - std::shared_ptr null_bitmap, data; - - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); - } else { - context_->buffer_index++; - data.reset(new Buffer(nullptr, 0)); - } - return MakePrimitiveArray(field_.type, field_meta.length, data, null_bitmap, - field_meta.null_count, 0, &result_); - } - - template - Status LoadBinary() { - FieldMetadata field_meta; - std::shared_ptr null_bitmap, offsets, values; - - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); - } else { - context_->buffer_index += 2; - offsets = values = nullptr; - } - - result_ = std::make_shared( - field_meta.length, offsets, values, null_bitmap, field_meta.null_count); - return Status::OK(); - } - - Status Visit(const BooleanType& type) override { return LoadPrimitive(type); } - - Status Visit(const Int8Type& type) override { return LoadPrimitive(type); } - - Status Visit(const Int16Type& type) override { return LoadPrimitive(type); } - - Status Visit(const Int32Type& type) override { return LoadPrimitive(type); } - - Status Visit(const Int64Type& type) override { return LoadPrimitive(type); } - - Status Visit(const UInt8Type& type) override { return LoadPrimitive(type); } - - Status Visit(const UInt16Type& type) override { return LoadPrimitive(type); } - - Status Visit(const UInt32Type& type) override { return LoadPrimitive(type); } - - Status Visit(const UInt64Type& type) override { return LoadPrimitive(type); } - - Status Visit(const HalfFloatType& type) override { return LoadPrimitive(type); } - - Status Visit(const FloatType& type) override { return LoadPrimitive(type); } - - Status Visit(const DoubleType& type) override { return LoadPrimitive(type); } - - Status Visit(const StringType& type) override { return LoadBinary(); } - - Status Visit(const BinaryType& type) override { return LoadBinary(); } - - Status Visit(const DateType& type) override { return LoadPrimitive(type); } - - Status Visit(const TimeType& type) override { return LoadPrimitive(type); } - - Status Visit(const TimestampType& type) override { return LoadPrimitive(type); } - - Status Visit(const ListType& type) override { - FieldMetadata field_meta; - std::shared_ptr null_bitmap, offsets; - - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index, &offsets)); - } else { - offsets = nullptr; - } - ++context_->buffer_index; - - const int num_children = type.num_children(); - if (num_children != 1) { - std::stringstream ss; - ss << "Wrong number of children: " << num_children; - return Status::Invalid(ss.str()); - } - std::shared_ptr values_array; - - RETURN_NOT_OK(LoadChild(*type.child(0).get(), &values_array)); - - result_ = std::make_shared(field_.type, field_meta.length, offsets, - values_array, null_bitmap, field_meta.null_count); - return Status::OK(); - } - - Status LoadChildren(std::vector> child_fields, - std::vector>* arrays) { - arrays->reserve(static_cast(child_fields.size())); - - for (const auto& child_field : child_fields) { - std::shared_ptr field_array; - RETURN_NOT_OK(LoadChild(*child_field.get(), &field_array)); - arrays->emplace_back(field_array); - } + *metadata = metadata_.field(field_index); return Status::OK(); } - Status Visit(const StructType& type) override { - FieldMetadata field_meta; - std::shared_ptr null_bitmap; - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - - std::vector> fields; - RETURN_NOT_OK(LoadChildren(type.children(), &fields)); - - result_ = std::make_shared( - field_.type, field_meta.length, fields, null_bitmap, field_meta.null_count); - return Status::OK(); - } - - Status Visit(const UnionType& type) override { - FieldMetadata field_meta; - std::shared_ptr null_bitmap, type_ids, offsets; - - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index, &type_ids)); - if (type.mode == UnionMode::DENSE) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index + 1, &offsets)); - } - } - context_->buffer_index += type.mode == UnionMode::DENSE ? 2 : 1; - - std::vector> fields; - RETURN_NOT_OK(LoadChildren(type.children(), &fields)); - - result_ = std::make_shared(field_.type, field_meta.length, fields, - type_ids, offsets, null_bitmap, field_meta.null_count); - return Status::OK(); - } - - Status Visit(const DictionaryType& type) override { - FieldMetadata field_meta; - std::shared_ptr null_bitmap, indices_data; - RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &indices_data)); - - std::shared_ptr indices; - RETURN_NOT_OK(MakePrimitiveArray(type.index_type(), field_meta.length, indices_data, - null_bitmap, field_meta.null_count, 0, &indices)); - - result_ = std::make_shared(field_.type, indices); - return Status::OK(); - }; + private: + const RecordBatchMetadata& metadata_; + io::ReadableFileInterface* file_; }; class RecordBatchReader { @@ -788,17 +587,15 @@ class RecordBatchReader { Status Read(std::shared_ptr* out) { std::vector> arrays(schema_->num_fields()); - // The field_index and buffer_index are incremented in the ArrayLoader - // based on how much of the batch is "consumed" (through nested data - // reconstruction, for example) - context_.metadata = &metadata_; - context_.field_index = 0; - context_.buffer_index = 0; - context_.max_recursion_depth = max_recursion_depth_; + IpcComponentSource source(metadata_, file_); + ArrayLoaderContext context; + context.source = &source; + context.field_index = 0; + context.buffer_index = 0; + context.max_recursion_depth = max_recursion_depth_; for (int i = 0; i < schema_->num_fields(); ++i) { - ArrayLoader loader(*schema_->field(i).get(), &context_, file_); - RETURN_NOT_OK(loader.Load(&arrays[i])); + RETURN_NOT_OK(LoadArray(schema_->field(i)->type, &context, &arrays[i])); } *out = std::make_shared(schema_, metadata_.length(), arrays); @@ -806,7 +603,6 @@ class RecordBatchReader { } private: - RecordBatchContext context_; const RecordBatchMetadata& metadata_; std::shared_ptr schema_; int max_recursion_depth_; @@ -816,7 +612,7 @@ class RecordBatchReader { Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, io::ReadableFileInterface* file, std::shared_ptr* out) { - return ReadRecordBatch(metadata, schema, kMaxIpcRecursionDepth, file, out); + return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); } Status ReadRecordBatch(const RecordBatchMetadata& metadata, diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 933d3a4639fe8..21d814db86530 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -26,6 +26,7 @@ #include #include "arrow/ipc/metadata.h" +#include "arrow/loader.h" #include "arrow/util/visibility.h" namespace arrow { @@ -47,11 +48,6 @@ namespace ipc { // ---------------------------------------------------------------------- // Write path -// -// ARROW-109: We set this number arbitrarily to help catch user mistakes. For -// deeply nested schemas, it is expected the user will indicate explicitly the -// maximum allowed recursion depth -constexpr int kMaxIpcRecursionDepth = 64; // Write the RecordBatch (collection of equal-length Arrow arrays) to the // output stream in a contiguous block. The record batch metadata is written as @@ -75,7 +71,7 @@ constexpr int kMaxIpcRecursionDepth = 64; // padding bytes Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth = kMaxIpcRecursionDepth); + MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); // Write Array as a DictionaryBatch message Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 2ba44ac618ce3..695e7886e3124 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -289,6 +289,7 @@ FieldMetadata RecordBatchMetadata::field(int i) const { FieldMetadata result; result.length = node->length(); result.null_count = node->null_count(); + result.offset = 0; return result; } diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index f12529b5c585e..f6a0a3a073faa 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -25,6 +25,7 @@ #include #include +#include "arrow/loader.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" @@ -135,12 +136,6 @@ class ARROW_EXPORT SchemaMetadata { DISALLOW_COPY_AND_ASSIGN(SchemaMetadata); }; -// Field metadata -struct ARROW_EXPORT FieldMetadata { - int32_t length; - int32_t null_count; -}; - struct ARROW_EXPORT BufferMetadata { int32_t page; int64_t offset; diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc new file mode 100644 index 0000000000000..3cb51ae8fdab7 --- /dev/null +++ b/cpp/src/arrow/loader.cc @@ -0,0 +1,285 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/loader.h" + +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/util/logging.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +struct DataType; +class Status; + +class ArrayLoader : public TypeVisitor { + public: + ArrayLoader(const std::shared_ptr& type, ArrayLoaderContext* context) + : type_(type), context_(context) {} + + Status Load(std::shared_ptr* out) { + if (context_->max_recursion_depth <= 0) { + return Status::Invalid("Max recursion depth reached"); + } + + // Load the array + RETURN_NOT_OK(type_->Accept(this)); + + *out = std::move(result_); + return Status::OK(); + } + + Status GetBuffer(int buffer_index, std::shared_ptr* out) { + return context_->source->GetBuffer(buffer_index, out); + } + + Status LoadCommon(FieldMetadata* field_meta, std::shared_ptr* null_bitmap) { + // This only contains the length and null count, which we need to figure + // out what to do with the buffers. For example, if null_count == 0, then + // we can skip that buffer without reading from shared memory + RETURN_NOT_OK( + context_->source->GetFieldMetadata(context_->field_index++, field_meta)); + + // extract null_bitmap which is common to all arrays + if (field_meta->null_count == 0) { + *null_bitmap = nullptr; + } else { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, null_bitmap)); + } + context_->buffer_index++; + return Status::OK(); + } + + template + Status LoadPrimitive() { + using ArrayType = typename TypeTraits::ArrayType; + + FieldMetadata field_meta; + std::shared_ptr null_bitmap, data; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); + } else { + context_->buffer_index++; + data.reset(new Buffer(nullptr, 0)); + } + result_ = std::make_shared(type_, field_meta.length, data, null_bitmap, + field_meta.null_count, field_meta.offset); + return Status::OK(); + } + + template + Status LoadBinary() { + FieldMetadata field_meta; + std::shared_ptr null_bitmap, offsets, values; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); + } else { + context_->buffer_index += 2; + offsets = values = nullptr; + } + + result_ = std::make_shared( + field_meta.length, offsets, values, null_bitmap, field_meta.null_count); + return Status::OK(); + } + + Status LoadChild(const Field& field, std::shared_ptr* out) { + ArrayLoader loader(field.type, context_); + --context_->max_recursion_depth; + RETURN_NOT_OK(loader.Load(out)); + ++context_->max_recursion_depth; + return Status::OK(); + } + + Status LoadChildren(std::vector> child_fields, + std::vector>* arrays) { + arrays->reserve(static_cast(child_fields.size())); + + for (const auto& child_field : child_fields) { + std::shared_ptr field_array; + RETURN_NOT_OK(LoadChild(*child_field.get(), &field_array)); + arrays->emplace_back(field_array); + } + return Status::OK(); + } + +#define VISIT_PRIMITIVE(TYPE) \ + Status Visit(const TYPE& type) override { return LoadPrimitive(); } + + VISIT_PRIMITIVE(BooleanType); + VISIT_PRIMITIVE(Int8Type); + VISIT_PRIMITIVE(Int16Type); + VISIT_PRIMITIVE(Int32Type); + VISIT_PRIMITIVE(Int64Type); + VISIT_PRIMITIVE(UInt8Type); + VISIT_PRIMITIVE(UInt16Type); + VISIT_PRIMITIVE(UInt32Type); + VISIT_PRIMITIVE(UInt64Type); + VISIT_PRIMITIVE(HalfFloatType); + VISIT_PRIMITIVE(FloatType); + VISIT_PRIMITIVE(DoubleType); + VISIT_PRIMITIVE(DateType); + VISIT_PRIMITIVE(Date32Type); + VISIT_PRIMITIVE(TimeType); + VISIT_PRIMITIVE(TimestampType); + +#undef VISIT_PRIMITIVE + + Status Visit(const StringType& type) override { return LoadBinary(); } + + Status Visit(const BinaryType& type) override { return LoadBinary(); } + + Status Visit(const ListType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap, offsets; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, &offsets)); + } else { + offsets = nullptr; + } + ++context_->buffer_index; + + const int num_children = type.num_children(); + if (num_children != 1) { + std::stringstream ss; + ss << "Wrong number of children: " << num_children; + return Status::Invalid(ss.str()); + } + std::shared_ptr values_array; + + RETURN_NOT_OK(LoadChild(*type.child(0).get(), &values_array)); + + result_ = std::make_shared(type_, field_meta.length, offsets, values_array, + null_bitmap, field_meta.null_count); + return Status::OK(); + } + + Status Visit(const StructType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap; + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + + std::vector> fields; + RETURN_NOT_OK(LoadChildren(type.children(), &fields)); + + result_ = std::make_shared( + type_, field_meta.length, fields, null_bitmap, field_meta.null_count); + return Status::OK(); + } + + Status Visit(const UnionType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap, type_ids, offsets; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + if (field_meta.length > 0) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index, &type_ids)); + if (type.mode == UnionMode::DENSE) { + RETURN_NOT_OK(GetBuffer(context_->buffer_index + 1, &offsets)); + } + } + context_->buffer_index += type.mode == UnionMode::DENSE ? 2 : 1; + + std::vector> fields; + RETURN_NOT_OK(LoadChildren(type.children(), &fields)); + + result_ = std::make_shared(type_, field_meta.length, fields, type_ids, + offsets, null_bitmap, field_meta.null_count); + return Status::OK(); + } + + Status Visit(const DictionaryType& type) override { + std::shared_ptr indices; + RETURN_NOT_OK(LoadArray(type.index_type(), context_, &indices)); + result_ = std::make_shared(type_, indices); + return Status::OK(); + }; + + std::shared_ptr result() const { return result_; } + + private: + const std::shared_ptr type_; + ArrayLoaderContext* context_; + + // Used in visitor pattern + std::shared_ptr result_; +}; + +Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, + ArrayComponentSource* source, std::shared_ptr* out) { + ArrayLoaderContext context; + context.source = source; + context.field_index = context.buffer_index = 0; + context.max_recursion_depth = kMaxNestingDepth; + return LoadArray(type, &context, out); +} + +Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, + ArrayLoaderContext* context, std::shared_ptr* out) { + ArrayLoader loader(type, context); + RETURN_NOT_OK(loader.Load(out)); + + return Status::OK(); +} + +class InMemorySource : public ArrayComponentSource { + public: + InMemorySource(const std::vector& fields, + const std::vector>& buffers) + : fields_(fields), buffers_(buffers) {} + + Status GetBuffer(int buffer_index, std::shared_ptr* out) { + DCHECK(buffer_index < static_cast(buffers_.size())); + *out = buffers_[buffer_index]; + return Status::OK(); + } + + Status GetFieldMetadata(int field_index, FieldMetadata* metadata) { + DCHECK(field_index < static_cast(fields_.size())); + *metadata = fields_[field_index]; + return Status::OK(); + } + + private: + const std::vector& fields_; + const std::vector>& buffers_; +}; + +Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, + const std::vector& fields, + const std::vector>& buffers, std::shared_ptr* out) { + InMemorySource source(fields, buffers); + return LoadArray(type, &source, out); +} + +} // namespace arrow diff --git a/cpp/src/arrow/loader.h b/cpp/src/arrow/loader.h new file mode 100644 index 0000000000000..b4949f2556028 --- /dev/null +++ b/cpp/src/arrow/loader.h @@ -0,0 +1,89 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Function for constructing Array array objects from metadata and raw memory +// buffers + +#ifndef ARROW_LOADER_H +#define ARROW_LOADER_H + +#include +#include +#include +#include + +#include "arrow/status.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Array; +class Buffer; +struct DataType; + +// ARROW-109: We set this number arbitrarily to help catch user mistakes. For +// deeply nested schemas, it is expected the user will indicate explicitly the +// maximum allowed recursion depth +constexpr int kMaxNestingDepth = 64; + +struct ARROW_EXPORT FieldMetadata { + int64_t length; + int64_t null_count; + int64_t offset; +}; + +/// Implement this to create new types of Arrow data loaders +class ARROW_EXPORT ArrayComponentSource { + public: + virtual ~ArrayComponentSource() = default; + + virtual Status GetBuffer(int buffer_index, std::shared_ptr* out) = 0; + virtual Status GetFieldMetadata(int field_index, FieldMetadata* metadata) = 0; +}; + +/// Bookkeeping struct for loading array objects from their constituent pieces of raw data +/// +/// The field_index and buffer_index are incremented in the ArrayLoader +/// based on how much of the batch is "consumed" (through nested data +/// reconstruction, for example) +struct ArrayLoaderContext { + ArrayComponentSource* source; + int buffer_index; + int field_index; + int max_recursion_depth; +}; + +/// Construct an Array container from type metadata and a collection of memory +/// buffers +/// +/// \param[in] field the data type of the array being loaded +/// \param[in] source an implementation of ArrayComponentSource +/// \param[out] out the constructed array +/// \return Status indicating success or failure +Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, + ArrayComponentSource* source, std::shared_ptr* out); + +Status ARROW_EXPORT LoadArray(const std::shared_ptr& field, + ArrayLoaderContext* context, std::shared_ptr* out); + +Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, + const std::vector& fields, + const std::vector>& buffers, std::shared_ptr* out); + +} // namespace arrow + +#endif // ARROW_LOADER_H diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 7e69e42800e79..2508fa5bd8cde 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -145,9 +145,11 @@ class ArrayPrinter : public ArrayVisitor { Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } - Status Visit(const DateArray& array) override { return Status::NotImplemented("date"); } + Status Visit(const DateArray& array) override { return WritePrimitive(array); } - Status Visit(const TimeArray& array) override { return Status::NotImplemented("time"); } + Status Visit(const Date32Array& array) override { return WritePrimitive(array); } + + Status Visit(const TimeArray& array) override { return WritePrimitive(array); } Status Visit(const TimestampArray& array) override { return Status::NotImplemented("timestamp"); diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 7e5f13af9cf9b..4679a2f5b76b6 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -54,9 +54,7 @@ bool DataType::Equals(const DataType& other) const { } bool DataType::Equals(const std::shared_ptr& other) const { - if (!other) { - return false; - } + if (!other) { return false; } return Equals(*other.get()); } @@ -106,6 +104,10 @@ std::string DateType::ToString() const { return std::string("date"); } +std::string Date32Type::ToString() const { + return std::string("date32"); +} + // ---------------------------------------------------------------------- // Union type @@ -135,11 +137,12 @@ std::string UnionType::ToString() const { // ---------------------------------------------------------------------- // DictionaryType -DictionaryType::DictionaryType( - const std::shared_ptr& index_type, const std::shared_ptr& dictionary) +DictionaryType::DictionaryType(const std::shared_ptr& index_type, + const std::shared_ptr& dictionary, bool ordered) : FixedWidthType(Type::DICTIONARY), index_type_(index_type), - dictionary_(dictionary) {} + dictionary_(dictionary), + ordered_(ordered) {} int DictionaryType::bit_width() const { return static_cast(index_type_.get())->bit_width(); @@ -178,6 +181,7 @@ ACCEPT_VISITOR(StructType); ACCEPT_VISITOR(DecimalType); ACCEPT_VISITOR(UnionType); ACCEPT_VISITOR(DateType); +ACCEPT_VISITOR(Date32Type); ACCEPT_VISITOR(TimeType); ACCEPT_VISITOR(TimestampType); ACCEPT_VISITOR(IntervalType); @@ -205,11 +209,16 @@ TYPE_FACTORY(float64, DoubleType); TYPE_FACTORY(utf8, StringType); TYPE_FACTORY(binary, BinaryType); TYPE_FACTORY(date, DateType); +TYPE_FACTORY(date32, Date32Type); std::shared_ptr timestamp(TimeUnit unit) { return std::make_shared(unit); } +std::shared_ptr timestamp(const std::string& timezone, TimeUnit unit) { + return std::make_shared(timezone, unit); +} + std::shared_ptr time(TimeUnit unit) { return std::make_shared(unit); } @@ -313,6 +322,7 @@ TYPE_VISITOR_DEFAULT(DoubleType); TYPE_VISITOR_DEFAULT(StringType); TYPE_VISITOR_DEFAULT(BinaryType); TYPE_VISITOR_DEFAULT(DateType); +TYPE_VISITOR_DEFAULT(Date32Type); TYPE_VISITOR_DEFAULT(TimeType); TYPE_VISITOR_DEFAULT(TimestampType); TYPE_VISITOR_DEFAULT(IntervalType); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 9b1ab3288eb8c..aa0d70e5505e6 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -67,9 +67,12 @@ struct Type { // Variable-length bytes (no guarantee of UTF8-ness) BINARY, - // By default, int32 days since the UNIX epoch + // int64_t milliseconds since the UNIX epoch DATE, + // int32_t days since the UNIX epoch + DATE32, + // Exact timestamp encoded with int64 since UNIX epoch // Default unit millisecond TIMESTAMP, @@ -132,6 +135,7 @@ class ARROW_EXPORT TypeVisitor { virtual Status Visit(const StringType& type); virtual Status Visit(const BinaryType& type); virtual Status Visit(const DateType& type); + virtual Status Visit(const Date32Type& type); virtual Status Visit(const TimeType& type); virtual Status Visit(const TimestampType& type); virtual Status Visit(const IntervalType& type); @@ -425,6 +429,7 @@ struct ARROW_EXPORT UnionType : public DataType { // ---------------------------------------------------------------------- // Date and time types +/// Date as int64_t milliseconds since UNIX epoch struct ARROW_EXPORT DateType : public FixedWidthType { static constexpr Type::type type_id = Type::DATE; @@ -439,6 +444,20 @@ struct ARROW_EXPORT DateType : public FixedWidthType { static std::string name() { return "date"; } }; +/// Date as int32_t days since UNIX epoch +struct ARROW_EXPORT Date32Type : public FixedWidthType { + static constexpr Type::type type_id = Type::DATE32; + + using c_type = int32_t; + + Date32Type() : FixedWidthType(Type::DATE32) {} + + int bit_width() const override { return static_cast(sizeof(c_type) * 8); } + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; +}; + enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; struct ARROW_EXPORT TimeType : public FixedWidthType { @@ -467,16 +486,20 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { int bit_width() const override { return static_cast(sizeof(int64_t) * 8); } - TimeUnit unit; - explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) : FixedWidthType(Type::TIMESTAMP), unit(unit) {} + explicit TimestampType(const std::string& timezone, TimeUnit unit = TimeUnit::MILLI) + : FixedWidthType(Type::TIMESTAMP), unit(unit), timezone(timezone) {} + TimestampType(const TimestampType& other) : TimestampType(other.unit) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override { return name(); } static std::string name() { return "timestamp"; } + + TimeUnit unit; + std::string timezone; }; struct ARROW_EXPORT IntervalType : public FixedWidthType { @@ -507,7 +530,7 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { static constexpr Type::type type_id = Type::DICTIONARY; DictionaryType(const std::shared_ptr& index_type, - const std::shared_ptr& dictionary); + const std::shared_ptr& dictionary, bool ordered = false); int bit_width() const override; @@ -518,11 +541,13 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + bool ordered() const { return ordered_; } + private: // Must be an integer type (not currently checked) std::shared_ptr index_type_; - std::shared_ptr dictionary_; + bool ordered_; }; // ---------------------------------------------------------------------- @@ -532,6 +557,8 @@ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); +std::shared_ptr ARROW_EXPORT timestamp( + const std::string& timezone, TimeUnit unit); std::shared_ptr ARROW_EXPORT time(TimeUnit unit); std::shared_ptr ARROW_EXPORT struct_( @@ -595,6 +622,7 @@ static inline bool is_primitive(Type::type type_id) { case Type::FLOAT: case Type::DOUBLE: case Type::DATE: + case Type::DATE32: case Type::TIMESTAMP: case Type::TIME: case Type::INTERVAL: diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index fc4ad3d87d8ac..e53afe1a34d36 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -95,6 +95,10 @@ struct DateType; using DateArray = NumericArray; using DateBuilder = NumericBuilder; +struct Date32Type; +using Date32Array = NumericArray; +using Date32Builder = NumericBuilder; + struct TimeType; using TimeArray = NumericArray; using TimeBuilder = NumericBuilder; @@ -125,6 +129,7 @@ std::shared_ptr ARROW_EXPORT float64(); std::shared_ptr ARROW_EXPORT utf8(); std::shared_ptr ARROW_EXPORT binary(); std::shared_ptr ARROW_EXPORT date(); +std::shared_ptr ARROW_EXPORT date32(); } // namespace arrow diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index d6687c11bcf73..2cd14203cdbb1 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -130,6 +130,18 @@ struct TypeTraits { static inline std::shared_ptr type_singleton() { return date(); } }; +template <> +struct TypeTraits { + using ArrayType = Date32Array; + using BuilderType = Date32Builder; + + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int32_t); + } + constexpr static bool is_parameter_free = true; + static inline std::shared_ptr type_singleton() { return date32(); } +}; + template <> struct TypeTraits { using ArrayType = TimestampArray; diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 7787e95df5e72..6a6b4ba9ad0cb 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -54,7 +54,8 @@ cdef class Array: self.type.init(self.sp_array.get().type()) @staticmethod - def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None, MemoryPool memory_pool=None): + def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None, + MemoryPool memory_pool=None): """ Convert pandas.Series to an Arrow Array. @@ -75,8 +76,9 @@ cdef class Array: Notes ----- - Localized timestamps will currently be returned as UTC (pandas's native representation). - Timezone-naive data will be implicitly interpreted as UTC. + Localized timestamps will currently be returned as UTC (pandas's native + representation). Timezone-naive data will be implicitly interpreted as + UTC. Examples -------- @@ -119,9 +121,9 @@ cdef class Array: series_values = get_series_values(obj) if isinstance(series_values, pd.Categorical): - return DictionaryArray.from_arrays(series_values.codes, - series_values.categories.values, - mask=mask, memory_pool=memory_pool) + return DictionaryArray.from_arrays( + series_values.codes, series_values.categories.values, + mask=mask, memory_pool=memory_pool) else: if series_values.dtype.type == np.datetime64 and timestamps_to_ms: series_values = series_values.astype('datetime64[ms]') @@ -134,7 +136,8 @@ cdef class Array: return box_array(out) @staticmethod - def from_list(object list_obj, DataType type=None, MemoryPool memory_pool=None): + def from_list(object list_obj, DataType type=None, + MemoryPool memory_pool=None): """ Convert Python list to Arrow array @@ -358,7 +361,8 @@ cdef class BinaryArray(Array): cdef class DictionaryArray(Array): @staticmethod - def from_arrays(indices, dictionary, mask=None, MemoryPool memory_pool=None): + def from_arrays(indices, dictionary, mask=None, + MemoryPool memory_pool=None): """ Construct Arrow DictionaryArray from array of indices (must be non-negative integers) and corresponding array of dictionary values @@ -380,8 +384,15 @@ cdef class DictionaryArray(Array): shared_ptr[CDataType] c_type shared_ptr[CArray] c_result - arrow_indices = Array.from_pandas(indices, mask=mask, memory_pool=memory_pool) - arrow_dictionary = Array.from_pandas(dictionary, memory_pool=memory_pool) + if mask is None: + mask = indices == -1 + else: + mask = mask | (indices == -1) + + arrow_indices = Array.from_pandas(indices, mask=mask, + memory_pool=memory_pool) + arrow_dictionary = Array.from_pandas(dictionary, + memory_pool=memory_pool) if not isinstance(arrow_indices, IntegerArray): raise ValueError('Indices must be integer type') diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 93bc6ddcd56f6..ad5af1b0128ca 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -359,7 +359,8 @@ cdef class RecordBatch: """ Number of rows - Due to the definition of a RecordBatch, all columns have the same number of rows. + Due to the definition of a RecordBatch, all columns have the same + number of rows. Returns ------- diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 960653dca279e..953fa2c4b9a72 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -81,7 +81,11 @@ def _check_array_roundtrip(self, values, expected=None, arr = A.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, field=field) result = arr.to_pandas() - tm.assert_series_equal(pd.Series(result), pd.Series(values), check_names=False) + + assert arr.null_count == pd.isnull(values).sum() + + tm.assert_series_equal(pd.Series(result), pd.Series(values), + check_names=False) def test_float_no_nulls(self): data = {} diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index cadb53e0d2ab9..c707ada9dd55c 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -34,6 +34,7 @@ #include #include "arrow/api.h" +#include "arrow/loader.h" #include "arrow/status.h" #include "arrow/type_fwd.h" #include "arrow/type_traits.h" @@ -610,6 +611,7 @@ class PandasBlock { DOUBLE, BOOL, DATETIME, + DATETIME_WITH_TZ, CATEGORICAL }; @@ -1157,7 +1159,7 @@ class DataFrameBlockCreator { } int block_placement = 0; - if (column_type == Type::DICTIONARY) { + if (output_type == PandasBlock::CATEGORICAL) { std::shared_ptr block; RETURN_NOT_OK(MakeCategoricalBlock(col->type(), table_->num_rows(), &block)); categorical_blocks_[i] = block; @@ -1518,15 +1520,16 @@ inline Status ArrowSerializer::Convert(std::shared_ptr* out) { null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); } - // For readability - constexpr int64_t kOffset = 0; - RETURN_NOT_OK(ConvertData()); std::shared_ptr type; RETURN_NOT_OK(MakeDataType(&type)); - RETURN_NOT_OK( - MakePrimitiveArray(type, length_, data_, null_bitmap_, null_count, kOffset, out)); - return Status::OK(); + + std::vector fields(1); + fields[0].length = length_; + fields[0].null_count = null_count; + fields[0].offset = 0; + + return arrow::LoadArray(type, fields, {null_bitmap_, data_}, out); } template <> From f7f915d90aee8affb40616bc14877afde9b32898 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Fri, 10 Mar 2017 19:39:35 -0500 Subject: [PATCH 0353/1644] ARROW-615: [Java] Moved ByteArrayReadableSeekableByteChannel to src main o.a.a.vector.util The ByteArrayReadableSeekableByteChannel is useful when reading an ArrowRecordBatch from a byte array with ArrowReader. Currently it is vector.file test package, this change moves the class to src/main/java/o.a.a.vector.util. Updated test usage. Author: Bryan Cutler Closes #370 from BryanCutler/move-ByteArrayReadableSeekableByteChannel-ARROW-615 and squashes the following commits: 46f32a3 [Bryan Cutler] moved ByteArrayReadableSeekableByteChannel to src main o.a.a.vector.util --- .../vector/util}/ByteArrayReadableSeekableByteChannel.java | 2 +- .../org/apache/arrow/vector/file/TestArrowReaderWriter.java | 1 + 2 files changed, 2 insertions(+), 1 deletion(-) rename java/vector/src/{test/java/org/apache/arrow/vector/file => main/java/org/apache/arrow/vector/util}/ByteArrayReadableSeekableByteChannel.java (98%) diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteArrayReadableSeekableByteChannel.java similarity index 98% rename from java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java rename to java/vector/src/main/java/org/apache/arrow/vector/util/ByteArrayReadableSeekableByteChannel.java index 7c423d5881aea..69840fefa968b 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/ByteArrayReadableSeekableByteChannel.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/ByteArrayReadableSeekableByteChannel.java @@ -15,7 +15,7 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.arrow.vector.file; +package org.apache.arrow.vector.util; import java.io.IOException; import java.nio.ByteBuffer; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java index 96bcbb1dae71c..13b04de68fa62 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -39,6 +39,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.arrow.vector.util.ByteArrayReadableSeekableByteChannel; import org.junit.Before; import org.junit.Test; From d99958dd3de0ac4fd6a99127d62657249c494448 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 11 Mar 2017 16:58:45 -0500 Subject: [PATCH 0354/1644] ARROW-452: [C++/Python] Incorporate C++ and Python codebases for Feather file format The goal for this patch is to provide an eventual migration path for Feather (https://github.com/wesm/feather) users to use the batch and streaming Arrow file formats internally. Eventually the Feather metadata can be deprecated, but we will need to wait for the R community to build and ship Arrow bindings for R before that can happen. In the meantime, we won't need to maintain multiple Python/C++ codebases for the Python side of things. The test suite isn't yet passing because support for timestamps with time zones has not been implemented in the conversion to pandas.DataFrame, so I will do that when I can, but this can be reviewed in the meantime. I would upload a Gerrit code review, but there are some access control settings on gerrit.cloudera.org that need changing. Author: Wes McKinney Closes #361 from wesm/ARROW-452 and squashes the following commits: b7bfd30 [Wes McKinney] Add missing license header 06cbdca [Wes McKinney] Fix -Wconversion error 244959c [Wes McKinney] Mark datetime+tz tests as xfail 9a95094 [Wes McKinney] Incorporate Feather C++ and Python codebases and do associated refactoring to maximize code reuse with IPC reader/writer classes. Get C++ test suite passing. --- cpp/src/arrow/ipc/CMakeLists.txt | 22 +- cpp/src/arrow/ipc/api.h | 1 + cpp/src/arrow/ipc/feather-internal.h | 232 ++++++++ cpp/src/arrow/ipc/feather-test.cc | 437 +++++++++++++++ cpp/src/arrow/ipc/feather.cc | 729 ++++++++++++++++++++++++++ cpp/src/arrow/ipc/feather.fbs | 147 ++++++ cpp/src/arrow/ipc/feather.h | 109 ++++ python/CMakeLists.txt | 1 + python/pyarrow/_feather.pyx | 158 ++++++ python/pyarrow/feather.py | 118 +++++ python/pyarrow/table.pyx | 5 + python/pyarrow/tests/test_feather.py | 379 +++++++++++++ python/setup.py | 1 + python/src/pyarrow/adapters/pandas.cc | 83 ++- 14 files changed, 2394 insertions(+), 28 deletions(-) create mode 100644 cpp/src/arrow/ipc/feather-internal.h create mode 100644 cpp/src/arrow/ipc/feather-test.cc create mode 100644 cpp/src/arrow/ipc/feather.cc create mode 100644 cpp/src/arrow/ipc/feather.fbs create mode 100644 cpp/src/arrow/ipc/feather.h create mode 100644 python/pyarrow/_feather.pyx create mode 100644 python/pyarrow/feather.py create mode 100644 python/pyarrow/tests/test_feather.py diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 08da0a109c963..09a959ba69b51 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -25,11 +25,12 @@ set(ARROW_IPC_SHARED_LINK_LIBS ) set(ARROW_IPC_TEST_LINK_LIBS - arrow_io_static - arrow_ipc_static) + arrow_ipc_static + arrow_io_static) set(ARROW_IPC_SRCS adapter.cc + feather.cc json.cc json-internal.cc metadata.cc @@ -59,6 +60,10 @@ if(FLATBUFFERS_VENDORED) add_dependencies(arrow_ipc_objlib flatbuffers_ep) endif() +ADD_ARROW_TEST(feather-test) +ARROW_TEST_LINK_LIBRARIES(feather-test + ${ARROW_IPC_TEST_LINK_LIBS}) + ADD_ARROW_TEST(ipc-adapter-test) ARROW_TEST_LINK_LIBRARIES(ipc-adapter-test ${ARROW_IPC_TEST_LINK_LIBS}) @@ -105,14 +110,20 @@ if (ARROW_BUILD_TESTS) endif() # make clean will delete the generated file -set_source_files_properties(Metadata_generated.h PROPERTIES GENERATED TRUE) +set_source_files_properties(Message_generated.h PROPERTIES GENERATED TRUE) +set_source_files_properties(feather_generated.h PROPERTIES GENERATED TRUE) +set_source_files_properties(File_generated.h PROPERTIES GENERATED TRUE) set(OUTPUT_DIR ${CMAKE_SOURCE_DIR}/src/arrow/ipc) -set(FBS_OUTPUT_FILES "${OUTPUT_DIR}/Message_generated.h") +set(FBS_OUTPUT_FILES + "${OUTPUT_DIR}/File_generated.h" + "${OUTPUT_DIR}/Message_generated.h" + "${OUTPUT_DIR}/feather_generated.h") set(FBS_SRC ${CMAKE_SOURCE_DIR}/../format/Message.fbs - ${CMAKE_SOURCE_DIR}/../format/File.fbs) + ${CMAKE_SOURCE_DIR}/../format/File.fbs + ${CMAKE_CURRENT_SOURCE_DIR}/feather.fbs) foreach(FIL ${FBS_SRC}) get_filename_component(ABS_FIL ${FIL} ABSOLUTE) @@ -139,6 +150,7 @@ add_dependencies(arrow_ipc_objlib metadata_fbs) install(FILES adapter.h api.h + feather.h json.h metadata.h reader.h diff --git a/cpp/src/arrow/ipc/api.h b/cpp/src/arrow/ipc/api.h index cb854212bbeee..ad7cd84e9f986 100644 --- a/cpp/src/arrow/ipc/api.h +++ b/cpp/src/arrow/ipc/api.h @@ -19,6 +19,7 @@ #define ARROW_IPC_API_H #include "arrow/ipc/adapter.h" +#include "arrow/ipc/feather.h" #include "arrow/ipc/json.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/reader.h" diff --git a/cpp/src/arrow/ipc/feather-internal.h b/cpp/src/arrow/ipc/feather-internal.h new file mode 100644 index 0000000000000..10b0cfd5d5ea2 --- /dev/null +++ b/cpp/src/arrow/ipc/feather-internal.h @@ -0,0 +1,232 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/// Public API for the "Feather" file format, originally created at +/// http://github.com/wesm/feather + +#ifndef ARROW_IPC_FEATHER_INTERNAL_H +#define ARROW_IPC_FEATHER_INTERNAL_H + +#include +#include +#include +#include + +#include "flatbuffers/flatbuffers.h" + +#include "arrow/buffer.h" +#include "arrow/ipc/feather.h" +#include "arrow/ipc/feather_generated.h" +#include "arrow/type.h" + +namespace arrow { +namespace ipc { +namespace feather { + +typedef std::vector> ColumnVector; +typedef flatbuffers::FlatBufferBuilder FBB; +typedef flatbuffers::Offset FBString; + +struct ColumnType { + enum type { PRIMITIVE, CATEGORY, TIMESTAMP, DATE, TIME }; +}; + +struct ArrayMetadata { + ArrayMetadata() {} + + ArrayMetadata(fbs::Type type, int64_t offset, int64_t length, int64_t null_count, + int64_t total_bytes) + : type(type), + offset(offset), + length(length), + null_count(null_count), + total_bytes(total_bytes) {} + + bool Equals(const ArrayMetadata& other) const { + return this->type == other.type && this->offset == other.offset && + this->length == other.length && this->null_count == other.null_count && + this->total_bytes == other.total_bytes; + } + + fbs::Type type; + int64_t offset; + int64_t length; + int64_t null_count; + int64_t total_bytes; +}; + +struct CategoryMetadata { + ArrayMetadata levels; + bool ordered; +}; + +struct TimestampMetadata { + TimeUnit unit; + + // A timezone name known to the Olson timezone database. For display purposes + // because the actual data is all UTC + std::string timezone; +}; + +struct TimeMetadata { + TimeUnit unit; +}; + +static constexpr const char* kFeatherMagicBytes = "FEA1"; +static constexpr const int kFeatherDefaultAlignment = 8; + +class ColumnBuilder; + +class TableBuilder { + public: + explicit TableBuilder(int64_t num_rows); + ~TableBuilder() = default; + + FBB& fbb(); + Status Finish(); + std::shared_ptr GetBuffer() const; + + std::unique_ptr AddColumn(const std::string& name); + void SetDescription(const std::string& description); + void SetNumRows(int64_t num_rows); + void add_column(const flatbuffers::Offset& col); + + private: + flatbuffers::FlatBufferBuilder fbb_; + ColumnVector columns_; + + friend class ColumnBuilder; + + bool finished_; + std::string description_; + int64_t num_rows_; +}; + +class TableMetadata { + public: + TableMetadata() {} + ~TableMetadata() = default; + + Status Open(const std::shared_ptr& buffer) { + metadata_buffer_ = buffer; + table_ = fbs::GetCTable(buffer->data()); + + if (table_->version() < kFeatherVersion) { + std::cout << "This Feather file is old" + << " and will not be readable beyond the 0.3.0 release" << std::endl; + } + return Status::OK(); + } + + bool HasDescription() const { return table_->description() != 0; } + + std::string GetDescription() const { + if (!HasDescription()) { return std::string(""); } + return table_->description()->str(); + } + + int version() const { return table_->version(); } + int64_t num_rows() const { return table_->num_rows(); } + int64_t num_columns() const { return table_->columns()->size(); } + + const fbs::Column* column(int i) { return table_->columns()->Get(i); } + + private: + std::shared_ptr metadata_buffer_; + const fbs::CTable* table_; +}; + +static inline flatbuffers::Offset GetPrimitiveArray( + FBB& fbb, const ArrayMetadata& array) { + return fbs::CreatePrimitiveArray(fbb, array.type, fbs::Encoding_PLAIN, array.offset, + array.length, array.null_count, array.total_bytes); +} + +static inline fbs::TimeUnit ToFlatbufferEnum(TimeUnit unit) { + return static_cast(static_cast(unit)); +} + +static inline TimeUnit FromFlatbufferEnum(fbs::TimeUnit unit) { + return static_cast(static_cast(unit)); +} + +// Convert Feather enums to Flatbuffer enums + +const fbs::TypeMetadata COLUMN_TYPE_ENUM_MAPPING[] = { + fbs::TypeMetadata_NONE, // PRIMITIVE + fbs::TypeMetadata_CategoryMetadata, // CATEGORY + fbs::TypeMetadata_TimestampMetadata, // TIMESTAMP + fbs::TypeMetadata_DateMetadata, // DATE + fbs::TypeMetadata_TimeMetadata // TIME +}; + +static inline fbs::TypeMetadata ToFlatbufferEnum(ColumnType::type column_type) { + return COLUMN_TYPE_ENUM_MAPPING[column_type]; +} + +static inline void FromFlatbuffer(const fbs::PrimitiveArray* values, ArrayMetadata* out) { + out->type = values->type(); + out->offset = values->offset(); + out->length = values->length(); + out->null_count = values->null_count(); + out->total_bytes = values->total_bytes(); +} + +class ColumnBuilder { + public: + ColumnBuilder(TableBuilder* parent, const std::string& name); + ~ColumnBuilder() = default; + + flatbuffers::Offset CreateColumnMetadata(); + + Status Finish(); + void SetValues(const ArrayMetadata& values); + void SetUserMetadata(const std::string& data); + void SetCategory(const ArrayMetadata& levels, bool ordered = false); + void SetTimestamp(TimeUnit unit); + void SetTimestamp(TimeUnit unit, const std::string& timezone); + void SetDate(); + void SetTime(TimeUnit unit); + FBB& fbb(); + + private: + TableBuilder* parent_; + + std::string name_; + ArrayMetadata values_; + std::string user_metadata_; + + // Column metadata + + // Is this a primitive type, or one of the types having metadata? Default is + // primitive + ColumnType::type type_; + + // Type-specific metadata union + CategoryMetadata meta_category_; + TimeMetadata meta_time_; + + TimestampMetadata meta_timestamp_; + + FBB* fbb_; +}; + +} // namespace feather +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_FEATHER_INTERNAL_H diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc new file mode 100644 index 0000000000000..b73246b672260 --- /dev/null +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -0,0 +1,437 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/io/memory.h" +#include "arrow/ipc/feather-internal.h" +#include "arrow/ipc/feather.h" +#include "arrow/ipc/test-common.h" +#include "arrow/loader.h" +#include "arrow/pretty_print.h" +#include "arrow/test-util.h" + +namespace arrow { +namespace ipc { +namespace feather { + +template +inline void assert_vector_equal(const std::vector& left, const std::vector& right) { + ASSERT_EQ(left.size(), right.size()); + + for (size_t i = 0; i < left.size(); ++i) { + ASSERT_EQ(left[i], right[i]) << i; + } +} + +class TestTableBuilder : public ::testing::Test { + public: + void SetUp() { tb_.reset(new TableBuilder(1000)); } + + virtual void Finish() { + tb_->Finish(); + + table_.reset(new TableMetadata()); + ASSERT_OK(table_->Open(tb_->GetBuffer())); + } + + protected: + std::unique_ptr tb_; + std::unique_ptr table_; +}; + +TEST_F(TestTableBuilder, Version) { + Finish(); + ASSERT_EQ(kFeatherVersion, table_->version()); +} + +TEST_F(TestTableBuilder, EmptyTable) { + Finish(); + + ASSERT_FALSE(table_->HasDescription()); + ASSERT_EQ("", table_->GetDescription()); + ASSERT_EQ(1000, table_->num_rows()); + ASSERT_EQ(0, table_->num_columns()); +} + +TEST_F(TestTableBuilder, SetDescription) { + std::string desc("this is some good data"); + tb_->SetDescription(desc); + Finish(); + ASSERT_TRUE(table_->HasDescription()); + ASSERT_EQ(desc, table_->GetDescription()); +} + +void AssertArrayEquals(const ArrayMetadata& left, const ArrayMetadata& right) { + EXPECT_EQ(left.type, right.type); + EXPECT_EQ(left.offset, right.offset); + EXPECT_EQ(left.length, right.length); + EXPECT_EQ(left.null_count, right.null_count); + EXPECT_EQ(left.total_bytes, right.total_bytes); +} + +TEST_F(TestTableBuilder, AddPrimitiveColumn) { + std::unique_ptr cb = tb_->AddColumn("f0"); + + ArrayMetadata values1; + ArrayMetadata values2; + values1.type = fbs::Type_INT32; + values1.offset = 10000; + values1.length = 1000; + values1.null_count = 100; + values1.total_bytes = 4000; + + cb->SetValues(values1); + + std::string user_meta = "as you wish"; + cb->SetUserMetadata(user_meta); + + cb->Finish(); + + cb = tb_->AddColumn("f1"); + + values2.type = fbs::Type_UTF8; + values2.offset = 14000; + values2.length = 1000; + values2.null_count = 100; + values2.total_bytes = 10000; + + cb->SetValues(values2); + cb->Finish(); + + Finish(); + + ASSERT_EQ(2, table_->num_columns()); + + auto col = table_->column(0); + + ASSERT_EQ("f0", col->name()->str()); + ASSERT_EQ(user_meta, col->user_metadata()->str()); + + ArrayMetadata values3; + FromFlatbuffer(col->values(), &values3); + AssertArrayEquals(values3, values1); + + col = table_->column(1); + ASSERT_EQ("f1", col->name()->str()); + + ArrayMetadata values4; + FromFlatbuffer(col->values(), &values4); + AssertArrayEquals(values4, values2); +} + +TEST_F(TestTableBuilder, AddCategoryColumn) { + ArrayMetadata values1(fbs::Type_UINT8, 10000, 1000, 100, 4000); + ArrayMetadata levels(fbs::Type_UTF8, 14000, 10, 0, 300); + + std::unique_ptr cb = tb_->AddColumn("c0"); + cb->SetValues(values1); + cb->SetCategory(levels); + cb->Finish(); + + cb = tb_->AddColumn("c1"); + cb->SetValues(values1); + cb->SetCategory(levels, true); + cb->Finish(); + + Finish(); + + auto col = table_->column(0); + ASSERT_EQ(fbs::TypeMetadata_CategoryMetadata, col->metadata_type()); + + ArrayMetadata result; + FromFlatbuffer(col->values(), &result); + AssertArrayEquals(result, values1); + + auto cat_ptr = static_cast(col->metadata()); + ASSERT_FALSE(cat_ptr->ordered()); + + FromFlatbuffer(cat_ptr->levels(), &result); + AssertArrayEquals(result, levels); + + col = table_->column(1); + cat_ptr = static_cast(col->metadata()); + ASSERT_TRUE(cat_ptr->ordered()); + FromFlatbuffer(cat_ptr->levels(), &result); + AssertArrayEquals(result, levels); +} + +TEST_F(TestTableBuilder, AddTimestampColumn) { + ArrayMetadata values1(fbs::Type_INT64, 10000, 1000, 100, 4000); + std::unique_ptr cb = tb_->AddColumn("c0"); + cb->SetValues(values1); + cb->SetTimestamp(TimeUnit::MILLI); + cb->Finish(); + + cb = tb_->AddColumn("c1"); + + std::string tz("America/Los_Angeles"); + + cb->SetValues(values1); + cb->SetTimestamp(TimeUnit::SECOND, tz); + cb->Finish(); + + Finish(); + + auto col = table_->column(0); + + ASSERT_EQ(fbs::TypeMetadata_TimestampMetadata, col->metadata_type()); + + ArrayMetadata result; + FromFlatbuffer(col->values(), &result); + AssertArrayEquals(result, values1); + + auto ts_ptr = static_cast(col->metadata()); + ASSERT_EQ(fbs::TimeUnit_MILLISECOND, ts_ptr->unit()); + + col = table_->column(1); + ts_ptr = static_cast(col->metadata()); + ASSERT_EQ(fbs::TimeUnit_SECOND, ts_ptr->unit()); + ASSERT_EQ(tz, ts_ptr->timezone()->str()); +} + +TEST_F(TestTableBuilder, AddDateColumn) { + ArrayMetadata values1(fbs::Type_INT64, 10000, 1000, 100, 4000); + std::unique_ptr cb = tb_->AddColumn("d0"); + cb->SetValues(values1); + cb->SetDate(); + cb->Finish(); + + Finish(); + + auto col = table_->column(0); + + ASSERT_EQ(fbs::TypeMetadata_DateMetadata, col->metadata_type()); + ArrayMetadata result; + FromFlatbuffer(col->values(), &result); + AssertArrayEquals(result, values1); +} + +TEST_F(TestTableBuilder, AddTimeColumn) { + ArrayMetadata values1(fbs::Type_INT64, 10000, 1000, 100, 4000); + std::unique_ptr cb = tb_->AddColumn("c0"); + cb->SetValues(values1); + cb->SetTime(TimeUnit::SECOND); + cb->Finish(); + Finish(); + + auto col = table_->column(0); + + ASSERT_EQ(fbs::TypeMetadata_TimeMetadata, col->metadata_type()); + ArrayMetadata result; + FromFlatbuffer(col->values(), &result); + AssertArrayEquals(result, values1); + + auto t_ptr = static_cast(col->metadata()); + ASSERT_EQ(fbs::TimeUnit_SECOND, t_ptr->unit()); +} + +void CheckArrays(const Array& expected, const Array& result) { + if (!result.Equals(expected)) { + std::stringstream pp_result; + std::stringstream pp_expected; + + EXPECT_OK(PrettyPrint(result, 0, &pp_result)); + EXPECT_OK(PrettyPrint(expected, 0, &pp_expected)); + FAIL() << "Got: " << pp_result.str() << "\nExpected: " << pp_expected.str(); + } +} + +class TestTableWriter : public ::testing::Test { + public: + void SetUp() { + ASSERT_OK(io::BufferOutputStream::Create(1024, default_memory_pool(), &stream_)); + ASSERT_OK(TableWriter::Open(stream_, &writer_)); + } + + void Finish() { + // Write table footer + ASSERT_OK(writer_->Finalize()); + + ASSERT_OK(stream_->Finish(&output_)); + + std::shared_ptr buffer(new io::BufferReader(output_)); + reader_.reset(new TableReader()); + ASSERT_OK(reader_->Open(buffer)); + } + + void CheckBatch(const RecordBatch& batch) { + for (int i = 0; i < batch.num_columns(); ++i) { + ASSERT_OK(writer_->Append(batch.column_name(i), *batch.column(i))); + } + Finish(); + + std::shared_ptr col; + for (int i = 0; i < batch.num_columns(); ++i) { + ASSERT_OK(reader_->GetColumn(i, &col)); + ASSERT_EQ(batch.column_name(i), col->name()); + CheckArrays(*batch.column(i), *col->data()->chunk(0)); + } + } + + protected: + std::shared_ptr stream_; + std::unique_ptr writer_; + std::unique_ptr reader_; + + std::shared_ptr output_; +}; + +TEST_F(TestTableWriter, EmptyTable) { + Finish(); + + ASSERT_FALSE(reader_->HasDescription()); + ASSERT_EQ("", reader_->GetDescription()); + + ASSERT_EQ(0, reader_->num_rows()); + ASSERT_EQ(0, reader_->num_columns()); +} + +TEST_F(TestTableWriter, SetNumRows) { + writer_->SetNumRows(1000); + Finish(); + ASSERT_EQ(1000, reader_->num_rows()); +} + +TEST_F(TestTableWriter, SetDescription) { + std::string desc("contents of the file"); + writer_->SetDescription(desc); + Finish(); + + ASSERT_TRUE(reader_->HasDescription()); + ASSERT_EQ(desc, reader_->GetDescription()); + + ASSERT_EQ(0, reader_->num_rows()); + ASSERT_EQ(0, reader_->num_columns()); +} + +TEST_F(TestTableWriter, PrimitiveRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeIntRecordBatch(&batch)); + + ASSERT_OK(writer_->Append("f0", *batch->column(0))); + ASSERT_OK(writer_->Append("f1", *batch->column(1))); + Finish(); + + std::shared_ptr col; + ASSERT_OK(reader_->GetColumn(0, &col)); + ASSERT_TRUE(col->data()->chunk(0)->Equals(batch->column(0))); + ASSERT_EQ("f0", col->name()); + + ASSERT_OK(reader_->GetColumn(1, &col)); + ASSERT_TRUE(col->data()->chunk(0)->Equals(batch->column(1))); + ASSERT_EQ("f1", col->name()); +} + +Status MakeDictionaryFlat(std::shared_ptr* out) { + const int64_t length = 6; + + std::vector is_valid = {true, true, false, true, true, true}; + std::shared_ptr dict1, dict2; + + std::vector dict1_values = {"foo", "bar", "baz"}; + std::vector dict2_values = {"foo", "bar", "baz", "qux"}; + + ArrayFromVector(dict1_values, &dict1); + ArrayFromVector(dict2_values, &dict2); + + auto f0_type = arrow::dictionary(arrow::int32(), dict1); + auto f1_type = arrow::dictionary(arrow::int8(), dict1); + auto f2_type = arrow::dictionary(arrow::int32(), dict2); + + std::shared_ptr indices0, indices1, indices2; + std::vector indices0_values = {1, 2, -1, 0, 2, 0}; + std::vector indices1_values = {0, 0, 2, 2, 1, 1}; + std::vector indices2_values = {3, 0, 2, 1, 0, 2}; + + ArrayFromVector(is_valid, indices0_values, &indices0); + ArrayFromVector(is_valid, indices1_values, &indices1); + ArrayFromVector(is_valid, indices2_values, &indices2); + + auto a0 = std::make_shared(f0_type, indices0); + auto a1 = std::make_shared(f1_type, indices1); + auto a2 = std::make_shared(f2_type, indices2); + + // construct batch + std::shared_ptr schema(new Schema( + {field("dict1", f0_type), field("sparse", f1_type), field("dense", f2_type)})); + + std::vector> arrays = {a0, a1, a2}; + out->reset(new RecordBatch(schema, length, arrays)); + return Status::OK(); +} + +TEST_F(TestTableWriter, CategoryRoundtrip) { + std::shared_ptr batch; + ASSERT_OK(MakeDictionaryFlat(&batch)); + CheckBatch(*batch); +} + +TEST_F(TestTableWriter, TimeTypes) { + std::vector is_valid = {true, true, true, false, true, true, true}; + auto f0 = field("f0", date32()); + auto f1 = field("f1", time(TimeUnit::MILLI)); + auto f2 = field("f2", timestamp(TimeUnit::NANO)); + auto f3 = field("f3", timestamp("US/Los_Angeles", TimeUnit::SECOND)); + std::shared_ptr schema(new Schema({f0, f1, f2, f3})); + + std::vector values_vec = {0, 1, 2, 3, 4, 5, 6}; + std::shared_ptr values; + ArrayFromVector(is_valid, values_vec, &values); + + std::vector date_values_vec = {0, 1, 2, 3, 4, 5, 6}; + std::shared_ptr date_array; + ArrayFromVector(is_valid, date_values_vec, &date_array); + + std::vector fields(1); + fields[0].length = values->length(); + fields[0].null_count = values->null_count(); + fields[0].offset = 0; + + const auto& prim_values = static_cast(*values); + std::vector> buffers = { + prim_values.null_bitmap(), prim_values.data()}; + + std::vector> arrays; + arrays.push_back(date_array); + + for (int i = 1; i < schema->num_fields(); ++i) { + std::shared_ptr arr; + LoadArray(schema->field(i)->type, fields, buffers, &arr); + arrays.push_back(arr); + } + + RecordBatch batch(schema, values->length(), arrays); + CheckBatch(batch); +} + +TEST_F(TestTableWriter, VLenPrimitiveRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeStringTypesRecordBatch(&batch)); + CheckBatch(*batch); +} + +} // namespace feather +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc new file mode 100644 index 0000000000000..13dfa5830f1bf --- /dev/null +++ b/cpp/src/arrow/ipc/feather.cc @@ -0,0 +1,729 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/ipc/feather.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#include "flatbuffers/flatbuffers.h" + +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/column.h" +#include "arrow/io/file.h" +#include "arrow/ipc/feather-internal.h" +#include "arrow/ipc/feather_generated.h" +#include "arrow/loader.h" +#include "arrow/status.h" +#include "arrow/util/bit-util.h" + +namespace arrow { +namespace ipc { +namespace feather { + +static const uint8_t kPaddingBytes[kFeatherDefaultAlignment] = {0}; + +static inline int64_t PaddedLength(int64_t nbytes) { + static const int64_t alignment = kFeatherDefaultAlignment; + return ((nbytes + alignment - 1) / alignment) * alignment; +} + +// XXX: Hack for Feather 0.3.0 for backwards compatibility with old files +// Size in-file of written byte buffer +static int64_t GetOutputLength(int64_t nbytes) { + if (kFeatherVersion < 2) { + // Feather files < 0.3.0 + return nbytes; + } else { + return PaddedLength(nbytes); + } +} + +static Status WritePadded(io::OutputStream* stream, const uint8_t* data, int64_t length, + int64_t* bytes_written) { + RETURN_NOT_OK(stream->Write(data, length)); + + int64_t remainder = PaddedLength(length) - length; + if (remainder != 0) { RETURN_NOT_OK(stream->Write(kPaddingBytes, remainder)); } + *bytes_written = length + remainder; + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// TableBuilder + +TableBuilder::TableBuilder(int64_t num_rows) : finished_(false), num_rows_(num_rows) {} + +FBB& TableBuilder::fbb() { + return fbb_; +} + +Status TableBuilder::Finish() { + if (finished_) { return Status::Invalid("can only call this once"); } + + FBString desc = 0; + if (!description_.empty()) { desc = fbb_.CreateString(description_); } + + flatbuffers::Offset metadata = 0; + + auto root = fbs::CreateCTable( + fbb_, desc, num_rows_, fbb_.CreateVector(columns_), kFeatherVersion, metadata); + fbb_.Finish(root); + finished_ = true; + + return Status::OK(); +} + +std::shared_ptr TableBuilder::GetBuffer() const { + return std::make_shared( + fbb_.GetBufferPointer(), static_cast(fbb_.GetSize())); +} + +void TableBuilder::SetDescription(const std::string& description) { + description_ = description; +} + +void TableBuilder::SetNumRows(int64_t num_rows) { + num_rows_ = num_rows; +} + +void TableBuilder::add_column(const flatbuffers::Offset& col) { + columns_.push_back(col); +} + +ColumnBuilder::ColumnBuilder(TableBuilder* parent, const std::string& name) + : parent_(parent) { + fbb_ = &parent->fbb(); + name_ = name; + type_ = ColumnType::PRIMITIVE; +} + +flatbuffers::Offset ColumnBuilder::CreateColumnMetadata() { + switch (type_) { + case ColumnType::PRIMITIVE: + // flatbuffer void + return 0; + case ColumnType::CATEGORY: { + auto cat_meta = fbs::CreateCategoryMetadata( + fbb(), GetPrimitiveArray(fbb(), meta_category_.levels), meta_category_.ordered); + return cat_meta.Union(); + } + case ColumnType::TIMESTAMP: { + // flatbuffer void + flatbuffers::Offset tz = 0; + if (!meta_timestamp_.timezone.empty()) { + tz = fbb().CreateString(meta_timestamp_.timezone); + } + + auto ts_meta = + fbs::CreateTimestampMetadata(fbb(), ToFlatbufferEnum(meta_timestamp_.unit), tz); + return ts_meta.Union(); + } + case ColumnType::DATE: { + auto date_meta = fbs::CreateDateMetadata(fbb()); + return date_meta.Union(); + } + case ColumnType::TIME: { + auto time_meta = fbs::CreateTimeMetadata(fbb(), ToFlatbufferEnum(meta_time_.unit)); + return time_meta.Union(); + } + default: + // null + return flatbuffers::Offset(); + } +} + +Status ColumnBuilder::Finish() { + FBB& buf = fbb(); + + // values + auto values = GetPrimitiveArray(buf, values_); + flatbuffers::Offset metadata = CreateColumnMetadata(); + + auto column = fbs::CreateColumn(buf, buf.CreateString(name_), values, + ToFlatbufferEnum(type_), // metadata_type + metadata, buf.CreateString(user_metadata_)); + + // bad coupling, but OK for now + parent_->add_column(column); + return Status::OK(); +} + +void ColumnBuilder::SetValues(const ArrayMetadata& values) { + values_ = values; +} + +void ColumnBuilder::SetUserMetadata(const std::string& data) { + user_metadata_ = data; +} + +void ColumnBuilder::SetCategory(const ArrayMetadata& levels, bool ordered) { + type_ = ColumnType::CATEGORY; + meta_category_.levels = levels; + meta_category_.ordered = ordered; +} + +void ColumnBuilder::SetTimestamp(TimeUnit unit) { + type_ = ColumnType::TIMESTAMP; + meta_timestamp_.unit = unit; +} + +void ColumnBuilder::SetTimestamp(TimeUnit unit, const std::string& timezone) { + SetTimestamp(unit); + meta_timestamp_.timezone = timezone; +} + +void ColumnBuilder::SetDate() { + type_ = ColumnType::DATE; +} + +void ColumnBuilder::SetTime(TimeUnit unit) { + type_ = ColumnType::TIME; + meta_time_.unit = unit; +} + +FBB& ColumnBuilder::fbb() { + return *fbb_; +} + +std::unique_ptr TableBuilder::AddColumn(const std::string& name) { + return std::unique_ptr(new ColumnBuilder(this, name)); +} + +// ---------------------------------------------------------------------- +// reader.cc + +class TableReader::TableReaderImpl { + public: + TableReaderImpl() {} + + Status Open(const std::shared_ptr& source) { + source_ = source; + + int magic_size = static_cast(strlen(kFeatherMagicBytes)); + int footer_size = magic_size + static_cast(sizeof(uint32_t)); + + // Pathological issue where the file is smaller than + int64_t size = 0; + RETURN_NOT_OK(source->GetSize(&size)); + if (size < magic_size + footer_size) { + return Status::Invalid("File is too small to be a well-formed file"); + } + + std::shared_ptr buffer; + RETURN_NOT_OK(source->Read(magic_size, &buffer)); + + if (memcmp(buffer->data(), kFeatherMagicBytes, magic_size)) { + return Status::Invalid("Not a feather file"); + } + + // Now get the footer and verify + RETURN_NOT_OK(source->ReadAt(size - footer_size, footer_size, &buffer)); + + if (memcmp(buffer->data() + sizeof(uint32_t), kFeatherMagicBytes, magic_size)) { + return Status::Invalid("Feather file footer incomplete"); + } + + uint32_t metadata_length = *reinterpret_cast(buffer->data()); + if (size < magic_size + footer_size + metadata_length) { + return Status::Invalid("File is smaller than indicated metadata size"); + } + RETURN_NOT_OK( + source->ReadAt(size - footer_size - metadata_length, metadata_length, &buffer)); + + metadata_.reset(new TableMetadata()); + return metadata_->Open(buffer); + } + + Status GetDataType(const fbs::PrimitiveArray* values, fbs::TypeMetadata metadata_type, + const void* metadata, std::shared_ptr* out) { +#define PRIMITIVE_CASE(CAP_TYPE, FACTORY_FUNC) \ + case fbs::Type_##CAP_TYPE: \ + *out = FACTORY_FUNC(); \ + break; + + switch (metadata_type) { + case fbs::TypeMetadata_CategoryMetadata: { + auto meta = static_cast(metadata); + + std::shared_ptr index_type; + RETURN_NOT_OK(GetDataType(values, fbs::TypeMetadata_NONE, nullptr, &index_type)); + + std::shared_ptr levels; + RETURN_NOT_OK( + LoadValues(meta->levels(), fbs::TypeMetadata_NONE, nullptr, &levels)); + + *out = std::make_shared(index_type, levels, meta->ordered()); + break; + } + case fbs::TypeMetadata_TimestampMetadata: { + auto meta = static_cast(metadata); + TimeUnit unit = FromFlatbufferEnum(meta->unit()); + std::string tz; + // flatbuffer non-null + if (meta->timezone() != 0) { + tz = meta->timezone()->str(); + } else { + tz = ""; + } + *out = std::make_shared(tz, unit); + } break; + case fbs::TypeMetadata_DateMetadata: + *out = date32(); + break; + case fbs::TypeMetadata_TimeMetadata: { + auto meta = static_cast(metadata); + *out = std::make_shared(FromFlatbufferEnum(meta->unit())); + } break; + default: + switch (values->type()) { + PRIMITIVE_CASE(BOOL, boolean); + PRIMITIVE_CASE(INT8, int8); + PRIMITIVE_CASE(INT16, int16); + PRIMITIVE_CASE(INT32, int32); + PRIMITIVE_CASE(INT64, int64); + PRIMITIVE_CASE(UINT8, uint8); + PRIMITIVE_CASE(UINT16, uint16); + PRIMITIVE_CASE(UINT32, uint32); + PRIMITIVE_CASE(UINT64, uint64); + PRIMITIVE_CASE(FLOAT, float32); + PRIMITIVE_CASE(DOUBLE, float64); + PRIMITIVE_CASE(UTF8, utf8); + PRIMITIVE_CASE(BINARY, binary); + default: + return Status::Invalid("Unrecognized type"); + } + break; + } + +#undef PRIMITIVE_CASE + + return Status::OK(); + } + + // Retrieve a primitive array from the data source + // + // @returns: a Buffer instance, the precise type will depend on the kind of + // input data source (which may or may not have memory-map like semantics) + Status LoadValues(const fbs::PrimitiveArray* meta, fbs::TypeMetadata metadata_type, + const void* metadata, std::shared_ptr* out) { + std::shared_ptr type; + RETURN_NOT_OK(GetDataType(meta, metadata_type, metadata, &type)); + + std::vector fields(1); + std::vector> buffers; + + // Buffer data from the source (may or may not perform a copy depending on + // input source) + std::shared_ptr buffer; + RETURN_NOT_OK(source_->ReadAt(meta->offset(), meta->total_bytes(), &buffer)); + + int64_t offset = 0; + + // If there are nulls, the null bitmask is first + if (meta->null_count() > 0) { + int64_t null_bitmap_size = GetOutputLength(BitUtil::BytesForBits(meta->length())); + buffers.push_back(SliceBuffer(buffer, offset, null_bitmap_size)); + offset += null_bitmap_size; + } else { + buffers.push_back(nullptr); + } + + if (is_binary_like(type->type)) { + int64_t offsets_size = GetOutputLength((meta->length() + 1) * sizeof(int32_t)); + buffers.push_back(SliceBuffer(buffer, offset, offsets_size)); + offset += offsets_size; + } + + buffers.push_back(SliceBuffer(buffer, offset, buffer->size() - offset)); + + fields[0].length = meta->length(); + fields[0].null_count = meta->null_count(); + fields[0].offset = 0; + + return LoadArray(type, fields, buffers, out); + } + + bool HasDescription() const { return metadata_->HasDescription(); } + + std::string GetDescription() const { return metadata_->GetDescription(); } + + int version() const { return metadata_->version(); } + int64_t num_rows() const { return metadata_->num_rows(); } + int64_t num_columns() const { return metadata_->num_columns(); } + + std::string GetColumnName(int i) const { + const fbs::Column* col_meta = metadata_->column(i); + return col_meta->name()->str(); + } + + Status GetColumn(int i, std::shared_ptr* out) { + const fbs::Column* col_meta = metadata_->column(i); + + // auto user_meta = column->user_metadata(); + // if (user_meta->size() > 0) { user_metadata_ = user_meta->str(); } + + std::shared_ptr values; + RETURN_NOT_OK(LoadValues( + col_meta->values(), col_meta->metadata_type(), col_meta->metadata(), &values)); + out->reset(new Column(col_meta->name()->str(), values)); + return Status::OK(); + } + + private: + std::shared_ptr source_; + std::unique_ptr metadata_; + + std::shared_ptr schema_; +}; + +// ---------------------------------------------------------------------- +// TableReader public API + +TableReader::TableReader() { + impl_.reset(new TableReaderImpl()); +} + +TableReader::~TableReader() {} + +Status TableReader::Open(const std::shared_ptr& source) { + return impl_->Open(source); +} + +Status TableReader::OpenFile( + const std::string& abspath, std::unique_ptr* out) { + std::shared_ptr file; + RETURN_NOT_OK(io::MemoryMappedFile::Open(abspath, io::FileMode::READ, &file)); + out->reset(new TableReader()); + return (*out)->Open(file); +} + +bool TableReader::HasDescription() const { + return impl_->HasDescription(); +} + +std::string TableReader::GetDescription() const { + return impl_->GetDescription(); +} + +int TableReader::version() const { + return impl_->version(); +} + +int64_t TableReader::num_rows() const { + return impl_->num_rows(); +} + +int64_t TableReader::num_columns() const { + return impl_->num_columns(); +} + +std::string TableReader::GetColumnName(int i) const { + return impl_->GetColumnName(i); +} + +Status TableReader::GetColumn(int i, std::shared_ptr* out) { + return impl_->GetColumn(i, out); +} + +// ---------------------------------------------------------------------- +// writer.cc + +fbs::Type ToFlatbufferType(Type::type type) { + switch (type) { + case Type::BOOL: + return fbs::Type_BOOL; + case Type::INT8: + return fbs::Type_INT8; + case Type::INT16: + return fbs::Type_INT16; + case Type::INT32: + return fbs::Type_INT32; + case Type::INT64: + return fbs::Type_INT64; + case Type::UINT8: + return fbs::Type_UINT8; + case Type::UINT16: + return fbs::Type_UINT16; + case Type::UINT32: + return fbs::Type_UINT32; + case Type::UINT64: + return fbs::Type_UINT64; + case Type::FLOAT: + return fbs::Type_FLOAT; + case Type::DOUBLE: + return fbs::Type_DOUBLE; + case Type::STRING: + return fbs::Type_UTF8; + case Type::BINARY: + return fbs::Type_BINARY; + case Type::DATE32: + return fbs::Type_DATE; + case Type::TIMESTAMP: + return fbs::Type_TIMESTAMP; + case Type::TIME: + return fbs::Type_TIME; + case Type::DICTIONARY: + return fbs::Type_CATEGORY; + default: + break; + } + // prevent compiler warning + return fbs::Type_MIN; +} + +class TableWriter::TableWriterImpl : public ArrayVisitor { + public: + TableWriterImpl() : initialized_stream_(false), metadata_(0) {} + + Status Open(const std::shared_ptr& stream) { + stream_ = stream; + return Status::OK(); + } + + void SetDescription(const std::string& desc) { metadata_.SetDescription(desc); } + + void SetNumRows(int64_t num_rows) { metadata_.SetNumRows(num_rows); } + + Status Finalize() { + RETURN_NOT_OK(CheckStarted()); + metadata_.Finish(); + + auto buffer = metadata_.GetBuffer(); + + // Writer metadata + int64_t bytes_written; + RETURN_NOT_OK( + WritePadded(stream_.get(), buffer->data(), buffer->size(), &bytes_written)); + uint32_t buffer_size = static_cast(bytes_written); + + // Footer: metadata length, magic bytes + RETURN_NOT_OK( + stream_->Write(reinterpret_cast(&buffer_size), sizeof(uint32_t))); + RETURN_NOT_OK(stream_->Write(reinterpret_cast(kFeatherMagicBytes), + strlen(kFeatherMagicBytes))); + return stream_->Close(); + } + + Status LoadArrayMetadata(const Array& values, ArrayMetadata* meta) { + if (!(is_primitive(values.type_enum()) || is_binary_like(values.type_enum()))) { + std::stringstream ss; + ss << "Array is not primitive type: " << values.type()->ToString(); + return Status::Invalid(ss.str()); + } + + meta->type = ToFlatbufferType(values.type_enum()); + + RETURN_NOT_OK(stream_->Tell(&meta->offset)); + + meta->length = values.length(); + meta->null_count = values.null_count(); + meta->total_bytes = 0; + + return Status::OK(); + } + + Status WriteArray(const Array& values, ArrayMetadata* meta) { + RETURN_NOT_OK(CheckStarted()); + RETURN_NOT_OK(LoadArrayMetadata(values, meta)); + + int64_t bytes_written; + + // Write the null bitmask + if (values.null_count() > 0) { + // We assume there is one bit for each value in values.nulls, aligned on a + // byte boundary, and we write this much data into the stream + RETURN_NOT_OK(WritePadded(stream_.get(), values.null_bitmap()->data(), + values.null_bitmap()->size(), &bytes_written)); + meta->total_bytes += bytes_written; + } + + int64_t values_bytes = 0; + + const uint8_t* values_buffer = nullptr; + + if (is_binary_like(values.type_enum())) { + const auto& bin_values = static_cast(values); + + int64_t offset_bytes = sizeof(int32_t) * (values.length() + 1); + + values_bytes = bin_values.raw_value_offsets()[values.length()]; + + // Write the variable-length offsets + RETURN_NOT_OK(WritePadded(stream_.get(), + reinterpret_cast(bin_values.raw_value_offsets()), offset_bytes, + &bytes_written)) + meta->total_bytes += bytes_written; + + if (bin_values.data()) { values_buffer = bin_values.data()->data(); } + } else { + const auto& prim_values = static_cast(values); + const auto& fw_type = static_cast(*values.type()); + + if (values.type_enum() == Type::BOOL) { + // Booleans are bit-packed + values_bytes = BitUtil::BytesForBits(values.length()); + } else { + values_bytes = values.length() * fw_type.bit_width() / 8; + } + + if (prim_values.data()) { values_buffer = prim_values.data()->data(); } + } + RETURN_NOT_OK( + WritePadded(stream_.get(), values_buffer, values_bytes, &bytes_written)); + meta->total_bytes += bytes_written; + + return Status::OK(); + } + + Status WritePrimitiveValues(const Array& values) { + // Prepare metadata payload + ArrayMetadata meta; + RETURN_NOT_OK(WriteArray(values, &meta)); + current_column_->SetValues(meta); + return Status::OK(); + } + +#define VISIT_PRIMITIVE(TYPE) \ + Status Visit(const TYPE& values) override { return WritePrimitiveValues(values); } + + VISIT_PRIMITIVE(BooleanArray); + VISIT_PRIMITIVE(Int8Array); + VISIT_PRIMITIVE(Int16Array); + VISIT_PRIMITIVE(Int32Array); + VISIT_PRIMITIVE(Int64Array); + VISIT_PRIMITIVE(UInt8Array); + VISIT_PRIMITIVE(UInt16Array); + VISIT_PRIMITIVE(UInt32Array); + VISIT_PRIMITIVE(UInt64Array); + VISIT_PRIMITIVE(FloatArray); + VISIT_PRIMITIVE(DoubleArray); + VISIT_PRIMITIVE(BinaryArray); + VISIT_PRIMITIVE(StringArray); + +#undef VISIT_PRIMITIVE + + Status Visit(const DictionaryArray& values) override { + const auto& dict_type = static_cast(*values.type()); + + if (!is_integer(values.indices()->type_enum())) { + return Status::Invalid("Category values must be integers"); + } + + RETURN_NOT_OK(WritePrimitiveValues(*values.indices())); + + ArrayMetadata levels_meta; + RETURN_NOT_OK(WriteArray(*dict_type.dictionary(), &levels_meta)); + current_column_->SetCategory(levels_meta, dict_type.ordered()); + return Status::OK(); + } + + Status Visit(const TimestampArray& values) override { + RETURN_NOT_OK(WritePrimitiveValues(values)); + const auto& ts_type = static_cast(*values.type()); + current_column_->SetTimestamp(ts_type.unit, ts_type.timezone); + return Status::OK(); + } + + Status Visit(const Date32Array& values) override { + RETURN_NOT_OK(WritePrimitiveValues(values)); + current_column_->SetDate(); + return Status::OK(); + } + + Status Visit(const TimeArray& values) override { + RETURN_NOT_OK(WritePrimitiveValues(values)); + auto unit = static_cast(*values.type()).unit; + current_column_->SetTime(unit); + return Status::OK(); + } + + Status Append(const std::string& name, const Array& values) { + current_column_ = metadata_.AddColumn(name); + RETURN_NOT_OK(values.Accept(this)); + current_column_->Finish(); + return Status::OK(); + } + + private: + Status CheckStarted() { + if (!initialized_stream_) { + int64_t bytes_written_unused; + RETURN_NOT_OK( + WritePadded(stream_.get(), reinterpret_cast(kFeatherMagicBytes), + strlen(kFeatherMagicBytes), &bytes_written_unused)); + initialized_stream_ = true; + } + return Status::OK(); + } + + std::shared_ptr stream_; + + bool initialized_stream_; + TableBuilder metadata_; + + std::unique_ptr current_column_; + + Status AppendPrimitive(const PrimitiveArray& values, ArrayMetadata* out); +}; + +TableWriter::TableWriter() { + impl_.reset(new TableWriterImpl()); +} + +TableWriter::~TableWriter() {} + +Status TableWriter::Open( + const std::shared_ptr& stream, std::unique_ptr* out) { + out->reset(new TableWriter()); + return (*out)->impl_->Open(stream); +} + +Status TableWriter::OpenFile( + const std::string& abspath, std::unique_ptr* out) { + std::shared_ptr file; + RETURN_NOT_OK(io::FileOutputStream::Open(abspath, &file)); + out->reset(new TableWriter()); + return (*out)->impl_->Open(file); +} + +void TableWriter::SetDescription(const std::string& desc) { + impl_->SetDescription(desc); +} + +void TableWriter::SetNumRows(int64_t num_rows) { + impl_->SetNumRows(num_rows); +} + +Status TableWriter::Append(const std::string& name, const Array& values) { + return impl_->Append(name, values); +} + +Status TableWriter::Finalize() { + return impl_->Finalize(); +} + +} // namespace feather +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/feather.fbs b/cpp/src/arrow/ipc/feather.fbs new file mode 100644 index 0000000000000..a27d39989c620 --- /dev/null +++ b/cpp/src/arrow/ipc/feather.fbs @@ -0,0 +1,147 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +namespace arrow.ipc.feather.fbs; + +/// Feather is an experimental serialization format implemented using +/// techniques from Apache Arrow. It was created as a proof-of-concept of an +/// interoperable file format for storing data frames originating in Python or +/// R. It enabled the developers to sidestep some of the open design questions +/// in Arrow from early 2016 and instead create something simple and useful for +/// the intended use cases. + +enum Type : byte { + BOOL = 0, + + INT8 = 1, + INT16 = 2, + INT32 = 3, + INT64 = 4, + + UINT8 = 5, + UINT16 = 6, + UINT32 = 7, + UINT64 = 8, + + FLOAT = 9, + DOUBLE = 10, + + UTF8 = 11, + + BINARY = 12, + + CATEGORY = 13, + + TIMESTAMP = 14, + DATE = 15, + TIME = 16 +} + +enum Encoding : byte { + PLAIN = 0, + + /// Data is stored dictionary-encoded + /// dictionary size: + /// dictionary data: + /// dictionary index: + /// + /// TODO: do we care about storing the index values in a smaller typeclass + DICTIONARY = 1 +} + +enum TimeUnit : byte { + SECOND = 0, + MILLISECOND = 1, + MICROSECOND = 2, + NANOSECOND = 3 +} + +table PrimitiveArray { + type: Type; + + encoding: Encoding = PLAIN; + + /// Relative memory offset of the start of the array data excluding the size + /// of the metadata + offset: long; + + /// The number of logical values in the array + length: long; + + /// The number of observed nulls + null_count: long; + + /// The total size of the actual data in the file + total_bytes: long; + + /// TODO: Compression +} + +table CategoryMetadata { + /// The category codes are presumed to be integers that are valid indexes into + /// the levels array + + levels: PrimitiveArray; + ordered: bool = false; +} + +table TimestampMetadata { + unit: TimeUnit; + + /// Timestamp data is assumed to be UTC, but the time zone is stored here for + /// presentation as localized + timezone: string; +} + +table DateMetadata { +} + +table TimeMetadata { + unit: TimeUnit; +} + +union TypeMetadata { + CategoryMetadata, + TimestampMetadata, + DateMetadata, + TimeMetadata, +} + +table Column { + name: string; + values: PrimitiveArray; + metadata: TypeMetadata; + + /// This should (probably) be JSON + user_metadata: string; +} + +table CTable { + /// Some text (or a name) metadata about what the file is, optional + description: string; + + num_rows: long; + columns: [Column]; + + /// Version number of the Feather format + version: int; + + /// Table metadata (likely JSON), not yet used + metadata: string; +} + +root_type CTable; diff --git a/cpp/src/arrow/ipc/feather.h b/cpp/src/arrow/ipc/feather.h new file mode 100644 index 0000000000000..3d370dfe02bd0 --- /dev/null +++ b/cpp/src/arrow/ipc/feather.h @@ -0,0 +1,109 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/// Public API for the "Feather" file format, originally created at +/// http://github.com/wesm/feather + +#ifndef ARROW_IPC_FEATHER_H +#define ARROW_IPC_FEATHER_H + +#include +#include +#include +#include + +#include "arrow/type.h" + +namespace arrow { + +class Buffer; +class Column; +class Status; + +namespace io { + +class OutputStream; +class ReadableFileInterface; + +} // namespace io + +namespace ipc { +namespace feather { + +static constexpr const int kFeatherVersion = 2; + +// ---------------------------------------------------------------------- +// Metadata accessor classes + +class ARROW_EXPORT TableReader { + public: + TableReader(); + ~TableReader(); + + Status Open(const std::shared_ptr& source); + + static Status OpenFile(const std::string& abspath, std::unique_ptr* out); + + // Optional table description + // + // This does not return a const std::string& because a string has to be + // copied from the flatbuffer to be able to return a non-flatbuffer type + std::string GetDescription() const; + bool HasDescription() const; + + int version() const; + + int64_t num_rows() const; + int64_t num_columns() const; + + std::string GetColumnName(int i) const; + + Status GetColumn(int i, std::shared_ptr* out); + + private: + class ARROW_NO_EXPORT TableReaderImpl; + std::unique_ptr impl_; +}; + +class ARROW_EXPORT TableWriter { + public: + ~TableWriter(); + + static Status Open( + const std::shared_ptr& stream, std::unique_ptr* out); + + static Status OpenFile(const std::string& abspath, std::unique_ptr* out); + + void SetDescription(const std::string& desc); + void SetNumRows(int64_t num_rows); + + Status Append(const std::string& name, const Array& values); + + // We are done, write the file metadata and footer + Status Finalize(); + + private: + TableWriter(); + class ARROW_NO_EXPORT TableWriterImpl; + std::unique_ptr impl_; +}; + +} // namespace feather +} // namespace ipc +} // namespace arrow + +#endif // ARROW_IPC_FEATHER_H diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 6e6d609b00007..ef874e3d07959 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -437,6 +437,7 @@ set(CYTHON_EXTENSIONS config error io + _feather memory scalar schema diff --git a/python/pyarrow/_feather.pyx b/python/pyarrow/_feather.pyx new file mode 100644 index 0000000000000..67f734f6ed77c --- /dev/null +++ b/python/pyarrow/_feather.pyx @@ -0,0 +1,158 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from cython.operator cimport dereference as deref + +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport CArray, CColumn, CSchema, CStatus +from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream + +from libcpp.string cimport string +from libcpp cimport bool as c_bool + +cimport cpython + +from pyarrow.compat import frombytes, tobytes, encode_file_path + +from pyarrow.array cimport Array +from pyarrow.error cimport check_status +from pyarrow.table cimport Column + +cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: + + cdef cppclass TableWriter: + @staticmethod + CStatus Open(const shared_ptr[OutputStream]& stream, + unique_ptr[TableWriter]* out) + + @staticmethod + CStatus OpenFile(const string& abspath, unique_ptr[TableWriter]* out) + + void SetDescription(const string& desc) + void SetNumRows(int64_t num_rows) + + CStatus Append(const string& name, const CArray& values) + CStatus Finalize() + + cdef cppclass TableReader: + TableReader(const shared_ptr[ReadableFileInterface]& source) + + @staticmethod + CStatus OpenFile(const string& abspath, unique_ptr[TableReader]* out) + + string GetDescription() + c_bool HasDescription() + + int64_t num_rows() + int64_t num_columns() + + shared_ptr[CSchema] schema() + + CStatus GetColumn(int i, shared_ptr[CColumn]* out) + c_string GetColumnName(int i) + + +class FeatherError(Exception): + pass + + +cdef class FeatherWriter: + cdef: + unique_ptr[TableWriter] writer + + cdef public: + int64_t num_rows + + def __cinit__(self): + self.num_rows = -1 + + def open(self, object dest): + cdef: + string c_name = encode_file_path(dest) + + check_status(TableWriter.OpenFile(c_name, &self.writer)) + + def close(self): + if self.num_rows < 0: + self.num_rows = 0 + self.writer.get().SetNumRows(self.num_rows) + check_status(self.writer.get().Finalize()) + + def write_array(self, object name, object col, object mask=None): + cdef Array arr + + if self.num_rows >= 0: + if len(col) != self.num_rows: + raise ValueError('prior column had a different number of rows') + else: + self.num_rows = len(col) + + if isinstance(col, Array): + arr = col + else: + arr = Array.from_pandas(col, mask=mask) + + cdef c_string c_name = tobytes(name) + + with nogil: + check_status( + self.writer.get().Append(c_name, deref(arr.sp_array))) + + +cdef class FeatherReader: + cdef: + unique_ptr[TableReader] reader + + def __cinit__(self): + pass + + def open(self, source): + cdef: + string c_name = encode_file_path(source) + + check_status(TableReader.OpenFile(c_name, &self.reader)) + + property num_rows: + + def __get__(self): + return self.reader.get().num_rows() + + property num_columns: + + def __get__(self): + return self.reader.get().num_columns() + + def get_column_name(self, int i): + cdef c_string name = self.reader.get().GetColumnName(i) + return frombytes(name) + + def get_column(self, int i): + if i < 0 or i >= self.num_columns: + raise IndexError(i) + + cdef shared_ptr[CColumn] sp_column + with nogil: + check_status(self.reader.get() + .GetColumn(i, &sp_column)) + + cdef Column col = Column() + col.init(sp_column) + return col diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py new file mode 100644 index 0000000000000..b7dbf96563a41 --- /dev/null +++ b/python/pyarrow/feather.py @@ -0,0 +1,118 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import six +from distutils.version import LooseVersion +import pandas as pd + +from pyarrow._feather import FeatherError # noqa +from pyarrow.table import Table +import pyarrow._feather as ext + + +if LooseVersion(pd.__version__) < '0.17.0': + raise ImportError("feather requires pandas >= 0.17.0") + +if LooseVersion(pd.__version__) < '0.19.0': + pdapi = pd.core.common +else: + pdapi = pd.api.types + + +class FeatherReader(ext.FeatherReader): + + def __init__(self, source): + self.source = source + self.open(source) + + def read(self, columns=None): + if columns is not None: + column_set = set(columns) + else: + column_set = None + + columns = [] + names = [] + for i in range(self.num_columns): + name = self.get_column_name(i) + if column_set is None or name in column_set: + col = self.get_column(i) + columns.append(col) + names.append(name) + + table = Table.from_arrays(columns, names=names) + return table.to_pandas() + + +def write_feather(df, path): + ''' + Write a pandas.DataFrame to Feather format + ''' + writer = ext.FeatherWriter() + writer.open(path) + + if isinstance(df, pd.SparseDataFrame): + df = df.to_dense() + + if not df.columns.is_unique: + raise ValueError("cannot serialize duplicate column names") + + # TODO(wesm): pipeline conversion to Arrow memory layout + for i, name in enumerate(df.columns): + col = df.iloc[:, i] + + if pdapi.is_object_dtype(col): + inferred_type = pd.lib.infer_dtype(col) + msg = ("cannot serialize column {n} " + "named {name} with dtype {dtype}".format( + n=i, name=name, dtype=inferred_type)) + + if inferred_type in ['mixed']: + + # allow columns with nulls + an inferable type + inferred_type = pd.lib.infer_dtype(col[col.notnull()]) + if inferred_type in ['mixed']: + raise ValueError(msg) + + elif inferred_type not in ['unicode', 'string']: + raise ValueError(msg) + + if not isinstance(name, six.string_types): + name = str(name) + + writer.write_array(name, col) + + writer.close() + + +def read_feather(path, columns=None): + """ + Read a pandas.DataFrame from Feather format + + Parameters + ---------- + path : string, path to read from + columns : sequence, optional + Only read a specific set of columns. If not provided, all columns are + read + + Returns + ------- + df : pandas.DataFrame + """ + reader = FeatherReader(path) + return reader.read(columns=columns) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index ad5af1b0128ca..5657b973d1306 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -540,6 +540,11 @@ cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): block = _int.make_block(cat, placement=placement, klass=_int.CategoricalBlock, fastpath=True) + elif 'timezone' in item: + from pandas.types.api import DatetimeTZDtype + dtype = DatetimeTZDtype('ns', tz=item['timezone']) + block = _int.make_block(block_arr, placement=placement, + dtype=dtype, fastpath=True) else: block = _int.make_block(block_arr, placement=placement) blocks.append(block) diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py new file mode 100644 index 0000000000000..451475b4c6d81 --- /dev/null +++ b/python/pyarrow/tests/test_feather.py @@ -0,0 +1,379 @@ +# Copyright 2016 Feather Developers +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +import unittest + +import pytest + +from numpy.testing import assert_array_equal +import numpy as np + +from pandas.util.testing import assert_frame_equal +import pandas as pd + +from pyarrow.compat import guid +from pyarrow.error import ArrowException +from pyarrow.feather import (read_feather, write_feather, + FeatherReader) +from pyarrow._feather import FeatherWriter + + +def random_path(): + return 'feather_{}'.format(guid()) + + +class TestFeatherReader(unittest.TestCase): + + def setUp(self): + self.test_files = [] + + def tearDown(self): + for path in self.test_files: + try: + os.remove(path) + except os.error: + pass + + def test_file_not_exist(self): + with self.assertRaises(ArrowException): + FeatherReader('test_invalid_file') + + def _get_null_counts(self, path, columns=None): + reader = FeatherReader(path) + counts = [] + for i in range(reader.num_columns): + col = reader.get_column(i) + if columns is None or col.name in columns: + counts.append(col.null_count) + + return counts + + def _check_pandas_roundtrip(self, df, expected=None, path=None, + columns=None, null_counts=None): + if path is None: + path = random_path() + + self.test_files.append(path) + write_feather(df, path) + if not os.path.exists(path): + raise Exception('file not written') + + result = read_feather(path, columns) + if expected is None: + expected = df + + assert_frame_equal(result, expected) + + if null_counts is None: + null_counts = np.zeros(len(expected.columns)) + + np.testing.assert_array_equal(self._get_null_counts(path, columns), + null_counts) + + def _assert_error_on_write(self, df, exc, path=None): + # check that we are raising the exception + # on writing + + if path is None: + path = random_path() + + self.test_files.append(path) + + def f(): + write_feather(df, path) + + self.assertRaises(exc, f) + + def test_num_rows_attr(self): + df = pd.DataFrame({'foo': [1, 2, 3, 4, 5]}) + path = random_path() + self.test_files.append(path) + write_feather(df, path) + + reader = FeatherReader(path) + assert reader.num_rows == len(df) + + df = pd.DataFrame({}) + path = random_path() + self.test_files.append(path) + write_feather(df, path) + + reader = FeatherReader(path) + assert reader.num_rows == 0 + + def test_float_no_nulls(self): + data = {} + numpy_dtypes = ['f4', 'f8'] + num_values = 100 + + for dtype in numpy_dtypes: + values = np.random.randn(num_values) + data[dtype] = values.astype(dtype) + + df = pd.DataFrame(data) + self._check_pandas_roundtrip(df) + + def test_float_nulls(self): + num_values = 100 + + path = random_path() + self.test_files.append(path) + writer = FeatherWriter() + writer.open(path) + + null_mask = np.random.randint(0, 10, size=num_values) < 3 + dtypes = ['f4', 'f8'] + expected_cols = [] + null_counts = [] + for name in dtypes: + values = np.random.randn(num_values).astype(name) + writer.write_array(name, values, null_mask) + + values[null_mask] = np.nan + + expected_cols.append(values) + null_counts.append(null_mask.sum()) + + writer.close() + + ex_frame = pd.DataFrame(dict(zip(dtypes, expected_cols)), + columns=dtypes) + + result = read_feather(path) + assert_frame_equal(result, ex_frame) + assert_array_equal(self._get_null_counts(path), null_counts) + + def test_integer_no_nulls(self): + data = {} + + numpy_dtypes = ['i1', 'i2', 'i4', 'i8', + 'u1', 'u2', 'u4', 'u8'] + num_values = 100 + + for dtype in numpy_dtypes: + values = np.random.randint(0, 100, size=num_values) + data[dtype] = values.astype(dtype) + + df = pd.DataFrame(data) + self._check_pandas_roundtrip(df) + + def test_platform_numpy_integers(self): + data = {} + + numpy_dtypes = ['longlong'] + num_values = 100 + + for dtype in numpy_dtypes: + values = np.random.randint(0, 100, size=num_values) + data[dtype] = values.astype(dtype) + + df = pd.DataFrame(data) + self._check_pandas_roundtrip(df) + + def test_integer_with_nulls(self): + # pandas requires upcast to float dtype + path = random_path() + self.test_files.append(path) + + int_dtypes = ['i1', 'i2', 'i4', 'i8', 'u1', 'u2', 'u4', 'u8'] + num_values = 100 + + writer = FeatherWriter() + writer.open(path) + + null_mask = np.random.randint(0, 10, size=num_values) < 3 + expected_cols = [] + for name in int_dtypes: + values = np.random.randint(0, 100, size=num_values) + writer.write_array(name, values, null_mask) + + expected = values.astype('f8') + expected[null_mask] = np.nan + + expected_cols.append(expected) + + ex_frame = pd.DataFrame(dict(zip(int_dtypes, expected_cols)), + columns=int_dtypes) + + writer.close() + + result = read_feather(path) + assert_frame_equal(result, ex_frame) + + def test_boolean_no_nulls(self): + num_values = 100 + + np.random.seed(0) + + df = pd.DataFrame({'bools': np.random.randn(num_values) > 0}) + self._check_pandas_roundtrip(df) + + def test_boolean_nulls(self): + # pandas requires upcast to object dtype + path = random_path() + self.test_files.append(path) + + num_values = 100 + np.random.seed(0) + + writer = FeatherWriter() + writer.open(path) + + mask = np.random.randint(0, 10, size=num_values) < 3 + values = np.random.randint(0, 10, size=num_values) < 5 + writer.write_array('bools', values, mask) + + expected = values.astype(object) + expected[mask] = None + + writer.close() + + ex_frame = pd.DataFrame({'bools': expected}) + + result = read_feather(path) + assert_frame_equal(result, ex_frame) + + def test_boolean_object_nulls(self): + repeats = 100 + arr = np.array([False, None, True] * repeats, dtype=object) + df = pd.DataFrame({'bools': arr}) + self._check_pandas_roundtrip(df, null_counts=[1 * repeats]) + + def test_strings(self): + repeats = 1000 + + # we hvae mixed bytes, unicode, strings + values = [b'foo', None, u'bar', 'qux', np.nan] + df = pd.DataFrame({'strings': values * repeats}) + self._assert_error_on_write(df, ValueError) + + # embedded nulls are ok + values = ['foo', None, 'bar', 'qux', None] + df = pd.DataFrame({'strings': values * repeats}) + expected = pd.DataFrame({'strings': values * repeats}) + self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats]) + + values = ['foo', None, 'bar', 'qux', np.nan] + df = pd.DataFrame({'strings': values * repeats}) + expected = pd.DataFrame({'strings': values * repeats}) + self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats]) + + def test_empty_strings(self): + df = pd.DataFrame({'strings': [''] * 10}) + self._check_pandas_roundtrip(df) + + def test_nan_as_null(self): + # Create a nan that is not numpy.nan + values = np.array(['foo', np.nan, np.nan * 2, 'bar'] * 10) + df = pd.DataFrame({'strings': values}) + self._check_pandas_roundtrip(df) + + def test_category(self): + repeats = 1000 + values = ['foo', None, u'bar', 'qux', np.nan] + df = pd.DataFrame({'strings': values * repeats}) + df['strings'] = df['strings'].astype('category') + + values = ['foo', None, 'bar', 'qux', None] + expected = pd.DataFrame({'strings': pd.Categorical(values * repeats)}) + self._check_pandas_roundtrip(df, expected, + null_counts=[2 * repeats]) + + @pytest.mark.xfail + def test_timestamp(self): + df = pd.DataFrame({'naive': pd.date_range('2016-03-28', periods=10)}) + df['with_tz'] = (df.naive.dt.tz_localize('utc') + .dt.tz_convert('America/Los_Angeles')) + + self._check_pandas_roundtrip(df) + + @pytest.mark.xfail + def test_timestamp_with_nulls(self): + df = pd.DataFrame({'test': [pd.datetime(2016, 1, 1), + None, + pd.datetime(2016, 1, 3)]}) + df['with_tz'] = df.test.dt.tz_localize('utc') + + self._check_pandas_roundtrip(df, null_counts=[1, 1]) + + @pytest.mark.xfail + def test_out_of_float64_timestamp_with_nulls(self): + df = pd.DataFrame( + {'test': pd.DatetimeIndex([1451606400000000001, + None, 14516064000030405])}) + df['with_tz'] = df.test.dt.tz_localize('utc') + self._check_pandas_roundtrip(df, null_counts=[1, 1]) + + def test_non_string_columns(self): + df = pd.DataFrame({0: [1, 2, 3, 4], + 1: [True, False, True, False]}) + + expected = df.rename(columns=str) + self._check_pandas_roundtrip(df, expected) + + def test_unicode_filename(self): + # GH #209 + name = (b'Besa_Kavaj\xc3\xab.feather').decode('utf-8') + df = pd.DataFrame({'foo': [1, 2, 3, 4]}) + self._check_pandas_roundtrip(df, path=name) + + def test_read_columns(self): + data = {'foo': [1, 2, 3, 4], + 'boo': [5, 6, 7, 8], + 'woo': [1, 3, 5, 7]} + columns = list(data.keys())[1:3] + df = pd.DataFrame(data) + expected = pd.DataFrame({c: data[c] for c in columns}) + self._check_pandas_roundtrip(df, expected, columns=columns) + + def test_overwritten_file(self): + path = random_path() + + num_values = 100 + np.random.seed(0) + + values = np.random.randint(0, 10, size=num_values) + write_feather(pd.DataFrame({'ints': values}), path) + + df = pd.DataFrame({'ints': values[0: num_values//2]}) + self._check_pandas_roundtrip(df, path=path) + + def test_sparse_dataframe(self): + # GH #221 + data = {'A': [0, 1, 2], + 'B': [1, 0, 1]} + df = pd.DataFrame(data).to_sparse(fill_value=1) + expected = df.to_dense() + self._check_pandas_roundtrip(df, expected) + + def test_duplicate_columns(self): + + # https://github.com/wesm/feather/issues/53 + # not currently able to handle duplicate columns + df = pd.DataFrame(np.arange(12).reshape(4, 3), + columns=list('aaa')).copy() + self._assert_error_on_write(df, ValueError) + + def test_unsupported(self): + # https://github.com/wesm/feather/issues/240 + # serializing actual python objects + + # period + df = pd.DataFrame({'a': pd.period_range('2013', freq='M', periods=3)}) + self._assert_error_on_write(df, ValueError) + + # non-strings + df = pd.DataFrame({'a': ['a', 1, 2.0]}) + self._assert_error_on_write(df, ValueError) diff --git a/python/setup.py b/python/setup.py index b0f29be4c1b3b..a0573fe1fccff 100644 --- a/python/setup.py +++ b/python/setup.py @@ -101,6 +101,7 @@ def initialize_options(self): 'io', 'jemalloc', 'memory', + '_feather', '_parquet', 'scalar', 'schema', diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index c707ada9dd55c..eb3ab49f58892 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -925,6 +925,32 @@ class DatetimeBlock : public PandasBlock { } }; +class DatetimeTZBlock : public DatetimeBlock { + public: + DatetimeTZBlock(const std::string& timezone, int64_t num_rows) + : DatetimeBlock(num_rows, 1), timezone_(timezone) {} + + Status GetPyResult(PyObject** output) override { + PyObject* result = PyDict_New(); + RETURN_IF_PYERROR(); + + PyObject* py_tz = PyUnicode_FromStringAndSize( + timezone_.c_str(), static_cast(timezone_.size())); + RETURN_IF_PYERROR(); + + PyDict_SetItemString(result, "block", block_arr_.obj()); + PyDict_SetItemString(result, "timezone", py_tz); + PyDict_SetItemString(result, "placement", placement_arr_.obj()); + + *output = result; + + return Status::OK(); + } + + private: + std::string timezone_; +}; + template class CategoricalBlock : public PandasBlock { public: @@ -1068,6 +1094,8 @@ static inline Status MakeCategoricalBlock(const std::shared_ptr& type, return (*block)->Allocate(); } +using BlockMap = std::unordered_map>; + // Construct the exact pandas 0.x "BlockManager" memory layout // // * For each column determine the correct output pandas type @@ -1138,9 +1166,14 @@ class DataFrameBlockCreator { case Type::DATE: output_type = PandasBlock::DATETIME; break; - case Type::TIMESTAMP: - output_type = PandasBlock::DATETIME; - break; + case Type::TIMESTAMP: { + const auto& ts_type = static_cast(*col->type()); + if (ts_type.timezone != "") { + output_type = PandasBlock::DATETIME_WITH_TZ; + } else { + output_type = PandasBlock::DATETIME; + } + } break; case Type::LIST: { auto list_type = std::static_pointer_cast(col->type()); if (!ListTypeSupported(list_type->value_type()->type)) { @@ -1159,10 +1192,15 @@ class DataFrameBlockCreator { } int block_placement = 0; + std::shared_ptr block; if (output_type == PandasBlock::CATEGORICAL) { - std::shared_ptr block; RETURN_NOT_OK(MakeCategoricalBlock(col->type(), table_->num_rows(), &block)); categorical_blocks_[i] = block; + } else if (output_type == PandasBlock::DATETIME_WITH_TZ) { + const auto& ts_type = static_cast(*col->type()); + block = std::make_shared(ts_type.timezone, table_->num_rows()); + RETURN_NOT_OK(block->Allocate()); + datetimetz_blocks_[i] = block; } else { auto it = type_counts_.find(output_type); if (it != type_counts_.end()) { @@ -1252,28 +1290,24 @@ class DataFrameBlockCreator { return Status::OK(); } + Status AppendBlocks(const BlockMap& blocks, PyObject* list) { + for (const auto& it : blocks) { + PyObject* item; + RETURN_NOT_OK(it.second->GetPyResult(&item)); + if (PyList_Append(list, item) < 0) { RETURN_IF_PYERROR(); } + } + return Status::OK(); + } + Status GetResultList(PyObject** out) { PyAcquireGIL lock; - auto num_blocks = - static_cast(blocks_.size() + categorical_blocks_.size()); - PyObject* result = PyList_New(num_blocks); + PyObject* result = PyList_New(0); RETURN_IF_PYERROR(); - int i = 0; - for (const auto& it : blocks_) { - const std::shared_ptr block = it.second; - PyObject* item; - RETURN_NOT_OK(block->GetPyResult(&item)); - if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } - } - - for (const auto& it : categorical_blocks_) { - const std::shared_ptr block = it.second; - PyObject* item; - RETURN_NOT_OK(block->GetPyResult(&item)); - if (PyList_SET_ITEM(result, i++, item) < 0) { RETURN_IF_PYERROR(); } - } + RETURN_NOT_OK(AppendBlocks(blocks_, result)); + RETURN_NOT_OK(AppendBlocks(categorical_blocks_, result)); + RETURN_NOT_OK(AppendBlocks(datetimetz_blocks_, result)); *out = result; return Status::OK(); @@ -1292,10 +1326,13 @@ class DataFrameBlockCreator { std::unordered_map type_counts_; // block type -> block - std::unordered_map> blocks_; + BlockMap blocks_; // column number -> categorical block - std::unordered_map> categorical_blocks_; + BlockMap categorical_blocks_; + + // column number -> datetimetz block + BlockMap datetimetz_blocks_; }; Status ConvertTableToPandas( From fdc25b418273a9a0d9d2512f571236e96cb4e2b4 Mon Sep 17 00:00:00 2001 From: Julien Lafaye Date: Sun, 12 Mar 2017 13:28:09 +0100 Subject: [PATCH 0355/1644] ARROW-606: [C++] upgrade flatbuffers version to 1.6.0 all unittests pass benchmark (builder, column, jemalloc-builder) results suffer minor differences (<5%) wrt to flatbuffer 1.3.0 Author: Julien Lafaye Closes #373 from jlafaye/master and squashes the following commits: 3d001e5 [Julien Lafaye] ARROW-606: [C++] upgrade flatbuffers version to 1.6.0 --- cpp/CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 294c439e2b093..5ecc34e8a5fc6 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -35,7 +35,7 @@ set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") set(GFLAGS_VERSION "2.1.2") set(GTEST_VERSION "1.7.0") set(GBENCHMARK_VERSION "1.1.0") -set(FLATBUFFERS_VERSION "1.3.0") +set(FLATBUFFERS_VERSION "1.6.0") set(JEMALLOC_VERSION "4.4.0") find_package(ClangTools) From e5a11dac2ab856001cae4c1cb582cd376fa7f083 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 12 Mar 2017 11:53:04 -0400 Subject: [PATCH 0356/1644] ARROW-534: [C++] Add IPC tests for date/time after ARROW-452, fix bugs Closes #345. I had mostly done this in #361 so this adds tests to `ipc-adapter-test` Author: Wes McKinney Closes #371 from wesm/ARROW-534 and squashes the following commits: cab6d4f [Wes McKinney] Add functions to make record batches for date, date32, timestamp, time. Fix bugs --- cpp/src/arrow/ipc/adapter.cc | 75 ++++++--------------- cpp/src/arrow/ipc/feather-test.cc | 38 ----------- cpp/src/arrow/ipc/ipc-adapter-test.cc | 3 +- cpp/src/arrow/ipc/test-common.h | 97 +++++++++++++++++++++++++++ cpp/src/arrow/test-util.h | 18 +++++ cpp/src/arrow/type.h | 2 +- cpp/src/arrow/type_traits.h | 6 +- 7 files changed, 141 insertions(+), 98 deletions(-) diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 78d58101963dc..a4eff7214aa5f 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -309,66 +309,31 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status Visit(const Int8Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const Int16Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const Int32Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const Int64Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const UInt8Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const UInt16Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const UInt32Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const UInt64Array& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const HalfFloatArray& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const FloatArray& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const DoubleArray& array) override { - return VisitFixedWidth(array); - } +#define VISIT_FIXED_WIDTH(TYPE) \ + Status Visit(const TYPE& array) override { return VisitFixedWidth(array); } + + VISIT_FIXED_WIDTH(Int8Array); + VISIT_FIXED_WIDTH(Int16Array); + VISIT_FIXED_WIDTH(Int32Array); + VISIT_FIXED_WIDTH(Int64Array); + VISIT_FIXED_WIDTH(UInt8Array); + VISIT_FIXED_WIDTH(UInt16Array); + VISIT_FIXED_WIDTH(UInt32Array); + VISIT_FIXED_WIDTH(UInt64Array); + VISIT_FIXED_WIDTH(HalfFloatArray); + VISIT_FIXED_WIDTH(FloatArray); + VISIT_FIXED_WIDTH(DoubleArray); + VISIT_FIXED_WIDTH(DateArray); + VISIT_FIXED_WIDTH(Date32Array); + VISIT_FIXED_WIDTH(TimeArray); + VISIT_FIXED_WIDTH(TimestampArray); + +#undef VISIT_FIXED_WIDTH Status Visit(const StringArray& array) override { return VisitBinary(array); } Status Visit(const BinaryArray& array) override { return VisitBinary(array); } - Status Visit(const DateArray& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const TimeArray& array) override { - return VisitFixedWidth(array); - } - - Status Visit(const TimestampArray& array) override { - return VisitFixedWidth(array); - } - Status Visit(const ListArray& array) override { std::shared_ptr value_offsets; RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index b73246b672260..078c3e10aff29 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -344,44 +344,6 @@ TEST_F(TestTableWriter, PrimitiveRoundTrip) { ASSERT_EQ("f1", col->name()); } -Status MakeDictionaryFlat(std::shared_ptr* out) { - const int64_t length = 6; - - std::vector is_valid = {true, true, false, true, true, true}; - std::shared_ptr dict1, dict2; - - std::vector dict1_values = {"foo", "bar", "baz"}; - std::vector dict2_values = {"foo", "bar", "baz", "qux"}; - - ArrayFromVector(dict1_values, &dict1); - ArrayFromVector(dict2_values, &dict2); - - auto f0_type = arrow::dictionary(arrow::int32(), dict1); - auto f1_type = arrow::dictionary(arrow::int8(), dict1); - auto f2_type = arrow::dictionary(arrow::int32(), dict2); - - std::shared_ptr indices0, indices1, indices2; - std::vector indices0_values = {1, 2, -1, 0, 2, 0}; - std::vector indices1_values = {0, 0, 2, 2, 1, 1}; - std::vector indices2_values = {3, 0, 2, 1, 0, 2}; - - ArrayFromVector(is_valid, indices0_values, &indices0); - ArrayFromVector(is_valid, indices1_values, &indices1); - ArrayFromVector(is_valid, indices2_values, &indices2); - - auto a0 = std::make_shared(f0_type, indices0); - auto a1 = std::make_shared(f1_type, indices1); - auto a2 = std::make_shared(f2_type, indices2); - - // construct batch - std::shared_ptr schema(new Schema( - {field("dict1", f0_type), field("sparse", f1_type), field("dense", f2_type)})); - - std::vector> arrays = {a0, a1, a2}; - out->reset(new RecordBatch(schema, length, arrays)); - return Status::OK(); -} - TEST_F(TestTableWriter, CategoryRoundtrip) { std::shared_ptr batch; ASSERT_OK(MakeDictionaryFlat(&batch)); diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 6678fd522a86a..b60b8a9ba68d2 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -175,7 +175,8 @@ INSTANTIATE_TEST_CASE_P( RoundTripTests, TestRecordBatchParam, ::testing::Values(&MakeIntRecordBatch, &MakeStringTypesRecordBatch, &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeListRecordBatch, - &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary)); + &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, + &MakeTimestamps, &MakeTimes)); void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index dc823662ee1ef..7f33aba812e0f 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -425,6 +425,103 @@ Status MakeDictionary(std::shared_ptr* out) { return Status::OK(); } +Status MakeDictionaryFlat(std::shared_ptr* out) { + const int64_t length = 6; + + std::vector is_valid = {true, true, false, true, true, true}; + std::shared_ptr dict1, dict2; + + std::vector dict1_values = {"foo", "bar", "baz"}; + std::vector dict2_values = {"foo", "bar", "baz", "qux"}; + + ArrayFromVector(dict1_values, &dict1); + ArrayFromVector(dict2_values, &dict2); + + auto f0_type = arrow::dictionary(arrow::int32(), dict1); + auto f1_type = arrow::dictionary(arrow::int8(), dict1); + auto f2_type = arrow::dictionary(arrow::int32(), dict2); + + std::shared_ptr indices0, indices1, indices2; + std::vector indices0_values = {1, 2, -1, 0, 2, 0}; + std::vector indices1_values = {0, 0, 2, 2, 1, 1}; + std::vector indices2_values = {3, 0, 2, 1, 0, 2}; + + ArrayFromVector(is_valid, indices0_values, &indices0); + ArrayFromVector(is_valid, indices1_values, &indices1); + ArrayFromVector(is_valid, indices2_values, &indices2); + + auto a0 = std::make_shared(f0_type, indices0); + auto a1 = std::make_shared(f1_type, indices1); + auto a2 = std::make_shared(f2_type, indices2); + + // construct batch + std::shared_ptr schema(new Schema( + {field("dict1", f0_type), field("sparse", f1_type), field("dense", f2_type)})); + + std::vector> arrays = {a0, a1, a2}; + out->reset(new RecordBatch(schema, length, arrays)); + return Status::OK(); +} + +Status MakeDates(std::shared_ptr* out) { + std::vector is_valid = {true, true, true, false, true, true, true}; + auto f0 = field("f0", date32()); + auto f1 = field("f1", date()); + std::shared_ptr schema(new Schema({f0, f1})); + + std::vector date_values = {1489269000000, 1489270000000, 1489271000000, + 1489272000000, 1489272000000, 1489273000000}; + std::vector date32_values = {0, 1, 2, 3, 4, 5, 6}; + + std::shared_ptr date_array, date32_array; + ArrayFromVector(is_valid, date_values, &date_array); + ArrayFromVector(is_valid, date32_values, &date32_array); + + std::vector> arrays = {date32_array, date_array}; + *out = std::make_shared(schema, date_array->length(), arrays); + return Status::OK(); +} + +Status MakeTimestamps(std::shared_ptr* out) { + std::vector is_valid = {true, true, true, false, true, true, true}; + auto f0 = field("f0", timestamp(TimeUnit::MILLI)); + auto f1 = field("f1", timestamp(TimeUnit::NANO)); + auto f2 = field("f2", timestamp("US/Los_Angeles", TimeUnit::SECOND)); + std::shared_ptr schema(new Schema({f0, f1, f2})); + + std::vector ts_values = {1489269000000, 1489270000000, 1489271000000, + 1489272000000, 1489272000000, 1489273000000}; + + std::shared_ptr a0, a1, a2; + ArrayFromVector(f0->type, is_valid, ts_values, &a0); + ArrayFromVector(f1->type, is_valid, ts_values, &a1); + ArrayFromVector(f2->type, is_valid, ts_values, &a2); + + ArrayVector arrays = {a0, a1, a2}; + *out = std::make_shared(schema, a0->length(), arrays); + return Status::OK(); +} + +Status MakeTimes(std::shared_ptr* out) { + std::vector is_valid = {true, true, true, false, true, true, true}; + auto f0 = field("f0", time(TimeUnit::MILLI)); + auto f1 = field("f1", time(TimeUnit::NANO)); + auto f2 = field("f2", time(TimeUnit::SECOND)); + std::shared_ptr schema(new Schema({f0, f1, f2})); + + std::vector ts_values = {1489269000000, 1489270000000, 1489271000000, + 1489272000000, 1489272000000, 1489273000000}; + + std::shared_ptr a0, a1, a2; + ArrayFromVector(f0->type, is_valid, ts_values, &a0); + ArrayFromVector(f1->type, is_valid, ts_values, &a1); + ArrayFromVector(f2->type, is_valid, ts_values, &a2); + + ArrayVector arrays = {a0, a1, a2}; + *out = std::make_shared(schema, a0->length(), arrays); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index 11ce50a76a547..f05a54168b631 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -63,6 +63,8 @@ namespace arrow { +using ArrayVector = std::vector>; + namespace test { template @@ -232,6 +234,22 @@ class TestBase : public ::testing::Test { MemoryPool* pool_; }; +template +void ArrayFromVector(const std::shared_ptr& type, + const std::vector& is_valid, const std::vector& values, + std::shared_ptr* out) { + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, type); + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + ASSERT_OK(builder.Append(values[i])); + } else { + ASSERT_OK(builder.AppendNull()); + } + } + ASSERT_OK(builder.Finish(out)); +} + template void ArrayFromVector(const std::vector& is_valid, const std::vector& values, std::shared_ptr* out) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index aa0d70e5505e6..a838082d7e79a 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -452,7 +452,7 @@ struct ARROW_EXPORT Date32Type : public FixedWidthType { Date32Type() : FixedWidthType(Type::DATE32) {} - int bit_width() const override { return static_cast(sizeof(c_type) * 8); } + int bit_width() const override { return static_cast(sizeof(c_type) * 4); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 2cd14203cdbb1..91461da8c42a6 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -121,7 +121,7 @@ struct TypeTraits { template <> struct TypeTraits { using ArrayType = DateArray; - // using BuilderType = DateBuilder; + using BuilderType = DateBuilder; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); @@ -145,7 +145,7 @@ struct TypeTraits { template <> struct TypeTraits { using ArrayType = TimestampArray; - // using BuilderType = TimestampBuilder; + using BuilderType = TimestampBuilder; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); @@ -156,7 +156,7 @@ struct TypeTraits { template <> struct TypeTraits { using ArrayType = TimeArray; - // using BuilderType = TimestampBuilder; + using BuilderType = TimeBuilder; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); From 344ad1f10f3a4a86692d7a32b17e9939131321a1 Mon Sep 17 00:00:00 2001 From: rvernica Date: Sun, 12 Mar 2017 13:43:32 -0400 Subject: [PATCH 0357/1644] ARROW-619: Fix typos in setup.py args and LD_LIBRARY_PATH Author: rvernica Closes #372 from rvernica/patch-1 and squashes the following commits: b27999a [rvernica] Fix typos in setup.py args and LD_LIBRARY_PATH --- python/doc/install.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/doc/install.rst b/python/doc/install.rst index 4d99fa0caf1de..d93a88f547576 100644 --- a/python/doc/install.rst +++ b/python/doc/install.rst @@ -124,7 +124,7 @@ Install `pyarrow` # --with-jemalloc enables the jemalloc allocator support in PyArrow # --build-type=release disables debugging information and turns on # compiler optimizations for native code - python setup.py build_ext --with-parquet --with--jemalloc --build-type=release install + python setup.py build_ext --with-parquet --with-jemalloc --build-type=release install python setup.py install .. warning:: @@ -134,7 +134,7 @@ Install `pyarrow` .. note:: In development installations, you will also need to set a correct ``LD_LIBARY_PATH``. This is most probably done with - ``export LD_LIBARY_PATH=$ARROW_HOME/lib:$LD_LIBARY_PATH``. + ``export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH``. .. code-block:: python From d4ecb5e54eb7bc9392ad2f4e1cf9a0fe42be8cd0 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Sun, 12 Mar 2017 13:49:53 -0400 Subject: [PATCH 0358/1644] ARROW-612: [Java] Added not null to Field.toString output Changed `Field.toString` method to include an additional `not null` description only if the nullable flag is not set. Changed test to update expected string output. Author: Bryan Cutler Closes #368 from BryanCutler/Field-toString-show-nullable-ARROW-612 and squashes the following commits: 9dc633d [Bryan Cutler] added not null to Field.toString output --- .../main/java/org/apache/arrow/vector/types/pojo/Field.java | 3 +++ .../java/org/apache/arrow/vector/types/pojo/TestSchema.java | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 2d528e4141907..f9b79ce556338 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -185,6 +185,9 @@ public String toString() { if (!children.isEmpty()) { sb.append("<").append(Joiner.on(", ").join(children)).append(">"); } + if (!nullable) { + sb.append(" not null"); + } return sb.toString(); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index d60d17ea76db8..f04c78ec45d97 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -53,7 +53,7 @@ public void testComplex() throws IOException { )); roundTrip(schema); assertEquals( - "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND), h: Interval(DAY_TIME)>", + "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND), h: Interval(DAY_TIME)>", schema.toString()); } From 492b3d4395ab735f2de2a797cf13554a6b15936b Mon Sep 17 00:00:00 2001 From: Brian Hulette Date: Sun, 12 Mar 2017 15:42:44 -0400 Subject: [PATCH 0359/1644] ARROW-613: WIP TypeScript Implementation This implementation only supports reading arrow files at the moment - but it works in node and in the browser. I've implemented basic numeric types as well as Utf8, Date, List, and Struct. I included a couple of example node scripts, one that just dumps an arrow file's schema, and another that converts it to a (very poorly formatted) CSV. I also included an example of reading an arrow file in the browser, currently hosted here: https://theneuralbit.github.io/arrow/ after selecting an arrow file, that page _should_ display the file contents in a table. So far I've only tested it with this file: https://keybase.pub/hulettbh/example-csv.arrow, generated by my colleague @elahrvivaz Author: Brian Hulette Author: Brian Hulette Closes #358 from TheNeuralBit/javascript and squashes the following commits: 74e8520 [Brian Hulette] added a few more license headers ce81034 [Brian Hulette] Cleaned up TextDecoder/Utf8Vector 3f0d9f0 [Brian Hulette] Added missing runtime dependency a5800d9 [Brian Hulette] Added docstrings for Vector functions 3839485 [Brian Hulette] Removed tabs 1f6dcf3 [Brian Hulette] Renamed _arrayToInt 3a92bdd [Brian Hulette] Added ASF Licence headers 8092810 [Brian Hulette] Moved index.html to examples/ directory 0f43270 [Brian Hulette] Replaced table style with an original, basic one c221f74 [Brian Hulette] Create README.md 71e72df [Brian Hulette] Added support for the browser via webpack 00bb974 [Brian Hulette] Initial typescript implementation --- javascript/.gitignore | 4 + javascript/README.md | 50 +++++ javascript/bin/arrow2csv.js | 47 ++++ javascript/bin/arrow_schema.js | 25 +++ javascript/examples/read_file.html | 79 +++++++ javascript/lib/Arrow_generated.d.ts | 5 + javascript/lib/arrow.ts | 201 +++++++++++++++++ javascript/lib/bitarray.ts | 55 +++++ javascript/lib/types.ts | 328 ++++++++++++++++++++++++++++ javascript/package.json | 19 ++ javascript/postinstall.sh | 18 ++ javascript/tsconfig.json | 14 ++ javascript/webpack.config.js | 21 ++ 13 files changed, 866 insertions(+) create mode 100644 javascript/.gitignore create mode 100644 javascript/README.md create mode 100755 javascript/bin/arrow2csv.js create mode 100755 javascript/bin/arrow_schema.js create mode 100644 javascript/examples/read_file.html create mode 100644 javascript/lib/Arrow_generated.d.ts create mode 100644 javascript/lib/arrow.ts create mode 100644 javascript/lib/bitarray.ts create mode 100644 javascript/lib/types.ts create mode 100644 javascript/package.json create mode 100755 javascript/postinstall.sh create mode 100644 javascript/tsconfig.json create mode 100644 javascript/webpack.config.js diff --git a/javascript/.gitignore b/javascript/.gitignore new file mode 100644 index 0000000000000..3b97e3ab95707 --- /dev/null +++ b/javascript/.gitignore @@ -0,0 +1,4 @@ +lib/*_generated.js +dist +node_modules +typings diff --git a/javascript/README.md b/javascript/README.md new file mode 100644 index 0000000000000..98ef75674ede0 --- /dev/null +++ b/javascript/README.md @@ -0,0 +1,50 @@ + + +### Installation + +From this directory, run: + +``` bash +$ npm install # pull dependencies +$ tsc # build typescript +$ webpack # bundle for the browser +``` + +### Usage +The library is designed to be used with node.js or in the browser, this repository contains examples of both. + +#### Node +Import the arrow module: + +``` js +var arrow = require("arrow.js"); +``` + +See [bin/arrow_schema.js](bin/arrow_schema.js) and [bin/arrow2csv.js](bin/arrow2csv.js) for usage examples. + +#### Browser +Include `dist/arrow-bundle.js` in a ` + + + +

+ + + + +
+ + + diff --git a/javascript/lib/Arrow_generated.d.ts b/javascript/lib/Arrow_generated.d.ts new file mode 100644 index 0000000000000..1f5b4547a478c --- /dev/null +++ b/javascript/lib/Arrow_generated.d.ts @@ -0,0 +1,5 @@ +export var org: { + apache: { + arrow: any + } +} diff --git a/javascript/lib/arrow.ts b/javascript/lib/arrow.ts new file mode 100644 index 0000000000000..0762885aef8cc --- /dev/null +++ b/javascript/lib/arrow.ts @@ -0,0 +1,201 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +import { flatbuffers } from 'flatbuffers'; +import { org } from './Arrow_generated'; +var arrow = org.apache.arrow; +import { vectorFromField, Vector } from './types'; + +export function loadVectors(buf) { + var fileLength = buf.length, bb, footerLengthOffset, footerLength, + footerOffset, footer, schema, field, type, type_str, i, + len, rb_metas, rb_meta, rtrn, recordBatchBlock, recordBatchBlocks = []; + var vectors : Vector[] = []; + + bb = new flatbuffers.ByteBuffer(buf); + + footer = _loadFooter(bb); + + schema = footer.schema(); + + for (i = 0, len = schema.fieldsLength(); i < len; i += 1|0) { + field = schema.fields(i); + vectors.push(vectorFromField(field)); + } + + for (i = 0; i < footer.recordBatchesLength(); i += 1|0) { + recordBatchBlock = footer.recordBatches(i); + recordBatchBlocks.push({ + offset: recordBatchBlock.offset(), + metaDataLength: recordBatchBlock.metaDataLength(), + bodyLength: recordBatchBlock.bodyLength(), + }) + } + + loadBuffersIntoVectors(recordBatchBlocks, bb, vectors); + var rtrn : any = {}; + for (var i : any = 0; i < vectors.length; i += 1|0) { + rtrn[vectors[i].name] = vectors[i] + } + return rtrn; +} + +export function loadSchema(buf) { + var footer = _loadFooter(new flatbuffers.ByteBuffer(buf)); + var schema = footer.schema(); + + return parseSchema(schema); +} + +function _loadFooter(bb) { + var fileLength: number = bb.bytes_.length; + + if (fileLength < MAGIC.length*2 + 4) { + console.error("file too small " + fileLength); + return; + } + + if (!_checkMagic(bb.bytes_, 0)) { + console.error("missing magic bytes at beginning of file") + return; + } + + if (!_checkMagic(bb.bytes_, fileLength - MAGIC.length)) { + console.error("missing magic bytes at end of file") + return; + } + + var footerLengthOffset: number = fileLength - MAGIC.length - 4; + bb.setPosition(footerLengthOffset); + var footerLength: number = Int64FromByteBuffer(bb, footerLengthOffset) + + if (footerLength <= 0 || footerLength + MAGIC.length*2 + 4 > fileLength) { + console.log("Invalid footer length: " + footerLength) + } + + var footerOffset: number = footerLengthOffset - footerLength; + bb.setPosition(footerOffset); + var footer = arrow.flatbuf.Footer.getRootAsFooter(bb); + + return footer; +} + +function Int64FromByteBuffer(bb, offset) { + return ((bb.bytes_[offset + 3] & 255) << 24) | + ((bb.bytes_[offset + 2] & 255) << 16) | + ((bb.bytes_[offset + 1] & 255) << 8) | + ((bb.bytes_[offset] & 255)); +} + + +var MAGIC_STR = "ARROW1"; +var MAGIC = new Uint8Array(MAGIC_STR.length); +for (var i = 0; i < MAGIC_STR.length; i += 1|0) { + MAGIC[i] = MAGIC_STR.charCodeAt(i); +} + +function _checkMagic(buf, index) { + for (var i = 0; i < MAGIC.length; i += 1|0) { + if (MAGIC[i] != buf[index + i]) { + return false; + } + } + return true; +} + +var TYPEMAP = {} +TYPEMAP[arrow.flatbuf.Type.NONE] = "NONE"; +TYPEMAP[arrow.flatbuf.Type.Null] = "Null"; +TYPEMAP[arrow.flatbuf.Type.Int] = "Int"; +TYPEMAP[arrow.flatbuf.Type.FloatingPoint] = "FloatingPoint"; +TYPEMAP[arrow.flatbuf.Type.Binary] = "Binary"; +TYPEMAP[arrow.flatbuf.Type.Utf8] = "Utf8"; +TYPEMAP[arrow.flatbuf.Type.Bool] = "Bool"; +TYPEMAP[arrow.flatbuf.Type.Decimal] = "Decimal"; +TYPEMAP[arrow.flatbuf.Type.Date] = "Date"; +TYPEMAP[arrow.flatbuf.Type.Time] = "Time"; +TYPEMAP[arrow.flatbuf.Type.Timestamp] = "Timestamp"; +TYPEMAP[arrow.flatbuf.Type.Interval] = "Interval"; +TYPEMAP[arrow.flatbuf.Type.List] = "List"; +TYPEMAP[arrow.flatbuf.Type.Struct_] = "Struct"; +TYPEMAP[arrow.flatbuf.Type.Union] = "Union"; + +var VECTORTYPEMAP = {}; +VECTORTYPEMAP[arrow.flatbuf.VectorType.OFFSET] = 'OFFSET'; +VECTORTYPEMAP[arrow.flatbuf.VectorType.DATA] = 'DATA'; +VECTORTYPEMAP[arrow.flatbuf.VectorType.VALIDITY] = 'VALIDITY'; +VECTORTYPEMAP[arrow.flatbuf.VectorType.TYPE] = 'TYPE'; + +function parseField(field) { + var children = []; + for (var i = 0; i < field.childrenLength(); i += 1|0) { + children.push(parseField(field.children(i))); + } + + var layouts = []; + for (var i = 0; i < field.layoutLength(); i += 1|0) { + layouts.push(VECTORTYPEMAP[field.layout(i).type()]); + + } + + return { + name: field.name(), + nullable: field.nullable(), + type: TYPEMAP[field.typeType()], + children: children, + layout: layouts + }; +} + +function parseSchema(schema) { + var result = []; + var this_result, type; + for (var i = 0, len = schema.fieldsLength(); i < len; i += 1|0) { + result.push(parseField(schema.fields(i))); + } + return result; +} + +function parseBuffer(buffer) { + return { + offset: buffer.offset(), + length: buffer.length() + }; +} + +function loadBuffersIntoVectors(recordBatchBlocks, bb, vectors : Vector[]) { + var fieldNode, recordBatchBlock, recordBatch, numBuffers, bufReader = {index: 0, node_index: 1}, field_ctr = 0; + var buffer = bb.bytes_.buffer; + var baseOffset = bb.bytes_.byteOffset; + for (var i = recordBatchBlocks.length - 1; i >= 0; i -= 1|0) { + recordBatchBlock = recordBatchBlocks[i]; + bb.setPosition(recordBatchBlock.offset.low); + recordBatch = arrow.flatbuf.RecordBatch.getRootAsRecordBatch(bb); + bufReader.index = 0; + bufReader.node_index = 0; + numBuffers = recordBatch.buffersLength(); + + //console.log('num buffers: ' + recordBatch.buffersLength()); + //console.log('num nodes: ' + recordBatch.nodesLength()); + + while (bufReader.index < numBuffers) { + //console.log('Allocating buffers starting at ' + bufReader.index + '/' + numBuffers + ' to field ' + field_ctr); + vectors[field_ctr].loadData(recordBatch, buffer, bufReader, baseOffset + recordBatchBlock.offset.low + recordBatchBlock.metaDataLength) + field_ctr += 1; + } + } +} diff --git a/javascript/lib/bitarray.ts b/javascript/lib/bitarray.ts new file mode 100644 index 0000000000000..82fff32c194fa --- /dev/null +++ b/javascript/lib/bitarray.ts @@ -0,0 +1,55 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +export class BitArray { + private view: Uint8Array; + constructor(buffer: ArrayBuffer, offset: number, length: number) { + //if (ArrayBuffer.isView(buffer)) { + // var og_view = buffer; + // buffer = buffer.buffer; + // offset = og_view.offset; + // length = og_view.length/og_view.BYTES_PER_ELEMENT*8; + //} else if (buffer instanceof ArrayBuffer) { + var offset = offset || 0; + var length = length;// || buffer.length*8; + //} else if (buffer instanceof Number) { + // length = buffer; + // buffer = new ArrayBuffer(Math.ceil(length/8)); + // offset = 0; + //} + + this.view = new Uint8Array(buffer, offset, Math.ceil(length/8)); + } + + get(i) { + var index = (i >> 3) | 0; // | 0 converts to an int. Math.floor works too. + var bit = i % 8; // i % 8 is just as fast as i & 7 + return (this.view[index] & (1 << bit)) !== 0; + } + + set(i) { + var index = (i >> 3) | 0; + var bit = i % 8; + this.view[index] |= 1 << bit; + } + + unset(i) { + var index = (i >> 3) | 0; + var bit = i % 8; + this.view[index] &= ~(1 << bit); + } +} diff --git a/javascript/lib/types.ts b/javascript/lib/types.ts new file mode 100644 index 0000000000000..bbc755810056f --- /dev/null +++ b/javascript/lib/types.ts @@ -0,0 +1,328 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +import { BitArray } from './bitarray'; +import { TextDecoder } from 'text-encoding'; +import { org } from './Arrow_generated'; +var arrow = org.apache.arrow; + +interface ArrayView { + slice(start: number, end: number) : ArrayView + toString() : string +} + +export abstract class Vector { + name: string; + length: number; + null_count: number; + constructor(name: string) { + this.name = name; + } + /* Access datum at index i */ + abstract get(i); + /* Return array representing data in the range [start, end) */ + abstract slice(start: number, end: number); + + /* Use recordBatch fieldNodes and Buffers to construct this Vector */ + public loadData(recordBatch: any, buffer: any, bufReader: any, baseOffset: any) { + var fieldNode = recordBatch.nodes(bufReader.node_index); + this.length = fieldNode.length(); + this.null_count = fieldNode.length(); + bufReader.node_index += 1|0; + + this.loadBuffers(recordBatch, buffer, bufReader, baseOffset); + } + + protected abstract loadBuffers(recordBatch: any, buffer: any, bufReader: any, baseOffset: any); + + /* Helper function for loading a VALIDITY buffer (for Nullable types) */ + static loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset) : BitArray { + var buf_meta = recordBatch.buffers(bufReader.index); + var offset = baseOffset + buf_meta.offset().low; + var length = buf_meta.length().low; + bufReader.index += 1|0; + return new BitArray(buffer, offset, length*8); + } + + /* Helper function for loading an OFFSET buffer */ + static loadOffsetBuffer(recordBatch, buffer, bufReader, baseOffset) : Int32Array { + var buf_meta = recordBatch.buffers(bufReader.index); + var offset = baseOffset + buf_meta.offset().low; + var length = buf_meta.length().low/Int32Array.BYTES_PER_ELEMENT; + bufReader.index += 1|0; + return new Int32Array(buffer, offset, length); + } + +} + +class SimpleVector extends Vector { + protected dataView: T; + private TypedArray: {new(buffer: any, offset: number, length: number) : T, BYTES_PER_ELEMENT: number}; + + constructor (TypedArray: {new(buffer: any, offset: number, length: number): T, BYTES_PER_ELEMENT: number}, name: string) { + super(name); + this.TypedArray = TypedArray; + } + + get(i) { + return this.dataView[i]; + } + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.dataView = this.loadDataBuffer(recordBatch, buffer, bufReader, baseOffset); + } + + loadDataBuffer(recordBatch, buffer, bufReader, baseOffset) : T { + var buf_meta = recordBatch.buffers(bufReader.index); + var offset = baseOffset + buf_meta.offset().low; + var length = buf_meta.length().low/this.TypedArray.BYTES_PER_ELEMENT; + bufReader.index += 1|0; + return new this.TypedArray(buffer, offset, length); + } + + getDataView() { + return this.dataView; + } + + toString() { + return this.dataView.toString(); + } + + slice(start, end) { + return this.dataView.slice(start, end); + } +} + +class NullableSimpleVector extends SimpleVector { + private validityView: BitArray; + + get(i: number) { + if (this.validityView.get(i)) return this.dataView[i]; + else return null + } + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.validityView = Vector.loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset); + this.dataView = this.loadDataBuffer(recordBatch, buffer, bufReader, baseOffset); + } + +} + +class Uint8Vector extends SimpleVector { constructor(name: string) { super(Uint8Array, name); }; } +class Uint16Vector extends SimpleVector { constructor(name: string) { super(Uint16Array, name); }; } +class Uint32Vector extends SimpleVector { constructor(name: string) { super(Uint32Array, name); }; } +class Int8Vector extends SimpleVector { constructor(name: string) { super(Uint8Array, name); }; } +class Int16Vector extends SimpleVector { constructor(name: string) { super(Uint16Array, name); }; } +class Int32Vector extends SimpleVector { constructor(name: string) { super(Uint32Array, name); }; } +class Float32Vector extends SimpleVector { constructor(name: string) { super(Float32Array, name); }; } +class Float64Vector extends SimpleVector { constructor(name: string) { super(Float64Array, name); }; } + +class NullableUint8Vector extends NullableSimpleVector { constructor(name: string) { super(Uint8Array, name); }; } +class NullableUint16Vector extends NullableSimpleVector { constructor(name: string) { super(Uint16Array, name); }; } +class NullableUint32Vector extends NullableSimpleVector { constructor(name: string) { super(Uint32Array, name); }; } +class NullableInt8Vector extends NullableSimpleVector { constructor(name: string) { super(Uint8Array, name); }; } +class NullableInt16Vector extends NullableSimpleVector { constructor(name: string) { super(Uint16Array, name); }; } +class NullableInt32Vector extends NullableSimpleVector { constructor(name: string) { super(Uint32Array, name); }; } +class NullableFloat32Vector extends NullableSimpleVector { constructor(name: string) { super(Float32Array, name); }; } +class NullableFloat64Vector extends NullableSimpleVector { constructor(name: string) { super(Float64Array, name); }; } + +class Utf8Vector extends SimpleVector { + protected offsetView: Int32Array; + static decoder: TextDecoder = new TextDecoder('utf8'); + + constructor(name: string) { + super(Uint8Array, name); + } + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.offsetView = Vector.loadOffsetBuffer(recordBatch, buffer, bufReader, baseOffset); + this.dataView = this.loadDataBuffer(recordBatch, buffer, bufReader, baseOffset); + } + + get(i) { + return Utf8Vector.decoder.decode + (this.dataView.slice(this.offsetView[i], this.offsetView[i + 1])); + } + + slice(start: number, end: number) { + var rtrn: string[] = []; + for (var i: number = start; i < end; i += 1|0) { + rtrn.push(this.get(i)); + } + return rtrn; + } +} + +class NullableUtf8Vector extends Utf8Vector { + private validityView: BitArray; + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.validityView = Vector.loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset); + this.offsetView = Vector.loadOffsetBuffer(recordBatch, buffer, bufReader, baseOffset); + this.dataView = this.loadDataBuffer(recordBatch, buffer, bufReader, baseOffset); + } + + get(i) { + if (!this.validityView.get(i)) return null; + return super.get(i); + } +} + +// Nested Types +class ListVector extends Uint32Vector { + private dataVector: Vector; + + constructor(name, dataVector : Vector) { + super(name); + this.dataVector = dataVector; + } + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + super.loadBuffers(recordBatch, buffer, bufReader, baseOffset); + this.dataVector.loadData(recordBatch, buffer, bufReader, baseOffset); + this.length -= 1; + } + + get(i) { + var offset = super.get(i) + if (offset === null) { + return null; + } + var next_offset = super.get(i + 1) + return this.dataVector.slice(offset, next_offset) + } + + toString() { + return "length: " + (this.length); + } + + slice(start : number, end : number) { return []; }; +} + +class NullableListVector extends ListVector { + private validityView: BitArray; + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.validityView = Vector.loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset); + super.loadBuffers(recordBatch, buffer, bufReader, baseOffset); + } + + get(i) { + if (!this.validityView.get(i)) return null; + return super.get(i); + } +} + +class StructVector extends Vector { + private validityView: BitArray; + private vectors : Vector[]; + constructor(name: string, vectors: Vector[]) { + super(name); + this.vectors = vectors; + } + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.validityView = Vector.loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset); + this.vectors.forEach((v: Vector) => v.loadData(recordBatch, buffer, bufReader, baseOffset)); + } + + get(i : number) { + if (!this.validityView.get(i)) return null; + return this.vectors.map((v: Vector) => v.get(i)); + } + + slice(start : number, end : number) { + var rtrn = []; + for (var i: number = start; i < end; i += 1|0) { + rtrn.push(this.get(i)); + } + return rtrn; + } +} + +class DateVector extends SimpleVector { + constructor (name: string) { + super(Uint32Array, name); + } + + get (i) { + return new Date(super.get(2*i+1)*Math.pow(2,32) + super.get(2*i)); + } +} + +class NullableDateVector extends DateVector { + private validityView: BitArray; + + loadBuffers(recordBatch, buffer, bufReader, baseOffset) { + this.validityView = Vector.loadValidityBuffer(recordBatch, buffer, bufReader, baseOffset); + super.loadBuffers(recordBatch, buffer, bufReader, baseOffset); + } + + get (i) { + if (!this.validityView.get(i)) return null; + return super.get(i); + } +} + +var BASIC_TYPES = [arrow.flatbuf.Type.Int, arrow.flatbuf.Type.FloatingPoint, arrow.flatbuf.Type.Utf8, arrow.flatbuf.Type.Date]; + +export function vectorFromField(field) : Vector { + var typeType = field.typeType(); + if (BASIC_TYPES.indexOf(typeType) >= 0) { + var type = field.typeType(); + if (type === arrow.flatbuf.Type.Int) { + type = field.type(new arrow.flatbuf.Int()); + var VectorConstructor : {new(string) : Vector}; + if (type.isSigned()) { + if (type.bitWidth() == 32) + VectorConstructor = field.nullable() ? NullableInt32Vector : Int32Vector; + else if (type.bitWidth() == 16) + VectorConstructor = field.nullable() ? NullableInt16Vector : Int16Vector; + else if (type.bitWidth() == 8) + VectorConstructor = field.nullable() ? NullableInt8Vector : Int8Vector; + } else { + if (type.bitWidth() == 32) + VectorConstructor = field.nullable() ? NullableUint32Vector : Uint32Vector; + else if (type.bitWidth() == 16) + VectorConstructor = field.nullable() ? NullableUint16Vector : Uint16Vector; + else if (type.bitWidth() == 8) + VectorConstructor = field.nullable() ? NullableUint8Vector : Uint8Vector; + } + } else if (type === arrow.flatbuf.Type.FloatingPoint) { + type = field.type(new arrow.flatbuf.FloatingPoint()); + if (type.precision() == arrow.flatbuf.Precision.SINGLE) + VectorConstructor = field.nullable() ? NullableFloat32Vector : Float32Vector; + else if (type.precision() == arrow.flatbuf.Precision.DOUBLE) + VectorConstructor = field.nullable() ? NullableFloat64Vector : Float64Vector; + } else if (type === arrow.flatbuf.Type.Utf8) { + VectorConstructor = field.nullable() ? NullableUtf8Vector : Utf8Vector; + } else if (type === arrow.flatbuf.Type.Date) { + VectorConstructor = field.nullable() ? NullableDateVector : DateVector; + } + + return new VectorConstructor(field.name()); + } else if (typeType === arrow.flatbuf.Type.List) { + var dataVector = vectorFromField(field.children(0)); + return field.nullable() ? new NullableListVector(field.name(), dataVector) : new ListVector(field.name(), dataVector); + } else if (typeType === arrow.flatbuf.Type.Struct_) { + var vectors : Vector[] = []; + for (var i : number = 0; i < field.childrenLength(); i += 1|0) { + vectors.push(vectorFromField(field.children(i))); + } + return new StructVector(field.name(), vectors); + } +} diff --git a/javascript/package.json b/javascript/package.json new file mode 100644 index 0000000000000..b1e583b7d9da6 --- /dev/null +++ b/javascript/package.json @@ -0,0 +1,19 @@ +{ + "name": "arrow", + "version": "0.0.0", + "description": "", + "main": "dist/arrow.js", + "scripts": { + "postinstall": "./postinstall.sh", + "test": "echo \"Error: no test specified\" && exit 1" + }, + "author": "", + "license": "Apache-2.0", + "devDependencies": { + "flatbuffers": "^1.5.0", + "text-encoding": "^0.6.4" + }, + "dependencies": { + "commander": "^2.9.0" + } +} diff --git a/javascript/postinstall.sh b/javascript/postinstall.sh new file mode 100755 index 0000000000000..1e6622fa4f2ee --- /dev/null +++ b/javascript/postinstall.sh @@ -0,0 +1,18 @@ +#!/bin/bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +echo "Compiling flatbuffer schemas..." +#flatc -o lib --js ../format/Message.fbs ../format/File.fbs +flatc -o lib --js ../format/*.fbs +cat lib/*_generated.js > lib/Arrow_generated.js diff --git a/javascript/tsconfig.json b/javascript/tsconfig.json new file mode 100644 index 0000000000000..89c31ef85a143 --- /dev/null +++ b/javascript/tsconfig.json @@ -0,0 +1,14 @@ +{ + "compilerOptions": { + "outDir": "./dist/", + "allowJs": true, + "target": "es5", + "module": "commonjs", + "moduleResolution": "node" + }, + "include": [ + "typings/index.d.ts", + "lib/*.js", + "lib/*.ts" + ] +} diff --git a/javascript/webpack.config.js b/javascript/webpack.config.js new file mode 100644 index 0000000000000..a0ed56370f6b1 --- /dev/null +++ b/javascript/webpack.config.js @@ -0,0 +1,21 @@ +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. See accompanying LICENSE file. + +module.exports = { + entry: './dist/arrow.js', + output: { + path: __dirname + '/dist', + filename: 'arrow-bundle.js', + libraryTarget: 'var', + library: 'arrow' + } +}; From 2cf36ef2d6aea6a5ddf32c900d33db40d728bcd9 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 12 Mar 2017 23:18:23 -0400 Subject: [PATCH 0360/1644] ARROW-574: Python: Add support for nested Python lists in Pandas conversion Author: Uwe L. Korn Closes #364 from xhochy/ARROW-574 and squashes the following commits: 3ef02ae [Uwe L. Korn] ARROW-574: Python: Add support for nested Python lists in Pandas conversion --- python/pyarrow/tests/pandas_examples.py | 40 +++++++++++++++++++++ python/pyarrow/tests/test_convert_pandas.py | 14 ++++++-- python/pyarrow/tests/test_parquet.py | 17 +++++++-- python/src/pyarrow/adapters/builtin.cc | 27 +++++++------- python/src/pyarrow/adapters/builtin.h | 7 ++++ python/src/pyarrow/adapters/pandas.cc | 25 +++++++++++-- 6 files changed, 112 insertions(+), 18 deletions(-) diff --git a/python/pyarrow/tests/pandas_examples.py b/python/pyarrow/tests/pandas_examples.py index 63af42348026c..c9343fce233d2 100644 --- a/python/pyarrow/tests/pandas_examples.py +++ b/python/pyarrow/tests/pandas_examples.py @@ -76,3 +76,43 @@ def dataframe_with_arrays(): schema = pa.Schema.from_fields(fields) return df, schema + +def dataframe_with_lists(): + """ + Dataframe with list columns of every possible primtive type. + + Returns + ------- + df: pandas.DataFrame + schema: pyarrow.Schema + Arrow schema definition that is in line with the constructed df. + """ + arrays = OrderedDict() + fields = [] + + fields.append(pa.field('int64', pa.list_(pa.int64()))) + arrays['int64'] = [ + [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], + [0, 1, 2, 3, 4], + None, + [0] + ] + fields.append(pa.field('double', pa.list_(pa.double()))) + arrays['double'] = [ + [0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], + [0., 1., 2., 3., 4.], + None, + [0.] + ] + fields.append(pa.field('str_list', pa.list_(pa.string()))) + arrays['str_list'] = [ + [u"1", u"ä"], + None, + [u"1"], + [u"1", u"2", u"3"] + ] + + df = pd.DataFrame(arrays) + schema = pa.Schema.from_fields(fields) + + return df, schema diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 953fa2c4b9a72..a79bb2392ea6c 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -30,7 +30,7 @@ from pyarrow.compat import u import pyarrow as A -from .pandas_examples import dataframe_with_arrays +from .pandas_examples import dataframe_with_arrays, dataframe_with_lists def _alltypes_example(size=100): @@ -333,7 +333,7 @@ def test_date(self): expected['date'] = pd.to_datetime(df['date']) tm.assert_frame_equal(result, expected) - def test_column_of_lists(self): + def test_column_of_arrays(self): df, schema = dataframe_with_arrays() self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) table = A.Table.from_pandas(df, schema=schema) @@ -343,6 +343,16 @@ def test_column_of_lists(self): field = schema.field_by_name(column) self._check_array_roundtrip(df[column], field=field) + def test_column_of_lists(self): + df, schema = dataframe_with_lists() + self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) + table = A.Table.from_pandas(df, schema=schema) + assert table.schema.equals(schema) + + for column in df.columns: + field = schema.field_by_name(column) + self._check_array_roundtrip(df[column], field=field) + def test_threaded_conversion(self): df = _alltypes_example() self._check_pandas_roundtrip(df, nthreads=2, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 96f2d15e312f2..c72ff9e862b76 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -23,7 +23,7 @@ from pyarrow.compat import guid import pyarrow as pa import pyarrow.io as paio -from .pandas_examples import dataframe_with_arrays +from .pandas_examples import dataframe_with_arrays, dataframe_with_lists import numpy as np import pandas as pd @@ -322,7 +322,7 @@ def test_compare_schemas(): @parquet -def test_column_of_lists(tmpdir): +def test_column_of_arrays(tmpdir): df, schema = dataframe_with_arrays() filename = tmpdir.join('pandas_rountrip.parquet') @@ -334,6 +334,19 @@ def test_column_of_lists(tmpdir): pdt.assert_frame_equal(df, df_read) +@parquet +def test_column_of_lists(tmpdir): + df, schema = dataframe_with_lists() + + filename = tmpdir.join('pandas_rountrip.parquet') + arrow_table = pa.Table.from_pandas(df, timestamps_to_ms=True, + schema=schema) + pq.write_table(arrow_table, filename.strpath, version="2.0") + table_read = pq.read_table(filename.strpath) + df_read = table_read.to_pandas() + pdt.assert_frame_equal(df, df_read) + + @parquet def test_multithreaded_read(): df = alltypes_sample(size=10000) diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index c125cc078af88..4f7b2cb09e1e6 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -206,8 +206,7 @@ class SeqVisitor { }; // Non-exhaustive type inference -static Status InferArrowType( - PyObject* obj, int64_t* size, std::shared_ptr* out_type) { +Status InferArrowType(PyObject* obj, int64_t* size, std::shared_ptr* out_type) { *size = PySequence_Size(obj); if (PyErr_Occurred()) { // Not a sequence @@ -496,6 +495,19 @@ Status ListConverter::Init(const std::shared_ptr& builder) { return Status::OK(); } +Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, + const std::shared_ptr& builder) { + std::shared_ptr converter = GetConverter(type); + if (converter == nullptr) { + std::stringstream ss; + ss << "No type converter implemented for " << type->ToString(); + return Status::NotImplemented(ss.str()); + } + converter->Init(builder); + + return converter->AppendData(obj); +} + Status ConvertPySequence( PyObject* obj, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr type; @@ -509,19 +521,10 @@ Status ConvertPySequence( return Status::OK(); } - std::shared_ptr converter = GetConverter(type); - if (converter == nullptr) { - std::stringstream ss; - ss << "No type converter implemented for " << type->ToString(); - return Status::NotImplemented(ss.str()); - } - // Give the sequence converter an array builder std::shared_ptr builder; RETURN_NOT_OK(arrow::MakeBuilder(pool, type, &builder)); - converter->Init(builder); - - RETURN_NOT_OK(converter->AppendData(obj)); + RETURN_NOT_OK(AppendPySequence(obj, type, builder)); return builder->Finish(out); } diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 667298e3c5c5f..0c863a5631ada 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -37,6 +37,13 @@ class Status; namespace pyarrow { +PYARROW_EXPORT arrow::Status InferArrowType( + PyObject* obj, int64_t* size, std::shared_ptr* out_type); + +PYARROW_EXPORT arrow::Status AppendPySequence(PyObject* obj, + const std::shared_ptr& type, + const std::shared_ptr& builder); + PYARROW_EXPORT arrow::Status ConvertPySequence( PyObject* obj, arrow::MemoryPool* pool, std::shared_ptr* out); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index eb3ab49f58892..40079b49b9638 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -19,6 +19,7 @@ #include +#include "pyarrow/adapters/builtin.h" #include "pyarrow/adapters/pandas.h" #include "pyarrow/numpy_interop.h" @@ -1661,6 +1662,7 @@ inline Status ArrowSerializer::ConvertTypedLists( typedef npy_traits traits; typedef typename traits::value_type T; typedef typename traits::BuilderClass BuilderT; + PyAcquireGIL lock; auto value_builder = std::make_shared(pool_, field->type); ListBuilder list_builder(pool_, value_builder); @@ -1688,7 +1690,16 @@ inline Status ArrowSerializer::ConvertTypedLists( RETURN_NOT_OK(value_builder->Append(data, size)); } } else if (PyList_Check(objects[i])) { - return Status::TypeError("Python lists are not yet supported"); + int64_t size; + std::shared_ptr type; + RETURN_NOT_OK(list_builder.Append(true)); + RETURN_NOT_OK(InferArrowType(objects[i], &size, &type)); + if (type->type != field->type->type) { + std::stringstream ss; + ss << type->ToString() << " cannot be converted to " << field->type->ToString(); + return Status::TypeError(ss.str()); + } + RETURN_NOT_OK(AppendPySequence(objects[i], field->type, value_builder)); } else { return Status::TypeError("Unsupported Python type for list items"); } @@ -1702,6 +1713,7 @@ inline Status ArrowSerializer::ConvertTypedLists( const std::shared_ptr& field, std::shared_ptr* out) { // TODO: If there are bytes involed, convert to Binary representation + PyAcquireGIL lock; bool have_bytes = false; auto value_builder = std::make_shared(pool_); @@ -1721,7 +1733,16 @@ ArrowSerializer::ConvertTypedLists( auto data = reinterpret_cast(PyArray_DATA(numpy_array)); RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); } else if (PyList_Check(objects[i])) { - return Status::TypeError("Python lists are not yet supported"); + int64_t size; + std::shared_ptr type; + RETURN_NOT_OK(list_builder.Append(true)); + RETURN_NOT_OK(InferArrowType(objects[i], &size, &type)); + if (type->type != Type::STRING) { + std::stringstream ss; + ss << type->ToString() << " cannot be converted to STRING."; + return Status::TypeError(ss.str()); + } + RETURN_NOT_OK(AppendPySequence(objects[i], type, value_builder)); } else { return Status::TypeError("Unsupported Python type for list items"); } From 331be4923ac4b30dafa7e79785b71b89ddeb8f3c Mon Sep 17 00:00:00 2001 From: Miki Tebeka Date: Mon, 13 Mar 2017 13:16:04 -0400 Subject: [PATCH 0361/1644] ARROW-623: Fix segfault in __repr__ of empty field Small fix for not segfaulting when printing an empty field. Author: Miki Tebeka Closes #374 from tebeka/field-bug and squashes the following commits: abcd118 [Miki Tebeka] ARROW-623: Fix segfault in __repr__ of empty field --- python/pyarrow/schema.pyx | 3 +++ python/pyarrow/tests/test_schema.py | 7 +++++++ 2 files changed, 10 insertions(+) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 19910aba00427..d636b5a10bb58 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -111,6 +111,9 @@ cdef class Field: property name: def __get__(self): + if box_field(self.sp_field) is None: + raise ReferenceError( + 'Field not initialized (references NULL pointer)') return frombytes(self.field.name) diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index f6dc33c75dfb8..dd68f396a6888 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -87,3 +87,10 @@ def test_schema_equals(self): del fields[-1] sch3 = A.schema(fields) assert not sch1.equals(sch3) + + +class TestField(unittest.TestCase): + def test_empty_field(self): + f = arrow.Field() + with self.assertRaises(ReferenceError): + repr(f) From 00df40ceab48a97fb9f1404ca6a0049e88d0c461 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 13 Mar 2017 16:15:50 -0400 Subject: [PATCH 0362/1644] ARROW-618: [Python/C++] Support timestamp+timezone conversion to pandas This was a massive pain. This patch brings us up to feature parity with the stuff that was in Feather. The diff is larger than I would like mostly from moving around code in `pyarrow/adapters/pandas.cc`. I suggest we split up that file at our earliest opportunity into the "reader" and "writer" portion at least. The main work here was refactoring so that the data type for non-object arrays is computed up front (so it might be `timestamp('ns', tz='US/Eastern')`, then we use the visitor pattern to produce the right kind of array. This will also permit implicit type casts and conversions to integer from float because the type metadata is an input parameter. Things are getting to be a bit of a mess here so we should do some refactoring eventually, and probably also add some microbenchmarks since this stuff is performance sensitive. I also changed the C++ `pyarrow` namespace to `arrow::py` which will make it less painful to move that code tree to `cpp/src/arrow/python` at some point Author: Wes McKinney Closes #375 from wesm/ARROW-618 and squashes the following commits: 4b18bfa [Wes McKinney] Fix rebase conflict 5bc3724 [Wes McKinney] Fix rebase issues 870986f [Wes McKinney] Refactor ArrowSerializer to not be a template and use visitor pattern using passed-in data type. Fix DatetimeTZDtype pandas logic. Arrow Change pyarrow namespace to arrow::py --- cpp/src/arrow/type.cc | 25 + cpp/src/arrow/type.h | 2 +- python/pyarrow/__init__.py | 3 + python/pyarrow/array.pyx | 76 +- python/pyarrow/compat.py | 9 + python/pyarrow/config.pyx | 4 +- python/pyarrow/feather.py | 6 +- python/pyarrow/includes/libarrow.pxd | 11 +- python/pyarrow/includes/pyarrow.pxd | 19 +- python/pyarrow/schema.pxd | 10 +- python/pyarrow/schema.pyx | 160 +- python/pyarrow/table.pyx | 14 +- python/pyarrow/tests/test_convert_pandas.py | 33 +- python/pyarrow/tests/test_feather.py | 7 +- python/pyarrow/tests/test_schema.py | 142 +- python/src/pyarrow/adapters/builtin.cc | 61 +- python/src/pyarrow/adapters/builtin.h | 19 +- python/src/pyarrow/adapters/pandas-test.cc | 6 +- python/src/pyarrow/adapters/pandas.cc | 1896 +++++++++---------- python/src/pyarrow/adapters/pandas.h | 44 +- python/src/pyarrow/common.cc | 16 +- python/src/pyarrow/common.h | 17 +- python/src/pyarrow/config.cc | 6 +- python/src/pyarrow/config.h | 15 +- python/src/pyarrow/helpers.cc | 8 +- python/src/pyarrow/helpers.h | 15 +- python/src/pyarrow/io.cc | 12 +- python/src/pyarrow/io.h | 43 +- python/src/pyarrow/numpy_interop.h | 6 +- python/src/pyarrow/type_traits.h | 212 +++ python/src/pyarrow/util/datetime.h | 6 +- python/src/pyarrow/util/test_main.cc | 2 +- python/src/pyarrow/visibility.h | 32 - 33 files changed, 1595 insertions(+), 1342 deletions(-) create mode 100644 python/src/pyarrow/type_traits.h delete mode 100644 python/src/pyarrow/visibility.h diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 4679a2f5b76b6..0cafdce89e562 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -108,6 +108,31 @@ std::string Date32Type::ToString() const { return std::string("date32"); } +static inline void print_time_unit(TimeUnit unit, std::ostream* stream) { + switch (unit) { + case TimeUnit::SECOND: + (*stream) << "s"; + break; + case TimeUnit::MILLI: + (*stream) << "ms"; + break; + case TimeUnit::MICRO: + (*stream) << "us"; + break; + case TimeUnit::NANO: + (*stream) << "ns"; + break; + } +} + +std::string TimestampType::ToString() const { + std::stringstream ss; + ss << "timestamp["; + print_time_unit(this->unit, &ss); + ss << "]"; + return ss.str(); +} + // ---------------------------------------------------------------------- // Union type diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index a838082d7e79a..15b99c5ce4f89 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -495,7 +495,7 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { TimestampType(const TimestampType& other) : TimestampType(other.unit) {} Status Accept(TypeVisitor* visitor) const override; - std::string ToString() const override { return name(); } + std::string ToString() const override; static std::string name() { return "timestamp"; } TimeUnit unit; diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 6724b52e6004e..a4aac443fae82 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -56,6 +56,8 @@ FloatValue, DoubleValue, ListValue, BinaryValue, StringValue) +import pyarrow.schema as _schema + from pyarrow.schema import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, @@ -64,6 +66,7 @@ list_, struct, dictionary, field, DataType, Field, Schema, schema) + from pyarrow.table import Column, RecordBatch, Table, concat_tables diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 6a6b4ba9ad0cb..11244e7836058 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -34,7 +34,8 @@ from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool cimport pyarrow.scalar as scalar from pyarrow.scalar import NA -from pyarrow.schema cimport Field, Schema, DictionaryType +from pyarrow.schema cimport (DataType, Field, Schema, DictionaryType, + box_data_type) import pyarrow.schema as schema cimport cpython @@ -45,16 +46,40 @@ cdef _pandas(): return pd +cdef maybe_coerce_datetime64(values, dtype, DataType type, + timestamps_to_ms=False): + + from pyarrow.compat import DatetimeTZDtype + + if values.dtype.type != np.datetime64: + return values, type + + coerce_ms = timestamps_to_ms and values.dtype != 'datetime64[ms]' + + if coerce_ms: + values = values.astype('datetime64[ms]') + + if isinstance(dtype, DatetimeTZDtype): + tz = dtype.tz + unit = 'ms' if coerce_ms else dtype.unit + type = schema.timestamp(unit, tz) + else: + # Trust the NumPy dtype + type = schema.type_from_numpy_dtype(values.dtype) + + return values, type + + cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array): self.sp_array = sp_array self.ap = sp_array.get() - self.type = DataType() - self.type.init(self.sp_array.get().type()) + self.type = box_data_type(self.sp_array.get().type()) @staticmethod - def from_pandas(obj, mask=None, timestamps_to_ms=False, Field field=None, + def from_pandas(obj, mask=None, DataType type=None, + timestamps_to_ms=False, MemoryPool memory_pool=None): """ Convert pandas.Series to an Arrow Array. @@ -66,6 +91,9 @@ cdef class Array: mask : pandas.Series or numpy.ndarray, optional boolean mask if the object is valid or null + type : pyarrow.DataType + Explicit type to attempt to coerce to + timestamps_to_ms : bool, optional Convert datetime columns to ms resolution. This is needed for compatibility with other functionality like Parquet I/O which @@ -107,33 +135,43 @@ cdef class Array: """ cdef: shared_ptr[CArray] out - shared_ptr[CField] c_field + shared_ptr[CDataType] c_type CMemoryPool* pool pd = _pandas() - if field is not None: - c_field = field.sp_field - if mask is not None: mask = get_series_values(mask) - series_values = get_series_values(obj) + values = get_series_values(obj) + pool = maybe_unbox_memory_pool(memory_pool) - if isinstance(series_values, pd.Categorical): + if isinstance(values, pd.Categorical): return DictionaryArray.from_arrays( - series_values.codes, series_values.categories.values, + values.codes, values.categories.values, mask=mask, memory_pool=memory_pool) + elif values.dtype == object: + # Object dtype undergoes a different conversion path as more type + # inference may be needed + if type is not None: + c_type = type.sp_type + with nogil: + check_status(pyarrow.PandasObjectsToArrow( + pool, values, mask, c_type, &out)) else: - if series_values.dtype.type == np.datetime64 and timestamps_to_ms: - series_values = series_values.astype('datetime64[ms]') + values, type = maybe_coerce_datetime64( + values, obj.dtype, type, timestamps_to_ms=timestamps_to_ms) + + if type is None: + check_status(pyarrow.PandasDtypeToArrow(values.dtype, &c_type)) + else: + c_type = type.sp_type - pool = maybe_unbox_memory_pool(memory_pool) with nogil: check_status(pyarrow.PandasToArrow( - pool, series_values, mask, c_field, &out)) + pool, values, mask, c_type, &out)) - return box_array(out) + return box_array(out) @staticmethod def from_list(object list_obj, DataType type=None, @@ -338,6 +376,10 @@ cdef class DateArray(NumericArray): pass +cdef class TimestampArray(NumericArray): + pass + + cdef class FloatArray(FloatingPointArray): pass @@ -423,7 +465,7 @@ cdef dict _array_classes = { Type_LIST: ListArray, Type_BINARY: BinaryArray, Type_STRING: StringArray, - Type_TIMESTAMP: Int64Array, + Type_TIMESTAMP: TimestampArray, Type_DICTIONARY: DictionaryArray } diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 9148be7d9f8ad..74d7ca2827bc9 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -17,9 +17,11 @@ # flake8: noqa +from distutils.version import LooseVersion import itertools import numpy as np +import pandas as pd import sys import six @@ -115,6 +117,13 @@ def encode_file_path(path): return encoded_path +if LooseVersion(pd.__version__) < '0.19.0': + pdapi = pd.core.common + from pandas.core.dtypes import DatetimeTZDtype +else: + from pandas.types.dtypes import DatetimeTZDtype + pdapi = pd.api.types + integer_types = six.integer_types + (np.integer,) __all__ = [] diff --git a/python/pyarrow/config.pyx b/python/pyarrow/config.pyx index aa30f097248cd..5ad7cf53261e3 100644 --- a/python/pyarrow/config.pyx +++ b/python/pyarrow/config.pyx @@ -17,10 +17,10 @@ cdef extern from 'pyarrow/do_import_numpy.h': pass -cdef extern from 'pyarrow/numpy_interop.h' namespace 'pyarrow': +cdef extern from 'pyarrow/numpy_interop.h' namespace 'arrow::py': int import_numpy() -cdef extern from 'pyarrow/config.h' namespace 'pyarrow': +cdef extern from 'pyarrow/config.h' namespace 'arrow::py': void pyarrow_init() void pyarrow_set_numpy_nan(object o) diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py index b7dbf96563a41..28424afb093b5 100644 --- a/python/pyarrow/feather.py +++ b/python/pyarrow/feather.py @@ -19,6 +19,7 @@ from distutils.version import LooseVersion import pandas as pd +from pyarrow.compat import pdapi from pyarrow._feather import FeatherError # noqa from pyarrow.table import Table import pyarrow._feather as ext @@ -27,11 +28,6 @@ if LooseVersion(pd.__version__) < '0.17.0': raise ImportError("feather requires pandas >= 0.17.0") -if LooseVersion(pd.__version__) < '0.19.0': - pdapi = pd.core.common -else: - pdapi = pd.api.types - class FeatherReader(ext.FeatherReader): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 253cabbe0a581..dee7fd4f8e4e5 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -84,6 +84,13 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CArray] indices() shared_ptr[CArray] dictionary() + cdef cppclass CTimestampType" arrow::TimestampType"(CFixedWidthType): + TimeUnit unit + c_string timezone + + cdef cppclass CTimeType" arrow::TimeType"(CFixedWidthType): + TimeUnit unit + cdef cppclass CDictionaryType" arrow::DictionaryType"(CFixedWidthType): CDictionaryType(const shared_ptr[CDataType]& index_type, const shared_ptr[CArray]& dictionary) @@ -92,6 +99,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CArray] dictionary() shared_ptr[CDataType] timestamp(TimeUnit unit) + shared_ptr[CDataType] timestamp(const c_string& timezone, TimeUnit unit) cdef cppclass CMemoryPool" arrow::MemoryPool": int64_t bytes_allocated() @@ -117,9 +125,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringType" arrow::StringType"(CDataType): pass - cdef cppclass CTimestampType" arrow::TimestampType"(CDataType): - TimeUnit unit - cdef cppclass CField" arrow::Field": c_string name shared_ptr[CDataType] type diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index f1d45e0d50f36..9fbddba3d10c5 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,22 +18,29 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CField, +from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CTable, CDataType, CStatus, Type, CMemoryPool, TimeUnit) cimport pyarrow.includes.libarrow_io as arrow_io -cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: +cdef extern from "pyarrow/api.h" namespace "arrow::py" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) shared_ptr[CDataType] GetTimestampType(TimeUnit unit) - CStatus ConvertPySequence(object obj, CMemoryPool* pool, shared_ptr[CArray]* out) + CStatus ConvertPySequence(object obj, CMemoryPool* pool, + shared_ptr[CArray]* out) + + CStatus PandasDtypeToArrow(object dtype, shared_ptr[CDataType]* type) CStatus PandasToArrow(CMemoryPool* pool, object ao, object mo, - shared_ptr[CField] field, + const shared_ptr[CDataType]& type, shared_ptr[CArray]* out) + CStatus PandasObjectsToArrow(CMemoryPool* pool, object ao, object mo, + const shared_ptr[CDataType]& type, + shared_ptr[CArray]* out) + CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, PyObject* py_ref, PyObject** out) @@ -47,12 +54,12 @@ cdef extern from "pyarrow/api.h" namespace "pyarrow" nogil: CMemoryPool* get_memory_pool() -cdef extern from "pyarrow/common.h" namespace "pyarrow" nogil: +cdef extern from "pyarrow/common.h" namespace "arrow::py" nogil: cdef cppclass PyBytesBuffer(CBuffer): PyBytesBuffer(object o) -cdef extern from "pyarrow/io.h" namespace "pyarrow" nogil: +cdef extern from "pyarrow/io.h" namespace "arrow::py" nogil: cdef cppclass PyReadableFile(arrow_io.ReadableFileInterface): PyReadableFile(object fo) diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index 390954cfc6bd9..15ee5f19ee5d9 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -16,7 +16,9 @@ # under the License. from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CDataType, CDictionaryType, +from pyarrow.includes.libarrow cimport (CDataType, + CDictionaryType, + CTimestampType, CField, CSchema) cdef class DataType: @@ -31,6 +33,12 @@ cdef class DictionaryType(DataType): cdef: const CDictionaryType* dict_type + +cdef class TimestampType(DataType): + cdef: + const CTimestampType* ts_type + + cdef class Field: cdef: shared_ptr[CField] sp_field diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index d636b5a10bb58..4bc938df668f8 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -26,23 +26,19 @@ from cython.operator cimport dereference as deref from pyarrow.compat import frombytes, tobytes from pyarrow.array cimport Array +from pyarrow.error cimport check_status from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, - Type_NA, Type_BOOL, - Type_UINT8, Type_INT8, - Type_UINT16, Type_INT16, - Type_UINT32, Type_INT32, - Type_UINT64, Type_INT64, - Type_TIMESTAMP, Type_DATE, - Type_FLOAT, Type_DOUBLE, - Type_STRING, Type_BINARY, TimeUnit_SECOND, TimeUnit_MILLI, TimeUnit_MICRO, TimeUnit_NANO, Type, TimeUnit) cimport pyarrow.includes.pyarrow as pyarrow -cimport pyarrow.includes.libarrow as libarrow +cimport pyarrow.includes.libarrow as la cimport cpython +import six + + cdef class DataType: def __cinit__(self): @@ -73,13 +69,33 @@ cdef class DictionaryType(DataType): DataType.init(self, type) self.dict_type = type.get() - def __str__(self): - return frombytes(self.type.ToString()) - def __repr__(self): return 'DictionaryType({0})'.format(str(self)) +cdef class TimestampType(DataType): + + cdef init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.ts_type = type.get() + + property unit: + + def __get__(self): + return timeunit_to_string(self.ts_type.unit) + + property tz: + + def __get__(self): + if self.ts_type.timezone.size() > 0: + return frombytes(self.ts_type.timezone) + else: + return None + + def __repr__(self): + return 'TimestampType({0})'.format(str(self)) + + cdef class Field: def __cinit__(self): @@ -205,49 +221,76 @@ cdef DataType primitive_type(Type type): def field(name, type, bint nullable=True): return Field.from_py(name, type, nullable) + cdef set PRIMITIVE_TYPES = set([ - Type_NA, Type_BOOL, - Type_UINT8, Type_INT8, - Type_UINT16, Type_INT16, - Type_UINT32, Type_INT32, - Type_UINT64, Type_INT64, - Type_TIMESTAMP, Type_DATE, - Type_FLOAT, Type_DOUBLE]) + la.Type_NA, la.Type_BOOL, + la.Type_UINT8, la.Type_INT8, + la.Type_UINT16, la.Type_INT16, + la.Type_UINT32, la.Type_INT32, + la.Type_UINT64, la.Type_INT64, + la.Type_TIMESTAMP, la.Type_DATE, + la.Type_FLOAT, la.Type_DOUBLE]) + def null(): - return primitive_type(Type_NA) + return primitive_type(la.Type_NA) + def bool_(): - return primitive_type(Type_BOOL) + return primitive_type(la.Type_BOOL) + def uint8(): - return primitive_type(Type_UINT8) + return primitive_type(la.Type_UINT8) + def int8(): - return primitive_type(Type_INT8) + return primitive_type(la.Type_INT8) + def uint16(): - return primitive_type(Type_UINT16) + return primitive_type(la.Type_UINT16) + def int16(): - return primitive_type(Type_INT16) + return primitive_type(la.Type_INT16) + def uint32(): - return primitive_type(Type_UINT32) + return primitive_type(la.Type_UINT32) + def int32(): - return primitive_type(Type_INT32) + return primitive_type(la.Type_INT32) + def uint64(): - return primitive_type(Type_UINT64) + return primitive_type(la.Type_UINT64) + def int64(): - return primitive_type(Type_INT64) + return primitive_type(la.Type_INT64) + cdef dict _timestamp_type_cache = {} -def timestamp(unit_str): - cdef TimeUnit unit + +cdef timeunit_to_string(TimeUnit unit): + if unit == TimeUnit_SECOND: + return 's' + elif unit == TimeUnit_MILLI: + return 'ms' + elif unit == TimeUnit_MICRO: + return 'us' + elif unit == TimeUnit_NANO: + return 'ns' + + +def timestamp(unit_str, tz=None): + cdef: + TimeUnit unit + c_string c_timezone + if unit_str == "s": unit = TimeUnit_SECOND elif unit_str == 'ms': @@ -259,34 +302,47 @@ def timestamp(unit_str): else: raise TypeError('Invalid TimeUnit string') - if unit in _timestamp_type_cache: - return _timestamp_type_cache[unit] + cdef TimestampType out = TimestampType() + + if tz is None: + out.init(la.timestamp(unit)) + if unit in _timestamp_type_cache: + return _timestamp_type_cache[unit] + _timestamp_type_cache[unit] = out + else: + if not isinstance(tz, six.string_types): + tz = tz.zone + + c_timezone = tobytes(tz) + out.init(la.timestamp(c_timezone, unit)) - cdef DataType out = DataType() - out.init(libarrow.timestamp(unit)) - _timestamp_type_cache[unit] = out return out + def date(): - return primitive_type(Type_DATE) + return primitive_type(la.Type_DATE) + def float_(): - return primitive_type(Type_FLOAT) + return primitive_type(la.Type_FLOAT) + def double(): - return primitive_type(Type_DOUBLE) + return primitive_type(la.Type_DOUBLE) + def string(): """ UTF8 string """ - return primitive_type(Type_STRING) + return primitive_type(la.Type_STRING) + def binary(): """ Binary (PyBytes-like) type """ - return primitive_type(Type_BINARY) + return primitive_type(la.Type_BINARY) def list_(DataType value_type): @@ -326,13 +382,25 @@ def struct(fields): out.init(struct_type) return out + def schema(fields): return Schema.from_fields(fields) + cdef DataType box_data_type(const shared_ptr[CDataType]& type): + cdef: + DataType out + if type.get() == NULL: return None - cdef DataType out = DataType() + + if type.get().type == la.Type_DICTIONARY: + out = DictionaryType() + elif type.get().type == la.Type_TIMESTAMP: + out = TimestampType() + else: + out = DataType() + out.init(type) return out @@ -347,3 +415,11 @@ cdef Schema box_schema(const shared_ptr[CSchema]& type): cdef Schema out = Schema() out.init_schema(type) return out + + +def type_from_numpy_dtype(object dtype): + cdef shared_ptr[CDataType] c_type + with nogil: + check_status(pyarrow.PandasDtypeToArrow(dtype, &c_type)) + + return box_data_type(c_type) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 5657b973d1306..58f5d680393f7 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -30,7 +30,7 @@ import pyarrow.config from pyarrow.array cimport Array, box_array, wrap_array_output from pyarrow.error import ArrowException from pyarrow.error cimport check_status -from pyarrow.schema cimport box_data_type, box_schema, Field +from pyarrow.schema cimport box_data_type, box_schema, DataType from pyarrow.compat import frombytes, tobytes @@ -302,14 +302,15 @@ cdef _dataframe_to_arrays(df, name, timestamps_to_ms, Schema schema): cdef: list names = [] list arrays = [] - Field field = None + DataType type = None for name in df.columns: col = df[name] if schema is not None: - field = schema.field_by_name(name) - arr = Array.from_pandas(col, timestamps_to_ms=timestamps_to_ms, - field=field) + type = schema.field_by_name(name).type + + arr = Array.from_pandas(col, type=type, + timestamps_to_ms=timestamps_to_ms) names.append(name) arrays.append(arr) @@ -522,6 +523,7 @@ cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): import pandas.core.internals as _int from pandas import RangeIndex, Categorical + from pyarrow.compat import DatetimeTZDtype with nogil: check_status(pyarrow.ConvertTableToPandas(table, nthreads, @@ -541,9 +543,9 @@ cdef table_to_blockmanager(const shared_ptr[CTable]& table, int nthreads): klass=_int.CategoricalBlock, fastpath=True) elif 'timezone' in item: - from pandas.types.api import DatetimeTZDtype dtype = DatetimeTZDtype('ns', tz=item['timezone']) block = _int.make_block(block_arr, placement=placement, + klass=_int.DatetimeTZBlock, dtype=dtype, fastpath=True) else: block = _int.make_block(block_arr, placement=placement) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index a79bb2392ea6c..6b89444b3e824 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -77,9 +77,9 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, tm.assert_frame_equal(result, expected, check_dtype=check_dtype) def _check_array_roundtrip(self, values, expected=None, - timestamps_to_ms=False, field=None): + timestamps_to_ms=False, type=None): arr = A.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, - field=field) + type=type) result = arr.to_pandas() assert arr.null_count == pd.isnull(values).sum() @@ -134,11 +134,13 @@ def test_integer_no_nulls(self): data = OrderedDict() fields = [] - numpy_dtypes = [('i1', A.int8()), ('i2', A.int16()), - ('i4', A.int32()), ('i8', A.int64()), - ('u1', A.uint8()), ('u2', A.uint16()), - ('u4', A.uint32()), ('u8', A.uint64()), - ('longlong', A.int64()), ('ulonglong', A.uint64())] + numpy_dtypes = [ + ('i1', A.int8()), ('i2', A.int16()), + ('i4', A.int32()), ('i8', A.int64()), + ('u1', A.uint8()), ('u2', A.uint16()), + ('u4', A.uint32()), ('u8', A.uint64()), + ('longlong', A.int64()), ('ulonglong', A.uint64()) + ] num_values = 100 for dtype, arrow_dtype in numpy_dtypes: @@ -153,7 +155,6 @@ def test_integer_no_nulls(self): schema = A.Schema.from_fields(fields) self._check_pandas_roundtrip(df, expected_schema=schema) - def test_integer_with_nulls(self): # pandas requires upcast to float dtype @@ -301,9 +302,9 @@ def test_timestamps_with_timezone(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - df_est = df['datetime64'].dt.tz_localize('US/Eastern').to_frame() - df_utc = df_est['datetime64'].dt.tz_convert('UTC').to_frame() - self._check_pandas_roundtrip(df_est, expected=df_utc, timestamps_to_ms=True, check_dtype=False) + df['datetime64'] = (df['datetime64'].dt.tz_localize('US/Eastern') + .to_frame()) + self._check_pandas_roundtrip(df, timestamps_to_ms=True) # drop-in a null and ns instead of ms df = pd.DataFrame({ @@ -314,9 +315,9 @@ def test_timestamps_with_timezone(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - df_est = df['datetime64'].dt.tz_localize('US/Eastern').to_frame() - df_utc = df_est['datetime64'].dt.tz_convert('UTC').to_frame() - self._check_pandas_roundtrip(df_est, expected=df_utc, timestamps_to_ms=False, check_dtype=False) + df['datetime64'] = (df['datetime64'].dt.tz_localize('US/Eastern') + .to_frame()) + self._check_pandas_roundtrip(df, timestamps_to_ms=False) def test_date(self): df = pd.DataFrame({ @@ -341,7 +342,7 @@ def test_column_of_arrays(self): for column in df.columns: field = schema.field_by_name(column) - self._check_array_roundtrip(df[column], field=field) + self._check_array_roundtrip(df[column], type=field.type) def test_column_of_lists(self): df, schema = dataframe_with_lists() @@ -351,7 +352,7 @@ def test_column_of_lists(self): for column in df.columns: field = schema.field_by_name(column) - self._check_array_roundtrip(df[column], field=field) + self._check_array_roundtrip(df[column], type=field.type) def test_threaded_conversion(self): df = _alltypes_example() diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index 451475b4c6d81..e4b6273ffccf4 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -23,8 +23,8 @@ from pandas.util.testing import assert_frame_equal import pandas as pd +import pyarrow as pa from pyarrow.compat import guid -from pyarrow.error import ArrowException from pyarrow.feather import (read_feather, write_feather, FeatherReader) from pyarrow._feather import FeatherWriter @@ -47,7 +47,7 @@ def tearDown(self): pass def test_file_not_exist(self): - with self.assertRaises(ArrowException): + with self.assertRaises(pa.ArrowException): FeatherReader('test_invalid_file') def _get_null_counts(self, path, columns=None): @@ -291,7 +291,6 @@ def test_category(self): self._check_pandas_roundtrip(df, expected, null_counts=[2 * repeats]) - @pytest.mark.xfail def test_timestamp(self): df = pd.DataFrame({'naive': pd.date_range('2016-03-28', periods=10)}) df['with_tz'] = (df.naive.dt.tz_localize('utc') @@ -299,7 +298,6 @@ def test_timestamp(self): self._check_pandas_roundtrip(df) - @pytest.mark.xfail def test_timestamp_with_nulls(self): df = pd.DataFrame({'test': [pd.datetime(2016, 1, 1), None, @@ -308,7 +306,6 @@ def test_timestamp_with_nulls(self): self._check_pandas_roundtrip(df, null_counts=[1, 1]) - @pytest.mark.xfail def test_out_of_float64_timestamp_with_nulls(self): df = pd.DataFrame( {'test': pd.DatetimeIndex([1451606400000000001, diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index dd68f396a6888..5588840cceb1f 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -15,82 +15,108 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.compat import unittest -import pyarrow as arrow +import pytest -A = arrow +import pyarrow as pa +import numpy as np -class TestTypes(unittest.TestCase): +# XXX: pyarrow.schema.schema masks the module on imports +sch = pa._schema - def test_integers(self): - dtypes = ['int8', 'int16', 'int32', 'int64', - 'uint8', 'uint16', 'uint32', 'uint64'] - for name in dtypes: - factory = getattr(arrow, name) - t = factory() - assert str(t) == name +def test_type_integers(): + dtypes = ['int8', 'int16', 'int32', 'int64', + 'uint8', 'uint16', 'uint32', 'uint64'] - def test_list(self): - value_type = arrow.int32() - list_type = arrow.list_(value_type) - assert str(list_type) == 'list' + for name in dtypes: + factory = getattr(pa, name) + t = factory() + assert str(t) == name - def test_string(self): - t = arrow.string() - assert str(t) == 'string' - def test_field(self): - t = arrow.string() - f = arrow.field('foo', t) +def test_type_list(): + value_type = pa.int32() + list_type = pa.list_(value_type) + assert str(list_type) == 'list' - assert f.name == 'foo' - assert f.nullable - assert f.type is t - assert repr(f) == "Field('foo', type=string)" - f = arrow.field('foo', t, False) - assert not f.nullable +def test_type_string(): + t = pa.string() + assert str(t) == 'string' - def test_schema(self): - fields = [ - A.field('foo', A.int32()), - A.field('bar', A.string()), - A.field('baz', A.list_(A.int8())) - ] - sch = A.schema(fields) - assert len(sch) == 3 - assert sch[0].name == 'foo' - assert sch[0].type == fields[0].type - assert sch.field_by_name('foo').name == 'foo' - assert sch.field_by_name('foo').type == fields[0].type +def test_type_timestamp_with_tz(): + tz = 'America/Los_Angeles' + t = pa.timestamp('ns', tz=tz) + assert t.unit == 'ns' + assert t.tz == tz - assert repr(sch) == """\ + +def test_type_from_numpy_dtype_timestamps(): + cases = [ + (np.dtype('datetime64[s]'), pa.timestamp('s')), + (np.dtype('datetime64[ms]'), pa.timestamp('ms')), + (np.dtype('datetime64[us]'), pa.timestamp('us')), + (np.dtype('datetime64[ns]'), pa.timestamp('ns')) + ] + + for dt, pt in cases: + result = sch.type_from_numpy_dtype(dt) + assert result == pt + + +def test_field(): + t = pa.string() + f = pa.field('foo', t) + + assert f.name == 'foo' + assert f.nullable + assert f.type is t + assert repr(f) == "Field('foo', type=string)" + + f = pa.field('foo', t, False) + assert not f.nullable + + +def test_schema(): + fields = [ + pa.field('foo', pa.int32()), + pa.field('bar', pa.string()), + pa.field('baz', pa.list_(pa.int8())) + ] + sch = pa.schema(fields) + + assert len(sch) == 3 + assert sch[0].name == 'foo' + assert sch[0].type == fields[0].type + assert sch.field_by_name('foo').name == 'foo' + assert sch.field_by_name('foo').type == fields[0].type + + assert repr(sch) == """\ foo: int32 bar: string baz: list""" - def test_schema_equals(self): - fields = [ - A.field('foo', A.int32()), - A.field('bar', A.string()), - A.field('baz', A.list_(A.int8())) - ] - sch1 = A.schema(fields) - print(dir(sch1)) - sch2 = A.schema(fields) - assert sch1.equals(sch2) +def test_field_empty(): + f = pa.Field() + with pytest.raises(ReferenceError): + repr(f) + - del fields[-1] - sch3 = A.schema(fields) - assert not sch1.equals(sch3) +def test_schema_equals(): + fields = [ + pa.field('foo', pa.int32()), + pa.field('bar', pa.string()), + pa.field('baz', pa.list_(pa.int8())) + ] + sch1 = pa.schema(fields) + print(dir(sch1)) + sch2 = pa.schema(fields) + assert sch1.equals(sch2) -class TestField(unittest.TestCase): - def test_empty_field(self): - f = arrow.Field() - with self.assertRaises(ReferenceError): - repr(f) + del fields[-1] + sch3 = pa.schema(fields) + assert not sch1.equals(sch3) diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index 4f7b2cb09e1e6..b197f5845c020 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -27,13 +27,8 @@ #include "pyarrow/helpers.h" #include "pyarrow/util/datetime.h" -using arrow::ArrayBuilder; -using arrow::DataType; -using arrow::MemoryPool; -using arrow::Status; -using arrow::Type; - -namespace pyarrow { +namespace arrow { +namespace py { static inline bool IsPyInteger(PyObject* obj) { #if PYARROW_IS_PY2 @@ -82,22 +77,22 @@ class ScalarVisitor { std::shared_ptr GetType() { // TODO(wesm): handling mixed-type cases if (float_count_) { - return arrow::float64(); + return float64(); } else if (int_count_) { // TODO(wesm): tighter type later - return arrow::int64(); + return int64(); } else if (date_count_) { - return arrow::date(); + return date(); } else if (timestamp_count_) { - return arrow::timestamp(arrow::TimeUnit::MICRO); + return timestamp(TimeUnit::MICRO); } else if (bool_count_) { - return arrow::boolean(); + return boolean(); } else if (binary_count_) { - return arrow::binary(); + return binary(); } else if (unicode_count_) { - return arrow::utf8(); + return utf8(); } else { - return arrow::null(); + return null(); } } @@ -157,14 +152,14 @@ class SeqVisitor { std::shared_ptr GetType() { if (scalars_.total_count() == 0) { if (max_nesting_level_ == 0) { - return arrow::null(); + return null(); } else { return nullptr; } } else { std::shared_ptr result = scalars_.GetType(); for (int i = 0; i < max_nesting_level_; ++i) { - result = std::make_shared(result); + result = std::make_shared(result); } return result; } @@ -215,7 +210,7 @@ Status InferArrowType(PyObject* obj, int64_t* size, std::shared_ptr* o } // For 0-length sequences, refuse to guess - if (*size == 0) { *out_type = arrow::null(); } + if (*size == 0) { *out_type = null(); } SeqVisitor seq_visitor; RETURN_NOT_OK(seq_visitor.Visit(obj)); @@ -255,7 +250,7 @@ class TypedConverter : public SeqConverter { BuilderType* typed_builder_; }; -class BoolConverter : public TypedConverter { +class BoolConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { Py_ssize_t size = PySequence_Size(seq); @@ -276,7 +271,7 @@ class BoolConverter : public TypedConverter { } }; -class Int64Converter : public TypedConverter { +class Int64Converter : public TypedConverter { public: Status AppendData(PyObject* seq) override { int64_t val; @@ -296,7 +291,7 @@ class Int64Converter : public TypedConverter { } }; -class DateConverter : public TypedConverter { +class DateConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { Py_ssize_t size = PySequence_Size(seq); @@ -314,7 +309,7 @@ class DateConverter : public TypedConverter { } }; -class TimestampConverter : public TypedConverter { +class TimestampConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { Py_ssize_t size = PySequence_Size(seq); @@ -347,7 +342,7 @@ class TimestampConverter : public TypedConverter { } }; -class DoubleConverter : public TypedConverter { +class DoubleConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { double val; @@ -367,7 +362,7 @@ class DoubleConverter : public TypedConverter { } }; -class BytesConverter : public TypedConverter { +class BytesConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { PyObject* item; @@ -401,7 +396,7 @@ class BytesConverter : public TypedConverter { } }; -class UTF8Converter : public TypedConverter { +class UTF8Converter : public TypedConverter { public: Status AppendData(PyObject* seq) override { PyObject* item; @@ -433,7 +428,7 @@ class UTF8Converter : public TypedConverter { } }; -class ListConverter : public TypedConverter { +class ListConverter : public TypedConverter { public: Status Init(const std::shared_ptr& builder) override; @@ -483,10 +478,10 @@ std::shared_ptr GetConverter(const std::shared_ptr& type Status ListConverter::Init(const std::shared_ptr& builder) { builder_ = builder; - typed_builder_ = static_cast(builder.get()); + typed_builder_ = static_cast(builder.get()); value_converter_ = - GetConverter(static_cast(builder->type().get())->value_type()); + GetConverter(static_cast(builder->type().get())->value_type()); if (value_converter_ == nullptr) { return Status::NotImplemented("value type not implemented"); } @@ -508,8 +503,7 @@ Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, return converter->AppendData(obj); } -Status ConvertPySequence( - PyObject* obj, MemoryPool* pool, std::shared_ptr* out) { +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr type; int64_t size; PyDateTime_IMPORT; @@ -517,16 +511,17 @@ Status ConvertPySequence( // Handle NA / NullType case if (type->type == Type::NA) { - out->reset(new arrow::NullArray(size)); + out->reset(new NullArray(size)); return Status::OK(); } // Give the sequence converter an array builder std::shared_ptr builder; - RETURN_NOT_OK(arrow::MakeBuilder(pool, type, &builder)); + RETURN_NOT_OK(MakeBuilder(pool, type, &builder)); RETURN_NOT_OK(AppendPySequence(obj, type, builder)); return builder->Finish(out); } -} // namespace pyarrow +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/adapters/builtin.h b/python/src/pyarrow/adapters/builtin.h index 0c863a5631ada..2d45e670628b5 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/python/src/pyarrow/adapters/builtin.h @@ -27,27 +27,28 @@ #include +#include "arrow/util/visibility.h" + #include "pyarrow/common.h" -#include "pyarrow/visibility.h" namespace arrow { + class Array; class Status; -} -namespace pyarrow { +namespace py { -PYARROW_EXPORT arrow::Status InferArrowType( +ARROW_EXPORT arrow::Status InferArrowType( PyObject* obj, int64_t* size, std::shared_ptr* out_type); -PYARROW_EXPORT arrow::Status AppendPySequence(PyObject* obj, +ARROW_EXPORT arrow::Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, const std::shared_ptr& builder); -PYARROW_EXPORT -arrow::Status ConvertPySequence( - PyObject* obj, arrow::MemoryPool* pool, std::shared_ptr* out); +ARROW_EXPORT +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out); -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_ADAPTERS_BUILTIN_H diff --git a/python/src/pyarrow/adapters/pandas-test.cc b/python/src/pyarrow/adapters/pandas-test.cc index e286ccc2c8dc4..e694e790a38d1 100644 --- a/python/src/pyarrow/adapters/pandas-test.cc +++ b/python/src/pyarrow/adapters/pandas-test.cc @@ -30,9 +30,8 @@ #include "arrow/type.h" #include "pyarrow/adapters/pandas.h" -using namespace arrow; - -namespace pyarrow { +namespace arrow { +namespace py { TEST(PandasConversionTest, TestObjectBlockWriteFails) { StringBuilder builder; @@ -61,4 +60,5 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) { Py_END_ALLOW_THREADS; } +} // namespace py } // namespace arrow diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 40079b49b9638..863cf54c9aa1c 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -19,7 +19,6 @@ #include -#include "pyarrow/adapters/builtin.h" #include "pyarrow/adapters/pandas.h" #include "pyarrow/numpy_interop.h" @@ -34,120 +33,39 @@ #include #include -#include "arrow/api.h" +#include "arrow/array.h" +#include "arrow/column.h" #include "arrow/loader.h" #include "arrow/status.h" +#include "arrow/table.h" #include "arrow/type_fwd.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" +#include "pyarrow/adapters/builtin.h" #include "pyarrow/common.h" #include "pyarrow/config.h" +#include "pyarrow/type_traits.h" #include "pyarrow/util/datetime.h" -namespace pyarrow { - -using arrow::Array; -using arrow::ChunkedArray; -using arrow::Column; -using arrow::DictionaryType; -using arrow::Field; -using arrow::DataType; -using arrow::ListType; -using arrow::ListBuilder; -using arrow::Status; -using arrow::Table; -using arrow::Type; - -namespace BitUtil = arrow::BitUtil; +namespace arrow { +namespace py { // ---------------------------------------------------------------------- // Utility code -template -struct npy_traits {}; - -template <> -struct npy_traits { - typedef uint8_t value_type; - using TypeClass = arrow::BooleanType; - using BuilderClass = arrow::BooleanBuilder; - - static constexpr bool supports_nulls = false; - static inline bool isnull(uint8_t v) { return false; } -}; - -#define NPY_INT_DECL(TYPE, CapType, T) \ - template <> \ - struct npy_traits { \ - typedef T value_type; \ - using TypeClass = arrow::CapType##Type; \ - using BuilderClass = arrow::CapType##Builder; \ - \ - static constexpr bool supports_nulls = false; \ - static inline bool isnull(T v) { return false; } \ - }; - -NPY_INT_DECL(INT8, Int8, int8_t); -NPY_INT_DECL(INT16, Int16, int16_t); -NPY_INT_DECL(INT32, Int32, int32_t); -NPY_INT_DECL(INT64, Int64, int64_t); +int cast_npy_type_compat(int type_num) { +// Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set +// U/LONGLONG to U/INT64 so things work properly. -NPY_INT_DECL(UINT8, UInt8, uint8_t); -NPY_INT_DECL(UINT16, UInt16, uint16_t); -NPY_INT_DECL(UINT32, UInt32, uint32_t); -NPY_INT_DECL(UINT64, UInt64, uint64_t); - -#if NPY_INT64 != NPY_LONGLONG -NPY_INT_DECL(LONGLONG, Int64, int64_t); -NPY_INT_DECL(ULONGLONG, UInt64, uint64_t); +#if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) + if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } + if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } #endif -template <> -struct npy_traits { - typedef float value_type; - using TypeClass = arrow::FloatType; - using BuilderClass = arrow::FloatBuilder; - - static constexpr bool supports_nulls = true; - - static inline bool isnull(float v) { return v != v; } -}; - -template <> -struct npy_traits { - typedef double value_type; - using TypeClass = arrow::DoubleType; - using BuilderClass = arrow::DoubleBuilder; - - static constexpr bool supports_nulls = true; - - static inline bool isnull(double v) { return v != v; } -}; - -template <> -struct npy_traits { - typedef int64_t value_type; - using TypeClass = arrow::TimestampType; - using BuilderClass = arrow::TimestampBuilder; - - static constexpr bool supports_nulls = true; - - static inline bool isnull(int64_t v) { - // NaT = -2**63 - // = -0x8000000000000000 - // = -9223372036854775808; - // = std::numeric_limits::min() - return v == std::numeric_limits::min(); - } -}; - -template <> -struct npy_traits { - typedef PyObject* value_type; - static constexpr bool supports_nulls = true; -}; + return type_num; +} static inline bool PyObject_is_null(const PyObject* obj) { return obj == Py_None || obj == numpy_nan; @@ -181,8 +99,24 @@ static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) return null_count; } +// Returns null count +static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { + int64_t null_count = 0; + const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); + // TODO(wesm): strided null mask + for (int i = 0; i < length; ++i) { + if (mask_values[i]) { + ++null_count; + } else { + BitUtil::SetBit(bitmap, i); + } + } + return null_count; +} + template -static int64_t ValuesToBytemap(const void* data, int64_t length, uint8_t* valid_bytes) { +static int64_t ValuesToValidBytes( + const void* data, int64_t length, uint8_t* valid_bytes) { typedef npy_traits traits; typedef typename traits::value_type T; @@ -214,7 +148,7 @@ Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) { return Status::OK(); } -Status AppendObjectStrings(arrow::StringBuilder& string_builder, PyObject** objects, +Status AppendObjectStrings(StringBuilder& string_builder, PyObject** objects, int64_t objects_length, bool* have_bytes) { PyObject* obj; @@ -242,360 +176,561 @@ Status AppendObjectStrings(arrow::StringBuilder& string_builder, PyObject** obje return Status::OK(); } -template -struct arrow_traits {}; +template +struct WrapBytes {}; template <> -struct arrow_traits { - static constexpr int npy_type = NPY_BOOL; - static constexpr bool supports_nulls = false; - static constexpr bool is_boolean = true; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); + } }; -#define INT_DECL(TYPE) \ - template <> \ - struct arrow_traits { \ - static constexpr int npy_type = NPY_##TYPE; \ - static constexpr bool supports_nulls = false; \ - static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_numeric_not_nullable = true; \ - static constexpr bool is_numeric_nullable = false; \ - typedef typename npy_traits::value_type T; \ - }; - -INT_DECL(INT8); -INT_DECL(INT16); -INT_DECL(INT32); -INT_DECL(INT64); -INT_DECL(UINT8); -INT_DECL(UINT16); -INT_DECL(UINT32); -INT_DECL(UINT64); - template <> -struct arrow_traits { - static constexpr int npy_type = NPY_FLOAT32; - static constexpr bool supports_nulls = true; - static constexpr float na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyBytes_FromStringAndSize(reinterpret_cast(data), length); + } }; -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_FLOAT64; - static constexpr bool supports_nulls = true; - static constexpr double na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; -}; +static inline bool ListTypeSupported(const Type::type type_id) { + switch (type_id) { + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::INT64: + case Type::UINT64: + case Type::FLOAT: + case Type::DOUBLE: + case Type::STRING: + case Type::TIMESTAMP: + // The above types are all supported. + return true; + default: + break; + } + return false; +} -static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); +// ---------------------------------------------------------------------- +// Conversion from NumPy-in-Pandas to Arrow -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_DATETIME; - static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; -}; +class PandasConverter : public TypeVisitor { + public: + PandasConverter( + MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type) + : pool_(pool), + type_(type), + arr_(reinterpret_cast(ao)), + mask_(nullptr) { + if (mo != nullptr and mo != Py_None) { mask_ = reinterpret_cast(mo); } + length_ = PyArray_SIZE(arr_); + } -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_DATETIME; - static constexpr bool supports_nulls = true; - static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; - typedef typename npy_traits::value_type T; -}; + bool is_strided() const { + npy_intp* astrides = PyArray_STRIDES(arr_); + return astrides[0] != PyArray_DESCR(arr_)->elsize; + } -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_OBJECT; - static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; -}; + Status InitNullBitmap() { + int null_bytes = BitUtil::BytesForBits(length_); -template <> -struct arrow_traits { - static constexpr int npy_type = NPY_OBJECT; - static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; -}; + null_bitmap_ = std::make_shared(pool_); + RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); -template -struct WrapBytes {}; + null_bitmap_data_ = null_bitmap_->mutable_data(); + memset(null_bitmap_data_, 0, null_bytes); -template <> -struct WrapBytes { - static inline PyObject* Wrap(const uint8_t* data, int64_t length) { - return PyUnicode_FromStringAndSize(reinterpret_cast(data), length); + return Status::OK(); } -}; -template <> -struct WrapBytes { - static inline PyObject* Wrap(const uint8_t* data, int64_t length) { - return PyBytes_FromStringAndSize(reinterpret_cast(data), length); - } -}; + // ---------------------------------------------------------------------- + // Traditional visitor conversion for non-object arrays -inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { - if (type == NPY_DATETIME) { - PyArray_Descr* descr = PyArray_DESCR(out); - auto date_dtype = reinterpret_cast(descr->c_metadata); - if (datatype->type == Type::TIMESTAMP) { - auto timestamp_type = static_cast(datatype); + template + Status ConvertData(std::shared_ptr* data); - switch (timestamp_type->unit) { - case arrow::TimestampType::Unit::SECOND: - date_dtype->meta.base = NPY_FR_s; - break; - case arrow::TimestampType::Unit::MILLI: - date_dtype->meta.base = NPY_FR_ms; - break; - case arrow::TimestampType::Unit::MICRO: - date_dtype->meta.base = NPY_FR_us; - break; - case arrow::TimestampType::Unit::NANO: - date_dtype->meta.base = NPY_FR_ns; - break; - } - } else { - // datatype->type == Type::DATE - date_dtype->meta.base = NPY_FR_D; + template + Status VisitNative() { + using traits = arrow_traits; + + if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } + + std::shared_ptr data; + RETURN_NOT_OK(ConvertData(&data)); + + int64_t null_count = 0; + if (mask_ != nullptr) { + null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); + } else if (traits::supports_nulls) { + // TODO(wesm): this presumes the NumPy C type and arrow C type are the + // same + null_count = ValuesToBitmap( + PyArray_DATA(arr_), length_, null_bitmap_data_); } + + std::vector fields(1); + fields[0].length = length_; + fields[0].null_count = null_count; + fields[0].offset = 0; + + return LoadArray(type_, fields, {null_bitmap_, data}, &out_); } -} -template -inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - // Upcast to double, set NaN as appropriate +#define VISIT_NATIVE(TYPE) \ + Status Visit(const TYPE& type) override { return VisitNative(); } - for (int i = 0; i < arr->length(); ++i) { - *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; + VISIT_NATIVE(BooleanType); + VISIT_NATIVE(Int8Type); + VISIT_NATIVE(Int16Type); + VISIT_NATIVE(Int32Type); + VISIT_NATIVE(Int64Type); + VISIT_NATIVE(UInt8Type); + VISIT_NATIVE(UInt16Type); + VISIT_NATIVE(UInt32Type); + VISIT_NATIVE(UInt64Type); + VISIT_NATIVE(FloatType); + VISIT_NATIVE(DoubleType); + VISIT_NATIVE(TimestampType); + +#undef VISIT_NATIVE + + Status Convert(std::shared_ptr* out) { + if (PyArray_NDIM(arr_) != 1) { + return Status::Invalid("only handle 1-dimensional arrays"); } + // TODO(wesm): strided arrays + if (is_strided()) { return Status::Invalid("no support for strided data yet"); } + + if (type_ == nullptr) { return Status::Invalid("Must pass data type"); } + + // Visit the type to perform conversion + RETURN_NOT_OK(type_->Accept(this)); + + *out = out_; + return Status::OK(); } -} -template -inline void ConvertIntegerNoNullsSameType(const ChunkedArray& data, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - memcpy(out_values, in_values, sizeof(T) * arr->length()); - out_values += arr->length(); + // ---------------------------------------------------------------------- + // Conversion logic for various object dtype arrays + + template + Status ConvertTypedLists( + const std::shared_ptr& type, std::shared_ptr* out); + + Status ConvertObjectStrings(std::shared_ptr* out); + Status ConvertBooleans(std::shared_ptr* out); + Status ConvertDates(std::shared_ptr* out); + Status ConvertLists(const std::shared_ptr& type, std::shared_ptr* out); + Status ConvertObjects(std::shared_ptr* out); + + protected: + MemoryPool* pool_; + std::shared_ptr type_; + PyArrayObject* arr_; + PyArrayObject* mask_; + int64_t length_; + + // Used in visitor pattern + std::shared_ptr out_; + + std::shared_ptr null_bitmap_; + uint8_t* null_bitmap_data_; +}; + +template +inline Status PandasConverter::ConvertData(std::shared_ptr* data) { + using traits = arrow_traits; + + // Handle LONGLONG->INT64 and other fun things + int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num); + + if (traits::npy_type != type_num_compat) { + return Status::NotImplemented("NumPy type casts not yet implemented"); } + + *data = std::make_shared(arr_); + return Status::OK(); } -template -inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values = in_values[i]; - } +template <> +inline Status PandasConverter::ConvertData(std::shared_ptr* data) { + int nbytes = BitUtil::BytesForBits(length_); + auto buffer = std::make_shared(pool_); + RETURN_NOT_OK(buffer->Resize(nbytes)); + + const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); + + uint8_t* bitmap = buffer->mutable_data(); + + memset(bitmap, 0, nbytes); + for (int i = 0; i < length_; ++i) { + if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } } + + *data = buffer; + return Status::OK(); } -static Status ConvertBooleanWithNulls(const ChunkedArray& data, PyObject** out_values) { +Status PandasConverter::ConvertDates(std::shared_ptr* out) { PyAcquireGIL lock; - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto bool_arr = static_cast(arr.get()); - for (int64_t i = 0; i < arr->length(); ++i) { - if (bool_arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values++ = Py_None; - } else if (bool_arr->Value(i)) { - // True - Py_INCREF(Py_True); - *out_values++ = Py_True; - } else { - // False - Py_INCREF(Py_False); - *out_values++ = Py_False; - } + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + DateBuilder date_builder(pool_); + RETURN_NOT_OK(date_builder.Resize(length_)); + + Status s; + PyObject* obj; + for (int64_t i = 0; i < length_; ++i) { + obj = objects[i]; + if (PyDate_CheckExact(obj)) { + PyDateTime_Date* pydate = reinterpret_cast(obj); + date_builder.Append(PyDate_to_ms(pydate)); + } else { + date_builder.AppendNull(); } } - return Status::OK(); + return date_builder.Finish(out); } -static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto bool_arr = static_cast(arr.get()); - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = static_cast(bool_arr->Value(i)); - } +Status PandasConverter::ConvertObjectStrings(std::shared_ptr* out) { + PyAcquireGIL lock; + + // The output type at this point is inconclusive because there may be bytes + // and unicode mixed in the object array + + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + StringBuilder string_builder(pool_); + RETURN_NOT_OK(string_builder.Resize(length_)); + + Status s; + bool have_bytes = false; + RETURN_NOT_OK(AppendObjectStrings(string_builder, objects, length_, &have_bytes)); + RETURN_NOT_OK(string_builder.Finish(out)); + + if (have_bytes) { + const auto& arr = static_cast(*out->get()); + *out = std::make_shared(arr.length(), arr.value_offsets(), arr.data(), + arr.null_bitmap(), arr.null_count()); } + return Status::OK(); } -template -inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { +Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { PyAcquireGIL lock; - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = static_cast(data.chunk(c).get()); - const uint8_t* data_ptr; - int32_t length; - const bool has_nulls = data.null_count() > 0; - for (int64_t i = 0; i < arr->length(); ++i) { - if (has_nulls && arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values = Py_None; - } else { - data_ptr = arr->GetValue(i, &length); - *out_values = WrapBytes::Wrap(data_ptr, length); - if (*out_values == nullptr) { - PyErr_Clear(); - std::stringstream ss; - ss << "Wrapping " - << std::string(reinterpret_cast(data_ptr), length) << " failed"; - return Status::UnknownError(ss.str()); - } - } - ++out_values; + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + + int nbytes = BitUtil::BytesForBits(length_); + auto data = std::make_shared(pool_); + RETURN_NOT_OK(data->Resize(nbytes)); + uint8_t* bitmap = data->mutable_data(); + memset(bitmap, 0, nbytes); + + int64_t null_count = 0; + for (int64_t i = 0; i < length_; ++i) { + if (objects[i] == Py_True) { + BitUtil::SetBit(bitmap, i); + BitUtil::SetBit(null_bitmap_data_, i); + } else if (objects[i] != Py_False) { + ++null_count; + } else { + BitUtil::SetBit(null_bitmap_data_, i); } } + + *out = std::make_shared(length_, data, null_bitmap_, null_count); + return Status::OK(); } -template -inline Status ConvertListsLike( - const std::shared_ptr& col, PyObject** out_values) { - const ChunkedArray& data = *col->data().get(); - auto list_type = std::static_pointer_cast(col->type()); +Status PandasConverter::ConvertObjects(std::shared_ptr* out) { + // Python object arrays are annoying, since we could have one of: + // + // * Strings + // * Booleans with nulls + // * Mixed type (not supported at the moment by arrow format) + // + // Additionally, nulls may be encoded either as np.nan or None. So we have to + // do some type inference and conversion - // Get column of underlying value arrays - std::vector> value_arrays; - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = std::static_pointer_cast(data.chunk(c)); - value_arrays.emplace_back(arr->values()); - } - auto flat_column = std::make_shared(list_type->value_field(), value_arrays); - // TODO(ARROW-489): Currently we don't have a Python reference for single columns. - // Storing a reference to the whole Array would be to expensive. - PyObject* numpy_array; - RETURN_NOT_OK(ConvertColumnToPandas(flat_column, nullptr, &numpy_array)); + RETURN_NOT_OK(InitNullBitmap()); - PyAcquireGIL lock; + // TODO: mask not supported here + if (mask_ != nullptr) { + return Status::NotImplemented("mask not supported in object conversions yet"); + } - for (int c = 0; c < data.num_chunks(); c++) { - auto arr = std::static_pointer_cast(data.chunk(c)); + const PyObject** objects; + { + PyAcquireGIL lock; + objects = reinterpret_cast(PyArray_DATA(arr_)); + PyDateTime_IMPORT; + } - const uint8_t* data_ptr; - const bool has_nulls = data.null_count() > 0; - for (int64_t i = 0; i < arr->length(); ++i) { - if (has_nulls && arr->IsNull(i)) { - Py_INCREF(Py_None); - *out_values = Py_None; + if (type_) { + switch (type_->type) { + case Type::STRING: + return ConvertObjectStrings(out); + case Type::BOOL: + return ConvertBooleans(out); + case Type::DATE: + return ConvertDates(out); + case Type::LIST: { + const auto& list_field = static_cast(*type_); + return ConvertLists(list_field.value_field()->type, out); + } + default: + return Status::TypeError("No known conversion to Arrow type"); + } + } else { + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + continue; + } else if (PyObject_is_string(objects[i])) { + return ConvertObjectStrings(out); + } else if (PyBool_Check(objects[i])) { + return ConvertBooleans(out); + } else if (PyDate_CheckExact(objects[i])) { + return ConvertDates(out); } else { - PyObject* start = PyLong_FromLong(arr->value_offset(i)); - PyObject* end = PyLong_FromLong(arr->value_offset(i + 1)); - PyObject* slice = PySlice_New(start, end, NULL); - *out_values = PyObject_GetItem(numpy_array, slice); - Py_DECREF(start); - Py_DECREF(end); - Py_DECREF(slice); + return Status::TypeError("unhandled python type"); } - ++out_values; } } - Py_XDECREF(numpy_array); - return Status::OK(); + return Status::TypeError("Unable to infer type of object array, were all null"); } -template -inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); +template +inline Status PandasConverter::ConvertTypedLists( + const std::shared_ptr& type, std::shared_ptr* out) { + typedef npy_traits traits; + typedef typename traits::value_type T; + typedef typename traits::BuilderClass BuilderT; - const uint8_t* valid_bits = arr->null_bitmap_data(); + PyAcquireGIL lock; - if (arr->null_count() > 0) { - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = BitUtil::BitNotSet(valid_bits, i) ? na_value : in_values[i]; + auto value_builder = std::make_shared(pool_, type); + ListBuilder list_builder(pool_, value_builder); + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + RETURN_NOT_OK(list_builder.AppendNull()); + } else if (PyArray_Check(objects[i])) { + auto numpy_array = reinterpret_cast(objects[i]); + RETURN_NOT_OK(list_builder.Append(true)); + + // TODO(uwe): Support more complex numpy array structures + RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE)); + + int64_t size = PyArray_DIM(numpy_array, 0); + auto data = reinterpret_cast(PyArray_DATA(numpy_array)); + if (traits::supports_nulls) { + null_bitmap_->Resize(size, false); + // TODO(uwe): A bitmap would be more space-efficient but the Builder API doesn't + // currently support this. + // ValuesToBitmap(data, size, null_bitmap_->mutable_data()); + ValuesToValidBytes(data, size, null_bitmap_->mutable_data()); + RETURN_NOT_OK(value_builder->Append(data, size, null_bitmap_->data())); + } else { + RETURN_NOT_OK(value_builder->Append(data, size)); + } + + } else if (PyList_Check(objects[i])) { + int64_t size; + std::shared_ptr inferred_type; + RETURN_NOT_OK(list_builder.Append(true)); + RETURN_NOT_OK(InferArrowType(objects[i], &size, &inferred_type)); + if (inferred_type->type != type->type) { + std::stringstream ss; + ss << inferred_type->ToString() << " cannot be converted to " << type->ToString(); + return Status::TypeError(ss.str()); } + RETURN_NOT_OK(AppendPySequence(objects[i], type, value_builder)); } else { - memcpy(out_values, in_values, sizeof(T) * arr->length()); - out_values += arr->length(); + return Status::TypeError("Unsupported Python type for list items"); } } + return list_builder.Finish(out); } -template -inline void ConvertNumericNullableCast( - const ChunkedArray& data, OutType na_value, OutType* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); +template <> +inline Status PandasConverter::ConvertTypedLists( + const std::shared_ptr& type, std::shared_ptr* out) { + PyAcquireGIL lock; + // TODO: If there are bytes involed, convert to Binary representation + bool have_bytes = false; - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = arr->IsNull(i) ? na_value : static_cast(in_values[i]); + auto value_builder = std::make_shared(pool_); + ListBuilder list_builder(pool_, value_builder); + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + for (int64_t i = 0; i < length_; ++i) { + if (PyObject_is_null(objects[i])) { + RETURN_NOT_OK(list_builder.AppendNull()); + } else if (PyArray_Check(objects[i])) { + auto numpy_array = reinterpret_cast(objects[i]); + RETURN_NOT_OK(list_builder.Append(true)); + + // TODO(uwe): Support more complex numpy array structures + RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); + + int64_t size = PyArray_DIM(numpy_array, 0); + auto data = reinterpret_cast(PyArray_DATA(numpy_array)); + RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); + } else if (PyList_Check(objects[i])) { + int64_t size; + std::shared_ptr inferred_type; + RETURN_NOT_OK(list_builder.Append(true)); + RETURN_NOT_OK(InferArrowType(objects[i], &size, &inferred_type)); + if (inferred_type->type != Type::STRING) { + std::stringstream ss; + ss << inferred_type->ToString() << " cannot be converted to STRING."; + return Status::TypeError(ss.str()); + } + RETURN_NOT_OK(AppendPySequence(objects[i], inferred_type, value_builder)); + } else { + return Status::TypeError("Unsupported Python type for list items"); } } + return list_builder.Finish(out); } -template -inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - - for (int64_t i = 0; i < arr->length(); ++i) { - // There are 1000 * 60 * 60 * 24 = 86400000ms in a day - *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; - } +#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ + case Type::TYPE: { \ + return ConvertTypedLists(type, out); \ + } + +Status PandasConverter::ConvertLists( + const std::shared_ptr& type, std::shared_ptr* out) { + switch (type->type) { + LIST_CASE(UINT8, NPY_UINT8, UInt8Type) + LIST_CASE(INT8, NPY_INT8, Int8Type) + LIST_CASE(UINT16, NPY_UINT16, UInt16Type) + LIST_CASE(INT16, NPY_INT16, Int16Type) + LIST_CASE(UINT32, NPY_UINT32, UInt32Type) + LIST_CASE(INT32, NPY_INT32, Int32Type) + LIST_CASE(UINT64, NPY_UINT64, UInt64Type) + LIST_CASE(INT64, NPY_INT64, Int64Type) + LIST_CASE(TIMESTAMP, NPY_DATETIME, TimestampType) + LIST_CASE(FLOAT, NPY_FLOAT, FloatType) + LIST_CASE(DOUBLE, NPY_DOUBLE, DoubleType) + LIST_CASE(STRING, NPY_OBJECT, StringType) + default: + return Status::TypeError("Unknown list item type"); } + + return Status::TypeError("Unknown list type"); } -template -inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); +Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& type, std::shared_ptr* out) { + PandasConverter converter(pool, ao, mo, type); + return converter.Convert(out); +} - for (int64_t i = 0; i < arr->length(); ++i) { - *out_values++ = arr->IsNull(i) ? kPandasTimestampNull - : (static_cast(in_values[i]) * SHIFT); +Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& type, std::shared_ptr* out) { + PandasConverter converter(pool, ao, mo, type); + return converter.ConvertObjects(out); +} + +Status PandasDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { + PyArray_Descr* descr = reinterpret_cast(dtype); + + int type_num = cast_npy_type_compat(descr->type_num); + +#define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \ + case NPY_##NPY_NAME: \ + *out = FACTORY(); \ + break; + + switch (type_num) { + TO_ARROW_TYPE_CASE(BOOL, boolean); + TO_ARROW_TYPE_CASE(INT8, int8); + TO_ARROW_TYPE_CASE(INT16, int16); + TO_ARROW_TYPE_CASE(INT32, int32); + TO_ARROW_TYPE_CASE(INT64, int64); +#if (NPY_INT64 != NPY_LONGLONG) + TO_ARROW_TYPE_CASE(LONGLONG, int64); +#endif + TO_ARROW_TYPE_CASE(UINT8, uint8); + TO_ARROW_TYPE_CASE(UINT16, uint16); + TO_ARROW_TYPE_CASE(UINT32, uint32); + TO_ARROW_TYPE_CASE(UINT64, uint64); +#if (NPY_UINT64 != NPY_ULONGLONG) + TO_ARROW_CASE(ULONGLONG); +#endif + TO_ARROW_TYPE_CASE(FLOAT32, float32); + TO_ARROW_TYPE_CASE(FLOAT64, float64); + case NPY_DATETIME: { + auto date_dtype = + reinterpret_cast(descr->c_metadata); + TimeUnit unit; + switch (date_dtype->meta.base) { + case NPY_FR_s: + unit = TimeUnit::SECOND; + break; + case NPY_FR_ms: + unit = TimeUnit::MILLI; + break; + case NPY_FR_us: + unit = TimeUnit::MICRO; + break; + case NPY_FR_ns: + unit = TimeUnit::NANO; + break; + default: + return Status::NotImplemented("Unsupported datetime64 time unit"); + } + *out = timestamp(unit); + } break; + default: { + std::stringstream ss; + ss << "Unsupported numpy type " << descr->type_num << std::endl; + return Status::NotImplemented(ss.str()); } } + +#undef TO_ARROW_TYPE_CASE + + return Status::OK(); } // ---------------------------------------------------------------------- // pandas 0.x DataFrame conversion internals +inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { + if (type == NPY_DATETIME) { + PyArray_Descr* descr = PyArray_DESCR(out); + auto date_dtype = reinterpret_cast(descr->c_metadata); + if (datatype->type == Type::TIMESTAMP) { + auto timestamp_type = static_cast(datatype); + + switch (timestamp_type->unit) { + case TimestampType::Unit::SECOND: + date_dtype->meta.base = NPY_FR_s; + break; + case TimestampType::Unit::MILLI: + date_dtype->meta.base = NPY_FR_ms; + break; + case TimestampType::Unit::MICRO: + date_dtype->meta.base = NPY_FR_us; + break; + case TimestampType::Unit::NANO: + date_dtype->meta.base = NPY_FR_ns; + break; + } + } else { + // datatype->type == Type::DATE + date_dtype->meta.base = NPY_FR_D; + } + } +} + class PandasBlock { public: enum type { @@ -688,10 +823,219 @@ class PandasBlock { DISALLOW_COPY_AND_ASSIGN(PandasBlock); }; -#define CONVERTLISTSLIKE_CASE(ArrowType, ArrowEnum) \ - case Type::ArrowEnum: \ - RETURN_NOT_OK((ConvertListsLike<::arrow::ArrowType>(col, out_buffer))); \ - break; +template +inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + // Upcast to double, set NaN as appropriate + + for (int i = 0; i < arr->length(); ++i) { + *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; + } + } +} + +template +inline void ConvertIntegerNoNullsSameType(const ChunkedArray& data, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); + } +} + +template +inline void ConvertIntegerNoNullsCast(const ChunkedArray& data, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values = in_values[i]; + } + } +} + +static Status ConvertBooleanWithNulls(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); + + for (int64_t i = 0; i < arr->length(); ++i) { + if (bool_arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values++ = Py_None; + } else if (bool_arr->Value(i)) { + // True + Py_INCREF(Py_True); + *out_values++ = Py_True; + } else { + // False + Py_INCREF(Py_False); + *out_values++ = Py_False; + } + } + } + return Status::OK(); +} + +static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto bool_arr = static_cast(arr.get()); + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = static_cast(bool_arr->Value(i)); + } + } +} + +template +inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = static_cast(data.chunk(c).get()); + + const uint8_t* data_ptr; + int32_t length; + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + data_ptr = arr->GetValue(i, &length); + *out_values = WrapBytes::Wrap(data_ptr, length); + if (*out_values == nullptr) { + PyErr_Clear(); + std::stringstream ss; + ss << "Wrapping " + << std::string(reinterpret_cast(data_ptr), length) << " failed"; + return Status::UnknownError(ss.str()); + } + } + ++out_values; + } + } + return Status::OK(); +} + +template +inline Status ConvertListsLike( + const std::shared_ptr& col, PyObject** out_values) { + const ChunkedArray& data = *col->data().get(); + auto list_type = std::static_pointer_cast(col->type()); + + // Get column of underlying value arrays + std::vector> value_arrays; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = std::static_pointer_cast(data.chunk(c)); + value_arrays.emplace_back(arr->values()); + } + auto flat_column = std::make_shared(list_type->value_field(), value_arrays); + // TODO(ARROW-489): Currently we don't have a Python reference for single columns. + // Storing a reference to the whole Array would be to expensive. + PyObject* numpy_array; + RETURN_NOT_OK(ConvertColumnToPandas(flat_column, nullptr, &numpy_array)); + + PyAcquireGIL lock; + + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = std::static_pointer_cast(data.chunk(c)); + + const uint8_t* data_ptr; + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + PyObject* start = PyLong_FromLong(arr->value_offset(i)); + PyObject* end = PyLong_FromLong(arr->value_offset(i + 1)); + PyObject* slice = PySlice_New(start, end, NULL); + *out_values = PyObject_GetItem(numpy_array, slice); + Py_DECREF(start); + Py_DECREF(end); + Py_DECREF(slice); + } + ++out_values; + } + } + + Py_XDECREF(numpy_array); + return Status::OK(); +} + +template +inline void ConvertNumericNullable(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + const uint8_t* valid_bits = arr->null_bitmap_data(); + + if (arr->null_count() > 0) { + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = BitUtil::BitNotSet(valid_bits, i) ? na_value : in_values[i]; + } + } else { + memcpy(out_values, in_values, sizeof(T) * arr->length()); + out_values += arr->length(); + } + } +} + +template +inline void ConvertNumericNullableCast( + const ChunkedArray& data, OutType na_value, OutType* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? na_value : static_cast(in_values[i]); + } + } +} + +template +inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + // There are 1000 * 60 * 60 * 24 = 86400000ms in a day + *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; + } + } +} + +template +inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { + for (int c = 0; c < data.num_chunks(); c++) { + const std::shared_ptr arr = data.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? kPandasTimestampNull + : (static_cast(in_values[i]) * SHIFT); + } + } +} + +#define CONVERTLISTSLIKE_CASE(ArrowType, ArrowEnum) \ + case Type::ArrowEnum: \ + RETURN_NOT_OK((ConvertListsLike(col, out_buffer))); \ + break; class ObjectBlock : public PandasBlock { public: @@ -712,9 +1056,9 @@ class ObjectBlock : public PandasBlock { if (type == Type::BOOL) { RETURN_NOT_OK(ConvertBooleanWithNulls(data, out_buffer)); } else if (type == Type::BINARY) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::STRING) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::LIST) { auto list_type = std::static_pointer_cast(col->type()); switch (list_type->value_type()->type) { @@ -880,8 +1224,8 @@ class DatetimeBlock : public PandasBlock { public: using PandasBlock::PandasBlock; - Status Allocate() override { - RETURN_NOT_OK(AllocateNDArray(NPY_DATETIME)); + Status AllocateDatetime(int ndim) { + RETURN_NOT_OK(AllocateNDArray(NPY_DATETIME, ndim)); PyAcquireGIL lock; auto date_dtype = reinterpret_cast( @@ -890,6 +1234,8 @@ class DatetimeBlock : public PandasBlock { return Status::OK(); } + Status Allocate() override { return AllocateDatetime(2); } + Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { Type::type type = col->type()->type; @@ -904,15 +1250,15 @@ class DatetimeBlock : public PandasBlock { // TODO(wesm): Do we want to make sure to zero out the milliseconds? ConvertDatetimeNanos(data, out_buffer); } else if (type == Type::TIMESTAMP) { - auto ts_type = static_cast(col->type().get()); + auto ts_type = static_cast(col->type().get()); - if (ts_type->unit == arrow::TimeUnit::NANO) { + if (ts_type->unit == TimeUnit::NANO) { ConvertNumericNullable(data, kPandasTimestampNull, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::MICRO) { + } else if (ts_type->unit == TimeUnit::MICRO) { ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::MILLI) { + } else if (ts_type->unit == TimeUnit::MILLI) { ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == arrow::TimeUnit::SECOND) { + } else if (ts_type->unit == TimeUnit::SECOND) { ConvertDatetimeNanos(data, out_buffer); } else { return Status::NotImplemented("Unsupported time unit"); @@ -931,6 +1277,9 @@ class DatetimeTZBlock : public DatetimeBlock { DatetimeTZBlock(const std::string& timezone, int64_t num_rows) : DatetimeBlock(num_rows, 1), timezone_(timezone) {} + // Like Categorical, the internal ndarray is 1-dimensional + Status Allocate() override { return AllocateDatetime(1); } + Status GetPyResult(PyObject** output) override { PyObject* result = PyDict_New(); RETURN_IF_PYERROR(); @@ -977,9 +1326,8 @@ class CategoricalBlock : public PandasBlock { for (int c = 0; c < data.num_chunks(); c++) { const std::shared_ptr arr = data.chunk(c); - const auto& dict_arr = static_cast(*arr); - const auto& indices = - static_cast(*dict_arr.indices()); + const auto& dict_arr = static_cast(*arr); + const auto& indices = static_cast(*dict_arr.indices()); auto in_values = reinterpret_cast(indices.data()->data()); // Null is -1 in CategoricalBlock @@ -1046,28 +1394,6 @@ Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, return (*block)->Allocate(); } -static inline bool ListTypeSupported(const Type::type type_id) { - switch (type_id) { - case Type::UINT8: - case Type::INT8: - case Type::UINT16: - case Type::INT16: - case Type::UINT32: - case Type::INT32: - case Type::INT64: - case Type::UINT64: - case Type::FLOAT: - case Type::DOUBLE: - case Type::STRING: - case Type::TIMESTAMP: - // The above types are all supported. - return true; - default: - break; - } - return false; -} - static inline Status MakeCategoricalBlock(const std::shared_ptr& type, int64_t num_rows, std::shared_ptr* block) { // All categoricals become a block with a single column @@ -1168,7 +1494,7 @@ class DataFrameBlockCreator { output_type = PandasBlock::DATETIME; break; case Type::TIMESTAMP: { - const auto& ts_type = static_cast(*col->type()); + const auto& ts_type = static_cast(*col->type()); if (ts_type.timezone != "") { output_type = PandasBlock::DATETIME_WITH_TZ; } else { @@ -1182,636 +1508,165 @@ class DataFrameBlockCreator { ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); return Status::NotImplemented(ss.str()); - } - output_type = PandasBlock::OBJECT; - } break; - case Type::DICTIONARY: - output_type = PandasBlock::CATEGORICAL; - break; - default: - return Status::NotImplemented(col->type()->ToString()); - } - - int block_placement = 0; - std::shared_ptr block; - if (output_type == PandasBlock::CATEGORICAL) { - RETURN_NOT_OK(MakeCategoricalBlock(col->type(), table_->num_rows(), &block)); - categorical_blocks_[i] = block; - } else if (output_type == PandasBlock::DATETIME_WITH_TZ) { - const auto& ts_type = static_cast(*col->type()); - block = std::make_shared(ts_type.timezone, table_->num_rows()); - RETURN_NOT_OK(block->Allocate()); - datetimetz_blocks_[i] = block; - } else { - auto it = type_counts_.find(output_type); - if (it != type_counts_.end()) { - block_placement = it->second; - // Increment count - it->second += 1; - } else { - // Add key to map - type_counts_[output_type] = 1; - } - } - - column_types_[i] = output_type; - column_block_placement_[i] = block_placement; - } - - // Create normal non-categorical blocks - for (const auto& it : type_counts_) { - PandasBlock::type type = static_cast(it.first); - std::shared_ptr block; - RETURN_NOT_OK(MakeBlock(type, table_->num_rows(), it.second, &block)); - blocks_[type] = block; - } - return Status::OK(); - } - - Status WriteTableToBlocks(int nthreads) { - auto WriteColumn = [this](int i) { - std::shared_ptr col = this->table_->column(i); - PandasBlock::type output_type = this->column_types_[i]; - - int rel_placement = this->column_block_placement_[i]; - - std::shared_ptr block; - if (output_type == PandasBlock::CATEGORICAL) { - auto it = this->categorical_blocks_.find(i); - if (it == this->blocks_.end()) { - return Status::KeyError("No categorical block allocated"); - } - block = it->second; - } else { - auto it = this->blocks_.find(output_type); - if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } - block = it->second; - } - return block->Write(col, i, rel_placement); - }; - - nthreads = std::min(nthreads, table_->num_columns()); - - if (nthreads == 1) { - for (int i = 0; i < table_->num_columns(); ++i) { - RETURN_NOT_OK(WriteColumn(i)); - } - } else { - std::vector thread_pool; - thread_pool.reserve(nthreads); - std::atomic task_counter(0); - - std::mutex error_mtx; - bool error_occurred = false; - Status error; - - for (int thread_id = 0; thread_id < nthreads; ++thread_id) { - thread_pool.emplace_back( - [this, &error, &error_occurred, &error_mtx, &task_counter, &WriteColumn]() { - int column_num; - while (!error_occurred) { - column_num = task_counter.fetch_add(1); - if (column_num >= this->table_->num_columns()) { break; } - Status s = WriteColumn(column_num); - if (!s.ok()) { - std::lock_guard lock(error_mtx); - error_occurred = true; - error = s; - break; - } - } - }); - } - for (auto&& thread : thread_pool) { - thread.join(); - } - - if (error_occurred) { return error; } - } - return Status::OK(); - } - - Status AppendBlocks(const BlockMap& blocks, PyObject* list) { - for (const auto& it : blocks) { - PyObject* item; - RETURN_NOT_OK(it.second->GetPyResult(&item)); - if (PyList_Append(list, item) < 0) { RETURN_IF_PYERROR(); } - } - return Status::OK(); - } - - Status GetResultList(PyObject** out) { - PyAcquireGIL lock; - - PyObject* result = PyList_New(0); - RETURN_IF_PYERROR(); - - RETURN_NOT_OK(AppendBlocks(blocks_, result)); - RETURN_NOT_OK(AppendBlocks(categorical_blocks_, result)); - RETURN_NOT_OK(AppendBlocks(datetimetz_blocks_, result)); - - *out = result; - return Status::OK(); - } - - private: - std::shared_ptr table_; - - // column num -> block type id - std::vector column_types_; - - // column num -> relative placement within internal block - std::vector column_block_placement_; - - // block type -> type count - std::unordered_map type_counts_; - - // block type -> block - BlockMap blocks_; - - // column number -> categorical block - BlockMap categorical_blocks_; - - // column number -> datetimetz block - BlockMap datetimetz_blocks_; -}; - -Status ConvertTableToPandas( - const std::shared_ptr
& table, int nthreads, PyObject** out) { - DataFrameBlockCreator helper(table); - return helper.Convert(nthreads, out); -} - -// ---------------------------------------------------------------------- -// Serialization - -template -class ArrowSerializer { - public: - ArrowSerializer(arrow::MemoryPool* pool, PyArrayObject* arr, PyArrayObject* mask) - : pool_(pool), arr_(arr), mask_(mask) { - length_ = PyArray_SIZE(arr_); - } - - void IndicateType(const std::shared_ptr field) { field_indicator_ = field; } - - Status Convert(std::shared_ptr* out); - - int stride() const { return PyArray_STRIDES(arr_)[0]; } - - Status InitNullBitmap() { - int null_bytes = BitUtil::BytesForBits(length_); - - null_bitmap_ = std::make_shared(pool_); - RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); - - null_bitmap_data_ = null_bitmap_->mutable_data(); - memset(null_bitmap_data_, 0, null_bytes); - - return Status::OK(); - } - - bool is_strided() const { - npy_intp* astrides = PyArray_STRIDES(arr_); - return astrides[0] != PyArray_DESCR(arr_)->elsize; - } - - private: - Status ConvertData(); - - Status ConvertDates(std::shared_ptr* out) { - PyAcquireGIL lock; - - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::DateBuilder date_builder(pool_); - RETURN_NOT_OK(date_builder.Resize(length_)); - - Status s; - PyObject* obj; - for (int64_t i = 0; i < length_; ++i) { - obj = objects[i]; - if (PyDate_CheckExact(obj)) { - PyDateTime_Date* pydate = reinterpret_cast(obj); - date_builder.Append(PyDate_to_ms(pydate)); - } else { - date_builder.AppendNull(); - } - } - return date_builder.Finish(out); - } - - Status ConvertObjectStrings(std::shared_ptr* out) { - PyAcquireGIL lock; - - // The output type at this point is inconclusive because there may be bytes - // and unicode mixed in the object array - - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - arrow::StringBuilder string_builder(pool_); - RETURN_NOT_OK(string_builder.Resize(length_)); - - Status s; - bool have_bytes = false; - RETURN_NOT_OK(AppendObjectStrings(string_builder, objects, length_, &have_bytes)); - RETURN_NOT_OK(string_builder.Finish(out)); - - if (have_bytes) { - const auto& arr = static_cast(*out->get()); - *out = std::make_shared(arr.length(), arr.value_offsets(), - arr.data(), arr.null_bitmap(), arr.null_count()); - } - return Status::OK(); - } - - Status ConvertBooleans(std::shared_ptr* out) { - PyAcquireGIL lock; - - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - - int nbytes = BitUtil::BytesForBits(length_); - auto data = std::make_shared(pool_); - RETURN_NOT_OK(data->Resize(nbytes)); - uint8_t* bitmap = data->mutable_data(); - memset(bitmap, 0, nbytes); - - int64_t null_count = 0; - for (int64_t i = 0; i < length_; ++i) { - if (objects[i] == Py_True) { - BitUtil::SetBit(bitmap, i); - BitUtil::SetBit(null_bitmap_data_, i); - } else if (objects[i] != Py_False) { - ++null_count; - } else { - BitUtil::SetBit(null_bitmap_data_, i); - } - } - - *out = std::make_shared(length_, data, null_bitmap_, null_count); - - return Status::OK(); - } - - template - Status ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out); - -#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ - case Type::TYPE: { \ - return ConvertTypedLists(field, out); \ - } - - Status ConvertLists(const std::shared_ptr& field, std::shared_ptr* out) { - switch (field->type->type) { - LIST_CASE(UINT8, NPY_UINT8, UInt8Type) - LIST_CASE(INT8, NPY_INT8, Int8Type) - LIST_CASE(UINT16, NPY_UINT16, UInt16Type) - LIST_CASE(INT16, NPY_INT16, Int16Type) - LIST_CASE(UINT32, NPY_UINT32, UInt32Type) - LIST_CASE(INT32, NPY_INT32, Int32Type) - LIST_CASE(UINT64, NPY_UINT64, UInt64Type) - LIST_CASE(INT64, NPY_INT64, Int64Type) - LIST_CASE(TIMESTAMP, NPY_DATETIME, TimestampType) - LIST_CASE(FLOAT, NPY_FLOAT, FloatType) - LIST_CASE(DOUBLE, NPY_DOUBLE, DoubleType) - LIST_CASE(STRING, NPY_OBJECT, StringType) - default: - return Status::TypeError("Unknown list item type"); - } - - return Status::TypeError("Unknown list type"); - } - - Status MakeDataType(std::shared_ptr* out); - - arrow::MemoryPool* pool_; - - PyArrayObject* arr_; - PyArrayObject* mask_; - - int64_t length_; - - std::shared_ptr field_indicator_; - std::shared_ptr data_; - std::shared_ptr null_bitmap_; - uint8_t* null_bitmap_data_; -}; - -// Returns null count -static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { - int64_t null_count = 0; - const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); - // TODO(wesm): strided null mask - for (int i = 0; i < length; ++i) { - if (mask_values[i]) { - ++null_count; - } else { - BitUtil::SetBit(bitmap, i); - } - } - return null_count; -} - -template -inline Status ArrowSerializer::MakeDataType(std::shared_ptr* out) { - out->reset(new typename npy_traits::TypeClass()); - return Status::OK(); -} - -template <> -inline Status ArrowSerializer::MakeDataType( - std::shared_ptr* out) { - PyArray_Descr* descr = PyArray_DESCR(arr_); - auto date_dtype = reinterpret_cast(descr->c_metadata); - arrow::TimestampType::Unit unit; - - switch (date_dtype->meta.base) { - case NPY_FR_s: - unit = arrow::TimestampType::Unit::SECOND; - break; - case NPY_FR_ms: - unit = arrow::TimestampType::Unit::MILLI; - break; - case NPY_FR_us: - unit = arrow::TimestampType::Unit::MICRO; - break; - case NPY_FR_ns: - unit = arrow::TimestampType::Unit::NANO; - break; - default: - return Status::Invalid("Unknown NumPy datetime unit"); - } - - out->reset(new arrow::TimestampType(unit)); - return Status::OK(); -} - -template -inline Status ArrowSerializer::Convert(std::shared_ptr* out) { - typedef npy_traits traits; - - if (mask_ != nullptr || traits::supports_nulls) { RETURN_NOT_OK(InitNullBitmap()); } - - int64_t null_count = 0; - if (mask_ != nullptr) { - null_count = MaskToBitmap(mask_, length_, null_bitmap_data_); - } else if (traits::supports_nulls) { - null_count = ValuesToBitmap(PyArray_DATA(arr_), length_, null_bitmap_data_); - } - - RETURN_NOT_OK(ConvertData()); - std::shared_ptr type; - RETURN_NOT_OK(MakeDataType(&type)); - - std::vector fields(1); - fields[0].length = length_; - fields[0].null_count = null_count; - fields[0].offset = 0; - - return arrow::LoadArray(type, fields, {null_bitmap_, data_}, out); -} - -template <> -inline Status ArrowSerializer::Convert(std::shared_ptr* out) { - // Python object arrays are annoying, since we could have one of: - // - // * Strings - // * Booleans with nulls - // * Mixed type (not supported at the moment by arrow format) - // - // Additionally, nulls may be encoded either as np.nan or None. So we have to - // do some type inference and conversion - - RETURN_NOT_OK(InitNullBitmap()); - - // TODO: mask not supported here - const PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - { - PyAcquireGIL lock; - PyDateTime_IMPORT; - } - - if (field_indicator_) { - switch (field_indicator_->type->type) { - case Type::STRING: - return ConvertObjectStrings(out); - case Type::BOOL: - return ConvertBooleans(out); - case Type::DATE: - return ConvertDates(out); - case Type::LIST: { - auto list_field = static_cast(field_indicator_->type.get()); - return ConvertLists(list_field->value_field(), out); + } + output_type = PandasBlock::OBJECT; + } break; + case Type::DICTIONARY: + output_type = PandasBlock::CATEGORICAL; + break; + default: + return Status::NotImplemented(col->type()->ToString()); } - default: - return Status::TypeError("No known conversion to Arrow type"); - } - } else { - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - continue; - } else if (PyObject_is_string(objects[i])) { - return ConvertObjectStrings(out); - } else if (PyBool_Check(objects[i])) { - return ConvertBooleans(out); - } else if (PyDate_CheckExact(objects[i])) { - return ConvertDates(out); + + int block_placement = 0; + std::shared_ptr block; + if (output_type == PandasBlock::CATEGORICAL) { + RETURN_NOT_OK(MakeCategoricalBlock(col->type(), table_->num_rows(), &block)); + categorical_blocks_[i] = block; + } else if (output_type == PandasBlock::DATETIME_WITH_TZ) { + const auto& ts_type = static_cast(*col->type()); + block = std::make_shared(ts_type.timezone, table_->num_rows()); + RETURN_NOT_OK(block->Allocate()); + datetimetz_blocks_[i] = block; } else { - return Status::TypeError("unhandled python type"); + auto it = type_counts_.find(output_type); + if (it != type_counts_.end()) { + block_placement = it->second; + // Increment count + it->second += 1; + } else { + // Add key to map + type_counts_[output_type] = 1; + } } + + column_types_[i] = output_type; + column_block_placement_[i] = block_placement; + } + + // Create normal non-categorical blocks + for (const auto& it : type_counts_) { + PandasBlock::type type = static_cast(it.first); + std::shared_ptr block; + RETURN_NOT_OK(MakeBlock(type, table_->num_rows(), it.second, &block)); + blocks_[type] = block; } + return Status::OK(); } - return Status::TypeError("Unable to infer type of object array, were all null"); -} + Status WriteTableToBlocks(int nthreads) { + auto WriteColumn = [this](int i) { + std::shared_ptr col = this->table_->column(i); + PandasBlock::type output_type = this->column_types_[i]; -template -inline Status ArrowSerializer::ConvertData() { - // TODO(wesm): strided arrays - if (is_strided()) { return Status::Invalid("no support for strided data yet"); } + int rel_placement = this->column_block_placement_[i]; - data_ = std::make_shared(arr_); - return Status::OK(); -} + std::shared_ptr block; + if (output_type == PandasBlock::CATEGORICAL) { + auto it = this->categorical_blocks_.find(i); + if (it == this->blocks_.end()) { + return Status::KeyError("No categorical block allocated"); + } + block = it->second; + } else if (output_type == PandasBlock::DATETIME_WITH_TZ) { + auto it = this->datetimetz_blocks_.find(i); + if (it == this->datetimetz_blocks_.end()) { + return Status::KeyError("No datetimetz block allocated"); + } + block = it->second; + } else { + auto it = this->blocks_.find(output_type); + if (it == this->blocks_.end()) { return Status::KeyError("No block allocated"); } + block = it->second; + } + return block->Write(col, i, rel_placement); + }; -template <> -inline Status ArrowSerializer::ConvertData() { - if (is_strided()) { return Status::Invalid("no support for strided data yet"); } + nthreads = std::min(nthreads, table_->num_columns()); - int nbytes = BitUtil::BytesForBits(length_); - auto buffer = std::make_shared(pool_); - RETURN_NOT_OK(buffer->Resize(nbytes)); + if (nthreads == 1) { + for (int i = 0; i < table_->num_columns(); ++i) { + RETURN_NOT_OK(WriteColumn(i)); + } + } else { + std::vector thread_pool; + thread_pool.reserve(nthreads); + std::atomic task_counter(0); - const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); + std::mutex error_mtx; + bool error_occurred = false; + Status error; - uint8_t* bitmap = buffer->mutable_data(); + for (int thread_id = 0; thread_id < nthreads; ++thread_id) { + thread_pool.emplace_back( + [this, &error, &error_occurred, &error_mtx, &task_counter, &WriteColumn]() { + int column_num; + while (!error_occurred) { + column_num = task_counter.fetch_add(1); + if (column_num >= this->table_->num_columns()) { break; } + Status s = WriteColumn(column_num); + if (!s.ok()) { + std::lock_guard lock(error_mtx); + error_occurred = true; + error = s; + break; + } + } + }); + } + for (auto&& thread : thread_pool) { + thread.join(); + } - memset(bitmap, 0, nbytes); - for (int i = 0; i < length_; ++i) { - if (values[i] > 0) { BitUtil::SetBit(bitmap, i); } + if (error_occurred) { return error; } + } + return Status::OK(); } - data_ = buffer; - - return Status::OK(); -} - -template -template -inline Status ArrowSerializer::ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out) { - typedef npy_traits traits; - typedef typename traits::value_type T; - typedef typename traits::BuilderClass BuilderT; - PyAcquireGIL lock; - - auto value_builder = std::make_shared(pool_, field->type); - ListBuilder list_builder(pool_, value_builder); - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - RETURN_NOT_OK(list_builder.AppendNull()); - } else if (PyArray_Check(objects[i])) { - auto numpy_array = reinterpret_cast(objects[i]); - RETURN_NOT_OK(list_builder.Append(true)); - - // TODO(uwe): Support more complex numpy array structures - RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, ITEM_TYPE)); - - int64_t size = PyArray_DIM(numpy_array, 0); - auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - if (traits::supports_nulls) { - null_bitmap_->Resize(size, false); - // TODO(uwe): A bitmap would be more space-efficient but the Builder API doesn't - // currently support this. - // ValuesToBitmap(data, size, null_bitmap_->mutable_data()); - ValuesToBytemap(data, size, null_bitmap_->mutable_data()); - RETURN_NOT_OK(value_builder->Append(data, size, null_bitmap_->data())); - } else { - RETURN_NOT_OK(value_builder->Append(data, size)); - } - } else if (PyList_Check(objects[i])) { - int64_t size; - std::shared_ptr type; - RETURN_NOT_OK(list_builder.Append(true)); - RETURN_NOT_OK(InferArrowType(objects[i], &size, &type)); - if (type->type != field->type->type) { - std::stringstream ss; - ss << type->ToString() << " cannot be converted to " << field->type->ToString(); - return Status::TypeError(ss.str()); - } - RETURN_NOT_OK(AppendPySequence(objects[i], field->type, value_builder)); - } else { - return Status::TypeError("Unsupported Python type for list items"); + Status AppendBlocks(const BlockMap& blocks, PyObject* list) { + for (const auto& it : blocks) { + PyObject* item; + RETURN_NOT_OK(it.second->GetPyResult(&item)); + if (PyList_Append(list, item) < 0) { RETURN_IF_PYERROR(); } } + return Status::OK(); } - return list_builder.Finish(out); -} -template <> -template <> -inline Status -ArrowSerializer::ConvertTypedLists( - const std::shared_ptr& field, std::shared_ptr* out) { - // TODO: If there are bytes involed, convert to Binary representation - PyAcquireGIL lock; - bool have_bytes = false; + Status GetResultList(PyObject** out) { + PyAcquireGIL lock; - auto value_builder = std::make_shared(pool_); - ListBuilder list_builder(pool_, value_builder); - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { - RETURN_NOT_OK(list_builder.AppendNull()); - } else if (PyArray_Check(objects[i])) { - auto numpy_array = reinterpret_cast(objects[i]); - RETURN_NOT_OK(list_builder.Append(true)); + PyObject* result = PyList_New(0); + RETURN_IF_PYERROR(); - // TODO(uwe): Support more complex numpy array structures - RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); + RETURN_NOT_OK(AppendBlocks(blocks_, result)); + RETURN_NOT_OK(AppendBlocks(categorical_blocks_, result)); + RETURN_NOT_OK(AppendBlocks(datetimetz_blocks_, result)); - int64_t size = PyArray_DIM(numpy_array, 0); - auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); - } else if (PyList_Check(objects[i])) { - int64_t size; - std::shared_ptr type; - RETURN_NOT_OK(list_builder.Append(true)); - RETURN_NOT_OK(InferArrowType(objects[i], &size, &type)); - if (type->type != Type::STRING) { - std::stringstream ss; - ss << type->ToString() << " cannot be converted to STRING."; - return Status::TypeError(ss.str()); - } - RETURN_NOT_OK(AppendPySequence(objects[i], type, value_builder)); - } else { - return Status::TypeError("Unsupported Python type for list items"); - } + *out = result; + return Status::OK(); } - return list_builder.Finish(out); -} -template <> -inline Status ArrowSerializer::ConvertData() { - return Status::TypeError("NYI"); -} - -#define TO_ARROW_CASE(TYPE) \ - case NPY_##TYPE: { \ - ArrowSerializer converter(pool, arr, mask); \ - RETURN_NOT_OK(converter.Convert(out)); \ - } break; + private: + std::shared_ptr
table_; -Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& field, std::shared_ptr* out) { - PyArrayObject* arr = reinterpret_cast(ao); - PyArrayObject* mask = nullptr; + // column num -> block type id + std::vector column_types_; - if (mo != nullptr and mo != Py_None) { mask = reinterpret_cast(mo); } + // column num -> relative placement within internal block + std::vector column_block_placement_; - if (PyArray_NDIM(arr) != 1) { - return Status::Invalid("only handle 1-dimensional arrays"); - } + // block type -> type count + std::unordered_map type_counts_; - int type_num = PyArray_DESCR(arr)->type_num; + // block type -> block + BlockMap blocks_; -#if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) - // Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set - // U/LONGLONG to U/INT64 so things work properly. - if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } - if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } -#endif + // column number -> categorical block + BlockMap categorical_blocks_; - switch (type_num) { - TO_ARROW_CASE(BOOL); - TO_ARROW_CASE(INT8); - TO_ARROW_CASE(INT16); - TO_ARROW_CASE(INT32); - TO_ARROW_CASE(INT64); -#if (NPY_INT64 != NPY_LONGLONG) - TO_ARROW_CASE(LONGLONG); -#endif - TO_ARROW_CASE(UINT8); - TO_ARROW_CASE(UINT16); - TO_ARROW_CASE(UINT32); - TO_ARROW_CASE(UINT64); -#if (NPY_UINT64 != NPY_ULONGLONG) - TO_ARROW_CASE(ULONGLONG); -#endif - TO_ARROW_CASE(FLOAT32); - TO_ARROW_CASE(FLOAT64); - TO_ARROW_CASE(DATETIME); - case NPY_OBJECT: { - ArrowSerializer converter(pool, arr, mask); - converter.IndicateType(field); - RETURN_NOT_OK(converter.Convert(out)); - } break; - default: - std::stringstream ss; - ss << "Unsupported numpy type " << PyArray_DESCR(arr)->type_num << std::endl; - return Status::NotImplemented(ss.str()); - } - return Status::OK(); -} + // column number -> datetimetz block + BlockMap datetimetz_blocks_; +}; class ArrowDeserializer { public: @@ -1839,7 +1694,7 @@ class ArrowDeserializer { Status ConvertValuesZeroCopy(int npy_type, std::shared_ptr arr) { typedef typename arrow_traits::T T; - auto prim_arr = static_cast(arr.get()); + auto prim_arr = static_cast(arr.get()); auto in_values = reinterpret_cast(prim_arr->data()->data()); // Zero-Copy. We can pass the data pointer directly to NumPy. @@ -1988,19 +1843,19 @@ class ArrowDeserializer { inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - return ConvertBinaryLike(data_, out_values); + return ConvertBinaryLike(data_, out_values); } template inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - return ConvertBinaryLike(data_, out_values); + return ConvertBinaryLike(data_, out_values); } #define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ case Type::ArrowEnum: \ - return ConvertListsLike<::arrow::ArrowType>(col_, out_values); + return ConvertListsLike(col_, out_values); template inline typename std::enable_if::type ConvertValues() { @@ -2051,7 +1906,7 @@ class ArrowDeserializer { private: std::shared_ptr col_; - const arrow::ChunkedArray& data_; + const ChunkedArray& data_; PyObject* py_ref_; PyArrayObject* arr_; PyObject* result_; @@ -2071,4 +1926,11 @@ Status ConvertColumnToPandas( return converter.Convert(out); } -} // namespace pyarrow +Status ConvertTableToPandas( + const std::shared_ptr
& table, int nthreads, PyObject** out) { + DataFrameBlockCreator helper(table); + return helper.Convert(nthreads, out); +} + +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/adapters/pandas.h b/python/src/pyarrow/adapters/pandas.h index b548f9321d75a..6862339d89baf 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/python/src/pyarrow/adapters/pandas.h @@ -25,28 +25,26 @@ #include -#include "pyarrow/visibility.h" +#include "arrow/util/visibility.h" namespace arrow { class Array; class Column; -class Field; +class DataType; class MemoryPool; class Status; class Table; -} // namespace arrow - -namespace pyarrow { +namespace py { -PYARROW_EXPORT -arrow::Status ConvertArrayToPandas( - const std::shared_ptr& arr, PyObject* py_ref, PyObject** out); +ARROW_EXPORT +Status ConvertArrayToPandas( + const std::shared_ptr& arr, PyObject* py_ref, PyObject** out); -PYARROW_EXPORT -arrow::Status ConvertColumnToPandas( - const std::shared_ptr& col, PyObject* py_ref, PyObject** out); +ARROW_EXPORT +Status ConvertColumnToPandas( + const std::shared_ptr& col, PyObject* py_ref, PyObject** out); struct PandasOptions { bool strings_to_categorical; @@ -58,14 +56,24 @@ struct PandasOptions { // BlockManager structure of the pandas.DataFrame used as of pandas 0.19.x. // // tuple item: (indices: ndarray[int32], block: ndarray[TYPE, ndim=2]) -PYARROW_EXPORT -arrow::Status ConvertTableToPandas( - const std::shared_ptr& table, int nthreads, PyObject** out); +ARROW_EXPORT +Status ConvertTableToPandas( + const std::shared_ptr
& table, int nthreads, PyObject** out); + +ARROW_EXPORT +Status PandasDtypeToArrow(PyObject* dtype, std::shared_ptr* out); -PYARROW_EXPORT -arrow::Status PandasToArrow(arrow::MemoryPool* pool, PyObject* ao, PyObject* mo, - const std::shared_ptr& field, std::shared_ptr* out); +ARROW_EXPORT +Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& type, std::shared_ptr* out); -} // namespace pyarrow +/// Convert dtype=object arrays. If target data type is not known, pass a type +/// with nullptr +ARROW_EXPORT +Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, + const std::shared_ptr& type, std::shared_ptr* out); + +} // namespace py +} // namespace arrow #endif // PYARROW_ADAPTERS_PANDAS_H diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index d2f5291ea8301..c898f634aedbb 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -24,24 +24,23 @@ #include "arrow/memory_pool.h" #include "arrow/status.h" -using arrow::Status; - -namespace pyarrow { +namespace arrow { +namespace py { static std::mutex memory_pool_mutex; -static arrow::MemoryPool* default_pyarrow_pool = nullptr; +static MemoryPool* default_pyarrow_pool = nullptr; -void set_default_memory_pool(arrow::MemoryPool* pool) { +void set_default_memory_pool(MemoryPool* pool) { std::lock_guard guard(memory_pool_mutex); default_pyarrow_pool = pool; } -arrow::MemoryPool* get_memory_pool() { +MemoryPool* get_memory_pool() { std::lock_guard guard(memory_pool_mutex); if (default_pyarrow_pool) { return default_pyarrow_pool; } else { - return arrow::default_memory_pool(); + return default_memory_pool(); } } @@ -60,4 +59,5 @@ PyBytesBuffer::~PyBytesBuffer() { Py_DECREF(obj_); } -} // namespace pyarrow +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index ad65ec75eec9e..0b4c6bebcfe79 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -19,16 +19,16 @@ #define PYARROW_COMMON_H #include "pyarrow/config.h" -#include "pyarrow/visibility.h" #include "arrow/buffer.h" #include "arrow/util/macros.h" +#include "arrow/util/visibility.h" namespace arrow { + class MemoryPool; -} -namespace pyarrow { +namespace py { class PyAcquireGIL { public: @@ -98,10 +98,10 @@ struct PyObjectStringify { } // Return the common PyArrow memory pool -PYARROW_EXPORT void set_default_memory_pool(arrow::MemoryPool* pool); -PYARROW_EXPORT arrow::MemoryPool* get_memory_pool(); +ARROW_EXPORT void set_default_memory_pool(MemoryPool* pool); +ARROW_EXPORT MemoryPool* get_memory_pool(); -class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { +class ARROW_EXPORT NumPyBuffer : public Buffer { public: NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { arr_ = arr; @@ -118,7 +118,7 @@ class PYARROW_EXPORT NumPyBuffer : public arrow::Buffer { PyArrayObject* arr_; }; -class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { +class ARROW_EXPORT PyBytesBuffer : public Buffer { public: PyBytesBuffer(PyObject* obj); ~PyBytesBuffer(); @@ -127,6 +127,7 @@ class PYARROW_EXPORT PyBytesBuffer : public arrow::Buffer { PyObject* obj_; }; -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_COMMON_H diff --git a/python/src/pyarrow/config.cc b/python/src/pyarrow/config.cc index e1002bf4fd146..0be6d962b55ab 100644 --- a/python/src/pyarrow/config.cc +++ b/python/src/pyarrow/config.cc @@ -19,7 +19,8 @@ #include "pyarrow/config.h" -namespace pyarrow { +namespace arrow { +namespace py { void pyarrow_init() {} @@ -30,4 +31,5 @@ void pyarrow_set_numpy_nan(PyObject* obj) { numpy_nan = obj; } -} // namespace pyarrow +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/config.h b/python/src/pyarrow/config.h index 386ee4b1e2590..87fc5c2b290f6 100644 --- a/python/src/pyarrow/config.h +++ b/python/src/pyarrow/config.h @@ -20,24 +20,27 @@ #include +#include "arrow/util/visibility.h" + #include "pyarrow/numpy_interop.h" -#include "pyarrow/visibility.h" #if PY_MAJOR_VERSION >= 3 #define PyString_Check PyUnicode_Check #endif -namespace pyarrow { +namespace arrow { +namespace py { -PYARROW_EXPORT +ARROW_EXPORT extern PyObject* numpy_nan; -PYARROW_EXPORT +ARROW_EXPORT void pyarrow_init(); -PYARROW_EXPORT +ARROW_EXPORT void pyarrow_set_numpy_nan(PyObject* obj); -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_CONFIG_H diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index 78fad165ac8e6..edebea6d97c95 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -19,9 +19,8 @@ #include -using namespace arrow; - -namespace pyarrow { +namespace arrow { +namespace py { #define GET_PRIMITIVE_TYPE(NAME, FACTORY) \ case Type::NAME: \ @@ -51,4 +50,5 @@ std::shared_ptr GetPrimitiveType(Type::type type) { } } -} // namespace pyarrow +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/helpers.h b/python/src/pyarrow/helpers.h index 788c3eedddfd6..611e814b7d858 100644 --- a/python/src/pyarrow/helpers.h +++ b/python/src/pyarrow/helpers.h @@ -18,19 +18,18 @@ #ifndef PYARROW_HELPERS_H #define PYARROW_HELPERS_H -#include #include -#include "pyarrow/visibility.h" +#include "arrow/type.h" +#include "arrow/util/visibility.h" -namespace pyarrow { +namespace arrow { +namespace py { -using arrow::DataType; -using arrow::Type; - -PYARROW_EXPORT +ARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_HELPERS_H diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index aa4cb7b052c27..0aa61dc811f5c 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -26,9 +26,8 @@ #include "pyarrow/common.h" -using arrow::Status; - -namespace pyarrow { +namespace arrow { +namespace py { // ---------------------------------------------------------------------- // Python file @@ -151,7 +150,7 @@ Status PyReadableFile::Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) { return Status::OK(); } -Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { +Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { PyAcquireGIL lock; PyObject* bytes_obj; @@ -214,8 +213,9 @@ Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { // A readable file that is backed by a PyBytes PyBytesReader::PyBytesReader(PyObject* obj) - : arrow::io::BufferReader(std::make_shared(obj)) {} + : io::BufferReader(std::make_shared(obj)) {} PyBytesReader::~PyBytesReader() {} -} // namespace pyarrow +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h index 4cb010f2d4e9f..a603e81622545 100644 --- a/python/src/pyarrow/io.h +++ b/python/src/pyarrow/io.h @@ -20,17 +20,17 @@ #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" +#include "arrow/util/visibility.h" #include "pyarrow/config.h" #include "pyarrow/common.h" -#include "pyarrow/visibility.h" namespace arrow { + class MemoryPool; -} -namespace pyarrow { +namespace py { // A common interface to a Python file-like object. Must acquire GIL before // calling any methods @@ -39,31 +39,31 @@ class PythonFile { PythonFile(PyObject* file); ~PythonFile(); - arrow::Status Close(); - arrow::Status Seek(int64_t position, int whence); - arrow::Status Read(int64_t nbytes, PyObject** out); - arrow::Status Tell(int64_t* position); - arrow::Status Write(const uint8_t* data, int64_t nbytes); + Status Close(); + Status Seek(int64_t position, int whence); + Status Read(int64_t nbytes, PyObject** out); + Status Tell(int64_t* position); + Status Write(const uint8_t* data, int64_t nbytes); private: PyObject* file_; }; -class PYARROW_EXPORT PyReadableFile : public arrow::io::ReadableFileInterface { +class ARROW_EXPORT PyReadableFile : public io::ReadableFileInterface { public: explicit PyReadableFile(PyObject* file); virtual ~PyReadableFile(); - arrow::Status Close() override; + Status Close() override; - arrow::Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; - arrow::Status Read(int64_t nbytes, std::shared_ptr* out) override; + Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; + Status Read(int64_t nbytes, std::shared_ptr* out) override; - arrow::Status GetSize(int64_t* size) override; + Status GetSize(int64_t* size) override; - arrow::Status Seek(int64_t position) override; + Status Seek(int64_t position) override; - arrow::Status Tell(int64_t* position) override; + Status Tell(int64_t* position) override; bool supports_zero_copy() const override; @@ -71,21 +71,21 @@ class PYARROW_EXPORT PyReadableFile : public arrow::io::ReadableFileInterface { std::unique_ptr file_; }; -class PYARROW_EXPORT PyOutputStream : public arrow::io::OutputStream { +class ARROW_EXPORT PyOutputStream : public io::OutputStream { public: explicit PyOutputStream(PyObject* file); virtual ~PyOutputStream(); - arrow::Status Close() override; - arrow::Status Tell(int64_t* position) override; - arrow::Status Write(const uint8_t* data, int64_t nbytes) override; + Status Close() override; + Status Tell(int64_t* position) override; + Status Write(const uint8_t* data, int64_t nbytes) override; private: std::unique_ptr file_; }; // A zero-copy reader backed by a PyBytes object -class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { +class ARROW_EXPORT PyBytesReader : public io::BufferReader { public: explicit PyBytesReader(PyObject* obj); virtual ~PyBytesReader(); @@ -93,6 +93,7 @@ class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { // TODO(wesm): seekable output files -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_IO_H diff --git a/python/src/pyarrow/numpy_interop.h b/python/src/pyarrow/numpy_interop.h index 6326527a67420..57f3328e87078 100644 --- a/python/src/pyarrow/numpy_interop.h +++ b/python/src/pyarrow/numpy_interop.h @@ -42,7 +42,8 @@ #include #include -namespace pyarrow { +namespace arrow { +namespace py { inline int import_numpy() { #ifdef NUMPY_IMPORT_ARRAY @@ -53,6 +54,7 @@ inline int import_numpy() { return 0; } -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_NUMPY_INTEROP_H diff --git a/python/src/pyarrow/type_traits.h b/python/src/pyarrow/type_traits.h new file mode 100644 index 0000000000000..f4604d7a9894d --- /dev/null +++ b/python/src/pyarrow/type_traits.h @@ -0,0 +1,212 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include + +#include "pyarrow/numpy_interop.h" + +#include "arrow/builder.h" +#include "arrow/type.h" + +namespace arrow { +namespace py { + +template +struct npy_traits {}; + +template <> +struct npy_traits { + typedef uint8_t value_type; + using TypeClass = BooleanType; + using BuilderClass = BooleanBuilder; + + static constexpr bool supports_nulls = false; + static inline bool isnull(uint8_t v) { return false; } +}; + +#define NPY_INT_DECL(TYPE, CapType, T) \ + template <> \ + struct npy_traits { \ + typedef T value_type; \ + using TypeClass = CapType##Type; \ + using BuilderClass = CapType##Builder; \ + \ + static constexpr bool supports_nulls = false; \ + static inline bool isnull(T v) { return false; } \ + }; + +NPY_INT_DECL(INT8, Int8, int8_t); +NPY_INT_DECL(INT16, Int16, int16_t); +NPY_INT_DECL(INT32, Int32, int32_t); +NPY_INT_DECL(INT64, Int64, int64_t); + +NPY_INT_DECL(UINT8, UInt8, uint8_t); +NPY_INT_DECL(UINT16, UInt16, uint16_t); +NPY_INT_DECL(UINT32, UInt32, uint32_t); +NPY_INT_DECL(UINT64, UInt64, uint64_t); + +#if NPY_INT64 != NPY_LONGLONG +NPY_INT_DECL(LONGLONG, Int64, int64_t); +NPY_INT_DECL(ULONGLONG, UInt64, uint64_t); +#endif + +template <> +struct npy_traits { + typedef float value_type; + using TypeClass = FloatType; + using BuilderClass = FloatBuilder; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(float v) { return v != v; } +}; + +template <> +struct npy_traits { + typedef double value_type; + using TypeClass = DoubleType; + using BuilderClass = DoubleBuilder; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(double v) { return v != v; } +}; + +template <> +struct npy_traits { + typedef int64_t value_type; + using TypeClass = TimestampType; + using BuilderClass = TimestampBuilder; + + static constexpr bool supports_nulls = true; + + static inline bool isnull(int64_t v) { + // NaT = -2**63 + // = -0x8000000000000000 + // = -9223372036854775808; + // = std::numeric_limits::min() + return v == std::numeric_limits::min(); + } +}; + +template <> +struct npy_traits { + typedef PyObject* value_type; + static constexpr bool supports_nulls = true; +}; + +template +struct arrow_traits {}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_BOOL; + static constexpr bool supports_nulls = false; + static constexpr bool is_boolean = true; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; + +#define INT_DECL(TYPE) \ + template <> \ + struct arrow_traits { \ + static constexpr int npy_type = NPY_##TYPE; \ + static constexpr bool supports_nulls = false; \ + static constexpr double na_value = NAN; \ + static constexpr bool is_boolean = false; \ + static constexpr bool is_numeric_not_nullable = true; \ + static constexpr bool is_numeric_nullable = false; \ + typedef typename npy_traits::value_type T; \ + }; + +INT_DECL(INT8); +INT_DECL(INT16); +INT_DECL(INT32); +INT_DECL(INT64); +INT_DECL(UINT8); +INT_DECL(UINT16); +INT_DECL(UINT32); +INT_DECL(UINT64); + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT32; + static constexpr bool supports_nulls = true; + static constexpr float na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_FLOAT64; + static constexpr bool supports_nulls = true; + static constexpr double na_value = NAN; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + +static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = kPandasTimestampNull; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_DATETIME; + static constexpr bool supports_nulls = true; + static constexpr int64_t na_value = kPandasTimestampNull; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; + static constexpr bool is_boolean = false; + static constexpr bool is_numeric_not_nullable = false; + static constexpr bool is_numeric_nullable = false; +}; + +} // namespace py +} // namespace arrow diff --git a/python/src/pyarrow/util/datetime.h b/python/src/pyarrow/util/datetime.h index 9ffa691052460..f704a96d91bba 100644 --- a/python/src/pyarrow/util/datetime.h +++ b/python/src/pyarrow/util/datetime.h @@ -21,7 +21,8 @@ #include #include -namespace pyarrow { +namespace arrow { +namespace py { inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { struct tm date = {0}; @@ -35,6 +36,7 @@ inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); } -} // namespace pyarrow +} // namespace py +} // namespace arrow #endif // PYARROW_UTIL_DATETIME_H diff --git a/python/src/pyarrow/util/test_main.cc b/python/src/pyarrow/util/test_main.cc index 02e9a54f65914..d8d1d030f8f97 100644 --- a/python/src/pyarrow/util/test_main.cc +++ b/python/src/pyarrow/util/test_main.cc @@ -26,7 +26,7 @@ int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); Py_Initialize(); - pyarrow::import_numpy(); + arrow::py::import_numpy(); int ret = RUN_ALL_TESTS(); diff --git a/python/src/pyarrow/visibility.h b/python/src/pyarrow/visibility.h deleted file mode 100644 index 9f0c13b4b2083..0000000000000 --- a/python/src/pyarrow/visibility.h +++ /dev/null @@ -1,32 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef PYARROW_VISIBILITY_H -#define PYARROW_VISIBILITY_H - -#if defined(_WIN32) || defined(__CYGWIN__) -#define PYARROW_EXPORT __declspec(dllexport) -#else // Not Windows -#ifndef PYARROW_EXPORT -#define PYARROW_EXPORT __attribute__((visibility("default"))) -#endif -#ifndef PYARROW_NO_EXPORT -#define PYARROW_NO_EXPORT __attribute__((visibility("hidden"))) -#endif -#endif // Non-Windows - -#endif // PYARROW_VISIBILITY_H From 6aed18f965bea60580e80b086dd72857546abea2 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Mon, 13 Mar 2017 16:23:22 -0400 Subject: [PATCH 0363/1644] ARROW-619: [Python] Fixed remaining typo for LD_LIBRARY_PATH Typo with LD_LIBRARY_PATH in install documentation. Author: Bryan Cutler Closes #376 from BryanCutler/pyarrow-install-typo-ARROW-619 and squashes the following commits: e38b588 [Bryan Cutler] fixed typo for LD_LIBRARY_PATH --- python/doc/install.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/doc/install.rst b/python/doc/install.rst index d93a88f547576..16d19ef123135 100644 --- a/python/doc/install.rst +++ b/python/doc/install.rst @@ -133,7 +133,7 @@ Install `pyarrow` .. note:: In development installations, you will also need to set a correct - ``LD_LIBARY_PATH``. This is most probably done with + ``LD_LIBRARY_PATH``. This is most probably done with ``export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH``. From f442879d3c791d86fb0fdfa098a72329843f5baf Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 14 Mar 2017 09:17:30 +0100 Subject: [PATCH 0364/1644] ARROW-624: [C++] Restore MakePrimitiveArray function, use in feather.cc I verified locally the parquet-cpp test suite passes again Author: Wes McKinney Closes #378 from wesm/ARROW-624 and squashes the following commits: 023df9b [Wes McKinney] Use passed offset in MakePrimitiveArray 30a553e [Wes McKinney] Restore MakePrimitiveArray function, use in Feather, verify fixes parquet test suite --- cpp/src/arrow/api.h | 3 +++ cpp/src/arrow/ipc/feather.cc | 8 +------- cpp/src/arrow/loader.cc | 28 +++++++++++++++++++++++----- cpp/src/arrow/loader.h | 10 ++++++++++ 4 files changed, 37 insertions(+), 12 deletions(-) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 51437d863b8b9..3bc86662613ed 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -24,7 +24,10 @@ #include "arrow/buffer.h" #include "arrow/builder.h" #include "arrow/column.h" +#include "arrow/compare.h" +#include "arrow/loader.h" #include "arrow/memory_pool.h" +#include "arrow/pretty_print.h" #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 13dfa5830f1bf..1d165acccbd04 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -331,7 +331,6 @@ class TableReader::TableReaderImpl { std::shared_ptr type; RETURN_NOT_OK(GetDataType(meta, metadata_type, metadata, &type)); - std::vector fields(1); std::vector> buffers; // Buffer data from the source (may or may not perform a copy depending on @@ -357,12 +356,7 @@ class TableReader::TableReaderImpl { } buffers.push_back(SliceBuffer(buffer, offset, buffer->size() - offset)); - - fields[0].length = meta->length(); - fields[0].null_count = meta->null_count(); - fields[0].offset = 0; - - return LoadArray(type, fields, buffers, out); + return MakePrimitiveArray(type, buffers, meta->length(), meta->null_count(), 0, out); } bool HasDescription() const { return metadata_->HasDescription(); } diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index 3cb51ae8fdab7..0b3ee1cf0a899 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -235,8 +235,8 @@ class ArrayLoader : public TypeVisitor { std::shared_ptr result_; }; -Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, - ArrayComponentSource* source, std::shared_ptr* out) { +Status LoadArray(const std::shared_ptr& type, ArrayComponentSource* source, + std::shared_ptr* out) { ArrayLoaderContext context; context.source = source; context.field_index = context.buffer_index = 0; @@ -244,8 +244,8 @@ Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, return LoadArray(type, &context, out); } -Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, - ArrayLoaderContext* context, std::shared_ptr* out) { +Status LoadArray(const std::shared_ptr& type, ArrayLoaderContext* context, + std::shared_ptr* out) { ArrayLoader loader(type, context); RETURN_NOT_OK(loader.Load(out)); @@ -275,11 +275,29 @@ class InMemorySource : public ArrayComponentSource { const std::vector>& buffers_; }; -Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, +Status LoadArray(const std::shared_ptr& type, const std::vector& fields, const std::vector>& buffers, std::shared_ptr* out) { InMemorySource source(fields, buffers); return LoadArray(type, &source, out); } +Status MakePrimitiveArray(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int64_t null_count, int64_t offset, std::shared_ptr* out) { + std::vector> buffers = {null_bitmap, data}; + return MakePrimitiveArray(type, buffers, length, null_count, offset, out); +} + +Status MakePrimitiveArray(const std::shared_ptr& type, + const std::vector>& buffers, int64_t length, + int64_t null_count, int64_t offset, std::shared_ptr* out) { + std::vector fields(1); + fields[0].length = length; + fields[0].null_count = null_count; + fields[0].offset = offset; + + return LoadArray(type, fields, buffers, out); +} + } // namespace arrow diff --git a/cpp/src/arrow/loader.h b/cpp/src/arrow/loader.h index b4949f2556028..f116d64f5c0c1 100644 --- a/cpp/src/arrow/loader.h +++ b/cpp/src/arrow/loader.h @@ -84,6 +84,16 @@ Status ARROW_EXPORT LoadArray(const std::shared_ptr& type, const std::vector& fields, const std::vector>& buffers, std::shared_ptr* out); +/// Create new arrays for logical types that are backed by primitive arrays. +Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, + int64_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset, + std::shared_ptr* out); + +Status ARROW_EXPORT MakePrimitiveArray(const std::shared_ptr& type, + const std::vector>& buffers, int64_t length, + int64_t null_count, int64_t offset, std::shared_ptr* out); + } // namespace arrow #endif // ARROW_LOADER_H From cef46152cc7489c23b67aaed70574dba742d19bb Mon Sep 17 00:00:00 2001 From: Jeff Knupp Date: Tue, 14 Mar 2017 10:58:44 -0400 Subject: [PATCH 0365/1644] ARROW-598: [Python] Add support for converting pyarrow.Buffer to a memoryview with zero copy WIP, as tests are not all done and I'm assuming we'll need to keep a reference to the underlying buffer so it doesn't get gc'ed. Author: Jeff Knupp Author: Jeff Knupp Closes #369 from jeffknupp/master and squashes the following commits: c300f30 [Jeff Knupp] Initialize members in init; test for lifetime with zero references 13f5dc1 [Jeff Knupp] WIP: python 2 compatability 170d01d [Jeff Knupp] WIP: python 2 compatability bfbed0f [Jeff Knupp] WIP: add test for buffer protocol reference counting fd1cb44 [Jeff Knupp] WIP: make buffers read-only; add test for immutability c24e83a [Jeff Knupp] WIP: make arrow.io.Buffer implement Python's buffer protocol b2540d4 [Jeff Knupp] ARROW-598: [Python] Add support for converting pyarrow.Buffer to a memoryview with zero copy --- python/pyarrow/io.pxd | 2 ++ python/pyarrow/io.pyx | 16 +++++++++++++- python/pyarrow/tests/test_io.py | 39 +++++++++++++++++++++++++++++++++ 3 files changed, 56 insertions(+), 1 deletion(-) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index fffc7c596db76..3d73e1143e15a 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -25,6 +25,8 @@ from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, cdef class Buffer: cdef: shared_ptr[CBuffer] buffer + Py_ssize_t shape[1] + Py_ssize_t strides[1] cdef init(self, const shared_ptr[CBuffer]& buffer) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 4acef212b4dce..240ea240c3abe 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -56,7 +56,6 @@ cdef extern from "Python.h": PyObject* PyBytes_FromStringAndSizeNative" PyBytes_FromStringAndSize"( char *v, Py_ssize_t len) except NULL - cdef class NativeFile: def __cinit__(self): @@ -421,6 +420,8 @@ cdef class Buffer: cdef init(self, const shared_ptr[CBuffer]& buffer): self.buffer = buffer + self.shape[0] = self.size + self.strides[0] = (1) def __len__(self): return self.size @@ -449,6 +450,19 @@ cdef class Buffer: self.buffer.get().data(), self.buffer.get().size()) + def __getbuffer__(self, cp.Py_buffer* buffer, int flags): + + buffer.buf = self.buffer.get().data() + buffer.format = 'b' + buffer.internal = NULL + buffer.itemsize = 1 + buffer.len = self.size + buffer.ndim = 1 + buffer.obj = self + buffer.readonly = 1 + buffer.shape = self.shape + buffer.strides = self.strides + buffer.suboffsets = NULL cdef shared_ptr[PoolBuffer] allocate_buffer(CMemoryPool* pool): cdef shared_ptr[PoolBuffer] result diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index dfa84a27e6be9..c6caba5ce641a 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -135,6 +135,34 @@ def test_buffer_bytes(): assert result == val +def test_buffer_memoryview(): + val = b'some data' + + buf = io.buffer_from_bytes(val) + assert isinstance(buf, io.Buffer) + + result = memoryview(buf) + + assert result == val + + +def test_buffer_memoryview_is_immutable(): + val = b'some data' + + buf = io.buffer_from_bytes(val) + assert isinstance(buf, io.Buffer) + + result = memoryview(buf) + + with pytest.raises(TypeError) as exc: + result[0] = b'h' + assert 'cannot modify read-only' in str(exc.value) + + b = bytes(buf) + with pytest.raises(TypeError) as exc: + b[0] = b'h' + assert 'cannot modify read-only' in str(exc.value) + def test_memory_output_stream(): # 10 bytes @@ -160,6 +188,17 @@ def test_inmemory_write_after_closed(): with pytest.raises(IOError): f.write(b'not ok') +def test_buffer_protocol_ref_counting(): + import gc + + def make_buffer(bytes_obj): + return bytearray(io.buffer_from_bytes(bytes_obj)) + + buf = make_buffer(b'foo') + gc.collect() + assert buf == b'foo' + + # ---------------------------------------------------------------------- # OS files and memory maps From a32ae59094be82ad318a73d067f2db680d3ab295 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 14 Mar 2017 11:47:39 -0400 Subject: [PATCH 0366/1644] ARROW-625: [C++] Add TimeUnit to TimeType::ToString. Add timezone to TimestampType::ToString if present Author: Wes McKinney Closes #377 from wesm/ARROW-625 and squashes the following commits: d76a8d3 [Wes McKinney] Move PrintTimeUnit into operator<< for std::ostream 351f90e [Wes McKinney] Add TimeUnit to TimeType::ToString. Add timezone to TimestampType::ToString if it has one --- cpp/src/arrow/type-test.cc | 24 ++++++++++++++++++++++++ cpp/src/arrow/type.cc | 23 ++++++----------------- cpp/src/arrow/type.h | 22 ++++++++++++++++++++-- 3 files changed, 50 insertions(+), 19 deletions(-) diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index fe6c62adb7fba..3adc4d83c3a2d 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -132,6 +132,18 @@ TEST(TestTimeType, Equals) { ASSERT_TRUE(t3.Equals(t4)); } +TEST(TestTimeType, ToString) { + auto t1 = time(TimeUnit::MILLI); + auto t2 = time(TimeUnit::NANO); + auto t3 = time(TimeUnit::SECOND); + auto t4 = time(TimeUnit::MICRO); + + ASSERT_EQ("time[ms]", t1->ToString()); + ASSERT_EQ("time[ns]", t2->ToString()); + ASSERT_EQ("time[s]", t3->ToString()); + ASSERT_EQ("time[us]", t4->ToString()); +} + TEST(TestTimestampType, Equals) { TimestampType t1; TimestampType t2; @@ -143,4 +155,16 @@ TEST(TestTimestampType, Equals) { ASSERT_TRUE(t3.Equals(t4)); } +TEST(TestTimestampType, ToString) { + auto t1 = timestamp(TimeUnit::MILLI); + auto t2 = timestamp("US/Eastern", TimeUnit::NANO); + auto t3 = timestamp(TimeUnit::SECOND); + auto t4 = timestamp(TimeUnit::MICRO); + + ASSERT_EQ("timestamp[ms]", t1->ToString()); + ASSERT_EQ("timestamp[ns, tz=US/Eastern]", t2->ToString()); + ASSERT_EQ("timestamp[s]", t3->ToString()); + ASSERT_EQ("timestamp[us]", t4->ToString()); +} + } // namespace arrow diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 0cafdce89e562..d41b36315a86e 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -108,27 +108,16 @@ std::string Date32Type::ToString() const { return std::string("date32"); } -static inline void print_time_unit(TimeUnit unit, std::ostream* stream) { - switch (unit) { - case TimeUnit::SECOND: - (*stream) << "s"; - break; - case TimeUnit::MILLI: - (*stream) << "ms"; - break; - case TimeUnit::MICRO: - (*stream) << "us"; - break; - case TimeUnit::NANO: - (*stream) << "ns"; - break; - } +std::string TimeType::ToString() const { + std::stringstream ss; + ss << "time[" << this->unit << "]"; + return ss.str(); } std::string TimestampType::ToString() const { std::stringstream ss; - ss << "timestamp["; - print_time_unit(this->unit, &ss); + ss << "timestamp[" << this->unit; + if (this->timezone.size() > 0) { ss << ", tz=" << this->timezone; } ss << "]"; return ss.str(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 15b99c5ce4f89..9f28875925a4b 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -20,6 +20,7 @@ #include #include +#include #include #include @@ -460,6 +461,24 @@ struct ARROW_EXPORT Date32Type : public FixedWidthType { enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; +static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { + switch (unit) { + case TimeUnit::SECOND: + os << "s"; + break; + case TimeUnit::MILLI: + os << "ms"; + break; + case TimeUnit::MICRO: + os << "us"; + break; + case TimeUnit::NANO: + os << "ns"; + break; + } + return os; +} + struct ARROW_EXPORT TimeType : public FixedWidthType { static constexpr Type::type type_id = Type::TIME; using Unit = TimeUnit; @@ -474,8 +493,7 @@ struct ARROW_EXPORT TimeType : public FixedWidthType { TimeType(const TimeType& other) : TimeType(other.unit) {} Status Accept(TypeVisitor* visitor) const override; - std::string ToString() const override { return name(); } - static std::string name() { return "time"; } + std::string ToString() const override; }; struct ARROW_EXPORT TimestampType : public FixedWidthType { From dd8204ce77662110996c6b24e28577c225ef5546 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 14 Mar 2017 12:16:23 -0400 Subject: [PATCH 0367/1644] ARROW-628: [Python] Install nomkl metapackage when building parquet-cpp in Travis CI I was surprised to find conda installing the mkl conda package with what's there now, but this should fix that and make the builds slightly faster Author: Wes McKinney Closes #380 from wesm/ARROW-628 and squashes the following commits: 3598dcf [Wes McKinney] Install nomkl metapackage when building parquet-cpp in Travis CI --- ci/travis_script_python.sh | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 11d8d89ca7b6f..6f4b8e9a090a7 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -29,9 +29,14 @@ pushd $PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env build_parquet_cpp() { - conda create -y -q -p $PARQUET_HOME thrift-cpp snappy zlib brotli boost + conda create -y -q -p $PARQUET_HOME python=3.5 source activate $PARQUET_HOME + # In case some package wants to download the MKL + conda install -y -q nomkl + + conda install -y -q thrift-cpp snappy zlib brotli boost + export BOOST_ROOT=$PARQUET_HOME export SNAPPY_HOME=$PARQUET_HOME export THRIFT_HOME=$PARQUET_HOME From c8d15d467f7a1950cf08bfcc1ead2e7ab828be00 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 15 Mar 2017 11:10:36 -0400 Subject: [PATCH 0368/1644] ARROW-491: [Format / C++] Add FixedWidthBinary type to format, C++ implementation I have a bunch more work to do on the C++ implementation: - [x] Test builder class - [x] Test array API (slice, etc.) - [x] Implement/test ArrayEquals/ArrayRangeEquals - [x] Implement `PrettyPrint` (may want to encode to hexadecimal, I don't think that BinaryArray prints properly right now for non-ASCII/UTF8 data) - [x] Add IPC roundtrip tests In the meantime, @julienledem @nongli or others could you look at the changes to the format Flatbuffers and let me know if that looks right to you? Thanks Author: Wes McKinney Closes #379 from wesm/ARROW-491 and squashes the following commits: f948835 [Wes McKinney] Move hex encode/decode to a separate header since including io-util on Windows causes a compilation failure 949fbc8 [Wes McKinney] Hex encode values in binary and fixedwidthbinary. Test PrettyPrint for FW binary a97c11a [Wes McKinney] Complete IPC implementation for date/time types. Implement IPC for FixedWidthBinary b679264 [Wes McKinney] Fix bug with fast bitsetting when length is a power of 2 8e76225 [Wes McKinney] Do not needlessly create 0-length buffers 832b363 [Wes McKinney] Implement TypeEquals, ArrayRangeEquals, clang fixes bf9ecd0 [Wes McKinney] cpplint ec50654 [Wes McKinney] Add some basic tests for the fixed width binary builder caa0314 [Wes McKinney] Draft FixedWidthBinaryBuilder. No tests yet c183639 [Wes McKinney] Consolidate some type tests. Draft FixedWidthBinaryArray class 9143c53 [Wes McKinney] Draft FixedWidthBinaryType --- cpp/src/arrow/array-list-test.cc | 20 --- cpp/src/arrow/array-string-test.cc | 194 +++++++++++++++++++++++-- cpp/src/arrow/array.cc | 139 +++++++++++------- cpp/src/arrow/array.h | 34 +++++ cpp/src/arrow/buffer.h | 10 ++ cpp/src/arrow/builder.cc | 93 ++++++++++-- cpp/src/arrow/builder.h | 50 ++++--- cpp/src/arrow/compare.cc | 32 ++++ cpp/src/arrow/ipc/adapter.cc | 11 ++ cpp/src/arrow/ipc/ipc-adapter-test.cc | 4 +- cpp/src/arrow/ipc/ipc-file-test.cc | 8 +- cpp/src/arrow/ipc/json-internal.cc | 26 +--- cpp/src/arrow/ipc/metadata-internal.cc | 74 +++++++++- cpp/src/arrow/ipc/test-common.h | 65 ++++++++- cpp/src/arrow/loader.cc | 12 ++ cpp/src/arrow/pretty_print-test.cc | 26 +++- cpp/src/arrow/pretty_print.cc | 82 +++++++---- cpp/src/arrow/type-test.cc | 52 +++++++ cpp/src/arrow/type.cc | 20 +++ cpp/src/arrow/type.h | 30 +++- cpp/src/arrow/type_fwd.h | 4 + cpp/src/arrow/type_traits.h | 7 + cpp/src/arrow/util/io-util.h | 5 +- cpp/src/arrow/util/string.h | 57 ++++++++ format/Message.fbs | 8 +- 25 files changed, 870 insertions(+), 193 deletions(-) create mode 100644 cpp/src/arrow/util/string.h diff --git a/cpp/src/arrow/array-list-test.cc b/cpp/src/arrow/array-list-test.cc index a144fd937d7a0..87dfdaaed33a4 100644 --- a/cpp/src/arrow/array-list-test.cc +++ b/cpp/src/arrow/array-list-test.cc @@ -36,26 +36,6 @@ using std::vector; namespace arrow { -TEST(TypesTest, TestListType) { - std::shared_ptr vt = std::make_shared(); - - ListType list_type(vt); - ASSERT_EQ(list_type.type, Type::LIST); - - ASSERT_EQ(list_type.name(), string("list")); - ASSERT_EQ(list_type.ToString(), string("list")); - - ASSERT_EQ(list_type.value_type()->type, vt->type); - ASSERT_EQ(list_type.value_type()->type, vt->type); - - std::shared_ptr st = std::make_shared(); - std::shared_ptr lt = std::make_shared(st); - ASSERT_EQ(lt->ToString(), string("list")); - - ListType lt2(lt); - ASSERT_EQ(lt2.ToString(), string("list>")); -} - // ---------------------------------------------------------------------- // List tests diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index 3fdeb3cefe7d2..cf2ff416032c6 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -33,22 +33,6 @@ namespace arrow { class Buffer; -TEST(TypesTest, BinaryType) { - BinaryType t1; - BinaryType e1; - StringType t2; - EXPECT_TRUE(t1.Equals(e1)); - EXPECT_FALSE(t1.Equals(t2)); - ASSERT_EQ(t1.type, Type::BINARY); - ASSERT_EQ(t1.ToString(), std::string("binary")); -} - -TEST(TypesTest, TestStringType) { - StringType str; - ASSERT_EQ(str.type, Type::STRING); - ASSERT_EQ(str.ToString(), std::string("string")); -} - // ---------------------------------------------------------------------- // String container @@ -474,4 +458,182 @@ TEST_F(TestBinaryArray, LengthZeroCtor) { BinaryArray array(0, nullptr, nullptr); } +// ---------------------------------------------------------------------- +// FixedWidthBinary tests + +class TestFWBinaryArray : public ::testing::Test { + public: + void SetUp() {} + + void InitBuilder(int byte_width) { + auto type = fixed_width_binary(byte_width); + builder_.reset(new FixedWidthBinaryBuilder(default_memory_pool(), type)); + } + + protected: + std::unique_ptr builder_; +}; + +TEST_F(TestFWBinaryArray, Builder) { + const int32_t byte_width = 10; + int64_t length = 4096; + + int64_t nbytes = length * byte_width; + + std::vector data(nbytes); + test::random_bytes(nbytes, 0, data.data()); + + std::vector is_valid(length); + test::random_null_bytes(length, 0.1, is_valid.data()); + + const uint8_t* raw_data = data.data(); + + std::shared_ptr result; + + auto CheckResult = [this, &length, &is_valid, &raw_data, &byte_width]( + const Array& result) { + // Verify output + const auto& fw_result = static_cast(result); + + ASSERT_EQ(length, result.length()); + + for (int64_t i = 0; i < result.length(); ++i) { + if (is_valid[i]) { + ASSERT_EQ( + 0, memcmp(raw_data + byte_width * i, fw_result.GetValue(i), byte_width)); + } else { + ASSERT_TRUE(fw_result.IsNull(i)); + } + } + }; + + // Build using iterative API + InitBuilder(byte_width); + for (int64_t i = 0; i < length; ++i) { + if (is_valid[i]) { + builder_->Append(raw_data + byte_width * i); + } else { + builder_->AppendNull(); + } + } + + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); + + // Build using batch API + InitBuilder(byte_width); + + const uint8_t* raw_is_valid = is_valid.data(); + + ASSERT_OK(builder_->Append(raw_data, 50, raw_is_valid)); + ASSERT_OK(builder_->Append(raw_data + 50 * byte_width, length - 50, raw_is_valid + 50)); + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); + + // Build from std::string + InitBuilder(byte_width); + for (int64_t i = 0; i < length; ++i) { + if (is_valid[i]) { + builder_->Append(std::string( + reinterpret_cast(raw_data + byte_width * i), byte_width)); + } else { + builder_->AppendNull(); + } + } + + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); +} + +TEST_F(TestFWBinaryArray, EqualsRangeEquals) { + // Check that we don't compare data in null slots + + auto type = fixed_width_binary(4); + FixedWidthBinaryBuilder builder1(default_memory_pool(), type); + FixedWidthBinaryBuilder builder2(default_memory_pool(), type); + + ASSERT_OK(builder1.Append("foo1")); + ASSERT_OK(builder1.AppendNull()); + + ASSERT_OK(builder2.Append("foo1")); + ASSERT_OK(builder2.Append("foo2")); + + std::shared_ptr array1, array2; + ASSERT_OK(builder1.Finish(&array1)); + ASSERT_OK(builder2.Finish(&array2)); + + const auto& a1 = static_cast(*array1); + const auto& a2 = static_cast(*array2); + + FixedWidthBinaryArray equal1(type, 2, a1.data(), a1.null_bitmap(), 1); + FixedWidthBinaryArray equal2(type, 2, a2.data(), a1.null_bitmap(), 1); + + ASSERT_TRUE(equal1.Equals(equal2)); + ASSERT_TRUE(equal1.RangeEquals(equal2, 0, 2, 0)); +} + +TEST_F(TestFWBinaryArray, ZeroSize) { + auto type = fixed_width_binary(0); + FixedWidthBinaryBuilder builder(default_memory_pool(), type); + + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.AppendNull()); + ASSERT_OK(builder.AppendNull()); + ASSERT_OK(builder.AppendNull()); + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + const auto& fw_array = static_cast(*array); + + // data is never allocated + ASSERT_TRUE(fw_array.data() == nullptr); + ASSERT_EQ(0, fw_array.byte_width()); + + ASSERT_EQ(6, array->length()); + ASSERT_EQ(3, array->null_count()); +} + +TEST_F(TestFWBinaryArray, Slice) { + auto type = fixed_width_binary(4); + FixedWidthBinaryBuilder builder(default_memory_pool(), type); + + std::vector strings = {"foo1", "foo2", "foo3", "foo4", "foo5"}; + std::vector is_null = {0, 1, 0, 0, 0}; + + for (int i = 0; i < 5; ++i) { + if (is_null[i]) { + builder.AppendNull(); + } else { + builder.Append(strings[i]); + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(1); + slice2 = array->Slice(1); + ASSERT_EQ(4, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, slice->length(), 0, slice)); + + // Chained slices + slice = array->Slice(2); + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 3); + ASSERT_EQ(3, slice->length()); + + slice2 = array->Slice(1, 3); + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); +} + } // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 49da6bb3197a1..36b3fccf79ed0 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -128,10 +128,6 @@ std::shared_ptr NullArray::Slice(int64_t offset, int64_t length) const { return std::make_shared(length); } -Status NullArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - // ---------------------------------------------------------------------- // Primitive array base @@ -143,11 +139,6 @@ PrimitiveArray::PrimitiveArray(const std::shared_ptr& type, int64_t le raw_data_ = data == nullptr ? nullptr : data_->data(); } -template -Status NumericArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - template std::shared_ptr NumericArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); @@ -155,22 +146,6 @@ std::shared_ptr NumericArray::Slice(int64_t offset, int64_t length) co type_, length, data_, null_bitmap_, kUnknownNullCount, offset); } -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; - // ---------------------------------------------------------------------- // BooleanArray @@ -179,10 +154,6 @@ BooleanArray::BooleanArray(int64_t length, const std::shared_ptr& data, : PrimitiveArray(std::make_shared(), length, data, null_bitmap, null_count, offset) {} -Status BooleanArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr BooleanArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( @@ -244,10 +215,6 @@ Status ListArray::Validate() const { return Status::OK(); } -Status ListArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr ListArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( @@ -285,10 +252,6 @@ Status BinaryArray::Validate() const { return Status::OK(); } -Status BinaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr BinaryArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( @@ -306,16 +269,33 @@ Status StringArray::Validate() const { return BinaryArray::Validate(); } -Status StringArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr StringArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( length, value_offsets_, data_, null_bitmap_, kUnknownNullCount, offset); } +// ---------------------------------------------------------------------- +// Fixed width binary + +FixedWidthBinaryArray::FixedWidthBinaryArray(const std::shared_ptr& type, + int64_t length, const std::shared_ptr& data, + const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) + : Array(type, length, null_bitmap, null_count, offset), + data_(data), + raw_data_(nullptr) { + DCHECK(type->type == Type::FIXED_WIDTH_BINARY); + byte_width_ = static_cast(*type).byte_width(); + if (data) { raw_data_ = data->data(); } +} + +std::shared_ptr FixedWidthBinaryArray::Slice( + int64_t offset, int64_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + type_, length, data_, null_bitmap_, kUnknownNullCount, offset); +} + // ---------------------------------------------------------------------- // Struct @@ -368,10 +348,6 @@ Status StructArray::Validate() const { return Status::OK(); } -Status StructArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr StructArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( @@ -413,10 +389,6 @@ Status UnionArray::Validate() const { return Status::OK(); } -Status UnionArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr UnionArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared(type_, length, children_, type_ids_, value_offsets_, @@ -447,17 +419,54 @@ std::shared_ptr DictionaryArray::dictionary() const { return dict_type_->dictionary(); } -Status DictionaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - std::shared_ptr DictionaryArray::Slice(int64_t offset, int64_t length) const { std::shared_ptr sliced_indices = indices_->Slice(offset, length); return std::make_shared(type_, sliced_indices); } // ---------------------------------------------------------------------- -// Default implementations of ArrayVisitor methods +// Implement ArrayVisitor methods + +Status NullArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status BooleanArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +template +Status NumericArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status BinaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status StringArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status FixedWidthBinaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status ListArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status StructArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status UnionArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} + +Status DictionaryArray::Accept(ArrayVisitor* visitor) const { + return visitor->Visit(*this); +} #define ARRAY_VISITOR_DEFAULT(ARRAY_CLASS) \ Status ArrayVisitor::Visit(const ARRAY_CLASS& array) { \ @@ -477,8 +486,9 @@ ARRAY_VISITOR_DEFAULT(UInt64Array); ARRAY_VISITOR_DEFAULT(HalfFloatArray); ARRAY_VISITOR_DEFAULT(FloatArray); ARRAY_VISITOR_DEFAULT(DoubleArray); -ARRAY_VISITOR_DEFAULT(StringArray); ARRAY_VISITOR_DEFAULT(BinaryArray); +ARRAY_VISITOR_DEFAULT(StringArray); +ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); ARRAY_VISITOR_DEFAULT(DateArray); ARRAY_VISITOR_DEFAULT(Date32Array); ARRAY_VISITOR_DEFAULT(TimeArray); @@ -493,4 +503,23 @@ Status ArrayVisitor::Visit(const DecimalArray& array) { return Status::NotImplemented("decimal"); } +// ---------------------------------------------------------------------- +// Instantiate templates + +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; +template class NumericArray; + } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index f111609db4317..ecc8ce540b1dd 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -57,6 +57,7 @@ class ARROW_EXPORT ArrayVisitor { virtual Status Visit(const DoubleArray& array); virtual Status Visit(const StringArray& array); virtual Status Visit(const BinaryArray& array); + virtual Status Visit(const FixedWidthBinaryArray& array); virtual Status Visit(const DateArray& array); virtual Status Visit(const Date32Array& array); virtual Status Visit(const TimeArray& array); @@ -386,6 +387,39 @@ class ARROW_EXPORT StringArray : public BinaryArray { std::shared_ptr Slice(int64_t offset, int64_t length) const override; }; +// ---------------------------------------------------------------------- +// Fixed width binary + +class ARROW_EXPORT FixedWidthBinaryArray : public Array { + public: + using TypeClass = FixedWidthBinaryType; + + FixedWidthBinaryArray(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0); + + const uint8_t* GetValue(int64_t i) const { + return raw_data_ + (i + offset_) * byte_width_; + } + + /// Note that this buffer does not account for any slice offset + std::shared_ptr data() const { return data_; } + + int32_t byte_width() const { return byte_width_; } + + const uint8_t* raw_data() const { return raw_data_; } + + Status Accept(ArrayVisitor* visitor) const override; + + std::shared_ptr Slice(int64_t offset, int64_t length) const override; + + protected: + int32_t byte_width_; + std::shared_ptr data_; + const uint8_t* raw_data_; +}; + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 0724385a4aff8..26c8ea60214f6 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -157,6 +157,8 @@ class ARROW_EXPORT BufferBuilder { /// Resizes the buffer to the nearest multiple of 64 bytes per Layout.md Status Resize(int64_t elements) { + // Resize(0) is a no-op + if (elements == 0) { return Status::OK(); } if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } RETURN_NOT_OK(buffer_->Resize(elements)); capacity_ = buffer_->capacity(); @@ -170,6 +172,14 @@ class ARROW_EXPORT BufferBuilder { return Status::OK(); } + // Advance pointer and zero out memory + Status Advance(int64_t length) { + if (capacity_ < length + size_) { RETURN_NOT_OK(Resize(length + size_)); } + memset(data_ + size_, 0, static_cast(length)); + size_ += length; + return Status::OK(); + } + template Status Append(T arithmetic_value) { static_assert(std::is_arithmetic::value, diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 4372925fe494b..b65a4928ec999 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -121,6 +121,14 @@ void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int64_t leng uint8_t bitset = null_bitmap_data_[byte_offset]; for (int64_t i = 0; i < length; ++i) { + if (bit_offset == 8) { + bit_offset = 0; + null_bitmap_data_[byte_offset] = bitset; + byte_offset++; + // TODO: Except for the last byte, this shouldn't be needed + bitset = null_bitmap_data_[byte_offset]; + } + if (valid_bytes[i]) { bitset |= BitUtil::kBitmask[bit_offset]; } else { @@ -129,13 +137,6 @@ void ArrayBuilder::UnsafeAppendToBitmap(const uint8_t* valid_bytes, int64_t leng } bit_offset++; - if (bit_offset == 8) { - bit_offset = 0; - null_bitmap_data_[byte_offset] = bitset; - byte_offset++; - // TODO: Except for the last byte, this shouldn't be needed - bitset = null_bitmap_data_[byte_offset]; - } } if (bit_offset != 0) { null_bitmap_data_[byte_offset] = bitset; } length_ += length; @@ -324,21 +325,37 @@ Status BooleanBuilder::Append( // ---------------------------------------------------------------------- // ListBuilder -ListBuilder::ListBuilder( - MemoryPool* pool, std::shared_ptr value_builder, const TypePtr& type) +ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, + const std::shared_ptr& type) : ArrayBuilder( pool, type ? type : std::static_pointer_cast( std::make_shared(value_builder->type()))), offset_builder_(pool), value_builder_(value_builder) {} -ListBuilder::ListBuilder( - MemoryPool* pool, std::shared_ptr values, const TypePtr& type) +ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr values, + const std::shared_ptr& type) : ArrayBuilder(pool, type ? type : std::static_pointer_cast( std::make_shared(values->type()))), offset_builder_(pool), values_(values) {} +Status ListBuilder::Append( + const int32_t* offsets, int64_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + offset_builder_.UnsafeAppend(offsets, length); + return Status::OK(); +} + +Status ListBuilder::Append(bool is_valid) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(is_valid); + RETURN_NOT_OK( + offset_builder_.Append(static_cast(value_builder_->length()))); + return Status::OK(); +} + Status ListBuilder::Init(int64_t elements) { DCHECK_LT(elements, std::numeric_limits::max()); RETURN_NOT_OK(ArrayBuilder::Init(elements)); @@ -386,7 +403,7 @@ BinaryBuilder::BinaryBuilder(MemoryPool* pool) byte_builder_ = static_cast(value_builder_.get()); } -BinaryBuilder::BinaryBuilder(MemoryPool* pool, const TypePtr& type) +BinaryBuilder::BinaryBuilder(MemoryPool* pool, const std::shared_ptr& type) : ListBuilder(pool, std::make_shared(pool, uint8()), type) { byte_builder_ = static_cast(value_builder_.get()); } @@ -417,6 +434,58 @@ Status StringBuilder::Finish(std::shared_ptr* out) { return Status::OK(); } +// ---------------------------------------------------------------------- +// Fixed width binary + +FixedWidthBinaryBuilder::FixedWidthBinaryBuilder( + MemoryPool* pool, const std::shared_ptr& type) + : ArrayBuilder(pool, type), byte_builder_(pool) { + DCHECK(type->type == Type::FIXED_WIDTH_BINARY); + byte_width_ = static_cast(*type).byte_width(); +} + +Status FixedWidthBinaryBuilder::Append(const uint8_t* value) { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(true); + return byte_builder_.Append(value, byte_width_); +} + +Status FixedWidthBinaryBuilder::Append( + const uint8_t* data, int64_t length, const uint8_t* valid_bytes) { + RETURN_NOT_OK(Reserve(length)); + UnsafeAppendToBitmap(valid_bytes, length); + return byte_builder_.Append(data, length * byte_width_); +} + +Status FixedWidthBinaryBuilder::Append(const std::string& value) { + return Append(reinterpret_cast(value.c_str())); +} + +Status FixedWidthBinaryBuilder::AppendNull() { + RETURN_NOT_OK(Reserve(1)); + UnsafeAppendToBitmap(false); + return byte_builder_.Advance(byte_width_); +} + +Status FixedWidthBinaryBuilder::Init(int64_t elements) { + DCHECK_LT(elements, std::numeric_limits::max()); + RETURN_NOT_OK(ArrayBuilder::Init(elements)); + return byte_builder_.Resize(elements * byte_width_); +} + +Status FixedWidthBinaryBuilder::Resize(int64_t capacity) { + DCHECK_LT(capacity, std::numeric_limits::max()); + RETURN_NOT_OK(byte_builder_.Resize(capacity * byte_width_)); + return ArrayBuilder::Resize(capacity); +} + +Status FixedWidthBinaryBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr data = byte_builder_.Finish(); + *out = std::make_shared( + type_, length_, data, null_bitmap_, null_count_); + return Status::OK(); +} + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index ebc683ab334e6..07b7cfcb3a964 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -46,7 +46,7 @@ static constexpr int64_t kMinBuilderCapacity = 1 << 5; /// the null count. class ARROW_EXPORT ArrayBuilder { public: - explicit ArrayBuilder(MemoryPool* pool, const TypePtr& type) + explicit ArrayBuilder(MemoryPool* pool, const std::shared_ptr& type) : pool_(pool), type_(type), null_bitmap_(nullptr), @@ -140,7 +140,7 @@ class ARROW_EXPORT PrimitiveBuilder : public ArrayBuilder { public: using value_type = typename Type::c_type; - explicit PrimitiveBuilder(MemoryPool* pool, const TypePtr& type) + explicit PrimitiveBuilder(MemoryPool* pool, const std::shared_ptr& type) : ArrayBuilder(pool, type), data_(nullptr), raw_data_(nullptr) {} using ArrayBuilder::Advance; @@ -313,11 +313,11 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { /// Use this constructor to incrementally build the value array along with offsets and /// null bitmap. ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, - const TypePtr& type = nullptr); + const std::shared_ptr& type = nullptr); /// Use this constructor to build the list with a pre-existing values array - ListBuilder( - MemoryPool* pool, std::shared_ptr values, const TypePtr& type = nullptr); + ListBuilder(MemoryPool* pool, std::shared_ptr values, + const std::shared_ptr& type = nullptr); Status Init(int64_t elements) override; Status Resize(int64_t capacity) override; @@ -328,24 +328,13 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { /// If passed, valid_bytes is of equal length to values, and any zero byte /// will be considered as a null for that slot Status Append( - const int32_t* offsets, int64_t length, const uint8_t* valid_bytes = nullptr) { - RETURN_NOT_OK(Reserve(length)); - UnsafeAppendToBitmap(valid_bytes, length); - offset_builder_.UnsafeAppend(offsets, length); - return Status::OK(); - } + const int32_t* offsets, int64_t length, const uint8_t* valid_bytes = nullptr); /// Start a new variable-length list slot /// /// This function should be called before beginning to append elements to the /// value builder - Status Append(bool is_valid = true) { - RETURN_NOT_OK(Reserve(1)); - UnsafeAppendToBitmap(is_valid); - RETURN_NOT_OK( - offset_builder_.Append(static_cast(value_builder_->length()))); - return Status::OK(); - } + Status Append(bool is_valid = true); Status AppendNull() { return Append(false); } @@ -362,11 +351,10 @@ class ARROW_EXPORT ListBuilder : public ArrayBuilder { // ---------------------------------------------------------------------- // Binary and String -// BinaryBuilder : public ListBuilder class ARROW_EXPORT BinaryBuilder : public ListBuilder { public: explicit BinaryBuilder(MemoryPool* pool); - explicit BinaryBuilder(MemoryPool* pool, const TypePtr& type); + explicit BinaryBuilder(MemoryPool* pool, const std::shared_ptr& type); Status Append(const uint8_t* value, int32_t length) { RETURN_NOT_OK(ListBuilder::Append()); @@ -399,6 +387,28 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { Status Append(const std::vector& values, uint8_t* null_bytes); }; +// ---------------------------------------------------------------------- +// FixedWidthBinaryBuilder + +class ARROW_EXPORT FixedWidthBinaryBuilder : public ArrayBuilder { + public: + FixedWidthBinaryBuilder(MemoryPool* pool, const std::shared_ptr& type); + + Status Append(const uint8_t* value); + Status Append( + const uint8_t* data, int64_t length, const uint8_t* valid_bytes = nullptr); + Status Append(const std::string& value); + Status AppendNull(); + + Status Init(int64_t elements) override; + Status Resize(int64_t capacity) override; + Status Finish(std::shared_ptr* out) override; + + protected: + int32_t byte_width_; + BufferBuilder byte_builder_; +}; + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 17b883302c658..86ed8ccecd1ea 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -143,6 +143,32 @@ class RangeEqualsVisitor : public ArrayVisitor { return Status::OK(); } + Status Visit(const FixedWidthBinaryArray& left) override { + const auto& right = static_cast(right_); + + int32_t width = left.byte_width(); + + const uint8_t* left_data = left.raw_data() + left.offset() * width; + const uint8_t* right_data = right.raw_data() + right.offset() * width; + + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i)) { + result_ = false; + return Status::OK(); + } + if (is_null) continue; + + if (std::memcmp(left_data + width * i, right_data + width * o_i, width)) { + result_ = false; + return Status::OK(); + } + } + result_ = true; + return Status::OK(); + } + Status Visit(const DateArray& left) override { return CompareValues(left); } Status Visit(const Date32Array& left) override { @@ -632,6 +658,12 @@ class TypeEqualsVisitor : public TypeVisitor { return Status::OK(); } + Status Visit(const FixedWidthBinaryType& left) override { + const auto& right = static_cast(right_); + result_ = left.byte_width() == right.byte_width(); + return Status::OK(); + } + Status Visit(const ListType& left) override { return VisitChildren(left); } Status Visit(const StructType& left) override { return VisitChildren(left); } diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index a4eff7214aa5f..406ce249eec32 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -304,6 +304,17 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } + Status Visit(const FixedWidthBinaryArray& array) override { + auto data = array.data(); + int32_t width = array.byte_width(); + + if (array.offset() != 0) { + data = SliceBuffer(data, array.offset() * width, width * array.length()); + } + buffers_.push_back(data); + return Status::OK(); + } + Status Visit(const BooleanArray& array) override { buffers_.push_back(array.data()); return Status::OK(); diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index b60b8a9ba68d2..36a675f5f94f7 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -175,8 +175,8 @@ INSTANTIATE_TEST_CASE_P( RoundTripTests, TestRecordBatchParam, ::testing::Values(&MakeIntRecordBatch, &MakeStringTypesRecordBatch, &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeListRecordBatch, - &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, - &MakeTimestamps, &MakeTimes)); + &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, + &MakeTimestamps, &MakeTimes, &MakeFWBinary)); void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc index 0c95c8eca65ca..b45782220e478 100644 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ b/cpp/src/arrow/ipc/ipc-file-test.cc @@ -43,7 +43,10 @@ namespace arrow { namespace ipc { void CompareBatch(const RecordBatch& left, const RecordBatch& right) { - ASSERT_TRUE(left.schema()->Equals(right.schema())); + if (!left.schema()->Equals(right.schema())) { + FAIL() << "Left schema: " << left.schema()->ToString() + << "\nRight schema: " << right.schema()->ToString(); + } ASSERT_EQ(left.num_columns(), right.num_columns()) << left.schema()->ToString() << " result: " << right.schema()->ToString(); EXPECT_EQ(left.num_rows(), right.num_rows()); @@ -180,7 +183,8 @@ TEST_P(TestStreamFormat, RoundTrip) { #define BATCH_CASES() \ ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeUnion, &MakeDictionary); + &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, &MakeTimestamps, &MakeTimes, \ + &MakeFWBinary); INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 0458b85f0078a..549b26bfe8201 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -39,12 +39,11 @@ #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" +#include "arrow/util/string.h" namespace arrow { namespace ipc { -static const char* kAsciiTable = "0123456789ABCDEF"; - using RjArray = rj::Value::ConstArray; using RjObject = rj::Value::ConstObject; @@ -401,14 +400,7 @@ class JsonArrayWriter : public ArrayVisitor { if (std::is_base_of::value) { writer_->String(buf, length); } else { - std::string hex_string; - hex_string.reserve(length * 2); - for (int32_t j = 0; j < length; ++j) { - // Convert to 2 base16 digits - hex_string.push_back(kAsciiTable[buf[j] >> 4]); - hex_string.push_back(kAsciiTable[buf[j] & 15]); - } - writer_->String(hex_string); + writer_->String(HexEncode(buf, length)); } } } @@ -760,20 +752,6 @@ class JsonSchemaReader { const rj::Value& json_schema_; }; -static inline Status ParseHexValue(const char* data, uint8_t* out) { - char c1 = data[0]; - char c2 = data[1]; - - const char* pos1 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c1); - const char* pos2 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c2); - - // Error checking - if (*pos1 != c1 || *pos2 != c2) { return Status::Invalid("Encountered non-hex digit"); } - - *out = static_cast((pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable)); - return Status::OK(); -} - template inline typename std::enable_if::value, typename T::c_type>::type UnboxValue(const rj::Value& val) { diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc index 17a3a5fafe626..be0d282f21bbf 100644 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ b/cpp/src/arrow/ipc/metadata-internal.cc @@ -170,6 +170,39 @@ static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ break; +static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit unit) { + switch (unit) { + case TimeUnit::SECOND: + return flatbuf::TimeUnit_SECOND; + case TimeUnit::MILLI: + return flatbuf::TimeUnit_MILLISECOND; + case TimeUnit::MICRO: + return flatbuf::TimeUnit_MICROSECOND; + case TimeUnit::NANO: + return flatbuf::TimeUnit_NANOSECOND; + default: + break; + } + return flatbuf::TimeUnit_MIN; +} + +static inline TimeUnit FromFlatbufferUnit(flatbuf::TimeUnit unit) { + switch (unit) { + case flatbuf::TimeUnit_SECOND: + return TimeUnit::SECOND; + case flatbuf::TimeUnit_MILLISECOND: + return TimeUnit::MILLI; + case flatbuf::TimeUnit_MICROSECOND: + return TimeUnit::MICRO; + case flatbuf::TimeUnit_NANOSECOND: + return TimeUnit::NANO; + default: + break; + } + // cannot reach + return TimeUnit::SECOND; +} + static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, const std::vector>& children, std::shared_ptr* out) { switch (type) { @@ -183,6 +216,11 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, case flatbuf::Type_Binary: *out = binary(); return Status::OK(); + case flatbuf::Type_FixedWidthBinary: { + auto fw_binary = static_cast(type_data); + *out = fixed_width_binary(fw_binary->byteWidth()); + return Status::OK(); + } case flatbuf::Type_Utf8: *out = utf8(); return Status::OK(); @@ -190,7 +228,22 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, *out = boolean(); return Status::OK(); case flatbuf::Type_Decimal: - case flatbuf::Type_Timestamp: + return Status::NotImplemented("Decimal"); + case flatbuf::Type_Date: + *out = date(); + return Status::OK(); + case flatbuf::Type_Time: { + auto time_type = static_cast(type_data); + *out = time(FromFlatbufferUnit(time_type->unit())); + return Status::OK(); + } + case flatbuf::Type_Timestamp: { + auto ts_type = static_cast(type_data); + *out = timestamp(FromFlatbufferUnit(ts_type->unit())); + return Status::OK(); + } + case flatbuf::Type_Interval: + return Status::NotImplemented("Interval"); case flatbuf::Type_List: if (children.size() != 1) { return Status::Invalid("List must have exactly 1 child field"); @@ -275,6 +328,11 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_FloatingPoint; *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); break; + case Type::FIXED_WIDTH_BINARY: { + const auto& fw_type = static_cast(*type); + *out_type = flatbuf::Type_FixedWidthBinary; + *offset = flatbuf::CreateFixedWidthBinary(fbb, fw_type.byte_width()).Union(); + } break; case Type::BINARY: *out_type = flatbuf::Type_Binary; *offset = flatbuf::CreateBinary(fbb).Union(); @@ -283,6 +341,20 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_Utf8; *offset = flatbuf::CreateUtf8(fbb).Union(); break; + case Type::DATE: + *out_type = flatbuf::Type_Date; + *offset = flatbuf::CreateDate(fbb).Union(); + break; + case Type::TIME: { + const auto& time_type = static_cast(*type); + *out_type = flatbuf::Type_Time; + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); + } break; + case Type::TIMESTAMP: { + const auto& ts_type = static_cast(*type); + *out_type = flatbuf::Type_Timestamp; + *offset = flatbuf::CreateTimestamp(fbb, ToFlatbufferUnit(ts_type.unit)).Union(); + } break; case Type::LIST: *out_type = flatbuf::Type_List; return ListToFlatbuffer(fbb, type, children, dictionary_memo, offset); diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 7f33aba812e0f..66a5e09362cf5 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -463,30 +463,42 @@ Status MakeDictionaryFlat(std::shared_ptr* out) { return Status::OK(); } -Status MakeDates(std::shared_ptr* out) { +Status MakeDate(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false, true, true, true}; - auto f0 = field("f0", date32()); auto f1 = field("f1", date()); - std::shared_ptr schema(new Schema({f0, f1})); + std::shared_ptr schema(new Schema({f1})); std::vector date_values = {1489269000000, 1489270000000, 1489271000000, 1489272000000, 1489272000000, 1489273000000}; - std::vector date32_values = {0, 1, 2, 3, 4, 5, 6}; - std::shared_ptr date_array, date32_array; + std::shared_ptr date_array; ArrayFromVector(is_valid, date_values, &date_array); - ArrayFromVector(is_valid, date32_values, &date32_array); - std::vector> arrays = {date32_array, date_array}; + std::vector> arrays = {date_array}; *out = std::make_shared(schema, date_array->length(), arrays); return Status::OK(); } +Status MakeDate32(std::shared_ptr* out) { + std::vector is_valid = {true, true, true, false, true, true, true}; + auto f0 = field("f0", date32()); + std::shared_ptr schema(new Schema({f0})); + + std::vector date32_values = {0, 1, 2, 3, 4, 5, 6}; + + std::shared_ptr date32_array; + ArrayFromVector(is_valid, date32_values, &date32_array); + + std::vector> arrays = {date32_array}; + *out = std::make_shared(schema, date32_array->length(), arrays); + return Status::OK(); +} + Status MakeTimestamps(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false, true, true, true}; auto f0 = field("f0", timestamp(TimeUnit::MILLI)); auto f1 = field("f1", timestamp(TimeUnit::NANO)); - auto f2 = field("f2", timestamp("US/Los_Angeles", TimeUnit::SECOND)); + auto f2 = field("f2", timestamp(TimeUnit::SECOND)); std::shared_ptr schema(new Schema({f0, f1, f2})); std::vector ts_values = {1489269000000, 1489270000000, 1489271000000, @@ -522,6 +534,43 @@ Status MakeTimes(std::shared_ptr* out) { return Status::OK(); } +template +void AppendValues(const std::vector& is_valid, const std::vector& values, + BuilderType* builder) { + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + builder->Append(values[i]); + } else { + builder->AppendNull(); + } + } +} + +Status MakeFWBinary(std::shared_ptr* out) { + std::vector is_valid = {true, true, true, false}; + auto f0 = field("f0", fixed_width_binary(4)); + auto f1 = field("f1", fixed_width_binary(0)); + std::shared_ptr schema(new Schema({f0, f1})); + + std::shared_ptr a1, a2; + + FixedWidthBinaryBuilder b1(default_memory_pool(), f0->type); + FixedWidthBinaryBuilder b2(default_memory_pool(), f0->type); + + std::vector values1 = {"foo1", "foo2", "foo3", "foo4"}; + AppendValues(is_valid, values1, &b1); + + std::vector values2 = {"", "", "", ""}; + AppendValues(is_valid, values2, &b2); + + RETURN_NOT_OK(b1.Finish(&a1)); + RETURN_NOT_OK(b2.Finish(&a2)); + + ArrayVector arrays = {a1, a2}; + *out = std::make_shared(schema, a1->length(), arrays); + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index 0b3ee1cf0a899..fc373715105e1 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -157,6 +157,18 @@ class ArrayLoader : public TypeVisitor { Status Visit(const BinaryType& type) override { return LoadBinary(); } + Status Visit(const FixedWidthBinaryType& type) override { + FieldMetadata field_meta; + std::shared_ptr null_bitmap, data; + + RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); + + result_ = std::make_shared( + type_, field_meta.length, data, null_bitmap, field_meta.null_count); + return Status::OK(); + } + Status Visit(const ListType& type) override { FieldMetadata field_meta; std::shared_ptr null_bitmap, offsets; diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index aca650f0a927b..f21383f0cb06f 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -56,7 +56,7 @@ void CheckPrimitive(int indent, const std::vector& is_valid, const std::vector& values, const char* expected) { std::shared_ptr array; ArrayFromVector(is_valid, values, &array); - CheckArray(*array.get(), indent, expected); + CheckArray(*array, indent, expected); } TEST_F(TestPrettyPrint, PrimitiveType) { @@ -71,6 +71,30 @@ TEST_F(TestPrettyPrint, PrimitiveType) { CheckPrimitive(0, is_valid, values2, ex2); } +TEST_F(TestPrettyPrint, BinaryType) { + std::vector is_valid = {true, true, false, true, false}; + std::vector values = {"foo", "bar", "", "baz", ""}; + static const char* ex = R"expected([666F6F, 626172, null, 62617A, null])expected"; + CheckPrimitive(0, is_valid, values, ex); +} + +TEST_F(TestPrettyPrint, FixedWidthBinaryType) { + std::vector is_valid = {true, true, false, true, false}; + std::vector values = {"foo", "bar", "baz"}; + static const char* ex = R"expected([666F6F, 626172, 62617A])expected"; + + std::shared_ptr array; + auto type = fixed_width_binary(3); + FixedWidthBinaryBuilder builder(default_memory_pool(), type); + + builder.Append(values[0]); + builder.Append(values[1]); + builder.Append(values[2]); + builder.Finish(&array); + + CheckArray(*array, 0, ex); +} + TEST_F(TestPrettyPrint, DictionaryType) { std::vector is_valid = {true, true, false, true, true, true}; diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 2508fa5bd8cde..87c1a1cf9d9c5 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -26,6 +26,7 @@ #include "arrow/table.h" #include "arrow/type.h" #include "arrow/type_traits.h" +#include "arrow/util/string.h" namespace arrow { @@ -66,9 +67,9 @@ class ArrayPrinter : public ArrayVisitor { } } - // String (Utf8), Binary + // String (Utf8) template - typename std::enable_if::value, void>::type + typename std::enable_if::value, void>::type WriteDataValues(const T& array) { int32_t length; for (int i = 0; i < array.length(); ++i) { @@ -82,6 +83,37 @@ class ArrayPrinter : public ArrayVisitor { } } + // Binary + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { + int32_t length; + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + Write("null"); + } else { + const char* buf = reinterpret_cast(array.GetValue(i, &length)); + (*sink_) << HexEncode(buf, length); + } + } + } + + template + typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { + int32_t width = array.byte_width(); + for (int i = 0; i < array.length(); ++i) { + if (i > 0) { (*sink_) << ", "; } + if (array.IsNull(i)) { + Write("null"); + } else { + const char* buf = reinterpret_cast(array.GetValue(i)); + (*sink_) << HexEncode(buf, width); + } + } + } + template typename std::enable_if::value, void>::type WriteDataValues(const T& array) { @@ -100,15 +132,7 @@ class ArrayPrinter : public ArrayVisitor { void CloseArray() { (*sink_) << "]"; } template - Status WritePrimitive(const T& array) { - OpenArray(); - WriteDataValues(array); - CloseArray(); - return Status::OK(); - } - - template - Status WriteVarBytes(const T& array) { + Status WriteArray(const T& array) { OpenArray(); WriteDataValues(array); CloseArray(); @@ -117,39 +141,41 @@ class ArrayPrinter : public ArrayVisitor { Status Visit(const NullArray& array) override { return Status::OK(); } - Status Visit(const BooleanArray& array) override { return WritePrimitive(array); } + Status Visit(const BooleanArray& array) override { return WriteArray(array); } + + Status Visit(const Int8Array& array) override { return WriteArray(array); } - Status Visit(const Int8Array& array) override { return WritePrimitive(array); } + Status Visit(const Int16Array& array) override { return WriteArray(array); } - Status Visit(const Int16Array& array) override { return WritePrimitive(array); } + Status Visit(const Int32Array& array) override { return WriteArray(array); } - Status Visit(const Int32Array& array) override { return WritePrimitive(array); } + Status Visit(const Int64Array& array) override { return WriteArray(array); } - Status Visit(const Int64Array& array) override { return WritePrimitive(array); } + Status Visit(const UInt8Array& array) override { return WriteArray(array); } - Status Visit(const UInt8Array& array) override { return WritePrimitive(array); } + Status Visit(const UInt16Array& array) override { return WriteArray(array); } - Status Visit(const UInt16Array& array) override { return WritePrimitive(array); } + Status Visit(const UInt32Array& array) override { return WriteArray(array); } - Status Visit(const UInt32Array& array) override { return WritePrimitive(array); } + Status Visit(const UInt64Array& array) override { return WriteArray(array); } - Status Visit(const UInt64Array& array) override { return WritePrimitive(array); } + Status Visit(const HalfFloatArray& array) override { return WriteArray(array); } - Status Visit(const HalfFloatArray& array) override { return WritePrimitive(array); } + Status Visit(const FloatArray& array) override { return WriteArray(array); } - Status Visit(const FloatArray& array) override { return WritePrimitive(array); } + Status Visit(const DoubleArray& array) override { return WriteArray(array); } - Status Visit(const DoubleArray& array) override { return WritePrimitive(array); } + Status Visit(const StringArray& array) override { return WriteArray(array); } - Status Visit(const StringArray& array) override { return WriteVarBytes(array); } + Status Visit(const BinaryArray& array) override { return WriteArray(array); } - Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } + Status Visit(const FixedWidthBinaryArray& array) override { return WriteArray(array); } - Status Visit(const DateArray& array) override { return WritePrimitive(array); } + Status Visit(const DateArray& array) override { return WriteArray(array); } - Status Visit(const Date32Array& array) override { return WritePrimitive(array); } + Status Visit(const Date32Array& array) override { return WriteArray(array); } - Status Visit(const TimeArray& array) override { return WritePrimitive(array); } + Status Visit(const TimeArray& array) override { return WriteArray(array); } Status Visit(const TimestampArray& array) override { return Status::NotImplemented("timestamp"); diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 3adc4d83c3a2d..ddfff8745b97e 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -121,6 +121,58 @@ TEST_F(TestSchema, GetFieldByName) { ASSERT_TRUE(result == nullptr); } +TEST(TestBinaryType, ToString) { + BinaryType t1; + BinaryType e1; + StringType t2; + EXPECT_TRUE(t1.Equals(e1)); + EXPECT_FALSE(t1.Equals(t2)); + ASSERT_EQ(t1.type, Type::BINARY); + ASSERT_EQ(t1.ToString(), std::string("binary")); +} + +TEST(TestStringType, ToString) { + StringType str; + ASSERT_EQ(str.type, Type::STRING); + ASSERT_EQ(str.ToString(), std::string("string")); +} + +TEST(TestFixedWidthBinaryType, ToString) { + auto t = fixed_width_binary(10); + ASSERT_EQ(t->type, Type::FIXED_WIDTH_BINARY); + ASSERT_EQ("fixed_width_binary[10]", t->ToString()); +} + +TEST(TestFixedWidthBinaryType, Equals) { + auto t1 = fixed_width_binary(10); + auto t2 = fixed_width_binary(10); + auto t3 = fixed_width_binary(3); + + ASSERT_TRUE(t1->Equals(t1)); + ASSERT_TRUE(t1->Equals(t2)); + ASSERT_FALSE(t1->Equals(t3)); +} + +TEST(TestListType, Basics) { + std::shared_ptr vt = std::make_shared(); + + ListType list_type(vt); + ASSERT_EQ(list_type.type, Type::LIST); + + ASSERT_EQ("list", list_type.name()); + ASSERT_EQ("list", list_type.ToString()); + + ASSERT_EQ(list_type.value_type()->type, vt->type); + ASSERT_EQ(list_type.value_type()->type, vt->type); + + std::shared_ptr st = std::make_shared(); + std::shared_ptr lt = std::make_shared(st); + ASSERT_EQ("list", lt->ToString()); + + ListType lt2(lt); + ASSERT_EQ("list>", lt2.ToString()); +} + TEST(TestTimeType, Equals) { TimeType t1; TimeType t2; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index d41b36315a86e..ee0a89ab8abea 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -88,6 +88,16 @@ std::string BinaryType::ToString() const { return std::string("binary"); } +int FixedWidthBinaryType::bit_width() const { + return 8 * byte_width(); +} + +std::string FixedWidthBinaryType::ToString() const { + std::stringstream ss; + ss << "fixed_width_binary[" << byte_width_ << "]"; + return ss.str(); +} + std::string StructType::ToString() const { std::stringstream s; s << "struct<"; @@ -189,6 +199,7 @@ std::string NullType::ToString() const { ACCEPT_VISITOR(NullType); ACCEPT_VISITOR(BooleanType); ACCEPT_VISITOR(BinaryType); +ACCEPT_VISITOR(FixedWidthBinaryType); ACCEPT_VISITOR(StringType); ACCEPT_VISITOR(ListType); ACCEPT_VISITOR(StructType); @@ -225,6 +236,10 @@ TYPE_FACTORY(binary, BinaryType); TYPE_FACTORY(date, DateType); TYPE_FACTORY(date32, Date32Type); +std::shared_ptr fixed_width_binary(int32_t byte_width) { + return std::make_shared(byte_width); +} + std::shared_ptr timestamp(TimeUnit unit) { return std::make_shared(unit); } @@ -285,6 +300,10 @@ std::vector BinaryType::GetBufferLayout() const { return {kValidityBuffer, kOffsetBuffer, kValues8}; } +std::vector FixedWidthBinaryType::GetBufferLayout() const { + return {kValidityBuffer, BufferDescr(BufferType::DATA, byte_width_ * 8)}; +} + std::vector ListType::GetBufferLayout() const { return {kValidityBuffer, kOffsetBuffer}; } @@ -335,6 +354,7 @@ TYPE_VISITOR_DEFAULT(FloatType); TYPE_VISITOR_DEFAULT(DoubleType); TYPE_VISITOR_DEFAULT(StringType); TYPE_VISITOR_DEFAULT(BinaryType); +TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); TYPE_VISITOR_DEFAULT(DateType); TYPE_VISITOR_DEFAULT(Date32Type); TYPE_VISITOR_DEFAULT(TimeType); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 9f28875925a4b..a143d79013fb1 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -68,6 +68,9 @@ struct Type { // Variable-length bytes (no guarantee of UTF8-ness) BINARY, + // Fixed-width binary. Each value occupies the same number of bytes + FIXED_WIDTH_BINARY, + // int64_t milliseconds since the UNIX epoch DATE, @@ -135,6 +138,7 @@ class ARROW_EXPORT TypeVisitor { virtual Status Visit(const DoubleType& type); virtual Status Visit(const StringType& type); virtual Status Visit(const BinaryType& type); + virtual Status Visit(const FixedWidthBinaryType& type); virtual Status Visit(const DateType& type); virtual Status Visit(const Date32Type& type); virtual Status Visit(const TimeType& type); @@ -347,7 +351,7 @@ struct ARROW_EXPORT ListType : public DataType, public NoExtraMeta { std::vector GetBufferLayout() const override; }; -// BinaryType type is reprsents lists of 1-byte values. +// BinaryType type is represents lists of 1-byte values. struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { static constexpr Type::type type_id = Type::BINARY; @@ -364,7 +368,27 @@ struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { explicit BinaryType(Type::type logical_type) : DataType(logical_type) {} }; -// UTF encoded strings +// BinaryType type is represents lists of 1-byte values. +class ARROW_EXPORT FixedWidthBinaryType : public FixedWidthType { + public: + static constexpr Type::type type_id = Type::FIXED_WIDTH_BINARY; + + explicit FixedWidthBinaryType(int32_t byte_width) + : FixedWidthType(Type::FIXED_WIDTH_BINARY), byte_width_(byte_width) {} + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; + + std::vector GetBufferLayout() const override; + + int32_t byte_width() const { return byte_width_; } + int bit_width() const override; + + protected: + int32_t byte_width_; +}; + +// UTF-8 encoded strings struct ARROW_EXPORT StringType : public BinaryType { static constexpr Type::type type_id = Type::STRING; @@ -571,6 +595,8 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { // ---------------------------------------------------------------------- // Factory functions +std::shared_ptr ARROW_EXPORT fixed_width_binary(int32_t byte_width); + std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index e53afe1a34d36..7fc36c4bde06b 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -48,6 +48,10 @@ struct BinaryType; class BinaryArray; class BinaryBuilder; +class FixedWidthBinaryType; +class FixedWidthBinaryArray; +class FixedWidthBinaryBuilder; + struct StringType; class StringArray; class StringBuilder; diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 91461da8c42a6..242e59d10fce4 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -228,6 +228,13 @@ struct TypeTraits { static inline std::shared_ptr type_singleton() { return binary(); } }; +template <> +struct TypeTraits { + using ArrayType = FixedWidthBinaryArray; + using BuilderType = FixedWidthBinaryBuilder; + constexpr static bool is_parameter_free = false; +}; + // Not all type classes have a c_type template struct as_void { diff --git a/cpp/src/arrow/util/io-util.h b/cpp/src/arrow/util/io-util.h index 9f2645699004c..34bee18df5229 100644 --- a/cpp/src/arrow/util/io-util.h +++ b/cpp/src/arrow/util/io-util.h @@ -18,9 +18,12 @@ #ifndef ARROW_UTIL_IO_UTIL_H #define ARROW_UTIL_IO_UTIL_H -#include "arrow/buffer.h" #include +#include "arrow/buffer.h" +#include "arrow/io/interfaces.h" +#include "arrow/status.h" + namespace arrow { namespace io { diff --git a/cpp/src/arrow/util/string.h b/cpp/src/arrow/util/string.h new file mode 100644 index 0000000000000..5d9fdc88ced7e --- /dev/null +++ b/cpp/src/arrow/util/string.h @@ -0,0 +1,57 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_STRING_UTIL_H +#define ARROW_UTIL_STRING_UTIL_H + +#include +#include + +#include "arrow/status.h" + +namespace arrow { + +static const char* kAsciiTable = "0123456789ABCDEF"; + +static inline std::string HexEncode(const char* data, int32_t length) { + std::string hex_string; + hex_string.reserve(length * 2); + for (int32_t j = 0; j < length; ++j) { + // Convert to 2 base16 digits + hex_string.push_back(kAsciiTable[data[j] >> 4]); + hex_string.push_back(kAsciiTable[data[j] & 15]); + } + return hex_string; +} + +static inline Status ParseHexValue(const char* data, uint8_t* out) { + char c1 = data[0]; + char c2 = data[1]; + + const char* pos1 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c1); + const char* pos2 = std::lower_bound(kAsciiTable, kAsciiTable + 16, c2); + + // Error checking + if (*pos1 != c1 || *pos2 != c2) { return Status::Invalid("Encountered non-hex digit"); } + + *out = static_cast((pos1 - kAsciiTable) << 4 | (pos2 - kAsciiTable)); + return Status::OK(); +} + +} // namespace arrow + +#endif // ARROW_UTIL_STRING_UTIL_H diff --git a/format/Message.fbs b/format/Message.fbs index 86dfa87b04807..fb3478de5d2a0 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -68,6 +68,11 @@ table Utf8 { table Binary { } +table FixedWidthBinary { + /// Number of bytes per value + byteWidth: int; +} + table Bool { } @@ -113,7 +118,8 @@ union Type { Interval, List, Struct_, - Union + Union, + FixedWidthBinary } /// ---------------------------------------------------------------------- From 3b650014f6c59c6cf6f488572c5cd340bf2da453 Mon Sep 17 00:00:00 2001 From: Johan Mabille Date: Thu, 16 Mar 2017 12:01:13 -0400 Subject: [PATCH 0369/1644] ARROW-520: [C++] STL-compliant allocator Ready for review Author: Johan Mabille Closes #381 from JohanMabille/stl_allocator and squashes the following commits: 53c6821 [Johan Mabille] stl allocator --- cpp/src/arrow/CMakeLists.txt | 2 + cpp/src/arrow/allocator-test.cc | 72 ++++++++++++++++++++++++ cpp/src/arrow/allocator.h | 98 +++++++++++++++++++++++++++++++++ cpp/src/arrow/memory_pool.h | 6 +- 4 files changed, 175 insertions(+), 3 deletions(-) create mode 100644 cpp/src/arrow/allocator-test.cc create mode 100644 cpp/src/arrow/allocator.h diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index ddeb81cae7b5b..0abd4b9c34b0a 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -17,6 +17,7 @@ # Headers: top level install(FILES + allocator.h api.h array.h buffer.h @@ -47,6 +48,7 @@ install( # Unit tests ####################################### +ADD_ARROW_TEST(allocator-test) ADD_ARROW_TEST(array-test) ADD_ARROW_TEST(array-decimal-test) ADD_ARROW_TEST(array-dictionary-test) diff --git a/cpp/src/arrow/allocator-test.cc b/cpp/src/arrow/allocator-test.cc new file mode 100644 index 0000000000000..0b242674bf175 --- /dev/null +++ b/cpp/src/arrow/allocator-test.cc @@ -0,0 +1,72 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" +#include "arrow/allocator.h" +#include "arrow/test-util.h" + +namespace arrow { + +TEST(stl_allocator, MemoryTracking) { + auto pool = default_memory_pool(); + stl_allocator alloc; + uint64_t* data = alloc.allocate(100); + + ASSERT_EQ(100 * sizeof(uint64_t), pool->bytes_allocated()); + + alloc.deallocate(data, 100); + ASSERT_EQ(0, pool->bytes_allocated()); +} + +#if !(defined(ARROW_VALGRIND) || defined(ADDRESS_SANITIZER)) + +TEST(stl_allocator, TestOOM) { + stl_allocator alloc; + uint64_t to_alloc = std::numeric_limits::max(); + ASSERT_THROW(alloc.allocate(to_alloc), std::bad_alloc); +} + +TEST(stl_allocator, FreeLargeMemory) { + stl_allocator alloc; + + uint8_t* data = alloc.allocate(100); + +#ifndef NDEBUG + EXPECT_EXIT(alloc.deallocate(data, 120), ::testing::ExitedWithCode(1), + ".*Check failed: \\(bytes_allocated_\\) >= \\(size\\)"); +#endif + + alloc.deallocate(data, 100); +} + +TEST(stl_allocator, MaxMemory) { + DefaultMemoryPool pool; + + ASSERT_EQ(0, pool.max_memory()); + stl_allocator alloc(&pool); + uint8_t* data = alloc.allocate(100); + uint8_t* data2 = alloc.allocate(100); + + alloc.deallocate(data, 100); + alloc.deallocate(data2, 100); + + ASSERT_EQ(200, pool.max_memory()); +} + +#endif // ARROW_VALGRIND + +} // namespace arrow diff --git a/cpp/src/arrow/allocator.h b/cpp/src/arrow/allocator.h new file mode 100644 index 0000000000000..c976ba96b8d03 --- /dev/null +++ b/cpp/src/arrow/allocator.h @@ -0,0 +1,98 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_ALLOCATOR_H +#define ARROW_ALLOCATOR_H + +#include +#include +#include +#include "arrow/memory_pool.h" +#include "arrow/status.h" + +namespace arrow { + +template +class stl_allocator { + public: + using value_type = T; + using pointer = T*; + using const_pointer = const T*; + using reference = T&; + using const_reference = const T&; + using size_type = std::size_t; + using difference_type = std::ptrdiff_t; + + template + struct rebind { + using other = stl_allocator; + }; + + stl_allocator() noexcept : pool_(default_memory_pool()) {} + explicit stl_allocator(MemoryPool* pool) noexcept : pool_(pool) {} + + template + stl_allocator(const stl_allocator& rhs) noexcept : pool_(rhs.pool_) {} + + ~stl_allocator() { pool_ = nullptr; } + + pointer address(reference r) const noexcept { return std::addressof(r); } + + const_pointer address(const_reference r) const noexcept { return std::addressof(r); } + + pointer allocate(size_type n, const void* /*hint*/ = nullptr) { + uint8_t* data; + Status s = pool_->Allocate(n * sizeof(T), &data); + if (!s.ok()) throw std::bad_alloc(); + return reinterpret_cast(data); + } + + void deallocate(pointer p, size_type n) { + pool_->Free(reinterpret_cast(p), n * sizeof(T)); + } + + size_type size_max() const noexcept { return size_type(-1) / sizeof(T); } + + template + void construct(U* p, Args&&... args) { + new (reinterpret_cast(p)) U(std::forward(args)...); + } + + template + void destroy(U* p) { + p->~U(); + } + + MemoryPool* pool() const noexcept { return pool_; } + + private: + MemoryPool* pool_; +}; + +template +bool operator==(const stl_allocator& lhs, const stl_allocator& rhs) noexcept { + return lhs.pool() == rhs.pool(); +} + +template +bool operator!=(const stl_allocator& lhs, const stl_allocator& rhs) noexcept { + return !(lhs == rhs); +} + +} // namespace arrow + +#endif // ARROW_ALLOCATOR_H diff --git a/cpp/src/arrow/memory_pool.h b/cpp/src/arrow/memory_pool.h index 33d4c3e9aad52..0edfda635d0e8 100644 --- a/cpp/src/arrow/memory_pool.h +++ b/cpp/src/arrow/memory_pool.h @@ -15,8 +15,8 @@ // specific language governing permissions and limitations // under the License. -#ifndef ARROW_UTIL_MEMORY_POOL_H -#define ARROW_UTIL_MEMORY_POOL_H +#ifndef ARROW_MEMORY_POOL_H +#define ARROW_MEMORY_POOL_H #include #include @@ -93,4 +93,4 @@ ARROW_EXPORT MemoryPool* default_memory_pool(); } // namespace arrow -#endif // ARROW_UTIL_MEMORY_POOL_H +#endif // ARROW_MEMORY_POOL_H From 49f666e740208d1e6167537f141f27b6b78b77cb Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Thu, 16 Mar 2017 13:59:53 -0400 Subject: [PATCH 0370/1644] ARROW-542: Adding dictionary encoding to FileWriter WIP for comments Author: Emilio Lahr-Vivaz Author: Wes McKinney Closes #334 from elahrvivaz/ARROW-542 and squashes the following commits: 5339730 [Emilio Lahr-Vivaz] fixing bitvector load of value count, adding struct integration test 00d78d3 [Emilio Lahr-Vivaz] fixing set bit validity value in NullableMapVector load 1679934 [Emilio Lahr-Vivaz] cleaning up license 70639e0 [Emilio Lahr-Vivaz] restoring vector loader test bde4eee [Wes McKinney] Handle 0-length message indicator for EOS in C++ StreamReader a24854b [Emilio Lahr-Vivaz] fixing StreamToFile conversion 2ee7cfb [Emilio Lahr-Vivaz] fixing FileToStream conversion adec200 [Emilio Lahr-Vivaz] making arrow magic static, cleanup 8366288 [Emilio Lahr-Vivaz] making magic array private 127937f [Emilio Lahr-Vivaz] removing qualifier for magic db9a007 [Emilio Lahr-Vivaz] adding dictionary tests to echo server 95c7b2a [Emilio Lahr-Vivaz] cleanup 45caa02 [Emilio Lahr-Vivaz] reverting basewriter dictionary methods 682db6f [Emilio Lahr-Vivaz] cleanup a1508b9 [Emilio Lahr-Vivaz] removing dictionary vector method (instead use field.dictionary) 43c28af [Emilio Lahr-Vivaz] adding test for nested dictionary encoded list 92a1e6f [Emilio Lahr-Vivaz] fixing imports e567564 [Emilio Lahr-Vivaz] adding field size check in vectorschemaroot 568fda5 [Emilio Lahr-Vivaz] imports, formatting 363308e [Emilio Lahr-Vivaz] fixing tests 2f69be1 [Emilio Lahr-Vivaz] not passing around dictionary vectors with dictionary fields, adding dictionary encoding to fields, restoring vector loader/unloader e5c8e02 [Emilio Lahr-Vivaz] Merging dictionary unloader/loader with arrow writer/reader Creating base class for stream/file writer Creating base class with visitors for arrow messages Indentation fixes Other cleanup d095f3f [Emilio Lahr-Vivaz] ARROW-542: Adding dictionary encoding to file and stream writing --- cpp/src/arrow/ipc/reader.cc | 6 + integration/integration_test.py | 4 + .../org/apache/arrow/tools/EchoServer.java | 48 +- .../org/apache/arrow/tools/FileRoundtrip.java | 48 +- .../org/apache/arrow/tools/FileToStream.java | 27 +- .../org/apache/arrow/tools/Integration.java | 83 +-- .../org/apache/arrow/tools/StreamToFile.java | 19 +- .../arrow/tools/ArrowFileTestFixtures.java | 51 +- .../apache/arrow/tools/EchoServerTest.java | 280 ++++++-- .../apache/arrow/tools/TestIntegration.java | 38 +- java/tools/tmptestfilesio | Bin 0 -> 628 bytes .../main/codegen/templates/MapWriters.java | 8 +- .../templates/NullableValueVectors.java | 40 +- .../main/codegen/templates/UnionVector.java | 10 +- .../org/apache/arrow/vector/BitVector.java | 2 +- .../org/apache/arrow/vector/FieldVector.java | 4 +- .../org/apache/arrow/vector/VectorLoader.java | 13 +- .../apache/arrow/vector/VectorSchemaRoot.java | 32 +- .../apache/arrow/vector/VectorUnloader.java | 27 +- .../complex/AbstractContainerVector.java | 3 +- .../vector/complex/AbstractMapVector.java | 9 +- .../complex/BaseRepeatedValueVector.java | 5 +- .../vector/complex/DictionaryVector.java | 229 ------ .../arrow/vector/complex/ListVector.java | 26 +- .../arrow/vector/complex/MapVector.java | 5 +- .../vector/complex/NullableMapVector.java | 9 +- .../complex/impl/ComplexWriterImpl.java | 6 +- .../vector/complex/impl/PromotableWriter.java | 5 +- .../arrow/vector/dictionary/Dictionary.java | 66 ++ .../vector/dictionary/DictionaryEncoder.java | 144 ++++ .../vector/dictionary/DictionaryProvider.java | 47 ++ .../arrow/vector/file/ArrowFileReader.java | 142 ++++ .../arrow/vector/file/ArrowFileWriter.java | 59 ++ .../apache/arrow/vector/file/ArrowFooter.java | 1 - .../apache/arrow/vector/file/ArrowMagic.java | 37 + .../apache/arrow/vector/file/ArrowReader.java | 222 ++++-- .../apache/arrow/vector/file/ArrowWriter.java | 173 ++++- .../apache/arrow/vector/file/ReadChannel.java | 11 +- .../SeekableReadChannel.java} | 29 +- .../arrow/vector/file/WriteChannel.java | 7 +- .../vector/file/json/JsonFileReader.java | 26 +- .../vector/schema/ArrowDictionaryBatch.java | 60 ++ .../arrow/vector/schema/ArrowMessage.java | 30 + .../arrow/vector/schema/ArrowRecordBatch.java | 8 +- .../vector/stream/ArrowStreamReader.java | 88 +-- .../vector/stream/ArrowStreamWriter.java | 75 +- .../vector/stream/MessageSerializer.java | 164 ++++- .../org/apache/arrow/vector/types/Types.java | 114 +-- .../vector/types/pojo/DictionaryEncoding.java | 51 ++ .../apache/arrow/vector/types/pojo/Field.java | 59 +- .../arrow/vector/TestDecimalVector.java | 2 +- .../arrow/vector/TestDictionaryVector.java | 82 +-- .../apache/arrow/vector/TestListVector.java | 4 +- .../apache/arrow/vector/TestValueVector.java | 12 +- .../arrow/vector/TestVectorUnloadLoad.java | 22 +- .../complex/impl/TestPromotableWriter.java | 2 +- .../complex/writer/TestComplexWriter.java | 14 +- .../arrow/vector/file/TestArrowFile.java | 665 +++++++++++------- .../vector/file/TestArrowReaderWriter.java | 28 +- .../arrow/vector/file/TestArrowStream.java | 102 +++ .../vector/file/TestArrowStreamPipe.java | 163 +++++ .../arrow/vector/file/json/TestJSONFile.java | 4 +- .../vector/stream/MessageSerializerTest.java | 8 +- .../arrow/vector/stream/TestArrowStream.java | 96 --- .../vector/stream/TestArrowStreamPipe.java | 129 ---- 65 files changed, 2497 insertions(+), 1486 deletions(-) create mode 100644 java/tools/tmptestfilesio delete mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/dictionary/Dictionary.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryProvider.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileWriter.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/file/ArrowMagic.java rename java/vector/src/main/java/org/apache/arrow/vector/{types/Dictionary.java => file/SeekableReadChannel.java} (57%) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowDictionaryBatch.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowMessage.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java delete mode 100644 java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java delete mode 100644 java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 973416670bdfa..4cb5f6cccc4c8 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -78,6 +78,12 @@ class StreamReader::StreamReaderImpl { int32_t message_length = *reinterpret_cast(buffer->data()); + if (message_length == 0) { + // Optional 0 EOS control message + *message = nullptr; + return Status::OK(); + } + RETURN_NOT_OK(stream_->Read(message_length, &buffer)); if (buffer->size() != message_length) { return Status::IOError("Unexpected end of stream trying to read message"); diff --git a/integration/integration_test.py b/integration/integration_test.py index 049436a751f38..5cd63c502bd20 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -680,12 +680,16 @@ def stream_to_file(self, stream_path, file_path): cmd = ['java', '-cp', self.ARROW_TOOLS_JAR, 'org.apache.arrow.tools.StreamToFile', stream_path, file_path] + if self.debug: + print(' '.join(cmd)) run_cmd(cmd) def file_to_stream(self, file_path, stream_path): cmd = ['java', '-cp', self.ARROW_TOOLS_JAR, 'org.apache.arrow.tools.FileToStream', file_path, stream_path] + if self.debug: + print(' '.join(cmd)) run_cmd(cmd) diff --git a/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java index c00620e44b064..7c0cadd9d77dd 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java @@ -18,23 +18,19 @@ package org.apache.arrow.tools; import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; import java.net.ServerSocket; import java.net.Socket; -import java.util.ArrayList; -import java.util.List; + +import com.google.common.base.Preconditions; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.stream.ArrowStreamReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import com.google.common.base.Preconditions; - public class EchoServer { private static final Logger LOGGER = LoggerFactory.getLogger(EchoServer.class); @@ -57,30 +53,28 @@ public ClientConnection(Socket socket) { public void run() throws IOException { BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); - List batches = new ArrayList(); - try ( - InputStream in = socket.getInputStream(); - OutputStream out = socket.getOutputStream(); - ArrowStreamReader reader = new ArrowStreamReader(in, allocator); - ) { - // Read the entire input stream. - reader.init(); - while (true) { - ArrowRecordBatch batch = reader.nextRecordBatch(); - if (batch == null) break; - batches.add(batch); - } - LOGGER.info(String.format("Received %d batches", batches.size())); - - // Write it back - try (ArrowStreamWriter writer = new ArrowStreamWriter(out, reader.getSchema())) { - for (ArrowRecordBatch batch: batches) { - writer.writeRecordBatch(batch); + // Read the entire input stream and write it back + try (ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { + VectorSchemaRoot root = reader.getVectorSchemaRoot(); + // load the first batch before instantiating the writer so that we have any dictionaries + reader.loadNextBatch(); + try (ArrowStreamWriter writer = new ArrowStreamWriter(root, reader, socket.getOutputStream())) { + writer.start(); + int echoed = 0; + while (true) { + int rowCount = reader.getVectorSchemaRoot().getRowCount(); + if (rowCount == 0) { + break; + } else { + writer.writeBatch(); + echoed += rowCount; + reader.loadNextBatch(); + } } writer.end(); Preconditions.checkState(reader.bytesRead() == writer.bytesWritten()); + LOGGER.info(String.format("Echoed %d records", echoed)); } - LOGGER.info("Done writing stream back."); } } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java index db7a1c23f9ca6..9fa7b761a5772 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java @@ -23,18 +23,12 @@ import java.io.FileOutputStream; import java.io.IOException; import java.io.PrintStream; -import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.VectorUnloader; -import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.file.ArrowWriter; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.file.ArrowFileReader; +import org.apache.arrow.vector.file.ArrowFileWriter; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.commons.cli.CommandLine; import org.apache.commons.cli.CommandLineParser; @@ -86,35 +80,27 @@ int run(String[] args) { File inFile = validateFile("input", inFileName); File outFile = validateFile("output", outFileName); BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); // TODO: close - try( - FileInputStream fileInputStream = new FileInputStream(inFile); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator);) { + try (FileInputStream fileInputStream = new FileInputStream(inFile); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("Input file size: " + inFile.length()); LOGGER.debug("Found schema: " + schema); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(outFile); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ) { - - // initialize vectors - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); - VectorSchemaRoot root = new VectorSchemaRoot(schema, allocator);) { - - VectorLoader vectorLoader = new VectorLoader(root); - vectorLoader.load(inRecordBatch); - - VectorUnloader vectorUnloader = new VectorUnloader(root); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - arrowWriter.writeRecordBatch(recordBatch); + try (FileOutputStream fileOutputStream = new FileOutputStream(outFile); + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, arrowReader, fileOutputStream.getChannel())) { + arrowWriter.start(); + while (true) { + arrowReader.loadNextBatch(); + int loaded = root.getRowCount(); + if (loaded == 0) { + break; + } else { + arrowWriter.writeBatch(); } } + arrowWriter.end(); } LOGGER.debug("Output file size: " + outFile.length()); } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java index ba6505cb48d08..d5345535d19dc 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java @@ -25,10 +25,8 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.file.ArrowFileReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; /** @@ -36,19 +34,20 @@ * first argument and the output is written to standard out. */ public class FileToStream { + public static void convert(FileInputStream in, OutputStream out) throws IOException { BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - try( - ArrowReader reader = new ArrowReader(in.getChannel(), allocator);) { - ArrowFooter footer = reader.readFooter(); - try ( - ArrowStreamWriter writer = new ArrowStreamWriter(out, footer.getSchema()); - ) { - for (ArrowBlock block: footer.getRecordBatches()) { - try (ArrowRecordBatch batch = reader.readRecordBatch(block)) { - writer.writeRecordBatch(batch); - } + try (ArrowFileReader reader = new ArrowFileReader(in.getChannel(), allocator)) { + VectorSchemaRoot root = reader.getVectorSchemaRoot(); + // load the first batch before instantiating the writer so that we have any dictionaries + reader.loadNextBatch(); + try (ArrowStreamWriter writer = new ArrowStreamWriter(root, reader, out)) { + writer.start(); + while (root.getRowCount() > 0) { + writer.writeBatch(); + reader.loadNextBatch(); } + writer.end(); } } } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index 36d4ee5485470..5d4849c234383 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -28,16 +28,12 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.VectorUnloader; import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.file.ArrowFileReader; +import org.apache.arrow.vector.file.ArrowFileWriter; import org.apache.arrow.vector.file.json.JsonFileReader; import org.apache.arrow.vector.file.json.JsonFileWriter; -import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.arrow.vector.util.Validator; import org.apache.commons.cli.CommandLine; @@ -69,24 +65,18 @@ enum Command { ARROW_TO_JSON(true, false) { @Override public void execute(File arrowFile, File jsonFile) throws IOException { - try( - BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + try(BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); FileInputStream fileInputStream = new FileInputStream(arrowFile); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator);) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("Input file size: " + arrowFile.length()); LOGGER.debug("Found schema: " + schema); - try (JsonFileWriter writer = new JsonFileWriter(jsonFile, JsonFileWriter.config().pretty(true));) { + try (JsonFileWriter writer = new JsonFileWriter(jsonFile, JsonFileWriter.config().pretty(true))) { writer.start(schema); - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); - VectorSchemaRoot root = new VectorSchemaRoot(schema, allocator);) { - VectorLoader vectorLoader = new VectorLoader(root); - vectorLoader.load(inRecordBatch); - writer.write(root); - } + for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { + arrowReader.loadRecordBatch(rbBlock); + writer.write(root); } } LOGGER.debug("Output file size: " + jsonFile.length()); @@ -96,27 +86,22 @@ public void execute(File arrowFile, File jsonFile) throws IOException { JSON_TO_ARROW(false, true) { @Override public void execute(File arrowFile, File jsonFile) throws IOException { - try ( - BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - JsonFileReader reader = new JsonFileReader(jsonFile, allocator); - ) { + try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + JsonFileReader reader = new JsonFileReader(jsonFile, allocator)) { Schema schema = reader.start(); LOGGER.debug("Input file size: " + jsonFile.length()); LOGGER.debug("Found schema: " + schema); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(arrowFile); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ) { - - // initialize vectors - VectorSchemaRoot root; - while ((root = reader.read()) != null) { - VectorUnloader vectorUnloader = new VectorUnloader(root); - try (ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch();) { - arrowWriter.writeRecordBatch(recordBatch); - } - root.close(); + try (FileOutputStream fileOutputStream = new FileOutputStream(arrowFile); + VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator); + // TODO json dictionaries + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel())) { + arrowWriter.start(); + reader.read(root); + while (root.getRowCount() != 0) { + arrowWriter.writeBatch(); + reader.read(root); } + arrowWriter.end(); } LOGGER.debug("Output file size: " + arrowFile.length()); } @@ -125,32 +110,26 @@ public void execute(File arrowFile, File jsonFile) throws IOException { VALIDATE(true, true) { @Override public void execute(File arrowFile, File jsonFile) throws IOException { - try ( - BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - JsonFileReader jsonReader = new JsonFileReader(jsonFile, allocator); - FileInputStream fileInputStream = new FileInputStream(arrowFile); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), allocator); - ) { + try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + JsonFileReader jsonReader = new JsonFileReader(jsonFile, allocator); + FileInputStream fileInputStream = new FileInputStream(arrowFile); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { Schema jsonSchema = jsonReader.start(); - ArrowFooter footer = arrowReader.readFooter(); - Schema arrowSchema = footer.getSchema(); + VectorSchemaRoot arrowRoot = arrowReader.getVectorSchemaRoot(); + Schema arrowSchema = arrowRoot.getSchema(); LOGGER.debug("Arrow Input file size: " + arrowFile.length()); LOGGER.debug("ARROW schema: " + arrowSchema); LOGGER.debug("JSON Input file size: " + jsonFile.length()); LOGGER.debug("JSON schema: " + jsonSchema); Validator.compareSchemas(jsonSchema, arrowSchema); - List recordBatches = footer.getRecordBatches(); + List recordBatches = arrowReader.getRecordBlocks(); Iterator iterator = recordBatches.iterator(); VectorSchemaRoot jsonRoot; while ((jsonRoot = jsonReader.read()) != null && iterator.hasNext()) { ArrowBlock rbBlock = iterator.next(); - try (ArrowRecordBatch inRecordBatch = arrowReader.readRecordBatch(rbBlock); - VectorSchemaRoot arrowRoot = new VectorSchemaRoot(arrowSchema, allocator);) { - VectorLoader vectorLoader = new VectorLoader(arrowRoot); - vectorLoader.load(inRecordBatch); - Validator.compareVectorSchemaRoot(arrowRoot, jsonRoot); - } + arrowReader.loadRecordBatch(rbBlock); + Validator.compareVectorSchemaRoot(arrowRoot, jsonRoot); jsonRoot.close(); } boolean hasMoreJSON = jsonRoot != null; diff --git a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java index c8a5c8914afcc..3b79d5b05e116 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java @@ -27,8 +27,8 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.file.ArrowWriter; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.file.ArrowFileWriter; import org.apache.arrow.vector.stream.ArrowStreamReader; /** @@ -38,13 +38,16 @@ public class StreamToFile { public static void convert(InputStream in, OutputStream out) throws IOException { BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { - reader.init(); - try (ArrowWriter writer = new ArrowWriter(Channels.newChannel(out), reader.getSchema());) { - while (true) { - ArrowRecordBatch batch = reader.nextRecordBatch(); - if (batch == null) break; - writer.writeRecordBatch(batch); + VectorSchemaRoot root = reader.getVectorSchemaRoot(); + // load the first batch before instantiating the writer so that we have any dictionaries + reader.loadNextBatch(); + try (ArrowFileWriter writer = new ArrowFileWriter(root, reader, Channels.newChannel(out))) { + writer.start(); + while (root.getRowCount() > 0) { + writer.writeBatch(); + reader.loadNextBatch(); } + writer.end(); } } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java index 4cfc52fe08631..f752f7eaa74b9 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java @@ -23,13 +23,10 @@ import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; -import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.VectorLoader; import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.VectorUnloader; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; @@ -37,10 +34,8 @@ import org.apache.arrow.vector.complex.writer.BigIntWriter; import org.apache.arrow.vector.complex.writer.IntWriter; import org.apache.arrow.vector.file.ArrowBlock; -import org.apache.arrow.vector.file.ArrowFooter; -import org.apache.arrow.vector.file.ArrowReader; -import org.apache.arrow.vector.file.ArrowWriter; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.file.ArrowFileReader; +import org.apache.arrow.vector.file.ArrowFileWriter; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Assert; @@ -63,26 +58,14 @@ static void writeData(int count, MapVector parent) { static void validateOutput(File testOutFile, BufferAllocator allocator) throws Exception { // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(testOutFile); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); - - // initialize vectors - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, readerAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); - } - validateContent(COUNT, root); - } + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(testOutFile); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { + arrowReader.loadRecordBatch(rbBlock); + validateContent(COUNT, root); } } } @@ -96,16 +79,10 @@ static void validateContent(int count, VectorSchemaRoot root) { } static void write(FieldVector parent, File file) throws FileNotFoundException, IOException { - Schema schema = new Schema(parent.getField().getChildren()); - int valueCount = parent.getAccessor().getValueCount(); - List fields = parent.getChildrenFromFields(); - VectorUnloader vectorUnloader = new VectorUnloader(schema, valueCount, fields); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(file); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - ) { - arrowWriter.writeRecordBatch(recordBatch); + VectorSchemaRoot root = new VectorSchemaRoot(parent); + try (FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel())) { + arrowWriter.writeBatch(); } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java index 48d6162f423a3..706f8e2ca4d36 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java @@ -24,106 +24,268 @@ import java.io.IOException; import java.net.Socket; import java.net.UnknownHostException; -import java.util.ArrayList; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; import java.util.Collections; import java.util.List; +import com.google.common.collect.ImmutableList; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.schema.ArrowFieldNode; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullableIntVector; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.NullableVarCharVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.dictionary.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.dictionary.DictionaryProvider.MapDictionaryProvider; import org.apache.arrow.vector.stream.ArrowStreamReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.arrow.vector.util.Text; +import org.junit.AfterClass; +import org.junit.Assert; +import org.junit.BeforeClass; import org.junit.Test; -import io.netty.buffer.ArrowBuf; - public class EchoServerTest { - public static ArrowBuf buf(BufferAllocator alloc, byte[] bytes) { - ArrowBuf buffer = alloc.buffer(bytes.length); - buffer.writeBytes(bytes); - return buffer; + + private static EchoServer server; + private static int serverPort; + private static Thread serverThread; + + @BeforeClass + public static void startEchoServer() throws IOException { + server = new EchoServer(0); + serverPort = server.port(); + serverThread = new Thread() { + @Override + public void run() { + try { + server.run(); + } catch (IOException e) { + e.printStackTrace(); + } + } + }; + serverThread.start(); } - public static byte[] array(ArrowBuf buf) { - byte[] bytes = new byte[buf.readableBytes()]; - buf.readBytes(bytes); - return bytes; + @AfterClass + public static void stopEchoServer() throws IOException, InterruptedException { + server.close(); + serverThread.join(); } - private void testEchoServer(int serverPort, Schema schema, List batches) + private void testEchoServer(int serverPort, + Field field, + NullableTinyIntVector vector, + int batches) throws UnknownHostException, IOException { BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + VectorSchemaRoot root = new VectorSchemaRoot(asList(field), asList((FieldVector) vector), 0); try (Socket socket = new Socket("localhost", serverPort); - ArrowStreamWriter writer = new ArrowStreamWriter(socket.getOutputStream(), schema); + ArrowStreamWriter writer = new ArrowStreamWriter(root, null, socket.getOutputStream()); ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), alloc)) { - for (ArrowRecordBatch batch: batches) { - writer.writeRecordBatch(batch); + writer.start(); + for (int i = 0; i < batches; i++) { + vector.allocateNew(16); + for (int j = 0; j < 8; j++) { + vector.getMutator().set(j, j + i); + vector.getMutator().set(j + 8, 0, (byte) (j + i)); + } + vector.getMutator().setValueCount(16); + root.setRowCount(16); + writer.writeBatch(); } writer.end(); - reader.init(); - assertEquals(schema, reader.getSchema()); - for (int i = 0; i < batches.size(); i++) { - ArrowRecordBatch result = reader.nextRecordBatch(); - ArrowRecordBatch expected = batches.get(i); - assertTrue(result != null); - assertEquals(expected.getBuffers().size(), result.getBuffers().size()); - for (int j = 0; j < expected.getBuffers().size(); j++) { - assertTrue(expected.getBuffers().get(j).compareTo(result.getBuffers().get(j)) == 0); + assertEquals(new Schema(asList(field)), reader.getVectorSchemaRoot().getSchema()); + + NullableTinyIntVector readVector = (NullableTinyIntVector) reader.getVectorSchemaRoot().getFieldVectors().get(0); + for (int i = 0; i < batches; i++) { + reader.loadNextBatch(); + assertEquals(16, reader.getVectorSchemaRoot().getRowCount()); + assertEquals(16, readVector.getAccessor().getValueCount()); + for (int j = 0; j < 8; j++) { + assertEquals(j + i, readVector.getAccessor().get(j)); + assertTrue(readVector.getAccessor().isNull(j + 8)); } } - ArrowRecordBatch result = reader.nextRecordBatch(); - assertTrue(result == null); + reader.loadNextBatch(); + assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); assertEquals(reader.bytesRead(), writer.bytesWritten()); } } @Test public void basicTest() throws InterruptedException, IOException { - final EchoServer server = new EchoServer(0); - int serverPort = server.port(); - Thread serverThread = new Thread() { - @Override - public void run() { - try { - server.run(); - } catch (IOException e) { - e.printStackTrace(); - } - } - }; - serverThread.start(); - BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - byte[] validity = new byte[] { (byte)255, 0}; - byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; - ArrowBuf validityb = buf(alloc, validity); - ArrowBuf valuesb = buf(alloc, values); - ArrowRecordBatch batch = new ArrowRecordBatch( - 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb)); - Schema schema = new Schema(asList(new Field( - "testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); + Field field = new Field("testField", true, new ArrowType.Int(8, true), Collections.emptyList()); + NullableTinyIntVector vector = new NullableTinyIntVector("testField", alloc, null); + Schema schema = new Schema(asList(field)); // Try an empty stream, just the header. - testEchoServer(serverPort, schema, new ArrayList()); + testEchoServer(serverPort, field, vector, 0); // Try with one batch. - List batches = new ArrayList<>(); - batches.add(batch); - testEchoServer(serverPort, schema, batches); + testEchoServer(serverPort, field, vector, 1); // Try with a few - for (int i = 0; i < 10; i++) { - batches.add(batch); + testEchoServer(serverPort, field, vector, 10); + } + + @Test + public void testFlatDictionary() throws IOException { + DictionaryEncoding writeEncoding = new DictionaryEncoding(1L, false, null); + try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + NullableIntVector writeVector = new NullableIntVector("varchar", allocator, writeEncoding); + NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dict", allocator, null)) { + writeVector.allocateNewSafe(); + NullableIntVector.Mutator mutator = writeVector.getMutator(); + mutator.set(0, 0); + mutator.set(1, 1); + mutator.set(3, 2); + mutator.set(4, 1); + mutator.set(5, 2); + mutator.setValueCount(6); + + writeDictionaryVector.allocateNewSafe(); + NullableVarCharVector.Mutator dictionaryMutator = writeDictionaryVector.getMutator(); + dictionaryMutator.set(0, "foo".getBytes(StandardCharsets.UTF_8)); + dictionaryMutator.set(1, "bar".getBytes(StandardCharsets.UTF_8)); + dictionaryMutator.set(2, "baz".getBytes(StandardCharsets.UTF_8)); + dictionaryMutator.setValueCount(3); + + List fields = ImmutableList.of(writeVector.getField()); + List vectors = ImmutableList.of((FieldVector) writeVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 6); + + DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary(writeDictionaryVector, writeEncoding)); + + try (Socket socket = new Socket("localhost", serverPort); + ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket.getOutputStream()); + ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { + writer.start(); + writer.writeBatch(); + writer.end(); + + reader.loadNextBatch(); + VectorSchemaRoot readerRoot = reader.getVectorSchemaRoot(); + Assert.assertEquals(6, readerRoot.getRowCount()); + + FieldVector readVector = readerRoot.getFieldVectors().get(0); + Assert.assertNotNull(readVector); + + DictionaryEncoding readEncoding = readVector.getField().getDictionary(); + Assert.assertNotNull(readEncoding); + Assert.assertEquals(1L, readEncoding.getId()); + + FieldVector.Accessor accessor = readVector.getAccessor(); + Assert.assertEquals(6, accessor.getValueCount()); + Assert.assertEquals(0, accessor.getObject(0)); + Assert.assertEquals(1, accessor.getObject(1)); + Assert.assertEquals(null, accessor.getObject(2)); + Assert.assertEquals(2, accessor.getObject(3)); + Assert.assertEquals(1, accessor.getObject(4)); + Assert.assertEquals(2, accessor.getObject(5)); + + Dictionary dictionary = reader.lookup(1L); + Assert.assertNotNull(dictionary); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary.getVector()).getAccessor(); + Assert.assertEquals(3, dictionaryAccessor.getValueCount()); + Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); + Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); + Assert.assertEquals(new Text("baz"), dictionaryAccessor.getObject(2)); + } } - testEchoServer(serverPort, schema, batches); + } - server.close(); - serverThread.join(); + @Test + public void testNestedDictionary() throws IOException { + DictionaryEncoding writeEncoding = new DictionaryEncoding(2L, false, null); + try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dictionary", allocator, null); + ListVector writeVector = new ListVector("list", allocator, null, null)) { + + // data being written: + // [['foo', 'bar'], ['foo'], ['bar']] -> [[0, 1], [0], [1]] + + writeDictionaryVector.allocateNew(); + writeDictionaryVector.getMutator().set(0, "foo".getBytes(StandardCharsets.UTF_8)); + writeDictionaryVector.getMutator().set(1, "bar".getBytes(StandardCharsets.UTF_8)); + writeDictionaryVector.getMutator().setValueCount(2); + + writeVector.addOrGetVector(MinorType.INT, writeEncoding); + writeVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(writeVector); + listWriter.startList(); + listWriter.writeInt(0); + listWriter.writeInt(1); + listWriter.endList(); + listWriter.startList(); + listWriter.writeInt(0); + listWriter.endList(); + listWriter.startList(); + listWriter.writeInt(1); + listWriter.endList(); + listWriter.setValueCount(3); + + List fields = ImmutableList.of(writeVector.getField()); + List vectors = ImmutableList.of((FieldVector) writeVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 3); + + DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary(writeDictionaryVector, writeEncoding)); + + try (Socket socket = new Socket("localhost", serverPort); + ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket.getOutputStream()); + ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { + writer.start(); + writer.writeBatch(); + writer.end(); + + reader.loadNextBatch(); + VectorSchemaRoot readerRoot = reader.getVectorSchemaRoot(); + Assert.assertEquals(3, readerRoot.getRowCount()); + + ListVector readVector = (ListVector) readerRoot.getFieldVectors().get(0); + Assert.assertNotNull(readVector); + + Assert.assertNull(readVector.getField().getDictionary()); + DictionaryEncoding readEncoding = readVector.getField().getChildren().get(0).getDictionary(); + Assert.assertNotNull(readEncoding); + Assert.assertEquals(2L, readEncoding.getId()); + + Field nestedField = readVector.getField().getChildren().get(0); + + DictionaryEncoding encoding = nestedField.getDictionary(); + Assert.assertNotNull(encoding); + Assert.assertEquals(2L, encoding.getId()); + Assert.assertEquals(new Int(32, true), encoding.getIndexType()); + + ListVector.Accessor accessor = readVector.getAccessor(); + Assert.assertEquals(3, accessor.getValueCount()); + Assert.assertEquals(Arrays.asList(0, 1), accessor.getObject(0)); + Assert.assertEquals(Arrays.asList(0), accessor.getObject(1)); + Assert.assertEquals(Arrays.asList(1), accessor.getObject(2)); + + Dictionary readDictionary = reader.lookup(2L); + Assert.assertNotNull(readDictionary); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) readDictionary.getVector()).getAccessor(); + Assert.assertEquals(2, dictionaryAccessor.getValueCount()); + Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); + Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); + } + } } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java index 0ae32bebe0b30..9d4ef5c26505b 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -33,6 +33,11 @@ import java.io.StringReader; import java.util.Map; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; +import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; +import com.fasterxml.jackson.databind.ObjectMapper; +import com.fasterxml.jackson.databind.SerializationFeature; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.tools.Integration.Command; @@ -49,11 +54,6 @@ import org.junit.Test; import org.junit.rules.TemporaryFolder; -import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; -import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; -import com.fasterxml.jackson.databind.ObjectMapper; -import com.fasterxml.jackson.databind.SerializationFeature; - public class TestIntegration { @Rule @@ -128,6 +128,34 @@ public void testJSONRoundTripWithVariableWidth() throws Exception { } } + @Test + public void testJSONRoundTripWithStruct() throws Exception { + File testJSONFile = new File("../../integration/data/struct_example.json"); + File testOutFile = testFolder.newFile("testOutStruct.arrow"); + File testRoundTripJSONFile = testFolder.newFile("testOutStruct.json"); + testOutFile.delete(); + testRoundTripJSONFile.delete(); + + Integration integration = new Integration(); + + // convert to arrow + String[] args1 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + integration.run(args1); + + // convert back to json + String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + integration.run(args2); + + BufferedReader orig = readNormalized(testJSONFile); + BufferedReader rt = readNormalized(testRoundTripJSONFile); + String i, o; + int j = 0; + while ((i = orig.readLine()) != null && (o = rt.readLine()) != null) { + assertEquals("line: " + j, i, o); + ++j; + } + } + private ObjectMapper om = new ObjectMapper(); { DefaultPrettyPrinter prettyPrinter = new DefaultPrettyPrinter(); diff --git a/java/tools/tmptestfilesio b/java/tools/tmptestfilesio new file mode 100644 index 0000000000000000000000000000000000000000..d1b6b6cdb93878637bff514fbacc2b0054dd5f4d GIT binary patch literal 628 zcmZ{hJx;?w5QU$UB`gsjgpz{BmY7e6_I-)K^zOlJyU4|WJ7$N6$-zP8=-uI=>U^QdjD zb6weIOXOi< zPk<#list type.minor as minor> @@ -113,7 +113,7 @@ public MapWriter map(String name) { FieldWriter writer = fields.get(finalName); if(writer == null){ int vectorCount=container.size(); - NullableMapVector vector = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); + NullableMapVector vector = container.addOrGet(name, MinorType.MAP, NullableMapVector.class, null); writer = new PromotableWriter(vector, container, getNullableMapWriterFactory()); if(vectorCount != container.size()) { writer.allocate(); @@ -157,7 +157,7 @@ public ListWriter list(String name) { FieldWriter writer = fields.get(finalName); int vectorCount = container.size(); if(writer == null) { - writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class), container, getNullableMapWriterFactory()); + writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class, null), container, getNullableMapWriterFactory()); if (container.size() > vectorCount) { writer.allocate(); } @@ -222,7 +222,7 @@ public void end() { if(writer == null) { ValueVector vector; ValueVector currentVector = container.getChild(name); - ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class<#if minor.class == "Decimal"> , new int[] {precision, scale}); + ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class, null<#if minor.class == "Decimal"> , new int[] {precision, scale}); writer = new PromotableWriter(v, container, getNullableMapWriterFactory()); vector = v; if (currentVector == null || currentVector != vector) { diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 6b25fb36b40c0..b3e10e3fa87a2 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -65,21 +65,21 @@ public final class ${className} extends BaseDataValueVector implements <#if type private final int precision; private final int scale; - public ${className}(String name, BufferAllocator allocator, int precision, int scale) { + public ${className}(String name, BufferAllocator allocator, DictionaryEncoding dictionary, int precision, int scale) { super(name, allocator); values = new ${valuesName}(valuesField, allocator, precision, scale); this.precision = precision; this.scale = scale; mutator = new Mutator(); accessor = new Accessor(); - field = new Field(name, true, new Decimal(precision, scale), null); + field = new Field(name, true, new Decimal(precision, scale), dictionary, null); innerVectors = Collections.unmodifiableList(Arrays.asList( bits, values )); } <#else> - public ${className}(String name, BufferAllocator allocator) { + public ${className}(String name, BufferAllocator allocator, DictionaryEncoding dictionary) { super(name, allocator); values = new ${valuesName}(valuesField, allocator); mutator = new Mutator(); @@ -88,38 +88,38 @@ public final class ${className} extends BaseDataValueVector implements <#if type minor.class == "SmallInt" || minor.class == "Int" || minor.class == "BigInt"> - field = new Field(name, true, new Int(${type.width} * 8, true), null); + field = new Field(name, true, new Int(${type.width} * 8, true), dictionary, null); <#elseif minor.class == "UInt1" || minor.class == "UInt2" || minor.class == "UInt4" || minor.class == "UInt8"> - field = new Field(name, true, new Int(${type.width} * 8, false), null); + field = new Field(name, true, new Int(${type.width} * 8, false), dictionary, null); <#elseif minor.class == "Date"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Date(), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Date(), dictionary, null); <#elseif minor.class == "Time"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), dictionary, null); <#elseif minor.class == "Float4"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE), dictionary, null); <#elseif minor.class == "Float8"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE), null); + field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE), dictionary, null); <#elseif minor.class == "TimeStampSec"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND), dictionary, null); <#elseif minor.class == "TimeStampMilli"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND), dictionary, null); <#elseif minor.class == "TimeStampMicro"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND), dictionary, null); <#elseif minor.class == "TimeStampNano"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND), null); + field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND), dictionary, null); <#elseif minor.class == "IntervalDay"> - field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.DAY_TIME), null); + field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.DAY_TIME), dictionary, null); <#elseif minor.class == "IntervalYear"> - field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.YEAR_MONTH), null); + field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.YEAR_MONTH), dictionary, null); <#elseif minor.class == "VarChar"> - field = new Field(name, true, new Utf8(), null); + field = new Field(name, true, new Utf8(), dictionary, null); <#elseif minor.class == "VarBinary"> - field = new Field(name, true, new Binary(), null); + field = new Field(name, true, new Binary(), dictionary, null); <#elseif minor.class == "Bit"> - field = new Field(name, true, new Bool(), null); + field = new Field(name, true, new Bool(), dictionary, null); innerVectors = Collections.unmodifiableList(Arrays.asList( bits, @@ -378,9 +378,9 @@ private class TransferImpl implements TransferPair { public TransferImpl(String name, BufferAllocator allocator){ <#if minor.class == "Decimal"> - to = new ${className}(name, allocator, precision, scale); + to = new ${className}(name, allocator, field.getDictionary(), precision, scale); <#else> - to = new ${className}(name, allocator); + to = new ${className}(name, allocator, field.getDictionary()); } diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 1a6908df2c40d..076ed93999623 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -118,11 +118,11 @@ public List getFieldBuffers() { public List getFieldInnerVectors() { return this.innerVectors; } - + public NullableMapVector getMap() { if (mapVector == null) { int vectorCount = internalMap.size(); - mapVector = internalMap.addOrGet("map", MinorType.MAP, NullableMapVector.class); + mapVector = internalMap.addOrGet("map", MinorType.MAP, NullableMapVector.class, null); if (internalMap.size() > vectorCount) { mapVector.allocateNew(); if (callBack != null) { @@ -144,7 +144,7 @@ public NullableMapVector getMap() { public Nullable${name}Vector get${name}Vector() { if (${uncappedName}Vector == null) { int vectorCount = internalMap.size(); - ${uncappedName}Vector = internalMap.addOrGet("${lowerCaseName}", MinorType.${name?upper_case}, Nullable${name}Vector.class); + ${uncappedName}Vector = internalMap.addOrGet("${lowerCaseName}", MinorType.${name?upper_case}, Nullable${name}Vector.class, null); if (internalMap.size() > vectorCount) { ${uncappedName}Vector.allocateNew(); if (callBack != null) { @@ -162,7 +162,7 @@ public NullableMapVector getMap() { public ListVector getList() { if (listVector == null) { int vectorCount = internalMap.size(); - listVector = internalMap.addOrGet("list", MinorType.LIST, ListVector.class); + listVector = internalMap.addOrGet("list", MinorType.LIST, ListVector.class, null); if (internalMap.size() > vectorCount) { listVector.allocateNew(); if (callBack != null) { @@ -262,7 +262,7 @@ public void copyFromSafe(int inIndex, int outIndex, UnionVector from) { public FieldVector addVector(FieldVector v) { String name = v.getMinorType().name().toLowerCase(); Preconditions.checkState(internalMap.getChild(name) == null, String.format("%s vector already exists", name)); - final FieldVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass()); + final FieldVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass(), v.getField().getDictionary()); v.makeTransferPair(newVector).transfer(); internalMap.putChild(name, newVector); if (callBack != null) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index d1e9abe5dd111..179f2ee879f43 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -81,6 +81,7 @@ public void load(ArrowFieldNode fieldNode, ArrowBuf data) { } else { super.load(fieldNode, data); } + this.valueCount = fieldNode.getLength(); } @Override @@ -451,7 +452,6 @@ public final void setToOne(int index) { /** * set count bits to 1 in data starting at firstBitIndex - * @param data the buffer to set * @param firstBitIndex the index of the first bit to set * @param count the number of bits to set */ diff --git a/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java index b28433cfd0d94..0fdbc48552aaa 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java @@ -19,11 +19,10 @@ import java.util.List; +import io.netty.buffer.ArrowBuf; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.pojo.Field; -import io.netty.buffer.ArrowBuf; - /** * A vector corresponding to a Field in the schema * It has inner vectors backed by buffers (validity, offsets, data, ...) @@ -61,5 +60,4 @@ public interface FieldVector extends ValueVector { * @return the inner vectors for this field as defined by the TypeLayout */ List getFieldInnerVectors(); - } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index 5c1176cf95d26..76de250e0e972 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -36,15 +36,14 @@ * Loads buffers into vectors */ public class VectorLoader { + private final VectorSchemaRoot root; /** * will create children in root based on schema - * @param schema the expected schema * @param root the root to add vectors to based on schema */ public VectorLoader(VectorSchemaRoot root) { - super(); this.root = root; } @@ -57,18 +56,16 @@ public void load(ArrowRecordBatch recordBatch) { Iterator buffers = recordBatch.getBuffers().iterator(); Iterator nodes = recordBatch.getNodes().iterator(); List fields = root.getSchema().getFields(); - for (int i = 0; i < fields.size(); ++i) { - Field field = fields.get(i); + for (Field field: fields) { FieldVector fieldVector = root.getVector(field.getName()); loadBuffers(fieldVector, field, buffers, nodes); } root.setRowCount(recordBatch.getLength()); if (nodes.hasNext() || buffers.hasNext()) { - throw new IllegalArgumentException("not all nodes and buffers where consumed. nodes: " + Iterators.toString(nodes) + " buffers: " + Iterators.toString(buffers)); + throw new IllegalArgumentException("not all nodes and buffers were consumed. nodes: " + Iterators.toString(nodes) + " buffers: " + Iterators.toString(buffers)); } } - private void loadBuffers(FieldVector vector, Field field, Iterator buffers, Iterator nodes) { checkArgument(nodes.hasNext(), "no more field nodes for for field " + field + " and vector " + vector); @@ -82,7 +79,7 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf vector.loadFieldBuffers(fieldNode, ownBuffers); } catch (RuntimeException e) { throw new IllegalArgumentException("Could not load buffers for field " + - field + ". error message: " + e.getMessage(), e); + field + ". error message: " + e.getMessage(), e); } List children = field.getChildren(); if (children.size() > 0) { @@ -96,4 +93,4 @@ private void loadBuffers(FieldVector vector, Field field, Iterator buf } } -} +} \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java index 1cbe18787ef45..7e626fb14305e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java @@ -18,7 +18,6 @@ package org.apache.arrow.vector; import java.util.ArrayList; -import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; @@ -29,6 +28,9 @@ import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; +/** + * Holder for a set of vectors to be loaded/unloaded + */ public class VectorSchemaRoot implements AutoCloseable { private final Schema schema; @@ -37,9 +39,17 @@ public class VectorSchemaRoot implements AutoCloseable { private final Map fieldVectorsMap = new HashMap<>(); public VectorSchemaRoot(FieldVector parent) { - this.schema = new Schema(parent.getField().getChildren()); - this.rowCount = parent.getAccessor().getValueCount(); - this.fieldVectors = parent.getChildrenFromFields(); + this(parent.getField().getChildren(), parent.getChildrenFromFields(), parent.getAccessor().getValueCount()); + } + + public VectorSchemaRoot(List fields, List fieldVectors, int rowCount) { + if (fields.size() != fieldVectors.size()) { + throw new IllegalArgumentException("Fields must match field vectors. Found " + + fieldVectors.size() + " vectors and " + fields.size() + " fields"); + } + this.schema = new Schema(fields); + this.rowCount = rowCount; + this.fieldVectors = fieldVectors; for (int i = 0; i < schema.getFields().size(); ++i) { Field field = schema.getFields().get(i); FieldVector vector = fieldVectors.get(i); @@ -47,21 +57,19 @@ public VectorSchemaRoot(FieldVector parent) { } } - public VectorSchemaRoot(Schema schema, BufferAllocator allocator) { - super(); - this.schema = schema; + public static VectorSchemaRoot create(Schema schema, BufferAllocator allocator) { List fieldVectors = new ArrayList<>(); for (Field field : schema.getFields()) { MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - FieldVector vector = minorType.getNewVector(field.getName(), allocator, null); + FieldVector vector = minorType.getNewVector(field.getName(), allocator, field.getDictionary(), null); vector.initializeChildrenFromFields(field.getChildren()); fieldVectors.add(vector); - fieldVectorsMap.put(field.getName(), vector); } - this.fieldVectors = Collections.unmodifiableList(fieldVectors); - if (this.fieldVectors.size() != schema.getFields().size()) { - throw new IllegalArgumentException("The root vector did not create the right number of children. found " + fieldVectors.size() + " expected " + schema.getFields().size()); + if (fieldVectors.size() != schema.getFields().size()) { + throw new IllegalArgumentException("The root vector did not create the right number of children. found " + + fieldVectors.size() + " expected " + schema.getFields().size()); } + return new VectorSchemaRoot(schema.getFields(), fieldVectors, 0); } public List getFieldVectors() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java index 92d8cb045ae31..8e9ff6d462c5c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorUnloader.java @@ -20,42 +20,27 @@ import java.util.ArrayList; import java.util.List; +import io.netty.buffer.ArrowBuf; import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.schema.ArrowVectorType; -import org.apache.arrow.vector.types.pojo.Schema; - -import io.netty.buffer.ArrowBuf; public class VectorUnloader { - private final Schema schema; - private final int valueCount; - private final List vectors; - - public VectorUnloader(Schema schema, int valueCount, List vectors) { - super(); - this.schema = schema; - this.valueCount = valueCount; - this.vectors = vectors; - } + private final VectorSchemaRoot root; public VectorUnloader(VectorSchemaRoot root) { - this(root.getSchema(), root.getRowCount(), root.getFieldVectors()); - } - - public Schema getSchema() { - return schema; + this.root = root; } public ArrowRecordBatch getRecordBatch() { List nodes = new ArrayList<>(); List buffers = new ArrayList<>(); - for (FieldVector vector : vectors) { + for (FieldVector vector : root.getFieldVectors()) { appendNodes(vector, nodes, buffers); } - return new ArrowRecordBatch(valueCount, nodes, buffers); + return new ArrowRecordBatch(root.getRowCount(), nodes, buffers); } private void appendNodes(FieldVector vector, List nodes, List buffers) { @@ -74,4 +59,4 @@ private void appendNodes(FieldVector vector, List nodes, List T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale); + public abstract T addOrGet(String name, MinorType minorType, Class clazz, DictionaryEncoding dictionary, int... precisionScale); // return the child vector with the input name public abstract T getChild(String name, Class clazz); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index f030d166ade8d..baeeb07873714 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -26,6 +26,7 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.MapWithOrdinal; @@ -110,7 +111,7 @@ public boolean allocateNewSafe() { * @return resultant {@link org.apache.arrow.vector.ValueVector} */ @Override - public T addOrGet(String name, MinorType minorType, Class clazz, int... precisionScale) { + public T addOrGet(String name, MinorType minorType, Class clazz, DictionaryEncoding dictionary, int... precisionScale) { final ValueVector existing = getChild(name); boolean create = false; if (existing == null) { @@ -122,7 +123,7 @@ public T addOrGet(String name, MinorType minorType, Clas create = true; } if (create) { - final T vector = clazz.cast(minorType.getNewVector(name, allocator, callBack, precisionScale)); + final T vector = clazz.cast(minorType.getNewVector(name, allocator, dictionary, callBack, precisionScale)); putChild(name, vector); if (callBack!=null) { callBack.doWork(); @@ -162,12 +163,12 @@ public T getChild(String name, Class clazz) { return typeify(v, clazz); } - protected ValueVector add(String name, MinorType minorType, int... precisionScale) { + protected ValueVector add(String name, MinorType minorType, DictionaryEncoding dictionary, int... precisionScale) { final ValueVector existing = getChild(name); if (existing != null) { throw new IllegalStateException(String.format("Vector already exists: Existing[%s], Requested[%s] ", existing.getClass().getSimpleName(), minorType)); } - FieldVector vector = minorType.getNewVector(name, allocator, callBack, precisionScale); + FieldVector vector = minorType.getNewVector(name, allocator, dictionary, callBack, precisionScale); putChild(name, vector); if (callBack!=null) { callBack.doWork(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 7424df474ae89..eeb8f5830f404 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -28,6 +28,7 @@ import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.util.SchemaChangeRuntimeException; import com.google.common.base.Preconditions; @@ -150,10 +151,10 @@ public int size() { return vector == DEFAULT_DATA_VECTOR ? 0:1; } - public AddOrGetResult addOrGetVector(MinorType minorType) { + public AddOrGetResult addOrGetVector(MinorType minorType, DictionaryEncoding dictionary) { boolean created = false; if (vector instanceof ZeroVector) { - vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, null); + vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, dictionary, null); // returned vector must have the same field created = true; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java deleted file mode 100644 index 84760eadf2253..0000000000000 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/DictionaryVector.java +++ /dev/null @@ -1,229 +0,0 @@ -/******************************************************************************* - - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - ******************************************************************************/ -package org.apache.arrow.vector.complex; - -import io.netty.buffer.ArrowBuf; -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.OutOfMemoryException; -import org.apache.arrow.vector.NullableIntVector; -import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.complex.reader.FieldReader; -import org.apache.arrow.vector.types.Dictionary; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.Field; -import org.apache.arrow.vector.util.TransferPair; - -import java.util.HashMap; -import java.util.Iterator; -import java.util.Map; - -public class DictionaryVector implements ValueVector { - - private ValueVector indices; - private Dictionary dictionary; - - public DictionaryVector(ValueVector indices, Dictionary dictionary) { - this.indices = indices; - this.dictionary = dictionary; - } - - /** - * Dictionary encodes a vector. The dictionary will be built using the values from the vector. - * - * @param vector vector to encode - * @return dictionary encoded vector - */ - public static DictionaryVector encode(ValueVector vector) { - validateType(vector.getMinorType()); - Map lookUps = new HashMap<>(); - Map transfers = new HashMap<>(); - - ValueVector.Accessor accessor = vector.getAccessor(); - int count = accessor.getValueCount(); - - NullableIntVector indices = new NullableIntVector(vector.getField().getName(), vector.getAllocator()); - indices.allocateNew(count); - NullableIntVector.Mutator mutator = indices.getMutator(); - - int nextIndex = 0; - for (int i = 0; i < count; i++) { - Object value = accessor.getObject(i); - if (value != null) { // if it's null leave it null - Integer index = lookUps.get(value); - if (index == null) { - index = nextIndex++; - lookUps.put(value, index); - transfers.put(i, index); - } - mutator.set(i, index); - } - } - mutator.setValueCount(count); - - // copy the dictionary values into the dictionary vector - TransferPair dictionaryTransfer = vector.getTransferPair(vector.getAllocator()); - ValueVector dictionaryVector = dictionaryTransfer.getTo(); - dictionaryVector.allocateNewSafe(); - for (Map.Entry entry: transfers.entrySet()) { - dictionaryTransfer.copyValueSafe(entry.getKey(), entry.getValue()); - } - dictionaryVector.getMutator().setValueCount(transfers.size()); - Dictionary dictionary = new Dictionary(dictionaryVector, false); - - return new DictionaryVector(indices, dictionary); - } - - /** - * Dictionary encodes a vector with a provided dictionary. The dictionary must contain all values in the vector. - * - * @param vector vector to encode - * @param dictionary dictionary used for encoding - * @return dictionary encoded vector - */ - public static DictionaryVector encode(ValueVector vector, Dictionary dictionary) { - validateType(vector.getMinorType()); - // load dictionary values into a hashmap for lookup - ValueVector.Accessor dictionaryAccessor = dictionary.getDictionary().getAccessor(); - Map lookUps = new HashMap<>(dictionaryAccessor.getValueCount()); - for (int i = 0; i < dictionaryAccessor.getValueCount(); i++) { - // for primitive array types we need a wrapper that implements equals and hashcode appropriately - lookUps.put(dictionaryAccessor.getObject(i), i); - } - - // vector to hold our indices (dictionary encoded values) - NullableIntVector indices = new NullableIntVector(vector.getField().getName(), vector.getAllocator()); - NullableIntVector.Mutator mutator = indices.getMutator(); - - ValueVector.Accessor accessor = vector.getAccessor(); - int count = accessor.getValueCount(); - - indices.allocateNew(count); - - for (int i = 0; i < count; i++) { - Object value = accessor.getObject(i); - if (value != null) { // if it's null leave it null - // note: this may fail if value was not included in the dictionary - mutator.set(i, lookUps.get(value)); - } - } - mutator.setValueCount(count); - - return new DictionaryVector(indices, dictionary); - } - - /** - * Decodes a dictionary encoded array using the provided dictionary. - * - * @param indices dictionary encoded values, must be int type - * @param dictionary dictionary used to decode the values - * @return vector with values restored from dictionary - */ - public static ValueVector decode(ValueVector indices, Dictionary dictionary) { - ValueVector.Accessor accessor = indices.getAccessor(); - int count = accessor.getValueCount(); - ValueVector dictionaryVector = dictionary.getDictionary(); - // copy the dictionary values into the decoded vector - TransferPair transfer = dictionaryVector.getTransferPair(indices.getAllocator()); - transfer.getTo().allocateNewSafe(); - for (int i = 0; i < count; i++) { - Object index = accessor.getObject(i); - if (index != null) { - transfer.copyValueSafe(((Number) index).intValue(), i); - } - } - - ValueVector decoded = transfer.getTo(); - decoded.getMutator().setValueCount(count); - return decoded; - } - - private static void validateType(MinorType type) { - // byte arrays don't work as keys in our dictionary map - we could wrap them with something to - // implement equals and hashcode if we want that functionality - if (type == MinorType.VARBINARY || type == MinorType.LIST || type == MinorType.MAP || type == MinorType.UNION) { - throw new IllegalArgumentException("Dictionary encoding for complex types not implemented"); - } - } - - public ValueVector getIndexVector() { return indices; } - - public ValueVector getDictionaryVector() { return dictionary.getDictionary(); } - - public Dictionary getDictionary() { return dictionary; } - - @Override - public MinorType getMinorType() { return indices.getMinorType(); } - - @Override - public Field getField() { return indices.getField(); } - - // note: dictionary vector is not closed, as it may be shared - @Override - public void close() { indices.close(); } - - @Override - public void allocateNew() throws OutOfMemoryException { indices.allocateNew(); } - - @Override - public boolean allocateNewSafe() { return indices.allocateNewSafe(); } - - @Override - public BufferAllocator getAllocator() { return indices.getAllocator(); } - - @Override - public void setInitialCapacity(int numRecords) { indices.setInitialCapacity(numRecords); } - - @Override - public int getValueCapacity() { return indices.getValueCapacity(); } - - @Override - public int getBufferSize() { return indices.getBufferSize(); } - - @Override - public int getBufferSizeFor(int valueCount) { return indices.getBufferSizeFor(valueCount); } - - @Override - public Iterator iterator() { - return indices.iterator(); - } - - @Override - public void clear() { indices.clear(); } - - @Override - public TransferPair getTransferPair(BufferAllocator allocator) { return indices.getTransferPair(allocator); } - - @Override - public TransferPair getTransferPair(String ref, BufferAllocator allocator) { return indices.getTransferPair(ref, allocator); } - - @Override - public TransferPair makeTransferPair(ValueVector target) { return indices.makeTransferPair(target); } - - @Override - public Accessor getAccessor() { return indices.getAccessor(); } - - @Override - public Mutator getMutator() { return indices.getMutator(); } - - @Override - public FieldReader getReader() { return indices.getReader(); } - - @Override - public ArrowBuf[] getBuffers(boolean clear) { return indices.getBuffers(clear); } -} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 074b0aa7e58fa..a12440e39e8fe 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -24,6 +24,10 @@ import java.util.Collections; import java.util.List; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.AddOrGetResult; @@ -42,16 +46,12 @@ import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringArrayList; import org.apache.arrow.vector.util.TransferPair; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ObjectArrays; - -import io.netty.buffer.ArrowBuf; - public class ListVector extends BaseRepeatedValueVector implements FieldVector { final UInt4Vector offsets; @@ -62,14 +62,16 @@ public class ListVector extends BaseRepeatedValueVector implements FieldVector { private UnionListWriter writer; private UnionListReader reader; private CallBack callBack; + private final DictionaryEncoding dictionary; - public ListVector(String name, BufferAllocator allocator, CallBack callBack) { + public ListVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack) { super(name, allocator); this.bits = new BitVector("$bits$", allocator); this.offsets = getOffsetVector(); this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits, offsets)); this.writer = new UnionListWriter(this); this.reader = new UnionListReader(this); + this.dictionary = dictionary; this.callBack = callBack; } @@ -80,7 +82,7 @@ public void initializeChildrenFromFields(List children) { } Field field = children.get(0); MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - AddOrGetResult addOrGetVector = addOrGetVector(minorType); + AddOrGetResult addOrGetVector = addOrGetVector(minorType, field.getDictionary()); if (!addOrGetVector.isCreated()) { throw new IllegalArgumentException("Child vector already existed: " + addOrGetVector.getVector()); } @@ -151,16 +153,16 @@ private class TransferImpl implements TransferPair { TransferPair pairs[] = new TransferPair[3]; public TransferImpl(String name, BufferAllocator allocator) { - this(new ListVector(name, allocator, null)); + this(new ListVector(name, allocator, dictionary, null)); } public TransferImpl(ListVector to) { this.to = to; - to.addOrGetVector(vector.getMinorType()); + to.addOrGetVector(vector.getMinorType(), vector.getField().getDictionary()); pairs[0] = offsets.makeTransferPair(to.offsets); pairs[1] = bits.makeTransferPair(to.bits); if (to.getDataVector() instanceof ZeroVector) { - to.addOrGetVector(vector.getMinorType()); + to.addOrGetVector(vector.getMinorType(), vector.getField().getDictionary()); } pairs[2] = getDataVector().makeTransferPair(to.getDataVector()); } @@ -232,8 +234,8 @@ public boolean allocateNewSafe() { return success; } - public AddOrGetResult addOrGetVector(MinorType minorType) { - AddOrGetResult result = super.addOrGetVector(minorType); + public AddOrGetResult addOrGetVector(MinorType minorType, DictionaryEncoding dictionary) { + AddOrGetResult result = super.addOrGetVector(minorType, dictionary); reader = new UnionListReader(this); return result; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index 31a1bb74b8e98..4d750cad264db 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -160,7 +160,7 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { // (This is similar to what happens in ScanBatch where the children cannot be added till they are // read). To take care of this, we ensure that the hashCode of the MaterializedField does not // include the hashCode of the children but is based only on MaterializedField$key. - final FieldVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass()); + final FieldVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass(), vector.getField().getDictionary()); if (allocate && to.size() != preSize) { newVector.allocateNew(); } @@ -314,12 +314,11 @@ public void close() { public void initializeChildrenFromFields(List children) { for (Field field : children) { MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - FieldVector vector = (FieldVector)this.add(field.getName(), minorType); + FieldVector vector = (FieldVector)this.add(field.getName(), minorType, field.getDictionary()); vector.initializeChildrenFromFields(field.getChildren()); } } - public List getChildrenFromFields() { return getChildren(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 5fa35307ab683..bb1fdf841a305 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -34,6 +34,7 @@ import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.TransferPair; @@ -48,14 +49,16 @@ public class NullableMapVector extends MapVector implements FieldVector { protected final BitVector bits; private final List innerVectors; + private final DictionaryEncoding dictionary; private final Accessor accessor; private final Mutator mutator; - public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) { + public NullableMapVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack) { super(name, checkNotNull(allocator), callBack); this.bits = new BitVector("$bits$", allocator); this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits)); + this.dictionary = dictionary; this.accessor = new Accessor(); this.mutator = new Mutator(); } @@ -83,7 +86,7 @@ public FieldReader getReader() { @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, callBack), false); + return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, dictionary, callBack), false); } @Override @@ -93,7 +96,7 @@ public TransferPair makeTransferPair(ValueVector to) { @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new NullableMapTransferPair(this, new NullableMapVector(ref, allocator, callBack), false); + return new NullableMapTransferPair(this, new NullableMapVector(ref, allocator, dictionary, callBack), false); } protected class NullableMapTransferPair extends MapTransferPair { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index dbdd2050d13ed..6d0531678488a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -149,7 +149,8 @@ public MapWriter rootAsMap() { switch(mode){ case INIT: - NullableMapVector map = container.addOrGet(name, MinorType.MAP, NullableMapVector.class); + // TODO allow dictionaries in complex types + NullableMapVector map = container.addOrGet(name, MinorType.MAP, NullableMapVector.class, null); mapRoot = nullableMapWriterFactory.build(map); mapRoot.setPosition(idx()); mode = Mode.MAP; @@ -180,7 +181,8 @@ public ListWriter rootAsList() { case INIT: int vectorCount = container.size(); - ListVector listVector = container.addOrGet(name, MinorType.LIST, ListVector.class); + // TODO allow dictionaries in complex types + ListVector listVector = container.addOrGet(name, MinorType.LIST, ListVector.class, null); if (container.size() > vectorCount) { listVector.allocateNew(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 1f7253bca93c8..e33319a2270b1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -125,7 +125,7 @@ protected FieldWriter getWriter(MinorType type) { // ??? return null; } - ValueVector v = listVector.addOrGetVector(type).getVector(); + ValueVector v = listVector.addOrGetVector(type, null).getVector(); v.allocateNew(); setWriter(v); writer.setPosition(position); @@ -150,7 +150,8 @@ private FieldWriter promoteToUnion() { TransferPair tp = vector.getTransferPair(vector.getMinorType().name().toLowerCase(), vector.getAllocator()); tp.transfer(); if (parentContainer != null) { - unionVector = parentContainer.addOrGet(name, MinorType.UNION, UnionVector.class); + // TODO allow dictionaries in complex types + unionVector = parentContainer.addOrGet(name, MinorType.UNION, UnionVector.class, null); unionVector.allocateNew(); } else if (listVector != null) { unionVector = listVector.promoteToUnion(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/dictionary/Dictionary.java b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/Dictionary.java new file mode 100644 index 0000000000000..0c1cadfdafdbf --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/Dictionary.java @@ -0,0 +1,66 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.dictionary; + +import java.util.Objects; + +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; + +public class Dictionary { + + private final DictionaryEncoding encoding; + private final FieldVector dictionary; + + public Dictionary(FieldVector dictionary, DictionaryEncoding encoding) { + this.dictionary = dictionary; + this.encoding = encoding; + } + + public FieldVector getVector() { + return dictionary; + } + + public DictionaryEncoding getEncoding() { + return encoding; + } + + public ArrowType getVectorType() { + return dictionary.getField().getType(); + } + + @Override + public String toString() { + return "Dictionary " + encoding + " " + dictionary; + } + + @Override + public boolean equals(Object o) { + if (this == o) return true; + if (o == null || getClass() != o.getClass()) return false; + Dictionary that = (Dictionary) o; + return Objects.equals(encoding, that.encoding) && Objects.equals(dictionary, that.dictionary); + } + + @Override + public int hashCode() { + return Objects.hash(encoding, dictionary); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java new file mode 100644 index 0000000000000..0666bc4137a9d --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryEncoder.java @@ -0,0 +1,144 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.dictionary; + +import java.lang.reflect.InvocationTargetException; +import java.lang.reflect.Method; +import java.util.HashMap; +import java.util.Map; + +import com.google.common.collect.ImmutableList; + +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.TransferPair; + +public class DictionaryEncoder { + + // TODO recursively examine fields? + + /** + * Dictionary encodes a vector with a provided dictionary. The dictionary must contain all values in the vector. + * + * @param vector vector to encode + * @param dictionary dictionary used for encoding + * @return dictionary encoded vector + */ + public static ValueVector encode(ValueVector vector, Dictionary dictionary) { + validateType(vector.getMinorType()); + // load dictionary values into a hashmap for lookup + ValueVector.Accessor dictionaryAccessor = dictionary.getVector().getAccessor(); + Map lookUps = new HashMap<>(dictionaryAccessor.getValueCount()); + for (int i = 0; i < dictionaryAccessor.getValueCount(); i++) { + // for primitive array types we need a wrapper that implements equals and hashcode appropriately + lookUps.put(dictionaryAccessor.getObject(i), i); + } + + Field valueField = vector.getField(); + Field indexField = new Field(valueField.getName(), valueField.isNullable(), + dictionary.getEncoding().getIndexType(), dictionary.getEncoding(), null); + + // vector to hold our indices (dictionary encoded values) + FieldVector indices = indexField.createVector(vector.getAllocator()); + ValueVector.Mutator mutator = indices.getMutator(); + + // use reflection to pull out the set method + // TODO implement a common interface for int vectors + Method setter = null; + for (Class c: ImmutableList.of(int.class, long.class)) { + try { + setter = mutator.getClass().getMethod("set", int.class, c); + break; + } catch(NoSuchMethodException e) { + // ignore + } + } + if (setter == null) { + throw new IllegalArgumentException("Dictionary encoding does not have a valid int type:" + indices.getClass()); + } + + ValueVector.Accessor accessor = vector.getAccessor(); + int count = accessor.getValueCount(); + + indices.allocateNew(); + + try { + for (int i = 0; i < count; i++) { + Object value = accessor.getObject(i); + if (value != null) { // if it's null leave it null + // note: this may fail if value was not included in the dictionary + Object encoded = lookUps.get(value); + if (encoded == null) { + throw new IllegalArgumentException("Dictionary encoding not defined for value:" + value); + } + setter.invoke(mutator, i, encoded); + } + } + } catch (IllegalAccessException e) { + throw new RuntimeException("IllegalAccessException invoking vector mutator set():", e); + } catch (InvocationTargetException e) { + throw new RuntimeException("InvocationTargetException invoking vector mutator set():", e.getCause()); + } + + mutator.setValueCount(count); + + return indices; + } + + /** + * Decodes a dictionary encoded array using the provided dictionary. + * + * @param indices dictionary encoded values, must be int type + * @param dictionary dictionary used to decode the values + * @return vector with values restored from dictionary + */ + public static ValueVector decode(ValueVector indices, Dictionary dictionary) { + ValueVector.Accessor accessor = indices.getAccessor(); + int count = accessor.getValueCount(); + ValueVector dictionaryVector = dictionary.getVector(); + int dictionaryCount = dictionaryVector.getAccessor().getValueCount(); + // copy the dictionary values into the decoded vector + TransferPair transfer = dictionaryVector.getTransferPair(indices.getAllocator()); + transfer.getTo().allocateNewSafe(); + for (int i = 0; i < count; i++) { + Object index = accessor.getObject(i); + if (index != null) { + int indexAsInt = ((Number) index).intValue(); + if (indexAsInt > dictionaryCount) { + throw new IllegalArgumentException("Provided dictionary does not contain value for index " + indexAsInt); + } + transfer.copyValueSafe(indexAsInt, i); + } + } + // TODO do we need to worry about the field? + ValueVector decoded = transfer.getTo(); + decoded.getMutator().setValueCount(count); + return decoded; + } + + private static void validateType(MinorType type) { + // byte arrays don't work as keys in our dictionary map - we could wrap them with something to + // implement equals and hashcode if we want that functionality + if (type == MinorType.VARBINARY || type == MinorType.LIST || type == MinorType.MAP || type == MinorType.UNION) { + throw new IllegalArgumentException("Dictionary encoding for complex types not implemented: type " + type); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryProvider.java b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryProvider.java new file mode 100644 index 0000000000000..63fde2536da8b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/dictionary/DictionaryProvider.java @@ -0,0 +1,47 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.dictionary; + +import java.util.HashMap; +import java.util.Map; + +public interface DictionaryProvider { + + public Dictionary lookup(long id); + + public static class MapDictionaryProvider implements DictionaryProvider { + + private final Map map; + + public MapDictionaryProvider(Dictionary... dictionaries) { + this.map = new HashMap<>(); + for (Dictionary dictionary: dictionaries) { + put(dictionary); + } + } + + public void put(Dictionary dictionary) { + map.put(dictionary.getEncoding().getId(), dictionary); + } + + @Override + public Dictionary lookup(long id) { + return map.get(id); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java new file mode 100644 index 0000000000000..28440a190ad43 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java @@ -0,0 +1,142 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.ByteBuffer; +import java.nio.channels.SeekableByteChannel; +import java.util.Arrays; +import java.util.List; + +import org.apache.arrow.flatbuf.Footer; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.schema.ArrowDictionaryBatch; +import org.apache.arrow.vector.schema.ArrowMessage; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.MessageSerializer; +import org.apache.arrow.vector.types.pojo.Schema; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class ArrowFileReader extends ArrowReader { + + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowFileReader.class); + + private ArrowFooter footer; + private int currentDictionaryBatch = 0; + private int currentRecordBatch = 0; + + public ArrowFileReader(SeekableByteChannel in, BufferAllocator allocator) { + super(new SeekableReadChannel(in), allocator); + } + + public ArrowFileReader(SeekableReadChannel in, BufferAllocator allocator) { + super(in, allocator); + } + + @Override + protected Schema readSchema(SeekableReadChannel in) throws IOException { + if (footer == null) { + if (in.size() <= (ArrowMagic.MAGIC_LENGTH * 2 + 4)) { + throw new InvalidArrowFileException("file too small: " + in.size()); + } + ByteBuffer buffer = ByteBuffer.allocate(4 + ArrowMagic.MAGIC_LENGTH); + long footerLengthOffset = in.size() - buffer.remaining(); + in.setPosition(footerLengthOffset); + in.readFully(buffer); + buffer.flip(); + byte[] array = buffer.array(); + if (!ArrowMagic.validateMagic(Arrays.copyOfRange(array, 4, array.length))) { + throw new InvalidArrowFileException("missing Magic number " + Arrays.toString(buffer.array())); + } + int footerLength = MessageSerializer.bytesToInt(array); + if (footerLength <= 0 || footerLength + ArrowMagic.MAGIC_LENGTH * 2 + 4 > in.size()) { + throw new InvalidArrowFileException("invalid footer length: " + footerLength); + } + long footerOffset = footerLengthOffset - footerLength; + LOGGER.debug(String.format("Footer starts at %d, length: %d", footerOffset, footerLength)); + ByteBuffer footerBuffer = ByteBuffer.allocate(footerLength); + in.setPosition(footerOffset); + in.readFully(footerBuffer); + footerBuffer.flip(); + Footer footerFB = Footer.getRootAsFooter(footerBuffer); + this.footer = new ArrowFooter(footerFB); + } + return footer.getSchema(); + } + + @Override + protected ArrowMessage readMessage(SeekableReadChannel in, BufferAllocator allocator) throws IOException { + if (currentDictionaryBatch < footer.getDictionaries().size()) { + ArrowBlock block = footer.getDictionaries().get(currentDictionaryBatch++); + return readDictionaryBatch(in, block, allocator); + } else if (currentRecordBatch < footer.getRecordBatches().size()) { + ArrowBlock block = footer.getRecordBatches().get(currentRecordBatch++); + return readRecordBatch(in, block, allocator); + } else { + return null; + } + } + + public List getDictionaryBlocks() throws IOException { + ensureInitialized(); + return footer.getDictionaries(); + } + + public List getRecordBlocks() throws IOException { + ensureInitialized(); + return footer.getRecordBatches(); + } + + public void loadRecordBatch(ArrowBlock block) throws IOException { + ensureInitialized(); + int blockIndex = footer.getRecordBatches().indexOf(block); + if (blockIndex == -1) { + throw new IllegalArgumentException("Arrow bock does not exist in record batches: " + block); + } + currentRecordBatch = blockIndex; + loadNextBatch(); + } + + private ArrowDictionaryBatch readDictionaryBatch(SeekableReadChannel in, + ArrowBlock block, + BufferAllocator allocator) throws IOException { + LOGGER.debug(String.format("DictionaryRecordBatch at %d, metadata: %d, body: %d", + block.getOffset(), block.getMetadataLength(), block.getBodyLength())); + in.setPosition(block.getOffset()); + ArrowDictionaryBatch batch = MessageSerializer.deserializeDictionaryBatch(in, block, allocator); + if (batch == null) { + throw new IOException("Invalid file. No batch at offset: " + block.getOffset()); + } + return batch; + } + + private ArrowRecordBatch readRecordBatch(SeekableReadChannel in, + ArrowBlock block, + BufferAllocator allocator) throws IOException { + LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", + block.getOffset(), block.getMetadataLength(), + block.getBodyLength())); + in.setPosition(block.getOffset()); + ArrowRecordBatch batch = MessageSerializer.deserializeRecordBatch(in, block, allocator); + if (batch == null) { + throw new IOException("Invalid file. No batch at offset: " + block.getOffset()); + } + return batch; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileWriter.java new file mode 100644 index 0000000000000..23d210a3ee73b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileWriter.java @@ -0,0 +1,59 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.channels.WritableByteChannel; +import java.util.List; + +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.types.pojo.Schema; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +public class ArrowFileWriter extends ArrowWriter { + + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowFileWriter.class); + + public ArrowFileWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out) { + super(root, provider, out); + } + + @Override + protected void startInternal(WriteChannel out) throws IOException { + ArrowMagic.writeMagic(out); + } + + @Override + protected void endInternal(WriteChannel out, + Schema schema, + List dictionaries, + List records) throws IOException { + long footerStart = out.getCurrentPosition(); + out.write(new ArrowFooter(schema, dictionaries, records), false); + int footerLength = (int)(out.getCurrentPosition() - footerStart); + if (footerLength <= 0) { + throw new InvalidArrowFileException("invalid footer"); + } + out.writeIntLittleEndian(footerLength); + LOGGER.debug(String.format("Footer starts at %d, length: %d", footerStart, footerLength)); + ArrowMagic.writeMagic(out); + LOGGER.debug(String.format("magic written, now at %d", out.getCurrentPosition())); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java index 38903068570c7..1c0008a9184a0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFooter.java @@ -38,7 +38,6 @@ public class ArrowFooter implements FBSerializable { private final List recordBatches; public ArrowFooter(Schema schema, List dictionaries, List recordBatches) { - super(); this.schema = schema; this.dictionaries = dictionaries; this.recordBatches = recordBatches; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowMagic.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowMagic.java new file mode 100644 index 0000000000000..99ea96b3856d5 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowMagic.java @@ -0,0 +1,37 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; + +public class ArrowMagic { + + private static final byte[] MAGIC = "ARROW1".getBytes(StandardCharsets.UTF_8); + + public static final int MAGIC_LENGTH = MAGIC.length; + + public static void writeMagic(WriteChannel out) throws IOException { + out.write(MAGIC); + } + + public static boolean validateMagic(byte[] array) { + return Arrays.equals(MAGIC, array); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java index 8f4f4978d66cf..1646fbe803687 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -18,90 +18,188 @@ package org.apache.arrow.vector.file; import java.io.IOException; -import java.nio.ByteBuffer; -import java.nio.channels.SeekableByteChannel; -import java.util.Arrays; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import com.google.common.collect.ImmutableList; -import org.apache.arrow.flatbuf.Footer; import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.dictionary.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.schema.ArrowDictionaryBatch; +import org.apache.arrow.vector.schema.ArrowMessage; +import org.apache.arrow.vector.schema.ArrowMessage.ArrowMessageVisitor; import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.stream.MessageSerializer; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - -public class ArrowReader implements AutoCloseable { - private static final Logger LOGGER = LoggerFactory.getLogger(ArrowReader.class); - - public static final byte[] MAGIC = "ARROW1".getBytes(); +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; - private final SeekableByteChannel in; +public abstract class ArrowReader implements DictionaryProvider, AutoCloseable { + private final T in; private final BufferAllocator allocator; - private ArrowFooter footer; + private VectorLoader loader; + private VectorSchemaRoot root; + private Map dictionaries; - public ArrowReader(SeekableByteChannel in, BufferAllocator allocator) { - super(); + private boolean initialized = false; + + protected ArrowReader(T in, BufferAllocator allocator) { this.in = in; this.allocator = allocator; } - private int readFully(ByteBuffer buffer) throws IOException { - int total = 0; - int n; - do { - n = in.read(buffer); - total += n; - } while (n >= 0 && buffer.remaining() > 0); - buffer.flip(); - return total; + /** + * Returns the vector schema root. This will be loaded with new values on every call to loadNextBatch + * + * @return the vector schema root + * @throws IOException if reading of schema fails + */ + public VectorSchemaRoot getVectorSchemaRoot() throws IOException { + ensureInitialized(); + return root; } - public ArrowFooter readFooter() throws IOException { - if (footer == null) { - if (in.size() <= (MAGIC.length * 2 + 4)) { - throw new InvalidArrowFileException("file too small: " + in.size()); - } - ByteBuffer buffer = ByteBuffer.allocate(4 + MAGIC.length); - long footerLengthOffset = in.size() - buffer.remaining(); - in.position(footerLengthOffset); - readFully(buffer); - byte[] array = buffer.array(); - if (!Arrays.equals(MAGIC, Arrays.copyOfRange(array, 4, array.length))) { - throw new InvalidArrowFileException("missing Magic number " + Arrays.toString(buffer.array())); - } - int footerLength = MessageSerializer.bytesToInt(array); - if (footerLength <= 0 || footerLength + MAGIC.length * 2 + 4 > in.size()) { - throw new InvalidArrowFileException("invalid footer length: " + footerLength); - } - long footerOffset = footerLengthOffset - footerLength; - LOGGER.debug(String.format("Footer starts at %d, length: %d", footerOffset, footerLength)); - ByteBuffer footerBuffer = ByteBuffer.allocate(footerLength); - in.position(footerOffset); - readFully(footerBuffer); - Footer footerFB = Footer.getRootAsFooter(footerBuffer); - this.footer = new ArrowFooter(footerFB); + /** + * Returns any dictionaries + * + * @return dictionaries, if any + * @throws IOException if reading of schema fails + */ + public Map getDictionaryVectors() throws IOException { + ensureInitialized(); + return dictionaries; + } + + @Override + public Dictionary lookup(long id) { + if (initialized) { + return dictionaries.get(id); + } else { + return null; } - return footer; } - // TODO: read dictionaries - - public ArrowRecordBatch readRecordBatch(ArrowBlock block) throws IOException { - LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", - block.getOffset(), block.getMetadataLength(), - block.getBodyLength())); - in.position(block.getOffset()); - ArrowRecordBatch batch = MessageSerializer.deserializeRecordBatch( - new ReadChannel(in, block.getOffset()), block, allocator); - if (batch == null) { - throw new IOException("Invalid file. No batch at offset: " + block.getOffset()); + public void loadNextBatch() throws IOException { + ensureInitialized(); + // read in all dictionary batches, then stop after our first record batch + ArrowMessageVisitor visitor = new ArrowMessageVisitor() { + @Override + public Boolean visit(ArrowDictionaryBatch message) { + try { load(message); } finally { message.close(); } + return true; + } + @Override + public Boolean visit(ArrowRecordBatch message) { + try { loader.load(message); } finally { message.close(); } + return false; + } + }; + root.setRowCount(0); + ArrowMessage message = readMessage(in, allocator); + while (message != null && message.accepts(visitor)) { + message = readMessage(in, allocator); } - return batch; } + public long bytesRead() { return in.bytesRead(); } + @Override public void close() throws IOException { + if (initialized) { + root.close(); + for (Dictionary dictionary: dictionaries.values()) { + dictionary.getVector().close(); + } + } in.close(); } + + protected abstract Schema readSchema(T in) throws IOException; + + protected abstract ArrowMessage readMessage(T in, BufferAllocator allocator) throws IOException; + + protected void ensureInitialized() throws IOException { + if (!initialized) { + initialize(); + initialized = true; + } + } + + /** + * Reads the schema and initializes the vectors + */ + private void initialize() throws IOException { + Schema schema = readSchema(in); + List fields = new ArrayList<>(); + List vectors = new ArrayList<>(); + Map dictionaries = new HashMap<>(); + + for (Field field: schema.getFields()) { + Field updated = toMemoryFormat(field, dictionaries); + fields.add(updated); + vectors.add(updated.createVector(allocator)); + } + + this.root = new VectorSchemaRoot(fields, vectors, 0); + this.loader = new VectorLoader(root); + this.dictionaries = Collections.unmodifiableMap(dictionaries); + } + + // in the message format, fields have the dictionary type + // in the memory format, they have the index type + private Field toMemoryFormat(Field field, Map dictionaries) { + DictionaryEncoding encoding = field.getDictionary(); + List children = field.getChildren(); + + if (encoding == null && children.isEmpty()) { + return field; + } + + List updatedChildren = new ArrayList<>(children.size()); + for (Field child: children) { + updatedChildren.add(toMemoryFormat(child, dictionaries)); + } + + ArrowType type; + if (encoding == null) { + type = field.getType(); + } else { + // re-type the field for in-memory format + type = encoding.getIndexType(); + if (type == null) { + type = new Int(32, true); + } + // get existing or create dictionary vector + if (!dictionaries.containsKey(encoding.getId())) { + // create a new dictionary vector for the values + Field dictionaryField = new Field(field.getName(), field.isNullable(), field.getType(), null, children); + FieldVector dictionaryVector = dictionaryField.createVector(allocator); + dictionaries.put(encoding.getId(), new Dictionary(dictionaryVector, encoding)); + } + } + + return new Field(field.getName(), field.isNullable(), type, encoding, updatedChildren); + } + + private void load(ArrowDictionaryBatch dictionaryBatch) { + long id = dictionaryBatch.getDictionaryId(); + Dictionary dictionary = dictionaries.get(id); + if (dictionary == null) { + throw new IllegalArgumentException("Dictionary ID " + id + " not defined in schema"); + } + FieldVector vector = dictionary.getVector(); + VectorSchemaRoot root = new VectorSchemaRoot(ImmutableList.of(vector.getField()), ImmutableList.of(vector), 0); + VectorLoader loader = new VectorLoader(root); + loader.load(dictionaryBatch.getDictionary()); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java index 24c667e67d98d..60a6afb565318 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -1,4 +1,4 @@ -/** +/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information @@ -21,77 +21,172 @@ import java.nio.channels.WritableByteChannel; import java.util.ArrayList; import java.util.Collections; +import java.util.HashMap; import java.util.List; +import java.util.Map; +import com.google.common.collect.ImmutableList; + +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.dictionary.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.schema.ArrowDictionaryBatch; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.stream.MessageSerializer; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -public class ArrowWriter implements AutoCloseable { +public abstract class ArrowWriter implements AutoCloseable { + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class); + // schema with fields in message format, not memory format + private final Schema schema; private final WriteChannel out; - private final Schema schema; + private final VectorUnloader unloader; + private final List dictionaries; + + private final List dictionaryBlocks = new ArrayList<>(); + private final List recordBlocks = new ArrayList<>(); - private final List recordBatches = new ArrayList<>(); private boolean started = false; + private boolean ended = false; - public ArrowWriter(WritableByteChannel out, Schema schema) { + /** + * Note: fields are not closed when the writer is closed + * + * @param root + * @param provider + * @param out + */ + protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out) { + this.unloader = new VectorUnloader(root); this.out = new WriteChannel(out); - this.schema = schema; + + List fields = new ArrayList<>(root.getSchema().getFields().size()); + Map dictionaryBatches = new HashMap<>(); + + for (Field field: root.getSchema().getFields()) { + fields.add(toMessageFormat(field, provider, dictionaryBatches)); + } + + this.schema = new Schema(fields); + this.dictionaries = Collections.unmodifiableList(new ArrayList<>(dictionaryBatches.values())); + } + + // in the message format, fields have the dictionary type + // in the memory format, they have the index type + private Field toMessageFormat(Field field, DictionaryProvider provider, Map batches) { + DictionaryEncoding encoding = field.getDictionary(); + List children = field.getChildren(); + + if (encoding == null && children.isEmpty()) { + return field; + } + + List updatedChildren = new ArrayList<>(children.size()); + for (Field child: children) { + updatedChildren.add(toMessageFormat(child, provider, batches)); + } + + ArrowType type; + if (encoding == null) { + type = field.getType(); + } else { + long id = encoding.getId(); + Dictionary dictionary = provider.lookup(id); + if (dictionary == null) { + throw new IllegalArgumentException("Could not find dictionary with ID " + id); + } + type = dictionary.getVectorType(); + + if (!batches.containsKey(id)) { + FieldVector vector = dictionary.getVector(); + int count = vector.getAccessor().getValueCount(); + VectorSchemaRoot root = new VectorSchemaRoot(ImmutableList.of(field), ImmutableList.of(vector), count); + VectorUnloader unloader = new VectorUnloader(root); + ArrowRecordBatch batch = unloader.getRecordBatch(); + batches.put(id, new ArrowDictionaryBatch(id, batch)); + } + } + + return new Field(field.getName(), field.isNullable(), type, encoding, updatedChildren); } - private void start() throws IOException { - writeMagic(); - MessageSerializer.serialize(out, schema); + public void start() throws IOException { + ensureStarted(); } - // TODO: write dictionaries + public void writeBatch() throws IOException { + ensureStarted(); + try (ArrowRecordBatch batch = unloader.getRecordBatch()) { + writeRecordBatch(batch); + } + } - public void writeRecordBatch(ArrowRecordBatch recordBatch) throws IOException { - checkStarted(); - ArrowBlock batchDesc = MessageSerializer.serialize(out, recordBatch); + protected void writeRecordBatch(ArrowRecordBatch batch) throws IOException { + ArrowBlock block = MessageSerializer.serialize(out, batch); LOGGER.debug(String.format("RecordBatch at %d, metadata: %d, body: %d", - batchDesc.getOffset(), batchDesc.getMetadataLength(), batchDesc.getBodyLength())); + block.getOffset(), block.getMetadataLength(), block.getBodyLength())); + recordBlocks.add(block); + } - // add metadata to footer - recordBatches.add(batchDesc); + public void end() throws IOException { + ensureStarted(); + ensureEnded(); } - private void checkStarted() throws IOException { + public long bytesWritten() { return out.getCurrentPosition(); } + + private void ensureStarted() throws IOException { if (!started) { started = true; - start(); + startInternal(out); + // write the schema - for file formats this is duplicated in the footer, but matches + // the streaming format + MessageSerializer.serialize(out, schema); + // write out any dictionaries + for (ArrowDictionaryBatch batch : dictionaries) { + try { + ArrowBlock block = MessageSerializer.serialize(out, batch); + LOGGER.debug(String.format("DictionaryRecordBatch at %d, metadata: %d, body: %d", + block.getOffset(), block.getMetadataLength(), block.getBodyLength())); + dictionaryBlocks.add(block); + } finally { + batch.close(); + } + } } } - @Override - public void close() throws IOException { - try { - long footerStart = out.getCurrentPosition(); - writeFooter(); - int footerLength = (int)(out.getCurrentPosition() - footerStart); - if (footerLength <= 0 ) { - throw new InvalidArrowFileException("invalid footer"); - } - out.writeIntLittleEndian(footerLength); - LOGGER.debug(String.format("Footer starts at %d, length: %d", footerStart, footerLength)); - writeMagic(); - } finally { - out.close(); + private void ensureEnded() throws IOException { + if (!ended) { + ended = true; + endInternal(out, schema, dictionaryBlocks, recordBlocks); } } - private void writeMagic() throws IOException { - out.write(ArrowReader.MAGIC); - LOGGER.debug(String.format("magic written, now at %d", out.getCurrentPosition())); - } + protected abstract void startInternal(WriteChannel out) throws IOException; + + protected abstract void endInternal(WriteChannel out, + Schema schema, + List dictionaries, + List records) throws IOException; - private void writeFooter() throws IOException { - // TODO: dictionaries - out.write(new ArrowFooter(schema, Collections.emptyList(), recordBatches), false); + @Override + public void close() { + try { + end(); + out.close(); + } catch (IOException e) { + throw new RuntimeException(e); + } } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java index a9dc1293b8193..b062f3826eab3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java @@ -32,16 +32,9 @@ public class ReadChannel implements AutoCloseable { private ReadableByteChannel in; private long bytesRead = 0; - // The starting byte offset into 'in'. - private final long startByteOffset; - - public ReadChannel(ReadableByteChannel in, long startByteOffset) { - this.in = in; - this.startByteOffset = startByteOffset; - } public ReadChannel(ReadableByteChannel in) { - this(in, 0); + this.in = in; } public long bytesRead() { return bytesRead; } @@ -72,8 +65,6 @@ public int readFully(ArrowBuf buffer, int l) throws IOException { return n; } - public long getCurrentPositiion() { return startByteOffset + bytesRead; } - @Override public void close() throws IOException { if (this.in != null) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java b/java/vector/src/main/java/org/apache/arrow/vector/file/SeekableReadChannel.java similarity index 57% rename from java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java rename to java/vector/src/main/java/org/apache/arrow/vector/file/SeekableReadChannel.java index fbe1345f96aa3..914c3cb4b33a9 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Dictionary.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/SeekableReadChannel.java @@ -1,5 +1,4 @@ -/******************************************************************************* - +/** * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information @@ -15,26 +14,26 @@ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. - ******************************************************************************/ -package org.apache.arrow.vector.types; + */ +package org.apache.arrow.vector.file; -import org.apache.arrow.vector.ValueVector; +import java.io.IOException; +import java.nio.channels.SeekableByteChannel; -public class Dictionary { +public class SeekableReadChannel extends ReadChannel { - private ValueVector dictionary; - private boolean ordered; + private final SeekableByteChannel in; - public Dictionary(ValueVector dictionary, boolean ordered) { - this.dictionary = dictionary; - this.ordered = ordered; + public SeekableReadChannel(SeekableByteChannel in) { + super(in); + this.in = in; } - public ValueVector getDictionary() { - return dictionary; + public void setPosition(long position) throws IOException { + in.position(position); } - public boolean isOrdered() { - return ordered; + public long size() throws IOException { + return in.size(); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java index d99c9a6c99958..42104d181a2d0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/WriteChannel.java @@ -21,13 +21,12 @@ import java.nio.ByteBuffer; import java.nio.channels.WritableByteChannel; -import org.apache.arrow.vector.schema.FBSerializable; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; - import com.google.flatbuffers.FlatBufferBuilder; import io.netty.buffer.ArrowBuf; +import org.apache.arrow.vector.schema.FBSerializable; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; /** * Wrapper around a WritableByteChannel that maintains the position as well adding diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 24fdc184523b3..bdb63b92cb105 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -88,10 +88,34 @@ public Schema start() throws JsonParseException, IOException { } } + public void read(VectorSchemaRoot root) throws IOException { + JsonToken t = parser.nextToken(); + if (t == START_OBJECT) { + { + int count = readNextField("count", Integer.class); + root.setRowCount(count); + nextFieldIs("columns"); + readToken(START_ARRAY); + { + for (Field field : schema.getFields()) { + FieldVector vector = root.getVector(field.getName()); + readVector(field, vector); + } + } + readToken(END_ARRAY); + } + readToken(END_OBJECT); + } else if (t == END_ARRAY) { + root.setRowCount(0); + } else { + throw new IllegalArgumentException("Invalid token: " + t); + } + } + public VectorSchemaRoot read() throws IOException { JsonToken t = parser.nextToken(); if (t == START_OBJECT) { - VectorSchemaRoot recordBatch = new VectorSchemaRoot(schema, allocator); + VectorSchemaRoot recordBatch = VectorSchemaRoot.create(schema, allocator); { int count = readNextField("count", Integer.class); recordBatch.setRowCount(count); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowDictionaryBatch.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowDictionaryBatch.java new file mode 100644 index 0000000000000..901877b7058cd --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowDictionaryBatch.java @@ -0,0 +1,60 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.flatbuf.DictionaryBatch; + +public class ArrowDictionaryBatch implements ArrowMessage { + + private final long dictionaryId; + private final ArrowRecordBatch dictionary; + + public ArrowDictionaryBatch(long dictionaryId, ArrowRecordBatch dictionary) { + this.dictionaryId = dictionaryId; + this.dictionary = dictionary; + } + + public long getDictionaryId() { return dictionaryId; } + public ArrowRecordBatch getDictionary() { return dictionary; } + + @Override + public int writeTo(FlatBufferBuilder builder) { + int dataOffset = dictionary.writeTo(builder); + DictionaryBatch.startDictionaryBatch(builder); + DictionaryBatch.addId(builder, dictionaryId); + DictionaryBatch.addData(builder, dataOffset); + return DictionaryBatch.endDictionaryBatch(builder); + } + + @Override + public int computeBodyLength() { return dictionary.computeBodyLength(); } + + @Override + public T accepts(ArrowMessageVisitor visitor) { return visitor.visit(this); } + + @Override + public String toString() { + return "ArrowDictionaryBatch [dictionaryId=" + dictionaryId + ", dictionary=" + dictionary + "]"; + } + + @Override + public void close() { + dictionary.close(); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowMessage.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowMessage.java new file mode 100644 index 0000000000000..d307428889b0f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowMessage.java @@ -0,0 +1,30 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.schema; + +public interface ArrowMessage extends FBSerializable, AutoCloseable { + + public int computeBodyLength(); + + public T accepts(ArrowMessageVisitor visitor); + + public static interface ArrowMessageVisitor { + public T visit(ArrowDictionaryBatch message); + public T visit(ArrowRecordBatch message); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java index 40c2fbfd984f8..6ef514e568d2d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowRecordBatch.java @@ -32,7 +32,8 @@ import io.netty.buffer.ArrowBuf; -public class ArrowRecordBatch implements FBSerializable, AutoCloseable { +public class ArrowRecordBatch implements ArrowMessage { + private static final Logger LOGGER = LoggerFactory.getLogger(ArrowRecordBatch.class); /** number of records */ @@ -113,9 +114,13 @@ public int writeTo(FlatBufferBuilder builder) { return RecordBatch.endRecordBatch(builder); } + @Override + public T accepts(ArrowMessageVisitor visitor) { return visitor.visit(this); } + /** * releases the buffers */ + @Override public void close() { if (!closed) { closed = true; @@ -134,6 +139,7 @@ public String toString() { /** * Computes the size of the serialized body for this recordBatch. */ + @Override public int computeBodyLength() { int size = 0; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java index f32966c5d5217..2deef37cd4e56 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java @@ -17,79 +17,43 @@ */ package org.apache.arrow.vector.stream; -import java.io.IOException; -import java.io.InputStream; -import java.nio.channels.Channels; -import java.nio.channels.ReadableByteChannel; - import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.file.ArrowReader; import org.apache.arrow.vector.file.ReadChannel; -import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.types.pojo.Schema; -import com.google.common.base.Preconditions; +import java.io.IOException; +import java.io.InputStream; +import java.nio.channels.Channels; +import java.nio.channels.ReadableByteChannel; /** * This classes reads from an input stream and produces ArrowRecordBatches. */ -public class ArrowStreamReader implements AutoCloseable { - private ReadChannel in; - private final BufferAllocator allocator; - private Schema schema; - - /** - * Constructs a streaming read, reading bytes from 'in'. Non-blocking. - */ - public ArrowStreamReader(ReadableByteChannel in, BufferAllocator allocator) { - super(); - this.in = new ReadChannel(in); - this.allocator = allocator; - } - - public ArrowStreamReader(InputStream in, BufferAllocator allocator) { - this(Channels.newChannel(in), allocator); - } - - /** - * Initializes the reader. Must be called before the other APIs. This is blocking. - */ - public void init() throws IOException { - Preconditions.checkState(this.schema == null, "Cannot call init() more than once."); - this.schema = readSchema(); - } +public class ArrowStreamReader extends ArrowReader { - /** - * Returns the schema for all records in this stream. - */ - public Schema getSchema () { - Preconditions.checkState(this.schema != null, "Must call init() first."); - return schema; - } - - public long bytesRead() { return in.bytesRead(); } + /** + * Constructs a streaming read, reading bytes from 'in'. Non-blocking. + */ + public ArrowStreamReader(ReadableByteChannel in, BufferAllocator allocator) { + super(new ReadChannel(in), allocator); + } - /** - * Reads and returns the next ArrowRecordBatch. Returns null if this is the end - * of stream. - */ - public ArrowRecordBatch nextRecordBatch() throws IOException { - Preconditions.checkState(this.in != null, "Cannot call after close()"); - Preconditions.checkState(this.schema != null, "Must call init() first."); - return MessageSerializer.deserializeRecordBatch(in, allocator); - } + public ArrowStreamReader(InputStream in, BufferAllocator allocator) { + this(Channels.newChannel(in), allocator); + } - @Override - public void close() throws IOException { - if (this.in != null) { - in.close(); - in = null; + /** + * Reads the schema message from the beginning of the stream. + */ + @Override + protected Schema readSchema(ReadChannel in) throws IOException { + return MessageSerializer.deserializeSchema(in); } - } - /** - * Reads the schema message from the beginning of the stream. - */ - private Schema readSchema() throws IOException { - return MessageSerializer.deserializeSchema(in); - } + @Override + protected ArrowMessage readMessage(ReadChannel in, BufferAllocator allocator) throws IOException { + return MessageSerializer.deserializeMessageBatch(in, allocator); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java index 60dc5861c9242..ea29cd99804c8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamWriter.java @@ -17,63 +17,40 @@ */ package org.apache.arrow.vector.stream; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.file.ArrowBlock; +import org.apache.arrow.vector.file.ArrowWriter; +import org.apache.arrow.vector.file.WriteChannel; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.Schema; + import java.io.IOException; import java.io.OutputStream; import java.nio.channels.Channels; import java.nio.channels.WritableByteChannel; +import java.util.List; -import org.apache.arrow.vector.file.WriteChannel; -import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.pojo.Schema; - -public class ArrowStreamWriter implements AutoCloseable { - private final WriteChannel out; - private final Schema schema; - private boolean headerSent = false; +public class ArrowStreamWriter extends ArrowWriter { - /** - * Creates the stream writer. non-blocking. - * totalBatches can be set if the writer knows beforehand. Can be -1 if unknown. - */ - public ArrowStreamWriter(WritableByteChannel out, Schema schema) { - this.out = new WriteChannel(out); - this.schema = schema; - } - - public ArrowStreamWriter(OutputStream out, Schema schema) - throws IOException { - this(Channels.newChannel(out), schema); - } - - public long bytesWritten() { return out.getCurrentPosition(); } - - public void writeRecordBatch(ArrowRecordBatch batch) throws IOException { - // Send the header if we have not yet. - checkAndSendHeader(); - MessageSerializer.serialize(out, batch); - } + public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, OutputStream out) { + this(root, provider, Channels.newChannel(out)); + } - /** - * End the stream. This is not required and this object can simply be closed. - */ - public void end() throws IOException { - checkAndSendHeader(); - out.writeIntLittleEndian(0); - } + public ArrowStreamWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out) { + super(root, provider, out); + } - @Override - public void close() throws IOException { - // The header might not have been sent if this is an empty stream. Send it even in - // this case so readers see a valid empty stream. - checkAndSendHeader(); - out.close(); - } + @Override + protected void startInternal(WriteChannel out) throws IOException {} - private void checkAndSendHeader() throws IOException { - if (!headerSent) { - MessageSerializer.serialize(out, schema); - headerSent = true; + @Override + protected void endInternal(WriteChannel out, + Schema schema, + List dictionaries, + List records) throws IOException { + out.writeIntLittleEndian(0); } - } } - diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 92df2504bcb23..92a6c0c26ba6e 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -22,7 +22,11 @@ import java.util.ArrayList; import java.util.List; +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; import org.apache.arrow.flatbuf.Buffer; +import org.apache.arrow.flatbuf.DictionaryBatch; import org.apache.arrow.flatbuf.FieldNode; import org.apache.arrow.flatbuf.Message; import org.apache.arrow.flatbuf.MessageHeader; @@ -33,14 +37,12 @@ import org.apache.arrow.vector.file.ReadChannel; import org.apache.arrow.vector.file.WriteChannel; import org.apache.arrow.vector.schema.ArrowBuffer; +import org.apache.arrow.vector.schema.ArrowDictionaryBatch; import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.Schema; -import com.google.flatbuffers.FlatBufferBuilder; - -import io.netty.buffer.ArrowBuf; - /** * Utility class for serializing Messages. Messages are all serialized a similar way. * 1. 4 byte little endian message header prefix @@ -81,35 +83,39 @@ public static long serialize(WriteChannel out, Schema schema) throws IOException * Deserializes a schema object. Format is from serialize(). */ public static Schema deserializeSchema(ReadChannel in) throws IOException { - Message message = deserializeMessage(in, MessageHeader.Schema); + Message message = deserializeMessage(in); if (message == null) { throw new IOException("Unexpected end of input. Missing schema."); } + if (message.headerType() != MessageHeader.Schema) { + throw new IOException("Expected schema but header was " + message.headerType()); + } return Schema.convertSchema((org.apache.arrow.flatbuf.Schema) message.header(new org.apache.arrow.flatbuf.Schema())); } + /** * Serializes an ArrowRecordBatch. Returns the offset and length of the written batch. */ public static ArrowBlock serialize(WriteChannel out, ArrowRecordBatch batch) - throws IOException { + throws IOException { + long start = out.getCurrentPosition(); int bodyLength = batch.computeBodyLength(); FlatBufferBuilder builder = new FlatBufferBuilder(); int batchOffset = batch.writeTo(builder); - ByteBuffer serializedMessage = serializeMessage(builder, MessageHeader.RecordBatch, - batchOffset, bodyLength); + ByteBuffer serializedMessage = serializeMessage(builder, MessageHeader.RecordBatch, batchOffset, bodyLength); int metadataLength = serializedMessage.remaining(); - // Add extra padding bytes so that length prefix + metadata is a multiple - // of 8 after alignment - if ((start + metadataLength + 4) % 8 != 0) { - metadataLength += 8 - (start + metadataLength + 4) % 8; + // calculate alignment bytes so that metadata length points to the correct location after alignment + int padding = (int)((start + metadataLength + 4) % 8); + if (padding != 0) { + metadataLength += (8 - padding); } out.writeIntLittleEndian(metadataLength); @@ -118,6 +124,13 @@ public static ArrowBlock serialize(WriteChannel out, ArrowRecordBatch batch) // Align the output to 8 byte boundary. out.align(); + long bufferLength = writeBatchBuffers(out, batch); + + // Metadata size in the Block account for the size prefix + return new ArrowBlock(start, metadataLength + 4, bufferLength); + } + + private static long writeBatchBuffers(WriteChannel out, ArrowRecordBatch batch) throws IOException { long bufferStart = out.getCurrentPosition(); List buffers = batch.getBuffers(); List buffersLayout = batch.getBuffersLayout(); @@ -135,22 +148,14 @@ public static ArrowBlock serialize(WriteChannel out, ArrowRecordBatch batch) " != " + startPosition + layout.getSize()); } } - // Metadata size in the Block account for the size prefix - return new ArrowBlock(start, metadataLength + 4, out.getCurrentPosition() - bufferStart); + return out.getCurrentPosition() - bufferStart; } /** * Deserializes a RecordBatch */ - public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, - BufferAllocator alloc) throws IOException { - Message message = deserializeMessage(in, MessageHeader.RecordBatch); - if (message == null) return null; - - if (message.bodyLength() > Integer.MAX_VALUE) { - throw new IOException("Cannot currently deserialize record batches over 2GB"); - } - + private static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, Message message, BufferAllocator alloc) + throws IOException { RecordBatch recordBatchFB = (RecordBatch) message.header(new RecordBatch()); int bodyLength = (int) message.bodyLength(); @@ -191,9 +196,7 @@ public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, ArrowBlock // Now read the body final ArrowBuf body = buffer.slice(block.getMetadataLength(), (int) totalLen - block.getMetadataLength()); - ArrowRecordBatch result = deserializeRecordBatch(recordBatchFB, body); - - return result; + return deserializeRecordBatch(recordBatchFB, body); } // Deserializes a record batch given the Flatbuffer metadata and in-memory body @@ -218,6 +221,106 @@ private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB return arrowRecordBatch; } + /** + * Serializes a dictionary ArrowRecordBatch. Returns the offset and length of the written batch. + */ + public static ArrowBlock serialize(WriteChannel out, ArrowDictionaryBatch batch) throws IOException { + long start = out.getCurrentPosition(); + int bodyLength = batch.computeBodyLength(); + + FlatBufferBuilder builder = new FlatBufferBuilder(); + int batchOffset = batch.writeTo(builder); + + ByteBuffer serializedMessage = serializeMessage(builder, MessageHeader.DictionaryBatch, batchOffset, bodyLength); + + int metadataLength = serializedMessage.remaining(); + + // Add extra padding bytes so that length prefix + metadata is a multiple + // of 8 after alignment + if ((start + metadataLength + 4) % 8 != 0) { + metadataLength += 8 - (start + metadataLength + 4) % 8; + } + + out.writeIntLittleEndian(metadataLength); + out.write(serializedMessage); + + // Align the output to 8 byte boundary. + out.align(); + + // write the embedded record batch + long bufferLength = writeBatchBuffers(out, batch.getDictionary()); + + // Metadata size in the Block account for the size prefix + return new ArrowBlock(start, metadataLength + 4, bufferLength + 8); + } + + /** + * Deserializes a DictionaryBatch + */ + private static ArrowDictionaryBatch deserializeDictionaryBatch(ReadChannel in, + Message message, + BufferAllocator alloc) throws IOException { + DictionaryBatch dictionaryBatchFB = (DictionaryBatch) message.header(new DictionaryBatch()); + + int bodyLength = (int) message.bodyLength(); + + // Now read the record batch body + ArrowBuf body = alloc.buffer(bodyLength); + if (in.readFully(body, bodyLength) != bodyLength) { + throw new IOException("Unexpected end of input trying to read batch."); + } + ArrowRecordBatch recordBatch = deserializeRecordBatch(dictionaryBatchFB.data(), body); + return new ArrowDictionaryBatch(dictionaryBatchFB.id(), recordBatch); + } + + /** + * Deserializes a DictionaryBatch knowing the size of the entire message up front. This + * minimizes the number of reads to the underlying stream. + */ + public static ArrowDictionaryBatch deserializeDictionaryBatch(ReadChannel in, + ArrowBlock block, + BufferAllocator alloc) throws IOException { + // Metadata length contains integer prefix plus byte padding + long totalLen = block.getMetadataLength() + block.getBodyLength(); + + if (totalLen > Integer.MAX_VALUE) { + throw new IOException("Cannot currently deserialize record batches over 2GB"); + } + + ArrowBuf buffer = alloc.buffer((int) totalLen); + if (in.readFully(buffer, (int) totalLen) != totalLen) { + throw new IOException("Unexpected end of input trying to read batch."); + } + + ArrowBuf metadataBuffer = buffer.slice(4, block.getMetadataLength() - 4); + + Message messageFB = + Message.getRootAsMessage(metadataBuffer.nioBuffer().asReadOnlyBuffer()); + + DictionaryBatch dictionaryBatchFB = (DictionaryBatch) messageFB.header(new DictionaryBatch()); + + // Now read the body + final ArrowBuf body = buffer.slice(block.getMetadataLength(), + (int) totalLen - block.getMetadataLength()); + ArrowRecordBatch recordBatch = deserializeRecordBatch(dictionaryBatchFB.data(), body); + return new ArrowDictionaryBatch(dictionaryBatchFB.id(), recordBatch); + } + + public static ArrowMessage deserializeMessageBatch(ReadChannel in, BufferAllocator alloc) throws IOException { + Message message = deserializeMessage(in); + if (message == null) { + return null; + } else if (message.bodyLength() > Integer.MAX_VALUE) { + throw new IOException("Cannot currently deserialize record batches over 2GB"); + } + + switch (message.headerType()) { + case MessageHeader.RecordBatch: return deserializeRecordBatch(in, message, alloc); + case MessageHeader.DictionaryBatch: return deserializeDictionaryBatch(in, message, alloc); + default: throw new IOException("Unexpected message header type " + message.headerType()); + } + } + /** * Serializes a message header. */ @@ -232,7 +335,7 @@ private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte heade return builder.dataBuffer(); } - private static Message deserializeMessage(ReadChannel in, byte headerType) throws IOException { + private static Message deserializeMessage(ReadChannel in) throws IOException { // Read the message size. There is an i32 little endian prefix. ByteBuffer buffer = ByteBuffer.allocate(4); if (in.readFully(buffer) != 4) return null; @@ -246,11 +349,6 @@ private static Message deserializeMessage(ReadChannel in, byte headerType) throw } buffer.rewind(); - Message message = Message.getRootAsMessage(buffer); - if (message.headerType() != headerType) { - throw new IOException("Invalid message: expecting " + headerType + - ". Message contained: " + message.headerType()); - } - return message; + return Message.getRootAsMessage(buffer); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index ab539d5dc3b6e..8f2d04224c0fd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -33,10 +33,10 @@ import org.apache.arrow.vector.NullableIntervalDayVector; import org.apache.arrow.vector.NullableIntervalYearVector; import org.apache.arrow.vector.NullableSmallIntVector; -import org.apache.arrow.vector.NullableTimeStampSecVector; -import org.apache.arrow.vector.NullableTimeStampMilliVector; import org.apache.arrow.vector.NullableTimeStampMicroVector; +import org.apache.arrow.vector.NullableTimeStampMilliVector; import org.apache.arrow.vector.NullableTimeStampNanoVector; +import org.apache.arrow.vector.NullableTimeStampSecVector; import org.apache.arrow.vector.NullableTimeVector; import org.apache.arrow.vector.NullableTinyIntVector; import org.apache.arrow.vector.NullableUInt1Vector; @@ -61,10 +61,10 @@ import org.apache.arrow.vector.complex.impl.IntervalYearWriterImpl; import org.apache.arrow.vector.complex.impl.NullableMapWriter; import org.apache.arrow.vector.complex.impl.SmallIntWriterImpl; -import org.apache.arrow.vector.complex.impl.TimeStampSecWriterImpl; -import org.apache.arrow.vector.complex.impl.TimeStampMilliWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampMicroWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampMilliWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampNanoWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeStampSecWriterImpl; import org.apache.arrow.vector.complex.impl.TimeWriterImpl; import org.apache.arrow.vector.complex.impl.TinyIntWriterImpl; import org.apache.arrow.vector.complex.impl.UInt1WriterImpl; @@ -92,6 +92,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.util.CallBack; @@ -129,7 +130,7 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { return ZeroVector.INSTANCE; } @@ -145,8 +146,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableMapVector(name, allocator, callBack); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableMapVector(name, allocator, dictionary, callBack); } @Override @@ -161,8 +162,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTinyIntVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTinyIntVector(name, allocator, dictionary); } @Override @@ -177,8 +178,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableSmallIntVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableSmallIntVector(name, allocator, dictionary); } @Override @@ -193,8 +194,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableIntVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableIntVector(name, allocator, dictionary); } @Override @@ -209,8 +210,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableBigIntVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableBigIntVector(name, allocator, dictionary); } @Override @@ -225,8 +226,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableDateVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableDateVector(name, allocator, dictionary); } @Override @@ -241,8 +242,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTimeVector(name, allocator, dictionary); } @Override @@ -258,8 +259,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeStampSecVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTimeStampSecVector(name, allocator, dictionary); } @Override @@ -275,8 +276,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeStampMilliVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTimeStampMilliVector(name, allocator, dictionary); } @Override @@ -292,8 +293,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeStampMicroVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTimeStampMicroVector(name, allocator, dictionary); } @Override @@ -309,8 +310,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableTimeStampNanoVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableTimeStampNanoVector(name, allocator, dictionary); } @Override @@ -325,8 +326,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableIntervalDayVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableIntervalDayVector(name, allocator, dictionary); } @Override @@ -341,8 +342,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableIntervalDayVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableIntervalDayVector(name, allocator, dictionary); } @Override @@ -358,8 +359,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableFloat4Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableFloat4Vector(name, allocator, dictionary); } @Override @@ -375,8 +376,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableFloat8Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableFloat8Vector(name, allocator, dictionary); } @Override @@ -391,8 +392,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableBitVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableBitVector(name, allocator, dictionary); } @Override @@ -407,8 +408,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableVarCharVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableVarCharVector(name, allocator, dictionary); } @Override @@ -423,8 +424,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableVarBinaryVector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableVarBinaryVector(name, allocator, dictionary); } @Override @@ -443,8 +444,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableDecimalVector(name, allocator, precisionScale[0], precisionScale[1]); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableDecimalVector(name, allocator, dictionary, precisionScale[0], precisionScale[1]); } @Override @@ -459,8 +460,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableUInt1Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableUInt1Vector(name, allocator, dictionary); } @Override @@ -475,8 +476,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableUInt2Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableUInt2Vector(name, allocator, dictionary); } @Override @@ -491,8 +492,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableUInt4Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableUInt4Vector(name, allocator, dictionary); } @Override @@ -507,8 +508,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new NullableUInt8Vector(name, allocator); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new NullableUInt8Vector(name, allocator, dictionary); } @Override @@ -523,8 +524,8 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { - return new ListVector(name, allocator, callBack); + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + return new ListVector(name, allocator, dictionary, callBack); } @Override @@ -539,7 +540,10 @@ public Field getField() { } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + if (dictionary != null) { + throw new UnsupportedOperationException("Dictionary encoding not supported for complex types"); + } return new UnionVector(name, allocator, callBack); } @@ -561,7 +565,7 @@ public ArrowType getType() { public abstract Field getField(); - public abstract FieldVector getNewVector(String name, BufferAllocator allocator, CallBack callBack, int... precisionScale); + public abstract FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale); public abstract FieldWriter getNewFieldWriter(ValueVector vector); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java new file mode 100644 index 0000000000000..6d35cdef832f9 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java @@ -0,0 +1,51 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types.pojo; + +import org.apache.arrow.vector.types.pojo.ArrowType.Int; + +public class DictionaryEncoding { + + private final long id; + private final boolean ordered; + private final Int indexType; + + public DictionaryEncoding(long id, boolean ordered, Int indexType) { + this.id = id; + this.ordered = ordered; + this.indexType = indexType == null ? new Int(32, true) : indexType; + } + + public long getId() { + return id; + } + + public boolean isOrdered() { + return ordered; + } + + public Int getIndexType() { + return indexType; + } + + @Override + public String toString() { + return "DictionaryEncoding[id=" + id + ",ordered=" + ordered + ",indexType=" + indexType + "]"; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index f9b79ce556338..bbbd559f10a3d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -24,23 +24,27 @@ import java.util.List; import java.util.Objects; +import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonInclude; import com.fasterxml.jackson.annotation.JsonInclude.Include; -import org.apache.arrow.flatbuf.DictionaryEncoding; -import org.apache.arrow.vector.schema.TypeLayout; -import org.apache.arrow.vector.schema.VectorLayout; - -import com.fasterxml.jackson.annotation.JsonCreator; import com.fasterxml.jackson.annotation.JsonProperty; import com.google.common.base.Joiner; import com.google.common.collect.ImmutableList; import com.google.flatbuffers.FlatBufferBuilder; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.schema.TypeLayout; +import org.apache.arrow.vector.schema.VectorLayout; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; + public class Field { private final String name; private final boolean nullable; private final ArrowType type; - private final Long dictionary; + private final DictionaryEncoding dictionary; private final List children; private final TypeLayout typeLayout; @@ -49,7 +53,7 @@ private Field( @JsonProperty("name") String name, @JsonProperty("nullable") boolean nullable, @JsonProperty("type") ArrowType type, - @JsonProperty("dictionary") Long dictionary, + @JsonProperty("dictionary") DictionaryEncoding dictionary, @JsonProperty("children") List children, @JsonProperty("typeLayout") TypeLayout typeLayout) { this.name = name; @@ -68,18 +72,30 @@ public Field(String name, boolean nullable, ArrowType type, List children this(name, nullable, type, null, children, TypeLayout.getTypeLayout(checkNotNull(type))); } - public Field(String name, boolean nullable, ArrowType type, Long dictionary, List children) { + public Field(String name, boolean nullable, ArrowType type, DictionaryEncoding dictionary, List children) { this(name, nullable, type, dictionary, children, TypeLayout.getTypeLayout(checkNotNull(type))); } + public FieldVector createVector(BufferAllocator allocator) { + MinorType minorType = Types.getMinorTypeForArrowType(type); + FieldVector vector = minorType.getNewVector(name, allocator, dictionary, null); + vector.initializeChildrenFromFields(children); + return vector; + } + public static Field convertField(org.apache.arrow.flatbuf.Field field) { String name = field.name(); boolean nullable = field.nullable(); ArrowType type = getTypeForField(field); - DictionaryEncoding dictionaryEncoding = field.dictionary(); - Long dictionary = null; - if (dictionaryEncoding != null) { - dictionary = dictionaryEncoding.id(); + DictionaryEncoding dictionary = null; + org.apache.arrow.flatbuf.DictionaryEncoding dictionaryFB = field.dictionary(); + if (dictionaryFB != null) { + Int indexType = null; + org.apache.arrow.flatbuf.Int indexTypeFB = dictionaryFB.indexType(); + if (indexTypeFB != null) { + indexType = new Int(indexTypeFB.bitWidth(), indexTypeFB.isSigned()); + } + dictionary = new DictionaryEncoding(dictionaryFB.id(), dictionaryFB.isOrdered(), indexType); } ImmutableList.Builder layout = ImmutableList.builder(); for (int i = 0; i < field.layoutLength(); ++i) { @@ -105,8 +121,11 @@ public int getField(FlatBufferBuilder builder) { int typeOffset = type.getType(builder); int dictionaryOffset = -1; if (dictionary != null) { - builder.addLong(dictionary); - dictionaryOffset = builder.offset(); + // TODO encode dictionary type - currently type is only signed 32 bit int (default null) + org.apache.arrow.flatbuf.DictionaryEncoding.startDictionaryEncoding(builder); + org.apache.arrow.flatbuf.DictionaryEncoding.addId(builder, dictionary.getId()); + org.apache.arrow.flatbuf.DictionaryEncoding.addIsOrdered(builder, dictionary.isOrdered()); + dictionaryOffset = org.apache.arrow.flatbuf.DictionaryEncoding.endDictionaryEncoding(builder); } int[] childrenData = new int[children.size()]; for (int i = 0; i < children.size(); i++) { @@ -126,11 +145,11 @@ public int getField(FlatBufferBuilder builder) { org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeID().getFlatbufID()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); + org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); + org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); if (dictionary != null) { org.apache.arrow.flatbuf.Field.addDictionary(builder, dictionaryOffset); } - org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); - org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); return org.apache.arrow.flatbuf.Field.endField(builder); } @@ -147,7 +166,7 @@ public ArrowType getType() { } @JsonInclude(Include.NON_NULL) - public Long getDictionary() { return dictionary; } + public DictionaryEncoding getDictionary() { return dictionary; } public List getChildren() { return children; @@ -168,8 +187,8 @@ public boolean equals(Object obj) { Objects.equals(this.type, that.type) && Objects.equals(this.dictionary, that.dictionary) && (Objects.equals(this.children, that.children) || - (this.children == null && that.children.size() == 0) || - (this.children.size() == 0 && that.children == null)); + (this.children == null || this.children.size() == 0) && + (that.children == null || that.children.size() == 0)); } @Override @@ -180,7 +199,7 @@ public String toString() { } sb.append(type); if (dictionary != null) { - sb.append("[dictionary: ").append(dictionary).append("]"); + sb.append("[dictionary: ").append(dictionary.getId()).append("]"); } if (!children.isEmpty()) { sb.append("<").append(Joiner.on(", ").join(children)).append(">"); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java index cca35e44a215d..20f4aa8cf643d 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java @@ -44,7 +44,7 @@ public class TestDecimalVector { @Test public void test() { BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - NullableDecimalVector decimalVector = new NullableDecimalVector("decimal", allocator, 10, scale); + NullableDecimalVector decimalVector = new NullableDecimalVector("decimal", allocator, null, 10, scale); decimalVector.allocateNew(); BigDecimal[] values = new BigDecimal[intValues.length]; for (int i = 0; i < intValues.length; i++) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java index 962950abec87a..e3087ef8c95cc 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java @@ -18,16 +18,16 @@ package org.apache.arrow.vector; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.complex.DictionaryVector; -import org.apache.arrow.vector.types.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryEncoder; +import org.apache.arrow.vector.dictionary.Dictionary; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.junit.After; import org.junit.Before; import org.junit.Test; import java.nio.charset.StandardCharsets; -import static org.junit.Assert.assertArrayEquals; import static org.junit.Assert.assertEquals; public class TestDictionaryVector { @@ -49,65 +49,10 @@ public void terminate() throws Exception { } @Test - public void testEncodeStringsWithGeneratedDictionary() { + public void testEncodeStrings() { // Create a new value vector - try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null)) { - final NullableVarCharVector.Mutator m = vector.getMutator(); - vector.allocateNew(512, 5); - - // set some values - m.setSafe(0, zero, 0, zero.length); - m.setSafe(1, one, 0, one.length); - m.setSafe(2, one, 0, one.length); - m.setSafe(3, two, 0, two.length); - m.setSafe(4, zero, 0, zero.length); - m.setValueCount(5); - - DictionaryVector encoded = DictionaryVector.encode(vector); - - try { - // verify values in the dictionary - ValueVector dictionary = encoded.getDictionaryVector(); - assertEquals(vector.getClass(), dictionary.getClass()); - - NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary).getAccessor(); - assertEquals(3, dictionaryAccessor.getValueCount()); - assertArrayEquals(zero, dictionaryAccessor.get(0)); - assertArrayEquals(one, dictionaryAccessor.get(1)); - assertArrayEquals(two, dictionaryAccessor.get(2)); - - // verify indices - ValueVector indices = encoded.getIndexVector(); - assertEquals(NullableIntVector.class, indices.getClass()); - - NullableIntVector.Accessor indexAccessor = ((NullableIntVector) indices).getAccessor(); - assertEquals(5, indexAccessor.getValueCount()); - assertEquals(0, indexAccessor.get(0)); - assertEquals(1, indexAccessor.get(1)); - assertEquals(1, indexAccessor.get(2)); - assertEquals(2, indexAccessor.get(3)); - assertEquals(0, indexAccessor.get(4)); - - // now run through the decoder and verify we get the original back - try (ValueVector decoded = DictionaryVector.decode(indices, encoded.getDictionary())) { - assertEquals(vector.getClass(), decoded.getClass()); - assertEquals(vector.getAccessor().getValueCount(), decoded.getAccessor().getValueCount()); - for (int i = 0; i < 5; i++) { - assertEquals(vector.getAccessor().getObject(i), decoded.getAccessor().getObject(i)); - } - } - } finally { - encoded.getDictionaryVector().close(); - encoded.getIndexVector().close(); - } - } - } - - @Test - public void testEncodeStringsWithProvidedDictionary() { - // Create a new value vector - try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null); - final NullableVarCharVector dictionary = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("dict", allocator, null)) { + try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null, null); + final NullableVarCharVector dictionaryVector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("dict", allocator, null, null)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(512, 5); @@ -120,19 +65,20 @@ public void testEncodeStringsWithProvidedDictionary() { m.setValueCount(5); // set some dictionary values - final NullableVarCharVector.Mutator m2 = dictionary.getMutator(); - dictionary.allocateNew(512, 3); + final NullableVarCharVector.Mutator m2 = dictionaryVector.getMutator(); + dictionaryVector.allocateNew(512, 3); m2.setSafe(0, zero, 0, zero.length); m2.setSafe(1, one, 0, one.length); m2.setSafe(2, two, 0, two.length); m2.setValueCount(3); - try(final DictionaryVector encoded = DictionaryVector.encode(vector, new Dictionary(dictionary, false))) { + Dictionary dictionary = new Dictionary(dictionaryVector, new DictionaryEncoding(1L, false, null)); + + try(final ValueVector encoded = (FieldVector) DictionaryEncoder.encode(vector, dictionary)) { // verify indices - ValueVector indices = encoded.getIndexVector(); - assertEquals(NullableIntVector.class, indices.getClass()); + assertEquals(NullableIntVector.class, encoded.getClass()); - NullableIntVector.Accessor indexAccessor = ((NullableIntVector) indices).getAccessor(); + NullableIntVector.Accessor indexAccessor = ((NullableIntVector) encoded).getAccessor(); assertEquals(5, indexAccessor.getValueCount()); assertEquals(0, indexAccessor.get(0)); assertEquals(1, indexAccessor.get(1)); @@ -141,7 +87,7 @@ public void testEncodeStringsWithProvidedDictionary() { assertEquals(0, indexAccessor.get(4)); // now run through the decoder and verify we get the original back - try (ValueVector decoded = DictionaryVector.decode(indices, encoded.getDictionary())) { + try (ValueVector decoded = DictionaryEncoder.decode(encoded, dictionary)) { assertEquals(vector.getClass(), decoded.getClass()); assertEquals(vector.getAccessor().getValueCount(), decoded.getAccessor().getValueCount()); for (int i = 0; i < 5; i++) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java index 1f0baaed776a1..18d93b6401e39 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestListVector.java @@ -42,8 +42,8 @@ public void terminate() throws Exception { @Test public void testCopyFrom() throws Exception { - try (ListVector inVector = new ListVector("input", allocator, null); - ListVector outVector = new ListVector("output", allocator, null)) { + try (ListVector inVector = new ListVector("input", allocator, null, null); + ListVector outVector = new ListVector("output", allocator, null, null)) { UnionListWriter writer = inVector.getWriter(); writer.allocate(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 774b59e3683e3..6917638d74e4d 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -86,7 +86,7 @@ public void testFixedType() { public void testNullableVarLen2() { // Create a new value vector for 1024 integers. - try (final NullableVarCharVector vector = new NullableVarCharVector(EMPTY_SCHEMA_PATH, allocator)) { + try (final NullableVarCharVector vector = new NullableVarCharVector(EMPTY_SCHEMA_PATH, allocator, null)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(1024 * 10, 1024); @@ -116,7 +116,7 @@ public void testNullableVarLen2() { public void testNullableFixedType() { // Create a new value vector for 1024 integers. - try (final NullableUInt4Vector vector = new NullableUInt4Vector(EMPTY_SCHEMA_PATH, allocator)) { + try (final NullableUInt4Vector vector = new NullableUInt4Vector(EMPTY_SCHEMA_PATH, allocator, null)) { final NullableUInt4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -186,7 +186,7 @@ public void testNullableFixedType() { @Test public void testNullableFloat() { // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -233,7 +233,7 @@ public void testNullableFloat() { @Test public void testNullableInt() { // Create a new value vector for 1024 integers - try (final NullableIntVector vector = (NullableIntVector) MinorType.INT.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableIntVector vector = (NullableIntVector) MinorType.INT.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { final NullableIntVector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -403,7 +403,7 @@ private void validateRange(int length, int start, int count) { @Test public void testReAllocNullableFixedWidthVector() { // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -436,7 +436,7 @@ public void testReAllocNullableFixedWidthVector() { @Test public void testReAllocNullableVariableWidthVector() { // Create a new value vector for 1024 integers - try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java index 79c9d5046acd6..372bcf0da6e9a 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorUnloadLoad.java @@ -27,6 +27,7 @@ import java.util.Collections; import java.util.List; +import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.complex.MapVector; @@ -46,8 +47,6 @@ import org.junit.Assert; import org.junit.Test; -import io.netty.buffer.ArrowBuf; - public class TestVectorUnloadLoad { static final BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); @@ -81,8 +80,8 @@ public void testUnloadLoad() throws IOException { try ( ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); - ) { + VectorSchemaRoot newRoot = VectorSchemaRoot.create(schema, finalVectorsAllocator); + ) { // load it VectorLoader vectorLoader = new VectorLoader(newRoot); @@ -131,8 +130,8 @@ public void testUnloadLoadAddPadding() throws IOException { try ( ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); - ) { + VectorSchemaRoot newRoot = VectorSchemaRoot.create(schema, finalVectorsAllocator); + ) { List oldBuffers = recordBatch.getBuffers(); List newBuffers = new ArrayList<>(); for (ArrowBuf oldBuffer : oldBuffers) { @@ -185,7 +184,7 @@ public void testLoadEmptyValidityBuffer() throws IOException { Schema schema = new Schema(asList( new Field("intDefined", true, new ArrowType.Int(32, true), Collections.emptyList()), new Field("intNull", true, new ArrowType.Int(32, true), Collections.emptyList()) - )); + )); int count = 10; ArrowBuf validity = allocator.buffer(10).slice(0, 0); ArrowBuf[] values = new ArrowBuf[2]; @@ -200,8 +199,8 @@ public void testLoadEmptyValidityBuffer() throws IOException { try ( ArrowRecordBatch recordBatch = new ArrowRecordBatch(count, asList(new ArrowFieldNode(count, 0), new ArrowFieldNode(count, count)), asList(validity, values[0], validity, values[1])); BufferAllocator finalVectorsAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - VectorSchemaRoot newRoot = new VectorSchemaRoot(schema, finalVectorsAllocator); - ) { + VectorSchemaRoot newRoot = VectorSchemaRoot.create(schema, finalVectorsAllocator); + ) { // load it VectorLoader vectorLoader = new VectorLoader(newRoot); @@ -244,11 +243,12 @@ public static VectorUnloader newVectorUnloader(FieldVector root) { Schema schema = new Schema(root.getField().getChildren()); int valueCount = root.getAccessor().getValueCount(); List fields = root.getChildrenFromFields(); - return new VectorUnloader(schema, valueCount, fields); + VectorSchemaRoot vsr = new VectorSchemaRoot(schema.getFields(), fields, valueCount); + return new VectorUnloader(vsr); } @AfterClass public static void afterClass() { allocator.close(); } -} +} \ No newline at end of file diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 58312b3f9ff9c..2b49d8ed4b582 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -53,7 +53,7 @@ public void terminate() throws Exception { public void testPromoteToUnion() throws Exception { try (final MapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); - final NullableMapVector v = container.addOrGet("test", MinorType.MAP, NullableMapVector.class); + final NullableMapVector v = container.addOrGet("test", MinorType.MAP, NullableMapVector.class, null); final PromotableWriter writer = new PromotableWriter(v, container)) { container.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 7a2d416241b78..a8a2d512c09ec 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -181,7 +181,7 @@ public void testList() { @Test public void listScalarType() { - ListVector listVector = new ListVector("list", allocator, null); + ListVector listVector = new ListVector("list", allocator, null, null); listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); for (int i = 0; i < COUNT; i++) { @@ -204,7 +204,7 @@ public void listScalarType() { @Test public void listScalarTypeNullable() { - ListVector listVector = new ListVector("list", allocator, null); + ListVector listVector = new ListVector("list", allocator, null, null); listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); for (int i = 0; i < COUNT; i++) { @@ -233,7 +233,7 @@ public void listScalarTypeNullable() { @Test public void listMapType() { - ListVector listVector = new ListVector("list", allocator, null); + ListVector listVector = new ListVector("list", allocator, null, null); listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); MapWriter mapWriter = listWriter.map(); @@ -261,7 +261,7 @@ public void listMapType() { @Test public void listListType() { - try (ListVector listVector = new ListVector("list", allocator, null)) { + try (ListVector listVector = new ListVector("list", allocator, null, null)) { listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); for (int i = 0; i < COUNT; i++) { @@ -286,7 +286,7 @@ public void listListType() { */ @Test public void listListType2() { - try (ListVector listVector = new ListVector("list", allocator, null)) { + try (ListVector listVector = new ListVector("list", allocator, null, null)) { listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); ListWriter innerListWriter = listWriter.list(); @@ -324,7 +324,7 @@ private void checkListOfLists(final ListVector listVector) { @Test public void unionListListType() { - try (ListVector listVector = new ListVector("list", allocator, null)) { + try (ListVector listVector = new ListVector("list", allocator, null, null)) { listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); for (int i = 0; i < COUNT; i++) { @@ -353,7 +353,7 @@ public void unionListListType() { */ @Test public void unionListListType2() { - try (ListVector listVector = new ListVector("list", allocator, null)) { + try (ListVector listVector = new ListVector("list", allocator, null, null)) { listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); ListWriter innerListWriter = listWriter.list(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index a83a2833c88bf..75e5d2d6e5c98 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -17,31 +17,44 @@ */ package org.apache.arrow.vector.file; -import static org.apache.arrow.vector.TestVectorUnloadLoad.newVectorUnloader; -import static org.junit.Assert.assertTrue; - import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; -import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; import java.util.List; +import com.google.common.collect.ImmutableList; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.VectorLoader; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.NullableVarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.VectorUnloader; +import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; +import org.apache.arrow.vector.complex.impl.UnionListWriter; +import org.apache.arrow.vector.dictionary.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryProvider; +import org.apache.arrow.vector.dictionary.DictionaryProvider.MapDictionaryProvider; +import org.apache.arrow.vector.dictionary.DictionaryEncoder; import org.apache.arrow.vector.schema.ArrowBuffer; +import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.stream.ArrowStreamReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.apache.arrow.vector.stream.MessageSerializerTest; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; +import org.apache.arrow.vector.util.Text; import org.junit.Assert; import org.junit.Test; import org.slf4j.Logger; @@ -68,7 +81,7 @@ public void testWriteComplex() throws IOException { int count = COUNT; try ( BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null, null)) { writeComplexData(count, parent); FieldVector root = parent.getChild("root"); validateComplexContent(count, new VectorSchemaRoot(root)); @@ -83,71 +96,63 @@ public void testWriteRead() throws IOException { int count = COUNT; // write - try ( - BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { + try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeData(count, parent); write(parent.getChild("root"), file, stream); } // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(file); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); - LOGGER.debug("reading schema: " + schema); - - // initialize vectors - - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator){ + @Override + protected ArrowMessage readMessage(SeekableReadChannel in, BufferAllocator allocator) throws IOException { + ArrowMessage message = super.readMessage(in, allocator); + if (message != null) { + ArrowRecordBatch batch = (ArrowRecordBatch) message; + List buffersLayout = batch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + } + return message; } - vectorLoader.load(recordBatch); - } - - validateContent(count, root); - } + }) { + Schema schema = arrowReader.getVectorSchemaRoot().getSchema(); + LOGGER.debug("reading schema: " + schema); + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { + arrowReader.loadRecordBatch(rbBlock); + Assert.assertEquals(count, root.getRowCount()); + validateContent(count, root); } } // Read from stream. - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); - ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) - ) { - arrowReader.init(); - Schema schema = arrowReader.getSchema(); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator){ + @Override + protected ArrowMessage readMessage(ReadChannel in, BufferAllocator allocator) throws IOException { + ArrowMessage message = super.readMessage(in, allocator); + if (message != null) { + ArrowRecordBatch batch = (ArrowRecordBatch) message; + List buffersLayout = batch.getBuffersLayout(); + for (ArrowBuffer arrowBuffer : buffersLayout) { + Assert.assertEquals(0, arrowBuffer.getOffset() % 8); + } + } + return message; + } + }) { + + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - while (true) { - try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { - if (recordBatch == null) break; - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); - } - vectorLoader.load(recordBatch); - } - } - validateContent(count, root); - } + arrowReader.loadNextBatch(); + Assert.assertEquals(count, root.getRowCount()); + validateContent(count, root); } } @@ -158,61 +163,37 @@ public void testWriteReadComplex() throws IOException { int count = COUNT; // write - try ( - BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { + try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null)) { writeComplexData(count, parent); write(parent.getChild("root"), file, stream); } // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(file); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null) - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - // initialize vectors - - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); - } - validateComplexContent(count, root); - } + for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { + arrowReader.loadRecordBatch(rbBlock); + Assert.assertEquals(count, root.getRowCount()); + validateComplexContent(count, root); } } // Read from stream. - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); - ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) - ) { - arrowReader.init(); - Schema schema = arrowReader.getSchema(); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - while (true) { - try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { - if (recordBatch == null) break; - vectorLoader.load(recordBatch); - } - } - validateComplexContent(count, root); - } + arrowReader.loadNextBatch(); + Assert.assertEquals(count, root.getRowCount()); + validateComplexContent(count, root); } } @@ -223,94 +204,70 @@ public void testWriteReadMultipleRBs() throws IOException { int[] counts = { 10, 5 }; // write - try ( - BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", originalVectorAllocator, null); - FileOutputStream fileOutputStream = new FileOutputStream(file);) { + try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", originalVectorAllocator, null); + FileOutputStream fileOutputStream = new FileOutputStream(file)){ writeData(counts[0], parent); - VectorUnloader vectorUnloader0 = newVectorUnloader(parent.getChild("root")); - Schema schema = vectorUnloader0.getSchema(); - Assert.assertEquals(2, schema.getFields().size()); - try (ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowStreamWriter streamWriter = new ArrowStreamWriter(stream, schema)) { - try (ArrowRecordBatch recordBatch = vectorUnloader0.getRecordBatch()) { - Assert.assertEquals("RB #0", counts[0], recordBatch.getLength()); - arrowWriter.writeRecordBatch(recordBatch); - streamWriter.writeRecordBatch(recordBatch); - } + VectorSchemaRoot root = new VectorSchemaRoot(parent.getChild("root")); + + try(ArrowFileWriter fileWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel()); + ArrowStreamWriter streamWriter = new ArrowStreamWriter(root, null, stream)) { + fileWriter.start(); + streamWriter.start(); + + fileWriter.writeBatch(); + streamWriter.writeBatch(); + parent.allocateNew(); writeData(counts[1], parent); // if we write the same data we don't catch that the metadata is stored in the wrong order. - VectorUnloader vectorUnloader1 = newVectorUnloader(parent.getChild("root")); - try (ArrowRecordBatch recordBatch = vectorUnloader1.getRecordBatch()) { - Assert.assertEquals("RB #1", counts[1], recordBatch.getLength()); - arrowWriter.writeRecordBatch(recordBatch); - streamWriter.writeRecordBatch(recordBatch); - } + root.setRowCount(counts[1]); + + fileWriter.writeBatch(); + streamWriter.writeBatch(); + + fileWriter.end(); + streamWriter.end(); } } - // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(file); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null); - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); + // read file + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); int i = 0; - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { - VectorLoader vectorLoader = new VectorLoader(root); - List recordBatches = footer.getRecordBatches(); - Assert.assertEquals(2, recordBatches.size()); - long previousOffset = 0; - for (ArrowBlock rbBlock : recordBatches) { - Assert.assertTrue(rbBlock.getOffset() + " > " + previousOffset, rbBlock.getOffset() > previousOffset); - previousOffset = rbBlock.getOffset(); - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - Assert.assertEquals("RB #" + i, counts[i], recordBatch.getLength()); - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); - } - vectorLoader.load(recordBatch); - validateContent(counts[i], root); - } - ++i; - } + List recordBatches = arrowReader.getRecordBlocks(); + Assert.assertEquals(2, recordBatches.size()); + long previousOffset = 0; + for (ArrowBlock rbBlock : recordBatches) { + Assert.assertTrue(rbBlock.getOffset() + " > " + previousOffset, rbBlock.getOffset() > previousOffset); + previousOffset = rbBlock.getOffset(); + arrowReader.loadRecordBatch(rbBlock); + Assert.assertEquals("RB #" + i, counts[i], root.getRowCount()); + validateContent(counts[i], root); + ++i; } } // read stream - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); - ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) - ) { - arrowReader.init(); - Schema schema = arrowReader.getSchema(); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); int i = 0; - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { - VectorLoader vectorLoader = new VectorLoader(root); - for (int n = 0; n < 2; n++) { - try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { - assertTrue(recordBatch != null); - Assert.assertEquals("RB #" + i, counts[i], recordBatch.getLength()); - List buffersLayout = recordBatch.getBuffersLayout(); - for (ArrowBuffer arrowBuffer : buffersLayout) { - Assert.assertEquals(0, arrowBuffer.getOffset() % 8); - } - vectorLoader.load(recordBatch); - validateContent(counts[i], root); - } - ++i; - } + + for (int n = 0; n < 2; n++) { + arrowReader.loadNextBatch(); + Assert.assertEquals("RB #" + i, counts[i], root.getRowCount()); + validateContent(counts[i], root); + ++i; } + arrowReader.loadNextBatch(); + Assert.assertEquals(0, root.getRowCount()); } } @@ -319,90 +276,326 @@ public void testWriteReadUnion() throws IOException { File file = new File("target/mytest_write_union.arrow"); ByteArrayOutputStream stream = new ByteArrayOutputStream(); int count = COUNT; - try ( - BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + // write + try (BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null, null)) { writeUnionData(count, parent); - - printVectors(parent.getChildrenFromFields()); - validateUnionData(count, new VectorSchemaRoot(parent.getChild("root"))); - write(parent.getChild("root"), file, stream); } - // read - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(file); - ArrowReader arrowReader = new ArrowReader(fileInputStream.getChannel(), readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - ) { - ArrowFooter footer = arrowReader.readFooter(); - Schema schema = footer.getSchema(); + + // read file + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateUnionData(count, root); + } + + // Read from stream. + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateUnionData(count, root); + } + } - // initialize vectors - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator);) { - VectorLoader vectorLoader = new VectorLoader(root); - List recordBatches = footer.getRecordBatches(); - for (ArrowBlock rbBlock : recordBatches) { - try (ArrowRecordBatch recordBatch = arrowReader.readRecordBatch(rbBlock)) { - vectorLoader.load(recordBatch); - } - validateUnionData(count, root); - } + @Test + public void testWriteReadTiny() throws IOException { + File file = new File("target/mytest_write_tiny.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); + + try (VectorSchemaRoot root = VectorSchemaRoot.create(MessageSerializerTest.testSchema(), allocator)) { + root.getFieldVectors().get(0).allocateNew(); + NullableTinyIntVector.Mutator mutator = (NullableTinyIntVector.Mutator) root.getFieldVectors().get(0).getMutator(); + for (int i = 0; i < 16; i++) { + mutator.set(i, i < 8 ? 1 : 0, (byte)(i + 1)); + } + mutator.setValueCount(16); + root.setRowCount(16); + + // write file + try (FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel())) { + LOGGER.debug("writing schema: " + root.getSchema()); + arrowWriter.start(); + arrowWriter.writeBatch(); + arrowWriter.end(); + } + // write stream + try (ArrowStreamWriter arrowWriter = new ArrowStreamWriter(root, null, stream)) { + arrowWriter.start(); + arrowWriter.writeBatch(); + arrowWriter.end(); } } + // read file + try (BufferAllocator readerAllocator = allocator.newChildAllocator("fileReader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateTinyData(root); + } + // Read from stream. - try ( - BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); - ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); - ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator); - BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null) - ) { - arrowReader.init(); - Schema schema = arrowReader.getSchema(); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("streamReader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateTinyData(root); + } + } + + private void validateTinyData(VectorSchemaRoot root) { + Assert.assertEquals(16, root.getRowCount()); + NullableTinyIntVector vector = (NullableTinyIntVector) root.getFieldVectors().get(0); + for (int i = 0; i < 16; i++) { + if (i < 8) { + Assert.assertEquals((byte)(i + 1), vector.getAccessor().get(i)); + } else { + Assert.assertTrue(vector.getAccessor().isNull(i)); + } + } + } + + @Test + public void testWriteReadDictionary() throws IOException { + File file = new File("target/mytest_dict.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); + + // write + try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableVarCharVector vector = new NullableVarCharVector("varchar", originalVectorAllocator, null); + NullableVarCharVector dictionaryVector = new NullableVarCharVector("dict", originalVectorAllocator, null)) { + vector.allocateNewSafe(); + NullableVarCharVector.Mutator mutator = vector.getMutator(); + mutator.set(0, "foo".getBytes(StandardCharsets.UTF_8)); + mutator.set(1, "bar".getBytes(StandardCharsets.UTF_8)); + mutator.set(3, "baz".getBytes(StandardCharsets.UTF_8)); + mutator.set(4, "bar".getBytes(StandardCharsets.UTF_8)); + mutator.set(5, "baz".getBytes(StandardCharsets.UTF_8)); + mutator.setValueCount(6); + + dictionaryVector.allocateNewSafe(); + mutator = dictionaryVector.getMutator(); + mutator.set(0, "foo".getBytes(StandardCharsets.UTF_8)); + mutator.set(1, "bar".getBytes(StandardCharsets.UTF_8)); + mutator.set(2, "baz".getBytes(StandardCharsets.UTF_8)); + mutator.setValueCount(3); + + Dictionary dictionary = new Dictionary(dictionaryVector, new DictionaryEncoding(1L, false, null)); + MapDictionaryProvider provider = new MapDictionaryProvider(); + provider.put(dictionary); + + FieldVector encodedVector = (FieldVector) DictionaryEncoder.encode(vector, dictionary); + + List fields = ImmutableList.of(encodedVector.getField()); + List vectors = ImmutableList.of(encodedVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 6); + + try (FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowFileWriter fileWriter = new ArrowFileWriter(root, provider, fileOutputStream.getChannel()); + ArrowStreamWriter streamWriter = new ArrowStreamWriter(root, provider, stream)) { + LOGGER.debug("writing schema: " + root.getSchema()); + fileWriter.start(); + streamWriter.start(); + fileWriter.writeBatch(); + streamWriter.writeBatch(); + fileWriter.end(); + streamWriter.end(); + } + + dictionaryVector.close(); + encodedVector.close(); + } + + // read from file + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateFlatDictionary(root.getFieldVectors().get(0), arrowReader); + } + + // Read from stream + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateFlatDictionary(root.getFieldVectors().get(0), arrowReader); + } + } + + private void validateFlatDictionary(FieldVector vector, DictionaryProvider provider) { + Assert.assertNotNull(vector); + + DictionaryEncoding encoding = vector.getField().getDictionary(); + Assert.assertNotNull(encoding); + Assert.assertEquals(1L, encoding.getId()); + + FieldVector.Accessor accessor = vector.getAccessor(); + Assert.assertEquals(6, accessor.getValueCount()); + Assert.assertEquals(0, accessor.getObject(0)); + Assert.assertEquals(1, accessor.getObject(1)); + Assert.assertEquals(null, accessor.getObject(2)); + Assert.assertEquals(2, accessor.getObject(3)); + Assert.assertEquals(1, accessor.getObject(4)); + Assert.assertEquals(2, accessor.getObject(5)); + + Dictionary dictionary = provider.lookup(1L); + Assert.assertNotNull(dictionary); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary.getVector()).getAccessor(); + Assert.assertEquals(3, dictionaryAccessor.getValueCount()); + Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); + Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); + Assert.assertEquals(new Text("baz"), dictionaryAccessor.getObject(2)); + } - try (VectorSchemaRoot root = new VectorSchemaRoot(schema, vectorAllocator)) { - VectorLoader vectorLoader = new VectorLoader(root); - while (true) { - try (ArrowRecordBatch recordBatch = arrowReader.nextRecordBatch()) { - if (recordBatch == null) break; - vectorLoader.load(recordBatch); - } - } - validateUnionData(count, root); + @Test + public void testWriteReadNestedDictionary() throws IOException { + File file = new File("target/mytest_dict_nested.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); + + DictionaryEncoding encoding = new DictionaryEncoding(2L, false, null); + + // data being written: + // [['foo', 'bar'], ['foo'], ['bar']] -> [[0, 1], [0], [1]] + + // write + try (NullableVarCharVector dictionaryVector = new NullableVarCharVector("dictionary", allocator, null); + ListVector listVector = new ListVector("list", allocator, null, null)) { + + Dictionary dictionary = new Dictionary(dictionaryVector, encoding); + MapDictionaryProvider provider = new MapDictionaryProvider(); + provider.put(dictionary); + + dictionaryVector.allocateNew(); + dictionaryVector.getMutator().set(0, "foo".getBytes(StandardCharsets.UTF_8)); + dictionaryVector.getMutator().set(1, "bar".getBytes(StandardCharsets.UTF_8)); + dictionaryVector.getMutator().setValueCount(2); + + listVector.addOrGetVector(MinorType.INT, encoding); + listVector.allocateNew(); + UnionListWriter listWriter = new UnionListWriter(listVector); + listWriter.startList(); + listWriter.writeInt(0); + listWriter.writeInt(1); + listWriter.endList(); + listWriter.startList(); + listWriter.writeInt(0); + listWriter.endList(); + listWriter.startList(); + listWriter.writeInt(1); + listWriter.endList(); + listWriter.setValueCount(3); + + List fields = ImmutableList.of(listVector.getField()); + List vectors = ImmutableList.of((FieldVector) listVector); + VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 3); + + try ( + FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowFileWriter fileWriter = new ArrowFileWriter(root, provider, fileOutputStream.getChannel()); + ArrowStreamWriter streamWriter = new ArrowStreamWriter(root, provider, stream)) { + LOGGER.debug("writing schema: " + root.getSchema()); + fileWriter.start(); + streamWriter.start(); + fileWriter.writeBatch(); + streamWriter.writeBatch(); + fileWriter.end(); + streamWriter.end(); } } + + // read from file + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateNestedDictionary((ListVector) root.getFieldVectors().get(0), arrowReader); + } + + // Read from stream + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + validateNestedDictionary((ListVector) root.getFieldVectors().get(0), arrowReader); + } + } + + private void validateNestedDictionary(ListVector vector, DictionaryProvider provider) { + Assert.assertNotNull(vector); + Assert.assertNull(vector.getField().getDictionary()); + Field nestedField = vector.getField().getChildren().get(0); + + DictionaryEncoding encoding = nestedField.getDictionary(); + Assert.assertNotNull(encoding); + Assert.assertEquals(2L, encoding.getId()); + Assert.assertEquals(new Int(32, true), encoding.getIndexType()); + + ListVector.Accessor accessor = vector.getAccessor(); + Assert.assertEquals(3, accessor.getValueCount()); + Assert.assertEquals(Arrays.asList(0, 1), accessor.getObject(0)); + Assert.assertEquals(Arrays.asList(0), accessor.getObject(1)); + Assert.assertEquals(Arrays.asList(1), accessor.getObject(2)); + + Dictionary dictionary = provider.lookup(2L); + Assert.assertNotNull(dictionary); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary.getVector()).getAccessor(); + Assert.assertEquals(2, dictionaryAccessor.getValueCount()); + Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); + Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); } /** * Writes the contents of parents to file. If outStream is non-null, also writes it * to outStream in the streaming serialized format. */ - private void write(FieldVector parent, File file, OutputStream outStream) throws FileNotFoundException, IOException { - VectorUnloader vectorUnloader = newVectorUnloader(parent); - Schema schema = vectorUnloader.getSchema(); - LOGGER.debug("writing schema: " + schema); - try ( - FileOutputStream fileOutputStream = new FileOutputStream(file); - ArrowWriter arrowWriter = new ArrowWriter(fileOutputStream.getChannel(), schema); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - ) { - arrowWriter.writeRecordBatch(recordBatch); + private void write(FieldVector parent, File file, OutputStream outStream) throws IOException { + VectorSchemaRoot root = new VectorSchemaRoot(parent); + + try (FileOutputStream fileOutputStream = new FileOutputStream(file); + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel());) { + LOGGER.debug("writing schema: " + root.getSchema()); + arrowWriter.start(); + arrowWriter.writeBatch(); + arrowWriter.end(); } // Also try serializing to the stream writer. if (outStream != null) { - try ( - ArrowStreamWriter arrowWriter = new ArrowStreamWriter(outStream, schema); - ArrowRecordBatch recordBatch = vectorUnloader.getRecordBatch(); - ) { - arrowWriter.writeRecordBatch(recordBatch); + try (ArrowStreamWriter arrowWriter = new ArrowStreamWriter(root, null, outStream)) { + arrowWriter.start(); + arrowWriter.writeBatch(); + arrowWriter.end(); } } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java index 13b04de68fa62..914dfe4319db3 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -17,12 +17,15 @@ */ package org.apache.arrow.vector.file; +import static java.nio.channels.Channels.newChannel; import static java.util.Arrays.asList; import static org.junit.Assert.assertArrayEquals; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; import java.io.ByteArrayOutputStream; +import java.io.File; +import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; import java.nio.channels.Channels; @@ -34,8 +37,14 @@ import org.apache.arrow.flatbuf.RecordBatch; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullableIntVector; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; @@ -69,12 +78,17 @@ byte[] array(ArrowBuf buf) { @Test public void test() throws IOException { Schema schema = new Schema(asList(new Field("testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); - byte[] validity = new byte[] { (byte)255, 0}; + MinorType minorType = Types.getMinorTypeForArrowType(schema.getFields().get(0).getType()); + FieldVector vector = minorType.getNewVector("testField", allocator, null,null); + vector.initializeChildrenFromFields(schema.getFields().get(0).getChildren()); + + byte[] validity = new byte[] { (byte) 255, 0}; // second half is "undefined" byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; ByteArrayOutputStream out = new ByteArrayOutputStream(); - try (ArrowWriter writer = new ArrowWriter(Channels.newChannel(out), schema)) { + try (VectorSchemaRoot root = new VectorSchemaRoot(schema.getFields(), asList(vector), 16); + ArrowFileWriter writer = new ArrowFileWriter(root, null, newChannel(out))) { ArrowBuf validityb = buf(validity); ArrowBuf valuesb = buf(values); writer.writeRecordBatch(new ArrowRecordBatch(16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); @@ -82,15 +96,15 @@ public void test() throws IOException { byte[] byteArray = out.toByteArray(); - try (ArrowReader reader = new ArrowReader(new ByteArrayReadableSeekableByteChannel(byteArray), allocator)) { - ArrowFooter footer = reader.readFooter(); - Schema readSchema = footer.getSchema(); + SeekableReadChannel channel = new SeekableReadChannel(new ByteArrayReadableSeekableByteChannel(byteArray)); + try (ArrowFileReader reader = new ArrowFileReader(channel, allocator)) { + Schema readSchema = reader.getVectorSchemaRoot().getSchema(); assertEquals(schema, readSchema); assertTrue(readSchema.getFields().get(0).getTypeLayout().getVectorTypes().toString(), readSchema.getFields().get(0).getTypeLayout().getVectors().size() > 0); // TODO: dictionaries - List recordBatches = footer.getRecordBatches(); + List recordBatches = reader.getRecordBlocks(); assertEquals(1, recordBatches.size()); - ArrowRecordBatch recordBatch = reader.readRecordBatch(recordBatches.get(0)); + ArrowRecordBatch recordBatch = (ArrowRecordBatch) reader.readMessage(channel, allocator); List nodes = recordBatch.getNodes(); assertEquals(1, nodes.size()); ArrowFieldNode node = nodes.get(0); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java new file mode 100644 index 0000000000000..e7cdf3fea4b8b --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java @@ -0,0 +1,102 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.ByteArrayInputStream; +import java.io.ByteArrayOutputStream; +import java.io.IOException; + +import io.netty.buffer.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowMessage; +import org.apache.arrow.vector.schema.ArrowRecordBatch; +import org.apache.arrow.vector.stream.ArrowStreamReader; +import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.apache.arrow.vector.stream.MessageSerializerTest; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Test; + +public class TestArrowStream extends BaseFileTest { + @Test + public void testEmptyStream() throws IOException { + Schema schema = MessageSerializerTest.testSchema(); + VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator); + + // Write the stream. + ByteArrayOutputStream out = new ByteArrayOutputStream(); + try (ArrowStreamWriter writer = new ArrowStreamWriter(root, null, out)) { + } + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { + assertEquals(schema, reader.getVectorSchemaRoot().getSchema()); + // Empty should return nothing. Can be called repeatedly. + reader.loadNextBatch(); + assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); + reader.loadNextBatch(); + assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); + } + } + + @Test + public void testReadWrite() throws IOException { + Schema schema = MessageSerializerTest.testSchema(); + try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator)) { + int numBatches = 1; + + root.getFieldVectors().get(0).allocateNew(); + NullableTinyIntVector.Mutator mutator = (NullableTinyIntVector.Mutator) root.getFieldVectors().get(0).getMutator(); + for (int i = 0; i < 16; i++) { + mutator.set(i, i < 8 ? 1 : 0, (byte)(i + 1)); + } + mutator.setValueCount(16); + root.setRowCount(16); + + ByteArrayOutputStream out = new ByteArrayOutputStream(); + long bytesWritten = 0; + try (ArrowStreamWriter writer = new ArrowStreamWriter(root, null, out)) { + writer.start(); + for (int i = 0; i < numBatches; i++) { + writer.writeBatch(); + } + writer.end(); + bytesWritten = writer.bytesWritten(); + } + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { + Schema readSchema = reader.getVectorSchemaRoot().getSchema(); + assertEquals(schema, readSchema); + for (int i = 0; i < numBatches; i++) { + reader.loadNextBatch(); + } + // TODO figure out why reader isn't getting padding bytes + assertEquals(bytesWritten, reader.bytesRead() + 4); + reader.loadNextBatch(); + assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); + } + } + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java new file mode 100644 index 0000000000000..46d46794bbefa --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java @@ -0,0 +1,163 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.file; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.IOException; +import java.nio.channels.Pipe; +import java.nio.channels.ReadableByteChannel; +import java.nio.channels.WritableByteChannel; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.schema.ArrowMessage; +import org.apache.arrow.vector.stream.ArrowStreamReader; +import org.apache.arrow.vector.stream.ArrowStreamWriter; +import org.apache.arrow.vector.stream.MessageSerializerTest; +import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Assert; +import org.junit.Test; + +public class TestArrowStreamPipe { + Schema schema = MessageSerializerTest.testSchema(); + BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + + private final class WriterThread extends Thread { + + private final int numBatches; + private final ArrowStreamWriter writer; + private final VectorSchemaRoot root; + + public WriterThread(int numBatches, WritableByteChannel sinkChannel) + throws IOException { + this.numBatches = numBatches; + BufferAllocator allocator = alloc.newChildAllocator("writer thread", 0, Integer.MAX_VALUE); + root = VectorSchemaRoot.create(schema, allocator); + writer = new ArrowStreamWriter(root, null, sinkChannel); + } + + @Override + public void run() { + try { + writer.start(); + for (int j = 0; j < numBatches; j++) { + root.getFieldVectors().get(0).allocateNew(); + NullableTinyIntVector.Mutator mutator = (NullableTinyIntVector.Mutator) root.getFieldVectors().get(0).getMutator(); + // Send a changing batch id first + mutator.set(0, j); + for (int i = 1; i < 16; i++) { + mutator.set(i, i < 8 ? 1 : 0, (byte)(i + 1)); + } + mutator.setValueCount(16); + root.setRowCount(16); + + writer.writeBatch(); + } + writer.close(); + root.close(); + } catch (IOException e) { + e.printStackTrace(); + Assert.fail(e.toString()); // have to explicitly fail since we're in a separate thread + } + } + + public long bytesWritten() { return writer.bytesWritten(); } + } + + private final class ReaderThread extends Thread { + private int batchesRead = 0; + private final ArrowStreamReader reader; + private final BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); + private boolean done = false; + + public ReaderThread(ReadableByteChannel sourceChannel) + throws IOException { + reader = new ArrowStreamReader(sourceChannel, alloc) { + @Override + protected ArrowMessage readMessage(ReadChannel in, BufferAllocator allocator) throws IOException { + // Read all the batches. Each batch contains an incrementing id and then some + // constant data. Verify both. + ArrowMessage message = super.readMessage(in, allocator); + if (message == null) { + done = true; + } else { + batchesRead++; + } + return message; + } + @Override + public void loadNextBatch() throws IOException { + super.loadNextBatch(); + if (!done) { + VectorSchemaRoot root = getVectorSchemaRoot(); + Assert.assertEquals(16, root.getRowCount()); + NullableTinyIntVector vector = (NullableTinyIntVector) root.getFieldVectors().get(0); + Assert.assertEquals((byte)(batchesRead - 1), vector.getAccessor().get(0)); + for (int i = 1; i < 16; i++) { + if (i < 8) { + Assert.assertEquals((byte)(i + 1), vector.getAccessor().get(i)); + } else { + Assert.assertTrue(vector.getAccessor().isNull(i)); + } + } + } + } + }; + } + + @Override + public void run() { + try { + assertEquals(schema, reader.getVectorSchemaRoot().getSchema()); + assertTrue( + reader.getVectorSchemaRoot().getSchema().getFields().get(0).getTypeLayout().getVectorTypes().toString(), + reader.getVectorSchemaRoot().getSchema().getFields().get(0).getTypeLayout().getVectors().size() > 0); + while (!done) { + reader.loadNextBatch(); + } + } catch (IOException e) { + e.printStackTrace(); + Assert.fail(e.toString()); // have to explicitly fail since we're in a separate thread + } + } + + public int getBatchesRead() { return batchesRead; } + public long bytesRead() { return reader.bytesRead(); } + } + + // Starts up a producer and consumer thread to read/write batches. + @Test + public void pipeTest() throws IOException, InterruptedException { + int NUM_BATCHES = 10; + Pipe pipe = Pipe.open(); + WriterThread writer = new WriterThread(NUM_BATCHES, pipe.sink()); + ReaderThread reader = new ReaderThread(pipe.source()); + + writer.start(); + reader.start(); + reader.join(); + writer.join(); + + assertEquals(NUM_BATCHES, reader.getBatchesRead()); + assertEquals(writer.bytesWritten(), reader.bytesRead()); + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java index 3720a13b0fce5..c88958cbf2c9c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java @@ -70,7 +70,7 @@ public void testWriteComplexJSON() throws IOException { int count = COUNT; try ( BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null, null)) { writeComplexData(count, parent); VectorSchemaRoot root = new VectorSchemaRoot(parent.getChild("root")); validateComplexContent(root.getRowCount(), root); @@ -92,7 +92,7 @@ public void testWriteReadUnionJSON() throws IOException { int count = COUNT; try ( BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null)) { + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null, null)) { writeUnionData(count, parent); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java index 7b4de80ee03ea..bb2ccf8cbb5f6 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java @@ -34,6 +34,7 @@ import org.apache.arrow.vector.file.ReadChannel; import org.apache.arrow.vector.file.WriteChannel; import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; @@ -88,9 +89,10 @@ public void testSerializeRecordBatch() throws IOException { MessageSerializer.serialize(new WriteChannel(Channels.newChannel(out)), batch); ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); - ArrowRecordBatch deserialized = MessageSerializer.deserializeRecordBatch( - new ReadChannel(Channels.newChannel(in)), alloc); - verifyBatch(deserialized, validity, values); + ReadChannel channel = new ReadChannel(Channels.newChannel(in)); + ArrowMessage deserialized = MessageSerializer.deserializeMessageBatch(channel, alloc); + assertEquals(ArrowRecordBatch.class, deserialized.getClass()); + verifyBatch((ArrowRecordBatch) deserialized, validity, values); } public static Schema testSchema() { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java deleted file mode 100644 index 725272a0f072e..0000000000000 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStream.java +++ /dev/null @@ -1,96 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.stream; - -import static java.util.Arrays.asList; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; -import java.io.IOException; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.file.BaseFileTest; -import org.apache.arrow.vector.schema.ArrowFieldNode; -import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.pojo.Schema; -import org.junit.Test; - -import io.netty.buffer.ArrowBuf; - -public class TestArrowStream extends BaseFileTest { - @Test - public void testEmptyStream() throws IOException { - Schema schema = MessageSerializerTest.testSchema(); - - // Write the stream. - ByteArrayOutputStream out = new ByteArrayOutputStream(); - try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema)) { - } - - ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); - try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { - reader.init(); - assertEquals(schema, reader.getSchema()); - // Empty should return null. Can be called repeatedly. - assertTrue(reader.nextRecordBatch() == null); - assertTrue(reader.nextRecordBatch() == null); - } - } - - @Test - public void testReadWrite() throws IOException { - Schema schema = MessageSerializerTest.testSchema(); - byte[] validity = new byte[] { (byte)255, 0}; - // second half is "undefined" - byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; - - int numBatches = 5; - BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - ByteArrayOutputStream out = new ByteArrayOutputStream(); - long bytesWritten = 0; - try (ArrowStreamWriter writer = new ArrowStreamWriter(out, schema)) { - ArrowBuf validityb = MessageSerializerTest.buf(alloc, validity); - ArrowBuf valuesb = MessageSerializerTest.buf(alloc, values); - for (int i = 0; i < numBatches; i++) { - writer.writeRecordBatch(new ArrowRecordBatch( - 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); - } - bytesWritten = writer.bytesWritten(); - } - - ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); - try (ArrowStreamReader reader = new ArrowStreamReader(in, alloc)) { - reader.init(); - Schema readSchema = reader.getSchema(); - for (int i = 0; i < numBatches; i++) { - assertEquals(schema, readSchema); - assertTrue( - readSchema.getFields().get(0).getTypeLayout().getVectorTypes().toString(), - readSchema.getFields().get(0).getTypeLayout().getVectors().size() > 0); - ArrowRecordBatch recordBatch = reader.nextRecordBatch(); - MessageSerializerTest.verifyBatch(recordBatch, validity, values); - assertTrue(recordBatch != null); - } - assertTrue(reader.nextRecordBatch() == null); - assertEquals(bytesWritten, reader.bytesRead()); - } - } -} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java deleted file mode 100644 index aa0b77e46a392..0000000000000 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/TestArrowStreamPipe.java +++ /dev/null @@ -1,129 +0,0 @@ -/** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. - */ -package org.apache.arrow.vector.stream; - -import static java.util.Arrays.asList; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.io.IOException; -import java.nio.channels.Pipe; -import java.nio.channels.ReadableByteChannel; -import java.nio.channels.WritableByteChannel; - -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.schema.ArrowFieldNode; -import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.pojo.Schema; -import org.junit.Test; - -import io.netty.buffer.ArrowBuf; - -public class TestArrowStreamPipe { - Schema schema = MessageSerializerTest.testSchema(); - // second half is "undefined" - byte[] values = new byte[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16}; - - private final class WriterThread extends Thread { - private final int numBatches; - private final ArrowStreamWriter writer; - - public WriterThread(int numBatches, WritableByteChannel sinkChannel) - throws IOException { - this.numBatches = numBatches; - writer = new ArrowStreamWriter(sinkChannel, schema); - } - - @Override - public void run() { - BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - try { - ArrowBuf valuesb = MessageSerializerTest.buf(alloc, values); - for (int i = 0; i < numBatches; i++) { - // Send a changing byte id first. - byte[] validity = new byte[] { (byte)i, 0}; - ArrowBuf validityb = MessageSerializerTest.buf(alloc, validity); - writer.writeRecordBatch(new ArrowRecordBatch( - 16, asList(new ArrowFieldNode(16, 8)), asList(validityb, valuesb))); - } - writer.close(); - } catch (IOException e) { - e.printStackTrace(); - assertTrue(false); - } - } - - public long bytesWritten() { return writer.bytesWritten(); } - } - - private final class ReaderThread extends Thread { - private int batchesRead = 0; - private final ArrowStreamReader reader; - private final BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - - public ReaderThread(ReadableByteChannel sourceChannel) - throws IOException { - reader = new ArrowStreamReader(sourceChannel, alloc); - } - - @Override - public void run() { - try { - reader.init(); - assertEquals(schema, reader.getSchema()); - assertTrue( - reader.getSchema().getFields().get(0).getTypeLayout().getVectorTypes().toString(), - reader.getSchema().getFields().get(0).getTypeLayout().getVectors().size() > 0); - - // Read all the batches. Each batch contains an incrementing id and then some - // constant data. Verify both. - while (true) { - ArrowRecordBatch batch = reader.nextRecordBatch(); - if (batch == null) break; - byte[] validity = new byte[] { (byte)batchesRead, 0}; - MessageSerializerTest.verifyBatch(batch, validity, values); - batchesRead++; - } - } catch (IOException e) { - e.printStackTrace(); - assertTrue(false); - } - } - - public int getBatchesRead() { return batchesRead; } - public long bytesRead() { return reader.bytesRead(); } - } - - // Starts up a producer and consumer thread to read/write batches. - @Test - public void pipeTest() throws IOException, InterruptedException { - int NUM_BATCHES = 10; - Pipe pipe = Pipe.open(); - WriterThread writer = new WriterThread(NUM_BATCHES, pipe.sink()); - ReaderThread reader = new ReaderThread(pipe.source()); - - writer.start(); - reader.start(); - reader.join(); - writer.join(); - - assertEquals(NUM_BATCHES, reader.getBatchesRead()); - assertEquals(writer.bytesWritten(), reader.bytesRead()); - } -} From 1c101ffe0e7a92e1fc251f9335081e64aada8b26 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 16 Mar 2017 14:17:50 -0400 Subject: [PATCH 0371/1644] ARROW-636: [C++] Update README about Boost system requirement Author: Wes McKinney Closes #386 from wesm/ARROW-636 and squashes the following commits: 2dd3052 [Wes McKinney] Update README about Boost system requirement --- cpp/README.md | 35 ++++++++++++++++++++++++++++------- 1 file changed, 28 insertions(+), 7 deletions(-) diff --git a/cpp/README.md b/cpp/README.md index 542a854990250..51f1f0606fa3a 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -14,13 +14,33 @@ # Arrow C++ -## Setup Build Environment +## System setup Arrow uses CMake as a build configuration system. Currently, it supports in-source and out-of-source builds with the latter one being preferred. -Arrow requires a C++11-enabled compiler. On Linux, gcc 4.8 and higher should be -sufficient. +Build Arrow requires: + +* A C++11-enabled compiler. On Linux, gcc 4.8 and higher should be sufficient. +* CMake +* Boost + +On Ubuntu/Debian you can install the requirements with: + +```shell +sudo apt-get install cmake \ + libboost-dev \ + libboost-filesystem-dev \ + libboost-system-dev +``` + +On OS X, you can use [Homebrew][1]: + +```shell +brew install boost cmake +``` + +## Building Arrow Simple debug build: @@ -50,7 +70,6 @@ and benchmarks or `make runbenchmark` to run only the benchmark tests. Benchmark logs will be placed in the build directory under `build/benchmark-logs`. - ### Third-party environment variables To set up your own specific build toolchain, here are the relevant environment @@ -86,10 +105,12 @@ build failures by running the following checks before submitting your pull reque Note that the clang-tidy target may take a while to run. You might consider running clang-tidy separately on the files you have added/changed before invoking the make target to reduce iteration time. Also, it might generate warnings -that aren't valid. To avoid these you can use add a line comment `// NOLINT`. If -NOLINT doesn't suppress the warnings, you add the file in question to -the .clang-tidy-ignore file. This will allow `make check-clang-tidy` to pass in +that aren't valid. To avoid these you can use add a line comment `// NOLINT`. If +NOLINT doesn't suppress the warnings, you add the file in question to +the .clang-tidy-ignore file. This will allow `make check-clang-tidy` to pass in travis-CI (but still surface the potential warnings in `make clang-tidy`). Ideally, both of these options would be used rarely. Current known uses-cases whent hey are required: * Parameterized tests in google test. + +[1]: https://brew.sh/ \ No newline at end of file From 0cf2bbb2afe6006219904265b41123c2ce10715a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 16 Mar 2017 21:16:44 +0100 Subject: [PATCH 0372/1644] ARROW-642: [Java] Remove temporary file in java/tools Author: Wes McKinney Closes #389 from wesm/ARROW-642 and squashes the following commits: 03771c8 [Wes McKinney] Remove temporary file from ARROW-542 --- java/tools/tmptestfilesio | Bin 628 -> 0 bytes 1 file changed, 0 insertions(+), 0 deletions(-) delete mode 100644 java/tools/tmptestfilesio diff --git a/java/tools/tmptestfilesio b/java/tools/tmptestfilesio deleted file mode 100644 index d1b6b6cdb93878637bff514fbacc2b0054dd5f4d..0000000000000000000000000000000000000000 GIT binary patch literal 0 HcmV?d00001 literal 628 zcmZ{hJx;?w5QU$UB`gsjgpz{BmY7e6_I-)K^zOlJyU4|WJ7$N6$-zP8=-uI=>U^QdjD zb6weIOXOi< zPk Date: Thu, 16 Mar 2017 18:23:16 -0400 Subject: [PATCH 0373/1644] ARROW-231 [C++]: Add typed Resize to PoolBuffer I also added a typed Reserve to be consistent. Let me know what you think. Author: Johan Mabille Closes #391 from JohanMabille/buffer_resize and squashes the following commits: 90ccbfa [Johan Mabille] typed resize --- cpp/src/arrow/buffer-test.cc | 19 +++++++++++++++++++ cpp/src/arrow/buffer.h | 10 ++++++++++ 2 files changed, 29 insertions(+) diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc index 934fcfef14856..e0a2137b9bd78 100644 --- a/cpp/src/arrow/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -66,6 +66,25 @@ TEST_F(TestBuffer, Resize) { ASSERT_EQ(128, buf.capacity()); } +TEST_F(TestBuffer, TypedResize) { + PoolBuffer buf; + + ASSERT_EQ(0, buf.size()); + ASSERT_OK(buf.TypedResize(100)); + ASSERT_EQ(800, buf.size()); + ASSERT_OK(buf.TypedResize(200)); + ASSERT_EQ(1600, buf.size()); + + ASSERT_OK(buf.TypedResize(50, true)); + ASSERT_EQ(400, buf.size()); + ASSERT_EQ(448, buf.capacity()); + + ASSERT_OK(buf.TypedResize(100)); + ASSERT_EQ(832, buf.capacity()); + ASSERT_OK(buf.TypedResize(50, false)); + ASSERT_EQ(832, buf.capacity()); +} + TEST_F(TestBuffer, ResizeOOM) { // This test doesn't play nice with AddressSanitizer #ifndef ADDRESS_SANITIZER diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 26c8ea60214f6..1647e8601f481 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -133,6 +133,16 @@ class ARROW_EXPORT ResizableBuffer : public MutableBuffer { /// It does not change buffer's reported size. virtual Status Reserve(int64_t new_capacity) = 0; + template + Status TypedResize(int64_t new_nb_elements, bool shrink_to_fit = true) { + return Resize(sizeof(T) * new_nb_elements, shrink_to_fit); + } + + template + Status TypedReserve(int64_t new_nb_elements) { + return Reserve(sizeof(T) * new_nb_elements); + } + protected: ResizableBuffer(uint8_t* data, int64_t size) : MutableBuffer(data, size) {} }; From 3ee3822b6dccc9c859e5a324ef01fc1e9bf75dd1 Mon Sep 17 00:00:00 2001 From: Johan Mabille Date: Thu, 16 Mar 2017 20:47:09 -0400 Subject: [PATCH 0374/1644] ARROW-593 [C++]: Rename ReadableFileInterface to RandomAccessFile Author: Johan Mabille Closes #393 from JohanMabille/raf and squashes the following commits: ba755a6 [Johan Mabille] type alias for backward compatibility c9e459f [Johan Mabille] Renamed ReadableFileInterface to RandomAccessFile --- cpp/src/arrow/io/file.h | 4 ++-- cpp/src/arrow/io/hdfs.h | 2 +- cpp/src/arrow/io/interfaces.cc | 6 +++--- cpp/src/arrow/io/interfaces.h | 10 ++++++---- cpp/src/arrow/io/memory.h | 2 +- cpp/src/arrow/ipc/adapter.cc | 14 +++++++------- cpp/src/arrow/ipc/adapter.h | 8 ++++---- cpp/src/arrow/ipc/feather.cc | 6 +++--- cpp/src/arrow/ipc/feather.h | 4 ++-- cpp/src/arrow/ipc/ipc-adapter-test.cc | 2 +- cpp/src/arrow/ipc/json.h | 2 +- cpp/src/arrow/ipc/metadata.cc | 2 +- cpp/src/arrow/ipc/metadata.h | 4 ++-- cpp/src/arrow/ipc/reader.cc | 8 ++++---- cpp/src/arrow/ipc/reader.h | 6 +++--- python/pyarrow/_feather.pyx | 4 ++-- python/pyarrow/_parquet.pxd | 6 +++--- python/pyarrow/_parquet.pyx | 4 ++-- python/pyarrow/includes/libarrow_io.pxd | 10 +++++----- python/pyarrow/includes/libarrow_ipc.pxd | 6 +++--- python/pyarrow/includes/pyarrow.pxd | 2 +- python/pyarrow/io.pxd | 8 ++++---- python/pyarrow/io.pyx | 16 ++++++++-------- python/src/pyarrow/io.h | 2 +- 24 files changed, 70 insertions(+), 68 deletions(-) diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index fe55e968e05d7..f687fadc299bd 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -64,7 +64,7 @@ class ARROW_EXPORT FileOutputStream : public OutputStream { }; // Operating system file -class ARROW_EXPORT ReadableFile : public ReadableFileInterface { +class ARROW_EXPORT ReadableFile : public RandomAccessFile { public: ~ReadableFile(); @@ -115,7 +115,7 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { Status Seek(int64_t position) override; - // Required by ReadableFileInterface, copies memory into out. Not thread-safe + // Required by RandomAccessFile, copies memory into out. Not thread-safe Status Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) override; // Zero copy read. Not thread-safe diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index fbf1d758afb99..e3f5442f48ead 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -159,7 +159,7 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { DISALLOW_COPY_AND_ASSIGN(HdfsClient); }; -class ARROW_EXPORT HdfsReadableFile : public ReadableFileInterface { +class ARROW_EXPORT HdfsReadableFile : public RandomAccessFile { public: ~HdfsReadableFile(); diff --git a/cpp/src/arrow/io/interfaces.cc b/cpp/src/arrow/io/interfaces.cc index 51ed0693e2c57..06957d4de560d 100644 --- a/cpp/src/arrow/io/interfaces.cc +++ b/cpp/src/arrow/io/interfaces.cc @@ -29,18 +29,18 @@ namespace io { FileInterface::~FileInterface() {} -ReadableFileInterface::ReadableFileInterface() { +RandomAccessFile::RandomAccessFile() { set_mode(FileMode::READ); } -Status ReadableFileInterface::ReadAt( +Status RandomAccessFile::ReadAt( int64_t position, int64_t nbytes, int64_t* bytes_read, uint8_t* out) { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); return Read(nbytes, bytes_read, out); } -Status ReadableFileInterface::ReadAt( +Status RandomAccessFile::ReadAt( int64_t position, int64_t nbytes, std::shared_ptr* out) { std::lock_guard guard(lock_); RETURN_NOT_OK(Seek(position)); diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 9862a67aed0cd..258a3155743bf 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -97,7 +97,7 @@ class ARROW_EXPORT InputStream : virtual public FileInterface, public Readable { InputStream() {} }; -class ARROW_EXPORT ReadableFileInterface : public InputStream, public Seekable { +class ARROW_EXPORT RandomAccessFile : public InputStream, public Seekable { public: virtual Status GetSize(int64_t* size) = 0; @@ -118,7 +118,7 @@ class ARROW_EXPORT ReadableFileInterface : public InputStream, public Seekable { protected: std::mutex lock_; - ReadableFileInterface(); + RandomAccessFile(); }; class ARROW_EXPORT WriteableFileInterface : public OutputStream, public Seekable { @@ -129,12 +129,14 @@ class ARROW_EXPORT WriteableFileInterface : public OutputStream, public Seekable WriteableFileInterface() { set_mode(FileMode::READ); } }; -class ARROW_EXPORT ReadWriteFileInterface : public ReadableFileInterface, +class ARROW_EXPORT ReadWriteFileInterface : public RandomAccessFile, public WriteableFileInterface { protected: - ReadWriteFileInterface() { ReadableFileInterface::set_mode(FileMode::READWRITE); } + ReadWriteFileInterface() { RandomAccessFile::set_mode(FileMode::READWRITE); } }; +using ReadableFileInterface = RandomAccessFile; + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index 82807508417d7..eb2a50912889e 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -66,7 +66,7 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { uint8_t* mutable_data_; }; -class ARROW_EXPORT BufferReader : public ReadableFileInterface { +class ARROW_EXPORT BufferReader : public RandomAccessFile { public: explicit BufferReader(const std::shared_ptr& buffer); BufferReader(const uint8_t* data, int64_t size); diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc index 406ce249eec32..db9f63ca18cbd 100644 --- a/cpp/src/arrow/ipc/adapter.cc +++ b/cpp/src/arrow/ipc/adapter.cc @@ -523,7 +523,7 @@ Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { class IpcComponentSource : public ArrayComponentSource { public: - IpcComponentSource(const RecordBatchMetadata& metadata, io::ReadableFileInterface* file) + IpcComponentSource(const RecordBatchMetadata& metadata, io::RandomAccessFile* file) : metadata_(metadata), file_(file) {} Status GetBuffer(int buffer_index, std::shared_ptr* out) override { @@ -547,14 +547,14 @@ class IpcComponentSource : public ArrayComponentSource { private: const RecordBatchMetadata& metadata_; - io::ReadableFileInterface* file_; + io::RandomAccessFile* file_; }; class RecordBatchReader { public: RecordBatchReader(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, - io::ReadableFileInterface* file) + io::RandomAccessFile* file) : metadata_(metadata), schema_(schema), max_recursion_depth_(max_recursion_depth), @@ -582,24 +582,24 @@ class RecordBatchReader { const RecordBatchMetadata& metadata_; std::shared_ptr schema_; int max_recursion_depth_; - io::ReadableFileInterface* file_; + io::RandomAccessFile* file_; }; Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::ReadableFileInterface* file, + const std::shared_ptr& schema, io::RandomAccessFile* file, std::shared_ptr* out) { return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); } Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, - io::ReadableFileInterface* file, std::shared_ptr* out) { + io::RandomAccessFile* file, std::shared_ptr* out) { RecordBatchReader reader(metadata, schema, max_recursion_depth, file); return reader.Read(out); } Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::ReadableFileInterface* file, + const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, std::shared_ptr* out) { int64_t id = metadata.id(); auto it = dictionary_types.find(id); diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h index 21d814db86530..cea4686077486 100644 --- a/cpp/src/arrow/ipc/adapter.h +++ b/cpp/src/arrow/ipc/adapter.h @@ -39,7 +39,7 @@ class Status; namespace io { -class ReadableFileInterface; +class RandomAccessFile; class OutputStream; } // namespace io @@ -87,15 +87,15 @@ Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); // "Read" path; does not copy data if the input supports zero copy reads Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::ReadableFileInterface* file, + const std::shared_ptr& schema, io::RandomAccessFile* file, std::shared_ptr* out); Status ReadRecordBatch(const RecordBatchMetadata& metadata, const std::shared_ptr& schema, int max_recursion_depth, - io::ReadableFileInterface* file, std::shared_ptr* out); + io::RandomAccessFile* file, std::shared_ptr* out); Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::ReadableFileInterface* file, + const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, std::shared_ptr* out); } // namespace ipc diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 1d165acccbd04..72bbaa4da3571 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -218,7 +218,7 @@ class TableReader::TableReaderImpl { public: TableReaderImpl() {} - Status Open(const std::shared_ptr& source) { + Status Open(const std::shared_ptr& source) { source_ = source; int magic_size = static_cast(strlen(kFeatherMagicBytes)); @@ -386,7 +386,7 @@ class TableReader::TableReaderImpl { } private: - std::shared_ptr source_; + std::shared_ptr source_; std::unique_ptr metadata_; std::shared_ptr schema_; @@ -401,7 +401,7 @@ TableReader::TableReader() { TableReader::~TableReader() {} -Status TableReader::Open(const std::shared_ptr& source) { +Status TableReader::Open(const std::shared_ptr& source) { return impl_->Open(source); } diff --git a/cpp/src/arrow/ipc/feather.h b/cpp/src/arrow/ipc/feather.h index 3d370dfe02bd0..1e4ba58255456 100644 --- a/cpp/src/arrow/ipc/feather.h +++ b/cpp/src/arrow/ipc/feather.h @@ -37,7 +37,7 @@ class Status; namespace io { class OutputStream; -class ReadableFileInterface; +class RandomAccessFile; } // namespace io @@ -54,7 +54,7 @@ class ARROW_EXPORT TableReader { TableReader(); ~TableReader(); - Status Open(const std::shared_ptr& source); + Status Open(const std::shared_ptr& source); static Status OpenFile(const std::string& abspath, std::unique_ptr* out); diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc index 36a675f5f94f7..638d98af8244d 100644 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ b/cpp/src/arrow/ipc/ipc-adapter-test.cc @@ -61,7 +61,7 @@ class IpcTestFixture : public io::MemoryMapFixture { auto metadata = std::make_shared(message); // The buffer offsets start at 0, so we must construct a - // ReadableFileInterface according to that frame of reference + // RandomAccessFile according to that frame of reference std::shared_ptr buffer_payload; RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); io::BufferReader buffer_reader(buffer_payload); diff --git a/cpp/src/arrow/ipc/json.h b/cpp/src/arrow/ipc/json.h index 88afdfaa5ff3b..0d88cef9a4d7b 100644 --- a/cpp/src/arrow/ipc/json.h +++ b/cpp/src/arrow/ipc/json.h @@ -31,7 +31,7 @@ namespace arrow { namespace io { class OutputStream; -class ReadableFileInterface; +class RandomAccessFile; } // namespace io diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 695e7886e3124..71bc5c9eb3207 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -359,7 +359,7 @@ const RecordBatchMetadata& DictionaryBatchMetadata::record_batch() const { // Conveniences Status ReadMessage(int64_t offset, int32_t metadata_length, - io::ReadableFileInterface* file, std::shared_ptr* message) { + io::RandomAccessFile* file, std::shared_ptr* message) { std::shared_ptr buffer; RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index f6a0a3a073faa..4eb0186d3a467 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -41,7 +41,7 @@ class Status; namespace io { class OutputStream; -class ReadableFileInterface; +class RandomAccessFile; } // namespace io @@ -219,7 +219,7 @@ class ARROW_EXPORT Message { /// \param[out] message the message read /// \return Status success or failure Status ReadMessage(int64_t offset, int32_t metadata_length, - io::ReadableFileInterface* file, std::shared_ptr* message); + io::RandomAccessFile* file, std::shared_ptr* message); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 4cb5f6cccc4c8..95753643c6513 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -308,7 +308,7 @@ class FileReader::FileReaderImpl { } Status Open( - const std::shared_ptr& file, int64_t footer_offset) { + const std::shared_ptr& file, int64_t footer_offset) { file_ = file; footer_offset_ = footer_offset; RETURN_NOT_OK(ReadFooter()); @@ -318,7 +318,7 @@ class FileReader::FileReaderImpl { std::shared_ptr schema() const { return schema_; } private: - std::shared_ptr file_; + std::shared_ptr file_; // The location where the Arrow file layout ends. May be the end of the file // or some other location if embedded in a larger file. @@ -342,14 +342,14 @@ FileReader::FileReader() { FileReader::~FileReader() {} -Status FileReader::Open(const std::shared_ptr& file, +Status FileReader::Open(const std::shared_ptr& file, std::shared_ptr* reader) { int64_t footer_offset; RETURN_NOT_OK(file->GetSize(&footer_offset)); return Open(file, footer_offset, reader); } -Status FileReader::Open(const std::shared_ptr& file, +Status FileReader::Open(const std::shared_ptr& file, int64_t footer_offset, std::shared_ptr* reader) { *reader = std::shared_ptr(new FileReader()); return (*reader)->impl_->Open(file, footer_offset); diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index 6f143e1a1265e..ca91765edbac1 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -37,7 +37,7 @@ class Status; namespace io { class InputStream; -class ReadableFileInterface; +class RandomAccessFile; } // namespace io @@ -72,7 +72,7 @@ class ARROW_EXPORT FileReader { // can be any amount of data preceding the Arrow-formatted data, because we // need only locate the end of the Arrow file stream to discover the metadata // and then proceed to read the data into memory. - static Status Open(const std::shared_ptr& file, + static Status Open(const std::shared_ptr& file, std::shared_ptr* reader); // If the file is embedded within some larger file or memory region, you can @@ -82,7 +82,7 @@ class ARROW_EXPORT FileReader { // // @param file: the data source // @param footer_offset: the position of the end of the Arrow "file" - static Status Open(const std::shared_ptr& file, + static Status Open(const std::shared_ptr& file, int64_t footer_offset, std::shared_ptr* reader); /// The schema includes any dictionaries diff --git a/python/pyarrow/_feather.pyx b/python/pyarrow/_feather.pyx index 67f734f6ed77c..beb4aaad44618 100644 --- a/python/pyarrow/_feather.pyx +++ b/python/pyarrow/_feather.pyx @@ -23,7 +23,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport CArray, CColumn, CSchema, CStatus -from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream +from pyarrow.includes.libarrow_io cimport RandomAccessFile, OutputStream from libcpp.string cimport string from libcpp cimport bool as c_bool @@ -53,7 +53,7 @@ cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: CStatus Finalize() cdef cppclass TableReader: - TableReader(const shared_ptr[ReadableFileInterface]& source) + TableReader(const shared_ptr[RandomAccessFile]& source) @staticmethod CStatus OpenFile(const string& abspath, unique_ptr[TableReader]* out) diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index e106252189f42..cf9ec8e787661 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -20,7 +20,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CSchema, CStatus, CTable, CMemoryPool) -from pyarrow.includes.libarrow_io cimport ReadableFileInterface, OutputStream +from pyarrow.includes.libarrow_io cimport RandomAccessFile, OutputStream cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: @@ -173,7 +173,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: cdef cppclass ParquetFileReader: @staticmethod unique_ptr[ParquetFileReader] Open( - const shared_ptr[ReadableFileInterface]& file, + const shared_ptr[RandomAccessFile]& file, const ReaderProperties& props, const shared_ptr[CFileMetaData]& metadata) @@ -203,7 +203,7 @@ cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: - CStatus OpenFile(const shared_ptr[ReadableFileInterface]& file, + CStatus OpenFile(const shared_ptr[RandomAccessFile]& file, CMemoryPool* allocator, const ReaderProperties& properties, const shared_ptr[CFileMetaData]& metadata, diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 08c7bb5d8b1bc..8e67da9f75a6e 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -23,7 +23,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, OutputStream, +from pyarrow.includes.libarrow_io cimport (RandomAccessFile, OutputStream, FileOutputStream) cimport pyarrow.includes.pyarrow as pyarrow @@ -354,7 +354,7 @@ cdef class ParquetReader: def open(self, object source, FileMetaData metadata=None): cdef: - shared_ptr[ReadableFileInterface] rd_handle + shared_ptr[RandomAccessFile] rd_handle shared_ptr[CFileMetaData] c_metadata ReaderProperties properties = default_reader_properties() c_string path diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd index 8d0d5248b4db0..5992c737df512 100644 --- a/python/pyarrow/includes/libarrow_io.pxd +++ b/python/pyarrow/includes/libarrow_io.pxd @@ -51,7 +51,7 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: cdef cppclass InputStream(FileInterface, Readable): pass - cdef cppclass ReadableFileInterface(InputStream, Seekable): + cdef cppclass RandomAccessFile(InputStream, Seekable): CStatus GetSize(int64_t* size) CStatus ReadAt(int64_t position, int64_t nbytes, @@ -63,7 +63,7 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: CStatus WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) - cdef cppclass ReadWriteFileInterface(ReadableFileInterface, + cdef cppclass ReadWriteFileInterface(RandomAccessFile, WriteableFileInterface): pass @@ -77,7 +77,7 @@ cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: int file_descriptor() - cdef cppclass ReadableFile(ReadableFileInterface): + cdef cppclass ReadableFile(RandomAccessFile): @staticmethod CStatus Open(const c_string& path, shared_ptr[ReadableFile]* file) @@ -123,7 +123,7 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: int64_t block_size int16_t permissions - cdef cppclass HdfsReadableFile(ReadableFileInterface): + cdef cppclass HdfsReadableFile(RandomAccessFile): pass cdef cppclass HdfsOutputStream(OutputStream): @@ -163,7 +163,7 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: cdef cppclass CBufferReader" arrow::io::BufferReader"\ - (ReadableFileInterface): + (RandomAccessFile): CBufferReader(const shared_ptr[CBuffer]& buffer) CBufferReader(const uint8_t* data, int64_t nbytes) diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index 10c70a96b0ab2..8b7d705afd4e7 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -20,7 +20,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CSchema, CRecordBatch) from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, - ReadableFileInterface) + RandomAccessFile) cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: @@ -51,11 +51,11 @@ cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: cdef cppclass CFileReader " arrow::ipc::FileReader": @staticmethod - CStatus Open(const shared_ptr[ReadableFileInterface]& file, + CStatus Open(const shared_ptr[RandomAccessFile]& file, shared_ptr[CFileReader]* out) @staticmethod - CStatus Open2" Open"(const shared_ptr[ReadableFileInterface]& file, + CStatus Open2" Open"(const shared_ptr[RandomAccessFile]& file, int64_t footer_offset, shared_ptr[CFileReader]* out) shared_ptr[CSchema] schema() diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 9fbddba3d10c5..805950bd1476a 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -60,7 +60,7 @@ cdef extern from "pyarrow/common.h" namespace "arrow::py" nogil: cdef extern from "pyarrow/io.h" namespace "arrow::py" nogil: - cdef cppclass PyReadableFile(arrow_io.ReadableFileInterface): + cdef cppclass PyReadableFile(arrow_io.RandomAccessFile): PyReadableFile(object fo) cdef cppclass PyOutputStream(arrow_io.OutputStream): diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index 3d73e1143e15a..cffd29ab39111 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport (ReadableFileInterface, +from pyarrow.includes.libarrow_io cimport (RandomAccessFile, OutputStream) cdef class Buffer: @@ -32,7 +32,7 @@ cdef class Buffer: cdef class NativeFile: cdef: - shared_ptr[ReadableFileInterface] rd_file + shared_ptr[RandomAccessFile] rd_file shared_ptr[OutputStream] wr_file bint is_readable bint is_writeable @@ -43,8 +43,8 @@ cdef class NativeFile: # extension classes are technically virtual in the C++ sense) we can expose # the arrow::io abstract file interfaces to other components throughout the # suite of Arrow C++ libraries - cdef read_handle(self, shared_ptr[ReadableFileInterface]* file) + cdef read_handle(self, shared_ptr[RandomAccessFile]* file) cdef write_handle(self, shared_ptr[OutputStream]* file) -cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader) +cdef get_reader(object source, shared_ptr[RandomAccessFile]* reader) cdef get_writer(object source, shared_ptr[OutputStream]* writer) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 240ea240c3abe..17b43dedb0a5f 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -81,9 +81,9 @@ cdef class NativeFile: check_status(self.wr_file.get().Close()) self.is_open = False - cdef read_handle(self, shared_ptr[ReadableFileInterface]* file): + cdef read_handle(self, shared_ptr[RandomAccessFile]* file): self._assert_readable() - file[0] = self.rd_file + file[0] = self.rd_file cdef write_handle(self, shared_ptr[OutputStream]* file): self._assert_writeable() @@ -361,7 +361,7 @@ cdef class MemoryMappedFile(NativeFile): check_status(CMemoryMappedFile.Open(c_path, c_mode, &handle)) self.wr_file = handle - self.rd_file = handle + self.rd_file = handle self.is_open = True @@ -398,7 +398,7 @@ cdef class OSFile(NativeFile): check_status(ReadableFile.Open(path, pool, &handle)) self.is_readable = 1 - self.rd_file = handle + self.rd_file = handle cdef _open_writeable(self, c_string path): cdef shared_ptr[FileOutputStream] handle @@ -536,7 +536,7 @@ cdef Buffer wrap_buffer(const shared_ptr[CBuffer]& buf): return result -cdef get_reader(object source, shared_ptr[ReadableFileInterface]* reader): +cdef get_reader(object source, shared_ptr[RandomAccessFile]* reader): cdef NativeFile nf if isinstance(source, six.string_types): @@ -815,7 +815,7 @@ cdef class _HdfsClient: check_status(self.client.get() .OpenReadable(c_path, &rd_handle)) - out.rd_file = rd_handle + out.rd_file = rd_handle out.is_readable = True out.is_writeable = 0 @@ -924,7 +924,7 @@ cdef class _StreamReader: def _open(self, source): cdef: - shared_ptr[ReadableFileInterface] reader + shared_ptr[RandomAccessFile] reader shared_ptr[InputStream] in_stream get_reader(source, &reader) @@ -996,7 +996,7 @@ cdef class _FileReader: pass def _open(self, source, footer_offset=None): - cdef shared_ptr[ReadableFileInterface] reader + cdef shared_ptr[RandomAccessFile] reader get_reader(source, &reader) cdef int64_t offset = 0 diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h index a603e81622545..e38cd81775255 100644 --- a/python/src/pyarrow/io.h +++ b/python/src/pyarrow/io.h @@ -49,7 +49,7 @@ class PythonFile { PyObject* file_; }; -class ARROW_EXPORT PyReadableFile : public io::ReadableFileInterface { +class ARROW_EXPORT PyReadableFile : public io::RandomAccessFile { public: explicit PyReadableFile(PyObject* file); virtual ~PyReadableFile(); From c13d671b15cc701e396b11f17e6d36a339dd5210 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 16 Mar 2017 20:50:10 -0400 Subject: [PATCH 0375/1644] ARROW-644: Python: Cython should be a setup-only requirement Author: Uwe L. Korn Closes #392 from xhochy/ARROW-644 and squashes the following commits: 5d99895 [Uwe L. Korn] ARROW-644: Python: Cython should be a setup-only requirement --- python/setup.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/python/setup.py b/python/setup.py index a0573fe1fccff..9abf9854af2a8 100644 --- a/python/setup.py +++ b/python/setup.py @@ -321,8 +321,8 @@ def get_outputs(self): 'build_ext': build_ext }, use_scm_version = {"root": "..", "relative_to": __file__}, - setup_requires=['setuptools_scm'], - install_requires=['cython >= 0.23', 'numpy >= 1.9', 'six >= 1.0.0'], + setup_requires=['setuptools_scm', 'cython >= 0.23'], + install_requires=['numpy >= 1.9', 'six >= 1.0.0'], test_requires=['pytest'], description="Python library for Apache Arrow", long_description=long_description, From 39c7274fc36b5f405f1dbfa48067dde52abec5ce Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 16 Mar 2017 21:09:38 -0400 Subject: [PATCH 0376/1644] ARROW-631: [GLib] Import See also https://issues.apache.org/jira/browse/ARROW-631 and `glib/README.md` in this change. Author: Kouhei Sutou Closes #382 from kou/glib-import and squashes the following commits: 67a5d24 [Kouhei Sutou] [GLib] Rename directory to c_glib/ from glib/ 24cd605 [Kouhei Sutou] [GLib] Import --- .travis.yml | 5 + c_glib/.gitignore | 44 ++ c_glib/Makefile.am | 26 ++ c_glib/README.md | 114 +++++ c_glib/arrow-glib/Makefile.am | 494 ++++++++++++++++++++ c_glib/arrow-glib/array-builder.cpp | 229 +++++++++ c_glib/arrow-glib/array-builder.h | 70 +++ c_glib/arrow-glib/array-builder.hpp | 26 ++ c_glib/arrow-glib/array.cpp | 268 +++++++++++ c_glib/arrow-glib/array.h | 67 +++ c_glib/arrow-glib/array.hpp | 27 ++ c_glib/arrow-glib/arrow-glib.h | 80 ++++ c_glib/arrow-glib/arrow-glib.hpp | 37 ++ c_glib/arrow-glib/arrow-glib.pc.in | 28 ++ c_glib/arrow-glib/arrow-io-glib.h | 32 ++ c_glib/arrow-glib/arrow-io-glib.hpp | 30 ++ c_glib/arrow-glib/arrow-io-glib.pc.in | 28 ++ c_glib/arrow-glib/arrow-ipc-glib.h | 27 ++ c_glib/arrow-glib/arrow-ipc-glib.hpp | 30 ++ c_glib/arrow-glib/arrow-ipc-glib.pc.in | 28 ++ c_glib/arrow-glib/binary-array-builder.cpp | 122 +++++ c_glib/arrow-glib/binary-array-builder.h | 77 +++ c_glib/arrow-glib/binary-array.cpp | 73 +++ c_glib/arrow-glib/binary-array.h | 72 +++ c_glib/arrow-glib/binary-data-type.cpp | 67 +++ c_glib/arrow-glib/binary-data-type.h | 69 +++ c_glib/arrow-glib/boolean-array-builder.cpp | 120 +++++ c_glib/arrow-glib/boolean-array-builder.h | 76 +++ c_glib/arrow-glib/boolean-array.cpp | 69 +++ c_glib/arrow-glib/boolean-array.h | 70 +++ c_glib/arrow-glib/boolean-data-type.cpp | 67 +++ c_glib/arrow-glib/boolean-data-type.h | 69 +++ c_glib/arrow-glib/chunked-array.cpp | 241 ++++++++++ c_glib/arrow-glib/chunked-array.h | 78 ++++ c_glib/arrow-glib/chunked-array.hpp | 27 ++ c_glib/arrow-glib/column.cpp | 262 +++++++++++ c_glib/arrow-glib/column.h | 82 ++++ c_glib/arrow-glib/column.hpp | 27 ++ c_glib/arrow-glib/data-type.cpp | 260 +++++++++++ c_glib/arrow-glib/data-type.h | 72 +++ c_glib/arrow-glib/data-type.hpp | 27 ++ c_glib/arrow-glib/double-array-builder.cpp | 120 +++++ c_glib/arrow-glib/double-array-builder.h | 76 +++ c_glib/arrow-glib/double-array.cpp | 69 +++ c_glib/arrow-glib/double-array.h | 71 +++ c_glib/arrow-glib/double-data-type.cpp | 68 +++ c_glib/arrow-glib/double-data-type.h | 70 +++ c_glib/arrow-glib/enums.c.template | 56 +++ c_glib/arrow-glib/enums.h.template | 41 ++ c_glib/arrow-glib/error.cpp | 81 ++++ c_glib/arrow-glib/error.h | 54 +++ c_glib/arrow-glib/error.hpp | 28 ++ c_glib/arrow-glib/field.cpp | 250 ++++++++++ c_glib/arrow-glib/field.h | 83 ++++ c_glib/arrow-glib/field.hpp | 27 ++ c_glib/arrow-glib/float-array-builder.cpp | 120 +++++ c_glib/arrow-glib/float-array-builder.h | 76 +++ c_glib/arrow-glib/float-array.cpp | 69 +++ c_glib/arrow-glib/float-array.h | 71 +++ c_glib/arrow-glib/float-data-type.cpp | 68 +++ c_glib/arrow-glib/float-data-type.h | 69 +++ c_glib/arrow-glib/int16-array-builder.cpp | 120 +++++ c_glib/arrow-glib/int16-array-builder.h | 76 +++ c_glib/arrow-glib/int16-array.cpp | 69 +++ c_glib/arrow-glib/int16-array.h | 71 +++ c_glib/arrow-glib/int16-data-type.cpp | 67 +++ c_glib/arrow-glib/int16-data-type.h | 69 +++ c_glib/arrow-glib/int32-array-builder.cpp | 120 +++++ c_glib/arrow-glib/int32-array-builder.h | 76 +++ c_glib/arrow-glib/int32-array.cpp | 69 +++ c_glib/arrow-glib/int32-array.h | 71 +++ c_glib/arrow-glib/int32-data-type.cpp | 67 +++ c_glib/arrow-glib/int32-data-type.h | 69 +++ c_glib/arrow-glib/int64-array-builder.cpp | 120 +++++ c_glib/arrow-glib/int64-array-builder.h | 76 +++ c_glib/arrow-glib/int64-array.cpp | 69 +++ c_glib/arrow-glib/int64-array.h | 71 +++ c_glib/arrow-glib/int64-data-type.cpp | 67 +++ c_glib/arrow-glib/int64-data-type.h | 69 +++ c_glib/arrow-glib/int8-array-builder.cpp | 120 +++++ c_glib/arrow-glib/int8-array-builder.h | 76 +++ c_glib/arrow-glib/int8-array.cpp | 69 +++ c_glib/arrow-glib/int8-array.h | 71 +++ c_glib/arrow-glib/int8-data-type.cpp | 67 +++ c_glib/arrow-glib/int8-data-type.h | 69 +++ c_glib/arrow-glib/io-enums.c.template | 56 +++ c_glib/arrow-glib/io-enums.h.template | 41 ++ c_glib/arrow-glib/io-file-mode.cpp | 63 +++ c_glib/arrow-glib/io-file-mode.h | 40 ++ c_glib/arrow-glib/io-file-mode.hpp | 27 ++ c_glib/arrow-glib/io-file-output-stream.cpp | 231 +++++++++ c_glib/arrow-glib/io-file-output-stream.h | 72 +++ c_glib/arrow-glib/io-file-output-stream.hpp | 28 ++ c_glib/arrow-glib/io-file.cpp | 116 +++++ c_glib/arrow-glib/io-file.h | 51 ++ c_glib/arrow-glib/io-file.hpp | 38 ++ c_glib/arrow-glib/io-input-stream.cpp | 56 +++ c_glib/arrow-glib/io-input-stream.h | 45 ++ c_glib/arrow-glib/io-input-stream.hpp | 38 ++ c_glib/arrow-glib/io-memory-mapped-file.cpp | 287 ++++++++++++ c_glib/arrow-glib/io-memory-mapped-file.h | 72 +++ c_glib/arrow-glib/io-memory-mapped-file.hpp | 28 ++ c_glib/arrow-glib/io-output-stream.cpp | 56 +++ c_glib/arrow-glib/io-output-stream.h | 45 ++ c_glib/arrow-glib/io-output-stream.hpp | 38 ++ c_glib/arrow-glib/io-readable-file.cpp | 127 +++++ c_glib/arrow-glib/io-readable-file.h | 55 +++ c_glib/arrow-glib/io-readable-file.hpp | 38 ++ c_glib/arrow-glib/io-readable.cpp | 84 ++++ c_glib/arrow-glib/io-readable.h | 51 ++ c_glib/arrow-glib/io-readable.hpp | 38 ++ c_glib/arrow-glib/io-writeable-file.cpp | 84 ++++ c_glib/arrow-glib/io-writeable-file.h | 51 ++ c_glib/arrow-glib/io-writeable-file.hpp | 38 ++ c_glib/arrow-glib/io-writeable.cpp | 106 +++++ c_glib/arrow-glib/io-writeable.h | 52 +++ c_glib/arrow-glib/io-writeable.hpp | 38 ++ c_glib/arrow-glib/ipc-enums.c.template | 56 +++ c_glib/arrow-glib/ipc-enums.h.template | 41 ++ c_glib/arrow-glib/ipc-file-reader.cpp | 247 ++++++++++ c_glib/arrow-glib/ipc-file-reader.h | 83 ++++ c_glib/arrow-glib/ipc-file-reader.hpp | 28 ++ c_glib/arrow-glib/ipc-file-writer.cpp | 158 +++++++ c_glib/arrow-glib/ipc-file-writer.h | 78 ++++ c_glib/arrow-glib/ipc-file-writer.hpp | 28 ++ c_glib/arrow-glib/ipc-metadata-version.cpp | 59 +++ c_glib/arrow-glib/ipc-metadata-version.h | 39 ++ c_glib/arrow-glib/ipc-metadata-version.hpp | 27 ++ c_glib/arrow-glib/ipc-stream-reader.cpp | 221 +++++++++ c_glib/arrow-glib/ipc-stream-reader.h | 80 ++++ c_glib/arrow-glib/ipc-stream-reader.hpp | 28 ++ c_glib/arrow-glib/ipc-stream-writer.cpp | 232 +++++++++ c_glib/arrow-glib/ipc-stream-writer.h | 82 ++++ c_glib/arrow-glib/ipc-stream-writer.hpp | 28 ++ c_glib/arrow-glib/list-array-builder.cpp | 173 +++++++ c_glib/arrow-glib/list-array-builder.h | 77 +++ c_glib/arrow-glib/list-array.cpp | 92 ++++ c_glib/arrow-glib/list-array.h | 73 +++ c_glib/arrow-glib/list-data-type.cpp | 91 ++++ c_glib/arrow-glib/list-data-type.h | 73 +++ c_glib/arrow-glib/null-array.cpp | 69 +++ c_glib/arrow-glib/null-array.h | 70 +++ c_glib/arrow-glib/null-data-type.cpp | 67 +++ c_glib/arrow-glib/null-data-type.h | 69 +++ c_glib/arrow-glib/record-batch.cpp | 288 ++++++++++++ c_glib/arrow-glib/record-batch.h | 85 ++++ c_glib/arrow-glib/record-batch.hpp | 27 ++ c_glib/arrow-glib/schema.cpp | 245 ++++++++++ c_glib/arrow-glib/schema.h | 80 ++++ c_glib/arrow-glib/schema.hpp | 27 ++ c_glib/arrow-glib/string-array-builder.cpp | 97 ++++ c_glib/arrow-glib/string-array-builder.h | 74 +++ c_glib/arrow-glib/string-array.cpp | 74 +++ c_glib/arrow-glib/string-array.h | 71 +++ c_glib/arrow-glib/string-data-type.cpp | 68 +++ c_glib/arrow-glib/string-data-type.h | 69 +++ c_glib/arrow-glib/struct-array-builder.cpp | 187 ++++++++ c_glib/arrow-glib/struct-array-builder.h | 81 ++++ c_glib/arrow-glib/struct-array.cpp | 97 ++++ c_glib/arrow-glib/struct-array.h | 73 +++ c_glib/arrow-glib/struct-data-type.cpp | 75 +++ c_glib/arrow-glib/struct-data-type.h | 71 +++ c_glib/arrow-glib/table.cpp | 240 ++++++++++ c_glib/arrow-glib/table.h | 80 ++++ c_glib/arrow-glib/table.hpp | 27 ++ c_glib/arrow-glib/type.cpp | 90 ++++ c_glib/arrow-glib/type.h | 84 ++++ c_glib/arrow-glib/type.hpp | 26 ++ c_glib/arrow-glib/uint16-array-builder.cpp | 120 +++++ c_glib/arrow-glib/uint16-array-builder.h | 76 +++ c_glib/arrow-glib/uint16-array.cpp | 69 +++ c_glib/arrow-glib/uint16-array.h | 71 +++ c_glib/arrow-glib/uint16-data-type.cpp | 67 +++ c_glib/arrow-glib/uint16-data-type.h | 69 +++ c_glib/arrow-glib/uint32-array-builder.cpp | 120 +++++ c_glib/arrow-glib/uint32-array-builder.h | 76 +++ c_glib/arrow-glib/uint32-array.cpp | 69 +++ c_glib/arrow-glib/uint32-array.h | 71 +++ c_glib/arrow-glib/uint32-data-type.cpp | 67 +++ c_glib/arrow-glib/uint32-data-type.h | 69 +++ c_glib/arrow-glib/uint64-array-builder.cpp | 120 +++++ c_glib/arrow-glib/uint64-array-builder.h | 76 +++ c_glib/arrow-glib/uint64-array.cpp | 69 +++ c_glib/arrow-glib/uint64-array.h | 71 +++ c_glib/arrow-glib/uint64-data-type.cpp | 67 +++ c_glib/arrow-glib/uint64-data-type.h | 69 +++ c_glib/arrow-glib/uint8-array-builder.cpp | 120 +++++ c_glib/arrow-glib/uint8-array-builder.h | 76 +++ c_glib/arrow-glib/uint8-array.cpp | 69 +++ c_glib/arrow-glib/uint8-array.h | 71 +++ c_glib/arrow-glib/uint8-data-type.cpp | 67 +++ c_glib/arrow-glib/uint8-data-type.h | 69 +++ c_glib/autogen.sh | 31 ++ c_glib/configure.ac | 76 +++ c_glib/doc/Makefile.am | 19 + c_glib/doc/reference/Makefile.am | 63 +++ c_glib/doc/reference/arrow-glib-docs.sgml | 171 +++++++ c_glib/example/Makefile.am | 34 ++ c_glib/example/build.c | 71 +++ c_glib/test/helper/buildable.rb | 77 +++ c_glib/test/run-test.rb | 41 ++ c_glib/test/run-test.sh | 29 ++ c_glib/test/test-array.rb | 44 ++ c_glib/test/test-binary-array.rb | 25 + c_glib/test/test-binary-data-type.rb | 28 ++ c_glib/test/test-boolean-array.rb | 25 + c_glib/test/test-boolean-data-type.rb | 28 ++ c_glib/test/test-chunked-array.rb | 67 +++ c_glib/test/test-column.rb | 86 ++++ c_glib/test/test-double-array.rb | 25 + c_glib/test/test-double-data-type.rb | 28 ++ c_glib/test/test-field.rb | 41 ++ c_glib/test/test-float-array.rb | 25 + c_glib/test/test-float-data-type.rb | 28 ++ c_glib/test/test-int16-array.rb | 25 + c_glib/test/test-int16-data-type.rb | 28 ++ c_glib/test/test-int32-array.rb | 25 + c_glib/test/test-int32-data-type.rb | 28 ++ c_glib/test/test-int64-array.rb | 25 + c_glib/test/test-int64-data-type.rb | 28 ++ c_glib/test/test-int8-array.rb | 25 + c_glib/test/test-int8-data-type.rb | 28 ++ c_glib/test/test-io-file-output-stream.rb | 38 ++ c_glib/test/test-io-memory-mapped-file.rb | 138 ++++++ c_glib/test/test-ipc-file-writer.rb | 45 ++ c_glib/test/test-ipc-stream-writer.rb | 53 +++ c_glib/test/test-list-array.rb | 43 ++ c_glib/test/test-list-data-type.rb | 36 ++ c_glib/test/test-null-array.rb | 33 ++ c_glib/test/test-null-data-type.rb | 28 ++ c_glib/test/test-record-batch.rb | 80 ++++ c_glib/test/test-schema.rb | 69 +++ c_glib/test/test-string-array.rb | 25 + c_glib/test/test-string-data-type.rb | 28 ++ c_glib/test/test-struct-array.rb | 58 +++ c_glib/test/test-table.rb | 72 +++ c_glib/test/test-uint16-array.rb | 25 + c_glib/test/test-uint16-data-type.rb | 28 ++ c_glib/test/test-uint32-array.rb | 25 + c_glib/test/test-uint32-data-type.rb | 28 ++ c_glib/test/test-uint64-array.rb | 25 + c_glib/test/test-uint64-data-type.rb | 28 ++ c_glib/test/test-uint8-array.rb | 25 + c_glib/test/test-uint8-data-type.rb | 28 ++ ci/travis_before_script_c_glib.sh | 40 ++ ci/travis_script_c_glib.sh | 24 + 246 files changed, 18251 insertions(+) create mode 100644 c_glib/.gitignore create mode 100644 c_glib/Makefile.am create mode 100644 c_glib/README.md create mode 100644 c_glib/arrow-glib/Makefile.am create mode 100644 c_glib/arrow-glib/array-builder.cpp create mode 100644 c_glib/arrow-glib/array-builder.h create mode 100644 c_glib/arrow-glib/array-builder.hpp create mode 100644 c_glib/arrow-glib/array.cpp create mode 100644 c_glib/arrow-glib/array.h create mode 100644 c_glib/arrow-glib/array.hpp create mode 100644 c_glib/arrow-glib/arrow-glib.h create mode 100644 c_glib/arrow-glib/arrow-glib.hpp create mode 100644 c_glib/arrow-glib/arrow-glib.pc.in create mode 100644 c_glib/arrow-glib/arrow-io-glib.h create mode 100644 c_glib/arrow-glib/arrow-io-glib.hpp create mode 100644 c_glib/arrow-glib/arrow-io-glib.pc.in create mode 100644 c_glib/arrow-glib/arrow-ipc-glib.h create mode 100644 c_glib/arrow-glib/arrow-ipc-glib.hpp create mode 100644 c_glib/arrow-glib/arrow-ipc-glib.pc.in create mode 100644 c_glib/arrow-glib/binary-array-builder.cpp create mode 100644 c_glib/arrow-glib/binary-array-builder.h create mode 100644 c_glib/arrow-glib/binary-array.cpp create mode 100644 c_glib/arrow-glib/binary-array.h create mode 100644 c_glib/arrow-glib/binary-data-type.cpp create mode 100644 c_glib/arrow-glib/binary-data-type.h create mode 100644 c_glib/arrow-glib/boolean-array-builder.cpp create mode 100644 c_glib/arrow-glib/boolean-array-builder.h create mode 100644 c_glib/arrow-glib/boolean-array.cpp create mode 100644 c_glib/arrow-glib/boolean-array.h create mode 100644 c_glib/arrow-glib/boolean-data-type.cpp create mode 100644 c_glib/arrow-glib/boolean-data-type.h create mode 100644 c_glib/arrow-glib/chunked-array.cpp create mode 100644 c_glib/arrow-glib/chunked-array.h create mode 100644 c_glib/arrow-glib/chunked-array.hpp create mode 100644 c_glib/arrow-glib/column.cpp create mode 100644 c_glib/arrow-glib/column.h create mode 100644 c_glib/arrow-glib/column.hpp create mode 100644 c_glib/arrow-glib/data-type.cpp create mode 100644 c_glib/arrow-glib/data-type.h create mode 100644 c_glib/arrow-glib/data-type.hpp create mode 100644 c_glib/arrow-glib/double-array-builder.cpp create mode 100644 c_glib/arrow-glib/double-array-builder.h create mode 100644 c_glib/arrow-glib/double-array.cpp create mode 100644 c_glib/arrow-glib/double-array.h create mode 100644 c_glib/arrow-glib/double-data-type.cpp create mode 100644 c_glib/arrow-glib/double-data-type.h create mode 100644 c_glib/arrow-glib/enums.c.template create mode 100644 c_glib/arrow-glib/enums.h.template create mode 100644 c_glib/arrow-glib/error.cpp create mode 100644 c_glib/arrow-glib/error.h create mode 100644 c_glib/arrow-glib/error.hpp create mode 100644 c_glib/arrow-glib/field.cpp create mode 100644 c_glib/arrow-glib/field.h create mode 100644 c_glib/arrow-glib/field.hpp create mode 100644 c_glib/arrow-glib/float-array-builder.cpp create mode 100644 c_glib/arrow-glib/float-array-builder.h create mode 100644 c_glib/arrow-glib/float-array.cpp create mode 100644 c_glib/arrow-glib/float-array.h create mode 100644 c_glib/arrow-glib/float-data-type.cpp create mode 100644 c_glib/arrow-glib/float-data-type.h create mode 100644 c_glib/arrow-glib/int16-array-builder.cpp create mode 100644 c_glib/arrow-glib/int16-array-builder.h create mode 100644 c_glib/arrow-glib/int16-array.cpp create mode 100644 c_glib/arrow-glib/int16-array.h create mode 100644 c_glib/arrow-glib/int16-data-type.cpp create mode 100644 c_glib/arrow-glib/int16-data-type.h create mode 100644 c_glib/arrow-glib/int32-array-builder.cpp create mode 100644 c_glib/arrow-glib/int32-array-builder.h create mode 100644 c_glib/arrow-glib/int32-array.cpp create mode 100644 c_glib/arrow-glib/int32-array.h create mode 100644 c_glib/arrow-glib/int32-data-type.cpp create mode 100644 c_glib/arrow-glib/int32-data-type.h create mode 100644 c_glib/arrow-glib/int64-array-builder.cpp create mode 100644 c_glib/arrow-glib/int64-array-builder.h create mode 100644 c_glib/arrow-glib/int64-array.cpp create mode 100644 c_glib/arrow-glib/int64-array.h create mode 100644 c_glib/arrow-glib/int64-data-type.cpp create mode 100644 c_glib/arrow-glib/int64-data-type.h create mode 100644 c_glib/arrow-glib/int8-array-builder.cpp create mode 100644 c_glib/arrow-glib/int8-array-builder.h create mode 100644 c_glib/arrow-glib/int8-array.cpp create mode 100644 c_glib/arrow-glib/int8-array.h create mode 100644 c_glib/arrow-glib/int8-data-type.cpp create mode 100644 c_glib/arrow-glib/int8-data-type.h create mode 100644 c_glib/arrow-glib/io-enums.c.template create mode 100644 c_glib/arrow-glib/io-enums.h.template create mode 100644 c_glib/arrow-glib/io-file-mode.cpp create mode 100644 c_glib/arrow-glib/io-file-mode.h create mode 100644 c_glib/arrow-glib/io-file-mode.hpp create mode 100644 c_glib/arrow-glib/io-file-output-stream.cpp create mode 100644 c_glib/arrow-glib/io-file-output-stream.h create mode 100644 c_glib/arrow-glib/io-file-output-stream.hpp create mode 100644 c_glib/arrow-glib/io-file.cpp create mode 100644 c_glib/arrow-glib/io-file.h create mode 100644 c_glib/arrow-glib/io-file.hpp create mode 100644 c_glib/arrow-glib/io-input-stream.cpp create mode 100644 c_glib/arrow-glib/io-input-stream.h create mode 100644 c_glib/arrow-glib/io-input-stream.hpp create mode 100644 c_glib/arrow-glib/io-memory-mapped-file.cpp create mode 100644 c_glib/arrow-glib/io-memory-mapped-file.h create mode 100644 c_glib/arrow-glib/io-memory-mapped-file.hpp create mode 100644 c_glib/arrow-glib/io-output-stream.cpp create mode 100644 c_glib/arrow-glib/io-output-stream.h create mode 100644 c_glib/arrow-glib/io-output-stream.hpp create mode 100644 c_glib/arrow-glib/io-readable-file.cpp create mode 100644 c_glib/arrow-glib/io-readable-file.h create mode 100644 c_glib/arrow-glib/io-readable-file.hpp create mode 100644 c_glib/arrow-glib/io-readable.cpp create mode 100644 c_glib/arrow-glib/io-readable.h create mode 100644 c_glib/arrow-glib/io-readable.hpp create mode 100644 c_glib/arrow-glib/io-writeable-file.cpp create mode 100644 c_glib/arrow-glib/io-writeable-file.h create mode 100644 c_glib/arrow-glib/io-writeable-file.hpp create mode 100644 c_glib/arrow-glib/io-writeable.cpp create mode 100644 c_glib/arrow-glib/io-writeable.h create mode 100644 c_glib/arrow-glib/io-writeable.hpp create mode 100644 c_glib/arrow-glib/ipc-enums.c.template create mode 100644 c_glib/arrow-glib/ipc-enums.h.template create mode 100644 c_glib/arrow-glib/ipc-file-reader.cpp create mode 100644 c_glib/arrow-glib/ipc-file-reader.h create mode 100644 c_glib/arrow-glib/ipc-file-reader.hpp create mode 100644 c_glib/arrow-glib/ipc-file-writer.cpp create mode 100644 c_glib/arrow-glib/ipc-file-writer.h create mode 100644 c_glib/arrow-glib/ipc-file-writer.hpp create mode 100644 c_glib/arrow-glib/ipc-metadata-version.cpp create mode 100644 c_glib/arrow-glib/ipc-metadata-version.h create mode 100644 c_glib/arrow-glib/ipc-metadata-version.hpp create mode 100644 c_glib/arrow-glib/ipc-stream-reader.cpp create mode 100644 c_glib/arrow-glib/ipc-stream-reader.h create mode 100644 c_glib/arrow-glib/ipc-stream-reader.hpp create mode 100644 c_glib/arrow-glib/ipc-stream-writer.cpp create mode 100644 c_glib/arrow-glib/ipc-stream-writer.h create mode 100644 c_glib/arrow-glib/ipc-stream-writer.hpp create mode 100644 c_glib/arrow-glib/list-array-builder.cpp create mode 100644 c_glib/arrow-glib/list-array-builder.h create mode 100644 c_glib/arrow-glib/list-array.cpp create mode 100644 c_glib/arrow-glib/list-array.h create mode 100644 c_glib/arrow-glib/list-data-type.cpp create mode 100644 c_glib/arrow-glib/list-data-type.h create mode 100644 c_glib/arrow-glib/null-array.cpp create mode 100644 c_glib/arrow-glib/null-array.h create mode 100644 c_glib/arrow-glib/null-data-type.cpp create mode 100644 c_glib/arrow-glib/null-data-type.h create mode 100644 c_glib/arrow-glib/record-batch.cpp create mode 100644 c_glib/arrow-glib/record-batch.h create mode 100644 c_glib/arrow-glib/record-batch.hpp create mode 100644 c_glib/arrow-glib/schema.cpp create mode 100644 c_glib/arrow-glib/schema.h create mode 100644 c_glib/arrow-glib/schema.hpp create mode 100644 c_glib/arrow-glib/string-array-builder.cpp create mode 100644 c_glib/arrow-glib/string-array-builder.h create mode 100644 c_glib/arrow-glib/string-array.cpp create mode 100644 c_glib/arrow-glib/string-array.h create mode 100644 c_glib/arrow-glib/string-data-type.cpp create mode 100644 c_glib/arrow-glib/string-data-type.h create mode 100644 c_glib/arrow-glib/struct-array-builder.cpp create mode 100644 c_glib/arrow-glib/struct-array-builder.h create mode 100644 c_glib/arrow-glib/struct-array.cpp create mode 100644 c_glib/arrow-glib/struct-array.h create mode 100644 c_glib/arrow-glib/struct-data-type.cpp create mode 100644 c_glib/arrow-glib/struct-data-type.h create mode 100644 c_glib/arrow-glib/table.cpp create mode 100644 c_glib/arrow-glib/table.h create mode 100644 c_glib/arrow-glib/table.hpp create mode 100644 c_glib/arrow-glib/type.cpp create mode 100644 c_glib/arrow-glib/type.h create mode 100644 c_glib/arrow-glib/type.hpp create mode 100644 c_glib/arrow-glib/uint16-array-builder.cpp create mode 100644 c_glib/arrow-glib/uint16-array-builder.h create mode 100644 c_glib/arrow-glib/uint16-array.cpp create mode 100644 c_glib/arrow-glib/uint16-array.h create mode 100644 c_glib/arrow-glib/uint16-data-type.cpp create mode 100644 c_glib/arrow-glib/uint16-data-type.h create mode 100644 c_glib/arrow-glib/uint32-array-builder.cpp create mode 100644 c_glib/arrow-glib/uint32-array-builder.h create mode 100644 c_glib/arrow-glib/uint32-array.cpp create mode 100644 c_glib/arrow-glib/uint32-array.h create mode 100644 c_glib/arrow-glib/uint32-data-type.cpp create mode 100644 c_glib/arrow-glib/uint32-data-type.h create mode 100644 c_glib/arrow-glib/uint64-array-builder.cpp create mode 100644 c_glib/arrow-glib/uint64-array-builder.h create mode 100644 c_glib/arrow-glib/uint64-array.cpp create mode 100644 c_glib/arrow-glib/uint64-array.h create mode 100644 c_glib/arrow-glib/uint64-data-type.cpp create mode 100644 c_glib/arrow-glib/uint64-data-type.h create mode 100644 c_glib/arrow-glib/uint8-array-builder.cpp create mode 100644 c_glib/arrow-glib/uint8-array-builder.h create mode 100644 c_glib/arrow-glib/uint8-array.cpp create mode 100644 c_glib/arrow-glib/uint8-array.h create mode 100644 c_glib/arrow-glib/uint8-data-type.cpp create mode 100644 c_glib/arrow-glib/uint8-data-type.h create mode 100755 c_glib/autogen.sh create mode 100644 c_glib/configure.ac create mode 100644 c_glib/doc/Makefile.am create mode 100644 c_glib/doc/reference/Makefile.am create mode 100644 c_glib/doc/reference/arrow-glib-docs.sgml create mode 100644 c_glib/example/Makefile.am create mode 100644 c_glib/example/build.c create mode 100644 c_glib/test/helper/buildable.rb create mode 100755 c_glib/test/run-test.rb create mode 100755 c_glib/test/run-test.sh create mode 100644 c_glib/test/test-array.rb create mode 100644 c_glib/test/test-binary-array.rb create mode 100644 c_glib/test/test-binary-data-type.rb create mode 100644 c_glib/test/test-boolean-array.rb create mode 100644 c_glib/test/test-boolean-data-type.rb create mode 100644 c_glib/test/test-chunked-array.rb create mode 100644 c_glib/test/test-column.rb create mode 100644 c_glib/test/test-double-array.rb create mode 100644 c_glib/test/test-double-data-type.rb create mode 100644 c_glib/test/test-field.rb create mode 100644 c_glib/test/test-float-array.rb create mode 100644 c_glib/test/test-float-data-type.rb create mode 100644 c_glib/test/test-int16-array.rb create mode 100644 c_glib/test/test-int16-data-type.rb create mode 100644 c_glib/test/test-int32-array.rb create mode 100644 c_glib/test/test-int32-data-type.rb create mode 100644 c_glib/test/test-int64-array.rb create mode 100644 c_glib/test/test-int64-data-type.rb create mode 100644 c_glib/test/test-int8-array.rb create mode 100644 c_glib/test/test-int8-data-type.rb create mode 100644 c_glib/test/test-io-file-output-stream.rb create mode 100644 c_glib/test/test-io-memory-mapped-file.rb create mode 100644 c_glib/test/test-ipc-file-writer.rb create mode 100644 c_glib/test/test-ipc-stream-writer.rb create mode 100644 c_glib/test/test-list-array.rb create mode 100644 c_glib/test/test-list-data-type.rb create mode 100644 c_glib/test/test-null-array.rb create mode 100644 c_glib/test/test-null-data-type.rb create mode 100644 c_glib/test/test-record-batch.rb create mode 100644 c_glib/test/test-schema.rb create mode 100644 c_glib/test/test-string-array.rb create mode 100644 c_glib/test/test-string-data-type.rb create mode 100644 c_glib/test/test-struct-array.rb create mode 100644 c_glib/test/test-table.rb create mode 100644 c_glib/test/test-uint16-array.rb create mode 100644 c_glib/test/test-uint16-data-type.rb create mode 100644 c_glib/test/test-uint32-array.rb create mode 100644 c_glib/test/test-uint32-data-type.rb create mode 100644 c_glib/test/test-uint64-array.rb create mode 100644 c_glib/test/test-uint64-data-type.rb create mode 100644 c_glib/test/test-uint8-array.rb create mode 100644 c_glib/test/test-uint8-data-type.rb create mode 100755 ci/travis_before_script_c_glib.sh create mode 100755 ci/travis_script_c_glib.sh diff --git a/.travis.yml b/.travis.yml index e8d91045c2254..b219b03e0eb2b 100644 --- a/.travis.yml +++ b/.travis.yml @@ -16,6 +16,9 @@ addons: - libboost-filesystem-dev - libboost-system-dev - libjemalloc-dev + - gtk-doc-tools + - autoconf-archive + - libgirepository1.0-dev matrix: fast_finish: true @@ -30,9 +33,11 @@ matrix: - export CC="gcc-4.9" - export CXX="g++-4.9" - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh + - $TRAVIS_BUILD_DIR/ci/travis_before_script_c_glib.sh script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh + - $TRAVIS_BUILD_DIR/ci/travis_script_c_glib.sh - compiler: clang osx_image: xcode6.4 os: osx diff --git a/c_glib/.gitignore b/c_glib/.gitignore new file mode 100644 index 0000000000000..38e33a2cd88e7 --- /dev/null +++ b/c_glib/.gitignore @@ -0,0 +1,44 @@ +Makefile +Makefile.in +.deps/ +.libs/ +*.gir +*.typelib +*.o +*.lo +*.la +*~ +/*.tar.gz +/aclocal.m4 +/autom4te.cache/ +/config.h +/config.h.in +/config.log +/config.status +/config/ +/configure +/doc/reference/*.txt +/doc/reference/*.txt.bak +/doc/reference/*.args +/doc/reference/*.hierarchy +/doc/reference/*.interfaces +/doc/reference/*.prerequisites +/doc/reference/*.signals +/doc/reference/*.types +/doc/reference/gtk-doc.make +/doc/reference/*.stamp +/doc/reference/html/ +/doc/reference/xml/ +/libtool +/m4/ +/stamp-h1 +/version +/arrow-glib/enums.c +/arrow-glib/enums.h +/arrow-glib/io-enums.c +/arrow-glib/io-enums.h +/arrow-glib/ipc-enums.c +/arrow-glib/ipc-enums.h +/arrow-glib/stamp-* +/arrow-glib/*.pc +/example/build diff --git a/c_glib/Makefile.am b/c_glib/Makefile.am new file mode 100644 index 0000000000000..076f9be08524b --- /dev/null +++ b/c_glib/Makefile.am @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +ACLOCAL_AMFLAGS = -I m4 ${ACLOCAL_FLAGS} + +SUBDIRS = \ + arrow-glib \ + doc \ + example + +EXTRA_DIST = \ + version diff --git a/c_glib/README.md b/c_glib/README.md new file mode 100644 index 0000000000000..4008015a56438 --- /dev/null +++ b/c_glib/README.md @@ -0,0 +1,114 @@ + + +# Arrow GLib + +Arrow GLib is a wrapper library for Arrow C++. Arrow GLib provides C +API. + +Arrow GLib supports +[GObject Introspection](https://wiki.gnome.org/action/show/Projects/GObjectIntrospection). +It means that you can create language bindings at runtime or compile time. + +For example, you can use Apache Arrow from Ruby by Arrow GLib and +[gobject-introspection gem](https://rubygems.org/gems/gobject-introspection) +with the following code: + +```ruby +# Generate bindings at runtime +require "gi" +Arrow = GI.load("Arrow") + +# Now, you can access arrow::BooleanArray in Arrow C++ by +# Arrow::BooleanArray +p Arrow::BooleanArray +``` + +In Ruby case, you should use +[red-arrow gem](https://rubygems.org/gems/red-arrow). It's based on +gobject-introspection gem. It adds many convenient features to raw +gobject-introspection gem based bindings. + +## Install + +### Package + +TODO + +### Build + +You need to install Arrow C++ before you install Arrow GLib. See Arrow +C++ document about how to install Arrow C++. + +You need [GTK-Doc](https://www.gtk.org/gtk-doc/) and +[GObject Introspection](https://wiki.gnome.org/action/show/Projects/GObjectIntrospection) +to build Arrow GLib. You can install them by the followings: + +On Debian GNU/Linux or Ubuntu: + +```text +% sudo apt install -y -V gtk-doc-tools libgirepository1.0-dev +``` + +On CentOS 7 or later: + +```text +% sudo yum install -y gtk-doc gobject-introspection-devel +``` + +On macOS with [Homebrew](https://brew.sh/): + +```text +% brew install -y gtk-doc gobject-introspection +``` + +Now, you can build Arrow GLib: + +```text +% cd glib +% ./configure --enable-gtk-doc +% make +% sudo make install +``` + +## Usage + +You can use Arrow GLib with C or other languages. If you use Arrow +GLib with C, you use C API. If you use Arrow GLib with other +languages, you use GObject Introspection based bindings. + +### C + +You can find API reference in the +`/usr/local/share/gtk-doc/html/arrow-glib/` directory. If you specify +`--prefix` to `configure`, the directory will be different. + +You can find example codes in the `example/` directory. + +### Language bindings + +You can use Arrow GLib with non C languages with GObject Introspection +based bindings. Here are languages that support GObject Introspection: + + * Ruby: [red-arrow gem](https://rubygems.org/gems/red-arrow) should be used. + + * Python: [PyGObject](https://wiki.gnome.org/Projects/PyGObject) should be used. (Note that you should use PyArrow than Arrow GLib.) + + * Lua: [LGI](https://github.com/pavouk/lgi) should be used. + + * Go: [Go-gir-generator](https://github.com/linuxdeepin/go-gir-generator) should be used. + +See also +[Projects/GObjectIntrospection/Users - GNOME Wiki!](https://wiki.gnome.org/Projects/GObjectIntrospection/Users) +for other languages. diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am new file mode 100644 index 0000000000000..61137a075f601 --- /dev/null +++ b/c_glib/arrow-glib/Makefile.am @@ -0,0 +1,494 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +CLEANFILES = + +EXTRA_DIST = + +AM_CPPFLAGS = \ + -I$(top_builddir) \ + -I$(top_srcdir) + +AM_CFLAGS = \ + $(GLIB_CFLAGS) \ + $(GARROW_CFLAGS) + +# libarrow-glib +lib_LTLIBRARIES = \ + libarrow-glib.la + +libarrow_glib_la_CXXFLAGS = \ + $(GLIB_CFLAGS) \ + $(ARROW_CFLAGS) \ + $(GARROW_CXXFLAGS) + +libarrow_glib_la_LIBADD = \ + $(GLIB_LIBS) \ + $(ARROW_LIBS) + +libarrow_glib_la_headers = \ + array.h \ + array-builder.h \ + arrow-glib.h \ + chunked-array.h \ + column.h \ + binary-array.h \ + binary-array-builder.h \ + binary-data-type.h \ + boolean-array.h \ + boolean-array-builder.h \ + boolean-data-type.h \ + data-type.h \ + double-array.h \ + double-array-builder.h \ + double-data-type.h \ + error.h \ + field.h \ + float-array.h \ + float-array-builder.h \ + float-data-type.h \ + int8-array.h \ + int8-array-builder.h \ + int8-data-type.h \ + int16-array.h \ + int16-array-builder.h \ + int16-data-type.h \ + int32-array.h \ + int32-array-builder.h \ + int32-data-type.h \ + int64-array.h \ + int64-array-builder.h \ + int64-data-type.h \ + list-array.h \ + list-array-builder.h \ + list-data-type.h \ + null-array.h \ + null-data-type.h \ + record-batch.h \ + schema.h \ + string-array.h \ + string-array-builder.h \ + string-data-type.h \ + struct-array.h \ + struct-array-builder.h \ + struct-data-type.h \ + table.h \ + type.h \ + uint8-array.h \ + uint8-array-builder.h \ + uint8-data-type.h \ + uint16-array.h \ + uint16-array-builder.h \ + uint16-data-type.h \ + uint32-array.h \ + uint32-array-builder.h \ + uint32-data-type.h \ + uint64-array.h \ + uint64-array-builder.h \ + uint64-data-type.h + +libarrow_glib_la_generated_headers = \ + enums.h + +libarrow_glib_la_generated_sources = \ + enums.c \ + $(libarrow_glib_la_generated_headers) + +libarrow_glib_la_sources = \ + array.cpp \ + array-builder.cpp \ + binary-array.cpp \ + binary-array-builder.cpp \ + binary-data-type.cpp \ + boolean-array.cpp \ + boolean-array-builder.cpp \ + boolean-data-type.cpp \ + chunked-array.cpp \ + column.cpp \ + data-type.cpp \ + double-array.cpp \ + double-array-builder.cpp \ + double-data-type.cpp \ + error.cpp \ + field.cpp \ + float-array.cpp \ + float-array-builder.cpp \ + float-data-type.cpp \ + int8-array.cpp \ + int8-array-builder.cpp \ + int8-data-type.cpp \ + int16-array.cpp \ + int16-array-builder.cpp \ + int16-data-type.cpp \ + int32-array.cpp \ + int32-array-builder.cpp \ + int32-data-type.cpp \ + int64-array.cpp \ + int64-array-builder.cpp \ + int64-data-type.cpp \ + list-array.cpp \ + list-array-builder.cpp \ + list-data-type.cpp \ + null-array.cpp \ + null-data-type.cpp \ + record-batch.cpp \ + schema.cpp \ + string-array.cpp \ + string-array-builder.cpp \ + string-data-type.cpp \ + struct-array.cpp \ + struct-array-builder.cpp \ + struct-data-type.cpp \ + table.cpp \ + type.cpp \ + uint8-array.cpp \ + uint8-array-builder.cpp \ + uint8-data-type.cpp \ + uint16-array.cpp \ + uint16-array-builder.cpp \ + uint16-data-type.cpp \ + uint32-array.cpp \ + uint32-array-builder.cpp \ + uint32-data-type.cpp \ + uint64-array.cpp \ + uint64-array-builder.cpp \ + uint64-data-type.cpp \ + $(libarrow_glib_la_headers) \ + $(libarrow_glib_la_generated_sources) + +libarrow_glib_la_cpp_headers = \ + array.hpp \ + array-builder.hpp \ + arrow-glib.hpp \ + chunked-array.hpp \ + column.hpp \ + data-type.hpp \ + error.hpp \ + field.hpp \ + record-batch.hpp \ + schema.hpp \ + table.hpp \ + type.hpp + +libarrow_glib_la_SOURCES = \ + $(libarrow_glib_la_sources) \ + $(libarrow_glib_la_cpp_headers) + +BUILT_SOURCES = \ + $(libarrow_glib_la_genearted_sources) \ + stamp-enums.c \ + stamp-enums.h + +EXTRA_DIST += \ + enums.c.template \ + enums.h.template + +enums.h: stamp-enums.h + @true +stamp-enums.h: $(libarrow_glib_la_headers) enums.h.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrow \ + --symbol-prefix garrow \ + --template enums.h.template \ + $(libarrow_glib_la_headers)) > enums.h + touch $@ + +enums.c: stamp-enums.c + @true +stamp-enums.c: $(libarrow_glib_la_headers) enums.c.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrow \ + --symbol-prefix garrow \ + --template enums.c.template \ + $(libarrow_glib_la_headers)) > enums.c + touch $@ + +# libarrow-io-glib +lib_LTLIBRARIES += \ + libarrow-io-glib.la + +libarrow_io_glib_la_CXXFLAGS = \ + $(GLIB_CFLAGS) \ + $(ARROW_IO_CFLAGS) \ + $(GARROW_CXXFLAGS) + +libarrow_io_glib_la_LIBADD = \ + $(GLIB_LIBS) \ + $(ARROW_IO_LIBS) + +libarrow_io_glib_la_headers = \ + arrow-io-glib.h \ + io-file.h \ + io-file-mode.h \ + io-file-output-stream.h \ + io-input-stream.h \ + io-memory-mapped-file.h \ + io-output-stream.h \ + io-readable.h \ + io-readable-file.h \ + io-writeable.h \ + io-writeable-file.h + +libarrow_io_glib_la_generated_headers = \ + io-enums.h + +libarrow_io_glib_la_generated_sources = \ + io-enums.c \ + $(libarrow_io_glib_la_generated_headers) + +libarrow_io_glib_la_sources = \ + io-file.cpp \ + io-file-mode.cpp \ + io-file-output-stream.cpp \ + io-input-stream.cpp \ + io-memory-mapped-file.cpp \ + io-output-stream.cpp \ + io-readable.cpp \ + io-readable-file.cpp \ + io-writeable.cpp \ + io-writeable-file.cpp \ + $(libarrow_io_glib_la_headers) \ + $(libarrow_io_glib_la_generated_sources) + +libarrow_io_glib_la_cpp_headers = \ + arrow-io-glib.hpp \ + io-file.hpp \ + io-file-mode.hpp \ + io-file-output-stream.hpp \ + io-input-stream.hpp \ + io-memory-mapped-file.hpp \ + io-output-stream.hpp \ + io-readable.hpp \ + io-readable-file.hpp \ + io-writeable.hpp \ + io-writeable-file.hpp + +libarrow_io_glib_la_SOURCES = \ + $(libarrow_io_glib_la_sources) \ + $(libarrow_io_glib_la_cpp_headers) + +BUILT_SOURCES += \ + $(libarrow_io_glib_la_genearted_sources) \ + stamp-io-enums.c \ + stamp-io-enums.h + +EXTRA_DIST += \ + io-enums.c.template \ + io-enums.h.template + +io-enums.h: stamp-io-enums.h + @true +stamp-io-enums.h: $(libarrow_io_glib_la_headers) io-enums.h.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrowIO \ + --symbol-prefix garrow_io \ + --template io-enums.h.template \ + $(libarrow_io_glib_la_headers)) > io-enums.h + touch $@ + +io-enums.c: stamp-io-enums.c + @true +stamp-io-enums.c: $(libarrow_io_glib_la_headers) io-enums.c.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrowIO \ + --symbol-prefix garrow_io \ + --template io-enums.c.template \ + $(libarrow_io_glib_la_headers)) > io-enums.c + touch $@ + +# libarrow-ipc-glib +lib_LTLIBRARIES += \ + libarrow-ipc-glib.la + +libarrow_ipc_glib_la_CXXFLAGS = \ + $(GLIB_CFLAGS) \ + $(ARROW_IPC_CFLAGS) \ + $(GARROW_CXXFLAGS) + +libarrow_ipc_glib_la_LIBADD = \ + $(GLIB_LIBS) \ + $(ARROW_IPC_LIBS) + +libarrow_ipc_glib_la_headers = \ + arrow-ipc-glib.h \ + ipc-file-reader.h \ + ipc-file-writer.h \ + ipc-stream-reader.h \ + ipc-stream-writer.h \ + ipc-metadata-version.h + +libarrow_ipc_glib_la_generated_headers = \ + ipc-enums.h + +libarrow_ipc_glib_la_generated_sources = \ + ipc-enums.c \ + $(libarrow_ipc_glib_la_generated_headers) + +libarrow_ipc_glib_la_sources = \ + ipc-file-reader.cpp \ + ipc-file-writer.cpp \ + ipc-metadata-version.cpp \ + ipc-stream-reader.cpp \ + ipc-stream-writer.cpp \ + $(libarrow_ipc_glib_la_headers) \ + $(libarrow_ipc_glib_la_generated_sources) + +libarrow_ipc_glib_la_cpp_headers = \ + arrow-ipc-glib.hpp \ + ipc-file-reader.hpp \ + ipc-file-writer.hpp \ + ipc-metadata-version.hpp \ + ipc-stream-reader.hpp \ + ipc-stream-writer.hpp + +libarrow_ipc_glib_la_SOURCES = \ + $(libarrow_ipc_glib_la_sources) \ + $(libarrow_ipc_glib_la_cpp_headers) + +BUILT_SOURCES += \ + $(libarrow_ipc_glib_la_genearted_sources) \ + stamp-ipc-enums.c \ + stamp-ipc-enums.h + +EXTRA_DIST += \ + ipc-enums.c.template \ + ipc-enums.h.template + +ipc-enums.h: stamp-ipc-enums.h + @true +stamp-ipc-enums.h: $(libarrow_ipc_glib_la_headers) ipc-enums.h.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrowIPC \ + --symbol-prefix garrow_ipc \ + --template ipc-enums.h.template \ + $(libarrow_ipc_glib_la_headers)) > ipc-enums.h + touch $@ + +ipc-enums.c: stamp-ipc-enums.c + @true +stamp-ipc-enums.c: $(libarrow_ipc_glib_la_headers) ipc-enums.c.template + $(AM_V_GEN) \ + (cd $(srcdir) && \ + $(GLIB_MKENUMS) \ + --identifier-prefix GArrowIPC \ + --symbol-prefix garrow_ipc \ + --template ipc-enums.c.template \ + $(libarrow_ipc_glib_la_headers)) > ipc-enums.c + touch $@ + +pkginclude_HEADERS = \ + $(libarrow_glib_la_headers) \ + $(libarrow_glib_la_cpp_headers) \ + $(libarrow_glib_la_generated_headers) \ + $(libarrow_io_glib_la_headers) \ + $(libarrow_io_glib_la_cpp_headers) \ + $(libarrow_io_glib_la_generated_headers) \ + $(libarrow_ipc_glib_la_headers) \ + $(libarrow_ipc_glib_la_cpp_headers) \ + $(libarrow_ipc_glib_la_generated_headers) + +pkgconfigdir = $(libdir)/pkgconfig +pkgconfig_DATA = \ + arrow-glib.pc \ + arrow-io-glib.pc \ + arrow-ipc-glib.pc + +# GObject Introspection +-include $(INTROSPECTION_MAKEFILE) +INTROSPECTION_GIRS = +INTROSPECTION_SCANNER_ARGS = +INTROSPECTION_COMPILER_ARGS = + +if HAVE_INTROSPECTION +Arrow-1.0.gir: libarrow-glib.la +Arrow_1_0_gir_PACKAGES = \ + gobject-2.0 +Arrow_1_0_gir_EXPORT_PACKAGES = arrow +Arrow_1_0_gir_INCLUDES = GObject-2.0 +Arrow_1_0_gir_CFLAGS = \ + $(AM_CPPFLAGS) +Arrow_1_0_gir_LIBS = libarrow-glib.la +Arrow_1_0_gir_FILES = $(libarrow_glib_la_sources) +Arrow_1_0_gir_SCANNERFLAGS = \ + --warn-all \ + --identifier-prefix=GArrow \ + --symbol-prefix=garrow +INTROSPECTION_GIRS += Arrow-1.0.gir + +ArrowIO-1.0.gir: libarrow-io-glib.la +ArrowIO-1.0.gir: Arrow-1.0.gir +ArrowIO_1_0_gir_PACKAGES = \ + gobject-2.0 +ArrowIO_1_0_gir_EXPORT_PACKAGES = arrow-io +ArrowIO_1_0_gir_INCLUDES = \ + GObject-2.0 +ArrowIO_1_0_gir_CFLAGS = \ + $(AM_CPPFLAGS) +ArrowIO_1_0_gir_LIBS = \ + libarrow-io-glib.la \ + libarrow-glib.la +ArrowIO_1_0_gir_FILES = $(libarrow_io_glib_la_sources) +ArrowIO_1_0_gir_SCANNERFLAGS = \ + --include-uninstalled=$(builddir)/Arrow-1.0.gir \ + --warn-all \ + --identifier-prefix=GArrowIO \ + --symbol-prefix=garrow_io +INTROSPECTION_GIRS += ArrowIO-1.0.gir + +ArrowIPC-1.0.gir: libarrow-ipc-glib.la +ArrowIPC-1.0.gir: Arrow-1.0.gir +ArrowIPC-1.0.gir: ArrowIO-1.0.gir +ArrowIPC_1_0_gir_PACKAGES = \ + gobject-2.0 +ArrowIPC_1_0_gir_EXPORT_PACKAGES = arrow-ipc +ArrowIPC_1_0_gir_INCLUDES = \ + GObject-2.0 +ArrowIPC_1_0_gir_CFLAGS = \ + $(AM_CPPFLAGS) +ArrowIPC_1_0_gir_LIBS = \ + libarrow-ipc-glib.la \ + libarrow-io-glib.la \ + libarrow-glib.la +ArrowIPC_1_0_gir_FILES = $(libarrow_ipc_glib_la_sources) +ArrowIPC_1_0_gir_SCANNERFLAGS = \ + --include-uninstalled=$(builddir)/Arrow-1.0.gir \ + --include-uninstalled=$(builddir)/ArrowIO-1.0.gir \ + --warn-all \ + --identifier-prefix=GArrowIPC \ + --symbol-prefix=garrow_ipc +INTROSPECTION_GIRS += ArrowIPC-1.0.gir + +girdir = $(datadir)/gir-1.0 +gir_DATA = $(INTROSPECTION_GIRS) + +typelibdir = $(libdir)/girepository-1.0 +typelib_DATA = $(INTROSPECTION_GIRS:.gir=.typelib) + +CLEANFILES += \ + $(gir_DATA) \ + $(typelib_DATA) +endif diff --git a/c_glib/arrow-glib/array-builder.cpp b/c_glib/arrow-glib/array-builder.cpp new file mode 100644 index 0000000000000..0f038c8f66cee --- /dev/null +++ b/c_glib/arrow-glib/array-builder.cpp @@ -0,0 +1,229 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: array-builder + * @short_description: Base class for all array builder classes. + * + * #GArrowArrayBuilder is a base class for all array builder classes + * such as #GArrowBooleanArrayBuilder. + * + * You need to use array builder class to create a new array. + */ + +typedef struct GArrowArrayBuilderPrivate_ { + std::shared_ptr array_builder; +} GArrowArrayBuilderPrivate; + +enum { + PROP_0, + PROP_ARRAY_BUILDER +}; + +G_DEFINE_ABSTRACT_TYPE_WITH_PRIVATE(GArrowArrayBuilder, + garrow_array_builder, + G_TYPE_OBJECT) + +#define GARROW_ARRAY_BUILDER_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_ARRAY_BUILDER, \ + GArrowArrayBuilderPrivate)) + +static void +garrow_array_builder_finalize(GObject *object) +{ + GArrowArrayBuilderPrivate *priv; + + priv = GARROW_ARRAY_BUILDER_GET_PRIVATE(object); + + priv->array_builder = nullptr; + + G_OBJECT_CLASS(garrow_array_builder_parent_class)->finalize(object); +} + +static void +garrow_array_builder_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowArrayBuilderPrivate *priv; + + priv = GARROW_ARRAY_BUILDER_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_ARRAY_BUILDER: + priv->array_builder = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_array_builder_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_array_builder_init(GArrowArrayBuilder *builder) +{ +} + +static void +garrow_array_builder_class_init(GArrowArrayBuilderClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_array_builder_finalize; + gobject_class->set_property = garrow_array_builder_set_property; + gobject_class->get_property = garrow_array_builder_get_property; + + spec = g_param_spec_pointer("array-builder", + "Array builder", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_ARRAY_BUILDER, spec); +} + +/** + * garrow_array_builder_finish: + * @builder: A #GArrowArrayBuilder. + * + * Returns: (transfer full): The built #GArrowArray. + */ +GArrowArray * +garrow_array_builder_finish(GArrowArrayBuilder *builder) +{ + auto arrow_builder = garrow_array_builder_get_raw(builder); + std::shared_ptr arrow_array; + arrow_builder->Finish(&arrow_array); + return garrow_array_new_raw(&arrow_array); +} + +G_END_DECLS + +GArrowArrayBuilder * +garrow_array_builder_new_raw(std::shared_ptr *arrow_builder) +{ + GType type; + + switch ((*arrow_builder)->type()->type) { + case arrow::Type::type::BOOL: + type = GARROW_TYPE_BOOLEAN_ARRAY_BUILDER; + break; + case arrow::Type::type::UINT8: + type = GARROW_TYPE_UINT8_ARRAY_BUILDER; + break; + case arrow::Type::type::INT8: + type = GARROW_TYPE_INT8_ARRAY_BUILDER; + break; + case arrow::Type::type::UINT16: + type = GARROW_TYPE_UINT16_ARRAY_BUILDER; + break; + case arrow::Type::type::INT16: + type = GARROW_TYPE_INT16_ARRAY_BUILDER; + break; + case arrow::Type::type::UINT32: + type = GARROW_TYPE_UINT32_ARRAY_BUILDER; + break; + case arrow::Type::type::INT32: + type = GARROW_TYPE_INT32_ARRAY_BUILDER; + break; + case arrow::Type::type::UINT64: + type = GARROW_TYPE_UINT64_ARRAY_BUILDER; + break; + case arrow::Type::type::INT64: + type = GARROW_TYPE_INT64_ARRAY_BUILDER; + break; + case arrow::Type::type::FLOAT: + type = GARROW_TYPE_FLOAT_ARRAY_BUILDER; + break; + case arrow::Type::type::DOUBLE: + type = GARROW_TYPE_DOUBLE_ARRAY_BUILDER; + break; + case arrow::Type::type::BINARY: + type = GARROW_TYPE_BINARY_ARRAY_BUILDER; + break; + case arrow::Type::type::STRING: + type = GARROW_TYPE_STRING_ARRAY_BUILDER; + break; + case arrow::Type::type::LIST: + type = GARROW_TYPE_LIST_ARRAY_BUILDER; + break; + case arrow::Type::type::STRUCT: + type = GARROW_TYPE_STRUCT_ARRAY_BUILDER; + break; + default: + type = GARROW_TYPE_ARRAY_BUILDER; + break; + } + + auto builder = + GARROW_ARRAY_BUILDER(g_object_new(type, + "array-builder", arrow_builder, + NULL)); + return builder; +} + +std::shared_ptr +garrow_array_builder_get_raw(GArrowArrayBuilder *builder) +{ + GArrowArrayBuilderPrivate *priv; + + priv = GARROW_ARRAY_BUILDER_GET_PRIVATE(builder); + return priv->array_builder; +} diff --git a/c_glib/arrow-glib/array-builder.h b/c_glib/arrow-glib/array-builder.h new file mode 100644 index 0000000000000..3717aef04a2f4 --- /dev/null +++ b/c_glib/arrow-glib/array-builder.h @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_ARRAY_BUILDER \ + (garrow_array_builder_get_type()) +#define GARROW_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_ARRAY_BUILDER, \ + GArrowArrayBuilder)) +#define GARROW_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_ARRAY_BUILDER, \ + GArrowArrayBuilderClass)) +#define GARROW_IS_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_ARRAY_BUILDER)) +#define GARROW_IS_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_ARRAY_BUILDER)) +#define GARROW_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_ARRAY_BUILDER, \ + GArrowArrayBuilderClass)) + +typedef struct _GArrowArrayBuilder GArrowArrayBuilder; +typedef struct _GArrowArrayBuilderClass GArrowArrayBuilderClass; + +/** + * GArrowArrayBuilder: + * + * It wraps `arrow::ArrayBuilder`. + */ +struct _GArrowArrayBuilder +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowArrayBuilderClass +{ + GObjectClass parent_class; +}; + +GType garrow_array_builder_get_type (void) G_GNUC_CONST; + +GArrowArray *garrow_array_builder_finish (GArrowArrayBuilder *builder); + +G_END_DECLS diff --git a/c_glib/arrow-glib/array-builder.hpp b/c_glib/arrow-glib/array-builder.hpp new file mode 100644 index 0000000000000..becebb23f9bb0 --- /dev/null +++ b/c_glib/arrow-glib/array-builder.hpp @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +GArrowArrayBuilder *garrow_array_builder_new_raw(std::shared_ptr *arrow_builder); +std::shared_ptr garrow_array_builder_get_raw(GArrowArrayBuilder *builder); diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp new file mode 100644 index 0000000000000..5dacb07ba8710 --- /dev/null +++ b/c_glib/arrow-glib/array.cpp @@ -0,0 +1,268 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +G_BEGIN_DECLS + +/** + * SECTION: array + * @short_description: Base class for all array classes + * + * #GArrowArray is a base class for all array classes such as + * #GArrowBooleanArray. + * + * Array is immutable. You need to use array builder class such as + * #GArrowBooleanArrayBuilder to create a new array. + */ + +typedef struct GArrowArrayPrivate_ { + std::shared_ptr array; +} GArrowArrayPrivate; + +enum { + PROP_0, + PROP_ARRAY +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowArray, garrow_array, G_TYPE_OBJECT) + +#define GARROW_ARRAY_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), GARROW_TYPE_ARRAY, GArrowArrayPrivate)) + +static void +garrow_array_finalize(GObject *object) +{ + auto priv = GARROW_ARRAY_GET_PRIVATE(object); + + priv->array = nullptr; + + G_OBJECT_CLASS(garrow_array_parent_class)->finalize(object); +} + +static void +garrow_array_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + auto priv = GARROW_ARRAY_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_ARRAY: + priv->array = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_array_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_array_init(GArrowArray *object) +{ +} + +static void +garrow_array_class_init(GArrowArrayClass *klass) +{ + GParamSpec *spec; + + auto gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_array_finalize; + gobject_class->set_property = garrow_array_set_property; + gobject_class->get_property = garrow_array_get_property; + + spec = g_param_spec_pointer("array", + "Array", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_ARRAY, spec); +} + +/** + * garrow_array_get_length: + * @array: A #GArrowArray. + * + * Returns: The number of rows in the array. + */ +gint64 +garrow_array_get_length(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + return arrow_array->length(); +} + +/** + * garrow_array_get_offset: + * @array: A #GArrowArray. + * + * Returns: The number of values in the array. + */ +gint64 +garrow_array_get_offset(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + return arrow_array->offset(); +} + +/** + * garrow_array_get_n_nulls: + * @array: A #GArrowArray. + * + * Returns: The number of NULLs in the array. + */ +gint64 +garrow_array_get_n_nulls(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + return arrow_array->null_count(); +} + +/** + * garrow_array_slice: + * @array: A #GArrowArray. + * @offset: The offset of sub #GArrowArray. + * @length: The length of sub #GArrowArray. + * + * Returns: (transfer full): The sub #GArrowArray. It covers only from + * `offset` to `offset + length` range. The sub #GArrowArray shares + * values with the base #GArrowArray. + */ +GArrowArray * +garrow_array_slice(GArrowArray *array, + gint64 offset, + gint64 length) +{ + const auto arrow_array = garrow_array_get_raw(array); + auto arrow_sub_array = arrow_array->Slice(offset, length); + return garrow_array_new_raw(&arrow_sub_array); +} + +G_END_DECLS + +GArrowArray * +garrow_array_new_raw(std::shared_ptr *arrow_array) +{ + GType type; + GArrowArray *array; + + switch ((*arrow_array)->type_enum()) { + case arrow::Type::type::NA: + type = GARROW_TYPE_NULL_ARRAY; + break; + case arrow::Type::type::BOOL: + type = GARROW_TYPE_BOOLEAN_ARRAY; + break; + case arrow::Type::type::UINT8: + type = GARROW_TYPE_UINT8_ARRAY; + break; + case arrow::Type::type::INT8: + type = GARROW_TYPE_INT8_ARRAY; + break; + case arrow::Type::type::UINT16: + type = GARROW_TYPE_UINT16_ARRAY; + break; + case arrow::Type::type::INT16: + type = GARROW_TYPE_INT16_ARRAY; + break; + case arrow::Type::type::UINT32: + type = GARROW_TYPE_UINT32_ARRAY; + break; + case arrow::Type::type::INT32: + type = GARROW_TYPE_INT32_ARRAY; + break; + case arrow::Type::type::UINT64: + type = GARROW_TYPE_UINT64_ARRAY; + break; + case arrow::Type::type::INT64: + type = GARROW_TYPE_INT64_ARRAY; + break; + case arrow::Type::type::FLOAT: + type = GARROW_TYPE_FLOAT_ARRAY; + break; + case arrow::Type::type::DOUBLE: + type = GARROW_TYPE_DOUBLE_ARRAY; + break; + case arrow::Type::type::BINARY: + type = GARROW_TYPE_BINARY_ARRAY; + break; + case arrow::Type::type::STRING: + type = GARROW_TYPE_STRING_ARRAY; + break; + case arrow::Type::type::LIST: + type = GARROW_TYPE_LIST_ARRAY; + break; + case arrow::Type::type::STRUCT: + type = GARROW_TYPE_STRUCT_ARRAY; + break; + default: + type = GARROW_TYPE_ARRAY; + break; + } + array = GARROW_ARRAY(g_object_new(type, + "array", arrow_array, + NULL)); + return array; +} + +std::shared_ptr +garrow_array_get_raw(GArrowArray *array) +{ + auto priv = GARROW_ARRAY_GET_PRIVATE(array); + return priv->array; +} diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h new file mode 100644 index 0000000000000..9b1fa7e1e4a31 --- /dev/null +++ b/c_glib/arrow-glib/array.h @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_ARRAY \ + (garrow_array_get_type()) +#define GARROW_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), GARROW_TYPE_ARRAY, GArrowArray)) +#define GARROW_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), GARROW_TYPE_ARRAY, GArrowArrayClass)) +#define GARROW_IS_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_ARRAY)) +#define GARROW_IS_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_ARRAY)) +#define GARROW_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), GARROW_TYPE_ARRAY, GArrowArrayClass)) + +typedef struct _GArrowArray GArrowArray; +typedef struct _GArrowArrayClass GArrowArrayClass; + +/** + * GArrowArray: + * + * It wraps `arrow::Array`. + */ +struct _GArrowArray +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowArrayClass +{ + GObjectClass parent_class; +}; + +GType garrow_array_get_type (void) G_GNUC_CONST; + +gint64 garrow_array_get_length (GArrowArray *array); +gint64 garrow_array_get_offset (GArrowArray *array); +gint64 garrow_array_get_n_nulls (GArrowArray *array); +GArrowArray *garrow_array_slice (GArrowArray *array, + gint64 offset, + gint64 length); + +G_END_DECLS diff --git a/c_glib/arrow-glib/array.hpp b/c_glib/arrow-glib/array.hpp new file mode 100644 index 0000000000000..d2dff22c48cf9 --- /dev/null +++ b/c_glib/arrow-glib/array.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowArray *garrow_array_new_raw(std::shared_ptr *arrow_array); +std::shared_ptr garrow_array_get_raw(GArrowArray *array); diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h new file mode 100644 index 0000000000000..4356234a4a63d --- /dev/null +++ b/c_glib/arrow-glib/arrow-glib.h @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp new file mode 100644 index 0000000000000..70fda8da7c526 --- /dev/null +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -0,0 +1,37 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-glib.pc.in b/c_glib/arrow-glib/arrow-glib.pc.in new file mode 100644 index 0000000000000..f9f27b2499057 --- /dev/null +++ b/c_glib/arrow-glib/arrow-glib.pc.in @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@prefix@ +exec_prefix=@exec_prefix@ +libdir=@libdir@ +includedir=@includedir@ + +Name: Apache Arrow GLib +Description: C API for Apache Arrow based on GLib +Version: @VERSION@ +Libs: -L${libdir} -larrow-glib +Cflags: -I${includedir} +Requires: gobject-2.0 arrow diff --git a/c_glib/arrow-glib/arrow-io-glib.h b/c_glib/arrow-glib/arrow-io-glib.h new file mode 100644 index 0000000000000..e02aa9b96982b --- /dev/null +++ b/c_glib/arrow-glib/arrow-io-glib.h @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-io-glib.hpp b/c_glib/arrow-glib/arrow-io-glib.hpp new file mode 100644 index 0000000000000..378f20216b6a1 --- /dev/null +++ b/c_glib/arrow-glib/arrow-io-glib.hpp @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-io-glib.pc.in b/c_glib/arrow-glib/arrow-io-glib.pc.in new file mode 100644 index 0000000000000..4256184cf7348 --- /dev/null +++ b/c_glib/arrow-glib/arrow-io-glib.pc.in @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@prefix@ +exec_prefix=@exec_prefix@ +libdir=@libdir@ +includedir=@includedir@ + +Name: Apache Arrow I/O GLib +Description: C API for Apache Arrow I/O based on GLib +Version: @VERSION@ +Libs: -L${libdir} -larrow-glib-io +Cflags: -I${includedir} +Requires: arrow-glib arrow-io diff --git a/c_glib/arrow-glib/arrow-ipc-glib.h b/c_glib/arrow-glib/arrow-ipc-glib.h new file mode 100644 index 0000000000000..4954d83cd0728 --- /dev/null +++ b/c_glib/arrow-glib/arrow-ipc-glib.h @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-ipc-glib.hpp b/c_glib/arrow-glib/arrow-ipc-glib.hpp new file mode 100644 index 0000000000000..d32bc052b98e5 --- /dev/null +++ b/c_glib/arrow-glib/arrow-ipc-glib.hpp @@ -0,0 +1,30 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-ipc-glib.pc.in b/c_glib/arrow-glib/arrow-ipc-glib.pc.in new file mode 100644 index 0000000000000..0b04c4a808ff1 --- /dev/null +++ b/c_glib/arrow-glib/arrow-ipc-glib.pc.in @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +prefix=@prefix@ +exec_prefix=@exec_prefix@ +libdir=@libdir@ +includedir=@includedir@ + +Name: Apache Arrow IPC GLib +Description: C API for Apache Arrow IPC based on GLib +Version: @VERSION@ +Libs: -L${libdir} -larrow-glib-ipc +Cflags: -I${includedir} +Requires: arrow-glib-io arrow-ipc diff --git a/c_glib/arrow-glib/binary-array-builder.cpp b/c_glib/arrow-glib/binary-array-builder.cpp new file mode 100644 index 0000000000000..ab11535eb8595 --- /dev/null +++ b/c_glib/arrow-glib/binary-array-builder.cpp @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: binary-array-builder + * @short_description: Binary array builder class + * + * #GArrowBinaryArrayBuilder is the class to create a new + * #GArrowBinaryArray. + */ + +G_DEFINE_TYPE(GArrowBinaryArrayBuilder, + garrow_binary_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_binary_array_builder_init(GArrowBinaryArrayBuilder *builder) +{ +} + +static void +garrow_binary_array_builder_class_init(GArrowBinaryArrayBuilderClass *klass) +{ +} + +/** + * garrow_binary_array_builder_new: + * + * Returns: A newly created #GArrowBinaryArrayBuilder. + */ +GArrowBinaryArrayBuilder * +garrow_binary_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::binary()); + auto builder = + GARROW_BINARY_ARRAY_BUILDER(g_object_new(GARROW_TYPE_BINARY_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_binary_array_builder_append: + * @builder: A #GArrowBinaryArrayBuilder. + * @value: (array length=length): A binary value. + * @length: A value length. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, + const guint8 *value, + gint32 length, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value, length); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[binary-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_binary_array_builder_append_null: + * @builder: A #GArrowBinaryArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[binary-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/binary-array-builder.h b/c_glib/arrow-glib/binary-array-builder.h new file mode 100644 index 0000000000000..111a83a3a09b0 --- /dev/null +++ b/c_glib/arrow-glib/binary-array-builder.h @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BINARY_ARRAY_BUILDER \ + (garrow_binary_array_builder_get_type()) +#define GARROW_BINARY_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilder)) +#define GARROW_BINARY_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilderClass)) +#define GARROW_IS_BINARY_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER)) +#define GARROW_IS_BINARY_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER)) +#define GARROW_BINARY_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilderClass)) + +typedef struct _GArrowBinaryArrayBuilder GArrowBinaryArrayBuilder; +typedef struct _GArrowBinaryArrayBuilderClass GArrowBinaryArrayBuilderClass; + +/** + * GArrowBinaryArrayBuilder: + * + * It wraps `arrow::BinaryBuilder`. + */ +struct _GArrowBinaryArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowBinaryArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_binary_array_builder_get_type(void) G_GNUC_CONST; + +GArrowBinaryArrayBuilder *garrow_binary_array_builder_new(void); + +gboolean garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, + const guint8 *value, + gint32 length, + GError **error); +gboolean garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/binary-array.cpp b/c_glib/arrow-glib/binary-array.cpp new file mode 100644 index 0000000000000..c149d14025ae7 --- /dev/null +++ b/c_glib/arrow-glib/binary-array.cpp @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: binary-array + * @short_description: Binary array class + * + * #GArrowBinaryArray is a class for binary array. It can store zero + * or more binary data. + * + * #GArrowBinaryArray is immutable. You need to use + * #GArrowBinaryArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowBinaryArray, \ + garrow_binary_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_binary_array_init(GArrowBinaryArray *object) +{ +} + +static void +garrow_binary_array_class_init(GArrowBinaryArrayClass *klass) +{ +} + +/** + * garrow_binary_array_get_value: + * @array: A #GArrowBinaryArray. + * @i: The index of the target value. + * @length: (out): The length of the value. + * + * Returns: (array length=length): The i-th value. + */ +const guint8 * +garrow_binary_array_get_value(GArrowBinaryArray *array, + gint64 i, + gint32 *length) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_binary_array = + static_cast(arrow_array.get()); + return arrow_binary_array->GetValue(i, length); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/binary-array.h b/c_glib/arrow-glib/binary-array.h new file mode 100644 index 0000000000000..ab63ece9844f8 --- /dev/null +++ b/c_glib/arrow-glib/binary-array.h @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BINARY_ARRAY \ + (garrow_binary_array_get_type()) +#define GARROW_BINARY_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArray)) +#define GARROW_BINARY_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArrayClass)) +#define GARROW_IS_BINARY_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_ARRAY)) +#define GARROW_IS_BINARY_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_ARRAY)) +#define GARROW_BINARY_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArrayClass)) + +typedef struct _GArrowBinaryArray GArrowBinaryArray; +typedef struct _GArrowBinaryArrayClass GArrowBinaryArrayClass; + +/** + * GArrowBinaryArray: + * + * It wraps `arrow::BinaryArray`. + */ +struct _GArrowBinaryArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowBinaryArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_binary_array_get_type(void) G_GNUC_CONST; + +const guint8 *garrow_binary_array_get_value(GArrowBinaryArray *array, + gint64 i, + gint32 *length); + +G_END_DECLS diff --git a/c_glib/arrow-glib/binary-data-type.cpp b/c_glib/arrow-glib/binary-data-type.cpp new file mode 100644 index 0000000000000..e5187f7d94efe --- /dev/null +++ b/c_glib/arrow-glib/binary-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: binary-data-type + * @short_description: Binary data type + * + * #GArrowBinaryDataType is a class for binary data type. + */ + +G_DEFINE_TYPE(GArrowBinaryDataType, \ + garrow_binary_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_binary_data_type_init(GArrowBinaryDataType *object) +{ +} + +static void +garrow_binary_data_type_class_init(GArrowBinaryDataTypeClass *klass) +{ +} + +/** + * garrow_binary_data_type_new: + * + * Returns: The newly created binary data type. + */ +GArrowBinaryDataType * +garrow_binary_data_type_new(void) +{ + auto arrow_data_type = arrow::binary(); + + GArrowBinaryDataType *data_type = + GARROW_BINARY_DATA_TYPE(g_object_new(GARROW_TYPE_BINARY_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/binary-data-type.h b/c_glib/arrow-glib/binary-data-type.h new file mode 100644 index 0000000000000..9654fe216376e --- /dev/null +++ b/c_glib/arrow-glib/binary-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BINARY_DATA_TYPE \ + (garrow_binary_data_type_get_type()) +#define GARROW_BINARY_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataType)) +#define GARROW_BINARY_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataTypeClass)) +#define GARROW_IS_BINARY_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE)) +#define GARROW_IS_BINARY_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_DATA_TYPE)) +#define GARROW_BINARY_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataTypeClass)) + +typedef struct _GArrowBinaryDataType GArrowBinaryDataType; +typedef struct _GArrowBinaryDataTypeClass GArrowBinaryDataTypeClass; + +/** + * GArrowBinaryDataType: + * + * It wraps `arrow::BinaryType`. + */ +struct _GArrowBinaryDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowBinaryDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_binary_data_type_get_type (void) G_GNUC_CONST; +GArrowBinaryDataType *garrow_binary_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array-builder.cpp b/c_glib/arrow-glib/boolean-array-builder.cpp new file mode 100644 index 0000000000000..1a4c1f9fd8f7e --- /dev/null +++ b/c_glib/arrow-glib/boolean-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: boolean-array-builder + * @short_description: Boolean array builder class + * + * #GArrowBooleanArrayBuilder is the class to create a new + * #GArrowBooleanArray. + */ + +G_DEFINE_TYPE(GArrowBooleanArrayBuilder, + garrow_boolean_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_boolean_array_builder_init(GArrowBooleanArrayBuilder *builder) +{ +} + +static void +garrow_boolean_array_builder_class_init(GArrowBooleanArrayBuilderClass *klass) +{ +} + +/** + * garrow_boolean_array_builder_new: + * + * Returns: A newly created #GArrowBooleanArrayBuilder. + */ +GArrowBooleanArrayBuilder * +garrow_boolean_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool); + auto builder = + GARROW_BOOLEAN_ARRAY_BUILDER(g_object_new(GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_boolean_array_builder_append: + * @builder: A #GArrowBooleanArrayBuilder. + * @value: A boolean value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, + gboolean value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[boolean-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_boolean_array_builder_append_null: + * @builder: A #GArrowBooleanArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[boolean-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array-builder.h b/c_glib/arrow-glib/boolean-array-builder.h new file mode 100644 index 0000000000000..ca50e9797d41c --- /dev/null +++ b/c_glib/arrow-glib/boolean-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BOOLEAN_ARRAY_BUILDER \ + (garrow_boolean_array_builder_get_type()) +#define GARROW_BOOLEAN_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilder)) +#define GARROW_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilderClass)) +#define GARROW_IS_BOOLEAN_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) +#define GARROW_IS_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) +#define GARROW_BOOLEAN_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilderClass)) + +typedef struct _GArrowBooleanArrayBuilder GArrowBooleanArrayBuilder; +typedef struct _GArrowBooleanArrayBuilderClass GArrowBooleanArrayBuilderClass; + +/** + * GArrowBooleanArrayBuilder: + * + * It wraps `arrow::BooleanBuilder`. + */ +struct _GArrowBooleanArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowBooleanArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_boolean_array_builder_get_type(void) G_GNUC_CONST; + +GArrowBooleanArrayBuilder *garrow_boolean_array_builder_new(void); + +gboolean garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, + gboolean value, + GError **error); +gboolean garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array.cpp b/c_glib/arrow-glib/boolean-array.cpp new file mode 100644 index 0000000000000..62fc40fd54112 --- /dev/null +++ b/c_glib/arrow-glib/boolean-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: boolean-array + * @short_description: Boolean array class + * + * #GArrowBooleanArray is a class for binary array. It can store zero + * or more boolean data. + * + * #GArrowBooleanArray is immutable. You need to use + * #GArrowBooleanArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowBooleanArray, \ + garrow_boolean_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_boolean_array_init(GArrowBooleanArray *object) +{ +} + +static void +garrow_boolean_array_class_init(GArrowBooleanArrayClass *klass) +{ +} + +/** + * garrow_boolean_array_get_value: + * @array: A #GArrowBooleanArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gboolean +garrow_boolean_array_get_value(GArrowBooleanArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array.h b/c_glib/arrow-glib/boolean-array.h new file mode 100644 index 0000000000000..9899fdf0ceca8 --- /dev/null +++ b/c_glib/arrow-glib/boolean-array.h @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BOOLEAN_ARRAY \ + (garrow_boolean_array_get_type()) +#define GARROW_BOOLEAN_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArray)) +#define GARROW_BOOLEAN_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArrayClass)) +#define GARROW_IS_BOOLEAN_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY)) +#define GARROW_IS_BOOLEAN_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY)) +#define GARROW_BOOLEAN_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArrayClass)) + +typedef struct _GArrowBooleanArray GArrowBooleanArray; +typedef struct _GArrowBooleanArrayClass GArrowBooleanArrayClass; + +/** + * GArrowBooleanArray: + * + * It wraps `arrow::BooleanArray`. + */ +struct _GArrowBooleanArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowBooleanArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_boolean_array_get_type (void) G_GNUC_CONST; +gboolean garrow_boolean_array_get_value (GArrowBooleanArray *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-data-type.cpp b/c_glib/arrow-glib/boolean-data-type.cpp new file mode 100644 index 0000000000000..99c73d9ff8873 --- /dev/null +++ b/c_glib/arrow-glib/boolean-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: boolean-data-type + * @short_description: Boolean data type + * + * #GArrowBooleanDataType is a class for boolean data type. + */ + +G_DEFINE_TYPE(GArrowBooleanDataType, \ + garrow_boolean_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_boolean_data_type_init(GArrowBooleanDataType *object) +{ +} + +static void +garrow_boolean_data_type_class_init(GArrowBooleanDataTypeClass *klass) +{ +} + +/** + * garrow_boolean_data_type_new: + * + * Returns: The newly created boolean data type. + */ +GArrowBooleanDataType * +garrow_boolean_data_type_new(void) +{ + auto arrow_data_type = arrow::boolean(); + + GArrowBooleanDataType *data_type = + GARROW_BOOLEAN_DATA_TYPE(g_object_new(GARROW_TYPE_BOOLEAN_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-data-type.h b/c_glib/arrow-glib/boolean-data-type.h new file mode 100644 index 0000000000000..ad30c99960a8e --- /dev/null +++ b/c_glib/arrow-glib/boolean-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BOOLEAN_DATA_TYPE \ + (garrow_boolean_data_type_get_type()) +#define GARROW_BOOLEAN_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataType)) +#define GARROW_BOOLEAN_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataTypeClass)) +#define GARROW_IS_BOOLEAN_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE)) +#define GARROW_IS_BOOLEAN_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE)) +#define GARROW_BOOLEAN_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataTypeClass)) + +typedef struct _GArrowBooleanDataType GArrowBooleanDataType; +typedef struct _GArrowBooleanDataTypeClass GArrowBooleanDataTypeClass; + +/** + * GArrowBooleanDataType: + * + * It wraps `arrow::BooleanType`. + */ +struct _GArrowBooleanDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowBooleanDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_boolean_data_type_get_type (void) G_GNUC_CONST; +GArrowBooleanDataType *garrow_boolean_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/chunked-array.cpp b/c_glib/arrow-glib/chunked-array.cpp new file mode 100644 index 0000000000000..e732ece73c7f9 --- /dev/null +++ b/c_glib/arrow-glib/chunked-array.cpp @@ -0,0 +1,241 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: chunked-array + * @short_description: Chunked array class + * + * #GArrowChunkedArray is a class for chunked array. Chunked array + * makes a list of #GArrowArrays one logical large array. + */ + +typedef struct GArrowChunkedArrayPrivate_ { + std::shared_ptr chunked_array; +} GArrowChunkedArrayPrivate; + +enum { + PROP_0, + PROP_CHUNKED_ARRAY +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowChunkedArray, + garrow_chunked_array, + G_TYPE_OBJECT) + +#define GARROW_CHUNKED_ARRAY_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_CHUNKED_ARRAY, \ + GArrowChunkedArrayPrivate)) + +static void +garrow_chunked_array_finalize(GObject *object) +{ + GArrowChunkedArrayPrivate *priv; + + priv = GARROW_CHUNKED_ARRAY_GET_PRIVATE(object); + + priv->chunked_array = nullptr; + + G_OBJECT_CLASS(garrow_chunked_array_parent_class)->finalize(object); +} + +static void +garrow_chunked_array_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowChunkedArrayPrivate *priv; + + priv = GARROW_CHUNKED_ARRAY_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_CHUNKED_ARRAY: + priv->chunked_array = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_chunked_array_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_chunked_array_init(GArrowChunkedArray *object) +{ +} + +static void +garrow_chunked_array_class_init(GArrowChunkedArrayClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_chunked_array_finalize; + gobject_class->set_property = garrow_chunked_array_set_property; + gobject_class->get_property = garrow_chunked_array_get_property; + + spec = g_param_spec_pointer("chunked-array", + "Chunked array", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_CHUNKED_ARRAY, spec); +} + +/** + * garrow_chunked_array_new: + * @chunks: (element-type GArrowArray): The array chunks. + * + * Returns: A newly created #GArrowChunkedArray. + */ +GArrowChunkedArray * +garrow_chunked_array_new(GList *chunks) +{ + std::vector> arrow_chunks; + for (GList *node = chunks; node; node = node->next) { + GArrowArray *chunk = GARROW_ARRAY(node->data); + arrow_chunks.push_back(garrow_array_get_raw(chunk)); + } + + auto arrow_chunked_array = + std::make_shared(arrow_chunks); + return garrow_chunked_array_new_raw(&arrow_chunked_array); +} + +/** + * garrow_chunked_array_get_length: + * @chunked_array: A #GArrowChunkedArray. + * + * Returns: The total number of rows in the chunked array. + */ +guint64 +garrow_chunked_array_get_length(GArrowChunkedArray *chunked_array) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + return arrow_chunked_array->length(); +} + +/** + * garrow_chunked_array_get_n_nulls: + * @chunked_array: A #GArrowChunkedArray. + * + * Returns: The total number of NULL in the chunked array. + */ +guint64 +garrow_chunked_array_get_n_nulls(GArrowChunkedArray *chunked_array) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + return arrow_chunked_array->null_count(); +} + +/** + * garrow_chunked_array_get_n_chunks: + * @chunked_array: A #GArrowChunkedArray. + * + * Returns: The total number of chunks in the chunked array. + */ +guint +garrow_chunked_array_get_n_chunks(GArrowChunkedArray *chunked_array) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + return arrow_chunked_array->num_chunks(); +} + +/** + * garrow_chunked_array_get_chunk: + * @chunked_array: A #GArrowChunkedArray. + * @i: The index of the target chunk. + * + * Returns: (transfer full): The i-th chunk of the chunked array. + */ +GArrowArray * +garrow_chunked_array_get_chunk(GArrowChunkedArray *chunked_array, + guint i) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + auto arrow_chunk = arrow_chunked_array->chunk(i); + return garrow_array_new_raw(&arrow_chunk); +} + +/** + * garrow_chunked_array_get_chunks: + * @chunked_array: A #GArrowChunkedArray. + * + * Returns: (element-type GArrowArray) (transfer full): + * The chunks in the chunked array. + */ +GList * +garrow_chunked_array_get_chunks(GArrowChunkedArray *chunked_array) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + + GList *chunks = NULL; + for (auto arrow_chunk : arrow_chunked_array->chunks()) { + GArrowArray *chunk = garrow_array_new_raw(&arrow_chunk); + chunks = g_list_prepend(chunks, chunk); + } + + return g_list_reverse(chunks); +} + +G_END_DECLS + +GArrowChunkedArray * +garrow_chunked_array_new_raw(std::shared_ptr *arrow_chunked_array) +{ + auto chunked_array = + GARROW_CHUNKED_ARRAY(g_object_new(GARROW_TYPE_CHUNKED_ARRAY, + "chunked-array", arrow_chunked_array, + NULL)); + return chunked_array; +} + +std::shared_ptr +garrow_chunked_array_get_raw(GArrowChunkedArray *chunked_array) +{ + GArrowChunkedArrayPrivate *priv; + + priv = GARROW_CHUNKED_ARRAY_GET_PRIVATE(chunked_array); + return priv->chunked_array; +} diff --git a/c_glib/arrow-glib/chunked-array.h b/c_glib/arrow-glib/chunked-array.h new file mode 100644 index 0000000000000..338930b9bd84a --- /dev/null +++ b/c_glib/arrow-glib/chunked-array.h @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_CHUNKED_ARRAY \ + (garrow_chunked_array_get_type()) +#define GARROW_CHUNKED_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_CHUNKED_ARRAY, \ + GArrowChunkedArray)) +#define GARROW_CHUNKED_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_CHUNKED_ARRAY, \ + GArrowChunkedArrayClass)) +#define GARROW_IS_CHUNKED_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_CHUNKED_ARRAY)) +#define GARROW_IS_CHUNKED_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_CHUNKED_ARRAY)) +#define GARROW_CHUNKED_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_CHUNKED_ARRAY, \ + GArrowChunkedArrayClass)) + +typedef struct _GArrowChunkedArray GArrowChunkedArray; +typedef struct _GArrowChunkedArrayClass GArrowChunkedArrayClass; + +/** + * GArrowChunkedArray: + * + * It wraps `arrow::ChunkedArray`. + */ +struct _GArrowChunkedArray +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowChunkedArrayClass +{ + GObjectClass parent_class; +}; + +GType garrow_chunked_array_get_type(void) G_GNUC_CONST; + +GArrowChunkedArray *garrow_chunked_array_new(GList *chunks); + +guint64 garrow_chunked_array_get_length (GArrowChunkedArray *chunked_array); +guint64 garrow_chunked_array_get_n_nulls(GArrowChunkedArray *chunked_array); +guint garrow_chunked_array_get_n_chunks (GArrowChunkedArray *chunked_array); + +GArrowArray *garrow_chunked_array_get_chunk(GArrowChunkedArray *chunked_array, + guint i); +GList *garrow_chunked_array_get_chunks(GArrowChunkedArray *chunked_array); + +G_END_DECLS diff --git a/c_glib/arrow-glib/chunked-array.hpp b/c_glib/arrow-glib/chunked-array.hpp new file mode 100644 index 0000000000000..ec5068adc0741 --- /dev/null +++ b/c_glib/arrow-glib/chunked-array.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowChunkedArray *garrow_chunked_array_new_raw(std::shared_ptr *arrow_chunked_array); +std::shared_ptr garrow_chunked_array_get_raw(GArrowChunkedArray *chunked_array); diff --git a/c_glib/arrow-glib/column.cpp b/c_glib/arrow-glib/column.cpp new file mode 100644 index 0000000000000..94df640d6b2b5 --- /dev/null +++ b/c_glib/arrow-glib/column.cpp @@ -0,0 +1,262 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: column + * @short_description: Column class + * + * #GArrowColumn is a class for column. Column has a #GArrowField and + * zero or more values. Values are #GArrowChunkedArray. + */ + +typedef struct GArrowColumnPrivate_ { + std::shared_ptr column; +} GArrowColumnPrivate; + +enum { + PROP_0, + PROP_COLUMN +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowColumn, + garrow_column, + G_TYPE_OBJECT) + +#define GARROW_COLUMN_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_COLUMN, \ + GArrowColumnPrivate)) + +static void +garrow_column_dispose(GObject *object) +{ + GArrowColumnPrivate *priv; + + priv = GARROW_COLUMN_GET_PRIVATE(object); + + priv->column = nullptr; + + G_OBJECT_CLASS(garrow_column_parent_class)->dispose(object); +} + +static void +garrow_column_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowColumnPrivate *priv; + + priv = GARROW_COLUMN_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_COLUMN: + priv->column = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_column_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_column_init(GArrowColumn *object) +{ +} + +static void +garrow_column_class_init(GArrowColumnClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->dispose = garrow_column_dispose; + gobject_class->set_property = garrow_column_set_property; + gobject_class->get_property = garrow_column_get_property; + + spec = g_param_spec_pointer("column", + "Column", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_COLUMN, spec); +} + +/** + * garrow_column_new_array: + * @field: The metadata of the column. + * @array: The data of the column. + * + * Returns: A newly created #GArrowColumn. + */ +GArrowColumn * +garrow_column_new_array(GArrowField *field, + GArrowArray *array) +{ + auto arrow_column = + std::make_shared(garrow_field_get_raw(field), + garrow_array_get_raw(array)); + return garrow_column_new_raw(&arrow_column); +} + +/** + * garrow_column_new_chunked_array: + * @field: The metadata of the column. + * @chunked_array: The data of the column. + * + * Returns: A newly created #GArrowColumn. + */ +GArrowColumn * +garrow_column_new_chunked_array(GArrowField *field, + GArrowChunkedArray *chunked_array) +{ + auto arrow_column = + std::make_shared(garrow_field_get_raw(field), + garrow_chunked_array_get_raw(chunked_array)); + return garrow_column_new_raw(&arrow_column); +} + +/** + * garrow_column_get_length: + * @column: A #GArrowColumn. + * + * Returns: The number of data of the column. + */ +guint64 +garrow_column_get_length(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + return arrow_column->length(); +} + +/** + * garrow_column_get_n_nulls: + * @column: A #GArrowColumn. + * + * Returns: The number of nulls of the column. + */ +guint64 +garrow_column_get_n_nulls(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + return arrow_column->null_count(); +} + +/** + * garrow_column_get_field: + * @column: A #GArrowColumn. + * + * Returns: (transfer full): The metadata of the column. + */ +GArrowField * +garrow_column_get_field(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + auto arrow_field = arrow_column->field(); + return garrow_field_new_raw(&arrow_field); +} + +/** + * garrow_column_get_name: + * @column: A #GArrowColumn. + * + * Returns: The name of the column. + */ +const gchar * +garrow_column_get_name(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + return arrow_column->name().c_str(); +} + +/** + * garrow_column_get_data_type: + * @column: A #GArrowColumn. + * + * Returns: (transfer full): The data type of the column. + */ +GArrowDataType * +garrow_column_get_data_type(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + auto arrow_data_type = arrow_column->type(); + return garrow_data_type_new_raw(&arrow_data_type); +} + +/** + * garrow_column_get_data: + * @column: A #GArrowColumn. + * + * Returns: (transfer full): The data of the column. + */ +GArrowChunkedArray * +garrow_column_get_data(GArrowColumn *column) +{ + const auto arrow_column = garrow_column_get_raw(column); + auto arrow_chunked_array = arrow_column->data(); + return garrow_chunked_array_new_raw(&arrow_chunked_array); +} + +G_END_DECLS + +GArrowColumn * +garrow_column_new_raw(std::shared_ptr *arrow_column) +{ + auto column = GARROW_COLUMN(g_object_new(GARROW_TYPE_COLUMN, + "column", arrow_column, + NULL)); + return column; +} + +std::shared_ptr +garrow_column_get_raw(GArrowColumn *column) +{ + GArrowColumnPrivate *priv; + + priv = GARROW_COLUMN_GET_PRIVATE(column); + return priv->column; +} diff --git a/c_glib/arrow-glib/column.h b/c_glib/arrow-glib/column.h new file mode 100644 index 0000000000000..fba3c26b2f08f --- /dev/null +++ b/c_glib/arrow-glib/column.h @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_COLUMN \ + (garrow_column_get_type()) +#define GARROW_COLUMN(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_COLUMN, \ + GArrowColumn)) +#define GARROW_COLUMN_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_COLUMN, \ + GArrowColumnClass)) +#define GARROW_IS_COLUMN(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_COLUMN)) +#define GARROW_IS_COLUMN_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_COLUMN)) +#define GARROW_COLUMN_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_COLUMN, \ + GArrowColumnClass)) + +typedef struct _GArrowColumn GArrowColumn; +typedef struct _GArrowColumnClass GArrowColumnClass; + +/** + * GArrowColumn: + * + * It wraps `arrow::Column`. + */ +struct _GArrowColumn +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowColumnClass +{ + GObjectClass parent_class; +}; + +GType garrow_column_get_type (void) G_GNUC_CONST; + +GArrowColumn *garrow_column_new_array(GArrowField *field, + GArrowArray *array); +GArrowColumn *garrow_column_new_chunked_array(GArrowField *field, + GArrowChunkedArray *chunked_array); + +guint64 garrow_column_get_length (GArrowColumn *column); +guint64 garrow_column_get_n_nulls (GArrowColumn *column); +GArrowField *garrow_column_get_field (GArrowColumn *column); +const gchar *garrow_column_get_name (GArrowColumn *column); +GArrowDataType *garrow_column_get_data_type (GArrowColumn *column); +GArrowChunkedArray *garrow_column_get_data (GArrowColumn *column); + +G_END_DECLS diff --git a/c_glib/arrow-glib/column.hpp b/c_glib/arrow-glib/column.hpp new file mode 100644 index 0000000000000..4ebb742bb5051 --- /dev/null +++ b/c_glib/arrow-glib/column.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowColumn *garrow_column_new_raw(std::shared_ptr *arrow_column); +std::shared_ptr garrow_column_get_raw(GArrowColumn *column); diff --git a/c_glib/arrow-glib/data-type.cpp b/c_glib/arrow-glib/data-type.cpp new file mode 100644 index 0000000000000..2df9e7a38da91 --- /dev/null +++ b/c_glib/arrow-glib/data-type.cpp @@ -0,0 +1,260 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: data-type + * @short_description: Base class for all data type classes + * + * #GArrowDataType is a base class for all data type classes such as + * #GArrowBooleanDataType. + */ + +typedef struct GArrowDataTypePrivate_ { + std::shared_ptr data_type; +} GArrowDataTypePrivate; + +enum { + PROP_0, + PROP_DATA_TYPE +}; + +G_DEFINE_ABSTRACT_TYPE_WITH_PRIVATE(GArrowDataType, + garrow_data_type, + G_TYPE_OBJECT) + +#define GARROW_DATA_TYPE_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_DATA_TYPE, \ + GArrowDataTypePrivate)) + +static void +garrow_data_type_finalize(GObject *object) +{ + GArrowDataTypePrivate *priv; + + priv = GARROW_DATA_TYPE_GET_PRIVATE(object); + + priv->data_type = nullptr; + + G_OBJECT_CLASS(garrow_data_type_parent_class)->finalize(object); +} + +static void +garrow_data_type_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowDataTypePrivate *priv; + + priv = GARROW_DATA_TYPE_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_DATA_TYPE: + priv->data_type = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_data_type_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_data_type_init(GArrowDataType *object) +{ +} + +static void +garrow_data_type_class_init(GArrowDataTypeClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_data_type_finalize; + gobject_class->set_property = garrow_data_type_set_property; + gobject_class->get_property = garrow_data_type_get_property; + + spec = g_param_spec_pointer("data-type", + "DataType", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_DATA_TYPE, spec); +} + +/** + * garrow_data_type_equal: + * @data_type: A #GArrowDataType. + * @other_data_type: A #GArrowDataType. + * + * Returns: Whether they are equal or not. + */ +gboolean +garrow_data_type_equal(GArrowDataType *data_type, + GArrowDataType *other_data_type) +{ + const auto arrow_data_type = garrow_data_type_get_raw(data_type); + const auto arrow_other_data_type = garrow_data_type_get_raw(other_data_type); + return arrow_data_type->Equals(arrow_other_data_type); +} + +/** + * garrow_data_type_to_string: + * @data_type: A #GArrowDataType. + * + * Returns: The string representation of the data type. The caller + * must free it by g_free() when the caller doesn't need it anymore. + */ +gchar * +garrow_data_type_to_string(GArrowDataType *data_type) +{ + const auto arrow_data_type = garrow_data_type_get_raw(data_type); + return g_strdup(arrow_data_type->ToString().c_str()); +} + +/** + * garrow_data_type_type: + * @data_type: A #GArrowDataType. + * + * Returns: The type of the data type. + */ +GArrowType +garrow_data_type_type(GArrowDataType *data_type) +{ + const auto arrow_data_type = garrow_data_type_get_raw(data_type); + return garrow_type_from_raw(arrow_data_type->type); +} + +G_END_DECLS + +GArrowDataType * +garrow_data_type_new_raw(std::shared_ptr *arrow_data_type) +{ + GType type; + GArrowDataType *data_type; + + switch ((*arrow_data_type)->type) { + case arrow::Type::type::NA: + type = GARROW_TYPE_NULL_DATA_TYPE; + break; + case arrow::Type::type::BOOL: + type = GARROW_TYPE_BOOLEAN_DATA_TYPE; + break; + case arrow::Type::type::UINT8: + type = GARROW_TYPE_UINT8_DATA_TYPE; + break; + case arrow::Type::type::INT8: + type = GARROW_TYPE_INT8_DATA_TYPE; + break; + case arrow::Type::type::UINT16: + type = GARROW_TYPE_UINT16_DATA_TYPE; + break; + case arrow::Type::type::INT16: + type = GARROW_TYPE_INT16_DATA_TYPE; + break; + case arrow::Type::type::UINT32: + type = GARROW_TYPE_UINT32_DATA_TYPE; + break; + case arrow::Type::type::INT32: + type = GARROW_TYPE_INT32_DATA_TYPE; + break; + case arrow::Type::type::UINT64: + type = GARROW_TYPE_UINT64_DATA_TYPE; + break; + case arrow::Type::type::INT64: + type = GARROW_TYPE_INT64_DATA_TYPE; + break; + case arrow::Type::type::FLOAT: + type = GARROW_TYPE_FLOAT_DATA_TYPE; + break; + case arrow::Type::type::DOUBLE: + type = GARROW_TYPE_DOUBLE_DATA_TYPE; + break; + case arrow::Type::type::BINARY: + type = GARROW_TYPE_BINARY_DATA_TYPE; + break; + case arrow::Type::type::STRING: + type = GARROW_TYPE_STRING_DATA_TYPE; + break; + case arrow::Type::type::LIST: + type = GARROW_TYPE_LIST_DATA_TYPE; + break; + case arrow::Type::type::STRUCT: + type = GARROW_TYPE_STRUCT_DATA_TYPE; + break; + default: + type = GARROW_TYPE_DATA_TYPE; + break; + } + data_type = GARROW_DATA_TYPE(g_object_new(type, + "data_type", arrow_data_type, + NULL)); + return data_type; +} + +std::shared_ptr +garrow_data_type_get_raw(GArrowDataType *data_type) +{ + GArrowDataTypePrivate *priv; + + priv = GARROW_DATA_TYPE_GET_PRIVATE(data_type); + return priv->data_type; +} diff --git a/c_glib/arrow-glib/data-type.h b/c_glib/arrow-glib/data-type.h new file mode 100644 index 0000000000000..3203d09b5c651 --- /dev/null +++ b/c_glib/arrow-glib/data-type.h @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_DATA_TYPE \ + (garrow_data_type_get_type()) +#define GARROW_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DATA_TYPE, \ + GArrowDataType)) +#define GARROW_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DATA_TYPE, \ + GArrowDataTypeClass)) +#define GARROW_IS_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DATA_TYPE)) +#define GARROW_IS_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DATA_TYPE)) +#define GARROW_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DATA_TYPE, \ + GArrowDataTypeClass)) + +typedef struct _GArrowDataType GArrowDataType; +typedef struct _GArrowDataTypeClass GArrowDataTypeClass; + +/** + * GArrowDataType: + * + * It wraps `arrow::DataType`. + */ +struct _GArrowDataType +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowDataTypeClass +{ + GObjectClass parent_class; +}; + +GType garrow_data_type_get_type (void) G_GNUC_CONST; +gboolean garrow_data_type_equal (GArrowDataType *data_type, + GArrowDataType *other_data_type); +gchar *garrow_data_type_to_string (GArrowDataType *data_type); +GArrowType garrow_data_type_type (GArrowDataType *data_type); + +G_END_DECLS diff --git a/c_glib/arrow-glib/data-type.hpp b/c_glib/arrow-glib/data-type.hpp new file mode 100644 index 0000000000000..fddcb2eb1ac59 --- /dev/null +++ b/c_glib/arrow-glib/data-type.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowDataType *garrow_data_type_new_raw(std::shared_ptr *arrow_data_type); +std::shared_ptr garrow_data_type_get_raw(GArrowDataType *data_type); diff --git a/c_glib/arrow-glib/double-array-builder.cpp b/c_glib/arrow-glib/double-array-builder.cpp new file mode 100644 index 0000000000000..cc44eeabfb686 --- /dev/null +++ b/c_glib/arrow-glib/double-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: double-array-builder + * @short_description: 64-bit floating point array builder class + * + * #GArrowDoubleArrayBuilder is the class to create a new + * #GArrowDoubleArray. + */ + +G_DEFINE_TYPE(GArrowDoubleArrayBuilder, + garrow_double_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_double_array_builder_init(GArrowDoubleArrayBuilder *builder) +{ +} + +static void +garrow_double_array_builder_class_init(GArrowDoubleArrayBuilderClass *klass) +{ +} + +/** + * garrow_double_array_builder_new: + * + * Returns: A newly created #GArrowDoubleArrayBuilder. + */ +GArrowDoubleArrayBuilder * +garrow_double_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::float64()); + auto builder = + GARROW_DOUBLE_ARRAY_BUILDER(g_object_new(GARROW_TYPE_DOUBLE_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_double_array_builder_append: + * @builder: A #GArrowDoubleArrayBuilder. + * @value: A double value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, + gdouble value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[double-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_double_array_builder_append_null: + * @builder: A #GArrowDoubleArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[double-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/double-array-builder.h b/c_glib/arrow-glib/double-array-builder.h new file mode 100644 index 0000000000000..5d95c898bc8a7 --- /dev/null +++ b/c_glib/arrow-glib/double-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_DOUBLE_ARRAY_BUILDER \ + (garrow_double_array_builder_get_type()) +#define GARROW_DOUBLE_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilder)) +#define GARROW_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilderClass)) +#define GARROW_IS_DOUBLE_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) +#define GARROW_IS_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) +#define GARROW_DOUBLE_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilderClass)) + +typedef struct _GArrowDoubleArrayBuilder GArrowDoubleArrayBuilder; +typedef struct _GArrowDoubleArrayBuilderClass GArrowDoubleArrayBuilderClass; + +/** + * GArrowDoubleArrayBuilder: + * + * It wraps `arrow::DoubleBuilder`. + */ +struct _GArrowDoubleArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowDoubleArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_double_array_builder_get_type(void) G_GNUC_CONST; + +GArrowDoubleArrayBuilder *garrow_double_array_builder_new(void); + +gboolean garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, + gdouble value, + GError **error); +gboolean garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/double-array.cpp b/c_glib/arrow-glib/double-array.cpp new file mode 100644 index 0000000000000..ecc55d7541689 --- /dev/null +++ b/c_glib/arrow-glib/double-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: double-array + * @short_description: 64-bit floating point array class + * + * #GArrowDoubleArray is a class for 64-bit floating point array. It + * can store zero or more 64-bit floating data. + * + * #GArrowDoubleArray is immutable. You need to use + * #GArrowDoubleArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowDoubleArray, \ + garrow_double_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_double_array_init(GArrowDoubleArray *object) +{ +} + +static void +garrow_double_array_class_init(GArrowDoubleArrayClass *klass) +{ +} + +/** + * garrow_double_array_get_value: + * @array: A #GArrowDoubleArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gdouble +garrow_double_array_get_value(GArrowDoubleArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/double-array.h b/c_glib/arrow-glib/double-array.h new file mode 100644 index 0000000000000..b9a236532e3bf --- /dev/null +++ b/c_glib/arrow-glib/double-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_DOUBLE_ARRAY \ + (garrow_double_array_get_type()) +#define GARROW_DOUBLE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArray)) +#define GARROW_DOUBLE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArrayClass)) +#define GARROW_IS_DOUBLE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_ARRAY)) +#define GARROW_IS_DOUBLE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_ARRAY)) +#define GARROW_DOUBLE_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArrayClass)) + +typedef struct _GArrowDoubleArray GArrowDoubleArray; +typedef struct _GArrowDoubleArrayClass GArrowDoubleArrayClass; + +/** + * GArrowDoubleArray: + * + * It wraps `arrow::DoubleArray`. + */ +struct _GArrowDoubleArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowDoubleArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_double_array_get_type(void) G_GNUC_CONST; + +gdouble garrow_double_array_get_value(GArrowDoubleArray *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/double-data-type.cpp b/c_glib/arrow-glib/double-data-type.cpp new file mode 100644 index 0000000000000..c132f97ebe58f --- /dev/null +++ b/c_glib/arrow-glib/double-data-type.cpp @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: double-data-type + * @short_description: 64-bit floating point data type + * + * #GArrowDoubleDataType is a class for 64-bit floating point data + * type. + */ + +G_DEFINE_TYPE(GArrowDoubleDataType, \ + garrow_double_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_double_data_type_init(GArrowDoubleDataType *object) +{ +} + +static void +garrow_double_data_type_class_init(GArrowDoubleDataTypeClass *klass) +{ +} + +/** + * garrow_double_data_type_new: + * + * Returns: The newly created 64-bit floating point data type. + */ +GArrowDoubleDataType * +garrow_double_data_type_new(void) +{ + auto arrow_data_type = arrow::float64(); + + GArrowDoubleDataType *data_type = + GARROW_DOUBLE_DATA_TYPE(g_object_new(GARROW_TYPE_DOUBLE_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/double-data-type.h b/c_glib/arrow-glib/double-data-type.h new file mode 100644 index 0000000000000..ec725cbed3ba2 --- /dev/null +++ b/c_glib/arrow-glib/double-data-type.h @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_DOUBLE_DATA_TYPE \ + (garrow_double_data_type_get_type()) +#define GARROW_DOUBLE_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataType)) +#define GARROW_DOUBLE_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataTypeClass)) +#define GARROW_IS_DOUBLE_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE)) +#define GARROW_IS_DOUBLE_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_DATA_TYPE)) +#define GARROW_DOUBLE_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataTypeClass)) + +typedef struct _GArrowDoubleDataType GArrowDoubleDataType; +typedef struct _GArrowDoubleDataTypeClass GArrowDoubleDataTypeClass; + +/** + * GArrowDoubleDataType: + * + * It wraps `arrow::DoubleType`. + */ +struct _GArrowDoubleDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowDoubleDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_double_data_type_get_type(void) G_GNUC_CONST; + +GArrowDoubleDataType *garrow_double_data_type_new(void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/enums.c.template b/c_glib/arrow-glib/enums.c.template new file mode 100644 index 0000000000000..6becbd565d516 --- /dev/null +++ b/c_glib/arrow-glib/enums.c.template @@ -0,0 +1,56 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType +@enum_name@_get_type(void) +{ + static GType etype = 0; + if (G_UNLIKELY(etype == 0)) { + static const G@Type@Value values[] = { +/*** END value-header ***/ + +/*** BEGIN value-production ***/ + {@VALUENAME@, "@VALUENAME@", "@valuenick@"}, +/*** END value-production ***/ + +/*** BEGIN value-tail ***/ + {0, NULL, NULL} + }; + etype = g_@type@_register_static(g_intern_static_string("@EnumName@"), values); + } + return etype; +} +/*** END value-tail ***/ + +/*** BEGIN file-tail ***/ +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/enums.h.template b/c_glib/arrow-glib/enums.h.template new file mode 100644 index 0000000000000..3509ed2e90db4 --- /dev/null +++ b/c_glib/arrow-glib/enums.h.template @@ -0,0 +1,41 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType @enum_name@_get_type(void) G_GNUC_CONST; +#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type()) +/*** END value-header ***/ + +/*** BEGIN file-tail ***/ + +G_END_DECLS +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/error.cpp b/c_glib/arrow-glib/error.cpp new file mode 100644 index 0000000000000..efbc6ae60452a --- /dev/null +++ b/c_glib/arrow-glib/error.cpp @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +G_BEGIN_DECLS + +/** + * SECTION: error + * @title: GArrowError + * @short_description: Error code mapping between Arrow and arrow-glib + * + * #GArrowError provides error codes corresponding to `arrow::Status` + * values. + */ + +G_DEFINE_QUARK(garrow-error-quark, garrow_error) + +static GArrowError +garrow_error_code(const arrow::Status &status) +{ + switch (status.code()) { + case arrow::StatusCode::OK: + return GARROW_ERROR_UNKNOWN; + case arrow::StatusCode::OutOfMemory: + return GARROW_ERROR_OUT_OF_MEMORY; + case arrow::StatusCode::KeyError: + return GARROW_ERROR_KEY; + case arrow::StatusCode::TypeError: + return GARROW_ERROR_TYPE; + case arrow::StatusCode::Invalid: + return GARROW_ERROR_INVALID; + case arrow::StatusCode::IOError: + return GARROW_ERROR_IO; + case arrow::StatusCode::UnknownError: + return GARROW_ERROR_UNKNOWN; + case arrow::StatusCode::NotImplemented: + return GARROW_ERROR_NOT_IMPLEMENTED; + default: + return GARROW_ERROR_UNKNOWN; + } +} + +G_END_DECLS + +void +garrow_error_set(GError **error, + const arrow::Status &status, + const char *context) +{ + if (status.ok()) { + return; + } + + g_set_error(error, + GARROW_ERROR, + garrow_error_code(status), + "%s: %s", + context, + status.ToString().c_str()); +} diff --git a/c_glib/arrow-glib/error.h b/c_glib/arrow-glib/error.h new file mode 100644 index 0000000000000..b4a4fac39cd73 --- /dev/null +++ b/c_glib/arrow-glib/error.h @@ -0,0 +1,54 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +/** + * GArrowError: + * @GARROW_ERROR_OUT_OF_MEMORY: Out of memory error. + * @GARROW_ERROR_KEY: Key error. + * @GARROW_ERROR_TYPE: Type error. + * @GARROW_ERROR_INVALID: Invalid value error. + * @GARROW_ERROR_IO: IO error. + * @GARROW_ERROR_UNKNOWN: Unknown error. + * @GARROW_ERROR_NOT_IMPLEMENTED: The feature is not implemented. + * + * The error codes are used by all arrow-glib functions. + * + * They are corresponding to `arrow::Status` values. + */ +typedef enum { + GARROW_ERROR_OUT_OF_MEMORY = 1, + GARROW_ERROR_KEY, + GARROW_ERROR_TYPE, + GARROW_ERROR_INVALID, + GARROW_ERROR_IO, + GARROW_ERROR_UNKNOWN = 9, + GARROW_ERROR_NOT_IMPLEMENTED = 10 +} GArrowError; + +#define GARROW_ERROR garrow_error_quark() + +GQuark garrow_error_quark(void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/error.hpp b/c_glib/arrow-glib/error.hpp new file mode 100644 index 0000000000000..357d293c4f127 --- /dev/null +++ b/c_glib/arrow-glib/error.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +void garrow_error_set(GError **error, + const arrow::Status &status, + const char *context); diff --git a/c_glib/arrow-glib/field.cpp b/c_glib/arrow-glib/field.cpp new file mode 100644 index 0000000000000..0dcaf0a009a6d --- /dev/null +++ b/c_glib/arrow-glib/field.cpp @@ -0,0 +1,250 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: field + * @short_description: Field class + * + * #GArrowField is a class for field. Field is metadata of a + * column. It has name, data type (#GArrowDataType) and nullable + * information of the column. + */ + +typedef struct GArrowFieldPrivate_ { + std::shared_ptr field; +} GArrowFieldPrivate; + +enum { + PROP_0, + PROP_FIELD +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowField, + garrow_field, + G_TYPE_OBJECT) + +#define GARROW_FIELD_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_FIELD, \ + GArrowFieldPrivate)) + +static void +garrow_field_finalize(GObject *object) +{ + GArrowFieldPrivate *priv; + + priv = GARROW_FIELD_GET_PRIVATE(object); + + priv->field = nullptr; + + G_OBJECT_CLASS(garrow_field_parent_class)->finalize(object); +} + +static void +garrow_field_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowFieldPrivate *priv; + + priv = GARROW_FIELD_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_FIELD: + priv->field = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_field_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_field_init(GArrowField *object) +{ +} + +static void +garrow_field_class_init(GArrowFieldClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_field_finalize; + gobject_class->set_property = garrow_field_set_property; + gobject_class->get_property = garrow_field_get_property; + + spec = g_param_spec_pointer("field", + "Field", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_FIELD, spec); +} + +/** + * garrow_field_new: + * @name: The name of the field. + * @data_type: The data type of the field. + * + * Returns: A newly created #GArrowField. + */ +GArrowField * +garrow_field_new(const gchar *name, + GArrowDataType *data_type) +{ + auto arrow_field = + std::make_shared(name, + garrow_data_type_get_raw(data_type)); + return garrow_field_new_raw(&arrow_field); +} + +/** + * garrow_field_new_full: + * @name: The name of the field. + * @data_type: The data type of the field. + * @nullable: Whether null may be included or not. + * + * Returns: A newly created #GArrowField. + */ +GArrowField * +garrow_field_new_full(const gchar *name, + GArrowDataType *data_type, + gboolean nullable) +{ + auto arrow_field = + std::make_shared(name, + garrow_data_type_get_raw(data_type), + nullable); + return garrow_field_new_raw(&arrow_field); +} + +/** + * garrow_field_get_name: + * @field: A #GArrowField. + * + * Returns: The name of the field. + */ +const gchar * +garrow_field_get_name(GArrowField *field) +{ + const auto arrow_field = garrow_field_get_raw(field); + return arrow_field->name.c_str(); +} + +/** + * garrow_field_get_data_type: + * @field: A #GArrowField. + * + * Returns: (transfer full): The data type of the field. + */ +GArrowDataType * +garrow_field_get_data_type(GArrowField *field) +{ + const auto arrow_field = garrow_field_get_raw(field); + return garrow_data_type_new_raw(&arrow_field->type); +} + +/** + * garrow_field_is_nullable: + * @field: A #GArrowField. + * + * Returns: Whether the filed may include null or not. + */ +gboolean +garrow_field_is_nullable(GArrowField *field) +{ + const auto arrow_field = garrow_field_get_raw(field); + return arrow_field->nullable; +} + +/** + * garrow_field_equal: + * @field: A #GArrowField. + * @other_field: A #GArrowField. + * + * Returns: Whether they are equal or not. + */ +gboolean +garrow_field_equal(GArrowField *field, + GArrowField *other_field) +{ + const auto arrow_field = garrow_field_get_raw(field); + const auto arrow_other_field = garrow_field_get_raw(other_field); + return arrow_field->Equals(arrow_other_field); +} + +/** + * garrow_field_to_string: + * @field: A #GArrowField. + * + * Returns: The string representation of the field. + */ +gchar * +garrow_field_to_string(GArrowField *field) +{ + const auto arrow_field = garrow_field_get_raw(field); + return g_strdup(arrow_field->ToString().c_str()); +} + +G_END_DECLS + +GArrowField * +garrow_field_new_raw(std::shared_ptr *arrow_field) +{ + auto field = GARROW_FIELD(g_object_new(GARROW_TYPE_FIELD, + "field", arrow_field, + NULL)); + return field; +} + +std::shared_ptr +garrow_field_get_raw(GArrowField *field) +{ + GArrowFieldPrivate *priv; + + priv = GARROW_FIELD_GET_PRIVATE(field); + return priv->field; +} diff --git a/c_glib/arrow-glib/field.h b/c_glib/arrow-glib/field.h new file mode 100644 index 0000000000000..e724dce49da5c --- /dev/null +++ b/c_glib/arrow-glib/field.h @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_FIELD \ + (garrow_field_get_type()) +#define GARROW_FIELD(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FIELD, \ + GArrowField)) +#define GARROW_FIELD_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FIELD, \ + GArrowFieldClass)) +#define GARROW_IS_FIELD(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FIELD)) +#define GARROW_IS_FIELD_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FIELD)) +#define GARROW_FIELD_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FIELD, \ + GArrowFieldClass)) + +typedef struct _GArrowField GArrowField; +typedef struct _GArrowFieldClass GArrowFieldClass; + +/** + * GArrowField: + * + * It wraps `arrow::Field`. + */ +struct _GArrowField +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowFieldClass +{ + GObjectClass parent_class; +}; + +GType garrow_field_get_type (void) G_GNUC_CONST; + +GArrowField *garrow_field_new (const gchar *name, + GArrowDataType *data_type); +GArrowField *garrow_field_new_full (const gchar *name, + GArrowDataType *data_type, + gboolean nullable); + +const gchar *garrow_field_get_name (GArrowField *field); +GArrowDataType *garrow_field_get_data_type (GArrowField *field); +gboolean garrow_field_is_nullable (GArrowField *field); + +gboolean garrow_field_equal (GArrowField *field, + GArrowField *other_field); + +gchar *garrow_field_to_string (GArrowField *field); + +G_END_DECLS diff --git a/c_glib/arrow-glib/field.hpp b/c_glib/arrow-glib/field.hpp new file mode 100644 index 0000000000000..e130ad5992409 --- /dev/null +++ b/c_glib/arrow-glib/field.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowField *garrow_field_new_raw(std::shared_ptr *arrow_field); +std::shared_ptr garrow_field_get_raw(GArrowField *field); diff --git a/c_glib/arrow-glib/float-array-builder.cpp b/c_glib/arrow-glib/float-array-builder.cpp new file mode 100644 index 0000000000000..77a9a0bb75a05 --- /dev/null +++ b/c_glib/arrow-glib/float-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: float-array-builder + * @short_description: 32-bit floating point array builder class + * + * #GArrowFloatArrayBuilder is the class to creating a new + * #GArrowFloatArray. + */ + +G_DEFINE_TYPE(GArrowFloatArrayBuilder, + garrow_float_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_float_array_builder_init(GArrowFloatArrayBuilder *builder) +{ +} + +static void +garrow_float_array_builder_class_init(GArrowFloatArrayBuilderClass *klass) +{ +} + +/** + * garrow_float_array_builder_new: + * + * Returns: A newly created #GArrowFloatArrayBuilder. + */ +GArrowFloatArrayBuilder * +garrow_float_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::float32()); + auto builder = + GARROW_FLOAT_ARRAY_BUILDER(g_object_new(GARROW_TYPE_FLOAT_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_float_array_builder_append: + * @builder: A #GArrowFloatArrayBuilder. + * @value: A float value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, + gfloat value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[float-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_float_array_builder_append_null: + * @builder: A #GArrowFloatArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[float-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/float-array-builder.h b/c_glib/arrow-glib/float-array-builder.h new file mode 100644 index 0000000000000..003900313cca4 --- /dev/null +++ b/c_glib/arrow-glib/float-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_FLOAT_ARRAY_BUILDER \ + (garrow_float_array_builder_get_type()) +#define GARROW_FLOAT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilder)) +#define GARROW_FLOAT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilderClass)) +#define GARROW_IS_FLOAT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER)) +#define GARROW_IS_FLOAT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER)) +#define GARROW_FLOAT_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilderClass)) + +typedef struct _GArrowFloatArrayBuilder GArrowFloatArrayBuilder; +typedef struct _GArrowFloatArrayBuilderClass GArrowFloatArrayBuilderClass; + +/** + * GArrowFloatArrayBuilder: + * + * It wraps `arrow::FloatBuilder`. + */ +struct _GArrowFloatArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowFloatArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_float_array_builder_get_type(void) G_GNUC_CONST; + +GArrowFloatArrayBuilder *garrow_float_array_builder_new(void); + +gboolean garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, + gfloat value, + GError **error); +gboolean garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/float-array.cpp b/c_glib/arrow-glib/float-array.cpp new file mode 100644 index 0000000000000..28e8047652f7e --- /dev/null +++ b/c_glib/arrow-glib/float-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: float-array + * @short_description: 32-bit floating point array class + * + * #GArrowFloatArray is a class for 32-bit floating point array. It + * can store zero or more 32-bit floating data. + * + * #GArrowFloatArray is immutable. You need to use + * #GArrowFloatArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowFloatArray, \ + garrow_float_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_float_array_init(GArrowFloatArray *object) +{ +} + +static void +garrow_float_array_class_init(GArrowFloatArrayClass *klass) +{ +} + +/** + * garrow_float_array_get_value: + * @array: A #GArrowFloatArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gfloat +garrow_float_array_get_value(GArrowFloatArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/float-array.h b/c_glib/arrow-glib/float-array.h new file mode 100644 index 0000000000000..d113f9757a511 --- /dev/null +++ b/c_glib/arrow-glib/float-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_FLOAT_ARRAY \ + (garrow_float_array_get_type()) +#define GARROW_FLOAT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArray)) +#define GARROW_FLOAT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArrayClass)) +#define GARROW_IS_FLOAT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_ARRAY)) +#define GARROW_IS_FLOAT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_ARRAY)) +#define GARROW_FLOAT_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArrayClass)) + +typedef struct _GArrowFloatArray GArrowFloatArray; +typedef struct _GArrowFloatArrayClass GArrowFloatArrayClass; + +/** + * GArrowFloatArray: + * + * It wraps `arrow::FloatArray`. + */ +struct _GArrowFloatArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowFloatArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_float_array_get_type(void) G_GNUC_CONST; + +gfloat garrow_float_array_get_value(GArrowFloatArray *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/float-data-type.cpp b/c_glib/arrow-glib/float-data-type.cpp new file mode 100644 index 0000000000000..ce7f28acfcb45 --- /dev/null +++ b/c_glib/arrow-glib/float-data-type.cpp @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: float-data-type + * @short_description: 32-bit floating point data type + * + * #GArrowFloatDataType is a class for 32-bit floating point data + * type. + */ + +G_DEFINE_TYPE(GArrowFloatDataType, \ + garrow_float_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_float_data_type_init(GArrowFloatDataType *object) +{ +} + +static void +garrow_float_data_type_class_init(GArrowFloatDataTypeClass *klass) +{ +} + +/** + * garrow_float_data_type_new: + * + * Returns: The newly created float data type. + */ +GArrowFloatDataType * +garrow_float_data_type_new(void) +{ + auto arrow_data_type = arrow::float32(); + + GArrowFloatDataType *data_type = + GARROW_FLOAT_DATA_TYPE(g_object_new(GARROW_TYPE_FLOAT_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/float-data-type.h b/c_glib/arrow-glib/float-data-type.h new file mode 100644 index 0000000000000..dcb6c2ab13d25 --- /dev/null +++ b/c_glib/arrow-glib/float-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_FLOAT_DATA_TYPE \ + (garrow_float_data_type_get_type()) +#define GARROW_FLOAT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataType)) +#define GARROW_FLOAT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataTypeClass)) +#define GARROW_IS_FLOAT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE)) +#define GARROW_IS_FLOAT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_DATA_TYPE)) +#define GARROW_FLOAT_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataTypeClass)) + +typedef struct _GArrowFloatDataType GArrowFloatDataType; +typedef struct _GArrowFloatDataTypeClass GArrowFloatDataTypeClass; + +/** + * GArrowFloatDataType: + * + * It wraps `arrow::FloatType`. + */ +struct _GArrowFloatDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowFloatDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_float_data_type_get_type (void) G_GNUC_CONST; +GArrowFloatDataType *garrow_float_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array-builder.cpp b/c_glib/arrow-glib/int16-array-builder.cpp new file mode 100644 index 0000000000000..fbf18ef1e6ce7 --- /dev/null +++ b/c_glib/arrow-glib/int16-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int16-array-builder + * @short_description: 16-bit integer array builder class + * + * #GArrowInt16ArrayBuilder is the class to create a new + * #GArrowInt16Array. + */ + +G_DEFINE_TYPE(GArrowInt16ArrayBuilder, + garrow_int16_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int16_array_builder_init(GArrowInt16ArrayBuilder *builder) +{ +} + +static void +garrow_int16_array_builder_class_init(GArrowInt16ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int16_array_builder_new: + * + * Returns: A newly created #GArrowInt16ArrayBuilder. + */ +GArrowInt16ArrayBuilder * +garrow_int16_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::int16()); + auto builder = + GARROW_INT16_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT16_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_int16_array_builder_append: + * @builder: A #GArrowInt16ArrayBuilder. + * @value: A int16 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, + gint16 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int16-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int16_array_builder_append_null: + * @builder: A #GArrowInt16ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int16-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array-builder.h b/c_glib/arrow-glib/int16-array-builder.h new file mode 100644 index 0000000000000..f222cfdccc9b7 --- /dev/null +++ b/c_glib/arrow-glib/int16-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT16_ARRAY_BUILDER \ + (garrow_int16_array_builder_get_type()) +#define GARROW_INT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilder)) +#define GARROW_INT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilderClass)) +#define GARROW_IS_INT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER)) +#define GARROW_IS_INT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_ARRAY_BUILDER)) +#define GARROW_INT16_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilderClass)) + +typedef struct _GArrowInt16ArrayBuilder GArrowInt16ArrayBuilder; +typedef struct _GArrowInt16ArrayBuilderClass GArrowInt16ArrayBuilderClass; + +/** + * GArrowInt16ArrayBuilder: + * + * It wraps `arrow::Int16Builder`. + */ +struct _GArrowInt16ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt16ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int16_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt16ArrayBuilder *garrow_int16_array_builder_new(void); + +gboolean garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, + gint16 value, + GError **error); +gboolean garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array.cpp b/c_glib/arrow-glib/int16-array.cpp new file mode 100644 index 0000000000000..456d085a3449a --- /dev/null +++ b/c_glib/arrow-glib/int16-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int16-array + * @short_description: 16-bit integer array class + * + * #GArrowInt16Array is a class for 16-bit integer array. It can store + * zero or more 16-bit integer data. + * + * #GArrowInt16Array is immutable. You need to use + * #GArrowInt16ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowInt16Array, \ + garrow_int16_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int16_array_init(GArrowInt16Array *object) +{ +} + +static void +garrow_int16_array_class_init(GArrowInt16ArrayClass *klass) +{ +} + +/** + * garrow_int16_array_get_value: + * @array: A #GArrowInt16Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint16 +garrow_int16_array_get_value(GArrowInt16Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array.h b/c_glib/arrow-glib/int16-array.h new file mode 100644 index 0000000000000..d37144cef51f2 --- /dev/null +++ b/c_glib/arrow-glib/int16-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT16_ARRAY \ + (garrow_int16_array_get_type()) +#define GARROW_INT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16Array)) +#define GARROW_INT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16ArrayClass)) +#define GARROW_IS_INT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_ARRAY)) +#define GARROW_IS_INT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_ARRAY)) +#define GARROW_INT16_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16ArrayClass)) + +typedef struct _GArrowInt16Array GArrowInt16Array; +typedef struct _GArrowInt16ArrayClass GArrowInt16ArrayClass; + +/** + * GArrowInt16Array: + * + * It wraps `arrow::Int16Array`. + */ +struct _GArrowInt16Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt16ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int16_array_get_type(void) G_GNUC_CONST; + +gint16 garrow_int16_array_get_value(GArrowInt16Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-data-type.cpp b/c_glib/arrow-glib/int16-data-type.cpp new file mode 100644 index 0000000000000..45e109e1759dc --- /dev/null +++ b/c_glib/arrow-glib/int16-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int16-data-type + * @short_description: 16-bit integer data type + * + * #GArrowInt16DataType is a class for 16-bit integer data type. + */ + +G_DEFINE_TYPE(GArrowInt16DataType, \ + garrow_int16_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int16_data_type_init(GArrowInt16DataType *object) +{ +} + +static void +garrow_int16_data_type_class_init(GArrowInt16DataTypeClass *klass) +{ +} + +/** + * garrow_int16_data_type_new: + * + * Returns: The newly created 16-bit integer data type. + */ +GArrowInt16DataType * +garrow_int16_data_type_new(void) +{ + auto arrow_data_type = arrow::int16(); + + GArrowInt16DataType *data_type = + GARROW_INT16_DATA_TYPE(g_object_new(GARROW_TYPE_INT16_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int16-data-type.h b/c_glib/arrow-glib/int16-data-type.h new file mode 100644 index 0000000000000..eaa199c4fc7f8 --- /dev/null +++ b/c_glib/arrow-glib/int16-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT16_DATA_TYPE \ + (garrow_int16_data_type_get_type()) +#define GARROW_INT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataType)) +#define GARROW_INT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataTypeClass)) +#define GARROW_IS_INT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_DATA_TYPE)) +#define GARROW_IS_INT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_DATA_TYPE)) +#define GARROW_INT16_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataTypeClass)) + +typedef struct _GArrowInt16DataType GArrowInt16DataType; +typedef struct _GArrowInt16DataTypeClass GArrowInt16DataTypeClass; + +/** + * GArrowInt16DataType: + * + * It wraps `arrow::Int16Type`. + */ +struct _GArrowInt16DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt16DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int16_data_type_get_type (void) G_GNUC_CONST; +GArrowInt16DataType *garrow_int16_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array-builder.cpp b/c_glib/arrow-glib/int32-array-builder.cpp new file mode 100644 index 0000000000000..30cc4702f68fb --- /dev/null +++ b/c_glib/arrow-glib/int32-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int32-array-builder + * @short_description: 32-bit integer array builder class + * + * #GArrowInt32ArrayBuilder is the class to create a new + * #GArrowInt32Array. + */ + +G_DEFINE_TYPE(GArrowInt32ArrayBuilder, + garrow_int32_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int32_array_builder_init(GArrowInt32ArrayBuilder *builder) +{ +} + +static void +garrow_int32_array_builder_class_init(GArrowInt32ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int32_array_builder_new: + * + * Returns: A newly created #GArrowInt32ArrayBuilder. + */ +GArrowInt32ArrayBuilder * +garrow_int32_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::int32()); + auto builder = + GARROW_INT32_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT32_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_int32_array_builder_append: + * @builder: A #GArrowInt32ArrayBuilder. + * @value: A int32 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, + gint32 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int32-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int32_array_builder_append_null: + * @builder: A #GArrowInt32ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int32-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array-builder.h b/c_glib/arrow-glib/int32-array-builder.h new file mode 100644 index 0000000000000..bdb380d6070b0 --- /dev/null +++ b/c_glib/arrow-glib/int32-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT32_ARRAY_BUILDER \ + (garrow_int32_array_builder_get_type()) +#define GARROW_INT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilder)) +#define GARROW_INT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilderClass)) +#define GARROW_IS_INT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER)) +#define GARROW_IS_INT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_ARRAY_BUILDER)) +#define GARROW_INT32_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilderClass)) + +typedef struct _GArrowInt32ArrayBuilder GArrowInt32ArrayBuilder; +typedef struct _GArrowInt32ArrayBuilderClass GArrowInt32ArrayBuilderClass; + +/** + * GArrowInt32ArrayBuilder: + * + * It wraps `arrow::Int32Builder`. + */ +struct _GArrowInt32ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt32ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int32_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt32ArrayBuilder *garrow_int32_array_builder_new(void); + +gboolean garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, + gint32 value, + GError **error); +gboolean garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array.cpp b/c_glib/arrow-glib/int32-array.cpp new file mode 100644 index 0000000000000..8bd6f35fd6431 --- /dev/null +++ b/c_glib/arrow-glib/int32-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int32-array + * @short_description: 32-bit integer array class + * + * #GArrowInt32Array is a class for 32-bit integer array. It can store + * zero or more 32-bit integer data. + * + * #GArrowInt32Array is immutable. You need to use + * #GArrowInt32ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowInt32Array, \ + garrow_int32_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int32_array_init(GArrowInt32Array *object) +{ +} + +static void +garrow_int32_array_class_init(GArrowInt32ArrayClass *klass) +{ +} + +/** + * garrow_int32_array_get_value: + * @array: A #GArrowInt32Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint32 +garrow_int32_array_get_value(GArrowInt32Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array.h b/c_glib/arrow-glib/int32-array.h new file mode 100644 index 0000000000000..cce2b41aafe26 --- /dev/null +++ b/c_glib/arrow-glib/int32-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT32_ARRAY \ + (garrow_int32_array_get_type()) +#define GARROW_INT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32Array)) +#define GARROW_INT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32ArrayClass)) +#define GARROW_IS_INT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_ARRAY)) +#define GARROW_IS_INT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_ARRAY)) +#define GARROW_INT32_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32ArrayClass)) + +typedef struct _GArrowInt32Array GArrowInt32Array; +typedef struct _GArrowInt32ArrayClass GArrowInt32ArrayClass; + +/** + * GArrowInt32Array: + * + * It wraps `arrow::Int32Array`. + */ +struct _GArrowInt32Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt32ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int32_array_get_type(void) G_GNUC_CONST; + +gint32 garrow_int32_array_get_value(GArrowInt32Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-data-type.cpp b/c_glib/arrow-glib/int32-data-type.cpp new file mode 100644 index 0000000000000..add21135364f9 --- /dev/null +++ b/c_glib/arrow-glib/int32-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int32-data-type + * @short_description: 32-bit integer data type + * + * #GArrowInt32DataType is a class for 32-bit integer data type. + */ + +G_DEFINE_TYPE(GArrowInt32DataType, \ + garrow_int32_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int32_data_type_init(GArrowInt32DataType *object) +{ +} + +static void +garrow_int32_data_type_class_init(GArrowInt32DataTypeClass *klass) +{ +} + +/** + * garrow_int32_data_type_new: + * + * Returns: The newly created 32-bit integer data type. + */ +GArrowInt32DataType * +garrow_int32_data_type_new(void) +{ + auto arrow_data_type = arrow::int32(); + + GArrowInt32DataType *data_type = + GARROW_INT32_DATA_TYPE(g_object_new(GARROW_TYPE_INT32_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int32-data-type.h b/c_glib/arrow-glib/int32-data-type.h new file mode 100644 index 0000000000000..75cccbd40560d --- /dev/null +++ b/c_glib/arrow-glib/int32-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT32_DATA_TYPE \ + (garrow_int32_data_type_get_type()) +#define GARROW_INT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataType)) +#define GARROW_INT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataTypeClass)) +#define GARROW_IS_INT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_DATA_TYPE)) +#define GARROW_IS_INT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_DATA_TYPE)) +#define GARROW_INT32_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataTypeClass)) + +typedef struct _GArrowInt32DataType GArrowInt32DataType; +typedef struct _GArrowInt32DataTypeClass GArrowInt32DataTypeClass; + +/** + * GArrowInt32DataType: + * + * It wraps `arrow::Int32Type`. + */ +struct _GArrowInt32DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt32DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int32_data_type_get_type (void) G_GNUC_CONST; +GArrowInt32DataType *garrow_int32_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array-builder.cpp b/c_glib/arrow-glib/int64-array-builder.cpp new file mode 100644 index 0000000000000..b5eff114f92c9 --- /dev/null +++ b/c_glib/arrow-glib/int64-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int64-array-builder + * @short_description: 64-bit integer array builder class + * + * #GArrowInt64ArrayBuilder is the class to create a new + * #GArrowInt64Array. + */ + +G_DEFINE_TYPE(GArrowInt64ArrayBuilder, + garrow_int64_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int64_array_builder_init(GArrowInt64ArrayBuilder *builder) +{ +} + +static void +garrow_int64_array_builder_class_init(GArrowInt64ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int64_array_builder_new: + * + * Returns: A newly created #GArrowInt64ArrayBuilder. + */ +GArrowInt64ArrayBuilder * +garrow_int64_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::int64()); + auto builder = + GARROW_INT64_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT64_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_int64_array_builder_append: + * @builder: A #GArrowInt64ArrayBuilder. + * @value: A int64 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, + gint64 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int64-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int64_array_builder_append_null: + * @builder: A #GArrowInt64ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int64-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array-builder.h b/c_glib/arrow-glib/int64-array-builder.h new file mode 100644 index 0000000000000..8f4947eb7d9b1 --- /dev/null +++ b/c_glib/arrow-glib/int64-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT64_ARRAY_BUILDER \ + (garrow_int64_array_builder_get_type()) +#define GARROW_INT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilder)) +#define GARROW_INT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilderClass)) +#define GARROW_IS_INT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER)) +#define GARROW_IS_INT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_ARRAY_BUILDER)) +#define GARROW_INT64_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilderClass)) + +typedef struct _GArrowInt64ArrayBuilder GArrowInt64ArrayBuilder; +typedef struct _GArrowInt64ArrayBuilderClass GArrowInt64ArrayBuilderClass; + +/** + * GArrowInt64ArrayBuilder: + * + * It wraps `arrow::Int64Builder`. + */ +struct _GArrowInt64ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt64ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int64_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt64ArrayBuilder *garrow_int64_array_builder_new(void); + +gboolean garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, + gint64 value, + GError **error); +gboolean garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array.cpp b/c_glib/arrow-glib/int64-array.cpp new file mode 100644 index 0000000000000..be49d5bf35251 --- /dev/null +++ b/c_glib/arrow-glib/int64-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int64-array + * @short_description: 64-bit integer array class + * + * #GArrowInt64Array is a class for 64-bit integer array. It can store + * zero or more 64-bit integer data. + * + * #GArrowInt64Array is immutable. You need to use + * #GArrowInt64ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowInt64Array, \ + garrow_int64_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int64_array_init(GArrowInt64Array *object) +{ +} + +static void +garrow_int64_array_class_init(GArrowInt64ArrayClass *klass) +{ +} + +/** + * garrow_int64_array_get_value: + * @array: A #GArrowInt64Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint64 +garrow_int64_array_get_value(GArrowInt64Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array.h b/c_glib/arrow-glib/int64-array.h new file mode 100644 index 0000000000000..73d4c6453a6d5 --- /dev/null +++ b/c_glib/arrow-glib/int64-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT64_ARRAY \ + (garrow_int64_array_get_type()) +#define GARROW_INT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64Array)) +#define GARROW_INT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64ArrayClass)) +#define GARROW_IS_INT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_ARRAY)) +#define GARROW_IS_INT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_ARRAY)) +#define GARROW_INT64_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64ArrayClass)) + +typedef struct _GArrowInt64Array GArrowInt64Array; +typedef struct _GArrowInt64ArrayClass GArrowInt64ArrayClass; + +/** + * GArrowInt64Array: + * + * It wraps `arrow::Int64Array`. + */ +struct _GArrowInt64Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt64ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int64_array_get_type(void) G_GNUC_CONST; + +gint64 garrow_int64_array_get_value(GArrowInt64Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-data-type.cpp b/c_glib/arrow-glib/int64-data-type.cpp new file mode 100644 index 0000000000000..8e85b9d2ab922 --- /dev/null +++ b/c_glib/arrow-glib/int64-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int64-data-type + * @short_description: 64-bit integer data type + * + * #GArrowInt64DataType is a class for 64-bit integer data type. + */ + +G_DEFINE_TYPE(GArrowInt64DataType, \ + garrow_int64_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int64_data_type_init(GArrowInt64DataType *object) +{ +} + +static void +garrow_int64_data_type_class_init(GArrowInt64DataTypeClass *klass) +{ +} + +/** + * garrow_int64_data_type_new: + * + * Returns: The newly created 64-bit integer data type. + */ +GArrowInt64DataType * +garrow_int64_data_type_new(void) +{ + auto arrow_data_type = arrow::int64(); + + GArrowInt64DataType *data_type = + GARROW_INT64_DATA_TYPE(g_object_new(GARROW_TYPE_INT64_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int64-data-type.h b/c_glib/arrow-glib/int64-data-type.h new file mode 100644 index 0000000000000..499e79f7ab7a7 --- /dev/null +++ b/c_glib/arrow-glib/int64-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT64_DATA_TYPE \ + (garrow_int64_data_type_get_type()) +#define GARROW_INT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataType)) +#define GARROW_INT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataTypeClass)) +#define GARROW_IS_INT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_DATA_TYPE)) +#define GARROW_IS_INT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_DATA_TYPE)) +#define GARROW_INT64_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataTypeClass)) + +typedef struct _GArrowInt64DataType GArrowInt64DataType; +typedef struct _GArrowInt64DataTypeClass GArrowInt64DataTypeClass; + +/** + * GArrowInt64DataType: + * + * It wraps `arrow::Int64Type`. + */ +struct _GArrowInt64DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt64DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int64_data_type_get_type (void) G_GNUC_CONST; +GArrowInt64DataType *garrow_int64_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array-builder.cpp b/c_glib/arrow-glib/int8-array-builder.cpp new file mode 100644 index 0000000000000..5107a6fae1f6a --- /dev/null +++ b/c_glib/arrow-glib/int8-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int8-array-builder + * @short_description: 8-bit integer array builder class + * + * #GArrowInt8ArrayBuilder is the class to create a new + * #GArrowInt8Array. + */ + +G_DEFINE_TYPE(GArrowInt8ArrayBuilder, + garrow_int8_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int8_array_builder_init(GArrowInt8ArrayBuilder *builder) +{ +} + +static void +garrow_int8_array_builder_class_init(GArrowInt8ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int8_array_builder_new: + * + * Returns: A newly created #GArrowInt8ArrayBuilder. + */ +GArrowInt8ArrayBuilder * +garrow_int8_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::int8()); + auto builder = + GARROW_INT8_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT8_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_int8_array_builder_append: + * @builder: A #GArrowInt8ArrayBuilder. + * @value: A int8 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, + gint8 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int8-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int8_array_builder_append_null: + * @builder: A #GArrowInt8ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int8-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array-builder.h b/c_glib/arrow-glib/int8-array-builder.h new file mode 100644 index 0000000000000..321e9310a6447 --- /dev/null +++ b/c_glib/arrow-glib/int8-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT8_ARRAY_BUILDER \ + (garrow_int8_array_builder_get_type()) +#define GARROW_INT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilder)) +#define GARROW_INT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilderClass)) +#define GARROW_IS_INT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER)) +#define GARROW_IS_INT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_ARRAY_BUILDER)) +#define GARROW_INT8_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilderClass)) + +typedef struct _GArrowInt8ArrayBuilder GArrowInt8ArrayBuilder; +typedef struct _GArrowInt8ArrayBuilderClass GArrowInt8ArrayBuilderClass; + +/** + * GArrowInt8ArrayBuilder: + * + * It wraps `arrow::Int8Builder`. + */ +struct _GArrowInt8ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt8ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int8_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt8ArrayBuilder *garrow_int8_array_builder_new(void); + +gboolean garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, + gint8 value, + GError **error); +gboolean garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array.cpp b/c_glib/arrow-glib/int8-array.cpp new file mode 100644 index 0000000000000..d3f12ece9bbf7 --- /dev/null +++ b/c_glib/arrow-glib/int8-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int8-array + * @short_description: 8-bit integer array class + * + * #GArrowInt8Array is a class for 8-bit integer array. It can store + * zero or more 8-bit integer data. + * + * #GArrowInt8Array is immutable. You need to use + * #GArrowInt8ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowInt8Array, \ + garrow_int8_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int8_array_init(GArrowInt8Array *object) +{ +} + +static void +garrow_int8_array_class_init(GArrowInt8ArrayClass *klass) +{ +} + +/** + * garrow_int8_array_get_value: + * @array: A #GArrowInt8Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint8 +garrow_int8_array_get_value(GArrowInt8Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array.h b/c_glib/arrow-glib/int8-array.h new file mode 100644 index 0000000000000..0e1e901f4fdb6 --- /dev/null +++ b/c_glib/arrow-glib/int8-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT8_ARRAY \ + (garrow_int8_array_get_type()) +#define GARROW_INT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8Array)) +#define GARROW_INT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8ArrayClass)) +#define GARROW_IS_INT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_ARRAY)) +#define GARROW_IS_INT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_ARRAY)) +#define GARROW_INT8_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8ArrayClass)) + +typedef struct _GArrowInt8Array GArrowInt8Array; +typedef struct _GArrowInt8ArrayClass GArrowInt8ArrayClass; + +/** + * GArrowInt8Array: + * + * It wraps `arrow::Int8Array`. + */ +struct _GArrowInt8Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt8ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int8_array_get_type(void) G_GNUC_CONST; + +gint8 garrow_int8_array_get_value(GArrowInt8Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-data-type.cpp b/c_glib/arrow-glib/int8-data-type.cpp new file mode 100644 index 0000000000000..55b1ebc852d10 --- /dev/null +++ b/c_glib/arrow-glib/int8-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int8-data-type + * @short_description: 8-bit integer data type + * + * #GArrowInt8DataType is a class for 8-bit integer data type. + */ + +G_DEFINE_TYPE(GArrowInt8DataType, \ + garrow_int8_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int8_data_type_init(GArrowInt8DataType *object) +{ +} + +static void +garrow_int8_data_type_class_init(GArrowInt8DataTypeClass *klass) +{ +} + +/** + * garrow_int8_data_type_new: + * + * Returns: The newly created 8-bit integer data type. + */ +GArrowInt8DataType * +garrow_int8_data_type_new(void) +{ + auto arrow_data_type = arrow::int8(); + + GArrowInt8DataType *data_type = + GARROW_INT8_DATA_TYPE(g_object_new(GARROW_TYPE_INT8_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-data-type.h b/c_glib/arrow-glib/int8-data-type.h new file mode 100644 index 0000000000000..4343bd17a725b --- /dev/null +++ b/c_glib/arrow-glib/int8-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT8_DATA_TYPE \ + (garrow_int8_data_type_get_type()) +#define GARROW_INT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataType)) +#define GARROW_INT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataTypeClass)) +#define GARROW_IS_INT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_DATA_TYPE)) +#define GARROW_IS_INT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_DATA_TYPE)) +#define GARROW_INT8_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataTypeClass)) + +typedef struct _GArrowInt8DataType GArrowInt8DataType; +typedef struct _GArrowInt8DataTypeClass GArrowInt8DataTypeClass; + +/** + * GArrowInt8DataType: + * + * It wraps `arrow::Int8Type`. + */ +struct _GArrowInt8DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt8DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int8_data_type_get_type (void) G_GNUC_CONST; +GArrowInt8DataType *garrow_int8_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-enums.c.template b/c_glib/arrow-glib/io-enums.c.template new file mode 100644 index 0000000000000..10ee77588d98b --- /dev/null +++ b/c_glib/arrow-glib/io-enums.c.template @@ -0,0 +1,56 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType +@enum_name@_get_type(void) +{ + static GType etype = 0; + if (G_UNLIKELY(etype == 0)) { + static const G@Type@Value values[] = { +/*** END value-header ***/ + +/*** BEGIN value-production ***/ + {@VALUENAME@, "@VALUENAME@", "@valuenick@"}, +/*** END value-production ***/ + +/*** BEGIN value-tail ***/ + {0, NULL, NULL} + }; + etype = g_@type@_register_static(g_intern_static_string("@EnumName@"), values); + } + return etype; +} +/*** END value-tail ***/ + +/*** BEGIN file-tail ***/ +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/io-enums.h.template b/c_glib/arrow-glib/io-enums.h.template new file mode 100644 index 0000000000000..429141dc76a60 --- /dev/null +++ b/c_glib/arrow-glib/io-enums.h.template @@ -0,0 +1,41 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType @enum_name@_get_type(void) G_GNUC_CONST; +#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type()) +/*** END value-header ***/ + +/*** BEGIN file-tail ***/ + +G_END_DECLS +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/io-file-mode.cpp b/c_glib/arrow-glib/io-file-mode.cpp new file mode 100644 index 0000000000000..7998d3f5bb061 --- /dev/null +++ b/c_glib/arrow-glib/io-file-mode.cpp @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +/** + * SECTION: io-file-mode + * @title: GArrowIOFileMode + * @short_description: File mode mapping between Arrow and arrow-glib + * + * #GArrowIOFileMode provides file modes corresponding to + * `arrow::io::FileMode::type` values. + */ + +GArrowIOFileMode +garrow_io_file_mode_from_raw(arrow::io::FileMode::type mode) +{ + switch (mode) { + case arrow::io::FileMode::type::READ: + return GARROW_IO_FILE_MODE_READ; + case arrow::io::FileMode::type::WRITE: + return GARROW_IO_FILE_MODE_WRITE; + case arrow::io::FileMode::type::READWRITE: + return GARROW_IO_FILE_MODE_READWRITE; + default: + return GARROW_IO_FILE_MODE_READ; + } +} + +arrow::io::FileMode::type +garrow_io_file_mode_to_raw(GArrowIOFileMode mode) +{ + switch (mode) { + case GARROW_IO_FILE_MODE_READ: + return arrow::io::FileMode::type::READ; + case GARROW_IO_FILE_MODE_WRITE: + return arrow::io::FileMode::type::WRITE; + case GARROW_IO_FILE_MODE_READWRITE: + return arrow::io::FileMode::type::READWRITE; + default: + return arrow::io::FileMode::type::READ; + } +} diff --git a/c_glib/arrow-glib/io-file-mode.h b/c_glib/arrow-glib/io-file-mode.h new file mode 100644 index 0000000000000..03eca353bbdbb --- /dev/null +++ b/c_glib/arrow-glib/io-file-mode.h @@ -0,0 +1,40 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +/** + * GArrowIOFileMode: + * @GARROW_IO_FILE_MODE_READ: For read. + * @GARROW_IO_FILE_MODE_WRITE: For write. + * @GARROW_IO_FILE_MODE_READWRITE: For read-write. + * + * They are corresponding to `arrow::io::FileMode::type` values. + */ +typedef enum { + GARROW_IO_FILE_MODE_READ, + GARROW_IO_FILE_MODE_WRITE, + GARROW_IO_FILE_MODE_READWRITE +} GArrowIOFileMode; + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-file-mode.hpp b/c_glib/arrow-glib/io-file-mode.hpp new file mode 100644 index 0000000000000..b3d8ac6d8e053 --- /dev/null +++ b/c_glib/arrow-glib/io-file-mode.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowIOFileMode garrow_io_file_mode_from_raw(arrow::io::FileMode::type mode); +arrow::io::FileMode::type garrow_io_file_mode_to_raw(GArrowIOFileMode mode); diff --git a/c_glib/arrow-glib/io-file-output-stream.cpp b/c_glib/arrow-glib/io-file-output-stream.cpp new file mode 100644 index 0000000000000..673e8cd36a60a --- /dev/null +++ b/c_glib/arrow-glib/io-file-output-stream.cpp @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-file-output-stream + * @short_description: A file output stream. + * + * The #GArrowIOFileOutputStream is a class for file output stream. + */ + +typedef struct GArrowIOFileOutputStreamPrivate_ { + std::shared_ptr file_output_stream; +} GArrowIOFileOutputStreamPrivate; + +enum { + PROP_0, + PROP_FILE_OUTPUT_STREAM +}; + +static std::shared_ptr +garrow_io_file_output_stream_get_raw_file_interface(GArrowIOFile *file) +{ + auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(file); + auto arrow_file_output_stream = + garrow_io_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_io_file_interface_init(GArrowIOFileInterface *iface) +{ + iface->get_raw = garrow_io_file_output_stream_get_raw_file_interface; +} + +static std::shared_ptr +garrow_io_file_output_stream_get_raw_writeable_interface(GArrowIOWriteable *writeable) +{ + auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(writeable); + auto arrow_file_output_stream = + garrow_io_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_io_writeable_interface_init(GArrowIOWriteableInterface *iface) +{ + iface->get_raw = garrow_io_file_output_stream_get_raw_writeable_interface; +} + +static std::shared_ptr +garrow_io_file_output_stream_get_raw_output_stream_interface(GArrowIOOutputStream *output_stream) +{ + auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(output_stream); + auto arrow_file_output_stream = + garrow_io_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_io_output_stream_interface_init(GArrowIOOutputStreamInterface *iface) +{ + iface->get_raw = garrow_io_file_output_stream_get_raw_output_stream_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowIOFileOutputStream, + garrow_io_file_output_stream, + G_TYPE_OBJECT, + G_ADD_PRIVATE(GArrowIOFileOutputStream) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_FILE, + garrow_io_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE, + garrow_io_writeable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_OUTPUT_STREAM, + garrow_io_output_stream_interface_init)); + +#define GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ + GArrowIOFileOutputStreamPrivate)) + +static void +garrow_io_file_output_stream_finalize(GObject *object) +{ + GArrowIOFileOutputStreamPrivate *priv; + + priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + + priv->file_output_stream = nullptr; + + G_OBJECT_CLASS(garrow_io_file_output_stream_parent_class)->finalize(object); +} + +static void +garrow_io_file_output_stream_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowIOFileOutputStreamPrivate *priv; + + priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_FILE_OUTPUT_STREAM: + priv->file_output_stream = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_io_file_output_stream_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_io_file_output_stream_init(GArrowIOFileOutputStream *object) +{ +} + +static void +garrow_io_file_output_stream_class_init(GArrowIOFileOutputStreamClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_io_file_output_stream_finalize; + gobject_class->set_property = garrow_io_file_output_stream_set_property; + gobject_class->get_property = garrow_io_file_output_stream_get_property; + + spec = g_param_spec_pointer("file-output-stream", + "io::FileOutputStream", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_FILE_OUTPUT_STREAM, spec); +} + +/** + * garrow_io_file_output_stream_open: + * @path: The path of the file output stream. + * @append: Whether the path is opened as append mode or recreate mode. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIOFileOutputStream or %NULL on error. + */ +GArrowIOFileOutputStream * +garrow_io_file_output_stream_open(const gchar *path, + gboolean append, + GError **error) +{ + std::shared_ptr arrow_file_output_stream; + auto status = + arrow::io::FileOutputStream::Open(std::string(path), + append, + &arrow_file_output_stream); + if (status.ok()) { + return garrow_io_file_output_stream_new_raw(&arrow_file_output_stream); + } else { + std::string context("[io][file-output-stream][open]: <"); + context += path; + context += ">"; + garrow_error_set(error, status, context.c_str()); + return NULL; + } +} + +G_END_DECLS + +GArrowIOFileOutputStream * +garrow_io_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream) +{ + auto file_output_stream = + GARROW_IO_FILE_OUTPUT_STREAM(g_object_new(GARROW_IO_TYPE_FILE_OUTPUT_STREAM, + "file-output-stream", arrow_file_output_stream, + NULL)); + return file_output_stream; +} + +std::shared_ptr +garrow_io_file_output_stream_get_raw(GArrowIOFileOutputStream *file_output_stream) +{ + GArrowIOFileOutputStreamPrivate *priv; + + priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); + return priv->file_output_stream; +} diff --git a/c_glib/arrow-glib/io-file-output-stream.h b/c_glib/arrow-glib/io-file-output-stream.h new file mode 100644 index 0000000000000..032b125544e77 --- /dev/null +++ b/c_glib/arrow-glib/io-file-output-stream.h @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_FILE_OUTPUT_STREAM \ + (garrow_io_file_output_stream_get_type()) +#define GARROW_IO_FILE_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ + GArrowIOFileOutputStream)) +#define GARROW_IO_FILE_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ + GArrowIOFileOutputStreamClass)) +#define GARROW_IO_IS_FILE_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_IO_IS_FILE_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_IO_FILE_OUTPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ + GArrowIOFileOutputStreamClass)) + +typedef struct _GArrowIOFileOutputStream GArrowIOFileOutputStream; +typedef struct _GArrowIOFileOutputStreamClass GArrowIOFileOutputStreamClass; + +/** + * GArrowIOFileOutputStream: + * + * It wraps `arrow::io::FileOutputStream`. + */ +struct _GArrowIOFileOutputStream +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowIOFileOutputStreamClass +{ + GObjectClass parent_class; +}; + +GType garrow_io_file_output_stream_get_type(void) G_GNUC_CONST; + +GArrowIOFileOutputStream *garrow_io_file_output_stream_open(const gchar *path, + gboolean append, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-file-output-stream.hpp b/c_glib/arrow-glib/io-file-output-stream.hpp new file mode 100644 index 0000000000000..76b8e91f6cf43 --- /dev/null +++ b/c_glib/arrow-glib/io-file-output-stream.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIOFileOutputStream *garrow_io_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); +std::shared_ptr garrow_io_file_output_stream_get_raw(GArrowIOFileOutputStream *file_output_stream); diff --git a/c_glib/arrow-glib/io-file.cpp b/c_glib/arrow-glib/io-file.cpp new file mode 100644 index 0000000000000..536ae3e705f59 --- /dev/null +++ b/c_glib/arrow-glib/io-file.cpp @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-file + * @title: GArrowIOFile + * @short_description: File interface + * + * #GArrowIOFile is an interface for file. + */ + +G_DEFINE_INTERFACE(GArrowIOFile, + garrow_io_file, + G_TYPE_OBJECT) + +static void +garrow_io_file_default_init (GArrowIOFileInterface *iface) +{ +} + +/** + * garrow_io_file_close: + * @file: A #GArrowIOFile. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_file_close(GArrowIOFile *file, + GError **error) +{ + auto arrow_file = garrow_io_file_get_raw(file); + + auto status = arrow_file->Close(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][file][close]"); + return FALSE; + } +} + +/** + * garrow_io_file_tell: + * @file: A #GArrowIOFile. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: The current offset on success, -1 if there was an error. + */ +gint64 +garrow_io_file_tell(GArrowIOFile *file, + GError **error) +{ + auto arrow_file = garrow_io_file_get_raw(file); + + gint64 position; + auto status = arrow_file->Tell(&position); + if (status.ok()) { + return position; + } else { + garrow_error_set(error, status, "[io][file][tell]"); + return -1; + } +} + +/** + * garrow_io_file_get_mode: + * @file: A #GArrowIOFile. + * + * Returns: The mode of the file. + */ +GArrowIOFileMode +garrow_io_file_get_mode(GArrowIOFile *file) +{ + auto arrow_file = garrow_io_file_get_raw(file); + + auto arrow_mode = arrow_file->mode(); + return garrow_io_file_mode_from_raw(arrow_mode); +} + +G_END_DECLS + +std::shared_ptr +garrow_io_file_get_raw(GArrowIOFile *file) +{ + auto *iface = GARROW_IO_FILE_GET_IFACE(file); + return iface->get_raw(file); +} diff --git a/c_glib/arrow-glib/io-file.h b/c_glib/arrow-glib/io-file.h new file mode 100644 index 0000000000000..9fa0ec137566f --- /dev/null +++ b/c_glib/arrow-glib/io-file.h @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_FILE \ + (garrow_io_file_get_type()) +#define GARROW_IO_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_FILE, \ + GArrowIOFileInterface)) +#define GARROW_IO_IS_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_FILE)) +#define GARROW_IO_FILE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_FILE, \ + GArrowIOFileInterface)) + +typedef struct _GArrowIOFile GArrowIOFile; +typedef struct _GArrowIOFileInterface GArrowIOFileInterface; + +GType garrow_io_file_get_type(void) G_GNUC_CONST; + +gboolean garrow_io_file_close(GArrowIOFile *file, + GError **error); +gint64 garrow_io_file_tell(GArrowIOFile *file, + GError **error); +GArrowIOFileMode garrow_io_file_get_mode(GArrowIOFile *file); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-file.hpp b/c_glib/arrow-glib/io-file.hpp new file mode 100644 index 0000000000000..afaca90a10fa3 --- /dev/null +++ b/c_glib/arrow-glib/io-file.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOFileInterface: + * + * It wraps `arrow::io::FileInterface`. + */ +struct _GArrowIOFileInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOFile *file); +}; + +std::shared_ptr garrow_io_file_get_raw(GArrowIOFile *file); diff --git a/c_glib/arrow-glib/io-input-stream.cpp b/c_glib/arrow-glib/io-input-stream.cpp new file mode 100644 index 0000000000000..a28b9c6556ccd --- /dev/null +++ b/c_glib/arrow-glib/io-input-stream.cpp @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-input-stream + * @title: GArrowIOInputStream + * @short_description: Stream input interface + * + * #GArrowIOInputStream is an interface for stream input. Stream input + * is file based and readable. + */ + +G_DEFINE_INTERFACE(GArrowIOInputStream, + garrow_io_input_stream, + G_TYPE_OBJECT) + +static void +garrow_io_input_stream_default_init (GArrowIOInputStreamInterface *iface) +{ +} + +G_END_DECLS + +std::shared_ptr +garrow_io_input_stream_get_raw(GArrowIOInputStream *input_stream) +{ + auto *iface = GARROW_IO_INPUT_STREAM_GET_IFACE(input_stream); + return iface->get_raw(input_stream); +} diff --git a/c_glib/arrow-glib/io-input-stream.h b/c_glib/arrow-glib/io-input-stream.h new file mode 100644 index 0000000000000..a7f06819b4f97 --- /dev/null +++ b/c_glib/arrow-glib/io-input-stream.h @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_INPUT_STREAM \ + (garrow_io_input_stream_get_type()) +#define GARROW_IO_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_INPUT_STREAM, \ + GArrowIOInputStreamInterface)) +#define GARROW_IO_IS_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_INPUT_STREAM)) +#define GARROW_IO_INPUT_STREAM_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_INPUT_STREAM, \ + GArrowIOInputStreamInterface)) + +typedef struct _GArrowIOInputStream GArrowIOInputStream; +typedef struct _GArrowIOInputStreamInterface GArrowIOInputStreamInterface; + +GType garrow_io_input_stream_get_type(void) G_GNUC_CONST; + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-input-stream.hpp b/c_glib/arrow-glib/io-input-stream.hpp new file mode 100644 index 0000000000000..3b1de5da5c226 --- /dev/null +++ b/c_glib/arrow-glib/io-input-stream.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOInputStreamInterface: + * + * It wraps `arrow::io::InputStream`. + */ +struct _GArrowIOInputStreamInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOInputStream *file); +}; + +std::shared_ptr garrow_io_input_stream_get_raw(GArrowIOInputStream *input_stream); diff --git a/c_glib/arrow-glib/io-memory-mapped-file.cpp b/c_glib/arrow-glib/io-memory-mapped-file.cpp new file mode 100644 index 0000000000000..aa6ae2afd6e78 --- /dev/null +++ b/c_glib/arrow-glib/io-memory-mapped-file.cpp @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-memory-mapped-file + * @short_description: Memory mapped file class + * + * #GArrowIOMemoryMappedFile is a class for memory mapped file. It's + * readable and writeable. It supports zero copy. + */ + +typedef struct GArrowIOMemoryMappedFilePrivate_ { + std::shared_ptr memory_mapped_file; +} GArrowIOMemoryMappedFilePrivate; + +enum { + PROP_0, + PROP_MEMORY_MAPPED_FILE +}; + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_file_interface(GArrowIOFile *file) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_file_interface_init(GArrowIOFileInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_file_interface; +} + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_readable_interface(GArrowIOReadable *readable) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(readable); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_readable_interface_init(GArrowIOReadableInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_readable_interface; +} + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_input_stream_interface(GArrowIOInputStream *input_stream) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(input_stream); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_input_stream_interface_init(GArrowIOInputStreamInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_input_stream_interface; +} + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_readable_file_interface(GArrowIOReadableFile *file) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_readable_file_interface_init(GArrowIOReadableFileInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_readable_file_interface; +} + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_writeable_interface(GArrowIOWriteable *writeable) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(writeable); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_writeable_interface_init(GArrowIOWriteableInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_writeable_interface; +} + +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_writeable_file_interface(GArrowIOWriteableFile *file) +{ + auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_io_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_io_writeable_file_interface_init(GArrowIOWriteableFileInterface *iface) +{ + iface->get_raw = garrow_io_memory_mapped_file_get_raw_writeable_file_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowIOMemoryMappedFile, + garrow_io_memory_mapped_file, + G_TYPE_OBJECT, + G_ADD_PRIVATE(GArrowIOMemoryMappedFile) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_FILE, + garrow_io_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_READABLE, + garrow_io_readable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_INPUT_STREAM, + garrow_io_input_stream_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_READABLE_FILE, + garrow_io_readable_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE, + garrow_io_writeable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE_FILE, + garrow_io_writeable_file_interface_init)); + +#define GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ + GArrowIOMemoryMappedFilePrivate)) + +static void +garrow_io_memory_mapped_file_finalize(GObject *object) +{ + GArrowIOMemoryMappedFilePrivate *priv; + + priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(object); + + priv->memory_mapped_file = nullptr; + + G_OBJECT_CLASS(garrow_io_memory_mapped_file_parent_class)->finalize(object); +} + +static void +garrow_io_memory_mapped_file_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowIOMemoryMappedFilePrivate *priv; + + priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_MEMORY_MAPPED_FILE: + priv->memory_mapped_file = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_io_memory_mapped_file_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_io_memory_mapped_file_init(GArrowIOMemoryMappedFile *object) +{ +} + +static void +garrow_io_memory_mapped_file_class_init(GArrowIOMemoryMappedFileClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_io_memory_mapped_file_finalize; + gobject_class->set_property = garrow_io_memory_mapped_file_set_property; + gobject_class->get_property = garrow_io_memory_mapped_file_get_property; + + spec = g_param_spec_pointer("memory-mapped-file", + "io::MemoryMappedFile", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_MEMORY_MAPPED_FILE, spec); +} + +/** + * garrow_io_memory_mapped_file_open: + * @path: The path of the memory mapped file. + * @mode: The mode of the memory mapped file. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIOMemoryMappedFile or %NULL on error. + */ +GArrowIOMemoryMappedFile * +garrow_io_memory_mapped_file_open(const gchar *path, + GArrowIOFileMode mode, + GError **error) +{ + std::shared_ptr arrow_memory_mapped_file; + auto status = + arrow::io::MemoryMappedFile::Open(std::string(path), + garrow_io_file_mode_to_raw(mode), + &arrow_memory_mapped_file); + if (status.ok()) { + return garrow_io_memory_mapped_file_new_raw(&arrow_memory_mapped_file); + } else { + std::string context("[io][memory-mapped-file][open]: <"); + context += path; + context += ">"; + garrow_error_set(error, status, context.c_str()); + return NULL; + } +} + +G_END_DECLS + +GArrowIOMemoryMappedFile * +garrow_io_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file) +{ + auto memory_mapped_file = + GARROW_IO_MEMORY_MAPPED_FILE(g_object_new(GARROW_IO_TYPE_MEMORY_MAPPED_FILE, + "memory-mapped-file", arrow_memory_mapped_file, + NULL)); + return memory_mapped_file; +} + +std::shared_ptr +garrow_io_memory_mapped_file_get_raw(GArrowIOMemoryMappedFile *memory_mapped_file) +{ + GArrowIOMemoryMappedFilePrivate *priv; + + priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(memory_mapped_file); + return priv->memory_mapped_file; +} diff --git a/c_glib/arrow-glib/io-memory-mapped-file.h b/c_glib/arrow-glib/io-memory-mapped-file.h new file mode 100644 index 0000000000000..0d2d6c2f835de --- /dev/null +++ b/c_glib/arrow-glib/io-memory-mapped-file.h @@ -0,0 +1,72 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_MEMORY_MAPPED_FILE \ + (garrow_io_memory_mapped_file_get_type()) +#define GARROW_IO_MEMORY_MAPPED_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ + GArrowIOMemoryMappedFile)) +#define GARROW_IO_MEMORY_MAPPED_FILE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ + GArrowIOMemoryMappedFileClass)) +#define GARROW_IO_IS_MEMORY_MAPPED_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE)) +#define GARROW_IO_IS_MEMORY_MAPPED_FILE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE)) +#define GARROW_IO_MEMORY_MAPPED_FILE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ + GArrowIOMemoryMappedFileClass)) + +typedef struct _GArrowIOMemoryMappedFile GArrowIOMemoryMappedFile; +typedef struct _GArrowIOMemoryMappedFileClass GArrowIOMemoryMappedFileClass; + +/** + * GArrowIOMemoryMappedFile: + * + * It wraps `arrow::io::MemoryMappedFile`. + */ +struct _GArrowIOMemoryMappedFile +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowIOMemoryMappedFileClass +{ + GObjectClass parent_class; +}; + +GType garrow_io_memory_mapped_file_get_type(void) G_GNUC_CONST; + +GArrowIOMemoryMappedFile *garrow_io_memory_mapped_file_open(const gchar *path, + GArrowIOFileMode mode, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-memory-mapped-file.hpp b/c_glib/arrow-glib/io-memory-mapped-file.hpp new file mode 100644 index 0000000000000..b48e05f2f9e7b --- /dev/null +++ b/c_glib/arrow-glib/io-memory-mapped-file.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIOMemoryMappedFile *garrow_io_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file); +std::shared_ptr garrow_io_memory_mapped_file_get_raw(GArrowIOMemoryMappedFile *memory_mapped_file); diff --git a/c_glib/arrow-glib/io-output-stream.cpp b/c_glib/arrow-glib/io-output-stream.cpp new file mode 100644 index 0000000000000..bdf5587ba1c07 --- /dev/null +++ b/c_glib/arrow-glib/io-output-stream.cpp @@ -0,0 +1,56 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-output-stream + * @title: GArrowIOOutputStream + * @short_description: Stream output interface + * + * #GArrowIOOutputStream is an interface for stream output. Stream + * output is file based and writeable + */ + +G_DEFINE_INTERFACE(GArrowIOOutputStream, + garrow_io_output_stream, + G_TYPE_OBJECT) + +static void +garrow_io_output_stream_default_init (GArrowIOOutputStreamInterface *iface) +{ +} + +G_END_DECLS + +std::shared_ptr +garrow_io_output_stream_get_raw(GArrowIOOutputStream *output_stream) +{ + auto *iface = GARROW_IO_OUTPUT_STREAM_GET_IFACE(output_stream); + return iface->get_raw(output_stream); +} diff --git a/c_glib/arrow-glib/io-output-stream.h b/c_glib/arrow-glib/io-output-stream.h new file mode 100644 index 0000000000000..c4079d50233cd --- /dev/null +++ b/c_glib/arrow-glib/io-output-stream.h @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_OUTPUT_STREAM \ + (garrow_io_output_stream_get_type()) +#define GARROW_IO_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_OUTPUT_STREAM, \ + GArrowIOOutputStreamInterface)) +#define GARROW_IO_IS_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_OUTPUT_STREAM)) +#define GARROW_IO_OUTPUT_STREAM_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_OUTPUT_STREAM, \ + GArrowIOOutputStreamInterface)) + +typedef struct _GArrowIOOutputStream GArrowIOOutputStream; +typedef struct _GArrowIOOutputStreamInterface GArrowIOOutputStreamInterface; + +GType garrow_io_output_stream_get_type(void) G_GNUC_CONST; + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-output-stream.hpp b/c_glib/arrow-glib/io-output-stream.hpp new file mode 100644 index 0000000000000..f144130b1420e --- /dev/null +++ b/c_glib/arrow-glib/io-output-stream.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOOutputStreamInterface: + * + * It wraps `arrow::io::OutputStream`. + */ +struct _GArrowIOOutputStreamInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOOutputStream *file); +}; + +std::shared_ptr garrow_io_output_stream_get_raw(GArrowIOOutputStream *output_stream); diff --git a/c_glib/arrow-glib/io-readable-file.cpp b/c_glib/arrow-glib/io-readable-file.cpp new file mode 100644 index 0000000000000..014fd7a1c7d32 --- /dev/null +++ b/c_glib/arrow-glib/io-readable-file.cpp @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-readable-file + * @title: GArrowIOReadableFile + * @short_description: File input interface + * + * #GArrowIOReadableFile is an interface for file input. + */ + +G_DEFINE_INTERFACE(GArrowIOReadableFile, + garrow_io_readable_file, + G_TYPE_OBJECT) + +static void +garrow_io_readable_file_default_init (GArrowIOReadableFileInterface *iface) +{ +} + +/** + * garrow_io_readable_file_get_size: + * @file: A #GArrowIOReadableFile. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: The size of the file. + */ +guint64 +garrow_io_readable_file_get_size(GArrowIOReadableFile *file, + GError **error) +{ + auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(file); + auto arrow_readable_file = iface->get_raw(file); + int64_t size; + + auto status = arrow_readable_file->GetSize(&size); + if (status.ok()) { + return size; + } else { + garrow_error_set(error, status, "[io][readable-file][get-size]"); + return 0; + } +} + +/** + * garrow_io_readable_file_get_support_zero_copy: + * @file: A #GArrowIOReadableFile. + * + * Returns: Whether zero copy read is supported or not. + */ +gboolean +garrow_io_readable_file_get_support_zero_copy(GArrowIOReadableFile *file) +{ + auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(file); + auto arrow_readable_file = iface->get_raw(file); + + return arrow_readable_file->supports_zero_copy(); +} + +/** + * garrow_io_readable_file_read_at: + * @file: A #GArrowIOReadableFile. + * @position: The read start position. + * @n_bytes: The number of bytes to be read. + * @n_read_bytes: (out): The read number of bytes. + * @buffer: (array length=n_bytes): The buffer to be read data. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_readable_file_read_at(GArrowIOReadableFile *file, + gint64 position, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error) +{ + const auto arrow_readable_file = garrow_io_readable_file_get_raw(file); + + auto status = arrow_readable_file->ReadAt(position, + n_bytes, + n_read_bytes, + buffer); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][readable-file][read-at]"); + return FALSE; + } +} + +G_END_DECLS + +std::shared_ptr +garrow_io_readable_file_get_raw(GArrowIOReadableFile *readable_file) +{ + auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(readable_file); + return iface->get_raw(readable_file); +} diff --git a/c_glib/arrow-glib/io-readable-file.h b/c_glib/arrow-glib/io-readable-file.h new file mode 100644 index 0000000000000..1dcb13e04969c --- /dev/null +++ b/c_glib/arrow-glib/io-readable-file.h @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_READABLE_FILE \ + (garrow_io_readable_file_get_type()) +#define GARROW_IO_READABLE_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_READABLE_FILE, \ + GArrowIOReadableFileInterface)) +#define GARROW_IO_IS_READABLE_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_READABLE_FILE)) +#define GARROW_IO_READABLE_FILE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_READABLE_FILE, \ + GArrowIOReadableFileInterface)) + +typedef struct _GArrowIOReadableFile GArrowIOReadableFile; +typedef struct _GArrowIOReadableFileInterface GArrowIOReadableFileInterface; + +GType garrow_io_readable_file_get_type(void) G_GNUC_CONST; + +guint64 garrow_io_readable_file_get_size(GArrowIOReadableFile *file, + GError **error); +gboolean garrow_io_readable_file_get_support_zero_copy(GArrowIOReadableFile *file); +gboolean garrow_io_readable_file_read_at(GArrowIOReadableFile *file, + gint64 position, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-readable-file.hpp b/c_glib/arrow-glib/io-readable-file.hpp new file mode 100644 index 0000000000000..83d8628f48b62 --- /dev/null +++ b/c_glib/arrow-glib/io-readable-file.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOReadableFileInterface: + * + * It wraps `arrow::io::ReadableFileInterface`. + */ +struct _GArrowIOReadableFileInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOReadableFile *file); +}; + +std::shared_ptr garrow_io_readable_file_get_raw(GArrowIOReadableFile *readable_file); diff --git a/c_glib/arrow-glib/io-readable.cpp b/c_glib/arrow-glib/io-readable.cpp new file mode 100644 index 0000000000000..b372a66090ceb --- /dev/null +++ b/c_glib/arrow-glib/io-readable.cpp @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-readable + * @title: GArrowIOReadable + * @short_description: Input interface + * + * #GArrowIOReadable is an interface for input. Input must be + * readable. + */ + +G_DEFINE_INTERFACE(GArrowIOReadable, + garrow_io_readable, + G_TYPE_OBJECT) + +static void +garrow_io_readable_default_init (GArrowIOReadableInterface *iface) +{ +} + +/** + * garrow_io_readable_read: + * @readable: A #GArrowIOReadable. + * @n_bytes: The number of bytes to be read. + * @n_read_bytes: (out): The read number of bytes. + * @buffer: (array length=n_bytes): The buffer to be read data. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_readable_read(GArrowIOReadable *readable, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error) +{ + const auto arrow_readable = garrow_io_readable_get_raw(readable); + + auto status = arrow_readable->Read(n_bytes, n_read_bytes, buffer); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][readable][read]"); + return FALSE; + } +} + +G_END_DECLS + +std::shared_ptr +garrow_io_readable_get_raw(GArrowIOReadable *readable) +{ + auto *iface = GARROW_IO_READABLE_GET_IFACE(readable); + return iface->get_raw(readable); +} diff --git a/c_glib/arrow-glib/io-readable.h b/c_glib/arrow-glib/io-readable.h new file mode 100644 index 0000000000000..d24b46c50df4c --- /dev/null +++ b/c_glib/arrow-glib/io-readable.h @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_READABLE \ + (garrow_io_readable_get_type()) +#define GARROW_IO_READABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_READABLE, \ + GArrowIOReadableInterface)) +#define GARROW_IO_IS_READABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_READABLE)) +#define GARROW_IO_READABLE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_READABLE, \ + GArrowIOReadableInterface)) + +typedef struct _GArrowIOReadable GArrowIOReadable; +typedef struct _GArrowIOReadableInterface GArrowIOReadableInterface; + +GType garrow_io_readable_get_type(void) G_GNUC_CONST; + +gboolean garrow_io_readable_read(GArrowIOReadable *readable, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-readable.hpp b/c_glib/arrow-glib/io-readable.hpp new file mode 100644 index 0000000000000..3d27b3f92ba78 --- /dev/null +++ b/c_glib/arrow-glib/io-readable.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOReadableInterface: + * + * It wraps `arrow::io::Readable`. + */ +struct _GArrowIOReadableInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOReadable *file); +}; + +std::shared_ptr garrow_io_readable_get_raw(GArrowIOReadable *readable); diff --git a/c_glib/arrow-glib/io-writeable-file.cpp b/c_glib/arrow-glib/io-writeable-file.cpp new file mode 100644 index 0000000000000..3de42dd60a971 --- /dev/null +++ b/c_glib/arrow-glib/io-writeable-file.cpp @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-writeable-file + * @title: GArrowIOWriteableFile + * @short_description: File output interface + * + * #GArrowIOWriteableFile is an interface for file output. + */ + +G_DEFINE_INTERFACE(GArrowIOWriteableFile, + garrow_io_writeable_file, + G_TYPE_OBJECT) + +static void +garrow_io_writeable_file_default_init (GArrowIOWriteableFileInterface *iface) +{ +} + +/** + * garrow_io_writeable_file_write_at: + * @writeable_file: A #GArrowIOWriteableFile. + * @position: The write start position. + * @data: (array length=n_bytes): The data to be written. + * @n_bytes: The number of bytes to be written. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, + gint64 position, + const guint8 *data, + gint64 n_bytes, + GError **error) +{ + const auto arrow_writeable_file = + garrow_io_writeable_file_get_raw(writeable_file); + + auto status = arrow_writeable_file->WriteAt(position, data, n_bytes); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][writeable-file][write-at]"); + return FALSE; + } +} + +G_END_DECLS + +std::shared_ptr +garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file) +{ + auto *iface = GARROW_IO_WRITEABLE_FILE_GET_IFACE(writeable_file); + return iface->get_raw(writeable_file); +} diff --git a/c_glib/arrow-glib/io-writeable-file.h b/c_glib/arrow-glib/io-writeable-file.h new file mode 100644 index 0000000000000..4a4dee5111f5f --- /dev/null +++ b/c_glib/arrow-glib/io-writeable-file.h @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_WRITEABLE_FILE \ + (garrow_io_writeable_file_get_type()) +#define GARROW_IO_WRITEABLE_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_WRITEABLE_FILE, \ + GArrowIOWriteableFileInterface)) +#define GARROW_IO_IS_WRITEABLE_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_WRITEABLE_FILE)) +#define GARROW_IO_WRITEABLE_FILE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_WRITEABLE_FILE, \ + GArrowIOWriteableFileInterface)) + +typedef struct _GArrowIOWriteableFile GArrowIOWriteableFile; +typedef struct _GArrowIOWriteableFileInterface GArrowIOWriteableFileInterface; + +GType garrow_io_writeable_file_get_type(void) G_GNUC_CONST; + +gboolean garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, + gint64 position, + const guint8 *data, + gint64 n_bytes, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-writeable-file.hpp b/c_glib/arrow-glib/io-writeable-file.hpp new file mode 100644 index 0000000000000..2043007ad58e3 --- /dev/null +++ b/c_glib/arrow-glib/io-writeable-file.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOWriteableFileInterface: + * + * It wraps `arrow::io::WriteableFileInterface`. + */ +struct _GArrowIOWriteableFileInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOWriteableFile *file); +}; + +std::shared_ptr garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file); diff --git a/c_glib/arrow-glib/io-writeable.cpp b/c_glib/arrow-glib/io-writeable.cpp new file mode 100644 index 0000000000000..9ea69e3adccde --- /dev/null +++ b/c_glib/arrow-glib/io-writeable.cpp @@ -0,0 +1,106 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-writeable + * @title: GArrowIOWriteable + * @short_description: Output interface + * + * #GArrowIOWriteable is an interface for output. Output must be + * writeable. + */ + +G_DEFINE_INTERFACE(GArrowIOWriteable, + garrow_io_writeable, + G_TYPE_OBJECT) + +static void +garrow_io_writeable_default_init (GArrowIOWriteableInterface *iface) +{ +} + +/** + * garrow_io_writeable_write: + * @writeable: A #GArrowIOWriteable. + * @data: (array length=n_bytes): The data to be written. + * @n_bytes: The number of bytes to be written. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_writeable_write(GArrowIOWriteable *writeable, + const guint8 *data, + gint64 n_bytes, + GError **error) +{ + const auto arrow_writeable = garrow_io_writeable_get_raw(writeable); + + auto status = arrow_writeable->Write(data, n_bytes); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][writeable][write]"); + return FALSE; + } +} + +/** + * garrow_io_writeable_flush: + * @writeable: A #GArrowIOWriteable. + * @error: (nullable): Return location for a #GError or %NULL. + * + * It ensures writing all data on memory to storage. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_writeable_flush(GArrowIOWriteable *writeable, + GError **error) +{ + const auto arrow_writeable = garrow_io_writeable_get_raw(writeable); + + auto status = arrow_writeable->Flush(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][writeable][flush]"); + return FALSE; + } +} + +G_END_DECLS + +std::shared_ptr +garrow_io_writeable_get_raw(GArrowIOWriteable *writeable) +{ + auto *iface = GARROW_IO_WRITEABLE_GET_IFACE(writeable); + return iface->get_raw(writeable); +} diff --git a/c_glib/arrow-glib/io-writeable.h b/c_glib/arrow-glib/io-writeable.h new file mode 100644 index 0000000000000..f5c5e9129f8be --- /dev/null +++ b/c_glib/arrow-glib/io-writeable.h @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_WRITEABLE \ + (garrow_io_writeable_get_type()) +#define GARROW_IO_WRITEABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_WRITEABLE, \ + GArrowIOWriteableInterface)) +#define GARROW_IO_IS_WRITEABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_WRITEABLE)) +#define GARROW_IO_WRITEABLE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_WRITEABLE, \ + GArrowIOWriteableInterface)) + +typedef struct _GArrowIOWriteable GArrowIOWriteable; +typedef struct _GArrowIOWriteableInterface GArrowIOWriteableInterface; + +GType garrow_io_writeable_get_type(void) G_GNUC_CONST; + +gboolean garrow_io_writeable_write(GArrowIOWriteable *writeable, + const guint8 *data, + gint64 n_bytes, + GError **error); +gboolean garrow_io_writeable_flush(GArrowIOWriteable *writeable, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-writeable.hpp b/c_glib/arrow-glib/io-writeable.hpp new file mode 100644 index 0000000000000..f833924a61ae8 --- /dev/null +++ b/c_glib/arrow-glib/io-writeable.hpp @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +/** + * GArrowIOWriteableInterface: + * + * It wraps `arrow::io::Writeable`. + */ +struct _GArrowIOWriteableInterface +{ + GTypeInterface parent_iface; + + std::shared_ptr (*get_raw)(GArrowIOWriteable *file); +}; + +std::shared_ptr garrow_io_writeable_get_raw(GArrowIOWriteable *writeable); diff --git a/c_glib/arrow-glib/ipc-enums.c.template b/c_glib/arrow-glib/ipc-enums.c.template new file mode 100644 index 0000000000000..c938f77477172 --- /dev/null +++ b/c_glib/arrow-glib/ipc-enums.c.template @@ -0,0 +1,56 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType +@enum_name@_get_type(void) +{ + static GType etype = 0; + if (G_UNLIKELY(etype == 0)) { + static const G@Type@Value values[] = { +/*** END value-header ***/ + +/*** BEGIN value-production ***/ + {@VALUENAME@, "@VALUENAME@", "@valuenick@"}, +/*** END value-production ***/ + +/*** BEGIN value-tail ***/ + {0, NULL, NULL} + }; + etype = g_@type@_register_static(g_intern_static_string("@EnumName@"), values); + } + return etype; +} +/*** END value-tail ***/ + +/*** BEGIN file-tail ***/ +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/ipc-enums.h.template b/c_glib/arrow-glib/ipc-enums.h.template new file mode 100644 index 0000000000000..e103c5bfeb985 --- /dev/null +++ b/c_glib/arrow-glib/ipc-enums.h.template @@ -0,0 +1,41 @@ +/*** BEGIN file-header ***/ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS +/*** END file-header ***/ + +/*** BEGIN file-production ***/ + +/* enumerations from "@filename@" */ +/*** END file-production ***/ + +/*** BEGIN value-header ***/ +GType @enum_name@_get_type(void) G_GNUC_CONST; +#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type()) +/*** END value-header ***/ + +/*** BEGIN file-tail ***/ + +G_END_DECLS +/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/ipc-file-reader.cpp b/c_glib/arrow-glib/ipc-file-reader.cpp new file mode 100644 index 0000000000000..b9e408c4e9464 --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-reader.cpp @@ -0,0 +1,247 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: ipc-file-reader + * @short_description: File reader class + * + * #GArrowIPCFileReader is a class for receiving data by file based IPC. + */ + +typedef struct GArrowIPCFileReaderPrivate_ { + std::shared_ptr file_reader; +} GArrowIPCFileReaderPrivate; + +enum { + PROP_0, + PROP_FILE_READER +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCFileReader, + garrow_ipc_file_reader, + G_TYPE_OBJECT); + +#define GARROW_IPC_FILE_READER_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_IPC_TYPE_FILE_READER, \ + GArrowIPCFileReaderPrivate)) + +static void +garrow_ipc_file_reader_finalize(GObject *object) +{ + GArrowIPCFileReaderPrivate *priv; + + priv = GARROW_IPC_FILE_READER_GET_PRIVATE(object); + + priv->file_reader = nullptr; + + G_OBJECT_CLASS(garrow_ipc_file_reader_parent_class)->finalize(object); +} + +static void +garrow_ipc_file_reader_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowIPCFileReaderPrivate *priv; + + priv = GARROW_IPC_FILE_READER_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_FILE_READER: + priv->file_reader = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_file_reader_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_file_reader_init(GArrowIPCFileReader *object) +{ +} + +static void +garrow_ipc_file_reader_class_init(GArrowIPCFileReaderClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_ipc_file_reader_finalize; + gobject_class->set_property = garrow_ipc_file_reader_set_property; + gobject_class->get_property = garrow_ipc_file_reader_get_property; + + spec = g_param_spec_pointer("file-reader", + "ipc::FileReader", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_FILE_READER, spec); +} + +/** + * garrow_ipc_file_reader_open: + * @file: The file to be read. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIPCFileReader or %NULL on error. + */ +GArrowIPCFileReader * +garrow_ipc_file_reader_open(GArrowIOReadableFile *file, + GError **error) +{ + std::shared_ptr arrow_file_reader; + auto status = + arrow::ipc::FileReader::Open(garrow_io_readable_file_get_raw(file), + &arrow_file_reader); + if (status.ok()) { + return garrow_ipc_file_reader_new_raw(&arrow_file_reader); + } else { + garrow_error_set(error, status, "[ipc][file-reader][open]"); + return NULL; + } +} + +/** + * garrow_ipc_file_reader_get_schema: + * @file_reader: A #GArrowIPCFileReader. + * + * Returns: (transfer full): The schema in the file. + */ +GArrowSchema * +garrow_ipc_file_reader_get_schema(GArrowIPCFileReader *file_reader) +{ + auto arrow_file_reader = + garrow_ipc_file_reader_get_raw(file_reader); + auto arrow_schema = arrow_file_reader->schema(); + return garrow_schema_new_raw(&arrow_schema); +} + +/** + * garrow_ipc_file_reader_get_n_record_batches: + * @file_reader: A #GArrowIPCFileReader. + * + * Returns: The number of record batches in the file. + */ +guint +garrow_ipc_file_reader_get_n_record_batches(GArrowIPCFileReader *file_reader) +{ + auto arrow_file_reader = + garrow_ipc_file_reader_get_raw(file_reader); + return arrow_file_reader->num_record_batches(); +} + +/** + * garrow_ipc_file_reader_get_version: + * @file_reader: A #GArrowIPCFileReader. + * + * Returns: The format version in the file. + */ +GArrowIPCMetadataVersion +garrow_ipc_file_reader_get_version(GArrowIPCFileReader *file_reader) +{ + auto arrow_file_reader = + garrow_ipc_file_reader_get_raw(file_reader); + auto arrow_version = arrow_file_reader->version(); + return garrow_ipc_metadata_version_from_raw(arrow_version); +} + +/** + * garrow_ipc_file_reader_get_record_batch: + * @file_reader: A #GArrowIPCFileReader. + * @i: The index of the target record batch. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): + * The i-th record batch in the file or %NULL on error. + */ +GArrowRecordBatch * +garrow_ipc_file_reader_get_record_batch(GArrowIPCFileReader *file_reader, + guint i, + GError **error) +{ + auto arrow_file_reader = + garrow_ipc_file_reader_get_raw(file_reader); + std::shared_ptr arrow_record_batch; + auto status = arrow_file_reader->GetRecordBatch(i, &arrow_record_batch); + + if (status.ok()) { + return garrow_record_batch_new_raw(&arrow_record_batch); + } else { + garrow_error_set(error, status, "[ipc][file-reader][get-record-batch]"); + return NULL; + } +} + +G_END_DECLS + +GArrowIPCFileReader * +garrow_ipc_file_reader_new_raw(std::shared_ptr *arrow_file_reader) +{ + auto file_reader = + GARROW_IPC_FILE_READER(g_object_new(GARROW_IPC_TYPE_FILE_READER, + "file-reader", arrow_file_reader, + NULL)); + return file_reader; +} + +std::shared_ptr +garrow_ipc_file_reader_get_raw(GArrowIPCFileReader *file_reader) +{ + GArrowIPCFileReaderPrivate *priv; + + priv = GARROW_IPC_FILE_READER_GET_PRIVATE(file_reader); + return priv->file_reader; +} diff --git a/c_glib/arrow-glib/ipc-file-reader.h b/c_glib/arrow-glib/ipc-file-reader.h new file mode 100644 index 0000000000000..22915f8ae6e68 --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-reader.h @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +#include + +G_BEGIN_DECLS + +#define GARROW_IPC_TYPE_FILE_READER \ + (garrow_ipc_file_reader_get_type()) +#define GARROW_IPC_FILE_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IPC_TYPE_FILE_READER, \ + GArrowIPCFileReader)) +#define GARROW_IPC_FILE_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IPC_TYPE_FILE_READER, \ + GArrowIPCFileReaderClass)) +#define GARROW_IPC_IS_FILE_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IPC_TYPE_FILE_READER)) +#define GARROW_IPC_IS_FILE_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IPC_TYPE_FILE_READER)) +#define GARROW_IPC_FILE_READER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IPC_TYPE_FILE_READER, \ + GArrowIPCFileReaderClass)) + +typedef struct _GArrowIPCFileReader GArrowIPCFileReader; +typedef struct _GArrowIPCFileReaderClass GArrowIPCFileReaderClass; + +/** + * GArrowIPCFileReader: + * + * It wraps `arrow::ipc::FileReader`. + */ +struct _GArrowIPCFileReader +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowIPCFileReaderClass +{ + GObjectClass parent_class; +}; + +GType garrow_ipc_file_reader_get_type(void) G_GNUC_CONST; + +GArrowIPCFileReader *garrow_ipc_file_reader_open(GArrowIOReadableFile *file, + GError **error); + +GArrowSchema *garrow_ipc_file_reader_get_schema(GArrowIPCFileReader *file_reader); +guint garrow_ipc_file_reader_get_n_record_batches(GArrowIPCFileReader *file_reader); +GArrowIPCMetadataVersion garrow_ipc_file_reader_get_version(GArrowIPCFileReader *file_reader); +GArrowRecordBatch *garrow_ipc_file_reader_get_record_batch(GArrowIPCFileReader *file_reader, + guint i, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-file-reader.hpp b/c_glib/arrow-glib/ipc-file-reader.hpp new file mode 100644 index 0000000000000..66cd45d51ddf5 --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-reader.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIPCFileReader *garrow_ipc_file_reader_new_raw(std::shared_ptr *arrow_file_reader); +std::shared_ptr garrow_ipc_file_reader_get_raw(GArrowIPCFileReader *file_reader); diff --git a/c_glib/arrow-glib/ipc-file-writer.cpp b/c_glib/arrow-glib/ipc-file-writer.cpp new file mode 100644 index 0000000000000..d8b3c2e72fa31 --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-writer.cpp @@ -0,0 +1,158 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include +#include + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: ipc-file-writer + * @short_description: File writer class + * + * #GArrowIPCFileWriter is a class for sending data by file based IPC. + */ + +G_DEFINE_TYPE(GArrowIPCFileWriter, + garrow_ipc_file_writer, + GARROW_IPC_TYPE_STREAM_WRITER); + +static void +garrow_ipc_file_writer_init(GArrowIPCFileWriter *object) +{ +} + +static void +garrow_ipc_file_writer_class_init(GArrowIPCFileWriterClass *klass) +{ +} + +/** + * garrow_ipc_file_writer_open: + * @sink: The output of the writer. + * @schema: The schema of the writer. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIPCFileWriter or %NULL on error. + */ +GArrowIPCFileWriter * +garrow_ipc_file_writer_open(GArrowIOOutputStream *sink, + GArrowSchema *schema, + GError **error) +{ + std::shared_ptr arrow_file_writer; + auto status = + arrow::ipc::FileWriter::Open(garrow_io_output_stream_get_raw(sink).get(), + garrow_schema_get_raw(schema), + &arrow_file_writer); + if (status.ok()) { + return garrow_ipc_file_writer_new_raw(&arrow_file_writer); + } else { + garrow_error_set(error, status, "[ipc][file-writer][open]"); + return NULL; + } +} + +/** + * garrow_ipc_file_writer_write_record_batch: + * @file_writer: A #GArrowIPCFileWriter. + * @record_batch: The record batch to be written. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_ipc_file_writer_write_record_batch(GArrowIPCFileWriter *file_writer, + GArrowRecordBatch *record_batch, + GError **error) +{ + auto arrow_file_writer = + garrow_ipc_file_writer_get_raw(file_writer); + auto arrow_record_batch = + garrow_record_batch_get_raw(record_batch); + auto arrow_record_batch_raw = + arrow_record_batch.get(); + + auto status = arrow_file_writer->WriteRecordBatch(*arrow_record_batch_raw); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[ipc][file-writer][write-record-batch]"); + return FALSE; + } +} + +/** + * garrow_ipc_file_writer_close: + * @file_writer: A #GArrowIPCFileWriter. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_ipc_file_writer_close(GArrowIPCFileWriter *file_writer, + GError **error) +{ + auto arrow_file_writer = + garrow_ipc_file_writer_get_raw(file_writer); + + auto status = arrow_file_writer->Close(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[ipc][file-writer][close]"); + return FALSE; + } +} + +G_END_DECLS + +GArrowIPCFileWriter * +garrow_ipc_file_writer_new_raw(std::shared_ptr *arrow_file_writer) +{ + auto file_writer = + GARROW_IPC_FILE_WRITER(g_object_new(GARROW_IPC_TYPE_FILE_WRITER, + "stream-writer", arrow_file_writer, + NULL)); + return file_writer; +} + +arrow::ipc::FileWriter * +garrow_ipc_file_writer_get_raw(GArrowIPCFileWriter *file_writer) +{ + auto arrow_stream_writer = + garrow_ipc_stream_writer_get_raw(GARROW_IPC_STREAM_WRITER(file_writer)); + auto arrow_file_writer_raw = + dynamic_cast(arrow_stream_writer.get()); + return arrow_file_writer_raw; +} diff --git a/c_glib/arrow-glib/ipc-file-writer.h b/c_glib/arrow-glib/ipc-file-writer.h new file mode 100644 index 0000000000000..732d9426aec8e --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-writer.h @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IPC_TYPE_FILE_WRITER \ + (garrow_ipc_file_writer_get_type()) +#define GARROW_IPC_FILE_WRITER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IPC_TYPE_FILE_WRITER, \ + GArrowIPCFileWriter)) +#define GARROW_IPC_FILE_WRITER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IPC_TYPE_FILE_WRITER, \ + GArrowIPCFileWriterClass)) +#define GARROW_IPC_IS_FILE_WRITER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IPC_TYPE_FILE_WRITER)) +#define GARROW_IPC_IS_FILE_WRITER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IPC_TYPE_FILE_WRITER)) +#define GARROW_IPC_FILE_WRITER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IPC_TYPE_FILE_WRITER, \ + GArrowIPCFileWriterClass)) + +typedef struct _GArrowIPCFileWriter GArrowIPCFileWriter; +typedef struct _GArrowIPCFileWriterClass GArrowIPCFileWriterClass; + +/** + * GArrowIPCFileWriter: + * + * It wraps `arrow::ipc::FileWriter`. + */ +struct _GArrowIPCFileWriter +{ + /*< private >*/ + GArrowIPCStreamWriter parent_instance; +}; + +struct _GArrowIPCFileWriterClass +{ + GObjectClass parent_class; +}; + +GType garrow_ipc_file_writer_get_type(void) G_GNUC_CONST; + +GArrowIPCFileWriter *garrow_ipc_file_writer_open(GArrowIOOutputStream *sink, + GArrowSchema *schema, + GError **error); + +gboolean garrow_ipc_file_writer_write_record_batch(GArrowIPCFileWriter *file_writer, + GArrowRecordBatch *record_batch, + GError **error); +gboolean garrow_ipc_file_writer_close(GArrowIPCFileWriter *file_writer, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-file-writer.hpp b/c_glib/arrow-glib/ipc-file-writer.hpp new file mode 100644 index 0000000000000..b8ae1137a99ad --- /dev/null +++ b/c_glib/arrow-glib/ipc-file-writer.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIPCFileWriter *garrow_ipc_file_writer_new_raw(std::shared_ptr *arrow_file_writer); +arrow::ipc::FileWriter *garrow_ipc_file_writer_get_raw(GArrowIPCFileWriter *file_writer); diff --git a/c_glib/arrow-glib/ipc-metadata-version.cpp b/c_glib/arrow-glib/ipc-metadata-version.cpp new file mode 100644 index 0000000000000..c5cc8d379843c --- /dev/null +++ b/c_glib/arrow-glib/ipc-metadata-version.cpp @@ -0,0 +1,59 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +/** + * SECTION: ipc-metadata-version + * @title: GArrowIPCMetadataVersion + * @short_description: Metadata version mapgging between Arrow and arrow-glib + * + * #GArrowIPCMetadataVersion provides metadata versions corresponding + * to `arrow::ipc::MetadataVersion::type` values. + */ + +GArrowIPCMetadataVersion +garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion::type version) +{ + switch (version) { + case arrow::ipc::MetadataVersion::type::V1: + return GARROW_IPC_METADATA_VERSION_V1; + case arrow::ipc::MetadataVersion::type::V2: + return GARROW_IPC_METADATA_VERSION_V2; + default: + return GARROW_IPC_METADATA_VERSION_V2; + } +} + +arrow::ipc::MetadataVersion::type +garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version) +{ + switch (version) { + case GARROW_IPC_METADATA_VERSION_V1: + return arrow::ipc::MetadataVersion::type::V1; + case GARROW_IPC_METADATA_VERSION_V2: + return arrow::ipc::MetadataVersion::type::V2; + default: + return arrow::ipc::MetadataVersion::type::V2; + } +} diff --git a/c_glib/arrow-glib/ipc-metadata-version.h b/c_glib/arrow-glib/ipc-metadata-version.h new file mode 100644 index 0000000000000..ccfd52a81639f --- /dev/null +++ b/c_glib/arrow-glib/ipc-metadata-version.h @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +/** + * GArrowIPCMetadataVersion: + * @GARROW_IPC_METADATA_VERSION_V1: Version 1. + * @GARROW_IPC_METADATA_VERSION_V2: Version 2. + * + * They are corresponding to `arrow::ipc::MetadataVersion::type` + * values. + */ +typedef enum { + GARROW_IPC_METADATA_VERSION_V1, + GARROW_IPC_METADATA_VERSION_V2 +} GArrowIPCMetadataVersion; + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-metadata-version.hpp b/c_glib/arrow-glib/ipc-metadata-version.hpp new file mode 100644 index 0000000000000..2a7e8cffa8917 --- /dev/null +++ b/c_glib/arrow-glib/ipc-metadata-version.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowIPCMetadataVersion garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion::type version); +arrow::ipc::MetadataVersion::type garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version); diff --git a/c_glib/arrow-glib/ipc-stream-reader.cpp b/c_glib/arrow-glib/ipc-stream-reader.cpp new file mode 100644 index 0000000000000..48047842aaac6 --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-reader.cpp @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: ipc-stream-reader + * @short_description: Stream reader class + * + * #GArrowIPCStreamReader is a class for receiving data by stream + * based IPC. + */ + +typedef struct GArrowIPCStreamReaderPrivate_ { + std::shared_ptr stream_reader; +} GArrowIPCStreamReaderPrivate; + +enum { + PROP_0, + PROP_STREAM_READER +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCStreamReader, + garrow_ipc_stream_reader, + G_TYPE_OBJECT); + +#define GARROW_IPC_STREAM_READER_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_IPC_TYPE_STREAM_READER, \ + GArrowIPCStreamReaderPrivate)) + +static void +garrow_ipc_stream_reader_finalize(GObject *object) +{ + GArrowIPCStreamReaderPrivate *priv; + + priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(object); + + priv->stream_reader = nullptr; + + G_OBJECT_CLASS(garrow_ipc_stream_reader_parent_class)->finalize(object); +} + +static void +garrow_ipc_stream_reader_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowIPCStreamReaderPrivate *priv; + + priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_STREAM_READER: + priv->stream_reader = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_stream_reader_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_stream_reader_init(GArrowIPCStreamReader *object) +{ +} + +static void +garrow_ipc_stream_reader_class_init(GArrowIPCStreamReaderClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_ipc_stream_reader_finalize; + gobject_class->set_property = garrow_ipc_stream_reader_set_property; + gobject_class->get_property = garrow_ipc_stream_reader_get_property; + + spec = g_param_spec_pointer("stream-reader", + "ipc::StreamReader", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_STREAM_READER, spec); +} + +/** + * garrow_ipc_stream_reader_open: + * @stream: The stream to be read. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIPCStreamReader or %NULL on error. + */ +GArrowIPCStreamReader * +garrow_ipc_stream_reader_open(GArrowIOInputStream *stream, + GError **error) +{ + std::shared_ptr arrow_stream_reader; + auto status = + arrow::ipc::StreamReader::Open(garrow_io_input_stream_get_raw(stream), + &arrow_stream_reader); + if (status.ok()) { + return garrow_ipc_stream_reader_new_raw(&arrow_stream_reader); + } else { + garrow_error_set(error, status, "[ipc][stream-reader][open]"); + return NULL; + } +} + +/** + * garrow_ipc_stream_reader_get_schema: + * @stream_reader: A #GArrowIPCStreamReader. + * + * Returns: (transfer full): The schema in the stream. + */ +GArrowSchema * +garrow_ipc_stream_reader_get_schema(GArrowIPCStreamReader *stream_reader) +{ + auto arrow_stream_reader = + garrow_ipc_stream_reader_get_raw(stream_reader); + auto arrow_schema = arrow_stream_reader->schema(); + return garrow_schema_new_raw(&arrow_schema); +} + +/** + * garrow_ipc_stream_reader_get_next_record_batch: + * @stream_reader: A #GArrowIPCStreamReader. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): + * The next record batch in the stream or %NULL on end of stream. + */ +GArrowRecordBatch * +garrow_ipc_stream_reader_get_next_record_batch(GArrowIPCStreamReader *stream_reader, + GError **error) +{ + auto arrow_stream_reader = + garrow_ipc_stream_reader_get_raw(stream_reader); + std::shared_ptr arrow_record_batch; + auto status = arrow_stream_reader->GetNextRecordBatch(&arrow_record_batch); + + if (status.ok()) { + if (arrow_record_batch == nullptr) { + return NULL; + } else { + return garrow_record_batch_new_raw(&arrow_record_batch); + } + } else { + garrow_error_set(error, status, "[ipc][stream-reader][get-next-record-batch]"); + return NULL; + } +} + +G_END_DECLS + +GArrowIPCStreamReader * +garrow_ipc_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader) +{ + auto stream_reader = + GARROW_IPC_STREAM_READER(g_object_new(GARROW_IPC_TYPE_STREAM_READER, + "stream-reader", arrow_stream_reader, + NULL)); + return stream_reader; +} + +std::shared_ptr +garrow_ipc_stream_reader_get_raw(GArrowIPCStreamReader *stream_reader) +{ + GArrowIPCStreamReaderPrivate *priv; + + priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(stream_reader); + return priv->stream_reader; +} diff --git a/c_glib/arrow-glib/ipc-stream-reader.h b/c_glib/arrow-glib/ipc-stream-reader.h new file mode 100644 index 0000000000000..993cd85003bb9 --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-reader.h @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +#include + +G_BEGIN_DECLS + +#define GARROW_IPC_TYPE_STREAM_READER \ + (garrow_ipc_stream_reader_get_type()) +#define GARROW_IPC_STREAM_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IPC_TYPE_STREAM_READER, \ + GArrowIPCStreamReader)) +#define GARROW_IPC_STREAM_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IPC_TYPE_STREAM_READER, \ + GArrowIPCStreamReaderClass)) +#define GARROW_IPC_IS_STREAM_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IPC_TYPE_STREAM_READER)) +#define GARROW_IPC_IS_STREAM_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IPC_TYPE_STREAM_READER)) +#define GARROW_IPC_STREAM_READER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IPC_TYPE_STREAM_READER, \ + GArrowIPCStreamReaderClass)) + +typedef struct _GArrowIPCStreamReader GArrowIPCStreamReader; +typedef struct _GArrowIPCStreamReaderClass GArrowIPCStreamReaderClass; + +/** + * GArrowIPCStreamReader: + * + * It wraps `arrow::ipc::StreamReader`. + */ +struct _GArrowIPCStreamReader +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowIPCStreamReaderClass +{ + GObjectClass parent_class; +}; + +GType garrow_ipc_stream_reader_get_type(void) G_GNUC_CONST; + +GArrowIPCStreamReader *garrow_ipc_stream_reader_open(GArrowIOInputStream *stream, + GError **error); + +GArrowSchema *garrow_ipc_stream_reader_get_schema(GArrowIPCStreamReader *stream_reader); +GArrowRecordBatch *garrow_ipc_stream_reader_get_next_record_batch(GArrowIPCStreamReader *stream_reader, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-stream-reader.hpp b/c_glib/arrow-glib/ipc-stream-reader.hpp new file mode 100644 index 0000000000000..a35bdab7e69d4 --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-reader.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIPCStreamReader *garrow_ipc_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader); +std::shared_ptr garrow_ipc_stream_reader_get_raw(GArrowIPCStreamReader *stream_reader); diff --git a/c_glib/arrow-glib/ipc-stream-writer.cpp b/c_glib/arrow-glib/ipc-stream-writer.cpp new file mode 100644 index 0000000000000..e2455a4a9c61c --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-writer.cpp @@ -0,0 +1,232 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include +#include + +#include + +#include + +G_BEGIN_DECLS + +/** + * SECTION: ipc-stream-writer + * @short_description: Stream writer class + * + * #GArrowIPCStreamWriter is a class for sending data by stream based + * IPC. + */ + +typedef struct GArrowIPCStreamWriterPrivate_ { + std::shared_ptr stream_writer; +} GArrowIPCStreamWriterPrivate; + +enum { + PROP_0, + PROP_STREAM_WRITER +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCStreamWriter, + garrow_ipc_stream_writer, + G_TYPE_OBJECT); + +#define GARROW_IPC_STREAM_WRITER_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_IPC_TYPE_STREAM_WRITER, \ + GArrowIPCStreamWriterPrivate)) + +static void +garrow_ipc_stream_writer_finalize(GObject *object) +{ + GArrowIPCStreamWriterPrivate *priv; + + priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(object); + + priv->stream_writer = nullptr; + + G_OBJECT_CLASS(garrow_ipc_stream_writer_parent_class)->finalize(object); +} + +static void +garrow_ipc_stream_writer_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowIPCStreamWriterPrivate *priv; + + priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_STREAM_WRITER: + priv->stream_writer = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_stream_writer_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_ipc_stream_writer_init(GArrowIPCStreamWriter *object) +{ +} + +static void +garrow_ipc_stream_writer_class_init(GArrowIPCStreamWriterClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_ipc_stream_writer_finalize; + gobject_class->set_property = garrow_ipc_stream_writer_set_property; + gobject_class->get_property = garrow_ipc_stream_writer_get_property; + + spec = g_param_spec_pointer("stream-writer", + "ipc::StreamWriter", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_STREAM_WRITER, spec); +} + +/** + * garrow_ipc_stream_writer_open: + * @sink: The output of the writer. + * @schema: The schema of the writer. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowIPCStreamWriter or %NULL on error. + */ +GArrowIPCStreamWriter * +garrow_ipc_stream_writer_open(GArrowIOOutputStream *sink, + GArrowSchema *schema, + GError **error) +{ + std::shared_ptr arrow_stream_writer; + auto status = + arrow::ipc::StreamWriter::Open(garrow_io_output_stream_get_raw(sink).get(), + garrow_schema_get_raw(schema), + &arrow_stream_writer); + if (status.ok()) { + return garrow_ipc_stream_writer_new_raw(&arrow_stream_writer); + } else { + garrow_error_set(error, status, "[ipc][stream-writer][open]"); + return NULL; + } +} + +/** + * garrow_ipc_stream_writer_write_record_batch: + * @stream_writer: A #GArrowIPCStreamWriter. + * @record_batch: The record batch to be written. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_ipc_stream_writer_write_record_batch(GArrowIPCStreamWriter *stream_writer, + GArrowRecordBatch *record_batch, + GError **error) +{ + auto arrow_stream_writer = + garrow_ipc_stream_writer_get_raw(stream_writer); + auto arrow_record_batch = + garrow_record_batch_get_raw(record_batch); + auto arrow_record_batch_raw = + arrow_record_batch.get(); + + auto status = arrow_stream_writer->WriteRecordBatch(*arrow_record_batch_raw); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[ipc][stream-writer][write-record-batch]"); + return FALSE; + } +} + +/** + * garrow_ipc_stream_writer_close: + * @stream_writer: A #GArrowIPCStreamWriter. + * @error: (nullable): Return locatipcn for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_ipc_stream_writer_close(GArrowIPCStreamWriter *stream_writer, + GError **error) +{ + auto arrow_stream_writer = + garrow_ipc_stream_writer_get_raw(stream_writer); + + auto status = arrow_stream_writer->Close(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[ipc][stream-writer][close]"); + return FALSE; + } +} + +G_END_DECLS + +GArrowIPCStreamWriter * +garrow_ipc_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer) +{ + auto stream_writer = + GARROW_IPC_STREAM_WRITER(g_object_new(GARROW_IPC_TYPE_STREAM_WRITER, + "stream-writer", arrow_stream_writer, + NULL)); + return stream_writer; +} + +std::shared_ptr +garrow_ipc_stream_writer_get_raw(GArrowIPCStreamWriter *stream_writer) +{ + GArrowIPCStreamWriterPrivate *priv; + + priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(stream_writer); + return priv->stream_writer; +} diff --git a/c_glib/arrow-glib/ipc-stream-writer.h b/c_glib/arrow-glib/ipc-stream-writer.h new file mode 100644 index 0000000000000..4488204736d51 --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-writer.h @@ -0,0 +1,82 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include +#include + +#include + +G_BEGIN_DECLS + +#define GARROW_IPC_TYPE_STREAM_WRITER \ + (garrow_ipc_stream_writer_get_type()) +#define GARROW_IPC_STREAM_WRITER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IPC_TYPE_STREAM_WRITER, \ + GArrowIPCStreamWriter)) +#define GARROW_IPC_STREAM_WRITER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_IPC_TYPE_STREAM_WRITER, \ + GArrowIPCStreamWriterClass)) +#define GARROW_IPC_IS_STREAM_WRITER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IPC_TYPE_STREAM_WRITER)) +#define GARROW_IPC_IS_STREAM_WRITER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_IPC_TYPE_STREAM_WRITER)) +#define GARROW_IPC_STREAM_WRITER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_IPC_TYPE_STREAM_WRITER, \ + GArrowIPCStreamWriterClass)) + +typedef struct _GArrowIPCStreamWriter GArrowIPCStreamWriter; +typedef struct _GArrowIPCStreamWriterClass GArrowIPCStreamWriterClass; + +/** + * GArrowIPCStreamWriter: + * + * It wraps `arrow::ipc::StreamWriter`. + */ +struct _GArrowIPCStreamWriter +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowIPCStreamWriterClass +{ + GObjectClass parent_class; +}; + +GType garrow_ipc_stream_writer_get_type(void) G_GNUC_CONST; + +GArrowIPCStreamWriter *garrow_ipc_stream_writer_open(GArrowIOOutputStream *sink, + GArrowSchema *schema, + GError **error); + +gboolean garrow_ipc_stream_writer_write_record_batch(GArrowIPCStreamWriter *stream_writer, + GArrowRecordBatch *record_batch, + GError **error); +gboolean garrow_ipc_stream_writer_close(GArrowIPCStreamWriter *stream_writer, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-stream-writer.hpp b/c_glib/arrow-glib/ipc-stream-writer.hpp new file mode 100644 index 0000000000000..9d097404582a9 --- /dev/null +++ b/c_glib/arrow-glib/ipc-stream-writer.hpp @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +GArrowIPCStreamWriter *garrow_ipc_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer); +std::shared_ptr garrow_ipc_stream_writer_get_raw(GArrowIPCStreamWriter *stream_writer); diff --git a/c_glib/arrow-glib/list-array-builder.cpp b/c_glib/arrow-glib/list-array-builder.cpp new file mode 100644 index 0000000000000..6c8f53da1fc98 --- /dev/null +++ b/c_glib/arrow-glib/list-array-builder.cpp @@ -0,0 +1,173 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: list-array-builder + * @short_description: List array builder class + * @include: arrow-glib/arrow-glib.h + * + * #GArrowListArrayBuilder is the class to create a new + * #GArrowListArray. + */ + +G_DEFINE_TYPE(GArrowListArrayBuilder, + garrow_list_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_list_array_builder_init(GArrowListArrayBuilder *builder) +{ +} + +static void +garrow_list_array_builder_class_init(GArrowListArrayBuilderClass *klass) +{ +} + +/** + * garrow_list_array_builder_new: + * @value_builder: A #GArrowArrayBuilder for value array. + * + * Returns: A newly created #GArrowListArrayBuilder. + */ +GArrowListArrayBuilder * +garrow_list_array_builder_new(GArrowArrayBuilder *value_builder) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_value_builder = garrow_array_builder_get_raw(value_builder); + auto arrow_list_builder = + std::make_shared(memory_pool, arrow_value_builder); + std::shared_ptr arrow_builder = arrow_list_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_LIST_ARRAY_BUILDER(builder); +} + +/** + * garrow_list_array_builder_append: + * @builder: A #GArrowListArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new list element. To append a new list element, you + * need to call this function then append list element values to + * `value_builder`. `value_builder` is the #GArrowArrayBuilder + * specified to constructor. You can get `value_builder` by + * garrow_list_array_builder_get_value_builder(). + * + * |[ + * GArrowInt8ArrayBuilder *value_builder; + * GArrowListArrayBuilder *builder; + * + * value_builder = garrow_int8_array_builder_new(); + * builder = garrow_list_array_builder_new(value_builder, NULL); + * + * // Start 0th list element: [1, 0, -1] + * garrow_list_array_builder_append(builder, NULL); + * garrow_int8_array_builder_append(value_builder, 1); + * garrow_int8_array_builder_append(value_builder, 0); + * garrow_int8_array_builder_append(value_builder, -1); + * + * // Start 1st list element: [-29, 29] + * garrow_list_array_builder_append(builder, NULL); + * garrow_int8_array_builder_append(value_builder, -29); + * garrow_int8_array_builder_append(value_builder, 29); + * + * { + * // [[1, 0, -1], [-29, 29]] + * GArrowArray *array = garrow_array_builder_finish(builder); + * // Now, builder is needless. + * g_object_unref(builder); + * g_object_unref(value_builder); + * + * // Use array... + * g_object_unref(array); + * } + * ]| + */ +gboolean +garrow_list_array_builder_append(GArrowListArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[list-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_list_array_builder_append_null: + * @builder: A #GArrowListArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new NULL element. + */ +gboolean +garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[list-array-builder][append-null]"); + return FALSE; + } +} + +/** + * garrow_list_array_builder_get_value_builder: + * @builder: A #GArrowListArrayBuilder. + * + * Returns: (transfer full): The #GArrowArrayBuilder for values. + */ +GArrowArrayBuilder * +garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + auto arrow_value_builder = arrow_builder->value_builder(); + return garrow_array_builder_new_raw(&arrow_value_builder); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/list-array-builder.h b/c_glib/arrow-glib/list-array-builder.h new file mode 100644 index 0000000000000..2c2e58e54309b --- /dev/null +++ b/c_glib/arrow-glib/list-array-builder.h @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_LIST_ARRAY_BUILDER \ + (garrow_list_array_builder_get_type()) +#define GARROW_LIST_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilder)) +#define GARROW_LIST_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilderClass)) +#define GARROW_IS_LIST_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER)) +#define GARROW_IS_LIST_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_ARRAY_BUILDER)) +#define GARROW_LIST_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilderClass)) + +typedef struct _GArrowListArrayBuilder GArrowListArrayBuilder; +typedef struct _GArrowListArrayBuilderClass GArrowListArrayBuilderClass; + +/** + * GArrowListArrayBuilder: + * + * It wraps `arrow::ListBuilder`. + */ +struct _GArrowListArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowListArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_list_array_builder_get_type(void) G_GNUC_CONST; + +GArrowListArrayBuilder *garrow_list_array_builder_new(GArrowArrayBuilder *value_builder); + +gboolean garrow_list_array_builder_append(GArrowListArrayBuilder *builder, + GError **error); +gboolean garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, + GError **error); + +GArrowArrayBuilder *garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder); + +G_END_DECLS diff --git a/c_glib/arrow-glib/list-array.cpp b/c_glib/arrow-glib/list-array.cpp new file mode 100644 index 0000000000000..2b3fb311280d0 --- /dev/null +++ b/c_glib/arrow-glib/list-array.cpp @@ -0,0 +1,92 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: list-array + * @short_description: List array class + * @include: arrow-glib/arrow-glib.h + * + * #GArrowListArray is a class for list array. It can store zero + * or more list data. + * + * #GArrowListArray is immutable. You need to use + * #GArrowListArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowListArray, \ + garrow_list_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_list_array_init(GArrowListArray *object) +{ +} + +static void +garrow_list_array_class_init(GArrowListArrayClass *klass) +{ +} + +/** + * garrow_list_array_get_value_type: + * @array: A #GArrowListArray. + * + * Returns: (transfer full): The data type of value in each list. + */ +GArrowDataType * +garrow_list_array_get_value_type(GArrowListArray *array) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_list_array = + static_cast(arrow_array.get()); + auto arrow_value_type = arrow_list_array->value_type(); + return garrow_data_type_new_raw(&arrow_value_type); +} + +/** + * garrow_list_array_get_value: + * @array: A #GArrowListArray. + * @i: The index of the target value. + * + * Returns: (transfer full): The i-th list. + */ +GArrowArray * +garrow_list_array_get_value(GArrowListArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_list_array = + static_cast(arrow_array.get()); + auto arrow_list = + arrow_list_array->values()->Slice(arrow_list_array->value_offset(i), + arrow_list_array->value_length(i)); + return garrow_array_new_raw(&arrow_list); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/list-array.h b/c_glib/arrow-glib/list-array.h new file mode 100644 index 0000000000000..c49aed1b9599e --- /dev/null +++ b/c_glib/arrow-glib/list-array.h @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_LIST_ARRAY \ + (garrow_list_array_get_type()) +#define GARROW_LIST_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArray)) +#define GARROW_LIST_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArrayClass)) +#define GARROW_IS_LIST_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_ARRAY)) +#define GARROW_IS_LIST_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_ARRAY)) +#define GARROW_LIST_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArrayClass)) + +typedef struct _GArrowListArray GArrowListArray; +typedef struct _GArrowListArrayClass GArrowListArrayClass; + +/** + * GArrowListArray: + * + * It wraps `arrow::ListArray`. + */ +struct _GArrowListArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowListArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_list_array_get_type(void) G_GNUC_CONST; + +GArrowDataType *garrow_list_array_get_value_type(GArrowListArray *array); +GArrowArray *garrow_list_array_get_value(GArrowListArray *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/list-data-type.cpp b/c_glib/arrow-glib/list-data-type.cpp new file mode 100644 index 0000000000000..e82e6fdee48ba --- /dev/null +++ b/c_glib/arrow-glib/list-data-type.cpp @@ -0,0 +1,91 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: list-data-type + * @short_description: List data type + * + * #GArrowListDataType is a class for list data type. + */ + +G_DEFINE_TYPE(GArrowListDataType, \ + garrow_list_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_list_data_type_init(GArrowListDataType *object) +{ +} + +static void +garrow_list_data_type_class_init(GArrowListDataTypeClass *klass) +{ +} + +/** + * garrow_list_data_type_new: + * @field: The field of elements + * + * Returns: The newly created list data type. + */ +GArrowListDataType * +garrow_list_data_type_new(GArrowField *field) +{ + auto arrow_field = garrow_field_get_raw(field); + auto arrow_data_type = + std::make_shared(arrow_field); + + GArrowListDataType *data_type = + GARROW_LIST_DATA_TYPE(g_object_new(GARROW_TYPE_LIST_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +/** + * garrow_list_data_type_get_value_field: + * @list_data_type: A #GArrowListDataType. + * + * Returns: (transfer full): The field of value. + */ +GArrowField * +garrow_list_data_type_get_value_field(GArrowListDataType *list_data_type) +{ + auto arrow_data_type = + garrow_data_type_get_raw(GARROW_DATA_TYPE(list_data_type)); + auto arrow_list_data_type = + static_cast(arrow_data_type.get()); + + auto arrow_field = arrow_list_data_type->value_field(); + auto field = garrow_field_new_raw(&arrow_field); + + return field; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/list-data-type.h b/c_glib/arrow-glib/list-data-type.h new file mode 100644 index 0000000000000..bb406e2c62074 --- /dev/null +++ b/c_glib/arrow-glib/list-data-type.h @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_LIST_DATA_TYPE \ + (garrow_list_data_type_get_type()) +#define GARROW_LIST_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataType)) +#define GARROW_LIST_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataTypeClass)) +#define GARROW_IS_LIST_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_DATA_TYPE)) +#define GARROW_IS_LIST_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_DATA_TYPE)) +#define GARROW_LIST_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataTypeClass)) + +typedef struct _GArrowListDataType GArrowListDataType; +typedef struct _GArrowListDataTypeClass GArrowListDataTypeClass; + +/** + * GArrowListDataType: + * + * It wraps `arrow::ListType`. + */ +struct _GArrowListDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowListDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_list_data_type_get_type (void) G_GNUC_CONST; + +GArrowListDataType *garrow_list_data_type_new (GArrowField *field); + +GArrowField *garrow_list_data_type_get_value_field (GArrowListDataType *list_data_type); + +G_END_DECLS diff --git a/c_glib/arrow-glib/null-array.cpp b/c_glib/arrow-glib/null-array.cpp new file mode 100644 index 0000000000000..0e0ea51e24c04 --- /dev/null +++ b/c_glib/arrow-glib/null-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: null-array + * @short_description: Null array class + * + * #GArrowNullArray is a class for null array. It can store zero + * or more null values. + * + * #GArrowNullArray is immutable. You need to specify an array length + * to create a new array. + */ + +G_DEFINE_TYPE(GArrowNullArray, \ + garrow_null_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_null_array_init(GArrowNullArray *object) +{ +} + +static void +garrow_null_array_class_init(GArrowNullArrayClass *klass) +{ +} + +/** + * garrow_null_array_new: + * @length: An array length. + * + * Returns: A newly created #GArrowNullArray. + */ +GArrowNullArray * +garrow_null_array_new(gint64 length) +{ + auto arrow_null_array = std::make_shared(length); + std::shared_ptr arrow_array = arrow_null_array; + auto array = garrow_array_new_raw(&arrow_array); + return GARROW_NULL_ARRAY(array); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/null-array.h b/c_glib/arrow-glib/null-array.h new file mode 100644 index 0000000000000..e25f3054843e4 --- /dev/null +++ b/c_glib/arrow-glib/null-array.h @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_NULL_ARRAY \ + (garrow_null_array_get_type()) +#define GARROW_NULL_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArray)) +#define GARROW_NULL_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArrayClass)) +#define GARROW_IS_NULL_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_NULL_ARRAY)) +#define GARROW_IS_NULL_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_NULL_ARRAY)) +#define GARROW_NULL_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArrayClass)) + +typedef struct _GArrowNullArray GArrowNullArray; +typedef struct _GArrowNullArrayClass GArrowNullArrayClass; + +/** + * GArrowNullArray: + * + * It wraps `arrow::NullArray`. + */ +struct _GArrowNullArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowNullArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_null_array_get_type(void) G_GNUC_CONST; + +GArrowNullArray *garrow_null_array_new(gint64 length); + +G_END_DECLS diff --git a/c_glib/arrow-glib/null-data-type.cpp b/c_glib/arrow-glib/null-data-type.cpp new file mode 100644 index 0000000000000..1f75d3bb88c37 --- /dev/null +++ b/c_glib/arrow-glib/null-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: null-data-type + * @short_description: Null data type + * + * #GArrowNullDataType is a class for null data type. + */ + +G_DEFINE_TYPE(GArrowNullDataType, \ + garrow_null_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_null_data_type_init(GArrowNullDataType *object) +{ +} + +static void +garrow_null_data_type_class_init(GArrowNullDataTypeClass *klass) +{ +} + +/** + * garrow_null_data_type_new: + * + * Returns: The newly created null data type. + */ +GArrowNullDataType * +garrow_null_data_type_new(void) +{ + auto arrow_data_type = arrow::null(); + + GArrowNullDataType *data_type = + GARROW_NULL_DATA_TYPE(g_object_new(GARROW_TYPE_NULL_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/null-data-type.h b/c_glib/arrow-glib/null-data-type.h new file mode 100644 index 0000000000000..006b76c961f3b --- /dev/null +++ b/c_glib/arrow-glib/null-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_NULL_DATA_TYPE \ + (garrow_null_data_type_get_type()) +#define GARROW_NULL_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataType)) +#define GARROW_NULL_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataTypeClass)) +#define GARROW_IS_NULL_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_NULL_DATA_TYPE)) +#define GARROW_IS_NULL_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_NULL_DATA_TYPE)) +#define GARROW_NULL_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataTypeClass)) + +typedef struct _GArrowNullDataType GArrowNullDataType; +typedef struct _GArrowNullDataTypeClass GArrowNullDataTypeClass; + +/** + * GArrowNullDataType: + * + * It wraps `arrow::NullType`. + */ +struct _GArrowNullDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowNullDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_null_data_type_get_type (void) G_GNUC_CONST; +GArrowNullDataType *garrow_null_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/record-batch.cpp b/c_glib/arrow-glib/record-batch.cpp new file mode 100644 index 0000000000000..8ac1791feef8c --- /dev/null +++ b/c_glib/arrow-glib/record-batch.cpp @@ -0,0 +1,288 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: record-batch + * @short_description: Record batch class + * + * #GArrowRecordBatch is a class for record batch. Record batch is + * similar to #GArrowTable. Record batch also has also zero or more + * columns and zero or more records. + * + * Record batch is used for shared memory IPC. + */ + +typedef struct GArrowRecordBatchPrivate_ { + std::shared_ptr record_batch; +} GArrowRecordBatchPrivate; + +enum { + PROP_0, + PROP_RECORD_BATCH +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowRecordBatch, + garrow_record_batch, + G_TYPE_OBJECT) + +#define GARROW_RECORD_BATCH_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_RECORD_BATCH, \ + GArrowRecordBatchPrivate)) + +static void +garrow_record_batch_finalize(GObject *object) +{ + GArrowRecordBatchPrivate *priv; + + priv = GARROW_RECORD_BATCH_GET_PRIVATE(object); + + priv->record_batch = nullptr; + + G_OBJECT_CLASS(garrow_record_batch_parent_class)->finalize(object); +} + +static void +garrow_record_batch_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowRecordBatchPrivate *priv; + + priv = GARROW_RECORD_BATCH_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_RECORD_BATCH: + priv->record_batch = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_record_batch_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_record_batch_init(GArrowRecordBatch *object) +{ +} + +static void +garrow_record_batch_class_init(GArrowRecordBatchClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_record_batch_finalize; + gobject_class->set_property = garrow_record_batch_set_property; + gobject_class->get_property = garrow_record_batch_get_property; + + spec = g_param_spec_pointer("record-batch", + "RecordBatch", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_RECORD_BATCH, spec); +} + +/** + * garrow_record_batch_new: + * @schema: The schema of the record batch. + * @n_rows: The number of the rows in the record batch. + * @columns: (element-type GArrowArray): The columns in the record batch. + * + * Returns: A newly created #GArrowRecordBatch. + */ +GArrowRecordBatch * +garrow_record_batch_new(GArrowSchema *schema, + guint32 n_rows, + GList *columns) +{ + std::vector> arrow_columns; + for (GList *node = columns; node; node = node->next) { + GArrowArray *column = GARROW_ARRAY(node->data); + arrow_columns.push_back(garrow_array_get_raw(column)); + } + + auto arrow_record_batch = + std::make_shared(garrow_schema_get_raw(schema), + n_rows, + arrow_columns); + return garrow_record_batch_new_raw(&arrow_record_batch); +} + +/** + * garrow_record_batch_get_schema: + * @record_batch: A #GArrowRecordBatch. + * + * Returns: (transfer full): The schema of the record batch. + */ +GArrowSchema * +garrow_record_batch_get_schema(GArrowRecordBatch *record_batch) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + auto arrow_schema = arrow_record_batch->schema(); + return garrow_schema_new_raw(&arrow_schema); +} + +/** + * garrow_record_batch_get_column: + * @record_batch: A #GArrowRecordBatch. + * @i: The index of the target column. + * + * Returns: (transfer full): The i-th column in the record batch. + */ +GArrowArray * +garrow_record_batch_get_column(GArrowRecordBatch *record_batch, + guint i) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + auto arrow_column = arrow_record_batch->column(i); + return garrow_array_new_raw(&arrow_column); +} + +/** + * garrow_record_batch_get_columns: + * @record_batch: A #GArrowRecordBatch. + * + * Returns: (element-type GArrowArray) (transfer full): + * The columns in the record batch. + */ +GList * +garrow_record_batch_get_columns(GArrowRecordBatch *record_batch) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + + GList *columns = NULL; + for (auto arrow_column : arrow_record_batch->columns()) { + GArrowArray *column = garrow_array_new_raw(&arrow_column); + columns = g_list_prepend(columns, column); + } + + return g_list_reverse(columns); +} + +/** + * garrow_record_batch_get_column_name: + * @record_batch: A #GArrowRecordBatch. + * @i: The index of the target column. + * + * Returns: The name of the i-th column in the record batch. + */ +const gchar * +garrow_record_batch_get_column_name(GArrowRecordBatch *record_batch, + guint i) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + return arrow_record_batch->column_name(i).c_str(); +} + +/** + * garrow_record_batch_get_n_columns: + * @record_batch: A #GArrowRecordBatch. + * + * Returns: The number of columns in the record batch. + */ +guint +garrow_record_batch_get_n_columns(GArrowRecordBatch *record_batch) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + return arrow_record_batch->num_columns(); +} + +/** + * garrow_record_batch_get_n_rows: + * @record_batch: A #GArrowRecordBatch. + * + * Returns: The number of rows in the record batch. + */ +gint64 +garrow_record_batch_get_n_rows(GArrowRecordBatch *record_batch) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + return arrow_record_batch->num_rows(); +} + +/** + * garrow_record_batch_slice: + * @record_batch: A #GArrowRecordBatch. + * @offset: The offset of sub #GArrowRecordBatch. + * @length: The length of sub #GArrowRecordBatch. + * + * Returns: (transfer full): The sub #GArrowRecordBatch. It covers + * only from `offset` to `offset + length` range. The sub + * #GArrowRecordBatch shares values with the base + * #GArrowRecordBatch. + */ +GArrowRecordBatch * +garrow_record_batch_slice(GArrowRecordBatch *record_batch, + gint64 offset, + gint64 length) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + auto arrow_sub_record_batch = arrow_record_batch->Slice(offset, length); + return garrow_record_batch_new_raw(&arrow_sub_record_batch); +} + +G_END_DECLS + +GArrowRecordBatch * +garrow_record_batch_new_raw(std::shared_ptr *arrow_record_batch) +{ + auto record_batch = + GARROW_RECORD_BATCH(g_object_new(GARROW_TYPE_RECORD_BATCH, + "record-batch", arrow_record_batch, + NULL)); + return record_batch; +} + +std::shared_ptr +garrow_record_batch_get_raw(GArrowRecordBatch *record_batch) +{ + GArrowRecordBatchPrivate *priv; + + priv = GARROW_RECORD_BATCH_GET_PRIVATE(record_batch); + return priv->record_batch; +} diff --git a/c_glib/arrow-glib/record-batch.h b/c_glib/arrow-glib/record-batch.h new file mode 100644 index 0000000000000..92eee4d9af973 --- /dev/null +++ b/c_glib/arrow-glib/record-batch.h @@ -0,0 +1,85 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_RECORD_BATCH \ + (garrow_record_batch_get_type()) +#define GARROW_RECORD_BATCH(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_RECORD_BATCH, \ + GArrowRecordBatch)) +#define GARROW_RECORD_BATCH_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_RECORD_BATCH, \ + GArrowRecordBatchClass)) +#define GARROW_IS_RECORD_BATCH(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_RECORD_BATCH)) +#define GARROW_IS_RECORD_BATCH_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_RECORD_BATCH)) +#define GARROW_RECORD_BATCH_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_RECORD_BATCH, \ + GArrowRecordBatchClass)) + +typedef struct _GArrowRecordBatch GArrowRecordBatch; +typedef struct _GArrowRecordBatchClass GArrowRecordBatchClass; + +/** + * GArrowRecordBatch: + * + * It wraps `arrow::RecordBatch`. + */ +struct _GArrowRecordBatch +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowRecordBatchClass +{ + GObjectClass parent_class; +}; + +GType garrow_record_batch_get_type(void) G_GNUC_CONST; + +GArrowRecordBatch *garrow_record_batch_new(GArrowSchema *schema, + guint32 n_rows, + GList *columns); + +GArrowSchema *garrow_record_batch_get_schema (GArrowRecordBatch *record_batch); +GArrowArray *garrow_record_batch_get_column (GArrowRecordBatch *record_batch, + guint i); +GList *garrow_record_batch_get_columns (GArrowRecordBatch *record_batch); +const gchar *garrow_record_batch_get_column_name(GArrowRecordBatch *record_batch, + guint i); +guint garrow_record_batch_get_n_columns (GArrowRecordBatch *record_batch); +gint64 garrow_record_batch_get_n_rows (GArrowRecordBatch *record_batch); +GArrowRecordBatch *garrow_record_batch_slice (GArrowRecordBatch *record_batch, + gint64 offset, + gint64 length); + +G_END_DECLS diff --git a/c_glib/arrow-glib/record-batch.hpp b/c_glib/arrow-glib/record-batch.hpp new file mode 100644 index 0000000000000..2e4fe039b4fc5 --- /dev/null +++ b/c_glib/arrow-glib/record-batch.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowRecordBatch *garrow_record_batch_new_raw(std::shared_ptr *arrow_record_batch); +std::shared_ptr garrow_record_batch_get_raw(GArrowRecordBatch *record_batch); diff --git a/c_glib/arrow-glib/schema.cpp b/c_glib/arrow-glib/schema.cpp new file mode 100644 index 0000000000000..4d5ae5af4fb4a --- /dev/null +++ b/c_glib/arrow-glib/schema.cpp @@ -0,0 +1,245 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: schema + * @short_description: Schema class + * + * #GArrowSchema is a class for schema. Schema is metadata of a + * table. It has zero or more #GArrowFields. + */ + +typedef struct GArrowSchemaPrivate_ { + std::shared_ptr schema; +} GArrowSchemaPrivate; + +enum { + PROP_0, + PROP_SCHEMA +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowSchema, + garrow_schema, + G_TYPE_OBJECT) + +#define GARROW_SCHEMA_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_SCHEMA, \ + GArrowSchemaPrivate)) + +static void +garrow_schema_finalize(GObject *object) +{ + GArrowSchemaPrivate *priv; + + priv = GARROW_SCHEMA_GET_PRIVATE(object); + + priv->schema = nullptr; + + G_OBJECT_CLASS(garrow_schema_parent_class)->finalize(object); +} + +static void +garrow_schema_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowSchemaPrivate *priv; + + priv = GARROW_SCHEMA_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_SCHEMA: + priv->schema = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_schema_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_schema_init(GArrowSchema *object) +{ +} + +static void +garrow_schema_class_init(GArrowSchemaClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_schema_finalize; + gobject_class->set_property = garrow_schema_set_property; + gobject_class->get_property = garrow_schema_get_property; + + spec = g_param_spec_pointer("schema", + "Schema", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_SCHEMA, spec); +} + +/** + * garrow_schema_new: + * @fields: (element-type GArrowField): The fields of the schema. + * + * Returns: A newly created #GArrowSchema. + */ +GArrowSchema * +garrow_schema_new(GList *fields) +{ + std::vector> arrow_fields; + for (GList *node = fields; node; node = node->next) { + GArrowField *field = GARROW_FIELD(node->data); + arrow_fields.push_back(garrow_field_get_raw(field)); + } + + auto arrow_schema = std::make_shared(arrow_fields); + return garrow_schema_new_raw(&arrow_schema); +} + +/** + * garrow_schema_get_field: + * @schema: A #GArrowSchema. + * @i: The index of the target field. + * + * Returns: (transfer full): The i-th field of the schema. + */ +GArrowField * +garrow_schema_get_field(GArrowSchema *schema, guint i) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + auto arrow_field = arrow_schema->field(i); + return garrow_field_new_raw(&arrow_field); +} + +/** + * garrow_schema_get_field_by_name: + * @schema: A #GArrowSchema. + * @name: The name of the field to be found. + * + * Returns: (transfer full): The found field or %NULL. + */ +GArrowField * +garrow_schema_get_field_by_name(GArrowSchema *schema, + const gchar *name) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + auto arrow_field = arrow_schema->GetFieldByName(std::string(name)); + if (arrow_field == nullptr) { + return NULL; + } else { + return garrow_field_new_raw(&arrow_field); + } +} + +/** + * garrow_schema_n_fields: + * @schema: A #GArrowSchema. + * + * Returns: The number of fields of the schema. + */ +guint +garrow_schema_n_fields(GArrowSchema *schema) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + return arrow_schema->num_fields(); +} + +/** + * garrow_schema_get_fields: + * @schema: A #GArrowSchema. + * + * Returns: (element-type GArrowField) (transfer full): + * The fields of the schema. + */ +GList * +garrow_schema_get_fields(GArrowSchema *schema) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + + GList *fields = NULL; + for (auto arrow_field : arrow_schema->fields()) { + GArrowField *field = garrow_field_new_raw(&arrow_field); + fields = g_list_prepend(fields, field); + } + + return g_list_reverse(fields); +} + +/** + * garrow_schema_to_string: + * @schema: A #GArrowSchema. + * + * Returns: The string representation of the schema. + */ +gchar * +garrow_schema_to_string(GArrowSchema *schema) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + return g_strdup(arrow_schema->ToString().c_str()); +} + +G_END_DECLS + +GArrowSchema * +garrow_schema_new_raw(std::shared_ptr *arrow_schema) +{ + auto schema = GARROW_SCHEMA(g_object_new(GARROW_TYPE_SCHEMA, + "schema", arrow_schema, + NULL)); + return schema; +} + +std::shared_ptr +garrow_schema_get_raw(GArrowSchema *schema) +{ + GArrowSchemaPrivate *priv; + + priv = GARROW_SCHEMA_GET_PRIVATE(schema); + return priv->schema; +} diff --git a/c_glib/arrow-glib/schema.h b/c_glib/arrow-glib/schema.h new file mode 100644 index 0000000000000..7615634021bc3 --- /dev/null +++ b/c_glib/arrow-glib/schema.h @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_SCHEMA \ + (garrow_schema_get_type()) +#define GARROW_SCHEMA(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_SCHEMA, \ + GArrowSchema)) +#define GARROW_SCHEMA_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_SCHEMA, \ + GArrowSchemaClass)) +#define GARROW_IS_SCHEMA(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_SCHEMA)) +#define GARROW_IS_SCHEMA_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_SCHEMA)) +#define GARROW_SCHEMA_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_SCHEMA, \ + GArrowSchemaClass)) + +typedef struct _GArrowSchema GArrowSchema; +typedef struct _GArrowSchemaClass GArrowSchemaClass; + +/** + * GArrowSchema: + * + * It wraps `arrow::Schema`. + */ +struct _GArrowSchema +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowSchemaClass +{ + GObjectClass parent_class; +}; + +GType garrow_schema_get_type (void) G_GNUC_CONST; + +GArrowSchema *garrow_schema_new (GList *fields); + +GArrowField *garrow_schema_get_field (GArrowSchema *schema, + guint i); +GArrowField *garrow_schema_get_field_by_name(GArrowSchema *schema, + const gchar *name); + +guint garrow_schema_n_fields (GArrowSchema *schema); +GList *garrow_schema_get_fields (GArrowSchema *schema); + +gchar *garrow_schema_to_string (GArrowSchema *schema); + +G_END_DECLS diff --git a/c_glib/arrow-glib/schema.hpp b/c_glib/arrow-glib/schema.hpp new file mode 100644 index 0000000000000..0d025340844d3 --- /dev/null +++ b/c_glib/arrow-glib/schema.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowSchema *garrow_schema_new_raw(std::shared_ptr *arrow_schema); +std::shared_ptr garrow_schema_get_raw(GArrowSchema *schema); diff --git a/c_glib/arrow-glib/string-array-builder.cpp b/c_glib/arrow-glib/string-array-builder.cpp new file mode 100644 index 0000000000000..ebad53a18704a --- /dev/null +++ b/c_glib/arrow-glib/string-array-builder.cpp @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: string-array-builder + * @short_description: UTF-8 encoded string array builder class + * + * #GArrowStringArrayBuilder is the class to create a new + * #GArrowStringArray. + */ + +G_DEFINE_TYPE(GArrowStringArrayBuilder, + garrow_string_array_builder, + GARROW_TYPE_BINARY_ARRAY_BUILDER) + +static void +garrow_string_array_builder_init(GArrowStringArrayBuilder *builder) +{ +} + +static void +garrow_string_array_builder_class_init(GArrowStringArrayBuilderClass *klass) +{ +} + +/** + * garrow_string_array_builder_new: + * + * Returns: A newly created #GArrowStringArrayBuilder. + */ +GArrowStringArrayBuilder * +garrow_string_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool); + auto builder = + GARROW_STRING_ARRAY_BUILDER(g_object_new(GARROW_TYPE_STRING_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_string_array_builder_append: + * @builder: A #GArrowStringArrayBuilder. + * @value: A string value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, + const gchar *value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value, + static_cast(strlen(value))); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[string-array-builder][append]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/string-array-builder.h b/c_glib/arrow-glib/string-array-builder.h new file mode 100644 index 0000000000000..f370ed9edec9d --- /dev/null +++ b/c_glib/arrow-glib/string-array-builder.h @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRING_ARRAY_BUILDER \ + (garrow_string_array_builder_get_type()) +#define GARROW_STRING_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilder)) +#define GARROW_STRING_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilderClass)) +#define GARROW_IS_STRING_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER)) +#define GARROW_IS_STRING_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_ARRAY_BUILDER)) +#define GARROW_STRING_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilderClass)) + +typedef struct _GArrowStringArrayBuilder GArrowStringArrayBuilder; +typedef struct _GArrowStringArrayBuilderClass GArrowStringArrayBuilderClass; + +/** + * GArrowStringArrayBuilder: + * + * It wraps `arrow::StringBuilder`. + */ +struct _GArrowStringArrayBuilder +{ + /*< private >*/ + GArrowBinaryArrayBuilder parent_instance; +}; + +struct _GArrowStringArrayBuilderClass +{ + GArrowBinaryArrayBuilderClass parent_class; +}; + +GType garrow_string_array_builder_get_type(void) G_GNUC_CONST; + +GArrowStringArrayBuilder *garrow_string_array_builder_new(void); + +gboolean garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, + const gchar *value, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/string-array.cpp b/c_glib/arrow-glib/string-array.cpp new file mode 100644 index 0000000000000..329c742ccafe1 --- /dev/null +++ b/c_glib/arrow-glib/string-array.cpp @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: string-array + * @short_description: UTF-8 encoded string array class + * + * #GArrowStringArray is a class for UTF-8 encoded string array. It + * can store zero or more UTF-8 encoded string data. + * + * #GArrowStringArray is immutable. You need to use + * #GArrowStringArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowStringArray, \ + garrow_string_array, \ + GARROW_TYPE_BINARY_ARRAY) + +static void +garrow_string_array_init(GArrowStringArray *object) +{ +} + +static void +garrow_string_array_class_init(GArrowStringArrayClass *klass) +{ +} + +/** + * garrow_string_array_get_string: + * @array: A #GArrowStringArray. + * @i: The index of the target value. + * + * Returns: The i-th UTF-8 encoded string. + */ +gchar * +garrow_string_array_get_string(GArrowStringArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_string_array = + static_cast(arrow_array.get()); + gint32 length; + auto value = + reinterpret_cast(arrow_string_array->GetValue(i, &length)); + return g_strndup(value, length); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/string-array.h b/c_glib/arrow-glib/string-array.h new file mode 100644 index 0000000000000..41a53cd5f1d4a --- /dev/null +++ b/c_glib/arrow-glib/string-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRING_ARRAY \ + (garrow_string_array_get_type()) +#define GARROW_STRING_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArray)) +#define GARROW_STRING_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArrayClass)) +#define GARROW_IS_STRING_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_ARRAY)) +#define GARROW_IS_STRING_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_ARRAY)) +#define GARROW_STRING_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArrayClass)) + +typedef struct _GArrowStringArray GArrowStringArray; +typedef struct _GArrowStringArrayClass GArrowStringArrayClass; + +/** + * GArrowStringArray: + * + * It wraps `arrow::StringArray`. + */ +struct _GArrowStringArray +{ + /*< private >*/ + GArrowBinaryArray parent_instance; +}; + +struct _GArrowStringArrayClass +{ + GArrowBinaryArrayClass parent_class; +}; + +GType garrow_string_array_get_type(void) G_GNUC_CONST; + +gchar *garrow_string_array_get_string(GArrowStringArray *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/string-data-type.cpp b/c_glib/arrow-glib/string-data-type.cpp new file mode 100644 index 0000000000000..96a31bf2f906a --- /dev/null +++ b/c_glib/arrow-glib/string-data-type.cpp @@ -0,0 +1,68 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: string-data-type + * @short_description: UTF-8 encoded string data type + * + * #GArrowStringDataType is a class for UTF-8 encoded string data + * type. + */ + +G_DEFINE_TYPE(GArrowStringDataType, \ + garrow_string_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_string_data_type_init(GArrowStringDataType *object) +{ +} + +static void +garrow_string_data_type_class_init(GArrowStringDataTypeClass *klass) +{ +} + +/** + * garrow_string_data_type_new: + * + * Returns: The newly created UTF-8 encoded string data type. + */ +GArrowStringDataType * +garrow_string_data_type_new(void) +{ + auto arrow_data_type = arrow::utf8(); + + GArrowStringDataType *data_type = + GARROW_STRING_DATA_TYPE(g_object_new(GARROW_TYPE_STRING_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/string-data-type.h b/c_glib/arrow-glib/string-data-type.h new file mode 100644 index 0000000000000..d10a325e1bb6c --- /dev/null +++ b/c_glib/arrow-glib/string-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRING_DATA_TYPE \ + (garrow_string_data_type_get_type()) +#define GARROW_STRING_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataType)) +#define GARROW_STRING_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataTypeClass)) +#define GARROW_IS_STRING_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_DATA_TYPE)) +#define GARROW_IS_STRING_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_DATA_TYPE)) +#define GARROW_STRING_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataTypeClass)) + +typedef struct _GArrowStringDataType GArrowStringDataType; +typedef struct _GArrowStringDataTypeClass GArrowStringDataTypeClass; + +/** + * GArrowStringDataType: + * + * It wraps `arrow::StringType`. + */ +struct _GArrowStringDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowStringDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_string_data_type_get_type (void) G_GNUC_CONST; +GArrowStringDataType *garrow_string_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array-builder.cpp b/c_glib/arrow-glib/struct-array-builder.cpp new file mode 100644 index 0000000000000..2453a5baf2ec8 --- /dev/null +++ b/c_glib/arrow-glib/struct-array-builder.cpp @@ -0,0 +1,187 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: struct-array-builder + * @short_description: Struct array builder class + * @include: arrow-glib/arrow-glib.h + * + * #GArrowStructArrayBuilder is the class to create a new + * #GArrowStructArray. + */ + +G_DEFINE_TYPE(GArrowStructArrayBuilder, + garrow_struct_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_struct_array_builder_init(GArrowStructArrayBuilder *builder) +{ +} + +static void +garrow_struct_array_builder_class_init(GArrowStructArrayBuilderClass *klass) +{ +} + +/** + * garrow_struct_array_builder_new: + * @data_type: #GArrowStructDataType for the struct. + * @field_builders: (element-type GArrowArray): #GArrowArrayBuilders + * for fields. + * + * Returns: A newly created #GArrowStructArrayBuilder. + */ +GArrowStructArrayBuilder * +garrow_struct_array_builder_new(GArrowStructDataType *data_type, + GList *field_builders) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_data_type = garrow_data_type_get_raw(GARROW_DATA_TYPE(data_type)); + std::vector> arrow_field_builders; + for (GList *node = field_builders; node; node = g_list_next(node)) { + auto field_builder = static_cast(node->data); + auto arrow_field_builder = garrow_array_builder_get_raw(field_builder); + arrow_field_builders.push_back(arrow_field_builder); + } + + auto arrow_struct_builder = + std::make_shared(memory_pool, + arrow_data_type, + arrow_field_builders); + std::shared_ptr arrow_builder = arrow_struct_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_STRUCT_ARRAY_BUILDER(builder); +} + +/** + * garrow_struct_array_builder_append: + * @builder: A #GArrowStructArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new struct element. To append a new struct element, + * you need to call this function then append struct element field + * values to all `field_builder`s. `field_value`s are the + * #GArrowArrayBuilder specified to constructor. You can get + * `field_builder` by garrow_struct_array_builder_get_field_builder() + * or garrow_struct_array_builder_get_field_builders(). + * + * |[ + * // TODO + * ]| + */ +gboolean +garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[struct-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_struct_array_builder_append_null: + * @builder: A #GArrowStructArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new NULL element. + */ +gboolean +garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[struct-array-builder][append-null]"); + return FALSE; + } +} + +/** + * garrow_struct_array_builder_get_field_builder: + * @builder: A #GArrowStructArrayBuilder. + * @i: The index of the field in the struct. + * + * Returns: (transfer full): The #GArrowArrayBuilder for the i-th field. + */ +GArrowArrayBuilder * +garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, + gint i) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + auto arrow_field_builder = arrow_builder->field_builder(i); + return garrow_array_builder_new_raw(&arrow_field_builder); +} + +/** + * garrow_struct_array_builder_get_field_builders: + * @builder: A #GArrowStructArrayBuilder. + * + * Returns: (element-type GArrowArray) (transfer full): + * The #GArrowArrayBuilder for all fields. + */ +GList * +garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder) +{ + auto arrow_struct_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + GList *field_builders = NULL; + for (auto arrow_field_builder : arrow_struct_builder->field_builders()) { + auto field_builder = garrow_array_builder_new_raw(&arrow_field_builder); + field_builders = g_list_prepend(field_builders, field_builder); + } + + return g_list_reverse(field_builders); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array-builder.h b/c_glib/arrow-glib/struct-array-builder.h new file mode 100644 index 0000000000000..7dd86625616e3 --- /dev/null +++ b/c_glib/arrow-glib/struct-array-builder.h @@ -0,0 +1,81 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRUCT_ARRAY_BUILDER \ + (garrow_struct_array_builder_get_type()) +#define GARROW_STRUCT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilder)) +#define GARROW_STRUCT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilderClass)) +#define GARROW_IS_STRUCT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER)) +#define GARROW_IS_STRUCT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER)) +#define GARROW_STRUCT_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilderClass)) + +typedef struct _GArrowStructArrayBuilder GArrowStructArrayBuilder; +typedef struct _GArrowStructArrayBuilderClass GArrowStructArrayBuilderClass; + +/** + * GArrowStructArrayBuilder: + * + * It wraps `arrow::StructBuilder`. + */ +struct _GArrowStructArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowStructArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_struct_array_builder_get_type(void) G_GNUC_CONST; + +GArrowStructArrayBuilder *garrow_struct_array_builder_new(GArrowStructDataType *data_type, + GList *field_builders); + +gboolean garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, + GError **error); +gboolean garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, + GError **error); + +GArrowArrayBuilder *garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, + gint i); +GList *garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder); + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array.cpp b/c_glib/arrow-glib/struct-array.cpp new file mode 100644 index 0000000000000..14c2d17cdd737 --- /dev/null +++ b/c_glib/arrow-glib/struct-array.cpp @@ -0,0 +1,97 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: struct-array + * @short_description: Struct array class + * @include: arrow-glib/arrow-glib.h + * + * #GArrowStructArray is a class for struct array. It can store zero + * or more structs. One struct has zero or more fields. + * + * #GArrowStructArray is immutable. You need to use + * #GArrowStructArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowStructArray, \ + garrow_struct_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_struct_array_init(GArrowStructArray *object) +{ +} + +static void +garrow_struct_array_class_init(GArrowStructArrayClass *klass) +{ +} + +/** + * garrow_struct_array_get_field + * @array: A #GArrowStructArray. + * @i: The index of the field in the struct. + * + * Returns: (transfer full): The i-th field. + */ +GArrowArray * +garrow_struct_array_get_field(GArrowStructArray *array, + gint i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_struct_array = + static_cast(arrow_array.get()); + auto arrow_field = arrow_struct_array->field(i); + return garrow_array_new_raw(&arrow_field); +} + +/** + * garrow_struct_array_get_fields + * @array: A #GArrowStructArray. + * + * Returns: (element-type GArrowArray) (transfer full): + * The fields in the struct. + */ +GList * +garrow_struct_array_get_fields(GArrowStructArray *array) +{ + const auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + const auto arrow_struct_array = + static_cast(arrow_array.get()); + + GList *fields = NULL; + for (auto arrow_field : arrow_struct_array->fields()) { + GArrowArray *field = garrow_array_new_raw(&arrow_field); + fields = g_list_prepend(fields, field); + } + + return g_list_reverse(fields); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array.h b/c_glib/arrow-glib/struct-array.h new file mode 100644 index 0000000000000..f96e9d468f350 --- /dev/null +++ b/c_glib/arrow-glib/struct-array.h @@ -0,0 +1,73 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRUCT_ARRAY \ + (garrow_struct_array_get_type()) +#define GARROW_STRUCT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArray)) +#define GARROW_STRUCT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArrayClass)) +#define GARROW_IS_STRUCT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_ARRAY)) +#define GARROW_IS_STRUCT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_ARRAY)) +#define GARROW_STRUCT_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArrayClass)) + +typedef struct _GArrowStructArray GArrowStructArray; +typedef struct _GArrowStructArrayClass GArrowStructArrayClass; + +/** + * GArrowStructArray: + * + * It wraps `arrow::StructArray`. + */ +struct _GArrowStructArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowStructArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_struct_array_get_type(void) G_GNUC_CONST; + +GArrowArray *garrow_struct_array_get_field(GArrowStructArray *array, + gint i); +GList *garrow_struct_array_get_fields(GArrowStructArray *array); + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-data-type.cpp b/c_glib/arrow-glib/struct-data-type.cpp new file mode 100644 index 0000000000000..9a4f2a2deead0 --- /dev/null +++ b/c_glib/arrow-glib/struct-data-type.cpp @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: struct-data-type + * @short_description: Struct data type + * + * #GArrowStructDataType is a class for struct data type. + */ + +G_DEFINE_TYPE(GArrowStructDataType, \ + garrow_struct_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_struct_data_type_init(GArrowStructDataType *object) +{ +} + +static void +garrow_struct_data_type_class_init(GArrowStructDataTypeClass *klass) +{ +} + +/** + * garrow_struct_data_type_new: + * @fields: (element-type GArrowField): The fields of the struct. + * + * Returns: The newly created struct data type. + */ +GArrowStructDataType * +garrow_struct_data_type_new(GList *fields) +{ + std::vector> arrow_fields; + for (GList *node = fields; node; node = g_list_next(node)) { + auto field = GARROW_FIELD(node->data); + auto arrow_field = garrow_field_get_raw(field); + arrow_fields.push_back(arrow_field); + } + + auto arrow_data_type = std::make_shared(arrow_fields); + GArrowStructDataType *data_type = + GARROW_STRUCT_DATA_TYPE(g_object_new(GARROW_TYPE_STRUCT_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/struct-data-type.h b/c_glib/arrow-glib/struct-data-type.h new file mode 100644 index 0000000000000..0a2c743e280b7 --- /dev/null +++ b/c_glib/arrow-glib/struct-data-type.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STRUCT_DATA_TYPE \ + (garrow_struct_data_type_get_type()) +#define GARROW_STRUCT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataType)) +#define GARROW_STRUCT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataTypeClass)) +#define GARROW_IS_STRUCT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE)) +#define GARROW_IS_STRUCT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_DATA_TYPE)) +#define GARROW_STRUCT_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataTypeClass)) + +typedef struct _GArrowStructDataType GArrowStructDataType; +typedef struct _GArrowStructDataTypeClass GArrowStructDataTypeClass; + +/** + * GArrowStructDataType: + * + * It wraps `arrow::StructType`. + */ +struct _GArrowStructDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowStructDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_struct_data_type_get_type (void) G_GNUC_CONST; + +GArrowStructDataType *garrow_struct_data_type_new(GList *fields); + +G_END_DECLS diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp new file mode 100644 index 0000000000000..2410e76c921fb --- /dev/null +++ b/c_glib/arrow-glib/table.cpp @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: table + * @short_description: Table class + * + * #GArrowTable is a class for table. Table has zero or more + * #GArrowColumns and zero or more records. + */ + +typedef struct GArrowTablePrivate_ { + std::shared_ptr table; +} GArrowTablePrivate; + +enum { + PROP_0, + PROP_TABLE +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowTable, + garrow_table, + G_TYPE_OBJECT) + +#define GARROW_TABLE_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_TABLE, \ + GArrowTablePrivate)) + +static void +garrow_table_dispose(GObject *object) +{ + GArrowTablePrivate *priv; + + priv = GARROW_TABLE_GET_PRIVATE(object); + + priv->table = nullptr; + + G_OBJECT_CLASS(garrow_table_parent_class)->dispose(object); +} + +static void +garrow_table_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowTablePrivate *priv; + + priv = GARROW_TABLE_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_TABLE: + priv->table = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_table_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_table_init(GArrowTable *object) +{ +} + +static void +garrow_table_class_init(GArrowTableClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->dispose = garrow_table_dispose; + gobject_class->set_property = garrow_table_set_property; + gobject_class->get_property = garrow_table_get_property; + + spec = g_param_spec_pointer("table", + "Table", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_TABLE, spec); +} + +/** + * garrow_table_new: + * @name: The name of the table. + * @schema: The schema of the table. + * @columns: (element-type GArrowColumn): The columns of the table. + * + * Returns: A newly created #GArrowTable. + */ +GArrowTable * +garrow_table_new(const gchar *name, + GArrowSchema *schema, + GList *columns) +{ + std::vector> arrow_columns; + for (GList *node = columns; node; node = node->next) { + GArrowColumn *column = GARROW_COLUMN(node->data); + arrow_columns.push_back(garrow_column_get_raw(column)); + } + + auto arrow_table = + std::make_shared(name, + garrow_schema_get_raw(schema), + arrow_columns); + return garrow_table_new_raw(&arrow_table); +} + +/** + * garrow_table_get_name: + * @table: A #GArrowTable. + * + * Returns: The name of the table. + */ +const gchar * +garrow_table_get_name(GArrowTable *table) +{ + const auto arrow_table = garrow_table_get_raw(table); + return arrow_table->name().c_str(); +} + +/** + * garrow_table_get_schema: + * @table: A #GArrowTable. + * + * Returns: (transfer full): The schema of the table. + */ +GArrowSchema * +garrow_table_get_schema(GArrowTable *table) +{ + const auto arrow_table = garrow_table_get_raw(table); + auto arrow_schema = arrow_table->schema(); + return garrow_schema_new_raw(&arrow_schema); +} + +/** + * garrow_table_get_column: + * @table: A #GArrowTable. + * @i: The index of the target column. + * + * Returns: (transfer full): The i-th column in the table. + */ +GArrowColumn * +garrow_table_get_column(GArrowTable *table, + guint i) +{ + const auto arrow_table = garrow_table_get_raw(table); + auto arrow_column = arrow_table->column(i); + return garrow_column_new_raw(&arrow_column); +} + +/** + * garrow_table_get_n_columns: + * @table: A #GArrowTable. + * + * Returns: The number of columns in the table. + */ +guint +garrow_table_get_n_columns(GArrowTable *table) +{ + const auto arrow_table = garrow_table_get_raw(table); + return arrow_table->num_columns(); +} + +/** + * garrow_table_get_n_rows: + * @table: A #GArrowTable. + * + * Returns: The number of rows in the table. + */ +guint64 +garrow_table_get_n_rows(GArrowTable *table) +{ + const auto arrow_table = garrow_table_get_raw(table); + return arrow_table->num_rows(); +} + +G_END_DECLS + +GArrowTable * +garrow_table_new_raw(std::shared_ptr *arrow_table) +{ + auto table = GARROW_TABLE(g_object_new(GARROW_TYPE_TABLE, + "table", arrow_table, + NULL)); + return table; +} + +std::shared_ptr +garrow_table_get_raw(GArrowTable *table) +{ + GArrowTablePrivate *priv; + + priv = GARROW_TABLE_GET_PRIVATE(table); + return priv->table; +} diff --git a/c_glib/arrow-glib/table.h b/c_glib/arrow-glib/table.h new file mode 100644 index 0000000000000..34a89a78abcbb --- /dev/null +++ b/c_glib/arrow-glib/table.h @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_TABLE \ + (garrow_table_get_type()) +#define GARROW_TABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_TABLE, \ + GArrowTable)) +#define GARROW_TABLE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_TABLE, \ + GArrowTableClass)) +#define GARROW_IS_TABLE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_TABLE)) +#define GARROW_IS_TABLE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_TABLE)) +#define GARROW_TABLE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_TABLE, \ + GArrowTableClass)) + +typedef struct _GArrowTable GArrowTable; +typedef struct _GArrowTableClass GArrowTableClass; + +/** + * GArrowTable: + * + * It wraps `arrow::Table`. + */ +struct _GArrowTable +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowTableClass +{ + GObjectClass parent_class; +}; + +GType garrow_table_get_type (void) G_GNUC_CONST; + +GArrowTable *garrow_table_new (const gchar *name, + GArrowSchema *schema, + GList *columns); + +const gchar *garrow_table_get_name (GArrowTable *table); +GArrowSchema *garrow_table_get_schema (GArrowTable *table); +GArrowColumn *garrow_table_get_column (GArrowTable *table, + guint i); +guint garrow_table_get_n_columns (GArrowTable *table); +guint64 garrow_table_get_n_rows (GArrowTable *table); + +G_END_DECLS diff --git a/c_glib/arrow-glib/table.hpp b/c_glib/arrow-glib/table.hpp new file mode 100644 index 0000000000000..22b0fad502456 --- /dev/null +++ b/c_glib/arrow-glib/table.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowTable *garrow_table_new_raw(std::shared_ptr *arrow_table); +std::shared_ptr garrow_table_get_raw(GArrowTable *table); diff --git a/c_glib/arrow-glib/type.cpp b/c_glib/arrow-glib/type.cpp new file mode 100644 index 0000000000000..56cbc212211eb --- /dev/null +++ b/c_glib/arrow-glib/type.cpp @@ -0,0 +1,90 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +/** + * SECTION: type + * @title: GArrowType + * @short_description: Type mapping between Arrow and arrow-glib + * + * #GArrowType provides types corresponding to `arrow::Type::type` + * values. + */ + +GArrowType +garrow_type_from_raw(arrow::Type::type type) +{ + switch (type) { + case arrow::Type::type::NA: + return GARROW_TYPE_NA; + case arrow::Type::type::BOOL: + return GARROW_TYPE_BOOL; + case arrow::Type::type::UINT8: + return GARROW_TYPE_UINT8; + case arrow::Type::type::INT8: + return GARROW_TYPE_INT8; + case arrow::Type::type::UINT16: + return GARROW_TYPE_UINT16; + case arrow::Type::type::INT16: + return GARROW_TYPE_INT16; + case arrow::Type::type::UINT32: + return GARROW_TYPE_UINT32; + case arrow::Type::type::INT32: + return GARROW_TYPE_INT32; + case arrow::Type::type::UINT64: + return GARROW_TYPE_UINT64; + case arrow::Type::type::INT64: + return GARROW_TYPE_INT64; + case arrow::Type::type::HALF_FLOAT: + return GARROW_TYPE_HALF_FLOAT; + case arrow::Type::type::FLOAT: + return GARROW_TYPE_FLOAT; + case arrow::Type::type::DOUBLE: + return GARROW_TYPE_DOUBLE; + case arrow::Type::type::STRING: + return GARROW_TYPE_STRING; + case arrow::Type::type::BINARY: + return GARROW_TYPE_BINARY; + case arrow::Type::type::DATE: + return GARROW_TYPE_DATE; + case arrow::Type::type::TIMESTAMP: + return GARROW_TYPE_TIMESTAMP; + case arrow::Type::type::TIME: + return GARROW_TYPE_TIME; + case arrow::Type::type::INTERVAL: + return GARROW_TYPE_INTERVAL; + case arrow::Type::type::DECIMAL: + return GARROW_TYPE_DECIMAL; + case arrow::Type::type::LIST: + return GARROW_TYPE_LIST; + case arrow::Type::type::STRUCT: + return GARROW_TYPE_STRUCT; + case arrow::Type::type::UNION: + return GARROW_TYPE_UNION; + case arrow::Type::type::DICTIONARY: + return GARROW_TYPE_DICTIONARY; + default: + return GARROW_TYPE_NA; + } +} diff --git a/c_glib/arrow-glib/type.h b/c_glib/arrow-glib/type.h new file mode 100644 index 0000000000000..48d2801dad42c --- /dev/null +++ b/c_glib/arrow-glib/type.h @@ -0,0 +1,84 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +/** + * GArrowType: + * @GARROW_TYPE_NA: A degenerate NULL type represented as 0 bytes/bits. + * @GARROW_TYPE_BOOL: A boolean value represented as 1 bit. + * @GARROW_TYPE_UINT8: Little-endian 8bit unsigned integer. + * @GARROW_TYPE_INT8: Little-endian 8bit signed integer. + * @GARROW_TYPE_UINT16: Little-endian 16bit unsigned integer. + * @GARROW_TYPE_INT16: Little-endian 16bit signed integer. + * @GARROW_TYPE_UINT32: Little-endian 32bit unsigned integer. + * @GARROW_TYPE_INT32: Little-endian 32bit signed integer. + * @GARROW_TYPE_UINT64: Little-endian 64bit unsigned integer. + * @GARROW_TYPE_INT64: Little-endian 64bit signed integer. + * @GARROW_TYPE_HALF_FLOAT: 2-byte floating point value. + * @GARROW_TYPE_FLOAT: 4-byte floating point value. + * @GARROW_TYPE_DOUBLE: 8-byte floating point value. + * @GARROW_TYPE_STRING: UTF-8 variable-length string. + * @GARROW_TYPE_BINARY: Variable-length bytes (no guarantee of UTF-8-ness). + * @GARROW_TYPE_DATE: By default, int32 days since the UNIX epoch. + * @GARROW_TYPE_TIMESTAMP: Exact timestamp encoded with int64 since UNIX epoch. + * Default unit millisecond. + * @GARROW_TYPE_TIME: Exact time encoded with int64, default unit millisecond. + * @GARROW_TYPE_INTERVAL: YEAR_MONTH or DAY_TIME interval in SQL style. + * @GARROW_TYPE_DECIMAL: Precision- and scale-based decimal + * type. Storage type depends on the parameters. + * @GARROW_TYPE_LIST: A list of some logical data type. + * @GARROW_TYPE_STRUCT: Struct of logical types. + * @GARROW_TYPE_UNION: Unions of logical types. + * @GARROW_TYPE_DICTIONARY: Dictionary aka Category type. + * + * They are corresponding to `arrow::Type::type` values. + */ +typedef enum { + GARROW_TYPE_NA, + GARROW_TYPE_BOOL, + GARROW_TYPE_UINT8, + GARROW_TYPE_INT8, + GARROW_TYPE_UINT16, + GARROW_TYPE_INT16, + GARROW_TYPE_UINT32, + GARROW_TYPE_INT32, + GARROW_TYPE_UINT64, + GARROW_TYPE_INT64, + GARROW_TYPE_HALF_FLOAT, + GARROW_TYPE_FLOAT, + GARROW_TYPE_DOUBLE, + GARROW_TYPE_STRING, + GARROW_TYPE_BINARY, + GARROW_TYPE_DATE, + GARROW_TYPE_TIMESTAMP, + GARROW_TYPE_TIME, + GARROW_TYPE_INTERVAL, + GARROW_TYPE_DECIMAL, + GARROW_TYPE_LIST, + GARROW_TYPE_STRUCT, + GARROW_TYPE_UNION, + GARROW_TYPE_DICTIONARY +} GArrowType; + +G_END_DECLS diff --git a/c_glib/arrow-glib/type.hpp b/c_glib/arrow-glib/type.hpp new file mode 100644 index 0000000000000..2a452be6dd854 --- /dev/null +++ b/c_glib/arrow-glib/type.hpp @@ -0,0 +1,26 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowType garrow_type_from_raw(arrow::Type::type type); diff --git a/c_glib/arrow-glib/uint16-array-builder.cpp b/c_glib/arrow-glib/uint16-array-builder.cpp new file mode 100644 index 0000000000000..bfade2de7a84d --- /dev/null +++ b/c_glib/arrow-glib/uint16-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint16-array-builder + * @short_description: 16-bit unsigned integer array builder class + * + * #GArrowUInt16ArrayBuilder is the class to create a new + * #GArrowUInt16Array. + */ + +G_DEFINE_TYPE(GArrowUInt16ArrayBuilder, + garrow_uint16_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint16_array_builder_init(GArrowUInt16ArrayBuilder *builder) +{ +} + +static void +garrow_uint16_array_builder_class_init(GArrowUInt16ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint16_array_builder_new: + * + * Returns: A newly created #GArrowUInt16ArrayBuilder. + */ +GArrowUInt16ArrayBuilder * +garrow_uint16_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::uint16()); + auto builder = + GARROW_UINT16_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT16_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_uint16_array_builder_append: + * @builder: A #GArrowUInt16ArrayBuilder. + * @value: An uint16 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, + guint16 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint16-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint16_array_builder_append_null: + * @builder: A #GArrowUInt16ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint16-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array-builder.h b/c_glib/arrow-glib/uint16-array-builder.h new file mode 100644 index 0000000000000..c08966ecc1d91 --- /dev/null +++ b/c_glib/arrow-glib/uint16-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT16_ARRAY_BUILDER \ + (garrow_uint16_array_builder_get_type()) +#define GARROW_UINT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilder)) +#define GARROW_UINT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilderClass)) +#define GARROW_IS_UINT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER)) +#define GARROW_IS_UINT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER)) +#define GARROW_UINT16_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilderClass)) + +typedef struct _GArrowUInt16ArrayBuilder GArrowUInt16ArrayBuilder; +typedef struct _GArrowUInt16ArrayBuilderClass GArrowUInt16ArrayBuilderClass; + +/** + * GArrowUInt16ArrayBuilder: + * + * It wraps `arrow::UInt16Builder`. + */ +struct _GArrowUInt16ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt16ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint16_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt16ArrayBuilder *garrow_uint16_array_builder_new(void); + +gboolean garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, + guint16 value, + GError **error); +gboolean garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array.cpp b/c_glib/arrow-glib/uint16-array.cpp new file mode 100644 index 0000000000000..6c416c6592935 --- /dev/null +++ b/c_glib/arrow-glib/uint16-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint16-array + * @short_description: 16-bit unsigned integer array class + * + * #GArrowUInt16Array is a class for 16-bit unsigned integer array. It + * can store zero or more 16-bit unsigned integer data. + * + * #GArrowUInt16Array is immutable. You need to use + * #GArrowUInt16ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowUInt16Array, \ + garrow_uint16_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint16_array_init(GArrowUInt16Array *object) +{ +} + +static void +garrow_uint16_array_class_init(GArrowUInt16ArrayClass *klass) +{ +} + +/** + * garrow_uint16_array_get_value: + * @array: A #GArrowUInt16Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint16 +garrow_uint16_array_get_value(GArrowUInt16Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array.h b/c_glib/arrow-glib/uint16-array.h new file mode 100644 index 0000000000000..44725510062c8 --- /dev/null +++ b/c_glib/arrow-glib/uint16-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT16_ARRAY \ + (garrow_uint16_array_get_type()) +#define GARROW_UINT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16Array)) +#define GARROW_UINT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16ArrayClass)) +#define GARROW_IS_UINT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_ARRAY)) +#define GARROW_IS_UINT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_ARRAY)) +#define GARROW_UINT16_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16ArrayClass)) + +typedef struct _GArrowUInt16Array GArrowUInt16Array; +typedef struct _GArrowUInt16ArrayClass GArrowUInt16ArrayClass; + +/** + * GArrowUInt16Array: + * + * It wraps `arrow::UInt16Array`. + */ +struct _GArrowUInt16Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt16ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint16_array_get_type(void) G_GNUC_CONST; + +guint16 garrow_uint16_array_get_value(GArrowUInt16Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-data-type.cpp b/c_glib/arrow-glib/uint16-data-type.cpp new file mode 100644 index 0000000000000..918b75d61c3eb --- /dev/null +++ b/c_glib/arrow-glib/uint16-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint16-data-type + * @short_description: 16-bit unsigned integer data type + * + * #GArrowUInt16DataType is a class for 16-bit unsigned integer data type. + */ + +G_DEFINE_TYPE(GArrowUInt16DataType, \ + garrow_uint16_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint16_data_type_init(GArrowUInt16DataType *object) +{ +} + +static void +garrow_uint16_data_type_class_init(GArrowUInt16DataTypeClass *klass) +{ +} + +/** + * garrow_uint16_data_type_new: + * + * Returns: The newly created 16-bit unsigned integer data type. + */ +GArrowUInt16DataType * +garrow_uint16_data_type_new(void) +{ + auto arrow_data_type = arrow::uint16(); + + GArrowUInt16DataType *data_type = + GARROW_UINT16_DATA_TYPE(g_object_new(GARROW_TYPE_UINT16_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-data-type.h b/c_glib/arrow-glib/uint16-data-type.h new file mode 100644 index 0000000000000..b65189d888fcd --- /dev/null +++ b/c_glib/arrow-glib/uint16-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT16_DATA_TYPE \ + (garrow_uint16_data_type_get_type()) +#define GARROW_UINT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataType)) +#define GARROW_UINT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataTypeClass)) +#define GARROW_IS_UINT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE)) +#define GARROW_IS_UINT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_DATA_TYPE)) +#define GARROW_UINT16_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataTypeClass)) + +typedef struct _GArrowUInt16DataType GArrowUInt16DataType; +typedef struct _GArrowUInt16DataTypeClass GArrowUInt16DataTypeClass; + +/** + * GArrowUInt16DataType: + * + * It wraps `arrow::UInt16Type`. + */ +struct _GArrowUInt16DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt16DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint16_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt16DataType *garrow_uint16_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array-builder.cpp b/c_glib/arrow-glib/uint32-array-builder.cpp new file mode 100644 index 0000000000000..35b1893619fa5 --- /dev/null +++ b/c_glib/arrow-glib/uint32-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint32-array-builder + * @short_description: 32-bit unsigned integer array builder class + * + * #GArrowUInt32ArrayBuilder is the class to create a new + * #GArrowUInt32Array. + */ + +G_DEFINE_TYPE(GArrowUInt32ArrayBuilder, + garrow_uint32_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint32_array_builder_init(GArrowUInt32ArrayBuilder *builder) +{ +} + +static void +garrow_uint32_array_builder_class_init(GArrowUInt32ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint32_array_builder_new: + * + * Returns: A newly created #GArrowUInt32ArrayBuilder. + */ +GArrowUInt32ArrayBuilder * +garrow_uint32_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::uint32()); + auto builder = + GARROW_UINT32_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT32_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_uint32_array_builder_append: + * @builder: A #GArrowUInt32ArrayBuilder. + * @value: An uint32 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, + guint32 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint32-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint32_array_builder_append_null: + * @builder: A #GArrowUInt32ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint32-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array-builder.h b/c_glib/arrow-glib/uint32-array-builder.h new file mode 100644 index 0000000000000..4881d3b17ff0d --- /dev/null +++ b/c_glib/arrow-glib/uint32-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT32_ARRAY_BUILDER \ + (garrow_uint32_array_builder_get_type()) +#define GARROW_UINT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilder)) +#define GARROW_UINT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilderClass)) +#define GARROW_IS_UINT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER)) +#define GARROW_IS_UINT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER)) +#define GARROW_UINT32_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilderClass)) + +typedef struct _GArrowUInt32ArrayBuilder GArrowUInt32ArrayBuilder; +typedef struct _GArrowUInt32ArrayBuilderClass GArrowUInt32ArrayBuilderClass; + +/** + * GArrowUInt32ArrayBuilder: + * + * It wraps `arrow::UInt32Builder`. + */ +struct _GArrowUInt32ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt32ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint32_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt32ArrayBuilder *garrow_uint32_array_builder_new(void); + +gboolean garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, + guint32 value, + GError **error); +gboolean garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array.cpp b/c_glib/arrow-glib/uint32-array.cpp new file mode 100644 index 0000000000000..d10f10005f9be --- /dev/null +++ b/c_glib/arrow-glib/uint32-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint32-array + * @short_description: 32-bit unsigned integer array class + * + * #GArrowUInt32Array is a class for 32-bit unsigned integer array. It + * can store zero or more 32-bit unsigned integer data. + * + * #GArrowUInt32Array is immutable. You need to use + * #GArrowUInt32ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowUInt32Array, \ + garrow_uint32_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint32_array_init(GArrowUInt32Array *object) +{ +} + +static void +garrow_uint32_array_class_init(GArrowUInt32ArrayClass *klass) +{ +} + +/** + * garrow_uint32_array_get_value: + * @array: A #GArrowUInt32Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint32 +garrow_uint32_array_get_value(GArrowUInt32Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array.h b/c_glib/arrow-glib/uint32-array.h new file mode 100644 index 0000000000000..57d4beaee6186 --- /dev/null +++ b/c_glib/arrow-glib/uint32-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT32_ARRAY \ + (garrow_uint32_array_get_type()) +#define GARROW_UINT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32Array)) +#define GARROW_UINT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32ArrayClass)) +#define GARROW_IS_UINT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_ARRAY)) +#define GARROW_IS_UINT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_ARRAY)) +#define GARROW_UINT32_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32ArrayClass)) + +typedef struct _GArrowUInt32Array GArrowUInt32Array; +typedef struct _GArrowUInt32ArrayClass GArrowUInt32ArrayClass; + +/** + * GArrowUInt32Array: + * + * It wraps `arrow::UInt32Array`. + */ +struct _GArrowUInt32Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt32ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint32_array_get_type(void) G_GNUC_CONST; + +guint32 garrow_uint32_array_get_value(GArrowUInt32Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-data-type.cpp b/c_glib/arrow-glib/uint32-data-type.cpp new file mode 100644 index 0000000000000..fde14f3274174 --- /dev/null +++ b/c_glib/arrow-glib/uint32-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint32-data-type + * @short_description: 32-bit unsigned integer data type + * + * #GArrowUInt32DataType is a class for 32-bit unsigned integer data type. + */ + +G_DEFINE_TYPE(GArrowUInt32DataType, \ + garrow_uint32_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint32_data_type_init(GArrowUInt32DataType *object) +{ +} + +static void +garrow_uint32_data_type_class_init(GArrowUInt32DataTypeClass *klass) +{ +} + +/** + * garrow_uint32_data_type_new: + * + * Returns: The newly created 32-bit unsigned integer data type. + */ +GArrowUInt32DataType * +garrow_uint32_data_type_new(void) +{ + auto arrow_data_type = arrow::uint32(); + + GArrowUInt32DataType *data_type = + GARROW_UINT32_DATA_TYPE(g_object_new(GARROW_TYPE_UINT32_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-data-type.h b/c_glib/arrow-glib/uint32-data-type.h new file mode 100644 index 0000000000000..4fe60cd850ba8 --- /dev/null +++ b/c_glib/arrow-glib/uint32-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT32_DATA_TYPE \ + (garrow_uint32_data_type_get_type()) +#define GARROW_UINT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataType)) +#define GARROW_UINT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataTypeClass)) +#define GARROW_IS_UINT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE)) +#define GARROW_IS_UINT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_DATA_TYPE)) +#define GARROW_UINT32_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataTypeClass)) + +typedef struct _GArrowUInt32DataType GArrowUInt32DataType; +typedef struct _GArrowUInt32DataTypeClass GArrowUInt32DataTypeClass; + +/** + * GArrowUInt32DataType: + * + * It wraps `arrow::UInt32Type`. + */ +struct _GArrowUInt32DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt32DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint32_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt32DataType *garrow_uint32_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array-builder.cpp b/c_glib/arrow-glib/uint64-array-builder.cpp new file mode 100644 index 0000000000000..85d24ca54ab8b --- /dev/null +++ b/c_glib/arrow-glib/uint64-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint64-array-builder + * @short_description: 64-bit unsigned integer array builder class + * + * #GArrowUInt64ArrayBuilder is the class to create a new + * #GArrowUInt64Array. + */ + +G_DEFINE_TYPE(GArrowUInt64ArrayBuilder, + garrow_uint64_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint64_array_builder_init(GArrowUInt64ArrayBuilder *builder) +{ +} + +static void +garrow_uint64_array_builder_class_init(GArrowUInt64ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint64_array_builder_new: + * + * Returns: A newly created #GArrowUInt64ArrayBuilder. + */ +GArrowUInt64ArrayBuilder * +garrow_uint64_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::uint64()); + auto builder = + GARROW_UINT64_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT64_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_uint64_array_builder_append: + * @builder: A #GArrowUInt64ArrayBuilder. + * @value: An uint64 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, + guint64 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint64-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint64_array_builder_append_null: + * @builder: A #GArrowUInt64ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint64-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array-builder.h b/c_glib/arrow-glib/uint64-array-builder.h new file mode 100644 index 0000000000000..c51d1e2485d6f --- /dev/null +++ b/c_glib/arrow-glib/uint64-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT64_ARRAY_BUILDER \ + (garrow_uint64_array_builder_get_type()) +#define GARROW_UINT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilder)) +#define GARROW_UINT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilderClass)) +#define GARROW_IS_UINT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER)) +#define GARROW_IS_UINT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER)) +#define GARROW_UINT64_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilderClass)) + +typedef struct _GArrowUInt64ArrayBuilder GArrowUInt64ArrayBuilder; +typedef struct _GArrowUInt64ArrayBuilderClass GArrowUInt64ArrayBuilderClass; + +/** + * GArrowUInt64ArrayBuilder: + * + * It wraps `arrow::UInt64Builder`. + */ +struct _GArrowUInt64ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt64ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint64_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt64ArrayBuilder *garrow_uint64_array_builder_new(void); + +gboolean garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, + guint64 value, + GError **error); +gboolean garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array.cpp b/c_glib/arrow-glib/uint64-array.cpp new file mode 100644 index 0000000000000..1f900842674b8 --- /dev/null +++ b/c_glib/arrow-glib/uint64-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint64-array + * @short_description: 64-bit unsigned integer array class + * + * #GArrowUInt64Array is a class for 64-bit unsigned integer array. It + * can store zero or more 64-bit unsigned integer data. + * + * #GArrowUInt64Array is immutable. You need to use + * #GArrowUInt64ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowUInt64Array, \ + garrow_uint64_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint64_array_init(GArrowUInt64Array *object) +{ +} + +static void +garrow_uint64_array_class_init(GArrowUInt64ArrayClass *klass) +{ +} + +/** + * garrow_uint64_array_get_value: + * @array: A #GArrowUInt64Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint64 +garrow_uint64_array_get_value(GArrowUInt64Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array.h b/c_glib/arrow-glib/uint64-array.h new file mode 100644 index 0000000000000..b5abde52bd263 --- /dev/null +++ b/c_glib/arrow-glib/uint64-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT64_ARRAY \ + (garrow_uint64_array_get_type()) +#define GARROW_UINT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64Array)) +#define GARROW_UINT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64ArrayClass)) +#define GARROW_IS_UINT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_ARRAY)) +#define GARROW_IS_UINT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_ARRAY)) +#define GARROW_UINT64_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64ArrayClass)) + +typedef struct _GArrowUInt64Array GArrowUInt64Array; +typedef struct _GArrowUInt64ArrayClass GArrowUInt64ArrayClass; + +/** + * GArrowUInt64Array: + * + * It wraps `arrow::UInt64Array`. + */ +struct _GArrowUInt64Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt64ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint64_array_get_type(void) G_GNUC_CONST; + +guint64 garrow_uint64_array_get_value(GArrowUInt64Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-data-type.cpp b/c_glib/arrow-glib/uint64-data-type.cpp new file mode 100644 index 0000000000000..7c18b36a01b3b --- /dev/null +++ b/c_glib/arrow-glib/uint64-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint64-data-type + * @short_description: 64-bit unsigned integer data type + * + * #GArrowUInt64DataType is a class for 64-bit unsigned integer data type. + */ + +G_DEFINE_TYPE(GArrowUInt64DataType, \ + garrow_uint64_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint64_data_type_init(GArrowUInt64DataType *object) +{ +} + +static void +garrow_uint64_data_type_class_init(GArrowUInt64DataTypeClass *klass) +{ +} + +/** + * garrow_uint64_data_type_new: + * + * Returns: The newly created 64-bit unsigned integer data type. + */ +GArrowUInt64DataType * +garrow_uint64_data_type_new(void) +{ + auto arrow_data_type = arrow::uint64(); + + GArrowUInt64DataType *data_type = + GARROW_UINT64_DATA_TYPE(g_object_new(GARROW_TYPE_UINT64_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-data-type.h b/c_glib/arrow-glib/uint64-data-type.h new file mode 100644 index 0000000000000..221023c863818 --- /dev/null +++ b/c_glib/arrow-glib/uint64-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT64_DATA_TYPE \ + (garrow_uint64_data_type_get_type()) +#define GARROW_UINT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataType)) +#define GARROW_UINT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataTypeClass)) +#define GARROW_IS_UINT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE)) +#define GARROW_IS_UINT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_DATA_TYPE)) +#define GARROW_UINT64_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataTypeClass)) + +typedef struct _GArrowUInt64DataType GArrowUInt64DataType; +typedef struct _GArrowUInt64DataTypeClass GArrowUInt64DataTypeClass; + +/** + * GArrowUInt64DataType: + * + * It wraps `arrow::UInt64Type`. + */ +struct _GArrowUInt64DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt64DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint64_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt64DataType *garrow_uint64_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array-builder.cpp b/c_glib/arrow-glib/uint8-array-builder.cpp new file mode 100644 index 0000000000000..2f49693236b24 --- /dev/null +++ b/c_glib/arrow-glib/uint8-array-builder.cpp @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint8-array-builder + * @short_description: 8-bit unsigned integer array builder class + * + * #GArrowUInt8ArrayBuilder is the class to create a new + * #GArrowUInt8Array. + */ + +G_DEFINE_TYPE(GArrowUInt8ArrayBuilder, + garrow_uint8_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint8_array_builder_init(GArrowUInt8ArrayBuilder *builder) +{ +} + +static void +garrow_uint8_array_builder_class_init(GArrowUInt8ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint8_array_builder_new: + * + * Returns: A newly created #GArrowUInt8ArrayBuilder. + */ +GArrowUInt8ArrayBuilder * +garrow_uint8_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_builder = + std::make_shared(memory_pool, arrow::uint8()); + auto builder = + GARROW_UINT8_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT8_ARRAY_BUILDER, + "array-builder", &arrow_builder, + NULL)); + return builder; +} + +/** + * garrow_uint8_array_builder_append: + * @builder: A #GArrowUInt8ArrayBuilder. + * @value: An uint8 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, + guint8 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint8-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint8_array_builder_append_null: + * @builder: A #GArrowUInt8ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint8-array-builder][append-null]"); + return FALSE; + } +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array-builder.h b/c_glib/arrow-glib/uint8-array-builder.h new file mode 100644 index 0000000000000..e7216931a511c --- /dev/null +++ b/c_glib/arrow-glib/uint8-array-builder.h @@ -0,0 +1,76 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT8_ARRAY_BUILDER \ + (garrow_uint8_array_builder_get_type()) +#define GARROW_UINT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilder)) +#define GARROW_UINT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilderClass)) +#define GARROW_IS_UINT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER)) +#define GARROW_IS_UINT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER)) +#define GARROW_UINT8_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilderClass)) + +typedef struct _GArrowUInt8ArrayBuilder GArrowUInt8ArrayBuilder; +typedef struct _GArrowUInt8ArrayBuilderClass GArrowUInt8ArrayBuilderClass; + +/** + * GArrowUInt8ArrayBuilder: + * + * It wraps `arrow::UInt8Builder`. + */ +struct _GArrowUInt8ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt8ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint8_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt8ArrayBuilder *garrow_uint8_array_builder_new(void); + +gboolean garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, + guint8 value, + GError **error); +gboolean garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array.cpp b/c_glib/arrow-glib/uint8-array.cpp new file mode 100644 index 0000000000000..b5a2595b1ef09 --- /dev/null +++ b/c_glib/arrow-glib/uint8-array.cpp @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint8-array + * @short_description: 8-bit unsigned integer array class + * + * #GArrowUInt8Array is a class for 8-bit unsigned integer array. It + * can store zero or more 8-bit unsigned integer data. + * + * #GArrowUInt8Array is immutable. You need to use + * #GArrowUInt8ArrayBuilder to create a new array. + */ + +G_DEFINE_TYPE(GArrowUInt8Array, \ + garrow_uint8_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint8_array_init(GArrowUInt8Array *object) +{ +} + +static void +garrow_uint8_array_class_init(GArrowUInt8ArrayClass *klass) +{ +} + +/** + * garrow_uint8_array_get_value: + * @array: A #GArrowUInt8Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint8 +garrow_uint8_array_get_value(GArrowUInt8Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array.h b/c_glib/arrow-glib/uint8-array.h new file mode 100644 index 0000000000000..a572bc549670e --- /dev/null +++ b/c_glib/arrow-glib/uint8-array.h @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT8_ARRAY \ + (garrow_uint8_array_get_type()) +#define GARROW_UINT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8Array)) +#define GARROW_UINT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8ArrayClass)) +#define GARROW_IS_UINT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_ARRAY)) +#define GARROW_IS_UINT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_ARRAY)) +#define GARROW_UINT8_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8ArrayClass)) + +typedef struct _GArrowUInt8Array GArrowUInt8Array; +typedef struct _GArrowUInt8ArrayClass GArrowUInt8ArrayClass; + +/** + * GArrowUInt8Array: + * + * It wraps `arrow::UInt8Array`. + */ +struct _GArrowUInt8Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt8ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint8_array_get_type(void) G_GNUC_CONST; + +guint8 garrow_uint8_array_get_value(GArrowUInt8Array *array, + gint64 i); + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-data-type.cpp b/c_glib/arrow-glib/uint8-data-type.cpp new file mode 100644 index 0000000000000..7c93e455a4e96 --- /dev/null +++ b/c_glib/arrow-glib/uint8-data-type.cpp @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint8-data-type + * @short_description: 8-bit unsigned integer data type + * + * #GArrowUInt8DataType is a class for 8-bit unsigned integer data type. + */ + +G_DEFINE_TYPE(GArrowUInt8DataType, \ + garrow_uint8_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint8_data_type_init(GArrowUInt8DataType *object) +{ +} + +static void +garrow_uint8_data_type_class_init(GArrowUInt8DataTypeClass *klass) +{ +} + +/** + * garrow_uint8_data_type_new: + * + * Returns: The newly created 8-bit unsigned integer data type. + */ +GArrowUInt8DataType * +garrow_uint8_data_type_new(void) +{ + auto arrow_data_type = arrow::uint8(); + + GArrowUInt8DataType *data_type = + GARROW_UINT8_DATA_TYPE(g_object_new(GARROW_TYPE_UINT8_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-data-type.h b/c_glib/arrow-glib/uint8-data-type.h new file mode 100644 index 0000000000000..6e058524f4b10 --- /dev/null +++ b/c_glib/arrow-glib/uint8-data-type.h @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT8_DATA_TYPE \ + (garrow_uint8_data_type_get_type()) +#define GARROW_UINT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataType)) +#define GARROW_UINT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataTypeClass)) +#define GARROW_IS_UINT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE)) +#define GARROW_IS_UINT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_DATA_TYPE)) +#define GARROW_UINT8_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataTypeClass)) + +typedef struct _GArrowUInt8DataType GArrowUInt8DataType; +typedef struct _GArrowUInt8DataTypeClass GArrowUInt8DataTypeClass; + +/** + * GArrowUInt8DataType: + * + * It wraps `arrow::UInt8Type`. + */ +struct _GArrowUInt8DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt8DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint8_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt8DataType *garrow_uint8_data_type_new (void); + +G_END_DECLS diff --git a/c_glib/autogen.sh b/c_glib/autogen.sh new file mode 100755 index 0000000000000..08e33e6ca07c0 --- /dev/null +++ b/c_glib/autogen.sh @@ -0,0 +1,31 @@ +#!/bin/sh +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +set -u +set -e + +ruby \ + -e 'print ARGF.read.scan(/^ (.+?)<\/version>/)[0][0]' \ + ../java/pom.xml > \ + version + +mkdir -p m4 + +gtkdocize --copy --docdir doc/reference +autoreconf --install diff --git a/c_glib/configure.ac b/c_glib/configure.ac new file mode 100644 index 0000000000000..85f7eec3cb557 --- /dev/null +++ b/c_glib/configure.ac @@ -0,0 +1,76 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +AC_PREREQ(2.65) + +m4_define([arrow_glib_version], m4_include(version)) +AC_INIT([arrow-glib], arrow_glib_version, [kou@clear-code.com]) +AC_CONFIG_AUX_DIR([config]) +AC_CONFIG_MACRO_DIR([m4]) + +AC_CONFIG_SRCDIR([arrow-glib/arrow-glib.h]) +AC_CONFIG_HEADERS([config.h]) + +AM_INIT_AUTOMAKE([1.13 foreign]) +AM_SILENT_RULES([yes]) + +AC_PROG_CC +AC_PROG_CXX +AX_CXX_COMPILE_STDCXX_11([ext], [mandatory]) +LT_INIT + +GARROW_CFLAGS="-Wall -Wconversion" +GARROW_CXXFLAGS="-Wall -Wconversion" +AC_ARG_ENABLE(debug, + [AS_HELP_STRING([--enable-debug], + [Use debug flags (default=no)])], + [GARROW_DEBUG="$enableval"], + [GARROW_DEBUG="no"]) +if test "x$GARROW_DEBUG" != "xno"; then + GARROW_DEBUG="yes" + if test "$CLANG" = "yes"; then + CFLAGS="$CFLAGS -O0 -g" + CXXFLAGS="$CXXFLAGS -O0 -g" + elif test "$GCC" = "yes"; then + CFLAGS="$CFLAGS -O0 -g3" + CXXFLAGS="$CXXFLAGS -O0 -g3" + fi +fi +AC_SUBST(GARROW_CFLAGS) +AC_SUBST(GARROW_CXXFLAGS) + +AM_PATH_GLIB_2_0([2.32.4], [], [], [gobject]) + +GOBJECT_INTROSPECTION_REQUIRE([1.32.1]) +GTK_DOC_CHECK([1.18-2]) + +PKG_CHECK_MODULES([ARROW], [arrow]) +PKG_CHECK_MODULES([ARROW_IO], [arrow-io]) +PKG_CHECK_MODULES([ARROW_IPC], [arrow-ipc]) + +AC_CONFIG_FILES([ + Makefile + arrow-glib/Makefile + arrow-glib/arrow-glib.pc + arrow-glib/arrow-io-glib.pc + arrow-glib/arrow-ipc-glib.pc + doc/Makefile + doc/reference/Makefile + example/Makefile +]) + +AC_OUTPUT diff --git a/c_glib/doc/Makefile.am b/c_glib/doc/Makefile.am new file mode 100644 index 0000000000000..85c1d5126097c --- /dev/null +++ b/c_glib/doc/Makefile.am @@ -0,0 +1,19 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +SUBDIRS = \ + reference diff --git a/c_glib/doc/reference/Makefile.am b/c_glib/doc/reference/Makefile.am new file mode 100644 index 0000000000000..d1c8e01c299a0 --- /dev/null +++ b/c_glib/doc/reference/Makefile.am @@ -0,0 +1,63 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +DOC_MODULE = arrow-glib + +DOC_MAIN_SGML_FILE = $(DOC_MODULE)-docs.sgml + +DOC_SOURCE_DIR = \ + $(top_srcdir)/arrow-glib + +SCAN_OPTIONS = \ + --deprecated-guards="GARROW_DISABLE_DEPRECATED" + +MKDB_OPTIONS = \ + --name-space=arrow \ + --source-suffixes="c,cpp,h" + +HFILE_GLOB = \ + $(top_srcdir)/arrow-glib/*.h + +IGNORE_HFILES = \ + enums.h \ + io-enums.h \ + ipc-enums.h + +CFILE_GLOB = \ + $(top_srcdir)/arrow-glib/*.cpp + +AM_CPPFLAGS = \ + -I$(top_builddir) \ + -I$(top_srcdir) + +AM_CFLAGS = \ + $(GLIB_CFLAGS) \ + $(ARROW_CFLAGS) + +GTKDOC_LIBS = \ + $(top_builddir)/arrow-glib/libarrow-glib.la \ + $(top_builddir)/arrow-glib/libarrow-io-glib.la \ + $(top_builddir)/arrow-glib/libarrow-ipc-glib.la + +include $(srcdir)/gtk-doc.make + +CLEANFILES += \ + $(DOC_MODULE)-decl-list.txt \ + $(DOC_MODULE)-decl.txt \ + $(DOC_MODULE)-overrides.txt \ + $(DOC_MODULE)-sections.txt \ + $(DOC_MODULE).types diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml new file mode 100644 index 0000000000000..9f504bec7ad53 --- /dev/null +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -0,0 +1,171 @@ + + + + + %gtkdocentities; +]> + + + &package_name; Reference Manual + + for &package_string;. + + + + + + GArrow + + Array + + + + + + + + + + + + + + + + + + + + Array builder + + + + + + + + + + + + + + + + + + + Type + + + + + + + + + + + + + + + + + + + + + Schema + + + + + Table + + + + + + + Error + + + + + + GArrowIO + + Enums + + + + Input + + + + + + Output + + + + + + + Input and output + + + + + + + GArrowIPC + + Enums + + + + Reader + + + + + Input + + + + + + + Object Hierarchy + + + + API Index + + + + Index of deprecated API + + + + diff --git a/c_glib/example/Makefile.am b/c_glib/example/Makefile.am new file mode 100644 index 0000000000000..3d456d7844231 --- /dev/null +++ b/c_glib/example/Makefile.am @@ -0,0 +1,34 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +AM_CPPFLAGS = \ + -I$(top_builddir) \ + -I$(top_srcdir) + +AM_CFLAGS = \ + $(GLIB_CFLAGS) \ + $(GARROW_CFLAGS) + +AM_LDFLAGS = \ + $(GLIB_LIBS) \ + $(builddir)/../arrow-glib/libarrow-glib.la + +noinst_PROGRAMS = \ + build + +build_SOURCES = \ + build.c diff --git a/c_glib/example/build.c b/c_glib/example/build.c new file mode 100644 index 0000000000000..2722458acd5c4 --- /dev/null +++ b/c_glib/example/build.c @@ -0,0 +1,71 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#include + +#include + +int +main(int argc, char **argv) +{ + GArrowArray *array; + + { + GArrowInt32ArrayBuilder *builder; + gboolean success = TRUE; + GError *error = NULL; + + builder = garrow_int32_array_builder_new(); + if (success) { + success = garrow_int32_array_builder_append(builder, 29, &error); + } + if (success) { + success = garrow_int32_array_builder_append(builder, 2929, &error); + } + if (success) { + success = garrow_int32_array_builder_append(builder, 292929, &error); + } + if (!success) { + g_print("failed to append: %s\n", error->message); + g_error_free(error); + g_object_unref(builder); + return EXIT_FAILURE; + } + array = garrow_array_builder_finish(GARROW_ARRAY_BUILDER(builder)); + g_object_unref(builder); + } + + { + gint64 i, n; + + n = garrow_array_get_length(array); + g_print("length: %" G_GINT64_FORMAT "\n", n); + for (i = 0; i < n; i++) { + gint32 value; + + value = garrow_int32_array_get_value(GARROW_INT32_ARRAY(array), i); + g_print("array[%" G_GINT64_FORMAT "] = %d\n", + i, value); + } + } + + g_object_unref(array); + + return EXIT_SUCCESS; +} diff --git a/c_glib/test/helper/buildable.rb b/c_glib/test/helper/buildable.rb new file mode 100644 index 0000000000000..900e180675b45 --- /dev/null +++ b/c_glib/test/helper/buildable.rb @@ -0,0 +1,77 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +module Helper + module Buildable + def build_boolean_array(values) + build_array(Arrow::BooleanArrayBuilder, values) + end + + def build_int8_array(values) + build_array(Arrow::Int8ArrayBuilder, values) + end + + def build_uint8_array(values) + build_array(Arrow::UInt8ArrayBuilder, values) + end + + def build_int16_array(values) + build_array(Arrow::Int16ArrayBuilder, values) + end + + def build_uint16_array(values) + build_array(Arrow::UInt16ArrayBuilder, values) + end + + def build_int32_array(values) + build_array(Arrow::Int32ArrayBuilder, values) + end + + def build_uint32_array(values) + build_array(Arrow::UInt32ArrayBuilder, values) + end + + def build_int64_array(values) + build_array(Arrow::Int64ArrayBuilder, values) + end + + def build_uint64_array(values) + build_array(Arrow::UInt64ArrayBuilder, values) + end + + def build_float_array(values) + build_array(Arrow::FloatArrayBuilder, values) + end + + def build_double_array(values) + build_array(Arrow::DoubleArrayBuilder, values) + end + + private + def build_array(builder_class, values) + builder = builder_class.new + values.each do |value| + if value.nil? + builder.append_null + else + builder.append(value) + end + end + builder.finish + end + end +end diff --git a/c_glib/test/run-test.rb b/c_glib/test/run-test.rb new file mode 100755 index 0000000000000..32ceb4ad61d2e --- /dev/null +++ b/c_glib/test/run-test.rb @@ -0,0 +1,41 @@ +#!/usr/bin/env ruby +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +require "pathname" +require "test-unit" + +base_dir = Pathname(__dir__).parent +typelib_dir = base_dir + "arrow-glib" +test_dir = base_dir + "test" + +ENV["GI_TYPELIB_PATH"] = [ + typelib_dir.to_s, + ENV["GI_TYPELIB_PATH"], +].compact.join(File::PATH_SEPARATOR) + +require "gi" + +Arrow = GI.load("Arrow") +ArrowIO = GI.load("ArrowIO") +ArrowIPC = GI.load("ArrowIPC") + +require "tempfile" +require_relative "helper/buildable" + +exit(Test::Unit::AutoRunner.run(true, test_dir.to_s)) diff --git a/c_glib/test/run-test.sh b/c_glib/test/run-test.sh new file mode 100755 index 0000000000000..9b0ec8e45f52f --- /dev/null +++ b/c_glib/test/run-test.sh @@ -0,0 +1,29 @@ +#!/bin/sh +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +base_dir="$(cd .; pwd)" +lib_dir="${base_dir}/arrow-glib/.libs" + +LD_LIBRARY_PATH="${lib_dir}:${LD_LIBRARY_PATH}" + +if [ "${NO_MAKE}" != "yes" ]; then + make -j8 > /dev/null || exit $? +fi + +${GDB} ruby ${base_dir}/test/run-test.rb "$@" diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb new file mode 100644 index 0000000000000..d68827cb85b1d --- /dev/null +++ b/c_glib/test/test-array.rb @@ -0,0 +1,44 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestArray < Test::Unit::TestCase + def test_length + builder = Arrow::BooleanArrayBuilder.new + builder.append(true) + array = builder.finish + assert_equal(1, array.length) + end + + def test_n_nulls + builder = Arrow::BooleanArrayBuilder.new + builder.append_null + builder.append_null + array = builder.finish + assert_equal(2, array.n_nulls) + end + + def test_slice + builder = Arrow::BooleanArrayBuilder.new + builder.append(true) + builder.append(false) + builder.append(true) + array = builder.finish + sub_array = array.slice(1, 2) + assert_equal([false, true], + sub_array.length.times.collect {|i| sub_array.get_value(i)}) + end +end diff --git a/c_glib/test/test-binary-array.rb b/c_glib/test/test-binary-array.rb new file mode 100644 index 0000000000000..82a537ef29e9e --- /dev/null +++ b/c_glib/test/test-binary-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBinaryArray < Test::Unit::TestCase + def test_value + builder = Arrow::BinaryArrayBuilder.new + builder.append("\x00\x01\x02") + array = builder.finish + assert_equal([0, 1, 2], array.get_value(0)) + end +end diff --git a/c_glib/test/test-binary-data-type.rb b/c_glib/test/test-binary-data-type.rb new file mode 100644 index 0000000000000..3d4095c1b0648 --- /dev/null +++ b/c_glib/test/test-binary-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBinaryDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::BinaryDataType.new + assert_equal(Arrow::Type::BINARY, data_type.type) + end + + def test_to_s + data_type = Arrow::BinaryDataType.new + assert_equal("binary", data_type.to_s) + end +end diff --git a/c_glib/test/test-boolean-array.rb b/c_glib/test/test-boolean-array.rb new file mode 100644 index 0000000000000..9cc3c94d554bf --- /dev/null +++ b/c_glib/test/test-boolean-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBooleanArray < Test::Unit::TestCase + def test_value + builder = Arrow::BooleanArrayBuilder.new + builder.append(true) + array = builder.finish + assert_equal(true, array.get_value(0)) + end +end diff --git a/c_glib/test/test-boolean-data-type.rb b/c_glib/test/test-boolean-data-type.rb new file mode 100644 index 0000000000000..ac5667140fb8e --- /dev/null +++ b/c_glib/test/test-boolean-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBooleanDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::BooleanDataType.new + assert_equal(Arrow::Type::BOOL, data_type.type) + end + + def test_to_s + data_type = Arrow::BooleanDataType.new + assert_equal("bool", data_type.to_s) + end +end diff --git a/c_glib/test/test-chunked-array.rb b/c_glib/test/test-chunked-array.rb new file mode 100644 index 0000000000000..167d5d1033e42 --- /dev/null +++ b/c_glib/test/test-chunked-array.rb @@ -0,0 +1,67 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestChunkedArray < Test::Unit::TestCase + include Helper::Buildable + + def test_length + chunks = [ + build_boolean_array([true, false]), + build_boolean_array([true]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + assert_equal(3, chunked_array.length) + end + + def test_n_nulls + chunks = [ + build_boolean_array([true, nil, false]), + build_boolean_array([nil, nil, true]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + assert_equal(3, chunked_array.n_nulls) + end + + + def test_n_chunks + chunks = [ + build_boolean_array([true]), + build_boolean_array([false]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + assert_equal(2, chunked_array.n_chunks) + end + + def test_chunk + chunks = [ + build_boolean_array([true, false]), + build_boolean_array([false]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + assert_equal(2, chunked_array.get_chunk(0).length) + end + + def test_chunks + chunks = [ + build_boolean_array([true, false]), + build_boolean_array([false]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + assert_equal([2, 1], + chunked_array.chunks.collect(&:length)) + end +end diff --git a/c_glib/test/test-column.rb b/c_glib/test/test-column.rb new file mode 100644 index 0000000000000..ec75194edb830 --- /dev/null +++ b/c_glib/test/test-column.rb @@ -0,0 +1,86 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestColumn < Test::Unit::TestCase + include Helper::Buildable + + sub_test_case(".new") do + def test_array + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true]) + column = Arrow::Column.new(field, array) + assert_equal(1, column.length) + end + + def test_chunked_array + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + chunks = [ + build_boolean_array([true]), + build_boolean_array([false, true]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + column = Arrow::Column.new(field, chunked_array) + assert_equal(3, column.length) + end + end + + def test_length + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true, false]) + column = Arrow::Column.new(field, array) + assert_equal(2, column.length) + end + + def test_n_nulls + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true, nil, nil]) + column = Arrow::Column.new(field, array) + assert_equal(2, column.n_nulls) + end + + def test_field + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true]) + column = Arrow::Column.new(field, array) + assert_equal("enabled", column.field.name) + end + + def test_name + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true]) + column = Arrow::Column.new(field, array) + assert_equal("enabled", column.name) + end + + def test_data_type + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array = build_boolean_array([true]) + column = Arrow::Column.new(field, array) + assert_equal("bool", column.data_type.to_s) + end + + def test_data + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + chunks = [ + build_boolean_array([true]), + build_boolean_array([false, true]), + ] + chunked_array = Arrow::ChunkedArray.new(chunks) + column = Arrow::Column.new(field, chunked_array) + assert_equal(3, column.data.length) + end +end diff --git a/c_glib/test/test-double-array.rb b/c_glib/test/test-double-array.rb new file mode 100644 index 0000000000000..f9c000d23f173 --- /dev/null +++ b/c_glib/test/test-double-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestDoubleArray < Test::Unit::TestCase + def test_value + builder = Arrow::DoubleArrayBuilder.new + builder.append(1.5) + array = builder.finish + assert_in_delta(1.5, array.get_value(0)) + end +end diff --git a/c_glib/test/test-double-data-type.rb b/c_glib/test/test-double-data-type.rb new file mode 100644 index 0000000000000..18c870cb9e62b --- /dev/null +++ b/c_glib/test/test-double-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestDoubleDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::DoubleDataType.new + assert_equal(Arrow::Type::DOUBLE, data_type.type) + end + + def test_to_s + data_type = Arrow::DoubleDataType.new + assert_equal("double", data_type.to_s) + end +end diff --git a/c_glib/test/test-field.rb b/c_glib/test/test-field.rb new file mode 100644 index 0000000000000..a20802c2ac653 --- /dev/null +++ b/c_glib/test/test-field.rb @@ -0,0 +1,41 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestField < Test::Unit::TestCase + def test_name + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + assert_equal("enabled", field.name) + end + + def test_data_type + data_type = Arrow::BooleanDataType.new + field = Arrow::Field.new("enabled", data_type) + assert_equal(data_type.to_s, field.data_type.to_s) + end + + def test_nullable? + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + assert do + field.nullable? + end + end + + def test_to_s + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + assert_equal("enabled: bool", field.to_s) + end +end diff --git a/c_glib/test/test-float-array.rb b/c_glib/test/test-float-array.rb new file mode 100644 index 0000000000000..020c705aad241 --- /dev/null +++ b/c_glib/test/test-float-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestFloatArray < Test::Unit::TestCase + def test_value + builder = Arrow::FloatArrayBuilder.new + builder.append(1.5) + array = builder.finish + assert_in_delta(1.5, array.get_value(0)) + end +end diff --git a/c_glib/test/test-float-data-type.rb b/c_glib/test/test-float-data-type.rb new file mode 100644 index 0000000000000..ab315fd336b84 --- /dev/null +++ b/c_glib/test/test-float-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestFloatDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::FloatDataType.new + assert_equal(Arrow::Type::FLOAT, data_type.type) + end + + def test_to_s + data_type = Arrow::FloatDataType.new + assert_equal("float", data_type.to_s) + end +end diff --git a/c_glib/test/test-int16-array.rb b/c_glib/test/test-int16-array.rb new file mode 100644 index 0000000000000..2aa5b0c054563 --- /dev/null +++ b/c_glib/test/test-int16-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt16Array < Test::Unit::TestCase + def test_value + builder = Arrow::Int16ArrayBuilder.new + builder.append(-1) + array = builder.finish + assert_equal(-1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-int16-data-type.rb b/c_glib/test/test-int16-data-type.rb new file mode 100644 index 0000000000000..273ec809c198e --- /dev/null +++ b/c_glib/test/test-int16-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt16DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::Int16DataType.new + assert_equal(Arrow::Type::INT16, data_type.type) + end + + def test_to_s + data_type = Arrow::Int16DataType.new + assert_equal("int16", data_type.to_s) + end +end diff --git a/c_glib/test/test-int32-array.rb b/c_glib/test/test-int32-array.rb new file mode 100644 index 0000000000000..9dd6b3afc8676 --- /dev/null +++ b/c_glib/test/test-int32-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt32Array < Test::Unit::TestCase + def test_value + builder = Arrow::Int32ArrayBuilder.new + builder.append(-1) + array = builder.finish + assert_equal(-1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-int32-data-type.rb b/c_glib/test/test-int32-data-type.rb new file mode 100644 index 0000000000000..f6b9b34e1d827 --- /dev/null +++ b/c_glib/test/test-int32-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt32DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::Int32DataType.new + assert_equal(Arrow::Type::INT32, data_type.type) + end + + def test_to_s + data_type = Arrow::Int32DataType.new + assert_equal("int32", data_type.to_s) + end +end diff --git a/c_glib/test/test-int64-array.rb b/c_glib/test/test-int64-array.rb new file mode 100644 index 0000000000000..612a8b4f69276 --- /dev/null +++ b/c_glib/test/test-int64-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt64Array < Test::Unit::TestCase + def test_value + builder = Arrow::Int64ArrayBuilder.new + builder.append(-1) + array = builder.finish + assert_equal(-1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-int64-data-type.rb b/c_glib/test/test-int64-data-type.rb new file mode 100644 index 0000000000000..032b24dac3ecc --- /dev/null +++ b/c_glib/test/test-int64-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt64DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::Int64DataType.new + assert_equal(Arrow::Type::INT64, data_type.type) + end + + def test_to_s + data_type = Arrow::Int64DataType.new + assert_equal("int64", data_type.to_s) + end +end diff --git a/c_glib/test/test-int8-array.rb b/c_glib/test/test-int8-array.rb new file mode 100644 index 0000000000000..ab009964ab16f --- /dev/null +++ b/c_glib/test/test-int8-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt8Array < Test::Unit::TestCase + def test_value + builder = Arrow::Int8ArrayBuilder.new + builder.append(-1) + array = builder.finish + assert_equal(-1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-int8-data-type.rb b/c_glib/test/test-int8-data-type.rb new file mode 100644 index 0000000000000..d33945614db8e --- /dev/null +++ b/c_glib/test/test-int8-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt8DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::Int8DataType.new + assert_equal(Arrow::Type::INT8, data_type.type) + end + + def test_to_s + data_type = Arrow::Int8DataType.new + assert_equal("int8", data_type.to_s) + end +end diff --git a/c_glib/test/test-io-file-output-stream.rb b/c_glib/test/test-io-file-output-stream.rb new file mode 100644 index 0000000000000..1f2ae5fa10fd1 --- /dev/null +++ b/c_glib/test/test-io-file-output-stream.rb @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestIOFileOutputStream < Test::Unit::TestCase + sub_test_case(".open") do + def test_create + tempfile = Tempfile.open("arrow-io-file-output-stream") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::FileOutputStream.open(tempfile.path, false) + file.close + assert_equal("", File.read(tempfile.path)) + end + + def test_append + tempfile = Tempfile.open("arrow-io-file-output-stream") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::FileOutputStream.open(tempfile.path, true) + file.close + assert_equal("Hello", File.read(tempfile.path)) + end + end +end diff --git a/c_glib/test/test-io-memory-mapped-file.rb b/c_glib/test/test-io-memory-mapped-file.rb new file mode 100644 index 0000000000000..609819833614f --- /dev/null +++ b/c_glib/test/test-io-memory-mapped-file.rb @@ -0,0 +1,138 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestIOMemoryMappedFile < Test::Unit::TestCase + def test_open + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + buffer = " " * 5 + file.read(buffer) + assert_equal("Hello", buffer) + ensure + file.close + end + end + + def test_size + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + assert_equal(5, file.size) + ensure + file.close + end + end + + def test_read + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello World") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + buffer = " " * 5 + _success, n_read_bytes = file.read(buffer) + assert_equal("Hello", buffer.byteslice(0, n_read_bytes)) + ensure + file.close + end + end + + def test_read_at + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello World") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + buffer = " " * 5 + _success, n_read_bytes = file.read_at(6, buffer) + assert_equal("World", buffer.byteslice(0, n_read_bytes)) + ensure + file.close + end + end + + def test_write + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + begin + file.write("World") + ensure + file.close + end + assert_equal("World", File.read(tempfile.path)) + end + + def test_write_at + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + begin + file.write_at(2, "rld") + ensure + file.close + end + assert_equal("Herld", File.read(tempfile.path)) + end + + def test_flush + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + begin + file.write("World") + file.flush + assert_equal("World", File.read(tempfile.path)) + ensure + file.close + end + end + + def test_tell + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello World") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + buffer = " " * 5 + file.read(buffer) + assert_equal(5, file.tell) + ensure + file.close + end + end + + def test_mode + tempfile = Tempfile.open("arrow-io-memory-mapped-file") + tempfile.write("Hello World") + tempfile.close + file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + begin + assert_equal(ArrowIO::FileMode::READWRITE, file.mode) + ensure + file.close + end + end +end diff --git a/c_glib/test/test-ipc-file-writer.rb b/c_glib/test/test-ipc-file-writer.rb new file mode 100644 index 0000000000000..369bff324e6d9 --- /dev/null +++ b/c_glib/test/test-ipc-file-writer.rb @@ -0,0 +1,45 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestIPCFileWriter < Test::Unit::TestCase + def test_write_record_batch + tempfile = Tempfile.open("arrow-ipc-file-writer") + output = ArrowIO::FileOutputStream.open(tempfile.path, false) + begin + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + schema = Arrow::Schema.new([field]) + file_writer = ArrowIPC::FileWriter.open(output, schema) + begin + record_batch = Arrow::RecordBatch.new(schema, 0, []) + file_writer.write_record_batch(record_batch) + ensure + file_writer.close + end + ensure + output.close + end + + input = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + file_reader = ArrowIPC::FileReader.open(input) + assert_equal(["enabled"], + file_reader.schema.fields.collect(&:name)) + ensure + input.close + end + end +end diff --git a/c_glib/test/test-ipc-stream-writer.rb b/c_glib/test/test-ipc-stream-writer.rb new file mode 100644 index 0000000000000..62ac45dce2c79 --- /dev/null +++ b/c_glib/test/test-ipc-stream-writer.rb @@ -0,0 +1,53 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestIPCStreamWriter < Test::Unit::TestCase + include Helper::Buildable + + def test_write_record_batch + tempfile = Tempfile.open("arrow-ipc-stream-writer") + output = ArrowIO::FileOutputStream.open(tempfile.path, false) + begin + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + schema = Arrow::Schema.new([field]) + stream_writer = ArrowIPC::StreamWriter.open(output, schema) + begin + columns = [ + build_boolean_array([true]), + ] + record_batch = Arrow::RecordBatch.new(schema, 1, columns) + stream_writer.write_record_batch(record_batch) + ensure + stream_writer.close + end + ensure + output.close + end + + input = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + begin + stream_reader = ArrowIPC::StreamReader.open(input) + assert_equal(["enabled"], + stream_reader.schema.fields.collect(&:name)) + assert_equal(true, + stream_reader.next_record_batch.get_column(0).get_value(0)) + assert_nil(stream_reader.next_record_batch) + ensure + input.close + end + end +end diff --git a/c_glib/test/test-list-array.rb b/c_glib/test/test-list-array.rb new file mode 100644 index 0000000000000..34177de9dcdeb --- /dev/null +++ b/c_glib/test/test-list-array.rb @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestListArray < Test::Unit::TestCase + def test_value + builder = Arrow::ListArrayBuilder.new(Arrow::Int8ArrayBuilder.new) + value_builder = builder.value_builder + + builder.append + value_builder.append(-29) + value_builder.append(29) + + builder.append + value_builder.append(-1) + value_builder.append(0) + value_builder.append(1) + + array = builder.finish + value = array.get_value(1) + assert_equal([-1, 0, 1], + value.length.times.collect {|i| value.get_value(i)}) + end + + def test_value_type + builder = Arrow::ListArrayBuilder.new(Arrow::Int8ArrayBuilder.new) + array = builder.finish + assert_equal(Arrow::Int8DataType.new, array.value_type) + end +end diff --git a/c_glib/test/test-list-data-type.rb b/c_glib/test/test-list-data-type.rb new file mode 100644 index 0000000000000..6fde203517684 --- /dev/null +++ b/c_glib/test/test-list-data-type.rb @@ -0,0 +1,36 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestListDataType < Test::Unit::TestCase + def test_type + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + data_type = Arrow::ListDataType.new(field) + assert_equal(Arrow::Type::LIST, data_type.type) + end + + def test_to_s + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + data_type = Arrow::ListDataType.new(field) + assert_equal("list", data_type.to_s) + end + + def test_value_field + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + data_type = Arrow::ListDataType.new(field) + assert_equal(field, data_type.value_field) + end +end diff --git a/c_glib/test/test-null-array.rb b/c_glib/test/test-null-array.rb new file mode 100644 index 0000000000000..6aa8c037c17ee --- /dev/null +++ b/c_glib/test/test-null-array.rb @@ -0,0 +1,33 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestNullArray < Test::Unit::TestCase + def test_length + array = Arrow::NullArray.new(3) + assert_equal(3, array.length) + end + + def test_n_nulls + array = Arrow::NullArray.new(3) + assert_equal(3, array.n_nulls) + end + + def test_slice + array = Arrow::NullArray.new(3) + assert_equal(2, array.slice(1, 2).length) + end +end diff --git a/c_glib/test/test-null-data-type.rb b/c_glib/test/test-null-data-type.rb new file mode 100644 index 0000000000000..95e54833b0896 --- /dev/null +++ b/c_glib/test/test-null-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestNullDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::NullDataType.new + assert_equal(Arrow::Type::NA, data_type.type) + end + + def test_to_s + data_type = Arrow::NullDataType.new + assert_equal("null", data_type.to_s) + end +end diff --git a/c_glib/test/test-record-batch.rb b/c_glib/test/test-record-batch.rb new file mode 100644 index 0000000000000..941ff35060154 --- /dev/null +++ b/c_glib/test/test-record-batch.rb @@ -0,0 +1,80 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestTable < Test::Unit::TestCase + include Helper::Buildable + + def test_new + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + build_boolean_array([true]), + build_boolean_array([false]), + ] + record_batch = Arrow::RecordBatch.new(schema, 1, columns) + assert_equal(1, record_batch.n_rows) + end + + sub_test_case("instance methods") do + def setup + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + build_boolean_array([true, false, true, false, true, false]), + build_boolean_array([false, true, false, true, false]), + ] + @record_batch = Arrow::RecordBatch.new(schema, 5, columns) + end + + def test_schema + assert_equal(["visible", "valid"], + @record_batch.schema.fields.collect(&:name)) + end + + def test_column + assert_equal(5, @record_batch.get_column(1).length) + end + + def test_columns + assert_equal([6, 5], + @record_batch.columns.collect(&:length)) + end + + def test_n_columns + assert_equal(2, @record_batch.n_columns) + end + + def test_n_rows + assert_equal(5, @record_batch.n_rows) + end + + def test_slice + sub_record_batch = @record_batch.slice(3, 2) + sub_visible_values = sub_record_batch.n_rows.times.collect do |i| + sub_record_batch.get_column(0).get_value(i) + end + assert_equal([false, true], + sub_visible_values) + end + end +end diff --git a/c_glib/test/test-schema.rb b/c_glib/test/test-schema.rb new file mode 100644 index 0000000000000..c9cbb756944bb --- /dev/null +++ b/c_glib/test/test-schema.rb @@ -0,0 +1,69 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestSchema < Test::Unit::TestCase + def test_field + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + schema = Arrow::Schema.new([field]) + assert_equal("enabled", schema.get_field(0).name) + end + + sub_test_case("#get_field_by_name") do + def test_found + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + schema = Arrow::Schema.new([field]) + assert_equal("enabled", schema.get_field_by_name("enabled").name) + end + + def test_not_found + field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + schema = Arrow::Schema.new([field]) + assert_nil(schema.get_field_by_name("nonexistent")) + end + end + + def test_n_fields + fields = [ + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + Arrow::Field.new("required", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + assert_equal(2, schema.n_fields) + end + + def test_fields + fields = [ + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + Arrow::Field.new("required", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + assert_equal(["enabled", "required"], + schema.fields.collect(&:name)) + end + + def test_to_s + fields = [ + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + Arrow::Field.new("required", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + assert_equal(<<-SCHEMA.chomp, schema.to_s) +enabled: bool +required: bool + SCHEMA + end +end diff --git a/c_glib/test/test-string-array.rb b/c_glib/test/test-string-array.rb new file mode 100644 index 0000000000000..a0f5a7b6b0fda --- /dev/null +++ b/c_glib/test/test-string-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestStringArray < Test::Unit::TestCase + def test_value + builder = Arrow::StringArrayBuilder.new + builder.append("Hello") + array = builder.finish + assert_equal("Hello", array.get_string(0)) + end +end diff --git a/c_glib/test/test-string-data-type.rb b/c_glib/test/test-string-data-type.rb new file mode 100644 index 0000000000000..daba7fd9ec768 --- /dev/null +++ b/c_glib/test/test-string-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestStringDataType < Test::Unit::TestCase + def test_type + data_type = Arrow::StringDataType.new + assert_equal(Arrow::Type::STRING, data_type.type) + end + + def test_to_s + data_type = Arrow::StringDataType.new + assert_equal("string", data_type.to_s) + end +end diff --git a/c_glib/test/test-struct-array.rb b/c_glib/test/test-struct-array.rb new file mode 100644 index 0000000000000..cf450f52d299a --- /dev/null +++ b/c_glib/test/test-struct-array.rb @@ -0,0 +1,58 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestStructArray < Test::Unit::TestCase + def test_fields + fields = [ + Arrow::Field.new("score", Arrow::Int8DataType.new), + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + ] + data_type = Arrow::StructDataType.new(fields) + field_builders = [ + Arrow::Int8ArrayBuilder.new, + Arrow::BooleanArrayBuilder.new, + ] + builder = Arrow::StructArrayBuilder.new(data_type, field_builders) + + builder.append + builder.get_field_builder(0).append(-29) + builder.get_field_builder(1).append(true) + + builder.append + builder.field_builders[0].append(2) + builder.field_builders[1].append(false) + + array = builder.finish + values = array.length.times.collect do |i| + if i.zero? + [ + array.get_field(0).get_value(i), + array.get_field(1).get_value(i), + ] + else + array.fields.collect do |field| + field.get_value(i) + end + end + end + assert_equal([ + [-29, true], + [2, false], + ], + values) + end +end diff --git a/c_glib/test/test-table.rb b/c_glib/test/test-table.rb new file mode 100644 index 0000000000000..1687d2f6e3ff6 --- /dev/null +++ b/c_glib/test/test-table.rb @@ -0,0 +1,72 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestTable < Test::Unit::TestCase + include Helper::Buildable + + sub_test_case(".new") do + def test_columns + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + Arrow::Column.new(fields[0], build_boolean_array([true])), + Arrow::Column.new(fields[1], build_boolean_array([false])), + ] + table = Arrow::Table.new("memos", schema, columns) + assert_equal("memos", table.name) + end + end + + sub_test_case("instance methods") do + def setup + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + Arrow::Column.new(fields[0], build_boolean_array([true])), + Arrow::Column.new(fields[1], build_boolean_array([false])), + ] + @table = Arrow::Table.new("memos", schema, columns) + end + + def test_name + assert_equal("memos", @table.name) + end + + def test_schema + assert_equal(["visible", "valid"], + @table.schema.fields.collect(&:name)) + end + + def test_column + assert_equal("valid", @table.get_column(1).name) + end + + def test_n_columns + assert_equal(2, @table.n_columns) + end + + def test_n_rows + assert_equal(1, @table.n_rows) + end + end +end diff --git a/c_glib/test/test-uint16-array.rb b/c_glib/test/test-uint16-array.rb new file mode 100644 index 0000000000000..ad85f09326bd3 --- /dev/null +++ b/c_glib/test/test-uint16-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt16Array < Test::Unit::TestCase + def test_value + builder = Arrow::UInt16ArrayBuilder.new + builder.append(1) + array = builder.finish + assert_equal(1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-uint16-data-type.rb b/c_glib/test/test-uint16-data-type.rb new file mode 100644 index 0000000000000..f5a6cc0be28bb --- /dev/null +++ b/c_glib/test/test-uint16-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt16DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::UInt16DataType.new + assert_equal(Arrow::Type::UINT16, data_type.type) + end + + def test_to_s + data_type = Arrow::UInt16DataType.new + assert_equal("uint16", data_type.to_s) + end +end diff --git a/c_glib/test/test-uint32-array.rb b/c_glib/test/test-uint32-array.rb new file mode 100644 index 0000000000000..59e19f3ed796f --- /dev/null +++ b/c_glib/test/test-uint32-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt32Array < Test::Unit::TestCase + def test_value + builder = Arrow::UInt32ArrayBuilder.new + builder.append(1) + array = builder.finish + assert_equal(1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-uint32-data-type.rb b/c_glib/test/test-uint32-data-type.rb new file mode 100644 index 0000000000000..7a50257d6d3b9 --- /dev/null +++ b/c_glib/test/test-uint32-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt32DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::UInt32DataType.new + assert_equal(Arrow::Type::UINT32, data_type.type) + end + + def test_to_s + data_type = Arrow::UInt32DataType.new + assert_equal("uint32", data_type.to_s) + end +end diff --git a/c_glib/test/test-uint64-array.rb b/c_glib/test/test-uint64-array.rb new file mode 100644 index 0000000000000..e0195c1d49817 --- /dev/null +++ b/c_glib/test/test-uint64-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt64Array < Test::Unit::TestCase + def test_value + builder = Arrow::UInt64ArrayBuilder.new + builder.append(1) + array = builder.finish + assert_equal(1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-uint64-data-type.rb b/c_glib/test/test-uint64-data-type.rb new file mode 100644 index 0000000000000..403fc9acdfcfa --- /dev/null +++ b/c_glib/test/test-uint64-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt64DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::UInt64DataType.new + assert_equal(Arrow::Type::UINT64, data_type.type) + end + + def test_to_s + data_type = Arrow::UInt64DataType.new + assert_equal("uint64", data_type.to_s) + end +end diff --git a/c_glib/test/test-uint8-array.rb b/c_glib/test/test-uint8-array.rb new file mode 100644 index 0000000000000..02f3470774c10 --- /dev/null +++ b/c_glib/test/test-uint8-array.rb @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt8Array < Test::Unit::TestCase + def test_value + builder = Arrow::UInt8ArrayBuilder.new + builder.append(1) + array = builder.finish + assert_equal(1, array.get_value(0)) + end +end diff --git a/c_glib/test/test-uint8-data-type.rb b/c_glib/test/test-uint8-data-type.rb new file mode 100644 index 0000000000000..eb91da2761efe --- /dev/null +++ b/c_glib/test/test-uint8-data-type.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt8DataType < Test::Unit::TestCase + def test_type + data_type = Arrow::UInt8DataType.new + assert_equal(Arrow::Type::UINT8, data_type.type) + end + + def test_to_s + data_type = Arrow::UInt8DataType.new + assert_equal("uint8", data_type.to_s) + end +end diff --git a/ci/travis_before_script_c_glib.sh b/ci/travis_before_script_c_glib.sh new file mode 100755 index 0000000000000..1a828e7659bd9 --- /dev/null +++ b/ci/travis_before_script_c_glib.sh @@ -0,0 +1,40 @@ +#!/usr/bin/env bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + + +set -ex + +if [ $TRAVIS_OS_NAME == "osx" ]; then + brew install gtk-doc autoconf-archive gobject-introspection +fi + +gem install gobject-introspection + +ARROW_C_GLIB_DIR=$TRAVIS_BUILD_DIR/c_glib + +pushd $ARROW_C_GLIB_DIR + +: ${ARROW_C_GLIB_INSTALL=$TRAVIS_BUILD_DIR/c-glib-install} + +./autogen.sh + +export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$ARROW_CPP_INSTALL/lib/pkgconfig +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_CPP_INSTALL/lib + +./configure --prefix=${ARROW_C_GLIB_INSTALL} --enable-gtk-doc + +make -j4 +make install + +popd diff --git a/ci/travis_script_c_glib.sh b/ci/travis_script_c_glib.sh new file mode 100755 index 0000000000000..1492354405810 --- /dev/null +++ b/ci/travis_script_c_glib.sh @@ -0,0 +1,24 @@ +#!/usr/bin/env bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +set -e + +ARROW_C_GLIB_DIR=$TRAVIS_BUILD_DIR/c_glib + +pushd $ARROW_C_GLIB_DIR + +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_CPP_INSTALL/lib +NO_MAKE=yes test/run-test.sh + +popd From 57b537a3ce54698c60addb9d193ecfc3b88397aa Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 17 Mar 2017 17:20:48 -0700 Subject: [PATCH 0377/1644] ARROW-637: [Format] Add timezone to Timestamp metadata, comments describing the semantics cc @julianhyde @julienledem for comment. This makes Arrow semantically equivalent to datetime objects in Python, and I believe the SQL standard Author: Wes McKinney Closes #388 from wesm/ARROW-637 and squashes the following commits: e4661a4 [Wes McKinney] Allow for absolute time zone offsets 3fc10d6 [Wes McKinney] typo a25967a [Wes McKinney] Add timezone to Timestamp metadata, comments describing the semantics --- format/Message.fbs | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/format/Message.fbs b/format/Message.fbs index fb3478de5d2a0..f2d5eba75e65b 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -93,6 +93,28 @@ table Time { /// time from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. table Timestamp { unit: TimeUnit; + + /// The time zone is a string indicating the name of a time zone, one of: + /// + /// * As used in the Olson time zone database (the "tz database" or + /// "tzdata"), such as "America/New_York" + /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30 + /// + /// Whether a timezone string is present indicates different semantics about + /// the data: + /// + /// * If the time zone is null or equal to an empty string, the data is "time + /// zone naive" and shall be displayed *as is* to the user, not localized + /// to the locale of the user. This data can be though of as UTC but + /// without having "UTC" as the time zone, it is not considered to be + /// localized to any time zone + /// + /// * If the time zone is set to a valid value, values can be displayed as + /// "localized" to that time zone, even though the underlying 64-bit + /// integers are identical to the same data stored in UTC. Converting + /// between time zones is a metadata-only operation and does not change the + /// underlying values + timezone: string; } enum IntervalUnit: short { YEAR_MONTH, DAY_TIME} From 16dd87164d7ab756dc6c5eaabd22ef767edca037 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 18 Mar 2017 18:14:49 +0100 Subject: [PATCH 0378/1644] ARROW-650: [GLib] Follow ReadableFileInterface -> RnadomAccessFile change Author: Kouhei Sutou Closes #399 from kou/glib-follow-random-access-change and squashes the following commits: d46a1cb [Kouhei Sutou] [GLib] Follow ReadableFileInterface -> RnadomAccessFile change --- c_glib/arrow-glib/Makefile.am | 6 +- c_glib/arrow-glib/arrow-io-glib.h | 2 +- c_glib/arrow-glib/arrow-io-glib.hpp | 2 +- c_glib/arrow-glib/io-memory-mapped-file.cpp | 14 +- c_glib/arrow-glib/io-random-access-file.cpp | 128 ++++++++++++++++++ c_glib/arrow-glib/io-random-access-file.h | 55 ++++++++ ...ble-file.hpp => io-random-access-file.hpp} | 12 +- c_glib/arrow-glib/io-readable-file.cpp | 127 ----------------- c_glib/arrow-glib/io-readable-file.h | 55 -------- c_glib/arrow-glib/ipc-file-reader.cpp | 6 +- c_glib/arrow-glib/ipc-file-reader.h | 4 +- c_glib/doc/reference/arrow-glib-docs.sgml | 2 +- 12 files changed, 207 insertions(+), 206 deletions(-) create mode 100644 c_glib/arrow-glib/io-random-access-file.cpp create mode 100644 c_glib/arrow-glib/io-random-access-file.h rename c_glib/arrow-glib/{io-readable-file.hpp => io-random-access-file.hpp} (69%) delete mode 100644 c_glib/arrow-glib/io-readable-file.cpp delete mode 100644 c_glib/arrow-glib/io-readable-file.h diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 61137a075f601..7699594b7ade7 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -242,8 +242,8 @@ libarrow_io_glib_la_headers = \ io-input-stream.h \ io-memory-mapped-file.h \ io-output-stream.h \ + io-random-access-file.h \ io-readable.h \ - io-readable-file.h \ io-writeable.h \ io-writeable-file.h @@ -261,8 +261,8 @@ libarrow_io_glib_la_sources = \ io-input-stream.cpp \ io-memory-mapped-file.cpp \ io-output-stream.cpp \ + io-random-access-file.cpp \ io-readable.cpp \ - io-readable-file.cpp \ io-writeable.cpp \ io-writeable-file.cpp \ $(libarrow_io_glib_la_headers) \ @@ -276,8 +276,8 @@ libarrow_io_glib_la_cpp_headers = \ io-input-stream.hpp \ io-memory-mapped-file.hpp \ io-output-stream.hpp \ + io-random-access-file.hpp \ io-readable.hpp \ - io-readable-file.hpp \ io-writeable.hpp \ io-writeable-file.hpp diff --git a/c_glib/arrow-glib/arrow-io-glib.h b/c_glib/arrow-glib/arrow-io-glib.h index e02aa9b96982b..4d49a9859d82a 100644 --- a/c_glib/arrow-glib/arrow-io-glib.h +++ b/c_glib/arrow-glib/arrow-io-glib.h @@ -26,7 +26,7 @@ #include #include #include +#include #include -#include #include #include diff --git a/c_glib/arrow-glib/arrow-io-glib.hpp b/c_glib/arrow-glib/arrow-io-glib.hpp index 378f20216b6a1..3e7636cc7ef99 100644 --- a/c_glib/arrow-glib/arrow-io-glib.hpp +++ b/c_glib/arrow-glib/arrow-io-glib.hpp @@ -25,6 +25,6 @@ #include #include #include +#include #include -#include #include diff --git a/c_glib/arrow-glib/io-memory-mapped-file.cpp b/c_glib/arrow-glib/io-memory-mapped-file.cpp index aa6ae2afd6e78..12c9a6c95ac12 100644 --- a/c_glib/arrow-glib/io-memory-mapped-file.cpp +++ b/c_glib/arrow-glib/io-memory-mapped-file.cpp @@ -29,7 +29,7 @@ #include #include #include -#include +#include #include #include @@ -97,8 +97,8 @@ garrow_io_input_stream_interface_init(GArrowIOInputStreamInterface *iface) iface->get_raw = garrow_io_memory_mapped_file_get_raw_input_stream_interface; } -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_readable_file_interface(GArrowIOReadableFile *file) +static std::shared_ptr +garrow_io_memory_mapped_file_get_raw_random_access_file_interface(GArrowIORandomAccessFile *file) { auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); auto arrow_memory_mapped_file = @@ -107,9 +107,9 @@ garrow_io_memory_mapped_file_get_raw_readable_file_interface(GArrowIOReadableFil } static void -garrow_io_readable_file_interface_init(GArrowIOReadableFileInterface *iface) +garrow_io_random_access_file_interface_init(GArrowIORandomAccessFileInterface *iface) { - iface->get_raw = garrow_io_memory_mapped_file_get_raw_readable_file_interface; + iface->get_raw = garrow_io_memory_mapped_file_get_raw_random_access_file_interface; } static std::shared_ptr @@ -152,8 +152,8 @@ G_DEFINE_TYPE_WITH_CODE(GArrowIOMemoryMappedFile, garrow_io_readable_interface_init) G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_INPUT_STREAM, garrow_io_input_stream_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_READABLE_FILE, - garrow_io_readable_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_RANDOM_ACCESS_FILE, + garrow_io_random_access_file_interface_init) G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE, garrow_io_writeable_interface_init) G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE_FILE, diff --git a/c_glib/arrow-glib/io-random-access-file.cpp b/c_glib/arrow-glib/io-random-access-file.cpp new file mode 100644 index 0000000000000..552b879c19794 --- /dev/null +++ b/c_glib/arrow-glib/io-random-access-file.cpp @@ -0,0 +1,128 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: io-random-access-file + * @title: GArrowIORandomAccessFile + * @short_description: File input interface + * + * #GArrowIORandomAccessFile is an interface for file input. + */ + +G_DEFINE_INTERFACE(GArrowIORandomAccessFile, + garrow_io_random_access_file, + G_TYPE_OBJECT) + +static void +garrow_io_random_access_file_default_init (GArrowIORandomAccessFileInterface *iface) +{ +} + +/** + * garrow_io_random_access_file_get_size: + * @file: A #GArrowIORandomAccessFile. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: The size of the file. + */ +guint64 +garrow_io_random_access_file_get_size(GArrowIORandomAccessFile *file, + GError **error) +{ + auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(file); + auto arrow_random_access_file = iface->get_raw(file); + int64_t size; + + auto status = arrow_random_access_file->GetSize(&size); + if (status.ok()) { + return size; + } else { + garrow_error_set(error, status, "[io][random-access-file][get-size]"); + return 0; + } +} + +/** + * garrow_io_random_access_file_get_support_zero_copy: + * @file: A #GArrowIORandomAccessFile. + * + * Returns: Whether zero copy read is supported or not. + */ +gboolean +garrow_io_random_access_file_get_support_zero_copy(GArrowIORandomAccessFile *file) +{ + auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(file); + auto arrow_random_access_file = iface->get_raw(file); + + return arrow_random_access_file->supports_zero_copy(); +} + +/** + * garrow_io_random_access_file_read_at: + * @file: A #GArrowIORandomAccessFile. + * @position: The read start position. + * @n_bytes: The number of bytes to be read. + * @n_read_bytes: (out): The read number of bytes. + * @buffer: (array length=n_bytes): The buffer to be read data. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, + gint64 position, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error) +{ + const auto arrow_random_access_file = + garrow_io_random_access_file_get_raw(file); + + auto status = arrow_random_access_file->ReadAt(position, + n_bytes, + n_read_bytes, + buffer); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[io][random-access-file][read-at]"); + return FALSE; + } +} + +G_END_DECLS + +std::shared_ptr +garrow_io_random_access_file_get_raw(GArrowIORandomAccessFile *random_access_file) +{ + auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(random_access_file); + return iface->get_raw(random_access_file); +} diff --git a/c_glib/arrow-glib/io-random-access-file.h b/c_glib/arrow-glib/io-random-access-file.h new file mode 100644 index 0000000000000..e980ab2e3c8e5 --- /dev/null +++ b/c_glib/arrow-glib/io-random-access-file.h @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_IO_TYPE_RANDOM_ACCESS_FILE \ + (garrow_io_random_access_file_get_type()) +#define GARROW_IO_RANDOM_ACCESS_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_IO_TYPE_RANDOM_ACCESS_FILE, \ + GArrowIORandomAccessFileInterface)) +#define GARROW_IO_IS_RANDOM_ACCESS_FILE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_IO_TYPE_RANDOM_ACCESS_FILE)) +#define GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(obj) \ + (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ + GARROW_IO_TYPE_RANDOM_ACCESS_FILE, \ + GArrowIORandomAccessFileInterface)) + +typedef struct _GArrowIORandomAccessFile GArrowIORandomAccessFile; +typedef struct _GArrowIORandomAccessFileInterface GArrowIORandomAccessFileInterface; + +GType garrow_io_random_access_file_get_type(void) G_GNUC_CONST; + +guint64 garrow_io_random_access_file_get_size(GArrowIORandomAccessFile *file, + GError **error); +gboolean garrow_io_random_access_file_get_support_zero_copy(GArrowIORandomAccessFile *file); +gboolean garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, + gint64 position, + gint64 n_bytes, + gint64 *n_read_bytes, + guint8 *buffer, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/io-readable-file.hpp b/c_glib/arrow-glib/io-random-access-file.hpp similarity index 69% rename from c_glib/arrow-glib/io-readable-file.hpp rename to c_glib/arrow-glib/io-random-access-file.hpp index 83d8628f48b62..7c97c9ecedb5b 100644 --- a/c_glib/arrow-glib/io-readable-file.hpp +++ b/c_glib/arrow-glib/io-random-access-file.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOReadableFileInterface: + * GArrowIORandomAccessFileInterface: * - * It wraps `arrow::io::ReadableFileInterface`. + * It wraps `arrow::io::RandomAccessFile`. */ -struct _GArrowIOReadableFileInterface +struct _GArrowIORandomAccessFileInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOReadableFile *file); + std::shared_ptr (*get_raw)(GArrowIORandomAccessFile *file); }; -std::shared_ptr garrow_io_readable_file_get_raw(GArrowIOReadableFile *readable_file); +std::shared_ptr garrow_io_random_access_file_get_raw(GArrowIORandomAccessFile *random_access_file); diff --git a/c_glib/arrow-glib/io-readable-file.cpp b/c_glib/arrow-glib/io-readable-file.cpp deleted file mode 100644 index 014fd7a1c7d32..0000000000000 --- a/c_glib/arrow-glib/io-readable-file.cpp +++ /dev/null @@ -1,127 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: io-readable-file - * @title: GArrowIOReadableFile - * @short_description: File input interface - * - * #GArrowIOReadableFile is an interface for file input. - */ - -G_DEFINE_INTERFACE(GArrowIOReadableFile, - garrow_io_readable_file, - G_TYPE_OBJECT) - -static void -garrow_io_readable_file_default_init (GArrowIOReadableFileInterface *iface) -{ -} - -/** - * garrow_io_readable_file_get_size: - * @file: A #GArrowIOReadableFile. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: The size of the file. - */ -guint64 -garrow_io_readable_file_get_size(GArrowIOReadableFile *file, - GError **error) -{ - auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(file); - auto arrow_readable_file = iface->get_raw(file); - int64_t size; - - auto status = arrow_readable_file->GetSize(&size); - if (status.ok()) { - return size; - } else { - garrow_error_set(error, status, "[io][readable-file][get-size]"); - return 0; - } -} - -/** - * garrow_io_readable_file_get_support_zero_copy: - * @file: A #GArrowIOReadableFile. - * - * Returns: Whether zero copy read is supported or not. - */ -gboolean -garrow_io_readable_file_get_support_zero_copy(GArrowIOReadableFile *file) -{ - auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(file); - auto arrow_readable_file = iface->get_raw(file); - - return arrow_readable_file->supports_zero_copy(); -} - -/** - * garrow_io_readable_file_read_at: - * @file: A #GArrowIOReadableFile. - * @position: The read start position. - * @n_bytes: The number of bytes to be read. - * @n_read_bytes: (out): The read number of bytes. - * @buffer: (array length=n_bytes): The buffer to be read data. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_io_readable_file_read_at(GArrowIOReadableFile *file, - gint64 position, - gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, - GError **error) -{ - const auto arrow_readable_file = garrow_io_readable_file_get_raw(file); - - auto status = arrow_readable_file->ReadAt(position, - n_bytes, - n_read_bytes, - buffer); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][readable-file][read-at]"); - return FALSE; - } -} - -G_END_DECLS - -std::shared_ptr -garrow_io_readable_file_get_raw(GArrowIOReadableFile *readable_file) -{ - auto *iface = GARROW_IO_READABLE_FILE_GET_IFACE(readable_file); - return iface->get_raw(readable_file); -} diff --git a/c_glib/arrow-glib/io-readable-file.h b/c_glib/arrow-glib/io-readable-file.h deleted file mode 100644 index 1dcb13e04969c..0000000000000 --- a/c_glib/arrow-glib/io-readable-file.h +++ /dev/null @@ -1,55 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_IO_TYPE_READABLE_FILE \ - (garrow_io_readable_file_get_type()) -#define GARROW_IO_READABLE_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_READABLE_FILE, \ - GArrowIOReadableFileInterface)) -#define GARROW_IO_IS_READABLE_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_READABLE_FILE)) -#define GARROW_IO_READABLE_FILE_GET_IFACE(obj) \ - (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_READABLE_FILE, \ - GArrowIOReadableFileInterface)) - -typedef struct _GArrowIOReadableFile GArrowIOReadableFile; -typedef struct _GArrowIOReadableFileInterface GArrowIOReadableFileInterface; - -GType garrow_io_readable_file_get_type(void) G_GNUC_CONST; - -guint64 garrow_io_readable_file_get_size(GArrowIOReadableFile *file, - GError **error); -gboolean garrow_io_readable_file_get_support_zero_copy(GArrowIOReadableFile *file); -gboolean garrow_io_readable_file_read_at(GArrowIOReadableFile *file, - gint64 position, - gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-file-reader.cpp b/c_glib/arrow-glib/ipc-file-reader.cpp index b9e408c4e9464..223be857d9beb 100644 --- a/c_glib/arrow-glib/ipc-file-reader.cpp +++ b/c_glib/arrow-glib/ipc-file-reader.cpp @@ -27,7 +27,7 @@ #include #include -#include +#include #include #include @@ -139,12 +139,12 @@ garrow_ipc_file_reader_class_init(GArrowIPCFileReaderClass *klass) * #GArrowIPCFileReader or %NULL on error. */ GArrowIPCFileReader * -garrow_ipc_file_reader_open(GArrowIOReadableFile *file, +garrow_ipc_file_reader_open(GArrowIORandomAccessFile *file, GError **error) { std::shared_ptr arrow_file_reader; auto status = - arrow::ipc::FileReader::Open(garrow_io_readable_file_get_raw(file), + arrow::ipc::FileReader::Open(garrow_io_random_access_file_get_raw(file), &arrow_file_reader); if (status.ok()) { return garrow_ipc_file_reader_new_raw(&arrow_file_reader); diff --git a/c_glib/arrow-glib/ipc-file-reader.h b/c_glib/arrow-glib/ipc-file-reader.h index 22915f8ae6e68..15eba8e35a273 100644 --- a/c_glib/arrow-glib/ipc-file-reader.h +++ b/c_glib/arrow-glib/ipc-file-reader.h @@ -22,7 +22,7 @@ #include #include -#include +#include #include @@ -70,7 +70,7 @@ struct _GArrowIPCFileReaderClass GType garrow_ipc_file_reader_get_type(void) G_GNUC_CONST; -GArrowIPCFileReader *garrow_ipc_file_reader_open(GArrowIOReadableFile *file, +GArrowIPCFileReader *garrow_ipc_file_reader_open(GArrowIORandomAccessFile *file, GError **error); GArrowSchema *garrow_ipc_file_reader_get_schema(GArrowIPCFileReader *file_reader); diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 9f504bec7ad53..a732e09df1269 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -121,7 +121,7 @@ Input - + Output From f5157a0af7046a618f159a5d0693a664f45658d7 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 18 Mar 2017 18:29:12 +0100 Subject: [PATCH 0379/1644] ARROW-648: [C++] Support multiarch on Debian On multiarch enabled Debian, we need to install libraries into ${CMAKE_INSTALL_PREFIX}/lib/${ARCH}/ instead of ${CMAKE_INSTALL_PREFIX}/lib/. See also: https://wiki.debian.org/Multiarch/HOWTO Author: Kouhei Sutou Closes #398 from kou/debian-support-multiarch and squashes the following commits: f5c8495 [Kouhei Sutou] [C++] Fix missing "${prefix}/" in .pc.in 8da48f6 [Kouhei Sutou] [C++] Support multiarch on Debian --- cpp/CMakeLists.txt | 1 + cpp/cmake_modules/BuildUtils.cmake | 8 ++++---- cpp/src/arrow/CMakeLists.txt | 4 ++-- cpp/src/arrow/arrow.pc.in | 2 +- cpp/src/arrow/io/CMakeLists.txt | 2 +- cpp/src/arrow/io/arrow-io.pc.in | 2 +- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- cpp/src/arrow/ipc/arrow-ipc.pc.in | 2 +- cpp/src/arrow/jemalloc/CMakeLists.txt | 2 +- cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in | 2 +- 10 files changed, 14 insertions(+), 13 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5ecc34e8a5fc6..b39646ed45b0a 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -28,6 +28,7 @@ set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") include(CMakeParseArguments) include(ExternalProject) +include(GNUInstallDirs) set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 2da8a05c9c42a..9e14838cef2e1 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -68,8 +68,8 @@ function(ADD_ARROW_LIB LIB_NAME) endif() install(TARGETS ${LIB_NAME}_shared - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) + LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} + ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() if (ARROW_BUILD_STATIC) @@ -84,8 +84,8 @@ function(ADD_ARROW_LIB LIB_NAME) LINK_PRIVATE ${ARG_STATIC_PRIVATE_LINK_LIBS}) install(TARGETS ${LIB_NAME}_static - LIBRARY DESTINATION lib - ARCHIVE DESTINATION lib) + LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} + ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() if (APPLE) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 0abd4b9c34b0a..24a95475b14e0 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -34,7 +34,7 @@ install(FILES type_fwd.h type_traits.h test-util.h - DESTINATION include/arrow) + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow") # pkg-config support configure_file(arrow.pc.in @@ -42,7 +42,7 @@ configure_file(arrow.pc.in @ONLY) install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow.pc" - DESTINATION "lib/pkgconfig/") + DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") ####################################### # Unit tests diff --git a/cpp/src/arrow/arrow.pc.in b/cpp/src/arrow/arrow.pc.in index 5ad429b714893..1c3f65d661101 100644 --- a/cpp/src/arrow/arrow.pc.in +++ b/cpp/src/arrow/arrow.pc.in @@ -16,7 +16,7 @@ # under the License. prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/lib +libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include Name: Apache Arrow diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index ceb7b7379322a..69621d36f9ee1 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -123,4 +123,4 @@ configure_file(arrow-io.pc.in @ONLY) install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" - DESTINATION "lib/pkgconfig/") + DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") diff --git a/cpp/src/arrow/io/arrow-io.pc.in b/cpp/src/arrow/io/arrow-io.pc.in index 4b4abdd62df42..af28aae6736fe 100644 --- a/cpp/src/arrow/io/arrow-io.pc.in +++ b/cpp/src/arrow/io/arrow-io.pc.in @@ -16,7 +16,7 @@ # under the License. prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/lib +libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include Name: Apache Arrow I/O diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 09a959ba69b51..4a5e319edf8ec 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -163,7 +163,7 @@ configure_file(arrow-ipc.pc.in @ONLY) install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" - DESTINATION "lib/pkgconfig/") + DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") set(UTIL_LINK_LIBS diff --git a/cpp/src/arrow/ipc/arrow-ipc.pc.in b/cpp/src/arrow/ipc/arrow-ipc.pc.in index 73b44c99f0430..cbc226abf1ff5 100644 --- a/cpp/src/arrow/ipc/arrow-ipc.pc.in +++ b/cpp/src/arrow/ipc/arrow-ipc.pc.in @@ -16,7 +16,7 @@ # under the License. prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/lib +libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include Name: Apache Arrow IPC diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index 5d5482ab653bf..c7e6c6af97cff 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -112,4 +112,4 @@ configure_file(arrow-jemalloc.pc.in @ONLY) install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-jemalloc.pc" - DESTINATION "lib/pkgconfig/") + DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") diff --git a/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in index 0b300fec0b2bf..18085aaf715d4 100644 --- a/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in +++ b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in @@ -16,7 +16,7 @@ # under the License. prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/lib +libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include Name: Apache Arrow jemalloc-based allocator From 5ef684003cfa268d5532564e43f2589d9fe3ca43 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 18 Mar 2017 14:54:38 -0400 Subject: [PATCH 0380/1644] ARROW-652: Remove trailing f in merge script output Author: Uwe L. Korn Closes #402 from xhochy/ARROW-652 and squashes the following commits: 005488c [Uwe L. Korn] ARROW-652: Remove trailing f in merge script output --- dev/merge_arrow_pr.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index f7e7a37c36e5c..39db254a9f25d 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -253,7 +253,7 @@ def resolve_jira(title, merge_branches, comment): if cur_status == "Resolved" or cur_status == "Closed": fail("JIRA issue %s already has status '%s'" % (jira_id, cur_status)) print("=== JIRA %s ===" % jira_id) - print("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%sf\n" + print("summary\t\t%s\nassignee\t%s\nstatus\t\t%s\nurl\t\t%s/%s\n" % (cur_summary, cur_assignee, cur_status, JIRA_BASE, jira_id)) resolve = [x for x in asf_jira.transitions(jira_id) From 019f90d75a21f6ff7b00d657310ad1c61e5ace01 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 18 Mar 2017 16:40:37 -0400 Subject: [PATCH 0381/1644] ARROW-647: [C++] Use Boost shared libraries for tests and utilities Boost shared libraries are used when ARROW_BOOST_USE_SHARED is true. Without this change, tests and utilities use Boost static libraries event when ARROW_BOOST_USE_SHARED. CentOS 7 provides boost-system and boost-filesystem packages but they include only shared libraries. They don't include static libraries. Apache Arrow C++ requires Boost static libraries only for tests and utilities. We can support CentOS 7 by making Boost static libraries optional. Author: Kouhei Sutou Closes #400 from kou/use-boost-shared-libraries-for-tests-and-utilities and squashes the following commits: 094b69f [Kouhei Sutou] [C++] Use Boost shared libraries for tests and utilities --- cpp/CMakeLists.txt | 46 ++++++++++++++++++-------------- cpp/src/arrow/io/CMakeLists.txt | 16 ++++------- cpp/src/arrow/ipc/CMakeLists.txt | 19 +++++-------- 3 files changed, 37 insertions(+), 44 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index b39646ed45b0a..197aa9c7cb636 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -410,29 +410,35 @@ endfunction() # ---------------------------------------------------------------------- # Add Boost dependencies (code adapted from Apache Kudu (incubating)) -# Find static boost headers and libs -# TODO Differentiate here between release and debug builds set(Boost_DEBUG TRUE) set(Boost_USE_MULTITHREADED ON) -set(Boost_USE_STATIC_LIBS ON) -find_package(Boost COMPONENTS system filesystem REQUIRED) -if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") - set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) - set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) -else() - set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) - set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) -endif() - -# Find shared Boost libraries. -set(Boost_USE_STATIC_LIBS OFF) -find_package(Boost COMPONENTS system filesystem REQUIRED) -if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") - set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) - set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) +if (ARROW_BOOST_USE_SHARED) + # Find shared Boost libraries. + set(Boost_USE_STATIC_LIBS OFF) + find_package(Boost COMPONENTS system filesystem REQUIRED) + if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) + set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) + else() + set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) + set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) + endif() + set(BOOST_SYSTEM_LIBRARY boost_system_shared) + set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_shared) else() - set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) - set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) + # Find static boost headers and libs + # TODO Differentiate here between release and debug builds + set(Boost_USE_STATIC_LIBS ON) + find_package(Boost COMPONENTS system filesystem REQUIRED) + if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") + set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) + set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) + else() + set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) + set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) + endif() + set(BOOST_SYSTEM_LIBRARY boost_system_static) + set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_static) endif() message(STATUS "Boost include dir: " ${Boost_INCLUDE_DIRS}) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 69621d36f9ee1..af3acbf06d1ef 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -56,19 +56,13 @@ else() ) endif() -if (ARROW_BOOST_USE_SHARED) - set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS - boost_system_shared - boost_filesystem_shared) -else() - set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS - boost_system_static - boost_filesystem_static) -endif() +set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS + ${BOOST_SYSTEM_LIBRARY} + ${BOOST_FILESYSTEM_LIBRARY}) set(ARROW_IO_STATIC_PRIVATE_LINK_LIBS - boost_system_static - boost_filesystem_static) + ${BOOST_SYSTEM_LIBRARY} + ${BOOST_FILESYSTEM_LIBRARY}) set(ARROW_IO_TEST_LINK_LIBS arrow_io_static) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 4a5e319edf8ec..c73af63285bcd 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -90,8 +90,8 @@ if (ARROW_BUILD_TESTS) arrow_static gflags gtest - boost_filesystem_static - boost_system_static + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY} dl) set_target_properties(json-integration-test PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") @@ -103,8 +103,8 @@ if (ARROW_BUILD_TESTS) gflags gtest pthread - boost_filesystem_static - boost_system_static + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY} dl) endif() endif() @@ -170,17 +170,10 @@ set(UTIL_LINK_LIBS arrow_ipc_static arrow_io_static arrow_static - boost_filesystem_static - boost_system_static + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY} dl) -if (NOT APPLE) - set(UTIL_LINK_LIBS - ${UTIL_LINK_LIBS} - boost_filesystem_static - boost_system_static) -endif() - if (ARROW_BUILD_UTILITIES) add_executable(file-to-stream file-to-stream.cc) target_link_libraries(file-to-stream ${UTIL_LINK_LIBS}) From 98c9490180aed2b24be395e80f50e7f606fadcd5 Mon Sep 17 00:00:00 2001 From: Miki Tebeka Date: Sat, 18 Mar 2017 16:47:13 -0400 Subject: [PATCH 0382/1644] ARROW-639: [C++] Invalid offset in slices Fix incrementing offest_ twice in Slice Author: Miki Tebeka Closes #387 from tebeka/ARROW-639 and squashes the following commits: 6520f4c [Miki Tebeka] fix lint error 95fca13 [Miki Tebeka] [ARROW-639] [C++] Invalid offset in slices --- cpp/src/arrow/array-string-test.cc | 14 ++++++++++++++ cpp/src/arrow/array.h | 4 ++-- 2 files changed, 16 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index cf2ff416032c6..ed38acd010329 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -165,6 +165,20 @@ TEST_F(TestStringArray, CompareNullByteSlots) { equal_array.Array::Slice(1)->RangeEquals(0, 2, 0, equal_array2.Array::Slice(1))); } +TEST_F(TestStringArray, TestSliceGetString) { + StringBuilder builder(default_memory_pool()); + + builder.Append("a"); + builder.Append("b"); + builder.Append("c"); + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + auto s = array->Slice(1, 10); + auto arr = std::dynamic_pointer_cast(s); + ASSERT_EQ(arr->GetString(0), "b"); +} + // ---------------------------------------------------------------------- // String builder tests diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index ecc8ce540b1dd..50faf0892e8c0 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -322,8 +322,8 @@ class ARROW_EXPORT BinaryArray : public Array { // Account for base offset i += offset_; - const int32_t pos = raw_value_offsets_[i + offset_]; - *out_length = raw_value_offsets_[i + offset_ + 1] - pos; + const int32_t pos = raw_value_offsets_[i]; + *out_length = raw_value_offsets_[i + 1] - pos; return raw_data_ + pos; } From a9f0c63ad9b8942f3287da2b7109d486d92731b0 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 18 Mar 2017 16:52:45 -0400 Subject: [PATCH 0383/1644] ARROW-651: [C++] Set version to shared library .deb package builder assumes that shared libraries have version. See also: https://www.debian.org/doc/debian-policy/ch-sharedlibs.html Author: Kouhei Sutou Closes #401 from kou/debian-set-version-to-shared-library and squashes the following commits: 2da1442 [Kouhei Sutou] [C++] Set version to shared library --- cpp/CMakeLists.txt | 3 +++ cpp/cmake_modules/BuildUtils.cmake | 4 +++- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 197aa9c7cb636..956658a82524c 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -30,6 +30,9 @@ include(CMakeParseArguments) include(ExternalProject) include(GNUInstallDirs) +set(ARROW_SO_VERSION "0") +set(ARROW_ABI_VERSION "${ARROW_SO_VERSION}.0.0") + set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 9e14838cef2e1..78b514c2295ae 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -52,7 +52,9 @@ function(ADD_ARROW_LIB LIB_NAME) PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" LINK_FLAGS "${ARG_SHARED_LINK_FLAGS}" - OUTPUT_NAME ${LIB_NAME}) + OUTPUT_NAME ${LIB_NAME} + VERSION "${ARROW_ABI_VERSION}" + SOVERSION "${ARROW_SO_VERSION}") target_link_libraries(${LIB_NAME}_shared LINK_PUBLIC ${ARG_SHARED_LINK_LIBS} LINK_PRIVATE ${ARG_SHARED_PRIVATE_LINK_LIBS}) From 4c5f79c39c2b22afe68becf1d1e7a93ca781f88d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 19 Mar 2017 01:03:47 -0400 Subject: [PATCH 0384/1644] ARROW-617: [Format] Add additional Time metadata and comments based on discussion in ARROW-617 Author: Wes McKinney Closes #385 from wesm/ARROW-617 and squashes the following commits: f9f0571 [Wes McKinney] Add metadata and comments based on discussion in ARROW-617 --- format/Message.fbs | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/format/Message.fbs b/format/Message.fbs index f2d5eba75e65b..8fdcc804d4706 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -81,16 +81,21 @@ table Decimal { scale: int; } +/// Date is a 64-bit type representing milliseconds since the UNIX epoch table Date { } enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } +/// Time type. The physical storage type depends on the unit +/// - SECOND and MILLISECOND: 32 bits +/// - MICROSECOND and NANOSECOND: 64 bits table Time { unit: TimeUnit; + bitWidth: int; } -/// time from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. +/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. table Timestamp { unit: TimeUnit; From df2220f350282925a454ed911eed6618e4d53969 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 20 Mar 2017 10:48:34 +0100 Subject: [PATCH 0385/1644] ARROW-661: [C++] Add LargeRecordBatch metadata type, IPC support, associated refactoring This patch enables the following code for writing record batches exceeding 2^31 - 1 ```c++ RETURN_NOT_OK(WriteLargeRecordBatch( batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); return ReadLargeRecordBatch(batch.schema(), 0, mmap_.get(), result); ``` This also does a fair amount of refactoring and code consolidation related to ongoing code cleaning in arrow_ipc. These APIs are marked experimental. This does add `LargeRecordBatch` flatbuffer type to the Message union, but I've indicated that Arrow implementations (e.g. Java) are not required to implement this type. It's strictly to enable C++ users to write very large datasets that have been embedded for convenience in Arrow's structured data model. cc @pcmoritz @robertnishihara Author: Wes McKinney Closes #404 from wesm/ARROW-661 and squashes the following commits: 9c18a95 [Wes McKinney] Fix import ordering d7811f2 [Wes McKinney] cpplint 179a1e3 [Wes McKinney] Add unit test for large record batches. Use bytewise comparisons with aligned bitmaps 36c3862 [Wes McKinney] Get LargeRecordBatch round trip working. Add to Message union for now 4c1d08c [Wes McKinney] Refactoring, failing test fixture for large record batch f4c8830 [Wes McKinney] Consolidate ipc-metadata-test and ipc-read-write-test and draft large record batch read/write path 85d1a1c [Wes McKinney] Add (untested) metadata writer for LargeRecordBatch 0f2722c [Wes McKinney] Consolidate metadata-internal.h into metadata.h. Use own Arrow structs for IPC metadata and convert to flatbuffers later e8f8973 [Wes McKinney] Split adapter.h/cc into reader.h/writer.h. Draft LargeRecordBatch type --- cpp/src/arrow/allocator-test.cc | 1 + cpp/src/arrow/allocator.h | 1 + cpp/src/arrow/io/test-common.h | 18 + cpp/src/arrow/ipc/CMakeLists.txt | 15 +- cpp/src/arrow/ipc/adapter.cc | 630 --------------------- cpp/src/arrow/ipc/adapter.h | 104 ---- cpp/src/arrow/ipc/api.h | 1 - cpp/src/arrow/ipc/ipc-adapter-test.cc | 320 ----------- cpp/src/arrow/ipc/ipc-file-test.cc | 228 -------- cpp/src/arrow/ipc/ipc-metadata-test.cc | 100 ---- cpp/src/arrow/ipc/ipc-read-write-test.cc | 608 ++++++++++++++++++++ cpp/src/arrow/ipc/metadata-internal.cc | 597 ------------------- cpp/src/arrow/ipc/metadata-internal.h | 83 --- cpp/src/arrow/ipc/metadata.cc | 692 ++++++++++++++++++++++- cpp/src/arrow/ipc/metadata.h | 40 +- cpp/src/arrow/ipc/reader.cc | 171 +++++- cpp/src/arrow/ipc/reader.h | 22 + cpp/src/arrow/ipc/test-common.h | 2 +- cpp/src/arrow/ipc/writer.cc | 544 ++++++++++++++++-- cpp/src/arrow/ipc/writer.h | 46 +- cpp/src/arrow/loader.h | 25 + cpp/src/arrow/type.h | 1 + cpp/src/arrow/util/bit-util.cc | 16 +- format/Message.fbs | 22 +- 24 files changed, 2131 insertions(+), 2156 deletions(-) delete mode 100644 cpp/src/arrow/ipc/adapter.cc delete mode 100644 cpp/src/arrow/ipc/adapter.h delete mode 100644 cpp/src/arrow/ipc/ipc-adapter-test.cc delete mode 100644 cpp/src/arrow/ipc/ipc-file-test.cc delete mode 100644 cpp/src/arrow/ipc/ipc-metadata-test.cc create mode 100644 cpp/src/arrow/ipc/ipc-read-write-test.cc delete mode 100644 cpp/src/arrow/ipc/metadata-internal.cc delete mode 100644 cpp/src/arrow/ipc/metadata-internal.h diff --git a/cpp/src/arrow/allocator-test.cc b/cpp/src/arrow/allocator-test.cc index 0b242674bf175..811ef5a79c2dd 100644 --- a/cpp/src/arrow/allocator-test.cc +++ b/cpp/src/arrow/allocator-test.cc @@ -16,6 +16,7 @@ // under the License. #include "gtest/gtest.h" + #include "arrow/allocator.h" #include "arrow/test-util.h" diff --git a/cpp/src/arrow/allocator.h b/cpp/src/arrow/allocator.h index c976ba96b8d03..e00023dc460fb 100644 --- a/cpp/src/arrow/allocator.h +++ b/cpp/src/arrow/allocator.h @@ -21,6 +21,7 @@ #include #include #include + #include "arrow/memory_pool.h" #include "arrow/status.h" diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 8355714540e95..4c114760e9a4b 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -41,6 +41,24 @@ namespace arrow { namespace io { +static inline Status ZeroMemoryMap(MemoryMappedFile* file) { + constexpr int64_t kBufferSize = 512; + static constexpr uint8_t kZeroBytes[kBufferSize] = {0}; + + RETURN_NOT_OK(file->Seek(0)); + int64_t position = 0; + int64_t file_size; + RETURN_NOT_OK(file->GetSize(&file_size)); + + int64_t chunksize; + while (position < file_size) { + chunksize = std::min(kBufferSize, file_size - position); + RETURN_NOT_OK(file->Write(kZeroBytes, chunksize)); + position += chunksize; + } + return Status::OK(); +} + class MemoryMapFixture { public: void TearDown() { diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index c73af63285bcd..5d470df0309b3 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -29,12 +29,10 @@ set(ARROW_IPC_TEST_LINK_LIBS arrow_io_static) set(ARROW_IPC_SRCS - adapter.cc feather.cc json.cc json-internal.cc metadata.cc - metadata-internal.cc reader.cc writer.cc ) @@ -64,16 +62,8 @@ ADD_ARROW_TEST(feather-test) ARROW_TEST_LINK_LIBRARIES(feather-test ${ARROW_IPC_TEST_LINK_LIBS}) -ADD_ARROW_TEST(ipc-adapter-test) -ARROW_TEST_LINK_LIBRARIES(ipc-adapter-test - ${ARROW_IPC_TEST_LINK_LIBS}) - -ADD_ARROW_TEST(ipc-file-test) -ARROW_TEST_LINK_LIBRARIES(ipc-file-test - ${ARROW_IPC_TEST_LINK_LIBS}) - -ADD_ARROW_TEST(ipc-metadata-test) -ARROW_TEST_LINK_LIBRARIES(ipc-metadata-test +ADD_ARROW_TEST(ipc-read-write-test) +ARROW_TEST_LINK_LIBRARIES(ipc-read-write-test ${ARROW_IPC_TEST_LINK_LIBS}) ADD_ARROW_TEST(ipc-json-test) @@ -148,7 +138,6 @@ add_dependencies(arrow_ipc_objlib metadata_fbs) # Headers: top level install(FILES - adapter.h api.h feather.h json.h diff --git a/cpp/src/arrow/ipc/adapter.cc b/cpp/src/arrow/ipc/adapter.cc deleted file mode 100644 index db9f63ca18cbd..0000000000000 --- a/cpp/src/arrow/ipc/adapter.cc +++ /dev/null @@ -1,630 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/ipc/adapter.h" - -#include -#include -#include -#include -#include -#include - -#include "arrow/array.h" -#include "arrow/buffer.h" -#include "arrow/io/interfaces.h" -#include "arrow/io/memory.h" -#include "arrow/ipc/Message_generated.h" -#include "arrow/ipc/metadata-internal.h" -#include "arrow/ipc/metadata.h" -#include "arrow/ipc/util.h" -#include "arrow/loader.h" -#include "arrow/memory_pool.h" -#include "arrow/schema.h" -#include "arrow/status.h" -#include "arrow/table.h" -#include "arrow/type.h" -#include "arrow/type_fwd.h" -#include "arrow/util/bit-util.h" -#include "arrow/util/logging.h" - -namespace arrow { - -namespace flatbuf = org::apache::arrow::flatbuf; - -namespace ipc { - -// ---------------------------------------------------------------------- -// Record batch write path - -class RecordBatchWriter : public ArrayVisitor { - public: - RecordBatchWriter( - MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth) - : pool_(pool), - max_recursion_depth_(max_recursion_depth), - buffer_start_offset_(buffer_start_offset) { - DCHECK_GT(max_recursion_depth, 0); - } - - virtual ~RecordBatchWriter() = default; - - Status VisitArray(const Array& arr) { - if (max_recursion_depth_ <= 0) { - return Status::Invalid("Max recursion depth reached"); - } - - if (arr.length() > std::numeric_limits::max()) { - return Status::Invalid("Cannot write arrays larger than 2^31 - 1 in length"); - } - - // push back all common elements - field_nodes_.push_back(flatbuf::FieldNode( - static_cast(arr.length()), static_cast(arr.null_count()))); - if (arr.null_count() > 0) { - std::shared_ptr bitmap = arr.null_bitmap(); - - if (arr.offset() != 0) { - // With a sliced array / non-zero offset, we must copy the bitmap - RETURN_NOT_OK( - CopyBitmap(pool_, bitmap->data(), arr.offset(), arr.length(), &bitmap)); - } - - buffers_.push_back(bitmap); - } else { - // Push a dummy zero-length buffer, not to be copied - buffers_.push_back(std::make_shared(nullptr, 0)); - } - return arr.Accept(this); - } - - Status Assemble(const RecordBatch& batch, int64_t* body_length) { - if (field_nodes_.size() > 0) { - field_nodes_.clear(); - buffer_meta_.clear(); - buffers_.clear(); - } - - // Perform depth-first traversal of the row-batch - for (int i = 0; i < batch.num_columns(); ++i) { - RETURN_NOT_OK(VisitArray(*batch.column(i))); - } - - // The position for the start of a buffer relative to the passed frame of - // reference. May be 0 or some other position in an address space - int64_t offset = buffer_start_offset_; - - // Construct the buffer metadata for the record batch header - for (size_t i = 0; i < buffers_.size(); ++i) { - const Buffer* buffer = buffers_[i].get(); - int64_t size = 0; - int64_t padding = 0; - - // The buffer might be null if we are handling zero row lengths. - if (buffer) { - size = buffer->size(); - padding = BitUtil::RoundUpToMultipleOf64(size) - size; - } - - // TODO(wesm): We currently have no notion of shared memory page id's, - // but we've included it in the metadata IDL for when we have it in the - // future. Use page = -1 for now - // - // Note that page ids are a bespoke notion for Arrow and not a feature we - // are using from any OS-level shared memory. The thought is that systems - // may (in the future) associate integer page id's with physical memory - // pages (according to whatever is the desired shared memory mechanism) - buffer_meta_.push_back(flatbuf::Buffer(-1, offset, size + padding)); - offset += size + padding; - } - - *body_length = offset - buffer_start_offset_; - DCHECK(BitUtil::IsMultipleOf64(*body_length)); - - return Status::OK(); - } - - // Override this for writing dictionary metadata - virtual Status WriteMetadataMessage( - int32_t num_rows, int64_t body_length, std::shared_ptr* out) { - return WriteRecordBatchMessage( - num_rows, body_length, field_nodes_, buffer_meta_, out); - } - - Status WriteMetadata(int32_t num_rows, int64_t body_length, io::OutputStream* dst, - int32_t* metadata_length) { - // Now that we have computed the locations of all of the buffers in shared - // memory, the data header can be converted to a flatbuffer and written out - // - // Note: The memory written here is prefixed by the size of the flatbuffer - // itself as an int32_t. - std::shared_ptr metadata_fb; - RETURN_NOT_OK(WriteMetadataMessage(num_rows, body_length, &metadata_fb)); - - // Need to write 4 bytes (metadata size), the metadata, plus padding to - // end on an 8-byte offset - int64_t start_offset; - RETURN_NOT_OK(dst->Tell(&start_offset)); - - int32_t padded_metadata_length = static_cast(metadata_fb->size()) + 4; - const int32_t remainder = - (padded_metadata_length + static_cast(start_offset)) % 8; - if (remainder != 0) { padded_metadata_length += 8 - remainder; } - - // The returned metadata size includes the length prefix, the flatbuffer, - // plus padding - *metadata_length = padded_metadata_length; - - // Write the flatbuffer size prefix including padding - int32_t flatbuffer_size = padded_metadata_length - 4; - RETURN_NOT_OK( - dst->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); - - // Write the flatbuffer - RETURN_NOT_OK(dst->Write(metadata_fb->data(), metadata_fb->size())); - - // Write any padding - int32_t padding = - padded_metadata_length - static_cast(metadata_fb->size()) - 4; - if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } - - return Status::OK(); - } - - Status Write(const RecordBatch& batch, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length) { - RETURN_NOT_OK(Assemble(batch, body_length)); - -#ifndef NDEBUG - int64_t start_position, current_position; - RETURN_NOT_OK(dst->Tell(&start_position)); -#endif - - RETURN_NOT_OK(WriteMetadata( - static_cast(batch.num_rows()), *body_length, dst, metadata_length)); - -#ifndef NDEBUG - RETURN_NOT_OK(dst->Tell(¤t_position)); - DCHECK(BitUtil::IsMultipleOf8(current_position)); -#endif - - // Now write the buffers - for (size_t i = 0; i < buffers_.size(); ++i) { - const Buffer* buffer = buffers_[i].get(); - int64_t size = 0; - int64_t padding = 0; - - // The buffer might be null if we are handling zero row lengths. - if (buffer) { - size = buffer->size(); - padding = BitUtil::RoundUpToMultipleOf64(size) - size; - } - - if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); } - - if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } - } - -#ifndef NDEBUG - RETURN_NOT_OK(dst->Tell(¤t_position)); - DCHECK(BitUtil::IsMultipleOf8(current_position)); -#endif - - return Status::OK(); - } - - Status GetTotalSize(const RecordBatch& batch, int64_t* size) { - // emulates the behavior of Write without actually writing - int32_t metadata_length = 0; - int64_t body_length = 0; - MockOutputStream dst; - RETURN_NOT_OK(Write(batch, &dst, &metadata_length, &body_length)); - *size = dst.GetExtentBytesWritten(); - return Status::OK(); - } - - protected: - template - Status VisitFixedWidth(const ArrayType& array) { - std::shared_ptr data_buffer = array.data(); - - if (array.offset() != 0) { - // Non-zero offset, slice the buffer - const auto& fw_type = static_cast(*array.type()); - const int type_width = fw_type.bit_width() / 8; - const int64_t byte_offset = array.offset() * type_width; - - // Send padding if it's available - const int64_t buffer_length = - std::min(BitUtil::RoundUpToMultipleOf64(array.length() * type_width), - data_buffer->size() - byte_offset); - data_buffer = SliceBuffer(data_buffer, byte_offset, buffer_length); - } - buffers_.push_back(data_buffer); - return Status::OK(); - } - - template - Status GetZeroBasedValueOffsets( - const ArrayType& array, std::shared_ptr* value_offsets) { - // Share slicing logic between ListArray and BinaryArray - - auto offsets = array.value_offsets(); - - if (array.offset() != 0) { - // If we have a non-zero offset, then the value offsets do not start at - // zero. We must a) create a new offsets array with shifted offsets and - // b) slice the values array accordingly - - std::shared_ptr shifted_offsets; - RETURN_NOT_OK(AllocateBuffer( - pool_, sizeof(int32_t) * (array.length() + 1), &shifted_offsets)); - - int32_t* dest_offsets = reinterpret_cast(shifted_offsets->mutable_data()); - const int32_t start_offset = array.value_offset(0); - - for (int i = 0; i < array.length(); ++i) { - dest_offsets[i] = array.value_offset(i) - start_offset; - } - // Final offset - dest_offsets[array.length()] = array.value_offset(array.length()) - start_offset; - offsets = shifted_offsets; - } - - *value_offsets = offsets; - return Status::OK(); - } - - Status VisitBinary(const BinaryArray& array) { - std::shared_ptr value_offsets; - RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); - auto data = array.data(); - - if (array.offset() != 0) { - // Slice the data buffer to include only the range we need now - data = SliceBuffer(data, array.value_offset(0), array.value_offset(array.length())); - } - - buffers_.push_back(value_offsets); - buffers_.push_back(data); - return Status::OK(); - } - - Status Visit(const FixedWidthBinaryArray& array) override { - auto data = array.data(); - int32_t width = array.byte_width(); - - if (array.offset() != 0) { - data = SliceBuffer(data, array.offset() * width, width * array.length()); - } - buffers_.push_back(data); - return Status::OK(); - } - - Status Visit(const BooleanArray& array) override { - buffers_.push_back(array.data()); - return Status::OK(); - } - -#define VISIT_FIXED_WIDTH(TYPE) \ - Status Visit(const TYPE& array) override { return VisitFixedWidth(array); } - - VISIT_FIXED_WIDTH(Int8Array); - VISIT_FIXED_WIDTH(Int16Array); - VISIT_FIXED_WIDTH(Int32Array); - VISIT_FIXED_WIDTH(Int64Array); - VISIT_FIXED_WIDTH(UInt8Array); - VISIT_FIXED_WIDTH(UInt16Array); - VISIT_FIXED_WIDTH(UInt32Array); - VISIT_FIXED_WIDTH(UInt64Array); - VISIT_FIXED_WIDTH(HalfFloatArray); - VISIT_FIXED_WIDTH(FloatArray); - VISIT_FIXED_WIDTH(DoubleArray); - VISIT_FIXED_WIDTH(DateArray); - VISIT_FIXED_WIDTH(Date32Array); - VISIT_FIXED_WIDTH(TimeArray); - VISIT_FIXED_WIDTH(TimestampArray); - -#undef VISIT_FIXED_WIDTH - - Status Visit(const StringArray& array) override { return VisitBinary(array); } - - Status Visit(const BinaryArray& array) override { return VisitBinary(array); } - - Status Visit(const ListArray& array) override { - std::shared_ptr value_offsets; - RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); - buffers_.push_back(value_offsets); - - --max_recursion_depth_; - std::shared_ptr values = array.values(); - - if (array.offset() != 0) { - // For non-zero offset, we slice the values array accordingly - const int32_t offset = array.value_offset(0); - const int32_t length = array.value_offset(array.length()) - offset; - values = values->Slice(offset, length); - } - RETURN_NOT_OK(VisitArray(*values)); - ++max_recursion_depth_; - return Status::OK(); - } - - Status Visit(const StructArray& array) override { - --max_recursion_depth_; - for (std::shared_ptr field : array.fields()) { - if (array.offset() != 0) { - // If offset is non-zero, slice the child array - field = field->Slice(array.offset(), array.length()); - } - RETURN_NOT_OK(VisitArray(*field)); - } - ++max_recursion_depth_; - return Status::OK(); - } - - Status Visit(const UnionArray& array) override { - auto type_ids = array.type_ids(); - if (array.offset() != 0) { - type_ids = SliceBuffer(type_ids, array.offset() * sizeof(UnionArray::type_id_t), - array.length() * sizeof(UnionArray::type_id_t)); - } - - buffers_.push_back(type_ids); - - --max_recursion_depth_; - if (array.mode() == UnionMode::DENSE) { - const auto& type = static_cast(*array.type()); - auto value_offsets = array.value_offsets(); - - // The Union type codes are not necessary 0-indexed - uint8_t max_code = 0; - for (uint8_t code : type.type_codes) { - if (code > max_code) { max_code = code; } - } - - // Allocate an array of child offsets. Set all to -1 to indicate that we - // haven't observed a first occurrence of a particular child yet - std::vector child_offsets(max_code + 1); - std::vector child_lengths(max_code + 1, 0); - - if (array.offset() != 0) { - // This is an unpleasant case. Because the offsets are different for - // each child array, when we have a sliced array, we need to "rebase" - // the value_offsets for each array - - const int32_t* unshifted_offsets = array.raw_value_offsets(); - const uint8_t* type_ids = array.raw_type_ids(); - - // Allocate the shifted offsets - std::shared_ptr shifted_offsets_buffer; - RETURN_NOT_OK(AllocateBuffer( - pool_, array.length() * sizeof(int32_t), &shifted_offsets_buffer)); - int32_t* shifted_offsets = - reinterpret_cast(shifted_offsets_buffer->mutable_data()); - - for (int64_t i = 0; i < array.length(); ++i) { - const uint8_t code = type_ids[i]; - int32_t shift = child_offsets[code]; - if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } - shifted_offsets[i] = unshifted_offsets[i] - shift; - - // Update the child length to account for observed value - ++child_lengths[code]; - } - - value_offsets = shifted_offsets_buffer; - } - buffers_.push_back(value_offsets); - - // Visit children and slice accordingly - for (int i = 0; i < type.num_children(); ++i) { - std::shared_ptr child = array.child(i); - if (array.offset() != 0) { - const uint8_t code = type.type_codes[i]; - child = child->Slice(child_offsets[code], child_lengths[code]); - } - RETURN_NOT_OK(VisitArray(*child)); - } - } else { - for (std::shared_ptr child : array.children()) { - // Sparse union, slicing is simpler - if (array.offset() != 0) { - // If offset is non-zero, slice the child array - child = child->Slice(array.offset(), array.length()); - } - RETURN_NOT_OK(VisitArray(*child)); - } - } - ++max_recursion_depth_; - return Status::OK(); - } - - Status Visit(const DictionaryArray& array) override { - // Dictionary written out separately. Slice offset contained in the indices - return array.indices()->Accept(this); - } - - // In some cases, intermediate buffers may need to be allocated (with sliced arrays) - MemoryPool* pool_; - - std::vector field_nodes_; - std::vector buffer_meta_; - std::vector> buffers_; - - int64_t max_recursion_depth_; - int64_t buffer_start_offset_; -}; - -class DictionaryWriter : public RecordBatchWriter { - public: - using RecordBatchWriter::RecordBatchWriter; - - Status WriteMetadataMessage( - int32_t num_rows, int64_t body_length, std::shared_ptr* out) override { - return WriteDictionaryMessage( - dictionary_id_, num_rows, body_length, field_nodes_, buffer_meta_, out); - } - - Status Write(int64_t dictionary_id, const std::shared_ptr& dictionary, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { - dictionary_id_ = dictionary_id; - - // Make a dummy record batch. A bit tedious as we have to make a schema - std::vector> fields = { - arrow::field("dictionary", dictionary->type())}; - auto schema = std::make_shared(fields); - RecordBatch batch(schema, dictionary->length(), {dictionary}); - - return RecordBatchWriter::Write(batch, dst, metadata_length, body_length); - } - - private: - // TODO(wesm): Setting this in Write is a bit unclean, but it works - int64_t dictionary_id_; -}; - -Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth) { - RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); - return writer.Write(batch, dst, metadata_length, body_length); -} - -Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool) { - DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth); - return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); -} - -Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth); - RETURN_NOT_OK(writer.GetTotalSize(batch, size)); - return Status::OK(); -} - -// ---------------------------------------------------------------------- -// Record batch read path - -class IpcComponentSource : public ArrayComponentSource { - public: - IpcComponentSource(const RecordBatchMetadata& metadata, io::RandomAccessFile* file) - : metadata_(metadata), file_(file) {} - - Status GetBuffer(int buffer_index, std::shared_ptr* out) override { - BufferMetadata buffer_meta = metadata_.buffer(buffer_index); - if (buffer_meta.length == 0) { - *out = nullptr; - return Status::OK(); - } else { - return file_->ReadAt(buffer_meta.offset, buffer_meta.length, out); - } - } - - Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { - // pop off a field - if (field_index >= metadata_.num_fields()) { - return Status::Invalid("Ran out of field metadata, likely malformed"); - } - *metadata = metadata_.field(field_index); - return Status::OK(); - } - - private: - const RecordBatchMetadata& metadata_; - io::RandomAccessFile* file_; -}; - -class RecordBatchReader { - public: - RecordBatchReader(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::RandomAccessFile* file) - : metadata_(metadata), - schema_(schema), - max_recursion_depth_(max_recursion_depth), - file_(file) {} - - Status Read(std::shared_ptr* out) { - std::vector> arrays(schema_->num_fields()); - - IpcComponentSource source(metadata_, file_); - ArrayLoaderContext context; - context.source = &source; - context.field_index = 0; - context.buffer_index = 0; - context.max_recursion_depth = max_recursion_depth_; - - for (int i = 0; i < schema_->num_fields(); ++i) { - RETURN_NOT_OK(LoadArray(schema_->field(i)->type, &context, &arrays[i])); - } - - *out = std::make_shared(schema_, metadata_.length(), arrays); - return Status::OK(); - } - - private: - const RecordBatchMetadata& metadata_; - std::shared_ptr schema_; - int max_recursion_depth_; - io::RandomAccessFile* file_; -}; - -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::RandomAccessFile* file, - std::shared_ptr* out) { - return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); -} - -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::RandomAccessFile* file, std::shared_ptr* out) { - RecordBatchReader reader(metadata, schema, max_recursion_depth, file); - return reader.Read(out); -} - -Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, - std::shared_ptr* out) { - int64_t id = metadata.id(); - auto it = dictionary_types.find(id); - if (it == dictionary_types.end()) { - std::stringstream ss; - ss << "Do not have type metadata for dictionary with id: " << id; - return Status::KeyError(ss.str()); - } - - std::vector> fields = {it->second}; - - // We need a schema for the record batch - auto dummy_schema = std::make_shared(fields); - - // The dictionary is embedded in a record batch with a single column - std::shared_ptr batch; - RETURN_NOT_OK(ReadRecordBatch(metadata.record_batch(), dummy_schema, file, &batch)); - - if (batch->num_columns() != 1) { - return Status::Invalid("Dictionary record batch must only contain one field"); - } - - *out = batch->column(0); - return Status::OK(); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/adapter.h b/cpp/src/arrow/ipc/adapter.h deleted file mode 100644 index cea4686077486..0000000000000 --- a/cpp/src/arrow/ipc/adapter.h +++ /dev/null @@ -1,104 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Public API for writing and accessing (with zero copy, if possible) Arrow -// IPC binary formatted data (e.g. in shared memory, or from some other IO source) - -#ifndef ARROW_IPC_ADAPTER_H -#define ARROW_IPC_ADAPTER_H - -#include -#include -#include - -#include "arrow/ipc/metadata.h" -#include "arrow/loader.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Array; -class MemoryPool; -class RecordBatch; -class Schema; -class Status; - -namespace io { - -class RandomAccessFile; -class OutputStream; - -} // namespace io - -namespace ipc { - -// ---------------------------------------------------------------------- -// Write path - -// Write the RecordBatch (collection of equal-length Arrow arrays) to the -// output stream in a contiguous block. The record batch metadata is written as -// a flatbuffer (see format/Message.fbs -- the RecordBatch message type) -// prefixed by its size, followed by each of the memory buffers in the batch -// written end to end (with appropriate alignment and padding): -// -// -// -// Finally, the absolute offsets (relative to the start of the output stream) -// to the end of the body and end of the metadata / data header (suffixed by -// the header size) is returned in out-variables -// -// @param(in) buffer_start_offset: the start offset to use in the buffer metadata, -// default should be 0 -// -// @param(out) metadata_length: the size of the length-prefixed flatbuffer -// including padding to a 64-byte boundary -// -// @param(out) body_length: the size of the contiguous buffer block plus -// padding bytes -Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); - -// Write Array as a DictionaryBatch message -Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, - int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length, MemoryPool* pool); - -// Compute the precise number of bytes needed in a contiguous memory segment to -// write the record batch. This involves generating the complete serialized -// Flatbuffers metadata. -Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); - -// ---------------------------------------------------------------------- -// "Read" path; does not copy data if the input supports zero copy reads - -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::RandomAccessFile* file, - std::shared_ptr* out); - -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, int max_recursion_depth, - io::RandomAccessFile* file, std::shared_ptr* out); - -Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, - std::shared_ptr* out); - -} // namespace ipc -} // namespace arrow - -#endif // ARROW_IPC_MEMORY_H diff --git a/cpp/src/arrow/ipc/api.h b/cpp/src/arrow/ipc/api.h index ad7cd84e9f986..3f05e69d5843d 100644 --- a/cpp/src/arrow/ipc/api.h +++ b/cpp/src/arrow/ipc/api.h @@ -18,7 +18,6 @@ #ifndef ARROW_IPC_API_H #define ARROW_IPC_API_H -#include "arrow/ipc/adapter.h" #include "arrow/ipc/feather.h" #include "arrow/ipc/json.h" #include "arrow/ipc/metadata.h" diff --git a/cpp/src/arrow/ipc/ipc-adapter-test.cc b/cpp/src/arrow/ipc/ipc-adapter-test.cc deleted file mode 100644 index 638d98af8244d..0000000000000 --- a/cpp/src/arrow/ipc/ipc-adapter-test.cc +++ /dev/null @@ -1,320 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/io/memory.h" -#include "arrow/io/test-common.h" -#include "arrow/ipc/adapter.h" -#include "arrow/ipc/metadata.h" -#include "arrow/ipc/test-common.h" -#include "arrow/ipc/util.h" - -#include "arrow/buffer.h" -#include "arrow/memory_pool.h" -#include "arrow/pretty_print.h" -#include "arrow/status.h" -#include "arrow/test-util.h" -#include "arrow/util/bit-util.h" - -namespace arrow { -namespace ipc { - -class IpcTestFixture : public io::MemoryMapFixture { - public: - Status RoundTripHelper(const RecordBatch& batch, int memory_map_size, - std::shared_ptr* batch_result) { - std::string path = "test-write-row-batch"; - io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - - int32_t metadata_length; - int64_t body_length; - - const int64_t buffer_offset = 0; - - RETURN_NOT_OK(WriteRecordBatch( - batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); - - std::shared_ptr message; - RETURN_NOT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); - - // The buffer offsets start at 0, so we must construct a - // RandomAccessFile according to that frame of reference - std::shared_ptr buffer_payload; - RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); - io::BufferReader buffer_reader(buffer_payload); - - return ReadRecordBatch(*metadata, batch.schema(), &buffer_reader, batch_result); - } - - void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { - std::shared_ptr batch_result; - - ASSERT_OK(RoundTripHelper(batch, 1 << 16, &batch_result)); - EXPECT_EQ(batch.num_rows(), batch_result->num_rows()); - - ASSERT_TRUE(batch.schema()->Equals(batch_result->schema())); - ASSERT_EQ(batch.num_columns(), batch_result->num_columns()) - << batch.schema()->ToString() - << " result: " << batch_result->schema()->ToString(); - - for (int i = 0; i < batch.num_columns(); ++i) { - const auto& left = *batch.column(i); - const auto& right = *batch_result->column(i); - if (!left.Equals(right)) { - std::stringstream pp_result; - std::stringstream pp_expected; - - ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); - ASSERT_OK(PrettyPrint(right, 0, &pp_result)); - - FAIL() << "Index: " << i << " Expected: " << pp_expected.str() - << "\nGot: " << pp_result.str(); - } - } - } - - void CheckRoundtrip(const std::shared_ptr& array, int64_t buffer_size) { - auto f0 = arrow::field("f0", array->type()); - std::vector> fields = {f0}; - auto schema = std::make_shared(fields); - - RecordBatch batch(schema, 0, {array}); - CheckRoundtrip(batch, buffer_size); - } - - protected: - std::shared_ptr mmap_; - MemoryPool* pool_; -}; - -class TestWriteRecordBatch : public ::testing::Test, public IpcTestFixture { - public: - void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { io::MemoryMapFixture::TearDown(); } -}; - -class TestRecordBatchParam : public ::testing::TestWithParam, - public IpcTestFixture { - public: - void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { io::MemoryMapFixture::TearDown(); } - using IpcTestFixture::RoundTripHelper; - using IpcTestFixture::CheckRoundtrip; -}; - -TEST_P(TestRecordBatchParam, RoundTrip) { - std::shared_ptr batch; - ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - - CheckRoundtrip(*batch, 1 << 20); -} - -TEST_P(TestRecordBatchParam, SliceRoundTrip) { - std::shared_ptr batch; - ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - - // Skip the zero-length case - if (batch->num_rows() < 2) { return; } - - auto sliced_batch = batch->Slice(2, 10); - CheckRoundtrip(*sliced_batch, 1 << 20); -} - -TEST_P(TestRecordBatchParam, ZeroLengthArrays) { - std::shared_ptr batch; - ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - - std::shared_ptr zero_length_batch; - if (batch->num_rows() > 2) { - zero_length_batch = batch->Slice(2, 0); - } else { - zero_length_batch = batch->Slice(0, 0); - } - - CheckRoundtrip(*zero_length_batch, 1 << 20); - - // ARROW-544: check binary array - std::shared_ptr value_offsets; - ASSERT_OK(AllocateBuffer(pool_, sizeof(int32_t), &value_offsets)); - *reinterpret_cast(value_offsets->mutable_data()) = 0; - - std::shared_ptr bin_array = std::make_shared(0, value_offsets, - std::make_shared(nullptr, 0), std::make_shared(nullptr, 0)); - - // null value_offsets - std::shared_ptr bin_array2 = std::make_shared(0, nullptr, nullptr); - - CheckRoundtrip(bin_array, 1 << 20); - CheckRoundtrip(bin_array2, 1 << 20); -} - -INSTANTIATE_TEST_CASE_P( - RoundTripTests, TestRecordBatchParam, - ::testing::Values(&MakeIntRecordBatch, &MakeStringTypesRecordBatch, - &MakeNonNullRecordBatch, &MakeZeroLengthRecordBatch, &MakeListRecordBatch, - &MakeDeeplyNestedList, &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, - &MakeTimestamps, &MakeTimes, &MakeFWBinary)); - -void TestGetRecordBatchSize(std::shared_ptr batch) { - ipc::MockOutputStream mock; - int32_t mock_metadata_length = -1; - int64_t mock_body_length = -1; - int64_t size = -1; - ASSERT_OK(WriteRecordBatch( - *batch, 0, &mock, &mock_metadata_length, &mock_body_length, default_memory_pool())); - ASSERT_OK(GetRecordBatchSize(*batch, &size)); - ASSERT_EQ(mock.GetExtentBytesWritten(), size); -} - -TEST_F(TestWriteRecordBatch, IntegerGetRecordBatchSize) { - std::shared_ptr batch; - - ASSERT_OK(MakeIntRecordBatch(&batch)); - TestGetRecordBatchSize(batch); - - ASSERT_OK(MakeListRecordBatch(&batch)); - TestGetRecordBatchSize(batch); - - ASSERT_OK(MakeZeroLengthRecordBatch(&batch)); - TestGetRecordBatchSize(batch); - - ASSERT_OK(MakeNonNullRecordBatch(&batch)); - TestGetRecordBatchSize(batch); - - ASSERT_OK(MakeDeeplyNestedList(&batch)); - TestGetRecordBatchSize(batch); -} - -class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { - public: - void SetUp() { pool_ = default_memory_pool(); } - void TearDown() { io::MemoryMapFixture::TearDown(); } - - Status WriteToMmap(int recursion_level, bool override_level, int32_t* metadata_length, - int64_t* body_length, std::shared_ptr* batch, - std::shared_ptr* schema) { - const int batch_length = 5; - TypePtr type = int32(); - std::shared_ptr array; - const bool include_nulls = true; - RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); - for (int i = 0; i < recursion_level; ++i) { - type = list(type); - RETURN_NOT_OK( - MakeRandomListArray(array, batch_length, include_nulls, pool_, &array)); - } - - auto f0 = field("f0", type); - - *schema = std::shared_ptr(new Schema({f0})); - - std::vector> arrays = {array}; - *batch = std::make_shared(*schema, batch_length, arrays); - - std::string path = "test-write-past-max-recursion"; - const int memory_map_size = 1 << 20; - io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); - - if (override_level) { - return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, - pool_, recursion_level + 1); - } else { - return WriteRecordBatch( - **batch, 0, mmap_.get(), metadata_length, body_length, pool_); - } - } - - protected: - std::shared_ptr mmap_; - MemoryPool* pool_; -}; - -TEST_F(RecursionLimits, WriteLimit) { - int32_t metadata_length = -1; - int64_t body_length = -1; - std::shared_ptr schema; - std::shared_ptr batch; - ASSERT_RAISES(Invalid, - WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &batch, &schema)); -} - -TEST_F(RecursionLimits, ReadLimit) { - int32_t metadata_length = -1; - int64_t body_length = -1; - std::shared_ptr schema; - - const int recursion_depth = 64; - - std::shared_ptr batch; - ASSERT_OK(WriteToMmap( - recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); - - std::shared_ptr message; - ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); - - std::shared_ptr payload; - ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); - - io::BufferReader reader(payload); - - std::shared_ptr result; - ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &result)); -} - -TEST_F(RecursionLimits, StressLimit) { - auto CheckDepth = [this](int recursion_depth, bool* it_works) { - int32_t metadata_length = -1; - int64_t body_length = -1; - std::shared_ptr schema; - std::shared_ptr batch; - ASSERT_OK(WriteToMmap( - recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); - - std::shared_ptr message; - ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); - - std::shared_ptr payload; - ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); - - io::BufferReader reader(payload); - - std::shared_ptr result; - ASSERT_OK(ReadRecordBatch(*metadata, schema, recursion_depth + 1, &reader, &result)); - *it_works = result->Equals(*batch); - }; - - bool it_works = false; - CheckDepth(100, &it_works); - ASSERT_TRUE(it_works); - - CheckDepth(500, &it_works); - ASSERT_TRUE(it_works); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-file-test.cc b/cpp/src/arrow/ipc/ipc-file-test.cc deleted file mode 100644 index b45782220e478..0000000000000 --- a/cpp/src/arrow/ipc/ipc-file-test.cc +++ /dev/null @@ -1,228 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/io/memory.h" -#include "arrow/io/test-common.h" -#include "arrow/ipc/adapter.h" -#include "arrow/ipc/reader.h" -#include "arrow/ipc/test-common.h" -#include "arrow/ipc/util.h" -#include "arrow/ipc/writer.h" - -#include "arrow/buffer.h" -#include "arrow/memory_pool.h" -#include "arrow/status.h" -#include "arrow/test-util.h" -#include "arrow/util/bit-util.h" - -namespace arrow { -namespace ipc { - -void CompareBatch(const RecordBatch& left, const RecordBatch& right) { - if (!left.schema()->Equals(right.schema())) { - FAIL() << "Left schema: " << left.schema()->ToString() - << "\nRight schema: " << right.schema()->ToString(); - } - ASSERT_EQ(left.num_columns(), right.num_columns()) - << left.schema()->ToString() << " result: " << right.schema()->ToString(); - EXPECT_EQ(left.num_rows(), right.num_rows()); - for (int i = 0; i < left.num_columns(); ++i) { - EXPECT_TRUE(left.column(i)->Equals(right.column(i))) - << "Idx: " << i << " Name: " << left.column_name(i); - } -} - -using BatchVector = std::vector>; - -class TestFileFormat : public ::testing::TestWithParam { - public: - void SetUp() { - pool_ = default_memory_pool(); - buffer_ = std::make_shared(pool_); - sink_.reset(new io::BufferOutputStream(buffer_)); - } - void TearDown() {} - - Status RoundTripHelper(const BatchVector& in_batches, BatchVector* out_batches) { - // Write the file - std::shared_ptr writer; - RETURN_NOT_OK(FileWriter::Open(sink_.get(), in_batches[0]->schema(), &writer)); - - const int num_batches = static_cast(in_batches.size()); - - for (const auto& batch : in_batches) { - RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); - } - RETURN_NOT_OK(writer->Close()); - RETURN_NOT_OK(sink_->Close()); - - // Current offset into stream is the end of the file - int64_t footer_offset; - RETURN_NOT_OK(sink_->Tell(&footer_offset)); - - // Open the file - auto buf_reader = std::make_shared(buffer_); - std::shared_ptr reader; - RETURN_NOT_OK(FileReader::Open(buf_reader, footer_offset, &reader)); - - EXPECT_EQ(num_batches, reader->num_record_batches()); - for (int i = 0; i < num_batches; ++i) { - std::shared_ptr chunk; - RETURN_NOT_OK(reader->GetRecordBatch(i, &chunk)); - out_batches->emplace_back(chunk); - } - - return Status::OK(); - } - - protected: - MemoryPool* pool_; - - std::unique_ptr sink_; - std::shared_ptr buffer_; -}; - -TEST_P(TestFileFormat, RoundTrip) { - std::shared_ptr batch1; - std::shared_ptr batch2; - ASSERT_OK((*GetParam())(&batch1)); // NOLINT clang-tidy gtest issue - ASSERT_OK((*GetParam())(&batch2)); // NOLINT clang-tidy gtest issue - - std::vector> in_batches = {batch1, batch2}; - std::vector> out_batches; - - ASSERT_OK(RoundTripHelper(in_batches, &out_batches)); - - // Compare batches - for (size_t i = 0; i < in_batches.size(); ++i) { - CompareBatch(*in_batches[i], *out_batches[i]); - } -} - -class TestStreamFormat : public ::testing::TestWithParam { - public: - void SetUp() { - pool_ = default_memory_pool(); - buffer_ = std::make_shared(pool_); - sink_.reset(new io::BufferOutputStream(buffer_)); - } - void TearDown() {} - - Status RoundTripHelper( - const RecordBatch& batch, std::vector>* out_batches) { - // Write the file - std::shared_ptr writer; - RETURN_NOT_OK(StreamWriter::Open(sink_.get(), batch.schema(), &writer)); - int num_batches = 5; - for (int i = 0; i < num_batches; ++i) { - RETURN_NOT_OK(writer->WriteRecordBatch(batch)); - } - RETURN_NOT_OK(writer->Close()); - RETURN_NOT_OK(sink_->Close()); - - // Open the file - auto buf_reader = std::make_shared(buffer_); - - std::shared_ptr reader; - RETURN_NOT_OK(StreamReader::Open(buf_reader, &reader)); - - std::shared_ptr chunk; - while (true) { - RETURN_NOT_OK(reader->GetNextRecordBatch(&chunk)); - if (chunk == nullptr) { break; } - out_batches->emplace_back(chunk); - } - return Status::OK(); - } - - protected: - MemoryPool* pool_; - - std::unique_ptr sink_; - std::shared_ptr buffer_; -}; - -TEST_P(TestStreamFormat, RoundTrip) { - std::shared_ptr batch; - ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue - - std::vector> out_batches; - - ASSERT_OK(RoundTripHelper(*batch, &out_batches)); - - // Compare batches. Same - for (size_t i = 0; i < out_batches.size(); ++i) { - CompareBatch(*batch, *out_batches[i]); - } -} - -#define BATCH_CASES() \ - ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, &MakeTimestamps, &MakeTimes, \ - &MakeFWBinary); - -INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); -INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); - -void CheckBatchDictionaries(const RecordBatch& batch) { - // Check that dictionaries that should be the same are the same - auto schema = batch.schema(); - - const auto& t0 = static_cast(*schema->field(0)->type); - const auto& t1 = static_cast(*schema->field(1)->type); - - ASSERT_EQ(t0.dictionary().get(), t1.dictionary().get()); - - // Same dictionary used for list values - const auto& t3 = static_cast(*schema->field(3)->type); - const auto& t3_value = static_cast(*t3.value_type()); - ASSERT_EQ(t0.dictionary().get(), t3_value.dictionary().get()); -} - -TEST_F(TestStreamFormat, DictionaryRoundTrip) { - std::shared_ptr batch; - ASSERT_OK(MakeDictionary(&batch)); - - std::vector> out_batches; - ASSERT_OK(RoundTripHelper(*batch, &out_batches)); - - CheckBatchDictionaries(*out_batches[0]); -} - -TEST_F(TestFileFormat, DictionaryRoundTrip) { - std::shared_ptr batch; - ASSERT_OK(MakeDictionary(&batch)); - - std::vector> out_batches; - ASSERT_OK(RoundTripHelper({batch}, &out_batches)); - - CheckBatchDictionaries(*out_batches[0]); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-metadata-test.cc b/cpp/src/arrow/ipc/ipc-metadata-test.cc deleted file mode 100644 index 4fb3204a5b6d2..0000000000000 --- a/cpp/src/arrow/ipc/ipc-metadata-test.cc +++ /dev/null @@ -1,100 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/io/memory.h" -#include "arrow/ipc/metadata-internal.h" -#include "arrow/ipc/metadata.h" -#include "arrow/ipc/test-common.h" -#include "arrow/schema.h" -#include "arrow/status.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -namespace arrow { - -class Buffer; - -namespace ipc { - -class TestSchemaMetadata : public ::testing::Test { - public: - void SetUp() {} - - void CheckRoundtrip(const Schema& schema, DictionaryMemo* memo) { - std::shared_ptr buffer; - ASSERT_OK(WriteSchemaMessage(schema, memo, &buffer)); - - std::shared_ptr message; - ASSERT_OK(Message::Open(buffer, 0, &message)); - - ASSERT_EQ(Message::SCHEMA, message->type()); - - auto schema_msg = std::make_shared(message); - ASSERT_EQ(schema.num_fields(), schema_msg->num_fields()); - - DictionaryMemo empty_memo; - - std::shared_ptr schema2; - ASSERT_OK(schema_msg->GetSchema(empty_memo, &schema2)); - - AssertSchemaEqual(schema, *schema2); - } -}; - -const std::shared_ptr INT32 = std::make_shared(); - -TEST_F(TestSchemaMetadata, PrimitiveFields) { - auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared(), false); - auto f2 = std::make_shared("f2", std::make_shared()); - auto f3 = std::make_shared("f3", std::make_shared()); - auto f4 = std::make_shared("f4", std::make_shared()); - auto f5 = std::make_shared("f5", std::make_shared()); - auto f6 = std::make_shared("f6", std::make_shared()); - auto f7 = std::make_shared("f7", std::make_shared()); - auto f8 = std::make_shared("f8", std::make_shared()); - auto f9 = std::make_shared("f9", std::make_shared(), false); - auto f10 = std::make_shared("f10", std::make_shared()); - - Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); - DictionaryMemo memo; - - CheckRoundtrip(schema, &memo); -} - -TEST_F(TestSchemaMetadata, NestedFields) { - auto type = std::make_shared(std::make_shared()); - auto f0 = std::make_shared("f0", type); - - std::shared_ptr type2(new StructType({std::make_shared("k1", INT32), - std::make_shared("k2", INT32), std::make_shared("k3", INT32)})); - auto f1 = std::make_shared("f1", type2); - - Schema schema({f0, f1}); - DictionaryMemo memo; - - CheckRoundtrip(schema, &memo); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc new file mode 100644 index 0000000000000..261ca1d0e52d8 --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -0,0 +1,608 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/io/memory.h" +#include "arrow/io/test-common.h" +#include "arrow/ipc/api.h" +#include "arrow/ipc/test-common.h" +#include "arrow/ipc/util.h" + +#include "arrow/buffer.h" +#include "arrow/memory_pool.h" +#include "arrow/pretty_print.h" +#include "arrow/status.h" +#include "arrow/test-util.h" +#include "arrow/util/bit-util.h" + +namespace arrow { +namespace ipc { + +void CompareBatch(const RecordBatch& left, const RecordBatch& right) { + if (!left.schema()->Equals(right.schema())) { + FAIL() << "Left schema: " << left.schema()->ToString() + << "\nRight schema: " << right.schema()->ToString(); + } + ASSERT_EQ(left.num_columns(), right.num_columns()) + << left.schema()->ToString() << " result: " << right.schema()->ToString(); + EXPECT_EQ(left.num_rows(), right.num_rows()); + for (int i = 0; i < left.num_columns(); ++i) { + EXPECT_TRUE(left.column(i)->Equals(right.column(i))) + << "Idx: " << i << " Name: " << left.column_name(i); + } +} + +using BatchVector = std::vector>; + +class TestSchemaMetadata : public ::testing::Test { + public: + void SetUp() {} + + void CheckRoundtrip(const Schema& schema, DictionaryMemo* memo) { + std::shared_ptr buffer; + ASSERT_OK(WriteSchemaMessage(schema, memo, &buffer)); + + std::shared_ptr message; + ASSERT_OK(Message::Open(buffer, 0, &message)); + + ASSERT_EQ(Message::SCHEMA, message->type()); + + auto schema_msg = std::make_shared(message); + ASSERT_EQ(schema.num_fields(), schema_msg->num_fields()); + + DictionaryMemo empty_memo; + + std::shared_ptr schema2; + ASSERT_OK(schema_msg->GetSchema(empty_memo, &schema2)); + + AssertSchemaEqual(schema, *schema2); + } +}; + +const std::shared_ptr INT32 = std::make_shared(); + +TEST_F(TestSchemaMetadata, PrimitiveFields) { + auto f0 = std::make_shared("f0", std::make_shared()); + auto f1 = std::make_shared("f1", std::make_shared(), false); + auto f2 = std::make_shared("f2", std::make_shared()); + auto f3 = std::make_shared("f3", std::make_shared()); + auto f4 = std::make_shared("f4", std::make_shared()); + auto f5 = std::make_shared("f5", std::make_shared()); + auto f6 = std::make_shared("f6", std::make_shared()); + auto f7 = std::make_shared("f7", std::make_shared()); + auto f8 = std::make_shared("f8", std::make_shared()); + auto f9 = std::make_shared("f9", std::make_shared(), false); + auto f10 = std::make_shared("f10", std::make_shared()); + + Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); + DictionaryMemo memo; + + CheckRoundtrip(schema, &memo); +} + +TEST_F(TestSchemaMetadata, NestedFields) { + auto type = std::make_shared(std::make_shared()); + auto f0 = std::make_shared("f0", type); + + std::shared_ptr type2(new StructType({std::make_shared("k1", INT32), + std::make_shared("k2", INT32), std::make_shared("k3", INT32)})); + auto f1 = std::make_shared("f1", type2); + + Schema schema({f0, f1}); + DictionaryMemo memo; + + CheckRoundtrip(schema, &memo); +} + +#define BATCH_CASES() \ + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ + &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, &MakeTimestamps, &MakeTimes, \ + &MakeFWBinary); + +class IpcTestFixture : public io::MemoryMapFixture { + public: + Status DoStandardRoundTrip(const RecordBatch& batch, bool zero_data, + std::shared_ptr* batch_result) { + int32_t metadata_length; + int64_t body_length; + + const int64_t buffer_offset = 0; + + if (zero_data) { RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); } + RETURN_NOT_OK(mmap_->Seek(0)); + + RETURN_NOT_OK(WriteRecordBatch( + batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); + + std::shared_ptr message; + RETURN_NOT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); + + // The buffer offsets start at 0, so we must construct a + // RandomAccessFile according to that frame of reference + std::shared_ptr buffer_payload; + RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); + io::BufferReader buffer_reader(buffer_payload); + + return ReadRecordBatch(*metadata, batch.schema(), &buffer_reader, batch_result); + } + + Status DoLargeRoundTrip( + const RecordBatch& batch, bool zero_data, std::shared_ptr* result) { + int32_t metadata_length; + int64_t body_length; + + const int64_t buffer_offset = 0; + + if (zero_data) { RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); } + RETURN_NOT_OK(mmap_->Seek(0)); + + RETURN_NOT_OK(WriteLargeRecordBatch( + batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); + return ReadLargeRecordBatch(batch.schema(), 0, mmap_.get(), result); + } + + void CheckReadResult(const RecordBatch& result, const RecordBatch& expected) { + EXPECT_EQ(expected.num_rows(), result.num_rows()); + + ASSERT_TRUE(expected.schema()->Equals(result.schema())); + ASSERT_EQ(expected.num_columns(), result.num_columns()) + << expected.schema()->ToString() << " result: " << result.schema()->ToString(); + + for (int i = 0; i < expected.num_columns(); ++i) { + const auto& left = *expected.column(i); + const auto& right = *result.column(i); + if (!left.Equals(right)) { + std::stringstream pp_result; + std::stringstream pp_expected; + + ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); + ASSERT_OK(PrettyPrint(right, 0, &pp_result)); + + FAIL() << "Index: " << i << " Expected: " << pp_expected.str() + << "\nGot: " << pp_result.str(); + } + } + } + + void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { + std::string path = "test-write-row-batch"; + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(buffer_size, path, &mmap_)); + + std::shared_ptr result; + ASSERT_OK(DoStandardRoundTrip(batch, true, &result)); + CheckReadResult(*result, batch); + + ASSERT_OK(DoLargeRoundTrip(batch, true, &result)); + CheckReadResult(*result, batch); + } + + void CheckRoundtrip(const std::shared_ptr& array, int64_t buffer_size) { + auto f0 = arrow::field("f0", array->type()); + std::vector> fields = {f0}; + auto schema = std::make_shared(fields); + + RecordBatch batch(schema, 0, {array}); + CheckRoundtrip(batch, buffer_size); + } + + protected: + std::shared_ptr mmap_; + MemoryPool* pool_; +}; + +class TestWriteRecordBatch : public ::testing::Test, public IpcTestFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } +}; + +class TestIpcRoundTrip : public ::testing::TestWithParam, + public IpcTestFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } +}; + +TEST_P(TestIpcRoundTrip, RoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + CheckRoundtrip(*batch, 1 << 20); +} + +TEST_P(TestIpcRoundTrip, SliceRoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + // Skip the zero-length case + if (batch->num_rows() < 2) { return; } + + auto sliced_batch = batch->Slice(2, 10); + CheckRoundtrip(*sliced_batch, 1 << 20); +} + +TEST_P(TestIpcRoundTrip, ZeroLengthArrays) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + std::shared_ptr zero_length_batch; + if (batch->num_rows() > 2) { + zero_length_batch = batch->Slice(2, 0); + } else { + zero_length_batch = batch->Slice(0, 0); + } + + CheckRoundtrip(*zero_length_batch, 1 << 20); + + // ARROW-544: check binary array + std::shared_ptr value_offsets; + ASSERT_OK(AllocateBuffer(pool_, sizeof(int32_t), &value_offsets)); + *reinterpret_cast(value_offsets->mutable_data()) = 0; + + std::shared_ptr bin_array = std::make_shared(0, value_offsets, + std::make_shared(nullptr, 0), std::make_shared(nullptr, 0)); + + // null value_offsets + std::shared_ptr bin_array2 = std::make_shared(0, nullptr, nullptr); + + CheckRoundtrip(bin_array, 1 << 20); + CheckRoundtrip(bin_array2, 1 << 20); +} + +void TestGetRecordBatchSize(std::shared_ptr batch) { + ipc::MockOutputStream mock; + int32_t mock_metadata_length = -1; + int64_t mock_body_length = -1; + int64_t size = -1; + ASSERT_OK(WriteRecordBatch( + *batch, 0, &mock, &mock_metadata_length, &mock_body_length, default_memory_pool())); + ASSERT_OK(GetRecordBatchSize(*batch, &size)); + ASSERT_EQ(mock.GetExtentBytesWritten(), size); +} + +TEST_F(TestWriteRecordBatch, IntegerGetRecordBatchSize) { + std::shared_ptr batch; + + ASSERT_OK(MakeIntRecordBatch(&batch)); + TestGetRecordBatchSize(batch); + + ASSERT_OK(MakeListRecordBatch(&batch)); + TestGetRecordBatchSize(batch); + + ASSERT_OK(MakeZeroLengthRecordBatch(&batch)); + TestGetRecordBatchSize(batch); + + ASSERT_OK(MakeNonNullRecordBatch(&batch)); + TestGetRecordBatchSize(batch); + + ASSERT_OK(MakeDeeplyNestedList(&batch)); + TestGetRecordBatchSize(batch); +} + +class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } + + Status WriteToMmap(int recursion_level, bool override_level, int32_t* metadata_length, + int64_t* body_length, std::shared_ptr* batch, + std::shared_ptr* schema) { + const int batch_length = 5; + TypePtr type = int32(); + std::shared_ptr array; + const bool include_nulls = true; + RETURN_NOT_OK(MakeRandomInt32Array(1000, include_nulls, pool_, &array)); + for (int i = 0; i < recursion_level; ++i) { + type = list(type); + RETURN_NOT_OK( + MakeRandomListArray(array, batch_length, include_nulls, pool_, &array)); + } + + auto f0 = field("f0", type); + + *schema = std::shared_ptr(new Schema({f0})); + + std::vector> arrays = {array}; + *batch = std::make_shared(*schema, batch_length, arrays); + + std::string path = "test-write-past-max-recursion"; + const int memory_map_size = 1 << 20; + io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + + if (override_level) { + return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, + pool_, recursion_level + 1); + } else { + return WriteRecordBatch( + **batch, 0, mmap_.get(), metadata_length, body_length, pool_); + } + } + + protected: + std::shared_ptr mmap_; + MemoryPool* pool_; +}; + +TEST_F(RecursionLimits, WriteLimit) { + int32_t metadata_length = -1; + int64_t body_length = -1; + std::shared_ptr schema; + std::shared_ptr batch; + ASSERT_RAISES(Invalid, + WriteToMmap((1 << 8) + 1, false, &metadata_length, &body_length, &batch, &schema)); +} + +TEST_F(RecursionLimits, ReadLimit) { + int32_t metadata_length = -1; + int64_t body_length = -1; + std::shared_ptr schema; + + const int recursion_depth = 64; + + std::shared_ptr batch; + ASSERT_OK(WriteToMmap( + recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); + + std::shared_ptr message; + ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); + + std::shared_ptr payload; + ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); + + io::BufferReader reader(payload); + + std::shared_ptr result; + ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &result)); +} + +TEST_F(RecursionLimits, StressLimit) { + auto CheckDepth = [this](int recursion_depth, bool* it_works) { + int32_t metadata_length = -1; + int64_t body_length = -1; + std::shared_ptr schema; + std::shared_ptr batch; + ASSERT_OK(WriteToMmap( + recursion_depth, true, &metadata_length, &body_length, &batch, &schema)); + + std::shared_ptr message; + ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + auto metadata = std::make_shared(message); + + std::shared_ptr payload; + ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); + + io::BufferReader reader(payload); + + std::shared_ptr result; + ASSERT_OK(ReadRecordBatch(*metadata, schema, recursion_depth + 1, &reader, &result)); + *it_works = result->Equals(*batch); + }; + + bool it_works = false; + CheckDepth(100, &it_works); + ASSERT_TRUE(it_works); + + CheckDepth(500, &it_works); + ASSERT_TRUE(it_works); +} + +class TestFileFormat : public ::testing::TestWithParam { + public: + void SetUp() { + pool_ = default_memory_pool(); + buffer_ = std::make_shared(pool_); + sink_.reset(new io::BufferOutputStream(buffer_)); + } + void TearDown() {} + + Status RoundTripHelper(const BatchVector& in_batches, BatchVector* out_batches) { + // Write the file + std::shared_ptr writer; + RETURN_NOT_OK(FileWriter::Open(sink_.get(), in_batches[0]->schema(), &writer)); + + const int num_batches = static_cast(in_batches.size()); + + for (const auto& batch : in_batches) { + RETURN_NOT_OK(writer->WriteRecordBatch(*batch)); + } + RETURN_NOT_OK(writer->Close()); + RETURN_NOT_OK(sink_->Close()); + + // Current offset into stream is the end of the file + int64_t footer_offset; + RETURN_NOT_OK(sink_->Tell(&footer_offset)); + + // Open the file + auto buf_reader = std::make_shared(buffer_); + std::shared_ptr reader; + RETURN_NOT_OK(FileReader::Open(buf_reader, footer_offset, &reader)); + + EXPECT_EQ(num_batches, reader->num_record_batches()); + for (int i = 0; i < num_batches; ++i) { + std::shared_ptr chunk; + RETURN_NOT_OK(reader->GetRecordBatch(i, &chunk)); + out_batches->emplace_back(chunk); + } + + return Status::OK(); + } + + protected: + MemoryPool* pool_; + + std::unique_ptr sink_; + std::shared_ptr buffer_; +}; + +TEST_P(TestFileFormat, RoundTrip) { + std::shared_ptr batch1; + std::shared_ptr batch2; + ASSERT_OK((*GetParam())(&batch1)); // NOLINT clang-tidy gtest issue + ASSERT_OK((*GetParam())(&batch2)); // NOLINT clang-tidy gtest issue + + std::vector> in_batches = {batch1, batch2}; + std::vector> out_batches; + + ASSERT_OK(RoundTripHelper(in_batches, &out_batches)); + + // Compare batches + for (size_t i = 0; i < in_batches.size(); ++i) { + CompareBatch(*in_batches[i], *out_batches[i]); + } +} + +class TestStreamFormat : public ::testing::TestWithParam { + public: + void SetUp() { + pool_ = default_memory_pool(); + buffer_ = std::make_shared(pool_); + sink_.reset(new io::BufferOutputStream(buffer_)); + } + void TearDown() {} + + Status RoundTripHelper( + const RecordBatch& batch, std::vector>* out_batches) { + // Write the file + std::shared_ptr writer; + RETURN_NOT_OK(StreamWriter::Open(sink_.get(), batch.schema(), &writer)); + int num_batches = 5; + for (int i = 0; i < num_batches; ++i) { + RETURN_NOT_OK(writer->WriteRecordBatch(batch)); + } + RETURN_NOT_OK(writer->Close()); + RETURN_NOT_OK(sink_->Close()); + + // Open the file + auto buf_reader = std::make_shared(buffer_); + + std::shared_ptr reader; + RETURN_NOT_OK(StreamReader::Open(buf_reader, &reader)); + + std::shared_ptr chunk; + while (true) { + RETURN_NOT_OK(reader->GetNextRecordBatch(&chunk)); + if (chunk == nullptr) { break; } + out_batches->emplace_back(chunk); + } + return Status::OK(); + } + + protected: + MemoryPool* pool_; + + std::unique_ptr sink_; + std::shared_ptr buffer_; +}; + +TEST_P(TestStreamFormat, RoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + std::vector> out_batches; + + ASSERT_OK(RoundTripHelper(*batch, &out_batches)); + + // Compare batches. Same + for (size_t i = 0; i < out_batches.size(); ++i) { + CompareBatch(*batch, *out_batches[i]); + } +} + +INSTANTIATE_TEST_CASE_P(GenericIpcRoundTripTests, TestIpcRoundTrip, BATCH_CASES()); +INSTANTIATE_TEST_CASE_P(FileRoundTripTests, TestFileFormat, BATCH_CASES()); +INSTANTIATE_TEST_CASE_P(StreamRoundTripTests, TestStreamFormat, BATCH_CASES()); + +TEST_F(TestIpcRoundTrip, LargeRecordBatch) { + const int64_t length = static_cast(std::numeric_limits::max()) + 1; + + BooleanBuilder builder(default_memory_pool()); + ASSERT_OK(builder.Reserve(length)); + ASSERT_OK(builder.Advance(length)); + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + auto f0 = arrow::field("f0", array->type()); + std::vector> fields = {f0}; + auto schema = std::make_shared(fields); + + RecordBatch batch(schema, 0, {array}); + + std::string path = "test-write-large-record_batch"; + + // 512 MB + constexpr int64_t kBufferSize = 1 << 29; + + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(kBufferSize, path, &mmap_)); + + std::shared_ptr result; + ASSERT_OK(DoLargeRoundTrip(batch, false, &result)); + CheckReadResult(*result, batch); + + // Fails if we try to write this with the normal code path + ASSERT_RAISES(Invalid, DoStandardRoundTrip(batch, false, &result)); +} + +void CheckBatchDictionaries(const RecordBatch& batch) { + // Check that dictionaries that should be the same are the same + auto schema = batch.schema(); + + const auto& t0 = static_cast(*schema->field(0)->type); + const auto& t1 = static_cast(*schema->field(1)->type); + + ASSERT_EQ(t0.dictionary().get(), t1.dictionary().get()); + + // Same dictionary used for list values + const auto& t3 = static_cast(*schema->field(3)->type); + const auto& t3_value = static_cast(*t3.value_type()); + ASSERT_EQ(t0.dictionary().get(), t3_value.dictionary().get()); +} + +TEST_F(TestStreamFormat, DictionaryRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeDictionary(&batch)); + + std::vector> out_batches; + ASSERT_OK(RoundTripHelper(*batch, &out_batches)); + + CheckBatchDictionaries(*out_batches[0]); +} + +TEST_F(TestFileFormat, DictionaryRoundTrip) { + std::shared_ptr batch; + ASSERT_OK(MakeDictionary(&batch)); + + std::vector> out_batches; + ASSERT_OK(RoundTripHelper({batch}, &out_batches)); + + CheckBatchDictionaries(*out_batches[0]); +} + +} // namespace ipc +} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata-internal.cc b/cpp/src/arrow/ipc/metadata-internal.cc deleted file mode 100644 index be0d282f21bbf..0000000000000 --- a/cpp/src/arrow/ipc/metadata-internal.cc +++ /dev/null @@ -1,597 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/ipc/metadata-internal.h" - -#include -#include -#include -#include -#include - -#include "flatbuffers/flatbuffers.h" - -#include "arrow/array.h" -#include "arrow/buffer.h" -#include "arrow/ipc/Message_generated.h" -#include "arrow/schema.h" -#include "arrow/status.h" -#include "arrow/type.h" - -namespace arrow { - -namespace flatbuf = org::apache::arrow::flatbuf; - -namespace ipc { - -static Status IntFromFlatbuffer( - const flatbuf::Int* int_data, std::shared_ptr* out) { - if (int_data->bitWidth() > 64) { - return Status::NotImplemented("Integers with more than 64 bits not implemented"); - } - if (int_data->bitWidth() < 8) { - return Status::NotImplemented("Integers with less than 8 bits not implemented"); - } - - switch (int_data->bitWidth()) { - case 8: - *out = int_data->is_signed() ? int8() : uint8(); - break; - case 16: - *out = int_data->is_signed() ? int16() : uint16(); - break; - case 32: - *out = int_data->is_signed() ? int32() : uint32(); - break; - case 64: - *out = int_data->is_signed() ? int64() : uint64(); - break; - default: - return Status::NotImplemented("Integers not in cstdint are not implemented"); - } - return Status::OK(); -} - -static Status FloatFromFlatuffer( - const flatbuf::FloatingPoint* float_data, std::shared_ptr* out) { - if (float_data->precision() == flatbuf::Precision_HALF) { - *out = float16(); - } else if (float_data->precision() == flatbuf::Precision_SINGLE) { - *out = float32(); - } else { - *out = float64(); - } - return Status::OK(); -} - -// Forward declaration -static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - DictionaryMemo* dictionary_memo, FieldOffset* offset); - -static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, bool is_signed) { - return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); -} - -static Offset FloatToFlatbuffer(FBB& fbb, flatbuf::Precision precision) { - return flatbuf::CreateFloatingPoint(fbb, precision).Union(); -} - -static Status AppendChildFields(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo) { - FieldOffset field; - for (int i = 0; i < type->num_children(); ++i) { - RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), dictionary_memo, &field)); - out_children->push_back(field); - } - return Status::OK(); -} - -static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { - RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); - *offset = flatbuf::CreateList(fbb).Union(); - return Status::OK(); -} - -static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { - RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); - *offset = flatbuf::CreateStruct_(fbb).Union(); - return Status::OK(); -} - -// ---------------------------------------------------------------------- -// Union implementation - -static Status UnionFromFlatbuffer(const flatbuf::Union* union_data, - const std::vector>& children, std::shared_ptr* out) { - UnionMode mode = union_data->mode() == flatbuf::UnionMode_Sparse ? UnionMode::SPARSE - : UnionMode::DENSE; - - std::vector type_codes; - - const flatbuffers::Vector* fb_type_ids = union_data->typeIds(); - if (fb_type_ids == nullptr) { - for (uint8_t i = 0; i < children.size(); ++i) { - type_codes.push_back(i); - } - } else { - for (int32_t id : (*fb_type_ids)) { - // TODO(wesm): can these values exceed 255? - type_codes.push_back(static_cast(id)); - } - } - - *out = union_(children, type_codes, mode); - return Status::OK(); -} - -static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* out_children, DictionaryMemo* dictionary_memo, - Offset* offset) { - RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); - - const auto& union_type = static_cast(*type); - - flatbuf::UnionMode mode = union_type.mode == UnionMode::SPARSE - ? flatbuf::UnionMode_Sparse - : flatbuf::UnionMode_Dense; - - std::vector type_ids; - type_ids.reserve(union_type.type_codes.size()); - for (uint8_t code : union_type.type_codes) { - type_ids.push_back(code); - } - - auto fb_type_ids = fbb.CreateVector(type_ids); - - *offset = flatbuf::CreateUnion(fbb, mode, fb_type_ids).Union(); - return Status::OK(); -} - -#define INT_TO_FB_CASE(BIT_WIDTH, IS_SIGNED) \ - *out_type = flatbuf::Type_Int; \ - *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ - break; - -static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit unit) { - switch (unit) { - case TimeUnit::SECOND: - return flatbuf::TimeUnit_SECOND; - case TimeUnit::MILLI: - return flatbuf::TimeUnit_MILLISECOND; - case TimeUnit::MICRO: - return flatbuf::TimeUnit_MICROSECOND; - case TimeUnit::NANO: - return flatbuf::TimeUnit_NANOSECOND; - default: - break; - } - return flatbuf::TimeUnit_MIN; -} - -static inline TimeUnit FromFlatbufferUnit(flatbuf::TimeUnit unit) { - switch (unit) { - case flatbuf::TimeUnit_SECOND: - return TimeUnit::SECOND; - case flatbuf::TimeUnit_MILLISECOND: - return TimeUnit::MILLI; - case flatbuf::TimeUnit_MICROSECOND: - return TimeUnit::MICRO; - case flatbuf::TimeUnit_NANOSECOND: - return TimeUnit::NANO; - default: - break; - } - // cannot reach - return TimeUnit::SECOND; -} - -static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, - const std::vector>& children, std::shared_ptr* out) { - switch (type) { - case flatbuf::Type_NONE: - return Status::Invalid("Type metadata cannot be none"); - case flatbuf::Type_Int: - return IntFromFlatbuffer(static_cast(type_data), out); - case flatbuf::Type_FloatingPoint: - return FloatFromFlatuffer( - static_cast(type_data), out); - case flatbuf::Type_Binary: - *out = binary(); - return Status::OK(); - case flatbuf::Type_FixedWidthBinary: { - auto fw_binary = static_cast(type_data); - *out = fixed_width_binary(fw_binary->byteWidth()); - return Status::OK(); - } - case flatbuf::Type_Utf8: - *out = utf8(); - return Status::OK(); - case flatbuf::Type_Bool: - *out = boolean(); - return Status::OK(); - case flatbuf::Type_Decimal: - return Status::NotImplemented("Decimal"); - case flatbuf::Type_Date: - *out = date(); - return Status::OK(); - case flatbuf::Type_Time: { - auto time_type = static_cast(type_data); - *out = time(FromFlatbufferUnit(time_type->unit())); - return Status::OK(); - } - case flatbuf::Type_Timestamp: { - auto ts_type = static_cast(type_data); - *out = timestamp(FromFlatbufferUnit(ts_type->unit())); - return Status::OK(); - } - case flatbuf::Type_Interval: - return Status::NotImplemented("Interval"); - case flatbuf::Type_List: - if (children.size() != 1) { - return Status::Invalid("List must have exactly 1 child field"); - } - *out = std::make_shared(children[0]); - return Status::OK(); - case flatbuf::Type_Struct_: - *out = std::make_shared(children); - return Status::OK(); - case flatbuf::Type_Union: - return UnionFromFlatbuffer( - static_cast(type_data), children, out); - default: - return Status::Invalid("Unrecognized type"); - } -} - -// TODO(wesm): Convert this to visitor pattern -static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, - std::vector* children, std::vector* layout, - flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, Offset* offset) { - if (type->type == Type::DICTIONARY) { - // In this library, the dictionary "type" is a logical construct. Here we - // pass through to the value type, as we've already captured the index - // type in the DictionaryEncoding metadata in the parent field - const auto& dict_type = static_cast(*type); - return TypeToFlatbuffer(fbb, dict_type.dictionary()->type(), children, layout, - out_type, dictionary_memo, offset); - } - - std::vector buffer_layout = type->GetBufferLayout(); - for (const BufferDescr& descr : buffer_layout) { - flatbuf::VectorType vector_type; - switch (descr.type()) { - case BufferType::OFFSET: - vector_type = flatbuf::VectorType_OFFSET; - break; - case BufferType::DATA: - vector_type = flatbuf::VectorType_DATA; - break; - case BufferType::VALIDITY: - vector_type = flatbuf::VectorType_VALIDITY; - break; - case BufferType::TYPE: - vector_type = flatbuf::VectorType_TYPE; - break; - default: - vector_type = flatbuf::VectorType_DATA; - break; - } - auto offset = flatbuf::CreateVectorLayout( - fbb, static_cast(descr.bit_width()), vector_type); - layout->push_back(offset); - } - - switch (type->type) { - case Type::BOOL: - *out_type = flatbuf::Type_Bool; - *offset = flatbuf::CreateBool(fbb).Union(); - break; - case Type::UINT8: - INT_TO_FB_CASE(8, false); - case Type::INT8: - INT_TO_FB_CASE(8, true); - case Type::UINT16: - INT_TO_FB_CASE(16, false); - case Type::INT16: - INT_TO_FB_CASE(16, true); - case Type::UINT32: - INT_TO_FB_CASE(32, false); - case Type::INT32: - INT_TO_FB_CASE(32, true); - case Type::UINT64: - INT_TO_FB_CASE(64, false); - case Type::INT64: - INT_TO_FB_CASE(64, true); - case Type::FLOAT: - *out_type = flatbuf::Type_FloatingPoint; - *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_SINGLE); - break; - case Type::DOUBLE: - *out_type = flatbuf::Type_FloatingPoint; - *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); - break; - case Type::FIXED_WIDTH_BINARY: { - const auto& fw_type = static_cast(*type); - *out_type = flatbuf::Type_FixedWidthBinary; - *offset = flatbuf::CreateFixedWidthBinary(fbb, fw_type.byte_width()).Union(); - } break; - case Type::BINARY: - *out_type = flatbuf::Type_Binary; - *offset = flatbuf::CreateBinary(fbb).Union(); - break; - case Type::STRING: - *out_type = flatbuf::Type_Utf8; - *offset = flatbuf::CreateUtf8(fbb).Union(); - break; - case Type::DATE: - *out_type = flatbuf::Type_Date; - *offset = flatbuf::CreateDate(fbb).Union(); - break; - case Type::TIME: { - const auto& time_type = static_cast(*type); - *out_type = flatbuf::Type_Time; - *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); - } break; - case Type::TIMESTAMP: { - const auto& ts_type = static_cast(*type); - *out_type = flatbuf::Type_Timestamp; - *offset = flatbuf::CreateTimestamp(fbb, ToFlatbufferUnit(ts_type.unit)).Union(); - } break; - case Type::LIST: - *out_type = flatbuf::Type_List; - return ListToFlatbuffer(fbb, type, children, dictionary_memo, offset); - case Type::STRUCT: - *out_type = flatbuf::Type_Struct_; - return StructToFlatbuffer(fbb, type, children, dictionary_memo, offset); - case Type::UNION: - *out_type = flatbuf::Type_Union; - return UnionToFlatBuffer(fbb, type, children, dictionary_memo, offset); - default: - *out_type = flatbuf::Type_NONE; // Make clang-tidy happy - std::stringstream ss; - ss << "Unable to convert type: " << type->ToString() << std::endl; - return Status::NotImplemented(ss.str()); - } - return Status::OK(); -} - -using DictionaryOffset = flatbuffers::Offset; - -static DictionaryOffset GetDictionaryEncoding( - FBB& fbb, const DictionaryType& type, DictionaryMemo* memo) { - int64_t dictionary_id = memo->GetId(type.dictionary()); - - // We assume that the dictionary index type (as an integer) has already been - // validated elsewhere, and can safely assume we are dealing with signed - // integers - const auto& fw_index_type = static_cast(*type.index_type()); - - auto index_type_offset = flatbuf::CreateInt(fbb, fw_index_type.bit_width(), true); - - // TODO(wesm): ordered dictionaries - return flatbuf::CreateDictionaryEncoding(fbb, dictionary_id, index_type_offset); -} - -static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, - DictionaryMemo* dictionary_memo, FieldOffset* offset) { - auto fb_name = fbb.CreateString(field->name); - - flatbuf::Type type_enum; - Offset type_offset; - Offset type_layout; - std::vector children; - std::vector layout; - - RETURN_NOT_OK(TypeToFlatbuffer( - fbb, field->type, &children, &layout, &type_enum, dictionary_memo, &type_offset)); - auto fb_children = fbb.CreateVector(children); - auto fb_layout = fbb.CreateVector(layout); - - DictionaryOffset dictionary = 0; - if (field->type->type == Type::DICTIONARY) { - dictionary = GetDictionaryEncoding( - fbb, static_cast(*field->type), dictionary_memo); - } - - // TODO: produce the list of VectorTypes - *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_offset, - dictionary, fb_children, fb_layout); - - return Status::OK(); -} - -Status FieldFromFlatbufferDictionary( - const flatbuf::Field* field, std::shared_ptr* out) { - // Need an empty memo to pass down for constructing children - DictionaryMemo dummy_memo; - - // Any DictionaryEncoding set is ignored here - - std::shared_ptr type; - auto children = field->children(); - std::vector> child_fields(children->size()); - for (int i = 0; i < static_cast(children->size()); ++i) { - RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), dummy_memo, &child_fields[i])); - } - - RETURN_NOT_OK( - TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); - - *out = std::make_shared(field->name()->str(), type, field->nullable()); - return Status::OK(); -} - -Status FieldFromFlatbuffer(const flatbuf::Field* field, - const DictionaryMemo& dictionary_memo, std::shared_ptr* out) { - std::shared_ptr type; - - const flatbuf::DictionaryEncoding* encoding = field->dictionary(); - - if (encoding == nullptr) { - // The field is not dictionary encoded. We must potentially visit its - // children to fully reconstruct the data type - auto children = field->children(); - std::vector> child_fields(children->size()); - for (int i = 0; i < static_cast(children->size()); ++i) { - RETURN_NOT_OK( - FieldFromFlatbuffer(children->Get(i), dictionary_memo, &child_fields[i])); - } - RETURN_NOT_OK( - TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); - } else { - // The field is dictionary encoded. The type of the dictionary values has - // been determined elsewhere, and is stored in the DictionaryMemo. Here we - // construct the logical DictionaryType object - - std::shared_ptr dictionary; - RETURN_NOT_OK(dictionary_memo.GetDictionary(encoding->id(), &dictionary)); - - std::shared_ptr index_type; - RETURN_NOT_OK(IntFromFlatbuffer(encoding->indexType(), &index_type)); - type = std::make_shared(index_type, dictionary); - } - *out = std::make_shared(field->name()->str(), type, field->nullable()); - return Status::OK(); -} - -// Implement MessageBuilder - -// will return the endianness of the system we are running on -// based the NUMPY_API function. See NOTICE.txt -flatbuf::Endianness endianness() { - union { - uint32_t i; - char c[4]; - } bint = {0x01020304}; - - return bint.c[0] == 1 ? flatbuf::Endianness_Big : flatbuf::Endianness_Little; -} - -Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, DictionaryMemo* dictionary_memo, - flatbuffers::Offset* out) { - std::vector field_offsets; - for (int i = 0; i < schema.num_fields(); ++i) { - std::shared_ptr field = schema.field(i); - FieldOffset offset; - RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, dictionary_memo, &offset)); - field_offsets.push_back(offset); - } - - *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets)); - return Status::OK(); -} - -class MessageBuilder { - public: - Status SetSchema(const Schema& schema, DictionaryMemo* dictionary_memo) { - flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb_, schema, dictionary_memo, &fb_schema)); - - header_type_ = flatbuf::MessageHeader_Schema; - header_ = fb_schema.Union(); - body_length_ = 0; - return Status::OK(); - } - - Status SetRecordBatch(int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers) { - header_type_ = flatbuf::MessageHeader_RecordBatch; - header_ = flatbuf::CreateRecordBatch(fbb_, length, fbb_.CreateVectorOfStructs(nodes), - fbb_.CreateVectorOfStructs(buffers)) - .Union(); - body_length_ = body_length; - - return Status::OK(); - } - - Status SetDictionary(int64_t id, int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers) { - header_type_ = flatbuf::MessageHeader_DictionaryBatch; - - auto record_batch = flatbuf::CreateRecordBatch(fbb_, length, - fbb_.CreateVectorOfStructs(nodes), fbb_.CreateVectorOfStructs(buffers)); - - header_ = flatbuf::CreateDictionaryBatch(fbb_, id, record_batch).Union(); - body_length_ = body_length; - return Status::OK(); - } - - Status Finish(); - - Status GetBuffer(std::shared_ptr* out); - - private: - flatbuf::MessageHeader header_type_; - flatbuffers::Offset header_; - int64_t body_length_; - flatbuffers::FlatBufferBuilder fbb_; -}; - -Status WriteSchemaMessage( - const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out) { - MessageBuilder message; - RETURN_NOT_OK(message.SetSchema(schema, dictionary_memo)); - RETURN_NOT_OK(message.Finish()); - return message.GetBuffer(out); -} - -Status WriteRecordBatchMessage(int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers, std::shared_ptr* out) { - MessageBuilder builder; - RETURN_NOT_OK(builder.SetRecordBatch(length, body_length, nodes, buffers)); - RETURN_NOT_OK(builder.Finish()); - return builder.GetBuffer(out); -} - -Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers, std::shared_ptr* out) { - MessageBuilder builder; - RETURN_NOT_OK(builder.SetDictionary(id, length, body_length, nodes, buffers)); - RETURN_NOT_OK(builder.Finish()); - return builder.GetBuffer(out); -} - -Status MessageBuilder::Finish() { - auto message = - flatbuf::CreateMessage(fbb_, kMetadataVersion, header_type_, header_, body_length_); - fbb_.Finish(message); - return Status::OK(); -} - -Status MessageBuilder::GetBuffer(std::shared_ptr* out) { - int32_t size = fbb_.GetSize(); - - auto result = std::make_shared(); - RETURN_NOT_OK(result->Resize(size)); - - uint8_t* dst = result->mutable_data(); - memcpy(dst, fbb_.GetBufferPointer(), size); - - *out = result; - return Status::OK(); -} - -} // namespace ipc -} // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata-internal.h b/cpp/src/arrow/ipc/metadata-internal.h deleted file mode 100644 index 59afecbcbd27e..0000000000000 --- a/cpp/src/arrow/ipc/metadata-internal.h +++ /dev/null @@ -1,83 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_IPC_METADATA_INTERNAL_H -#define ARROW_IPC_METADATA_INTERNAL_H - -#include -#include -#include - -#include "flatbuffers/flatbuffers.h" - -#include "arrow/ipc/File_generated.h" -#include "arrow/ipc/Message_generated.h" -#include "arrow/ipc/metadata.h" - -namespace arrow { - -namespace flatbuf = org::apache::arrow::flatbuf; - -class Buffer; -struct Field; -class Schema; -class Status; - -namespace ipc { - -using FBB = flatbuffers::FlatBufferBuilder; -using FieldOffset = flatbuffers::Offset; -using VectorLayoutOffset = flatbuffers::Offset; -using Offset = flatbuffers::Offset; - -static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; - -// Construct a field with type for a dictionary-encoded field. None of its -// children or children's descendents can be dictionary encoded -Status FieldFromFlatbufferDictionary( - const flatbuf::Field* field, std::shared_ptr* out); - -// Construct a field for a non-dictionary-encoded field. Its children may be -// dictionary encoded -Status FieldFromFlatbuffer(const flatbuf::Field* field, - const DictionaryMemo& dictionary_memo, std::shared_ptr* out); - -Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, DictionaryMemo* dictionary_memo, - flatbuffers::Offset* out); - -// Serialize arrow::Schema as a Flatbuffer -// -// \param[in] schema a Schema instance -// \param[inout] dictionary_memo class for tracking dictionaries and assigning -// dictionary ids -// \param[out] out the serialized arrow::Buffer -// \return Status outcome -Status WriteSchemaMessage( - const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); - -Status WriteRecordBatchMessage(int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers, std::shared_ptr* out); - -Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, - const std::vector& nodes, - const std::vector& buffers, std::shared_ptr* out); - -} // namespace ipc -} // namespace arrow - -#endif // ARROW_IPC_METADATA_INTERNAL_H diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 71bc5c9eb3207..a418d4893dd40 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -24,14 +24,14 @@ #include "flatbuffers/flatbuffers.h" +#include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/ipc/File_generated.h" #include "arrow/ipc/Message_generated.h" -#include "arrow/ipc/metadata-internal.h" - -#include "arrow/buffer.h" #include "arrow/schema.h" #include "arrow/status.h" +#include "arrow/type.h" namespace arrow { @@ -39,6 +39,643 @@ namespace flatbuf = org::apache::arrow::flatbuf; namespace ipc { +using FBB = flatbuffers::FlatBufferBuilder; +using DictionaryOffset = flatbuffers::Offset; +using FieldOffset = flatbuffers::Offset; +using LargeRecordBatchOffset = flatbuffers::Offset; +using RecordBatchOffset = flatbuffers::Offset; +using VectorLayoutOffset = flatbuffers::Offset; +using Offset = flatbuffers::Offset; + +static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; + +static Status IntFromFlatbuffer( + const flatbuf::Int* int_data, std::shared_ptr* out) { + if (int_data->bitWidth() > 64) { + return Status::NotImplemented("Integers with more than 64 bits not implemented"); + } + if (int_data->bitWidth() < 8) { + return Status::NotImplemented("Integers with less than 8 bits not implemented"); + } + + switch (int_data->bitWidth()) { + case 8: + *out = int_data->is_signed() ? int8() : uint8(); + break; + case 16: + *out = int_data->is_signed() ? int16() : uint16(); + break; + case 32: + *out = int_data->is_signed() ? int32() : uint32(); + break; + case 64: + *out = int_data->is_signed() ? int64() : uint64(); + break; + default: + return Status::NotImplemented("Integers not in cstdint are not implemented"); + } + return Status::OK(); +} + +static Status FloatFromFlatuffer( + const flatbuf::FloatingPoint* float_data, std::shared_ptr* out) { + if (float_data->precision() == flatbuf::Precision_HALF) { + *out = float16(); + } else if (float_data->precision() == flatbuf::Precision_SINGLE) { + *out = float32(); + } else { + *out = float64(); + } + return Status::OK(); +} + +// Forward declaration +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + DictionaryMemo* dictionary_memo, FieldOffset* offset); + +static Offset IntToFlatbuffer(FBB& fbb, int bitWidth, bool is_signed) { + return flatbuf::CreateInt(fbb, bitWidth, is_signed).Union(); +} + +static Offset FloatToFlatbuffer(FBB& fbb, flatbuf::Precision precision) { + return flatbuf::CreateFloatingPoint(fbb, precision).Union(); +} + +static Status AppendChildFields(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo) { + FieldOffset field; + for (int i = 0; i < type->num_children(); ++i) { + RETURN_NOT_OK(FieldToFlatbuffer(fbb, type->child(i), dictionary_memo, &field)); + out_children->push_back(field); + } + return Status::OK(); +} + +static Status ListToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); + *offset = flatbuf::CreateList(fbb).Union(); + return Status::OK(); +} + +static Status StructToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); + *offset = flatbuf::CreateStruct_(fbb).Union(); + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// Union implementation + +static Status UnionFromFlatbuffer(const flatbuf::Union* union_data, + const std::vector>& children, std::shared_ptr* out) { + UnionMode mode = union_data->mode() == flatbuf::UnionMode_Sparse ? UnionMode::SPARSE + : UnionMode::DENSE; + + std::vector type_codes; + + const flatbuffers::Vector* fb_type_ids = union_data->typeIds(); + if (fb_type_ids == nullptr) { + for (uint8_t i = 0; i < children.size(); ++i) { + type_codes.push_back(i); + } + } else { + for (int32_t id : (*fb_type_ids)) { + // TODO(wesm): can these values exceed 255? + type_codes.push_back(static_cast(id)); + } + } + + *out = union_(children, type_codes, mode); + return Status::OK(); +} + +static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* out_children, DictionaryMemo* dictionary_memo, + Offset* offset) { + RETURN_NOT_OK(AppendChildFields(fbb, type, out_children, dictionary_memo)); + + const auto& union_type = static_cast(*type); + + flatbuf::UnionMode mode = union_type.mode == UnionMode::SPARSE + ? flatbuf::UnionMode_Sparse + : flatbuf::UnionMode_Dense; + + std::vector type_ids; + type_ids.reserve(union_type.type_codes.size()); + for (uint8_t code : union_type.type_codes) { + type_ids.push_back(code); + } + + auto fb_type_ids = fbb.CreateVector(type_ids); + + *offset = flatbuf::CreateUnion(fbb, mode, fb_type_ids).Union(); + return Status::OK(); +} + +#define INT_TO_FB_CASE(BIT_WIDTH, IS_SIGNED) \ + *out_type = flatbuf::Type_Int; \ + *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ + break; + +static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit unit) { + switch (unit) { + case TimeUnit::SECOND: + return flatbuf::TimeUnit_SECOND; + case TimeUnit::MILLI: + return flatbuf::TimeUnit_MILLISECOND; + case TimeUnit::MICRO: + return flatbuf::TimeUnit_MICROSECOND; + case TimeUnit::NANO: + return flatbuf::TimeUnit_NANOSECOND; + default: + break; + } + return flatbuf::TimeUnit_MIN; +} + +static inline TimeUnit FromFlatbufferUnit(flatbuf::TimeUnit unit) { + switch (unit) { + case flatbuf::TimeUnit_SECOND: + return TimeUnit::SECOND; + case flatbuf::TimeUnit_MILLISECOND: + return TimeUnit::MILLI; + case flatbuf::TimeUnit_MICROSECOND: + return TimeUnit::MICRO; + case flatbuf::TimeUnit_NANOSECOND: + return TimeUnit::NANO; + default: + break; + } + // cannot reach + return TimeUnit::SECOND; +} + +static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, + const std::vector>& children, std::shared_ptr* out) { + switch (type) { + case flatbuf::Type_NONE: + return Status::Invalid("Type metadata cannot be none"); + case flatbuf::Type_Int: + return IntFromFlatbuffer(static_cast(type_data), out); + case flatbuf::Type_FloatingPoint: + return FloatFromFlatuffer( + static_cast(type_data), out); + case flatbuf::Type_Binary: + *out = binary(); + return Status::OK(); + case flatbuf::Type_FixedWidthBinary: { + auto fw_binary = static_cast(type_data); + *out = fixed_width_binary(fw_binary->byteWidth()); + return Status::OK(); + } + case flatbuf::Type_Utf8: + *out = utf8(); + return Status::OK(); + case flatbuf::Type_Bool: + *out = boolean(); + return Status::OK(); + case flatbuf::Type_Decimal: + return Status::NotImplemented("Decimal"); + case flatbuf::Type_Date: + *out = date(); + return Status::OK(); + case flatbuf::Type_Time: { + auto time_type = static_cast(type_data); + *out = time(FromFlatbufferUnit(time_type->unit())); + return Status::OK(); + } + case flatbuf::Type_Timestamp: { + auto ts_type = static_cast(type_data); + *out = timestamp(FromFlatbufferUnit(ts_type->unit())); + return Status::OK(); + } + case flatbuf::Type_Interval: + return Status::NotImplemented("Interval"); + case flatbuf::Type_List: + if (children.size() != 1) { + return Status::Invalid("List must have exactly 1 child field"); + } + *out = std::make_shared(children[0]); + return Status::OK(); + case flatbuf::Type_Struct_: + *out = std::make_shared(children); + return Status::OK(); + case flatbuf::Type_Union: + return UnionFromFlatbuffer( + static_cast(type_data), children, out); + default: + return Status::Invalid("Unrecognized type"); + } +} + +// TODO(wesm): Convert this to visitor pattern +static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + std::vector* children, std::vector* layout, + flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, Offset* offset) { + if (type->type == Type::DICTIONARY) { + // In this library, the dictionary "type" is a logical construct. Here we + // pass through to the value type, as we've already captured the index + // type in the DictionaryEncoding metadata in the parent field + const auto& dict_type = static_cast(*type); + return TypeToFlatbuffer(fbb, dict_type.dictionary()->type(), children, layout, + out_type, dictionary_memo, offset); + } + + std::vector buffer_layout = type->GetBufferLayout(); + for (const BufferDescr& descr : buffer_layout) { + flatbuf::VectorType vector_type; + switch (descr.type()) { + case BufferType::OFFSET: + vector_type = flatbuf::VectorType_OFFSET; + break; + case BufferType::DATA: + vector_type = flatbuf::VectorType_DATA; + break; + case BufferType::VALIDITY: + vector_type = flatbuf::VectorType_VALIDITY; + break; + case BufferType::TYPE: + vector_type = flatbuf::VectorType_TYPE; + break; + default: + vector_type = flatbuf::VectorType_DATA; + break; + } + auto offset = flatbuf::CreateVectorLayout( + fbb, static_cast(descr.bit_width()), vector_type); + layout->push_back(offset); + } + + switch (type->type) { + case Type::BOOL: + *out_type = flatbuf::Type_Bool; + *offset = flatbuf::CreateBool(fbb).Union(); + break; + case Type::UINT8: + INT_TO_FB_CASE(8, false); + case Type::INT8: + INT_TO_FB_CASE(8, true); + case Type::UINT16: + INT_TO_FB_CASE(16, false); + case Type::INT16: + INT_TO_FB_CASE(16, true); + case Type::UINT32: + INT_TO_FB_CASE(32, false); + case Type::INT32: + INT_TO_FB_CASE(32, true); + case Type::UINT64: + INT_TO_FB_CASE(64, false); + case Type::INT64: + INT_TO_FB_CASE(64, true); + case Type::FLOAT: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_SINGLE); + break; + case Type::DOUBLE: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); + break; + case Type::FIXED_WIDTH_BINARY: { + const auto& fw_type = static_cast(*type); + *out_type = flatbuf::Type_FixedWidthBinary; + *offset = flatbuf::CreateFixedWidthBinary(fbb, fw_type.byte_width()).Union(); + } break; + case Type::BINARY: + *out_type = flatbuf::Type_Binary; + *offset = flatbuf::CreateBinary(fbb).Union(); + break; + case Type::STRING: + *out_type = flatbuf::Type_Utf8; + *offset = flatbuf::CreateUtf8(fbb).Union(); + break; + case Type::DATE: + *out_type = flatbuf::Type_Date; + *offset = flatbuf::CreateDate(fbb).Union(); + break; + case Type::TIME: { + const auto& time_type = static_cast(*type); + *out_type = flatbuf::Type_Time; + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); + } break; + case Type::TIMESTAMP: { + const auto& ts_type = static_cast(*type); + *out_type = flatbuf::Type_Timestamp; + *offset = flatbuf::CreateTimestamp(fbb, ToFlatbufferUnit(ts_type.unit)).Union(); + } break; + case Type::LIST: + *out_type = flatbuf::Type_List; + return ListToFlatbuffer(fbb, type, children, dictionary_memo, offset); + case Type::STRUCT: + *out_type = flatbuf::Type_Struct_; + return StructToFlatbuffer(fbb, type, children, dictionary_memo, offset); + case Type::UNION: + *out_type = flatbuf::Type_Union; + return UnionToFlatBuffer(fbb, type, children, dictionary_memo, offset); + default: + *out_type = flatbuf::Type_NONE; // Make clang-tidy happy + std::stringstream ss; + ss << "Unable to convert type: " << type->ToString() << std::endl; + return Status::NotImplemented(ss.str()); + } + return Status::OK(); +} + +static DictionaryOffset GetDictionaryEncoding( + FBB& fbb, const DictionaryType& type, DictionaryMemo* memo) { + int64_t dictionary_id = memo->GetId(type.dictionary()); + + // We assume that the dictionary index type (as an integer) has already been + // validated elsewhere, and can safely assume we are dealing with signed + // integers + const auto& fw_index_type = static_cast(*type.index_type()); + + auto index_type_offset = flatbuf::CreateInt(fbb, fw_index_type.bit_width(), true); + + // TODO(wesm): ordered dictionaries + return flatbuf::CreateDictionaryEncoding(fbb, dictionary_id, index_type_offset); +} + +static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, + DictionaryMemo* dictionary_memo, FieldOffset* offset) { + auto fb_name = fbb.CreateString(field->name); + + flatbuf::Type type_enum; + Offset type_offset; + Offset type_layout; + std::vector children; + std::vector layout; + + RETURN_NOT_OK(TypeToFlatbuffer( + fbb, field->type, &children, &layout, &type_enum, dictionary_memo, &type_offset)); + auto fb_children = fbb.CreateVector(children); + auto fb_layout = fbb.CreateVector(layout); + + DictionaryOffset dictionary = 0; + if (field->type->type == Type::DICTIONARY) { + dictionary = GetDictionaryEncoding( + fbb, static_cast(*field->type), dictionary_memo); + } + + // TODO: produce the list of VectorTypes + *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_offset, + dictionary, fb_children, fb_layout); + + return Status::OK(); +} + +static Status FieldFromFlatbuffer(const flatbuf::Field* field, + const DictionaryMemo& dictionary_memo, std::shared_ptr* out) { + std::shared_ptr type; + + const flatbuf::DictionaryEncoding* encoding = field->dictionary(); + + if (encoding == nullptr) { + // The field is not dictionary encoded. We must potentially visit its + // children to fully reconstruct the data type + auto children = field->children(); + std::vector> child_fields(children->size()); + for (int i = 0; i < static_cast(children->size()); ++i) { + RETURN_NOT_OK( + FieldFromFlatbuffer(children->Get(i), dictionary_memo, &child_fields[i])); + } + RETURN_NOT_OK( + TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); + } else { + // The field is dictionary encoded. The type of the dictionary values has + // been determined elsewhere, and is stored in the DictionaryMemo. Here we + // construct the logical DictionaryType object + + std::shared_ptr dictionary; + RETURN_NOT_OK(dictionary_memo.GetDictionary(encoding->id(), &dictionary)); + + std::shared_ptr index_type; + RETURN_NOT_OK(IntFromFlatbuffer(encoding->indexType(), &index_type)); + type = std::make_shared(index_type, dictionary); + } + *out = std::make_shared(field->name()->str(), type, field->nullable()); + return Status::OK(); +} + +static Status FieldFromFlatbufferDictionary( + const flatbuf::Field* field, std::shared_ptr* out) { + // Need an empty memo to pass down for constructing children + DictionaryMemo dummy_memo; + + // Any DictionaryEncoding set is ignored here + + std::shared_ptr type; + auto children = field->children(); + std::vector> child_fields(children->size()); + for (int i = 0; i < static_cast(children->size()); ++i) { + RETURN_NOT_OK(FieldFromFlatbuffer(children->Get(i), dummy_memo, &child_fields[i])); + } + + RETURN_NOT_OK( + TypeFromFlatbuffer(field->type_type(), field->type(), child_fields, &type)); + + *out = std::make_shared(field->name()->str(), type, field->nullable()); + return Status::OK(); +} + +// will return the endianness of the system we are running on +// based the NUMPY_API function. See NOTICE.txt +flatbuf::Endianness endianness() { + union { + uint32_t i; + char c[4]; + } bint = {0x01020304}; + + return bint.c[0] == 1 ? flatbuf::Endianness_Big : flatbuf::Endianness_Little; +} + +static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, + DictionaryMemo* dictionary_memo, flatbuffers::Offset* out) { + std::vector field_offsets; + for (int i = 0; i < schema.num_fields(); ++i) { + std::shared_ptr field = schema.field(i); + FieldOffset offset; + RETURN_NOT_OK(FieldToFlatbuffer(fbb, field, dictionary_memo, &offset)); + field_offsets.push_back(offset); + } + + *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets)); + return Status::OK(); +} + +static Status WriteFlatbufferBuilder(FBB& fbb, std::shared_ptr* out) { + int32_t size = fbb.GetSize(); + + auto result = std::make_shared(); + RETURN_NOT_OK(result->Resize(size)); + + uint8_t* dst = result->mutable_data(); + memcpy(dst, fbb.GetBufferPointer(), size); + *out = result; + return Status::OK(); +} + +static Status WriteMessage(FBB& fbb, flatbuf::MessageHeader header_type, + flatbuffers::Offset header, int64_t body_length, std::shared_ptr* out) { + auto message = + flatbuf::CreateMessage(fbb, kMetadataVersion, header_type, header, body_length); + fbb.Finish(message); + return WriteFlatbufferBuilder(fbb, out); +} + +Status WriteSchemaMessage( + const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out) { + FBB fbb; + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); + return WriteMessage(fbb, flatbuf::MessageHeader_Schema, fb_schema.Union(), 0, out); +} + +using FieldNodeVector = + flatbuffers::Offset>; +using LargeFieldNodeVector = + flatbuffers::Offset>; +using BufferVector = flatbuffers::Offset>; + +static Status WriteFieldNodes( + FBB& fbb, const std::vector& nodes, FieldNodeVector* out) { + std::vector fb_nodes; + fb_nodes.reserve(nodes.size()); + + for (size_t i = 0; i < nodes.size(); ++i) { + const FieldMetadata& node = nodes[i]; + if (node.offset != 0) { + return Status::Invalid("Field metadata for IPC must have offset 0"); + } + fb_nodes.emplace_back( + static_cast(node.length), static_cast(node.null_count)); + } + *out = fbb.CreateVectorOfStructs(fb_nodes); + return Status::OK(); +} + +static Status WriteLargeFieldNodes( + FBB& fbb, const std::vector& nodes, LargeFieldNodeVector* out) { + std::vector fb_nodes; + fb_nodes.reserve(nodes.size()); + + for (size_t i = 0; i < nodes.size(); ++i) { + const FieldMetadata& node = nodes[i]; + if (node.offset != 0) { + return Status::Invalid("Field metadata for IPC must have offset 0"); + } + fb_nodes.emplace_back(node.length, node.null_count); + } + *out = fbb.CreateVectorOfStructs(fb_nodes); + return Status::OK(); +} + +static Status WriteBuffers( + FBB& fbb, const std::vector& buffers, BufferVector* out) { + std::vector fb_buffers; + fb_buffers.reserve(buffers.size()); + + for (size_t i = 0; i < buffers.size(); ++i) { + const BufferMetadata& buffer = buffers[i]; + fb_buffers.emplace_back(buffer.page, buffer.offset, buffer.length); + } + *out = fbb.CreateVectorOfStructs(fb_buffers); + return Status::OK(); +} + +static Status MakeRecordBatch(FBB& fbb, int32_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + RecordBatchOffset* offset) { + FieldNodeVector fb_nodes; + BufferVector fb_buffers; + + RETURN_NOT_OK(WriteFieldNodes(fbb, nodes, &fb_nodes)); + RETURN_NOT_OK(WriteBuffers(fbb, buffers, &fb_buffers)); + + *offset = flatbuf::CreateRecordBatch(fbb, length, fb_nodes, fb_buffers); + return Status::OK(); +} + +static Status MakeLargeRecordBatch(FBB& fbb, int64_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + LargeRecordBatchOffset* offset) { + LargeFieldNodeVector fb_nodes; + BufferVector fb_buffers; + + RETURN_NOT_OK(WriteLargeFieldNodes(fbb, nodes, &fb_nodes)); + RETURN_NOT_OK(WriteBuffers(fbb, buffers, &fb_buffers)); + + *offset = flatbuf::CreateLargeRecordBatch(fbb, length, fb_nodes, fb_buffers); + return Status::OK(); +} + +Status WriteRecordBatchMessage(int32_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out) { + FBB fbb; + RecordBatchOffset record_batch; + RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); + return WriteMessage( + fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), body_length, out); +} + +Status WriteLargeRecordBatchMessage(int64_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out) { + FBB fbb; + LargeRecordBatchOffset large_batch; + RETURN_NOT_OK( + MakeLargeRecordBatch(fbb, length, body_length, nodes, buffers, &large_batch)); + return WriteMessage(fbb, flatbuf::MessageHeader_LargeRecordBatch, large_batch.Union(), + body_length, out); +} + +Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out) { + FBB fbb; + RecordBatchOffset record_batch; + RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); + auto dictionary_batch = flatbuf::CreateDictionaryBatch(fbb, id, record_batch).Union(); + return WriteMessage( + fbb, flatbuf::MessageHeader_DictionaryBatch, dictionary_batch, body_length, out); +} + +static flatbuffers::Offset> +FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { + std::vector fb_blocks; + + for (const FileBlock& block : blocks) { + fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); + } + + return fbb.CreateVectorOfStructs(fb_blocks); +} + +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out) { + FBB fbb; + + flatbuffers::Offset fb_schema; + RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); + + auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); + auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); + + auto footer = flatbuf::CreateFooter( + fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + + fbb.Finish(footer); + + int32_t size = fbb.GetSize(); + + return out->Write(fbb.GetBufferPointer(), size); +} + // ---------------------------------------------------------------------- // Memoization data structure for handling shared dictionaries @@ -158,7 +795,18 @@ int64_t Message::body_length() const { // ---------------------------------------------------------------------- // SchemaMetadata -class SchemaMetadata::SchemaMetadataImpl { +class MessageHolder { + public: + void set_message(const std::shared_ptr& message) { message_ = message; } + void set_buffer(const std::shared_ptr& buffer) { buffer_ = buffer; } + + protected: + // Possible parents, owns the flatbuffer data + std::shared_ptr message_; + std::shared_ptr buffer_; +}; + +class SchemaMetadata::SchemaMetadataImpl : public MessageHolder { public: explicit SchemaMetadataImpl(const void* schema) : schema_(static_cast(schema)) {} @@ -196,15 +844,19 @@ class SchemaMetadata::SchemaMetadataImpl { const flatbuf::Schema* schema_; }; -SchemaMetadata::SchemaMetadata( - const std::shared_ptr& message, const void* flatbuf) { - message_ = message; - impl_.reset(new SchemaMetadataImpl(flatbuf)); +SchemaMetadata::SchemaMetadata(const std::shared_ptr& message) + : SchemaMetadata(message->impl_->header()) { + impl_->set_message(message); } -SchemaMetadata::SchemaMetadata(const std::shared_ptr& message) { - message_ = message; - impl_.reset(new SchemaMetadataImpl(message->impl_->header())); +SchemaMetadata::SchemaMetadata(const void* header) { + impl_.reset(new SchemaMetadataImpl(header)); +} + +SchemaMetadata::SchemaMetadata(const std::shared_ptr& buffer, int64_t offset) + : SchemaMetadata(buffer->data() + offset) { + // Preserve ownership + impl_->set_buffer(buffer); } SchemaMetadata::~SchemaMetadata() {} @@ -231,7 +883,7 @@ Status SchemaMetadata::GetSchema( // ---------------------------------------------------------------------- // RecordBatchMetadata -class RecordBatchMetadata::RecordBatchMetadataImpl { +class RecordBatchMetadata::RecordBatchMetadataImpl : public MessageHolder { public: explicit RecordBatchMetadataImpl(const void* batch) : batch_(static_cast(batch)) { @@ -249,22 +901,14 @@ class RecordBatchMetadata::RecordBatchMetadataImpl { int num_fields() const { return batch_->nodes()->size(); } - void set_message(const std::shared_ptr& message) { message_ = message; } - - void set_buffer(const std::shared_ptr& buffer) { buffer_ = buffer; } - private: const flatbuf::RecordBatch* batch_; const flatbuffers::Vector* nodes_; const flatbuffers::Vector* buffers_; - - // Possible parents, owns the flatbuffer data - std::shared_ptr message_; - std::shared_ptr buffer_; }; -RecordBatchMetadata::RecordBatchMetadata(const std::shared_ptr& message) { - impl_.reset(new RecordBatchMetadataImpl(message->impl_->header())); +RecordBatchMetadata::RecordBatchMetadata(const std::shared_ptr& message) + : RecordBatchMetadata(message->impl_->header()) { impl_->set_message(message); } @@ -358,8 +1002,8 @@ const RecordBatchMetadata& DictionaryBatchMetadata::record_batch() const { // ---------------------------------------------------------------------- // Conveniences -Status ReadMessage(int64_t offset, int32_t metadata_length, - io::RandomAccessFile* file, std::shared_ptr* message) { +Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, + std::shared_ptr* message) { std::shared_ptr buffer; RETURN_NOT_OK(file->ReadAt(offset, metadata_length, &buffer)); diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 4eb0186d3a467..41e6c5e9f19ea 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -107,10 +107,9 @@ class Message; // Container for serialized Schema metadata contained in an IPC message class ARROW_EXPORT SchemaMetadata { public: + explicit SchemaMetadata(const void* header); explicit SchemaMetadata(const std::shared_ptr& message); - - // Accepts an opaque flatbuffer pointer - SchemaMetadata(const std::shared_ptr& message, const void* schema); + SchemaMetadata(const std::shared_ptr& message, int64_t offset); ~SchemaMetadata(); @@ -127,9 +126,6 @@ class ARROW_EXPORT SchemaMetadata { const DictionaryMemo& dictionary_memo, std::shared_ptr* out) const; private: - // Parent, owns the flatbuffer data - std::shared_ptr message_; - class SchemaMetadataImpl; std::unique_ptr impl_; @@ -145,8 +141,6 @@ struct ARROW_EXPORT BufferMetadata { // Container for serialized record batch metadata contained in an IPC message class ARROW_EXPORT RecordBatchMetadata { public: - // Instantiate from opaque pointer. Memory ownership must be preserved - // elsewhere (e.g. in a dictionary batch) explicit RecordBatchMetadata(const void* header); explicit RecordBatchMetadata(const std::shared_ptr& message); RecordBatchMetadata(const std::shared_ptr& message, int64_t offset); @@ -218,8 +212,34 @@ class ARROW_EXPORT Message { /// \param[in] file the seekable file interface to read from /// \param[out] message the message read /// \return Status success or failure -Status ReadMessage(int64_t offset, int32_t metadata_length, - io::RandomAccessFile* file, std::shared_ptr* message); +Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, + std::shared_ptr* message); + +// Serialize arrow::Schema as a Flatbuffer +// +// \param[in] schema a Schema instance +// \param[inout] dictionary_memo class for tracking dictionaries and assigning +// dictionary ids +// \param[out] out the serialized arrow::Buffer +// \return Status outcome +Status WriteSchemaMessage( + const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); + +Status WriteRecordBatchMessage(int32_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out); + +Status WriteLargeRecordBatchMessage(int64_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out); + +Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, + const std::vector& nodes, const std::vector& buffers, + std::shared_ptr* out); + +Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, + const std::vector& record_batches, DictionaryMemo* dictionary_memo, + io::OutputStream* out); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 95753643c6513..a2b20a901a69e 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -26,16 +26,114 @@ #include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" -#include "arrow/ipc/adapter.h" -#include "arrow/ipc/metadata-internal.h" +#include "arrow/ipc/File_generated.h" +#include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" +#include "arrow/schema.h" #include "arrow/status.h" +#include "arrow/table.h" #include "arrow/util/logging.h" namespace arrow { + +namespace flatbuf = org::apache::arrow::flatbuf; + namespace ipc { +// ---------------------------------------------------------------------- +// Record batch read path + +class IpcComponentSource : public ArrayComponentSource { + public: + IpcComponentSource(const RecordBatchMetadata& metadata, io::RandomAccessFile* file) + : metadata_(metadata), file_(file) {} + + Status GetBuffer(int buffer_index, std::shared_ptr* out) override { + BufferMetadata buffer_meta = metadata_.buffer(buffer_index); + if (buffer_meta.length == 0) { + *out = nullptr; + return Status::OK(); + } else { + return file_->ReadAt(buffer_meta.offset, buffer_meta.length, out); + } + } + + Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { + // pop off a field + if (field_index >= metadata_.num_fields()) { + return Status::Invalid("Ran out of field metadata, likely malformed"); + } + *metadata = metadata_.field(field_index); + return Status::OK(); + } + + private: + const RecordBatchMetadata& metadata_; + io::RandomAccessFile* file_; +}; + +Status ReadRecordBatch(const RecordBatchMetadata& metadata, + const std::shared_ptr& schema, io::RandomAccessFile* file, + std::shared_ptr* out) { + return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); +} + +static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, + int64_t num_rows, int max_recursion_depth, ArrayComponentSource* source, + std::shared_ptr* out) { + std::vector> arrays(schema->num_fields()); + + ArrayLoaderContext context; + context.source = source; + context.field_index = 0; + context.buffer_index = 0; + context.max_recursion_depth = max_recursion_depth; + + for (int i = 0; i < schema->num_fields(); ++i) { + RETURN_NOT_OK(LoadArray(schema->field(i)->type, &context, &arrays[i])); + } + + *out = std::make_shared(schema, num_rows, arrays); + return Status::OK(); +} + +Status ReadRecordBatch(const RecordBatchMetadata& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::RandomAccessFile* file, std::shared_ptr* out) { + IpcComponentSource source(metadata, file); + return LoadRecordBatchFromSource( + schema, metadata.length(), max_recursion_depth, &source, out); +} + +Status ReadDictionary(const DictionaryBatchMetadata& metadata, + const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, + std::shared_ptr* out) { + int64_t id = metadata.id(); + auto it = dictionary_types.find(id); + if (it == dictionary_types.end()) { + std::stringstream ss; + ss << "Do not have type metadata for dictionary with id: " << id; + return Status::KeyError(ss.str()); + } + + std::vector> fields = {it->second}; + + // We need a schema for the record batch + auto dummy_schema = std::make_shared(fields); + + // The dictionary is embedded in a record batch with a single column + std::shared_ptr batch; + RETURN_NOT_OK(ReadRecordBatch(metadata.record_batch(), dummy_schema, file, &batch)); + + if (batch->num_columns() != 1) { + return Status::Invalid("Dictionary record batch must only contain one field"); + } + + *out = batch->column(0); + return Status::OK(); +} + // ---------------------------------------------------------------------- // StreamReader implementation @@ -228,7 +326,7 @@ class FileReader::FileReaderImpl { // TODO(wesm): Verify the footer footer_ = flatbuf::GetFooter(footer_buffer_->data()); - schema_metadata_.reset(new SchemaMetadata(nullptr, footer_->schema())); + schema_metadata_.reset(new SchemaMetadata(footer_->schema())); return Status::OK(); } @@ -307,8 +405,7 @@ class FileReader::FileReaderImpl { return schema_metadata_->GetSchema(*dictionary_memo_, &schema_); } - Status Open( - const std::shared_ptr& file, int64_t footer_offset) { + Status Open(const std::shared_ptr& file, int64_t footer_offset) { file_ = file; footer_offset_ = footer_offset; RETURN_NOT_OK(ReadFooter()); @@ -371,5 +468,69 @@ Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { return impl_->GetRecordBatch(i, batch); } +// ---------------------------------------------------------------------- +// Read LargeRecordBatch + +class LargeRecordBatchSource : public ArrayComponentSource { + public: + LargeRecordBatchSource( + const flatbuf::LargeRecordBatch* metadata, io::RandomAccessFile* file) + : metadata_(metadata), file_(file) {} + + Status GetBuffer(int buffer_index, std::shared_ptr* out) override { + if (buffer_index >= static_cast(metadata_->buffers()->size())) { + return Status::Invalid("Ran out of buffer metadata, likely malformed"); + } + const flatbuf::Buffer* buffer = metadata_->buffers()->Get(buffer_index); + + if (buffer->length() == 0) { + *out = nullptr; + return Status::OK(); + } else { + return file_->ReadAt(buffer->offset(), buffer->length(), out); + } + } + + Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { + // pop off a field + if (field_index >= static_cast(metadata_->nodes()->size())) { + return Status::Invalid("Ran out of field metadata, likely malformed"); + } + const flatbuf::LargeFieldNode* node = metadata_->nodes()->Get(field_index); + + metadata->length = node->length(); + metadata->null_count = node->null_count(); + metadata->offset = 0; + return Status::OK(); + } + + private: + const flatbuf::LargeRecordBatch* metadata_; + io::RandomAccessFile* file_; +}; + +Status ReadLargeRecordBatch(const std::shared_ptr& schema, int64_t offset, + io::RandomAccessFile* file, std::shared_ptr* out) { + std::shared_ptr buffer; + RETURN_NOT_OK(file->Seek(offset)); + + RETURN_NOT_OK(file->Read(sizeof(int32_t), &buffer)); + int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); + + RETURN_NOT_OK(file->Read(flatbuffer_size, &buffer)); + auto message = flatbuf::GetMessage(buffer->data()); + auto batch = reinterpret_cast(message->header()); + + // TODO(ARROW-388): The buffer offsets start at 0, so we must construct a + // RandomAccessFile according to that frame of reference + std::shared_ptr buffer_payload; + RETURN_NOT_OK(file->Read(message->bodyLength(), &buffer_payload)); + io::BufferReader buffer_reader(buffer_payload); + + LargeRecordBatchSource source(batch, &buffer_reader); + return LoadRecordBatchFromSource( + schema, batch->length(), kMaxNestingDepth, &source, out); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index ca91765edbac1..1c1314a040bef 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -43,6 +43,20 @@ class RandomAccessFile; namespace ipc { +// Generic read functionsh; does not copy data if the input supports zero copy reads + +Status ReadRecordBatch(const RecordBatchMetadata& metadata, + const std::shared_ptr& schema, io::RandomAccessFile* file, + std::shared_ptr* out); + +Status ReadRecordBatch(const RecordBatchMetadata& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::RandomAccessFile* file, std::shared_ptr* out); + +Status ReadDictionary(const DictionaryBatchMetadata& metadata, + const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, + std::shared_ptr* out); + class ARROW_EXPORT StreamReader { public: ~StreamReader(); @@ -106,6 +120,14 @@ class ARROW_EXPORT FileReader { std::unique_ptr impl_; }; +// ---------------------------------------------------------------------- +// + +/// EXPERIMENTAL: Read length-prefixed LargeRecordBatch metadata (64-bit array +/// lengths) at offset and reconstruct RecordBatch +Status ARROW_EXPORT ReadLargeRecordBatch(const std::shared_ptr& schema, + int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 66a5e09362cf5..ba203b090b3b7 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -103,7 +103,7 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li typedef Status MakeRecordBatch(std::shared_ptr* out); Status MakeIntRecordBatch(std::shared_ptr* out) { - const int length = 1000; + const int length = 10; // Make the schema auto f0 = field("f0", int32()); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 58402b588404c..82c119ef53e9a 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -17,27 +17,509 @@ #include "arrow/ipc/writer.h" +#include #include #include +#include #include #include +#include "arrow/array.h" #include "arrow/buffer.h" #include "arrow/io/interfaces.h" #include "arrow/io/memory.h" -#include "arrow/ipc/adapter.h" -#include "arrow/ipc/metadata-internal.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" +#include "arrow/loader.h" #include "arrow/memory_pool.h" #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" +#include "arrow/type.h" +#include "arrow/util/bit-util.h" #include "arrow/util/logging.h" namespace arrow { namespace ipc { +// ---------------------------------------------------------------------- +// Record batch write path + +class RecordBatchWriter : public ArrayVisitor { + public: + RecordBatchWriter( + MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth) + : pool_(pool), + max_recursion_depth_(max_recursion_depth), + buffer_start_offset_(buffer_start_offset) { + DCHECK_GT(max_recursion_depth, 0); + } + + virtual ~RecordBatchWriter() = default; + + virtual Status CheckArrayMetadata(const Array& arr) { + if (arr.length() > std::numeric_limits::max()) { + return Status::Invalid("Cannot write arrays larger than 2^31 - 1 in length"); + } + return Status::OK(); + } + + Status VisitArray(const Array& arr) { + if (max_recursion_depth_ <= 0) { + return Status::Invalid("Max recursion depth reached"); + } + + RETURN_NOT_OK(CheckArrayMetadata(arr)); + + // push back all common elements + field_nodes_.emplace_back(arr.length(), arr.null_count(), 0); + + if (arr.null_count() > 0) { + std::shared_ptr bitmap = arr.null_bitmap(); + + if (arr.offset() != 0) { + // With a sliced array / non-zero offset, we must copy the bitmap + RETURN_NOT_OK( + CopyBitmap(pool_, bitmap->data(), arr.offset(), arr.length(), &bitmap)); + } + + buffers_.push_back(bitmap); + } else { + // Push a dummy zero-length buffer, not to be copied + buffers_.push_back(std::make_shared(nullptr, 0)); + } + return arr.Accept(this); + } + + Status Assemble(const RecordBatch& batch, int64_t* body_length) { + if (field_nodes_.size() > 0) { + field_nodes_.clear(); + buffer_meta_.clear(); + buffers_.clear(); + } + + // Perform depth-first traversal of the row-batch + for (int i = 0; i < batch.num_columns(); ++i) { + RETURN_NOT_OK(VisitArray(*batch.column(i))); + } + + // The position for the start of a buffer relative to the passed frame of + // reference. May be 0 or some other position in an address space + int64_t offset = buffer_start_offset_; + + buffer_meta_.reserve(buffers_.size()); + + const int32_t kNoPageId = -1; + + // Construct the buffer metadata for the record batch header + for (size_t i = 0; i < buffers_.size(); ++i) { + const Buffer* buffer = buffers_[i].get(); + int64_t size = 0; + int64_t padding = 0; + + // The buffer might be null if we are handling zero row lengths. + if (buffer) { + size = buffer->size(); + padding = BitUtil::RoundUpToMultipleOf64(size) - size; + } + + // TODO(wesm): We currently have no notion of shared memory page id's, + // but we've included it in the metadata IDL for when we have it in the + // future. Use page = -1 for now + // + // Note that page ids are a bespoke notion for Arrow and not a feature we + // are using from any OS-level shared memory. The thought is that systems + // may (in the future) associate integer page id's with physical memory + // pages (according to whatever is the desired shared memory mechanism) + buffer_meta_.push_back({kNoPageId, offset, size + padding}); + offset += size + padding; + } + + *body_length = offset - buffer_start_offset_; + DCHECK(BitUtil::IsMultipleOf64(*body_length)); + + return Status::OK(); + } + + // Override this for writing dictionary metadata + virtual Status WriteMetadataMessage( + int64_t num_rows, int64_t body_length, std::shared_ptr* out) { + return WriteRecordBatchMessage( + static_cast(num_rows), body_length, field_nodes_, buffer_meta_, out); + } + + Status WriteMetadata(int64_t num_rows, int64_t body_length, io::OutputStream* dst, + int32_t* metadata_length) { + // Now that we have computed the locations of all of the buffers in shared + // memory, the data header can be converted to a flatbuffer and written out + // + // Note: The memory written here is prefixed by the size of the flatbuffer + // itself as an int32_t. + std::shared_ptr metadata_fb; + RETURN_NOT_OK(WriteMetadataMessage(num_rows, body_length, &metadata_fb)); + + // Need to write 4 bytes (metadata size), the metadata, plus padding to + // end on an 8-byte offset + int64_t start_offset; + RETURN_NOT_OK(dst->Tell(&start_offset)); + + int32_t padded_metadata_length = static_cast(metadata_fb->size()) + 4; + const int32_t remainder = + (padded_metadata_length + static_cast(start_offset)) % 8; + if (remainder != 0) { padded_metadata_length += 8 - remainder; } + + // The returned metadata size includes the length prefix, the flatbuffer, + // plus padding + *metadata_length = padded_metadata_length; + + // Write the flatbuffer size prefix including padding + int32_t flatbuffer_size = padded_metadata_length - 4; + RETURN_NOT_OK( + dst->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); + + // Write the flatbuffer + RETURN_NOT_OK(dst->Write(metadata_fb->data(), metadata_fb->size())); + + // Write any padding + int32_t padding = + padded_metadata_length - static_cast(metadata_fb->size()) - 4; + if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } + + return Status::OK(); + } + + Status Write(const RecordBatch& batch, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length) { + RETURN_NOT_OK(Assemble(batch, body_length)); + +#ifndef NDEBUG + int64_t start_position, current_position; + RETURN_NOT_OK(dst->Tell(&start_position)); +#endif + + RETURN_NOT_OK(WriteMetadata(batch.num_rows(), *body_length, dst, metadata_length)); + +#ifndef NDEBUG + RETURN_NOT_OK(dst->Tell(¤t_position)); + DCHECK(BitUtil::IsMultipleOf8(current_position)); +#endif + + // Now write the buffers + for (size_t i = 0; i < buffers_.size(); ++i) { + const Buffer* buffer = buffers_[i].get(); + int64_t size = 0; + int64_t padding = 0; + + // The buffer might be null if we are handling zero row lengths. + if (buffer) { + size = buffer->size(); + padding = BitUtil::RoundUpToMultipleOf64(size) - size; + } + + if (size > 0) { RETURN_NOT_OK(dst->Write(buffer->data(), size)); } + + if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } + } + +#ifndef NDEBUG + RETURN_NOT_OK(dst->Tell(¤t_position)); + DCHECK(BitUtil::IsMultipleOf8(current_position)); +#endif + + return Status::OK(); + } + + Status GetTotalSize(const RecordBatch& batch, int64_t* size) { + // emulates the behavior of Write without actually writing + int32_t metadata_length = 0; + int64_t body_length = 0; + MockOutputStream dst; + RETURN_NOT_OK(Write(batch, &dst, &metadata_length, &body_length)); + *size = dst.GetExtentBytesWritten(); + return Status::OK(); + } + + protected: + template + Status VisitFixedWidth(const ArrayType& array) { + std::shared_ptr data_buffer = array.data(); + + if (array.offset() != 0) { + // Non-zero offset, slice the buffer + const auto& fw_type = static_cast(*array.type()); + const int type_width = fw_type.bit_width() / 8; + const int64_t byte_offset = array.offset() * type_width; + + // Send padding if it's available + const int64_t buffer_length = + std::min(BitUtil::RoundUpToMultipleOf64(array.length() * type_width), + data_buffer->size() - byte_offset); + data_buffer = SliceBuffer(data_buffer, byte_offset, buffer_length); + } + buffers_.push_back(data_buffer); + return Status::OK(); + } + + template + Status GetZeroBasedValueOffsets( + const ArrayType& array, std::shared_ptr* value_offsets) { + // Share slicing logic between ListArray and BinaryArray + + auto offsets = array.value_offsets(); + + if (array.offset() != 0) { + // If we have a non-zero offset, then the value offsets do not start at + // zero. We must a) create a new offsets array with shifted offsets and + // b) slice the values array accordingly + + std::shared_ptr shifted_offsets; + RETURN_NOT_OK(AllocateBuffer( + pool_, sizeof(int32_t) * (array.length() + 1), &shifted_offsets)); + + int32_t* dest_offsets = reinterpret_cast(shifted_offsets->mutable_data()); + const int32_t start_offset = array.value_offset(0); + + for (int i = 0; i < array.length(); ++i) { + dest_offsets[i] = array.value_offset(i) - start_offset; + } + // Final offset + dest_offsets[array.length()] = array.value_offset(array.length()) - start_offset; + offsets = shifted_offsets; + } + + *value_offsets = offsets; + return Status::OK(); + } + + Status VisitBinary(const BinaryArray& array) { + std::shared_ptr value_offsets; + RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); + auto data = array.data(); + + if (array.offset() != 0) { + // Slice the data buffer to include only the range we need now + data = SliceBuffer(data, array.value_offset(0), array.value_offset(array.length())); + } + + buffers_.push_back(value_offsets); + buffers_.push_back(data); + return Status::OK(); + } + + Status Visit(const FixedWidthBinaryArray& array) override { + auto data = array.data(); + int32_t width = array.byte_width(); + + if (array.offset() != 0) { + data = SliceBuffer(data, array.offset() * width, width * array.length()); + } + buffers_.push_back(data); + return Status::OK(); + } + + Status Visit(const BooleanArray& array) override { + buffers_.push_back(array.data()); + return Status::OK(); + } + +#define VISIT_FIXED_WIDTH(TYPE) \ + Status Visit(const TYPE& array) override { return VisitFixedWidth(array); } + + VISIT_FIXED_WIDTH(Int8Array); + VISIT_FIXED_WIDTH(Int16Array); + VISIT_FIXED_WIDTH(Int32Array); + VISIT_FIXED_WIDTH(Int64Array); + VISIT_FIXED_WIDTH(UInt8Array); + VISIT_FIXED_WIDTH(UInt16Array); + VISIT_FIXED_WIDTH(UInt32Array); + VISIT_FIXED_WIDTH(UInt64Array); + VISIT_FIXED_WIDTH(HalfFloatArray); + VISIT_FIXED_WIDTH(FloatArray); + VISIT_FIXED_WIDTH(DoubleArray); + VISIT_FIXED_WIDTH(DateArray); + VISIT_FIXED_WIDTH(Date32Array); + VISIT_FIXED_WIDTH(TimeArray); + VISIT_FIXED_WIDTH(TimestampArray); + +#undef VISIT_FIXED_WIDTH + + Status Visit(const StringArray& array) override { return VisitBinary(array); } + + Status Visit(const BinaryArray& array) override { return VisitBinary(array); } + + Status Visit(const ListArray& array) override { + std::shared_ptr value_offsets; + RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); + buffers_.push_back(value_offsets); + + --max_recursion_depth_; + std::shared_ptr values = array.values(); + + if (array.offset() != 0) { + // For non-zero offset, we slice the values array accordingly + const int32_t offset = array.value_offset(0); + const int32_t length = array.value_offset(array.length()) - offset; + values = values->Slice(offset, length); + } + RETURN_NOT_OK(VisitArray(*values)); + ++max_recursion_depth_; + return Status::OK(); + } + + Status Visit(const StructArray& array) override { + --max_recursion_depth_; + for (std::shared_ptr field : array.fields()) { + if (array.offset() != 0) { + // If offset is non-zero, slice the child array + field = field->Slice(array.offset(), array.length()); + } + RETURN_NOT_OK(VisitArray(*field)); + } + ++max_recursion_depth_; + return Status::OK(); + } + + Status Visit(const UnionArray& array) override { + auto type_ids = array.type_ids(); + if (array.offset() != 0) { + type_ids = SliceBuffer(type_ids, array.offset() * sizeof(UnionArray::type_id_t), + array.length() * sizeof(UnionArray::type_id_t)); + } + + buffers_.push_back(type_ids); + + --max_recursion_depth_; + if (array.mode() == UnionMode::DENSE) { + const auto& type = static_cast(*array.type()); + auto value_offsets = array.value_offsets(); + + // The Union type codes are not necessary 0-indexed + uint8_t max_code = 0; + for (uint8_t code : type.type_codes) { + if (code > max_code) { max_code = code; } + } + + // Allocate an array of child offsets. Set all to -1 to indicate that we + // haven't observed a first occurrence of a particular child yet + std::vector child_offsets(max_code + 1); + std::vector child_lengths(max_code + 1, 0); + + if (array.offset() != 0) { + // This is an unpleasant case. Because the offsets are different for + // each child array, when we have a sliced array, we need to "rebase" + // the value_offsets for each array + + const int32_t* unshifted_offsets = array.raw_value_offsets(); + const uint8_t* type_ids = array.raw_type_ids(); + + // Allocate the shifted offsets + std::shared_ptr shifted_offsets_buffer; + RETURN_NOT_OK(AllocateBuffer( + pool_, array.length() * sizeof(int32_t), &shifted_offsets_buffer)); + int32_t* shifted_offsets = + reinterpret_cast(shifted_offsets_buffer->mutable_data()); + + for (int64_t i = 0; i < array.length(); ++i) { + const uint8_t code = type_ids[i]; + int32_t shift = child_offsets[code]; + if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } + shifted_offsets[i] = unshifted_offsets[i] - shift; + + // Update the child length to account for observed value + ++child_lengths[code]; + } + + value_offsets = shifted_offsets_buffer; + } + buffers_.push_back(value_offsets); + + // Visit children and slice accordingly + for (int i = 0; i < type.num_children(); ++i) { + std::shared_ptr child = array.child(i); + if (array.offset() != 0) { + const uint8_t code = type.type_codes[i]; + child = child->Slice(child_offsets[code], child_lengths[code]); + } + RETURN_NOT_OK(VisitArray(*child)); + } + } else { + for (std::shared_ptr child : array.children()) { + // Sparse union, slicing is simpler + if (array.offset() != 0) { + // If offset is non-zero, slice the child array + child = child->Slice(array.offset(), array.length()); + } + RETURN_NOT_OK(VisitArray(*child)); + } + } + ++max_recursion_depth_; + return Status::OK(); + } + + Status Visit(const DictionaryArray& array) override { + // Dictionary written out separately. Slice offset contained in the indices + return array.indices()->Accept(this); + } + + // In some cases, intermediate buffers may need to be allocated (with sliced arrays) + MemoryPool* pool_; + + std::vector field_nodes_; + std::vector buffer_meta_; + std::vector> buffers_; + + int64_t max_recursion_depth_; + int64_t buffer_start_offset_; +}; + +class DictionaryWriter : public RecordBatchWriter { + public: + using RecordBatchWriter::RecordBatchWriter; + + Status WriteMetadataMessage( + int64_t num_rows, int64_t body_length, std::shared_ptr* out) override { + return WriteDictionaryMessage(dictionary_id_, static_cast(num_rows), + body_length, field_nodes_, buffer_meta_, out); + } + + Status Write(int64_t dictionary_id, const std::shared_ptr& dictionary, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + dictionary_id_ = dictionary_id; + + // Make a dummy record batch. A bit tedious as we have to make a schema + std::vector> fields = { + arrow::field("dictionary", dictionary->type())}; + auto schema = std::make_shared(fields); + RecordBatch batch(schema, dictionary->length(), {dictionary}); + + return RecordBatchWriter::Write(batch, dst, metadata_length, body_length); + } + + private: + // TODO(wesm): Setting this in Write is a bit unclean, but it works + int64_t dictionary_id_; +}; + +Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, int max_recursion_depth) { + RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); + return writer.Write(batch, dst, metadata_length, body_length); +} + +Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool) { + DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth); + return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); +} + +Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { + RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth); + RETURN_NOT_OK(writer.GetTotalSize(batch, size)); + return Status::OK(); +} + // ---------------------------------------------------------------------- // Stream writer implementation @@ -199,38 +681,6 @@ Status StreamWriter::Close() { // ---------------------------------------------------------------------- // File writer implementation -static flatbuffers::Offset> -FileBlocksToFlatbuffer(FBB& fbb, const std::vector& blocks) { - std::vector fb_blocks; - - for (const FileBlock& block : blocks) { - fb_blocks.emplace_back(block.offset, block.metadata_length, block.body_length); - } - - return fbb.CreateVectorOfStructs(fb_blocks); -} - -Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out) { - FBB fbb; - - flatbuffers::Offset fb_schema; - RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); - - auto fb_dictionaries = FileBlocksToFlatbuffer(fbb, dictionaries); - auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); - - auto footer = flatbuf::CreateFooter( - fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); - - fbb.Finish(footer); - - int32_t size = fbb.GetSize(); - - return out->Write(fbb.GetBufferPointer(), size); -} - class FileWriter::FileWriterImpl : public StreamWriter::StreamWriterImpl { public: using BASE = StreamWriter::StreamWriterImpl; @@ -283,5 +733,31 @@ Status FileWriter::Close() { return impl_->Close(); } +// ---------------------------------------------------------------------- +// Write record batches with 64-bit size metadata + +class LargeRecordBatchWriter : public RecordBatchWriter { + public: + using RecordBatchWriter::RecordBatchWriter; + + Status CheckArrayMetadata(const Array& arr) override { + // No < INT32_MAX length check + return Status::OK(); + } + + Status WriteMetadataMessage( + int64_t num_rows, int64_t body_length, std::shared_ptr* out) override { + return WriteLargeRecordBatchMessage( + num_rows, body_length, field_nodes_, buffer_meta_, out); + } +}; + +Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, int max_recursion_depth) { + LargeRecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); + return writer.Write(batch, dst, metadata_length, body_length); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 7aff71e18e486..1271652a35c78 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -45,6 +45,40 @@ class OutputStream; namespace ipc { +// Write the RecordBatch (collection of equal-length Arrow arrays) to the +// output stream in a contiguous block. The record batch metadata is written as +// a flatbuffer (see format/Message.fbs -- the RecordBatch message type) +// prefixed by its size, followed by each of the memory buffers in the batch +// written end to end (with appropriate alignment and padding): +// +// +// +// Finally, the absolute offsets (relative to the start of the output stream) +// to the end of the body and end of the metadata / data header (suffixed by +// the header size) is returned in out-variables +// +// @param(in) buffer_start_offset: the start offset to use in the buffer metadata, +// default should be 0 +// +// @param(out) metadata_length: the size of the length-prefixed flatbuffer +// including padding to a 64-byte boundary +// +// @param(out) body_length: the size of the contiguous buffer block plus +// padding bytes +Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); + +// Write Array as a DictionaryBatch message +Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool); + +// Compute the precise number of bytes needed in a contiguous memory segment to +// write the record batch. This involves generating the complete serialized +// Flatbuffers metadata. +Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); + class ARROW_EXPORT StreamWriter { public: virtual ~StreamWriter() = default; @@ -68,10 +102,6 @@ class ARROW_EXPORT StreamWriter { std::unique_ptr impl_; }; -Status WriteFileFooter(const Schema& schema, const std::vector& dictionaries, - const std::vector& record_batches, DictionaryMemo* dictionary_memo, - io::OutputStream* out); - class ARROW_EXPORT FileWriter : public StreamWriter { public: static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, @@ -86,6 +116,14 @@ class ARROW_EXPORT FileWriter : public StreamWriter { std::unique_ptr impl_; }; +// ---------------------------------------------------------------------- + +/// EXPERIMENTAL: Write record batch using LargeRecordBatch IPC metadata. This +/// data may not be readable by all Arrow implementations +Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/loader.h b/cpp/src/arrow/loader.h index f116d64f5c0c1..9b650e2da7426 100644 --- a/cpp/src/arrow/loader.h +++ b/cpp/src/arrow/loader.h @@ -41,11 +41,36 @@ struct DataType; constexpr int kMaxNestingDepth = 64; struct ARROW_EXPORT FieldMetadata { + FieldMetadata() {} + FieldMetadata(int64_t length, int64_t null_count, int64_t offset) + : length(length), null_count(null_count), offset(offset) {} + + FieldMetadata(const FieldMetadata& other) { + this->length = other.length; + this->null_count = other.null_count; + this->offset = other.offset; + } + int64_t length; int64_t null_count; int64_t offset; }; +struct ARROW_EXPORT BufferMetadata { + BufferMetadata() {} + BufferMetadata(int32_t page, int64_t offset, int64_t length) + : page(page), offset(offset), length(length) {} + + /// The shared memory page id where to find this. Set to -1 if unused + int32_t page; + + /// The relative offset into the memory page to the starting byte of the buffer + int64_t offset; + + /// Absolute length in bytes of the buffer + int64_t length; +}; + /// Implement this to create new types of Arrow data loaders class ARROW_EXPORT ArrayComponentSource { public: diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index a143d79013fb1..adc3161e9551a 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -222,6 +222,7 @@ struct ARROW_EXPORT Field { std::string ToString() const; }; + typedef std::shared_ptr FieldPtr; struct ARROW_EXPORT PrimitiveCType : public FixedWidthType { diff --git a/cpp/src/arrow/util/bit-util.cc b/cpp/src/arrow/util/bit-util.cc index 3767ba9e62f4a..ba0bfd7a9e387 100644 --- a/cpp/src/arrow/util/bit-util.cc +++ b/cpp/src/arrow/util/bit-util.cc @@ -112,7 +112,21 @@ Status CopyBitmap(MemoryPool* pool, const uint8_t* data, int64_t offset, int64_t bool BitmapEquals(const uint8_t* left, int64_t left_offset, const uint8_t* right, int64_t right_offset, int64_t bit_length) { - // TODO(wesm): Make this faster using word-wise comparisons + if (left_offset % 8 == 0 && right_offset % 8 == 0) { + // byte aligned, can use memcmp + bool bytes_equal = std::memcmp(left + left_offset / 8, right + right_offset / 8, + bit_length / 8) == 0; + if (!bytes_equal) { return false; } + for (int64_t i = (bit_length / 8) * 8; i < bit_length; ++i) { + if (BitUtil::GetBit(left, left_offset + i) != + BitUtil::GetBit(right, right_offset + i)) { + return false; + } + } + return true; + } + + // Unaligned slow case for (int64_t i = 0; i < bit_length; ++i) { if (BitUtil::GetBit(left, left_offset + i) != BitUtil::GetBit(right, right_offset + i)) { diff --git a/format/Message.fbs b/format/Message.fbs index 8fdcc804d4706..2af26d4dc54f9 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -307,6 +307,22 @@ table RecordBatch { buffers: [Buffer]; } +/// ---------------------------------------------------------------------- +/// EXPERIMENTAL: A RecordBatch type that supports data with more than 2^31 - 1 +/// elements. Arrow implementations do not need to implement this type to be +/// compliant + +struct LargeFieldNode { + length: long; + null_count: long; +} + +table LargeRecordBatch { + length: long; + nodes: [LargeFieldNode]; + buffers: [Buffer]; +} + /// ---------------------------------------------------------------------- /// For sending dictionary encoding information. Any Field can be /// dictionary-encoded, but in this case none of its children may be @@ -324,8 +340,12 @@ table DictionaryBatch { /// This union enables us to easily send different message types without /// redundant storage, and in the future we can easily add new message types. +/// +/// Arrow implementations do not need to implement all of the message types, +/// which may include experimental metadata types. For maximum compatibility, +/// it is best to send data using RecordBatch union MessageHeader { - Schema, DictionaryBatch, RecordBatch + Schema, DictionaryBatch, RecordBatch, LargeRecordBatch } table Message { From cd4544df89b60641f49bbb3104043c0ae07ef8a9 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 20 Mar 2017 10:54:57 +0100 Subject: [PATCH 0386/1644] ARROW-664: [C++] Make C++ Arrow serialization deterministic Author: Philipp Moritz Closes #405 from pcmoritz/init-buffer-builder and squashes the following commits: 10a897f [Philipp Moritz] Initialize memory obtained by BufferBuilder to zero --- cpp/src/arrow/buffer.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 1647e8601f481..70c16a2dafc86 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -170,9 +170,13 @@ class ARROW_EXPORT BufferBuilder { // Resize(0) is a no-op if (elements == 0) { return Status::OK(); } if (capacity_ == 0) { buffer_ = std::make_shared(pool_); } + int64_t old_capacity = capacity_; RETURN_NOT_OK(buffer_->Resize(elements)); capacity_ = buffer_->capacity(); data_ = buffer_->mutable_data(); + if (capacity_ > old_capacity) { + memset(data_ + old_capacity, 0, capacity_ - old_capacity); + } return Status::OK(); } From 02bdbf48a483b224ebfd61cf9be69cb0807e6e50 Mon Sep 17 00:00:00 2001 From: Johan Mabille Date: Mon, 20 Mar 2017 10:57:57 +0100 Subject: [PATCH 0387/1644] ARROW-502 [C++/Python]: Logging memory pool This is a simple decorator on MemoryPool that logs it call to ``std::cout``. I can improve it later if you need to log to other supports. Are you ok with the current logging format ? Also, I'm not a cython expert so I hope the implementation of ``CLoggingMemoryPool`` is correct. Author: Johan Mabille Closes #395 from JohanMabille/memory_pool and squashes the following commits: aa8ad5f [Johan Mabille] cython fix f70e78a [Johan Mabille] python logging memory pool 9d1d144 [Johan Mabille] formatting 8f9164c [Johan Mabille] Logging memory pool --- cpp/src/arrow/memory_pool-test.cc | 17 +++++++++++++++ cpp/src/arrow/memory_pool.cc | 32 ++++++++++++++++++++++++++++ cpp/src/arrow/memory_pool.h | 18 ++++++++++++++++ python/pyarrow/includes/libarrow.pxd | 3 +++ python/pyarrow/memory.pxd | 5 ++++- python/pyarrow/memory.pyx | 5 ++++- 6 files changed, 78 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/memory_pool-test.cc b/cpp/src/arrow/memory_pool-test.cc index 6ab73fb103f50..8a185abca71cc 100644 --- a/cpp/src/arrow/memory_pool-test.cc +++ b/cpp/src/arrow/memory_pool-test.cc @@ -78,4 +78,21 @@ TEST(DefaultMemoryPoolDeathTest, MaxMemory) { #endif // ARROW_VALGRIND +TEST(LoggingMemoryPool, Logging) { + DefaultMemoryPool pool; + LoggingMemoryPool lp(&pool); + + ASSERT_EQ(0, lp.max_memory()); + + uint8_t* data; + ASSERT_OK(pool.Allocate(100, &data)); + + uint8_t* data2; + ASSERT_OK(pool.Allocate(100, &data2)); + + pool.Free(data, 100); + pool.Free(data2, 100); + + ASSERT_EQ(200, pool.max_memory()); +} } // namespace arrow diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index 5a630271a7da7..cf01a02938385 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -22,6 +22,7 @@ #include #include #include +#include #include "arrow/status.h" #include "arrow/util/logging.h" @@ -134,4 +135,35 @@ MemoryPool* default_memory_pool() { return &default_memory_pool_; } +LoggingMemoryPool::LoggingMemoryPool(MemoryPool* pool) : pool_(pool) {} + +Status LoggingMemoryPool::Allocate(int64_t size, uint8_t** out) { + Status s = pool_->Allocate(size, out); + std::cout << "Allocate: size = " << size << " - out = " << *out << std::endl; + return s; +} + +Status LoggingMemoryPool::Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) { + Status s = pool_->Reallocate(old_size, new_size, ptr); + std::cout << "Reallocate: old_size = " << old_size << " - new_size = " << new_size + << " - ptr = " << *ptr << std::endl; + return s; +} + +void LoggingMemoryPool::Free(uint8_t* buffer, int64_t size) { + pool_->Free(buffer, size); + std::cout << "Free: buffer = " << buffer << " - size = " << size << std::endl; +} + +int64_t LoggingMemoryPool::bytes_allocated() const { + int64_t nb_bytes = pool_->bytes_allocated(); + std::cout << "bytes_allocated: " << nb_bytes << std::endl; + return nb_bytes; +} + +int64_t LoggingMemoryPool::max_memory() const { + int64_t mem = pool_->max_memory(); + std::cout << "max_memory: " << mem << std::endl; + return mem; +} } // namespace arrow diff --git a/cpp/src/arrow/memory_pool.h b/cpp/src/arrow/memory_pool.h index 0edfda635d0e8..90bc593ab71fe 100644 --- a/cpp/src/arrow/memory_pool.h +++ b/cpp/src/arrow/memory_pool.h @@ -89,6 +89,24 @@ class ARROW_EXPORT DefaultMemoryPool : public MemoryPool { std::atomic max_memory_; }; +class ARROW_EXPORT LoggingMemoryPool : public MemoryPool { + public: + explicit LoggingMemoryPool(MemoryPool* pool); + virtual ~LoggingMemoryPool() = default; + + Status Allocate(int64_t size, uint8_t** out) override; + Status Reallocate(int64_t old_size, int64_t new_size, uint8_t** ptr) override; + + void Free(uint8_t* buffer, int64_t size) override; + + int64_t bytes_allocated() const override; + + int64_t max_memory() const override; + + private: + MemoryPool* pool_; +}; + ARROW_EXPORT MemoryPool* default_memory_pool(); } // namespace arrow diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index dee7fd4f8e4e5..705fe6b4a55ca 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -104,6 +104,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CMemoryPool" arrow::MemoryPool": int64_t bytes_allocated() + cdef cppclass CLoggingMemoryPool" arrow::LoggingMemoryPool"(CMemoryPool): + CLoggingMemoryPool(CMemoryPool*) + cdef cppclass CBuffer" arrow::Buffer": uint8_t* data() int64_t size() diff --git a/python/pyarrow/memory.pxd b/python/pyarrow/memory.pxd index 3079ccb807b0d..bb1af85c8ea65 100644 --- a/python/pyarrow/memory.pxd +++ b/python/pyarrow/memory.pxd @@ -15,7 +15,7 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.includes.libarrow cimport CMemoryPool +from pyarrow.includes.libarrow cimport CMemoryPool, CLoggingMemoryPool cdef class MemoryPool: @@ -24,4 +24,7 @@ cdef class MemoryPool: cdef init(self, CMemoryPool* pool) +cdef class LoggingMemoryPool(MemoryPool): + pass + cdef CMemoryPool* maybe_unbox_memory_pool(MemoryPool memory_pool) diff --git a/python/pyarrow/memory.pyx b/python/pyarrow/memory.pyx index 18a6de4f15392..98dbf66c8e0af 100644 --- a/python/pyarrow/memory.pyx +++ b/python/pyarrow/memory.pyx @@ -19,7 +19,7 @@ # distutils: language = c++ # cython: embedsignature = True -from pyarrow.includes.libarrow cimport CMemoryPool +from pyarrow.includes.libarrow cimport CMemoryPool, CLoggingMemoryPool from pyarrow.includes.pyarrow cimport set_default_memory_pool, get_memory_pool cdef class MemoryPool: @@ -35,6 +35,9 @@ cdef CMemoryPool* maybe_unbox_memory_pool(MemoryPool memory_pool): else: return memory_pool.pool +cdef class LoggingMemoryPool(MemoryPool): + pass + def default_pool(): cdef: MemoryPool pool = MemoryPool() From 6cd82c2a294562d1d16a4767b32f072056f396a3 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Mon, 20 Mar 2017 17:55:56 +0100 Subject: [PATCH 0388/1644] ARROW-671: [GLib] Install missing license file Author: Kouhei Sutou Closes #406 from kou/glib-install-missing-license-file and squashes the following commits: 8e452d4 [Kouhei Sutou] [GLib] Install missing license file --- c_glib/.gitignore | 1 + c_glib/Makefile.am | 6 ++++++ c_glib/autogen.sh | 2 ++ 3 files changed, 9 insertions(+) diff --git a/c_glib/.gitignore b/c_glib/.gitignore index 38e33a2cd88e7..e57a0594c1af3 100644 --- a/c_glib/.gitignore +++ b/c_glib/.gitignore @@ -8,6 +8,7 @@ Makefile.in *.lo *.la *~ +/LICENSE.txt /*.tar.gz /aclocal.m4 /autom4te.cache/ diff --git a/c_glib/Makefile.am b/c_glib/Makefile.am index 076f9be08524b..c078b0889d4ff 100644 --- a/c_glib/Makefile.am +++ b/c_glib/Makefile.am @@ -23,4 +23,10 @@ SUBDIRS = \ example EXTRA_DIST = \ + README.md \ + LICENSE.txt \ version + +doc_DATA = \ + README.md \ + LICENSE.txt diff --git a/c_glib/autogen.sh b/c_glib/autogen.sh index 08e33e6ca07c0..6e2036da6406b 100755 --- a/c_glib/autogen.sh +++ b/c_glib/autogen.sh @@ -25,6 +25,8 @@ ruby \ ../java/pom.xml > \ version +cp ../LICENSE.txt ./ + mkdir -p m4 gtkdocize --copy --docdir doc/reference From 98a52b4823f3cd0880eaef066dc932f533170292 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 20 Mar 2017 10:44:15 -0700 Subject: [PATCH 0389/1644] ARROW-316: [Format] Changes to Date metadata format per discussion in ARROW-316 Author: Wes McKinney Closes #390 from wesm/ARROW-316 and squashes the following commits: 6828e05 [Wes McKinney] Format changes for Date per discussion in ARROW-316 --- format/Message.fbs | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/format/Message.fbs b/format/Message.fbs index 2af26d4dc54f9..e56366d436eb9 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -81,8 +81,19 @@ table Decimal { scale: int; } -/// Date is a 64-bit type representing milliseconds since the UNIX epoch +enum DateUnit: short { + DAY, + MILLISECOND +} + +/// Date is either a 32-bit or 64-bit type representing elapsed time since UNIX +/// epoch (1970-01-01), stored in either of two units: +/// +/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no +/// leap seconds), where the values are evenly divisible by 86400000 +/// * Days (32 bits) since the UNIX epoch table Date { + unit: DateUnit; } enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } From a8bf0fbc832fef3e2a6a9ec075db03007a26442a Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 21 Mar 2017 15:21:44 -0700 Subject: [PATCH 0390/1644] ARROW-673: [Java] Support additional Time metadata Author: Julien Le Dem Closes #407 from julienledem/time_md and squashes the following commits: 3f721e2 [Julien Le Dem] ARROW-673: [Java] Support additional Time metadata --- .../src/main/codegen/data/ArrowTypes.tdd | 2 +- .../templates/NullableValueVectors.java | 39 +------------------ .../arrow/vector/schema/TypeLayout.java | 2 +- .../org/apache/arrow/vector/types/Types.java | 7 +++- .../arrow/vector/types/pojo/TestSchema.java | 2 +- 5 files changed, 10 insertions(+), 42 deletions(-) diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 01465e585dad2..8f997524fccfc 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -58,7 +58,7 @@ }, { name: "Time", - fields: [] + fields: [{name: "unit", type: short, valueType: TimeUnit}, {name: "bitWidth", type: int}] }, { name: "Timestamp", diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index b3e10e3fa87a2..ec2ce7930cf5d 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -84,43 +84,8 @@ public final class ${className} extends BaseDataValueVector implements <#if type values = new ${valuesName}(valuesField, allocator); mutator = new Mutator(); accessor = new Accessor(); - <#if minor.class == "TinyInt" || - minor.class == "SmallInt" || - minor.class == "Int" || - minor.class == "BigInt"> - field = new Field(name, true, new Int(${type.width} * 8, true), dictionary, null); - <#elseif minor.class == "UInt1" || - minor.class == "UInt2" || - minor.class == "UInt4" || - minor.class == "UInt8"> - field = new Field(name, true, new Int(${type.width} * 8, false), dictionary, null); - <#elseif minor.class == "Date"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Date(), dictionary, null); - <#elseif minor.class == "Time"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Time(), dictionary, null); - <#elseif minor.class == "Float4"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.SINGLE), dictionary, null); - <#elseif minor.class == "Float8"> - field = new Field(name, true, new FloatingPoint(org.apache.arrow.vector.types.FloatingPointPrecision.DOUBLE), dictionary, null); - <#elseif minor.class == "TimeStampSec"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND), dictionary, null); - <#elseif minor.class == "TimeStampMilli"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND), dictionary, null); - <#elseif minor.class == "TimeStampMicro"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND), dictionary, null); - <#elseif minor.class == "TimeStampNano"> - field = new Field(name, true, new org.apache.arrow.vector.types.pojo.ArrowType.Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND), dictionary, null); - <#elseif minor.class == "IntervalDay"> - field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.DAY_TIME), dictionary, null); - <#elseif minor.class == "IntervalYear"> - field = new Field(name, true, new Interval(org.apache.arrow.vector.types.IntervalUnit.YEAR_MONTH), dictionary, null); - <#elseif minor.class == "VarChar"> - field = new Field(name, true, new Utf8(), dictionary, null); - <#elseif minor.class == "VarBinary"> - field = new Field(name, true, new Binary(), dictionary, null); - <#elseif minor.class == "Bit"> - field = new Field(name, true, new Bool(), dictionary, null); - + ArrowType type = Types.MinorType.${minor.class?upper_case}.getType(); + field = new Field(name, true, type, dictionary, null); innerVectors = Collections.unmodifiableList(Arrays.asList( bits, <#if type.major = "VarLen"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 0b586914bdf85..69d550fc9f799 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -164,7 +164,7 @@ public TypeLayout visit(Date type) { @Override public TypeLayout visit(Time type) { - return newFixedWidthTypeLayout(dataVector(64)); + return newFixedWidthTypeLayout(dataVector(type.getBitWidth())); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 8f2d04224c0fd..7cbf3c5bb5e36 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -108,7 +108,7 @@ public class Types { private static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); - private static final Field TIME_FIELD = new Field("", true, Time.INSTANCE, null); + private static final Field TIME_FIELD = new Field("", true, new Time(TimeUnit.MILLISECOND, 32), null); private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND), null); private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND), null); private static final Field TIMESTAMPMICRO_FIELD = new Field("", true, new Timestamp(TimeUnit.MICROSECOND), null); @@ -235,7 +235,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new DateWriterImpl((NullableDateVector) vector); } }, - TIME(Time.INSTANCE) { + TIME(new Time(TimeUnit.MILLISECOND, 32)) { @Override public Field getField() { return TIME_FIELD; @@ -639,6 +639,9 @@ public MinorType visit(FloatingPoint type) { } @Override public MinorType visit(Time type) { + if (type.getUnit() != TimeUnit.MILLISECOND || type.getBitWidth() != 32) { + throw new IllegalArgumentException("Only milliseconds on 32 bits supported for now: " + type); + } return MinorType.TIME; } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index f04c78ec45d97..5b74c54c9159f 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -71,7 +71,7 @@ public void testAll() throws IOException { field("i", new ArrowType.Bool()), field("j", new ArrowType.Decimal(5, 5)), field("k", new ArrowType.Date()), - field("l", new ArrowType.Time()), + field("l", new ArrowType.Time(TimeUnit.MILLISECOND, 32)), field("m", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), field("n", new ArrowType.Interval(IntervalUnit.DAY_TIME)) )); From a9a570139966593ed84ddd842da73b60ace89e1e Mon Sep 17 00:00:00 2001 From: Tsuyoshi Ozawa Date: Tue, 21 Mar 2017 15:24:19 -0700 Subject: [PATCH 0391/1644] ARROW-208: Add checkstyle policy to java project Author: Tsuyoshi Ozawa Closes #96 from oza/ARROW-208 and squashes the following commits: 809e729 [Tsuyoshi Ozawa] reformatted code in memory and tools dir with IDE 40ee6a3 [Tsuyoshi Ozawa] ARROW-208: Add checkstyle policy to java project --- .../main/java/io/netty/buffer/ArrowBuf.java | 219 +++---- .../io/netty/buffer/ExpandableByteBuf.java | 8 +- .../java/io/netty/buffer/LargeBuffer.java | 9 +- .../netty/buffer/MutableWrappedByteBuf.java | 18 +- .../netty/buffer/PooledByteBufAllocatorL.java | 84 +-- .../buffer/UnsafeDirectLittleEndian.java | 52 +- .../org/apache/arrow/memory/Accountant.java | 102 ++-- .../arrow/memory/AllocationListener.java | 4 +- .../arrow/memory/AllocationManager.java | 177 +++--- .../arrow/memory/AllocationReservation.java | 20 +- .../memory/AllocatorClosedException.java | 6 +- .../arrow/memory/ArrowByteBufAllocator.java | 14 +- .../apache/arrow/memory/BaseAllocator.java | 539 +++++++++--------- .../apache/arrow/memory/BoundsChecking.java | 7 +- .../apache/arrow/memory/BufferAllocator.java | 80 +-- .../apache/arrow/memory/BufferManager.java | 15 +- .../apache/arrow/memory/ChildAllocator.java | 18 +- .../arrow/memory/OutOfMemoryException.java | 13 +- .../apache/arrow/memory/RootAllocator.java | 6 +- .../org/apache/arrow/memory/package-info.java | 49 +- .../arrow/memory/util/AssertionUtil.java | 15 +- .../arrow/memory/util/AutoCloseableLock.java | 5 +- .../arrow/memory/util/HistoricalLog.java | 85 +-- .../apache/arrow/memory/util/StackTrace.java | 15 +- java/pom.xml | 55 ++ .../org/apache/arrow/tools/EchoServer.java | 102 ++-- .../org/apache/arrow/tools/FileRoundtrip.java | 29 +- .../org/apache/arrow/tools/FileToStream.java | 17 +- .../org/apache/arrow/tools/Integration.java | 133 +++-- .../org/apache/arrow/tools/StreamToFile.java | 17 +- .../arrow/tools/ArrowFileTestFixtures.java | 28 +- .../apache/arrow/tools/EchoServerTest.java | 66 ++- .../apache/arrow/tools/TestFileRoundtrip.java | 15 +- .../apache/arrow/tools/TestIntegration.java | 159 +++--- 34 files changed, 1218 insertions(+), 963 deletions(-) diff --git a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java index 95d2be5a43a36..e777b5a6a5d58 100644 --- a/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ArrowBuf.java @@ -6,27 +6,21 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package io.netty.buffer; -import java.io.IOException; -import java.io.InputStream; -import java.io.OutputStream; -import java.nio.ByteBuffer; -import java.nio.ByteOrder; -import java.nio.channels.GatheringByteChannel; -import java.nio.channels.ScatteringByteChannel; -import java.nio.charset.Charset; -import java.util.concurrent.atomic.AtomicInteger; -import java.util.concurrent.atomic.AtomicLong; +import com.google.common.base.Preconditions; + +import io.netty.util.internal.PlatformDependent; import org.apache.arrow.memory.AllocationManager.BufferLedger; import org.apache.arrow.memory.ArrowByteBufAllocator; @@ -37,15 +31,23 @@ import org.apache.arrow.memory.BufferManager; import org.apache.arrow.memory.util.HistoricalLog; -import com.google.common.base.Preconditions; - -import io.netty.util.internal.PlatformDependent; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; +import java.nio.ByteBuffer; +import java.nio.ByteOrder; +import java.nio.channels.GatheringByteChannel; +import java.nio.channels.ScatteringByteChannel; +import java.nio.charset.Charset; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(ArrowBuf.class); private static final AtomicLong idGenerator = new AtomicLong(0); - + private static final int LOG_BYTES_PER_ROW = 10; private final long id = idGenerator.incrementAndGet(); private final AtomicInteger refCnt; private final UnsafeDirectLittleEndian udle; @@ -55,9 +57,9 @@ public final class ArrowBuf extends AbstractByteBuf implements AutoCloseable { private final BufferManager bufManager; private final ArrowByteBufAllocator alloc; private final boolean isEmpty; - private volatile int length; private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, "ArrowBuf[%d]", id) : null; + private volatile int length; public ArrowBuf( final AtomicInteger refCnt, @@ -85,6 +87,17 @@ public ArrowBuf( } + public static String bufferState(final ByteBuf buf) { + final int cap = buf.capacity(); + final int mcap = buf.maxCapacity(); + final int ri = buf.readerIndex(); + final int rb = buf.readableBytes(); + final int wi = buf.writerIndex(); + final int wb = buf.writableBytes(); + return String.format("cap/max: %d/%d, ri: %d, rb: %d, wi: %d, wb: %d", + cap, mcap, ri, rb, wi, wb); + } + public ArrowBuf reallocIfNeeded(final int size) { Preconditions.checkArgument(size >= 0, "reallocation size must be non-negative"); @@ -95,7 +108,8 @@ public ArrowBuf reallocIfNeeded(final int size) { if (bufManager != null) { return bufManager.replace(this, size); } else { - throw new UnsupportedOperationException("Realloc is only available in the context of an operator's UDFs"); + throw new UnsupportedOperationException("Realloc is only available in the context of an " + + "operator's UDFs"); } } @@ -128,14 +142,13 @@ private final void checkIndexD(int index, int fieldLength) { /** * Allows a function to determine whether not reading a particular string of bytes is valid. - * - * Will throw an exception if the memory is not readable for some reason. Only doesn't something in the case that + *

+ * Will throw an exception if the memory is not readable for some reason. Only doesn't + * something in the case that * AssertionUtil.BOUNDS_CHECKING_ENABLED is true. * - * @param start - * The starting position of the bytes to be read. - * @param end - * The exclusive endpoint of the bytes to be read. + * @param start The starting position of the bytes to be read. + * @param end The exclusive endpoint of the bytes to be read. */ public void checkBytes(int start, int end) { if (BoundsChecking.BOUNDS_CHECKING_ENABLED) { @@ -156,17 +169,21 @@ private void ensure(int width) { } /** - * Create a new ArrowBuf that is associated with an alternative allocator for the purposes of memory ownership and - * accounting. This has no impact on the reference counting for the current ArrowBuf except in the situation where the + * Create a new ArrowBuf that is associated with an alternative allocator for the purposes of + * memory ownership and + * accounting. This has no impact on the reference counting for the current ArrowBuf except in + * the situation where the * passed in Allocator is the same as the current buffer. - * - * This operation has no impact on the reference count of this ArrowBuf. The newly created ArrowBuf with either have a - * reference count of 1 (in the case that this is the first time this memory is being associated with the new - * allocator) or the current value of the reference count + 1 for the other AllocationManager/BufferLedger combination + *

+ * This operation has no impact on the reference count of this ArrowBuf. The newly created + * ArrowBuf with either have a + * reference count of 1 (in the case that this is the first time this memory is being + * associated with the new + * allocator) or the current value of the reference count + 1 for the other + * AllocationManager/BufferLedger combination * in the case that the provided allocator already had an association to this underlying memory. * - * @param target - * The target allocator to create an association with. + * @param target The target allocator to create an association with. * @return A new ArrowBuf which shares the same underlying memory as this ArrowBuf. */ public ArrowBuf retain(BufferAllocator target) { @@ -186,28 +203,39 @@ public ArrowBuf retain(BufferAllocator target) { } /** - * Transfer the memory accounting ownership of this ArrowBuf to another allocator. This will generate a new ArrowBuf - * that carries an association with the underlying memory of this ArrowBuf. If this ArrowBuf is connected to the - * owning BufferLedger of this memory, that memory ownership/accounting will be transferred to the taret allocator. If - * this ArrowBuf does not currently own the memory underlying it (and is only associated with it), this does not + * Transfer the memory accounting ownership of this ArrowBuf to another allocator. This will + * generate a new ArrowBuf + * that carries an association with the underlying memory of this ArrowBuf. If this ArrowBuf is + * connected to the + * owning BufferLedger of this memory, that memory ownership/accounting will be transferred to + * the taret allocator. If + * this ArrowBuf does not currently own the memory underlying it (and is only associated with + * it), this does not * transfer any ownership to the newly created ArrowBuf. - * - * This operation has no impact on the reference count of this ArrowBuf. The newly created ArrowBuf with either have a - * reference count of 1 (in the case that this is the first time this memory is being associated with the new - * allocator) or the current value of the reference count for the other AllocationManager/BufferLedger combination in + *

+ * This operation has no impact on the reference count of this ArrowBuf. The newly created + * ArrowBuf with either have a + * reference count of 1 (in the case that this is the first time this memory is being + * associated with the new + * allocator) or the current value of the reference count for the other + * AllocationManager/BufferLedger combination in * the case that the provided allocator already had an association to this underlying memory. - * - * Transfers will always succeed, even if that puts the other allocator into an overlimit situation. This is possible - * due to the fact that the original owning allocator may have allocated this memory out of a local reservation - * whereas the target allocator may need to allocate new memory from a parent or RootAllocator. This operation is done - * in a mostly-lockless but consistent manner. As such, the overlimit==true situation could occur slightly prematurely - * to an actual overlimit==true condition. This is simply conservative behavior which means we may return overlimit + *

+ * Transfers will always succeed, even if that puts the other allocator into an overlimit + * situation. This is possible + * due to the fact that the original owning allocator may have allocated this memory out of a + * local reservation + * whereas the target allocator may need to allocate new memory from a parent or RootAllocator. + * This operation is done + * in a mostly-lockless but consistent manner. As such, the overlimit==true situation could + * occur slightly prematurely + * to an actual overlimit==true condition. This is simply conservative behavior which means we + * may return overlimit * slightly sooner than is necessary. * - * @param target - * The allocator to transfer ownership to. - * @return A new transfer result with the impact of the transfer (whether it was overlimit) as well as the newly - * created ArrowBuf. + * @param target The allocator to transfer ownership to. + * @return A new transfer result with the impact of the transfer (whether it was overlimit) as + * well as the newly created ArrowBuf. */ public TransferResult transferOwnership(BufferAllocator target) { @@ -223,28 +251,6 @@ public TransferResult transferOwnership(BufferAllocator target) { return new TransferResult(allocationFit, newBuf); } - /** - * The outcome of a Transfer. - */ - public class TransferResult { - - /** - * Whether this transfer fit within the target allocator's capacity. - */ - public final boolean allocationFit; - - /** - * The newly created buffer associated with the target allocator. - */ - public final ArrowBuf buffer; - - private TransferResult(boolean allocationFit, ArrowBuf buffer) { - this.allocationFit = allocationFit; - this.buffer = buffer; - } - - } - @Override public boolean release() { return release(1); @@ -261,7 +267,8 @@ public boolean release(int decrement) { } if (decrement < 1) { - throw new IllegalStateException(String.format("release(%d) argument is not positive. Buffer Info: %s", + throw new IllegalStateException(String.format("release(%d) argument is not positive. Buffer" + + " Info: %s", decrement, toVerboseString())); } @@ -273,7 +280,8 @@ public boolean release(int decrement) { if (refCnt < 0) { throw new IllegalStateException( - String.format("ArrowBuf[%d] refCnt has gone negative. Buffer Info: %s", id, toVerboseString())); + String.format("ArrowBuf[%d] refCnt has gone negative. Buffer Info: %s", id, + toVerboseString())); } return refCnt == 0; @@ -299,7 +307,8 @@ public synchronized ArrowBuf capacity(int newCapacity) { return this; } - throw new UnsupportedOperationException("Buffers don't support resizing that increases the size."); + throw new UnsupportedOperationException("Buffers don't support resizing that increases the " + + "size."); } @Override @@ -354,17 +363,6 @@ public ArrowBuf slice() { return slice(readerIndex(), readableBytes()); } - public static String bufferState(final ByteBuf buf) { - final int cap = buf.capacity(); - final int mcap = buf.maxCapacity(); - final int ri = buf.readerIndex(); - final int rb = buf.readableBytes(); - final int wi = buf.writerIndex(); - final int wb = buf.writableBytes(); - return String.format("cap/max: %d/%d, ri: %d, rb: %d, wi: %d, wb: %d", - cap, mcap, ri, rb, wi, wb); - } - @Override public ArrowBuf slice(int index, int length) { @@ -373,7 +371,8 @@ public ArrowBuf slice(int index, int length) { } /* - * Re the behavior of reference counting, see http://netty.io/wiki/reference-counted-objects.html#wiki-h3-5, which + * Re the behavior of reference counting, see http://netty.io/wiki/reference-counted-objects + * .html#wiki-h3-5, which * explains that derived buffers share their reference count with their parent */ final ArrowBuf newBuf = ledger.newArrowBuf(offset + index, length); @@ -408,12 +407,12 @@ public ByteBuffer internalNioBuffer(int index, int length) { @Override public ByteBuffer[] nioBuffers() { - return new ByteBuffer[] { nioBuffer() }; + return new ByteBuffer[]{nioBuffer()}; } @Override public ByteBuffer[] nioBuffers(int index, int length) { - return new ByteBuffer[] { nioBuffer(index, length) }; + return new ByteBuffer[]{nioBuffer(index, length)}; } @Override @@ -443,7 +442,8 @@ public long memoryAddress() { @Override public String toString() { - return String.format("ArrowBuf[%d], udle: [%d %d..%d]", id, udle.id, offset, offset + capacity()); + return String.format("ArrowBuf[%d], udle: [%d %d..%d]", id, udle.id, offset, offset + + capacity()); } @Override @@ -738,7 +738,8 @@ public ArrowBuf setBytes(int index, ByteBuf src, int srcIndex, int length) { public ArrowBuf setBytes(int index, ByteBuffer src, int srcIndex, int length) { if (src.isDirect()) { checkIndex(index, length); - PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + srcIndex, this.memoryAddress() + index, + PlatformDependent.copyMemory(PlatformDependent.directBufferAddress(src) + srcIndex, this + .memoryAddress() + index, length); } else { if (srcIndex == 0 && src.capacity() == length) { @@ -788,7 +789,8 @@ public void close() { } /** - * Returns the possible memory consumed by this ArrowBuf in the worse case scenario. (not shared, connected to larger + * Returns the possible memory consumed by this ArrowBuf in the worse case scenario. (not + * shared, connected to larger * underlying buffer of allocated memory) * * @return Size in bytes. @@ -798,7 +800,8 @@ public int getPossibleMemoryConsumed() { } /** - * Return that is Accounted for by this buffer (and its potentially shared siblings within the context of the + * Return that is Accounted for by this buffer (and its potentially shared siblings within the + * context of the * associated allocator). * * @return Size in bytes. @@ -807,15 +810,11 @@ public int getActualMemoryConsumed() { return ledger.getAccountedSize(); } - private final static int LOG_BYTES_PER_ROW = 10; - /** * Return the buffer's byte contents in the form of a hex dump. * - * @param start - * the starting byte index - * @param length - * how many bytes to log + * @param start the starting byte index + * @param length how many bytes to log * @return A hex dump in a String. */ public String toHexString(final int start, final int length) { @@ -878,5 +877,27 @@ public ArrowBuf writerIndex(int writerIndex) { return this; } + /** + * The outcome of a Transfer. + */ + public class TransferResult { + + /** + * Whether this transfer fit within the target allocator's capacity. + */ + public final boolean allocationFit; + + /** + * The newly created buffer associated with the target allocator. + */ + public final ArrowBuf buffer; + + private TransferResult(boolean allocationFit, ArrowBuf buffer) { + this.allocationFit = allocationFit; + this.buffer = buffer; + } + + } + } diff --git a/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java index 7fb884daa3952..9f8af93109739 100644 --- a/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/ExpandableByteBuf.java @@ -6,21 +6,23 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package io.netty.buffer; import org.apache.arrow.memory.BufferAllocator; /** - * Allows us to decorate ArrowBuf to make it expandable so that we can use them in the context of the Netty framework + * Allows us to decorate ArrowBuf to make it expandable so that we can use them in the context of + * the Netty framework * (thus supporting RPC level memory accounting). */ public class ExpandableByteBuf extends MutableWrappedByteBuf { diff --git a/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java index c026e430d77f3..9a6e402dad53e 100644 --- a/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java +++ b/java/memory/src/main/java/io/netty/buffer/LargeBuffer.java @@ -6,21 +6,24 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package io.netty.buffer; /** - * A MutableWrappedByteBuf that also maintains a metric of the number of huge buffer bytes and counts. + * A MutableWrappedByteBuf that also maintains a metric of the number of huge buffer bytes and + * counts. */ public class LargeBuffer extends MutableWrappedByteBuf { + public LargeBuffer(ByteBuf buffer) { super(buffer); } diff --git a/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java b/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java index 5709473135e4b..a5683adccbc32 100644 --- a/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java +++ b/java/memory/src/main/java/io/netty/buffer/MutableWrappedByteBuf.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package io.netty.buffer; import java.io.IOException; @@ -26,16 +27,12 @@ import java.nio.channels.ScatteringByteChannel; /** - * This is basically a complete copy of DuplicatedByteBuf. We copy because we want to override some behaviors and make + * This is basically a complete copy of DuplicatedByteBuf. We copy because we want to override + * some behaviors and make * buffer mutable. */ abstract class MutableWrappedByteBuf extends AbstractByteBuf { - @Override - public ByteBuffer nioBuffer(int index, int length) { - return unwrap().nioBuffer(index, length); - } - ByteBuf buffer; public MutableWrappedByteBuf(ByteBuf buffer) { @@ -50,6 +47,11 @@ public MutableWrappedByteBuf(ByteBuf buffer) { setIndex(buffer.readerIndex(), buffer.writerIndex()); } + @Override + public ByteBuffer nioBuffer(int index, int length) { + return unwrap().nioBuffer(index, length); + } + @Override public ByteBuf unwrap() { return buffer; diff --git a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java index a843ac5586e79..b6de2e3aa2acb 100644 --- a/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java +++ b/java/memory/src/main/java/io/netty/buffer/PooledByteBufAllocatorL.java @@ -6,42 +6,44 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package io.netty.buffer; -import static org.apache.arrow.memory.util.AssertionUtil.ASSERT_ENABLED; +import io.netty.util.internal.StringUtil; + +import org.apache.arrow.memory.OutOfMemoryException; import java.lang.reflect.Field; import java.nio.ByteBuffer; import java.util.concurrent.atomic.AtomicLong; -import org.apache.arrow.memory.OutOfMemoryException; - -import io.netty.util.internal.StringUtil; +import static org.apache.arrow.memory.util.AssertionUtil.ASSERT_ENABLED; /** - * The base allocator that we use for all of Arrow's memory management. Returns UnsafeDirectLittleEndian buffers. + * The base allocator that we use for all of Arrow's memory management. Returns + * UnsafeDirectLittleEndian buffers. */ public class PooledByteBufAllocatorL { - private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("arrow.allocator"); - private static final int MEMORY_LOGGER_FREQUENCY_SECONDS = 60; + private static final org.slf4j.Logger memoryLogger = org.slf4j.LoggerFactory.getLogger("arrow" + + ".allocator"); + private static final int MEMORY_LOGGER_FREQUENCY_SECONDS = 60; + public final UnsafeDirectLittleEndian empty; private final AtomicLong hugeBufferSize = new AtomicLong(0); private final AtomicLong hugeBufferCount = new AtomicLong(0); private final AtomicLong normalBufferSize = new AtomicLong(0); private final AtomicLong normalBufferCount = new AtomicLong(0); - private final InnerAllocator allocator; - public final UnsafeDirectLittleEndian empty; public PooledByteBufAllocatorL() { allocator = new InnerAllocator(); @@ -78,6 +80,7 @@ public long getNormalBufferCount() { } private static class AccountedUnsafeDirectLittleEndian extends UnsafeDirectLittleEndian { + private final long initialCapacity; private final AtomicLong count; private final AtomicLong size; @@ -89,7 +92,8 @@ private AccountedUnsafeDirectLittleEndian(LargeBuffer buf, AtomicLong count, Ato this.size = size; } - private AccountedUnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf, AtomicLong count, AtomicLong size) { + private AccountedUnsafeDirectLittleEndian(PooledUnsafeDirectByteBuf buf, AtomicLong count, + AtomicLong size) { super(buf); this.initialCapacity = buf.capacity(); this.count = count; @@ -119,6 +123,7 @@ public boolean release(int decrement) { } private class InnerAllocator extends PooledByteBufAllocator { + private final PoolArena[] directArenas; private final MemoryStatusThread statusThread; private final int chunkSize; @@ -131,7 +136,8 @@ public InnerAllocator() { f.setAccessible(true); this.directArenas = (PoolArena[]) f.get(this); } catch (Exception e) { - throw new RuntimeException("Failure while initializing allocator. Unable to retrieve direct arenas field.", e); + throw new RuntimeException("Failure while initializing allocator. Unable to retrieve " + + "direct arenas field.", e); } this.chunkSize = directArenas[0].chunkSize; @@ -158,7 +164,8 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa hugeBufferCount.incrementAndGet(); // logger.debug("Allocating huge buffer of size {}", initialCapacity, new Exception()); - return new AccountedUnsafeDirectLittleEndian(new LargeBuffer(buf), hugeBufferCount, hugeBufferSize); + return new AccountedUnsafeDirectLittleEndian(new LargeBuffer(buf), hugeBufferCount, + hugeBufferSize); } else { // within chunk, use arena. ByteBuf buf = directArena.allocate(cache, initialCapacity, maxCapacity); @@ -173,7 +180,8 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa normalBufferSize.addAndGet(buf.capacity()); normalBufferCount.incrementAndGet(); - return new AccountedUnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf, normalBufferCount, normalBufferSize); + return new AccountedUnsafeDirectLittleEndian((PooledUnsafeDirectByteBuf) buf, + normalBufferCount, normalBufferSize); } } else { @@ -183,7 +191,8 @@ private UnsafeDirectLittleEndian newDirectBufferL(int initialCapacity, int maxCa private UnsupportedOperationException fail() { return new UnsupportedOperationException( - "Arrow requires that the JVM used supports access sun.misc.Unsafe. This platform didn't provide that functionality."); + "Arrow requires that the JVM used supports access sun.misc.Unsafe. This platform " + + "didn't provide that functionality."); } @Override @@ -203,7 +212,8 @@ public ByteBuf heapBuffer(int initialCapacity, int maxCapacity) { private void validate(int initialCapacity, int maxCapacity) { if (initialCapacity < 0) { - throw new IllegalArgumentException("initialCapacity: " + initialCapacity + " (expectd: 0+)"); + throw new IllegalArgumentException("initialCapacity: " + initialCapacity + " (expectd: " + + "0+)"); } if (initialCapacity > maxCapacity) { throw new IllegalArgumentException(String.format( @@ -212,26 +222,6 @@ private void validate(int initialCapacity, int maxCapacity) { } } - private class MemoryStatusThread extends Thread { - - public MemoryStatusThread() { - super("allocation.logger"); - this.setDaemon(true); - } - - @Override - public void run() { - while (true) { - memoryLogger.trace("Memory Usage: \n{}", PooledByteBufAllocatorL.this.toString()); - try { - Thread.sleep(MEMORY_LOGGER_FREQUENCY_SECONDS * 1000); - } catch (InterruptedException e) { - return; - } - } - } - } - @Override public String toString() { StringBuilder buf = new StringBuilder(); @@ -256,6 +246,26 @@ public String toString() { return buf.toString(); } + private class MemoryStatusThread extends Thread { + + public MemoryStatusThread() { + super("allocation.logger"); + this.setDaemon(true); + } + + @Override + public void run() { + while (true) { + memoryLogger.trace("Memory Usage: \n{}", PooledByteBufAllocatorL.this.toString()); + try { + Thread.sleep(MEMORY_LOGGER_FREQUENCY_SECONDS * 1000); + } catch (InterruptedException e) { + return; + } + } + } + } + } } diff --git a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java index 5ea176745f25e..87d822f58a315 100644 --- a/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java +++ b/java/memory/src/main/java/io/netty/buffer/UnsafeDirectLittleEndian.java @@ -6,9 +6,9 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. @@ -18,22 +18,31 @@ package io.netty.buffer; +import io.netty.util.internal.PlatformDependent; + import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import java.nio.ByteOrder; import java.util.concurrent.atomic.AtomicLong; -import io.netty.util.internal.PlatformDependent; - /** - * The underlying class we use for little-endian access to memory. Is used underneath ArrowBufs to abstract away the + * The underlying class we use for little-endian access to memory. Is used underneath ArrowBufs + * to abstract away the * Netty classes and underlying Netty memory management. */ public class UnsafeDirectLittleEndian extends WrappedByteBuf { + + public static final boolean ASSERT_ENABLED; private static final boolean NATIVE_ORDER = ByteOrder.nativeOrder() == ByteOrder.LITTLE_ENDIAN; private static final AtomicLong ID_GENERATOR = new AtomicLong(0); + static { + boolean isAssertEnabled = false; + assert isAssertEnabled = true; + ASSERT_ENABLED = isAssertEnabled; + } + public final long id = ID_GENERATOR.incrementAndGet(); private final AbstractByteBuf wrapped; private final long memoryAddress; @@ -60,21 +69,22 @@ private UnsafeDirectLittleEndian(AbstractByteBuf buf, boolean fake) { this.wrapped = buf; this.memoryAddress = buf.memoryAddress(); } - private long addr(int index) { - return memoryAddress + index; - } - @Override - public long getLong(int index) { + private long addr(int index) { + return memoryAddress + index; + } + + @Override + public long getLong(int index) { // wrapped.checkIndex(index, 8); - long v = PlatformDependent.getLong(addr(index)); - return v; - } + long v = PlatformDependent.getLong(addr(index)); + return v; + } - @Override - public float getFloat(int index) { - return Float.intBitsToFloat(getInt(index)); - } + @Override + public float getFloat(int index) { + return Float.intBitsToFloat(getInt(index)); + } @Override public ByteBuf slice() { @@ -259,12 +269,4 @@ public int hashCode() { return System.identityHashCode(this); } - public static final boolean ASSERT_ENABLED; - - static { - boolean isAssertEnabled = false; - assert isAssertEnabled = true; - ASSERT_ENABLED = isAssertEnabled; - } - } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java index 37c598ad89ece..6ddc8f784bc4a 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java @@ -6,30 +6,33 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; +import com.google.common.base.Preconditions; + import java.util.concurrent.atomic.AtomicLong; import javax.annotation.concurrent.ThreadSafe; -import com.google.common.base.Preconditions; - /** - * Provides a concurrent way to manage account for memory usage without locking. Used as basis for Allocators. All + * Provides a concurrent way to manage account for memory usage without locking. Used as basis + * for Allocators. All * operations are threadsafe (except for close). */ @ThreadSafe class Accountant implements AutoCloseable { - // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Accountant.class); + // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(Accountant + // .class); /** * The parent allocator @@ -37,7 +40,8 @@ class Accountant implements AutoCloseable { protected final Accountant parent; /** - * The amount of memory reserved for this allocator. Releases below this amount of memory will not be returned to the + * The amount of memory reserved for this allocator. Releases below this amount of memory will + * not be returned to the * parent Accountant until this Accountant is closed. */ protected final long reservation; @@ -45,7 +49,8 @@ class Accountant implements AutoCloseable { private final AtomicLong peakAllocation = new AtomicLong(); /** - * Maximum local memory that can be held. This can be externally updated. Changing it won't cause past memory to + * Maximum local memory that can be held. This can be externally updated. Changing it won't + * cause past memory to * change but will change responses to future allocation efforts */ private final AtomicLong allocationLimit = new AtomicLong(); @@ -56,11 +61,14 @@ class Accountant implements AutoCloseable { private final AtomicLong locallyHeldMemory = new AtomicLong(); public Accountant(Accountant parent, long reservation, long maxAllocation) { - Preconditions.checkArgument(reservation >= 0, "The initial reservation size must be non-negative."); - Preconditions.checkArgument(maxAllocation >= 0, "The maximum allocation limit must be non-negative."); + Preconditions.checkArgument(reservation >= 0, "The initial reservation size must be " + + "non-negative."); + Preconditions.checkArgument(maxAllocation >= 0, "The maximum allocation limit must be " + + "non-negative."); Preconditions.checkArgument(reservation <= maxAllocation, "The initial reservation size must be <= the maximum allocation."); - Preconditions.checkArgument(reservation == 0 || parent != null, "The root accountant can't reserve memory."); + Preconditions.checkArgument(reservation == 0 || parent != null, "The root accountant can't " + + "reserve memory."); this.parent = parent; this.reservation = reservation; @@ -72,19 +80,20 @@ public Accountant(Accountant parent, long reservation, long maxAllocation) { if (!outcome.isOk()) { throw new OutOfMemoryException(String.format( "Failure trying to allocate initial reservation for Allocator. " - + "Attempted to allocate %d bytes and received an outcome of %s.", reservation, outcome.name())); + + "Attempted to allocate %d bytes and received an outcome of %s.", reservation, + outcome.name())); } } } /** - * Attempt to allocate the requested amount of memory. Either completely succeeds or completely fails. Constructs a a + * Attempt to allocate the requested amount of memory. Either completely succeeds or completely + * fails. Constructs a a * log of delta - * + *

* If it fails, no changes are made to accounting. * - * @param size - * The amount of memory to reserve in bytes. + * @param size The amount of memory to reserve in bytes. * @return True if the allocation was successful, false if the allocation failed. */ AllocationOutcome allocateBytes(long size) { @@ -116,8 +125,7 @@ private void updatePeak() { /** * Increase the accounting. Returns whether the allocation fit within limits. * - * @param size - * to increase + * @param size to increase * @return Whether the allocation fit within limits. */ boolean forceAllocate(long size) { @@ -126,24 +134,29 @@ boolean forceAllocate(long size) { } /** - * Internal method for allocation. This takes a forced approach to allocation to ensure that we manage reservation - * boundary issues consistently. Allocation is always done through the entire tree. The two options that we influence - * are whether the allocation should be forced and whether or not the peak memory allocation should be updated. If at - * some point during allocation escalation we determine that the allocation is no longer possible, we will continue to - * do a complete and consistent allocation but we will stop updating the peak allocation. We do this because we know - * that we will be directly unwinding this allocation (and thus never actually making the allocation). If force - * allocation is passed, then we continue to update the peak limits since we now know that this allocation will occur + * Internal method for allocation. This takes a forced approach to allocation to ensure that we + * manage reservation + * boundary issues consistently. Allocation is always done through the entire tree. The two + * options that we influence + * are whether the allocation should be forced and whether or not the peak memory allocation + * should be updated. If at + * some point during allocation escalation we determine that the allocation is no longer + * possible, we will continue to + * do a complete and consistent allocation but we will stop updating the peak allocation. We do + * this because we know + * that we will be directly unwinding this allocation (and thus never actually making the + * allocation). If force + * allocation is passed, then we continue to update the peak limits since we now know that this + * allocation will occur * despite our moving past one or more limits. * - * @param size - * The size of the allocation. - * @param incomingUpdatePeak - * Whether we should update the local peak for this allocation. - * @param forceAllocation - * Whether we should force the allocation. + * @param size The size of the allocation. + * @param incomingUpdatePeak Whether we should update the local peak for this allocation. + * @param forceAllocation Whether we should force the allocation. * @return The outcome of the allocation. */ - private AllocationOutcome allocate(final long size, final boolean incomingUpdatePeak, final boolean forceAllocation) { + private AllocationOutcome allocate(final long size, final boolean incomingUpdatePeak, final + boolean forceAllocation) { final long newLocal = locallyHeldMemory.addAndGet(size); final long beyondReservation = newLocal - reservation; final boolean beyondLimit = newLocal > allocationLimit.get(); @@ -173,7 +186,7 @@ public void releaseBytes(long size) { Preconditions.checkArgument(newSize >= 0, "Accounted size went negative."); final long originalSize = newSize + size; - if(originalSize > reservation && parent != null){ + if (originalSize > reservation && parent != null) { // we deallocated memory that we should release to our parent. final long possibleAmountToReleaseToParent = originalSize - reservation; final long actualToReleaseToParent = Math.min(size, possibleAmountToReleaseToParent); @@ -182,16 +195,6 @@ public void releaseBytes(long size) { } - /** - * Set the maximum amount of memory that can be allocated in the this Accountant before failing an allocation. - * - * @param newLimit - * The limit in bytes. - */ - public void setLimit(long newLimit) { - allocationLimit.set(newLimit); - } - public boolean isOverLimit() { return getAllocatedMemory() > getLimit() || (parent != null && parent.isOverLimit()); } @@ -216,7 +219,18 @@ public long getLimit() { } /** - * Return the current amount of allocated memory that this Accountant is managing accounting for. Note this does not + * Set the maximum amount of memory that can be allocated in the this Accountant before failing + * an allocation. + * + * @param newLimit The limit in bytes. + */ + public void setLimit(long newLimit) { + allocationLimit.set(newLimit); + } + + /** + * Return the current amount of allocated memory that this Accountant is managing accounting + * for. Note this does not * include reservation memory that hasn't been allocated. * * @return Currently allocate memory in bytes. diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java index 1b127f8181222..d36cb37fc2e24 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationListener.java @@ -15,15 +15,17 @@ * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; /** * An allocation listener being notified for allocation/deallocation - * + *

* It is expected to be called from multiple threads and as such, * provider should take care of making the implementation thread-safe */ public interface AllocationListener { + public static final AllocationListener NOOP = new AllocationListener() { @Override public void onAllocation(long size) { diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java index f15bb8a40fa01..683752e6a4980 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -6,53 +6,62 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; -import static org.apache.arrow.memory.BaseAllocator.indent; +import com.google.common.base.Preconditions; -import java.util.IdentityHashMap; -import java.util.concurrent.atomic.AtomicInteger; -import java.util.concurrent.atomic.AtomicLong; -import java.util.concurrent.locks.ReadWriteLock; -import java.util.concurrent.locks.ReentrantReadWriteLock; +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.PooledByteBufAllocatorL; +import io.netty.buffer.UnsafeDirectLittleEndian; import org.apache.arrow.memory.BaseAllocator.Verbosity; import org.apache.arrow.memory.util.AutoCloseableLock; import org.apache.arrow.memory.util.HistoricalLog; -import com.google.common.base.Preconditions; +import java.util.IdentityHashMap; +import java.util.concurrent.atomic.AtomicInteger; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.ReadWriteLock; +import java.util.concurrent.locks.ReentrantReadWriteLock; -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.PooledByteBufAllocatorL; -import io.netty.buffer.UnsafeDirectLittleEndian; +import static org.apache.arrow.memory.BaseAllocator.indent; /** - * Manages the relationship between one or more allocators and a particular UDLE. Ensures that one allocator owns the - * memory that multiple allocators may be referencing. Manages a BufferLedger between each of its associated allocators. - * This class is also responsible for managing when memory is allocated and returned to the Netty-based + * Manages the relationship between one or more allocators and a particular UDLE. Ensures that + * one allocator owns the + * memory that multiple allocators may be referencing. Manages a BufferLedger between each of its + * associated allocators. + * This class is also responsible for managing when memory is allocated and returned to the + * Netty-based * PooledByteBufAllocatorL. - * - * The only reason that this isn't package private is we're forced to put ArrowBuf in Netty's package which need access + *

+ * The only reason that this isn't package private is we're forced to put ArrowBuf in Netty's + * package which need access * to these objects or methods. - * - * Threading: AllocationManager manages thread-safety internally. Operations within the context of a single BufferLedger - * are lockless in nature and can be leveraged by multiple threads. Operations that cross the context of two ledgers - * will acquire a lock on the AllocationManager instance. Important note, there is one AllocationManager per - * UnsafeDirectLittleEndian buffer allocation. As such, there will be thousands of these in a typical query. The + *

+ * Threading: AllocationManager manages thread-safety internally. Operations within the context + * of a single BufferLedger + * are lockless in nature and can be leveraged by multiple threads. Operations that cross the + * context of two ledgers + * will acquire a lock on the AllocationManager instance. Important note, there is one + * AllocationManager per + * UnsafeDirectLittleEndian buffer allocation. As such, there will be thousands of these in a + * typical query. The * contention of acquiring a lock on AllocationManager should be very low. - * */ public class AllocationManager { - // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AllocationManager.class); + // private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger + // (AllocationManager.class); private static final AtomicLong MANAGER_ID_GENERATOR = new AtomicLong(0); private static final AtomicLong LEDGER_ID_GENERATOR = new AtomicLong(0); @@ -81,17 +90,19 @@ public class AllocationManager { this.root = accountingAllocator.root; this.underlying = INNER_ALLOCATOR.allocate(size); - // we do a no retain association since our creator will want to retrieve the newly created ledger and will create a + // we do a no retain association since our creator will want to retrieve the newly created + // ledger and will create a // reference count at that point this.owningLedger = associate(accountingAllocator, false); this.size = underlying.capacity(); } /** - * Associate the existing underlying buffer with a new allocator. This will increase the reference count to the + * Associate the existing underlying buffer with a new allocator. This will increase the + * reference count to the * provided ledger by 1. - * @param allocator - * The target allocator to associate this buffer with. + * + * @param allocator The target allocator to associate this buffer with. * @return The Ledger (new or existing) that associates the underlying buffer to this new ledger. */ BufferLedger associate(final BaseAllocator allocator) { @@ -118,7 +129,8 @@ private BufferLedger associate(final BaseAllocator allocator, final boolean reta } try (AutoCloseableLock write = writeLock.open()) { - // we have to recheck existing ledger since a second reader => writer could be competing with us. + // we have to recheck existing ledger since a second reader => writer could be competing + // with us. final BufferLedger existingLedger = map.get(allocator); if (existingLedger != null) { @@ -141,7 +153,8 @@ private BufferLedger associate(final BaseAllocator allocator, final boolean reta /** - * The way that a particular BufferLedger communicates back to the AllocationManager that it now longer needs to hold + * The way that a particular BufferLedger communicates back to the AllocationManager that it + * now longer needs to hold * a reference to particular piece of memory. */ private class ReleaseListener { @@ -169,16 +182,19 @@ public void release() { amDestructionTime = System.nanoTime(); owningLedger = null; } else { - // we need to change the owning allocator. we've been removed so we'll get whatever is top of list + // we need to change the owning allocator. we've been removed so we'll get whatever is + // top of list BufferLedger newLedger = map.values().iterator().next(); - // we'll forcefully transfer the ownership and not worry about whether we exceeded the limit + // we'll forcefully transfer the ownership and not worry about whether we exceeded the + // limit // since this consumer can't do anything with this. oldLedger.transferBalance(newLedger); } } else { if (map.isEmpty()) { - throw new IllegalStateException("The final removal of a ledger should be connected to the owning ledger."); + throw new IllegalStateException("The final removal of a ledger should be connected to " + + "the owning ledger."); } } @@ -187,25 +203,30 @@ public void release() { } /** - * The reference manager that binds an allocator manager to a particular BaseAllocator. Also responsible for creating + * The reference manager that binds an allocator manager to a particular BaseAllocator. Also + * responsible for creating * a set of ArrowBufs that share a common fate and set of reference counts. - * As with AllocationManager, the only reason this is public is due to ArrowBuf being in io.netty.buffer package. + * As with AllocationManager, the only reason this is public is due to ArrowBuf being in io + * .netty.buffer package. */ public class BufferLedger { private final IdentityHashMap buffers = BaseAllocator.DEBUG ? new IdentityHashMap() : null; - private final long ledgerId = LEDGER_ID_GENERATOR.incrementAndGet(); // unique ID assigned to each ledger - private final AtomicInteger bufRefCnt = new AtomicInteger(0); // start at zero so we can manage request for retain - // correctly + private final long ledgerId = LEDGER_ID_GENERATOR.incrementAndGet(); // unique ID assigned to + // each ledger + private final AtomicInteger bufRefCnt = new AtomicInteger(0); // start at zero so we can + // manage request for retain + // correctly private final long lCreationTime = System.nanoTime(); - private volatile long lDestructionTime = 0; private final BaseAllocator allocator; private final ReleaseListener listener; - private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? new HistoricalLog(BaseAllocator.DEBUG_LOG_LENGTH, - "BufferLedger[%d]", 1) + private final HistoricalLog historicalLog = BaseAllocator.DEBUG ? new HistoricalLog + (BaseAllocator.DEBUG_LOG_LENGTH, + "BufferLedger[%d]", 1) : null; + private volatile long lDestructionTime = 0; private BufferLedger(BaseAllocator allocator, ReleaseListener listener) { this.allocator = allocator; @@ -213,10 +234,11 @@ private BufferLedger(BaseAllocator allocator, ReleaseListener listener) { } /** - * Transfer any balance the current ledger has to the target ledger. In the case that the current ledger holds no + * Transfer any balance the current ledger has to the target ledger. In the case that the + * current ledger holds no * memory, no transfer is made to the new ledger. - * @param target - * The ledger to transfer ownership account to. + * + * @param target The ledger to transfer ownership account to. * @return Whether transfer fit within target ledgers limits. */ public boolean transferBalance(final BufferLedger target) { @@ -231,7 +253,8 @@ public boolean transferBalance(final BufferLedger target) { return true; } - // since two balance transfers out from the allocator manager could cause incorrect accounting, we need to ensure + // since two balance transfers out from the allocator manager could cause incorrect + // accounting, we need to ensure // that this won't happen by synchronizing on the allocator manager instance. try (AutoCloseableLock write = writeLock.open()) { if (owningLedger != this) { @@ -253,12 +276,10 @@ public boolean transferBalance(final BufferLedger target) { /** * Print the current ledger state to a the provided StringBuilder. - * @param sb - * The StringBuilder to populate. - * @param indent - * The level of indentation to position the data. - * @param verbosity - * The level of verbosity to print. + * + * @param sb The StringBuilder to populate. + * @param indent The level of indentation to position the data. + * @param verbosity The level of verbosity to print. */ public void print(StringBuilder sb, int indent, Verbosity verbosity) { indent(sb, indent) @@ -304,7 +325,8 @@ private void inc() { } /** - * Decrement the ledger's reference count. If the ledger is decremented to zero, this ledger should release its + * Decrement the ledger's reference count. If the ledger is decremented to zero, this ledger + * should release its * ownership back to the AllocationManager */ public int decrement(int decrement) { @@ -323,15 +345,19 @@ public int decrement(int decrement) { } /** - * Returns the ledger associated with a particular BufferAllocator. If the BufferAllocator doesn't currently have a - * ledger associated with this AllocationManager, a new one is created. This is placed on BufferLedger rather than - * AllocationManager directly because ArrowBufs don't have access to AllocationManager and they are the ones - * responsible for exposing the ability to associate multiple allocators with a particular piece of underlying - * memory. Note that this will increment the reference count of this ledger by one to ensure the ledger isn't + * Returns the ledger associated with a particular BufferAllocator. If the BufferAllocator + * doesn't currently have a + * ledger associated with this AllocationManager, a new one is created. This is placed on + * BufferLedger rather than + * AllocationManager directly because ArrowBufs don't have access to AllocationManager and + * they are the ones + * responsible for exposing the ability to associate multiple allocators with a particular + * piece of underlying + * memory. Note that this will increment the reference count of this ledger by one to ensure + * the ledger isn't * destroyed before use. * - * @param allocator - * A BufferAllocator. + * @param allocator A BufferAllocator. * @return The ledger associated with the BufferAllocator. */ public BufferLedger getLedgerForAllocator(BufferAllocator allocator) { @@ -339,13 +365,14 @@ public BufferLedger getLedgerForAllocator(BufferAllocator allocator) { } /** - * Create a new ArrowBuf associated with this AllocationManager and memory. Does not impact reference count. + * Create a new ArrowBuf associated with this AllocationManager and memory. Does not impact + * reference count. * Typically used for slicing. - * @param offset - * The offset in bytes to start this new ArrowBuf. - * @param length - * The length in bytes that this ArrowBuf will provide access to. - * @return A new ArrowBuf that shares references with all ArrowBufs associated with this BufferLedger + * + * @param offset The offset in bytes to start this new ArrowBuf. + * @param length The length in bytes that this ArrowBuf will provide access to. + * @return A new ArrowBuf that shares references with all ArrowBufs associated with this + * BufferLedger */ public ArrowBuf newArrowBuf(int offset, int length) { allocator.assertOpen(); @@ -354,13 +381,13 @@ public ArrowBuf newArrowBuf(int offset, int length) { /** * Create a new ArrowBuf associated with this AllocationManager and memory. - * @param offset - * The offset in bytes to start this new ArrowBuf. - * @param length - * The length in bytes that this ArrowBuf will provide access to. - * @param manager - * An optional BufferManager argument that can be used to manage expansion of this ArrowBuf - * @return A new ArrowBuf that shares references with all ArrowBufs associated with this BufferLedger + * + * @param offset The offset in bytes to start this new ArrowBuf. + * @param length The length in bytes that this ArrowBuf will provide access to. + * @param manager An optional BufferManager argument that can be used to manage expansion of + * this ArrowBuf + * @return A new ArrowBuf that shares references with all ArrowBufs associated with this + * BufferLedger */ public ArrowBuf newArrowBuf(int offset, int length, BufferManager manager) { allocator.assertOpen(); @@ -377,7 +404,8 @@ public ArrowBuf newArrowBuf(int offset, int length, BufferManager manager) { if (BaseAllocator.DEBUG) { historicalLog.recordEvent( - "ArrowBuf(BufferLedger, BufferAllocator[%s], UnsafeDirectLittleEndian[identityHashCode == " + "ArrowBuf(BufferLedger, BufferAllocator[%s], " + + "UnsafeDirectLittleEndian[identityHashCode == " + "%d](%s)) => ledger hc == %d", allocator.name, System.identityHashCode(buf), buf.toString(), System.identityHashCode(this)); @@ -401,7 +429,8 @@ public int getSize() { } /** - * How much memory is accounted for by this ledger. This is either getSize() if this is the owning ledger for the + * How much memory is accounted for by this ledger. This is either getSize() if this is the + * owning ledger for the * memory or zero in the case that this is not the owning ledger associated with this memory. * * @return Amount of accounted(owned) memory associated with this ledger. diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java index 68d1244d1e328..7f5aa313779a7 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java @@ -6,32 +6,36 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; import io.netty.buffer.ArrowBuf; /** - * Supports cumulative allocation reservation. Clients may increase the size of the reservation repeatedly until they - * call for an allocation of the current total size. The reservation can only be used once, and will throw an exception + * Supports cumulative allocation reservation. Clients may increase the size of the reservation + * repeatedly until they + * call for an allocation of the current total size. The reservation can only be used once, and + * will throw an exception * if it is used more than once. *

- * For the purposes of airtight memory accounting, the reservation must be close()d whether it is used or not. + * For the purposes of airtight memory accounting, the reservation must be close()d whether it is + * used or not. * This is not threadsafe. */ public interface AllocationReservation extends AutoCloseable { /** * Add to the current reservation. - * + *

*

Adding may fail if the allocator is not allowed to consume any more space. * * @param nBytes the number of bytes to add @@ -42,7 +46,7 @@ public interface AllocationReservation extends AutoCloseable { /** * Requests a reservation of additional space. - * + *

*

The implementation of the allocator's inner class provides this. * * @param nBytes the amount to reserve @@ -52,7 +56,7 @@ public interface AllocationReservation extends AutoCloseable { /** * Allocate a buffer whose size is the total of all the add()s made. - * + *

*

The allocation request can still fail, even if the amount of space * requested is available, if the allocation cannot be made contiguously. * diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java index 3274642dedd59..d5b638e1ed298 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocatorClosedException.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; /** @@ -23,6 +24,7 @@ */ @SuppressWarnings("serial") public class AllocatorClosedException extends RuntimeException { + /** * @param message string associated with the cause */ diff --git a/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java index 5dc5ac397bd93..b8b5283423c82 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/ArrowByteBufAllocator.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; import io.netty.buffer.ByteBuf; @@ -23,9 +24,12 @@ import io.netty.buffer.ExpandableByteBuf; /** - * An implementation of ByteBufAllocator that wraps a Arrow BufferAllocator. This allows the RPC layer to be accounted - * and managed using Arrow's BufferAllocator infrastructure. The only thin different from a typical BufferAllocator is - * the signature and the fact that this Allocator returns ExpandableByteBufs which enable otherwise non-expandable + * An implementation of ByteBufAllocator that wraps a Arrow BufferAllocator. This allows the RPC + * layer to be accounted + * and managed using Arrow's BufferAllocator infrastructure. The only thin different from a + * typical BufferAllocator is + * the signature and the fact that this Allocator returns ExpandableByteBufs which enable + * otherwise non-expandable * ArrowBufs to be expandable. */ public class ArrowByteBufAllocator implements ByteBufAllocator { diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java index 9edafbce082cb..aaa7ce804c3e5 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BaseAllocator.java @@ -6,57 +6,54 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; -import java.util.Arrays; -import java.util.IdentityHashMap; -import java.util.Set; -import java.util.concurrent.atomic.AtomicInteger; +import com.google.common.base.Preconditions; + +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.UnsafeDirectLittleEndian; import org.apache.arrow.memory.AllocationManager.BufferLedger; import org.apache.arrow.memory.util.AssertionUtil; import org.apache.arrow.memory.util.HistoricalLog; -import com.google.common.base.Preconditions; - -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.UnsafeDirectLittleEndian; +import java.util.Arrays; +import java.util.IdentityHashMap; +import java.util.Set; +import java.util.concurrent.atomic.AtomicInteger; public abstract class BaseAllocator extends Accountant implements BufferAllocator { - private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BaseAllocator.class); public static final String DEBUG_ALLOCATOR = "arrow.memory.debug.allocator"; - public static final int DEBUG_LOG_LENGTH = 6; public static final boolean DEBUG = AssertionUtil.isAssertionsEnabled() || Boolean.parseBoolean(System.getProperty(DEBUG_ALLOCATOR, "false")); + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BaseAllocator + .class); + // Package exposed for sharing between AllocatorManger and BaseAllocator objects + final String name; + final RootAllocator root; private final Object DEBUG_LOCK = DEBUG ? new Object() : null; - private final AllocationListener listener; private final BaseAllocator parentAllocator; private final ArrowByteBufAllocator thisAsByteBufAllocator; private final IdentityHashMap childAllocators; private final ArrowBuf empty; - - private volatile boolean isClosed = false; // the allocator has been closed - - // Package exposed for sharing between AllocatorManger and BaseAllocator objects - final String name; - final RootAllocator root; - // members used purely for debugging private final IdentityHashMap childLedgers; private final IdentityHashMap reservations; private final HistoricalLog historicalLog; + private volatile boolean isClosed = false; // the allocator has been closed protected BaseAllocator( final AllocationListener listener, @@ -91,7 +88,8 @@ private BaseAllocator( this.root = (RootAllocator) this; empty = createEmpty(); } else { - throw new IllegalStateException("An parent allocator must either carry a root or be the root."); + throw new IllegalStateException("An parent allocator must either carry a root or be the " + + "root."); } this.parentAllocator = parentAllocator; @@ -114,11 +112,52 @@ private BaseAllocator( } + private static String createErrorMsg(final BufferAllocator allocator, final int rounded, final + int requested) { + if (rounded != requested) { + return String.format( + "Unable to allocate buffer of size %d (rounded from %d) due to memory limit. Current " + + "allocation: %d", + rounded, requested, allocator.getAllocatedMemory()); + } else { + return String.format("Unable to allocate buffer of size %d due to memory limit. Current " + + "allocation: %d", + rounded, allocator.getAllocatedMemory()); + } + } + + /** + * Rounds up the provided value to the nearest power of two. + * + * @param val An integer value. + * @return The closest power of two of that value. + */ + static int nextPowerOfTwo(int val) { + int highestBit = Integer.highestOneBit(val); + if (highestBit == val) { + return val; + } else { + return highestBit << 1; + } + } + + public static StringBuilder indent(StringBuilder sb, int indent) { + final char[] indentation = new char[indent * 2]; + Arrays.fill(indentation, ' '); + sb.append(indentation); + return sb; + } + + public static boolean isDebug() { + return DEBUG; + } + @Override public void assertOpen() { if (AssertionUtil.ASSERT_ENABLED) { if (isClosed) { - throw new IllegalStateException("Attempting operation on allocator when allocator is closed.\n" + throw new IllegalStateException("Attempting operation on allocator when allocator is " + + "closed.\n" + toVerboseString()); } } @@ -136,7 +175,8 @@ public ArrowBuf getEmpty() { } /** - * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that we have a new ledger + * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that + * we have a new ledger * associated with this allocator. */ void associateLedger(BufferLedger ledger) { @@ -149,7 +189,8 @@ void associateLedger(BufferLedger ledger) { } /** - * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that we are removing a + * For debug/verification purposes only. Allows an AllocationManager to tell the allocator that + * we are removing a * ledger associated with this allocator */ void dissociateLedger(BufferLedger ledger) { @@ -167,8 +208,7 @@ void dissociateLedger(BufferLedger ledger) { /** * Track when a ChildAllocator of this BaseAllocator is closed. Used for debugging purposes. * - * @param childAllocator - * The child allocator that has been closed. + * @param childAllocator The child allocator that has been closed. */ private void childClosed(final BaseAllocator childAllocator) { assertOpen(); @@ -187,17 +227,6 @@ private void childClosed(final BaseAllocator childAllocator) { } } - private static String createErrorMsg(final BufferAllocator allocator, final int rounded, final int requested) { - if (rounded != requested) { - return String.format( - "Unable to allocate buffer of size %d (rounded from %d) due to memory limit. Current allocation: %d", - rounded, requested, allocator.getAllocatedMemory()); - } else { - return String.format("Unable to allocate buffer of size %d due to memory limit. Current allocation: %d", - rounded, allocator.getAllocatedMemory()); - } - } - @Override public ArrowBuf buffer(final int initialRequestSize) { assertOpen(); @@ -205,7 +234,7 @@ public ArrowBuf buffer(final int initialRequestSize) { return buffer(initialRequestSize, null); } - private ArrowBuf createEmpty(){ + private ArrowBuf createEmpty() { assertOpen(); return new ArrowBuf(new AtomicInteger(), null, AllocationManager.EMPTY, null, null, 0, 0, true); @@ -221,7 +250,8 @@ public ArrowBuf buffer(final int initialRequestSize, BufferManager manager) { return empty; } - // round to next largest power of two if we're within a chunk since that is how our allocator operates + // round to next largest power of two if we're within a chunk since that is how our allocator + // operates final int actualRequestSize = initialRequestSize < AllocationManager.CHUNK_SIZE ? nextPowerOfTwo(initialRequestSize) : initialRequestSize; @@ -245,10 +275,12 @@ public ArrowBuf buffer(final int initialRequestSize, BufferManager manager) { } /** - * Used by usual allocation as well as for allocating a pre-reserved buffer. Skips the typical accounting associated + * Used by usual allocation as well as for allocating a pre-reserved buffer. Skips the typical + * accounting associated * with creating a new buffer. */ - private ArrowBuf bufferWithoutReservation(final int size, BufferManager bufferManager) throws OutOfMemoryException { + private ArrowBuf bufferWithoutReservation(final int size, BufferManager bufferManager) throws + OutOfMemoryException { assertOpen(); final AllocationManager manager = new AllocationManager(this, size); @@ -274,185 +306,20 @@ public BufferAllocator newChildAllocator( final long maxAllocation) { assertOpen(); - final ChildAllocator childAllocator = new ChildAllocator(this, name, initReservation, maxAllocation); + final ChildAllocator childAllocator = new ChildAllocator(this, name, initReservation, + maxAllocation); if (DEBUG) { synchronized (DEBUG_LOCK) { childAllocators.put(childAllocator, childAllocator); - historicalLog.recordEvent("allocator[%s] created new child allocator[%s]", name, childAllocator.name); + historicalLog.recordEvent("allocator[%s] created new child allocator[%s]", name, + childAllocator.name); } } return childAllocator; } - public class Reservation implements AllocationReservation { - private int nBytes = 0; - private boolean used = false; - private boolean closed = false; - private final HistoricalLog historicalLog; - - public Reservation() { - if (DEBUG) { - historicalLog = new HistoricalLog("Reservation[allocator[%s], %d]", name, System.identityHashCode(this)); - historicalLog.recordEvent("created"); - synchronized (DEBUG_LOCK) { - reservations.put(this, this); - } - } else { - historicalLog = null; - } - } - - @Override - public boolean add(final int nBytes) { - assertOpen(); - - Preconditions.checkArgument(nBytes >= 0, "nBytes(%d) < 0", nBytes); - Preconditions.checkState(!closed, "Attempt to increase reservation after reservation has been closed"); - Preconditions.checkState(!used, "Attempt to increase reservation after reservation has been used"); - - // we round up to next power of two since all reservations are done in powers of two. This may overestimate the - // preallocation since someone may perceive additions to be power of two. If this becomes a problem, we can look - // at - // modifying this behavior so that we maintain what we reserve and what the user asked for and make sure to only - // round to power of two as necessary. - final int nBytesTwo = BaseAllocator.nextPowerOfTwo(nBytes); - if (!reserve(nBytesTwo)) { - return false; - } - - this.nBytes += nBytesTwo; - return true; - } - - @Override - public ArrowBuf allocateBuffer() { - assertOpen(); - - Preconditions.checkState(!closed, "Attempt to allocate after closed"); - Preconditions.checkState(!used, "Attempt to allocate more than once"); - - final ArrowBuf arrowBuf = allocate(nBytes); - used = true; - return arrowBuf; - } - - @Override - public int getSize() { - return nBytes; - } - - @Override - public boolean isUsed() { - return used; - } - - @Override - public boolean isClosed() { - return closed; - } - - @Override - public void close() { - assertOpen(); - - if (closed) { - return; - } - - if (DEBUG) { - if (!isClosed()) { - final Object object; - synchronized (DEBUG_LOCK) { - object = reservations.remove(this); - } - if (object == null) { - final StringBuilder sb = new StringBuilder(); - print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); - logger.debug(sb.toString()); - throw new IllegalStateException( - String.format("Didn't find closing reservation[%d]", System.identityHashCode(this))); - } - - historicalLog.recordEvent("closed"); - } - } - - if (!used) { - releaseReservation(nBytes); - } - - closed = true; - } - - @Override - public boolean reserve(int nBytes) { - assertOpen(); - - final AllocationOutcome outcome = BaseAllocator.this.allocateBytes(nBytes); - - if (DEBUG) { - historicalLog.recordEvent("reserve(%d) => %s", nBytes, Boolean.toString(outcome.isOk())); - } - - return outcome.isOk(); - } - - /** - * Allocate the a buffer of the requested size. - * - *

- * The implementation of the allocator's inner class provides this. - * - * @param nBytes - * the size of the buffer requested - * @return the buffer, or null, if the request cannot be satisfied - */ - private ArrowBuf allocate(int nBytes) { - assertOpen(); - - boolean success = false; - - /* - * The reservation already added the requested bytes to the allocators owned and allocated bytes via reserve(). - * This ensures that they can't go away. But when we ask for the buffer here, that will add to the allocated bytes - * as well, so we need to return the same number back to avoid double-counting them. - */ - try { - final ArrowBuf arrowBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); - - listener.onAllocation(nBytes); - if (DEBUG) { - historicalLog.recordEvent("allocate() => %s", String.format("ArrowBuf[%d]", arrowBuf.getId())); - } - success = true; - return arrowBuf; - } finally { - if (!success) { - releaseBytes(nBytes); - } - } - } - - /** - * Return the reservation back to the allocator without having used it. - * - * @param nBytes - * the size of the reservation - */ - private void releaseReservation(int nBytes) { - assertOpen(); - - releaseBytes(nBytes); - - if (DEBUG) { - historicalLog.recordEvent("releaseReservation(%d)", nBytes); - } - } - - } - @Override public AllocationReservation newReservation() { assertOpen(); @@ -460,7 +327,6 @@ public AllocationReservation newReservation() { return new Reservation(); } - @Override public synchronized void close() { /* @@ -474,7 +340,7 @@ public synchronized void close() { isClosed = true; if (DEBUG) { - synchronized(DEBUG_LOCK) { + synchronized (DEBUG_LOCK) { verifyAllocator(); // are there outstanding child allocators? @@ -488,7 +354,8 @@ public synchronized void close() { } throw new IllegalStateException( - String.format("Allocator[%s] closed with outstanding child allocators.\n%s", name, toString())); + String.format("Allocator[%s] closed with outstanding child allocators.\n%s", name, + toString())); } // are there outstanding buffers? @@ -501,7 +368,8 @@ public synchronized void close() { if (reservations.size() != 0) { throw new IllegalStateException( - String.format("Allocator[%s] closed with outstanding reservations (%d).\n%s", name, reservations.size(), + String.format("Allocator[%s] closed with outstanding reservations (%d).\n%s", name, + reservations.size(), toString())); } @@ -512,7 +380,8 @@ public synchronized void close() { final long allocated = getAllocatedMemory(); if (allocated > 0) { throw new IllegalStateException( - String.format("Memory was leaked by query. Memory leaked: (%d)\n%s", allocated, toString())); + String.format("Memory was leaked by query. Memory leaked: (%d)\n%s", allocated, + toString())); } // we need to release our memory to our parent before we tell it we've closed. @@ -543,7 +412,8 @@ public String toString() { } /** - * Provide a verbose string of the current allocator state. Includes the state of all child allocators, along with + * Provide a verbose string of the current allocator state. Includes the state of all child + * allocators, along with * historical logs of each object and including stacktraces. * * @return A Verbose string of current allocator state. @@ -559,48 +429,32 @@ private void hist(String noteFormat, Object... args) { historicalLog.recordEvent(noteFormat, args); } - /** - * Rounds up the provided value to the nearest power of two. - * - * @param val - * An integer value. - * @return The closest power of two of that value. - */ - static int nextPowerOfTwo(int val) { - int highestBit = Integer.highestOneBit(val); - if (highestBit == val) { - return val; - } else { - return highestBit << 1; - } - } - - /** * Verifies the accounting state of the allocator. Only works for DEBUG. * - * @throws IllegalStateException - * when any problems are found + * @throws IllegalStateException when any problems are found */ void verifyAllocator() { - final IdentityHashMap buffersSeen = new IdentityHashMap<>(); + final IdentityHashMap buffersSeen = new + IdentityHashMap<>(); verifyAllocator(buffersSeen); } /** * Verifies the accounting state of the allocator. Only works for DEBUG. - * *

- * This overload is used for recursive calls, allowing for checking that ArrowBufs are unique across all allocators + *

+ * This overload is used for recursive calls, allowing for checking that ArrowBufs are unique + * across all allocators * that are checked. *

* - * @param buffersSeen - * a map of buffers that have already been seen when walking a tree of allocators - * @throws IllegalStateException - * when any problems are found + * @param buffersSeen a map of buffers that have already been seen when walking a tree of + * allocators + * @throws IllegalStateException when any problems are found */ - private void verifyAllocator(final IdentityHashMap buffersSeen) { + private void verifyAllocator(final IdentityHashMap + buffersSeen) { // The remaining tests can only be performed if we're in debug mode. if (!DEBUG) { return; @@ -618,7 +472,8 @@ private void verifyAllocator(final IdentityHashMap ledgerS } } - - public static StringBuilder indent(StringBuilder sb, int indent) { - final char[] indentation = new char[indent * 2]; - Arrays.fill(indentation, ' '); - sb.append(indentation); - return sb; - } - public static enum Verbosity { BASIC(false, false), // only include basic information LOG(true, false), // include basic @@ -800,7 +651,179 @@ public static enum Verbosity { } } - public static boolean isDebug() { - return DEBUG; + public class Reservation implements AllocationReservation { + + private final HistoricalLog historicalLog; + private int nBytes = 0; + private boolean used = false; + private boolean closed = false; + + public Reservation() { + if (DEBUG) { + historicalLog = new HistoricalLog("Reservation[allocator[%s], %d]", name, System + .identityHashCode(this)); + historicalLog.recordEvent("created"); + synchronized (DEBUG_LOCK) { + reservations.put(this, this); + } + } else { + historicalLog = null; + } + } + + @Override + public boolean add(final int nBytes) { + assertOpen(); + + Preconditions.checkArgument(nBytes >= 0, "nBytes(%d) < 0", nBytes); + Preconditions.checkState(!closed, "Attempt to increase reservation after reservation has " + + "been closed"); + Preconditions.checkState(!used, "Attempt to increase reservation after reservation has been" + + " used"); + + // we round up to next power of two since all reservations are done in powers of two. This + // may overestimate the + // preallocation since someone may perceive additions to be power of two. If this becomes a + // problem, we can look + // at + // modifying this behavior so that we maintain what we reserve and what the user asked for + // and make sure to only + // round to power of two as necessary. + final int nBytesTwo = BaseAllocator.nextPowerOfTwo(nBytes); + if (!reserve(nBytesTwo)) { + return false; + } + + this.nBytes += nBytesTwo; + return true; + } + + @Override + public ArrowBuf allocateBuffer() { + assertOpen(); + + Preconditions.checkState(!closed, "Attempt to allocate after closed"); + Preconditions.checkState(!used, "Attempt to allocate more than once"); + + final ArrowBuf arrowBuf = allocate(nBytes); + used = true; + return arrowBuf; + } + + @Override + public int getSize() { + return nBytes; + } + + @Override + public boolean isUsed() { + return used; + } + + @Override + public boolean isClosed() { + return closed; + } + + @Override + public void close() { + assertOpen(); + + if (closed) { + return; + } + + if (DEBUG) { + if (!isClosed()) { + final Object object; + synchronized (DEBUG_LOCK) { + object = reservations.remove(this); + } + if (object == null) { + final StringBuilder sb = new StringBuilder(); + print(sb, 0, Verbosity.LOG_WITH_STACKTRACE); + logger.debug(sb.toString()); + throw new IllegalStateException( + String.format("Didn't find closing reservation[%d]", System.identityHashCode + (this))); + } + + historicalLog.recordEvent("closed"); + } + } + + if (!used) { + releaseReservation(nBytes); + } + + closed = true; + } + + @Override + public boolean reserve(int nBytes) { + assertOpen(); + + final AllocationOutcome outcome = BaseAllocator.this.allocateBytes(nBytes); + + if (DEBUG) { + historicalLog.recordEvent("reserve(%d) => %s", nBytes, Boolean.toString(outcome.isOk())); + } + + return outcome.isOk(); + } + + /** + * Allocate the a buffer of the requested size. + *

+ *

+ * The implementation of the allocator's inner class provides this. + * + * @param nBytes the size of the buffer requested + * @return the buffer, or null, if the request cannot be satisfied + */ + private ArrowBuf allocate(int nBytes) { + assertOpen(); + + boolean success = false; + + /* + * The reservation already added the requested bytes to the allocators owned and allocated + * bytes via reserve(). + * This ensures that they can't go away. But when we ask for the buffer here, that will add + * to the allocated bytes + * as well, so we need to return the same number back to avoid double-counting them. + */ + try { + final ArrowBuf arrowBuf = BaseAllocator.this.bufferWithoutReservation(nBytes, null); + + listener.onAllocation(nBytes); + if (DEBUG) { + historicalLog.recordEvent("allocate() => %s", String.format("ArrowBuf[%d]", arrowBuf + .getId())); + } + success = true; + return arrowBuf; + } finally { + if (!success) { + releaseBytes(nBytes); + } + } + } + + /** + * Return the reservation back to the allocator without having used it. + * + * @param nBytes the size of the reservation + */ + private void releaseReservation(int nBytes) { + assertOpen(); + + releaseBytes(nBytes); + + if (DEBUG) { + historicalLog.recordEvent("releaseReservation(%d)", nBytes); + } + } + } } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java b/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java index 4e88c734ab4be..b0e9cd8c1a0e9 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BoundsChecking.java @@ -6,21 +6,22 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; public class BoundsChecking { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BoundsChecking.class); public static final boolean BOUNDS_CHECKING_ENABLED; + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(BoundsChecking.class); static { boolean isAssertEnabled = false; diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java index 356a3416cbf85..81ffb1bec780e 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java @@ -6,47 +6,48 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; -import io.netty.buffer.ByteBufAllocator; import io.netty.buffer.ArrowBuf; +import io.netty.buffer.ByteBufAllocator; /** * Wrapper class to deal with byte buffer allocation. Ensures users only use designated methods. */ public interface BufferAllocator extends AutoCloseable { + /** - * Allocate a new or reused buffer of the provided size. Note that the buffer may technically be larger than the - * requested size for rounding purposes. However, the buffer's capacity will be set to the configured size. + * Allocate a new or reused buffer of the provided size. Note that the buffer may technically + * be larger than the + * requested size for rounding purposes. However, the buffer's capacity will be set to the + * configured size. * - * @param size - * The size in bytes. + * @param size The size in bytes. * @return a new ArrowBuf, or null if the request can't be satisfied - * @throws OutOfMemoryException - * if buffer cannot be allocated + * @throws OutOfMemoryException if buffer cannot be allocated */ public ArrowBuf buffer(int size); /** - * Allocate a new or reused buffer of the provided size. Note that the buffer may technically be larger than the - * requested size for rounding purposes. However, the buffer's capacity will be set to the configured size. + * Allocate a new or reused buffer of the provided size. Note that the buffer may technically + * be larger than the + * requested size for rounding purposes. However, the buffer's capacity will be set to the + * configured size. * - * @param size - * The size in bytes. - * @param manager - * A buffer manager to manage reallocation. + * @param size The size in bytes. + * @param manager A buffer manager to manage reallocation. * @return a new ArrowBuf, or null if the request can't be satisfied - * @throws OutOfMemoryException - * if buffer cannot be allocated + * @throws OutOfMemoryException if buffer cannot be allocated */ public ArrowBuf buffer(int size, BufferManager manager); @@ -60,19 +61,16 @@ public interface BufferAllocator extends AutoCloseable { /** * Create a new child allocator. * - * @param name - * the name of the allocator. - * @param initReservation - * the initial space reservation (obtained from this allocator) - * @param maxAllocation - * maximum amount of space the new allocator can allocate + * @param name the name of the allocator. + * @param initReservation the initial space reservation (obtained from this allocator) + * @param maxAllocation maximum amount of space the new allocator can allocate * @return the new allocator, or null if it can't be created */ public BufferAllocator newChildAllocator(String name, long initReservation, long maxAllocation); /** * Close and release all buffers generated from this buffer pool. - * + *

*

When assertions are on, complains if there are any outstanding buffers; to avoid * that, release all buffers before the allocator is closed. */ @@ -87,19 +85,18 @@ public interface BufferAllocator extends AutoCloseable { public long getAllocatedMemory(); /** - * Set the maximum amount of memory this allocator is allowed to allocate. + * Return the current maximum limit this allocator imposes. * - * @param newLimit - * The new Limit to apply to allocations + * @return Limit in number of bytes. */ - public void setLimit(long newLimit); + public long getLimit(); /** - * Return the current maximum limit this allocator imposes. + * Set the maximum amount of memory this allocator is allowed to allocate. * - * @return Limit in number of bytes. + * @param newLimit The new Limit to apply to allocations */ - public long getLimit(); + public void setLimit(long newLimit); /** * Returns the peak amount of memory allocated from this allocator. @@ -118,25 +115,31 @@ public interface BufferAllocator extends AutoCloseable { public AllocationReservation newReservation(); /** - * Get a reference to the empty buffer associated with this allocator. Empty buffers are special because we don't - * worry about them leaking or managing reference counts on them since they don't actually point to any memory. + * Get a reference to the empty buffer associated with this allocator. Empty buffers are + * special because we don't + * worry about them leaking or managing reference counts on them since they don't actually + * point to any memory. */ public ArrowBuf getEmpty(); /** - * Return the name of this allocator. This is a human readable name that can help debugging. Typically provides + * Return the name of this allocator. This is a human readable name that can help debugging. + * Typically provides * coordinates about where this allocator was created */ public String getName(); /** - * Return whether or not this allocator (or one if its parents) is over its limits. In the case that an allocator is - * over its limit, all consumers of that allocator should aggressively try to addrss the overlimit situation. + * Return whether or not this allocator (or one if its parents) is over its limits. In the case + * that an allocator is + * over its limit, all consumers of that allocator should aggressively try to addrss the + * overlimit situation. */ public boolean isOverLimit(); /** - * Return a verbose string describing this allocator. If in DEBUG mode, this will also include relevant stacktraces + * Return a verbose string describing this allocator. If in DEBUG mode, this will also include + * relevant stacktraces * and historical logs for underlying objects * * @return A very verbose description of the allocator hierarchy. @@ -144,7 +147,8 @@ public interface BufferAllocator extends AutoCloseable { public String toVerboseString(); /** - * Asserts (using java assertions) that the provided allocator is currently open. If assertions are disabled, this is + * Asserts (using java assertions) that the provided allocator is currently open. If assertions + * are disabled, this is * a no-op. */ public void assertOpen(); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java index 8969434791012..2fe763e10aff9 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java @@ -15,6 +15,7 @@ * See the License for the specific language governing permissions and * limitations under the License. ******************************************************************************/ + package org.apache.arrow.memory; import io.netty.buffer.ArrowBuf; @@ -24,7 +25,7 @@ * re-allocation the old buffer will be freed. Managing a list of these buffers * prevents some parts of the system from needing to define a correct location * to place the final call to free them. - * + *

* The current uses of these types of buffers are within the pluggable components of Drill. * In UDFs, memory management should not be a concern. We provide access to re-allocatable * ArrowBufs to give UDF writers general purpose buffers we can account for. To prevent the need @@ -38,12 +39,9 @@ public interface BufferManager extends AutoCloseable { /** * Replace an old buffer with a new version at least of the provided size. Does not copy data. * - * @param old - * Old Buffer that the user is no longer going to use. - * @param newSize - * Size of new replacement buffer. - * @return - * A new version of the buffer. + * @param old Old Buffer that the user is no longer going to use. + * @param newSize Size of new replacement buffer. + * @return A new version of the buffer. */ public ArrowBuf replace(ArrowBuf old, int newSize); @@ -57,8 +55,7 @@ public interface BufferManager extends AutoCloseable { /** * Get a managed buffer of at least a certain size. * - * @param size - * The desired size + * @param size The desired size * @return A buffer */ public ArrowBuf getManagedBuffer(int size); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java index 11c9063fc9c69..f9a6dc72ece8c 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/ChildAllocator.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; @@ -22,21 +23,22 @@ * Child allocator class. Only slightly different from the {@see RootAllocator}, * in that these can't be created directly, but must be obtained from * {@see BufferAllocator#newChildAllocator(AllocatorOwner, long, long, int)}. - + *

*

Child allocators can only be created by the root, or other children, so * this class is package private.

*/ class ChildAllocator extends BaseAllocator { + /** * Constructor. * * @param parentAllocator parent allocator -- the one creating this child - * @param name the name of this child allocator + * @param name the name of this child allocator * @param initReservation initial amount of space to reserve (obtained from the parent) - * @param maxAllocation maximum amount of space that can be obtained from this allocator; - * note this includes direct allocations (via {@see BufferAllocator#buffer(int, int)} - * et al) and requests from descendant allocators. Depending on the allocation policy in - * force, even less memory may be available + * @param maxAllocation maximum amount of space that can be obtained from this allocator; note + * this includes direct allocations (via {@see BufferAllocator#buffer(int, + *int)} et al) and requests from descendant allocators. Depending on the + * allocation policy in force, even less memory may be available */ ChildAllocator( BaseAllocator parentAllocator, diff --git a/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java b/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java index 6ba0284d8d449..c36584c9538b0 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/OutOfMemoryException.java @@ -6,28 +6,31 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; public class OutOfMemoryException extends RuntimeException { - private static final long serialVersionUID = -6858052345185793382L; - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(OutOfMemoryException.class); + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(OutOfMemoryException + .class); + private static final long serialVersionUID = -6858052345185793382L; public OutOfMemoryException() { super(); } - public OutOfMemoryException(String message, Throwable cause, boolean enableSuppression, boolean writableStackTrace) { + public OutOfMemoryException(String message, Throwable cause, boolean enableSuppression, boolean + writableStackTrace) { super(message, cause, enableSuppression, writableStackTrace); } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java index 57a2c0cdae8d8..1dc6bf0c92fa0 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/RootAllocator.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory; import com.google.common.annotations.VisibleForTesting; @@ -24,6 +25,7 @@ * tree of descendant child allocators. */ public class RootAllocator extends BaseAllocator { + public RootAllocator(final long limit) { this(AllocationListener.NOOP, limit); } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/package-info.java b/java/memory/src/main/java/org/apache/arrow/memory/package-info.java index 40d25cada4519..cef382d1e044e 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/package-info.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/package-info.java @@ -1,24 +1,43 @@ /** - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * + * Licensed to the Apache Software Foundation (ASF) under one or more contributor license + * agreements. See the NOTICE file distributed with this work for additional information regarding + * copyright ownership. The ASF licenses this file to you under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with the License. You may obtain + * a copy of the License at + *

* http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. + *

+ * Unless required by applicable law or agreed to in writing, software distributed under the License + * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express + * or implied. See the License for the specific language governing permissions and limitations under + * the License. + *

+ * Memory Allocation, Account and Management + *

+ * See the README.md file in this directory for detailed information about Arrow's memory allocation + * subsystem. + *

+ * Memory Allocation, Account and Management + *

+ * See the README.md file in this directory for detailed information about Arrow's memory + * allocation subsystem. + *

+ * Memory Allocation, Account and Management + *

+ * See the README.md file in this directory for detailed information about Arrow's memory + * allocation subsystem. + *

+ * Memory Allocation, Account and Management + *

+ * See the README.md file in this directory for detailed information about Arrow's memory + * allocation subsystem. */ /** * Memory Allocation, Account and Management * - * See the README.md file in this directory for detailed information about Arrow's memory allocation subsystem. + * See the README.md file in this directory for detailed information about Arrow's memory + * allocation subsystem. * */ + package org.apache.arrow.memory; diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java b/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java index 28d078528974e..710f572e06027 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/AssertionUtil.java @@ -6,32 +6,33 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory.util; public class AssertionUtil { - static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AssertionUtil.class); public static final boolean ASSERT_ENABLED; + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(AssertionUtil.class); - static{ + static { boolean isAssertEnabled = false; assert isAssertEnabled = true; ASSERT_ENABLED = isAssertEnabled; } - public static boolean isAssertionsEnabled(){ - return ASSERT_ENABLED; + private AssertionUtil() { } - private AssertionUtil() { + public static boolean isAssertionsEnabled() { + return ASSERT_ENABLED; } } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java b/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java index 94e5cc5fded4f..8d9008c894ac8 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/AutoCloseableLock.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory.util; import java.util.concurrent.locks.Lock; diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java index c9b5c5385c596..c464598bfb856 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java @@ -6,53 +6,43 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory.util; +import org.slf4j.Logger; + import java.util.Arrays; import java.util.LinkedList; -import org.slf4j.Logger; - /** * Utility class that can be used to log activity within a class * for later logging and debugging. Supports recording events and * recording the stack at the time they occur. */ public class HistoricalLog { - private static class Event { - private final String note; // the event text - private final StackTrace stackTrace; // where the event occurred - private final long time; - - public Event(final String note) { - this.note = note; - this.time = System.nanoTime(); - stackTrace = new StackTrace(); - } - } private final LinkedList history = new LinkedList<>(); private final String idString; // the formatted id string - private Event firstEvent; // the first stack trace recorded private final int limit; // the limit on the number of events kept + private Event firstEvent; // the first stack trace recorded /** * Constructor. The format string will be formatted and have its arguments * substituted at the time this is called. * - * @param idStringFormat {@link String#format} format string that can be used - * to identify this object in a log. Including some kind of unique identifier - * that can be associated with the object instance is best. - * @param args for the format string, or nothing if none are required + * @param idStringFormat {@link String#format} format string that can be used to identify this + * object in a log. Including some kind of unique identifier that can be + * associated with the object instance is best. + * @param args for the format string, or nothing if none are required */ public HistoricalLog(final String idStringFormat, Object... args) { this(Integer.MAX_VALUE, idStringFormat, args); @@ -61,7 +51,7 @@ public HistoricalLog(final String idStringFormat, Object... args) { /** * Constructor. The format string will be formatted and have its arguments * substituted at the time this is called. - * + *

*

This form supports the specification of a limit that will limit the * number of historical entries kept (which keeps down the amount of memory * used). With the limit, the first entry made is always kept (under the @@ -70,12 +60,12 @@ public HistoricalLog(final String idStringFormat, Object... args) { * Each time a new entry is made, the oldest that is not the first is dropped. *

* - * @param limit the maximum number of historical entries that will be kept, - * not including the first entry made - * @param idStringFormat {@link String#format} format string that can be used - * to identify this object in a log. Including some kind of unique identifier - * that can be associated with the object instance is best. - * @param args for the format string, or nothing if none are required + * @param limit the maximum number of historical entries that will be kept, not including + * the first entry made + * @param idStringFormat {@link String#format} format string that can be used to identify this + * object in a log. Including some kind of unique identifier that can be + * associated with the object instance is best. + * @param args for the format string, or nothing if none are required */ public HistoricalLog(final int limit, final String idStringFormat, Object... args) { this.limit = limit; @@ -88,7 +78,7 @@ public HistoricalLog(final int limit, final String idStringFormat, Object... arg * at the time this is called. * * @param noteFormat {@link String#format} format string that describes the event - * @param args for the format string, or nothing if none are required + * @param args for the format string, or nothing if none are required */ public synchronized void recordEvent(final String noteFormat, Object... args) { final String note = String.format(noteFormat, args); @@ -113,23 +103,14 @@ public void buildHistory(final StringBuilder sb, boolean includeStackTrace) { buildHistory(sb, 0, includeStackTrace); } - /** - * Write the history of this object to the given {@link StringBuilder}. The history - * includes the identifying string provided at construction time, and all the recorded - * events with their stack traces. - * - * @param sb {@link StringBuilder} to write to - * @param additional an extra string that will be written between the identifying - * information and the history; often used for a current piece of state - */ - /** * * @param sb * @param indent * @param includeStackTrace */ - public synchronized void buildHistory(final StringBuilder sb, int indent, boolean includeStackTrace) { + public synchronized void buildHistory(final StringBuilder sb, int indent, boolean + includeStackTrace) { final char[] indentation = new char[indent]; final char[] innerIndentation = new char[indent + 2]; Arrays.fill(indentation, ' '); @@ -140,7 +121,6 @@ public synchronized void buildHistory(final StringBuilder sb, int indent, boolea .append(idString) .append('\n'); - if (firstEvent != null) { sb.append(innerIndentation) .append(firstEvent.time) @@ -151,7 +131,7 @@ public synchronized void buildHistory(final StringBuilder sb, int indent, boolea firstEvent.stackTrace.writeToBuilder(sb, indent + 2); } - for(final Event event : history) { + for (final Event event : history) { if (event == firstEvent) { continue; } @@ -170,6 +150,16 @@ public synchronized void buildHistory(final StringBuilder sb, int indent, boolea } } + /** + * Write the history of this object to the given {@link StringBuilder}. The history + * includes the identifying string provided at construction time, and all the recorded + * events with their stack traces. + * + * @param sb {@link StringBuilder} to write to + * @param additional an extra string that will be written between the identifying + * information and the history; often used for a current piece of state + */ + /** * Write the history of this object to the given {@link Logger}. The history * includes the identifying string provided at construction time, and all the recorded @@ -182,4 +172,17 @@ public void logHistory(final Logger logger) { buildHistory(sb, 0, true); logger.debug(sb.toString()); } + + private static class Event { + + private final String note; // the event text + private final StackTrace stackTrace; // where the event occurred + private final long time; + + public Event(final String note) { + this.note = note; + this.time = System.nanoTime(); + stackTrace = new StackTrace(); + } + } } diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java b/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java index 638c2fb9a959e..bb4ea6c46179e 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/StackTrace.java @@ -6,15 +6,16 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.memory.util; import java.util.Arrays; @@ -23,6 +24,7 @@ * Convenient way of obtaining and manipulating stack traces for debugging. */ public class StackTrace { + private final StackTraceElement[] stackTraceElements; /** @@ -36,10 +38,9 @@ public StackTrace() { /** * Write the stack trace to a StringBuilder. - * @param sb - * where to write it - * @param indent - * how many double spaces to indent each line + * + * @param sb where to write it + * @param indent how many double spaces to indent each line */ public void writeToBuilder(final StringBuilder sb, final int indent) { // create the indentation string @@ -47,7 +48,7 @@ public void writeToBuilder(final StringBuilder sb, final int indent) { Arrays.fill(indentation, ' '); // write the stack trace in standard Java format - for(StackTraceElement ste : stackTraceElements) { + for (StackTraceElement ste : stackTraceElements) { sb.append(indentation) .append("at ") .append(ste.getClassName()) diff --git a/java/pom.xml b/java/pom.xml index fa03783396ffb..774761f0c1e66 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -35,6 +35,7 @@ 2 2.7.1 2.7.1 + false @@ -269,6 +270,47 @@ + + + org.apache.maven.plugins + maven-checkstyle-plugin + 2.17 + + + com.puppycrawl.tools + checkstyle + 6.15 + + + com.google.guava + guava + ${dep.guava.version} + + + + + validate + validate + + check + + + + + google_checks.xml + UTF-8 + true + ${checkstyle.failOnViolation} + ${checkstyle.failOnViolation} + warning + xml + html + ${project.build.directory}/test/checkstyle-errors.xml + false + + + + @@ -382,6 +424,19 @@ + + + org.apache.maven.plugins + maven-checkstyle-plugin + [0,) + + check + + + + + + diff --git a/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java index 7c0cadd9d77dd..24079b62da919 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/EchoServer.java @@ -6,20 +6,17 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.arrow.tools; -import java.io.IOException; -import java.net.ServerSocket; -import java.net.Socket; +package org.apache.arrow.tools; import com.google.common.base.Preconditions; @@ -31,11 +28,14 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.IOException; +import java.net.ServerSocket; +import java.net.Socket; + public class EchoServer { private static final Logger LOGGER = LoggerFactory.getLogger(EchoServer.class); - - private boolean closed = false; private final ServerSocket serverSocket; + private boolean closed = false; public EchoServer(int port) throws IOException { LOGGER.info("Starting echo server."); @@ -43,22 +43,64 @@ public EchoServer(int port) throws IOException { LOGGER.info("Running echo server on port: " + port()); } - public int port() { return serverSocket.getLocalPort(); } + public static void main(String[] args) throws Exception { + int port; + if (args.length > 0) { + port = Integer.parseInt(args[0]); + } else { + port = 8080; + } + new EchoServer(port).run(); + } + + public int port() { + return serverSocket.getLocalPort(); + } + + public void run() throws IOException { + try { + while (!closed) { + LOGGER.info("Waiting to accept new client connection."); + Socket clientSocket = serverSocket.accept(); + LOGGER.info("Accepted new client connection."); + try (ClientConnection client = new ClientConnection(clientSocket)) { + try { + client.run(); + } catch (IOException e) { + LOGGER.warn("Error handling client connection.", e); + } + } + LOGGER.info("Closed connection with client"); + } + } catch (java.net.SocketException ex) { + if (!closed) throw ex; + } finally { + serverSocket.close(); + LOGGER.info("Server closed."); + } + } + + public void close() throws IOException { + closed = true; + serverSocket.close(); + } public static class ClientConnection implements AutoCloseable { public final Socket socket; + public ClientConnection(Socket socket) { this.socket = socket; } public void run() throws IOException { - BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); + BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); // Read the entire input stream and write it back try (ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { VectorSchemaRoot root = reader.getVectorSchemaRoot(); // load the first batch before instantiating the writer so that we have any dictionaries reader.loadNextBatch(); - try (ArrowStreamWriter writer = new ArrowStreamWriter(root, reader, socket.getOutputStream())) { + try (ArrowStreamWriter writer = new ArrowStreamWriter(root, reader, socket + .getOutputStream())) { writer.start(); int echoed = 0; while (true) { @@ -83,42 +125,4 @@ public void close() throws IOException { socket.close(); } } - - public void run() throws IOException { - try { - while (!closed) { - LOGGER.info("Waiting to accept new client connection."); - Socket clientSocket = serverSocket.accept(); - LOGGER.info("Accepted new client connection."); - try (ClientConnection client = new ClientConnection(clientSocket)) { - try { - client.run(); - } catch (IOException e) { - LOGGER.warn("Error handling client connection.", e); - } - } - LOGGER.info("Closed connection with client"); - } - } catch (java.net.SocketException ex) { - if (!closed) throw ex; - } finally { - serverSocket.close(); - LOGGER.info("Server closed."); - } - } - - public void close() throws IOException { - closed = true; - serverSocket.close(); - } - - public static void main(String[] args) throws Exception { - int port; - if (args.length > 0) { - port = Integer.parseInt(args[0]); - } else { - port = 8080; - } - new EchoServer(port).run(); - } } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java index 9fa7b761a5772..b8621920d3348 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java @@ -16,13 +16,8 @@ * specific language governing permissions and limitations * under the License. */ -package org.apache.arrow.tools; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.PrintStream; +package org.apache.arrow.tools; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -38,17 +33,17 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.PrintStream; + public class FileRoundtrip { private static final Logger LOGGER = LoggerFactory.getLogger(FileRoundtrip.class); - - public static void main(String[] args) { - System.exit(new FileRoundtrip(System.out, System.err).run(args)); - } - private final Options options; private final PrintStream out; private final PrintStream err; - FileRoundtrip(PrintStream out, PrintStream err) { this.out = out; this.err = err; @@ -58,6 +53,10 @@ public static void main(String[] args) { } + public static void main(String[] args) { + System.exit(new FileRoundtrip(System.out, System.err).run(args)); + } + private File validateFile(String type, String fileName) { if (fileName == null) { throw new IllegalArgumentException("missing " + type + " file parameter"); @@ -81,7 +80,8 @@ int run(String[] args) { File outFile = validateFile("output", outFileName); BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); // TODO: close try (FileInputStream fileInputStream = new FileInputStream(inFile); - ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), + allocator)) { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); @@ -89,7 +89,8 @@ int run(String[] args) { LOGGER.debug("Found schema: " + schema); try (FileOutputStream fileOutputStream = new FileOutputStream(outFile); - ArrowFileWriter arrowWriter = new ArrowFileWriter(root, arrowReader, fileOutputStream.getChannel())) { + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, arrowReader, + fileOutputStream.getChannel())) { arrowWriter.start(); while (true) { arrowReader.loadNextBatch(); diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java index d5345535d19dc..be404fd4c5950 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java @@ -6,22 +6,17 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.arrow.tools; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileOutputStream; -import java.io.IOException; -import java.io.OutputStream; +package org.apache.arrow.tools; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -29,6 +24,12 @@ import org.apache.arrow.vector.file.ArrowFileReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.io.OutputStream; + /** * Converts an Arrow file to an Arrow stream. The file should be specified as the * first argument and the output is written to standard out. diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index 5d4849c234383..453693d7fa489 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -16,15 +16,8 @@ * specific language governing permissions and limitations * under the License. */ -package org.apache.arrow.tools; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileOutputStream; -import java.io.IOException; -import java.util.Arrays; -import java.util.Iterator; -import java.util.List; +package org.apache.arrow.tools; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -44,8 +37,25 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileOutputStream; +import java.io.IOException; +import java.util.Arrays; +import java.util.Iterator; +import java.util.List; + public class Integration { private static final Logger LOGGER = LoggerFactory.getLogger(Integration.class); + private final Options options; + + Integration() { + this.options = new Options(); + this.options.addOption("a", "arrow", true, "arrow file"); + this.options.addOption("j", "json", true, "json file"); + this.options.addOption("c", "command", true, "command to execute: " + Arrays.toString(Command + .values())); + } public static void main(String[] args) { try { @@ -59,20 +69,61 @@ public static void main(String[] args) { } } - private final Options options; + private static void fatalError(String message, Throwable e) { + System.err.println(message); + System.err.println(e.getMessage()); + LOGGER.error(message, e); + System.exit(1); + } + + private File validateFile(String type, String fileName, boolean shouldExist) { + if (fileName == null) { + throw new IllegalArgumentException("missing " + type + " file parameter"); + } + File f = new File(fileName); + if (shouldExist && (!f.exists() || f.isDirectory())) { + throw new IllegalArgumentException(type + " file not found: " + f.getAbsolutePath()); + } + if (!shouldExist && f.exists()) { + throw new IllegalArgumentException(type + " file already exists: " + f.getAbsolutePath()); + } + return f; + } + + void run(String[] args) throws ParseException, IOException { + CommandLineParser parser = new PosixParser(); + CommandLine cmd = parser.parse(options, args, false); + + + Command command = toCommand(cmd.getOptionValue("command")); + File arrowFile = validateFile("arrow", cmd.getOptionValue("arrow"), command.arrowExists); + File jsonFile = validateFile("json", cmd.getOptionValue("json"), command.jsonExists); + command.execute(arrowFile, jsonFile); + } + + private Command toCommand(String commandName) { + try { + return Command.valueOf(commandName); + } catch (IllegalArgumentException e) { + throw new IllegalArgumentException("Unknown command: " + commandName + " expected one of " + + Arrays.toString(Command.values())); + } + } enum Command { ARROW_TO_JSON(true, false) { @Override public void execute(File arrowFile, File jsonFile) throws IOException { - try(BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - FileInputStream fileInputStream = new FileInputStream(arrowFile); - ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { + try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(arrowFile); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), + allocator)) { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("Input file size: " + arrowFile.length()); LOGGER.debug("Found schema: " + schema); - try (JsonFileWriter writer = new JsonFileWriter(jsonFile, JsonFileWriter.config().pretty(true))) { + try (JsonFileWriter writer = new JsonFileWriter(jsonFile, JsonFileWriter.config() + .pretty(true))) { writer.start(schema); for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { arrowReader.loadRecordBatch(rbBlock); @@ -94,7 +145,8 @@ public void execute(File arrowFile, File jsonFile) throws IOException { try (FileOutputStream fileOutputStream = new FileOutputStream(arrowFile); VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator); // TODO json dictionaries - ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel())) { + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream + .getChannel())) { arrowWriter.start(); reader.read(root); while (root.getRowCount() != 0) { @@ -113,7 +165,8 @@ public void execute(File arrowFile, File jsonFile) throws IOException { try (BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); JsonFileReader jsonReader = new JsonFileReader(jsonFile, allocator); FileInputStream fileInputStream = new FileInputStream(arrowFile); - ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), allocator)) { + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), + allocator)) { Schema jsonSchema = jsonReader.start(); VectorSchemaRoot arrowRoot = arrowReader.getVectorSchemaRoot(); Schema arrowSchema = arrowRoot.getSchema(); @@ -135,7 +188,8 @@ public void execute(File arrowFile, File jsonFile) throws IOException { boolean hasMoreJSON = jsonRoot != null; boolean hasMoreArrow = iterator.hasNext(); if (hasMoreJSON || hasMoreArrow) { - throw new IllegalArgumentException("Unexpected RecordBatches. J:" + hasMoreJSON + " A:" + hasMoreArrow); + throw new IllegalArgumentException("Unexpected RecordBatches. J:" + hasMoreJSON + " " + + "A:" + hasMoreArrow); } } } @@ -153,51 +207,4 @@ public void execute(File arrowFile, File jsonFile) throws IOException { } - Integration() { - this.options = new Options(); - this.options.addOption("a", "arrow", true, "arrow file"); - this.options.addOption("j", "json", true, "json file"); - this.options.addOption("c", "command", true, "command to execute: " + Arrays.toString(Command.values())); - } - - private File validateFile(String type, String fileName, boolean shouldExist) { - if (fileName == null) { - throw new IllegalArgumentException("missing " + type + " file parameter"); - } - File f = new File(fileName); - if (shouldExist && (!f.exists() || f.isDirectory())) { - throw new IllegalArgumentException(type + " file not found: " + f.getAbsolutePath()); - } - if (!shouldExist && f.exists()) { - throw new IllegalArgumentException(type + " file already exists: " + f.getAbsolutePath()); - } - return f; - } - - void run(String[] args) throws ParseException, IOException { - CommandLineParser parser = new PosixParser(); - CommandLine cmd = parser.parse(options, args, false); - - - Command command = toCommand(cmd.getOptionValue("command")); - File arrowFile = validateFile("arrow", cmd.getOptionValue("arrow"), command.arrowExists); - File jsonFile = validateFile("json", cmd.getOptionValue("json"), command.jsonExists); - command.execute(arrowFile, jsonFile); - } - - private Command toCommand(String commandName) { - try { - return Command.valueOf(commandName); - } catch (IllegalArgumentException e) { - throw new IllegalArgumentException("Unknown command: " + commandName + " expected one of " + Arrays.toString(Command.values())); - } - } - - private static void fatalError(String message, Throwable e) { - System.err.println(message); - System.err.println(e.getMessage()); - LOGGER.error(message, e); - System.exit(1); - } - } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java index 3b79d5b05e116..41dfd347be579 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java @@ -6,17 +6,24 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.arrow.tools; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.file.ArrowFileWriter; +import org.apache.arrow.vector.stream.ArrowStreamReader; + import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; @@ -25,12 +32,6 @@ import java.io.OutputStream; import java.nio.channels.Channels; -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.VectorSchemaRoot; -import org.apache.arrow.vector.file.ArrowFileWriter; -import org.apache.arrow.vector.stream.ArrowStreamReader; - /** * Converts an Arrow stream to an Arrow file. */ diff --git a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java index f752f7eaa74b9..1a389098b4f47 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java @@ -16,13 +16,8 @@ * specific language governing permissions and limitations * under the License. */ -package org.apache.arrow.tools; -import java.io.File; -import java.io.FileInputStream; -import java.io.FileNotFoundException; -import java.io.FileOutputStream; -import java.io.IOException; +package org.apache.arrow.tools; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; @@ -39,6 +34,12 @@ import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Assert; +import java.io.File; +import java.io.FileInputStream; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; +import java.io.IOException; + public class ArrowFileTestFixtures { static final int COUNT = 10; @@ -58,9 +59,11 @@ static void writeData(int count, MapVector parent) { static void validateOutput(File testOutFile, BufferAllocator allocator) throws Exception { // read - try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer + .MAX_VALUE); FileInputStream fileInputStream = new FileInputStream(testOutFile); - ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), + readerAllocator)) { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { @@ -81,16 +84,19 @@ static void validateContent(int count, VectorSchemaRoot root) { static void write(FieldVector parent, File file) throws FileNotFoundException, IOException { VectorSchemaRoot root = new VectorSchemaRoot(parent); try (FileOutputStream fileOutputStream = new FileOutputStream(file); - ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream.getChannel())) { + ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream + .getChannel())) { arrowWriter.writeBatch(); } } - static void writeInput(File testInFile, BufferAllocator allocator) throws FileNotFoundException, IOException { + static void writeInput(File testInFile, BufferAllocator allocator) throws + FileNotFoundException, IOException { int count = ArrowFileTestFixtures.COUNT; try ( - BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, + Integer.MAX_VALUE); MapVector parent = new MapVector("parent", vectorAllocator, null)) { writeData(count, parent); write(parent.getChild("root"), testInFile); diff --git a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java index 706f8e2ca4d36..5970c57f46583 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java @@ -6,28 +6,17 @@ * to you under the Apache License, Version 2.0 (the * "License"); you may not use this file except in compliance * with the License. You may obtain a copy of the License at - * + *

* http://www.apache.org/licenses/LICENSE-2.0 - * + *

* Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.arrow.tools; -import static java.util.Arrays.asList; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; - -import java.io.IOException; -import java.net.Socket; -import java.net.UnknownHostException; -import java.nio.charset.StandardCharsets; -import java.util.Arrays; -import java.util.Collections; -import java.util.List; +package org.apache.arrow.tools; import com.google.common.collect.ImmutableList; @@ -57,6 +46,18 @@ import org.junit.BeforeClass; import org.junit.Test; +import java.io.IOException; +import java.net.Socket; +import java.net.UnknownHostException; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static java.util.Arrays.asList; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + public class EchoServerTest { private static EchoServer server; @@ -94,8 +95,8 @@ private void testEchoServer(int serverPort, BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); VectorSchemaRoot root = new VectorSchemaRoot(asList(field), asList((FieldVector) vector), 0); try (Socket socket = new Socket("localhost", serverPort); - ArrowStreamWriter writer = new ArrowStreamWriter(root, null, socket.getOutputStream()); - ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), alloc)) { + ArrowStreamWriter writer = new ArrowStreamWriter(root, null, socket.getOutputStream()); + ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), alloc)) { writer.start(); for (int i = 0; i < batches; i++) { vector.allocateNew(16); @@ -111,7 +112,8 @@ private void testEchoServer(int serverPort, assertEquals(new Schema(asList(field)), reader.getVectorSchemaRoot().getSchema()); - NullableTinyIntVector readVector = (NullableTinyIntVector) reader.getVectorSchemaRoot().getFieldVectors().get(0); + NullableTinyIntVector readVector = (NullableTinyIntVector) reader.getVectorSchemaRoot() + .getFieldVectors().get(0); for (int i = 0; i < batches; i++) { reader.loadNextBatch(); assertEquals(16, reader.getVectorSchemaRoot().getRowCount()); @@ -131,7 +133,8 @@ private void testEchoServer(int serverPort, public void basicTest() throws InterruptedException, IOException { BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - Field field = new Field("testField", true, new ArrowType.Int(8, true), Collections.emptyList()); + Field field = new Field("testField", true, new ArrowType.Int(8, true), Collections + .emptyList()); NullableTinyIntVector vector = new NullableTinyIntVector("testField", alloc, null); Schema schema = new Schema(asList(field)); @@ -150,7 +153,8 @@ public void testFlatDictionary() throws IOException { DictionaryEncoding writeEncoding = new DictionaryEncoding(1L, false, null); try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); NullableIntVector writeVector = new NullableIntVector("varchar", allocator, writeEncoding); - NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dict", allocator, null)) { + NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dict", + allocator, null)) { writeVector.allocateNewSafe(); NullableIntVector.Mutator mutator = writeVector.getMutator(); mutator.set(0, 0); @@ -171,10 +175,12 @@ public void testFlatDictionary() throws IOException { List vectors = ImmutableList.of((FieldVector) writeVector); VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 6); - DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary(writeDictionaryVector, writeEncoding)); + DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary + (writeDictionaryVector, writeEncoding)); try (Socket socket = new Socket("localhost", serverPort); - ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket.getOutputStream()); + ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket + .getOutputStream()); ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { writer.start(); writer.writeBatch(); @@ -202,7 +208,8 @@ public void testFlatDictionary() throws IOException { Dictionary dictionary = reader.lookup(1L); Assert.assertNotNull(dictionary); - NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary.getVector()).getAccessor(); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) dictionary + .getVector()).getAccessor(); Assert.assertEquals(3, dictionaryAccessor.getValueCount()); Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); @@ -215,7 +222,8 @@ public void testFlatDictionary() throws IOException { public void testNestedDictionary() throws IOException { DictionaryEncoding writeEncoding = new DictionaryEncoding(2L, false, null); try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); - NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dictionary", allocator, null); + NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dictionary", + allocator, null); ListVector writeVector = new ListVector("list", allocator, null, null)) { // data being written: @@ -245,10 +253,12 @@ public void testNestedDictionary() throws IOException { List vectors = ImmutableList.of((FieldVector) writeVector); VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 3); - DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary(writeDictionaryVector, writeEncoding)); + DictionaryProvider writeProvider = new MapDictionaryProvider(new Dictionary + (writeDictionaryVector, writeEncoding)); try (Socket socket = new Socket("localhost", serverPort); - ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket.getOutputStream()); + ArrowStreamWriter writer = new ArrowStreamWriter(root, writeProvider, socket + .getOutputStream()); ArrowStreamReader reader = new ArrowStreamReader(socket.getInputStream(), allocator)) { writer.start(); writer.writeBatch(); @@ -262,7 +272,8 @@ public void testNestedDictionary() throws IOException { Assert.assertNotNull(readVector); Assert.assertNull(readVector.getField().getDictionary()); - DictionaryEncoding readEncoding = readVector.getField().getChildren().get(0).getDictionary(); + DictionaryEncoding readEncoding = readVector.getField().getChildren().get(0) + .getDictionary(); Assert.assertNotNull(readEncoding); Assert.assertEquals(2L, readEncoding.getId()); @@ -281,7 +292,8 @@ public void testNestedDictionary() throws IOException { Dictionary readDictionary = reader.lookup(2L); Assert.assertNotNull(readDictionary); - NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) readDictionary.getVector()).getAccessor(); + NullableVarCharVector.Accessor dictionaryAccessor = ((NullableVarCharVector) + readDictionary.getVector()).getAccessor(); Assert.assertEquals(2, dictionaryAccessor.getValueCount()); Assert.assertEquals(new Text("foo"), dictionaryAccessor.getObject(0)); Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java index ee39f5e92c7b0..78021f8ad076c 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestFileRoundtrip.java @@ -16,13 +16,8 @@ * specific language governing permissions and limitations * under the License. */ -package org.apache.arrow.tools; -import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; -import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; -import static org.junit.Assert.assertEquals; - -import java.io.File; +package org.apache.arrow.tools; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -32,6 +27,12 @@ import org.junit.Test; import org.junit.rules.TemporaryFolder; +import java.io.File; + +import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; +import static org.junit.Assert.assertEquals; + public class TestFileRoundtrip { @Rule @@ -56,7 +57,7 @@ public void test() throws Exception { writeInput(testInFile, allocator); - String[] args = { "-i", testInFile.getAbsolutePath(), "-o", testOutFile.getAbsolutePath()}; + String[] args = {"-i", testInFile.getAbsolutePath(), "-o", testOutFile.getAbsolutePath()}; int result = new FileRoundtrip(System.out, System.err).run(args); assertEquals(0, result); diff --git a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java index 9d4ef5c26505b..7d9a41985bbe3 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/TestIntegration.java @@ -16,22 +16,8 @@ * specific language governing permissions and limitations * under the License. */ -package org.apache.arrow.tools; - -import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; -import static org.apache.arrow.tools.ArrowFileTestFixtures.write; -import static org.apache.arrow.tools.ArrowFileTestFixtures.writeData; -import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; -import java.io.BufferedReader; -import java.io.File; -import java.io.FileNotFoundException; -import java.io.IOException; -import java.io.StringReader; -import java.util.Map; +package org.apache.arrow.tools; import com.fasterxml.jackson.core.util.DefaultPrettyPrinter; import com.fasterxml.jackson.core.util.DefaultPrettyPrinter.NopIndenter; @@ -54,12 +40,75 @@ import org.junit.Test; import org.junit.rules.TemporaryFolder; +import java.io.BufferedReader; +import java.io.File; +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.StringReader; +import java.util.Map; + +import static org.apache.arrow.tools.ArrowFileTestFixtures.validateOutput; +import static org.apache.arrow.tools.ArrowFileTestFixtures.write; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeData; +import static org.apache.arrow.tools.ArrowFileTestFixtures.writeInput; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + public class TestIntegration { @Rule public TemporaryFolder testFolder = new TemporaryFolder(); private BufferAllocator allocator; + private ObjectMapper om = new ObjectMapper(); + + { + DefaultPrettyPrinter prettyPrinter = new DefaultPrettyPrinter(); + prettyPrinter.indentArraysWith(NopIndenter.instance); + om.setDefaultPrettyPrinter(prettyPrinter); + om.enable(SerializationFeature.INDENT_OUTPUT); + om.enable(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS); + } + + static void writeInputFloat(File testInFile, BufferAllocator allocator, double... f) throws + FileNotFoundException, IOException { + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, + Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + Float8Writer floatWriter = rootWriter.float8("float"); + for (int i = 0; i < f.length; i++) { + floatWriter.setPosition(i); + floatWriter.writeFloat8(f[i]); + } + writer.setValueCount(f.length); + write(parent.getChild("root"), testInFile); + } + } + + static void writeInput2(File testInFile, BufferAllocator allocator) throws + FileNotFoundException, IOException { + int count = ArrowFileTestFixtures.COUNT; + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, + Integer.MAX_VALUE); + MapVector parent = new MapVector("parent", vectorAllocator, null)) { + writeData(count, parent); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("int"); + BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); + intWriter.setPosition(5); + intWriter.writeInt(999); + bigIntWriter.setPosition(4); + bigIntWriter.writeBigInt(777L); + writer.setValueCount(count); + write(parent.getChild("root"), testInFile); + } + } @Before public void init() { @@ -85,18 +134,21 @@ public void testValid() throws Exception { Integration integration = new Integration(); // convert it to json - String[] args1 = { "-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + String[] args1 = {"-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; integration.run(args1); // convert back to arrow - String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + String[] args2 = {"-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; integration.run(args2); // check it is the same validateOutput(testOutFile, allocator); // validate arrow against json - String[] args3 = { "-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + String[] args3 = {"-arrow", testInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.VALIDATE.name()}; integration.run(args3); } @@ -111,11 +163,13 @@ public void testJSONRoundTripWithVariableWidth() throws Exception { Integration integration = new Integration(); // convert to arrow - String[] args1 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + String[] args1 = {"-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; integration.run(args1); // convert back to json - String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + String[] args2 = {"-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile + .getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; integration.run(args2); BufferedReader orig = readNormalized(testJSONFile); @@ -139,11 +193,13 @@ public void testJSONRoundTripWithStruct() throws Exception { Integration integration = new Integration(); // convert to arrow - String[] args1 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; + String[] args1 = {"-arrow", testOutFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.JSON_TO_ARROW.name()}; integration.run(args1); // convert back to json - String[] args2 = { "-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + String[] args2 = {"-arrow", testOutFile.getAbsolutePath(), "-json", testRoundTripJSONFile + .getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; integration.run(args2); BufferedReader orig = readNormalized(testJSONFile); @@ -156,22 +212,12 @@ public void testJSONRoundTripWithStruct() throws Exception { } } - private ObjectMapper om = new ObjectMapper(); - { - DefaultPrettyPrinter prettyPrinter = new DefaultPrettyPrinter(); - prettyPrinter.indentArraysWith(NopIndenter.instance); - om.setDefaultPrettyPrinter(prettyPrinter); - om.enable(SerializationFeature.INDENT_OUTPUT); - om.enable(SerializationFeature.ORDER_MAP_ENTRIES_BY_KEYS); - } - private BufferedReader readNormalized(File f) throws IOException { - Map tree = om.readValue(f, Map.class); + Map tree = om.readValue(f, Map.class); String normalized = om.writeValueAsString(tree); return new BufferedReader(new StringReader(normalized)); } - /** * the test should not be sensitive to small variations in float representation */ @@ -190,11 +236,13 @@ public void testFloat() throws Exception { Integration integration = new Integration(); // convert the "valid" file to json - String[] args1 = { "-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + String[] args1 = {"-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; integration.run(args1); // compare the "invalid" file to the "valid" json - String[] args3 = { "-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + String[] args3 = {"-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.VALIDATE.name()}; // this should fail integration.run(args3); } @@ -214,11 +262,13 @@ public void testInvalid() throws Exception { Integration integration = new Integration(); // convert the "valid" file to json - String[] args1 = { "-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; + String[] args1 = {"-arrow", testValidInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.ARROW_TO_JSON.name()}; integration.run(args1); // compare the "invalid" file to the "valid" json - String[] args3 = { "-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile.getAbsolutePath(), "-command", Command.VALIDATE.name()}; + String[] args3 = {"-arrow", testInvalidInFile.getAbsolutePath(), "-json", testJSONFile + .getAbsolutePath(), "-command", Command.VALIDATE.name()}; // this should fail try { integration.run(args3); @@ -229,39 +279,4 @@ public void testInvalid() throws Exception { } } - - static void writeInputFloat(File testInFile, BufferAllocator allocator, double... f) throws FileNotFoundException, IOException { - try ( - BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null)) { - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - Float8Writer floatWriter = rootWriter.float8("float"); - for (int i = 0; i < f.length; i++) { - floatWriter.setPosition(i); - floatWriter.writeFloat8(f[i]); - } - writer.setValueCount(f.length); - write(parent.getChild("root"), testInFile); - } - } - - static void writeInput2(File testInFile, BufferAllocator allocator) throws FileNotFoundException, IOException { - int count = ArrowFileTestFixtures.COUNT; - try ( - BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - MapVector parent = new MapVector("parent", vectorAllocator, null)) { - writeData(count, parent); - ComplexWriter writer = new ComplexWriterImpl("root", parent); - MapWriter rootWriter = writer.rootAsMap(); - IntWriter intWriter = rootWriter.integer("int"); - BigIntWriter bigIntWriter = rootWriter.bigInt("bigInt"); - intWriter.setPosition(5); - intWriter.writeInt(999); - bigIntWriter.setPosition(4); - bigIntWriter.writeBigInt(777L); - writer.setValueCount(count); - write(parent.getChild("root"), testInFile); - } - } } From 55d8f99c351c22c2357924b4e70fcef7c8fd119a Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 21 Mar 2017 18:15:52 -0700 Subject: [PATCH 0392/1644] ARROW-677: [java] Fix checkstyle jcl-over-slf4j conflict issue Author: Julien Le Dem Closes #412 from julienledem/checkstyle_slf4j and squashes the following commits: 2fda6b8 [Julien Le Dem] ARROW-677: [java] Fix checkstyle jcl-over-slf4j conflict issue --- java/pom.xml | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/java/pom.xml b/java/pom.xml index 774761f0c1e66..5edd605e8eedb 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -286,6 +286,11 @@ guava ${dep.guava.version} + + org.slf4j + jcl-over-slf4j + 1.7.5 + From 82b15a4c38d5bc3bf0e2e1ff27a0dfc7c8929551 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 22 Mar 2017 00:15:19 -0400 Subject: [PATCH 0393/1644] ARROW-678: [GLib] Fix dependencies libarrow-glib-io.so should link to libarrow-glib.so. libarrow-glib-ipc.so should link to libarrow-glib.so and libarrow-io-glib.so. Author: Kouhei Sutou Closes #413 from kou/glib-fix-dependencies and squashes the following commits: f67c04e [Kouhei Sutou] [GLib] Fix dependencies --- c_glib/arrow-glib/Makefile.am | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 7699594b7ade7..a948007741cbd 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -232,7 +232,8 @@ libarrow_io_glib_la_CXXFLAGS = \ libarrow_io_glib_la_LIBADD = \ $(GLIB_LIBS) \ - $(ARROW_IO_LIBS) + $(ARROW_IO_LIBS) \ + libarrow-glib.la libarrow_io_glib_la_headers = \ arrow-io-glib.h \ @@ -329,7 +330,9 @@ libarrow_ipc_glib_la_CXXFLAGS = \ libarrow_ipc_glib_la_LIBADD = \ $(GLIB_LIBS) \ - $(ARROW_IPC_LIBS) + $(ARROW_IPC_LIBS) \ + libarrow-glib.la \ + libarrow-io-glib.la libarrow_ipc_glib_la_headers = \ arrow-ipc-glib.h \ @@ -448,9 +451,7 @@ ArrowIO_1_0_gir_INCLUDES = \ GObject-2.0 ArrowIO_1_0_gir_CFLAGS = \ $(AM_CPPFLAGS) -ArrowIO_1_0_gir_LIBS = \ - libarrow-io-glib.la \ - libarrow-glib.la +ArrowIO_1_0_gir_LIBS = libarrow-io-glib.la ArrowIO_1_0_gir_FILES = $(libarrow_io_glib_la_sources) ArrowIO_1_0_gir_SCANNERFLAGS = \ --include-uninstalled=$(builddir)/Arrow-1.0.gir \ @@ -469,10 +470,7 @@ ArrowIPC_1_0_gir_INCLUDES = \ GObject-2.0 ArrowIPC_1_0_gir_CFLAGS = \ $(AM_CPPFLAGS) -ArrowIPC_1_0_gir_LIBS = \ - libarrow-ipc-glib.la \ - libarrow-io-glib.la \ - libarrow-glib.la +ArrowIPC_1_0_gir_LIBS = libarrow-ipc-glib.la ArrowIPC_1_0_gir_FILES = $(libarrow_ipc_glib_la_sources) ArrowIPC_1_0_gir_SCANNERFLAGS = \ --include-uninstalled=$(builddir)/Arrow-1.0.gir \ From d25286718c283bec0b1fd4cbe47ddb3f159c29b5 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 22 Mar 2017 00:17:01 -0400 Subject: [PATCH 0394/1644] ARROW-675: [GLib] Update package metadata Author: Kouhei Sutou Closes #411 from kou/glib-update-metadata and squashes the following commits: da7bba6 [Kouhei Sutou] [GLib] Update package metadata --- c_glib/configure.ac | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/c_glib/configure.ac b/c_glib/configure.ac index 85f7eec3cb557..c6913437d93f8 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -18,7 +18,10 @@ AC_PREREQ(2.65) m4_define([arrow_glib_version], m4_include(version)) -AC_INIT([arrow-glib], arrow_glib_version, [kou@clear-code.com]) +AC_INIT([arrow-glib], + arrow_glib_version, + [https://issues.apache.org/jira/browse/ARROW], + [apache-arrow-glib]) AC_CONFIG_AUX_DIR([config]) AC_CONFIG_MACRO_DIR([m4]) From 96734efb73852f2d8372f72d7c56e8fb3ab4e516 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 22 Mar 2017 09:26:09 -0400 Subject: [PATCH 0395/1644] ARROW-654: [C++] Serialize timezone in IPC metadata Author: Wes McKinney Closes #416 from wesm/ARROW-654 and squashes the following commits: 001708e [Wes McKinney] Fix API change in Python bindings 3729cf9 [Wes McKinney] Serialize timezone in IPC metadata --- cpp/src/arrow/ipc/feather-test.cc | 2 +- cpp/src/arrow/ipc/feather.cc | 2 +- cpp/src/arrow/ipc/metadata.cc | 16 ++++++++++++++-- cpp/src/arrow/ipc/test-common.h | 2 +- cpp/src/arrow/memory_pool.cc | 2 +- cpp/src/arrow/type-test.cc | 2 +- cpp/src/arrow/type.cc | 4 ++-- cpp/src/arrow/type.h | 4 ++-- python/pyarrow/includes/libarrow.pxd | 4 ++-- python/pyarrow/schema.pyx | 2 +- 10 files changed, 26 insertions(+), 14 deletions(-) diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index 078c3e10aff29..2513887f75903 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -355,7 +355,7 @@ TEST_F(TestTableWriter, TimeTypes) { auto f0 = field("f0", date32()); auto f1 = field("f1", time(TimeUnit::MILLI)); auto f2 = field("f2", timestamp(TimeUnit::NANO)); - auto f3 = field("f3", timestamp("US/Los_Angeles", TimeUnit::SECOND)); + auto f3 = field("f3", timestamp(TimeUnit::SECOND, "US/Los_Angeles")); std::shared_ptr schema(new Schema({f0, f1, f2, f3})); std::vector values_vec = {0, 1, 2, 3, 4, 5, 6}; diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 72bbaa4da3571..0dd9a8183fdc2 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -287,7 +287,7 @@ class TableReader::TableReaderImpl { } else { tz = ""; } - *out = std::make_shared(tz, unit); + *out = timestamp(unit, tz); } break; case fbs::TypeMetadata_DateMetadata: *out = date32(); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index a418d4893dd40..4dfda543ebf6b 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -46,6 +46,7 @@ using LargeRecordBatchOffset = flatbuffers::Offset; using RecordBatchOffset = flatbuffers::Offset; using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; +using FBString = flatbuffers::Offset; static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; @@ -250,7 +251,12 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } case flatbuf::Type_Timestamp: { auto ts_type = static_cast(type_data); - *out = timestamp(FromFlatbufferUnit(ts_type->unit())); + TimeUnit unit = FromFlatbufferUnit(ts_type->unit()); + if (ts_type->timezone() != 0 && ts_type->timezone()->Length() > 0) { + *out = timestamp(unit, ts_type->timezone()->str()); + } else { + *out = timestamp(unit); + } return Status::OK(); } case flatbuf::Type_Interval: @@ -364,7 +370,13 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, case Type::TIMESTAMP: { const auto& ts_type = static_cast(*type); *out_type = flatbuf::Type_Timestamp; - *offset = flatbuf::CreateTimestamp(fbb, ToFlatbufferUnit(ts_type.unit)).Union(); + + flatbuf::TimeUnit fb_unit = ToFlatbufferUnit(ts_type.unit); + FBString fb_timezone = 0; + if (ts_type.timezone.size() > 0) { + fb_timezone = fbb.CreateString(ts_type.timezone); + } + *offset = flatbuf::CreateTimestamp(fbb, fb_unit, fb_timezone).Union(); } break; case Type::LIST: *out_type = flatbuf::Type_List; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index ba203b090b3b7..330af0c6ced20 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -497,7 +497,7 @@ Status MakeDate32(std::shared_ptr* out) { Status MakeTimestamps(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false, true, true, true}; auto f0 = field("f0", timestamp(TimeUnit::MILLI)); - auto f1 = field("f1", timestamp(TimeUnit::NANO)); + auto f1 = field("f1", timestamp(TimeUnit::NANO, "America/New_York")); auto f2 = field("f2", timestamp(TimeUnit::SECOND)); std::shared_ptr schema(new Schema({f0, f1, f2})); diff --git a/cpp/src/arrow/memory_pool.cc b/cpp/src/arrow/memory_pool.cc index cf01a02938385..7992f229862bf 100644 --- a/cpp/src/arrow/memory_pool.cc +++ b/cpp/src/arrow/memory_pool.cc @@ -19,10 +19,10 @@ #include #include +#include #include #include #include -#include #include "arrow/status.h" #include "arrow/util/logging.h" diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index ddfff8745b97e..22aa7eba8a3e8 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -209,7 +209,7 @@ TEST(TestTimestampType, Equals) { TEST(TestTimestampType, ToString) { auto t1 = timestamp(TimeUnit::MILLI); - auto t2 = timestamp("US/Eastern", TimeUnit::NANO); + auto t2 = timestamp(TimeUnit::NANO, "US/Eastern"); auto t3 = timestamp(TimeUnit::SECOND); auto t4 = timestamp(TimeUnit::MICRO); diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index ee0a89ab8abea..64070cb13abd0 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -244,8 +244,8 @@ std::shared_ptr timestamp(TimeUnit unit) { return std::make_shared(unit); } -std::shared_ptr timestamp(const std::string& timezone, TimeUnit unit) { - return std::make_shared(timezone, unit); +std::shared_ptr timestamp(TimeUnit unit, const std::string& timezone) { + return std::make_shared(unit, timezone); } std::shared_ptr time(TimeUnit unit) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index adc3161e9551a..27b28d2f42bc0 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -532,7 +532,7 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) : FixedWidthType(Type::TIMESTAMP), unit(unit) {} - explicit TimestampType(const std::string& timezone, TimeUnit unit = TimeUnit::MILLI) + explicit TimestampType(TimeUnit unit, const std::string& timezone) : FixedWidthType(Type::TIMESTAMP), unit(unit), timezone(timezone) {} TimestampType(const TimestampType& other) : TimestampType(other.unit) {} @@ -603,7 +603,7 @@ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& val std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); std::shared_ptr ARROW_EXPORT timestamp( - const std::string& timezone, TimeUnit unit); + TimeUnit unit, const std::string& timezone); std::shared_ptr ARROW_EXPORT time(TimeUnit unit); std::shared_ptr ARROW_EXPORT struct_( diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 705fe6b4a55ca..2d698d35b1b84 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -99,14 +99,14 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CArray] dictionary() shared_ptr[CDataType] timestamp(TimeUnit unit) - shared_ptr[CDataType] timestamp(const c_string& timezone, TimeUnit unit) + shared_ptr[CDataType] timestamp(TimeUnit unit, const c_string& timezone) cdef cppclass CMemoryPool" arrow::MemoryPool": int64_t bytes_allocated() cdef cppclass CLoggingMemoryPool" arrow::LoggingMemoryPool"(CMemoryPool): CLoggingMemoryPool(CMemoryPool*) - + cdef cppclass CBuffer" arrow::Buffer": uint8_t* data() int64_t size() diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 4bc938df668f8..ee38144e6e3db 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -314,7 +314,7 @@ def timestamp(unit_str, tz=None): tz = tz.zone c_timezone = tobytes(tz) - out.init(la.timestamp(c_timezone, unit)) + out.init(la.timestamp(unit, c_timezone)) return out From b179ad2d80c3f3c1ab81bfa9ff0c343fb47b148a Mon Sep 17 00:00:00 2001 From: Max Risuhin Date: Wed, 22 Mar 2017 10:05:19 -0400 Subject: [PATCH 0396/1644] =?UTF-8?q?ARROW-681:=20[C++]=20Disable=20boost'?= =?UTF-8?q?s=20autolinking=20if=20shared=20boost=20is=20used=20=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …on Windows; Correct linking with IMPORTED_IMPLIB of 3rd party shared libs on WIndows. Author: Max Risuhin Closes #415 from MaxRis/master and squashes the following commits: 9fd851d [Max Risuhin] ARROW-681: [C++] Disable boost's autolinking if shared boost is used on Windows; Correct linking with IMPORTED_IMPLIB of 3rd party shared libs on WIndows. --- cpp/CMakeLists.txt | 31 +++++++++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 956658a82524c..84158cc008132 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -379,8 +379,15 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) SET(AUG_LIB_NAME "${LIB_NAME}_shared") add_library(${AUG_LIB_NAME} SHARED IMPORTED) - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + + if(MSVC) + # Mark the ”.lib” location as part of a Windows DLL + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") + else() + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + endif() message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") elseif(ARG_STATIC_LIB) add_library(${LIB_NAME} STATIC IMPORTED) @@ -397,8 +404,15 @@ function(ADD_THIRDPARTY_LIB LIB_NAME) PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") SET(AUG_LIB_NAME "${LIB_NAME}_shared") add_library(${AUG_LIB_NAME} SHARED IMPORTED) - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + + if(MSVC) + # Mark the ”.lib” location as part of a Windows DLL + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") + else() + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + endif() message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") else() message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") @@ -418,6 +432,15 @@ set(Boost_USE_MULTITHREADED ON) if (ARROW_BOOST_USE_SHARED) # Find shared Boost libraries. set(Boost_USE_STATIC_LIBS OFF) + + if(MSVC) + # disable autolinking in boost + add_definitions(-DBOOST_ALL_NO_LIB) + + # force all boost libraries to dynamic link + add_definitions(-DBOOST_ALL_DYN_LINK) + endif() + find_package(Boost COMPONENTS system filesystem REQUIRED) if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) From 5fda24776d82cf120525d298ba261ddd02e5fcc8 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 22 Mar 2017 11:04:42 -0400 Subject: [PATCH 0397/1644] ARROW-680: [C++] Support CMake 2 or older again GNUInstallDirs in CMake 2 always uses multiarch cared library directory. See also: https://github.com/Kitware/CMake/commit/620939e4e6f5a61cd5c0fac2704de4bfda0eb7ef Author: Kouhei Sutou Closes #419 from kou/cpp-support-cmake-2 and squashes the following commits: 684cb2b [Kouhei Sutou] [C++] Support CMake 2 or older again --- cpp/CMakeLists.txt | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 84158cc008132..61e645da20e75 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -28,7 +28,13 @@ set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/cmake_modules") include(CMakeParseArguments) include(ExternalProject) -include(GNUInstallDirs) + +if(CMAKE_MAJOR_VERSION LESS 3) + set(CMAKE_INSTALL_INCLUDEDIR "include") + set(CMAKE_INSTALL_LIBDIR "lib") +else() + include(GNUInstallDirs) +endif() set(ARROW_SO_VERSION "0") set(ARROW_ABI_VERSION "${ARROW_SO_VERSION}.0.0") From 1b957dcf1a025a45a858091e51138b4a75c3826a Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 22 Mar 2017 11:48:41 -0400 Subject: [PATCH 0398/1644] ARROW-688: [C++] Use CMAKE_INSTALL_INCLUDEDIR for consistency Using CMAKE_INSTALL_INCLUDEDIR isn't required. It's just for consistency. We already used CMAKE_INSTALL_INCLUDEDIR at cpp/src/arrow/CMakeLists.txt. Or we can revert CMAKE_INSTALL_INCLUDEDIR change at cpp/src/arrow/CMakeLists.txt. Author: Kouhei Sutou Closes #420 from kou/cpp-use-cmake-install-includedir-for-consistency and squashes the following commits: ba7da0d [Kouhei Sutou] [C++] Use CMAKE_INSTALL_INCLUDEDIR for consistency --- cpp/src/arrow/io/CMakeLists.txt | 2 +- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- cpp/src/arrow/jemalloc/CMakeLists.txt | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index af3acbf06d1ef..8aabf6496f8f7 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -109,7 +109,7 @@ install(FILES hdfs.h interfaces.h memory.h - DESTINATION include/arrow/io) + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/io") # pkg-config support configure_file(arrow-io.pc.in diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 5d470df0309b3..3a98a380e7019 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -144,7 +144,7 @@ install(FILES metadata.h reader.h writer.h - DESTINATION include/arrow/ipc) + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/ipc") # pkg-config support configure_file(arrow-ipc.pc.in diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index c7e6c6af97cff..b8e6e231a3dca 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -104,7 +104,7 @@ ARROW_BENCHMARK_LINK_LIBRARIES(jemalloc-builder-benchmark # Headers: top level install(FILES memory_pool.h - DESTINATION include/arrow/jemalloc) + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/jemalloc") # pkg-config support configure_file(arrow-jemalloc.pc.in From 36103143b5975138522f4e54f8b21565a34f6504 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 22 Mar 2017 18:34:52 +0100 Subject: [PATCH 0399/1644] ARROW-689: [GLib] Fix install directories Header files should be installed into `${PREFIX}/include/arrow-glib/` instead of `${PREFIX}/include/apache-arrow-glib/`. Documents should be installed into `${PREFIX}/share/doc/arrow-glib/` instead of `${PREFIX}/share/doc/apache-arrow-glib/`. We needed to change install directories when we changed `AC_INIT()`'s 3rd argument to apache-arrow-glib... Author: Kouhei Sutou Closes #421 from kou/glib-fix-install-directory and squashes the following commits: 65e5cee [Kouhei Sutou] [GLib] Fix install directories --- c_glib/Makefile.am | 3 ++- c_glib/arrow-glib/Makefile.am | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/c_glib/Makefile.am b/c_glib/Makefile.am index c078b0889d4ff..40e8395a56824 100644 --- a/c_glib/Makefile.am +++ b/c_glib/Makefile.am @@ -27,6 +27,7 @@ EXTRA_DIST = \ LICENSE.txt \ version -doc_DATA = \ +arrow_glib_docdir = ${datarootdir}/doc/arrow-glib +arrow_glib_doc_DATA = \ README.md \ LICENSE.txt diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index a948007741cbd..a72d1e874402a 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -403,7 +403,8 @@ stamp-ipc-enums.c: $(libarrow_ipc_glib_la_headers) ipc-enums.c.template $(libarrow_ipc_glib_la_headers)) > ipc-enums.c touch $@ -pkginclude_HEADERS = \ +arrow_glib_includedir = $(includedir)/arrow-glib +arrow_glib_include_HEADERS = \ $(libarrow_glib_la_headers) \ $(libarrow_glib_la_cpp_headers) \ $(libarrow_glib_la_generated_headers) \ From 71424c20d31addb37cf7db56561790ca69db0430 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 22 Mar 2017 14:03:40 -0400 Subject: [PATCH 0400/1644] ARROW-683: [C++/Python] Refactor to make Date32 and Date64 types for new metadata. Test IPC roundtrip Maintains existing Python behavior (datetime.date getting converted to milliseconds) Author: Wes McKinney Closes #418 from wesm/ARROW-683 and squashes the following commits: 69f156b [Wes McKinney] Add autoconf-archive to README for system requirements 10988ad [Wes McKinney] Remove hacks for ax_cxx_compile_stdcxx_11 334558d [Wes McKinney] Fix glib for date32/date64. Add ax_cxx_compile_stdcxx_11.m4 macro for older autoconf 93cf8d6 [Wes McKinney] Refactor to make Date32 and Date64 types for new metadata. Test IPC roundtrips, maintain existing Python behavior --- c_glib/README.md | 5 ++- c_glib/arrow-glib/type.cpp | 6 ++- c_glib/arrow-glib/type.h | 6 ++- cpp/src/arrow/array.cc | 4 +- cpp/src/arrow/array.h | 4 +- cpp/src/arrow/builder.cc | 5 ++- cpp/src/arrow/builder.h | 2 +- cpp/src/arrow/compare.cc | 10 +++-- cpp/src/arrow/ipc/ipc-json-test.cc | 8 ++-- cpp/src/arrow/ipc/ipc-read-write-test.cc | 8 ++-- cpp/src/arrow/ipc/json-internal.cc | 16 ++++---- cpp/src/arrow/ipc/metadata.cc | 18 +++++++-- cpp/src/arrow/ipc/test-common.h | 29 +++++--------- cpp/src/arrow/ipc/writer.cc | 2 +- cpp/src/arrow/loader.cc | 2 +- cpp/src/arrow/pretty_print.cc | 4 +- cpp/src/arrow/type-test.cc | 8 ++++ cpp/src/arrow/type.cc | 12 +++--- cpp/src/arrow/type.h | 40 ++++++++++---------- cpp/src/arrow/type_fwd.h | 8 ++-- cpp/src/arrow/type_traits.h | 8 ++-- python/pyarrow/__init__.py | 2 +- python/pyarrow/array.pyx | 9 ++++- python/pyarrow/includes/libarrow.pxd | 8 +++- python/pyarrow/scalar.pyx | 19 +++++++--- python/pyarrow/schema.pyx | 13 +++++-- python/pyarrow/tests/test_convert_builtin.py | 2 +- python/pyarrow/tests/test_convert_pandas.py | 2 +- python/src/pyarrow/adapters/builtin.cc | 6 +-- python/src/pyarrow/adapters/pandas.cc | 18 ++++----- python/src/pyarrow/helpers.cc | 3 +- python/src/pyarrow/type_traits.h | 2 +- 32 files changed, 166 insertions(+), 123 deletions(-) diff --git a/c_glib/README.md b/c_glib/README.md index 4008015a56438..84027bf2cb3db 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -58,7 +58,7 @@ to build Arrow GLib. You can install them by the followings: On Debian GNU/Linux or Ubuntu: ```text -% sudo apt install -y -V gtk-doc-tools libgirepository1.0-dev +% sudo apt install -y -V gtk-doc-tools autoconf-archive libgirepository1.0-dev ``` On CentOS 7 or later: @@ -76,7 +76,8 @@ On macOS with [Homebrew](https://brew.sh/): Now, you can build Arrow GLib: ```text -% cd glib +% cd c_glib +% ./autogen.sh % ./configure --enable-gtk-doc % make % sudo make install diff --git a/c_glib/arrow-glib/type.cpp b/c_glib/arrow-glib/type.cpp index 56cbc212211eb..2e59647884551 100644 --- a/c_glib/arrow-glib/type.cpp +++ b/c_glib/arrow-glib/type.cpp @@ -66,8 +66,10 @@ garrow_type_from_raw(arrow::Type::type type) return GARROW_TYPE_STRING; case arrow::Type::type::BINARY: return GARROW_TYPE_BINARY; - case arrow::Type::type::DATE: - return GARROW_TYPE_DATE; + case arrow::Type::type::DATE32: + return GARROW_TYPE_DATE32; + case arrow::Type::type::DATE64: + return GARROW_TYPE_DATE64; case arrow::Type::type::TIMESTAMP: return GARROW_TYPE_TIMESTAMP; case arrow::Type::type::TIME: diff --git a/c_glib/arrow-glib/type.h b/c_glib/arrow-glib/type.h index 48d2801dad42c..cd6137cb5ba5f 100644 --- a/c_glib/arrow-glib/type.h +++ b/c_glib/arrow-glib/type.h @@ -40,7 +40,8 @@ G_BEGIN_DECLS * @GARROW_TYPE_DOUBLE: 8-byte floating point value. * @GARROW_TYPE_STRING: UTF-8 variable-length string. * @GARROW_TYPE_BINARY: Variable-length bytes (no guarantee of UTF-8-ness). - * @GARROW_TYPE_DATE: By default, int32 days since the UNIX epoch. + * @GARROW_TYPE_DATE32: int32 days since the UNIX epoch. + * @GARROW_TYPE_DATE64: int64 milliseconds since the UNIX epoch. * @GARROW_TYPE_TIMESTAMP: Exact timestamp encoded with int64 since UNIX epoch. * Default unit millisecond. * @GARROW_TYPE_TIME: Exact time encoded with int64, default unit millisecond. @@ -70,7 +71,8 @@ typedef enum { GARROW_TYPE_DOUBLE, GARROW_TYPE_STRING, GARROW_TYPE_BINARY, - GARROW_TYPE_DATE, + GARROW_TYPE_DATE32, + GARROW_TYPE_DATE64, GARROW_TYPE_TIMESTAMP, GARROW_TYPE_TIME, GARROW_TYPE_INTERVAL, diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 36b3fccf79ed0..4fa2b2b521f59 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -489,8 +489,8 @@ ARRAY_VISITOR_DEFAULT(DoubleArray); ARRAY_VISITOR_DEFAULT(BinaryArray); ARRAY_VISITOR_DEFAULT(StringArray); ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); -ARRAY_VISITOR_DEFAULT(DateArray); ARRAY_VISITOR_DEFAULT(Date32Array); +ARRAY_VISITOR_DEFAULT(Date64Array); ARRAY_VISITOR_DEFAULT(TimeArray); ARRAY_VISITOR_DEFAULT(TimestampArray); ARRAY_VISITOR_DEFAULT(IntervalArray); @@ -515,8 +515,8 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; -template class NumericArray; template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 50faf0892e8c0..e66ac505d5dbf 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -58,8 +58,8 @@ class ARROW_EXPORT ArrayVisitor { virtual Status Visit(const StringArray& array); virtual Status Visit(const BinaryArray& array); virtual Status Visit(const FixedWidthBinaryArray& array); - virtual Status Visit(const DateArray& array); virtual Status Visit(const Date32Array& array); + virtual Status Visit(const Date64Array& array); virtual Status Visit(const TimeArray& array); virtual Status Visit(const TimestampArray& array); virtual Status Visit(const IntervalArray& array); @@ -559,8 +559,8 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; #if defined(__GNUC__) && !defined(__clang__) diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index b65a4928ec999..483d6f0a425ea 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -238,8 +238,8 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; -template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; @@ -531,7 +531,8 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(INT32, Int32Builder); BUILDER_CASE(UINT64, UInt64Builder); BUILDER_CASE(INT64, Int64Builder); - BUILDER_CASE(DATE, DateBuilder); + BUILDER_CASE(DATE32, Date32Builder); + BUILDER_CASE(DATE64, Date64Builder); case Type::TIMESTAMP: out->reset(new TimestampBuilder(pool, type)); return Status::OK(); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 07b7cfcb3a964..7cefa649cbf71 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -232,8 +232,8 @@ using Int32Builder = NumericBuilder; using Int64Builder = NumericBuilder; using TimestampBuilder = NumericBuilder; using TimeBuilder = NumericBuilder; -using DateBuilder = NumericBuilder; using Date32Builder = NumericBuilder; +using Date64Builder = NumericBuilder; using HalfFloatBuilder = NumericBuilder; using FloatBuilder = NumericBuilder; diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 86ed8ccecd1ea..3e6ecefc5ca5b 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -169,12 +169,14 @@ class RangeEqualsVisitor : public ArrayVisitor { return Status::OK(); } - Status Visit(const DateArray& left) override { return CompareValues(left); } - Status Visit(const Date32Array& left) override { return CompareValues(left); } + Status Visit(const Date64Array& left) override { + return CompareValues(left); + } + Status Visit(const TimeArray& left) override { return CompareValues(left); } Status Visit(const TimestampArray& left) override { @@ -409,10 +411,10 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { Status Visit(const DoubleArray& left) override { return ComparePrimitive(left); } - Status Visit(const DateArray& left) override { return ComparePrimitive(left); } - Status Visit(const Date32Array& left) override { return ComparePrimitive(left); } + Status Visit(const Date64Array& left) override { return ComparePrimitive(left); } + Status Visit(const TimeArray& left) override { return ComparePrimitive(left); } Status Visit(const TimestampArray& left) override { return ComparePrimitive(left); } diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 4c18a496f4c80..fd35182751948 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -96,15 +96,17 @@ void CheckPrimitive(const std::shared_ptr& type, } TEST(TestJsonSchemaWriter, FlatTypes) { + // TODO + // field("f14", date32()) std::vector> fields = {field("f0", int8()), field("f1", int16(), false), field("f2", int32()), field("f3", int64(), false), field("f4", uint8()), field("f5", uint16()), field("f6", uint32()), field("f7", uint64()), field("f8", float32()), field("f9", float64()), field("f10", utf8()), field("f11", binary()), field("f12", list(int32())), field("f13", struct_({field("s1", int32()), field("s2", utf8())})), - field("f14", date()), field("f15", timestamp(TimeUnit::NANO)), - field("f16", time(TimeUnit::MICRO)), - field("f17", union_({field("u1", int8()), field("u2", time(TimeUnit::MILLI))}, + field("f15", date64()), field("f16", timestamp(TimeUnit::NANO)), + field("f17", time(TimeUnit::MICRO)), + field("f18", union_({field("u1", int8()), field("u2", time(TimeUnit::MILLI))}, {0, 1}, UnionMode::DENSE))}; Schema schema(fields); diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 261ca1d0e52d8..00118448ff044 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -117,10 +117,10 @@ TEST_F(TestSchemaMetadata, NestedFields) { CheckRoundtrip(schema, &memo); } -#define BATCH_CASES() \ - ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ - &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ - &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDate, &MakeTimestamps, &MakeTimes, \ +#define BATCH_CASES() \ + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ + &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, &MakeTimestamps, &MakeTimes, \ &MakeFWBinary); class IpcTestFixture : public io::MemoryMapFixture { diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 549b26bfe8201..08f0bdc3a023e 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -133,10 +133,7 @@ class JsonSchemaWriter : public TypeVisitor { } template - typename std::enable_if< - std::is_base_of::value || std::is_base_of::value || - std::is_base_of::value || std::is_base_of::value, - void>::type + typename std::enable_if::value, void>::type WriteTypeMetadata(const T& type) {} template @@ -303,7 +300,10 @@ class JsonSchemaWriter : public TypeVisitor { Status Visit(const BinaryType& type) override { return WriteVarBytes("binary", type); } - Status Visit(const DateType& type) override { return WritePrimitive("date", type); } + // TODO + Status Visit(const Date32Type& type) override { return WritePrimitive("date", type); } + + Status Visit(const Date64Type& type) override { return WritePrimitive("date", type); } Status Visit(const TimeType& type) override { return WritePrimitive("time", type); } @@ -733,7 +733,8 @@ class JsonSchemaReader { } else if (type_name == "null") { *type = null(); } else if (type_name == "date") { - *type = date(); + // TODO + *type = date64(); } else if (type_name == "time") { return GetTimeLike(json_type, type); } else if (type_name == "timestamp") { @@ -1059,7 +1060,8 @@ class JsonArrayReader { TYPE_CASE(DoubleType); TYPE_CASE(StringType); TYPE_CASE(BinaryType); - NOT_IMPLEMENTED_CASE(DATE); + NOT_IMPLEMENTED_CASE(DATE32); + NOT_IMPLEMENTED_CASE(DATE64); NOT_IMPLEMENTED_CASE(TIMESTAMP); NOT_IMPLEMENTED_CASE(TIME); NOT_IMPLEMENTED_CASE(INTERVAL); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 4dfda543ebf6b..c091bac5a09e2 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -241,9 +241,15 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, return Status::OK(); case flatbuf::Type_Decimal: return Status::NotImplemented("Decimal"); - case flatbuf::Type_Date: - *out = date(); + case flatbuf::Type_Date: { + auto date_type = static_cast(type_data); + if (date_type->unit() == flatbuf::DateUnit_DAY) { + *out = date32(); + } else { + *out = date64(); + } return Status::OK(); + } case flatbuf::Type_Time: { auto time_type = static_cast(type_data); *out = time(FromFlatbufferUnit(time_type->unit())); @@ -358,9 +364,13 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_Utf8; *offset = flatbuf::CreateUtf8(fbb).Union(); break; - case Type::DATE: + case Type::DATE32: + *out_type = flatbuf::Type_Date; + *offset = flatbuf::CreateDate(fbb, flatbuf::DateUnit_DAY).Union(); + break; + case Type::DATE64: *out_type = flatbuf::Type_Date; - *offset = flatbuf::CreateDate(fbb).Union(); + *offset = flatbuf::CreateDate(fbb, flatbuf::DateUnit_MILLISECOND).Union(); break; case Type::TIME: { const auto& time_type = static_cast(*type); diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 330af0c6ced20..4085ecf9e3da9 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -463,33 +463,22 @@ Status MakeDictionaryFlat(std::shared_ptr* out) { return Status::OK(); } -Status MakeDate(std::shared_ptr* out) { - std::vector is_valid = {true, true, true, false, true, true, true}; - auto f1 = field("f1", date()); - std::shared_ptr schema(new Schema({f1})); - - std::vector date_values = {1489269000000, 1489270000000, 1489271000000, - 1489272000000, 1489272000000, 1489273000000}; - - std::shared_ptr date_array; - ArrayFromVector(is_valid, date_values, &date_array); - - std::vector> arrays = {date_array}; - *out = std::make_shared(schema, date_array->length(), arrays); - return Status::OK(); -} - -Status MakeDate32(std::shared_ptr* out) { +Status MakeDates(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false, true, true, true}; auto f0 = field("f0", date32()); - std::shared_ptr schema(new Schema({f0})); + auto f1 = field("f1", date64()); + std::shared_ptr schema(new Schema({f0, f1})); std::vector date32_values = {0, 1, 2, 3, 4, 5, 6}; - std::shared_ptr date32_array; ArrayFromVector(is_valid, date32_values, &date32_array); - std::vector> arrays = {date32_array}; + std::vector date64_values = {1489269000000, 1489270000000, 1489271000000, + 1489272000000, 1489272000000, 1489273000000}; + std::shared_ptr date64_array; + ArrayFromVector(is_valid, date64_values, &date64_array); + + std::vector> arrays = {date32_array, date64_array}; *out = std::make_shared(schema, date32_array->length(), arrays); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 82c119ef53e9a..ef59471e3c7c9 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -336,8 +336,8 @@ class RecordBatchWriter : public ArrayVisitor { VISIT_FIXED_WIDTH(HalfFloatArray); VISIT_FIXED_WIDTH(FloatArray); VISIT_FIXED_WIDTH(DoubleArray); - VISIT_FIXED_WIDTH(DateArray); VISIT_FIXED_WIDTH(Date32Array); + VISIT_FIXED_WIDTH(Date64Array); VISIT_FIXED_WIDTH(TimeArray); VISIT_FIXED_WIDTH(TimestampArray); diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index fc373715105e1..bc506be572625 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -146,8 +146,8 @@ class ArrayLoader : public TypeVisitor { VISIT_PRIMITIVE(HalfFloatType); VISIT_PRIMITIVE(FloatType); VISIT_PRIMITIVE(DoubleType); - VISIT_PRIMITIVE(DateType); VISIT_PRIMITIVE(Date32Type); + VISIT_PRIMITIVE(Date64Type); VISIT_PRIMITIVE(TimeType); VISIT_PRIMITIVE(TimestampType); diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 87c1a1cf9d9c5..fc5eed18d8776 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -171,10 +171,10 @@ class ArrayPrinter : public ArrayVisitor { Status Visit(const FixedWidthBinaryArray& array) override { return WriteArray(array); } - Status Visit(const DateArray& array) override { return WriteArray(array); } - Status Visit(const Date32Array& array) override { return WriteArray(array); } + Status Visit(const Date64Array& array) override { return WriteArray(array); } + Status Visit(const TimeArray& array) override { return WriteArray(array); } Status Visit(const TimestampArray& array) override { diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 22aa7eba8a3e8..c2d115ccbfe6f 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -173,6 +173,14 @@ TEST(TestListType, Basics) { ASSERT_EQ("list>", lt2.ToString()); } +TEST(TestDateTypes, ToString) { + auto t1 = date32(); + auto t2 = date64(); + + ASSERT_EQ("date32[day]", t1->ToString()); + ASSERT_EQ("date64[ms]", t2->ToString()); +} + TEST(TestTimeType, Equals) { TimeType t1; TimeType t2; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 64070cb13abd0..937cbc5a7669d 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -110,12 +110,12 @@ std::string StructType::ToString() const { return s.str(); } -std::string DateType::ToString() const { - return std::string("date"); +std::string Date64Type::ToString() const { + return std::string("date64[ms]"); } std::string Date32Type::ToString() const { - return std::string("date32"); + return std::string("date32[day]"); } std::string TimeType::ToString() const { @@ -205,7 +205,7 @@ ACCEPT_VISITOR(ListType); ACCEPT_VISITOR(StructType); ACCEPT_VISITOR(DecimalType); ACCEPT_VISITOR(UnionType); -ACCEPT_VISITOR(DateType); +ACCEPT_VISITOR(Date64Type); ACCEPT_VISITOR(Date32Type); ACCEPT_VISITOR(TimeType); ACCEPT_VISITOR(TimestampType); @@ -233,7 +233,7 @@ TYPE_FACTORY(float32, FloatType); TYPE_FACTORY(float64, DoubleType); TYPE_FACTORY(utf8, StringType); TYPE_FACTORY(binary, BinaryType); -TYPE_FACTORY(date, DateType); +TYPE_FACTORY(date64, Date64Type); TYPE_FACTORY(date32, Date32Type); std::shared_ptr fixed_width_binary(int32_t byte_width) { @@ -355,7 +355,7 @@ TYPE_VISITOR_DEFAULT(DoubleType); TYPE_VISITOR_DEFAULT(StringType); TYPE_VISITOR_DEFAULT(BinaryType); TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); -TYPE_VISITOR_DEFAULT(DateType); +TYPE_VISITOR_DEFAULT(Date64Type); TYPE_VISITOR_DEFAULT(Date32Type); TYPE_VISITOR_DEFAULT(TimeType); TYPE_VISITOR_DEFAULT(TimestampType); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 27b28d2f42bc0..c179bf336987b 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -71,12 +71,12 @@ struct Type { // Fixed-width binary. Each value occupies the same number of bytes FIXED_WIDTH_BINARY, - // int64_t milliseconds since the UNIX epoch - DATE, - // int32_t days since the UNIX epoch DATE32, + // int64_t milliseconds since the UNIX epoch + DATE64, + // Exact timestamp encoded with int64 since UNIX epoch // Default unit millisecond TIMESTAMP, @@ -139,7 +139,7 @@ class ARROW_EXPORT TypeVisitor { virtual Status Visit(const StringType& type); virtual Status Visit(const BinaryType& type); virtual Status Visit(const FixedWidthBinaryType& type); - virtual Status Visit(const DateType& type); + virtual Status Visit(const Date64Type& type); virtual Status Visit(const Date32Type& type); virtual Status Visit(const TimeType& type); virtual Status Visit(const TimestampType& type); @@ -245,7 +245,7 @@ struct ARROW_EXPORT CTypeImpl : public PrimitiveCType { std::string ToString() const override { return std::string(DERIVED::name()); } }; -struct ARROW_EXPORT NullType : public DataType { +struct ARROW_EXPORT NullType : public DataType, public NoExtraMeta { static constexpr Type::type type_id = Type::NA; NullType() : DataType(Type::NA) {} @@ -263,7 +263,7 @@ struct IntegerTypeImpl : public CTypeImpl, public Inte bool is_signed() const override { return std::is_signed::value; } }; -struct ARROW_EXPORT BooleanType : public FixedWidthType { +struct ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { static constexpr Type::type type_id = Type::BOOL; BooleanType() : FixedWidthType(Type::BOOL) {} @@ -455,33 +455,33 @@ struct ARROW_EXPORT UnionType : public DataType { // ---------------------------------------------------------------------- // Date and time types -/// Date as int64_t milliseconds since UNIX epoch -struct ARROW_EXPORT DateType : public FixedWidthType { - static constexpr Type::type type_id = Type::DATE; +/// Date as int32_t days since UNIX epoch +struct ARROW_EXPORT Date32Type : public FixedWidthType, public NoExtraMeta { + static constexpr Type::type type_id = Type::DATE32; - using c_type = int64_t; + using c_type = int32_t; - DateType() : FixedWidthType(Type::DATE) {} + Date32Type() : FixedWidthType(Type::DATE32) {} - int bit_width() const override { return static_cast(sizeof(c_type) * 8); } + int bit_width() const override { return static_cast(sizeof(c_type) * 4); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; - static std::string name() { return "date"; } }; -/// Date as int32_t days since UNIX epoch -struct ARROW_EXPORT Date32Type : public FixedWidthType { - static constexpr Type::type type_id = Type::DATE32; +/// Date as int64_t milliseconds since UNIX epoch +struct ARROW_EXPORT Date64Type : public FixedWidthType, public NoExtraMeta { + static constexpr Type::type type_id = Type::DATE64; - using c_type = int32_t; + using c_type = int64_t; - Date32Type() : FixedWidthType(Type::DATE32) {} + Date64Type() : FixedWidthType(Type::DATE64) {} - int bit_width() const override { return static_cast(sizeof(c_type) * 4); } + int bit_width() const override { return static_cast(sizeof(c_type) * 8); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; + static std::string name() { return "date"; } }; enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; @@ -666,8 +666,8 @@ static inline bool is_primitive(Type::type type_id) { case Type::HALF_FLOAT: case Type::FLOAT: case Type::DOUBLE: - case Type::DATE: case Type::DATE32: + case Type::DATE64: case Type::TIMESTAMP: case Type::TIME: case Type::INTERVAL: diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 7fc36c4bde06b..ae85593cf4546 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -95,9 +95,9 @@ _NUMERIC_TYPE_DECL(Double); #undef _NUMERIC_TYPE_DECL -struct DateType; -using DateArray = NumericArray; -using DateBuilder = NumericBuilder; +struct Date64Type; +using Date64Array = NumericArray; +using Date64Builder = NumericBuilder; struct Date32Type; using Date32Array = NumericArray; @@ -132,8 +132,8 @@ std::shared_ptr ARROW_EXPORT float32(); std::shared_ptr ARROW_EXPORT float64(); std::shared_ptr ARROW_EXPORT utf8(); std::shared_ptr ARROW_EXPORT binary(); -std::shared_ptr ARROW_EXPORT date(); std::shared_ptr ARROW_EXPORT date32(); +std::shared_ptr ARROW_EXPORT date64(); } // namespace arrow diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 242e59d10fce4..e731913bbd226 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -119,15 +119,15 @@ struct TypeTraits { }; template <> -struct TypeTraits { - using ArrayType = DateArray; - using BuilderType = DateBuilder; +struct TypeTraits { + using ArrayType = Date64Array; + using BuilderType = Date64Builder; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); } constexpr static bool is_parameter_free = true; - static inline std::shared_ptr type_singleton() { return date(); } + static inline std::shared_ptr type_singleton() { return date64(); } }; template <> diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index a4aac443fae82..c6f0be04e8d0d 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -61,7 +61,7 @@ from pyarrow.schema import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, - timestamp, date, + timestamp, date32, date64, float_, double, binary, string, list_, struct, dictionary, field, DataType, Field, Schema, schema) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 11244e7836058..6afeaa0a7332b 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -372,7 +372,11 @@ cdef class UInt64Array(IntegerArray): pass -cdef class DateArray(NumericArray): +cdef class Date32Array(NumericArray): + pass + + +cdef class Date64Array(NumericArray): pass @@ -459,7 +463,8 @@ cdef dict _array_classes = { Type_INT16: Int16Array, Type_INT32: Int32Array, Type_INT64: Int64Array, - Type_DATE: DateArray, + Type_DATE32: Date32Array, + Type_DATE64: Date64Array, Type_FLOAT: FloatArray, Type_DOUBLE: DoubleArray, Type_LIST: ListArray, diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 2d698d35b1b84..1d9c38e48cfe9 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -39,7 +39,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_DOUBLE" arrow::Type::DOUBLE" Type_TIMESTAMP" arrow::Type::TIMESTAMP" - Type_DATE" arrow::Type::DATE" + Type_DATE32" arrow::Type::DATE32" + Type_DATE64" arrow::Type::DATE64" Type_BINARY" arrow::Type::BINARY" Type_STRING" arrow::Type::STRING" @@ -177,7 +178,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CInt64Array" arrow::Int64Array"(CArray): int64_t Value(int i) - cdef cppclass CDateArray" arrow::DateArray"(CArray): + cdef cppclass CDate32Array" arrow::Date32Array"(CArray): + int32_t Value(int i) + + cdef cppclass CDate64Array" arrow::Date64Array"(CArray): int64_t Value(int i) cdef cppclass CTimestampArray" arrow::TimestampArray"(CArray): diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 1337b2b2cb198..8c88f90422fac 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -124,11 +124,18 @@ cdef class UInt64Value(ArrayValue): return ap.Value(self.index) -cdef class DateValue(ArrayValue): +cdef class Date32Value(ArrayValue): def as_py(self): - cdef CDateArray* ap = self.sp_array.get() - return datetime.datetime.utcfromtimestamp(ap.Value(self.index) / 1000).date() + raise NotImplementedError + + +cdef class Date64Value(ArrayValue): + + def as_py(self): + cdef CDate64Array* ap = self.sp_array.get() + return datetime.datetime.utcfromtimestamp( + ap.Value(self.index) / 1000).date() cdef class TimestampValue(ArrayValue): @@ -147,7 +154,8 @@ cdef class TimestampValue(ArrayValue): return datetime.datetime.utcfromtimestamp(float(val) / 1000000) else: # TimeUnit_NANO - raise NotImplementedError("Cannot convert nanosecond timestamps to datetime.datetime") + raise NotImplementedError("Cannot convert nanosecond timestamps " + "to datetime.datetime") cdef class FloatValue(ArrayValue): @@ -226,7 +234,8 @@ cdef dict _scalar_classes = { Type_INT16: Int16Value, Type_INT32: Int32Value, Type_INT64: Int64Value, - Type_DATE: DateValue, + Type_DATE32: Date32Value, + Type_DATE64: Date64Value, Type_TIMESTAMP: TimestampValue, Type_FLOAT: FloatValue, Type_DOUBLE: DoubleValue, diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index ee38144e6e3db..ab5ae5fa2a3f2 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -228,8 +228,9 @@ cdef set PRIMITIVE_TYPES = set([ la.Type_UINT16, la.Type_INT16, la.Type_UINT32, la.Type_INT32, la.Type_UINT64, la.Type_INT64, - la.Type_TIMESTAMP, la.Type_DATE, - la.Type_FLOAT, la.Type_DOUBLE]) + la.Type_TIMESTAMP, la.Type_DATE32, + la.Type_DATE64, la.Type_FLOAT, + la.Type_DOUBLE]) def null(): @@ -319,8 +320,12 @@ def timestamp(unit_str, tz=None): return out -def date(): - return primitive_type(la.Type_DATE) +def date32(): + return primitive_type(la.Type_DATE32) + + +def date64(): + return primitive_type(la.Type_DATE64) def float_(): diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index c06d18d19c049..7915f9766bf67 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -97,7 +97,7 @@ def test_date(self): datetime.date(2040, 2, 26)] arr = pyarrow.from_pylist(data) assert len(arr) == 4 - assert arr.type == pyarrow.date() + assert arr.type == pyarrow.date64() assert arr.null_count == 1 assert arr[0].as_py() == datetime.date(2000, 1, 1) assert arr[1].as_py() is None diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 6b89444b3e824..ea7a892a6f2a4 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -326,7 +326,7 @@ def test_date(self): datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)]}) table = A.Table.from_pandas(df) - field = A.Field.from_py('date', A.date()) + field = A.Field.from_py('date', A.date64()) schema = A.Schema.from_fields([field]) assert table.schema.equals(schema) result = table.to_pandas() diff --git a/python/src/pyarrow/adapters/builtin.cc b/python/src/pyarrow/adapters/builtin.cc index b197f5845c020..06e098a80369e 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/python/src/pyarrow/adapters/builtin.cc @@ -82,7 +82,7 @@ class ScalarVisitor { // TODO(wesm): tighter type later return int64(); } else if (date_count_) { - return date(); + return date64(); } else if (timestamp_count_) { return timestamp(TimeUnit::MICRO); } else if (bool_count_) { @@ -291,7 +291,7 @@ class Int64Converter : public TypedConverter { } }; -class DateConverter : public TypedConverter { +class DateConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { Py_ssize_t size = PySequence_Size(seq); @@ -457,7 +457,7 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::INT64: return std::make_shared(); - case Type::DATE: + case Type::DATE64: return std::make_shared(); case Type::TIMESTAMP: return std::make_shared(); diff --git a/python/src/pyarrow/adapters/pandas.cc b/python/src/pyarrow/adapters/pandas.cc index 863cf54c9aa1c..a7386cefcdbbf 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/python/src/pyarrow/adapters/pandas.cc @@ -379,7 +379,7 @@ Status PandasConverter::ConvertDates(std::shared_ptr* out) { PyAcquireGIL lock; PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - DateBuilder date_builder(pool_); + Date64Builder date_builder(pool_); RETURN_NOT_OK(date_builder.Resize(length_)); Status s; @@ -477,7 +477,7 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { return ConvertObjectStrings(out); case Type::BOOL: return ConvertBooleans(out); - case Type::DATE: + case Type::DATE64: return ConvertDates(out); case Type::LIST: { const auto& list_field = static_cast(*type_); @@ -725,7 +725,7 @@ inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) break; } } else { - // datatype->type == Type::DATE + // datatype->type == Type::DATE64 date_dtype->meta.base = NPY_FR_D; } } @@ -1245,8 +1245,8 @@ class DatetimeBlock : public PandasBlock { const ChunkedArray& data = *col.get()->data(); - if (type == Type::DATE) { - // DateType is millisecond timestamp stored as int64_t + if (type == Type::DATE64) { + // Date64Type is millisecond timestamp stored as int64_t // TODO(wesm): Do we want to make sure to zero out the milliseconds? ConvertDatetimeNanos(data, out_buffer); } else if (type == Type::TIMESTAMP) { @@ -1490,7 +1490,7 @@ class DataFrameBlockCreator { case Type::BINARY: output_type = PandasBlock::OBJECT; break; - case Type::DATE: + case Type::DATE64: output_type = PandasBlock::DATETIME; break; case Type::TIMESTAMP: { @@ -1752,7 +1752,7 @@ class ArrowDeserializer { CONVERT_CASE(DOUBLE); CONVERT_CASE(BINARY); CONVERT_CASE(STRING); - CONVERT_CASE(DATE); + CONVERT_CASE(DATE64); CONVERT_CASE(TIMESTAMP); CONVERT_CASE(DICTIONARY); CONVERT_CASE(LIST); @@ -1771,7 +1771,7 @@ class ArrowDeserializer { template inline typename std::enable_if< - (TYPE != Type::DATE) & arrow_traits::is_numeric_nullable, Status>::type + (TYPE != Type::DATE64) & arrow_traits::is_numeric_nullable, Status>::type ConvertValues() { typedef typename arrow_traits::T T; int npy_type = arrow_traits::npy_type; @@ -1788,7 +1788,7 @@ class ArrowDeserializer { } template - inline typename std::enable_if::type ConvertValues() { + inline typename std::enable_if::type ConvertValues() { typedef typename arrow_traits::T T; RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); diff --git a/python/src/pyarrow/helpers.cc b/python/src/pyarrow/helpers.cc index edebea6d97c95..43edf8af17fa2 100644 --- a/python/src/pyarrow/helpers.cc +++ b/python/src/pyarrow/helpers.cc @@ -39,7 +39,8 @@ std::shared_ptr GetPrimitiveType(Type::type type) { GET_PRIMITIVE_TYPE(INT32, int32); GET_PRIMITIVE_TYPE(UINT64, uint64); GET_PRIMITIVE_TYPE(INT64, int64); - GET_PRIMITIVE_TYPE(DATE, date); + GET_PRIMITIVE_TYPE(DATE32, date32); + GET_PRIMITIVE_TYPE(DATE64, date64); GET_PRIMITIVE_TYPE(BOOL, boolean); GET_PRIMITIVE_TYPE(FLOAT, float32); GET_PRIMITIVE_TYPE(DOUBLE, float64); diff --git a/python/src/pyarrow/type_traits.h b/python/src/pyarrow/type_traits.h index f4604d7a9894d..cc65d5ceed9c1 100644 --- a/python/src/pyarrow/type_traits.h +++ b/python/src/pyarrow/type_traits.h @@ -180,7 +180,7 @@ struct arrow_traits { }; template <> -struct arrow_traits { +struct arrow_traits { static constexpr int npy_type = NPY_DATETIME; static constexpr bool supports_nulls = true; static constexpr int64_t na_value = kPandasTimestampNull; From ced9d766d70e84c4d0542c6f5d9bd57faf10781d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 22 Mar 2017 14:05:33 -0400 Subject: [PATCH 0401/1644] ARROW-679: [Format] Change FieldNode, RecordBatch lengths to long, remove LargeRecordBatch. Refactoring This enables me to delete a bunch of code without losing functionality. C++ users must explicitly opt-in to writing size over INT32_MAX. cc @julienledem. I have not added checks in Java about sizes over INT32_MAX, wasn't sure where you might want to do that. Author: Wes McKinney Closes #417 from wesm/ARROW-679 and squashes the following commits: ea237b1 [Wes McKinney] Document allow_64bit for WriteRecordBatch e237d4a [Wes McKinney] Change FieldNode, RecordBatch lengths to long, remove LargeRecordBatch. Refactoring --- cpp/src/arrow/ipc/ipc-read-write-test.cc | 2 +- cpp/src/arrow/ipc/metadata.cc | 48 +--------------- cpp/src/arrow/ipc/metadata.h | 6 +- cpp/src/arrow/ipc/reader.cc | 55 +++---------------- cpp/src/arrow/ipc/reader.h | 7 +-- cpp/src/arrow/ipc/writer.cc | 52 ++++++------------ cpp/src/arrow/ipc/writer.h | 51 +++++++++-------- format/Message.fbs | 24 ++------ .../arrow/vector/schema/ArrowFieldNode.java | 2 +- .../vector/stream/MessageSerializer.java | 4 +- 10 files changed, 61 insertions(+), 190 deletions(-) diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 00118448ff044..6919aebbe8d6d 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -163,7 +163,7 @@ class IpcTestFixture : public io::MemoryMapFixture { RETURN_NOT_OK(WriteLargeRecordBatch( batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); - return ReadLargeRecordBatch(batch.schema(), 0, mmap_.get(), result); + return ReadRecordBatch(batch.schema(), 0, mmap_.get(), result); } void CheckReadResult(const RecordBatch& result, const RecordBatch& expected) { diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index c091bac5a09e2..b10ccec9e7c4e 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -42,7 +42,6 @@ namespace ipc { using FBB = flatbuffers::FlatBufferBuilder; using DictionaryOffset = flatbuffers::Offset; using FieldOffset = flatbuffers::Offset; -using LargeRecordBatchOffset = flatbuffers::Offset; using RecordBatchOffset = flatbuffers::Offset; using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; @@ -558,8 +557,6 @@ Status WriteSchemaMessage( using FieldNodeVector = flatbuffers::Offset>; -using LargeFieldNodeVector = - flatbuffers::Offset>; using BufferVector = flatbuffers::Offset>; static Status WriteFieldNodes( @@ -567,23 +564,6 @@ static Status WriteFieldNodes( std::vector fb_nodes; fb_nodes.reserve(nodes.size()); - for (size_t i = 0; i < nodes.size(); ++i) { - const FieldMetadata& node = nodes[i]; - if (node.offset != 0) { - return Status::Invalid("Field metadata for IPC must have offset 0"); - } - fb_nodes.emplace_back( - static_cast(node.length), static_cast(node.null_count)); - } - *out = fbb.CreateVectorOfStructs(fb_nodes); - return Status::OK(); -} - -static Status WriteLargeFieldNodes( - FBB& fbb, const std::vector& nodes, LargeFieldNodeVector* out) { - std::vector fb_nodes; - fb_nodes.reserve(nodes.size()); - for (size_t i = 0; i < nodes.size(); ++i) { const FieldMetadata& node = nodes[i]; if (node.offset != 0) { @@ -621,19 +601,6 @@ static Status MakeRecordBatch(FBB& fbb, int32_t length, int64_t body_length, return Status::OK(); } -static Status MakeLargeRecordBatch(FBB& fbb, int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - LargeRecordBatchOffset* offset) { - LargeFieldNodeVector fb_nodes; - BufferVector fb_buffers; - - RETURN_NOT_OK(WriteLargeFieldNodes(fbb, nodes, &fb_nodes)); - RETURN_NOT_OK(WriteBuffers(fbb, buffers, &fb_buffers)); - - *offset = flatbuf::CreateLargeRecordBatch(fbb, length, fb_nodes, fb_buffers); - return Status::OK(); -} - Status WriteRecordBatchMessage(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { @@ -644,17 +611,6 @@ Status WriteRecordBatchMessage(int32_t length, int64_t body_length, fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), body_length, out); } -Status WriteLargeRecordBatchMessage(int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out) { - FBB fbb; - LargeRecordBatchOffset large_batch; - RETURN_NOT_OK( - MakeLargeRecordBatch(fbb, length, body_length, nodes, buffers, &large_batch)); - return WriteMessage(fbb, flatbuf::MessageHeader_LargeRecordBatch, large_batch.Union(), - body_length, out); -} - Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { @@ -917,7 +873,7 @@ class RecordBatchMetadata::RecordBatchMetadataImpl : public MessageHolder { const flatbuf::Buffer* buffer(int i) const { return buffers_->Get(i); } - int32_t length() const { return batch_->length(); } + int64_t length() const { return batch_->length(); } int num_buffers() const { return batch_->buffers()->size(); } @@ -969,7 +925,7 @@ BufferMetadata RecordBatchMetadata::buffer(int i) const { return result; } -int32_t RecordBatchMetadata::length() const { +int64_t RecordBatchMetadata::length() const { return impl_->length(); } diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 41e6c5e9f19ea..dc07c7a1bd9b7 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -150,7 +150,7 @@ class ARROW_EXPORT RecordBatchMetadata { FieldMetadata field(int i) const; BufferMetadata buffer(int i) const; - int32_t length() const; + int64_t length() const; int num_buffers() const; int num_fields() const; @@ -229,10 +229,6 @@ Status WriteRecordBatchMessage(int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); -Status WriteLargeRecordBatchMessage(int64_t length, int64_t body_length, - const std::vector& nodes, const std::vector& buffers, - std::shared_ptr* out); - Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index a2b20a901a69e..71ba951111999 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -468,48 +468,7 @@ Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { return impl_->GetRecordBatch(i, batch); } -// ---------------------------------------------------------------------- -// Read LargeRecordBatch - -class LargeRecordBatchSource : public ArrayComponentSource { - public: - LargeRecordBatchSource( - const flatbuf::LargeRecordBatch* metadata, io::RandomAccessFile* file) - : metadata_(metadata), file_(file) {} - - Status GetBuffer(int buffer_index, std::shared_ptr* out) override { - if (buffer_index >= static_cast(metadata_->buffers()->size())) { - return Status::Invalid("Ran out of buffer metadata, likely malformed"); - } - const flatbuf::Buffer* buffer = metadata_->buffers()->Get(buffer_index); - - if (buffer->length() == 0) { - *out = nullptr; - return Status::OK(); - } else { - return file_->ReadAt(buffer->offset(), buffer->length(), out); - } - } - - Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { - // pop off a field - if (field_index >= static_cast(metadata_->nodes()->size())) { - return Status::Invalid("Ran out of field metadata, likely malformed"); - } - const flatbuf::LargeFieldNode* node = metadata_->nodes()->Get(field_index); - - metadata->length = node->length(); - metadata->null_count = node->null_count(); - metadata->offset = 0; - return Status::OK(); - } - - private: - const flatbuf::LargeRecordBatch* metadata_; - io::RandomAccessFile* file_; -}; - -Status ReadLargeRecordBatch(const std::shared_ptr& schema, int64_t offset, +Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out) { std::shared_ptr buffer; RETURN_NOT_OK(file->Seek(offset)); @@ -517,19 +476,19 @@ Status ReadLargeRecordBatch(const std::shared_ptr& schema, int64_t offse RETURN_NOT_OK(file->Read(sizeof(int32_t), &buffer)); int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); + std::shared_ptr message; RETURN_NOT_OK(file->Read(flatbuffer_size, &buffer)); - auto message = flatbuf::GetMessage(buffer->data()); - auto batch = reinterpret_cast(message->header()); + RETURN_NOT_OK(Message::Open(buffer, 0, &message)); + + RecordBatchMetadata metadata(message); // TODO(ARROW-388): The buffer offsets start at 0, so we must construct a // RandomAccessFile according to that frame of reference std::shared_ptr buffer_payload; - RETURN_NOT_OK(file->Read(message->bodyLength(), &buffer_payload)); + RETURN_NOT_OK(file->Read(message->body_length(), &buffer_payload)); io::BufferReader buffer_reader(buffer_payload); - LargeRecordBatchSource source(batch, &buffer_reader); - return LoadRecordBatchFromSource( - schema, batch->length(), kMaxNestingDepth, &source, out); + return ReadRecordBatch(metadata, schema, kMaxNestingDepth, &buffer_reader, out); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index 1c1314a040bef..1e8636c1efcce 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -120,12 +120,9 @@ class ARROW_EXPORT FileReader { std::unique_ptr impl_; }; -// ---------------------------------------------------------------------- -// -/// EXPERIMENTAL: Read length-prefixed LargeRecordBatch metadata (64-bit array -/// lengths) at offset and reconstruct RecordBatch -Status ARROW_EXPORT ReadLargeRecordBatch(const std::shared_ptr& schema, +/// Read encapsulated message and RecordBatch +Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); } // namespace ipc diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index ef59471e3c7c9..0f55f8e33e71d 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -48,28 +48,25 @@ namespace ipc { class RecordBatchWriter : public ArrayVisitor { public: RecordBatchWriter( - MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth) + MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth, + bool allow_64bit) : pool_(pool), max_recursion_depth_(max_recursion_depth), - buffer_start_offset_(buffer_start_offset) { + buffer_start_offset_(buffer_start_offset), + allow_64bit_(allow_64bit) { DCHECK_GT(max_recursion_depth, 0); } virtual ~RecordBatchWriter() = default; - virtual Status CheckArrayMetadata(const Array& arr) { - if (arr.length() > std::numeric_limits::max()) { - return Status::Invalid("Cannot write arrays larger than 2^31 - 1 in length"); - } - return Status::OK(); - } - Status VisitArray(const Array& arr) { if (max_recursion_depth_ <= 0) { return Status::Invalid("Max recursion depth reached"); } - RETURN_NOT_OK(CheckArrayMetadata(arr)); + if (!allow_64bit_ && arr.length() > std::numeric_limits::max()) { + return Status::Invalid("Cannot write arrays larger than 2^31 - 1 in length"); + } // push back all common elements field_nodes_.emplace_back(arr.length(), arr.null_count(), 0); @@ -470,6 +467,7 @@ class RecordBatchWriter : public ArrayVisitor { int64_t max_recursion_depth_; int64_t buffer_start_offset_; + bool allow_64bit_; }; class DictionaryWriter : public RecordBatchWriter { @@ -502,20 +500,21 @@ class DictionaryWriter : public RecordBatchWriter { Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth) { - RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); + MemoryPool* pool, int max_recursion_depth, bool allow_64bit) { + RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth, + allow_64bit); return writer.Write(batch, dst, metadata_length, body_length); } Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool) { - DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth); + DictionaryWriter writer(pool, buffer_start_offset, kMaxNestingDepth, false); return writer.Write(dictionary_id, dictionary, dst, metadata_length, body_length); } Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth); + RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth, true); RETURN_NOT_OK(writer.GetTotalSize(batch, size)); return Status::OK(); } @@ -733,30 +732,11 @@ Status FileWriter::Close() { return impl_->Close(); } -// ---------------------------------------------------------------------- -// Write record batches with 64-bit size metadata - -class LargeRecordBatchWriter : public RecordBatchWriter { - public: - using RecordBatchWriter::RecordBatchWriter; - - Status CheckArrayMetadata(const Array& arr) override { - // No < INT32_MAX length check - return Status::OK(); - } - - Status WriteMetadataMessage( - int64_t num_rows, int64_t body_length, std::shared_ptr* out) override { - return WriteLargeRecordBatchMessage( - num_rows, body_length, field_nodes_, buffer_meta_, out); - } -}; - Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth) { - LargeRecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth); - return writer.Write(batch, dst, metadata_length, body_length); + MemoryPool* pool) { + return WriteRecordBatch(batch, buffer_start_offset, dst, metadata_length, body_length, + pool, kMaxNestingDepth, true); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 1271652a35c78..3b7e710c124cb 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -45,29 +45,30 @@ class OutputStream; namespace ipc { -// Write the RecordBatch (collection of equal-length Arrow arrays) to the -// output stream in a contiguous block. The record batch metadata is written as -// a flatbuffer (see format/Message.fbs -- the RecordBatch message type) -// prefixed by its size, followed by each of the memory buffers in the batch -// written end to end (with appropriate alignment and padding): -// -// -// -// Finally, the absolute offsets (relative to the start of the output stream) -// to the end of the body and end of the metadata / data header (suffixed by -// the header size) is returned in out-variables -// -// @param(in) buffer_start_offset: the start offset to use in the buffer metadata, -// default should be 0 -// -// @param(out) metadata_length: the size of the length-prefixed flatbuffer -// including padding to a 64-byte boundary -// -// @param(out) body_length: the size of the contiguous buffer block plus -// padding bytes +/// Write the RecordBatch (collection of equal-length Arrow arrays) to the +/// output stream in a contiguous block. The record batch metadata is written as +/// a flatbuffer (see format/Message.fbs -- the RecordBatch message type) +/// prefixed by its size, followed by each of the memory buffers in the batch +/// written end to end (with appropriate alignment and padding): +/// +/// +/// +/// Finally, the absolute offsets (relative to the start of the output stream) +/// to the end of the body and end of the metadata / data header (suffixed by +/// the header size) is returned in out-variables +/// +/// @param(in) buffer_start_offset the start offset to use in the buffer metadata, +/// default should be 0 +/// @param(in) allow_64bit permit field lengths exceeding INT32_MAX. May not be +/// readable by other Arrow implementations +/// @param(out) metadata_length: the size of the length-prefixed flatbuffer +/// including padding to a 64-byte boundary +/// @param(out) body_length: the size of the contiguous buffer block plus +/// padding bytes Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); + MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth, + bool allow_64bit = false); // Write Array as a DictionaryBatch message Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, @@ -116,13 +117,11 @@ class ARROW_EXPORT FileWriter : public StreamWriter { std::unique_ptr impl_; }; -// ---------------------------------------------------------------------- - -/// EXPERIMENTAL: Write record batch using LargeRecordBatch IPC metadata. This -/// data may not be readable by all Arrow implementations +/// EXPERIMENTAL: Write RecordBatch allowing lengths over INT32_MAX. This data +/// may not be readable by all Arrow implementations Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth); + MemoryPool* pool); } // namespace ipc } // namespace arrow diff --git a/format/Message.fbs b/format/Message.fbs index e56366d436eb9..ff30aceeda4f3 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -290,12 +290,12 @@ struct Buffer { struct FieldNode { /// The number of value slots in the Arrow array at this level of a nested /// tree - length: int; + length: long; /// The number of observed nulls. Fields with null_count == 0 may choose not /// to write their physical validity bitmap out as a materialized buffer, /// instead setting the length of the bitmap buffer to 0. - null_count: int; + null_count: long; } /// A data header describing the shared memory layout of a "record" or "row" @@ -304,7 +304,7 @@ struct FieldNode { table RecordBatch { /// number of records / rows. The arrays in the batch should all have this /// length - length: int; + length: long; /// Nodes correspond to the pre-ordered flattened logical schema nodes: [FieldNode]; @@ -318,22 +318,6 @@ table RecordBatch { buffers: [Buffer]; } -/// ---------------------------------------------------------------------- -/// EXPERIMENTAL: A RecordBatch type that supports data with more than 2^31 - 1 -/// elements. Arrow implementations do not need to implement this type to be -/// compliant - -struct LargeFieldNode { - length: long; - null_count: long; -} - -table LargeRecordBatch { - length: long; - nodes: [LargeFieldNode]; - buffers: [Buffer]; -} - /// ---------------------------------------------------------------------- /// For sending dictionary encoding information. Any Field can be /// dictionary-encoded, but in this case none of its children may be @@ -356,7 +340,7 @@ table DictionaryBatch { /// which may include experimental metadata types. For maximum compatibility, /// it is best to send data using RecordBatch union MessageHeader { - Schema, DictionaryBatch, RecordBatch, LargeRecordBatch + Schema, DictionaryBatch, RecordBatch } table Message { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java index 71dd0abc6bcef..72ce982f2e7ee 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/ArrowFieldNode.java @@ -34,7 +34,7 @@ public ArrowFieldNode(int length, int nullCount) { @Override public int writeTo(FlatBufferBuilder builder) { - return FieldNode.createFieldNode(builder, length, nullCount); + return FieldNode.createFieldNode(builder, (long)length, (long)nullCount); } public int getNullCount() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 92a6c0c26ba6e..f85fb51710bde 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -207,7 +207,7 @@ private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB List nodes = new ArrayList<>(); for (int i = 0; i < nodesLength; ++i) { FieldNode node = recordBatchFB.nodes(i); - nodes.add(new ArrowFieldNode(node.length(), node.nullCount())); + nodes.add(new ArrowFieldNode((int)node.length(), (int)node.nullCount())); } List buffers = new ArrayList<>(); for (int i = 0; i < recordBatchFB.buffersLength(); ++i) { @@ -216,7 +216,7 @@ private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB buffers.add(vectorBuffer); } ArrowRecordBatch arrowRecordBatch = - new ArrowRecordBatch(recordBatchFB.length(), nodes, buffers); + new ArrowRecordBatch((int)recordBatchFB.length(), nodes, buffers); body.release(); return arrowRecordBatch; } From 2406d4eed9af41b1ef60c53834aced036a933327 Mon Sep 17 00:00:00 2001 From: Miki Tebeka Date: Wed, 22 Mar 2017 14:06:42 -0400 Subject: [PATCH 0402/1644] ARROW-552: [Python] Implement getitem for DictionaryArray by returning a value from the dictionary Author: Miki Tebeka Author: Wes McKinney Closes #414 from wesm/ARROW-552 and squashes the following commits: 8a039b5 [Wes McKinney] Implement DictionaryArray.getitem by indexing into the dictionary. Add indices and dictionary properties e700b45 [Miki Tebeka] ARROW-552: [Python] Add scalar value support for Dictionary type (WIP) --- python/pyarrow/array.pxd | 4 +++- python/pyarrow/array.pyx | 25 +++++++++++++++++++++++++ python/pyarrow/scalar.pyx | 2 +- python/pyarrow/tests/test_scalars.py | 13 +++++++++++++ 4 files changed, 42 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index 56bb53d5c97dc..c3e7997aa823c 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -109,7 +109,9 @@ cdef class BinaryArray(Array): cdef class DictionaryArray(Array): - pass + cdef: + object _indices, _dictionary + cdef wrap_array_output(PyObject* output) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 6afeaa0a7332b..795076cfccb7e 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -406,6 +406,31 @@ cdef class BinaryArray(Array): cdef class DictionaryArray(Array): + cdef getitem(self, int64_t i): + cdef Array dictionary = self.dictionary + cdef int64_t index = self.indices[i].as_py() + return scalar.box_scalar(dictionary.type, dictionary.sp_array, index) + + property dictionary: + + def __get__(self): + cdef CDictionaryArray* darr = (self.ap) + + if self._dictionary is None: + self._dictionary = box_array(darr.dictionary()) + + return self._dictionary + + property indices: + + def __get__(self): + cdef CDictionaryArray* darr = (self.ap) + + if self._indices is None: + self._indices = box_array(darr.indices()) + + return self._indices + @staticmethod def from_arrays(indices, dictionary, mask=None, MemoryPool memory_pool=None): diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 8c88f90422fac..1b7e67b356a2f 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -241,7 +241,7 @@ cdef dict _scalar_classes = { Type_DOUBLE: DoubleValue, Type_LIST: ListValue, Type_BINARY: BinaryValue, - Type_STRING: StringValue, + Type_STRING: StringValue } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index ef600a06296cb..d56481c06d0f8 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -16,6 +16,8 @@ # specific language governing permissions and limitations # under the License. +import pandas as pd + from pyarrow.compat import unittest, u, unicode_type import pyarrow as A @@ -100,3 +102,14 @@ def test_list(self): v = arr[3] assert len(v) == 0 + + def test_dictionary(self): + colors = ['red', 'green', 'blue'] + values = pd.Series(colors * 4) + + categorical = pd.Categorical(values, categories=colors) + + v = A.DictionaryArray.from_arrays(categorical.codes, + categorical.categories) + for i, c in enumerate(values): + assert v[i].as_py() == c From bf2acf6cb22b8d2bf6d0fb98a6117e78e92b81fe Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 22 Mar 2017 23:07:05 -0400 Subject: [PATCH 0403/1644] ARROW-454: pojo.Field doesn't implement hashCode() Author: Julien Le Dem Closes #423 from julienledem/field_hashcode and squashes the following commits: 192a689 [Julien Le Dem] ARROW-454: pojo.Field doesn't implement hashCode() --- .../src/main/codegen/templates/ArrowType.java | 2 +- .../apache/arrow/vector/types/pojo/Field.java | 29 ++++++++++--------- .../arrow/vector/types/pojo/TestSchema.java | 22 ++++++++++++++ 3 files changed, 39 insertions(+), 14 deletions(-) diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 85ea3898e09c6..91cbe98196b81 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -164,7 +164,7 @@ public String toString() { @Override public int hashCode() { - return Objects.hash(<#list type.fields as field>${field.name}<#if field_has_next>, ); + return java.util.Arrays.deepHashCode(new Object[] {<#list type.fields as field>${field.name}<#if field_has_next>, }); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index bbbd559f10a3d..c310b9082f78f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -24,14 +24,6 @@ import java.util.List; import java.util.Objects; -import com.fasterxml.jackson.annotation.JsonCreator; -import com.fasterxml.jackson.annotation.JsonInclude; -import com.fasterxml.jackson.annotation.JsonInclude.Include; -import com.fasterxml.jackson.annotation.JsonProperty; -import com.google.common.base.Joiner; -import com.google.common.collect.ImmutableList; -import com.google.flatbuffers.FlatBufferBuilder; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.schema.TypeLayout; @@ -40,6 +32,14 @@ import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonInclude; +import com.fasterxml.jackson.annotation.JsonInclude.Include; +import com.fasterxml.jackson.annotation.JsonProperty; +import com.google.common.base.Joiner; +import com.google.common.collect.ImmutableList; +import com.google.flatbuffers.FlatBufferBuilder; + public class Field { private final String name; private final boolean nullable; @@ -176,6 +176,11 @@ public TypeLayout getTypeLayout() { return typeLayout; } + @Override + public int hashCode() { + return Objects.hash(name, nullable, type, dictionary, children); + } + @Override public boolean equals(Object obj) { if (!(obj instanceof Field)) { @@ -183,12 +188,10 @@ public boolean equals(Object obj) { } Field that = (Field) obj; return Objects.equals(this.name, that.name) && - Objects.equals(this.nullable, that.nullable) && - Objects.equals(this.type, that.type) && + Objects.equals(this.nullable, that.nullable) && + Objects.equals(this.type, that.type) && Objects.equals(this.dictionary, that.dictionary) && - (Objects.equals(this.children, that.children) || - (this.children == null || this.children.size() == 0) && - (that.children == null || that.children.size() == 0)); + Objects.equals(this.children, that.children); } @Override diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 5b74c54c9159f..a7d1cce917747 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -22,6 +22,7 @@ import static org.junit.Assert.assertTrue; import java.io.IOException; +import java.util.List; import org.apache.arrow.vector.types.FloatingPointPrecision; import org.apache.arrow.vector.types.IntervalUnit; @@ -125,6 +126,27 @@ private void roundTrip(Schema schema) throws IOException { Schema actual = Schema.fromJSON(json); assertEquals(schema.toJson(), actual.toJson()); assertEquals(schema, actual); + validateFieldsHashcode(schema.getFields(), actual.getFields()); + assertEquals(schema.hashCode(), actual.hashCode()); + } + + private void validateFieldsHashcode(List schemaFields, List actualFields) { + assertEquals(schemaFields.size(), actualFields.size()); + if (schemaFields.size() == 0) { + return; + } + for (int i = 0; i < schemaFields.size(); i++) { + Field schemaField = schemaFields.get(i); + Field actualField = actualFields.get(i); + validateFieldsHashcode(schemaField.getChildren(), actualField.getChildren()); + validateHashCode(schemaField.getType(), actualField.getType()); + validateHashCode(schemaField, actualField); + } + } + + private void validateHashCode(Object o1, Object o2) { + assertEquals(o1, o2); + assertEquals(o1 + " == " + o2, o1.hashCode(), o2.hashCode()); } private void contains(Schema schema, String... s) throws IOException { From 990e2bde758ac8bc6e4497ae1bc37f89b71bb5cf Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Wed, 22 Mar 2017 23:08:01 -0400 Subject: [PATCH 0404/1644] ARROW-691: [Java] Encode dictionary type in message format Author: Emilio Lahr-Vivaz Closes #422 from elahrvivaz/ARROW-691 and squashes the following commits: c1adad1 [Emilio Lahr-Vivaz] ARROW-691 Encode dictionary type in message format --- .../vector/types/pojo/DictionaryEncoding.java | 18 ++++++++++++++++++ .../apache/arrow/vector/types/pojo/Field.java | 3 ++- .../vector/stream/MessageSerializerTest.java | 15 +++++++++++++++ 3 files changed, 35 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java index 6d35cdef832f9..32568d34ba495 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/DictionaryEncoding.java @@ -18,6 +18,8 @@ ******************************************************************************/ package org.apache.arrow.vector.types.pojo; +import java.util.Objects; + import org.apache.arrow.vector.types.pojo.ArrowType.Int; public class DictionaryEncoding { @@ -48,4 +50,20 @@ public Int getIndexType() { public String toString() { return "DictionaryEncoding[id=" + id + ",ordered=" + ordered + ",indexType=" + indexType + "]"; } + + @Override + public boolean equals(Object o) { + if (this == o) { + return true; + } else if (o == null || getClass() != o.getClass()) { + return false; + } + DictionaryEncoding that = (DictionaryEncoding) o; + return id == that.id && ordered == that.ordered && Objects.equals(indexType, that.indexType); + } + + @Override + public int hashCode() { + return Objects.hash(id, ordered, indexType); + } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index c310b9082f78f..011f0e6e446a8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -121,10 +121,11 @@ public int getField(FlatBufferBuilder builder) { int typeOffset = type.getType(builder); int dictionaryOffset = -1; if (dictionary != null) { - // TODO encode dictionary type - currently type is only signed 32 bit int (default null) + int dictionaryType = dictionary.getIndexType().getType(builder); org.apache.arrow.flatbuf.DictionaryEncoding.startDictionaryEncoding(builder); org.apache.arrow.flatbuf.DictionaryEncoding.addId(builder, dictionary.getId()); org.apache.arrow.flatbuf.DictionaryEncoding.addIsOrdered(builder, dictionary.isOrdered()); + org.apache.arrow.flatbuf.DictionaryEncoding.addIndexType(builder, dictionaryType); dictionaryOffset = org.apache.arrow.flatbuf.DictionaryEncoding.endDictionaryEncoding(builder); } int[] childrenData = new int[children.size()]; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java index bb2ccf8cbb5f6..d3d49d5fb8096 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java @@ -37,6 +37,7 @@ import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Test; @@ -72,6 +73,20 @@ public void testSchemaMessageSerialization() throws IOException { assertEquals(1, deserialized.getFields().size()); } + @Test + public void testSchemaDictionaryMessageSerialization() throws IOException { + DictionaryEncoding dictionary = new DictionaryEncoding(9L, false, new ArrowType.Int(8, true)); + Field field = new Field("test", true, ArrowType.Utf8.INSTANCE, dictionary, null); + Schema schema = new Schema(Collections.singletonList(field)); + ByteArrayOutputStream out = new ByteArrayOutputStream(); + long size = MessageSerializer.serialize(new WriteChannel(Channels.newChannel(out)), schema); + assertEquals(size, out.toByteArray().length); + + ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); + Schema deserialized = MessageSerializer.deserializeSchema(new ReadChannel(Channels.newChannel(in))); + assertEquals(schema, deserialized); + } + @Test public void testSerializeRecordBatch() throws IOException { byte[] validity = new byte[] { (byte)255, 0}; From 2926183276e69390bd84569c364b4e8fb316db53 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 22 Mar 2017 23:09:10 -0400 Subject: [PATCH 0405/1644] ARROW-347: Add method to pass CallBack when creating a transfer pair supersedes and closes #182 Author: Julien Le Dem Closes #425 from julienledem/arrow_347 and squashes the following commits: 3c47b82 [Julien Le Dem] ARROW-347: Add method to pass CallBack when creating a transfer pair --- .../main/codegen/templates/UnionVector.java | 13 ++- .../arrow/vector/BaseDataValueVector.java | 7 ++ .../org/apache/arrow/vector/ValueVector.java | 3 + .../org/apache/arrow/vector/ZeroVector.java | 6 ++ .../complex/BaseRepeatedValueVector.java | 14 ++- .../arrow/vector/complex/ListVector.java | 34 +++++--- .../arrow/vector/complex/MapVector.java | 5 ++ .../vector/complex/NullableMapVector.java | 7 +- .../vector/complex/impl/PromotableWriter.java | 2 +- .../complex/writer/TestComplexWriter.java | 86 +++++++++++++++++-- 10 files changed, 146 insertions(+), 31 deletions(-) diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 076ed93999623..d17935b08eefc 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -236,12 +236,17 @@ public Field getField() { @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new TransferImpl(name, allocator); + return getTransferPair(name, allocator); } @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new TransferImpl(ref, allocator); + return getTransferPair(ref, allocator, null); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return new org.apache.arrow.vector.complex.UnionVector.TransferImpl(ref, allocator, callBack); } @Override @@ -276,8 +281,8 @@ private class TransferImpl implements TransferPair { private final TransferPair typeVectorTransferPair; private final UnionVector to; - public TransferImpl(String name, BufferAllocator allocator) { - to = new UnionVector(name, allocator, null); + public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { + to = new UnionVector(name, allocator, callBack); internalMapVectorTransferPair = internalMap.makeTransferPair(to.internalMap); typeVectorTransferPair = typeVector.makeTransferPair(to.typeVector); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java index 7fe1615da5a27..6d7d3f04a6d04 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BaseDataValueVector.java @@ -24,6 +24,8 @@ import org.apache.arrow.vector.schema.ArrowFieldNode; import io.netty.buffer.ArrowBuf; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.TransferPair; public abstract class BaseDataValueVector extends BaseValueVector implements BufferBacked { @@ -87,6 +89,11 @@ public void close() { super.close(); } + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return getTransferPair(ref, allocator); + } + @Override public ArrowBuf[] getBuffers(boolean clear) { ArrowBuf[] out; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index ff7b94c34d80d..8e35398b9394b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -24,6 +24,7 @@ import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.TransferPair; import io.netty.buffer.ArrowBuf; @@ -106,6 +107,8 @@ public interface ValueVector extends Closeable, Iterable { TransferPair getTransferPair(String ref, BufferAllocator allocator); + TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack); + /** * Returns a new {@link org.apache.arrow.vector.util.TransferPair transfer pair} that is used to transfer underlying * buffers into the target vector. diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index e163b4fa9398f..73f858e4d35a0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -29,6 +29,7 @@ import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Null; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.TransferPair; import io.netty.buffer.ArrowBuf; @@ -159,6 +160,11 @@ public TransferPair getTransferPair(String ref, BufferAllocator allocator) { return defaultPair; } + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return defaultPair; + } + @Override public TransferPair makeTransferPair(ValueVector target) { return defaultPair; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index eeb8f5830f404..eda1f3bc80a96 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -29,6 +29,7 @@ import org.apache.arrow.vector.ZeroVector; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.SchemaChangeRuntimeException; import com.google.common.base.Preconditions; @@ -44,15 +45,17 @@ public abstract class BaseRepeatedValueVector extends BaseValueVector implements protected final UInt4Vector offsets; protected FieldVector vector; + protected final CallBack callBack; - protected BaseRepeatedValueVector(String name, BufferAllocator allocator) { - this(name, allocator, DEFAULT_DATA_VECTOR); + protected BaseRepeatedValueVector(String name, BufferAllocator allocator, CallBack callBack) { + this(name, allocator, DEFAULT_DATA_VECTOR, callBack); } - protected BaseRepeatedValueVector(String name, BufferAllocator allocator, FieldVector vector) { + protected BaseRepeatedValueVector(String name, BufferAllocator allocator, FieldVector vector, CallBack callBack) { super(name, allocator); this.offsets = new UInt4Vector(OFFSETS_VECTOR_NAME, allocator); this.vector = Preconditions.checkNotNull(vector, "data vector cannot be null"); + this.callBack = callBack; } @Override @@ -154,9 +157,12 @@ public int size() { public AddOrGetResult addOrGetVector(MinorType minorType, DictionaryEncoding dictionary) { boolean created = false; if (vector instanceof ZeroVector) { - vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, dictionary, null); + vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, dictionary, callBack); // returned vector must have the same field created = true; + if (callBack != null) { + callBack.doWork(); + } } if (vector.getField().getType().getTypeID() != minorType.getType().getTypeID()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index a12440e39e8fe..54b051b9781e5 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -24,10 +24,6 @@ import java.util.Collections; import java.util.List; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ObjectArrays; - -import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.AddOrGetResult; @@ -52,6 +48,11 @@ import org.apache.arrow.vector.util.JsonStringArrayList; import org.apache.arrow.vector.util.TransferPair; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; + public class ListVector extends BaseRepeatedValueVector implements FieldVector { final UInt4Vector offsets; @@ -59,17 +60,15 @@ public class ListVector extends BaseRepeatedValueVector implements FieldVector { private final List innerVectors; private Mutator mutator = new Mutator(); private Accessor accessor = new Accessor(); - private UnionListWriter writer; private UnionListReader reader; private CallBack callBack; private final DictionaryEncoding dictionary; public ListVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack) { - super(name, allocator); + super(name, allocator, callBack); this.bits = new BitVector("$bits$", allocator); this.offsets = getOffsetVector(); this.innerVectors = Collections.unmodifiableList(Arrays.asList(bits, offsets)); - this.writer = new UnionListWriter(this); this.reader = new UnionListReader(this); this.dictionary = dictionary; this.callBack = callBack; @@ -86,6 +85,8 @@ public void initializeChildrenFromFields(List children) { if (!addOrGetVector.isCreated()) { throw new IllegalArgumentException("Child vector already existed: " + addOrGetVector.getVector()); } + + addOrGetVector.getVector().initializeChildrenFromFields(field.getChildren()); } @Override @@ -111,7 +112,7 @@ public List getFieldInnerVectors() { } public UnionListWriter getWriter() { - return writer; + return new UnionListWriter(this); } @Override @@ -139,7 +140,12 @@ public FieldVector getDataVector() { @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { - return new TransferImpl(ref, allocator); + return getTransferPair(ref, allocator, null); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return new TransferImpl(ref, allocator, callBack); } @Override @@ -152,8 +158,8 @@ private class TransferImpl implements TransferPair { ListVector to; TransferPair pairs[] = new TransferPair[3]; - public TransferImpl(String name, BufferAllocator allocator) { - this(new ListVector(name, allocator, dictionary, null)); + public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { + this(new ListVector(name, allocator, dictionary, callBack)); } public TransferImpl(ListVector to) { @@ -172,6 +178,7 @@ public void transfer() { for (TransferPair pair : pairs) { pair.transfer(); } + to.lastSet = lastSet; } @Override @@ -282,9 +289,12 @@ public ArrowBuf[] getBuffers(boolean clear) { } public UnionVector promoteToUnion() { - UnionVector vector = new UnionVector(name, allocator, null); + UnionVector vector = new UnionVector(name, allocator, callBack); replaceDataVector(vector); reader = new UnionListReader(this); + if (callBack != null) { + callBack.doWork(); + } return vector; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index 4d750cad264db..cb67537c446c6 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -115,6 +115,11 @@ public int getBufferSizeFor(final int valueCount) { @Override public TransferPair getTransferPair(BufferAllocator allocator) { + return getTransferPair(name, allocator, null); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { return new MapTransferPair(this, new MapVector(name, allocator, callBack), false); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index bb1fdf841a305..de1d1857370b0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -86,7 +86,7 @@ public FieldReader getReader() { @Override public TransferPair getTransferPair(BufferAllocator allocator) { - return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, dictionary, callBack), false); + return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, dictionary, null), false); } @Override @@ -96,6 +96,11 @@ public TransferPair makeTransferPair(ValueVector to) { @Override public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return new NullableMapTransferPair(this, new NullableMapVector(ref, allocator, dictionary, null), false); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { return new NullableMapTransferPair(this, new NullableMapVector(ref, allocator, dictionary, callBack), false); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index e33319a2270b1..1880c9b490c27 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -142,7 +142,7 @@ public boolean isEmptyMap() { } protected FieldWriter getWriter() { - return getWriter(type); + return writer; } private FieldWriter promoteToUnion() { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index a8a2d512c09ec..99ba19bec80e7 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -29,8 +29,10 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.SchemaChangeCallBack; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; +import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.impl.ComplexWriterImpl; import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; @@ -49,7 +51,11 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.JsonStringHashMap; import org.apache.arrow.vector.util.Text; +import org.apache.arrow.vector.util.TransferPair; import org.joda.time.DateTime; import org.joda.time.DateTimeZone; import org.junit.Assert; @@ -65,7 +71,38 @@ public class TestComplexWriter { @Test public void simpleNestedTypes() { - MapVector parent = new MapVector("parent", allocator, null); + MapVector parent = populateMapVector(null); + MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); + for (int i = 0; i < COUNT; i++) { + rootReader.setPosition(i); + Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); + Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); + } + + parent.close(); + } + + @Test + public void transferPairSchemaChange() { + SchemaChangeCallBack callBack1 = new SchemaChangeCallBack(); + SchemaChangeCallBack callBack2 = new SchemaChangeCallBack(); + MapVector parent = populateMapVector(callBack1); + + TransferPair tp = parent.getTransferPair("newVector", allocator, callBack2); + + ComplexWriter writer = new ComplexWriterImpl("newWriter", parent); + MapWriter rootWriter = writer.rootAsMap(); + IntWriter intWriter = rootWriter.integer("newInt"); + intWriter.writeInt(1); + writer.setValueCount(1); + + assertTrue(callBack1.getSchemaChangedAndReset()); + // The second vector should not have registered a schema change + assertFalse(callBack1.getSchemaChangedAndReset()); + } + + private MapVector populateMapVector(CallBack callBack) { + MapVector parent = new MapVector("parent", allocator, callBack); ComplexWriter writer = new ComplexWriterImpl("root", parent); MapWriter rootWriter = writer.rootAsMap(); IntWriter intWriter = rootWriter.integer("int"); @@ -77,14 +114,7 @@ public void simpleNestedTypes() { rootWriter.end(); } writer.setValueCount(COUNT); - MapReader rootReader = new SingleMapReaderImpl(parent).reader("root"); - for (int i = 0; i < COUNT; i++) { - rootReader.setPosition(i); - Assert.assertEquals(i, rootReader.reader("int").readInteger().intValue()); - Assert.assertEquals(i, rootReader.reader("bigInt").readLong().longValue()); - } - - parent.close(); + return parent; } @Test @@ -646,4 +676,42 @@ public void timeStampWriters() throws Exception { long nanoLong = nanoReader.readLong(); Assert.assertEquals(expectedNanos, nanoLong); } + + @Test + public void complexCopierWithList() { + MapVector parent = new MapVector("parent", allocator, null); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + ListWriter listWriter = rootWriter.list("list"); + MapWriter innerMapWriter = listWriter.map(); + IntWriter outerIntWriter = listWriter.integer(); + rootWriter.start(); + listWriter.startList(); + outerIntWriter.writeInt(1); + outerIntWriter.writeInt(2); + innerMapWriter.start(); + IntWriter intWriter = innerMapWriter.integer("a"); + intWriter.writeInt(1); + innerMapWriter.end(); + innerMapWriter.start(); + intWriter = innerMapWriter.integer("a"); + intWriter.writeInt(2); + innerMapWriter.end(); + listWriter.endList(); + rootWriter.end(); + writer.setValueCount(1); + + NullableMapVector mapVector = (NullableMapVector) parent.getChild("root"); + TransferPair tp = mapVector.getTransferPair(allocator); + tp.splitAndTransfer(0, 1); + MapVector toMapVector = (MapVector) tp.getTo(); + JsonStringHashMap toMapValue = (JsonStringHashMap) toMapVector.getAccessor().getObject(0); + JsonStringArrayList object = (JsonStringArrayList) toMapValue.get("list"); + assertEquals(1, object.get(0)); + assertEquals(2, object.get(1)); + JsonStringHashMap innerMap = (JsonStringHashMap) object.get(2); + assertEquals(1, innerMap.get("a")); + innerMap = (JsonStringHashMap) object.get(3); + assertEquals(2, innerMap.get("a")); + } } \ No newline at end of file From f67974b190349c781509d2b1657331935f458f9b Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 22 Mar 2017 23:10:01 -0400 Subject: [PATCH 0406/1644] ARROW-700: Add headroom interface for allocator Author: Julien Le Dem Closes #424 from julienledem/headroom and squashes the following commits: 2aab160 [Julien Le Dem] ARROW-700: Add headroom interface for allocator --- .../java/org/apache/arrow/memory/Accountant.java | 14 ++++++++++++-- .../org/apache/arrow/memory/BufferAllocator.java | 8 ++++++++ .../org/apache/arrow/memory/TestAccountant.java | 13 +++++++++++-- 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java index 6ddc8f784bc4a..89329b2766357 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/Accountant.java @@ -18,12 +18,12 @@ package org.apache.arrow.memory; -import com.google.common.base.Preconditions; - import java.util.concurrent.atomic.AtomicLong; import javax.annotation.concurrent.ThreadSafe; +import com.google.common.base.Preconditions; + /** * Provides a concurrent way to manage account for memory usage without locking. Used as basis * for Allocators. All @@ -202,6 +202,7 @@ public boolean isOverLimit() { /** * Close this Accountant. This will release any reservation bytes back to a parent Accountant. */ + @Override public void close() { // return memory reservation to parent allocator. if (parent != null) { @@ -248,6 +249,15 @@ public long getPeakMemoryAllocation() { return peakAllocation.get(); } + public long getHeadroom(){ + long localHeadroom = allocationLimit.get() - locallyHeldMemory.get(); + if(parent == null){ + return localHeadroom; + } + + return Math.min(localHeadroom, parent.getHeadroom()); + } + /** * Describes the type of outcome that occurred when trying to account for allocation of memory. */ diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java index 81ffb1bec780e..c05e9acb0aa96 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java @@ -105,6 +105,14 @@ public interface BufferAllocator extends AutoCloseable { */ public long getPeakMemoryAllocation(); + /** + * Returns the amount of memory that can probably be allocated at this moment + * without exceeding this or any parents allocation maximum. + * + * @return Headroom in bytes + */ + public long getHeadroom(); + /** * Create an allocation reservation. A reservation is a way of building up * a request for a buffer whose size is not known in advance. See diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java b/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java index 86bccf5064a60..2624a4a047e7e 100644 --- a/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestAccountant.java @@ -19,7 +19,6 @@ import static org.junit.Assert.assertEquals; -import org.apache.arrow.memory.Accountant; import org.apache.arrow.memory.Accountant.AllocationOutcome; import org.junit.Assert; import org.junit.Test; @@ -36,6 +35,7 @@ public void nested() { final Accountant parent = new Accountant(null, 0, Long.MAX_VALUE); ensureAccurateReservations(parent); assertEquals(0, parent.getAllocatedMemory()); + assertEquals(parent.getLimit() - parent.getAllocatedMemory(), parent.getHeadroom()); } @Test @@ -71,6 +71,7 @@ public void run() { } assertEquals(0, parent.getAllocatedMemory()); + assertEquals(parent.getLimit() - parent.getAllocatedMemory(), parent.getHeadroom()); } private void ensureAccurateReservations(Accountant outsideParent) { @@ -121,6 +122,9 @@ private void ensureAccurateReservations(Accountant outsideParent) { // went beyond reservation, now in parent accountant assertEquals(3, parent.getAllocatedMemory()); + assertEquals(7, child.getHeadroom()); + assertEquals(7, parent.getHeadroom()); + { AllocationOutcome first = child.allocateBytes(7); assertEquals(AllocationOutcome.SUCCESS, first); @@ -135,9 +139,11 @@ private void ensureAccurateReservations(Accountant outsideParent) { child.releaseBytes(9); assertEquals(1, child.getAllocatedMemory()); + assertEquals(8, child.getHeadroom()); // back to reservation size assertEquals(2, parent.getAllocatedMemory()); + assertEquals(8, parent.getHeadroom()); AllocationOutcome first = child.allocateBytes(10); assertEquals(AllocationOutcome.FAILED_PARENT, first); @@ -152,11 +158,14 @@ private void ensureAccurateReservations(Accountant outsideParent) { // at new limit assertEquals(child.getAllocatedMemory(), 11); assertEquals(parent.getAllocatedMemory(), 11); - + assertEquals(-1, child.getHeadroom()); + assertEquals(-1, parent.getHeadroom()); child.releaseBytes(11); assertEquals(child.getAllocatedMemory(), 0); assertEquals(parent.getAllocatedMemory(), 2); + assertEquals(8, child.getHeadroom()); + assertEquals(8, parent.getHeadroom()); child.close(); parent.close(); From e8f6a492d30d32cd67fe3a537b3aec4cbae566c9 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Wed, 22 Mar 2017 20:15:55 -0700 Subject: [PATCH 0407/1644] ARROW-674: [Java] Support additional Timestamp timezone metadata Author: Julien Le Dem Closes #408 from julienledem/timestamp_md and squashes the following commits: e394526 [Julien Le Dem] ARROW-674: [Java] Support additional Timestamp timezone metadata --- .../src/main/codegen/data/ArrowTypes.tdd | 2 +- .../org/apache/arrow/vector/types/Types.java | 16 ++-- .../apache/arrow/vector/pojo/TestConvert.java | 2 +- .../arrow/vector/types/pojo/TestSchema.java | 90 ++++++++++++------- 4 files changed, 66 insertions(+), 44 deletions(-) diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 8f997524fccfc..94fe31e8dc0d8 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -62,7 +62,7 @@ }, { name: "Timestamp", - fields: [{name: "unit", type: short, valueType: TimeUnit}] + fields: [{name: "unit", type: short, valueType: TimeUnit}, {name: "timezone", type: String}] }, { name: "Interval", diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 7cbf3c5bb5e36..81743b51917a1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -109,10 +109,10 @@ public class Types { private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); private static final Field TIME_FIELD = new Field("", true, new Time(TimeUnit.MILLISECOND, 32), null); - private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND), null); - private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND), null); - private static final Field TIMESTAMPMICRO_FIELD = new Field("", true, new Timestamp(TimeUnit.MICROSECOND), null); - private static final Field TIMESTAMPNANO_FIELD = new Field("", true, new Timestamp(TimeUnit.NANOSECOND), null); + private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND, "UTC"), null); + private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND, "UTC"), null); + private static final Field TIMESTAMPMICRO_FIELD = new Field("", true, new Timestamp(TimeUnit.MICROSECOND, "UTC"), null); + private static final Field TIMESTAMPNANO_FIELD = new Field("", true, new Timestamp(TimeUnit.NANOSECOND, "UTC"), null); private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.SINGLE), null); @@ -252,7 +252,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in second from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. - TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND)) { + TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND, "UTC")) { @Override public Field getField() { return TIMESTAMPSEC_FIELD; @@ -269,7 +269,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. - TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND)) { + TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND, "UTC")) { @Override public Field getField() { return TIMESTAMPMILLI_FIELD; @@ -286,7 +286,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in microsecond from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. - TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND)) { + TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND, "UTC")) { @Override public Field getField() { return TIMESTAMPMICRO_FIELD; @@ -303,7 +303,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in nanosecond from the Unix epoch, 00:00:00.000000000 on 1 January 1970, UTC. - TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND)) { + TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND, "UTC")) { @Override public Field getField() { return TIMESTAMPNANO_FIELD; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 65823e2a821a1..824c62aa5fbf3 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -81,7 +81,7 @@ public void nestedSchema() { new Field("child4.1", true, Utf8.INSTANCE, null) ))); childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMPMILLI.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( - new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND), null), + new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND, "UTC"), null), new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); Schema initialSchema = new Schema(childrenBuilder.build()); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index a7d1cce917747..9f1b2e014b860 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -28,6 +28,20 @@ import org.apache.arrow.vector.types.IntervalUnit; import org.apache.arrow.vector.types.TimeUnit; import org.apache.arrow.vector.types.UnionMode; +import org.apache.arrow.vector.types.pojo.ArrowType.Binary; +import org.apache.arrow.vector.types.pojo.ArrowType.Bool; +import org.apache.arrow.vector.types.pojo.ArrowType.Date; +import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; +import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; +import org.apache.arrow.vector.types.pojo.ArrowType.Int; +import org.apache.arrow.vector.types.pojo.ArrowType.Interval; +import org.apache.arrow.vector.types.pojo.ArrowType.List; +import org.apache.arrow.vector.types.pojo.ArrowType.Null; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; +import org.apache.arrow.vector.types.pojo.ArrowType.Time; +import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; +import org.apache.arrow.vector.types.pojo.ArrowType.Union; +import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; import org.junit.Test; public class TestSchema { @@ -43,38 +57,40 @@ private static Field field(String name, ArrowType type, Field... children) { @Test public void testComplex() throws IOException { Schema schema = new Schema(asList( - field("a", false, new ArrowType.Int(8, true)), - field("b", new ArrowType.Struct(), - field("c", new ArrowType.Int(16, true)), - field("d", new ArrowType.Utf8())), - field("e", new ArrowType.List(), field(null, new ArrowType.Date())), - field("f", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), - field("g", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), - field("h", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + field("a", false, new Int(8, true)), + field("b", new Struct(), + field("c", new Int(16, true)), + field("d", new Utf8())), + field("e", new List(), field(null, new Date())), + field("f", new FloatingPoint(FloatingPointPrecision.SINGLE)), + field("g", new Timestamp(TimeUnit.MILLISECOND, "UTC")), + field("h", new Timestamp(TimeUnit.MICROSECOND, null)), + field("i", new Interval(IntervalUnit.DAY_TIME)) )); roundTrip(schema); assertEquals( - "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND), h: Interval(DAY_TIME)>", + "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND, UTC), h: Timestamp(MICROSECOND, null), i: Interval(DAY_TIME)>", schema.toString()); } @Test public void testAll() throws IOException { Schema schema = new Schema(asList( - field("a", false, new ArrowType.Null()), - field("b", new ArrowType.Struct(), field("ba", new ArrowType.Null())), - field("c", new ArrowType.List(), field("ca", new ArrowType.Null())), - field("d", new ArrowType.Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new ArrowType.Null())), - field("e", new ArrowType.Int(8, true)), - field("f", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), - field("g", new ArrowType.Utf8()), - field("h", new ArrowType.Binary()), - field("i", new ArrowType.Bool()), - field("j", new ArrowType.Decimal(5, 5)), - field("k", new ArrowType.Date()), - field("l", new ArrowType.Time(TimeUnit.MILLISECOND, 32)), - field("m", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), - field("n", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + field("a", false, new Null()), + field("b", new Struct(), field("ba", new Null())), + field("c", new List(), field("ca", new Null())), + field("d", new Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new Null())), + field("e", new Int(8, true)), + field("f", new FloatingPoint(FloatingPointPrecision.SINGLE)), + field("g", new Utf8()), + field("h", new Binary()), + field("i", new Bool()), + field("j", new Decimal(5, 5)), + field("k", new Date()), + field("l", new Time(TimeUnit.MILLISECOND, 32)), + field("m", new Timestamp(TimeUnit.MILLISECOND, "UTC")), + field("n", new Timestamp(TimeUnit.MICROSECOND, null)), + field("o", new Interval(IntervalUnit.DAY_TIME)) )); roundTrip(schema); } @@ -82,7 +98,7 @@ public void testAll() throws IOException { @Test public void testUnion() throws IOException { Schema schema = new Schema(asList( - field("d", new ArrowType.Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new ArrowType.Null())) + field("d", new Union(UnionMode.Sparse, new int[] {1, 2, 3}), field("da", new Null())) )); roundTrip(schema); contains(schema, "Sparse"); @@ -91,20 +107,26 @@ public void testUnion() throws IOException { @Test public void testTS() throws IOException { Schema schema = new Schema(asList( - field("a", new ArrowType.Timestamp(TimeUnit.SECOND)), - field("b", new ArrowType.Timestamp(TimeUnit.MILLISECOND)), - field("c", new ArrowType.Timestamp(TimeUnit.MICROSECOND)), - field("d", new ArrowType.Timestamp(TimeUnit.NANOSECOND)) + field("a", new Timestamp(TimeUnit.SECOND, "UTC")), + field("b", new Timestamp(TimeUnit.MILLISECOND, "UTC")), + field("c", new Timestamp(TimeUnit.MICROSECOND, "UTC")), + field("d", new Timestamp(TimeUnit.NANOSECOND, "UTC")), + field("e", new Timestamp(TimeUnit.SECOND, null)), + field("f", new Timestamp(TimeUnit.MILLISECOND, null)), + field("g", new Timestamp(TimeUnit.MICROSECOND, null)), + field("h", new Timestamp(TimeUnit.NANOSECOND, null)) )); roundTrip(schema); - contains(schema, "SECOND", "MILLISECOND", "MICROSECOND", "NANOSECOND"); + assertEquals( + "Schema", + schema.toString()); } @Test public void testInterval() throws IOException { Schema schema = new Schema(asList( - field("a", new ArrowType.Interval(IntervalUnit.YEAR_MONTH)), - field("b", new ArrowType.Interval(IntervalUnit.DAY_TIME)) + field("a", new Interval(IntervalUnit.YEAR_MONTH)), + field("b", new Interval(IntervalUnit.DAY_TIME)) )); roundTrip(schema); contains(schema, "YEAR_MONTH", "DAY_TIME"); @@ -113,9 +135,9 @@ public void testInterval() throws IOException { @Test public void testFP() throws IOException { Schema schema = new Schema(asList( - field("a", new ArrowType.FloatingPoint(FloatingPointPrecision.HALF)), - field("b", new ArrowType.FloatingPoint(FloatingPointPrecision.SINGLE)), - field("c", new ArrowType.FloatingPoint(FloatingPointPrecision.DOUBLE)) + field("a", new FloatingPoint(FloatingPointPrecision.HALF)), + field("b", new FloatingPoint(FloatingPointPrecision.SINGLE)), + field("c", new FloatingPoint(FloatingPointPrecision.DOUBLE)) )); roundTrip(schema); contains(schema, "HALF", "SINGLE", "DOUBLE"); From 7594492d5105e86d3388c8bac94dab8dbfa5226a Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Thu, 23 Mar 2017 17:18:35 +0100 Subject: [PATCH 0408/1644] ARROW-704: Fix bad import caused by conflicting changes Author: Julien Le Dem Closes #430 from julienledem/ARROW-704 and squashes the following commits: 2e42330 [Julien Le Dem] ARROW-704: Fix bad import caused by conflicting changes --- .../java/org/apache/arrow/vector/types/pojo/TestSchema.java | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 9f1b2e014b860..57af9528c5933 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -22,7 +22,6 @@ import static org.junit.Assert.assertTrue; import java.io.IOException; -import java.util.List; import org.apache.arrow.vector.types.FloatingPointPrecision; import org.apache.arrow.vector.types.IntervalUnit; @@ -152,7 +151,7 @@ private void roundTrip(Schema schema) throws IOException { assertEquals(schema.hashCode(), actual.hashCode()); } - private void validateFieldsHashcode(List schemaFields, List actualFields) { + private void validateFieldsHashcode(java.util.List schemaFields, java.util.List actualFields) { assertEquals(schemaFields.size(), actualFields.size()); if (schemaFields.size() == 0) { return; From 2a568f093670daba7b5dab8c096669bcfdd09a5f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 23 Mar 2017 12:30:44 -0400 Subject: [PATCH 0409/1644] ARROW-662: [Format] Move Schema flatbuffers into their own file that can be included @julienledem for some reason the Java build is failing for me locally (also on master): ``` [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.2:testCompile (default-testCompile) on project arrow-vector: Compilation failure: Compilation failure: [ERROR] /home/wesm/code/arrow/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java:[38] error: a type with the same simple name is already defined by the single-type-import of List [ERROR] /home/wesm/code/arrow/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java:[64,19] error: List is abstract; cannot be instantiated [ERROR] /home/wesm/code/arrow/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java:[81,19] error: List is abstract; cannot be instantiated [ERROR] -> [Help 1] ``` Author: Wes McKinney Closes #429 from wesm/ARROW-662 and squashes the following commits: b588f81 [Wes McKinney] Move Schema flatbuffers into their own file that can be included --- cpp/src/arrow/ipc/CMakeLists.txt | 1 + format/File.fbs | 2 +- format/Message.fbs | 264 +---------------------------- format/Schema.fbs | 280 +++++++++++++++++++++++++++++++ java/format/pom.xml | 20 +-- 5 files changed, 295 insertions(+), 272 deletions(-) create mode 100644 format/Schema.fbs diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 3a98a380e7019..629cc5bbed055 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -113,6 +113,7 @@ set(FBS_OUTPUT_FILES set(FBS_SRC ${CMAKE_SOURCE_DIR}/../format/Message.fbs ${CMAKE_SOURCE_DIR}/../format/File.fbs + ${CMAKE_SOURCE_DIR}/../format/Schema.fbs ${CMAKE_CURRENT_SOURCE_DIR}/feather.fbs) foreach(FIL ${FBS_SRC}) diff --git a/format/File.fbs b/format/File.fbs index e8d6da4f848ff..3a27ca67caf5f 100644 --- a/format/File.fbs +++ b/format/File.fbs @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -include "Message.fbs"; +include "Schema.fbs"; namespace org.apache.arrow.flatbuf; diff --git a/format/Message.fbs b/format/Message.fbs index ff30aceeda4f3..2cb60953c6a79 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -15,272 +15,14 @@ // specific language governing permissions and limitations // under the License. -namespace org.apache.arrow.flatbuf; - -enum MetadataVersion:short { - V1, - V2 -} - -/// ---------------------------------------------------------------------- -/// Logical types and their metadata (if any) -/// -/// These are stored in the flatbuffer in the Type union below - -table Null { -} - -/// A Struct_ in the flatbuffer metadata is the same as an Arrow Struct -/// (according to the physical memory layout). We used Struct_ here as -/// Struct is a reserved word in Flatbuffers -table Struct_ { -} - -table List { -} - -enum UnionMode:short { Sparse, Dense } - -/// A union is a complex type with children in Field -/// By default ids in the type vector refer to the offsets in the children -/// optionally typeIds provides an indirection between the child offset and the type id -/// for each child typeIds[offset] is the id used in the type vector -table Union { - mode: UnionMode; - typeIds: [ int ]; // optional, describes typeid of each child. -} - -table Int { - bitWidth: int; // restricted to 8, 16, 32, and 64 in v1 - is_signed: bool; -} - -enum Precision:short {HALF, SINGLE, DOUBLE} - -table FloatingPoint { - precision: Precision; -} - -/// Unicode with UTF-8 encoding -table Utf8 { -} - -table Binary { -} - -table FixedWidthBinary { - /// Number of bytes per value - byteWidth: int; -} - -table Bool { -} - -table Decimal { - precision: int; - scale: int; -} - -enum DateUnit: short { - DAY, - MILLISECOND -} - -/// Date is either a 32-bit or 64-bit type representing elapsed time since UNIX -/// epoch (1970-01-01), stored in either of two units: -/// -/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no -/// leap seconds), where the values are evenly divisible by 86400000 -/// * Days (32 bits) since the UNIX epoch -table Date { - unit: DateUnit; -} - -enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } - -/// Time type. The physical storage type depends on the unit -/// - SECOND and MILLISECOND: 32 bits -/// - MICROSECOND and NANOSECOND: 64 bits -table Time { - unit: TimeUnit; - bitWidth: int; -} - -/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. -table Timestamp { - unit: TimeUnit; - - /// The time zone is a string indicating the name of a time zone, one of: - /// - /// * As used in the Olson time zone database (the "tz database" or - /// "tzdata"), such as "America/New_York" - /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30 - /// - /// Whether a timezone string is present indicates different semantics about - /// the data: - /// - /// * If the time zone is null or equal to an empty string, the data is "time - /// zone naive" and shall be displayed *as is* to the user, not localized - /// to the locale of the user. This data can be though of as UTC but - /// without having "UTC" as the time zone, it is not considered to be - /// localized to any time zone - /// - /// * If the time zone is set to a valid value, values can be displayed as - /// "localized" to that time zone, even though the underlying 64-bit - /// integers are identical to the same data stored in UTC. Converting - /// between time zones is a metadata-only operation and does not change the - /// underlying values - timezone: string; -} - -enum IntervalUnit: short { YEAR_MONTH, DAY_TIME} -table Interval { - unit: IntervalUnit; -} - -/// ---------------------------------------------------------------------- -/// Top-level Type value, enabling extensible type-specific metadata. We can -/// add new logical types to Type without breaking backwards compatibility - -union Type { - Null, - Int, - FloatingPoint, - Binary, - Utf8, - Bool, - Decimal, - Date, - Time, - Timestamp, - Interval, - List, - Struct_, - Union, - FixedWidthBinary -} - -/// ---------------------------------------------------------------------- -/// The possible types of a vector - -enum VectorType: short { - /// used in List type, Dense Union and variable length primitive types (String, Binary) - OFFSET, - /// actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector - DATA, - /// Bit vector indicating if each value is null - VALIDITY, - /// Type vector used in Union type - TYPE -} - -/// ---------------------------------------------------------------------- -/// represents the physical layout of a buffer -/// buffers have fixed width slots of a given type - -table VectorLayout { - /// the width of a slot in the buffer (typically 1, 8, 16, 32 or 64) - bit_width: short; - /// the purpose of the vector - type: VectorType; -} - - -/// ---------------------------------------------------------------------- -/// user defined key value pairs to add custom metadata to arrow -/// key namespacing is the responsibility of the user - -table KeyValue { - key: string; - value: [ubyte]; -} - -/// ---------------------------------------------------------------------- -/// Dictionary encoding metadata - -table DictionaryEncoding { - /// The known dictionary id in the application where this data is used. In - /// the file or streaming formats, the dictionary ids are found in the - /// DictionaryBatch messages - id: long; - - /// The dictionary indices are constrained to be positive integers. If this - /// field is null, the indices must be signed int32 - indexType: Int; +include "Schema.fbs"; - /// By default, dictionaries are not ordered, or the order does not have - /// semantic meaning. In some statistical, applications, dictionary-encoding - /// is used to represent ordered categorical data, and we provide a way to - /// preserve that metadata here - isOrdered: bool; -} - -/// ---------------------------------------------------------------------- -/// A field represents a named column in a record / row batch or child of a -/// nested type. -/// -/// - children is only for nested Arrow arrays -/// - For primitive types, children will have length 0 -/// - nullable should default to true in general - -table Field { - // Name is not required, in i.e. a List - name: string; - nullable: bool; - type: Type; - - // Present only if the field is dictionary encoded - dictionary: DictionaryEncoding; - - // children apply only to Nested data types like Struct, List and Union - children: [Field]; - /// layout of buffers produced for this type (as derived from the Type) - /// does not include children - /// each recordbatch will return instances of those Buffers. - layout: [ VectorLayout ]; - // User-defined metadata - custom_metadata: [ KeyValue ]; -} - -/// ---------------------------------------------------------------------- -/// Endianness of the platform that produces the RecordBatch - -enum Endianness:short { Little, Big } - -/// ---------------------------------------------------------------------- -/// A Schema describes the columns in a row batch - -table Schema { - - /// endianness of the buffer - /// it is Little Endian by default - /// if endianness doesn't match the underlying system then the vectors need to be converted - endianness: Endianness=Little; - - fields: [Field]; - // User-defined metadata - custom_metadata: [ KeyValue ]; -} +namespace org.apache.arrow.flatbuf; /// ---------------------------------------------------------------------- /// Data structures for describing a table row batch (a collection of /// equal-length Arrow arrays) -/// A Buffer represents a single contiguous memory segment -struct Buffer { - /// The shared memory page id where this buffer is located. Currently this is - /// not used - page: int; - - /// The relative offset into the shared memory page where the bytes for this - /// buffer starts - offset: long; - - /// The absolute length (in bytes) of the memory buffer. The memory is found - /// from offset (inclusive) to offset + length (non-inclusive). - length: long; -} - /// Metadata about a field at some level of a nested type tree (but not /// its children). /// @@ -349,4 +91,4 @@ table Message { bodyLength: long; } -root_type Message; +root_type Message; \ No newline at end of file diff --git a/format/Schema.fbs b/format/Schema.fbs new file mode 100644 index 0000000000000..5268bf95cfdc8 --- /dev/null +++ b/format/Schema.fbs @@ -0,0 +1,280 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/// Logical types, vector layouts, and schemas + +namespace org.apache.arrow.flatbuf; + +enum MetadataVersion:short { + V1, + V2 +} + +/// These are stored in the flatbuffer in the Type union below + +table Null { +} + +/// A Struct_ in the flatbuffer metadata is the same as an Arrow Struct +/// (according to the physical memory layout). We used Struct_ here as +/// Struct is a reserved word in Flatbuffers +table Struct_ { +} + +table List { +} + +enum UnionMode:short { Sparse, Dense } + +/// A union is a complex type with children in Field +/// By default ids in the type vector refer to the offsets in the children +/// optionally typeIds provides an indirection between the child offset and the type id +/// for each child typeIds[offset] is the id used in the type vector +table Union { + mode: UnionMode; + typeIds: [ int ]; // optional, describes typeid of each child. +} + +table Int { + bitWidth: int; // restricted to 8, 16, 32, and 64 in v1 + is_signed: bool; +} + +enum Precision:short {HALF, SINGLE, DOUBLE} + +table FloatingPoint { + precision: Precision; +} + +/// Unicode with UTF-8 encoding +table Utf8 { +} + +table Binary { +} + +table FixedWidthBinary { + /// Number of bytes per value + byteWidth: int; +} + +table Bool { +} + +table Decimal { + precision: int; + scale: int; +} + +enum DateUnit: short { + DAY, + MILLISECOND +} + +/// Date is either a 32-bit or 64-bit type representing elapsed time since UNIX +/// epoch (1970-01-01), stored in either of two units: +/// +/// * Milliseconds (64 bits) indicating UNIX time elapsed since the epoch (no +/// leap seconds), where the values are evenly divisible by 86400000 +/// * Days (32 bits) since the UNIX epoch +table Date { + unit: DateUnit; +} + +enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } + +/// Time type. The physical storage type depends on the unit +/// - SECOND and MILLISECOND: 32 bits +/// - MICROSECOND and NANOSECOND: 64 bits +table Time { + unit: TimeUnit; + bitWidth: int; +} + +/// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. +table Timestamp { + unit: TimeUnit; + + /// The time zone is a string indicating the name of a time zone, one of: + /// + /// * As used in the Olson time zone database (the "tz database" or + /// "tzdata"), such as "America/New_York" + /// * An absolute time zone offset of the form +XX:XX or -XX:XX, such as +07:30 + /// + /// Whether a timezone string is present indicates different semantics about + /// the data: + /// + /// * If the time zone is null or equal to an empty string, the data is "time + /// zone naive" and shall be displayed *as is* to the user, not localized + /// to the locale of the user. This data can be though of as UTC but + /// without having "UTC" as the time zone, it is not considered to be + /// localized to any time zone + /// + /// * If the time zone is set to a valid value, values can be displayed as + /// "localized" to that time zone, even though the underlying 64-bit + /// integers are identical to the same data stored in UTC. Converting + /// between time zones is a metadata-only operation and does not change the + /// underlying values + timezone: string; +} + +enum IntervalUnit: short { YEAR_MONTH, DAY_TIME} +table Interval { + unit: IntervalUnit; +} + +/// ---------------------------------------------------------------------- +/// Top-level Type value, enabling extensible type-specific metadata. We can +/// add new logical types to Type without breaking backwards compatibility + +union Type { + Null, + Int, + FloatingPoint, + Binary, + Utf8, + Bool, + Decimal, + Date, + Time, + Timestamp, + Interval, + List, + Struct_, + Union, + FixedWidthBinary +} + +/// ---------------------------------------------------------------------- +/// The possible types of a vector + +enum VectorType: short { + /// used in List type, Dense Union and variable length primitive types (String, Binary) + OFFSET, + /// actual data, either wixed width primitive types in slots or variable width delimited by an OFFSET vector + DATA, + /// Bit vector indicating if each value is null + VALIDITY, + /// Type vector used in Union type + TYPE +} + +/// ---------------------------------------------------------------------- +/// represents the physical layout of a buffer +/// buffers have fixed width slots of a given type + +table VectorLayout { + /// the width of a slot in the buffer (typically 1, 8, 16, 32 or 64) + bit_width: short; + /// the purpose of the vector + type: VectorType; +} + + +/// ---------------------------------------------------------------------- +/// user defined key value pairs to add custom metadata to arrow +/// key namespacing is the responsibility of the user + +table KeyValue { + key: string; + value: [ubyte]; +} + +/// ---------------------------------------------------------------------- +/// Dictionary encoding metadata + +table DictionaryEncoding { + /// The known dictionary id in the application where this data is used. In + /// the file or streaming formats, the dictionary ids are found in the + /// DictionaryBatch messages + id: long; + + /// The dictionary indices are constrained to be positive integers. If this + /// field is null, the indices must be signed int32 + indexType: Int; + + /// By default, dictionaries are not ordered, or the order does not have + /// semantic meaning. In some statistical, applications, dictionary-encoding + /// is used to represent ordered categorical data, and we provide a way to + /// preserve that metadata here + isOrdered: bool; +} + +/// ---------------------------------------------------------------------- +/// A field represents a named column in a record / row batch or child of a +/// nested type. +/// +/// - children is only for nested Arrow arrays +/// - For primitive types, children will have length 0 +/// - nullable should default to true in general + +table Field { + // Name is not required, in i.e. a List + name: string; + nullable: bool; + type: Type; + + // Present only if the field is dictionary encoded + dictionary: DictionaryEncoding; + + // children apply only to Nested data types like Struct, List and Union + children: [Field]; + /// layout of buffers produced for this type (as derived from the Type) + /// does not include children + /// each recordbatch will return instances of those Buffers. + layout: [ VectorLayout ]; + // User-defined metadata + custom_metadata: [ KeyValue ]; +} + +/// ---------------------------------------------------------------------- +/// Endianness of the platform producing the data + +enum Endianness:short { Little, Big } + +/// ---------------------------------------------------------------------- +/// A Buffer represents a single contiguous memory segment +struct Buffer { + /// The shared memory page id where this buffer is located. Currently this is + /// not used + page: int; + + /// The relative offset into the shared memory page where the bytes for this + /// buffer starts + offset: long; + + /// The absolute length (in bytes) of the memory buffer. The memory is found + /// from offset (inclusive) to offset + length (non-inclusive). + length: long; +} + +/// ---------------------------------------------------------------------- +/// A Schema describes the columns in a row batch + +table Schema { + + /// endianness of the buffer + /// it is Little Endian by default + /// if endianness doesn't match the underlying system then the vectors need to be converted + endianness: Endianness=Little; + + fields: [Field]; + // User-defined metadata + custom_metadata: [ KeyValue ]; +} + +root_type Schema; diff --git a/java/format/pom.xml b/java/format/pom.xml index c65a7bc3de197..e7a58a4172fe2 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -1,13 +1,13 @@ - 4.0.0 @@ -109,6 +109,7 @@ -j -o ${flatc.generated.files} + ../../format/Schema.fbs ../../format/Message.fbs ../../format/File.fbs @@ -165,4 +166,3 @@ - From e968ca6e30209abeb90e099eb95de59655be73a8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 23 Mar 2017 12:38:40 -0400 Subject: [PATCH 0410/1644] ARROW-621: [C++] Start IPC benchmark suite for record batches, implement "inline" visitor. Code reorg From the benchmarks, the difference between virtual functions and an "inline" switch statement is very small, but it serves to reduce some boilerplate when many of the visit functions are the same Author: Wes McKinney Closes #427 from wesm/ARROW-621 and squashes the following commits: b975053 [Wes McKinney] cpplint 782636a [Wes McKinney] Mark template inline 3ae494e [Wes McKinney] Inline visitor, remove code duplication in loader.cc in favor of templates / std::enable_if 1b2d253 [Wes McKinney] Tweak benchmark params b126ca8 [Wes McKinney] Draft IPC roundtrip benchmark for wide record batches. Some test code refactoring --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/api.h | 1 + cpp/src/arrow/array-list-test.cc | 1 + cpp/src/arrow/array-primitive-test.cc | 1 + cpp/src/arrow/array-string-test.cc | 1 + cpp/src/arrow/array-struct-test.cc | 1 + cpp/src/arrow/array.cc | 36 +---- cpp/src/arrow/array.h | 33 +---- cpp/src/arrow/column-test.cc | 1 + cpp/src/arrow/ipc/CMakeLists.txt | 4 + cpp/src/arrow/ipc/ipc-read-write-benchmark.cc | 134 ++++++++++++++++++ cpp/src/arrow/ipc/reader.h | 5 +- cpp/src/arrow/ipc/writer.cc | 8 +- cpp/src/arrow/loader.cc | 65 ++++----- cpp/src/arrow/table-test.cc | 1 + cpp/src/arrow/test-common.h | 84 +++++++++++ cpp/src/arrow/test-util.h | 45 +----- cpp/src/arrow/type.cc | 36 +---- cpp/src/arrow/type.h | 33 +---- cpp/src/arrow/type_fwd.h | 2 + cpp/src/arrow/visitor.cc | 96 +++++++++++++ cpp/src/arrow/visitor.h | 93 ++++++++++++ cpp/src/arrow/visitor_inline.h | 67 +++++++++ 24 files changed, 528 insertions(+), 222 deletions(-) create mode 100644 cpp/src/arrow/ipc/ipc-read-write-benchmark.cc create mode 100644 cpp/src/arrow/test-common.h create mode 100644 cpp/src/arrow/visitor.cc create mode 100644 cpp/src/arrow/visitor.h create mode 100644 cpp/src/arrow/visitor_inline.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 61e645da20e75..c04afe47030a5 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -845,6 +845,7 @@ set(ARROW_SRCS src/arrow/status.cc src/arrow/table.cc src/arrow/type.cc + src/arrow/visitor.cc src/arrow/util/bit-util.cc ) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 24a95475b14e0..0e83aacaadab5 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -34,6 +34,7 @@ install(FILES type_fwd.h type_traits.h test-util.h + visitor.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow") # pkg-config support diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 3bc86662613ed..ea818b62931d6 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -32,5 +32,6 @@ #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" +#include "arrow/visitor.h" #endif // ARROW_API_H diff --git a/cpp/src/arrow/array-list-test.cc b/cpp/src/arrow/array-list-test.cc index 87dfdaaed33a4..1cfa77f684868 100644 --- a/cpp/src/arrow/array-list-test.cc +++ b/cpp/src/arrow/array-list-test.cc @@ -26,6 +26,7 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/status.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index dfa37a8063767..6863e58df05d2 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -26,6 +26,7 @@ #include "arrow/buffer.h" #include "arrow/builder.h" #include "arrow/status.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/type_traits.h" diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc index ed38acd010329..6c2c1516c8f3c 100644 --- a/cpp/src/arrow/array-string-test.cc +++ b/cpp/src/arrow/array-string-test.cc @@ -25,6 +25,7 @@ #include "arrow/array.h" #include "arrow/builder.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" #include "arrow/type_traits.h" diff --git a/cpp/src/arrow/array-struct-test.cc b/cpp/src/arrow/array-struct-test.cc index f4e7409a6232a..4eb1eab13fbc6 100644 --- a/cpp/src/arrow/array-struct-test.cc +++ b/cpp/src/arrow/array-struct-test.cc @@ -24,6 +24,7 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/status.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 4fa2b2b521f59..20b732ab114da 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -28,6 +28,7 @@ #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" +#include "arrow/visitor.h" namespace arrow { @@ -468,41 +469,6 @@ Status DictionaryArray::Accept(ArrayVisitor* visitor) const { return visitor->Visit(*this); } -#define ARRAY_VISITOR_DEFAULT(ARRAY_CLASS) \ - Status ArrayVisitor::Visit(const ARRAY_CLASS& array) { \ - return Status::NotImplemented(array.type()->ToString()); \ - } - -ARRAY_VISITOR_DEFAULT(NullArray); -ARRAY_VISITOR_DEFAULT(BooleanArray); -ARRAY_VISITOR_DEFAULT(Int8Array); -ARRAY_VISITOR_DEFAULT(Int16Array); -ARRAY_VISITOR_DEFAULT(Int32Array); -ARRAY_VISITOR_DEFAULT(Int64Array); -ARRAY_VISITOR_DEFAULT(UInt8Array); -ARRAY_VISITOR_DEFAULT(UInt16Array); -ARRAY_VISITOR_DEFAULT(UInt32Array); -ARRAY_VISITOR_DEFAULT(UInt64Array); -ARRAY_VISITOR_DEFAULT(HalfFloatArray); -ARRAY_VISITOR_DEFAULT(FloatArray); -ARRAY_VISITOR_DEFAULT(DoubleArray); -ARRAY_VISITOR_DEFAULT(BinaryArray); -ARRAY_VISITOR_DEFAULT(StringArray); -ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); -ARRAY_VISITOR_DEFAULT(Date32Array); -ARRAY_VISITOR_DEFAULT(Date64Array); -ARRAY_VISITOR_DEFAULT(TimeArray); -ARRAY_VISITOR_DEFAULT(TimestampArray); -ARRAY_VISITOR_DEFAULT(IntervalArray); -ARRAY_VISITOR_DEFAULT(ListArray); -ARRAY_VISITOR_DEFAULT(StructArray); -ARRAY_VISITOR_DEFAULT(UnionArray); -ARRAY_VISITOR_DEFAULT(DictionaryArray); - -Status ArrayVisitor::Visit(const DecimalArray& array) { - return Status::NotImplemented("decimal"); -} - // ---------------------------------------------------------------------- // Instantiate templates diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index e66ac505d5dbf..2a072dbf25ec0 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -31,6 +31,7 @@ #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" +#include "arrow/visitor.h" namespace arrow { @@ -38,38 +39,6 @@ class MemoryPool; class MutableBuffer; class Status; -class ARROW_EXPORT ArrayVisitor { - public: - virtual ~ArrayVisitor() = default; - - virtual Status Visit(const NullArray& array); - virtual Status Visit(const BooleanArray& array); - virtual Status Visit(const Int8Array& array); - virtual Status Visit(const Int16Array& array); - virtual Status Visit(const Int32Array& array); - virtual Status Visit(const Int64Array& array); - virtual Status Visit(const UInt8Array& array); - virtual Status Visit(const UInt16Array& array); - virtual Status Visit(const UInt32Array& array); - virtual Status Visit(const UInt64Array& array); - virtual Status Visit(const HalfFloatArray& array); - virtual Status Visit(const FloatArray& array); - virtual Status Visit(const DoubleArray& array); - virtual Status Visit(const StringArray& array); - virtual Status Visit(const BinaryArray& array); - virtual Status Visit(const FixedWidthBinaryArray& array); - virtual Status Visit(const Date32Array& array); - virtual Status Visit(const Date64Array& array); - virtual Status Visit(const TimeArray& array); - virtual Status Visit(const TimestampArray& array); - virtual Status Visit(const IntervalArray& array); - virtual Status Visit(const DecimalArray& array); - virtual Status Visit(const ListArray& array); - virtual Status Visit(const StructArray& array); - virtual Status Visit(const UnionArray& array); - virtual Status Visit(const DictionaryArray& type); -}; - /// Immutable data array with some logical type and some length. /// /// Any memory is owned by the respective Buffer instance (or its parents). diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc index 24d58c80b9fae..872fcb95c08e1 100644 --- a/cpp/src/arrow/column-test.cc +++ b/cpp/src/arrow/column-test.cc @@ -25,6 +25,7 @@ #include "arrow/array.h" #include "arrow/column.h" #include "arrow/schema.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 629cc5bbed055..056e7dba53830 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -170,3 +170,7 @@ if (ARROW_BUILD_UTILITIES) add_executable(stream-to-file stream-to-file.cc) target_link_libraries(stream-to-file ${UTIL_LINK_LIBS}) endif() + +ADD_ARROW_BENCHMARK(ipc-read-write-benchmark) +ARROW_TEST_LINK_LIBRARIES(ipc-read-write-benchmark + ${ARROW_IPC_TEST_LINK_LIBS}) diff --git a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc new file mode 100644 index 0000000000000..e27e5136a0d5a --- /dev/null +++ b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc @@ -0,0 +1,134 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "benchmark/benchmark.h" + +#include +#include +#include + +#include "arrow/api.h" +#include "arrow/io/memory.h" +#include "arrow/ipc/api.h" +#include "arrow/test-util.h" + +namespace arrow { + +template +std::shared_ptr MakeRecordBatch(int64_t total_size, int64_t num_fields) { + using T = typename TYPE::c_type; + size_t itemsize = sizeof(T); + int64_t length = total_size / num_fields / itemsize; + + auto type = TypeTraits::type_singleton(); + + std::vector is_valid; + test::random_is_valid(length, 0.1, &is_valid); + + std::vector values; + test::randint(length, 0, 100, &values); + + MemoryPool* pool = default_memory_pool(); + typename TypeTraits::BuilderType builder(pool, type); + for (size_t i = 0; i < values.size(); ++i) { + if (is_valid[i]) { + builder.Append(values[i]); + } else { + builder.AppendNull(); + } + } + std::shared_ptr array; + builder.Finish(&array); + + ArrayVector arrays; + std::vector> fields; + for (int64_t i = 0; i < num_fields; ++i) { + std::stringstream ss; + ss << "f" << i; + fields.push_back(field(ss.str(), type)); + arrays.push_back(array); + } + + auto schema = std::make_shared(fields); + return std::make_shared(schema, length, arrays); +} + +static void BM_WriteRecordBatch(benchmark::State& state) { // NOLINT non-const reference + // 1MB + constexpr int64_t kTotalSize = 1 << 20; + + auto buffer = std::make_shared(default_memory_pool()); + buffer->Resize(kTotalSize & 2); + auto record_batch = MakeRecordBatch(kTotalSize, state.range(0)); + + while (state.KeepRunning()) { + io::BufferOutputStream stream(buffer); + int32_t metadata_length; + int64_t body_length; + if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, + default_memory_pool()) + .ok()) { + state.SkipWithError("Failed to write!"); + } + } + state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize); +} + +static void BM_ReadRecordBatch(benchmark::State& state) { // NOLINT non-const reference + // 1MB + constexpr int64_t kTotalSize = 1 << 20; + + auto buffer = std::make_shared(default_memory_pool()); + buffer->Resize(kTotalSize & 2); + auto record_batch = MakeRecordBatch(kTotalSize, state.range(0)); + + io::BufferOutputStream stream(buffer); + + int32_t metadata_length; + int64_t body_length; + if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, + default_memory_pool()) + .ok()) { + state.SkipWithError("Failed to write!"); + } + + while (state.KeepRunning()) { + std::shared_ptr result; + io::BufferReader reader(buffer); + + if (!ipc::ReadRecordBatch(record_batch->schema(), 0, &reader, &result).ok()) { + state.SkipWithError("Failed to read!"); + } + } + state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize); +} + +BENCHMARK(BM_WriteRecordBatch) + ->RangeMultiplier(4) + ->Range(1, 1 << 13) + ->MinTime(1.0) + ->Unit(benchmark::kMicrosecond) + ->UseRealTime(); + +BENCHMARK(BM_ReadRecordBatch) + ->RangeMultiplier(4) + ->Range(1, 1 << 13) + ->MinTime(1.0) + ->Unit(benchmark::kMicrosecond) + ->UseRealTime(); + +} // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index 1e8636c1efcce..ffd0a111d604b 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -120,10 +120,9 @@ class ARROW_EXPORT FileReader { std::unique_ptr impl_; }; - /// Read encapsulated message and RecordBatch -Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& schema, - int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); +Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, + io::RandomAccessFile* file, std::shared_ptr* out); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 0f55f8e33e71d..dc991aba79795 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -47,9 +47,8 @@ namespace ipc { class RecordBatchWriter : public ArrayVisitor { public: - RecordBatchWriter( - MemoryPool* pool, int64_t buffer_start_offset, int max_recursion_depth, - bool allow_64bit) + RecordBatchWriter(MemoryPool* pool, int64_t buffer_start_offset, + int max_recursion_depth, bool allow_64bit) : pool_(pool), max_recursion_depth_(max_recursion_depth), buffer_start_offset_(buffer_start_offset), @@ -501,8 +500,7 @@ class DictionaryWriter : public RecordBatchWriter { Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool, int max_recursion_depth, bool allow_64bit) { - RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth, - allow_64bit); + RecordBatchWriter writer(pool, buffer_start_offset, max_recursion_depth, allow_64bit); return writer.Write(batch, dst, metadata_length, body_length); } diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index bc506be572625..a67a3e94bd5f2 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -28,6 +28,7 @@ #include "arrow/type_traits.h" #include "arrow/util/logging.h" #include "arrow/util/visibility.h" +#include "arrow/visitor_inline.h" namespace arrow { @@ -35,7 +36,7 @@ class Array; struct DataType; class Status; -class ArrayLoader : public TypeVisitor { +class ArrayLoader { public: ArrayLoader(const std::shared_ptr& type, ArrayLoaderContext* context) : type_(type), context_(context) {} @@ -45,8 +46,7 @@ class ArrayLoader : public TypeVisitor { return Status::Invalid("Max recursion depth reached"); } - // Load the array - RETURN_NOT_OK(type_->Accept(this)); + RETURN_NOT_OK(VisitTypeInline(*type_, this)); *out = std::move(result_); return Status::OK(); @@ -92,8 +92,10 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - template + template Status LoadBinary() { + using CONTAINER = typename TypeTraits::ArrayType; + FieldMetadata field_meta; std::shared_ptr null_bitmap, offsets, values; @@ -131,33 +133,24 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } -#define VISIT_PRIMITIVE(TYPE) \ - Status Visit(const TYPE& type) override { return LoadPrimitive(); } - - VISIT_PRIMITIVE(BooleanType); - VISIT_PRIMITIVE(Int8Type); - VISIT_PRIMITIVE(Int16Type); - VISIT_PRIMITIVE(Int32Type); - VISIT_PRIMITIVE(Int64Type); - VISIT_PRIMITIVE(UInt8Type); - VISIT_PRIMITIVE(UInt16Type); - VISIT_PRIMITIVE(UInt32Type); - VISIT_PRIMITIVE(UInt64Type); - VISIT_PRIMITIVE(HalfFloatType); - VISIT_PRIMITIVE(FloatType); - VISIT_PRIMITIVE(DoubleType); - VISIT_PRIMITIVE(Date32Type); - VISIT_PRIMITIVE(Date64Type); - VISIT_PRIMITIVE(TimeType); - VISIT_PRIMITIVE(TimestampType); - -#undef VISIT_PRIMITIVE - - Status Visit(const StringType& type) override { return LoadBinary(); } - - Status Visit(const BinaryType& type) override { return LoadBinary(); } - - Status Visit(const FixedWidthBinaryType& type) override { + Status Visit(const NullType& type) { return Status::NotImplemented("null"); } + + template + typename std::enable_if::value && + !std::is_base_of::value && + !std::is_base_of::value, + Status>::type + Visit(const T& type) { + return LoadPrimitive(); + } + + template + typename std::enable_if::value, Status>::type Visit( + const T& type) { + return LoadBinary(); + } + + Status Visit(const FixedWidthBinaryType& type) { FieldMetadata field_meta; std::shared_ptr null_bitmap, data; @@ -169,7 +162,7 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - Status Visit(const ListType& type) override { + Status Visit(const ListType& type) { FieldMetadata field_meta; std::shared_ptr null_bitmap, offsets; @@ -196,7 +189,7 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - Status Visit(const StructType& type) override { + Status Visit(const StructType& type) { FieldMetadata field_meta; std::shared_ptr null_bitmap; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); @@ -209,7 +202,7 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - Status Visit(const UnionType& type) override { + Status Visit(const UnionType& type) { FieldMetadata field_meta; std::shared_ptr null_bitmap, type_ids, offsets; @@ -230,12 +223,12 @@ class ArrayLoader : public TypeVisitor { return Status::OK(); } - Status Visit(const DictionaryType& type) override { + Status Visit(const DictionaryType& type) { std::shared_ptr indices; RETURN_NOT_OK(LoadArray(type.index_type(), context_, &indices)); result_ = std::make_shared(type_, indices); return Status::OK(); - }; + } std::shared_ptr result() const { return result_; } diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 36374731cbb49..6bb31638ecbbf 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -26,6 +26,7 @@ #include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/test-common.h b/cpp/src/arrow/test-common.h new file mode 100644 index 0000000000000..f704b6b545b7d --- /dev/null +++ b/cpp/src/arrow/test-common.h @@ -0,0 +1,84 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TEST_COMMON_H +#define ARROW_TEST_COMMON_H + +#include +#include +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/builder.h" +#include "arrow/column.h" +#include "arrow/memory_pool.h" +#include "arrow/table.h" +#include "arrow/test-util.h" + +namespace arrow { + +class TestBase : public ::testing::Test { + public: + void SetUp() { + pool_ = default_memory_pool(); + random_seed_ = 0; + } + + template + std::shared_ptr MakePrimitive(int64_t length, int64_t null_count = 0) { + auto data = std::make_shared(pool_); + const int64_t data_nbytes = length * sizeof(typename ArrayType::value_type); + EXPECT_OK(data->Resize(data_nbytes)); + + // Fill with random data + test::random_bytes(data_nbytes, random_seed_++, data->mutable_data()); + + auto null_bitmap = std::make_shared(pool_); + EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); + return std::make_shared(length, data, null_bitmap, null_count); + } + + protected: + uint32_t random_seed_; + MemoryPool* pool_; +}; + +class TestBuilder : public ::testing::Test { + public: + void SetUp() { + pool_ = default_memory_pool(); + type_ = TypePtr(new UInt8Type()); + builder_.reset(new UInt8Builder(pool_)); + builder_nn_.reset(new UInt8Builder(pool_)); + } + + protected: + MemoryPool* pool_; + + TypePtr type_; + std::unique_ptr builder_; + std::unique_ptr builder_nn_; +}; + +} // namespace arrow + +#endif // ARROW_TEST_COMMON_H_ diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index f05a54168b631..bed555984fb68 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -25,7 +25,7 @@ #include #include -#include "gtest/gtest.h" +#include #include "arrow/array.h" #include "arrow/buffer.h" @@ -208,32 +208,6 @@ Status MakeRandomBytePoolBuffer(int64_t length, MemoryPool* pool, } // namespace test -class TestBase : public ::testing::Test { - public: - void SetUp() { - pool_ = default_memory_pool(); - random_seed_ = 0; - } - - template - std::shared_ptr MakePrimitive(int64_t length, int64_t null_count = 0) { - auto data = std::make_shared(pool_); - const int64_t data_nbytes = length * sizeof(typename ArrayType::value_type); - EXPECT_OK(data->Resize(data_nbytes)); - - // Fill with random data - test::random_bytes(data_nbytes, random_seed_++, data->mutable_data()); - - auto null_bitmap = std::make_shared(pool_); - EXPECT_OK(null_bitmap->Resize(BitUtil::BytesForBits(length))); - return std::make_shared(length, data, null_bitmap, null_count); - } - - protected: - uint32_t random_seed_; - MemoryPool* pool_; -}; - template void ArrayFromVector(const std::shared_ptr& type, const std::vector& is_valid, const std::vector& values, @@ -275,23 +249,6 @@ void ArrayFromVector(const std::vector& values, std::shared_ptr* ASSERT_OK(builder.Finish(out)); } -class TestBuilder : public ::testing::Test { - public: - void SetUp() { - pool_ = default_memory_pool(); - type_ = TypePtr(new UInt8Type()); - builder_.reset(new UInt8Builder(pool_)); - builder_nn_.reset(new UInt8Builder(pool_)); - } - - protected: - MemoryPool* pool_; - - TypePtr type_; - std::unique_ptr builder_; - std::unique_ptr builder_nn_; -}; - template Status MakeArray(const std::vector& valid_bytes, const std::vector& values, int64_t size, Builder* builder, std::shared_ptr* out) { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 937cbc5a7669d..1c61eb61abea0 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -24,6 +24,7 @@ #include "arrow/compare.h" #include "arrow/status.h" #include "arrow/util/logging.h" +#include "arrow/visitor.h" namespace arrow { @@ -331,39 +332,4 @@ std::vector DecimalType::GetBufferLayout() const { return {}; } -// ---------------------------------------------------------------------- -// Default implementations of TypeVisitor methods - -#define TYPE_VISITOR_DEFAULT(TYPE_CLASS) \ - Status TypeVisitor::Visit(const TYPE_CLASS& type) { \ - return Status::NotImplemented(type.ToString()); \ - } - -TYPE_VISITOR_DEFAULT(NullType); -TYPE_VISITOR_DEFAULT(BooleanType); -TYPE_VISITOR_DEFAULT(Int8Type); -TYPE_VISITOR_DEFAULT(Int16Type); -TYPE_VISITOR_DEFAULT(Int32Type); -TYPE_VISITOR_DEFAULT(Int64Type); -TYPE_VISITOR_DEFAULT(UInt8Type); -TYPE_VISITOR_DEFAULT(UInt16Type); -TYPE_VISITOR_DEFAULT(UInt32Type); -TYPE_VISITOR_DEFAULT(UInt64Type); -TYPE_VISITOR_DEFAULT(HalfFloatType); -TYPE_VISITOR_DEFAULT(FloatType); -TYPE_VISITOR_DEFAULT(DoubleType); -TYPE_VISITOR_DEFAULT(StringType); -TYPE_VISITOR_DEFAULT(BinaryType); -TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); -TYPE_VISITOR_DEFAULT(Date64Type); -TYPE_VISITOR_DEFAULT(Date32Type); -TYPE_VISITOR_DEFAULT(TimeType); -TYPE_VISITOR_DEFAULT(TimestampType); -TYPE_VISITOR_DEFAULT(IntervalType); -TYPE_VISITOR_DEFAULT(DecimalType); -TYPE_VISITOR_DEFAULT(ListType); -TYPE_VISITOR_DEFAULT(StructType); -TYPE_VISITOR_DEFAULT(UnionType); -TYPE_VISITOR_DEFAULT(DictionaryType); - } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index c179bf336987b..40c00a4bac1b1 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -28,6 +28,7 @@ #include "arrow/type_fwd.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" +#include "arrow/visitor.h" namespace arrow { @@ -119,38 +120,6 @@ class BufferDescr { int bit_width_; }; -class ARROW_EXPORT TypeVisitor { - public: - virtual ~TypeVisitor() = default; - - virtual Status Visit(const NullType& type); - virtual Status Visit(const BooleanType& type); - virtual Status Visit(const Int8Type& type); - virtual Status Visit(const Int16Type& type); - virtual Status Visit(const Int32Type& type); - virtual Status Visit(const Int64Type& type); - virtual Status Visit(const UInt8Type& type); - virtual Status Visit(const UInt16Type& type); - virtual Status Visit(const UInt32Type& type); - virtual Status Visit(const UInt64Type& type); - virtual Status Visit(const HalfFloatType& type); - virtual Status Visit(const FloatType& type); - virtual Status Visit(const DoubleType& type); - virtual Status Visit(const StringType& type); - virtual Status Visit(const BinaryType& type); - virtual Status Visit(const FixedWidthBinaryType& type); - virtual Status Visit(const Date64Type& type); - virtual Status Visit(const Date32Type& type); - virtual Status Visit(const TimeType& type); - virtual Status Visit(const TimestampType& type); - virtual Status Visit(const IntervalType& type); - virtual Status Visit(const DecimalType& type); - virtual Status Visit(const ListType& type); - virtual Status Visit(const StructType& type); - virtual Status Visit(const UnionType& type); - virtual Status Visit(const DictionaryType& type); -}; - struct ARROW_EXPORT DataType { Type::type type; diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index ae85593cf4546..f62c0314a4620 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -18,6 +18,8 @@ #ifndef ARROW_TYPE_FWD_H #define ARROW_TYPE_FWD_H +#include + #include "arrow/util/visibility.h" namespace arrow { diff --git a/cpp/src/arrow/visitor.cc b/cpp/src/arrow/visitor.cc new file mode 100644 index 0000000000000..181e932eeebf6 --- /dev/null +++ b/cpp/src/arrow/visitor.cc @@ -0,0 +1,96 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/visitor.h" + +#include "arrow/array.h" +#include "arrow/status.h" +#include "arrow/type.h" + +namespace arrow { + +#define ARRAY_VISITOR_DEFAULT(ARRAY_CLASS) \ + Status ArrayVisitor::Visit(const ARRAY_CLASS& array) { \ + return Status::NotImplemented(array.type()->ToString()); \ + } + +ARRAY_VISITOR_DEFAULT(NullArray); +ARRAY_VISITOR_DEFAULT(BooleanArray); +ARRAY_VISITOR_DEFAULT(Int8Array); +ARRAY_VISITOR_DEFAULT(Int16Array); +ARRAY_VISITOR_DEFAULT(Int32Array); +ARRAY_VISITOR_DEFAULT(Int64Array); +ARRAY_VISITOR_DEFAULT(UInt8Array); +ARRAY_VISITOR_DEFAULT(UInt16Array); +ARRAY_VISITOR_DEFAULT(UInt32Array); +ARRAY_VISITOR_DEFAULT(UInt64Array); +ARRAY_VISITOR_DEFAULT(HalfFloatArray); +ARRAY_VISITOR_DEFAULT(FloatArray); +ARRAY_VISITOR_DEFAULT(DoubleArray); +ARRAY_VISITOR_DEFAULT(BinaryArray); +ARRAY_VISITOR_DEFAULT(StringArray); +ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); +ARRAY_VISITOR_DEFAULT(Date32Array); +ARRAY_VISITOR_DEFAULT(Date64Array); +ARRAY_VISITOR_DEFAULT(TimeArray); +ARRAY_VISITOR_DEFAULT(TimestampArray); +ARRAY_VISITOR_DEFAULT(IntervalArray); +ARRAY_VISITOR_DEFAULT(ListArray); +ARRAY_VISITOR_DEFAULT(StructArray); +ARRAY_VISITOR_DEFAULT(UnionArray); +ARRAY_VISITOR_DEFAULT(DictionaryArray); + +Status ArrayVisitor::Visit(const DecimalArray& array) { + return Status::NotImplemented("decimal"); +} + +// ---------------------------------------------------------------------- +// Default implementations of TypeVisitor methods + +#define TYPE_VISITOR_DEFAULT(TYPE_CLASS) \ + Status TypeVisitor::Visit(const TYPE_CLASS& type) { \ + return Status::NotImplemented(type.ToString()); \ + } + +TYPE_VISITOR_DEFAULT(NullType); +TYPE_VISITOR_DEFAULT(BooleanType); +TYPE_VISITOR_DEFAULT(Int8Type); +TYPE_VISITOR_DEFAULT(Int16Type); +TYPE_VISITOR_DEFAULT(Int32Type); +TYPE_VISITOR_DEFAULT(Int64Type); +TYPE_VISITOR_DEFAULT(UInt8Type); +TYPE_VISITOR_DEFAULT(UInt16Type); +TYPE_VISITOR_DEFAULT(UInt32Type); +TYPE_VISITOR_DEFAULT(UInt64Type); +TYPE_VISITOR_DEFAULT(HalfFloatType); +TYPE_VISITOR_DEFAULT(FloatType); +TYPE_VISITOR_DEFAULT(DoubleType); +TYPE_VISITOR_DEFAULT(StringType); +TYPE_VISITOR_DEFAULT(BinaryType); +TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); +TYPE_VISITOR_DEFAULT(Date64Type); +TYPE_VISITOR_DEFAULT(Date32Type); +TYPE_VISITOR_DEFAULT(TimeType); +TYPE_VISITOR_DEFAULT(TimestampType); +TYPE_VISITOR_DEFAULT(IntervalType); +TYPE_VISITOR_DEFAULT(DecimalType); +TYPE_VISITOR_DEFAULT(ListType); +TYPE_VISITOR_DEFAULT(StructType); +TYPE_VISITOR_DEFAULT(UnionType); +TYPE_VISITOR_DEFAULT(DictionaryType); + +} // namespace arrow diff --git a/cpp/src/arrow/visitor.h b/cpp/src/arrow/visitor.h new file mode 100644 index 0000000000000..a9c59c8f762fe --- /dev/null +++ b/cpp/src/arrow/visitor.h @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_VISITOR_H +#define ARROW_VISITOR_H + +#include "arrow/status.h" +#include "arrow/type_fwd.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class ARROW_EXPORT ArrayVisitor { + public: + virtual ~ArrayVisitor() = default; + + virtual Status Visit(const NullArray& array); + virtual Status Visit(const BooleanArray& array); + virtual Status Visit(const Int8Array& array); + virtual Status Visit(const Int16Array& array); + virtual Status Visit(const Int32Array& array); + virtual Status Visit(const Int64Array& array); + virtual Status Visit(const UInt8Array& array); + virtual Status Visit(const UInt16Array& array); + virtual Status Visit(const UInt32Array& array); + virtual Status Visit(const UInt64Array& array); + virtual Status Visit(const HalfFloatArray& array); + virtual Status Visit(const FloatArray& array); + virtual Status Visit(const DoubleArray& array); + virtual Status Visit(const StringArray& array); + virtual Status Visit(const BinaryArray& array); + virtual Status Visit(const FixedWidthBinaryArray& array); + virtual Status Visit(const Date32Array& array); + virtual Status Visit(const Date64Array& array); + virtual Status Visit(const TimeArray& array); + virtual Status Visit(const TimestampArray& array); + virtual Status Visit(const IntervalArray& array); + virtual Status Visit(const DecimalArray& array); + virtual Status Visit(const ListArray& array); + virtual Status Visit(const StructArray& array); + virtual Status Visit(const UnionArray& array); + virtual Status Visit(const DictionaryArray& type); +}; + +class ARROW_EXPORT TypeVisitor { + public: + virtual ~TypeVisitor() = default; + + virtual Status Visit(const NullType& type); + virtual Status Visit(const BooleanType& type); + virtual Status Visit(const Int8Type& type); + virtual Status Visit(const Int16Type& type); + virtual Status Visit(const Int32Type& type); + virtual Status Visit(const Int64Type& type); + virtual Status Visit(const UInt8Type& type); + virtual Status Visit(const UInt16Type& type); + virtual Status Visit(const UInt32Type& type); + virtual Status Visit(const UInt64Type& type); + virtual Status Visit(const HalfFloatType& type); + virtual Status Visit(const FloatType& type); + virtual Status Visit(const DoubleType& type); + virtual Status Visit(const StringType& type); + virtual Status Visit(const BinaryType& type); + virtual Status Visit(const FixedWidthBinaryType& type); + virtual Status Visit(const Date64Type& type); + virtual Status Visit(const Date32Type& type); + virtual Status Visit(const TimeType& type); + virtual Status Visit(const TimestampType& type); + virtual Status Visit(const IntervalType& type); + virtual Status Visit(const DecimalType& type); + virtual Status Visit(const ListType& type); + virtual Status Visit(const StructType& type); + virtual Status Visit(const UnionType& type); + virtual Status Visit(const DictionaryType& type); +}; + +} // namespace arrow + +#endif // ARROW_VISITOR_H diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h new file mode 100644 index 0000000000000..b69468d17eebe --- /dev/null +++ b/cpp/src/arrow/visitor_inline.h @@ -0,0 +1,67 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Private header, not to be exported + +#ifndef ARROW_VISITOR_INLINE_H +#define ARROW_VISITOR_INLINE_H + +#include "arrow/array.h" +#include "arrow/status.h" +#include "arrow/type.h" + +namespace arrow { + +#define TYPE_VISIT_INLINE(TYPE_CLASS) \ + case TYPE_CLASS::type_id: \ + return visitor->Visit(static_cast(type)); + +template +inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { + switch (type.type) { + TYPE_VISIT_INLINE(NullType); + TYPE_VISIT_INLINE(BooleanType); + TYPE_VISIT_INLINE(Int8Type); + TYPE_VISIT_INLINE(UInt8Type); + TYPE_VISIT_INLINE(Int16Type); + TYPE_VISIT_INLINE(UInt16Type); + TYPE_VISIT_INLINE(Int32Type); + TYPE_VISIT_INLINE(UInt32Type); + TYPE_VISIT_INLINE(Int64Type); + TYPE_VISIT_INLINE(UInt64Type); + TYPE_VISIT_INLINE(FloatType); + TYPE_VISIT_INLINE(DoubleType); + TYPE_VISIT_INLINE(StringType); + TYPE_VISIT_INLINE(BinaryType); + TYPE_VISIT_INLINE(FixedWidthBinaryType); + TYPE_VISIT_INLINE(Date32Type); + TYPE_VISIT_INLINE(Date64Type); + TYPE_VISIT_INLINE(TimestampType); + TYPE_VISIT_INLINE(TimeType); + TYPE_VISIT_INLINE(ListType); + TYPE_VISIT_INLINE(StructType); + TYPE_VISIT_INLINE(UnionType); + TYPE_VISIT_INLINE(DictionaryType); + default: + break; + } + return Status::NotImplemented("Type not implemented"); +} + +} // namespace arrow + +#endif // ARROW_VISITOR_INLINE_H From dcaa8e5d7ef1353c657e016bf271495042825a91 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Thu, 23 Mar 2017 12:44:27 -0700 Subject: [PATCH 0411/1644] ARROW-702: fix BitVector.copyFromSafe to reAllocate instead of returning false Author: Julien Le Dem Closes #426 from julienledem/arrow_702 and squashes the following commits: 4c77b95 [Julien Le Dem] add license 7ab84aa [Julien Le Dem] Thanks Hakim for the test case ba8aa8e [Julien Le Dem] ARROW-702: fix BitVector.copyFromSafe to reAllocate instead of returning false --- .../arrow/memory/TestBaseAllocator.java | 2 +- .../org/apache/arrow/vector/BitVector.java | 8 +-- .../apache/arrow/vector/TestBitVector.java | 66 +++++++++++++++++++ 3 files changed, 70 insertions(+), 6 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java diff --git a/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java index 3c96d57f4e64d..59b7be87e17be 100644 --- a/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java +++ b/java/memory/src/test/java/org/apache/arrow/memory/TestBaseAllocator.java @@ -381,7 +381,7 @@ public void testAllocator_sliceRanges() throws Exception { assertEquals((byte) i, slice1.getByte(i)); } - final ArrowBuf slice2 = (ArrowBuf) arrowBuf.slice(25, 25); + final ArrowBuf slice2 = arrowBuf.slice(25, 25); assertEquals(0, slice2.readerIndex()); assertEquals(25, slice2.readableBytes()); for(int i = 25; i < 50; ++i) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index 179f2ee879f43..ed574333beacd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -216,13 +216,11 @@ public void copyFrom(int inIndex, int outIndex, BitVector from) { this.mutator.set(outIndex, from.accessor.get(inIndex)); } - public boolean copyFromSafe(int inIndex, int outIndex, BitVector from) { + public void copyFromSafe(int inIndex, int outIndex, BitVector from) { if (outIndex >= this.getValueCapacity()) { - decrementAllocationMonitor(); - return false; + reAlloc(); } copyFrom(inIndex, outIndex, from); - return true; } @Override @@ -273,7 +271,7 @@ public void splitAndTransferTo(int startIndex, int length, BitVector target) { if (target.data != null) { target.data.release(); } - target.data = (ArrowBuf) data.slice(firstByte, byteSize); + target.data = data.slice(firstByte, byteSize); target.data.retain(1); } else { // Copy data diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java new file mode 100644 index 0000000000000..f2343c88e70a5 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestBitVector.java @@ -0,0 +1,66 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.BufferAllocator; +import org.junit.After; +import org.junit.Before; +import org.junit.Test; + +public class TestBitVector { + private final static String EMPTY_SCHEMA_PATH = ""; + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testBitVectorCopyFromSafe() { + final int size = 20; + try (final BitVector src = new BitVector(EMPTY_SCHEMA_PATH, allocator); + final BitVector dst = new BitVector(EMPTY_SCHEMA_PATH, allocator)) { + src.allocateNew(size); + dst.allocateNew(10); + + for (int i = 0; i < size; i++) { + src.getMutator().set(i, i % 2); + } + src.getMutator().setValueCount(size); + + for (int i = 0; i < size; i++) { + dst.copyFromSafe(i, i, src); + } + dst.getMutator().setValueCount(size); + + for (int i = 0; i < size; i++) { + assertEquals(src.getAccessor().getObject(i), dst.getAccessor().getObject(i)); + } + } + } + +} From 13c12c6ea5e23928268b5c2c7b962d223cca7bd4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 11:54:18 +0100 Subject: [PATCH 0412/1644] ARROW-682: [Integration] Check implementations against themselves This adds an additional layer of internal consistency checks Author: Wes McKinney Closes #433 from wesm/ARROW-682 and squashes the following commits: b33ac7a [Wes McKinney] Run integration tests with same implementation producing and consuming to validate internal consistency --- integration/integration_test.py | 56 +++++++++++++++++---------------- 1 file changed, 29 insertions(+), 27 deletions(-) diff --git a/integration/integration_test.py b/integration/integration_test.py index 5cd63c502bd20..ec2a38d840d0b 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -34,10 +34,12 @@ # Control for flakiness np.random.seed(12345) + def load_version_from_pom(): import xml.etree.ElementTree as ET tree = ET.parse(os.path.join(ARROW_HOME, 'java', 'pom.xml')) - version_tag = list(tree.getroot().findall('{http://maven.apache.org/POM/4.0.0}version'))[0] + tag_pattern = '{http://maven.apache.org/POM/4.0.0}version' + version_tag = list(tree.getroot().findall(tag_pattern))[0] return version_tag.text @@ -596,32 +598,32 @@ def __init__(self, json_files, testers, debug=False): def run(self): for producer, consumer in itertools.product(self.testers, self.testers): - if producer is consumer: - continue - - print('-- {0} producing, {1} consuming'.format(producer.name, - consumer.name)) - - for json_path in self.json_files: - print('Testing file {0}'.format(json_path)) - - # Make the random access file - print('-- Creating binary inputs') - producer_file_path = os.path.join(self.temp_dir, guid()) - producer.json_to_file(json_path, producer_file_path) - - # Validate the file - print('-- Validating file') - consumer.validate(json_path, producer_file_path) - - print('-- Validating stream') - producer_stream_path = os.path.join(self.temp_dir, guid()) - consumer_file_path = os.path.join(self.temp_dir, guid()) - producer.file_to_stream(producer_file_path, - producer_stream_path) - consumer.stream_to_file(producer_stream_path, - consumer_file_path) - consumer.validate(json_path, consumer_file_path) + self._compare_implementations(producer, consumer) + + def _compare_implementations(self, producer, consumer): + print('-- {0} producing, {1} consuming'.format(producer.name, + consumer.name)) + + for json_path in self.json_files: + print('Testing file {0}'.format(json_path)) + + # Make the random access file + print('-- Creating binary inputs') + producer_file_path = os.path.join(self.temp_dir, guid()) + producer.json_to_file(json_path, producer_file_path) + + # Validate the file + print('-- Validating file') + consumer.validate(json_path, producer_file_path) + + print('-- Validating stream') + producer_stream_path = os.path.join(self.temp_dir, guid()) + consumer_file_path = os.path.join(self.temp_dir, guid()) + producer.file_to_stream(producer_file_path, + producer_stream_path) + consumer.stream_to_file(producer_stream_path, + consumer_file_path) + consumer.validate(json_path, consumer_file_path) class Tester(object): From bc185a41a239181d255e72bf255a354da4f5dae6 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 11:58:30 +0100 Subject: [PATCH 0413/1644] ARROW-595: [Python] Set schema attribute on StreamReader Author: Wes McKinney Closes #434 from wesm/ARROW-595 and squashes the following commits: 484cc7b [Wes McKinney] Set schema attribute on StreamReader --- python/pyarrow/io.pyx | 4 ++-- python/pyarrow/tests/test_ipc.py | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 17b43dedb0a5f..72e0e0ff01512 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -933,8 +933,8 @@ cdef class _StreamReader: with nogil: check_status(CStreamReader.Open(in_stream, &self.reader)) - schema = Schema() - schema.init_schema(self.reader.get().schema()) + self.schema = Schema() + self.schema.init_schema(self.reader.get().schema()) def get_next_batch(self): """ diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 665a63b6d5a38..4c9dad1b840a8 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -104,6 +104,8 @@ def test_simple_roundtrip(self): file_contents = self._get_source() reader = pa.StreamReader(file_contents) + assert reader.schema.equals(batches[0].schema) + total = 0 for i, next_batch in enumerate(reader): assert next_batch.equals(batches[i]) From 016a209815465c3161ac357a316efa55061da983 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 24 Mar 2017 11:30:27 -0400 Subject: [PATCH 0414/1644] ARROW-706: [GLib] Add package install document Author: Kouhei Sutou Closes #436 from kou/glib-add-package-install and squashes the following commits: fa8dc04 [Kouhei Sutou] [GLib] Add a note about "unofficial" d23c34d [Kouhei Sutou] [GLib] Add package install document --- c_glib/README.md | 70 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 69 insertions(+), 1 deletion(-) diff --git a/c_glib/README.md b/c_glib/README.md index 84027bf2cb3db..95cc9a65c5bd8 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -42,9 +42,77 @@ gobject-introspection gem based bindings. ## Install +You can use packages or build by yourself to install Arrow GLib. It's +recommended that you use packages. + +Note that the packages are "unofficial". "Official" packages will be +released in the future. + ### Package -TODO +There are supported platforms: + + * Debian GNU/Linux Jessie + * Ubuntu 16.04 LTS + * Ubuntu 16.10 + * CentOS 7 + +You can feedback to https://github.com/kou/arrow-packages about +packages things. + +#### Debian GNU/Linux jessie + +You need to add the following apt-lines to +`/etc/apt/sources.list.d/groonga.list`: + +```text +deb http://packages.groonga.org/debian/ jessie main +deb-src http://packages.groonga.org/debian/ jessie main +``` + +Then you need to run the following command lines: + +```text +% sudo apt update +% sudo apt install -y --allow-unauthenticated groonga-keyring +% sudo apt update +``` + +Now you can install Arrow GLib packages: + +```text +% sudo apt install -y libarrow-glib-dev +``` + +#### Ubuntu 16.04 LTS and Ubuntu 16.10 + +You need to add an APT repository: + +```text +% sudo apt install -y software-properties-common +% sudo add-apt-repository -y ppa:groonga/ppa +% sudo apt update +``` + +Now you can install Arrow GLib packages: + +```text +% sudo apt install -y libarrow-glib-dev +``` + +#### CentOS 7 + +You need to add a Yum repository: + +```text +% sudo yum install -y http://packages.groonga.org/centos/groonga-release-1.2.0-1.noarch.rpm +``` + +Now you can install Arrow GLib packages: + +```text +% sudo yum install -y --enablerepo=epel arrow-glib-devel +``` ### Build From dc3cb30b90ca818cdd053dacdc9badb7cfc9214a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 11:53:42 -0400 Subject: [PATCH 0415/1644] ARROW-550: [Format] Draft experimental Tensor flatbuffer message type Tensor-like data occurs very frequently in scientific computing and machine learning applications that are mostly implemented in C and C++. Arrow's C++ memory management and shared memory utilities can help serve these use cases for zero-copy data transfer to other tensor-like data structures (like NumPy ndarrays, or the tensor objects used in machine learning libraries like TensorFlow or Torch). The Tensor data structure is loosely modeled after NumPy's ndarray object and TensorFlow's tensor protocol buffers type (https://github.com/tensorflow/tensorflow/blob/754048a0453a04a761e112ae5d99c149eb9910dd/tensorflow/core/framework/tensor.proto). cc @pcmoritz @robertnishihara @sylvaincorlay @JohanMabille Author: Wes McKinney Closes #435 from wesm/ARROW-550 and squashes the following commits: afac56e [Wes McKinney] Change TensorOrder enum to byte 249a9d5 [Wes McKinney] Replace strides with TensorOrder enum for row major / column major d7d6407 [Wes McKinney] Draft Tensor flatbuffer type --- cpp/src/arrow/ipc/CMakeLists.txt | 1 + format/Message.fbs | 3 +- format/Tensor.fbs | 60 ++++++++++++++++++++++++++++++++ java/format/pom.xml | 3 +- 4 files changed, 65 insertions(+), 2 deletions(-) create mode 100644 format/Tensor.fbs diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 056e7dba53830..d6ee9309b44d8 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -114,6 +114,7 @@ set(FBS_SRC ${CMAKE_SOURCE_DIR}/../format/Message.fbs ${CMAKE_SOURCE_DIR}/../format/File.fbs ${CMAKE_SOURCE_DIR}/../format/Schema.fbs + ${CMAKE_SOURCE_DIR}/../format/Tensor.fbs ${CMAKE_CURRENT_SOURCE_DIR}/feather.fbs) foreach(FIL ${FBS_SRC}) diff --git a/format/Message.fbs b/format/Message.fbs index 2cb60953c6a79..f4a95713cea93 100644 --- a/format/Message.fbs +++ b/format/Message.fbs @@ -16,6 +16,7 @@ // under the License. include "Schema.fbs"; +include "Tensor.fbs"; namespace org.apache.arrow.flatbuf; @@ -82,7 +83,7 @@ table DictionaryBatch { /// which may include experimental metadata types. For maximum compatibility, /// it is best to send data using RecordBatch union MessageHeader { - Schema, DictionaryBatch, RecordBatch + Schema, DictionaryBatch, RecordBatch, Tensor } table Message { diff --git a/format/Tensor.fbs b/format/Tensor.fbs new file mode 100644 index 0000000000000..bc5b6d1289b2f --- /dev/null +++ b/format/Tensor.fbs @@ -0,0 +1,60 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +/// EXPERIMENTAL: Metadata for n-dimensional arrays, aka "tensors" or +/// "ndarrays". Arrow implementations in general are not required to implement +/// this type + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf; + +/// Shape data for a single axis in a tensor +table TensorDim { + /// Length of dimension + size: long; + + /// Name of the dimension, optional + name: string; +} + +enum TensorOrder : byte { + /// Higher dimensions vary first when traversing data in byte-contiguous + /// order, aka "C order" + ROW_MAJOR, + + /// Lower dimensions vary first when traversing data in byte-contiguous + /// order, aka "Fortran order" + COLUMN_MAJOR +} + +table Tensor { + /// The type of data contained in a value cell. Currently only fixed-width + /// value types are supported, no strings or nested types + type: Type; + + /// The dimensions of the tensor, optionally named + shape: [TensorDim]; + + /// The memory order of the tensor's data + order: TensorOrder; + + /// The location and size of the tensor's data + data: Buffer; +} + +root_type Tensor; diff --git a/java/format/pom.xml b/java/format/pom.xml index e7a58a4172fe2..98a113a30cf78 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -110,8 +110,9 @@ -o ${flatc.generated.files} ../../format/Schema.fbs - ../../format/Message.fbs + ../../format/Tensor.fbs ../../format/File.fbs + ../../format/Message.fbs From 5ad498833fe6cd5519b8d652d4bf620add5a7eed Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 14:10:42 -0400 Subject: [PATCH 0416/1644] ARROW-708: [C++] Simplify metadata APIs to all use the Message class, perf analysis This doesn't produce a meaningful perf improvement, but it does remove a fair amount of code which is nice. Here is an interactive FlameGraph SVG: https://www.dropbox.com/s/kp8i5r3j7i0em02/ipc-perf-20170324.svg?dl=0 ![screenshot from 2017-03-24 12 52 54](https://cloud.githubusercontent.com/assets/329591/24304760/f283960a-1090-11e7-9bc5-4cb26f7ca0ae.png) So it appears that out of the few hundred nanoseconds spent constructing each Array object, the time is mostly spent in object constructors. One thing that shows up is the RecordBatch constructor which is spending a bunch of time copying the `vector>` passed, so I added a move constructor. Author: Wes McKinney Closes #437 from wesm/record-batch-read-perf and squashes the following commits: 95fdbc7 [Wes McKinney] Add RecordBatch constructor with rvalue-reference for the columns 793b3be [Wes McKinney] Inline SliceBuffer 212f17f [Wes McKinney] Benchmark in nanoseconds a295aae [Wes McKinney] Remove record batch / dictionary PIMPL interfaces, handle flatbuffer details internally --- cpp/src/arrow/buffer.cc | 7 - cpp/src/arrow/buffer.h | 6 +- cpp/src/arrow/ipc/ipc-read-write-benchmark.cc | 2 - cpp/src/arrow/ipc/ipc-read-write-test.cc | 9 +- cpp/src/arrow/ipc/metadata.cc | 123 +----------------- cpp/src/arrow/ipc/metadata.h | 41 +----- cpp/src/arrow/ipc/reader.cc | 85 +++++++----- cpp/src/arrow/ipc/reader.h | 16 +-- cpp/src/arrow/table.cc | 4 + cpp/src/arrow/table.h | 3 + 10 files changed, 76 insertions(+), 220 deletions(-) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index a0b78ac0b9f20..28edf5e824c1f 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -68,13 +68,6 @@ bool Buffer::Equals(const Buffer& other) const { static_cast(size_)))); } -std::shared_ptr SliceBuffer( - const std::shared_ptr& buffer, int64_t offset, int64_t length) { - DCHECK_LE(offset, buffer->size()); - DCHECK_LE(length, buffer->size() - offset); - return std::make_shared(buffer, offset, length); -} - std::shared_ptr MutableBuffer::GetImmutableView() { return std::make_shared(this->get_shared_ptr(), 0, size()); } diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 70c16a2dafc86..449bb537d9caa 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -96,8 +96,10 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { /// Construct a view on passed buffer at the indicated offset and length. This /// function cannot fail and does not error checking (except in debug builds) -std::shared_ptr ARROW_EXPORT SliceBuffer( - const std::shared_ptr& buffer, int64_t offset, int64_t length); +static inline std::shared_ptr SliceBuffer( + const std::shared_ptr& buffer, int64_t offset, int64_t length) { + return std::make_shared(buffer, offset, length); +} /// A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { diff --git a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc index e27e5136a0d5a..1aecdbc633190 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc @@ -121,14 +121,12 @@ BENCHMARK(BM_WriteRecordBatch) ->RangeMultiplier(4) ->Range(1, 1 << 13) ->MinTime(1.0) - ->Unit(benchmark::kMicrosecond) ->UseRealTime(); BENCHMARK(BM_ReadRecordBatch) ->RangeMultiplier(4) ->Range(1, 1 << 13) ->MinTime(1.0) - ->Unit(benchmark::kMicrosecond) ->UseRealTime(); } // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 6919aebbe8d6d..086cc68176783 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -140,7 +140,6 @@ class IpcTestFixture : public io::MemoryMapFixture { std::shared_ptr message; RETURN_NOT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); // The buffer offsets start at 0, so we must construct a // RandomAccessFile according to that frame of reference @@ -148,7 +147,7 @@ class IpcTestFixture : public io::MemoryMapFixture { RETURN_NOT_OK(mmap_->ReadAt(metadata_length, body_length, &buffer_payload)); io::BufferReader buffer_reader(buffer_payload); - return ReadRecordBatch(*metadata, batch.schema(), &buffer_reader, batch_result); + return ReadRecordBatch(*message, batch.schema(), &buffer_reader, batch_result); } Status DoLargeRoundTrip( @@ -370,7 +369,6 @@ TEST_F(RecursionLimits, ReadLimit) { std::shared_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); std::shared_ptr payload; ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); @@ -378,7 +376,7 @@ TEST_F(RecursionLimits, ReadLimit) { io::BufferReader reader(payload); std::shared_ptr result; - ASSERT_RAISES(Invalid, ReadRecordBatch(*metadata, schema, &reader, &result)); + ASSERT_RAISES(Invalid, ReadRecordBatch(*message, schema, &reader, &result)); } TEST_F(RecursionLimits, StressLimit) { @@ -392,7 +390,6 @@ TEST_F(RecursionLimits, StressLimit) { std::shared_ptr message; ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); - auto metadata = std::make_shared(message); std::shared_ptr payload; ASSERT_OK(mmap_->ReadAt(metadata_length, body_length, &payload)); @@ -400,7 +397,7 @@ TEST_F(RecursionLimits, StressLimit) { io::BufferReader reader(payload); std::shared_ptr result; - ASSERT_OK(ReadRecordBatch(*metadata, schema, recursion_depth + 1, &reader, &result)); + ASSERT_OK(ReadRecordBatch(*message, schema, recursion_depth + 1, &reader, &result)); *it_works = result->Equals(*batch); }; diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index b10ccec9e7c4e..14cb627982343 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -770,6 +770,10 @@ int64_t Message::body_length() const { return impl_->body_length(); } +const void* Message::header() const { + return impl_->header(); +} + // ---------------------------------------------------------------------- // SchemaMetadata @@ -858,125 +862,6 @@ Status SchemaMetadata::GetSchema( return Status::OK(); } -// ---------------------------------------------------------------------- -// RecordBatchMetadata - -class RecordBatchMetadata::RecordBatchMetadataImpl : public MessageHolder { - public: - explicit RecordBatchMetadataImpl(const void* batch) - : batch_(static_cast(batch)) { - nodes_ = batch_->nodes(); - buffers_ = batch_->buffers(); - } - - const flatbuf::FieldNode* field(int i) const { return nodes_->Get(i); } - - const flatbuf::Buffer* buffer(int i) const { return buffers_->Get(i); } - - int64_t length() const { return batch_->length(); } - - int num_buffers() const { return batch_->buffers()->size(); } - - int num_fields() const { return batch_->nodes()->size(); } - - private: - const flatbuf::RecordBatch* batch_; - const flatbuffers::Vector* nodes_; - const flatbuffers::Vector* buffers_; -}; - -RecordBatchMetadata::RecordBatchMetadata(const std::shared_ptr& message) - : RecordBatchMetadata(message->impl_->header()) { - impl_->set_message(message); -} - -RecordBatchMetadata::RecordBatchMetadata(const void* header) { - impl_.reset(new RecordBatchMetadataImpl(header)); -} - -RecordBatchMetadata::RecordBatchMetadata( - const std::shared_ptr& buffer, int64_t offset) - : RecordBatchMetadata(buffer->data() + offset) { - // Preserve ownership - impl_->set_buffer(buffer); -} - -RecordBatchMetadata::~RecordBatchMetadata() {} - -// TODO(wesm): Copying the flatbuffer data isn't great, but this will do for -// now -FieldMetadata RecordBatchMetadata::field(int i) const { - const flatbuf::FieldNode* node = impl_->field(i); - - FieldMetadata result; - result.length = node->length(); - result.null_count = node->null_count(); - result.offset = 0; - return result; -} - -BufferMetadata RecordBatchMetadata::buffer(int i) const { - const flatbuf::Buffer* buffer = impl_->buffer(i); - - BufferMetadata result; - result.page = buffer->page(); - result.offset = buffer->offset(); - result.length = buffer->length(); - return result; -} - -int64_t RecordBatchMetadata::length() const { - return impl_->length(); -} - -int RecordBatchMetadata::num_buffers() const { - return impl_->num_buffers(); -} - -int RecordBatchMetadata::num_fields() const { - return impl_->num_fields(); -} - -// ---------------------------------------------------------------------- -// DictionaryBatchMetadata - -class DictionaryBatchMetadata::DictionaryBatchMetadataImpl { - public: - explicit DictionaryBatchMetadataImpl(const void* dictionary) - : metadata_(static_cast(dictionary)) { - record_batch_.reset(new RecordBatchMetadata(metadata_->data())); - } - - int64_t id() const { return metadata_->id(); } - const RecordBatchMetadata& record_batch() const { return *record_batch_; } - - void set_message(const std::shared_ptr& message) { message_ = message; } - - private: - const flatbuf::DictionaryBatch* metadata_; - - std::unique_ptr record_batch_; - - // Parent, owns the flatbuffer data - std::shared_ptr message_; -}; - -DictionaryBatchMetadata::DictionaryBatchMetadata( - const std::shared_ptr& message) { - impl_.reset(new DictionaryBatchMetadataImpl(message->impl_->header())); - impl_->set_message(message); -} - -DictionaryBatchMetadata::~DictionaryBatchMetadata() {} - -int64_t DictionaryBatchMetadata::id() const { - return impl_->id(); -} - -const RecordBatchMetadata& DictionaryBatchMetadata::record_batch() const { - return impl_->record_batch(); -} - // ---------------------------------------------------------------------- // Conveniences diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index dc07c7a1bd9b7..6e903c0a18ef6 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -138,44 +138,6 @@ struct ARROW_EXPORT BufferMetadata { int64_t length; }; -// Container for serialized record batch metadata contained in an IPC message -class ARROW_EXPORT RecordBatchMetadata { - public: - explicit RecordBatchMetadata(const void* header); - explicit RecordBatchMetadata(const std::shared_ptr& message); - RecordBatchMetadata(const std::shared_ptr& message, int64_t offset); - - ~RecordBatchMetadata(); - - FieldMetadata field(int i) const; - BufferMetadata buffer(int i) const; - - int64_t length() const; - int num_buffers() const; - int num_fields() const; - - private: - class RecordBatchMetadataImpl; - std::unique_ptr impl_; - - DISALLOW_COPY_AND_ASSIGN(RecordBatchMetadata); -}; - -class ARROW_EXPORT DictionaryBatchMetadata { - public: - explicit DictionaryBatchMetadata(const std::shared_ptr& message); - ~DictionaryBatchMetadata(); - - int64_t id() const; - const RecordBatchMetadata& record_batch() const; - - private: - class DictionaryBatchMetadataImpl; - std::unique_ptr impl_; - - DISALLOW_COPY_AND_ASSIGN(DictionaryBatchMetadata); -}; - class ARROW_EXPORT Message { public: enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; @@ -187,11 +149,12 @@ class ARROW_EXPORT Message { Type type() const; + const void* header() const; + private: Message(const std::shared_ptr& buffer, int64_t offset); friend class DictionaryBatchMetadata; - friend class RecordBatchMetadata; friend class SchemaMetadata; // Hide serialization details from user API diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 71ba951111999..83e03aa0b36b4 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -46,36 +46,41 @@ namespace ipc { class IpcComponentSource : public ArrayComponentSource { public: - IpcComponentSource(const RecordBatchMetadata& metadata, io::RandomAccessFile* file) + IpcComponentSource(const flatbuf::RecordBatch* metadata, io::RandomAccessFile* file) : metadata_(metadata), file_(file) {} Status GetBuffer(int buffer_index, std::shared_ptr* out) override { - BufferMetadata buffer_meta = metadata_.buffer(buffer_index); - if (buffer_meta.length == 0) { + const flatbuf::Buffer* buffer = metadata_->buffers()->Get(buffer_index); + + if (buffer->length() == 0) { *out = nullptr; return Status::OK(); } else { - return file_->ReadAt(buffer_meta.offset, buffer_meta.length, out); + return file_->ReadAt(buffer->offset(), buffer->length(), out); } } - Status GetFieldMetadata(int field_index, FieldMetadata* metadata) override { + Status GetFieldMetadata(int field_index, FieldMetadata* field) override { + auto nodes = metadata_->nodes(); // pop off a field - if (field_index >= metadata_.num_fields()) { + if (field_index >= static_cast(nodes->size())) { return Status::Invalid("Ran out of field metadata, likely malformed"); } - *metadata = metadata_.field(field_index); + const flatbuf::FieldNode* node = nodes->Get(field_index); + + field->length = node->length(); + field->null_count = node->null_count(); + field->offset = 0; return Status::OK(); } private: - const RecordBatchMetadata& metadata_; + const flatbuf::RecordBatch* metadata_; io::RandomAccessFile* file_; }; -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::RandomAccessFile* file, - std::shared_ptr* out) { +Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, + io::RandomAccessFile* file, std::shared_ptr* out) { return ReadRecordBatch(metadata, schema, kMaxNestingDepth, file, out); } @@ -94,22 +99,32 @@ static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, RETURN_NOT_OK(LoadArray(schema->field(i)->type, &context, &arrays[i])); } - *out = std::make_shared(schema, num_rows, arrays); + *out = std::make_shared(schema, num_rows, std::move(arrays)); return Status::OK(); } -Status ReadRecordBatch(const RecordBatchMetadata& metadata, +static inline Status ReadRecordBatch(const flatbuf::RecordBatch* metadata, const std::shared_ptr& schema, int max_recursion_depth, io::RandomAccessFile* file, std::shared_ptr* out) { IpcComponentSource source(metadata, file); return LoadRecordBatchFromSource( - schema, metadata.length(), max_recursion_depth, &source, out); + schema, metadata->length(), max_recursion_depth, &source, out); } -Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, - std::shared_ptr* out) { - int64_t id = metadata.id(); +Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, + int max_recursion_depth, io::RandomAccessFile* file, + std::shared_ptr* out) { + DCHECK_EQ(metadata.type(), Message::RECORD_BATCH); + auto batch = reinterpret_cast(metadata.header()); + return ReadRecordBatch(batch, schema, max_recursion_depth, file, out); +} + +Status ReadDictionary(const Message& metadata, const DictionaryTypeMap& dictionary_types, + io::RandomAccessFile* file, int64_t* dictionary_id, std::shared_ptr* out) { + auto dictionary_batch = + reinterpret_cast(metadata.header()); + + int64_t id = *dictionary_id = dictionary_batch->id(); auto it = dictionary_types.find(id); if (it == dictionary_types.end()) { std::stringstream ss; @@ -124,7 +139,10 @@ Status ReadDictionary(const DictionaryBatchMetadata& metadata, // The dictionary is embedded in a record batch with a single column std::shared_ptr batch; - RETURN_NOT_OK(ReadRecordBatch(metadata.record_batch(), dummy_schema, file, &batch)); + auto batch_meta = + reinterpret_cast(dictionary_batch->data()); + RETURN_NOT_OK( + ReadRecordBatch(batch_meta, dummy_schema, kMaxNestingDepth, file, &batch)); if (batch->num_columns() != 1) { return Status::Invalid("Dictionary record batch must only contain one field"); @@ -211,15 +229,14 @@ class StreamReader::StreamReaderImpl { std::shared_ptr message; RETURN_NOT_OK(ReadNextMessage(Message::DICTIONARY_BATCH, &message)); - DictionaryBatchMetadata metadata(message); - std::shared_ptr batch_body; RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)) io::BufferReader reader(batch_body); std::shared_ptr dictionary; - RETURN_NOT_OK(ReadDictionary(metadata, dictionary_types_, &reader, &dictionary)); - return dictionary_memo_.AddDictionary(metadata.id(), dictionary); + int64_t id; + RETURN_NOT_OK(ReadDictionary(*message, dictionary_types_, &reader, &id, &dictionary)); + return dictionary_memo_.AddDictionary(id, dictionary); } Status ReadSchema() { @@ -249,12 +266,10 @@ class StreamReader::StreamReaderImpl { return Status::OK(); } - RecordBatchMetadata batch_metadata(message); - std::shared_ptr batch_body; RETURN_NOT_OK(ReadExact(message->body_length(), &batch_body)); io::BufferReader reader(batch_body); - return ReadRecordBatch(batch_metadata, schema_, &reader, batch); + return ReadRecordBatch(*message, schema_, &reader, batch); } std::shared_ptr schema() const { return schema_; } @@ -365,7 +380,6 @@ class FileReader::FileReaderImpl { std::shared_ptr message; RETURN_NOT_OK( ReadMessage(block.offset, block.metadata_length, file_.get(), &message)); - auto metadata = std::make_shared(message); // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see // ARROW-384). @@ -373,7 +387,7 @@ class FileReader::FileReaderImpl { RETURN_NOT_OK(file_->Read(block.body_length, &buffer_block)); io::BufferReader reader(buffer_block); - return ReadRecordBatch(*metadata, schema_, &reader, batch); + return ReadRecordBatch(*message, schema_, &reader, batch); } Status ReadSchema() { @@ -386,9 +400,8 @@ class FileReader::FileReaderImpl { RETURN_NOT_OK( ReadMessage(block.offset, block.metadata_length, file_.get(), &message)); - // TODO(wesm): ARROW-577: This code is duplicated, can be fixed with a more - // invasive refactor - DictionaryBatchMetadata metadata(message); + // TODO(wesm): ARROW-577: This code is a bit duplicated, can be fixed + // with a more invasive refactor // TODO(wesm): ARROW-388 -- the buffer frame of reference is 0 (see // ARROW-384). @@ -397,8 +410,10 @@ class FileReader::FileReaderImpl { io::BufferReader reader(buffer_block); std::shared_ptr dictionary; - RETURN_NOT_OK(ReadDictionary(metadata, dictionary_fields_, &reader, &dictionary)); - RETURN_NOT_OK(dictionary_memo_->AddDictionary(metadata.id(), dictionary)); + int64_t dictionary_id; + RETURN_NOT_OK(ReadDictionary( + *message, dictionary_fields_, &reader, &dictionary_id, &dictionary)); + RETURN_NOT_OK(dictionary_memo_->AddDictionary(dictionary_id, dictionary)); } // Get the schema @@ -480,15 +495,13 @@ Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, RETURN_NOT_OK(file->Read(flatbuffer_size, &buffer)); RETURN_NOT_OK(Message::Open(buffer, 0, &message)); - RecordBatchMetadata metadata(message); - // TODO(ARROW-388): The buffer offsets start at 0, so we must construct a // RandomAccessFile according to that frame of reference std::shared_ptr buffer_payload; RETURN_NOT_OK(file->Read(message->body_length(), &buffer_payload)); io::BufferReader buffer_reader(buffer_payload); - return ReadRecordBatch(metadata, schema, kMaxNestingDepth, &buffer_reader, out); + return ReadRecordBatch(*message, schema, kMaxNestingDepth, &buffer_reader, out); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index ffd0a111d604b..6d9e6ca7b0ab7 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -45,17 +45,15 @@ namespace ipc { // Generic read functionsh; does not copy data if the input supports zero copy reads -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, io::RandomAccessFile* file, - std::shared_ptr* out); - -Status ReadRecordBatch(const RecordBatchMetadata& metadata, - const std::shared_ptr& schema, int max_recursion_depth, +Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, io::RandomAccessFile* file, std::shared_ptr* out); -Status ReadDictionary(const DictionaryBatchMetadata& metadata, - const DictionaryTypeMap& dictionary_types, io::RandomAccessFile* file, - std::shared_ptr* out); +Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, + int max_recursion_depth, io::RandomAccessFile* file, + std::shared_ptr* out); + +Status ReadDictionary(const Message& metadata, const DictionaryTypeMap& dictionary_types, + io::RandomAccessFile* file, std::shared_ptr* out); class ARROW_EXPORT StreamReader { public: diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 6b957c081e502..3f254aae6d3fa 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -33,6 +33,10 @@ RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} +RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, + std::vector>&& columns) + : schema_(schema), num_rows_(num_rows), columns_(std::move(columns)) {} + const std::string& RecordBatch::column_name(int i) const { return schema_->field(i)->name; } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 68f664b38a365..bf0d99c4e9d2b 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -43,6 +43,9 @@ class ARROW_EXPORT RecordBatch { RecordBatch(const std::shared_ptr& schema, int64_t num_rows, const std::vector>& columns); + RecordBatch(const std::shared_ptr& schema, int64_t num_rows, + std::vector>&& columns); + bool Equals(const RecordBatch& other) const; bool ApproxEquals(const RecordBatch& other) const; From 60b5832e4c75457e98b5caf7a8622c2201de2cd5 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 18:27:16 -0400 Subject: [PATCH 0417/1644] ARROW-686: [C++] Account for time metadata changes, add Time32 and Time64 types This also has a little visitor refactoring Author: Wes McKinney Closes #432 from wesm/ARROW-686 and squashes the following commits: 300c7f2 [Wes McKinney] Fix glib for time32/64 changes be4c976 [Wes McKinney] Remove JSON time todo 504059a [Wes McKinney] Remove copy ctors to fix MSVC linker error ae574ce [Wes McKinney] Some cleaning cf9783c [Wes McKinney] Add new time types to Python bindings 95f5a05 [Wes McKinney] Implement Time32 and Time64 types, IPC roundtrip --- c_glib/arrow-glib/type.cpp | 6 +- c_glib/arrow-glib/type.h | 6 +- cpp/src/arrow/array-decimal-test.cc | 6 - cpp/src/arrow/array-primitive-test.cc | 3 - cpp/src/arrow/array.cc | 3 +- cpp/src/arrow/array.h | 5 +- cpp/src/arrow/builder.cc | 28 ++--- cpp/src/arrow/builder.h | 4 +- cpp/src/arrow/compare.cc | 60 ++++++---- cpp/src/arrow/ipc/feather-test.cc | 2 +- cpp/src/arrow/ipc/feather.cc | 14 ++- cpp/src/arrow/ipc/ipc-json-test.cc | 8 +- cpp/src/arrow/ipc/json-internal.cc | 47 ++++++-- cpp/src/arrow/ipc/metadata.cc | 20 +++- cpp/src/arrow/ipc/test-common.h | 26 ++-- cpp/src/arrow/ipc/writer.cc | 3 +- cpp/src/arrow/pretty_print.cc | 164 +++++++++++--------------- cpp/src/arrow/type-test.cc | 36 +++--- cpp/src/arrow/type.cc | 41 +++++-- cpp/src/arrow/type.h | 63 +++++++--- cpp/src/arrow/type_fwd.h | 11 +- cpp/src/arrow/type_traits.h | 49 +++++++- cpp/src/arrow/visitor.cc | 6 +- cpp/src/arrow/visitor.h | 6 +- cpp/src/arrow/visitor_inline.h | 41 ++++++- python/pyarrow/array.pyx | 10 +- python/pyarrow/includes/libarrow.pxd | 15 ++- 27 files changed, 446 insertions(+), 237 deletions(-) diff --git a/c_glib/arrow-glib/type.cpp b/c_glib/arrow-glib/type.cpp index 2e59647884551..8adbaa90a58c7 100644 --- a/c_glib/arrow-glib/type.cpp +++ b/c_glib/arrow-glib/type.cpp @@ -72,8 +72,10 @@ garrow_type_from_raw(arrow::Type::type type) return GARROW_TYPE_DATE64; case arrow::Type::type::TIMESTAMP: return GARROW_TYPE_TIMESTAMP; - case arrow::Type::type::TIME: - return GARROW_TYPE_TIME; + case arrow::Type::type::TIME32: + return GARROW_TYPE_TIME32; + case arrow::Type::type::TIME64: + return GARROW_TYPE_TIME64; case arrow::Type::type::INTERVAL: return GARROW_TYPE_INTERVAL; case arrow::Type::type::DECIMAL: diff --git a/c_glib/arrow-glib/type.h b/c_glib/arrow-glib/type.h index cd6137cb5ba5f..e171aa3220f05 100644 --- a/c_glib/arrow-glib/type.h +++ b/c_glib/arrow-glib/type.h @@ -44,7 +44,8 @@ G_BEGIN_DECLS * @GARROW_TYPE_DATE64: int64 milliseconds since the UNIX epoch. * @GARROW_TYPE_TIMESTAMP: Exact timestamp encoded with int64 since UNIX epoch. * Default unit millisecond. - * @GARROW_TYPE_TIME: Exact time encoded with int64, default unit millisecond. + * @GARROW_TYPE_TIME32: Exact time encoded with int32, supporting seconds or milliseconds + * @GARROW_TYPE_TIME64: Exact time encoded with int64, supporting micro- or nanoseconds * @GARROW_TYPE_INTERVAL: YEAR_MONTH or DAY_TIME interval in SQL style. * @GARROW_TYPE_DECIMAL: Precision- and scale-based decimal * type. Storage type depends on the parameters. @@ -74,7 +75,8 @@ typedef enum { GARROW_TYPE_DATE32, GARROW_TYPE_DATE64, GARROW_TYPE_TIMESTAMP, - GARROW_TYPE_TIME, + GARROW_TYPE_TIME32, + GARROW_TYPE_TIME64, GARROW_TYPE_INTERVAL, GARROW_TYPE_DECIMAL, GARROW_TYPE_LIST, diff --git a/cpp/src/arrow/array-decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc index 9e00fd9a7dd49..b64023bbc6a1e 100644 --- a/cpp/src/arrow/array-decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -29,12 +29,6 @@ TEST(TypesTest, TestDecimalType) { ASSERT_EQ(t1.scale, 4); ASSERT_EQ(t1.ToString(), std::string("decimal(8, 4)")); - - // Test copy constructor - DecimalType t2 = t1; - ASSERT_EQ(t2.type, Type::DECIMAL); - ASSERT_EQ(t2.precision, 8); - ASSERT_EQ(t2.scale, 4); } } // namespace arrow diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc index 6863e58df05d2..fe60170cc5cc4 100644 --- a/cpp/src/arrow/array-primitive-test.cc +++ b/cpp/src/arrow/array-primitive-test.cc @@ -47,9 +47,6 @@ class Array; \ ASSERT_EQ(tp.type, Type::ENUM); \ ASSERT_EQ(tp.ToString(), string(NAME)); \ - \ - KLASS tp_copy = tp; \ - ASSERT_EQ(tp_copy.type, Type::ENUM); \ } PRIMITIVE_TEST(Int8Type, INT8, "int8"); diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 20b732ab114da..f1c8bd42c476d 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -483,7 +483,8 @@ template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; -template class NumericArray; +template class NumericArray; +template class NumericArray; template class NumericArray; template class NumericArray; template class NumericArray; diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 2a072dbf25ec0..c73b7a87a4f50 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -527,10 +527,11 @@ extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; +extern template class ARROW_EXPORT NumericArray; #if defined(__GNUC__) && !defined(__clang__) #pragma GCC diagnostic pop diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 483d6f0a425ea..52a785d086117 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -240,8 +240,9 @@ template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; +template class PrimitiveBuilder; +template class PrimitiveBuilder; template class PrimitiveBuilder; -template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; template class PrimitiveBuilder; @@ -511,9 +512,9 @@ std::shared_ptr StructBuilder::field_builder(int pos) const { // ---------------------------------------------------------------------- // Helper functions -#define BUILDER_CASE(ENUM, BuilderType) \ - case Type::ENUM: \ - out->reset(new BuilderType(pool)); \ +#define BUILDER_CASE(ENUM, BuilderType) \ + case Type::ENUM: \ + out->reset(new BuilderType(pool, type)); \ return Status::OK(); // Initially looked at doing this with vtables, but shared pointers makes it @@ -533,17 +534,14 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(INT64, Int64Builder); BUILDER_CASE(DATE32, Date32Builder); BUILDER_CASE(DATE64, Date64Builder); - case Type::TIMESTAMP: - out->reset(new TimestampBuilder(pool, type)); - return Status::OK(); - case Type::TIME: - out->reset(new TimeBuilder(pool, type)); - return Status::OK(); - BUILDER_CASE(BOOL, BooleanBuilder); - BUILDER_CASE(FLOAT, FloatBuilder); - BUILDER_CASE(DOUBLE, DoubleBuilder); - BUILDER_CASE(STRING, StringBuilder); - BUILDER_CASE(BINARY, BinaryBuilder); + BUILDER_CASE(TIME32, Time32Builder); + BUILDER_CASE(TIME64, Time64Builder); + BUILDER_CASE(TIMESTAMP, TimestampBuilder); + BUILDER_CASE(BOOL, BooleanBuilder); + BUILDER_CASE(FLOAT, FloatBuilder); + BUILDER_CASE(DOUBLE, DoubleBuilder); + BUILDER_CASE(STRING, StringBuilder); + BUILDER_CASE(BINARY, BinaryBuilder); case Type::LIST: { std::shared_ptr value_builder; std::shared_ptr value_type = diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 7cefa649cbf71..bd957b38280da 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -231,7 +231,8 @@ using Int16Builder = NumericBuilder; using Int32Builder = NumericBuilder; using Int64Builder = NumericBuilder; using TimestampBuilder = NumericBuilder; -using TimeBuilder = NumericBuilder; +using Time32Builder = NumericBuilder; +using Time64Builder = NumericBuilder; using Date32Builder = NumericBuilder; using Date64Builder = NumericBuilder; @@ -378,6 +379,7 @@ class ARROW_EXPORT BinaryBuilder : public ListBuilder { // String builder class ARROW_EXPORT StringBuilder : public BinaryBuilder { public: + using BinaryBuilder::BinaryBuilder; explicit StringBuilder(MemoryPool* pool); using BinaryBuilder::Append; diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 3e6ecefc5ca5b..13511cf0f11be 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -29,6 +29,7 @@ #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" +#include "arrow/visitor_inline.h" namespace arrow { @@ -177,7 +178,13 @@ class RangeEqualsVisitor : public ArrayVisitor { return CompareValues(left); } - Status Visit(const TimeArray& left) override { return CompareValues(left); } + Status Visit(const Time32Array& left) override { + return CompareValues(left); + } + + Status Visit(const Time64Array& left) override { + return CompareValues(left); + } Status Visit(const TimestampArray& left) override { return CompareValues(left); @@ -415,7 +422,9 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { Status Visit(const Date64Array& left) override { return ComparePrimitive(left); } - Status Visit(const TimeArray& left) override { return ComparePrimitive(left); } + Status Visit(const Time32Array& left) override { return ComparePrimitive(left); } + + Status Visit(const Time64Array& left) override { return ComparePrimitive(left); } Status Visit(const TimestampArray& left) override { return ComparePrimitive(left); } @@ -628,7 +637,7 @@ Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) // ---------------------------------------------------------------------- // Implement TypeEquals -class TypeEqualsVisitor : public TypeVisitor { +class TypeEqualsVisitor { public: explicit TypeEqualsVisitor(const DataType& right) : right_(right), result_(false) {} @@ -648,29 +657,44 @@ class TypeEqualsVisitor : public TypeVisitor { return Status::OK(); } - Status Visit(const TimeType& left) override { - const auto& right = static_cast(right_); + template + typename std::enable_if::value || + std::is_base_of::value, + Status>::type + Visit(const T& type) { + result_ = true; + return Status::OK(); + } + + Status Visit(const Time32Type& left) { + const auto& right = static_cast(right_); result_ = left.unit == right.unit; return Status::OK(); } - Status Visit(const TimestampType& left) override { + Status Visit(const Time64Type& left) { + const auto& right = static_cast(right_); + result_ = left.unit == right.unit; + return Status::OK(); + } + + Status Visit(const TimestampType& left) { const auto& right = static_cast(right_); result_ = left.unit == right.unit && left.timezone == right.timezone; return Status::OK(); } - Status Visit(const FixedWidthBinaryType& left) override { + Status Visit(const FixedWidthBinaryType& left) { const auto& right = static_cast(right_); result_ = left.byte_width() == right.byte_width(); return Status::OK(); } - Status Visit(const ListType& left) override { return VisitChildren(left); } + Status Visit(const ListType& left) { return VisitChildren(left); } - Status Visit(const StructType& left) override { return VisitChildren(left); } + Status Visit(const StructType& left) { return VisitChildren(left); } - Status Visit(const UnionType& left) override { + Status Visit(const UnionType& left) { const auto& right = static_cast(right_); if (left.mode != right.mode || left.type_codes.size() != right.type_codes.size()) { @@ -691,7 +715,7 @@ class TypeEqualsVisitor : public TypeVisitor { return Status::OK(); } - Status Visit(const DictionaryType& left) override { + Status Visit(const DictionaryType& left) { const auto& right = static_cast(right_); result_ = left.index_type()->Equals(right.index_type()) && left.dictionary()->Equals(right.dictionary()); @@ -713,18 +737,8 @@ Status TypeEquals(const DataType& left, const DataType& right, bool* are_equal) *are_equal = false; } else { TypeEqualsVisitor visitor(right); - Status s = left.Accept(&visitor); - - // We do not implement any type visitors where there is no additional - // metadata to compare. - if (s.IsNotImplemented()) { - // Not implemented means there is no additional metadata to compare - *are_equal = true; - } else if (!s.ok()) { - return s; - } else { - *are_equal = visitor.result(); - } + RETURN_NOT_OK(VisitTypeInline(left, &visitor)); + *are_equal = visitor.result(); } return Status::OK(); } diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index 2513887f75903..e181f6933541b 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -353,7 +353,7 @@ TEST_F(TestTableWriter, CategoryRoundtrip) { TEST_F(TestTableWriter, TimeTypes) { std::vector is_valid = {true, true, true, false, true, true, true}; auto f0 = field("f0", date32()); - auto f1 = field("f1", time(TimeUnit::MILLI)); + auto f1 = field("f1", time32(TimeUnit::MILLI)); auto f2 = field("f2", timestamp(TimeUnit::NANO)); auto f3 = field("f3", timestamp(TimeUnit::SECOND, "US/Los_Angeles")); std::shared_ptr schema(new Schema({f0, f1, f2, f3})); diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 0dd9a8183fdc2..000bba9cce03b 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -294,7 +294,7 @@ class TableReader::TableReaderImpl { break; case fbs::TypeMetadata_TimeMetadata: { auto meta = static_cast(metadata); - *out = std::make_shared(FromFlatbufferEnum(meta->unit())); + *out = time32(FromFlatbufferEnum(meta->unit())); } break; default: switch (values->type()) { @@ -476,7 +476,9 @@ fbs::Type ToFlatbufferType(Type::type type) { return fbs::Type_DATE; case Type::TIMESTAMP: return fbs::Type_TIMESTAMP; - case Type::TIME: + case Type::TIME32: + return fbs::Type_TIME; + case Type::TIME64: return fbs::Type_TIME; case Type::DICTIONARY: return fbs::Type_CATEGORY; @@ -646,13 +648,17 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { return Status::OK(); } - Status Visit(const TimeArray& values) override { + Status Visit(const Time32Array& values) override { RETURN_NOT_OK(WritePrimitiveValues(values)); - auto unit = static_cast(*values.type()).unit; + auto unit = static_cast(*values.type()).unit; current_column_->SetTime(unit); return Status::OK(); } + Status Visit(const Time64Array& values) override { + return Status::NotImplemented("time64"); + } + Status Append(const std::string& name, const Array& values) { current_column_ = metadata_.AddColumn(name); RETURN_NOT_OK(values.Accept(this)); diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index fd35182751948..e943ef1558a75 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -52,7 +52,9 @@ void TestSchemaRoundTrip(const Schema& schema) { std::shared_ptr out; ASSERT_OK(ReadJsonSchema(d, &out)); - ASSERT_TRUE(schema.Equals(out)); + if (!schema.Equals(out)) { + FAIL() << "In schema: " << schema.ToString() << "\nOut schema: " << out->ToString(); + } } void TestArrayRoundTrip(const Array& array) { @@ -105,8 +107,8 @@ TEST(TestJsonSchemaWriter, FlatTypes) { field("f10", utf8()), field("f11", binary()), field("f12", list(int32())), field("f13", struct_({field("s1", int32()), field("s2", utf8())})), field("f15", date64()), field("f16", timestamp(TimeUnit::NANO)), - field("f17", time(TimeUnit::MICRO)), - field("f18", union_({field("u1", int8()), field("u2", time(TimeUnit::MILLI))}, + field("f17", time64(TimeUnit::MICRO)), + field("f18", union_({field("u1", int8()), field("u2", time32(TimeUnit::MILLI))}, {0, 1}, UnionMode::DENSE))}; Schema schema(fields); diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 08f0bdc3a023e..348468006d0b5 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -133,7 +133,10 @@ class JsonSchemaWriter : public TypeVisitor { } template - typename std::enable_if::value, void>::type + typename std::enable_if::value || + std::is_base_of::value || + std::is_base_of::value, + void>::type WriteTypeMetadata(const T& type) {} template @@ -167,7 +170,8 @@ class JsonSchemaWriter : public TypeVisitor { } template - typename std::enable_if::value || + typename std::enable_if::value || + std::is_base_of::value || std::is_base_of::value, void>::type WriteTypeMetadata(const T& type) { @@ -305,7 +309,9 @@ class JsonSchemaWriter : public TypeVisitor { Status Visit(const Date64Type& type) override { return WritePrimitive("date", type); } - Status Visit(const TimeType& type) override { return WritePrimitive("time", type); } + Status Visit(const Time32Type& type) override { return WritePrimitive("time", type); } + + Status Visit(const Time64Type& type) override { return WritePrimitive("time", type); } Status Visit(const TimestampType& type) override { return WritePrimitive("timestamp", type); @@ -650,15 +656,35 @@ class JsonSchemaReader { return Status::OK(); } - template - Status GetTimeLike(const RjObject& json_type, std::shared_ptr* type) { + Status GetTime(const RjObject& json_type, std::shared_ptr* type) { const auto& json_unit = json_type.FindMember("unit"); RETURN_NOT_STRING("unit", json_unit, json_type); std::string unit_str = json_unit->value.GetString(); - TimeUnit unit; + if (unit_str == "SECOND") { + *type = time32(TimeUnit::SECOND); + } else if (unit_str == "MILLISECOND") { + *type = time32(TimeUnit::MILLI); + } else if (unit_str == "MICROSECOND") { + *type = time64(TimeUnit::MICRO); + } else if (unit_str == "NANOSECOND") { + *type = time64(TimeUnit::NANO); + } else { + std::stringstream ss; + ss << "Invalid time unit: " << unit_str; + return Status::Invalid(ss.str()); + } + return Status::OK(); + } + + Status GetTimestamp(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_unit = json_type.FindMember("unit"); + RETURN_NOT_STRING("unit", json_unit, json_type); + std::string unit_str = json_unit->value.GetString(); + + TimeUnit unit; if (unit_str == "SECOND") { unit = TimeUnit::SECOND; } else if (unit_str == "MILLISECOND") { @@ -673,7 +699,7 @@ class JsonSchemaReader { return Status::Invalid(ss.str()); } - *type = std::make_shared(unit); + *type = timestamp(unit); return Status::OK(); } @@ -736,9 +762,9 @@ class JsonSchemaReader { // TODO *type = date64(); } else if (type_name == "time") { - return GetTimeLike(json_type, type); + return GetTime(json_type, type); } else if (type_name == "timestamp") { - return GetTimeLike(json_type, type); + return GetTimestamp(json_type, type); } else if (type_name == "list") { *type = list(children[0]); } else if (type_name == "struct") { @@ -1063,7 +1089,8 @@ class JsonArrayReader { NOT_IMPLEMENTED_CASE(DATE32); NOT_IMPLEMENTED_CASE(DATE64); NOT_IMPLEMENTED_CASE(TIMESTAMP); - NOT_IMPLEMENTED_CASE(TIME); + NOT_IMPLEMENTED_CASE(TIME32); + NOT_IMPLEMENTED_CASE(TIME64); NOT_IMPLEMENTED_CASE(INTERVAL); TYPE_CASE(ListType); TYPE_CASE(StructType); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 14cb627982343..17af563805792 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -251,7 +251,16 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } case flatbuf::Type_Time: { auto time_type = static_cast(type_data); - *out = time(FromFlatbufferUnit(time_type->unit())); + TimeUnit unit = FromFlatbufferUnit(time_type->unit()); + switch (unit) { + case TimeUnit::SECOND: + case TimeUnit::MILLI: + *out = time32(unit); + break; + default: + *out = time64(unit); + break; + } return Status::OK(); } case flatbuf::Type_Timestamp: { @@ -371,8 +380,13 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_Date; *offset = flatbuf::CreateDate(fbb, flatbuf::DateUnit_MILLISECOND).Union(); break; - case Type::TIME: { - const auto& time_type = static_cast(*type); + case Type::TIME32: { + const auto& time_type = static_cast(*type); + *out_type = flatbuf::Type_Time; + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); + } break; + case Type::TIME64: { + const auto& time_type = static_cast(*type); *out_type = flatbuf::Type_Time; *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); } break; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 4085ecf9e3da9..7ee57d2152c1b 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -505,20 +505,24 @@ Status MakeTimestamps(std::shared_ptr* out) { Status MakeTimes(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false, true, true, true}; - auto f0 = field("f0", time(TimeUnit::MILLI)); - auto f1 = field("f1", time(TimeUnit::NANO)); - auto f2 = field("f2", time(TimeUnit::SECOND)); - std::shared_ptr schema(new Schema({f0, f1, f2})); - - std::vector ts_values = {1489269000000, 1489270000000, 1489271000000, + auto f0 = field("f0", time32(TimeUnit::MILLI)); + auto f1 = field("f1", time64(TimeUnit::NANO)); + auto f2 = field("f2", time32(TimeUnit::SECOND)); + auto f3 = field("f3", time64(TimeUnit::NANO)); + std::shared_ptr schema(new Schema({f0, f1, f2, f3})); + + std::vector t32_values = { + 1489269000, 1489270000, 1489271000, 1489272000, 1489272000, 1489273000}; + std::vector t64_values = {1489269000000, 1489270000000, 1489271000000, 1489272000000, 1489272000000, 1489273000000}; - std::shared_ptr a0, a1, a2; - ArrayFromVector(f0->type, is_valid, ts_values, &a0); - ArrayFromVector(f1->type, is_valid, ts_values, &a1); - ArrayFromVector(f2->type, is_valid, ts_values, &a2); + std::shared_ptr a0, a1, a2, a3; + ArrayFromVector(f0->type, is_valid, t32_values, &a0); + ArrayFromVector(f1->type, is_valid, t64_values, &a1); + ArrayFromVector(f2->type, is_valid, t32_values, &a2); + ArrayFromVector(f3->type, is_valid, t64_values, &a3); - ArrayVector arrays = {a0, a1, a2}; + ArrayVector arrays = {a0, a1, a2, a3}; *out = std::make_shared(schema, a0->length(), arrays); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index dc991aba79795..e795ef961cb64 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -334,8 +334,9 @@ class RecordBatchWriter : public ArrayVisitor { VISIT_FIXED_WIDTH(DoubleArray); VISIT_FIXED_WIDTH(Date32Array); VISIT_FIXED_WIDTH(Date64Array); - VISIT_FIXED_WIDTH(TimeArray); VISIT_FIXED_WIDTH(TimestampArray); + VISIT_FIXED_WIDTH(Time32Array); + VISIT_FIXED_WIDTH(Time64Array); #undef VISIT_FIXED_WIDTH diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index fc5eed18d8776..0f67fe5bc52a7 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -27,20 +27,17 @@ #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/string.h" +#include "arrow/visitor_inline.h" namespace arrow { -class ArrayPrinter : public ArrayVisitor { +class ArrayPrinter { public: ArrayPrinter(const Array& array, int indent, std::ostream* sink) : array_(array), indent_(indent), sink_(sink) {} - Status Print() { return VisitArray(array_); } - - Status VisitArray(const Array& array) { return array.Accept(this); } - template - typename std::enable_if::value, void>::type WriteDataValues( + inline typename std::enable_if::value, void>::type WriteDataValues( const T& array) { const auto data = array.raw_data(); for (int i = 0; i < array.length(); ++i) { @@ -54,7 +51,7 @@ class ArrayPrinter : public ArrayVisitor { } template - typename std::enable_if::value, void>::type WriteDataValues( + inline typename std::enable_if::value, void>::type WriteDataValues( const T& array) { const auto data = array.raw_data(); for (int i = 0; i < array.length(); ++i) { @@ -69,7 +66,7 @@ class ArrayPrinter : public ArrayVisitor { // String (Utf8) template - typename std::enable_if::value, void>::type + inline typename std::enable_if::value, void>::type WriteDataValues(const T& array) { int32_t length; for (int i = 0; i < array.length(); ++i) { @@ -85,7 +82,7 @@ class ArrayPrinter : public ArrayVisitor { // Binary template - typename std::enable_if::value, void>::type + inline typename std::enable_if::value, void>::type WriteDataValues(const T& array) { int32_t length; for (int i = 0; i < array.length(); ++i) { @@ -100,8 +97,9 @@ class ArrayPrinter : public ArrayVisitor { } template - typename std::enable_if::value, void>::type - WriteDataValues(const T& array) { + inline + typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { int32_t width = array.byte_width(); for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } @@ -115,7 +113,7 @@ class ArrayPrinter : public ArrayVisitor { } template - typename std::enable_if::value, void>::type + inline typename std::enable_if::value, void>::type WriteDataValues(const T& array) { for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } @@ -127,83 +125,34 @@ class ArrayPrinter : public ArrayVisitor { } } - void OpenArray() { (*sink_) << "["; } + void Write(const char* data); + void Write(const std::string& data); + void Newline(); + void Indent(); + void OpenArray(); + void CloseArray(); - void CloseArray() { (*sink_) << "]"; } + Status Visit(const NullArray& array) { return Status::OK(); } template - Status WriteArray(const T& array) { + typename std::enable_if::value || + std::is_base_of::value || + std::is_base_of::value, + Status>::type + Visit(const T& array) { OpenArray(); WriteDataValues(array); CloseArray(); return Status::OK(); } - Status Visit(const NullArray& array) override { return Status::OK(); } - - Status Visit(const BooleanArray& array) override { return WriteArray(array); } - - Status Visit(const Int8Array& array) override { return WriteArray(array); } - - Status Visit(const Int16Array& array) override { return WriteArray(array); } - - Status Visit(const Int32Array& array) override { return WriteArray(array); } - - Status Visit(const Int64Array& array) override { return WriteArray(array); } - - Status Visit(const UInt8Array& array) override { return WriteArray(array); } - - Status Visit(const UInt16Array& array) override { return WriteArray(array); } - - Status Visit(const UInt32Array& array) override { return WriteArray(array); } - - Status Visit(const UInt64Array& array) override { return WriteArray(array); } - - Status Visit(const HalfFloatArray& array) override { return WriteArray(array); } - - Status Visit(const FloatArray& array) override { return WriteArray(array); } - - Status Visit(const DoubleArray& array) override { return WriteArray(array); } - - Status Visit(const StringArray& array) override { return WriteArray(array); } - - Status Visit(const BinaryArray& array) override { return WriteArray(array); } - - Status Visit(const FixedWidthBinaryArray& array) override { return WriteArray(array); } - - Status Visit(const Date32Array& array) override { return WriteArray(array); } - - Status Visit(const Date64Array& array) override { return WriteArray(array); } - - Status Visit(const TimeArray& array) override { return WriteArray(array); } - - Status Visit(const TimestampArray& array) override { - return Status::NotImplemented("timestamp"); - } - - Status Visit(const IntervalArray& array) override { - return Status::NotImplemented("interval"); - } + Status Visit(const IntervalArray& array) { return Status::NotImplemented("interval"); } - Status Visit(const DecimalArray& array) override { - return Status::NotImplemented("decimal"); - } + Status Visit(const DecimalArray& array) { return Status::NotImplemented("decimal"); } - Status WriteValidityBitmap(const Array& array) { - Newline(); - Write("-- is_valid: "); - - if (array.null_count() > 0) { - BooleanArray is_valid( - array.length(), array.null_bitmap(), nullptr, 0, array.offset()); - return PrettyPrint(is_valid, indent_ + 2, sink_); - } else { - Write("all not null"); - return Status::OK(); - } - } + Status WriteValidityBitmap(const Array& array); - Status Visit(const ListArray& array) override { + Status Visit(const ListArray& array) { RETURN_NOT_OK(WriteValidityBitmap(array)); Newline(); @@ -239,12 +188,12 @@ class ArrayPrinter : public ArrayVisitor { return Status::OK(); } - Status Visit(const StructArray& array) override { + Status Visit(const StructArray& array) { RETURN_NOT_OK(WriteValidityBitmap(array)); return PrintChildren(array.fields(), array.offset(), array.length()); } - Status Visit(const UnionArray& array) override { + Status Visit(const UnionArray& array) { RETURN_NOT_OK(WriteValidityBitmap(array)); Newline(); @@ -264,7 +213,7 @@ class ArrayPrinter : public ArrayVisitor { return PrintChildren(array.children(), 0, array.length() + array.offset()); } - Status Visit(const DictionaryArray& array) override { + Status Visit(const DictionaryArray& array) { RETURN_NOT_OK(WriteValidityBitmap(array)); Newline(); @@ -276,20 +225,7 @@ class ArrayPrinter : public ArrayVisitor { return PrettyPrint(*array.indices(), indent_ + 2, sink_); } - void Write(const char* data) { (*sink_) << data; } - - void Write(const std::string& data) { (*sink_) << data; } - - void Newline() { - (*sink_) << "\n"; - Indent(); - } - - void Indent() { - for (int i = 0; i < indent_; ++i) { - (*sink_) << " "; - } - } + Status Print() { return VisitArrayInline(array_, this); } private: const Array& array_; @@ -298,6 +234,46 @@ class ArrayPrinter : public ArrayVisitor { std::ostream* sink_; }; +Status ArrayPrinter::WriteValidityBitmap(const Array& array) { + Newline(); + Write("-- is_valid: "); + + if (array.null_count() > 0) { + BooleanArray is_valid( + array.length(), array.null_bitmap(), nullptr, 0, array.offset()); + return PrettyPrint(is_valid, indent_ + 2, sink_); + } else { + Write("all not null"); + return Status::OK(); + } +} + +void ArrayPrinter::OpenArray() { + (*sink_) << "["; +} +void ArrayPrinter::CloseArray() { + (*sink_) << "]"; +} + +void ArrayPrinter::Write(const char* data) { + (*sink_) << data; +} + +void ArrayPrinter::Write(const std::string& data) { + (*sink_) << data; +} + +void ArrayPrinter::Newline() { + (*sink_) << "\n"; + Indent(); +} + +void ArrayPrinter::Indent() { + for (int i = 0; i < indent_; ++i) { + (*sink_) << " "; + } +} + Status PrettyPrint(const Array& arr, int indent, std::ostream* sink) { ArrayPrinter printer(arr, indent, sink); return printer.Print(); diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index c2d115ccbfe6f..b6a84df339e6e 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -182,26 +182,30 @@ TEST(TestDateTypes, ToString) { } TEST(TestTimeType, Equals) { - TimeType t1; - TimeType t2; - TimeType t3(TimeUnit::NANO); - TimeType t4(TimeUnit::NANO); - - ASSERT_TRUE(t1.Equals(t2)); + Time32Type t0; + Time32Type t1(TimeUnit::SECOND); + Time32Type t2(TimeUnit::MILLI); + Time64Type t3(TimeUnit::MICRO); + Time64Type t4(TimeUnit::NANO); + Time64Type t5(TimeUnit::MICRO); + + ASSERT_TRUE(t0.Equals(t2)); + ASSERT_TRUE(t1.Equals(t1)); ASSERT_FALSE(t1.Equals(t3)); - ASSERT_TRUE(t3.Equals(t4)); + ASSERT_FALSE(t3.Equals(t4)); + ASSERT_TRUE(t3.Equals(t5)); } TEST(TestTimeType, ToString) { - auto t1 = time(TimeUnit::MILLI); - auto t2 = time(TimeUnit::NANO); - auto t3 = time(TimeUnit::SECOND); - auto t4 = time(TimeUnit::MICRO); - - ASSERT_EQ("time[ms]", t1->ToString()); - ASSERT_EQ("time[ns]", t2->ToString()); - ASSERT_EQ("time[s]", t3->ToString()); - ASSERT_EQ("time[us]", t4->ToString()); + auto t1 = time32(TimeUnit::MILLI); + auto t2 = time64(TimeUnit::NANO); + auto t3 = time32(TimeUnit::SECOND); + auto t4 = time64(TimeUnit::MICRO); + + ASSERT_EQ("time32[ms]", t1->ToString()); + ASSERT_EQ("time64[ns]", t2->ToString()); + ASSERT_EQ("time32[s]", t3->ToString()); + ASSERT_EQ("time64[us]", t4->ToString()); } TEST(TestTimestampType, Equals) { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 1c61eb61abea0..388502214e733 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -119,12 +119,34 @@ std::string Date32Type::ToString() const { return std::string("date32[day]"); } -std::string TimeType::ToString() const { +// ---------------------------------------------------------------------- +// Time types + +Time32Type::Time32Type(TimeUnit unit) : FixedWidthType(Type::TIME32), unit(unit) { + DCHECK(unit == TimeUnit::SECOND || unit == TimeUnit::MILLI) + << "Must be seconds or milliseconds"; +} + +std::string Time32Type::ToString() const { + std::stringstream ss; + ss << "time32[" << this->unit << "]"; + return ss.str(); +} + +Time64Type::Time64Type(TimeUnit unit) : FixedWidthType(Type::TIME64), unit(unit) { + DCHECK(unit == TimeUnit::MICRO || unit == TimeUnit::NANO) + << "Must be microseconds or nanoseconds"; +} + +std::string Time64Type::ToString() const { std::stringstream ss; - ss << "time[" << this->unit << "]"; + ss << "time64[" << this->unit << "]"; return ss.str(); } +// ---------------------------------------------------------------------- +// Timestamp types + std::string TimestampType::ToString() const { std::stringstream ss; ss << "timestamp[" << this->unit; @@ -138,7 +160,7 @@ std::string TimestampType::ToString() const { UnionType::UnionType(const std::vector>& fields, const std::vector& type_codes, UnionMode mode) - : DataType(Type::UNION), mode(mode), type_codes(type_codes) { + : NestedType(Type::UNION), mode(mode), type_codes(type_codes) { children_ = fields; } @@ -206,9 +228,10 @@ ACCEPT_VISITOR(ListType); ACCEPT_VISITOR(StructType); ACCEPT_VISITOR(DecimalType); ACCEPT_VISITOR(UnionType); -ACCEPT_VISITOR(Date64Type); ACCEPT_VISITOR(Date32Type); -ACCEPT_VISITOR(TimeType); +ACCEPT_VISITOR(Date64Type); +ACCEPT_VISITOR(Time32Type); +ACCEPT_VISITOR(Time64Type); ACCEPT_VISITOR(TimestampType); ACCEPT_VISITOR(IntervalType); ACCEPT_VISITOR(DictionaryType); @@ -249,8 +272,12 @@ std::shared_ptr timestamp(TimeUnit unit, const std::string& timezone) return std::make_shared(unit, timezone); } -std::shared_ptr time(TimeUnit unit) { - return std::make_shared(unit); +std::shared_ptr time32(TimeUnit unit) { + return std::make_shared(unit); +} + +std::shared_ptr time64(TimeUnit unit) { + return std::make_shared(unit); } std::shared_ptr list(const std::shared_ptr& value_type) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 40c00a4bac1b1..7ae5ae3c4b72e 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -82,8 +82,13 @@ struct Type { // Default unit millisecond TIMESTAMP, - // Exact time encoded with int64, default unit millisecond - TIME, + // Time as signed 32-bit integer, representing either seconds or + // milliseconds since midnight + TIME32, + + // Time as signed 64-bit integer, representing either microseconds or + // nanoseconds since midnight + TIME64, // YEAR_MONTH or DAY_TIME interval in SQL style INTERVAL, @@ -147,6 +152,9 @@ struct ARROW_EXPORT DataType { virtual std::string ToString() const = 0; virtual std::vector GetBufferLayout() const = 0; + + private: + DISALLOW_COPY_AND_ASSIGN(DataType); }; typedef std::shared_ptr TypePtr; @@ -168,6 +176,10 @@ struct ARROW_EXPORT FloatingPointMeta { virtual Precision precision() const = 0; }; +struct ARROW_EXPORT NestedType : public DataType { + using DataType::DataType; +}; + struct NoExtraMeta {}; // A field is a piece of metadata that includes (for now) a name and a data @@ -298,14 +310,14 @@ struct ARROW_EXPORT DoubleType : public CTypeImpl& value_type) : ListType(std::make_shared("item", value_type)) {} - explicit ListType(const std::shared_ptr& value_field) : DataType(Type::LIST) { + explicit ListType(const std::shared_ptr& value_field) : NestedType(Type::LIST) { children_ = {value_field}; } @@ -369,11 +381,11 @@ struct ARROW_EXPORT StringType : public BinaryType { static std::string name() { return "utf8"; } }; -struct ARROW_EXPORT StructType : public DataType, public NoExtraMeta { +struct ARROW_EXPORT StructType : public NestedType { static constexpr Type::type type_id = Type::STRUCT; explicit StructType(const std::vector>& fields) - : DataType(Type::STRUCT) { + : NestedType(Type::STRUCT) { children_ = fields; } @@ -401,7 +413,7 @@ struct ARROW_EXPORT DecimalType : public DataType { enum class UnionMode : char { SPARSE, DENSE }; -struct ARROW_EXPORT UnionType : public DataType { +struct ARROW_EXPORT UnionType : public NestedType { static constexpr Type::type type_id = Type::UNION; UnionType(const std::vector>& fields, @@ -473,8 +485,23 @@ static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { return os; } -struct ARROW_EXPORT TimeType : public FixedWidthType { - static constexpr Type::type type_id = Type::TIME; +struct ARROW_EXPORT Time32Type : public FixedWidthType { + static constexpr Type::type type_id = Type::TIME32; + using Unit = TimeUnit; + using c_type = int32_t; + + TimeUnit unit; + + int bit_width() const override { return static_cast(sizeof(c_type) * 4); } + + explicit Time32Type(TimeUnit unit = TimeUnit::MILLI); + + Status Accept(TypeVisitor* visitor) const override; + std::string ToString() const override; +}; + +struct ARROW_EXPORT Time64Type : public FixedWidthType { + static constexpr Type::type type_id = Type::TIME64; using Unit = TimeUnit; using c_type = int64_t; @@ -482,9 +509,7 @@ struct ARROW_EXPORT TimeType : public FixedWidthType { int bit_width() const override { return static_cast(sizeof(c_type) * 8); } - explicit TimeType(TimeUnit unit = TimeUnit::MILLI) - : FixedWidthType(Type::TIME), unit(unit) {} - TimeType(const TimeType& other) : TimeType(other.unit) {} + explicit Time64Type(TimeUnit unit = TimeUnit::MILLI); Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -504,8 +529,6 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { explicit TimestampType(TimeUnit unit, const std::string& timezone) : FixedWidthType(Type::TIMESTAMP), unit(unit), timezone(timezone) {} - TimestampType(const TimestampType& other) : TimestampType(other.unit) {} - Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "timestamp"; } @@ -527,8 +550,6 @@ struct ARROW_EXPORT IntervalType : public FixedWidthType { explicit IntervalType(Unit unit = Unit::YEAR_MONTH) : FixedWidthType(Type::INTERVAL), unit(unit) {} - IntervalType(const IntervalType& other) : IntervalType(other.unit) {} - Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override { return name(); } static std::string name() { return "date"; } @@ -573,7 +594,12 @@ std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& val std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); std::shared_ptr ARROW_EXPORT timestamp( TimeUnit unit, const std::string& timezone); -std::shared_ptr ARROW_EXPORT time(TimeUnit unit); + +/// Unit can be either SECOND or MILLI +std::shared_ptr ARROW_EXPORT time32(TimeUnit unit); + +/// Unit can be either MICRO or NANO +std::shared_ptr ARROW_EXPORT time64(TimeUnit unit); std::shared_ptr ARROW_EXPORT struct_( const std::vector>& fields); @@ -637,8 +663,9 @@ static inline bool is_primitive(Type::type type_id) { case Type::DOUBLE: case Type::DATE32: case Type::DATE64: + case Type::TIME32: + case Type::TIME64: case Type::TIMESTAMP: - case Type::TIME: case Type::INTERVAL: return true; default: diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index f62c0314a4620..201f4e92bb00d 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -105,9 +105,13 @@ struct Date32Type; using Date32Array = NumericArray; using Date32Builder = NumericBuilder; -struct TimeType; -using TimeArray = NumericArray; -using TimeBuilder = NumericBuilder; +struct Time32Type; +using Time32Array = NumericArray; +using Time32Builder = NumericBuilder; + +struct Time64Type; +using Time64Array = NumericArray; +using Time64Builder = NumericBuilder; struct TimestampType; using TimestampArray = NumericArray; @@ -134,6 +138,7 @@ std::shared_ptr ARROW_EXPORT float32(); std::shared_ptr ARROW_EXPORT float64(); std::shared_ptr ARROW_EXPORT utf8(); std::shared_ptr ARROW_EXPORT binary(); + std::shared_ptr ARROW_EXPORT date32(); std::shared_ptr ARROW_EXPORT date64(); diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index e731913bbd226..f735d2706e5a9 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -28,6 +28,12 @@ namespace arrow { template struct TypeTraits {}; +template <> +struct TypeTraits { + using ArrayType = NullArray; + constexpr static bool is_parameter_free = false; +}; + template <> struct TypeTraits { using ArrayType = UInt8Array; @@ -154,9 +160,20 @@ struct TypeTraits { }; template <> -struct TypeTraits { - using ArrayType = TimeArray; - using BuilderType = TimeBuilder; +struct TypeTraits { + using ArrayType = Time32Array; + using BuilderType = Time32Builder; + + static inline int64_t bytes_required(int64_t elements) { + return elements * sizeof(int32_t); + } + constexpr static bool is_parameter_free = false; +}; + +template <> +struct TypeTraits { + using ArrayType = Time64Array; + using BuilderType = Time64Builder; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); @@ -235,6 +252,32 @@ struct TypeTraits { constexpr static bool is_parameter_free = false; }; +template <> +struct TypeTraits { + using ArrayType = ListArray; + using BuilderType = ListBuilder; + constexpr static bool is_parameter_free = false; +}; + +template <> +struct TypeTraits { + using ArrayType = StructArray; + using BuilderType = StructBuilder; + constexpr static bool is_parameter_free = false; +}; + +template <> +struct TypeTraits { + using ArrayType = UnionArray; + constexpr static bool is_parameter_free = false; +}; + +template <> +struct TypeTraits { + using ArrayType = DictionaryArray; + constexpr static bool is_parameter_free = false; +}; + // Not all type classes have a c_type template struct as_void { diff --git a/cpp/src/arrow/visitor.cc b/cpp/src/arrow/visitor.cc index 181e932eeebf6..9200e0ff228a3 100644 --- a/cpp/src/arrow/visitor.cc +++ b/cpp/src/arrow/visitor.cc @@ -46,7 +46,8 @@ ARRAY_VISITOR_DEFAULT(StringArray); ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); ARRAY_VISITOR_DEFAULT(Date32Array); ARRAY_VISITOR_DEFAULT(Date64Array); -ARRAY_VISITOR_DEFAULT(TimeArray); +ARRAY_VISITOR_DEFAULT(Time32Array); +ARRAY_VISITOR_DEFAULT(Time64Array); ARRAY_VISITOR_DEFAULT(TimestampArray); ARRAY_VISITOR_DEFAULT(IntervalArray); ARRAY_VISITOR_DEFAULT(ListArray); @@ -84,7 +85,8 @@ TYPE_VISITOR_DEFAULT(BinaryType); TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); TYPE_VISITOR_DEFAULT(Date64Type); TYPE_VISITOR_DEFAULT(Date32Type); -TYPE_VISITOR_DEFAULT(TimeType); +TYPE_VISITOR_DEFAULT(Time32Type); +TYPE_VISITOR_DEFAULT(Time64Type); TYPE_VISITOR_DEFAULT(TimestampType); TYPE_VISITOR_DEFAULT(IntervalType); TYPE_VISITOR_DEFAULT(DecimalType); diff --git a/cpp/src/arrow/visitor.h b/cpp/src/arrow/visitor.h index a9c59c8f762fe..d44dcf6b97676 100644 --- a/cpp/src/arrow/visitor.h +++ b/cpp/src/arrow/visitor.h @@ -46,7 +46,8 @@ class ARROW_EXPORT ArrayVisitor { virtual Status Visit(const FixedWidthBinaryArray& array); virtual Status Visit(const Date32Array& array); virtual Status Visit(const Date64Array& array); - virtual Status Visit(const TimeArray& array); + virtual Status Visit(const Time32Array& array); + virtual Status Visit(const Time64Array& array); virtual Status Visit(const TimestampArray& array); virtual Status Visit(const IntervalArray& array); virtual Status Visit(const DecimalArray& array); @@ -78,7 +79,8 @@ class ARROW_EXPORT TypeVisitor { virtual Status Visit(const FixedWidthBinaryType& type); virtual Status Visit(const Date64Type& type); virtual Status Visit(const Date32Type& type); - virtual Status Visit(const TimeType& type); + virtual Status Visit(const Time32Type& type); + virtual Status Visit(const Time64Type& type); virtual Status Visit(const TimestampType& type); virtual Status Visit(const IntervalType& type); virtual Status Visit(const DecimalType& type); diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index b69468d17eebe..0ea16bcc73366 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -51,7 +51,8 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { TYPE_VISIT_INLINE(Date32Type); TYPE_VISIT_INLINE(Date64Type); TYPE_VISIT_INLINE(TimestampType); - TYPE_VISIT_INLINE(TimeType); + TYPE_VISIT_INLINE(Time32Type); + TYPE_VISIT_INLINE(Time64Type); TYPE_VISIT_INLINE(ListType); TYPE_VISIT_INLINE(StructType); TYPE_VISIT_INLINE(UnionType); @@ -62,6 +63,44 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { return Status::NotImplemented("Type not implemented"); } +#define ARRAY_VISIT_INLINE(TYPE_CLASS) \ + case TYPE_CLASS::type_id: \ + return visitor->Visit( \ + static_cast::ArrayType&>(array)); + +template +inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { + switch (array.type_enum()) { + ARRAY_VISIT_INLINE(NullType); + ARRAY_VISIT_INLINE(BooleanType); + ARRAY_VISIT_INLINE(Int8Type); + ARRAY_VISIT_INLINE(UInt8Type); + ARRAY_VISIT_INLINE(Int16Type); + ARRAY_VISIT_INLINE(UInt16Type); + ARRAY_VISIT_INLINE(Int32Type); + ARRAY_VISIT_INLINE(UInt32Type); + ARRAY_VISIT_INLINE(Int64Type); + ARRAY_VISIT_INLINE(UInt64Type); + ARRAY_VISIT_INLINE(FloatType); + ARRAY_VISIT_INLINE(DoubleType); + ARRAY_VISIT_INLINE(StringType); + ARRAY_VISIT_INLINE(BinaryType); + ARRAY_VISIT_INLINE(FixedWidthBinaryType); + ARRAY_VISIT_INLINE(Date32Type); + ARRAY_VISIT_INLINE(Date64Type); + ARRAY_VISIT_INLINE(TimestampType); + ARRAY_VISIT_INLINE(Time32Type); + ARRAY_VISIT_INLINE(Time64Type); + ARRAY_VISIT_INLINE(ListType); + ARRAY_VISIT_INLINE(StructType); + ARRAY_VISIT_INLINE(UnionType); + ARRAY_VISIT_INLINE(DictionaryType); + default: + break; + } + return Status::NotImplemented("Type not implemented"); +} + } // namespace arrow #endif // ARROW_VISITOR_INLINE_H diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 795076cfccb7e..654f5ab527043 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -383,6 +383,12 @@ cdef class Date64Array(NumericArray): cdef class TimestampArray(NumericArray): pass +cdef class Time32Array(NumericArray): + pass + + +cdef class Time64Array(NumericArray): + pass cdef class FloatArray(FloatingPointArray): pass @@ -490,12 +496,14 @@ cdef dict _array_classes = { Type_INT64: Int64Array, Type_DATE32: Date32Array, Type_DATE64: Date64Array, + Type_TIMESTAMP: TimestampArray, + Type_TIME32: Time32Array, + Type_TIME64: Time64Array, Type_FLOAT: FloatArray, Type_DOUBLE: DoubleArray, Type_LIST: ListArray, Type_BINARY: BinaryArray, Type_STRING: StringArray, - Type_TIMESTAMP: TimestampArray, Type_DICTIONARY: DictionaryArray } diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 1d9c38e48cfe9..bdbd18bce01df 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -38,9 +38,11 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_FLOAT" arrow::Type::FLOAT" Type_DOUBLE" arrow::Type::DOUBLE" - Type_TIMESTAMP" arrow::Type::TIMESTAMP" Type_DATE32" arrow::Type::DATE32" Type_DATE64" arrow::Type::DATE64" + Type_TIMESTAMP" arrow::Type::TIMESTAMP" + Type_TIME32" arrow::Type::TIME32" + Type_TIME64" arrow::Type::TIME64" Type_BINARY" arrow::Type::BINARY" Type_STRING" arrow::Type::STRING" @@ -85,11 +87,20 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CArray] indices() shared_ptr[CArray] dictionary() + cdef cppclass CDate32Type" arrow::Date32Type"(CFixedWidthType): + pass + + cdef cppclass CDate64Type" arrow::Date64Type"(CFixedWidthType): + pass + cdef cppclass CTimestampType" arrow::TimestampType"(CFixedWidthType): TimeUnit unit c_string timezone - cdef cppclass CTimeType" arrow::TimeType"(CFixedWidthType): + cdef cppclass CTime32Type" arrow::Time32Type"(CFixedWidthType): + TimeUnit unit + + cdef cppclass CTime64Type" arrow::Time64Type"(CFixedWidthType): TimeUnit unit cdef cppclass CDictionaryType" arrow::DictionaryType"(CFixedWidthType): From c7947dc2d08a0a2295016d34db201cc38a38360c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 24 Mar 2017 22:10:36 -0400 Subject: [PATCH 0418/1644] ARROW-709: [C++] Restore type comparator for DecimalType Needed for PARQUET-923. Author: Wes McKinney Closes #439 from wesm/ARROW-709 and squashes the following commits: 55d23a3 [Wes McKinney] Restore type comparator for DecimalType --- cpp/src/arrow/compare.cc | 6 ++++++ cpp/src/arrow/loader.cc | 2 ++ cpp/src/arrow/type_traits.h | 6 ++++++ cpp/src/arrow/visitor_inline.h | 2 ++ 4 files changed, 16 insertions(+) diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 13511cf0f11be..8274e0f80dc50 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -690,6 +690,12 @@ class TypeEqualsVisitor { return Status::OK(); } + Status Visit(const DecimalType& left) { + const auto& right = static_cast(right_); + result_ = left.precision == right.precision && left.scale == right.scale; + return Status::OK(); + } + Status Visit(const ListType& left) { return VisitChildren(left); } Status Visit(const StructType& left) { return VisitChildren(left); } diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index a67a3e94bd5f2..cc64c4d8264f7 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -135,6 +135,8 @@ class ArrayLoader { Status Visit(const NullType& type) { return Status::NotImplemented("null"); } + Status Visit(const DecimalType& type) { return Status::NotImplemented("decimal"); } + template typename std::enable_if::value && !std::is_base_of::value && diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index f735d2706e5a9..1270aee1622ea 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -278,6 +278,12 @@ struct TypeTraits { constexpr static bool is_parameter_free = false; }; +template <> +struct TypeTraits { + // using ArrayType = DecimalArray; + constexpr static bool is_parameter_free = false; +}; + // Not all type classes have a c_type template struct as_void { diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index 0ea16bcc73366..586b123e67cfb 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -53,6 +53,7 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { TYPE_VISIT_INLINE(TimestampType); TYPE_VISIT_INLINE(Time32Type); TYPE_VISIT_INLINE(Time64Type); + TYPE_VISIT_INLINE(DecimalType); TYPE_VISIT_INLINE(ListType); TYPE_VISIT_INLINE(StructType); TYPE_VISIT_INLINE(UnionType); @@ -91,6 +92,7 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { ARRAY_VISIT_INLINE(TimestampType); ARRAY_VISIT_INLINE(Time32Type); ARRAY_VISIT_INLINE(Time64Type); + // ARRAY_VISIT_INLINE(DecimalType); ARRAY_VISIT_INLINE(ListType); ARRAY_VISIT_INLINE(StructType); ARRAY_VISIT_INLINE(UnionType); From 685ebf49001ef02134e404dddae2bfd032dc4a65 Mon Sep 17 00:00:00 2001 From: Jeff Knupp Date: Sat, 25 Mar 2017 12:15:02 -0400 Subject: [PATCH 0419/1644] ARROW-626: [Python] Replace PyBytesBuffer with zero-copy, memoryview-based PyBuffer WIP, but tests all pass Author: Jeff Knupp Closes #410 from jeffknupp/master and squashes the following commits: bfba71d [Jeff Knupp] Fix typo in test 0a39f24 [Jeff Knupp] Fix some logical issues with initialization; add bytearray test fb2cfa3 [Jeff Knupp] Add proper reference counting regardless of if buf is memoryview 26f8b74 [Jeff Knupp] Need to investigate why test failed travis but not locally. For now, revert f7d21ac [Jeff Knupp] ARROW-626: [Python] Replace PyBytesBuffer with zero-copy, memoryview-based PyBuffer --- python/pyarrow/includes/pyarrow.pxd | 4 ++-- python/pyarrow/io.pyx | 19 ++++++++----------- python/pyarrow/tests/test_io.py | 20 +++++++++++++++----- python/src/pyarrow/common.cc | 24 +++++++++++++++--------- python/src/pyarrow/common.h | 10 +++++++--- python/src/pyarrow/io.cc | 6 +++--- python/src/pyarrow/io.h | 2 +- 7 files changed, 51 insertions(+), 34 deletions(-) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 805950bd1476a..3fdbebc9293cd 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -55,8 +55,8 @@ cdef extern from "pyarrow/api.h" namespace "arrow::py" nogil: cdef extern from "pyarrow/common.h" namespace "arrow::py" nogil: - cdef cppclass PyBytesBuffer(CBuffer): - PyBytesBuffer(object o) + cdef cppclass PyBuffer(CBuffer): + PyBuffer(object o) cdef extern from "pyarrow/io.h" namespace "arrow::py" nogil: diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 72e0e0ff01512..cb44ce8816338 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -501,16 +501,11 @@ cdef class BufferReader(NativeFile): Buffer buffer def __cinit__(self, object obj): - cdef shared_ptr[CBuffer] buf if isinstance(obj, Buffer): self.buffer = obj - elif isinstance(obj, bytes): - buf.reset(new pyarrow.PyBytesBuffer(obj)) - self.buffer = wrap_buffer(buf) else: - raise ValueError('Unable to convert value to buffer: {0}' - .format(type(obj))) + self.buffer = build_arrow_buffer(obj) self.rd_file.reset(new CBufferReader(self.buffer.buffer)) self.is_readable = 1 @@ -518,16 +513,18 @@ cdef class BufferReader(NativeFile): self.is_open = True -def buffer_from_bytes(object obj): +def build_arrow_buffer(object obj): """ Construct an Arrow buffer from a Python bytes object """ cdef shared_ptr[CBuffer] buf - if not isinstance(obj, bytes): - raise ValueError('Must pass bytes object') + try: + memoryview(obj) + buf.reset(new pyarrow.PyBuffer(obj)) + return wrap_buffer(buf) + except TypeError: + raise ValueError('Must pass object that implements buffer protocol') - buf.reset(new pyarrow.PyBytesBuffer(obj)) - return wrap_buffer(buf) cdef Buffer wrap_buffer(const shared_ptr[CBuffer]& buf): diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index c6caba5ce641a..9cd15c4a76cef 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -82,7 +82,6 @@ def test_bytes_reader(): # Like a BytesIO, but zero-copy underneath for C++ consumers data = b'some sample data' f = io.BufferReader(data) - assert f.tell() == 0 assert f.size() == len(data) @@ -128,7 +127,7 @@ def get_buffer(): def test_buffer_bytes(): val = b'some data' - buf = io.buffer_from_bytes(val) + buf = io.build_arrow_buffer(val) assert isinstance(buf, io.Buffer) result = buf.to_pybytes() @@ -138,18 +137,29 @@ def test_buffer_bytes(): def test_buffer_memoryview(): val = b'some data' - buf = io.buffer_from_bytes(val) + buf = io.build_arrow_buffer(val) assert isinstance(buf, io.Buffer) result = memoryview(buf) assert result == val +def test_buffer_bytearray(): + val = bytearray(b'some data') + + + buf = io.build_arrow_buffer(val) + assert isinstance(buf, io.Buffer) + + result = bytearray(buf) + + assert result == val + def test_buffer_memoryview_is_immutable(): val = b'some data' - buf = io.buffer_from_bytes(val) + buf = io.build_arrow_buffer(val) assert isinstance(buf, io.Buffer) result = memoryview(buf) @@ -192,7 +202,7 @@ def test_buffer_protocol_ref_counting(): import gc def make_buffer(bytes_obj): - return bytearray(io.buffer_from_bytes(bytes_obj)) + return bytearray(io.build_arrow_buffer(bytes_obj)) buf = make_buffer(b'foo') gc.collect() diff --git a/python/src/pyarrow/common.cc b/python/src/pyarrow/common.cc index c898f634aedbb..792aa4775d4f0 100644 --- a/python/src/pyarrow/common.cc +++ b/python/src/pyarrow/common.cc @@ -45,18 +45,24 @@ MemoryPool* get_memory_pool() { } // ---------------------------------------------------------------------- -// PyBytesBuffer +// PyBuffer -PyBytesBuffer::PyBytesBuffer(PyObject* obj) - : Buffer(reinterpret_cast(PyBytes_AS_STRING(obj)), - PyBytes_GET_SIZE(obj)), - obj_(obj) { - Py_INCREF(obj_); +PyBuffer::PyBuffer(PyObject* obj) + : Buffer(nullptr, 0) { + if (PyObject_CheckBuffer(obj)) { + obj_ = PyMemoryView_FromObject(obj); + Py_buffer* buffer = PyMemoryView_GET_BUFFER(obj_); + data_ = reinterpret_cast(buffer->buf); + size_ = buffer->len; + capacity_ = buffer->len; + is_mutable_ = false; + Py_INCREF(obj_); + } } -PyBytesBuffer::~PyBytesBuffer() { - PyAcquireGIL lock; - Py_DECREF(obj_); +PyBuffer::~PyBuffer() { + PyAcquireGIL lock; + Py_DECREF(obj_); } } // namespace py diff --git a/python/src/pyarrow/common.h b/python/src/pyarrow/common.h index 0b4c6bebcfe79..b4e4ea6d2b908 100644 --- a/python/src/pyarrow/common.h +++ b/python/src/pyarrow/common.h @@ -118,10 +118,14 @@ class ARROW_EXPORT NumPyBuffer : public Buffer { PyArrayObject* arr_; }; -class ARROW_EXPORT PyBytesBuffer : public Buffer { +class ARROW_EXPORT PyBuffer : public Buffer { public: - PyBytesBuffer(PyObject* obj); - ~PyBytesBuffer(); + /// Note that the GIL must be held when calling the PyBuffer constructor. + /// + /// While memoryview objects support multi-demensional buffers, PyBuffer only supports + /// one-dimensional byte buffers. + PyBuffer(PyObject* obj); + ~PyBuffer(); private: PyObject* obj_; diff --git a/python/src/pyarrow/io.cc b/python/src/pyarrow/io.cc index 0aa61dc811f5c..c66155b946a64 100644 --- a/python/src/pyarrow/io.cc +++ b/python/src/pyarrow/io.cc @@ -156,7 +156,7 @@ Status PyReadableFile::Read(int64_t nbytes, std::shared_ptr* out) { PyObject* bytes_obj; ARROW_RETURN_NOT_OK(file_->Read(nbytes, &bytes_obj)); - *out = std::make_shared(bytes_obj); + *out = std::make_shared(bytes_obj); Py_DECREF(bytes_obj); return Status::OK(); @@ -210,10 +210,10 @@ Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { } // ---------------------------------------------------------------------- -// A readable file that is backed by a PyBytes +// A readable file that is backed by a PyBuffer PyBytesReader::PyBytesReader(PyObject* obj) - : io::BufferReader(std::make_shared(obj)) {} + : io::BufferReader(std::make_shared(obj)) {} PyBytesReader::~PyBytesReader() {} diff --git a/python/src/pyarrow/io.h b/python/src/pyarrow/io.h index e38cd81775255..89af60926ad94 100644 --- a/python/src/pyarrow/io.h +++ b/python/src/pyarrow/io.h @@ -84,7 +84,7 @@ class ARROW_EXPORT PyOutputStream : public io::OutputStream { std::unique_ptr file_; }; -// A zero-copy reader backed by a PyBytes object +// A zero-copy reader backed by a PyBuffer object class ARROW_EXPORT PyBytesReader : public io::BufferReader { public: explicit PyBytesReader(PyObject* obj); From ab848f0eab053eeea62d1cf0c0f285db6460da54 Mon Sep 17 00:00:00 2001 From: Jeff Knupp Date: Sun, 26 Mar 2017 09:19:44 +0200 Subject: [PATCH 0420/1644] ARROW-713: [C++] Fix cmake linking issue in new IPC benchmark Author: Jeff Knupp Closes #444 from jeffknupp/master and squashes the following commits: 37aa10f [Jeff Knupp] [C++] ARROW-713: Fix cmake linking issue in new IPC benchmark --- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index d6ee9309b44d8..030cba93f5fc0 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -173,5 +173,5 @@ if (ARROW_BUILD_UTILITIES) endif() ADD_ARROW_BENCHMARK(ipc-read-write-benchmark) -ARROW_TEST_LINK_LIBRARIES(ipc-read-write-benchmark +ARROW_BENCHMARK_LINK_LIBRARIES(ipc-read-write-benchmark ${ARROW_IPC_TEST_LINK_LIBS}) From fd876697fc37a270a978117f020bf9e07a6c1bad Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 26 Mar 2017 09:21:15 +0200 Subject: [PATCH 0421/1644] ARROW-684: [Python] More helpful error message if libparquet_arrow not built Author: Wes McKinney Closes #443 from wesm/ARROW-684 and squashes the following commits: c18ca81 [Wes McKinney] More helpful error message if libparquet_arrow not built --- python/cmake_modules/FindParquet.cmake | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/python/cmake_modules/FindParquet.cmake b/python/cmake_modules/FindParquet.cmake index 7445e0919acb6..a20b651e2b3c6 100644 --- a/python/cmake_modules/FindParquet.cmake +++ b/python/cmake_modules/FindParquet.cmake @@ -68,13 +68,21 @@ else () set(PARQUET_ARROW_FOUND FALSE) endif () -if (PARQUET_FOUND) +if (PARQUET_FOUND AND PARQUET_ARROW_FOUND) if (NOT Parquet_FIND_QUIETLY) message(STATUS "Found the Parquet library: ${PARQUET_LIBRARIES}") + message(STATUS "Found the Parquet Arrow library: ${PARQUET_ARROW_LIBS}") endif () else () if (NOT Parquet_FIND_QUIETLY) - set(PARQUET_ERR_MSG "Could not find the Parquet library. Looked in ") + if (NOT PARQUET_FOUND) + set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} Could not find the parquet library.") + endif() + + if (NOT PARQUET_ARROW_FOUND) + set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} Could not find the parquet_arrow library. Did you build with -DPARQUET_ARROW=on?") + endif() + set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} Looked in ") if ( _parquet_roots ) set(PARQUET_ERR_MSG "${PARQUET_ERR_MSG} in ${_parquet_roots}.") else () @@ -88,12 +96,6 @@ else () endif () endif () -if (PARQUET_ARROW_FOUND) - if (NOT Parquet_FIND_QUIETLY) - message(STATUS "Found the Parquet Arrow library: ${PARQUET_ARROW_LIBS}") - endif() -endif() - mark_as_advanced( PARQUET_FOUND PARQUET_INCLUDE_DIR From 6d4e862902fd56c93ae394d6bc2938a1c4d1d949 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 26 Mar 2017 11:01:14 -0400 Subject: [PATCH 0422/1644] ARROW-712: [C++] Reimplement Array::Accept as inline visitor Author: Wes McKinney Closes #442 from wesm/more-inline-visitors and squashes the following commits: 69af01a [Wes McKinney] Remove unused member 362dd7e [Wes McKinney] cpplint 1e56f0f [Wes McKinney] Reimplement Array::Accept as inline visitor --- cpp/src/arrow/array.cc | 58 +++++++++++++----------------------------- cpp/src/arrow/array.h | 22 +--------------- 2 files changed, 18 insertions(+), 62 deletions(-) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index f1c8bd42c476d..cff0126647647 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -29,6 +29,7 @@ #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" #include "arrow/visitor.h" +#include "arrow/visitor_inline.h" namespace arrow { @@ -103,6 +104,10 @@ bool Array::RangeEquals(const Array& other, int64_t start_idx, int64_t end_idx, return are_equal; } +Status Array::Validate() const { + return Status::OK(); +} + // Last two parameters are in-out parameters static inline void ConformSliceParams( int64_t array_offset, int64_t array_length, int64_t* offset, int64_t* length) { @@ -117,10 +122,6 @@ std::shared_ptr Array::Slice(int64_t offset) const { return Slice(offset, slice_length); } -Status Array::Validate() const { - return Status::OK(); -} - NullArray::NullArray(int64_t length) : Array(null(), length, nullptr, length) {} std::shared_ptr NullArray::Slice(int64_t offset, int64_t length) const { @@ -426,47 +427,22 @@ std::shared_ptr DictionaryArray::Slice(int64_t offset, int64_t length) co } // ---------------------------------------------------------------------- -// Implement ArrayVisitor methods - -Status NullArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} +// Implement Array::Accept as inline visitor -Status BooleanArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} +struct AcceptVirtualVisitor { + explicit AcceptVirtualVisitor(ArrayVisitor* visitor) : visitor(visitor) {} -template -Status NumericArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} + ArrayVisitor* visitor; -Status BinaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status StringArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status FixedWidthBinaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status ListArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status StructArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} - -Status UnionArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); -} + template + Status Visit(const T& array) { + return visitor->Visit(array); + } +}; -Status DictionaryArray::Accept(ArrayVisitor* visitor) const { - return visitor->Visit(*this); +Status Array::Accept(ArrayVisitor* visitor) const { + AcceptVirtualVisitor inline_visitor(visitor); + return VisitArrayInline(*this, visitor); } // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c73b7a87a4f50..cc0cf98092a8c 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -110,7 +110,7 @@ class ARROW_EXPORT Array { /// Defaults to always returning Status::OK. This can be an expensive check. virtual Status Validate() const; - virtual Status Accept(ArrayVisitor* visitor) const = 0; + Status Accept(ArrayVisitor* visitor) const; /// Construct a zero-copy slice of the array with the indicated offset and /// length @@ -151,8 +151,6 @@ class ARROW_EXPORT NullArray : public Array { explicit NullArray(int64_t length); - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; }; @@ -196,8 +194,6 @@ class ARROW_EXPORT NumericArray : public PrimitiveArray { return reinterpret_cast(raw_data_) + offset_; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; value_type Value(int64_t i) const { return raw_data()[i]; } @@ -213,8 +209,6 @@ class ARROW_EXPORT BooleanArray : public PrimitiveArray { const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, int64_t offset = 0); - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; bool Value(int64_t i) const { @@ -262,8 +256,6 @@ class ARROW_EXPORT ListArray : public Array { return raw_value_offsets_[i + 1] - raw_value_offsets_[i]; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: @@ -313,8 +305,6 @@ class ARROW_EXPORT BinaryArray : public Array { Status Validate() const override; - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: @@ -351,8 +341,6 @@ class ARROW_EXPORT StringArray : public BinaryArray { Status Validate() const override; - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; }; @@ -379,8 +367,6 @@ class ARROW_EXPORT FixedWidthBinaryArray : public Array { const uint8_t* raw_data() const { return raw_data_; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: @@ -409,8 +395,6 @@ class ARROW_EXPORT StructArray : public Array { const std::vector>& fields() const { return children_; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: @@ -450,8 +434,6 @@ class ARROW_EXPORT UnionArray : public Array { const std::vector>& children() const { return children_; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: @@ -496,8 +478,6 @@ class ARROW_EXPORT DictionaryArray : public Array { const DictionaryType* dict_type() { return dict_type_; } - Status Accept(ArrayVisitor* visitor) const override; - std::shared_ptr Slice(int64_t offset, int64_t length) const override; protected: From 3aac4adef11345f211e4c66467ff758cbc397e43 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 26 Mar 2017 11:45:38 -0400 Subject: [PATCH 0423/1644] ARROW-341: [Python] Move pyarrow's C++ code to the main C++ source tree, install libarrow_python and headers This will enable third parties to link to `libarrow_python`. For now, the pyarrow build system continues to use CMake -- for the purpose of resolving the thirdparty toolchain we may or may not want to go completely to distutils, but we can sort that out later. Author: Wes McKinney Closes #440 from wesm/ARROW-341 and squashes the following commits: 193bc51 [Wes McKinney] Ensure that '-undefined dynamic_lookup' is passed when linking shared library on OS X a93496b [Wes McKinney] Add missing backslash 7620f50 [Wes McKinney] Fix cpplint issues 0617c69 [Wes McKinney] Fix LD_LIBRARY_PATH, ARROW_HOME 090c78c [Wes McKinney] Build Arrow library stack specific to active Python version 10e4626 [Wes McKinney] Get Python test suite passing again cfb7f44 [Wes McKinney] Remove print statement c1e63dc [Wes McKinney] Scrubbing python/CMakeLists.txt b80b153 [Wes McKinney] Cleanup, build pandas-test within main test suite 7ef1f81 [Wes McKinney] Start moving python/src/pyarrow tp cpp/src/arrow/python --- ci/travis_script_python.sh | 26 ++- cpp/CMakeLists.txt | 115 ++++------ cpp/cmake_modules/BuildUtils.cmake | 88 ++++++- {python => cpp}/cmake_modules/FindNumPy.cmake | 0 .../cmake_modules/FindPythonLibsNew.cmake | 0 cpp/src/arrow/python/CMakeLists.txt | 93 ++++++++ .../pyarrow => cpp/src/arrow/python}/api.h | 15 +- .../src/arrow/python/builtin_convert.cc | 6 +- .../src/arrow/python/builtin_convert.h | 8 +- .../src/arrow/python}/common.cc | 35 ++- .../pyarrow => cpp/src/arrow/python}/common.h | 18 +- .../src/arrow/python}/config.cc | 6 +- .../pyarrow => cpp/src/arrow/python}/config.h | 13 +- .../src/arrow/python}/do_import_numpy.h | 0 .../src/arrow/python}/helpers.cc | 2 +- .../src/arrow/python}/helpers.h | 0 .../pyarrow => cpp/src/arrow/python}/io.cc | 7 +- .../src/pyarrow => cpp/src/arrow/python}/io.h | 6 +- .../src/arrow/python}/numpy_interop.h | 2 +- .../src/arrow/python}/pandas-test.cc | 4 +- .../src/arrow/python/pandas_convert.cc | 26 +-- .../src/arrow/python/pandas_convert.h | 6 +- .../src/arrow/python}/type_traits.h | 3 +- .../src/arrow/python}/util/CMakeLists.txt | 10 +- .../src/arrow/python}/util/datetime.h | 0 .../src/arrow/python}/util/test_main.cc | 4 +- python/CMakeLists.txt | 215 ++---------------- python/cmake_modules/FindArrow.cmake | 9 + python/pyarrow/config.pyx | 14 +- python/pyarrow/includes/pyarrow.pxd | 6 +- python/setup.py | 11 +- python/src/pyarrow/CMakeLists.txt | 22 -- 32 files changed, 359 insertions(+), 411 deletions(-) rename {python => cpp}/cmake_modules/FindNumPy.cmake (100%) rename {python => cpp}/cmake_modules/FindPythonLibsNew.cmake (100%) create mode 100644 cpp/src/arrow/python/CMakeLists.txt rename {python/src/pyarrow => cpp/src/arrow/python}/api.h (75%) rename python/src/pyarrow/adapters/builtin.cc => cpp/src/arrow/python/builtin_convert.cc (99%) rename python/src/pyarrow/adapters/builtin.h => cpp/src/arrow/python/builtin_convert.h (90%) rename {python/src/pyarrow => cpp/src/arrow/python}/common.cc (69%) rename {python/src/pyarrow => cpp/src/arrow/python}/common.h (90%) rename {python/src/pyarrow => cpp/src/arrow/python}/config.cc (91%) rename {python/src/pyarrow => cpp/src/arrow/python}/config.h (85%) rename {python/src/pyarrow => cpp/src/arrow/python}/do_import_numpy.h (100%) rename {python/src/pyarrow => cpp/src/arrow/python}/helpers.cc (98%) rename {python/src/pyarrow => cpp/src/arrow/python}/helpers.h (100%) rename {python/src/pyarrow => cpp/src/arrow/python}/io.cc (98%) rename {python/src/pyarrow => cpp/src/arrow/python}/io.h (96%) rename {python/src/pyarrow => cpp/src/arrow/python}/numpy_interop.h (97%) rename {python/src/pyarrow/adapters => cpp/src/arrow/python}/pandas-test.cc (95%) rename python/src/pyarrow/adapters/pandas.cc => cpp/src/arrow/python/pandas_convert.cc (99%) rename python/src/pyarrow/adapters/pandas.h => cpp/src/arrow/python/pandas_convert.h (95%) rename {python/src/pyarrow => cpp/src/arrow/python}/type_traits.h (99%) rename {python/src/pyarrow => cpp/src/arrow/python}/util/CMakeLists.txt (83%) rename {python/src/pyarrow => cpp/src/arrow/python}/util/datetime.h (100%) rename {python/src/pyarrow => cpp/src/arrow/python}/util/test_main.cc (92%) delete mode 100644 python/src/pyarrow/CMakeLists.txt diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 6f4b8e9a090a7..df11209e7c49b 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -23,7 +23,6 @@ export MINICONDA=$HOME/miniconda export PATH="$MINICONDA/bin:$PATH" export ARROW_HOME=$ARROW_CPP_INSTALL -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_CPP_INSTALL/lib pushd $PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env @@ -70,11 +69,31 @@ build_parquet_cpp() { build_parquet_cpp -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PARQUET_HOME/lib +function build_arrow_libraries() { + CPP_BUILD_DIR=$1 + CPP_DIR=$TRAVIS_BUILD_DIR/cpp + + mkdir $CPP_BUILD_DIR + pushd $CPP_BUILD_DIR + + cmake -DARROW_BUILD_TESTS=off \ + -DARROW_PYTHON=on \ + -DCMAKE_INSTALL_PREFIX=$2 \ + $CPP_DIR + + make -j4 + make install + + popd +} python_version_tests() { PYTHON_VERSION=$1 CONDA_ENV_DIR=$TRAVIS_BUILD_DIR/pyarrow-test-$PYTHON_VERSION + + export ARROW_HOME=$TRAVIS_BUILD_DIR/arrow-install-$PYTHON_VERSION + export LD_LIBRARY_PATH=$ARROW_HOME/lib:$PARQUET_HOME/lib + conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION source activate $CONDA_ENV_DIR @@ -87,6 +106,9 @@ python_version_tests() { # Expensive dependencies install from Continuum package repo conda install -y pip numpy pandas cython + # Build C++ libraries + build_arrow_libraries arrow-build-$PYTHON_VERSION $ARROW_HOME + # Other stuff pip install pip install -r requirements.txt diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c04afe47030a5..c77cf601cbd46 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -106,6 +106,10 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Rely on boost shared libraries where relevant" ON) + option(ARROW_PYTHON + "Build the Arrow CPython extensions" + OFF) + option(ARROW_SSE3 "Build Arrow with SSE3" ON) @@ -133,6 +137,7 @@ if(NOT ARROW_BUILD_BENCHMARKS) set(NO_BENCHMARKS 1) endif() +include(BuildUtils) ############################################################ # Compiler flags @@ -303,6 +308,14 @@ endfunction() # # Arguments after the test name will be passed to set_tests_properties(). function(ADD_ARROW_TEST REL_TEST_NAME) + set(options) + set(single_value_args) + set(multi_value_args STATIC_LINK_LIBS) + cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) + if(ARG_UNPARSED_ARGUMENTS) + message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") + endif() + if(NO_TESTS OR NOT ARROW_BUILD_STATIC) return() endif() @@ -312,7 +325,13 @@ function(ADD_ARROW_TEST REL_TEST_NAME) # This test has a corresponding .cc file, set it up as an executable. set(TEST_PATH "${EXECUTABLE_OUTPUT_PATH}/${TEST_NAME}") add_executable(${TEST_NAME} "${REL_TEST_NAME}.cc") - target_link_libraries(${TEST_NAME} ${ARROW_TEST_LINK_LIBS}) + + if (ARG_STATIC_LINK_LIBS) + # Customize link libraries + target_link_libraries(${TEST_NAME} ${ARG_STATIC_LINK_LIBS}) + else() + target_link_libraries(${TEST_NAME} ${ARROW_TEST_LINK_LIBS}) + endif() add_dependencies(unittest ${TEST_NAME}) else() # No executable, just invoke the test (probably a script) directly. @@ -332,10 +351,6 @@ function(ADD_ARROW_TEST REL_TEST_NAME) ${BUILD_SUPPORT_DIR}/run-test.sh ${CMAKE_BINARY_DIR} test ${TEST_PATH}) endif() set_tests_properties(${TEST_NAME} PROPERTIES LABELS "unittest") - - if(ARGN) - set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) - endif() endfunction() # A wrapper for add_dependencies() that is compatible with NO_TESTS. @@ -363,72 +378,6 @@ enable_testing() ############################################################ # Dependencies ############################################################ -function(ADD_THIRDPARTY_LIB LIB_NAME) - set(options) - set(one_value_args SHARED_LIB STATIC_LIB) - set(multi_value_args DEPS) - cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) - if(ARG_UNPARSED_ARGUMENTS) - message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") - endif() - - if(ARG_STATIC_LIB AND ARG_SHARED_LIB) - if(NOT ARG_STATIC_LIB) - message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") - endif() - - SET(AUG_LIB_NAME "${LIB_NAME}_static") - add_library(${AUG_LIB_NAME} STATIC IMPORTED) - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") - - SET(AUG_LIB_NAME "${LIB_NAME}_shared") - add_library(${AUG_LIB_NAME} SHARED IMPORTED) - - if(MSVC) - # Mark the ”.lib” location as part of a Windows DLL - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") - else() - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - endif() - message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") - elseif(ARG_STATIC_LIB) - add_library(${LIB_NAME} STATIC IMPORTED) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - SET(AUG_LIB_NAME "${LIB_NAME}_static") - add_library(${AUG_LIB_NAME} STATIC IMPORTED) - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") - elseif(ARG_SHARED_LIB) - add_library(${LIB_NAME} SHARED IMPORTED) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - SET(AUG_LIB_NAME "${LIB_NAME}_shared") - add_library(${AUG_LIB_NAME} SHARED IMPORTED) - - if(MSVC) - # Mark the ”.lib” location as part of a Windows DLL - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") - else() - set_target_properties(${AUG_LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - endif() - message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") - else() - message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") - endif() - - if(ARG_DEPS) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") - endif() -endfunction() # ---------------------------------------------------------------------- # Add Boost dependencies (code adapted from Apache Kudu (incubating)) @@ -798,8 +747,7 @@ if (${CLANG_FORMAT_FOUND}) add_custom_target(format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 1 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g' | - sed -e '/windows_compatibility.h/g'` - `find ${CMAKE_CURRENT_SOURCE_DIR}/../python -name \\*.cc -or -name \\*.h`) + sed -e '/windows_compatibility.h/g'`) # runs clang format and exits with a non-zero exit code if any files need to be reformatted add_custom_target(check-format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 0 @@ -857,11 +805,9 @@ if(NOT APPLE) set(ARROW_SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") endif() -include(BuildUtils) - ADD_ARROW_LIB(arrow - SOURCES ${ARROW_SRCS} - SHARED_LINK_FLAGS ${ARROW_SHARED_LINK_FLAGS} + SOURCES ${ARROW_SRCS} + SHARED_LINK_FLAGS ${ARROW_SHARED_LINK_FLAGS} ) add_subdirectory(src/arrow) @@ -875,6 +821,10 @@ endif() #---------------------------------------------------------------------- # IPC library +if(ARROW_PYTHON) + set(ARROW_IPC on) +endif() + ## Flatbuffers if(ARROW_IPC) if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") @@ -908,3 +858,14 @@ if(ARROW_IPC) add_subdirectory(src/arrow/ipc) endif() + +if(ARROW_PYTHON) + find_package(PythonLibsNew REQUIRED) + find_package(NumPy REQUIRED) + + include_directories(SYSTEM + ${NUMPY_INCLUDE_DIRS} + ${PYTHON_INCLUDE_DIRS}) + + add_subdirectory(src/arrow/python) +endif() diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 78b514c2295ae..c9930418185c7 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -15,6 +15,73 @@ # specific language governing permissions and limitations # under the License. +function(ADD_THIRDPARTY_LIB LIB_NAME) + set(options) + set(one_value_args SHARED_LIB STATIC_LIB) + set(multi_value_args DEPS) + cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) + if(ARG_UNPARSED_ARGUMENTS) + message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") + endif() + + if(ARG_STATIC_LIB AND ARG_SHARED_LIB) + if(NOT ARG_STATIC_LIB) + message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") + endif() + + SET(AUG_LIB_NAME "${LIB_NAME}_static") + add_library(${AUG_LIB_NAME} STATIC IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + + SET(AUG_LIB_NAME "${LIB_NAME}_shared") + add_library(${AUG_LIB_NAME} SHARED IMPORTED) + + if(MSVC) + # Mark the ”.lib” location as part of a Windows DLL + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") + else() + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + endif() + message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + elseif(ARG_STATIC_LIB) + add_library(${LIB_NAME} STATIC IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + SET(AUG_LIB_NAME "${LIB_NAME}_static") + add_library(${AUG_LIB_NAME} STATIC IMPORTED) + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") + message("Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") + elseif(ARG_SHARED_LIB) + add_library(${LIB_NAME} SHARED IMPORTED) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + SET(AUG_LIB_NAME "${LIB_NAME}_shared") + add_library(${AUG_LIB_NAME} SHARED IMPORTED) + + if(MSVC) + # Mark the ”.lib” location as part of a Windows DLL + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_IMPLIB "${ARG_SHARED_LIB}") + else() + set_target_properties(${AUG_LIB_NAME} + PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") + endif() + message("Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") + else() + message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") + endif() + + if(ARG_DEPS) + set_target_properties(${LIB_NAME} + PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") + endif() +endfunction() + function(ADD_ARROW_LIB LIB_NAME) set(options) set(one_value_args SHARED_LINK_FLAGS) @@ -45,9 +112,16 @@ function(ADD_ARROW_LIB LIB_NAME) if (ARROW_BUILD_SHARED) add_library(${LIB_NAME}_shared SHARED $) + if(APPLE) - set_target_properties(${LIB_NAME}_shared PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + # On OS X, you can avoid linking at library load time and instead + # expecting that the symbols have been loaded separately. This happens + # with libpython* where there can be conflicts between system Python and + # the Python from a thirdparty distribution + set(ARG_SHARED_LINK_FLAGS + "-undefined dynamic_lookup ${ARG_SHARED_LINK_FLAGS}") endif() + set_target_properties(${LIB_NAME}_shared PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" @@ -55,6 +129,7 @@ function(ADD_ARROW_LIB LIB_NAME) OUTPUT_NAME ${LIB_NAME} VERSION "${ARROW_ABI_VERSION}" SOVERSION "${ARROW_SO_VERSION}") + target_link_libraries(${LIB_NAME}_shared LINK_PUBLIC ${ARG_SHARED_LINK_LIBS} LINK_PRIVATE ${ARG_SHARED_PRIVATE_LINK_LIBS}) @@ -68,28 +143,28 @@ function(ADD_ARROW_LIB LIB_NAME) set_target_properties(${LIB_NAME}_shared PROPERTIES INSTALL_RPATH ${_lib_install_rpath}) endif() - + install(TARGETS ${LIB_NAME}_shared LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() - + if (ARROW_BUILD_STATIC) add_library(${LIB_NAME}_static STATIC $) set_target_properties(${LIB_NAME}_static PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" OUTPUT_NAME ${LIB_NAME}) - + target_link_libraries(${LIB_NAME}_static LINK_PUBLIC ${ARG_STATIC_LINK_LIBS} LINK_PRIVATE ${ARG_STATIC_PRIVATE_LINK_LIBS}) - + install(TARGETS ${LIB_NAME}_static LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() - + if (APPLE) set_target_properties(${LIB_NAME}_shared PROPERTIES @@ -98,4 +173,3 @@ function(ADD_ARROW_LIB LIB_NAME) endif() endfunction() - diff --git a/python/cmake_modules/FindNumPy.cmake b/cpp/cmake_modules/FindNumPy.cmake similarity index 100% rename from python/cmake_modules/FindNumPy.cmake rename to cpp/cmake_modules/FindNumPy.cmake diff --git a/python/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake similarity index 100% rename from python/cmake_modules/FindPythonLibsNew.cmake rename to cpp/cmake_modules/FindPythonLibsNew.cmake diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt new file mode 100644 index 0000000000000..03f5afc624b34 --- /dev/null +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -0,0 +1,93 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# + +####################################### +# arrow_python +####################################### + +if (ARROW_BUILD_TESTS) + add_library(arrow_python_test_main STATIC + util/test_main.cc) + + if (APPLE) + target_link_libraries(arrow_python_test_main + gtest + dl) + set_target_properties(arrow_python_test_main + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + else() + target_link_libraries(arrow_python_test_main + gtest + pthread + dl + ) + endif() +endif() + +set(ARROW_PYTHON_MIN_TEST_LIBS + arrow_python_test_main + arrow_python_static + arrow_ipc_static + arrow_io_static + arrow_static) + +if(NOT APPLE AND ARROW_BUILD_TESTS) + ADD_THIRDPARTY_LIB(python + SHARED_LIB "${PYTHON_LIBRARIES}") + list(APPEND ARROW_PYTHON_MIN_TEST_LIBS python) +endif() + +set(ARROW_PYTHON_TEST_LINK_LIBS ${ARROW_PYTHON_MIN_TEST_LIBS}) + +# ---------------------------------------------------------------------- + +set(ARROW_PYTHON_SRCS + builtin_convert.cc + common.cc + config.cc + helpers.cc + io.cc + pandas_convert.cc +) + +set(ARROW_PYTHON_SHARED_LINK_LIBS + arrow_io_shared + arrow_ipc_shared + arrow_shared +) + +ADD_ARROW_LIB(arrow_python + SOURCES ${ARROW_PYTHON_SRCS} + SHARED_LINK_FLAGS "" + SHARED_LINK_LIBS ${ARROW_PYTHON_SHARED_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} +) + +install(FILES + api.h + builtin_convert.h + common.h + config.h + do_import_numpy.h + helpers.h + io.h + numpy_interop.h + pandas_convert.h + type_traits.h + DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/python") + +# set_target_properties(arrow_python_shared PROPERTIES +# INSTALL_RPATH "\$ORIGIN") + +if (ARROW_BUILD_TESTS) + ADD_ARROW_TEST(pandas-test + STATIC_LINK_LIBS "${ARROW_PYTHON_TEST_LINK_LIBS}") +endif() diff --git a/python/src/pyarrow/api.h b/cpp/src/arrow/python/api.h similarity index 75% rename from python/src/pyarrow/api.h rename to cpp/src/arrow/python/api.h index f65cc097f548f..f4f1c0cf9a5d6 100644 --- a/python/src/pyarrow/api.h +++ b/cpp/src/arrow/python/api.h @@ -15,12 +15,13 @@ // specific language governing permissions and limitations // under the License. -#ifndef PYARROW_API_H -#define PYARROW_API_H +#ifndef ARROW_PYTHON_API_H +#define ARROW_PYTHON_API_H -#include "pyarrow/helpers.h" +#include "arrow/python/builtin_convert.h" +#include "arrow/python/common.h" +#include "arrow/python/helpers.h" +#include "arrow/python/io.h" +#include "arrow/python/pandas_convert.h" -#include "pyarrow/adapters/builtin.h" -#include "pyarrow/adapters/pandas.h" - -#endif // PYARROW_API_H +#endif // ARROW_PYTHON_API_H diff --git a/python/src/pyarrow/adapters/builtin.cc b/cpp/src/arrow/python/builtin_convert.cc similarity index 99% rename from python/src/pyarrow/adapters/builtin.cc rename to cpp/src/arrow/python/builtin_convert.cc index 06e098a80369e..9acccc149664b 100644 --- a/python/src/pyarrow/adapters/builtin.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -19,13 +19,13 @@ #include #include -#include "pyarrow/adapters/builtin.h" +#include "arrow/python/builtin_convert.h" #include "arrow/api.h" #include "arrow/status.h" -#include "pyarrow/helpers.h" -#include "pyarrow/util/datetime.h" +#include "arrow/python/helpers.h" +#include "arrow/python/util/datetime.h" namespace arrow { namespace py { diff --git a/python/src/pyarrow/adapters/builtin.h b/cpp/src/arrow/python/builtin_convert.h similarity index 90% rename from python/src/pyarrow/adapters/builtin.h rename to cpp/src/arrow/python/builtin_convert.h index 2d45e670628b5..7b50990dd557c 100644 --- a/python/src/pyarrow/adapters/builtin.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -18,8 +18,8 @@ // Functions for converting between CPython built-in data structures and Arrow // data structures -#ifndef PYARROW_ADAPTERS_BUILTIN_H -#define PYARROW_ADAPTERS_BUILTIN_H +#ifndef ARROW_PYTHON_ADAPTERS_BUILTIN_H +#define ARROW_PYTHON_ADAPTERS_BUILTIN_H #include @@ -29,7 +29,7 @@ #include "arrow/util/visibility.h" -#include "pyarrow/common.h" +#include "arrow/python/common.h" namespace arrow { @@ -51,4 +51,4 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr } // namespace py } // namespace arrow -#endif // PYARROW_ADAPTERS_BUILTIN_H +#endif // ARROW_PYTHON_ADAPTERS_BUILTIN_H diff --git a/python/src/pyarrow/common.cc b/cpp/src/arrow/python/common.cc similarity index 69% rename from python/src/pyarrow/common.cc rename to cpp/src/arrow/python/common.cc index 792aa4775d4f0..a5aea30884468 100644 --- a/python/src/pyarrow/common.cc +++ b/cpp/src/arrow/python/common.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include "pyarrow/common.h" +#include "arrow/python/common.h" #include #include @@ -28,17 +28,17 @@ namespace arrow { namespace py { static std::mutex memory_pool_mutex; -static MemoryPool* default_pyarrow_pool = nullptr; +static MemoryPool* default_python_pool = nullptr; void set_default_memory_pool(MemoryPool* pool) { std::lock_guard guard(memory_pool_mutex); - default_pyarrow_pool = pool; + default_python_pool = pool; } MemoryPool* get_memory_pool() { std::lock_guard guard(memory_pool_mutex); - if (default_pyarrow_pool) { - return default_pyarrow_pool; + if (default_python_pool) { + return default_python_pool; } else { return default_memory_pool(); } @@ -47,22 +47,21 @@ MemoryPool* get_memory_pool() { // ---------------------------------------------------------------------- // PyBuffer -PyBuffer::PyBuffer(PyObject* obj) - : Buffer(nullptr, 0) { - if (PyObject_CheckBuffer(obj)) { - obj_ = PyMemoryView_FromObject(obj); - Py_buffer* buffer = PyMemoryView_GET_BUFFER(obj_); - data_ = reinterpret_cast(buffer->buf); - size_ = buffer->len; - capacity_ = buffer->len; - is_mutable_ = false; - Py_INCREF(obj_); - } +PyBuffer::PyBuffer(PyObject* obj) : Buffer(nullptr, 0) { + if (PyObject_CheckBuffer(obj)) { + obj_ = PyMemoryView_FromObject(obj); + Py_buffer* buffer = PyMemoryView_GET_BUFFER(obj_); + data_ = reinterpret_cast(buffer->buf); + size_ = buffer->len; + capacity_ = buffer->len; + is_mutable_ = false; + Py_INCREF(obj_); + } } PyBuffer::~PyBuffer() { - PyAcquireGIL lock; - Py_DECREF(obj_); + PyAcquireGIL lock; + Py_DECREF(obj_); } } // namespace py diff --git a/python/src/pyarrow/common.h b/cpp/src/arrow/python/common.h similarity index 90% rename from python/src/pyarrow/common.h rename to cpp/src/arrow/python/common.h index b4e4ea6d2b908..f1be471cd3a83 100644 --- a/python/src/pyarrow/common.h +++ b/cpp/src/arrow/python/common.h @@ -15,10 +15,12 @@ // specific language governing permissions and limitations // under the License. -#ifndef PYARROW_COMMON_H -#define PYARROW_COMMON_H +#ifndef ARROW_PYTHON_COMMON_H +#define ARROW_PYTHON_COMMON_H -#include "pyarrow/config.h" +#include + +#include "arrow/python/config.h" #include "arrow/buffer.h" #include "arrow/util/macros.h" @@ -47,7 +49,7 @@ class OwnedRef { public: OwnedRef() : obj_(nullptr) {} - OwnedRef(PyObject* obj) : obj_(obj) {} + explicit OwnedRef(PyObject* obj) : obj_(obj) {} ~OwnedRef() { PyAcquireGIL lock; @@ -71,7 +73,7 @@ struct PyObjectStringify { OwnedRef tmp_obj; const char* bytes; - PyObjectStringify(PyObject* obj) { + explicit PyObjectStringify(PyObject* obj) { PyObject* bytes_obj; if (PyUnicode_Check(obj)) { bytes_obj = PyUnicode_AsUTF8String(obj); @@ -103,7 +105,7 @@ ARROW_EXPORT MemoryPool* get_memory_pool(); class ARROW_EXPORT NumPyBuffer : public Buffer { public: - NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { + explicit NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { arr_ = arr; Py_INCREF(arr); @@ -124,7 +126,7 @@ class ARROW_EXPORT PyBuffer : public Buffer { /// /// While memoryview objects support multi-demensional buffers, PyBuffer only supports /// one-dimensional byte buffers. - PyBuffer(PyObject* obj); + explicit PyBuffer(PyObject* obj); ~PyBuffer(); private: @@ -134,4 +136,4 @@ class ARROW_EXPORT PyBuffer : public Buffer { } // namespace py } // namespace arrow -#endif // PYARROW_COMMON_H +#endif // ARROW_PYTHON_COMMON_H diff --git a/python/src/pyarrow/config.cc b/cpp/src/arrow/python/config.cc similarity index 91% rename from python/src/pyarrow/config.cc rename to cpp/src/arrow/python/config.cc index 0be6d962b55ab..2abc4dda6ee17 100644 --- a/python/src/pyarrow/config.cc +++ b/cpp/src/arrow/python/config.cc @@ -17,16 +17,16 @@ #include -#include "pyarrow/config.h" +#include "arrow/python/config.h" namespace arrow { namespace py { -void pyarrow_init() {} +void Init() {} PyObject* numpy_nan = nullptr; -void pyarrow_set_numpy_nan(PyObject* obj) { +void set_numpy_nan(PyObject* obj) { Py_INCREF(obj); numpy_nan = obj; } diff --git a/python/src/pyarrow/config.h b/cpp/src/arrow/python/config.h similarity index 85% rename from python/src/pyarrow/config.h rename to cpp/src/arrow/python/config.h index 87fc5c2b290f6..dd554e05b9379 100644 --- a/python/src/pyarrow/config.h +++ b/cpp/src/arrow/python/config.h @@ -15,15 +15,14 @@ // specific language governing permissions and limitations // under the License. -#ifndef PYARROW_CONFIG_H -#define PYARROW_CONFIG_H +#ifndef ARROW_PYTHON_CONFIG_H +#define ARROW_PYTHON_CONFIG_H #include +#include "arrow/python/numpy_interop.h" #include "arrow/util/visibility.h" -#include "pyarrow/numpy_interop.h" - #if PY_MAJOR_VERSION >= 3 #define PyString_Check PyUnicode_Check #endif @@ -35,12 +34,12 @@ ARROW_EXPORT extern PyObject* numpy_nan; ARROW_EXPORT -void pyarrow_init(); +void Init(); ARROW_EXPORT -void pyarrow_set_numpy_nan(PyObject* obj); +void set_numpy_nan(PyObject* obj); } // namespace py } // namespace arrow -#endif // PYARROW_CONFIG_H +#endif // ARROW_PYTHON_CONFIG_H diff --git a/python/src/pyarrow/do_import_numpy.h b/cpp/src/arrow/python/do_import_numpy.h similarity index 100% rename from python/src/pyarrow/do_import_numpy.h rename to cpp/src/arrow/python/do_import_numpy.h diff --git a/python/src/pyarrow/helpers.cc b/cpp/src/arrow/python/helpers.cc similarity index 98% rename from python/src/pyarrow/helpers.cc rename to cpp/src/arrow/python/helpers.cc index 43edf8af17fa2..add2d9a222adf 100644 --- a/python/src/pyarrow/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include "pyarrow/helpers.h" +#include "arrow/python/helpers.h" #include diff --git a/python/src/pyarrow/helpers.h b/cpp/src/arrow/python/helpers.h similarity index 100% rename from python/src/pyarrow/helpers.h rename to cpp/src/arrow/python/helpers.h diff --git a/python/src/pyarrow/io.cc b/cpp/src/arrow/python/io.cc similarity index 98% rename from python/src/pyarrow/io.cc rename to cpp/src/arrow/python/io.cc index c66155b946a64..ba82a45411c4c 100644 --- a/python/src/pyarrow/io.cc +++ b/cpp/src/arrow/python/io.cc @@ -15,16 +15,17 @@ // specific language governing permissions and limitations // under the License. -#include "pyarrow/io.h" +#include "arrow/python/io.h" #include #include +#include #include "arrow/io/memory.h" #include "arrow/memory_pool.h" #include "arrow/status.h" -#include "pyarrow/common.h" +#include "arrow/python/common.h" namespace arrow { namespace py { @@ -166,7 +167,7 @@ Status PyReadableFile::GetSize(int64_t* size) { PyAcquireGIL lock; int64_t current_position; - ; + ARROW_RETURN_NOT_OK(file_->Tell(¤t_position)); ARROW_RETURN_NOT_OK(file_->Seek(0, 2)); diff --git a/python/src/pyarrow/io.h b/cpp/src/arrow/python/io.h similarity index 96% rename from python/src/pyarrow/io.h rename to cpp/src/arrow/python/io.h index 89af60926ad94..905bd6c7a6aed 100644 --- a/python/src/pyarrow/io.h +++ b/cpp/src/arrow/python/io.h @@ -22,9 +22,9 @@ #include "arrow/io/memory.h" #include "arrow/util/visibility.h" -#include "pyarrow/config.h" +#include "arrow/python/config.h" -#include "pyarrow/common.h" +#include "arrow/python/common.h" namespace arrow { @@ -36,7 +36,7 @@ namespace py { // calling any methods class PythonFile { public: - PythonFile(PyObject* file); + explicit PythonFile(PyObject* file); ~PythonFile(); Status Close(); diff --git a/python/src/pyarrow/numpy_interop.h b/cpp/src/arrow/python/numpy_interop.h similarity index 97% rename from python/src/pyarrow/numpy_interop.h rename to cpp/src/arrow/python/numpy_interop.h index 57f3328e87078..0a4b425e734f7 100644 --- a/python/src/pyarrow/numpy_interop.h +++ b/cpp/src/arrow/python/numpy_interop.h @@ -34,7 +34,7 @@ // This is required to be able to access the NumPy C API properly in C++ files // other than this main one -#define PY_ARRAY_UNIQUE_SYMBOL pyarrow_ARRAY_API +#define PY_ARRAY_UNIQUE_SYMBOL arrow_ARRAY_API #ifndef NUMPY_IMPORT_ARRAY #define NO_IMPORT_ARRAY #endif diff --git a/python/src/pyarrow/adapters/pandas-test.cc b/cpp/src/arrow/python/pandas-test.cc similarity index 95% rename from python/src/pyarrow/adapters/pandas-test.cc rename to cpp/src/arrow/python/pandas-test.cc index e694e790a38d1..ae2527e19c00e 100644 --- a/python/src/pyarrow/adapters/pandas-test.cc +++ b/cpp/src/arrow/python/pandas-test.cc @@ -24,17 +24,17 @@ #include "arrow/array.h" #include "arrow/builder.h" +#include "arrow/python/pandas_convert.h" #include "arrow/schema.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" -#include "pyarrow/adapters/pandas.h" namespace arrow { namespace py { TEST(PandasConversionTest, TestObjectBlockWriteFails) { - StringBuilder builder; + StringBuilder builder(default_memory_pool()); const char value[] = {'\xf1', '\0'}; for (int i = 0; i < 1000; ++i) { diff --git a/python/src/pyarrow/adapters/pandas.cc b/cpp/src/arrow/python/pandas_convert.cc similarity index 99% rename from python/src/pyarrow/adapters/pandas.cc rename to cpp/src/arrow/python/pandas_convert.cc index a7386cefcdbbf..f2c2415ed2793 100644 --- a/python/src/pyarrow/adapters/pandas.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -19,8 +19,8 @@ #include -#include "pyarrow/adapters/pandas.h" -#include "pyarrow/numpy_interop.h" +#include "arrow/python/numpy_interop.h" +#include "arrow/python/pandas_convert.h" #include #include @@ -32,10 +32,16 @@ #include #include #include +#include #include "arrow/array.h" #include "arrow/column.h" #include "arrow/loader.h" +#include "arrow/python/builtin_convert.h" +#include "arrow/python/common.h" +#include "arrow/python/config.h" +#include "arrow/python/type_traits.h" +#include "arrow/python/util/datetime.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type_fwd.h" @@ -43,12 +49,6 @@ #include "arrow/util/bit-util.h" #include "arrow/util/macros.h" -#include "pyarrow/adapters/builtin.h" -#include "pyarrow/common.h" -#include "pyarrow/config.h" -#include "pyarrow/type_traits.h" -#include "pyarrow/util/datetime.h" - namespace arrow { namespace py { @@ -125,7 +125,7 @@ static int64_t ValuesToValidBytes( // TODO(wesm): striding for (int i = 0; i < length; ++i) { - valid_bytes[i] = not traits::isnull(values[i]); + valid_bytes[i] = !traits::isnull(values[i]); if (traits::isnull(values[i])) null_count++; } @@ -226,7 +226,7 @@ class PandasConverter : public TypeVisitor { type_(type), arr_(reinterpret_cast(ao)), mask_(nullptr) { - if (mo != nullptr and mo != Py_None) { mask_ = reinterpret_cast(mo); } + if (mo != nullptr && mo != Py_None) { mask_ = reinterpret_cast(mo); } length_ = PyArray_SIZE(arr_); } @@ -820,6 +820,7 @@ class PandasBlock { OwnedRef placement_arr_; int64_t* placement_data_; + private: DISALLOW_COPY_AND_ASSIGN(PandasBlock); }; @@ -947,7 +948,6 @@ inline Status ConvertListsLike( for (int c = 0; c < data.num_chunks(); c++) { auto arr = std::static_pointer_cast(data.chunk(c)); - const uint8_t* data_ptr; const bool has_nulls = data.null_count() > 0; for (int64_t i = 0; i < arr->length(); ++i) { if (has_nulls && arr->IsNull(i)) { @@ -1304,7 +1304,7 @@ class DatetimeTZBlock : public DatetimeBlock { template class CategoricalBlock : public PandasBlock { public: - CategoricalBlock(int64_t num_rows) : PandasBlock(num_rows, 1) {} + explicit CategoricalBlock(int64_t num_rows) : PandasBlock(num_rows, 1) {} Status Allocate() override { constexpr int npy_type = arrow_traits::npy_type; @@ -1432,7 +1432,7 @@ using BlockMap = std::unordered_map>; // * placement arrays as we go class DataFrameBlockCreator { public: - DataFrameBlockCreator(const std::shared_ptr

& table) : table_(table) {} + explicit DataFrameBlockCreator(const std::shared_ptr
& table) : table_(table) {} Status Convert(int nthreads, PyObject** output) { column_types_.resize(table_->num_columns()); diff --git a/python/src/pyarrow/adapters/pandas.h b/cpp/src/arrow/python/pandas_convert.h similarity index 95% rename from python/src/pyarrow/adapters/pandas.h rename to cpp/src/arrow/python/pandas_convert.h index 6862339d89baf..a33741efaa492 100644 --- a/python/src/pyarrow/adapters/pandas.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -18,8 +18,8 @@ // Functions for converting between pandas's NumPy-based data representation // and Arrow data structures -#ifndef PYARROW_ADAPTERS_PANDAS_H -#define PYARROW_ADAPTERS_PANDAS_H +#ifndef ARROW_PYTHON_ADAPTERS_PANDAS_H +#define ARROW_PYTHON_ADAPTERS_PANDAS_H #include @@ -76,4 +76,4 @@ Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, } // namespace py } // namespace arrow -#endif // PYARROW_ADAPTERS_PANDAS_H +#endif // ARROW_PYTHON_ADAPTERS_PANDAS_H diff --git a/python/src/pyarrow/type_traits.h b/cpp/src/arrow/python/type_traits.h similarity index 99% rename from python/src/pyarrow/type_traits.h rename to cpp/src/arrow/python/type_traits.h index cc65d5ceed9c1..f78dc360095dc 100644 --- a/python/src/pyarrow/type_traits.h +++ b/cpp/src/arrow/python/type_traits.h @@ -18,8 +18,9 @@ #include #include +#include -#include "pyarrow/numpy_interop.h" +#include "arrow/python/numpy_interop.h" #include "arrow/builder.h" #include "arrow/type.h" diff --git a/python/src/pyarrow/util/CMakeLists.txt b/cpp/src/arrow/python/util/CMakeLists.txt similarity index 83% rename from python/src/pyarrow/util/CMakeLists.txt rename to cpp/src/arrow/python/util/CMakeLists.txt index 6cd49cb75a4fb..4cc20f6f4b47e 100644 --- a/python/src/pyarrow/util/CMakeLists.txt +++ b/cpp/src/arrow/python/util/CMakeLists.txt @@ -16,21 +16,21 @@ # under the License. ####################################### -# pyarrow_test_main +# arrow/python_test_main ####################################### if (PYARROW_BUILD_TESTS) - add_library(pyarrow_test_main STATIC + add_library(arrow/python_test_main STATIC test_main.cc) if (APPLE) - target_link_libraries(pyarrow_test_main + target_link_libraries(arrow/python_test_main gtest dl) - set_target_properties(pyarrow_test_main + set_target_properties(arrow/python_test_main PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") else() - target_link_libraries(pyarrow_test_main + target_link_libraries(arrow/python_test_main gtest pthread dl diff --git a/python/src/pyarrow/util/datetime.h b/cpp/src/arrow/python/util/datetime.h similarity index 100% rename from python/src/pyarrow/util/datetime.h rename to cpp/src/arrow/python/util/datetime.h diff --git a/python/src/pyarrow/util/test_main.cc b/cpp/src/arrow/python/util/test_main.cc similarity index 92% rename from python/src/pyarrow/util/test_main.cc rename to cpp/src/arrow/python/util/test_main.cc index d8d1d030f8f97..c83514d0dbd37 100644 --- a/python/src/pyarrow/util/test_main.cc +++ b/cpp/src/arrow/python/util/test_main.cc @@ -19,8 +19,8 @@ #include -#include "pyarrow/do_import_numpy.h" -#include "pyarrow/numpy_interop.h" +#include "arrow/python/do_import_numpy.h" +#include "arrow/python/numpy_interop.h" int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index ef874e3d07959..35a1a89ef3164 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -47,9 +47,6 @@ endif() # Top level cmake dir if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") - option(PYARROW_BUILD_TESTS - "Build the PyArrow C++ googletest unit tests" - OFF) option(PYARROW_BUILD_PARQUET "Build the PyArrow Parquet integration" OFF) @@ -57,7 +54,7 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") "Build the PyArrow jemalloc integration" OFF) option(PYARROW_BUNDLE_ARROW_CPP - "Bundle the Arrow C++ libraries" + "Bundle the Arrow C++ libraries" OFF) endif() @@ -75,6 +72,8 @@ endif(CCACHE_FOUND) # Compiler flags ############################################################ +include(BuildUtils) +include(CompilerInfo) include(SetupCxxFlags) # Add common flags @@ -86,8 +85,6 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer") # Suppress Cython warnings set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") -# Determine compiler version -include(CompilerInfo) if ("${COMPILER_FAMILY}" STREQUAL "clang") # Using Clang with ccache causes a bunch of spurious warnings that are @@ -215,116 +212,9 @@ include_directories(SYSTEM ${PYTHON_INCLUDE_DIRS} src) -############################################################ -# Testing -############################################################ - -# Add a new test case, with or without an executable that should be built. -# -# REL_TEST_NAME is the name of the test. It may be a single component -# (e.g. monotime-test) or contain additional components (e.g. -# net/net_util-test). Either way, the last component must be a globally -# unique name. -# -# Arguments after the test name will be passed to set_tests_properties(). -function(ADD_PYARROW_TEST REL_TEST_NAME) - if(NO_TESTS) - return() - endif() - get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) - - if(EXISTS ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}.cc) - # This test has a corresponding .cc file, set it up as an executable. - set(TEST_PATH "${EXECUTABLE_OUTPUT_PATH}/${TEST_NAME}") - add_executable(${TEST_NAME} "${REL_TEST_NAME}.cc") - target_link_libraries(${TEST_NAME} ${PYARROW_TEST_LINK_LIBS}) - else() - # No executable, just invoke the test (probably a script) directly. - set(TEST_PATH ${CMAKE_CURRENT_SOURCE_DIR}/${REL_TEST_NAME}) - endif() - - add_test(${TEST_NAME} - ${BUILD_SUPPORT_DIR}/run-test.sh ${TEST_PATH}) - if(ARGN) - set_tests_properties(${TEST_NAME} PROPERTIES ${ARGN}) - endif() -endfunction() - -# A wrapper for add_dependencies() that is compatible with NO_TESTS. -function(ADD_PYARROW_TEST_DEPENDENCIES REL_TEST_NAME) - if(NO_TESTS) - return() - endif() - get_filename_component(TEST_NAME ${REL_TEST_NAME} NAME_WE) - - add_dependencies(${TEST_NAME} ${ARGN}) -endfunction() - -enable_testing() - ############################################################ # Dependencies ############################################################ -function(ADD_THIRDPARTY_LIB LIB_NAME) - set(options) - set(one_value_args SHARED_LIB STATIC_LIB) - set(multi_value_args DEPS) - cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) - if(ARG_UNPARSED_ARGUMENTS) - message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") - endif() - - if(("${PYARROW_LINK}" STREQUAL "s" AND ARG_STATIC_LIB) OR (NOT ARG_SHARED_LIB)) - if(NOT ARG_STATIC_LIB) - message(FATAL_ERROR "No static or shared library provided for ${LIB_NAME}") - endif() - add_library(${LIB_NAME} STATIC IMPORTED) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - message(STATUS "Added static library dependency ${LIB_NAME}: ${ARG_STATIC_LIB}") - else() - add_library(${LIB_NAME} SHARED IMPORTED) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - message(STATUS "Added shared library dependency ${LIB_NAME}: ${ARG_SHARED_LIB}") - endif() - - if(ARG_DEPS) - set_target_properties(${LIB_NAME} - PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") - endif() - - # Set up an "exported variant" for this thirdparty library (see "Visibility" - # above). It's the same as the real target, just with an "_exported" suffix. - # We prefer the static archive if it exists (as it's akin to an "internal" - # library), but we'll settle for the shared object if we must. - # - # A shared object exported variant will force any "leaf" library that - # transitively depends on it to also depend on it at runtime; this is - # desirable for some libraries (e.g. cyrus_sasl). - set(LIB_NAME_EXPORTED ${LIB_NAME}_exported) - if(ARG_STATIC_LIB) - add_library(${LIB_NAME_EXPORTED} STATIC IMPORTED) - set_target_properties(${LIB_NAME_EXPORTED} - PROPERTIES IMPORTED_LOCATION "${ARG_STATIC_LIB}") - else() - add_library(${LIB_NAME_EXPORTED} SHARED IMPORTED) - set_target_properties(${LIB_NAME_EXPORTED} - PROPERTIES IMPORTED_LOCATION "${ARG_SHARED_LIB}") - endif() - if(ARG_DEPS) - set_target_properties(${LIB_NAME_EXPORTED} - PROPERTIES IMPORTED_LINK_INTERFACE_LIBRARIES "${ARG_DEPS}") - endif() -endfunction() - -## GMock -if (PYARROW_BUILD_TESTS) - find_package(GTest REQUIRED) - include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) - ADD_THIRDPARTY_LIB(gtest - STATIC_LIB ${GTEST_STATIC_LIB}) -endif() ## Parquet find_package(Parquet) @@ -352,6 +242,8 @@ if (PYARROW_BUNDLE_ARROW_CPP) COPYONLY) SET(ARROW_IPC_SHARED_LIB ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_ipc${CMAKE_SHARED_LIBRARY_SUFFIX}) + SET(ARROW_PYTHON_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) endif() ADD_THIRDPARTY_LIB(arrow @@ -360,66 +252,8 @@ ADD_THIRDPARTY_LIB(arrow_io SHARED_LIB ${ARROW_IO_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_ipc SHARED_LIB ${ARROW_IPC_SHARED_LIB}) - -############################################################ -# Linker setup -############################################################ - -set(PYARROW_MIN_TEST_LIBS - pyarrow_test_main - pyarrow) - -set(PYARROW_MIN_TEST_LIBS - pyarrow_test_main - pyarrow - ${PYARROW_BASE_LIBS}) - -if(NOT APPLE AND PYARROW_BUILD_TESTS) - ADD_THIRDPARTY_LIB(python - SHARED_LIB "${PYTHON_LIBRARIES}") - list(APPEND PYARROW_MIN_TEST_LIBS python) -endif() - -set(PYARROW_TEST_LINK_LIBS ${PYARROW_MIN_TEST_LIBS}) - -############################################################ -# "make ctags" target -############################################################ -if (UNIX) - add_custom_target(ctags ctags -R --languages=c++,c --exclude=thirdparty/installed) -endif (UNIX) - -############################################################ -# "make etags" target -############################################################ -if (UNIX) - add_custom_target(tags etags --members --declarations - `find ${CMAKE_CURRENT_SOURCE_DIR}/src - -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or -name \\*.h -or -name \\*.c -or - -name \\*.f`) - add_custom_target(etags DEPENDS tags) -endif (UNIX) - -############################################################ -# "make cscope" target -############################################################ -if (UNIX) - add_custom_target(cscope find ${CMAKE_CURRENT_SOURCE_DIR} - ( -name \\*.cc -or -name \\*.hh -or -name \\*.cpp -or - -name \\*.h -or -name \\*.c -or -name \\*.f ) - -exec echo \"{}\" \; > cscope.files && cscope -q -b VERBATIM) -endif (UNIX) - -############################################################ -# "make lint" target -############################################################ -if (UNIX) - # Full lint - add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py - --verbose=2 - --filter=-whitespace/comments,-readability/todo,-build/header_guard - `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h`) -endif (UNIX) +ADD_THIRDPARTY_LIB(arrow_python + SHARED_LIB ${ARROW_PYTHON_SHARED_LIB}) ############################################################ # Subdirectories @@ -429,9 +263,6 @@ if (UNIX) set(CMAKE_BUILD_WITH_INSTALL_RPATH TRUE) endif() -add_subdirectory(src/pyarrow) -add_subdirectory(src/pyarrow/util) - set(CYTHON_EXTENSIONS array config @@ -444,19 +275,11 @@ set(CYTHON_EXTENSIONS table ) -set(PYARROW_SRCS - src/pyarrow/common.cc - src/pyarrow/config.cc - src/pyarrow/helpers.cc - src/pyarrow/io.cc - src/pyarrow/adapters/builtin.cc - src/pyarrow/adapters/pandas.cc -) - set(LINK_LIBS - arrow - arrow_io - arrow_ipc + arrow_shared + arrow_io_shared + arrow_ipc_shared + arrow_python_shared ) if (PYARROW_BUILD_PARQUET) @@ -497,24 +320,12 @@ if (PYARROW_BUILD_JEMALLOC) SHARED_LIB ${ARROW_JEMALLOC_SHARED_LIB}) set(LINK_LIBS ${LINK_LIBS} - arrow_jemalloc) + arrow_jemalloc_shared) set(CYTHON_EXTENSIONS ${CYTHON_EXTENSIONS} jemalloc) endif() -add_library(pyarrow SHARED - ${PYARROW_SRCS}) -if (PYARROW_BUNDLE_ARROW_CPP) - set_target_properties(pyarrow PROPERTIES - INSTALL_RPATH "\$ORIGIN") -endif() -target_link_libraries(pyarrow ${LINK_LIBS}) - -if(APPLE) - set_target_properties(pyarrow PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") -endif() - ############################################################ # Setup and build Cython modules ############################################################ @@ -555,5 +366,5 @@ foreach(module ${CYTHON_EXTENSIONS}) set_target_properties(${module_name} PROPERTIES INSTALL_RPATH ${module_install_rpath}) - target_link_libraries(${module_name} pyarrow) + target_link_libraries(${module_name} ${LINK_LIBS}) endforeach(module) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 5d0207d7c7769..5030c9c8ce900 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -57,12 +57,18 @@ find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) +find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) set(ARROW_IPC_LIB_NAME libarrow_ipc) set(ARROW_JEMALLOC_LIB_NAME libarrow_jemalloc) + set(ARROW_PYTHON_LIB_NAME libarrow_python) set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) @@ -77,6 +83,9 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_JEMALLOC_LIB_NAME}.a) set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/${ARROW_JEMALLOC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PYTHON_LIB_NAME}.a) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/${ARROW_PYTHON_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") diff --git a/python/pyarrow/config.pyx b/python/pyarrow/config.pyx index 5ad7cf53261e3..536f27839ae91 100644 --- a/python/pyarrow/config.pyx +++ b/python/pyarrow/config.pyx @@ -14,21 +14,21 @@ # distutils: language = c++ # cython: embedsignature = True -cdef extern from 'pyarrow/do_import_numpy.h': +cdef extern from 'arrow/python/do_import_numpy.h': pass -cdef extern from 'pyarrow/numpy_interop.h' namespace 'arrow::py': +cdef extern from 'arrow/python/numpy_interop.h' namespace 'arrow::py': int import_numpy() -cdef extern from 'pyarrow/config.h' namespace 'arrow::py': - void pyarrow_init() - void pyarrow_set_numpy_nan(object o) +cdef extern from 'arrow/python/config.h' namespace 'arrow::py': + void Init() + void set_numpy_nan(object o) import_numpy() -pyarrow_init() +Init() import numpy as np -pyarrow_set_numpy_nan(np.nan) +set_numpy_nan(np.nan) import multiprocessing import os diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 3fdbebc9293cd..c3fdf4b070ee0 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -25,7 +25,7 @@ from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, cimport pyarrow.includes.libarrow_io as arrow_io -cdef extern from "pyarrow/api.h" namespace "arrow::py" nogil: +cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: shared_ptr[CDataType] GetPrimitiveType(Type type) shared_ptr[CDataType] GetTimestampType(TimeUnit unit) CStatus ConvertPySequence(object obj, CMemoryPool* pool, @@ -53,13 +53,9 @@ cdef extern from "pyarrow/api.h" namespace "arrow::py" nogil: void set_default_memory_pool(CMemoryPool* pool) CMemoryPool* get_memory_pool() - -cdef extern from "pyarrow/common.h" namespace "arrow::py" nogil: cdef cppclass PyBuffer(CBuffer): PyBuffer(object o) - -cdef extern from "pyarrow/io.h" namespace "arrow::py" nogil: cdef cppclass PyReadableFile(arrow_io.RandomAccessFile): PyReadableFile(object fo) diff --git a/python/setup.py b/python/setup.py index 9abf9854af2a8..dae6cb2f078f6 100644 --- a/python/setup.py +++ b/python/setup.py @@ -186,7 +186,7 @@ def _run_cmake(self): # a bit hacky build_lib = saved_cwd - # Move the built libpyarrow library to the place expected by the Python + # Move the libraries to the place expected by the Python # build shared_library_prefix = 'lib' if sys.platform == 'darwin': @@ -203,15 +203,16 @@ def _run_cmake(self): pass def move_lib(lib_name): - lib_filename = shared_library_prefix + lib_name + shared_library_suffix + lib_filename = (shared_library_prefix + lib_name + + shared_library_suffix) shutil.move(pjoin(self.build_type, lib_filename), pjoin(build_lib, 'pyarrow', lib_filename)) - move_lib("pyarrow") if self.bundle_arrow_cpp: move_lib("arrow") move_lib("arrow_io") move_lib("arrow_ipc") + move_lib("arrow_python") if self.with_jemalloc: move_lib("arrow_jemalloc") if self.with_parquet: @@ -227,14 +228,14 @@ def move_lib(lib_name): if self._failure_permitted(name): print('Cython module {0} failure permitted'.format(name)) continue - raise RuntimeError('libpyarrow C-extension failed to build:', + raise RuntimeError('pyarrow C-extension failed to build:', os.path.abspath(built_path)) ext_path = pjoin(build_lib, self._get_cmake_ext_path(name)) if os.path.exists(ext_path): os.remove(ext_path) self.mkpath(os.path.dirname(ext_path)) - print('Moving built libpyarrow C-extension', built_path, + print('Moving built C-extension', built_path, 'to build path', ext_path) shutil.move(self.get_ext_built(name), ext_path) self._found_names.append(name) diff --git a/python/src/pyarrow/CMakeLists.txt b/python/src/pyarrow/CMakeLists.txt deleted file mode 100644 index 9e69718dfa7c7..0000000000000 --- a/python/src/pyarrow/CMakeLists.txt +++ /dev/null @@ -1,22 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -####################################### -# Unit tests -####################################### - -ADD_PYARROW_TEST(adapters/pandas-test) From d2d27555b4b2f3f0ba26539211bfe8b4d1b52481 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 27 Mar 2017 10:43:56 -0400 Subject: [PATCH 0424/1644] ARROW-658: [C++] Implement a prototype in-memory arrow::Tensor type I haven't implemented much beyond the data container and automatically computing row major strides. If we agree on the basics, then I will implement IPC read/writes of this data structure in a follow up patch. cc @pcmoritz @robertnishihara @JohanMabille @sylvaincorlay Author: Wes McKinney Closes #438 from wesm/ARROW-658 and squashes the following commits: 7f82028 [Wes McKinney] Include numeric STL header 8160393 [Wes McKinney] std::accumulate is in algorithm header bdd4c55 [Wes McKinney] No need to special case 0-dim 471c719 [Wes McKinney] Add test for 0-d tensor. Use std::accumulate in Tensor::size 8d4a13a [Wes McKinney] Make std::vector args const-refs 8bd9716 [Wes McKinney] Add extern templates for numeric tensors 7d805bf [Wes McKinney] cpplint 8b65aea [Wes McKinney] Implement a prototype in-memory arrow::Tensor type --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/buffer.cc | 4 - cpp/src/arrow/buffer.h | 7 +- cpp/src/arrow/tensor-test.cc | 73 ++++++++++++++++ cpp/src/arrow/tensor.cc | 116 +++++++++++++++++++++++++ cpp/src/arrow/tensor.h | 158 +++++++++++++++++++++++++++++++++++ cpp/src/arrow/type_fwd.h | 13 ++- 8 files changed, 359 insertions(+), 14 deletions(-) create mode 100644 cpp/src/arrow/tensor-test.cc create mode 100644 cpp/src/arrow/tensor.cc create mode 100644 cpp/src/arrow/tensor.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c77cf601cbd46..e4c18ca86e4d7 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -792,6 +792,7 @@ set(ARROW_SRCS src/arrow/schema.cc src/arrow/status.cc src/arrow/table.cc + src/arrow/tensor.cc src/arrow/type.cc src/arrow/visitor.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 0e83aacaadab5..f965f1d07feef 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -65,6 +65,7 @@ ADD_ARROW_TEST(pretty_print-test) ADD_ARROW_TEST(status-test) ADD_ARROW_TEST(type-test) ADD_ARROW_TEST(table-test) +ADD_ARROW_TEST(tensor-test) ADD_ARROW_BENCHMARK(builder-benchmark) ADD_ARROW_BENCHMARK(column-benchmark) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 28edf5e824c1f..be747e1d49504 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -68,10 +68,6 @@ bool Buffer::Equals(const Buffer& other) const { static_cast(size_)))); } -std::shared_ptr MutableBuffer::GetImmutableView() { - return std::make_shared(this->get_shared_ptr(), 0, size()); -} - PoolBuffer::PoolBuffer(MemoryPool* pool) : ResizableBuffer(nullptr, 0) { if (pool == nullptr) { pool = default_memory_pool(); } pool_ = pool; diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 449bb537d9caa..713d57a1f101d 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -43,7 +43,7 @@ class Status; /// of bytes that where allocated for the buffer in total. /// /// The following invariant is always true: Size < Capacity -class ARROW_EXPORT Buffer : public std::enable_shared_from_this { +class ARROW_EXPORT Buffer { public: Buffer(const uint8_t* data, int64_t size) : is_mutable_(false), data_(data), size_(size), capacity_(size) {} @@ -58,8 +58,6 @@ class ARROW_EXPORT Buffer : public std::enable_shared_from_this { /// we might add utility methods to help determine if a buffer satisfies this contract. Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); - std::shared_ptr get_shared_ptr() { return shared_from_this(); } - bool is_mutable() const { return is_mutable_; } /// Return true if both buffers are the same size and contain the same bytes @@ -111,9 +109,6 @@ class ARROW_EXPORT MutableBuffer : public Buffer { uint8_t* mutable_data() { return mutable_data_; } - /// Get a read-only view of this buffer - std::shared_ptr GetImmutableView(); - protected: MutableBuffer() : Buffer(nullptr, 0), mutable_data_(nullptr) {} diff --git a/cpp/src/arrow/tensor-test.cc b/cpp/src/arrow/tensor-test.cc new file mode 100644 index 0000000000000..99a94934c7990 --- /dev/null +++ b/cpp/src/arrow/tensor-test.cc @@ -0,0 +1,73 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Unit tests for DataType (and subclasses), Field, and Schema + +#include +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/buffer.h" +#include "arrow/tensor.h" +#include "arrow/test-util.h" +#include "arrow/type.h" + +namespace arrow { + +TEST(TestTensor, ZeroDim) { + const int64_t values = 1; + std::vector shape = {}; + + using T = int64_t; + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), values * sizeof(T), &buffer)); + + Int64Tensor t0(buffer, shape); + + ASSERT_EQ(1, t0.size()); +} + +TEST(TestTensor, BasicCtors) { + const int64_t values = 24; + std::vector shape = {4, 6}; + std::vector strides = {48, 8}; + std::vector dim_names = {"foo", "bar"}; + + using T = int64_t; + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), values * sizeof(T), &buffer)); + + Int64Tensor t1(buffer, shape); + Int64Tensor t2(buffer, shape, strides); + Int64Tensor t3(buffer, shape, strides, dim_names); + + ASSERT_EQ(24, t1.size()); + ASSERT_TRUE(t1.is_mutable()); + ASSERT_FALSE(t1.has_dim_names()); + + ASSERT_EQ(strides, t1.strides()); + ASSERT_EQ(strides, t2.strides()); + + ASSERT_EQ("foo", t3.dim_name(0)); + ASSERT_EQ("bar", t3.dim_name(1)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc new file mode 100644 index 0000000000000..c0d128f563906 --- /dev/null +++ b/cpp/src/arrow/tensor.cc @@ -0,0 +1,116 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/tensor.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "arrow/array.h" +#include "arrow/buffer.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/util/logging.h" + +namespace arrow { + +void ComputeRowMajorStrides(const FixedWidthType& type, const std::vector& shape, + std::vector* strides) { + int64_t remaining = type.bit_width() / 8; + for (int64_t dimsize : shape) { + remaining *= dimsize; + } + + for (int64_t dimsize : shape) { + remaining /= dimsize; + strides->push_back(remaining); + } +} + +/// Constructor with strides and dimension names +Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides, + const std::vector& dim_names) + : type_(type), data_(data), shape_(shape), strides_(strides), dim_names_(dim_names) { + DCHECK(is_tensor_supported(type->type)); + if (shape.size() > 0 && strides.size() == 0) { + ComputeRowMajorStrides(static_cast(*type_), shape, &strides_); + } +} + +Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides) + : Tensor(type, data, shape, strides, {}) {} + +Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape) + : Tensor(type, data, shape, {}, {}) {} + +const std::string& Tensor::dim_name(int i) const { + DCHECK_LT(i, static_cast(dim_names_.size())); + return dim_names_[i]; +} + +int64_t Tensor::size() const { + return std::accumulate( + shape_.begin(), shape_.end(), 1, std::multiplies()); +} + +template +NumericTensor::NumericTensor(const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides, + const std::vector& dim_names) + : Tensor(TypeTraits::type_singleton(), data, shape, strides, dim_names), + raw_data_(nullptr), + mutable_raw_data_(nullptr) { + if (data_) { + raw_data_ = reinterpret_cast(data_->data()); + if (data_->is_mutable()) { + auto mut_buf = static_cast(data_.get()); + mutable_raw_data_ = reinterpret_cast(mut_buf->mutable_data()); + } + } +} + +template +NumericTensor::NumericTensor( + const std::shared_ptr& data, const std::vector& shape) + : NumericTensor(data, shape, {}, {}) {} + +template +NumericTensor::NumericTensor(const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides) + : NumericTensor(data, shape, strides, {}) {} + +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; +template class NumericTensor; + +} // namespace arrow diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h new file mode 100644 index 0000000000000..0059368f7b2d8 --- /dev/null +++ b/cpp/src/arrow/tensor.h @@ -0,0 +1,158 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_TENSOR_H +#define ARROW_TENSOR_H + +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/type.h" +#include "arrow/type_traits.h" +#include "arrow/util/macros.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +class Buffer; +class MemoryPool; +class MutableBuffer; +class Status; + +static inline bool is_tensor_supported(Type::type type_id) { + switch (type_id) { + case Type::UINT8: + case Type::INT8: + case Type::UINT16: + case Type::INT16: + case Type::UINT32: + case Type::INT32: + case Type::UINT64: + case Type::INT64: + case Type::HALF_FLOAT: + case Type::FLOAT: + case Type::DOUBLE: + return true; + default: + break; + } + return false; +} + +class ARROW_EXPORT Tensor { + public: + virtual ~Tensor() = default; + + /// Constructor with no dimension names or strides, data assumed to be row-major + Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape); + + /// Constructor with non-negative strides + Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides); + + /// Constructor with strides and dimension names + Tensor(const std::shared_ptr& type, const std::shared_ptr& data, + const std::vector& shape, const std::vector& strides, + const std::vector& dim_names); + + std::shared_ptr data() const { return data_; } + const std::vector& shape() const { return shape_; } + const std::vector& strides() const { return strides_; } + + const std::string& dim_name(int i) const; + bool has_dim_names() const { return shape_.size() > 0 && dim_names_.size() > 0; } + + /// Total number of value cells in the tensor + int64_t size() const; + + /// Return true if the underlying data buffer is mutable + bool is_mutable() const { return data_->is_mutable(); } + + protected: + Tensor() {} + + std::shared_ptr type_; + + std::shared_ptr data_; + + std::vector shape_; + std::vector strides_; + + /// These names are optional + std::vector dim_names_; + + private: + DISALLOW_COPY_AND_ASSIGN(Tensor); +}; + +template +class ARROW_EXPORT NumericTensor : public Tensor { + public: + using value_type = typename T::c_type; + + NumericTensor(const std::shared_ptr& data, const std::vector& shape); + + /// Constructor with non-negative strides + NumericTensor(const std::shared_ptr& data, const std::vector& shape, + const std::vector& strides); + + /// Constructor with strides and dimension names + NumericTensor(const std::shared_ptr& data, const std::vector& shape, + const std::vector& strides, const std::vector& dim_names); + + const value_type* raw_data() const { return raw_data_; } + value_type* raw_data() { return mutable_raw_data_; } + + private: + const value_type* raw_data_; + value_type* mutable_raw_data_; +}; + +// ---------------------------------------------------------------------- +// extern templates and other details + +// gcc and clang disagree about how to handle template visibility when you have +// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic push +#pragma GCC diagnostic ignored "-Wattributes" +#endif + +// Only instantiate these templates once +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; +extern template class ARROW_EXPORT NumericTensor; + +#if defined(__GNUC__) && !defined(__clang__) +#pragma GCC diagnostic pop +#endif + +} // namespace arrow + +#endif // ARROW_TENSOR_H diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 201f4e92bb00d..04ddf7e74dd1d 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -30,6 +30,7 @@ struct DataType; class Array; class ArrayBuilder; struct Field; +class Tensor; class Buffer; class MemoryPool; @@ -78,10 +79,14 @@ class NumericArray; template class NumericBuilder; -#define _NUMERIC_TYPE_DECL(KLASS) \ - struct KLASS##Type; \ - using KLASS##Array = NumericArray; \ - using KLASS##Builder = NumericBuilder; +template +class NumericTensor; + +#define _NUMERIC_TYPE_DECL(KLASS) \ + struct KLASS##Type; \ + using KLASS##Array = NumericArray; \ + using KLASS##Builder = NumericBuilder; \ + using KLASS##Tensor = NumericTensor; _NUMERIC_TYPE_DECL(Int8); _NUMERIC_TYPE_DECL(Int16); From e717d47865038a65a23d80d6d5d6df782d9a8e43 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 27 Mar 2017 23:13:33 -0400 Subject: [PATCH 0425/1644] ARROW-716: [Python] Update README build instructions after moving libpyarrow to C++ tree Author: Wes McKinney Closes #445 from wesm/ARROW-716 and squashes the following commits: 2608d2b [Wes McKinney] Update README after moving libpyarrow to main C++ source tree --- cpp/README.md | 10 ++++++++++ python/README.md | 33 +++++++++++++++++++-------------- 2 files changed, 29 insertions(+), 14 deletions(-) diff --git a/cpp/README.md b/cpp/README.md index 51f1f0606fa3a..b6f0fa0e3531b 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -81,6 +81,16 @@ variables * Hadoop: `HADOOP_HOME` (only required for the HDFS I/O extensions) * jemalloc: `JEMALLOC_HOME` (only required for the jemalloc-based memory pool) +### Building Python integration library + +The `arrow_python` shared library can be built by passing `-DARROW_PYTHON=on` +to CMake. This must be installed or in your library load path to be able to +build pyarrow, the Arrow Python bindings. + +The Python library must be built against the same Python version for which you +are building pyarrow, e.g. Python 2.7 or Python 3.6. NumPy must also be +installed. + ### API documentation To generate the (html) API documentation, run the following command in the apidoc diff --git a/python/README.md b/python/README.md index 88ab17e71730f..25a3a67b83b03 100644 --- a/python/README.md +++ b/python/README.md @@ -22,25 +22,30 @@ other traditional Python scientific computing packages. This project is layered in two pieces: -* pyarrow, a C++ library for easier interoperability between Arrow C++, NumPy, - and pandas -* Cython extensions and pure Python code under arrow/ which expose Arrow C++ +* arrow_python, a library part of the main Arrow C++ project for Python, + pandas, and NumPy interoperability +* Cython extensions and pure Python code under pyarrow/ which expose Arrow C++ and pyarrow to pure Python users #### PyArrow Dependencies: -These are the various projects that PyArrow depends on. -1. **g++ and gcc Version >= 4.8** -2. **cmake > 2.8.6** -3. **boost** -4. **Arrow-cpp and its dependencies** - -The Arrow C++ library must be built with all options enabled and installed with -``ARROW_HOME`` environment variable set to the installation location. Look at -(https://github.com/apache/arrow/blob/master/cpp/README.md) for instructions. +To build pyarrow, first build and install Arrow C++ with the Python component +enabled using `-DARROW_PYTHON=on`, see +(https://github.com/apache/arrow/blob/master/cpp/README.md) . These components +must be installed either in the default system location (e.g. `/usr/local`) or +in a custom `$ARROW_HOME` location. + +```shell +mkdir cpp/build +pushd cpp/build +cmake -DARROW_PYTHON=on -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. +make -j4 +make install +``` -Ensure PyArrow can locate the Arrow-cpp shared libraries by setting the -LD_LIBRARY_PATH environment variable. +If you build with a custom `CMAKE_INSTALL_PREFIX`, during development, you must +set `ARROW_HOME` as an environment variable and add it to your +`LD_LIBRARY_PATH` on Linux and OS X: ```bash export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_HOME/lib From 3b71d87c5e2a79cc5955e6cb73fa4a5cc906458f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 28 Mar 2017 10:07:44 -0400 Subject: [PATCH 0426/1644] ARROW-620: [C++] Implement JSON integration test support for date, time, timestamp, fixed width binary This also contains some code scrubbing, and uses inline visitors in the JSON reader/writer path Author: Wes McKinney Closes #446 from wesm/ARROW-620 and squashes the following commits: 46978aa [Wes McKinney] Fix compiler warning cd714f8 [Wes McKinney] No underscores. Fix bug slicing null buffer d26ad14 [Wes McKinney] Implement FixedWidthBinary support in JSON reader/writer. Make FWBinaryArray subclass of PrimitiveArray bd5652e [Wes McKinney] Get date/time/timestamp JSON tests passing. Cleanup, fix large metadata issues fcbf64a [Wes McKinney] Refactoring, implement record batch round trip fixture for JSON, failing unit tests --- cpp/src/arrow/array.cc | 5 +- cpp/src/arrow/array.h | 7 +- cpp/src/arrow/compare.cc | 14 +- cpp/src/arrow/ipc/ipc-json-test.cc | 43 +- cpp/src/arrow/ipc/ipc-read-write-test.cc | 63 +-- cpp/src/arrow/ipc/json-internal.cc | 693 ++++++++++++----------- cpp/src/arrow/ipc/metadata.cc | 6 +- cpp/src/arrow/ipc/metadata.h | 4 +- cpp/src/arrow/ipc/reader.cc | 2 + cpp/src/arrow/ipc/test-common.h | 40 +- cpp/src/arrow/ipc/writer.cc | 8 +- cpp/src/arrow/tensor.cc | 3 +- cpp/src/arrow/type.cc | 31 +- cpp/src/arrow/type.h | 71 ++- 14 files changed, 545 insertions(+), 445 deletions(-) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index cff0126647647..3ea033376fca3 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -283,12 +283,9 @@ std::shared_ptr StringArray::Slice(int64_t offset, int64_t length) const FixedWidthBinaryArray::FixedWidthBinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) - : Array(type, length, null_bitmap, null_count, offset), - data_(data), - raw_data_(nullptr) { + : PrimitiveArray(type, length, data, null_bitmap, null_count, offset) { DCHECK(type->type == Type::FIXED_WIDTH_BINARY); byte_width_ = static_cast(*type).byte_width(); - if (data) { raw_data_ = data->data(); } } std::shared_ptr FixedWidthBinaryArray::Slice( diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index cc0cf98092a8c..c0ec571e45983 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -347,7 +347,7 @@ class ARROW_EXPORT StringArray : public BinaryArray { // ---------------------------------------------------------------------- // Fixed width binary -class ARROW_EXPORT FixedWidthBinaryArray : public Array { +class ARROW_EXPORT FixedWidthBinaryArray : public PrimitiveArray { public: using TypeClass = FixedWidthBinaryType; @@ -360,9 +360,6 @@ class ARROW_EXPORT FixedWidthBinaryArray : public Array { return raw_data_ + (i + offset_) * byte_width_; } - /// Note that this buffer does not account for any slice offset - std::shared_ptr data() const { return data_; } - int32_t byte_width() const { return byte_width_; } const uint8_t* raw_data() const { return raw_data_; } @@ -371,8 +368,6 @@ class ARROW_EXPORT FixedWidthBinaryArray : public Array { protected: int32_t byte_width_; - std::shared_ptr data_; - const uint8_t* raw_data_; }; // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 8274e0f80dc50..3e282f8886623 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -666,14 +666,12 @@ class TypeEqualsVisitor { return Status::OK(); } - Status Visit(const Time32Type& left) { - const auto& right = static_cast(right_); - result_ = left.unit == right.unit; - return Status::OK(); - } - - Status Visit(const Time64Type& left) { - const auto& right = static_cast(right_); + template + typename std::enable_if::value || + std::is_base_of::value, + Status>::type + Visit(const T& left) { + const auto& right = static_cast(right_); result_ = left.unit == right.unit; return Status::OK(); } diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index e943ef1558a75..68261ab25a43a 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -75,7 +75,8 @@ void TestArrayRoundTrip(const Array& array) { std::shared_ptr out; ASSERT_OK(ReadJsonArray(default_memory_pool(), d, array.type(), &out)); - ASSERT_TRUE(array.Equals(out)) << array_as_json; + // std::cout << array_as_json << std::endl; + CompareArraysDetailed(0, *out, array); } template @@ -351,5 +352,45 @@ TEST(TestJsonFileReadWrite, MinimalFormatExample) { ASSERT_TRUE(batch->column(1)->Equals(bar)); } +#define BATCH_CASES() \ + ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ + &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ + &MakeStruct, &MakeUnion, &MakeDates, &MakeTimestamps, &MakeTimes, &MakeFWBinary); + +class TestJsonRoundTrip : public ::testing::TestWithParam { + public: + void SetUp() {} + void TearDown() {} +}; + +void CheckRoundtrip(const RecordBatch& batch) { + std::unique_ptr writer; + ASSERT_OK(JsonWriter::Open(batch.schema(), &writer)); + ASSERT_OK(writer->WriteRecordBatch(batch)); + + std::string result; + ASSERT_OK(writer->Finish(&result)); + + auto buffer = std::make_shared(reinterpret_cast(result.c_str()), + static_cast(result.size())); + + std::unique_ptr reader; + ASSERT_OK(JsonReader::Open(buffer, &reader)); + + std::shared_ptr result_batch; + ASSERT_OK(reader->GetRecordBatch(0, &result_batch)); + + CompareBatch(batch, *result_batch); +} + +TEST_P(TestJsonRoundTrip, RoundTrip) { + std::shared_ptr batch; + ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue + + CheckRoundtrip(*batch); +} + +INSTANTIATE_TEST_CASE_P(TestJsonRoundTrip, TestJsonRoundTrip, BATCH_CASES()); + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 086cc68176783..cd3f190fe4a27 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -41,20 +41,6 @@ namespace arrow { namespace ipc { -void CompareBatch(const RecordBatch& left, const RecordBatch& right) { - if (!left.schema()->Equals(right.schema())) { - FAIL() << "Left schema: " << left.schema()->ToString() - << "\nRight schema: " << right.schema()->ToString(); - } - ASSERT_EQ(left.num_columns(), right.num_columns()) - << left.schema()->ToString() << " result: " << right.schema()->ToString(); - EXPECT_EQ(left.num_rows(), right.num_rows()); - for (int i = 0; i < left.num_columns(); ++i) { - EXPECT_TRUE(left.column(i)->Equals(right.column(i))) - << "Idx: " << i << " Name: " << left.column_name(i); - } -} - using BatchVector = std::vector>; class TestSchemaMetadata : public ::testing::Test { @@ -85,17 +71,17 @@ class TestSchemaMetadata : public ::testing::Test { const std::shared_ptr INT32 = std::make_shared(); TEST_F(TestSchemaMetadata, PrimitiveFields) { - auto f0 = std::make_shared("f0", std::make_shared()); - auto f1 = std::make_shared("f1", std::make_shared(), false); - auto f2 = std::make_shared("f2", std::make_shared()); - auto f3 = std::make_shared("f3", std::make_shared()); - auto f4 = std::make_shared("f4", std::make_shared()); - auto f5 = std::make_shared("f5", std::make_shared()); - auto f6 = std::make_shared("f6", std::make_shared()); - auto f7 = std::make_shared("f7", std::make_shared()); - auto f8 = std::make_shared("f8", std::make_shared()); - auto f9 = std::make_shared("f9", std::make_shared(), false); - auto f10 = std::make_shared("f10", std::make_shared()); + auto f0 = field("f0", std::make_shared()); + auto f1 = field("f1", std::make_shared(), false); + auto f2 = field("f2", std::make_shared()); + auto f3 = field("f3", std::make_shared()); + auto f4 = field("f4", std::make_shared()); + auto f5 = field("f5", std::make_shared()); + auto f6 = field("f6", std::make_shared()); + auto f7 = field("f7", std::make_shared()); + auto f8 = field("f8", std::make_shared()); + auto f9 = field("f9", std::make_shared(), false); + auto f10 = field("f10", std::make_shared()); Schema schema({f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10}); DictionaryMemo memo; @@ -105,11 +91,11 @@ TEST_F(TestSchemaMetadata, PrimitiveFields) { TEST_F(TestSchemaMetadata, NestedFields) { auto type = std::make_shared(std::make_shared()); - auto f0 = std::make_shared("f0", type); + auto f0 = field("f0", type); - std::shared_ptr type2(new StructType({std::make_shared("k1", INT32), - std::make_shared("k2", INT32), std::make_shared("k3", INT32)})); - auto f1 = std::make_shared("f1", type2); + std::shared_ptr type2( + new StructType({field("k1", INT32), field("k2", INT32), field("k3", INT32)})); + auto f1 = field("f1", type2); Schema schema({f0, f1}); DictionaryMemo memo; @@ -172,20 +158,7 @@ class IpcTestFixture : public io::MemoryMapFixture { ASSERT_EQ(expected.num_columns(), result.num_columns()) << expected.schema()->ToString() << " result: " << result.schema()->ToString(); - for (int i = 0; i < expected.num_columns(); ++i) { - const auto& left = *expected.column(i); - const auto& right = *result.column(i); - if (!left.Equals(right)) { - std::stringstream pp_result; - std::stringstream pp_expected; - - ASSERT_OK(PrettyPrint(left, 0, &pp_expected)); - ASSERT_OK(PrettyPrint(right, 0, &pp_result)); - - FAIL() << "Index: " << i << " Expected: " << pp_expected.str() - << "\nGot: " << pp_result.str(); - } - } + CompareBatchColumnsDetailed(result, expected); } void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { @@ -549,7 +522,7 @@ TEST_F(TestIpcRoundTrip, LargeRecordBatch) { std::vector> fields = {f0}; auto schema = std::make_shared(fields); - RecordBatch batch(schema, 0, {array}); + RecordBatch batch(schema, length, {array}); std::string path = "test-write-large-record_batch"; @@ -562,6 +535,8 @@ TEST_F(TestIpcRoundTrip, LargeRecordBatch) { ASSERT_OK(DoLargeRoundTrip(batch, false, &result)); CheckReadResult(*result, batch); + ASSERT_EQ(length, result->num_rows()); + // Fails if we try to write this with the normal code path ASSERT_RAISES(Invalid, DoStandardRoundTrip(batch, false, &result)); } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 348468006d0b5..95ab011bd087f 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -40,6 +40,7 @@ #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" #include "arrow/util/string.h" +#include "arrow/visitor_inline.h" namespace arrow { namespace ipc { @@ -63,13 +64,13 @@ static std::string GetBufferTypeName(BufferType type) { return "UNKNOWN"; } -static std::string GetFloatingPrecisionName(FloatingPointMeta::Precision precision) { +static std::string GetFloatingPrecisionName(FloatingPoint::Precision precision) { switch (precision) { - case FloatingPointMeta::HALF: + case FloatingPoint::HALF: return "HALF"; - case FloatingPointMeta::SINGLE: + case FloatingPoint::SINGLE: return "SINGLE"; - case FloatingPointMeta::DOUBLE: + case FloatingPoint::DOUBLE: return "DOUBLE"; default: break; @@ -93,7 +94,7 @@ static std::string GetTimeUnitName(TimeUnit unit) { return "UNKNOWN"; } -class JsonSchemaWriter : public TypeVisitor { +class JsonSchemaWriter { public: explicit JsonSchemaWriter(const Schema& schema, RjWriter* writer) : schema_(schema), writer_(writer) {} @@ -120,7 +121,7 @@ class JsonSchemaWriter : public TypeVisitor { writer_->Bool(field.nullable); // Visit the type - RETURN_NOT_OK(field.type->Accept(this)); + RETURN_NOT_OK(VisitTypeInline(*field.type, this)); writer_->EndObject(); return Status::OK(); @@ -139,25 +140,19 @@ class JsonSchemaWriter : public TypeVisitor { void>::type WriteTypeMetadata(const T& type) {} - template - typename std::enable_if::value, void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const Integer& type) { writer_->Key("bitWidth"); writer_->Int(type.bit_width()); writer_->Key("isSigned"); writer_->Bool(type.is_signed()); } - template - typename std::enable_if::value, void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const FloatingPoint& type) { writer_->Key("precision"); writer_->String(GetFloatingPrecisionName(type.precision())); } - template - typename std::enable_if::value, void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const IntervalType& type) { writer_->Key("unit"); switch (type.unit) { case IntervalType::Unit::YEAR_MONTH: @@ -169,28 +164,45 @@ class JsonSchemaWriter : public TypeVisitor { } } - template - typename std::enable_if::value || - std::is_base_of::value || - std::is_base_of::value, - void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const TimestampType& type) { + writer_->Key("unit"); + writer_->String(GetTimeUnitName(type.unit)); + if (type.timezone.size() > 0) { + writer_->Key("timezone"); + writer_->String(type.timezone); + } + } + + void WriteTypeMetadata(const TimeType& type) { writer_->Key("unit"); writer_->String(GetTimeUnitName(type.unit)); } - template - typename std::enable_if::value, void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const DateType& type) { + writer_->Key("unit"); + switch (type.unit) { + case DateUnit::DAY: + writer_->String("DAY"); + break; + case DateUnit::MILLI: + writer_->String("MILLISECOND"); + break; + } + } + + void WriteTypeMetadata(const FixedWidthBinaryType& type) { + writer_->Key("byteWidth"); + writer_->Int(type.byte_width()); + } + + void WriteTypeMetadata(const DecimalType& type) { writer_->Key("precision"); writer_->Int(type.precision); writer_->Key("scale"); writer_->Int(type.scale); } - template - typename std::enable_if::value, void>::type - WriteTypeMetadata(const T& type) { + void WriteTypeMetadata(const UnionType& type) { writer_->Key("mode"); switch (type.mode) { case UnionMode::SPARSE: @@ -268,86 +280,65 @@ class JsonSchemaWriter : public TypeVisitor { return Status::OK(); } - Status Visit(const NullType& type) override { return WritePrimitive("null", type); } - - Status Visit(const BooleanType& type) override { return WritePrimitive("bool", type); } - - Status Visit(const Int8Type& type) override { return WritePrimitive("int", type); } - - Status Visit(const Int16Type& type) override { return WritePrimitive("int", type); } + Status Visit(const NullType& type) { return WritePrimitive("null", type); } - Status Visit(const Int32Type& type) override { return WritePrimitive("int", type); } + Status Visit(const BooleanType& type) { return WritePrimitive("bool", type); } - Status Visit(const Int64Type& type) override { return WritePrimitive("int", type); } + Status Visit(const Integer& type) { return WritePrimitive("int", type); } - Status Visit(const UInt8Type& type) override { return WritePrimitive("int", type); } - - Status Visit(const UInt16Type& type) override { return WritePrimitive("int", type); } - - Status Visit(const UInt32Type& type) override { return WritePrimitive("int", type); } - - Status Visit(const UInt64Type& type) override { return WritePrimitive("int", type); } - - Status Visit(const HalfFloatType& type) override { - return WritePrimitive("floatingpoint", type); - } - - Status Visit(const FloatType& type) override { - return WritePrimitive("floatingpoint", type); - } - - Status Visit(const DoubleType& type) override { + Status Visit(const FloatingPoint& type) { return WritePrimitive("floatingpoint", type); } - Status Visit(const StringType& type) override { return WriteVarBytes("utf8", type); } - - Status Visit(const BinaryType& type) override { return WriteVarBytes("binary", type); } - - // TODO - Status Visit(const Date32Type& type) override { return WritePrimitive("date", type); } + Status Visit(const DateType& type) { return WritePrimitive("date", type); } - Status Visit(const Date64Type& type) override { return WritePrimitive("date", type); } + Status Visit(const TimeType& type) { return WritePrimitive("time", type); } - Status Visit(const Time32Type& type) override { return WritePrimitive("time", type); } + Status Visit(const StringType& type) { return WriteVarBytes("utf8", type); } - Status Visit(const Time64Type& type) override { return WritePrimitive("time", type); } + Status Visit(const BinaryType& type) { return WriteVarBytes("binary", type); } - Status Visit(const TimestampType& type) override { - return WritePrimitive("timestamp", type); + Status Visit(const FixedWidthBinaryType& type) { + return WritePrimitive("fixedwidthbinary", type); } - Status Visit(const IntervalType& type) override { - return WritePrimitive("interval", type); - } + Status Visit(const TimestampType& type) { return WritePrimitive("timestamp", type); } + + Status Visit(const IntervalType& type) { return WritePrimitive("interval", type); } - Status Visit(const ListType& type) override { + Status Visit(const ListType& type) { WriteName("list", type); RETURN_NOT_OK(WriteChildren(type.children())); WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } - Status Visit(const StructType& type) override { + Status Visit(const StructType& type) { WriteName("struct", type); WriteChildren(type.children()); WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } - Status Visit(const UnionType& type) override { + Status Visit(const UnionType& type) { WriteName("union", type); WriteChildren(type.children()); WriteBufferLayout(type.GetBufferLayout()); return Status::OK(); } + Status Visit(const DecimalType& type) { return Status::NotImplemented("decimal"); } + + Status Visit(const DictionaryType& type) { + return Status::NotImplemented("dictionary"); + } + private: const Schema& schema_; RjWriter* writer_; }; -class JsonArrayWriter : public ArrayVisitor { +class JsonArrayWriter { public: JsonArrayWriter(const std::string& name, const Array& array, RjWriter* writer) : name_(name), array_(array), writer_(writer) {} @@ -362,7 +353,7 @@ class JsonArrayWriter : public ArrayVisitor { writer_->Key("count"); writer_->Int(static_cast(arr.length())); - RETURN_NOT_OK(arr.Accept(this)); + RETURN_NOT_OK(VisitArrayInline(arr, this)); writer_->EndObject(); return Status::OK(); @@ -411,9 +402,15 @@ class JsonArrayWriter : public ArrayVisitor { } } - template - typename std::enable_if::value, void>::type - WriteDataValues(const T& arr) { + void WriteDataValues(const FixedWidthBinaryArray& arr) { + int32_t width = arr.byte_width(); + for (int64_t i = 0; i < arr.length(); ++i) { + const char* buf = reinterpret_cast(arr.GetValue(i)); + writer_->String(HexEncode(buf, width)); + } + } + + void WriteDataValues(const BooleanArray& arr) { for (int i = 0; i < arr.length(); ++i) { writer_->Bool(arr.Value(i)); } @@ -458,23 +455,6 @@ class JsonArrayWriter : public ArrayVisitor { writer_->EndArray(); } - template - Status WritePrimitive(const T& array) { - WriteValidityField(array); - WriteDataField(array); - SetNoChildren(); - return Status::OK(); - } - - template - Status WriteVarBytes(const T& array) { - WriteValidityField(array); - WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); - WriteDataField(array); - SetNoChildren(); - return Status::OK(); - } - Status WriteChildren(const std::vector>& fields, const std::vector>& arrays) { writer_->Key("children"); @@ -486,53 +466,48 @@ class JsonArrayWriter : public ArrayVisitor { return Status::OK(); } - Status Visit(const NullArray& array) override { + Status Visit(const NullArray& array) { SetNoChildren(); return Status::OK(); } - Status Visit(const BooleanArray& array) override { return WritePrimitive(array); } - - Status Visit(const Int8Array& array) override { return WritePrimitive(array); } - - Status Visit(const Int16Array& array) override { return WritePrimitive(array); } - - Status Visit(const Int32Array& array) override { return WritePrimitive(array); } - - Status Visit(const Int64Array& array) override { return WritePrimitive(array); } - - Status Visit(const UInt8Array& array) override { return WritePrimitive(array); } - - Status Visit(const UInt16Array& array) override { return WritePrimitive(array); } - - Status Visit(const UInt32Array& array) override { return WritePrimitive(array); } - - Status Visit(const UInt64Array& array) override { return WritePrimitive(array); } - - Status Visit(const HalfFloatArray& array) override { return WritePrimitive(array); } - - Status Visit(const FloatArray& array) override { return WritePrimitive(array); } + template + typename std::enable_if::value, Status>::type Visit( + const T& array) { + WriteValidityField(array); + WriteDataField(array); + SetNoChildren(); + return Status::OK(); + } - Status Visit(const DoubleArray& array) override { return WritePrimitive(array); } + template + typename std::enable_if::value, Status>::type Visit( + const T& array) { + WriteValidityField(array); + WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); + WriteDataField(array); + SetNoChildren(); + return Status::OK(); + } - Status Visit(const StringArray& array) override { return WriteVarBytes(array); } + Status Visit(const DecimalArray& array) { return Status::NotImplemented("decimal"); } - Status Visit(const BinaryArray& array) override { return WriteVarBytes(array); } + Status Visit(const DictionaryArray& array) { return Status::NotImplemented("decimal"); } - Status Visit(const ListArray& array) override { + Status Visit(const ListArray& array) { WriteValidityField(array); WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length() + 1); auto type = static_cast(array.type().get()); return WriteChildren(type->children(), {array.values()}); } - Status Visit(const StructArray& array) override { + Status Visit(const StructArray& array) { WriteValidityField(array); auto type = static_cast(array.type().get()); return WriteChildren(type->children(), array.fields()); } - Status Visit(const UnionArray& array) override { + Status Visit(const UnionArray& array) { WriteValidityField(array); auto type = static_cast(array.type().get()); @@ -549,240 +524,256 @@ class JsonArrayWriter : public ArrayVisitor { RjWriter* writer_; }; -class JsonSchemaReader { - public: - explicit JsonSchemaReader(const rj::Value& json_schema) : json_schema_(json_schema) {} - - Status GetSchema(std::shared_ptr* schema) { - const auto& obj_schema = json_schema_.GetObject(); +static Status GetInteger( + const rj::Value::ConstObject& json_type, std::shared_ptr* type) { + const auto& json_bit_width = json_type.FindMember("bitWidth"); + RETURN_NOT_INT("bitWidth", json_bit_width, json_type); - const auto& json_fields = obj_schema.FindMember("fields"); - RETURN_NOT_ARRAY("fields", json_fields, obj_schema); + const auto& json_is_signed = json_type.FindMember("isSigned"); + RETURN_NOT_BOOL("isSigned", json_is_signed, json_type); - std::vector> fields; - RETURN_NOT_OK(GetFieldsFromArray(json_fields->value, &fields)); + bool is_signed = json_is_signed->value.GetBool(); + int bit_width = json_bit_width->value.GetInt(); - *schema = std::make_shared(fields); - return Status::OK(); + switch (bit_width) { + case 8: + *type = is_signed ? int8() : uint8(); + break; + case 16: + *type = is_signed ? int16() : uint16(); + break; + case 32: + *type = is_signed ? int32() : uint32(); + break; + case 64: + *type = is_signed ? int64() : uint64(); + break; + default: + std::stringstream ss; + ss << "Invalid bit width: " << bit_width; + return Status::Invalid(ss.str()); } + return Status::OK(); +} - Status GetFieldsFromArray( - const rj::Value& obj, std::vector>* fields) { - const auto& values = obj.GetArray(); +static Status GetFloatingPoint( + const RjObject& json_type, std::shared_ptr* type) { + const auto& json_precision = json_type.FindMember("precision"); + RETURN_NOT_STRING("precision", json_precision, json_type); - fields->resize(values.Size()); - for (rj::SizeType i = 0; i < fields->size(); ++i) { - RETURN_NOT_OK(GetField(values[i], &(*fields)[i])); - } - return Status::OK(); + std::string precision = json_precision->value.GetString(); + + if (precision == "DOUBLE") { + *type = float64(); + } else if (precision == "SINGLE") { + *type = float32(); + } else if (precision == "HALF") { + *type = float16(); + } else { + std::stringstream ss; + ss << "Invalid precision: " << precision; + return Status::Invalid(ss.str()); } + return Status::OK(); +} - Status GetField(const rj::Value& obj, std::shared_ptr* field) { - if (!obj.IsObject()) { return Status::Invalid("Field was not a JSON object"); } - const auto& json_field = obj.GetObject(); +static Status GetFixedWidthBinary( + const RjObject& json_type, std::shared_ptr* type) { + const auto& json_byte_width = json_type.FindMember("byteWidth"); + RETURN_NOT_INT("byteWidth", json_byte_width, json_type); - const auto& json_name = json_field.FindMember("name"); - RETURN_NOT_STRING("name", json_name, json_field); + int32_t byte_width = json_byte_width->value.GetInt(); + *type = fixed_width_binary(byte_width); + return Status::OK(); +} - const auto& json_nullable = json_field.FindMember("nullable"); - RETURN_NOT_BOOL("nullable", json_nullable, json_field); +static Status GetDate(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_unit = json_type.FindMember("unit"); + RETURN_NOT_STRING("unit", json_unit, json_type); - const auto& json_type = json_field.FindMember("type"); - RETURN_NOT_OBJECT("type", json_type, json_field); + std::string unit_str = json_unit->value.GetString(); - const auto& json_children = json_field.FindMember("children"); - RETURN_NOT_ARRAY("children", json_children, json_field); + if (unit_str == "DAY") { + *type = date32(); + } else if (unit_str == "MILLISECOND") { + *type = date64(); + } else { + std::stringstream ss; + ss << "Invalid date unit: " << unit_str; + return Status::Invalid(ss.str()); + } + return Status::OK(); +} - std::vector> children; - RETURN_NOT_OK(GetFieldsFromArray(json_children->value, &children)); +static Status GetTime(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_unit = json_type.FindMember("unit"); + RETURN_NOT_STRING("unit", json_unit, json_type); + + std::string unit_str = json_unit->value.GetString(); + + if (unit_str == "SECOND") { + *type = time32(TimeUnit::SECOND); + } else if (unit_str == "MILLISECOND") { + *type = time32(TimeUnit::MILLI); + } else if (unit_str == "MICROSECOND") { + *type = time64(TimeUnit::MICRO); + } else if (unit_str == "NANOSECOND") { + *type = time64(TimeUnit::NANO); + } else { + std::stringstream ss; + ss << "Invalid time unit: " << unit_str; + return Status::Invalid(ss.str()); + } + return Status::OK(); +} - std::shared_ptr type; - RETURN_NOT_OK(GetType(json_type->value.GetObject(), children, &type)); +static Status GetTimestamp(const RjObject& json_type, std::shared_ptr* type) { + const auto& json_unit = json_type.FindMember("unit"); + RETURN_NOT_STRING("unit", json_unit, json_type); + + std::string unit_str = json_unit->value.GetString(); + + TimeUnit unit; + if (unit_str == "SECOND") { + unit = TimeUnit::SECOND; + } else if (unit_str == "MILLISECOND") { + unit = TimeUnit::MILLI; + } else if (unit_str == "MICROSECOND") { + unit = TimeUnit::MICRO; + } else if (unit_str == "NANOSECOND") { + unit = TimeUnit::NANO; + } else { + std::stringstream ss; + ss << "Invalid time unit: " << unit_str; + return Status::Invalid(ss.str()); + } - *field = std::make_shared( - json_name->value.GetString(), type, json_nullable->value.GetBool()); - return Status::OK(); + const auto& json_tz = json_type.FindMember("timezone"); + if (json_tz == json_type.MemberEnd()) { + *type = timestamp(unit); + } else { + *type = timestamp(unit, json_tz->value.GetString()); } - Status GetInteger( - const rj::Value::ConstObject& json_type, std::shared_ptr* type) { - const auto& json_bit_width = json_type.FindMember("bitWidth"); - RETURN_NOT_INT("bitWidth", json_bit_width, json_type); + return Status::OK(); +} - const auto& json_is_signed = json_type.FindMember("isSigned"); - RETURN_NOT_BOOL("isSigned", json_is_signed, json_type); +static Status GetUnion(const RjObject& json_type, + const std::vector>& children, + std::shared_ptr* type) { + const auto& json_mode = json_type.FindMember("mode"); + RETURN_NOT_STRING("mode", json_mode, json_type); - bool is_signed = json_is_signed->value.GetBool(); - int bit_width = json_bit_width->value.GetInt(); + std::string mode_str = json_mode->value.GetString(); + UnionMode mode; - switch (bit_width) { - case 8: - *type = is_signed ? int8() : uint8(); - break; - case 16: - *type = is_signed ? int16() : uint16(); - break; - case 32: - *type = is_signed ? int32() : uint32(); - break; - case 64: - *type = is_signed ? int64() : uint64(); - break; - default: - std::stringstream ss; - ss << "Invalid bit width: " << bit_width; - return Status::Invalid(ss.str()); - } - return Status::OK(); + if (mode_str == "SPARSE") { + mode = UnionMode::SPARSE; + } else if (mode_str == "DENSE") { + mode = UnionMode::DENSE; + } else { + std::stringstream ss; + ss << "Invalid union mode: " << mode_str; + return Status::Invalid(ss.str()); } - Status GetFloatingPoint(const RjObject& json_type, std::shared_ptr* type) { - const auto& json_precision = json_type.FindMember("precision"); - RETURN_NOT_STRING("precision", json_precision, json_type); - - std::string precision = json_precision->value.GetString(); + const auto& json_type_codes = json_type.FindMember("typeIds"); + RETURN_NOT_ARRAY("typeIds", json_type_codes, json_type); - if (precision == "DOUBLE") { - *type = float64(); - } else if (precision == "SINGLE") { - *type = float32(); - } else if (precision == "HALF") { - *type = float16(); - } else { - std::stringstream ss; - ss << "Invalid precision: " << precision; - return Status::Invalid(ss.str()); - } - return Status::OK(); + std::vector type_codes; + const auto& id_array = json_type_codes->value.GetArray(); + for (const rj::Value& val : id_array) { + DCHECK(val.IsUint()); + type_codes.push_back(static_cast(val.GetUint())); } - Status GetTime(const RjObject& json_type, std::shared_ptr* type) { - const auto& json_unit = json_type.FindMember("unit"); - RETURN_NOT_STRING("unit", json_unit, json_type); - - std::string unit_str = json_unit->value.GetString(); - - if (unit_str == "SECOND") { - *type = time32(TimeUnit::SECOND); - } else if (unit_str == "MILLISECOND") { - *type = time32(TimeUnit::MILLI); - } else if (unit_str == "MICROSECOND") { - *type = time64(TimeUnit::MICRO); - } else if (unit_str == "NANOSECOND") { - *type = time64(TimeUnit::NANO); - } else { - std::stringstream ss; - ss << "Invalid time unit: " << unit_str; - return Status::Invalid(ss.str()); - } - return Status::OK(); - } + *type = union_(children, type_codes, mode); - Status GetTimestamp(const RjObject& json_type, std::shared_ptr* type) { - const auto& json_unit = json_type.FindMember("unit"); - RETURN_NOT_STRING("unit", json_unit, json_type); + return Status::OK(); +} - std::string unit_str = json_unit->value.GetString(); +static Status GetType(const RjObject& json_type, + const std::vector>& children, + std::shared_ptr* type) { + const auto& json_type_name = json_type.FindMember("name"); + RETURN_NOT_STRING("name", json_type_name, json_type); + + std::string type_name = json_type_name->value.GetString(); + + if (type_name == "int") { + return GetInteger(json_type, type); + } else if (type_name == "floatingpoint") { + return GetFloatingPoint(json_type, type); + } else if (type_name == "bool") { + *type = boolean(); + } else if (type_name == "utf8") { + *type = utf8(); + } else if (type_name == "binary") { + *type = binary(); + } else if (type_name == "fixedwidthbinary") { + return GetFixedWidthBinary(json_type, type); + } else if (type_name == "null") { + *type = null(); + } else if (type_name == "date") { + return GetDate(json_type, type); + } else if (type_name == "time") { + return GetTime(json_type, type); + } else if (type_name == "timestamp") { + return GetTimestamp(json_type, type); + } else if (type_name == "list") { + *type = list(children[0]); + } else if (type_name == "struct") { + *type = struct_(children); + } else { + return GetUnion(json_type, children, type); + } + return Status::OK(); +} - TimeUnit unit; - if (unit_str == "SECOND") { - unit = TimeUnit::SECOND; - } else if (unit_str == "MILLISECOND") { - unit = TimeUnit::MILLI; - } else if (unit_str == "MICROSECOND") { - unit = TimeUnit::MICRO; - } else if (unit_str == "NANOSECOND") { - unit = TimeUnit::NANO; - } else { - std::stringstream ss; - ss << "Invalid time unit: " << unit_str; - return Status::Invalid(ss.str()); - } +static Status GetField(const rj::Value& obj, std::shared_ptr* field); - *type = timestamp(unit); +static Status GetFieldsFromArray( + const rj::Value& obj, std::vector>* fields) { + const auto& values = obj.GetArray(); - return Status::OK(); + fields->resize(values.Size()); + for (rj::SizeType i = 0; i < fields->size(); ++i) { + RETURN_NOT_OK(GetField(values[i], &(*fields)[i])); } + return Status::OK(); +} - Status GetUnion(const RjObject& json_type, - const std::vector>& children, - std::shared_ptr* type) { - const auto& json_mode = json_type.FindMember("mode"); - RETURN_NOT_STRING("mode", json_mode, json_type); - - std::string mode_str = json_mode->value.GetString(); - UnionMode mode; +static Status GetField(const rj::Value& obj, std::shared_ptr* field) { + if (!obj.IsObject()) { return Status::Invalid("Field was not a JSON object"); } + const auto& json_field = obj.GetObject(); - if (mode_str == "SPARSE") { - mode = UnionMode::SPARSE; - } else if (mode_str == "DENSE") { - mode = UnionMode::DENSE; - } else { - std::stringstream ss; - ss << "Invalid union mode: " << mode_str; - return Status::Invalid(ss.str()); - } + const auto& json_name = json_field.FindMember("name"); + RETURN_NOT_STRING("name", json_name, json_field); - const auto& json_type_codes = json_type.FindMember("typeIds"); - RETURN_NOT_ARRAY("typeIds", json_type_codes, json_type); + const auto& json_nullable = json_field.FindMember("nullable"); + RETURN_NOT_BOOL("nullable", json_nullable, json_field); - std::vector type_codes; - const auto& id_array = json_type_codes->value.GetArray(); - for (const rj::Value& val : id_array) { - DCHECK(val.IsUint()); - type_codes.push_back(static_cast(val.GetUint())); - } + const auto& json_type = json_field.FindMember("type"); + RETURN_NOT_OBJECT("type", json_type, json_field); - *type = union_(children, type_codes, mode); + const auto& json_children = json_field.FindMember("children"); + RETURN_NOT_ARRAY("children", json_children, json_field); - return Status::OK(); - } + std::vector> children; + RETURN_NOT_OK(GetFieldsFromArray(json_children->value, &children)); - Status GetType(const RjObject& json_type, - const std::vector>& children, - std::shared_ptr* type) { - const auto& json_type_name = json_type.FindMember("name"); - RETURN_NOT_STRING("name", json_type_name, json_type); - - std::string type_name = json_type_name->value.GetString(); - - if (type_name == "int") { - return GetInteger(json_type, type); - } else if (type_name == "floatingpoint") { - return GetFloatingPoint(json_type, type); - } else if (type_name == "bool") { - *type = boolean(); - } else if (type_name == "utf8") { - *type = utf8(); - } else if (type_name == "binary") { - *type = binary(); - } else if (type_name == "null") { - *type = null(); - } else if (type_name == "date") { - // TODO - *type = date64(); - } else if (type_name == "time") { - return GetTime(json_type, type); - } else if (type_name == "timestamp") { - return GetTimestamp(json_type, type); - } else if (type_name == "list") { - *type = list(children[0]); - } else if (type_name == "struct") { - *type = struct_(children); - } else { - return GetUnion(json_type, children, type); - } - return Status::OK(); - } + std::shared_ptr type; + RETURN_NOT_OK(GetType(json_type->value.GetObject(), children, &type)); - private: - const rj::Value& json_schema_; -}; + *field = std::make_shared( + json_name->value.GetString(), type, json_nullable->value.GetBool()); + return Status::OK(); +} template inline typename std::enable_if::value, typename T::c_type>::type UnboxValue(const rj::Value& val) { - DCHECK(val.IsInt()); + DCHECK(val.IsInt64()); return static_cast(val.GetInt64()); } @@ -833,8 +824,10 @@ class JsonArrayReader { } template - typename std::enable_if::value || - std::is_base_of::value, + typename std::enable_if< + std::is_base_of::value || std::is_base_of::value || + std::is_base_of::value || + std::is_base_of::value || std::is_base_of::value, Status>::type ReadArray(const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { @@ -903,6 +896,47 @@ class JsonArrayReader { return builder.Finish(array); } + template + typename std::enable_if::value, Status>::type + ReadArray(const RjObject& json_array, int32_t length, const std::vector& is_valid, + const std::shared_ptr& type, std::shared_ptr* array) { + FixedWidthBinaryBuilder builder(pool_, type); + + const auto& json_data = json_array.FindMember("DATA"); + RETURN_NOT_ARRAY("DATA", json_data, json_array); + + const auto& json_data_arr = json_data->value.GetArray(); + + DCHECK_EQ(static_cast(json_data_arr.Size()), length); + + int32_t byte_width = static_cast(*type).byte_width(); + + // Allocate space for parsed values + std::shared_ptr byte_buffer; + RETURN_NOT_OK(AllocateBuffer(pool_, byte_width, &byte_buffer)); + uint8_t* byte_buffer_data = byte_buffer->mutable_data(); + + for (int i = 0; i < length; ++i) { + if (!is_valid[i]) { + builder.AppendNull(); + continue; + } + + const rj::Value& val = json_data_arr[i]; + DCHECK(val.IsString()); + std::string hex_string = val.GetString(); + DCHECK_EQ(static_cast(hex_string.size()), byte_width * 2) + << "Expected size: " << byte_width * 2 << " got: " << hex_string.size(); + const char* hex_data = hex_string.c_str(); + + for (int32_t j = 0; j < byte_width; ++j) { + RETURN_NOT_OK(ParseHexValue(hex_data + j * 2, &byte_buffer_data[j])); + } + RETURN_NOT_OK(builder.Append(byte_buffer_data)); + } + return builder.Finish(array); + } + template Status GetIntArray( const RjArray& json_array, const int32_t length, std::shared_ptr* out) { @@ -1063,13 +1097,6 @@ class JsonArrayReader { case TYPE::type_id: \ return ReadArray(json_array, length, is_valid, type, array); -#define NOT_IMPLEMENTED_CASE(TYPE_ENUM) \ - case Type::TYPE_ENUM: { \ - std::stringstream ss; \ - ss << type->ToString(); \ - return Status::NotImplemented(ss.str()); \ - } - switch (type->type) { TYPE_CASE(NullType); TYPE_CASE(BooleanType); @@ -1086,16 +1113,15 @@ class JsonArrayReader { TYPE_CASE(DoubleType); TYPE_CASE(StringType); TYPE_CASE(BinaryType); - NOT_IMPLEMENTED_CASE(DATE32); - NOT_IMPLEMENTED_CASE(DATE64); - NOT_IMPLEMENTED_CASE(TIMESTAMP); - NOT_IMPLEMENTED_CASE(TIME32); - NOT_IMPLEMENTED_CASE(TIME64); - NOT_IMPLEMENTED_CASE(INTERVAL); + TYPE_CASE(FixedWidthBinaryType); + TYPE_CASE(Date32Type); + TYPE_CASE(Date64Type); + TYPE_CASE(TimestampType); + TYPE_CASE(Time32Type); + TYPE_CASE(Time64Type); TYPE_CASE(ListType); TYPE_CASE(StructType); TYPE_CASE(UnionType); - NOT_IMPLEMENTED_CASE(DICTIONARY); default: std::stringstream ss; ss << type->ToString(); @@ -1103,7 +1129,6 @@ class JsonArrayReader { } #undef TYPE_CASE -#undef NOT_IMPLEMENTED_CASE return Status::OK(); } @@ -1118,8 +1143,16 @@ Status WriteJsonSchema(const Schema& schema, RjWriter* json_writer) { } Status ReadJsonSchema(const rj::Value& json_schema, std::shared_ptr* schema) { - JsonSchemaReader converter(json_schema); - return converter.GetSchema(schema); + const auto& obj_schema = json_schema.GetObject(); + + const auto& json_fields = obj_schema.FindMember("fields"); + RETURN_NOT_ARRAY("fields", json_fields, obj_schema); + + std::vector> fields; + RETURN_NOT_OK(GetFieldsFromArray(json_fields->value, &fields)); + + *schema = std::make_shared(fields); + return Status::OK(); } Status WriteJsonArray( diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 17af563805792..85dc8b321c41d 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -602,7 +602,7 @@ static Status WriteBuffers( return Status::OK(); } -static Status MakeRecordBatch(FBB& fbb, int32_t length, int64_t body_length, +static Status MakeRecordBatch(FBB& fbb, int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, RecordBatchOffset* offset) { FieldNodeVector fb_nodes; @@ -615,7 +615,7 @@ static Status MakeRecordBatch(FBB& fbb, int32_t length, int64_t body_length, return Status::OK(); } -Status WriteRecordBatchMessage(int32_t length, int64_t body_length, +Status WriteRecordBatchMessage(int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { FBB fbb; @@ -625,7 +625,7 @@ Status WriteRecordBatchMessage(int32_t length, int64_t body_length, fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), body_length, out); } -Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, +Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { FBB fbb; diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 6e903c0a18ef6..f60fb770c3696 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -188,11 +188,11 @@ Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile Status WriteSchemaMessage( const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); -Status WriteRecordBatchMessage(int32_t length, int64_t body_length, +Status WriteRecordBatchMessage(int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); -Status WriteDictionaryMessage(int64_t id, int32_t length, int64_t body_length, +Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 83e03aa0b36b4..03c678ab7e280 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -97,6 +97,8 @@ static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, for (int i = 0; i < schema->num_fields(); ++i) { RETURN_NOT_OK(LoadArray(schema->field(i)->type, &context, &arrays[i])); + DCHECK_EQ(num_rows, arrays[i]->length()) + << "Array length did not match record batch length"; } *out = std::make_shared(schema, num_rows, std::move(arrays)); diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 7ee57d2152c1b..994e1283004a9 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -29,6 +29,7 @@ #include "arrow/buffer.h" #include "arrow/builder.h" #include "arrow/memory_pool.h" +#include "arrow/pretty_print.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" @@ -47,6 +48,41 @@ static inline void AssertSchemaEqual(const Schema& lhs, const Schema& rhs) { } } +static inline void CompareBatch(const RecordBatch& left, const RecordBatch& right) { + if (!left.schema()->Equals(right.schema())) { + FAIL() << "Left schema: " << left.schema()->ToString() + << "\nRight schema: " << right.schema()->ToString(); + } + ASSERT_EQ(left.num_columns(), right.num_columns()) + << left.schema()->ToString() << " result: " << right.schema()->ToString(); + EXPECT_EQ(left.num_rows(), right.num_rows()); + for (int i = 0; i < left.num_columns(); ++i) { + EXPECT_TRUE(left.column(i)->Equals(right.column(i))) + << "Idx: " << i << " Name: " << left.column_name(i); + } +} + +static inline void CompareArraysDetailed( + int index, const Array& result, const Array& expected) { + if (!expected.Equals(result)) { + std::stringstream pp_result; + std::stringstream pp_expected; + + ASSERT_OK(PrettyPrint(expected, 0, &pp_expected)); + ASSERT_OK(PrettyPrint(result, 0, &pp_result)); + + FAIL() << "Index: " << index << " Expected: " << pp_expected.str() + << "\nGot: " << pp_result.str(); + } +} + +static inline void CompareBatchColumnsDetailed( + const RecordBatch& result, const RecordBatch& expected) { + for (int i = 0; i < expected.num_columns(); ++i) { + CompareArraysDetailed(i, *result.column(i), *expected.column(i)); + } +} + const auto kListInt32 = list(int32()); const auto kListListInt32 = list(kListInt32); @@ -474,7 +510,7 @@ Status MakeDates(std::shared_ptr* out) { ArrayFromVector(is_valid, date32_values, &date32_array); std::vector date64_values = {1489269000000, 1489270000000, 1489271000000, - 1489272000000, 1489272000000, 1489273000000}; + 1489272000000, 1489272000000, 1489273000000, 1489274000000}; std::shared_ptr date64_array; ArrayFromVector(is_valid, date64_values, &date64_array); @@ -548,7 +584,7 @@ Status MakeFWBinary(std::shared_ptr* out) { std::shared_ptr a1, a2; FixedWidthBinaryBuilder b1(default_memory_pool(), f0->type); - FixedWidthBinaryBuilder b2(default_memory_pool(), f0->type); + FixedWidthBinaryBuilder b2(default_memory_pool(), f1->type); std::vector values1 = {"foo1", "foo2", "foo3", "foo4"}; AppendValues(is_valid, values1, &b1); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index e795ef961cb64..da360f31641b8 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -141,7 +141,7 @@ class RecordBatchWriter : public ArrayVisitor { virtual Status WriteMetadataMessage( int64_t num_rows, int64_t body_length, std::shared_ptr* out) { return WriteRecordBatchMessage( - static_cast(num_rows), body_length, field_nodes_, buffer_meta_, out); + num_rows, body_length, field_nodes_, buffer_meta_, out); } Status WriteMetadata(int64_t num_rows, int64_t body_length, io::OutputStream* dst, @@ -306,7 +306,7 @@ class RecordBatchWriter : public ArrayVisitor { auto data = array.data(); int32_t width = array.byte_width(); - if (array.offset() != 0) { + if (data && array.offset() != 0) { data = SliceBuffer(data, array.offset() * width, width * array.length()); } buffers_.push_back(data); @@ -476,8 +476,8 @@ class DictionaryWriter : public RecordBatchWriter { Status WriteMetadataMessage( int64_t num_rows, int64_t body_length, std::shared_ptr* out) override { - return WriteDictionaryMessage(dictionary_id_, static_cast(num_rows), - body_length, field_nodes_, buffer_meta_, out); + return WriteDictionaryMessage( + dictionary_id_, num_rows, body_length, field_nodes_, buffer_meta_, out); } Status Write(int64_t dictionary_id, const std::shared_ptr& dictionary, diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index c0d128f563906..6489cd01d4c80 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -71,8 +71,7 @@ const std::string& Tensor::dim_name(int i) const { } int64_t Tensor::size() const { - return std::accumulate( - shape_.begin(), shape_.end(), 1, std::multiplies()); + return std::accumulate(shape_.begin(), shape_.end(), 1, std::multiplies()); } template diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 388502214e733..c790f6e5a4345 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -45,8 +45,6 @@ std::string Field::ToString() const { return ss.str(); } -DataType::~DataType() {} - bool DataType::Equals(const DataType& other) const { bool are_equal = false; Status error = TypeEquals(*this, other, &are_equal); @@ -63,16 +61,16 @@ std::string BooleanType::ToString() const { return name(); } -FloatingPointMeta::Precision HalfFloatType::precision() const { - return FloatingPointMeta::HALF; +FloatingPoint::Precision HalfFloatType::precision() const { + return FloatingPoint::HALF; } -FloatingPointMeta::Precision FloatType::precision() const { - return FloatingPointMeta::SINGLE; +FloatingPoint::Precision FloatType::precision() const { + return FloatingPoint::SINGLE; } -FloatingPointMeta::Precision DoubleType::precision() const { - return FloatingPointMeta::DOUBLE; +FloatingPoint::Precision DoubleType::precision() const { + return FloatingPoint::DOUBLE; } std::string StringType::ToString() const { @@ -111,6 +109,16 @@ std::string StructType::ToString() const { return s.str(); } +// ---------------------------------------------------------------------- +// Date types + +DateType::DateType(Type::type type_id, DateUnit unit) + : FixedWidthType(type_id), unit(unit) {} + +Date32Type::Date32Type() : DateType(Type::DATE32, DateUnit::DAY) {} + +Date64Type::Date64Type() : DateType(Type::DATE64, DateUnit::MILLI) {} + std::string Date64Type::ToString() const { return std::string("date64[ms]"); } @@ -122,7 +130,10 @@ std::string Date32Type::ToString() const { // ---------------------------------------------------------------------- // Time types -Time32Type::Time32Type(TimeUnit unit) : FixedWidthType(Type::TIME32), unit(unit) { +TimeType::TimeType(Type::type type_id, TimeUnit unit) + : FixedWidthType(type_id), unit(unit) {} + +Time32Type::Time32Type(TimeUnit unit) : TimeType(Type::TIME32, unit) { DCHECK(unit == TimeUnit::SECOND || unit == TimeUnit::MILLI) << "Must be seconds or milliseconds"; } @@ -133,7 +144,7 @@ std::string Time32Type::ToString() const { return ss.str(); } -Time64Type::Time64Type(TimeUnit unit) : FixedWidthType(Type::TIME64), unit(unit) { +Time64Type::Time64Type(TimeUnit unit) : TimeType(Type::TIME64, unit) { DCHECK(unit == TimeUnit::MICRO || unit == TimeUnit::NANO) << "Must be microseconds or nanoseconds"; } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 7ae5ae3c4b72e..dc50ecd669cae 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -132,7 +132,7 @@ struct ARROW_EXPORT DataType { explicit DataType(Type::type type) : type(type) {} - virtual ~DataType(); + virtual ~DataType() = default; // Return whether the types are equal // @@ -167,11 +167,17 @@ struct ARROW_EXPORT FixedWidthType : public DataType { std::vector GetBufferLayout() const override; }; -struct ARROW_EXPORT IntegerMeta { +struct ARROW_EXPORT PrimitiveCType : public FixedWidthType { + using FixedWidthType::FixedWidthType; +}; + +struct ARROW_EXPORT Integer : public PrimitiveCType { + using PrimitiveCType::PrimitiveCType; virtual bool is_signed() const = 0; }; -struct ARROW_EXPORT FloatingPointMeta { +struct ARROW_EXPORT FloatingPoint : public PrimitiveCType { + using PrimitiveCType::PrimitiveCType; enum Precision { HALF, SINGLE, DOUBLE }; virtual Precision precision() const = 0; }; @@ -206,16 +212,12 @@ struct ARROW_EXPORT Field { typedef std::shared_ptr FieldPtr; -struct ARROW_EXPORT PrimitiveCType : public FixedWidthType { - using FixedWidthType::FixedWidthType; -}; - -template -struct ARROW_EXPORT CTypeImpl : public PrimitiveCType { +template +struct ARROW_EXPORT CTypeImpl : public BASE { using c_type = C_TYPE; static constexpr Type::type type_id = TYPE_ID; - CTypeImpl() : PrimitiveCType(TYPE_ID) {} + CTypeImpl() : BASE(TYPE_ID) {} int bit_width() const override { return static_cast(sizeof(C_TYPE) * 8); } @@ -240,7 +242,7 @@ struct ARROW_EXPORT NullType : public DataType, public NoExtraMeta { }; template -struct IntegerTypeImpl : public CTypeImpl, public IntegerMeta { +struct IntegerTypeImpl : public CTypeImpl { bool is_signed() const override { return std::is_signed::value; } }; @@ -292,20 +294,19 @@ struct ARROW_EXPORT Int64Type : public IntegerTypeImpl, - public FloatingPointMeta { + : public CTypeImpl { Precision precision() const override; static std::string name() { return "halffloat"; } }; -struct ARROW_EXPORT FloatType : public CTypeImpl, - public FloatingPointMeta { +struct ARROW_EXPORT FloatType + : public CTypeImpl { Precision precision() const override; static std::string name() { return "float"; } }; -struct ARROW_EXPORT DoubleType : public CTypeImpl, - public FloatingPointMeta { +struct ARROW_EXPORT DoubleType + : public CTypeImpl { Precision precision() const override; static std::string name() { return "double"; } }; @@ -436,13 +437,23 @@ struct ARROW_EXPORT UnionType : public NestedType { // ---------------------------------------------------------------------- // Date and time types +enum class DateUnit : char { DAY = 0, MILLI = 1 }; + +struct DateType : public FixedWidthType { + public: + DateUnit unit; + + protected: + DateType(Type::type type_id, DateUnit unit); +}; + /// Date as int32_t days since UNIX epoch -struct ARROW_EXPORT Date32Type : public FixedWidthType, public NoExtraMeta { +struct ARROW_EXPORT Date32Type : public DateType { static constexpr Type::type type_id = Type::DATE32; using c_type = int32_t; - Date32Type() : FixedWidthType(Type::DATE32) {} + Date32Type(); int bit_width() const override { return static_cast(sizeof(c_type) * 4); } @@ -451,12 +462,12 @@ struct ARROW_EXPORT Date32Type : public FixedWidthType, public NoExtraMeta { }; /// Date as int64_t milliseconds since UNIX epoch -struct ARROW_EXPORT Date64Type : public FixedWidthType, public NoExtraMeta { +struct ARROW_EXPORT Date64Type : public DateType { static constexpr Type::type type_id = Type::DATE64; using c_type = int64_t; - Date64Type() : FixedWidthType(Type::DATE64) {} + Date64Type(); int bit_width() const override { return static_cast(sizeof(c_type) * 8); } @@ -485,13 +496,18 @@ static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { return os; } -struct ARROW_EXPORT Time32Type : public FixedWidthType { +struct TimeType : public FixedWidthType { + public: + TimeUnit unit; + + protected: + TimeType(Type::type type_id, TimeUnit unit); +}; + +struct ARROW_EXPORT Time32Type : public TimeType { static constexpr Type::type type_id = Type::TIME32; - using Unit = TimeUnit; using c_type = int32_t; - TimeUnit unit; - int bit_width() const override { return static_cast(sizeof(c_type) * 4); } explicit Time32Type(TimeUnit unit = TimeUnit::MILLI); @@ -500,13 +516,10 @@ struct ARROW_EXPORT Time32Type : public FixedWidthType { std::string ToString() const override; }; -struct ARROW_EXPORT Time64Type : public FixedWidthType { +struct ARROW_EXPORT Time64Type : public TimeType { static constexpr Type::type type_id = Type::TIME64; - using Unit = TimeUnit; using c_type = int64_t; - TimeUnit unit; - int bit_width() const override { return static_cast(sizeof(c_type) * 8); } explicit Time64Type(TimeUnit unit = TimeUnit::MILLI); From dac648db7053bc3cd71e9c64b69edc8959d8ec62 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Tue, 28 Mar 2017 14:21:59 -0700 Subject: [PATCH 0427/1644] ARROW-701: [Java] Support Additional Date Type Metadata The format for Date type now includes metadata for units as DAYS or MILLISECONDS. This change adds DateUnit and support for usage in metadata. Includes round-trip JSON testing. Author: Bryan Cutler Closes #431 from BryanCutler/java-date_unit-metadata-ARROW-701 and squashes the following commits: cdbcbfd [Bryan Cutler] Added support for DateUnit metadata --- .../src/main/codegen/data/ArrowTypes.tdd | 2 +- .../apache/arrow/vector/types/DateUnit.java | 44 +++++++++++++++++++ .../org/apache/arrow/vector/types/Types.java | 4 +- .../arrow/vector/types/pojo/TestSchema.java | 7 +-- 4 files changed, 51 insertions(+), 6 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 94fe31e8dc0d8..67785ad6b4d19 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -54,7 +54,7 @@ }, { name: "Date", - fields: [] + fields: [{name: "unit", type: short, valueType: DateUnit}] }, { name: "Time", diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java new file mode 100644 index 0000000000000..e5beebffde9e4 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/DateUnit.java @@ -0,0 +1,44 @@ +/******************************************************************************* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.types; + +public enum DateUnit { + DAY(org.apache.arrow.flatbuf.DateUnit.DAY), + MILLISECOND(org.apache.arrow.flatbuf.DateUnit.MILLISECOND); + + private static final DateUnit[] valuesByFlatbufId = new DateUnit[DateUnit.values().length]; + static { + for (DateUnit v : DateUnit.values()) { + valuesByFlatbufId[v.flatbufID] = v; + } + } + + private final short flatbufID; + + DateUnit(short flatbufID) { + this.flatbufID = flatbufID; + } + + public short getFlatbufID() { + return flatbufID; + } + + public static DateUnit fromFlatbufID(short id) { + return valuesByFlatbufId[id]; + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 81743b51917a1..2f070237101d8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -107,7 +107,7 @@ public class Types { private static final Field UINT2_FIELD = new Field("", true, new Int(16, false), null); private static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); - private static final Field DATE_FIELD = new Field("", true, Date.INSTANCE, null); + private static final Field DATE_FIELD = new Field("", true, new Date(DateUnit.MILLISECOND), null); private static final Field TIME_FIELD = new Field("", true, new Time(TimeUnit.MILLISECOND, 32), null); private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND, "UTC"), null); private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND, "UTC"), null); @@ -219,7 +219,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new BigIntWriterImpl((NullableBigIntVector) vector); } }, - DATE(Date.INSTANCE) { + DATE(new Date(DateUnit.MILLISECOND)) { @Override public Field getField() { return DATE_FIELD; diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 57af9528c5933..45f3b5656d861 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -23,6 +23,7 @@ import java.io.IOException; +import org.apache.arrow.vector.types.DateUnit; import org.apache.arrow.vector.types.FloatingPointPrecision; import org.apache.arrow.vector.types.IntervalUnit; import org.apache.arrow.vector.types.TimeUnit; @@ -60,7 +61,7 @@ public void testComplex() throws IOException { field("b", new Struct(), field("c", new Int(16, true)), field("d", new Utf8())), - field("e", new List(), field(null, new Date())), + field("e", new List(), field(null, new Date(DateUnit.MILLISECOND))), field("f", new FloatingPoint(FloatingPointPrecision.SINGLE)), field("g", new Timestamp(TimeUnit.MILLISECOND, "UTC")), field("h", new Timestamp(TimeUnit.MICROSECOND, null)), @@ -68,7 +69,7 @@ public void testComplex() throws IOException { )); roundTrip(schema); assertEquals( - "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND, UTC), h: Timestamp(MICROSECOND, null), i: Interval(DAY_TIME)>", + "Schema, e: List, f: FloatingPoint(SINGLE), g: Timestamp(MILLISECOND, UTC), h: Timestamp(MICROSECOND, null), i: Interval(DAY_TIME)>", schema.toString()); } @@ -85,7 +86,7 @@ public void testAll() throws IOException { field("h", new Binary()), field("i", new Bool()), field("j", new Decimal(5, 5)), - field("k", new Date()), + field("k", new Date(DateUnit.MILLISECOND)), field("l", new Time(TimeUnit.MILLISECOND, 32)), field("m", new Timestamp(TimeUnit.MILLISECOND, "UTC")), field("n", new Timestamp(TimeUnit.MICROSECOND, null)), From b03236360a5ba04078d5ec1129a13f9e905f0626 Mon Sep 17 00:00:00 2001 From: Itai Incze Date: Wed, 29 Mar 2017 14:55:39 -0400 Subject: [PATCH 0428/1644] ARROW-732: [C++] Schema comparison bugs in struct and union types Found 2 small bugs in comparing the nested subfields in compare.cc Fixed and added a new test to type-test Author: Itai Incze Closes #450 from itaiin/master and squashes the following commits: fd6e5cf [Itai Incze] Fixed schema comparison bug for union types 44a068c [Itai Incze] Fixed: nested schema comparison bug --- cpp/src/arrow/compare.cc | 12 ++++++++++-- cpp/src/arrow/type-test.cc | 37 +++++++++++++++++++++++++++++++++++++ 2 files changed, 47 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 3e282f8886623..f786222f7e4f2 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -650,7 +650,7 @@ class TypeEqualsVisitor { for (int i = 0; i < left.num_children(); ++i) { if (!left.child(i)->Equals(right_.child(i))) { result_ = false; - break; + return Status::OK(); } } result_ = true; @@ -712,9 +712,17 @@ class TypeEqualsVisitor { for (size_t i = 0; i < left_codes.size(); ++i) { if (left_codes[i] != right_codes[i]) { result_ = false; - break; + return Status::OK(); + } + } + + for (int i = 0; i < left.num_children(); ++i) { + if (!left.child(i)->Equals(right_.child(i))) { + result_ = false; + return Status::OK(); } } + result_ = true; return Status::OK(); } diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index b6a84df339e6e..7f13f8ba480b4 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -231,4 +231,41 @@ TEST(TestTimestampType, ToString) { ASSERT_EQ("timestamp[us]", t4->ToString()); } +TEST(TestNestedType, Equals) { + auto create_struct = + [](std::string inner_name, std::string struct_name) -> shared_ptr { + auto f_type = field(inner_name, int32()); + vector> fields = {f_type}; + auto s_type = std::make_shared(fields); + return field(struct_name, s_type); + }; + + auto create_union = + [](std::string inner_name, std::string union_name) -> shared_ptr { + auto f_type = field(inner_name, int32()); + vector> fields = {f_type}; + vector codes = {Type::INT32}; + auto u_type = std::make_shared(fields, codes, UnionMode::SPARSE); + return field(union_name, u_type); + }; + + auto s0 = create_struct("f0", "s0"); + auto s0_other = create_struct("f0", "s0"); + auto s0_bad = create_struct("f1", "s0"); + auto s1 = create_struct("f1", "s1"); + + ASSERT_TRUE(s0->Equals(s0_other)); + ASSERT_FALSE(s0->Equals(s1)); + ASSERT_FALSE(s0->Equals(s0_bad)); + + auto u0 = create_union("f0", "u0"); + auto u0_other = create_union("f0", "u0"); + auto u0_bad = create_union("f1", "u0"); + auto u1 = create_union("f1", "u1"); + + ASSERT_TRUE(u0->Equals(u0_other)); + ASSERT_FALSE(u0->Equals(u1)); + ASSERT_FALSE(u0->Equals(u0_bad)); +} + } // namespace arrow From 8f386374eca26d0eebe562beac52fc75459f352c Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 29 Mar 2017 15:03:20 -0400 Subject: [PATCH 0429/1644] ARROW-731: [C++] Add shared library related versions to .pc They can be used to find real shared library path in parquet-cpp. See also https://github.com/apache/parquet-cpp/pull/276#issuecomment-289816148 Author: Kouhei Sutou Closes #451 from kou/cpp-add-soversion-to-pc and squashes the following commits: f657a88 [Kouhei Sutou] [C++] Add shared library related versions to .pc --- cpp/src/arrow/arrow.pc.in | 3 +++ cpp/src/arrow/io/arrow-io.pc.in | 3 +++ cpp/src/arrow/ipc/arrow-ipc.pc.in | 3 +++ cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in | 3 +++ 4 files changed, 12 insertions(+) diff --git a/cpp/src/arrow/arrow.pc.in b/cpp/src/arrow/arrow.pc.in index 1c3f65d661101..0debee32a243a 100644 --- a/cpp/src/arrow/arrow.pc.in +++ b/cpp/src/arrow/arrow.pc.in @@ -19,6 +19,9 @@ prefix=@CMAKE_INSTALL_PREFIX@ libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include +so_version=@ARROW_SO_VERSION@ +abi_version=@ARROW_ABI_VERSION@ + Name: Apache Arrow Description: Arrow is a set of technologies that enable big-data systems to process and move data fast. Version: @ARROW_VERSION@ diff --git a/cpp/src/arrow/io/arrow-io.pc.in b/cpp/src/arrow/io/arrow-io.pc.in index af28aae6736fe..61af3577f5a38 100644 --- a/cpp/src/arrow/io/arrow-io.pc.in +++ b/cpp/src/arrow/io/arrow-io.pc.in @@ -19,6 +19,9 @@ prefix=@CMAKE_INSTALL_PREFIX@ libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include +so_version=@ARROW_SO_VERSION@ +abi_version=@ARROW_ABI_VERSION@ + Name: Apache Arrow I/O Description: I/O interface for Arrow. Version: @ARROW_VERSION@ diff --git a/cpp/src/arrow/ipc/arrow-ipc.pc.in b/cpp/src/arrow/ipc/arrow-ipc.pc.in index cbc226abf1ff5..29a942acf0331 100644 --- a/cpp/src/arrow/ipc/arrow-ipc.pc.in +++ b/cpp/src/arrow/ipc/arrow-ipc.pc.in @@ -19,6 +19,9 @@ prefix=@CMAKE_INSTALL_PREFIX@ libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include +so_version=@ARROW_SO_VERSION@ +abi_version=@ARROW_ABI_VERSION@ + Name: Apache Arrow IPC Description: IPC extension for Arrow. Version: @ARROW_VERSION@ diff --git a/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in index 18085aaf715d4..8e946d17d8601 100644 --- a/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in +++ b/cpp/src/arrow/jemalloc/arrow-jemalloc.pc.in @@ -19,6 +19,9 @@ prefix=@CMAKE_INSTALL_PREFIX@ libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ includedir=${prefix}/include +so_version=@ARROW_SO_VERSION@ +abi_version=@ARROW_ABI_VERSION@ + Name: Apache Arrow jemalloc-based allocator Description: jemalloc allocator for Arrow. Version: @ARROW_VERSION@ From f7b287a28d62c6b246665da7eee50fe222ebaaeb Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 29 Mar 2017 19:53:08 -0400 Subject: [PATCH 0430/1644] ARROW-627: [C++] Add compatibility macros for exported extern templates This should also reduce compiler warnings on MSVC Author: Wes McKinney Closes #447 from wesm/ARROW-627 and squashes the following commits: 3f6277d [Wes McKinney] Wrong define for msvc b53a400 [Wes McKinney] MSVC needs export annotation when instantiating templates 8a9fcb4 [Wes McKinney] Add compatibility macros for exported extern templates, also to reduce compiler warnings in MSVC --- cpp/src/arrow/array.cc | 32 ++++++++++++------------ cpp/src/arrow/array.h | 43 ++++++++++++--------------------- cpp/src/arrow/tensor.cc | 22 ++++++++--------- cpp/src/arrow/tensor.h | 33 +++++++++---------------- cpp/src/arrow/type.h | 4 +-- cpp/src/arrow/util/visibility.h | 17 +++++++++++++ 6 files changed, 73 insertions(+), 78 deletions(-) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 3ea033376fca3..b25411a1c5938 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -445,21 +445,21 @@ Status Array::Accept(ArrayVisitor* visitor) const { // ---------------------------------------------------------------------- // Instantiate templates -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; -template class NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; +template class ARROW_TEMPLATE_EXPORT NumericArray; } // namespace arrow diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index c0ec571e45983..53b640853d5a6 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -483,34 +483,23 @@ class ARROW_EXPORT DictionaryArray : public Array { // ---------------------------------------------------------------------- // extern templates and other details -// gcc and clang disagree about how to handle template visibility when you have -// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wattributes" -#endif - // Only instantiate these templates once -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; -extern template class ARROW_EXPORT NumericArray; - -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic pop -#endif +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; +ARROW_EXTERN_TEMPLATE NumericArray; } // namespace arrow diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index 6489cd01d4c80..7c4593fc40e66 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -100,16 +100,16 @@ NumericTensor::NumericTensor(const std::shared_ptr& data, const std::vector& shape, const std::vector& strides) : NumericTensor(data, shape, strides, {}) {} -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; -template class NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; +template class ARROW_TEMPLATE_EXPORT NumericTensor; } // namespace arrow diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 0059368f7b2d8..7bee867a9b33a 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -129,29 +129,18 @@ class ARROW_EXPORT NumericTensor : public Tensor { // ---------------------------------------------------------------------- // extern templates and other details -// gcc and clang disagree about how to handle template visibility when you have -// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic push -#pragma GCC diagnostic ignored "-Wattributes" -#endif - // Only instantiate these templates once -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; -extern template class ARROW_EXPORT NumericTensor; - -#if defined(__GNUC__) && !defined(__clang__) -#pragma GCC diagnostic pop -#endif +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; +ARROW_EXTERN_TEMPLATE NumericTensor; } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index dc50ecd669cae..2a73f6be934eb 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -439,7 +439,7 @@ struct ARROW_EXPORT UnionType : public NestedType { enum class DateUnit : char { DAY = 0, MILLI = 1 }; -struct DateType : public FixedWidthType { +struct ARROW_EXPORT DateType : public FixedWidthType { public: DateUnit unit; @@ -496,7 +496,7 @@ static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { return os; } -struct TimeType : public FixedWidthType { +struct ARROW_EXPORT TimeType : public FixedWidthType { public: TimeUnit unit; diff --git a/cpp/src/arrow/util/visibility.h b/cpp/src/arrow/util/visibility.h index 4819a0061e75f..6382f7f63180c 100644 --- a/cpp/src/arrow/util/visibility.h +++ b/cpp/src/arrow/util/visibility.h @@ -35,4 +35,21 @@ #endif #endif // Non-Windows +// gcc and clang disagree about how to handle template visibility when you have +// explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 + +#if defined(__clang__) + #define ARROW_EXTERN_TEMPLATE extern template class ARROW_EXPORT +#else + #define ARROW_EXTERN_TEMPLATE extern template class +#endif + +// This is a complicated topic, some reading on it: +// http://www.codesynthesis.com/~boris/blog/2010/01/18/dll-export-cxx-templates/ +#if defined(_MSC_VER) + #define ARROW_TEMPLATE_EXPORT ARROW_EXPORT +#else + #define ARROW_TEMPLATE_EXPORT +#endif + #endif // ARROW_UTIL_VISIBILITY_H From 642b753a49a3fcb5d53946c773cd70ab2a3ece88 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 10:19:50 -0400 Subject: [PATCH 0431/1644] ARROW-698: Add flag to FileWriter::WriteRecordBatch for writing record batches with lengths over INT32_MAX cc @pcmoritz Author: Wes McKinney Closes #455 from wesm/ARROW-698 and squashes the following commits: 42c100c [Wes McKinney] Add allow_64bit option to FileWriter::WriteRecordBatch --- cpp/src/arrow/ipc/ipc-read-write-test.cc | 20 ++++++++++++-------- cpp/src/arrow/ipc/writer.cc | 18 ++++++++++-------- cpp/src/arrow/ipc/writer.h | 4 ++-- cpp/src/arrow/type-test.cc | 8 ++++---- cpp/src/arrow/util/visibility.h | 8 ++++---- 5 files changed, 32 insertions(+), 26 deletions(-) diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index cd3f190fe4a27..48e546eed12f5 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -138,17 +138,21 @@ class IpcTestFixture : public io::MemoryMapFixture { Status DoLargeRoundTrip( const RecordBatch& batch, bool zero_data, std::shared_ptr* result) { - int32_t metadata_length; - int64_t body_length; - - const int64_t buffer_offset = 0; - if (zero_data) { RETURN_NOT_OK(ZeroMemoryMap(mmap_.get())); } RETURN_NOT_OK(mmap_->Seek(0)); - RETURN_NOT_OK(WriteLargeRecordBatch( - batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); - return ReadRecordBatch(batch.schema(), 0, mmap_.get(), result); + std::shared_ptr file_writer; + RETURN_NOT_OK(FileWriter::Open(mmap_.get(), batch.schema(), &file_writer)); + RETURN_NOT_OK(file_writer->WriteRecordBatch(batch, true)); + RETURN_NOT_OK(file_writer->Close()); + + int64_t offset; + RETURN_NOT_OK(mmap_->Tell(&offset)); + + std::shared_ptr file_reader; + RETURN_NOT_OK(FileReader::Open(mmap_, offset, &file_reader)); + + return file_reader->GetRecordBatch(0, result); } void CheckReadResult(const RecordBatch& result, const RecordBatch& expected) { diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index da360f31641b8..92e61941937a6 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -591,7 +591,7 @@ class StreamWriter::StreamWriterImpl { return Status::OK(); } - Status WriteRecordBatch(const RecordBatch& batch, FileBlock* block) { + Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit, FileBlock* block) { RETURN_NOT_OK(CheckStarted()); block->offset = position_; @@ -599,7 +599,8 @@ class StreamWriter::StreamWriterImpl { // Frame of reference in file format is 0, see ARROW-384 const int64_t buffer_start_offset = 0; RETURN_NOT_OK(arrow::ipc::WriteRecordBatch(batch, buffer_start_offset, sink_, - &block->metadata_length, &block->body_length, pool_)); + &block->metadata_length, &block->body_length, pool_, kMaxNestingDepth, + allow_64bit)); RETURN_NOT_OK(UpdatePosition()); DCHECK(position_ % 8 == 0) << "WriteRecordBatch did not perform aligned writes"; @@ -607,10 +608,11 @@ class StreamWriter::StreamWriterImpl { return Status::OK(); } - Status WriteRecordBatch(const RecordBatch& batch) { + Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit) { // Push an empty FileBlock. Can be written in the footer later record_batches_.emplace_back(0, 0, 0); - return WriteRecordBatch(batch, &record_batches_[record_batches_.size() - 1]); + return WriteRecordBatch( + batch, allow_64bit, &record_batches_[record_batches_.size() - 1]); } // Adds padding bytes if necessary to ensure all memory blocks are written on @@ -657,8 +659,8 @@ StreamWriter::StreamWriter() { impl_.reset(new StreamWriterImpl()); } -Status StreamWriter::WriteRecordBatch(const RecordBatch& batch) { - return impl_->WriteRecordBatch(batch); +Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, bool allow_64bit) { + return impl_->WriteRecordBatch(batch, allow_64bit); } void StreamWriter::set_memory_pool(MemoryPool* pool) { @@ -723,8 +725,8 @@ Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& s return (*out)->impl_->Open(sink, schema); } -Status FileWriter::WriteRecordBatch(const RecordBatch& batch) { - return impl_->WriteRecordBatch(batch); +Status FileWriter::WriteRecordBatch(const RecordBatch& batch, bool allow_64bit) { + return impl_->WriteRecordBatch(batch, allow_64bit); } Status FileWriter::Close() { diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 3b7e710c124cb..25b5ad62726d9 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -87,7 +87,7 @@ class ARROW_EXPORT StreamWriter { static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); - virtual Status WriteRecordBatch(const RecordBatch& batch); + virtual Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit = false); /// Perform any logic necessary to finish the stream. User is responsible for /// closing the actual OutputStream @@ -108,7 +108,7 @@ class ARROW_EXPORT FileWriter : public StreamWriter { static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); - Status WriteRecordBatch(const RecordBatch& batch) override; + Status WriteRecordBatch(const RecordBatch& batch, bool allow_64bit = false) override; Status Close() override; private: diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 7f13f8ba480b4..ed8654314ee6d 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -232,16 +232,16 @@ TEST(TestTimestampType, ToString) { } TEST(TestNestedType, Equals) { - auto create_struct = - [](std::string inner_name, std::string struct_name) -> shared_ptr { + auto create_struct = []( + std::string inner_name, std::string struct_name) -> shared_ptr { auto f_type = field(inner_name, int32()); vector> fields = {f_type}; auto s_type = std::make_shared(fields); return field(struct_name, s_type); }; - auto create_union = - [](std::string inner_name, std::string union_name) -> shared_ptr { + auto create_union = []( + std::string inner_name, std::string union_name) -> shared_ptr { auto f_type = field(inner_name, int32()); vector> fields = {f_type}; vector codes = {Type::INT32}; diff --git a/cpp/src/arrow/util/visibility.h b/cpp/src/arrow/util/visibility.h index 6382f7f63180c..e84cc45aadf01 100644 --- a/cpp/src/arrow/util/visibility.h +++ b/cpp/src/arrow/util/visibility.h @@ -39,17 +39,17 @@ // explicit specializations https://llvm.org/bugs/show_bug.cgi?id=24815 #if defined(__clang__) - #define ARROW_EXTERN_TEMPLATE extern template class ARROW_EXPORT +#define ARROW_EXTERN_TEMPLATE extern template class ARROW_EXPORT #else - #define ARROW_EXTERN_TEMPLATE extern template class +#define ARROW_EXTERN_TEMPLATE extern template class #endif // This is a complicated topic, some reading on it: // http://www.codesynthesis.com/~boris/blog/2010/01/18/dll-export-cxx-templates/ #if defined(_MSC_VER) - #define ARROW_TEMPLATE_EXPORT ARROW_EXPORT +#define ARROW_TEMPLATE_EXPORT ARROW_EXPORT #else - #define ARROW_TEMPLATE_EXPORT +#define ARROW_TEMPLATE_EXPORT #endif #endif // ARROW_UTIL_VISIBILITY_H From 47fad3f42c05bd4139796b93375dfb3cba74e87b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 15:04:07 -0400 Subject: [PATCH 0432/1644] ARROW-728: [C++/Python] Add Table::RemoveColumn method, remove name member, some other code cleaning * Consolidated column.h and table.h * Consolidated schema.h and type.h * Removed some `Equals(const std::shared_ptr&)` methods, better to use `const T&` methods Author: Wes McKinney Closes #457 from wesm/ARROW-728 and squashes the following commits: 961783d [Wes McKinney] Fix glib test suite 1645ea2 [Wes McKinney] Return new vector from DeleteVectorElement ea36a8c [Wes McKinney] Fix GLib bindings for removal of name Table member 77d363c [Wes McKinney] Incorporate API changes in pyarrow, add Table.remove_column function. Make nicer repr b73c4d7 [Wes McKinney] Remove Table name attribute, implement and test Table::RemoveColumn 6a7f022 [Wes McKinney] Move Schema to type.h, remove Equals with shared_ptr function 818c46f [Wes McKinney] Consolidate column.h into table.h --- c_glib/arrow-glib/table.cpp | 20 +- c_glib/arrow-glib/table.h | 4 +- c_glib/test/test-table.rb | 9 +- cpp/CMakeLists.txt | 2 - cpp/src/arrow/CMakeLists.txt | 3 - cpp/src/arrow/api.h | 2 - cpp/src/arrow/column-test.cc | 191 ---------------- cpp/src/arrow/column.cc | 132 ----------- cpp/src/arrow/column.h | 104 --------- cpp/src/arrow/ipc/feather.cc | 2 +- cpp/src/arrow/ipc/ipc-json-test.cc | 6 +- cpp/src/arrow/ipc/ipc-read-write-test.cc | 2 +- cpp/src/arrow/ipc/json-integration-test.cc | 6 +- cpp/src/arrow/ipc/json-internal.cc | 1 - cpp/src/arrow/ipc/json.cc | 1 - cpp/src/arrow/ipc/metadata.cc | 1 - cpp/src/arrow/ipc/reader.cc | 2 +- cpp/src/arrow/ipc/test-common.h | 2 +- cpp/src/arrow/ipc/writer.cc | 1 - cpp/src/arrow/python/pandas-test.cc | 3 +- cpp/src/arrow/python/pandas_convert.cc | 1 - cpp/src/arrow/schema.cc | 72 ------ cpp/src/arrow/schema.h | 59 ----- cpp/src/arrow/table-test.cc | 246 +++++++++++++++++---- cpp/src/arrow/table.cc | 149 +++++++++++-- cpp/src/arrow/table.h | 90 +++++++- cpp/src/arrow/test-common.h | 1 - cpp/src/arrow/test-util.h | 2 - cpp/src/arrow/type-test.cc | 8 +- cpp/src/arrow/type.cc | 53 +++++ cpp/src/arrow/type.h | 33 ++- cpp/src/arrow/util/stl.h | 40 ++++ python/pyarrow/array.pyx | 4 +- python/pyarrow/includes/libarrow.pxd | 17 +- python/pyarrow/io.pyx | 6 +- python/pyarrow/schema.pyx | 9 +- python/pyarrow/table.pyx | 65 +++--- python/pyarrow/tests/test_parquet.py | 2 +- python/pyarrow/tests/test_table.py | 33 ++- 39 files changed, 623 insertions(+), 761 deletions(-) delete mode 100644 cpp/src/arrow/column-test.cc delete mode 100644 cpp/src/arrow/column.cc delete mode 100644 cpp/src/arrow/column.h delete mode 100644 cpp/src/arrow/schema.cc delete mode 100644 cpp/src/arrow/schema.h create mode 100644 cpp/src/arrow/util/stl.h diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp index 2410e76c921fb..2f82ffa4320e0 100644 --- a/c_glib/arrow-glib/table.cpp +++ b/c_glib/arrow-glib/table.cpp @@ -126,15 +126,13 @@ garrow_table_class_init(GArrowTableClass *klass) /** * garrow_table_new: - * @name: The name of the table. * @schema: The schema of the table. * @columns: (element-type GArrowColumn): The columns of the table. * * Returns: A newly created #GArrowTable. */ GArrowTable * -garrow_table_new(const gchar *name, - GArrowSchema *schema, +garrow_table_new(GArrowSchema *schema, GList *columns) { std::vector> arrow_columns; @@ -144,25 +142,11 @@ garrow_table_new(const gchar *name, } auto arrow_table = - std::make_shared(name, - garrow_schema_get_raw(schema), + std::make_shared(garrow_schema_get_raw(schema), arrow_columns); return garrow_table_new_raw(&arrow_table); } -/** - * garrow_table_get_name: - * @table: A #GArrowTable. - * - * Returns: The name of the table. - */ -const gchar * -garrow_table_get_name(GArrowTable *table) -{ - const auto arrow_table = garrow_table_get_raw(table); - return arrow_table->name().c_str(); -} - /** * garrow_table_get_schema: * @table: A #GArrowTable. diff --git a/c_glib/arrow-glib/table.h b/c_glib/arrow-glib/table.h index 34a89a78abcbb..4dbb8c587a2ec 100644 --- a/c_glib/arrow-glib/table.h +++ b/c_glib/arrow-glib/table.h @@ -66,11 +66,9 @@ struct _GArrowTableClass GType garrow_table_get_type (void) G_GNUC_CONST; -GArrowTable *garrow_table_new (const gchar *name, - GArrowSchema *schema, +GArrowTable *garrow_table_new (GArrowSchema *schema, GList *columns); -const gchar *garrow_table_get_name (GArrowTable *table); GArrowSchema *garrow_table_get_schema (GArrowTable *table); GArrowColumn *garrow_table_get_column (GArrowTable *table, guint i); diff --git a/c_glib/test/test-table.rb b/c_glib/test/test-table.rb index 1687d2f6e3ff6..0583e8139e47a 100644 --- a/c_glib/test/test-table.rb +++ b/c_glib/test/test-table.rb @@ -29,8 +29,7 @@ def test_columns Arrow::Column.new(fields[0], build_boolean_array([true])), Arrow::Column.new(fields[1], build_boolean_array([false])), ] - table = Arrow::Table.new("memos", schema, columns) - assert_equal("memos", table.name) + table = Arrow::Table.new(schema, columns) end end @@ -45,11 +44,7 @@ def setup Arrow::Column.new(fields[0], build_boolean_array([true])), Arrow::Column.new(fields[1], build_boolean_array([false])), ] - @table = Arrow::Table.new("memos", schema, columns) - end - - def test_name - assert_equal("memos", @table.name) + @table = Arrow::Table.new(schema, columns) end def test_schema diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index e4c18ca86e4d7..e11de1b4fb0da 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -784,12 +784,10 @@ set(ARROW_SRCS src/arrow/array.cc src/arrow/buffer.cc src/arrow/builder.cc - src/arrow/column.cc src/arrow/compare.cc src/arrow/loader.cc src/arrow/memory_pool.cc src/arrow/pretty_print.cc - src/arrow/schema.cc src/arrow/status.cc src/arrow/table.cc src/arrow/tensor.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index f965f1d07feef..5c9aadf9ee79b 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -22,12 +22,10 @@ install(FILES array.h buffer.h builder.h - column.h compare.h loader.h memory_pool.h pretty_print.h - schema.h status.h table.h type.h @@ -59,7 +57,6 @@ ADD_ARROW_TEST(array-string-test) ADD_ARROW_TEST(array-struct-test) ADD_ARROW_TEST(array-union-test) ADD_ARROW_TEST(buffer-test) -ADD_ARROW_TEST(column-test) ADD_ARROW_TEST(memory_pool-test) ADD_ARROW_TEST(pretty_print-test) ADD_ARROW_TEST(status-test) diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index ea818b62931d6..50a09515297ff 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -23,12 +23,10 @@ #include "arrow/array.h" #include "arrow/buffer.h" #include "arrow/builder.h" -#include "arrow/column.h" #include "arrow/compare.h" #include "arrow/loader.h" #include "arrow/memory_pool.h" #include "arrow/pretty_print.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/column-test.cc b/cpp/src/arrow/column-test.cc deleted file mode 100644 index 872fcb95c08e1..0000000000000 --- a/cpp/src/arrow/column-test.cc +++ /dev/null @@ -1,191 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/column.h" -#include "arrow/schema.h" -#include "arrow/test-common.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -using std::shared_ptr; -using std::vector; - -namespace arrow { - -class TestChunkedArray : public TestBase { - protected: - virtual void Construct() { - one_ = std::make_shared(arrays_one_); - another_ = std::make_shared(arrays_another_); - } - - ArrayVector arrays_one_; - ArrayVector arrays_another_; - - std::shared_ptr one_; - std::shared_ptr another_; -}; - -TEST_F(TestChunkedArray, BasicEquals) { - std::vector null_bitmap(100, true); - std::vector data(100, 1); - std::shared_ptr array; - ArrayFromVector(null_bitmap, data, &array); - arrays_one_.push_back(array); - arrays_another_.push_back(array); - - Construct(); - ASSERT_TRUE(one_->Equals(one_)); - ASSERT_FALSE(one_->Equals(nullptr)); - ASSERT_TRUE(one_->Equals(another_)); - ASSERT_TRUE(one_->Equals(*another_.get())); -} - -TEST_F(TestChunkedArray, EqualsDifferingTypes) { - std::vector null_bitmap(100, true); - std::vector data32(100, 1); - std::vector data64(100, 1); - std::shared_ptr array; - ArrayFromVector(null_bitmap, data32, &array); - arrays_one_.push_back(array); - ArrayFromVector(null_bitmap, data64, &array); - arrays_another_.push_back(array); - - Construct(); - ASSERT_FALSE(one_->Equals(another_)); - ASSERT_FALSE(one_->Equals(*another_.get())); -} - -TEST_F(TestChunkedArray, EqualsDifferingLengths) { - std::vector null_bitmap100(100, true); - std::vector null_bitmap101(101, true); - std::vector data100(100, 1); - std::vector data101(101, 1); - std::shared_ptr array; - ArrayFromVector(null_bitmap100, data100, &array); - arrays_one_.push_back(array); - ArrayFromVector(null_bitmap101, data101, &array); - arrays_another_.push_back(array); - - Construct(); - ASSERT_FALSE(one_->Equals(another_)); - ASSERT_FALSE(one_->Equals(*another_.get())); - - std::vector null_bitmap1(1, true); - std::vector data1(1, 1); - ArrayFromVector(null_bitmap1, data1, &array); - arrays_one_.push_back(array); - - Construct(); - ASSERT_TRUE(one_->Equals(another_)); - ASSERT_TRUE(one_->Equals(*another_.get())); -} - -class TestColumn : public TestChunkedArray { - protected: - void Construct() override { - TestChunkedArray::Construct(); - - one_col_ = std::make_shared(one_field_, one_); - another_col_ = std::make_shared(another_field_, another_); - } - - std::shared_ptr data_; - std::unique_ptr column_; - - std::shared_ptr one_field_; - std::shared_ptr another_field_; - - std::shared_ptr one_col_; - std::shared_ptr another_col_; -}; - -TEST_F(TestColumn, BasicAPI) { - ArrayVector arrays; - arrays.push_back(MakePrimitive(100)); - arrays.push_back(MakePrimitive(100, 10)); - arrays.push_back(MakePrimitive(100, 20)); - - auto field = std::make_shared("c0", int32()); - column_.reset(new Column(field, arrays)); - - ASSERT_EQ("c0", column_->name()); - ASSERT_TRUE(column_->type()->Equals(int32())); - ASSERT_EQ(300, column_->length()); - ASSERT_EQ(30, column_->null_count()); - ASSERT_EQ(3, column_->data()->num_chunks()); - - // nullptr array should not break - column_.reset(new Column(field, std::shared_ptr(nullptr))); - ASSERT_NE(column_.get(), nullptr); -} - -TEST_F(TestColumn, ChunksInhomogeneous) { - ArrayVector arrays; - arrays.push_back(MakePrimitive(100)); - arrays.push_back(MakePrimitive(100, 10)); - - auto field = std::make_shared("c0", int32()); - column_.reset(new Column(field, arrays)); - - ASSERT_OK(column_->ValidateData()); - - arrays.push_back(MakePrimitive(100, 10)); - column_.reset(new Column(field, arrays)); - ASSERT_RAISES(Invalid, column_->ValidateData()); -} - -TEST_F(TestColumn, Equals) { - std::vector null_bitmap(100, true); - std::vector data(100, 1); - std::shared_ptr array; - ArrayFromVector(null_bitmap, data, &array); - arrays_one_.push_back(array); - arrays_another_.push_back(array); - - one_field_ = std::make_shared("column", int32()); - another_field_ = std::make_shared("column", int32()); - - Construct(); - ASSERT_TRUE(one_col_->Equals(one_col_)); - ASSERT_FALSE(one_col_->Equals(nullptr)); - ASSERT_TRUE(one_col_->Equals(another_col_)); - ASSERT_TRUE(one_col_->Equals(*another_col_.get())); - - // Field is different - another_field_ = std::make_shared("two", int32()); - Construct(); - ASSERT_FALSE(one_col_->Equals(another_col_)); - ASSERT_FALSE(one_col_->Equals(*another_col_.get())); - - // ChunkedArray is different - another_field_ = std::make_shared("column", int32()); - arrays_another_.push_back(array); - Construct(); - ASSERT_FALSE(one_col_->Equals(another_col_)); - ASSERT_FALSE(one_col_->Equals(*another_col_.get())); -} - -} // namespace arrow diff --git a/cpp/src/arrow/column.cc b/cpp/src/arrow/column.cc deleted file mode 100644 index 78501f9393e22..0000000000000 --- a/cpp/src/arrow/column.cc +++ /dev/null @@ -1,132 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/column.h" - -#include -#include - -#include "arrow/array.h" -#include "arrow/status.h" -#include "arrow/type.h" - -namespace arrow { - -ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { - length_ = 0; - null_count_ = 0; - for (const std::shared_ptr& chunk : chunks) { - length_ += chunk->length(); - null_count_ += chunk->null_count(); - } -} - -bool ChunkedArray::Equals(const ChunkedArray& other) const { - if (length_ != other.length()) { return false; } - if (null_count_ != other.null_count()) { return false; } - - // Check contents of the underlying arrays. This checks for equality of - // the underlying data independently of the chunk size. - int this_chunk_idx = 0; - int64_t this_start_idx = 0; - int other_chunk_idx = 0; - int64_t other_start_idx = 0; - - int64_t elements_compared = 0; - while (elements_compared < length_) { - const std::shared_ptr this_array = chunks_[this_chunk_idx]; - const std::shared_ptr other_array = other.chunk(other_chunk_idx); - int64_t common_length = std::min( - this_array->length() - this_start_idx, other_array->length() - other_start_idx); - if (!this_array->RangeEquals(this_start_idx, this_start_idx + common_length, - other_start_idx, other_array)) { - return false; - } - - elements_compared += common_length; - - // If we have exhausted the current chunk, proceed to the next one individually. - if (this_start_idx + common_length == this_array->length()) { - this_chunk_idx++; - this_start_idx = 0; - } else { - this_start_idx += common_length; - } - - if (other_start_idx + common_length == other_array->length()) { - other_chunk_idx++; - other_start_idx = 0; - } else { - other_start_idx += common_length; - } - } - return true; -} - -bool ChunkedArray::Equals(const std::shared_ptr& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } - return Equals(*other.get()); -} - -Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) - : field_(field) { - data_ = std::make_shared(chunks); -} - -Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) - : field_(field) { - if (data) { - data_ = std::make_shared(ArrayVector({data})); - } else { - data_ = std::make_shared(ArrayVector({})); - } -} - -Column::Column(const std::string& name, const std::shared_ptr& data) - : Column(::arrow::field(name, data->type()), data) {} - -Column::Column( - const std::shared_ptr& field, const std::shared_ptr& data) - : field_(field), data_(data) {} - -bool Column::Equals(const Column& other) const { - if (!field_->Equals(other.field())) { return false; } - return data_->Equals(other.data()); -} - -bool Column::Equals(const std::shared_ptr& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } - - return Equals(*other.get()); -} - -Status Column::ValidateData() { - for (int i = 0; i < data_->num_chunks(); ++i) { - std::shared_ptr type = data_->chunk(i)->type(); - if (!this->type()->Equals(type)) { - std::stringstream ss; - ss << "In chunk " << i << " expected type " << this->type()->ToString() - << " but saw " << type->ToString(); - return Status::Invalid(ss.str()); - } - } - return Status::OK(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/column.h b/cpp/src/arrow/column.h deleted file mode 100644 index bfcfd8ee459c0..0000000000000 --- a/cpp/src/arrow/column.h +++ /dev/null @@ -1,104 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_COLUMN_H -#define ARROW_COLUMN_H - -#include -#include -#include -#include - -#include "arrow/type.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class Array; -class Status; - -typedef std::vector> ArrayVector; - -// A data structure managing a list of primitive Arrow arrays logically as one -// large array -class ARROW_EXPORT ChunkedArray { - public: - explicit ChunkedArray(const ArrayVector& chunks); - - // @returns: the total length of the chunked array; computed on construction - int64_t length() const { return length_; } - - int64_t null_count() const { return null_count_; } - - int num_chunks() const { return static_cast(chunks_.size()); } - - std::shared_ptr chunk(int i) const { return chunks_[i]; } - - const ArrayVector& chunks() const { return chunks_; } - - bool Equals(const ChunkedArray& other) const; - bool Equals(const std::shared_ptr& other) const; - - protected: - ArrayVector chunks_; - int64_t length_; - int64_t null_count_; -}; - -// An immutable column data structure consisting of a field (type metadata) and -// a logical chunked data array (which can be validated as all being the same -// type). -class ARROW_EXPORT Column { - public: - Column(const std::shared_ptr& field, const ArrayVector& chunks); - Column(const std::shared_ptr& field, const std::shared_ptr& data); - - Column(const std::shared_ptr& field, const std::shared_ptr& data); - - /// Construct from name and array - Column(const std::string& name, const std::shared_ptr& data); - - int64_t length() const { return data_->length(); } - - int64_t null_count() const { return data_->null_count(); } - - std::shared_ptr field() const { return field_; } - - // @returns: the column's name in the passed metadata - const std::string& name() const { return field_->name; } - - // @returns: the column's type according to the metadata - std::shared_ptr type() const { return field_->type; } - - // @returns: the column's data as a chunked logical array - std::shared_ptr data() const { return data_; } - - bool Equals(const Column& other) const; - bool Equals(const std::shared_ptr& other) const; - - // Verify that the column's array data is consistent with the passed field's - // metadata - Status ValidateData(); - - protected: - std::shared_ptr field_; - std::shared_ptr data_; -}; - -} // namespace arrow - -#endif // ARROW_COLUMN_H diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 000bba9cce03b..5820563b43834 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -30,12 +30,12 @@ #include "arrow/array.h" #include "arrow/buffer.h" -#include "arrow/column.h" #include "arrow/io/file.h" #include "arrow/ipc/feather-internal.h" #include "arrow/ipc/feather_generated.h" #include "arrow/loader.h" #include "arrow/status.h" +#include "arrow/table.h" #include "arrow/util/bit-util.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/ipc-json-test.cc b/cpp/src/arrow/ipc/ipc-json-test.cc index 68261ab25a43a..9cf6a88a6f3f1 100644 --- a/cpp/src/arrow/ipc/ipc-json-test.cc +++ b/cpp/src/arrow/ipc/ipc-json-test.cc @@ -52,7 +52,7 @@ void TestSchemaRoundTrip(const Schema& schema) { std::shared_ptr out; ASSERT_OK(ReadJsonSchema(d, &out)); - if (!schema.Equals(out)) { + if (!schema.Equals(*out)) { FAIL() << "In schema: " << schema.ToString() << "\nOut schema: " << out->ToString(); } } @@ -263,14 +263,14 @@ TEST(TestJsonFileReadWrite, BasicRoundTrip) { reinterpret_cast(result.c_str()), static_cast(result.size())); ASSERT_OK(JsonReader::Open(buffer, &reader)); - ASSERT_TRUE(reader->schema()->Equals(*schema.get())); + ASSERT_TRUE(reader->schema()->Equals(*schema)); ASSERT_EQ(nbatches, reader->num_record_batches()); for (int i = 0; i < nbatches; ++i) { std::shared_ptr batch; ASSERT_OK(reader->GetRecordBatch(i, &batch)); - ASSERT_TRUE(batch->Equals(*batches[i].get())); + ASSERT_TRUE(batch->Equals(*batches[i])); } } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 48e546eed12f5..6ddda3f339641 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -158,7 +158,7 @@ class IpcTestFixture : public io::MemoryMapFixture { void CheckReadResult(const RecordBatch& result, const RecordBatch& expected) { EXPECT_EQ(expected.num_rows(), result.num_rows()); - ASSERT_TRUE(expected.schema()->Equals(result.schema())); + ASSERT_TRUE(expected.schema()->Equals(*result.schema())); ASSERT_EQ(expected.num_columns(), result.num_columns()) << expected.schema()->ToString() << " result: " << result.schema()->ToString(); diff --git a/cpp/src/arrow/ipc/json-integration-test.cc b/cpp/src/arrow/ipc/json-integration-test.cc index c16074ee32dc6..aa95500003ec0 100644 --- a/cpp/src/arrow/ipc/json-integration-test.cc +++ b/cpp/src/arrow/ipc/json-integration-test.cc @@ -33,10 +33,10 @@ #include "arrow/ipc/reader.h" #include "arrow/ipc/writer.h" #include "arrow/pretty_print.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-util.h" +#include "arrow/type.h" DEFINE_string(arrow, "", "Arrow file name"); DEFINE_string(json, "", "JSON file name"); @@ -143,7 +143,7 @@ static Status ValidateArrowVsJson( auto json_schema = json_reader->schema(); auto arrow_schema = arrow_reader->schema(); - if (!json_schema->Equals(arrow_schema)) { + if (!json_schema->Equals(*arrow_schema)) { std::stringstream ss; ss << "JSON schema: \n" << json_schema->ToString() << "\n" @@ -170,7 +170,7 @@ static Status ValidateArrowVsJson( RETURN_NOT_OK(json_reader->GetRecordBatch(i, &json_batch)); RETURN_NOT_OK(arrow_reader->GetRecordBatch(i, &arrow_batch)); - if (!json_batch->ApproxEquals(*arrow_batch.get())) { + if (!json_batch->ApproxEquals(*arrow_batch)) { std::stringstream ss; ss << "Record batch " << i << " did not match"; diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 95ab011bd087f..9572a0a81898d 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -33,7 +33,6 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/memory_pool.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/type.h" #include "arrow/type_traits.h" diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index a01be191aa8ad..8056b6f3e758e 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -26,7 +26,6 @@ #include "arrow/buffer.h" #include "arrow/ipc/json-internal.h" #include "arrow/memory_pool.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 85dc8b321c41d..36ba4b26042a8 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -29,7 +29,6 @@ #include "arrow/io/interfaces.h" #include "arrow/ipc/File_generated.h" #include "arrow/ipc/Message_generated.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 03c678ab7e280..28320d98df9d1 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -30,9 +30,9 @@ #include "arrow/ipc/Message_generated.h" #include "arrow/ipc/metadata.h" #include "arrow/ipc/util.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" +#include "arrow/type.h" #include "arrow/util/logging.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 994e1283004a9..583f909d071e6 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -49,7 +49,7 @@ static inline void AssertSchemaEqual(const Schema& lhs, const Schema& rhs) { } static inline void CompareBatch(const RecordBatch& left, const RecordBatch& right) { - if (!left.schema()->Equals(right.schema())) { + if (!left.schema()->Equals(*right.schema())) { FAIL() << "Left schema: " << left.schema()->ToString() << "\nRight schema: " << right.schema()->ToString(); } diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 92e61941937a6..db5f0829f92f7 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -32,7 +32,6 @@ #include "arrow/ipc/util.h" #include "arrow/loader.h" #include "arrow/memory_pool.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/python/pandas-test.cc b/cpp/src/arrow/python/pandas-test.cc index ae2527e19c00e..0d643df2e9f38 100644 --- a/cpp/src/arrow/python/pandas-test.cc +++ b/cpp/src/arrow/python/pandas-test.cc @@ -25,7 +25,6 @@ #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/python/pandas_convert.h" -#include "arrow/schema.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" @@ -52,7 +51,7 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) { std::make_shared(f2, arr), std::make_shared(f3, arr)}; auto schema = std::make_shared(fields); - auto table = std::make_shared
("", schema, cols); + auto table = std::make_shared
(schema, cols); PyObject* out; Py_BEGIN_ALLOW_THREADS; diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index f2c2415ed2793..685b1f421c457 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -35,7 +35,6 @@ #include #include "arrow/array.h" -#include "arrow/column.h" #include "arrow/loader.h" #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" diff --git a/cpp/src/arrow/schema.cc b/cpp/src/arrow/schema.cc deleted file mode 100644 index aa38fd3dd9260..0000000000000 --- a/cpp/src/arrow/schema.cc +++ /dev/null @@ -1,72 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "arrow/schema.h" - -#include -#include -#include -#include - -#include "arrow/type.h" - -namespace arrow { - -Schema::Schema(const std::vector>& fields) : fields_(fields) {} - -bool Schema::Equals(const Schema& other) const { - if (this == &other) { return true; } - - if (num_fields() != other.num_fields()) { return false; } - for (int i = 0; i < num_fields(); ++i) { - if (!field(i)->Equals(*other.field(i).get())) { return false; } - } - return true; -} - -bool Schema::Equals(const std::shared_ptr& other) const { - return Equals(*other.get()); -} - -std::shared_ptr Schema::GetFieldByName(const std::string& name) { - if (fields_.size() > 0 && name_to_index_.size() == 0) { - for (size_t i = 0; i < fields_.size(); ++i) { - name_to_index_[fields_[i]->name] = static_cast(i); - } - } - - auto it = name_to_index_.find(name); - if (it == name_to_index_.end()) { - return nullptr; - } else { - return fields_[it->second]; - } -} - -std::string Schema::ToString() const { - std::stringstream buffer; - - int i = 0; - for (auto field : fields_) { - if (i > 0) { buffer << std::endl; } - buffer << field->ToString(); - ++i; - } - return buffer.str(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/schema.h b/cpp/src/arrow/schema.h deleted file mode 100644 index 37cdbf7d786a4..0000000000000 --- a/cpp/src/arrow/schema.h +++ /dev/null @@ -1,59 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#ifndef ARROW_SCHEMA_H -#define ARROW_SCHEMA_H - -#include -#include -#include -#include - -#include "arrow/type.h" -#include "arrow/util/visibility.h" - -namespace arrow { - -class ARROW_EXPORT Schema { - public: - explicit Schema(const std::vector>& fields); - - // Returns true if all of the schema fields are equal - bool Equals(const Schema& other) const; - bool Equals(const std::shared_ptr& other) const; - - // Return the ith schema element. Does not boundscheck - std::shared_ptr field(int i) const { return fields_[i]; } - - // Returns nullptr if name not found - std::shared_ptr GetFieldByName(const std::string& name); - - const std::vector>& fields() const { return fields_; } - - // Render a string representation of the schema suitable for debugging - std::string ToString() const; - - int num_fields() const { return static_cast(fields_.size()); } - - private: - std::vector> fields_; - std::unordered_map name_to_index_; -}; - -} // namespace arrow - -#endif // ARROW_FIELD_H diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 6bb31638ecbbf..38533063cbc07 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -22,8 +22,6 @@ #include "gtest/gtest.h" #include "arrow/array.h" -#include "arrow/column.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/test-common.h" @@ -35,6 +33,160 @@ using std::vector; namespace arrow { +class TestChunkedArray : public TestBase { + protected: + virtual void Construct() { + one_ = std::make_shared(arrays_one_); + another_ = std::make_shared(arrays_another_); + } + + ArrayVector arrays_one_; + ArrayVector arrays_another_; + + std::shared_ptr one_; + std::shared_ptr another_; +}; + +TEST_F(TestChunkedArray, BasicEquals) { + std::vector null_bitmap(100, true); + std::vector data(100, 1); + std::shared_ptr array; + ArrayFromVector(null_bitmap, data, &array); + arrays_one_.push_back(array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_TRUE(one_->Equals(one_)); + ASSERT_FALSE(one_->Equals(nullptr)); + ASSERT_TRUE(one_->Equals(another_)); + ASSERT_TRUE(one_->Equals(*another_.get())); +} + +TEST_F(TestChunkedArray, EqualsDifferingTypes) { + std::vector null_bitmap(100, true); + std::vector data32(100, 1); + std::vector data64(100, 1); + std::shared_ptr array; + ArrayFromVector(null_bitmap, data32, &array); + arrays_one_.push_back(array); + ArrayFromVector(null_bitmap, data64, &array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_FALSE(one_->Equals(another_)); + ASSERT_FALSE(one_->Equals(*another_.get())); +} + +TEST_F(TestChunkedArray, EqualsDifferingLengths) { + std::vector null_bitmap100(100, true); + std::vector null_bitmap101(101, true); + std::vector data100(100, 1); + std::vector data101(101, 1); + std::shared_ptr array; + ArrayFromVector(null_bitmap100, data100, &array); + arrays_one_.push_back(array); + ArrayFromVector(null_bitmap101, data101, &array); + arrays_another_.push_back(array); + + Construct(); + ASSERT_FALSE(one_->Equals(another_)); + ASSERT_FALSE(one_->Equals(*another_.get())); + + std::vector null_bitmap1(1, true); + std::vector data1(1, 1); + ArrayFromVector(null_bitmap1, data1, &array); + arrays_one_.push_back(array); + + Construct(); + ASSERT_TRUE(one_->Equals(another_)); + ASSERT_TRUE(one_->Equals(*another_.get())); +} + +class TestColumn : public TestChunkedArray { + protected: + void Construct() override { + TestChunkedArray::Construct(); + + one_col_ = std::make_shared(one_field_, one_); + another_col_ = std::make_shared(another_field_, another_); + } + + std::shared_ptr data_; + std::unique_ptr column_; + + std::shared_ptr one_field_; + std::shared_ptr another_field_; + + std::shared_ptr one_col_; + std::shared_ptr another_col_; +}; + +TEST_F(TestColumn, BasicAPI) { + ArrayVector arrays; + arrays.push_back(MakePrimitive(100)); + arrays.push_back(MakePrimitive(100, 10)); + arrays.push_back(MakePrimitive(100, 20)); + + auto field = std::make_shared("c0", int32()); + column_.reset(new Column(field, arrays)); + + ASSERT_EQ("c0", column_->name()); + ASSERT_TRUE(column_->type()->Equals(int32())); + ASSERT_EQ(300, column_->length()); + ASSERT_EQ(30, column_->null_count()); + ASSERT_EQ(3, column_->data()->num_chunks()); + + // nullptr array should not break + column_.reset(new Column(field, std::shared_ptr(nullptr))); + ASSERT_NE(column_.get(), nullptr); +} + +TEST_F(TestColumn, ChunksInhomogeneous) { + ArrayVector arrays; + arrays.push_back(MakePrimitive(100)); + arrays.push_back(MakePrimitive(100, 10)); + + auto field = std::make_shared("c0", int32()); + column_.reset(new Column(field, arrays)); + + ASSERT_OK(column_->ValidateData()); + + arrays.push_back(MakePrimitive(100, 10)); + column_.reset(new Column(field, arrays)); + ASSERT_RAISES(Invalid, column_->ValidateData()); +} + +TEST_F(TestColumn, Equals) { + std::vector null_bitmap(100, true); + std::vector data(100, 1); + std::shared_ptr array; + ArrayFromVector(null_bitmap, data, &array); + arrays_one_.push_back(array); + arrays_another_.push_back(array); + + one_field_ = std::make_shared("column", int32()); + another_field_ = std::make_shared("column", int32()); + + Construct(); + ASSERT_TRUE(one_col_->Equals(one_col_)); + ASSERT_FALSE(one_col_->Equals(nullptr)); + ASSERT_TRUE(one_col_->Equals(another_col_)); + ASSERT_TRUE(one_col_->Equals(*another_col_.get())); + + // Field is different + another_field_ = std::make_shared("two", int32()); + Construct(); + ASSERT_FALSE(one_col_->Equals(another_col_)); + ASSERT_FALSE(one_col_->Equals(*another_col_.get())); + + // ChunkedArray is different + another_field_ = std::make_shared("column", int32()); + arrays_another_.push_back(array); + Construct(); + ASSERT_FALSE(one_col_->Equals(another_col_)); + ASSERT_FALSE(one_col_->Equals(*another_col_.get())); +} + class TestTable : public TestBase { public: void MakeExample1(int length) { @@ -63,7 +215,7 @@ class TestTable : public TestBase { TEST_F(TestTable, EmptySchema) { auto empty_schema = shared_ptr(new Schema({})); - table_.reset(new Table("data", empty_schema, columns_)); + table_.reset(new Table(empty_schema, columns_)); ASSERT_OK(table_->ValidateColumns()); ASSERT_EQ(0, table_->num_rows()); ASSERT_EQ(0, table_->num_columns()); @@ -73,17 +225,13 @@ TEST_F(TestTable, Ctors) { const int length = 100; MakeExample1(length); - std::string name = "data"; - - table_.reset(new Table(name, schema_, columns_)); + table_.reset(new Table(schema_, columns_)); ASSERT_OK(table_->ValidateColumns()); - ASSERT_EQ(name, table_->name()); ASSERT_EQ(length, table_->num_rows()); ASSERT_EQ(3, table_->num_columns()); - table_.reset(new Table(name, schema_, columns_, length)); + table_.reset(new Table(schema_, columns_, length)); ASSERT_OK(table_->ValidateColumns()); - ASSERT_EQ(name, table_->name()); ASSERT_EQ(length, table_->num_rows()); } @@ -91,10 +239,9 @@ TEST_F(TestTable, Metadata) { const int length = 100; MakeExample1(length); - std::string name = "data"; - table_.reset(new Table(name, schema_, columns_)); + table_.reset(new Table(schema_, columns_)); - ASSERT_TRUE(table_->schema()->Equals(schema_)); + ASSERT_TRUE(table_->schema()->Equals(*schema_)); auto col = table_->column(0); ASSERT_EQ(schema_->field(0)->name, col->name()); @@ -106,13 +253,13 @@ TEST_F(TestTable, InvalidColumns) { const int length = 100; MakeExample1(length); - table_.reset(new Table("data", schema_, columns_, length - 1)); + table_.reset(new Table(schema_, columns_, length - 1)); ASSERT_RAISES(Invalid, table_->ValidateColumns()); columns_.clear(); // Wrong number of columns - table_.reset(new Table("data", schema_, columns_, length)); + table_.reset(new Table(schema_, columns_, length)); ASSERT_RAISES(Invalid, table_->ValidateColumns()); columns_ = { @@ -120,7 +267,7 @@ TEST_F(TestTable, InvalidColumns) { std::make_shared(schema_->field(1), MakePrimitive(length)), std::make_shared(schema_->field(2), MakePrimitive(length - 1))}; - table_.reset(new Table("data", schema_, columns_, length)); + table_.reset(new Table(schema_, columns_, length)); ASSERT_RAISES(Invalid, table_->ValidateColumns()); } @@ -128,26 +275,22 @@ TEST_F(TestTable, Equals) { const int length = 100; MakeExample1(length); - std::string name = "data"; - table_.reset(new Table(name, schema_, columns_)); + table_.reset(new Table(schema_, columns_)); - ASSERT_TRUE(table_->Equals(table_)); - ASSERT_FALSE(table_->Equals(nullptr)); - // Differing name - ASSERT_FALSE(table_->Equals(std::make_shared
("other_name", schema_, columns_))); + ASSERT_TRUE(table_->Equals(*table_)); // Differing schema auto f0 = std::make_shared("f3", int32()); auto f1 = std::make_shared("f4", uint8()); auto f2 = std::make_shared("f5", int16()); vector> fields = {f0, f1, f2}; auto other_schema = std::make_shared(fields); - ASSERT_FALSE(table_->Equals(std::make_shared
(name, other_schema, columns_))); + ASSERT_FALSE(table_->Equals(Table(other_schema, columns_))); // Differing columns std::vector> other_columns = { std::make_shared(schema_->field(0), MakePrimitive(length, 10)), std::make_shared(schema_->field(1), MakePrimitive(length, 10)), std::make_shared(schema_->field(2), MakePrimitive(length, 10))}; - ASSERT_FALSE(table_->Equals(std::make_shared
(name, schema_, other_columns))); + ASSERT_FALSE(table_->Equals(Table(schema_, other_columns))); } TEST_F(TestTable, FromRecordBatches) { @@ -157,10 +300,10 @@ TEST_F(TestTable, FromRecordBatches) { auto batch1 = std::make_shared(schema_, length, arrays_); std::shared_ptr
result, expected; - ASSERT_OK(Table::FromRecordBatches("foo", {batch1}, &result)); + ASSERT_OK(Table::FromRecordBatches({batch1}, &result)); - expected = std::make_shared
("foo", schema_, columns_); - ASSERT_TRUE(result->Equals(expected)); + expected = std::make_shared
(schema_, columns_); + ASSERT_TRUE(result->Equals(*expected)); std::vector> other_columns; for (int i = 0; i < schema_->num_fields(); ++i) { @@ -168,20 +311,20 @@ TEST_F(TestTable, FromRecordBatches) { other_columns.push_back(std::make_shared(schema_->field(i), col_arrays)); } - ASSERT_OK(Table::FromRecordBatches("foo", {batch1, batch1}, &result)); - expected = std::make_shared
("foo", schema_, other_columns); - ASSERT_TRUE(result->Equals(expected)); + ASSERT_OK(Table::FromRecordBatches({batch1, batch1}, &result)); + expected = std::make_shared
(schema_, other_columns); + ASSERT_TRUE(result->Equals(*expected)); // Error states std::vector> empty_batches; - ASSERT_RAISES(Invalid, Table::FromRecordBatches("", empty_batches, &result)); + ASSERT_RAISES(Invalid, Table::FromRecordBatches(empty_batches, &result)); std::vector> fields = {schema_->field(0), schema_->field(1)}; auto other_schema = std::make_shared(fields); std::vector> other_arrays = {arrays_[0], arrays_[1]}; auto batch2 = std::make_shared(other_schema, length, other_arrays); - ASSERT_RAISES(Invalid, Table::FromRecordBatches("", {batch1, batch2}, &result)); + ASSERT_RAISES(Invalid, Table::FromRecordBatches({batch1, batch2}, &result)); } TEST_F(TestTable, ConcatenateTables) { @@ -195,25 +338,50 @@ TEST_F(TestTable, ConcatenateTables) { auto batch2 = std::make_shared(schema_, length, arrays_); std::shared_ptr
t1, t2, t3, result, expected; - ASSERT_OK(Table::FromRecordBatches("foo", {batch1}, &t1)); - ASSERT_OK(Table::FromRecordBatches("foo", {batch2}, &t2)); + ASSERT_OK(Table::FromRecordBatches({batch1}, &t1)); + ASSERT_OK(Table::FromRecordBatches({batch2}, &t2)); - ASSERT_OK(ConcatenateTables("bar", {t1, t2}, &result)); - ASSERT_OK(Table::FromRecordBatches("bar", {batch1, batch2}, &expected)); - ASSERT_TRUE(result->Equals(expected)); + ASSERT_OK(ConcatenateTables({t1, t2}, &result)); + ASSERT_OK(Table::FromRecordBatches({batch1, batch2}, &expected)); + ASSERT_TRUE(result->Equals(*expected)); // Error states std::vector> empty_tables; - ASSERT_RAISES(Invalid, ConcatenateTables("", empty_tables, &result)); + ASSERT_RAISES(Invalid, ConcatenateTables(empty_tables, &result)); std::vector> fields = {schema_->field(0), schema_->field(1)}; auto other_schema = std::make_shared(fields); std::vector> other_arrays = {arrays_[0], arrays_[1]}; auto batch3 = std::make_shared(other_schema, length, other_arrays); - ASSERT_OK(Table::FromRecordBatches("", {batch3}, &t3)); + ASSERT_OK(Table::FromRecordBatches({batch3}, &t3)); + + ASSERT_RAISES(Invalid, ConcatenateTables({t1, t3}, &result)); +} + +TEST_F(TestTable, RemoveColumn) { + const int64_t length = 10; + MakeExample1(length); + + Table table(schema_, columns_); + + std::shared_ptr
result; + ASSERT_OK(table.RemoveColumn(0, &result)); + + auto ex_schema = + std::shared_ptr(new Schema({schema_->field(1), schema_->field(2)})); + std::vector> ex_columns = {table.column(1), table.column(2)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); + + ASSERT_OK(table.RemoveColumn(1, &result)); + ex_schema = std::shared_ptr(new Schema({schema_->field(0), schema_->field(2)})); + ex_columns = {table.column(0), table.column(2)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); - ASSERT_RAISES(Invalid, ConcatenateTables("foo", {t1, t3}, &result)); + ASSERT_OK(table.RemoveColumn(2, &result)); + ex_schema = std::shared_ptr(new Schema({schema_->field(0), schema_->field(1)})); + ex_columns = {table.column(0), table.column(1)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); } class TestRecordBatch : public TestBase {}; diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 3f254aae6d3fa..8e283f4da9bb7 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -23,12 +23,122 @@ #include #include "arrow/array.h" -#include "arrow/column.h" -#include "arrow/schema.h" #include "arrow/status.h" +#include "arrow/type.h" +#include "arrow/util/logging.h" +#include "arrow/util/stl.h" namespace arrow { +// ---------------------------------------------------------------------- +// ChunkedArray and Column methods + +ChunkedArray::ChunkedArray(const ArrayVector& chunks) : chunks_(chunks) { + length_ = 0; + null_count_ = 0; + for (const std::shared_ptr& chunk : chunks) { + length_ += chunk->length(); + null_count_ += chunk->null_count(); + } +} + +bool ChunkedArray::Equals(const ChunkedArray& other) const { + if (length_ != other.length()) { return false; } + if (null_count_ != other.null_count()) { return false; } + + // Check contents of the underlying arrays. This checks for equality of + // the underlying data independently of the chunk size. + int this_chunk_idx = 0; + int64_t this_start_idx = 0; + int other_chunk_idx = 0; + int64_t other_start_idx = 0; + + int64_t elements_compared = 0; + while (elements_compared < length_) { + const std::shared_ptr this_array = chunks_[this_chunk_idx]; + const std::shared_ptr other_array = other.chunk(other_chunk_idx); + int64_t common_length = std::min( + this_array->length() - this_start_idx, other_array->length() - other_start_idx); + if (!this_array->RangeEquals(this_start_idx, this_start_idx + common_length, + other_start_idx, other_array)) { + return false; + } + + elements_compared += common_length; + + // If we have exhausted the current chunk, proceed to the next one individually. + if (this_start_idx + common_length == this_array->length()) { + this_chunk_idx++; + this_start_idx = 0; + } else { + this_start_idx += common_length; + } + + if (other_start_idx + common_length == other_array->length()) { + other_chunk_idx++; + other_start_idx = 0; + } else { + other_start_idx += common_length; + } + } + return true; +} + +bool ChunkedArray::Equals(const std::shared_ptr& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + return Equals(*other.get()); +} + +Column::Column(const std::shared_ptr& field, const ArrayVector& chunks) + : field_(field) { + data_ = std::make_shared(chunks); +} + +Column::Column(const std::shared_ptr& field, const std::shared_ptr& data) + : field_(field) { + if (data) { + data_ = std::make_shared(ArrayVector({data})); + } else { + data_ = std::make_shared(ArrayVector({})); + } +} + +Column::Column(const std::string& name, const std::shared_ptr& data) + : Column(::arrow::field(name, data->type()), data) {} + +Column::Column( + const std::shared_ptr& field, const std::shared_ptr& data) + : field_(field), data_(data) {} + +bool Column::Equals(const Column& other) const { + if (!field_->Equals(other.field())) { return false; } + return data_->Equals(other.data()); +} + +bool Column::Equals(const std::shared_ptr& other) const { + if (this == other.get()) { return true; } + if (!other) { return false; } + + return Equals(*other.get()); +} + +Status Column::ValidateData() { + for (int i = 0; i < data_->num_chunks(); ++i) { + std::shared_ptr type = data_->chunk(i)->type(); + if (!this->type()->Equals(type)) { + std::stringstream ss; + ss << "In chunk " << i << " expected type " << this->type()->ToString() + << " but saw " << type->ToString(); + return Status::Invalid(ss.str()); + } + } + return Status::OK(); +} + +// ---------------------------------------------------------------------- +// RecordBatch methods + RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} @@ -83,9 +193,9 @@ std::shared_ptr RecordBatch::Slice(int64_t offset, int64_t length) // ---------------------------------------------------------------------- // Table methods -Table::Table(const std::string& name, const std::shared_ptr& schema, +Table::Table(const std::shared_ptr& schema, const std::vector>& columns) - : name_(name), schema_(schema), columns_(columns) { + : schema_(schema), columns_(columns) { if (columns.size() == 0) { num_rows_ = 0; } else { @@ -93,12 +203,11 @@ Table::Table(const std::string& name, const std::shared_ptr& schema, } } -Table::Table(const std::string& name, const std::shared_ptr& schema, +Table::Table(const std::shared_ptr& schema, const std::vector>& columns, int64_t num_rows) - : name_(name), schema_(schema), columns_(columns), num_rows_(num_rows) {} + : schema_(schema), columns_(columns), num_rows_(num_rows) {} -Status Table::FromRecordBatches(const std::string& name, - const std::vector>& batches, +Status Table::FromRecordBatches(const std::vector>& batches, std::shared_ptr
* table) { if (batches.size() == 0) { return Status::Invalid("Must pass at least one record batch"); @@ -110,7 +219,7 @@ Status Table::FromRecordBatches(const std::string& name, const int ncolumns = static_cast(schema->num_fields()); for (int i = 1; i < nbatches; ++i) { - if (!batches[i]->schema()->Equals(schema)) { + if (!batches[i]->schema()->Equals(*schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" << schema->ToString() << "\nvs\n" @@ -129,11 +238,11 @@ Status Table::FromRecordBatches(const std::string& name, columns[i] = std::make_shared(schema->field(i), column_arrays); } - *table = std::make_shared
(name, schema, columns); + *table = std::make_shared
(schema, columns); return Status::OK(); } -Status ConcatenateTables(const std::string& output_name, +Status ConcatenateTables( const std::vector>& tables, std::shared_ptr
* table) { if (tables.size() == 0) { return Status::Invalid("Must pass at least one table"); } @@ -143,7 +252,7 @@ Status ConcatenateTables(const std::string& output_name, const int ncolumns = static_cast(schema->num_fields()); for (int i = 1; i < ntables; ++i) { - if (!tables[i]->schema()->Equals(schema)) { + if (!tables[i]->schema()->Equals(*schema)) { std::stringstream ss; ss << "Schema at index " << static_cast(i) << " was different: \n" << schema->ToString() << "\nvs\n" @@ -164,13 +273,13 @@ Status ConcatenateTables(const std::string& output_name, } columns[i] = std::make_shared(schema->field(i), column_arrays); } - *table = std::make_shared
(output_name, schema, columns); + *table = std::make_shared
(schema, columns); return Status::OK(); } bool Table::Equals(const Table& other) const { - if (name_ != other.name()) { return false; } - if (!schema_->Equals(other.schema())) { return false; } + if (this == &other) { return true; } + if (!schema_->Equals(*other.schema())) { return false; } if (static_cast(columns_.size()) != other.num_columns()) { return false; } for (int i = 0; i < static_cast(columns_.size()); i++) { @@ -179,10 +288,12 @@ bool Table::Equals(const Table& other) const { return true; } -bool Table::Equals(const std::shared_ptr
& other) const { - if (this == other.get()) { return true; } - if (!other) { return false; } - return Equals(*other.get()); +Status Table::RemoveColumn(int i, std::shared_ptr
* out) const { + std::shared_ptr new_schema; + RETURN_NOT_OK(schema_->RemoveField(i, &new_schema)); + + *out = std::make_shared
(new_schema, DeleteVectorElement(columns_, i)); + return Status::OK(); } Status Table::ValidateColumns() const { diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index bf0d99c4e9d2b..7b739c9a1b314 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -23,6 +23,7 @@ #include #include +#include "arrow/type.h" #include "arrow/util/visibility.h" namespace arrow { @@ -32,6 +33,74 @@ class Column; class Schema; class Status; +using ArrayVector = std::vector>; + +// A data structure managing a list of primitive Arrow arrays logically as one +// large array +class ARROW_EXPORT ChunkedArray { + public: + explicit ChunkedArray(const ArrayVector& chunks); + + // @returns: the total length of the chunked array; computed on construction + int64_t length() const { return length_; } + + int64_t null_count() const { return null_count_; } + + int num_chunks() const { return static_cast(chunks_.size()); } + + std::shared_ptr chunk(int i) const { return chunks_[i]; } + + const ArrayVector& chunks() const { return chunks_; } + + bool Equals(const ChunkedArray& other) const; + bool Equals(const std::shared_ptr& other) const; + + protected: + ArrayVector chunks_; + int64_t length_; + int64_t null_count_; +}; + +// An immutable column data structure consisting of a field (type metadata) and +// a logical chunked data array (which can be validated as all being the same +// type). +class ARROW_EXPORT Column { + public: + Column(const std::shared_ptr& field, const ArrayVector& chunks); + Column(const std::shared_ptr& field, const std::shared_ptr& data); + + Column(const std::shared_ptr& field, const std::shared_ptr& data); + + /// Construct from name and array + Column(const std::string& name, const std::shared_ptr& data); + + int64_t length() const { return data_->length(); } + + int64_t null_count() const { return data_->null_count(); } + + std::shared_ptr field() const { return field_; } + + // @returns: the column's name in the passed metadata + const std::string& name() const { return field_->name; } + + // @returns: the column's type according to the metadata + std::shared_ptr type() const { return field_->type; } + + // @returns: the column's data as a chunked logical array + std::shared_ptr data() const { return data_; } + + bool Equals(const Column& other) const; + bool Equals(const std::shared_ptr& other) const; + + // Verify that the column's array data is consistent with the passed field's + // metadata + Status ValidateData(); + + protected: + std::shared_ptr field_; + std::shared_ptr data_; +}; + // A record batch is a simpler and more rigid table data structure intended for // use primarily in shared memory IPC. It contains a schema (metadata) and a // corresponding sequence of equal-length Arrow arrays @@ -81,25 +150,22 @@ class ARROW_EXPORT RecordBatch { class ARROW_EXPORT Table { public: // If columns is zero-length, the table's number of rows is zero - Table(const std::string& name, const std::shared_ptr& schema, + Table(const std::shared_ptr& schema, const std::vector>& columns); // num_rows is a parameter to allow for tables of a particular size not // having any materialized columns. Each column should therefore have the // same length as num_rows -- you can validate this using // Table::ValidateColumns - Table(const std::string& name, const std::shared_ptr& schema, - const std::vector>& columns, int64_t nubm_rows); + Table(const std::shared_ptr& schema, + const std::vector>& columns, int64_t num_rows); // Construct table from RecordBatch, but only if all of the batch schemas are // equal. Returns Status::Invalid if there is some problem - static Status FromRecordBatches(const std::string& name, + static Status FromRecordBatches( const std::vector>& batches, std::shared_ptr
* table); - // @returns: the table's name, if any (may be length 0) - const std::string& name() const { return name_; } - // @returns: the table's schema std::shared_ptr schema() const { return schema_; } @@ -107,6 +173,10 @@ class ARROW_EXPORT Table { // @returns: the i-th column std::shared_ptr column(int i) const { return columns_[i]; } + /// Remove column from the table, producing a new Table (because tables and + /// schemas are immutable) + Status RemoveColumn(int i, std::shared_ptr
* out) const; + // @returns: the number of columns in the table int num_columns() const { return static_cast(columns_.size()); } @@ -114,15 +184,11 @@ class ARROW_EXPORT Table { int64_t num_rows() const { return num_rows_; } bool Equals(const Table& other) const; - bool Equals(const std::shared_ptr
& other) const; // After construction, perform any checks to validate the input arguments Status ValidateColumns() const; private: - // The table's name, optional - std::string name_; - std::shared_ptr schema_; std::vector> columns_; @@ -131,7 +197,7 @@ class ARROW_EXPORT Table { // Construct table from multiple input tables. Return Status::Invalid if // schemas are not equal -Status ARROW_EXPORT ConcatenateTables(const std::string& output_name, +Status ARROW_EXPORT ConcatenateTables( const std::vector>& tables, std::shared_ptr
* table); } // namespace arrow diff --git a/cpp/src/arrow/test-common.h b/cpp/src/arrow/test-common.h index f704b6b545b7d..dc11e76edf465 100644 --- a/cpp/src/arrow/test-common.h +++ b/cpp/src/arrow/test-common.h @@ -29,7 +29,6 @@ #include "arrow/array.h" #include "arrow/buffer.h" #include "arrow/builder.h" -#include "arrow/column.h" #include "arrow/memory_pool.h" #include "arrow/table.h" #include "arrow/test-util.h" diff --git a/cpp/src/arrow/test-util.h b/cpp/src/arrow/test-util.h index bed555984fb68..94937b592cc33 100644 --- a/cpp/src/arrow/test-util.h +++ b/cpp/src/arrow/test-util.h @@ -30,9 +30,7 @@ #include "arrow/array.h" #include "arrow/buffer.h" #include "arrow/builder.h" -#include "arrow/column.h" #include "arrow/memory_pool.h" -#include "arrow/schema.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index ed8654314ee6d..70c173432a960 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -23,7 +23,6 @@ #include "gtest/gtest.h" -#include "arrow/schema.h" #include "arrow/type.h" using std::shared_ptr; @@ -75,11 +74,8 @@ TEST_F(TestSchema, Basics) { vector> fields3 = {f0, f1_optional, f2}; auto schema3 = std::make_shared(fields3); - ASSERT_TRUE(schema->Equals(schema2)); - ASSERT_FALSE(schema->Equals(schema3)); - - ASSERT_TRUE(schema->Equals(*schema2.get())); - ASSERT_FALSE(schema->Equals(*schema3.get())); + ASSERT_TRUE(schema->Equals(*schema2)); + ASSERT_FALSE(schema->Equals(*schema3)); } TEST_F(TestSchema, ToString) { diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index c790f6e5a4345..e6e6f5c3e8bc7 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -24,6 +24,7 @@ #include "arrow/compare.h" #include "arrow/status.h" #include "arrow/util/logging.h" +#include "arrow/util/stl.h" #include "arrow/visitor.h" namespace arrow { @@ -45,6 +46,8 @@ std::string Field::ToString() const { return ss.str(); } +DataType::~DataType() {} + bool DataType::Equals(const DataType& other) const { bool are_equal = false; Status error = TypeEquals(*this, other, &are_equal); @@ -224,6 +227,56 @@ std::string NullType::ToString() const { return name(); } +// ---------------------------------------------------------------------- +// Schema implementation + +Schema::Schema(const std::vector>& fields) : fields_(fields) {} + +bool Schema::Equals(const Schema& other) const { + if (this == &other) { return true; } + + if (num_fields() != other.num_fields()) { return false; } + for (int i = 0; i < num_fields(); ++i) { + if (!field(i)->Equals(*other.field(i).get())) { return false; } + } + return true; +} + +std::shared_ptr Schema::GetFieldByName(const std::string& name) { + if (fields_.size() > 0 && name_to_index_.size() == 0) { + for (size_t i = 0; i < fields_.size(); ++i) { + name_to_index_[fields_[i]->name] = static_cast(i); + } + } + + auto it = name_to_index_.find(name); + if (it == name_to_index_.end()) { + return nullptr; + } else { + return fields_[it->second]; + } +} + +Status Schema::RemoveField(int i, std::shared_ptr* out) const { + DCHECK_GE(i, 0); + DCHECK_LT(i, this->num_fields()); + + *out = std::make_shared(DeleteVectorElement(fields_, i)); + return Status::OK(); +} + +std::string Schema::ToString() const { + std::stringstream buffer; + + int i = 0; + for (auto field : fields_) { + if (i > 0) { buffer << std::endl; } + buffer << field->ToString(); + ++i; + } + return buffer.str(); +} + // ---------------------------------------------------------------------- // Visitors and factory functions diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 2a73f6be934eb..4f931907ee79f 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -22,6 +22,7 @@ #include #include #include +#include #include #include "arrow/status.h" @@ -132,7 +133,7 @@ struct ARROW_EXPORT DataType { explicit DataType(Type::type type) : type(type) {} - virtual ~DataType() = default; + virtual ~DataType(); // Return whether the types are equal // @@ -596,6 +597,36 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { bool ordered_; }; +// ---------------------------------------------------------------------- +// Schema + +class ARROW_EXPORT Schema { + public: + explicit Schema(const std::vector>& fields); + + // Returns true if all of the schema fields are equal + bool Equals(const Schema& other) const; + + // Return the ith schema element. Does not boundscheck + std::shared_ptr field(int i) const { return fields_[i]; } + + // Returns nullptr if name not found + std::shared_ptr GetFieldByName(const std::string& name); + + const std::vector>& fields() const { return fields_; } + + // Render a string representation of the schema suitable for debugging + std::string ToString() const; + + Status RemoveField(int i, std::shared_ptr* out) const; + + int num_fields() const { return static_cast(fields_.size()); } + + private: + std::vector> fields_; + std::unordered_map name_to_index_; +}; + // ---------------------------------------------------------------------- // Factory functions diff --git a/cpp/src/arrow/util/stl.h b/cpp/src/arrow/util/stl.h new file mode 100644 index 0000000000000..3ec535d62b920 --- /dev/null +++ b/cpp/src/arrow/util/stl.h @@ -0,0 +1,40 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_STL_H +#define ARROW_UTIL_STL_H + +#include + +namespace arrow { + +template +inline std::vector DeleteVectorElement(const std::vector& values, size_t index) { + std::vector out; + out.reserve(values.size() - 1); + for (size_t i = 0; i < index; ++i) { + out.push_back(values[i]); + } + for (size_t i = index + 1; i < values.size(); ++i) { + out.push_back(values[i]); + } + return out; +} + +} // namespace arrow + +#endif // ARROW_UTIL_STL_H diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 654f5ab527043..6cae1966cb16e 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -19,6 +19,8 @@ # distutils: language = c++ # cython: embedsignature = True +from cython.operator cimport dereference as deref + import numpy as np from pyarrow.includes.libarrow cimport * @@ -216,7 +218,7 @@ cdef class Array: return '{0}\n{1}'.format(type_format, values) def equals(Array self, Array other): - return self.ap.Equals(other.sp_array) + return self.ap.Equals(deref(other.ap)) def __len__(self): if self.sp_array.get(): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index bdbd18bce01df..8e428b40b8f8b 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -59,7 +59,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CDataType" arrow::DataType": Type type - c_bool Equals(const shared_ptr[CDataType]& other) c_bool Equals(const CDataType& other) c_string ToString() @@ -71,7 +70,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int64_t null_count() Type type_enum() - c_bool Equals(const shared_ptr[CArray]& arr) + c_bool Equals(const CArray& arr) c_bool IsNull(int i) shared_ptr[CArray] Slice(int64_t offset) @@ -155,7 +154,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CSchema" arrow::Schema": CSchema(const vector[shared_ptr[CField]]& fields) - c_bool Equals(const shared_ptr[CSchema]& other) + c_bool Equals(const CSchema& other) shared_ptr[CField] field(int i) shared_ptr[CField] GetFieldByName(c_string& name) @@ -231,7 +230,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const vector[shared_ptr[CArray]]& chunks) c_bool Equals(const CColumn& other) - c_bool Equals(const shared_ptr[CColumn]& other) int64_t length() int64_t null_count() @@ -258,12 +256,11 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CRecordBatch] Slice(int64_t offset, int64_t length) cdef cppclass CTable" arrow::Table": - CTable(const c_string& name, const shared_ptr[CSchema]& schema, + CTable(const shared_ptr[CSchema]& schema, const vector[shared_ptr[CColumn]]& columns) @staticmethod CStatus FromRecordBatches( - const c_string& name, const vector[shared_ptr[CRecordBatch]]& batches, shared_ptr[CTable]* table) @@ -271,15 +268,13 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int num_rows() c_bool Equals(const CTable& other) - c_bool Equals(const shared_ptr[CTable]& other) - - const c_string& name() shared_ptr[CSchema] schema() shared_ptr[CColumn] column(int i) - CStatus ConcatenateTables(const c_string& output_name, - const vector[shared_ptr[CTable]]& tables, + CStatus RemoveColumn(int i, shared_ptr[CTable]* out) + + CStatus ConcatenateTables(const vector[shared_ptr[CTable]]& tables, shared_ptr[CTable]* result) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index cb44ce8816338..d528bdc495208 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -956,7 +956,6 @@ cdef class _StreamReader: vector[shared_ptr[CRecordBatch]] batches shared_ptr[CRecordBatch] batch shared_ptr[CTable] table - c_string name = b'' with nogil: while True: @@ -965,7 +964,7 @@ cdef class _StreamReader: break batches.push_back(batch) - check_status(CTable.FromRecordBatches(name, batches, &table)) + check_status(CTable.FromRecordBatches(batches, &table)) return table_from_ctable(table) @@ -1033,7 +1032,6 @@ cdef class _FileReader: cdef: vector[shared_ptr[CRecordBatch]] batches shared_ptr[CTable] table - c_string name = b'' int i, nbatches nbatches = self.num_record_batches @@ -1042,6 +1040,6 @@ cdef class _FileReader: with nogil: for i in range(nbatches): check_status(self.reader.get().GetRecordBatch(i, &batches[i])) - check_status(CTable.FromRecordBatches(name, batches, &table)) + check_status(CTable.FromRecordBatches(batches, &table)) return table_from_ctable(table) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index ab5ae5fa2a3f2..4f02901cc9a11 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -166,7 +166,7 @@ cdef class Schema: cdef Schema _other _other = other - return self.sp_schema.get().Equals(_other.sp_schema) + return self.sp_schema.get().Equals(deref(_other.schema)) def field_by_name(self, name): """ @@ -200,11 +200,16 @@ cdef class Schema: return result - def __repr__(self): + def __str__(self): return frombytes(self.schema.ToString()) + def __repr__(self): + return self.__str__() + + cdef dict _type_cache = {} + cdef DataType primitive_type(Type type): if type in _type_cache: return _type_cache[type] diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 58f5d680393f7..e6fddbd0cfbbd 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -298,7 +298,7 @@ cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): -cdef _dataframe_to_arrays(df, name, timestamps_to_ms, Schema schema): +cdef _dataframe_to_arrays(df, timestamps_to_ms, Schema schema): cdef: list names = [] list arrays = [] @@ -474,7 +474,7 @@ cdef class RecordBatch: ------- pyarrow.table.RecordBatch """ - names, arrays = _dataframe_to_arrays(df, None, False, schema) + names, arrays = _dataframe_to_arrays(df, False, schema) return cls.from_arrays(arrays, names) @staticmethod @@ -573,6 +573,9 @@ cdef class Table: def __cinit__(self): self.table = NULL + def __repr__(self): + return 'pyarrow.Table\n{0}'.format(str(self.schema)) + cdef init(self, const shared_ptr[CTable]& table): self.sp_table = table self.table = table.get() @@ -608,7 +611,7 @@ cdef class Table: return result @classmethod - def from_pandas(cls, df, name=None, timestamps_to_ms=False, schema=None): + def from_pandas(cls, df, timestamps_to_ms=False, schema=None): """ Convert pandas.DataFrame to an Arrow Table @@ -616,8 +619,6 @@ cdef class Table: ---------- df: pandas.DataFrame - name: str - timestamps_to_ms: bool Convert datetime columns to ms resolution. This is needed for compability with other functionality like Parquet I/O which @@ -643,13 +644,13 @@ cdef class Table: >>> pa.Table.from_pandas(df) """ - names, arrays = _dataframe_to_arrays(df, name=name, + names, arrays = _dataframe_to_arrays(df, timestamps_to_ms=timestamps_to_ms, schema=schema) - return cls.from_arrays(arrays, names=names, name=name) + return cls.from_arrays(arrays, names=names) @staticmethod - def from_arrays(arrays, names=None, name=None): + def from_arrays(arrays, names=None): """ Construct a Table from Arrow arrays or columns @@ -660,8 +661,6 @@ cdef class Table: names: list of str, optional Names for the table columns. If Columns passed, will be inferred. If Arrays passed, this argument is required - name: str, optional - name for the Table Returns ------- @@ -669,7 +668,6 @@ cdef class Table: """ cdef: - c_string c_name vector[shared_ptr[CField]] fields vector[shared_ptr[CColumn]] columns shared_ptr[CSchema] schema @@ -689,16 +687,11 @@ cdef class Table: else: raise ValueError(type(arrays[i])) - if name is None: - c_name = '' - else: - c_name = tobytes(name) - - table.reset(new CTable(c_name, schema, columns)) + table.reset(new CTable(schema, columns)) return table_from_ctable(table) @staticmethod - def from_batches(batches, name=None): + def from_batches(batches): """ Construct a Table from a list of Arrow RecordBatches @@ -712,16 +705,12 @@ cdef class Table: vector[shared_ptr[CRecordBatch]] c_batches shared_ptr[CTable] c_table RecordBatch batch - Table table - c_string c_name - - c_name = b'' if name is None else tobytes(name) for batch in batches: c_batches.push_back(batch.sp_batch) with nogil: - check_status(CTable.FromRecordBatches(c_name, c_batches, &c_table)) + check_status(CTable.FromRecordBatches(c_batches, &c_table)) return table_from_ctable(c_table) @@ -761,18 +750,6 @@ cdef class Table: entries.append((name, column)) return OrderedDict(entries) - @property - def name(self): - """ - Label of the table - - Returns - ------- - str - """ - self._check_nullptr() - return frombytes(self.table.name()) - @property def schema(self): """ @@ -851,8 +828,19 @@ cdef class Table: """ return (self.num_rows, self.num_columns) + def remove_column(self, int i): + """ + Create new Table with the indicated column removed + """ + cdef shared_ptr[CTable] c_table -def concat_tables(tables, output_name=None): + with nogil: + check_status(self.table.RemoveColumn(i, &c_table)) + + return table_from_ctable(c_table) + + +def concat_tables(tables): """ Perform zero-copy concatenation of pyarrow.Table objects. Raises exception if all of the Table schemas are not the same @@ -867,15 +855,12 @@ def concat_tables(tables, output_name=None): vector[shared_ptr[CTable]] c_tables shared_ptr[CTable] c_result Table table - c_string c_name - - c_name = b'' if output_name is None else tobytes(output_name) for table in tables: c_tables.push_back(table.sp_table) with nogil: - check_status(ConcatenateTables(c_name, c_tables, &c_result)) + check_status(ConcatenateTables(c_tables, &c_result)) return table_from_ctable(c_result) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index c72ff9e862b76..fc32b9fac8b98 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -47,7 +47,7 @@ def test_single_pylist_column_roundtrip(tmpdir): filename = tmpdir.join('single_{}_column.parquet' .format(dtype.__name__)) data = [pa.from_pylist(list(map(dtype, range(5))))] - table = pa.Table.from_arrays(data, names=('a', 'b'), name='table_name') + table = pa.Table.from_arrays(data, names=('a', 'b')) pq.write_table(table, filename.strpath) table_read = pq.read_table(filename.strpath) for col_written, col_read in zip(table.itercolumns(), diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 67f1892a9987b..548f4782a7030 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -31,7 +31,7 @@ def test_basics(self): data = [ pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(data, names=['a'], name='table_name') + table = pa.Table.from_arrays(data, names=['a']) column = table.column(0) assert column.name == 'a' assert column.length() == 5 @@ -43,7 +43,7 @@ def test_pandas(self): data = [ pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(data, names=['a'], name='table_name') + table = pa.Table.from_arrays(data, names=['a']) column = table.column(0) series = column.to_pandas() assert series.name == 'a' @@ -154,8 +154,7 @@ def test_table_basics(): pa.from_pylist(range(5)), pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(data, names=('a', 'b'), name='table_name') - assert table.name == 'table_name' + table = pa.Table.from_arrays(data, names=('a', 'b')) assert len(table) == 5 assert table.num_rows == 5 assert table.num_columns == 2 @@ -170,6 +169,19 @@ def test_table_basics(): assert chunk is not None +def test_table_remove_column(): + data = [ + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]), + pa.from_pylist(range(5, 10)) + ] + table = pa.Table.from_arrays(data, names=('a', 'b', 'c')) + + t2 = table.remove_column(0) + expected = pa.Table.from_arrays(data[1:], names=('b', 'c')) + assert t2.equals(expected) + + def test_concat_tables(): data = [ list(range(5)), @@ -181,18 +193,16 @@ def test_concat_tables(): ] t1 = pa.Table.from_arrays([pa.from_pylist(x) for x in data], - names=('a', 'b'), name='table_name') + names=('a', 'b')) t2 = pa.Table.from_arrays([pa.from_pylist(x) for x in data2], - names=('a', 'b'), name='table_name') + names=('a', 'b')) - result = pa.concat_tables([t1, t2], output_name='foo') - assert result.name == 'foo' + result = pa.concat_tables([t1, t2]) assert len(result) == 10 expected = pa.Table.from_arrays([pa.from_pylist(x + y) for x, y in zip(data, data2)], - names=('a', 'b'), - name='foo') + names=('a', 'b')) assert result.equals(expected) @@ -202,8 +212,7 @@ def test_table_pandas(): pa.from_pylist(range(5)), pa.from_pylist([-10, -5, 0, 5, 10]) ] - table = pa.Table.from_arrays(data, names=('a', 'b'), - name='table_name') + table = pa.Table.from_arrays(data, names=('a', 'b')) # TODO: Use this part once from_pandas is implemented # data = {'a': range(5), 'b': [-10, -5, 0, 5, 10]} From 15b874e47e3975c5240290ec7ed105bf8d1b56bc Mon Sep 17 00:00:00 2001 From: Max Risuhin Date: Thu, 30 Mar 2017 15:13:39 -0400 Subject: [PATCH 0433/1644] ARROW-699: [C++] Resolve Arrow and Arrow IPC build issues on Windows; Resolve Arrow and Arrow IPC build issues on Windows; Running unit tests in Appveyor. Changes description: - Current file.cc implementation ( https://github.com/apache/arrow/compare/master...MaxRis:ARROW-699?expand=1#diff-1b2fb57add5bb8f21e28a707f24462b0L161 ) assumes that input file name is encoded with utf-16 inside std::string. But unit tests are passing just utf-8 compatible C-strings. Util method Utf8ToUtf16 introduced ( https://github.com/apache/arrow/compare/master...MaxRis:ARROW-699?expand=1#diff-1b2fb57add5bb8f21e28a707f24462b0R156 ) to convert utf-8 to utf-16 (std::wstring). - io-file-test has FIleIsClosed method which uses method _close(FILE_HANDLE) to test if file handle valid or invalid. Result is interpreted as file was closed or not. MSVC C runtime implementation by default crashes application if input param is invalid. To overwrite this behavior it's needed to set custom hander (https://github.com/apache/arrow/compare/master...MaxRis:ARROW-699?expand=1#diff-05724c5d85bf64720fa85ef3012e470dR61). More info here: https://msdn.microsoft.com/en-us/library/ksazx244.aspx - Message and FileWriter classes keeps their internal implementation as private member of unique_ptr of FORWARD declared type, for example: ``` class MessageImpl; std::unique_ptr impl_; ``` MSVC compiler requires constructor and destructor of Message class be defined. Currently, they are defined by default, and because of this, compiler places auto generated code into the .hpp file, which is not visible for others libs during the linking (We got unresolved linking issues). The solution is to define destructors explicitly. More there http://stackoverflow.com/a/42158611/2266412 and there http://stackoverflow.com/a/6089065/2266412 Author: Max Risuhin Closes #449 from MaxRis/ARROW-699 and squashes the following commits: 2d5383f [Max Risuhin] ARROW-699: [C++] Resolve Arrow and Arrow IPC build issues on Windows; Running unit tests in Appveyor. --- appveyor.yml | 12 +++++---- cpp/CMakeLists.txt | 11 +++++++- cpp/cmake_modules/BuildUtils.cmake | 2 ++ cpp/src/arrow/io/file.cc | 42 +++++++++++++++++------------- cpp/src/arrow/io/io-file-test.cc | 23 +++++++++++++--- cpp/src/arrow/io/io-hdfs-test.cc | 5 ++-- cpp/src/arrow/io/test-common.h | 10 ++++--- cpp/src/arrow/ipc/CMakeLists.txt | 33 +++++++++++++++++------ cpp/src/arrow/ipc/metadata.cc | 2 ++ cpp/src/arrow/ipc/metadata.h | 1 + cpp/src/arrow/ipc/writer.cc | 4 +++ cpp/src/arrow/ipc/writer.h | 4 ++- 12 files changed, 106 insertions(+), 43 deletions(-) diff --git a/appveyor.yml b/appveyor.yml index 17362c993d053..9f3594907d17e 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -23,8 +23,8 @@ environment: - GENERATOR: Visual Studio 14 2015 Win64 # - GENERATOR: Visual Studio 14 2015 MSVC_DEFAULT_OPTIONS: ON - BOOST_ROOT: C:\Libraries\boost_1_59_0 - BOOST_LIBRARYDIR: C:\Libraries\boost_1_59_0\lib64-msvc-14.0 + BOOST_ROOT: C:\Libraries\boost_1_63_0 + BOOST_LIBRARYDIR: C:\Libraries\boost_1_63_0\lib64-msvc-14.0 build_script: - cd cpp @@ -32,8 +32,10 @@ build_script: - cd build # A lot of features are still deactivated as they do not build on Windows # * gbenchmark doesn't build with MSVC - - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_IPC=OFF -DARROW_HDFS=OFF -DARROW_BUILD_BENCHMARKS=OFF -DARROW_JEMALLOC=OFF .. - - cmake --build . --config Debug + - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_BENCHMARKS=OFF -DARROW_JEMALLOC=OFF -DCMAKE_BUILD_TYPE=Release .. + - cmake --build . --config Release # test_script: -# - ctest -VV + - ctest -VV + + diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index e11de1b4fb0da..aa8ea31b831e3 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -500,7 +500,11 @@ if(ARROW_BUILD_TESTS) set(GFLAGS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gflags_ep-prefix/src/gflags_ep") set(GFLAGS_HOME "${GFLAGS_PREFIX}") set(GFLAGS_INCLUDE_DIR "${GFLAGS_PREFIX}/include") - set(GFLAGS_STATIC_LIB "${GFLAGS_PREFIX}/lib/libgflags.a") + if(MSVC) + set(GFLAGS_STATIC_LIB "${GFLAGS_PREFIX}/lib/gflags_static.lib") + else() + set(GFLAGS_STATIC_LIB "${GFLAGS_PREFIX}/lib/libgflags.a") + endif() set(GFLAGS_VENDORED 1) set(GFLAGS_CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} -DCMAKE_INSTALL_PREFIX=${GFLAGS_PREFIX} @@ -536,6 +540,11 @@ if(ARROW_BUILD_TESTS) include_directories(SYSTEM ${GFLAGS_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(gflags STATIC_LIB ${GFLAGS_STATIC_LIB}) + if(MSVC) + set_target_properties(gflags + PROPERTIES + IMPORTED_LINK_INTERFACE_LIBRARIES "shlwapi.lib") + endif() if(GFLAGS_VENDORED) add_dependencies(gflags gflags_ep) diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index c9930418185c7..43d984045eb20 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -125,6 +125,8 @@ function(ADD_ARROW_LIB LIB_NAME) set_target_properties(${LIB_NAME}_shared PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + RUNTIME_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" + PDB_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" LINK_FLAGS "${ARG_SHARED_LINK_FLAGS}" OUTPUT_NAME ${LIB_NAME} VERSION "${ARROW_ABI_VERSION}" diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 7c14238e8fda4..0aa2c92a07281 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -152,20 +152,30 @@ static inline int64_t lseek64_compat(int fd, int64_t pos, int whence) { #endif } +#if defined(_MSC_VER) +static inline Status ConvertToUtf16(const std::string& input, std::wstring* result) { + if (result == nullptr) { return Status::Invalid("Pointer to result is not valid"); } + + if (input.empty()) { + *result = std::wstring(); + return Status::OK(); + } + + std::wstring_convert> utf16_converter; + *result = utf16_converter.from_bytes(input); + return Status::OK(); +} +#endif + static inline Status FileOpenReadable(const std::string& filename, int* fd) { int ret; errno_t errno_actual = 0; #if defined(_MSC_VER) - // https://msdn.microsoft.com/en-us/library/w64k0ytk.aspx - - // See GH #209. Here we are assuming that the filename has been encoded in - // utf-16le so that unicode filenames can be supported - const int nwchars = static_cast(filename.size()) / sizeof(wchar_t); - std::vector wpath(nwchars + 1); - memcpy(wpath.data(), filename.data(), filename.size()); - memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); + std::wstring wide_filename; + RETURN_NOT_OK(ConvertToUtf16(filename, &wide_filename)); - errno_actual = _wsopen_s(fd, wpath.data(), _O_RDONLY | _O_BINARY, _SH_DENYNO, _S_IREAD); + errno_actual = + _wsopen_s(fd, wide_filename.c_str(), _O_RDONLY | _O_BINARY, _SH_DENYNO, _S_IREAD); ret = *fd; #else ret = *fd = open(filename.c_str(), O_RDONLY | O_BINARY); @@ -181,16 +191,12 @@ static inline Status FileOpenWriteable( errno_t errno_actual = 0; #if defined(_MSC_VER) - // https://msdn.microsoft.com/en-us/library/w64k0ytk.aspx - // Same story with wchar_t as above - const int nwchars = static_cast(filename.size()) / sizeof(wchar_t); - std::vector wpath(nwchars + 1); - memcpy(wpath.data(), filename.data(), filename.size()); - memcpy(wpath.data() + nwchars, L"\0", sizeof(wchar_t)); + std::wstring wide_filename; + RETURN_NOT_OK(ConvertToUtf16(filename, &wide_filename)); int oflag = _O_CREAT | _O_BINARY; - int sh_flag = _S_IWRITE; - if (!write_only) { sh_flag |= _S_IREAD; } + int pmode = _S_IWRITE; + if (!write_only) { pmode |= _S_IREAD; } if (truncate) { oflag |= _O_TRUNC; } @@ -200,7 +206,7 @@ static inline Status FileOpenWriteable( oflag |= _O_RDWR; } - errno_actual = _wsopen_s(fd, wpath.data(), oflag, _SH_DENYNO, sh_flag); + errno_actual = _wsopen_s(fd, wide_filename.c_str(), oflag, _SH_DENYNO, pmode); ret = *fd; #else diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 5810c820f6dd7..348be17d89341 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -41,14 +41,29 @@ static bool FileExists(const std::string& path) { return std::ifstream(path.c_str()).good(); } +#if defined(_MSC_VER) +void InvalidParamHandler(const wchar_t* expr, const wchar_t* func, + const wchar_t* source_file, unsigned int source_line, uintptr_t reserved) { + wprintf(L"Invalid parameter in funcion %s. Source: %s line %d expression %s", func, + source_file, source_line, expr); +} +#endif + static bool FileIsClosed(int fd) { -#ifdef _MSC_VER - // Close file a second time, this should set errno to EBADF - close(fd); +#if defined(_MSC_VER) + // Disables default behavior on wrong params which causes the application to crash + // https://msdn.microsoft.com/en-us/library/ksazx244.aspx + _set_invalid_parameter_handler(InvalidParamHandler); + + // Disables possible assertion alert box on invalid input arguments + _CrtSetReportMode(_CRT_ASSERT, 0); + + int ret = static_cast(_close(fd)); + return (ret == -1); #else if (-1 != fcntl(fd, F_GETFD)) { return false; } -#endif return errno == EBADF; +#endif } class FileTestFixture : public ::testing::Test { diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 648d4baac9b6f..f3140be0b2dac 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -78,8 +78,9 @@ class TestHdfsClient : public ::testing::Test { LibHdfsShim* driver_shim; client_ = nullptr; - scratch_dir_ = - boost::filesystem::unique_path("/tmp/arrow-hdfs/scratch-%%%%").string(); + scratch_dir_ = boost::filesystem::unique_path( + boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") + .string(); loaded_driver_ = false; diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index 4c114760e9a4b..db5bcc1b4f49b 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -69,13 +69,15 @@ class MemoryMapFixture { void CreateFile(const std::string path, int64_t size) { FILE* file = fopen(path.c_str(), "w"); - if (file != nullptr) { tmp_files_.push_back(path); } + if (file != nullptr) { + tmp_files_.push_back(path); #ifdef _MSC_VER - _chsize(fileno(file), static_cast(size)); + _chsize(fileno(file), static_cast(size)); #else - ftruncate(fileno(file), static_cast(size)); + ftruncate(fileno(file), static_cast(size)); #endif - fclose(file); + fclose(file); + } } Status InitMemoryMap( diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 030cba93f5fc0..31a04dfc07818 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -85,6 +85,15 @@ if (ARROW_BUILD_TESTS) dl) set_target_properties(json-integration-test PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + elseif (MSVC) + target_link_libraries(json-integration-test + arrow_ipc_static + arrow_io_static + arrow_static + gflags + gtest + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY}) else() target_link_libraries(json-integration-test arrow_ipc_static @@ -156,14 +165,22 @@ install( FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") - -set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static - arrow_static - ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY} - dl) +if(MSVC) + set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY}) +else() + set(UTIL_LINK_LIBS + arrow_ipc_static + arrow_io_static + arrow_static + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY} + dl) +endif() if (ARROW_BUILD_UTILITIES) add_executable(file-to-stream file-to-stream.cc) diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 36ba4b26042a8..6d9fabdc920f9 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -767,6 +767,8 @@ Message::Message(const std::shared_ptr& buffer, int64_t offset) { impl_.reset(new MessageImpl(buffer, offset)); } +Message::~Message() {} + Status Message::Open(const std::shared_ptr& buffer, int64_t offset, std::shared_ptr* out) { // ctor is private diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index f60fb770c3696..798abdcdf9db7 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -140,6 +140,7 @@ struct ARROW_EXPORT BufferMetadata { class ARROW_EXPORT Message { public: + ~Message(); enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; static Status Open(const std::shared_ptr& buffer, int64_t offset, diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index db5f0829f92f7..0a19f69d27d8c 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -662,6 +662,8 @@ Status StreamWriter::WriteRecordBatch(const RecordBatch& batch, bool allow_64bit return impl_->WriteRecordBatch(batch, allow_64bit); } +StreamWriter::~StreamWriter() {} + void StreamWriter::set_memory_pool(MemoryPool* pool) { impl_->set_memory_pool(pool); } @@ -718,6 +720,8 @@ FileWriter::FileWriter() { impl_.reset(new FileWriterImpl()); } +FileWriter::~FileWriter() {} + Status FileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out) { *out = std::shared_ptr(new FileWriter()); // ctor is private diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 25b5ad62726d9..c572157b465a6 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -82,7 +82,7 @@ Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); class ARROW_EXPORT StreamWriter { public: - virtual ~StreamWriter() = default; + virtual ~StreamWriter(); static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); @@ -105,6 +105,8 @@ class ARROW_EXPORT StreamWriter { class ARROW_EXPORT FileWriter : public StreamWriter { public: + virtual ~FileWriter(); + static Status Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out); From 957a0e67836b66f8ff4fc3fdae343553c589b53f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 18:03:26 -0400 Subject: [PATCH 0434/1644] ARROW-717: [C++] Implement IPC zero-copy round trip for tensors This patch provides: ```python WriteTensor(tensor, file, &metadata_length, &body_length)); std::shared_ptr result; ReadTensor(offset, file, &result)); ``` Also implemented `Tensor::Equals` and did some refactoring / code simplification in compare.cc Author: Wes McKinney Closes #454 from wesm/ARROW-717 and squashes the following commits: 6c15481 [Wes McKinney] Tensor IPC read/write, and refactoring / code scrubbing --- cpp/src/arrow/buffer.cc | 6 +- cpp/src/arrow/compare.cc | 330 ++++++++++------------- cpp/src/arrow/compare.h | 4 + cpp/src/arrow/ipc/ipc-read-write-test.cc | 54 +++- cpp/src/arrow/ipc/metadata.cc | 266 ++++++++++++------ cpp/src/arrow/ipc/metadata.h | 67 +++-- cpp/src/arrow/ipc/reader.cc | 79 +++--- cpp/src/arrow/ipc/reader.h | 32 +-- cpp/src/arrow/ipc/writer.cc | 79 +++--- cpp/src/arrow/ipc/writer.h | 12 +- cpp/src/arrow/tensor-test.cc | 25 +- cpp/src/arrow/tensor.cc | 67 ++++- cpp/src/arrow/tensor.h | 18 +- cpp/src/arrow/type_traits.h | 11 + cpp/src/arrow/visitor_inline.h | 26 ++ format/Tensor.fbs | 14 +- 16 files changed, 656 insertions(+), 434 deletions(-) diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index be747e1d49504..59623403e5c5e 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -27,11 +27,9 @@ namespace arrow { -Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) { - data_ = parent->data() + offset; - size_ = size; +Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) + : Buffer(parent->data() + offset, size) { parent_ = parent; - capacity_ = size; } Buffer::~Buffer() {} diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index f786222f7e4f2..c2580b4f54109 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -25,6 +25,7 @@ #include "arrow/array.h" #include "arrow/status.h" +#include "arrow/tensor.h" #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" @@ -36,7 +37,7 @@ namespace arrow { // ---------------------------------------------------------------------- // Public method implementations -class RangeEqualsVisitor : public ArrayVisitor { +class RangeEqualsVisitor { public: RangeEqualsVisitor(const Array& right, int64_t left_start_idx, int64_t left_end_idx, int64_t right_start_idx) @@ -46,12 +47,6 @@ class RangeEqualsVisitor : public ArrayVisitor { right_start_idx_(right_start_idx), result_(false) {} - Status Visit(const NullArray& left) override { - UNUSED(left); - result_ = true; - return Status::OK(); - } - template inline Status CompareValues(const ArrayType& left) { const auto& right = static_cast(right_); @@ -96,108 +91,6 @@ class RangeEqualsVisitor : public ArrayVisitor { return true; } - Status Visit(const BooleanArray& left) override { - return CompareValues(left); - } - - Status Visit(const Int8Array& left) override { return CompareValues(left); } - - Status Visit(const Int16Array& left) override { - return CompareValues(left); - } - Status Visit(const Int32Array& left) override { - return CompareValues(left); - } - Status Visit(const Int64Array& left) override { - return CompareValues(left); - } - Status Visit(const UInt8Array& left) override { - return CompareValues(left); - } - Status Visit(const UInt16Array& left) override { - return CompareValues(left); - } - Status Visit(const UInt32Array& left) override { - return CompareValues(left); - } - Status Visit(const UInt64Array& left) override { - return CompareValues(left); - } - Status Visit(const FloatArray& left) override { - return CompareValues(left); - } - Status Visit(const DoubleArray& left) override { - return CompareValues(left); - } - - Status Visit(const HalfFloatArray& left) override { - return Status::NotImplemented("Half float type"); - } - - Status Visit(const StringArray& left) override { - result_ = CompareBinaryRange(left); - return Status::OK(); - } - - Status Visit(const BinaryArray& left) override { - result_ = CompareBinaryRange(left); - return Status::OK(); - } - - Status Visit(const FixedWidthBinaryArray& left) override { - const auto& right = static_cast(right_); - - int32_t width = left.byte_width(); - - const uint8_t* left_data = left.raw_data() + left.offset() * width; - const uint8_t* right_data = right.raw_data() + right.offset() * width; - - for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; - ++i, ++o_i) { - const bool is_null = left.IsNull(i); - if (is_null != right.IsNull(o_i)) { - result_ = false; - return Status::OK(); - } - if (is_null) continue; - - if (std::memcmp(left_data + width * i, right_data + width * o_i, width)) { - result_ = false; - return Status::OK(); - } - } - result_ = true; - return Status::OK(); - } - - Status Visit(const Date32Array& left) override { - return CompareValues(left); - } - - Status Visit(const Date64Array& left) override { - return CompareValues(left); - } - - Status Visit(const Time32Array& left) override { - return CompareValues(left); - } - - Status Visit(const Time64Array& left) override { - return CompareValues(left); - } - - Status Visit(const TimestampArray& left) override { - return CompareValues(left); - } - - Status Visit(const IntervalArray& left) override { - return CompareValues(left); - } - - Status Visit(const DecimalArray& left) override { - return Status::NotImplemented("Decimal type"); - } - bool CompareLists(const ListArray& left) { const auto& right = static_cast(right_); @@ -225,11 +118,6 @@ class RangeEqualsVisitor : public ArrayVisitor { return true; } - Status Visit(const ListArray& left) override { - result_ = CompareLists(left); - return Status::OK(); - } - bool CompareStructs(const StructArray& left) { const auto& right = static_cast(right_); bool equal_fields = true; @@ -251,11 +139,6 @@ class RangeEqualsVisitor : public ArrayVisitor { return true; } - Status Visit(const StructArray& left) override { - result_ = CompareStructs(left); - return Status::OK(); - } - bool CompareUnions(const UnionArray& left) const { const auto& right = static_cast(right_); @@ -314,12 +197,73 @@ class RangeEqualsVisitor : public ArrayVisitor { return true; } - Status Visit(const UnionArray& left) override { + Status Visit(const BinaryArray& left) { + result_ = CompareBinaryRange(left); + return Status::OK(); + } + + Status Visit(const FixedWidthBinaryArray& left) { + const auto& right = static_cast(right_); + + int32_t width = left.byte_width(); + + const uint8_t* left_data = nullptr; + const uint8_t* right_data = nullptr; + + if (left.data()) { left_data = left.raw_data() + left.offset() * width; } + + if (right.data()) { right_data = right.raw_data() + right.offset() * width; } + + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i)) { + result_ = false; + return Status::OK(); + } + if (is_null) continue; + + if (std::memcmp(left_data + width * i, right_data + width * o_i, width)) { + result_ = false; + return Status::OK(); + } + } + result_ = true; + return Status::OK(); + } + + Status Visit(const NullArray& left) { + UNUSED(left); + result_ = true; + return Status::OK(); + } + + template + typename std::enable_if::value, Status>::type Visit( + const T& left) { + return CompareValues(left); + } + + Status Visit(const DecimalArray& left) { + return Status::NotImplemented("Decimal type"); + } + + Status Visit(const ListArray& left) { + result_ = CompareLists(left); + return Status::OK(); + } + + Status Visit(const StructArray& left) { + result_ = CompareStructs(left); + return Status::OK(); + } + + Status Visit(const UnionArray& left) { result_ = CompareUnions(left); return Status::OK(); } - Status Visit(const DictionaryArray& left) override { + Status Visit(const DictionaryArray& left) { const auto& right = static_cast(right_); if (!left.dictionary()->Equals(right.dictionary())) { result_ = false; @@ -346,9 +290,9 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { explicit ArrayEqualsVisitor(const Array& right) : RangeEqualsVisitor(right, 0, right.length(), 0) {} - Status Visit(const NullArray& left) override { return Status::OK(); } + Status Visit(const NullArray& left) { return Status::OK(); } - Status Visit(const BooleanArray& left) override { + Status Visit(const BooleanArray& left) { const auto& right = static_cast(right_); if (left.null_count() > 0) { const uint8_t* left_data = left.data()->data(); @@ -372,64 +316,39 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { bool IsEqualPrimitive(const PrimitiveArray& left) { const auto& right = static_cast(right_); const auto& size_meta = dynamic_cast(*left.type()); - const int value_byte_size = size_meta.bit_width() / 8; - DCHECK_GT(value_byte_size, 0); + const int byte_width = size_meta.bit_width() / 8; + + const uint8_t* left_data = nullptr; + const uint8_t* right_data = nullptr; + + if (left.data()) { left_data = left.data()->data() + left.offset() * byte_width; } - const uint8_t* left_data = left.data()->data() + left.offset() * value_byte_size; - const uint8_t* right_data = right.data()->data() + right.offset() * value_byte_size; + if (right.data()) { right_data = right.data()->data() + right.offset() * byte_width; } if (left.null_count() > 0) { for (int64_t i = 0; i < left.length(); ++i) { - if (!left.IsNull(i) && memcmp(left_data, right_data, value_byte_size)) { + if (!left.IsNull(i) && memcmp(left_data, right_data, byte_width)) { return false; } - left_data += value_byte_size; - right_data += value_byte_size; + left_data += byte_width; + right_data += byte_width; } return true; } else { return memcmp(left_data, right_data, - static_cast(value_byte_size * left.length())) == 0; + static_cast(byte_width * left.length())) == 0; } } - Status ComparePrimitive(const PrimitiveArray& left) { + template + typename std::enable_if::value && + !std::is_base_of::value, + Status>::type + Visit(const T& left) { result_ = IsEqualPrimitive(left); return Status::OK(); } - Status Visit(const Int8Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Int16Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Int32Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Int64Array& left) override { return ComparePrimitive(left); } - - Status Visit(const UInt8Array& left) override { return ComparePrimitive(left); } - - Status Visit(const UInt16Array& left) override { return ComparePrimitive(left); } - - Status Visit(const UInt32Array& left) override { return ComparePrimitive(left); } - - Status Visit(const UInt64Array& left) override { return ComparePrimitive(left); } - - Status Visit(const FloatArray& left) override { return ComparePrimitive(left); } - - Status Visit(const DoubleArray& left) override { return ComparePrimitive(left); } - - Status Visit(const Date32Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Date64Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Time32Array& left) override { return ComparePrimitive(left); } - - Status Visit(const Time64Array& left) override { return ComparePrimitive(left); } - - Status Visit(const TimestampArray& left) override { return ComparePrimitive(left); } - - Status Visit(const IntervalArray& left) override { return ComparePrimitive(left); } - template bool ValueOffsetsEqual(const ArrayType& left) { const auto& right = static_cast(right_); @@ -494,17 +413,12 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { } } - Status Visit(const StringArray& left) override { - result_ = CompareBinary(left); - return Status::OK(); - } - - Status Visit(const BinaryArray& left) override { + Status Visit(const BinaryArray& left) { result_ = CompareBinary(left); return Status::OK(); } - Status Visit(const ListArray& left) override { + Status Visit(const ListArray& left) { const auto& right = static_cast(right_); bool equal_offsets = ValueOffsetsEqual(left); if (!equal_offsets) { @@ -523,7 +437,7 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { return Status::OK(); } - Status Visit(const DictionaryArray& left) override { + Status Visit(const DictionaryArray& left) { const auto& right = static_cast(right_); if (!left.dictionary()->Equals(right.dictionary())) { result_ = false; @@ -532,6 +446,13 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { } return Status::OK(); } + + template + typename std::enable_if::value, + Status>::type + Visit(const T& left) { + return RangeEqualsVisitor::Visit(left); + } }; template @@ -560,14 +481,15 @@ inline bool FloatingApproxEquals( class ApproxEqualsVisitor : public ArrayEqualsVisitor { public: using ArrayEqualsVisitor::ArrayEqualsVisitor; + using ArrayEqualsVisitor::Visit; - Status Visit(const FloatArray& left) override { + Status Visit(const FloatArray& left) { result_ = FloatingApproxEquals(left, static_cast(right_)); return Status::OK(); } - Status Visit(const DoubleArray& left) override { + Status Visit(const DoubleArray& left) { result_ = FloatingApproxEquals(left, static_cast(right_)); return Status::OK(); @@ -586,7 +508,8 @@ static bool BaseDataEquals(const Array& left, const Array& right) { return true; } -Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { +template +inline Status ArrayEqualsImpl(const Array& left, const Array& right, bool* are_equal) { // The arrays are the same object if (&left == &right) { *are_equal = true; @@ -595,13 +518,21 @@ Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { } else if (left.length() == 0) { *are_equal = true; } else { - ArrayEqualsVisitor visitor(right); - RETURN_NOT_OK(left.Accept(&visitor)); + VISITOR visitor(right); + RETURN_NOT_OK(VisitArrayInline(left, &visitor)); *are_equal = visitor.result(); } return Status::OK(); } +Status ArrayEquals(const Array& left, const Array& right, bool* are_equal) { + return ArrayEqualsImpl(left, right, are_equal); +} + +Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) { + return ArrayEqualsImpl(left, right, are_equal); +} + Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_start_idx, int64_t left_end_idx, int64_t right_start_idx, bool* are_equal) { if (&left == &right) { @@ -612,23 +543,56 @@ Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_star *are_equal = true; } else { RangeEqualsVisitor visitor(right, left_start_idx, left_end_idx, right_start_idx); - RETURN_NOT_OK(left.Accept(&visitor)); + RETURN_NOT_OK(VisitArrayInline(left, &visitor)); *are_equal = visitor.result(); } return Status::OK(); } -Status ArrayApproxEquals(const Array& left, const Array& right, bool* are_equal) { +// ---------------------------------------------------------------------- +// Implement TensorEquals + +class TensorEqualsVisitor { + public: + explicit TensorEqualsVisitor(const Tensor& right) : right_(right) {} + + template + Status Visit(const TensorType& left) { + const auto& size_meta = dynamic_cast(*left.type()); + const int byte_width = size_meta.bit_width() / 8; + DCHECK_GT(byte_width, 0); + + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right_.data()->data(); + + result_ = + memcmp(left_data, right_data, static_cast(byte_width * left.size())) == 0; + return Status::OK(); + } + + bool result() const { return result_; } + + protected: + const Tensor& right_; + bool result_; +}; + +Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { // The arrays are the same object if (&left == &right) { *are_equal = true; - } else if (!BaseDataEquals(left, right)) { + } else if (left.type_enum() != right.type_enum()) { *are_equal = false; - } else if (left.length() == 0) { + } else if (left.size() == 0) { *are_equal = true; } else { - ApproxEqualsVisitor visitor(right); - RETURN_NOT_OK(left.Accept(&visitor)); + if (!left.is_contiguous() || !right.is_contiguous()) { + return Status::NotImplemented( + "Comparison not implemented for non-contiguous tensors"); + } + + TensorEqualsVisitor visitor(right); + RETURN_NOT_OK(VisitTensorInline(left, &visitor)); *are_equal = visitor.result(); } return Status::OK(); diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h index 1ddf0497dd3d9..522b11dadec47 100644 --- a/cpp/src/arrow/compare.h +++ b/cpp/src/arrow/compare.h @@ -29,10 +29,14 @@ namespace arrow { class Array; struct DataType; class Status; +class Tensor; /// Returns true if the arrays are exactly equal Status ARROW_EXPORT ArrayEquals(const Array& left, const Array& right, bool* are_equal); +Status ARROW_EXPORT TensorEquals( + const Tensor& left, const Tensor& right, bool* are_equal); + /// Returns true if the arrays are approximately equal. For non-floating point /// types, this is equivalent to ArrayEquals(left, right) Status ARROW_EXPORT ArrayApproxEquals( diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 6ddda3f339641..74ca017df5cf1 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -25,16 +25,16 @@ #include "gtest/gtest.h" #include "arrow/array.h" +#include "arrow/buffer.h" #include "arrow/io/memory.h" #include "arrow/io/test-common.h" #include "arrow/ipc/api.h" #include "arrow/ipc/test-common.h" #include "arrow/ipc/util.h" - -#include "arrow/buffer.h" #include "arrow/memory_pool.h" #include "arrow/pretty_print.h" #include "arrow/status.h" +#include "arrow/tensor.h" #include "arrow/test-util.h" #include "arrow/util/bit-util.h" @@ -56,13 +56,10 @@ class TestSchemaMetadata : public ::testing::Test { ASSERT_EQ(Message::SCHEMA, message->type()); - auto schema_msg = std::make_shared(message); - ASSERT_EQ(schema.num_fields(), schema_msg->num_fields()); - DictionaryMemo empty_memo; std::shared_ptr schema2; - ASSERT_OK(schema_msg->GetSchema(empty_memo, &schema2)); + ASSERT_OK(GetSchema(message->header(), empty_memo, &schema2)); AssertSchemaEqual(schema, *schema2); } @@ -90,7 +87,7 @@ TEST_F(TestSchemaMetadata, PrimitiveFields) { } TEST_F(TestSchemaMetadata, NestedFields) { - auto type = std::make_shared(std::make_shared()); + auto type = list(int32()); auto f0 = field("f0", type); std::shared_ptr type2( @@ -532,7 +529,6 @@ TEST_F(TestIpcRoundTrip, LargeRecordBatch) { // 512 MB constexpr int64_t kBufferSize = 1 << 29; - ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(kBufferSize, path, &mmap_)); std::shared_ptr result; @@ -580,5 +576,47 @@ TEST_F(TestFileFormat, DictionaryRoundTrip) { CheckBatchDictionaries(*out_batches[0]); } +class TestTensorRoundTrip : public ::testing::Test, public IpcTestFixture { + public: + void SetUp() { pool_ = default_memory_pool(); } + void TearDown() { io::MemoryMapFixture::TearDown(); } + + void CheckTensorRoundTrip(const Tensor& tensor) { + int32_t metadata_length; + int64_t body_length; + + ASSERT_OK(mmap_->Seek(0)); + + ASSERT_OK(WriteTensor(tensor, mmap_.get(), &metadata_length, &body_length)); + + std::shared_ptr result; + ASSERT_OK(ReadTensor(0, mmap_.get(), &result)); + + ASSERT_TRUE(tensor.Equals(*result)); + } +}; + +TEST_F(TestTensorRoundTrip, BasicRoundtrip) { + std::string path = "test-write-tensor"; + constexpr int64_t kBufferSize = 1 << 20; + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(kBufferSize, path, &mmap_)); + + std::vector shape = {4, 6}; + std::vector strides = {48, 8}; + std::vector dim_names = {"foo", "bar"}; + int64_t size = 24; + + std::vector values; + test::randint(size, 0, 100, &values); + + auto data = test::GetBufferFromVector(values); + + Int64Tensor t0(data, shape, strides, dim_names); + Int64Tensor tzero(data, {}, {}, {}); + + CheckTensorRoundTrip(t0); + CheckTensorRoundTrip(tzero); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 6d9fabdc920f9..076a6e792ba40 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -20,6 +20,7 @@ #include #include #include +#include #include #include "flatbuffers/flatbuffers.h" @@ -29,7 +30,10 @@ #include "arrow/io/interfaces.h" #include "arrow/ipc/File_generated.h" #include "arrow/ipc/Message_generated.h" +#include "arrow/ipc/Tensor_generated.h" +#include "arrow/ipc/util.h" #include "arrow/status.h" +#include "arrow/tensor.h" #include "arrow/type.h" namespace arrow { @@ -418,6 +422,46 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, return Status::OK(); } +static Status TensorTypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, + flatbuf::Type* out_type, Offset* offset) { + switch (type->type) { + case Type::UINT8: + INT_TO_FB_CASE(8, false); + case Type::INT8: + INT_TO_FB_CASE(8, true); + case Type::UINT16: + INT_TO_FB_CASE(16, false); + case Type::INT16: + INT_TO_FB_CASE(16, true); + case Type::UINT32: + INT_TO_FB_CASE(32, false); + case Type::INT32: + INT_TO_FB_CASE(32, true); + case Type::UINT64: + INT_TO_FB_CASE(64, false); + case Type::INT64: + INT_TO_FB_CASE(64, true); + case Type::HALF_FLOAT: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_HALF); + break; + case Type::FLOAT: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_SINGLE); + break; + case Type::DOUBLE: + *out_type = flatbuf::Type_FloatingPoint; + *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); + break; + default: + *out_type = flatbuf::Type_NONE; // Make clang-tidy happy + std::stringstream ss; + ss << "Unable to convert type: " << type->ToString() << std::endl; + return Status::NotImplemented(ss.str()); + } + return Status::OK(); +} + static DictionaryOffset GetDictionaryEncoding( FBB& fbb, const DictionaryType& type, DictionaryMemo* memo) { int64_t dictionary_id = memo->GetId(type.dictionary()); @@ -552,7 +596,7 @@ static Status WriteFlatbufferBuilder(FBB& fbb, std::shared_ptr* out) { return Status::OK(); } -static Status WriteMessage(FBB& fbb, flatbuf::MessageHeader header_type, +static Status WriteFBMessage(FBB& fbb, flatbuf::MessageHeader header_type, flatbuffers::Offset header, int64_t body_length, std::shared_ptr* out) { auto message = flatbuf::CreateMessage(fbb, kMetadataVersion, header_type, header, body_length); @@ -565,7 +609,7 @@ Status WriteSchemaMessage( FBB fbb; flatbuffers::Offset fb_schema; RETURN_NOT_OK(SchemaToFlatbuffer(fbb, schema, dictionary_memo, &fb_schema)); - return WriteMessage(fbb, flatbuf::MessageHeader_Schema, fb_schema.Union(), 0, out); + return WriteFBMessage(fbb, flatbuf::MessageHeader_Schema, fb_schema.Union(), 0, out); } using FieldNodeVector = @@ -620,10 +664,39 @@ Status WriteRecordBatchMessage(int64_t length, int64_t body_length, FBB fbb; RecordBatchOffset record_batch; RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); - return WriteMessage( + return WriteFBMessage( fbb, flatbuf::MessageHeader_RecordBatch, record_batch.Union(), body_length, out); } +Status WriteTensorMessage( + const Tensor& tensor, int64_t buffer_start_offset, std::shared_ptr* out) { + using TensorDimOffset = flatbuffers::Offset; + using TensorOffset = flatbuffers::Offset; + + FBB fbb; + + flatbuf::Type fb_type_type; + Offset fb_type; + RETURN_NOT_OK(TensorTypeToFlatbuffer(fbb, tensor.type(), &fb_type_type, &fb_type)); + + std::vector dims; + for (int i = 0; i < tensor.ndim(); ++i) { + FBString name = fbb.CreateString(tensor.dim_name(i)); + dims.push_back(flatbuf::CreateTensorDim(fbb, tensor.shape()[i], name)); + } + + auto fb_shape = fbb.CreateVector(dims); + auto fb_strides = fbb.CreateVector(tensor.strides()); + int64_t body_length = tensor.data()->size(); + flatbuf::Buffer buffer(-1, buffer_start_offset, body_length); + + TensorOffset fb_tensor = + flatbuf::CreateTensor(fbb, fb_type_type, fb_type, fb_shape, fb_strides, &buffer); + + return WriteFBMessage( + fbb, flatbuf::MessageHeader_Tensor, fb_tensor.Union(), body_length, out); +} + Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out) { @@ -631,7 +704,7 @@ Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, RecordBatchOffset record_batch; RETURN_NOT_OK(MakeRecordBatch(fbb, length, body_length, nodes, buffers, &record_batch)); auto dictionary_batch = flatbuf::CreateDictionaryBatch(fbb, id, record_batch).Union(); - return WriteMessage( + return WriteFBMessage( fbb, flatbuf::MessageHeader_DictionaryBatch, dictionary_batch, body_length, out); } @@ -746,6 +819,8 @@ class Message::MessageImpl { return Message::DICTIONARY_BATCH; case flatbuf::MessageHeader_RecordBatch: return Message::RECORD_BATCH; + case flatbuf::MessageHeader_Tensor: + return Message::TENSOR; default: return Message::NONE; } @@ -790,95 +865,78 @@ const void* Message::header() const { } // ---------------------------------------------------------------------- -// SchemaMetadata - -class MessageHolder { - public: - void set_message(const std::shared_ptr& message) { message_ = message; } - void set_buffer(const std::shared_ptr& buffer) { buffer_ = buffer; } - - protected: - // Possible parents, owns the flatbuffer data - std::shared_ptr message_; - std::shared_ptr buffer_; -}; - -class SchemaMetadata::SchemaMetadataImpl : public MessageHolder { - public: - explicit SchemaMetadataImpl(const void* schema) - : schema_(static_cast(schema)) {} - - const flatbuf::Field* get_field(int i) const { return schema_->fields()->Get(i); } - int num_fields() const { return schema_->fields()->size(); } - - Status VisitField(const flatbuf::Field* field, DictionaryTypeMap* id_to_field) const { - const flatbuf::DictionaryEncoding* dict_metadata = field->dictionary(); - if (dict_metadata == nullptr) { - // Field is not dictionary encoded. Visit children - auto children = field->children(); - for (flatbuffers::uoffset_t i = 0; i < children->size(); ++i) { - RETURN_NOT_OK(VisitField(children->Get(i), id_to_field)); - } - } else { - // Field is dictionary encoded. Construct the data type for the - // dictionary (no descendents can be dictionary encoded) - std::shared_ptr dictionary_field; - RETURN_NOT_OK(FieldFromFlatbufferDictionary(field, &dictionary_field)); - (*id_to_field)[dict_metadata->id()] = dictionary_field; +static Status VisitField(const flatbuf::Field* field, DictionaryTypeMap* id_to_field) { + const flatbuf::DictionaryEncoding* dict_metadata = field->dictionary(); + if (dict_metadata == nullptr) { + // Field is not dictionary encoded. Visit children + auto children = field->children(); + for (flatbuffers::uoffset_t i = 0; i < children->size(); ++i) { + RETURN_NOT_OK(VisitField(children->Get(i), id_to_field)); } - return Status::OK(); + } else { + // Field is dictionary encoded. Construct the data type for the + // dictionary (no descendents can be dictionary encoded) + std::shared_ptr dictionary_field; + RETURN_NOT_OK(FieldFromFlatbufferDictionary(field, &dictionary_field)); + (*id_to_field)[dict_metadata->id()] = dictionary_field; } + return Status::OK(); +} - Status GetDictionaryTypes(DictionaryTypeMap* id_to_field) const { - for (int i = 0; i < num_fields(); ++i) { - RETURN_NOT_OK(VisitField(get_field(i), id_to_field)); - } - return Status::OK(); +Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_field) { + auto schema = static_cast(opaque_schema); + int num_fields = static_cast(schema->fields()->size()); + for (int i = 0; i < num_fields; ++i) { + RETURN_NOT_OK(VisitField(schema->fields()->Get(i), id_to_field)); } - - private: - const flatbuf::Schema* schema_; -}; - -SchemaMetadata::SchemaMetadata(const std::shared_ptr& message) - : SchemaMetadata(message->impl_->header()) { - impl_->set_message(message); + return Status::OK(); } -SchemaMetadata::SchemaMetadata(const void* header) { - impl_.reset(new SchemaMetadataImpl(header)); -} +Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo, + std::shared_ptr* out) { + auto schema = static_cast(opaque_schema); + int num_fields = static_cast(schema->fields()->size()); -SchemaMetadata::SchemaMetadata(const std::shared_ptr& buffer, int64_t offset) - : SchemaMetadata(buffer->data() + offset) { - // Preserve ownership - impl_->set_buffer(buffer); + std::vector> fields(num_fields); + for (int i = 0; i < num_fields; ++i) { + const flatbuf::Field* field = schema->fields()->Get(i); + RETURN_NOT_OK(FieldFromFlatbuffer(field, dictionary_memo, &fields[i])); + } + *out = std::make_shared(fields); + return Status::OK(); } -SchemaMetadata::~SchemaMetadata() {} +Status GetTensorMetadata(const void* opaque_tensor, std::shared_ptr* type, + std::vector* shape, std::vector* strides, + std::vector* dim_names) { + auto tensor = static_cast(opaque_tensor); -int SchemaMetadata::num_fields() const { - return impl_->num_fields(); -} + int ndim = static_cast(tensor->shape()->size()); -Status SchemaMetadata::GetDictionaryTypes(DictionaryTypeMap* id_to_field) const { - return impl_->GetDictionaryTypes(id_to_field); -} + for (int i = 0; i < ndim; ++i) { + auto dim = tensor->shape()->Get(i); -Status SchemaMetadata::GetSchema( - const DictionaryMemo& dictionary_memo, std::shared_ptr* out) const { - std::vector> fields(num_fields()); - for (int i = 0; i < this->num_fields(); ++i) { - const flatbuf::Field* field = impl_->get_field(i); - RETURN_NOT_OK(FieldFromFlatbuffer(field, dictionary_memo, &fields[i])); + shape->push_back(dim->size()); + auto fb_name = dim->name(); + if (fb_name == 0) { + dim_names->push_back(""); + } else { + dim_names->push_back(fb_name->str()); + } } - *out = std::make_shared(fields); - return Status::OK(); + + if (tensor->strides()->size() > 0) { + for (int i = 0; i < ndim; ++i) { + strides->push_back(tensor->strides()->Get(i)); + } + } + + return TypeFromFlatbuffer(tensor->type_type(), tensor->type(), {}, type); } // ---------------------------------------------------------------------- -// Conveniences +// Read and write messages Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, std::shared_ptr* message) { @@ -896,5 +954,61 @@ Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile return Message::Open(buffer, 4, message); } +Status ReadMessage(io::InputStream* file, std::shared_ptr* message) { + std::shared_ptr buffer; + RETURN_NOT_OK(file->Read(sizeof(int32_t), &buffer)); + + if (buffer->size() != sizeof(int32_t)) { + *message = nullptr; + return Status::OK(); + } + + int32_t message_length = *reinterpret_cast(buffer->data()); + + if (message_length == 0) { + // Optional 0 EOS control message + *message = nullptr; + return Status::OK(); + } + + RETURN_NOT_OK(file->Read(message_length, &buffer)); + if (buffer->size() != message_length) { + return Status::IOError("Unexpected end of stream trying to read message"); + } + + return Message::Open(buffer, 0, message); +} + +Status WriteMessage( + const Buffer& message, io::OutputStream* file, int32_t* message_length) { + // Need to write 4 bytes (message size), the message, plus padding to + // end on an 8-byte offset + int64_t start_offset; + RETURN_NOT_OK(file->Tell(&start_offset)); + + int32_t padded_message_length = static_cast(message.size()) + 4; + const int32_t remainder = + (padded_message_length + static_cast(start_offset)) % 8; + if (remainder != 0) { padded_message_length += 8 - remainder; } + + // The returned message size includes the length prefix, the flatbuffer, + // plus padding + *message_length = padded_message_length; + + // Write the flatbuffer size prefix including padding + int32_t flatbuffer_size = padded_message_length - 4; + RETURN_NOT_OK( + file->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); + + // Write the flatbuffer + RETURN_NOT_OK(file->Write(message.data(), message.size())); + + // Write any padding + int32_t padding = padded_message_length - static_cast(message.size()) - 4; + if (padding > 0) { RETURN_NOT_OK(file->Write(kPaddingBytes, padding)); } + + return Status::OK(); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 798abdcdf9db7..fac4a70aada8d 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -22,6 +22,7 @@ #include #include +#include #include #include @@ -37,9 +38,11 @@ struct DataType; struct Field; class Schema; class Status; +class Tensor; namespace io { +class InputStream; class OutputStream; class RandomAccessFile; @@ -53,7 +56,7 @@ struct MetadataVersion { static constexpr const char* kArrowMagicBytes = "ARROW1"; -struct ARROW_EXPORT FileBlock { +struct FileBlock { FileBlock() {} FileBlock(int64_t offset, int32_t metadata_length, int64_t body_length) : offset(offset), metadata_length(metadata_length), body_length(body_length) {} @@ -104,44 +107,25 @@ class DictionaryMemo { class Message; -// Container for serialized Schema metadata contained in an IPC message -class ARROW_EXPORT SchemaMetadata { - public: - explicit SchemaMetadata(const void* header); - explicit SchemaMetadata(const std::shared_ptr& message); - SchemaMetadata(const std::shared_ptr& message, int64_t offset); - - ~SchemaMetadata(); - - int num_fields() const; - - // Retrieve a list of all the dictionary ids and types required by the schema for - // reconstruction. The presumption is that these will be loaded either from - // the stream or file (or they may already be somewhere else in memory) - Status GetDictionaryTypes(DictionaryTypeMap* id_to_field) const; +// Retrieve a list of all the dictionary ids and types required by the schema for +// reconstruction. The presumption is that these will be loaded either from +// the stream or file (or they may already be somewhere else in memory) +Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_field); - // Construct a complete Schema from the message. May be expensive for very - // large schemas if you are only interested in a few fields - Status GetSchema( - const DictionaryMemo& dictionary_memo, std::shared_ptr* out) const; - - private: - class SchemaMetadataImpl; - std::unique_ptr impl_; - - DISALLOW_COPY_AND_ASSIGN(SchemaMetadata); -}; +// Construct a complete Schema from the message. May be expensive for very +// large schemas if you are only interested in a few fields +Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo, + std::shared_ptr* out); -struct ARROW_EXPORT BufferMetadata { - int32_t page; - int64_t offset; - int64_t length; -}; +Status GetTensorMetadata(const void* opaque_tensor, std::shared_ptr* type, + std::vector* shape, std::vector* strides, + std::vector* dim_names); class ARROW_EXPORT Message { public: + enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH, TENSOR }; + ~Message(); - enum Type { NONE, SCHEMA, DICTIONARY_BATCH, RECORD_BATCH }; static Status Open(const std::shared_ptr& buffer, int64_t offset, std::shared_ptr* out); @@ -155,9 +139,6 @@ class ARROW_EXPORT Message { private: Message(const std::shared_ptr& buffer, int64_t offset); - friend class DictionaryBatchMetadata; - friend class SchemaMetadata; - // Hide serialization details from user API class MessageImpl; std::unique_ptr impl_; @@ -179,6 +160,17 @@ class ARROW_EXPORT Message { Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, std::shared_ptr* message); +/// Read length-prefixed message with as-yet unknown length. Returns nullptr if +/// there are not enough bytes available or the message length is 0 (e.g. EOS +/// in a stream) +Status ReadMessage(io::InputStream* stream, std::shared_ptr* message); + +/// Write a serialized message with a length-prefix and padding to an 8-byte offset +/// +/// +Status WriteMessage( + const Buffer& message, io::OutputStream* file, int32_t* message_length); + // Serialize arrow::Schema as a Flatbuffer // // \param[in] schema a Schema instance @@ -193,6 +185,9 @@ Status WriteRecordBatchMessage(int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); +Status WriteTensorMessage( + const Tensor& tensor, int64_t buffer_start_offset, std::shared_ptr* out); + Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 28320d98df9d1..b47b773192774 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -33,6 +33,7 @@ #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type.h" +#include "arrow/tensor.h" #include "arrow/util/logging.h" namespace arrow { @@ -186,28 +187,9 @@ class StreamReader::StreamReaderImpl { } Status ReadNextMessage(Message::Type expected_type, std::shared_ptr* message) { - std::shared_ptr buffer; - RETURN_NOT_OK(stream_->Read(sizeof(int32_t), &buffer)); - - if (buffer->size() != sizeof(int32_t)) { - *message = nullptr; - return Status::OK(); - } - - int32_t message_length = *reinterpret_cast(buffer->data()); - - if (message_length == 0) { - // Optional 0 EOS control message - *message = nullptr; - return Status::OK(); - } - - RETURN_NOT_OK(stream_->Read(message_length, &buffer)); - if (buffer->size() != message_length) { - return Status::IOError("Unexpected end of stream trying to read message"); - } + RETURN_NOT_OK(ReadMessage(stream_.get(), message)); - RETURN_NOT_OK(Message::Open(buffer, 0, message)); + if ((*message) == nullptr) { return Status::OK(); } if ((*message)->type() != expected_type) { std::stringstream ss; @@ -245,8 +227,7 @@ class StreamReader::StreamReaderImpl { std::shared_ptr message; RETURN_NOT_OK(ReadNextMessage(Message::SCHEMA, &message)); - SchemaMetadata schema_meta(message); - RETURN_NOT_OK(schema_meta.GetDictionaryTypes(&dictionary_types_)); + RETURN_NOT_OK(GetDictionaryTypes(message->header(), &dictionary_types_)); // TODO(wesm): In future, we may want to reconcile the ids in the stream with // those found in the schema @@ -255,7 +236,7 @@ class StreamReader::StreamReaderImpl { RETURN_NOT_OK(ReadNextDictionary()); } - return schema_meta.GetSchema(dictionary_memo_, &schema_); + return GetSchema(message->header(), dictionary_memo_, &schema_); } Status GetNextRecordBatch(std::shared_ptr* batch) { @@ -343,7 +324,6 @@ class FileReader::FileReaderImpl { // TODO(wesm): Verify the footer footer_ = flatbuf::GetFooter(footer_buffer_->data()); - schema_metadata_.reset(new SchemaMetadata(footer_->schema())); return Status::OK(); } @@ -372,8 +352,6 @@ class FileReader::FileReaderImpl { return FileBlockFromFlatbuffer(footer_->dictionaries()->Get(i)); } - const SchemaMetadata& schema_metadata() const { return *schema_metadata_; } - Status GetRecordBatch(int i, std::shared_ptr* batch) { DCHECK_GE(i, 0); DCHECK_LT(i, num_record_batches()); @@ -393,7 +371,7 @@ class FileReader::FileReaderImpl { } Status ReadSchema() { - RETURN_NOT_OK(schema_metadata_->GetDictionaryTypes(&dictionary_fields_)); + RETURN_NOT_OK(GetDictionaryTypes(footer_->schema(), &dictionary_fields_)); // Read all the dictionaries for (int i = 0; i < num_dictionaries(); ++i) { @@ -419,7 +397,7 @@ class FileReader::FileReaderImpl { } // Get the schema - return schema_metadata_->GetSchema(*dictionary_memo_, &schema_); + return GetSchema(footer_->schema(), *dictionary_memo_, &schema_); } Status Open(const std::shared_ptr& file, int64_t footer_offset) { @@ -441,7 +419,6 @@ class FileReader::FileReaderImpl { // Footer metadata std::shared_ptr footer_buffer_; const flatbuf::Footer* footer_; - std::unique_ptr schema_metadata_; DictionaryTypeMap dictionary_fields_; std::shared_ptr dictionary_memo_; @@ -485,26 +462,46 @@ Status FileReader::GetRecordBatch(int i, std::shared_ptr* batch) { return impl_->GetRecordBatch(i, batch); } -Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, - io::RandomAccessFile* file, std::shared_ptr* out) { +static Status ReadContiguousPayload(int64_t offset, io::RandomAccessFile* file, + std::shared_ptr* message, std::shared_ptr* payload) { std::shared_ptr buffer; RETURN_NOT_OK(file->Seek(offset)); + RETURN_NOT_OK(ReadMessage(file, message)); - RETURN_NOT_OK(file->Read(sizeof(int32_t), &buffer)); - int32_t flatbuffer_size = *reinterpret_cast(buffer->data()); - - std::shared_ptr message; - RETURN_NOT_OK(file->Read(flatbuffer_size, &buffer)); - RETURN_NOT_OK(Message::Open(buffer, 0, &message)); + if (*message == nullptr) { + return Status::Invalid("Unable to read metadata at offset"); + } // TODO(ARROW-388): The buffer offsets start at 0, so we must construct a // RandomAccessFile according to that frame of reference - std::shared_ptr buffer_payload; - RETURN_NOT_OK(file->Read(message->body_length(), &buffer_payload)); - io::BufferReader buffer_reader(buffer_payload); + RETURN_NOT_OK(file->Read((*message)->body_length(), payload)); + return Status::OK(); +} +Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, + io::RandomAccessFile* file, std::shared_ptr* out) { + std::shared_ptr payload; + std::shared_ptr message; + + RETURN_NOT_OK(ReadContiguousPayload(offset, file, &message, &payload)); + io::BufferReader buffer_reader(payload); return ReadRecordBatch(*message, schema, kMaxNestingDepth, &buffer_reader, out); } +Status ReadTensor( + int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out) { + std::shared_ptr message; + std::shared_ptr data; + RETURN_NOT_OK(ReadContiguousPayload(offset, file, &message, &data)); + + std::shared_ptr type; + std::vector shape; + std::vector strides; + std::vector dim_names; + RETURN_NOT_OK( + GetTensorMetadata(message->header(), &type, &shape, &strides, &dim_names)); + return MakeTensor(type, data, shape, strides, dim_names, out); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index 6d9e6ca7b0ab7..b62f0527e0ca0 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -17,8 +17,8 @@ // Implement Arrow file layout for IPC/RPC purposes and short-lived storage -#ifndef ARROW_IPC_FILE_H -#define ARROW_IPC_FILE_H +#ifndef ARROW_IPC_READER_H +#define ARROW_IPC_READER_H #include #include @@ -33,6 +33,7 @@ class Buffer; class RecordBatch; class Schema; class Status; +class Tensor; namespace io { @@ -43,18 +44,6 @@ class RandomAccessFile; namespace ipc { -// Generic read functionsh; does not copy data if the input supports zero copy reads - -Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, - io::RandomAccessFile* file, std::shared_ptr* out); - -Status ReadRecordBatch(const Message& metadata, const std::shared_ptr& schema, - int max_recursion_depth, io::RandomAccessFile* file, - std::shared_ptr* out); - -Status ReadDictionary(const Message& metadata, const DictionaryTypeMap& dictionary_types, - io::RandomAccessFile* file, std::shared_ptr* out); - class ARROW_EXPORT StreamReader { public: ~StreamReader(); @@ -118,11 +107,24 @@ class ARROW_EXPORT FileReader { std::unique_ptr impl_; }; +// Generic read functionsh; does not copy data if the input supports zero copy reads +Status ARROW_EXPORT ReadRecordBatch(const Message& metadata, + const std::shared_ptr& schema, io::RandomAccessFile* file, + std::shared_ptr* out); + +Status ARROW_EXPORT ReadRecordBatch(const Message& metadata, + const std::shared_ptr& schema, int max_recursion_depth, + io::RandomAccessFile* file, std::shared_ptr* out); + /// Read encapsulated message and RecordBatch Status ARROW_EXPORT ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); +/// EXPERIMENTAL: Read arrow::Tensor from a contiguous message +Status ARROW_EXPORT ReadTensor( + int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out); + } // namespace ipc } // namespace arrow -#endif // ARROW_IPC_FILE_H +#endif // ARROW_IPC_READER_H diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 0a19f69d27d8c..249ef201c66bb 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -34,6 +34,7 @@ #include "arrow/memory_pool.h" #include "arrow/status.h" #include "arrow/table.h" +#include "arrow/tensor.h" #include "arrow/type.h" #include "arrow/util/bit-util.h" #include "arrow/util/logging.h" @@ -143,46 +144,6 @@ class RecordBatchWriter : public ArrayVisitor { num_rows, body_length, field_nodes_, buffer_meta_, out); } - Status WriteMetadata(int64_t num_rows, int64_t body_length, io::OutputStream* dst, - int32_t* metadata_length) { - // Now that we have computed the locations of all of the buffers in shared - // memory, the data header can be converted to a flatbuffer and written out - // - // Note: The memory written here is prefixed by the size of the flatbuffer - // itself as an int32_t. - std::shared_ptr metadata_fb; - RETURN_NOT_OK(WriteMetadataMessage(num_rows, body_length, &metadata_fb)); - - // Need to write 4 bytes (metadata size), the metadata, plus padding to - // end on an 8-byte offset - int64_t start_offset; - RETURN_NOT_OK(dst->Tell(&start_offset)); - - int32_t padded_metadata_length = static_cast(metadata_fb->size()) + 4; - const int32_t remainder = - (padded_metadata_length + static_cast(start_offset)) % 8; - if (remainder != 0) { padded_metadata_length += 8 - remainder; } - - // The returned metadata size includes the length prefix, the flatbuffer, - // plus padding - *metadata_length = padded_metadata_length; - - // Write the flatbuffer size prefix including padding - int32_t flatbuffer_size = padded_metadata_length - 4; - RETURN_NOT_OK( - dst->Write(reinterpret_cast(&flatbuffer_size), sizeof(int32_t))); - - // Write the flatbuffer - RETURN_NOT_OK(dst->Write(metadata_fb->data(), metadata_fb->size())); - - // Write any padding - int32_t padding = - padded_metadata_length - static_cast(metadata_fb->size()) - 4; - if (padding > 0) { RETURN_NOT_OK(dst->Write(kPaddingBytes, padding)); } - - return Status::OK(); - } - Status Write(const RecordBatch& batch, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { RETURN_NOT_OK(Assemble(batch, body_length)); @@ -192,7 +153,14 @@ class RecordBatchWriter : public ArrayVisitor { RETURN_NOT_OK(dst->Tell(&start_position)); #endif - RETURN_NOT_OK(WriteMetadata(batch.num_rows(), *body_length, dst, metadata_length)); + // Now that we have computed the locations of all of the buffers in shared + // memory, the data header can be converted to a flatbuffer and written out + // + // Note: The memory written here is prefixed by the size of the flatbuffer + // itself as an int32_t. + std::shared_ptr metadata_fb; + RETURN_NOT_OK(WriteMetadataMessage(batch.num_rows(), *body_length, &metadata_fb)); + RETURN_NOT_OK(WriteMessage(*metadata_fb, dst, metadata_length)); #ifndef NDEBUG RETURN_NOT_OK(dst->Tell(¤t_position)); @@ -504,6 +472,28 @@ Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, return writer.Write(batch, dst, metadata_length, body_length); } +Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, + io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, + MemoryPool* pool) { + return WriteRecordBatch(batch, buffer_start_offset, dst, metadata_length, body_length, + pool, kMaxNestingDepth, true); +} + +Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length) { + std::shared_ptr metadata; + RETURN_NOT_OK(WriteTensorMessage(tensor, 0, &metadata)); + RETURN_NOT_OK(WriteMessage(*metadata, dst, metadata_length)); + auto data = tensor.data(); + if (data) { + *body_length = data->size(); + return dst->Write(data->data(), *body_length); + } else { + *body_length = 0; + return Status::OK(); + } +} + Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dictionary, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool) { @@ -736,12 +726,5 @@ Status FileWriter::Close() { return impl_->Close(); } -Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool) { - return WriteRecordBatch(batch, buffer_start_offset, dst, metadata_length, body_length, - pool, kMaxNestingDepth, true); -} - } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index c572157b465a6..8b2dc9cd48788 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -17,8 +17,8 @@ // Implement Arrow streaming binary format -#ifndef ARROW_IPC_STREAM_H -#define ARROW_IPC_STREAM_H +#ifndef ARROW_IPC_WRITER_H +#define ARROW_IPC_WRITER_H #include #include @@ -36,6 +36,7 @@ class MemoryPool; class RecordBatch; class Schema; class Status; +class Tensor; namespace io { @@ -125,7 +126,12 @@ Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offs io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool); +/// EXPERIMENTAL: Write arrow::Tensor as a contiguous message +/// +Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length); + } // namespace ipc } // namespace arrow -#endif // ARROW_IPC_STREAM_H +#endif // ARROW_IPC_WRITER_H diff --git a/cpp/src/arrow/tensor-test.cc b/cpp/src/arrow/tensor-test.cc index 99a94934c7990..336905c21ae81 100644 --- a/cpp/src/arrow/tensor-test.cc +++ b/cpp/src/arrow/tensor-test.cc @@ -61,13 +61,36 @@ TEST(TestTensor, BasicCtors) { ASSERT_EQ(24, t1.size()); ASSERT_TRUE(t1.is_mutable()); - ASSERT_FALSE(t1.has_dim_names()); ASSERT_EQ(strides, t1.strides()); ASSERT_EQ(strides, t2.strides()); ASSERT_EQ("foo", t3.dim_name(0)); ASSERT_EQ("bar", t3.dim_name(1)); + ASSERT_EQ("", t1.dim_name(0)); + ASSERT_EQ("", t1.dim_name(1)); +} + +TEST(TestTensor, IsContiguous) { + const int64_t values = 24; + std::vector shape = {4, 6}; + std::vector strides = {48, 8}; + + using T = int64_t; + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), values * sizeof(T), &buffer)); + + std::vector c_strides = {48, 8}; + std::vector f_strides = {8, 32}; + std::vector noncontig_strides = {8, 8}; + Int64Tensor t1(buffer, shape, c_strides); + Int64Tensor t2(buffer, shape, f_strides); + Int64Tensor t3(buffer, shape, noncontig_strides); + + ASSERT_TRUE(t1.is_contiguous()); + ASSERT_TRUE(t2.is_contiguous()); + ASSERT_FALSE(t3.is_contiguous()); } } // namespace arrow diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index 7c4593fc40e66..9a8de5119ea58 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -27,14 +27,15 @@ #include "arrow/array.h" #include "arrow/buffer.h" +#include "arrow/compare.h" #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/logging.h" namespace arrow { -void ComputeRowMajorStrides(const FixedWidthType& type, const std::vector& shape, - std::vector* strides) { +static void ComputeRowMajorStrides(const FixedWidthType& type, + const std::vector& shape, std::vector* strides) { int64_t remaining = type.bit_width() / 8; for (int64_t dimsize : shape) { remaining *= dimsize; @@ -46,6 +47,15 @@ void ComputeRowMajorStrides(const FixedWidthType& type, const std::vector& shape, std::vector* strides) { + int64_t total = type.bit_width() / 8; + for (int64_t dimsize : shape) { + strides->push_back(total); + total *= dimsize; + } +} + /// Constructor with strides and dimension names Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& data, const std::vector& shape, const std::vector& strides, @@ -66,14 +76,36 @@ Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr(dim_names_.size())); - return dim_names_[i]; + static const std::string kEmpty = ""; + if (dim_names_.size() == 0) { + return kEmpty; + } else { + DCHECK_LT(i, static_cast(dim_names_.size())); + return dim_names_[i]; + } } int64_t Tensor::size() const { return std::accumulate(shape_.begin(), shape_.end(), 1, std::multiplies()); } +bool Tensor::is_contiguous() const { + std::vector c_strides; + std::vector f_strides; + + const auto& fw_type = static_cast(*type_); + ComputeRowMajorStrides(fw_type, shape_, &c_strides); + ComputeColumnMajorStrides(fw_type, shape_, &f_strides); + return strides_ == c_strides || strides_ == f_strides; +} + +bool Tensor::Equals(const Tensor& other) const { + bool are_equal = false; + Status error = TensorEquals(*this, other, &are_equal); + if (!error.ok()) { DCHECK(false) << "Tensors not comparable: " << error.ToString(); } + return are_equal; +} + template NumericTensor::NumericTensor(const std::shared_ptr& data, const std::vector& shape, const std::vector& strides, @@ -112,4 +144,31 @@ template class ARROW_TEMPLATE_EXPORT NumericTensor; template class ARROW_TEMPLATE_EXPORT NumericTensor; template class ARROW_TEMPLATE_EXPORT NumericTensor; +#define TENSOR_CASE(TYPE, TENSOR_TYPE) \ + case Type::TYPE: \ + *tensor = std::make_shared(data, shape, strides, dim_names); \ + break; + +Status ARROW_EXPORT MakeTensor(const std::shared_ptr& type, + const std::shared_ptr& data, const std::vector& shape, + const std::vector& strides, const std::vector& dim_names, + std::shared_ptr* tensor) { + switch (type->type) { + TENSOR_CASE(INT8, Int8Tensor); + TENSOR_CASE(INT16, Int16Tensor); + TENSOR_CASE(INT32, Int32Tensor); + TENSOR_CASE(INT64, Int64Tensor); + TENSOR_CASE(UINT8, UInt8Tensor); + TENSOR_CASE(UINT16, UInt16Tensor); + TENSOR_CASE(UINT32, UInt32Tensor); + TENSOR_CASE(UINT64, UInt64Tensor); + TENSOR_CASE(HALF_FLOAT, HalfFloatTensor); + TENSOR_CASE(FLOAT, FloatTensor); + TENSOR_CASE(DOUBLE, DoubleTensor); + default: + return Status::NotImplemented(type->ToString()); + } + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 7bee867a9b33a..eeb5c3e8e5536 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -73,12 +73,15 @@ class ARROW_EXPORT Tensor { const std::vector& shape, const std::vector& strides, const std::vector& dim_names); + std::shared_ptr type() const { return type_; } std::shared_ptr data() const { return data_; } + const std::vector& shape() const { return shape_; } const std::vector& strides() const { return strides_; } + int ndim() const { return static_cast(shape_.size()); } + const std::string& dim_name(int i) const; - bool has_dim_names() const { return shape_.size() > 0 && dim_names_.size() > 0; } /// Total number of value cells in the tensor int64_t size() const; @@ -86,13 +89,17 @@ class ARROW_EXPORT Tensor { /// Return true if the underlying data buffer is mutable bool is_mutable() const { return data_->is_mutable(); } + bool is_contiguous() const; + + Type::type type_enum() const { return type_->type; } + + bool Equals(const Tensor& other) const; + protected: Tensor() {} std::shared_ptr type_; - std::shared_ptr data_; - std::vector shape_; std::vector strides_; @@ -126,6 +133,11 @@ class ARROW_EXPORT NumericTensor : public Tensor { value_type* mutable_raw_data_; }; +Status ARROW_EXPORT MakeTensor(const std::shared_ptr& type, + const std::shared_ptr& data, const std::vector& shape, + const std::vector& strides, const std::vector& dim_names, + std::shared_ptr* tensor); + // ---------------------------------------------------------------------- // extern templates and other details diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 1270aee1622ea..b73d5a68d257e 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -38,6 +38,7 @@ template <> struct TypeTraits { using ArrayType = UInt8Array; using BuilderType = UInt8Builder; + using TensorType = UInt8Tensor; static inline int64_t bytes_required(int64_t elements) { return elements; } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return uint8(); } @@ -47,6 +48,7 @@ template <> struct TypeTraits { using ArrayType = Int8Array; using BuilderType = Int8Builder; + using TensorType = Int8Tensor; static inline int64_t bytes_required(int64_t elements) { return elements; } constexpr static bool is_parameter_free = true; static inline std::shared_ptr type_singleton() { return int8(); } @@ -56,6 +58,7 @@ template <> struct TypeTraits { using ArrayType = UInt16Array; using BuilderType = UInt16Builder; + using TensorType = UInt16Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(uint16_t); @@ -68,6 +71,7 @@ template <> struct TypeTraits { using ArrayType = Int16Array; using BuilderType = Int16Builder; + using TensorType = Int16Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int16_t); @@ -80,6 +84,7 @@ template <> struct TypeTraits { using ArrayType = UInt32Array; using BuilderType = UInt32Builder; + using TensorType = UInt32Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(uint32_t); @@ -92,6 +97,7 @@ template <> struct TypeTraits { using ArrayType = Int32Array; using BuilderType = Int32Builder; + using TensorType = Int32Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int32_t); @@ -104,6 +110,7 @@ template <> struct TypeTraits { using ArrayType = UInt64Array; using BuilderType = UInt64Builder; + using TensorType = UInt64Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(uint64_t); @@ -116,6 +123,7 @@ template <> struct TypeTraits { using ArrayType = Int64Array; using BuilderType = Int64Builder; + using TensorType = Int64Tensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(int64_t); @@ -185,6 +193,7 @@ template <> struct TypeTraits { using ArrayType = HalfFloatArray; using BuilderType = HalfFloatBuilder; + using TensorType = HalfFloatTensor; static inline int64_t bytes_required(int64_t elements) { return elements * sizeof(uint16_t); @@ -197,6 +206,7 @@ template <> struct TypeTraits { using ArrayType = FloatArray; using BuilderType = FloatBuilder; + using TensorType = FloatTensor; static inline int64_t bytes_required(int64_t elements) { return static_cast(elements * sizeof(float)); @@ -209,6 +219,7 @@ template <> struct TypeTraits { using ArrayType = DoubleArray; using BuilderType = DoubleBuilder; + using TensorType = DoubleTensor; static inline int64_t bytes_required(int64_t elements) { return static_cast(elements * sizeof(double)); diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index 586b123e67cfb..cbc4d5acdb8cf 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -22,6 +22,7 @@ #include "arrow/array.h" #include "arrow/status.h" +#include "arrow/tensor.h" #include "arrow/type.h" namespace arrow { @@ -103,6 +104,31 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { return Status::NotImplemented("Type not implemented"); } +#define TENSOR_VISIT_INLINE(TYPE_CLASS) \ + case TYPE_CLASS::type_id: \ + return visitor->Visit( \ + static_cast::TensorType&>(array)); + +template +inline Status VisitTensorInline(const Tensor& array, VISITOR* visitor) { + switch (array.type_enum()) { + TENSOR_VISIT_INLINE(Int8Type); + TENSOR_VISIT_INLINE(UInt8Type); + TENSOR_VISIT_INLINE(Int16Type); + TENSOR_VISIT_INLINE(UInt16Type); + TENSOR_VISIT_INLINE(Int32Type); + TENSOR_VISIT_INLINE(UInt32Type); + TENSOR_VISIT_INLINE(Int64Type); + TENSOR_VISIT_INLINE(UInt64Type); + TENSOR_VISIT_INLINE(HalfFloatType); + TENSOR_VISIT_INLINE(FloatType); + TENSOR_VISIT_INLINE(DoubleType); + default: + break; + } + return Status::NotImplemented("Type not implemented"); +} + } // namespace arrow #endif // ARROW_VISITOR_INLINE_H diff --git a/format/Tensor.fbs b/format/Tensor.fbs index bc5b6d1289b2f..18b614c3bde62 100644 --- a/format/Tensor.fbs +++ b/format/Tensor.fbs @@ -32,16 +32,6 @@ table TensorDim { name: string; } -enum TensorOrder : byte { - /// Higher dimensions vary first when traversing data in byte-contiguous - /// order, aka "C order" - ROW_MAJOR, - - /// Lower dimensions vary first when traversing data in byte-contiguous - /// order, aka "Fortran order" - COLUMN_MAJOR -} - table Tensor { /// The type of data contained in a value cell. Currently only fixed-width /// value types are supported, no strings or nested types @@ -50,8 +40,8 @@ table Tensor { /// The dimensions of the tensor, optionally named shape: [TensorDim]; - /// The memory order of the tensor's data - order: TensorOrder; + /// Non-negative byte offsets to advance one value cell along each dimension + strides: [long]; /// The location and size of the tensor's data data: Buffer; From 4938d8d7cea65d039650f684afaa29a74510c3e0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 18:22:11 -0400 Subject: [PATCH 0435/1644] ARROW-726: [C++] Fix segfault caused when passing non-buffer object to arrow::py::PyBuffer This leads to calling `Py_DECREF` on a null pointer Author: Wes McKinney Closes #459 from wesm/ARROW-726 and squashes the following commits: a764134 [Wes McKinney] Fix segfault caused when passing non-buffer object to arrow::py::PyBuffer. Fix some compiler warnings --- cpp/src/arrow/python/builtin_convert.cc | 4 ++-- cpp/src/arrow/python/common.cc | 4 ++-- cpp/src/arrow/python/pandas-test.cc | 10 ++++++++-- cpp/src/arrow/python/pandas_convert.cc | 10 +++++----- cpp/src/arrow/python/pandas_convert.h | 2 +- 5 files changed, 18 insertions(+), 12 deletions(-) diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 9acccc149664b..6e59845dea76a 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -390,7 +390,7 @@ class BytesConverter : public TypedConverter { // No error checking length = PyBytes_GET_SIZE(bytes_obj); bytes = PyBytes_AS_STRING(bytes_obj); - RETURN_NOT_OK(typed_builder_->Append(bytes, length)); + RETURN_NOT_OK(typed_builder_->Append(bytes, static_cast(length))); } return Status::OK(); } @@ -422,7 +422,7 @@ class UTF8Converter : public TypedConverter { // No error checking length = PyBytes_GET_SIZE(bytes_obj); bytes = PyBytes_AS_STRING(bytes_obj); - RETURN_NOT_OK(typed_builder_->Append(bytes, length)); + RETURN_NOT_OK(typed_builder_->Append(bytes, static_cast(length))); } return Status::OK(); } diff --git a/cpp/src/arrow/python/common.cc b/cpp/src/arrow/python/common.cc index a5aea30884468..717cb5c5cc122 100644 --- a/cpp/src/arrow/python/common.cc +++ b/cpp/src/arrow/python/common.cc @@ -47,7 +47,7 @@ MemoryPool* get_memory_pool() { // ---------------------------------------------------------------------- // PyBuffer -PyBuffer::PyBuffer(PyObject* obj) : Buffer(nullptr, 0) { +PyBuffer::PyBuffer(PyObject* obj) : Buffer(nullptr, 0), obj_(nullptr) { if (PyObject_CheckBuffer(obj)) { obj_ = PyMemoryView_FromObject(obj); Py_buffer* buffer = PyMemoryView_GET_BUFFER(obj_); @@ -61,7 +61,7 @@ PyBuffer::PyBuffer(PyObject* obj) : Buffer(nullptr, 0) { PyBuffer::~PyBuffer() { PyAcquireGIL lock; - Py_DECREF(obj_); + Py_XDECREF(obj_); } } // namespace py diff --git a/cpp/src/arrow/python/pandas-test.cc b/cpp/src/arrow/python/pandas-test.cc index 0d643df2e9f38..a4e640b83718b 100644 --- a/cpp/src/arrow/python/pandas-test.cc +++ b/cpp/src/arrow/python/pandas-test.cc @@ -24,20 +24,26 @@ #include "arrow/array.h" #include "arrow/builder.h" -#include "arrow/python/pandas_convert.h" #include "arrow/table.h" #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/python/common.h" +#include "arrow/python/pandas_convert.h" + namespace arrow { namespace py { +TEST(PyBuffer, InvalidInputObject) { + PyBuffer buffer(Py_None); +} + TEST(PandasConversionTest, TestObjectBlockWriteFails) { StringBuilder builder(default_memory_pool()); const char value[] = {'\xf1', '\0'}; for (int i = 0; i < 1000; ++i) { - builder.Append(value, strlen(value)); + builder.Append(value, static_cast(strlen(value))); } std::shared_ptr arr; diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 685b1f421c457..db2e90eb8b0ff 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -159,13 +159,13 @@ Status AppendObjectStrings(StringBuilder& string_builder, PyObject** objects, PyErr_Clear(); return Status::TypeError("failed converting unicode to UTF8"); } - const int64_t length = PyBytes_GET_SIZE(obj); + const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); Status s = string_builder.Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); if (!s.ok()) { return s; } } else if (PyBytes_Check(obj)) { *have_bytes = true; - const int64_t length = PyBytes_GET_SIZE(obj); + const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); RETURN_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); } else { string_builder.AppendNull(); @@ -235,7 +235,7 @@ class PandasConverter : public TypeVisitor { } Status InitNullBitmap() { - int null_bytes = BitUtil::BytesForBits(length_); + int64_t null_bytes = BitUtil::BytesForBits(length_); null_bitmap_ = std::make_shared(pool_); RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); @@ -357,7 +357,7 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* data) { template <> inline Status PandasConverter::ConvertData(std::shared_ptr* data) { - int nbytes = BitUtil::BytesForBits(length_); + int64_t nbytes = BitUtil::BytesForBits(length_); auto buffer = std::make_shared(pool_); RETURN_NOT_OK(buffer->Resize(nbytes)); @@ -423,7 +423,7 @@ Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - int nbytes = BitUtil::BytesForBits(length_); + int64_t nbytes = BitUtil::BytesForBits(length_); auto data = std::make_shared(pool_); RETURN_NOT_OK(data->Resize(nbytes)); uint8_t* bitmap = data->mutable_data(); diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index a33741efaa492..12644d98da156 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -31,7 +31,7 @@ namespace arrow { class Array; class Column; -class DataType; +struct DataType; class MemoryPool; class Status; class Table; From ae2da980b94c73719f659071537e40570981adf4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 18:31:23 -0400 Subject: [PATCH 0436/1644] ARROW-743: [C++] Consolidate all but decimal array tests into array-test, collect some tests in type-test.cc Author: Wes McKinney Closes #463 from wesm/ARROW-743 and squashes the following commits: 49df9f7 [Wes McKinney] Consolidate all but decimal array tests into array-test, move some type tests to type-test --- cpp/src/arrow/CMakeLists.txt | 6 - cpp/src/arrow/array-dictionary-test.cc | 150 -- cpp/src/arrow/array-list-test.cc | 238 ---- cpp/src/arrow/array-primitive-test.cc | 543 ------- cpp/src/arrow/array-string-test.cc | 654 --------- cpp/src/arrow/array-struct-test.cc | 410 ------ cpp/src/arrow/array-test.cc | 1812 +++++++++++++++++++++++- cpp/src/arrow/array-union-test.cc | 67 - cpp/src/arrow/io/io-hdfs-test.cc | 7 +- cpp/src/arrow/type-test.cc | 45 + 10 files changed, 1858 insertions(+), 2074 deletions(-) delete mode 100644 cpp/src/arrow/array-dictionary-test.cc delete mode 100644 cpp/src/arrow/array-list-test.cc delete mode 100644 cpp/src/arrow/array-primitive-test.cc delete mode 100644 cpp/src/arrow/array-string-test.cc delete mode 100644 cpp/src/arrow/array-struct-test.cc delete mode 100644 cpp/src/arrow/array-union-test.cc diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 5c9aadf9ee79b..bd33bf5b8296e 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -50,12 +50,6 @@ install( ADD_ARROW_TEST(allocator-test) ADD_ARROW_TEST(array-test) ADD_ARROW_TEST(array-decimal-test) -ADD_ARROW_TEST(array-dictionary-test) -ADD_ARROW_TEST(array-list-test) -ADD_ARROW_TEST(array-primitive-test) -ADD_ARROW_TEST(array-string-test) -ADD_ARROW_TEST(array-struct-test) -ADD_ARROW_TEST(array-union-test) ADD_ARROW_TEST(buffer-test) ADD_ARROW_TEST(memory_pool-test) ADD_ARROW_TEST(pretty_print-test) diff --git a/cpp/src/arrow/array-dictionary-test.cc b/cpp/src/arrow/array-dictionary-test.cc deleted file mode 100644 index 0c4e628111a15..0000000000000 --- a/cpp/src/arrow/array-dictionary-test.cc +++ /dev/null @@ -1,150 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/buffer.h" -#include "arrow/memory_pool.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -namespace arrow { - -TEST(TestDictionary, Basics) { - std::vector values = {100, 1000, 10000, 100000}; - std::shared_ptr dict; - ArrayFromVector(values, &dict); - - std::shared_ptr type1 = - std::dynamic_pointer_cast(dictionary(int16(), dict)); - DictionaryType type2(int16(), dict); - - ASSERT_TRUE(int16()->Equals(type1->index_type())); - ASSERT_TRUE(type1->dictionary()->Equals(dict)); - - ASSERT_TRUE(int16()->Equals(type2.index_type())); - ASSERT_TRUE(type2.dictionary()->Equals(dict)); - - ASSERT_EQ("dictionary", type1->ToString()); -} - -TEST(TestDictionary, Equals) { - std::vector is_valid = {true, true, false, true, true, true}; - - std::shared_ptr dict; - std::vector dict_values = {"foo", "bar", "baz"}; - ArrayFromVector(dict_values, &dict); - std::shared_ptr dict_type = dictionary(int16(), dict); - - std::shared_ptr dict2; - std::vector dict2_values = {"foo", "bar", "baz", "qux"}; - ArrayFromVector(dict2_values, &dict2); - std::shared_ptr dict2_type = dictionary(int16(), dict2); - - std::shared_ptr indices; - std::vector indices_values = {1, 2, -1, 0, 2, 0}; - ArrayFromVector(is_valid, indices_values, &indices); - - std::shared_ptr indices2; - std::vector indices2_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(is_valid, indices2_values, &indices2); - - std::shared_ptr indices3; - std::vector indices3_values = {1, 1, 0, 0, 2, 0}; - ArrayFromVector(is_valid, indices3_values, &indices3); - - auto array = std::make_shared(dict_type, indices); - auto array2 = std::make_shared(dict_type, indices2); - auto array3 = std::make_shared(dict2_type, indices); - auto array4 = std::make_shared(dict_type, indices3); - - ASSERT_TRUE(array->Equals(array)); - - // Equal, because the unequal index is masked by null - ASSERT_TRUE(array->Equals(array2)); - - // Unequal dictionaries - ASSERT_FALSE(array->Equals(array3)); - - // Unequal indices - ASSERT_FALSE(array->Equals(array4)); - - // RangeEquals - ASSERT_TRUE(array->RangeEquals(3, 6, 3, array4)); - ASSERT_FALSE(array->RangeEquals(1, 3, 1, array4)); - - // ARROW-33 Test slices - const int64_t size = array->length(); - - std::shared_ptr slice, slice2; - slice = array->Array::Slice(2); - slice2 = array->Array::Slice(2); - ASSERT_EQ(size - 2, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); - - // Chained slices - slice2 = array->Array::Slice(1)->Array::Slice(1); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(1, 3); - slice2 = array->Slice(1, 3); - ASSERT_EQ(3, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, 4, 0, slice)); -} - -TEST(TestDictionary, Validate) { - std::vector is_valid = {true, true, false, true, true, true}; - - std::shared_ptr dict; - std::vector dict_values = {"foo", "bar", "baz"}; - ArrayFromVector(dict_values, &dict); - std::shared_ptr dict_type = dictionary(int16(), dict); - - std::shared_ptr indices; - std::vector indices_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(is_valid, indices_values, &indices); - - std::shared_ptr indices2; - std::vector indices2_values = {1., 2., 0., 0., 2., 0.}; - ArrayFromVector(is_valid, indices2_values, &indices2); - - std::shared_ptr indices3; - std::vector indices3_values = {1, 2, 0, 0, 2, 0}; - ArrayFromVector(is_valid, indices3_values, &indices3); - - std::shared_ptr arr = std::make_shared(dict_type, indices); - std::shared_ptr arr2 = std::make_shared(dict_type, indices2); - std::shared_ptr arr3 = std::make_shared(dict_type, indices3); - - // Only checking index type for now - ASSERT_OK(arr->Validate()); - ASSERT_RAISES(Invalid, arr2->Validate()); - ASSERT_OK(arr3->Validate()); -} - -} // namespace arrow diff --git a/cpp/src/arrow/array-list-test.cc b/cpp/src/arrow/array-list-test.cc deleted file mode 100644 index 1cfa77f684868..0000000000000 --- a/cpp/src/arrow/array-list-test.cc +++ /dev/null @@ -1,238 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/status.h" -#include "arrow/test-common.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -using std::shared_ptr; -using std::string; -using std::unique_ptr; -using std::vector; - -namespace arrow { - -// ---------------------------------------------------------------------- -// List tests - -class TestListBuilder : public TestBuilder { - public: - void SetUp() { - TestBuilder::SetUp(); - - value_type_ = TypePtr(new Int32Type()); - type_ = TypePtr(new ListType(value_type_)); - - std::shared_ptr tmp; - ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); - builder_ = std::dynamic_pointer_cast(tmp); - } - - void Done() { - std::shared_ptr out; - EXPECT_OK(builder_->Finish(&out)); - result_ = std::dynamic_pointer_cast(out); - } - - protected: - TypePtr value_type_; - TypePtr type_; - - shared_ptr builder_; - shared_ptr result_; -}; - -TEST_F(TestListBuilder, Equality) { - Int32Builder* vb = static_cast(builder_->value_builder().get()); - - std::shared_ptr array, equal_array, unequal_array; - vector equal_offsets = {0, 1, 2, 5, 6, 7, 8, 10}; - vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2, 5, 6}; - vector unequal_offsets = {0, 1, 4, 7}; - vector unequal_values = {1, 2, 2, 2, 3, 4, 5}; - - // setup two equal arrays - ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); - ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); - - ASSERT_OK(builder_->Finish(&array)); - ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); - ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); - - ASSERT_OK(builder_->Finish(&equal_array)); - // now an unequal one - ASSERT_OK(builder_->Append(unequal_offsets.data(), unequal_offsets.size())); - ASSERT_OK(vb->Append(unequal_values.data(), unequal_values.size())); - - ASSERT_OK(builder_->Finish(&unequal_array)); - - // Test array equality - EXPECT_TRUE(array->Equals(array)); - EXPECT_TRUE(array->Equals(equal_array)); - EXPECT_TRUE(equal_array->Equals(array)); - EXPECT_FALSE(equal_array->Equals(unequal_array)); - EXPECT_FALSE(unequal_array->Equals(equal_array)); - - // Test range equality - EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_array)); - EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_array)); - EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); - EXPECT_TRUE(array->RangeEquals(2, 3, 2, unequal_array)); - - // Check with slices, ARROW-33 - std::shared_ptr slice, slice2; - - slice = array->Slice(2); - slice2 = array->Slice(2); - ASSERT_EQ(array->length() - 2, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); - - // Chained slices - slice2 = array->Slice(1)->Slice(1); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(1, 4); - slice2 = array->Slice(1, 4); - ASSERT_EQ(4, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, 5, 0, slice)); -} - -TEST_F(TestListBuilder, TestResize) {} - -TEST_F(TestListBuilder, TestAppendNull) { - ASSERT_OK(builder_->AppendNull()); - ASSERT_OK(builder_->AppendNull()); - - Done(); - - ASSERT_OK(result_->Validate()); - ASSERT_TRUE(result_->IsNull(0)); - ASSERT_TRUE(result_->IsNull(1)); - - ASSERT_EQ(0, result_->raw_value_offsets()[0]); - ASSERT_EQ(0, result_->value_offset(1)); - ASSERT_EQ(0, result_->value_offset(2)); - - Int32Array* values = static_cast(result_->values().get()); - ASSERT_EQ(0, values->length()); -} - -void ValidateBasicListArray(const ListArray* result, const vector& values, - const vector& is_valid) { - ASSERT_OK(result->Validate()); - ASSERT_EQ(1, result->null_count()); - ASSERT_EQ(0, result->values()->null_count()); - - ASSERT_EQ(3, result->length()); - vector ex_offsets = {0, 3, 3, 7}; - for (size_t i = 0; i < ex_offsets.size(); ++i) { - ASSERT_EQ(ex_offsets[i], result->value_offset(i)); - } - - for (int i = 0; i < result->length(); ++i) { - ASSERT_EQ(!static_cast(is_valid[i]), result->IsNull(i)); - } - - ASSERT_EQ(7, result->values()->length()); - Int32Array* varr = static_cast(result->values().get()); - - for (size_t i = 0; i < values.size(); ++i) { - ASSERT_EQ(values[i], varr->Value(i)); - } -} - -TEST_F(TestListBuilder, TestBasics) { - vector values = {0, 1, 2, 3, 4, 5, 6}; - vector lengths = {3, 0, 4}; - vector is_valid = {1, 0, 1}; - - Int32Builder* vb = static_cast(builder_->value_builder().get()); - - ASSERT_OK(builder_->Reserve(lengths.size())); - ASSERT_OK(vb->Reserve(values.size())); - - int pos = 0; - for (size_t i = 0; i < lengths.size(); ++i) { - ASSERT_OK(builder_->Append(is_valid[i] > 0)); - for (int j = 0; j < lengths[i]; ++j) { - vb->Append(values[pos++]); - } - } - - Done(); - ValidateBasicListArray(result_.get(), values, is_valid); -} - -TEST_F(TestListBuilder, BulkAppend) { - vector values = {0, 1, 2, 3, 4, 5, 6}; - vector lengths = {3, 0, 4}; - vector is_valid = {1, 0, 1}; - vector offsets = {0, 3, 3}; - - Int32Builder* vb = static_cast(builder_->value_builder().get()); - ASSERT_OK(vb->Reserve(values.size())); - - builder_->Append(offsets.data(), offsets.size(), is_valid.data()); - for (int32_t value : values) { - vb->Append(value); - } - Done(); - ValidateBasicListArray(result_.get(), values, is_valid); -} - -TEST_F(TestListBuilder, BulkAppendInvalid) { - vector values = {0, 1, 2, 3, 4, 5, 6}; - vector lengths = {3, 0, 4}; - vector is_null = {0, 1, 0}; - vector is_valid = {1, 0, 1}; - vector offsets = {0, 2, 4}; // should be 0, 3, 3 given the is_null array - - Int32Builder* vb = static_cast(builder_->value_builder().get()); - ASSERT_OK(vb->Reserve(values.size())); - - builder_->Append(offsets.data(), offsets.size(), is_valid.data()); - builder_->Append(offsets.data(), offsets.size(), is_valid.data()); - for (int32_t value : values) { - vb->Append(value); - } - - Done(); - ASSERT_RAISES(Invalid, result_->Validate()); -} - -TEST_F(TestListBuilder, TestZeroLength) { - // All buffers are null - Done(); - ASSERT_OK(result_->Validate()); -} - -} // namespace arrow diff --git a/cpp/src/arrow/array-primitive-test.cc b/cpp/src/arrow/array-primitive-test.cc deleted file mode 100644 index fe60170cc5cc4..0000000000000 --- a/cpp/src/arrow/array-primitive-test.cc +++ /dev/null @@ -1,543 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/buffer.h" -#include "arrow/builder.h" -#include "arrow/status.h" -#include "arrow/test-common.h" -#include "arrow/test-util.h" -#include "arrow/type.h" -#include "arrow/type_traits.h" -#include "arrow/util/bit-util.h" - -using std::string; -using std::shared_ptr; -using std::unique_ptr; -using std::vector; - -namespace arrow { - -class Array; - -#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ - TEST(TypesTest, TestPrimitive_##ENUM) { \ - KLASS tp; \ - \ - ASSERT_EQ(tp.type, Type::ENUM); \ - ASSERT_EQ(tp.ToString(), string(NAME)); \ - } - -PRIMITIVE_TEST(Int8Type, INT8, "int8"); -PRIMITIVE_TEST(Int16Type, INT16, "int16"); -PRIMITIVE_TEST(Int32Type, INT32, "int32"); -PRIMITIVE_TEST(Int64Type, INT64, "int64"); -PRIMITIVE_TEST(UInt8Type, UINT8, "uint8"); -PRIMITIVE_TEST(UInt16Type, UINT16, "uint16"); -PRIMITIVE_TEST(UInt32Type, UINT32, "uint32"); -PRIMITIVE_TEST(UInt64Type, UINT64, "uint64"); - -PRIMITIVE_TEST(FloatType, FLOAT, "float"); -PRIMITIVE_TEST(DoubleType, DOUBLE, "double"); - -PRIMITIVE_TEST(BooleanType, BOOL, "bool"); - -// ---------------------------------------------------------------------- -// Primitive type tests - -TEST_F(TestBuilder, TestReserve) { - builder_->Init(10); - ASSERT_EQ(2, builder_->null_bitmap()->size()); - - builder_->Reserve(30); - ASSERT_EQ(4, builder_->null_bitmap()->size()); -} - -template -class TestPrimitiveBuilder : public TestBuilder { - public: - typedef typename Attrs::ArrayType ArrayType; - typedef typename Attrs::BuilderType BuilderType; - typedef typename Attrs::T T; - typedef typename Attrs::Type Type; - - virtual void SetUp() { - TestBuilder::SetUp(); - - type_ = Attrs::type(); - - std::shared_ptr tmp; - ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); - builder_ = std::dynamic_pointer_cast(tmp); - - ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); - builder_nn_ = std::dynamic_pointer_cast(tmp); - } - - void RandomData(int64_t N, double pct_null = 0.1) { - Attrs::draw(N, &draws_); - - valid_bytes_.resize(static_cast(N)); - test::random_null_bytes(N, pct_null, valid_bytes_.data()); - } - - void Check(const std::shared_ptr& builder, bool nullable) { - int64_t size = builder->length(); - - auto ex_data = std::make_shared( - reinterpret_cast(draws_.data()), size * sizeof(T)); - - std::shared_ptr ex_null_bitmap; - int64_t ex_null_count = 0; - - if (nullable) { - ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); - ex_null_count = test::null_count(valid_bytes_); - } else { - ex_null_bitmap = nullptr; - } - - auto expected = - std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); - - std::shared_ptr out; - ASSERT_OK(builder->Finish(&out)); - - std::shared_ptr result = std::dynamic_pointer_cast(out); - - // Builder is now reset - ASSERT_EQ(0, builder->length()); - ASSERT_EQ(0, builder->capacity()); - ASSERT_EQ(0, builder->null_count()); - ASSERT_EQ(nullptr, builder->data()); - - ASSERT_EQ(ex_null_count, result->null_count()); - ASSERT_TRUE(result->Equals(*expected)); - } - - protected: - std::shared_ptr type_; - shared_ptr builder_; - shared_ptr builder_nn_; - - vector draws_; - vector valid_bytes_; -}; - -#define PTYPE_DECL(CapType, c_type) \ - typedef CapType##Array ArrayType; \ - typedef CapType##Builder BuilderType; \ - typedef CapType##Type Type; \ - typedef c_type T; \ - \ - static std::shared_ptr type() { \ - return std::shared_ptr(new Type()); \ - } - -#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ - struct P##CapType { \ - PTYPE_DECL(CapType, c_type); \ - static void draw(int64_t N, vector* draws) { \ - test::randint(N, LOWER, UPPER, draws); \ - } \ - } - -#define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ - struct P##CapType { \ - PTYPE_DECL(CapType, c_type); \ - static void draw(int64_t N, vector* draws) { \ - test::random_real(N, 0, LOWER, UPPER, draws); \ - } \ - } - -PINT_DECL(UInt8, uint8_t, 0, UINT8_MAX); -PINT_DECL(UInt16, uint16_t, 0, UINT16_MAX); -PINT_DECL(UInt32, uint32_t, 0, UINT32_MAX); -PINT_DECL(UInt64, uint64_t, 0, UINT64_MAX); - -PINT_DECL(Int8, int8_t, INT8_MIN, INT8_MAX); -PINT_DECL(Int16, int16_t, INT16_MIN, INT16_MAX); -PINT_DECL(Int32, int32_t, INT32_MIN, INT32_MAX); -PINT_DECL(Int64, int64_t, INT64_MIN, INT64_MAX); - -PFLOAT_DECL(Float, float, -1000, 1000); -PFLOAT_DECL(Double, double, -1000, 1000); - -struct PBoolean { - PTYPE_DECL(Boolean, uint8_t); -}; - -template <> -void TestPrimitiveBuilder::RandomData(int64_t N, double pct_null) { - draws_.resize(static_cast(N)); - valid_bytes_.resize(static_cast(N)); - - test::random_null_bytes(N, 0.5, draws_.data()); - test::random_null_bytes(N, pct_null, valid_bytes_.data()); -} - -template <> -void TestPrimitiveBuilder::Check( - const std::shared_ptr& builder, bool nullable) { - int64_t size = builder->length(); - - auto ex_data = test::bytes_to_null_buffer(draws_); - - std::shared_ptr ex_null_bitmap; - int64_t ex_null_count = 0; - - if (nullable) { - ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); - ex_null_count = test::null_count(valid_bytes_); - } else { - ex_null_bitmap = nullptr; - } - - auto expected = - std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); - - std::shared_ptr out; - ASSERT_OK(builder->Finish(&out)); - std::shared_ptr result = std::dynamic_pointer_cast(out); - - // Builder is now reset - ASSERT_EQ(0, builder->length()); - ASSERT_EQ(0, builder->capacity()); - ASSERT_EQ(0, builder->null_count()); - ASSERT_EQ(nullptr, builder->data()); - - ASSERT_EQ(ex_null_count, result->null_count()); - - ASSERT_EQ(expected->length(), result->length()); - - for (int64_t i = 0; i < result->length(); ++i) { - if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } - bool actual = BitUtil::GetBit(result->data()->data(), i); - ASSERT_EQ(static_cast(draws_[i]), actual) << i; - } - ASSERT_TRUE(result->Equals(*expected)); -} - -typedef ::testing::Types - Primitives; - -TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); - -#define DECL_T() typedef typename TestFixture::T T; - -#define DECL_TYPE() typedef typename TestFixture::Type Type; - -#define DECL_ARRAYTYPE() typedef typename TestFixture::ArrayType ArrayType; - -TYPED_TEST(TestPrimitiveBuilder, TestInit) { - DECL_TYPE(); - - int64_t n = 1000; - ASSERT_OK(this->builder_->Reserve(n)); - ASSERT_EQ(BitUtil::NextPower2(n), this->builder_->capacity()); - ASSERT_EQ(BitUtil::NextPower2(TypeTraits::bytes_required(n)), - this->builder_->data()->size()); - - // unsure if this should go in all builder classes - ASSERT_EQ(0, this->builder_->num_children()); -} - -TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { - int64_t size = 1000; - for (int64_t i = 0; i < size; ++i) { - ASSERT_OK(this->builder_->AppendNull()); - } - - std::shared_ptr result; - ASSERT_OK(this->builder_->Finish(&result)); - - for (int64_t i = 0; i < size; ++i) { - ASSERT_TRUE(result->IsNull(i)) << i; - } -} - -TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { - DECL_T(); - - int64_t size = 1000; - - vector& draws = this->draws_; - vector& valid_bytes = this->valid_bytes_; - - int64_t memory_before = this->pool_->bytes_allocated(); - - this->RandomData(size); - - this->builder_->Reserve(size); - - int64_t i; - for (i = 0; i < size; ++i) { - if (valid_bytes[i] > 0) { - this->builder_->Append(draws[i]); - } else { - this->builder_->AppendNull(); - } - } - - do { - std::shared_ptr result; - ASSERT_OK(this->builder_->Finish(&result)); - } while (false); - - ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); -} - -TYPED_TEST(TestPrimitiveBuilder, Equality) { - DECL_T(); - - const int64_t size = 1000; - this->RandomData(size); - vector& draws = this->draws_; - vector& valid_bytes = this->valid_bytes_; - std::shared_ptr array, equal_array, unequal_array; - auto builder = this->builder_.get(); - ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); - ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &equal_array)); - - // Make the not equal array by negating the first valid element with itself. - const auto first_valid = std::find_if( - valid_bytes.begin(), valid_bytes.end(), [](uint8_t valid) { return valid > 0; }); - const int64_t first_valid_idx = std::distance(valid_bytes.begin(), first_valid); - // This should be true with a very high probability, but might introduce flakiness - ASSERT_LT(first_valid_idx, size - 1); - draws[first_valid_idx] = - static_cast(~*reinterpret_cast(&draws[first_valid_idx])); - ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &unequal_array)); - - // test normal equality - EXPECT_TRUE(array->Equals(array)); - EXPECT_TRUE(array->Equals(equal_array)); - EXPECT_TRUE(equal_array->Equals(array)); - EXPECT_FALSE(equal_array->Equals(unequal_array)); - EXPECT_FALSE(unequal_array->Equals(equal_array)); - - // Test range equality - EXPECT_FALSE(array->RangeEquals(0, first_valid_idx + 1, 0, unequal_array)); - EXPECT_FALSE(array->RangeEquals(first_valid_idx, size, first_valid_idx, unequal_array)); - EXPECT_TRUE(array->RangeEquals(0, first_valid_idx, 0, unequal_array)); - EXPECT_TRUE( - array->RangeEquals(first_valid_idx + 1, size, first_valid_idx + 1, unequal_array)); -} - -TYPED_TEST(TestPrimitiveBuilder, SliceEquality) { - DECL_T(); - - const int64_t size = 1000; - this->RandomData(size); - vector& draws = this->draws_; - vector& valid_bytes = this->valid_bytes_; - auto builder = this->builder_.get(); - - std::shared_ptr array; - ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); - - std::shared_ptr slice, slice2; - - slice = array->Slice(5); - slice2 = array->Slice(5); - ASSERT_EQ(size - 5, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(5, array->length(), 0, slice)); - - // Chained slices - slice2 = array->Slice(2)->Slice(3); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(5, 10); - slice2 = array->Slice(5, 10); - ASSERT_EQ(10, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(5, 15, 0, slice)); -} - -TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { - DECL_T(); - - const int64_t size = 10000; - - vector& draws = this->draws_; - vector& valid_bytes = this->valid_bytes_; - - this->RandomData(size); - - this->builder_->Reserve(1000); - this->builder_nn_->Reserve(1000); - - int64_t null_count = 0; - // Append the first 1000 - for (size_t i = 0; i < 1000; ++i) { - if (valid_bytes[i] > 0) { - this->builder_->Append(draws[i]); - } else { - this->builder_->AppendNull(); - ++null_count; - } - this->builder_nn_->Append(draws[i]); - } - - ASSERT_EQ(null_count, this->builder_->null_count()); - - ASSERT_EQ(1000, this->builder_->length()); - ASSERT_EQ(1024, this->builder_->capacity()); - - ASSERT_EQ(1000, this->builder_nn_->length()); - ASSERT_EQ(1024, this->builder_nn_->capacity()); - - this->builder_->Reserve(size - 1000); - this->builder_nn_->Reserve(size - 1000); - - // Append the next 9000 - for (size_t i = 1000; i < size; ++i) { - if (valid_bytes[i] > 0) { - this->builder_->Append(draws[i]); - } else { - this->builder_->AppendNull(); - } - this->builder_nn_->Append(draws[i]); - } - - ASSERT_EQ(size, this->builder_->length()); - ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); - - ASSERT_EQ(size, this->builder_nn_->length()); - ASSERT_EQ(BitUtil::NextPower2(size), this->builder_nn_->capacity()); - - this->Check(this->builder_, true); - this->Check(this->builder_nn_, false); -} - -TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { - DECL_T(); - - int64_t size = 10000; - this->RandomData(size); - - vector& draws = this->draws_; - vector& valid_bytes = this->valid_bytes_; - - // first slug - int64_t K = 1000; - - ASSERT_OK(this->builder_->Append(draws.data(), K, valid_bytes.data())); - ASSERT_OK(this->builder_nn_->Append(draws.data(), K)); - - ASSERT_EQ(1000, this->builder_->length()); - ASSERT_EQ(1024, this->builder_->capacity()); - - ASSERT_EQ(1000, this->builder_nn_->length()); - ASSERT_EQ(1024, this->builder_nn_->capacity()); - - // Append the next 9000 - ASSERT_OK(this->builder_->Append(draws.data() + K, size - K, valid_bytes.data() + K)); - ASSERT_OK(this->builder_nn_->Append(draws.data() + K, size - K)); - - ASSERT_EQ(size, this->builder_->length()); - ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); - - this->Check(this->builder_, true); - this->Check(this->builder_nn_, false); -} - -TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { - int64_t n = 1000; - ASSERT_OK(this->builder_->Reserve(n)); - - ASSERT_OK(this->builder_->Advance(100)); - ASSERT_EQ(100, this->builder_->length()); - - ASSERT_OK(this->builder_->Advance(900)); - - int64_t too_many = this->builder_->capacity() - 1000 + 1; - ASSERT_RAISES(Invalid, this->builder_->Advance(too_many)); -} - -TYPED_TEST(TestPrimitiveBuilder, TestResize) { - DECL_TYPE(); - - int64_t cap = kMinBuilderCapacity * 2; - - ASSERT_OK(this->builder_->Reserve(cap)); - ASSERT_EQ(cap, this->builder_->capacity()); - - ASSERT_EQ(TypeTraits::bytes_required(cap), this->builder_->data()->size()); - ASSERT_EQ(BitUtil::BytesForBits(cap), this->builder_->null_bitmap()->size()); -} - -TYPED_TEST(TestPrimitiveBuilder, TestReserve) { - ASSERT_OK(this->builder_->Reserve(10)); - ASSERT_EQ(0, this->builder_->length()); - ASSERT_EQ(kMinBuilderCapacity, this->builder_->capacity()); - - ASSERT_OK(this->builder_->Reserve(90)); - ASSERT_OK(this->builder_->Advance(100)); - ASSERT_OK(this->builder_->Reserve(kMinBuilderCapacity)); - - ASSERT_EQ(BitUtil::NextPower2(kMinBuilderCapacity + 100), this->builder_->capacity()); -} - -template -void CheckSliceApproxEquals() { - using T = typename TYPE::c_type; - - const int64_t kSize = 50; - std::vector draws1; - std::vector draws2; - - const uint32_t kSeed = 0; - test::random_real(kSize, kSeed, 0, 100, &draws1); - test::random_real(kSize, kSeed + 1, 0, 100, &draws2); - - // Make the draws equal in the sliced segment, but unequal elsewhere (to - // catch not using the slice offset) - for (int64_t i = 10; i < 30; ++i) { - draws2[i] = draws1[i]; - } - - std::vector is_valid; - test::random_is_valid(kSize, 0.1, &is_valid); - - std::shared_ptr array1, array2; - ArrayFromVector(is_valid, draws1, &array1); - ArrayFromVector(is_valid, draws2, &array2); - - std::shared_ptr slice1 = array1->Slice(10, 20); - std::shared_ptr slice2 = array2->Slice(10, 20); - - ASSERT_TRUE(slice1->ApproxEquals(slice2)); -} - -TEST(TestPrimitiveAdHoc, FloatingSliceApproxEquals) { - CheckSliceApproxEquals(); - CheckSliceApproxEquals(); -} - -} // namespace arrow diff --git a/cpp/src/arrow/array-string-test.cc b/cpp/src/arrow/array-string-test.cc deleted file mode 100644 index 6c2c1516c8f3c..0000000000000 --- a/cpp/src/arrow/array-string-test.cc +++ /dev/null @@ -1,654 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/test-common.h" -#include "arrow/test-util.h" -#include "arrow/type.h" -#include "arrow/type_traits.h" - -namespace arrow { - -class Buffer; - -// ---------------------------------------------------------------------- -// String container - -class TestStringArray : public ::testing::Test { - public: - void SetUp() { - chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; - offsets_ = {0, 1, 1, 1, 3, 6}; - valid_bytes_ = {1, 1, 0, 1, 1}; - expected_ = {"a", "", "", "bb", "ccc"}; - - MakeArray(); - } - - void MakeArray() { - length_ = static_cast(offsets_.size()) - 1; - value_buf_ = test::GetBufferFromVector(chars_); - offsets_buf_ = test::GetBufferFromVector(offsets_); - null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); - null_count_ = test::null_count(valid_bytes_); - - strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); - } - - protected: - std::vector offsets_; - std::vector chars_; - std::vector valid_bytes_; - - std::vector expected_; - - std::shared_ptr value_buf_; - std::shared_ptr offsets_buf_; - std::shared_ptr null_bitmap_; - - int64_t null_count_; - int64_t length_; - - std::shared_ptr strings_; -}; - -TEST_F(TestStringArray, TestArrayBasics) { - ASSERT_EQ(length_, strings_->length()); - ASSERT_EQ(1, strings_->null_count()); - ASSERT_OK(strings_->Validate()); -} - -TEST_F(TestStringArray, TestType) { - TypePtr type = strings_->type(); - - ASSERT_EQ(Type::STRING, type->type); - ASSERT_EQ(Type::STRING, strings_->type_enum()); -} - -TEST_F(TestStringArray, TestListFunctions) { - int pos = 0; - for (size_t i = 0; i < expected_.size(); ++i) { - ASSERT_EQ(pos, strings_->value_offset(i)); - ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); - pos += static_cast(expected_[i].size()); - } -} - -TEST_F(TestStringArray, TestDestructor) { - auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); -} - -TEST_F(TestStringArray, TestGetString) { - for (size_t i = 0; i < expected_.size(); ++i) { - if (valid_bytes_[i] == 0) { - ASSERT_TRUE(strings_->IsNull(i)); - } else { - ASSERT_EQ(expected_[i], strings_->GetString(i)); - } - } -} - -TEST_F(TestStringArray, TestEmptyStringComparison) { - offsets_ = {0, 0, 0, 0, 0, 0}; - offsets_buf_ = test::GetBufferFromVector(offsets_); - length_ = static_cast(offsets_.size() - 1); - - auto strings_a = std::make_shared( - length_, offsets_buf_, nullptr, null_bitmap_, null_count_); - auto strings_b = std::make_shared( - length_, offsets_buf_, nullptr, null_bitmap_, null_count_); - ASSERT_TRUE(strings_a->Equals(strings_b)); -} - -TEST_F(TestStringArray, CompareNullByteSlots) { - StringBuilder builder(default_memory_pool()); - StringBuilder builder2(default_memory_pool()); - StringBuilder builder3(default_memory_pool()); - - builder.Append("foo"); - builder2.Append("foo"); - builder3.Append("foo"); - - builder.Append("bar"); - builder2.AppendNull(); - - // same length, but different - builder3.Append("xyz"); - - builder.Append("baz"); - builder2.Append("baz"); - builder3.Append("baz"); - - std::shared_ptr array, array2, array3; - ASSERT_OK(builder.Finish(&array)); - ASSERT_OK(builder2.Finish(&array2)); - ASSERT_OK(builder3.Finish(&array3)); - - const auto& a1 = static_cast(*array); - const auto& a2 = static_cast(*array2); - const auto& a3 = static_cast(*array3); - - // The validity bitmaps are the same, the data is different, but the unequal - // portion is masked out - StringArray equal_array(3, a1.value_offsets(), a1.data(), a2.null_bitmap(), 1); - StringArray equal_array2(3, a3.value_offsets(), a3.data(), a2.null_bitmap(), 1); - - ASSERT_TRUE(equal_array.Equals(equal_array2)); - ASSERT_TRUE(a2.RangeEquals(equal_array2, 0, 3, 0)); - - ASSERT_TRUE(equal_array.Array::Slice(1)->Equals(equal_array2.Array::Slice(1))); - ASSERT_TRUE( - equal_array.Array::Slice(1)->RangeEquals(0, 2, 0, equal_array2.Array::Slice(1))); -} - -TEST_F(TestStringArray, TestSliceGetString) { - StringBuilder builder(default_memory_pool()); - - builder.Append("a"); - builder.Append("b"); - builder.Append("c"); - - std::shared_ptr array; - ASSERT_OK(builder.Finish(&array)); - auto s = array->Slice(1, 10); - auto arr = std::dynamic_pointer_cast(s); - ASSERT_EQ(arr->GetString(0), "b"); -} - -// ---------------------------------------------------------------------- -// String builder tests - -class TestStringBuilder : public TestBuilder { - public: - void SetUp() { - TestBuilder::SetUp(); - builder_.reset(new StringBuilder(pool_)); - } - - void Done() { - std::shared_ptr out; - EXPECT_OK(builder_->Finish(&out)); - - result_ = std::dynamic_pointer_cast(out); - result_->Validate(); - } - - protected: - std::unique_ptr builder_; - std::shared_ptr result_; -}; - -TEST_F(TestStringBuilder, TestScalarAppend) { - std::vector strings = {"", "bb", "a", "", "ccc"}; - std::vector is_null = {0, 0, 0, 1, 0}; - - int N = static_cast(strings.size()); - int reps = 1000; - - for (int j = 0; j < reps; ++j) { - for (int i = 0; i < N; ++i) { - if (is_null[i]) { - builder_->AppendNull(); - } else { - builder_->Append(strings[i]); - } - } - } - Done(); - - ASSERT_EQ(reps * N, result_->length()); - ASSERT_EQ(reps, result_->null_count()); - ASSERT_EQ(reps * 6, result_->data()->size()); - - int32_t length; - int32_t pos = 0; - for (int i = 0; i < N * reps; ++i) { - if (is_null[i % N]) { - ASSERT_TRUE(result_->IsNull(i)); - } else { - ASSERT_FALSE(result_->IsNull(i)); - result_->GetValue(i, &length); - ASSERT_EQ(pos, result_->value_offset(i)); - ASSERT_EQ(static_cast(strings[i % N].size()), length); - ASSERT_EQ(strings[i % N], result_->GetString(i)); - - pos += length; - } - } -} - -TEST_F(TestStringBuilder, TestZeroLength) { - // All buffers are null - Done(); -} - -// Binary container type -// TODO(emkornfield) there should be some way to refactor these to avoid code duplicating -// with String -class TestBinaryArray : public ::testing::Test { - public: - void SetUp() { - chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; - offsets_ = {0, 1, 1, 1, 3, 6}; - valid_bytes_ = {1, 1, 0, 1, 1}; - expected_ = {"a", "", "", "bb", "ccc"}; - - MakeArray(); - } - - void MakeArray() { - length_ = static_cast(offsets_.size() - 1); - value_buf_ = test::GetBufferFromVector(chars_); - offsets_buf_ = test::GetBufferFromVector(offsets_); - - null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); - null_count_ = test::null_count(valid_bytes_); - - strings_ = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); - } - - protected: - std::vector offsets_; - std::vector chars_; - std::vector valid_bytes_; - - std::vector expected_; - - std::shared_ptr value_buf_; - std::shared_ptr offsets_buf_; - std::shared_ptr null_bitmap_; - - int64_t null_count_; - int64_t length_; - - std::shared_ptr strings_; -}; - -TEST_F(TestBinaryArray, TestArrayBasics) { - ASSERT_EQ(length_, strings_->length()); - ASSERT_EQ(1, strings_->null_count()); - ASSERT_OK(strings_->Validate()); -} - -TEST_F(TestBinaryArray, TestType) { - TypePtr type = strings_->type(); - - ASSERT_EQ(Type::BINARY, type->type); - ASSERT_EQ(Type::BINARY, strings_->type_enum()); -} - -TEST_F(TestBinaryArray, TestListFunctions) { - size_t pos = 0; - for (size_t i = 0; i < expected_.size(); ++i) { - ASSERT_EQ(pos, strings_->value_offset(i)); - ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); - pos += expected_[i].size(); - } -} - -TEST_F(TestBinaryArray, TestDestructor) { - auto arr = std::make_shared( - length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); -} - -TEST_F(TestBinaryArray, TestGetValue) { - for (size_t i = 0; i < expected_.size(); ++i) { - if (valid_bytes_[i] == 0) { - ASSERT_TRUE(strings_->IsNull(i)); - } else { - int32_t len = -1; - const uint8_t* bytes = strings_->GetValue(i, &len); - ASSERT_EQ(0, std::memcmp(expected_[i].data(), bytes, len)); - } - } -} - -TEST_F(TestBinaryArray, TestEqualsEmptyStrings) { - BinaryBuilder builder(default_memory_pool(), arrow::binary()); - - std::string empty_string(""); - - builder.Append(empty_string); - builder.Append(empty_string); - builder.Append(empty_string); - builder.Append(empty_string); - builder.Append(empty_string); - - std::shared_ptr left_arr; - ASSERT_OK(builder.Finish(&left_arr)); - - const BinaryArray& left = static_cast(*left_arr); - std::shared_ptr right = std::make_shared(left.length(), - left.value_offsets(), nullptr, left.null_bitmap(), left.null_count()); - - ASSERT_TRUE(left.Equals(right)); - ASSERT_TRUE(left.RangeEquals(0, left.length(), 0, right)); -} - -class TestBinaryBuilder : public TestBuilder { - public: - void SetUp() { - TestBuilder::SetUp(); - builder_.reset(new BinaryBuilder(pool_)); - } - - void Done() { - std::shared_ptr out; - EXPECT_OK(builder_->Finish(&out)); - - result_ = std::dynamic_pointer_cast(out); - result_->Validate(); - } - - protected: - std::unique_ptr builder_; - std::shared_ptr result_; -}; - -TEST_F(TestBinaryBuilder, TestScalarAppend) { - std::vector strings = {"", "bb", "a", "", "ccc"}; - std::vector is_null = {0, 0, 0, 1, 0}; - - int N = static_cast(strings.size()); - int reps = 1000; - - for (int j = 0; j < reps; ++j) { - for (int i = 0; i < N; ++i) { - if (is_null[i]) { - builder_->AppendNull(); - } else { - builder_->Append(strings[i]); - } - } - } - Done(); - ASSERT_OK(result_->Validate()); - ASSERT_EQ(reps * N, result_->length()); - ASSERT_EQ(reps, result_->null_count()); - ASSERT_EQ(reps * 6, result_->data()->size()); - - int32_t length; - for (int i = 0; i < N * reps; ++i) { - if (is_null[i % N]) { - ASSERT_TRUE(result_->IsNull(i)); - } else { - ASSERT_FALSE(result_->IsNull(i)); - const uint8_t* vals = result_->GetValue(i, &length); - ASSERT_EQ(static_cast(strings[i % N].size()), length); - ASSERT_EQ(0, std::memcmp(vals, strings[i % N].data(), length)); - } - } -} - -TEST_F(TestBinaryBuilder, TestZeroLength) { - // All buffers are null - Done(); -} - -// ---------------------------------------------------------------------- -// Slice tests - -template -void CheckSliceEquality() { - using Traits = TypeTraits; - using BuilderType = typename Traits::BuilderType; - - BuilderType builder(default_memory_pool()); - - std::vector strings = {"foo", "", "bar", "baz", "qux", ""}; - std::vector is_null = {0, 1, 0, 1, 0, 0}; - - int N = static_cast(strings.size()); - int reps = 10; - - for (int j = 0; j < reps; ++j) { - for (int i = 0; i < N; ++i) { - if (is_null[i]) { - builder.AppendNull(); - } else { - builder.Append(strings[i]); - } - } - } - - std::shared_ptr array; - ASSERT_OK(builder.Finish(&array)); - - std::shared_ptr slice, slice2; - - slice = array->Slice(5); - slice2 = array->Slice(5); - ASSERT_EQ(N * reps - 5, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(5, slice->length(), 0, slice)); - - // Chained slices - slice2 = array->Slice(2)->Slice(3); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(5, 20); - slice2 = array->Slice(5, 20); - ASSERT_EQ(20, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(5, 25, 0, slice)); -} - -TEST_F(TestBinaryArray, TestSliceEquality) { - CheckSliceEquality(); -} - -TEST_F(TestStringArray, TestSliceEquality) { - CheckSliceEquality(); -} - -TEST_F(TestBinaryArray, LengthZeroCtor) { - BinaryArray array(0, nullptr, nullptr); -} - -// ---------------------------------------------------------------------- -// FixedWidthBinary tests - -class TestFWBinaryArray : public ::testing::Test { - public: - void SetUp() {} - - void InitBuilder(int byte_width) { - auto type = fixed_width_binary(byte_width); - builder_.reset(new FixedWidthBinaryBuilder(default_memory_pool(), type)); - } - - protected: - std::unique_ptr builder_; -}; - -TEST_F(TestFWBinaryArray, Builder) { - const int32_t byte_width = 10; - int64_t length = 4096; - - int64_t nbytes = length * byte_width; - - std::vector data(nbytes); - test::random_bytes(nbytes, 0, data.data()); - - std::vector is_valid(length); - test::random_null_bytes(length, 0.1, is_valid.data()); - - const uint8_t* raw_data = data.data(); - - std::shared_ptr result; - - auto CheckResult = [this, &length, &is_valid, &raw_data, &byte_width]( - const Array& result) { - // Verify output - const auto& fw_result = static_cast(result); - - ASSERT_EQ(length, result.length()); - - for (int64_t i = 0; i < result.length(); ++i) { - if (is_valid[i]) { - ASSERT_EQ( - 0, memcmp(raw_data + byte_width * i, fw_result.GetValue(i), byte_width)); - } else { - ASSERT_TRUE(fw_result.IsNull(i)); - } - } - }; - - // Build using iterative API - InitBuilder(byte_width); - for (int64_t i = 0; i < length; ++i) { - if (is_valid[i]) { - builder_->Append(raw_data + byte_width * i); - } else { - builder_->AppendNull(); - } - } - - ASSERT_OK(builder_->Finish(&result)); - CheckResult(*result); - - // Build using batch API - InitBuilder(byte_width); - - const uint8_t* raw_is_valid = is_valid.data(); - - ASSERT_OK(builder_->Append(raw_data, 50, raw_is_valid)); - ASSERT_OK(builder_->Append(raw_data + 50 * byte_width, length - 50, raw_is_valid + 50)); - ASSERT_OK(builder_->Finish(&result)); - CheckResult(*result); - - // Build from std::string - InitBuilder(byte_width); - for (int64_t i = 0; i < length; ++i) { - if (is_valid[i]) { - builder_->Append(std::string( - reinterpret_cast(raw_data + byte_width * i), byte_width)); - } else { - builder_->AppendNull(); - } - } - - ASSERT_OK(builder_->Finish(&result)); - CheckResult(*result); -} - -TEST_F(TestFWBinaryArray, EqualsRangeEquals) { - // Check that we don't compare data in null slots - - auto type = fixed_width_binary(4); - FixedWidthBinaryBuilder builder1(default_memory_pool(), type); - FixedWidthBinaryBuilder builder2(default_memory_pool(), type); - - ASSERT_OK(builder1.Append("foo1")); - ASSERT_OK(builder1.AppendNull()); - - ASSERT_OK(builder2.Append("foo1")); - ASSERT_OK(builder2.Append("foo2")); - - std::shared_ptr array1, array2; - ASSERT_OK(builder1.Finish(&array1)); - ASSERT_OK(builder2.Finish(&array2)); - - const auto& a1 = static_cast(*array1); - const auto& a2 = static_cast(*array2); - - FixedWidthBinaryArray equal1(type, 2, a1.data(), a1.null_bitmap(), 1); - FixedWidthBinaryArray equal2(type, 2, a2.data(), a1.null_bitmap(), 1); - - ASSERT_TRUE(equal1.Equals(equal2)); - ASSERT_TRUE(equal1.RangeEquals(equal2, 0, 2, 0)); -} - -TEST_F(TestFWBinaryArray, ZeroSize) { - auto type = fixed_width_binary(0); - FixedWidthBinaryBuilder builder(default_memory_pool(), type); - - ASSERT_OK(builder.Append(nullptr)); - ASSERT_OK(builder.Append(nullptr)); - ASSERT_OK(builder.Append(nullptr)); - ASSERT_OK(builder.AppendNull()); - ASSERT_OK(builder.AppendNull()); - ASSERT_OK(builder.AppendNull()); - - std::shared_ptr array; - ASSERT_OK(builder.Finish(&array)); - - const auto& fw_array = static_cast(*array); - - // data is never allocated - ASSERT_TRUE(fw_array.data() == nullptr); - ASSERT_EQ(0, fw_array.byte_width()); - - ASSERT_EQ(6, array->length()); - ASSERT_EQ(3, array->null_count()); -} - -TEST_F(TestFWBinaryArray, Slice) { - auto type = fixed_width_binary(4); - FixedWidthBinaryBuilder builder(default_memory_pool(), type); - - std::vector strings = {"foo1", "foo2", "foo3", "foo4", "foo5"}; - std::vector is_null = {0, 1, 0, 0, 0}; - - for (int i = 0; i < 5; ++i) { - if (is_null[i]) { - builder.AppendNull(); - } else { - builder.Append(strings[i]); - } - } - - std::shared_ptr array; - ASSERT_OK(builder.Finish(&array)); - - std::shared_ptr slice, slice2; - - slice = array->Slice(1); - slice2 = array->Slice(1); - ASSERT_EQ(4, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, slice->length(), 0, slice)); - - // Chained slices - slice = array->Slice(2); - slice2 = array->Slice(1)->Slice(1); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(1, 3); - ASSERT_EQ(3, slice->length()); - - slice2 = array->Slice(1, 3); - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); -} - -} // namespace arrow diff --git a/cpp/src/arrow/array-struct-test.cc b/cpp/src/arrow/array-struct-test.cc deleted file mode 100644 index 4eb1eab13fbc6..0000000000000 --- a/cpp/src/arrow/array-struct-test.cc +++ /dev/null @@ -1,410 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/status.h" -#include "arrow/test-common.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -using std::shared_ptr; -using std::string; -using std::vector; - -namespace arrow { - -TEST(TestStructType, Basics) { - TypePtr f0_type = TypePtr(new Int32Type()); - auto f0 = std::make_shared("f0", f0_type); - - TypePtr f1_type = TypePtr(new StringType()); - auto f1 = std::make_shared("f1", f1_type); - - TypePtr f2_type = TypePtr(new UInt8Type()); - auto f2 = std::make_shared("f2", f2_type); - - vector> fields = {f0, f1, f2}; - - StructType struct_type(fields); - - ASSERT_TRUE(struct_type.child(0)->Equals(f0)); - ASSERT_TRUE(struct_type.child(1)->Equals(f1)); - ASSERT_TRUE(struct_type.child(2)->Equals(f2)); - - ASSERT_EQ(struct_type.ToString(), "struct"); - - // TODO(wesm): out of bounds for field(...) -} - -void ValidateBasicStructArray(const StructArray* result, - const vector& struct_is_valid, const vector& list_values, - const vector& list_is_valid, const vector& list_lengths, - const vector& list_offsets, const vector& int_values) { - ASSERT_EQ(4, result->length()); - ASSERT_OK(result->Validate()); - - auto list_char_arr = static_cast(result->field(0).get()); - auto char_arr = static_cast(list_char_arr->values().get()); - auto int32_arr = static_cast(result->field(1).get()); - - ASSERT_EQ(0, result->null_count()); - ASSERT_EQ(1, list_char_arr->null_count()); - ASSERT_EQ(0, int32_arr->null_count()); - - // List - ASSERT_EQ(4, list_char_arr->length()); - ASSERT_EQ(10, list_char_arr->values()->length()); - for (size_t i = 0; i < list_offsets.size(); ++i) { - ASSERT_EQ(list_offsets[i], list_char_arr->raw_value_offsets()[i]); - } - for (size_t i = 0; i < list_values.size(); ++i) { - ASSERT_EQ(list_values[i], char_arr->Value(i)); - } - - // Int32 - ASSERT_EQ(4, int32_arr->length()); - for (size_t i = 0; i < int_values.size(); ++i) { - ASSERT_EQ(int_values[i], int32_arr->Value(i)); - } -} - -// ---------------------------------------------------------------------------------- -// Struct test -class TestStructBuilder : public TestBuilder { - public: - void SetUp() { - TestBuilder::SetUp(); - - auto int32_type = TypePtr(new Int32Type()); - auto char_type = TypePtr(new Int8Type()); - auto list_type = TypePtr(new ListType(char_type)); - - std::vector types = {list_type, int32_type}; - std::vector fields; - fields.push_back(FieldPtr(new Field("list", list_type))); - fields.push_back(FieldPtr(new Field("int", int32_type))); - - type_ = TypePtr(new StructType(fields)); - value_fields_ = fields; - - std::shared_ptr tmp; - ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); - - builder_ = std::dynamic_pointer_cast(tmp); - ASSERT_EQ(2, static_cast(builder_->field_builders().size())); - } - - void Done() { - std::shared_ptr out; - ASSERT_OK(builder_->Finish(&out)); - result_ = std::dynamic_pointer_cast(out); - } - - protected: - std::vector value_fields_; - TypePtr type_; - - std::shared_ptr builder_; - std::shared_ptr result_; -}; - -TEST_F(TestStructBuilder, TestAppendNull) { - ASSERT_OK(builder_->AppendNull()); - ASSERT_OK(builder_->AppendNull()); - ASSERT_EQ(2, static_cast(builder_->field_builders().size())); - - ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); - ASSERT_OK(list_vb->AppendNull()); - ASSERT_OK(list_vb->AppendNull()); - ASSERT_EQ(2, list_vb->length()); - - Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - ASSERT_OK(int_vb->AppendNull()); - ASSERT_OK(int_vb->AppendNull()); - ASSERT_EQ(2, int_vb->length()); - - Done(); - - ASSERT_OK(result_->Validate()); - - ASSERT_EQ(2, static_cast(result_->fields().size())); - ASSERT_EQ(2, result_->length()); - ASSERT_EQ(2, result_->field(0)->length()); - ASSERT_EQ(2, result_->field(1)->length()); - ASSERT_TRUE(result_->IsNull(0)); - ASSERT_TRUE(result_->IsNull(1)); - ASSERT_TRUE(result_->field(0)->IsNull(0)); - ASSERT_TRUE(result_->field(0)->IsNull(1)); - ASSERT_TRUE(result_->field(1)->IsNull(0)); - ASSERT_TRUE(result_->field(1)->IsNull(1)); - - ASSERT_EQ(Type::LIST, result_->field(0)->type_enum()); - ASSERT_EQ(Type::INT32, result_->field(1)->type_enum()); -} - -TEST_F(TestStructBuilder, TestBasics) { - vector int_values = {1, 2, 3, 4}; - vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; - vector list_lengths = {3, 0, 3, 4}; - vector list_offsets = {0, 3, 3, 6, 10}; - vector list_is_valid = {1, 0, 1, 1}; - vector struct_is_valid = {1, 1, 1, 1}; - - ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); - Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); - Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - ASSERT_EQ(2, static_cast(builder_->field_builders().size())); - - EXPECT_OK(builder_->Resize(list_lengths.size())); - EXPECT_OK(char_vb->Resize(list_values.size())); - EXPECT_OK(int_vb->Resize(int_values.size())); - - int pos = 0; - for (size_t i = 0; i < list_lengths.size(); ++i) { - ASSERT_OK(list_vb->Append(list_is_valid[i] > 0)); - int_vb->UnsafeAppend(int_values[i]); - for (int j = 0; j < list_lengths[i]; ++j) { - char_vb->UnsafeAppend(list_values[pos++]); - } - } - - for (size_t i = 0; i < struct_is_valid.size(); ++i) { - ASSERT_OK(builder_->Append(struct_is_valid[i] > 0)); - } - - Done(); - - ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, - list_lengths, list_offsets, int_values); -} - -TEST_F(TestStructBuilder, BulkAppend) { - vector int_values = {1, 2, 3, 4}; - vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; - vector list_lengths = {3, 0, 3, 4}; - vector list_offsets = {0, 3, 3, 6}; - vector list_is_valid = {1, 0, 1, 1}; - vector struct_is_valid = {1, 1, 1, 1}; - - ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); - Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); - Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - - ASSERT_OK(builder_->Resize(list_lengths.size())); - ASSERT_OK(char_vb->Resize(list_values.size())); - ASSERT_OK(int_vb->Resize(int_values.size())); - - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - Done(); - ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, - list_lengths, list_offsets, int_values); -} - -TEST_F(TestStructBuilder, BulkAppendInvalid) { - vector int_values = {1, 2, 3, 4}; - vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; - vector list_lengths = {3, 0, 3, 4}; - vector list_offsets = {0, 3, 3, 6}; - vector list_is_valid = {1, 0, 1, 1}; - vector struct_is_valid = {1, 0, 1, 1}; // should be 1, 1, 1, 1 - - ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); - Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); - Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - - ASSERT_OK(builder_->Reserve(list_lengths.size())); - ASSERT_OK(char_vb->Reserve(list_values.size())); - ASSERT_OK(int_vb->Reserve(int_values.size())); - - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - Done(); - // Even null bitmap of the parent Struct is not valid, Validate() will ignore it. - ASSERT_OK(result_->Validate()); -} - -TEST_F(TestStructBuilder, TestEquality) { - std::shared_ptr array, equal_array; - std::shared_ptr unequal_bitmap_array, unequal_offsets_array, - unequal_values_array; - - vector int_values = {1, 2, 3, 4}; - vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; - vector list_lengths = {3, 0, 3, 4}; - vector list_offsets = {0, 3, 3, 6}; - vector list_is_valid = {1, 0, 1, 1}; - vector struct_is_valid = {1, 1, 1, 1}; - - vector unequal_int_values = {4, 2, 3, 1}; - vector unequal_list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'l', 'u', 'c', 'y'}; - vector unequal_list_offsets = {0, 3, 4, 6}; - vector unequal_list_is_valid = {1, 1, 1, 1}; - vector unequal_struct_is_valid = {1, 0, 0, 1}; - - ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); - Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); - Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); - ASSERT_OK(builder_->Reserve(list_lengths.size())); - ASSERT_OK(char_vb->Reserve(list_values.size())); - ASSERT_OK(int_vb->Reserve(int_values.size())); - - // setup two equal arrays, one of which takes an unequal bitmap - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - ASSERT_OK(builder_->Finish(&array)); - - ASSERT_OK(builder_->Resize(list_lengths.size())); - ASSERT_OK(char_vb->Resize(list_values.size())); - ASSERT_OK(int_vb->Resize(int_values.size())); - - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - ASSERT_OK(builder_->Finish(&equal_array)); - - ASSERT_OK(builder_->Resize(list_lengths.size())); - ASSERT_OK(char_vb->Resize(list_values.size())); - ASSERT_OK(int_vb->Resize(int_values.size())); - - // setup an unequal one with the unequal bitmap - builder_->Append(unequal_struct_is_valid.size(), unequal_struct_is_valid.data()); - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - ASSERT_OK(builder_->Finish(&unequal_bitmap_array)); - - ASSERT_OK(builder_->Resize(list_lengths.size())); - ASSERT_OK(char_vb->Resize(list_values.size())); - ASSERT_OK(int_vb->Resize(int_values.size())); - - // setup an unequal one with unequal offsets - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - list_vb->Append(unequal_list_offsets.data(), unequal_list_offsets.size(), - unequal_list_is_valid.data()); - for (int8_t value : list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : int_values) { - int_vb->UnsafeAppend(value); - } - - ASSERT_OK(builder_->Finish(&unequal_offsets_array)); - - ASSERT_OK(builder_->Resize(list_lengths.size())); - ASSERT_OK(char_vb->Resize(list_values.size())); - ASSERT_OK(int_vb->Resize(int_values.size())); - - // setup anunequal one with unequal values - builder_->Append(struct_is_valid.size(), struct_is_valid.data()); - list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); - for (int8_t value : unequal_list_values) { - char_vb->UnsafeAppend(value); - } - for (int32_t value : unequal_int_values) { - int_vb->UnsafeAppend(value); - } - - ASSERT_OK(builder_->Finish(&unequal_values_array)); - - // Test array equality - EXPECT_TRUE(array->Equals(array)); - EXPECT_TRUE(array->Equals(equal_array)); - EXPECT_TRUE(equal_array->Equals(array)); - EXPECT_FALSE(equal_array->Equals(unequal_bitmap_array)); - EXPECT_FALSE(unequal_bitmap_array->Equals(equal_array)); - EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_values_array)); - EXPECT_FALSE(unequal_values_array->Equals(unequal_bitmap_array)); - EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_offsets_array)); - EXPECT_FALSE(unequal_offsets_array->Equals(unequal_bitmap_array)); - - // Test range equality - EXPECT_TRUE(array->RangeEquals(0, 4, 0, equal_array)); - EXPECT_TRUE(array->RangeEquals(3, 4, 3, unequal_bitmap_array)); - EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_offsets_array)); - EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_offsets_array)); - EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_offsets_array)); - EXPECT_FALSE(array->RangeEquals(0, 1, 0, unequal_values_array)); - EXPECT_TRUE(array->RangeEquals(1, 3, 1, unequal_values_array)); - EXPECT_FALSE(array->RangeEquals(3, 4, 3, unequal_values_array)); - - // ARROW-33 Slice / equality - std::shared_ptr slice, slice2; - - slice = array->Slice(2); - slice2 = array->Slice(2); - ASSERT_EQ(array->length() - 2, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); - - slice = array->Slice(1, 2); - slice2 = array->Slice(1, 2); - ASSERT_EQ(2, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); -} - -TEST_F(TestStructBuilder, TestZeroLength) { - // All buffers are null - Done(); - ASSERT_OK(result_->Validate()); -} - -} // namespace arrow diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 854ebb20f53ed..52f3727d46a15 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -25,12 +25,20 @@ #include "arrow/array.h" #include "arrow/buffer.h" +#include "arrow/builder.h" +#include "arrow/ipc/test-common.h" #include "arrow/memory_pool.h" +#include "arrow/status.h" +#include "arrow/test-common.h" #include "arrow/test-util.h" #include "arrow/type.h" +#include "arrow/type_traits.h" namespace arrow { +using std::string; +using std::vector; + class TestArray : public ::testing::Test { public: void SetUp() { pool_ = default_memory_pool(); } @@ -57,7 +65,7 @@ TEST_F(TestArray, TestLength) { } std::shared_ptr MakeArrayFromValidBytes( - const std::vector& v, MemoryPool* pool) { + const vector& v, MemoryPool* pool) { int64_t null_count = v.size() - std::accumulate(v.begin(), v.end(), 0); std::shared_ptr null_buf = test::bytes_to_null_buffer(v); @@ -88,7 +96,7 @@ TEST_F(TestArray, TestEquality) { } TEST_F(TestArray, SliceRecomputeNullCount) { - std::vector valid_bytes = {1, 0, 1, 1, 0, 1, 0, 0, 0}; + vector valid_bytes = {1, 0, 1, 1, 0, 1, 0, 0, 0}; auto array = MakeArrayFromValidBytes(valid_bytes, pool_); @@ -115,7 +123,7 @@ TEST_F(TestArray, SliceRecomputeNullCount) { TEST_F(TestArray, TestIsNull) { // clang-format off - std::vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, + vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, @@ -155,4 +163,1802 @@ TEST_F(TestArray, BuildLargeInMemoryArray) { TEST_F(TestArray, TestCopy) {} +// ---------------------------------------------------------------------- +// Primitive type tests + +TEST_F(TestBuilder, TestReserve) { + builder_->Init(10); + ASSERT_EQ(2, builder_->null_bitmap()->size()); + + builder_->Reserve(30); + ASSERT_EQ(4, builder_->null_bitmap()->size()); +} + +template +class TestPrimitiveBuilder : public TestBuilder { + public: + typedef typename Attrs::ArrayType ArrayType; + typedef typename Attrs::BuilderType BuilderType; + typedef typename Attrs::T T; + typedef typename Attrs::Type Type; + + virtual void SetUp() { + TestBuilder::SetUp(); + + type_ = Attrs::type(); + + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_ = std::dynamic_pointer_cast(tmp); + + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_nn_ = std::dynamic_pointer_cast(tmp); + } + + void RandomData(int64_t N, double pct_null = 0.1) { + Attrs::draw(N, &draws_); + + valid_bytes_.resize(static_cast(N)); + test::random_null_bytes(N, pct_null, valid_bytes_.data()); + } + + void Check(const std::shared_ptr& builder, bool nullable) { + int64_t size = builder->length(); + + auto ex_data = std::make_shared( + reinterpret_cast(draws_.data()), size * sizeof(T)); + + std::shared_ptr ex_null_bitmap; + int64_t ex_null_count = 0; + + if (nullable) { + ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); + ex_null_count = test::null_count(valid_bytes_); + } else { + ex_null_bitmap = nullptr; + } + + auto expected = + std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); + + std::shared_ptr out; + ASSERT_OK(builder->Finish(&out)); + + std::shared_ptr result = std::dynamic_pointer_cast(out); + + // Builder is now reset + ASSERT_EQ(0, builder->length()); + ASSERT_EQ(0, builder->capacity()); + ASSERT_EQ(0, builder->null_count()); + ASSERT_EQ(nullptr, builder->data()); + + ASSERT_EQ(ex_null_count, result->null_count()); + ASSERT_TRUE(result->Equals(*expected)); + } + + protected: + std::shared_ptr type_; + std::shared_ptr builder_; + std::shared_ptr builder_nn_; + + vector draws_; + vector valid_bytes_; +}; + +#define PTYPE_DECL(CapType, c_type) \ + typedef CapType##Array ArrayType; \ + typedef CapType##Builder BuilderType; \ + typedef CapType##Type Type; \ + typedef c_type T; \ + \ + static std::shared_ptr type() { \ + return std::shared_ptr(new Type()); \ + } + +#define PINT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int64_t N, vector* draws) { \ + test::randint(N, LOWER, UPPER, draws); \ + } \ + } + +#define PFLOAT_DECL(CapType, c_type, LOWER, UPPER) \ + struct P##CapType { \ + PTYPE_DECL(CapType, c_type); \ + static void draw(int64_t N, vector* draws) { \ + test::random_real(N, 0, LOWER, UPPER, draws); \ + } \ + } + +PINT_DECL(UInt8, uint8_t, 0, UINT8_MAX); +PINT_DECL(UInt16, uint16_t, 0, UINT16_MAX); +PINT_DECL(UInt32, uint32_t, 0, UINT32_MAX); +PINT_DECL(UInt64, uint64_t, 0, UINT64_MAX); + +PINT_DECL(Int8, int8_t, INT8_MIN, INT8_MAX); +PINT_DECL(Int16, int16_t, INT16_MIN, INT16_MAX); +PINT_DECL(Int32, int32_t, INT32_MIN, INT32_MAX); +PINT_DECL(Int64, int64_t, INT64_MIN, INT64_MAX); + +PFLOAT_DECL(Float, float, -1000, 1000); +PFLOAT_DECL(Double, double, -1000, 1000); + +struct PBoolean { + PTYPE_DECL(Boolean, uint8_t); +}; + +template <> +void TestPrimitiveBuilder::RandomData(int64_t N, double pct_null) { + draws_.resize(static_cast(N)); + valid_bytes_.resize(static_cast(N)); + + test::random_null_bytes(N, 0.5, draws_.data()); + test::random_null_bytes(N, pct_null, valid_bytes_.data()); +} + +template <> +void TestPrimitiveBuilder::Check( + const std::shared_ptr& builder, bool nullable) { + int64_t size = builder->length(); + + auto ex_data = test::bytes_to_null_buffer(draws_); + + std::shared_ptr ex_null_bitmap; + int64_t ex_null_count = 0; + + if (nullable) { + ex_null_bitmap = test::bytes_to_null_buffer(valid_bytes_); + ex_null_count = test::null_count(valid_bytes_); + } else { + ex_null_bitmap = nullptr; + } + + auto expected = + std::make_shared(size, ex_data, ex_null_bitmap, ex_null_count); + + std::shared_ptr out; + ASSERT_OK(builder->Finish(&out)); + std::shared_ptr result = std::dynamic_pointer_cast(out); + + // Builder is now reset + ASSERT_EQ(0, builder->length()); + ASSERT_EQ(0, builder->capacity()); + ASSERT_EQ(0, builder->null_count()); + ASSERT_EQ(nullptr, builder->data()); + + ASSERT_EQ(ex_null_count, result->null_count()); + + ASSERT_EQ(expected->length(), result->length()); + + for (int64_t i = 0; i < result->length(); ++i) { + if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } + bool actual = BitUtil::GetBit(result->data()->data(), i); + ASSERT_EQ(static_cast(draws_[i]), actual) << i; + } + ASSERT_TRUE(result->Equals(*expected)); +} + +typedef ::testing::Types + Primitives; + +TYPED_TEST_CASE(TestPrimitiveBuilder, Primitives); + +#define DECL_T() typedef typename TestFixture::T T; + +#define DECL_TYPE() typedef typename TestFixture::Type Type; + +#define DECL_ARRAYTYPE() typedef typename TestFixture::ArrayType ArrayType; + +TYPED_TEST(TestPrimitiveBuilder, TestInit) { + DECL_TYPE(); + + int64_t n = 1000; + ASSERT_OK(this->builder_->Reserve(n)); + ASSERT_EQ(BitUtil::NextPower2(n), this->builder_->capacity()); + ASSERT_EQ(BitUtil::NextPower2(TypeTraits::bytes_required(n)), + this->builder_->data()->size()); + + // unsure if this should go in all builder classes + ASSERT_EQ(0, this->builder_->num_children()); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAppendNull) { + int64_t size = 1000; + for (int64_t i = 0; i < size; ++i) { + ASSERT_OK(this->builder_->AppendNull()); + } + + std::shared_ptr result; + ASSERT_OK(this->builder_->Finish(&result)); + + for (int64_t i = 0; i < size; ++i) { + ASSERT_TRUE(result->IsNull(i)) << i; + } +} + +TYPED_TEST(TestPrimitiveBuilder, TestArrayDtorDealloc) { + DECL_T(); + + int64_t size = 1000; + + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + + int64_t memory_before = this->pool_->bytes_allocated(); + + this->RandomData(size); + + this->builder_->Reserve(size); + + int64_t i; + for (i = 0; i < size; ++i) { + if (valid_bytes[i] > 0) { + this->builder_->Append(draws[i]); + } else { + this->builder_->AppendNull(); + } + } + + do { + std::shared_ptr result; + ASSERT_OK(this->builder_->Finish(&result)); + } while (false); + + ASSERT_EQ(memory_before, this->pool_->bytes_allocated()); +} + +TYPED_TEST(TestPrimitiveBuilder, Equality) { + DECL_T(); + + const int64_t size = 1000; + this->RandomData(size); + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + std::shared_ptr array, equal_array, unequal_array; + auto builder = this->builder_.get(); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &equal_array)); + + // Make the not equal array by negating the first valid element with itself. + const auto first_valid = std::find_if( + valid_bytes.begin(), valid_bytes.end(), [](uint8_t valid) { return valid > 0; }); + const int64_t first_valid_idx = std::distance(valid_bytes.begin(), first_valid); + // This should be true with a very high probability, but might introduce flakiness + ASSERT_LT(first_valid_idx, size - 1); + draws[first_valid_idx] = + static_cast(~*reinterpret_cast(&draws[first_valid_idx])); + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &unequal_array)); + + // test normal equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_array)); + EXPECT_FALSE(unequal_array->Equals(equal_array)); + + // Test range equality + EXPECT_FALSE(array->RangeEquals(0, first_valid_idx + 1, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(first_valid_idx, size, first_valid_idx, unequal_array)); + EXPECT_TRUE(array->RangeEquals(0, first_valid_idx, 0, unequal_array)); + EXPECT_TRUE( + array->RangeEquals(first_valid_idx + 1, size, first_valid_idx + 1, unequal_array)); +} + +TYPED_TEST(TestPrimitiveBuilder, SliceEquality) { + DECL_T(); + + const int64_t size = 1000; + this->RandomData(size); + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + auto builder = this->builder_.get(); + + std::shared_ptr array; + ASSERT_OK(MakeArray(valid_bytes, draws, size, builder, &array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(5); + slice2 = array->Slice(5); + ASSERT_EQ(size - 5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(2)->Slice(3); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(5, 10); + slice2 = array->Slice(5, 10); + ASSERT_EQ(10, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, 15, 0, slice)); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAppendScalar) { + DECL_T(); + + const int64_t size = 10000; + + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + + this->RandomData(size); + + this->builder_->Reserve(1000); + this->builder_nn_->Reserve(1000); + + int64_t null_count = 0; + // Append the first 1000 + for (size_t i = 0; i < 1000; ++i) { + if (valid_bytes[i] > 0) { + this->builder_->Append(draws[i]); + } else { + this->builder_->AppendNull(); + ++null_count; + } + this->builder_nn_->Append(draws[i]); + } + + ASSERT_EQ(null_count, this->builder_->null_count()); + + ASSERT_EQ(1000, this->builder_->length()); + ASSERT_EQ(1024, this->builder_->capacity()); + + ASSERT_EQ(1000, this->builder_nn_->length()); + ASSERT_EQ(1024, this->builder_nn_->capacity()); + + this->builder_->Reserve(size - 1000); + this->builder_nn_->Reserve(size - 1000); + + // Append the next 9000 + for (size_t i = 1000; i < size; ++i) { + if (valid_bytes[i] > 0) { + this->builder_->Append(draws[i]); + } else { + this->builder_->AppendNull(); + } + this->builder_nn_->Append(draws[i]); + } + + ASSERT_EQ(size, this->builder_->length()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); + + ASSERT_EQ(size, this->builder_nn_->length()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_nn_->capacity()); + + this->Check(this->builder_, true); + this->Check(this->builder_nn_, false); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAppendVector) { + DECL_T(); + + int64_t size = 10000; + this->RandomData(size); + + vector& draws = this->draws_; + vector& valid_bytes = this->valid_bytes_; + + // first slug + int64_t K = 1000; + + ASSERT_OK(this->builder_->Append(draws.data(), K, valid_bytes.data())); + ASSERT_OK(this->builder_nn_->Append(draws.data(), K)); + + ASSERT_EQ(1000, this->builder_->length()); + ASSERT_EQ(1024, this->builder_->capacity()); + + ASSERT_EQ(1000, this->builder_nn_->length()); + ASSERT_EQ(1024, this->builder_nn_->capacity()); + + // Append the next 9000 + ASSERT_OK(this->builder_->Append(draws.data() + K, size - K, valid_bytes.data() + K)); + ASSERT_OK(this->builder_nn_->Append(draws.data() + K, size - K)); + + ASSERT_EQ(size, this->builder_->length()); + ASSERT_EQ(BitUtil::NextPower2(size), this->builder_->capacity()); + + this->Check(this->builder_, true); + this->Check(this->builder_nn_, false); +} + +TYPED_TEST(TestPrimitiveBuilder, TestAdvance) { + int64_t n = 1000; + ASSERT_OK(this->builder_->Reserve(n)); + + ASSERT_OK(this->builder_->Advance(100)); + ASSERT_EQ(100, this->builder_->length()); + + ASSERT_OK(this->builder_->Advance(900)); + + int64_t too_many = this->builder_->capacity() - 1000 + 1; + ASSERT_RAISES(Invalid, this->builder_->Advance(too_many)); +} + +TYPED_TEST(TestPrimitiveBuilder, TestResize) { + DECL_TYPE(); + + int64_t cap = kMinBuilderCapacity * 2; + + ASSERT_OK(this->builder_->Reserve(cap)); + ASSERT_EQ(cap, this->builder_->capacity()); + + ASSERT_EQ(TypeTraits::bytes_required(cap), this->builder_->data()->size()); + ASSERT_EQ(BitUtil::BytesForBits(cap), this->builder_->null_bitmap()->size()); +} + +TYPED_TEST(TestPrimitiveBuilder, TestReserve) { + ASSERT_OK(this->builder_->Reserve(10)); + ASSERT_EQ(0, this->builder_->length()); + ASSERT_EQ(kMinBuilderCapacity, this->builder_->capacity()); + + ASSERT_OK(this->builder_->Reserve(90)); + ASSERT_OK(this->builder_->Advance(100)); + ASSERT_OK(this->builder_->Reserve(kMinBuilderCapacity)); + + ASSERT_EQ(BitUtil::NextPower2(kMinBuilderCapacity + 100), this->builder_->capacity()); +} + +template +void CheckSliceApproxEquals() { + using T = typename TYPE::c_type; + + const int64_t kSize = 50; + vector draws1; + vector draws2; + + const uint32_t kSeed = 0; + test::random_real(kSize, kSeed, 0, 100, &draws1); + test::random_real(kSize, kSeed + 1, 0, 100, &draws2); + + // Make the draws equal in the sliced segment, but unequal elsewhere (to + // catch not using the slice offset) + for (int64_t i = 10; i < 30; ++i) { + draws2[i] = draws1[i]; + } + + vector is_valid; + test::random_is_valid(kSize, 0.1, &is_valid); + + std::shared_ptr array1, array2; + ArrayFromVector(is_valid, draws1, &array1); + ArrayFromVector(is_valid, draws2, &array2); + + std::shared_ptr slice1 = array1->Slice(10, 20); + std::shared_ptr slice2 = array2->Slice(10, 20); + + ASSERT_TRUE(slice1->ApproxEquals(slice2)); +} + +TEST(TestPrimitiveAdHoc, FloatingSliceApproxEquals) { + CheckSliceApproxEquals(); + CheckSliceApproxEquals(); +} + +// ---------------------------------------------------------------------- +// String / Binary tests + +class TestStringArray : public ::testing::Test { + public: + void SetUp() { + chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; + offsets_ = {0, 1, 1, 1, 3, 6}; + valid_bytes_ = {1, 1, 0, 1, 1}; + expected_ = {"a", "", "", "bb", "ccc"}; + + MakeArray(); + } + + void MakeArray() { + length_ = static_cast(offsets_.size()) - 1; + value_buf_ = test::GetBufferFromVector(chars_); + offsets_buf_ = test::GetBufferFromVector(offsets_); + null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); + null_count_ = test::null_count(valid_bytes_); + + strings_ = std::make_shared( + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + } + + protected: + vector offsets_; + vector chars_; + vector valid_bytes_; + + vector expected_; + + std::shared_ptr value_buf_; + std::shared_ptr offsets_buf_; + std::shared_ptr null_bitmap_; + + int64_t null_count_; + int64_t length_; + + std::shared_ptr strings_; +}; + +TEST_F(TestStringArray, TestArrayBasics) { + ASSERT_EQ(length_, strings_->length()); + ASSERT_EQ(1, strings_->null_count()); + ASSERT_OK(strings_->Validate()); +} + +TEST_F(TestStringArray, TestType) { + std::shared_ptr type = strings_->type(); + + ASSERT_EQ(Type::STRING, type->type); + ASSERT_EQ(Type::STRING, strings_->type_enum()); +} + +TEST_F(TestStringArray, TestListFunctions) { + int pos = 0; + for (size_t i = 0; i < expected_.size(); ++i) { + ASSERT_EQ(pos, strings_->value_offset(i)); + ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); + pos += static_cast(expected_[i].size()); + } +} + +TEST_F(TestStringArray, TestDestructor) { + auto arr = std::make_shared( + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); +} + +TEST_F(TestStringArray, TestGetString) { + for (size_t i = 0; i < expected_.size(); ++i) { + if (valid_bytes_[i] == 0) { + ASSERT_TRUE(strings_->IsNull(i)); + } else { + ASSERT_EQ(expected_[i], strings_->GetString(i)); + } + } +} + +TEST_F(TestStringArray, TestEmptyStringComparison) { + offsets_ = {0, 0, 0, 0, 0, 0}; + offsets_buf_ = test::GetBufferFromVector(offsets_); + length_ = static_cast(offsets_.size() - 1); + + auto strings_a = std::make_shared( + length_, offsets_buf_, nullptr, null_bitmap_, null_count_); + auto strings_b = std::make_shared( + length_, offsets_buf_, nullptr, null_bitmap_, null_count_); + ASSERT_TRUE(strings_a->Equals(strings_b)); +} + +TEST_F(TestStringArray, CompareNullByteSlots) { + StringBuilder builder(default_memory_pool()); + StringBuilder builder2(default_memory_pool()); + StringBuilder builder3(default_memory_pool()); + + builder.Append("foo"); + builder2.Append("foo"); + builder3.Append("foo"); + + builder.Append("bar"); + builder2.AppendNull(); + + // same length, but different + builder3.Append("xyz"); + + builder.Append("baz"); + builder2.Append("baz"); + builder3.Append("baz"); + + std::shared_ptr array, array2, array3; + ASSERT_OK(builder.Finish(&array)); + ASSERT_OK(builder2.Finish(&array2)); + ASSERT_OK(builder3.Finish(&array3)); + + const auto& a1 = static_cast(*array); + const auto& a2 = static_cast(*array2); + const auto& a3 = static_cast(*array3); + + // The validity bitmaps are the same, the data is different, but the unequal + // portion is masked out + StringArray equal_array(3, a1.value_offsets(), a1.data(), a2.null_bitmap(), 1); + StringArray equal_array2(3, a3.value_offsets(), a3.data(), a2.null_bitmap(), 1); + + ASSERT_TRUE(equal_array.Equals(equal_array2)); + ASSERT_TRUE(a2.RangeEquals(equal_array2, 0, 3, 0)); + + ASSERT_TRUE(equal_array.Array::Slice(1)->Equals(equal_array2.Array::Slice(1))); + ASSERT_TRUE( + equal_array.Array::Slice(1)->RangeEquals(0, 2, 0, equal_array2.Array::Slice(1))); +} + +TEST_F(TestStringArray, TestSliceGetString) { + StringBuilder builder(default_memory_pool()); + + builder.Append("a"); + builder.Append("b"); + builder.Append("c"); + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + auto s = array->Slice(1, 10); + auto arr = std::dynamic_pointer_cast(s); + ASSERT_EQ(arr->GetString(0), "b"); +} + +// ---------------------------------------------------------------------- +// String builder tests + +class TestStringBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + builder_.reset(new StringBuilder(pool_)); + } + + void Done() { + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + + result_ = std::dynamic_pointer_cast(out); + result_->Validate(); + } + + protected: + std::unique_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestStringBuilder, TestScalarAppend) { + vector strings = {"", "bb", "a", "", "ccc"}; + vector is_null = {0, 0, 0, 1, 0}; + + int N = static_cast(strings.size()); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder_->AppendNull(); + } else { + builder_->Append(strings[i]); + } + } + } + Done(); + + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->data()->size()); + + int32_t length; + int32_t pos = 0; + for (int i = 0; i < N * reps; ++i) { + if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); + } else { + ASSERT_FALSE(result_->IsNull(i)); + result_->GetValue(i, &length); + ASSERT_EQ(pos, result_->value_offset(i)); + ASSERT_EQ(static_cast(strings[i % N].size()), length); + ASSERT_EQ(strings[i % N], result_->GetString(i)); + + pos += length; + } + } +} + +TEST_F(TestStringBuilder, TestZeroLength) { + // All buffers are null + Done(); +} + +// Binary container type +// TODO(emkornfield) there should be some way to refactor these to avoid code duplicating +// with String +class TestBinaryArray : public ::testing::Test { + public: + void SetUp() { + chars_ = {'a', 'b', 'b', 'c', 'c', 'c'}; + offsets_ = {0, 1, 1, 1, 3, 6}; + valid_bytes_ = {1, 1, 0, 1, 1}; + expected_ = {"a", "", "", "bb", "ccc"}; + + MakeArray(); + } + + void MakeArray() { + length_ = static_cast(offsets_.size() - 1); + value_buf_ = test::GetBufferFromVector(chars_); + offsets_buf_ = test::GetBufferFromVector(offsets_); + + null_bitmap_ = test::bytes_to_null_buffer(valid_bytes_); + null_count_ = test::null_count(valid_bytes_); + + strings_ = std::make_shared( + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); + } + + protected: + vector offsets_; + vector chars_; + vector valid_bytes_; + + vector expected_; + + std::shared_ptr value_buf_; + std::shared_ptr offsets_buf_; + std::shared_ptr null_bitmap_; + + int64_t null_count_; + int64_t length_; + + std::shared_ptr strings_; +}; + +TEST_F(TestBinaryArray, TestArrayBasics) { + ASSERT_EQ(length_, strings_->length()); + ASSERT_EQ(1, strings_->null_count()); + ASSERT_OK(strings_->Validate()); +} + +TEST_F(TestBinaryArray, TestType) { + std::shared_ptr type = strings_->type(); + + ASSERT_EQ(Type::BINARY, type->type); + ASSERT_EQ(Type::BINARY, strings_->type_enum()); +} + +TEST_F(TestBinaryArray, TestListFunctions) { + size_t pos = 0; + for (size_t i = 0; i < expected_.size(); ++i) { + ASSERT_EQ(pos, strings_->value_offset(i)); + ASSERT_EQ(static_cast(expected_[i].size()), strings_->value_length(i)); + pos += expected_[i].size(); + } +} + +TEST_F(TestBinaryArray, TestDestructor) { + auto arr = std::make_shared( + length_, offsets_buf_, value_buf_, null_bitmap_, null_count_); +} + +TEST_F(TestBinaryArray, TestGetValue) { + for (size_t i = 0; i < expected_.size(); ++i) { + if (valid_bytes_[i] == 0) { + ASSERT_TRUE(strings_->IsNull(i)); + } else { + int32_t len = -1; + const uint8_t* bytes = strings_->GetValue(i, &len); + ASSERT_EQ(0, std::memcmp(expected_[i].data(), bytes, len)); + } + } +} + +TEST_F(TestBinaryArray, TestEqualsEmptyStrings) { + BinaryBuilder builder(default_memory_pool(), arrow::binary()); + + string empty_string(""); + + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + builder.Append(empty_string); + + std::shared_ptr left_arr; + ASSERT_OK(builder.Finish(&left_arr)); + + const BinaryArray& left = static_cast(*left_arr); + std::shared_ptr right = std::make_shared(left.length(), + left.value_offsets(), nullptr, left.null_bitmap(), left.null_count()); + + ASSERT_TRUE(left.Equals(right)); + ASSERT_TRUE(left.RangeEquals(0, left.length(), 0, right)); +} + +class TestBinaryBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + builder_.reset(new BinaryBuilder(pool_)); + } + + void Done() { + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + + result_ = std::dynamic_pointer_cast(out); + result_->Validate(); + } + + protected: + std::unique_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestBinaryBuilder, TestScalarAppend) { + vector strings = {"", "bb", "a", "", "ccc"}; + vector is_null = {0, 0, 0, 1, 0}; + + int N = static_cast(strings.size()); + int reps = 1000; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder_->AppendNull(); + } else { + builder_->Append(strings[i]); + } + } + } + Done(); + ASSERT_OK(result_->Validate()); + ASSERT_EQ(reps * N, result_->length()); + ASSERT_EQ(reps, result_->null_count()); + ASSERT_EQ(reps * 6, result_->data()->size()); + + int32_t length; + for (int i = 0; i < N * reps; ++i) { + if (is_null[i % N]) { + ASSERT_TRUE(result_->IsNull(i)); + } else { + ASSERT_FALSE(result_->IsNull(i)); + const uint8_t* vals = result_->GetValue(i, &length); + ASSERT_EQ(static_cast(strings[i % N].size()), length); + ASSERT_EQ(0, std::memcmp(vals, strings[i % N].data(), length)); + } + } +} + +TEST_F(TestBinaryBuilder, TestZeroLength) { + // All buffers are null + Done(); +} + +// ---------------------------------------------------------------------- +// Slice tests + +template +void CheckSliceEquality() { + using Traits = TypeTraits; + using BuilderType = typename Traits::BuilderType; + + BuilderType builder(default_memory_pool()); + + vector strings = {"foo", "", "bar", "baz", "qux", ""}; + vector is_null = {0, 1, 0, 1, 0, 0}; + + int N = static_cast(strings.size()); + int reps = 10; + + for (int j = 0; j < reps; ++j) { + for (int i = 0; i < N; ++i) { + if (is_null[i]) { + builder.AppendNull(); + } else { + builder.Append(strings[i]); + } + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(5); + slice2 = array->Slice(5); + ASSERT_EQ(N * reps - 5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, slice->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(2)->Slice(3); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(5, 20); + slice2 = array->Slice(5, 20); + ASSERT_EQ(20, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(5, 25, 0, slice)); +} + +TEST_F(TestBinaryArray, TestSliceEquality) { + CheckSliceEquality(); +} + +TEST_F(TestStringArray, TestSliceEquality) { + CheckSliceEquality(); +} + +TEST_F(TestBinaryArray, LengthZeroCtor) { + BinaryArray array(0, nullptr, nullptr); +} + +// ---------------------------------------------------------------------- +// FixedWidthBinary tests + +class TestFWBinaryArray : public ::testing::Test { + public: + void SetUp() {} + + void InitBuilder(int byte_width) { + auto type = fixed_width_binary(byte_width); + builder_.reset(new FixedWidthBinaryBuilder(default_memory_pool(), type)); + } + + protected: + std::unique_ptr builder_; +}; + +TEST_F(TestFWBinaryArray, Builder) { + const int32_t byte_width = 10; + int64_t length = 4096; + + int64_t nbytes = length * byte_width; + + vector data(nbytes); + test::random_bytes(nbytes, 0, data.data()); + + vector is_valid(length); + test::random_null_bytes(length, 0.1, is_valid.data()); + + const uint8_t* raw_data = data.data(); + + std::shared_ptr result; + + auto CheckResult = [this, &length, &is_valid, &raw_data, &byte_width]( + const Array& result) { + // Verify output + const auto& fw_result = static_cast(result); + + ASSERT_EQ(length, result.length()); + + for (int64_t i = 0; i < result.length(); ++i) { + if (is_valid[i]) { + ASSERT_EQ( + 0, memcmp(raw_data + byte_width * i, fw_result.GetValue(i), byte_width)); + } else { + ASSERT_TRUE(fw_result.IsNull(i)); + } + } + }; + + // Build using iterative API + InitBuilder(byte_width); + for (int64_t i = 0; i < length; ++i) { + if (is_valid[i]) { + builder_->Append(raw_data + byte_width * i); + } else { + builder_->AppendNull(); + } + } + + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); + + // Build using batch API + InitBuilder(byte_width); + + const uint8_t* raw_is_valid = is_valid.data(); + + ASSERT_OK(builder_->Append(raw_data, 50, raw_is_valid)); + ASSERT_OK(builder_->Append(raw_data + 50 * byte_width, length - 50, raw_is_valid + 50)); + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); + + // Build from std::string + InitBuilder(byte_width); + for (int64_t i = 0; i < length; ++i) { + if (is_valid[i]) { + builder_->Append( + string(reinterpret_cast(raw_data + byte_width * i), byte_width)); + } else { + builder_->AppendNull(); + } + } + + ASSERT_OK(builder_->Finish(&result)); + CheckResult(*result); +} + +TEST_F(TestFWBinaryArray, EqualsRangeEquals) { + // Check that we don't compare data in null slots + + auto type = fixed_width_binary(4); + FixedWidthBinaryBuilder builder1(default_memory_pool(), type); + FixedWidthBinaryBuilder builder2(default_memory_pool(), type); + + ASSERT_OK(builder1.Append("foo1")); + ASSERT_OK(builder1.AppendNull()); + + ASSERT_OK(builder2.Append("foo1")); + ASSERT_OK(builder2.Append("foo2")); + + std::shared_ptr array1, array2; + ASSERT_OK(builder1.Finish(&array1)); + ASSERT_OK(builder2.Finish(&array2)); + + const auto& a1 = static_cast(*array1); + const auto& a2 = static_cast(*array2); + + FixedWidthBinaryArray equal1(type, 2, a1.data(), a1.null_bitmap(), 1); + FixedWidthBinaryArray equal2(type, 2, a2.data(), a1.null_bitmap(), 1); + + ASSERT_TRUE(equal1.Equals(equal2)); + ASSERT_TRUE(equal1.RangeEquals(equal2, 0, 2, 0)); +} + +TEST_F(TestFWBinaryArray, ZeroSize) { + auto type = fixed_width_binary(0); + FixedWidthBinaryBuilder builder(default_memory_pool(), type); + + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.Append(nullptr)); + ASSERT_OK(builder.AppendNull()); + ASSERT_OK(builder.AppendNull()); + ASSERT_OK(builder.AppendNull()); + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + const auto& fw_array = static_cast(*array); + + // data is never allocated + ASSERT_TRUE(fw_array.data() == nullptr); + ASSERT_EQ(0, fw_array.byte_width()); + + ASSERT_EQ(6, array->length()); + ASSERT_EQ(3, array->null_count()); +} + +TEST_F(TestFWBinaryArray, Slice) { + auto type = fixed_width_binary(4); + FixedWidthBinaryBuilder builder(default_memory_pool(), type); + + vector strings = {"foo1", "foo2", "foo3", "foo4", "foo5"}; + vector is_null = {0, 1, 0, 0, 0}; + + for (int i = 0; i < 5; ++i) { + if (is_null[i]) { + builder.AppendNull(); + } else { + builder.Append(strings[i]); + } + } + + std::shared_ptr array; + ASSERT_OK(builder.Finish(&array)); + + std::shared_ptr slice, slice2; + + slice = array->Slice(1); + slice2 = array->Slice(1); + ASSERT_EQ(4, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, slice->length(), 0, slice)); + + // Chained slices + slice = array->Slice(2); + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 3); + ASSERT_EQ(3, slice->length()); + + slice2 = array->Slice(1, 3); + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); +} + +// ---------------------------------------------------------------------- +// List tests + +class TestListBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + + value_type_ = int32(); + type_ = list(value_type_); + + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + builder_ = std::dynamic_pointer_cast(tmp); + } + + void Done() { + std::shared_ptr out; + EXPECT_OK(builder_->Finish(&out)); + result_ = std::dynamic_pointer_cast(out); + } + + protected: + std::shared_ptr value_type_; + std::shared_ptr type_; + + std::shared_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestListBuilder, Equality) { + Int32Builder* vb = static_cast(builder_->value_builder().get()); + + std::shared_ptr array, equal_array, unequal_array; + vector equal_offsets = {0, 1, 2, 5, 6, 7, 8, 10}; + vector equal_values = {1, 2, 3, 4, 5, 2, 2, 2, 5, 6}; + vector unequal_offsets = {0, 1, 4, 7}; + vector unequal_values = {1, 2, 2, 2, 3, 4, 5}; + + // setup two equal arrays + ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); + ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); + + ASSERT_OK(builder_->Finish(&array)); + ASSERT_OK(builder_->Append(equal_offsets.data(), equal_offsets.size())); + ASSERT_OK(vb->Append(equal_values.data(), equal_values.size())); + + ASSERT_OK(builder_->Finish(&equal_array)); + // now an unequal one + ASSERT_OK(builder_->Append(unequal_offsets.data(), unequal_offsets.size())); + ASSERT_OK(vb->Append(unequal_values.data(), unequal_values.size())); + + ASSERT_OK(builder_->Finish(&unequal_array)); + + // Test array equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_array)); + EXPECT_FALSE(unequal_array->Equals(equal_array)); + + // Test range equality + EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_array)); + EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_array)); + EXPECT_TRUE(array->RangeEquals(2, 3, 2, unequal_array)); + + // Check with slices, ARROW-33 + std::shared_ptr slice, slice2; + + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(array->length() - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 4); + slice2 = array->Slice(1, 4); + ASSERT_EQ(4, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 5, 0, slice)); +} + +TEST_F(TestListBuilder, TestResize) {} + +TEST_F(TestListBuilder, TestAppendNull) { + ASSERT_OK(builder_->AppendNull()); + ASSERT_OK(builder_->AppendNull()); + + Done(); + + ASSERT_OK(result_->Validate()); + ASSERT_TRUE(result_->IsNull(0)); + ASSERT_TRUE(result_->IsNull(1)); + + ASSERT_EQ(0, result_->raw_value_offsets()[0]); + ASSERT_EQ(0, result_->value_offset(1)); + ASSERT_EQ(0, result_->value_offset(2)); + + Int32Array* values = static_cast(result_->values().get()); + ASSERT_EQ(0, values->length()); +} + +void ValidateBasicListArray(const ListArray* result, const vector& values, + const vector& is_valid) { + ASSERT_OK(result->Validate()); + ASSERT_EQ(1, result->null_count()); + ASSERT_EQ(0, result->values()->null_count()); + + ASSERT_EQ(3, result->length()); + vector ex_offsets = {0, 3, 3, 7}; + for (size_t i = 0; i < ex_offsets.size(); ++i) { + ASSERT_EQ(ex_offsets[i], result->value_offset(i)); + } + + for (int i = 0; i < result->length(); ++i) { + ASSERT_EQ(!static_cast(is_valid[i]), result->IsNull(i)); + } + + ASSERT_EQ(7, result->values()->length()); + Int32Array* varr = static_cast(result->values().get()); + + for (size_t i = 0; i < values.size(); ++i) { + ASSERT_EQ(values[i], varr->Value(i)); + } +} + +TEST_F(TestListBuilder, TestBasics) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_valid = {1, 0, 1}; + + Int32Builder* vb = static_cast(builder_->value_builder().get()); + + ASSERT_OK(builder_->Reserve(lengths.size())); + ASSERT_OK(vb->Reserve(values.size())); + + int pos = 0; + for (size_t i = 0; i < lengths.size(); ++i) { + ASSERT_OK(builder_->Append(is_valid[i] > 0)); + for (int j = 0; j < lengths[i]; ++j) { + vb->Append(values[pos++]); + } + } + + Done(); + ValidateBasicListArray(result_.get(), values, is_valid); +} + +TEST_F(TestListBuilder, BulkAppend) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_valid = {1, 0, 1}; + vector offsets = {0, 3, 3}; + + Int32Builder* vb = static_cast(builder_->value_builder().get()); + ASSERT_OK(vb->Reserve(values.size())); + + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + for (int32_t value : values) { + vb->Append(value); + } + Done(); + ValidateBasicListArray(result_.get(), values, is_valid); +} + +TEST_F(TestListBuilder, BulkAppendInvalid) { + vector values = {0, 1, 2, 3, 4, 5, 6}; + vector lengths = {3, 0, 4}; + vector is_null = {0, 1, 0}; + vector is_valid = {1, 0, 1}; + vector offsets = {0, 2, 4}; // should be 0, 3, 3 given the is_null array + + Int32Builder* vb = static_cast(builder_->value_builder().get()); + ASSERT_OK(vb->Reserve(values.size())); + + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + builder_->Append(offsets.data(), offsets.size(), is_valid.data()); + for (int32_t value : values) { + vb->Append(value); + } + + Done(); + ASSERT_RAISES(Invalid, result_->Validate()); +} + +TEST_F(TestListBuilder, TestZeroLength) { + // All buffers are null + Done(); + ASSERT_OK(result_->Validate()); +} + +// ---------------------------------------------------------------------- +// DictionaryArray tests + +TEST(TestDictionary, Basics) { + vector values = {100, 1000, 10000, 100000}; + std::shared_ptr dict; + ArrayFromVector(values, &dict); + + std::shared_ptr type1 = + std::dynamic_pointer_cast(dictionary(int16(), dict)); + DictionaryType type2(int16(), dict); + + ASSERT_TRUE(int16()->Equals(type1->index_type())); + ASSERT_TRUE(type1->dictionary()->Equals(dict)); + + ASSERT_TRUE(int16()->Equals(type2.index_type())); + ASSERT_TRUE(type2.dictionary()->Equals(dict)); + + ASSERT_EQ("dictionary", type1->ToString()); +} + +TEST(TestDictionary, Equals) { + vector is_valid = {true, true, false, true, true, true}; + + std::shared_ptr dict; + vector dict_values = {"foo", "bar", "baz"}; + ArrayFromVector(dict_values, &dict); + std::shared_ptr dict_type = dictionary(int16(), dict); + + std::shared_ptr dict2; + vector dict2_values = {"foo", "bar", "baz", "qux"}; + ArrayFromVector(dict2_values, &dict2); + std::shared_ptr dict2_type = dictionary(int16(), dict2); + + std::shared_ptr indices; + vector indices_values = {1, 2, -1, 0, 2, 0}; + ArrayFromVector(is_valid, indices_values, &indices); + + std::shared_ptr indices2; + vector indices2_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(is_valid, indices2_values, &indices2); + + std::shared_ptr indices3; + vector indices3_values = {1, 1, 0, 0, 2, 0}; + ArrayFromVector(is_valid, indices3_values, &indices3); + + auto array = std::make_shared(dict_type, indices); + auto array2 = std::make_shared(dict_type, indices2); + auto array3 = std::make_shared(dict2_type, indices); + auto array4 = std::make_shared(dict_type, indices3); + + ASSERT_TRUE(array->Equals(array)); + + // Equal, because the unequal index is masked by null + ASSERT_TRUE(array->Equals(array2)); + + // Unequal dictionaries + ASSERT_FALSE(array->Equals(array3)); + + // Unequal indices + ASSERT_FALSE(array->Equals(array4)); + + // RangeEquals + ASSERT_TRUE(array->RangeEquals(3, 6, 3, array4)); + ASSERT_FALSE(array->RangeEquals(1, 3, 1, array4)); + + // ARROW-33 Test slices + const int64_t size = array->length(); + + std::shared_ptr slice, slice2; + slice = array->Array::Slice(2); + slice2 = array->Array::Slice(2); + ASSERT_EQ(size - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Array::Slice(1)->Array::Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 3); + slice2 = array->Slice(1, 3); + ASSERT_EQ(3, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 4, 0, slice)); +} + +TEST(TestDictionary, Validate) { + vector is_valid = {true, true, false, true, true, true}; + + std::shared_ptr dict; + vector dict_values = {"foo", "bar", "baz"}; + ArrayFromVector(dict_values, &dict); + std::shared_ptr dict_type = dictionary(int16(), dict); + + std::shared_ptr indices; + vector indices_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(is_valid, indices_values, &indices); + + std::shared_ptr indices2; + vector indices2_values = {1., 2., 0., 0., 2., 0.}; + ArrayFromVector(is_valid, indices2_values, &indices2); + + std::shared_ptr indices3; + vector indices3_values = {1, 2, 0, 0, 2, 0}; + ArrayFromVector(is_valid, indices3_values, &indices3); + + std::shared_ptr arr = std::make_shared(dict_type, indices); + std::shared_ptr arr2 = std::make_shared(dict_type, indices2); + std::shared_ptr arr3 = std::make_shared(dict_type, indices3); + + // Only checking index type for now + ASSERT_OK(arr->Validate()); + ASSERT_RAISES(Invalid, arr2->Validate()); + ASSERT_OK(arr3->Validate()); +} + +// ---------------------------------------------------------------------- +// Struct tests + +void ValidateBasicStructArray(const StructArray* result, + const vector& struct_is_valid, const vector& list_values, + const vector& list_is_valid, const vector& list_lengths, + const vector& list_offsets, const vector& int_values) { + ASSERT_EQ(4, result->length()); + ASSERT_OK(result->Validate()); + + auto list_char_arr = static_cast(result->field(0).get()); + auto char_arr = static_cast(list_char_arr->values().get()); + auto int32_arr = static_cast(result->field(1).get()); + + ASSERT_EQ(0, result->null_count()); + ASSERT_EQ(1, list_char_arr->null_count()); + ASSERT_EQ(0, int32_arr->null_count()); + + // List + ASSERT_EQ(4, list_char_arr->length()); + ASSERT_EQ(10, list_char_arr->values()->length()); + for (size_t i = 0; i < list_offsets.size(); ++i) { + ASSERT_EQ(list_offsets[i], list_char_arr->raw_value_offsets()[i]); + } + for (size_t i = 0; i < list_values.size(); ++i) { + ASSERT_EQ(list_values[i], char_arr->Value(i)); + } + + // Int32 + ASSERT_EQ(4, int32_arr->length()); + for (size_t i = 0; i < int_values.size(); ++i) { + ASSERT_EQ(int_values[i], int32_arr->Value(i)); + } +} + +// ---------------------------------------------------------------------------------- +// Struct test +class TestStructBuilder : public TestBuilder { + public: + void SetUp() { + TestBuilder::SetUp(); + + auto int32_type = int32(); + auto char_type = int8(); + auto list_type = list(char_type); + + vector> types = {list_type, int32_type}; + vector fields; + fields.push_back(FieldPtr(new Field("list", list_type))); + fields.push_back(FieldPtr(new Field("int", int32_type))); + + type_ = struct_(fields); + value_fields_ = fields; + + std::shared_ptr tmp; + ASSERT_OK(MakeBuilder(pool_, type_, &tmp)); + + builder_ = std::dynamic_pointer_cast(tmp); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); + } + + void Done() { + std::shared_ptr out; + ASSERT_OK(builder_->Finish(&out)); + result_ = std::dynamic_pointer_cast(out); + } + + protected: + vector value_fields_; + std::shared_ptr type_; + + std::shared_ptr builder_; + std::shared_ptr result_; +}; + +TEST_F(TestStructBuilder, TestAppendNull) { + ASSERT_OK(builder_->AppendNull()); + ASSERT_OK(builder_->AppendNull()); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + ASSERT_OK(list_vb->AppendNull()); + ASSERT_OK(list_vb->AppendNull()); + ASSERT_EQ(2, list_vb->length()); + + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_OK(int_vb->AppendNull()); + ASSERT_OK(int_vb->AppendNull()); + ASSERT_EQ(2, int_vb->length()); + + Done(); + + ASSERT_OK(result_->Validate()); + + ASSERT_EQ(2, static_cast(result_->fields().size())); + ASSERT_EQ(2, result_->length()); + ASSERT_EQ(2, result_->field(0)->length()); + ASSERT_EQ(2, result_->field(1)->length()); + ASSERT_TRUE(result_->IsNull(0)); + ASSERT_TRUE(result_->IsNull(1)); + ASSERT_TRUE(result_->field(0)->IsNull(0)); + ASSERT_TRUE(result_->field(0)->IsNull(1)); + ASSERT_TRUE(result_->field(1)->IsNull(0)); + ASSERT_TRUE(result_->field(1)->IsNull(1)); + + ASSERT_EQ(Type::LIST, result_->field(0)->type_enum()); + ASSERT_EQ(Type::INT32, result_->field(1)->type_enum()); +} + +TEST_F(TestStructBuilder, TestBasics) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6, 10}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_EQ(2, static_cast(builder_->field_builders().size())); + + EXPECT_OK(builder_->Resize(list_lengths.size())); + EXPECT_OK(char_vb->Resize(list_values.size())); + EXPECT_OK(int_vb->Resize(int_values.size())); + + int pos = 0; + for (size_t i = 0; i < list_lengths.size(); ++i) { + ASSERT_OK(list_vb->Append(list_is_valid[i] > 0)); + int_vb->UnsafeAppend(int_values[i]); + for (int j = 0; j < list_lengths[i]; ++j) { + char_vb->UnsafeAppend(list_values[pos++]); + } + } + + for (size_t i = 0; i < struct_is_valid.size(); ++i) { + ASSERT_OK(builder_->Append(struct_is_valid[i] > 0)); + } + + Done(); + + ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, + list_lengths, list_offsets, int_values); +} + +TEST_F(TestStructBuilder, BulkAppend) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + Done(); + ValidateBasicStructArray(result_.get(), struct_is_valid, list_values, list_is_valid, + list_lengths, list_offsets, int_values); +} + +TEST_F(TestStructBuilder, BulkAppendInvalid) { + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 0, 1, 1}; // should be 1, 1, 1, 1 + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + + ASSERT_OK(builder_->Reserve(list_lengths.size())); + ASSERT_OK(char_vb->Reserve(list_values.size())); + ASSERT_OK(int_vb->Reserve(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + Done(); + // Even null bitmap of the parent Struct is not valid, Validate() will ignore it. + ASSERT_OK(result_->Validate()); +} + +TEST_F(TestStructBuilder, TestEquality) { + std::shared_ptr array, equal_array; + std::shared_ptr unequal_bitmap_array, unequal_offsets_array, + unequal_values_array; + + vector int_values = {1, 2, 3, 4}; + vector list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'm', 'a', 'r', 'k'}; + vector list_lengths = {3, 0, 3, 4}; + vector list_offsets = {0, 3, 3, 6}; + vector list_is_valid = {1, 0, 1, 1}; + vector struct_is_valid = {1, 1, 1, 1}; + + vector unequal_int_values = {4, 2, 3, 1}; + vector unequal_list_values = {'j', 'o', 'e', 'b', 'o', 'b', 'l', 'u', 'c', 'y'}; + vector unequal_list_offsets = {0, 3, 4, 6}; + vector unequal_list_is_valid = {1, 1, 1, 1}; + vector unequal_struct_is_valid = {1, 0, 0, 1}; + + ListBuilder* list_vb = static_cast(builder_->field_builder(0).get()); + Int8Builder* char_vb = static_cast(list_vb->value_builder().get()); + Int32Builder* int_vb = static_cast(builder_->field_builder(1).get()); + ASSERT_OK(builder_->Reserve(list_lengths.size())); + ASSERT_OK(char_vb->Reserve(list_values.size())); + ASSERT_OK(int_vb->Reserve(int_values.size())); + + // setup two equal arrays, one of which takes an unequal bitmap + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + ASSERT_OK(builder_->Finish(&array)); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + ASSERT_OK(builder_->Finish(&equal_array)); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup an unequal one with the unequal bitmap + builder_->Append(unequal_struct_is_valid.size(), unequal_struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + ASSERT_OK(builder_->Finish(&unequal_bitmap_array)); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup an unequal one with unequal offsets + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(unequal_list_offsets.data(), unequal_list_offsets.size(), + unequal_list_is_valid.data()); + for (int8_t value : list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : int_values) { + int_vb->UnsafeAppend(value); + } + + ASSERT_OK(builder_->Finish(&unequal_offsets_array)); + + ASSERT_OK(builder_->Resize(list_lengths.size())); + ASSERT_OK(char_vb->Resize(list_values.size())); + ASSERT_OK(int_vb->Resize(int_values.size())); + + // setup anunequal one with unequal values + builder_->Append(struct_is_valid.size(), struct_is_valid.data()); + list_vb->Append(list_offsets.data(), list_offsets.size(), list_is_valid.data()); + for (int8_t value : unequal_list_values) { + char_vb->UnsafeAppend(value); + } + for (int32_t value : unequal_int_values) { + int_vb->UnsafeAppend(value); + } + + ASSERT_OK(builder_->Finish(&unequal_values_array)); + + // Test array equality + EXPECT_TRUE(array->Equals(array)); + EXPECT_TRUE(array->Equals(equal_array)); + EXPECT_TRUE(equal_array->Equals(array)); + EXPECT_FALSE(equal_array->Equals(unequal_bitmap_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(equal_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_values_array)); + EXPECT_FALSE(unequal_values_array->Equals(unequal_bitmap_array)); + EXPECT_FALSE(unequal_bitmap_array->Equals(unequal_offsets_array)); + EXPECT_FALSE(unequal_offsets_array->Equals(unequal_bitmap_array)); + + // Test range equality + EXPECT_TRUE(array->RangeEquals(0, 4, 0, equal_array)); + EXPECT_TRUE(array->RangeEquals(3, 4, 3, unequal_bitmap_array)); + EXPECT_TRUE(array->RangeEquals(0, 1, 0, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(0, 2, 0, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(1, 2, 1, unequal_offsets_array)); + EXPECT_FALSE(array->RangeEquals(0, 1, 0, unequal_values_array)); + EXPECT_TRUE(array->RangeEquals(1, 3, 1, unequal_values_array)); + EXPECT_FALSE(array->RangeEquals(3, 4, 3, unequal_values_array)); + + // ARROW-33 Slice / equality + std::shared_ptr slice, slice2; + + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(array->length() - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, slice->length(), 0, slice)); + + slice = array->Slice(1, 2); + slice2 = array->Slice(1, 2); + ASSERT_EQ(2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 3, 0, slice)); +} + +TEST_F(TestStructBuilder, TestZeroLength) { + // All buffers are null + Done(); + ASSERT_OK(result_->Validate()); +} + +// ---------------------------------------------------------------------- +// Union tests + +TEST(TestUnionArrayAdHoc, TestSliceEquals) { + std::shared_ptr batch; + ASSERT_OK(ipc::MakeUnion(&batch)); + + const int64_t size = batch->num_rows(); + + auto CheckUnion = [&size](std::shared_ptr array) { + std::shared_ptr slice, slice2; + slice = array->Slice(2); + slice2 = array->Slice(2); + ASSERT_EQ(size - 2, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); + + // Chained slices + slice2 = array->Slice(1)->Slice(1); + ASSERT_TRUE(slice->Equals(slice2)); + + slice = array->Slice(1, 5); + slice2 = array->Slice(1, 5); + ASSERT_EQ(5, slice->length()); + + ASSERT_TRUE(slice->Equals(slice2)); + ASSERT_TRUE(array->RangeEquals(1, 6, 0, slice)); + }; + + CheckUnion(batch->column(1)); + CheckUnion(batch->column(2)); +} + } // namespace arrow diff --git a/cpp/src/arrow/array-union-test.cc b/cpp/src/arrow/array-union-test.cc deleted file mode 100644 index 83c3196cab74b..0000000000000 --- a/cpp/src/arrow/array-union-test.cc +++ /dev/null @@ -1,67 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -// Tests for UnionArray - -#include -#include -#include - -#include "gtest/gtest.h" - -#include "arrow/array.h" -#include "arrow/builder.h" -#include "arrow/ipc/test-common.h" -#include "arrow/status.h" -#include "arrow/table.h" -#include "arrow/test-util.h" -#include "arrow/type.h" - -namespace arrow { - -TEST(TestUnionArrayAdHoc, TestSliceEquals) { - std::shared_ptr batch; - ASSERT_OK(ipc::MakeUnion(&batch)); - - const int64_t size = batch->num_rows(); - - auto CheckUnion = [&size](std::shared_ptr array) { - std::shared_ptr slice, slice2; - slice = array->Slice(2); - slice2 = array->Slice(2); - ASSERT_EQ(size - 2, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(2, array->length(), 0, slice)); - - // Chained slices - slice2 = array->Slice(1)->Slice(1); - ASSERT_TRUE(slice->Equals(slice2)); - - slice = array->Slice(1, 5); - slice2 = array->Slice(1, 5); - ASSERT_EQ(5, slice->length()); - - ASSERT_TRUE(slice->Equals(slice2)); - ASSERT_TRUE(array->RangeEquals(1, 6, 0, slice)); - }; - - CheckUnion(batch->column(1)); - CheckUnion(batch->column(2)); -} - -} // namespace arrow diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index f3140be0b2dac..af59e96a1448f 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -78,9 +78,10 @@ class TestHdfsClient : public ::testing::Test { LibHdfsShim* driver_shim; client_ = nullptr; - scratch_dir_ = boost::filesystem::unique_path( - boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") - .string(); + scratch_dir_ = + boost::filesystem::unique_path( + boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") + .string(); loaded_driver_ = false; diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 70c173432a960..b221c80391cde 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -117,6 +117,28 @@ TEST_F(TestSchema, GetFieldByName) { ASSERT_TRUE(result == nullptr); } +#define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ + TEST(TypesTest, TestPrimitive_##ENUM) { \ + KLASS tp; \ + \ + ASSERT_EQ(tp.type, Type::ENUM); \ + ASSERT_EQ(tp.ToString(), std::string(NAME)); \ + } + +PRIMITIVE_TEST(Int8Type, INT8, "int8"); +PRIMITIVE_TEST(Int16Type, INT16, "int16"); +PRIMITIVE_TEST(Int32Type, INT32, "int32"); +PRIMITIVE_TEST(Int64Type, INT64, "int64"); +PRIMITIVE_TEST(UInt8Type, UINT8, "uint8"); +PRIMITIVE_TEST(UInt16Type, UINT16, "uint16"); +PRIMITIVE_TEST(UInt32Type, UINT32, "uint32"); +PRIMITIVE_TEST(UInt64Type, UINT64, "uint64"); + +PRIMITIVE_TEST(FloatType, FLOAT, "float"); +PRIMITIVE_TEST(DoubleType, DOUBLE, "double"); + +PRIMITIVE_TEST(BooleanType, BOOL, "bool"); + TEST(TestBinaryType, ToString) { BinaryType t1; BinaryType e1; @@ -264,4 +286,27 @@ TEST(TestNestedType, Equals) { ASSERT_FALSE(u0->Equals(u0_bad)); } +TEST(TestStructType, Basics) { + auto f0_type = int32(); + auto f0 = field("f0", f0_type); + + auto f1_type = utf8(); + auto f1 = field("f1", f1_type); + + auto f2_type = uint8(); + auto f2 = field("f2", f2_type); + + vector> fields = {f0, f1, f2}; + + StructType struct_type(fields); + + ASSERT_TRUE(struct_type.child(0)->Equals(f0)); + ASSERT_TRUE(struct_type.child(1)->Equals(f1)); + ASSERT_TRUE(struct_type.child(2)->Equals(f2)); + + ASSERT_EQ(struct_type.ToString(), "struct"); + + // TODO(wesm): out of bounds for field(...) +} + } // namespace arrow From ba4f478e7c651f43e5e605b3fa6818f2e9f7cd3d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 18:41:35 -0400 Subject: [PATCH 0437/1644] ARROW-715: [Python] Make pandas not a hard requirement, flake8 fixes Author: Wes McKinney Closes #462 from wesm/ARROW-715 and squashes the following commits: 21fe8eb [Wes McKinney] Make pandas not a hard requirement, flake8 fixes --- python/pyarrow/compat.py | 19 +++++++++++-------- python/setup.py | 21 +++++++++++++-------- 2 files changed, 24 insertions(+), 16 deletions(-) diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 74d7ca2827bc9..b9206aacbc9f1 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -21,7 +21,6 @@ import itertools import numpy as np -import pandas as pd import sys import six @@ -31,6 +30,17 @@ PY26 = sys.version_info[:2] == (2, 6) PY2 = sys.version_info[0] == 2 +try: + import pandas as pd + if LooseVersion(pd.__version__) < '0.19.0': + pdapi = pd.core.common + from pandas.core.dtypes import DatetimeTZDtype + else: + from pandas.types.dtypes import DatetimeTZDtype + pdapi = pd.api.types + HAVE_PANDAS = True +except: + HAVE_PANDAS = False if PY26: import unittest2 as unittest @@ -117,13 +127,6 @@ def encode_file_path(path): return encoded_path -if LooseVersion(pd.__version__) < '0.19.0': - pdapi = pd.core.common - from pandas.core.dtypes import DatetimeTZDtype -else: - from pandas.types.dtypes import DatetimeTZDtype - pdapi = pd.api.types - integer_types = six.integer_types + (np.integer,) __all__ = [] diff --git a/python/setup.py b/python/setup.py index dae6cb2f078f6..9ff091819c760 100644 --- a/python/setup.py +++ b/python/setup.py @@ -17,7 +17,6 @@ # specific language governing permissions and limitations # under the License. -import glob import os.path as osp import re import shutil @@ -83,16 +82,20 @@ def run(self): ('build-type=', None, 'build type (debug or release)'), ('with-parquet', None, 'build the Parquet extension'), ('with-jemalloc', None, 'build the jemalloc extension'), - ('bundle-arrow-cpp', None, 'bundle the Arrow C++ libraries')] + + ('bundle-arrow-cpp', None, + 'bundle the Arrow C++ libraries')] + _build_ext.user_options) def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() - self.with_parquet = strtobool(os.environ.get('PYARROW_WITH_PARQUET', '0')) - self.with_jemalloc = strtobool(os.environ.get('PYARROW_WITH_JEMALLOC', '0')) - self.bundle_arrow_cpp = strtobool(os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) + self.with_parquet = strtobool( + os.environ.get('PYARROW_WITH_PARQUET', '0')) + self.with_jemalloc = strtobool( + os.environ.get('PYARROW_WITH_JEMALLOC', '0')) + self.bundle_arrow_cpp = strtobool( + os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) CYTHON_MODULE_NAMES = [ 'array', @@ -300,8 +303,10 @@ def get_outputs(self): if not os.path.exists('../.git') and os.path.exists('../java/pom.xml'): import xml.etree.ElementTree as ET tree = ET.parse('../java/pom.xml') - version_tag = list(tree.getroot().findall('{http://maven.apache.org/POM/4.0.0}version'))[0] - os.environ["SETUPTOOLS_SCM_PRETEND_VERSION"] = version_tag.text.replace("-SNAPSHOT", "a0") + version_tag = list(tree.getroot().findall( + '{http://maven.apache.org/POM/4.0.0}version'))[0] + os.environ["SETUPTOOLS_SCM_PRETEND_VERSION"] = version_tag.text.replace( + "-SNAPSHOT", "a0") long_description = """Apache Arrow is a columnar in-memory analytics layer designed to accelerate big data. It houses a set of canonical in-memory @@ -321,7 +326,7 @@ def get_outputs(self): 'clean': clean, 'build_ext': build_ext }, - use_scm_version = {"root": "..", "relative_to": __file__}, + use_scm_version={"root": "..", "relative_to": __file__}, setup_requires=['setuptools_scm', 'cython >= 0.23'], install_requires=['numpy >= 1.9', 'six >= 1.0.0'], test_requires=['pytest'], From edd6cfcd9bfc02b2ed093f22acf830a57422f7b3 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 30 Mar 2017 18:42:52 -0400 Subject: [PATCH 0438/1644] ARROW-727: [Python] Ensure that NativeFile.write accepts any bytes, unicode, or object providing buffer protocol. Rename build_arrow_buffer to pyarrow.frombuffer Author: Wes McKinney Closes #464 from wesm/ARROW-727 and squashes the following commits: c93edb0 [Wes McKinney] Rename build_arrow_buffer to pyarrow.frombuffer. Ensure that NativeFile.write accepts any bytes, unicode, or object providing buffer protocol --- python/pyarrow/__init__.py | 3 ++- python/pyarrow/io.pyx | 16 ++++++++++------ python/pyarrow/tests/test_io.py | 29 +++++++++++++++++++++++------ 3 files changed, 35 insertions(+), 13 deletions(-) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index c6f0be04e8d0d..dce438910151b 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -43,7 +43,8 @@ from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, - Buffer, InMemoryOutputStream, BufferReader) + Buffer, InMemoryOutputStream, BufferReader, + frombuffer) from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index d528bdc495208..d64427aa36ef5 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -126,14 +126,18 @@ cdef class NativeFile: def write(self, data): """ - Write bytes-like (unicode, encoded to UTF-8) to file + Write byte from any object implementing buffer protocol (bytes, + bytearray, ndarray, pyarrow.Buffer) """ self._assert_writeable() - data = tobytes(data) + if isinstance(data, six.string_types): + data = tobytes(data) - cdef const uint8_t* buf = cp.PyBytes_AS_STRING(data) - cdef int64_t bufsize = len(data) + cdef Buffer arrow_buffer = frombuffer(data) + + cdef const uint8_t* buf = arrow_buffer.buffer.get().data() + cdef int64_t bufsize = len(arrow_buffer) with nogil: check_status(self.wr_file.get().Write(buf, bufsize)) @@ -505,7 +509,7 @@ cdef class BufferReader(NativeFile): if isinstance(obj, Buffer): self.buffer = obj else: - self.buffer = build_arrow_buffer(obj) + self.buffer = frombuffer(obj) self.rd_file.reset(new CBufferReader(self.buffer.buffer)) self.is_readable = 1 @@ -513,7 +517,7 @@ cdef class BufferReader(NativeFile): self.is_open = True -def build_arrow_buffer(object obj): +def frombuffer(object obj): """ Construct an Arrow buffer from a Python bytes object """ diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 9cd15c4a76cef..15c5e6b924385 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -23,6 +23,7 @@ import numpy as np from pyarrow.compat import u, guid +import pyarrow as pa import pyarrow.io as io # ---------------------------------------------------------------------- @@ -127,28 +128,29 @@ def get_buffer(): def test_buffer_bytes(): val = b'some data' - buf = io.build_arrow_buffer(val) + buf = pa.frombuffer(val) assert isinstance(buf, io.Buffer) result = buf.to_pybytes() assert result == val + def test_buffer_memoryview(): val = b'some data' - buf = io.build_arrow_buffer(val) + buf = pa.frombuffer(val) assert isinstance(buf, io.Buffer) result = memoryview(buf) assert result == val + def test_buffer_bytearray(): val = bytearray(b'some data') - - buf = io.build_arrow_buffer(val) + buf = pa.frombuffer(val) assert isinstance(buf, io.Buffer) result = bytearray(buf) @@ -159,7 +161,7 @@ def test_buffer_bytearray(): def test_buffer_memoryview_is_immutable(): val = b'some data' - buf = io.build_arrow_buffer(val) + buf = pa.frombuffer(val) assert isinstance(buf, io.Buffer) result = memoryview(buf) @@ -198,21 +200,36 @@ def test_inmemory_write_after_closed(): with pytest.raises(IOError): f.write(b'not ok') + def test_buffer_protocol_ref_counting(): import gc def make_buffer(bytes_obj): - return bytearray(io.build_arrow_buffer(bytes_obj)) + return bytearray(pa.frombuffer(bytes_obj)) buf = make_buffer(b'foo') gc.collect() assert buf == b'foo' +def test_nativefile_write_memoryview(): + f = io.InMemoryOutputStream() + data = b'ok' + + arr = np.frombuffer(data, dtype='S1') + + f.write(arr) + f.write(bytearray(data)) + + buf = f.get_result() + + assert buf.to_pybytes() == data * 2 + # ---------------------------------------------------------------------- # OS files and memory maps + @pytest.fixture def sample_disk_data(request): SIZE = 4096 From 4915ecf1e1dba625d916604d30f2575e4ddb6439 Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Thu, 30 Mar 2017 19:12:49 -0400 Subject: [PATCH 0439/1644] ARROW-632: [Python] Add support for FixedWidthBinary type Author: Phillip Cloud Closes #461 from cpcloud/ARROW-632 and squashes the following commits: 134644a [Phillip Cloud] ARROW-632: [Python] Add support for FixedWidthBinary type --- .gitignore | 3 + cpp/src/arrow/builder.cc | 1 + cpp/src/arrow/ipc/ipc-read-write-benchmark.cc | 4 +- cpp/src/arrow/ipc/reader.cc | 2 +- cpp/src/arrow/python/builtin_convert.cc | 101 +++++++++++--- cpp/src/arrow/python/builtin_convert.h | 17 ++- cpp/src/arrow/python/pandas_convert.cc | 131 +++++++++++++++--- cpp/src/arrow/util/logging.h | 7 +- python/pyarrow/__init__.py | 5 +- python/pyarrow/array.pxd | 8 ++ python/pyarrow/array.pyx | 16 ++- python/pyarrow/includes/libarrow.pxd | 8 ++ python/pyarrow/includes/pyarrow.pxd | 3 + python/pyarrow/scalar.pxd | 5 + python/pyarrow/scalar.pyx | 19 ++- python/pyarrow/schema.pxd | 6 + python/pyarrow/schema.pyx | 42 ++++-- python/pyarrow/tests/test_convert_builtin.py | 13 ++ python/pyarrow/tests/test_convert_pandas.py | 17 +++ python/pyarrow/tests/test_scalars.py | 14 ++ 20 files changed, 367 insertions(+), 55 deletions(-) diff --git a/.gitignore b/.gitignore index a00cbba065a03..5e28b3685e465 100644 --- a/.gitignore +++ b/.gitignore @@ -24,3 +24,6 @@ *.dylib .build_cache_dir MANIFEST + +cpp/.idea/ +python/.eggs/ \ No newline at end of file diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 52a785d086117..82b62146b0f98 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -542,6 +542,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(DOUBLE, DoubleBuilder); BUILDER_CASE(STRING, StringBuilder); BUILDER_CASE(BINARY, BinaryBuilder); + BUILDER_CASE(FIXED_WIDTH_BINARY, FixedWidthBinaryBuilder); case Type::LIST: { std::shared_ptr value_builder; std::shared_ptr value_type = diff --git a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc index 1aecdbc633190..b385929d8b10a 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-benchmark.cc @@ -80,7 +80,7 @@ static void BM_WriteRecordBatch(benchmark::State& state) { // NOLINT non-const int32_t metadata_length; int64_t body_length; if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, - default_memory_pool()) + default_memory_pool()) .ok()) { state.SkipWithError("Failed to write!"); } @@ -101,7 +101,7 @@ static void BM_ReadRecordBatch(benchmark::State& state) { // NOLINT non-const r int32_t metadata_length; int64_t body_length; if (!ipc::WriteRecordBatch(*record_batch, 0, &stream, &metadata_length, &body_length, - default_memory_pool()) + default_memory_pool()) .ok()) { state.SkipWithError("Failed to write!"); } diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index b47b773192774..00ea20cf5dfb1 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -32,8 +32,8 @@ #include "arrow/ipc/util.h" #include "arrow/status.h" #include "arrow/table.h" -#include "arrow/type.h" #include "arrow/tensor.h" +#include "arrow/type.h" #include "arrow/util/logging.h" namespace arrow { diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 6e59845dea76a..72e86774fcca7 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -23,6 +23,7 @@ #include "arrow/api.h" #include "arrow/status.h" +#include "arrow/util/logging.h" #include "arrow/python/helpers.h" #include "arrow/python/util/datetime.h" @@ -200,18 +201,25 @@ class SeqVisitor { int nesting_histogram_[MAX_NESTING_LEVELS]; }; -// Non-exhaustive type inference -Status InferArrowType(PyObject* obj, int64_t* size, std::shared_ptr* out_type) { - *size = PySequence_Size(obj); +Status InferArrowSize(PyObject* obj, int64_t* size) { + *size = static_cast(PySequence_Size(obj)); if (PyErr_Occurred()) { // Not a sequence PyErr_Clear(); return Status::TypeError("Object is not a sequence"); } + return Status::OK(); +} + +// Non-exhaustive type inference +Status InferArrowTypeAndSize( + PyObject* obj, int64_t* size, std::shared_ptr* out_type) { + RETURN_NOT_OK(InferArrowSize(obj, size)); // For 0-length sequences, refuse to guess if (*size == 0) { *out_type = null(); } + PyDateTime_IMPORT; SeqVisitor seq_visitor; RETURN_NOT_OK(seq_visitor.Visit(obj)); RETURN_NOT_OK(seq_visitor.Validate()); @@ -253,7 +261,7 @@ class TypedConverter : public SeqConverter { class BoolConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { - Py_ssize_t size = PySequence_Size(seq); + int64_t size = static_cast(PySequence_Size(seq)); RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); @@ -275,14 +283,14 @@ class Int64Converter : public TypedConverter { public: Status AppendData(PyObject* seq) override { int64_t val; - Py_ssize_t size = PySequence_Size(seq); + int64_t size = static_cast(PySequence_Size(seq)); RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); if (item.obj() == Py_None) { typed_builder_->AppendNull(); } else { - val = PyLong_AsLongLong(item.obj()); + val = static_cast(PyLong_AsLongLong(item.obj())); RETURN_IF_PYERROR(); typed_builder_->Append(val); } @@ -294,7 +302,7 @@ class Int64Converter : public TypedConverter { class DateConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { - Py_ssize_t size = PySequence_Size(seq); + int64_t size = static_cast(PySequence_Size(seq)); RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); @@ -312,7 +320,7 @@ class DateConverter : public TypedConverter { class TimestampConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { - Py_ssize_t size = PySequence_Size(seq); + int64_t size = static_cast(PySequence_Size(seq)); RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); @@ -334,7 +342,8 @@ class TimestampConverter : public TypedConverter { epoch.tm_year = 70; epoch.tm_mday = 1; // Microseconds since the epoch - int64_t val = lrint(difftime(mktime(&datetime), mktime(&epoch))) * 1000000 + us; + int64_t val = static_cast( + lrint(difftime(mktime(&datetime), mktime(&epoch))) * 1000000 + us); typed_builder_->Append(val); } } @@ -346,7 +355,7 @@ class DoubleConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { double val; - Py_ssize_t size = PySequence_Size(seq); + int64_t size = static_cast(PySequence_Size(seq)); RETURN_NOT_OK(typed_builder_->Reserve(size)); for (int64_t i = 0; i < size; ++i) { OwnedRef item(PySequence_GetItem(seq, i)); @@ -369,7 +378,7 @@ class BytesConverter : public TypedConverter { PyObject* bytes_obj; OwnedRef tmp; const char* bytes; - int64_t length; + Py_ssize_t length; Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { item = PySequence_GetItem(seq, i); @@ -385,7 +394,8 @@ class BytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::TypeError("Non-string value encountered"); + return Status::TypeError( + "Value that cannot be converted to bytes was encountered"); } // No error checking length = PyBytes_GET_SIZE(bytes_obj); @@ -396,6 +406,41 @@ class BytesConverter : public TypedConverter { } }; +class FixedWidthBytesConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + PyObject* item; + PyObject* bytes_obj; + OwnedRef tmp; + Py_ssize_t expected_length = std::dynamic_pointer_cast( + typed_builder_->type())->byte_width(); + Py_ssize_t size = PySequence_Size(seq); + for (int64_t i = 0; i < size; ++i) { + item = PySequence_GetItem(seq, i); + OwnedRef holder(item); + + if (item == Py_None) { + RETURN_NOT_OK(typed_builder_->AppendNull()); + continue; + } else if (PyUnicode_Check(item)) { + tmp.reset(PyUnicode_AsUTF8String(item)); + RETURN_IF_PYERROR(); + bytes_obj = tmp.obj(); + } else if (PyBytes_Check(item)) { + bytes_obj = item; + } else { + return Status::TypeError( + "Value that cannot be converted to bytes was encountered"); + } + // No error checking + RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length)); + RETURN_NOT_OK(typed_builder_->Append( + reinterpret_cast(PyBytes_AS_STRING(bytes_obj)))); + } + return Status::OK(); + } +}; + class UTF8Converter : public TypedConverter { public: Status AppendData(PyObject* seq) override { @@ -403,7 +448,7 @@ class UTF8Converter : public TypedConverter { PyObject* bytes_obj; OwnedRef tmp; const char* bytes; - int64_t length; + Py_ssize_t length; Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { item = PySequence_GetItem(seq, i); @@ -465,6 +510,8 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::BINARY: return std::make_shared(); + case Type::FIXED_WIDTH_BINARY: + return std::make_shared(); case Type::STRING: return std::make_shared(); case Type::LIST: @@ -472,7 +519,6 @@ std::shared_ptr GetConverter(const std::shared_ptr& type case Type::STRUCT: default: return nullptr; - break; } } @@ -492,6 +538,7 @@ Status ListConverter::Init(const std::shared_ptr& builder) { Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, const std::shared_ptr& builder) { + PyDateTime_IMPORT; std::shared_ptr converter = GetConverter(type); if (converter == nullptr) { std::stringstream ss; @@ -506,9 +553,12 @@ Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out) { std::shared_ptr type; int64_t size; - PyDateTime_IMPORT; - RETURN_NOT_OK(InferArrowType(obj, &size, &type)); + RETURN_NOT_OK(InferArrowTypeAndSize(obj, &size, &type)); + return ConvertPySequence(obj, pool, out, type, size); +} +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, + const std::shared_ptr& type, int64_t size) { // Handle NA / NullType case if (type->type == Type::NA) { out->reset(new NullArray(size)); @@ -519,9 +569,26 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr std::shared_ptr builder; RETURN_NOT_OK(MakeBuilder(pool, type, &builder)); RETURN_NOT_OK(AppendPySequence(obj, type, builder)); - return builder->Finish(out); } +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, + const std::shared_ptr& type) { + int64_t size; + RETURN_NOT_OK(InferArrowSize(obj, &size)); + return ConvertPySequence(obj, pool, out, type, size); +} + +Status CheckPythonBytesAreFixedLength(PyObject* obj, Py_ssize_t expected_length) { + const Py_ssize_t length = PyBytes_GET_SIZE(obj); + if (length != expected_length) { + std::stringstream ss; + ss << "Found byte string of length " << length << ", expected length is " + << expected_length; + return Status::TypeError(ss.str()); + } + return Status::OK(); +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/builtin_convert.h b/cpp/src/arrow/python/builtin_convert.h index 7b50990dd557c..00ff0fd8236fc 100644 --- a/cpp/src/arrow/python/builtin_convert.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -38,16 +38,31 @@ class Status; namespace py { -ARROW_EXPORT arrow::Status InferArrowType( +ARROW_EXPORT arrow::Status InferArrowTypeAndSize( PyObject* obj, int64_t* size, std::shared_ptr* out_type); +ARROW_EXPORT arrow::Status InferArrowSize(PyObject* obj, int64_t* size); ARROW_EXPORT arrow::Status AppendPySequence(PyObject* obj, const std::shared_ptr& type, const std::shared_ptr& builder); +// Type and size inference ARROW_EXPORT Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out); +// Size inference +ARROW_EXPORT +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, + const std::shared_ptr& type); + +// No inference +ARROW_EXPORT +Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, + const std::shared_ptr& type, int64_t size); + +ARROW_EXPORT Status CheckPythonBytesAreFixedLength( + PyObject* obj, Py_ssize_t expected_length); + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index db2e90eb8b0ff..68a8d7d7afcf5 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -147,8 +147,8 @@ Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) { return Status::OK(); } -Status AppendObjectStrings(StringBuilder& string_builder, PyObject** objects, - int64_t objects_length, bool* have_bytes) { +Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, + PyObject** objects, bool* have_bytes) { PyObject* obj; for (int64_t i = 0; i < objects_length; ++i) { @@ -160,15 +160,45 @@ Status AppendObjectStrings(StringBuilder& string_builder, PyObject** objects, return Status::TypeError("failed converting unicode to UTF8"); } const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); - Status s = string_builder.Append(PyBytes_AS_STRING(obj), length); + Status s = builder->Append(PyBytes_AS_STRING(obj), length); Py_DECREF(obj); if (!s.ok()) { return s; } } else if (PyBytes_Check(obj)) { *have_bytes = true; const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); - RETURN_NOT_OK(string_builder.Append(PyBytes_AS_STRING(obj), length)); + RETURN_NOT_OK(builder->Append(PyBytes_AS_STRING(obj), length)); } else { - string_builder.AppendNull(); + builder->AppendNull(); + } + } + + return Status::OK(); +} + +static Status AppendObjectFixedWidthBytes(int64_t objects_length, int byte_width, + FixedWidthBinaryBuilder* builder, PyObject** objects) { + PyObject* obj; + + for (int64_t i = 0; i < objects_length; ++i) { + obj = objects[i]; + if (PyUnicode_Check(obj)) { + obj = PyUnicode_AsUTF8String(obj); + if (obj == NULL) { + PyErr_Clear(); + return Status::TypeError("failed converting unicode to UTF8"); + } + + RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); + Status s = + builder->Append(reinterpret_cast(PyBytes_AS_STRING(obj))); + Py_DECREF(obj); + RETURN_NOT_OK(s); + } else if (PyBytes_Check(obj)) { + RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); + RETURN_NOT_OK( + builder->Append(reinterpret_cast(PyBytes_AS_STRING(obj)))); + } else { + builder->AppendNull(); } } @@ -192,6 +222,13 @@ struct WrapBytes { } }; +template <> +struct WrapBytes { + static inline PyObject* Wrap(const uint8_t* data, int64_t length) { + return PyBytes_FromStringAndSize(reinterpret_cast(data), length); + } +}; + static inline bool ListTypeSupported(const Type::type type_id) { switch (type_id) { case Type::UINT8: @@ -226,7 +263,7 @@ class PandasConverter : public TypeVisitor { arr_(reinterpret_cast(ao)), mask_(nullptr) { if (mo != nullptr && mo != Py_None) { mask_ = reinterpret_cast(mo); } - length_ = PyArray_SIZE(arr_); + length_ = static_cast(PyArray_SIZE(arr_)); } bool is_strided() const { @@ -241,7 +278,7 @@ class PandasConverter : public TypeVisitor { RETURN_NOT_OK(null_bitmap_->Resize(null_bytes)); null_bitmap_data_ = null_bitmap_->mutable_data(); - memset(null_bitmap_data_, 0, null_bytes); + memset(null_bitmap_data_, 0, static_cast(null_bytes)); return Status::OK(); } @@ -321,6 +358,8 @@ class PandasConverter : public TypeVisitor { const std::shared_ptr& type, std::shared_ptr* out); Status ConvertObjectStrings(std::shared_ptr* out); + Status ConvertObjectFixedWidthBytes( + const std::shared_ptr& type, std::shared_ptr* out); Status ConvertBooleans(std::shared_ptr* out); Status ConvertDates(std::shared_ptr* out); Status ConvertLists(const std::shared_ptr& type, std::shared_ptr* out); @@ -402,13 +441,13 @@ Status PandasConverter::ConvertObjectStrings(std::shared_ptr* out) { // and unicode mixed in the object array PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - StringBuilder string_builder(pool_); - RETURN_NOT_OK(string_builder.Resize(length_)); + StringBuilder builder(pool_); + RETURN_NOT_OK(builder.Resize(length_)); Status s; bool have_bytes = false; - RETURN_NOT_OK(AppendObjectStrings(string_builder, objects, length_, &have_bytes)); - RETURN_NOT_OK(string_builder.Finish(out)); + RETURN_NOT_OK(AppendObjectStrings(length_, &builder, objects, &have_bytes)); + RETURN_NOT_OK(builder.Finish(out)); if (have_bytes) { const auto& arr = static_cast(*out->get()); @@ -418,6 +457,20 @@ Status PandasConverter::ConvertObjectStrings(std::shared_ptr* out) { return Status::OK(); } +Status PandasConverter::ConvertObjectFixedWidthBytes( + const std::shared_ptr& type, std::shared_ptr* out) { + PyAcquireGIL lock; + + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + FixedWidthBinaryBuilder builder(pool_, type); + RETURN_NOT_OK(builder.Resize(length_)); + RETURN_NOT_OK(AppendObjectFixedWidthBytes(length_, + std::dynamic_pointer_cast(builder.type())->byte_width(), + &builder, objects)); + RETURN_NOT_OK(builder.Finish(out)); + return Status::OK(); +} + Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { PyAcquireGIL lock; @@ -474,6 +527,8 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { switch (type_->type) { case Type::STRING: return ConvertObjectStrings(out); + case Type::FIXED_WIDTH_BINARY: + return ConvertObjectFixedWidthBytes(type_, out); case Type::BOOL: return ConvertBooleans(out); case Type::DATE64: @@ -543,7 +598,7 @@ inline Status PandasConverter::ConvertTypedLists( int64_t size; std::shared_ptr inferred_type; RETURN_NOT_OK(list_builder.Append(true)); - RETURN_NOT_OK(InferArrowType(objects[i], &size, &inferred_type)); + RETURN_NOT_OK(InferArrowTypeAndSize(objects[i], &size, &inferred_type)); if (inferred_type->type != type->type) { std::stringstream ss; ss << inferred_type->ToString() << " cannot be converted to " << type->ToString(); @@ -577,14 +632,14 @@ inline Status PandasConverter::ConvertTypedLists( // TODO(uwe): Support more complex numpy array structures RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); - int64_t size = PyArray_DIM(numpy_array, 0); + int64_t size = static_cast(PyArray_DIM(numpy_array, 0)); auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - RETURN_NOT_OK(AppendObjectStrings(*value_builder.get(), data, size, &have_bytes)); + RETURN_NOT_OK(AppendObjectStrings(size, value_builder.get(), data, &have_bytes)); } else if (PyList_Check(objects[i])) { int64_t size; std::shared_ptr inferred_type; RETURN_NOT_OK(list_builder.Append(true)); - RETURN_NOT_OK(InferArrowType(objects[i], &size, &inferred_type)); + RETURN_NOT_OK(InferArrowTypeAndSize(objects[i], &size, &inferred_type)); if (inferred_type->type != Type::STRING) { std::stringstream ss; ss << inferred_type->ToString() << " cannot be converted to STRING."; @@ -832,7 +887,7 @@ inline void ConvertIntegerWithNulls(const ChunkedArray& data, double* out_values // Upcast to double, set NaN as appropriate for (int i = 0; i < arr->length(); ++i) { - *out_values++ = prim_arr->IsNull(i) ? NAN : in_values[i]; + *out_values++ = prim_arr->IsNull(i) ? NAN : static_cast(in_values[i]); } } } @@ -924,6 +979,36 @@ inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) return Status::OK(); } +inline Status ConvertFixedWidthBinary(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { + auto arr = static_cast(data.chunk(c).get()); + + const uint8_t* data_ptr; + int32_t length = + std::dynamic_pointer_cast(arr->type())->byte_width(); + const bool has_nulls = data.null_count() > 0; + for (int64_t i = 0; i < arr->length(); ++i) { + if (has_nulls && arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values = Py_None; + } else { + data_ptr = arr->GetValue(i); + *out_values = WrapBytes::Wrap(data_ptr, length); + if (*out_values == nullptr) { + PyErr_Clear(); + std::stringstream ss; + ss << "Wrapping " + << std::string(reinterpret_cast(data_ptr), length) << " failed"; + return Status::UnknownError(ss.str()); + } + } + ++out_values; + } + } + return Status::OK(); +} + template inline Status ConvertListsLike( const std::shared_ptr& col, PyObject** out_values) { @@ -1058,6 +1143,8 @@ class ObjectBlock : public PandasBlock { RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::STRING) { RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + } else if (type == Type::FIXED_WIDTH_BINARY) { + RETURN_NOT_OK(ConvertFixedWidthBinary(data, out_buffer)); } else if (type == Type::LIST) { auto list_type = std::static_pointer_cast(col->type()); switch (list_type->value_type()->type) { @@ -1487,6 +1574,7 @@ class DataFrameBlockCreator { break; case Type::STRING: case Type::BINARY: + case Type::FIXED_WIDTH_BINARY: output_type = PandasBlock::OBJECT; break; case Type::DATE64: @@ -1751,6 +1839,7 @@ class ArrowDeserializer { CONVERT_CASE(DOUBLE); CONVERT_CASE(BINARY); CONVERT_CASE(STRING); + CONVERT_CASE(FIXED_WIDTH_BINARY); CONVERT_CASE(DATE64); CONVERT_CASE(TIMESTAMP); CONVERT_CASE(DICTIONARY); @@ -1845,6 +1934,7 @@ class ArrowDeserializer { return ConvertBinaryLike(data_, out_values); } + // Binary strings template inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); @@ -1852,6 +1942,15 @@ class ArrowDeserializer { return ConvertBinaryLike(data_, out_values); } + // Fixed length binary strings + template + inline typename std::enable_if::type + ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + return ConvertFixedWidthBinary(data_, out_values); + } + #define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ case Type::ArrowEnum: \ return ConvertListsLike(col_, out_values); diff --git a/cpp/src/arrow/util/logging.h b/cpp/src/arrow/util/logging.h index b22f07dd6345f..697d47c541003 100644 --- a/cpp/src/arrow/util/logging.h +++ b/cpp/src/arrow/util/logging.h @@ -38,9 +38,10 @@ namespace arrow { #define ARROW_LOG_INTERNAL(level) ::arrow::internal::CerrLog(level) #define ARROW_LOG(level) ARROW_LOG_INTERNAL(ARROW_##level) -#define ARROW_CHECK(condition) \ - (condition) ? 0 : ::arrow::internal::FatalLog(ARROW_FATAL) \ - << __FILE__ << __LINE__ << " Check failed: " #condition " " +#define ARROW_CHECK(condition) \ + (condition) ? 0 \ + : ::arrow::internal::FatalLog(ARROW_FATAL) \ + << __FILE__ << __LINE__ << " Check failed: " #condition " " #ifdef NDEBUG #define ARROW_DFATAL ARROW_WARNING diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index dce438910151b..66b6038617944 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -55,7 +55,7 @@ Int8Value, Int16Value, Int32Value, Int64Value, UInt8Value, UInt16Value, UInt32Value, UInt64Value, FloatValue, DoubleValue, ListValue, - BinaryValue, StringValue) + BinaryValue, StringValue, FixedWidthBinaryValue) import pyarrow.schema as _schema @@ -65,7 +65,8 @@ timestamp, date32, date64, float_, double, binary, string, list_, struct, dictionary, field, - DataType, Field, Schema, schema) + DataType, FixedWidthBinaryType, + Field, Schema, schema) from pyarrow.table import Column, RecordBatch, Table, concat_tables diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index c3e7997aa823c..a7241c6a47e31 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -24,9 +24,11 @@ from pyarrow.schema cimport DataType from cpython cimport PyObject + cdef extern from "Python.h": int PySlice_Check(object) + cdef class Array: cdef: shared_ptr[CArray] sp_array @@ -38,6 +40,7 @@ cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array) cdef getitem(self, int64_t i) + cdef object box_array(const shared_ptr[CArray]& sp_array) @@ -52,6 +55,7 @@ cdef class NumericArray(Array): cdef class IntegerArray(NumericArray): pass + cdef class FloatingPointArray(NumericArray): pass @@ -96,6 +100,10 @@ cdef class DoubleArray(FloatingPointArray): pass +cdef class FixedWidthBinaryArray(Array): + pass + + cdef class ListArray(Array): pass diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 6cae1966cb16e..289baf2993081 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -37,6 +37,7 @@ cimport pyarrow.scalar as scalar from pyarrow.scalar import NA from pyarrow.schema cimport (DataType, Field, Schema, DictionaryType, + FixedWidthBinaryType, box_data_type) import pyarrow.schema as schema @@ -197,7 +198,11 @@ cdef class Array: if type is None: check_status(pyarrow.ConvertPySequence(list_obj, pool, &sp_array)) else: - raise NotImplementedError() + check_status( + pyarrow.ConvertPySequence( + list_obj, pool, &sp_array, type.sp_type + ) + ) return box_array(sp_array) @@ -385,6 +390,7 @@ cdef class Date64Array(NumericArray): cdef class TimestampArray(NumericArray): pass + cdef class Time32Array(NumericArray): pass @@ -392,6 +398,7 @@ cdef class Time32Array(NumericArray): cdef class Time64Array(NumericArray): pass + cdef class FloatArray(FloatingPointArray): pass @@ -400,6 +407,10 @@ cdef class DoubleArray(FloatingPointArray): pass +cdef class FixedWidthBinaryArray(Array): + pass + + cdef class ListArray(Array): pass @@ -506,7 +517,8 @@ cdef dict _array_classes = { Type_LIST: ListArray, Type_BINARY: BinaryArray, Type_STRING: StringArray, - Type_DICTIONARY: DictionaryArray + Type_DICTIONARY: DictionaryArray, + Type_FIXED_WIDTH_BINARY: FixedWidthBinaryArray, } cdef object box_array(const shared_ptr[CArray]& sp_array): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 8e428b40b8f8b..b44ade5298eb3 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -45,6 +45,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_TIME64" arrow::Type::TIME64" Type_BINARY" arrow::Type::BINARY" Type_STRING" arrow::Type::STRING" + Type_FIXED_WIDTH_BINARY" arrow::Type::FIXED_WIDTH_BINARY" Type_LIST" arrow::Type::LIST" Type_STRUCT" arrow::Type::STRUCT" @@ -139,6 +140,10 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringType" arrow::StringType"(CDataType): pass + cdef cppclass CFixedWidthBinaryType" arrow::FixedWidthBinaryType"(CFixedWidthType): + CFixedWidthBinaryType(int byte_width) + int byte_width() + cdef cppclass CField" arrow::Field": c_string name shared_ptr[CDataType] type @@ -203,6 +208,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CDoubleArray" arrow::DoubleArray"(CArray): double Value(int i) + cdef cppclass CFixedWidthBinaryArray" arrow::FixedWidthBinaryArray"(CArray): + const uint8_t* GetValue(int i) + cdef cppclass CListArray" arrow::ListArray"(CArray): const int32_t* raw_value_offsets() int32_t value_offset(int i) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index c3fdf4b070ee0..8142c1c06ff75 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -30,6 +30,9 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: shared_ptr[CDataType] GetTimestampType(TimeUnit unit) CStatus ConvertPySequence(object obj, CMemoryPool* pool, shared_ptr[CArray]* out) + CStatus ConvertPySequence(object obj, CMemoryPool* pool, + shared_ptr[CArray]* out, + const shared_ptr[CDataType]& type) CStatus PandasDtypeToArrow(object dtype, shared_ptr[CDataType]* type) diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd index 551aeb9697bf7..e9cc3cb487cbc 100644 --- a/python/pyarrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -61,6 +61,11 @@ cdef class ListValue(ArrayValue): cdef class StringValue(ArrayValue): pass + +cdef class FixedWidthBinaryValue(ArrayValue): + pass + + cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, int64_t index) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 1b7e67b356a2f..f4a1c9e08eb64 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -224,6 +224,22 @@ cdef class ListValue(ArrayValue): return result +cdef class FixedWidthBinaryValue(ArrayValue): + + def as_py(self): + cdef: + CFixedWidthBinaryArray* ap + CFixedWidthBinaryType* ap_type + int32_t length + const char* data + ap = self.sp_array.get() + ap_type = ap.type().get() + length = ap_type.byte_width() + data = ap.GetValue(self.index) + return cp.PyBytes_FromStringAndSize(data, length) + + + cdef dict _scalar_classes = { Type_BOOL: BooleanValue, Type_UINT8: Int8Value, @@ -241,7 +257,8 @@ cdef dict _scalar_classes = { Type_DOUBLE: DoubleValue, Type_LIST: ListValue, Type_BINARY: BinaryValue, - Type_STRING: StringValue + Type_STRING: StringValue, + Type_FIXED_WIDTH_BINARY: FixedWidthBinaryValue, } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index 15ee5f19ee5d9..c0c2c709b2744 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -19,6 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CDataType, CDictionaryType, CTimestampType, + CFixedWidthBinaryType, CField, CSchema) cdef class DataType: @@ -39,6 +40,11 @@ cdef class TimestampType(DataType): const CTimestampType* ts_type +cdef class FixedWidthBinaryType(DataType): + cdef: + const CFixedWidthBinaryType* fixed_width_binary_type + + cdef class Field: cdef: shared_ptr[CField] sp_field diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 4f02901cc9a11..532a318840caf 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -28,6 +28,7 @@ from pyarrow.compat import frombytes, tobytes from pyarrow.array cimport Array from pyarrow.error cimport check_status from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, + CFixedWidthBinaryType, TimeUnit_SECOND, TimeUnit_MILLI, TimeUnit_MICRO, TimeUnit_NANO, Type, TimeUnit) @@ -52,7 +53,7 @@ cdef class DataType: return frombytes(self.type.ToString()) def __repr__(self): - return 'DataType({0})'.format(str(self)) + return '{0.__class__.__name__}({0})'.format(self) def __richcmp__(DataType self, DataType other, int op): if op == cpython.Py_EQ: @@ -69,9 +70,6 @@ cdef class DictionaryType(DataType): DataType.init(self, type) self.dict_type = type.get() - def __repr__(self): - return 'DictionaryType({0})'.format(str(self)) - cdef class TimestampType(DataType): @@ -92,8 +90,17 @@ cdef class TimestampType(DataType): else: return None - def __repr__(self): - return 'TimestampType({0})'.format(str(self)) + +cdef class FixedWidthBinaryType(DataType): + + cdef init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.fixed_width_binary_type = type.get() + + property byte_width: + + def __get__(self): + return self.fixed_width_binary_type.byte_width() cdef class Field: @@ -348,11 +355,24 @@ def string(): return primitive_type(la.Type_STRING) -def binary(): - """ - Binary (PyBytes-like) type +def binary(int length=-1): + """Binary (PyBytes-like) type + + Parameters + ---------- + length : int, optional, default -1 + If length == -1 then return a variable length binary type. If length is + greater than or equal to 0 then return a fixed width binary type of + width `length`. """ - return primitive_type(la.Type_BINARY) + if length == -1: + return primitive_type(la.Type_BINARY) + + cdef FixedWidthBinaryType out = FixedWidthBinaryType() + cdef shared_ptr[CDataType] fixed_width_binary_type + fixed_width_binary_type.reset(new CFixedWidthBinaryType(length)) + out.init(fixed_width_binary_type) + return out def list_(DataType value_type): @@ -408,6 +428,8 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): out = DictionaryType() elif type.get().type == la.Type_TIMESTAMP: out = TimestampType() + elif type.get().type == la.Type_FIXED_WIDTH_BINARY: + out = FixedWidthBinaryType() else: out = DataType() diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 7915f9766bf67..99251250499d2 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -92,6 +92,19 @@ def test_bytes(self): assert arr.type == pyarrow.binary() assert arr.to_pylist() == [b'foo', u1, None] + def test_fixed_width_bytes(self): + data = [b'foof', None, b'barb', b'2346'] + arr = pyarrow.from_pylist(data, type=pyarrow.binary(4)) + assert len(arr) == 4 + assert arr.null_count == 1 + assert arr.type == pyarrow.binary(4) + assert arr.to_pylist() == data + + def test_fixed_width_bytes_does_not_accept_varying_lengths(self): + data = [b'foo', None, b'barb', b'2346'] + with self.assertRaises(pyarrow.error.ArrowException): + pyarrow.from_pylist(data, type=pyarrow.binary(4)) + def test_date(self): data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)] diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index ea7a892a6f2a4..f7cb47f685590 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -244,6 +244,23 @@ def test_bytes_to_binary(self): expected = pd.DataFrame({'strings': values2}) self._check_pandas_roundtrip(df, expected) + def test_fixed_width_bytes(self): + values = [b'foo', None, b'bar', None, None, b'hey'] + df = pd.DataFrame({'strings': values}) + schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) + table = A.Table.from_pandas(df, schema=schema) + assert table.schema[0].type == schema[0].type + assert table.schema[0].name == schema[0].name + result = table.to_pandas() + tm.assert_frame_equal(result, df) + + def test_fixed_width_bytes_does_not_accept_varying_lengths(self): + values = [b'foo', None, b'ba', None, None, b'hey'] + df = pd.DataFrame({'strings': values}) + schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) + with self.assertRaises(A.error.ArrowException): + A.Table.from_pandas(df, schema=schema) + def test_timestamps_notimezone_no_nulls(self): df = pd.DataFrame({ 'datetime64': np.array([ diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index d56481c06d0f8..265ce8d3a58a1 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -87,6 +87,20 @@ def test_bytes(self): assert v == b'bar' assert isinstance(v, bytes) + def test_fixed_width_bytes(self): + data = [b'foof', None, b'barb'] + arr = A.from_pylist(data, type=A.binary(4)) + + v = arr[0] + assert isinstance(v, A.FixedWidthBinaryValue) + assert v.as_py() == b'foof' + + assert arr[1] is A.NA + + v = arr[2].as_py() + assert v == b'barb' + assert isinstance(v, bytes) + def test_list(self): arr = A.from_pylist([['foo', None], None, ['bar'], []]) From f5967ed682e63dd752d0120573bb33f42dd56e27 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 31 Mar 2017 10:20:20 -0400 Subject: [PATCH 0440/1644] ARROW-603: [C++] Add RecordBatch::Validate method, call in RecordBatch ctor in debug builds This function will help catch malformed RecordBatch objects during development Author: Wes McKinney Closes #466 from wesm/ARROW-603 and squashes the following commits: dfdb048 [Wes McKinney] Fix incorrect clang-tidy name 5a51f69 [Wes McKinney] Add RecordBatch::Validate method, call in RecordBatch ctor in debug builds --- cpp/cmake_modules/FindClangTools.cmake | 29 ++++++---- cpp/src/arrow/io/io-hdfs-test.cc | 7 +-- cpp/src/arrow/table-test.cc | 76 ++++++++++++++++++-------- cpp/src/arrow/table.cc | 25 +++++++++ cpp/src/arrow/table.h | 4 ++ 5 files changed, 104 insertions(+), 37 deletions(-) diff --git a/cpp/cmake_modules/FindClangTools.cmake b/cpp/cmake_modules/FindClangTools.cmake index c07c7d244493e..0e9430ba29195 100644 --- a/cpp/cmake_modules/FindClangTools.cmake +++ b/cpp/cmake_modules/FindClangTools.cmake @@ -27,16 +27,21 @@ # This module defines # CLANG_TIDY_BIN, The path to the clang tidy binary # CLANG_TIDY_FOUND, Whether clang tidy was found -# CLANG_FORMAT_BIN, The path to the clang format binary +# CLANG_FORMAT_BIN, The path to the clang format binary # CLANG_TIDY_FOUND, Whether clang format was found -find_program(CLANG_TIDY_BIN - NAMES clang-tidy-3.8 clang-tidy-3.7 clang-tidy-3.6 clang-tidy - PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin +find_program(CLANG_TIDY_BIN + NAMES clang-tidy-4.0 + clang-tidy-3.9 + clang-tidy-3.8 + clang-tidy-3.7 + clang-tidy-3.6 + clang-tidy + PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin NO_DEFAULT_PATH ) -if ( "${CLANG_TIDY_BIN}" STREQUAL "CLANG_TIDY_BIN-NOTFOUND" ) +if ( "${CLANG_TIDY_BIN}" STREQUAL "CLANG_TIDY_BIN-NOTFOUND" ) set(CLANG_TIDY_FOUND 0) message("clang-tidy not found") else() @@ -44,17 +49,21 @@ else() message("clang-tidy found at ${CLANG_TIDY_BIN}") endif() -find_program(CLANG_FORMAT_BIN - NAMES clang-format-3.8 clang-format-3.7 clang-format-3.6 clang-format - PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin +find_program(CLANG_FORMAT_BIN + NAMES clang-format-4.0 + clang-format-3.9 + clang-format-3.8 + clang-format-3.7 + clang-format-3.6 + clang-format + PATHS ${ClangTools_PATH} $ENV{CLANG_TOOLS_PATH} /usr/local/bin /usr/bin NO_DEFAULT_PATH ) -if ( "${CLANG_FORMAT_BIN}" STREQUAL "CLANG_FORMAT_BIN-NOTFOUND" ) +if ( "${CLANG_FORMAT_BIN}" STREQUAL "CLANG_FORMAT_BIN-NOTFOUND" ) set(CLANG_FORMAT_FOUND 0) message("clang-format not found") else() set(CLANG_FORMAT_FOUND 1) message("clang-format found at ${CLANG_FORMAT_BIN}") endif() - diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index af59e96a1448f..f3140be0b2dac 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -78,10 +78,9 @@ class TestHdfsClient : public ::testing::Test { LibHdfsShim* driver_shim; client_ = nullptr; - scratch_dir_ = - boost::filesystem::unique_path( - boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") - .string(); + scratch_dir_ = boost::filesystem::unique_path( + boost::filesystem::temp_directory_path() / "arrow-hdfs/scratch-%%%%") + .string(); loaded_driver_ = false; diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 38533063cbc07..cd32f4a387290 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -127,8 +127,8 @@ TEST_F(TestColumn, BasicAPI) { arrays.push_back(MakePrimitive(100, 10)); arrays.push_back(MakePrimitive(100, 20)); - auto field = std::make_shared("c0", int32()); - column_.reset(new Column(field, arrays)); + auto f0 = field("c0", int32()); + column_.reset(new Column(f0, arrays)); ASSERT_EQ("c0", column_->name()); ASSERT_TRUE(column_->type()->Equals(int32())); @@ -137,7 +137,7 @@ TEST_F(TestColumn, BasicAPI) { ASSERT_EQ(3, column_->data()->num_chunks()); // nullptr array should not break - column_.reset(new Column(field, std::shared_ptr(nullptr))); + column_.reset(new Column(f0, std::shared_ptr(nullptr))); ASSERT_NE(column_.get(), nullptr); } @@ -146,13 +146,13 @@ TEST_F(TestColumn, ChunksInhomogeneous) { arrays.push_back(MakePrimitive(100)); arrays.push_back(MakePrimitive(100, 10)); - auto field = std::make_shared("c0", int32()); - column_.reset(new Column(field, arrays)); + auto f0 = field("c0", int32()); + column_.reset(new Column(f0, arrays)); ASSERT_OK(column_->ValidateData()); arrays.push_back(MakePrimitive(100, 10)); - column_.reset(new Column(field, arrays)); + column_.reset(new Column(f0, arrays)); ASSERT_RAISES(Invalid, column_->ValidateData()); } @@ -164,8 +164,8 @@ TEST_F(TestColumn, Equals) { arrays_one_.push_back(array); arrays_another_.push_back(array); - one_field_ = std::make_shared("column", int32()); - another_field_ = std::make_shared("column", int32()); + one_field_ = field("column", int32()); + another_field_ = field("column", int32()); Construct(); ASSERT_TRUE(one_col_->Equals(one_col_)); @@ -174,13 +174,13 @@ TEST_F(TestColumn, Equals) { ASSERT_TRUE(one_col_->Equals(*another_col_.get())); // Field is different - another_field_ = std::make_shared("two", int32()); + another_field_ = field("two", int32()); Construct(); ASSERT_FALSE(one_col_->Equals(another_col_)); ASSERT_FALSE(one_col_->Equals(*another_col_.get())); // ChunkedArray is different - another_field_ = std::make_shared("column", int32()); + another_field_ = field("column", int32()); arrays_another_.push_back(array); Construct(); ASSERT_FALSE(one_col_->Equals(another_col_)); @@ -190,9 +190,9 @@ TEST_F(TestColumn, Equals) { class TestTable : public TestBase { public: void MakeExample1(int length) { - auto f0 = std::make_shared("f0", int32()); - auto f1 = std::make_shared("f1", uint8()); - auto f2 = std::make_shared("f2", int16()); + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8()); + auto f2 = field("f2", int16()); vector> fields = {f0, f1, f2}; schema_ = std::make_shared(fields); @@ -279,9 +279,9 @@ TEST_F(TestTable, Equals) { ASSERT_TRUE(table_->Equals(*table_)); // Differing schema - auto f0 = std::make_shared("f3", int32()); - auto f1 = std::make_shared("f4", uint8()); - auto f2 = std::make_shared("f5", int16()); + auto f0 = field("f3", int32()); + auto f1 = field("f4", uint8()); + auto f2 = field("f5", int16()); vector> fields = {f0, f1, f2}; auto other_schema = std::make_shared(fields); ASSERT_FALSE(table_->Equals(Table(other_schema, columns_))); @@ -389,9 +389,9 @@ class TestRecordBatch : public TestBase {}; TEST_F(TestRecordBatch, Equals) { const int length = 10; - auto f0 = std::make_shared("f0", int32()); - auto f1 = std::make_shared("f1", uint8()); - auto f2 = std::make_shared("f2", int16()); + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8()); + auto f2 = field("f2", int16()); vector> fields = {f0, f1, f2}; auto schema = std::make_shared(fields); @@ -401,21 +401,51 @@ TEST_F(TestRecordBatch, Equals) { auto a2 = MakePrimitive(length); RecordBatch b1(schema, length, {a0, a1, a2}); - RecordBatch b2(schema, 5, {a0, a1, a2}); RecordBatch b3(schema, length, {a0, a1}); RecordBatch b4(schema, length, {a0, a1, a1}); ASSERT_TRUE(b1.Equals(b1)); - ASSERT_FALSE(b1.Equals(b2)); ASSERT_FALSE(b1.Equals(b3)); ASSERT_FALSE(b1.Equals(b4)); } +#ifdef NDEBUG +// In debug builds, RecordBatch ctor aborts if you construct an invalid one + +TEST_F(TestRecordBatch, Validate) { + const int length = 10; + + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8()); + auto f2 = field("f2", int16()); + + auto schema = std::shared_ptr(new Schema({f0, f1, f2})); + + auto a0 = MakePrimitive(length); + auto a1 = MakePrimitive(length); + auto a2 = MakePrimitive(length); + auto a3 = MakePrimitive(5); + + RecordBatch b1(schema, length, {a0, a1, a2}); + + ASSERT_OK(b1.Validate()); + + // Length mismatch + RecordBatch b2(schema, length, {a0, a1, a3}); + ASSERT_RAISES(Invalid, b2.Validate()); + + // Type mismatch + RecordBatch b3(schema, length, {a0, a1, a0}); + ASSERT_RAISES(Invalid, b3.Validate()); +} + +#endif + TEST_F(TestRecordBatch, Slice) { const int length = 10; - auto f0 = std::make_shared("f0", int32()); - auto f1 = std::make_shared("f1", uint8()); + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8()); vector> fields = {f0, f1}; auto schema = std::make_shared(fields); diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 8e283f4da9bb7..da61fbb9a6daf 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -139,6 +139,11 @@ Status Column::ValidateData() { // ---------------------------------------------------------------------- // RecordBatch methods +void AssertBatchValid(const RecordBatch& batch) { + Status s = batch.Validate(); + if (!s.ok()) { DCHECK(false) << s.ToString(); } +} + RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows, const std::vector>& columns) : schema_(schema), num_rows_(num_rows), columns_(columns) {} @@ -190,6 +195,26 @@ std::shared_ptr RecordBatch::Slice(int64_t offset, int64_t length) return std::make_shared(schema_, num_rows, arrays); } +Status RecordBatch::Validate() const { + for (int i = 0; i < num_columns(); ++i) { + const Array& arr = *columns_[i]; + if (arr.length() != num_rows_) { + std::stringstream ss; + ss << "Number of rows in column " << i << " did not match batch: " << arr.length() + << " vs " << num_rows_; + return Status::Invalid(ss.str()); + } + const auto& schema_type = *schema_->field(i)->type; + if (!arr.type()->Equals(schema_type)) { + std::stringstream ss; + ss << "Column " << i << " type not match schema: " << arr.type()->ToString() + << " vs " << schema_type.ToString(); + return Status::Invalid(ss.str()); + } + } + return Status::OK(); +} + // ---------------------------------------------------------------------- // Table methods diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 7b739c9a1b314..0f35dd888fe2f 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -140,6 +140,10 @@ class ARROW_EXPORT RecordBatch { std::shared_ptr Slice(int64_t offset); std::shared_ptr Slice(int64_t offset, int64_t length); + /// Returns error status is there is something wrong with the record batch + /// contents, like a schema/array mismatch or inconsistent lengths + Status Validate() const; + private: std::shared_ptr schema_; int64_t num_rows_; From 005826f804f7db668e8e165fad45c9c1cd1de0cf Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 31 Mar 2017 10:27:20 -0400 Subject: [PATCH 0441/1644] ARROW-719: [GLib] Release source archive I don't know about his approach is good but it will help you to consider what approach is better. Author: Kouhei Sutou Closes #448 from kou/glib-release-source-archive and squashes the following commits: 76b0110 [Kouhei Sutou] [GLib] Release source archive --- dev/release/02-source.sh | 39 ++++++++++++++++++++++++++++++++++++++- dev/release/run-rat.sh | 19 +++++++++++++++++++ 2 files changed, 57 insertions(+), 1 deletion(-) mode change 100644 => 100755 dev/release/02-source.sh diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh old mode 100644 new mode 100755 index bdaa5cc9340fe..924b94fd6caa0 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -54,9 +54,46 @@ echo "Using commit $release_hash" tarball=$tag.tar.gz +extract_dir=tmp-apache-arrow +rm -rf $extract_dir # be conservative and use the release hash, even though git produces the same # archive (identical hashes) using the scm tag -git archive $release_hash --prefix $tag/ -o $tarball +git archive $release_hash --prefix $extract_dir/ | tar xf - + +# build Apache Arrow C++ before building Apache Arrow GLib because +# Apache Arrow GLib requires Apache Arrow C++. +mkdir -p $extract_dir/cpp/build +cpp_install_dir=$PWD/$extract_dir/cpp/install +cd $extract_dir/cpp/build +cmake .. \ + -DCMAKE_INSTALL_PREFIX=$cpp_install_dir \ + -DARROW_BUILD_TESTS=no +make -j8 +make install +cd - + +# build source archive for Apache Arrow GLib by "make dist". +cd $extract_dir/c_glib +./autogen.sh +./configure \ + PKG_CONFIG_PATH=$cpp_install_dir/lib/pkgconfig \ + --enable-gtk-doc +LD_LIBRARY_PATH=$cpp_install_dir/lib make -j8 +make dist +tar xzf *.tar.gz +rm *.tar.gz +cd - +rm -rf tmp-c_glib/ +mv $extract_dir/c_glib/apache-arrow-glib-* tmp-c_glib/ +rm -rf $extract_dir + +# replace c_glib/ by tar.gz generated by "make dist" +rm -rf $tag +git archive $release_hash --prefix $tag/ | tar xf - +rm -rf $tag/c_glib +mv tmp-c_glib $tag/c_glib +tar czf $tarball $tag +rm -rf $tag ${SOURCE_DIR}/run-rat.sh $tarball diff --git a/dev/release/run-rat.sh b/dev/release/run-rat.sh index e26dd589695b1..a3c12a0ce8a92 100755 --- a/dev/release/run-rat.sh +++ b/dev/release/run-rat.sh @@ -40,6 +40,25 @@ $RAT $1 \ -e __init__.pxd \ -e __init__.py \ -e requirements.txt \ + -e version \ + -e "*.m4" \ + -e configure \ + -e config.sub \ + -e config.h.in \ + -e compile \ + -e missing \ + -e install-sh \ + -e config.guess \ + -e depcomp \ + -e ltmain.sh \ + -e arrow-glib.types \ + -e arrow-glib-sections.txt \ + -e arrow-glib-overrides.txt \ + -e gtk-doc.make \ + -e "*.html" \ + -e "*.css" \ + -e "*.png" \ + -e "*.devhelp2" \ > rat.txt cat rat.txt UNAPPROVED=`cat rat.txt | grep "Unknown Licenses" | head -n 1 | cut -d " " -f 1` From ad8a0cfeced7f86e21fcaa63de3e55ce42b8f962 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Fri, 31 Mar 2017 10:30:08 -0400 Subject: [PATCH 0442/1644] ARROW-739: Don't install jemalloc in parallel Alternative fix proposal. Couldn't trigger it locally though. Author: Uwe L. Korn Author: Robert Nishihara Closes #456 from xhochy/ARROW-739 and squashes the following commits: c1cad56 [Robert Nishihara] Replace MAKE -> CMAKE_MAKE_PROGRAM in CMakeLists.txt. f121072 [Uwe L. Korn] Add install to install command e8803b8 [Uwe L. Korn] ARROW-739: Don't install jemalloc in parallel --- cpp/CMakeLists.txt | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index aa8ea31b831e3..5dcf58c0f232d 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -637,14 +637,16 @@ if (ARROW_JEMALLOC) URL https://github.com/jemalloc/jemalloc/releases/download/${JEMALLOC_VERSION}/jemalloc-${JEMALLOC_VERSION}.tar.bz2 CONFIGURE_COMMAND ./configure "--prefix=${JEMALLOC_PREFIX}" "--with-jemalloc-prefix=" BUILD_IN_SOURCE 1 - BUILD_COMMAND ${MAKE} - BUILD_BYPRODUCTS "${JEMALLOC_STATIC_LIB}" "${JEMALLOC_SHARED_LIB}") + BUILD_COMMAND ${CMAKE_MAKE_PROGRAM} + BUILD_BYPRODUCTS "${JEMALLOC_STATIC_LIB}" "${JEMALLOC_SHARED_LIB}" + INSTALL_COMMAND ${CMAKE_MAKE_PROGRAM} -j1 install) else() ExternalProject_Add(jemalloc_ep URL https://github.com/jemalloc/jemalloc/releases/download/${JEMALLOC_VERSION}/jemalloc-${JEMALLOC_VERSION}.tar.bz2 CONFIGURE_COMMAND ./configure "--prefix=${JEMALLOC_PREFIX}" "--with-jemalloc-prefix=" BUILD_IN_SOURCE 1 - BUILD_COMMAND ${MAKE}) + BUILD_COMMAND ${CMAKE_MAKE_PROGRAM} + INSTALL_COMMAND ${CMAKE_MAKE_PROGRAM} -j1 install) endif() else() set(JEMALLOC_VENDORED 0) From e5b682760614a2a51e9587afbb4b9b676e59e5a9 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 31 Mar 2017 12:57:46 -0400 Subject: [PATCH 0443/1644] ARROW-744: [GLib] Re-add an assertion for garrow_table_new() test Author: Kouhei Sutou Closes #469 from kou/glib-re-add-assertion-to-garrow-table-new-test and squashes the following commits: 64c2e50 [Kouhei Sutou] [GLib] Re-add an assertion for garrow_table_new() test --- c_glib/test/test-table.rb | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/c_glib/test/test-table.rb b/c_glib/test/test-table.rb index 0583e8139e47a..e2b71b31e44c0 100644 --- a/c_glib/test/test-table.rb +++ b/c_glib/test/test-table.rb @@ -30,6 +30,25 @@ def test_columns Arrow::Column.new(fields[1], build_boolean_array([false])), ] table = Arrow::Table.new(schema, columns) + + data = table.n_columns.times.collect do |i| + column = table.get_column(i) + values = [] + column.data.chunks.each do |chunk| + chunk.length.times do |j| + values << chunk.get_value(j) + end + end + [ + column.name, + values, + ] + end + assert_equal([ + ["visible", [true]], + ["valid", [false]], + ], + data) end end From 4e77d3382f6cc6450c79b1ebefea0bbd1f2dd379 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 31 Mar 2017 12:58:55 -0400 Subject: [PATCH 0444/1644] ARROW-746: [GLib] Add garrow_array_get_data_type() Author: Kouhei Sutou Closes #470 from kou/glib-add-garrow-array-get-data-type and squashes the following commits: 3f4de67 [Kouhei Sutou] [GLib] Add garrow_array_get_data_type() --- c_glib/arrow-glib/array.cpp | 15 +++++++++++++++ c_glib/arrow-glib/array.h | 3 ++- c_glib/test/test-array.rb | 6 ++++++ 3 files changed, 23 insertions(+), 1 deletion(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 5dacb07ba8710..b084054f9af87 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -173,6 +174,20 @@ garrow_array_get_n_nulls(GArrowArray *array) return arrow_array->null_count(); } +/** + * garrow_array_get_data_type: + * @array: A #GArrowArray. + * + * Returns: (transfer full): The #GArrowDataType for the array. + */ +GArrowDataType * +garrow_array_get_data_type(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + auto arrow_data_type = arrow_array->type(); + return garrow_data_type_new_raw(&arrow_data_type); +} + /** * garrow_array_slice: * @array: A #GArrowArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 9b1fa7e1e4a31..6467db5ff45db 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -19,7 +19,7 @@ #pragma once -#include +#include G_BEGIN_DECLS @@ -60,6 +60,7 @@ GType garrow_array_get_type (void) G_GNUC_CONST; gint64 garrow_array_get_length (GArrowArray *array); gint64 garrow_array_get_offset (GArrowArray *array); gint64 garrow_array_get_n_nulls (GArrowArray *array); +GArrowDataType *garrow_array_get_data_type(GArrowArray *array); GArrowArray *garrow_array_slice (GArrowArray *array, gint64 offset, gint64 length); diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index d68827cb85b1d..c427f0200ef02 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -31,6 +31,12 @@ def test_n_nulls assert_equal(2, array.n_nulls) end + def test_data_type + builder = Arrow::BooleanArrayBuilder.new + array = builder.finish + assert_equal(Arrow::BooleanDataType.new, array.data_type) + end + def test_slice builder = Arrow::BooleanArrayBuilder.new builder.append(true) From 067cd4ebfbd9be9b607658a2a249017cc6db84f9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 31 Mar 2017 13:00:11 -0400 Subject: [PATCH 0445/1644] ARROW-630: [C++] Create boolean batches for IPC testing, properly account for nonzero offset This fixes a couple bugs; boolean IPC was not being tested directly like the other types (it was implicitly by integration tests, though) Author: Wes McKinney Closes #460 from wesm/ARROW-630 and squashes the following commits: f9448a7 [Wes McKinney] Create boolean batches for IPC testing, properly account for offset in unloading, comparison --- cpp/src/arrow/compare.cc | 4 +++- cpp/src/arrow/ipc/ipc-read-write-test.cc | 2 +- cpp/src/arrow/ipc/test-common.h | 22 ++++++++++++++++++++++ cpp/src/arrow/ipc/writer.cc | 7 ++++++- 4 files changed, 32 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index c2580b4f54109..4cd617e6021df 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -294,13 +294,15 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { Status Visit(const BooleanArray& left) { const auto& right = static_cast(right_); + if (left.null_count() > 0) { const uint8_t* left_data = left.data()->data(); const uint8_t* right_data = right.data()->data(); for (int64_t i = 0; i < left.length(); ++i) { if (!left.IsNull(i) && - BitUtil::GetBit(left_data, i) != BitUtil::GetBit(right_data, i)) { + BitUtil::GetBit(left_data, i + left.offset()) != + BitUtil::GetBit(right_data, i + right.offset())) { result_ = false; return Status::OK(); } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 74ca017df5cf1..c900d0ba37ed2 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -104,7 +104,7 @@ TEST_F(TestSchemaMetadata, NestedFields) { ::testing::Values(&MakeIntRecordBatch, &MakeListRecordBatch, &MakeNonNullRecordBatch, \ &MakeZeroLengthRecordBatch, &MakeDeeplyNestedList, &MakeStringTypesRecordBatch, \ &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, &MakeTimestamps, &MakeTimes, \ - &MakeFWBinary); + &MakeFWBinary, &MakeBooleanBatch); class IpcTestFixture : public io::MemoryMapFixture { public: diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 583f909d071e6..134a5caee8ec4 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -138,6 +138,28 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li typedef Status MakeRecordBatch(std::shared_ptr* out); +Status MakeBooleanBatch(std::shared_ptr* out) { + const int length = 1000; + + // Make the schema + auto f0 = field("f0", boolean()); + auto f1 = field("f1", boolean()); + std::shared_ptr schema(new Schema({f0, f1})); + + std::vector values(length); + std::vector valid_bytes(length); + test::random_null_bytes(length, 0.5, values.data()); + test::random_null_bytes(length, 0.1, valid_bytes.data()); + + auto data = test::bytes_to_null_buffer(values); + auto null_bitmap = test::bytes_to_null_buffer(valid_bytes); + + auto a0 = std::make_shared(length, data, null_bitmap, -1); + auto a1 = std::make_shared(length, data, nullptr, 0); + out->reset(new RecordBatch(schema, length, {a0, a1})); + return Status::OK(); +} + Status MakeIntRecordBatch(std::shared_ptr* out) { const int length = 10; diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 249ef201c66bb..0867382e6b1b0 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -281,7 +281,12 @@ class RecordBatchWriter : public ArrayVisitor { } Status Visit(const BooleanArray& array) override { - buffers_.push_back(array.data()); + std::shared_ptr bits = array.data(); + if (array.offset() != 0) { + RETURN_NOT_OK( + CopyBitmap(pool_, bits->data(), array.offset(), array.length(), &bits)); + } + buffers_.push_back(bits); return Status::OK(); } From d75d7a96ca21bb2c1cfcf3bce8d09c2f24a5b8a6 Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sat, 1 Apr 2017 00:00:24 -0400 Subject: [PATCH 0446/1644] =?UTF-8?q?ARROW-736:=20[Python]=20Mixed-type=20?= =?UTF-8?q?object=20DataFrame=20columns=20should=20not=20silently=20co?= =?UTF-8?q?=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …erce to an Arrow type by default Author: Phillip Cloud Closes #465 from cpcloud/ARROW-736 and squashes the following commits: fd09def [Phillip Cloud] Update cmake bcf6236 [Phillip Cloud] Rename and move 4a18014 [Phillip Cloud] Move test e80efe1 [Phillip Cloud] Use OwnedRef instead of horror b2df3e9 [Phillip Cloud] Fix python error handling and make compatible with python27 84d33b4 [Phillip Cloud] ARROW-736: Mixed-type object DataFrame columns should not silently coerce to an Arrow type by default --- cpp/src/arrow/python/CMakeLists.txt | 2 +- cpp/src/arrow/python/pandas_convert.cc | 52 ++++++++++++++++--- cpp/src/arrow/python/pandas_convert.h | 4 ++ .../python/{pandas-test.cc => python-test.cc} | 35 +++++++++++-- python/pyarrow/tests/test_convert_builtin.py | 5 ++ python/pyarrow/tests/test_convert_pandas.py | 5 ++ 6 files changed, 91 insertions(+), 12 deletions(-) rename cpp/src/arrow/python/{pandas-test.cc => python-test.cc} (70%) diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index 03f5afc624b34..faaad89656f92 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -88,6 +88,6 @@ install(FILES # INSTALL_RPATH "\$ORIGIN") if (ARROW_BUILD_TESTS) - ADD_ARROW_TEST(pandas-test + ADD_ARROW_TEST(python-test STATIC_LINK_LIBS "${ARROW_PYTHON_TEST_LINK_LIBS}") endif() diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 68a8d7d7afcf5..ae9b17ca9ac86 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -46,6 +46,7 @@ #include "arrow/type_fwd.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" +#include "arrow/util/logging.h" #include "arrow/util/macros.h" namespace arrow { @@ -167,8 +168,10 @@ Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, *have_bytes = true; const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); RETURN_NOT_OK(builder->Append(PyBytes_AS_STRING(obj), length)); + } else if (PyObject_is_null(obj)) { + RETURN_NOT_OK(builder->AppendNull()); } else { - builder->AppendNull(); + return InvalidConversion(obj, "string or bytes"); } } @@ -197,8 +200,10 @@ static Status AppendObjectFixedWidthBytes(int64_t objects_length, int byte_width RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); RETURN_NOT_OK( builder->Append(reinterpret_cast(PyBytes_AS_STRING(obj)))); + } else if (PyObject_is_null(obj)) { + RETURN_NOT_OK(builder->AppendNull()); } else { - builder->AppendNull(); + return InvalidConversion(obj, "string or bytes"); } } @@ -413,6 +418,32 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* return Status::OK(); } +Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { + OwnedRef type(PyObject_Type(obj)); + RETURN_IF_PYERROR(); + DCHECK_NE(type.obj(), nullptr); + + OwnedRef type_name(PyObject_GetAttrString(type.obj(), "__name__")); + RETURN_IF_PYERROR(); + DCHECK_NE(type_name.obj(), nullptr); + + OwnedRef bytes_obj(PyUnicode_AsUTF8String(type_name.obj())); + RETURN_IF_PYERROR(); + DCHECK_NE(bytes_obj.obj(), nullptr); + + Py_ssize_t size = PyBytes_GET_SIZE(bytes_obj.obj()); + const char* bytes = PyBytes_AS_STRING(bytes_obj.obj()); + + DCHECK_NE(bytes, nullptr) << "bytes from type(...).__name__ were null"; + + std::string cpp_type_name(bytes, size); + + std::stringstream ss; + ss << "Python object of type " << cpp_type_name << " is not None and is not a " + << expected_type_name << " object"; + return Status::TypeError(ss.str()); +} + Status PandasConverter::ConvertDates(std::shared_ptr* out) { PyAcquireGIL lock; @@ -427,8 +458,10 @@ Status PandasConverter::ConvertDates(std::shared_ptr* out) { if (PyDate_CheckExact(obj)) { PyDateTime_Date* pydate = reinterpret_cast(obj); date_builder.Append(PyDate_to_ms(pydate)); - } else { + } else if (PyObject_is_null(obj)) { date_builder.AppendNull(); + } else { + return InvalidConversion(obj, "date"); } } return date_builder.Finish(out); @@ -483,14 +516,18 @@ Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { memset(bitmap, 0, nbytes); int64_t null_count = 0; + PyObject* obj; for (int64_t i = 0; i < length_; ++i) { - if (objects[i] == Py_True) { + obj = objects[i]; + if (obj == Py_True) { BitUtil::SetBit(bitmap, i); BitUtil::SetBit(null_bitmap_data_, i); - } else if (objects[i] != Py_False) { + } else if (obj == Py_False) { + BitUtil::SetBit(null_bitmap_data_, i); + } else if (PyObject_is_null(obj)) { ++null_count; } else { - BitUtil::SetBit(null_bitmap_data_, i); + return InvalidConversion(obj, "bool"); } } @@ -551,7 +588,8 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { } else if (PyDate_CheckExact(objects[i])) { return ConvertDates(out); } else { - return Status::TypeError("unhandled python type"); + return InvalidConversion( + const_cast(objects[i]), "string, bool, or date"); } } } diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index 12644d98da156..105c1598d3936 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -24,6 +24,7 @@ #include #include +#include #include "arrow/util/visibility.h" @@ -73,6 +74,9 @@ ARROW_EXPORT Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type, std::shared_ptr* out); +ARROW_EXPORT +Status InvalidConversion(PyObject* obj, const std::string& expected_type_name); + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/pandas-test.cc b/cpp/src/arrow/python/python-test.cc similarity index 70% rename from cpp/src/arrow/python/pandas-test.cc rename to cpp/src/arrow/python/python-test.cc index a4e640b83718b..01e30f5a36ce8 100644 --- a/cpp/src/arrow/python/pandas-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -17,19 +17,18 @@ #include "gtest/gtest.h" -#include #include -#include -#include + +#include #include "arrow/array.h" #include "arrow/builder.h" #include "arrow/table.h" #include "arrow/test-util.h" -#include "arrow/type.h" #include "arrow/python/common.h" #include "arrow/python/pandas_convert.h" +#include "arrow/python/builtin_convert.h" namespace arrow { namespace py { @@ -65,5 +64,33 @@ TEST(PandasConversionTest, TestObjectBlockWriteFails) { Py_END_ALLOW_THREADS; } +TEST(BuiltinConversionTest, TestMixedTypeFails) { + PyAcquireGIL lock; + MemoryPool* pool = default_memory_pool(); + std::shared_ptr arr; + + OwnedRef list_ref(PyList_New(3)); + PyObject* list = list_ref.obj(); + + ASSERT_NE(list, nullptr); + + PyObject* str = PyUnicode_FromString("abc"); + ASSERT_NE(str, nullptr); + + PyObject* integer = PyLong_FromLong(1234L); + ASSERT_NE(integer, nullptr); + + PyObject* doub = PyFloat_FromDouble(123.0234); + ASSERT_NE(doub, nullptr); + + // This steals a reference to each object, so we don't need to decref them later + // just the list + ASSERT_EQ(PyList_SetItem(list, 0, str), 0); + ASSERT_EQ(PyList_SetItem(list, 1, integer), 0); + ASSERT_EQ(PyList_SetItem(list, 2, doub), 0); + + ASSERT_RAISES(UnknownError, ConvertPySequence(list, pool, &arr)); +} + } // namespace py } // namespace arrow diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 99251250499d2..3309ba018628d 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -157,3 +157,8 @@ def test_list_of_int(self): assert arr.null_count == 1 assert arr.type == pyarrow.list_(pyarrow.int64()) assert arr.to_pylist() == data + + def test_mixed_types_fails(self): + data = ['a', 1, 2.0] + with self.assertRaises(pyarrow.error.ArrowException): + pyarrow.from_pylist(data) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index f7cb47f685590..3f19b68fe0a03 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -398,3 +398,8 @@ def test_category(self): ] for values in arrays: self._check_array_roundtrip(values) + + def test_mixed_types_fails(self): + data = pd.DataFrame({'a': ['a', 1, 2.0]}) + with self.assertRaises(A.error.ArrowException): + A.Table.from_pandas(data) From 9f5e17448f984a709be36bfd6f731852a775e1b0 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 1 Apr 2017 11:18:30 -0400 Subject: [PATCH 0447/1644] ARROW-733: [C++/Python] Rename FixedWidthBinary to FixedSizeBinary for consistency with FixedSizeList As discussed on JIRA Author: Wes McKinney Closes #473 from wesm/ARROW-733 and squashes the following commits: 0e30af3 [Wes McKinney] Rename FixedWidthBinary to FixedSizeBinary for consistency with FixedSizeList type --- cpp/src/arrow/array-test.cc | 34 ++++++++++---------- cpp/src/arrow/array.cc | 11 +++---- cpp/src/arrow/array.h | 6 ++-- cpp/src/arrow/builder.cc | 24 +++++++------- cpp/src/arrow/builder.h | 6 ++-- cpp/src/arrow/compare.cc | 8 ++--- cpp/src/arrow/ipc/json-internal.cc | 24 +++++++------- cpp/src/arrow/ipc/metadata.cc | 14 ++++---- cpp/src/arrow/ipc/test-common.h | 8 ++--- cpp/src/arrow/ipc/writer.cc | 2 +- cpp/src/arrow/loader.cc | 6 ++-- cpp/src/arrow/pretty_print-test.cc | 6 ++-- cpp/src/arrow/pretty_print.cc | 7 ++-- cpp/src/arrow/python/builtin_convert.cc | 6 ++-- cpp/src/arrow/python/pandas_convert.cc | 30 ++++++++--------- cpp/src/arrow/type-test.cc | 16 ++++----- cpp/src/arrow/type.cc | 14 ++++---- cpp/src/arrow/type.h | 14 ++++---- cpp/src/arrow/type_fwd.h | 6 ++-- cpp/src/arrow/type_traits.h | 6 ++-- cpp/src/arrow/visitor.cc | 4 +-- cpp/src/arrow/visitor.h | 4 +-- cpp/src/arrow/visitor_inline.h | 4 +-- format/Schema.fbs | 4 +-- python/pyarrow/__init__.py | 4 +-- python/pyarrow/array.pxd | 2 +- python/pyarrow/array.pyx | 6 ++-- python/pyarrow/includes/libarrow.pxd | 8 ++--- python/pyarrow/scalar.pxd | 2 +- python/pyarrow/scalar.pyx | 12 +++---- python/pyarrow/schema.pxd | 6 ++-- python/pyarrow/schema.pyx | 22 ++++++------- python/pyarrow/tests/test_convert_builtin.py | 4 +-- python/pyarrow/tests/test_convert_pandas.py | 4 +-- python/pyarrow/tests/test_scalars.py | 4 +-- 35 files changed, 168 insertions(+), 170 deletions(-) diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 52f3727d46a15..68b9864301d20 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -1080,19 +1080,19 @@ TEST_F(TestBinaryArray, LengthZeroCtor) { } // ---------------------------------------------------------------------- -// FixedWidthBinary tests +// FixedSizeBinary tests class TestFWBinaryArray : public ::testing::Test { public: void SetUp() {} void InitBuilder(int byte_width) { - auto type = fixed_width_binary(byte_width); - builder_.reset(new FixedWidthBinaryBuilder(default_memory_pool(), type)); + auto type = fixed_size_binary(byte_width); + builder_.reset(new FixedSizeBinaryBuilder(default_memory_pool(), type)); } protected: - std::unique_ptr builder_; + std::unique_ptr builder_; }; TEST_F(TestFWBinaryArray, Builder) { @@ -1114,7 +1114,7 @@ TEST_F(TestFWBinaryArray, Builder) { auto CheckResult = [this, &length, &is_valid, &raw_data, &byte_width]( const Array& result) { // Verify output - const auto& fw_result = static_cast(result); + const auto& fw_result = static_cast(result); ASSERT_EQ(length, result.length()); @@ -1169,9 +1169,9 @@ TEST_F(TestFWBinaryArray, Builder) { TEST_F(TestFWBinaryArray, EqualsRangeEquals) { // Check that we don't compare data in null slots - auto type = fixed_width_binary(4); - FixedWidthBinaryBuilder builder1(default_memory_pool(), type); - FixedWidthBinaryBuilder builder2(default_memory_pool(), type); + auto type = fixed_size_binary(4); + FixedSizeBinaryBuilder builder1(default_memory_pool(), type); + FixedSizeBinaryBuilder builder2(default_memory_pool(), type); ASSERT_OK(builder1.Append("foo1")); ASSERT_OK(builder1.AppendNull()); @@ -1183,19 +1183,19 @@ TEST_F(TestFWBinaryArray, EqualsRangeEquals) { ASSERT_OK(builder1.Finish(&array1)); ASSERT_OK(builder2.Finish(&array2)); - const auto& a1 = static_cast(*array1); - const auto& a2 = static_cast(*array2); + const auto& a1 = static_cast(*array1); + const auto& a2 = static_cast(*array2); - FixedWidthBinaryArray equal1(type, 2, a1.data(), a1.null_bitmap(), 1); - FixedWidthBinaryArray equal2(type, 2, a2.data(), a1.null_bitmap(), 1); + FixedSizeBinaryArray equal1(type, 2, a1.data(), a1.null_bitmap(), 1); + FixedSizeBinaryArray equal2(type, 2, a2.data(), a1.null_bitmap(), 1); ASSERT_TRUE(equal1.Equals(equal2)); ASSERT_TRUE(equal1.RangeEquals(equal2, 0, 2, 0)); } TEST_F(TestFWBinaryArray, ZeroSize) { - auto type = fixed_width_binary(0); - FixedWidthBinaryBuilder builder(default_memory_pool(), type); + auto type = fixed_size_binary(0); + FixedSizeBinaryBuilder builder(default_memory_pool(), type); ASSERT_OK(builder.Append(nullptr)); ASSERT_OK(builder.Append(nullptr)); @@ -1207,7 +1207,7 @@ TEST_F(TestFWBinaryArray, ZeroSize) { std::shared_ptr array; ASSERT_OK(builder.Finish(&array)); - const auto& fw_array = static_cast(*array); + const auto& fw_array = static_cast(*array); // data is never allocated ASSERT_TRUE(fw_array.data() == nullptr); @@ -1218,8 +1218,8 @@ TEST_F(TestFWBinaryArray, ZeroSize) { } TEST_F(TestFWBinaryArray, Slice) { - auto type = fixed_width_binary(4); - FixedWidthBinaryBuilder builder(default_memory_pool(), type); + auto type = fixed_size_binary(4); + FixedSizeBinaryBuilder builder(default_memory_pool(), type); vector strings = {"foo1", "foo2", "foo3", "foo4", "foo5"}; vector is_null = {0, 1, 0, 0, 0}; diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index b25411a1c5938..bd20654bc87d4 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -280,18 +280,17 @@ std::shared_ptr StringArray::Slice(int64_t offset, int64_t length) const // ---------------------------------------------------------------------- // Fixed width binary -FixedWidthBinaryArray::FixedWidthBinaryArray(const std::shared_ptr& type, +FixedSizeBinaryArray::FixedSizeBinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) : PrimitiveArray(type, length, data, null_bitmap, null_count, offset) { - DCHECK(type->type == Type::FIXED_WIDTH_BINARY); - byte_width_ = static_cast(*type).byte_width(); + DCHECK(type->type == Type::FIXED_SIZE_BINARY); + byte_width_ = static_cast(*type).byte_width(); } -std::shared_ptr FixedWidthBinaryArray::Slice( - int64_t offset, int64_t length) const { +std::shared_ptr FixedSizeBinaryArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); - return std::make_shared( + return std::make_shared( type_, length, data_, null_bitmap_, kUnknownNullCount, offset); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 53b640853d5a6..9f0e73914da84 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -347,11 +347,11 @@ class ARROW_EXPORT StringArray : public BinaryArray { // ---------------------------------------------------------------------- // Fixed width binary -class ARROW_EXPORT FixedWidthBinaryArray : public PrimitiveArray { +class ARROW_EXPORT FixedSizeBinaryArray : public PrimitiveArray { public: - using TypeClass = FixedWidthBinaryType; + using TypeClass = FixedSizeBinaryType; - FixedWidthBinaryArray(const std::shared_ptr& type, int64_t length, + FixedSizeBinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, int64_t offset = 0); diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 82b62146b0f98..40b81cf015ab4 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -438,51 +438,51 @@ Status StringBuilder::Finish(std::shared_ptr* out) { // ---------------------------------------------------------------------- // Fixed width binary -FixedWidthBinaryBuilder::FixedWidthBinaryBuilder( +FixedSizeBinaryBuilder::FixedSizeBinaryBuilder( MemoryPool* pool, const std::shared_ptr& type) : ArrayBuilder(pool, type), byte_builder_(pool) { - DCHECK(type->type == Type::FIXED_WIDTH_BINARY); - byte_width_ = static_cast(*type).byte_width(); + DCHECK(type->type == Type::FIXED_SIZE_BINARY); + byte_width_ = static_cast(*type).byte_width(); } -Status FixedWidthBinaryBuilder::Append(const uint8_t* value) { +Status FixedSizeBinaryBuilder::Append(const uint8_t* value) { RETURN_NOT_OK(Reserve(1)); UnsafeAppendToBitmap(true); return byte_builder_.Append(value, byte_width_); } -Status FixedWidthBinaryBuilder::Append( +Status FixedSizeBinaryBuilder::Append( const uint8_t* data, int64_t length, const uint8_t* valid_bytes) { RETURN_NOT_OK(Reserve(length)); UnsafeAppendToBitmap(valid_bytes, length); return byte_builder_.Append(data, length * byte_width_); } -Status FixedWidthBinaryBuilder::Append(const std::string& value) { +Status FixedSizeBinaryBuilder::Append(const std::string& value) { return Append(reinterpret_cast(value.c_str())); } -Status FixedWidthBinaryBuilder::AppendNull() { +Status FixedSizeBinaryBuilder::AppendNull() { RETURN_NOT_OK(Reserve(1)); UnsafeAppendToBitmap(false); return byte_builder_.Advance(byte_width_); } -Status FixedWidthBinaryBuilder::Init(int64_t elements) { +Status FixedSizeBinaryBuilder::Init(int64_t elements) { DCHECK_LT(elements, std::numeric_limits::max()); RETURN_NOT_OK(ArrayBuilder::Init(elements)); return byte_builder_.Resize(elements * byte_width_); } -Status FixedWidthBinaryBuilder::Resize(int64_t capacity) { +Status FixedSizeBinaryBuilder::Resize(int64_t capacity) { DCHECK_LT(capacity, std::numeric_limits::max()); RETURN_NOT_OK(byte_builder_.Resize(capacity * byte_width_)); return ArrayBuilder::Resize(capacity); } -Status FixedWidthBinaryBuilder::Finish(std::shared_ptr* out) { +Status FixedSizeBinaryBuilder::Finish(std::shared_ptr* out) { std::shared_ptr data = byte_builder_.Finish(); - *out = std::make_shared( + *out = std::make_shared( type_, length_, data, null_bitmap_, null_count_); return Status::OK(); } @@ -542,7 +542,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(DOUBLE, DoubleBuilder); BUILDER_CASE(STRING, StringBuilder); BUILDER_CASE(BINARY, BinaryBuilder); - BUILDER_CASE(FIXED_WIDTH_BINARY, FixedWidthBinaryBuilder); + BUILDER_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryBuilder); case Type::LIST: { std::shared_ptr value_builder; std::shared_ptr value_type = diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index bd957b38280da..61207a334db32 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -390,11 +390,11 @@ class ARROW_EXPORT StringBuilder : public BinaryBuilder { }; // ---------------------------------------------------------------------- -// FixedWidthBinaryBuilder +// FixedSizeBinaryBuilder -class ARROW_EXPORT FixedWidthBinaryBuilder : public ArrayBuilder { +class ARROW_EXPORT FixedSizeBinaryBuilder : public ArrayBuilder { public: - FixedWidthBinaryBuilder(MemoryPool* pool, const std::shared_ptr& type); + FixedSizeBinaryBuilder(MemoryPool* pool, const std::shared_ptr& type); Status Append(const uint8_t* value); Status Append( diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 4cd617e6021df..7451439a875d6 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -202,8 +202,8 @@ class RangeEqualsVisitor { return Status::OK(); } - Status Visit(const FixedWidthBinaryArray& left) { - const auto& right = static_cast(right_); + Status Visit(const FixedSizeBinaryArray& left) { + const auto& right = static_cast(right_); int32_t width = left.byte_width(); @@ -648,8 +648,8 @@ class TypeEqualsVisitor { return Status::OK(); } - Status Visit(const FixedWidthBinaryType& left) { - const auto& right = static_cast(right_); + Status Visit(const FixedSizeBinaryType& left) { + const auto& right = static_cast(right_); result_ = left.byte_width() == right.byte_width(); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 9572a0a81898d..1e2385b73f82c 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -189,7 +189,7 @@ class JsonSchemaWriter { } } - void WriteTypeMetadata(const FixedWidthBinaryType& type) { + void WriteTypeMetadata(const FixedSizeBinaryType& type) { writer_->Key("byteWidth"); writer_->Int(type.byte_width()); } @@ -297,8 +297,8 @@ class JsonSchemaWriter { Status Visit(const BinaryType& type) { return WriteVarBytes("binary", type); } - Status Visit(const FixedWidthBinaryType& type) { - return WritePrimitive("fixedwidthbinary", type); + Status Visit(const FixedSizeBinaryType& type) { + return WritePrimitive("fixedsizebinary", type); } Status Visit(const TimestampType& type) { return WritePrimitive("timestamp", type); } @@ -401,7 +401,7 @@ class JsonArrayWriter { } } - void WriteDataValues(const FixedWidthBinaryArray& arr) { + void WriteDataValues(const FixedSizeBinaryArray& arr) { int32_t width = arr.byte_width(); for (int64_t i = 0; i < arr.length(); ++i) { const char* buf = reinterpret_cast(arr.GetValue(i)); @@ -576,13 +576,13 @@ static Status GetFloatingPoint( return Status::OK(); } -static Status GetFixedWidthBinary( +static Status GetFixedSizeBinary( const RjObject& json_type, std::shared_ptr* type) { const auto& json_byte_width = json_type.FindMember("byteWidth"); RETURN_NOT_INT("byteWidth", json_byte_width, json_type); int32_t byte_width = json_byte_width->value.GetInt(); - *type = fixed_width_binary(byte_width); + *type = fixed_size_binary(byte_width); return Status::OK(); } @@ -709,8 +709,8 @@ static Status GetType(const RjObject& json_type, *type = utf8(); } else if (type_name == "binary") { *type = binary(); - } else if (type_name == "fixedwidthbinary") { - return GetFixedWidthBinary(json_type, type); + } else if (type_name == "fixedsizebinary") { + return GetFixedSizeBinary(json_type, type); } else if (type_name == "null") { *type = null(); } else if (type_name == "date") { @@ -896,10 +896,10 @@ class JsonArrayReader { } template - typename std::enable_if::value, Status>::type + typename std::enable_if::value, Status>::type ReadArray(const RjObject& json_array, int32_t length, const std::vector& is_valid, const std::shared_ptr& type, std::shared_ptr* array) { - FixedWidthBinaryBuilder builder(pool_, type); + FixedSizeBinaryBuilder builder(pool_, type); const auto& json_data = json_array.FindMember("DATA"); RETURN_NOT_ARRAY("DATA", json_data, json_array); @@ -908,7 +908,7 @@ class JsonArrayReader { DCHECK_EQ(static_cast(json_data_arr.Size()), length); - int32_t byte_width = static_cast(*type).byte_width(); + int32_t byte_width = static_cast(*type).byte_width(); // Allocate space for parsed values std::shared_ptr byte_buffer; @@ -1112,7 +1112,7 @@ class JsonArrayReader { TYPE_CASE(DoubleType); TYPE_CASE(StringType); TYPE_CASE(BinaryType); - TYPE_CASE(FixedWidthBinaryType); + TYPE_CASE(FixedSizeBinaryType); TYPE_CASE(Date32Type); TYPE_CASE(Date64Type); TYPE_CASE(TimestampType); diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 076a6e792ba40..5007f1309087d 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -230,9 +230,9 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, case flatbuf::Type_Binary: *out = binary(); return Status::OK(); - case flatbuf::Type_FixedWidthBinary: { - auto fw_binary = static_cast(type_data); - *out = fixed_width_binary(fw_binary->byteWidth()); + case flatbuf::Type_FixedSizeBinary: { + auto fw_binary = static_cast(type_data); + *out = fixed_size_binary(fw_binary->byteWidth()); return Status::OK(); } case flatbuf::Type_Utf8: @@ -362,10 +362,10 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, *out_type = flatbuf::Type_FloatingPoint; *offset = FloatToFlatbuffer(fbb, flatbuf::Precision_DOUBLE); break; - case Type::FIXED_WIDTH_BINARY: { - const auto& fw_type = static_cast(*type); - *out_type = flatbuf::Type_FixedWidthBinary; - *offset = flatbuf::CreateFixedWidthBinary(fbb, fw_type.byte_width()).Union(); + case Type::FIXED_SIZE_BINARY: { + const auto& fw_type = static_cast(*type); + *out_type = flatbuf::Type_FixedSizeBinary; + *offset = flatbuf::CreateFixedSizeBinary(fbb, fw_type.byte_width()).Union(); } break; case Type::BINARY: *out_type = flatbuf::Type_Binary; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 134a5caee8ec4..d113531822c96 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -599,14 +599,14 @@ void AppendValues(const std::vector& is_valid, const std::vector& value Status MakeFWBinary(std::shared_ptr* out) { std::vector is_valid = {true, true, true, false}; - auto f0 = field("f0", fixed_width_binary(4)); - auto f1 = field("f1", fixed_width_binary(0)); + auto f0 = field("f0", fixed_size_binary(4)); + auto f1 = field("f1", fixed_size_binary(0)); std::shared_ptr schema(new Schema({f0, f1})); std::shared_ptr a1, a2; - FixedWidthBinaryBuilder b1(default_memory_pool(), f0->type); - FixedWidthBinaryBuilder b2(default_memory_pool(), f1->type); + FixedSizeBinaryBuilder b1(default_memory_pool(), f0->type); + FixedSizeBinaryBuilder b2(default_memory_pool(), f1->type); std::vector values1 = {"foo1", "foo2", "foo3", "foo4"}; AppendValues(is_valid, values1, &b1); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 0867382e6b1b0..5330206480928 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -269,7 +269,7 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status Visit(const FixedWidthBinaryArray& array) override { + Status Visit(const FixedSizeBinaryArray& array) override { auto data = array.data(); int32_t width = array.byte_width(); diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index cc64c4d8264f7..f3347f92e6d87 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -139,7 +139,7 @@ class ArrayLoader { template typename std::enable_if::value && - !std::is_base_of::value && + !std::is_base_of::value && !std::is_base_of::value, Status>::type Visit(const T& type) { @@ -152,14 +152,14 @@ class ArrayLoader { return LoadBinary(); } - Status Visit(const FixedWidthBinaryType& type) { + Status Visit(const FixedSizeBinaryType& type) { FieldMetadata field_meta; std::shared_ptr null_bitmap, data; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &data)); - result_ = std::make_shared( + result_ = std::make_shared( type_, field_meta.length, data, null_bitmap, field_meta.null_count); return Status::OK(); } diff --git a/cpp/src/arrow/pretty_print-test.cc b/cpp/src/arrow/pretty_print-test.cc index f21383f0cb06f..80cd9cfe6ac6d 100644 --- a/cpp/src/arrow/pretty_print-test.cc +++ b/cpp/src/arrow/pretty_print-test.cc @@ -78,14 +78,14 @@ TEST_F(TestPrettyPrint, BinaryType) { CheckPrimitive(0, is_valid, values, ex); } -TEST_F(TestPrettyPrint, FixedWidthBinaryType) { +TEST_F(TestPrettyPrint, FixedSizeBinaryType) { std::vector is_valid = {true, true, false, true, false}; std::vector values = {"foo", "bar", "baz"}; static const char* ex = R"expected([666F6F, 626172, 62617A])expected"; std::shared_ptr array; - auto type = fixed_width_binary(3); - FixedWidthBinaryBuilder builder(default_memory_pool(), type); + auto type = fixed_size_binary(3); + FixedSizeBinaryBuilder builder(default_memory_pool(), type); builder.Append(values[0]); builder.Append(values[1]); diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 0f67fe5bc52a7..0f46f0306fe08 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -97,9 +97,8 @@ class ArrayPrinter { } template - inline - typename std::enable_if::value, void>::type - WriteDataValues(const T& array) { + inline typename std::enable_if::value, void>::type + WriteDataValues(const T& array) { int32_t width = array.byte_width(); for (int i = 0; i < array.length(); ++i) { if (i > 0) { (*sink_) << ", "; } @@ -136,7 +135,7 @@ class ArrayPrinter { template typename std::enable_if::value || - std::is_base_of::value || + std::is_base_of::value || std::is_base_of::value, Status>::type Visit(const T& array) { diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 72e86774fcca7..6a13fdccdeaff 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -406,13 +406,13 @@ class BytesConverter : public TypedConverter { } }; -class FixedWidthBytesConverter : public TypedConverter { +class FixedWidthBytesConverter : public TypedConverter { public: Status AppendData(PyObject* seq) override { PyObject* item; PyObject* bytes_obj; OwnedRef tmp; - Py_ssize_t expected_length = std::dynamic_pointer_cast( + Py_ssize_t expected_length = std::dynamic_pointer_cast( typed_builder_->type())->byte_width(); Py_ssize_t size = PySequence_Size(seq); for (int64_t i = 0; i < size; ++i) { @@ -510,7 +510,7 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::BINARY: return std::make_shared(); - case Type::FIXED_WIDTH_BINARY: + case Type::FIXED_SIZE_BINARY: return std::make_shared(); case Type::STRING: return std::make_shared(); diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index ae9b17ca9ac86..ddfec1bf45a2e 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -179,7 +179,7 @@ Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, } static Status AppendObjectFixedWidthBytes(int64_t objects_length, int byte_width, - FixedWidthBinaryBuilder* builder, PyObject** objects) { + FixedSizeBinaryBuilder* builder, PyObject** objects) { PyObject* obj; for (int64_t i = 0; i < objects_length; ++i) { @@ -228,7 +228,7 @@ struct WrapBytes { }; template <> -struct WrapBytes { +struct WrapBytes { static inline PyObject* Wrap(const uint8_t* data, int64_t length) { return PyBytes_FromStringAndSize(reinterpret_cast(data), length); } @@ -495,10 +495,10 @@ Status PandasConverter::ConvertObjectFixedWidthBytes( PyAcquireGIL lock; PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); - FixedWidthBinaryBuilder builder(pool_, type); + FixedSizeBinaryBuilder builder(pool_, type); RETURN_NOT_OK(builder.Resize(length_)); RETURN_NOT_OK(AppendObjectFixedWidthBytes(length_, - std::dynamic_pointer_cast(builder.type())->byte_width(), + std::dynamic_pointer_cast(builder.type())->byte_width(), &builder, objects)); RETURN_NOT_OK(builder.Finish(out)); return Status::OK(); @@ -564,7 +564,7 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { switch (type_->type) { case Type::STRING: return ConvertObjectStrings(out); - case Type::FIXED_WIDTH_BINARY: + case Type::FIXED_SIZE_BINARY: return ConvertObjectFixedWidthBytes(type_, out); case Type::BOOL: return ConvertBooleans(out); @@ -1017,14 +1017,14 @@ inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) return Status::OK(); } -inline Status ConvertFixedWidthBinary(const ChunkedArray& data, PyObject** out_values) { +inline Status ConvertFixedSizeBinary(const ChunkedArray& data, PyObject** out_values) { PyAcquireGIL lock; for (int c = 0; c < data.num_chunks(); c++) { - auto arr = static_cast(data.chunk(c).get()); + auto arr = static_cast(data.chunk(c).get()); const uint8_t* data_ptr; int32_t length = - std::dynamic_pointer_cast(arr->type())->byte_width(); + std::dynamic_pointer_cast(arr->type())->byte_width(); const bool has_nulls = data.null_count() > 0; for (int64_t i = 0; i < arr->length(); ++i) { if (has_nulls && arr->IsNull(i)) { @@ -1032,7 +1032,7 @@ inline Status ConvertFixedWidthBinary(const ChunkedArray& data, PyObject** out_v *out_values = Py_None; } else { data_ptr = arr->GetValue(i); - *out_values = WrapBytes::Wrap(data_ptr, length); + *out_values = WrapBytes::Wrap(data_ptr, length); if (*out_values == nullptr) { PyErr_Clear(); std::stringstream ss; @@ -1181,8 +1181,8 @@ class ObjectBlock : public PandasBlock { RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::STRING) { RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); - } else if (type == Type::FIXED_WIDTH_BINARY) { - RETURN_NOT_OK(ConvertFixedWidthBinary(data, out_buffer)); + } else if (type == Type::FIXED_SIZE_BINARY) { + RETURN_NOT_OK(ConvertFixedSizeBinary(data, out_buffer)); } else if (type == Type::LIST) { auto list_type = std::static_pointer_cast(col->type()); switch (list_type->value_type()->type) { @@ -1612,7 +1612,7 @@ class DataFrameBlockCreator { break; case Type::STRING: case Type::BINARY: - case Type::FIXED_WIDTH_BINARY: + case Type::FIXED_SIZE_BINARY: output_type = PandasBlock::OBJECT; break; case Type::DATE64: @@ -1877,7 +1877,7 @@ class ArrowDeserializer { CONVERT_CASE(DOUBLE); CONVERT_CASE(BINARY); CONVERT_CASE(STRING); - CONVERT_CASE(FIXED_WIDTH_BINARY); + CONVERT_CASE(FIXED_SIZE_BINARY); CONVERT_CASE(DATE64); CONVERT_CASE(TIMESTAMP); CONVERT_CASE(DICTIONARY); @@ -1982,11 +1982,11 @@ class ArrowDeserializer { // Fixed length binary strings template - inline typename std::enable_if::type + inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - return ConvertFixedWidthBinary(data_, out_values); + return ConvertFixedSizeBinary(data_, out_values); } #define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index b221c80391cde..dafadc168c191 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -155,16 +155,16 @@ TEST(TestStringType, ToString) { ASSERT_EQ(str.ToString(), std::string("string")); } -TEST(TestFixedWidthBinaryType, ToString) { - auto t = fixed_width_binary(10); - ASSERT_EQ(t->type, Type::FIXED_WIDTH_BINARY); - ASSERT_EQ("fixed_width_binary[10]", t->ToString()); +TEST(TestFixedSizeBinaryType, ToString) { + auto t = fixed_size_binary(10); + ASSERT_EQ(t->type, Type::FIXED_SIZE_BINARY); + ASSERT_EQ("fixed_size_binary[10]", t->ToString()); } -TEST(TestFixedWidthBinaryType, Equals) { - auto t1 = fixed_width_binary(10); - auto t2 = fixed_width_binary(10); - auto t3 = fixed_width_binary(3); +TEST(TestFixedSizeBinaryType, Equals) { + auto t1 = fixed_size_binary(10); + auto t2 = fixed_size_binary(10); + auto t3 = fixed_size_binary(3); ASSERT_TRUE(t1->Equals(t1)); ASSERT_TRUE(t1->Equals(t2)); diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index e6e6f5c3e8bc7..d99551d661d69 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -90,13 +90,13 @@ std::string BinaryType::ToString() const { return std::string("binary"); } -int FixedWidthBinaryType::bit_width() const { +int FixedSizeBinaryType::bit_width() const { return 8 * byte_width(); } -std::string FixedWidthBinaryType::ToString() const { +std::string FixedSizeBinaryType::ToString() const { std::stringstream ss; - ss << "fixed_width_binary[" << byte_width_ << "]"; + ss << "fixed_size_binary[" << byte_width_ << "]"; return ss.str(); } @@ -286,7 +286,7 @@ std::string Schema::ToString() const { ACCEPT_VISITOR(NullType); ACCEPT_VISITOR(BooleanType); ACCEPT_VISITOR(BinaryType); -ACCEPT_VISITOR(FixedWidthBinaryType); +ACCEPT_VISITOR(FixedSizeBinaryType); ACCEPT_VISITOR(StringType); ACCEPT_VISITOR(ListType); ACCEPT_VISITOR(StructType); @@ -324,8 +324,8 @@ TYPE_FACTORY(binary, BinaryType); TYPE_FACTORY(date64, Date64Type); TYPE_FACTORY(date32, Date32Type); -std::shared_ptr fixed_width_binary(int32_t byte_width) { - return std::make_shared(byte_width); +std::shared_ptr fixed_size_binary(int32_t byte_width) { + return std::make_shared(byte_width); } std::shared_ptr timestamp(TimeUnit unit) { @@ -392,7 +392,7 @@ std::vector BinaryType::GetBufferLayout() const { return {kValidityBuffer, kOffsetBuffer, kValues8}; } -std::vector FixedWidthBinaryType::GetBufferLayout() const { +std::vector FixedSizeBinaryType::GetBufferLayout() const { return {kValidityBuffer, BufferDescr(BufferType::DATA, byte_width_ * 8)}; } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 4f931907ee79f..6b936f348d4de 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -70,8 +70,8 @@ struct Type { // Variable-length bytes (no guarantee of UTF8-ness) BINARY, - // Fixed-width binary. Each value occupies the same number of bytes - FIXED_WIDTH_BINARY, + // Fixed-size binary. Each value occupies the same number of bytes + FIXED_SIZE_BINARY, // int32_t days since the UNIX epoch DATE32, @@ -353,12 +353,12 @@ struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { }; // BinaryType type is represents lists of 1-byte values. -class ARROW_EXPORT FixedWidthBinaryType : public FixedWidthType { +class ARROW_EXPORT FixedSizeBinaryType : public FixedWidthType { public: - static constexpr Type::type type_id = Type::FIXED_WIDTH_BINARY; + static constexpr Type::type type_id = Type::FIXED_SIZE_BINARY; - explicit FixedWidthBinaryType(int32_t byte_width) - : FixedWidthType(Type::FIXED_WIDTH_BINARY), byte_width_(byte_width) {} + explicit FixedSizeBinaryType(int32_t byte_width) + : FixedWidthType(Type::FIXED_SIZE_BINARY), byte_width_(byte_width) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -630,7 +630,7 @@ class ARROW_EXPORT Schema { // ---------------------------------------------------------------------- // Factory functions -std::shared_ptr ARROW_EXPORT fixed_width_binary(int32_t byte_width); +std::shared_ptr ARROW_EXPORT fixed_size_binary(int32_t byte_width); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 04ddf7e74dd1d..2e27ce9858964 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -51,9 +51,9 @@ struct BinaryType; class BinaryArray; class BinaryBuilder; -class FixedWidthBinaryType; -class FixedWidthBinaryArray; -class FixedWidthBinaryBuilder; +class FixedSizeBinaryType; +class FixedSizeBinaryArray; +class FixedSizeBinaryBuilder; struct StringType; class StringArray; diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index b73d5a68d257e..353b638fed894 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -257,9 +257,9 @@ struct TypeTraits { }; template <> -struct TypeTraits { - using ArrayType = FixedWidthBinaryArray; - using BuilderType = FixedWidthBinaryBuilder; +struct TypeTraits { + using ArrayType = FixedSizeBinaryArray; + using BuilderType = FixedSizeBinaryBuilder; constexpr static bool is_parameter_free = false; }; diff --git a/cpp/src/arrow/visitor.cc b/cpp/src/arrow/visitor.cc index 9200e0ff228a3..117578965ccc4 100644 --- a/cpp/src/arrow/visitor.cc +++ b/cpp/src/arrow/visitor.cc @@ -43,7 +43,7 @@ ARRAY_VISITOR_DEFAULT(FloatArray); ARRAY_VISITOR_DEFAULT(DoubleArray); ARRAY_VISITOR_DEFAULT(BinaryArray); ARRAY_VISITOR_DEFAULT(StringArray); -ARRAY_VISITOR_DEFAULT(FixedWidthBinaryArray); +ARRAY_VISITOR_DEFAULT(FixedSizeBinaryArray); ARRAY_VISITOR_DEFAULT(Date32Array); ARRAY_VISITOR_DEFAULT(Date64Array); ARRAY_VISITOR_DEFAULT(Time32Array); @@ -82,7 +82,7 @@ TYPE_VISITOR_DEFAULT(FloatType); TYPE_VISITOR_DEFAULT(DoubleType); TYPE_VISITOR_DEFAULT(StringType); TYPE_VISITOR_DEFAULT(BinaryType); -TYPE_VISITOR_DEFAULT(FixedWidthBinaryType); +TYPE_VISITOR_DEFAULT(FixedSizeBinaryType); TYPE_VISITOR_DEFAULT(Date64Type); TYPE_VISITOR_DEFAULT(Date32Type); TYPE_VISITOR_DEFAULT(Time32Type); diff --git a/cpp/src/arrow/visitor.h b/cpp/src/arrow/visitor.h index d44dcf6b97676..6c36e465ec436 100644 --- a/cpp/src/arrow/visitor.h +++ b/cpp/src/arrow/visitor.h @@ -43,7 +43,7 @@ class ARROW_EXPORT ArrayVisitor { virtual Status Visit(const DoubleArray& array); virtual Status Visit(const StringArray& array); virtual Status Visit(const BinaryArray& array); - virtual Status Visit(const FixedWidthBinaryArray& array); + virtual Status Visit(const FixedSizeBinaryArray& array); virtual Status Visit(const Date32Array& array); virtual Status Visit(const Date64Array& array); virtual Status Visit(const Time32Array& array); @@ -76,7 +76,7 @@ class ARROW_EXPORT TypeVisitor { virtual Status Visit(const DoubleType& type); virtual Status Visit(const StringType& type); virtual Status Visit(const BinaryType& type); - virtual Status Visit(const FixedWidthBinaryType& type); + virtual Status Visit(const FixedSizeBinaryType& type); virtual Status Visit(const Date64Type& type); virtual Status Visit(const Date32Type& type); virtual Status Visit(const Time32Type& type); diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index cbc4d5acdb8cf..c61c9f59f7ab2 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -48,7 +48,7 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { TYPE_VISIT_INLINE(DoubleType); TYPE_VISIT_INLINE(StringType); TYPE_VISIT_INLINE(BinaryType); - TYPE_VISIT_INLINE(FixedWidthBinaryType); + TYPE_VISIT_INLINE(FixedSizeBinaryType); TYPE_VISIT_INLINE(Date32Type); TYPE_VISIT_INLINE(Date64Type); TYPE_VISIT_INLINE(TimestampType); @@ -87,7 +87,7 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { ARRAY_VISIT_INLINE(DoubleType); ARRAY_VISIT_INLINE(StringType); ARRAY_VISIT_INLINE(BinaryType); - ARRAY_VISIT_INLINE(FixedWidthBinaryType); + ARRAY_VISIT_INLINE(FixedSizeBinaryType); ARRAY_VISIT_INLINE(Date32Type); ARRAY_VISIT_INLINE(Date64Type); ARRAY_VISIT_INLINE(TimestampType); diff --git a/format/Schema.fbs b/format/Schema.fbs index 5268bf95cfdc8..958f09181bfa6 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -67,7 +67,7 @@ table Utf8 { table Binary { } -table FixedWidthBinary { +table FixedSizeBinary { /// Number of bytes per value byteWidth: int; } @@ -156,7 +156,7 @@ union Type { List, Struct_, Union, - FixedWidthBinary + FixedSizeBinary } /// ---------------------------------------------------------------------- diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 66b6038617944..3df2a1d445549 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -55,7 +55,7 @@ Int8Value, Int16Value, Int32Value, Int64Value, UInt8Value, UInt16Value, UInt32Value, UInt64Value, FloatValue, DoubleValue, ListValue, - BinaryValue, StringValue, FixedWidthBinaryValue) + BinaryValue, StringValue, FixedSizeBinaryValue) import pyarrow.schema as _schema @@ -65,7 +65,7 @@ timestamp, date32, date64, float_, double, binary, string, list_, struct, dictionary, field, - DataType, FixedWidthBinaryType, + DataType, FixedSizeBinaryType, Field, Schema, schema) diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index a7241c6a47e31..0b5f33d0d2db6 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -100,7 +100,7 @@ cdef class DoubleArray(FloatingPointArray): pass -cdef class FixedWidthBinaryArray(Array): +cdef class FixedSizeBinaryArray(Array): pass diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 289baf2993081..b9799f15bf3e7 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -37,7 +37,7 @@ cimport pyarrow.scalar as scalar from pyarrow.scalar import NA from pyarrow.schema cimport (DataType, Field, Schema, DictionaryType, - FixedWidthBinaryType, + FixedSizeBinaryType, box_data_type) import pyarrow.schema as schema @@ -407,7 +407,7 @@ cdef class DoubleArray(FloatingPointArray): pass -cdef class FixedWidthBinaryArray(Array): +cdef class FixedSizeBinaryArray(Array): pass @@ -518,7 +518,7 @@ cdef dict _array_classes = { Type_BINARY: BinaryArray, Type_STRING: StringArray, Type_DICTIONARY: DictionaryArray, - Type_FIXED_WIDTH_BINARY: FixedWidthBinaryArray, + Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, } cdef object box_array(const shared_ptr[CArray]& sp_array): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index b44ade5298eb3..f549884d175fa 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -45,7 +45,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_TIME64" arrow::Type::TIME64" Type_BINARY" arrow::Type::BINARY" Type_STRING" arrow::Type::STRING" - Type_FIXED_WIDTH_BINARY" arrow::Type::FIXED_WIDTH_BINARY" + Type_FIXED_SIZE_BINARY" arrow::Type::FIXED_SIZE_BINARY" Type_LIST" arrow::Type::LIST" Type_STRUCT" arrow::Type::STRUCT" @@ -140,8 +140,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CStringType" arrow::StringType"(CDataType): pass - cdef cppclass CFixedWidthBinaryType" arrow::FixedWidthBinaryType"(CFixedWidthType): - CFixedWidthBinaryType(int byte_width) + cdef cppclass CFixedSizeBinaryType" arrow::FixedSizeBinaryType"(CFixedWidthType): + CFixedSizeBinaryType(int byte_width) int byte_width() cdef cppclass CField" arrow::Field": @@ -208,7 +208,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CDoubleArray" arrow::DoubleArray"(CArray): double Value(int i) - cdef cppclass CFixedWidthBinaryArray" arrow::FixedWidthBinaryArray"(CArray): + cdef cppclass CFixedSizeBinaryArray" arrow::FixedSizeBinaryArray"(CArray): const uint8_t* GetValue(int i) cdef cppclass CListArray" arrow::ListArray"(CArray): diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd index e9cc3cb487cbc..d6c3b35160c12 100644 --- a/python/pyarrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -62,7 +62,7 @@ cdef class StringValue(ArrayValue): pass -cdef class FixedWidthBinaryValue(ArrayValue): +cdef class FixedSizeBinaryValue(ArrayValue): pass diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index f4a1c9e08eb64..983a9a7334044 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -224,16 +224,16 @@ cdef class ListValue(ArrayValue): return result -cdef class FixedWidthBinaryValue(ArrayValue): +cdef class FixedSizeBinaryValue(ArrayValue): def as_py(self): cdef: - CFixedWidthBinaryArray* ap - CFixedWidthBinaryType* ap_type + CFixedSizeBinaryArray* ap + CFixedSizeBinaryType* ap_type int32_t length const char* data - ap = self.sp_array.get() - ap_type = ap.type().get() + ap = self.sp_array.get() + ap_type = ap.type().get() length = ap_type.byte_width() data = ap.GetValue(self.index) return cp.PyBytes_FromStringAndSize(data, length) @@ -258,7 +258,7 @@ cdef dict _scalar_classes = { Type_LIST: ListValue, Type_BINARY: BinaryValue, Type_STRING: StringValue, - Type_FIXED_WIDTH_BINARY: FixedWidthBinaryValue, + Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index c0c2c709b2744..94d65bfc157a1 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -19,7 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CDataType, CDictionaryType, CTimestampType, - CFixedWidthBinaryType, + CFixedSizeBinaryType, CField, CSchema) cdef class DataType: @@ -40,9 +40,9 @@ cdef class TimestampType(DataType): const CTimestampType* ts_type -cdef class FixedWidthBinaryType(DataType): +cdef class FixedSizeBinaryType(DataType): cdef: - const CFixedWidthBinaryType* fixed_width_binary_type + const CFixedSizeBinaryType* fixed_size_binary_type cdef class Field: diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 532a318840caf..06df64461ae22 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -28,7 +28,7 @@ from pyarrow.compat import frombytes, tobytes from pyarrow.array cimport Array from pyarrow.error cimport check_status from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, - CFixedWidthBinaryType, + CFixedSizeBinaryType, TimeUnit_SECOND, TimeUnit_MILLI, TimeUnit_MICRO, TimeUnit_NANO, Type, TimeUnit) @@ -91,16 +91,16 @@ cdef class TimestampType(DataType): return None -cdef class FixedWidthBinaryType(DataType): +cdef class FixedSizeBinaryType(DataType): cdef init(self, const shared_ptr[CDataType]& type): DataType.init(self, type) - self.fixed_width_binary_type = type.get() + self.fixed_size_binary_type = type.get() property byte_width: def __get__(self): - return self.fixed_width_binary_type.byte_width() + return self.fixed_size_binary_type.byte_width() cdef class Field: @@ -362,16 +362,16 @@ def binary(int length=-1): ---------- length : int, optional, default -1 If length == -1 then return a variable length binary type. If length is - greater than or equal to 0 then return a fixed width binary type of + greater than or equal to 0 then return a fixed size binary type of width `length`. """ if length == -1: return primitive_type(la.Type_BINARY) - cdef FixedWidthBinaryType out = FixedWidthBinaryType() - cdef shared_ptr[CDataType] fixed_width_binary_type - fixed_width_binary_type.reset(new CFixedWidthBinaryType(length)) - out.init(fixed_width_binary_type) + cdef FixedSizeBinaryType out = FixedSizeBinaryType() + cdef shared_ptr[CDataType] fixed_size_binary_type + fixed_size_binary_type.reset(new CFixedSizeBinaryType(length)) + out.init(fixed_size_binary_type) return out @@ -428,8 +428,8 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): out = DictionaryType() elif type.get().type == la.Type_TIMESTAMP: out = TimestampType() - elif type.get().type == la.Type_FIXED_WIDTH_BINARY: - out = FixedWidthBinaryType() + elif type.get().type == la.Type_FIXED_SIZE_BINARY: + out = FixedSizeBinaryType() else: out = DataType() diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 3309ba018628d..bb6d2d17d5f0c 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -92,7 +92,7 @@ def test_bytes(self): assert arr.type == pyarrow.binary() assert arr.to_pylist() == [b'foo', u1, None] - def test_fixed_width_bytes(self): + def test_fixed_size_bytes(self): data = [b'foof', None, b'barb', b'2346'] arr = pyarrow.from_pylist(data, type=pyarrow.binary(4)) assert len(arr) == 4 @@ -100,7 +100,7 @@ def test_fixed_width_bytes(self): assert arr.type == pyarrow.binary(4) assert arr.to_pylist() == data - def test_fixed_width_bytes_does_not_accept_varying_lengths(self): + def test_fixed_size_bytes_does_not_accept_varying_lengths(self): data = [b'foo', None, b'barb', b'2346'] with self.assertRaises(pyarrow.error.ArrowException): pyarrow.from_pylist(data, type=pyarrow.binary(4)) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 3f19b68fe0a03..c472ee69034c8 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -244,7 +244,7 @@ def test_bytes_to_binary(self): expected = pd.DataFrame({'strings': values2}) self._check_pandas_roundtrip(df, expected) - def test_fixed_width_bytes(self): + def test_fixed_size_bytes(self): values = [b'foo', None, b'bar', None, None, b'hey'] df = pd.DataFrame({'strings': values}) schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) @@ -254,7 +254,7 @@ def test_fixed_width_bytes(self): result = table.to_pandas() tm.assert_frame_equal(result, df) - def test_fixed_width_bytes_does_not_accept_varying_lengths(self): + def test_fixed_size_bytes_does_not_accept_varying_lengths(self): values = [b'foo', None, b'ba', None, None, b'hey'] df = pd.DataFrame({'strings': values}) schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index 265ce8d3a58a1..a5db7e0835607 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -87,12 +87,12 @@ def test_bytes(self): assert v == b'bar' assert isinstance(v, bytes) - def test_fixed_width_bytes(self): + def test_fixed_size_bytes(self): data = [b'foof', None, b'barb'] arr = A.from_pylist(data, type=A.binary(4)) v = arr[0] - assert isinstance(v, A.FixedWidthBinaryValue) + assert isinstance(v, A.FixedSizeBinaryValue) assert v.as_py() == b'foof' assert arr[1] is A.NA From fd000964d218b355e725d8eced1d1301f36dc092 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 1 Apr 2017 11:19:09 -0400 Subject: [PATCH 0448/1644] ARROW-723: [Python] Ensure that passing chunk_size=0 when writing Parquet file does not enter infinite loop This should also be fixed in parquet-cpp, will open a JIRA. Author: Wes McKinney Closes #468 from wesm/ARROW-723 and squashes the following commits: f938703 [Wes McKinney] Raise if row group size is 0, use default if -1 5f83850 [Wes McKinney] Ensure that passing chunk_size=0 when writing Parquet file does not enter infinite loop --- python/pyarrow/_parquet.pyx | 5 ++++- python/pyarrow/parquet.py | 2 +- python/pyarrow/tests/test_parquet.py | 17 +++++++++++++++++ 3 files changed, 22 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 8e67da9f75a6e..c4cbd28e85dab 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -538,10 +538,13 @@ cdef class ParquetWriter: def write_table(self, Table table, row_group_size=None): cdef CTable* ctable = table.table - if row_group_size is None: + if row_group_size is None or row_group_size == -1: row_group_size = ctable.num_rows() + elif row_group_size == 0: + raise ValueError('Row group size cannot be 0') cdef int c_row_group_size = row_group_size + with nogil: check_status(WriteTable(deref(ctable), self.allocator, self.sink, c_row_group_size, diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index fa96f95698013..2985316f35f01 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -187,7 +187,7 @@ def write_table(table, sink, chunk_size=None, version='1.0', ---------- table : pyarrow.Table sink: string or pyarrow.io.NativeFile - chunk_size : int + chunk_size : int, default None The maximum number of rows in each Parquet RowGroup. As a default, we will write a single RowGroup per file. version : {"1.0", "2.0"}, default "1.0" diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index fc32b9fac8b98..b8b2800259caf 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -365,6 +365,23 @@ def test_multithreaded_read(): assert table1.equals(table2) +@parquet +def test_min_chunksize(): + data = pd.DataFrame([np.arange(4)], columns=['A', 'B', 'C', 'D']) + table = pa.Table.from_pandas(data.reset_index()) + + buf = io.BytesIO() + pq.write_table(table, buf, chunk_size=-1) + + buf.seek(0) + result = pq.read_table(buf) + + assert result.equals(table) + + with pytest.raises(ValueError): + pq.write_table(table, buf, chunk_size=0) + + @parquet def test_pass_separate_metadata(): # ARROW-471 From 31a1f53f4990d07a337ea0b000e04df2917b6d73 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 1 Apr 2017 11:19:40 -0400 Subject: [PATCH 0449/1644] ARROW-710: [Python] Read/write with file-like Python objects from read_feather/write_feather cc @jreback Author: Wes McKinney Closes #474 from wesm/ARROW-710 and squashes the following commits: 61d7218 [Wes McKinney] Do not close OutputStream in Feather writer. Read and write to file-like Python objects --- cpp/src/arrow/ipc/feather-test.cc | 3 +- cpp/src/arrow/ipc/feather.cc | 25 +--- cpp/src/arrow/ipc/feather.h | 7 +- python/CMakeLists.txt | 1 - python/pyarrow/_feather.pyx | 158 ----------------------- python/pyarrow/feather.py | 14 +- python/pyarrow/includes/libarrow_ipc.pxd | 31 ++++- python/pyarrow/io.pyx | 101 ++++++++++++++- python/pyarrow/tests/test_feather.py | 17 ++- python/setup.py | 1 - 10 files changed, 160 insertions(+), 198 deletions(-) delete mode 100644 python/pyarrow/_feather.pyx diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index e181f6933541b..077a44b896fc1 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -272,8 +272,7 @@ class TestTableWriter : public ::testing::Test { ASSERT_OK(stream_->Finish(&output_)); std::shared_ptr buffer(new io::BufferReader(output_)); - reader_.reset(new TableReader()); - ASSERT_OK(reader_->Open(buffer)); + ASSERT_OK(TableReader::Open(buffer, &reader_)); } void CheckBatch(const RecordBatch& batch) { diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 5820563b43834..e838e1fdbcd61 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -401,16 +401,10 @@ TableReader::TableReader() { TableReader::~TableReader() {} -Status TableReader::Open(const std::shared_ptr& source) { - return impl_->Open(source); -} - -Status TableReader::OpenFile( - const std::string& abspath, std::unique_ptr* out) { - std::shared_ptr file; - RETURN_NOT_OK(io::MemoryMappedFile::Open(abspath, io::FileMode::READ, &file)); +Status TableReader::Open(const std::shared_ptr& source, + std::unique_ptr* out) { out->reset(new TableReader()); - return (*out)->Open(file); + return (*out)->impl_->Open(source); } bool TableReader::HasDescription() const { @@ -517,9 +511,8 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { // Footer: metadata length, magic bytes RETURN_NOT_OK( stream_->Write(reinterpret_cast(&buffer_size), sizeof(uint32_t))); - RETURN_NOT_OK(stream_->Write(reinterpret_cast(kFeatherMagicBytes), - strlen(kFeatherMagicBytes))); - return stream_->Close(); + return stream_->Write( + reinterpret_cast(kFeatherMagicBytes), strlen(kFeatherMagicBytes)); } Status LoadArrayMetadata(const Array& values, ArrayMetadata* meta) { @@ -700,14 +693,6 @@ Status TableWriter::Open( return (*out)->impl_->Open(stream); } -Status TableWriter::OpenFile( - const std::string& abspath, std::unique_ptr* out) { - std::shared_ptr file; - RETURN_NOT_OK(io::FileOutputStream::Open(abspath, &file)); - out->reset(new TableWriter()); - return (*out)->impl_->Open(file); -} - void TableWriter::SetDescription(const std::string& desc) { impl_->SetDescription(desc); } diff --git a/cpp/src/arrow/ipc/feather.h b/cpp/src/arrow/ipc/feather.h index 1e4ba58255456..8cc8ca092a1b2 100644 --- a/cpp/src/arrow/ipc/feather.h +++ b/cpp/src/arrow/ipc/feather.h @@ -54,9 +54,8 @@ class ARROW_EXPORT TableReader { TableReader(); ~TableReader(); - Status Open(const std::shared_ptr& source); - - static Status OpenFile(const std::string& abspath, std::unique_ptr* out); + static Status Open(const std::shared_ptr& source, + std::unique_ptr* out); // Optional table description // @@ -86,8 +85,6 @@ class ARROW_EXPORT TableWriter { static Status Open( const std::shared_ptr& stream, std::unique_ptr* out); - static Status OpenFile(const std::string& abspath, std::unique_ptr* out); - void SetDescription(const std::string& desc); void SetNumRows(int64_t num_rows); diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 35a1a89ef3164..f315d019bb4de 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -268,7 +268,6 @@ set(CYTHON_EXTENSIONS config error io - _feather memory scalar schema diff --git a/python/pyarrow/_feather.pyx b/python/pyarrow/_feather.pyx deleted file mode 100644 index beb4aaad44618..0000000000000 --- a/python/pyarrow/_feather.pyx +++ /dev/null @@ -1,158 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# cython: profile=False -# distutils: language = c++ -# cython: embedsignature = True - -from cython.operator cimport dereference as deref - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport CArray, CColumn, CSchema, CStatus -from pyarrow.includes.libarrow_io cimport RandomAccessFile, OutputStream - -from libcpp.string cimport string -from libcpp cimport bool as c_bool - -cimport cpython - -from pyarrow.compat import frombytes, tobytes, encode_file_path - -from pyarrow.array cimport Array -from pyarrow.error cimport check_status -from pyarrow.table cimport Column - -cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: - - cdef cppclass TableWriter: - @staticmethod - CStatus Open(const shared_ptr[OutputStream]& stream, - unique_ptr[TableWriter]* out) - - @staticmethod - CStatus OpenFile(const string& abspath, unique_ptr[TableWriter]* out) - - void SetDescription(const string& desc) - void SetNumRows(int64_t num_rows) - - CStatus Append(const string& name, const CArray& values) - CStatus Finalize() - - cdef cppclass TableReader: - TableReader(const shared_ptr[RandomAccessFile]& source) - - @staticmethod - CStatus OpenFile(const string& abspath, unique_ptr[TableReader]* out) - - string GetDescription() - c_bool HasDescription() - - int64_t num_rows() - int64_t num_columns() - - shared_ptr[CSchema] schema() - - CStatus GetColumn(int i, shared_ptr[CColumn]* out) - c_string GetColumnName(int i) - - -class FeatherError(Exception): - pass - - -cdef class FeatherWriter: - cdef: - unique_ptr[TableWriter] writer - - cdef public: - int64_t num_rows - - def __cinit__(self): - self.num_rows = -1 - - def open(self, object dest): - cdef: - string c_name = encode_file_path(dest) - - check_status(TableWriter.OpenFile(c_name, &self.writer)) - - def close(self): - if self.num_rows < 0: - self.num_rows = 0 - self.writer.get().SetNumRows(self.num_rows) - check_status(self.writer.get().Finalize()) - - def write_array(self, object name, object col, object mask=None): - cdef Array arr - - if self.num_rows >= 0: - if len(col) != self.num_rows: - raise ValueError('prior column had a different number of rows') - else: - self.num_rows = len(col) - - if isinstance(col, Array): - arr = col - else: - arr = Array.from_pandas(col, mask=mask) - - cdef c_string c_name = tobytes(name) - - with nogil: - check_status( - self.writer.get().Append(c_name, deref(arr.sp_array))) - - -cdef class FeatherReader: - cdef: - unique_ptr[TableReader] reader - - def __cinit__(self): - pass - - def open(self, source): - cdef: - string c_name = encode_file_path(source) - - check_status(TableReader.OpenFile(c_name, &self.reader)) - - property num_rows: - - def __get__(self): - return self.reader.get().num_rows() - - property num_columns: - - def __get__(self): - return self.reader.get().num_columns() - - def get_column_name(self, int i): - cdef c_string name = self.reader.get().GetColumnName(i) - return frombytes(name) - - def get_column(self, int i): - if i < 0 or i >= self.num_columns: - raise IndexError(i) - - cdef shared_ptr[CColumn] sp_column - with nogil: - check_status(self.reader.get() - .GetColumn(i, &sp_column)) - - cdef Column col = Column() - col.init(sp_column) - return col diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py index 28424afb093b5..f87c7f3a95ee4 100644 --- a/python/pyarrow/feather.py +++ b/python/pyarrow/feather.py @@ -20,9 +20,9 @@ import pandas as pd from pyarrow.compat import pdapi -from pyarrow._feather import FeatherError # noqa +from pyarrow.io import FeatherError # noqa from pyarrow.table import Table -import pyarrow._feather as ext +import pyarrow.io as ext if LooseVersion(pd.__version__) < '0.17.0': @@ -54,12 +54,12 @@ def read(self, columns=None): return table.to_pandas() -def write_feather(df, path): +def write_feather(df, dest): ''' Write a pandas.DataFrame to Feather format ''' writer = ext.FeatherWriter() - writer.open(path) + writer.open(dest) if isinstance(df, pd.SparseDataFrame): df = df.to_dense() @@ -95,13 +95,13 @@ def write_feather(df, path): writer.close() -def read_feather(path, columns=None): +def read_feather(source, columns=None): """ Read a pandas.DataFrame from Feather format Parameters ---------- - path : string, path to read from + source : string file path, or file-like object columns : sequence, optional Only read a specific set of columns. If not provided, all columns are read @@ -110,5 +110,5 @@ def read_feather(path, columns=None): ------- df : pandas.DataFrame """ - reader = FeatherReader(path) + reader = FeatherReader(source) return reader.read(columns=columns) diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd index 8b7d705afd4e7..59fd90bdac7a8 100644 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ b/python/pyarrow/includes/libarrow_ipc.pxd @@ -18,7 +18,7 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CSchema, CRecordBatch) +from pyarrow.includes.libarrow cimport (CArray, CColumn, CSchema, CRecordBatch) from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, RandomAccessFile) @@ -63,3 +63,32 @@ cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: int num_record_batches() CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) + +cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: + + cdef cppclass CFeatherWriter" arrow::ipc::feather::TableWriter": + @staticmethod + CStatus Open(const shared_ptr[OutputStream]& stream, + unique_ptr[CFeatherWriter]* out) + + void SetDescription(const c_string& desc) + void SetNumRows(int64_t num_rows) + + CStatus Append(const c_string& name, const CArray& values) + CStatus Finalize() + + cdef cppclass CFeatherReader" arrow::ipc::feather::TableReader": + @staticmethod + CStatus Open(const shared_ptr[RandomAccessFile]& file, + unique_ptr[CFeatherReader]* out) + + c_string GetDescription() + c_bool HasDescription() + + int64_t num_rows() + int64_t num_columns() + + shared_ptr[CSchema] schema() + + CStatus GetColumn(int i, shared_ptr[CColumn]* out) + c_string GetColumnName(int i) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index d64427aa36ef5..0b27379c273b0 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -32,10 +32,11 @@ from pyarrow.includes.libarrow_ipc cimport * cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import frombytes, tobytes, encode_file_path +from pyarrow.array cimport Array from pyarrow.error cimport check_status from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow.schema cimport Schema -from pyarrow.table cimport (RecordBatch, batch_from_cbatch, +from pyarrow.table cimport (Column, RecordBatch, batch_from_cbatch, table_from_ctable) cimport cpython as cp @@ -564,7 +565,9 @@ cdef get_reader(object source, shared_ptr[RandomAccessFile]* reader): cdef get_writer(object source, shared_ptr[OutputStream]* writer): cdef NativeFile nf - if not isinstance(source, NativeFile) and hasattr(source, 'write'): + if isinstance(source, six.string_types): + source = OSFile(source, mode='w') + elif not isinstance(source, NativeFile) and hasattr(source, 'write'): # Optimistically hope this is file-like source = PythonFileInterface(source, mode='w') @@ -1047,3 +1050,97 @@ cdef class _FileReader: check_status(CTable.FromRecordBatches(batches, &table)) return table_from_ctable(table) + + +#---------------------------------------------------------------------- +# Implement legacy Feather file format + + +class FeatherError(Exception): + pass + + +cdef class FeatherWriter: + cdef: + unique_ptr[CFeatherWriter] writer + + cdef public: + int64_t num_rows + + def __cinit__(self): + self.num_rows = -1 + + def open(self, object dest): + cdef shared_ptr[OutputStream] sink + get_writer(dest, &sink) + + with nogil: + check_status(CFeatherWriter.Open(sink, &self.writer)) + + def close(self): + if self.num_rows < 0: + self.num_rows = 0 + self.writer.get().SetNumRows(self.num_rows) + check_status(self.writer.get().Finalize()) + + def write_array(self, object name, object col, object mask=None): + cdef Array arr + + if self.num_rows >= 0: + if len(col) != self.num_rows: + raise ValueError('prior column had a different number of rows') + else: + self.num_rows = len(col) + + if isinstance(col, Array): + arr = col + else: + arr = Array.from_pandas(col, mask=mask) + + cdef c_string c_name = tobytes(name) + + with nogil: + check_status( + self.writer.get().Append(c_name, deref(arr.sp_array))) + + +cdef class FeatherReader: + cdef: + unique_ptr[CFeatherReader] reader + + def __cinit__(self): + pass + + def open(self, source): + cdef shared_ptr[RandomAccessFile] reader + get_reader(source, &reader) + + with nogil: + check_status(CFeatherReader.Open(reader, &self.reader)) + + property num_rows: + + def __get__(self): + return self.reader.get().num_rows() + + property num_columns: + + def __get__(self): + return self.reader.get().num_columns() + + def get_column_name(self, int i): + cdef c_string name = self.reader.get().GetColumnName(i) + return frombytes(name) + + def get_column(self, int i): + if i < 0 or i >= self.num_columns: + raise IndexError(i) + + cdef shared_ptr[CColumn] sp_column + with nogil: + check_status(self.reader.get() + .GetColumn(i, &sp_column)) + + cdef Column col = Column() + col.init(sp_column) + return col diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index e4b6273ffccf4..dd6888f2d1306 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -27,7 +27,7 @@ from pyarrow.compat import guid from pyarrow.feather import (read_feather, write_feather, FeatherReader) -from pyarrow._feather import FeatherWriter +from pyarrow.io import FeatherWriter def random_path(): @@ -347,6 +347,21 @@ def test_overwritten_file(self): df = pd.DataFrame({'ints': values[0: num_values//2]}) self._check_pandas_roundtrip(df, path=path) + def test_filelike_objects(self): + from io import BytesIO + + buf = BytesIO() + + # the copy makes it non-strided + df = pd.DataFrame(np.arange(12).reshape(4, 3), + columns=['a', 'b', 'c']).copy() + write_feather(df, buf) + + buf.seek(0) + + result = read_feather(buf) + assert_frame_equal(result, df) + def test_sparse_dataframe(self): # GH #221 data = {'A': [0, 1, 2], diff --git a/python/setup.py b/python/setup.py index 9ff091819c760..12b44e1bad520 100644 --- a/python/setup.py +++ b/python/setup.py @@ -104,7 +104,6 @@ def initialize_options(self): 'io', 'jemalloc', 'memory', - '_feather', '_parquet', 'scalar', 'schema', From 651ea9247c42b889da457432d4ff13b558e8bec1 Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sat, 1 Apr 2017 13:18:40 -0400 Subject: [PATCH 0450/1644] ARROW-745: [C++] Allow use of system cpplint You can do `pip install cpplint` to take advantage of this. Author: Phillip Cloud Closes #476 from cpcloud/ARROW-745 and squashes the following commits: e43d20c [Phillip Cloud] ARROW-745: [C++] Allow use of system cpplint --- cpp/CMakeLists.txt | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5dcf58c0f232d..7a5a0e68874ac 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -741,8 +741,11 @@ if (UNIX) ENDIF() ENDFOREACH(item ${LINT_FILES}) + find_program(CPPLINT_BIN NAMES cpplint cpplint.py HINTS ${BUILD_SUPPORT_DIR}) + message(STATUS "Found cpplint executable at ${CPPLINT_BIN}") + # Full lint - add_custom_target(lint ${BUILD_SUPPORT_DIR}/cpplint.py + add_custom_target(lint ${CPPLINT_BIN} --verbose=2 --linelength=90 --filter=-whitespace/comments,-readability/todo,-build/header_guard,-build/c++11,-runtime/references,-build/include_order From baf38e47a7d73d87017699304dcbe15f297c9284 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 2 Apr 2017 10:04:11 +0200 Subject: [PATCH 0451/1644] ARROW-747: [C++] Calling add_dependencies with dl causes spurious CMake warning I added an option to make the dependency targets (e.g. external projects) in libraries more explicit. Author: Wes McKinney Closes #472 from wesm/ARROW-747 and squashes the following commits: c60832f [Wes McKinney] Add DEPENDENCIES argument to ADD_ARROW_LIB to fix spurious dl dependency issue --- cpp/CMakeLists.txt | 9 +--- cpp/cmake_modules/BuildUtils.cmake | 18 ++----- cpp/src/arrow/io/CMakeLists.txt | 4 +- cpp/src/arrow/ipc/CMakeLists.txt | 74 ++++++++++++--------------- cpp/src/arrow/jemalloc/CMakeLists.txt | 5 ++ 5 files changed, 46 insertions(+), 64 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 7a5a0e68874ac..aacc7a15fffc9 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -654,13 +654,8 @@ if (ARROW_JEMALLOC) include_directories(SYSTEM ${JEMALLOC_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(jemalloc - STATIC_LIB ${JEMALLOC_STATIC_LIB} - SHARED_LIB ${JEMALLOC_SHARED_LIB}) - - if (JEMALLOC_VENDORED) - add_dependencies(jemalloc_shared jemalloc_ep) - add_dependencies(jemalloc_static jemalloc_ep) - endif() + STATIC_LIB ${JEMALLOC_STATIC_LIB} + SHARED_LIB ${JEMALLOC_SHARED_LIB}) endif() ## Google PerfTools diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 43d984045eb20..3a3b53678f6e5 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -85,26 +85,18 @@ endfunction() function(ADD_ARROW_LIB LIB_NAME) set(options) set(one_value_args SHARED_LINK_FLAGS) - set(multi_value_args SOURCES STATIC_LINK_LIBS STATIC_PRIVATE_LINK_LIBS SHARED_LINK_LIBS SHARED_PRIVATE_LINK_LIBS) + set(multi_value_args SOURCES STATIC_LINK_LIBS STATIC_PRIVATE_LINK_LIBS SHARED_LINK_LIBS SHARED_PRIVATE_LINK_LIBS DEPENDENCIES) cmake_parse_arguments(ARG "${options}" "${one_value_args}" "${multi_value_args}" ${ARGN}) if(ARG_UNPARSED_ARGUMENTS) message(SEND_ERROR "Error: unrecognized arguments: ${ARG_UNPARSED_ARGUMENTS}") endif() add_library(${LIB_NAME}_objlib OBJECT - ${ARG_SOURCES} + ${ARG_SOURCES} ) - if (ARG_STATIC_LINK_LIBS) - add_dependencies(${LIB_NAME}_objlib ${ARG_STATIC_LINK_LIBS}) - endif() - if (ARG_STATIC_PRIVATE_LINK_LIBS) - add_dependencies(${LIB_NAME}_objlib ${ARG_STATIC_PRIVATE_LINK_LIBS}) - endif() - if (ARG_SHARED_LINK_LIBS) - add_dependencies(${LIB_NAME}_objlib ${ARG_SHARED_LINK_LIBS}) - endif() - if(ARG_SHARED_PRIVATE_LINK_LIBS) - add_dependencies(${LIB_NAME}_objlib ${ARG_SHARED_PRIVATE_LINK_LIBS}) + + if (ARG_DEPENDENCIES) + add_dependencies(${LIB_NAME}_objlib ${ARG_DEPENDENCIES}) endif() # Necessary to make static linking into other shared libraries work properly diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 8aabf6496f8f7..3951eac322c6a 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -48,11 +48,11 @@ if (MSVC) else() set(ARROW_IO_STATIC_LINK_LIBS arrow_static - dl + ${CMAKE_DL_LIBS} ) set(ARROW_IO_SHARED_LINK_LIBS arrow_shared - dl + ${CMAKE_DL_LIBS} ) endif() diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 31a04dfc07818..5fa7d6125ce5e 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -26,7 +26,8 @@ set(ARROW_IPC_SHARED_LINK_LIBS set(ARROW_IPC_TEST_LINK_LIBS arrow_ipc_static - arrow_io_static) + arrow_io_static + arrow_static) set(ARROW_IPC_SRCS feather.cc @@ -44,20 +45,22 @@ if(NOT APPLE) set(ARROW_IPC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") endif() -ADD_ARROW_LIB(arrow_ipc - SOURCES ${ARROW_IPC_SRCS} - SHARED_LINK_FLAGS ${ARROW_IPC_LINK_FLAGS} - SHARED_LINK_LIBS ${ARROW_IPC_SHARED_LINK_LIBS} - STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} -) - if(RAPIDJSON_VENDORED) - add_dependencies(arrow_ipc_objlib rapidjson_ep) + set(IPC_DEPENDENCIES ${IPC_DEPENDENCIES} rapidjson_ep) endif() + if(FLATBUFFERS_VENDORED) - add_dependencies(arrow_ipc_objlib flatbuffers_ep) + set(IPC_DEPENDENCIES ${IPC_DEPENDENCIES} flatbuffers_ep) endif() +ADD_ARROW_LIB(arrow_ipc + SOURCES ${ARROW_IPC_SRCS} + DEPENDENCIES ${IPC_DEPENDENCIES} + SHARED_LINK_FLAGS ${ARROW_IPC_LINK_FLAGS} + SHARED_LINK_LIBS ${ARROW_IPC_SHARED_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} +) + ADD_ARROW_TEST(feather-test) ARROW_TEST_LINK_LIBRARIES(feather-test ${ARROW_IPC_TEST_LINK_LIBS}) @@ -71,40 +74,27 @@ ARROW_TEST_LINK_LIBRARIES(ipc-json-test ${ARROW_IPC_TEST_LINK_LIBS}) ADD_ARROW_TEST(json-integration-test) +ARROW_TEST_LINK_LIBRARIES(json-integration-test + ${ARROW_IPC_TEST_LINK_LIBS}) if (ARROW_BUILD_TESTS) - if (APPLE) - target_link_libraries(json-integration-test - arrow_ipc_static - arrow_io_static - arrow_static - gflags - gtest - ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY} - dl) - set_target_properties(json-integration-test - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") - elseif (MSVC) - target_link_libraries(json-integration-test - arrow_ipc_static - arrow_io_static - arrow_static - gflags - gtest - ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY}) - else() - target_link_libraries(json-integration-test - arrow_ipc_static - arrow_io_static - arrow_static - gflags - gtest - pthread - ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY} - dl) + target_link_libraries(json-integration-test + gflags + gtest + ${BOOST_FILESYSTEM_LIBRARY} + ${BOOST_SYSTEM_LIBRARY}) + + if (UNIX) + if (APPLE) + target_link_libraries(json-integration-test + ${CMAKE_DL_LIBS}) + set_target_properties(json-integration-test + PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + else() + target_link_libraries(json-integration-test + pthread + ${CMAKE_DL_LIBS}) + endif() endif() endif() diff --git a/cpp/src/arrow/jemalloc/CMakeLists.txt b/cpp/src/arrow/jemalloc/CMakeLists.txt index b8e6e231a3dca..7b627ac97b884 100644 --- a/cpp/src/arrow/jemalloc/CMakeLists.txt +++ b/cpp/src/arrow/jemalloc/CMakeLists.txt @@ -84,8 +84,13 @@ if(NOT APPLE) set(ARROW_JEMALLOC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") endif() +if (JEMALLOC_VENDORED) + set(JEMALLOC_DEPENDENCIES jemalloc_ep) +endif() + ADD_ARROW_LIB(arrow_jemalloc SOURCES ${ARROW_JEMALLOC_SRCS} + DEPENDENCIES ${JEMALLOC_DEPENDENCIES} SHARED_LINK_FLAGS ${ARROW_JEMALLOC_LINK_FLAGS} SHARED_LINK_LIBS ${ARROW_JEMALLOC_SHARED_LINK_LIBS} SHARED_PRIVATE_LINK_LIBS ${ARROW_JEMALLOC_SHARED_PRIVATE_LINK_LIBS} From 7fec7d30c75dd8910522002bb6bb640330834b90 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 2 Apr 2017 11:29:14 -0400 Subject: [PATCH 0452/1644] ARROW-754: [GLib] Add garrow_array_is_null() Author: Kouhei Sutou Closes #480 from kou/glib-support-array-is-null and squashes the following commits: 5c4259b [Kouhei Sutou] [GLib] Add garrow_array_is_null() --- c_glib/arrow-glib/array.cpp | 14 ++++++++++++++ c_glib/arrow-glib/array.h | 2 ++ c_glib/test/test-array.rb | 9 +++++++++ 3 files changed, 25 insertions(+) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index b084054f9af87..caf2eb55d6b2c 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -135,6 +135,20 @@ garrow_array_class_init(GArrowArrayClass *klass) g_object_class_install_property(gobject_class, PROP_ARRAY, spec); } +/** + * garrow_array_is_null: + * @array: A #GArrowArray. + * @i: The index of the target value. + * + * Returns: Whether the i-th value is null or not. + */ +gboolean +garrow_array_is_null(GArrowArray *array, gint64 i) +{ + auto arrow_array = garrow_array_get_raw(array); + return arrow_array->IsNull(i); +} + /** * garrow_array_get_length: * @array: A #GArrowArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 6467db5ff45db..957b4416fa581 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -57,6 +57,8 @@ struct _GArrowArrayClass GType garrow_array_get_type (void) G_GNUC_CONST; +gboolean garrow_array_is_null (GArrowArray *array, + gint64 i); gint64 garrow_array_get_length (GArrowArray *array); gint64 garrow_array_get_offset (GArrowArray *array); gint64 garrow_array_get_n_nulls (GArrowArray *array); diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index c427f0200ef02..08908b08961a7 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -16,6 +16,15 @@ # under the License. class TestArray < Test::Unit::TestCase + def test_is_null + builder = Arrow::BooleanArrayBuilder.new + builder.append_null + builder.append(true) + array = builder.finish + assert_equal([true, false], + array.length.times.collect {|i| array.null?(i)}) + end + def test_length builder = Arrow::BooleanArrayBuilder.new builder.append(true) From e333576a0d215e97cc4e2a218ddc56ee1242986d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 2 Apr 2017 11:59:18 -0400 Subject: [PATCH 0453/1644] ARROW-718: [Python] Implement pyarrow.Tensor container, zero-copy NumPy roundtrips Author: Wes McKinney Closes #477 from wesm/ARROW-718 and squashes the following commits: 2c23427 [Wes McKinney] Restore clang-format-3.9 formatting eb21a17 [Wes McKinney] Finish basic tensor zero-copy roundtrips, simple repr. flake8 fixes 4cf6d2b [Wes McKinney] Draft tensor conversion to/from numpy --- cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/api.h | 1 + cpp/src/arrow/python/CMakeLists.txt | 10 + cpp/src/arrow/python/api.h | 1 + cpp/src/arrow/python/common.h | 17 -- cpp/src/arrow/python/helpers.cc | 1 + cpp/src/arrow/python/numpy_convert.cc | 267 +++++++++++++++++++ cpp/src/arrow/python/numpy_convert.h | 69 +++++ cpp/src/arrow/python/pandas_convert.cc | 88 +----- cpp/src/arrow/python/pandas_convert.h | 3 - cpp/src/arrow/python/python-test.cc | 2 +- cpp/src/arrow/tensor.cc | 14 +- cpp/src/arrow/tensor.h | 7 + python/pyarrow/__init__.py | 6 +- python/pyarrow/array.pxd | 13 +- python/pyarrow/array.pyx | 97 ++++++- python/pyarrow/includes/libarrow.pxd | 17 ++ python/pyarrow/includes/pyarrow.pxd | 12 +- python/pyarrow/io.pyx | 2 +- python/pyarrow/schema.pyx | 14 +- python/pyarrow/table.pyx | 2 +- python/pyarrow/tests/pandas_examples.py | 5 +- python/pyarrow/tests/test_convert_builtin.py | 2 +- python/pyarrow/tests/test_convert_pandas.py | 15 +- python/pyarrow/tests/test_feather.py | 2 - python/pyarrow/tests/test_jemalloc.py | 17 +- 26 files changed, 541 insertions(+), 144 deletions(-) create mode 100644 cpp/src/arrow/python/numpy_convert.cc create mode 100644 cpp/src/arrow/python/numpy_convert.h diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index bd33bf5b8296e..8eaa76ae9e843 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -28,6 +28,7 @@ install(FILES pretty_print.h status.h table.h + tensor.h type.h type_fwd.h type_traits.h diff --git a/cpp/src/arrow/api.h b/cpp/src/arrow/api.h index 50a09515297ff..aa0da7580244a 100644 --- a/cpp/src/arrow/api.h +++ b/cpp/src/arrow/api.h @@ -29,6 +29,7 @@ #include "arrow/pretty_print.h" #include "arrow/status.h" #include "arrow/table.h" +#include "arrow/tensor.h" #include "arrow/type.h" #include "arrow/visitor.h" diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index faaad89656f92..a8b4cc7ff1ded 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -55,6 +55,7 @@ set(ARROW_PYTHON_SRCS config.cc helpers.cc io.cc + numpy_convert.cc pandas_convert.cc ) @@ -71,6 +72,14 @@ ADD_ARROW_LIB(arrow_python STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} ) +if ("${COMPILER_FAMILY}" STREQUAL "clang") + # Clang, be quiet. Python C API has lots of macros + set_property(SOURCE ${ARROW_PYTHON_SRCS} + APPEND_STRING + PROPERTY + COMPILE_FLAGS -Wno-parentheses-equality) +endif() + install(FILES api.h builtin_convert.h @@ -79,6 +88,7 @@ install(FILES do_import_numpy.h helpers.h io.h + numpy_convert.h numpy_interop.h pandas_convert.h type_traits.h diff --git a/cpp/src/arrow/python/api.h b/cpp/src/arrow/python/api.h index f4f1c0cf9a5d6..895d1f447ff58 100644 --- a/cpp/src/arrow/python/api.h +++ b/cpp/src/arrow/python/api.h @@ -22,6 +22,7 @@ #include "arrow/python/common.h" #include "arrow/python/helpers.h" #include "arrow/python/io.h" +#include "arrow/python/numpy_convert.h" #include "arrow/python/pandas_convert.h" #endif // ARROW_PYTHON_API_H diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h index f1be471cd3a83..32bfa784acbd0 100644 --- a/cpp/src/arrow/python/common.h +++ b/cpp/src/arrow/python/common.h @@ -103,23 +103,6 @@ struct PyObjectStringify { ARROW_EXPORT void set_default_memory_pool(MemoryPool* pool); ARROW_EXPORT MemoryPool* get_memory_pool(); -class ARROW_EXPORT NumPyBuffer : public Buffer { - public: - explicit NumPyBuffer(PyArrayObject* arr) : Buffer(nullptr, 0) { - arr_ = arr; - Py_INCREF(arr); - - data_ = reinterpret_cast(PyArray_DATA(arr_)); - size_ = PyArray_SIZE(arr_) * PyArray_DESCR(arr_)->elsize; - capacity_ = size_; - } - - virtual ~NumPyBuffer() { Py_XDECREF(arr_); } - - private: - PyArrayObject* arr_; -}; - class ARROW_EXPORT PyBuffer : public Buffer { public: /// Note that the GIL must be held when calling the PyBuffer constructor. diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index add2d9a222adf..be5f412fbea1c 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -42,6 +42,7 @@ std::shared_ptr GetPrimitiveType(Type::type type) { GET_PRIMITIVE_TYPE(DATE32, date32); GET_PRIMITIVE_TYPE(DATE64, date64); GET_PRIMITIVE_TYPE(BOOL, boolean); + GET_PRIMITIVE_TYPE(HALF_FLOAT, float16); GET_PRIMITIVE_TYPE(FLOAT, float32); GET_PRIMITIVE_TYPE(DOUBLE, float64); GET_PRIMITIVE_TYPE(BINARY, binary); diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc new file mode 100644 index 0000000000000..3697819120dbe --- /dev/null +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include "arrow/python/numpy_convert.h" +#include "arrow/python/numpy_interop.h" + +#include +#include +#include +#include +#include + +#include "arrow/buffer.h" +#include "arrow/tensor.h" +#include "arrow/type.h" + +#include "arrow/python/common.h" +#include "arrow/python/type_traits.h" + +namespace arrow { +namespace py { + +bool is_contiguous(PyObject* array) { + if (PyArray_Check(array)) { + return PyArray_FLAGS(reinterpret_cast(array)) & + (NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_F_CONTIGUOUS); + } else { + return false; + } +} + +int cast_npy_type_compat(int type_num) { +// Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set +// U/LONGLONG to U/INT64 so things work properly. + +#if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) + if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } + if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } +#endif + + return type_num; +} + +NumPyBuffer::NumPyBuffer(PyObject* ao) : Buffer(nullptr, 0) { + arr_ = ao; + Py_INCREF(ao); + + if (PyArray_Check(ao)) { + PyArrayObject* ndarray = reinterpret_cast(ao); + data_ = reinterpret_cast(PyArray_DATA(ndarray)); + size_ = PyArray_SIZE(ndarray) * PyArray_DESCR(ndarray)->elsize; + capacity_ = size_; + + if (PyArray_FLAGS(ndarray) & NPY_ARRAY_WRITEABLE) { is_mutable_ = true; } + } +} + +NumPyBuffer::~NumPyBuffer() { + Py_XDECREF(arr_); +} + +#define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \ + case NPY_##NPY_NAME: \ + *out = FACTORY(); \ + break; + +Status GetTensorType(PyObject* dtype, std::shared_ptr* out) { + PyArray_Descr* descr = reinterpret_cast(dtype); + int type_num = cast_npy_type_compat(descr->type_num); + + switch (type_num) { + TO_ARROW_TYPE_CASE(BOOL, uint8); + TO_ARROW_TYPE_CASE(INT8, int8); + TO_ARROW_TYPE_CASE(INT16, int16); + TO_ARROW_TYPE_CASE(INT32, int32); + TO_ARROW_TYPE_CASE(INT64, int64); +#if (NPY_INT64 != NPY_LONGLONG) + TO_ARROW_TYPE_CASE(LONGLONG, int64); +#endif + TO_ARROW_TYPE_CASE(UINT8, uint8); + TO_ARROW_TYPE_CASE(UINT16, uint16); + TO_ARROW_TYPE_CASE(UINT32, uint32); + TO_ARROW_TYPE_CASE(UINT64, uint64); +#if (NPY_UINT64 != NPY_ULONGLONG) + TO_ARROW_CASE(ULONGLONG); +#endif + TO_ARROW_TYPE_CASE(FLOAT16, float16); + TO_ARROW_TYPE_CASE(FLOAT32, float32); + TO_ARROW_TYPE_CASE(FLOAT64, float64); + default: { + std::stringstream ss; + ss << "Unsupported numpy type " << descr->type_num << std::endl; + return Status::NotImplemented(ss.str()); + } + } + return Status::OK(); +} + +Status GetNumPyType(const DataType& type, int* type_num) { +#define NUMPY_TYPE_CASE(ARROW_NAME, NPY_NAME) \ + case Type::ARROW_NAME: \ + *type_num = NPY_##NPY_NAME; \ + break; + + switch (type.type) { + NUMPY_TYPE_CASE(UINT8, UINT8); + NUMPY_TYPE_CASE(INT8, INT8); + NUMPY_TYPE_CASE(UINT16, UINT16); + NUMPY_TYPE_CASE(INT16, INT16); + NUMPY_TYPE_CASE(UINT32, UINT32); + NUMPY_TYPE_CASE(INT32, INT32); + NUMPY_TYPE_CASE(UINT64, UINT64); + NUMPY_TYPE_CASE(INT64, INT64); + NUMPY_TYPE_CASE(HALF_FLOAT, FLOAT16); + NUMPY_TYPE_CASE(FLOAT, FLOAT32); + NUMPY_TYPE_CASE(DOUBLE, FLOAT64); + default: { + std::stringstream ss; + ss << "Unsupported tensor type: " << type.ToString() << std::endl; + return Status::NotImplemented(ss.str()); + } + } +#undef NUMPY_TYPE_CASE + + return Status::OK(); +} + +Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { + PyArray_Descr* descr = reinterpret_cast(dtype); + + int type_num = cast_npy_type_compat(descr->type_num); + + switch (type_num) { + TO_ARROW_TYPE_CASE(BOOL, boolean); + TO_ARROW_TYPE_CASE(INT8, int8); + TO_ARROW_TYPE_CASE(INT16, int16); + TO_ARROW_TYPE_CASE(INT32, int32); + TO_ARROW_TYPE_CASE(INT64, int64); +#if (NPY_INT64 != NPY_LONGLONG) + TO_ARROW_TYPE_CASE(LONGLONG, int64); +#endif + TO_ARROW_TYPE_CASE(UINT8, uint8); + TO_ARROW_TYPE_CASE(UINT16, uint16); + TO_ARROW_TYPE_CASE(UINT32, uint32); + TO_ARROW_TYPE_CASE(UINT64, uint64); +#if (NPY_UINT64 != NPY_ULONGLONG) + TO_ARROW_CASE(ULONGLONG); +#endif + TO_ARROW_TYPE_CASE(FLOAT32, float32); + TO_ARROW_TYPE_CASE(FLOAT64, float64); + case NPY_DATETIME: { + auto date_dtype = + reinterpret_cast(descr->c_metadata); + TimeUnit unit; + switch (date_dtype->meta.base) { + case NPY_FR_s: + unit = TimeUnit::SECOND; + break; + case NPY_FR_ms: + unit = TimeUnit::MILLI; + break; + case NPY_FR_us: + unit = TimeUnit::MICRO; + break; + case NPY_FR_ns: + unit = TimeUnit::NANO; + break; + default: + return Status::NotImplemented("Unsupported datetime64 time unit"); + } + *out = timestamp(unit); + } break; + default: { + std::stringstream ss; + ss << "Unsupported numpy type " << descr->type_num << std::endl; + return Status::NotImplemented(ss.str()); + } + } + + return Status::OK(); +} + +#undef TO_ARROW_TYPE_CASE + +Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, std::shared_ptr* out) { + if (!PyArray_Check(ao)) { return Status::TypeError("Did not pass ndarray object"); } + + PyArrayObject* ndarray = reinterpret_cast(ao); + + // TODO(wesm): What do we want to do with non-contiguous memory and negative strides? + + int ndim = PyArray_NDIM(ndarray); + + std::shared_ptr data = std::make_shared(ao); + std::vector shape(ndim); + std::vector strides(ndim); + + npy_intp* array_strides = PyArray_STRIDES(ndarray); + npy_intp* array_shape = PyArray_SHAPE(ndarray); + for (int i = 0; i < ndim; ++i) { + if (array_strides[i] < 0) { + return Status::Invalid("Negative ndarray strides not supported"); + } + shape[i] = array_shape[i]; + strides[i] = array_strides[i]; + } + + std::shared_ptr type; + RETURN_NOT_OK( + GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), &type)); + return MakeTensor(type, data, shape, strides, {}, out); +} + +Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out) { + int type_num; + RETURN_NOT_OK(GetNumPyType(*tensor.type(), &type_num)); + PyArray_Descr* dtype = PyArray_DescrNewFromType(type_num); + RETURN_IF_PYERROR(); + + std::vector npy_shape(tensor.ndim()); + std::vector npy_strides(tensor.ndim()); + + for (int i = 0; i < tensor.ndim(); ++i) { + npy_shape[i] = tensor.shape()[i]; + npy_strides[i] = tensor.strides()[i]; + } + + const void* immutable_data = nullptr; + if (tensor.data()) { immutable_data = tensor.data()->data(); } + + // Remove const =( + void* mutable_data = const_cast(immutable_data); + + int array_flags = 0; + if (tensor.is_row_major()) { array_flags |= NPY_ARRAY_C_CONTIGUOUS; } + if (tensor.is_column_major()) { array_flags |= NPY_ARRAY_F_CONTIGUOUS; } + if (tensor.is_mutable()) { array_flags |= NPY_ARRAY_WRITEABLE; } + + PyObject* result = PyArray_NewFromDescr(&PyArray_Type, dtype, tensor.ndim(), + npy_shape.data(), npy_strides.data(), mutable_data, array_flags, nullptr); + RETURN_IF_PYERROR() + + if (base != Py_None) { + PyArray_SetBaseObject(reinterpret_cast(result), base); + } + *out = result; + return Status::OK(); +} + +} // namespace py +} // namespace arrow diff --git a/cpp/src/arrow/python/numpy_convert.h b/cpp/src/arrow/python/numpy_convert.h new file mode 100644 index 0000000000000..685a626d4ca28 --- /dev/null +++ b/cpp/src/arrow/python/numpy_convert.h @@ -0,0 +1,69 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for converting between pandas's NumPy-based data representation +// and Arrow data structures + +#ifndef ARROW_PYTHON_NUMPY_CONVERT_H +#define ARROW_PYTHON_NUMPY_CONVERT_H + +#include + +#include +#include + +#include "arrow/buffer.h" +#include "arrow/util/visibility.h" + +namespace arrow { + +struct DataType; +class MemoryPool; +class Status; +class Tensor; + +namespace py { + +class ARROW_EXPORT NumPyBuffer : public Buffer { + public: + explicit NumPyBuffer(PyObject* arr); + virtual ~NumPyBuffer(); + + private: + PyObject* arr_; +}; + +// Handle misbehaved types like LONGLONG and ULONGLONG +int cast_npy_type_compat(int type_num); + +bool is_contiguous(PyObject* array); + +ARROW_EXPORT +Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out); + +Status GetTensorType(PyObject* dtype, std::shared_ptr* out); +Status GetNumPyType(const DataType& type, int* type_num); + +ARROW_EXPORT Status NdarrayToTensor( + MemoryPool* pool, PyObject* ao, std::shared_ptr* out); + +ARROW_EXPORT Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out); + +} // namespace py +} // namespace arrow + +#endif // ARROW_PYTHON_NUMPY_CONVERT_H diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index ddfec1bf45a2e..01019e5669f2d 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -36,11 +36,6 @@ #include "arrow/array.h" #include "arrow/loader.h" -#include "arrow/python/builtin_convert.h" -#include "arrow/python/common.h" -#include "arrow/python/config.h" -#include "arrow/python/type_traits.h" -#include "arrow/python/util/datetime.h" #include "arrow/status.h" #include "arrow/table.h" #include "arrow/type_fwd.h" @@ -49,24 +44,19 @@ #include "arrow/util/logging.h" #include "arrow/util/macros.h" +#include "arrow/python/builtin_convert.h" +#include "arrow/python/common.h" +#include "arrow/python/config.h" +#include "arrow/python/numpy_convert.h" +#include "arrow/python/type_traits.h" +#include "arrow/python/util/datetime.h" + namespace arrow { namespace py { // ---------------------------------------------------------------------- // Utility code -int cast_npy_type_compat(int type_num) { -// Both LONGLONG and INT64 can be observed in the wild, which is buggy. We set -// U/LONGLONG to U/INT64 so things work properly. - -#if (NPY_INT64 == NPY_LONGLONG) && (NPY_SIZEOF_LONGLONG == 8) - if (type_num == NPY_LONGLONG) { type_num = NPY_INT64; } - if (type_num == NPY_ULONGLONG) { type_num = NPY_UINT64; } -#endif - - return type_num; -} - static inline bool PyObject_is_null(const PyObject* obj) { return obj == Py_None || obj == numpy_nan; } @@ -395,7 +385,7 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* data) { return Status::NotImplemented("NumPy type casts not yet implemented"); } - *data = std::make_shared(arr_); + *data = std::make_shared(reinterpret_cast(arr_)); return Status::OK(); } @@ -730,68 +720,6 @@ Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, return converter.ConvertObjects(out); } -Status PandasDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { - PyArray_Descr* descr = reinterpret_cast(dtype); - - int type_num = cast_npy_type_compat(descr->type_num); - -#define TO_ARROW_TYPE_CASE(NPY_NAME, FACTORY) \ - case NPY_##NPY_NAME: \ - *out = FACTORY(); \ - break; - - switch (type_num) { - TO_ARROW_TYPE_CASE(BOOL, boolean); - TO_ARROW_TYPE_CASE(INT8, int8); - TO_ARROW_TYPE_CASE(INT16, int16); - TO_ARROW_TYPE_CASE(INT32, int32); - TO_ARROW_TYPE_CASE(INT64, int64); -#if (NPY_INT64 != NPY_LONGLONG) - TO_ARROW_TYPE_CASE(LONGLONG, int64); -#endif - TO_ARROW_TYPE_CASE(UINT8, uint8); - TO_ARROW_TYPE_CASE(UINT16, uint16); - TO_ARROW_TYPE_CASE(UINT32, uint32); - TO_ARROW_TYPE_CASE(UINT64, uint64); -#if (NPY_UINT64 != NPY_ULONGLONG) - TO_ARROW_CASE(ULONGLONG); -#endif - TO_ARROW_TYPE_CASE(FLOAT32, float32); - TO_ARROW_TYPE_CASE(FLOAT64, float64); - case NPY_DATETIME: { - auto date_dtype = - reinterpret_cast(descr->c_metadata); - TimeUnit unit; - switch (date_dtype->meta.base) { - case NPY_FR_s: - unit = TimeUnit::SECOND; - break; - case NPY_FR_ms: - unit = TimeUnit::MILLI; - break; - case NPY_FR_us: - unit = TimeUnit::MICRO; - break; - case NPY_FR_ns: - unit = TimeUnit::NANO; - break; - default: - return Status::NotImplemented("Unsupported datetime64 time unit"); - } - *out = timestamp(unit); - } break; - default: { - std::stringstream ss; - ss << "Unsupported numpy type " << descr->type_num << std::endl; - return Status::NotImplemented(ss.str()); - } - } - -#undef TO_ARROW_TYPE_CASE - - return Status::OK(); -} - // ---------------------------------------------------------------------- // pandas 0.x DataFrame conversion internals diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index 105c1598d3936..8fd31076a994f 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -61,9 +61,6 @@ ARROW_EXPORT Status ConvertTableToPandas( const std::shared_ptr
& table, int nthreads, PyObject** out); -ARROW_EXPORT -Status PandasDtypeToArrow(PyObject* dtype, std::shared_ptr* out); - ARROW_EXPORT Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type, std::shared_ptr* out); diff --git a/cpp/src/arrow/python/python-test.cc b/cpp/src/arrow/python/python-test.cc index 01e30f5a36ce8..f269ebfb642c7 100644 --- a/cpp/src/arrow/python/python-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -26,9 +26,9 @@ #include "arrow/table.h" #include "arrow/test-util.h" +#include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" #include "arrow/python/pandas_convert.h" -#include "arrow/python/builtin_convert.h" namespace arrow { namespace py { diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index 9a8de5119ea58..8bbb97b596e18 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -90,13 +90,21 @@ int64_t Tensor::size() const { } bool Tensor::is_contiguous() const { - std::vector c_strides; - std::vector f_strides; + return is_row_major() || is_column_major(); +} +bool Tensor::is_row_major() const { + std::vector c_strides; const auto& fw_type = static_cast(*type_); ComputeRowMajorStrides(fw_type, shape_, &c_strides); + return strides_ == c_strides; +} + +bool Tensor::is_column_major() const { + std::vector f_strides; + const auto& fw_type = static_cast(*type_); ComputeColumnMajorStrides(fw_type, shape_, &f_strides); - return strides_ == c_strides || strides_ == f_strides; + return strides_ == f_strides; } bool Tensor::Equals(const Tensor& other) const { diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index eeb5c3e8e5536..12015f14b1d3d 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -89,8 +89,15 @@ class ARROW_EXPORT Tensor { /// Return true if the underlying data buffer is mutable bool is_mutable() const { return data_->is_mutable(); } + /// Either row major or column major bool is_contiguous() const; + /// AKA "C order" + bool is_row_major() const; + + /// AKA "Fortran order" + bool is_column_major() const; + Type::type type_enum() const { return type_->type; } bool Equals(const Tensor& other) const; diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 3df2a1d445549..5215028c90f0d 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -28,8 +28,7 @@ import pyarrow.config from pyarrow.config import cpu_count, set_cpu_count -from pyarrow.array import (Array, - from_pandas_series, from_pylist, +from pyarrow.array import (Array, Tensor, from_pylist, NumericArray, IntegerArray, FloatingPointArray, BooleanArray, Int8Array, UInt8Array, @@ -63,7 +62,8 @@ int8, int16, int32, int64, uint8, uint16, uint32, uint64, timestamp, date32, date64, - float_, double, binary, string, + float16, float32, float64, + binary, string, list_, struct, dictionary, field, DataType, FixedSizeBinaryType, Field, Schema, schema) diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index 0b5f33d0d2db6..42675630fd51b 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -16,7 +16,7 @@ # under the License. from pyarrow.includes.common cimport shared_ptr, int64_t -from pyarrow.includes.libarrow cimport CArray +from pyarrow.includes.libarrow cimport CArray, CTensor from pyarrow.scalar import NA @@ -41,6 +41,17 @@ cdef class Array: cdef getitem(self, int64_t i) +cdef class Tensor: + cdef: + shared_ptr[CTensor] sp_tensor + CTensor* tp + + cdef readonly: + DataType type + + cdef init(self, const shared_ptr[CTensor]& sp_tensor) + + cdef object box_array(const shared_ptr[CArray]& sp_array) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index b9799f15bf3e7..398e4cbffa94d 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -81,7 +81,7 @@ cdef class Array: self.type = box_data_type(self.sp_array.get().type()) @staticmethod - def from_pandas(obj, mask=None, DataType type=None, + def from_numpy(obj, mask=None, DataType type=None, timestamps_to_ms=False, MemoryPool memory_pool=None): """ @@ -116,7 +116,7 @@ cdef class Array: >>> import pandas as pd >>> import pyarrow as pa - >>> pa.Array.from_pandas(pd.Series([1, 2])) + >>> pa.Array.from_numpy(pd.Series([1, 2])) [ 1, @@ -124,7 +124,7 @@ cdef class Array: ] >>> import numpy as np - >>> pa.Array.from_pandas(pd.Series([1, 2]), np.array([0, 1], + >>> pa.Array.from_numpy(pd.Series([1, 2]), np.array([0, 1], ... dtype=bool)) [ @@ -166,7 +166,7 @@ cdef class Array: values, obj.dtype, type, timestamps_to_ms=timestamps_to_ms) if type is None: - check_status(pyarrow.PandasDtypeToArrow(values.dtype, &c_type)) + check_status(pyarrow.NumPyDtypeToArrow(values.dtype, &c_type)) else: c_type = type.sp_type @@ -316,6 +316,77 @@ cdef class Array: return [x.as_py() for x in self] +cdef class Tensor: + + cdef init(self, const shared_ptr[CTensor]& sp_tensor): + self.sp_tensor = sp_tensor + self.tp = sp_tensor.get() + self.type = box_data_type(self.tp.type()) + + def __repr__(self): + return """ +type: {0} +shape: {1} +strides: {2}""".format(self.type, self.shape, self.strides) + + @staticmethod + def from_numpy(obj): + cdef shared_ptr[CTensor] ctensor + check_status(pyarrow.NdarrayToTensor(default_memory_pool(), + obj, &ctensor)) + return box_tensor(ctensor) + + def to_numpy(self): + """ + Convert arrow::Tensor to numpy.ndarray with zero copy + """ + cdef: + PyObject* out + + check_status(pyarrow.TensorToNdarray(deref(self.tp), self, + &out)) + return PyObject_to_object(out) + + property is_mutable: + + def __get__(self): + return self.tp.is_mutable() + + property is_contiguous: + + def __get__(self): + return self.tp.is_contiguous() + + property ndim: + + def __get__(self): + return self.tp.ndim() + + property size: + + def __get__(self): + return self.tp.size() + + property shape: + + def __get__(self): + cdef size_t i + py_shape = [] + for i in range(self.tp.shape().size()): + py_shape.append(self.tp.shape()[i]) + return py_shape + + property strides: + + def __get__(self): + cdef size_t i + py_strides = [] + for i in range(self.tp.strides().size()): + py_strides.append(self.tp.strides()[i]) + return py_strides + + + cdef wrap_array_output(PyObject* output): cdef object obj = PyObject_to_object(output) @@ -479,10 +550,10 @@ cdef class DictionaryArray(Array): else: mask = mask | (indices == -1) - arrow_indices = Array.from_pandas(indices, mask=mask, - memory_pool=memory_pool) - arrow_dictionary = Array.from_pandas(dictionary, - memory_pool=memory_pool) + arrow_indices = Array.from_numpy(indices, mask=mask, + memory_pool=memory_pool) + arrow_dictionary = Array.from_numpy(dictionary, + memory_pool=memory_pool) if not isinstance(arrow_indices, IntegerArray): raise ValueError('Indices must be integer type') @@ -535,6 +606,15 @@ cdef object box_array(const shared_ptr[CArray]& sp_array): return arr +cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor): + if sp_tensor.get() == NULL: + raise ValueError('Tensor was NULL') + + cdef Tensor tensor = Tensor() + tensor.init(sp_tensor) + return tensor + + cdef object get_series_values(object obj): import pandas as pd @@ -549,4 +629,3 @@ cdef object get_series_values(object obj): from_pylist = Array.from_list -from_pandas_series = Array.from_pandas diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index f549884d175fa..8da063cbdc364 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -35,6 +35,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_UINT64" arrow::Type::UINT64" Type_INT64" arrow::Type::INT64" + Type_HALF_FLOAT" arrow::Type::HALF_FLOAT" Type_FLOAT" arrow::Type::FLOAT" Type_DOUBLE" arrow::Type::DOUBLE" @@ -282,6 +283,22 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: CStatus RemoveColumn(int i, shared_ptr[CTable]* out) + cdef cppclass CTensor" arrow::Tensor": + shared_ptr[CDataType] type() + shared_ptr[CBuffer] data() + + const vector[int64_t]& shape() + const vector[int64_t]& strides() + int64_t size() + + int ndim() + const c_string& dim_name(int i) + + c_bool is_mutable() + c_bool is_contiguous() + Type type_enum() + c_bool Equals(const CTensor& other) + CStatus ConcatenateTables(const vector[shared_ptr[CTable]]& tables, shared_ptr[CTable]* result) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 8142c1c06ff75..9b64435e48d7f 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -18,8 +18,8 @@ # distutils: language = c++ from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, - CTable, CDataType, CStatus, Type, +from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType, + CTable, CTensor, CStatus, Type, CMemoryPool, TimeUnit) cimport pyarrow.includes.libarrow_io as arrow_io @@ -34,7 +34,7 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: shared_ptr[CArray]* out, const shared_ptr[CDataType]& type) - CStatus PandasDtypeToArrow(object dtype, shared_ptr[CDataType]* type) + CStatus NumPyDtypeToArrow(object dtype, shared_ptr[CDataType]* type) CStatus PandasToArrow(CMemoryPool* pool, object ao, object mo, const shared_ptr[CDataType]& type, @@ -44,6 +44,12 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: const shared_ptr[CDataType]& type, shared_ptr[CArray]* out) + CStatus NdarrayToTensor(CMemoryPool* pool, object ao, + shared_ptr[CTensor]* out); + + CStatus TensorToNdarray(const CTensor& tensor, PyObject* base, + PyObject** out) + CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, PyObject* py_ref, PyObject** out) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 0b27379c273b0..608b20d896ae3 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -1095,7 +1095,7 @@ cdef class FeatherWriter: if isinstance(col, Array): arr = col else: - arr = Array.from_pandas(col, mask=mask) + arr = Array.from_numpy(col, mask=mask) cdef c_string c_name = tobytes(name) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 06df64461ae22..253be4590b518 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -241,7 +241,9 @@ cdef set PRIMITIVE_TYPES = set([ la.Type_UINT32, la.Type_INT32, la.Type_UINT64, la.Type_INT64, la.Type_TIMESTAMP, la.Type_DATE32, - la.Type_DATE64, la.Type_FLOAT, + la.Type_DATE64, + la.Type_HALF_FLOAT, + la.Type_FLOAT, la.Type_DOUBLE]) @@ -340,11 +342,15 @@ def date64(): return primitive_type(la.Type_DATE64) -def float_(): +def float16(): + return primitive_type(la.Type_HALF_FLOAT) + + +def float32(): return primitive_type(la.Type_FLOAT) -def double(): +def float64(): return primitive_type(la.Type_DOUBLE) @@ -452,6 +458,6 @@ cdef Schema box_schema(const shared_ptr[CSchema]& type): def type_from_numpy_dtype(object dtype): cdef shared_ptr[CDataType] c_type with nogil: - check_status(pyarrow.PandasDtypeToArrow(dtype, &c_type)) + check_status(pyarrow.NumPyDtypeToArrow(dtype, &c_type)) return box_data_type(c_type) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index e6fddbd0cfbbd..94389a73cc974 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -309,7 +309,7 @@ cdef _dataframe_to_arrays(df, timestamps_to_ms, Schema schema): if schema is not None: type = schema.field_by_name(name).type - arr = Array.from_pandas(col, type=type, + arr = Array.from_numpy(col, type=type, timestamps_to_ms=timestamps_to_ms) names.append(name) arrays.append(arr) diff --git a/python/pyarrow/tests/pandas_examples.py b/python/pyarrow/tests/pandas_examples.py index c9343fce233d2..e081c38713057 100644 --- a/python/pyarrow/tests/pandas_examples.py +++ b/python/pyarrow/tests/pandas_examples.py @@ -37,7 +37,7 @@ def dataframe_with_arrays(): ('i4', pa.int32()), ('i8', pa.int64()), ('u1', pa.uint8()), ('u2', pa.uint16()), ('u4', pa.uint32()), ('u8', pa.uint64()), - ('f4', pa.float_()), ('f8', pa.double())] + ('f4', pa.float32()), ('f8', pa.float64())] arrays = OrderedDict() fields = [] @@ -77,6 +77,7 @@ def dataframe_with_arrays(): return df, schema + def dataframe_with_lists(): """ Dataframe with list columns of every possible primtive type. @@ -97,7 +98,7 @@ def dataframe_with_lists(): None, [0] ] - fields.append(pa.field('double', pa.list_(pa.double()))) + fields.append(pa.field('double', pa.list_(pa.float64()))) arrays['double'] = [ [0., 1., 2., 3., 4., 5., 6., 7., 8., 9.], [0., 1., 2., 3., 4.], diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index bb6d2d17d5f0c..15fca560c6513 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -70,7 +70,7 @@ def test_double(self): arr = pyarrow.from_pylist(data) assert len(arr) == 6 assert arr.null_count == 3 - assert arr.type == pyarrow.double() + assert arr.type == pyarrow.float64() assert arr.to_pylist() == data def test_unicode(self): diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index c472ee69034c8..0b3c02e9945eb 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -22,7 +22,6 @@ import unittest import numpy as np -import numpy.testing as npt import pandas as pd import pandas.util.testing as tm @@ -78,8 +77,8 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, def _check_array_roundtrip(self, values, expected=None, timestamps_to_ms=False, type=None): - arr = A.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, - type=type) + arr = A.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, + type=type) result = arr.to_pandas() assert arr.null_count == pd.isnull(values).sum() @@ -90,7 +89,7 @@ def _check_array_roundtrip(self, values, expected=None, def test_float_no_nulls(self): data = {} fields = [] - dtypes = [('f4', A.float_()), ('f8', A.double())] + dtypes = [('f4', A.float32()), ('f8', A.float64())] num_values = 100 for numpy_dtype, arrow_dtype in dtypes: @@ -106,7 +105,7 @@ def test_float_nulls(self): num_values = 100 null_mask = np.random.randint(0, 10, size=num_values) < 3 - dtypes = [('f4', A.float_()), ('f8', A.double())] + dtypes = [('f4', A.float32()), ('f8', A.float64())] names = ['f4', 'f8'] expected_cols = [] @@ -115,7 +114,7 @@ def test_float_nulls(self): for name, arrow_dtype in dtypes: values = np.random.randn(num_values).astype(name) - arr = A.from_pandas_series(values, null_mask) + arr = A.Array.from_numpy(values, null_mask) arrays.append(arr) fields.append(A.Field.from_py(name, arrow_dtype)) values[null_mask] = np.nan @@ -168,7 +167,7 @@ def test_integer_with_nulls(self): for name in int_dtypes: values = np.random.randint(0, 100, size=num_values) - arr = A.from_pandas_series(values, null_mask) + arr = A.Array.from_numpy(values, null_mask) arrays.append(arr) expected = values.astype('f8') @@ -202,7 +201,7 @@ def test_boolean_nulls(self): mask = np.random.randint(0, 10, size=num_values) < 3 values = np.random.randint(0, 10, size=num_values) < 5 - arr = A.from_pandas_series(values, mask) + arr = A.Array.from_numpy(values, mask) expected = values.astype(object) expected[mask] = None diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index dd6888f2d1306..525da344c9951 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -15,8 +15,6 @@ import os import unittest -import pytest - from numpy.testing import assert_array_equal import numpy as np diff --git a/python/pyarrow/tests/test_jemalloc.py b/python/pyarrow/tests/test_jemalloc.py index 8efd514dd0cae..c6cc2cc34a08b 100644 --- a/python/pyarrow/tests/test_jemalloc.py +++ b/python/pyarrow/tests/test_jemalloc.py @@ -33,11 +33,15 @@ def test_different_memory_pool(): gc.collect() bytes_before_default = pyarrow.total_allocated_bytes() bytes_before_jemalloc = pyarrow.jemalloc.default_pool().bytes_allocated() - array = pyarrow.from_pylist([1, None, 3, None], - memory_pool=pyarrow.jemalloc.default_pool()) + + # it works + array = pyarrow.from_pylist([1, None, 3, None], # noqa + memory_pool=pyarrow.jemalloc.default_pool()) gc.collect() assert pyarrow.total_allocated_bytes() == bytes_before_default - assert pyarrow.jemalloc.default_pool().bytes_allocated() > bytes_before_jemalloc + assert (pyarrow.jemalloc.default_pool().bytes_allocated() > + bytes_before_jemalloc) + @jemalloc def test_default_memory_pool(): @@ -47,10 +51,13 @@ def test_default_memory_pool(): old_memory_pool = pyarrow.memory.default_pool() pyarrow.memory.set_default_pool(pyarrow.jemalloc.default_pool()) - array = pyarrow.from_pylist([1, None, 3, None]) + + array = pyarrow.from_pylist([1, None, 3, None]) # noqa + pyarrow.memory.set_default_pool(old_memory_pool) gc.collect() assert pyarrow.total_allocated_bytes() == bytes_before_default - assert pyarrow.jemalloc.default_pool().bytes_allocated() > bytes_before_jemalloc + assert (pyarrow.jemalloc.default_pool().bytes_allocated() > + bytes_before_jemalloc) From d54ab9a23aa8d4fe52ce91b117511540cc7491bb Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 2 Apr 2017 12:06:56 -0400 Subject: [PATCH 0454/1644] ARROW-737: [C++] Enable mutable buffer slices, SliceMutableBuffer function This patch also results in better microperformance on my laptop. ``` Run on (4 X 1200 MHz CPU s) 2017-04-01 22:35:30 Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- BM_WriteRecordBatch/1/min_time:1.000/real_time 53617 ns 53423 ns 26903 18.2136GB/s BM_WriteRecordBatch/4/min_time:1.000/real_time 48118 ns 47999 ns 26823 20.295GB/s BM_WriteRecordBatch/16/min_time:1.000/real_time 46679 ns 46562 ns 30409 20.9209GB/s BM_WriteRecordBatch/64/min_time:1.000/real_time 50583 ns 50344 ns 27308 19.3061GB/s BM_WriteRecordBatch/256/min_time:1.000/real_time 93474 ns 93259 ns 11720 10.4474GB/s BM_WriteRecordBatch/1024/min_time:1.000/real_time 254056 ns 253461 ns 5434 3.84389GB/s BM_WriteRecordBatch/4k/min_time:1.000/real_time 892292 ns 888924 ns 1210 1120.71MB/s BM_WriteRecordBatch/8k/min_time:1.000/real_time 1799323 ns 1795023 ns 754 555.764MB/s BM_ReadRecordBatch/1/min_time:1.000/real_time 2501 ns 2452 ns 586633 390.521GB/s BM_ReadRecordBatch/4/min_time:1.000/real_time 6921 ns 5577 ns 227670 141.095GB/s BM_ReadRecordBatch/16/min_time:1.000/real_time 15561 ns 14505 ns 67493 62.758GB/s BM_ReadRecordBatch/64/min_time:1.000/real_time 48583 ns 48242 ns 30070 20.1008GB/s BM_ReadRecordBatch/256/min_time:1.000/real_time 184637 ns 183306 ns 6660 5.2891GB/s BM_ReadRecordBatch/1024/min_time:1.000/real_time 734128 ns 729692 ns 1905 1.33024GB/s BM_ReadRecordBatch/4k/min_time:1.000/real_time 3027028 ns 3008020 ns 445 330.357MB/s BM_ReadRecordBatch/8k/min_time:1.000/real_time 6472022 ns 6426801 ns 211 154.511MB/s ``` before ``` Run on (4 X 1200 MHz CPU s) 2017-04-01 22:37:36 Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------- BM_WriteRecordBatch/1/min_time:1.000/real_time 59317 ns 59202 ns 22727 16.4633GB/s BM_WriteRecordBatch/4/min_time:1.000/real_time 56322 ns 56191 ns 24944 17.3389GB/s BM_WriteRecordBatch/16/min_time:1.000/real_time 52027 ns 51880 ns 26682 18.7703GB/s BM_WriteRecordBatch/64/min_time:1.000/real_time 56324 ns 56202 ns 24724 17.3384GB/s BM_WriteRecordBatch/256/min_time:1.000/real_time 108096 ns 107868 ns 11284 9.03423GB/s BM_WriteRecordBatch/1024/min_time:1.000/real_time 279508 ns 278730 ns 4957 3.49386GB/s BM_WriteRecordBatch/4k/min_time:1.000/real_time 1013229 ns 1009772 ns 1191 986.944MB/s BM_WriteRecordBatch/8k/min_time:1.000/real_time 2011309 ns 2005377 ns 661 497.189MB/s BM_ReadRecordBatch/1/min_time:1.000/real_time 2507 ns 2501 ns 558949 389.563GB/s BM_ReadRecordBatch/4/min_time:1.000/real_time 5120 ns 5110 ns 275798 190.735GB/s BM_ReadRecordBatch/16/min_time:1.000/real_time 15800 ns 15706 ns 85481 61.8072GB/s BM_ReadRecordBatch/64/min_time:1.000/real_time 55678 ns 55476 ns 25022 17.5393GB/s BM_ReadRecordBatch/256/min_time:1.000/real_time 218083 ns 217596 ns 6163 4.47794GB/s BM_ReadRecordBatch/1024/min_time:1.000/real_time 875861 ns 873419 ns 1591 1.11497GB/s BM_ReadRecordBatch/4k/min_time:1.000/real_time 3545586 ns 3538141 ns 383 282.041MB/s BM_ReadRecordBatch/8k/min_time:1.000/real_time 7118830 ns 7104433 ns 194 140.473MB/s ``` Author: Wes McKinney Closes #479 from wesm/ARROW-737 and squashes the following commits: b663ca4 [Wes McKinney] Enable mutable buffer slices, SliceMutableBuffer function --- cpp/src/arrow/buffer-test.cc | 17 +++++++++++++++++ cpp/src/arrow/buffer.cc | 17 ++++++++++++----- cpp/src/arrow/buffer.h | 25 +++++++++++++++++-------- 3 files changed, 46 insertions(+), 13 deletions(-) diff --git a/cpp/src/arrow/buffer-test.cc b/cpp/src/arrow/buffer-test.cc index e0a2137b9bd78..5815ed17af50e 100644 --- a/cpp/src/arrow/buffer-test.cc +++ b/cpp/src/arrow/buffer-test.cc @@ -168,4 +168,21 @@ TEST_F(TestBuffer, SliceBuffer) { ASSERT_EQ(2, buf.use_count()); } +TEST_F(TestBuffer, SliceMutableBuffer) { + std::string data_str = "some data to slice"; + auto data = reinterpret_cast(data_str.c_str()); + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), 50, &buffer)); + + memcpy(buffer->mutable_data(), data, data_str.size()); + + std::shared_ptr slice = SliceMutableBuffer(buffer, 5, 10); + ASSERT_TRUE(slice->is_mutable()); + ASSERT_EQ(10, slice->size()); + + Buffer expected(data + 5, 10); + ASSERT_TRUE(slice->Equals(expected)); +} + } // namespace arrow diff --git a/cpp/src/arrow/buffer.cc b/cpp/src/arrow/buffer.cc index 59623403e5c5e..fb6379894c3b0 100644 --- a/cpp/src/arrow/buffer.cc +++ b/cpp/src/arrow/buffer.cc @@ -27,11 +27,6 @@ namespace arrow { -Buffer::Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) - : Buffer(parent->data() + offset, size) { - parent_ = parent; -} - Buffer::~Buffer() {} Status Buffer::Copy( @@ -116,6 +111,18 @@ Status PoolBuffer::Resize(int64_t new_size, bool shrink_to_fit) { return Status::OK(); } +std::shared_ptr SliceMutableBuffer( + const std::shared_ptr& buffer, int64_t offset, int64_t length) { + return std::make_shared(buffer, offset, length); +} + +MutableBuffer::MutableBuffer( + const std::shared_ptr& parent, int64_t offset, int64_t size) + : MutableBuffer(parent->mutable_data() + offset, size) { + DCHECK(parent->is_mutable()) << "Must pass mutable buffer"; + parent_ = parent; +} + Status AllocateBuffer( MemoryPool* pool, int64_t size, std::shared_ptr* out) { auto buffer = std::make_shared(pool); diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 713d57a1f101d..3f14c964e83c1 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -46,7 +46,7 @@ class Status; class ARROW_EXPORT Buffer { public: Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); /// An offset into data that is owned by another buffer, but we want to be @@ -56,7 +56,10 @@ class ARROW_EXPORT Buffer { /// This method makes no assertions about alignment or padding of the buffer but /// in general we expected buffers to be aligned and padded to 64 bytes. In the future /// we might add utility methods to help determine if a buffer satisfies this contract. - Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size); + Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) + : Buffer(parent->data() + offset, size) { + parent_ = parent; + } bool is_mutable() const { return is_mutable_; } @@ -74,6 +77,7 @@ class ARROW_EXPORT Buffer { int64_t capacity() const { return capacity_; } const uint8_t* data() const { return data_; } + uint8_t* mutable_data() { return mutable_data_; } int64_t size() const { return size_; } @@ -82,6 +86,7 @@ class ARROW_EXPORT Buffer { protected: bool is_mutable_; const uint8_t* data_; + uint8_t* mutable_data_; int64_t size_; int64_t capacity_; @@ -99,20 +104,24 @@ static inline std::shared_ptr SliceBuffer( return std::make_shared(buffer, offset, length); } +/// Construct a mutable buffer slice. If the parent buffer is not mutable, this +/// will abort in debug builds +std::shared_ptr ARROW_EXPORT SliceMutableBuffer( + const std::shared_ptr& buffer, int64_t offset, int64_t length); + /// A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { public: - MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { - is_mutable_ = true; + MutableBuffer(uint8_t* data, int64_t size) + : Buffer(data, size) { mutable_data_ = data; + is_mutable_ = true; } - uint8_t* mutable_data() { return mutable_data_; } + MutableBuffer(const std::shared_ptr& parent, int64_t offset, int64_t size); protected: - MutableBuffer() : Buffer(nullptr, 0), mutable_data_(nullptr) {} - - uint8_t* mutable_data_; + MutableBuffer() : Buffer(nullptr, 0) {} }; class ARROW_EXPORT ResizableBuffer : public MutableBuffer { From c4d535ca1ed98981c62ecceba81fa7c4efeaf66d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 2 Apr 2017 14:40:08 -0400 Subject: [PATCH 0455/1644] ARROW-753: [Python] Fix linker error for python-test on OS X We don't link to libpython in the pyarrow C extensions, but we need to run the googletest unit tests. I thought we were building and running this test in Travis CI, but we're not yet. It's not that easy to add that right now, so this triages the builds in the meantime Author: Wes McKinney Closes #478 from wesm/ARROW-753 and squashes the following commits: bcc2455 [Wes McKinney] Fix linker error for python-test on OS X --- cpp/cmake_modules/FindPythonLibsNew.cmake | 10 ++++------ cpp/src/arrow/python/CMakeLists.txt | 2 +- 2 files changed, 5 insertions(+), 7 deletions(-) diff --git a/cpp/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake index 1000a957a6269..3e248a93342c5 100644 --- a/cpp/cmake_modules/FindPythonLibsNew.cmake +++ b/cpp/cmake_modules/FindPythonLibsNew.cmake @@ -148,12 +148,10 @@ if(CMAKE_HOST_WIN32) set(PYTHON_LIBRARY "${PYTHON_PREFIX}/libs/libpython${PYTHON_LIBRARY_SUFFIX}.a") endif() elseif(APPLE) - # Seems to require "-undefined dynamic_lookup" instead of linking - # against the .dylib, otherwise it crashes. This flag is added - # below - set(PYTHON_LIBRARY "") - #set(PYTHON_LIBRARY - # "${PYTHON_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") + # In Python C extensions on OS X, the flag "-undefined dynamic_lookup" can + # avoid certain kinds of dynamic linking issues with portable binaries, so + # you should avoid targeting libpython at link time if at all possible + set(PYTHON_LIBRARY "${PYTHON_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") else() if(${PYTHON_SIZEOF_VOID_P} MATCHES 8) set(_PYTHON_LIBS_SEARCH "${PYTHON_PREFIX}/lib64" "${PYTHON_PREFIX}/lib" "${PYTHON_LIBRARY_PATH}") diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index a8b4cc7ff1ded..c69d976737f91 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -39,7 +39,7 @@ set(ARROW_PYTHON_MIN_TEST_LIBS arrow_io_static arrow_static) -if(NOT APPLE AND ARROW_BUILD_TESTS) +if(ARROW_BUILD_TESTS) ADD_THIRDPARTY_LIB(python SHARED_LIB "${PYTHON_LIBRARIES}") list(APPEND ARROW_PYTHON_MIN_TEST_LIBS python) From 9f720b117648f42356bbfd0a36f6275b878305d1 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sun, 2 Apr 2017 14:49:20 -0400 Subject: [PATCH 0456/1644] ARROW-738: Fix manylinux1 build Author: Uwe L. Korn Closes #471 from xhochy/fix-manylinux1-build and squashes the following commits: 5259118 [Uwe L. Korn] Instead of copying, symlink to correct ABI versions cffcca3 [Uwe L. Korn] Don't hardcode ABI versions 98f79af [Uwe L. Korn] Set PKG_CONFIG_PATH b97bcf4 [Uwe L. Korn] Read SOVERSION from pkg-config a83dbc8 [Uwe L. Korn] Build arrow-cpp for each Python version --- python/CMakeLists.txt | 39 ++++++++++++++-------------- python/cmake_modules/FindArrow.cmake | 36 ++++++++++++++++--------- python/manylinux1/Dockerfile-x86_64 | 4 ++- python/manylinux1/build_arrow.sh | 10 +++++++ python/setup.py | 18 ++++++++++--- 5 files changed, 71 insertions(+), 36 deletions(-) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index f315d019bb4de..463a29d87b711 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -226,24 +226,27 @@ endif() find_package(Arrow REQUIRED) include_directories(SYSTEM ${ARROW_INCLUDE_DIR}) -if (PYARROW_BUNDLE_ARROW_CPP) - configure_file(${ARROW_SHARED_LIB} - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX} +function(bundle_arrow_lib library_path) + get_filename_component(LIBRARY_NAME ${${library_path}} NAME_WE) + configure_file(${${library_path}} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/${LIBRARY_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX} COPYONLY) - SET(ARROW_SHARED_LIB - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) - configure_file(${ARROW_IO_SHARED_LIB} - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_io${CMAKE_SHARED_LIBRARY_SUFFIX} + configure_file(${${library_path}}.${ARROW_ABI_VERSION} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/${LIBRARY_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}.${ARROW_ABI_VERSION} COPYONLY) - SET(ARROW_IO_SHARED_LIB - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_io${CMAKE_SHARED_LIBRARY_SUFFIX}) - configure_file(${ARROW_IPC_SHARED_LIB} - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_ipc${CMAKE_SHARED_LIBRARY_SUFFIX} + configure_file(${${library_path}}.${ARROW_SO_VERSION} + ${BUILD_OUTPUT_ROOT_DIRECTORY}/${LIBRARY_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}.${ARROW_SO_VERSION} COPYONLY) - SET(ARROW_IPC_SHARED_LIB - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_ipc${CMAKE_SHARED_LIBRARY_SUFFIX}) - SET(ARROW_PYTHON_SHARED_LIB - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) + SET(ARROW_SHARED_LIB + ${BUILD_OUTPUT_ROOT_DIRECTORY}/${LIBRARY_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) +endfunction(bundle_arrow_lib) + +if (PYARROW_BUNDLE_ARROW_CPP) + # arrow + bundle_arrow_lib(ARROW_SHARED_LIB) + bundle_arrow_lib(ARROW_IO_SHARED_LIB) + bundle_arrow_lib(ARROW_IPC_SHARED_LIB) + bundle_arrow_lib(ARROW_PYTHON_SHARED_LIB) endif() ADD_THIRDPARTY_LIB(arrow @@ -309,11 +312,7 @@ endif() if (PYARROW_BUILD_JEMALLOC) if (PYARROW_BUNDLE_ARROW_CPP) - configure_file(${ARROW_JEMALLOC_SHARED_LIB} - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX} - COPYONLY) - SET(ARROW_JEMALLOC_SHARED_LIB - ${BUILD_OUTPUT_ROOT_DIRECTORY}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) + bundle_arrow_lib(ARROW_JEMALLOC_SHARED_LIB) endif() ADD_THIRDPARTY_LIB(arrow_jemalloc SHARED_LIB ${ARROW_JEMALLOC_SHARED_LIB}) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 5030c9c8ce900..c2ca0f4ad22c8 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -23,6 +23,8 @@ # ARROW_SHARED_LIB, path to libarrow's shared library # ARROW_FOUND, whether arrow has been found +include(FindPkgConfig) + set(ARROW_SEARCH_HEADER_PATHS $ENV{ARROW_HOME}/include ) @@ -31,16 +33,27 @@ set(ARROW_SEARCH_LIB_PATH $ENV{ARROW_HOME}/lib ) -find_path(ARROW_INCLUDE_DIR arrow/array.h PATHS - ${ARROW_SEARCH_HEADER_PATHS} - # make sure we don't accidentally pick up a different version - NO_DEFAULT_PATH -) - -find_library(ARROW_LIB_PATH NAMES arrow - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) +pkg_check_modules(ARROW arrow) +if (ARROW_FOUND) + pkg_get_variable(ARROW_ABI_VERSION arrow abi_version) + message(STATUS "Arrow ABI version: ${ARROW_ABI_VERSION}") + pkg_get_variable(ARROW_SO_VERSION arrow so_version) + message(STATUS "Arrow SO version: ${ARROW_SO_VERSION}") + set(ARROW_INCLUDE_DIR ${ARROW_INCLUDE_DIRS}) + set(ARROW_LIBS ${ARROW_LIBRARY_DIRS}) +else() + find_path(ARROW_INCLUDE_DIR arrow/array.h PATHS + ${ARROW_SEARCH_HEADER_PATHS} + # make sure we don't accidentally pick up a different version + NO_DEFAULT_PATH + ) + + find_library(ARROW_LIB_PATH NAMES arrow + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) + get_filename_component(ARROW_LIBS ${ARROW_LIB_PATH} DIRECTORY) +endif() find_library(ARROW_IO_LIB_PATH NAMES arrow_io PATHS @@ -62,7 +75,7 @@ find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) -if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) +if (ARROW_INCLUDE_DIR AND ARROW_LIBS) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) set(ARROW_IO_LIB_NAME libarrow_io) @@ -70,7 +83,6 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIB_PATH) set(ARROW_JEMALLOC_LIB_NAME libarrow_jemalloc) set(ARROW_PYTHON_LIB_NAME libarrow_python) - set(ARROW_LIBS ${ARROW_SEARCH_LIB_PATH}) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 820b94e306afe..56b27ad2ae808 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -52,7 +52,9 @@ RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 ADD arrow /arrow WORKDIR /arrow/cpp -RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_HDFS=ON -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF . +RUN mkdir build-plain +WORKDIR /arrow/cpp/build-plain +RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF .. RUN make -j5 install WORKDIR / diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 576a983b11c37..8bc4e60235b49 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -39,6 +39,7 @@ export PYARROW_BUNDLE_ARROW_CPP=1 export LDFLAGS="-Wl,--no-as-needed" export ARROW_HOME="/arrow-dist" export PARQUET_HOME="/usr" +export PKG_CONFIG_PATH=/arrow-dist/lib64/pkgconfig # Ensure the target directory exists mkdir -p /io/dist @@ -52,6 +53,15 @@ for PYTHON in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}) Installing build dependencies ===" $PIPI_IO "numpy==1.9.0" $PIPI_IO "cython==0.24" + $PIPI_IO "pandas==0.19.2" + + echo "=== (${PYTHON}) Building Arrow C++ libraries ===" + ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON} + mkdir -p "${ARROW_BUILD_DIR}" + pushd "${ARROW_BUILD_DIR}" + PATH="$(cpython_path $PYTHON)/bin:$PATH" cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/arrow-dist -DARROW_BUILD_TESTS=OFF -DARROW_BUILD_SHARED=ON -DARROW_BOOST_USE_SHARED=OFF -DARROW_JEMALLOC=ON -DARROW_RPATH_ORIGIN=ON -DARROW_JEMALLOC_USE_SHARED=OFF -DARROW_PYTHON=ON -DPythonInterp_FIND_VERSION=${PYTHON} .. + make -j5 install + popd # Clear output directory rm -rf dist/ diff --git a/python/setup.py b/python/setup.py index 12b44e1bad520..ba77e688ae1f6 100644 --- a/python/setup.py +++ b/python/setup.py @@ -17,19 +17,20 @@ # specific language governing permissions and limitations # under the License. +import glob +import os import os.path as osp import re import shutil +import sys + from Cython.Distutils import build_ext as _build_ext import Cython -import sys import pkg_resources from setuptools import setup, Extension -import os - from os.path import join as pjoin from distutils.command.clean import clean as _clean @@ -207,8 +208,19 @@ def _run_cmake(self): def move_lib(lib_name): lib_filename = (shared_library_prefix + lib_name + shared_library_suffix) + # Also copy libraries with ABI/SO version suffix + libs = glob.glob(pjoin(self.build_type, lib_filename) + '*') + # Longest suffix library should be copied, all others symlinked + libs.sort(key=lambda s: -len(s)) + print(libs, libs[0]) + lib_filename = os.path.basename(libs[0]) shutil.move(pjoin(self.build_type, lib_filename), pjoin(build_lib, 'pyarrow', lib_filename)) + for lib in libs[1:]: + filename = os.path.basename(lib) + link_name = pjoin(build_lib, 'pyarrow', filename) + if not os.path.exists(link_name): + os.symlink(lib_filename, link_name) if self.bundle_arrow_cpp: move_lib("arrow") From 8f113b4d0fc344ab7d411af85fbf99154d5d1eaa Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 2 Apr 2017 23:31:19 -0400 Subject: [PATCH 0457/1644] ARROW-755: [GLib] Add garrow_array_get_value_type() garrow_array_get_data_type() is renamed to garrow_array_get_value_data_type() for consistency. Author: Kouhei Sutou Closes #481 from kou/glib-support-array-value-type and squashes the following commits: bb4349c [Kouhei Sutou] [GLib] Add index for new symbols in 0.3.0 to API reference 7b07306 [Kouhei Sutou] [GLib] Add garrow_array_get_value_type() --- c_glib/arrow-glib/array.cpp | 23 ++++++++++++++++++++--- c_glib/arrow-glib/array.h | 3 ++- c_glib/doc/reference/arrow-glib-docs.sgml | 4 ++++ c_glib/test/test-array.rb | 10 ++++++++-- 4 files changed, 34 insertions(+), 6 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index caf2eb55d6b2c..9d0e101e1b52f 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -35,6 +35,7 @@ #include #include #include +#include #include #include #include @@ -189,19 +190,35 @@ garrow_array_get_n_nulls(GArrowArray *array) } /** - * garrow_array_get_data_type: + * garrow_array_get_value_data_type: * @array: A #GArrowArray. * - * Returns: (transfer full): The #GArrowDataType for the array. + * Since: 0.3.0 + * Returns: (transfer full): The #GArrowDataType for each value of the + * array. */ GArrowDataType * -garrow_array_get_data_type(GArrowArray *array) +garrow_array_get_value_data_type(GArrowArray *array) { auto arrow_array = garrow_array_get_raw(array); auto arrow_data_type = arrow_array->type(); return garrow_data_type_new_raw(&arrow_data_type); } +/** + * garrow_array_get_value_type: + * @array: A #GArrowArray. + * + * Since: 0.3.0 + * Returns: The #GArrowType for each value of the array. + */ +GArrowType +garrow_array_get_value_type(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + return garrow_type_from_raw(arrow_array->type_enum()); +} + /** * garrow_array_slice: * @array: A #GArrowArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 957b4416fa581..06a37e9b43ad6 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -62,7 +62,8 @@ gboolean garrow_array_is_null (GArrowArray *array, gint64 garrow_array_get_length (GArrowArray *array); gint64 garrow_array_get_offset (GArrowArray *array); gint64 garrow_array_get_n_nulls (GArrowArray *array); -GArrowDataType *garrow_array_get_data_type(GArrowArray *array); +GArrowDataType *garrow_array_get_value_data_type(GArrowArray *array); +GArrowType garrow_array_get_value_type(GArrowArray *array); GArrowArray *garrow_array_slice (GArrowArray *array, gint64 offset, gint64 length); diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index a732e09df1269..06a19369640b5 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -167,5 +167,9 @@ Index of deprecated API + + Index of new symbols in 0.3.0 + + diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index 08908b08961a7..06102eb36575b 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -40,10 +40,16 @@ def test_n_nulls assert_equal(2, array.n_nulls) end - def test_data_type + def test_value_data_type builder = Arrow::BooleanArrayBuilder.new array = builder.finish - assert_equal(Arrow::BooleanDataType.new, array.data_type) + assert_equal(Arrow::BooleanDataType.new, array.value_data_type) + end + + def test_value_type + builder = Arrow::BooleanArrayBuilder.new + array = builder.finish + assert_equal(Arrow::Type::BOOL, array.value_type) end def test_slice From 96f3d6176d8c95717f4ff45e4226161de3168b05 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 3 Apr 2017 08:43:47 +0200 Subject: [PATCH 0458/1644] ARROW-749: [Python] Delete partially-written Feather file when column write fails This is currently the only place where we are doing an atomic create-file/write-file. We should be mindful of other serialization functions which may yield unreadable files in the future. Author: Wes McKinney Closes #484 from wesm/ARROW-749 and squashes the following commits: 137e235 [Wes McKinney] Delete partially-written Feather file when column write fails --- python/pyarrow/feather.py | 79 ++++++++++++++++++---------- python/pyarrow/tests/test_feather.py | 16 ++++++ 2 files changed, 67 insertions(+), 28 deletions(-) diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py index f87c7f3a95ee4..3b5716e36be0a 100644 --- a/python/pyarrow/feather.py +++ b/python/pyarrow/feather.py @@ -15,8 +15,10 @@ # specific language governing permissions and limitations # under the License. -import six from distutils.version import LooseVersion +import os + +import six import pandas as pd from pyarrow.compat import pdapi @@ -54,45 +56,66 @@ def read(self, columns=None): return table.to_pandas() -def write_feather(df, dest): - ''' - Write a pandas.DataFrame to Feather format - ''' - writer = ext.FeatherWriter() - writer.open(dest) +class FeatherWriter(object): - if isinstance(df, pd.SparseDataFrame): - df = df.to_dense() + def __init__(self, dest): + self.dest = dest + self.writer = ext.FeatherWriter() + self.writer.open(dest) - if not df.columns.is_unique: - raise ValueError("cannot serialize duplicate column names") + def write(self, df): + if isinstance(df, pd.SparseDataFrame): + df = df.to_dense() - # TODO(wesm): pipeline conversion to Arrow memory layout - for i, name in enumerate(df.columns): - col = df.iloc[:, i] + if not df.columns.is_unique: + raise ValueError("cannot serialize duplicate column names") - if pdapi.is_object_dtype(col): - inferred_type = pd.lib.infer_dtype(col) - msg = ("cannot serialize column {n} " - "named {name} with dtype {dtype}".format( - n=i, name=name, dtype=inferred_type)) + # TODO(wesm): pipeline conversion to Arrow memory layout + for i, name in enumerate(df.columns): + col = df.iloc[:, i] - if inferred_type in ['mixed']: + if pdapi.is_object_dtype(col): + inferred_type = pd.lib.infer_dtype(col) + msg = ("cannot serialize column {n} " + "named {name} with dtype {dtype}".format( + n=i, name=name, dtype=inferred_type)) - # allow columns with nulls + an inferable type - inferred_type = pd.lib.infer_dtype(col[col.notnull()]) if inferred_type in ['mixed']: + + # allow columns with nulls + an inferable type + inferred_type = pd.lib.infer_dtype(col[col.notnull()]) + if inferred_type in ['mixed']: + raise ValueError(msg) + + elif inferred_type not in ['unicode', 'string']: raise ValueError(msg) - elif inferred_type not in ['unicode', 'string']: - raise ValueError(msg) + if not isinstance(name, six.string_types): + name = str(name) - if not isinstance(name, six.string_types): - name = str(name) + self.writer.write_array(name, col) - writer.write_array(name, col) + self.writer.close() - writer.close() + +def write_feather(df, dest): + ''' + Write a pandas.DataFrame to Feather format + ''' + writer = FeatherWriter(dest) + try: + writer.write(df) + except: + # Try to make sure the resource is closed + import gc + writer = None + gc.collect() + if isinstance(dest, six.string_types): + try: + os.remove(dest) + except os.error: + pass + raise def read_feather(source, columns=None): diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index 525da344c9951..c7b4f1e997327 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -249,6 +249,22 @@ def test_boolean_object_nulls(self): df = pd.DataFrame({'bools': arr}) self._check_pandas_roundtrip(df, null_counts=[1 * repeats]) + def test_delete_partial_file_on_error(self): + # strings will fail + df = pd.DataFrame( + { + 'numbers': range(5), + 'strings': [b'foo', None, u'bar', 'qux', np.nan]}, + columns=['numbers', 'strings']) + + path = random_path() + try: + write_feather(df, path) + except: + pass + + assert not os.path.exists(path) + def test_strings(self): repeats = 1000 From 7232e5b5df64f9dbdc9405798644cd08a6d9db6b Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 3 Apr 2017 18:01:24 -0400 Subject: [PATCH 0459/1644] ARROW-676: move from MinorType to FieldType in ValueVectors to carry all the relevant type bits I'm adding all the Type information to the vector with FieldType. This avoids losing information and carrying around extra type metadata that can not go in the MinorType enum (for example: decimals' precision and dictionary encoding) Author: Julien Le Dem Closes #409 from julienledem/minor_field and squashes the following commits: 0407e63 [Julien Le Dem] ARROW-676: move from MinorType to FieldType in ValueVectors to carry all the relevant type bits --- .../apache/arrow/tools/EchoServerTest.java | 55 ++-- .../src/main/codegen/data/ArrowTypes.tdd | 38 ++- .../src/main/codegen/templates/ArrowType.java | 81 +++++- .../main/codegen/templates/MapWriters.java | 10 +- .../templates/NullableValueVectors.java | 42 +-- .../main/codegen/templates/UnionVector.java | 20 +- .../org/apache/arrow/vector/FieldVector.java | 3 +- .../apache/arrow/vector/VectorSchemaRoot.java | 6 +- .../complex/AbstractContainerVector.java | 20 +- .../vector/complex/AbstractMapVector.java | 29 +- .../complex/BaseRepeatedValueVector.java | 11 +- .../arrow/vector/complex/ListVector.java | 13 +- .../arrow/vector/complex/MapVector.java | 6 +- .../vector/complex/NullableMapVector.java | 1 + .../complex/impl/ComplexWriterImpl.java | 5 +- .../vector/complex/impl/PromotableWriter.java | 5 +- .../org/apache/arrow/vector/types/Types.java | 264 ++++-------------- .../apache/arrow/vector/types/pojo/Field.java | 82 +++--- .../arrow/vector/types/pojo/FieldType.java | 60 ++++ .../arrow/vector/TestDecimalVector.java | 14 +- .../arrow/vector/TestDictionaryVector.java | 16 +- .../org/apache/arrow/vector/TestUtils.java | 39 +++ .../apache/arrow/vector/TestValueVector.java | 15 +- .../complex/impl/TestPromotableWriter.java | 3 +- .../arrow/vector/file/TestArrowFile.java | 20 +- .../vector/file/TestArrowReaderWriter.java | 12 +- 26 files changed, 461 insertions(+), 409 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestUtils.java diff --git a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java index 5970c57f46583..7d07588892cf9 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java @@ -18,7 +18,19 @@ package org.apache.arrow.tools; -import com.google.common.collect.ImmutableList; +import static java.util.Arrays.asList; +import static org.apache.arrow.vector.types.Types.MinorType.TINYINT; +import static org.apache.arrow.vector.types.Types.MinorType.VARCHAR; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +import java.io.IOException; +import java.net.Socket; +import java.net.UnknownHostException; +import java.nio.charset.StandardCharsets; +import java.util.Arrays; +import java.util.Collections; +import java.util.List; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; @@ -39,6 +51,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.arrow.vector.util.Text; import org.junit.AfterClass; @@ -46,17 +59,7 @@ import org.junit.BeforeClass; import org.junit.Test; -import java.io.IOException; -import java.net.Socket; -import java.net.UnknownHostException; -import java.nio.charset.StandardCharsets; -import java.util.Arrays; -import java.util.Collections; -import java.util.List; - -import static java.util.Arrays.asList; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertTrue; +import com.google.common.collect.ImmutableList; public class EchoServerTest { @@ -133,9 +136,12 @@ private void testEchoServer(int serverPort, public void basicTest() throws InterruptedException, IOException { BufferAllocator alloc = new RootAllocator(Long.MAX_VALUE); - Field field = new Field("testField", true, new ArrowType.Int(8, true), Collections - .emptyList()); - NullableTinyIntVector vector = new NullableTinyIntVector("testField", alloc, null); + Field field = new Field( + "testField", true, + new ArrowType.Int(8, true), + Collections.emptyList()); + NullableTinyIntVector vector = + new NullableTinyIntVector("testField", FieldType.nullable(TINYINT.getType()), alloc); Schema schema = new Schema(asList(field)); // Try an empty stream, just the header. @@ -152,9 +158,16 @@ public void basicTest() throws InterruptedException, IOException { public void testFlatDictionary() throws IOException { DictionaryEncoding writeEncoding = new DictionaryEncoding(1L, false, null); try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); - NullableIntVector writeVector = new NullableIntVector("varchar", allocator, writeEncoding); - NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dict", - allocator, null)) { + NullableIntVector writeVector = + new NullableIntVector( + "varchar", + new FieldType(true, MinorType.INT.getType(), writeEncoding), + allocator); + NullableVarCharVector writeDictionaryVector = + new NullableVarCharVector( + "dict", + FieldType.nullable(VARCHAR.getType()), + allocator)) { writeVector.allocateNewSafe(); NullableIntVector.Mutator mutator = writeVector.getMutator(); mutator.set(0, 0); @@ -222,8 +235,8 @@ public void testFlatDictionary() throws IOException { public void testNestedDictionary() throws IOException { DictionaryEncoding writeEncoding = new DictionaryEncoding(2L, false, null); try (BufferAllocator allocator = new RootAllocator(Long.MAX_VALUE); - NullableVarCharVector writeDictionaryVector = new NullableVarCharVector("dictionary", - allocator, null); + NullableVarCharVector writeDictionaryVector = + new NullableVarCharVector("dictionary", FieldType.nullable(VARCHAR.getType()), allocator); ListVector writeVector = new ListVector("list", allocator, null, null)) { // data being written: @@ -234,7 +247,7 @@ public void testNestedDictionary() throws IOException { writeDictionaryVector.getMutator().set(1, "bar".getBytes(StandardCharsets.UTF_8)); writeDictionaryVector.getMutator().setValueCount(2); - writeVector.addOrGetVector(MinorType.INT, writeEncoding); + writeVector.addOrGetVector(new FieldType(true, MinorType.INT.getType(), writeEncoding)); writeVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(writeVector); listWriter.startList(); diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index 67785ad6b4d19..e1fb5e0619a9b 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -14,59 +14,73 @@ types: [ { name: "Null", - fields: [] + fields: [], + complex: false }, { name: "Struct_", - fields: [] + fields: [], + complex: true }, { name: "List", - fields: [] + fields: [], + complex: true }, { name: "Union", - fields: [{name: "mode", type: short, valueType: UnionMode}, {name: "typeIds", type: "int[]"}] + fields: [{name: "mode", type: short, valueType: UnionMode}, {name: "typeIds", type: "int[]"}], + complex: true }, { name: "Int", - fields: [{name: "bitWidth", type: int}, {name: "isSigned", type: boolean}] + fields: [{name: "bitWidth", type: int}, {name: "isSigned", type: boolean}], + complex: false }, { name: "FloatingPoint", - fields: [{name: precision, type: short, valueType: FloatingPointPrecision}] + fields: [{name: precision, type: short, valueType: FloatingPointPrecision}], + complex: false }, { name: "Utf8", - fields: [] + fields: [], + complex: false }, { name: "Binary", - fields: [] + fields: [], + complex: false }, { name: "Bool", - fields: [] + fields: [], + complex: false }, { name: "Decimal", - fields: [{name: "precision", type: int}, {name: "scale", type: int}] + fields: [{name: "precision", type: int}, {name: "scale", type: int}], + complex: false }, { name: "Date", fields: [{name: "unit", type: short, valueType: DateUnit}] + complex: false }, { name: "Time", - fields: [{name: "unit", type: short, valueType: TimeUnit}, {name: "bitWidth", type: int}] + fields: [{name: "unit", type: short, valueType: TimeUnit}, {name: "bitWidth", type: int}], + complex: false }, { name: "Timestamp", fields: [{name: "unit", type: short, valueType: TimeUnit}, {name: "timezone", type: String}] + complex: false }, { name: "Interval", - fields: [{name: "unit", type: short, valueType: IntervalUnit}] + fields: [{name: "unit", type: short, valueType: IntervalUnit}], + complex: false } ] } diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index 91cbe98196b81..a9e875a2095f7 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -50,13 +50,35 @@ }) public abstract class ArrowType { + public static abstract class PrimitiveType extends ArrowType { + + private PrimitiveType() { + } + + @Override + public boolean isComplex() { + return false; + } + } + + public static abstract class ComplexType extends ArrowType { + + private ComplexType() { + } + + @Override + public boolean isComplex() { + return true; + } + } + public static enum ArrowTypeID { <#list arrowTypes.types as type> <#assign name = type.name> ${name?remove_ending("_")}(Type.${name}), NONE(Type.NONE); - + private final byte flatbufType; public byte getFlatbufID() { @@ -70,6 +92,8 @@ private ArrowTypeID(byte flatbufType) { @JsonIgnore public abstract ArrowTypeID getTypeID(); + @JsonIgnore + public abstract boolean isComplex(); public abstract int getType(FlatBufferBuilder builder); public abstract T accept(ArrowTypeVisitor visitor); @@ -87,21 +111,56 @@ public static interface ArrowTypeVisitor { } + /** + * to visit the Complex ArrowTypes and bundle Primitive ones in one case + */ + public static abstract class ComplexTypeVisitor implements ArrowTypeVisitor { + + public T visit(PrimitiveType type) { + throw new UnsupportedOperationException("Unexpected Primitive type: " + type); + } + + <#list arrowTypes.types as type> + <#if !type.complex> + public final T visit(${type.name?remove_ending("_")} type) { + return visit((PrimitiveType) type); + } + + + } + + /** + * to visit the Primitive ArrowTypes and bundle Complex ones under one case + */ + public static abstract class PrimitiveTypeVisitor implements ArrowTypeVisitor { + + public T visit(ComplexType type) { + throw new UnsupportedOperationException("Unexpected Complex type: " + type); + } + + <#list arrowTypes.types as type> + <#if type.complex> + public final T visit(${type.name?remove_ending("_")} type) { + return visit((ComplexType) type); + } + + + } + <#list arrowTypes.types as type> <#assign name = type.name?remove_ending("_")> <#assign fields = type.fields> - public static class ${name} extends ArrowType { + public static class ${name} extends <#if type.complex>ComplexType<#else>PrimitiveType { public static final ArrowTypeID TYPE_TYPE = ArrowTypeID.${name}; <#if type.fields?size == 0> public static final ${name} INSTANCE = new ${name}(); - + <#else> <#list fields as field> <#assign fieldType = field.valueType!field.type> ${fieldType} ${field.name}; - <#if type.fields?size != 0> @JsonCreator public ${type.name}( <#list type.fields as field> @@ -113,6 +172,13 @@ public static class ${name} extends ArrowType { this.${field.name} = ${field.name}; } + + <#list fields as field> + <#assign fieldType = field.valueType!field.type> + public ${fieldType} get${field.name?cap_first}() { + return ${field.name}; + } + @Override @@ -143,13 +209,6 @@ public int getType(FlatBufferBuilder builder) { return org.apache.arrow.flatbuf.${type.name}.end${type.name}(builder); } - <#list fields as field> - <#assign fieldType = field.valueType!field.type> - public ${fieldType} get${field.name?cap_first}() { - return ${field.name}; - } - - public String toString() { return "${name}" <#if fields?size != 0> diff --git a/java/vector/src/main/codegen/templates/MapWriters.java b/java/vector/src/main/codegen/templates/MapWriters.java index 428ce0427d4b8..d3e6de9527123 100644 --- a/java/vector/src/main/codegen/templates/MapWriters.java +++ b/java/vector/src/main/codegen/templates/MapWriters.java @@ -64,7 +64,7 @@ public class ${mode}MapWriter extends AbstractFieldWriter { list(child.getName()); break; case UNION: - UnionWriter writer = new UnionWriter(container.addOrGet(child.getName(), MinorType.UNION, UnionVector.class, null), getNullableMapWriterFactory()); + UnionWriter writer = new UnionWriter(container.addOrGet(child.getName(), FieldType.nullable(MinorType.UNION.getType()), UnionVector.class), getNullableMapWriterFactory()); fields.put(handleCase(child.getName()), writer); break; <#list vv.types as type><#list type.minor as minor> @@ -113,7 +113,7 @@ public MapWriter map(String name) { FieldWriter writer = fields.get(finalName); if(writer == null){ int vectorCount=container.size(); - NullableMapVector vector = container.addOrGet(name, MinorType.MAP, NullableMapVector.class, null); + NullableMapVector vector = container.addOrGet(name, FieldType.nullable(MinorType.MAP.getType()), NullableMapVector.class); writer = new PromotableWriter(vector, container, getNullableMapWriterFactory()); if(vectorCount != container.size()) { writer.allocate(); @@ -157,7 +157,7 @@ public ListWriter list(String name) { FieldWriter writer = fields.get(finalName); int vectorCount = container.size(); if(writer == null) { - writer = new PromotableWriter(container.addOrGet(name, MinorType.LIST, ListVector.class, null), container, getNullableMapWriterFactory()); + writer = new PromotableWriter(container.addOrGet(name, FieldType.nullable(MinorType.LIST.getType()), ListVector.class), container, getNullableMapWriterFactory()); if (container.size() > vectorCount) { writer.allocate(); } @@ -222,7 +222,9 @@ public void end() { if(writer == null) { ValueVector vector; ValueVector currentVector = container.getChild(name); - ${vectName}Vector v = container.addOrGet(name, MinorType.${upperName}, ${vectName}Vector.class, null<#if minor.class == "Decimal"> , new int[] {precision, scale}); + ${vectName}Vector v = container.addOrGet(name, + FieldType.nullable(<#if minor.class == "Decimal">new Decimal(precision, scale)<#else>MinorType.${upperName}.getType()), + ${vectName}Vector.class); writer = new PromotableWriter(v, container, getNullableMapWriterFactory()); vector = v; if (currentVector == null || currentVector != vector) { diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index ec2ce7930cf5d..8e1727ca6c820 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -64,28 +64,21 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#if minor.class == "Decimal"> private final int precision; private final int scale; + - public ${className}(String name, BufferAllocator allocator, DictionaryEncoding dictionary, int precision, int scale) { + public ${className}(String name, FieldType fieldType, BufferAllocator allocator) { super(name, allocator); - values = new ${valuesName}(valuesField, allocator, precision, scale); - this.precision = precision; - this.scale = scale; - mutator = new Mutator(); - accessor = new Accessor(); - field = new Field(name, true, new Decimal(precision, scale), dictionary, null); - innerVectors = Collections.unmodifiableList(Arrays.asList( - bits, - values - )); - } - <#else> - public ${className}(String name, BufferAllocator allocator, DictionaryEncoding dictionary) { - super(name, allocator); - values = new ${valuesName}(valuesField, allocator); - mutator = new Mutator(); - accessor = new Accessor(); - ArrowType type = Types.MinorType.${minor.class?upper_case}.getType(); - field = new Field(name, true, type, dictionary, null); + <#if minor.class == "Decimal"> + Decimal decimal = (Decimal)fieldType.getType(); + this.precision = decimal.getPrecision(); + this.scale = decimal.getScale(); + this.values = new ${valuesName}(valuesField, allocator, precision, scale); + <#else> + this.values = new ${valuesName}(valuesField, allocator); + + this.mutator = new Mutator(); + this.accessor = new Accessor(); + this.field = new Field(name, fieldType, null); innerVectors = Collections.unmodifiableList(Arrays.asList( bits, <#if type.major = "VarLen"> @@ -94,7 +87,6 @@ public final class ${className} extends BaseDataValueVector implements <#if type values )); } - @Override public BitVector getValidityVector() { @@ -341,12 +333,8 @@ public void splitAndTransferTo(int startIndex, int length, ${className} target) private class TransferImpl implements TransferPair { ${className} to; - public TransferImpl(String name, BufferAllocator allocator){ - <#if minor.class == "Decimal"> - to = new ${className}(name, allocator, field.getDictionary(), precision, scale); - <#else> - to = new ${className}(name, allocator, field.getDictionary()); - + public TransferImpl(String ref, BufferAllocator allocator){ + to = new ${className}(ref, field.getFieldType(), allocator); } public TransferImpl(${className} to){ diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index d17935b08eefc..797b29342e4c1 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -119,10 +119,22 @@ public List getFieldInnerVectors() { return this.innerVectors; } + private String fieldName(MinorType type) { + return type.name().toLowerCase(); + } + + private FieldType fieldType(MinorType type) { + return new FieldType(true, type.getType(), null); + } + + private T addOrGet(MinorType minorType, Class c) { + return internalMap.addOrGet(fieldName(minorType), fieldType(minorType), c); + } + public NullableMapVector getMap() { if (mapVector == null) { int vectorCount = internalMap.size(); - mapVector = internalMap.addOrGet("map", MinorType.MAP, NullableMapVector.class, null); + mapVector = addOrGet(MinorType.MAP, NullableMapVector.class); if (internalMap.size() > vectorCount) { mapVector.allocateNew(); if (callBack != null) { @@ -144,7 +156,7 @@ public NullableMapVector getMap() { public Nullable${name}Vector get${name}Vector() { if (${uncappedName}Vector == null) { int vectorCount = internalMap.size(); - ${uncappedName}Vector = internalMap.addOrGet("${lowerCaseName}", MinorType.${name?upper_case}, Nullable${name}Vector.class, null); + ${uncappedName}Vector = addOrGet(MinorType.${name?upper_case}, Nullable${name}Vector.class); if (internalMap.size() > vectorCount) { ${uncappedName}Vector.allocateNew(); if (callBack != null) { @@ -162,7 +174,7 @@ public NullableMapVector getMap() { public ListVector getList() { if (listVector == null) { int vectorCount = internalMap.size(); - listVector = internalMap.addOrGet("list", MinorType.LIST, ListVector.class, null); + listVector = addOrGet(MinorType.LIST, ListVector.class); if (internalMap.size() > vectorCount) { listVector.allocateNew(); if (callBack != null) { @@ -267,7 +279,7 @@ public void copyFromSafe(int inIndex, int outIndex, UnionVector from) { public FieldVector addVector(FieldVector v) { String name = v.getMinorType().name().toLowerCase(); Preconditions.checkState(internalMap.getChild(name) == null, String.format("%s vector already exists", name)); - final FieldVector newVector = internalMap.addOrGet(name, v.getMinorType(), v.getClass(), v.getField().getDictionary()); + final FieldVector newVector = internalMap.addOrGet(name, v.getField().getFieldType(), v.getClass()); v.makeTransferPair(newVector).transfer(); internalMap.putChild(name, newVector); if (callBack != null) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java index 0fdbc48552aaa..6c2c8302a7b8b 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/FieldVector.java @@ -19,10 +19,11 @@ import java.util.List; -import io.netty.buffer.ArrowBuf; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.types.pojo.Field; +import io.netty.buffer.ArrowBuf; + /** * A vector corresponding to a Field in the schema * It has inner vectors backed by buffers (validity, offsets, data, ...) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java index 7e626fb14305e..29b96736001ce 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorSchemaRoot.java @@ -23,8 +23,6 @@ import java.util.Map; import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; @@ -60,9 +58,7 @@ public VectorSchemaRoot(List fields, List fieldVectors, int public static VectorSchemaRoot create(Schema schema, BufferAllocator allocator) { List fieldVectors = new ArrayList<>(); for (Field field : schema.getFields()) { - MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - FieldVector vector = minorType.getNewVector(field.getName(), allocator, field.getDictionary(), null); - vector.initializeChildrenFromFields(field.getChildren()); + FieldVector vector = field.createVector(allocator); fieldVectors.add(vector); } if (fieldVectors.size() != schema.getFields().size()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java index 86a5e82119831..71f2bea5b8fe1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -22,7 +22,9 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.ArrowType.List; +import org.apache.arrow.vector.types.pojo.ArrowType.Struct; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.CallBack; /** @@ -85,12 +87,24 @@ protected boolean supportsDirectRead() { // return the number of child vectors public abstract int size(); - // add a new vector with the input MajorType or return the existing vector if we already added one with the same type - public abstract T addOrGet(String name, MinorType minorType, Class clazz, DictionaryEncoding dictionary, int... precisionScale); + // add a new vector with the input FieldType or return the existing vector if we already added one with the same name + public abstract T addOrGet(String name, FieldType fieldType, Class clazz); // return the child vector with the input name public abstract T getChild(String name, Class clazz); // return the child vector's ordinal in the composite container public abstract VectorWithOrdinal getChildVectorWithOrdinal(String name); + + public NullableMapVector addOrGetMap(String name) { + return addOrGet(name, FieldType.nullable(new Struct()), NullableMapVector.class); + } + + public ListVector addOrGetList(String name) { + return addOrGet(name, FieldType.nullable(new List()), ListVector.class); + } + + public UnionVector addOrGetUnion(String name) { + return addOrGet(name, FieldType.nullable(MinorType.UNION.getType()), UnionVector.class); + } } \ No newline at end of file diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index baeeb07873714..dc833edbed8d0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -25,8 +25,7 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.ValueVector; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.MapWithOrdinal; @@ -102,8 +101,8 @@ public boolean allocateNewSafe() { * * * - * @param name the name of the field - * @param minorType the minorType for the vector + * @param childName the name of the field + * @param fieldType the type for the vector * @param clazz class of expected vector type * @param class type of expected vector type * @throws java.lang.IllegalStateException raised if there is a hard schema change @@ -111,8 +110,8 @@ public boolean allocateNewSafe() { * @return resultant {@link org.apache.arrow.vector.ValueVector} */ @Override - public T addOrGet(String name, MinorType minorType, Class clazz, DictionaryEncoding dictionary, int... precisionScale) { - final ValueVector existing = getChild(name); + public T addOrGet(String childName, FieldType fieldType, Class clazz) { + final ValueVector existing = getChild(childName); boolean create = false; if (existing == null) { create = true; @@ -123,9 +122,9 @@ public T addOrGet(String name, MinorType minorType, Clas create = true; } if (create) { - final T vector = clazz.cast(minorType.getNewVector(name, allocator, dictionary, callBack, precisionScale)); - putChild(name, vector); - if (callBack!=null) { + final T vector = clazz.cast(fieldType.createNewSingleVector(childName, allocator, callBack)); + putChild(childName, vector); + if (callBack != null) { callBack.doWork(); } return vector; @@ -163,14 +162,14 @@ public T getChild(String name, Class clazz) { return typeify(v, clazz); } - protected ValueVector add(String name, MinorType minorType, DictionaryEncoding dictionary, int... precisionScale) { - final ValueVector existing = getChild(name); + protected ValueVector add(String childName, FieldType fieldType) { + final ValueVector existing = getChild(childName); if (existing != null) { - throw new IllegalStateException(String.format("Vector already exists: Existing[%s], Requested[%s] ", existing.getClass().getSimpleName(), minorType)); + throw new IllegalStateException(String.format("Vector already exists: Existing[%s], Requested[%s] ", existing.getClass().getSimpleName(), fieldType)); } - FieldVector vector = minorType.getNewVector(name, allocator, dictionary, callBack, precisionScale); - putChild(name, vector); - if (callBack!=null) { + FieldVector vector = fieldType.createNewSingleVector(childName, allocator, callBack); + putChild(childName, vector); + if (callBack != null) { callBack.doWork(); } return vector; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index eda1f3bc80a96..6b240c04f7124 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -27,8 +27,7 @@ import org.apache.arrow.vector.UInt4Vector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; -import org.apache.arrow.vector.types.Types.MinorType; -import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.SchemaChangeRuntimeException; @@ -154,10 +153,10 @@ public int size() { return vector == DEFAULT_DATA_VECTOR ? 0:1; } - public AddOrGetResult addOrGetVector(MinorType minorType, DictionaryEncoding dictionary) { + public AddOrGetResult addOrGetVector(FieldType fieldType) { boolean created = false; if (vector instanceof ZeroVector) { - vector = minorType.getNewVector(DATA_VECTOR_NAME, allocator, dictionary, callBack); + vector = fieldType.createNewSingleVector(DATA_VECTOR_NAME, allocator, callBack); // returned vector must have the same field created = true; if (callBack != null) { @@ -165,9 +164,9 @@ public AddOrGetResult addOrGetVector(MinorType minorT } } - if (vector.getField().getType().getTypeID() != minorType.getType().getTypeID()) { + if (vector.getField().getType().getTypeID() != fieldType.getType().getTypeID()) { final String msg = String.format("Inner vector type mismatch. Requested type: [%s], actual type: [%s]", - minorType.getType().getTypeID(), vector.getField().getType().getTypeID()); + fieldType.getType().getTypeID(), vector.getField().getType().getTypeID()); throw new SchemaChangeRuntimeException(msg); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 54b051b9781e5..d138ca339e3cf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -40,10 +40,10 @@ import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.schema.ArrowFieldNode; -import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.JsonStringArrayList; import org.apache.arrow.vector.util.TransferPair; @@ -80,8 +80,7 @@ public void initializeChildrenFromFields(List children) { throw new IllegalArgumentException("Lists have only one child. Found: " + children); } Field field = children.get(0); - MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - AddOrGetResult addOrGetVector = addOrGetVector(minorType, field.getDictionary()); + AddOrGetResult addOrGetVector = addOrGetVector(field.getFieldType()); if (!addOrGetVector.isCreated()) { throw new IllegalArgumentException("Child vector already existed: " + addOrGetVector.getVector()); } @@ -164,11 +163,11 @@ public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { public TransferImpl(ListVector to) { this.to = to; - to.addOrGetVector(vector.getMinorType(), vector.getField().getDictionary()); + to.addOrGetVector(vector.getField().getFieldType()); pairs[0] = offsets.makeTransferPair(to.offsets); pairs[1] = bits.makeTransferPair(to.bits); if (to.getDataVector() instanceof ZeroVector) { - to.addOrGetVector(vector.getMinorType(), vector.getField().getDictionary()); + to.addOrGetVector(vector.getField().getFieldType()); } pairs[2] = getDataVector().makeTransferPair(to.getDataVector()); } @@ -241,8 +240,8 @@ public boolean allocateNewSafe() { return success; } - public AddOrGetResult addOrGetVector(MinorType minorType, DictionaryEncoding dictionary) { - AddOrGetResult result = super.addOrGetVector(minorType, dictionary); + public AddOrGetResult addOrGetVector(FieldType fieldType) { + AddOrGetResult result = super.addOrGetVector(fieldType); reader = new UnionListReader(this); return result; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java index cb67537c446c6..997a6a38a080a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/MapVector.java @@ -32,7 +32,6 @@ import org.apache.arrow.vector.complex.impl.SingleMapReaderImpl; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; -import org.apache.arrow.vector.types.Types; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Struct; import org.apache.arrow.vector.types.pojo.Field; @@ -165,7 +164,7 @@ protected MapTransferPair(MapVector from, MapVector to, boolean allocate) { // (This is similar to what happens in ScanBatch where the children cannot be added till they are // read). To take care of this, we ensure that the hashCode of the MaterializedField does not // include the hashCode of the children but is based only on MaterializedField$key. - final FieldVector newVector = to.addOrGet(child, vector.getMinorType(), vector.getClass(), vector.getField().getDictionary()); + final FieldVector newVector = to.addOrGet(child, vector.getField().getFieldType(), vector.getClass()); if (allocate && to.size() != preSize) { newVector.allocateNew(); } @@ -318,8 +317,7 @@ public void close() { public void initializeChildrenFromFields(List children) { for (Field field : children) { - MinorType minorType = Types.getMinorTypeForArrowType(field.getType()); - FieldVector vector = (FieldVector)this.add(field.getName(), minorType, field.getDictionary()); + FieldVector vector = (FieldVector)this.add(field.getName(), field.getFieldType()); vector.initializeChildrenFromFields(field.getChildren()); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index de1d1857370b0..7fe35e8253afb 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -270,4 +270,5 @@ public Accessor getAccessor() { public Mutator getMutator() { return mutator; } + } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java index 6d0531678488a..6851d6d45d562 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/ComplexWriterImpl.java @@ -22,7 +22,6 @@ import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.StateTool; import org.apache.arrow.vector.complex.writer.BaseWriter.ComplexWriter; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; import com.google.common.base.Preconditions; @@ -150,7 +149,7 @@ public MapWriter rootAsMap() { case INIT: // TODO allow dictionaries in complex types - NullableMapVector map = container.addOrGet(name, MinorType.MAP, NullableMapVector.class, null); + NullableMapVector map = container.addOrGetMap(name); mapRoot = nullableMapWriterFactory.build(map); mapRoot.setPosition(idx()); mode = Mode.MAP; @@ -182,7 +181,7 @@ public ListWriter rootAsList() { case INIT: int vectorCount = container.size(); // TODO allow dictionaries in complex types - ListVector listVector = container.addOrGet(name, MinorType.LIST, ListVector.class, null); + ListVector listVector = container.addOrGetList(name); if (container.size() > vectorCount) { listVector.allocateNew(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java index 1880c9b490c27..d16718e75a701 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/PromotableWriter.java @@ -27,6 +27,7 @@ import org.apache.arrow.vector.complex.writer.FieldWriter; import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.TransferPair; /** @@ -125,7 +126,7 @@ protected FieldWriter getWriter(MinorType type) { // ??? return null; } - ValueVector v = listVector.addOrGetVector(type, null).getVector(); + ValueVector v = listVector.addOrGetVector(FieldType.nullable(type.getType())).getVector(); v.allocateNew(); setWriter(v); writer.setPosition(position); @@ -151,7 +152,7 @@ private FieldWriter promoteToUnion() { tp.transfer(); if (parentContainer != null) { // TODO allow dictionaries in complex types - unionVector = parentContainer.addOrGet(name, MinorType.UNION, UnionVector.class, null); + unionVector = parentContainer.addOrGetUnion(name); unionVector.allocateNew(); } else if (listVector != null) { unionVector = listVector.promoteToUnion(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 2f070237101d8..f07bb585f810c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -92,45 +92,15 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Timestamp; import org.apache.arrow.vector.types.pojo.ArrowType.Union; import org.apache.arrow.vector.types.pojo.ArrowType.Utf8; -import org.apache.arrow.vector.types.pojo.DictionaryEncoding; -import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.util.CallBack; public class Types { - private static final Field NULL_FIELD = new Field("", true, Null.INSTANCE, null); - private static final Field TINYINT_FIELD = new Field("", true, new Int(8, true), null); - private static final Field SMALLINT_FIELD = new Field("", true, new Int(16, true), null); - private static final Field INT_FIELD = new Field("", true, new Int(32, true), null); - private static final Field BIGINT_FIELD = new Field("", true, new Int(64, true), null); - private static final Field UINT1_FIELD = new Field("", true, new Int(8, false), null); - private static final Field UINT2_FIELD = new Field("", true, new Int(16, false), null); - private static final Field UINT4_FIELD = new Field("", true, new Int(32, false), null); - private static final Field UINT8_FIELD = new Field("", true, new Int(64, false), null); - private static final Field DATE_FIELD = new Field("", true, new Date(DateUnit.MILLISECOND), null); - private static final Field TIME_FIELD = new Field("", true, new Time(TimeUnit.MILLISECOND, 32), null); - private static final Field TIMESTAMPSEC_FIELD = new Field("", true, new Timestamp(TimeUnit.SECOND, "UTC"), null); - private static final Field TIMESTAMPMILLI_FIELD = new Field("", true, new Timestamp(TimeUnit.MILLISECOND, "UTC"), null); - private static final Field TIMESTAMPMICRO_FIELD = new Field("", true, new Timestamp(TimeUnit.MICROSECOND, "UTC"), null); - private static final Field TIMESTAMPNANO_FIELD = new Field("", true, new Timestamp(TimeUnit.NANOSECOND, "UTC"), null); - private static final Field INTERVALDAY_FIELD = new Field("", true, new Interval(IntervalUnit.DAY_TIME), null); - private static final Field INTERVALYEAR_FIELD = new Field("", true, new Interval(IntervalUnit.YEAR_MONTH), null); - private static final Field FLOAT4_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.SINGLE), null); - private static final Field FLOAT8_FIELD = new Field("", true, new FloatingPoint(FloatingPointPrecision.DOUBLE), null); - private static final Field VARCHAR_FIELD = new Field("", true, Utf8.INSTANCE, null); - private static final Field VARBINARY_FIELD = new Field("", true, Binary.INSTANCE, null); - private static final Field BIT_FIELD = new Field("", true, Bool.INSTANCE, null); - - public enum MinorType { NULL(Null.INSTANCE) { @Override - public Field getField() { - return NULL_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { return ZeroVector.INSTANCE; } @@ -141,13 +111,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, MAP(Struct.INSTANCE) { @Override - public Field getField() { - throw new UnsupportedOperationException("Cannot get simple field for Map type"); - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableMapVector(name, allocator, dictionary, callBack); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableMapVector(name, allocator, fieldType.getDictionary(), schemaChangeCallback); } @Override @@ -157,13 +122,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, TINYINT(new Int(8, true)) { @Override - public Field getField() { - return TINYINT_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTinyIntVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTinyIntVector(name, fieldType, allocator); } @Override @@ -173,13 +133,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, SMALLINT(new Int(16, true)) { @Override - public Field getField() { - return SMALLINT_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableSmallIntVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableSmallIntVector(name, fieldType, allocator); } @Override @@ -189,13 +144,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, INT(new Int(32, true)) { @Override - public Field getField() { - return INT_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableIntVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableIntVector(name, fieldType, allocator); } @Override @@ -205,13 +155,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, BIGINT(new Int(64, true)) { @Override - public Field getField() { - return BIGINT_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableBigIntVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableBigIntVector(name, fieldType, allocator); } @Override @@ -221,13 +166,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, DATE(new Date(DateUnit.MILLISECOND)) { @Override - public Field getField() { - return DATE_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableDateVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableDateVector(name, fieldType, allocator); } @Override @@ -237,13 +177,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, TIME(new Time(TimeUnit.MILLISECOND, 32)) { @Override - public Field getField() { - return TIME_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTimeVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeVector(name, fieldType, allocator); } @Override @@ -254,13 +189,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // time in second from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND, "UTC")) { @Override - public Field getField() { - return TIMESTAMPSEC_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTimeStampSecVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeStampSecVector(name, fieldType, allocator); } @Override @@ -271,13 +201,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND, "UTC")) { @Override - public Field getField() { - return TIMESTAMPMILLI_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTimeStampMilliVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeStampMilliVector(name, fieldType, allocator); } @Override @@ -288,13 +213,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // time in microsecond from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND, "UTC")) { @Override - public Field getField() { - return TIMESTAMPMICRO_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTimeStampMicroVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeStampMicroVector(name, fieldType, allocator); } @Override @@ -305,13 +225,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // time in nanosecond from the Unix epoch, 00:00:00.000000000 on 1 January 1970, UTC. TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND, "UTC")) { @Override - public Field getField() { - return TIMESTAMPNANO_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableTimeStampNanoVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeStampNanoVector(name, fieldType, allocator); } @Override @@ -321,13 +236,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, INTERVALDAY(new Interval(IntervalUnit.DAY_TIME)) { @Override - public Field getField() { - return INTERVALDAY_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableIntervalDayVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableIntervalDayVector(name, fieldType, allocator); } @Override @@ -337,13 +247,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, INTERVALYEAR(new Interval(IntervalUnit.YEAR_MONTH)) { @Override - public Field getField() { - return INTERVALYEAR_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableIntervalDayVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableIntervalYearVector(name, fieldType, allocator); } @Override @@ -354,13 +259,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // 4 byte ieee 754 FLOAT4(new FloatingPoint(SINGLE)) { @Override - public Field getField() { - return FLOAT4_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableFloat4Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableFloat4Vector(name, fieldType, allocator); } @Override @@ -371,13 +271,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { // 8 byte ieee 754 FLOAT8(new FloatingPoint(DOUBLE)) { @Override - public Field getField() { - return FLOAT8_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableFloat8Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableFloat8Vector(name, fieldType, allocator); } @Override @@ -387,13 +282,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, BIT(Bool.INSTANCE) { @Override - public Field getField() { - return BIT_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableBitVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableBitVector(name, fieldType, allocator); } @Override @@ -403,13 +293,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, VARCHAR(Utf8.INSTANCE) { @Override - public Field getField() { - return VARCHAR_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableVarCharVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableVarCharVector(name, fieldType, allocator); } @Override @@ -419,13 +304,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, VARBINARY(Binary.INSTANCE) { @Override - public Field getField() { - return VARBINARY_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableVarBinaryVector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableVarBinaryVector(name, fieldType, allocator); } @Override @@ -438,14 +318,10 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { public ArrowType getType() { throw new UnsupportedOperationException("Cannot get simple type for Decimal type"); } - @Override - public Field getField() { - throw new UnsupportedOperationException("Cannot get simple field for Decimal type"); - } @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableDecimalVector(name, allocator, dictionary, precisionScale[0], precisionScale[1]); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableDecimalVector(name, fieldType, allocator); } @Override @@ -455,13 +331,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, UINT1(new Int(8, false)) { @Override - public Field getField() { - return UINT1_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableUInt1Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableUInt1Vector(name, fieldType, allocator); } @Override @@ -471,13 +342,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, UINT2(new Int(16, false)) { @Override - public Field getField() { - return UINT2_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableUInt2Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableUInt2Vector(name, fieldType, allocator); } @Override @@ -487,13 +353,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, UINT4(new Int(32, false)) { @Override - public Field getField() { - return UINT4_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableUInt4Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableUInt4Vector(name, fieldType, allocator); } @Override @@ -503,13 +364,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, UINT8(new Int(64, false)) { @Override - public Field getField() { - return UINT8_FIELD; - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new NullableUInt8Vector(name, allocator, dictionary); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableUInt8Vector(name, fieldType, allocator); } @Override @@ -519,13 +375,8 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, LIST(List.INSTANCE) { @Override - public Field getField() { - throw new UnsupportedOperationException("Cannot get simple field for List type"); - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - return new ListVector(name, allocator, dictionary, callBack); + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new ListVector(name, allocator, fieldType.getDictionary(), schemaChangeCallback); } @Override @@ -535,16 +386,11 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { }, UNION(new Union(Sparse, null)) { @Override - public Field getField() { - throw new UnsupportedOperationException("Cannot get simple field for Union type"); - } - - @Override - public FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale) { - if (dictionary != null) { + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + if (fieldType.getDictionary() != null) { throw new UnsupportedOperationException("Dictionary encoding not supported for complex types"); } - return new UnionVector(name, allocator, callBack); + return new UnionVector(name, allocator, schemaChangeCallback); } @Override @@ -563,9 +409,7 @@ public ArrowType getType() { return type; } - public abstract Field getField(); - - public abstract FieldVector getNewVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack, int... precisionScale); + public abstract FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback); public abstract FieldWriter getNewFieldWriter(ValueVector vector); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java index 011f0e6e446a8..05eb9cdceac23 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Field.java @@ -28,11 +28,10 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.schema.VectorLayout; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import com.fasterxml.jackson.annotation.JsonCreator; +import com.fasterxml.jackson.annotation.JsonIgnore; import com.fasterxml.jackson.annotation.JsonInclude; import com.fasterxml.jackson.annotation.JsonInclude.Include; import com.fasterxml.jackson.annotation.JsonProperty; @@ -41,10 +40,17 @@ import com.google.flatbuffers.FlatBufferBuilder; public class Field { + + public static Field nullablePrimitive(String name, ArrowType.PrimitiveType type) { + return nullable(name, type); + } + + public static Field nullable(String name, ArrowType type) { + return new Field(name, true, type, null, null); + } + private final String name; - private final boolean nullable; - private final ArrowType type; - private final DictionaryEncoding dictionary; + private final FieldType fieldType; private final List children; private final TypeLayout typeLayout; @@ -56,29 +62,31 @@ private Field( @JsonProperty("dictionary") DictionaryEncoding dictionary, @JsonProperty("children") List children, @JsonProperty("typeLayout") TypeLayout typeLayout) { + this(name, new FieldType(nullable, type, dictionary), children, typeLayout); + } + + private Field(String name, FieldType fieldType, List children, TypeLayout typeLayout) { + super(); this.name = name; - this.nullable = nullable; - this.type = checkNotNull(type); - this.dictionary = dictionary; - if (children == null) { - this.children = ImmutableList.of(); - } else { - this.children = children; - } + this.fieldType = checkNotNull(fieldType); + this.children = children == null ? ImmutableList.of() : children; this.typeLayout = checkNotNull(typeLayout); } + public Field(String name, FieldType fieldType, List children) { + this(name, fieldType, children, TypeLayout.getTypeLayout(fieldType.getType())); + } + public Field(String name, boolean nullable, ArrowType type, List children) { - this(name, nullable, type, null, children, TypeLayout.getTypeLayout(checkNotNull(type))); + this(name, nullable, type, null, children); } public Field(String name, boolean nullable, ArrowType type, DictionaryEncoding dictionary, List children) { - this(name, nullable, type, dictionary, children, TypeLayout.getTypeLayout(checkNotNull(type))); + this(name, new FieldType(nullable, type, dictionary), children); } public FieldVector createVector(BufferAllocator allocator) { - MinorType minorType = Types.getMinorTypeForArrowType(type); - FieldVector vector = minorType.getNewVector(name, allocator, dictionary, null); + FieldVector vector = fieldType.createNewSingleVector(name, allocator, null); vector.initializeChildrenFromFields(children); return vector; } @@ -110,7 +118,7 @@ public static Field convertField(org.apache.arrow.flatbuf.Field field) { } public void validate() { - TypeLayout expectedLayout = TypeLayout.getTypeLayout(type); + TypeLayout expectedLayout = TypeLayout.getTypeLayout(getType()); if (!expectedLayout.equals(typeLayout)) { throw new IllegalArgumentException("Deserialized field does not match expected vectors. expected: " + expectedLayout + " got " + typeLayout); } @@ -118,8 +126,9 @@ public void validate() { public int getField(FlatBufferBuilder builder) { int nameOffset = name == null ? -1 : builder.createString(name); - int typeOffset = type.getType(builder); + int typeOffset = getType().getType(builder); int dictionaryOffset = -1; + DictionaryEncoding dictionary = getDictionary(); if (dictionary != null) { int dictionaryType = dictionary.getIndexType().getType(builder); org.apache.arrow.flatbuf.DictionaryEncoding.startDictionaryEncoding(builder); @@ -143,8 +152,8 @@ public int getField(FlatBufferBuilder builder) { if (name != null) { org.apache.arrow.flatbuf.Field.addName(builder, nameOffset); } - org.apache.arrow.flatbuf.Field.addNullable(builder, nullable); - org.apache.arrow.flatbuf.Field.addTypeType(builder, type.getTypeID().getFlatbufID()); + org.apache.arrow.flatbuf.Field.addNullable(builder, isNullable()); + org.apache.arrow.flatbuf.Field.addTypeType(builder, getType().getTypeID().getFlatbufID()); org.apache.arrow.flatbuf.Field.addType(builder, typeOffset); org.apache.arrow.flatbuf.Field.addChildren(builder, childrenOffset); org.apache.arrow.flatbuf.Field.addLayout(builder, layoutOffset); @@ -159,15 +168,22 @@ public String getName() { } public boolean isNullable() { - return nullable; + return fieldType.isNullable(); } public ArrowType getType() { - return type; + return fieldType.getType(); + } + + @JsonIgnore + public FieldType getFieldType() { + return fieldType; } @JsonInclude(Include.NON_NULL) - public DictionaryEncoding getDictionary() { return dictionary; } + public DictionaryEncoding getDictionary() { + return fieldType.getDictionary(); + } public List getChildren() { return children; @@ -179,7 +195,7 @@ public TypeLayout getTypeLayout() { @Override public int hashCode() { - return Objects.hash(name, nullable, type, dictionary, children); + return Objects.hash(name, isNullable(), getType(), getDictionary(), children); } @Override @@ -189,10 +205,10 @@ public boolean equals(Object obj) { } Field that = (Field) obj; return Objects.equals(this.name, that.name) && - Objects.equals(this.nullable, that.nullable) && - Objects.equals(this.type, that.type) && - Objects.equals(this.dictionary, that.dictionary) && - Objects.equals(this.children, that.children); + Objects.equals(this.isNullable(), that.isNullable()) && + Objects.equals(this.getType(), that.getType()) && + Objects.equals(this.getDictionary(), that.getDictionary()) && + Objects.equals(this.children, that.children); } @Override @@ -201,14 +217,14 @@ public String toString() { if (name != null) { sb.append(name).append(": "); } - sb.append(type); - if (dictionary != null) { - sb.append("[dictionary: ").append(dictionary.getId()).append("]"); + sb.append(getType()); + if (getDictionary() != null) { + sb.append("[dictionary: ").append(getDictionary().getId()).append("]"); } if (!children.isEmpty()) { sb.append("<").append(Joiner.on(", ").join(children)).append(">"); } - if (!nullable) { + if (!isNullable()) { sb.append(" not null"); } return sb.toString(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java new file mode 100644 index 0000000000000..fe99e631360cc --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/FieldType.java @@ -0,0 +1,60 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector.types.pojo; + +import static com.google.common.base.Preconditions.checkNotNull; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.types.Types; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.util.CallBack; + +public class FieldType { + + public static FieldType nullable(ArrowType type) { + return new FieldType(true, type, null); + } + + private final boolean nullable; + private final ArrowType type; + private final DictionaryEncoding dictionary; + + public FieldType(boolean nullable, ArrowType type, DictionaryEncoding dictionary) { + super(); + this.nullable = nullable; + this.type = checkNotNull(type); + this.dictionary = dictionary; + } + + public boolean isNullable() { + return nullable; + } + public ArrowType getType() { + return type; + } + public DictionaryEncoding getDictionary() { + return dictionary; + } + + public FieldVector createNewSingleVector(String name, BufferAllocator allocator, CallBack schemaCallBack) { + MinorType minorType = Types.getMinorTypeForArrowType(type); + return minorType.getNewVector(name, this, allocator, schemaCallBack); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java index 20f4aa8cf643d..ee7530c8d1085 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDecimalVector.java @@ -17,16 +17,16 @@ */ package org.apache.arrow.vector; -import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.memory.RootAllocator; -import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.util.DecimalUtility; -import org.junit.Test; +import static org.junit.Assert.assertEquals; import java.math.BigDecimal; import java.math.BigInteger; -import static org.junit.Assert.assertEquals; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.util.DecimalUtility; +import org.junit.Test; public class TestDecimalVector { @@ -44,7 +44,7 @@ public class TestDecimalVector { @Test public void test() { BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); - NullableDecimalVector decimalVector = new NullableDecimalVector("decimal", allocator, null, 10, scale); + NullableDecimalVector decimalVector = TestUtils.newVector(NullableDecimalVector.class, "decimal", new ArrowType.Decimal(10, scale), allocator); decimalVector.allocateNew(); BigDecimal[] values = new BigDecimal[intValues.length]; for (int i = 0; i < intValues.length; i++) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java index e3087ef8c95cc..3bf3b1cedff38 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestDictionaryVector.java @@ -17,19 +17,19 @@ */ package org.apache.arrow.vector; +import static org.apache.arrow.vector.TestUtils.newNullableVarCharVector; +import static org.junit.Assert.assertEquals; + +import java.nio.charset.StandardCharsets; + import org.apache.arrow.memory.BufferAllocator; -import org.apache.arrow.vector.dictionary.DictionaryEncoder; import org.apache.arrow.vector.dictionary.Dictionary; -import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.dictionary.DictionaryEncoder; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.junit.After; import org.junit.Before; import org.junit.Test; -import java.nio.charset.StandardCharsets; - -import static org.junit.Assert.assertEquals; - public class TestDictionaryVector { private BufferAllocator allocator; @@ -51,8 +51,8 @@ public void terminate() throws Exception { @Test public void testEncodeStrings() { // Create a new value vector - try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("foo", allocator, null, null); - final NullableVarCharVector dictionaryVector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector("dict", allocator, null, null)) { + try (final NullableVarCharVector vector = newNullableVarCharVector("foo", allocator); + final NullableVarCharVector dictionaryVector = newNullableVarCharVector("dict", allocator);) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(512, 5); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestUtils.java b/java/vector/src/test/java/org/apache/arrow/vector/TestUtils.java new file mode 100644 index 0000000000000..b79f2da9210ab --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestUtils.java @@ -0,0 +1,39 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.FieldType; + +public class TestUtils { + + public static NullableVarCharVector newNullableVarCharVector(String name, BufferAllocator allocator) { + return (NullableVarCharVector)FieldType.nullable(new ArrowType.Utf8()).createNewSingleVector(name, allocator, null); + } + + public static T newVector(Class c, String name, ArrowType type, BufferAllocator allocator) { + return c.cast(FieldType.nullable(type).createNewSingleVector(name, allocator, null)); + } + + public static T newVector(Class c, String name, MinorType type, BufferAllocator allocator) { + return c.cast(FieldType.nullable(type.getType()).createNewSingleVector(name, allocator, null)); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 6917638d74e4d..78ca14dc406ea 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -17,6 +17,8 @@ */ package org.apache.arrow.vector; +import static org.apache.arrow.vector.TestUtils.newNullableVarCharVector; +import static org.apache.arrow.vector.TestUtils.newVector; import static org.junit.Assert.assertArrayEquals; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -28,6 +30,7 @@ import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.schema.TypeLayout; import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; import org.junit.Assert; @@ -86,7 +89,7 @@ public void testFixedType() { public void testNullableVarLen2() { // Create a new value vector for 1024 integers. - try (final NullableVarCharVector vector = new NullableVarCharVector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableVarCharVector vector = newNullableVarCharVector(EMPTY_SCHEMA_PATH, allocator)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(1024 * 10, 1024); @@ -116,7 +119,7 @@ public void testNullableVarLen2() { public void testNullableFixedType() { // Create a new value vector for 1024 integers. - try (final NullableUInt4Vector vector = new NullableUInt4Vector(EMPTY_SCHEMA_PATH, allocator, null)) { + try (final NullableUInt4Vector vector = newVector(NullableUInt4Vector.class, EMPTY_SCHEMA_PATH, new ArrowType.Int(32, false), allocator);) { final NullableUInt4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -186,7 +189,7 @@ public void testNullableFixedType() { @Test public void testNullableFloat() { // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { + try (final NullableFloat4Vector vector = newVector(NullableFloat4Vector.class, EMPTY_SCHEMA_PATH, MinorType.FLOAT4, allocator);) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -233,7 +236,7 @@ public void testNullableFloat() { @Test public void testNullableInt() { // Create a new value vector for 1024 integers - try (final NullableIntVector vector = (NullableIntVector) MinorType.INT.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { + try (final NullableIntVector vector = newVector(NullableIntVector.class, EMPTY_SCHEMA_PATH, MinorType.INT, allocator)) { final NullableIntVector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -403,7 +406,7 @@ private void validateRange(int length, int start, int count) { @Test public void testReAllocNullableFixedWidthVector() { // Create a new value vector for 1024 integers - try (final NullableFloat4Vector vector = (NullableFloat4Vector) MinorType.FLOAT4.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { + try (final NullableFloat4Vector vector = newVector(NullableFloat4Vector.class, EMPTY_SCHEMA_PATH, MinorType.FLOAT4, allocator)) { final NullableFloat4Vector.Mutator m = vector.getMutator(); vector.allocateNew(1024); @@ -436,7 +439,7 @@ public void testReAllocNullableFixedWidthVector() { @Test public void testReAllocNullableVariableWidthVector() { // Create a new value vector for 1024 integers - try (final NullableVarCharVector vector = (NullableVarCharVector) MinorType.VARCHAR.getNewVector(EMPTY_SCHEMA_PATH, allocator, null, null)) { + try (final NullableVarCharVector vector = newVector(NullableVarCharVector.class, EMPTY_SCHEMA_PATH, MinorType.VARCHAR, allocator)) { final NullableVarCharVector.Mutator m = vector.getMutator(); vector.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java index 2b49d8ed4b582..65b193c0aee4c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/impl/TestPromotableWriter.java @@ -27,7 +27,6 @@ import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.ArrowTypeID; import org.apache.arrow.vector.types.pojo.Field; import org.junit.After; @@ -53,7 +52,7 @@ public void terminate() throws Exception { public void testPromoteToUnion() throws Exception { try (final MapVector container = new MapVector(EMPTY_SCHEMA_PATH, allocator, null); - final NullableMapVector v = container.addOrGet("test", MinorType.MAP, NullableMapVector.class, null); + final NullableMapVector v = container.addOrGetMap("test"); final PromotableWriter writer = new PromotableWriter(v, container)) { container.allocateNew(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 75e5d2d6e5c98..a1104ffe545d8 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -17,6 +17,8 @@ */ package org.apache.arrow.vector.file; +import static org.apache.arrow.vector.TestUtils.newNullableVarCharVector; + import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; import java.io.File; @@ -28,8 +30,6 @@ import java.util.Arrays; import java.util.List; -import com.google.common.collect.ImmutableList; - import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.NullableTinyIntVector; @@ -40,19 +40,19 @@ import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.impl.UnionListWriter; import org.apache.arrow.vector.dictionary.Dictionary; +import org.apache.arrow.vector.dictionary.DictionaryEncoder; import org.apache.arrow.vector.dictionary.DictionaryProvider; import org.apache.arrow.vector.dictionary.DictionaryProvider.MapDictionaryProvider; -import org.apache.arrow.vector.dictionary.DictionaryEncoder; import org.apache.arrow.vector.schema.ArrowBuffer; import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.stream.ArrowStreamReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; import org.apache.arrow.vector.stream.MessageSerializerTest; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; import org.apache.arrow.vector.types.pojo.Schema; import org.apache.arrow.vector.util.Text; import org.junit.Assert; @@ -60,6 +60,8 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import com.google.common.collect.ImmutableList; + public class TestArrowFile extends BaseFileTest { private static final Logger LOGGER = LoggerFactory.getLogger(TestArrowFile.class); @@ -380,8 +382,8 @@ public void testWriteReadDictionary() throws IOException { // write try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); - NullableVarCharVector vector = new NullableVarCharVector("varchar", originalVectorAllocator, null); - NullableVarCharVector dictionaryVector = new NullableVarCharVector("dict", originalVectorAllocator, null)) { + NullableVarCharVector vector = newNullableVarCharVector("varchar", originalVectorAllocator); + NullableVarCharVector dictionaryVector = newNullableVarCharVector("dict", originalVectorAllocator)) { vector.allocateNewSafe(); NullableVarCharVector.Mutator mutator = vector.getMutator(); mutator.set(0, "foo".getBytes(StandardCharsets.UTF_8)); @@ -483,7 +485,7 @@ public void testWriteReadNestedDictionary() throws IOException { // [['foo', 'bar'], ['foo'], ['bar']] -> [[0, 1], [0], [1]] // write - try (NullableVarCharVector dictionaryVector = new NullableVarCharVector("dictionary", allocator, null); + try (NullableVarCharVector dictionaryVector = newNullableVarCharVector("dictionary", allocator); ListVector listVector = new ListVector("list", allocator, null, null)) { Dictionary dictionary = new Dictionary(dictionaryVector, encoding); @@ -495,7 +497,7 @@ public void testWriteReadNestedDictionary() throws IOException { dictionaryVector.getMutator().set(1, "bar".getBytes(StandardCharsets.UTF_8)); dictionaryVector.getMutator().setValueCount(2); - listVector.addOrGetVector(MinorType.INT, encoding); + listVector.addOrGetVector(new FieldType(true, new Int(32, true), encoding)); listVector.allocateNew(); UnionListWriter listWriter = new UnionListWriter(listVector); listWriter.startList(); @@ -511,7 +513,7 @@ public void testWriteReadNestedDictionary() throws IOException { listWriter.setValueCount(3); List fields = ImmutableList.of(listVector.getField()); - List vectors = ImmutableList.of((FieldVector) listVector); + List vectors = ImmutableList.of(listVector); VectorSchemaRoot root = new VectorSchemaRoot(fields, vectors, 3); try ( diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java index 914dfe4319db3..d00cb0f8c0065 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowReaderWriter.java @@ -24,11 +24,8 @@ import static org.junit.Assert.assertTrue; import java.io.ByteArrayOutputStream; -import java.io.File; -import java.io.FileOutputStream; import java.io.IOException; import java.nio.ByteBuffer; -import java.nio.channels.Channels; import java.util.Collections; import java.util.List; @@ -38,13 +35,10 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.NullableIntVector; -import org.apache.arrow.vector.NullableTinyIntVector; +import org.apache.arrow.vector.TestUtils; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.schema.ArrowFieldNode; import org.apache.arrow.vector.schema.ArrowRecordBatch; -import org.apache.arrow.vector.types.Types; -import org.apache.arrow.vector.types.Types.MinorType; import org.apache.arrow.vector.types.pojo.ArrowType; import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; @@ -78,8 +72,8 @@ byte[] array(ArrowBuf buf) { @Test public void test() throws IOException { Schema schema = new Schema(asList(new Field("testField", true, new ArrowType.Int(8, true), Collections.emptyList()))); - MinorType minorType = Types.getMinorTypeForArrowType(schema.getFields().get(0).getType()); - FieldVector vector = minorType.getNewVector("testField", allocator, null,null); + ArrowType type = schema.getFields().get(0).getType(); + FieldVector vector = TestUtils.newVector(FieldVector.class, "testField", type, allocator); vector.initializeChildrenFromFields(schema.getFields().get(0).getChildren()); byte[] validity = new byte[] { (byte) 255, 0}; From 7d1d4e751807ac38cfe7a5c537450ede3ae9eb00 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 3 Apr 2017 18:05:04 -0400 Subject: [PATCH 0460/1644] ARROW-657: [C++/Python] Expose Tensor IPC in Python. Add equals method. Add pyarrow.create_memory_map/memory_map functions This adds a `MemoryMappedFile::Create` C++ function for allocating new memory maps of a particular size, with a Python wrapper. I combined the Cython header declarations for the main libarrow libraries into a single pxd file. I am also checking in some tests from ARROW-718 that got left off the patch. Author: Wes McKinney Closes #483 from wesm/ARROW-657 and squashes the following commits: a5d2f00 [Wes McKinney] More readable cinit post refactor fbf438d [Wes McKinney] clang-format 4024c22 [Wes McKinney] Fix MSVC issues, use slower SetBitTo implementation that doesn't have compiler warning 7847d0f [Wes McKinney] Make file names unique 994b83d [Wes McKinney] Try to fix MSVC 25cfc83 [Wes McKinney] Expose Tensor IPC in Python. Add equals method. Add create_memory_map function and memory_map factory --- cpp/CMakeLists.txt | 2 +- cpp/src/arrow/buffer.h | 7 +- cpp/src/arrow/io/CMakeLists.txt | 2 +- cpp/src/arrow/io/file.cc | 13 ++ cpp/src/arrow/io/file.h | 4 + cpp/src/arrow/io/io-file-test.cc | 8 +- cpp/src/arrow/io/test-common.h | 20 +- cpp/src/arrow/ipc/CMakeLists.txt | 2 +- cpp/src/arrow/ipc/feather-internal.h | 16 +- cpp/src/arrow/ipc/feather.h | 1 + cpp/src/arrow/ipc/ipc-read-write-test.cc | 12 +- cpp/src/arrow/ipc/metadata.h | 27 +-- cpp/src/arrow/ipc/writer.h | 18 +- cpp/src/arrow/util/bit-util.h | 10 +- python/pyarrow/__init__.py | 6 +- python/pyarrow/_parquet.pxd | 4 +- python/pyarrow/_parquet.pyx | 2 - python/pyarrow/array.pxd | 1 + python/pyarrow/array.pyx | 6 + python/pyarrow/includes/libarrow.pxd | 235 +++++++++++++++++++++++ python/pyarrow/includes/libarrow_io.pxd | 171 ----------------- python/pyarrow/includes/libarrow_ipc.pxd | 94 --------- python/pyarrow/includes/pyarrow.pxd | 12 +- python/pyarrow/io.pxd | 4 +- python/pyarrow/io.pyx | 124 +++++++++++- python/pyarrow/tests/test_io.py | 16 +- python/pyarrow/tests/test_tensor.py | 93 +++++++++ 27 files changed, 553 insertions(+), 357 deletions(-) delete mode 100644 python/pyarrow/includes/libarrow_io.pxd delete mode 100644 python/pyarrow/includes/libarrow_ipc.pxd create mode 100644 python/pyarrow/tests/test_tensor.py diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index aacc7a15fffc9..d26c847807d79 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -806,7 +806,7 @@ set(ARROW_SRCS src/arrow/util/bit-util.cc ) -if(NOT APPLE) +if(NOT APPLE AND NOT MSVC) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the # version-script option. diff --git a/cpp/src/arrow/buffer.h b/cpp/src/arrow/buffer.h index 3f14c964e83c1..a02ce3cbe8107 100644 --- a/cpp/src/arrow/buffer.h +++ b/cpp/src/arrow/buffer.h @@ -46,7 +46,7 @@ class Status; class ARROW_EXPORT Buffer { public: Buffer(const uint8_t* data, int64_t size) - : is_mutable_(false), data_(data), size_(size), capacity_(size) {} + : is_mutable_(false), data_(data), size_(size), capacity_(size) {} virtual ~Buffer(); /// An offset into data that is owned by another buffer, but we want to be @@ -57,7 +57,7 @@ class ARROW_EXPORT Buffer { /// in general we expected buffers to be aligned and padded to 64 bytes. In the future /// we might add utility methods to help determine if a buffer satisfies this contract. Buffer(const std::shared_ptr& parent, int64_t offset, int64_t size) - : Buffer(parent->data() + offset, size) { + : Buffer(parent->data() + offset, size) { parent_ = parent; } @@ -112,8 +112,7 @@ std::shared_ptr ARROW_EXPORT SliceMutableBuffer( /// A Buffer whose contents can be mutated. May or may not own its data. class ARROW_EXPORT MutableBuffer : public Buffer { public: - MutableBuffer(uint8_t* data, int64_t size) - : Buffer(data, size) { + MutableBuffer(uint8_t* data, int64_t size) : Buffer(data, size) { mutable_data_ = data; is_mutable_ = true; } diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 3951eac322c6a..791c29c2797f9 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -75,7 +75,7 @@ set(ARROW_IO_SRCS memory.cc ) -if(NOT APPLE) +if(NOT APPLE AND NOT MSVC) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the # version-script option. diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 0aa2c92a07281..720be3d6e739c 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -604,6 +604,19 @@ class MemoryMappedFile::MemoryMap : public MutableBuffer { MemoryMappedFile::MemoryMappedFile() {} MemoryMappedFile::~MemoryMappedFile() {} +Status MemoryMappedFile::Create( + const std::string& path, int64_t size, std::shared_ptr* out) { + std::shared_ptr file; + RETURN_NOT_OK(FileOutputStream::Open(path, &file)); +#ifdef _MSC_VER + _chsize_s(file->file_descriptor(), static_cast(size)); +#else + ftruncate(file->file_descriptor(), static_cast(size)); +#endif + RETURN_NOT_OK(file->Close()); + return MemoryMappedFile::Open(path, FileMode::READWRITE, out); +} + Status MemoryMappedFile::Open(const std::string& path, FileMode::type mode, std::shared_ptr* out) { std::shared_ptr result(new MemoryMappedFile()); diff --git a/cpp/src/arrow/io/file.h b/cpp/src/arrow/io/file.h index f687fadc299bd..f0be3cf980162 100644 --- a/cpp/src/arrow/io/file.h +++ b/cpp/src/arrow/io/file.h @@ -106,6 +106,10 @@ class ARROW_EXPORT MemoryMappedFile : public ReadWriteFileInterface { public: ~MemoryMappedFile(); + /// Create new file with indicated size, return in read/write mode + static Status Create( + const std::string& path, int64_t size, std::shared_ptr* out); + static Status Open(const std::string& path, FileMode::type mode, std::shared_ptr* out); diff --git a/cpp/src/arrow/io/io-file-test.cc b/cpp/src/arrow/io/io-file-test.cc index 348be17d89341..a5784de3752d9 100644 --- a/cpp/src/arrow/io/io-file-test.cc +++ b/cpp/src/arrow/io/io-file-test.cc @@ -393,10 +393,8 @@ TEST_F(TestMemoryMappedFile, WriteRead) { const int reps = 5; std::string path = "ipc-write-read-test"; - CreateFile(path, reps * buffer_size); - std::shared_ptr result; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &result)); + ASSERT_OK(InitMemoryMap(reps * buffer_size, path, &result)); int64_t position = 0; std::shared_ptr out_buffer; @@ -419,10 +417,8 @@ TEST_F(TestMemoryMappedFile, ReadOnly) { const int reps = 5; std::string path = "ipc-read-only-test"; - CreateFile(path, reps * buffer_size); - std::shared_ptr rwmmap; - ASSERT_OK(MemoryMappedFile::Open(path, FileMode::READWRITE, &rwmmap)); + ASSERT_OK(InitMemoryMap(reps * buffer_size, path, &rwmmap)); int64_t position = 0; for (int i = 0; i < reps; ++i) { diff --git a/cpp/src/arrow/io/test-common.h b/cpp/src/arrow/io/test-common.h index db5bcc1b4f49b..d6ec27048d51e 100644 --- a/cpp/src/arrow/io/test-common.h +++ b/cpp/src/arrow/io/test-common.h @@ -67,23 +67,17 @@ class MemoryMapFixture { } } - void CreateFile(const std::string path, int64_t size) { - FILE* file = fopen(path.c_str(), "w"); - if (file != nullptr) { - tmp_files_.push_back(path); -#ifdef _MSC_VER - _chsize(fileno(file), static_cast(size)); -#else - ftruncate(fileno(file), static_cast(size)); -#endif - fclose(file); - } + void CreateFile(const std::string& path, int64_t size) { + std::shared_ptr file; + ASSERT_OK(MemoryMappedFile::Create(path, size, &file)); + tmp_files_.push_back(path); } Status InitMemoryMap( int64_t size, const std::string& path, std::shared_ptr* mmap) { - CreateFile(path, size); - return MemoryMappedFile::Open(path, FileMode::READWRITE, mmap); + RETURN_NOT_OK(MemoryMappedFile::Create(path, size, mmap)); + tmp_files_.push_back(path); + return Status::OK(); } private: diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 5fa7d6125ce5e..57db03311c06f 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -38,7 +38,7 @@ set(ARROW_IPC_SRCS writer.cc ) -if(NOT APPLE) +if(NOT MSVC AND NOT APPLE) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the # version-script option. diff --git a/cpp/src/arrow/ipc/feather-internal.h b/cpp/src/arrow/ipc/feather-internal.h index 10b0cfd5d5ea2..6847445149bb0 100644 --- a/cpp/src/arrow/ipc/feather-internal.h +++ b/cpp/src/arrow/ipc/feather-internal.h @@ -41,11 +41,11 @@ typedef std::vector> ColumnVector; typedef flatbuffers::FlatBufferBuilder FBB; typedef flatbuffers::Offset FBString; -struct ColumnType { +struct ARROW_EXPORT ColumnType { enum type { PRIMITIVE, CATEGORY, TIMESTAMP, DATE, TIME }; }; -struct ArrayMetadata { +struct ARROW_EXPORT ArrayMetadata { ArrayMetadata() {} ArrayMetadata(fbs::Type type, int64_t offset, int64_t length, int64_t null_count, @@ -69,12 +69,12 @@ struct ArrayMetadata { int64_t total_bytes; }; -struct CategoryMetadata { +struct ARROW_EXPORT CategoryMetadata { ArrayMetadata levels; bool ordered; }; -struct TimestampMetadata { +struct ARROW_EXPORT TimestampMetadata { TimeUnit unit; // A timezone name known to the Olson timezone database. For display purposes @@ -82,7 +82,7 @@ struct TimestampMetadata { std::string timezone; }; -struct TimeMetadata { +struct ARROW_EXPORT TimeMetadata { TimeUnit unit; }; @@ -91,7 +91,7 @@ static constexpr const int kFeatherDefaultAlignment = 8; class ColumnBuilder; -class TableBuilder { +class ARROW_EXPORT TableBuilder { public: explicit TableBuilder(int64_t num_rows); ~TableBuilder() = default; @@ -116,7 +116,7 @@ class TableBuilder { int64_t num_rows_; }; -class TableMetadata { +class ARROW_EXPORT TableMetadata { public: TableMetadata() {} ~TableMetadata() = default; @@ -186,7 +186,7 @@ static inline void FromFlatbuffer(const fbs::PrimitiveArray* values, ArrayMetada out->total_bytes = values->total_bytes(); } -class ColumnBuilder { +class ARROW_EXPORT ColumnBuilder { public: ColumnBuilder(TableBuilder* parent, const std::string& name); ~ColumnBuilder() = default; diff --git a/cpp/src/arrow/ipc/feather.h b/cpp/src/arrow/ipc/feather.h index 8cc8ca092a1b2..4d59a8bbd54a9 100644 --- a/cpp/src/arrow/ipc/feather.h +++ b/cpp/src/arrow/ipc/feather.h @@ -27,6 +27,7 @@ #include #include "arrow/type.h" +#include "arrow/util/visibility.h" namespace arrow { diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index c900d0ba37ed2..86ec7701add20 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -106,6 +106,8 @@ TEST_F(TestSchemaMetadata, NestedFields) { &MakeStruct, &MakeUnion, &MakeDictionary, &MakeDates, &MakeTimestamps, &MakeTimes, \ &MakeFWBinary, &MakeBooleanBatch); +static int g_file_number = 0; + class IpcTestFixture : public io::MemoryMapFixture { public: Status DoStandardRoundTrip(const RecordBatch& batch, bool zero_data, @@ -163,8 +165,9 @@ class IpcTestFixture : public io::MemoryMapFixture { } void CheckRoundtrip(const RecordBatch& batch, int64_t buffer_size) { - std::string path = "test-write-row-batch"; - ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(buffer_size, path, &mmap_)); + std::stringstream ss; + ss << "test-write-row-batch-" << g_file_number++; + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(buffer_size, ss.str(), &mmap_)); std::shared_ptr result; ASSERT_OK(DoStandardRoundTrip(batch, true, &result)); @@ -303,9 +306,10 @@ class RecursionLimits : public ::testing::Test, public io::MemoryMapFixture { std::vector> arrays = {array}; *batch = std::make_shared(*schema, batch_length, arrays); - std::string path = "test-write-past-max-recursion"; + std::stringstream ss; + ss << "test-write-past-max-recursion-" << g_file_number++; const int memory_map_size = 1 << 20; - io::MemoryMapFixture::InitMemoryMap(memory_map_size, path, &mmap_); + RETURN_NOT_OK(io::MemoryMapFixture::InitMemoryMap(memory_map_size, ss.str(), &mmap_)); if (override_level) { return WriteRecordBatch(**batch, 0, mmap_.get(), metadata_length, body_length, diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index fac4a70aada8d..451a76d5249e0 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -72,7 +72,7 @@ using DictionaryMap = std::unordered_map>; using DictionaryTypeMap = std::unordered_map>; // Memoization data structure for handling shared dictionaries -class DictionaryMemo { +class ARROW_EXPORT DictionaryMemo { public: DictionaryMemo(); @@ -114,12 +114,12 @@ Status GetDictionaryTypes(const void* opaque_schema, DictionaryTypeMap* id_to_fi // Construct a complete Schema from the message. May be expensive for very // large schemas if you are only interested in a few fields -Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_memo, - std::shared_ptr* out); +Status ARROW_EXPORT GetSchema(const void* opaque_schema, + const DictionaryMemo& dictionary_memo, std::shared_ptr* out); -Status GetTensorMetadata(const void* opaque_tensor, std::shared_ptr* type, - std::vector* shape, std::vector* strides, - std::vector* dim_names); +Status ARROW_EXPORT GetTensorMetadata(const void* opaque_tensor, + std::shared_ptr* type, std::vector* shape, + std::vector* strides, std::vector* dim_names); class ARROW_EXPORT Message { public: @@ -157,18 +157,19 @@ class ARROW_EXPORT Message { /// \param[in] file the seekable file interface to read from /// \param[out] message the message read /// \return Status success or failure -Status ReadMessage(int64_t offset, int32_t metadata_length, io::RandomAccessFile* file, - std::shared_ptr* message); +Status ARROW_EXPORT ReadMessage(int64_t offset, int32_t metadata_length, + io::RandomAccessFile* file, std::shared_ptr* message); /// Read length-prefixed message with as-yet unknown length. Returns nullptr if /// there are not enough bytes available or the message length is 0 (e.g. EOS /// in a stream) -Status ReadMessage(io::InputStream* stream, std::shared_ptr* message); +Status ARROW_EXPORT ReadMessage( + io::InputStream* stream, std::shared_ptr* message); /// Write a serialized message with a length-prefix and padding to an 8-byte offset /// /// -Status WriteMessage( +Status ARROW_EXPORT WriteMessage( const Buffer& message, io::OutputStream* file, int32_t* message_length); // Serialize arrow::Schema as a Flatbuffer @@ -178,14 +179,14 @@ Status WriteMessage( // dictionary ids // \param[out] out the serialized arrow::Buffer // \return Status outcome -Status WriteSchemaMessage( +Status ARROW_EXPORT WriteSchemaMessage( const Schema& schema, DictionaryMemo* dictionary_memo, std::shared_ptr* out); -Status WriteRecordBatchMessage(int64_t length, int64_t body_length, +Status ARROW_EXPORT WriteRecordBatchMessage(int64_t length, int64_t body_length, const std::vector& nodes, const std::vector& buffers, std::shared_ptr* out); -Status WriteTensorMessage( +Status ARROW_EXPORT WriteTensorMessage( const Tensor& tensor, int64_t buffer_start_offset, std::shared_ptr* out); Status WriteDictionaryMessage(int64_t id, int64_t length, int64_t body_length, diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 8b2dc9cd48788..0b7a6e1b56be5 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -66,9 +66,9 @@ namespace ipc { /// including padding to a 64-byte boundary /// @param(out) body_length: the size of the contiguous buffer block plus /// padding bytes -Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth, +Status ARROW_EXPORT WriteRecordBatch(const RecordBatch& batch, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool, int max_recursion_depth = kMaxNestingDepth, bool allow_64bit = false); // Write Array as a DictionaryBatch message @@ -79,7 +79,7 @@ Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dict // Compute the precise number of bytes needed in a contiguous memory segment to // write the record batch. This involves generating the complete serialized // Flatbuffers metadata. -Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size); +Status ARROW_EXPORT GetRecordBatchSize(const RecordBatch& batch, int64_t* size); class ARROW_EXPORT StreamWriter { public: @@ -122,14 +122,14 @@ class ARROW_EXPORT FileWriter : public StreamWriter { /// EXPERIMENTAL: Write RecordBatch allowing lengths over INT32_MAX. This data /// may not be readable by all Arrow implementations -Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, - io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, - MemoryPool* pool); +Status ARROW_EXPORT WriteLargeRecordBatch(const RecordBatch& batch, + int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, + int64_t* body_length, MemoryPool* pool); /// EXPERIMENTAL: Write arrow::Tensor as a contiguous message /// -Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, - int64_t* body_length); +Status ARROW_EXPORT WriteTensor(const Tensor& tensor, io::OutputStream* dst, + int32_t* metadata_length, int64_t* body_length); } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 6e3e8ae9f2160..42afd0705f0f9 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -52,7 +52,7 @@ static inline int64_t Ceil2Bytes(int64_t size) { } static inline bool GetBit(const uint8_t* bits, int64_t i) { - return static_cast(bits[i / 8] & kBitmask[i % 8]); + return (bits[i / 8] & kBitmask[i % 8]) != 0; } static inline bool BitNotSet(const uint8_t* bits, int64_t i) { @@ -68,9 +68,13 @@ static inline void SetBit(uint8_t* bits, int64_t i) { } static inline void SetBitTo(uint8_t* bits, int64_t i, bool bit_is_set) { - // See https://graphics.stanford.edu/~seander/bithacks.html + // TODO: speed up. See https://graphics.stanford.edu/~seander/bithacks.html // "Conditionally set or clear bits without branching" - bits[i / 8] ^= static_cast(-bit_is_set ^ bits[i / 8]) & kBitmask[i % 8]; + if (bit_is_set) { + SetBit(bits, i); + } else { + ClearBit(bits, i); + } } static inline int64_t NextPower2(int64_t n) { diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 5215028c90f0d..6860f986fb6e8 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -42,8 +42,10 @@ from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, - Buffer, InMemoryOutputStream, BufferReader, - frombuffer) + Buffer, BufferReader, InMemoryOutputStream, + MemoryMappedFile, memory_map, + frombuffer, read_tensor, write_tensor, + memory_map, create_memory_map) from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index cf9ec8e787661..f12c86fdebc83 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -19,8 +19,8 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CSchema, CStatus, - CTable, CMemoryPool) -from pyarrow.includes.libarrow_io cimport RandomAccessFile, OutputStream + CTable, CMemoryPool, + RandomAccessFile, OutputStream) cdef extern from "parquet/api/schema.h" namespace "parquet::schema" nogil: diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index c4cbd28e85dab..cfd2816e2a16e 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -23,8 +23,6 @@ from cython.operator cimport dereference as deref from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport (RandomAccessFile, OutputStream, - FileOutputStream) cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.array cimport Array diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index 42675630fd51b..f6aaea2582e21 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -53,6 +53,7 @@ cdef class Tensor: cdef object box_array(const shared_ptr[CArray]& sp_array) +cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor) cdef class BooleanArray(Array): diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 398e4cbffa94d..e7c456d80a41f 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -347,6 +347,12 @@ strides: {2}""".format(self.type, self.shape, self.strides) &out)) return PyObject_to_object(out) + def equals(self, Tensor other): + """ + Return true if the tensors contains exactly equal data + """ + return self.tp.Equals(deref(other.tp)) + property is_mutable: def __get__(self): diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 8da063cbdc364..67d6af910c2b9 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -303,6 +303,162 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CTable]* result) +cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: + enum FileMode" arrow::io::FileMode::type": + FileMode_READ" arrow::io::FileMode::READ" + FileMode_WRITE" arrow::io::FileMode::WRITE" + FileMode_READWRITE" arrow::io::FileMode::READWRITE" + + enum ObjectType" arrow::io::ObjectType::type": + ObjectType_FILE" arrow::io::ObjectType::FILE" + ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" + + cdef cppclass FileInterface: + CStatus Close() + CStatus Tell(int64_t* position) + FileMode mode() + + cdef cppclass Readable: + CStatus ReadB" Read"(int64_t nbytes, shared_ptr[CBuffer]* out) + CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) + + cdef cppclass Seekable: + CStatus Seek(int64_t position) + + cdef cppclass Writeable: + CStatus Write(const uint8_t* data, int64_t nbytes) + + cdef cppclass OutputStream(FileInterface, Writeable): + pass + + cdef cppclass InputStream(FileInterface, Readable): + pass + + cdef cppclass RandomAccessFile(InputStream, Seekable): + CStatus GetSize(int64_t* size) + + CStatus ReadAt(int64_t position, int64_t nbytes, + int64_t* bytes_read, uint8_t* buffer) + CStatus ReadAt(int64_t position, int64_t nbytes, + int64_t* bytes_read, shared_ptr[CBuffer]* out) + + cdef cppclass WriteableFileInterface(OutputStream, Seekable): + CStatus WriteAt(int64_t position, const uint8_t* data, + int64_t nbytes) + + cdef cppclass ReadWriteFileInterface(RandomAccessFile, + WriteableFileInterface): + pass + + +cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: + + + cdef cppclass FileOutputStream(OutputStream): + @staticmethod + CStatus Open(const c_string& path, shared_ptr[FileOutputStream]* file) + + int file_descriptor() + + cdef cppclass ReadableFile(RandomAccessFile): + @staticmethod + CStatus Open(const c_string& path, shared_ptr[ReadableFile]* file) + + @staticmethod + CStatus Open(const c_string& path, CMemoryPool* memory_pool, + shared_ptr[ReadableFile]* file) + + int file_descriptor() + + cdef cppclass CMemoryMappedFile" arrow::io::MemoryMappedFile"\ + (ReadWriteFileInterface): + + @staticmethod + CStatus Create(const c_string& path, int64_t size, + shared_ptr[CMemoryMappedFile]* file) + + @staticmethod + CStatus Open(const c_string& path, FileMode mode, + shared_ptr[CMemoryMappedFile]* file) + + int file_descriptor() + + +cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: + CStatus HaveLibHdfs() + CStatus HaveLibHdfs3() + + enum HdfsDriver" arrow::io::HdfsDriver": + HdfsDriver_LIBHDFS" arrow::io::HdfsDriver::LIBHDFS" + HdfsDriver_LIBHDFS3" arrow::io::HdfsDriver::LIBHDFS3" + + cdef cppclass HdfsConnectionConfig: + c_string host + int port + c_string user + c_string kerb_ticket + HdfsDriver driver + + cdef cppclass HdfsPathInfo: + ObjectType kind; + c_string name + c_string owner + c_string group + int32_t last_modified_time + int32_t last_access_time + int64_t size + int16_t replication + int64_t block_size + int16_t permissions + + cdef cppclass HdfsReadableFile(RandomAccessFile): + pass + + cdef cppclass HdfsOutputStream(OutputStream): + pass + + cdef cppclass CHdfsClient" arrow::io::HdfsClient": + @staticmethod + CStatus Connect(const HdfsConnectionConfig* config, + shared_ptr[CHdfsClient]* client) + + CStatus CreateDirectory(const c_string& path) + + CStatus Delete(const c_string& path, c_bool recursive) + + CStatus Disconnect() + + c_bool Exists(const c_string& path) + + CStatus GetCapacity(int64_t* nbytes) + CStatus GetUsed(int64_t* nbytes) + + CStatus ListDirectory(const c_string& path, + vector[HdfsPathInfo]* listing) + + CStatus GetPathInfo(const c_string& path, HdfsPathInfo* info) + + CStatus Rename(const c_string& src, const c_string& dst) + + CStatus OpenReadable(const c_string& path, + shared_ptr[HdfsReadableFile]* handle) + + CStatus OpenWriteable(const c_string& path, c_bool append, + int32_t buffer_size, int16_t replication, + int64_t default_block_size, + shared_ptr[HdfsOutputStream]* handle) + + +cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: + cdef cppclass CBufferReader" arrow::io::BufferReader"\ + (RandomAccessFile): + CBufferReader(const shared_ptr[CBuffer]& buffer) + CBufferReader(const uint8_t* data, int64_t nbytes) + + cdef cppclass BufferOutputStream(OutputStream): + BufferOutputStream(const shared_ptr[ResizableBuffer]& buffer) + + cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: cdef cppclass SchemaMessage: int num_fields() @@ -335,3 +491,82 @@ cdef extern from "arrow/ipc/metadata.h" namespace "arrow::ipc" nogil: shared_ptr[SchemaMessage] GetSchema() shared_ptr[RecordBatchMessage] GetRecordBatch() shared_ptr[DictionaryBatchMessage] GetDictionaryBatch() + + +cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: + + cdef cppclass CStreamWriter " arrow::ipc::StreamWriter": + @staticmethod + CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, + shared_ptr[CStreamWriter]* out) + + CStatus Close() + CStatus WriteRecordBatch(const CRecordBatch& batch) + + cdef cppclass CStreamReader " arrow::ipc::StreamReader": + + @staticmethod + CStatus Open(const shared_ptr[InputStream]& stream, + shared_ptr[CStreamReader]* out) + + shared_ptr[CSchema] schema() + + CStatus GetNextRecordBatch(shared_ptr[CRecordBatch]* batch) + + cdef cppclass CFileWriter " arrow::ipc::FileWriter"(CStreamWriter): + @staticmethod + CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, + shared_ptr[CFileWriter]* out) + + cdef cppclass CFileReader " arrow::ipc::FileReader": + + @staticmethod + CStatus Open(const shared_ptr[RandomAccessFile]& file, + shared_ptr[CFileReader]* out) + + @staticmethod + CStatus Open2" Open"(const shared_ptr[RandomAccessFile]& file, + int64_t footer_offset, shared_ptr[CFileReader]* out) + + shared_ptr[CSchema] schema() + + int num_record_batches() + + CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) + + CStatus WriteTensor(const CTensor& tensor, OutputStream* dst, + int32_t* metadata_length, + int64_t* body_length) + + CStatus ReadTensor(int64_t offset, RandomAccessFile* file, + shared_ptr[CTensor]* out) + + +cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: + + cdef cppclass CFeatherWriter" arrow::ipc::feather::TableWriter": + @staticmethod + CStatus Open(const shared_ptr[OutputStream]& stream, + unique_ptr[CFeatherWriter]* out) + + void SetDescription(const c_string& desc) + void SetNumRows(int64_t num_rows) + + CStatus Append(const c_string& name, const CArray& values) + CStatus Finalize() + + cdef cppclass CFeatherReader" arrow::ipc::feather::TableReader": + @staticmethod + CStatus Open(const shared_ptr[RandomAccessFile]& file, + unique_ptr[CFeatherReader]* out) + + c_string GetDescription() + c_bool HasDescription() + + int64_t num_rows() + int64_t num_columns() + + shared_ptr[CSchema] schema() + + CStatus GetColumn(int i, shared_ptr[CColumn]* out) + c_string GetColumnName(int i) diff --git a/python/pyarrow/includes/libarrow_io.pxd b/python/pyarrow/includes/libarrow_io.pxd deleted file mode 100644 index 5992c737df512..0000000000000 --- a/python/pyarrow/includes/libarrow_io.pxd +++ /dev/null @@ -1,171 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# distutils: language = c++ - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport * - -cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: - enum FileMode" arrow::io::FileMode::type": - FileMode_READ" arrow::io::FileMode::READ" - FileMode_WRITE" arrow::io::FileMode::WRITE" - FileMode_READWRITE" arrow::io::FileMode::READWRITE" - - enum ObjectType" arrow::io::ObjectType::type": - ObjectType_FILE" arrow::io::ObjectType::FILE" - ObjectType_DIRECTORY" arrow::io::ObjectType::DIRECTORY" - - cdef cppclass FileInterface: - CStatus Close() - CStatus Tell(int64_t* position) - FileMode mode() - - cdef cppclass Readable: - CStatus ReadB" Read"(int64_t nbytes, shared_ptr[CBuffer]* out) - CStatus Read(int64_t nbytes, int64_t* bytes_read, uint8_t* out) - - cdef cppclass Seekable: - CStatus Seek(int64_t position) - - cdef cppclass Writeable: - CStatus Write(const uint8_t* data, int64_t nbytes) - - cdef cppclass OutputStream(FileInterface, Writeable): - pass - - cdef cppclass InputStream(FileInterface, Readable): - pass - - cdef cppclass RandomAccessFile(InputStream, Seekable): - CStatus GetSize(int64_t* size) - - CStatus ReadAt(int64_t position, int64_t nbytes, - int64_t* bytes_read, uint8_t* buffer) - CStatus ReadAt(int64_t position, int64_t nbytes, - int64_t* bytes_read, shared_ptr[CBuffer]* out) - - cdef cppclass WriteableFileInterface(OutputStream, Seekable): - CStatus WriteAt(int64_t position, const uint8_t* data, - int64_t nbytes) - - cdef cppclass ReadWriteFileInterface(RandomAccessFile, - WriteableFileInterface): - pass - - -cdef extern from "arrow/io/file.h" namespace "arrow::io" nogil: - - - cdef cppclass FileOutputStream(OutputStream): - @staticmethod - CStatus Open(const c_string& path, shared_ptr[FileOutputStream]* file) - - int file_descriptor() - - cdef cppclass ReadableFile(RandomAccessFile): - @staticmethod - CStatus Open(const c_string& path, shared_ptr[ReadableFile]* file) - - @staticmethod - CStatus Open(const c_string& path, CMemoryPool* memory_pool, - shared_ptr[ReadableFile]* file) - - int file_descriptor() - - cdef cppclass CMemoryMappedFile" arrow::io::MemoryMappedFile"\ - (ReadWriteFileInterface): - @staticmethod - CStatus Open(const c_string& path, FileMode mode, - shared_ptr[CMemoryMappedFile]* file) - - int file_descriptor() - - -cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: - CStatus HaveLibHdfs() - CStatus HaveLibHdfs3() - - enum HdfsDriver" arrow::io::HdfsDriver": - HdfsDriver_LIBHDFS" arrow::io::HdfsDriver::LIBHDFS" - HdfsDriver_LIBHDFS3" arrow::io::HdfsDriver::LIBHDFS3" - - cdef cppclass HdfsConnectionConfig: - c_string host - int port - c_string user - c_string kerb_ticket - HdfsDriver driver - - cdef cppclass HdfsPathInfo: - ObjectType kind; - c_string name - c_string owner - c_string group - int32_t last_modified_time - int32_t last_access_time - int64_t size - int16_t replication - int64_t block_size - int16_t permissions - - cdef cppclass HdfsReadableFile(RandomAccessFile): - pass - - cdef cppclass HdfsOutputStream(OutputStream): - pass - - cdef cppclass CHdfsClient" arrow::io::HdfsClient": - @staticmethod - CStatus Connect(const HdfsConnectionConfig* config, - shared_ptr[CHdfsClient]* client) - - CStatus CreateDirectory(const c_string& path) - - CStatus Delete(const c_string& path, c_bool recursive) - - CStatus Disconnect() - - c_bool Exists(const c_string& path) - - CStatus GetCapacity(int64_t* nbytes) - CStatus GetUsed(int64_t* nbytes) - - CStatus ListDirectory(const c_string& path, - vector[HdfsPathInfo]* listing) - - CStatus GetPathInfo(const c_string& path, HdfsPathInfo* info) - - CStatus Rename(const c_string& src, const c_string& dst) - - CStatus OpenReadable(const c_string& path, - shared_ptr[HdfsReadableFile]* handle) - - CStatus OpenWriteable(const c_string& path, c_bool append, - int32_t buffer_size, int16_t replication, - int64_t default_block_size, - shared_ptr[HdfsOutputStream]* handle) - - -cdef extern from "arrow/io/memory.h" namespace "arrow::io" nogil: - cdef cppclass CBufferReader" arrow::io::BufferReader"\ - (RandomAccessFile): - CBufferReader(const shared_ptr[CBuffer]& buffer) - CBufferReader(const uint8_t* data, int64_t nbytes) - - cdef cppclass BufferOutputStream(OutputStream): - BufferOutputStream(const shared_ptr[ResizableBuffer]& buffer) diff --git a/python/pyarrow/includes/libarrow_ipc.pxd b/python/pyarrow/includes/libarrow_ipc.pxd deleted file mode 100644 index 59fd90bdac7a8..0000000000000 --- a/python/pyarrow/includes/libarrow_ipc.pxd +++ /dev/null @@ -1,94 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# distutils: language = c++ - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CArray, CColumn, CSchema, CRecordBatch) -from pyarrow.includes.libarrow_io cimport (InputStream, OutputStream, - RandomAccessFile) - - -cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: - - cdef cppclass CStreamWriter " arrow::ipc::StreamWriter": - @staticmethod - CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, - shared_ptr[CStreamWriter]* out) - - CStatus Close() - CStatus WriteRecordBatch(const CRecordBatch& batch) - - cdef cppclass CStreamReader " arrow::ipc::StreamReader": - - @staticmethod - CStatus Open(const shared_ptr[InputStream]& stream, - shared_ptr[CStreamReader]* out) - - shared_ptr[CSchema] schema() - - CStatus GetNextRecordBatch(shared_ptr[CRecordBatch]* batch) - - cdef cppclass CFileWriter " arrow::ipc::FileWriter"(CStreamWriter): - @staticmethod - CStatus Open(OutputStream* sink, const shared_ptr[CSchema]& schema, - shared_ptr[CFileWriter]* out) - - cdef cppclass CFileReader " arrow::ipc::FileReader": - - @staticmethod - CStatus Open(const shared_ptr[RandomAccessFile]& file, - shared_ptr[CFileReader]* out) - - @staticmethod - CStatus Open2" Open"(const shared_ptr[RandomAccessFile]& file, - int64_t footer_offset, shared_ptr[CFileReader]* out) - - shared_ptr[CSchema] schema() - - int num_record_batches() - - CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) - -cdef extern from "arrow/ipc/feather.h" namespace "arrow::ipc::feather" nogil: - - cdef cppclass CFeatherWriter" arrow::ipc::feather::TableWriter": - @staticmethod - CStatus Open(const shared_ptr[OutputStream]& stream, - unique_ptr[CFeatherWriter]* out) - - void SetDescription(const c_string& desc) - void SetNumRows(int64_t num_rows) - - CStatus Append(const c_string& name, const CArray& values) - CStatus Finalize() - - cdef cppclass CFeatherReader" arrow::ipc::feather::TableReader": - @staticmethod - CStatus Open(const shared_ptr[RandomAccessFile]& file, - unique_ptr[CFeatherReader]* out) - - c_string GetDescription() - c_bool HasDescription() - - int64_t num_rows() - int64_t num_columns() - - shared_ptr[CSchema] schema() - - CStatus GetColumn(int i, shared_ptr[CColumn]* out) - c_string GetColumnName(int i) diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index 9b64435e48d7f..c40df3db8a9c5 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -20,9 +20,9 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport (CArray, CBuffer, CColumn, CDataType, CTable, CTensor, CStatus, Type, - CMemoryPool, TimeUnit) - -cimport pyarrow.includes.libarrow_io as arrow_io + CMemoryPool, TimeUnit, + RandomAccessFile, OutputStream, + CBufferReader) cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: @@ -65,11 +65,11 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: cdef cppclass PyBuffer(CBuffer): PyBuffer(object o) - cdef cppclass PyReadableFile(arrow_io.RandomAccessFile): + cdef cppclass PyReadableFile(RandomAccessFile): PyReadableFile(object fo) - cdef cppclass PyOutputStream(arrow_io.OutputStream): + cdef cppclass PyOutputStream(OutputStream): PyOutputStream(object fo) - cdef cppclass PyBytesReader(arrow_io.CBufferReader): + cdef cppclass PyBytesReader(CBufferReader): PyBytesReader(object fo) diff --git a/python/pyarrow/io.pxd b/python/pyarrow/io.pxd index cffd29ab39111..0c37a09add574 100644 --- a/python/pyarrow/io.pxd +++ b/python/pyarrow/io.pxd @@ -19,8 +19,7 @@ from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport (RandomAccessFile, - OutputStream) + cdef class Buffer: cdef: @@ -30,6 +29,7 @@ cdef class Buffer: cdef init(self, const shared_ptr[CBuffer]& buffer) + cdef class NativeFile: cdef: shared_ptr[RandomAccessFile] rd_file diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 608b20d896ae3..98b5a62b372a2 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -27,12 +27,10 @@ from cython.operator cimport dereference as deref from libc.stdlib cimport malloc, free from pyarrow.includes.libarrow cimport * -from pyarrow.includes.libarrow_io cimport * -from pyarrow.includes.libarrow_ipc cimport * cimport pyarrow.includes.pyarrow as pyarrow from pyarrow.compat import frombytes, tobytes, encode_file_path -from pyarrow.array cimport Array +from pyarrow.array cimport Array, Tensor, box_tensor from pyarrow.error cimport check_status from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow.schema cimport Schema @@ -340,7 +338,32 @@ cdef class MemoryMappedFile(NativeFile): cdef: object path - def __cinit__(self, path, mode='r'): + def __cinit__(self): + self.is_open = False + self.is_readable = 0 + self.is_writeable = 0 + + @staticmethod + def create(path, size): + cdef: + shared_ptr[CMemoryMappedFile] handle + c_string c_path = encode_file_path(path) + int64_t c_size = size + + with nogil: + check_status(CMemoryMappedFile.Create(c_path, c_size, &handle)) + + cdef MemoryMappedFile result = MemoryMappedFile() + result.path = path + result.is_readable = 1 + result.is_writeable = 1 + result.wr_file = handle + result.rd_file = handle + result.is_open = True + + return result + + def open(self, path, mode='r'): self.path = path cdef: @@ -348,8 +371,6 @@ cdef class MemoryMappedFile(NativeFile): shared_ptr[CMemoryMappedFile] handle c_string c_path = encode_file_path(path) - self.is_readable = self.is_writeable = 0 - if mode in ('r', 'rb'): c_mode = FileMode_READ self.is_readable = 1 @@ -370,6 +391,41 @@ cdef class MemoryMappedFile(NativeFile): self.is_open = True +def memory_map(path, mode='r'): + """ + Open memory map at file path. Size of the memory map cannot change + + Parameters + ---------- + path : string + mode : {'r', 'w'}, default 'r' + + Returns + ------- + mmap : MemoryMappedFile + """ + cdef MemoryMappedFile mmap = MemoryMappedFile() + mmap.open(path, mode) + return mmap + + +def create_memory_map(path, size): + """ + Create memory map at indicated path of the given size, return open + writeable file object + + Parameters + ---------- + path : string + size : int + + Returns + ------- + mmap : MemoryMappedFile + """ + return MemoryMappedFile.create(path, size) + + cdef class OSFile(NativeFile): """ Supports 'r', 'w' modes @@ -542,7 +598,7 @@ cdef get_reader(object source, shared_ptr[RandomAccessFile]* reader): cdef NativeFile nf if isinstance(source, six.string_types): - source = MemoryMappedFile(source, mode='r') + source = memory_map(source, mode='r') elif isinstance(source, Buffer): source = BufferReader(source) elif not isinstance(source, NativeFile) and hasattr(source, 'read'): @@ -1144,3 +1200,57 @@ cdef class FeatherReader: cdef Column col = Column() col.init(sp_column) return col + + +def write_tensor(Tensor tensor, NativeFile dest): + """ + Write pyarrow.Tensor to pyarrow.NativeFile object its current position + + Parameters + ---------- + tensor : pyarrow.Tensor + dest : pyarrow.NativeFile + + Returns + ------- + bytes_written : int + Total number of bytes written to the file + """ + cdef: + int32_t metadata_length + int64_t body_length + + dest._assert_writeable() + + with nogil: + check_status( + WriteTensor(deref(tensor.tp), dest.wr_file.get(), + &metadata_length, &body_length)) + + return metadata_length + body_length + + +def read_tensor(NativeFile source): + """ + Read pyarrow.Tensor from pyarrow.NativeFile object from current + position. If the file source supports zero copy (e.g. a memory map), then + this operation does not allocate any memory + + Parameters + ---------- + source : pyarrow.NativeFile + + Returns + ------- + tensor : Tensor + """ + cdef: + shared_ptr[CTensor] sp_tensor + + source._assert_writeable() + + cdef int64_t offset = source.tell() + with nogil: + check_status(ReadTensor(offset, source.rd_file.get(), &sp_tensor)) + + return box_tensor(sp_tensor) diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index 15c5e6b924385..beb6113849ac3 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -246,10 +246,10 @@ def teardown(): return path, data -def _check_native_file_reader(KLASS, sample_data): +def _check_native_file_reader(FACTORY, sample_data): path, data = sample_data - f = KLASS(path, mode='r') + f = FACTORY(path, mode='r') assert f.read(10) == data[:10] assert f.read(0) == b'' @@ -269,14 +269,14 @@ def _check_native_file_reader(KLASS, sample_data): def test_memory_map_reader(sample_disk_data): - _check_native_file_reader(io.MemoryMappedFile, sample_disk_data) + _check_native_file_reader(pa.memory_map, sample_disk_data) def test_memory_map_retain_buffer_reference(sample_disk_data): path, data = sample_disk_data cases = [] - with io.MemoryMappedFile(path, 'rb') as f: + with pa.memory_map(path, 'rb') as f: cases.append((f.read_buffer(100), data[:100])) cases.append((f.read_buffer(100), data[100:200])) cases.append((f.read_buffer(100), data[200:300])) @@ -309,7 +309,7 @@ def test_memory_map_writer(): with open(path, 'wb') as f: f.write(data) - f = io.MemoryMappedFile(path, mode='r+w') + f = pa.memory_map(path, mode='r+w') f.seek(10) f.write('peekaboo') @@ -318,7 +318,7 @@ def test_memory_map_writer(): f.seek(10) assert f.read(8) == b'peekaboo' - f2 = io.MemoryMappedFile(path, mode='r+w') + f2 = pa.memory_map(path, mode='r+w') f2.seek(10) f2.write(b'booapeak') @@ -328,10 +328,10 @@ def test_memory_map_writer(): assert f.read(8) == b'booapeak' # Does not truncate file - f3 = io.MemoryMappedFile(path, mode='w') + f3 = pa.memory_map(path, mode='w') f3.write('foo') - with io.MemoryMappedFile(path) as f4: + with pa.memory_map(path) as f4: assert f4.size() == SIZE with pytest.raises(IOError): diff --git a/python/pyarrow/tests/test_tensor.py b/python/pyarrow/tests/test_tensor.py new file mode 100644 index 0000000000000..5327f1a74a33e --- /dev/null +++ b/python/pyarrow/tests/test_tensor.py @@ -0,0 +1,93 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +import os +import pytest + +import numpy as np +import pyarrow as pa + + +def test_tensor_attrs(): + data = np.random.randn(10, 4) + + tensor = pa.Tensor.from_numpy(data) + + assert tensor.ndim == 2 + assert tensor.size == 40 + assert tensor.shape == list(data.shape) + assert tensor.strides == list(data.strides) + + assert tensor.is_contiguous + assert tensor.is_mutable + + # not writeable + data2 = data.copy() + data2.flags.writeable = False + tensor = pa.Tensor.from_numpy(data2) + assert not tensor.is_mutable + + +@pytest.mark.parametrize('dtype_str,arrow_type', [ + ('i1', pa.int8()), + ('i2', pa.int16()), + ('i4', pa.int32()), + ('i8', pa.int64()), + ('u1', pa.uint8()), + ('u2', pa.uint16()), + ('u4', pa.uint32()), + ('u8', pa.uint64()), + ('f2', pa.float16()), + ('f4', pa.float32()), + ('f8', pa.float64()) +]) +def test_tensor_numpy_roundtrip(dtype_str, arrow_type): + dtype = np.dtype(dtype_str) + data = (100 * np.random.randn(10, 4)).astype(dtype) + + tensor = pa.Tensor.from_numpy(data) + assert tensor.type == arrow_type + + repr(tensor) + + result = tensor.to_numpy() + assert (data == result).all() + + +def _try_delete(path): + try: + os.remove(path) + except os.error: + pass + + +def test_tensor_ipc_roundtrip(): + data = np.random.randn(10, 4) + tensor = pa.Tensor.from_numpy(data) + + path = 'pyarrow-tensor-ipc-roundtrip' + try: + mmap = pa.create_memory_map(path, 1024) + + pa.write_tensor(tensor, mmap) + + mmap.seek(0) + result = pa.read_tensor(mmap) + + assert result.equals(tensor) + finally: + _try_delete(path) From f05b7c62cf6151a2a03292508628f8f1a8e7a1aa Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 3 Apr 2017 18:05:39 -0400 Subject: [PATCH 0461/1644] ARROW-443: [Python] Support ingest of strided NumPy arrays from pandas Author: Wes McKinney Closes #482 from wesm/ARROW-443 and squashes the following commits: d9b36c0 [Wes McKinney] Run commented out test cases, fix issue 8f0ff38 [Wes McKinney] cpplint 88eef1a [Wes McKinney] Support strided mask argument in some object conversions 22d4489 [Wes McKinney] First cut at strided NumPy import --- cpp/src/arrow/python/config.cc | 1 + cpp/src/arrow/python/numpy-internal.h | 66 +++++ cpp/src/arrow/python/pandas_convert.cc | 252 +++++++++++++------- python/pyarrow/array.pyx | 4 +- python/pyarrow/tests/test_convert_pandas.py | 59 ++++- 5 files changed, 290 insertions(+), 92 deletions(-) create mode 100644 cpp/src/arrow/python/numpy-internal.h diff --git a/cpp/src/arrow/python/config.cc b/cpp/src/arrow/python/config.cc index 2abc4dda6ee17..c2a69168bb01e 100644 --- a/cpp/src/arrow/python/config.cc +++ b/cpp/src/arrow/python/config.cc @@ -16,6 +16,7 @@ // under the License. #include +#include #include "arrow/python/config.h" diff --git a/cpp/src/arrow/python/numpy-internal.h b/cpp/src/arrow/python/numpy-internal.h new file mode 100644 index 0000000000000..fcc6a58f2a347 --- /dev/null +++ b/cpp/src/arrow/python/numpy-internal.h @@ -0,0 +1,66 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Internal utilities for dealing with NumPy + +#ifndef ARROW_PYTHON_NUMPY_INTERNAL_H +#define ARROW_PYTHON_NUMPY_INTERNAL_H + +#include + +#include + +#include "arrow/python/numpy_convert.h" +#include "arrow/python/numpy_interop.h" + +namespace arrow { +namespace py { + +/// Indexing convenience for interacting with strided 1-dim ndarray objects +template +class Ndarray1DIndexer { + public: + typedef int64_t size_type; + + Ndarray1DIndexer() : arr_(nullptr), data_(nullptr) {} + + explicit Ndarray1DIndexer(PyArrayObject* arr) : Ndarray1DIndexer() { Init(arr); } + + void Init(PyArrayObject* arr) { + arr_ = arr; + DCHECK_EQ(1, PyArray_NDIM(arr)) << "Only works with 1-dimensional arrays"; + Py_INCREF(arr); + data_ = reinterpret_cast(PyArray_DATA(arr)); + stride_ = PyArray_STRIDES(arr)[0] / sizeof(T); + } + + ~Ndarray1DIndexer() { Py_XDECREF(arr_); } + + int64_t size() const { return PyArray_SIZE(arr_); } + + T& operator[](size_type index) { return *(data_ + index * stride_); } + + private: + PyArrayObject* arr_; + T* data_; + int64_t stride_; +}; + +} // namespace py +} // namespace arrow + +#endif // ARROW_PYTHON_NUMPY_INTERNAL_H diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 01019e5669f2d..9577892a55b76 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -47,6 +47,7 @@ #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" #include "arrow/python/config.h" +#include "arrow/python/numpy-internal.h" #include "arrow/python/numpy_convert.h" #include "arrow/python/type_traits.h" #include "arrow/python/util/datetime.h" @@ -70,15 +71,16 @@ static inline bool PyObject_is_string(const PyObject* obj) { } template -static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) { +static int64_t ValuesToBitmap(PyArrayObject* arr, uint8_t* bitmap) { typedef npy_traits traits; typedef typename traits::value_type T; int64_t null_count = 0; - const T* values = reinterpret_cast(data); + + Ndarray1DIndexer values(arr); // TODO(wesm): striding - for (int i = 0; i < length; ++i) { + for (int i = 0; i < values.size(); ++i) { if (traits::isnull(values[i])) { ++null_count; } else { @@ -92,8 +94,8 @@ static int64_t ValuesToBitmap(const void* data, int64_t length, uint8_t* bitmap) // Returns null count static int64_t MaskToBitmap(PyArrayObject* mask, int64_t length, uint8_t* bitmap) { int64_t null_count = 0; - const uint8_t* mask_values = static_cast(PyArray_DATA(mask)); - // TODO(wesm): strided null mask + + Ndarray1DIndexer mask_values(mask); for (int i = 0; i < length; ++i) { if (mask_values[i]) { ++null_count; @@ -138,13 +140,24 @@ Status CheckFlatNumpyArray(PyArrayObject* numpy_array, int np_type) { return Status::OK(); } -Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, - PyObject** objects, bool* have_bytes) { +static Status AppendObjectStrings( + PyArrayObject* arr, PyArrayObject* mask, StringBuilder* builder, bool* have_bytes) { PyObject* obj; - for (int64_t i = 0; i < objects_length; ++i) { + Ndarray1DIndexer objects(arr); + Ndarray1DIndexer mask_values; + + bool have_mask = false; + if (mask != nullptr) { + mask_values.Init(mask); + have_mask = true; + } + + for (int64_t i = 0; i < objects.size(); ++i) { obj = objects[i]; - if (PyUnicode_Check(obj)) { + if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + RETURN_NOT_OK(builder->AppendNull()); + } else if (PyUnicode_Check(obj)) { obj = PyUnicode_AsUTF8String(obj); if (obj == NULL) { PyErr_Clear(); @@ -158,8 +171,6 @@ Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, *have_bytes = true; const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); RETURN_NOT_OK(builder->Append(PyBytes_AS_STRING(obj), length)); - } else if (PyObject_is_null(obj)) { - RETURN_NOT_OK(builder->AppendNull()); } else { return InvalidConversion(obj, "string or bytes"); } @@ -168,13 +179,24 @@ Status AppendObjectStrings(int64_t objects_length, StringBuilder* builder, return Status::OK(); } -static Status AppendObjectFixedWidthBytes(int64_t objects_length, int byte_width, - FixedSizeBinaryBuilder* builder, PyObject** objects) { +static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mask, + int byte_width, FixedSizeBinaryBuilder* builder) { PyObject* obj; - for (int64_t i = 0; i < objects_length; ++i) { + Ndarray1DIndexer objects(arr); + Ndarray1DIndexer mask_values; + + bool have_mask = false; + if (mask != nullptr) { + mask_values.Init(mask); + have_mask = true; + } + + for (int64_t i = 0; i < objects.size(); ++i) { obj = objects[i]; - if (PyUnicode_Check(obj)) { + if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + RETURN_NOT_OK(builder->AppendNull()); + } else if (PyUnicode_Check(obj)) { obj = PyUnicode_AsUTF8String(obj); if (obj == NULL) { PyErr_Clear(); @@ -190,8 +212,6 @@ static Status AppendObjectFixedWidthBytes(int64_t objects_length, int byte_width RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); RETURN_NOT_OK( builder->Append(reinterpret_cast(PyBytes_AS_STRING(obj)))); - } else if (PyObject_is_null(obj)) { - RETURN_NOT_OK(builder->AppendNull()); } else { return InvalidConversion(obj, "string or bytes"); } @@ -299,8 +319,7 @@ class PandasConverter : public TypeVisitor { } else if (traits::supports_nulls) { // TODO(wesm): this presumes the NumPy C type and arrow C type are the // same - null_count = ValuesToBitmap( - PyArray_DATA(arr_), length_, null_bitmap_data_); + null_count = ValuesToBitmap(arr_, null_bitmap_data_); } std::vector fields(1); @@ -329,36 +348,33 @@ class PandasConverter : public TypeVisitor { #undef VISIT_NATIVE - Status Convert(std::shared_ptr* out) { + Status Convert() { if (PyArray_NDIM(arr_) != 1) { return Status::Invalid("only handle 1-dimensional arrays"); } - // TODO(wesm): strided arrays - if (is_strided()) { return Status::Invalid("no support for strided data yet"); } if (type_ == nullptr) { return Status::Invalid("Must pass data type"); } // Visit the type to perform conversion RETURN_NOT_OK(type_->Accept(this)); - *out = out_; return Status::OK(); } + std::shared_ptr result() const { return out_; } + // ---------------------------------------------------------------------- // Conversion logic for various object dtype arrays template - Status ConvertTypedLists( - const std::shared_ptr& type, std::shared_ptr* out); + Status ConvertTypedLists(const std::shared_ptr& type); - Status ConvertObjectStrings(std::shared_ptr* out); - Status ConvertObjectFixedWidthBytes( - const std::shared_ptr& type, std::shared_ptr* out); - Status ConvertBooleans(std::shared_ptr* out); - Status ConvertDates(std::shared_ptr* out); - Status ConvertLists(const std::shared_ptr& type, std::shared_ptr* out); - Status ConvertObjects(std::shared_ptr* out); + Status ConvertObjectStrings(); + Status ConvertObjectFixedWidthBytes(const std::shared_ptr& type); + Status ConvertBooleans(); + Status ConvertDates(); + Status ConvertLists(const std::shared_ptr& type); + Status ConvertObjects(); protected: MemoryPool* pool_; @@ -374,9 +390,31 @@ class PandasConverter : public TypeVisitor { uint8_t* null_bitmap_data_; }; +template +void CopyStrided(T* input_data, int64_t length, int64_t stride, T* output_data) { + // Passing input_data as non-const is a concession to PyObject* + int64_t j = 0; + for (int64_t i = 0; i < length; ++i) { + output_data[i] = input_data[j]; + j += stride; + } +} + +template <> +void CopyStrided( + PyObject** input_data, int64_t length, int64_t stride, PyObject** output_data) { + int64_t j = 0; + for (int64_t i = 0; i < length; ++i) { + output_data[i] = input_data[j]; + if (output_data[i] != nullptr) { Py_INCREF(output_data[i]); } + j += stride; + } +} + template inline Status PandasConverter::ConvertData(std::shared_ptr* data) { using traits = arrow_traits; + using T = typename traits::T; // Handle LONGLONG->INT64 and other fun things int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num); @@ -385,7 +423,20 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* data) { return Status::NotImplemented("NumPy type casts not yet implemented"); } - *data = std::make_shared(reinterpret_cast(arr_)); + if (is_strided()) { + // Strided, must copy into new contiguous memory + const int64_t stride = PyArray_STRIDES(arr_)[0]; + const int64_t stride_elements = stride / sizeof(T); + + auto new_buffer = std::make_shared(pool_); + RETURN_NOT_OK(new_buffer->Resize(sizeof(T) * length_)); + CopyStrided(reinterpret_cast(PyArray_DATA(arr_)), length_, stride_elements, + reinterpret_cast(new_buffer->mutable_data())); + *data = new_buffer; + } else { + // Can zero-copy + *data = std::make_shared(reinterpret_cast(arr_)); + } return Status::OK(); } @@ -395,7 +446,7 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* auto buffer = std::make_shared(pool_); RETURN_NOT_OK(buffer->Resize(nbytes)); - const uint8_t* values = reinterpret_cast(PyArray_DATA(arr_)); + Ndarray1DIndexer values(arr_); uint8_t* bitmap = buffer->mutable_data(); @@ -434,13 +485,22 @@ Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { return Status::TypeError(ss.str()); } -Status PandasConverter::ConvertDates(std::shared_ptr* out) { +Status PandasConverter::ConvertDates() { PyAcquireGIL lock; - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + Ndarray1DIndexer objects(arr_); + + if (mask_ != nullptr) { + return Status::NotImplemented("mask not supported in object conversions yet"); + } + Date64Builder date_builder(pool_); RETURN_NOT_OK(date_builder.Resize(length_)); + /// We have to run this in this compilation unit, since we cannot use the + /// datetime API otherwise + PyDateTime_IMPORT; + Status s; PyObject* obj; for (int64_t i = 0; i < length_; ++i) { @@ -454,50 +514,57 @@ Status PandasConverter::ConvertDates(std::shared_ptr* out) { return InvalidConversion(obj, "date"); } } - return date_builder.Finish(out); + return date_builder.Finish(&out_); } -Status PandasConverter::ConvertObjectStrings(std::shared_ptr* out) { +Status PandasConverter::ConvertObjectStrings() { PyAcquireGIL lock; // The output type at this point is inconclusive because there may be bytes // and unicode mixed in the object array - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); StringBuilder builder(pool_); RETURN_NOT_OK(builder.Resize(length_)); Status s; bool have_bytes = false; - RETURN_NOT_OK(AppendObjectStrings(length_, &builder, objects, &have_bytes)); - RETURN_NOT_OK(builder.Finish(out)); + RETURN_NOT_OK(AppendObjectStrings(arr_, mask_, &builder, &have_bytes)); + RETURN_NOT_OK(builder.Finish(&out_)); if (have_bytes) { - const auto& arr = static_cast(*out->get()); - *out = std::make_shared(arr.length(), arr.value_offsets(), arr.data(), + const auto& arr = static_cast(*out_); + out_ = std::make_shared(arr.length(), arr.value_offsets(), arr.data(), arr.null_bitmap(), arr.null_count()); } return Status::OK(); } Status PandasConverter::ConvertObjectFixedWidthBytes( - const std::shared_ptr& type, std::shared_ptr* out) { + const std::shared_ptr& type) { PyAcquireGIL lock; - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + Ndarray1DIndexer objects(arr_); + + int32_t value_size = static_cast(*type).byte_width(); + FixedSizeBinaryBuilder builder(pool_, type); RETURN_NOT_OK(builder.Resize(length_)); - RETURN_NOT_OK(AppendObjectFixedWidthBytes(length_, - std::dynamic_pointer_cast(builder.type())->byte_width(), - &builder, objects)); - RETURN_NOT_OK(builder.Finish(out)); + RETURN_NOT_OK(AppendObjectFixedWidthBytes(arr_, mask_, value_size, &builder)); + RETURN_NOT_OK(builder.Finish(&out_)); return Status::OK(); } -Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { +Status PandasConverter::ConvertBooleans() { PyAcquireGIL lock; - PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + Ndarray1DIndexer objects(arr_); + Ndarray1DIndexer mask_values; + + bool have_mask = false; + if (mask_ != nullptr) { + mask_values.Init(mask_); + have_mask = true; + } int64_t nbytes = BitUtil::BytesForBits(length_); auto data = std::make_shared(pool_); @@ -509,24 +576,24 @@ Status PandasConverter::ConvertBooleans(std::shared_ptr* out) { PyObject* obj; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; - if (obj == Py_True) { + if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + ++null_count; + } else if (obj == Py_True) { BitUtil::SetBit(bitmap, i); BitUtil::SetBit(null_bitmap_data_, i); } else if (obj == Py_False) { BitUtil::SetBit(null_bitmap_data_, i); - } else if (PyObject_is_null(obj)) { - ++null_count; } else { return InvalidConversion(obj, "bool"); } } - *out = std::make_shared(length_, data, null_bitmap_, null_count); + out_ = std::make_shared(length_, data, null_bitmap_, null_count); return Status::OK(); } -Status PandasConverter::ConvertObjects(std::shared_ptr* out) { +Status PandasConverter::ConvertObjects() { // Python object arrays are annoying, since we could have one of: // // * Strings @@ -538,31 +605,27 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { RETURN_NOT_OK(InitNullBitmap()); - // TODO: mask not supported here - if (mask_ != nullptr) { - return Status::NotImplemented("mask not supported in object conversions yet"); - } + Ndarray1DIndexer objects; - const PyObject** objects; { PyAcquireGIL lock; - objects = reinterpret_cast(PyArray_DATA(arr_)); + objects.Init(arr_); PyDateTime_IMPORT; } if (type_) { switch (type_->type) { case Type::STRING: - return ConvertObjectStrings(out); + return ConvertObjectStrings(); case Type::FIXED_SIZE_BINARY: - return ConvertObjectFixedWidthBytes(type_, out); + return ConvertObjectFixedWidthBytes(type_); case Type::BOOL: - return ConvertBooleans(out); + return ConvertBooleans(); case Type::DATE64: - return ConvertDates(out); + return ConvertDates(); case Type::LIST: { const auto& list_field = static_cast(*type_); - return ConvertLists(list_field.value_field()->type, out); + return ConvertLists(list_field.value_field()->type); } default: return Status::TypeError("No known conversion to Arrow type"); @@ -572,11 +635,11 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { if (PyObject_is_null(objects[i])) { continue; } else if (PyObject_is_string(objects[i])) { - return ConvertObjectStrings(out); + return ConvertObjectStrings(); } else if (PyBool_Check(objects[i])) { - return ConvertBooleans(out); + return ConvertBooleans(); } else if (PyDate_CheckExact(objects[i])) { - return ConvertDates(out); + return ConvertDates(); } else { return InvalidConversion( const_cast(objects[i]), "string, bool, or date"); @@ -588,14 +651,22 @@ Status PandasConverter::ConvertObjects(std::shared_ptr* out) { } template -inline Status PandasConverter::ConvertTypedLists( - const std::shared_ptr& type, std::shared_ptr* out) { +inline Status PandasConverter::ConvertTypedLists(const std::shared_ptr& type) { typedef npy_traits traits; typedef typename traits::value_type T; typedef typename traits::BuilderClass BuilderT; PyAcquireGIL lock; + // TODO: mask not supported here + if (mask_ != nullptr) { + return Status::NotImplemented("mask not supported in object conversions yet"); + } + + if (is_strided()) { + return Status::NotImplemented("strided arrays not implemented for lists"); + } + auto value_builder = std::make_shared(pool_, type); ListBuilder list_builder(pool_, value_builder); PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); @@ -637,16 +708,25 @@ inline Status PandasConverter::ConvertTypedLists( return Status::TypeError("Unsupported Python type for list items"); } } - return list_builder.Finish(out); + return list_builder.Finish(&out_); } template <> inline Status PandasConverter::ConvertTypedLists( - const std::shared_ptr& type, std::shared_ptr* out) { + const std::shared_ptr& type) { PyAcquireGIL lock; // TODO: If there are bytes involed, convert to Binary representation bool have_bytes = false; + // TODO: mask not supported here + if (mask_ != nullptr) { + return Status::NotImplemented("mask not supported in object conversions yet"); + } + + if (is_strided()) { + return Status::NotImplemented("strided arrays not implemented for lists"); + } + auto value_builder = std::make_shared(pool_); ListBuilder list_builder(pool_, value_builder); PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); @@ -660,9 +740,8 @@ inline Status PandasConverter::ConvertTypedLists( // TODO(uwe): Support more complex numpy array structures RETURN_NOT_OK(CheckFlatNumpyArray(numpy_array, NPY_OBJECT)); - int64_t size = static_cast(PyArray_DIM(numpy_array, 0)); - auto data = reinterpret_cast(PyArray_DATA(numpy_array)); - RETURN_NOT_OK(AppendObjectStrings(size, value_builder.get(), data, &have_bytes)); + RETURN_NOT_OK( + AppendObjectStrings(numpy_array, nullptr, value_builder.get(), &have_bytes)); } else if (PyList_Check(objects[i])) { int64_t size; std::shared_ptr inferred_type; @@ -678,16 +757,15 @@ inline Status PandasConverter::ConvertTypedLists( return Status::TypeError("Unsupported Python type for list items"); } } - return list_builder.Finish(out); + return list_builder.Finish(&out_); } -#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ - case Type::TYPE: { \ - return ConvertTypedLists(type, out); \ +#define LIST_CASE(TYPE, NUMPY_TYPE, ArrowType) \ + case Type::TYPE: { \ + return ConvertTypedLists(type); \ } -Status PandasConverter::ConvertLists( - const std::shared_ptr& type, std::shared_ptr* out) { +Status PandasConverter::ConvertLists(const std::shared_ptr& type) { switch (type->type) { LIST_CASE(UINT8, NPY_UINT8, UInt8Type) LIST_CASE(INT8, NPY_INT8, Int8Type) @@ -711,13 +789,17 @@ Status PandasConverter::ConvertLists( Status PandasToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type, std::shared_ptr* out) { PandasConverter converter(pool, ao, mo, type); - return converter.Convert(out); + RETURN_NOT_OK(converter.Convert()); + *out = converter.result(); + return Status::OK(); } Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type, std::shared_ptr* out) { PandasConverter converter(pool, ao, mo, type); - return converter.ConvertObjects(out); + RETURN_NOT_OK(converter.ConvertObjects()); + *out = converter.result(); + return Status::OK(); } // ---------------------------------------------------------------------- diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index e7c456d80a41f..67785e34075f4 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -82,8 +82,8 @@ cdef class Array: @staticmethod def from_numpy(obj, mask=None, DataType type=None, - timestamps_to_ms=False, - MemoryPool memory_pool=None): + timestamps_to_ms=False, + MemoryPool memory_pool=None): """ Convert pandas.Series to an Arrow Array. diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 0b3c02e9945eb..56830a88f2ec2 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -75,16 +75,25 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, expected = df tm.assert_frame_equal(result, expected, check_dtype=check_dtype) - def _check_array_roundtrip(self, values, expected=None, + def _check_array_roundtrip(self, values, expected=None, mask=None, timestamps_to_ms=False, type=None): arr = A.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, - type=type) + mask=mask, type=type) result = arr.to_pandas() - assert arr.null_count == pd.isnull(values).sum() + values_nulls = pd.isnull(values) + if mask is None: + assert arr.null_count == values_nulls.sum() + else: + assert arr.null_count == (mask | values_nulls).sum() - tm.assert_series_equal(pd.Series(result), pd.Series(values), - check_names=False) + if mask is None: + tm.assert_series_equal(pd.Series(result), pd.Series(values), + check_names=False) + else: + expected = pd.Series(np.ma.masked_array(values, mask=mask)) + tm.assert_series_equal(pd.Series(result), expected, + check_names=False) def test_float_no_nulls(self): data = {} @@ -402,3 +411,43 @@ def test_mixed_types_fails(self): data = pd.DataFrame({'a': ['a', 1, 2.0]}) with self.assertRaises(A.error.ArrowException): A.Table.from_pandas(data) + + def test_strided_data_import(self): + cases = [] + + columns = ['a', 'b', 'c'] + N, K = 100, 3 + random_numbers = np.random.randn(N, K).copy() * 100 + + numeric_dtypes = ['i1', 'i2', 'i4', 'i8', 'u1', 'u2', 'u4', 'u8', + 'f4', 'f8'] + + for type_name in numeric_dtypes: + cases.append(random_numbers.astype(type_name)) + + # strings + cases.append(np.array([tm.rands(10) for i in range(N * K)], + dtype=object) + .reshape(N, K).copy()) + + # booleans + boolean_objects = (np.array([True, False, True] * N, dtype=object) + .reshape(N, K).copy()) + + # add some nulls, so dtype comes back as objects + boolean_objects[5] = None + cases.append(boolean_objects) + + cases.append(np.arange("2016-01-01T00:00:00.001", N * K, + dtype='datetime64[ms]') + .reshape(N, K).copy()) + + strided_mask = (random_numbers > 0).astype(bool)[:, 0] + + for case in cases: + df = pd.DataFrame(case, columns=columns) + col = df['a'] + + self._check_pandas_roundtrip(df) + self._check_array_roundtrip(col) + self._check_array_roundtrip(col, mask=strided_mask) From d0cd03d78547b12aaeb5e50d8c52ace60a973d4e Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Mon, 3 Apr 2017 18:08:06 -0400 Subject: [PATCH 0462/1644] ARROW-763: C++: Use to find libpythonX.X.dylib Author: Uwe L. Korn Closes #485 from xhochy/ARROW-763 and squashes the following commits: d5a475f [Uwe L. Korn] ARROW-763: C++: Use to find libpythonX.X.dylib --- cpp/cmake_modules/FindPythonLibsNew.cmake | 18 ++++++++++++++---- 1 file changed, 14 insertions(+), 4 deletions(-) diff --git a/cpp/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake index 3e248a93342c5..dfe5661b015b5 100644 --- a/cpp/cmake_modules/FindPythonLibsNew.cmake +++ b/cpp/cmake_modules/FindPythonLibsNew.cmake @@ -148,10 +148,20 @@ if(CMAKE_HOST_WIN32) set(PYTHON_LIBRARY "${PYTHON_PREFIX}/libs/libpython${PYTHON_LIBRARY_SUFFIX}.a") endif() elseif(APPLE) - # In Python C extensions on OS X, the flag "-undefined dynamic_lookup" can - # avoid certain kinds of dynamic linking issues with portable binaries, so - # you should avoid targeting libpython at link time if at all possible - set(PYTHON_LIBRARY "${PYTHON_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") + # In some cases libpythonX.X.dylib is not part of the PYTHON_PREFIX and we + # need to call `python-config --prefix` to determine the correct location. + + find_program(PYTHON_CONFIG python-config + NO_CMAKE_SYSTEM_PATH) + if (PYTHON_CONFIG) + execute_process( + COMMAND "${PYTHON_CONFIG}" "--prefix" + OUTPUT_VARIABLE PYTHON_CONFIG_PREFIX + OUTPUT_STRIP_TRAILING_WHITESPACE) + set(PYTHON_LIBRARY "${PYTHON_CONFIG_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") + else() + set(PYTHON_LIBRARY "${PYTHON_PREFIX}/lib/libpython${PYTHON_LIBRARY_SUFFIX}.dylib") + endif() else() if(${PYTHON_SIZEOF_VOID_P} MATCHES 8) set(_PYTHON_LIBS_SEARCH "${PYTHON_PREFIX}/lib64" "${PYTHON_PREFIX}/lib" "${PYTHON_LIBRARY_PATH}") From d560e307749a2397810962db1a5af4fb65675f17 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 4 Apr 2017 08:40:40 +0200 Subject: [PATCH 0463/1644] ARROW-656: [C++] Add random access writer for a mutable buffer. Rename WriteableFileInterface to WriteableFile for better consistency Author: Wes McKinney Closes #486 from wesm/ARROW-656 and squashes the following commits: be0d4bc [Wes McKinney] Fix glib after renaming class 042f533 [Wes McKinney] Add random access writer for a mutable buffer. Rename WriteableFileInterface to WriteableFile for better consistency --- c_glib/arrow-glib/io-memory-mapped-file.cpp | 2 +- c_glib/arrow-glib/io-writeable-file.cpp | 2 +- c_glib/arrow-glib/io-writeable-file.h | 2 +- c_glib/arrow-glib/io-writeable-file.hpp | 8 ++-- cpp/src/arrow/io/interfaces.h | 6 +-- cpp/src/arrow/io/io-memory-test.cc | 27 +++++++++++++ cpp/src/arrow/io/memory.cc | 45 +++++++++++++++++++++ cpp/src/arrow/io/memory.h | 23 +++++++++++ python/pyarrow/includes/libarrow.pxd | 4 +- 9 files changed, 107 insertions(+), 12 deletions(-) diff --git a/c_glib/arrow-glib/io-memory-mapped-file.cpp b/c_glib/arrow-glib/io-memory-mapped-file.cpp index 12c9a6c95ac12..e2e255c039109 100644 --- a/c_glib/arrow-glib/io-memory-mapped-file.cpp +++ b/c_glib/arrow-glib/io-memory-mapped-file.cpp @@ -127,7 +127,7 @@ garrow_io_writeable_interface_init(GArrowIOWriteableInterface *iface) iface->get_raw = garrow_io_memory_mapped_file_get_raw_writeable_interface; } -static std::shared_ptr +static std::shared_ptr garrow_io_memory_mapped_file_get_raw_writeable_file_interface(GArrowIOWriteableFile *file) { auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); diff --git a/c_glib/arrow-glib/io-writeable-file.cpp b/c_glib/arrow-glib/io-writeable-file.cpp index 3de42dd60a971..41b682acd1e26 100644 --- a/c_glib/arrow-glib/io-writeable-file.cpp +++ b/c_glib/arrow-glib/io-writeable-file.cpp @@ -76,7 +76,7 @@ garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, G_END_DECLS -std::shared_ptr +std::shared_ptr garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file) { auto *iface = GARROW_IO_WRITEABLE_FILE_GET_IFACE(writeable_file); diff --git a/c_glib/arrow-glib/io-writeable-file.h b/c_glib/arrow-glib/io-writeable-file.h index 4a4dee5111f5f..d1ebdbe630ef2 100644 --- a/c_glib/arrow-glib/io-writeable-file.h +++ b/c_glib/arrow-glib/io-writeable-file.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_WRITEABLE_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_WRITEABLE_FILE, \ - GArrowIOWriteableFileInterface)) + GArrowIOWriteableFile)) #define GARROW_IO_IS_WRITEABLE_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_WRITEABLE_FILE)) diff --git a/c_glib/arrow-glib/io-writeable-file.hpp b/c_glib/arrow-glib/io-writeable-file.hpp index 2043007ad58e3..aba95b209d827 100644 --- a/c_glib/arrow-glib/io-writeable-file.hpp +++ b/c_glib/arrow-glib/io-writeable-file.hpp @@ -24,15 +24,15 @@ #include /** - * GArrowIOWriteableFileInterface: + * GArrowIOWriteableFile: * - * It wraps `arrow::io::WriteableFileInterface`. + * It wraps `arrow::io::WriteableFile`. */ struct _GArrowIOWriteableFileInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOWriteableFile *file); + std::shared_ptr (*get_raw)(GArrowIOWriteableFile *file); }; -std::shared_ptr garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file); +std::shared_ptr garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file); diff --git a/cpp/src/arrow/io/interfaces.h b/cpp/src/arrow/io/interfaces.h index 258a3155743bf..b5a0bd85bf27b 100644 --- a/cpp/src/arrow/io/interfaces.h +++ b/cpp/src/arrow/io/interfaces.h @@ -121,16 +121,16 @@ class ARROW_EXPORT RandomAccessFile : public InputStream, public Seekable { RandomAccessFile(); }; -class ARROW_EXPORT WriteableFileInterface : public OutputStream, public Seekable { +class ARROW_EXPORT WriteableFile : public OutputStream, public Seekable { public: virtual Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) = 0; protected: - WriteableFileInterface() { set_mode(FileMode::READ); } + WriteableFile() { set_mode(FileMode::READ); } }; class ARROW_EXPORT ReadWriteFileInterface : public RandomAccessFile, - public WriteableFileInterface { + public WriteableFile { protected: ReadWriteFileInterface() { RandomAccessFile::set_mode(FileMode::READWRITE); } }; diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index 442cd0c4bbccd..4704fe8f4d391 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -66,6 +66,33 @@ TEST_F(TestBufferOutputStream, CloseResizes) { ASSERT_EQ(static_cast(K * data.size()), buffer_->size()); } +TEST(TestFixedSizeBufferWriter, Basics) { + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), 1024, &buffer)); + + FixedSizeBufferWriter writer(buffer); + + int64_t position; + ASSERT_OK(writer.Tell(&position)); + ASSERT_EQ(0, position); + + std::string data = "data123456"; + auto nbytes = static_cast(data.size()); + ASSERT_OK(writer.Write(reinterpret_cast(data.c_str()), nbytes)); + + ASSERT_OK(writer.Tell(&position)); + ASSERT_EQ(nbytes, position); + + ASSERT_OK(writer.Seek(4)); + ASSERT_OK(writer.Tell(&position)); + ASSERT_EQ(4, position); + + ASSERT_RAISES(IOError, writer.Seek(-1)); + ASSERT_RAISES(IOError, writer.Seek(1024)); + + ASSERT_OK(writer.Close()); +} + TEST(TestBufferReader, RetainParentReference) { // ARROW-387 std::string data = "data123456"; diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 5b5c8649deec4..2e701e1104d1c 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -98,6 +98,51 @@ Status BufferOutputStream::Reserve(int64_t nbytes) { return Status::OK(); } +// ---------------------------------------------------------------------- +// In-memory buffer writer + +/// Input buffer must be mutable, will abort if not +FixedSizeBufferWriter::FixedSizeBufferWriter(const std::shared_ptr& buffer) { + buffer_ = buffer; + DCHECK(buffer->is_mutable()) << "Must pass mutable buffer"; + mutable_data_ = buffer->mutable_data(); + size_ = buffer->size(); + position_ = 0; +} + +FixedSizeBufferWriter::~FixedSizeBufferWriter() {} + +Status FixedSizeBufferWriter::Close() { + // No-op + return Status::OK(); +} + +Status FixedSizeBufferWriter::Seek(int64_t position) { + if (position < 0 || position >= size_) { + return Status::IOError("position out of bounds"); + } + position_ = position; + return Status::OK(); +} + +Status FixedSizeBufferWriter::Tell(int64_t* position) { + *position = position_; + return Status::OK(); +} + +Status FixedSizeBufferWriter::Write(const uint8_t* data, int64_t nbytes) { + std::memcpy(mutable_data_ + position_, data, nbytes); + position_ += nbytes; + return Status::OK(); +} + +Status FixedSizeBufferWriter::WriteAt( + int64_t position, const uint8_t* data, int64_t nbytes) { + std::lock_guard guard(lock_); + RETURN_NOT_OK(Seek(position)); + return Write(data, nbytes); +} + // ---------------------------------------------------------------------- // In-memory buffer reader diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index eb2a50912889e..fbb186b728022 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -22,6 +22,7 @@ #include #include +#include #include #include "arrow/io/interfaces.h" @@ -66,6 +67,28 @@ class ARROW_EXPORT BufferOutputStream : public OutputStream { uint8_t* mutable_data_; }; +/// \brief Enables random writes into a fixed-size mutable buffer +/// +class ARROW_EXPORT FixedSizeBufferWriter : public WriteableFile { + public: + /// Input buffer must be mutable, will abort if not + explicit FixedSizeBufferWriter(const std::shared_ptr& buffer); + ~FixedSizeBufferWriter(); + + Status Close() override; + Status Seek(int64_t position) override; + Status Tell(int64_t* position) override; + Status Write(const uint8_t* data, int64_t nbytes) override; + Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; + + private: + std::mutex lock_; + std::shared_ptr buffer_; + uint8_t* mutable_data_; + int64_t size_; + int64_t position_; +}; + class ARROW_EXPORT BufferReader : public RandomAccessFile { public: explicit BufferReader(const std::shared_ptr& buffer); diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 67d6af910c2b9..2a0488f3a0139 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -342,12 +342,12 @@ cdef extern from "arrow/io/interfaces.h" namespace "arrow::io" nogil: CStatus ReadAt(int64_t position, int64_t nbytes, int64_t* bytes_read, shared_ptr[CBuffer]* out) - cdef cppclass WriteableFileInterface(OutputStream, Seekable): + cdef cppclass WriteableFile(OutputStream, Seekable): CStatus WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) cdef cppclass ReadWriteFileInterface(RandomAccessFile, - WriteableFileInterface): + WriteableFile): pass From ec6188efcc884e46481fe986605e3cbfc33c7e07 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 4 Apr 2017 18:24:07 +0200 Subject: [PATCH 0464/1644] ARROW-769: [GLib] Support building without installed Arrow C++ It doesn't require "make install"-ed Arrow C++ to build Arrow GLib. But it requires "make"-ed Arrow C++. This is useful to build packages. Author: Kouhei Sutou Closes #490 from kou/glib-support-build-without-installed-arrow-cpp and squashes the following commits: 352999b [Kouhei Sutou] [GLib] Support building without installed Arrow C++ --- c_glib/configure.ac | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/c_glib/configure.ac b/c_glib/configure.ac index c6913437d93f8..fc24c1b3c4778 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -61,9 +61,42 @@ AM_PATH_GLIB_2_0([2.32.4], [], [], [gobject]) GOBJECT_INTROSPECTION_REQUIRE([1.32.1]) GTK_DOC_CHECK([1.18-2]) -PKG_CHECK_MODULES([ARROW], [arrow]) -PKG_CHECK_MODULES([ARROW_IO], [arrow-io]) -PKG_CHECK_MODULES([ARROW_IPC], [arrow-ipc]) +AC_ARG_WITH(arrow-cpp-build-type, + [AS_HELP_STRING([--with-arrow-cpp-build-type=TYPE], + [-DCMAKE_BUILD_TYPE option value for Arrow C++ (default=Release)])], + [GARROW_ARROW_CPP_BUILD_TYPE="$withval"], + [GARROW_ARROW_CPP_BUILD_TYPE="Release"]) + +AC_ARG_WITH(arrow-cpp-build-dir, + [AS_HELP_STRING([--with-arrow-cpp-build-dir=PATH], + [Use this option to build with not installed Arrow C++])], + [GARROW_ARROW_CPP_BUILD_DIR="$withval"], + [GARROW_ARROW_CPP_BUILD_DIR=""]) +if test "x$GARROW_ARROW_CPP_BUILD_DIR" = "x"; then + PKG_CHECK_MODULES([ARROW], [arrow]) + PKG_CHECK_MODULES([ARROW_IO], [arrow-io]) + PKG_CHECK_MODULES([ARROW_IPC], [arrow-ipc]) +else + ARROW_INCLUDE_DIR="\$(abs_top_srcdir)/../cpp/src" + ARROW_LIB_DIR="${GARROW_ARROW_CPP_BUILD_DIR}/${GARROW_ARROW_CPP_BUILD_TYPE}" + + ARROW_CFLAGS="-I${ARROW_INCLUDE_DIR}" + ARROW_IO_CFLAGS="-I${ARROW_INCLUDE_DIR}" + ARROW_IPC_CFLAGS="-I${ARROW_INCLUDE_DIR}" + ARROW_LIBS="-L${ARROW_LIB_DIR} -larrow" + ARROW_IO_LIBS="-L${ARROW_LIB_DIR} -larrow_io" + ARROW_IPC_LIBS="-L${ARROW_LIB_DIR} -larrow_ipc" + + AC_SUBST(ARROW_LIB_DIR) + + AC_SUBST(ARROW_CFLAGS) + AC_SUBST(ARROW_IO_CFLAGS) + AC_SUBST(ARROW_IPC_CFLAGS) + AC_SUBST(ARROW_LIBS) + AC_SUBST(ARROW_IO_LIBS) + AC_SUBST(ARROW_IPC_LIBS) +fi + AC_CONFIG_FILES([ Makefile From 2aed7845fbc9e3d91ab9d16965ee9f6f3abc668b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 4 Apr 2017 13:18:16 -0400 Subject: [PATCH 0465/1644] ARROW-765: [Python] Add more natural Exception type hierarchy for thirdparty users I also took the liberty of changing a number of error types in libarrow_python Author: Wes McKinney Closes #489 from wesm/ARROW-765 and squashes the following commits: 74c43df [Wes McKinney] Make a nicer Exception hierachy, with more intuitive bases for thirdparty users 2a58a1b [Wes McKinney] Add a nicer exception hierarchy. Unknown errors return as ValueError --- cpp/src/arrow/python/builtin_convert.cc | 8 +- cpp/src/arrow/python/pandas_convert.cc | 6 +- cpp/src/arrow/status.h | 2 +- python/pyarrow/__init__.py | 8 +- python/pyarrow/error.pyx | 43 ++++++++++- python/pyarrow/includes/common.pxd | 4 +- python/pyarrow/tests/test_convert_builtin.py | 78 ++++++++++---------- python/pyarrow/tests/test_convert_pandas.py | 4 +- python/pyarrow/tests/test_feather.py | 2 +- 9 files changed, 101 insertions(+), 54 deletions(-) diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 6a13fdccdeaff..25b32ee26a06b 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -394,7 +394,7 @@ class BytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::TypeError( + return Status::Invalid( "Value that cannot be converted to bytes was encountered"); } // No error checking @@ -429,7 +429,7 @@ class FixedWidthBytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::TypeError( + return Status::Invalid( "Value that cannot be converted to bytes was encountered"); } // No error checking @@ -458,7 +458,7 @@ class UTF8Converter : public TypedConverter { RETURN_NOT_OK(typed_builder_->AppendNull()); continue; } else if (!PyUnicode_Check(item)) { - return Status::TypeError("Non-unicode value encountered"); + return Status::Invalid("Non-unicode value encountered"); } tmp.reset(PyUnicode_AsUTF8String(item)); RETURN_IF_PYERROR(); @@ -585,7 +585,7 @@ Status CheckPythonBytesAreFixedLength(PyObject* obj, Py_ssize_t expected_length) std::stringstream ss; ss << "Found byte string of length " << length << ", expected length is " << expected_length; - return Status::TypeError(ss.str()); + return Status::Invalid(ss.str()); } return Status::OK(); } diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 9577892a55b76..48d3489bf900b 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -161,7 +161,7 @@ static Status AppendObjectStrings( obj = PyUnicode_AsUTF8String(obj); if (obj == NULL) { PyErr_Clear(); - return Status::TypeError("failed converting unicode to UTF8"); + return Status::Invalid("failed converting unicode to UTF8"); } const int32_t length = static_cast(PyBytes_GET_SIZE(obj)); Status s = builder->Append(PyBytes_AS_STRING(obj), length); @@ -200,7 +200,7 @@ static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mas obj = PyUnicode_AsUTF8String(obj); if (obj == NULL) { PyErr_Clear(); - return Status::TypeError("failed converting unicode to UTF8"); + return Status::Invalid("failed converting unicode to UTF8"); } RETURN_NOT_OK(CheckPythonBytesAreFixedLength(obj, byte_width)); @@ -482,7 +482,7 @@ Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { std::stringstream ss; ss << "Python object of type " << cpp_type_name << " is not None and is not a " << expected_type_name << " object"; - return Status::TypeError(ss.str()); + return Status::Invalid(ss.str()); } Status PandasConverter::ConvertDates() { diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h index 05f5b749b60cb..dd65b753fef31 100644 --- a/cpp/src/arrow/status.h +++ b/cpp/src/arrow/status.h @@ -134,7 +134,7 @@ class ARROW_EXPORT Status { bool IsKeyError() const { return code() == StatusCode::KeyError; } bool IsInvalid() const { return code() == StatusCode::Invalid; } bool IsIOError() const { return code() == StatusCode::IOError; } - + bool IsTypeError() const { return code() == StatusCode::TypeError; } bool IsUnknownError() const { return code() == StatusCode::UnknownError; } bool IsNotImplemented() const { return code() == StatusCode::NotImplemented; } diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 6860f986fb6e8..8c520748cf316 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -38,7 +38,13 @@ ListArray, StringArray, DictionaryArray) -from pyarrow.error import ArrowException +from pyarrow.error import (ArrowException, + ArrowKeyError, + ArrowInvalid, + ArrowIOError, + ArrowMemoryError, + ArrowNotImplementedError, + ArrowTypeError) from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, diff --git a/python/pyarrow/error.pyx b/python/pyarrow/error.pyx index b8a82b3754c1b..259aeb074e3c2 100644 --- a/python/pyarrow/error.pyx +++ b/python/pyarrow/error.pyx @@ -19,13 +19,52 @@ from pyarrow.includes.libarrow cimport CStatus from pyarrow.includes.common cimport c_string from pyarrow.compat import frombytes + class ArrowException(Exception): pass + +class ArrowInvalid(ValueError, ArrowException): + pass + + +class ArrowMemoryError(MemoryError, ArrowException): + pass + + +class ArrowIOError(IOError, ArrowException): + pass + + +class ArrowKeyError(KeyError, ArrowException): + pass + + +class ArrowTypeError(TypeError, ArrowException): + pass + + +class ArrowNotImplementedError(NotImplementedError, ArrowException): + pass + + cdef int check_status(const CStatus& status) nogil except -1: if status.ok(): return 0 - cdef c_string c_message = status.ToString() with gil: - raise ArrowException(frombytes(c_message)) + message = frombytes(status.ToString()) + if status.IsInvalid(): + raise ArrowInvalid(message) + elif status.IsIOError(): + raise ArrowIOError(message) + elif status.IsOutOfMemory(): + raise ArrowMemoryError(message) + elif status.IsKeyError(): + raise ArrowKeyError(message) + elif status.IsNotImplemented(): + raise ArrowNotImplementedError(message) + elif status.IsTypeError(): + raise ArrowTypeError(message) + else: + raise ArrowException(message) diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index f689bdc3fd819..ab38ff3084f01 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -43,10 +43,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_string ToString() c_bool ok() + c_bool IsIOError() c_bool IsOutOfMemory() + c_bool IsInvalid() c_bool IsKeyError() c_bool IsNotImplemented() - c_bool IsInvalid() + c_bool IsTypeError() cdef inline object PyObject_to_object(PyObject* o): diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index 15fca560c6513..e2b03d85ecd50 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -17,7 +17,7 @@ # under the License. from pyarrow.compat import unittest, u # noqa -import pyarrow +import pyarrow as pa import datetime @@ -26,32 +26,32 @@ class TestConvertList(unittest.TestCase): def test_boolean(self): expected = [True, None, False, None] - arr = pyarrow.from_pylist(expected) + arr = pa.from_pylist(expected) assert len(arr) == 4 assert arr.null_count == 2 - assert arr.type == pyarrow.bool_() + assert arr.type == pa.bool_() assert arr.to_pylist() == expected def test_empty_list(self): - arr = pyarrow.from_pylist([]) + arr = pa.from_pylist([]) assert len(arr) == 0 assert arr.null_count == 0 - assert arr.type == pyarrow.null() + assert arr.type == pa.null() assert arr.to_pylist() == [] def test_all_none(self): - arr = pyarrow.from_pylist([None, None]) + arr = pa.from_pylist([None, None]) assert len(arr) == 2 assert arr.null_count == 2 - assert arr.type == pyarrow.null() + assert arr.type == pa.null() assert arr.to_pylist() == [None, None] def test_integer(self): expected = [1, None, 3, None] - arr = pyarrow.from_pylist(expected) + arr = pa.from_pylist(expected) assert len(arr) == 4 assert arr.null_count == 2 - assert arr.type == pyarrow.int64() + assert arr.type == pa.int64() assert arr.to_pylist() == expected def test_garbage_collection(self): @@ -60,25 +60,25 @@ def test_garbage_collection(self): # Force the cyclic garbage collector to run gc.collect() - bytes_before = pyarrow.total_allocated_bytes() - pyarrow.from_pylist([1, None, 3, None]) + bytes_before = pa.total_allocated_bytes() + pa.from_pylist([1, None, 3, None]) gc.collect() - assert pyarrow.total_allocated_bytes() == bytes_before + assert pa.total_allocated_bytes() == bytes_before def test_double(self): data = [1.5, 1, None, 2.5, None, None] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 6 assert arr.null_count == 3 - assert arr.type == pyarrow.float64() + assert arr.type == pa.float64() assert arr.to_pylist() == data def test_unicode(self): data = [u'foo', u'bar', None, u'mañana'] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 - assert arr.type == pyarrow.string() + assert arr.type == pa.string() assert arr.to_pylist() == data def test_bytes(self): @@ -86,31 +86,31 @@ def test_bytes(self): data = [b'foo', u1.decode('utf-8'), # unicode gets encoded, None] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 3 assert arr.null_count == 1 - assert arr.type == pyarrow.binary() + assert arr.type == pa.binary() assert arr.to_pylist() == [b'foo', u1, None] def test_fixed_size_bytes(self): data = [b'foof', None, b'barb', b'2346'] - arr = pyarrow.from_pylist(data, type=pyarrow.binary(4)) + arr = pa.from_pylist(data, type=pa.binary(4)) assert len(arr) == 4 assert arr.null_count == 1 - assert arr.type == pyarrow.binary(4) + assert arr.type == pa.binary(4) assert arr.to_pylist() == data def test_fixed_size_bytes_does_not_accept_varying_lengths(self): data = [b'foo', None, b'barb', b'2346'] - with self.assertRaises(pyarrow.error.ArrowException): - pyarrow.from_pylist(data, type=pyarrow.binary(4)) + with self.assertRaises(pa.ArrowInvalid): + pa.from_pylist(data, type=pa.binary(4)) def test_date(self): data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 4 - assert arr.type == pyarrow.date64() + assert arr.type == pa.date64() assert arr.null_count == 1 assert arr[0].as_py() == datetime.date(2000, 1, 1) assert arr[1].as_py() is None @@ -124,9 +124,9 @@ def test_timestamp(self): datetime.datetime(2006, 1, 13, 12, 34, 56, 432539), datetime.datetime(2010, 8, 13, 5, 46, 57, 437699) ] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 4 - assert arr.type == pyarrow.timestamp('us') + assert arr.type == pa.timestamp('us') assert arr.null_count == 1 assert arr[0].as_py() == datetime.datetime(2007, 7, 13, 1, 23, 34, 123456) @@ -137,28 +137,28 @@ def test_timestamp(self): 46, 57, 437699) def test_mixed_nesting_levels(self): - pyarrow.from_pylist([1, 2, None]) - pyarrow.from_pylist([[1], [2], None]) - pyarrow.from_pylist([[1], [2], [None]]) + pa.from_pylist([1, 2, None]) + pa.from_pylist([[1], [2], None]) + pa.from_pylist([[1], [2], [None]]) - with self.assertRaises(pyarrow.ArrowException): - pyarrow.from_pylist([1, 2, [1]]) + with self.assertRaises(pa.ArrowInvalid): + pa.from_pylist([1, 2, [1]]) - with self.assertRaises(pyarrow.ArrowException): - pyarrow.from_pylist([1, 2, []]) + with self.assertRaises(pa.ArrowInvalid): + pa.from_pylist([1, 2, []]) - with self.assertRaises(pyarrow.ArrowException): - pyarrow.from_pylist([[1], [2], [None, [1]]]) + with self.assertRaises(pa.ArrowInvalid): + pa.from_pylist([[1], [2], [None, [1]]]) def test_list_of_int(self): data = [[1, 2, 3], [], None, [1, 2]] - arr = pyarrow.from_pylist(data) + arr = pa.from_pylist(data) assert len(arr) == 4 assert arr.null_count == 1 - assert arr.type == pyarrow.list_(pyarrow.int64()) + assert arr.type == pa.list_(pa.int64()) assert arr.to_pylist() == data def test_mixed_types_fails(self): data = ['a', 1, 2.0] - with self.assertRaises(pyarrow.error.ArrowException): - pyarrow.from_pylist(data) + with self.assertRaises(pa.ArrowException): + pa.from_pylist(data) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 56830a88f2ec2..87c9c03d7da11 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -266,7 +266,7 @@ def test_fixed_size_bytes_does_not_accept_varying_lengths(self): values = [b'foo', None, b'ba', None, None, b'hey'] df = pd.DataFrame({'strings': values}) schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) - with self.assertRaises(A.error.ArrowException): + with self.assertRaises(A.ArrowInvalid): A.Table.from_pandas(df, schema=schema) def test_timestamps_notimezone_no_nulls(self): @@ -409,7 +409,7 @@ def test_category(self): def test_mixed_types_fails(self): data = pd.DataFrame({'a': ['a', 1, 2.0]}) - with self.assertRaises(A.error.ArrowException): + with self.assertRaises(A.ArrowException): A.Table.from_pandas(data) def test_strided_data_import(self): diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index c7b4f1e997327..cba9464354a4e 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -45,7 +45,7 @@ def tearDown(self): pass def test_file_not_exist(self): - with self.assertRaises(pa.ArrowException): + with self.assertRaises(pa.ArrowIOError): FeatherReader('test_invalid_file') def _get_null_counts(self, path, columns=None): From 5d6c6ad6a81be6194a4f8349a369a94ef927e18b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 4 Apr 2017 21:57:13 +0200 Subject: [PATCH 0466/1644] ARROW-770: [C++] Move .clang* files back into cpp source tree After ARROW-341, we don't need these files at the top level anymore to get clang-format to work on all of our C++ code Author: Wes McKinney Closes #491 from wesm/ARROW-770 and squashes the following commits: 1588a4f [Wes McKinney] Move .clang* files back into cpp source tree --- .clang-format => cpp/.clang-format | 0 .clang-tidy => cpp/.clang-tidy | 0 .clang-tidy-ignore => cpp/.clang-tidy-ignore | 0 3 files changed, 0 insertions(+), 0 deletions(-) rename .clang-format => cpp/.clang-format (100%) rename .clang-tidy => cpp/.clang-tidy (100%) rename .clang-tidy-ignore => cpp/.clang-tidy-ignore (100%) diff --git a/.clang-format b/cpp/.clang-format similarity index 100% rename from .clang-format rename to cpp/.clang-format diff --git a/.clang-tidy b/cpp/.clang-tidy similarity index 100% rename from .clang-tidy rename to cpp/.clang-tidy diff --git a/.clang-tidy-ignore b/cpp/.clang-tidy-ignore similarity index 100% rename from .clang-tidy-ignore rename to cpp/.clang-tidy-ignore From 360942e6171b301d5efb1686794239e3527828f3 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 4 Apr 2017 16:20:56 -0400 Subject: [PATCH 0467/1644] ARROW-672: [Format] Add MetadataVersion::V3 for Arrow 0.3 As a matter of diligence, we increment the metadata version for Arrow 0.3 since we've changed the metadata format is various ways. Author: Wes McKinney Closes #488 from wesm/ARROW-672 and squashes the following commits: f39733e [Wes McKinney] Add C++ unit test for read/write MetadataVersion. Change MetadataVersion to C++11 enum class bb09ba2 [Wes McKinney] Add MetadataVersion::V3 for Arrow 0.3 --- c_glib/arrow-glib/ipc-metadata-version.cpp | 22 ++++++++++-------- c_glib/arrow-glib/ipc-metadata-version.h | 4 +++- c_glib/arrow-glib/ipc-metadata-version.hpp | 4 ++-- cpp/src/arrow/ipc/ipc-read-write-test.cc | 20 ++++++++++++++++ cpp/src/arrow/ipc/metadata.cc | 23 ++++++++++++++++++- cpp/src/arrow/ipc/metadata.h | 6 ++--- cpp/src/arrow/ipc/reader.cc | 11 ++++++--- cpp/src/arrow/ipc/reader.h | 2 +- format/Schema.fbs | 3 ++- .../vector/stream/MessageSerializer.java | 2 +- 10 files changed, 75 insertions(+), 22 deletions(-) diff --git a/c_glib/arrow-glib/ipc-metadata-version.cpp b/c_glib/arrow-glib/ipc-metadata-version.cpp index c5cc8d379843c..f591f295ec886 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.cpp +++ b/c_glib/arrow-glib/ipc-metadata-version.cpp @@ -29,31 +29,35 @@ * @short_description: Metadata version mapgging between Arrow and arrow-glib * * #GArrowIPCMetadataVersion provides metadata versions corresponding - * to `arrow::ipc::MetadataVersion::type` values. + * to `arrow::ipc::MetadataVersion` values. */ GArrowIPCMetadataVersion -garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion::type version) +garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion version) { switch (version) { - case arrow::ipc::MetadataVersion::type::V1: + case arrow::ipc::MetadataVersion::V1: return GARROW_IPC_METADATA_VERSION_V1; - case arrow::ipc::MetadataVersion::type::V2: + case arrow::ipc::MetadataVersion::V2: return GARROW_IPC_METADATA_VERSION_V2; + case arrow::ipc::MetadataVersion::V3: + return GARROW_IPC_METADATA_VERSION_V3; default: - return GARROW_IPC_METADATA_VERSION_V2; + return GARROW_IPC_METADATA_VERSION_V3; } } -arrow::ipc::MetadataVersion::type +arrow::ipc::MetadataVersion garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version) { switch (version) { case GARROW_IPC_METADATA_VERSION_V1: - return arrow::ipc::MetadataVersion::type::V1; + return arrow::ipc::MetadataVersion::V1; case GARROW_IPC_METADATA_VERSION_V2: - return arrow::ipc::MetadataVersion::type::V2; + return arrow::ipc::MetadataVersion::V2; + case GARROW_IPC_METADATA_VERSION_V3: + return arrow::ipc::MetadataVersion::V3; default: - return arrow::ipc::MetadataVersion::type::V2; + return arrow::ipc::MetadataVersion::V3; } } diff --git a/c_glib/arrow-glib/ipc-metadata-version.h b/c_glib/arrow-glib/ipc-metadata-version.h index ccfd52a81639f..20defdb71b4f2 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.h +++ b/c_glib/arrow-glib/ipc-metadata-version.h @@ -27,13 +27,15 @@ G_BEGIN_DECLS * GArrowIPCMetadataVersion: * @GARROW_IPC_METADATA_VERSION_V1: Version 1. * @GARROW_IPC_METADATA_VERSION_V2: Version 2. + * @GARROW_IPC_METADATA_VERSION_V3: Version 3. * * They are corresponding to `arrow::ipc::MetadataVersion::type` * values. */ typedef enum { GARROW_IPC_METADATA_VERSION_V1, - GARROW_IPC_METADATA_VERSION_V2 + GARROW_IPC_METADATA_VERSION_V2, + GARROW_IPC_METADATA_VERSION_V3 } GArrowIPCMetadataVersion; G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-metadata-version.hpp b/c_glib/arrow-glib/ipc-metadata-version.hpp index 2a7e8cffa8917..229565f002180 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.hpp +++ b/c_glib/arrow-glib/ipc-metadata-version.hpp @@ -23,5 +23,5 @@ #include -GArrowIPCMetadataVersion garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion::type version); -arrow::ipc::MetadataVersion::type garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version); +GArrowIPCMetadataVersion garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion version); +arrow::ipc::MetadataVersion garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version); diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 86ec7701add20..6807296b59a5e 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -211,6 +211,26 @@ TEST_P(TestIpcRoundTrip, RoundTrip) { CheckRoundtrip(*batch, 1 << 20); } +TEST_F(TestIpcRoundTrip, MetadataVersion) { + std::shared_ptr batch; + ASSERT_OK(MakeIntRecordBatch(&batch)); + + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(1 << 16, "test-metadata", &mmap_)); + + int32_t metadata_length; + int64_t body_length; + + const int64_t buffer_offset = 0; + + ASSERT_OK(WriteRecordBatch( + *batch, buffer_offset, mmap_.get(), &metadata_length, &body_length, pool_)); + + std::shared_ptr message; + ASSERT_OK(ReadMessage(0, metadata_length, mmap_.get(), &message)); + + ASSERT_EQ(MetadataVersion::V3, message->metadata_version()); +} + TEST_P(TestIpcRoundTrip, SliceRoundTrip) { std::shared_ptr batch; ASSERT_OK((*GetParam())(&batch)); // NOLINT clang-tidy gtest issue diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 5007f1309087d..2ff25eeaa9213 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -50,7 +50,7 @@ using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; using FBString = flatbuffers::Offset; -static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V2; +static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V3; static Status IntFromFlatbuffer( const flatbuf::Int* int_data, std::shared_ptr* out) { @@ -826,6 +826,23 @@ class Message::MessageImpl { } } + MetadataVersion version() const { + switch (message_->version()) { + case flatbuf::MetadataVersion_V1: + // Arrow 0.1 + return MetadataVersion::V1; + case flatbuf::MetadataVersion_V2: + // Arrow 0.2 + return MetadataVersion::V2; + case flatbuf::MetadataVersion_V3: + // Arrow 0.3 + return MetadataVersion::V3; + // Add cases as other versions become available + default: + return MetadataVersion::V3; + } + } + const void* header() const { return message_->header(); } int64_t body_length() const { return message_->bodyLength(); } @@ -856,6 +873,10 @@ Message::Type Message::type() const { return impl_->type(); } +MetadataVersion Message::metadata_version() const { + return impl_->version(); +} + int64_t Message::body_length() const { return impl_->body_length(); } diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index 451a76d5249e0..b042882c7cd31 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -50,9 +50,7 @@ class RandomAccessFile; namespace ipc { -struct MetadataVersion { - enum type { V1, V2 }; -}; +enum class MetadataVersion : char { V1, V2, V3 }; static constexpr const char* kArrowMagicBytes = "ARROW1"; @@ -134,6 +132,8 @@ class ARROW_EXPORT Message { Type type() const; + MetadataVersion metadata_version() const; + const void* header() const; private: diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 00ea20cf5dfb1..55f632f306b9a 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -332,15 +332,20 @@ class FileReader::FileReaderImpl { int num_record_batches() const { return footer_->recordBatches()->size(); } - MetadataVersion::type version() const { + MetadataVersion version() const { switch (footer_->version()) { case flatbuf::MetadataVersion_V1: + // Arrow 0.1 return MetadataVersion::V1; case flatbuf::MetadataVersion_V2: + // Arrow 0.2 return MetadataVersion::V2; + case flatbuf::MetadataVersion_V3: + // Arrow 0.3 + return MetadataVersion::V3; // Add cases as other versions become available default: - return MetadataVersion::V2; + return MetadataVersion::V3; } } @@ -454,7 +459,7 @@ int FileReader::num_record_batches() const { return impl_->num_record_batches(); } -MetadataVersion::type FileReader::version() const { +MetadataVersion FileReader::version() const { return impl_->version(); } diff --git a/cpp/src/arrow/ipc/reader.h b/cpp/src/arrow/ipc/reader.h index b62f0527e0ca0..1972446743bc1 100644 --- a/cpp/src/arrow/ipc/reader.h +++ b/cpp/src/arrow/ipc/reader.h @@ -91,7 +91,7 @@ class ARROW_EXPORT FileReader { int num_record_batches() const; - MetadataVersion::type version() const; + MetadataVersion version() const; // Read a record batch from the file. Does not copy memory if the input // source supports zero-copy. diff --git a/format/Schema.fbs b/format/Schema.fbs index 958f09181bfa6..ca9c8e6c3e76c 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -21,7 +21,8 @@ namespace org.apache.arrow.flatbuf; enum MetadataVersion:short { V1, - V2 + V2, + V3 } /// These are stored in the flatbuffer in the Type union below diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index f85fb51710bde..ec7e0f2ffb115 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -329,7 +329,7 @@ private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte heade Message.startMessage(builder); Message.addHeaderType(builder, headerType); Message.addHeader(builder, headerOffset); - Message.addVersion(builder, MetadataVersion.V2); + Message.addVersion(builder, MetadataVersion.V3); Message.addBodyLength(builder, bodyLength); builder.finish(Message.endMessage(builder)); return builder.dataBuffer(); From e29a7d4cae943312a1f8598e71c5d46c1954b5fa Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 4 Apr 2017 16:22:29 -0400 Subject: [PATCH 0468/1644] ARROW-668: [Python] Box timestamp values as pandas.Timestamp if available, attach tzinfo I'm not sure how to easily test the behavior if pandas is not present. I created an environment without pandas and added some fixes so that I verify the behavior, but at some point we should create a "no pandas" test suite to see what using pyarrow is like without pandas installed. Author: Wes McKinney Closes #487 from wesm/ARROW-668 and squashes the following commits: 554a647 [Wes McKinney] Remove cython from requirements.txt 649d28a [Wes McKinney] Box timestamp values as pandas.Timestamp if available, return timezone also if available --- python/pyarrow/array.pyx | 25 +++------ python/pyarrow/compat.py | 17 +++++++ python/pyarrow/scalar.pyx | 47 +++++++++++++---- python/pyarrow/tests/test_scalars.py | 76 ++++++++++++++++++---------- 4 files changed, 112 insertions(+), 53 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 67785e34075f4..1f59556e94fb8 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -29,7 +29,7 @@ cimport pyarrow.includes.pyarrow as pyarrow import pyarrow.config -from pyarrow.compat import frombytes, tobytes +from pyarrow.compat import frombytes, tobytes, PandasSeries, Categorical from pyarrow.error cimport check_status from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool @@ -44,11 +44,6 @@ import pyarrow.schema as schema cimport cpython -cdef _pandas(): - import pandas as pd - return pd - - cdef maybe_coerce_datetime64(values, dtype, DataType type, timestamps_to_ms=False): @@ -66,7 +61,7 @@ cdef maybe_coerce_datetime64(values, dtype, DataType type, tz = dtype.tz unit = 'ms' if coerce_ms else dtype.unit type = schema.timestamp(unit, tz) - else: + elif type is None: # Trust the NumPy dtype type = schema.type_from_numpy_dtype(values.dtype) @@ -141,15 +136,13 @@ cdef class Array: shared_ptr[CDataType] c_type CMemoryPool* pool - pd = _pandas() - if mask is not None: mask = get_series_values(mask) values = get_series_values(obj) pool = maybe_unbox_memory_pool(memory_pool) - if isinstance(values, pd.Categorical): + if isinstance(values, Categorical): return DictionaryArray.from_arrays( values.codes, values.categories.values, mask=mask, memory_pool=memory_pool) @@ -397,9 +390,9 @@ cdef wrap_array_output(PyObject* output): cdef object obj = PyObject_to_object(output) if isinstance(obj, dict): - return _pandas().Categorical(obj['indices'], - categories=obj['dictionary'], - fastpath=True) + return Categorical(obj['indices'], + categories=obj['dictionary'], + fastpath=True) else: return obj @@ -622,14 +615,12 @@ cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor): cdef object get_series_values(object obj): - import pandas as pd - - if isinstance(obj, pd.Series): + if isinstance(obj, PandasSeries): result = obj.values elif isinstance(obj, np.ndarray): result = obj else: - result = pd.Series(obj).values + result = PandasSeries(obj).values return result diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index b9206aacbc9f1..4dcc11677e7dd 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -38,9 +38,26 @@ else: from pandas.types.dtypes import DatetimeTZDtype pdapi = pd.api.types + + PandasSeries = pd.Series + Categorical = pd.Categorical HAVE_PANDAS = True except: HAVE_PANDAS = False + class DatetimeTZDtype(object): + pass + + class ClassPlaceholder(object): + + def __init__(self, *args, **kwargs): + raise NotImplementedError + + class PandasSeries(ClassPlaceholder): + pass + + class Categorical(ClassPlaceholder): + pass + if PY26: import unittest2 as unittest diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 983a9a7334044..1c0790a4fdc3c 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -26,6 +26,12 @@ cimport cpython as cp NA = None + +cdef _pandas(): + import pandas as pd + return pd + + cdef class NAType(Scalar): def __cinit__(self): @@ -146,16 +152,37 @@ cdef class TimestampValue(ArrayValue): CTimestampType* dtype = ap.type().get() int64_t val = ap.Value(self.index) - if dtype.unit == TimeUnit_SECOND: - return datetime.datetime.utcfromtimestamp(val) - elif dtype.unit == TimeUnit_MILLI: - return datetime.datetime.utcfromtimestamp(float(val) / 1000) - elif dtype.unit == TimeUnit_MICRO: - return datetime.datetime.utcfromtimestamp(float(val) / 1000000) - else: - # TimeUnit_NANO - raise NotImplementedError("Cannot convert nanosecond timestamps " - "to datetime.datetime") + timezone = None + tzinfo = None + if dtype.timezone.size() > 0: + timezone = frombytes(dtype.timezone) + import pytz + tzinfo = pytz.timezone(timezone) + + try: + pd = _pandas() + if dtype.unit == TimeUnit_SECOND: + val = val * 1000000000 + elif dtype.unit == TimeUnit_MILLI: + val = val * 1000000 + elif dtype.unit == TimeUnit_MICRO: + val = val * 1000 + return pd.Timestamp(val, tz=tzinfo) + except ImportError: + if dtype.unit == TimeUnit_SECOND: + result = datetime.datetime.utcfromtimestamp(val) + elif dtype.unit == TimeUnit_MILLI: + result = datetime.datetime.utcfromtimestamp(float(val) / 1000) + elif dtype.unit == TimeUnit_MICRO: + result = datetime.datetime.utcfromtimestamp( + float(val) / 1000000) + else: + # TimeUnit_NANO + raise NotImplementedError("Cannot convert nanosecond " + "timestamps without pandas") + if timezone is not None: + result = result.replace(tzinfo=tzinfo) + return result cdef class FloatValue(ArrayValue): diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index a5db7e0835607..f4f275b994228 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -19,69 +19,69 @@ import pandas as pd from pyarrow.compat import unittest, u, unicode_type -import pyarrow as A +import pyarrow as pa class TestScalars(unittest.TestCase): def test_null_singleton(self): with self.assertRaises(Exception): - A.NAType() + pa.NAType() def test_bool(self): - arr = A.from_pylist([True, None, False, None]) + arr = pa.from_pylist([True, None, False, None]) v = arr[0] - assert isinstance(v, A.BooleanValue) + assert isinstance(v, pa.BooleanValue) assert repr(v) == "True" assert v.as_py() is True - assert arr[1] is A.NA + assert arr[1] is pa.NA def test_int64(self): - arr = A.from_pylist([1, 2, None]) + arr = pa.from_pylist([1, 2, None]) v = arr[0] - assert isinstance(v, A.Int64Value) + assert isinstance(v, pa.Int64Value) assert repr(v) == "1" assert v.as_py() == 1 - assert arr[2] is A.NA + assert arr[2] is pa.NA def test_double(self): - arr = A.from_pylist([1.5, None, 3]) + arr = pa.from_pylist([1.5, None, 3]) v = arr[0] - assert isinstance(v, A.DoubleValue) + assert isinstance(v, pa.DoubleValue) assert repr(v) == "1.5" assert v.as_py() == 1.5 - assert arr[1] is A.NA + assert arr[1] is pa.NA v = arr[2] assert v.as_py() == 3.0 def test_string_unicode(self): - arr = A.from_pylist([u'foo', None, u'mañana']) + arr = pa.from_pylist([u'foo', None, u'mañana']) v = arr[0] - assert isinstance(v, A.StringValue) + assert isinstance(v, pa.StringValue) assert v.as_py() == 'foo' - assert arr[1] is A.NA + assert arr[1] is pa.NA v = arr[2].as_py() assert v == u'mañana' assert isinstance(v, unicode_type) def test_bytes(self): - arr = A.from_pylist([b'foo', None, u('bar')]) + arr = pa.from_pylist([b'foo', None, u('bar')]) v = arr[0] - assert isinstance(v, A.BinaryValue) + assert isinstance(v, pa.BinaryValue) assert v.as_py() == b'foo' - assert arr[1] is A.NA + assert arr[1] is pa.NA v = arr[2].as_py() assert v == b'bar' @@ -89,41 +89,65 @@ def test_bytes(self): def test_fixed_size_bytes(self): data = [b'foof', None, b'barb'] - arr = A.from_pylist(data, type=A.binary(4)) + arr = pa.from_pylist(data, type=pa.binary(4)) v = arr[0] - assert isinstance(v, A.FixedSizeBinaryValue) + assert isinstance(v, pa.FixedSizeBinaryValue) assert v.as_py() == b'foof' - assert arr[1] is A.NA + assert arr[1] is pa.NA v = arr[2].as_py() assert v == b'barb' assert isinstance(v, bytes) def test_list(self): - arr = A.from_pylist([['foo', None], None, ['bar'], []]) + arr = pa.from_pylist([['foo', None], None, ['bar'], []]) v = arr[0] assert len(v) == 2 - assert isinstance(v, A.ListValue) + assert isinstance(v, pa.ListValue) assert repr(v) == "['foo', None]" assert v.as_py() == ['foo', None] assert v[0].as_py() == 'foo' - assert v[1] is A.NA + assert v[1] is pa.NA - assert arr[1] is A.NA + assert arr[1] is pa.NA v = arr[3] assert len(v) == 0 + def test_timestamp(self): + arr = pd.date_range('2000-01-01 12:34:56', periods=10).values + + units = ['s', 'ms', 'us', 'ns'] + + for unit in units: + dtype = 'datetime64[{0}]'.format(unit) + arrow_arr = pa.Array.from_numpy(arr.astype(dtype)) + expected = pd.Timestamp('2000-01-01 12:34:56') + + assert arrow_arr[0].as_py() == expected + + tz = 'America/New_York' + arrow_type = pa.timestamp(unit, tz=tz) + + dtype = 'datetime64[{0}]'.format(unit) + arrow_arr = pa.Array.from_numpy(arr.astype(dtype), + type=arrow_type) + expected = (pd.Timestamp('2000-01-01 12:34:56') + .tz_localize('utc') + .tz_convert(tz)) + + assert arrow_arr[0].as_py() == expected + def test_dictionary(self): colors = ['red', 'green', 'blue'] values = pd.Series(colors * 4) categorical = pd.Categorical(values, categories=colors) - v = A.DictionaryArray.from_arrays(categorical.codes, - categorical.categories) + v = pa.DictionaryArray.from_arrays(categorical.codes, + categorical.categories) for i, c in enumerate(values): assert v[i].as_py() == c From f4fcb42c2cb0d463db4ddeef68e4392f8d7c049f Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Wed, 5 Apr 2017 13:33:25 -0400 Subject: [PATCH 0469/1644] ARROW-510 ARROW-582 ARROW-663 ARROW-729: [Java] Added units for Time and Date types, and integration tests closes #366 Author: Leif Walsh Author: Wes McKinney Closes #475 from leifwalsh/feature/java-date-time-types and squashes the following commits: 2e2a4cf [Leif Walsh] ARROW-729: [Java] removed Joda DateTime getters from Date* and Time* types 47f83a8 [Wes McKinney] Integration tests for all date and time combinations 6e86422 [Wes McKinney] ARROW-733: [C++/Python] Rename FixedWidthBinary to FixedSizeBinary for consistency with FixedSizeList 2dca474 [Leif Walsh] ARROW-729: [Java] Added units for date/time types --- cpp/src/arrow/ipc/json-internal.cc | 13 ++ cpp/src/arrow/ipc/metadata.cc | 11 +- cpp/src/arrow/type-test.cc | 8 +- cpp/src/arrow/type.h | 15 +- integration/integration_test.py | 145 ++++++++++++++---- .../main/codegen/data/ValueVectorTypes.tdd | 18 ++- .../codegen/templates/FixedValueVectors.java | 21 +-- .../vector/file/json/JsonFileReader.java | 28 +++- .../vector/file/json/JsonFileWriter.java | 28 +++- .../org/apache/arrow/vector/types/Types.java | 95 ++++++++++-- .../arrow/vector/types/pojo/TestSchema.java | 40 ++++- 11 files changed, 337 insertions(+), 85 deletions(-) diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 1e2385b73f82c..124c21b8fc023 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -175,6 +175,8 @@ class JsonSchemaWriter { void WriteTypeMetadata(const TimeType& type) { writer_->Key("unit"); writer_->String(GetTimeUnitName(type.unit)); + writer_->Key("bitWidth"); + writer_->Int(type.bit_width()); } void WriteTypeMetadata(const DateType& type) { @@ -608,6 +610,9 @@ static Status GetTime(const RjObject& json_type, std::shared_ptr* type const auto& json_unit = json_type.FindMember("unit"); RETURN_NOT_STRING("unit", json_unit, json_type); + const auto& json_bit_width = json_type.FindMember("bitWidth"); + RETURN_NOT_INT("bitWidth", json_bit_width, json_type); + std::string unit_str = json_unit->value.GetString(); if (unit_str == "SECOND") { @@ -623,6 +628,14 @@ static Status GetTime(const RjObject& json_type, std::shared_ptr* type ss << "Invalid time unit: " << unit_str; return Status::Invalid(ss.str()); } + + const auto& fw_type = static_cast(**type); + + int bit_width = json_bit_width->value.GetInt(); + if (bit_width != fw_type.bit_width()) { + return Status::Invalid("Indicated bit width does not match unit"); + } + return Status::OK(); } diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 2ff25eeaa9213..d902ec296cff3 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -255,12 +255,19 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, case flatbuf::Type_Time: { auto time_type = static_cast(type_data); TimeUnit unit = FromFlatbufferUnit(time_type->unit()); + int32_t bit_width = time_type->bitWidth(); switch (unit) { case TimeUnit::SECOND: case TimeUnit::MILLI: + if (bit_width != 32) { + return Status::Invalid("Time is 32 bits for second/milli unit"); + } *out = time32(unit); break; default: + if (bit_width != 64) { + return Status::Invalid("Time is 64 bits for micro/nano unit"); + } *out = time64(unit); break; } @@ -386,12 +393,12 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, case Type::TIME32: { const auto& time_type = static_cast(*type); *out_type = flatbuf::Type_Time; - *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit), 32).Union(); } break; case Type::TIME64: { const auto& time_type = static_cast(*type); *out_type = flatbuf::Type_Time; - *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit)).Union(); + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit), 64).Union(); } break; case Type::TIMESTAMP: { const auto& ts_type = static_cast(*type); diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index dafadc168c191..66164e3430913 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -191,12 +191,15 @@ TEST(TestListType, Basics) { ASSERT_EQ("list>", lt2.ToString()); } -TEST(TestDateTypes, ToString) { +TEST(TestDateTypes, Attrs) { auto t1 = date32(); auto t2 = date64(); ASSERT_EQ("date32[day]", t1->ToString()); ASSERT_EQ("date64[ms]", t2->ToString()); + + ASSERT_EQ(32, static_cast(*t1).bit_width()); + ASSERT_EQ(64, static_cast(*t2).bit_width()); } TEST(TestTimeType, Equals) { @@ -207,6 +210,9 @@ TEST(TestTimeType, Equals) { Time64Type t4(TimeUnit::NANO); Time64Type t5(TimeUnit::MICRO); + ASSERT_EQ(32, t0.bit_width()); + ASSERT_EQ(64, t3.bit_width()); + ASSERT_TRUE(t0.Equals(t2)); ASSERT_TRUE(t1.Equals(t1)); ASSERT_FALSE(t1.Equals(t3)); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 6b936f348d4de..0e69133219d55 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -18,6 +18,7 @@ #ifndef ARROW_TYPE_H #define ARROW_TYPE_H +#include #include #include #include @@ -220,7 +221,7 @@ struct ARROW_EXPORT CTypeImpl : public BASE { CTypeImpl() : BASE(TYPE_ID) {} - int bit_width() const override { return static_cast(sizeof(C_TYPE) * 8); } + int bit_width() const override { return static_cast(sizeof(C_TYPE) * CHAR_BIT); } Status Accept(TypeVisitor* visitor) const override { return visitor->Visit(*static_cast(this)); @@ -456,7 +457,7 @@ struct ARROW_EXPORT Date32Type : public DateType { Date32Type(); - int bit_width() const override { return static_cast(sizeof(c_type) * 4); } + int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -470,7 +471,7 @@ struct ARROW_EXPORT Date64Type : public DateType { Date64Type(); - int bit_width() const override { return static_cast(sizeof(c_type) * 8); } + int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -509,7 +510,7 @@ struct ARROW_EXPORT Time32Type : public TimeType { static constexpr Type::type type_id = Type::TIME32; using c_type = int32_t; - int bit_width() const override { return static_cast(sizeof(c_type) * 4); } + int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } explicit Time32Type(TimeUnit unit = TimeUnit::MILLI); @@ -521,7 +522,7 @@ struct ARROW_EXPORT Time64Type : public TimeType { static constexpr Type::type type_id = Type::TIME64; using c_type = int64_t; - int bit_width() const override { return static_cast(sizeof(c_type) * 8); } + int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } explicit Time64Type(TimeUnit unit = TimeUnit::MILLI); @@ -535,7 +536,7 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { typedef int64_t c_type; static constexpr Type::type type_id = Type::TIMESTAMP; - int bit_width() const override { return static_cast(sizeof(int64_t) * 8); } + int bit_width() const override { return static_cast(sizeof(int64_t) * CHAR_BIT); } explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) : FixedWidthType(Type::TIMESTAMP), unit(unit) {} @@ -557,7 +558,7 @@ struct ARROW_EXPORT IntervalType : public FixedWidthType { using c_type = int64_t; static constexpr Type::type type_id = Type::INTERVAL; - int bit_width() const override { return static_cast(sizeof(int64_t) * 8); } + int bit_width() const override { return static_cast(sizeof(int64_t) * CHAR_BIT); } Unit unit; diff --git a/integration/integration_test.py b/integration/integration_test.py index ec2a38d840d0b..6631dc8c2f761 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -175,10 +175,14 @@ def _get_buffers(self): class IntegerType(PrimitiveType): - def __init__(self, name, is_signed, bit_width, nullable=True): + def __init__(self, name, is_signed, bit_width, nullable=True, + min_value=TEST_INT_MIN, + max_value=TEST_INT_MAX): PrimitiveType.__init__(self, name, nullable=nullable) self.is_signed = is_signed self.bit_width = bit_width + self.min_value = min_value + self.max_value = max_value @property def numpy_type(self): @@ -194,14 +198,80 @@ def _get_type(self): def generate_column(self, size): iinfo = np.iinfo(self.numpy_type) values = [int(x) for x in - np.random.randint(max(iinfo.min, TEST_INT_MIN), - min(iinfo.max, TEST_INT_MAX), + np.random.randint(max(iinfo.min, self.min_value), + min(iinfo.max, self.max_value), size=size)] is_valid = self._make_is_valid(size) return PrimitiveColumn(self.name, size, is_valid, values) +class DateType(IntegerType): + + DAY = 0 + MILLISECOND = 1 + + def __init__(self, name, unit, nullable=True): + self.unit = unit + bit_width = 32 if unit == self.DAY else 64 + IntegerType.__init__(self, name, True, bit_width, nullable=nullable) + + def _get_type(self): + return OrderedDict([ + ('name', 'date'), + ('unit', 'DAY' if self.unit == self.DAY else 'MILLISECOND') + ]) + + +TIMEUNIT_NAMES = { + 's': 'SECOND', + 'ms': 'MILLISECOND', + 'us': 'MICROSECOND', + 'ns': 'NANOSECOND' +} + + +class TimeType(IntegerType): + + BIT_WIDTHS = { + 's': 32, + 'ms': 32, + 'us': 64, + 'ns': 64 + } + + def __init__(self, name, unit='s', nullable=True): + self.unit = unit + IntegerType.__init__(self, name, True, self.BIT_WIDTHS[unit], + nullable=nullable) + + def _get_type(self): + return OrderedDict([ + ('name', 'time'), + ('unit', TIMEUNIT_NAMES[self.unit]), + ('bitWidth', self.bit_width) + ]) + + +class TimestampType(IntegerType): + + def __init__(self, name, unit='s', tz=None, nullable=True): + self.unit = unit + self.tz = tz + IntegerType.__init__(self, name, True, 64, nullable=nullable) + + def _get_type(self): + fields = [ + ('name', 'timestamp'), + ('unit', TIMEUNIT_NAMES[self.unit]) + ] + + if self.tz is not None: + fields.append(('timezone', self.tz)) + + return OrderedDict(fields) + + class FloatingPointType(PrimitiveType): def __init__(self, name, bit_width, nullable=True): @@ -509,6 +579,20 @@ def get_field(name, type_, nullable=True): raise TypeError(dtype) +def _generate_file(fields, batch_sizes): + schema = JSONSchema(fields) + batches = [] + for size in batch_sizes: + columns = [] + for field in fields: + col = field.generate_column(size) + columns.append(col) + + batches.append(JSONRecordBatch(size, columns)) + + return JSONFile(schema, batches) + + def generate_primitive_case(): types = ['bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', @@ -520,19 +604,27 @@ def generate_primitive_case(): fields.append(get_field(type_ + "_nullable", type_, True)) fields.append(get_field(type_ + "_nonnullable", type_, False)) - schema = JSONSchema(fields) - batch_sizes = [7, 10] - batches = [] - for size in batch_sizes: - columns = [] - for field in fields: - col = field.generate_column(size) - columns.append(col) + return _generate_file(fields, batch_sizes) - batches.append(JSONRecordBatch(size, columns)) - return JSONFile(schema, batches) +def generate_datetime_case(): + fields = [ + DateType('f0', DateType.DAY), + DateType('f1', DateType.MILLISECOND), + TimeType('f2', 's'), + TimeType('f3', 'ms'), + TimeType('f4', 'us'), + TimeType('f5', 'ns'), + TimestampType('f6', 's'), + TimestampType('f7', 'ms'), + TimestampType('f8', 'us'), + TimestampType('f9', 'ns'), + TimestampType('f10', 'ms', tz='America/New_York') + ] + + batch_sizes = [7, 10] + return _generate_file(fields, batch_sizes) def generate_nested_case(): @@ -545,19 +637,8 @@ def generate_nested_case(): # ListType('list_nonnullable', get_field('item', 'int32'), False), ] - schema = JSONSchema(fields) - batch_sizes = [7, 10] - batches = [] - for size in batch_sizes: - columns = [] - for field in fields: - col = field.generate_column(size) - columns.append(col) - - batches.append(JSONRecordBatch(size, columns)) - - return JSONFile(schema, batches) + return _generate_file(fields, batch_sizes) def get_generated_json_files(): @@ -566,13 +647,13 @@ def get_generated_json_files(): def _temp_path(): return - file_objs = [] - - K = 10 - for i in range(K): - file_objs.append(generate_primitive_case()) - - file_objs.append(generate_nested_case()) + file_objs = [ + generate_primitive_case(), + generate_primitive_case(), + generate_primitive_case(), + generate_datetime_case(), + generate_nested_case() + ] generated_paths = [] for file_obj in file_objs: diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index 2181cfdc335b4..b08c100edcac8 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -56,8 +56,10 @@ { class: "Int", valueHolder: "IntHolder"}, { class: "UInt4", valueHolder: "UInt4Holder" }, { class: "Float4", javaType: "float" , boxedType: "Float", fields: [{name: "value", type: "float"}]}, - { class: "IntervalYear", javaType: "int", friendlyType: "Period" } - { class: "Time", javaType: "int", friendlyType: "DateTime" } + { class: "DateDay" }, + { class: "IntervalYear", javaType: "int", friendlyType: "Period" }, + { class: "TimeSec" }, + { class: "TimeMilli" } ] }, { @@ -70,11 +72,13 @@ { class: "BigInt"}, { class: "UInt8" }, { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, - { class: "Date", javaType: "long", friendlyType: "DateTime" }, - { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } - { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } - { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } - { class: "TimeStampNano", javaType: "long", boxedType: "Long", friendlyType: "DateTime" } + { class: "DateMilli" }, + { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, + { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, + { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, + { class: "TimeStampNano", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, + { class: "TimeMicro" }, + { class: "TimeNano" } ] }, { diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index d5265f1140ee0..947c82c74a401 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -19,6 +19,7 @@ import org.apache.arrow.vector.util.DecimalUtility; import java.lang.Override; +import java.util.concurrent.TimeUnit; <@pp.dropOutputFile /> <#list vv.types as type> @@ -482,12 +483,15 @@ public long getTwoAsLong(int index) { - <#if minor.class == "Date"> + <#if minor.class == "DateDay" || + minor.class == "DateMilli" || + minor.class == "TimeSec" || + minor.class == "TimeMilli" || + minor.class == "TimeMicro" || + minor.class == "TimeNano"> @Override public ${friendlyType} getObject(int index) { - org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); - return date; + return get(index); } <#elseif minor.class == "TimeStampSec"> @@ -554,15 +558,6 @@ public StringBuilder getAsStringBuilder(int index) { append(months).append(monthString)); } - <#elseif minor.class == "Time"> - @Override - public DateTime getObject(int index) { - - org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); - time = time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); - return time; - } - <#elseif minor.class == "Decimal9" || minor.class == "Decimal18"> @Override public ${friendlyType} getObject(int index) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index bdb63b92cb105..2f91205cffcbd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -32,15 +32,21 @@ import org.apache.arrow.vector.BigIntVector; import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DateDayVector; +import org.apache.arrow.vector.DateMilliVector; import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.Float4Vector; import org.apache.arrow.vector.Float8Vector; import org.apache.arrow.vector.IntVector; import org.apache.arrow.vector.SmallIntVector; -import org.apache.arrow.vector.TimeStampSecVector; -import org.apache.arrow.vector.TimeStampMilliVector; +import org.apache.arrow.vector.TimeMicroVector; +import org.apache.arrow.vector.TimeMilliVector; +import org.apache.arrow.vector.TimeNanoVector; +import org.apache.arrow.vector.TimeSecVector; import org.apache.arrow.vector.TimeStampMicroVector; +import org.apache.arrow.vector.TimeStampMilliVector; import org.apache.arrow.vector.TimeStampNanoVector; +import org.apache.arrow.vector.TimeStampSecVector; import org.apache.arrow.vector.TinyIntVector; import org.apache.arrow.vector.UInt1Vector; import org.apache.arrow.vector.UInt2Vector; @@ -240,6 +246,24 @@ private void setValueFromParser(ValueVector valueVector, int i) throws IOExcepti case VARCHAR: ((VarCharVector)valueVector).getMutator().setSafe(i, parser.readValueAs(String.class).getBytes(UTF_8)); break; + case DATEDAY: + ((DateDayVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case DATEMILLI: + ((DateMilliVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case TIMESEC: + ((TimeSecVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case TIMEMILLI: + ((TimeMilliVector)valueVector).getMutator().set(i, parser.readValueAs(Integer.class)); + break; + case TIMEMICRO: + ((TimeMicroVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; + case TIMENANO: + ((TimeNanoVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); + break; case TIMESTAMPSEC: ((TimeStampSecVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java index 99040b67e1cd3..d86b3de3b9da3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileWriter.java @@ -23,11 +23,17 @@ import org.apache.arrow.vector.BitVector; import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.DateDayVector; +import org.apache.arrow.vector.DateMilliVector; import org.apache.arrow.vector.FieldVector; -import org.apache.arrow.vector.TimeStampSecVector; -import org.apache.arrow.vector.TimeStampMilliVector; +import org.apache.arrow.vector.TimeMicroVector; +import org.apache.arrow.vector.TimeMilliVector; +import org.apache.arrow.vector.TimeNanoVector; +import org.apache.arrow.vector.TimeSecVector; import org.apache.arrow.vector.TimeStampMicroVector; +import org.apache.arrow.vector.TimeStampMilliVector; import org.apache.arrow.vector.TimeStampNanoVector; +import org.apache.arrow.vector.TimeStampSecVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.VarBinaryVector; @@ -144,6 +150,24 @@ private void writeVector(Field field, FieldVector vector) throws IOException { private void writeValueToGenerator(ValueVector valueVector, int i) throws IOException { switch (valueVector.getMinorType()) { + case DATEDAY: + generator.writeNumber(((DateDayVector)valueVector).getAccessor().get(i)); + break; + case DATEMILLI: + generator.writeNumber(((DateMilliVector)valueVector).getAccessor().get(i)); + break; + case TIMESEC: + generator.writeNumber(((TimeSecVector)valueVector).getAccessor().get(i)); + break; + case TIMEMILLI: + generator.writeNumber(((TimeMilliVector)valueVector).getAccessor().get(i)); + break; + case TIMEMICRO: + generator.writeNumber(((TimeMicroVector)valueVector).getAccessor().get(i)); + break; + case TIMENANO: + generator.writeNumber(((TimeNanoVector)valueVector).getAccessor().get(i)); + break; case TIMESTAMPSEC: generator.writeNumber(((TimeStampSecVector)valueVector).getAccessor().get(i)); break; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index f07bb585f810c..b0455fa14e44c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -25,7 +25,8 @@ import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.NullableBigIntVector; import org.apache.arrow.vector.NullableBitVector; -import org.apache.arrow.vector.NullableDateVector; +import org.apache.arrow.vector.NullableDateDayVector; +import org.apache.arrow.vector.NullableDateMilliVector; import org.apache.arrow.vector.NullableDecimalVector; import org.apache.arrow.vector.NullableFloat4Vector; import org.apache.arrow.vector.NullableFloat8Vector; @@ -33,11 +34,14 @@ import org.apache.arrow.vector.NullableIntervalDayVector; import org.apache.arrow.vector.NullableIntervalYearVector; import org.apache.arrow.vector.NullableSmallIntVector; +import org.apache.arrow.vector.NullableTimeMicroVector; +import org.apache.arrow.vector.NullableTimeMilliVector; +import org.apache.arrow.vector.NullableTimeNanoVector; +import org.apache.arrow.vector.NullableTimeSecVector; import org.apache.arrow.vector.NullableTimeStampMicroVector; import org.apache.arrow.vector.NullableTimeStampMilliVector; import org.apache.arrow.vector.NullableTimeStampNanoVector; import org.apache.arrow.vector.NullableTimeStampSecVector; -import org.apache.arrow.vector.NullableTimeVector; import org.apache.arrow.vector.NullableTinyIntVector; import org.apache.arrow.vector.NullableUInt1Vector; import org.apache.arrow.vector.NullableUInt2Vector; @@ -52,7 +56,8 @@ import org.apache.arrow.vector.complex.UnionVector; import org.apache.arrow.vector.complex.impl.BigIntWriterImpl; import org.apache.arrow.vector.complex.impl.BitWriterImpl; -import org.apache.arrow.vector.complex.impl.DateWriterImpl; +import org.apache.arrow.vector.complex.impl.DateDayWriterImpl; +import org.apache.arrow.vector.complex.impl.DateMilliWriterImpl; import org.apache.arrow.vector.complex.impl.DecimalWriterImpl; import org.apache.arrow.vector.complex.impl.Float4WriterImpl; import org.apache.arrow.vector.complex.impl.Float8WriterImpl; @@ -61,11 +66,14 @@ import org.apache.arrow.vector.complex.impl.IntervalYearWriterImpl; import org.apache.arrow.vector.complex.impl.NullableMapWriter; import org.apache.arrow.vector.complex.impl.SmallIntWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeMicroWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeMilliWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeNanoWriterImpl; +import org.apache.arrow.vector.complex.impl.TimeSecWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampMicroWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampMilliWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampNanoWriterImpl; import org.apache.arrow.vector.complex.impl.TimeStampSecWriterImpl; -import org.apache.arrow.vector.complex.impl.TimeWriterImpl; import org.apache.arrow.vector.complex.impl.TinyIntWriterImpl; import org.apache.arrow.vector.complex.impl.UInt1WriterImpl; import org.apache.arrow.vector.complex.impl.UInt2WriterImpl; @@ -164,26 +172,70 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new BigIntWriterImpl((NullableBigIntVector) vector); } }, - DATE(new Date(DateUnit.MILLISECOND)) { + DATEDAY(new Date(DateUnit.DAY)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { - return new NullableDateVector(name, fieldType, allocator); + return new NullableDateDayVector(name, fieldType, allocator); } @Override public FieldWriter getNewFieldWriter(ValueVector vector) { - return new DateWriterImpl((NullableDateVector) vector); + return new DateDayWriterImpl((NullableDateDayVector) vector); } }, - TIME(new Time(TimeUnit.MILLISECOND, 32)) { + DATEMILLI(new Date(DateUnit.MILLISECOND)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { - return new NullableTimeVector(name, fieldType, allocator); + return new NullableDateMilliVector(name, fieldType, allocator); } @Override public FieldWriter getNewFieldWriter(ValueVector vector) { - return new TimeWriterImpl((NullableTimeVector) vector); + return new DateMilliWriterImpl((NullableDateMilliVector) vector); + } + }, + TIMESEC(new Time(TimeUnit.SECOND, 32)) { + @Override + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeSecVector(name, fieldType, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeSecWriterImpl((NullableTimeSecVector) vector); + } + }, + TIMEMILLI(new Time(TimeUnit.MILLISECOND, 32)) { + @Override + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeMilliVector(name, fieldType, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeMilliWriterImpl((NullableTimeMilliVector) vector); + } + }, + TIMEMICRO(new Time(TimeUnit.MICROSECOND, 64)) { + @Override + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeMicroVector(name, fieldType, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeMicroWriterImpl((NullableTimeMicroVector) vector); + } + }, + TIMENANO(new Time(TimeUnit.NANOSECOND, 64)) { + @Override + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + return new NullableTimeNanoVector(name, fieldType, allocator); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + return new TimeNanoWriterImpl((NullableTimeNanoVector) vector); } }, // time in second from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. @@ -479,14 +531,29 @@ public MinorType visit(FloatingPoint type) { } @Override public MinorType visit(Date type) { - return MinorType.DATE; + switch (type.getUnit()) { + case DAY: + return MinorType.DATEDAY; + case MILLISECOND: + return MinorType.DATEMILLI; + default: + throw new IllegalArgumentException("unknown unit: " + type); + } } @Override public MinorType visit(Time type) { - if (type.getUnit() != TimeUnit.MILLISECOND || type.getBitWidth() != 32) { - throw new IllegalArgumentException("Only milliseconds on 32 bits supported for now: " + type); + switch (type.getUnit()) { + case SECOND: + return MinorType.TIMESEC; + case MILLISECOND: + return MinorType.TIMEMILLI; + case MICROSECOND: + return MinorType.TIMEMICRO; + case NANOSECOND: + return MinorType.TIMENANO; + default: + throw new IllegalArgumentException("unknown unit: " + type); } - return MinorType.TIME; } @Override public MinorType visit(Timestamp type) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java index 45f3b5656d861..56fa73eccebf0 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/types/pojo/TestSchema.java @@ -86,11 +86,15 @@ public void testAll() throws IOException { field("h", new Binary()), field("i", new Bool()), field("j", new Decimal(5, 5)), - field("k", new Date(DateUnit.MILLISECOND)), - field("l", new Time(TimeUnit.MILLISECOND, 32)), - field("m", new Timestamp(TimeUnit.MILLISECOND, "UTC")), - field("n", new Timestamp(TimeUnit.MICROSECOND, null)), - field("o", new Interval(IntervalUnit.DAY_TIME)) + field("k", new Date(DateUnit.DAY)), + field("l", new Date(DateUnit.MILLISECOND)), + field("m", new Time(TimeUnit.SECOND, 32)), + field("n", new Time(TimeUnit.MILLISECOND, 32)), + field("o", new Time(TimeUnit.MICROSECOND, 64)), + field("p", new Time(TimeUnit.NANOSECOND, 64)), + field("q", new Timestamp(TimeUnit.MILLISECOND, "UTC")), + field("r", new Timestamp(TimeUnit.MICROSECOND, null)), + field("s", new Interval(IntervalUnit.DAY_TIME)) )); roundTrip(schema); } @@ -104,6 +108,32 @@ public void testUnion() throws IOException { contains(schema, "Sparse"); } + @Test + public void testDate() throws IOException { + Schema schema = new Schema(asList( + field("a", new Date(DateUnit.DAY)), + field("b", new Date(DateUnit.MILLISECOND)) + )); + roundTrip(schema); + assertEquals( + "Schema", + schema.toString()); + } + + @Test + public void testTime() throws IOException { + Schema schema = new Schema(asList( + field("a", new Time(TimeUnit.SECOND, 32)), + field("b", new Time(TimeUnit.MILLISECOND, 32)), + field("c", new Time(TimeUnit.MICROSECOND, 64)), + field("d", new Time(TimeUnit.NANOSECOND, 64)) + )); + roundTrip(schema); + assertEquals( + "Schema", + schema.toString()); + } + @Test public void testTS() throws IOException { Schema schema = new Schema(asList( From ddf880b312c1b11739d09bc014d4649b8f2f26d4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 6 Apr 2017 09:10:13 -0400 Subject: [PATCH 0470/1644] ARROW-752: [Python] Support boxed Arrow arrays as input to DictionaryArray.from_arrays Author: Wes McKinney Closes #496 from wesm/ARROW-752 and squashes the following commits: 2f57574 [Wes McKinney] Support boxed Arrow arrays as input to DictionaryArray.from_arrays --- python/pyarrow/array.pyx | 31 ++++++++---- python/pyarrow/tests/test_array.py | 81 ++++++++++++++++++++++++------ 2 files changed, 89 insertions(+), 23 deletions(-) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 1f59556e94fb8..9f302e02cdb04 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -497,8 +497,12 @@ cdef class DictionaryArray(Array): cdef getitem(self, int64_t i): cdef Array dictionary = self.dictionary - cdef int64_t index = self.indices[i].as_py() - return scalar.box_scalar(dictionary.type, dictionary.sp_array, index) + index = self.indices[i] + if index is NA: + return index + else: + return scalar.box_scalar(dictionary.type, dictionary.sp_array, + index.as_py()) property dictionary: @@ -544,15 +548,24 @@ cdef class DictionaryArray(Array): shared_ptr[CDataType] c_type shared_ptr[CArray] c_result - if mask is None: - mask = indices == -1 + if isinstance(indices, Array): + if mask is not None: + raise NotImplementedError( + "mask not implemented with Arrow array inputs yet") + arrow_indices = indices else: - mask = mask | (indices == -1) + if mask is None: + mask = indices == -1 + else: + mask = mask | (indices == -1) + arrow_indices = Array.from_numpy(indices, mask=mask, + memory_pool=memory_pool) - arrow_indices = Array.from_numpy(indices, mask=mask, - memory_pool=memory_pool) - arrow_dictionary = Array.from_numpy(dictionary, - memory_pool=memory_pool) + if isinstance(dictionary, Array): + arrow_dictionary = dictionary + else: + arrow_dictionary = Array.from_numpy(dictionary, + memory_pool=memory_pool) if not isinstance(arrow_indices, IntegerArray): raise ValueError('Indices must be integer type') diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index d8b2e2f5d80d6..57b17f6cea756 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -15,30 +15,33 @@ # specific language governing permissions and limitations # under the License. +import pytest import sys -import pytest +import numpy as np +import pandas as pd +import pandas.util.testing as tm -import pyarrow +import pyarrow as pa import pyarrow.formatting as fmt def test_total_bytes_allocated(): - assert pyarrow.total_allocated_bytes() == 0 + assert pa.total_allocated_bytes() == 0 def test_repr_on_pre_init_array(): - arr = pyarrow.array.Array() + arr = pa.Array() assert len(repr(arr)) > 0 def test_getitem_NA(): - arr = pyarrow.from_pylist([1, None, 2]) - assert arr[1] is pyarrow.NA + arr = pa.from_pylist([1, None, 2]) + assert arr[1] is pa.NA def test_list_format(): - arr = pyarrow.from_pylist([[1], None, [2, 3, None]]) + arr = pa.from_pylist([[1], None, [2, 3, None]]) result = fmt.array_format(arr) expected = """\ [ @@ -52,7 +55,7 @@ def test_list_format(): def test_string_format(): - arr = pyarrow.from_pylist(['', None, 'foo']) + arr = pa.from_pylist(['', None, 'foo']) result = fmt.array_format(arr) expected = """\ [ @@ -64,7 +67,7 @@ def test_string_format(): def test_long_array_format(): - arr = pyarrow.from_pylist(range(100)) + arr = pa.from_pylist(range(100)) result = fmt.array_format(arr, window=2) expected = """\ [ @@ -80,7 +83,7 @@ def test_long_array_format(): def test_to_pandas_zero_copy(): import gc - arr = pyarrow.from_pylist(range(10)) + arr = pa.from_pylist(range(10)) for i in range(10): np_arr = arr.to_pandas() @@ -90,7 +93,7 @@ def test_to_pandas_zero_copy(): assert sys.getrefcount(arr) == 2 for i in range(10): - arr = pyarrow.from_pylist(range(10)) + arr = pa.from_pylist(range(10)) np_arr = arr.to_pandas() arr = None gc.collect() @@ -105,14 +108,14 @@ def test_to_pandas_zero_copy(): def test_array_slice(): - arr = pyarrow.from_pylist(range(10)) + arr = pa.from_pylist(range(10)) sliced = arr.slice(2) - expected = pyarrow.from_pylist(range(2, 10)) + expected = pa.from_pylist(range(2, 10)) assert sliced.equals(expected) sliced2 = arr.slice(2, 4) - expected2 = pyarrow.from_pylist(range(2, 6)) + expected2 = pa.from_pylist(range(2, 6)) assert sliced2.equals(expected2) # 0 offset @@ -136,3 +139,53 @@ def test_array_slice(): with pytest.raises(IndexError): arr[::2] + + +def test_dictionary_from_numpy(): + indices = np.repeat([0, 1, 2], 2) + dictionary = np.array(['foo', 'bar', 'baz'], dtype=object) + mask = np.array([False, False, True, False, False, False]) + + d1 = pa.DictionaryArray.from_arrays(indices, dictionary) + d2 = pa.DictionaryArray.from_arrays(indices, dictionary, mask=mask) + + for i in range(len(indices)): + assert d1[i].as_py() == dictionary[indices[i]] + + if mask[i]: + assert d2[i] is pa.NA + else: + assert d2[i].as_py() == dictionary[indices[i]] + + +def test_dictionary_from_boxed_arrays(): + indices = np.repeat([0, 1, 2], 2) + dictionary = np.array(['foo', 'bar', 'baz'], dtype=object) + + iarr = pa.Array.from_numpy(indices) + darr = pa.Array.from_numpy(dictionary) + + d1 = pa.DictionaryArray.from_arrays(iarr, darr) + + for i in range(len(indices)): + assert d1[i].as_py() == dictionary[indices[i]] + + +def test_dictionary_with_pandas(): + indices = np.repeat([0, 1, 2], 2) + dictionary = np.array(['foo', 'bar', 'baz'], dtype=object) + mask = np.array([False, False, True, False, False, False]) + + d1 = pa.DictionaryArray.from_arrays(indices, dictionary) + d2 = pa.DictionaryArray.from_arrays(indices, dictionary, mask=mask) + + pandas1 = d1.to_pandas() + ex_pandas1 = pd.Categorical.from_codes(indices, categories=dictionary) + + tm.assert_series_equal(pd.Series(pandas1), pd.Series(ex_pandas1)) + + pandas2 = d2.to_pandas() + ex_pandas2 = pd.Categorical.from_codes(np.where(mask, -1, indices), + categories=dictionary) + + tm.assert_series_equal(pd.Series(pandas2), pd.Series(ex_pandas2)) From 621d52740b52af4042c4aaa3ac424c5916aa94da Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Thu, 6 Apr 2017 09:33:17 -0400 Subject: [PATCH 0471/1644] ARROW-582: [Java] Added JSON reader/writer unit test for date, time, and timestamp New unit test to verify Java JSON reader/writer round-trip with date, time and timestamp types Author: Bryan Cutler Closes #495 from BryanCutler/java-json-DateTime-Test-ARROW-582 and squashes the following commits: e80683b [Bryan Cutler] added JSON read and write support unit test --- .../vector/file/json/JsonFileReader.java | 2 +- .../arrow/vector/file/BaseFileTest.java | 46 +++++++++++++++++++ .../arrow/vector/file/json/TestJSONFile.java | 38 ++++++++++++++- 3 files changed, 84 insertions(+), 2 deletions(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index 2f91205cffcbd..fde9954d288bb 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -275,7 +275,7 @@ private void setValueFromParser(ValueVector valueVector, int i) throws IOExcepti break; case TIMESTAMPNANO: ((TimeStampNanoVector)valueVector).getMutator().set(i, parser.readValueAs(Long.class)); - break; + break; default: throw new UnsupportedOperationException("minor type: " + valueVector.getMinorType()); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java index 774bead3207a7..5c68a1904be70 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java @@ -32,8 +32,12 @@ import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; import org.apache.arrow.vector.complex.writer.BaseWriter.MapWriter; import org.apache.arrow.vector.complex.writer.BigIntWriter; +import org.apache.arrow.vector.complex.writer.DateMilliWriter; import org.apache.arrow.vector.complex.writer.IntWriter; +import org.apache.arrow.vector.complex.writer.TimeMilliWriter; +import org.apache.arrow.vector.complex.writer.TimeStampMilliWriter; import org.apache.arrow.vector.holders.NullableTimeStampMilliHolder; +import org.joda.time.DateTime; import org.joda.time.DateTimeZone; import org.junit.After; import org.junit.Assert; @@ -138,6 +142,48 @@ protected void validateComplexContent(int count, VectorSchemaRoot root) { } } + private DateTime makeDateTimeFromCount(int i) { + return new DateTime(2000 + i, 1 + i, 1 + i, i, i, i, i, DateTimeZone.UTC); + } + + protected void writeDateTimeData(int count, NullableMapVector parent) { + Assert.assertTrue(count < 100); + ComplexWriter writer = new ComplexWriterImpl("root", parent); + MapWriter rootWriter = writer.rootAsMap(); + DateMilliWriter dateWriter = rootWriter.dateMilli("date"); + TimeMilliWriter timeWriter = rootWriter.timeMilli("time"); + TimeStampMilliWriter timeStampMilliWriter = rootWriter.timeStampMilli("timestamp-milli"); + for (int i = 0; i < count; i++) { + DateTime dt = makeDateTimeFromCount(i); + // Number of days in milliseconds since epoch, stored as 64-bit integer, only date part is used + dateWriter.setPosition(i); + long dateLong = dt.minusMillis(dt.getMillisOfDay()).getMillis(); + dateWriter.writeDateMilli(dateLong); + // Time is a value in milliseconds since midnight, stored as 32-bit integer + timeWriter.setPosition(i); + timeWriter.writeTimeMilli(dt.getMillisOfDay()); + // Timestamp is milliseconds since the epoch, stored as 64-bit integer + timeStampMilliWriter.setPosition(i); + timeStampMilliWriter.writeTimeStampMilli(dt.getMillis()); + } + writer.setValueCount(count); + } + + protected void validateDateTimeContent(int count, VectorSchemaRoot root) { + Assert.assertEquals(count, root.getRowCount()); + printVectors(root.getFieldVectors()); + for (int i = 0; i < count; i++) { + Object dateVal = root.getVector("date").getAccessor().getObject(i); + DateTime dt = makeDateTimeFromCount(i); + DateTime dateExpected = dt.minusMillis(dt.getMillisOfDay()); + Assert.assertEquals(dateExpected.getMillis(), dateVal); + Object timeVal = root.getVector("time").getAccessor().getObject(i); + Assert.assertEquals(dt.getMillisOfDay(), timeVal); + Object timestampMilliVal = root.getVector("timestamp-milli").getAccessor().getObject(i); + Assert.assertTrue(dt.withZoneRetainFields(DateTimeZone.getDefault()).equals(timestampMilliVal)); + } + } + protected void writeData(int count, MapVector parent) { ComplexWriter writer = new ComplexWriterImpl("root", parent); MapWriter rootWriter = writer.rootAsMap(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java index c88958cbf2c9c..6369c07c3205c 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/json/TestJSONFile.java @@ -103,7 +103,7 @@ public void testWriteReadUnionJSON() throws IOException { writeJSON(file, root); } - // read + // read try ( BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); BufferAllocator vectorAllocator = allocator.newChildAllocator("final vectors", 0, Integer.MAX_VALUE); @@ -119,6 +119,42 @@ public void testWriteReadUnionJSON() throws IOException { } } + @Test + public void testWriteReadDateTimeJSON() throws IOException { + File file = new File("target/mytest_datetime.json"); + int count = COUNT; + + // write + try ( + BufferAllocator vectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", vectorAllocator, null, null)) { + + writeDateTimeData(count, parent); + + printVectors(parent.getChildrenFromFields()); + + VectorSchemaRoot root = new VectorSchemaRoot(parent.getChild("root")); + validateDateTimeContent(count, root); + + writeJSON(file, new VectorSchemaRoot(parent.getChild("root"))); + } + + // read + try ( + BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ) { + JsonFileReader reader = new JsonFileReader(file, readerAllocator); + Schema schema = reader.start(); + LOGGER.debug("reading schema: " + schema); + + // initialize vectors + try (VectorSchemaRoot root = reader.read();) { + validateDateTimeContent(count, root); + } + reader.close(); + } + } + @Test public void testSetStructLength() throws IOException { File file = new File("../../integration/data/struct_example.json"); From 49b3e0e2a08f633238875d9663ad5745fbb52db1 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 6 Apr 2017 10:13:58 -0400 Subject: [PATCH 0472/1644] ARROW-774: [GLib] Remove needless LICENSE.txt copy "make dist" result is included in source archive. LICENSE.txt is also included the source archive. Author: Kouhei Sutou Closes #497 from kou/glib-remove-needless-license-copy and squashes the following commits: a2fd0b9 [Kouhei Sutou] [GLib] Remove needless LICENSE.txt copy --- c_glib/Makefile.am | 4 +--- c_glib/autogen.sh | 2 -- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/c_glib/Makefile.am b/c_glib/Makefile.am index 40e8395a56824..bb52ce503e04e 100644 --- a/c_glib/Makefile.am +++ b/c_glib/Makefile.am @@ -24,10 +24,8 @@ SUBDIRS = \ EXTRA_DIST = \ README.md \ - LICENSE.txt \ version arrow_glib_docdir = ${datarootdir}/doc/arrow-glib arrow_glib_doc_DATA = \ - README.md \ - LICENSE.txt + README.md diff --git a/c_glib/autogen.sh b/c_glib/autogen.sh index 6e2036da6406b..08e33e6ca07c0 100755 --- a/c_glib/autogen.sh +++ b/c_glib/autogen.sh @@ -25,8 +25,6 @@ ruby \ ../java/pom.xml > \ version -cp ../LICENSE.txt ./ - mkdir -p m4 gtkdocize --copy --docdir doc/reference From ff744ef13c6dff42abf4a0a3ca697634f84b9bf8 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Thu, 6 Apr 2017 09:07:35 -0700 Subject: [PATCH 0473/1644] ARROW-775: add simple constructors to value vectors Author: Julien Le Dem Closes #498 from julienledem/ARROW-775 and squashes the following commits: badf8d1 [Julien Le Dem] ARROW-775: add simple constructors to value vectors --- .../src/main/codegen/templates/NullableValueVectors.java | 8 ++++++++ .../java/org/apache/arrow/vector/complex/ListVector.java | 4 ++++ .../apache/arrow/vector/complex/NullableMapVector.java | 4 ++++ .../java/org/apache/arrow/vector/util/DateUtility.java | 8 ++++---- 4 files changed, 20 insertions(+), 4 deletions(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 8e1727ca6c820..a50771a45a034 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -64,6 +64,14 @@ public final class ${className} extends BaseDataValueVector implements <#if type <#if minor.class == "Decimal"> private final int precision; private final int scale; + + public ${className}(String name, BufferAllocator allocator, int precision, int scale) { + this(name, new FieldType(true, new Decimal(precision, scale), null), allocator); + } + <#else> + public ${className}(String name, BufferAllocator allocator) { + this(name, new FieldType(true, org.apache.arrow.vector.types.Types.MinorType.${minor.class?upper_case}.getType(), null), allocator); + } public ${className}(String name, FieldType fieldType, BufferAllocator allocator) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index d138ca339e3cf..0461a8d9d285a 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -64,6 +64,10 @@ public class ListVector extends BaseRepeatedValueVector implements FieldVector { private CallBack callBack; private final DictionaryEncoding dictionary; + public ListVector(String name, BufferAllocator allocator, CallBack callBack) { + this(name, allocator, null, callBack); + } + public ListVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack) { super(name, allocator, callBack); this.bits = new BitVector("$bits$", allocator); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 7fe35e8253afb..71fee67d49c9f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -54,6 +54,10 @@ public class NullableMapVector extends MapVector implements FieldVector { private final Accessor accessor; private final Mutator mutator; + public NullableMapVector(String name, BufferAllocator allocator, CallBack callBack) { + this(name, allocator, null, callBack); + } + public NullableMapVector(String name, BufferAllocator allocator, DictionaryEncoding dictionary, CallBack callBack) { super(name, checkNotNull(allocator), callBack); this.bits = new BitVector("$bits$", allocator); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java index f4fc1736032c0..1f8ce069cf9cf 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java @@ -618,10 +618,10 @@ public class DateUtility { } } - public static final DateTimeFormatter formatDate = DateTimeFormat.forPattern("yyyy-MM-dd"); - public static final DateTimeFormatter formatTimeStamp = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS"); - public static final DateTimeFormatter formatTimeStampTZ = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS ZZZ"); - public static final DateTimeFormatter formatTime = DateTimeFormat.forPattern("HH:mm:ss.SSS"); + public static final DateTimeFormatter formatDate = DateTimeFormat.forPattern("yyyy-MM-dd"); + public static final DateTimeFormatter formatTimeStampMilli = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS"); + public static final DateTimeFormatter formatTimeStampTZ = DateTimeFormat.forPattern("yyyy-MM-dd HH:mm:ss.SSS ZZZ"); + public static final DateTimeFormatter formatTime = DateTimeFormat.forPattern("HH:mm:ss.SSS"); public static DateTimeFormatter dateTimeTZFormat = null; public static DateTimeFormatter timeFormat = null; From 56f1e91d2961a13b7f677785fa705bed06d9639d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 6 Apr 2017 13:49:32 -0400 Subject: [PATCH 0474/1644] ARROW-771: [Python] Add read_row_group / num_row_groups to ParquetFile requires PARQUET-946 https://github.com/apache/parquet-cpp/pull/291 cc @cpcloud @jreback @mrocklin Author: Wes McKinney Closes #494 from wesm/ARROW-771 and squashes the following commits: 126789a [Wes McKinney] Fix docstring 1009423 [Wes McKinney] Add read_row_group / num_row_groups to ParquetFile --- python/pyarrow/_parquet.pxd | 17 ++++++--- python/pyarrow/_parquet.pyx | 37 ++++++++++++++----- python/pyarrow/parquet.py | 53 ++++++++++++++++++++-------- python/pyarrow/tests/test_parquet.py | 29 +++++++++++++++ 4 files changed, 109 insertions(+), 27 deletions(-) diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index f12c86fdebc83..1ac1f69b033ce 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -179,7 +179,7 @@ cdef extern from "parquet/api/reader.h" namespace "parquet" nogil: @staticmethod unique_ptr[ParquetFileReader] OpenFile(const c_string& path) - shared_ptr[CFileMetaData] metadata(); + shared_ptr[CFileMetaData] metadata() cdef extern from "parquet/api/writer.h" namespace "parquet" nogil: @@ -211,11 +211,18 @@ cdef extern from "parquet/arrow/reader.h" namespace "parquet::arrow" nogil: cdef cppclass FileReader: FileReader(CMemoryPool* pool, unique_ptr[ParquetFileReader] reader) - CStatus ReadColumn(int i, shared_ptr[CArray]* out); - CStatus ReadTable(shared_ptr[CTable]* out); + CStatus ReadColumn(int i, shared_ptr[CArray]* out) + + int num_row_groups() + CStatus ReadRowGroup(int i, shared_ptr[CTable]* out) + CStatus ReadRowGroup(int i, const vector[int]& column_indices, + shared_ptr[CTable]* out) + + CStatus ReadTable(shared_ptr[CTable]* out) CStatus ReadTable(const vector[int]& column_indices, - shared_ptr[CTable]* out); - const ParquetFileReader* parquet_reader(); + shared_ptr[CTable]* out) + + const ParquetFileReader* parquet_reader() void set_num_threads(int num_threads) diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index cfd2816e2a16e..079bf5ee5924a 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -31,7 +31,7 @@ from pyarrow.error import ArrowException from pyarrow.error cimport check_status from pyarrow.io import NativeFile from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool -from pyarrow.table cimport Table +from pyarrow.table cimport Table, table_from_ctable from pyarrow.io cimport NativeFile, get_reader, get_writer @@ -381,16 +381,39 @@ cdef class ParquetReader: result.init(metadata) return result - def read(self, column_indices=None, nthreads=1): + property num_row_groups: + + def __get__(self): + return self.reader.get().num_row_groups() + + def set_num_threads(self, int nthreads): + self.reader.get().set_num_threads(nthreads) + + def read_row_group(self, int i, column_indices=None): cdef: - Table table = Table() shared_ptr[CTable] ctable vector[int] c_column_indices - self.reader.get().set_num_threads(nthreads) + if column_indices is not None: + for index in column_indices: + c_column_indices.push_back(index) + + with nogil: + check_status(self.reader.get() + .ReadRowGroup(i, c_column_indices, &ctable)) + else: + # Read all columns + with nogil: + check_status(self.reader.get() + .ReadRowGroup(i, &ctable)) + return table_from_ctable(ctable) + + def read_all(self, column_indices=None): + cdef: + shared_ptr[CTable] ctable + vector[int] c_column_indices if column_indices is not None: - # Read only desired column indices for index in column_indices: c_column_indices.push_back(index) @@ -402,9 +425,7 @@ cdef class ParquetReader: with nogil: check_status(self.reader.get() .ReadTable(&ctable)) - - table.init(ctable) - return table + return table_from_ctable(ctable) def column_name_idx(self, column_name): """ diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 2985316f35f01..d95c3b3aecaf8 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -50,7 +50,32 @@ def metadata(self): def schema(self): return self.metadata.schema - def read(self, nrows=None, columns=None, nthreads=1): + @property + def num_row_groups(self): + return self.reader.num_row_groups + + def read_row_group(self, i, columns=None, nthreads=1): + """ + Read a single row group from a Parquet file + + Parameters + ---------- + columns: list + If not None, only these columns will be read from the row group. + nthreads : int, default 1 + Number of columns to read in parallel. If > 1, requires that the + underlying file source is threadsafe + + Returns + ------- + pyarrow.table.Table + Content of the row group as a table (of columns) + """ + column_indices = self._get_column_indices(columns) + self.reader.set_num_threads(nthreads) + return self.reader.read_row_group(i, column_indices=column_indices) + + def read(self, columns=None, nthreads=1): """ Read a Table from Parquet format @@ -67,17 +92,16 @@ def read(self, nrows=None, columns=None, nthreads=1): pyarrow.table.Table Content of the file as a table (of columns) """ - if nrows is not None: - raise NotImplementedError("nrows argument") + column_indices = self._get_column_indices(columns) + self.reader.set_num_threads(nthreads) + return self.reader.read_all(column_indices=column_indices) - if columns is None: - column_indices = None + def _get_column_indices(self, column_names): + if column_names is None: + return None else: - column_indices = [self.reader.column_name_idx(column) - for column in columns] - - return self.reader.read(column_indices=column_indices, - nthreads=nthreads) + return [self.reader.column_name_idx(column) + for column in column_names] def read_table(source, columns=None, nthreads=1, metadata=None): @@ -178,8 +202,8 @@ def open_file(path, meta=None): return all_data -def write_table(table, sink, chunk_size=None, version='1.0', - use_dictionary=True, compression='snappy'): +def write_table(table, sink, row_group_size=None, version='1.0', + use_dictionary=True, compression='snappy', **kwargs): """ Write a Table to Parquet format @@ -187,7 +211,7 @@ def write_table(table, sink, chunk_size=None, version='1.0', ---------- table : pyarrow.Table sink: string or pyarrow.io.NativeFile - chunk_size : int, default None + row_group_size : int, default None The maximum number of rows in each Parquet RowGroup. As a default, we will write a single RowGroup per file. version : {"1.0", "2.0"}, default "1.0" @@ -198,7 +222,8 @@ def write_table(table, sink, chunk_size=None, version='1.0', compression : str or dict Specify the compression codec, either on a general basis or per-column. """ + row_group_size = kwargs.get('chunk_size', row_group_size) writer = ParquetWriter(sink, use_dictionary=use_dictionary, compression=compression, version=version) - writer.write_table(table, row_group_size=chunk_size) + writer.write_table(table, row_group_size=row_group_size) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index b8b2800259caf..86165be7052c6 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -402,6 +402,35 @@ def test_pass_separate_metadata(): pdt.assert_frame_equal(df, fileh.read().to_pandas()) +@parquet +def test_read_single_row_group(): + # ARROW-471 + N, K = 10000, 4 + df = alltypes_sample(size=N) + + a_table = pa.Table.from_pandas(df, timestamps_to_ms=True) + + buf = io.BytesIO() + pq.write_table(a_table, buf, row_group_size=N / K, + compression='snappy', version='2.0') + + buf.seek(0) + + pf = pq.ParquetFile(buf) + + assert pf.num_row_groups == K + + row_groups = [pf.read_row_group(i) for i in range(K)] + result = pa.concat_tables(row_groups) + pdt.assert_frame_equal(df, result.to_pandas()) + + cols = df.columns[:2] + row_groups = [pf.read_row_group(i, columns=cols) + for i in range(K)] + result = pa.concat_tables(row_groups) + pdt.assert_frame_equal(df[cols], result.to_pandas()) + + @parquet def test_read_multiple_files(tmpdir): nfiles = 10 From 58fa4c2fcc75f763a89b44eeedafade771d342e8 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 6 Apr 2017 20:36:29 +0200 Subject: [PATCH 0475/1644] ARROW-776: [GLib] Fix wrong type name Author: Kouhei Sutou Closes #499 from kou/glib-fix-wrong-type-name and squashes the following commits: 105f2f2 [Kouhei Sutou] [GLib] Fix wrong type name --- c_glib/arrow-glib/io-file.h | 2 +- c_glib/arrow-glib/io-input-stream.h | 2 +- c_glib/arrow-glib/io-output-stream.h | 2 +- c_glib/arrow-glib/io-random-access-file.h | 2 +- c_glib/arrow-glib/io-readable.h | 2 +- c_glib/arrow-glib/io-writeable.h | 2 +- 6 files changed, 6 insertions(+), 6 deletions(-) diff --git a/c_glib/arrow-glib/io-file.h b/c_glib/arrow-glib/io-file.h index 9fa0ec137566f..7181f6d37aeb3 100644 --- a/c_glib/arrow-glib/io-file.h +++ b/c_glib/arrow-glib/io-file.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_FILE, \ - GArrowIOFileInterface)) + GArrowIOFile)) #define GARROW_IO_IS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_FILE)) diff --git a/c_glib/arrow-glib/io-input-stream.h b/c_glib/arrow-glib/io-input-stream.h index a7f06819b4f97..57902095010c8 100644 --- a/c_glib/arrow-glib/io-input-stream.h +++ b/c_glib/arrow-glib/io-input-stream.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_INPUT_STREAM, \ - GArrowIOInputStreamInterface)) + GArrowIOInputStream)) #define GARROW_IO_IS_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_INPUT_STREAM)) diff --git a/c_glib/arrow-glib/io-output-stream.h b/c_glib/arrow-glib/io-output-stream.h index c4079d50233cd..02478ce9621eb 100644 --- a/c_glib/arrow-glib/io-output-stream.h +++ b/c_glib/arrow-glib/io-output-stream.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_OUTPUT_STREAM, \ - GArrowIOOutputStreamInterface)) + GArrowIOOutputStream)) #define GARROW_IO_IS_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_OUTPUT_STREAM)) diff --git a/c_glib/arrow-glib/io-random-access-file.h b/c_glib/arrow-glib/io-random-access-file.h index e980ab2e3c8e5..8ac63e417a3f2 100644 --- a/c_glib/arrow-glib/io-random-access-file.h +++ b/c_glib/arrow-glib/io-random-access-file.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_RANDOM_ACCESS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_RANDOM_ACCESS_FILE, \ - GArrowIORandomAccessFileInterface)) + GArrowIORandomAccessFile)) #define GARROW_IO_IS_RANDOM_ACCESS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_RANDOM_ACCESS_FILE)) diff --git a/c_glib/arrow-glib/io-readable.h b/c_glib/arrow-glib/io-readable.h index d24b46c50df4c..279984b3014a3 100644 --- a/c_glib/arrow-glib/io-readable.h +++ b/c_glib/arrow-glib/io-readable.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_READABLE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_READABLE, \ - GArrowIOReadableInterface)) + GArrowIOReadable)) #define GARROW_IO_IS_READABLE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_READABLE)) diff --git a/c_glib/arrow-glib/io-writeable.h b/c_glib/arrow-glib/io-writeable.h index f5c5e9129f8be..ce23247497706 100644 --- a/c_glib/arrow-glib/io-writeable.h +++ b/c_glib/arrow-glib/io-writeable.h @@ -28,7 +28,7 @@ G_BEGIN_DECLS #define GARROW_IO_WRITEABLE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ GARROW_IO_TYPE_WRITEABLE, \ - GArrowIOWriteableInterface)) + GArrowIOWriteable)) #define GARROW_IO_IS_WRITEABLE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_IO_TYPE_WRITEABLE)) From e371ebd7e16e5e5f4b14f0f578049d9246714e77 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 6 Apr 2017 15:19:59 -0400 Subject: [PATCH 0476/1644] ARROW-756: [C++] MSVC build fixes and cleanup, remove -fPIC flag from EP builds on Windows, Dev docs Includes existing patch for ARROW-757 and closes #492 With this patch I'm able to build with ``` cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release .. nmake ``` Author: Wes McKinney Author: Max Risuhin Closes #500 from wesm/ARROW-757-2 and squashes the following commits: 106e454 [Wes McKinney] Notes about MSVC solution file f34adf2 [Wes McKinney] Windows developer guide 43e5f3f [Wes McKinney] More MSVC cleaning / fixes. Remove -fPIC flags from builds ec7805e [Max Risuhin] ARROW-757: [C++] Resolve nmake build issues on Windows --- cpp/CMakeLists.txt | 40 +++++++------ cpp/README.md | 5 +- cpp/cmake_modules/SetupCxxFlags.cmake | 25 ++++++-- cpp/doc/Windows.md | 83 +++++++++++++++++++++++++++ 4 files changed, 129 insertions(+), 24 deletions(-) create mode 100644 cpp/doc/Windows.md diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index d26c847807d79..9947a34e4e7bb 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -146,9 +146,11 @@ include(BuildUtils) include(SetupCxxFlags) # Add common flags -set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${CXX_COMMON_FLAGS}") set(EP_CXX_FLAGS "${CMAKE_CXX_FLAGS}") -set(CMAKE_CXX_FLAGS "${ARROW_CXXFLAGS} ${CMAKE_CXX_FLAGS}") +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} ${ARROW_CXXFLAGS}") + +message(STATUS "CMAKE_CXX_FLAGS: ${CMAKE_CXX_FLAGS}") # Determine compiler version include(CompilerInfo) @@ -446,7 +448,7 @@ if(ARROW_BUILD_TESTS) if("$ENV{GTEST_HOME}" STREQUAL "") if(APPLE) set(GTEST_CMAKE_CXX_FLAGS "-fPIC -DGTEST_USE_OWN_TR1_TUPLE=1 -Wno-unused-value -Wno-ignored-attributes") - else() + elseif(NOT MSVC) set(GTEST_CMAKE_CXX_FLAGS "-fPIC") endif() string(TOUPPER ${CMAKE_BUILD_TYPE} UPPERCASE_BUILD_TYPE) @@ -456,12 +458,15 @@ if(ARROW_BUILD_TESTS) set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") set(GTEST_STATIC_LIB "${GTEST_PREFIX}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}gtest${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GTEST_VENDORED 1) + set(GTEST_CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + -Dgtest_force_shared_crt=ON + -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS}) if (CMAKE_VERSION VERSION_GREATER "3.2") # BUILD_BYPRODUCTS is a 3.2+ feature ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON + CMAKE_ARGS ${GTEST_CMAKE_ARGS} # googletest doesn't define install rules, so just build in the # source dir and don't try to install. See its README for # details. @@ -471,7 +476,7 @@ if(ARROW_BUILD_TESTS) else() ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS} -Dgtest_force_shared_crt=ON + CMAKE_ARGS ${GTEST_CMAKE_ARGS} # googletest doesn't define install rules, so just build in the # source dir and don't try to install. See its README for # details. @@ -556,9 +561,9 @@ if(ARROW_BUILD_BENCHMARKS) if("$ENV{GBENCHMARK_HOME}" STREQUAL "") if(APPLE) - set(GBENCHMARK_CMAKE_CXX_FLAGS "-std=c++11 -stdlib=libc++") - else() - set(GBENCHMARK_CMAKE_CXX_FLAGS "--std=c++11") + set(GBENCHMARK_CMAKE_CXX_FLAGS "-fPIC -std=c++11 -stdlib=libc++") + elseif(NOT MSVC) + set(GBENCHMARK_CMAKE_CXX_FLAGS "-fPIC --std=c++11") endif() set(GBENCHMARK_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/gbenchmark_ep/src/gbenchmark_ep-install") @@ -569,7 +574,7 @@ if(ARROW_BUILD_BENCHMARKS) "-DCMAKE_BUILD_TYPE=Release" "-DCMAKE_INSTALL_PREFIX:PATH=${GBENCHMARK_PREFIX}" "-DBENCHMARK_ENABLE_TESTING=OFF" - "-DCMAKE_CXX_FLAGS=-fPIC ${GBENCHMARK_CMAKE_CXX_FLAGS}") + "-DCMAKE_CXX_FLAGS=${GBENCHMARK_CMAKE_CXX_FLAGS}") if (APPLE) set(GBENCHMARK_CMAKE_ARGS ${GBENCHMARK_CMAKE_ARGS} "-DBENCHMARK_USE_LIBCXX=ON") endif() @@ -621,6 +626,13 @@ endif() message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) +#---------------------------------------------------------------------- + +if (MSVC) + # jemalloc is not supported on Windows + set(ARROW_JEMALLOC off) +endif() + if (ARROW_JEMALLOC) find_package(jemalloc) @@ -840,12 +852,10 @@ if(ARROW_IPC) ExternalProject_Add(flatbuffers_ep URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" CMAKE_ARGS - "-DCMAKE_CXX_FLAGS=-fPIC" "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" "-DFLATBUFFERS_BUILD_TESTS=OFF") set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") - set(FLATBUFFERS_STATIC_LIB "${FLATBUFFERS_PREFIX}/libflatbuffers.a") set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") set(FLATBUFFERS_VENDORED 1) else() @@ -854,16 +864,8 @@ if(ARROW_IPC) endif() message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") - message(STATUS "Flatbuffers static library: ${FLATBUFFERS_STATIC_LIB}") message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) - ADD_THIRDPARTY_LIB(flatbuffers - STATIC_LIB ${FLATBUFFERS_STATIC_LIB}) - - if(FLATBUFFERS_VENDORED) - add_dependencies(flatbuffers flatbuffers_ep) - endif() - add_subdirectory(src/arrow/ipc) endif() diff --git a/cpp/README.md b/cpp/README.md index b6f0fa0e3531b..b19fa001198a4 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -40,6 +40,8 @@ On OS X, you can use [Homebrew][1]: brew install boost cmake ``` +If you are developing on Windows, see the [Windows developer guide][2]. + ## Building Arrow Simple debug build: @@ -123,4 +125,5 @@ both of these options would be used rarely. Current known uses-cases whent hey * Parameterized tests in google test. -[1]: https://brew.sh/ \ No newline at end of file +[1]: https://brew.sh/ +[2]: https://github.com/apache/arrow/blob/master/cpp/doc/Windows.md \ No newline at end of file diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index ee672bd5f6a96..09a662ec6e583 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -24,15 +24,32 @@ CHECK_CXX_COMPILER_FLAG("-msse3" CXX_SUPPORTS_SSE3) CHECK_CXX_COMPILER_FLAG("-maltivec" CXX_SUPPORTS_ALTIVEC) # compiler flags that are common across debug/release builds -# - Wall: Enable all warnings. -set(CXX_COMMON_FLAGS "-std=c++11 -Wall") + +if (MSVC) + if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang") + # clang-cl + set(CXX_COMMON_FLAGS "-EHsc") + elseif(${CMAKE_CXX_COMPILER_VERSION} VERSION_LESS 19) + message(FATAL_ERROR "Only MSVC 2015 (Version 19.0) and later are supported + by Arrow. Found version ${CMAKE_CXX_COMPILER_VERSION}.") + else() + # Fix annoying D9025 warning + string(REPLACE "/W3" "" CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}") + + # Set desired warning level (e.g. set /W4 for more warnings) + set(CXX_COMMON_FLAGS "/W3") + endif() +else() + set(CXX_COMMON_FLAGS "-Wall -std=c++11") +endif() # Only enable additional instruction sets if they are supported if (CXX_SUPPORTS_SSE3 AND ARROW_SSE3) - set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -msse3") + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -msse3") endif() + if (CXX_SUPPORTS_ALTIVEC AND ARROW_ALTIVEC) - set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -maltivec") + set(CXX_COMMON_FLAGS "${CXX_COMMON_FLAGS} -maltivec") endif() if (APPLE) diff --git a/cpp/doc/Windows.md b/cpp/doc/Windows.md new file mode 100644 index 0000000000000..64f6a1b98f62e --- /dev/null +++ b/cpp/doc/Windows.md @@ -0,0 +1,83 @@ + + +# Developing Arrow C++ on Windows + +## System setup, conda, and conda-forge + +Since some of the Arrow developers work in the Python ecosystem, we are +investing time in maintaining the thirdparty build dependencies for Arrow and +related C++ libraries using the conda package manager. Others are free to add +other development instructions for Windows here. + +### Visual Studio + +Microsoft provides the free Visual Studio 2017 Community edition. When doing +development, you must launch the developer command prompt using + +```"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\Common7\Tools\VsDevCmd.bat" -arch=amd64``` + +It's easiest to configure a console emulator like [cmder][3] to automatically +launch this when starting a new development console. + +### conda and package toolchain + +[Miniconda][1] is a minimal Python distribution including the conda package +manager. To get started, download and install a 64-bit distribution. + +We recommend using packages from [conda-forge][2] + +```shell +conda config --add channels conda-forge +``` + +Now, you can bootstrap a build environment + +```shell +conda create -n arrow-dev cmake git boost +``` + +## Building with NMake + +Activate your conda build environment: + +``` +activate arrow-dev +``` + +Now, do an out of source build using `nmake`: + +``` +cd cpp +mkdir build +cd build +cmake -G "NMake Makefiles" -DCMAKE_BUILD_TYPE=Release .. +nmake +``` + +When using conda, only release builds are currently supported. + +## Build using Visual Studio (MSVC) Solution Files + +To build on the command line by instead generating a MSVC solution, instead +run: + +``` +cmake -G "Visual Studio 14 2015 Win64" -DCMAKE_BUILD_TYPE=Release .. +cmake --build . --config Release +``` + +[1]: https://conda.io/miniconda.html +[2]: https://conda-forge.github.io/ +[3]: http://cmder.net/ \ No newline at end of file From e53357cd610f1bdca0cbbac001e417f329d54be1 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 6 Apr 2017 18:49:06 -0400 Subject: [PATCH 0477/1644] ARROW-778: Port merge tool to work on Windows Author: Wes McKinney Closes #501 from wesm/ARROW-778 and squashes the following commits: a554320 [Wes McKinney] Use os.path.sep for splitting paths --- dev/merge_arrow_pr.py | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/dev/merge_arrow_pr.py b/dev/merge_arrow_pr.py index 39db254a9f25d..99ccc43394f27 100755 --- a/dev/merge_arrow_pr.py +++ b/dev/merge_arrow_pr.py @@ -42,8 +42,9 @@ JIRA_IMPORTED = False # Location of your Arrow git clone -ARROW_HOME = os.path.abspath(__file__).rsplit("/", 2)[0] -PROJECT_NAME = ARROW_HOME.rsplit("/", 1)[1] +SEP = os.path.sep +ARROW_HOME = os.path.abspath(__file__).rsplit(SEP, 2)[0] +PROJECT_NAME = ARROW_HOME.rsplit(SEP, 1)[1] print("ARROW_HOME = " + ARROW_HOME) print("PROJECT_NAME = " + PROJECT_NAME) From 1c6609746aeb9584fc83284f2587fa97bdbac47a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 7 Apr 2017 10:08:08 -0400 Subject: [PATCH 0478/1644] ARROW-758: [C++] Build with /WX in Appveyor, fix MSVC compiler warnings This will help keep the build clean. cc @MaxRis Author: Wes McKinney Closes #502 from wesm/ARROW-758 and squashes the following commits: 054c185 [Wes McKinney] Build with /WX in Appveyor, fix MSVC compiler warnings --- appveyor.yml | 6 +----- cpp/cmake_modules/SetupCxxFlags.cmake | 4 ++++ cpp/src/arrow/array-test.cc | 14 +++++++------- cpp/src/arrow/builder.h | 4 ++++ cpp/src/arrow/io/file.cc | 6 +++--- cpp/src/arrow/io/io-hdfs-test.cc | 2 +- cpp/src/arrow/ipc/json-internal.cc | 2 +- cpp/src/arrow/ipc/test-common.h | 2 +- cpp/src/arrow/tensor.cc | 2 +- 9 files changed, 23 insertions(+), 19 deletions(-) diff --git a/appveyor.yml b/appveyor.yml index 9f3594907d17e..b8c26e6e5084c 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -30,12 +30,8 @@ build_script: - cd cpp - mkdir build - cd build - # A lot of features are still deactivated as they do not build on Windows - # * gbenchmark doesn't build with MSVC - - cmake -G "%GENERATOR%" -DARROW_BOOST_USE_SHARED=OFF -DARROW_BUILD_BENCHMARKS=OFF -DARROW_JEMALLOC=OFF -DCMAKE_BUILD_TYPE=Release .. + - cmake -G "%GENERATOR%" -DARROW_CXXFLAGS="/WX" -DARROW_BOOST_USE_SHARED=OFF -DCMAKE_BUILD_TYPE=Release .. - cmake --build . --config Release # test_script: - ctest -VV - - diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index 09a662ec6e583..694e5a37df4ba 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -26,6 +26,10 @@ CHECK_CXX_COMPILER_FLAG("-maltivec" CXX_SUPPORTS_ALTIVEC) # compiler flags that are common across debug/release builds if (MSVC) + # TODO(wesm): Change usages of C runtime functions that MSVC says are + # insecure, like std::getenv + add_definitions(-D_CRT_SECURE_NO_WARNINGS) + if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang") # clang-cl set(CXX_COMMON_FLAGS "-EHsc") diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index 68b9864301d20..e50f4fd10b087 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -124,10 +124,10 @@ TEST_F(TestArray, SliceRecomputeNullCount) { TEST_F(TestArray, TestIsNull) { // clang-format off vector null_bitmap = {1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 1, 1, 0, 1, 0, 0, - 1, 0, 0, 1}; + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 1, 1, 0, 1, 0, 0, + 1, 0, 0, 1}; // clang-format on int64_t null_count = 0; for (uint8_t x : null_bitmap) { @@ -144,7 +144,7 @@ TEST_F(TestArray, TestIsNull) { ASSERT_TRUE(arr->null_bitmap()->Equals(*null_buf.get())); for (size_t i = 0; i < null_bitmap.size(); ++i) { - EXPECT_EQ(null_bitmap[i], !arr->IsNull(i)) << i; + EXPECT_EQ(null_bitmap[i] != 0, !arr->IsNull(i)) << i; } } @@ -334,7 +334,7 @@ void TestPrimitiveBuilder::Check( for (int64_t i = 0; i < result->length(); ++i) { if (nullable) { ASSERT_EQ(valid_bytes_[i] == 0, result->IsNull(i)) << i; } bool actual = BitUtil::GetBit(result->data()->data(), i); - ASSERT_EQ(static_cast(draws_[i]), actual) << i; + ASSERT_EQ(draws_[i] != 0, actual) << i; } ASSERT_TRUE(result->Equals(*expected)); } @@ -1379,7 +1379,7 @@ void ValidateBasicListArray(const ListArray* result, const vector& valu } for (int i = 0; i < result->length(); ++i) { - ASSERT_EQ(!static_cast(is_valid[i]), result->IsNull(i)); + ASSERT_EQ(is_valid[i] == 0, result->IsNull(i)); } ASSERT_EQ(7, result->values()->length()); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 61207a334db32..60cdc4cb3a5db 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -275,6 +275,10 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { return Status::OK(); } + Status Append(uint8_t val) { + return Append(val != 0); + } + /// Vector append /// /// If passed, valid_bytes is of equal length to values, and any zero byte diff --git a/cpp/src/arrow/io/file.cc b/cpp/src/arrow/io/file.cc index 720be3d6e739c..eb4b9fc43884f 100644 --- a/cpp/src/arrow/io/file.cc +++ b/cpp/src/arrow/io/file.cc @@ -250,9 +250,9 @@ static inline Status FileRead( int fd, uint8_t* buffer, int64_t nbytes, int64_t* bytes_read) { #if defined(_MSC_VER) if (nbytes > INT32_MAX) { return Status::IOError("Unable to read > 2GB blocks yet"); } - *bytes_read = _read(fd, buffer, static_cast(nbytes)); + *bytes_read = static_cast(_read(fd, buffer, static_cast(nbytes))); #else - *bytes_read = read(fd, buffer, static_cast(nbytes)); + *bytes_read = static_cast(read(fd, buffer, static_cast(nbytes))); #endif if (*bytes_read == -1) { @@ -269,7 +269,7 @@ static inline Status FileWrite(int fd, const uint8_t* buffer, int64_t nbytes) { if (nbytes > INT32_MAX) { return Status::IOError("Unable to write > 2GB blocks to file yet"); } - ret = static_cast(_write(fd, buffer, static_cast(nbytes))); + ret = static_cast(_write(fd, buffer, static_cast(nbytes))); #else ret = static_cast(write(fd, buffer, static_cast(nbytes))); #endif diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index f3140be0b2dac..a2c9c5210b10d 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -107,7 +107,7 @@ class TestHdfsClient : public ::testing::Test { const char* port = std::getenv("ARROW_HDFS_TEST_PORT"); const char* user = std::getenv("ARROW_HDFS_TEST_USER"); - ASSERT_TRUE(user) << "Set ARROW_HDFS_TEST_USER"; + ASSERT_TRUE(user != nullptr) << "Set ARROW_HDFS_TEST_USER"; conf_.host = host == nullptr ? "localhost" : host; conf_.user = user; diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 124c21b8fc023..fe0a7c94226f0 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -1102,7 +1102,7 @@ class JsonArrayReader { std::vector is_valid; for (const rj::Value& val : json_validity) { DCHECK(val.IsInt()); - is_valid.push_back(static_cast(val.GetInt())); + is_valid.push_back(val.GetInt() != 0); } #define TYPE_CASE(TYPE) \ diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index d113531822c96..9e0480d4c3634 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -125,7 +125,7 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li std::partial_sum(list_sizes.begin(), list_sizes.end(), ++offsets.begin()); // Force invariants - const int64_t child_length = child_array->length(); + const int32_t child_length = static_cast(child_array->length()); offsets[0] = 0; std::replace_if(offsets.begin(), offsets.end(), [child_length](int32_t offset) { return offset > child_length; }, child_length); diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index 8bbb97b596e18..d1c4083289f96 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -86,7 +86,7 @@ const std::string& Tensor::dim_name(int i) const { } int64_t Tensor::size() const { - return std::accumulate(shape_.begin(), shape_.end(), 1, std::multiplies()); + return std::accumulate(shape_.begin(), shape_.end(), 1LL, std::multiplies()); } bool Tensor::is_contiguous() const { From 027c6b8084961cf10d80927c8380cce7a23acc1f Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Fri, 7 Apr 2017 10:16:47 -0400 Subject: [PATCH 0479/1644] ARROW-781 [C++/Python] Increase reference count of the numpy base array? see https://issues.apache.org/jira/browse/ARROW-781 Author: Philipp Moritz Closes #503 from pcmoritz/numpy-base-object and squashes the following commits: 207e439 [Philipp Moritz] add test for numpy base object e96c89a [Philipp Moritz] increase reference count of the numpy base array --- cpp/src/arrow/python/numpy_convert.cc | 1 + python/pyarrow/tests/test_tensor.py | 7 +++++++ 2 files changed, 8 insertions(+) diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc index 3697819120dbe..23470fbc41aca 100644 --- a/cpp/src/arrow/python/numpy_convert.cc +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -258,6 +258,7 @@ Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out) { if (base != Py_None) { PyArray_SetBaseObject(reinterpret_cast(result), base); + Py_XINCREF(base); } *out = result; return Status::OK(); diff --git a/python/pyarrow/tests/test_tensor.py b/python/pyarrow/tests/test_tensor.py index 5327f1a74a33e..a39064b49dfbc 100644 --- a/python/pyarrow/tests/test_tensor.py +++ b/python/pyarrow/tests/test_tensor.py @@ -16,6 +16,7 @@ # under the License. import os +import sys import pytest import numpy as np @@ -41,6 +42,12 @@ def test_tensor_attrs(): tensor = pa.Tensor.from_numpy(data2) assert not tensor.is_mutable +def test_tensor_base_object(): + tensor = pa.Tensor.from_numpy(np.random.randn(10, 4)) + n = sys.getrefcount(tensor) + array = tensor.to_numpy() + assert sys.getrefcount(tensor) == n + 1 + @pytest.mark.parametrize('dtype_str,arrow_type', [ ('i1', pa.int8()), From 8ae3283b2ecdf6cb5a6d7e97753781128a57512d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 7 Apr 2017 16:49:39 -0400 Subject: [PATCH 0480/1644] ARROW-787: [GLib] Fix compilation error caused by introducing BooleanBuilder::Append overload Author: Wes McKinney Closes #506 from wesm/ARROW-787 and squashes the following commits: e0edb47 [Wes McKinney] Fix compilation error caused by introducing BooleanBuilder::Append overload --- c_glib/arrow-glib/boolean-array-builder.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/c_glib/arrow-glib/boolean-array-builder.cpp b/c_glib/arrow-glib/boolean-array-builder.cpp index 1a4c1f9fd8f7e..146eb31e8bdf8 100644 --- a/c_glib/arrow-glib/boolean-array-builder.cpp +++ b/c_glib/arrow-glib/boolean-array-builder.cpp @@ -84,7 +84,7 @@ garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, static_cast( garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - auto status = arrow_builder->Append(value); + auto status = arrow_builder->Append(static_cast(value)); if (status.ok()) { return TRUE; } else { From 35911037031e46784f4e585ac5922642351660c1 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 8 Apr 2017 09:22:16 -0400 Subject: [PATCH 0481/1644] ARROW-793: [GLib] Fix indent Author: Kouhei Sutou Closes #509 from kou/glib-fix-indent and squashes the following commits: 5453fb6 [Kouhei Sutou] [GLib] Fix indent --- c_glib/arrow-glib/uint32-array.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/c_glib/arrow-glib/uint32-array.cpp b/c_glib/arrow-glib/uint32-array.cpp index d10f10005f9be..18a9aedc0658f 100644 --- a/c_glib/arrow-glib/uint32-array.cpp +++ b/c_glib/arrow-glib/uint32-array.cpp @@ -60,7 +60,7 @@ garrow_uint32_array_class_init(GArrowUInt32ArrayClass *klass) */ guint32 garrow_uint32_array_get_value(GArrowUInt32Array *array, - gint64 i) + gint64 i) { auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); return static_cast(arrow_array.get())->Value(i); From b0e3122b904924dc86cf470edfa726e38bd14f83 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 9 Apr 2017 13:52:20 -0400 Subject: [PATCH 0482/1644] ARROW-724: Add How to Contribute section to README Author: Wes McKinney Closes #504 from wesm/ARROW-724 and squashes the following commits: c1e07b6 [Wes McKinney] Typo 3a6083c [Wes McKinney] Add how to contribute section modifed from parquet-mr --- README.md | 43 +++++++++++++++++++++++++++++++++++-------- 1 file changed, 35 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 1eb3f86f98656..2790895878563 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@

-#### Powering Columnar In-Memory Analytics +### Powering Columnar In-Memory Analytics Arrow is a set of technologies that enable big-data systems to process and move data fast. @@ -39,7 +39,7 @@ Initial implementations include: Arrow is an [Apache Software Foundation](www.apache.org) project. Learn more at [arrow.apache.org](http://arrow.apache.org). -#### What's in the Arrow libraries? +### What's in the Arrow libraries? The reference Arrow implementations contain a number of distinct software components: @@ -59,12 +59,7 @@ components: - Conversions to and from other in-memory data structures (e.g. Python's pandas library) -#### Getting involved - -Right now the primary audience for Apache Arrow are the developers of data -systems; most people will use Apache Arrow indirectly through systems that use -it for internal data handling and interoperating with other Arrow-enabled -systems. +### Getting involved Even if you do not plan to contribute to Apache Arrow itself or Arrow integrations in other projects, we'd be happy to have you involved: @@ -76,6 +71,38 @@ integrations in other projects, we'd be happy to have you involved: - [Learn the format][2] - Contribute code to one of the reference implementations +### How to Contribute + +We prefer to receive contributions in the form of GitHub pull requests. Please +send pull requests against the [github.com/apache/arrow][4] repository. + +If you are looking for some ideas on what to contribute, check out the [JIRA +issues][3] for the Apache Arrow project. Comment on the issue and/or contact +[dev@arrow.apache.org](http://mail-archives.apache.org/mod_mbox/arrow-dev/) +with your questions and ideas. + +If you’d like to report a bug but don’t have time to fix it, you can still post +it on JIRA, or email the mailing list +[dev@arrow.apache.org](http://mail-archives.apache.org/mod_mbox/arrow-dev/) + +To contribute a patch: + +1. Break your work into small, single-purpose patches if possible. It’s much +harder to merge in a large change with a lot of disjoint features. +2. Create a JIRA for your patch on the [Arrow Project +JIRA](https://issues.apache.org/jira/browse/ARROW). +3. Submit the patch as a GitHub pull request against the master branch. For a +tutorial, see the GitHub guides on forking a repo and sending a pull +request. Prefix your pull request name with the JIRA name (ex: +https://github.com/apache/arrow/pull/240). +4. Make sure that your code passes the unit tests. You can find instructions +how to run the unit tests for each Arrow component in its respective README +file. +5. Add new unit tests for your code. + +Thank you in advance for your contributions! + [1]: mailto:dev-subscribe@arrow.apache.org [2]: https://github.com/apache/arrow/tree/master/format [3]: https://issues.apache.org/jira/browse/ARROW +[4]: https://github.com/apache/arrow \ No newline at end of file From 739ed82028e9efae43f00f4e19b39737adb8a348 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 9 Apr 2017 13:54:10 -0400 Subject: [PATCH 0483/1644] ARROW-762: [Python] Start docs page about files and filesystems, adapt C++ docs about HDFS Author: Wes McKinney Closes #511 from wesm/ARROW-762 and squashes the following commits: 273142e [Wes McKinney] Add initial docs about configuring environment to use pyarrow.HdfsClient --- python/doc/filesystems.rst | 58 ++++++++++++++++++++++++++++++++++++++ python/doc/index.rst | 12 ++++---- 2 files changed, 64 insertions(+), 6 deletions(-) create mode 100644 python/doc/filesystems.rst diff --git a/python/doc/filesystems.rst b/python/doc/filesystems.rst new file mode 100644 index 0000000000000..9e00ddd558127 --- /dev/null +++ b/python/doc/filesystems.rst @@ -0,0 +1,58 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +File interfaces and Memory Maps +=============================== + +PyArrow features a number of file-like interfaces + +Hadoop File System (HDFS) +------------------------- + +PyArrow comes with bindings to a C++-based interface to the Hadoop File +System. You connect like so: + +.. code-block:: python + + import pyarrow as pa + hdfs = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path) + +By default, ``pyarrow.HdfsClient`` uses libhdfs, a JNI-based interface to the +Java Hadoop client. This library is loaded **at runtime** (rather than at link +/ library load time, since the library may not be in your LD_LIBRARY_PATH), and +relies on some environment variables. + +* ``HADOOP_HOME``: the root of your installed Hadoop distribution. Often has + `lib/native/libhdfs.so`. + +* ``JAVA_HOME``: the location of your Java SDK installation. + +* ``ARROW_LIBHDFS_DIR`` (optional): explicit location of ``libhdfs.so`` if it is + installed somewhere other than ``$HADOOP_HOME/lib/native``. + +* ``CLASSPATH``: must contain the Hadoop jars. You can set these using: + +.. code-block:: shell + + export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob` + +You can also use libhdfs3, a thirdparty C++ library for HDFS from Pivotal Labs: + +.. code-block:: python + + hdfs3 = pa.HdfsClient(host, port, user=user, kerb_ticket=ticket_cache_path, + driver='libhdfs3') diff --git a/python/doc/index.rst b/python/doc/index.rst index d64354be05520..608fff5d57ba4 100644 --- a/python/doc/index.rst +++ b/python/doc/index.rst @@ -34,15 +34,15 @@ structures. :maxdepth: 2 :caption: Getting Started - Installing pyarrow - Pandas - Module Reference - Getting Involved + install + pandas + filesystems + parquet + modules + getting_involved .. toctree:: :maxdepth: 2 :caption: Additional Features - Parquet format jemalloc MemoryPool - From b0863cb63d62ae7c4a429164e5a2e350d3c1f21a Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Sun, 9 Apr 2017 13:56:14 -0400 Subject: [PATCH 0484/1644] ARROW-788: [C++] Align WriteTensor message Author: Philipp Moritz Closes #512 from pcmoritz/tensor-alignment and squashes the following commits: fd20f05 [Philipp Moritz] align WriteTensor to 8 bytes --- cpp/src/arrow/ipc/reader.cc | 2 ++ cpp/src/arrow/ipc/writer.cc | 11 +++++++++++ 2 files changed, 13 insertions(+) diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 55f632f306b9a..a7c4f04a4d4cc 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -495,6 +495,8 @@ Status ReadRecordBatch(const std::shared_ptr& schema, int64_t offset, Status ReadTensor( int64_t offset, io::RandomAccessFile* file, std::shared_ptr* out) { + // Respect alignment of Tensor messages (see WriteTensor) + offset = PaddedLength(offset); std::shared_ptr message; std::shared_ptr data; RETURN_NOT_OK(ReadContiguousPayload(offset, file, &message, &data)); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 5330206480928..9305567e74f6b 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -470,6 +470,16 @@ class DictionaryWriter : public RecordBatchWriter { int64_t dictionary_id_; }; +// Adds padding bytes if necessary to ensure all memory blocks are written on +// 8-byte boundaries. +Status AlignStreamPosition(io::OutputStream* stream) { + int64_t position; + RETURN_NOT_OK(stream->Tell(&position)); + int64_t remainder = PaddedLength(position) - position; + if (remainder > 0) { return stream->Write(kPaddingBytes, remainder); } + return Status::OK(); +} + Status WriteRecordBatch(const RecordBatch& batch, int64_t buffer_start_offset, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length, MemoryPool* pool, int max_recursion_depth, bool allow_64bit) { @@ -486,6 +496,7 @@ Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offs Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + RETURN_NOT_OK(AlignStreamPosition(dst)); std::shared_ptr metadata; RETURN_NOT_OK(WriteTensorMessage(tensor, 0, &metadata)); RETURN_NOT_OK(WriteMessage(*metadata, dst, metadata_length)); From 449f99162abab52378e2d6b2ca18099df567dc29 Mon Sep 17 00:00:00 2001 From: Nong Li Date: Sun, 9 Apr 2017 14:31:57 -0400 Subject: [PATCH 0485/1644] ARROW-773: [CPP] Add Table::AddColumn API Author: Nong Li Closes #513 from nongli/arrow-773 and squashes the following commits: e6f5846 [Nong Li] ARROW-773: [CPP] Add Table::AddColumn API --- cpp/src/arrow/table-test.cc | 73 +++++++++++++++++++++++++++++ cpp/src/arrow/table.cc | 24 ++++++++++ cpp/src/arrow/table.h | 4 ++ cpp/src/arrow/type.cc | 9 ++++ cpp/src/arrow/type.h | 2 + cpp/src/arrow/util/CMakeLists.txt | 1 + cpp/src/arrow/util/stl-util-test.cc | 60 ++++++++++++++++++++++++ cpp/src/arrow/util/stl.h | 20 ++++++++ 8 files changed, 193 insertions(+) create mode 100644 cpp/src/arrow/util/stl-util-test.cc diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index cd32f4a387290..156c3d16d4db0 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -384,6 +384,79 @@ TEST_F(TestTable, RemoveColumn) { ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); } +TEST_F(TestTable, AddColumn) { + const int64_t length = 10; + MakeExample1(length); + + Table table(schema_, columns_); + + std::shared_ptr result; + // Some negative tests with invalid index + Status status = table.AddColumn(10, columns_[0], &result); + ASSERT_TRUE(status.IsInvalid()); + status = table.AddColumn(-1, columns_[0], &result); + ASSERT_TRUE(status.IsInvalid()); + + // Add column with wrong length + auto longer_col = std::make_shared( + schema_->field(0), MakePrimitive(length + 1)); + status = table.AddColumn(0, longer_col, &result); + ASSERT_TRUE(status.IsInvalid()); + + // Add column 0 in different places + ASSERT_OK(table.AddColumn(0, columns_[0], &result)); + auto ex_schema = std::shared_ptr(new Schema({ + schema_->field(0), + schema_->field(0), + schema_->field(1), + schema_->field(2)})); + std::vector> ex_columns = { + table.column(0), + table.column(0), + table.column(1), + table.column(2)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); + + ASSERT_OK(table.AddColumn(1, columns_[0], &result)); + ex_schema = std::shared_ptr(new Schema({ + schema_->field(0), + schema_->field(0), + schema_->field(1), + schema_->field(2)})); + ex_columns = { + table.column(0), + table.column(0), + table.column(1), + table.column(2)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); + + ASSERT_OK(table.AddColumn(2, columns_[0], &result)); + ex_schema = std::shared_ptr(new Schema({ + schema_->field(0), + schema_->field(1), + schema_->field(0), + schema_->field(2)})); + ex_columns = { + table.column(0), + table.column(1), + table.column(0), + table.column(2)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); + + ASSERT_OK(table.AddColumn(3, columns_[0], &result)); + ex_schema = std::shared_ptr(new Schema({ + schema_->field(0), + schema_->field(1), + schema_->field(2), + schema_->field(0)})); + ex_columns = { + table.column(0), + table.column(1), + table.column(2), + table.column(0)}; + ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); +} + class TestRecordBatch : public TestBase {}; TEST_F(TestRecordBatch, Equals) { diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index da61fbb9a6daf..9b39f770a17b7 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -321,6 +321,30 @@ Status Table::RemoveColumn(int i, std::shared_ptr
* out) const { return Status::OK(); } +Status Table::AddColumn(int i, const std::shared_ptr& col, + std::shared_ptr
* out) const { + if (i < 0 || i > num_columns() + 1) { + return Status::Invalid("Invalid column index."); + } + if (col == nullptr) { + std::stringstream ss; + ss << "Column " << i << " was null"; + return Status::Invalid(ss.str()); + } + if (col->length() != num_rows_) { + std::stringstream ss; + ss << "Added column's length must match table's length. Expected length " << num_rows_ + << " but got length " << col->length(); + return Status::Invalid(ss.str()); + } + + std::shared_ptr new_schema; + RETURN_NOT_OK(schema_->AddField(i, col->field(), &new_schema)); + + *out = std::make_shared
(new_schema, AddVectorElement(columns_, i, col)); + return Status::OK(); +} + Status Table::ValidateColumns() const { if (num_columns() != schema_->num_fields()) { return Status::Invalid("Number of columns did not match schema"); diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index 0f35dd888fe2f..dcea53d8fb1dd 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -181,6 +181,10 @@ class ARROW_EXPORT Table { /// schemas are immutable) Status RemoveColumn(int i, std::shared_ptr
* out) const; + /// Add column to the table, producing a new Table + Status AddColumn(int i, const std::shared_ptr& column, + std::shared_ptr
* out) const; + // @returns: the number of columns in the table int num_columns() const { return static_cast(columns_.size()); } diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index d99551d661d69..abbb626e0fcb4 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -257,6 +257,15 @@ std::shared_ptr Schema::GetFieldByName(const std::string& name) { } } +Status Schema::AddField(int i, const std::shared_ptr& field, + std::shared_ptr* out) const { + DCHECK_GE(i, 0); + DCHECK_LE(i, this->num_fields()); + + *out = std::make_shared(AddVectorElement(fields_, i, field)); + return Status::OK(); +} + Status Schema::RemoveField(int i, std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LT(i, this->num_fields()); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 0e69133219d55..36ab9d8b2b9d5 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -619,6 +619,8 @@ class ARROW_EXPORT Schema { // Render a string representation of the schema suitable for debugging std::string ToString() const; + Status AddField(int i, const std::shared_ptr& field, + std::shared_ptr* out) const; Status RemoveField(int i, std::shared_ptr* out) const; int num_fields() const { return static_cast(fields_.size()); } diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 8d9afccf867df..c1b6877a3e9ef 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -69,3 +69,4 @@ if (ARROW_BUILD_BENCHMARKS) endif() ADD_ARROW_TEST(bit-util-test) +ADD_ARROW_TEST(stl-util-test) diff --git a/cpp/src/arrow/util/stl-util-test.cc b/cpp/src/arrow/util/stl-util-test.cc new file mode 100644 index 0000000000000..526520e7a2dec --- /dev/null +++ b/cpp/src/arrow/util/stl-util-test.cc @@ -0,0 +1,60 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/stl.h" + +#include +#include + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" + +namespace arrow { + +TEST(StlUtilTest, VectorAddRemoveTest) { + std::vector values; + std::vector result = AddVectorElement(values, 0, 100); + EXPECT_EQ(values.size(), 0); + EXPECT_EQ(result.size(), 1); + EXPECT_EQ(result[0], 100); + + // Add 200 at index 0 and 300 at the end. + std::vector result2 = AddVectorElement(result, 0, 200); + result2 = AddVectorElement(result2, result2.size(), 300); + EXPECT_EQ(result.size(), 1); + EXPECT_EQ(result2.size(), 3); + EXPECT_EQ(result2[0], 200); + EXPECT_EQ(result2[1], 100); + EXPECT_EQ(result2[2], 300); + + // Remove 100, 300, 200 + std::vector result3 = DeleteVectorElement(result2, 1); + EXPECT_EQ(result2.size(), 3); + EXPECT_EQ(result3.size(), 2); + EXPECT_EQ(result3[0], 200); + EXPECT_EQ(result3[1], 300); + + result3 = DeleteVectorElement(result3, 1); + EXPECT_EQ(result3.size(), 1); + EXPECT_EQ(result3[0], 200); + + result3 = DeleteVectorElement(result3, 0); + EXPECT_TRUE(result3.empty()); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/stl.h b/cpp/src/arrow/util/stl.h index 3ec535d62b920..bd250539a8c8a 100644 --- a/cpp/src/arrow/util/stl.h +++ b/cpp/src/arrow/util/stl.h @@ -20,10 +20,14 @@ #include +#include + namespace arrow { template inline std::vector DeleteVectorElement(const std::vector& values, size_t index) { + DCHECK(!values.empty()); + DCHECK_LT(index, values.size()); std::vector out; out.reserve(values.size() - 1); for (size_t i = 0; i < index; ++i) { @@ -35,6 +39,22 @@ inline std::vector DeleteVectorElement(const std::vector& values, size_t i return out; } +template +inline std::vector AddVectorElement(const std::vector& values, size_t index, + const T& new_element) { + DCHECK_LE(index, values.size()); + std::vector out; + out.reserve(values.size() + 1); + for (size_t i = 0; i < index; ++i) { + out.push_back(values[i]); + } + out.push_back(new_element); + for (size_t i = index; i < values.size(); ++i) { + out.push_back(values[i]); + } + return out; +} + } // namespace arrow #endif // ARROW_UTIL_STL_H From 754bcce686ecf02e123dcf4801715bf155f15e1f Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sun, 9 Apr 2017 15:19:53 -0400 Subject: [PATCH 0486/1644] ARROW-655: [C++/Python] Implement DecimalArray Adds Decimal support for C++ and Python. TODOs: - [x] Tighten up some of the GIL acquisition. E.g., we may not need to hold it when importing the decimal module if we acquire it where we import the decimal module. - [x] Investigate FreeBSD issue (manifesting on OS X) where typeinfo symbols for `__int128_t` are not exported: https://bugs.llvm.org//show_bug.cgi?id=26156. - [x] See if there's a better way to visit scalar decimals, rather than keeping extra state on the class. Seems like an unacceptable hack. Author: Phillip Cloud Closes #403 from cpcloud/decimal and squashes the following commits: e5470fd [Phillip Cloud] Remove unnecessary header in helpers.h 07713a7 [Phillip Cloud] Remove more boost leakage f764156 [Phillip Cloud] Revert "Transitively link static libs as well" a7109b2 [Phillip Cloud] Transitively link static libs as well bf2a7ea [Phillip Cloud] Move IsNegative to cc file cb2c1ac [Phillip Cloud] Do not link boost regex to jemalloc e63b766 [Phillip Cloud] Remove python extra cmake args 805bbac [Phillip Cloud] ARROW-655: [C++/Python] Implement DecimalArray --- .travis.yml | 1 + cpp/CMakeLists.txt | 27 ++- cpp/cmake_modules/FindPythonLibsNew.cmake | 3 +- cpp/src/arrow/array-decimal-test.cc | 194 ++++++++++++++++++- cpp/src/arrow/array.cc | 49 ++++- cpp/src/arrow/array.h | 31 ++- cpp/src/arrow/builder.cc | 88 ++++++++- cpp/src/arrow/builder.h | 29 ++- cpp/src/arrow/compare.cc | 40 +++- cpp/src/arrow/ipc/CMakeLists.txt | 7 +- cpp/src/arrow/python/CMakeLists.txt | 3 +- cpp/src/arrow/python/builtin_convert.cc | 62 +++++- cpp/src/arrow/python/builtin_convert.h | 2 +- cpp/src/arrow/python/common.h | 9 +- cpp/src/arrow/python/helpers.cc | 79 ++++++++ cpp/src/arrow/python/helpers.h | 26 ++- cpp/src/arrow/python/pandas_convert.cc | 176 ++++++++++++++++- cpp/src/arrow/python/python-test.cc | 33 ++++ cpp/src/arrow/type.cc | 18 +- cpp/src/arrow/type.h | 26 ++- cpp/src/arrow/type_fwd.h | 2 + cpp/src/arrow/type_traits.h | 13 +- cpp/src/arrow/util/CMakeLists.txt | 2 + cpp/src/arrow/util/bit-util.h | 1 - cpp/src/arrow/util/decimal-test.cc | 161 +++++++++++++++ cpp/src/arrow/util/decimal.cc | 141 ++++++++++++++ cpp/src/arrow/util/decimal.h | 144 ++++++++++++++ cpp/src/arrow/visitor_inline.h | 2 +- format/Schema.fbs | 2 + python/pyarrow/__init__.py | 2 +- python/pyarrow/array.pxd | 4 + python/pyarrow/array.pyx | 5 + python/pyarrow/includes/common.pxd | 5 + python/pyarrow/includes/libarrow.pxd | 16 ++ python/pyarrow/scalar.pxd | 1 + python/pyarrow/scalar.pyx | 25 ++- python/pyarrow/schema.pxd | 10 +- python/pyarrow/schema.pyx | 28 ++- python/pyarrow/tests/test_convert_builtin.py | 40 ++++ python/pyarrow/tests/test_convert_pandas.py | 70 +++++++ 40 files changed, 1497 insertions(+), 80 deletions(-) create mode 100644 cpp/src/arrow/util/decimal-test.cc create mode 100644 cpp/src/arrow/util/decimal.cc create mode 100644 cpp/src/arrow/util/decimal.h diff --git a/.travis.yml b/.travis.yml index b219b03e0eb2b..f74a3b205c4b6 100644 --- a/.travis.yml +++ b/.travis.yml @@ -14,6 +14,7 @@ addons: - valgrind - libboost-dev - libboost-filesystem-dev + - libboost-regex-dev - libboost-system-dev - libjemalloc-dev - gtk-doc-tools diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 9947a34e4e7bb..5852fe59da095 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -398,30 +398,36 @@ if (ARROW_BOOST_USE_SHARED) add_definitions(-DBOOST_ALL_DYN_LINK) endif() - find_package(Boost COMPONENTS system filesystem REQUIRED) + find_package(Boost COMPONENTS system filesystem regex REQUIRED) if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) + set(BOOST_SHARED_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG}) else() set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) + set(BOOST_SHARED_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE}) endif() set(BOOST_SYSTEM_LIBRARY boost_system_shared) set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_shared) + set(BOOST_REGEX_LIBRARY boost_regex_shared) else() # Find static boost headers and libs # TODO Differentiate here between release and debug builds set(Boost_USE_STATIC_LIBS ON) - find_package(Boost COMPONENTS system filesystem REQUIRED) + find_package(Boost COMPONENTS system filesystem regex REQUIRED) if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) + set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG}) else() set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) + set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE}) endif() set(BOOST_SYSTEM_LIBRARY boost_system_static) set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_static) + set(BOOST_REGEX_LIBRARY boost_regex_static) endif() message(STATUS "Boost include dir: " ${Boost_INCLUDE_DIRS}) @@ -435,7 +441,11 @@ ADD_THIRDPARTY_LIB(boost_filesystem STATIC_LIB "${BOOST_STATIC_FILESYSTEM_LIBRARY}" SHARED_LIB "${BOOST_SHARED_FILESYSTEM_LIBRARY}") -SET(ARROW_BOOST_LIBS boost_system boost_filesystem) +ADD_THIRDPARTY_LIB(boost_regex + STATIC_LIB "${BOOST_STATIC_REGEX_LIBRARY}" + SHARED_LIB "${BOOST_SHARED_REGEX_LIBRARY}") + +SET(ARROW_BOOST_LIBS boost_system boost_filesystem boost_regex) include_directories(SYSTEM ${Boost_INCLUDE_DIR}) @@ -695,14 +705,16 @@ endif() set(ARROW_MIN_TEST_LIBS arrow_static arrow_test_main - ${ARROW_BASE_LIBS}) + ${ARROW_BASE_LIBS} + ${BOOST_REGEX_LIBRARY}) set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) set(ARROW_BENCHMARK_LINK_LIBS arrow_static arrow_benchmark_main - ${ARROW_BASE_LIBS}) + ${ARROW_BASE_LIBS} + ${BOOST_REGEX_LIBRARY}) ############################################################ # "make ctags" target @@ -796,7 +808,7 @@ endif() ############################################################ set(ARROW_LINK_LIBS -) + ${BOOST_REGEX_LIBRARY}) set(ARROW_PRIVATE_LINK_LIBS ) @@ -816,6 +828,7 @@ set(ARROW_SRCS src/arrow/visitor.cc src/arrow/util/bit-util.cc + src/arrow/util/decimal.cc ) if(NOT APPLE AND NOT MSVC) @@ -825,9 +838,11 @@ if(NOT APPLE AND NOT MSVC) set(ARROW_SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") endif() + ADD_ARROW_LIB(arrow SOURCES ${ARROW_SRCS} SHARED_LINK_FLAGS ${ARROW_SHARED_LINK_FLAGS} + SHARED_LINK_LIBS ${ARROW_LINK_LIBS} ) add_subdirectory(src/arrow) diff --git a/cpp/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake index dfe5661b015b5..d9cc4b3955734 100644 --- a/cpp/cmake_modules/FindPythonLibsNew.cmake +++ b/cpp/cmake_modules/FindPythonLibsNew.cmake @@ -175,7 +175,8 @@ else() find_library(PYTHON_LIBRARY NAMES "python${PYTHON_LIBRARY_SUFFIX}" PATHS ${_PYTHON_LIBS_SEARCH} - NO_SYSTEM_ENVIRONMENT_PATH) + NO_SYSTEM_ENVIRONMENT_PATH + NO_CMAKE_SYSTEM_PATH) message(STATUS "Found Python lib ${PYTHON_LIBRARY}") endif() diff --git a/cpp/src/arrow/array-decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc index b64023bbc6a1e..4c01f928a6f26 100644 --- a/cpp/src/arrow/array-decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -15,13 +15,16 @@ // specific language governing permissions and limitations // under the License. +#include "arrow/type.h" #include "gtest/gtest.h" -#include "arrow/type.h" +#include "arrow/builder.h" +#include "arrow/test-util.h" +#include "arrow/util/decimal.h" namespace arrow { -TEST(TypesTest, TestDecimalType) { +TEST(TypesTest, TestDecimal32Type) { DecimalType t1(8, 4); ASSERT_EQ(t1.type, Type::DECIMAL); @@ -29,6 +32,193 @@ TEST(TypesTest, TestDecimalType) { ASSERT_EQ(t1.scale, 4); ASSERT_EQ(t1.ToString(), std::string("decimal(8, 4)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 4); + ASSERT_EQ(t1.bit_width(), 32); } +TEST(TypesTest, TestDecimal64Type) { + DecimalType t1(12, 5); + + ASSERT_EQ(t1.type, Type::DECIMAL); + ASSERT_EQ(t1.precision, 12); + ASSERT_EQ(t1.scale, 5); + + ASSERT_EQ(t1.ToString(), std::string("decimal(12, 5)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 8); + ASSERT_EQ(t1.bit_width(), 64); +} + +TEST(TypesTest, TestDecimal128Type) { + DecimalType t1(27, 7); + + ASSERT_EQ(t1.type, Type::DECIMAL); + ASSERT_EQ(t1.precision, 27); + ASSERT_EQ(t1.scale, 7); + + ASSERT_EQ(t1.ToString(), std::string("decimal(27, 7)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 16); + ASSERT_EQ(t1.bit_width(), 128); +} + +template +class DecimalTestBase { + public: + virtual std::vector data( + const std::vector& input, size_t byte_width) const = 0; + + void test(int precision, const std::vector& draw, + const std::vector& valid_bytes, + const std::vector& sign_bitmap = {}, int64_t offset = 0) const { + auto type = std::make_shared(precision, 4); + int byte_width = type->byte_width(); + auto pool = default_memory_pool(); + auto builder = std::make_shared(pool, type); + size_t null_count = 0; + + size_t size = draw.size(); + builder->Reserve(size); + + for (size_t i = 0; i < size; ++i) { + if (valid_bytes[i]) { + builder->Append(draw[i]); + } else { + builder->AppendNull(); + ++null_count; + } + } + + std::shared_ptr expected_sign_bitmap; + if (!sign_bitmap.empty()) { + BitUtil::BytesToBits(sign_bitmap, &expected_sign_bitmap); + } + + auto raw_bytes = data(draw, byte_width); + auto expected_data = std::make_shared(raw_bytes.data(), size * byte_width); + auto expected_null_bitmap = test::bytes_to_null_buffer(valid_bytes); + int64_t expected_null_count = test::null_count(valid_bytes); + auto expected = std::make_shared(type, size, expected_data, + expected_null_bitmap, expected_null_count, offset, expected_sign_bitmap); + + std::shared_ptr out; + ASSERT_OK(builder->Finish(&out)); + ASSERT_TRUE(out->Equals(*expected)); + } +}; + +template +class DecimalTest : public DecimalTestBase { + public: + std::vector data( + const std::vector& input, size_t byte_width) const override { + std::vector result; + result.reserve(input.size() * byte_width); + // TODO(phillipc): There's probably a better way to do this + constexpr static const size_t bytes_per_element = sizeof(T); + for (size_t i = 0, j = 0; i < input.size(); ++i, j += bytes_per_element) { + *reinterpret_cast(&result[j]) = input[i].value; + } + return result; + } +}; + +template <> +class DecimalTest : public DecimalTestBase { + public: + std::vector data( + const std::vector& input, size_t byte_width) const override { + std::vector result; + result.reserve(input.size() * byte_width); + constexpr static const size_t bytes_per_element = 16; + for (size_t i = 0; i < input.size(); ++i) { + uint8_t stack_bytes[bytes_per_element] = {0}; + uint8_t* bytes = stack_bytes; + bool is_negative; + ToBytes(input[i], &bytes, &is_negative); + + for (size_t i = 0; i < bytes_per_element; ++i) { + result.push_back(bytes[i]); + } + } + return result; + } +}; + +class Decimal32BuilderTest : public ::testing::TestWithParam, + public DecimalTest {}; + +class Decimal64BuilderTest : public ::testing::TestWithParam, + public DecimalTest {}; + +class Decimal128BuilderTest : public ::testing::TestWithParam, + public DecimalTest {}; + +TEST_P(Decimal32BuilderTest, NoNulls) { + int precision = GetParam(); + std::vector draw = { + Decimal32(1), Decimal32(2), Decimal32(2389), Decimal32(4), Decimal32(-12348)}; + std::vector valid_bytes = {true, true, true, true, true}; + this->test(precision, draw, valid_bytes); +} + +TEST_P(Decimal64BuilderTest, NoNulls) { + int precision = GetParam(); + std::vector draw = { + Decimal64(1), Decimal64(2), Decimal64(2389), Decimal64(4), Decimal64(-12348)}; + std::vector valid_bytes = {true, true, true, true, true}; + this->test(precision, draw, valid_bytes); +} + +TEST_P(Decimal128BuilderTest, NoNulls) { + int precision = GetParam(); + std::vector draw = { + Decimal128(1), Decimal128(-2), Decimal128(2389), Decimal128(4), Decimal128(-12348)}; + std::vector valid_bytes = {true, true, true, true, true}; + std::vector sign_bitmap = {false, true, false, false, true}; + this->test(precision, draw, valid_bytes, sign_bitmap); +} + +TEST_P(Decimal32BuilderTest, WithNulls) { + int precision = GetParam(); + std::vector draw = { + Decimal32(1), Decimal32(2), Decimal32(-1), Decimal32(4), Decimal32(-1)}; + std::vector valid_bytes = {true, true, false, true, false}; + this->test(precision, draw, valid_bytes); +} + +TEST_P(Decimal64BuilderTest, WithNulls) { + int precision = GetParam(); + std::vector draw = { + Decimal64(-1), Decimal64(2), Decimal64(-1), Decimal64(4), Decimal64(-1)}; + std::vector valid_bytes = {true, true, false, true, false}; + this->test(precision, draw, valid_bytes); +} + +TEST_P(Decimal128BuilderTest, WithNulls) { + int precision = GetParam(); + std::vector draw = {Decimal128(1), Decimal128(2), Decimal128(-1), + Decimal128(4), Decimal128(-1), Decimal128(1), Decimal128(2), + Decimal128("230342903942.234234"), Decimal128("-23049302932.235234")}; + std::vector valid_bytes = { + true, true, false, true, false, true, true, true, true}; + std::vector sign_bitmap = { + false, false, false, false, false, false, false, false, true}; + this->test(precision, draw, valid_bytes, sign_bitmap); +} + +INSTANTIATE_TEST_CASE_P(Decimal32BuilderTest, Decimal32BuilderTest, + ::testing::Range( + DecimalPrecision::minimum, DecimalPrecision::maximum)); +INSTANTIATE_TEST_CASE_P(Decimal64BuilderTest, Decimal64BuilderTest, + ::testing::Range( + DecimalPrecision::minimum, DecimalPrecision::maximum)); +INSTANTIATE_TEST_CASE_P(Decimal128BuilderTest, Decimal128BuilderTest, + ::testing::Range( + DecimalPrecision::minimum, DecimalPrecision::maximum)); + } // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index bd20654bc87d4..4e73e7176fa9c 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -27,6 +27,7 @@ #include "arrow/status.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" +#include "arrow/util/decimal.h" #include "arrow/util/logging.h" #include "arrow/visitor.h" #include "arrow/visitor_inline.h" @@ -283,10 +284,8 @@ std::shared_ptr StringArray::Slice(int64_t offset, int64_t length) const FixedSizeBinaryArray::FixedSizeBinaryArray(const std::shared_ptr& type, int64_t length, const std::shared_ptr& data, const std::shared_ptr& null_bitmap, int64_t null_count, int64_t offset) - : PrimitiveArray(type, length, data, null_bitmap, null_count, offset) { - DCHECK(type->type == Type::FIXED_SIZE_BINARY); - byte_width_ = static_cast(*type).byte_width(); -} + : PrimitiveArray(type, length, data, null_bitmap, null_count, offset), + byte_width_(static_cast(*type).byte_width()) {} std::shared_ptr FixedSizeBinaryArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); @@ -294,6 +293,48 @@ std::shared_ptr FixedSizeBinaryArray::Slice(int64_t offset, int64_t lengt type_, length, data_, null_bitmap_, kUnknownNullCount, offset); } +const uint8_t* FixedSizeBinaryArray::GetValue(int64_t i) const { + return raw_data_ + (i + offset_) * byte_width_; +} + +// ---------------------------------------------------------------------- +// Decimal +DecimalArray::DecimalArray(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& data, const std::shared_ptr& null_bitmap, + int64_t null_count, int64_t offset, const std::shared_ptr& sign_bitmap) + : FixedSizeBinaryArray(type, length, data, null_bitmap, null_count, offset), + sign_bitmap_(sign_bitmap), + sign_bitmap_data_(sign_bitmap != nullptr ? sign_bitmap->data() : nullptr) {} + +bool DecimalArray::IsNegative(int64_t i) const { + return sign_bitmap_data_ != nullptr ? BitUtil::GetBit(sign_bitmap_data_, i) : false; +} + +template +ARROW_EXPORT Decimal DecimalArray::Value(int64_t i) const { + Decimal result; + FromBytes(GetValue(i), &result); + return result; +} + +template ARROW_EXPORT Decimal32 DecimalArray::Value(int64_t i) const; +template ARROW_EXPORT Decimal64 DecimalArray::Value(int64_t i) const; + +template <> +ARROW_EXPORT Decimal128 DecimalArray::Value(int64_t i) const { + Decimal128 result; + FromBytes(GetValue(i), IsNegative(i), &result); + return result; +} + +template ARROW_EXPORT Decimal128 DecimalArray::Value(int64_t i) const; + +std::shared_ptr DecimalArray::Slice(int64_t offset, int64_t length) const { + ConformSliceParams(offset_, length_, &offset, &length); + return std::make_shared( + type_, length, data_, null_bitmap_, kUnknownNullCount, offset, sign_bitmap_); +} + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 9f0e73914da84..a4117facdefd0 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -39,6 +39,9 @@ class MemoryPool; class MutableBuffer; class Status; +template +struct Decimal; + /// Immutable data array with some logical type and some length. /// /// Any memory is owned by the respective Buffer instance (or its parents). @@ -356,9 +359,7 @@ class ARROW_EXPORT FixedSizeBinaryArray : public PrimitiveArray { const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, int64_t offset = 0); - const uint8_t* GetValue(int64_t i) const { - return raw_data_ + (i + offset_) * byte_width_; - } + const uint8_t* GetValue(int64_t i) const; int32_t byte_width() const { return byte_width_; } @@ -370,6 +371,30 @@ class ARROW_EXPORT FixedSizeBinaryArray : public PrimitiveArray { int32_t byte_width_; }; +// ---------------------------------------------------------------------- +// DecimalArray +class ARROW_EXPORT DecimalArray : public FixedSizeBinaryArray { + public: + using TypeClass = Type; + + DecimalArray(const std::shared_ptr& type, int64_t length, + const std::shared_ptr& data, + const std::shared_ptr& null_bitmap = nullptr, int64_t null_count = 0, + int64_t offset = 0, const std::shared_ptr& sign_bitmap = nullptr); + + bool IsNegative(int64_t i) const; + + template + ARROW_EXPORT Decimal Value(int64_t i) const; + + std::shared_ptr Slice(int64_t offset, int64_t length) const override; + + private: + /// Only needed for 128 bit Decimals + std::shared_ptr sign_bitmap_; + const uint8_t* sign_bitmap_data_; +}; + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 40b81cf015ab4..a3677eff68669 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -27,6 +27,7 @@ #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" +#include "arrow/util/decimal.h" #include "arrow/util/logging.h" namespace arrow { @@ -323,6 +324,85 @@ Status BooleanBuilder::Append( return Status::OK(); } +// ---------------------------------------------------------------------- +// DecimalBuilder +DecimalBuilder::DecimalBuilder(MemoryPool* pool, const std::shared_ptr& type) + : FixedSizeBinaryBuilder(pool, type), + sign_bitmap_(nullptr), + sign_bitmap_data_(nullptr) {} + +template +ARROW_EXPORT Status DecimalBuilder::Append(const Decimal& val) { + DCHECK_EQ(sign_bitmap_, nullptr) << "sign_bitmap_ is not null"; + DCHECK_EQ(sign_bitmap_data_, nullptr) << "sign_bitmap_data_ is not null"; + + RETURN_NOT_OK(FixedSizeBinaryBuilder::Reserve(1)); + return FixedSizeBinaryBuilder::Append(reinterpret_cast(&val.value)); +} + +template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal32& val); +template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal64& val); + +template <> +ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& value) { + DCHECK_NE(sign_bitmap_, nullptr) << "sign_bitmap_ is null"; + DCHECK_NE(sign_bitmap_data_, nullptr) << "sign_bitmap_data_ is null"; + + RETURN_NOT_OK(FixedSizeBinaryBuilder::Reserve(1)); + uint8_t stack_bytes[16] = {0}; + uint8_t* bytes = stack_bytes; + bool is_negative; + ToBytes(value, &bytes, &is_negative); + RETURN_NOT_OK(FixedSizeBinaryBuilder::Append(bytes)); + + // TODO(phillipc): calculate the proper storage size here (do we have a function to do + // this)? + // TODO(phillipc): Reserve number of elements + RETURN_NOT_OK(sign_bitmap_->Reserve(1)); + BitUtil::SetBitTo(sign_bitmap_data_, length_ - 1, is_negative); + return Status::OK(); +} + +template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& val); + +Status DecimalBuilder::Init(int64_t capacity) { + RETURN_NOT_OK(FixedSizeBinaryBuilder::Init(capacity)); + if (byte_width_ == 16) { + AllocateResizableBuffer(pool_, null_bitmap_->size(), &sign_bitmap_); + sign_bitmap_data_ = sign_bitmap_->mutable_data(); + memset(sign_bitmap_data_, 0, static_cast(sign_bitmap_->capacity())); + } + return Status::OK(); +} + +Status DecimalBuilder::Resize(int64_t capacity) { + int64_t old_bytes = null_bitmap_ != nullptr ? null_bitmap_->size() : 0; + if (sign_bitmap_ == nullptr) { return Init(capacity); } + RETURN_NOT_OK(FixedSizeBinaryBuilder::Resize(capacity)); + + if (byte_width_ == 16) { + RETURN_NOT_OK(sign_bitmap_->Resize(null_bitmap_->size())); + int64_t new_bytes = sign_bitmap_->size(); + sign_bitmap_data_ = sign_bitmap_->mutable_data(); + + // The buffer might be overpadded to deal with padding according to the spec + if (old_bytes < new_bytes) { + memset(sign_bitmap_data_ + old_bytes, 0, + static_cast(sign_bitmap_->capacity() - old_bytes)); + } + } + return Status::OK(); +} + +Status DecimalBuilder::Finish(std::shared_ptr* out) { + std::shared_ptr data = byte_builder_.Finish(); + + /// TODO(phillipc): not sure where to get the offset argument here + *out = std::make_shared( + type_, length_, data, null_bitmap_, null_count_, 0, sign_bitmap_); + return Status::OK(); +} + // ---------------------------------------------------------------------- // ListBuilder @@ -440,10 +520,9 @@ Status StringBuilder::Finish(std::shared_ptr* out) { FixedSizeBinaryBuilder::FixedSizeBinaryBuilder( MemoryPool* pool, const std::shared_ptr& type) - : ArrayBuilder(pool, type), byte_builder_(pool) { - DCHECK(type->type == Type::FIXED_SIZE_BINARY); - byte_width_ = static_cast(*type).byte_width(); -} + : ArrayBuilder(pool, type), + byte_width_(static_cast(*type).byte_width()), + byte_builder_(pool) {} Status FixedSizeBinaryBuilder::Append(const uint8_t* value) { RETURN_NOT_OK(Reserve(1)); @@ -543,6 +622,7 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, BUILDER_CASE(STRING, StringBuilder); BUILDER_CASE(BINARY, BinaryBuilder); BUILDER_CASE(FIXED_SIZE_BINARY, FixedSizeBinaryBuilder); + BUILDER_CASE(DECIMAL, DecimalBuilder); case Type::LIST: { std::shared_ptr value_builder; std::shared_ptr value_type = diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index 60cdc4cb3a5db..d42ab5b01d1ba 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -37,6 +37,9 @@ namespace arrow { class Array; +template +struct Decimal; + static constexpr int64_t kMinBuilderCapacity = 1 << 5; /// Base class for all data array builders. @@ -76,12 +79,12 @@ class ARROW_EXPORT ArrayBuilder { Status SetNotNull(int64_t length); /// Allocates initial capacity requirements for the builder. In most - /// cases subclasses should override and call there parent classes + /// cases subclasses should override and call their parent class's /// method as well. virtual Status Init(int64_t capacity); /// Resizes the null_bitmap array. In most - /// cases subclasses should override and call there parent classes + /// cases subclasses should override and call their parent class's /// method as well. virtual Status Resize(int64_t new_bits); @@ -275,9 +278,7 @@ class ARROW_EXPORT BooleanBuilder : public ArrayBuilder { return Status::OK(); } - Status Append(uint8_t val) { - return Append(val != 0); - } + Status Append(uint8_t val) { return Append(val != 0); } /// Vector append /// @@ -415,6 +416,24 @@ class ARROW_EXPORT FixedSizeBinaryBuilder : public ArrayBuilder { BufferBuilder byte_builder_; }; +class ARROW_EXPORT DecimalBuilder : public FixedSizeBinaryBuilder { + public: + explicit DecimalBuilder(MemoryPool* pool, const std::shared_ptr& type); + + template + ARROW_EXPORT Status Append(const Decimal& val); + + Status Init(int64_t capacity) override; + Status Resize(int64_t capacity) override; + Status Finish(std::shared_ptr* out) override; + + private: + /// We only need these for 128 bit decimals, because boost stores the sign + /// separate from the underlying bytes. + std::shared_ptr sign_bitmap_; + uint8_t* sign_bitmap_data_; +}; + // ---------------------------------------------------------------------- // Struct diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 7451439a875d6..2297e4b206d1f 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -29,6 +29,7 @@ #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" +#include "arrow/util/decimal.h" #include "arrow/util/logging.h" #include "arrow/visitor_inline.h" @@ -232,6 +233,41 @@ class RangeEqualsVisitor { return Status::OK(); } + Status Visit(const DecimalArray& left) { + const auto& right = static_cast(right_); + + int32_t width = left.byte_width(); + + const uint8_t* left_data = nullptr; + const uint8_t* right_data = nullptr; + + if (left.data()) { left_data = left.raw_data() + left.offset() * width; } + + if (right.data()) { right_data = right.raw_data() + right.offset() * width; } + + for (int64_t i = left_start_idx_, o_i = right_start_idx_; i < left_end_idx_; + ++i, ++o_i) { + if (left.IsNegative(i) != right.IsNegative(o_i)) { + result_ = false; + return Status::OK(); + } + + const bool is_null = left.IsNull(i); + if (is_null != right.IsNull(o_i)) { + result_ = false; + return Status::OK(); + } + if (is_null) continue; + + if (std::memcmp(left_data + width * i, right_data + width * o_i, width)) { + result_ = false; + return Status::OK(); + } + } + result_ = true; + return Status::OK(); + } + Status Visit(const NullArray& left) { UNUSED(left); result_ = true; @@ -244,10 +280,6 @@ class RangeEqualsVisitor { return CompareValues(left); } - Status Visit(const DecimalArray& left) { - return Status::NotImplemented("Decimal type"); - } - Status Visit(const ListArray& left) { result_ = CompareLists(left); return Status::OK(); diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 57db03311c06f..c6880c56e466b 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -27,7 +27,8 @@ set(ARROW_IPC_SHARED_LINK_LIBS set(ARROW_IPC_TEST_LINK_LIBS arrow_ipc_static arrow_io_static - arrow_static) + arrow_static + ${BOOST_REGEX_LIBRARY}) set(ARROW_IPC_SRCS feather.cc @@ -161,7 +162,8 @@ if(MSVC) arrow_io_static arrow_static ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY}) + ${BOOST_SYSTEM_LIBRARY} + ${BOOST_REGEX_LIBRARY}) else() set(UTIL_LINK_LIBS arrow_ipc_static @@ -169,6 +171,7 @@ else() arrow_static ${BOOST_FILESYSTEM_LIBRARY} ${BOOST_SYSTEM_LIBRARY} + ${BOOST_REGEX_LIBRARY} dl) endif() diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index c69d976737f91..604527f6304ac 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -37,7 +37,8 @@ set(ARROW_PYTHON_MIN_TEST_LIBS arrow_python_static arrow_ipc_static arrow_io_static - arrow_static) + arrow_static + ${BOOST_REGEX_LIBRARY}) if(ARROW_BUILD_TESTS) ADD_THIRDPARTY_LIB(python diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 25b32ee26a06b..189ecee4fe022 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -17,12 +17,16 @@ #include #include + +#include #include +#include #include "arrow/python/builtin_convert.h" #include "arrow/api.h" #include "arrow/status.h" +#include "arrow/util/decimal.h" #include "arrow/util/logging.h" #include "arrow/python/helpers.h" @@ -109,7 +113,6 @@ class ScalarVisitor { int64_t float_count_; int64_t binary_count_; int64_t unicode_count_; - // Place to accumulate errors // std::vector errors_; }; @@ -394,8 +397,7 @@ class BytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::Invalid( - "Value that cannot be converted to bytes was encountered"); + return Status::Invalid("Value that cannot be converted to bytes was encountered"); } // No error checking length = PyBytes_GET_SIZE(bytes_obj); @@ -429,8 +431,7 @@ class FixedWidthBytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::Invalid( - "Value that cannot be converted to bytes was encountered"); + return Status::Invalid("Value that cannot be converted to bytes was encountered"); } // No error checking RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length)); @@ -495,6 +496,54 @@ class ListConverter : public TypedConverter { std::shared_ptr value_converter_; }; +#define DECIMAL_CONVERT_CASE(bit_width, item, builder) \ + case bit_width: { \ + arrow::Decimal##bit_width out; \ + RETURN_NOT_OK(PythonDecimalToArrowDecimal((item), &out)); \ + RETURN_NOT_OK((builder)->Append(out)); \ + break; \ + } + +class DecimalConverter : public TypedConverter { + public: + Status AppendData(PyObject* seq) override { + /// Ensure we've allocated enough space + Py_ssize_t size = PySequence_Size(seq); + RETURN_NOT_OK(typed_builder_->Reserve(size)); + + /// Can the compiler figure out that the case statement below isn't necessary + /// once we're running? + const int bit_width = + std::dynamic_pointer_cast(typed_builder_->type()) + ->bit_width(); + + OwnedRef ref; + PyObject* item = nullptr; + for (int64_t i = 0; i < size; ++i) { + ref.reset(PySequence_GetItem(seq, i)); + item = ref.obj(); + + /// TODO(phillipc): Check for nan? + if (item != Py_None) { + switch (bit_width) { + DECIMAL_CONVERT_CASE(32, item, typed_builder_) + DECIMAL_CONVERT_CASE(64, item, typed_builder_) + DECIMAL_CONVERT_CASE(128, item, typed_builder_) + default: + break; + } + RETURN_IF_PYERROR(); + } else { + RETURN_NOT_OK(typed_builder_->AppendNull()); + } + } + + return Status::OK(); + } +}; + +#undef DECIMAL_CONVERT_CASE + // Dynamic constructor for sequence converters std::shared_ptr GetConverter(const std::shared_ptr& type) { switch (type->type) { @@ -516,6 +565,9 @@ std::shared_ptr GetConverter(const std::shared_ptr& type return std::make_shared(); case Type::LIST: return std::make_shared(); + case Type::DECIMAL: { + return std::make_shared(); + } case Type::STRUCT: default: return nullptr; diff --git a/cpp/src/arrow/python/builtin_convert.h b/cpp/src/arrow/python/builtin_convert.h index 00ff0fd8236fc..3c2e350269a78 100644 --- a/cpp/src/arrow/python/builtin_convert.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -25,7 +25,7 @@ #include -#include +#include "arrow/type.h" #include "arrow/util/visibility.h" diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h index 32bfa784acbd0..a6806ab95ab95 100644 --- a/cpp/src/arrow/python/common.h +++ b/cpp/src/arrow/python/common.h @@ -57,12 +57,13 @@ class OwnedRef { } void reset(PyObject* obj) { - if (obj_ != nullptr) { Py_XDECREF(obj_); } + /// TODO(phillipc): Should we acquire the GIL here? It definitely needs to be + /// acquired, + /// but callers have probably already acquired it + Py_XDECREF(obj_); obj_ = obj; } - void release() { obj_ = nullptr; } - PyObject* obj() const { return obj_; } private: @@ -72,6 +73,7 @@ class OwnedRef { struct PyObjectStringify { OwnedRef tmp_obj; const char* bytes; + Py_ssize_t size; explicit PyObjectStringify(PyObject* obj) { PyObject* bytes_obj; @@ -82,6 +84,7 @@ struct PyObjectStringify { bytes_obj = obj; } bytes = PyBytes_AsString(bytes_obj); + size = PyBytes_GET_SIZE(bytes_obj); } }; diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index be5f412fbea1c..ffba7bbc21c14 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -16,6 +16,8 @@ // under the License. #include "arrow/python/helpers.h" +#include "arrow/python/common.h" +#include "arrow/util/decimal.h" #include @@ -52,5 +54,82 @@ std::shared_ptr GetPrimitiveType(Type::type type) { } } +Status ImportModule(const std::string& module_name, OwnedRef* ref) { + PyAcquireGIL lock; + PyObject* module = PyImport_ImportModule(module_name.c_str()); + RETURN_IF_PYERROR(); + ref->reset(module); + return Status::OK(); +} + +Status ImportFromModule(const OwnedRef& module, const std::string& name, OwnedRef* ref) { + /// Assumes that ImportModule was called first + DCHECK_NE(module.obj(), nullptr) << "Cannot import from nullptr Python module"; + + PyAcquireGIL lock; + PyObject* attr = PyObject_GetAttrString(module.obj(), name.c_str()); + RETURN_IF_PYERROR(); + ref->reset(attr); + return Status::OK(); +} + +template +Status PythonDecimalToArrowDecimal(PyObject* python_decimal, Decimal* arrow_decimal) { + // Call Python's str(decimal_object) + OwnedRef str_obj(PyObject_Str(python_decimal)); + RETURN_IF_PYERROR(); + + PyObjectStringify str(str_obj.obj()); + RETURN_IF_PYERROR(); + + const char* bytes = str.bytes; + DCHECK_NE(bytes, nullptr); + + Py_ssize_t size = str.size; + + std::string c_string(bytes, size); + return FromString(c_string, arrow_decimal); +} + +template Status PythonDecimalToArrowDecimal( + PyObject* python_decimal, Decimal32* arrow_decimal); +template Status PythonDecimalToArrowDecimal( + PyObject* python_decimal, Decimal64* arrow_decimal); +template Status PythonDecimalToArrowDecimal( + PyObject* python_decimal, Decimal128* arrow_decimal); + +Status InferDecimalPrecisionAndScale( + PyObject* python_decimal, int* precision, int* scale) { + // Call Python's str(decimal_object) + OwnedRef str_obj(PyObject_Str(python_decimal)); + RETURN_IF_PYERROR(); + PyObjectStringify str(str_obj.obj()); + + const char* bytes = str.bytes; + DCHECK_NE(bytes, nullptr); + + auto size = str.size; + + std::string c_string(bytes, size); + return FromString(c_string, static_cast(nullptr), precision, scale); +} + +Status DecimalFromString( + PyObject* decimal_constructor, const std::string& decimal_string, PyObject** out) { + DCHECK_NE(decimal_constructor, nullptr); + DCHECK_NE(out, nullptr); + + auto string_size = decimal_string.size(); + DCHECK_GT(string_size, 0); + + auto string_bytes = decimal_string.c_str(); + DCHECK_NE(string_bytes, nullptr); + + *out = PyObject_CallFunction( + decimal_constructor, const_cast("s#"), string_bytes, string_size); + RETURN_IF_PYERROR(); + return Status::OK(); +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/helpers.h b/cpp/src/arrow/python/helpers.h index 611e814b7d858..a19b25f7db805 100644 --- a/cpp/src/arrow/python/helpers.h +++ b/cpp/src/arrow/python/helpers.h @@ -18,16 +18,38 @@ #ifndef PYARROW_HELPERS_H #define PYARROW_HELPERS_H +#include + #include +#include +#include #include "arrow/type.h" #include "arrow/util/visibility.h" namespace arrow { + +template +struct Decimal; + namespace py { -ARROW_EXPORT -std::shared_ptr GetPrimitiveType(Type::type type); +class OwnedRef; + +ARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); + +Status ImportModule(const std::string& module_name, OwnedRef* ref); +Status ImportFromModule( + const OwnedRef& module, const std::string& module_name, OwnedRef* ref); + +template +Status PythonDecimalToArrowDecimal(PyObject* python_decimal, Decimal* arrow_decimal); + +Status InferDecimalPrecisionAndScale( + PyObject* python_decimal, int* precision = nullptr, int* scale = nullptr); + +Status DecimalFromString( + PyObject* decimal_constructor, const std::string& decimal_string, PyObject** out); } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 48d3489bf900b..f6e627e668e2d 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -41,12 +41,14 @@ #include "arrow/type_fwd.h" #include "arrow/type_traits.h" #include "arrow/util/bit-util.h" +#include "arrow/util/decimal.h" #include "arrow/util/logging.h" #include "arrow/util/macros.h" #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" #include "arrow/python/config.h" +#include "arrow/python/helpers.h" #include "arrow/python/numpy-internal.h" #include "arrow/python/numpy_convert.h" #include "arrow/python/type_traits.h" @@ -375,6 +377,7 @@ class PandasConverter : public TypeVisitor { Status ConvertDates(); Status ConvertLists(const std::shared_ptr& type); Status ConvertObjects(); + Status ConvertDecimals(); protected: MemoryPool* pool_; @@ -468,15 +471,14 @@ Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { RETURN_IF_PYERROR(); DCHECK_NE(type_name.obj(), nullptr); - OwnedRef bytes_obj(PyUnicode_AsUTF8String(type_name.obj())); + PyObjectStringify bytestring(type_name.obj()); RETURN_IF_PYERROR(); - DCHECK_NE(bytes_obj.obj(), nullptr); - - Py_ssize_t size = PyBytes_GET_SIZE(bytes_obj.obj()); - const char* bytes = PyBytes_AS_STRING(bytes_obj.obj()); + const char* bytes = bytestring.bytes; DCHECK_NE(bytes, nullptr) << "bytes from type(...).__name__ were null"; + Py_ssize_t size = bytestring.size; + std::string cpp_type_name(bytes, size); std::stringstream ss; @@ -517,6 +519,59 @@ Status PandasConverter::ConvertDates() { return date_builder.Finish(&out_); } +#define CONVERT_DECIMAL_CASE(bit_width, builder, object) \ + case bit_width: { \ + Decimal##bit_width d; \ + RETURN_NOT_OK(PythonDecimalToArrowDecimal((object), &d)); \ + RETURN_NOT_OK((builder).Append(d)); \ + break; \ + } + +Status PandasConverter::ConvertDecimals() { + PyAcquireGIL lock; + + // Import the decimal module and Decimal class + OwnedRef decimal; + OwnedRef Decimal; + RETURN_NOT_OK(ImportModule("decimal", &decimal)); + RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); + + PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); + PyObject* object = objects[0]; + + int precision; + int scale; + + RETURN_NOT_OK(InferDecimalPrecisionAndScale(object, &precision, &scale)); + + type_ = std::make_shared(precision, scale); + + const int bit_width = std::dynamic_pointer_cast(type_)->bit_width(); + DecimalBuilder decimal_builder(pool_, type_); + + RETURN_NOT_OK(decimal_builder.Resize(length_)); + + for (int64_t i = 0; i < length_; ++i) { + object = objects[i]; + if (PyObject_IsInstance(object, Decimal.obj())) { + switch (bit_width) { + CONVERT_DECIMAL_CASE(32, decimal_builder, object) + CONVERT_DECIMAL_CASE(64, decimal_builder, object) + CONVERT_DECIMAL_CASE(128, decimal_builder, object) + default: + break; + } + } else if (PyObject_is_null(object)) { + decimal_builder.AppendNull(); + } else { + return InvalidConversion(object, "decimal.Decimal"); + } + } + return decimal_builder.Finish(&out_); +} + +#undef CONVERT_DECIMAL_CASE + Status PandasConverter::ConvertObjectStrings() { PyAcquireGIL lock; @@ -554,6 +609,90 @@ Status PandasConverter::ConvertObjectFixedWidthBytes( return Status::OK(); } +template +Status validate_precision(int precision) { + constexpr static const int maximum_precision = DecimalPrecision::maximum; + if (!(precision > 0 && precision <= maximum_precision)) { + std::stringstream ss; + ss << "Invalid precision: " << precision << ". Minimum is 1, maximum is " + << maximum_precision; + return Status::Invalid(ss.str()); + } + return Status::OK(); +} + +template +Status RawDecimalToString( + const uint8_t* bytes, int precision, int scale, std::string* result) { + DCHECK_NE(bytes, nullptr); + DCHECK_NE(result, nullptr); + RETURN_NOT_OK(validate_precision(precision)); + Decimal decimal; + FromBytes(bytes, &decimal); + *result = ToString(decimal, precision, scale); + return Status::OK(); +} + +template Status RawDecimalToString( + const uint8_t*, int, int, std::string* result); +template Status RawDecimalToString( + const uint8_t*, int, int, std::string* result); + +Status RawDecimalToString(const uint8_t* bytes, int precision, int scale, + bool is_negative, std::string* result) { + DCHECK_NE(bytes, nullptr); + DCHECK_NE(result, nullptr); + RETURN_NOT_OK(validate_precision(precision)); + Decimal128 decimal; + FromBytes(bytes, is_negative, &decimal); + *result = ToString(decimal, precision, scale); + return Status::OK(); +} + +static Status ConvertDecimals(const ChunkedArray& data, PyObject** out_values) { + PyAcquireGIL lock; + OwnedRef decimal_ref; + OwnedRef Decimal_ref; + RETURN_NOT_OK(ImportModule("decimal", &decimal_ref)); + RETURN_NOT_OK(ImportFromModule(decimal_ref, "Decimal", &Decimal_ref)); + PyObject* Decimal = Decimal_ref.obj(); + + for (int c = 0; c < data.num_chunks(); c++) { + auto* arr(static_cast(data.chunk(c).get())); + auto type(std::dynamic_pointer_cast(arr->type())); + const int precision = type->precision; + const int scale = type->scale; + const int bit_width = type->bit_width(); + + for (int64_t i = 0; i < arr->length(); ++i) { + if (arr->IsNull(i)) { + Py_INCREF(Py_None); + *out_values++ = Py_None; + } else { + const uint8_t* raw_value = arr->GetValue(i); + std::string s; + switch (bit_width) { + case 32: + RETURN_NOT_OK(RawDecimalToString(raw_value, precision, scale, &s)); + break; + case 64: + RETURN_NOT_OK(RawDecimalToString(raw_value, precision, scale, &s)); + break; + case 128: + RETURN_NOT_OK( + RawDecimalToString(raw_value, precision, scale, arr->IsNegative(i), &s)); + break; + default: + break; + } + RETURN_NOT_OK(DecimalFromString(Decimal, s, out_values++)); + } + } + } + + return Status::OK(); +} + Status PandasConverter::ConvertBooleans() { PyAcquireGIL lock; @@ -598,6 +737,7 @@ Status PandasConverter::ConvertObjects() { // // * Strings // * Booleans with nulls + // * decimal.Decimals // * Mixed type (not supported at the moment by arrow format) // // Additionally, nulls may be encoded either as np.nan or None. So we have to @@ -613,6 +753,7 @@ Status PandasConverter::ConvertObjects() { PyDateTime_IMPORT; } + // This means we received an explicit type from the user if (type_) { switch (type_->type) { case Type::STRING: @@ -627,10 +768,17 @@ Status PandasConverter::ConvertObjects() { const auto& list_field = static_cast(*type_); return ConvertLists(list_field.value_field()->type); } + case Type::DECIMAL: + return ConvertDecimals(); default: return Status::TypeError("No known conversion to Arrow type"); } } else { + OwnedRef decimal; + OwnedRef Decimal; + RETURN_NOT_OK(ImportModule("decimal", &decimal)); + RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); + for (int64_t i = 0; i < length_; ++i) { if (PyObject_is_null(objects[i])) { continue; @@ -640,6 +788,8 @@ Status PandasConverter::ConvertObjects() { return ConvertBooleans(); } else if (PyDate_CheckExact(objects[i])) { return ConvertDates(); + } else if (PyObject_IsInstance(const_cast(objects[i]), Decimal.obj())) { + return ConvertDecimals(); } else { return InvalidConversion( const_cast(objects[i]), "string, bool, or date"); @@ -847,6 +997,7 @@ class PandasBlock { INT64, FLOAT, DOUBLE, + DECIMAL, BOOL, DATETIME, DATETIME_WITH_TZ, @@ -1193,6 +1344,8 @@ class ObjectBlock : public PandasBlock { RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::FIXED_SIZE_BINARY) { RETURN_NOT_OK(ConvertFixedSizeBinary(data, out_buffer)); + } else if (type == Type::DECIMAL) { + RETURN_NOT_OK(ConvertDecimals(data, out_buffer)); } else if (type == Type::LIST) { auto list_type = std::static_pointer_cast(col->type()); switch (list_type->value_type()->type) { @@ -1519,6 +1672,7 @@ Status MakeBlock(PandasBlock::type type, int64_t num_rows, int num_columns, BLOCK_CASE(DOUBLE, Float64Block); BLOCK_CASE(BOOL, BoolBlock); BLOCK_CASE(DATETIME, DatetimeBlock); + BLOCK_CASE(DECIMAL, ObjectBlock); default: return Status::NotImplemented("Unsupported block type"); } @@ -1649,6 +1803,9 @@ class DataFrameBlockCreator { case Type::DICTIONARY: output_type = PandasBlock::CATEGORICAL; break; + case Type::DECIMAL: + output_type = PandasBlock::DECIMAL; + break; default: return Status::NotImplemented(col->type()->ToString()); } @@ -1892,6 +2049,7 @@ class ArrowDeserializer { CONVERT_CASE(TIMESTAMP); CONVERT_CASE(DICTIONARY); CONVERT_CASE(LIST); + CONVERT_CASE(DECIMAL); default: { std::stringstream ss; ss << "Arrow type reading not implemented for " << col_->type()->ToString(); @@ -1999,6 +2157,13 @@ class ArrowDeserializer { return ConvertFixedSizeBinary(data_, out_values); } + template + inline typename std::enable_if::type ConvertValues() { + RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); + auto out_values = reinterpret_cast(PyArray_DATA(arr_)); + return ConvertDecimals(data_, out_values); + } + #define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ case Type::ArrowEnum: \ return ConvertListsLike(col_, out_values); @@ -2021,6 +2186,7 @@ class ArrowDeserializer { CONVERTVALUES_LISTSLIKE_CASE(FloatType, FLOAT) CONVERTVALUES_LISTSLIKE_CASE(DoubleType, DOUBLE) CONVERTVALUES_LISTSLIKE_CASE(StringType, STRING) + CONVERTVALUES_LISTSLIKE_CASE(DecimalType, DECIMAL) default: { std::stringstream ss; ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); diff --git a/cpp/src/arrow/python/python-test.cc b/cpp/src/arrow/python/python-test.cc index f269ebfb642c7..b63d2ffb1cd2c 100644 --- a/cpp/src/arrow/python/python-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -28,8 +28,11 @@ #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" +#include "arrow/python/helpers.h" #include "arrow/python/pandas_convert.h" +#include "arrow/util/decimal.h" + namespace arrow { namespace py { @@ -37,6 +40,36 @@ TEST(PyBuffer, InvalidInputObject) { PyBuffer buffer(Py_None); } +TEST(DecimalTest, TestPythonDecimalToArrowDecimal128) { + PyAcquireGIL lock; + + OwnedRef decimal; + OwnedRef Decimal; + ASSERT_OK(ImportModule("decimal", &decimal)); + ASSERT_NE(decimal.obj(), nullptr); + + ASSERT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); + ASSERT_NE(Decimal.obj(), nullptr); + + std::string decimal_string("-39402950693754869342983"); + const char* format = "s#"; + auto c_string = decimal_string.c_str(); + ASSERT_NE(c_string, nullptr); + + auto c_string_size = decimal_string.size(); + ASSERT_GT(c_string_size, 0); + OwnedRef pydecimal(PyObject_CallFunction( + Decimal.obj(), const_cast(format), c_string, c_string_size)); + ASSERT_NE(pydecimal.obj(), nullptr); + ASSERT_EQ(PyErr_Occurred(), nullptr); + + Decimal128 arrow_decimal; + int128_t boost_decimal(decimal_string); + PyObject* obj = pydecimal.obj(); + ASSERT_OK(PythonDecimalToArrowDecimal(obj, &arrow_decimal)); + ASSERT_EQ(boost_decimal, arrow_decimal.value); +} + TEST(PandasConversionTest, TestObjectBlockWriteFails) { StringBuilder builder(default_memory_pool()); const char value[] = {'\xf1', '\0'}; diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index abbb626e0fcb4..df4590f18d733 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -17,6 +17,7 @@ #include "arrow/type.h" +#include #include #include @@ -91,7 +92,7 @@ std::string BinaryType::ToString() const { } int FixedSizeBinaryType::bit_width() const { - return 8 * byte_width(); + return CHAR_BIT * byte_width(); } std::string FixedSizeBinaryType::ToString() const { @@ -380,6 +381,10 @@ std::shared_ptr field( return std::make_shared(name, type, nullable); } +std::shared_ptr decimal(int precision, int scale) { + return std::make_shared(precision, scale); +} + static const BufferDescr kValidityBuffer(BufferType::VALIDITY, 1); static const BufferDescr kOffsetBuffer(BufferType::OFFSET, 32); static const BufferDescr kTypeBuffer(BufferType::TYPE, 32); @@ -402,7 +407,11 @@ std::vector BinaryType::GetBufferLayout() const { } std::vector FixedSizeBinaryType::GetBufferLayout() const { - return {kValidityBuffer, BufferDescr(BufferType::DATA, byte_width_ * 8)}; + return {kValidityBuffer, BufferDescr(BufferType::DATA, bit_width())}; +} + +std::vector DecimalType::GetBufferLayout() const { + return {kValidityBuffer, kBooleanBuffer, BufferDescr(BufferType::DATA, bit_width())}; } std::vector ListType::GetBufferLayout() const { @@ -427,9 +436,4 @@ std::string DecimalType::ToString() const { return s.str(); } -std::vector DecimalType::GetBufferLayout() const { - // TODO(wesm) - return {}; -} - } // namespace arrow diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 36ab9d8b2b9d5..3a35f56381197 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -360,6 +360,8 @@ class ARROW_EXPORT FixedSizeBinaryType : public FixedWidthType { explicit FixedSizeBinaryType(int32_t byte_width) : FixedWidthType(Type::FIXED_SIZE_BINARY), byte_width_(byte_width) {} + explicit FixedSizeBinaryType(int32_t byte_width, Type::type type_id) + : FixedWidthType(type_id), byte_width_(byte_width) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -399,19 +401,31 @@ struct ARROW_EXPORT StructType : public NestedType { std::vector GetBufferLayout() const override; }; -struct ARROW_EXPORT DecimalType : public DataType { +static inline int decimal_byte_width(int precision) { + if (precision >= 0 && precision < 10) { + return 4; + } else if (precision >= 10 && precision < 19) { + return 8; + } else { + // TODO(phillipc): validate that we can't construct > 128 bit types + return 16; + } +} + +struct ARROW_EXPORT DecimalType : public FixedSizeBinaryType { static constexpr Type::type type_id = Type::DECIMAL; explicit DecimalType(int precision_, int scale_) - : DataType(Type::DECIMAL), precision(precision_), scale(scale_) {} - int precision; - int scale; - + : FixedSizeBinaryType(decimal_byte_width(precision_), Type::DECIMAL), + precision(precision_), + scale(scale_) {} + std::vector GetBufferLayout() const override; Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "decimal"; } - std::vector GetBufferLayout() const override; + int precision; + int scale; }; enum class UnionMode : char { SPARSE, DENSE }; diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 2e27ce9858964..acf12c3d9d18e 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -69,6 +69,7 @@ class StructBuilder; struct DecimalType; class DecimalArray; +class DecimalBuilder; struct UnionType; class UnionArray; @@ -146,6 +147,7 @@ std::shared_ptr ARROW_EXPORT binary(); std::shared_ptr ARROW_EXPORT date32(); std::shared_ptr ARROW_EXPORT date64(); +std::shared_ptr ARROW_EXPORT decimal(int precision, int scale); } // namespace arrow diff --git a/cpp/src/arrow/type_traits.h b/cpp/src/arrow/type_traits.h index 353b638fed894..3e8ea23432b98 100644 --- a/cpp/src/arrow/type_traits.h +++ b/cpp/src/arrow/type_traits.h @@ -228,6 +228,13 @@ struct TypeTraits { static inline std::shared_ptr type_singleton() { return float64(); } }; +template <> +struct TypeTraits { + using ArrayType = DecimalArray; + using BuilderType = DecimalBuilder; + constexpr static bool is_parameter_free = false; +}; + template <> struct TypeTraits { using ArrayType = BooleanArray; @@ -289,12 +296,6 @@ struct TypeTraits { constexpr static bool is_parameter_free = false; }; -template <> -struct TypeTraits { - // using ArrayType = DecimalArray; - constexpr static bool is_parameter_free = false; -}; - // Not all type classes have a c_type template struct as_void { diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index c1b6877a3e9ef..054f11055b60e 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -22,6 +22,7 @@ # Headers: top level install(FILES bit-util.h + decimal.h logging.h macros.h random.h @@ -70,3 +71,4 @@ endif() ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(stl-util-test) +ADD_ARROW_TEST(decimal-test) diff --git a/cpp/src/arrow/util/bit-util.h b/cpp/src/arrow/util/bit-util.h index 42afd0705f0f9..90a1c3eab9266 100644 --- a/cpp/src/arrow/util/bit-util.h +++ b/cpp/src/arrow/util/bit-util.h @@ -149,7 +149,6 @@ int64_t ARROW_EXPORT CountSetBits( bool ARROW_EXPORT BitmapEquals(const uint8_t* left, int64_t left_offset, const uint8_t* right, int64_t right_offset, int64_t bit_length); - } // namespace arrow #endif // ARROW_UTIL_BIT_UTIL_H diff --git a/cpp/src/arrow/util/decimal-test.cc b/cpp/src/arrow/util/decimal-test.cc new file mode 100644 index 0000000000000..1e22643962d5b --- /dev/null +++ b/cpp/src/arrow/util/decimal-test.cc @@ -0,0 +1,161 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. +// + +#include "arrow/util/decimal.h" + +#include "gtest/gtest.h" + +#include "arrow/test-util.h" + +namespace arrow { + +template +class DecimalTest : public ::testing::Test { + public: + DecimalTest() : string_value("234.23445") { integer_value.value = 23423445; } + Decimal integer_value; + std::string string_value; +}; + +typedef ::testing::Types DecimalTypes; +TYPED_TEST_CASE(DecimalTest, DecimalTypes); + +TYPED_TEST(DecimalTest, TestToString) { + Decimal decimal(this->integer_value); + int precision = 8; + int scale = 5; + std::string result = ToString(decimal, precision, scale); + ASSERT_EQ(result, this->string_value); +} + +TYPED_TEST(DecimalTest, TestFromString) { + Decimal expected(this->integer_value); + Decimal result; + int precision, scale; + ASSERT_OK(FromString(this->string_value, &result, &precision, &scale)); + ASSERT_EQ(result.value, expected.value); + ASSERT_EQ(precision, 8); + ASSERT_EQ(scale, 5); +} + +TEST(DecimalTest, TestStringToInt32) { + int32_t value = 0; + StringToInteger("123", "456", 1, &value); + ASSERT_EQ(value, 123456); +} + +TEST(DecimalTest, TestStringToInt64) { + int64_t value = 0; + StringToInteger("123456789", "456", -1, &value); + ASSERT_EQ(value, -123456789456); +} + +TEST(DecimalTest, TestStringToInt128) { + int128_t value = 0; + StringToInteger("123456789", "456789123", 1, &value); + ASSERT_EQ(value, 123456789456789123); +} + +TEST(DecimalTest, TestFromString128) { + static const std::string string_value("-23049223942343532412"); + Decimal result(string_value); + int128_t expected = -230492239423435324; + ASSERT_EQ(result.value, expected * 100 - 12); + + // Sanity check that our number is actually using more than 64 bits + ASSERT_NE(result.value, static_cast(result.value)); +} + +TEST(DecimalTest, TestFromDecimalString128) { + static const std::string string_value("-23049223942343.532412"); + Decimal result(string_value); + int128_t expected = -230492239423435324; + ASSERT_EQ(result.value, expected * 100 - 12); + + // Sanity check that our number is actually using more than 64 bits + ASSERT_NE(result.value, static_cast(result.value)); +} + +TEST(DecimalTest, TestDecimal32Precision) { + auto min_precision = DecimalPrecision::minimum; + auto max_precision = DecimalPrecision::maximum; + ASSERT_EQ(min_precision, 1); + ASSERT_EQ(max_precision, 9); +} + +TEST(DecimalTest, TestDecimal64Precision) { + auto min_precision = DecimalPrecision::minimum; + auto max_precision = DecimalPrecision::maximum; + ASSERT_EQ(min_precision, 10); + ASSERT_EQ(max_precision, 18); +} + +TEST(DecimalTest, TestDecimal128Precision) { + auto min_precision = DecimalPrecision::minimum; + auto max_precision = DecimalPrecision::maximum; + ASSERT_EQ(min_precision, 19); + ASSERT_EQ(max_precision, 38); +} + +TEST(DecimalTest, TestDecimal32SignedRoundTrip) { + Decimal32 expected(std::string("-3402692")); + + uint8_t stack_bytes[4] = {0}; + uint8_t* bytes = stack_bytes; + ToBytes(expected, &bytes); + + Decimal32 result; + FromBytes(bytes, &result); + ASSERT_EQ(expected.value, result.value); +} + +TEST(DecimalTest, TestDecimal64SignedRoundTrip) { + Decimal64 expected(std::string("-34034293045.921")); + + uint8_t stack_bytes[8] = {0}; + uint8_t* bytes = stack_bytes; + ToBytes(expected, &bytes); + + Decimal64 result; + FromBytes(bytes, &result); + + ASSERT_EQ(expected.value, result.value); +} + +TEST(DecimalTest, TestDecimal128StringAndBytesRoundTrip) { + std::string string_value("-340282366920938463463374607431.711455"); + Decimal128 expected(string_value); + + std::string expected_string_value("-340282366920938463463374607431711455"); + int128_t expected_underlying_value(expected_string_value); + + ASSERT_EQ(expected.value, expected_underlying_value); + + uint8_t stack_bytes[16] = {0}; + uint8_t* bytes = stack_bytes; + bool is_negative; + ToBytes(expected, &bytes, &is_negative); + + ASSERT_TRUE(is_negative); + + Decimal128 result; + FromBytes(bytes, is_negative, &result); + + ASSERT_EQ(expected.value, result.value); +} +} // namespace arrow diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc new file mode 100644 index 0000000000000..1ac347180fec5 --- /dev/null +++ b/cpp/src/arrow/util/decimal.cc @@ -0,0 +1,141 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/util/decimal.h" + +#include + +namespace arrow { + +static const boost::regex DECIMAL_PATTERN("(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?"); + +template +ARROW_EXPORT Status FromString( + const std::string& s, Decimal* out, int* precision, int* scale) { + if (s.empty()) { + return Status::Invalid("Empty string cannot be converted to decimal"); + } + boost::smatch match; + if (!boost::regex_match(s, match, DECIMAL_PATTERN)) { + std::stringstream ss; + ss << "String " << s << " is not a valid decimal string"; + return Status::Invalid(ss.str()); + } + const int8_t sign = match[1].str() == "-" ? -1 : 1; + std::string whole_part = match[4].str(); + std::string fractional_part = match[6].str(); + if (scale != nullptr) { *scale = static_cast(fractional_part.size()); } + if (precision != nullptr) { + *precision = + static_cast(whole_part.size()) + static_cast(fractional_part.size()); + } + if (out != nullptr) { StringToInteger(whole_part, fractional_part, sign, &out->value); } + return Status::OK(); +} + +template ARROW_EXPORT Status FromString( + const std::string& s, Decimal32* out, int* precision, int* scale); +template ARROW_EXPORT Status FromString( + const std::string& s, Decimal64* out, int* precision, int* scale); +template ARROW_EXPORT Status FromString( + const std::string& s, Decimal128* out, int* precision, int* scale); + +void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int32_t* out) { + DCHECK(sign == -1 || sign == 1); + DCHECK_NE(out, nullptr); + DCHECK(!whole.empty() || !fractional.empty()); + if (!whole.empty()) { + *out = std::stoi(whole, nullptr, 10) * + static_cast(pow(10.0, static_cast(fractional.size()))); + } + if (!fractional.empty()) { *out += std::stoi(fractional, nullptr, 10); } + *out *= sign; +} + +void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int64_t* out) { + DCHECK(sign == -1 || sign == 1); + DCHECK_NE(out, nullptr); + DCHECK(!whole.empty() || !fractional.empty()); + if (!whole.empty()) { + *out = static_cast(std::stoll(whole, nullptr, 10)) * + static_cast(pow(10.0, static_cast(fractional.size()))); + } + if (!fractional.empty()) { *out += std::stoll(fractional, nullptr, 10); } + *out *= sign; +} + +void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int128_t* out) { + DCHECK(sign == -1 || sign == 1); + DCHECK_NE(out, nullptr); + DCHECK(!whole.empty() || !fractional.empty()); + *out = int128_t(whole + fractional) * sign; +} + +void FromBytes(const uint8_t* bytes, Decimal32* decimal) { + DCHECK_NE(bytes, nullptr); + DCHECK_NE(decimal, nullptr); + decimal->value = *reinterpret_cast(bytes); +} + +void FromBytes(const uint8_t* bytes, Decimal64* decimal) { + DCHECK_NE(bytes, nullptr); + DCHECK_NE(decimal, nullptr); + decimal->value = *reinterpret_cast(bytes); +} + +constexpr static const size_t BYTES_IN_128_BITS = 128 / CHAR_BIT; +constexpr static const size_t LIMB_SIZE = + sizeof(std::remove_pointer::type); +constexpr static const size_t BYTES_PER_LIMB = BYTES_IN_128_BITS / LIMB_SIZE; + +void FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal) { + DCHECK_NE(bytes, nullptr); + DCHECK_NE(decimal, nullptr); + + auto& decimal_value(decimal->value); + int128_t::backend_type& backend(decimal_value.backend()); + backend.resize(BYTES_PER_LIMB, BYTES_PER_LIMB); + std::memcpy(backend.limbs(), bytes, BYTES_IN_128_BITS); + if (is_negative) { decimal->value = -decimal->value; } +} + +void ToBytes(const Decimal32& value, uint8_t** bytes) { + DCHECK_NE(*bytes, nullptr); + *reinterpret_cast(*bytes) = value.value; +} + +void ToBytes(const Decimal64& value, uint8_t** bytes) { + DCHECK_NE(*bytes, nullptr); + *reinterpret_cast(*bytes) = value.value; +} + +void ToBytes(const Decimal128& decimal, uint8_t** bytes, bool* is_negative) { + DCHECK_NE(*bytes, nullptr); + DCHECK_NE(is_negative, nullptr); + + /// TODO(phillipc): boost multiprecision is unreliable here, int128_t can't be + /// roundtripped + const auto& backend(decimal.value.backend()); + auto boost_bytes = reinterpret_cast(backend.limbs()); + std::memcpy(*bytes, boost_bytes, BYTES_IN_128_BITS); + *is_negative = backend.isneg(); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/decimal.h b/cpp/src/arrow/util/decimal.h new file mode 100644 index 0000000000000..46883e3de93c3 --- /dev/null +++ b/cpp/src/arrow/util/decimal.h @@ -0,0 +1,144 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_DECIMAL_H +#define ARROW_DECIMAL_H + +#include +#include +#include +#include +#include + +#include "arrow/status.h" +#include "arrow/util/bit-util.h" +#include "arrow/util/logging.h" + +#include + +namespace arrow { + +using boost::multiprecision::int128_t; + +template +struct ARROW_EXPORT Decimal; + +ARROW_EXPORT void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int32_t* out); +ARROW_EXPORT void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int64_t* out); +ARROW_EXPORT void StringToInteger( + const std::string& whole, const std::string& fractional, int8_t sign, int128_t* out); + +template +ARROW_EXPORT Status FromString(const std::string& s, Decimal* out, + int* precision = nullptr, int* scale = nullptr); + +template +struct ARROW_EXPORT Decimal { + Decimal() : value() {} + explicit Decimal(const std::string& s) : value() { FromString(s, this); } + explicit Decimal(const char* s) : Decimal(std::string(s)) {} + explicit Decimal(const T& value) : value(value) {} + + using value_type = T; + value_type value; +}; + +using Decimal32 = Decimal; +using Decimal64 = Decimal; +using Decimal128 = Decimal; + +template +struct ARROW_EXPORT DecimalPrecision {}; + +template <> +struct ARROW_EXPORT DecimalPrecision { + constexpr static const int minimum = 1; + constexpr static const int maximum = 9; +}; + +template <> +struct ARROW_EXPORT DecimalPrecision { + constexpr static const int minimum = 10; + constexpr static const int maximum = 18; +}; + +template <> +struct ARROW_EXPORT DecimalPrecision { + constexpr static const int minimum = 19; + constexpr static const int maximum = 38; +}; + +template +ARROW_EXPORT std::string ToString( + const Decimal& decimal_value, int precision, int scale) { + T value = decimal_value.value; + + // Decimal values are sent to clients as strings so in the interest of + // speed the string will be created without the using stringstream with the + // whole/fractional_part(). + size_t last_char_idx = precision + (scale > 0) // Add a space for decimal place + + (scale == precision) // Add a space for leading 0 + + (value < 0); // Add a space for negative sign + std::string str = std::string(last_char_idx, '0'); + // Start filling in the values in reverse order by taking the last digit + // of the value. Use a positive value and worry about the sign later. At this + // point the last_char_idx points to the string terminator. + T remaining_value = value; + size_t first_digit_idx = 0; + if (value < 0) { + remaining_value = -value; + first_digit_idx = 1; + } + if (scale > 0) { + int remaining_scale = scale; + do { + str[--last_char_idx] = static_cast( + (remaining_value % 10) + static_cast('0')); // Ascii offset + remaining_value /= 10; + } while (--remaining_scale > 0); + str[--last_char_idx] = '.'; + DCHECK_GT(last_char_idx, first_digit_idx) << "Not enough space remaining"; + } + do { + str[--last_char_idx] = + static_cast((remaining_value % 10) + static_cast('0')); // Ascii offset + remaining_value /= 10; + if (remaining_value == 0) { + // Trim any extra leading 0's. + if (last_char_idx > first_digit_idx) str.erase(0, last_char_idx - first_digit_idx); + break; + } + // For safety, enforce string length independent of remaining_value. + } while (last_char_idx > first_digit_idx); + if (value < 0) str[0] = '-'; + return str; +} + +/// Conversion from raw bytes to a Decimal value +ARROW_EXPORT void FromBytes(const uint8_t* bytes, Decimal32* value); +ARROW_EXPORT void FromBytes(const uint8_t* bytes, Decimal64* value); +ARROW_EXPORT void FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal); + +/// Conversion from a Decimal value to raw bytes +ARROW_EXPORT void ToBytes(const Decimal32& value, uint8_t** bytes); +ARROW_EXPORT void ToBytes(const Decimal64& value, uint8_t** bytes); +ARROW_EXPORT void ToBytes(const Decimal128& decimal, uint8_t** bytes, bool* is_negative); + +} // namespace arrow +#endif // ARROW_DECIMAL_H diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index c61c9f59f7ab2..29b3db60cadf8 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -93,7 +93,7 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { ARRAY_VISIT_INLINE(TimestampType); ARRAY_VISIT_INLINE(Time32Type); ARRAY_VISIT_INLINE(Time64Type); - // ARRAY_VISIT_INLINE(DecimalType); + ARRAY_VISIT_INLINE(DecimalType); ARRAY_VISIT_INLINE(ListType); ARRAY_VISIT_INLINE(StructType); ARRAY_VISIT_INLINE(UnionType); diff --git a/format/Schema.fbs b/format/Schema.fbs index ca9c8e6c3e76c..badc7ea8befbf 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -77,7 +77,9 @@ table Bool { } table Decimal { + /// Total number of decimal digits precision: int; + /// Number of digits after the decimal point "." scale: int; } diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 8c520748cf316..7b23cf66c6f7e 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -71,7 +71,7 @@ uint8, uint16, uint32, uint64, timestamp, date32, date64, float16, float32, float64, - binary, string, + binary, string, decimal, list_, struct, dictionary, field, DataType, FixedSizeBinaryType, Field, Schema, schema) diff --git a/python/pyarrow/array.pxd b/python/pyarrow/array.pxd index f6aaea2582e21..3ba48718265db 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/array.pxd @@ -116,6 +116,10 @@ cdef class FixedSizeBinaryArray(Array): pass +cdef class DecimalArray(FixedSizeBinaryArray): + pass + + cdef class ListArray(Array): pass diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index 9f302e02cdb04..ee500e6812974 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -481,6 +481,10 @@ cdef class FixedSizeBinaryArray(Array): pass +cdef class DecimalArray(FixedSizeBinaryArray): + pass + + cdef class ListArray(Array): pass @@ -602,6 +606,7 @@ cdef dict _array_classes = { Type_STRING: StringArray, Type_DICTIONARY: DictionaryArray, Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, + Type_DECIMAL: DecimalArray, } cdef object box_array(const shared_ptr[CArray]& sp_array): diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index ab38ff3084f01..4860334a9213c 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -51,6 +51,11 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsTypeError() +cdef extern from "arrow/util/decimal.h" namespace "arrow" nogil: + cdef cppclass int128_t: + pass + + cdef inline object PyObject_to_object(PyObject* o): # Cast to "object" increments reference count cdef object result = o diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 2a0488f3a0139..73d96b25f521b 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -39,6 +39,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: Type_FLOAT" arrow::Type::FLOAT" Type_DOUBLE" arrow::Type::DOUBLE" + Type_DECIMAL" arrow::Type::DECIMAL" + Type_DATE32" arrow::Type::DATE32" Type_DATE64" arrow::Type::DATE64" Type_TIMESTAMP" arrow::Type::TIMESTAMP" @@ -58,6 +60,11 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: TimeUnit_MICRO" arrow::TimeUnit::MICRO" TimeUnit_NANO" arrow::TimeUnit::NANO" + cdef cppclass Decimal[T]: + Decimal(const T&) + + cdef c_string ToString[T](const Decimal[T]&, int, int) + cdef cppclass CDataType" arrow::DataType": Type type @@ -144,6 +151,12 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CFixedSizeBinaryType" arrow::FixedSizeBinaryType"(CFixedWidthType): CFixedSizeBinaryType(int byte_width) int byte_width() + int bit_width() + + cdef cppclass CDecimalType" arrow::DecimalType"(CFixedSizeBinaryType): + int precision + int scale + CDecimalType(int precision, int scale) cdef cppclass CField" arrow::Field": c_string name @@ -212,6 +225,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CFixedSizeBinaryArray" arrow::FixedSizeBinaryArray"(CArray): const uint8_t* GetValue(int i) + cdef cppclass CDecimalArray" arrow::DecimalArray"(CFixedSizeBinaryArray): + Decimal[T] Value[T](int i) + cdef cppclass CListArray" arrow::ListArray"(CArray): const int32_t* raw_value_offsets() int32_t value_offset(int i) diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd index d6c3b35160c12..62a5664e57eb4 100644 --- a/python/pyarrow/scalar.pxd +++ b/python/pyarrow/scalar.pxd @@ -20,6 +20,7 @@ from pyarrow.includes.libarrow cimport * from pyarrow.schema cimport DataType + cdef class Scalar: cdef readonly: DataType type diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 1c0790a4fdc3c..f3d9321326964 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -17,9 +17,10 @@ from pyarrow.schema cimport DataType, box_data_type +from pyarrow.includes.common cimport int128_t from pyarrow.compat import frombytes import pyarrow.schema as schema - +import decimal import datetime cimport cpython as cp @@ -64,7 +65,7 @@ cdef class ArrayValue(Scalar): if hasattr(self, 'as_py'): return repr(self.as_py()) else: - return Scalar.__repr__(self) + return super(Scalar, self).__repr__() cdef class BooleanValue(ArrayValue): @@ -199,6 +200,25 @@ cdef class DoubleValue(ArrayValue): return ap.Value(self.index) +cdef class DecimalValue(ArrayValue): + + def as_py(self): + cdef: + CDecimalArray* ap = self.sp_array.get() + CDecimalType* t = ap.type().get() + int bit_width = t.bit_width() + int precision = t.precision + int scale = t.scale + c_string s + if bit_width == 32: + s = ToString[int32_t](ap.Value[int32_t](self.index), precision, scale) + elif bit_width == 64: + s = ToString[int64_t](ap.Value[int64_t](self.index), precision, scale) + elif bit_width == 128: + s = ToString[int128_t](ap.Value[int128_t](self.index), precision, scale) + return decimal.Decimal(s.decode('utf8')) + + cdef class StringValue(ArrayValue): def as_py(self): @@ -286,6 +306,7 @@ cdef dict _scalar_classes = { Type_BINARY: BinaryValue, Type_STRING: StringValue, Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, + Type_DECIMAL: DecimalValue, } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd index 94d65bfc157a1..eceedbad0ba0d 100644 --- a/python/pyarrow/schema.pxd +++ b/python/pyarrow/schema.pxd @@ -20,6 +20,7 @@ from pyarrow.includes.libarrow cimport (CDataType, CDictionaryType, CTimestampType, CFixedSizeBinaryType, + CDecimalType, CField, CSchema) cdef class DataType: @@ -27,7 +28,7 @@ cdef class DataType: shared_ptr[CDataType] sp_type CDataType* type - cdef init(self, const shared_ptr[CDataType]& type) + cdef void init(self, const shared_ptr[CDataType]& type) cdef class DictionaryType(DataType): @@ -45,6 +46,11 @@ cdef class FixedSizeBinaryType(DataType): const CFixedSizeBinaryType* fixed_size_binary_type +cdef class DecimalType(FixedSizeBinaryType): + cdef: + const CDecimalType* decimal_type + + cdef class Field: cdef: shared_ptr[CField] sp_field @@ -55,6 +61,7 @@ cdef class Field: cdef init(self, const shared_ptr[CField]& field) + cdef class Schema: cdef: shared_ptr[CSchema] sp_schema @@ -63,6 +70,7 @@ cdef class Schema: cdef init(self, const vector[shared_ptr[CField]]& fields) cdef init_schema(self, const shared_ptr[CSchema]& schema) + cdef DataType box_data_type(const shared_ptr[CDataType]& type) cdef Field box_field(const shared_ptr[CField]& field) cdef Schema box_schema(const shared_ptr[CSchema]& schema) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 253be4590b518..4b931bf452239 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -29,6 +29,7 @@ from pyarrow.array cimport Array from pyarrow.error cimport check_status from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, CFixedSizeBinaryType, + CDecimalType, TimeUnit_SECOND, TimeUnit_MILLI, TimeUnit_MICRO, TimeUnit_NANO, Type, TimeUnit) @@ -45,7 +46,7 @@ cdef class DataType: def __cinit__(self): pass - cdef init(self, const shared_ptr[CDataType]& type): + cdef void init(self, const shared_ptr[CDataType]& type): self.sp_type = type self.type = type.get() @@ -66,14 +67,14 @@ cdef class DataType: cdef class DictionaryType(DataType): - cdef init(self, const shared_ptr[CDataType]& type): + cdef void init(self, const shared_ptr[CDataType]& type): DataType.init(self, type) self.dict_type = type.get() cdef class TimestampType(DataType): - cdef init(self, const shared_ptr[CDataType]& type): + cdef void init(self, const shared_ptr[CDataType]& type): DataType.init(self, type) self.ts_type = type.get() @@ -93,7 +94,7 @@ cdef class TimestampType(DataType): cdef class FixedSizeBinaryType(DataType): - cdef init(self, const shared_ptr[CDataType]& type): + cdef void init(self, const shared_ptr[CDataType]& type): DataType.init(self, type) self.fixed_size_binary_type = type.get() @@ -103,6 +104,13 @@ cdef class FixedSizeBinaryType(DataType): return self.fixed_size_binary_type.byte_width() +cdef class DecimalType(FixedSizeBinaryType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.decimal_type = type.get() + + cdef class Field: def __cinit__(self): @@ -354,6 +362,12 @@ def float64(): return primitive_type(la.Type_DOUBLE) +cpdef DataType decimal(int precision, int scale=0): + cdef shared_ptr[CDataType] decimal_type + decimal_type.reset(new CDecimalType(precision, scale)) + return box_data_type(decimal_type) + + def string(): """ UTF8 string @@ -374,11 +388,9 @@ def binary(int length=-1): if length == -1: return primitive_type(la.Type_BINARY) - cdef FixedSizeBinaryType out = FixedSizeBinaryType() cdef shared_ptr[CDataType] fixed_size_binary_type fixed_size_binary_type.reset(new CFixedSizeBinaryType(length)) - out.init(fixed_size_binary_type) - return out + return box_data_type(fixed_size_binary_type) def list_(DataType value_type): @@ -436,6 +448,8 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): out = TimestampType() elif type.get().type == la.Type_FIXED_SIZE_BINARY: out = FixedSizeBinaryType() + elif type.get().type == la.Type_DECIMAL: + out = DecimalType() else: out = DataType() diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index e2b03d85ecd50..d89a8e0c54ceb 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -20,6 +20,7 @@ import pyarrow as pa import datetime +import decimal class TestConvertList(unittest.TestCase): @@ -162,3 +163,42 @@ def test_mixed_types_fails(self): data = ['a', 1, 2.0] with self.assertRaises(pa.ArrowException): pa.from_pylist(data) + + def test_decimal(self): + data = [decimal.Decimal('1234.183'), decimal.Decimal('8094.234')] + type = pa.decimal(precision=7, scale=3) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data + + def test_decimal_different_precisions(self): + data = [ + decimal.Decimal('1234234983.183'), decimal.Decimal('80943244.234') + ] + type = pa.decimal(precision=13, scale=3) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data + + def test_decimal_no_scale(self): + data = [decimal.Decimal('1234234983'), decimal.Decimal('8094324')] + type = pa.decimal(precision=10) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data + + def test_decimal_negative(self): + data = [decimal.Decimal('-1234.234983'), decimal.Decimal('-8.094324')] + type = pa.decimal(precision=10, scale=6) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data + + def test_decimal_no_whole_part(self): + data = [decimal.Decimal('-.4234983'), decimal.Decimal('.0103943')] + type = pa.decimal(precision=7, scale=7) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data + + def test_decimal_large_integer(self): + data = [decimal.Decimal('-394029506937548693.42983'), + decimal.Decimal('32358695912932.01033')] + type = pa.decimal(precision=23, scale=5) + arr = pa.from_pylist(data, type=type) + assert arr.to_pylist() == data diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 87c9c03d7da11..0504e1ddb4f53 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -20,6 +20,7 @@ import datetime import unittest +import decimal import numpy as np @@ -451,3 +452,72 @@ def test_strided_data_import(self): self._check_pandas_roundtrip(df) self._check_array_roundtrip(col) self._check_array_roundtrip(col, mask=strided_mask) + + def test_decimal_32_from_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('-1234.123'), + decimal.Decimal('1234.439'), + ] + }) + converted = A.Table.from_pandas(expected) + field = A.Field.from_py('decimals', A.decimal(7, 3)) + schema = A.Schema.from_fields([field]) + assert converted.schema.equals(schema) + + def test_decimal_32_to_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('-1234.123'), + decimal.Decimal('1234.439'), + ] + }) + converted = A.Table.from_pandas(expected) + df = converted.to_pandas() + tm.assert_frame_equal(df, expected) + + def test_decimal_64_from_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('-129934.123331'), + decimal.Decimal('129534.123731'), + ] + }) + converted = A.Table.from_pandas(expected) + field = A.Field.from_py('decimals', A.decimal(12, 6)) + schema = A.Schema.from_fields([field]) + assert converted.schema.equals(schema) + + def test_decimal_64_to_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('-129934.123331'), + decimal.Decimal('129534.123731'), + ] + }) + converted = A.Table.from_pandas(expected) + df = converted.to_pandas() + tm.assert_frame_equal(df, expected) + + def test_decimal_128_from_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('394092382910493.12341234678'), + -decimal.Decimal('314292388910493.12343437128'), + ] + }) + converted = A.Table.from_pandas(expected) + field = A.Field.from_py('decimals', A.decimal(26, 11)) + schema = A.Schema.from_fields([field]) + assert converted.schema.equals(schema) + + def test_decimal_128_to_pandas(self): + expected = pd.DataFrame({ + 'decimals': [ + decimal.Decimal('394092382910493.12341234678'), + -decimal.Decimal('314292388910493.12343437128'), + ] + }) + converted = A.Table.from_pandas(expected) + df = converted.to_pandas() + tm.assert_frame_equal(df, expected) From 137aade404cf53a7dbe0eaa31a868d1c376654b3 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 9 Apr 2017 20:00:30 -0400 Subject: [PATCH 0487/1644] ARROW-722: [Python] Support additional date/time types and metadata, conversion to/from NumPy and pandas.DataFrame Would appreciate a close look from @xhochy, @cpcloud. Also did some inline visitor cleaning for nicer code reuse Author: Wes McKinney Closes #510 from wesm/ARROW-722 and squashes the following commits: 3e1fda3 [Wes McKinney] cpplint cb32a6b [Wes McKinney] Nicer error message. Run clang-format 854f470 [Wes McKinney] First cut refactor 06dce15 [Wes McKinney] Rebase conflicts d1dc342 [Wes McKinney] Bring Python bindings to date/time types up to spec. Handle zero-copy creation from same-size int32/64. Use inline visitor in PandasConverter --- cpp/CMakeLists.txt | 6 + cpp/src/arrow/CMakeLists.txt | 1 + cpp/src/arrow/python/builtin_convert.cc | 29 +- cpp/src/arrow/python/builtin_convert.h | 4 + cpp/src/arrow/python/pandas_convert.cc | 345 ++++++++++---------- cpp/src/arrow/python/pandas_convert.h | 3 - cpp/src/arrow/python/type_traits.h | 63 ++-- cpp/src/arrow/python/util/datetime.h | 6 +- cpp/src/arrow/table-test.cc | 55 +--- cpp/src/arrow/table.cc | 10 +- cpp/src/arrow/table.h | 4 +- cpp/src/arrow/type.cc | 4 +- cpp/src/arrow/type.h | 4 +- cpp/src/arrow/util/stl.h | 4 +- python/pyarrow/scalar.pyx | 6 +- python/pyarrow/tests/test_convert_pandas.py | 182 +++++++---- 16 files changed, 399 insertions(+), 327 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5852fe59da095..b29cb7b075a94 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -386,6 +386,12 @@ enable_testing() set(Boost_DEBUG TRUE) set(Boost_USE_MULTITHREADED ON) +set(Boost_ADDITIONAL_VERSIONS + "1.63.0" "1.63" + "1.62.0" "1.61" + "1.61.0" "1.62" + "1.60.0" "1.60") + if (ARROW_BOOST_USE_SHARED) # Find shared Boost libraries. set(Boost_USE_STATIC_LIBS OFF) diff --git a/cpp/src/arrow/CMakeLists.txt b/cpp/src/arrow/CMakeLists.txt index 8eaa76ae9e843..cb5282cbf1eff 100644 --- a/cpp/src/arrow/CMakeLists.txt +++ b/cpp/src/arrow/CMakeLists.txt @@ -34,6 +34,7 @@ install(FILES type_traits.h test-util.h visitor.h + visitor_inline.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow") # pkg-config support diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 189ecee4fe022..a064a3daf970d 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -43,6 +43,31 @@ static inline bool IsPyInteger(PyObject* obj) { #endif } +Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { + OwnedRef type(PyObject_Type(obj)); + RETURN_IF_PYERROR(); + DCHECK_NE(type.obj(), nullptr); + + OwnedRef type_name(PyObject_GetAttrString(type.obj(), "__name__")); + RETURN_IF_PYERROR(); + DCHECK_NE(type_name.obj(), nullptr); + + PyObjectStringify bytestring(type_name.obj()); + RETURN_IF_PYERROR(); + + const char* bytes = bytestring.bytes; + DCHECK_NE(bytes, nullptr) << "bytes from type(...).__name__ were null"; + + Py_ssize_t size = bytestring.size; + + std::string cpp_type_name(bytes, size); + + std::stringstream ss; + ss << "Python object of type " << cpp_type_name << " is not None and is not a " + << expected_type_name << " object"; + return Status::Invalid(ss.str()); +} + class ScalarVisitor { public: ScalarVisitor() @@ -397,7 +422,7 @@ class BytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::Invalid("Value that cannot be converted to bytes was encountered"); + return InvalidConversion(item, "bytes"); } // No error checking length = PyBytes_GET_SIZE(bytes_obj); @@ -431,7 +456,7 @@ class FixedWidthBytesConverter : public TypedConverter { } else if (PyBytes_Check(item)) { bytes_obj = item; } else { - return Status::Invalid("Value that cannot be converted to bytes was encountered"); + return InvalidConversion(item, "bytes"); } // No error checking RETURN_NOT_OK(CheckPythonBytesAreFixedLength(bytes_obj, expected_length)); diff --git a/cpp/src/arrow/python/builtin_convert.h b/cpp/src/arrow/python/builtin_convert.h index 3c2e350269a78..2141c25e95ef0 100644 --- a/cpp/src/arrow/python/builtin_convert.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -24,6 +24,7 @@ #include #include +#include #include "arrow/type.h" @@ -60,6 +61,9 @@ ARROW_EXPORT Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, const std::shared_ptr& type, int64_t size); +ARROW_EXPORT +Status InvalidConversion(PyObject* obj, const std::string& expected_type_name); + ARROW_EXPORT Status CheckPythonBytesAreFixedLength( PyObject* obj, Py_ssize_t expected_length); diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index f6e627e668e2d..5bb8e45e191a9 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -44,6 +44,7 @@ #include "arrow/util/decimal.h" #include "arrow/util/logging.h" #include "arrow/util/macros.h" +#include "arrow/visitor_inline.h" #include "arrow/python/builtin_convert.h" #include "arrow/python/common.h" @@ -271,7 +272,7 @@ static inline bool ListTypeSupported(const Type::type type_id) { // ---------------------------------------------------------------------- // Conversion from NumPy-in-Pandas to Arrow -class PandasConverter : public TypeVisitor { +class PandasConverter { public: PandasConverter( MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type) @@ -332,23 +333,37 @@ class PandasConverter : public TypeVisitor { return LoadArray(type_, fields, {null_bitmap_, data}, &out_); } -#define VISIT_NATIVE(TYPE) \ - Status Visit(const TYPE& type) override { return VisitNative(); } + template + typename std::enable_if::value || + std::is_same::value, + Status>::type + Visit(const T& type) { + return VisitNative(); + } + + Status Visit(const Date32Type& type) { return VisitNative(); } + Status Visit(const Date64Type& type) { return VisitNative(); } + Status Visit(const TimestampType& type) { return VisitNative(); } + Status Visit(const Time32Type& type) { return VisitNative(); } + Status Visit(const Time64Type& type) { return VisitNative(); } + + Status Visit(const NullType& type) { return Status::NotImplemented("null"); } + + Status Visit(const BinaryType& type) { return Status::NotImplemented(type.ToString()); } + + Status Visit(const FixedSizeBinaryType& type) { + return Status::NotImplemented(type.ToString()); + } - VISIT_NATIVE(BooleanType); - VISIT_NATIVE(Int8Type); - VISIT_NATIVE(Int16Type); - VISIT_NATIVE(Int32Type); - VISIT_NATIVE(Int64Type); - VISIT_NATIVE(UInt8Type); - VISIT_NATIVE(UInt16Type); - VISIT_NATIVE(UInt32Type); - VISIT_NATIVE(UInt64Type); - VISIT_NATIVE(FloatType); - VISIT_NATIVE(DoubleType); - VISIT_NATIVE(TimestampType); + Status Visit(const DecimalType& type) { + return Status::NotImplemented(type.ToString()); + } -#undef VISIT_NATIVE + Status Visit(const DictionaryType& type) { + return Status::NotImplemented(type.ToString()); + } + + Status Visit(const NestedType& type) { return Status::NotImplemented(type.ToString()); } Status Convert() { if (PyArray_NDIM(arr_) != 1) { @@ -358,9 +373,7 @@ class PandasConverter : public TypeVisitor { if (type_ == nullptr) { return Status::Invalid("Must pass data type"); } // Visit the type to perform conversion - RETURN_NOT_OK(type_->Accept(this)); - - return Status::OK(); + return VisitTypeInline(*type_, this); } std::shared_ptr result() const { return out_; } @@ -371,10 +384,12 @@ class PandasConverter : public TypeVisitor { template Status ConvertTypedLists(const std::shared_ptr& type); + template + Status ConvertDates(); + Status ConvertObjectStrings(); Status ConvertObjectFixedWidthBytes(const std::shared_ptr& type); Status ConvertBooleans(); - Status ConvertDates(); Status ConvertLists(const std::shared_ptr& type); Status ConvertObjects(); Status ConvertDecimals(); @@ -462,41 +477,36 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* return Status::OK(); } -Status InvalidConversion(PyObject* obj, const std::string& expected_type_name) { - OwnedRef type(PyObject_Type(obj)); - RETURN_IF_PYERROR(); - DCHECK_NE(type.obj(), nullptr); - - OwnedRef type_name(PyObject_GetAttrString(type.obj(), "__name__")); - RETURN_IF_PYERROR(); - DCHECK_NE(type_name.obj(), nullptr); - - PyObjectStringify bytestring(type_name.obj()); - RETURN_IF_PYERROR(); - - const char* bytes = bytestring.bytes; - DCHECK_NE(bytes, nullptr) << "bytes from type(...).__name__ were null"; - - Py_ssize_t size = bytestring.size; +template +struct UnboxDate {}; - std::string cpp_type_name(bytes, size); +template <> +struct UnboxDate { + static int64_t Unbox(PyObject* obj) { + return PyDate_to_days(reinterpret_cast(obj)); + } +}; - std::stringstream ss; - ss << "Python object of type " << cpp_type_name << " is not None and is not a " - << expected_type_name << " object"; - return Status::Invalid(ss.str()); -} +template <> +struct UnboxDate { + static int64_t Unbox(PyObject* obj) { + return PyDate_to_ms(reinterpret_cast(obj)); + } +}; +template Status PandasConverter::ConvertDates() { PyAcquireGIL lock; + using BuilderType = typename TypeTraits::BuilderType; + Ndarray1DIndexer objects(arr_); if (mask_ != nullptr) { return Status::NotImplemented("mask not supported in object conversions yet"); } - Date64Builder date_builder(pool_); + BuilderType date_builder(pool_); RETURN_NOT_OK(date_builder.Resize(length_)); /// We have to run this in this compilation unit, since we cannot use the @@ -508,8 +518,7 @@ Status PandasConverter::ConvertDates() { for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; if (PyDate_CheckExact(obj)) { - PyDateTime_Date* pydate = reinterpret_cast(obj); - date_builder.Append(PyDate_to_ms(pydate)); + date_builder.Append(UnboxDate::Unbox(obj)); } else if (PyObject_is_null(obj)) { date_builder.AppendNull(); } else { @@ -762,8 +771,10 @@ Status PandasConverter::ConvertObjects() { return ConvertObjectFixedWidthBytes(type_); case Type::BOOL: return ConvertBooleans(); + case Type::DATE32: + return ConvertDates(); case Type::DATE64: - return ConvertDates(); + return ConvertDates(); case Type::LIST: { const auto& list_field = static_cast(*type_); return ConvertLists(list_field.value_field()->type); @@ -787,7 +798,8 @@ Status PandasConverter::ConvertObjects() { } else if (PyBool_Check(objects[i])) { return ConvertBooleans(); } else if (PyDate_CheckExact(objects[i])) { - return ConvertDates(); + // We could choose Date32 or Date64 + return ConvertDates(); } else if (PyObject_IsInstance(const_cast(objects[i]), Decimal.obj())) { return ConvertDecimals(); } else { @@ -955,34 +967,6 @@ Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, // ---------------------------------------------------------------------- // pandas 0.x DataFrame conversion internals -inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { - if (type == NPY_DATETIME) { - PyArray_Descr* descr = PyArray_DESCR(out); - auto date_dtype = reinterpret_cast(descr->c_metadata); - if (datatype->type == Type::TIMESTAMP) { - auto timestamp_type = static_cast(datatype); - - switch (timestamp_type->unit) { - case TimestampType::Unit::SECOND: - date_dtype->meta.base = NPY_FR_s; - break; - case TimestampType::Unit::MILLI: - date_dtype->meta.base = NPY_FR_ms; - break; - case TimestampType::Unit::MICRO: - date_dtype->meta.base = NPY_FR_us; - break; - case TimestampType::Unit::NANO: - date_dtype->meta.base = NPY_FR_ns; - break; - } - } else { - // datatype->type == Type::DATE64 - date_dtype->meta.base = NPY_FR_D; - } - } -} - class PandasBlock { public: enum type { @@ -1148,8 +1132,9 @@ static void ConvertBooleanNoNulls(const ChunkedArray& data, uint8_t* out_values) } } -template +template inline Status ConvertBinaryLike(const ChunkedArray& data, PyObject** out_values) { + using ArrayType = typename TypeTraits::ArrayType; PyAcquireGIL lock; for (int c = 0; c < data.num_chunks(); c++) { auto arr = static_cast(data.chunk(c).get()); @@ -1287,21 +1272,7 @@ inline void ConvertNumericNullableCast( } } -template -inline void ConvertDates(const ChunkedArray& data, T na_value, T* out_values) { - for (int c = 0; c < data.num_chunks(); c++) { - const std::shared_ptr arr = data.chunk(c); - auto prim_arr = static_cast(arr.get()); - auto in_values = reinterpret_cast(prim_arr->data()->data()); - - for (int64_t i = 0; i < arr->length(); ++i) { - // There are 1000 * 60 * 60 * 24 = 86400000ms in a day - *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / 86400000; - } - } -} - -template +template inline void ConvertDatetimeNanos(const ChunkedArray& data, int64_t* out_values) { for (int c = 0; c < data.num_chunks(); c++) { const std::shared_ptr arr = data.chunk(c); @@ -1339,9 +1310,9 @@ class ObjectBlock : public PandasBlock { if (type == Type::BOOL) { RETURN_NOT_OK(ConvertBooleanWithNulls(data, out_buffer)); } else if (type == Type::BINARY) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::STRING) { - RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); + RETURN_NOT_OK(ConvertBinaryLike(data, out_buffer)); } else if (type == Type::FIXED_SIZE_BINARY) { RETURN_NOT_OK(ConvertFixedSizeBinary(data, out_buffer)); } else if (type == Type::DECIMAL) { @@ -1532,7 +1503,11 @@ class DatetimeBlock : public PandasBlock { const ChunkedArray& data = *col.get()->data(); - if (type == Type::DATE64) { + if (type == Type::DATE32) { + // Date64Type is millisecond timestamp stored as int64_t + // TODO(wesm): Do we want to make sure to zero out the milliseconds? + ConvertDatetimeNanos(data, out_buffer); + } else if (type == Type::DATE64) { // Date64Type is millisecond timestamp stored as int64_t // TODO(wesm): Do we want to make sure to zero out the milliseconds? ConvertDatetimeNanos(data, out_buffer); @@ -1779,6 +1754,9 @@ class DataFrameBlockCreator { case Type::FIXED_SIZE_BINARY: output_type = PandasBlock::OBJECT; break; + case Type::DATE32: + output_type = PandasBlock::DATETIME; + break; case Type::DATE64: output_type = PandasBlock::DATETIME; break; @@ -1960,6 +1938,34 @@ class DataFrameBlockCreator { BlockMap datetimetz_blocks_; }; +inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) { + if (type == NPY_DATETIME) { + PyArray_Descr* descr = PyArray_DESCR(out); + auto date_dtype = reinterpret_cast(descr->c_metadata); + if (datatype->type == Type::TIMESTAMP) { + auto timestamp_type = static_cast(datatype); + + switch (timestamp_type->unit) { + case TimestampType::Unit::SECOND: + date_dtype->meta.base = NPY_FR_s; + break; + case TimestampType::Unit::MILLI: + date_dtype->meta.base = NPY_FR_ms; + break; + case TimestampType::Unit::MICRO: + date_dtype->meta.base = NPY_FR_us; + break; + case TimestampType::Unit::NANO: + date_dtype->meta.base = NPY_FR_ns; + break; + } + } else { + // datatype->type == Type::DATE64 + date_dtype->meta.base = NPY_FR_D; + } + } +} + class ArrowDeserializer { public: ArrowDeserializer(const std::shared_ptr& col, PyObject* py_ref) @@ -2024,51 +2030,14 @@ class ArrowDeserializer { // Allocate new array and deserialize. Can do a zero copy conversion for some // types - Status Convert(PyObject** out) { -#define CONVERT_CASE(TYPE) \ - case Type::TYPE: { \ - RETURN_NOT_OK(ConvertValues()); \ - } break; - - switch (col_->type()->type) { - CONVERT_CASE(BOOL); - CONVERT_CASE(INT8); - CONVERT_CASE(INT16); - CONVERT_CASE(INT32); - CONVERT_CASE(INT64); - CONVERT_CASE(UINT8); - CONVERT_CASE(UINT16); - CONVERT_CASE(UINT32); - CONVERT_CASE(UINT64); - CONVERT_CASE(FLOAT); - CONVERT_CASE(DOUBLE); - CONVERT_CASE(BINARY); - CONVERT_CASE(STRING); - CONVERT_CASE(FIXED_SIZE_BINARY); - CONVERT_CASE(DATE64); - CONVERT_CASE(TIMESTAMP); - CONVERT_CASE(DICTIONARY); - CONVERT_CASE(LIST); - CONVERT_CASE(DECIMAL); - default: { - std::stringstream ss; - ss << "Arrow type reading not implemented for " << col_->type()->ToString(); - return Status::NotImplemented(ss.str()); - } - } - -#undef CONVERT_CASE - - *out = result_; - return Status::OK(); - } + template + typename std::enable_if::value, Status>::type + Visit(const Type& type) { + constexpr int TYPE = Type::type_id; + using traits = arrow_traits; - template - inline typename std::enable_if< - (TYPE != Type::DATE64) & arrow_traits::is_numeric_nullable, Status>::type - ConvertValues() { - typedef typename arrow_traits::T T; - int npy_type = arrow_traits::npy_type; + typedef typename traits::T T; + int npy_type = traits::npy_type; if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); @@ -2076,31 +2045,56 @@ class ArrowDeserializer { RETURN_NOT_OK(AllocateOutput(npy_type)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - ConvertNumericNullable(data_, arrow_traits::na_value, out_values); + ConvertNumericNullable(data_, traits::na_value, out_values); return Status::OK(); } - template - inline typename std::enable_if::type ConvertValues() { - typedef typename arrow_traits::T T; + template + typename std::enable_if::value || + std::is_base_of::value, + Status>::type + Visit(const Type& type) { + constexpr int TYPE = Type::type_id; + using traits = arrow_traits; + + typedef typename traits::T T; - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + RETURN_NOT_OK(AllocateOutput(traits::npy_type)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - ConvertDates(data_, arrow_traits::na_value, out_values); + + constexpr T na_value = traits::na_value; + constexpr int64_t kShift = traits::npy_shift; + + for (int c = 0; c < data_.num_chunks(); c++) { + const std::shared_ptr arr = data_.chunk(c); + auto prim_arr = static_cast(arr.get()); + auto in_values = reinterpret_cast(prim_arr->data()->data()); + + for (int64_t i = 0; i < arr->length(); ++i) { + *out_values++ = arr->IsNull(i) ? na_value : in_values[i] / kShift; + } + } return Status::OK(); } + template + typename std::enable_if::value, Status>::type Visit( + const Type& type) { + return Status::NotImplemented("Don't know how to serialize Arrow time type to NumPy"); + } + // Integer specialization - template - inline - typename std::enable_if::is_numeric_not_nullable, Status>::type - ConvertValues() { - typedef typename arrow_traits::T T; - int npy_type = arrow_traits::npy_type; + template + typename std::enable_if::value, Status>::type Visit( + const Type& type) { + constexpr int TYPE = Type::type_id; + using traits = arrow_traits; + + typedef typename traits::T T; if (data_.num_chunks() == 1 && data_.null_count() == 0 && py_ref_ != nullptr) { - return ConvertValuesZeroCopy(npy_type, data_.chunk(0)); + return ConvertValuesZeroCopy(traits::npy_type, data_.chunk(0)); } if (data_.null_count() > 0) { @@ -2108,7 +2102,7 @@ class ArrowDeserializer { auto out_values = reinterpret_cast(PyArray_DATA(arr_)); ConvertIntegerWithNulls(data_, out_values); } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + RETURN_NOT_OK(AllocateOutput(traits::npy_type)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); ConvertIntegerNoNullsSameType(data_, out_values); } @@ -2117,15 +2111,13 @@ class ArrowDeserializer { } // Boolean specialization - template - inline typename std::enable_if::is_boolean, Status>::type - ConvertValues() { + Status Visit(const BooleanType& type) { if (data_.null_count() > 0) { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); RETURN_NOT_OK(ConvertBooleanWithNulls(data_, out_values)); } else { - RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); + RETURN_NOT_OK(AllocateOutput(arrow_traits::npy_type)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); ConvertBooleanNoNulls(data_, out_values); } @@ -2133,43 +2125,32 @@ class ArrowDeserializer { } // UTF8 strings - template - inline typename std::enable_if::type ConvertValues() { + template + typename std::enable_if::value, Status>::type Visit( + const Type& type) { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - return ConvertBinaryLike(data_, out_values); - } - - // Binary strings - template - inline typename std::enable_if::type ConvertValues() { - RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); - auto out_values = reinterpret_cast(PyArray_DATA(arr_)); - return ConvertBinaryLike(data_, out_values); + return ConvertBinaryLike(data_, out_values); } // Fixed length binary strings - template - inline typename std::enable_if::type - ConvertValues() { + Status Visit(const FixedSizeBinaryType& type) { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); return ConvertFixedSizeBinary(data_, out_values); } - template - inline typename std::enable_if::type ConvertValues() { + Status Visit(const DecimalType& type) { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); return ConvertDecimals(data_, out_values); } + Status Visit(const ListType& type) { #define CONVERTVALUES_LISTSLIKE_CASE(ArrowType, ArrowEnum) \ case Type::ArrowEnum: \ return ConvertListsLike(col_, out_values); - template - inline typename std::enable_if::type ConvertValues() { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); auto list_type = std::static_pointer_cast(col_->type()); @@ -2193,10 +2174,10 @@ class ArrowDeserializer { return Status::NotImplemented(ss.str()); } } +#undef CONVERTVALUES_LISTSLIKE_CASE } - template - inline typename std::enable_if::type ConvertValues() { + Status Visit(const DictionaryType& type) { std::shared_ptr block; RETURN_NOT_OK(MakeCategoricalBlock(col_->type(), col_->length(), &block)); RETURN_NOT_OK(block->Write(col_, 0, 0)); @@ -2216,6 +2197,18 @@ class ArrowDeserializer { return Status::OK(); } + Status Visit(const NullType& type) { return Status::NotImplemented("null type"); } + + Status Visit(const StructType& type) { return Status::NotImplemented("struct type"); } + + Status Visit(const UnionType& type) { return Status::NotImplemented("union type"); } + + Status Convert(PyObject** out) { + RETURN_NOT_OK(VisitTypeInline(*col_->type(), this)); + *out = result_; + return Status::OK(); + } + private: std::shared_ptr col_; const ChunkedArray& data_; diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index 8fd31076a994f..4d32c8b86cf50 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -71,9 +71,6 @@ ARROW_EXPORT Status PandasObjectsToArrow(MemoryPool* pool, PyObject* ao, PyObject* mo, const std::shared_ptr& type, std::shared_ptr* out); -ARROW_EXPORT -Status InvalidConversion(PyObject* obj, const std::string& expected_type_name); - } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/type_traits.h b/cpp/src/arrow/python/type_traits.h index f78dc360095dc..c464d65a4946c 100644 --- a/cpp/src/arrow/python/type_traits.h +++ b/cpp/src/arrow/python/type_traits.h @@ -119,9 +119,6 @@ template <> struct arrow_traits { static constexpr int npy_type = NPY_BOOL; static constexpr bool supports_nulls = false; - static constexpr bool is_boolean = true; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; }; #define INT_DECL(TYPE) \ @@ -130,9 +127,6 @@ struct arrow_traits { static constexpr int npy_type = NPY_##TYPE; \ static constexpr bool supports_nulls = false; \ static constexpr double na_value = NAN; \ - static constexpr bool is_boolean = false; \ - static constexpr bool is_numeric_not_nullable = true; \ - static constexpr bool is_numeric_nullable = false; \ typedef typename npy_traits::value_type T; \ }; @@ -150,9 +144,6 @@ struct arrow_traits { static constexpr int npy_type = NPY_FLOAT32; static constexpr bool supports_nulls = true; static constexpr float na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; @@ -161,33 +152,63 @@ struct arrow_traits { static constexpr int npy_type = NPY_FLOAT64; static constexpr bool supports_nulls = true; static constexpr double na_value = NAN; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; static constexpr int64_t kPandasTimestampNull = std::numeric_limits::min(); +constexpr int64_t kNanosecondsInDay = 86400000000000LL; + template <> struct arrow_traits { static constexpr int npy_type = NPY_DATETIME; + static constexpr int64_t npy_shift = 1; + static constexpr bool supports_nulls = true; static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; typedef typename npy_traits::value_type T; }; +template <> +struct arrow_traits { + // Data stores as FR_D day unit + static constexpr int npy_type = NPY_DATETIME; + static constexpr int64_t npy_shift = 1; + + static constexpr bool supports_nulls = true; + typedef typename npy_traits::value_type T; + + static constexpr int64_t na_value = kPandasTimestampNull; + static inline bool isnull(int64_t v) { return npy_traits::isnull(v); } +}; + template <> struct arrow_traits { + // Data stores as FR_D day unit static constexpr int npy_type = NPY_DATETIME; + + // There are 1000 * 60 * 60 * 24 = 86400000ms in a day + static constexpr int64_t npy_shift = 86400000; + + static constexpr bool supports_nulls = true; + typedef typename npy_traits::value_type T; + + static constexpr int64_t na_value = kPandasTimestampNull; + static inline bool isnull(int64_t v) { return npy_traits::isnull(v); } +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; static constexpr bool supports_nulls = true; static constexpr int64_t na_value = kPandasTimestampNull; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = true; + typedef typename npy_traits::value_type T; +}; + +template <> +struct arrow_traits { + static constexpr int npy_type = NPY_OBJECT; + static constexpr bool supports_nulls = true; typedef typename npy_traits::value_type T; }; @@ -195,18 +216,12 @@ template <> struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; }; template <> struct arrow_traits { static constexpr int npy_type = NPY_OBJECT; static constexpr bool supports_nulls = true; - static constexpr bool is_boolean = false; - static constexpr bool is_numeric_not_nullable = false; - static constexpr bool is_numeric_nullable = false; }; } // namespace py diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index f704a96d91bba..82cf6fc48cad4 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -24,7 +24,7 @@ namespace arrow { namespace py { -inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { +static inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { struct tm date = {0}; date.tm_year = PyDateTime_GET_YEAR(pydate) - 1900; date.tm_mon = PyDateTime_GET_MONTH(pydate) - 1; @@ -36,6 +36,10 @@ inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); } +static inline int32_t PyDate_to_days(PyDateTime_Date* pydate) { + return static_cast(PyDate_to_ms(pydate) / 86400000LL); +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 156c3d16d4db0..cdc0238cf4ab8 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -398,62 +398,35 @@ TEST_F(TestTable, AddColumn) { ASSERT_TRUE(status.IsInvalid()); // Add column with wrong length - auto longer_col = std::make_shared( - schema_->field(0), MakePrimitive(length + 1)); + auto longer_col = + std::make_shared(schema_->field(0), MakePrimitive(length + 1)); status = table.AddColumn(0, longer_col, &result); ASSERT_TRUE(status.IsInvalid()); // Add column 0 in different places ASSERT_OK(table.AddColumn(0, columns_[0], &result)); - auto ex_schema = std::shared_ptr(new Schema({ - schema_->field(0), - schema_->field(0), - schema_->field(1), - schema_->field(2)})); + auto ex_schema = std::shared_ptr(new Schema( + {schema_->field(0), schema_->field(0), schema_->field(1), schema_->field(2)})); std::vector> ex_columns = { - table.column(0), - table.column(0), - table.column(1), - table.column(2)}; + table.column(0), table.column(0), table.column(1), table.column(2)}; ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); ASSERT_OK(table.AddColumn(1, columns_[0], &result)); - ex_schema = std::shared_ptr(new Schema({ - schema_->field(0), - schema_->field(0), - schema_->field(1), - schema_->field(2)})); - ex_columns = { - table.column(0), - table.column(0), - table.column(1), - table.column(2)}; + ex_schema = std::shared_ptr(new Schema( + {schema_->field(0), schema_->field(0), schema_->field(1), schema_->field(2)})); + ex_columns = {table.column(0), table.column(0), table.column(1), table.column(2)}; ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); ASSERT_OK(table.AddColumn(2, columns_[0], &result)); - ex_schema = std::shared_ptr(new Schema({ - schema_->field(0), - schema_->field(1), - schema_->field(0), - schema_->field(2)})); - ex_columns = { - table.column(0), - table.column(1), - table.column(0), - table.column(2)}; + ex_schema = std::shared_ptr(new Schema( + {schema_->field(0), schema_->field(1), schema_->field(0), schema_->field(2)})); + ex_columns = {table.column(0), table.column(1), table.column(0), table.column(2)}; ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); ASSERT_OK(table.AddColumn(3, columns_[0], &result)); - ex_schema = std::shared_ptr(new Schema({ - schema_->field(0), - schema_->field(1), - schema_->field(2), - schema_->field(0)})); - ex_columns = { - table.column(0), - table.column(1), - table.column(2), - table.column(0)}; + ex_schema = std::shared_ptr(new Schema( + {schema_->field(0), schema_->field(1), schema_->field(2), schema_->field(0)})); + ex_columns = {table.column(0), table.column(1), table.column(2), table.column(0)}; ASSERT_TRUE(result->Equals(Table(ex_schema, ex_columns))); } diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 9b39f770a17b7..4c5257b92c033 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -321,11 +321,9 @@ Status Table::RemoveColumn(int i, std::shared_ptr
* out) const { return Status::OK(); } -Status Table::AddColumn(int i, const std::shared_ptr& col, - std::shared_ptr
* out) const { - if (i < 0 || i > num_columns() + 1) { - return Status::Invalid("Invalid column index."); - } +Status Table::AddColumn( + int i, const std::shared_ptr& col, std::shared_ptr
* out) const { + if (i < 0 || i > num_columns() + 1) { return Status::Invalid("Invalid column index."); } if (col == nullptr) { std::stringstream ss; ss << "Column " << i << " was null"; @@ -334,7 +332,7 @@ Status Table::AddColumn(int i, const std::shared_ptr& col, if (col->length() != num_rows_) { std::stringstream ss; ss << "Added column's length must match table's length. Expected length " << num_rows_ - << " but got length " << col->length(); + << " but got length " << col->length(); return Status::Invalid(ss.str()); } diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index dcea53d8fb1dd..b15d31b23a872 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -182,8 +182,8 @@ class ARROW_EXPORT Table { Status RemoveColumn(int i, std::shared_ptr
* out) const; /// Add column to the table, producing a new Table - Status AddColumn(int i, const std::shared_ptr& column, - std::shared_ptr
* out) const; + Status AddColumn( + int i, const std::shared_ptr& column, std::shared_ptr
* out) const; // @returns: the number of columns in the table int num_columns() const { return static_cast(columns_.size()); } diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index df4590f18d733..93cab14d797c3 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -258,8 +258,8 @@ std::shared_ptr Schema::GetFieldByName(const std::string& name) { } } -Status Schema::AddField(int i, const std::shared_ptr& field, - std::shared_ptr* out) const { +Status Schema::AddField( + int i, const std::shared_ptr& field, std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LE(i, this->num_fields()); diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 3a35f56381197..730cbed8f4d67 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -633,8 +633,8 @@ class ARROW_EXPORT Schema { // Render a string representation of the schema suitable for debugging std::string ToString() const; - Status AddField(int i, const std::shared_ptr& field, - std::shared_ptr* out) const; + Status AddField( + int i, const std::shared_ptr& field, std::shared_ptr* out) const; Status RemoveField(int i, std::shared_ptr* out) const; int num_fields() const { return static_cast(fields_.size()); } diff --git a/cpp/src/arrow/util/stl.h b/cpp/src/arrow/util/stl.h index bd250539a8c8a..bfce111ff8a22 100644 --- a/cpp/src/arrow/util/stl.h +++ b/cpp/src/arrow/util/stl.h @@ -40,8 +40,8 @@ inline std::vector DeleteVectorElement(const std::vector& values, size_t i } template -inline std::vector AddVectorElement(const std::vector& values, size_t index, - const T& new_element) { +inline std::vector AddVectorElement( + const std::vector& values, size_t index, const T& new_element) { DCHECK_LE(index, values.size()); std::vector out; out.reserve(values.size() + 1); diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index f3d9321326964..196deedefa959 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -134,7 +134,11 @@ cdef class UInt64Value(ArrayValue): cdef class Date32Value(ArrayValue): def as_py(self): - raise NotImplementedError + cdef CDate32Array* ap = self.sp_array.get() + + # Shift to seconds since epoch + return datetime.datetime.utcfromtimestamp( + int(ap.Value(self.index)) * 86400).date() cdef class Date64Value(ArrayValue): diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 0504e1ddb4f53..d1bea0b3e32f0 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -28,7 +28,7 @@ import pandas.util.testing as tm from pyarrow.compat import u -import pyarrow as A +import pyarrow as pa from .pandas_examples import dataframe_with_arrays, dataframe_with_lists @@ -67,7 +67,7 @@ def tearDown(self): def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, timestamps_to_ms=False, expected_schema=None, check_dtype=True, schema=None): - table = A.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms, + table = pa.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms, schema=schema) result = table.to_pandas(nthreads=nthreads) if expected_schema: @@ -78,7 +78,7 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, def _check_array_roundtrip(self, values, expected=None, mask=None, timestamps_to_ms=False, type=None): - arr = A.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, + arr = pa.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, mask=mask, type=type) result = arr.to_pandas() @@ -99,23 +99,23 @@ def _check_array_roundtrip(self, values, expected=None, mask=None, def test_float_no_nulls(self): data = {} fields = [] - dtypes = [('f4', A.float32()), ('f8', A.float64())] + dtypes = [('f4', pa.float32()), ('f8', pa.float64())] num_values = 100 for numpy_dtype, arrow_dtype in dtypes: values = np.random.randn(num_values) data[numpy_dtype] = values.astype(numpy_dtype) - fields.append(A.Field.from_py(numpy_dtype, arrow_dtype)) + fields.append(pa.Field.from_py(numpy_dtype, arrow_dtype)) df = pd.DataFrame(data) - schema = A.Schema.from_fields(fields) + schema = pa.Schema.from_fields(fields) self._check_pandas_roundtrip(df, expected_schema=schema) def test_float_nulls(self): num_values = 100 null_mask = np.random.randint(0, 10, size=num_values) < 3 - dtypes = [('f4', A.float32()), ('f8', A.float64())] + dtypes = [('f4', pa.float32()), ('f8', pa.float64())] names = ['f4', 'f8'] expected_cols = [] @@ -124,9 +124,9 @@ def test_float_nulls(self): for name, arrow_dtype in dtypes: values = np.random.randn(num_values).astype(name) - arr = A.Array.from_numpy(values, null_mask) + arr = pa.Array.from_numpy(values, null_mask) arrays.append(arr) - fields.append(A.Field.from_py(name, arrow_dtype)) + fields.append(pa.Field.from_py(name, arrow_dtype)) values[null_mask] = np.nan expected_cols.append(values) @@ -134,8 +134,8 @@ def test_float_nulls(self): ex_frame = pd.DataFrame(dict(zip(names, expected_cols)), columns=names) - table = A.Table.from_arrays(arrays, names) - assert table.schema.equals(A.Schema.from_fields(fields)) + table = pa.Table.from_arrays(arrays, names) + assert table.schema.equals(pa.Schema.from_fields(fields)) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -144,11 +144,11 @@ def test_integer_no_nulls(self): fields = [] numpy_dtypes = [ - ('i1', A.int8()), ('i2', A.int16()), - ('i4', A.int32()), ('i8', A.int64()), - ('u1', A.uint8()), ('u2', A.uint16()), - ('u4', A.uint32()), ('u8', A.uint64()), - ('longlong', A.int64()), ('ulonglong', A.uint64()) + ('i1', pa.int8()), ('i2', pa.int16()), + ('i4', pa.int32()), ('i8', pa.int64()), + ('u1', pa.uint8()), ('u2', pa.uint16()), + ('u4', pa.uint32()), ('u8', pa.uint64()), + ('longlong', pa.int64()), ('ulonglong', pa.uint64()) ] num_values = 100 @@ -158,10 +158,10 @@ def test_integer_no_nulls(self): min(info.max, np.iinfo('i8').max), size=num_values) data[dtype] = values.astype(dtype) - fields.append(A.Field.from_py(dtype, arrow_dtype)) + fields.append(pa.Field.from_py(dtype, arrow_dtype)) df = pd.DataFrame(data) - schema = A.Schema.from_fields(fields) + schema = pa.Schema.from_fields(fields) self._check_pandas_roundtrip(df, expected_schema=schema) def test_integer_with_nulls(self): @@ -177,7 +177,7 @@ def test_integer_with_nulls(self): for name in int_dtypes: values = np.random.randint(0, 100, size=num_values) - arr = A.Array.from_numpy(values, null_mask) + arr = pa.Array.from_numpy(values, null_mask) arrays.append(arr) expected = values.astype('f8') @@ -188,7 +188,7 @@ def test_integer_with_nulls(self): ex_frame = pd.DataFrame(dict(zip(int_dtypes, expected_cols)), columns=int_dtypes) - table = A.Table.from_arrays(arrays, int_dtypes) + table = pa.Table.from_arrays(arrays, int_dtypes) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -199,8 +199,8 @@ def test_boolean_no_nulls(self): np.random.seed(0) df = pd.DataFrame({'bools': np.random.randn(num_values) > 0}) - field = A.Field.from_py('bools', A.bool_()) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('bools', pa.bool_()) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, expected_schema=schema) def test_boolean_nulls(self): @@ -211,16 +211,16 @@ def test_boolean_nulls(self): mask = np.random.randint(0, 10, size=num_values) < 3 values = np.random.randint(0, 10, size=num_values) < 5 - arr = A.Array.from_numpy(values, mask) + arr = pa.Array.from_numpy(values, mask) expected = values.astype(object) expected[mask] = None - field = A.Field.from_py('bools', A.bool_()) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('bools', pa.bool_()) + schema = pa.Schema.from_fields([field]) ex_frame = pd.DataFrame({'bools': expected}) - table = A.Table.from_arrays([arr], ['bools']) + table = pa.Table.from_arrays([arr], ['bools']) assert table.schema.equals(schema) result = table.to_pandas() @@ -229,16 +229,16 @@ def test_boolean_nulls(self): def test_boolean_object_nulls(self): arr = np.array([False, None, True] * 100, dtype=object) df = pd.DataFrame({'bools': arr}) - field = A.Field.from_py('bools', A.bool_()) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('bools', pa.bool_()) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, expected_schema=schema) def test_unicode(self): repeats = 1000 values = [u'foo', None, u'bar', u'mañana', np.nan] df = pd.DataFrame({'strings': values * repeats}) - field = A.Field.from_py('strings', A.string()) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('strings', pa.string()) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, expected_schema=schema) @@ -246,8 +246,8 @@ def test_bytes_to_binary(self): values = [u('qux'), b'foo', None, 'bar', 'qux', np.nan] df = pd.DataFrame({'strings': values}) - table = A.Table.from_pandas(df) - assert table[0].type == A.binary() + table = pa.Table.from_pandas(df) + assert table[0].type == pa.binary() values2 = [b'qux', b'foo', None, b'bar', b'qux', np.nan] expected = pd.DataFrame({'strings': values2}) @@ -256,8 +256,8 @@ def test_bytes_to_binary(self): def test_fixed_size_bytes(self): values = [b'foo', None, b'bar', None, None, b'hey'] df = pd.DataFrame({'strings': values}) - schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) - table = A.Table.from_pandas(df, schema=schema) + schema = pa.Schema.from_fields([pa.field('strings', pa.binary(3))]) + table = pa.Table.from_pandas(df, schema=schema) assert table.schema[0].type == schema[0].type assert table.schema[0].name == schema[0].name result = table.to_pandas() @@ -266,9 +266,9 @@ def test_fixed_size_bytes(self): def test_fixed_size_bytes_does_not_accept_varying_lengths(self): values = [b'foo', None, b'ba', None, None, b'hey'] df = pd.DataFrame({'strings': values}) - schema = A.Schema.from_fields([A.field('strings', A.binary(3))]) - with self.assertRaises(A.ArrowInvalid): - A.Table.from_pandas(df, schema=schema) + schema = pa.Schema.from_fields([pa.field('strings', pa.binary(3))]) + with self.assertRaises(pa.ArrowInvalid): + pa.Table.from_pandas(df, schema=schema) def test_timestamps_notimezone_no_nulls(self): df = pd.DataFrame({ @@ -278,8 +278,8 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - field = A.Field.from_py('datetime64', A.timestamp('ms')) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('datetime64', pa.timestamp('ms')) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) @@ -290,8 +290,8 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - field = A.Field.from_py('datetime64', A.timestamp('ns')) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('datetime64', pa.timestamp('ns')) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) @@ -303,8 +303,8 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - field = A.Field.from_py('datetime64', A.timestamp('ms')) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('datetime64', pa.timestamp('ms')) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) @@ -315,8 +315,8 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - field = A.Field.from_py('datetime64', A.timestamp('ns')) - schema = A.Schema.from_fields([field]) + field = pa.Field.from_py('datetime64', pa.timestamp('ns')) + schema = pa.Schema.from_fields([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) @@ -345,25 +345,77 @@ def test_timestamps_with_timezone(self): .to_frame()) self._check_pandas_roundtrip(df, timestamps_to_ms=False) - def test_date(self): + def test_date_infer(self): df = pd.DataFrame({ 'date': [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)]}) - table = A.Table.from_pandas(df) - field = A.Field.from_py('date', A.date64()) - schema = A.Schema.from_fields([field]) + table = pa.Table.from_pandas(df) + field = pa.Field.from_py('date', pa.date32()) + schema = pa.Schema.from_fields([field]) assert table.schema.equals(schema) result = table.to_pandas() expected = df.copy() expected['date'] = pd.to_datetime(df['date']) tm.assert_frame_equal(result, expected) + def test_date_objects_typed(self): + arr = np.array([ + datetime.date(2017, 4, 3), + None, + datetime.date(2017, 4, 4), + datetime.date(2017, 4, 5)], dtype=object) + + arr_i4 = np.array([17259, -1, 17260, 17261], dtype='int32') + arr_i8 = arr_i4.astype('int64') * 86400000 + mask = np.array([False, True, False, False]) + + t32 = pa.date32() + t64 = pa.date64() + + a32 = pa.Array.from_numpy(arr, type=t32) + a64 = pa.Array.from_numpy(arr, type=t64) + + a32_expected = pa.Array.from_numpy(arr_i4, mask=mask, type=t32) + a64_expected = pa.Array.from_numpy(arr_i8, mask=mask, type=t64) + + assert a32.equals(a32_expected) + assert a64.equals(a64_expected) + + # Test converting back to pandas + colnames = ['date32', 'date64'] + table = pa.Table.from_arrays([a32, a64], colnames) + table_pandas = table.to_pandas() + + ex_values = (np.array(['2017-04-03', '2017-04-04', '2017-04-04', + '2017-04-05'], + dtype='datetime64[D]') + .astype('datetime64[ns]')) + ex_values[1] = pd.NaT.value + expected_pandas = pd.DataFrame({'date32': ex_values, + 'date64': ex_values}, + columns=colnames) + tm.assert_frame_equal(table_pandas, expected_pandas) + + def test_dates_from_integers(self): + t1 = pa.date32() + t2 = pa.date64() + + arr = np.array([17259, 17260, 17261], dtype='int32') + arr2 = arr.astype('int64') * 86400000 + + a1 = pa.Array.from_numpy(arr, type=t1) + a2 = pa.Array.from_numpy(arr2, type=t2) + + expected = datetime.date(2017, 4, 3) + assert a1[0].as_py() == expected + assert a2[0].as_py() == expected + def test_column_of_arrays(self): df, schema = dataframe_with_arrays() self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) - table = A.Table.from_pandas(df, schema=schema) + table = pa.Table.from_pandas(df, schema=schema) assert table.schema.equals(schema) for column in df.columns: @@ -373,7 +425,7 @@ def test_column_of_arrays(self): def test_column_of_lists(self): df, schema = dataframe_with_lists() self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) - table = A.Table.from_pandas(df, schema=schema) + table = pa.Table.from_pandas(df, schema=schema) assert table.schema.equals(schema) for column in df.columns: @@ -410,8 +462,8 @@ def test_category(self): def test_mixed_types_fails(self): data = pd.DataFrame({'a': ['a', 1, 2.0]}) - with self.assertRaises(A.ArrowException): - A.Table.from_pandas(data) + with self.assertRaises(pa.ArrowException): + pa.Table.from_pandas(data) def test_strided_data_import(self): cases = [] @@ -460,9 +512,9 @@ def test_decimal_32_from_pandas(self): decimal.Decimal('1234.439'), ] }) - converted = A.Table.from_pandas(expected) - field = A.Field.from_py('decimals', A.decimal(7, 3)) - schema = A.Schema.from_fields([field]) + converted = pa.Table.from_pandas(expected) + field = pa.Field.from_py('decimals', pa.decimal(7, 3)) + schema = pa.Schema.from_fields([field]) assert converted.schema.equals(schema) def test_decimal_32_to_pandas(self): @@ -472,7 +524,7 @@ def test_decimal_32_to_pandas(self): decimal.Decimal('1234.439'), ] }) - converted = A.Table.from_pandas(expected) + converted = pa.Table.from_pandas(expected) df = converted.to_pandas() tm.assert_frame_equal(df, expected) @@ -483,9 +535,9 @@ def test_decimal_64_from_pandas(self): decimal.Decimal('129534.123731'), ] }) - converted = A.Table.from_pandas(expected) - field = A.Field.from_py('decimals', A.decimal(12, 6)) - schema = A.Schema.from_fields([field]) + converted = pa.Table.from_pandas(expected) + field = pa.Field.from_py('decimals', pa.decimal(12, 6)) + schema = pa.Schema.from_fields([field]) assert converted.schema.equals(schema) def test_decimal_64_to_pandas(self): @@ -495,7 +547,7 @@ def test_decimal_64_to_pandas(self): decimal.Decimal('129534.123731'), ] }) - converted = A.Table.from_pandas(expected) + converted = pa.Table.from_pandas(expected) df = converted.to_pandas() tm.assert_frame_equal(df, expected) @@ -506,9 +558,9 @@ def test_decimal_128_from_pandas(self): -decimal.Decimal('314292388910493.12343437128'), ] }) - converted = A.Table.from_pandas(expected) - field = A.Field.from_py('decimals', A.decimal(26, 11)) - schema = A.Schema.from_fields([field]) + converted = pa.Table.from_pandas(expected) + field = pa.Field.from_py('decimals', pa.decimal(26, 11)) + schema = pa.Schema.from_fields([field]) assert converted.schema.equals(schema) def test_decimal_128_to_pandas(self): @@ -518,6 +570,6 @@ def test_decimal_128_to_pandas(self): -decimal.Decimal('314292388910493.12343437128'), ] }) - converted = A.Table.from_pandas(expected) + converted = pa.Table.from_pandas(expected) df = converted.to_pandas() tm.assert_frame_equal(df, expected) From 72e1e08754003a56b413f49a107d55d61519f7ef Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sun, 9 Apr 2017 21:11:33 -0400 Subject: [PATCH 0488/1644] ARROW-800: [C++] Boost headers being transitively included in pyarrow thanks to @wesm for suggesting the idea of returning `std::string` and doing the dispatching in c++. Author: Phillip Cloud Closes #518 from cpcloud/ARROW-800 and squashes the following commits: a983841 [Phillip Cloud] Formatting ba46502 [Phillip Cloud] decimal namespace and change to FormatValue f326f3a [Phillip Cloud] Const things 9001432 [Phillip Cloud] Remove ARROW_EXPORT of method inside ARROW_EXPORTed class 0c300ec [Phillip Cloud] ARROW-800: [C++] Boost headers being transitively included in pyarrow --- cpp/src/arrow/array-decimal-test.cc | 2 ++ cpp/src/arrow/array.cc | 44 +++++++++++++++---------- cpp/src/arrow/array.h | 3 +- cpp/src/arrow/builder.cc | 12 +++---- cpp/src/arrow/builder.h | 6 +++- cpp/src/arrow/python/builtin_convert.cc | 2 +- cpp/src/arrow/python/helpers.cc | 12 ++++--- cpp/src/arrow/python/helpers.h | 7 +++- cpp/src/arrow/python/pandas_convert.cc | 10 +++--- cpp/src/arrow/python/python-test.cc | 4 +-- cpp/src/arrow/type_fwd.h | 2 +- cpp/src/arrow/util/CMakeLists.txt | 1 - cpp/src/arrow/util/decimal-test.cc | 2 ++ cpp/src/arrow/util/decimal.cc | 2 ++ cpp/src/arrow/util/decimal.h | 2 ++ python/pyarrow/includes/common.pxd | 5 --- python/pyarrow/includes/libarrow.pxd | 7 +--- python/pyarrow/scalar.pyx | 9 +---- 18 files changed, 71 insertions(+), 61 deletions(-) diff --git a/cpp/src/arrow/array-decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc index 4c01f928a6f26..8353acc454f40 100644 --- a/cpp/src/arrow/array-decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -23,6 +23,7 @@ #include "arrow/util/decimal.h" namespace arrow { +namespace decimal { TEST(TypesTest, TestDecimal32Type) { DecimalType t1(8, 4); @@ -221,4 +222,5 @@ INSTANTIATE_TEST_CASE_P(Decimal128BuilderTest, Decimal128BuilderTest, ::testing::Range( DecimalPrecision::minimum, DecimalPrecision::maximum)); +} // namespace decimal } // namespace arrow diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index 4e73e7176fa9c..c4a78f3b2e400 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -310,25 +310,35 @@ bool DecimalArray::IsNegative(int64_t i) const { return sign_bitmap_data_ != nullptr ? BitUtil::GetBit(sign_bitmap_data_, i) : false; } -template -ARROW_EXPORT Decimal DecimalArray::Value(int64_t i) const { - Decimal result; - FromBytes(GetValue(i), &result); - return result; -} - -template ARROW_EXPORT Decimal32 DecimalArray::Value(int64_t i) const; -template ARROW_EXPORT Decimal64 DecimalArray::Value(int64_t i) const; - -template <> -ARROW_EXPORT Decimal128 DecimalArray::Value(int64_t i) const { - Decimal128 result; - FromBytes(GetValue(i), IsNegative(i), &result); - return result; +std::string DecimalArray::FormatValue(int64_t i) const { + const auto type_ = std::dynamic_pointer_cast(type()); + const int precision = type_->precision; + const int scale = type_->scale; + const int byte_width = byte_width_; + const uint8_t* bytes = GetValue(i); + switch (byte_width) { + case 4: { + decimal::Decimal32 value; + decimal::FromBytes(bytes, &value); + return decimal::ToString(value, precision, scale); + } + case 8: { + decimal::Decimal64 value; + decimal::FromBytes(bytes, &value); + return decimal::ToString(value, precision, scale); + } + case 16: { + decimal::Decimal128 value; + decimal::FromBytes(bytes, IsNegative(i), &value); + return decimal::ToString(value, precision, scale); + } + default: { + DCHECK(false) << "Invalid byte width: " << byte_width; + return ""; + } + } } -template ARROW_EXPORT Decimal128 DecimalArray::Value(int64_t i) const; - std::shared_ptr DecimalArray::Slice(int64_t offset, int64_t length) const { ConformSliceParams(offset_, length_, &offset, &length); return std::make_shared( diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index a4117facdefd0..4f8b22e31b4eb 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -384,8 +384,7 @@ class ARROW_EXPORT DecimalArray : public FixedSizeBinaryArray { bool IsNegative(int64_t i) const; - template - ARROW_EXPORT Decimal Value(int64_t i) const; + std::string FormatValue(int64_t i) const; std::shared_ptr Slice(int64_t offset, int64_t length) const override; diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index a3677eff68669..4281a61474cce 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -332,7 +332,7 @@ DecimalBuilder::DecimalBuilder(MemoryPool* pool, const std::shared_ptr sign_bitmap_data_(nullptr) {} template -ARROW_EXPORT Status DecimalBuilder::Append(const Decimal& val) { +ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal& val) { DCHECK_EQ(sign_bitmap_, nullptr) << "sign_bitmap_ is not null"; DCHECK_EQ(sign_bitmap_data_, nullptr) << "sign_bitmap_data_ is not null"; @@ -340,11 +340,11 @@ ARROW_EXPORT Status DecimalBuilder::Append(const Decimal& val) { return FixedSizeBinaryBuilder::Append(reinterpret_cast(&val.value)); } -template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal32& val); -template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal64& val); +template ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal32& val); +template ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal64& val); template <> -ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& value) { +ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal128& value) { DCHECK_NE(sign_bitmap_, nullptr) << "sign_bitmap_ is null"; DCHECK_NE(sign_bitmap_data_, nullptr) << "sign_bitmap_data_ is null"; @@ -352,7 +352,7 @@ ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& value) { uint8_t stack_bytes[16] = {0}; uint8_t* bytes = stack_bytes; bool is_negative; - ToBytes(value, &bytes, &is_negative); + decimal::ToBytes(value, &bytes, &is_negative); RETURN_NOT_OK(FixedSizeBinaryBuilder::Append(bytes)); // TODO(phillipc): calculate the proper storage size here (do we have a function to do @@ -363,7 +363,7 @@ ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& value) { return Status::OK(); } -template ARROW_EXPORT Status DecimalBuilder::Append(const Decimal128& val); +template ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal128& val); Status DecimalBuilder::Init(int64_t capacity) { RETURN_NOT_OK(FixedSizeBinaryBuilder::Init(capacity)); diff --git a/cpp/src/arrow/builder.h b/cpp/src/arrow/builder.h index d42ab5b01d1ba..68769165b02c0 100644 --- a/cpp/src/arrow/builder.h +++ b/cpp/src/arrow/builder.h @@ -37,9 +37,13 @@ namespace arrow { class Array; +namespace decimal { + template struct Decimal; +} // namespace decimal + static constexpr int64_t kMinBuilderCapacity = 1 << 5; /// Base class for all data array builders. @@ -421,7 +425,7 @@ class ARROW_EXPORT DecimalBuilder : public FixedSizeBinaryBuilder { explicit DecimalBuilder(MemoryPool* pool, const std::shared_ptr& type); template - ARROW_EXPORT Status Append(const Decimal& val); + ARROW_EXPORT Status Append(const decimal::Decimal& val); Status Init(int64_t capacity) override; Status Resize(int64_t capacity) override; diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index a064a3daf970d..1ae13f3db061c 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -523,7 +523,7 @@ class ListConverter : public TypedConverter { #define DECIMAL_CONVERT_CASE(bit_width, item, builder) \ case bit_width: { \ - arrow::Decimal##bit_width out; \ + arrow::decimal::Decimal##bit_width out; \ RETURN_NOT_OK(PythonDecimalToArrowDecimal((item), &out)); \ RETURN_NOT_OK((builder)->Append(out)); \ break; \ diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index ffba7bbc21c14..3d3d07a515833 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -74,7 +74,8 @@ Status ImportFromModule(const OwnedRef& module, const std::string& name, OwnedRe } template -Status PythonDecimalToArrowDecimal(PyObject* python_decimal, Decimal* arrow_decimal) { +Status PythonDecimalToArrowDecimal( + PyObject* python_decimal, decimal::Decimal* arrow_decimal) { // Call Python's str(decimal_object) OwnedRef str_obj(PyObject_Str(python_decimal)); RETURN_IF_PYERROR(); @@ -92,11 +93,11 @@ Status PythonDecimalToArrowDecimal(PyObject* python_decimal, Decimal* arrow_d } template Status PythonDecimalToArrowDecimal( - PyObject* python_decimal, Decimal32* arrow_decimal); + PyObject* python_decimal, decimal::Decimal32* arrow_decimal); template Status PythonDecimalToArrowDecimal( - PyObject* python_decimal, Decimal64* arrow_decimal); + PyObject* python_decimal, decimal::Decimal64* arrow_decimal); template Status PythonDecimalToArrowDecimal( - PyObject* python_decimal, Decimal128* arrow_decimal); + PyObject* python_decimal, decimal::Decimal128* arrow_decimal); Status InferDecimalPrecisionAndScale( PyObject* python_decimal, int* precision, int* scale) { @@ -111,7 +112,8 @@ Status InferDecimalPrecisionAndScale( auto size = str.size; std::string c_string(bytes, size); - return FromString(c_string, static_cast(nullptr), precision, scale); + return FromString( + c_string, static_cast(nullptr), precision, scale); } Status DecimalFromString( diff --git a/cpp/src/arrow/python/helpers.h b/cpp/src/arrow/python/helpers.h index a19b25f7db805..77fde263de7e0 100644 --- a/cpp/src/arrow/python/helpers.h +++ b/cpp/src/arrow/python/helpers.h @@ -29,9 +29,13 @@ namespace arrow { +namespace decimal { + template struct Decimal; +} // namespace decimal + namespace py { class OwnedRef; @@ -43,7 +47,8 @@ Status ImportFromModule( const OwnedRef& module, const std::string& module_name, OwnedRef* ref); template -Status PythonDecimalToArrowDecimal(PyObject* python_decimal, Decimal* arrow_decimal); +Status PythonDecimalToArrowDecimal( + PyObject* python_decimal, decimal::Decimal* arrow_decimal); Status InferDecimalPrecisionAndScale( PyObject* python_decimal, int* precision = nullptr, int* scale = nullptr); diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 5bb8e45e191a9..1a250e83c5093 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -530,7 +530,7 @@ Status PandasConverter::ConvertDates() { #define CONVERT_DECIMAL_CASE(bit_width, builder, object) \ case bit_width: { \ - Decimal##bit_width d; \ + decimal::Decimal##bit_width d; \ RETURN_NOT_OK(PythonDecimalToArrowDecimal((object), &d)); \ RETURN_NOT_OK((builder).Append(d)); \ break; \ @@ -620,7 +620,7 @@ Status PandasConverter::ConvertObjectFixedWidthBytes( template Status validate_precision(int precision) { - constexpr static const int maximum_precision = DecimalPrecision::maximum; + constexpr static const int maximum_precision = decimal::DecimalPrecision::maximum; if (!(precision > 0 && precision <= maximum_precision)) { std::stringstream ss; ss << "Invalid precision: " << precision << ". Minimum is 1, maximum is " @@ -636,7 +636,7 @@ Status RawDecimalToString( DCHECK_NE(bytes, nullptr); DCHECK_NE(result, nullptr); RETURN_NOT_OK(validate_precision(precision)); - Decimal decimal; + decimal::Decimal decimal; FromBytes(bytes, &decimal); *result = ToString(decimal, precision, scale); return Status::OK(); @@ -651,8 +651,8 @@ Status RawDecimalToString(const uint8_t* bytes, int precision, int scale, bool is_negative, std::string* result) { DCHECK_NE(bytes, nullptr); DCHECK_NE(result, nullptr); - RETURN_NOT_OK(validate_precision(precision)); - Decimal128 decimal; + RETURN_NOT_OK(validate_precision(precision)); + decimal::Decimal128 decimal; FromBytes(bytes, is_negative, &decimal); *result = ToString(decimal, precision, scale); return Status::OK(); diff --git a/cpp/src/arrow/python/python-test.cc b/cpp/src/arrow/python/python-test.cc index b63d2ffb1cd2c..a4a11c039b60c 100644 --- a/cpp/src/arrow/python/python-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -63,8 +63,8 @@ TEST(DecimalTest, TestPythonDecimalToArrowDecimal128) { ASSERT_NE(pydecimal.obj(), nullptr); ASSERT_EQ(PyErr_Occurred(), nullptr); - Decimal128 arrow_decimal; - int128_t boost_decimal(decimal_string); + decimal::Decimal128 arrow_decimal; + boost::multiprecision::int128_t boost_decimal(decimal_string); PyObject* obj = pydecimal.obj(); ASSERT_OK(PythonDecimalToArrowDecimal(obj, &arrow_decimal)); ASSERT_EQ(boost_decimal, arrow_decimal.value); diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index acf12c3d9d18e..2bb05f853a094 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -147,7 +147,7 @@ std::shared_ptr ARROW_EXPORT binary(); std::shared_ptr ARROW_EXPORT date32(); std::shared_ptr ARROW_EXPORT date64(); -std::shared_ptr ARROW_EXPORT decimal(int precision, int scale); +std::shared_ptr ARROW_EXPORT decimal_type(int precision, int scale); } // namespace arrow diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 054f11055b60e..9aa8bae273fb8 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -22,7 +22,6 @@ # Headers: top level install(FILES bit-util.h - decimal.h logging.h macros.h random.h diff --git a/cpp/src/arrow/util/decimal-test.cc b/cpp/src/arrow/util/decimal-test.cc index 1e22643962d5b..dcaa9afd8724a 100644 --- a/cpp/src/arrow/util/decimal-test.cc +++ b/cpp/src/arrow/util/decimal-test.cc @@ -23,6 +23,7 @@ #include "arrow/test-util.h" namespace arrow { +namespace decimal { template class DecimalTest : public ::testing::Test { @@ -158,4 +159,5 @@ TEST(DecimalTest, TestDecimal128StringAndBytesRoundTrip) { ASSERT_EQ(expected.value, result.value); } +} // namespace decimal } // namespace arrow diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc index 1ac347180fec5..3b8a3ff0398b5 100644 --- a/cpp/src/arrow/util/decimal.cc +++ b/cpp/src/arrow/util/decimal.cc @@ -20,6 +20,7 @@ #include namespace arrow { +namespace decimal { static const boost::regex DECIMAL_PATTERN("(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?"); @@ -138,4 +139,5 @@ void ToBytes(const Decimal128& decimal, uint8_t** bytes, bool* is_negative) { *is_negative = backend.isneg(); } +} // namespace decimal } // namespace arrow diff --git a/cpp/src/arrow/util/decimal.h b/cpp/src/arrow/util/decimal.h index 46883e3de93c3..c73bae1b4c995 100644 --- a/cpp/src/arrow/util/decimal.h +++ b/cpp/src/arrow/util/decimal.h @@ -31,6 +31,7 @@ #include namespace arrow { +namespace decimal { using boost::multiprecision::int128_t; @@ -140,5 +141,6 @@ ARROW_EXPORT void ToBytes(const Decimal32& value, uint8_t** bytes); ARROW_EXPORT void ToBytes(const Decimal64& value, uint8_t** bytes); ARROW_EXPORT void ToBytes(const Decimal128& decimal, uint8_t** bytes, bool* is_negative); +} // namespace decimal } // namespace arrow #endif // ARROW_DECIMAL_H diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 4860334a9213c..ab38ff3084f01 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -51,11 +51,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool IsTypeError() -cdef extern from "arrow/util/decimal.h" namespace "arrow" nogil: - cdef cppclass int128_t: - pass - - cdef inline object PyObject_to_object(PyObject* o): # Cast to "object" increments reference count cdef object result = o diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 73d96b25f521b..e719e185b7b13 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -60,11 +60,6 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: TimeUnit_MICRO" arrow::TimeUnit::MICRO" TimeUnit_NANO" arrow::TimeUnit::NANO" - cdef cppclass Decimal[T]: - Decimal(const T&) - - cdef c_string ToString[T](const Decimal[T]&, int, int) - cdef cppclass CDataType" arrow::DataType": Type type @@ -226,7 +221,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: const uint8_t* GetValue(int i) cdef cppclass CDecimalArray" arrow::DecimalArray"(CFixedSizeBinaryArray): - Decimal[T] Value[T](int i) + c_string FormatValue(int i) cdef cppclass CListArray" arrow::ListArray"(CArray): const int32_t* raw_value_offsets() diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 196deedefa959..7591ae880da3d 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -17,7 +17,6 @@ from pyarrow.schema cimport DataType, box_data_type -from pyarrow.includes.common cimport int128_t from pyarrow.compat import frombytes import pyarrow.schema as schema import decimal @@ -213,13 +212,7 @@ cdef class DecimalValue(ArrayValue): int bit_width = t.bit_width() int precision = t.precision int scale = t.scale - c_string s - if bit_width == 32: - s = ToString[int32_t](ap.Value[int32_t](self.index), precision, scale) - elif bit_width == 64: - s = ToString[int64_t](ap.Value[int64_t](self.index), precision, scale) - elif bit_width == 128: - s = ToString[int128_t](ap.Value[int128_t](self.index), precision, scale) + c_string s = ap.FormatValue(self.index) return decimal.Decimal(s.decode('utf8')) From acbda1893c55c68b3afd6c4cde1ee11e6926bb75 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 08:11:25 -0400 Subject: [PATCH 0489/1644] ARROW-794: [C++/Python] Disallow strided tensors in ipc::WriteTensor Author: Wes McKinney Closes #519 from wesm/ARROW-794 and squashes the following commits: ab82ba7 [Wes McKinney] Typo 6f0a350 [Wes McKinney] Typo e945298 [Wes McKinney] Raise ValueError if writing strided ndarray with WriteTensor and pyarrow.write_tensor --- cpp/src/arrow/ipc/ipc-read-write-test.cc | 18 ++++++++++++++++++ cpp/src/arrow/ipc/writer.cc | 4 ++++ python/pyarrow/tests/test_tensor.py | 13 +++++++++++++ 3 files changed, 35 insertions(+) diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 6807296b59a5e..1a91ec39ca1fc 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -642,5 +642,23 @@ TEST_F(TestTensorRoundTrip, BasicRoundtrip) { CheckTensorRoundTrip(tzero); } +TEST_F(TestTensorRoundTrip, NonContiguous) { + std::string path = "test-write-tensor-strided"; + constexpr int64_t kBufferSize = 1 << 20; + ASSERT_OK(io::MemoryMapFixture::InitMemoryMap(kBufferSize, path, &mmap_)); + + std::vector values; + test::randint(24, 0, 100, &values); + + auto data = test::GetBufferFromVector(values); + Int64Tensor tensor(data, {4, 3}, {48, 16}); + + int32_t metadata_length; + int64_t body_length; + ASSERT_OK(mmap_->Seek(0)); + ASSERT_RAISES( + Invalid, WriteTensor(tensor, mmap_.get(), &metadata_length, &body_length)); +} + } // namespace ipc } // namespace arrow diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 9305567e74f6b..d38a65c983d98 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -496,6 +496,10 @@ Status WriteLargeRecordBatch(const RecordBatch& batch, int64_t buffer_start_offs Status WriteTensor(const Tensor& tensor, io::OutputStream* dst, int32_t* metadata_length, int64_t* body_length) { + if (!tensor.is_contiguous()) { + return Status::Invalid("No support yet for writing non-contiguous tensors"); + } + RETURN_NOT_OK(AlignStreamPosition(dst)); std::shared_ptr metadata; RETURN_NOT_OK(WriteTensorMessage(tensor, 0, &metadata)); diff --git a/python/pyarrow/tests/test_tensor.py b/python/pyarrow/tests/test_tensor.py index a39064b49dfbc..327b7f08a37f1 100644 --- a/python/pyarrow/tests/test_tensor.py +++ b/python/pyarrow/tests/test_tensor.py @@ -98,3 +98,16 @@ def test_tensor_ipc_roundtrip(): assert result.equals(tensor) finally: _try_delete(path) + + +def test_tensor_ipc_strided(): + data = np.random.randn(10, 4) + tensor = pa.Tensor.from_numpy(data[::2]) + + path = 'pyarrow-tensor-ipc-strided' + try: + with pytest.raises(ValueError): + mmap = pa.create_memory_map(path, 1024) + pa.write_tensor(tensor, mmap) + finally: + _try_delete(path) From ddda3039e6fb6a9d4f2c5b1189369204bfe1ea93 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 08:30:27 -0400 Subject: [PATCH 0490/1644] ARROW-526: [Format] Revise Format documents for evolution in IPC stream / file / tensor formats Author: Wes McKinney Closes #515 from wesm/ARROW-526 and squashes the following commits: 6a38432 [Wes McKinney] Typo 5d564a6 [Wes McKinney] Revise Format documents for evolution in IPC stream / file / tensor formats --- format/IPC.md | 131 ++++++++++++++++++++++++++++++++------------- format/Metadata.md | 57 +++++++++++++++++--- 2 files changed, 146 insertions(+), 42 deletions(-) diff --git a/format/IPC.md b/format/IPC.md index d386e6048cf12..f0a67e292186c 100644 --- a/format/IPC.md +++ b/format/IPC.md @@ -14,65 +14,106 @@ # Interprocess messaging / communication (IPC) -## File format +## Encapsulated message format + +Data components in the stream and file formats are represented as encapsulated +*messages* consisting of: -We define a self-contained "file format" containing an Arrow schema along with -one or more record batches defining a dataset. See [format/File.fbs][1] for the -precise details of the file metadata. +* A length prefix indicating the metadata size +* The message metadata as a [Flatbuffer][3] +* Padding bytes to an 8-byte boundary +* The message body -In general, the file looks like: +Schematically, we have: ``` - - + + + + +``` + +The `metadata_size` includes the size of the flatbuffer plus padding. The +`Message` flatbuffer includes a version number, the particular message (as a +flatbuffer union), and the size of the message body: + +``` +table Message { + version: org.apache.arrow.flatbuf.MetadataVersion; + header: MessageHeader; + bodyLength: long; +} +``` + +Currently, we support 4 types of messages: + +* Schema +* RecordBatch +* DictionaryBatch +* Tensor + +## Streaming format + +We provide a streaming format for record batches. It is presented as a sequence +of encapsulated messages, each of which follows the format above. The schema +comes first in the stream, and it is the same for all of the record batches +that follow. If any fields in the schema are dictionary-encoded, one or more +`DictionaryBatch` messages will follow the schema. + +``` + ... ... - - - + ``` -See the File.fbs document for details about the Flatbuffers metadata. The -record batches have a particular structure, defined next. +When a stream reader implementation is reading a stream, after each message, it +may read the next 4 bytes to know how large the message metadata that follows +is. Once the message flatbuffer is read, you can then read the message body. + +The stream writer can signal end-of-stream (EOS) either by writing a 0 length +as an `int32` or simply closing the stream interface. + +## File format -### Record batches +We define a "file format" supporting random access in a very similar format to +the streaming format. The file starts and ends with a magic string `ARROW1` +(plus padding). What follows in the file is identical to the stream format. At +the end of the file, we write a *footer* including offsets and sizes for each +of the data blocks in the file, so that random access is possible. See +[format/File.fbs][1] for the precise details of the file footer. -The record batch metadata is written as a flatbuffer (see -[format/Message.fbs][2] -- the RecordBatch message type) prefixed by its size, -followed by each of the memory buffers in the batch written end to end (with -appropriate alignment and padding): +Schematically we have: ``` - - - - + + + +
+
+ ``` +### RecordBatch body structure + The `RecordBatch` metadata contains a depth-first (pre-order) flattened set of field metadata and physical memory buffers (some comments from [Message.fbs][2] have been shortened / removed): ``` table RecordBatch { - length: int; + length: long; nodes: [FieldNode]; buffers: [Buffer]; } struct FieldNode { - /// The number of value slots in the Arrow array at this level of a nested - /// tree - length: int; - - /// The number of observed nulls. Fields with null_count == 0 may choose not - /// to write their physical validity bitmap out as a materialized buffer, - /// instead setting the length of the bitmap buffer to 0. - null_count: int; + length: long; + null_count: long; } struct Buffer { @@ -91,9 +132,9 @@ struct Buffer { ``` In the context of a file, the `page` is not used, and the `Buffer` offsets use -as a frame of reference the start of the segment where they are written in the -file. So, while in a general IPC setting these offsets may be anyplace in one -or more shared memory regions, in the file format the offsets start from 0. +as a frame of reference the start of the message body. So, while in a general +IPC setting these offsets may be anyplace in one or more shared memory regions, +in the file format the offsets start from 0. The location of a record batch and the size of the metadata block as well as the body of buffers is stored in the file footer: @@ -112,12 +153,30 @@ Some notes about this * The metadata length includes the flatbuffer size, the record batch metadata flatbuffer, and any padding bytes - -### Dictionary batches +### Dictionary Batches Dictionary batches have not yet been implemented, while they are provided for in the metadata. For the time being, the `DICTIONARY` segments shown above in the file do not appear in any of the file implementations. +### Tensor (Multi-dimensional Array) Message Format + +The `Tensor` message types provides a way to write a multidimensional array of +fixed-size values (such as a NumPy ndarray) using Arrow's shared memory +tools. Arrow implementations in general are not required to implement this data +format, though we provide a reference implementation in C++. + +When writing a standalone encapsulated tensor message, we use the format as +indicated above, but additionally align the starting offset (if writing to a +shared memory region) to be a multiple of 8: + +``` + + + + +``` + [1]: https://github.com/apache/arrow/blob/master/format/File.fbs -[1]: https://github.com/apache/arrow/blob/master/format/Message.fbs \ No newline at end of file +[2]: https://github.com/apache/arrow/blob/master/format/Message.fbs +[3]: https://github.com/google]/flatbuffers diff --git a/format/Metadata.md b/format/Metadata.md index a4878f347073f..18fac527470d5 100644 --- a/format/Metadata.md +++ b/format/Metadata.md @@ -86,8 +86,8 @@ VectorLayout: Type: ``` { - "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|bool|decimal|date|time|timestamp|interval" - // fields as defined in the flatbuff depending on the type name + "name" : "null|struct|list|union|int|floatingpoint|utf8|binary|fixedsizebinary|bool|decimal|date|time|timestamp|interval" + // fields as defined in the Flatbuffer depending on the type name } ``` Union: @@ -126,14 +126,37 @@ Decimal: "scale" : /* integer */ } ``` + Timestamp: + ``` { "name" : "timestamp", "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND" } ``` + +Date: + +``` +{ + "name" : "date", + "unit" : "DAY|MILLISECOND" +} +``` + +Time: + +``` +{ + "name" : "time", + "unit" : "SECOND|MILLISECOND|MICROSECOND|NANOSECOND", + "bitWidth": /* integer: 32 or 64 */ +} +``` + Interval: + ``` { "name" : "interval", @@ -161,12 +184,16 @@ Flatbuffers IDL for a record batch data header ``` table RecordBatch { - length: int; + length: long; nodes: [FieldNode]; buffers: [Buffer]; } ``` +The `RecordBatch` metadata provides for record batches with length exceeding +2^31 - 1, but Arrow implementations are not required to implement support +beyond this size. + The `nodes` and `buffers` fields are produced by a depth-first traversal / flattening of a schema (possibly containing nested types) for a given in-memory data set. @@ -205,13 +232,17 @@ hierarchy. struct FieldNode { /// The number of value slots in the Arrow array at this level of a nested /// tree - length: int; + length: long; /// The number of observed nulls. - null_count: int; + null_count: lohng; } ``` +The `FieldNode` metadata provides for fields with length exceeding 2^31 - 1, +but Arrow implementations are not required to implement support for large +arrays. + ## Flattening of nested data Nested types are flattened in the record batch in depth-first order. When @@ -359,7 +390,21 @@ TBD ### Timestamp -TBD +All timestamps are stored as a 64-bit integer, with one of four unit +resolutions: second, millisecond, microsecond, and nanosecond. + +### Date + +We support two different date types: + +* Days since the UNIX epoch as a 32-bit integer +* Milliseconds since the UNIX epoch as a 64-bit integer + +### Time + +Time supports the same unit resolutions: second, millisecond, microsecond, and +nanosecond. We represent time as the smallest integer accommodating the +indicated unit. For second and millisecond: 32-bit, for the others 64-bit. ## Dictionary encoding From d1a9aff2937efe54fe3a5c80f7fbe19851cb71f3 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 09:23:12 -0400 Subject: [PATCH 0491/1644] ARROW-795: [C++] Consolidate arrow/arrow_io/arrow_ipc into a single shared and static library This leaves c_glib in a broken state, since I'd rather let @kou fix up things the way he prefers (I started refactoring the automake files and realized I would probably just make a mess). I'm going to submit a patch to make parquet-cpp work on top of this. Author: Wes McKinney Author: Kouhei Sutou Closes #516 from wesm/ARROW-795 and squashes the following commits: 7221e66 [Kouhei Sutou] Update arrow-glib after codebase consolidation 207749c [Wes McKinney] Consolidate arrow/arrow_io/arrow_ipc into a single shared and static library --- c_glib/arrow-glib/Makefile.am | 295 +++++----------------- c_glib/arrow-glib/arrow-glib.h | 17 ++ c_glib/arrow-glib/arrow-glib.hpp | 16 ++ c_glib/arrow-glib/arrow-io-glib.h | 32 --- c_glib/arrow-glib/arrow-io-glib.hpp | 30 --- c_glib/arrow-glib/arrow-io-glib.pc.in | 28 -- c_glib/arrow-glib/arrow-ipc-glib.h | 27 -- c_glib/arrow-glib/arrow-ipc-glib.hpp | 30 --- c_glib/arrow-glib/arrow-ipc-glib.pc.in | 28 -- c_glib/arrow-glib/io-enums.c.template | 56 ---- c_glib/arrow-glib/io-enums.h.template | 41 --- c_glib/arrow-glib/ipc-enums.c.template | 56 ---- c_glib/arrow-glib/ipc-enums.h.template | 41 --- c_glib/configure.ac | 12 - c_glib/doc/reference/Makefile.am | 8 +- c_glib/test/run-test.rb | 2 - c_glib/test/test-io-file-output-stream.rb | 4 +- c_glib/test/test-io-memory-mapped-file.rb | 20 +- c_glib/test/test-ipc-file-writer.rb | 8 +- c_glib/test/test-ipc-stream-writer.rb | 8 +- cpp/CMakeLists.txt | 130 +++++++--- cpp/src/arrow/io/CMakeLists.txt | 89 ------- cpp/src/arrow/io/arrow-io.pc.in | 30 --- cpp/src/arrow/io/symbols.map | 30 --- cpp/src/arrow/ipc/CMakeLists.txt | 83 +----- cpp/src/arrow/ipc/arrow-ipc.pc.in | 30 --- cpp/src/arrow/ipc/symbols.map | 30 --- cpp/src/arrow/python/CMakeLists.txt | 4 - python/CMakeLists.txt | 8 - python/cmake_modules/FindArrow.cmake | 24 -- python/setup.py | 2 - 31 files changed, 214 insertions(+), 1005 deletions(-) delete mode 100644 c_glib/arrow-glib/arrow-io-glib.h delete mode 100644 c_glib/arrow-glib/arrow-io-glib.hpp delete mode 100644 c_glib/arrow-glib/arrow-io-glib.pc.in delete mode 100644 c_glib/arrow-glib/arrow-ipc-glib.h delete mode 100644 c_glib/arrow-glib/arrow-ipc-glib.hpp delete mode 100644 c_glib/arrow-glib/arrow-ipc-glib.pc.in delete mode 100644 c_glib/arrow-glib/io-enums.c.template delete mode 100644 c_glib/arrow-glib/io-enums.h.template delete mode 100644 c_glib/arrow-glib/ipc-enums.c.template delete mode 100644 c_glib/arrow-glib/ipc-enums.h.template delete mode 100644 cpp/src/arrow/io/arrow-io.pc.in delete mode 100644 cpp/src/arrow/io/symbols.map delete mode 100644 cpp/src/arrow/ipc/arrow-ipc.pc.in delete mode 100644 cpp/src/arrow/ipc/symbols.map diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index a72d1e874402a..e719cccfa85ab 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -101,6 +101,25 @@ libarrow_glib_la_headers = \ uint64-array-builder.h \ uint64-data-type.h +libarrow_glib_la_headers += \ + io-file.h \ + io-file-mode.h \ + io-file-output-stream.h \ + io-input-stream.h \ + io-memory-mapped-file.h \ + io-output-stream.h \ + io-random-access-file.h \ + io-readable.h \ + io-writeable.h \ + io-writeable-file.h + +libarrow_glib_la_headers += \ + ipc-file-reader.h \ + ipc-file-writer.h \ + ipc-stream-reader.h \ + ipc-stream-writer.h \ + ipc-metadata-version.h + libarrow_glib_la_generated_headers = \ enums.h @@ -170,6 +189,25 @@ libarrow_glib_la_sources = \ $(libarrow_glib_la_headers) \ $(libarrow_glib_la_generated_sources) +libarrow_glib_la_sources += \ + io-file.cpp \ + io-file-mode.cpp \ + io-file-output-stream.cpp \ + io-input-stream.cpp \ + io-memory-mapped-file.cpp \ + io-output-stream.cpp \ + io-random-access-file.cpp \ + io-readable.cpp \ + io-writeable.cpp \ + io-writeable-file.cpp + +libarrow_glib_la_sources += \ + ipc-file-reader.cpp \ + ipc-file-writer.cpp \ + ipc-metadata-version.cpp \ + ipc-stream-reader.cpp \ + ipc-stream-writer.cpp + libarrow_glib_la_cpp_headers = \ array.hpp \ array-builder.hpp \ @@ -184,6 +222,25 @@ libarrow_glib_la_cpp_headers = \ table.hpp \ type.hpp +libarrow_glib_la_cpp_headers += \ + io-file.hpp \ + io-file-mode.hpp \ + io-file-output-stream.hpp \ + io-input-stream.hpp \ + io-memory-mapped-file.hpp \ + io-output-stream.hpp \ + io-random-access-file.hpp \ + io-readable.hpp \ + io-writeable.hpp \ + io-writeable-file.hpp + +libarrow_glib_la_cpp_headers += \ + ipc-file-reader.hpp \ + ipc-file-writer.hpp \ + ipc-metadata-version.hpp \ + ipc-stream-reader.hpp \ + ipc-stream-writer.hpp + libarrow_glib_la_SOURCES = \ $(libarrow_glib_la_sources) \ $(libarrow_glib_la_cpp_headers) @@ -221,205 +278,15 @@ stamp-enums.c: $(libarrow_glib_la_headers) enums.c.template $(libarrow_glib_la_headers)) > enums.c touch $@ -# libarrow-io-glib -lib_LTLIBRARIES += \ - libarrow-io-glib.la - -libarrow_io_glib_la_CXXFLAGS = \ - $(GLIB_CFLAGS) \ - $(ARROW_IO_CFLAGS) \ - $(GARROW_CXXFLAGS) - -libarrow_io_glib_la_LIBADD = \ - $(GLIB_LIBS) \ - $(ARROW_IO_LIBS) \ - libarrow-glib.la - -libarrow_io_glib_la_headers = \ - arrow-io-glib.h \ - io-file.h \ - io-file-mode.h \ - io-file-output-stream.h \ - io-input-stream.h \ - io-memory-mapped-file.h \ - io-output-stream.h \ - io-random-access-file.h \ - io-readable.h \ - io-writeable.h \ - io-writeable-file.h - -libarrow_io_glib_la_generated_headers = \ - io-enums.h - -libarrow_io_glib_la_generated_sources = \ - io-enums.c \ - $(libarrow_io_glib_la_generated_headers) - -libarrow_io_glib_la_sources = \ - io-file.cpp \ - io-file-mode.cpp \ - io-file-output-stream.cpp \ - io-input-stream.cpp \ - io-memory-mapped-file.cpp \ - io-output-stream.cpp \ - io-random-access-file.cpp \ - io-readable.cpp \ - io-writeable.cpp \ - io-writeable-file.cpp \ - $(libarrow_io_glib_la_headers) \ - $(libarrow_io_glib_la_generated_sources) - -libarrow_io_glib_la_cpp_headers = \ - arrow-io-glib.hpp \ - io-file.hpp \ - io-file-mode.hpp \ - io-file-output-stream.hpp \ - io-input-stream.hpp \ - io-memory-mapped-file.hpp \ - io-output-stream.hpp \ - io-random-access-file.hpp \ - io-readable.hpp \ - io-writeable.hpp \ - io-writeable-file.hpp - -libarrow_io_glib_la_SOURCES = \ - $(libarrow_io_glib_la_sources) \ - $(libarrow_io_glib_la_cpp_headers) - -BUILT_SOURCES += \ - $(libarrow_io_glib_la_genearted_sources) \ - stamp-io-enums.c \ - stamp-io-enums.h - -EXTRA_DIST += \ - io-enums.c.template \ - io-enums.h.template - -io-enums.h: stamp-io-enums.h - @true -stamp-io-enums.h: $(libarrow_io_glib_la_headers) io-enums.h.template - $(AM_V_GEN) \ - (cd $(srcdir) && \ - $(GLIB_MKENUMS) \ - --identifier-prefix GArrowIO \ - --symbol-prefix garrow_io \ - --template io-enums.h.template \ - $(libarrow_io_glib_la_headers)) > io-enums.h - touch $@ - -io-enums.c: stamp-io-enums.c - @true -stamp-io-enums.c: $(libarrow_io_glib_la_headers) io-enums.c.template - $(AM_V_GEN) \ - (cd $(srcdir) && \ - $(GLIB_MKENUMS) \ - --identifier-prefix GArrowIO \ - --symbol-prefix garrow_io \ - --template io-enums.c.template \ - $(libarrow_io_glib_la_headers)) > io-enums.c - touch $@ - -# libarrow-ipc-glib -lib_LTLIBRARIES += \ - libarrow-ipc-glib.la - -libarrow_ipc_glib_la_CXXFLAGS = \ - $(GLIB_CFLAGS) \ - $(ARROW_IPC_CFLAGS) \ - $(GARROW_CXXFLAGS) - -libarrow_ipc_glib_la_LIBADD = \ - $(GLIB_LIBS) \ - $(ARROW_IPC_LIBS) \ - libarrow-glib.la \ - libarrow-io-glib.la - -libarrow_ipc_glib_la_headers = \ - arrow-ipc-glib.h \ - ipc-file-reader.h \ - ipc-file-writer.h \ - ipc-stream-reader.h \ - ipc-stream-writer.h \ - ipc-metadata-version.h - -libarrow_ipc_glib_la_generated_headers = \ - ipc-enums.h - -libarrow_ipc_glib_la_generated_sources = \ - ipc-enums.c \ - $(libarrow_ipc_glib_la_generated_headers) - -libarrow_ipc_glib_la_sources = \ - ipc-file-reader.cpp \ - ipc-file-writer.cpp \ - ipc-metadata-version.cpp \ - ipc-stream-reader.cpp \ - ipc-stream-writer.cpp \ - $(libarrow_ipc_glib_la_headers) \ - $(libarrow_ipc_glib_la_generated_sources) - -libarrow_ipc_glib_la_cpp_headers = \ - arrow-ipc-glib.hpp \ - ipc-file-reader.hpp \ - ipc-file-writer.hpp \ - ipc-metadata-version.hpp \ - ipc-stream-reader.hpp \ - ipc-stream-writer.hpp - -libarrow_ipc_glib_la_SOURCES = \ - $(libarrow_ipc_glib_la_sources) \ - $(libarrow_ipc_glib_la_cpp_headers) - -BUILT_SOURCES += \ - $(libarrow_ipc_glib_la_genearted_sources) \ - stamp-ipc-enums.c \ - stamp-ipc-enums.h - -EXTRA_DIST += \ - ipc-enums.c.template \ - ipc-enums.h.template - -ipc-enums.h: stamp-ipc-enums.h - @true -stamp-ipc-enums.h: $(libarrow_ipc_glib_la_headers) ipc-enums.h.template - $(AM_V_GEN) \ - (cd $(srcdir) && \ - $(GLIB_MKENUMS) \ - --identifier-prefix GArrowIPC \ - --symbol-prefix garrow_ipc \ - --template ipc-enums.h.template \ - $(libarrow_ipc_glib_la_headers)) > ipc-enums.h - touch $@ - -ipc-enums.c: stamp-ipc-enums.c - @true -stamp-ipc-enums.c: $(libarrow_ipc_glib_la_headers) ipc-enums.c.template - $(AM_V_GEN) \ - (cd $(srcdir) && \ - $(GLIB_MKENUMS) \ - --identifier-prefix GArrowIPC \ - --symbol-prefix garrow_ipc \ - --template ipc-enums.c.template \ - $(libarrow_ipc_glib_la_headers)) > ipc-enums.c - touch $@ - arrow_glib_includedir = $(includedir)/arrow-glib -arrow_glib_include_HEADERS = \ - $(libarrow_glib_la_headers) \ - $(libarrow_glib_la_cpp_headers) \ - $(libarrow_glib_la_generated_headers) \ - $(libarrow_io_glib_la_headers) \ - $(libarrow_io_glib_la_cpp_headers) \ - $(libarrow_io_glib_la_generated_headers) \ - $(libarrow_ipc_glib_la_headers) \ - $(libarrow_ipc_glib_la_cpp_headers) \ - $(libarrow_ipc_glib_la_generated_headers) +arrow_glib_include_HEADERS = \ + $(libarrow_glib_la_headers) \ + $(libarrow_glib_la_cpp_headers) \ + $(libarrow_glib_la_generated_headers) pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = \ - arrow-glib.pc \ - arrow-io-glib.pc \ - arrow-ipc-glib.pc + arrow-glib.pc # GObject Introspection -include $(INTROSPECTION_MAKEFILE) @@ -443,44 +310,6 @@ Arrow_1_0_gir_SCANNERFLAGS = \ --symbol-prefix=garrow INTROSPECTION_GIRS += Arrow-1.0.gir -ArrowIO-1.0.gir: libarrow-io-glib.la -ArrowIO-1.0.gir: Arrow-1.0.gir -ArrowIO_1_0_gir_PACKAGES = \ - gobject-2.0 -ArrowIO_1_0_gir_EXPORT_PACKAGES = arrow-io -ArrowIO_1_0_gir_INCLUDES = \ - GObject-2.0 -ArrowIO_1_0_gir_CFLAGS = \ - $(AM_CPPFLAGS) -ArrowIO_1_0_gir_LIBS = libarrow-io-glib.la -ArrowIO_1_0_gir_FILES = $(libarrow_io_glib_la_sources) -ArrowIO_1_0_gir_SCANNERFLAGS = \ - --include-uninstalled=$(builddir)/Arrow-1.0.gir \ - --warn-all \ - --identifier-prefix=GArrowIO \ - --symbol-prefix=garrow_io -INTROSPECTION_GIRS += ArrowIO-1.0.gir - -ArrowIPC-1.0.gir: libarrow-ipc-glib.la -ArrowIPC-1.0.gir: Arrow-1.0.gir -ArrowIPC-1.0.gir: ArrowIO-1.0.gir -ArrowIPC_1_0_gir_PACKAGES = \ - gobject-2.0 -ArrowIPC_1_0_gir_EXPORT_PACKAGES = arrow-ipc -ArrowIPC_1_0_gir_INCLUDES = \ - GObject-2.0 -ArrowIPC_1_0_gir_CFLAGS = \ - $(AM_CPPFLAGS) -ArrowIPC_1_0_gir_LIBS = libarrow-ipc-glib.la -ArrowIPC_1_0_gir_FILES = $(libarrow_ipc_glib_la_sources) -ArrowIPC_1_0_gir_SCANNERFLAGS = \ - --include-uninstalled=$(builddir)/Arrow-1.0.gir \ - --include-uninstalled=$(builddir)/ArrowIO-1.0.gir \ - --warn-all \ - --identifier-prefix=GArrowIPC \ - --symbol-prefix=garrow_ipc -INTROSPECTION_GIRS += ArrowIPC-1.0.gir - girdir = $(datadir)/gir-1.0 gir_DATA = $(INTROSPECTION_GIRS) diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index 4356234a4a63d..9b03175799f44 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -78,3 +78,20 @@ #include #include #include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp index 70fda8da7c526..fd59d4a1a9240 100644 --- a/c_glib/arrow-glib/arrow-glib.hpp +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -35,3 +35,19 @@ #include #include #include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-io-glib.h b/c_glib/arrow-glib/arrow-io-glib.h deleted file mode 100644 index 4d49a9859d82a..0000000000000 --- a/c_glib/arrow-glib/arrow-io-glib.h +++ /dev/null @@ -1,32 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include diff --git a/c_glib/arrow-glib/arrow-io-glib.hpp b/c_glib/arrow-glib/arrow-io-glib.hpp deleted file mode 100644 index 3e7636cc7ef99..0000000000000 --- a/c_glib/arrow-glib/arrow-io-glib.hpp +++ /dev/null @@ -1,30 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include -#include -#include -#include -#include -#include -#include -#include diff --git a/c_glib/arrow-glib/arrow-io-glib.pc.in b/c_glib/arrow-glib/arrow-io-glib.pc.in deleted file mode 100644 index 4256184cf7348..0000000000000 --- a/c_glib/arrow-glib/arrow-io-glib.pc.in +++ /dev/null @@ -1,28 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -prefix=@prefix@ -exec_prefix=@exec_prefix@ -libdir=@libdir@ -includedir=@includedir@ - -Name: Apache Arrow I/O GLib -Description: C API for Apache Arrow I/O based on GLib -Version: @VERSION@ -Libs: -L${libdir} -larrow-glib-io -Cflags: -I${includedir} -Requires: arrow-glib arrow-io diff --git a/c_glib/arrow-glib/arrow-ipc-glib.h b/c_glib/arrow-glib/arrow-ipc-glib.h deleted file mode 100644 index 4954d83cd0728..0000000000000 --- a/c_glib/arrow-glib/arrow-ipc-glib.h +++ /dev/null @@ -1,27 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include -#include -#include -#include -#include diff --git a/c_glib/arrow-glib/arrow-ipc-glib.hpp b/c_glib/arrow-glib/arrow-ipc-glib.hpp deleted file mode 100644 index d32bc052b98e5..0000000000000 --- a/c_glib/arrow-glib/arrow-ipc-glib.hpp +++ /dev/null @@ -1,30 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -#include - -#include -#include -#include -#include -#include diff --git a/c_glib/arrow-glib/arrow-ipc-glib.pc.in b/c_glib/arrow-glib/arrow-ipc-glib.pc.in deleted file mode 100644 index 0b04c4a808ff1..0000000000000 --- a/c_glib/arrow-glib/arrow-ipc-glib.pc.in +++ /dev/null @@ -1,28 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -prefix=@prefix@ -exec_prefix=@exec_prefix@ -libdir=@libdir@ -includedir=@includedir@ - -Name: Apache Arrow IPC GLib -Description: C API for Apache Arrow IPC based on GLib -Version: @VERSION@ -Libs: -L${libdir} -larrow-glib-ipc -Cflags: -I${includedir} -Requires: arrow-glib-io arrow-ipc diff --git a/c_glib/arrow-glib/io-enums.c.template b/c_glib/arrow-glib/io-enums.c.template deleted file mode 100644 index 10ee77588d98b..0000000000000 --- a/c_glib/arrow-glib/io-enums.c.template +++ /dev/null @@ -1,56 +0,0 @@ -/*** BEGIN file-header ***/ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -/*** END file-header ***/ - -/*** BEGIN file-production ***/ - -/* enumerations from "@filename@" */ -/*** END file-production ***/ - -/*** BEGIN value-header ***/ -GType -@enum_name@_get_type(void) -{ - static GType etype = 0; - if (G_UNLIKELY(etype == 0)) { - static const G@Type@Value values[] = { -/*** END value-header ***/ - -/*** BEGIN value-production ***/ - {@VALUENAME@, "@VALUENAME@", "@valuenick@"}, -/*** END value-production ***/ - -/*** BEGIN value-tail ***/ - {0, NULL, NULL} - }; - etype = g_@type@_register_static(g_intern_static_string("@EnumName@"), values); - } - return etype; -} -/*** END value-tail ***/ - -/*** BEGIN file-tail ***/ -/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/io-enums.h.template b/c_glib/arrow-glib/io-enums.h.template deleted file mode 100644 index 429141dc76a60..0000000000000 --- a/c_glib/arrow-glib/io-enums.h.template +++ /dev/null @@ -1,41 +0,0 @@ -/*** BEGIN file-header ***/ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS -/*** END file-header ***/ - -/*** BEGIN file-production ***/ - -/* enumerations from "@filename@" */ -/*** END file-production ***/ - -/*** BEGIN value-header ***/ -GType @enum_name@_get_type(void) G_GNUC_CONST; -#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type()) -/*** END value-header ***/ - -/*** BEGIN file-tail ***/ - -G_END_DECLS -/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/ipc-enums.c.template b/c_glib/arrow-glib/ipc-enums.c.template deleted file mode 100644 index c938f77477172..0000000000000 --- a/c_glib/arrow-glib/ipc-enums.c.template +++ /dev/null @@ -1,56 +0,0 @@ -/*** BEGIN file-header ***/ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -/*** END file-header ***/ - -/*** BEGIN file-production ***/ - -/* enumerations from "@filename@" */ -/*** END file-production ***/ - -/*** BEGIN value-header ***/ -GType -@enum_name@_get_type(void) -{ - static GType etype = 0; - if (G_UNLIKELY(etype == 0)) { - static const G@Type@Value values[] = { -/*** END value-header ***/ - -/*** BEGIN value-production ***/ - {@VALUENAME@, "@VALUENAME@", "@valuenick@"}, -/*** END value-production ***/ - -/*** BEGIN value-tail ***/ - {0, NULL, NULL} - }; - etype = g_@type@_register_static(g_intern_static_string("@EnumName@"), values); - } - return etype; -} -/*** END value-tail ***/ - -/*** BEGIN file-tail ***/ -/*** END file-tail ***/ diff --git a/c_glib/arrow-glib/ipc-enums.h.template b/c_glib/arrow-glib/ipc-enums.h.template deleted file mode 100644 index e103c5bfeb985..0000000000000 --- a/c_glib/arrow-glib/ipc-enums.h.template +++ /dev/null @@ -1,41 +0,0 @@ -/*** BEGIN file-header ***/ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS -/*** END file-header ***/ - -/*** BEGIN file-production ***/ - -/* enumerations from "@filename@" */ -/*** END file-production ***/ - -/*** BEGIN value-header ***/ -GType @enum_name@_get_type(void) G_GNUC_CONST; -#define @ENUMPREFIX@_TYPE_@ENUMSHORT@ (@enum_name@_get_type()) -/*** END value-header ***/ - -/*** BEGIN file-tail ***/ - -G_END_DECLS -/*** END file-tail ***/ diff --git a/c_glib/configure.ac b/c_glib/configure.ac index fc24c1b3c4778..d63132e6f293c 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -74,27 +74,17 @@ AC_ARG_WITH(arrow-cpp-build-dir, [GARROW_ARROW_CPP_BUILD_DIR=""]) if test "x$GARROW_ARROW_CPP_BUILD_DIR" = "x"; then PKG_CHECK_MODULES([ARROW], [arrow]) - PKG_CHECK_MODULES([ARROW_IO], [arrow-io]) - PKG_CHECK_MODULES([ARROW_IPC], [arrow-ipc]) else ARROW_INCLUDE_DIR="\$(abs_top_srcdir)/../cpp/src" ARROW_LIB_DIR="${GARROW_ARROW_CPP_BUILD_DIR}/${GARROW_ARROW_CPP_BUILD_TYPE}" ARROW_CFLAGS="-I${ARROW_INCLUDE_DIR}" - ARROW_IO_CFLAGS="-I${ARROW_INCLUDE_DIR}" - ARROW_IPC_CFLAGS="-I${ARROW_INCLUDE_DIR}" ARROW_LIBS="-L${ARROW_LIB_DIR} -larrow" - ARROW_IO_LIBS="-L${ARROW_LIB_DIR} -larrow_io" - ARROW_IPC_LIBS="-L${ARROW_LIB_DIR} -larrow_ipc" AC_SUBST(ARROW_LIB_DIR) AC_SUBST(ARROW_CFLAGS) - AC_SUBST(ARROW_IO_CFLAGS) - AC_SUBST(ARROW_IPC_CFLAGS) AC_SUBST(ARROW_LIBS) - AC_SUBST(ARROW_IO_LIBS) - AC_SUBST(ARROW_IPC_LIBS) fi @@ -102,8 +92,6 @@ AC_CONFIG_FILES([ Makefile arrow-glib/Makefile arrow-glib/arrow-glib.pc - arrow-glib/arrow-io-glib.pc - arrow-glib/arrow-ipc-glib.pc doc/Makefile doc/reference/Makefile example/Makefile diff --git a/c_glib/doc/reference/Makefile.am b/c_glib/doc/reference/Makefile.am index d1c8e01c299a0..116bc6ce1b9a6 100644 --- a/c_glib/doc/reference/Makefile.am +++ b/c_glib/doc/reference/Makefile.am @@ -33,9 +33,7 @@ HFILE_GLOB = \ $(top_srcdir)/arrow-glib/*.h IGNORE_HFILES = \ - enums.h \ - io-enums.h \ - ipc-enums.h + enums.h CFILE_GLOB = \ $(top_srcdir)/arrow-glib/*.cpp @@ -49,9 +47,7 @@ AM_CFLAGS = \ $(ARROW_CFLAGS) GTKDOC_LIBS = \ - $(top_builddir)/arrow-glib/libarrow-glib.la \ - $(top_builddir)/arrow-glib/libarrow-io-glib.la \ - $(top_builddir)/arrow-glib/libarrow-ipc-glib.la + $(top_builddir)/arrow-glib/libarrow-glib.la include $(srcdir)/gtk-doc.make diff --git a/c_glib/test/run-test.rb b/c_glib/test/run-test.rb index 32ceb4ad61d2e..53805caef374f 100755 --- a/c_glib/test/run-test.rb +++ b/c_glib/test/run-test.rb @@ -32,8 +32,6 @@ require "gi" Arrow = GI.load("Arrow") -ArrowIO = GI.load("ArrowIO") -ArrowIPC = GI.load("ArrowIPC") require "tempfile" require_relative "helper/buildable" diff --git a/c_glib/test/test-io-file-output-stream.rb b/c_glib/test/test-io-file-output-stream.rb index 1f2ae5fa10fd1..e35a18361aab6 100644 --- a/c_glib/test/test-io-file-output-stream.rb +++ b/c_glib/test/test-io-file-output-stream.rb @@ -21,7 +21,7 @@ def test_create tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = ArrowIO::FileOutputStream.open(tempfile.path, false) + file = Arrow::IOFileOutputStream.open(tempfile.path, false) file.close assert_equal("", File.read(tempfile.path)) end @@ -30,7 +30,7 @@ def test_append tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = ArrowIO::FileOutputStream.open(tempfile.path, true) + file = Arrow::IOFileOutputStream.open(tempfile.path, true) file.close assert_equal("Hello", File.read(tempfile.path)) end diff --git a/c_glib/test/test-io-memory-mapped-file.rb b/c_glib/test/test-io-memory-mapped-file.rb index 609819833614f..197d1886f1e86 100644 --- a/c_glib/test/test-io-memory-mapped-file.rb +++ b/c_glib/test/test-io-memory-mapped-file.rb @@ -20,7 +20,7 @@ def test_open tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 file.read(buffer) @@ -34,7 +34,7 @@ def test_size tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin assert_equal(5, file.size) ensure @@ -46,7 +46,7 @@ def test_read tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 _success, n_read_bytes = file.read(buffer) @@ -60,7 +60,7 @@ def test_read_at tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 _success, n_read_bytes = file.read_at(6, buffer) @@ -74,7 +74,7 @@ def test_write tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) begin file.write("World") ensure @@ -87,7 +87,7 @@ def test_write_at tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) begin file.write_at(2, "rld") ensure @@ -100,7 +100,7 @@ def test_flush tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) begin file.write("World") file.flush @@ -114,7 +114,7 @@ def test_tell tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 file.read(buffer) @@ -128,9 +128,9 @@ def test_mode tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = ArrowIO::MemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) begin - assert_equal(ArrowIO::FileMode::READWRITE, file.mode) + assert_equal(Arrow::IOFileMode::READWRITE, file.mode) ensure file.close end diff --git a/c_glib/test/test-ipc-file-writer.rb b/c_glib/test/test-ipc-file-writer.rb index 369bff324e6d9..1c33ccc1919e7 100644 --- a/c_glib/test/test-ipc-file-writer.rb +++ b/c_glib/test/test-ipc-file-writer.rb @@ -18,11 +18,11 @@ class TestIPCFileWriter < Test::Unit::TestCase def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-file-writer") - output = ArrowIO::FileOutputStream.open(tempfile.path, false) + output = Arrow::IOFileOutputStream.open(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - file_writer = ArrowIPC::FileWriter.open(output, schema) + file_writer = Arrow::IPCFileWriter.open(output, schema) begin record_batch = Arrow::RecordBatch.new(schema, 0, []) file_writer.write_record_batch(record_batch) @@ -33,9 +33,9 @@ def test_write_record_batch output.close end - input = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + input = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin - file_reader = ArrowIPC::FileReader.open(input) + file_reader = Arrow::IPCFileReader.open(input) assert_equal(["enabled"], file_reader.schema.fields.collect(&:name)) ensure diff --git a/c_glib/test/test-ipc-stream-writer.rb b/c_glib/test/test-ipc-stream-writer.rb index 62ac45dce2c79..78bb4a7c1743c 100644 --- a/c_glib/test/test-ipc-stream-writer.rb +++ b/c_glib/test/test-ipc-stream-writer.rb @@ -20,11 +20,11 @@ class TestIPCStreamWriter < Test::Unit::TestCase def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-stream-writer") - output = ArrowIO::FileOutputStream.open(tempfile.path, false) + output = Arrow::IOFileOutputStream.open(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - stream_writer = ArrowIPC::StreamWriter.open(output, schema) + stream_writer = Arrow::IPCStreamWriter.open(output, schema) begin columns = [ build_boolean_array([true]), @@ -38,9 +38,9 @@ def test_write_record_batch output.close end - input = ArrowIO::MemoryMappedFile.open(tempfile.path, :read) + input = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) begin - stream_reader = ArrowIPC::StreamReader.open(input) + stream_reader = Arrow::IPCStreamReader.open(input) assert_equal(["enabled"], stream_reader.schema.fields.collect(&:name)) assert_equal(true, diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index b29cb7b075a94..0e4a4bbf34b67 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -705,6 +705,49 @@ endif() # set(ARROW_TCMALLOC_AVAILABLE 1) # endif() +## Flatbuffers + +if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") + set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") + ExternalProject_Add(flatbuffers_ep + URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" + CMAKE_ARGS + "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" + "-DFLATBUFFERS_BUILD_TESTS=OFF") + + set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") + set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") + set(FLATBUFFERS_VENDORED 1) +else() + find_package(Flatbuffers REQUIRED) + set(FLATBUFFERS_VENDORED 0) +endif() + +message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") +message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") +include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) + +######################################################################## +# HDFS thirdparty setup + +if (DEFINED ENV{HADOOP_HOME}) + set(HADOOP_HOME $ENV{HADOOP_HOME}) + if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") + message(STATUS "Did not find hdfs.h in expected location, using vendored one") + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") + endif() +else() + set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") +endif() + +set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") +if (NOT EXISTS ${HDFS_H_PATH}) + message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") +endif() +message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) + +include_directories(SYSTEM "${HADOOP_HOME}/include") + ############################################################ # Linker setup ############################################################ @@ -814,10 +857,37 @@ endif() ############################################################ set(ARROW_LINK_LIBS - ${BOOST_REGEX_LIBRARY}) + ${BOOST_REGEX_LIBRARY}) -set(ARROW_PRIVATE_LINK_LIBS -) +set(ARROW_STATIC_LINK_LIBS) + +set(ARROW_SHARED_PRIVATE_LINK_LIBS + ${BOOST_SYSTEM_LIBRARY} + ${BOOST_FILESYSTEM_LIBRARY}) + +set(ARROW_STATIC_PRIVATE_LINK_LIBS + ${BOOST_SYSTEM_LIBRARY} + ${BOOST_FILESYSTEM_LIBRARY}) + +if (NOT MSVC) + set(ARROW_LINK_LIBS + ${ARROW_LINK_LIBS} + ${CMAKE_DL_LIBS}) +endif() + +if(RAPIDJSON_VENDORED) + set(ARROW_DEPENDENCIES ${ARROW_DEPENDENCIES} rapidjson_ep) +endif() + +if(FLATBUFFERS_VENDORED) + set(ARROW_DEPENDENCIES ${ARROW_DEPENDENCIES} flatbuffers_ep) +endif() + +add_subdirectory(src/arrow) +add_subdirectory(src/arrow/io) +add_subdirectory(src/arrow/ipc) + +set(ARROW_DEPENDENCIES ${ARROW_DEPENDENCIES} metadata_fbs) set(ARROW_SRCS src/arrow/array.cc @@ -833,6 +903,19 @@ set(ARROW_SRCS src/arrow/type.cc src/arrow/visitor.cc + src/arrow/io/file.cc + src/arrow/io/hdfs.cc + src/arrow/io/hdfs-internal.cc + src/arrow/io/interfaces.cc + src/arrow/io/memory.cc + + src/arrow/ipc/feather.cc + src/arrow/ipc/json.cc + src/arrow/ipc/json-internal.cc + src/arrow/ipc/metadata.cc + src/arrow/ipc/reader.cc + src/arrow/ipc/writer.cc + src/arrow/util/bit-util.cc src/arrow/util/decimal.cc ) @@ -844,52 +927,25 @@ if(NOT APPLE AND NOT MSVC) set(ARROW_SHARED_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/src/arrow/symbols.map") endif() +set(ARROW_ALL_SRCS + ${ARROW_SRCS}) ADD_ARROW_LIB(arrow - SOURCES ${ARROW_SRCS} + SOURCES ${ARROW_ALL_SRCS} + DEPENDENCIES ${ARROW_DEPENDENCIES} SHARED_LINK_FLAGS ${ARROW_SHARED_LINK_FLAGS} SHARED_LINK_LIBS ${ARROW_LINK_LIBS} + SHARED_PRIVATE_LINK_LIBS ${ARROW_SHARED_PRIVATE_LINK_LIBS} + STATIC_LINK_LIBS ${ARROW_STATIC_LINK_LIBS} + STATIC_PRIVATE_LINK_LIBS ${ARROW_STATIC_PRIVATE_LINK_LIBS} ) -add_subdirectory(src/arrow) -add_subdirectory(src/arrow/io) add_subdirectory(src/arrow/util) if(ARROW_JEMALLOC) add_subdirectory(src/arrow/jemalloc) endif() -#---------------------------------------------------------------------- -# IPC library - -if(ARROW_PYTHON) - set(ARROW_IPC on) -endif() - -## Flatbuffers -if(ARROW_IPC) - if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") - set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") - ExternalProject_Add(flatbuffers_ep - URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" - CMAKE_ARGS - "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" - "-DFLATBUFFERS_BUILD_TESTS=OFF") - - set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") - set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") - set(FLATBUFFERS_VENDORED 1) - else() - find_package(Flatbuffers REQUIRED) - set(FLATBUFFERS_VENDORED 0) - endif() - - message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") - message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") - include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) - add_subdirectory(src/arrow/ipc) -endif() - if(ARROW_PYTHON) find_package(PythonLibsNew REQUIRED) find_package(NumPy REQUIRED) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index 791c29c2797f9..c0199d7ef2599 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -18,90 +18,9 @@ # ---------------------------------------------------------------------- # arrow_io : Arrow IO interfaces -# HDFS thirdparty setup -if (DEFINED ENV{HADOOP_HOME}) - set(HADOOP_HOME $ENV{HADOOP_HOME}) - if (NOT EXISTS "${HADOOP_HOME}/include/hdfs.h") - message(STATUS "Did not find hdfs.h in expected location, using vendored one") - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") - endif() -else() - set(HADOOP_HOME "${THIRDPARTY_DIR}/hadoop") -endif() - -set(HDFS_H_PATH "${HADOOP_HOME}/include/hdfs.h") -if (NOT EXISTS ${HDFS_H_PATH}) - message(FATAL_ERROR "Did not find hdfs.h at ${HDFS_H_PATH}") -endif() -message(STATUS "Found hdfs.h at: " ${HDFS_H_PATH}) - -include_directories(SYSTEM "${HADOOP_HOME}/include") - -# arrow_io library -if (MSVC) - set(ARROW_IO_STATIC_LINK_LIBS - arrow_static - ) - set(ARROW_IO_SHARED_LINK_LIBS - arrow_shared - ) -else() - set(ARROW_IO_STATIC_LINK_LIBS - arrow_static - ${CMAKE_DL_LIBS} - ) - set(ARROW_IO_SHARED_LINK_LIBS - arrow_shared - ${CMAKE_DL_LIBS} - ) -endif() - -set(ARROW_IO_SHARED_PRIVATE_LINK_LIBS - ${BOOST_SYSTEM_LIBRARY} - ${BOOST_FILESYSTEM_LIBRARY}) - -set(ARROW_IO_STATIC_PRIVATE_LINK_LIBS - ${BOOST_SYSTEM_LIBRARY} - ${BOOST_FILESYSTEM_LIBRARY}) - -set(ARROW_IO_TEST_LINK_LIBS - arrow_io_static) - -set(ARROW_IO_SRCS - file.cc - hdfs.cc - hdfs-internal.cc - interfaces.cc - memory.cc -) - -if(NOT APPLE AND NOT MSVC) - # Localize thirdparty symbols using a linker version script. This hides them - # from the client application. The OS X linker does not support the - # version-script option. - set(ARROW_IO_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") -endif() - -ADD_ARROW_LIB(arrow_io - SOURCES ${ARROW_IO_SRCS} - SHARED_LINK_FLAGS ${ARROW_IO_LINK_FLAGS} - SHARED_LINK_LIBS ${ARROW_IO_SHARED_LINK_LIBS} - SHARED_PRIVATE_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} - STATIC_LINK_LIBS ${ARROW_IO_STATIC_LINK_LIBS} - STATIC_PRIVATE_LINK_LIBS ${ARROW_IO_STATIC_PRIVATE_LINK_LIBS} -) - ADD_ARROW_TEST(io-file-test) -ARROW_TEST_LINK_LIBRARIES(io-file-test - ${ARROW_IO_TEST_LINK_LIBS}) - ADD_ARROW_TEST(io-hdfs-test) -ARROW_TEST_LINK_LIBRARIES(io-hdfs-test - ${ARROW_IO_TEST_LINK_LIBS}) - ADD_ARROW_TEST(io-memory-test) -ARROW_TEST_LINK_LIBRARIES(io-memory-test - ${ARROW_IO_TEST_LINK_LIBS}) # Headers: top level install(FILES @@ -110,11 +29,3 @@ install(FILES interfaces.h memory.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/io") - -# pkg-config support -configure_file(arrow-io.pc.in - "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" - @ONLY) -install( - FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-io.pc" - DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") diff --git a/cpp/src/arrow/io/arrow-io.pc.in b/cpp/src/arrow/io/arrow-io.pc.in deleted file mode 100644 index 61af3577f5a38..0000000000000 --- a/cpp/src/arrow/io/arrow-io.pc.in +++ /dev/null @@ -1,30 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ -includedir=${prefix}/include - -so_version=@ARROW_SO_VERSION@ -abi_version=@ARROW_ABI_VERSION@ - -Name: Apache Arrow I/O -Description: I/O interface for Arrow. -Version: @ARROW_VERSION@ -Libs: -L${libdir} -larrow_io -Cflags: -I${includedir} -Requires: arrow diff --git a/cpp/src/arrow/io/symbols.map b/cpp/src/arrow/io/symbols.map deleted file mode 100644 index 1e87caef9c8c1..0000000000000 --- a/cpp/src/arrow/io/symbols.map +++ /dev/null @@ -1,30 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -{ - # Symbols marked as 'local' are not exported by the DSO and thus may not - # be used by client applications. - local: - # devtoolset / static-libstdc++ symbols - __cxa_*; - - extern "C++" { - # boost - boost::*; - - # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically - # links c++11 symbols into binaries so that the result may be executed on - # a system with an older libstdc++ which doesn't include the necessary - # c++11 symbols. - std::*; - }; -}; diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index c6880c56e466b..37b455395644f 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -16,85 +16,24 @@ # under the License. ####################################### -# arrow_ipc -####################################### - -set(ARROW_IPC_SHARED_LINK_LIBS - arrow_io_shared - arrow_shared -) - -set(ARROW_IPC_TEST_LINK_LIBS - arrow_ipc_static - arrow_io_static - arrow_static - ${BOOST_REGEX_LIBRARY}) - -set(ARROW_IPC_SRCS - feather.cc - json.cc - json-internal.cc - metadata.cc - reader.cc - writer.cc -) - -if(NOT MSVC AND NOT APPLE) - # Localize thirdparty symbols using a linker version script. This hides them - # from the client application. The OS X linker does not support the - # version-script option. - set(ARROW_IPC_LINK_FLAGS "-Wl,--version-script=${CMAKE_CURRENT_SOURCE_DIR}/symbols.map") -endif() - -if(RAPIDJSON_VENDORED) - set(IPC_DEPENDENCIES ${IPC_DEPENDENCIES} rapidjson_ep) -endif() - -if(FLATBUFFERS_VENDORED) - set(IPC_DEPENDENCIES ${IPC_DEPENDENCIES} flatbuffers_ep) -endif() - -ADD_ARROW_LIB(arrow_ipc - SOURCES ${ARROW_IPC_SRCS} - DEPENDENCIES ${IPC_DEPENDENCIES} - SHARED_LINK_FLAGS ${ARROW_IPC_LINK_FLAGS} - SHARED_LINK_LIBS ${ARROW_IPC_SHARED_LINK_LIBS} - STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} -) +# Messaging and interprocess communication ADD_ARROW_TEST(feather-test) -ARROW_TEST_LINK_LIBRARIES(feather-test - ${ARROW_IPC_TEST_LINK_LIBS}) - ADD_ARROW_TEST(ipc-read-write-test) -ARROW_TEST_LINK_LIBRARIES(ipc-read-write-test - ${ARROW_IPC_TEST_LINK_LIBS}) - ADD_ARROW_TEST(ipc-json-test) -ARROW_TEST_LINK_LIBRARIES(ipc-json-test - ${ARROW_IPC_TEST_LINK_LIBS}) - ADD_ARROW_TEST(json-integration-test) -ARROW_TEST_LINK_LIBRARIES(json-integration-test - ${ARROW_IPC_TEST_LINK_LIBS}) if (ARROW_BUILD_TESTS) target_link_libraries(json-integration-test - gflags - gtest - ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY}) + gflags) if (UNIX) if (APPLE) - target_link_libraries(json-integration-test - ${CMAKE_DL_LIBS}) set_target_properties(json-integration-test PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") else() target_link_libraries(json-integration-test - pthread - ${CMAKE_DL_LIBS}) + pthread) endif() endif() endif() @@ -127,6 +66,7 @@ if(FLATBUFFERS_VENDORED) else() set(FBS_DEPENDS ${ABS_FBS_SRC}) endif() + add_custom_command( OUTPUT ${FBS_OUTPUT_FILES} COMMAND ${FLATBUFFERS_COMPILER} -c -o ${OUTPUT_DIR} ${ABS_FBS_SRC} @@ -136,7 +76,6 @@ add_custom_command( ) add_custom_target(metadata_fbs DEPENDS ${FBS_OUTPUT_FILES}) -add_dependencies(arrow_ipc_objlib metadata_fbs) # Headers: top level install(FILES @@ -148,26 +87,14 @@ install(FILES writer.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/ipc") -# pkg-config support -configure_file(arrow-ipc.pc.in - "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" - @ONLY) -install( - FILES "${CMAKE_CURRENT_BINARY_DIR}/arrow-ipc.pc" - DESTINATION "${CMAKE_INSTALL_LIBDIR}/pkgconfig/") - if(MSVC) set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static arrow_static ${BOOST_FILESYSTEM_LIBRARY} ${BOOST_SYSTEM_LIBRARY} ${BOOST_REGEX_LIBRARY}) else() set(UTIL_LINK_LIBS - arrow_ipc_static - arrow_io_static arrow_static ${BOOST_FILESYSTEM_LIBRARY} ${BOOST_SYSTEM_LIBRARY} @@ -183,5 +110,3 @@ if (ARROW_BUILD_UTILITIES) endif() ADD_ARROW_BENCHMARK(ipc-read-write-benchmark) -ARROW_BENCHMARK_LINK_LIBRARIES(ipc-read-write-benchmark - ${ARROW_IPC_TEST_LINK_LIBS}) diff --git a/cpp/src/arrow/ipc/arrow-ipc.pc.in b/cpp/src/arrow/ipc/arrow-ipc.pc.in deleted file mode 100644 index 29a942acf0331..0000000000000 --- a/cpp/src/arrow/ipc/arrow-ipc.pc.in +++ /dev/null @@ -1,30 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -prefix=@CMAKE_INSTALL_PREFIX@ -libdir=${prefix}/@CMAKE_INSTALL_LIBDIR@ -includedir=${prefix}/include - -so_version=@ARROW_SO_VERSION@ -abi_version=@ARROW_ABI_VERSION@ - -Name: Apache Arrow IPC -Description: IPC extension for Arrow. -Version: @ARROW_VERSION@ -Libs: -L${libdir} -larrow_ipc -Cflags: -I${includedir} -Requires: arrow-io diff --git a/cpp/src/arrow/ipc/symbols.map b/cpp/src/arrow/ipc/symbols.map deleted file mode 100644 index 1e87caef9c8c1..0000000000000 --- a/cpp/src/arrow/ipc/symbols.map +++ /dev/null @@ -1,30 +0,0 @@ -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. See accompanying LICENSE file. - -{ - # Symbols marked as 'local' are not exported by the DSO and thus may not - # be used by client applications. - local: - # devtoolset / static-libstdc++ symbols - __cxa_*; - - extern "C++" { - # boost - boost::*; - - # devtoolset or -static-libstdc++ - the Red Hat devtoolset statically - # links c++11 symbols into binaries so that the result may be executed on - # a system with an older libstdc++ which doesn't include the necessary - # c++11 symbols. - std::*; - }; -}; diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index 604527f6304ac..8f7991e7f6832 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -35,8 +35,6 @@ endif() set(ARROW_PYTHON_MIN_TEST_LIBS arrow_python_test_main arrow_python_static - arrow_ipc_static - arrow_io_static arrow_static ${BOOST_REGEX_LIBRARY}) @@ -61,8 +59,6 @@ set(ARROW_PYTHON_SRCS ) set(ARROW_PYTHON_SHARED_LINK_LIBS - arrow_io_shared - arrow_ipc_shared arrow_shared ) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 463a29d87b711..3e86521757342 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -244,17 +244,11 @@ endfunction(bundle_arrow_lib) if (PYARROW_BUNDLE_ARROW_CPP) # arrow bundle_arrow_lib(ARROW_SHARED_LIB) - bundle_arrow_lib(ARROW_IO_SHARED_LIB) - bundle_arrow_lib(ARROW_IPC_SHARED_LIB) bundle_arrow_lib(ARROW_PYTHON_SHARED_LIB) endif() ADD_THIRDPARTY_LIB(arrow SHARED_LIB ${ARROW_SHARED_LIB}) -ADD_THIRDPARTY_LIB(arrow_io - SHARED_LIB ${ARROW_IO_SHARED_LIB}) -ADD_THIRDPARTY_LIB(arrow_ipc - SHARED_LIB ${ARROW_IPC_SHARED_LIB}) ADD_THIRDPARTY_LIB(arrow_python SHARED_LIB ${ARROW_PYTHON_SHARED_LIB}) @@ -279,8 +273,6 @@ set(CYTHON_EXTENSIONS set(LINK_LIBS arrow_shared - arrow_io_shared - arrow_ipc_shared arrow_python_shared ) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index c2ca0f4ad22c8..51a887189ccd4 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -55,16 +55,6 @@ else() get_filename_component(ARROW_LIBS ${ARROW_LIB_PATH} DIRECTORY) endif() -find_library(ARROW_IO_LIB_PATH NAMES arrow_io - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) - -find_library(ARROW_IPC_LIB_PATH NAMES arrow_ipc - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) - find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc PATHS ${ARROW_SEARCH_LIB_PATH} @@ -78,20 +68,12 @@ find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python if (ARROW_INCLUDE_DIR AND ARROW_LIBS) set(ARROW_FOUND TRUE) set(ARROW_LIB_NAME libarrow) - set(ARROW_IO_LIB_NAME libarrow_io) - set(ARROW_IPC_LIB_NAME libarrow_ipc) set(ARROW_JEMALLOC_LIB_NAME libarrow_jemalloc) set(ARROW_PYTHON_LIB_NAME libarrow_python) set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_IO_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IO_LIB_NAME}.a) - set(ARROW_IO_SHARED_LIB ${ARROW_LIBS}/${ARROW_IO_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - - set(ARROW_IPC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_IPC_LIB_NAME}.a) - set(ARROW_IPC_SHARED_LIB ${ARROW_LIBS}/${ARROW_IPC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_JEMALLOC_LIB_NAME}.a) set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/${ARROW_JEMALLOC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) @@ -100,8 +82,6 @@ if (ARROW_INCLUDE_DIR AND ARROW_LIBS) if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") - message(STATUS "Found the Arrow IO library: ${ARROW_IO_LIB_PATH}") - message(STATUS "Found the Arrow IPC library: ${ARROW_IPC_LIB_PATH}") message(STATUS "Found the Arrow jemalloc library: ${ARROW_JEMALLOC_LIB_PATH}") endif () else () @@ -123,10 +103,6 @@ mark_as_advanced( ARROW_LIBS ARROW_STATIC_LIB ARROW_SHARED_LIB - ARROW_IO_STATIC_LIB - ARROW_IO_SHARED_LIB - ARROW_IPC_STATIC_LIB - ARROW_IPC_SHARED_LIB ARROW_JEMALLOC_STATIC_LIB ARROW_JEMALLOC_SHARED_LIB ) diff --git a/python/setup.py b/python/setup.py index ba77e688ae1f6..99bac15c779e6 100644 --- a/python/setup.py +++ b/python/setup.py @@ -224,8 +224,6 @@ def move_lib(lib_name): if self.bundle_arrow_cpp: move_lib("arrow") - move_lib("arrow_io") - move_lib("arrow_ipc") move_lib("arrow_python") if self.with_jemalloc: move_lib("arrow_jemalloc") From 793f4e0c51e320247ba71c9ccc7970e3eac1d01e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 09:25:17 -0400 Subject: [PATCH 0492/1644] ARROW-782: [C++] API cleanup, change public member access in DataType classes to functions, use class instead of struct Breaking APIs isn't ideal, but this one is fairly long overdue. The DataType classes are more than passive data carriers, and Google's C++ guide recommends using class instead of struct for this. That means we should put members in protected or private scope, and access them. I also renamed a couple of things to help with code clarity * `DataType::type` is now `DataType::id()` * `Array::type_enum` is not `Array::type_id` Author: Wes McKinney Closes #520 from wesm/ARROW-782 and squashes the following commits: f8dd131 [Wes McKinney] Revert changes with garrow_data_type_new_raw 40de60e [Wes McKinney] Fix glib usages of changed APIs 0097122 [Wes McKinney] Update post rebase f725655 [Wes McKinney] cpplint e77f6a5 [Wes McKinney] Change public member access in DataType classes to functions, use class instead of struct --- c_glib/arrow-glib/array-builder.cpp | 2 +- c_glib/arrow-glib/array.cpp | 4 +- c_glib/arrow-glib/data-type.cpp | 4 +- c_glib/arrow-glib/field.cpp | 7 +- cpp/src/arrow/array-decimal-test.cc | 42 ----- cpp/src/arrow/array-test.cc | 12 +- cpp/src/arrow/array.cc | 8 +- cpp/src/arrow/array.h | 4 +- cpp/src/arrow/builder.cc | 8 +- cpp/src/arrow/compare.cc | 23 +-- cpp/src/arrow/compare.h | 2 +- cpp/src/arrow/ipc/feather-test.cc | 2 +- cpp/src/arrow/ipc/feather.cc | 16 +- cpp/src/arrow/ipc/ipc-read-write-test.cc | 6 +- cpp/src/arrow/ipc/json-internal.cc | 44 ++--- cpp/src/arrow/ipc/json.cc | 4 +- cpp/src/arrow/ipc/metadata.cc | 32 ++-- cpp/src/arrow/ipc/metadata.h | 4 +- cpp/src/arrow/ipc/reader.cc | 2 +- cpp/src/arrow/ipc/test-common.h | 18 +- cpp/src/arrow/ipc/writer.cc | 4 +- cpp/src/arrow/ipc/writer.h | 2 +- cpp/src/arrow/loader.cc | 11 +- cpp/src/arrow/loader.h | 2 +- cpp/src/arrow/python/builtin_convert.cc | 4 +- cpp/src/arrow/python/numpy_convert.cc | 2 +- cpp/src/arrow/python/numpy_convert.h | 2 +- cpp/src/arrow/python/pandas_convert.cc | 52 +++--- cpp/src/arrow/python/pandas_convert.h | 2 +- cpp/src/arrow/table-test.cc | 4 +- cpp/src/arrow/table.cc | 4 +- cpp/src/arrow/table.h | 4 +- cpp/src/arrow/tensor.cc | 8 +- cpp/src/arrow/tensor.h | 2 +- cpp/src/arrow/type-test.cc | 64 ++++++-- cpp/src/arrow/type.cc | 34 ++-- cpp/src/arrow/type.h | 200 +++++++++++++++-------- cpp/src/arrow/type_fwd.h | 34 ++-- cpp/src/arrow/visitor_inline.h | 6 +- python/pyarrow/array.pyx | 2 +- python/pyarrow/includes/libarrow.pxd | 25 ++- python/pyarrow/scalar.pyx | 24 ++- python/pyarrow/schema.pyx | 22 +-- 43 files changed, 407 insertions(+), 351 deletions(-) diff --git a/c_glib/arrow-glib/array-builder.cpp b/c_glib/arrow-glib/array-builder.cpp index 0f038c8f66cee..aea93d00bafe4 100644 --- a/c_glib/arrow-glib/array-builder.cpp +++ b/c_glib/arrow-glib/array-builder.cpp @@ -161,7 +161,7 @@ garrow_array_builder_new_raw(std::shared_ptr *arrow_builder { GType type; - switch ((*arrow_builder)->type()->type) { + switch ((*arrow_builder)->type()->id()) { case arrow::Type::type::BOOL: type = GARROW_TYPE_BOOLEAN_ARRAY_BUILDER; break; diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 9d0e101e1b52f..e016ba9728dec 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -216,7 +216,7 @@ GArrowType garrow_array_get_value_type(GArrowArray *array) { auto arrow_array = garrow_array_get_raw(array); - return garrow_type_from_raw(arrow_array->type_enum()); + return garrow_type_from_raw(arrow_array->type_id()); } /** @@ -247,7 +247,7 @@ garrow_array_new_raw(std::shared_ptr *arrow_array) GType type; GArrowArray *array; - switch ((*arrow_array)->type_enum()) { + switch ((*arrow_array)->type_id()) { case arrow::Type::type::NA: type = GARROW_TYPE_NULL_ARRAY; break; diff --git a/c_glib/arrow-glib/data-type.cpp b/c_glib/arrow-glib/data-type.cpp index 2df9e7a38da91..12932a16269e8 100644 --- a/c_glib/arrow-glib/data-type.cpp +++ b/c_glib/arrow-glib/data-type.cpp @@ -180,7 +180,7 @@ GArrowType garrow_data_type_type(GArrowDataType *data_type) { const auto arrow_data_type = garrow_data_type_get_raw(data_type); - return garrow_type_from_raw(arrow_data_type->type); + return garrow_type_from_raw(arrow_data_type->id()); } G_END_DECLS @@ -191,7 +191,7 @@ garrow_data_type_new_raw(std::shared_ptr *arrow_data_type) GType type; GArrowDataType *data_type; - switch ((*arrow_data_type)->type) { + switch ((*arrow_data_type)->id()) { case arrow::Type::type::NA: type = GARROW_TYPE_NULL_DATA_TYPE; break; diff --git a/c_glib/arrow-glib/field.cpp b/c_glib/arrow-glib/field.cpp index 0dcaf0a009a6d..5fd0c4d221bba 100644 --- a/c_glib/arrow-glib/field.cpp +++ b/c_glib/arrow-glib/field.cpp @@ -171,7 +171,7 @@ const gchar * garrow_field_get_name(GArrowField *field) { const auto arrow_field = garrow_field_get_raw(field); - return arrow_field->name.c_str(); + return arrow_field->name().c_str(); } /** @@ -184,7 +184,8 @@ GArrowDataType * garrow_field_get_data_type(GArrowField *field) { const auto arrow_field = garrow_field_get_raw(field); - return garrow_data_type_new_raw(&arrow_field->type); + auto type = arrow_field->type(); + return garrow_data_type_new_raw(&type); } /** @@ -197,7 +198,7 @@ gboolean garrow_field_is_nullable(GArrowField *field) { const auto arrow_field = garrow_field_get_raw(field); - return arrow_field->nullable; + return arrow_field->nullable(); } /** diff --git a/cpp/src/arrow/array-decimal-test.cc b/cpp/src/arrow/array-decimal-test.cc index 8353acc454f40..4bde7abd9221a 100644 --- a/cpp/src/arrow/array-decimal-test.cc +++ b/cpp/src/arrow/array-decimal-test.cc @@ -25,48 +25,6 @@ namespace arrow { namespace decimal { -TEST(TypesTest, TestDecimal32Type) { - DecimalType t1(8, 4); - - ASSERT_EQ(t1.type, Type::DECIMAL); - ASSERT_EQ(t1.precision, 8); - ASSERT_EQ(t1.scale, 4); - - ASSERT_EQ(t1.ToString(), std::string("decimal(8, 4)")); - - // Test properties - ASSERT_EQ(t1.byte_width(), 4); - ASSERT_EQ(t1.bit_width(), 32); -} - -TEST(TypesTest, TestDecimal64Type) { - DecimalType t1(12, 5); - - ASSERT_EQ(t1.type, Type::DECIMAL); - ASSERT_EQ(t1.precision, 12); - ASSERT_EQ(t1.scale, 5); - - ASSERT_EQ(t1.ToString(), std::string("decimal(12, 5)")); - - // Test properties - ASSERT_EQ(t1.byte_width(), 8); - ASSERT_EQ(t1.bit_width(), 64); -} - -TEST(TypesTest, TestDecimal128Type) { - DecimalType t1(27, 7); - - ASSERT_EQ(t1.type, Type::DECIMAL); - ASSERT_EQ(t1.precision, 27); - ASSERT_EQ(t1.scale, 7); - - ASSERT_EQ(t1.ToString(), std::string("decimal(27, 7)")); - - // Test properties - ASSERT_EQ(t1.byte_width(), 16); - ASSERT_EQ(t1.bit_width(), 128); -} - template class DecimalTestBase { public: diff --git a/cpp/src/arrow/array-test.cc b/cpp/src/arrow/array-test.cc index e50f4fd10b087..99279f3a8bb65 100644 --- a/cpp/src/arrow/array-test.cc +++ b/cpp/src/arrow/array-test.cc @@ -691,8 +691,8 @@ TEST_F(TestStringArray, TestArrayBasics) { TEST_F(TestStringArray, TestType) { std::shared_ptr type = strings_->type(); - ASSERT_EQ(Type::STRING, type->type); - ASSERT_EQ(Type::STRING, strings_->type_enum()); + ASSERT_EQ(Type::STRING, type->id()); + ASSERT_EQ(Type::STRING, strings_->type_id()); } TEST_F(TestStringArray, TestListFunctions) { @@ -905,8 +905,8 @@ TEST_F(TestBinaryArray, TestArrayBasics) { TEST_F(TestBinaryArray, TestType) { std::shared_ptr type = strings_->type(); - ASSERT_EQ(Type::BINARY, type->type); - ASSERT_EQ(Type::BINARY, strings_->type_enum()); + ASSERT_EQ(Type::BINARY, type->id()); + ASSERT_EQ(Type::BINARY, strings_->type_id()); } TEST_F(TestBinaryArray, TestListFunctions) { @@ -1679,8 +1679,8 @@ TEST_F(TestStructBuilder, TestAppendNull) { ASSERT_TRUE(result_->field(1)->IsNull(0)); ASSERT_TRUE(result_->field(1)->IsNull(1)); - ASSERT_EQ(Type::LIST, result_->field(0)->type_enum()); - ASSERT_EQ(Type::INT32, result_->field(1)->type_enum()); + ASSERT_EQ(Type::LIST, result_->field(0)->type_id()); + ASSERT_EQ(Type::INT32, result_->field(1)->type_id()); } TEST_F(TestStructBuilder, TestBasics) { diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index c4a78f3b2e400..e640bbd4a206e 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -312,8 +312,8 @@ bool DecimalArray::IsNegative(int64_t i) const { std::string DecimalArray::FormatValue(int64_t i) const { const auto type_ = std::dynamic_pointer_cast(type()); - const int precision = type_->precision; - const int scale = type_->scale; + const int precision = type_->precision(); + const int scale = type_->scale(); const int byte_width = byte_width_; const uint8_t* bytes = GetValue(i); switch (byte_width) { @@ -453,11 +453,11 @@ DictionaryArray::DictionaryArray( indices->offset()), dict_type_(static_cast(type.get())), indices_(indices) { - DCHECK_EQ(type->type, Type::DICTIONARY); + DCHECK_EQ(type->id(), Type::DICTIONARY); } Status DictionaryArray::Validate() const { - Type::type index_type_id = indices_->type()->type; + Type::type index_type_id = indices_->type()->id(); if (!is_integer(index_type_id)) { return Status::Invalid("Dictionary indices must be integer type"); } diff --git a/cpp/src/arrow/array.h b/cpp/src/arrow/array.h index 4f8b22e31b4eb..071d4e30f52dd 100644 --- a/cpp/src/arrow/array.h +++ b/cpp/src/arrow/array.h @@ -80,7 +80,7 @@ class ARROW_EXPORT Array { int64_t null_count() const; std::shared_ptr type() const { return type_; } - Type::type type_enum() const { return type_->type; } + Type::type type_id() const { return type_->id(); } /// Buffer for the null bitmap. /// @@ -447,7 +447,7 @@ class ARROW_EXPORT UnionArray : public Array { const type_id_t* raw_type_ids() const { return raw_type_ids_ + offset_; } const int32_t* raw_value_offsets() const { return raw_value_offsets_ + offset_; } - UnionMode mode() const { return static_cast(*type_.get()).mode; } + UnionMode mode() const { return static_cast(*type_.get()).mode(); } std::shared_ptr child(int pos) const; diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 4281a61474cce..d85eb32652c47 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -253,7 +253,7 @@ BooleanBuilder::BooleanBuilder(MemoryPool* pool) BooleanBuilder::BooleanBuilder(MemoryPool* pool, const std::shared_ptr& type) : BooleanBuilder(pool) { - DCHECK_EQ(Type::BOOL, type->type); + DCHECK_EQ(Type::BOOL, type->id()); } Status BooleanBuilder::Init(int64_t capacity) { @@ -602,7 +602,7 @@ std::shared_ptr StructBuilder::field_builder(int pos) const { // TODO(wesm): come up with a less monolithic strategy Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, std::shared_ptr* out) { - switch (type->type) { + switch (type->id()) { BUILDER_CASE(UINT8, UInt8Builder); BUILDER_CASE(INT8, Int8Builder); BUILDER_CASE(UINT16, UInt16Builder); @@ -633,12 +633,12 @@ Status MakeBuilder(MemoryPool* pool, const std::shared_ptr& type, } case Type::STRUCT: { - std::vector& fields = type->children_; + const std::vector& fields = type->children(); std::vector> values_builder; for (auto it : fields) { std::shared_ptr builder; - RETURN_NOT_OK(MakeBuilder(pool, it->type, &builder)); + RETURN_NOT_OK(MakeBuilder(pool, it->type(), &builder)); values_builder.push_back(builder); } out->reset(new StructBuilder(pool, type, values_builder)); diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index 2297e4b206d1f..e02f3f0a9a69c 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -151,7 +151,7 @@ class RangeEqualsVisitor { // Define a mapping from the type id to child number uint8_t max_code = 0; - const std::vector type_codes = left_type.type_codes; + const std::vector& type_codes = left_type.type_codes(); for (size_t i = 0; i < type_codes.size(); ++i) { const uint8_t code = type_codes[i]; if (code > max_code) { max_code = code; } @@ -532,7 +532,7 @@ class ApproxEqualsVisitor : public ArrayEqualsVisitor { static bool BaseDataEquals(const Array& left, const Array& right) { if (left.length() != right.length() || left.null_count() != right.null_count() || - left.type_enum() != right.type_enum()) { + left.type_id() != right.type_id()) { return false; } if (left.null_count() > 0) { @@ -571,7 +571,7 @@ Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_star int64_t left_end_idx, int64_t right_start_idx, bool* are_equal) { if (&left == &right) { *are_equal = true; - } else if (left.type_enum() != right.type_enum()) { + } else if (left.type_id() != right.type_id()) { *are_equal = false; } else if (left.length() == 0) { *are_equal = true; @@ -615,7 +615,7 @@ Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { // The arrays are the same object if (&left == &right) { *are_equal = true; - } else if (left.type_enum() != right.type_enum()) { + } else if (left.type_id() != right.type_id()) { *are_equal = false; } else if (left.size() == 0) { *are_equal = true; @@ -670,13 +670,13 @@ class TypeEqualsVisitor { Status>::type Visit(const T& left) { const auto& right = static_cast(right_); - result_ = left.unit == right.unit; + result_ = left.unit() == right.unit(); return Status::OK(); } Status Visit(const TimestampType& left) { const auto& right = static_cast(right_); - result_ = left.unit == right.unit && left.timezone == right.timezone; + result_ = left.unit() == right.unit() && left.timezone() == right.timezone(); return Status::OK(); } @@ -688,7 +688,7 @@ class TypeEqualsVisitor { Status Visit(const DecimalType& left) { const auto& right = static_cast(right_); - result_ = left.precision == right.precision && left.scale == right.scale; + result_ = left.precision() == right.precision() && left.scale() == right.scale(); return Status::OK(); } @@ -699,13 +699,14 @@ class TypeEqualsVisitor { Status Visit(const UnionType& left) { const auto& right = static_cast(right_); - if (left.mode != right.mode || left.type_codes.size() != right.type_codes.size()) { + if (left.mode() != right.mode() || + left.type_codes().size() != right.type_codes().size()) { result_ = false; return Status::OK(); } - const std::vector left_codes = left.type_codes; - const std::vector right_codes = right.type_codes; + const std::vector& left_codes = left.type_codes(); + const std::vector& right_codes = right.type_codes(); for (size_t i = 0; i < left_codes.size(); ++i) { if (left_codes[i] != right_codes[i]) { @@ -743,7 +744,7 @@ Status TypeEquals(const DataType& left, const DataType& right, bool* are_equal) // The arrays are the same object if (&left == &right) { *are_equal = true; - } else if (left.type != right.type) { + } else if (left.id() != right.id()) { *are_equal = false; } else { TypeEqualsVisitor visitor(right); diff --git a/cpp/src/arrow/compare.h b/cpp/src/arrow/compare.h index 522b11dadec47..96a6435c5df33 100644 --- a/cpp/src/arrow/compare.h +++ b/cpp/src/arrow/compare.h @@ -27,7 +27,7 @@ namespace arrow { class Array; -struct DataType; +class DataType; class Status; class Tensor; diff --git a/cpp/src/arrow/ipc/feather-test.cc b/cpp/src/arrow/ipc/feather-test.cc index 077a44b896fc1..fb26df6e130f2 100644 --- a/cpp/src/arrow/ipc/feather-test.cc +++ b/cpp/src/arrow/ipc/feather-test.cc @@ -379,7 +379,7 @@ TEST_F(TestTableWriter, TimeTypes) { for (int i = 1; i < schema->num_fields(); ++i) { std::shared_ptr arr; - LoadArray(schema->field(i)->type, fields, buffers, &arr); + LoadArray(schema->field(i)->type(), fields, buffers, &arr); arrays.push_back(arr); } diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index e838e1fdbcd61..5dc039662ce9d 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -349,7 +349,7 @@ class TableReader::TableReaderImpl { buffers.push_back(nullptr); } - if (is_binary_like(type->type)) { + if (is_binary_like(type->id())) { int64_t offsets_size = GetOutputLength((meta->length() + 1) * sizeof(int32_t)); buffers.push_back(SliceBuffer(buffer, offset, offsets_size)); offset += offsets_size; @@ -516,13 +516,13 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { } Status LoadArrayMetadata(const Array& values, ArrayMetadata* meta) { - if (!(is_primitive(values.type_enum()) || is_binary_like(values.type_enum()))) { + if (!(is_primitive(values.type_id()) || is_binary_like(values.type_id()))) { std::stringstream ss; ss << "Array is not primitive type: " << values.type()->ToString(); return Status::Invalid(ss.str()); } - meta->type = ToFlatbufferType(values.type_enum()); + meta->type = ToFlatbufferType(values.type_id()); RETURN_NOT_OK(stream_->Tell(&meta->offset)); @@ -552,7 +552,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { const uint8_t* values_buffer = nullptr; - if (is_binary_like(values.type_enum())) { + if (is_binary_like(values.type_id())) { const auto& bin_values = static_cast(values); int64_t offset_bytes = sizeof(int32_t) * (values.length() + 1); @@ -570,7 +570,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { const auto& prim_values = static_cast(values); const auto& fw_type = static_cast(*values.type()); - if (values.type_enum() == Type::BOOL) { + if (values.type_id() == Type::BOOL) { // Booleans are bit-packed values_bytes = BitUtil::BytesForBits(values.length()); } else { @@ -616,7 +616,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { Status Visit(const DictionaryArray& values) override { const auto& dict_type = static_cast(*values.type()); - if (!is_integer(values.indices()->type_enum())) { + if (!is_integer(values.indices()->type_id())) { return Status::Invalid("Category values must be integers"); } @@ -631,7 +631,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { Status Visit(const TimestampArray& values) override { RETURN_NOT_OK(WritePrimitiveValues(values)); const auto& ts_type = static_cast(*values.type()); - current_column_->SetTimestamp(ts_type.unit, ts_type.timezone); + current_column_->SetTimestamp(ts_type.unit(), ts_type.timezone()); return Status::OK(); } @@ -643,7 +643,7 @@ class TableWriter::TableWriterImpl : public ArrayVisitor { Status Visit(const Time32Array& values) override { RETURN_NOT_OK(WritePrimitiveValues(values)); - auto unit = static_cast(*values.type()).unit; + auto unit = static_cast(*values.type()).unit(); current_column_->SetTime(unit); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 1a91ec39ca1fc..98a7c3dd58a6b 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -569,13 +569,13 @@ void CheckBatchDictionaries(const RecordBatch& batch) { // Check that dictionaries that should be the same are the same auto schema = batch.schema(); - const auto& t0 = static_cast(*schema->field(0)->type); - const auto& t1 = static_cast(*schema->field(1)->type); + const auto& t0 = static_cast(*schema->field(0)->type()); + const auto& t1 = static_cast(*schema->field(1)->type()); ASSERT_EQ(t0.dictionary().get(), t1.dictionary().get()); // Same dictionary used for list values - const auto& t3 = static_cast(*schema->field(3)->type); + const auto& t3 = static_cast(*schema->field(3)->type()); const auto& t3_value = static_cast(*t3.value_type()); ASSERT_EQ(t0.dictionary().get(), t3_value.dictionary().get()); } diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index fe0a7c94226f0..18ee8349da66a 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -114,13 +114,13 @@ class JsonSchemaWriter { writer_->StartObject(); writer_->Key("name"); - writer_->String(field.name.c_str()); + writer_->String(field.name().c_str()); writer_->Key("nullable"); - writer_->Bool(field.nullable); + writer_->Bool(field.nullable()); // Visit the type - RETURN_NOT_OK(VisitTypeInline(*field.type, this)); + RETURN_NOT_OK(VisitTypeInline(*field.type(), this)); writer_->EndObject(); return Status::OK(); @@ -153,7 +153,7 @@ class JsonSchemaWriter { void WriteTypeMetadata(const IntervalType& type) { writer_->Key("unit"); - switch (type.unit) { + switch (type.unit()) { case IntervalType::Unit::YEAR_MONTH: writer_->String("YEAR_MONTH"); break; @@ -165,23 +165,23 @@ class JsonSchemaWriter { void WriteTypeMetadata(const TimestampType& type) { writer_->Key("unit"); - writer_->String(GetTimeUnitName(type.unit)); - if (type.timezone.size() > 0) { + writer_->String(GetTimeUnitName(type.unit())); + if (type.timezone().size() > 0) { writer_->Key("timezone"); - writer_->String(type.timezone); + writer_->String(type.timezone()); } } void WriteTypeMetadata(const TimeType& type) { writer_->Key("unit"); - writer_->String(GetTimeUnitName(type.unit)); + writer_->String(GetTimeUnitName(type.unit())); writer_->Key("bitWidth"); writer_->Int(type.bit_width()); } void WriteTypeMetadata(const DateType& type) { writer_->Key("unit"); - switch (type.unit) { + switch (type.unit()) { case DateUnit::DAY: writer_->String("DAY"); break; @@ -198,14 +198,14 @@ class JsonSchemaWriter { void WriteTypeMetadata(const DecimalType& type) { writer_->Key("precision"); - writer_->Int(type.precision); + writer_->Int(type.precision()); writer_->Key("scale"); - writer_->Int(type.scale); + writer_->Int(type.scale()); } void WriteTypeMetadata(const UnionType& type) { writer_->Key("mode"); - switch (type.mode) { + switch (type.mode()) { case UnionMode::SPARSE: writer_->String("SPARSE"); break; @@ -217,8 +217,8 @@ class JsonSchemaWriter { // Write type ids writer_->Key("typeIds"); writer_->StartArray(); - for (size_t i = 0; i < type.type_codes.size(); ++i) { - writer_->Uint(type.type_codes[i]); + for (size_t i = 0; i < type.type_codes().size(); ++i) { + writer_->Uint(type.type_codes()[i]); } writer_->EndArray(); } @@ -461,7 +461,7 @@ class JsonArrayWriter { writer_->Key("children"); writer_->StartArray(); for (size_t i = 0; i < fields.size(); ++i) { - RETURN_NOT_OK(VisitArray(fields[i]->name, *arrays[i].get())); + RETURN_NOT_OK(VisitArray(fields[i]->name(), *arrays[i].get())); } writer_->EndArray(); return Status::OK(); @@ -513,7 +513,7 @@ class JsonArrayWriter { auto type = static_cast(array.type().get()); WriteIntegerField("TYPE_ID", array.raw_type_ids(), array.length()); - if (type->mode == UnionMode::DENSE) { + if (type->mode() == UnionMode::DENSE) { WriteIntegerField("OFFSET", array.raw_value_offsets(), array.length()); } return WriteChildren(type->children(), array.children()); @@ -1026,7 +1026,7 @@ class JsonArrayReader { RETURN_NOT_OK( GetIntArray(json_type_ids->value.GetArray(), length, &type_id_buffer)); - if (union_type.mode == UnionMode::DENSE) { + if (union_type.mode() == UnionMode::DENSE) { const auto& json_offsets = json_array.FindMember("OFFSET"); RETURN_NOT_ARRAY("OFFSET", json_offsets, json_array); RETURN_NOT_OK( @@ -1072,9 +1072,9 @@ class JsonArrayReader { auto it = json_child.FindMember("name"); RETURN_NOT_STRING("name", it, json_child); - DCHECK_EQ(it->value.GetString(), child_field->name); + DCHECK_EQ(it->value.GetString(), child_field->name()); std::shared_ptr child; - RETURN_NOT_OK(GetArray(json_children_arr[i], child_field->type, &child)); + RETURN_NOT_OK(GetArray(json_children_arr[i], child_field->type(), &child)); array->emplace_back(child); } @@ -1109,7 +1109,7 @@ class JsonArrayReader { case TYPE::type_id: \ return ReadArray(json_array, length, is_valid, type, array); - switch (type->type) { + switch (type->id()) { TYPE_CASE(NullType); TYPE_CASE(BooleanType); TYPE_CASE(UInt8Type); @@ -1192,7 +1192,7 @@ Status ReadJsonArray(MemoryPool* pool, const rj::Value& json_array, const Schema std::shared_ptr result = nullptr; for (const std::shared_ptr& field : schema.fields()) { - if (field->name == name) { + if (field->name() == name) { result = field; break; } @@ -1204,7 +1204,7 @@ Status ReadJsonArray(MemoryPool* pool, const rj::Value& json_array, const Schema return Status::KeyError(ss.str()); } - return ReadJsonArray(pool, json_array, result->type, array); + return ReadJsonArray(pool, json_array, result->type(), array); } } // namespace ipc diff --git a/cpp/src/arrow/ipc/json.cc b/cpp/src/arrow/ipc/json.cc index 8056b6f3e758e..0abd6d7ffe3df 100644 --- a/cpp/src/arrow/ipc/json.cc +++ b/cpp/src/arrow/ipc/json.cc @@ -79,7 +79,7 @@ class JsonWriter::JsonWriterImpl { DCHECK_EQ(batch.num_rows(), column->length()) << "Array length did not match record batch length"; - RETURN_NOT_OK(WriteJsonArray(schema_->field(i)->name, *column, writer_.get())); + RETURN_NOT_OK(WriteJsonArray(schema_->field(i)->name(), *column, writer_.get())); } writer_->EndArray(); @@ -158,7 +158,7 @@ class JsonReader::JsonReaderImpl { std::vector> columns(json_columns.Size()); for (int i = 0; i < static_cast(columns.size()); ++i) { - const std::shared_ptr& type = schema_->field(i)->type; + const std::shared_ptr& type = schema_->field(i)->type(); RETURN_NOT_OK(ReadJsonArray(pool_, json_columns[i], type, &columns[i])); } diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index d902ec296cff3..84f8883ffb949 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -163,13 +163,13 @@ static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, const auto& union_type = static_cast(*type); - flatbuf::UnionMode mode = union_type.mode == UnionMode::SPARSE + flatbuf::UnionMode mode = union_type.mode() == UnionMode::SPARSE ? flatbuf::UnionMode_Sparse : flatbuf::UnionMode_Dense; std::vector type_ids; - type_ids.reserve(union_type.type_codes.size()); - for (uint8_t code : union_type.type_codes) { + type_ids.reserve(union_type.type_codes().size()); + for (uint8_t code : union_type.type_codes()) { type_ids.push_back(code); } @@ -306,7 +306,7 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, std::vector* children, std::vector* layout, flatbuf::Type* out_type, DictionaryMemo* dictionary_memo, Offset* offset) { - if (type->type == Type::DICTIONARY) { + if (type->id() == Type::DICTIONARY) { // In this library, the dictionary "type" is a logical construct. Here we // pass through to the value type, as we've already captured the index // type in the DictionaryEncoding metadata in the parent field @@ -340,7 +340,7 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, layout->push_back(offset); } - switch (type->type) { + switch (type->id()) { case Type::BOOL: *out_type = flatbuf::Type_Bool; *offset = flatbuf::CreateBool(fbb).Union(); @@ -393,21 +393,21 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, case Type::TIME32: { const auto& time_type = static_cast(*type); *out_type = flatbuf::Type_Time; - *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit), 32).Union(); + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit()), 32).Union(); } break; case Type::TIME64: { const auto& time_type = static_cast(*type); *out_type = flatbuf::Type_Time; - *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit), 64).Union(); + *offset = flatbuf::CreateTime(fbb, ToFlatbufferUnit(time_type.unit()), 64).Union(); } break; case Type::TIMESTAMP: { const auto& ts_type = static_cast(*type); *out_type = flatbuf::Type_Timestamp; - flatbuf::TimeUnit fb_unit = ToFlatbufferUnit(ts_type.unit); + flatbuf::TimeUnit fb_unit = ToFlatbufferUnit(ts_type.unit()); FBString fb_timezone = 0; - if (ts_type.timezone.size() > 0) { - fb_timezone = fbb.CreateString(ts_type.timezone); + if (ts_type.timezone().size() > 0) { + fb_timezone = fbb.CreateString(ts_type.timezone()); } *offset = flatbuf::CreateTimestamp(fbb, fb_unit, fb_timezone).Union(); } break; @@ -431,7 +431,7 @@ static Status TypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, static Status TensorTypeToFlatbuffer(FBB& fbb, const std::shared_ptr& type, flatbuf::Type* out_type, Offset* offset) { - switch (type->type) { + switch (type->id()) { case Type::UINT8: INT_TO_FB_CASE(8, false); case Type::INT8: @@ -486,7 +486,7 @@ static DictionaryOffset GetDictionaryEncoding( static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, DictionaryMemo* dictionary_memo, FieldOffset* offset) { - auto fb_name = fbb.CreateString(field->name); + auto fb_name = fbb.CreateString(field->name()); flatbuf::Type type_enum; Offset type_offset; @@ -495,18 +495,18 @@ static Status FieldToFlatbuffer(FBB& fbb, const std::shared_ptr& field, std::vector layout; RETURN_NOT_OK(TypeToFlatbuffer( - fbb, field->type, &children, &layout, &type_enum, dictionary_memo, &type_offset)); + fbb, field->type(), &children, &layout, &type_enum, dictionary_memo, &type_offset)); auto fb_children = fbb.CreateVector(children); auto fb_layout = fbb.CreateVector(layout); DictionaryOffset dictionary = 0; - if (field->type->type == Type::DICTIONARY) { + if (field->type()->id() == Type::DICTIONARY) { dictionary = GetDictionaryEncoding( - fbb, static_cast(*field->type), dictionary_memo); + fbb, static_cast(*field->type()), dictionary_memo); } // TODO: produce the list of VectorTypes - *offset = flatbuf::CreateField(fbb, fb_name, field->nullable, type_enum, type_offset, + *offset = flatbuf::CreateField(fbb, fb_name, field->nullable(), type_enum, type_offset, dictionary, fb_children, fb_layout); return Status::OK(); diff --git a/cpp/src/arrow/ipc/metadata.h b/cpp/src/arrow/ipc/metadata.h index b042882c7cd31..84026c452ad27 100644 --- a/cpp/src/arrow/ipc/metadata.h +++ b/cpp/src/arrow/ipc/metadata.h @@ -34,8 +34,8 @@ namespace arrow { class Array; class Buffer; -struct DataType; -struct Field; +class DataType; +class Field; class Schema; class Status; class Tensor; diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index a7c4f04a4d4cc..69fde1783d7d3 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -97,7 +97,7 @@ static Status LoadRecordBatchFromSource(const std::shared_ptr& schema, context.max_recursion_depth = max_recursion_depth; for (int i = 0; i < schema->num_fields(); ++i) { - RETURN_NOT_OK(LoadArray(schema->field(i)->type, &context, &arrays[i])); + RETURN_NOT_OK(LoadArray(schema->field(i)->type(), &context, &arrays[i])); DCHECK_EQ(num_rows, arrays[i]->length()) << "Array length did not match record batch length"; } diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index 9e0480d4c3634..a17b609bbcba2 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -552,9 +552,9 @@ Status MakeTimestamps(std::shared_ptr* out) { 1489272000000, 1489272000000, 1489273000000}; std::shared_ptr a0, a1, a2; - ArrayFromVector(f0->type, is_valid, ts_values, &a0); - ArrayFromVector(f1->type, is_valid, ts_values, &a1); - ArrayFromVector(f2->type, is_valid, ts_values, &a2); + ArrayFromVector(f0->type(), is_valid, ts_values, &a0); + ArrayFromVector(f1->type(), is_valid, ts_values, &a1); + ArrayFromVector(f2->type(), is_valid, ts_values, &a2); ArrayVector arrays = {a0, a1, a2}; *out = std::make_shared(schema, a0->length(), arrays); @@ -575,10 +575,10 @@ Status MakeTimes(std::shared_ptr* out) { 1489272000000, 1489272000000, 1489273000000}; std::shared_ptr a0, a1, a2, a3; - ArrayFromVector(f0->type, is_valid, t32_values, &a0); - ArrayFromVector(f1->type, is_valid, t64_values, &a1); - ArrayFromVector(f2->type, is_valid, t32_values, &a2); - ArrayFromVector(f3->type, is_valid, t64_values, &a3); + ArrayFromVector(f0->type(), is_valid, t32_values, &a0); + ArrayFromVector(f1->type(), is_valid, t64_values, &a1); + ArrayFromVector(f2->type(), is_valid, t32_values, &a2); + ArrayFromVector(f3->type(), is_valid, t64_values, &a3); ArrayVector arrays = {a0, a1, a2, a3}; *out = std::make_shared(schema, a0->length(), arrays); @@ -605,8 +605,8 @@ Status MakeFWBinary(std::shared_ptr* out) { std::shared_ptr a1, a2; - FixedSizeBinaryBuilder b1(default_memory_pool(), f0->type); - FixedSizeBinaryBuilder b2(default_memory_pool(), f1->type); + FixedSizeBinaryBuilder b1(default_memory_pool(), f0->type()); + FixedSizeBinaryBuilder b2(default_memory_pool(), f1->type()); std::vector values1 = {"foo1", "foo2", "foo3", "foo4"}; AppendValues(is_valid, values1, &b1); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index d38a65c983d98..18a585599a31b 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -364,7 +364,7 @@ class RecordBatchWriter : public ArrayVisitor { // The Union type codes are not necessary 0-indexed uint8_t max_code = 0; - for (uint8_t code : type.type_codes) { + for (uint8_t code : type.type_codes()) { if (code > max_code) { max_code = code; } } @@ -406,7 +406,7 @@ class RecordBatchWriter : public ArrayVisitor { for (int i = 0; i < type.num_children(); ++i) { std::shared_ptr child = array.child(i); if (array.offset() != 0) { - const uint8_t code = type.type_codes[i]; + const uint8_t code = type.type_codes()[i]; child = child->Slice(child_offsets[code], child_lengths[code]); } RETURN_NOT_OK(VisitArray(*child)); diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 0b7a6e1b56be5..629bcb9c6c980 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -31,7 +31,7 @@ namespace arrow { class Array; class Buffer; -struct Field; +class Field; class MemoryPool; class RecordBatch; class Schema; diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index f3347f92e6d87..f9f6e6fcac826 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -24,6 +24,7 @@ #include "arrow/array.h" #include "arrow/buffer.h" +#include "arrow/status.h" #include "arrow/type.h" #include "arrow/type_traits.h" #include "arrow/util/logging.h" @@ -32,10 +33,6 @@ namespace arrow { -class Array; -struct DataType; -class Status; - class ArrayLoader { public: ArrayLoader(const std::shared_ptr& type, ArrayLoaderContext* context) @@ -114,7 +111,7 @@ class ArrayLoader { } Status LoadChild(const Field& field, std::shared_ptr* out) { - ArrayLoader loader(field.type, context_); + ArrayLoader loader(field.type(), context_); --context_->max_recursion_depth; RETURN_NOT_OK(loader.Load(out)); ++context_->max_recursion_depth; @@ -211,11 +208,11 @@ class ArrayLoader { RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); if (field_meta.length > 0) { RETURN_NOT_OK(GetBuffer(context_->buffer_index, &type_ids)); - if (type.mode == UnionMode::DENSE) { + if (type.mode() == UnionMode::DENSE) { RETURN_NOT_OK(GetBuffer(context_->buffer_index + 1, &offsets)); } } - context_->buffer_index += type.mode == UnionMode::DENSE ? 2 : 1; + context_->buffer_index += type.mode() == UnionMode::DENSE ? 2 : 1; std::vector> fields; RETURN_NOT_OK(LoadChildren(type.children(), &fields)); diff --git a/cpp/src/arrow/loader.h b/cpp/src/arrow/loader.h index 9b650e2da7426..f5e399537fd7b 100644 --- a/cpp/src/arrow/loader.h +++ b/cpp/src/arrow/loader.h @@ -33,7 +33,7 @@ namespace arrow { class Array; class Buffer; -struct DataType; +class DataType; // ARROW-109: We set this number arbitrarily to help catch user mistakes. For // deeply nested schemas, it is expected the user will indicate explicitly the diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 1ae13f3db061c..8cc9876fa9fc5 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -571,7 +571,7 @@ class DecimalConverter : public TypedConverter { // Dynamic constructor for sequence converters std::shared_ptr GetConverter(const std::shared_ptr& type) { - switch (type->type) { + switch (type->id()) { case Type::BOOL: return std::make_shared(); case Type::INT64: @@ -637,7 +637,7 @@ Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr Status ConvertPySequence(PyObject* obj, MemoryPool* pool, std::shared_ptr* out, const std::shared_ptr& type, int64_t size) { // Handle NA / NullType case - if (type->type == Type::NA) { + if (type->id() == Type::NA) { out->reset(new NullArray(size)); return Status::OK(); } diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc index 23470fbc41aca..ab79e179c7ea5 100644 --- a/cpp/src/arrow/python/numpy_convert.cc +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -118,7 +118,7 @@ Status GetNumPyType(const DataType& type, int* type_num) { *type_num = NPY_##NPY_NAME; \ break; - switch (type.type) { + switch (type.id()) { NUMPY_TYPE_CASE(UINT8, UINT8); NUMPY_TYPE_CASE(INT8, INT8); NUMPY_TYPE_CASE(UINT16, UINT16); diff --git a/cpp/src/arrow/python/numpy_convert.h b/cpp/src/arrow/python/numpy_convert.h index 685a626d4ca28..c2526403213b1 100644 --- a/cpp/src/arrow/python/numpy_convert.h +++ b/cpp/src/arrow/python/numpy_convert.h @@ -31,7 +31,7 @@ namespace arrow { -struct DataType; +class DataType; class MemoryPool; class Status; class Tensor; diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 1a250e83c5093..643c5fb8796a0 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -669,8 +669,8 @@ static Status ConvertDecimals(const ChunkedArray& data, PyObject** out_values) { for (int c = 0; c < data.num_chunks(); c++) { auto* arr(static_cast(data.chunk(c).get())); auto type(std::dynamic_pointer_cast(arr->type())); - const int precision = type->precision; - const int scale = type->scale; + const int precision = type->precision(); + const int scale = type->scale(); const int bit_width = type->bit_width(); for (int64_t i = 0; i < arr->length(); ++i) { @@ -764,7 +764,7 @@ Status PandasConverter::ConvertObjects() { // This means we received an explicit type from the user if (type_) { - switch (type_->type) { + switch (type_->id()) { case Type::STRING: return ConvertObjectStrings(); case Type::FIXED_SIZE_BINARY: @@ -777,7 +777,7 @@ Status PandasConverter::ConvertObjects() { return ConvertDates(); case Type::LIST: { const auto& list_field = static_cast(*type_); - return ConvertLists(list_field.value_field()->type); + return ConvertLists(list_field.value_field()->type()); } case Type::DECIMAL: return ConvertDecimals(); @@ -860,7 +860,7 @@ inline Status PandasConverter::ConvertTypedLists(const std::shared_ptr std::shared_ptr inferred_type; RETURN_NOT_OK(list_builder.Append(true)); RETURN_NOT_OK(InferArrowTypeAndSize(objects[i], &size, &inferred_type)); - if (inferred_type->type != type->type) { + if (inferred_type->id() != type->id()) { std::stringstream ss; ss << inferred_type->ToString() << " cannot be converted to " << type->ToString(); return Status::TypeError(ss.str()); @@ -909,7 +909,7 @@ inline Status PandasConverter::ConvertTypedLists( std::shared_ptr inferred_type; RETURN_NOT_OK(list_builder.Append(true)); RETURN_NOT_OK(InferArrowTypeAndSize(objects[i], &size, &inferred_type)); - if (inferred_type->type != Type::STRING) { + if (inferred_type->id() != Type::STRING) { std::stringstream ss; ss << inferred_type->ToString() << " cannot be converted to STRING."; return Status::TypeError(ss.str()); @@ -928,7 +928,7 @@ inline Status PandasConverter::ConvertTypedLists( } Status PandasConverter::ConvertLists(const std::shared_ptr& type) { - switch (type->type) { + switch (type->id()) { LIST_CASE(UINT8, NPY_UINT8, UInt8Type) LIST_CASE(INT8, NPY_INT8, Int8Type) LIST_CASE(UINT16, NPY_UINT16, UInt16Type) @@ -1300,7 +1300,7 @@ class ObjectBlock : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); PyObject** out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; @@ -1319,7 +1319,7 @@ class ObjectBlock : public PandasBlock { RETURN_NOT_OK(ConvertDecimals(data, out_buffer)); } else if (type == Type::LIST) { auto list_type = std::static_pointer_cast(col->type()); - switch (list_type->value_type()->type) { + switch (list_type->value_type()->id()) { CONVERTLISTSLIKE_CASE(UInt8Type, UINT8) CONVERTLISTSLIKE_CASE(Int8Type, INT8) CONVERTLISTSLIKE_CASE(UInt16Type, UINT16) @@ -1360,7 +1360,7 @@ class IntBlock : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); C_TYPE* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; @@ -1392,7 +1392,7 @@ class Float32Block : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); if (type != Type::FLOAT) { return Status::NotImplemented(col->type()->ToString()); } @@ -1412,7 +1412,7 @@ class Float64Block : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); double* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; @@ -1465,7 +1465,7 @@ class BoolBlock : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); if (type != Type::BOOL) { return Status::NotImplemented(col->type()->ToString()); } @@ -1496,7 +1496,7 @@ class DatetimeBlock : public PandasBlock { Status Write(const std::shared_ptr& col, int64_t abs_placement, int64_t rel_placement) override { - Type::type type = col->type()->type; + Type::type type = col->type()->id(); int64_t* out_buffer = reinterpret_cast(block_data_) + rel_placement * num_rows_; @@ -1514,13 +1514,13 @@ class DatetimeBlock : public PandasBlock { } else if (type == Type::TIMESTAMP) { auto ts_type = static_cast(col->type().get()); - if (ts_type->unit == TimeUnit::NANO) { + if (ts_type->unit() == TimeUnit::NANO) { ConvertNumericNullable(data, kPandasTimestampNull, out_buffer); - } else if (ts_type->unit == TimeUnit::MICRO) { + } else if (ts_type->unit() == TimeUnit::MICRO) { ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == TimeUnit::MILLI) { + } else if (ts_type->unit() == TimeUnit::MILLI) { ConvertDatetimeNanos(data, out_buffer); - } else if (ts_type->unit == TimeUnit::SECOND) { + } else if (ts_type->unit() == TimeUnit::SECOND) { ConvertDatetimeNanos(data, out_buffer); } else { return Status::NotImplemented("Unsupported time unit"); @@ -1661,7 +1661,7 @@ static inline Status MakeCategoricalBlock(const std::shared_ptr& type, int64_t num_rows, std::shared_ptr* block) { // All categoricals become a block with a single column auto dict_type = static_cast(type.get()); - switch (dict_type->index_type()->type) { + switch (dict_type->index_type()->id()) { case Type::INT8: *block = std::make_shared>(num_rows); break; @@ -1714,7 +1714,7 @@ class DataFrameBlockCreator { std::shared_ptr col = table_->column(i); PandasBlock::type output_type; - Type::type column_type = col->type()->type; + Type::type column_type = col->type()->id(); switch (column_type) { case Type::BOOL: output_type = col->null_count() > 0 ? PandasBlock::OBJECT : PandasBlock::BOOL; @@ -1762,7 +1762,7 @@ class DataFrameBlockCreator { break; case Type::TIMESTAMP: { const auto& ts_type = static_cast(*col->type()); - if (ts_type.timezone != "") { + if (ts_type.timezone() != "") { output_type = PandasBlock::DATETIME_WITH_TZ; } else { output_type = PandasBlock::DATETIME; @@ -1770,7 +1770,7 @@ class DataFrameBlockCreator { } break; case Type::LIST: { auto list_type = std::static_pointer_cast(col->type()); - if (!ListTypeSupported(list_type->value_type()->type)) { + if (!ListTypeSupported(list_type->value_type()->id())) { std::stringstream ss; ss << "Not implemented type for lists: " << list_type->value_type()->ToString(); @@ -1795,7 +1795,7 @@ class DataFrameBlockCreator { categorical_blocks_[i] = block; } else if (output_type == PandasBlock::DATETIME_WITH_TZ) { const auto& ts_type = static_cast(*col->type()); - block = std::make_shared(ts_type.timezone, table_->num_rows()); + block = std::make_shared(ts_type.timezone(), table_->num_rows()); RETURN_NOT_OK(block->Allocate()); datetimetz_blocks_[i] = block; } else { @@ -1942,10 +1942,10 @@ inline void set_numpy_metadata(int type, DataType* datatype, PyArrayObject* out) if (type == NPY_DATETIME) { PyArray_Descr* descr = PyArray_DESCR(out); auto date_dtype = reinterpret_cast(descr->c_metadata); - if (datatype->type == Type::TIMESTAMP) { + if (datatype->id() == Type::TIMESTAMP) { auto timestamp_type = static_cast(datatype); - switch (timestamp_type->unit) { + switch (timestamp_type->unit()) { case TimestampType::Unit::SECOND: date_dtype->meta.base = NPY_FR_s; break; @@ -2154,7 +2154,7 @@ class ArrowDeserializer { RETURN_NOT_OK(AllocateOutput(NPY_OBJECT)); auto out_values = reinterpret_cast(PyArray_DATA(arr_)); auto list_type = std::static_pointer_cast(col_->type()); - switch (list_type->value_type()->type) { + switch (list_type->value_type()->id()) { CONVERTVALUES_LISTSLIKE_CASE(UInt8Type, UINT8) CONVERTVALUES_LISTSLIKE_CASE(Int8Type, INT8) CONVERTVALUES_LISTSLIKE_CASE(UInt16Type, UINT16) diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index 4d32c8b86cf50..fd901d8f09fce 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -32,7 +32,7 @@ namespace arrow { class Array; class Column; -struct DataType; +class DataType; class MemoryPool; class Status; class Table; diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index cdc0238cf4ab8..0da4c0f9641a3 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -244,8 +244,8 @@ TEST_F(TestTable, Metadata) { ASSERT_TRUE(table_->schema()->Equals(*schema_)); auto col = table_->column(0); - ASSERT_EQ(schema_->field(0)->name, col->name()); - ASSERT_EQ(schema_->field(0)->type, col->type()); + ASSERT_EQ(schema_->field(0)->name(), col->name()); + ASSERT_EQ(schema_->field(0)->type(), col->type()); } TEST_F(TestTable, InvalidColumns) { diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index 4c5257b92c033..eabd98bda1893 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -153,7 +153,7 @@ RecordBatch::RecordBatch(const std::shared_ptr& schema, int64_t num_rows : schema_(schema), num_rows_(num_rows), columns_(std::move(columns)) {} const std::string& RecordBatch::column_name(int i) const { - return schema_->field(i)->name; + return schema_->field(i)->name(); } bool RecordBatch::Equals(const RecordBatch& other) const { @@ -204,7 +204,7 @@ Status RecordBatch::Validate() const { << " vs " << num_rows_; return Status::Invalid(ss.str()); } - const auto& schema_type = *schema_->field(i)->type; + const auto& schema_type = *schema_->field(i)->type(); if (!arr.type()->Equals(schema_type)) { std::stringstream ss; ss << "Column " << i << " type not match schema: " << arr.type()->ToString() diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index b15d31b23a872..cfd1f366b039f 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -81,10 +81,10 @@ class ARROW_EXPORT Column { std::shared_ptr field() const { return field_; } // @returns: the column's name in the passed metadata - const std::string& name() const { return field_->name; } + const std::string& name() const { return field_->name(); } // @returns: the column's type according to the metadata - std::shared_ptr type() const { return field_->type; } + std::shared_ptr type() const { return field_->type(); } // @returns: the column's data as a chunked logical array std::shared_ptr data() const { return data_; } diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index d1c4083289f96..fa3e203c998ba 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -61,7 +61,7 @@ Tensor::Tensor(const std::shared_ptr& type, const std::shared_ptr& shape, const std::vector& strides, const std::vector& dim_names) : type_(type), data_(data), shape_(shape), strides_(strides), dim_names_(dim_names) { - DCHECK(is_tensor_supported(type->type)); + DCHECK(is_tensor_supported(type->id())); if (shape.size() > 0 && strides.size() == 0) { ComputeRowMajorStrides(static_cast(*type_), shape, &strides_); } @@ -107,6 +107,10 @@ bool Tensor::is_column_major() const { return strides_ == f_strides; } +Type::type Tensor::type_id() const { + return type_->id(); +} + bool Tensor::Equals(const Tensor& other) const { bool are_equal = false; Status error = TensorEquals(*this, other, &are_equal); @@ -161,7 +165,7 @@ Status ARROW_EXPORT MakeTensor(const std::shared_ptr& type, const std::shared_ptr& data, const std::vector& shape, const std::vector& strides, const std::vector& dim_names, std::shared_ptr* tensor) { - switch (type->type) { + switch (type->id()) { TENSOR_CASE(INT8, Int8Tensor); TENSOR_CASE(INT16, Int16Tensor); TENSOR_CASE(INT32, Int32Tensor); diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 12015f14b1d3d..7741c305f870d 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -98,7 +98,7 @@ class ARROW_EXPORT Tensor { /// AKA "Fortran order" bool is_column_major() const; - Type::type type_enum() const { return type_->type; } + Type::type type_id() const; bool Equals(const Tensor& other) const; diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 66164e3430913..dec7268a5a8b5 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -34,11 +34,11 @@ TEST(TestField, Basics) { Field f0("f0", int32()); Field f0_nn("f0", int32(), false); - ASSERT_EQ(f0.name, "f0"); - ASSERT_EQ(f0.type->ToString(), int32()->ToString()); + ASSERT_EQ(f0.name(), "f0"); + ASSERT_EQ(f0.type()->ToString(), int32()->ToString()); - ASSERT_TRUE(f0.nullable); - ASSERT_FALSE(f0_nn.nullable); + ASSERT_TRUE(f0.nullable()); + ASSERT_FALSE(f0_nn.nullable()); } TEST(TestField, Equals) { @@ -121,7 +121,7 @@ TEST_F(TestSchema, GetFieldByName) { TEST(TypesTest, TestPrimitive_##ENUM) { \ KLASS tp; \ \ - ASSERT_EQ(tp.type, Type::ENUM); \ + ASSERT_EQ(tp.id(), Type::ENUM); \ ASSERT_EQ(tp.ToString(), std::string(NAME)); \ } @@ -145,19 +145,19 @@ TEST(TestBinaryType, ToString) { StringType t2; EXPECT_TRUE(t1.Equals(e1)); EXPECT_FALSE(t1.Equals(t2)); - ASSERT_EQ(t1.type, Type::BINARY); + ASSERT_EQ(t1.id(), Type::BINARY); ASSERT_EQ(t1.ToString(), std::string("binary")); } TEST(TestStringType, ToString) { StringType str; - ASSERT_EQ(str.type, Type::STRING); + ASSERT_EQ(str.id(), Type::STRING); ASSERT_EQ(str.ToString(), std::string("string")); } TEST(TestFixedSizeBinaryType, ToString) { auto t = fixed_size_binary(10); - ASSERT_EQ(t->type, Type::FIXED_SIZE_BINARY); + ASSERT_EQ(t->id(), Type::FIXED_SIZE_BINARY); ASSERT_EQ("fixed_size_binary[10]", t->ToString()); } @@ -175,13 +175,13 @@ TEST(TestListType, Basics) { std::shared_ptr vt = std::make_shared(); ListType list_type(vt); - ASSERT_EQ(list_type.type, Type::LIST); + ASSERT_EQ(list_type.id(), Type::LIST); ASSERT_EQ("list", list_type.name()); ASSERT_EQ("list", list_type.ToString()); - ASSERT_EQ(list_type.value_type()->type, vt->type); - ASSERT_EQ(list_type.value_type()->type, vt->type); + ASSERT_EQ(list_type.value_type()->id(), vt->id()); + ASSERT_EQ(list_type.value_type()->id(), vt->id()); std::shared_ptr st = std::make_shared(); std::shared_ptr lt = std::make_shared(st); @@ -315,4 +315,46 @@ TEST(TestStructType, Basics) { // TODO(wesm): out of bounds for field(...) } +TEST(TypesTest, TestDecimal32Type) { + DecimalType t1(8, 4); + + ASSERT_EQ(t1.id(), Type::DECIMAL); + ASSERT_EQ(t1.precision(), 8); + ASSERT_EQ(t1.scale(), 4); + + ASSERT_EQ(t1.ToString(), std::string("decimal(8, 4)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 4); + ASSERT_EQ(t1.bit_width(), 32); +} + +TEST(TypesTest, TestDecimal64Type) { + DecimalType t1(12, 5); + + ASSERT_EQ(t1.id(), Type::DECIMAL); + ASSERT_EQ(t1.precision(), 12); + ASSERT_EQ(t1.scale(), 5); + + ASSERT_EQ(t1.ToString(), std::string("decimal(12, 5)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 8); + ASSERT_EQ(t1.bit_width(), 64); +} + +TEST(TypesTest, TestDecimal128Type) { + DecimalType t1(27, 7); + + ASSERT_EQ(t1.id(), Type::DECIMAL); + ASSERT_EQ(t1.precision(), 27); + ASSERT_EQ(t1.scale(), 7); + + ASSERT_EQ(t1.ToString(), std::string("decimal(27, 7)")); + + // Test properties + ASSERT_EQ(t1.byte_width(), 16); + ASSERT_EQ(t1.bit_width(), 128); +} + } // namespace arrow diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 93cab14d797c3..a2300d6029e39 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -32,8 +32,8 @@ namespace arrow { bool Field::Equals(const Field& other) const { return (this == &other) || - (this->name == other.name && this->nullable == other.nullable && - this->type->Equals(*other.type.get())); + (this->name_ == other.name_ && this->nullable_ == other.nullable_ && + this->type_->Equals(*other.type_.get())); } bool Field::Equals(const std::shared_ptr& other) const { @@ -42,8 +42,8 @@ bool Field::Equals(const std::shared_ptr& other) const { std::string Field::ToString() const { std::stringstream ss; - ss << this->name << ": " << this->type->ToString(); - if (!this->nullable) { ss << " not null"; } + ss << this->name_ << ": " << this->type_->ToString(); + if (!this->nullable_) { ss << " not null"; } return ss.str(); } @@ -107,7 +107,7 @@ std::string StructType::ToString() const { for (int i = 0; i < this->num_children(); ++i) { if (i > 0) { s << ", "; } std::shared_ptr field = this->child(i); - s << field->name << ": " << field->type->ToString(); + s << field->name() << ": " << field->type()->ToString(); } s << ">"; return s.str(); @@ -117,7 +117,7 @@ std::string StructType::ToString() const { // Date types DateType::DateType(Type::type type_id, DateUnit unit) - : FixedWidthType(type_id), unit(unit) {} + : FixedWidthType(type_id), unit_(unit) {} Date32Type::Date32Type() : DateType(Type::DATE32, DateUnit::DAY) {} @@ -135,7 +135,7 @@ std::string Date32Type::ToString() const { // Time types TimeType::TimeType(Type::type type_id, TimeUnit unit) - : FixedWidthType(type_id), unit(unit) {} + : FixedWidthType(type_id), unit_(unit) {} Time32Type::Time32Type(TimeUnit unit) : TimeType(Type::TIME32, unit) { DCHECK(unit == TimeUnit::SECOND || unit == TimeUnit::MILLI) @@ -144,7 +144,7 @@ Time32Type::Time32Type(TimeUnit unit) : TimeType(Type::TIME32, unit) { std::string Time32Type::ToString() const { std::stringstream ss; - ss << "time32[" << this->unit << "]"; + ss << "time32[" << this->unit_ << "]"; return ss.str(); } @@ -155,7 +155,7 @@ Time64Type::Time64Type(TimeUnit unit) : TimeType(Type::TIME64, unit) { std::string Time64Type::ToString() const { std::stringstream ss; - ss << "time64[" << this->unit << "]"; + ss << "time64[" << this->unit_ << "]"; return ss.str(); } @@ -164,8 +164,8 @@ std::string Time64Type::ToString() const { std::string TimestampType::ToString() const { std::stringstream ss; - ss << "timestamp[" << this->unit; - if (this->timezone.size() > 0) { ss << ", tz=" << this->timezone; } + ss << "timestamp[" << this->unit_; + if (this->timezone_.size() > 0) { ss << ", tz=" << this->timezone_; } ss << "]"; return ss.str(); } @@ -175,14 +175,14 @@ std::string TimestampType::ToString() const { UnionType::UnionType(const std::vector>& fields, const std::vector& type_codes, UnionMode mode) - : NestedType(Type::UNION), mode(mode), type_codes(type_codes) { + : NestedType(Type::UNION), mode_(mode), type_codes_(type_codes) { children_ = fields; } std::string UnionType::ToString() const { std::stringstream s; - if (mode == UnionMode::SPARSE) { + if (mode_ == UnionMode::SPARSE) { s << "union[sparse]<"; } else { s << "union[dense]<"; @@ -190,7 +190,7 @@ std::string UnionType::ToString() const { for (size_t i = 0; i < children_.size(); ++i) { if (i) { s << ", "; } - s << children_[i]->ToString() << "=" << static_cast(type_codes[i]); + s << children_[i]->ToString() << "=" << static_cast(type_codes_[i]); } s << ">"; return s.str(); @@ -246,7 +246,7 @@ bool Schema::Equals(const Schema& other) const { std::shared_ptr Schema::GetFieldByName(const std::string& name) { if (fields_.size() > 0 && name_to_index_.size() == 0) { for (size_t i = 0; i < fields_.size(); ++i) { - name_to_index_[fields_[i]->name] = static_cast(i); + name_to_index_[fields_[i]->name()] = static_cast(i); } } @@ -423,7 +423,7 @@ std::vector StructType::GetBufferLayout() const { } std::vector UnionType::GetBufferLayout() const { - if (mode == UnionMode::SPARSE) { + if (mode_ == UnionMode::SPARSE) { return {kValidityBuffer, kTypeBuffer}; } else { return {kValidityBuffer, kTypeBuffer, kOffsetBuffer}; @@ -432,7 +432,7 @@ std::vector UnionType::GetBufferLayout() const { std::string DecimalType::ToString() const { std::stringstream s; - s << "decimal(" << precision << ", " << scale << ")"; + s << "decimal(" << precision_ << ", " << scale_ << ")"; return s.str(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 730cbed8f4d67..6810b35f05b70 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -127,13 +127,9 @@ class BufferDescr { int bit_width_; }; -struct ARROW_EXPORT DataType { - Type::type type; - - std::vector> children_; - - explicit DataType(Type::type type) : type(type) {} - +class ARROW_EXPORT DataType { + public: + explicit DataType(Type::type id) : id_(id) {} virtual ~DataType(); // Return whether the types are equal @@ -155,13 +151,20 @@ struct ARROW_EXPORT DataType { virtual std::vector GetBufferLayout() const = 0; + Type::type id() const { return id_; } + + protected: + Type::type id_; + std::vector> children_; + private: DISALLOW_COPY_AND_ASSIGN(DataType); }; typedef std::shared_ptr TypePtr; -struct ARROW_EXPORT FixedWidthType : public DataType { +class ARROW_EXPORT FixedWidthType : public DataType { + public: using DataType::DataType; virtual int bit_width() const = 0; @@ -169,53 +172,64 @@ struct ARROW_EXPORT FixedWidthType : public DataType { std::vector GetBufferLayout() const override; }; -struct ARROW_EXPORT PrimitiveCType : public FixedWidthType { +class ARROW_EXPORT PrimitiveCType : public FixedWidthType { + public: using FixedWidthType::FixedWidthType; }; -struct ARROW_EXPORT Integer : public PrimitiveCType { +class ARROW_EXPORT Integer : public PrimitiveCType { + public: using PrimitiveCType::PrimitiveCType; virtual bool is_signed() const = 0; }; -struct ARROW_EXPORT FloatingPoint : public PrimitiveCType { +class ARROW_EXPORT FloatingPoint : public PrimitiveCType { + public: using PrimitiveCType::PrimitiveCType; enum Precision { HALF, SINGLE, DOUBLE }; virtual Precision precision() const = 0; }; -struct ARROW_EXPORT NestedType : public DataType { +class ARROW_EXPORT NestedType : public DataType { + public: using DataType::DataType; }; -struct NoExtraMeta {}; +class NoExtraMeta {}; // A field is a piece of metadata that includes (for now) a name and a data // type -struct ARROW_EXPORT Field { - // Field name - std::string name; - - // The field's data type - std::shared_ptr type; - - // Fields can be nullable - bool nullable; - +class ARROW_EXPORT Field { + public: Field(const std::string& name, const std::shared_ptr& type, bool nullable = true) - : name(name), type(type), nullable(nullable) {} + : name_(name), type_(type), nullable_(nullable) {} bool Equals(const Field& other) const; bool Equals(const std::shared_ptr& other) const; std::string ToString() const; + + const std::string& name() const { return name_; } + std::shared_ptr type() const { return type_; } + bool nullable() const { return nullable_; } + + private: + // Field name + std::string name_; + + // The field's data type + std::shared_ptr type_; + + // Fields can be nullable + bool nullable_; }; typedef std::shared_ptr FieldPtr; template -struct ARROW_EXPORT CTypeImpl : public BASE { +class ARROW_EXPORT CTypeImpl : public BASE { + public: using c_type = C_TYPE; static constexpr Type::type type_id = TYPE_ID; @@ -230,7 +244,8 @@ struct ARROW_EXPORT CTypeImpl : public BASE { std::string ToString() const override { return std::string(DERIVED::name()); } }; -struct ARROW_EXPORT NullType : public DataType, public NoExtraMeta { +class ARROW_EXPORT NullType : public DataType, public NoExtraMeta { + public: static constexpr Type::type type_id = Type::NA; NullType() : DataType(Type::NA) {} @@ -244,11 +259,12 @@ struct ARROW_EXPORT NullType : public DataType, public NoExtraMeta { }; template -struct IntegerTypeImpl : public CTypeImpl { +class IntegerTypeImpl : public CTypeImpl { bool is_signed() const override { return std::is_signed::value; } }; -struct ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { +class ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { + public: static constexpr Type::type type_id = Type::BOOL; BooleanType() : FixedWidthType(Type::BOOL) {} @@ -260,60 +276,72 @@ struct ARROW_EXPORT BooleanType : public FixedWidthType, public NoExtraMeta { static std::string name() { return "bool"; } }; -struct ARROW_EXPORT UInt8Type : public IntegerTypeImpl { +class ARROW_EXPORT UInt8Type : public IntegerTypeImpl { + public: static std::string name() { return "uint8"; } }; -struct ARROW_EXPORT Int8Type : public IntegerTypeImpl { +class ARROW_EXPORT Int8Type : public IntegerTypeImpl { + public: static std::string name() { return "int8"; } }; -struct ARROW_EXPORT UInt16Type +class ARROW_EXPORT UInt16Type : public IntegerTypeImpl { + public: static std::string name() { return "uint16"; } }; -struct ARROW_EXPORT Int16Type : public IntegerTypeImpl { +class ARROW_EXPORT Int16Type : public IntegerTypeImpl { + public: static std::string name() { return "int16"; } }; -struct ARROW_EXPORT UInt32Type +class ARROW_EXPORT UInt32Type : public IntegerTypeImpl { + public: static std::string name() { return "uint32"; } }; -struct ARROW_EXPORT Int32Type : public IntegerTypeImpl { +class ARROW_EXPORT Int32Type : public IntegerTypeImpl { + public: static std::string name() { return "int32"; } }; -struct ARROW_EXPORT UInt64Type +class ARROW_EXPORT UInt64Type : public IntegerTypeImpl { + public: static std::string name() { return "uint64"; } }; -struct ARROW_EXPORT Int64Type : public IntegerTypeImpl { +class ARROW_EXPORT Int64Type : public IntegerTypeImpl { + public: static std::string name() { return "int64"; } }; -struct ARROW_EXPORT HalfFloatType +class ARROW_EXPORT HalfFloatType : public CTypeImpl { + public: Precision precision() const override; static std::string name() { return "halffloat"; } }; -struct ARROW_EXPORT FloatType +class ARROW_EXPORT FloatType : public CTypeImpl { + public: Precision precision() const override; static std::string name() { return "float"; } }; -struct ARROW_EXPORT DoubleType +class ARROW_EXPORT DoubleType : public CTypeImpl { + public: Precision precision() const override; static std::string name() { return "double"; } }; -struct ARROW_EXPORT ListType : public NestedType { +class ARROW_EXPORT ListType : public NestedType { + public: static constexpr Type::type type_id = Type::LIST; // List can contain any other logical value type @@ -326,7 +354,7 @@ struct ARROW_EXPORT ListType : public NestedType { std::shared_ptr value_field() const { return children_[0]; } - std::shared_ptr value_type() const { return children_[0]->type; } + std::shared_ptr value_type() const { return children_[0]->type(); } Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -337,7 +365,8 @@ struct ARROW_EXPORT ListType : public NestedType { }; // BinaryType type is represents lists of 1-byte values. -struct ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { +class ARROW_EXPORT BinaryType : public DataType, public NoExtraMeta { + public: static constexpr Type::type type_id = Type::BINARY; BinaryType() : BinaryType(Type::BINARY) {} @@ -376,7 +405,8 @@ class ARROW_EXPORT FixedSizeBinaryType : public FixedWidthType { }; // UTF-8 encoded strings -struct ARROW_EXPORT StringType : public BinaryType { +class ARROW_EXPORT StringType : public BinaryType { + public: static constexpr Type::type type_id = Type::STRING; StringType() : BinaryType(Type::STRING) {} @@ -386,7 +416,8 @@ struct ARROW_EXPORT StringType : public BinaryType { static std::string name() { return "utf8"; } }; -struct ARROW_EXPORT StructType : public NestedType { +class ARROW_EXPORT StructType : public NestedType { + public: static constexpr Type::type type_id = Type::STRUCT; explicit StructType(const std::vector>& fields) @@ -412,25 +443,32 @@ static inline int decimal_byte_width(int precision) { } } -struct ARROW_EXPORT DecimalType : public FixedSizeBinaryType { +class ARROW_EXPORT DecimalType : public FixedSizeBinaryType { + public: static constexpr Type::type type_id = Type::DECIMAL; - explicit DecimalType(int precision_, int scale_) - : FixedSizeBinaryType(decimal_byte_width(precision_), Type::DECIMAL), - precision(precision_), - scale(scale_) {} + explicit DecimalType(int precision, int scale) + : FixedSizeBinaryType(decimal_byte_width(precision), Type::DECIMAL), + precision_(precision), + scale_(scale) {} + std::vector GetBufferLayout() const override; Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "decimal"; } - int precision; - int scale; + int precision() const { return precision_; } + int scale() const { return scale_; } + + private: + int precision_; + int scale_; }; enum class UnionMode : char { SPARSE, DENSE }; -struct ARROW_EXPORT UnionType : public NestedType { +class ARROW_EXPORT UnionType : public NestedType { + public: static constexpr Type::type type_id = Type::UNION; UnionType(const std::vector>& fields, @@ -442,12 +480,17 @@ struct ARROW_EXPORT UnionType : public NestedType { std::vector GetBufferLayout() const override; - UnionMode mode; + const std::vector& type_codes() const { return type_codes_; } + + UnionMode mode() const { return mode_; } + + private: + UnionMode mode_; // The type id used in the data to indicate each data type in the union. For // example, the first type in the union might be denoted by the id 5 (instead // of 0). - std::vector type_codes; + std::vector type_codes_; }; // ---------------------------------------------------------------------- @@ -455,16 +498,18 @@ struct ARROW_EXPORT UnionType : public NestedType { enum class DateUnit : char { DAY = 0, MILLI = 1 }; -struct ARROW_EXPORT DateType : public FixedWidthType { +class ARROW_EXPORT DateType : public FixedWidthType { public: - DateUnit unit; + DateUnit unit() const { return unit_; } protected: DateType(Type::type type_id, DateUnit unit); + DateUnit unit_; }; /// Date as int32_t days since UNIX epoch -struct ARROW_EXPORT Date32Type : public DateType { +class ARROW_EXPORT Date32Type : public DateType { + public: static constexpr Type::type type_id = Type::DATE32; using c_type = int32_t; @@ -478,7 +523,8 @@ struct ARROW_EXPORT Date32Type : public DateType { }; /// Date as int64_t milliseconds since UNIX epoch -struct ARROW_EXPORT Date64Type : public DateType { +class ARROW_EXPORT Date64Type : public DateType { + public: static constexpr Type::type type_id = Type::DATE64; using c_type = int64_t; @@ -512,15 +558,17 @@ static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { return os; } -struct ARROW_EXPORT TimeType : public FixedWidthType { +class ARROW_EXPORT TimeType : public FixedWidthType { public: - TimeUnit unit; + TimeUnit unit() const { return unit_; } protected: TimeType(Type::type type_id, TimeUnit unit); + TimeUnit unit_; }; -struct ARROW_EXPORT Time32Type : public TimeType { +class ARROW_EXPORT Time32Type : public TimeType { + public: static constexpr Type::type type_id = Type::TIME32; using c_type = int32_t; @@ -532,7 +580,8 @@ struct ARROW_EXPORT Time32Type : public TimeType { std::string ToString() const override; }; -struct ARROW_EXPORT Time64Type : public TimeType { +class ARROW_EXPORT Time64Type : public TimeType { + public: static constexpr Type::type type_id = Type::TIME64; using c_type = int64_t; @@ -544,7 +593,8 @@ struct ARROW_EXPORT Time64Type : public TimeType { std::string ToString() const override; }; -struct ARROW_EXPORT TimestampType : public FixedWidthType { +class ARROW_EXPORT TimestampType : public FixedWidthType { + public: using Unit = TimeUnit; typedef int64_t c_type; @@ -553,20 +603,25 @@ struct ARROW_EXPORT TimestampType : public FixedWidthType { int bit_width() const override { return static_cast(sizeof(int64_t) * CHAR_BIT); } explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) - : FixedWidthType(Type::TIMESTAMP), unit(unit) {} + : FixedWidthType(Type::TIMESTAMP), unit_(unit) {} explicit TimestampType(TimeUnit unit, const std::string& timezone) - : FixedWidthType(Type::TIMESTAMP), unit(unit), timezone(timezone) {} + : FixedWidthType(Type::TIMESTAMP), unit_(unit), timezone_(timezone) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "timestamp"; } - TimeUnit unit; - std::string timezone; + TimeUnit unit() const { return unit_; } + const std::string& timezone() const { return timezone_; } + + private: + TimeUnit unit_; + std::string timezone_; }; -struct ARROW_EXPORT IntervalType : public FixedWidthType { +class ARROW_EXPORT IntervalType : public FixedWidthType { + public: enum class Unit : char { YEAR_MONTH = 0, DAY_TIME = 1 }; using c_type = int64_t; @@ -574,14 +629,17 @@ struct ARROW_EXPORT IntervalType : public FixedWidthType { int bit_width() const override { return static_cast(sizeof(int64_t) * CHAR_BIT); } - Unit unit; - explicit IntervalType(Unit unit = Unit::YEAR_MONTH) - : FixedWidthType(Type::INTERVAL), unit(unit) {} + : FixedWidthType(Type::INTERVAL), unit_(unit) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override { return name(); } static std::string name() { return "date"; } + + Unit unit() const { return unit_; } + + private: + Unit unit_; }; // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/type_fwd.h b/cpp/src/arrow/type_fwd.h index 2bb05f853a094..99c09bd6b7dca 100644 --- a/cpp/src/arrow/type_fwd.h +++ b/cpp/src/arrow/type_fwd.h @@ -26,10 +26,10 @@ namespace arrow { class Status; -struct DataType; +class DataType; class Array; class ArrayBuilder; -struct Field; +class Field; class Tensor; class Buffer; @@ -40,14 +40,14 @@ class Schema; class DictionaryType; class DictionaryArray; -struct NullType; +class NullType; class NullArray; -struct BooleanType; +class BooleanType; class BooleanArray; class BooleanBuilder; -struct BinaryType; +class BinaryType; class BinaryArray; class BinaryBuilder; @@ -55,23 +55,23 @@ class FixedSizeBinaryType; class FixedSizeBinaryArray; class FixedSizeBinaryBuilder; -struct StringType; +class StringType; class StringArray; class StringBuilder; -struct ListType; +class ListType; class ListArray; class ListBuilder; -struct StructType; +class StructType; class StructArray; class StructBuilder; -struct DecimalType; +class DecimalType; class DecimalArray; class DecimalBuilder; -struct UnionType; +class UnionType; class UnionArray; template @@ -84,7 +84,7 @@ template class NumericTensor; #define _NUMERIC_TYPE_DECL(KLASS) \ - struct KLASS##Type; \ + class KLASS##Type; \ using KLASS##Array = NumericArray; \ using KLASS##Builder = NumericBuilder; \ using KLASS##Tensor = NumericTensor; @@ -103,27 +103,27 @@ _NUMERIC_TYPE_DECL(Double); #undef _NUMERIC_TYPE_DECL -struct Date64Type; +class Date64Type; using Date64Array = NumericArray; using Date64Builder = NumericBuilder; -struct Date32Type; +class Date32Type; using Date32Array = NumericArray; using Date32Builder = NumericBuilder; -struct Time32Type; +class Time32Type; using Time32Array = NumericArray; using Time32Builder = NumericBuilder; -struct Time64Type; +class Time64Type; using Time64Array = NumericArray; using Time64Builder = NumericBuilder; -struct TimestampType; +class TimestampType; using TimestampArray = NumericArray; using TimestampBuilder = NumericBuilder; -struct IntervalType; +class IntervalType; using IntervalArray = NumericArray; // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index 29b3db60cadf8..bc5f493fa1f9a 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -33,7 +33,7 @@ namespace arrow { template inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { - switch (type.type) { + switch (type.id()) { TYPE_VISIT_INLINE(NullType); TYPE_VISIT_INLINE(BooleanType); TYPE_VISIT_INLINE(Int8Type); @@ -72,7 +72,7 @@ inline Status VisitTypeInline(const DataType& type, VISITOR* visitor) { template inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { - switch (array.type_enum()) { + switch (array.type_id()) { ARRAY_VISIT_INLINE(NullType); ARRAY_VISIT_INLINE(BooleanType); ARRAY_VISIT_INLINE(Int8Type); @@ -111,7 +111,7 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { template inline Status VisitTensorInline(const Tensor& array, VISITOR* visitor) { - switch (array.type_enum()) { + switch (array.type_id()) { TENSOR_VISIT_INLINE(Int8Type); TENSOR_VISIT_INLINE(UInt8Type); TENSOR_VISIT_INLINE(Int16Type); diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx index ee500e6812974..1c4253eebe46a 100644 --- a/python/pyarrow/array.pyx +++ b/python/pyarrow/array.pyx @@ -618,7 +618,7 @@ cdef object box_array(const shared_ptr[CArray]& sp_array): if data_type == NULL: raise ValueError('Array data type was NULL') - cdef Array arr = _array_classes[data_type.type]() + cdef Array arr = _array_classes[data_type.id()]() arr.init(sp_array) return arr diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index e719e185b7b13..71b5c8d2172dc 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -61,7 +61,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: TimeUnit_NANO" arrow::TimeUnit::NANO" cdef cppclass CDataType" arrow::DataType": - Type type + Type id() c_bool Equals(const CDataType& other) @@ -72,7 +72,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int64_t length() int64_t null_count() - Type type_enum() + Type type_id() c_bool Equals(const CArray& arr) c_bool IsNull(int i) @@ -97,14 +97,14 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: pass cdef cppclass CTimestampType" arrow::TimestampType"(CFixedWidthType): - TimeUnit unit - c_string timezone + TimeUnit unit() + const c_string& timezone() cdef cppclass CTime32Type" arrow::Time32Type"(CFixedWidthType): - TimeUnit unit + TimeUnit unit() cdef cppclass CTime64Type" arrow::Time64Type"(CFixedWidthType): - TimeUnit unit + TimeUnit unit() cdef cppclass CDictionaryType" arrow::DictionaryType"(CFixedWidthType): CDictionaryType(const shared_ptr[CDataType]& index_type, @@ -149,15 +149,14 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: int bit_width() cdef cppclass CDecimalType" arrow::DecimalType"(CFixedSizeBinaryType): - int precision - int scale + int precision() + int scale() CDecimalType(int precision, int scale) cdef cppclass CField" arrow::Field": - c_string name - shared_ptr[CDataType] type - - c_bool nullable + const c_string& name() + shared_ptr[CDataType] type() + c_bool nullable() CField(const c_string& name, const shared_ptr[CDataType]& type, c_bool nullable) @@ -307,7 +306,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: c_bool is_mutable() c_bool is_contiguous() - Type type_enum() + Type type_id() c_bool Equals(const CTensor& other) CStatus ConcatenateTables(const vector[shared_ptr[CTable]]& tables, diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx index 7591ae880da3d..2b6746a3cf815 100644 --- a/python/pyarrow/scalar.pyx +++ b/python/pyarrow/scalar.pyx @@ -158,26 +158,26 @@ cdef class TimestampValue(ArrayValue): timezone = None tzinfo = None - if dtype.timezone.size() > 0: - timezone = frombytes(dtype.timezone) + if dtype.timezone().size() > 0: + timezone = frombytes(dtype.timezone()) import pytz tzinfo = pytz.timezone(timezone) try: pd = _pandas() - if dtype.unit == TimeUnit_SECOND: + if dtype.unit() == TimeUnit_SECOND: val = val * 1000000000 - elif dtype.unit == TimeUnit_MILLI: + elif dtype.unit() == TimeUnit_MILLI: val = val * 1000000 - elif dtype.unit == TimeUnit_MICRO: + elif dtype.unit() == TimeUnit_MICRO: val = val * 1000 return pd.Timestamp(val, tz=tzinfo) except ImportError: - if dtype.unit == TimeUnit_SECOND: + if dtype.unit() == TimeUnit_SECOND: result = datetime.datetime.utcfromtimestamp(val) - elif dtype.unit == TimeUnit_MILLI: + elif dtype.unit() == TimeUnit_MILLI: result = datetime.datetime.utcfromtimestamp(float(val) / 1000) - elif dtype.unit == TimeUnit_MICRO: + elif dtype.unit() == TimeUnit_MICRO: result = datetime.datetime.utcfromtimestamp( float(val) / 1000000) else: @@ -208,10 +208,6 @@ cdef class DecimalValue(ArrayValue): def as_py(self): cdef: CDecimalArray* ap = self.sp_array.get() - CDecimalType* t = ap.type().get() - int bit_width = t.bit_width() - int precision = t.precision - int scale = t.scale c_string s = ap.FormatValue(self.index) return decimal.Decimal(s.decode('utf8')) @@ -309,11 +305,11 @@ cdef dict _scalar_classes = { cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, int64_t index): cdef ArrayValue val - if type.type.type == Type_NA: + if type.type.id() == Type_NA: return NA elif sp_array.get().IsNull(index): return NA else: - val = _scalar_classes[type.type.type]() + val = _scalar_classes[type.type.id()]() val.init(type, sp_array, index) return val diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx index 4b931bf452239..474980973959f 100644 --- a/python/pyarrow/schema.pyx +++ b/python/pyarrow/schema.pyx @@ -81,13 +81,13 @@ cdef class TimestampType(DataType): property unit: def __get__(self): - return timeunit_to_string(self.ts_type.unit) + return timeunit_to_string(self.ts_type.unit()) property tz: def __get__(self): - if self.ts_type.timezone.size() > 0: - return frombytes(self.ts_type.timezone) + if self.ts_type.timezone().size() > 0: + return frombytes(self.ts_type.timezone()) else: return None @@ -119,7 +119,7 @@ cdef class Field: cdef init(self, const shared_ptr[CField]& field): self.sp_field = field self.field = field.get() - self.type = box_data_type(field.get().type) + self.type = box_data_type(field.get().type()) @classmethod def from_py(cls, object name, DataType type, bint nullable=True): @@ -137,7 +137,7 @@ cdef class Field: property nullable: def __get__(self): - return self.field.nullable + return self.field.nullable() property name: @@ -145,7 +145,7 @@ cdef class Field: if box_field(self.sp_field) is None: raise ReferenceError( 'Field not initialized (references NULL pointer)') - return frombytes(self.field.name) + return frombytes(self.field.name()) cdef class Schema: @@ -162,7 +162,7 @@ cdef class Schema: cdef Field result = Field() result.init(self.schema.field(i)) - result.type = box_data_type(result.field.type) + result.type = box_data_type(result.field.type()) return result @@ -442,13 +442,13 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): if type.get() == NULL: return None - if type.get().type == la.Type_DICTIONARY: + if type.get().id() == la.Type_DICTIONARY: out = DictionaryType() - elif type.get().type == la.Type_TIMESTAMP: + elif type.get().id() == la.Type_TIMESTAMP: out = TimestampType() - elif type.get().type == la.Type_FIXED_SIZE_BINARY: + elif type.get().id() == la.Type_FIXED_SIZE_BINARY: out = FixedSizeBinaryType() - elif type.get().type == la.Type_DECIMAL: + elif type.get().id() == la.Type_DECIMAL: out = DecimalType() else: out = DataType() From e327c2e08d51ee13b3cf3b8801cd3adfe88b3f7c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 09:47:08 -0400 Subject: [PATCH 0493/1644] ARROW-761: [C++/Python] Add GetTensorSize method, Python bindings This computes the memory footprint of a serialized `arrow::Tensor` so that an appropriate memory region can be allocated. Author: Wes McKinney Closes #521 from wesm/ARROW-761 and squashes the following commits: 983177e [Wes McKinney] Fix sign comparison warning 0d787ad [Wes McKinney] Add GetTensorSize method, Python bindings --- cpp/src/arrow/ipc/ipc-read-write-test.cc | 4 +++ cpp/src/arrow/ipc/writer.cc | 29 ++++++++++++--------- cpp/src/arrow/ipc/writer.h | 4 +++ python/pyarrow/__init__.py | 3 ++- python/pyarrow/includes/libarrow.pxd | 3 +++ python/pyarrow/io.pyx | 20 ++++++++++++++ python/pyarrow/tests/test_convert_pandas.py | 4 +-- python/pyarrow/tests/test_ipc.py | 9 +++++++ python/pyarrow/tests/test_tensor.py | 9 ++++++- 9 files changed, 69 insertions(+), 16 deletions(-) diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index 98a7c3dd58a6b..cfba0d0a95106 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -640,6 +640,10 @@ TEST_F(TestTensorRoundTrip, BasicRoundtrip) { CheckTensorRoundTrip(t0); CheckTensorRoundTrip(tzero); + + int64_t serialized_size; + ASSERT_OK(GetTensorSize(t0, &serialized_size)); + ASSERT_TRUE(serialized_size > static_cast(size * sizeof(int64_t))); } TEST_F(TestTensorRoundTrip, NonContiguous) { diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 18a585599a31b..8ba00a6ffd599 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -192,16 +192,6 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status GetTotalSize(const RecordBatch& batch, int64_t* size) { - // emulates the behavior of Write without actually writing - int32_t metadata_length = 0; - int64_t body_length = 0; - MockOutputStream dst; - RETURN_NOT_OK(Write(batch, &dst, &metadata_length, &body_length)); - *size = dst.GetExtentBytesWritten(); - return Status::OK(); - } - protected: template Status VisitFixedWidth(const ArrayType& array) { @@ -522,8 +512,23 @@ Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dict } Status GetRecordBatchSize(const RecordBatch& batch, int64_t* size) { - RecordBatchWriter writer(default_memory_pool(), 0, kMaxNestingDepth, true); - RETURN_NOT_OK(writer.GetTotalSize(batch, size)); + // emulates the behavior of Write without actually writing + int32_t metadata_length = 0; + int64_t body_length = 0; + MockOutputStream dst; + RETURN_NOT_OK(WriteRecordBatch(batch, 0, &dst, &metadata_length, &body_length, + default_memory_pool(), kMaxNestingDepth, true)); + *size = dst.GetExtentBytesWritten(); + return Status::OK(); +} + +Status GetTensorSize(const Tensor& tensor, int64_t* size) { + // emulates the behavior of Write without actually writing + int32_t metadata_length = 0; + int64_t body_length = 0; + MockOutputStream dst; + RETURN_NOT_OK(WriteTensor(tensor, &dst, &metadata_length, &body_length)); + *size = dst.GetExtentBytesWritten(); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/writer.h b/cpp/src/arrow/ipc/writer.h index 629bcb9c6c980..b71becb8c73b8 100644 --- a/cpp/src/arrow/ipc/writer.h +++ b/cpp/src/arrow/ipc/writer.h @@ -81,6 +81,10 @@ Status WriteDictionary(int64_t dictionary_id, const std::shared_ptr& dict // Flatbuffers metadata. Status ARROW_EXPORT GetRecordBatchSize(const RecordBatch& batch, int64_t* size); +// Compute the precise number of bytes needed in a contiguous memory segment to +// write the tensor including metadata, padding, and data +Status ARROW_EXPORT GetTensorSize(const Tensor& tensor, int64_t* size); + class ARROW_EXPORT StreamWriter { public: virtual ~StreamWriter(); diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 7b23cf66c6f7e..df615b428c1c1 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -51,7 +51,8 @@ Buffer, BufferReader, InMemoryOutputStream, MemoryMappedFile, memory_map, frombuffer, read_tensor, write_tensor, - memory_map, create_memory_map) + memory_map, create_memory_map, + get_record_batch_size, get_tensor_size) from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 71b5c8d2172dc..40dd83776b82d 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -544,6 +544,9 @@ cdef extern from "arrow/ipc/api.h" namespace "arrow::ipc" nogil: CStatus GetRecordBatch(int i, shared_ptr[CRecordBatch]* batch) + CStatus GetRecordBatchSize(const CRecordBatch& batch, int64_t* size) + CStatus GetTensorSize(const CTensor& tensor, int64_t* size) + CStatus WriteTensor(const CTensor& tensor, OutputStream* dst, int32_t* metadata_length, int64_t* body_length) diff --git a/python/pyarrow/io.pyx b/python/pyarrow/io.pyx index 98b5a62b372a2..4eb0816ecbdea 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/io.pyx @@ -1202,6 +1202,26 @@ cdef class FeatherReader: return col +def get_tensor_size(Tensor tensor): + """ + Return total size of serialized Tensor including metadata and padding + """ + cdef int64_t size + with nogil: + check_status(GetTensorSize(deref(tensor.tp), &size)) + return size + + +def get_record_batch_size(RecordBatch batch): + """ + Return total size of serialized RecordBatch including metadata and padding + """ + cdef int64_t size + with nogil: + check_status(GetRecordBatchSize(deref(batch.batch), &size)) + return size + + def write_tensor(Tensor tensor, NativeFile dest): """ Write pyarrow.Tensor to pyarrow.NativeFile object its current position diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index d1bea0b3e32f0..4a57e4ba1d4fb 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -68,7 +68,7 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, timestamps_to_ms=False, expected_schema=None, check_dtype=True, schema=None): table = pa.Table.from_pandas(df, timestamps_to_ms=timestamps_to_ms, - schema=schema) + schema=schema) result = table.to_pandas(nthreads=nthreads) if expected_schema: assert table.schema.equals(expected_schema) @@ -79,7 +79,7 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, def _check_array_roundtrip(self, values, expected=None, mask=None, timestamps_to_ms=False, type=None): arr = pa.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, - mask=mask, type=type) + mask=mask, type=type) result = arr.to_pandas() values_nulls = pd.isnull(values) diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 4c9dad1b840a8..31d418d5150ac 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -151,6 +151,15 @@ def test_ipc_zero_copy_numpy(): assert_frame_equal(df, rdf) +def test_get_record_batch_size(): + N = 10 + itemsize = 8 + df = pd.DataFrame({'foo': np.random.randn(N)}) + + batch = pa.RecordBatch.from_pandas(df) + assert pa.get_record_batch_size(batch) > (N * itemsize) + + def write_file(batch, sink): writer = pa.FileWriter(sink, batch.schema) writer.write_batch(batch) diff --git a/python/pyarrow/tests/test_tensor.py b/python/pyarrow/tests/test_tensor.py index 327b7f08a37f1..ec71735b2a540 100644 --- a/python/pyarrow/tests/test_tensor.py +++ b/python/pyarrow/tests/test_tensor.py @@ -42,10 +42,11 @@ def test_tensor_attrs(): tensor = pa.Tensor.from_numpy(data2) assert not tensor.is_mutable + def test_tensor_base_object(): tensor = pa.Tensor.from_numpy(np.random.randn(10, 4)) n = sys.getrefcount(tensor) - array = tensor.to_numpy() + array = tensor.to_numpy() # noqa assert sys.getrefcount(tensor) == n + 1 @@ -111,3 +112,9 @@ def test_tensor_ipc_strided(): pa.write_tensor(tensor, mmap) finally: _try_delete(path) + + +def test_tensor_size(): + data = np.random.randn(10, 4) + tensor = pa.Tensor.from_numpy(data) + assert pa.get_tensor_size(tensor) > (data.size * 8) From c2f28cd07413e262fa0b741c286f86d5c7277c56 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 09:47:49 -0400 Subject: [PATCH 0494/1644] ARROW-741: [Python] Switch Travis CI to use Python 3.6 instead of 3.5 I'm OK with not building Python 3.5 in Travis CI anymore because 3.5 and 3.6 are essentially the same at the C API level. Other opinions? Author: Wes McKinney Closes #514 from wesm/ARROW-741 and squashes the following commits: 3aee721 [Wes McKinney] Remove apache channel 116b229 [Wes McKinney] Switch Travis CI to use Python 3.6 instead of 3.5 --- ci/travis_install_conda.sh | 1 - ci/travis_script_python.sh | 8 ++++---- 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index 9c13b1bc0f079..e064317f12303 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -32,7 +32,6 @@ conda info -a conda config --set show_channel_urls True conda config --add channels https://repo.continuum.io/pkgs/free conda config --add channels conda-forge -conda config --add channels apache conda info -a conda install --yes conda-build jinja2 anaconda-client diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index df11209e7c49b..604cd13916299 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -28,7 +28,7 @@ pushd $PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env build_parquet_cpp() { - conda create -y -q -p $PARQUET_HOME python=3.5 + conda create -y -q -p $PARQUET_HOME python=3.6 source activate $PARQUET_HOME # In case some package wants to download the MKL @@ -120,15 +120,15 @@ python_version_tests() { python -m pytest -vv -r sxX pyarrow # Build documentation once - if [[ "$PYTHON_VERSION" == "3.5" ]] + if [[ "$PYTHON_VERSION" == "3.6" ]] then pip install -r doc/requirements.txt python setup.py build_sphinx fi } -# run tests for python 2.7 and 3.5 +# run tests for python 2.7 and 3.6 python_version_tests 2.7 -python_version_tests 3.5 +python_version_tests 3.6 popd From 06d92bbab426d8b343d238e3e61166353da11877 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 10 Apr 2017 17:35:18 -0400 Subject: [PATCH 0495/1644] ARROW-779: [C++] Check for old metadata and raise exception if found Author: Wes McKinney Closes #507 from wesm/ARROW-779 and squashes the following commits: dad42f7 [Wes McKinney] Check for old metadata and raise exception if found --- cpp/src/arrow/ipc/metadata.cc | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 84f8883ffb949..ee21156c08c67 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -50,7 +50,11 @@ using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; using FBString = flatbuffers::Offset; -static constexpr flatbuf::MetadataVersion kMetadataVersion = flatbuf::MetadataVersion_V3; +static constexpr flatbuf::MetadataVersion kCurrentMetadataVersion = + flatbuf::MetadataVersion_V3; + +static constexpr flatbuf::MetadataVersion kMinMetadataVersion = + flatbuf::MetadataVersion_V3; static Status IntFromFlatbuffer( const flatbuf::Int* int_data, std::shared_ptr* out) { @@ -605,8 +609,8 @@ static Status WriteFlatbufferBuilder(FBB& fbb, std::shared_ptr* out) { static Status WriteFBMessage(FBB& fbb, flatbuf::MessageHeader header_type, flatbuffers::Offset header, int64_t body_length, std::shared_ptr* out) { - auto message = - flatbuf::CreateMessage(fbb, kMetadataVersion, header_type, header, body_length); + auto message = flatbuf::CreateMessage( + fbb, kCurrentMetadataVersion, header_type, header, body_length); fbb.Finish(message); return WriteFlatbufferBuilder(fbb, out); } @@ -738,7 +742,7 @@ Status WriteFileFooter(const Schema& schema, const std::vector& dicti auto fb_record_batches = FileBlocksToFlatbuffer(fbb, record_batches); auto footer = flatbuf::CreateFooter( - fbb, kMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); + fbb, kCurrentMetadataVersion, fb_schema, fb_dictionaries, fb_record_batches); fbb.Finish(footer); @@ -814,7 +818,11 @@ class Message::MessageImpl { Status Open() { message_ = flatbuf::GetMessage(buffer_->data() + offset_); - // TODO(wesm): verify the message + // Check that the metadata version is supported + if (message_->version() < kMinMetadataVersion) { + return Status::Invalid("Old metadata version not supported"); + } + return Status::OK(); } From 85b870e72803641568ff260af2306d9fc993a6d4 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 10:17:43 -0400 Subject: [PATCH 0496/1644] ARROW-802: [GLib] Add read examples Author: Kouhei Sutou Closes #522 from kou/glib-add-read-examples and squashes the following commits: 3fd5a2f [Kouhei Sutou] [GLib] Add read examples --- c_glib/.gitignore | 2 + c_glib/configure.ac | 2 + c_glib/example/Makefile.am | 15 +++- c_glib/example/README.md | 42 ++++++++++ c_glib/example/read-batch.c | 144 +++++++++++++++++++++++++++++++++++ c_glib/example/read-stream.c | 143 ++++++++++++++++++++++++++++++++++ 6 files changed, 347 insertions(+), 1 deletion(-) create mode 100644 c_glib/example/README.md create mode 100644 c_glib/example/read-batch.c create mode 100644 c_glib/example/read-stream.c diff --git a/c_glib/.gitignore b/c_glib/.gitignore index e57a0594c1af3..8928158f6ca3a 100644 --- a/c_glib/.gitignore +++ b/c_glib/.gitignore @@ -43,3 +43,5 @@ Makefile.in /arrow-glib/stamp-* /arrow-glib/*.pc /example/build +/example/read-batch +/example/read-stream diff --git a/c_glib/configure.ac b/c_glib/configure.ac index d63132e6f293c..f36719284711b 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -87,6 +87,8 @@ else AC_SUBST(ARROW_LIBS) fi +exampledir="\$(datadir)/arrow-glib/example" +AC_SUBST(exampledir) AC_CONFIG_FILES([ Makefile diff --git a/c_glib/example/Makefile.am b/c_glib/example/Makefile.am index 3d456d7844231..8bf3c15526759 100644 --- a/c_glib/example/Makefile.am +++ b/c_glib/example/Makefile.am @@ -28,7 +28,20 @@ AM_LDFLAGS = \ $(builddir)/../arrow-glib/libarrow-glib.la noinst_PROGRAMS = \ - build + build \ + read-batch \ + read-stream build_SOURCES = \ build.c + +read_batch_SOURCES = \ + read-batch.c + +read_stream_SOURCES = \ + read-stream.c + +example_DATA = \ + $(build_SOURCES) \ + $(read_batch_SOURCES) \ + $(read_stream_SOURCES) diff --git a/c_glib/example/README.md b/c_glib/example/README.md new file mode 100644 index 0000000000000..b1ba259534cb1 --- /dev/null +++ b/c_glib/example/README.md @@ -0,0 +1,42 @@ + + +# Arrow GLib example + +There are example codes in this directory. + +C example codes exist in this directory. + +## C example codes + +Here are example codes in this directory: + + * `build.c`: It shows how to create an array by array builder. + + + + * `read-batch.c`: It shows how to read Arrow array from file in batch + mode. + + + + * `read-stream.c`: It shows how to read Arrow array from file in + stream mode. + diff --git a/c_glib/example/read-batch.c b/c_glib/example/read-batch.c new file mode 100644 index 0000000000000..a55b085d961d1 --- /dev/null +++ b/c_glib/example/read-batch.c @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#include + +#include + +static void +print_array(GArrowArray *array) +{ + GArrowType value_type; + gint64 i, n; + + value_type = garrow_array_get_value_type(array); + + g_print("["); + n = garrow_array_get_length(array); + +#define ARRAY_CASE(type, Type, TYPE, format) \ + case GARROW_TYPE_ ## TYPE: \ + { \ + GArrow ## Type ## Array *real_array; \ + real_array = GARROW_ ## TYPE ## _ARRAY(array); \ + for (i = 0; i < n; i++) { \ + if (i > 0) { \ + g_print(", "); \ + } \ + g_print(format, \ + garrow_ ## type ## _array_get_value(real_array, i)); \ + } \ + } \ + break + + switch (value_type) { + ARRAY_CASE(uint8, UInt8, UINT8, "%hhu"); + ARRAY_CASE(uint16, UInt16, UINT16, "%" G_GUINT16_FORMAT); + ARRAY_CASE(uint32, UInt32, UINT32, "%" G_GUINT32_FORMAT); + ARRAY_CASE(uint64, UInt64, UINT64, "%" G_GUINT64_FORMAT); + ARRAY_CASE( int8, Int8, INT8, "%hhd"); + ARRAY_CASE( int16, Int16, INT16, "%" G_GINT16_FORMAT); + ARRAY_CASE( int32, Int32, INT32, "%" G_GINT32_FORMAT); + ARRAY_CASE( int64, Int64, INT64, "%" G_GINT64_FORMAT); + ARRAY_CASE( float, Float, FLOAT, "%g"); + ARRAY_CASE(double, Double, DOUBLE, "%g"); + default: + break; + } +#undef ARRAY_CASE + + g_print("]\n"); +} + +static void +print_record_batch(GArrowRecordBatch *record_batch) +{ + guint nth_column, n_columns; + + n_columns = garrow_record_batch_get_n_columns(record_batch); + for (nth_column = 0; nth_column < n_columns; nth_column++) { + GArrowArray *array; + + g_print("columns[%u](%s): ", + nth_column, + garrow_record_batch_get_column_name(record_batch, nth_column)); + array = garrow_record_batch_get_column(record_batch, nth_column); + print_array(array); + } +} + +int +main(int argc, char **argv) +{ + const char *input_path = "/tmp/batch.arrow"; + GArrowIOMemoryMappedFile *input; + GError *error = NULL; + + if (argc > 1) + input_path = argv[1]; + input = garrow_io_memory_mapped_file_open(input_path, + GARROW_IO_FILE_MODE_READ, + &error); + if (!input) { + g_print("failed to open file: %s\n", error->message); + g_error_free(error); + return EXIT_FAILURE; + } + + { + GArrowIPCFileReader *reader; + + reader = garrow_ipc_file_reader_open(GARROW_IO_RANDOM_ACCESS_FILE(input), + &error); + if (!reader) { + g_print("failed to open file reader: %s\n", error->message); + g_error_free(error); + g_object_unref(input); + return EXIT_FAILURE; + } + + { + guint i, n; + + n = garrow_ipc_file_reader_get_n_record_batches(reader); + for (i = 0; i < n; i++) { + GArrowRecordBatch *record_batch; + + record_batch = + garrow_ipc_file_reader_get_record_batch(reader, i, &error); + if (!record_batch) { + g_print("failed to open file reader: %s\n", error->message); + g_error_free(error); + g_object_unref(reader); + g_object_unref(input); + return EXIT_FAILURE; + } + + print_record_batch(record_batch); + g_object_unref(record_batch); + } + } + + g_object_unref(reader); + } + + g_object_unref(input); + + return EXIT_SUCCESS; +} diff --git a/c_glib/example/read-stream.c b/c_glib/example/read-stream.c new file mode 100644 index 0000000000000..c56942c7770c5 --- /dev/null +++ b/c_glib/example/read-stream.c @@ -0,0 +1,143 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#include + +#include + +static void +print_array(GArrowArray *array) +{ + GArrowType value_type; + gint64 i, n; + + value_type = garrow_array_get_value_type(array); + + g_print("["); + n = garrow_array_get_length(array); + +#define ARRAY_CASE(type, Type, TYPE, format) \ + case GARROW_TYPE_ ## TYPE: \ + { \ + GArrow ## Type ## Array *real_array; \ + real_array = GARROW_ ## TYPE ## _ARRAY(array); \ + for (i = 0; i < n; i++) { \ + if (i > 0) { \ + g_print(", "); \ + } \ + g_print(format, \ + garrow_ ## type ## _array_get_value(real_array, i)); \ + } \ + } \ + break + + switch (value_type) { + ARRAY_CASE(uint8, UInt8, UINT8, "%hhu"); + ARRAY_CASE(uint16, UInt16, UINT16, "%" G_GUINT16_FORMAT); + ARRAY_CASE(uint32, UInt32, UINT32, "%" G_GUINT32_FORMAT); + ARRAY_CASE(uint64, UInt64, UINT64, "%" G_GUINT64_FORMAT); + ARRAY_CASE( int8, Int8, INT8, "%hhd"); + ARRAY_CASE( int16, Int16, INT16, "%" G_GINT16_FORMAT); + ARRAY_CASE( int32, Int32, INT32, "%" G_GINT32_FORMAT); + ARRAY_CASE( int64, Int64, INT64, "%" G_GINT64_FORMAT); + ARRAY_CASE( float, Float, FLOAT, "%g"); + ARRAY_CASE(double, Double, DOUBLE, "%g"); + default: + break; + } +#undef ARRAY_CASE + + g_print("]\n"); +} + +static void +print_record_batch(GArrowRecordBatch *record_batch) +{ + guint nth_column, n_columns; + + n_columns = garrow_record_batch_get_n_columns(record_batch); + for (nth_column = 0; nth_column < n_columns; nth_column++) { + GArrowArray *array; + + g_print("columns[%u](%s): ", + nth_column, + garrow_record_batch_get_column_name(record_batch, nth_column)); + array = garrow_record_batch_get_column(record_batch, nth_column); + print_array(array); + } +} + +int +main(int argc, char **argv) +{ + const char *input_path = "/tmp/stream.arrow"; + GArrowIOMemoryMappedFile *input; + GError *error = NULL; + + if (argc > 1) + input_path = argv[1]; + input = garrow_io_memory_mapped_file_open(input_path, + GARROW_IO_FILE_MODE_READ, + &error); + if (!input) { + g_print("failed to open file: %s\n", error->message); + g_error_free(error); + return EXIT_FAILURE; + } + + { + GArrowIPCStreamReader *reader; + + reader = garrow_ipc_stream_reader_open(GARROW_IO_INPUT_STREAM(input), + &error); + if (!reader) { + g_print("failed to open stream reader: %s\n", error->message); + g_error_free(error); + g_object_unref(input); + return EXIT_FAILURE; + } + + while (TRUE) { + GArrowRecordBatch *record_batch; + + record_batch = + garrow_ipc_stream_reader_get_next_record_batch(reader, &error); + if (error) { + g_print("failed to get record batch: %s\n", error->message); + g_error_free(error); + g_object_unref(reader); + g_object_unref(input); + return EXIT_FAILURE; + } + + if (!record_batch) { + break; + } + + print_record_batch(record_batch); + g_object_unref(record_batch); + } + + g_object_unref(reader); + } + + g_object_unref(input); + + return EXIT_SUCCESS; +} From b7423a63cc53ad7d86a5e003a0bd48f855622354 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 10:20:24 -0400 Subject: [PATCH 0497/1644] ARROW-803: [GLib] Update package repository URL Author: Kouhei Sutou Closes #523 from kou/glib-update-package-repository-url and squashes the following commits: d130478 [Kouhei Sutou] [GLib] Update package repository URL --- c_glib/README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/c_glib/README.md b/c_glib/README.md index 95cc9a65c5bd8..0137d9059ee1e 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -66,8 +66,8 @@ You need to add the following apt-lines to `/etc/apt/sources.list.d/groonga.list`: ```text -deb http://packages.groonga.org/debian/ jessie main -deb-src http://packages.groonga.org/debian/ jessie main +deb https://packages.groonga.org/debian/ jessie main +deb-src https://packages.groonga.org/debian/ jessie main ``` Then you need to run the following command lines: @@ -105,7 +105,7 @@ Now you can install Arrow GLib packages: You need to add a Yum repository: ```text -% sudo yum install -y http://packages.groonga.org/centos/groonga-release-1.2.0-1.noarch.rpm +% sudo yum install -y https://packages.groonga.org/centos/groonga-release-1.3.0-1.noarch.rpm ``` Now you can install Arrow GLib packages: From b3cec804bbd1b2626ff55e1a733deca9b7ba032b Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 10:22:11 -0400 Subject: [PATCH 0498/1644] ARROW-804: [GLib] Update build document Author: Kouhei Sutou Closes #524 from kou/glib-update-build-document and squashes the following commits: 07085e6 [Kouhei Sutou] [GLib] Update build document --- c_glib/README.md | 25 ++++++++++++++++++++++++- 1 file changed, 24 insertions(+), 1 deletion(-) diff --git a/c_glib/README.md b/c_glib/README.md index 0137d9059ee1e..b253d32b266d4 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -114,7 +114,30 @@ Now you can install Arrow GLib packages: % sudo yum install -y --enablerepo=epel arrow-glib-devel ``` -### Build +### How to build by users + +Arrow GLib users should use released source archive to build Arrow +GLib: + +```text +% wget https://dist.apache.org/repos/dist/release/arrow/arrow-0.3.0/apache-arrow-0.3.0.tar.gz +% tar xf apache-arrow-0.3.0.tar.gz +% cd apache-arrow-0.3.0 +``` + +You need to build and install Arrow C++ before you build and install +Arrow GLib. See Arrow C++ document about how to install Arrow C++. + +You can build and install Arrow GLib after you install Arrow C++: + +```text +% cd c_glib +% ./configure +% make +% sudo make install +``` + +### How to build by developers You need to install Arrow C++ before you install Arrow GLib. See Arrow C++ document about how to install Arrow C++. From f5245cc6b1811217df78acfb7bf6163d9dd09f32 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 13:55:11 -0400 Subject: [PATCH 0499/1644] ARROW-806: [GLib] Support add/remove a column from table Author: Kouhei Sutou Closes #525 from kou/glib-add-remove-column and squashes the following commits: 72d495a [Kouhei Sutou] [GLib] Support add/remove a column from table --- c_glib/arrow-glib/table.cpp | 58 +++++++++++++++++++++++++++++++++++++ c_glib/arrow-glib/table.h | 8 +++++ c_glib/test/test-table.rb | 14 +++++++++ 3 files changed, 80 insertions(+) diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp index 2f82ffa4320e0..1d743b70731bb 100644 --- a/c_glib/arrow-glib/table.cpp +++ b/c_glib/arrow-glib/table.cpp @@ -22,6 +22,7 @@ #endif #include +#include #include #include @@ -203,6 +204,63 @@ garrow_table_get_n_rows(GArrowTable *table) return arrow_table->num_rows(); } +/** + * garrow_table_add_column: + * @table: A #GArrowTable. + * @i: The index of the new column. + * @column: The column to be added. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): The newly allocated + * #GArrowTable that has a new column or %NULL on error. + * + * Since: 0.3.0 + */ +GArrowTable * +garrow_table_add_column(GArrowTable *table, + guint i, + GArrowColumn *column, + GError **error) +{ + const auto arrow_table = garrow_table_get_raw(table); + const auto arrow_column = garrow_column_get_raw(column); + std::shared_ptr arrow_new_table; + auto status = arrow_table->AddColumn(i, arrow_column, &arrow_new_table); + if (status.ok()) { + return garrow_table_new_raw(&arrow_new_table); + } else { + garrow_error_set(error, status, "[table][add-column]"); + return NULL; + } +} + +/** + * garrow_table_remove_column: + * @table: A #GArrowTable. + * @i: The index of the column to be removed. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): The newly allocated + * #GArrowTable that doesn't have the column or %NULL on error. + * + * Since: 0.3.0 + */ +GArrowTable * +garrow_table_remove_column(GArrowTable *table, + guint i, + GError **error) +{ + const auto arrow_table = garrow_table_get_raw(table); + std::shared_ptr arrow_new_table; + auto status = arrow_table->RemoveColumn(i, &arrow_new_table); + if (status.ok()) { + return garrow_table_new_raw(&arrow_new_table); + } else { + garrow_error_set(error, status, "[table][remove-column]"); + return NULL; + } +} + G_END_DECLS GArrowTable * diff --git a/c_glib/arrow-glib/table.h b/c_glib/arrow-glib/table.h index 4dbb8c587a2ec..9ae0cce1b7d9d 100644 --- a/c_glib/arrow-glib/table.h +++ b/c_glib/arrow-glib/table.h @@ -75,4 +75,12 @@ GArrowColumn *garrow_table_get_column (GArrowTable *table, guint garrow_table_get_n_columns (GArrowTable *table); guint64 garrow_table_get_n_rows (GArrowTable *table); +GArrowTable *garrow_table_add_column (GArrowTable *table, + guint i, + GArrowColumn *column, + GError **error); +GArrowTable *garrow_table_remove_column (GArrowTable *table, + guint i, + GError **error); + G_END_DECLS diff --git a/c_glib/test/test-table.rb b/c_glib/test/test-table.rb index e2b71b31e44c0..da6871ec1d090 100644 --- a/c_glib/test/test-table.rb +++ b/c_glib/test/test-table.rb @@ -82,5 +82,19 @@ def test_n_columns def test_n_rows assert_equal(1, @table.n_rows) end + + def test_add_column + field = Arrow::Field.new("added", Arrow::BooleanDataType.new) + column = Arrow::Column.new(field, build_boolean_array([true])) + new_table = @table.add_column(1, column) + assert_equal(["visible", "added", "valid"], + new_table.schema.fields.collect(&:name)) + end + + def test_remove_column + new_table = @table.remove_column(0) + assert_equal(["valid"], + new_table.schema.fields.collect(&:name)) + end end end From 7b4723b2b4f259ac27b959d108fdc65c734d7359 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 13:55:57 -0400 Subject: [PATCH 0500/1644] ARROW-807: [GLib] Update "Since" tag Author: Kouhei Sutou Closes #526 from kou/glib-update-since-tag and squashes the following commits: 2ad64cc [Kouhei Sutou] [GLib] Update "Since" tag --- c_glib/arrow-glib/array.cpp | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index e016ba9728dec..3bd7b40ff9493 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -142,6 +142,8 @@ garrow_array_class_init(GArrowArrayClass *klass) * @i: The index of the target value. * * Returns: Whether the i-th value is null or not. + * + * Since: 0.3.0 */ gboolean garrow_array_is_null(GArrowArray *array, gint64 i) @@ -193,9 +195,10 @@ garrow_array_get_n_nulls(GArrowArray *array) * garrow_array_get_value_data_type: * @array: A #GArrowArray. * - * Since: 0.3.0 * Returns: (transfer full): The #GArrowDataType for each value of the * array. + * + * Since: 0.3.0 */ GArrowDataType * garrow_array_get_value_data_type(GArrowArray *array) @@ -209,8 +212,9 @@ garrow_array_get_value_data_type(GArrowArray *array) * garrow_array_get_value_type: * @array: A #GArrowArray. * - * Since: 0.3.0 * Returns: The #GArrowType for each value of the array. + * + * Since: 0.3.0 */ GArrowType garrow_array_get_value_type(GArrowArray *array) From ab520cbc7a1e3fe14a2290322ca2e392af30d612 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 11 Apr 2017 13:56:41 -0400 Subject: [PATCH 0501/1644] ARROW-808: [GLib] Remove needless ignore entries Author: Kouhei Sutou Closes #527 from kou/glib-remove-needless-ignore-entries and squashes the following commits: 57f734c [Kouhei Sutou] [GLib] Remove needless ignore entries --- c_glib/.gitignore | 4 ---- 1 file changed, 4 deletions(-) diff --git a/c_glib/.gitignore b/c_glib/.gitignore index 8928158f6ca3a..6f2de80d4f79e 100644 --- a/c_glib/.gitignore +++ b/c_glib/.gitignore @@ -36,10 +36,6 @@ Makefile.in /version /arrow-glib/enums.c /arrow-glib/enums.h -/arrow-glib/io-enums.c -/arrow-glib/io-enums.h -/arrow-glib/ipc-enums.c -/arrow-glib/ipc-enums.h /arrow-glib/stamp-* /arrow-glib/*.pc /example/build From 5e5a5878d7be62e0ae26ca0b45b4aafd761eb43d Mon Sep 17 00:00:00 2001 From: Leif Walsh Date: Tue, 11 Apr 2017 19:04:21 -0400 Subject: [PATCH 0502/1644] ARROW-805: [C++] Don't throw IOError when listing empty HDFS dir Author: Leif Walsh Closes #528 from leifwalsh/ARROW-805 and squashes the following commits: 4e1bb05 [Leif Walsh] ARROW-805: [C++] Don't throw IOError when listing empty HDFS dir --- cpp/src/arrow/io/hdfs.cc | 7 +++++-- cpp/src/arrow/io/io-hdfs-test.cc | 4 ++++ 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 408b85f8daeb7..3510ba183d8e4 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -406,8 +406,11 @@ class HdfsClient::HdfsClientImpl { // errno indicates error // // Note: errno is thread-locala - if (errno == 0) { num_entries = 0; } - { return Status::IOError("HDFS: list directory failed"); } + if (errno == 0) { + num_entries = 0; + } else { + return Status::IOError("HDFS: list directory failed"); + } } // Allocate additional space for elements diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index a2c9c5210b10d..0a9f5d9885e19 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -170,8 +170,12 @@ TYPED_TEST(TestHdfsClient, CreateDirectory) { ASSERT_OK(this->client_->CreateDirectory(path)); ASSERT_TRUE(this->client_->Exists(path)); + std::vector listing; + EXPECT_OK(this->client_->ListDirectory(path, &listing)); + ASSERT_EQ(0, listing.size()); EXPECT_OK(this->client_->Delete(path, true)); ASSERT_FALSE(this->client_->Exists(path)); + ASSERT_RAISES(IOError, this->client_->ListDirectory(path, &listing)); } TYPED_TEST(TestHdfsClient, GetCapacityUsed) { From 6443b82878489ed6a308d1e5ace33088788a060e Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 12 Apr 2017 10:54:42 -0400 Subject: [PATCH 0503/1644] ARROW-810: [GLib] Remove io/ipc prefix Author: Kouhei Sutou Closes #530 from kou/glib-remove-io-ipc-prefix and squashes the following commits: adfad7c [Kouhei Sutou] [GLib] Remove io/ipc prefix --- c_glib/arrow-glib/Makefile.am | 90 +++--- c_glib/arrow-glib/arrow-glib.h | 30 +- c_glib/arrow-glib/arrow-glib.hpp | 28 +- .../{io-file-mode.cpp => file-mode.cpp} | 28 +- .../{io-file-mode.h => file-mode.h} | 16 +- .../{io-file-mode.hpp => file-mode.hpp} | 6 +- ...tput-stream.cpp => file-output-stream.cpp} | 120 ++++---- ...e-output-stream.h => file-output-stream.h} | 44 +-- ...mapped-file.hpp => file-output-stream.hpp} | 6 +- .../{ipc-file-reader.cpp => file-reader.cpp} | 110 +++---- c_glib/arrow-glib/file-reader.h | 83 +++++ .../{ipc-file-writer.hpp => file-reader.hpp} | 6 +- .../{ipc-file-writer.cpp => file-writer.cpp} | 58 ++-- .../{ipc-file-writer.h => file-writer.h} | 52 ++-- .../{ipc-file-reader.hpp => file-writer.hpp} | 6 +- c_glib/arrow-glib/{io-file.cpp => file.cpp} | 48 +-- c_glib/arrow-glib/{io-file.h => file.h} | 34 +-- c_glib/arrow-glib/{io-file.hpp => file.hpp} | 10 +- .../{io-input-stream.cpp => input-stream.cpp} | 18 +- .../{io-input-stream.h => input-stream.h} | 26 +- .../{io-input-stream.hpp => input-stream.hpp} | 10 +- c_glib/arrow-glib/io-memory-mapped-file.cpp | 287 ------------------ c_glib/arrow-glib/ipc-file-reader.h | 83 ----- c_glib/arrow-glib/ipc-stream-reader.h | 80 ----- c_glib/arrow-glib/memory-mapped-file.cpp | 287 ++++++++++++++++++ ...ory-mapped-file.h => memory-mapped-file.h} | 48 +-- ...tput-stream.hpp => memory-mapped-file.hpp} | 6 +- ...adata-version.cpp => metadata-version.cpp} | 28 +- ...-metadata-version.h => metadata-version.h} | 16 +- ...adata-version.hpp => metadata-version.hpp} | 6 +- ...io-output-stream.cpp => output-stream.cpp} | 18 +- .../{io-output-stream.h => output-stream.h} | 26 +- ...io-output-stream.hpp => output-stream.hpp} | 10 +- ...access-file.cpp => random-access-file.cpp} | 42 +-- ...dom-access-file.h => random-access-file.h} | 32 +- ...access-file.hpp => random-access-file.hpp} | 10 +- .../{io-readable.cpp => readable.cpp} | 26 +- .../arrow-glib/{io-readable.h => readable.h} | 28 +- .../{io-readable.hpp => readable.hpp} | 10 +- ...pc-stream-reader.cpp => stream-reader.cpp} | 90 +++--- c_glib/arrow-glib/stream-reader.h | 80 +++++ ...pc-stream-reader.hpp => stream-reader.hpp} | 6 +- ...pc-stream-writer.cpp => stream-writer.cpp} | 88 +++--- .../{ipc-stream-writer.h => stream-writer.h} | 50 +-- ...pc-stream-writer.hpp => stream-writer.hpp} | 6 +- ...-writeable-file.cpp => writeable-file.cpp} | 26 +- .../{io-writeable-file.h => writeable-file.h} | 28 +- ...-writeable-file.hpp => writeable-file.hpp} | 10 +- .../{io-writeable.cpp => writeable.cpp} | 34 +-- .../{io-writeable.h => writeable.h} | 30 +- .../{io-writeable.hpp => writeable.hpp} | 10 +- c_glib/doc/reference/Makefile.am | 2 +- c_glib/doc/reference/arrow-glib-docs.sgml | 62 ++-- c_glib/example/read-batch.c | 18 +- c_glib/example/read-stream.c | 16 +- ...t-stream.rb => test-file-output-stream.rb} | 6 +- ...ipc-file-writer.rb => test-file-writer.rb} | 10 +- ...ped-file.rb => test-memory-mapped-file.rb} | 22 +- ...stream-writer.rb => test-stream-writer.rb} | 10 +- 59 files changed, 1238 insertions(+), 1238 deletions(-) rename c_glib/arrow-glib/{io-file-mode.cpp => file-mode.cpp} (72%) rename c_glib/arrow-glib/{io-file-mode.h => file-mode.h} (78%) rename c_glib/arrow-glib/{io-file-mode.hpp => file-mode.hpp} (81%) rename c_glib/arrow-glib/{io-file-output-stream.cpp => file-output-stream.cpp} (52%) rename c_glib/arrow-glib/{io-file-output-stream.h => file-output-stream.h} (52%) rename c_glib/arrow-glib/{io-memory-mapped-file.hpp => file-output-stream.hpp} (73%) rename c_glib/arrow-glib/{ipc-file-reader.cpp => file-reader.cpp} (60%) create mode 100644 c_glib/arrow-glib/file-reader.h rename c_glib/arrow-glib/{ipc-file-writer.hpp => file-reader.hpp} (78%) rename c_glib/arrow-glib/{ipc-file-writer.cpp => file-writer.cpp} (68%) rename c_glib/arrow-glib/{ipc-file-writer.h => file-writer.h} (52%) rename c_glib/arrow-glib/{ipc-file-reader.hpp => file-writer.hpp} (77%) rename c_glib/arrow-glib/{io-file.cpp => file.cpp} (68%) rename c_glib/arrow-glib/{io-file.h => file.h} (55%) rename c_glib/arrow-glib/{io-file.hpp => file.hpp} (79%) rename c_glib/arrow-glib/{io-input-stream.cpp => input-stream.cpp} (71%) rename c_glib/arrow-glib/{io-input-stream.h => input-stream.h} (57%) rename c_glib/arrow-glib/{io-input-stream.hpp => input-stream.hpp} (76%) delete mode 100644 c_glib/arrow-glib/io-memory-mapped-file.cpp delete mode 100644 c_glib/arrow-glib/ipc-file-reader.h delete mode 100644 c_glib/arrow-glib/ipc-stream-reader.h create mode 100644 c_glib/arrow-glib/memory-mapped-file.cpp rename c_glib/arrow-glib/{io-memory-mapped-file.h => memory-mapped-file.h} (51%) rename c_glib/arrow-glib/{io-file-output-stream.hpp => memory-mapped-file.hpp} (73%) rename c_glib/arrow-glib/{ipc-metadata-version.cpp => metadata-version.cpp} (68%) rename c_glib/arrow-glib/{ipc-metadata-version.h => metadata-version.h} (76%) rename c_glib/arrow-glib/{ipc-metadata-version.hpp => metadata-version.hpp} (77%) rename c_glib/arrow-glib/{io-output-stream.cpp => output-stream.cpp} (71%) rename c_glib/arrow-glib/{io-output-stream.h => output-stream.h} (57%) rename c_glib/arrow-glib/{io-output-stream.hpp => output-stream.hpp} (75%) rename c_glib/arrow-glib/{io-random-access-file.cpp => random-access-file.cpp} (70%) rename c_glib/arrow-glib/{io-random-access-file.h => random-access-file.h} (57%) rename c_glib/arrow-glib/{io-random-access-file.hpp => random-access-file.hpp} (78%) rename c_glib/arrow-glib/{io-readable.cpp => readable.cpp} (75%) rename c_glib/arrow-glib/{io-readable.h => readable.h} (60%) rename c_glib/arrow-glib/{io-readable.hpp => readable.hpp} (77%) rename c_glib/arrow-glib/{ipc-stream-reader.cpp => stream-reader.cpp} (63%) create mode 100644 c_glib/arrow-glib/stream-reader.h rename c_glib/arrow-glib/{ipc-stream-reader.hpp => stream-reader.hpp} (75%) rename c_glib/arrow-glib/{ipc-stream-writer.cpp => stream-writer.cpp} (65%) rename c_glib/arrow-glib/{ipc-stream-writer.h => stream-writer.h} (54%) rename c_glib/arrow-glib/{ipc-stream-writer.hpp => stream-writer.hpp} (75%) rename c_glib/arrow-glib/{io-writeable-file.cpp => writeable-file.cpp} (73%) rename c_glib/arrow-glib/{io-writeable-file.h => writeable-file.h} (59%) rename c_glib/arrow-glib/{io-writeable-file.hpp => writeable-file.hpp} (75%) rename c_glib/arrow-glib/{io-writeable.cpp => writeable.cpp} (72%) rename c_glib/arrow-glib/{io-writeable.h => writeable.h} (58%) rename c_glib/arrow-glib/{io-writeable.hpp => writeable.hpp} (77%) rename c_glib/test/{test-io-file-output-stream.rb => test-file-output-stream.rb} (87%) rename c_glib/test/{test-ipc-file-writer.rb => test-file-writer.rb} (82%) rename c_glib/test/{test-io-memory-mapped-file.rb => test-memory-mapped-file.rb} (81%) rename c_glib/test/{test-ipc-stream-writer.rb => test-stream-writer.rb} (84%) diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index e719cccfa85ab..387707c7d5897 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -102,23 +102,23 @@ libarrow_glib_la_headers = \ uint64-data-type.h libarrow_glib_la_headers += \ - io-file.h \ - io-file-mode.h \ - io-file-output-stream.h \ - io-input-stream.h \ - io-memory-mapped-file.h \ - io-output-stream.h \ - io-random-access-file.h \ - io-readable.h \ - io-writeable.h \ - io-writeable-file.h + file.h \ + file-mode.h \ + file-output-stream.h \ + input-stream.h \ + memory-mapped-file.h \ + output-stream.h \ + random-access-file.h \ + readable.h \ + writeable.h \ + writeable-file.h libarrow_glib_la_headers += \ - ipc-file-reader.h \ - ipc-file-writer.h \ - ipc-stream-reader.h \ - ipc-stream-writer.h \ - ipc-metadata-version.h + file-reader.h \ + file-writer.h \ + stream-reader.h \ + stream-writer.h \ + metadata-version.h libarrow_glib_la_generated_headers = \ enums.h @@ -190,23 +190,23 @@ libarrow_glib_la_sources = \ $(libarrow_glib_la_generated_sources) libarrow_glib_la_sources += \ - io-file.cpp \ - io-file-mode.cpp \ - io-file-output-stream.cpp \ - io-input-stream.cpp \ - io-memory-mapped-file.cpp \ - io-output-stream.cpp \ - io-random-access-file.cpp \ - io-readable.cpp \ - io-writeable.cpp \ - io-writeable-file.cpp + file.cpp \ + file-mode.cpp \ + file-output-stream.cpp \ + input-stream.cpp \ + memory-mapped-file.cpp \ + output-stream.cpp \ + random-access-file.cpp \ + readable.cpp \ + writeable.cpp \ + writeable-file.cpp libarrow_glib_la_sources += \ - ipc-file-reader.cpp \ - ipc-file-writer.cpp \ - ipc-metadata-version.cpp \ - ipc-stream-reader.cpp \ - ipc-stream-writer.cpp + file-reader.cpp \ + file-writer.cpp \ + metadata-version.cpp \ + stream-reader.cpp \ + stream-writer.cpp libarrow_glib_la_cpp_headers = \ array.hpp \ @@ -223,23 +223,23 @@ libarrow_glib_la_cpp_headers = \ type.hpp libarrow_glib_la_cpp_headers += \ - io-file.hpp \ - io-file-mode.hpp \ - io-file-output-stream.hpp \ - io-input-stream.hpp \ - io-memory-mapped-file.hpp \ - io-output-stream.hpp \ - io-random-access-file.hpp \ - io-readable.hpp \ - io-writeable.hpp \ - io-writeable-file.hpp + file.hpp \ + file-mode.hpp \ + file-output-stream.hpp \ + input-stream.hpp \ + memory-mapped-file.hpp \ + output-stream.hpp \ + random-access-file.hpp \ + readable.hpp \ + writeable.hpp \ + writeable-file.hpp libarrow_glib_la_cpp_headers += \ - ipc-file-reader.hpp \ - ipc-file-writer.hpp \ - ipc-metadata-version.hpp \ - ipc-stream-reader.hpp \ - ipc-stream-writer.hpp + file-reader.hpp \ + file-writer.hpp \ + metadata-version.hpp \ + stream-reader.hpp \ + stream-writer.hpp libarrow_glib_la_SOURCES = \ $(libarrow_glib_la_sources) \ diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index 9b03175799f44..b15c56f7bb486 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -79,19 +79,19 @@ #include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include -#include -#include -#include -#include -#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp index fd59d4a1a9240..3404d4d212917 100644 --- a/c_glib/arrow-glib/arrow-glib.hpp +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -36,18 +36,18 @@ #include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include +#include +#include +#include +#include +#include +#include +#include +#include -#include -#include -#include -#include -#include +#include +#include +#include +#include +#include diff --git a/c_glib/arrow-glib/io-file-mode.cpp b/c_glib/arrow-glib/file-mode.cpp similarity index 72% rename from c_glib/arrow-glib/io-file-mode.cpp rename to c_glib/arrow-glib/file-mode.cpp index 7998d3f5bb061..1fb17062ab2c9 100644 --- a/c_glib/arrow-glib/io-file-mode.cpp +++ b/c_glib/arrow-glib/file-mode.cpp @@ -21,41 +21,41 @@ # include #endif -#include +#include /** - * SECTION: io-file-mode - * @title: GArrowIOFileMode + * SECTION: file-mode + * @title: GArrowFileMode * @short_description: File mode mapping between Arrow and arrow-glib * - * #GArrowIOFileMode provides file modes corresponding to + * #GArrowFileMode provides file modes corresponding to * `arrow::io::FileMode::type` values. */ -GArrowIOFileMode -garrow_io_file_mode_from_raw(arrow::io::FileMode::type mode) +GArrowFileMode +garrow_file_mode_from_raw(arrow::io::FileMode::type mode) { switch (mode) { case arrow::io::FileMode::type::READ: - return GARROW_IO_FILE_MODE_READ; + return GARROW_FILE_MODE_READ; case arrow::io::FileMode::type::WRITE: - return GARROW_IO_FILE_MODE_WRITE; + return GARROW_FILE_MODE_WRITE; case arrow::io::FileMode::type::READWRITE: - return GARROW_IO_FILE_MODE_READWRITE; + return GARROW_FILE_MODE_READWRITE; default: - return GARROW_IO_FILE_MODE_READ; + return GARROW_FILE_MODE_READ; } } arrow::io::FileMode::type -garrow_io_file_mode_to_raw(GArrowIOFileMode mode) +garrow_file_mode_to_raw(GArrowFileMode mode) { switch (mode) { - case GARROW_IO_FILE_MODE_READ: + case GARROW_FILE_MODE_READ: return arrow::io::FileMode::type::READ; - case GARROW_IO_FILE_MODE_WRITE: + case GARROW_FILE_MODE_WRITE: return arrow::io::FileMode::type::WRITE; - case GARROW_IO_FILE_MODE_READWRITE: + case GARROW_FILE_MODE_READWRITE: return arrow::io::FileMode::type::READWRITE; default: return arrow::io::FileMode::type::READ; diff --git a/c_glib/arrow-glib/io-file-mode.h b/c_glib/arrow-glib/file-mode.h similarity index 78% rename from c_glib/arrow-glib/io-file-mode.h rename to c_glib/arrow-glib/file-mode.h index 03eca353bbdbb..8812af805abd5 100644 --- a/c_glib/arrow-glib/io-file-mode.h +++ b/c_glib/arrow-glib/file-mode.h @@ -24,17 +24,17 @@ G_BEGIN_DECLS /** - * GArrowIOFileMode: - * @GARROW_IO_FILE_MODE_READ: For read. - * @GARROW_IO_FILE_MODE_WRITE: For write. - * @GARROW_IO_FILE_MODE_READWRITE: For read-write. + * GArrowFileMode: + * @GARROW_FILE_MODE_READ: For read. + * @GARROW_FILE_MODE_WRITE: For write. + * @GARROW_FILE_MODE_READWRITE: For read-write. * * They are corresponding to `arrow::io::FileMode::type` values. */ typedef enum { - GARROW_IO_FILE_MODE_READ, - GARROW_IO_FILE_MODE_WRITE, - GARROW_IO_FILE_MODE_READWRITE -} GArrowIOFileMode; + GARROW_FILE_MODE_READ, + GARROW_FILE_MODE_WRITE, + GARROW_FILE_MODE_READWRITE +} GArrowFileMode; G_END_DECLS diff --git a/c_glib/arrow-glib/io-file-mode.hpp b/c_glib/arrow-glib/file-mode.hpp similarity index 81% rename from c_glib/arrow-glib/io-file-mode.hpp rename to c_glib/arrow-glib/file-mode.hpp index b3d8ac6d8e053..2b67379421d5a 100644 --- a/c_glib/arrow-glib/io-file-mode.hpp +++ b/c_glib/arrow-glib/file-mode.hpp @@ -21,7 +21,7 @@ #include -#include +#include -GArrowIOFileMode garrow_io_file_mode_from_raw(arrow::io::FileMode::type mode); -arrow::io::FileMode::type garrow_io_file_mode_to_raw(GArrowIOFileMode mode); +GArrowFileMode garrow_file_mode_from_raw(arrow::io::FileMode::type mode); +arrow::io::FileMode::type garrow_file_mode_to_raw(GArrowFileMode mode); diff --git a/c_glib/arrow-glib/io-file-output-stream.cpp b/c_glib/arrow-glib/file-output-stream.cpp similarity index 52% rename from c_glib/arrow-glib/io-file-output-stream.cpp rename to c_glib/arrow-glib/file-output-stream.cpp index 673e8cd36a60a..b6ca42a1d59da 100644 --- a/c_glib/arrow-glib/io-file-output-stream.cpp +++ b/c_glib/arrow-glib/file-output-stream.cpp @@ -24,23 +24,23 @@ #include #include -#include -#include -#include -#include +#include +#include +#include +#include G_BEGIN_DECLS /** - * SECTION: io-file-output-stream + * SECTION: file-output-stream * @short_description: A file output stream. * - * The #GArrowIOFileOutputStream is a class for file output stream. + * The #GArrowFileOutputStream is a class for file output stream. */ -typedef struct GArrowIOFileOutputStreamPrivate_ { +typedef struct GArrowFileOutputStreamPrivate_ { std::shared_ptr file_output_stream; -} GArrowIOFileOutputStreamPrivate; +} GArrowFileOutputStreamPrivate; enum { PROP_0, @@ -48,87 +48,87 @@ enum { }; static std::shared_ptr -garrow_io_file_output_stream_get_raw_file_interface(GArrowIOFile *file) +garrow_file_output_stream_get_raw_file_interface(GArrowFile *file) { - auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(file); + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(file); auto arrow_file_output_stream = - garrow_io_file_output_stream_get_raw(file_output_stream); + garrow_file_output_stream_get_raw(file_output_stream); return arrow_file_output_stream; } static void -garrow_io_file_interface_init(GArrowIOFileInterface *iface) +garrow_file_interface_init(GArrowFileInterface *iface) { - iface->get_raw = garrow_io_file_output_stream_get_raw_file_interface; + iface->get_raw = garrow_file_output_stream_get_raw_file_interface; } static std::shared_ptr -garrow_io_file_output_stream_get_raw_writeable_interface(GArrowIOWriteable *writeable) +garrow_file_output_stream_get_raw_writeable_interface(GArrowWriteable *writeable) { - auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(writeable); + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(writeable); auto arrow_file_output_stream = - garrow_io_file_output_stream_get_raw(file_output_stream); + garrow_file_output_stream_get_raw(file_output_stream); return arrow_file_output_stream; } static void -garrow_io_writeable_interface_init(GArrowIOWriteableInterface *iface) +garrow_writeable_interface_init(GArrowWriteableInterface *iface) { - iface->get_raw = garrow_io_file_output_stream_get_raw_writeable_interface; + iface->get_raw = garrow_file_output_stream_get_raw_writeable_interface; } static std::shared_ptr -garrow_io_file_output_stream_get_raw_output_stream_interface(GArrowIOOutputStream *output_stream) +garrow_file_output_stream_get_raw_output_stream_interface(GArrowOutputStream *output_stream) { - auto file_output_stream = GARROW_IO_FILE_OUTPUT_STREAM(output_stream); + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(output_stream); auto arrow_file_output_stream = - garrow_io_file_output_stream_get_raw(file_output_stream); + garrow_file_output_stream_get_raw(file_output_stream); return arrow_file_output_stream; } static void -garrow_io_output_stream_interface_init(GArrowIOOutputStreamInterface *iface) +garrow_output_stream_interface_init(GArrowOutputStreamInterface *iface) { - iface->get_raw = garrow_io_file_output_stream_get_raw_output_stream_interface; + iface->get_raw = garrow_file_output_stream_get_raw_output_stream_interface; } -G_DEFINE_TYPE_WITH_CODE(GArrowIOFileOutputStream, - garrow_io_file_output_stream, +G_DEFINE_TYPE_WITH_CODE(GArrowFileOutputStream, + garrow_file_output_stream, G_TYPE_OBJECT, - G_ADD_PRIVATE(GArrowIOFileOutputStream) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_FILE, - garrow_io_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE, - garrow_io_writeable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_OUTPUT_STREAM, - garrow_io_output_stream_interface_init)); - -#define GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ + G_ADD_PRIVATE(GArrowFileOutputStream) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, + garrow_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, + garrow_writeable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_OUTPUT_STREAM, + garrow_output_stream_interface_init)); + +#define GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ - GArrowIOFileOutputStreamPrivate)) + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamPrivate)) static void -garrow_io_file_output_stream_finalize(GObject *object) +garrow_file_output_stream_finalize(GObject *object) { - GArrowIOFileOutputStreamPrivate *priv; + GArrowFileOutputStreamPrivate *priv; - priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); priv->file_output_stream = nullptr; - G_OBJECT_CLASS(garrow_io_file_output_stream_parent_class)->finalize(object); + G_OBJECT_CLASS(garrow_file_output_stream_parent_class)->finalize(object); } static void -garrow_io_file_output_stream_set_property(GObject *object, +garrow_file_output_stream_set_property(GObject *object, guint prop_id, const GValue *value, GParamSpec *pspec) { - GArrowIOFileOutputStreamPrivate *priv; + GArrowFileOutputStreamPrivate *priv; - priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); switch (prop_id) { case PROP_FILE_OUTPUT_STREAM: @@ -142,7 +142,7 @@ garrow_io_file_output_stream_set_property(GObject *object, } static void -garrow_io_file_output_stream_get_property(GObject *object, +garrow_file_output_stream_get_property(GObject *object, guint prop_id, GValue *value, GParamSpec *pspec) @@ -155,21 +155,21 @@ garrow_io_file_output_stream_get_property(GObject *object, } static void -garrow_io_file_output_stream_init(GArrowIOFileOutputStream *object) +garrow_file_output_stream_init(GArrowFileOutputStream *object) { } static void -garrow_io_file_output_stream_class_init(GArrowIOFileOutputStreamClass *klass) +garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) { GObjectClass *gobject_class; GParamSpec *spec; gobject_class = G_OBJECT_CLASS(klass); - gobject_class->finalize = garrow_io_file_output_stream_finalize; - gobject_class->set_property = garrow_io_file_output_stream_set_property; - gobject_class->get_property = garrow_io_file_output_stream_get_property; + gobject_class->finalize = garrow_file_output_stream_finalize; + gobject_class->set_property = garrow_file_output_stream_set_property; + gobject_class->get_property = garrow_file_output_stream_get_property; spec = g_param_spec_pointer("file-output-stream", "io::FileOutputStream", @@ -180,16 +180,16 @@ garrow_io_file_output_stream_class_init(GArrowIOFileOutputStreamClass *klass) } /** - * garrow_io_file_output_stream_open: + * garrow_file_output_stream_open: * @path: The path of the file output stream. * @append: Whether the path is opened as append mode or recreate mode. * @error: (nullable): Return location for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened - * #GArrowIOFileOutputStream or %NULL on error. + * #GArrowFileOutputStream or %NULL on error. */ -GArrowIOFileOutputStream * -garrow_io_file_output_stream_open(const gchar *path, +GArrowFileOutputStream * +garrow_file_output_stream_open(const gchar *path, gboolean append, GError **error) { @@ -199,7 +199,7 @@ garrow_io_file_output_stream_open(const gchar *path, append, &arrow_file_output_stream); if (status.ok()) { - return garrow_io_file_output_stream_new_raw(&arrow_file_output_stream); + return garrow_file_output_stream_new_raw(&arrow_file_output_stream); } else { std::string context("[io][file-output-stream][open]: <"); context += path; @@ -211,21 +211,21 @@ garrow_io_file_output_stream_open(const gchar *path, G_END_DECLS -GArrowIOFileOutputStream * -garrow_io_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream) +GArrowFileOutputStream * +garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream) { auto file_output_stream = - GARROW_IO_FILE_OUTPUT_STREAM(g_object_new(GARROW_IO_TYPE_FILE_OUTPUT_STREAM, + GARROW_FILE_OUTPUT_STREAM(g_object_new(GARROW_TYPE_FILE_OUTPUT_STREAM, "file-output-stream", arrow_file_output_stream, NULL)); return file_output_stream; } std::shared_ptr -garrow_io_file_output_stream_get_raw(GArrowIOFileOutputStream *file_output_stream) +garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream) { - GArrowIOFileOutputStreamPrivate *priv; + GArrowFileOutputStreamPrivate *priv; - priv = GARROW_IO_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); return priv->file_output_stream; } diff --git a/c_glib/arrow-glib/io-file-output-stream.h b/c_glib/arrow-glib/file-output-stream.h similarity index 52% rename from c_glib/arrow-glib/io-file-output-stream.h rename to c_glib/arrow-glib/file-output-stream.h index 032b125544e77..bef3700039921 100644 --- a/c_glib/arrow-glib/io-file-output-stream.h +++ b/c_glib/arrow-glib/file-output-stream.h @@ -23,49 +23,49 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_FILE_OUTPUT_STREAM \ - (garrow_io_file_output_stream_get_type()) -#define GARROW_IO_FILE_OUTPUT_STREAM(obj) \ +#define GARROW_TYPE_FILE_OUTPUT_STREAM \ + (garrow_file_output_stream_get_type()) +#define GARROW_FILE_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ - GArrowIOFileOutputStream)) -#define GARROW_IO_FILE_OUTPUT_STREAM_CLASS(klass) \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStream)) +#define GARROW_FILE_OUTPUT_STREAM_CLASS(klass) \ (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ - GArrowIOFileOutputStreamClass)) -#define GARROW_IO_IS_FILE_OUTPUT_STREAM(obj) \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamClass)) +#define GARROW_IS_FILE_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM)) -#define GARROW_IO_IS_FILE_OUTPUT_STREAM_CLASS(klass) \ + GARROW_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_IS_FILE_OUTPUT_STREAM_CLASS(klass) \ (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM)) -#define GARROW_IO_FILE_OUTPUT_STREAM_GET_CLASS(obj) \ + GARROW_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_FILE_OUTPUT_STREAM_GET_CLASS(obj) \ (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IO_TYPE_FILE_OUTPUT_STREAM, \ - GArrowIOFileOutputStreamClass)) + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamClass)) -typedef struct _GArrowIOFileOutputStream GArrowIOFileOutputStream; -typedef struct _GArrowIOFileOutputStreamClass GArrowIOFileOutputStreamClass; +typedef struct _GArrowFileOutputStream GArrowFileOutputStream; +typedef struct _GArrowFileOutputStreamClass GArrowFileOutputStreamClass; /** - * GArrowIOFileOutputStream: + * GArrowFileOutputStream: * * It wraps `arrow::io::FileOutputStream`. */ -struct _GArrowIOFileOutputStream +struct _GArrowFileOutputStream { /*< private >*/ GObject parent_instance; }; -struct _GArrowIOFileOutputStreamClass +struct _GArrowFileOutputStreamClass { GObjectClass parent_class; }; -GType garrow_io_file_output_stream_get_type(void) G_GNUC_CONST; +GType garrow_file_output_stream_get_type(void) G_GNUC_CONST; -GArrowIOFileOutputStream *garrow_io_file_output_stream_open(const gchar *path, +GArrowFileOutputStream *garrow_file_output_stream_open(const gchar *path, gboolean append, GError **error); diff --git a/c_glib/arrow-glib/io-memory-mapped-file.hpp b/c_glib/arrow-glib/file-output-stream.hpp similarity index 73% rename from c_glib/arrow-glib/io-memory-mapped-file.hpp rename to c_glib/arrow-glib/file-output-stream.hpp index b48e05f2f9e7b..0b10418cdf176 100644 --- a/c_glib/arrow-glib/io-memory-mapped-file.hpp +++ b/c_glib/arrow-glib/file-output-stream.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIOMemoryMappedFile *garrow_io_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file); -std::shared_ptr garrow_io_memory_mapped_file_get_raw(GArrowIOMemoryMappedFile *memory_mapped_file); +GArrowFileOutputStream *garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); +std::shared_ptr garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream); diff --git a/c_glib/arrow-glib/ipc-file-reader.cpp b/c_glib/arrow-glib/file-reader.cpp similarity index 60% rename from c_glib/arrow-glib/ipc-file-reader.cpp rename to c_glib/arrow-glib/file-reader.cpp index 223be857d9beb..c2aeabe5eed21 100644 --- a/c_glib/arrow-glib/ipc-file-reader.cpp +++ b/c_glib/arrow-glib/file-reader.cpp @@ -27,59 +27,59 @@ #include #include -#include +#include -#include -#include +#include +#include G_BEGIN_DECLS /** - * SECTION: ipc-file-reader + * SECTION: file-reader * @short_description: File reader class * - * #GArrowIPCFileReader is a class for receiving data by file based IPC. + * #GArrowFileReader is a class for receiving data by file based IPC. */ -typedef struct GArrowIPCFileReaderPrivate_ { +typedef struct GArrowFileReaderPrivate_ { std::shared_ptr file_reader; -} GArrowIPCFileReaderPrivate; +} GArrowFileReaderPrivate; enum { PROP_0, PROP_FILE_READER }; -G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCFileReader, - garrow_ipc_file_reader, +G_DEFINE_TYPE_WITH_PRIVATE(GArrowFileReader, + garrow_file_reader, G_TYPE_OBJECT); -#define GARROW_IPC_FILE_READER_GET_PRIVATE(obj) \ +#define GARROW_FILE_READER_GET_PRIVATE(obj) \ (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_IPC_TYPE_FILE_READER, \ - GArrowIPCFileReaderPrivate)) + GARROW_TYPE_FILE_READER, \ + GArrowFileReaderPrivate)) static void -garrow_ipc_file_reader_finalize(GObject *object) +garrow_file_reader_finalize(GObject *object) { - GArrowIPCFileReaderPrivate *priv; + GArrowFileReaderPrivate *priv; - priv = GARROW_IPC_FILE_READER_GET_PRIVATE(object); + priv = GARROW_FILE_READER_GET_PRIVATE(object); priv->file_reader = nullptr; - G_OBJECT_CLASS(garrow_ipc_file_reader_parent_class)->finalize(object); + G_OBJECT_CLASS(garrow_file_reader_parent_class)->finalize(object); } static void -garrow_ipc_file_reader_set_property(GObject *object, +garrow_file_reader_set_property(GObject *object, guint prop_id, const GValue *value, GParamSpec *pspec) { - GArrowIPCFileReaderPrivate *priv; + GArrowFileReaderPrivate *priv; - priv = GARROW_IPC_FILE_READER_GET_PRIVATE(object); + priv = GARROW_FILE_READER_GET_PRIVATE(object); switch (prop_id) { case PROP_FILE_READER: @@ -93,7 +93,7 @@ garrow_ipc_file_reader_set_property(GObject *object, } static void -garrow_ipc_file_reader_get_property(GObject *object, +garrow_file_reader_get_property(GObject *object, guint prop_id, GValue *value, GParamSpec *pspec) @@ -106,21 +106,21 @@ garrow_ipc_file_reader_get_property(GObject *object, } static void -garrow_ipc_file_reader_init(GArrowIPCFileReader *object) +garrow_file_reader_init(GArrowFileReader *object) { } static void -garrow_ipc_file_reader_class_init(GArrowIPCFileReaderClass *klass) +garrow_file_reader_class_init(GArrowFileReaderClass *klass) { GObjectClass *gobject_class; GParamSpec *spec; gobject_class = G_OBJECT_CLASS(klass); - gobject_class->finalize = garrow_ipc_file_reader_finalize; - gobject_class->set_property = garrow_ipc_file_reader_set_property; - gobject_class->get_property = garrow_ipc_file_reader_get_property; + gobject_class->finalize = garrow_file_reader_finalize; + gobject_class->set_property = garrow_file_reader_set_property; + gobject_class->get_property = garrow_file_reader_get_property; spec = g_param_spec_pointer("file-reader", "ipc::FileReader", @@ -131,23 +131,23 @@ garrow_ipc_file_reader_class_init(GArrowIPCFileReaderClass *klass) } /** - * garrow_ipc_file_reader_open: + * garrow_file_reader_open: * @file: The file to be read. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened - * #GArrowIPCFileReader or %NULL on error. + * #GArrowFileReader or %NULL on error. */ -GArrowIPCFileReader * -garrow_ipc_file_reader_open(GArrowIORandomAccessFile *file, +GArrowFileReader * +garrow_file_reader_open(GArrowRandomAccessFile *file, GError **error) { std::shared_ptr arrow_file_reader; auto status = - arrow::ipc::FileReader::Open(garrow_io_random_access_file_get_raw(file), + arrow::ipc::FileReader::Open(garrow_random_access_file_get_raw(file), &arrow_file_reader); if (status.ok()) { - return garrow_ipc_file_reader_new_raw(&arrow_file_reader); + return garrow_file_reader_new_raw(&arrow_file_reader); } else { garrow_error_set(error, status, "[ipc][file-reader][open]"); return NULL; @@ -155,52 +155,52 @@ garrow_ipc_file_reader_open(GArrowIORandomAccessFile *file, } /** - * garrow_ipc_file_reader_get_schema: - * @file_reader: A #GArrowIPCFileReader. + * garrow_file_reader_get_schema: + * @file_reader: A #GArrowFileReader. * * Returns: (transfer full): The schema in the file. */ GArrowSchema * -garrow_ipc_file_reader_get_schema(GArrowIPCFileReader *file_reader) +garrow_file_reader_get_schema(GArrowFileReader *file_reader) { auto arrow_file_reader = - garrow_ipc_file_reader_get_raw(file_reader); + garrow_file_reader_get_raw(file_reader); auto arrow_schema = arrow_file_reader->schema(); return garrow_schema_new_raw(&arrow_schema); } /** - * garrow_ipc_file_reader_get_n_record_batches: - * @file_reader: A #GArrowIPCFileReader. + * garrow_file_reader_get_n_record_batches: + * @file_reader: A #GArrowFileReader. * * Returns: The number of record batches in the file. */ guint -garrow_ipc_file_reader_get_n_record_batches(GArrowIPCFileReader *file_reader) +garrow_file_reader_get_n_record_batches(GArrowFileReader *file_reader) { auto arrow_file_reader = - garrow_ipc_file_reader_get_raw(file_reader); + garrow_file_reader_get_raw(file_reader); return arrow_file_reader->num_record_batches(); } /** - * garrow_ipc_file_reader_get_version: - * @file_reader: A #GArrowIPCFileReader. + * garrow_file_reader_get_version: + * @file_reader: A #GArrowFileReader. * * Returns: The format version in the file. */ -GArrowIPCMetadataVersion -garrow_ipc_file_reader_get_version(GArrowIPCFileReader *file_reader) +GArrowMetadataVersion +garrow_file_reader_get_version(GArrowFileReader *file_reader) { auto arrow_file_reader = - garrow_ipc_file_reader_get_raw(file_reader); + garrow_file_reader_get_raw(file_reader); auto arrow_version = arrow_file_reader->version(); - return garrow_ipc_metadata_version_from_raw(arrow_version); + return garrow_metadata_version_from_raw(arrow_version); } /** - * garrow_ipc_file_reader_get_record_batch: - * @file_reader: A #GArrowIPCFileReader. + * garrow_file_reader_get_record_batch: + * @file_reader: A #GArrowFileReader. * @i: The index of the target record batch. * @error: (nullable): Return locatipcn for a #GError or %NULL. * @@ -208,12 +208,12 @@ garrow_ipc_file_reader_get_version(GArrowIPCFileReader *file_reader) * The i-th record batch in the file or %NULL on error. */ GArrowRecordBatch * -garrow_ipc_file_reader_get_record_batch(GArrowIPCFileReader *file_reader, +garrow_file_reader_get_record_batch(GArrowFileReader *file_reader, guint i, GError **error) { auto arrow_file_reader = - garrow_ipc_file_reader_get_raw(file_reader); + garrow_file_reader_get_raw(file_reader); std::shared_ptr arrow_record_batch; auto status = arrow_file_reader->GetRecordBatch(i, &arrow_record_batch); @@ -227,21 +227,21 @@ garrow_ipc_file_reader_get_record_batch(GArrowIPCFileReader *file_reader, G_END_DECLS -GArrowIPCFileReader * -garrow_ipc_file_reader_new_raw(std::shared_ptr *arrow_file_reader) +GArrowFileReader * +garrow_file_reader_new_raw(std::shared_ptr *arrow_file_reader) { auto file_reader = - GARROW_IPC_FILE_READER(g_object_new(GARROW_IPC_TYPE_FILE_READER, + GARROW_FILE_READER(g_object_new(GARROW_TYPE_FILE_READER, "file-reader", arrow_file_reader, NULL)); return file_reader; } std::shared_ptr -garrow_ipc_file_reader_get_raw(GArrowIPCFileReader *file_reader) +garrow_file_reader_get_raw(GArrowFileReader *file_reader) { - GArrowIPCFileReaderPrivate *priv; + GArrowFileReaderPrivate *priv; - priv = GARROW_IPC_FILE_READER_GET_PRIVATE(file_reader); + priv = GARROW_FILE_READER_GET_PRIVATE(file_reader); return priv->file_reader; } diff --git a/c_glib/arrow-glib/file-reader.h b/c_glib/arrow-glib/file-reader.h new file mode 100644 index 0000000000000..084f7148ed903 --- /dev/null +++ b/c_glib/arrow-glib/file-reader.h @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_FILE_READER \ + (garrow_file_reader_get_type()) +#define GARROW_FILE_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FILE_READER, \ + GArrowFileReader)) +#define GARROW_FILE_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FILE_READER, \ + GArrowFileReaderClass)) +#define GARROW_IS_FILE_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FILE_READER)) +#define GARROW_IS_FILE_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FILE_READER)) +#define GARROW_FILE_READER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FILE_READER, \ + GArrowFileReaderClass)) + +typedef struct _GArrowFileReader GArrowFileReader; +typedef struct _GArrowFileReaderClass GArrowFileReaderClass; + +/** + * GArrowFileReader: + * + * It wraps `arrow::ipc::FileReader`. + */ +struct _GArrowFileReader +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowFileReaderClass +{ + GObjectClass parent_class; +}; + +GType garrow_file_reader_get_type(void) G_GNUC_CONST; + +GArrowFileReader *garrow_file_reader_open(GArrowRandomAccessFile *file, + GError **error); + +GArrowSchema *garrow_file_reader_get_schema(GArrowFileReader *file_reader); +guint garrow_file_reader_get_n_record_batches(GArrowFileReader *file_reader); +GArrowMetadataVersion garrow_file_reader_get_version(GArrowFileReader *file_reader); +GArrowRecordBatch *garrow_file_reader_get_record_batch(GArrowFileReader *file_reader, + guint i, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-file-writer.hpp b/c_glib/arrow-glib/file-reader.hpp similarity index 78% rename from c_glib/arrow-glib/ipc-file-writer.hpp rename to c_glib/arrow-glib/file-reader.hpp index b8ae1137a99ad..152379bbda4ff 100644 --- a/c_glib/arrow-glib/ipc-file-writer.hpp +++ b/c_glib/arrow-glib/file-reader.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIPCFileWriter *garrow_ipc_file_writer_new_raw(std::shared_ptr *arrow_file_writer); -arrow::ipc::FileWriter *garrow_ipc_file_writer_get_raw(GArrowIPCFileWriter *file_writer); +GArrowFileReader *garrow_file_reader_new_raw(std::shared_ptr *arrow_file_reader); +std::shared_ptr garrow_file_reader_get_raw(GArrowFileReader *file_reader); diff --git a/c_glib/arrow-glib/ipc-file-writer.cpp b/c_glib/arrow-glib/file-writer.cpp similarity index 68% rename from c_glib/arrow-glib/ipc-file-writer.cpp rename to c_glib/arrow-glib/file-writer.cpp index d8b3c2e72fa31..68eca2edf77c1 100644 --- a/c_glib/arrow-glib/ipc-file-writer.cpp +++ b/c_glib/arrow-glib/file-writer.cpp @@ -28,55 +28,55 @@ #include #include -#include +#include -#include -#include +#include +#include G_BEGIN_DECLS /** - * SECTION: ipc-file-writer + * SECTION: file-writer * @short_description: File writer class * - * #GArrowIPCFileWriter is a class for sending data by file based IPC. + * #GArrowFileWriter is a class for sending data by file based IPC. */ -G_DEFINE_TYPE(GArrowIPCFileWriter, - garrow_ipc_file_writer, - GARROW_IPC_TYPE_STREAM_WRITER); +G_DEFINE_TYPE(GArrowFileWriter, + garrow_file_writer, + GARROW_TYPE_STREAM_WRITER); static void -garrow_ipc_file_writer_init(GArrowIPCFileWriter *object) +garrow_file_writer_init(GArrowFileWriter *object) { } static void -garrow_ipc_file_writer_class_init(GArrowIPCFileWriterClass *klass) +garrow_file_writer_class_init(GArrowFileWriterClass *klass) { } /** - * garrow_ipc_file_writer_open: + * garrow_file_writer_open: * @sink: The output of the writer. * @schema: The schema of the writer. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened - * #GArrowIPCFileWriter or %NULL on error. + * #GArrowFileWriter or %NULL on error. */ -GArrowIPCFileWriter * -garrow_ipc_file_writer_open(GArrowIOOutputStream *sink, +GArrowFileWriter * +garrow_file_writer_open(GArrowOutputStream *sink, GArrowSchema *schema, GError **error) { std::shared_ptr arrow_file_writer; auto status = - arrow::ipc::FileWriter::Open(garrow_io_output_stream_get_raw(sink).get(), + arrow::ipc::FileWriter::Open(garrow_output_stream_get_raw(sink).get(), garrow_schema_get_raw(schema), &arrow_file_writer); if (status.ok()) { - return garrow_ipc_file_writer_new_raw(&arrow_file_writer); + return garrow_file_writer_new_raw(&arrow_file_writer); } else { garrow_error_set(error, status, "[ipc][file-writer][open]"); return NULL; @@ -84,20 +84,20 @@ garrow_ipc_file_writer_open(GArrowIOOutputStream *sink, } /** - * garrow_ipc_file_writer_write_record_batch: - * @file_writer: A #GArrowIPCFileWriter. + * garrow_file_writer_write_record_batch: + * @file_writer: A #GArrowFileWriter. * @record_batch: The record batch to be written. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_ipc_file_writer_write_record_batch(GArrowIPCFileWriter *file_writer, +garrow_file_writer_write_record_batch(GArrowFileWriter *file_writer, GArrowRecordBatch *record_batch, GError **error) { auto arrow_file_writer = - garrow_ipc_file_writer_get_raw(file_writer); + garrow_file_writer_get_raw(file_writer); auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); auto arrow_record_batch_raw = @@ -113,18 +113,18 @@ garrow_ipc_file_writer_write_record_batch(GArrowIPCFileWriter *file_writer, } /** - * garrow_ipc_file_writer_close: - * @file_writer: A #GArrowIPCFileWriter. + * garrow_file_writer_close: + * @file_writer: A #GArrowFileWriter. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_ipc_file_writer_close(GArrowIPCFileWriter *file_writer, +garrow_file_writer_close(GArrowFileWriter *file_writer, GError **error) { auto arrow_file_writer = - garrow_ipc_file_writer_get_raw(file_writer); + garrow_file_writer_get_raw(file_writer); auto status = arrow_file_writer->Close(); if (status.ok()) { @@ -137,21 +137,21 @@ garrow_ipc_file_writer_close(GArrowIPCFileWriter *file_writer, G_END_DECLS -GArrowIPCFileWriter * -garrow_ipc_file_writer_new_raw(std::shared_ptr *arrow_file_writer) +GArrowFileWriter * +garrow_file_writer_new_raw(std::shared_ptr *arrow_file_writer) { auto file_writer = - GARROW_IPC_FILE_WRITER(g_object_new(GARROW_IPC_TYPE_FILE_WRITER, + GARROW_FILE_WRITER(g_object_new(GARROW_TYPE_FILE_WRITER, "stream-writer", arrow_file_writer, NULL)); return file_writer; } arrow::ipc::FileWriter * -garrow_ipc_file_writer_get_raw(GArrowIPCFileWriter *file_writer) +garrow_file_writer_get_raw(GArrowFileWriter *file_writer) { auto arrow_stream_writer = - garrow_ipc_stream_writer_get_raw(GARROW_IPC_STREAM_WRITER(file_writer)); + garrow_stream_writer_get_raw(GARROW_STREAM_WRITER(file_writer)); auto arrow_file_writer_raw = dynamic_cast(arrow_stream_writer.get()); return arrow_file_writer_raw; diff --git a/c_glib/arrow-glib/ipc-file-writer.h b/c_glib/arrow-glib/file-writer.h similarity index 52% rename from c_glib/arrow-glib/ipc-file-writer.h rename to c_glib/arrow-glib/file-writer.h index 732d9426aec8e..7f9a4f0399454 100644 --- a/c_glib/arrow-glib/ipc-file-writer.h +++ b/c_glib/arrow-glib/file-writer.h @@ -19,60 +19,60 @@ #pragma once -#include +#include G_BEGIN_DECLS -#define GARROW_IPC_TYPE_FILE_WRITER \ - (garrow_ipc_file_writer_get_type()) -#define GARROW_IPC_FILE_WRITER(obj) \ +#define GARROW_TYPE_FILE_WRITER \ + (garrow_file_writer_get_type()) +#define GARROW_FILE_WRITER(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IPC_TYPE_FILE_WRITER, \ - GArrowIPCFileWriter)) -#define GARROW_IPC_FILE_WRITER_CLASS(klass) \ + GARROW_TYPE_FILE_WRITER, \ + GArrowFileWriter)) +#define GARROW_FILE_WRITER_CLASS(klass) \ (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IPC_TYPE_FILE_WRITER, \ - GArrowIPCFileWriterClass)) -#define GARROW_IPC_IS_FILE_WRITER(obj) \ + GARROW_TYPE_FILE_WRITER, \ + GArrowFileWriterClass)) +#define GARROW_IS_FILE_WRITER(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IPC_TYPE_FILE_WRITER)) -#define GARROW_IPC_IS_FILE_WRITER_CLASS(klass) \ + GARROW_TYPE_FILE_WRITER)) +#define GARROW_IS_FILE_WRITER_CLASS(klass) \ (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IPC_TYPE_FILE_WRITER)) -#define GARROW_IPC_FILE_WRITER_GET_CLASS(obj) \ + GARROW_TYPE_FILE_WRITER)) +#define GARROW_FILE_WRITER_GET_CLASS(obj) \ (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IPC_TYPE_FILE_WRITER, \ - GArrowIPCFileWriterClass)) + GARROW_TYPE_FILE_WRITER, \ + GArrowFileWriterClass)) -typedef struct _GArrowIPCFileWriter GArrowIPCFileWriter; -typedef struct _GArrowIPCFileWriterClass GArrowIPCFileWriterClass; +typedef struct _GArrowFileWriter GArrowFileWriter; +typedef struct _GArrowFileWriterClass GArrowFileWriterClass; /** - * GArrowIPCFileWriter: + * GArrowFileWriter: * * It wraps `arrow::ipc::FileWriter`. */ -struct _GArrowIPCFileWriter +struct _GArrowFileWriter { /*< private >*/ - GArrowIPCStreamWriter parent_instance; + GArrowStreamWriter parent_instance; }; -struct _GArrowIPCFileWriterClass +struct _GArrowFileWriterClass { GObjectClass parent_class; }; -GType garrow_ipc_file_writer_get_type(void) G_GNUC_CONST; +GType garrow_file_writer_get_type(void) G_GNUC_CONST; -GArrowIPCFileWriter *garrow_ipc_file_writer_open(GArrowIOOutputStream *sink, +GArrowFileWriter *garrow_file_writer_open(GArrowOutputStream *sink, GArrowSchema *schema, GError **error); -gboolean garrow_ipc_file_writer_write_record_batch(GArrowIPCFileWriter *file_writer, +gboolean garrow_file_writer_write_record_batch(GArrowFileWriter *file_writer, GArrowRecordBatch *record_batch, GError **error); -gboolean garrow_ipc_file_writer_close(GArrowIPCFileWriter *file_writer, +gboolean garrow_file_writer_close(GArrowFileWriter *file_writer, GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-file-reader.hpp b/c_glib/arrow-glib/file-writer.hpp similarity index 77% rename from c_glib/arrow-glib/ipc-file-reader.hpp rename to c_glib/arrow-glib/file-writer.hpp index 66cd45d51ddf5..f6a720a6cde7e 100644 --- a/c_glib/arrow-glib/ipc-file-reader.hpp +++ b/c_glib/arrow-glib/file-writer.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIPCFileReader *garrow_ipc_file_reader_new_raw(std::shared_ptr *arrow_file_reader); -std::shared_ptr garrow_ipc_file_reader_get_raw(GArrowIPCFileReader *file_reader); +GArrowFileWriter *garrow_file_writer_new_raw(std::shared_ptr *arrow_file_writer); +arrow::ipc::FileWriter *garrow_file_writer_get_raw(GArrowFileWriter *file_writer); diff --git a/c_glib/arrow-glib/io-file.cpp b/c_glib/arrow-glib/file.cpp similarity index 68% rename from c_glib/arrow-glib/io-file.cpp rename to c_glib/arrow-glib/file.cpp index 536ae3e705f59..0d0fe1d8b9c83 100644 --- a/c_glib/arrow-glib/io-file.cpp +++ b/c_glib/arrow-glib/file.cpp @@ -24,40 +24,40 @@ #include #include -#include -#include +#include +#include G_BEGIN_DECLS /** - * SECTION: io-file - * @title: GArrowIOFile + * SECTION: file + * @title: GArrowFile * @short_description: File interface * - * #GArrowIOFile is an interface for file. + * #GArrowFile is an interface for file. */ -G_DEFINE_INTERFACE(GArrowIOFile, - garrow_io_file, +G_DEFINE_INTERFACE(GArrowFile, + garrow_file, G_TYPE_OBJECT) static void -garrow_io_file_default_init (GArrowIOFileInterface *iface) +garrow_file_default_init (GArrowFileInterface *iface) { } /** - * garrow_io_file_close: - * @file: A #GArrowIOFile. + * garrow_file_close: + * @file: A #GArrowFile. * @error: (nullable): Return location for a #GError or %NULL. * * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_file_close(GArrowIOFile *file, +garrow_file_close(GArrowFile *file, GError **error) { - auto arrow_file = garrow_io_file_get_raw(file); + auto arrow_file = garrow_file_get_raw(file); auto status = arrow_file->Close(); if (status.ok()) { @@ -69,17 +69,17 @@ garrow_io_file_close(GArrowIOFile *file, } /** - * garrow_io_file_tell: - * @file: A #GArrowIOFile. + * garrow_file_tell: + * @file: A #GArrowFile. * @error: (nullable): Return location for a #GError or %NULL. * * Returns: The current offset on success, -1 if there was an error. */ gint64 -garrow_io_file_tell(GArrowIOFile *file, +garrow_file_tell(GArrowFile *file, GError **error) { - auto arrow_file = garrow_io_file_get_raw(file); + auto arrow_file = garrow_file_get_raw(file); gint64 position; auto status = arrow_file->Tell(&position); @@ -92,25 +92,25 @@ garrow_io_file_tell(GArrowIOFile *file, } /** - * garrow_io_file_get_mode: - * @file: A #GArrowIOFile. + * garrow_file_get_mode: + * @file: A #GArrowFile. * * Returns: The mode of the file. */ -GArrowIOFileMode -garrow_io_file_get_mode(GArrowIOFile *file) +GArrowFileMode +garrow_file_get_mode(GArrowFile *file) { - auto arrow_file = garrow_io_file_get_raw(file); + auto arrow_file = garrow_file_get_raw(file); auto arrow_mode = arrow_file->mode(); - return garrow_io_file_mode_from_raw(arrow_mode); + return garrow_file_mode_from_raw(arrow_mode); } G_END_DECLS std::shared_ptr -garrow_io_file_get_raw(GArrowIOFile *file) +garrow_file_get_raw(GArrowFile *file) { - auto *iface = GARROW_IO_FILE_GET_IFACE(file); + auto *iface = GARROW_FILE_GET_IFACE(file); return iface->get_raw(file); } diff --git a/c_glib/arrow-glib/io-file.h b/c_glib/arrow-glib/file.h similarity index 55% rename from c_glib/arrow-glib/io-file.h rename to c_glib/arrow-glib/file.h index 7181f6d37aeb3..68054aa5b6217 100644 --- a/c_glib/arrow-glib/io-file.h +++ b/c_glib/arrow-glib/file.h @@ -19,33 +19,33 @@ #pragma once -#include +#include G_BEGIN_DECLS -#define GARROW_IO_TYPE_FILE \ - (garrow_io_file_get_type()) -#define GARROW_IO_FILE(obj) \ +#define GARROW_TYPE_FILE \ + (garrow_file_get_type()) +#define GARROW_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_FILE, \ - GArrowIOFile)) -#define GARROW_IO_IS_FILE(obj) \ + GARROW_TYPE_FILE, \ + GArrowFile)) +#define GARROW_IS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_FILE)) -#define GARROW_IO_FILE_GET_IFACE(obj) \ + GARROW_TYPE_FILE)) +#define GARROW_FILE_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_FILE, \ - GArrowIOFileInterface)) + GARROW_TYPE_FILE, \ + GArrowFileInterface)) -typedef struct _GArrowIOFile GArrowIOFile; -typedef struct _GArrowIOFileInterface GArrowIOFileInterface; +typedef struct _GArrowFile GArrowFile; +typedef struct _GArrowFileInterface GArrowFileInterface; -GType garrow_io_file_get_type(void) G_GNUC_CONST; +GType garrow_file_get_type(void) G_GNUC_CONST; -gboolean garrow_io_file_close(GArrowIOFile *file, +gboolean garrow_file_close(GArrowFile *file, GError **error); -gint64 garrow_io_file_tell(GArrowIOFile *file, +gint64 garrow_file_tell(GArrowFile *file, GError **error); -GArrowIOFileMode garrow_io_file_get_mode(GArrowIOFile *file); +GArrowFileMode garrow_file_get_mode(GArrowFile *file); G_END_DECLS diff --git a/c_glib/arrow-glib/io-file.hpp b/c_glib/arrow-glib/file.hpp similarity index 79% rename from c_glib/arrow-glib/io-file.hpp rename to c_glib/arrow-glib/file.hpp index afaca90a10fa3..c4cc78747cf6a 100644 --- a/c_glib/arrow-glib/io-file.hpp +++ b/c_glib/arrow-glib/file.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOFileInterface: + * GArrowFileInterface: * * It wraps `arrow::io::FileInterface`. */ -struct _GArrowIOFileInterface +struct _GArrowFileInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOFile *file); + std::shared_ptr (*get_raw)(GArrowFile *file); }; -std::shared_ptr garrow_io_file_get_raw(GArrowIOFile *file); +std::shared_ptr garrow_file_get_raw(GArrowFile *file); diff --git a/c_glib/arrow-glib/io-input-stream.cpp b/c_glib/arrow-glib/input-stream.cpp similarity index 71% rename from c_glib/arrow-glib/io-input-stream.cpp rename to c_glib/arrow-glib/input-stream.cpp index a28b9c6556ccd..36bef80422489 100644 --- a/c_glib/arrow-glib/io-input-stream.cpp +++ b/c_glib/arrow-glib/input-stream.cpp @@ -24,33 +24,33 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-input-stream - * @title: GArrowIOInputStream + * SECTION: input-stream + * @title: GArrowInputStream * @short_description: Stream input interface * - * #GArrowIOInputStream is an interface for stream input. Stream input + * #GArrowInputStream is an interface for stream input. Stream input * is file based and readable. */ -G_DEFINE_INTERFACE(GArrowIOInputStream, - garrow_io_input_stream, +G_DEFINE_INTERFACE(GArrowInputStream, + garrow_input_stream, G_TYPE_OBJECT) static void -garrow_io_input_stream_default_init (GArrowIOInputStreamInterface *iface) +garrow_input_stream_default_init (GArrowInputStreamInterface *iface) { } G_END_DECLS std::shared_ptr -garrow_io_input_stream_get_raw(GArrowIOInputStream *input_stream) +garrow_input_stream_get_raw(GArrowInputStream *input_stream) { - auto *iface = GARROW_IO_INPUT_STREAM_GET_IFACE(input_stream); + auto *iface = GARROW_INPUT_STREAM_GET_IFACE(input_stream); return iface->get_raw(input_stream); } diff --git a/c_glib/arrow-glib/io-input-stream.h b/c_glib/arrow-glib/input-stream.h similarity index 57% rename from c_glib/arrow-glib/io-input-stream.h rename to c_glib/arrow-glib/input-stream.h index 57902095010c8..4b331b93fb27f 100644 --- a/c_glib/arrow-glib/io-input-stream.h +++ b/c_glib/arrow-glib/input-stream.h @@ -23,23 +23,23 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_INPUT_STREAM \ - (garrow_io_input_stream_get_type()) -#define GARROW_IO_INPUT_STREAM(obj) \ +#define GARROW_TYPE_INPUT_STREAM \ + (garrow_input_stream_get_type()) +#define GARROW_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_INPUT_STREAM, \ - GArrowIOInputStream)) -#define GARROW_IO_IS_INPUT_STREAM(obj) \ + GARROW_TYPE_INPUT_STREAM, \ + GArrowInputStream)) +#define GARROW_IS_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_INPUT_STREAM)) -#define GARROW_IO_INPUT_STREAM_GET_IFACE(obj) \ + GARROW_TYPE_INPUT_STREAM)) +#define GARROW_INPUT_STREAM_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_INPUT_STREAM, \ - GArrowIOInputStreamInterface)) + GARROW_TYPE_INPUT_STREAM, \ + GArrowInputStreamInterface)) -typedef struct _GArrowIOInputStream GArrowIOInputStream; -typedef struct _GArrowIOInputStreamInterface GArrowIOInputStreamInterface; +typedef struct _GArrowInputStream GArrowInputStream; +typedef struct _GArrowInputStreamInterface GArrowInputStreamInterface; -GType garrow_io_input_stream_get_type(void) G_GNUC_CONST; +GType garrow_input_stream_get_type(void) G_GNUC_CONST; G_END_DECLS diff --git a/c_glib/arrow-glib/io-input-stream.hpp b/c_glib/arrow-glib/input-stream.hpp similarity index 76% rename from c_glib/arrow-glib/io-input-stream.hpp rename to c_glib/arrow-glib/input-stream.hpp index 3b1de5da5c226..7958df1585887 100644 --- a/c_glib/arrow-glib/io-input-stream.hpp +++ b/c_glib/arrow-glib/input-stream.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOInputStreamInterface: + * GArrowInputStreamInterface: * * It wraps `arrow::io::InputStream`. */ -struct _GArrowIOInputStreamInterface +struct _GArrowInputStreamInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOInputStream *file); + std::shared_ptr (*get_raw)(GArrowInputStream *file); }; -std::shared_ptr garrow_io_input_stream_get_raw(GArrowIOInputStream *input_stream); +std::shared_ptr garrow_input_stream_get_raw(GArrowInputStream *input_stream); diff --git a/c_glib/arrow-glib/io-memory-mapped-file.cpp b/c_glib/arrow-glib/io-memory-mapped-file.cpp deleted file mode 100644 index e2e255c039109..0000000000000 --- a/c_glib/arrow-glib/io-memory-mapped-file.cpp +++ /dev/null @@ -1,287 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: io-memory-mapped-file - * @short_description: Memory mapped file class - * - * #GArrowIOMemoryMappedFile is a class for memory mapped file. It's - * readable and writeable. It supports zero copy. - */ - -typedef struct GArrowIOMemoryMappedFilePrivate_ { - std::shared_ptr memory_mapped_file; -} GArrowIOMemoryMappedFilePrivate; - -enum { - PROP_0, - PROP_MEMORY_MAPPED_FILE -}; - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_file_interface(GArrowIOFile *file) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_file_interface_init(GArrowIOFileInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_file_interface; -} - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_readable_interface(GArrowIOReadable *readable) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(readable); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_readable_interface_init(GArrowIOReadableInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_readable_interface; -} - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_input_stream_interface(GArrowIOInputStream *input_stream) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(input_stream); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_input_stream_interface_init(GArrowIOInputStreamInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_input_stream_interface; -} - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_random_access_file_interface(GArrowIORandomAccessFile *file) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_random_access_file_interface_init(GArrowIORandomAccessFileInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_random_access_file_interface; -} - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_writeable_interface(GArrowIOWriteable *writeable) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(writeable); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_writeable_interface_init(GArrowIOWriteableInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_writeable_interface; -} - -static std::shared_ptr -garrow_io_memory_mapped_file_get_raw_writeable_file_interface(GArrowIOWriteableFile *file) -{ - auto memory_mapped_file = GARROW_IO_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_io_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_io_writeable_file_interface_init(GArrowIOWriteableFileInterface *iface) -{ - iface->get_raw = garrow_io_memory_mapped_file_get_raw_writeable_file_interface; -} - -G_DEFINE_TYPE_WITH_CODE(GArrowIOMemoryMappedFile, - garrow_io_memory_mapped_file, - G_TYPE_OBJECT, - G_ADD_PRIVATE(GArrowIOMemoryMappedFile) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_FILE, - garrow_io_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_READABLE, - garrow_io_readable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_INPUT_STREAM, - garrow_io_input_stream_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_RANDOM_ACCESS_FILE, - garrow_io_random_access_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE, - garrow_io_writeable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_IO_TYPE_WRITEABLE_FILE, - garrow_io_writeable_file_interface_init)); - -#define GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(obj) \ - (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ - GArrowIOMemoryMappedFilePrivate)) - -static void -garrow_io_memory_mapped_file_finalize(GObject *object) -{ - GArrowIOMemoryMappedFilePrivate *priv; - - priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(object); - - priv->memory_mapped_file = nullptr; - - G_OBJECT_CLASS(garrow_io_memory_mapped_file_parent_class)->finalize(object); -} - -static void -garrow_io_memory_mapped_file_set_property(GObject *object, - guint prop_id, - const GValue *value, - GParamSpec *pspec) -{ - GArrowIOMemoryMappedFilePrivate *priv; - - priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(object); - - switch (prop_id) { - case PROP_MEMORY_MAPPED_FILE: - priv->memory_mapped_file = - *static_cast *>(g_value_get_pointer(value)); - break; - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_io_memory_mapped_file_get_property(GObject *object, - guint prop_id, - GValue *value, - GParamSpec *pspec) -{ - switch (prop_id) { - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_io_memory_mapped_file_init(GArrowIOMemoryMappedFile *object) -{ -} - -static void -garrow_io_memory_mapped_file_class_init(GArrowIOMemoryMappedFileClass *klass) -{ - GObjectClass *gobject_class; - GParamSpec *spec; - - gobject_class = G_OBJECT_CLASS(klass); - - gobject_class->finalize = garrow_io_memory_mapped_file_finalize; - gobject_class->set_property = garrow_io_memory_mapped_file_set_property; - gobject_class->get_property = garrow_io_memory_mapped_file_get_property; - - spec = g_param_spec_pointer("memory-mapped-file", - "io::MemoryMappedFile", - "The raw std::shared *", - static_cast(G_PARAM_WRITABLE | - G_PARAM_CONSTRUCT_ONLY)); - g_object_class_install_property(gobject_class, PROP_MEMORY_MAPPED_FILE, spec); -} - -/** - * garrow_io_memory_mapped_file_open: - * @path: The path of the memory mapped file. - * @mode: The mode of the memory mapped file. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowIOMemoryMappedFile or %NULL on error. - */ -GArrowIOMemoryMappedFile * -garrow_io_memory_mapped_file_open(const gchar *path, - GArrowIOFileMode mode, - GError **error) -{ - std::shared_ptr arrow_memory_mapped_file; - auto status = - arrow::io::MemoryMappedFile::Open(std::string(path), - garrow_io_file_mode_to_raw(mode), - &arrow_memory_mapped_file); - if (status.ok()) { - return garrow_io_memory_mapped_file_new_raw(&arrow_memory_mapped_file); - } else { - std::string context("[io][memory-mapped-file][open]: <"); - context += path; - context += ">"; - garrow_error_set(error, status, context.c_str()); - return NULL; - } -} - -G_END_DECLS - -GArrowIOMemoryMappedFile * -garrow_io_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file) -{ - auto memory_mapped_file = - GARROW_IO_MEMORY_MAPPED_FILE(g_object_new(GARROW_IO_TYPE_MEMORY_MAPPED_FILE, - "memory-mapped-file", arrow_memory_mapped_file, - NULL)); - return memory_mapped_file; -} - -std::shared_ptr -garrow_io_memory_mapped_file_get_raw(GArrowIOMemoryMappedFile *memory_mapped_file) -{ - GArrowIOMemoryMappedFilePrivate *priv; - - priv = GARROW_IO_MEMORY_MAPPED_FILE_GET_PRIVATE(memory_mapped_file); - return priv->memory_mapped_file; -} diff --git a/c_glib/arrow-glib/ipc-file-reader.h b/c_glib/arrow-glib/ipc-file-reader.h deleted file mode 100644 index 15eba8e35a273..0000000000000 --- a/c_glib/arrow-glib/ipc-file-reader.h +++ /dev/null @@ -1,83 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -#include - -#include - -G_BEGIN_DECLS - -#define GARROW_IPC_TYPE_FILE_READER \ - (garrow_ipc_file_reader_get_type()) -#define GARROW_IPC_FILE_READER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IPC_TYPE_FILE_READER, \ - GArrowIPCFileReader)) -#define GARROW_IPC_FILE_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IPC_TYPE_FILE_READER, \ - GArrowIPCFileReaderClass)) -#define GARROW_IPC_IS_FILE_READER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IPC_TYPE_FILE_READER)) -#define GARROW_IPC_IS_FILE_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IPC_TYPE_FILE_READER)) -#define GARROW_IPC_FILE_READER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IPC_TYPE_FILE_READER, \ - GArrowIPCFileReaderClass)) - -typedef struct _GArrowIPCFileReader GArrowIPCFileReader; -typedef struct _GArrowIPCFileReaderClass GArrowIPCFileReaderClass; - -/** - * GArrowIPCFileReader: - * - * It wraps `arrow::ipc::FileReader`. - */ -struct _GArrowIPCFileReader -{ - /*< private >*/ - GObject parent_instance; -}; - -struct _GArrowIPCFileReaderClass -{ - GObjectClass parent_class; -}; - -GType garrow_ipc_file_reader_get_type(void) G_GNUC_CONST; - -GArrowIPCFileReader *garrow_ipc_file_reader_open(GArrowIORandomAccessFile *file, - GError **error); - -GArrowSchema *garrow_ipc_file_reader_get_schema(GArrowIPCFileReader *file_reader); -guint garrow_ipc_file_reader_get_n_record_batches(GArrowIPCFileReader *file_reader); -GArrowIPCMetadataVersion garrow_ipc_file_reader_get_version(GArrowIPCFileReader *file_reader); -GArrowRecordBatch *garrow_ipc_file_reader_get_record_batch(GArrowIPCFileReader *file_reader, - guint i, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-stream-reader.h b/c_glib/arrow-glib/ipc-stream-reader.h deleted file mode 100644 index 993cd85003bb9..0000000000000 --- a/c_glib/arrow-glib/ipc-stream-reader.h +++ /dev/null @@ -1,80 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -#include - -#include - -G_BEGIN_DECLS - -#define GARROW_IPC_TYPE_STREAM_READER \ - (garrow_ipc_stream_reader_get_type()) -#define GARROW_IPC_STREAM_READER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IPC_TYPE_STREAM_READER, \ - GArrowIPCStreamReader)) -#define GARROW_IPC_STREAM_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IPC_TYPE_STREAM_READER, \ - GArrowIPCStreamReaderClass)) -#define GARROW_IPC_IS_STREAM_READER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IPC_TYPE_STREAM_READER)) -#define GARROW_IPC_IS_STREAM_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IPC_TYPE_STREAM_READER)) -#define GARROW_IPC_STREAM_READER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IPC_TYPE_STREAM_READER, \ - GArrowIPCStreamReaderClass)) - -typedef struct _GArrowIPCStreamReader GArrowIPCStreamReader; -typedef struct _GArrowIPCStreamReaderClass GArrowIPCStreamReaderClass; - -/** - * GArrowIPCStreamReader: - * - * It wraps `arrow::ipc::StreamReader`. - */ -struct _GArrowIPCStreamReader -{ - /*< private >*/ - GObject parent_instance; -}; - -struct _GArrowIPCStreamReaderClass -{ - GObjectClass parent_class; -}; - -GType garrow_ipc_stream_reader_get_type(void) G_GNUC_CONST; - -GArrowIPCStreamReader *garrow_ipc_stream_reader_open(GArrowIOInputStream *stream, - GError **error); - -GArrowSchema *garrow_ipc_stream_reader_get_schema(GArrowIPCStreamReader *stream_reader); -GArrowRecordBatch *garrow_ipc_stream_reader_get_next_record_batch(GArrowIPCStreamReader *stream_reader, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/memory-mapped-file.cpp b/c_glib/arrow-glib/memory-mapped-file.cpp new file mode 100644 index 0000000000000..a3e1d0c45f142 --- /dev/null +++ b/c_glib/arrow-glib/memory-mapped-file.cpp @@ -0,0 +1,287 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: memory-mapped-file + * @short_description: Memory mapped file class + * + * #GArrowMemoryMappedFile is a class for memory mapped file. It's + * readable and writeable. It supports zero copy. + */ + +typedef struct GArrowMemoryMappedFilePrivate_ { + std::shared_ptr memory_mapped_file; +} GArrowMemoryMappedFilePrivate; + +enum { + PROP_0, + PROP_MEMORY_MAPPED_FILE +}; + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_file_interface(GArrowFile *file) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_file_interface_init(GArrowFileInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_file_interface; +} + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_readable_interface(GArrowReadable *readable) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(readable); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_readable_interface_init(GArrowReadableInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_readable_interface; +} + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_input_stream_interface(GArrowInputStream *input_stream) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(input_stream); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_input_stream_interface_init(GArrowInputStreamInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_input_stream_interface; +} + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_random_access_file_interface(GArrowRandomAccessFile *file) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_random_access_file_interface_init(GArrowRandomAccessFileInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_random_access_file_interface; +} + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_writeable_interface(GArrowWriteable *writeable) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(writeable); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_writeable_interface_init(GArrowWriteableInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_writeable_interface; +} + +static std::shared_ptr +garrow_memory_mapped_file_get_raw_writeable_file_interface(GArrowWriteableFile *file) +{ + auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); + auto arrow_memory_mapped_file = + garrow_memory_mapped_file_get_raw(memory_mapped_file); + return arrow_memory_mapped_file; +} + +static void +garrow_writeable_file_interface_init(GArrowWriteableFileInterface *iface) +{ + iface->get_raw = garrow_memory_mapped_file_get_raw_writeable_file_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowMemoryMappedFile, + garrow_memory_mapped_file, + G_TYPE_OBJECT, + G_ADD_PRIVATE(GArrowMemoryMappedFile) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, + garrow_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_READABLE, + garrow_readable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_INPUT_STREAM, + garrow_input_stream_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_RANDOM_ACCESS_FILE, + garrow_random_access_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, + garrow_writeable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE_FILE, + garrow_writeable_file_interface_init)); + +#define GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_MEMORY_MAPPED_FILE, \ + GArrowMemoryMappedFilePrivate)) + +static void +garrow_memory_mapped_file_finalize(GObject *object) +{ + GArrowMemoryMappedFilePrivate *priv; + + priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(object); + + priv->memory_mapped_file = nullptr; + + G_OBJECT_CLASS(garrow_memory_mapped_file_parent_class)->finalize(object); +} + +static void +garrow_memory_mapped_file_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowMemoryMappedFilePrivate *priv; + + priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_MEMORY_MAPPED_FILE: + priv->memory_mapped_file = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_memory_mapped_file_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_memory_mapped_file_init(GArrowMemoryMappedFile *object) +{ +} + +static void +garrow_memory_mapped_file_class_init(GArrowMemoryMappedFileClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_memory_mapped_file_finalize; + gobject_class->set_property = garrow_memory_mapped_file_set_property; + gobject_class->get_property = garrow_memory_mapped_file_get_property; + + spec = g_param_spec_pointer("memory-mapped-file", + "io::MemoryMappedFile", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_MEMORY_MAPPED_FILE, spec); +} + +/** + * garrow_memory_mapped_file_open: + * @path: The path of the memory mapped file. + * @mode: The mode of the memory mapped file. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowMemoryMappedFile or %NULL on error. + */ +GArrowMemoryMappedFile * +garrow_memory_mapped_file_open(const gchar *path, + GArrowFileMode mode, + GError **error) +{ + std::shared_ptr arrow_memory_mapped_file; + auto status = + arrow::io::MemoryMappedFile::Open(std::string(path), + garrow_file_mode_to_raw(mode), + &arrow_memory_mapped_file); + if (status.ok()) { + return garrow_memory_mapped_file_new_raw(&arrow_memory_mapped_file); + } else { + std::string context("[io][memory-mapped-file][open]: <"); + context += path; + context += ">"; + garrow_error_set(error, status, context.c_str()); + return NULL; + } +} + +G_END_DECLS + +GArrowMemoryMappedFile * +garrow_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file) +{ + auto memory_mapped_file = + GARROW_MEMORY_MAPPED_FILE(g_object_new(GARROW_TYPE_MEMORY_MAPPED_FILE, + "memory-mapped-file", arrow_memory_mapped_file, + NULL)); + return memory_mapped_file; +} + +std::shared_ptr +garrow_memory_mapped_file_get_raw(GArrowMemoryMappedFile *memory_mapped_file) +{ + GArrowMemoryMappedFilePrivate *priv; + + priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(memory_mapped_file); + return priv->memory_mapped_file; +} diff --git a/c_glib/arrow-glib/io-memory-mapped-file.h b/c_glib/arrow-glib/memory-mapped-file.h similarity index 51% rename from c_glib/arrow-glib/io-memory-mapped-file.h rename to c_glib/arrow-glib/memory-mapped-file.h index 0d2d6c2f835de..40b8de04a5a75 100644 --- a/c_glib/arrow-glib/io-memory-mapped-file.h +++ b/c_glib/arrow-glib/memory-mapped-file.h @@ -19,54 +19,54 @@ #pragma once -#include +#include G_BEGIN_DECLS -#define GARROW_IO_TYPE_MEMORY_MAPPED_FILE \ - (garrow_io_memory_mapped_file_get_type()) -#define GARROW_IO_MEMORY_MAPPED_FILE(obj) \ +#define GARROW_TYPE_MEMORY_MAPPED_FILE \ + (garrow_memory_mapped_file_get_type()) +#define GARROW_MEMORY_MAPPED_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ - GArrowIOMemoryMappedFile)) -#define GARROW_IO_MEMORY_MAPPED_FILE_CLASS(klass) \ + GARROW_TYPE_MEMORY_MAPPED_FILE, \ + GArrowMemoryMappedFile)) +#define GARROW_MEMORY_MAPPED_FILE_CLASS(klass) \ (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ - GArrowIOMemoryMappedFileClass)) -#define GARROW_IO_IS_MEMORY_MAPPED_FILE(obj) \ + GARROW_TYPE_MEMORY_MAPPED_FILE, \ + GArrowMemoryMappedFileClass)) +#define GARROW_IS_MEMORY_MAPPED_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE)) -#define GARROW_IO_IS_MEMORY_MAPPED_FILE_CLASS(klass) \ + GARROW_TYPE_MEMORY_MAPPED_FILE)) +#define GARROW_IS_MEMORY_MAPPED_FILE_CLASS(klass) \ (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE)) -#define GARROW_IO_MEMORY_MAPPED_FILE_GET_CLASS(obj) \ + GARROW_TYPE_MEMORY_MAPPED_FILE)) +#define GARROW_MEMORY_MAPPED_FILE_GET_CLASS(obj) \ (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IO_TYPE_MEMORY_MAPPED_FILE, \ - GArrowIOMemoryMappedFileClass)) + GARROW_TYPE_MEMORY_MAPPED_FILE, \ + GArrowMemoryMappedFileClass)) -typedef struct _GArrowIOMemoryMappedFile GArrowIOMemoryMappedFile; -typedef struct _GArrowIOMemoryMappedFileClass GArrowIOMemoryMappedFileClass; +typedef struct _GArrowMemoryMappedFile GArrowMemoryMappedFile; +typedef struct _GArrowMemoryMappedFileClass GArrowMemoryMappedFileClass; /** - * GArrowIOMemoryMappedFile: + * GArrowMemoryMappedFile: * * It wraps `arrow::io::MemoryMappedFile`. */ -struct _GArrowIOMemoryMappedFile +struct _GArrowMemoryMappedFile { /*< private >*/ GObject parent_instance; }; -struct _GArrowIOMemoryMappedFileClass +struct _GArrowMemoryMappedFileClass { GObjectClass parent_class; }; -GType garrow_io_memory_mapped_file_get_type(void) G_GNUC_CONST; +GType garrow_memory_mapped_file_get_type(void) G_GNUC_CONST; -GArrowIOMemoryMappedFile *garrow_io_memory_mapped_file_open(const gchar *path, - GArrowIOFileMode mode, +GArrowMemoryMappedFile *garrow_memory_mapped_file_open(const gchar *path, + GArrowFileMode mode, GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/io-file-output-stream.hpp b/c_glib/arrow-glib/memory-mapped-file.hpp similarity index 73% rename from c_glib/arrow-glib/io-file-output-stream.hpp rename to c_glib/arrow-glib/memory-mapped-file.hpp index 76b8e91f6cf43..522e43d117f39 100644 --- a/c_glib/arrow-glib/io-file-output-stream.hpp +++ b/c_glib/arrow-glib/memory-mapped-file.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIOFileOutputStream *garrow_io_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); -std::shared_ptr garrow_io_file_output_stream_get_raw(GArrowIOFileOutputStream *file_output_stream); +GArrowMemoryMappedFile *garrow_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file); +std::shared_ptr garrow_memory_mapped_file_get_raw(GArrowMemoryMappedFile *memory_mapped_file); diff --git a/c_glib/arrow-glib/ipc-metadata-version.cpp b/c_glib/arrow-glib/metadata-version.cpp similarity index 68% rename from c_glib/arrow-glib/ipc-metadata-version.cpp rename to c_glib/arrow-glib/metadata-version.cpp index f591f295ec886..ee458ebfcea4f 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.cpp +++ b/c_glib/arrow-glib/metadata-version.cpp @@ -21,41 +21,41 @@ # include #endif -#include +#include /** - * SECTION: ipc-metadata-version - * @title: GArrowIPCMetadataVersion + * SECTION: metadata-version + * @title: GArrowMetadataVersion * @short_description: Metadata version mapgging between Arrow and arrow-glib * - * #GArrowIPCMetadataVersion provides metadata versions corresponding + * #GArrowMetadataVersion provides metadata versions corresponding * to `arrow::ipc::MetadataVersion` values. */ -GArrowIPCMetadataVersion -garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion version) +GArrowMetadataVersion +garrow_metadata_version_from_raw(arrow::ipc::MetadataVersion version) { switch (version) { case arrow::ipc::MetadataVersion::V1: - return GARROW_IPC_METADATA_VERSION_V1; + return GARROW_METADATA_VERSION_V1; case arrow::ipc::MetadataVersion::V2: - return GARROW_IPC_METADATA_VERSION_V2; + return GARROW_METADATA_VERSION_V2; case arrow::ipc::MetadataVersion::V3: - return GARROW_IPC_METADATA_VERSION_V3; + return GARROW_METADATA_VERSION_V3; default: - return GARROW_IPC_METADATA_VERSION_V3; + return GARROW_METADATA_VERSION_V3; } } arrow::ipc::MetadataVersion -garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version) +garrow_metadata_version_to_raw(GArrowMetadataVersion version) { switch (version) { - case GARROW_IPC_METADATA_VERSION_V1: + case GARROW_METADATA_VERSION_V1: return arrow::ipc::MetadataVersion::V1; - case GARROW_IPC_METADATA_VERSION_V2: + case GARROW_METADATA_VERSION_V2: return arrow::ipc::MetadataVersion::V2; - case GARROW_IPC_METADATA_VERSION_V3: + case GARROW_METADATA_VERSION_V3: return arrow::ipc::MetadataVersion::V3; default: return arrow::ipc::MetadataVersion::V3; diff --git a/c_glib/arrow-glib/ipc-metadata-version.h b/c_glib/arrow-glib/metadata-version.h similarity index 76% rename from c_glib/arrow-glib/ipc-metadata-version.h rename to c_glib/arrow-glib/metadata-version.h index 20defdb71b4f2..d902a3949e69a 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.h +++ b/c_glib/arrow-glib/metadata-version.h @@ -24,18 +24,18 @@ G_BEGIN_DECLS /** - * GArrowIPCMetadataVersion: - * @GARROW_IPC_METADATA_VERSION_V1: Version 1. - * @GARROW_IPC_METADATA_VERSION_V2: Version 2. - * @GARROW_IPC_METADATA_VERSION_V3: Version 3. + * GArrowMetadataVersion: + * @GARROW_METADATA_VERSION_V1: Version 1. + * @GARROW_METADATA_VERSION_V2: Version 2. + * @GARROW_METADATA_VERSION_V3: Version 3. * * They are corresponding to `arrow::ipc::MetadataVersion::type` * values. */ typedef enum { - GARROW_IPC_METADATA_VERSION_V1, - GARROW_IPC_METADATA_VERSION_V2, - GARROW_IPC_METADATA_VERSION_V3 -} GArrowIPCMetadataVersion; + GARROW_METADATA_VERSION_V1, + GARROW_METADATA_VERSION_V2, + GARROW_METADATA_VERSION_V3 +} GArrowMetadataVersion; G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-metadata-version.hpp b/c_glib/arrow-glib/metadata-version.hpp similarity index 77% rename from c_glib/arrow-glib/ipc-metadata-version.hpp rename to c_glib/arrow-glib/metadata-version.hpp index 229565f002180..7b3865e59216b 100644 --- a/c_glib/arrow-glib/ipc-metadata-version.hpp +++ b/c_glib/arrow-glib/metadata-version.hpp @@ -21,7 +21,7 @@ #include -#include +#include -GArrowIPCMetadataVersion garrow_ipc_metadata_version_from_raw(arrow::ipc::MetadataVersion version); -arrow::ipc::MetadataVersion garrow_ipc_metadata_version_to_raw(GArrowIPCMetadataVersion version); +GArrowMetadataVersion garrow_metadata_version_from_raw(arrow::ipc::MetadataVersion version); +arrow::ipc::MetadataVersion garrow_metadata_version_to_raw(GArrowMetadataVersion version); diff --git a/c_glib/arrow-glib/io-output-stream.cpp b/c_glib/arrow-glib/output-stream.cpp similarity index 71% rename from c_glib/arrow-glib/io-output-stream.cpp rename to c_glib/arrow-glib/output-stream.cpp index bdf5587ba1c07..bbc29b794f7c6 100644 --- a/c_glib/arrow-glib/io-output-stream.cpp +++ b/c_glib/arrow-glib/output-stream.cpp @@ -24,33 +24,33 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-output-stream - * @title: GArrowIOOutputStream + * SECTION: output-stream + * @title: GArrowOutputStream * @short_description: Stream output interface * - * #GArrowIOOutputStream is an interface for stream output. Stream + * #GArrowOutputStream is an interface for stream output. Stream * output is file based and writeable */ -G_DEFINE_INTERFACE(GArrowIOOutputStream, - garrow_io_output_stream, +G_DEFINE_INTERFACE(GArrowOutputStream, + garrow_output_stream, G_TYPE_OBJECT) static void -garrow_io_output_stream_default_init (GArrowIOOutputStreamInterface *iface) +garrow_output_stream_default_init (GArrowOutputStreamInterface *iface) { } G_END_DECLS std::shared_ptr -garrow_io_output_stream_get_raw(GArrowIOOutputStream *output_stream) +garrow_output_stream_get_raw(GArrowOutputStream *output_stream) { - auto *iface = GARROW_IO_OUTPUT_STREAM_GET_IFACE(output_stream); + auto *iface = GARROW_OUTPUT_STREAM_GET_IFACE(output_stream); return iface->get_raw(output_stream); } diff --git a/c_glib/arrow-glib/io-output-stream.h b/c_glib/arrow-glib/output-stream.h similarity index 57% rename from c_glib/arrow-glib/io-output-stream.h rename to c_glib/arrow-glib/output-stream.h index 02478ce9621eb..3481072c27d8b 100644 --- a/c_glib/arrow-glib/io-output-stream.h +++ b/c_glib/arrow-glib/output-stream.h @@ -23,23 +23,23 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_OUTPUT_STREAM \ - (garrow_io_output_stream_get_type()) -#define GARROW_IO_OUTPUT_STREAM(obj) \ +#define GARROW_TYPE_OUTPUT_STREAM \ + (garrow_output_stream_get_type()) +#define GARROW_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_OUTPUT_STREAM, \ - GArrowIOOutputStream)) -#define GARROW_IO_IS_OUTPUT_STREAM(obj) \ + GARROW_TYPE_OUTPUT_STREAM, \ + GArrowOutputStream)) +#define GARROW_IS_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_OUTPUT_STREAM)) -#define GARROW_IO_OUTPUT_STREAM_GET_IFACE(obj) \ + GARROW_TYPE_OUTPUT_STREAM)) +#define GARROW_OUTPUT_STREAM_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_OUTPUT_STREAM, \ - GArrowIOOutputStreamInterface)) + GARROW_TYPE_OUTPUT_STREAM, \ + GArrowOutputStreamInterface)) -typedef struct _GArrowIOOutputStream GArrowIOOutputStream; -typedef struct _GArrowIOOutputStreamInterface GArrowIOOutputStreamInterface; +typedef struct _GArrowOutputStream GArrowOutputStream; +typedef struct _GArrowOutputStreamInterface GArrowOutputStreamInterface; -GType garrow_io_output_stream_get_type(void) G_GNUC_CONST; +GType garrow_output_stream_get_type(void) G_GNUC_CONST; G_END_DECLS diff --git a/c_glib/arrow-glib/io-output-stream.hpp b/c_glib/arrow-glib/output-stream.hpp similarity index 75% rename from c_glib/arrow-glib/io-output-stream.hpp rename to c_glib/arrow-glib/output-stream.hpp index f144130b1420e..635da10e24766 100644 --- a/c_glib/arrow-glib/io-output-stream.hpp +++ b/c_glib/arrow-glib/output-stream.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOOutputStreamInterface: + * GArrowOutputStreamInterface: * * It wraps `arrow::io::OutputStream`. */ -struct _GArrowIOOutputStreamInterface +struct _GArrowOutputStreamInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOOutputStream *file); + std::shared_ptr (*get_raw)(GArrowOutputStream *file); }; -std::shared_ptr garrow_io_output_stream_get_raw(GArrowIOOutputStream *output_stream); +std::shared_ptr garrow_output_stream_get_raw(GArrowOutputStream *output_stream); diff --git a/c_glib/arrow-glib/io-random-access-file.cpp b/c_glib/arrow-glib/random-access-file.cpp similarity index 70% rename from c_glib/arrow-glib/io-random-access-file.cpp rename to c_glib/arrow-glib/random-access-file.cpp index 552b879c19794..71f315ec7efaa 100644 --- a/c_glib/arrow-glib/io-random-access-file.cpp +++ b/c_glib/arrow-glib/random-access-file.cpp @@ -24,39 +24,39 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-random-access-file - * @title: GArrowIORandomAccessFile + * SECTION: random-access-file + * @title: GArrowRandomAccessFile * @short_description: File input interface * - * #GArrowIORandomAccessFile is an interface for file input. + * #GArrowRandomAccessFile is an interface for file input. */ -G_DEFINE_INTERFACE(GArrowIORandomAccessFile, - garrow_io_random_access_file, +G_DEFINE_INTERFACE(GArrowRandomAccessFile, + garrow_random_access_file, G_TYPE_OBJECT) static void -garrow_io_random_access_file_default_init (GArrowIORandomAccessFileInterface *iface) +garrow_random_access_file_default_init (GArrowRandomAccessFileInterface *iface) { } /** - * garrow_io_random_access_file_get_size: - * @file: A #GArrowIORandomAccessFile. + * garrow_random_access_file_get_size: + * @file: A #GArrowRandomAccessFile. * @error: (nullable): Return location for a #GError or %NULL. * * Returns: The size of the file. */ guint64 -garrow_io_random_access_file_get_size(GArrowIORandomAccessFile *file, +garrow_random_access_file_get_size(GArrowRandomAccessFile *file, GError **error) { - auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(file); + auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(file); auto arrow_random_access_file = iface->get_raw(file); int64_t size; @@ -70,23 +70,23 @@ garrow_io_random_access_file_get_size(GArrowIORandomAccessFile *file, } /** - * garrow_io_random_access_file_get_support_zero_copy: - * @file: A #GArrowIORandomAccessFile. + * garrow_random_access_file_get_support_zero_copy: + * @file: A #GArrowRandomAccessFile. * * Returns: Whether zero copy read is supported or not. */ gboolean -garrow_io_random_access_file_get_support_zero_copy(GArrowIORandomAccessFile *file) +garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file) { - auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(file); + auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(file); auto arrow_random_access_file = iface->get_raw(file); return arrow_random_access_file->supports_zero_copy(); } /** - * garrow_io_random_access_file_read_at: - * @file: A #GArrowIORandomAccessFile. + * garrow_random_access_file_read_at: + * @file: A #GArrowRandomAccessFile. * @position: The read start position. * @n_bytes: The number of bytes to be read. * @n_read_bytes: (out): The read number of bytes. @@ -96,7 +96,7 @@ garrow_io_random_access_file_get_support_zero_copy(GArrowIORandomAccessFile *fil * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, +garrow_random_access_file_read_at(GArrowRandomAccessFile *file, gint64 position, gint64 n_bytes, gint64 *n_read_bytes, @@ -104,7 +104,7 @@ garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, GError **error) { const auto arrow_random_access_file = - garrow_io_random_access_file_get_raw(file); + garrow_random_access_file_get_raw(file); auto status = arrow_random_access_file->ReadAt(position, n_bytes, @@ -121,8 +121,8 @@ garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, G_END_DECLS std::shared_ptr -garrow_io_random_access_file_get_raw(GArrowIORandomAccessFile *random_access_file) +garrow_random_access_file_get_raw(GArrowRandomAccessFile *random_access_file) { - auto *iface = GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(random_access_file); + auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(random_access_file); return iface->get_raw(random_access_file); } diff --git a/c_glib/arrow-glib/io-random-access-file.h b/c_glib/arrow-glib/random-access-file.h similarity index 57% rename from c_glib/arrow-glib/io-random-access-file.h rename to c_glib/arrow-glib/random-access-file.h index 8ac63e417a3f2..8a7f6b4218a31 100644 --- a/c_glib/arrow-glib/io-random-access-file.h +++ b/c_glib/arrow-glib/random-access-file.h @@ -23,29 +23,29 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_RANDOM_ACCESS_FILE \ - (garrow_io_random_access_file_get_type()) -#define GARROW_IO_RANDOM_ACCESS_FILE(obj) \ +#define GARROW_TYPE_RANDOM_ACCESS_FILE \ + (garrow_random_access_file_get_type()) +#define GARROW_RANDOM_ACCESS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_RANDOM_ACCESS_FILE, \ - GArrowIORandomAccessFile)) -#define GARROW_IO_IS_RANDOM_ACCESS_FILE(obj) \ + GARROW_TYPE_RANDOM_ACCESS_FILE, \ + GArrowRandomAccessFile)) +#define GARROW_IS_RANDOM_ACCESS_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_RANDOM_ACCESS_FILE)) -#define GARROW_IO_RANDOM_ACCESS_FILE_GET_IFACE(obj) \ + GARROW_TYPE_RANDOM_ACCESS_FILE)) +#define GARROW_RANDOM_ACCESS_FILE_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_RANDOM_ACCESS_FILE, \ - GArrowIORandomAccessFileInterface)) + GARROW_TYPE_RANDOM_ACCESS_FILE, \ + GArrowRandomAccessFileInterface)) -typedef struct _GArrowIORandomAccessFile GArrowIORandomAccessFile; -typedef struct _GArrowIORandomAccessFileInterface GArrowIORandomAccessFileInterface; +typedef struct _GArrowRandomAccessFile GArrowRandomAccessFile; +typedef struct _GArrowRandomAccessFileInterface GArrowRandomAccessFileInterface; -GType garrow_io_random_access_file_get_type(void) G_GNUC_CONST; +GType garrow_random_access_file_get_type(void) G_GNUC_CONST; -guint64 garrow_io_random_access_file_get_size(GArrowIORandomAccessFile *file, +guint64 garrow_random_access_file_get_size(GArrowRandomAccessFile *file, GError **error); -gboolean garrow_io_random_access_file_get_support_zero_copy(GArrowIORandomAccessFile *file); -gboolean garrow_io_random_access_file_read_at(GArrowIORandomAccessFile *file, +gboolean garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file); +gboolean garrow_random_access_file_read_at(GArrowRandomAccessFile *file, gint64 position, gint64 n_bytes, gint64 *n_read_bytes, diff --git a/c_glib/arrow-glib/io-random-access-file.hpp b/c_glib/arrow-glib/random-access-file.hpp similarity index 78% rename from c_glib/arrow-glib/io-random-access-file.hpp rename to c_glib/arrow-glib/random-access-file.hpp index 7c97c9ecedb5b..6d6fed70b4b62 100644 --- a/c_glib/arrow-glib/io-random-access-file.hpp +++ b/c_glib/arrow-glib/random-access-file.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIORandomAccessFileInterface: + * GArrowRandomAccessFileInterface: * * It wraps `arrow::io::RandomAccessFile`. */ -struct _GArrowIORandomAccessFileInterface +struct _GArrowRandomAccessFileInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIORandomAccessFile *file); + std::shared_ptr (*get_raw)(GArrowRandomAccessFile *file); }; -std::shared_ptr garrow_io_random_access_file_get_raw(GArrowIORandomAccessFile *random_access_file); +std::shared_ptr garrow_random_access_file_get_raw(GArrowRandomAccessFile *random_access_file); diff --git a/c_glib/arrow-glib/io-readable.cpp b/c_glib/arrow-glib/readable.cpp similarity index 75% rename from c_glib/arrow-glib/io-readable.cpp rename to c_glib/arrow-glib/readable.cpp index b372a66090ceb..b8c0cd99df06a 100644 --- a/c_glib/arrow-glib/io-readable.cpp +++ b/c_glib/arrow-glib/readable.cpp @@ -24,31 +24,31 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-readable - * @title: GArrowIOReadable + * SECTION: readable + * @title: GArrowReadable * @short_description: Input interface * - * #GArrowIOReadable is an interface for input. Input must be + * #GArrowReadable is an interface for input. Input must be * readable. */ -G_DEFINE_INTERFACE(GArrowIOReadable, - garrow_io_readable, +G_DEFINE_INTERFACE(GArrowReadable, + garrow_readable, G_TYPE_OBJECT) static void -garrow_io_readable_default_init (GArrowIOReadableInterface *iface) +garrow_readable_default_init (GArrowReadableInterface *iface) { } /** - * garrow_io_readable_read: - * @readable: A #GArrowIOReadable. + * garrow_readable_read: + * @readable: A #GArrowReadable. * @n_bytes: The number of bytes to be read. * @n_read_bytes: (out): The read number of bytes. * @buffer: (array length=n_bytes): The buffer to be read data. @@ -57,13 +57,13 @@ garrow_io_readable_default_init (GArrowIOReadableInterface *iface) * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_readable_read(GArrowIOReadable *readable, +garrow_readable_read(GArrowReadable *readable, gint64 n_bytes, gint64 *n_read_bytes, guint8 *buffer, GError **error) { - const auto arrow_readable = garrow_io_readable_get_raw(readable); + const auto arrow_readable = garrow_readable_get_raw(readable); auto status = arrow_readable->Read(n_bytes, n_read_bytes, buffer); if (status.ok()) { @@ -77,8 +77,8 @@ garrow_io_readable_read(GArrowIOReadable *readable, G_END_DECLS std::shared_ptr -garrow_io_readable_get_raw(GArrowIOReadable *readable) +garrow_readable_get_raw(GArrowReadable *readable) { - auto *iface = GARROW_IO_READABLE_GET_IFACE(readable); + auto *iface = GARROW_READABLE_GET_IFACE(readable); return iface->get_raw(readable); } diff --git a/c_glib/arrow-glib/io-readable.h b/c_glib/arrow-glib/readable.h similarity index 60% rename from c_glib/arrow-glib/io-readable.h rename to c_glib/arrow-glib/readable.h index 279984b3014a3..bde4b01ee1f15 100644 --- a/c_glib/arrow-glib/io-readable.h +++ b/c_glib/arrow-glib/readable.h @@ -23,26 +23,26 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_READABLE \ - (garrow_io_readable_get_type()) -#define GARROW_IO_READABLE(obj) \ +#define GARROW_TYPE_READABLE \ + (garrow_readable_get_type()) +#define GARROW_READABLE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_READABLE, \ - GArrowIOReadable)) -#define GARROW_IO_IS_READABLE(obj) \ + GARROW_TYPE_READABLE, \ + GArrowReadable)) +#define GARROW_IS_READABLE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_READABLE)) -#define GARROW_IO_READABLE_GET_IFACE(obj) \ + GARROW_TYPE_READABLE)) +#define GARROW_READABLE_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_READABLE, \ - GArrowIOReadableInterface)) + GARROW_TYPE_READABLE, \ + GArrowReadableInterface)) -typedef struct _GArrowIOReadable GArrowIOReadable; -typedef struct _GArrowIOReadableInterface GArrowIOReadableInterface; +typedef struct _GArrowReadable GArrowReadable; +typedef struct _GArrowReadableInterface GArrowReadableInterface; -GType garrow_io_readable_get_type(void) G_GNUC_CONST; +GType garrow_readable_get_type(void) G_GNUC_CONST; -gboolean garrow_io_readable_read(GArrowIOReadable *readable, +gboolean garrow_readable_read(GArrowReadable *readable, gint64 n_bytes, gint64 *n_read_bytes, guint8 *buffer, diff --git a/c_glib/arrow-glib/io-readable.hpp b/c_glib/arrow-glib/readable.hpp similarity index 77% rename from c_glib/arrow-glib/io-readable.hpp rename to c_glib/arrow-glib/readable.hpp index 3d27b3f92ba78..c241c77aa0329 100644 --- a/c_glib/arrow-glib/io-readable.hpp +++ b/c_glib/arrow-glib/readable.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOReadableInterface: + * GArrowReadableInterface: * * It wraps `arrow::io::Readable`. */ -struct _GArrowIOReadableInterface +struct _GArrowReadableInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOReadable *file); + std::shared_ptr (*get_raw)(GArrowReadable *file); }; -std::shared_ptr garrow_io_readable_get_raw(GArrowIOReadable *readable); +std::shared_ptr garrow_readable_get_raw(GArrowReadable *readable); diff --git a/c_glib/arrow-glib/ipc-stream-reader.cpp b/c_glib/arrow-glib/stream-reader.cpp similarity index 63% rename from c_glib/arrow-glib/ipc-stream-reader.cpp rename to c_glib/arrow-glib/stream-reader.cpp index 48047842aaac6..c4ccebe56f6ba 100644 --- a/c_glib/arrow-glib/ipc-stream-reader.cpp +++ b/c_glib/arrow-glib/stream-reader.cpp @@ -27,60 +27,60 @@ #include #include -#include +#include -#include -#include +#include +#include G_BEGIN_DECLS /** - * SECTION: ipc-stream-reader + * SECTION: stream-reader * @short_description: Stream reader class * - * #GArrowIPCStreamReader is a class for receiving data by stream + * #GArrowStreamReader is a class for receiving data by stream * based IPC. */ -typedef struct GArrowIPCStreamReaderPrivate_ { +typedef struct GArrowStreamReaderPrivate_ { std::shared_ptr stream_reader; -} GArrowIPCStreamReaderPrivate; +} GArrowStreamReaderPrivate; enum { PROP_0, PROP_STREAM_READER }; -G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCStreamReader, - garrow_ipc_stream_reader, +G_DEFINE_TYPE_WITH_PRIVATE(GArrowStreamReader, + garrow_stream_reader, G_TYPE_OBJECT); -#define GARROW_IPC_STREAM_READER_GET_PRIVATE(obj) \ +#define GARROW_STREAM_READER_GET_PRIVATE(obj) \ (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_IPC_TYPE_STREAM_READER, \ - GArrowIPCStreamReaderPrivate)) + GARROW_TYPE_STREAM_READER, \ + GArrowStreamReaderPrivate)) static void -garrow_ipc_stream_reader_finalize(GObject *object) +garrow_stream_reader_finalize(GObject *object) { - GArrowIPCStreamReaderPrivate *priv; + GArrowStreamReaderPrivate *priv; - priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(object); + priv = GARROW_STREAM_READER_GET_PRIVATE(object); priv->stream_reader = nullptr; - G_OBJECT_CLASS(garrow_ipc_stream_reader_parent_class)->finalize(object); + G_OBJECT_CLASS(garrow_stream_reader_parent_class)->finalize(object); } static void -garrow_ipc_stream_reader_set_property(GObject *object, +garrow_stream_reader_set_property(GObject *object, guint prop_id, const GValue *value, GParamSpec *pspec) { - GArrowIPCStreamReaderPrivate *priv; + GArrowStreamReaderPrivate *priv; - priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(object); + priv = GARROW_STREAM_READER_GET_PRIVATE(object); switch (prop_id) { case PROP_STREAM_READER: @@ -94,7 +94,7 @@ garrow_ipc_stream_reader_set_property(GObject *object, } static void -garrow_ipc_stream_reader_get_property(GObject *object, +garrow_stream_reader_get_property(GObject *object, guint prop_id, GValue *value, GParamSpec *pspec) @@ -107,21 +107,21 @@ garrow_ipc_stream_reader_get_property(GObject *object, } static void -garrow_ipc_stream_reader_init(GArrowIPCStreamReader *object) +garrow_stream_reader_init(GArrowStreamReader *object) { } static void -garrow_ipc_stream_reader_class_init(GArrowIPCStreamReaderClass *klass) +garrow_stream_reader_class_init(GArrowStreamReaderClass *klass) { GObjectClass *gobject_class; GParamSpec *spec; gobject_class = G_OBJECT_CLASS(klass); - gobject_class->finalize = garrow_ipc_stream_reader_finalize; - gobject_class->set_property = garrow_ipc_stream_reader_set_property; - gobject_class->get_property = garrow_ipc_stream_reader_get_property; + gobject_class->finalize = garrow_stream_reader_finalize; + gobject_class->set_property = garrow_stream_reader_set_property; + gobject_class->get_property = garrow_stream_reader_get_property; spec = g_param_spec_pointer("stream-reader", "ipc::StreamReader", @@ -132,23 +132,23 @@ garrow_ipc_stream_reader_class_init(GArrowIPCStreamReaderClass *klass) } /** - * garrow_ipc_stream_reader_open: + * garrow_stream_reader_open: * @stream: The stream to be read. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened - * #GArrowIPCStreamReader or %NULL on error. + * #GArrowStreamReader or %NULL on error. */ -GArrowIPCStreamReader * -garrow_ipc_stream_reader_open(GArrowIOInputStream *stream, +GArrowStreamReader * +garrow_stream_reader_open(GArrowInputStream *stream, GError **error) { std::shared_ptr arrow_stream_reader; auto status = - arrow::ipc::StreamReader::Open(garrow_io_input_stream_get_raw(stream), + arrow::ipc::StreamReader::Open(garrow_input_stream_get_raw(stream), &arrow_stream_reader); if (status.ok()) { - return garrow_ipc_stream_reader_new_raw(&arrow_stream_reader); + return garrow_stream_reader_new_raw(&arrow_stream_reader); } else { garrow_error_set(error, status, "[ipc][stream-reader][open]"); return NULL; @@ -156,34 +156,34 @@ garrow_ipc_stream_reader_open(GArrowIOInputStream *stream, } /** - * garrow_ipc_stream_reader_get_schema: - * @stream_reader: A #GArrowIPCStreamReader. + * garrow_stream_reader_get_schema: + * @stream_reader: A #GArrowStreamReader. * * Returns: (transfer full): The schema in the stream. */ GArrowSchema * -garrow_ipc_stream_reader_get_schema(GArrowIPCStreamReader *stream_reader) +garrow_stream_reader_get_schema(GArrowStreamReader *stream_reader) { auto arrow_stream_reader = - garrow_ipc_stream_reader_get_raw(stream_reader); + garrow_stream_reader_get_raw(stream_reader); auto arrow_schema = arrow_stream_reader->schema(); return garrow_schema_new_raw(&arrow_schema); } /** - * garrow_ipc_stream_reader_get_next_record_batch: - * @stream_reader: A #GArrowIPCStreamReader. + * garrow_stream_reader_get_next_record_batch: + * @stream_reader: A #GArrowStreamReader. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): * The next record batch in the stream or %NULL on end of stream. */ GArrowRecordBatch * -garrow_ipc_stream_reader_get_next_record_batch(GArrowIPCStreamReader *stream_reader, +garrow_stream_reader_get_next_record_batch(GArrowStreamReader *stream_reader, GError **error) { auto arrow_stream_reader = - garrow_ipc_stream_reader_get_raw(stream_reader); + garrow_stream_reader_get_raw(stream_reader); std::shared_ptr arrow_record_batch; auto status = arrow_stream_reader->GetNextRecordBatch(&arrow_record_batch); @@ -201,21 +201,21 @@ garrow_ipc_stream_reader_get_next_record_batch(GArrowIPCStreamReader *stream_rea G_END_DECLS -GArrowIPCStreamReader * -garrow_ipc_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader) +GArrowStreamReader * +garrow_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader) { auto stream_reader = - GARROW_IPC_STREAM_READER(g_object_new(GARROW_IPC_TYPE_STREAM_READER, + GARROW_STREAM_READER(g_object_new(GARROW_TYPE_STREAM_READER, "stream-reader", arrow_stream_reader, NULL)); return stream_reader; } std::shared_ptr -garrow_ipc_stream_reader_get_raw(GArrowIPCStreamReader *stream_reader) +garrow_stream_reader_get_raw(GArrowStreamReader *stream_reader) { - GArrowIPCStreamReaderPrivate *priv; + GArrowStreamReaderPrivate *priv; - priv = GARROW_IPC_STREAM_READER_GET_PRIVATE(stream_reader); + priv = GARROW_STREAM_READER_GET_PRIVATE(stream_reader); return priv->stream_reader; } diff --git a/c_glib/arrow-glib/stream-reader.h b/c_glib/arrow-glib/stream-reader.h new file mode 100644 index 0000000000000..16a7f57bf801b --- /dev/null +++ b/c_glib/arrow-glib/stream-reader.h @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +#include + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_STREAM_READER \ + (garrow_stream_reader_get_type()) +#define GARROW_STREAM_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STREAM_READER, \ + GArrowStreamReader)) +#define GARROW_STREAM_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STREAM_READER, \ + GArrowStreamReaderClass)) +#define GARROW_IS_STREAM_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STREAM_READER)) +#define GARROW_IS_STREAM_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STREAM_READER)) +#define GARROW_STREAM_READER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STREAM_READER, \ + GArrowStreamReaderClass)) + +typedef struct _GArrowStreamReader GArrowStreamReader; +typedef struct _GArrowStreamReaderClass GArrowStreamReaderClass; + +/** + * GArrowStreamReader: + * + * It wraps `arrow::ipc::StreamReader`. + */ +struct _GArrowStreamReader +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowStreamReaderClass +{ + GObjectClass parent_class; +}; + +GType garrow_stream_reader_get_type(void) G_GNUC_CONST; + +GArrowStreamReader *garrow_stream_reader_open(GArrowInputStream *stream, + GError **error); + +GArrowSchema *garrow_stream_reader_get_schema(GArrowStreamReader *stream_reader); +GArrowRecordBatch *garrow_stream_reader_get_next_record_batch(GArrowStreamReader *stream_reader, + GError **error); + +G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-stream-reader.hpp b/c_glib/arrow-glib/stream-reader.hpp similarity index 75% rename from c_glib/arrow-glib/ipc-stream-reader.hpp rename to c_glib/arrow-glib/stream-reader.hpp index a35bdab7e69d4..ca8e6895a4fd6 100644 --- a/c_glib/arrow-glib/ipc-stream-reader.hpp +++ b/c_glib/arrow-glib/stream-reader.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIPCStreamReader *garrow_ipc_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader); -std::shared_ptr garrow_ipc_stream_reader_get_raw(GArrowIPCStreamReader *stream_reader); +GArrowStreamReader *garrow_stream_reader_new_raw(std::shared_ptr *arrow_stream_reader); +std::shared_ptr garrow_stream_reader_get_raw(GArrowStreamReader *stream_reader); diff --git a/c_glib/arrow-glib/ipc-stream-writer.cpp b/c_glib/arrow-glib/stream-writer.cpp similarity index 65% rename from c_glib/arrow-glib/ipc-stream-writer.cpp rename to c_glib/arrow-glib/stream-writer.cpp index e2455a4a9c61c..016ce93759c87 100644 --- a/c_glib/arrow-glib/ipc-stream-writer.cpp +++ b/c_glib/arrow-glib/stream-writer.cpp @@ -28,59 +28,59 @@ #include #include -#include +#include -#include +#include G_BEGIN_DECLS /** - * SECTION: ipc-stream-writer + * SECTION: stream-writer * @short_description: Stream writer class * - * #GArrowIPCStreamWriter is a class for sending data by stream based + * #GArrowStreamWriter is a class for sending data by stream based * IPC. */ -typedef struct GArrowIPCStreamWriterPrivate_ { +typedef struct GArrowStreamWriterPrivate_ { std::shared_ptr stream_writer; -} GArrowIPCStreamWriterPrivate; +} GArrowStreamWriterPrivate; enum { PROP_0, PROP_STREAM_WRITER }; -G_DEFINE_TYPE_WITH_PRIVATE(GArrowIPCStreamWriter, - garrow_ipc_stream_writer, +G_DEFINE_TYPE_WITH_PRIVATE(GArrowStreamWriter, + garrow_stream_writer, G_TYPE_OBJECT); -#define GARROW_IPC_STREAM_WRITER_GET_PRIVATE(obj) \ +#define GARROW_STREAM_WRITER_GET_PRIVATE(obj) \ (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_IPC_TYPE_STREAM_WRITER, \ - GArrowIPCStreamWriterPrivate)) + GARROW_TYPE_STREAM_WRITER, \ + GArrowStreamWriterPrivate)) static void -garrow_ipc_stream_writer_finalize(GObject *object) +garrow_stream_writer_finalize(GObject *object) { - GArrowIPCStreamWriterPrivate *priv; + GArrowStreamWriterPrivate *priv; - priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(object); + priv = GARROW_STREAM_WRITER_GET_PRIVATE(object); priv->stream_writer = nullptr; - G_OBJECT_CLASS(garrow_ipc_stream_writer_parent_class)->finalize(object); + G_OBJECT_CLASS(garrow_stream_writer_parent_class)->finalize(object); } static void -garrow_ipc_stream_writer_set_property(GObject *object, +garrow_stream_writer_set_property(GObject *object, guint prop_id, const GValue *value, GParamSpec *pspec) { - GArrowIPCStreamWriterPrivate *priv; + GArrowStreamWriterPrivate *priv; - priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(object); + priv = GARROW_STREAM_WRITER_GET_PRIVATE(object); switch (prop_id) { case PROP_STREAM_WRITER: @@ -94,7 +94,7 @@ garrow_ipc_stream_writer_set_property(GObject *object, } static void -garrow_ipc_stream_writer_get_property(GObject *object, +garrow_stream_writer_get_property(GObject *object, guint prop_id, GValue *value, GParamSpec *pspec) @@ -107,21 +107,21 @@ garrow_ipc_stream_writer_get_property(GObject *object, } static void -garrow_ipc_stream_writer_init(GArrowIPCStreamWriter *object) +garrow_stream_writer_init(GArrowStreamWriter *object) { } static void -garrow_ipc_stream_writer_class_init(GArrowIPCStreamWriterClass *klass) +garrow_stream_writer_class_init(GArrowStreamWriterClass *klass) { GObjectClass *gobject_class; GParamSpec *spec; gobject_class = G_OBJECT_CLASS(klass); - gobject_class->finalize = garrow_ipc_stream_writer_finalize; - gobject_class->set_property = garrow_ipc_stream_writer_set_property; - gobject_class->get_property = garrow_ipc_stream_writer_get_property; + gobject_class->finalize = garrow_stream_writer_finalize; + gobject_class->set_property = garrow_stream_writer_set_property; + gobject_class->get_property = garrow_stream_writer_get_property; spec = g_param_spec_pointer("stream-writer", "ipc::StreamWriter", @@ -132,26 +132,26 @@ garrow_ipc_stream_writer_class_init(GArrowIPCStreamWriterClass *klass) } /** - * garrow_ipc_stream_writer_open: + * garrow_stream_writer_open: * @sink: The output of the writer. * @schema: The schema of the writer. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened - * #GArrowIPCStreamWriter or %NULL on error. + * #GArrowStreamWriter or %NULL on error. */ -GArrowIPCStreamWriter * -garrow_ipc_stream_writer_open(GArrowIOOutputStream *sink, +GArrowStreamWriter * +garrow_stream_writer_open(GArrowOutputStream *sink, GArrowSchema *schema, GError **error) { std::shared_ptr arrow_stream_writer; auto status = - arrow::ipc::StreamWriter::Open(garrow_io_output_stream_get_raw(sink).get(), + arrow::ipc::StreamWriter::Open(garrow_output_stream_get_raw(sink).get(), garrow_schema_get_raw(schema), &arrow_stream_writer); if (status.ok()) { - return garrow_ipc_stream_writer_new_raw(&arrow_stream_writer); + return garrow_stream_writer_new_raw(&arrow_stream_writer); } else { garrow_error_set(error, status, "[ipc][stream-writer][open]"); return NULL; @@ -159,20 +159,20 @@ garrow_ipc_stream_writer_open(GArrowIOOutputStream *sink, } /** - * garrow_ipc_stream_writer_write_record_batch: - * @stream_writer: A #GArrowIPCStreamWriter. + * garrow_stream_writer_write_record_batch: + * @stream_writer: A #GArrowStreamWriter. * @record_batch: The record batch to be written. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_ipc_stream_writer_write_record_batch(GArrowIPCStreamWriter *stream_writer, +garrow_stream_writer_write_record_batch(GArrowStreamWriter *stream_writer, GArrowRecordBatch *record_batch, GError **error) { auto arrow_stream_writer = - garrow_ipc_stream_writer_get_raw(stream_writer); + garrow_stream_writer_get_raw(stream_writer); auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); auto arrow_record_batch_raw = @@ -188,18 +188,18 @@ garrow_ipc_stream_writer_write_record_batch(GArrowIPCStreamWriter *stream_writer } /** - * garrow_ipc_stream_writer_close: - * @stream_writer: A #GArrowIPCStreamWriter. + * garrow_stream_writer_close: + * @stream_writer: A #GArrowStreamWriter. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_ipc_stream_writer_close(GArrowIPCStreamWriter *stream_writer, +garrow_stream_writer_close(GArrowStreamWriter *stream_writer, GError **error) { auto arrow_stream_writer = - garrow_ipc_stream_writer_get_raw(stream_writer); + garrow_stream_writer_get_raw(stream_writer); auto status = arrow_stream_writer->Close(); if (status.ok()) { @@ -212,21 +212,21 @@ garrow_ipc_stream_writer_close(GArrowIPCStreamWriter *stream_writer, G_END_DECLS -GArrowIPCStreamWriter * -garrow_ipc_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer) +GArrowStreamWriter * +garrow_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer) { auto stream_writer = - GARROW_IPC_STREAM_WRITER(g_object_new(GARROW_IPC_TYPE_STREAM_WRITER, + GARROW_STREAM_WRITER(g_object_new(GARROW_TYPE_STREAM_WRITER, "stream-writer", arrow_stream_writer, NULL)); return stream_writer; } std::shared_ptr -garrow_ipc_stream_writer_get_raw(GArrowIPCStreamWriter *stream_writer) +garrow_stream_writer_get_raw(GArrowStreamWriter *stream_writer) { - GArrowIPCStreamWriterPrivate *priv; + GArrowStreamWriterPrivate *priv; - priv = GARROW_IPC_STREAM_WRITER_GET_PRIVATE(stream_writer); + priv = GARROW_STREAM_WRITER_GET_PRIVATE(stream_writer); return priv->stream_writer; } diff --git a/c_glib/arrow-glib/ipc-stream-writer.h b/c_glib/arrow-glib/stream-writer.h similarity index 54% rename from c_glib/arrow-glib/ipc-stream-writer.h rename to c_glib/arrow-glib/stream-writer.h index 4488204736d51..6e773f1fc316e 100644 --- a/c_glib/arrow-glib/ipc-stream-writer.h +++ b/c_glib/arrow-glib/stream-writer.h @@ -23,60 +23,60 @@ #include #include -#include +#include G_BEGIN_DECLS -#define GARROW_IPC_TYPE_STREAM_WRITER \ - (garrow_ipc_stream_writer_get_type()) -#define GARROW_IPC_STREAM_WRITER(obj) \ +#define GARROW_TYPE_STREAM_WRITER \ + (garrow_stream_writer_get_type()) +#define GARROW_STREAM_WRITER(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IPC_TYPE_STREAM_WRITER, \ - GArrowIPCStreamWriter)) -#define GARROW_IPC_STREAM_WRITER_CLASS(klass) \ + GARROW_TYPE_STREAM_WRITER, \ + GArrowStreamWriter)) +#define GARROW_STREAM_WRITER_CLASS(klass) \ (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_IPC_TYPE_STREAM_WRITER, \ - GArrowIPCStreamWriterClass)) -#define GARROW_IPC_IS_STREAM_WRITER(obj) \ + GARROW_TYPE_STREAM_WRITER, \ + GArrowStreamWriterClass)) +#define GARROW_IS_STREAM_WRITER(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IPC_TYPE_STREAM_WRITER)) -#define GARROW_IPC_IS_STREAM_WRITER_CLASS(klass) \ + GARROW_TYPE_STREAM_WRITER)) +#define GARROW_IS_STREAM_WRITER_CLASS(klass) \ (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_IPC_TYPE_STREAM_WRITER)) -#define GARROW_IPC_STREAM_WRITER_GET_CLASS(obj) \ + GARROW_TYPE_STREAM_WRITER)) +#define GARROW_STREAM_WRITER_GET_CLASS(obj) \ (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_IPC_TYPE_STREAM_WRITER, \ - GArrowIPCStreamWriterClass)) + GARROW_TYPE_STREAM_WRITER, \ + GArrowStreamWriterClass)) -typedef struct _GArrowIPCStreamWriter GArrowIPCStreamWriter; -typedef struct _GArrowIPCStreamWriterClass GArrowIPCStreamWriterClass; +typedef struct _GArrowStreamWriter GArrowStreamWriter; +typedef struct _GArrowStreamWriterClass GArrowStreamWriterClass; /** - * GArrowIPCStreamWriter: + * GArrowStreamWriter: * * It wraps `arrow::ipc::StreamWriter`. */ -struct _GArrowIPCStreamWriter +struct _GArrowStreamWriter { /*< private >*/ GObject parent_instance; }; -struct _GArrowIPCStreamWriterClass +struct _GArrowStreamWriterClass { GObjectClass parent_class; }; -GType garrow_ipc_stream_writer_get_type(void) G_GNUC_CONST; +GType garrow_stream_writer_get_type(void) G_GNUC_CONST; -GArrowIPCStreamWriter *garrow_ipc_stream_writer_open(GArrowIOOutputStream *sink, +GArrowStreamWriter *garrow_stream_writer_open(GArrowOutputStream *sink, GArrowSchema *schema, GError **error); -gboolean garrow_ipc_stream_writer_write_record_batch(GArrowIPCStreamWriter *stream_writer, +gboolean garrow_stream_writer_write_record_batch(GArrowStreamWriter *stream_writer, GArrowRecordBatch *record_batch, GError **error); -gboolean garrow_ipc_stream_writer_close(GArrowIPCStreamWriter *stream_writer, +gboolean garrow_stream_writer_close(GArrowStreamWriter *stream_writer, GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/ipc-stream-writer.hpp b/c_glib/arrow-glib/stream-writer.hpp similarity index 75% rename from c_glib/arrow-glib/ipc-stream-writer.hpp rename to c_glib/arrow-glib/stream-writer.hpp index 9d097404582a9..994c83b8f4ad5 100644 --- a/c_glib/arrow-glib/ipc-stream-writer.hpp +++ b/c_glib/arrow-glib/stream-writer.hpp @@ -22,7 +22,7 @@ #include #include -#include +#include -GArrowIPCStreamWriter *garrow_ipc_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer); -std::shared_ptr garrow_ipc_stream_writer_get_raw(GArrowIPCStreamWriter *stream_writer); +GArrowStreamWriter *garrow_stream_writer_new_raw(std::shared_ptr *arrow_stream_writer); +std::shared_ptr garrow_stream_writer_get_raw(GArrowStreamWriter *stream_writer); diff --git a/c_glib/arrow-glib/io-writeable-file.cpp b/c_glib/arrow-glib/writeable-file.cpp similarity index 73% rename from c_glib/arrow-glib/io-writeable-file.cpp rename to c_glib/arrow-glib/writeable-file.cpp index 41b682acd1e26..d0937ea2612d2 100644 --- a/c_glib/arrow-glib/io-writeable-file.cpp +++ b/c_glib/arrow-glib/writeable-file.cpp @@ -24,30 +24,30 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-writeable-file - * @title: GArrowIOWriteableFile + * SECTION: writeable-file + * @title: GArrowWriteableFile * @short_description: File output interface * - * #GArrowIOWriteableFile is an interface for file output. + * #GArrowWriteableFile is an interface for file output. */ -G_DEFINE_INTERFACE(GArrowIOWriteableFile, - garrow_io_writeable_file, +G_DEFINE_INTERFACE(GArrowWriteableFile, + garrow_writeable_file, G_TYPE_OBJECT) static void -garrow_io_writeable_file_default_init (GArrowIOWriteableFileInterface *iface) +garrow_writeable_file_default_init (GArrowWriteableFileInterface *iface) { } /** - * garrow_io_writeable_file_write_at: - * @writeable_file: A #GArrowIOWriteableFile. + * garrow_writeable_file_write_at: + * @writeable_file: A #GArrowWriteableFile. * @position: The write start position. * @data: (array length=n_bytes): The data to be written. * @n_bytes: The number of bytes to be written. @@ -56,14 +56,14 @@ garrow_io_writeable_file_default_init (GArrowIOWriteableFileInterface *iface) * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, +garrow_writeable_file_write_at(GArrowWriteableFile *writeable_file, gint64 position, const guint8 *data, gint64 n_bytes, GError **error) { const auto arrow_writeable_file = - garrow_io_writeable_file_get_raw(writeable_file); + garrow_writeable_file_get_raw(writeable_file); auto status = arrow_writeable_file->WriteAt(position, data, n_bytes); if (status.ok()) { @@ -77,8 +77,8 @@ garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, G_END_DECLS std::shared_ptr -garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file) +garrow_writeable_file_get_raw(GArrowWriteableFile *writeable_file) { - auto *iface = GARROW_IO_WRITEABLE_FILE_GET_IFACE(writeable_file); + auto *iface = GARROW_WRITEABLE_FILE_GET_IFACE(writeable_file); return iface->get_raw(writeable_file); } diff --git a/c_glib/arrow-glib/io-writeable-file.h b/c_glib/arrow-glib/writeable-file.h similarity index 59% rename from c_glib/arrow-glib/io-writeable-file.h rename to c_glib/arrow-glib/writeable-file.h index d1ebdbe630ef2..7f4c186379b7e 100644 --- a/c_glib/arrow-glib/io-writeable-file.h +++ b/c_glib/arrow-glib/writeable-file.h @@ -23,26 +23,26 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_WRITEABLE_FILE \ - (garrow_io_writeable_file_get_type()) -#define GARROW_IO_WRITEABLE_FILE(obj) \ +#define GARROW_TYPE_WRITEABLE_FILE \ + (garrow_writeable_file_get_type()) +#define GARROW_WRITEABLE_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_WRITEABLE_FILE, \ - GArrowIOWriteableFile)) -#define GARROW_IO_IS_WRITEABLE_FILE(obj) \ + GARROW_TYPE_WRITEABLE_FILE, \ + GArrowWriteableFile)) +#define GARROW_IS_WRITEABLE_FILE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_WRITEABLE_FILE)) -#define GARROW_IO_WRITEABLE_FILE_GET_IFACE(obj) \ + GARROW_TYPE_WRITEABLE_FILE)) +#define GARROW_WRITEABLE_FILE_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_WRITEABLE_FILE, \ - GArrowIOWriteableFileInterface)) + GARROW_TYPE_WRITEABLE_FILE, \ + GArrowWriteableFileInterface)) -typedef struct _GArrowIOWriteableFile GArrowIOWriteableFile; -typedef struct _GArrowIOWriteableFileInterface GArrowIOWriteableFileInterface; +typedef struct _GArrowWriteableFile GArrowWriteableFile; +typedef struct _GArrowWriteableFileInterface GArrowWriteableFileInterface; -GType garrow_io_writeable_file_get_type(void) G_GNUC_CONST; +GType garrow_writeable_file_get_type(void) G_GNUC_CONST; -gboolean garrow_io_writeable_file_write_at(GArrowIOWriteableFile *writeable_file, +gboolean garrow_writeable_file_write_at(GArrowWriteableFile *writeable_file, gint64 position, const guint8 *data, gint64 n_bytes, diff --git a/c_glib/arrow-glib/io-writeable-file.hpp b/c_glib/arrow-glib/writeable-file.hpp similarity index 75% rename from c_glib/arrow-glib/io-writeable-file.hpp rename to c_glib/arrow-glib/writeable-file.hpp index aba95b209d827..aa3cc5082d0c5 100644 --- a/c_glib/arrow-glib/io-writeable-file.hpp +++ b/c_glib/arrow-glib/writeable-file.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOWriteableFile: + * GArrowWriteableFile: * * It wraps `arrow::io::WriteableFile`. */ -struct _GArrowIOWriteableFileInterface +struct _GArrowWriteableFileInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOWriteableFile *file); + std::shared_ptr (*get_raw)(GArrowWriteableFile *file); }; -std::shared_ptr garrow_io_writeable_file_get_raw(GArrowIOWriteableFile *writeable_file); +std::shared_ptr garrow_writeable_file_get_raw(GArrowWriteableFile *writeable_file); diff --git a/c_glib/arrow-glib/io-writeable.cpp b/c_glib/arrow-glib/writeable.cpp similarity index 72% rename from c_glib/arrow-glib/io-writeable.cpp rename to c_glib/arrow-glib/writeable.cpp index 9ea69e3adccde..6f4c63008ae49 100644 --- a/c_glib/arrow-glib/io-writeable.cpp +++ b/c_glib/arrow-glib/writeable.cpp @@ -24,31 +24,31 @@ #include #include -#include +#include G_BEGIN_DECLS /** - * SECTION: io-writeable - * @title: GArrowIOWriteable + * SECTION: writeable + * @title: GArrowWriteable * @short_description: Output interface * - * #GArrowIOWriteable is an interface for output. Output must be + * #GArrowWriteable is an interface for output. Output must be * writeable. */ -G_DEFINE_INTERFACE(GArrowIOWriteable, - garrow_io_writeable, +G_DEFINE_INTERFACE(GArrowWriteable, + garrow_writeable, G_TYPE_OBJECT) static void -garrow_io_writeable_default_init (GArrowIOWriteableInterface *iface) +garrow_writeable_default_init (GArrowWriteableInterface *iface) { } /** - * garrow_io_writeable_write: - * @writeable: A #GArrowIOWriteable. + * garrow_writeable_write: + * @writeable: A #GArrowWriteable. * @data: (array length=n_bytes): The data to be written. * @n_bytes: The number of bytes to be written. * @error: (nullable): Return location for a #GError or %NULL. @@ -56,12 +56,12 @@ garrow_io_writeable_default_init (GArrowIOWriteableInterface *iface) * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_writeable_write(GArrowIOWriteable *writeable, +garrow_writeable_write(GArrowWriteable *writeable, const guint8 *data, gint64 n_bytes, GError **error) { - const auto arrow_writeable = garrow_io_writeable_get_raw(writeable); + const auto arrow_writeable = garrow_writeable_get_raw(writeable); auto status = arrow_writeable->Write(data, n_bytes); if (status.ok()) { @@ -73,8 +73,8 @@ garrow_io_writeable_write(GArrowIOWriteable *writeable, } /** - * garrow_io_writeable_flush: - * @writeable: A #GArrowIOWriteable. + * garrow_writeable_flush: + * @writeable: A #GArrowWriteable. * @error: (nullable): Return location for a #GError or %NULL. * * It ensures writing all data on memory to storage. @@ -82,10 +82,10 @@ garrow_io_writeable_write(GArrowIOWriteable *writeable, * Returns: %TRUE on success, %FALSE if there was an error. */ gboolean -garrow_io_writeable_flush(GArrowIOWriteable *writeable, +garrow_writeable_flush(GArrowWriteable *writeable, GError **error) { - const auto arrow_writeable = garrow_io_writeable_get_raw(writeable); + const auto arrow_writeable = garrow_writeable_get_raw(writeable); auto status = arrow_writeable->Flush(); if (status.ok()) { @@ -99,8 +99,8 @@ garrow_io_writeable_flush(GArrowIOWriteable *writeable, G_END_DECLS std::shared_ptr -garrow_io_writeable_get_raw(GArrowIOWriteable *writeable) +garrow_writeable_get_raw(GArrowWriteable *writeable) { - auto *iface = GARROW_IO_WRITEABLE_GET_IFACE(writeable); + auto *iface = GARROW_WRITEABLE_GET_IFACE(writeable); return iface->get_raw(writeable); } diff --git a/c_glib/arrow-glib/io-writeable.h b/c_glib/arrow-glib/writeable.h similarity index 58% rename from c_glib/arrow-glib/io-writeable.h rename to c_glib/arrow-glib/writeable.h index ce23247497706..66d6922360ae4 100644 --- a/c_glib/arrow-glib/io-writeable.h +++ b/c_glib/arrow-glib/writeable.h @@ -23,30 +23,30 @@ G_BEGIN_DECLS -#define GARROW_IO_TYPE_WRITEABLE \ - (garrow_io_writeable_get_type()) -#define GARROW_IO_WRITEABLE(obj) \ +#define GARROW_TYPE_WRITEABLE \ + (garrow_writeable_get_type()) +#define GARROW_WRITEABLE(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_IO_TYPE_WRITEABLE, \ - GArrowIOWriteable)) -#define GARROW_IO_IS_WRITEABLE(obj) \ + GARROW_TYPE_WRITEABLE, \ + GArrowWriteable)) +#define GARROW_IS_WRITEABLE(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_IO_TYPE_WRITEABLE)) -#define GARROW_IO_WRITEABLE_GET_IFACE(obj) \ + GARROW_TYPE_WRITEABLE)) +#define GARROW_WRITEABLE_GET_IFACE(obj) \ (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_IO_TYPE_WRITEABLE, \ - GArrowIOWriteableInterface)) + GARROW_TYPE_WRITEABLE, \ + GArrowWriteableInterface)) -typedef struct _GArrowIOWriteable GArrowIOWriteable; -typedef struct _GArrowIOWriteableInterface GArrowIOWriteableInterface; +typedef struct _GArrowWriteable GArrowWriteable; +typedef struct _GArrowWriteableInterface GArrowWriteableInterface; -GType garrow_io_writeable_get_type(void) G_GNUC_CONST; +GType garrow_writeable_get_type(void) G_GNUC_CONST; -gboolean garrow_io_writeable_write(GArrowIOWriteable *writeable, +gboolean garrow_writeable_write(GArrowWriteable *writeable, const guint8 *data, gint64 n_bytes, GError **error); -gboolean garrow_io_writeable_flush(GArrowIOWriteable *writeable, +gboolean garrow_writeable_flush(GArrowWriteable *writeable, GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/io-writeable.hpp b/c_glib/arrow-glib/writeable.hpp similarity index 77% rename from c_glib/arrow-glib/io-writeable.hpp rename to c_glib/arrow-glib/writeable.hpp index f833924a61ae8..2b398f8b507c1 100644 --- a/c_glib/arrow-glib/io-writeable.hpp +++ b/c_glib/arrow-glib/writeable.hpp @@ -21,18 +21,18 @@ #include -#include +#include /** - * GArrowIOWriteableInterface: + * GArrowWriteableInterface: * * It wraps `arrow::io::Writeable`. */ -struct _GArrowIOWriteableInterface +struct _GArrowWriteableInterface { GTypeInterface parent_iface; - std::shared_ptr (*get_raw)(GArrowIOWriteable *file); + std::shared_ptr (*get_raw)(GArrowWriteable *file); }; -std::shared_ptr garrow_io_writeable_get_raw(GArrowIOWriteable *writeable); +std::shared_ptr garrow_writeable_get_raw(GArrowWriteable *writeable); diff --git a/c_glib/doc/reference/Makefile.am b/c_glib/doc/reference/Makefile.am index 116bc6ce1b9a6..d3389dc2ae81e 100644 --- a/c_glib/doc/reference/Makefile.am +++ b/c_glib/doc/reference/Makefile.am @@ -26,7 +26,7 @@ SCAN_OPTIONS = \ --deprecated-guards="GARROW_DISABLE_DEPRECATED" MKDB_OPTIONS = \ - --name-space=arrow \ + --name-space=garrow \ --source-suffixes="c,cpp,h" HFILE_GLOB = \ diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 06a19369640b5..396dce5049837 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -31,8 +31,8 @@ - - GArrow + + Data Array @@ -111,47 +111,47 @@ - - GArrowIO - - Enums - + + IO + + Mode + - + Input - - - + + + - + Output - - - - + + + + - + Input and output - - + + - - GArrowIPC - - Enums - + + IPC + + Metadata + - + Reader - - + + - - Input - - + + Writer + + diff --git a/c_glib/example/read-batch.c b/c_glib/example/read-batch.c index a55b085d961d1..dce96b8713362 100644 --- a/c_glib/example/read-batch.c +++ b/c_glib/example/read-batch.c @@ -87,14 +87,14 @@ int main(int argc, char **argv) { const char *input_path = "/tmp/batch.arrow"; - GArrowIOMemoryMappedFile *input; + GArrowMemoryMappedFile *input; GError *error = NULL; if (argc > 1) input_path = argv[1]; - input = garrow_io_memory_mapped_file_open(input_path, - GARROW_IO_FILE_MODE_READ, - &error); + input = garrow_memory_mapped_file_open(input_path, + GARROW_FILE_MODE_READ, + &error); if (!input) { g_print("failed to open file: %s\n", error->message); g_error_free(error); @@ -102,10 +102,10 @@ main(int argc, char **argv) } { - GArrowIPCFileReader *reader; + GArrowFileReader *reader; - reader = garrow_ipc_file_reader_open(GARROW_IO_RANDOM_ACCESS_FILE(input), - &error); + reader = garrow_file_reader_open(GARROW_RANDOM_ACCESS_FILE(input), + &error); if (!reader) { g_print("failed to open file reader: %s\n", error->message); g_error_free(error); @@ -116,12 +116,12 @@ main(int argc, char **argv) { guint i, n; - n = garrow_ipc_file_reader_get_n_record_batches(reader); + n = garrow_file_reader_get_n_record_batches(reader); for (i = 0; i < n; i++) { GArrowRecordBatch *record_batch; record_batch = - garrow_ipc_file_reader_get_record_batch(reader, i, &error); + garrow_file_reader_get_record_batch(reader, i, &error); if (!record_batch) { g_print("failed to open file reader: %s\n", error->message); g_error_free(error); diff --git a/c_glib/example/read-stream.c b/c_glib/example/read-stream.c index c56942c7770c5..ba461e3ad6aed 100644 --- a/c_glib/example/read-stream.c +++ b/c_glib/example/read-stream.c @@ -87,14 +87,14 @@ int main(int argc, char **argv) { const char *input_path = "/tmp/stream.arrow"; - GArrowIOMemoryMappedFile *input; + GArrowMemoryMappedFile *input; GError *error = NULL; if (argc > 1) input_path = argv[1]; - input = garrow_io_memory_mapped_file_open(input_path, - GARROW_IO_FILE_MODE_READ, - &error); + input = garrow_memory_mapped_file_open(input_path, + GARROW_FILE_MODE_READ, + &error); if (!input) { g_print("failed to open file: %s\n", error->message); g_error_free(error); @@ -102,10 +102,10 @@ main(int argc, char **argv) } { - GArrowIPCStreamReader *reader; + GArrowStreamReader *reader; - reader = garrow_ipc_stream_reader_open(GARROW_IO_INPUT_STREAM(input), - &error); + reader = garrow_stream_reader_open(GARROW_INPUT_STREAM(input), + &error); if (!reader) { g_print("failed to open stream reader: %s\n", error->message); g_error_free(error); @@ -117,7 +117,7 @@ main(int argc, char **argv) GArrowRecordBatch *record_batch; record_batch = - garrow_ipc_stream_reader_get_next_record_batch(reader, &error); + garrow_stream_reader_get_next_record_batch(reader, &error); if (error) { g_print("failed to get record batch: %s\n", error->message); g_error_free(error); diff --git a/c_glib/test/test-io-file-output-stream.rb b/c_glib/test/test-file-output-stream.rb similarity index 87% rename from c_glib/test/test-io-file-output-stream.rb rename to c_glib/test/test-file-output-stream.rb index e35a18361aab6..26737c0c87b38 100644 --- a/c_glib/test/test-io-file-output-stream.rb +++ b/c_glib/test/test-file-output-stream.rb @@ -15,13 +15,13 @@ # specific language governing permissions and limitations # under the License. -class TestIOFileOutputStream < Test::Unit::TestCase +class TestFileOutputStream < Test::Unit::TestCase sub_test_case(".open") do def test_create tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = Arrow::IOFileOutputStream.open(tempfile.path, false) + file = Arrow::FileOutputStream.open(tempfile.path, false) file.close assert_equal("", File.read(tempfile.path)) end @@ -30,7 +30,7 @@ def test_append tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = Arrow::IOFileOutputStream.open(tempfile.path, true) + file = Arrow::FileOutputStream.open(tempfile.path, true) file.close assert_equal("Hello", File.read(tempfile.path)) end diff --git a/c_glib/test/test-ipc-file-writer.rb b/c_glib/test/test-file-writer.rb similarity index 82% rename from c_glib/test/test-ipc-file-writer.rb rename to c_glib/test/test-file-writer.rb index 1c33ccc1919e7..31c094dd3bd29 100644 --- a/c_glib/test/test-ipc-file-writer.rb +++ b/c_glib/test/test-file-writer.rb @@ -15,14 +15,14 @@ # specific language governing permissions and limitations # under the License. -class TestIPCFileWriter < Test::Unit::TestCase +class TestFileWriter < Test::Unit::TestCase def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-file-writer") - output = Arrow::IOFileOutputStream.open(tempfile.path, false) + output = Arrow::FileOutputStream.open(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - file_writer = Arrow::IPCFileWriter.open(output, schema) + file_writer = Arrow::FileWriter.open(output, schema) begin record_batch = Arrow::RecordBatch.new(schema, 0, []) file_writer.write_record_batch(record_batch) @@ -33,9 +33,9 @@ def test_write_record_batch output.close end - input = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + input = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - file_reader = Arrow::IPCFileReader.open(input) + file_reader = Arrow::FileReader.open(input) assert_equal(["enabled"], file_reader.schema.fields.collect(&:name)) ensure diff --git a/c_glib/test/test-io-memory-mapped-file.rb b/c_glib/test/test-memory-mapped-file.rb similarity index 81% rename from c_glib/test/test-io-memory-mapped-file.rb rename to c_glib/test/test-memory-mapped-file.rb index 197d1886f1e86..e78d07a72d3b8 100644 --- a/c_glib/test/test-io-memory-mapped-file.rb +++ b/c_glib/test/test-memory-mapped-file.rb @@ -15,12 +15,12 @@ # specific language governing permissions and limitations # under the License. -class TestIOMemoryMappedFile < Test::Unit::TestCase +class TestMemoryMappedFile < Test::Unit::TestCase def test_open tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 file.read(buffer) @@ -34,7 +34,7 @@ def test_size tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin assert_equal(5, file.size) ensure @@ -46,7 +46,7 @@ def test_read tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 _success, n_read_bytes = file.read(buffer) @@ -60,7 +60,7 @@ def test_read_at tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 _success, n_read_bytes = file.read_at(6, buffer) @@ -74,7 +74,7 @@ def test_write tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) begin file.write("World") ensure @@ -87,7 +87,7 @@ def test_write_at tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) begin file.write_at(2, "rld") ensure @@ -100,7 +100,7 @@ def test_flush tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) begin file.write("World") file.flush @@ -114,7 +114,7 @@ def test_tell tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin buffer = " " * 5 file.read(buffer) @@ -128,9 +128,9 @@ def test_mode tempfile = Tempfile.open("arrow-io-memory-mapped-file") tempfile.write("Hello World") tempfile.close - file = Arrow::IOMemoryMappedFile.open(tempfile.path, :readwrite) + file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) begin - assert_equal(Arrow::IOFileMode::READWRITE, file.mode) + assert_equal(Arrow::FileMode::READWRITE, file.mode) ensure file.close end diff --git a/c_glib/test/test-ipc-stream-writer.rb b/c_glib/test/test-stream-writer.rb similarity index 84% rename from c_glib/test/test-ipc-stream-writer.rb rename to c_glib/test/test-stream-writer.rb index 78bb4a7c1743c..306115ee78925 100644 --- a/c_glib/test/test-ipc-stream-writer.rb +++ b/c_glib/test/test-stream-writer.rb @@ -15,16 +15,16 @@ # specific language governing permissions and limitations # under the License. -class TestIPCStreamWriter < Test::Unit::TestCase +class TestStreamWriter < Test::Unit::TestCase include Helper::Buildable def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-stream-writer") - output = Arrow::IOFileOutputStream.open(tempfile.path, false) + output = Arrow::FileOutputStream.open(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - stream_writer = Arrow::IPCStreamWriter.open(output, schema) + stream_writer = Arrow::StreamWriter.open(output, schema) begin columns = [ build_boolean_array([true]), @@ -38,9 +38,9 @@ def test_write_record_batch output.close end - input = Arrow::IOMemoryMappedFile.open(tempfile.path, :read) + input = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - stream_reader = Arrow::IPCStreamReader.open(input) + stream_reader = Arrow::StreamReader.open(input) assert_equal(["enabled"], stream_reader.schema.fields.collect(&:name)) assert_equal(true, From 9db96fea4e5de59860a481da3036b3129eb97e3b Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 12 Apr 2017 12:23:22 -0400 Subject: [PATCH 0504/1644] ARROW-811: [GLib] Add GArrowBuffer Author: Kouhei Sutou Closes #531 from kou/glib-add-buffer and squashes the following commits: 1954c95 [Kouhei Sutou] [GLib] Add GArrowBuffer --- c_glib/arrow-glib/Makefile.am | 7 +- c_glib/arrow-glib/buffer.cpp | 289 ++++++++++++++++++++++ c_glib/arrow-glib/buffer.h | 77 ++++++ c_glib/arrow-glib/buffer.hpp | 27 ++ c_glib/doc/reference/arrow-glib-docs.sgml | 4 + c_glib/test/test-buffer.rb | 55 ++++ 6 files changed, 457 insertions(+), 2 deletions(-) create mode 100644 c_glib/arrow-glib/buffer.cpp create mode 100644 c_glib/arrow-glib/buffer.h create mode 100644 c_glib/arrow-glib/buffer.hpp create mode 100644 c_glib/test/test-buffer.rb diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 387707c7d5897..2e7a9a0e439eb 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -44,14 +44,15 @@ libarrow_glib_la_headers = \ array.h \ array-builder.h \ arrow-glib.h \ - chunked-array.h \ - column.h \ binary-array.h \ binary-array-builder.h \ binary-data-type.h \ boolean-array.h \ boolean-array-builder.h \ boolean-data-type.h \ + buffer.h \ + chunked-array.h \ + column.h \ data-type.h \ double-array.h \ double-array-builder.h \ @@ -136,6 +137,7 @@ libarrow_glib_la_sources = \ boolean-array.cpp \ boolean-array-builder.cpp \ boolean-data-type.cpp \ + buffer.cpp \ chunked-array.cpp \ column.cpp \ data-type.cpp \ @@ -212,6 +214,7 @@ libarrow_glib_la_cpp_headers = \ array.hpp \ array-builder.hpp \ arrow-glib.hpp \ + buffer.hpp \ chunked-array.hpp \ column.hpp \ data-type.hpp \ diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp new file mode 100644 index 0000000000000..0ec52df0aee67 --- /dev/null +++ b/c_glib/arrow-glib/buffer.cpp @@ -0,0 +1,289 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: buffer + * @short_description: Buffer class + * + * #GArrowBuffer is a class for keeping data. Other classes such as + * #GArrowArray and #GArrowTensor can use data in buffer. + */ + +typedef struct GArrowBufferPrivate_ { + std::shared_ptr buffer; +} GArrowBufferPrivate; + +enum { + PROP_0, + PROP_BUFFER +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowBuffer, garrow_buffer, G_TYPE_OBJECT) + +#define GARROW_BUFFER_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), GARROW_TYPE_BUFFER, GArrowBufferPrivate)) + +static void +garrow_buffer_finalize(GObject *object) +{ + auto priv = GARROW_BUFFER_GET_PRIVATE(object); + + priv->buffer = nullptr; + + G_OBJECT_CLASS(garrow_buffer_parent_class)->finalize(object); +} + +static void +garrow_buffer_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + auto priv = GARROW_BUFFER_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_BUFFER: + priv->buffer = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_buffer_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_buffer_init(GArrowBuffer *object) +{ +} + +static void +garrow_buffer_class_init(GArrowBufferClass *klass) +{ + GParamSpec *spec; + + auto gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_buffer_finalize; + gobject_class->set_property = garrow_buffer_set_property; + gobject_class->get_property = garrow_buffer_get_property; + + spec = g_param_spec_pointer("buffer", + "Buffer", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_BUFFER, spec); +} + +/** + * garrow_buffer_new: + * @data: (array length=size): Data for the buffer. + * They aren't owned by the new buffer. + * You must not free the data while the new buffer is alive. + * @size: The number of bytes of the data. + * + * Returns: A newly created #GArrowBuffer. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_buffer_new(const guint8 *data, gint64 size) +{ + auto arrow_buffer = std::make_shared(data, size); + return garrow_buffer_new_raw(&arrow_buffer); + +} + +/** + * garrow_buffer_is_mutable: + * @buffer: A #GArrowBuffer. + * + * Returns: %TRUE if the buffer is mutable, %FALSE otherwise. + * + * Since: 0.3.0 + */ +gboolean +garrow_buffer_is_mutable(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + return arrow_buffer->is_mutable(); +} + +/** + * garrow_buffer_get_capacity: + * @buffer: A #GArrowBuffer. + * + * Returns: The number of bytes that where allocated for the buffer in + * total. + * + * Since: 0.3.0 + */ +gint64 +garrow_buffer_get_capacity(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + return arrow_buffer->capacity(); +} + +/** + * garrow_buffer_get_data: + * @buffer: A #GArrowBuffer. + * @size: (out): The number of bytes of the data. + * + * Returns: (array length=size): The data of the buffer. + * + * Since: 0.3.0 + */ +const guint8 * +garrow_buffer_get_data(GArrowBuffer *buffer, gint64 *size) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + *size = arrow_buffer->size(); + return arrow_buffer->data(); +} + +/** + * garrow_buffer_get_size: + * @buffer: A #GArrowBuffer. + * + * Returns: The number of bytes that might have valid data. + * + * Since: 0.3.0 + */ +gint64 +garrow_buffer_get_size(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + return arrow_buffer->size(); +} + +/** + * garrow_buffer_get_parent: + * @buffer: A #GArrowBuffer. + * + * Returns: (nullable) (transfer full): + * The parent #GArrowBuffer or %NULL. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_buffer_get_parent(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + auto arrow_parent_buffer = arrow_buffer->parent(); + + if (arrow_parent_buffer) { + return garrow_buffer_new_raw(&arrow_parent_buffer); + } else { + return NULL; + } +} + +/** + * garrow_buffer_copy: + * @buffer: A #GArrowBuffer. + * @start: An offset of data to be copied in byte. + * @size: The number of bytes to be copied from the start. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): + * A newly copied #GArrowBuffer on success, %NULL on error. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_buffer_copy(GArrowBuffer *buffer, + gint64 start, + gint64 size, + GError **error) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + std::shared_ptr arrow_copied_buffer; + auto status = arrow_buffer->Copy(start, size, &arrow_copied_buffer); + if (status.ok()) { + return garrow_buffer_new_raw(&arrow_copied_buffer); + } else { + garrow_error_set(error, status, "[buffer][copy]"); + return NULL; + } +} + +/** + * garrow_buffer_slice: + * @buffer: A #GArrowBuffer. + * @offset: An offset in the buffer data in byte. + * @size: The number of bytes of the sliced data. + * + * Returns: (transfer full): A newly created #GArrowBuffer that shares + * data of the base #GArrowBuffer. The created #GArrowBuffer has data + * start with offset from the base buffer data and are the specified + * bytes size. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_buffer_slice(GArrowBuffer *buffer, gint64 offset, gint64 size) +{ + auto arrow_parent_buffer = garrow_buffer_get_raw(buffer); + auto arrow_buffer = std::make_shared(arrow_parent_buffer, + offset, + size); + return garrow_buffer_new_raw(&arrow_buffer); +} + +G_END_DECLS + +GArrowBuffer * +garrow_buffer_new_raw(std::shared_ptr *arrow_buffer) +{ + auto buffer = GARROW_BUFFER(g_object_new(GARROW_TYPE_BUFFER, + "buffer", arrow_buffer, + NULL)); + return buffer; +} + +std::shared_ptr +garrow_buffer_get_raw(GArrowBuffer *buffer) +{ + auto priv = GARROW_BUFFER_GET_PRIVATE(buffer); + return priv->buffer; +} diff --git a/c_glib/arrow-glib/buffer.h b/c_glib/arrow-glib/buffer.h new file mode 100644 index 0000000000000..1e7d55182fd1d --- /dev/null +++ b/c_glib/arrow-glib/buffer.h @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_BUFFER \ + (garrow_buffer_get_type()) +#define GARROW_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), GARROW_TYPE_BUFFER, GArrowBuffer)) +#define GARROW_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), GARROW_TYPE_BUFFER, GArrowBufferClass)) +#define GARROW_IS_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_BUFFER)) +#define GARROW_IS_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_BUFFER)) +#define GARROW_BUFFER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), GARROW_TYPE_BUFFER, GArrowBufferClass)) + +typedef struct _GArrowBuffer GArrowBuffer; +typedef struct _GArrowBufferClass GArrowBufferClass; + +/** + * GArrowBuffer: + * + * It wraps `arrow::Buffer`. + */ +struct _GArrowBuffer +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowBufferClass +{ + GObjectClass parent_class; +}; + +GType garrow_buffer_get_type (void) G_GNUC_CONST; + +GArrowBuffer *garrow_buffer_new (const guint8 *data, + gint64 size); +gboolean garrow_buffer_is_mutable (GArrowBuffer *buffer); +gint64 garrow_buffer_get_capacity (GArrowBuffer *buffer); +const guint8 *garrow_buffer_get_data (GArrowBuffer *buffer, + gint64 *size); +gint64 garrow_buffer_get_size (GArrowBuffer *buffer); +GArrowBuffer *garrow_buffer_get_parent (GArrowBuffer *buffer); + +GArrowBuffer *garrow_buffer_copy (GArrowBuffer *buffer, + gint64 start, + gint64 size, + GError **error); +GArrowBuffer *garrow_buffer_slice (GArrowBuffer *buffer, + gint64 offset, + gint64 size); + +G_END_DECLS diff --git a/c_glib/arrow-glib/buffer.hpp b/c_glib/arrow-glib/buffer.hpp new file mode 100644 index 0000000000000..00dd3de3bfd50 --- /dev/null +++ b/c_glib/arrow-glib/buffer.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowBuffer *garrow_buffer_new_raw(std::shared_ptr *arrow_buffer); +std::shared_ptr garrow_buffer_get_raw(GArrowBuffer *buffer); diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 396dce5049837..3c1d8d161179c 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -105,6 +105,10 @@ + + Buffer + + Error diff --git a/c_glib/test/test-buffer.rb b/c_glib/test/test-buffer.rb new file mode 100644 index 0000000000000..1ea26f24ce873 --- /dev/null +++ b/c_glib/test/test-buffer.rb @@ -0,0 +1,55 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBuffer < Test::Unit::TestCase + def setup + @data = "Hello" + @buffer = Arrow::Buffer.new(@data) + end + + def test_mutable? + assert do + not @buffer.mutable? + end + end + + def test_capacity + assert_equal(@data.bytesize, @buffer.capacity) + end + + def test_data + assert_equal(@data, @buffer.data.pack("C*")) + end + + def test_size + assert_equal(@data.bytesize, @buffer.size) + end + + def test_parent + assert_nil(@buffer.parent) + end + + def test_copy + copied_buffer = @buffer.copy(1, 3) + assert_equal(@data[1, 3], copied_buffer.data.pack("C*")) + end + + def test_slice + sliced_buffer = @buffer.slice(1, 3) + assert_equal(@data[1, 3], sliced_buffer.data.pack("C*")) + end +end From 9d532c49d563ec22f73af3cc49549eb2e5cb6898 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 12 Apr 2017 13:05:21 -0400 Subject: [PATCH 0505/1644] ARROW-539: [Python] Add support for reading partitioned Parquet files with Hive-like directory schemes I probably didn't get all the use cases, but this should be a good start. First, the directory structure is walked to determine the distinct partition keys. These keys are later used as the dictionary for `arrow::DictionaryArray` objects which are constructed. I also created the `ParquetDatasetPiece` class to enable distributed processing of file components in frameworks like Dask. We may need to address pickling of the `ParquetPartitions` object (which must be passed to `ParquetDatasetPiece.read` so the right array metadata can be constructed. Author: Wes McKinney Author: Miki Tebeka Closes #529 from wesm/ARROW-539 and squashes the following commits: a0451fa [Wes McKinney] Code review comments deb6d82 [Wes McKinney] Don't make file-like Python object on LocalFilesystem 04dc691 [Wes McKinney] Complete initial partitioned reads, supporting unit tests. Expose arrow::Table::AddColumn 7d33755 [Wes McKinney] Untested draft of ParquetManifest for partitioned directory structures. Get test suite passing again ba8825f [Wes McKinney] Prototyping 18fe639 [Wes McKinney] Refactoring, add ParquetDataset, ParquetDatasetPiece 016b445 [Miki Tebeka] [ARROW-539] [Python] Support reading Parquet datasets with standard partition directory schemes --- python/pyarrow/filesystem.py | 25 +- python/pyarrow/includes/libarrow.pxd | 2 + python/pyarrow/parquet.py | 547 +++++++++++++++++++++++---- python/pyarrow/table.pxd | 1 + python/pyarrow/table.pyx | 40 +- python/pyarrow/tests/test_parquet.py | 156 +++++++- python/pyarrow/tests/test_table.py | 31 ++ 7 files changed, 692 insertions(+), 110 deletions(-) diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py index e820806ab4e68..269cf1c8ffa12 100644 --- a/python/pyarrow/filesystem.py +++ b/python/pyarrow/filesystem.py @@ -87,20 +87,10 @@ def read_parquet(self, path, columns=None, metadata=None, schema=None, ------- table : pyarrow.Table """ - from pyarrow.parquet import read_multiple_files - - if self.isdir(path): - paths_to_read = [] - for path in self.ls(path): - if path.endswith('parq') or path.endswith('parquet'): - paths_to_read.append(path) - else: - paths_to_read = [path] - - return read_multiple_files(paths_to_read, columns=columns, - filesystem=self, schema=schema, - metadata=metadata, - nthreads=nthreads) + from pyarrow.parquet import ParquetDataset + dataset = ParquetDataset(path, schema=schema, metadata=metadata, + filesystem=self) + return dataset.read(columns=columns, nthreads=nthreads) class LocalFilesystem(Filesystem): @@ -117,6 +107,13 @@ def get_instance(cls): def ls(self, path): return sorted(pjoin(path, x) for x in os.listdir(path)) + @implements(Filesystem.mkdir) + def mkdir(self, path, create_parents=True): + if create_parents: + os.makedirs(path) + else: + os.mkdir(path) + @implements(Filesystem.isdir) def isdir(self, path): return os.path.isdir(path) diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 40dd83776b82d..ae2b45fbdb212 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -291,6 +291,8 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CSchema] schema() shared_ptr[CColumn] column(int i) + CStatus AddColumn(int i, const shared_ptr[CColumn]& column, + shared_ptr[CTable]* out) CStatus RemoveColumn(int i, shared_ptr[CTable]* out) cdef cppclass CTensor" arrow::Tensor": diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index d95c3b3aecaf8..f81b6c24c691f 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -17,18 +17,23 @@ import six +import numpy as np + +from pyarrow.filesystem import LocalFilesystem from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa RowGroupMetaData, Schema, ParquetWriter) import pyarrow._parquet as _parquet # noqa -from pyarrow.table import concat_tables +import pyarrow.array as _array +import pyarrow.table as _table -EXCLUDED_PARQUET_PATHS = {'_metadata', '_common_metadata', '_SUCCESS'} +# ---------------------------------------------------------------------- +# Reading a single Parquet file class ParquetFile(object): """ - Open a Parquet binary file for reading + Reader interface for a single Parquet file Parameters ---------- @@ -72,7 +77,8 @@ def read_row_group(self, i, columns=None, nthreads=1): Content of the row group as a table (of columns) """ column_indices = self._get_column_indices(columns) - self.reader.set_num_threads(nthreads) + if nthreads is not None: + self.reader.set_num_threads(nthreads) return self.reader.read_row_group(i, column_indices=column_indices) def read(self, columns=None, nthreads=1): @@ -93,7 +99,8 @@ def read(self, columns=None, nthreads=1): Content of the file as a table (of columns) """ column_indices = self._get_column_indices(columns) - self.reader.set_num_threads(nthreads) + if nthreads is not None: + self.reader.set_num_threads(nthreads) return self.reader.read_all(column_indices=column_indices) def _get_column_indices(self, column_names): @@ -104,6 +111,463 @@ def _get_column_indices(self, column_names): for column in column_names] +# ---------------------------------------------------------------------- +# Metadata container providing instructions about reading a single Parquet +# file, possibly part of a partitioned dataset + + +class ParquetDatasetPiece(object): + """ + A single chunk of a potentially larger Parquet dataset to read. The + arguments will indicate to read either a single row group or all row + groups, and whether to add partition keys to the resulting pyarrow.Table + + Parameters + ---------- + path : str + Path to file in the file system where this piece is located + partition_keys : list of tuples + [(column name, ordinal index)] + row_group : int, default None + Row group to load. By default, reads all row groups + """ + + def __init__(self, path, row_group=None, partition_keys=None): + self.path = path + self.row_group = row_group + self.partition_keys = partition_keys or [] + + def __eq__(self, other): + if not isinstance(other, ParquetDatasetPiece): + return False + return (self.path == other.path and + self.row_group == other.row_group and + self.partition_keys == other.partition_keys) + + def __ne__(self, other): + return not (self == other) + + def __repr__(self): + return ('{0}({1!r}, row_group={2!r}, partition_keys={3!r})' + .format(type(self).__name__, self.path, + self.row_group, + self.partition_keys)) + + def __str__(self): + result = '' + + if len(self.partition_keys) > 0: + partition_str = ', '.join('{0}={1}'.format(name, index) + for name, index in self.partition_keys) + result += 'partition[{0}] '.format(partition_str) + + result += self.path + + if self.row_group is not None: + result += ' | row_group={0}'.format(self.row_group) + + return result + + def get_metadata(self, open_file_func=None): + """ + Given a function that can create an open ParquetFile object, return the + file's metadata + """ + return self._open(open_file_func).metadata + + def _open(self, open_file_func=None): + """ + Returns instance of ParquetFile + """ + if open_file_func is None: + def simple_opener(path): + return ParquetFile(path) + open_file_func = simple_opener + return open_file_func(self.path) + + def read(self, columns=None, nthreads=1, partitions=None, + open_file_func=None): + """ + Read this piece as a pyarrow.Table + + Parameters + ---------- + columns : list of column names, default None + nthreads : int, default 1 + For multithreaded file reads + partitions : ParquetPartitions, default None + open_file_func : function, default None + A function that knows how to construct a ParquetFile object given + the file path in this piece + + Returns + ------- + table : pyarrow.Table + """ + reader = self._open(open_file_func) + + if self.row_group is not None: + table = reader.read_row_group(self.row_group, columns=columns, + nthreads=nthreads) + else: + table = reader.read(columns=columns, nthreads=nthreads) + + if len(self.partition_keys) > 0: + if partitions is None: + raise ValueError('Must pass partition sets') + + # Here, the index is the categorical code of the partition where + # this piece is located. Suppose we had + # + # /foo=a/0.parq + # /foo=b/0.parq + # /foo=c/0.parq + # + # Then we assign a=0, b=1, c=2. And the resulting Table pieces will + # have a DictionaryArray column named foo having the constant index + # value as indicated. The distinct categories of the partition have + # been computed in the ParquetManifest + for i, (name, index) in enumerate(self.partition_keys): + # The partition code is the same for all values in this piece + indices = np.array([index], dtype='i4').repeat(len(table)) + + # This is set of all partition values, computed as part of the + # manifest, so ['a', 'b', 'c'] as in our example above. + dictionary = partitions.levels[i].dictionary + + arr = _array.DictionaryArray.from_arrays(indices, dictionary) + col = _table.Column.from_array(name, arr) + table = table.append_column(col) + + return table + + +def _is_parquet_file(path): + return path.endswith('parq') or path.endswith('parquet') + + +class PartitionSet(object): + """A data structure for cataloguing the observed Parquet partitions at a + particular level. So if we have + + /foo=a/bar=0 + /foo=a/bar=1 + /foo=a/bar=2 + /foo=b/bar=0 + /foo=b/bar=1 + /foo=b/bar=2 + + Then we have two partition sets, one for foo, another for bar. As we visit + levels of the partition hierarchy, a PartitionSet tracks the distinct + values and assigns categorical codes to use when reading the pieces + """ + + def __init__(self, name, keys=None): + self.name = name + self.keys = keys or [] + self.key_indices = {k: i for i, k in enumerate(self.keys)} + self._dictionary = None + + def get_index(self, key): + """ + Get the index of the partition value if it is known, otherwise assign + one + """ + if key in self.key_indices: + return self.key_indices[key] + else: + index = len(self.key_indices) + self.keys.append(key) + self.key_indices[key] = index + return index + + @property + def dictionary(self): + if self._dictionary is not None: + return self._dictionary + + if len(self.keys) == 0: + raise ValueError('No known partition keys') + + # Only integer and string partition types are supported right now + try: + integer_keys = [int(x) for x in self.keys] + dictionary = _array.from_pylist(integer_keys) + except ValueError: + dictionary = _array.from_pylist(self.keys) + + self._dictionary = dictionary + return dictionary + + @property + def is_sorted(self): + return list(self.keys) == sorted(self.keys) + + +class ParquetPartitions(object): + + def __init__(self): + self.levels = [] + self.partition_names = set() + + def __len__(self): + return len(self.levels) + + def __getitem__(self, i): + return self.levels[i] + + def get_index(self, level, name, key): + """ + Record a partition value at a particular level, returning the distinct + code for that value at that level. Example: + + partitions.get_index(1, 'foo', 'a') returns 0 + partitions.get_index(1, 'foo', 'b') returns 1 + partitions.get_index(1, 'foo', 'c') returns 2 + partitions.get_index(1, 'foo', 'a') returns 0 + + Parameters + ---------- + level : int + The nesting level of the partition we are observing + name : string + The partition name + key : string or int + The partition value + """ + if level == len(self.levels): + if name in self.partition_names: + raise ValueError('{0} was the name of the partition in ' + 'another level'.format(name)) + + part_set = PartitionSet(name) + self.levels.append(part_set) + self.partition_names.add(name) + + return self.levels[level].get_index(key) + + +def is_string(x): + return isinstance(x, six.string_types) + + +class ParquetManifest(object): + """ + + """ + def __init__(self, dirpath, filesystem=None, pathsep='/', + partition_scheme='hive'): + self.filesystem = filesystem or LocalFilesystem.get_instance() + self.pathsep = pathsep + self.dirpath = dirpath + self.partition_scheme = partition_scheme + self.partitions = ParquetPartitions() + self.pieces = [] + + self.common_metadata_path = None + self.metadata_path = None + + self._visit_level(0, self.dirpath, []) + + def _visit_level(self, level, base_path, part_keys): + directories = [] + files = [] + fs = self.filesystem + + if not fs.isdir(base_path): + raise ValueError('"{0}" is not a directory'.format(base_path)) + + for path in sorted(fs.ls(base_path)): + if fs.isfile(path): + if _is_parquet_file(path): + files.append(path) + elif path.endswith('_common_metadata'): + self.common_metadata_path = path + elif path.endswith('_metadata'): + self.metadata_path = path + elif not self._should_silently_exclude(path): + print('Ignoring path: {0}'.format(path)) + elif fs.isdir(path): + directories.append(path) + + if len(files) > 0 and len(directories) > 0: + raise ValueError('Found files in an intermediate ' + 'directory: {0}'.format(base_path)) + elif len(directories) > 0: + self._visit_directories(level, directories, part_keys) + else: + self._push_pieces(files, part_keys) + + def _should_silently_exclude(self, path): + _, tail = path.rsplit(self.pathsep, 1) + return tail.endswith('.crc') or tail in EXCLUDED_PARQUET_PATHS + + def _visit_directories(self, level, directories, part_keys): + for path in directories: + head, tail = _path_split(path, self.pathsep) + name, key = _parse_hive_partition(tail) + + index = self.partitions.get_index(level, name, key) + dir_part_keys = part_keys + [(name, index)] + self._visit_level(level + 1, path, dir_part_keys) + + def _parse_partition(self, dirname): + if self.partition_scheme == 'hive': + return _parse_hive_partition(dirname) + else: + raise NotImplementedError('partition schema: {0}' + .format(self.partition_scheme)) + + def _push_pieces(self, files, part_keys): + self.pieces.extend([ + ParquetDatasetPiece(path, partition_keys=part_keys) + for path in files + ]) + + +def _parse_hive_partition(value): + if '=' not in value: + raise ValueError('Directory name did not appear to be a ' + 'partition: {0}'.format(value)) + return value.split('=', 1) + + +def _path_split(path, sep): + i = path.rfind(sep) + 1 + head, tail = path[:i], path[i:] + head = head.rstrip(sep) + return head, tail + + +EXCLUDED_PARQUET_PATHS = {'_SUCCESS'} + + +class ParquetDataset(object): + """ + Encapsulates details of reading a complete Parquet dataset possibly + consisting of multiple files and partitions in subdirectories + + Parameters + ---------- + path_or_paths : str or List[str] + A directory name, single file name, or list of file names + filesystem : Filesystem, default None + If nothing passed, paths assumed to be found in the local on-disk + filesystem + metadata : pyarrow.parquet.FileMetaData + Use metadata obtained elsewhere to validate file schemas + schema : pyarrow.parquet.Schema + Use schema obtained elsewhere to validate file schemas. Alternative to + metadata parameter + split_row_groups : boolean, default False + Divide files into pieces for each row group in the file + validate_schema : boolean, default True + Check that individual file schemas are all the same / compatible + """ + def __init__(self, path_or_paths, filesystem=None, schema=None, + metadata=None, split_row_groups=False, validate_schema=True): + if filesystem is None: + self.fs = LocalFilesystem.get_instance() + else: + self.fs = filesystem + + self.pieces, self.partitions = _make_manifest(path_or_paths, self.fs) + + self.metadata = metadata + self.schema = schema + + self.split_row_groups = split_row_groups + + if split_row_groups: + raise NotImplementedError("split_row_groups not yet implemented") + + if validate_schema: + self.validate_schemas() + + def validate_schemas(self): + open_file = self._get_open_file_func() + + if self.metadata is None and self.schema is None: + self.schema = self.pieces[0].get_metadata(open_file).schema + elif self.schema is None: + self.schema = self.metadata.schema + + # Verify schemas are all equal + for piece in self.pieces: + file_metadata = piece.get_metadata(open_file) + if not self.schema.equals(file_metadata.schema): + raise ValueError('Schema in {0!s} was different. ' + '{1!s} vs {2!s}' + .format(piece, file_metadata.schema, + self.schema)) + + def read(self, columns=None, nthreads=1): + """ + Read multiple Parquet files as a single pyarrow.Table + + Parameters + ---------- + columns : List[str] + Names of columns to read from the file + nthreads : int, default 1 + Number of columns to read in parallel. Requires that the underlying + file source is threadsafe + + Returns + ------- + pyarrow.Table + Content of the file as a table (of columns) + """ + open_file = self._get_open_file_func() + + tables = [] + for piece in self.pieces: + table = piece.read(columns=columns, nthreads=nthreads, + partitions=self.partitions, + open_file_func=open_file) + tables.append(table) + + all_data = _table.concat_tables(tables) + return all_data + + def _get_open_file_func(self): + if self.fs is None or isinstance(self.fs, LocalFilesystem): + def open_file(path, meta=None): + return ParquetFile(path, metadata=meta) + else: + def open_file(path, meta=None): + return ParquetFile(self.fs.open(path, mode='rb'), + metadata=meta) + return open_file + + +def _make_manifest(path_or_paths, fs, pathsep='/'): + partitions = None + + if is_string(path_or_paths) and fs.isdir(path_or_paths): + manifest = ParquetManifest(path_or_paths, filesystem=fs, + pathsep=pathsep) + pieces = manifest.pieces + partitions = manifest.partitions + else: + if not isinstance(path_or_paths, list): + path_or_paths = [path_or_paths] + + # List of paths + if len(path_or_paths) == 0: + raise ValueError('Must pass at least one file path') + + pieces = [] + for path in path_or_paths: + if not fs.isfile(path): + raise IOError('Passed non-file path: {0}' + .format(path)) + piece = ParquetDatasetPiece(path) + pieces.append(piece) + + return pieces, partitions + + def read_table(source, columns=None, nthreads=1, metadata=None): """ Read a Table from Parquet format @@ -127,9 +591,7 @@ def read_table(source, columns=None, nthreads=1, metadata=None): pyarrow.Table Content of the file as a table (of columns) """ - from pyarrow.filesystem import LocalFilesystem - - if isinstance(source, six.string_types): + if is_string(source): fs = LocalFilesystem.get_instance() if fs.isdir(source): return fs.read_parquet(source, columns=columns, @@ -139,70 +601,7 @@ def read_table(source, columns=None, nthreads=1, metadata=None): return pf.read(columns=columns, nthreads=nthreads) -def read_multiple_files(paths, columns=None, filesystem=None, nthreads=1, - metadata=None, schema=None): - """ - Read multiple Parquet files as a single pyarrow.Table - - Parameters - ---------- - paths : List[str] - List of file paths - columns : List[str] - Names of columns to read from the file - filesystem : Filesystem, default None - If nothing passed, paths assumed to be found in the local on-disk - filesystem - nthreads : int, default 1 - Number of columns to read in parallel. Requires that the underlying - file source is threadsafe - metadata : pyarrow.parquet.FileMetaData - Use metadata obtained elsewhere to validate file schemas - schema : pyarrow.parquet.Schema - Use schema obtained elsewhere to validate file schemas. Alternative to - metadata parameter - - Returns - ------- - pyarrow.Table - Content of the file as a table (of columns) - """ - if filesystem is None: - def open_file(path, meta=None): - return ParquetFile(path, metadata=meta) - else: - def open_file(path, meta=None): - return ParquetFile(filesystem.open(path, mode='rb'), metadata=meta) - - if len(paths) == 0: - raise ValueError('Must pass at least one file path') - - if metadata is None and schema is None: - schema = open_file(paths[0]).schema - elif schema is None: - schema = metadata.schema - - # Verify schemas are all equal - all_file_metadata = [] - for path in paths: - file_metadata = open_file(path).metadata - if not schema.equals(file_metadata.schema): - raise ValueError('Schema in {0} was different. {1!s} vs {2!s}' - .format(path, file_metadata.schema, schema)) - all_file_metadata.append(file_metadata) - - # Read the tables - tables = [] - for path, path_metadata in zip(paths, all_file_metadata): - reader = open_file(path, meta=path_metadata) - table = reader.read(columns=columns, nthreads=nthreads) - tables.append(table) - - all_data = concat_tables(tables) - return all_data - - -def write_table(table, sink, row_group_size=None, version='1.0', +def write_table(table, where, row_group_size=None, version='1.0', use_dictionary=True, compression='snappy', **kwargs): """ Write a Table to Parquet format @@ -210,7 +609,7 @@ def write_table(table, sink, row_group_size=None, version='1.0', Parameters ---------- table : pyarrow.Table - sink: string or pyarrow.io.NativeFile + where: string or pyarrow.io.NativeFile row_group_size : int, default None The maximum number of rows in each Parquet RowGroup. As a default, we will write a single RowGroup per file. @@ -223,7 +622,7 @@ def write_table(table, sink, row_group_size=None, version='1.0', Specify the compression codec, either on a general basis or per-column. """ row_group_size = kwargs.get('chunk_size', row_group_size) - writer = ParquetWriter(sink, use_dictionary=use_dictionary, + writer = ParquetWriter(where, use_dictionary=use_dictionary, compression=compression, version=version) writer.write_table(table, row_group_size=row_group_size) diff --git a/python/pyarrow/table.pxd b/python/pyarrow/table.pxd index 389727b4dc1d7..f564042b62d7b 100644 --- a/python/pyarrow/table.pxd +++ b/python/pyarrow/table.pxd @@ -58,5 +58,6 @@ cdef class RecordBatch: cdef init(self, const shared_ptr[CRecordBatch]& table) cdef _check_nullptr(self) +cdef object box_column(const shared_ptr[CColumn]& ccolumn) cdef api object table_from_ctable(const shared_ptr[CTable]& ctable) cdef api object batch_from_cbatch(const shared_ptr[CRecordBatch]& cbatch) diff --git a/python/pyarrow/table.pyx b/python/pyarrow/table.pyx index 94389a73cc974..3972bda4ee425 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/table.pyx @@ -30,8 +30,9 @@ import pyarrow.config from pyarrow.array cimport Array, box_array, wrap_array_output from pyarrow.error import ArrowException from pyarrow.error cimport check_status -from pyarrow.schema cimport box_data_type, box_schema, DataType +from pyarrow.schema cimport box_data_type, box_schema, DataType, Field +from pyarrow.schema import field from pyarrow.compat import frombytes, tobytes cimport cpython @@ -141,6 +142,19 @@ cdef class Column: self.sp_column = column self.column = column.get() + @staticmethod + def from_array(object field_or_name, Array arr): + cdef Field boxed_field + + if isinstance(field_or_name, Field): + boxed_field = field_or_name + else: + boxed_field = field(field_or_name, arr.type) + + cdef shared_ptr[CColumn] sp_column + sp_column.reset(new CColumn(boxed_field.sp_field, arr.sp_array)) + return box_column(sp_column) + def to_pandas(self): """ Convert the arrow::Column to a pandas.Series @@ -828,6 +842,24 @@ cdef class Table: """ return (self.num_rows, self.num_columns) + def add_column(self, int i, Column column): + """ + Add column to Table at position. Returns new table + """ + cdef: + shared_ptr[CTable] c_table + + with nogil: + check_status(self.table.AddColumn(i, column.sp_column, &c_table)) + + return table_from_ctable(c_table) + + def append_column(self, Column column): + """ + Append column at end of columns. Returns new table + """ + return self.add_column(self.num_columns, column) + def remove_column(self, int i): """ Create new Table with the indicated column removed @@ -865,6 +897,12 @@ def concat_tables(tables): return table_from_ctable(c_result) +cdef object box_column(const shared_ptr[CColumn]& ccolumn): + cdef Column column = Column() + column.init(ccolumn) + return column + + cdef api object table_from_ctable(const shared_ptr[CTable]& ctable): cdef Table table = Table() table.init(ctable) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 86165be7052c6..de1b1488c1475 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -16,11 +16,13 @@ # under the License. from os.path import join as pjoin +import datetime import io import os import pytest -from pyarrow.compat import guid +from pyarrow.compat import guid, u +from pyarrow.filesystem import LocalFilesystem import pyarrow as pa import pyarrow.io as paio from .pandas_examples import dataframe_with_arrays, dataframe_with_lists @@ -28,7 +30,7 @@ import numpy as np import pandas as pd -import pandas.util.testing as pdt +import pandas.util.testing as tm try: import pyarrow.parquet as pq @@ -93,7 +95,7 @@ def test_pandas_parquet_2_0_rountrip(tmpdir): pq.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -125,7 +127,7 @@ def test_pandas_parquet_1_0_rountrip(tmpdir): # We pass uint32_t as int64_t if we write Parquet version 1.0 df['uint32'] = df['uint32'].values.astype(np.int64) - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -142,7 +144,7 @@ def test_pandas_column_selection(tmpdir): table_read = pq.read_table(filename.strpath, columns=['uint8']) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df[['uint8']], df_read) + tm.assert_frame_equal(df[['uint8']], df_read) def _random_integers(size, dtype): @@ -169,7 +171,7 @@ def _test_dataframe(size=10000, seed=0): 'float64': np.random.randn(size), 'float64': np.arange(size, dtype=np.float64), 'bool': np.random.randn(size) > 0, - 'strings': [pdt.rands(10) for i in range(size)] + 'strings': [tm.rands(10) for i in range(size)] }) return df @@ -183,7 +185,7 @@ def test_pandas_parquet_native_file_roundtrip(tmpdir): buf = imos.get_result() reader = paio.BufferReader(buf) df_read = pq.read_table(reader).to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -207,7 +209,7 @@ def test_pandas_parquet_pyfile_roundtrip(tmpdir): table_read = pq.read_table(data) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -236,7 +238,7 @@ def test_pandas_parquet_configuration_options(tmpdir): use_dictionary=use_dictionary) table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) for compression in ['NONE', 'SNAPPY', 'GZIP']: pq.write_table(arrow_table, filename.strpath, @@ -244,7 +246,7 @@ def test_pandas_parquet_configuration_options(tmpdir): compression=compression) table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) def make_sample_file(df): @@ -331,7 +333,7 @@ def test_column_of_arrays(tmpdir): pq.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -344,7 +346,7 @@ def test_column_of_lists(tmpdir): pq.write_table(arrow_table, filename.strpath, version="2.0") table_read = pq.read_table(filename.strpath) df_read = table_read.to_pandas() - pdt.assert_frame_equal(df, df_read) + tm.assert_frame_equal(df, df_read) @parquet @@ -399,7 +401,7 @@ def test_pass_separate_metadata(): fileh = pq.ParquetFile(buf, metadata=metadata) - pdt.assert_frame_equal(df, fileh.read().to_pandas()) + tm.assert_frame_equal(df, fileh.read().to_pandas()) @parquet @@ -422,13 +424,121 @@ def test_read_single_row_group(): row_groups = [pf.read_row_group(i) for i in range(K)] result = pa.concat_tables(row_groups) - pdt.assert_frame_equal(df, result.to_pandas()) + tm.assert_frame_equal(df, result.to_pandas()) cols = df.columns[:2] row_groups = [pf.read_row_group(i, columns=cols) for i in range(K)] result = pa.concat_tables(row_groups) - pdt.assert_frame_equal(df[cols], result.to_pandas()) + tm.assert_frame_equal(df[cols], result.to_pandas()) + + +@parquet +def test_parquet_piece_basics(): + path = '/baz.parq' + + piece1 = pq.ParquetDatasetPiece(path) + piece2 = pq.ParquetDatasetPiece(path, row_group=1) + piece3 = pq.ParquetDatasetPiece( + path, row_group=1, partition_keys=[('foo', 0), ('bar', 1)]) + + assert str(piece1) == path + assert str(piece2) == '/baz.parq | row_group=1' + assert str(piece3) == 'partition[foo=0, bar=1] /baz.parq | row_group=1' + + assert piece1 == piece1 + assert piece2 == piece2 + assert piece3 == piece3 + assert piece1 != piece3 + + +@parquet +def test_partition_set_dictionary_type(): + set1 = pq.PartitionSet('key1', [u('foo'), u('bar'), u('baz')]) + set2 = pq.PartitionSet('key2', [2007, 2008, 2009]) + + assert isinstance(set1.dictionary, pa.StringArray) + assert isinstance(set2.dictionary, pa.IntegerArray) + + set3 = pq.PartitionSet('key2', [datetime.datetime(2007, 1, 1)]) + with pytest.raises(TypeError): + set3.dictionary + + +@parquet +def test_read_partitioned_directory(tmpdir): + foo_keys = [0, 1] + bar_keys = ['a', 'b', 'c'] + partition_spec = [ + ['foo', foo_keys], + ['bar', bar_keys] + ] + N = 30 + + df = pd.DataFrame({ + 'index': np.arange(N), + 'foo': np.array(foo_keys, dtype='i4').repeat(15), + 'bar': np.tile(np.tile(np.array(bar_keys, dtype=object), 5), 2), + 'values': np.random.randn(N) + }, columns=['index', 'foo', 'bar', 'values']) + + base_path = str(tmpdir) + _generate_partition_directories(base_path, partition_spec, df) + + dataset = pq.ParquetDataset(base_path) + table = dataset.read() + result_df = (table.to_pandas() + .sort_values(by='index') + .reset_index(drop=True)) + + expected_df = (df.sort_values(by='index') + .reset_index(drop=True) + .reindex(columns=result_df.columns)) + expected_df['foo'] = pd.Categorical(df['foo'], categories=foo_keys) + expected_df['bar'] = pd.Categorical(df['bar'], categories=bar_keys) + + assert (result_df.columns == ['index', 'values', 'foo', 'bar']).all() + + tm.assert_frame_equal(result_df, expected_df) + + +def _generate_partition_directories(base_dir, partition_spec, df): + # partition_spec : list of lists, e.g. [['foo', [0, 1, 2], + # ['bar', ['a', 'b', 'c']] + # part_table : a pyarrow.Table to write to each partition + DEPTH = len(partition_spec) + fs = LocalFilesystem.get_instance() + + def _visit_level(base_dir, level, part_keys): + name, values = partition_spec[level] + for value in values: + this_part_keys = part_keys + [(name, value)] + + level_dir = pjoin(base_dir, '{0}={1}'.format(name, value)) + fs.mkdir(level_dir) + + if level == DEPTH - 1: + # Generate example data + file_path = pjoin(level_dir, 'data.parq') + + filtered_df = _filter_partition(df, this_part_keys) + part_table = pa.Table.from_pandas(filtered_df) + pq.write_table(part_table, file_path) + else: + _visit_level(level_dir, level + 1, this_part_keys) + + _visit_level(base_dir, 0, []) + + +def _filter_partition(df, part_keys): + predicate = np.ones(len(df), dtype=bool) + + to_drop = [] + for name, value in part_keys: + to_drop.append(name) + predicate &= df[name] == value + + return df[predicate].drop(to_drop, axis=1) @parquet @@ -459,7 +569,11 @@ def test_read_multiple_files(tmpdir): with open(pjoin(dirpath, '_SUCCESS.crc'), 'wb') as f: f.write(b'0') - result = pq.read_multiple_files(paths) + def read_multiple_files(paths, columns=None, nthreads=None, **kwargs): + dataset = pq.ParquetDataset(paths, **kwargs) + return dataset.read(columns=columns, nthreads=nthreads) + + result = read_multiple_files(paths) expected = pa.concat_tables(test_data) assert result.equals(expected) @@ -467,7 +581,7 @@ def test_read_multiple_files(tmpdir): # Read with provided metadata metadata = pq.ParquetFile(paths[0]).metadata - result2 = pq.read_multiple_files(paths, metadata=metadata) + result2 = read_multiple_files(paths, metadata=metadata) assert result2.equals(expected) result3 = pa.localfs.read_parquet(dirpath, schema=metadata.schema) @@ -493,15 +607,15 @@ def test_read_multiple_files(tmpdir): bad_meta = pq.ParquetFile(bad_apple_path).metadata with pytest.raises(ValueError): - pq.read_multiple_files(paths + [bad_apple_path]) + read_multiple_files(paths + [bad_apple_path]) with pytest.raises(ValueError): - pq.read_multiple_files(paths, metadata=bad_meta) + read_multiple_files(paths, metadata=bad_meta) mixed_paths = [bad_apple_path, paths[0]] with pytest.raises(ValueError): - pq.read_multiple_files(mixed_paths, schema=bad_meta.schema) + read_multiple_files(mixed_paths, schema=bad_meta.schema) with pytest.raises(ValueError): - pq.read_multiple_files(mixed_paths) + read_multiple_files(mixed_paths) diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 548f4782a7030..79b4c159fd10a 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -39,6 +39,14 @@ def test_basics(self): assert column.shape == (5,) assert column.to_pylist() == [-10, -5, 0, 5, 10] + def test_from_array(self): + arr = pa.from_pylist([0, 1, 2, 3, 4]) + + col1 = pa.Column.from_array('foo', arr) + col2 = pa.Column.from_array(pa.field('foo', arr.type), arr) + + assert col1.equals(col2) + def test_pandas(self): data = [ pa.from_pylist([-10, -5, 0, 5, 10]) @@ -169,6 +177,29 @@ def test_table_basics(): assert chunk is not None +def test_table_add_column(): + data = [ + pa.from_pylist(range(5)), + pa.from_pylist([-10, -5, 0, 5, 10]), + pa.from_pylist(range(5, 10)) + ] + table = pa.Table.from_arrays(data, names=('a', 'b', 'c')) + + col = pa.Column.from_array('d', data[1]) + t2 = table.add_column(3, col) + t3 = table.append_column(col) + + expected = pa.Table.from_arrays(data + [data[1]], + names=('a', 'b', 'c', 'd')) + assert t2.equals(expected) + assert t3.equals(expected) + + t4 = table.add_column(0, col) + expected = pa.Table.from_arrays([data[1]] + data, + names=('d', 'a', 'b', 'c')) + assert t4.equals(expected) + + def test_table_remove_column(): data = [ pa.from_pylist(range(5)), From 3d9bfc2aeb376c994ca9b257cb9156d08b870455 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 12 Apr 2017 15:51:47 -0400 Subject: [PATCH 0506/1644] ARROW-646: [Python] Conda s3 robustness, set CONDA_PKGS_DIR env variable and add Travis CI caching Author: Wes McKinney Closes #532 from wesm/ARROW-646 and squashes the following commits: 2f27123 [Wes McKinney] Fix env variable name 7ead593 [Wes McKinney] Set CONDA_PKGS_DIR env variable and add to Travis CI cache. Change some other conda settings --- .travis.yml | 6 ++++++ ci/travis_install_conda.sh | 7 +++++++ 2 files changed, 13 insertions(+) diff --git a/.travis.yml b/.travis.yml index f74a3b205c4b6..4a49c717bf75d 100644 --- a/.travis.yml +++ b/.travis.yml @@ -21,6 +21,12 @@ addons: - autoconf-archive - libgirepository1.0-dev +cache: + ccache: true + directories: + - $HOME/.conda_packages + - $HOME/.ccache + matrix: fast_finish: true allow_failures: diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index e064317f12303..c036e925427a9 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -23,13 +23,20 @@ fi wget -O miniconda.sh $MINICONDA_URL export MINICONDA=$HOME/miniconda +export CONDA_PKGS_DIRS=$HOME/.conda_packages +mkdir -p $CONDA_PKGS_DIRS bash miniconda.sh -b -p $MINICONDA export PATH="$MINICONDA/bin:$PATH" conda update -y -q conda +conda config --set auto_update_conda false conda info -a conda config --set show_channel_urls True + +# Help with SSL timeouts to S3 +conda config --set remote_connect_timeout_secs 12 + conda config --add channels https://repo.continuum.io/pkgs/free conda config --add channels conda-forge conda info -a From e9343650355b1820562bfa85d370cac2070b7c92 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 13 Apr 2017 12:46:58 +0200 Subject: [PATCH 0507/1644] ARROW-797: [Python] Make more explicitly curated public API page, sphinx cleanup Author: Wes McKinney Closes #535 from wesm/ARROW-797 and squashes the following commits: bc344a8 [Wes McKinney] rat warning fb1d916 [Wes McKinney] build_sphinx target needs extra options 00c6a03 [Wes McKinney] Remove sphinxext until it's actually needed. Add some ASF license headers 60d6ab6 [Wes McKinney] Update gitignore 2b9f3f9 [Wes McKinney] Add _static stub 80e4a4b [Wes McKinney] Remove unused options b662b85 [Wes McKinney] Remove unused options 30ebd05 [Wes McKinney] Cleaning, explicit API index 83e31d5 [Wes McKinney] Initial API doc d7f4ed7 [Wes McKinney] Add NumPy extensions from pandas --- ci/travis_script_python.sh | 2 +- python/cmake_modules/UseCython.cmake | 5 +- python/doc/.gitignore | 22 ++- python/doc/Makefile | 4 +- python/doc/source/_static/stub | 18 +++ python/doc/source/api.rst | 153 +++++++++++++++++++ python/doc/{ => source}/conf.py | 22 ++- python/doc/{ => source}/filesystems.rst | 0 python/doc/{ => source}/getting_involved.rst | 0 python/doc/{ => source}/index.rst | 2 +- python/doc/{ => source}/install.rst | 0 python/doc/{ => source}/jemalloc.rst | 0 python/doc/{ => source}/pandas.rst | 0 python/doc/{ => source}/parquet.rst | 0 14 files changed, 207 insertions(+), 21 deletions(-) create mode 100644 python/doc/source/_static/stub create mode 100644 python/doc/source/api.rst rename python/doc/{ => source}/conf.py (96%) rename python/doc/{ => source}/filesystems.rst (100%) rename python/doc/{ => source}/getting_involved.rst (100%) rename python/doc/{ => source}/index.rst (99%) rename python/doc/{ => source}/install.rst (100%) rename python/doc/{ => source}/jemalloc.rst (100%) rename python/doc/{ => source}/pandas.rst (100%) rename python/doc/{ => source}/parquet.rst (100%) diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 604cd13916299..680eb01158d8f 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -123,7 +123,7 @@ python_version_tests() { if [[ "$PYTHON_VERSION" == "3.6" ]] then pip install -r doc/requirements.txt - python setup.py build_sphinx + python setup.py build_sphinx -s doc/source fi } diff --git a/python/cmake_modules/UseCython.cmake b/python/cmake_modules/UseCython.cmake index cee6066d31de0..7c06b023db2cb 100644 --- a/python/cmake_modules/UseCython.cmake +++ b/python/cmake_modules/UseCython.cmake @@ -64,7 +64,7 @@ set( CYTHON_NO_DOCSTRINGS OFF CACHE BOOL "Strip docstrings from the compiled module." ) set( CYTHON_FLAGS "" CACHE STRING "Extra flags to the cython compiler." ) -mark_as_advanced( CYTHON_ANNOTATE CYTHON_NO_DOCSTRINGS CYTHON_FLAGS ) +mark_as_advanced( CYTHON_ANNOTATE CYTHON_NO_DOCSTRINGS CYTHON_FLAGS) find_package( Cython REQUIRED ) find_package( PythonLibsNew REQUIRED ) @@ -131,7 +131,8 @@ function( compile_pyx _name pyx_target_name generated_files pyx_file) # Add the command to run the compiler. add_custom_target(${pyx_target_name} COMMAND ${CYTHON_EXECUTABLE} ${cxx_arg} ${include_directory_arg} - ${annotate_arg} ${no_docstrings_arg} ${cython_debug_arg} ${CYTHON_FLAGS} + ${annotate_arg} ${no_docstrings_arg} ${cython_debug_arg} + ${CYTHON_FLAGS} --output-file "${_name}.${extension}" ${pyx_location} DEPENDS ${pyx_location} # do not specify byproducts for now since they don't work with the older diff --git a/python/doc/.gitignore b/python/doc/.gitignore index 87d04134d6fc3..3bee39fa36fe4 100644 --- a/python/doc/.gitignore +++ b/python/doc/.gitignore @@ -1,3 +1,19 @@ -# auto-generated module documentation -pyarrow*.rst -modules.rst +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +_build +source/generated \ No newline at end of file diff --git a/python/doc/Makefile b/python/doc/Makefile index 7257583952481..65d6a4df3b20f 100644 --- a/python/doc/Makefile +++ b/python/doc/Makefile @@ -22,9 +22,9 @@ BUILDDIR = _build # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter -ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . +ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source # the i18n builder cannot share the environment and doctrees with the others -I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . +I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) source .PHONY: help help: diff --git a/python/doc/source/_static/stub b/python/doc/source/_static/stub new file mode 100644 index 0000000000000..765c78f7bc0d2 --- /dev/null +++ b/python/doc/source/_static/stub @@ -0,0 +1,18 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ \ No newline at end of file diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst new file mode 100644 index 0000000000000..514dcf966f8cc --- /dev/null +++ b/python/doc/source/api.rst @@ -0,0 +1,153 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _api: + +************* +API Reference +************* + +.. _api.functions: + +Type Metadata and Schemas +------------------------- + +.. autosummary:: + :toctree: generated/ + + null + bool_ + int8 + int16 + int32 + int64 + uint8 + uint16 + uint32 + uint64 + float16 + float32 + float64 + timestamp + date32 + date64 + binary + string + decimal + list_ + struct + dictionary + field + DataType + Field + Schema + schema + +Scalar Value Types +------------------ + +.. autosummary:: + :toctree: generated/ + + NA + NAType + Scalar + ArrayValue + Int8Value + Int16Value + Int32Value + Int64Value + UInt8Value + UInt16Value + UInt32Value + UInt64Value + FloatValue + DoubleValue + ListValue + BinaryValue + StringValue + FixedSizeBinaryValue + +Array Types +----------- + +.. autosummary:: + :toctree: generated/ + + Array + NumericArray + IntegerArray + FloatingPointArray + BooleanArray + Int8Array + Int16Array + Int32Array + Int64Array + UInt8Array + UInt16Array + UInt32Array + UInt64Array + DictionaryArray + StringArray + +Tables and Record Batches +------------------------- + +.. autosummary:: + :toctree: generated/ + + Column + RecordBatch + Table + +Tensor type and Functions +------------------------- + +.. autosummary:: + :toctree: generated/ + + Tensor + write_tensor + get_tensor_size + read_tensor + +Input / Output and Shared Memory +-------------------------------- + +.. autosummary:: + :toctree: generated/ + + Buffer + BufferReader + InMemoryOutputStream + NativeFile + MemoryMappedFile + memory_map + create_memory_map + PythonFileInterface + +Interprocess Communication and Messaging +---------------------------------------- + +.. autosummary:: + :toctree: generated/ + + FileReader + FileWriter + StreamReader + StreamWriter diff --git a/python/doc/conf.py b/python/doc/source/conf.py similarity index 96% rename from python/doc/conf.py rename to python/doc/source/conf.py index e817bbdd42bd3..a9262bf7db3dd 100644 --- a/python/doc/conf.py +++ b/python/doc/source/conf.py @@ -29,19 +29,8 @@ import os import sys -from sphinx import apidoc - import sphinx_rtd_theme - -__location__ = os.path.join(os.getcwd(), os.path.dirname( - inspect.getfile(inspect.currentframe()))) -output_dir = os.path.join(__location__) -module_dir = os.path.join(__location__, "..", "pyarrow") -cmd_line_template = "sphinx-apidoc -f -e -o {outputdir} {moduledir}" -cmd_line = cmd_line_template.format(outputdir=output_dir, moduledir=module_dir) -apidoc.main(cmd_line.split(" ")) - on_rtd = os.environ.get('READTHEDOCS') == 'True' if not on_rtd: @@ -49,6 +38,12 @@ # build pyarrow there. sys.path.insert(0, os.path.abspath('..')) +sys.path.extend([ + os.path.join(os.path.dirname(__file__), + '..', '../..') + +]) + # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. @@ -64,7 +59,7 @@ 'sphinx.ext.doctest', 'sphinx.ext.mathjax', 'sphinx.ext.viewcode', - 'sphinx.ext.napoleon' + 'sphinx.ext.napoleon', ] # numpydoc configuration @@ -79,6 +74,9 @@ # source_suffix = ['.rst', '.md'] source_suffix = '.rst' +import glob +autosummary_generate = glob.glob("*.rst") + # The encoding of source files. # # source_encoding = 'utf-8-sig' diff --git a/python/doc/filesystems.rst b/python/doc/source/filesystems.rst similarity index 100% rename from python/doc/filesystems.rst rename to python/doc/source/filesystems.rst diff --git a/python/doc/getting_involved.rst b/python/doc/source/getting_involved.rst similarity index 100% rename from python/doc/getting_involved.rst rename to python/doc/source/getting_involved.rst diff --git a/python/doc/index.rst b/python/doc/source/index.rst similarity index 99% rename from python/doc/index.rst rename to python/doc/source/index.rst index 608fff5d57ba4..ecb8e8f4830f3 100644 --- a/python/doc/index.rst +++ b/python/doc/source/index.rst @@ -38,7 +38,7 @@ structures. pandas filesystems parquet - modules + api getting_involved .. toctree:: diff --git a/python/doc/install.rst b/python/doc/source/install.rst similarity index 100% rename from python/doc/install.rst rename to python/doc/source/install.rst diff --git a/python/doc/jemalloc.rst b/python/doc/source/jemalloc.rst similarity index 100% rename from python/doc/jemalloc.rst rename to python/doc/source/jemalloc.rst diff --git a/python/doc/pandas.rst b/python/doc/source/pandas.rst similarity index 100% rename from python/doc/pandas.rst rename to python/doc/source/pandas.rst diff --git a/python/doc/parquet.rst b/python/doc/source/parquet.rst similarity index 100% rename from python/doc/parquet.rst rename to python/doc/source/parquet.rst From 8b64a4fb2d3973813e2094e108021606034d27f4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 13 Apr 2017 12:51:47 +0200 Subject: [PATCH 0508/1644] ARROW-751: [Python] Make all Cython modules private. Some code tidying I also combined schema/array/scalar, as they are all interrelated. Author: Wes McKinney Closes #533 from wesm/ARROW-751 and squashes the following commits: 63b479b [Wes McKinney] jemalloc is now private 0f46116 [Wes McKinney] Fix APIs in Parquet 1074e7c [Wes McKinney] Make all Cython modules private. Code cleaning --- ci/travis_script_python.sh | 2 +- python/CMakeLists.txt | 16 +- python/pyarrow/__init__.py | 84 +- python/pyarrow/{array.pxd => _array.pxd} | 116 +- python/pyarrow/_array.pyx | 1368 +++++++++++++++++ python/pyarrow/{config.pyx => _config.pyx} | 0 python/pyarrow/{error.pxd => _error.pxd} | 0 python/pyarrow/{error.pyx => _error.pyx} | 0 python/pyarrow/{io.pxd => _io.pxd} | 0 python/pyarrow/{io.pyx => _io.pyx} | 17 +- .../pyarrow/{jemalloc.pyx => _jemalloc.pyx} | 2 +- python/pyarrow/{memory.pxd => _memory.pxd} | 0 python/pyarrow/{memory.pyx => _memory.pyx} | 0 python/pyarrow/_parquet.pyx | 16 +- python/pyarrow/{table.pxd => _table.pxd} | 3 +- python/pyarrow/{table.pyx => _table.pyx} | 18 +- python/pyarrow/array.pyx | 646 -------- python/pyarrow/feather.py | 6 +- python/pyarrow/filesystem.py | 2 +- python/pyarrow/formatting.py | 4 +- python/pyarrow/includes/libarrow.pxd | 5 +- python/pyarrow/ipc.py | 10 +- python/pyarrow/parquet.py | 4 +- python/pyarrow/scalar.pxd | 72 - python/pyarrow/scalar.pyx | 315 ---- python/pyarrow/schema.pxd | 76 - python/pyarrow/schema.pyx | 477 ------ python/pyarrow/tests/test_feather.py | 2 +- python/pyarrow/tests/test_hdfs.py | 8 +- python/pyarrow/tests/test_io.py | 31 +- python/pyarrow/tests/test_parquet.py | 5 +- python/pyarrow/tests/test_schema.py | 8 +- python/setup.py | 18 +- 33 files changed, 1591 insertions(+), 1740 deletions(-) rename python/pyarrow/{array.pxd => _array.pxd} (54%) create mode 100644 python/pyarrow/_array.pyx rename python/pyarrow/{config.pyx => _config.pyx} (100%) rename python/pyarrow/{error.pxd => _error.pxd} (100%) rename python/pyarrow/{error.pyx => _error.pyx} (100%) rename python/pyarrow/{io.pxd => _io.pxd} (100%) rename python/pyarrow/{io.pyx => _io.pyx} (99%) rename python/pyarrow/{jemalloc.pyx => _jemalloc.pyx} (96%) rename python/pyarrow/{memory.pxd => _memory.pxd} (100%) rename python/pyarrow/{memory.pyx => _memory.pyx} (100%) rename python/pyarrow/{table.pxd => _table.pxd} (98%) rename python/pyarrow/{table.pyx => _table.pyx} (98%) delete mode 100644 python/pyarrow/array.pyx delete mode 100644 python/pyarrow/scalar.pxd delete mode 100644 python/pyarrow/scalar.pyx delete mode 100644 python/pyarrow/schema.pxd delete mode 100644 python/pyarrow/schema.pyx diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 680eb01158d8f..549fe1141cfb1 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -115,7 +115,7 @@ python_version_tests() { python setup.py build_ext --inplace --with-parquet --with-jemalloc python -c "import pyarrow.parquet" - python -c "import pyarrow.jemalloc" + python -c "import pyarrow._jemalloc" python -m pytest -vv -r sxX pyarrow diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 3e86521757342..36052bc257232 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -261,14 +261,12 @@ if (UNIX) endif() set(CYTHON_EXTENSIONS - array - config - error - io - memory - scalar - schema - table + _array + _config + _error + _io + _memory + _table ) set(LINK_LIBS @@ -313,7 +311,7 @@ if (PYARROW_BUILD_JEMALLOC) arrow_jemalloc_shared) set(CYTHON_EXTENSIONS ${CYTHON_EXTENSIONS} - jemalloc) + _jemalloc) endif() ############################################################ diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index df615b428c1c1..66bde4933ee2d 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -25,49 +25,10 @@ pass -import pyarrow.config -from pyarrow.config import cpu_count, set_cpu_count +import pyarrow._config +from pyarrow._config import cpu_count, set_cpu_count -from pyarrow.array import (Array, Tensor, from_pylist, - NumericArray, IntegerArray, FloatingPointArray, - BooleanArray, - Int8Array, UInt8Array, - Int16Array, UInt16Array, - Int32Array, UInt32Array, - Int64Array, UInt64Array, - ListArray, StringArray, - DictionaryArray) - -from pyarrow.error import (ArrowException, - ArrowKeyError, - ArrowInvalid, - ArrowIOError, - ArrowMemoryError, - ArrowNotImplementedError, - ArrowTypeError) - -from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem -from pyarrow.io import (HdfsFile, NativeFile, PythonFileInterface, - Buffer, BufferReader, InMemoryOutputStream, - MemoryMappedFile, memory_map, - frombuffer, read_tensor, write_tensor, - memory_map, create_memory_map, - get_record_batch_size, get_tensor_size) - -from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter - -from pyarrow.memory import MemoryPool, total_allocated_bytes - -from pyarrow.scalar import (ArrayValue, Scalar, NA, NAType, - BooleanValue, - Int8Value, Int16Value, Int32Value, Int64Value, - UInt8Value, UInt16Value, UInt32Value, UInt64Value, - FloatValue, DoubleValue, ListValue, - BinaryValue, StringValue, FixedSizeBinaryValue) - -import pyarrow.schema as _schema - -from pyarrow.schema import (null, bool_, +from pyarrow._array import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, timestamp, date32, date64, @@ -75,10 +36,45 @@ binary, string, decimal, list_, struct, dictionary, field, DataType, FixedSizeBinaryType, - Field, Schema, schema) + Field, Schema, schema, + Array, Tensor, + from_pylist, + from_numpy_dtype, + NumericArray, IntegerArray, FloatingPointArray, + BooleanArray, + Int8Array, UInt8Array, + Int16Array, UInt16Array, + Int32Array, UInt32Array, + Int64Array, UInt64Array, + ListArray, StringArray, + DictionaryArray, + ArrayValue, Scalar, NA, NAType, + BooleanValue, + Int8Value, Int16Value, Int32Value, Int64Value, + UInt8Value, UInt16Value, UInt32Value, UInt64Value, + FloatValue, DoubleValue, ListValue, + BinaryValue, StringValue, FixedSizeBinaryValue) +from pyarrow._io import (HdfsFile, NativeFile, PythonFileInterface, + Buffer, BufferReader, InMemoryOutputStream, + OSFile, MemoryMappedFile, memory_map, + frombuffer, read_tensor, write_tensor, + memory_map, create_memory_map, + get_record_batch_size, get_tensor_size) + +from pyarrow._memory import MemoryPool, total_allocated_bytes +from pyarrow._table import Column, RecordBatch, Table, concat_tables +from pyarrow._error import (ArrowException, + ArrowKeyError, + ArrowInvalid, + ArrowIOError, + ArrowMemoryError, + ArrowNotImplementedError, + ArrowTypeError) -from pyarrow.table import Column, RecordBatch, Table, concat_tables +from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem + +from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter localfs = LocalFilesystem.get_instance() diff --git a/python/pyarrow/array.pxd b/python/pyarrow/_array.pxd similarity index 54% rename from python/pyarrow/array.pxd rename to python/pyarrow/_array.pxd index 3ba48718265db..40413746fc94b 100644 --- a/python/pyarrow/array.pxd +++ b/python/pyarrow/_array.pxd @@ -15,20 +15,109 @@ # specific language governing permissions and limitations # under the License. -from pyarrow.includes.common cimport shared_ptr, int64_t -from pyarrow.includes.libarrow cimport CArray, CTensor - -from pyarrow.scalar import NA - -from pyarrow.schema cimport DataType +from pyarrow.includes.common cimport * +from pyarrow.includes.libarrow cimport * from cpython cimport PyObject - cdef extern from "Python.h": int PySlice_Check(object) +cdef class DataType: + cdef: + shared_ptr[CDataType] sp_type + CDataType* type + + cdef void init(self, const shared_ptr[CDataType]& type) + + +cdef class DictionaryType(DataType): + cdef: + const CDictionaryType* dict_type + + +cdef class TimestampType(DataType): + cdef: + const CTimestampType* ts_type + + +cdef class FixedSizeBinaryType(DataType): + cdef: + const CFixedSizeBinaryType* fixed_size_binary_type + + +cdef class DecimalType(FixedSizeBinaryType): + cdef: + const CDecimalType* decimal_type + + +cdef class Field: + cdef: + shared_ptr[CField] sp_field + CField* field + + cdef readonly: + DataType type + + cdef init(self, const shared_ptr[CField]& field) + + +cdef class Schema: + cdef: + shared_ptr[CSchema] sp_schema + CSchema* schema + + cdef init(self, const vector[shared_ptr[CField]]& fields) + cdef init_schema(self, const shared_ptr[CSchema]& schema) + + +cdef class Scalar: + cdef readonly: + DataType type + + +cdef class NAType(Scalar): + pass + + +cdef class ArrayValue(Scalar): + cdef: + shared_ptr[CArray] sp_array + int64_t index + + cdef void init(self, DataType type, + const shared_ptr[CArray]& sp_array, int64_t index) + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array) + + +cdef class Int8Value(ArrayValue): + pass + + +cdef class Int64Value(ArrayValue): + pass + + +cdef class ListValue(ArrayValue): + cdef readonly: + DataType value_type + + cdef: + CListArray* ap + + cdef getitem(self, int64_t i) + + +cdef class StringValue(ArrayValue): + pass + + +cdef class FixedSizeBinaryValue(ArrayValue): + pass + + cdef class Array: cdef: shared_ptr[CArray] sp_array @@ -52,10 +141,6 @@ cdef class Tensor: cdef init(self, const shared_ptr[CTensor]& sp_tensor) -cdef object box_array(const shared_ptr[CArray]& sp_array) -cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor) - - cdef class BooleanArray(Array): pass @@ -137,5 +222,12 @@ cdef class DictionaryArray(Array): object _indices, _dictionary - cdef wrap_array_output(PyObject* output) +cdef DataType box_data_type(const shared_ptr[CDataType]& type) +cdef Field box_field(const shared_ptr[CField]& field) +cdef Schema box_schema(const shared_ptr[CSchema]& schema) +cdef object box_array(const shared_ptr[CArray]& sp_array) +cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor) +cdef object box_scalar(DataType type, + const shared_ptr[CArray]& sp_array, + int64_t index) diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx new file mode 100644 index 0000000000000..7ef8e5867a1a2 --- /dev/null +++ b/python/pyarrow/_array.pyx @@ -0,0 +1,1368 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +# cython: profile=False +# distutils: language = c++ +# cython: embedsignature = True + +from cython.operator cimport dereference as deref +from pyarrow.includes.libarrow cimport * +from pyarrow.includes.common cimport PyObject_to_object +cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow._error cimport check_status +from pyarrow._memory cimport MemoryPool, maybe_unbox_memory_pool +cimport cpython as cp + + +import datetime +import decimal as _pydecimal +import numpy as np +import six +import pyarrow._config +from pyarrow.compat import frombytes, tobytes, PandasSeries, Categorical + + +cdef _pandas(): + import pandas as pd + return pd + + +cdef class DataType: + + def __cinit__(self): + pass + + cdef void init(self, const shared_ptr[CDataType]& type): + self.sp_type = type + self.type = type.get() + + def __str__(self): + return frombytes(self.type.ToString()) + + def __repr__(self): + return '{0.__class__.__name__}({0})'.format(self) + + def __richcmp__(DataType self, DataType other, int op): + if op == cp.Py_EQ: + return self.type.Equals(deref(other.type)) + elif op == cp.Py_NE: + return not self.type.Equals(deref(other.type)) + else: + raise TypeError('Invalid comparison') + + +cdef class DictionaryType(DataType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.dict_type = type.get() + + +cdef class TimestampType(DataType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.ts_type = type.get() + + property unit: + + def __get__(self): + return timeunit_to_string(self.ts_type.unit()) + + property tz: + + def __get__(self): + if self.ts_type.timezone().size() > 0: + return frombytes(self.ts_type.timezone()) + else: + return None + + +cdef class FixedSizeBinaryType(DataType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.fixed_size_binary_type = ( + type.get()) + + property byte_width: + + def __get__(self): + return self.fixed_size_binary_type.byte_width() + + +cdef class DecimalType(FixedSizeBinaryType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.decimal_type = type.get() + + +cdef class Field: + + def __cinit__(self): + pass + + cdef init(self, const shared_ptr[CField]& field): + self.sp_field = field + self.field = field.get() + self.type = box_data_type(field.get().type()) + + @classmethod + def from_py(cls, object name, DataType type, bint nullable=True): + cdef Field result = Field() + result.type = type + result.sp_field.reset(new CField(tobytes(name), type.sp_type, + nullable)) + result.field = result.sp_field.get() + + return result + + def __repr__(self): + return 'Field({0!r}, type={1})'.format(self.name, str(self.type)) + + property nullable: + + def __get__(self): + return self.field.nullable() + + property name: + + def __get__(self): + if box_field(self.sp_field) is None: + raise ReferenceError( + 'Field not initialized (references NULL pointer)') + return frombytes(self.field.name()) + + +cdef class Schema: + + def __cinit__(self): + pass + + def __len__(self): + return self.schema.num_fields() + + def __getitem__(self, i): + if i < 0 or i >= len(self): + raise IndexError("{0} is out of bounds".format(i)) + + cdef Field result = Field() + result.init(self.schema.field(i)) + result.type = box_data_type(result.field.type()) + + return result + + cdef init(self, const vector[shared_ptr[CField]]& fields): + self.schema = new CSchema(fields) + self.sp_schema.reset(self.schema) + + cdef init_schema(self, const shared_ptr[CSchema]& schema): + self.schema = schema.get() + self.sp_schema = schema + + def equals(self, other): + """ + Test if this schema is equal to the other + """ + cdef Schema _other + _other = other + + return self.sp_schema.get().Equals(deref(_other.schema)) + + def field_by_name(self, name): + """ + Access a field by its name rather than the column index. + + Parameters + ---------- + name: str + + Returns + ------- + field: pyarrow.Field + """ + return box_field(self.schema.GetFieldByName(tobytes(name))) + + @classmethod + def from_fields(cls, fields): + cdef: + Schema result + Field field + vector[shared_ptr[CField]] c_fields + + c_fields.resize(len(fields)) + + for i in range(len(fields)): + field = fields[i] + c_fields[i] = field.sp_field + + result = Schema() + result.init(c_fields) + + return result + + def __str__(self): + return frombytes(self.schema.ToString()) + + def __repr__(self): + return self.__str__() + + +cdef dict _type_cache = {} + + +cdef DataType primitive_type(Type type): + if type in _type_cache: + return _type_cache[type] + + cdef DataType out = DataType() + out.init(pyarrow.GetPrimitiveType(type)) + + _type_cache[type] = out + return out + +#------------------------------------------------------------ +# Type factory functions + +def field(name, type, bint nullable=True): + return Field.from_py(name, type, nullable) + + +cdef set PRIMITIVE_TYPES = set([ + Type_NA, Type_BOOL, + Type_UINT8, Type_INT8, + Type_UINT16, Type_INT16, + Type_UINT32, Type_INT32, + Type_UINT64, Type_INT64, + Type_TIMESTAMP, Type_DATE32, + Type_DATE64, + Type_HALF_FLOAT, + Type_FLOAT, + Type_DOUBLE]) + + +def null(): + return primitive_type(Type_NA) + + +def bool_(): + return primitive_type(Type_BOOL) + + +def uint8(): + return primitive_type(Type_UINT8) + + +def int8(): + return primitive_type(Type_INT8) + + +def uint16(): + return primitive_type(Type_UINT16) + + +def int16(): + return primitive_type(Type_INT16) + + +def uint32(): + return primitive_type(Type_UINT32) + + +def int32(): + return primitive_type(Type_INT32) + + +def uint64(): + return primitive_type(Type_UINT64) + + +def int64(): + return primitive_type(Type_INT64) + + +cdef dict _timestamp_type_cache = {} + + +cdef timeunit_to_string(TimeUnit unit): + if unit == TimeUnit_SECOND: + return 's' + elif unit == TimeUnit_MILLI: + return 'ms' + elif unit == TimeUnit_MICRO: + return 'us' + elif unit == TimeUnit_NANO: + return 'ns' + + +def timestamp(unit_str, tz=None): + cdef: + TimeUnit unit + c_string c_timezone + + if unit_str == "s": + unit = TimeUnit_SECOND + elif unit_str == 'ms': + unit = TimeUnit_MILLI + elif unit_str == 'us': + unit = TimeUnit_MICRO + elif unit_str == 'ns': + unit = TimeUnit_NANO + else: + raise TypeError('Invalid TimeUnit string') + + cdef TimestampType out = TimestampType() + + if tz is None: + out.init(ctimestamp(unit)) + if unit in _timestamp_type_cache: + return _timestamp_type_cache[unit] + _timestamp_type_cache[unit] = out + else: + if not isinstance(tz, six.string_types): + tz = tz.zone + + c_timezone = tobytes(tz) + out.init(ctimestamp(unit, c_timezone)) + + return out + + +def date32(): + return primitive_type(Type_DATE32) + + +def date64(): + return primitive_type(Type_DATE64) + + +def float16(): + return primitive_type(Type_HALF_FLOAT) + + +def float32(): + return primitive_type(Type_FLOAT) + + +def float64(): + return primitive_type(Type_DOUBLE) + + +cpdef DataType decimal(int precision, int scale=0): + cdef shared_ptr[CDataType] decimal_type + decimal_type.reset(new CDecimalType(precision, scale)) + return box_data_type(decimal_type) + + +def string(): + """ + UTF8 string + """ + return primitive_type(Type_STRING) + + +def binary(int length=-1): + """Binary (PyBytes-like) type + + Parameters + ---------- + length : int, optional, default -1 + If length == -1 then return a variable length binary type. If length is + greater than or equal to 0 then return a fixed size binary type of + width `length`. + """ + if length == -1: + return primitive_type(Type_BINARY) + + cdef shared_ptr[CDataType] fixed_size_binary_type + fixed_size_binary_type.reset(new CFixedSizeBinaryType(length)) + return box_data_type(fixed_size_binary_type) + + +def list_(DataType value_type): + cdef DataType out = DataType() + cdef shared_ptr[CDataType] list_type + list_type.reset(new CListType(value_type.sp_type)) + out.init(list_type) + return out + + +def dictionary(DataType index_type, Array dictionary): + """ + Dictionary (categorical, or simply encoded) type + """ + cdef DictionaryType out = DictionaryType() + cdef shared_ptr[CDataType] dict_type + dict_type.reset(new CDictionaryType(index_type.sp_type, + dictionary.sp_array)) + out.init(dict_type) + return out + + +def struct(fields): + """ + + """ + cdef: + DataType out = DataType() + Field field + vector[shared_ptr[CField]] c_fields + cdef shared_ptr[CDataType] struct_type + + for field in fields: + c_fields.push_back(field.sp_field) + + struct_type.reset(new CStructType(c_fields)) + out.init(struct_type) + return out + + +def schema(fields): + return Schema.from_fields(fields) + + +cdef DataType box_data_type(const shared_ptr[CDataType]& type): + cdef: + DataType out + + if type.get() == NULL: + return None + + if type.get().id() == Type_DICTIONARY: + out = DictionaryType() + elif type.get().id() == Type_TIMESTAMP: + out = TimestampType() + elif type.get().id() == Type_FIXED_SIZE_BINARY: + out = FixedSizeBinaryType() + elif type.get().id() == Type_DECIMAL: + out = DecimalType() + else: + out = DataType() + + out.init(type) + return out + +cdef Field box_field(const shared_ptr[CField]& field): + if field.get() == NULL: + return None + cdef Field out = Field() + out.init(field) + return out + +cdef Schema box_schema(const shared_ptr[CSchema]& type): + cdef Schema out = Schema() + out.init_schema(type) + return out + + +def from_numpy_dtype(object dtype): + cdef shared_ptr[CDataType] c_type + with nogil: + check_status(pyarrow.NumPyDtypeToArrow(dtype, &c_type)) + + return box_data_type(c_type) + + +NA = None + + +cdef class NAType(Scalar): + + def __cinit__(self): + global NA + if NA is not None: + raise Exception('Cannot create multiple NAType instances') + + self.type = null() + + def __repr__(self): + return 'NA' + + def as_py(self): + return None + + +NA = NAType() + + +cdef class ArrayValue(Scalar): + + cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, + int64_t index): + self.type = type + self.index = index + self._set_array(sp_array) + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + + def __repr__(self): + if hasattr(self, 'as_py'): + return repr(self.as_py()) + else: + return super(Scalar, self).__repr__() + + +cdef class BooleanValue(ArrayValue): + + def as_py(self): + cdef CBooleanArray* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int8Value(ArrayValue): + + def as_py(self): + cdef CInt8Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt8Value(ArrayValue): + + def as_py(self): + cdef CUInt8Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int16Value(ArrayValue): + + def as_py(self): + cdef CInt16Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt16Value(ArrayValue): + + def as_py(self): + cdef CUInt16Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int32Value(ArrayValue): + + def as_py(self): + cdef CInt32Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt32Value(ArrayValue): + + def as_py(self): + cdef CUInt32Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Int64Value(ArrayValue): + + def as_py(self): + cdef CInt64Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class UInt64Value(ArrayValue): + + def as_py(self): + cdef CUInt64Array* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class Date32Value(ArrayValue): + + def as_py(self): + cdef CDate32Array* ap = self.sp_array.get() + + # Shift to seconds since epoch + return datetime.datetime.utcfromtimestamp( + int(ap.Value(self.index)) * 86400).date() + + +cdef class Date64Value(ArrayValue): + + def as_py(self): + cdef CDate64Array* ap = self.sp_array.get() + return datetime.datetime.utcfromtimestamp( + ap.Value(self.index) / 1000).date() + + +cdef class TimestampValue(ArrayValue): + + def as_py(self): + cdef: + CTimestampArray* ap = self.sp_array.get() + CTimestampType* dtype = ap.type().get() + int64_t val = ap.Value(self.index) + + timezone = None + tzinfo = None + if dtype.timezone().size() > 0: + timezone = frombytes(dtype.timezone()) + import pytz + tzinfo = pytz.timezone(timezone) + + try: + pd = _pandas() + if dtype.unit() == TimeUnit_SECOND: + val = val * 1000000000 + elif dtype.unit() == TimeUnit_MILLI: + val = val * 1000000 + elif dtype.unit() == TimeUnit_MICRO: + val = val * 1000 + return pd.Timestamp(val, tz=tzinfo) + except ImportError: + if dtype.unit() == TimeUnit_SECOND: + result = datetime.datetime.utcfromtimestamp(val) + elif dtype.unit() == TimeUnit_MILLI: + result = datetime.datetime.utcfromtimestamp(float(val) / 1000) + elif dtype.unit() == TimeUnit_MICRO: + result = datetime.datetime.utcfromtimestamp( + float(val) / 1000000) + else: + # TimeUnit_NANO + raise NotImplementedError("Cannot convert nanosecond " + "timestamps without pandas") + if timezone is not None: + result = result.replace(tzinfo=tzinfo) + return result + + +cdef class FloatValue(ArrayValue): + + def as_py(self): + cdef CFloatArray* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class DoubleValue(ArrayValue): + + def as_py(self): + cdef CDoubleArray* ap = self.sp_array.get() + return ap.Value(self.index) + + +cdef class DecimalValue(ArrayValue): + + def as_py(self): + cdef: + CDecimalArray* ap = self.sp_array.get() + c_string s = ap.FormatValue(self.index) + return _pydecimal.Decimal(s.decode('utf8')) + + +cdef class StringValue(ArrayValue): + + def as_py(self): + cdef CStringArray* ap = self.sp_array.get() + return ap.GetString(self.index).decode('utf-8') + + +cdef class BinaryValue(ArrayValue): + + def as_py(self): + cdef: + const uint8_t* ptr + int32_t length + CBinaryArray* ap = self.sp_array.get() + + ptr = ap.GetValue(self.index, &length) + return cp.PyBytes_FromStringAndSize((ptr), length) + + +cdef class ListValue(ArrayValue): + + def __len__(self): + return self.ap.value_length(self.index) + + def __getitem__(self, i): + return self.getitem(i) + + def __iter__(self): + for i in range(len(self)): + yield self.getitem(i) + raise StopIteration + + cdef void _set_array(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + self.ap = sp_array.get() + self.value_type = box_data_type(self.ap.value_type()) + + cdef getitem(self, int64_t i): + cdef int64_t j = self.ap.value_offset(self.index) + i + return box_scalar(self.value_type, self.ap.values(), j) + + def as_py(self): + cdef: + int64_t j + list result = [] + + for j in range(len(self)): + result.append(self.getitem(j).as_py()) + + return result + + +cdef class FixedSizeBinaryValue(ArrayValue): + + def as_py(self): + cdef: + CFixedSizeBinaryArray* ap + CFixedSizeBinaryType* ap_type + int32_t length + const char* data + ap = self.sp_array.get() + ap_type = ap.type().get() + length = ap_type.byte_width() + data = ap.GetValue(self.index) + return cp.PyBytes_FromStringAndSize(data, length) + + + +cdef dict _scalar_classes = { + Type_BOOL: BooleanValue, + Type_UINT8: Int8Value, + Type_UINT16: Int16Value, + Type_UINT32: Int32Value, + Type_UINT64: Int64Value, + Type_INT8: Int8Value, + Type_INT16: Int16Value, + Type_INT32: Int32Value, + Type_INT64: Int64Value, + Type_DATE32: Date32Value, + Type_DATE64: Date64Value, + Type_TIMESTAMP: TimestampValue, + Type_FLOAT: FloatValue, + Type_DOUBLE: DoubleValue, + Type_LIST: ListValue, + Type_BINARY: BinaryValue, + Type_STRING: StringValue, + Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, + Type_DECIMAL: DecimalValue, +} + +cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, + int64_t index): + cdef ArrayValue val + if type.type.id() == Type_NA: + return NA + elif sp_array.get().IsNull(index): + return NA + else: + val = _scalar_classes[type.type.id()]() + val.init(type, sp_array, index) + return val + + +cdef maybe_coerce_datetime64(values, dtype, DataType type, + timestamps_to_ms=False): + + from pyarrow.compat import DatetimeTZDtype + + if values.dtype.type != np.datetime64: + return values, type + + coerce_ms = timestamps_to_ms and values.dtype != 'datetime64[ms]' + + if coerce_ms: + values = values.astype('datetime64[ms]') + + if isinstance(dtype, DatetimeTZDtype): + tz = dtype.tz + unit = 'ms' if coerce_ms else dtype.unit + type = timestamp(unit, tz) + elif type is None: + # Trust the NumPy dtype + type = from_numpy_dtype(values.dtype) + + return values, type + + +cdef class Array: + + cdef init(self, const shared_ptr[CArray]& sp_array): + self.sp_array = sp_array + self.ap = sp_array.get() + self.type = box_data_type(self.sp_array.get().type()) + + @staticmethod + def from_numpy(obj, mask=None, DataType type=None, + timestamps_to_ms=False, + MemoryPool memory_pool=None): + """ + Convert pandas.Series to an Arrow Array. + + Parameters + ---------- + series : pandas.Series or numpy.ndarray + + mask : pandas.Series or numpy.ndarray, optional + boolean mask if the object is valid or null + + type : pyarrow.DataType + Explicit type to attempt to coerce to + + timestamps_to_ms : bool, optional + Convert datetime columns to ms resolution. This is needed for + compatibility with other functionality like Parquet I/O which + only supports milliseconds. + + memory_pool: MemoryPool, optional + Specific memory pool to use to allocate the resulting Arrow array. + + Notes + ----- + Localized timestamps will currently be returned as UTC (pandas's native + representation). Timezone-naive data will be implicitly interpreted as + UTC. + + Examples + -------- + + >>> import pandas as pd + >>> import pyarrow as pa + >>> pa.Array.from_numpy(pd.Series([1, 2])) + + [ + 1, + 2 + ] + + >>> import numpy as np + >>> pa.Array.from_numpy(pd.Series([1, 2]), np.array([0, 1], + ... dtype=bool)) + + [ + 1, + NA + ] + + Returns + ------- + pyarrow.array.Array + """ + cdef: + shared_ptr[CArray] out + shared_ptr[CDataType] c_type + CMemoryPool* pool + + if mask is not None: + mask = get_series_values(mask) + + values = get_series_values(obj) + pool = maybe_unbox_memory_pool(memory_pool) + + if isinstance(values, Categorical): + return DictionaryArray.from_arrays( + values.codes, values.categories.values, + mask=mask, memory_pool=memory_pool) + elif values.dtype == object: + # Object dtype undergoes a different conversion path as more type + # inference may be needed + if type is not None: + c_type = type.sp_type + with nogil: + check_status(pyarrow.PandasObjectsToArrow( + pool, values, mask, c_type, &out)) + else: + values, type = maybe_coerce_datetime64( + values, obj.dtype, type, timestamps_to_ms=timestamps_to_ms) + + if type is None: + check_status(pyarrow.NumPyDtypeToArrow(values.dtype, &c_type)) + else: + c_type = type.sp_type + + with nogil: + check_status(pyarrow.PandasToArrow( + pool, values, mask, c_type, &out)) + + return box_array(out) + + @staticmethod + def from_list(object list_obj, DataType type=None, + MemoryPool memory_pool=None): + """ + Convert Python list to Arrow array + + Parameters + ---------- + list_obj : array_like + + Returns + ------- + pyarrow.array.Array + """ + cdef: + shared_ptr[CArray] sp_array + CMemoryPool* pool + + pool = maybe_unbox_memory_pool(memory_pool) + if type is None: + check_status(pyarrow.ConvertPySequence(list_obj, pool, &sp_array)) + else: + check_status( + pyarrow.ConvertPySequence( + list_obj, pool, &sp_array, type.sp_type + ) + ) + + return box_array(sp_array) + + property null_count: + + def __get__(self): + return self.sp_array.get().null_count() + + def __iter__(self): + for i in range(len(self)): + yield self.getitem(i) + raise StopIteration + + def __repr__(self): + from pyarrow.formatting import array_format + type_format = object.__repr__(self) + values = array_format(self, window=10) + return '{0}\n{1}'.format(type_format, values) + + def equals(Array self, Array other): + return self.ap.Equals(deref(other.ap)) + + def __len__(self): + if self.sp_array.get(): + return self.sp_array.get().length() + else: + return 0 + + def isnull(self): + raise NotImplemented + + def __getitem__(self, key): + cdef: + Py_ssize_t n = len(self) + + if PySlice_Check(key): + start = key.start or 0 + while start < 0: + start += n + + stop = key.stop if key.stop is not None else n + while stop < 0: + stop += n + + step = key.step or 1 + if step != 1: + raise IndexError('only slices with step 1 supported') + else: + return self.slice(start, stop - start) + + while key < 0: + key += len(self) + + return self.getitem(key) + + cdef getitem(self, int64_t i): + return box_scalar(self.type, self.sp_array, i) + + def slice(self, offset=0, length=None): + """ + Compute zero-copy slice of this array + + Parameters + ---------- + offset : int, default 0 + Offset from start of array to slice + length : int, default None + Length of slice (default is until end of Array starting from + offset) + + Returns + ------- + sliced : RecordBatch + """ + cdef: + shared_ptr[CArray] result + + if offset < 0: + raise IndexError('Offset must be non-negative') + + if length is None: + result = self.ap.Slice(offset) + else: + result = self.ap.Slice(offset, length) + + return box_array(result) + + def to_pandas(self): + """ + Convert to an array object suitable for use in pandas + + See also + -------- + Column.to_pandas + Table.to_pandas + RecordBatch.to_pandas + """ + cdef: + PyObject* out + + with nogil: + check_status( + pyarrow.ConvertArrayToPandas(self.sp_array, self, + &out)) + return wrap_array_output(out) + + def to_pylist(self): + """ + Convert to an list of native Python objects. + """ + return [x.as_py() for x in self] + + +cdef class Tensor: + + cdef init(self, const shared_ptr[CTensor]& sp_tensor): + self.sp_tensor = sp_tensor + self.tp = sp_tensor.get() + self.type = box_data_type(self.tp.type()) + + def __repr__(self): + return """ +type: {0} +shape: {1} +strides: {2}""".format(self.type, self.shape, self.strides) + + @staticmethod + def from_numpy(obj): + cdef shared_ptr[CTensor] ctensor + check_status(pyarrow.NdarrayToTensor(default_memory_pool(), + obj, &ctensor)) + return box_tensor(ctensor) + + def to_numpy(self): + """ + Convert arrow::Tensor to numpy.ndarray with zero copy + """ + cdef: + PyObject* out + + check_status(pyarrow.TensorToNdarray(deref(self.tp), self, + &out)) + return PyObject_to_object(out) + + def equals(self, Tensor other): + """ + Return true if the tensors contains exactly equal data + """ + return self.tp.Equals(deref(other.tp)) + + property is_mutable: + + def __get__(self): + return self.tp.is_mutable() + + property is_contiguous: + + def __get__(self): + return self.tp.is_contiguous() + + property ndim: + + def __get__(self): + return self.tp.ndim() + + property size: + + def __get__(self): + return self.tp.size() + + property shape: + + def __get__(self): + cdef size_t i + py_shape = [] + for i in range(self.tp.shape().size()): + py_shape.append(self.tp.shape()[i]) + return py_shape + + property strides: + + def __get__(self): + cdef size_t i + py_strides = [] + for i in range(self.tp.strides().size()): + py_strides.append(self.tp.strides()[i]) + return py_strides + + + +cdef wrap_array_output(PyObject* output): + cdef object obj = PyObject_to_object(output) + + if isinstance(obj, dict): + return Categorical(obj['indices'], + categories=obj['dictionary'], + fastpath=True) + else: + return obj + + +cdef class NullArray(Array): + pass + + +cdef class BooleanArray(Array): + pass + + +cdef class NumericArray(Array): + pass + + +cdef class IntegerArray(NumericArray): + pass + + +cdef class FloatingPointArray(NumericArray): + pass + + +cdef class Int8Array(IntegerArray): + pass + + +cdef class UInt8Array(IntegerArray): + pass + + +cdef class Int16Array(IntegerArray): + pass + + +cdef class UInt16Array(IntegerArray): + pass + + +cdef class Int32Array(IntegerArray): + pass + + +cdef class UInt32Array(IntegerArray): + pass + + +cdef class Int64Array(IntegerArray): + pass + + +cdef class UInt64Array(IntegerArray): + pass + + +cdef class Date32Array(NumericArray): + pass + + +cdef class Date64Array(NumericArray): + pass + + +cdef class TimestampArray(NumericArray): + pass + + +cdef class Time32Array(NumericArray): + pass + + +cdef class Time64Array(NumericArray): + pass + + +cdef class FloatArray(FloatingPointArray): + pass + + +cdef class DoubleArray(FloatingPointArray): + pass + + +cdef class FixedSizeBinaryArray(Array): + pass + + +cdef class DecimalArray(FixedSizeBinaryArray): + pass + + +cdef class ListArray(Array): + pass + + +cdef class StringArray(Array): + pass + + +cdef class BinaryArray(Array): + pass + + +cdef class DictionaryArray(Array): + + cdef getitem(self, int64_t i): + cdef Array dictionary = self.dictionary + index = self.indices[i] + if index is NA: + return index + else: + return box_scalar(dictionary.type, dictionary.sp_array, + index.as_py()) + + property dictionary: + + def __get__(self): + cdef CDictionaryArray* darr = (self.ap) + + if self._dictionary is None: + self._dictionary = box_array(darr.dictionary()) + + return self._dictionary + + property indices: + + def __get__(self): + cdef CDictionaryArray* darr = (self.ap) + + if self._indices is None: + self._indices = box_array(darr.indices()) + + return self._indices + + @staticmethod + def from_arrays(indices, dictionary, mask=None, + MemoryPool memory_pool=None): + """ + Construct Arrow DictionaryArray from array of indices (must be + non-negative integers) and corresponding array of dictionary values + + Parameters + ---------- + indices : ndarray or pandas.Series, integer type + dictionary : ndarray or pandas.Series + mask : ndarray or pandas.Series, boolean type + True values indicate that indices are actually null + + Returns + ------- + dict_array : DictionaryArray + """ + cdef: + Array arrow_indices, arrow_dictionary + DictionaryArray result + shared_ptr[CDataType] c_type + shared_ptr[CArray] c_result + + if isinstance(indices, Array): + if mask is not None: + raise NotImplementedError( + "mask not implemented with Arrow array inputs yet") + arrow_indices = indices + else: + if mask is None: + mask = indices == -1 + else: + mask = mask | (indices == -1) + arrow_indices = Array.from_numpy(indices, mask=mask, + memory_pool=memory_pool) + + if isinstance(dictionary, Array): + arrow_dictionary = dictionary + else: + arrow_dictionary = Array.from_numpy(dictionary, + memory_pool=memory_pool) + + if not isinstance(arrow_indices, IntegerArray): + raise ValueError('Indices must be integer type') + + c_type.reset(new CDictionaryType(arrow_indices.type.sp_type, + arrow_dictionary.sp_array)) + c_result.reset(new CDictionaryArray(c_type, arrow_indices.sp_array)) + + result = DictionaryArray() + result.init(c_result) + return result + + +cdef dict _array_classes = { + Type_NA: NullArray, + Type_BOOL: BooleanArray, + Type_UINT8: UInt8Array, + Type_UINT16: UInt16Array, + Type_UINT32: UInt32Array, + Type_UINT64: UInt64Array, + Type_INT8: Int8Array, + Type_INT16: Int16Array, + Type_INT32: Int32Array, + Type_INT64: Int64Array, + Type_DATE32: Date32Array, + Type_DATE64: Date64Array, + Type_TIMESTAMP: TimestampArray, + Type_TIME32: Time32Array, + Type_TIME64: Time64Array, + Type_FLOAT: FloatArray, + Type_DOUBLE: DoubleArray, + Type_LIST: ListArray, + Type_BINARY: BinaryArray, + Type_STRING: StringArray, + Type_DICTIONARY: DictionaryArray, + Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, + Type_DECIMAL: DecimalArray, +} + +cdef object box_array(const shared_ptr[CArray]& sp_array): + if sp_array.get() == NULL: + raise ValueError('Array was NULL') + + cdef CDataType* data_type = sp_array.get().type().get() + + if data_type == NULL: + raise ValueError('Array data type was NULL') + + cdef Array arr = _array_classes[data_type.id()]() + arr.init(sp_array) + return arr + + +cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor): + if sp_tensor.get() == NULL: + raise ValueError('Tensor was NULL') + + cdef Tensor tensor = Tensor() + tensor.init(sp_tensor) + return tensor + + +cdef object get_series_values(object obj): + if isinstance(obj, PandasSeries): + result = obj.values + elif isinstance(obj, np.ndarray): + result = obj + else: + result = PandasSeries(obj).values + + return result + + +from_pylist = Array.from_list diff --git a/python/pyarrow/config.pyx b/python/pyarrow/_config.pyx similarity index 100% rename from python/pyarrow/config.pyx rename to python/pyarrow/_config.pyx diff --git a/python/pyarrow/error.pxd b/python/pyarrow/_error.pxd similarity index 100% rename from python/pyarrow/error.pxd rename to python/pyarrow/_error.pxd diff --git a/python/pyarrow/error.pyx b/python/pyarrow/_error.pyx similarity index 100% rename from python/pyarrow/error.pyx rename to python/pyarrow/_error.pyx diff --git a/python/pyarrow/io.pxd b/python/pyarrow/_io.pxd similarity index 100% rename from python/pyarrow/io.pxd rename to python/pyarrow/_io.pxd diff --git a/python/pyarrow/io.pyx b/python/pyarrow/_io.pyx similarity index 99% rename from python/pyarrow/io.pyx rename to python/pyarrow/_io.pyx index 4eb0816ecbdea..9f067fb2166c6 100644 --- a/python/pyarrow/io.pyx +++ b/python/pyarrow/_io.pyx @@ -23,21 +23,18 @@ # cython: embedsignature = True from cython.operator cimport dereference as deref - from libc.stdlib cimport malloc, free - from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow._array cimport Array, Tensor, box_tensor, Schema +from pyarrow._error cimport check_status +from pyarrow._memory cimport MemoryPool, maybe_unbox_memory_pool +from pyarrow._table cimport (Column, RecordBatch, batch_from_cbatch, + table_from_ctable) +cimport cpython as cp +import pyarrow._config from pyarrow.compat import frombytes, tobytes, encode_file_path -from pyarrow.array cimport Array, Tensor, box_tensor -from pyarrow.error cimport check_status -from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool -from pyarrow.schema cimport Schema -from pyarrow.table cimport (Column, RecordBatch, batch_from_cbatch, - table_from_ctable) - -cimport cpython as cp import re import six diff --git a/python/pyarrow/jemalloc.pyx b/python/pyarrow/_jemalloc.pyx similarity index 96% rename from python/pyarrow/jemalloc.pyx rename to python/pyarrow/_jemalloc.pyx index 97583f4b0da95..3b41964a39cb6 100644 --- a/python/pyarrow/jemalloc.pyx +++ b/python/pyarrow/_jemalloc.pyx @@ -20,7 +20,7 @@ # cython: embedsignature = True from pyarrow.includes.libarrow_jemalloc cimport CJemallocMemoryPool -from pyarrow.memory cimport MemoryPool +from pyarrow._memory cimport MemoryPool def default_pool(): cdef MemoryPool pool = MemoryPool() diff --git a/python/pyarrow/memory.pxd b/python/pyarrow/_memory.pxd similarity index 100% rename from python/pyarrow/memory.pxd rename to python/pyarrow/_memory.pxd diff --git a/python/pyarrow/memory.pyx b/python/pyarrow/_memory.pyx similarity index 100% rename from python/pyarrow/memory.pyx rename to python/pyarrow/_memory.pyx diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 079bf5ee5924a..5418e1dc82730 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -20,20 +20,18 @@ # cython: embedsignature = True from cython.operator cimport dereference as deref - from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow._array cimport Array +from pyarrow._error cimport check_status +from pyarrow._memory cimport MemoryPool, maybe_unbox_memory_pool +from pyarrow._table cimport Table, table_from_ctable +from pyarrow._io cimport NativeFile, get_reader, get_writer -from pyarrow.array cimport Array from pyarrow.compat import tobytes, frombytes -from pyarrow.error import ArrowException -from pyarrow.error cimport check_status -from pyarrow.io import NativeFile -from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool -from pyarrow.table cimport Table, table_from_ctable - -from pyarrow.io cimport NativeFile, get_reader, get_writer +from pyarrow._error import ArrowException +from pyarrow._io import NativeFile import six diff --git a/python/pyarrow/table.pxd b/python/pyarrow/_table.pxd similarity index 98% rename from python/pyarrow/table.pxd rename to python/pyarrow/_table.pxd index f564042b62d7b..e61e90d7462f9 100644 --- a/python/pyarrow/table.pxd +++ b/python/pyarrow/_table.pxd @@ -18,8 +18,7 @@ from pyarrow.includes.common cimport shared_ptr from pyarrow.includes.libarrow cimport (CChunkedArray, CColumn, CTable, CRecordBatch) - -from pyarrow.schema cimport Schema +from pyarrow._array cimport Schema cdef class ChunkedArray: diff --git a/python/pyarrow/table.pyx b/python/pyarrow/_table.pyx similarity index 98% rename from python/pyarrow/table.pyx rename to python/pyarrow/_table.pyx index 3972bda4ee425..6558b2ea463fa 100644 --- a/python/pyarrow/table.pyx +++ b/python/pyarrow/_table.pyx @@ -24,18 +24,16 @@ from cython.operator cimport dereference as deref from pyarrow.includes.libarrow cimport * from pyarrow.includes.common cimport * cimport pyarrow.includes.pyarrow as pyarrow +from pyarrow._array cimport (Array, box_array, wrap_array_output, + box_data_type, box_schema, DataType, Field) +from pyarrow._error cimport check_status +cimport cpython -import pyarrow.config - -from pyarrow.array cimport Array, box_array, wrap_array_output -from pyarrow.error import ArrowException -from pyarrow.error cimport check_status -from pyarrow.schema cimport box_data_type, box_schema, DataType, Field - -from pyarrow.schema import field +import pyarrow._config +from pyarrow._error import ArrowException +from pyarrow._array import field from pyarrow.compat import frombytes, tobytes -cimport cpython from collections import OrderedDict @@ -744,7 +742,7 @@ cdef class Table: pandas.DataFrame """ if nthreads is None: - nthreads = pyarrow.config.cpu_count() + nthreads = pyarrow._config.cpu_count() mgr = table_to_blockmanager(self.sp_table, nthreads) return _pandas().DataFrame(mgr) diff --git a/python/pyarrow/array.pyx b/python/pyarrow/array.pyx deleted file mode 100644 index 1c4253eebe46a..0000000000000 --- a/python/pyarrow/array.pyx +++ /dev/null @@ -1,646 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -# cython: profile=False -# distutils: language = c++ -# cython: embedsignature = True - -from cython.operator cimport dereference as deref - -import numpy as np - -from pyarrow.includes.libarrow cimport * -from pyarrow.includes.common cimport PyObject_to_object -cimport pyarrow.includes.pyarrow as pyarrow - -import pyarrow.config - -from pyarrow.compat import frombytes, tobytes, PandasSeries, Categorical -from pyarrow.error cimport check_status -from pyarrow.memory cimport MemoryPool, maybe_unbox_memory_pool - -cimport pyarrow.scalar as scalar -from pyarrow.scalar import NA - -from pyarrow.schema cimport (DataType, Field, Schema, DictionaryType, - FixedSizeBinaryType, - box_data_type) -import pyarrow.schema as schema - -cimport cpython - - -cdef maybe_coerce_datetime64(values, dtype, DataType type, - timestamps_to_ms=False): - - from pyarrow.compat import DatetimeTZDtype - - if values.dtype.type != np.datetime64: - return values, type - - coerce_ms = timestamps_to_ms and values.dtype != 'datetime64[ms]' - - if coerce_ms: - values = values.astype('datetime64[ms]') - - if isinstance(dtype, DatetimeTZDtype): - tz = dtype.tz - unit = 'ms' if coerce_ms else dtype.unit - type = schema.timestamp(unit, tz) - elif type is None: - # Trust the NumPy dtype - type = schema.type_from_numpy_dtype(values.dtype) - - return values, type - - -cdef class Array: - - cdef init(self, const shared_ptr[CArray]& sp_array): - self.sp_array = sp_array - self.ap = sp_array.get() - self.type = box_data_type(self.sp_array.get().type()) - - @staticmethod - def from_numpy(obj, mask=None, DataType type=None, - timestamps_to_ms=False, - MemoryPool memory_pool=None): - """ - Convert pandas.Series to an Arrow Array. - - Parameters - ---------- - series : pandas.Series or numpy.ndarray - - mask : pandas.Series or numpy.ndarray, optional - boolean mask if the object is valid or null - - type : pyarrow.DataType - Explicit type to attempt to coerce to - - timestamps_to_ms : bool, optional - Convert datetime columns to ms resolution. This is needed for - compatibility with other functionality like Parquet I/O which - only supports milliseconds. - - memory_pool: MemoryPool, optional - Specific memory pool to use to allocate the resulting Arrow array. - - Notes - ----- - Localized timestamps will currently be returned as UTC (pandas's native - representation). Timezone-naive data will be implicitly interpreted as - UTC. - - Examples - -------- - - >>> import pandas as pd - >>> import pyarrow as pa - >>> pa.Array.from_numpy(pd.Series([1, 2])) - - [ - 1, - 2 - ] - - >>> import numpy as np - >>> pa.Array.from_numpy(pd.Series([1, 2]), np.array([0, 1], - ... dtype=bool)) - - [ - 1, - NA - ] - - Returns - ------- - pyarrow.array.Array - """ - cdef: - shared_ptr[CArray] out - shared_ptr[CDataType] c_type - CMemoryPool* pool - - if mask is not None: - mask = get_series_values(mask) - - values = get_series_values(obj) - pool = maybe_unbox_memory_pool(memory_pool) - - if isinstance(values, Categorical): - return DictionaryArray.from_arrays( - values.codes, values.categories.values, - mask=mask, memory_pool=memory_pool) - elif values.dtype == object: - # Object dtype undergoes a different conversion path as more type - # inference may be needed - if type is not None: - c_type = type.sp_type - with nogil: - check_status(pyarrow.PandasObjectsToArrow( - pool, values, mask, c_type, &out)) - else: - values, type = maybe_coerce_datetime64( - values, obj.dtype, type, timestamps_to_ms=timestamps_to_ms) - - if type is None: - check_status(pyarrow.NumPyDtypeToArrow(values.dtype, &c_type)) - else: - c_type = type.sp_type - - with nogil: - check_status(pyarrow.PandasToArrow( - pool, values, mask, c_type, &out)) - - return box_array(out) - - @staticmethod - def from_list(object list_obj, DataType type=None, - MemoryPool memory_pool=None): - """ - Convert Python list to Arrow array - - Parameters - ---------- - list_obj : array_like - - Returns - ------- - pyarrow.array.Array - """ - cdef: - shared_ptr[CArray] sp_array - CMemoryPool* pool - - pool = maybe_unbox_memory_pool(memory_pool) - if type is None: - check_status(pyarrow.ConvertPySequence(list_obj, pool, &sp_array)) - else: - check_status( - pyarrow.ConvertPySequence( - list_obj, pool, &sp_array, type.sp_type - ) - ) - - return box_array(sp_array) - - property null_count: - - def __get__(self): - return self.sp_array.get().null_count() - - def __iter__(self): - for i in range(len(self)): - yield self.getitem(i) - raise StopIteration - - def __repr__(self): - from pyarrow.formatting import array_format - type_format = object.__repr__(self) - values = array_format(self, window=10) - return '{0}\n{1}'.format(type_format, values) - - def equals(Array self, Array other): - return self.ap.Equals(deref(other.ap)) - - def __len__(self): - if self.sp_array.get(): - return self.sp_array.get().length() - else: - return 0 - - def isnull(self): - raise NotImplemented - - def __getitem__(self, key): - cdef: - Py_ssize_t n = len(self) - - if PySlice_Check(key): - start = key.start or 0 - while start < 0: - start += n - - stop = key.stop if key.stop is not None else n - while stop < 0: - stop += n - - step = key.step or 1 - if step != 1: - raise IndexError('only slices with step 1 supported') - else: - return self.slice(start, stop - start) - - while key < 0: - key += len(self) - - return self.getitem(key) - - cdef getitem(self, int64_t i): - return scalar.box_scalar(self.type, self.sp_array, i) - - def slice(self, offset=0, length=None): - """ - Compute zero-copy slice of this array - - Parameters - ---------- - offset : int, default 0 - Offset from start of array to slice - length : int, default None - Length of slice (default is until end of Array starting from - offset) - - Returns - ------- - sliced : RecordBatch - """ - cdef: - shared_ptr[CArray] result - - if offset < 0: - raise IndexError('Offset must be non-negative') - - if length is None: - result = self.ap.Slice(offset) - else: - result = self.ap.Slice(offset, length) - - return box_array(result) - - def to_pandas(self): - """ - Convert to an array object suitable for use in pandas - - See also - -------- - Column.to_pandas - Table.to_pandas - RecordBatch.to_pandas - """ - cdef: - PyObject* out - - with nogil: - check_status( - pyarrow.ConvertArrayToPandas(self.sp_array, self, - &out)) - return wrap_array_output(out) - - def to_pylist(self): - """ - Convert to an list of native Python objects. - """ - return [x.as_py() for x in self] - - -cdef class Tensor: - - cdef init(self, const shared_ptr[CTensor]& sp_tensor): - self.sp_tensor = sp_tensor - self.tp = sp_tensor.get() - self.type = box_data_type(self.tp.type()) - - def __repr__(self): - return """ -type: {0} -shape: {1} -strides: {2}""".format(self.type, self.shape, self.strides) - - @staticmethod - def from_numpy(obj): - cdef shared_ptr[CTensor] ctensor - check_status(pyarrow.NdarrayToTensor(default_memory_pool(), - obj, &ctensor)) - return box_tensor(ctensor) - - def to_numpy(self): - """ - Convert arrow::Tensor to numpy.ndarray with zero copy - """ - cdef: - PyObject* out - - check_status(pyarrow.TensorToNdarray(deref(self.tp), self, - &out)) - return PyObject_to_object(out) - - def equals(self, Tensor other): - """ - Return true if the tensors contains exactly equal data - """ - return self.tp.Equals(deref(other.tp)) - - property is_mutable: - - def __get__(self): - return self.tp.is_mutable() - - property is_contiguous: - - def __get__(self): - return self.tp.is_contiguous() - - property ndim: - - def __get__(self): - return self.tp.ndim() - - property size: - - def __get__(self): - return self.tp.size() - - property shape: - - def __get__(self): - cdef size_t i - py_shape = [] - for i in range(self.tp.shape().size()): - py_shape.append(self.tp.shape()[i]) - return py_shape - - property strides: - - def __get__(self): - cdef size_t i - py_strides = [] - for i in range(self.tp.strides().size()): - py_strides.append(self.tp.strides()[i]) - return py_strides - - - -cdef wrap_array_output(PyObject* output): - cdef object obj = PyObject_to_object(output) - - if isinstance(obj, dict): - return Categorical(obj['indices'], - categories=obj['dictionary'], - fastpath=True) - else: - return obj - - -cdef class NullArray(Array): - pass - - -cdef class BooleanArray(Array): - pass - - -cdef class NumericArray(Array): - pass - - -cdef class IntegerArray(NumericArray): - pass - - -cdef class FloatingPointArray(NumericArray): - pass - - -cdef class Int8Array(IntegerArray): - pass - - -cdef class UInt8Array(IntegerArray): - pass - - -cdef class Int16Array(IntegerArray): - pass - - -cdef class UInt16Array(IntegerArray): - pass - - -cdef class Int32Array(IntegerArray): - pass - - -cdef class UInt32Array(IntegerArray): - pass - - -cdef class Int64Array(IntegerArray): - pass - - -cdef class UInt64Array(IntegerArray): - pass - - -cdef class Date32Array(NumericArray): - pass - - -cdef class Date64Array(NumericArray): - pass - - -cdef class TimestampArray(NumericArray): - pass - - -cdef class Time32Array(NumericArray): - pass - - -cdef class Time64Array(NumericArray): - pass - - -cdef class FloatArray(FloatingPointArray): - pass - - -cdef class DoubleArray(FloatingPointArray): - pass - - -cdef class FixedSizeBinaryArray(Array): - pass - - -cdef class DecimalArray(FixedSizeBinaryArray): - pass - - -cdef class ListArray(Array): - pass - - -cdef class StringArray(Array): - pass - - -cdef class BinaryArray(Array): - pass - - -cdef class DictionaryArray(Array): - - cdef getitem(self, int64_t i): - cdef Array dictionary = self.dictionary - index = self.indices[i] - if index is NA: - return index - else: - return scalar.box_scalar(dictionary.type, dictionary.sp_array, - index.as_py()) - - property dictionary: - - def __get__(self): - cdef CDictionaryArray* darr = (self.ap) - - if self._dictionary is None: - self._dictionary = box_array(darr.dictionary()) - - return self._dictionary - - property indices: - - def __get__(self): - cdef CDictionaryArray* darr = (self.ap) - - if self._indices is None: - self._indices = box_array(darr.indices()) - - return self._indices - - @staticmethod - def from_arrays(indices, dictionary, mask=None, - MemoryPool memory_pool=None): - """ - Construct Arrow DictionaryArray from array of indices (must be - non-negative integers) and corresponding array of dictionary values - - Parameters - ---------- - indices : ndarray or pandas.Series, integer type - dictionary : ndarray or pandas.Series - mask : ndarray or pandas.Series, boolean type - True values indicate that indices are actually null - - Returns - ------- - dict_array : DictionaryArray - """ - cdef: - Array arrow_indices, arrow_dictionary - DictionaryArray result - shared_ptr[CDataType] c_type - shared_ptr[CArray] c_result - - if isinstance(indices, Array): - if mask is not None: - raise NotImplementedError( - "mask not implemented with Arrow array inputs yet") - arrow_indices = indices - else: - if mask is None: - mask = indices == -1 - else: - mask = mask | (indices == -1) - arrow_indices = Array.from_numpy(indices, mask=mask, - memory_pool=memory_pool) - - if isinstance(dictionary, Array): - arrow_dictionary = dictionary - else: - arrow_dictionary = Array.from_numpy(dictionary, - memory_pool=memory_pool) - - if not isinstance(arrow_indices, IntegerArray): - raise ValueError('Indices must be integer type') - - c_type.reset(new CDictionaryType(arrow_indices.type.sp_type, - arrow_dictionary.sp_array)) - c_result.reset(new CDictionaryArray(c_type, arrow_indices.sp_array)) - - result = DictionaryArray() - result.init(c_result) - return result - - -cdef dict _array_classes = { - Type_NA: NullArray, - Type_BOOL: BooleanArray, - Type_UINT8: UInt8Array, - Type_UINT16: UInt16Array, - Type_UINT32: UInt32Array, - Type_UINT64: UInt64Array, - Type_INT8: Int8Array, - Type_INT16: Int16Array, - Type_INT32: Int32Array, - Type_INT64: Int64Array, - Type_DATE32: Date32Array, - Type_DATE64: Date64Array, - Type_TIMESTAMP: TimestampArray, - Type_TIME32: Time32Array, - Type_TIME64: Time64Array, - Type_FLOAT: FloatArray, - Type_DOUBLE: DoubleArray, - Type_LIST: ListArray, - Type_BINARY: BinaryArray, - Type_STRING: StringArray, - Type_DICTIONARY: DictionaryArray, - Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, - Type_DECIMAL: DecimalArray, -} - -cdef object box_array(const shared_ptr[CArray]& sp_array): - if sp_array.get() == NULL: - raise ValueError('Array was NULL') - - cdef CDataType* data_type = sp_array.get().type().get() - - if data_type == NULL: - raise ValueError('Array data type was NULL') - - cdef Array arr = _array_classes[data_type.id()]() - arr.init(sp_array) - return arr - - -cdef object box_tensor(const shared_ptr[CTensor]& sp_tensor): - if sp_tensor.get() == NULL: - raise ValueError('Tensor was NULL') - - cdef Tensor tensor = Tensor() - tensor.init(sp_tensor) - return tensor - - -cdef object get_series_values(object obj): - if isinstance(obj, PandasSeries): - result = obj.values - elif isinstance(obj, np.ndarray): - result = obj - else: - result = PandasSeries(obj).values - - return result - - -from_pylist = Array.from_list diff --git a/python/pyarrow/feather.py b/python/pyarrow/feather.py index 3b5716e36be0a..c7b118e60a46d 100644 --- a/python/pyarrow/feather.py +++ b/python/pyarrow/feather.py @@ -22,9 +22,9 @@ import pandas as pd from pyarrow.compat import pdapi -from pyarrow.io import FeatherError # noqa -from pyarrow.table import Table -import pyarrow.io as ext +from pyarrow._io import FeatherError # noqa +from pyarrow._table import Table +import pyarrow._io as ext if LooseVersion(pd.__version__) < '0.17.0': diff --git a/python/pyarrow/filesystem.py b/python/pyarrow/filesystem.py index 269cf1c8ffa12..92dd91ce9de07 100644 --- a/python/pyarrow/filesystem.py +++ b/python/pyarrow/filesystem.py @@ -19,7 +19,7 @@ import os from pyarrow.util import implements -import pyarrow.io as io +import pyarrow._io as io class Filesystem(object): diff --git a/python/pyarrow/formatting.py b/python/pyarrow/formatting.py index 5fe0611f8450b..c3583448d6e17 100644 --- a/python/pyarrow/formatting.py +++ b/python/pyarrow/formatting.py @@ -17,7 +17,7 @@ # Pretty-printing and other formatting utilities for Arrow data structures -import pyarrow.scalar as scalar +import pyarrow._array as _array def array_format(arr, window=None): @@ -42,7 +42,7 @@ def array_format(arr, window=None): def value_format(x, indent_level=0): - if isinstance(x, scalar.ListValue): + if isinstance(x, _array.ListValue): contents = ',\n'.join(value_format(item) for item in x) return '[{0}]'.format(_indent(contents, 1).strip()) else: diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index ae2b45fbdb212..2444f3fd0683e 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -113,8 +113,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CDataType] index_type() shared_ptr[CArray] dictionary() - shared_ptr[CDataType] timestamp(TimeUnit unit) - shared_ptr[CDataType] timestamp(TimeUnit unit, const c_string& timezone) + shared_ptr[CDataType] ctimestamp" arrow::timestamp"(TimeUnit unit) + shared_ptr[CDataType] ctimestamp" arrow::timestamp"( + TimeUnit unit, const c_string& timezone) cdef cppclass CMemoryPool" arrow::MemoryPool": int64_t bytes_allocated() diff --git a/python/pyarrow/ipc.py b/python/pyarrow/ipc.py index 5a5616564324c..f96ead3b92346 100644 --- a/python/pyarrow/ipc.py +++ b/python/pyarrow/ipc.py @@ -17,10 +17,10 @@ # Arrow file and stream reader/writer classes, and other messaging tools -import pyarrow.io as io +import pyarrow._io as _io -class StreamReader(io._StreamReader): +class StreamReader(_io._StreamReader): """ Reader for the Arrow streaming binary format @@ -37,7 +37,7 @@ def __iter__(self): yield self.get_next_batch() -class StreamWriter(io._StreamWriter): +class StreamWriter(_io._StreamWriter): """ Writer for the Arrow streaming binary format @@ -52,7 +52,7 @@ def __init__(self, sink, schema): self._open(sink, schema) -class FileReader(io._FileReader): +class FileReader(_io._FileReader): """ Class for reading Arrow record batch data from the Arrow binary file format @@ -68,7 +68,7 @@ def __init__(self, source, footer_offset=None): self._open(source, footer_offset=footer_offset) -class FileWriter(io._FileWriter): +class FileWriter(_io._FileWriter): """ Writer to create the Arrow binary file format diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index f81b6c24c691f..aaec43ab06027 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -23,8 +23,8 @@ from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa RowGroupMetaData, Schema, ParquetWriter) import pyarrow._parquet as _parquet # noqa -import pyarrow.array as _array -import pyarrow.table as _table +import pyarrow._array as _array +import pyarrow._table as _table # ---------------------------------------------------------------------- diff --git a/python/pyarrow/scalar.pxd b/python/pyarrow/scalar.pxd deleted file mode 100644 index 62a5664e57eb4..0000000000000 --- a/python/pyarrow/scalar.pxd +++ /dev/null @@ -1,72 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport * - -from pyarrow.schema cimport DataType - - -cdef class Scalar: - cdef readonly: - DataType type - - -cdef class NAType(Scalar): - pass - - -cdef class ArrayValue(Scalar): - cdef: - shared_ptr[CArray] sp_array - int64_t index - - cdef void init(self, DataType type, - const shared_ptr[CArray]& sp_array, int64_t index) - - cdef void _set_array(self, const shared_ptr[CArray]& sp_array) - - -cdef class Int8Value(ArrayValue): - pass - - -cdef class Int64Value(ArrayValue): - pass - - -cdef class ListValue(ArrayValue): - cdef readonly: - DataType value_type - - cdef: - CListArray* ap - - cdef getitem(self, int64_t i) - - -cdef class StringValue(ArrayValue): - pass - - -cdef class FixedSizeBinaryValue(ArrayValue): - pass - - -cdef object box_scalar(DataType type, - const shared_ptr[CArray]& sp_array, - int64_t index) diff --git a/python/pyarrow/scalar.pyx b/python/pyarrow/scalar.pyx deleted file mode 100644 index 2b6746a3cf815..0000000000000 --- a/python/pyarrow/scalar.pyx +++ /dev/null @@ -1,315 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from pyarrow.schema cimport DataType, box_data_type - -from pyarrow.compat import frombytes -import pyarrow.schema as schema -import decimal -import datetime - -cimport cpython as cp - -NA = None - - -cdef _pandas(): - import pandas as pd - return pd - - -cdef class NAType(Scalar): - - def __cinit__(self): - global NA - if NA is not None: - raise Exception('Cannot create multiple NAType instances') - - self.type = schema.null() - - def __repr__(self): - return 'NA' - - def as_py(self): - return None - -NA = NAType() - -cdef class ArrayValue(Scalar): - - cdef void init(self, DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): - self.type = type - self.index = index - self._set_array(sp_array) - - cdef void _set_array(self, const shared_ptr[CArray]& sp_array): - self.sp_array = sp_array - - def __repr__(self): - if hasattr(self, 'as_py'): - return repr(self.as_py()) - else: - return super(Scalar, self).__repr__() - - -cdef class BooleanValue(ArrayValue): - - def as_py(self): - cdef CBooleanArray* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class Int8Value(ArrayValue): - - def as_py(self): - cdef CInt8Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class UInt8Value(ArrayValue): - - def as_py(self): - cdef CUInt8Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class Int16Value(ArrayValue): - - def as_py(self): - cdef CInt16Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class UInt16Value(ArrayValue): - - def as_py(self): - cdef CUInt16Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class Int32Value(ArrayValue): - - def as_py(self): - cdef CInt32Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class UInt32Value(ArrayValue): - - def as_py(self): - cdef CUInt32Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class Int64Value(ArrayValue): - - def as_py(self): - cdef CInt64Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class UInt64Value(ArrayValue): - - def as_py(self): - cdef CUInt64Array* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class Date32Value(ArrayValue): - - def as_py(self): - cdef CDate32Array* ap = self.sp_array.get() - - # Shift to seconds since epoch - return datetime.datetime.utcfromtimestamp( - int(ap.Value(self.index)) * 86400).date() - - -cdef class Date64Value(ArrayValue): - - def as_py(self): - cdef CDate64Array* ap = self.sp_array.get() - return datetime.datetime.utcfromtimestamp( - ap.Value(self.index) / 1000).date() - - -cdef class TimestampValue(ArrayValue): - - def as_py(self): - cdef: - CTimestampArray* ap = self.sp_array.get() - CTimestampType* dtype = ap.type().get() - int64_t val = ap.Value(self.index) - - timezone = None - tzinfo = None - if dtype.timezone().size() > 0: - timezone = frombytes(dtype.timezone()) - import pytz - tzinfo = pytz.timezone(timezone) - - try: - pd = _pandas() - if dtype.unit() == TimeUnit_SECOND: - val = val * 1000000000 - elif dtype.unit() == TimeUnit_MILLI: - val = val * 1000000 - elif dtype.unit() == TimeUnit_MICRO: - val = val * 1000 - return pd.Timestamp(val, tz=tzinfo) - except ImportError: - if dtype.unit() == TimeUnit_SECOND: - result = datetime.datetime.utcfromtimestamp(val) - elif dtype.unit() == TimeUnit_MILLI: - result = datetime.datetime.utcfromtimestamp(float(val) / 1000) - elif dtype.unit() == TimeUnit_MICRO: - result = datetime.datetime.utcfromtimestamp( - float(val) / 1000000) - else: - # TimeUnit_NANO - raise NotImplementedError("Cannot convert nanosecond " - "timestamps without pandas") - if timezone is not None: - result = result.replace(tzinfo=tzinfo) - return result - - -cdef class FloatValue(ArrayValue): - - def as_py(self): - cdef CFloatArray* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class DoubleValue(ArrayValue): - - def as_py(self): - cdef CDoubleArray* ap = self.sp_array.get() - return ap.Value(self.index) - - -cdef class DecimalValue(ArrayValue): - - def as_py(self): - cdef: - CDecimalArray* ap = self.sp_array.get() - c_string s = ap.FormatValue(self.index) - return decimal.Decimal(s.decode('utf8')) - - -cdef class StringValue(ArrayValue): - - def as_py(self): - cdef CStringArray* ap = self.sp_array.get() - return ap.GetString(self.index).decode('utf-8') - - -cdef class BinaryValue(ArrayValue): - - def as_py(self): - cdef: - const uint8_t* ptr - int32_t length - CBinaryArray* ap = self.sp_array.get() - - ptr = ap.GetValue(self.index, &length) - return cp.PyBytes_FromStringAndSize((ptr), length) - - -cdef class ListValue(ArrayValue): - - def __len__(self): - return self.ap.value_length(self.index) - - def __getitem__(self, i): - return self.getitem(i) - - def __iter__(self): - for i in range(len(self)): - yield self.getitem(i) - raise StopIteration - - cdef void _set_array(self, const shared_ptr[CArray]& sp_array): - self.sp_array = sp_array - self.ap = sp_array.get() - self.value_type = box_data_type(self.ap.value_type()) - - cdef getitem(self, int64_t i): - cdef int64_t j = self.ap.value_offset(self.index) + i - return box_scalar(self.value_type, self.ap.values(), j) - - def as_py(self): - cdef: - int64_t j - list result = [] - - for j in range(len(self)): - result.append(self.getitem(j).as_py()) - - return result - - -cdef class FixedSizeBinaryValue(ArrayValue): - - def as_py(self): - cdef: - CFixedSizeBinaryArray* ap - CFixedSizeBinaryType* ap_type - int32_t length - const char* data - ap = self.sp_array.get() - ap_type = ap.type().get() - length = ap_type.byte_width() - data = ap.GetValue(self.index) - return cp.PyBytes_FromStringAndSize(data, length) - - - -cdef dict _scalar_classes = { - Type_BOOL: BooleanValue, - Type_UINT8: Int8Value, - Type_UINT16: Int16Value, - Type_UINT32: Int32Value, - Type_UINT64: Int64Value, - Type_INT8: Int8Value, - Type_INT16: Int16Value, - Type_INT32: Int32Value, - Type_INT64: Int64Value, - Type_DATE32: Date32Value, - Type_DATE64: Date64Value, - Type_TIMESTAMP: TimestampValue, - Type_FLOAT: FloatValue, - Type_DOUBLE: DoubleValue, - Type_LIST: ListValue, - Type_BINARY: BinaryValue, - Type_STRING: StringValue, - Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, - Type_DECIMAL: DecimalValue, -} - -cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, - int64_t index): - cdef ArrayValue val - if type.type.id() == Type_NA: - return NA - elif sp_array.get().IsNull(index): - return NA - else: - val = _scalar_classes[type.type.id()]() - val.init(type, sp_array, index) - return val diff --git a/python/pyarrow/schema.pxd b/python/pyarrow/schema.pxd deleted file mode 100644 index eceedbad0ba0d..0000000000000 --- a/python/pyarrow/schema.pxd +++ /dev/null @@ -1,76 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -from pyarrow.includes.common cimport * -from pyarrow.includes.libarrow cimport (CDataType, - CDictionaryType, - CTimestampType, - CFixedSizeBinaryType, - CDecimalType, - CField, CSchema) - -cdef class DataType: - cdef: - shared_ptr[CDataType] sp_type - CDataType* type - - cdef void init(self, const shared_ptr[CDataType]& type) - - -cdef class DictionaryType(DataType): - cdef: - const CDictionaryType* dict_type - - -cdef class TimestampType(DataType): - cdef: - const CTimestampType* ts_type - - -cdef class FixedSizeBinaryType(DataType): - cdef: - const CFixedSizeBinaryType* fixed_size_binary_type - - -cdef class DecimalType(FixedSizeBinaryType): - cdef: - const CDecimalType* decimal_type - - -cdef class Field: - cdef: - shared_ptr[CField] sp_field - CField* field - - cdef readonly: - DataType type - - cdef init(self, const shared_ptr[CField]& field) - - -cdef class Schema: - cdef: - shared_ptr[CSchema] sp_schema - CSchema* schema - - cdef init(self, const vector[shared_ptr[CField]]& fields) - cdef init_schema(self, const shared_ptr[CSchema]& schema) - - -cdef DataType box_data_type(const shared_ptr[CDataType]& type) -cdef Field box_field(const shared_ptr[CField]& field) -cdef Schema box_schema(const shared_ptr[CSchema]& schema) diff --git a/python/pyarrow/schema.pyx b/python/pyarrow/schema.pyx deleted file mode 100644 index 474980973959f..0000000000000 --- a/python/pyarrow/schema.pyx +++ /dev/null @@ -1,477 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -######################################## -# Data types, fields, schemas, and so forth - -# cython: profile=False -# distutils: language = c++ -# cython: embedsignature = True - -from cython.operator cimport dereference as deref - -from pyarrow.compat import frombytes, tobytes -from pyarrow.array cimport Array -from pyarrow.error cimport check_status -from pyarrow.includes.libarrow cimport (CDataType, CStructType, CListType, - CFixedSizeBinaryType, - CDecimalType, - TimeUnit_SECOND, TimeUnit_MILLI, - TimeUnit_MICRO, TimeUnit_NANO, - Type, TimeUnit) -cimport pyarrow.includes.pyarrow as pyarrow -cimport pyarrow.includes.libarrow as la - -cimport cpython - -import six - - -cdef class DataType: - - def __cinit__(self): - pass - - cdef void init(self, const shared_ptr[CDataType]& type): - self.sp_type = type - self.type = type.get() - - def __str__(self): - return frombytes(self.type.ToString()) - - def __repr__(self): - return '{0.__class__.__name__}({0})'.format(self) - - def __richcmp__(DataType self, DataType other, int op): - if op == cpython.Py_EQ: - return self.type.Equals(deref(other.type)) - elif op == cpython.Py_NE: - return not self.type.Equals(deref(other.type)) - else: - raise TypeError('Invalid comparison') - - -cdef class DictionaryType(DataType): - - cdef void init(self, const shared_ptr[CDataType]& type): - DataType.init(self, type) - self.dict_type = type.get() - - -cdef class TimestampType(DataType): - - cdef void init(self, const shared_ptr[CDataType]& type): - DataType.init(self, type) - self.ts_type = type.get() - - property unit: - - def __get__(self): - return timeunit_to_string(self.ts_type.unit()) - - property tz: - - def __get__(self): - if self.ts_type.timezone().size() > 0: - return frombytes(self.ts_type.timezone()) - else: - return None - - -cdef class FixedSizeBinaryType(DataType): - - cdef void init(self, const shared_ptr[CDataType]& type): - DataType.init(self, type) - self.fixed_size_binary_type = type.get() - - property byte_width: - - def __get__(self): - return self.fixed_size_binary_type.byte_width() - - -cdef class DecimalType(FixedSizeBinaryType): - - cdef void init(self, const shared_ptr[CDataType]& type): - DataType.init(self, type) - self.decimal_type = type.get() - - -cdef class Field: - - def __cinit__(self): - pass - - cdef init(self, const shared_ptr[CField]& field): - self.sp_field = field - self.field = field.get() - self.type = box_data_type(field.get().type()) - - @classmethod - def from_py(cls, object name, DataType type, bint nullable=True): - cdef Field result = Field() - result.type = type - result.sp_field.reset(new CField(tobytes(name), type.sp_type, - nullable)) - result.field = result.sp_field.get() - - return result - - def __repr__(self): - return 'Field({0!r}, type={1})'.format(self.name, str(self.type)) - - property nullable: - - def __get__(self): - return self.field.nullable() - - property name: - - def __get__(self): - if box_field(self.sp_field) is None: - raise ReferenceError( - 'Field not initialized (references NULL pointer)') - return frombytes(self.field.name()) - - -cdef class Schema: - - def __cinit__(self): - pass - - def __len__(self): - return self.schema.num_fields() - - def __getitem__(self, i): - if i < 0 or i >= len(self): - raise IndexError("{0} is out of bounds".format(i)) - - cdef Field result = Field() - result.init(self.schema.field(i)) - result.type = box_data_type(result.field.type()) - - return result - - cdef init(self, const vector[shared_ptr[CField]]& fields): - self.schema = new CSchema(fields) - self.sp_schema.reset(self.schema) - - cdef init_schema(self, const shared_ptr[CSchema]& schema): - self.schema = schema.get() - self.sp_schema = schema - - def equals(self, other): - """ - Test if this schema is equal to the other - """ - cdef Schema _other - _other = other - - return self.sp_schema.get().Equals(deref(_other.schema)) - - def field_by_name(self, name): - """ - Access a field by its name rather than the column index. - - Parameters - ---------- - name: str - - Returns - ------- - field: pyarrow.Field - """ - return box_field(self.schema.GetFieldByName(tobytes(name))) - - @classmethod - def from_fields(cls, fields): - cdef: - Schema result - Field field - vector[shared_ptr[CField]] c_fields - - c_fields.resize(len(fields)) - - for i in range(len(fields)): - field = fields[i] - c_fields[i] = field.sp_field - - result = Schema() - result.init(c_fields) - - return result - - def __str__(self): - return frombytes(self.schema.ToString()) - - def __repr__(self): - return self.__str__() - - -cdef dict _type_cache = {} - - -cdef DataType primitive_type(Type type): - if type in _type_cache: - return _type_cache[type] - - cdef DataType out = DataType() - out.init(pyarrow.GetPrimitiveType(type)) - - _type_cache[type] = out - return out - -#------------------------------------------------------------ -# Type factory functions - -def field(name, type, bint nullable=True): - return Field.from_py(name, type, nullable) - - -cdef set PRIMITIVE_TYPES = set([ - la.Type_NA, la.Type_BOOL, - la.Type_UINT8, la.Type_INT8, - la.Type_UINT16, la.Type_INT16, - la.Type_UINT32, la.Type_INT32, - la.Type_UINT64, la.Type_INT64, - la.Type_TIMESTAMP, la.Type_DATE32, - la.Type_DATE64, - la.Type_HALF_FLOAT, - la.Type_FLOAT, - la.Type_DOUBLE]) - - -def null(): - return primitive_type(la.Type_NA) - - -def bool_(): - return primitive_type(la.Type_BOOL) - - -def uint8(): - return primitive_type(la.Type_UINT8) - - -def int8(): - return primitive_type(la.Type_INT8) - - -def uint16(): - return primitive_type(la.Type_UINT16) - - -def int16(): - return primitive_type(la.Type_INT16) - - -def uint32(): - return primitive_type(la.Type_UINT32) - - -def int32(): - return primitive_type(la.Type_INT32) - - -def uint64(): - return primitive_type(la.Type_UINT64) - - -def int64(): - return primitive_type(la.Type_INT64) - - -cdef dict _timestamp_type_cache = {} - - -cdef timeunit_to_string(TimeUnit unit): - if unit == TimeUnit_SECOND: - return 's' - elif unit == TimeUnit_MILLI: - return 'ms' - elif unit == TimeUnit_MICRO: - return 'us' - elif unit == TimeUnit_NANO: - return 'ns' - - -def timestamp(unit_str, tz=None): - cdef: - TimeUnit unit - c_string c_timezone - - if unit_str == "s": - unit = TimeUnit_SECOND - elif unit_str == 'ms': - unit = TimeUnit_MILLI - elif unit_str == 'us': - unit = TimeUnit_MICRO - elif unit_str == 'ns': - unit = TimeUnit_NANO - else: - raise TypeError('Invalid TimeUnit string') - - cdef TimestampType out = TimestampType() - - if tz is None: - out.init(la.timestamp(unit)) - if unit in _timestamp_type_cache: - return _timestamp_type_cache[unit] - _timestamp_type_cache[unit] = out - else: - if not isinstance(tz, six.string_types): - tz = tz.zone - - c_timezone = tobytes(tz) - out.init(la.timestamp(unit, c_timezone)) - - return out - - -def date32(): - return primitive_type(la.Type_DATE32) - - -def date64(): - return primitive_type(la.Type_DATE64) - - -def float16(): - return primitive_type(la.Type_HALF_FLOAT) - - -def float32(): - return primitive_type(la.Type_FLOAT) - - -def float64(): - return primitive_type(la.Type_DOUBLE) - - -cpdef DataType decimal(int precision, int scale=0): - cdef shared_ptr[CDataType] decimal_type - decimal_type.reset(new CDecimalType(precision, scale)) - return box_data_type(decimal_type) - - -def string(): - """ - UTF8 string - """ - return primitive_type(la.Type_STRING) - - -def binary(int length=-1): - """Binary (PyBytes-like) type - - Parameters - ---------- - length : int, optional, default -1 - If length == -1 then return a variable length binary type. If length is - greater than or equal to 0 then return a fixed size binary type of - width `length`. - """ - if length == -1: - return primitive_type(la.Type_BINARY) - - cdef shared_ptr[CDataType] fixed_size_binary_type - fixed_size_binary_type.reset(new CFixedSizeBinaryType(length)) - return box_data_type(fixed_size_binary_type) - - -def list_(DataType value_type): - cdef DataType out = DataType() - cdef shared_ptr[CDataType] list_type - list_type.reset(new CListType(value_type.sp_type)) - out.init(list_type) - return out - - -def dictionary(DataType index_type, Array dictionary): - """ - Dictionary (categorical, or simply encoded) type - """ - cdef DictionaryType out = DictionaryType() - cdef shared_ptr[CDataType] dict_type - dict_type.reset(new CDictionaryType(index_type.sp_type, - dictionary.sp_array)) - out.init(dict_type) - return out - - -def struct(fields): - """ - - """ - cdef: - DataType out = DataType() - Field field - vector[shared_ptr[CField]] c_fields - cdef shared_ptr[CDataType] struct_type - - for field in fields: - c_fields.push_back(field.sp_field) - - struct_type.reset(new CStructType(c_fields)) - out.init(struct_type) - return out - - -def schema(fields): - return Schema.from_fields(fields) - - -cdef DataType box_data_type(const shared_ptr[CDataType]& type): - cdef: - DataType out - - if type.get() == NULL: - return None - - if type.get().id() == la.Type_DICTIONARY: - out = DictionaryType() - elif type.get().id() == la.Type_TIMESTAMP: - out = TimestampType() - elif type.get().id() == la.Type_FIXED_SIZE_BINARY: - out = FixedSizeBinaryType() - elif type.get().id() == la.Type_DECIMAL: - out = DecimalType() - else: - out = DataType() - - out.init(type) - return out - -cdef Field box_field(const shared_ptr[CField]& field): - if field.get() == NULL: - return None - cdef Field out = Field() - out.init(field) - return out - -cdef Schema box_schema(const shared_ptr[CSchema]& type): - cdef Schema out = Schema() - out.init_schema(type) - return out - - -def type_from_numpy_dtype(object dtype): - cdef shared_ptr[CDataType] c_type - with nogil: - check_status(pyarrow.NumPyDtypeToArrow(dtype, &c_type)) - - return box_data_type(c_type) diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index cba9464354a4e..6f8040fd483c9 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -25,7 +25,7 @@ from pyarrow.compat import guid from pyarrow.feather import (read_feather, write_feather, FeatherReader) -from pyarrow.io import FeatherWriter +from pyarrow._io import FeatherWriter def random_path(): diff --git a/python/pyarrow/tests/test_hdfs.py b/python/pyarrow/tests/test_hdfs.py index b8f7e25233421..d2a54790668d5 100644 --- a/python/pyarrow/tests/test_hdfs.py +++ b/python/pyarrow/tests/test_hdfs.py @@ -26,8 +26,6 @@ import pytest from pyarrow.compat import guid -from pyarrow.filesystem import HdfsClient -import pyarrow.io as io import pyarrow as pa import pyarrow.tests.test_parquet as test_parquet @@ -45,7 +43,7 @@ def hdfs_test_client(driver='libhdfs'): raise ValueError('Env variable ARROW_HDFS_TEST_PORT was not ' 'an integer') - return HdfsClient(host, port, user, driver=driver) + return pa.HdfsClient(host, port, user, driver=driver) @pytest.mark.hdfs @@ -190,7 +188,7 @@ class TestLibHdfs(HdfsTestCases, unittest.TestCase): @classmethod def check_driver(cls): - if not io.have_libhdfs(): + if not pa.have_libhdfs(): pytest.fail('No libhdfs available on system') def test_hdfs_orphaned_file(self): @@ -209,5 +207,5 @@ class TestLibHdfs3(HdfsTestCases, unittest.TestCase): @classmethod def check_driver(cls): - if not io.have_libhdfs3(): + if not pa.have_libhdfs3(): pytest.fail('No libhdfs3 available on system') diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index beb6113849ac3..c5d3708d6a9ac 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -24,7 +24,6 @@ from pyarrow.compat import u, guid import pyarrow as pa -import pyarrow.io as io # ---------------------------------------------------------------------- # Python file-like objects @@ -33,7 +32,7 @@ def test_python_file_write(): buf = BytesIO() - f = io.PythonFileInterface(buf) + f = pa.PythonFileInterface(buf) assert f.tell() == 0 @@ -57,7 +56,7 @@ def test_python_file_read(): data = b'some sample data' buf = BytesIO(data) - f = io.PythonFileInterface(buf, mode='r') + f = pa.PythonFileInterface(buf, mode='r') assert f.size() == len(data) @@ -82,7 +81,7 @@ def test_python_file_read(): def test_bytes_reader(): # Like a BytesIO, but zero-copy underneath for C++ consumers data = b'some sample data' - f = io.BufferReader(data) + f = pa.BufferReader(data) assert f.tell() == 0 assert f.size() == len(data) @@ -103,7 +102,7 @@ def test_bytes_reader(): def test_bytes_reader_non_bytes(): with pytest.raises(ValueError): - io.BufferReader(u('some sample data')) + pa.BufferReader(u('some sample data')) def test_bytes_reader_retains_parent_reference(): @@ -112,7 +111,7 @@ def test_bytes_reader_retains_parent_reference(): # ARROW-421 def get_buffer(): data = b'some sample data' * 1000 - reader = io.BufferReader(data) + reader = pa.BufferReader(data) reader.seek(5) return reader.read_buffer(6) @@ -129,7 +128,7 @@ def test_buffer_bytes(): val = b'some data' buf = pa.frombuffer(val) - assert isinstance(buf, io.Buffer) + assert isinstance(buf, pa.Buffer) result = buf.to_pybytes() @@ -140,7 +139,7 @@ def test_buffer_memoryview(): val = b'some data' buf = pa.frombuffer(val) - assert isinstance(buf, io.Buffer) + assert isinstance(buf, pa.Buffer) result = memoryview(buf) @@ -151,7 +150,7 @@ def test_buffer_bytearray(): val = bytearray(b'some data') buf = pa.frombuffer(val) - assert isinstance(buf, io.Buffer) + assert isinstance(buf, pa.Buffer) result = bytearray(buf) @@ -162,7 +161,7 @@ def test_buffer_memoryview_is_immutable(): val = b'some data' buf = pa.frombuffer(val) - assert isinstance(buf, io.Buffer) + assert isinstance(buf, pa.Buffer) result = memoryview(buf) @@ -180,7 +179,7 @@ def test_memory_output_stream(): # 10 bytes val = b'dataabcdef' - f = io.InMemoryOutputStream() + f = pa.InMemoryOutputStream() K = 1000 for i in range(K): @@ -193,7 +192,7 @@ def test_memory_output_stream(): def test_inmemory_write_after_closed(): - f = io.InMemoryOutputStream() + f = pa.InMemoryOutputStream() f.write(b'ok') f.get_result() @@ -213,7 +212,7 @@ def make_buffer(bytes_obj): def test_nativefile_write_memoryview(): - f = io.InMemoryOutputStream() + f = pa.InMemoryOutputStream() data = b'ok' arr = np.frombuffer(data, dtype='S1') @@ -289,7 +288,7 @@ def test_memory_map_retain_buffer_reference(sample_disk_data): def test_os_file_reader(sample_disk_data): - _check_native_file_reader(io.OSFile, sample_disk_data) + _check_native_file_reader(pa.OSFile, sample_disk_data) def _try_delete(path): @@ -354,10 +353,10 @@ def test_os_file_writer(): f.write(data) # Truncates file - f2 = io.OSFile(path, mode='w') + f2 = pa.OSFile(path, mode='w') f2.write('foo') - with io.OSFile(path) as f3: + with pa.OSFile(path) as f3: assert f3.size() == 3 with pytest.raises(IOError): diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index de1b1488c1475..a5c70aa16225f 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -24,7 +24,6 @@ from pyarrow.compat import guid, u from pyarrow.filesystem import LocalFilesystem import pyarrow as pa -import pyarrow.io as paio from .pandas_examples import dataframe_with_arrays, dataframe_with_lists import numpy as np @@ -180,10 +179,10 @@ def _test_dataframe(size=10000, seed=0): def test_pandas_parquet_native_file_roundtrip(tmpdir): df = _test_dataframe(10000) arrow_table = pa.Table.from_pandas(df) - imos = paio.InMemoryOutputStream() + imos = pa.InMemoryOutputStream() pq.write_table(arrow_table, imos, version="2.0") buf = imos.get_result() - reader = paio.BufferReader(buf) + reader = pa.BufferReader(buf) df_read = pq.read_table(reader).to_pandas() tm.assert_frame_equal(df, df_read) diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 5588840cceb1f..53b6b68cfde3c 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -16,13 +16,9 @@ # under the License. import pytest - -import pyarrow as pa - import numpy as np -# XXX: pyarrow.schema.schema masks the module on imports -sch = pa._schema +import pyarrow as pa def test_type_integers(): @@ -62,7 +58,7 @@ def test_type_from_numpy_dtype_timestamps(): ] for dt, pt in cases: - result = sch.type_from_numpy_dtype(dt) + result = pa.from_numpy_dtype(dt) assert result == pt diff --git a/python/setup.py b/python/setup.py index 99bac15c779e6..3991856404bc8 100644 --- a/python/setup.py +++ b/python/setup.py @@ -99,16 +99,14 @@ def initialize_options(self): os.environ.get('PYARROW_BUNDLE_ARROW_CPP', '0')) CYTHON_MODULE_NAMES = [ - 'array', - 'config', - 'error', - 'io', - 'jemalloc', - 'memory', + '_array', + '_config', + '_error', + '_io', + '_jemalloc', + '_memory', '_parquet', - 'scalar', - 'schema', - 'table'] + '_table'] def _run_cmake(self): # The directory containing this setup.py @@ -261,7 +259,7 @@ def move_lib(lib_name): def _failure_permitted(self, name): if name == '_parquet' and not self.with_parquet: return True - if name == 'jemalloc' and not self.with_jemalloc: + if name == '_jemalloc' and not self.with_jemalloc: return True return False From 19da86ab96fa839786eef768ff4521f46acaa3a4 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 13 Apr 2017 11:16:38 -0400 Subject: [PATCH 0509/1644] ARROW-817: [Python] Fix comment in date32 conversion Author: Wes McKinney Closes #536 from wesm/ARROW-817 and squashes the following commits: 3982948 [Wes McKinney] Fix comment --- cpp/src/arrow/python/pandas_convert.cc | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 643c5fb8796a0..b33aea4565817 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -1504,8 +1504,7 @@ class DatetimeBlock : public PandasBlock { const ChunkedArray& data = *col.get()->data(); if (type == Type::DATE32) { - // Date64Type is millisecond timestamp stored as int64_t - // TODO(wesm): Do we want to make sure to zero out the milliseconds? + // Convert from days since epoch to datetime64[ns] ConvertDatetimeNanos(data, out_buffer); } else if (type == Type::DATE64) { // Date64Type is millisecond timestamp stored as int64_t From 874666a61c4c7bf9f1242d8bb05274b7d1bbe2bd Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 13 Apr 2017 14:01:56 -0400 Subject: [PATCH 0510/1644] ARROW-816: [C++] Travis CI script cleanup, add C++ toolchain env with Flatbuffers, RapidJSON Author: Wes McKinney Closes #537 from wesm/ARROW-816 and squashes the following commits: 16992b6 [Wes McKinney] Disable Travis CI cache on OS X. brew install ccache 4621d2d [Wes McKinney] Fix variable name dc86821 [Wes McKinney] Fixes for integration tests Travis script 5e2c226 [Wes McKinney] Change file mode ed4be57 [Wes McKinney] Travis CI script cleanup, add C++ toolchain env with flatbuffers, rapidjson --- .travis.yml | 1 + ci/travis_before_script_c_glib.sh | 4 ++-- ci/travis_before_script_cpp.sh | 19 ++++++++++--------- ci/travis_env_common.sh | 31 +++++++++++++++++++++++++++++++ ci/travis_install_conda.sh | 3 +-- ci/travis_script_integration.sh | 15 ++++----------- ci/travis_script_python.sh | 14 ++++---------- 7 files changed, 53 insertions(+), 34 deletions(-) create mode 100755 ci/travis_env_common.sh diff --git a/.travis.yml b/.travis.yml index 4a49c717bf75d..824f62bccaab9 100644 --- a/.travis.yml +++ b/.travis.yml @@ -48,6 +48,7 @@ matrix: - compiler: clang osx_image: xcode6.4 os: osx + cache: addons: before_script: - $TRAVIS_BUILD_DIR/ci/travis_before_script_cpp.sh diff --git a/ci/travis_before_script_c_glib.sh b/ci/travis_before_script_c_glib.sh index 1a828e7659bd9..74bdd94b96a8b 100755 --- a/ci/travis_before_script_c_glib.sh +++ b/ci/travis_before_script_c_glib.sh @@ -15,14 +15,14 @@ set -ex +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh + if [ $TRAVIS_OS_NAME == "osx" ]; then brew install gtk-doc autoconf-archive gobject-introspection fi gem install gobject-introspection -ARROW_C_GLIB_DIR=$TRAVIS_BUILD_DIR/c_glib - pushd $ARROW_C_GLIB_DIR : ${ARROW_C_GLIB_INSTALL=$TRAVIS_BUILD_DIR/c-glib-install} diff --git a/ci/travis_before_script_cpp.sh b/ci/travis_before_script_cpp.sh index f804a38e76484..3f9f67c359289 100755 --- a/ci/travis_before_script_cpp.sh +++ b/ci/travis_before_script_cpp.sh @@ -15,19 +15,20 @@ set -ex -: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh +source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh + +# Set up C++ toolchain from conda-forge packages for faster builds +conda create -y -q -p $CPP_TOOLCHAIN python=2.7 flatbuffers rapidjson if [ $TRAVIS_OS_NAME == "osx" ]; then brew update > /dev/null brew install jemalloc + brew install ccache fi -mkdir $CPP_BUILD_DIR -pushd $CPP_BUILD_DIR - -CPP_DIR=$TRAVIS_BUILD_DIR/cpp - -: ${ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install} +mkdir $ARROW_CPP_BUILD_DIR +pushd $ARROW_CPP_BUILD_DIR CMAKE_COMMON_FLAGS="\ -DARROW_BUILD_BENCHMARKS=ON \ @@ -37,11 +38,11 @@ if [ $TRAVIS_OS_NAME == "linux" ]; then cmake -DARROW_TEST_MEMCHECK=on \ $CMAKE_COMMON_FLAGS \ -DARROW_CXXFLAGS="-Wconversion -Werror" \ - $CPP_DIR + $ARROW_CPP_DIR else cmake $CMAKE_COMMON_FLAGS \ -DARROW_CXXFLAGS=-Werror \ - $CPP_DIR + $ARROW_CPP_DIR fi make -j4 diff --git a/ci/travis_env_common.sh b/ci/travis_env_common.sh new file mode 100755 index 0000000000000..5593f0079f411 --- /dev/null +++ b/ci/travis_env_common.sh @@ -0,0 +1,31 @@ +#!/usr/bin/env bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +export MINICONDA=$HOME/miniconda +export PATH="$MINICONDA/bin:$PATH" +export CONDA_PKGS_DIRS=$HOME/.conda_packages + +# C++ toolchain +export CPP_TOOLCHAIN=$TRAVIS_BUILD_DIR/cpp-toolchain +export FLATBUFFERS_HOME=$CPP_TOOLCHAIN +export RAPIDJSON_HOME=$CPP_TOOLCHAIN + +export ARROW_CPP_DIR=$TRAVIS_BUILD_DIR/cpp +export ARROW_PYTHON_DIR=$TRAVIS_BUILD_DIR/python +export ARROW_C_GLIB_DIR=$TRAVIS_BUILD_DIR/c_glib +export ARROW_JAVA_DIR=${TRAVIS_BUILD_DIR}/java +export ARROW_INTEGRATION_DIR=$TRAVIS_BUILD_DIR/integration + +export ARROW_CPP_INSTALL=$TRAVIS_BUILD_DIR/cpp-install +export ARROW_CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index c036e925427a9..7d185ee82275b 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -22,8 +22,7 @@ fi wget -O miniconda.sh $MINICONDA_URL -export MINICONDA=$HOME/miniconda -export CONDA_PKGS_DIRS=$HOME/.conda_packages +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh mkdir -p $CONDA_PKGS_DIRS bash miniconda.sh -b -p $MINICONDA diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index 8ddd89b1639b0..56f5ab7d2d35e 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -14,23 +14,16 @@ set -e -: ${CPP_BUILD_DIR=$TRAVIS_BUILD_DIR/cpp-build} +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh -JAVA_DIR=${TRAVIS_BUILD_DIR}/java - -pushd $JAVA_DIR +pushd $ARROW_JAVA_DIR mvn package popd -pushd $TRAVIS_BUILD_DIR/integration - -export ARROW_CPP_EXE_PATH=$CPP_BUILD_DIR/debug - -source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh -export MINICONDA=$HOME/miniconda -export PATH="$MINICONDA/bin:$PATH" +pushd $ARROW_INTEGRATION_DIR +export ARROW_CPP_EXE_PATH=$ARROW_CPP_BUILD_DIR/debug CONDA_ENV_NAME=arrow-integration-test conda create -y -q -n $CONDA_ENV_NAME python=3.5 diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 549fe1141cfb1..bde1fd7e249ec 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -14,17 +14,11 @@ set -e -source $TRAVIS_BUILD_DIR/ci/travis_install_conda.sh - -PYTHON_DIR=$TRAVIS_BUILD_DIR/python - -# Re-use conda installation from C++ -export MINICONDA=$HOME/miniconda -export PATH="$MINICONDA/bin:$PATH" +source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh export ARROW_HOME=$ARROW_CPP_INSTALL -pushd $PYTHON_DIR +pushd $ARROW_PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env build_parquet_cpp() { @@ -101,10 +95,10 @@ python_version_tests() { which python # faster builds, please - conda install -y nomkl + conda install -y -q nomkl # Expensive dependencies install from Continuum package repo - conda install -y pip numpy pandas cython + conda install -y -q pip numpy pandas cython # Build C++ libraries build_arrow_libraries arrow-build-$PYTHON_VERSION $ARROW_HOME From b4892fd9fb676a678a966da51407b3ce4ba3ec65 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 14 Apr 2017 12:15:57 -0400 Subject: [PATCH 0511/1644] ARROW-528: [Python] Utilize improved Parquet writer C++ API, add write_metadata function, test _metadata files Author: Wes McKinney Closes #539 from wesm/ARROW-528 and squashes the following commits: 848ff93 [Wes McKinney] Add test for _metadata file 8b8f333 [Wes McKinney] Refactor to use APIs introduced in PARQUET-953. Add write_metadata function --- python/pyarrow/_parquet.pxd | 16 ++++++--- python/pyarrow/_parquet.pyx | 52 ++++++++++++++++------------ python/pyarrow/parquet.py | 34 +++++++++++++++--- python/pyarrow/tests/test_parquet.py | 24 +++++++++++++ 4 files changed, 94 insertions(+), 32 deletions(-) diff --git a/python/pyarrow/_parquet.pxd b/python/pyarrow/_parquet.pxd index 1ac1f69b033ce..9f6edc0b31dc6 100644 --- a/python/pyarrow/_parquet.pxd +++ b/python/pyarrow/_parquet.pxd @@ -235,8 +235,14 @@ cdef extern from "parquet/arrow/schema.h" namespace "parquet::arrow" nogil: cdef extern from "parquet/arrow/writer.h" namespace "parquet::arrow" nogil: - cdef CStatus WriteTable( - const CTable& table, CMemoryPool* pool, - const shared_ptr[OutputStream]& sink, - int64_t chunk_size, - const shared_ptr[WriterProperties]& properties) + cdef cppclass FileWriter: + + @staticmethod + CStatus Open(const CSchema& schema, CMemoryPool* pool, + const shared_ptr[OutputStream]& sink, + const shared_ptr[WriterProperties]& properties, + unique_ptr[FileWriter]* writer) + + CStatus WriteTable(const CTable& table, int64_t chunk_size) + CStatus NewRowGroup(int64_t chunk_size) + CStatus Close() diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index 5418e1dc82730..b7358a6a47386 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -23,7 +23,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow -from pyarrow._array cimport Array +from pyarrow._array cimport Array, Schema from pyarrow._error cimport check_status from pyarrow._memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow._table cimport Table, table_from_ctable @@ -108,7 +108,7 @@ cdef class FileMetaData: if self._schema is not None: return self._schema - cdef Schema schema = Schema() + cdef ParquetSchema schema = ParquetSchema() schema.init_from_filemeta(self) self._schema = schema return schema @@ -160,7 +160,7 @@ cdef class FileMetaData: return result -cdef class Schema: +cdef class ParquetSchema: cdef: object parent # the FileMetaData owning the SchemaDescriptor const SchemaDescriptor* schema @@ -194,7 +194,7 @@ cdef class Schema: def __getitem__(self, i): return self.column(i) - def equals(self, Schema other): + def equals(self, ParquetSchema other): """ Returns True if the Parquet schemas are equal """ @@ -217,7 +217,7 @@ cdef class ColumnSchema: def __cinit__(self): self.descr = NULL - cdef init_from_schema(self, Schema schema, int i): + cdef init_from_schema(self, ParquetSchema schema, int i): self.parent = schema self.descr = schema.schema.Column(i) @@ -373,7 +373,8 @@ cdef class ParquetReader: if self._metadata is not None: return self._metadata - metadata = self.reader.get().parquet_reader().metadata() + with nogil: + metadata = self.reader.get().parquet_reader().metadata() self._metadata = result = FileMetaData() result.init(metadata) @@ -487,9 +488,7 @@ cdef ParquetCompression compression_from_name(object name): cdef class ParquetWriter: cdef: - shared_ptr[WriterProperties] properties - shared_ptr[OutputStream] sink - CMemoryPool* allocator + unique_ptr[FileWriter] writer cdef readonly: object use_dictionary @@ -497,28 +496,34 @@ cdef class ParquetWriter: object version int row_group_size - def __cinit__(self, where, use_dictionary=None, compression=None, - version=None, MemoryPool memory_pool=None): - cdef shared_ptr[FileOutputStream] filestream + def __cinit__(self, where, Schema schema, use_dictionary=None, + compression=None, version=None, + MemoryPool memory_pool=None): + cdef: + shared_ptr[FileOutputStream] filestream + shared_ptr[OutputStream] sink + shared_ptr[WriterProperties] properties if isinstance(where, six.string_types): check_status(FileOutputStream.Open(tobytes(where), &filestream)) - self.sink = filestream + sink = filestream else: - get_writer(where, &self.sink) - self.allocator = maybe_unbox_memory_pool(memory_pool) + get_writer(where, &sink) self.use_dictionary = use_dictionary self.compression = compression self.version = version - self._setup_properties() - cdef _setup_properties(self): cdef WriterProperties.Builder properties_builder self._set_version(&properties_builder) self._set_compression_props(&properties_builder) self._set_dictionary_props(&properties_builder) - self.properties = properties_builder.build() + properties = properties_builder.build() + + check_status( + FileWriter.Open(deref(schema.schema), + maybe_unbox_memory_pool(memory_pool), + sink, properties, &self.writer)) cdef _set_version(self, WriterProperties.Builder* props): if self.version is not None: @@ -546,12 +551,16 @@ cdef class ParquetWriter: props.enable_dictionary() else: props.disable_dictionary() - else: + elif self.use_dictionary is not None: # Deactivate dictionary encoding by default props.disable_dictionary() for column in self.use_dictionary: props.enable_dictionary(column) + def close(self): + with nogil: + check_status(self.writer.get().Close()) + def write_table(self, Table table, row_group_size=None): cdef CTable* ctable = table.table @@ -563,6 +572,5 @@ cdef class ParquetWriter: cdef int c_row_group_size = row_group_size with nogil: - check_status(WriteTable(deref(ctable), self.allocator, - self.sink, c_row_group_size, - self.properties)) + check_status(self.writer.get() + .WriteTable(deref(ctable), c_row_group_size)) diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index aaec43ab06027..4ff7e038b5e6c 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -21,7 +21,8 @@ from pyarrow.filesystem import LocalFilesystem from pyarrow._parquet import (ParquetReader, FileMetaData, # noqa - RowGroupMetaData, Schema, ParquetWriter) + RowGroupMetaData, ParquetSchema, + ParquetWriter) import pyarrow._parquet as _parquet # noqa import pyarrow._array as _array import pyarrow._table as _table @@ -471,7 +472,8 @@ def __init__(self, path_or_paths, filesystem=None, schema=None, else: self.fs = filesystem - self.pieces, self.partitions = _make_manifest(path_or_paths, self.fs) + (self.pieces, self.partitions, + self.metadata_path) = _make_manifest(path_or_paths, self.fs) self.metadata = metadata self.schema = schema @@ -488,7 +490,10 @@ def validate_schemas(self): open_file = self._get_open_file_func() if self.metadata is None and self.schema is None: - self.schema = self.pieces[0].get_metadata(open_file).schema + if self.metadata_path is not None: + self.schema = open_file(self.metadata_path).schema + else: + self.schema = self.pieces[0].get_metadata(open_file).schema elif self.schema is None: self.schema = self.metadata.schema @@ -543,10 +548,12 @@ def open_file(path, meta=None): def _make_manifest(path_or_paths, fs, pathsep='/'): partitions = None + metadata_path = None if is_string(path_or_paths) and fs.isdir(path_or_paths): manifest = ParquetManifest(path_or_paths, filesystem=fs, pathsep=pathsep) + metadata_path = manifest.metadata_path pieces = manifest.pieces partitions = manifest.partitions else: @@ -565,7 +572,7 @@ def _make_manifest(path_or_paths, fs, pathsep='/'): piece = ParquetDatasetPiece(path) pieces.append(piece) - return pieces, partitions + return pieces, partitions, metadata_path def read_table(source, columns=None, nthreads=1, metadata=None): @@ -622,7 +629,24 @@ def write_table(table, where, row_group_size=None, version='1.0', Specify the compression codec, either on a general basis or per-column. """ row_group_size = kwargs.get('chunk_size', row_group_size) - writer = ParquetWriter(where, use_dictionary=use_dictionary, + writer = ParquetWriter(where, table.schema, + use_dictionary=use_dictionary, compression=compression, version=version) writer.write_table(table, row_group_size=row_group_size) + writer.close() + + +def write_metadata(schema, where, version='1.0'): + """ + Write metadata-only Parquet file from schema + + Parameters + ---------- + schema : pyarrow.Schema + where: string or pyarrow.io.NativeFile + version : {"1.0", "2.0"}, default "1.0" + The Parquet format version, defaults to 1.0 + """ + writer = ParquetWriter(where, schema, version=version) + writer.close() diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index a5c70aa16225f..ca6ae2d0b3be0 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -529,6 +529,30 @@ def _visit_level(base_dir, level, part_keys): _visit_level(base_dir, 0, []) +@parquet +def test_read_common_metadata_files(tmpdir): + N = 100 + df = pd.DataFrame({ + 'index': np.arange(N), + 'values': np.random.randn(N) + }, columns=['index', 'values']) + + base_path = str(tmpdir) + data_path = pjoin(base_path, 'data.parquet') + + table = pa.Table.from_pandas(df) + pq.write_table(table, data_path) + + metadata_path = pjoin(base_path, '_metadata') + pq.write_metadata(table.schema, metadata_path) + + dataset = pq.ParquetDataset(base_path) + assert dataset.metadata_path == metadata_path + + pf = pq.ParquetFile(data_path) + assert dataset.schema.equals(pf.schema) + + def _filter_partition(df, part_keys): predicate = np.ones(len(df), dtype=bool) From 01114d831b1cd0cdb9a7f28958d181dcece2537f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 14 Apr 2017 16:20:06 -0400 Subject: [PATCH 0512/1644] ARROW-783: [Java/C++] Fixes for 0-length record batches @StevenMPhillips @nongli @julienledem I found a number of issues in both C++ and Java around the handling of 0-length vectors. It seems that preserving a single inconsequential offset for a length-0 variable length vector can be a bit difficult, so I relaxed a restruction in `loadFieldVectors` about this. Let me know if there's anything concerning about the other changes around EOS signaling Author: Wes McKinney Closes #505 from wesm/ARROW-783 and squashes the following commits: 28ddcab [Wes McKinney] * Have loadNextBatch return true/false for EOS to accommodate 0-length record batches * Relax n + 1 restruction for 0-length vectors --- cpp/src/arrow/loader.cc | 16 +++--------- integration/integration_test.py | 8 +++--- .../org/apache/arrow/tools/FileRoundtrip.java | 4 +-- .../org/apache/arrow/tools/FileToStream.java | 10 +++++--- .../org/apache/arrow/tools/Integration.java | 17 ++++++++----- .../org/apache/arrow/tools/StreamToFile.java | 10 +++++--- .../arrow/tools/ArrowFileTestFixtures.java | 4 ++- .../apache/arrow/tools/EchoServerTest.java | 4 +-- .../templates/NullableValueVectors.java | 4 ++- .../arrow/vector/file/ArrowFileReader.java | 4 +-- .../apache/arrow/vector/file/ArrowReader.java | 14 +++++++++-- .../vector/file/json/JsonFileReader.java | 4 ++- .../arrow/vector/file/TestArrowFile.java | 25 +++++++++---------- .../arrow/vector/file/TestArrowStream.java | 12 +++++---- .../vector/file/TestArrowStreamPipe.java | 9 ++++--- 15 files changed, 82 insertions(+), 63 deletions(-) diff --git a/cpp/src/arrow/loader.cc b/cpp/src/arrow/loader.cc index f9f6e6fcac826..e4e1ba42ff600 100644 --- a/cpp/src/arrow/loader.cc +++ b/cpp/src/arrow/loader.cc @@ -97,13 +97,8 @@ class ArrayLoader { std::shared_ptr null_bitmap, offsets, values; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); - RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); - } else { - context_->buffer_index += 2; - offsets = values = nullptr; - } + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &values)); result_ = std::make_shared( field_meta.length, offsets, values, null_bitmap, field_meta.null_count); @@ -166,12 +161,7 @@ class ArrayLoader { std::shared_ptr null_bitmap, offsets; RETURN_NOT_OK(LoadCommon(&field_meta, &null_bitmap)); - if (field_meta.length > 0) { - RETURN_NOT_OK(GetBuffer(context_->buffer_index, &offsets)); - } else { - offsets = nullptr; - } - ++context_->buffer_index; + RETURN_NOT_OK(GetBuffer(context_->buffer_index++, &offsets)); const int num_children = type.num_children(); if (num_children != 1) { diff --git a/integration/integration_test.py b/integration/integration_test.py index 6631dc8c2f761..661f5c97770d9 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -593,7 +593,7 @@ def _generate_file(fields, batch_sizes): return JSONFile(schema, batches) -def generate_primitive_case(): +def generate_primitive_case(batch_sizes): types = ['bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float32', 'float64', 'binary', 'utf8'] @@ -604,7 +604,6 @@ def generate_primitive_case(): fields.append(get_field(type_ + "_nullable", type_, True)) fields.append(get_field(type_ + "_nonnullable", type_, False)) - batch_sizes = [7, 10] return _generate_file(fields, batch_sizes) @@ -648,9 +647,8 @@ def _temp_path(): return file_objs = [ - generate_primitive_case(), - generate_primitive_case(), - generate_primitive_case(), + generate_primitive_case([7, 10]), + generate_primitive_case([0, 0, 0]), generate_datetime_case(), generate_nested_case() ] diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java index b8621920d3348..135d4921ed128 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileRoundtrip.java @@ -93,9 +93,7 @@ int run(String[] args) { fileOutputStream.getChannel())) { arrowWriter.start(); while (true) { - arrowReader.loadNextBatch(); - int loaded = root.getRowCount(); - if (loaded == 0) { + if (!arrowReader.loadNextBatch()) { break; } else { arrowWriter.writeBatch(); diff --git a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java index be404fd4c5950..6722b30fa7f50 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/FileToStream.java @@ -41,12 +41,16 @@ public static void convert(FileInputStream in, OutputStream out) throws IOExcept try (ArrowFileReader reader = new ArrowFileReader(in.getChannel(), allocator)) { VectorSchemaRoot root = reader.getVectorSchemaRoot(); // load the first batch before instantiating the writer so that we have any dictionaries - reader.loadNextBatch(); + if (!reader.loadNextBatch()) { + throw new IOException("Unable to read first record batch"); + } try (ArrowStreamWriter writer = new ArrowStreamWriter(root, reader, out)) { writer.start(); - while (root.getRowCount() > 0) { + while (true) { writer.writeBatch(); - reader.loadNextBatch(); + if (!reader.loadNextBatch()) { + break; + } } writer.end(); } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java index 453693d7fa489..e8266d50786d3 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/Integration.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/Integration.java @@ -126,7 +126,9 @@ public void execute(File arrowFile, File jsonFile) throws IOException { .pretty(true))) { writer.start(schema); for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { - arrowReader.loadRecordBatch(rbBlock); + if (!arrowReader.loadRecordBatch(rbBlock)) { + throw new IOException("Expected to load record batch"); + } writer.write(root); } } @@ -148,10 +150,8 @@ public void execute(File arrowFile, File jsonFile) throws IOException { ArrowFileWriter arrowWriter = new ArrowFileWriter(root, null, fileOutputStream .getChannel())) { arrowWriter.start(); - reader.read(root); - while (root.getRowCount() != 0) { + while (reader.read(root)) { arrowWriter.writeBatch(); - reader.read(root); } arrowWriter.end(); } @@ -179,16 +179,21 @@ public void execute(File arrowFile, File jsonFile) throws IOException { List recordBatches = arrowReader.getRecordBlocks(); Iterator iterator = recordBatches.iterator(); VectorSchemaRoot jsonRoot; + int totalBatches = 0; while ((jsonRoot = jsonReader.read()) != null && iterator.hasNext()) { ArrowBlock rbBlock = iterator.next(); - arrowReader.loadRecordBatch(rbBlock); + if (!arrowReader.loadRecordBatch(rbBlock)) { + throw new IOException("Expected to load record batch"); + } Validator.compareVectorSchemaRoot(arrowRoot, jsonRoot); jsonRoot.close(); + totalBatches++; } boolean hasMoreJSON = jsonRoot != null; boolean hasMoreArrow = iterator.hasNext(); if (hasMoreJSON || hasMoreArrow) { - throw new IllegalArgumentException("Unexpected RecordBatches. J:" + hasMoreJSON + " " + throw new IllegalArgumentException("Unexpected RecordBatches. Total: " + totalBatches + + " J:" + hasMoreJSON + " " + "A:" + hasMoreArrow); } } diff --git a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java index 41dfd347be579..ef1a11f6bfac8 100644 --- a/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java +++ b/java/tools/src/main/java/org/apache/arrow/tools/StreamToFile.java @@ -41,12 +41,16 @@ public static void convert(InputStream in, OutputStream out) throws IOException try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { VectorSchemaRoot root = reader.getVectorSchemaRoot(); // load the first batch before instantiating the writer so that we have any dictionaries - reader.loadNextBatch(); + if (!reader.loadNextBatch()) { + throw new IOException("Unable to read first record batch"); + } try (ArrowFileWriter writer = new ArrowFileWriter(root, reader, Channels.newChannel(out))) { writer.start(); - while (root.getRowCount() > 0) { + while (true) { writer.writeBatch(); - reader.loadNextBatch(); + if (!reader.loadNextBatch()) { + break; + } } writer.end(); } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java index 1a389098b4f47..34c93ed232c80 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/ArrowFileTestFixtures.java @@ -67,7 +67,9 @@ static void validateOutput(File testOutFile, BufferAllocator allocator) throws E VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { - arrowReader.loadRecordBatch(rbBlock); + if (!arrowReader.loadRecordBatch(rbBlock)) { + throw new IOException("Expected to read record batch"); + } validateContent(COUNT, root); } } diff --git a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java index 7d07588892cf9..7cca33955d93a 100644 --- a/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java +++ b/java/tools/src/test/java/org/apache/arrow/tools/EchoServerTest.java @@ -118,7 +118,7 @@ private void testEchoServer(int serverPort, NullableTinyIntVector readVector = (NullableTinyIntVector) reader.getVectorSchemaRoot() .getFieldVectors().get(0); for (int i = 0; i < batches; i++) { - reader.loadNextBatch(); + Assert.assertTrue(reader.loadNextBatch()); assertEquals(16, reader.getVectorSchemaRoot().getRowCount()); assertEquals(16, readVector.getAccessor().getValueCount()); for (int j = 0; j < 8; j++) { @@ -126,7 +126,7 @@ private void testEchoServer(int serverPort, assertTrue(readVector.getAccessor().isNull(j + 8)); } } - reader.loadNextBatch(); + Assert.assertFalse(reader.loadNextBatch()); assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); assertEquals(reader.bytesRead(), writer.bytesWritten()); } diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index a50771a45a034..e5257ce554e3b 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -122,7 +122,9 @@ public List getChildrenFromFields() { public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { <#if type.major = "VarLen"> // variable width values: truncate offset vector buffer to size (#1) - org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, values.offsetVector.getBufferSizeFor(fieldNode.getLength() + 1)); + org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, + values.offsetVector.getBufferSizeFor( + fieldNode.getLength() == 0? 0 : fieldNode.getLength() + 1)); <#else> // fixed width values truncate value vector to size (#1) org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize(ownBuffers, 1, values.getBufferSizeFor(fieldNode.getLength())); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java index 28440a190ad43..f4d6ada932494 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowFileReader.java @@ -103,14 +103,14 @@ public List getRecordBlocks() throws IOException { return footer.getRecordBatches(); } - public void loadRecordBatch(ArrowBlock block) throws IOException { + public boolean loadRecordBatch(ArrowBlock block) throws IOException { ensureInitialized(); int blockIndex = footer.getRecordBatches().indexOf(block); if (blockIndex == -1) { throw new IllegalArgumentException("Arrow bock does not exist in record batches: " + block); } currentRecordBatch = blockIndex; - loadNextBatch(); + return loadNextBatch(); } private ArrowDictionaryBatch readDictionaryBatch(SeekableReadChannel in, diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java index 1646fbe803687..1d33913f71a95 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowReader.java @@ -89,7 +89,8 @@ public Dictionary lookup(long id) { } } - public void loadNextBatch() throws IOException { + // Returns true if a batch was read, false on EOS + public boolean loadNextBatch() throws IOException { ensureInitialized(); // read in all dictionary batches, then stop after our first record batch ArrowMessageVisitor visitor = new ArrowMessageVisitor() { @@ -106,9 +107,18 @@ public Boolean visit(ArrowRecordBatch message) { }; root.setRowCount(0); ArrowMessage message = readMessage(in, allocator); - while (message != null && message.accepts(visitor)) { + + boolean readBatch = false; + while (message != null) { + if (!message.accepts(visitor)) { + readBatch = true; + break; + } + // else read a dictionary message = readMessage(in, allocator); } + + return readBatch; } public long bytesRead() { return in.bytesRead(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java index fde9954d288bb..21aa0372c6b38 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/json/JsonFileReader.java @@ -94,7 +94,7 @@ public Schema start() throws JsonParseException, IOException { } } - public void read(VectorSchemaRoot root) throws IOException { + public boolean read(VectorSchemaRoot root) throws IOException { JsonToken t = parser.nextToken(); if (t == START_OBJECT) { { @@ -111,8 +111,10 @@ public void read(VectorSchemaRoot root) throws IOException { readToken(END_ARRAY); } readToken(END_OBJECT); + return true; } else if (t == END_ARRAY) { root.setRowCount(0); + return false; } else { throw new IllegalArgumentException("Invalid token: " + t); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index a1104ffe545d8..11730afd55406 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -152,7 +152,7 @@ protected ArrowMessage readMessage(ReadChannel in, BufferAllocator allocator) th VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); Assert.assertEquals(count, root.getRowCount()); validateContent(count, root); } @@ -193,7 +193,7 @@ public void testWriteReadComplex() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); Assert.assertEquals(count, root.getRowCount()); validateComplexContent(count, root); } @@ -263,13 +263,12 @@ public void testWriteReadMultipleRBs() throws IOException { int i = 0; for (int n = 0; n < 2; n++) { - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); Assert.assertEquals("RB #" + i, counts[i], root.getRowCount()); validateContent(counts[i], root); ++i; } - arrowReader.loadNextBatch(); - Assert.assertEquals(0, root.getRowCount()); + Assert.assertFalse(arrowReader.loadNextBatch()); } } @@ -294,7 +293,7 @@ public void testWriteReadUnion() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateUnionData(count, root); } @@ -305,7 +304,7 @@ public void testWriteReadUnion() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateUnionData(count, root); } } @@ -347,7 +346,7 @@ public void testWriteReadTiny() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateTinyData(root); } @@ -358,7 +357,7 @@ public void testWriteReadTiny() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateTinyData(root); } } @@ -433,7 +432,7 @@ public void testWriteReadDictionary() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateFlatDictionary(root.getFieldVectors().get(0), arrowReader); } @@ -444,7 +443,7 @@ public void testWriteReadDictionary() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateFlatDictionary(root.getFieldVectors().get(0), arrowReader); } } @@ -537,7 +536,7 @@ public void testWriteReadNestedDictionary() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateNestedDictionary((ListVector) root.getFieldVectors().get(0), arrowReader); } @@ -548,7 +547,7 @@ public void testWriteReadNestedDictionary() throws IOException { VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); Schema schema = root.getSchema(); LOGGER.debug("reading schema: " + schema); - arrowReader.loadNextBatch(); + Assert.assertTrue(arrowReader.loadNextBatch()); validateNestedDictionary((ListVector) root.getFieldVectors().get(0), arrowReader); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java index e7cdf3fea4b8b..7e9afd381c181 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStream.java @@ -19,6 +19,7 @@ import static java.util.Arrays.asList; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; import static org.junit.Assert.assertTrue; import java.io.ByteArrayInputStream; @@ -36,6 +37,7 @@ import org.apache.arrow.vector.stream.ArrowStreamWriter; import org.apache.arrow.vector.stream.MessageSerializerTest; import org.apache.arrow.vector.types.pojo.Schema; +import org.junit.Assert; import org.junit.Test; public class TestArrowStream extends BaseFileTest { @@ -52,10 +54,10 @@ public void testEmptyStream() throws IOException { ByteArrayInputStream in = new ByteArrayInputStream(out.toByteArray()); try (ArrowStreamReader reader = new ArrowStreamReader(in, allocator)) { assertEquals(schema, reader.getVectorSchemaRoot().getSchema()); - // Empty should return nothing. Can be called repeatedly. - reader.loadNextBatch(); + // Empty should return false + Assert.assertFalse(reader.loadNextBatch()); assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); - reader.loadNextBatch(); + Assert.assertFalse(reader.loadNextBatch()); assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); } } @@ -90,11 +92,11 @@ public void testReadWrite() throws IOException { Schema readSchema = reader.getVectorSchemaRoot().getSchema(); assertEquals(schema, readSchema); for (int i = 0; i < numBatches; i++) { - reader.loadNextBatch(); + assertTrue(reader.loadNextBatch()); } // TODO figure out why reader isn't getting padding bytes assertEquals(bytesWritten, reader.bytesRead() + 4); - reader.loadNextBatch(); + assertFalse(reader.loadNextBatch()); assertEquals(0, reader.getVectorSchemaRoot().getRowCount()); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java index 46d46794bbefa..20d4482da7c98 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowStreamPipe.java @@ -105,8 +105,10 @@ protected ArrowMessage readMessage(ReadChannel in, BufferAllocator allocator) th return message; } @Override - public void loadNextBatch() throws IOException { - super.loadNextBatch(); + public boolean loadNextBatch() throws IOException { + if (!super.loadNextBatch()) { + return false; + } if (!done) { VectorSchemaRoot root = getVectorSchemaRoot(); Assert.assertEquals(16, root.getRowCount()); @@ -120,6 +122,7 @@ public void loadNextBatch() throws IOException { } } } + return true; } }; } @@ -132,7 +135,7 @@ public void run() { reader.getVectorSchemaRoot().getSchema().getFields().get(0).getTypeLayout().getVectorTypes().toString(), reader.getVectorSchemaRoot().getSchema().getFields().get(0).getTypeLayout().getVectors().size() > 0); while (!done) { - reader.loadNextBatch(); + assertTrue(reader.loadNextBatch()); } } catch (IOException e) { e.printStackTrace(); From b6033378c2533ed7b396f111cc5aed10450907fb Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Fri, 14 Apr 2017 16:37:25 -0400 Subject: [PATCH 0513/1644] ARROW-815 [Java] Exposing reAlloc for ValueVector Author: Emilio Lahr-Vivaz Closes #534 from elahrvivaz/ARROW-815 and squashes the following commits: cf14944 [Emilio Lahr-Vivaz] unit test 45fa773 [Emilio Lahr-Vivaz] ARROW-815 [Java] Exposing reAlloc for ValueVector --- .../templates/NullableValueVectors.java | 6 + .../main/codegen/templates/UnionVector.java | 6 + .../org/apache/arrow/vector/ValueVector.java | 6 + .../org/apache/arrow/vector/ZeroVector.java | 3 + .../vector/complex/AbstractMapVector.java | 7 + .../complex/BaseRepeatedValueVector.java | 14 +- .../arrow/vector/complex/ListVector.java | 6 + .../vector/complex/NullableMapVector.java | 7 + .../arrow/vector/TestVectorReAlloc.java | 144 ++++++++++++++++++ 9 files changed, 194 insertions(+), 5 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index e5257ce554e3b..acee6cb738d76 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -242,6 +242,12 @@ public boolean allocateNewSafe() { return success; } + @Override + public void reAlloc() { + bits.reAlloc(); + values.reAlloc(); + } + <#if type.major == "VarLen"> @Override public void allocateNew(int totalBytes, int valueCount) { diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index 797b29342e4c1..d70cbae02bf33 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -214,6 +214,12 @@ public boolean allocateNewSafe() { return safe; } + @Override + public void reAlloc() { + internalMap.reAlloc(); + typeVector.reAlloc(); + } + @Override public void setInitialCapacity(int numRecords) { } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index 8e35398b9394b..685b0be010a08 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -68,6 +68,12 @@ public interface ValueVector extends Closeable, Iterable { */ boolean allocateNewSafe(); + /** + * Allocate new buffer with double capacity, and copy data into the new buffer. + * Replace vector's buffer with new buffer, and release old one + */ + void reAlloc(); + BufferAllocator getAllocator(); /** diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java index 73f858e4d35a0..e48788c6ae7c0 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ZeroVector.java @@ -142,6 +142,9 @@ public boolean allocateNewSafe() { return true; } + @Override + public void reAlloc() {} + @Override public BufferAllocator getAllocator() { throw new UnsupportedOperationException("Tried to get allocator from ZeroVector"); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index dc833edbed8d0..15e8a5bc624ac 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -81,6 +81,13 @@ public boolean allocateNewSafe() { return true; } + @Override + public void reAlloc() { + for (final ValueVector v: vectors.values()) { + v.reAlloc(); + } + } + /** * Adds a new field with the given parameters or replaces the existing one and consequently returns the resultant * {@link org.apache.arrow.vector.ValueVector}. diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index 6b240c04f7124..da221e33013d1 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -20,6 +20,10 @@ import java.util.Collections; import java.util.Iterator; +import com.google.common.base.Preconditions; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseValueVector; @@ -31,11 +35,6 @@ import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.SchemaChangeRuntimeException; -import com.google.common.base.Preconditions; -import com.google.common.collect.ObjectArrays; - -import io.netty.buffer.ArrowBuf; - public abstract class BaseRepeatedValueVector extends BaseValueVector implements RepeatedValueVector { public final static FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; @@ -79,6 +78,11 @@ public boolean allocateNewSafe() { return success; } + @Override + public void reAlloc() { + offsets.reAlloc(); + vector.reAlloc(); + } @Override public UInt4Vector getOffsetVector() { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 0461a8d9d285a..63235dfda87df 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -124,6 +124,12 @@ public void allocateNew() throws OutOfMemoryException { bits.allocateNewSafe(); } + @Override + public void reAlloc() { + super.reAlloc(); + bits.reAlloc(); + } + public void copyFromSafe(int inIndex, int outIndex, ListVector from) { copyFrom(inIndex, outIndex, from); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 71fee67d49c9f..647ab28352f0d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -197,6 +197,13 @@ public boolean allocateNewSafe() { bits.zeroVector(); return success; } + + @Override + public void reAlloc() { + bits.reAlloc(); + super.reAlloc(); + } + public final class Accessor extends MapVector.Accessor { final BitVector.Accessor bAccessor = bits.getAccessor(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java new file mode 100644 index 0000000000000..a7c35b6363cf1 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java @@ -0,0 +1,144 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNull; + +import java.nio.charset.StandardCharsets; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.NullableMapVector; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + + +public class TestVectorReAlloc { + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new RootAllocator(Long.MAX_VALUE); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testFixedType() { + try (final UInt4Vector vector = new UInt4Vector("", allocator)) { + final UInt4Vector.Mutator m = vector.getMutator(); + vector.setInitialCapacity(512); + vector.allocateNew(); + + assertEquals(512, vector.getValueCapacity()); + + try { + m.set(512, 0); + Assert.fail("Expected out of bounds exception"); + } catch (Exception e) { + // ok + } + + vector.reAlloc(); + assertEquals(1024, vector.getValueCapacity()); + + m.set(512, 100); + assertEquals(100, vector.getAccessor().get(512)); + } + } + + @Test + public void testNullableType() { + try (final NullableVarCharVector vector = new NullableVarCharVector("", allocator)) { + final NullableVarCharVector.Mutator m = vector.getMutator(); + vector.setInitialCapacity(512); + vector.allocateNew(); + + assertEquals(512, vector.getValueCapacity()); + + try { + m.set(512, "foo".getBytes(StandardCharsets.UTF_8)); + Assert.fail("Expected out of bounds exception"); + } catch (Exception e) { + // ok + } + + vector.reAlloc(); + assertEquals(1023, vector.getValueCapacity()); // note: size - 1 for some reason... + + m.set(512, "foo".getBytes(StandardCharsets.UTF_8)); + assertEquals("foo", new String(vector.getAccessor().get(512), StandardCharsets.UTF_8)); + } + } + + @Test + public void testListType() { + try (final ListVector vector = new ListVector("", allocator, null)) { + vector.addOrGetVector(FieldType.nullable(MinorType.INT.getType())); + + vector.setInitialCapacity(512); + vector.allocateNew(); + + assertEquals(1023, vector.getValueCapacity()); // TODO this doubles for some reason... + + try { + vector.getOffsetVector().getAccessor().get(2014); + Assert.fail("Expected out of bounds exception"); + } catch (Exception e) { + // ok + } + + vector.reAlloc(); + assertEquals(2047, vector.getValueCapacity()); // note: size - 1 + assertEquals(0, vector.getOffsetVector().getAccessor().get(2014)); + } + } + + @Test + public void testMapType() { + try (final NullableMapVector vector = new NullableMapVector("", allocator, null)) { + vector.addOrGet("", FieldType.nullable(MinorType.INT.getType()), NullableIntVector.class); + + vector.setInitialCapacity(512); + vector.allocateNew(); + + assertEquals(512, vector.getValueCapacity()); + + try { + vector.getAccessor().getObject(513); + Assert.fail("Expected out of bounds exception"); + } catch (Exception e) { + // ok + } + + vector.reAlloc(); + assertEquals(1024, vector.getValueCapacity()); + assertNull(vector.getAccessor().getObject(513)); + } + } +} From 794d0204cb33bc98bce418785b4643ee4f1083d8 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 14 Apr 2017 20:14:02 -0400 Subject: [PATCH 0514/1644] ARROW-777: restore getObject behavior on Date and Time Author: Julien Le Dem Closes #542 from julienledem/ARROW-777 and squashes the following commits: c77f5a0 [Julien Le Dem] ARROW-777: restore getObject behavior on Date and Time --- .../src/main/codegen/data/ValueVectorTypes.tdd | 4 ++-- .../codegen/templates/FixedValueVectors.java | 18 ++++++++++++++++-- .../apache/arrow/vector/file/BaseFileTest.java | 6 ++++-- 3 files changed, 22 insertions(+), 6 deletions(-) diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index b08c100edcac8..d472b559347f0 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -59,7 +59,7 @@ { class: "DateDay" }, { class: "IntervalYear", javaType: "int", friendlyType: "Period" }, { class: "TimeSec" }, - { class: "TimeMilli" } + { class: "TimeMilli", javaType: "int", friendlyType: "DateTime" } ] }, { @@ -72,7 +72,7 @@ { class: "BigInt"}, { class: "UInt8" }, { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, - { class: "DateMilli" }, + { class: "DateMilli", javaType: "long", friendlyType: "DateTime" }, { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 947c82c74a401..5c09e30c71487 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -484,9 +484,7 @@ public long getTwoAsLong(int index) { <#if minor.class == "DateDay" || - minor.class == "DateMilli" || minor.class == "TimeSec" || - minor.class == "TimeMilli" || minor.class == "TimeMicro" || minor.class == "TimeNano"> @Override @@ -494,6 +492,22 @@ public long getTwoAsLong(int index) { return get(index); } + <#elseif minor.class == "DateMilli"> + @Override + public ${friendlyType} getObject(int index) { + org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); + date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return date; + } + + <#elseif minor.class == "TimeMilli"> + @Override + public ${friendlyType} getObject(int index) { + org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); + time = time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + return time; + } + <#elseif minor.class == "TimeStampSec"> @Override public ${friendlyType} getObject(int index) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java index 5c68a1904be70..5ca083aa2dfab 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java @@ -22,6 +22,8 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullableDateMilliVector; +import org.apache.arrow.vector.NullableTimeMilliVector; import org.apache.arrow.vector.ValueVector.Accessor; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.complex.MapVector; @@ -173,11 +175,11 @@ protected void validateDateTimeContent(int count, VectorSchemaRoot root) { Assert.assertEquals(count, root.getRowCount()); printVectors(root.getFieldVectors()); for (int i = 0; i < count; i++) { - Object dateVal = root.getVector("date").getAccessor().getObject(i); + long dateVal = ((NullableDateMilliVector)root.getVector("date")).getAccessor().get(i); DateTime dt = makeDateTimeFromCount(i); DateTime dateExpected = dt.minusMillis(dt.getMillisOfDay()); Assert.assertEquals(dateExpected.getMillis(), dateVal); - Object timeVal = root.getVector("time").getAccessor().getObject(i); + long timeVal = ((NullableTimeMilliVector)root.getVector("time")).getAccessor().get(i); Assert.assertEquals(dt.getMillisOfDay(), timeVal); Object timestampMilliVal = root.getVector("timestamp-milli").getAccessor().getObject(i); Assert.assertTrue(dt.withZoneRetainFields(DateTimeZone.getDefault()).equals(timestampMilliVal)); From 88c351abc24179ae1b1fa76450c2c8a4d6e4f04e Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 14 Apr 2017 20:14:36 -0400 Subject: [PATCH 0515/1644] =?UTF-8?q?ARROW-720:=20arrow=20should=20not=20h?= =?UTF-8?q?ave=20a=20dependency=20on=20slf4j=20bridges=20in=20com=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …pile scope Author: Julien Le Dem Closes #541 from julienledem/ARROW-720 and squashes the following commits: fa63e20 [Julien Le Dem] ARROW-720: arrow should not have a dependency on slf4j bridges in compile scope --- java/pom.xml | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/java/pom.xml b/java/pom.xml index 5edd605e8eedb..5d07186e3e714 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -469,25 +469,28 @@ ${dep.slf4j.version} + org.slf4j jul-to-slf4j ${dep.slf4j.version} + test org.slf4j jcl-over-slf4j ${dep.slf4j.version} + test org.slf4j log4j-over-slf4j ${dep.slf4j.version} + test - com.googlecode.jmockit From 4b030dd0ea193eeb60644518f897ec966eb6b720 Mon Sep 17 00:00:00 2001 From: Jeff Knupp Date: Sat, 15 Apr 2017 11:09:51 +0200 Subject: [PATCH 0516/1644] ARROW-828: [C++] Add new dependency to README `libboost-regex-dev` is required to build on Ubuntu; added to `apt` install command. Author: Jeff Knupp Closes #545 from jeffknupp/master and squashes the following commits: b527ebb [Jeff Knupp] Add new dependency to README --- cpp/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/cpp/README.md b/cpp/README.md index b19fa001198a4..339b6b47533cb 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -31,6 +31,7 @@ On Ubuntu/Debian you can install the requirements with: sudo apt-get install cmake \ libboost-dev \ libboost-filesystem-dev \ + libboost-regex-dev \ libboost-system-dev ``` @@ -126,4 +127,4 @@ both of these options would be used rarely. Current known uses-cases whent hey * Parameterized tests in google test. [1]: https://brew.sh/ -[2]: https://github.com/apache/arrow/blob/master/cpp/doc/Windows.md \ No newline at end of file +[2]: https://github.com/apache/arrow/blob/master/cpp/doc/Windows.md From ce5b98e1d8254219419220c42e45959ca1aeac21 Mon Sep 17 00:00:00 2001 From: Deepak Majeti Date: Sat, 15 Apr 2017 11:27:46 +0200 Subject: [PATCH 0517/1644] =?UTF-8?q?ARROW-820:=20[C++]=20Build=20dependen?= =?UTF-8?q?cies=20for=20Parquet=20library=20without=20arrow=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit … support Author: Deepak Majeti Closes #538 from majetideepak/ARROW-820 and squashes the following commits: 10ca617 [Deepak Majeti] Revert HDFS change f399ab5 [Deepak Majeti] Add flags for ARROW_IPC and ARROW_HDFS add683a [Deepak Majeti] ARROW-820: [C++] Build dependencies for Parquet library without arrow support --- cpp/CMakeLists.txt | 107 ++++++++++++++++++++++++--------------------- 1 file changed, 57 insertions(+), 50 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 0e4a4bbf34b67..83610d33e6af1 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -621,27 +621,49 @@ if(ARROW_BUILD_BENCHMARKS) endif() endif() -# RapidJSON, header only dependency -if("$ENV{RAPIDJSON_HOME}" STREQUAL "") - ExternalProject_Add(rapidjson_ep - PREFIX "${CMAKE_BINARY_DIR}" - URL "https://github.com/miloyip/rapidjson/archive/v1.1.0.tar.gz" - URL_MD5 "badd12c511e081fec6c89c43a7027bce" - CONFIGURE_COMMAND "" - BUILD_COMMAND "" - BUILD_IN_SOURCE 1 - INSTALL_COMMAND "") - - ExternalProject_Get_Property(rapidjson_ep SOURCE_DIR) - set(RAPIDJSON_INCLUDE_DIR "${SOURCE_DIR}/include") - set(RAPIDJSON_VENDORED 1) -else() - set(RAPIDJSON_INCLUDE_DIR "$ENV{RAPIDJSON_HOME}/include") - set(RAPIDJSON_VENDORED 0) -endif() -message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") -include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) +if (ARROW_IPC) + # RapidJSON, header only dependency + if("$ENV{RAPIDJSON_HOME}" STREQUAL "") + ExternalProject_Add(rapidjson_ep + PREFIX "${CMAKE_BINARY_DIR}" + URL "https://github.com/miloyip/rapidjson/archive/v1.1.0.tar.gz" + URL_MD5 "badd12c511e081fec6c89c43a7027bce" + CONFIGURE_COMMAND "" + BUILD_COMMAND "" + BUILD_IN_SOURCE 1 + INSTALL_COMMAND "") + + ExternalProject_Get_Property(rapidjson_ep SOURCE_DIR) + set(RAPIDJSON_INCLUDE_DIR "${SOURCE_DIR}/include") + set(RAPIDJSON_VENDORED 1) + else() + set(RAPIDJSON_INCLUDE_DIR "$ENV{RAPIDJSON_HOME}/include") + set(RAPIDJSON_VENDORED 0) + endif() + message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") + include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) + + ## Flatbuffers + if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") + set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") + ExternalProject_Add(flatbuffers_ep + URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" + CMAKE_ARGS + "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" + "-DFLATBUFFERS_BUILD_TESTS=OFF") + + set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") + set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") + set(FLATBUFFERS_VENDORED 1) + else() + find_package(Flatbuffers REQUIRED) + set(FLATBUFFERS_VENDORED 0) + endif() + message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") + message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") + include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) +endif() #---------------------------------------------------------------------- if (MSVC) @@ -705,28 +727,6 @@ endif() # set(ARROW_TCMALLOC_AVAILABLE 1) # endif() -## Flatbuffers - -if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") - set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") - ExternalProject_Add(flatbuffers_ep - URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" - CMAKE_ARGS - "-DCMAKE_INSTALL_PREFIX:PATH=${FLATBUFFERS_PREFIX}" - "-DFLATBUFFERS_BUILD_TESTS=OFF") - - set(FLATBUFFERS_INCLUDE_DIR "${FLATBUFFERS_PREFIX}/include") - set(FLATBUFFERS_COMPILER "${FLATBUFFERS_PREFIX}/bin/flatc") - set(FLATBUFFERS_VENDORED 1) -else() - find_package(Flatbuffers REQUIRED) - set(FLATBUFFERS_VENDORED 0) -endif() - -message(STATUS "Flatbuffers include dir: ${FLATBUFFERS_INCLUDE_DIR}") -message(STATUS "Flatbuffers compiler: ${FLATBUFFERS_COMPILER}") -include_directories(SYSTEM ${FLATBUFFERS_INCLUDE_DIR}) - ######################################################################## # HDFS thirdparty setup @@ -885,7 +885,9 @@ endif() add_subdirectory(src/arrow) add_subdirectory(src/arrow/io) -add_subdirectory(src/arrow/ipc) +if (ARROW_IPC) + add_subdirectory(src/arrow/ipc) +endif() set(ARROW_DEPENDENCIES ${ARROW_DEPENDENCIES} metadata_fbs) @@ -909,17 +911,22 @@ set(ARROW_SRCS src/arrow/io/interfaces.cc src/arrow/io/memory.cc - src/arrow/ipc/feather.cc - src/arrow/ipc/json.cc - src/arrow/ipc/json-internal.cc - src/arrow/ipc/metadata.cc - src/arrow/ipc/reader.cc - src/arrow/ipc/writer.cc - src/arrow/util/bit-util.cc src/arrow/util/decimal.cc ) +if (ARROW_IPC) + set(ARROW_SRCS ${ARROW_SRCS} + src/arrow/ipc/feather.cc + src/arrow/ipc/json.cc + src/arrow/ipc/json-internal.cc + src/arrow/ipc/metadata.cc + src/arrow/ipc/reader.cc + src/arrow/ipc/writer.cc + ) +endif() + + if(NOT APPLE AND NOT MSVC) # Localize thirdparty symbols using a linker version script. This hides them # from the client application. The OS X linker does not support the From 4d2ac871c9126ba431ebb193ea19bd5eb7ef8ab3 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Sat, 15 Apr 2017 09:35:41 -0400 Subject: [PATCH 0518/1644] ARROW-826: [C++/Python] Fix compilation error on Mac with -DARROW_PYTHON=on This fixes https://github.com/ray-project/ray/issues/461 Author: Philipp Moritz Closes #544 from pcmoritz/fix-python-macos and squashes the following commits: cf59732 [Philipp Moritz] include before --- cpp/src/arrow/python/config.h | 1 + 1 file changed, 1 insertion(+) diff --git a/cpp/src/arrow/python/config.h b/cpp/src/arrow/python/config.h index dd554e05b9379..c13272667540a 100644 --- a/cpp/src/arrow/python/config.h +++ b/cpp/src/arrow/python/config.h @@ -18,6 +18,7 @@ #ifndef ARROW_PYTHON_CONFIG_H #define ARROW_PYTHON_CONFIG_H +#include #include #include "arrow/python/numpy_interop.h" From edb8252c7534b787cb4dc0234080765e9bd6a045 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 15 Apr 2017 12:43:20 -0400 Subject: [PATCH 0519/1644] =?UTF-8?q?ARROW-829:=20Don't=20deactivate=20Par?= =?UTF-8?q?quet=20dictionary=20encoding=20on=20column-wis=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …e compression Change-Id: Icae5494babc7cbac2e1c3e405e440ff42b2b6ae5 Author: Uwe L. Korn Closes #546 from xhochy/ARROW-829 and squashes the following commits: 7962877 [Uwe L. Korn] ARROW-829: Don't deactivate Parquet dictionary encoding on column-wise compression --- python/manylinux1/build_arrow.sh | 2 +- python/pyarrow/_parquet.pyx | 2 -- 2 files changed, 1 insertion(+), 3 deletions(-) diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 8bc4e60235b49..3df322581b54c 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -72,7 +72,7 @@ for PYTHON in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}) Test the existence of optional modules ===" $PIPI_IO -r requirements.txt PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.parquet" - PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow.jemalloc" + PATH="$PATH:$(cpython_path $PYTHON)/bin" $PYTHON_INTERPRETER -c "import pyarrow._jemalloc" echo "=== (${PYTHON}) Tag the wheel with manylinux1 ===" mkdir -p repaired_wheels/ diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index b7358a6a47386..dafcdaff9bfee 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -539,8 +539,6 @@ cdef class ParquetWriter: check_compression_name(self.compression) props.compression(compression_from_name(self.compression)) elif self.compression is not None: - # Deactivate dictionary encoding by default - props.disable_dictionary() for column, codec in self.compression.iteritems(): check_compression_name(codec) props.compression(column, compression_from_name(codec)) From 0f9c88f71bc64ec3288e381c8a4edb48c696b182 Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Sat, 15 Apr 2017 17:15:07 -0400 Subject: [PATCH 0520/1644] ARROW-725: [Formats/Java] FixedSizeList message and java implementation ~Currently only added minor type for 2-tuples~ Author: Emilio Lahr-Vivaz Closes #452 from elahrvivaz/ARROW-725 and squashes the following commits: b139d3d [Emilio Lahr-Vivaz] adding reAlloc to FixedSizeListVector 229e24a [Emilio Lahr-Vivaz] re-ordering imports 594c0a2 [Emilio Lahr-Vivaz] simplifying writing of list vectors through mutator 7cb2324 [Emilio Lahr-Vivaz] reverting writer changes, adding examples of writing fixed size list using vector mutators 756dc8a [Emilio Lahr-Vivaz] ARROW-725: [Formats/Java] FixedSizeList message and java implementation --- format/Schema.fbs | 8 +- .../src/main/codegen/data/ArrowTypes.tdd | 5 + .../main/codegen/templates/ComplexCopier.java | 2 + .../complex/BaseRepeatedValueVector.java | 6 +- .../vector/complex/FixedSizeListVector.java | 387 ++++++++++++++++++ .../arrow/vector/complex/ListVector.java | 18 +- .../vector/complex/NullableMapVector.java | 8 +- .../arrow/vector/complex/Positionable.java | 1 + .../vector/complex/PromotableVector.java | 32 ++ .../vector/complex/RepeatedValueVector.java | 6 +- .../complex/impl/AbstractBaseReader.java | 5 + .../complex/impl/AbstractBaseWriter.java | 5 + .../impl/UnionFixedSizeListReader.java | 103 +++++ .../arrow/vector/schema/TypeLayout.java | 8 + .../org/apache/arrow/vector/types/Types.java | 23 ++ .../vector/util/JsonStringArrayList.java | 8 + .../arrow/vector/TestFixedSizeListVector.java | 156 +++++++ .../arrow/vector/file/TestArrowFile.java | 69 +++- 18 files changed, 838 insertions(+), 12 deletions(-) create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/PromotableVector.java create mode 100644 java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionFixedSizeListReader.java create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestFixedSizeListVector.java diff --git a/format/Schema.fbs b/format/Schema.fbs index badc7ea8befbf..ff6119931dd34 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -39,6 +39,11 @@ table Struct_ { table List { } +table FixedSizeList { + /// Number of list items per value + listSize: int; +} + enum UnionMode:short { Sparse, Dense } /// A union is a complex type with children in Field @@ -159,7 +164,8 @@ union Type { List, Struct_, Union, - FixedSizeBinary + FixedSizeBinary, + FixedSizeList } /// ---------------------------------------------------------------------- diff --git a/java/vector/src/main/codegen/data/ArrowTypes.tdd b/java/vector/src/main/codegen/data/ArrowTypes.tdd index e1fb5e0619a9b..ce92c1333a501 100644 --- a/java/vector/src/main/codegen/data/ArrowTypes.tdd +++ b/java/vector/src/main/codegen/data/ArrowTypes.tdd @@ -27,6 +27,11 @@ fields: [], complex: true }, + { + name: "FixedSizeList", + fields: [{name: "listSize", type: int}], + complex: true + }, { name: "Union", fields: [{name: "mode", type: short, valueType: UnionMode}, {name: "typeIds", type: "int[]"}], diff --git a/java/vector/src/main/codegen/templates/ComplexCopier.java b/java/vector/src/main/codegen/templates/ComplexCopier.java index 0dffe5e30bea0..89368ce6e0b96 100644 --- a/java/vector/src/main/codegen/templates/ComplexCopier.java +++ b/java/vector/src/main/codegen/templates/ComplexCopier.java @@ -55,6 +55,8 @@ private static void writeValue(FieldReader reader, FieldWriter writer) { writer.endList(); } break; + case FIXED_SIZE_LIST: + throw new UnsupportedOperationException("Copy fixed size list"); case MAP: if (reader.isSet()) { writer.start(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index da221e33013d1..c9a9319c69154 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -213,12 +213,14 @@ public boolean isEmpty(int index) { public abstract class BaseRepeatedMutator extends BaseValueVector.BaseMutator implements RepeatedMutator { @Override - public void startNewValue(int index) { + public int startNewValue(int index) { while (offsets.getValueCapacity() <= index) { offsets.reAlloc(); } - offsets.getMutator().setSafe(index+1, offsets.getAccessor().get(index)); + int offset = offsets.getAccessor().get(index); + offsets.getMutator().setSafe(index+1, offset); setValueCount(index+1); + return offset; } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java new file mode 100644 index 0000000000000..7ac9f3bd5137f --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java @@ -0,0 +1,387 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex; + +import static java.util.Collections.singletonList; +import static org.apache.arrow.vector.complex.BaseRepeatedValueVector.DATA_VECTOR_NAME; + +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Objects; + +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.OutOfMemoryException; +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.BaseDataValueVector; +import org.apache.arrow.vector.BaseValueVector; +import org.apache.arrow.vector.BitVector; +import org.apache.arrow.vector.BufferBacked; +import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.complex.impl.UnionFixedSizeListReader; +import org.apache.arrow.vector.schema.ArrowFieldNode; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.Field; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.apache.arrow.vector.util.CallBack; +import org.apache.arrow.vector.util.JsonStringArrayList; +import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import org.apache.arrow.vector.util.TransferPair; + +public class FixedSizeListVector extends BaseValueVector implements FieldVector, PromotableVector { + + private FieldVector vector; + private final BitVector bits; + private final int listSize; + private final DictionaryEncoding dictionary; + private final List innerVectors; + + private UnionFixedSizeListReader reader; + + private Mutator mutator = new Mutator(); + private Accessor accessor = new Accessor(); + + public FixedSizeListVector(String name, + BufferAllocator allocator, + int listSize, + DictionaryEncoding dictionary, + CallBack schemaChangeCallback) { + super(name, allocator); + Preconditions.checkArgument(listSize > 0, "list size must be positive"); + this.bits = new BitVector("$bits$", allocator); + this.vector = ZeroVector.INSTANCE; + this.listSize = listSize; + this.dictionary = dictionary; + this.innerVectors = Collections.singletonList((BufferBacked) bits); + this.reader = new UnionFixedSizeListReader(this); + } + + @Override + public Field getField() { + List children = ImmutableList.of(getDataVector().getField()); + return new Field(name, true, new ArrowType.FixedSizeList(listSize), children); + } + + @Override + public MinorType getMinorType() { + return MinorType.FIXED_SIZE_LIST; + } + + public int getListSize() { + return listSize; + } + + @Override + public void initializeChildrenFromFields(List children) { + if (children.size() != 1) { + throw new IllegalArgumentException("Lists have only one child. Found: " + children); + } + Field field = children.get(0); + FieldType type = new FieldType(field.isNullable(), field.getType(), field.getDictionary()); + AddOrGetResult addOrGetVector = addOrGetVector(type); + if (!addOrGetVector.isCreated()) { + throw new IllegalArgumentException("Child vector already existed: " + addOrGetVector.getVector()); + } + addOrGetVector.getVector().initializeChildrenFromFields(field.getChildren()); + } + + @Override + public List getChildrenFromFields() { + return singletonList(vector); + } + + @Override + public void loadFieldBuffers(ArrowFieldNode fieldNode, List ownBuffers) { + BaseDataValueVector.load(fieldNode, innerVectors, ownBuffers); + } + + @Override + public List getFieldBuffers() { + return BaseDataValueVector.unload(innerVectors); + } + + @Override + public List getFieldInnerVectors() { + return innerVectors; + } + + @Override + public Accessor getAccessor() { + return accessor; + } + + @Override + public Mutator getMutator() { + return mutator; + } + + @Override + public UnionFixedSizeListReader getReader() { + return reader; + } + + @Override + public void allocateNew() throws OutOfMemoryException { + allocateNewSafe(); + } + + @Override + public boolean allocateNewSafe() { + /* boolean to keep track if all the memory allocation were successful + * Used in the case of composite vectors when we need to allocate multiple + * buffers for multiple vectors. If one of the allocations failed we need to + * clear all the memory that we allocated + */ + boolean success = false; + try { + success = bits.allocateNewSafe() && vector.allocateNewSafe(); + } finally { + if (!success) { + clear(); + } + } + if (success) { + bits.zeroVector(); + } + return success; + } + + @Override + public void reAlloc() { + bits.reAlloc(); + vector.reAlloc(); + } + + public FieldVector getDataVector() { + return vector; + } + + @Override + public void setInitialCapacity(int numRecords) { + bits.setInitialCapacity(numRecords); + vector.setInitialCapacity(numRecords * listSize); + } + + @Override + public int getValueCapacity() { + if (vector == ZeroVector.INSTANCE) { + return 0; + } + return vector.getValueCapacity() / listSize; + } + + @Override + public int getBufferSize() { + if (accessor.getValueCount() == 0) { + return 0; + } + return bits.getBufferSize() + vector.getBufferSize(); + } + + @Override + public int getBufferSizeFor(int valueCount) { + if (valueCount == 0) { + return 0; + } + return bits.getBufferSizeFor(valueCount) + vector.getBufferSizeFor(valueCount * listSize); + } + + @Override + public Iterator iterator() { + return Collections.singleton(vector).iterator(); + } + + @Override + public void clear() { + bits.clear(); + vector.clear(); + super.clear(); + } + + @Override + public ArrowBuf[] getBuffers(boolean clear) { + final ArrowBuf[] buffers = ObjectArrays.concat(bits.getBuffers(false), vector.getBuffers(false), ArrowBuf.class); + if (clear) { + for (ArrowBuf buffer: buffers) { + buffer.retain(); + } + clear(); + } + return buffers; + } + + /** + * Returns 1 if inner vector is explicitly set via #addOrGetVector else 0 + */ + public int size() { + return vector == ZeroVector.INSTANCE ? 0 : 1; + } + + @Override + @SuppressWarnings("unchecked") + public AddOrGetResult addOrGetVector(FieldType type) { + boolean created = false; + if (vector instanceof ZeroVector) { + vector = type.createNewSingleVector(DATA_VECTOR_NAME, allocator, null); + this.reader = new UnionFixedSizeListReader(this); + created = true; + } + // returned vector must have the same field + if (!Objects.equals(vector.getField().getType(), type.getType())) { + final String msg = String.format("Inner vector type mismatch. Requested type: [%s], actual type: [%s]", + type.getType(), vector.getField().getType()); + throw new SchemaChangeRuntimeException(msg); + } + + return new AddOrGetResult<>((T) vector, created); + } + + public void copyFromSafe(int inIndex, int outIndex, FixedSizeListVector from) { + copyFrom(inIndex, outIndex, from); + } + + public void copyFrom(int inIndex, int outIndex, FixedSizeListVector from) { + throw new UnsupportedOperationException("FixedSizeListVector.copyFrom"); + } + + @Override + public UnionVector promoteToUnion() { + UnionVector vector = new UnionVector(name, allocator, null); + this.vector.clear(); + this.vector = vector; + this.reader = new UnionFixedSizeListReader(this); + return vector; + } + + public class Accessor extends BaseValueVector.BaseAccessor { + + @Override + public Object getObject(int index) { + if (isNull(index)) { + return null; + } + final List vals = new JsonStringArrayList<>(listSize); + final ValueVector.Accessor valuesAccessor = vector.getAccessor(); + for(int i = 0; i < listSize; i++) { + vals.add(valuesAccessor.getObject(index * listSize + i)); + } + return vals; + } + + @Override + public boolean isNull(int index) { + return bits.getAccessor().get(index) == 0; + } + + @Override + public int getNullCount() { + return bits.getAccessor().getNullCount(); + } + + @Override + public int getValueCount() { + return bits.getAccessor().getValueCount(); + } + } + + public class Mutator extends BaseValueVector.BaseMutator { + + public void setNull(int index) { + bits.getMutator().setSafe(index, 0); + } + + public void setNotNull(int index) { + bits.getMutator().setSafe(index, 1); + } + + @Override + public void setValueCount(int valueCount) { + bits.getMutator().setValueCount(valueCount); + vector.getMutator().setValueCount(valueCount * listSize); + } + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator) { + return getTransferPair(ref, allocator, null); + } + + @Override + public TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack) { + return new TransferImpl(ref, allocator, callBack); + } + + @Override + public TransferPair makeTransferPair(ValueVector target) { + return new TransferImpl((FixedSizeListVector) target); + } + + private class TransferImpl implements TransferPair { + + FixedSizeListVector to; + TransferPair pairs[] = new TransferPair[2]; + + public TransferImpl(String name, BufferAllocator allocator, CallBack callBack) { + this(new FixedSizeListVector(name, allocator, listSize, dictionary, callBack)); + } + + public TransferImpl(FixedSizeListVector to) { + this.to = to; + Field field = vector.getField(); + FieldType type = new FieldType(field.isNullable(), field.getType(), field.getDictionary()); + to.addOrGetVector(type); + pairs[0] = bits.makeTransferPair(to.bits); + pairs[1] = getDataVector().makeTransferPair(to.getDataVector()); + } + + @Override + public void transfer() { + for (TransferPair pair : pairs) { + pair.transfer(); + } + } + + @Override + public void splitAndTransfer(int startIndex, int length) { + to.allocateNew(); + for (int i = 0; i < length; i++) { + copyValueSafe(startIndex + i, i); + } + } + + @Override + public ValueVector getTo() { + return to; + } + + @Override + public void copyValueSafe(int from, int to) { + this.to.copyFrom(from, to, FixedSizeListVector.this); + } + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java index 63235dfda87df..9392afbccdaa8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/ListVector.java @@ -53,7 +53,7 @@ import io.netty.buffer.ArrowBuf; -public class ListVector extends BaseRepeatedValueVector implements FieldVector { +public class ListVector extends BaseRepeatedValueVector implements FieldVector, PromotableVector { final UInt4Vector offsets; final BitVector bits; @@ -220,7 +220,7 @@ public Mutator getMutator() { } @Override - public FieldReader getReader() { + public UnionListReader getReader() { return reader; } @@ -297,6 +297,7 @@ public ArrowBuf[] getBuffers(boolean clear) { return buffers; } + @Override public UnionVector promoteToUnion() { UnionVector vector = new UnionVector(name, allocator, callBack); replaceDataVector(vector); @@ -345,12 +346,23 @@ public void setNotNull(int index) { } @Override - public void startNewValue(int index) { + public int startNewValue(int index) { for (int i = lastSet; i <= index; i++) { offsets.getMutator().setSafe(i + 1, offsets.getAccessor().get(i)); } setNotNull(index); lastSet = index + 1; + return offsets.getAccessor().get(lastSet); + } + + /** + * End the current value + * + * @param index index of the value to end + * @param size number of elements in the list that was written + */ + public void endValue(int index, int size) { + offsets.getMutator().set(index + 1, offsets.getAccessor().get(index + 1) + size); } @Override diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java index 647ab28352f0d..6456efba0dcb4 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/NullableMapVector.java @@ -31,6 +31,7 @@ import org.apache.arrow.vector.NullableVectorDefinitionSetter; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.complex.impl.NullableMapReaderImpl; +import org.apache.arrow.vector.complex.impl.NullableMapWriter; import org.apache.arrow.vector.complex.reader.FieldReader; import org.apache.arrow.vector.holders.ComplexHolder; import org.apache.arrow.vector.schema.ArrowFieldNode; @@ -45,6 +46,7 @@ public class NullableMapVector extends MapVector implements FieldVector { private final NullableMapReaderImpl reader = new NullableMapReaderImpl(this); + private final NullableMapWriter writer = new NullableMapWriter(this); protected final BitVector bits; @@ -84,10 +86,14 @@ public List getFieldInnerVectors() { } @Override - public FieldReader getReader() { + public NullableMapReaderImpl getReader() { return reader; } + public NullableMapWriter getWriter() { + return writer; + } + @Override public TransferPair getTransferPair(BufferAllocator allocator) { return new NullableMapTransferPair(this, new NullableMapVector(name, allocator, dictionary, null), false); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java index 93451181ca949..e1a4f36296987 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/Positionable.java @@ -18,5 +18,6 @@ package org.apache.arrow.vector.complex; public interface Positionable { + public int getPosition(); public void setPosition(int index); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/PromotableVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/PromotableVector.java new file mode 100644 index 0000000000000..8b528b4ccab9b --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/PromotableVector.java @@ -0,0 +1,32 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex; + +import org.apache.arrow.vector.AddOrGetResult; +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.DictionaryEncoding; +import org.apache.arrow.vector.types.pojo.FieldType; + +public interface PromotableVector { + + AddOrGetResult addOrGetVector(FieldType type); + + UnionVector promoteToUnion(); +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java index 54db393e8310d..b01a4e7cf49d4 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java @@ -73,13 +73,13 @@ interface RepeatedAccessor extends ValueVector.Accessor { } interface RepeatedMutator extends ValueVector.Mutator { + /** * Starts a new value that is a container of cells. * * @param index index of new value to start + * @return index into the child vector */ - void startNewValue(int index); - - + int startNewValue(int index); } } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java index e7c3c8c7e4b42..7c73c27ecff41 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseReader.java @@ -35,6 +35,11 @@ public AbstractBaseReader() { super(); } + @Override + public int getPosition() { + return index; + } + public void setPosition(int index){ this.index = index; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java index e6cf098f16f59..13a0a6bd9e28f 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/AbstractBaseWriter.java @@ -34,6 +34,11 @@ int idx() { return index; } + @Override + public int getPosition() { + return index; + } + @Override public void setPosition(int index) { this.index = index; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionFixedSizeListReader.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionFixedSizeListReader.java new file mode 100644 index 0000000000000..515d4ab8ce907 --- /dev/null +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/impl/UnionFixedSizeListReader.java @@ -0,0 +1,103 @@ +/******************************************************************************* + + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + ******************************************************************************/ +package org.apache.arrow.vector.complex.impl; + +import org.apache.arrow.vector.ValueVector; +import org.apache.arrow.vector.complex.FixedSizeListVector; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.complex.writer.BaseWriter.ListWriter; +import org.apache.arrow.vector.complex.writer.FieldWriter; +import org.apache.arrow.vector.holders.UnionHolder; +import org.apache.arrow.vector.types.Types.MinorType; + +/** + * Reader for fixed size list vectors + */ +public class UnionFixedSizeListReader extends AbstractFieldReader { + + private final FixedSizeListVector vector; + private final ValueVector data; + private final int listSize; + + private int currentOffset; + + public UnionFixedSizeListReader(FixedSizeListVector vector) { + this.vector = vector; + this.data = vector.getDataVector(); + this.listSize = vector.getListSize(); + } + + @Override + public boolean isSet() { + return !vector.getAccessor().isNull(idx()); + } + + @Override + public FieldReader reader() { + return data.getReader(); + } + + @Override + public Object readObject() { + return vector.getAccessor().getObject(idx()); + } + + @Override + public MinorType getMinorType() { + return vector.getMinorType(); + } + + @Override + public void setPosition(int index) { + super.setPosition(index); + data.getReader().setPosition(index * listSize); + currentOffset = 0; + } + + @Override + public void read(int index, UnionHolder holder) { + setPosition(idx()); + for (int i = -1; i < index; i++) { + if (!next()) { + throw new IndexOutOfBoundsException("Requested " + index + ", size " + listSize); + } + } + holder.reader = data.getReader(); + holder.isSet = vector.getAccessor().isNull(idx()) ? 0 : 1; + } + + @Override + public int size() { + return listSize; + } + + @Override + public boolean next() { + if (currentOffset < listSize) { + data.getReader().setPosition(idx() * listSize + currentOffset++); + return true; + } else { + return false; + } + } + + public void copyAsValue(ListWriter writer) { + ComplexCopier.copy(this, (FieldWriter) writer); + } +} diff --git a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java index 69d550fc9f799..24840ec988ac3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/schema/TypeLayout.java @@ -35,6 +35,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Bool; import org.apache.arrow.vector.types.pojo.ArrowType.Date; import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; +import org.apache.arrow.vector.types.pojo.ArrowType.FixedSizeList; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Interval; @@ -105,6 +106,13 @@ public static TypeLayout getTypeLayout(final ArrowType arrowType) { return new TypeLayout(vectors); } + @Override public TypeLayout visit(FixedSizeList type) { + List vectors = asList( + validityVector() + ); + return new TypeLayout(vectors); + } + @Override public TypeLayout visit(FloatingPoint type) { int bitWidth; switch (type.getPrecision()) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index b0455fa14e44c..6023f1c9500e7 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -51,6 +51,7 @@ import org.apache.arrow.vector.NullableVarCharVector; import org.apache.arrow.vector.ValueVector; import org.apache.arrow.vector.ZeroVector; +import org.apache.arrow.vector.complex.FixedSizeListVector; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.NullableMapVector; import org.apache.arrow.vector.complex.UnionVector; @@ -90,6 +91,7 @@ import org.apache.arrow.vector.types.pojo.ArrowType.Bool; import org.apache.arrow.vector.types.pojo.ArrowType.Date; import org.apache.arrow.vector.types.pojo.ArrowType.Decimal; +import org.apache.arrow.vector.types.pojo.ArrowType.FixedSizeList; import org.apache.arrow.vector.types.pojo.ArrowType.FloatingPoint; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.ArrowType.Interval; @@ -436,6 +438,23 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { return new UnionListWriter((ListVector) vector); } }, + FIXED_SIZE_LIST(null) { + @Override + public ArrowType getType() { + throw new UnsupportedOperationException("Cannot get simple type for FixedSizeList type"); + } + + @Override + public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { + int size = ((FixedSizeList)fieldType.getType()).getListSize(); + return new FixedSizeListVector(name, allocator, size, fieldType.getDictionary(), schemaChangeCallback); + } + + @Override + public FieldWriter getNewFieldWriter(ValueVector vector) { + throw new UnsupportedOperationException("FieldWriter not implemented for FixedSizeList type"); + } + }, UNION(new Union(Sparse, null)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { @@ -480,6 +499,10 @@ public static MinorType getMinorTypeForArrowType(ArrowType arrowType) { return MinorType.LIST; } + @Override public MinorType visit(FixedSizeList type) { + return MinorType.FIXED_SIZE_LIST; + } + @Override public MinorType visit(Union type) { return MinorType.UNION; } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java index 6291bfeaee666..c598069c2c309 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/JsonStringArrayList.java @@ -31,6 +31,14 @@ public class JsonStringArrayList extends ArrayList { mapper = new ObjectMapper(); } + public JsonStringArrayList() { + super(); + } + + public JsonStringArrayList(int size) { + super(size); + } + @Override public boolean equals(Object obj) { if (this == obj) { diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestFixedSizeListVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestFixedSizeListVector.java new file mode 100644 index 0000000000000..cfb7b3d2a26ac --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestFixedSizeListVector.java @@ -0,0 +1,156 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import com.google.common.collect.Lists; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.vector.complex.FixedSizeListVector; +import org.apache.arrow.vector.complex.ListVector; +import org.apache.arrow.vector.complex.impl.UnionFixedSizeListReader; +import org.apache.arrow.vector.complex.impl.UnionListReader; +import org.apache.arrow.vector.complex.reader.FieldReader; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType; +import org.apache.arrow.vector.types.pojo.FieldType; +import org.junit.After; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +public class TestFixedSizeListVector { + + private BufferAllocator allocator; + + @Before + public void init() { + allocator = new DirtyRootAllocator(Long.MAX_VALUE, (byte) 100); + } + + @After + public void terminate() throws Exception { + allocator.close(); + } + + @Test + public void testIntType() { + try (FixedSizeListVector vector = new FixedSizeListVector("list", allocator, 2, null, null)) { + NullableIntVector nested = (NullableIntVector) vector.addOrGetVector(FieldType.nullable(MinorType.INT.getType())).getVector(); + NullableIntVector.Mutator mutator = nested.getMutator(); + vector.allocateNew(); + + for (int i = 0; i < 10; i++) { + vector.getMutator().setNotNull(i); + mutator.set(i * 2, i); + mutator.set(i * 2 + 1, i + 10); + } + vector.getMutator().setValueCount(10); + + UnionFixedSizeListReader reader = vector.getReader(); + for (int i = 0; i < 10; i++) { + reader.setPosition(i); + Assert.assertTrue(reader.isSet()); + Assert.assertTrue(reader.next()); + Assert.assertEquals(i, reader.reader().readInteger().intValue()); + Assert.assertTrue(reader.next()); + Assert.assertEquals(i + 10, reader.reader().readInteger().intValue()); + Assert.assertFalse(reader.next()); + Assert.assertEquals(Lists.newArrayList(i, i + 10), reader.readObject()); + } + } + } + + @Test + public void testFloatTypeNullable() { + try (FixedSizeListVector vector = new FixedSizeListVector("list", allocator, 2, null, null)) { + NullableFloat4Vector nested = (NullableFloat4Vector) vector.addOrGetVector(FieldType.nullable(MinorType.FLOAT4.getType())).getVector(); + NullableFloat4Vector.Mutator mutator = nested.getMutator(); + vector.allocateNew(); + + for (int i = 0; i < 10; i++) { + if (i % 2 == 0) { + vector.getMutator().setNotNull(i); + mutator.set(i * 2, i + 0.1f); + mutator.set(i * 2 + 1, i + 10.1f); + } + } + vector.getMutator().setValueCount(10); + + UnionFixedSizeListReader reader = vector.getReader(); + for (int i = 0; i < 10; i++) { + reader.setPosition(i); + if (i % 2 == 0) { + Assert.assertTrue(reader.isSet()); + Assert.assertTrue(reader.next()); + Assert.assertEquals(i + 0.1f, reader.reader().readFloat(), 0.00001); + Assert.assertTrue(reader.next()); + Assert.assertEquals(i + 10.1f, reader.reader().readFloat(), 0.00001); + Assert.assertFalse(reader.next()); + Assert.assertEquals(Lists.newArrayList(i + 0.1f, i + 10.1f), reader.readObject()); + } else { + Assert.assertFalse(reader.isSet()); + Assert.assertNull(reader.readObject()); + } + } + } + } + + @Test + public void testNestedInList() { + try (ListVector vector = new ListVector("list", allocator, null, null)) { + ListVector.Mutator mutator = vector.getMutator(); + FixedSizeListVector tuples = (FixedSizeListVector) vector.addOrGetVector(FieldType.nullable(new ArrowType.FixedSizeList(2))).getVector(); + FixedSizeListVector.Mutator tupleMutator = tuples.getMutator(); + NullableIntVector.Mutator innerMutator = (NullableIntVector.Mutator) tuples.addOrGetVector(FieldType.nullable(MinorType.INT.getType())).getVector().getMutator(); + vector.allocateNew(); + + for (int i = 0; i < 10; i++) { + if (i % 2 == 0) { + int position = mutator.startNewValue(i); + for (int j = 0; j < i % 7; j++) { + tupleMutator.setNotNull(position + j); + innerMutator.set((position + j) * 2, j); + innerMutator.set((position + j) * 2 + 1, j + 1); + } + mutator.endValue(i, i % 7); + } + } + mutator.setValueCount(10); + + UnionListReader reader = vector.getReader(); + for (int i = 0; i < 10; i++) { + reader.setPosition(i); + if (i % 2 == 0) { + for (int j = 0; j < i % 7; j++) { + Assert.assertTrue(reader.next()); + FieldReader innerListReader = reader.reader(); + for (int k = 0; k < 2; k++) { + Assert.assertTrue(innerListReader.next()); + Assert.assertEquals(k + j, innerListReader.reader().readInteger().intValue()); + } + Assert.assertFalse(innerListReader.next()); + } + Assert.assertFalse(reader.next()); + } else { + Assert.assertFalse(reader.isSet()); + Assert.assertNull(reader.readObject()); + } + } + } + } +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java index 11730afd55406..3bed45361fc20 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/TestArrowFile.java @@ -30,11 +30,17 @@ import java.util.Arrays; import java.util.List; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.Lists; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.FieldVector; +import org.apache.arrow.vector.NullableFloat4Vector; +import org.apache.arrow.vector.NullableIntVector; import org.apache.arrow.vector.NullableTinyIntVector; import org.apache.arrow.vector.NullableVarCharVector; import org.apache.arrow.vector.VectorSchemaRoot; +import org.apache.arrow.vector.complex.FixedSizeListVector; import org.apache.arrow.vector.complex.ListVector; import org.apache.arrow.vector.complex.MapVector; import org.apache.arrow.vector.complex.NullableMapVector; @@ -49,6 +55,8 @@ import org.apache.arrow.vector.stream.ArrowStreamReader; import org.apache.arrow.vector.stream.ArrowStreamWriter; import org.apache.arrow.vector.stream.MessageSerializerTest; +import org.apache.arrow.vector.types.Types.MinorType; +import org.apache.arrow.vector.types.pojo.ArrowType.FixedSizeList; import org.apache.arrow.vector.types.pojo.ArrowType.Int; import org.apache.arrow.vector.types.pojo.DictionaryEncoding; import org.apache.arrow.vector.types.pojo.Field; @@ -60,8 +68,6 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import com.google.common.collect.ImmutableList; - public class TestArrowFile extends BaseFileTest { private static final Logger LOGGER = LoggerFactory.getLogger(TestArrowFile.class); @@ -576,6 +582,65 @@ private void validateNestedDictionary(ListVector vector, DictionaryProvider prov Assert.assertEquals(new Text("bar"), dictionaryAccessor.getObject(1)); } + @Test + public void testWriteReadFixedSizeList() throws IOException { + File file = new File("target/mytest_fixed_list.arrow"); + ByteArrayOutputStream stream = new ByteArrayOutputStream(); + int count = COUNT; + + // write + try (BufferAllocator originalVectorAllocator = allocator.newChildAllocator("original vectors", 0, Integer.MAX_VALUE); + NullableMapVector parent = new NullableMapVector("parent", originalVectorAllocator, null, null)) { + FixedSizeListVector tuples = parent.addOrGet("float-pairs", new FieldType(true, new FixedSizeList(2), null), FixedSizeListVector.class); + NullableFloat4Vector floats = (NullableFloat4Vector) tuples.addOrGetVector(new FieldType(true, MinorType.FLOAT4.getType(), null)).getVector(); + NullableIntVector ints = parent.addOrGet("ints", new FieldType(true, new Int(32, true), null), NullableIntVector.class); + parent.allocateNew(); + + for (int i = 0; i < 10; i++) { + tuples.getMutator().setNotNull(i); + floats.getMutator().set(i * 2, i + 0.1f); + floats.getMutator().set(i * 2 + 1, i + 10.1f); + ints.getMutator().set(i, i); + } + + parent.getMutator().setValueCount(10); + write(parent, file, stream); + } + + // read + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + FileInputStream fileInputStream = new FileInputStream(file); + ArrowFileReader arrowReader = new ArrowFileReader(fileInputStream.getChannel(), readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + + for (ArrowBlock rbBlock : arrowReader.getRecordBlocks()) { + arrowReader.loadRecordBatch(rbBlock); + Assert.assertEquals(count, root.getRowCount()); + for (int i = 0; i < 10; i++) { + Assert.assertEquals(Lists.newArrayList(i + 0.1f, i + 10.1f), root.getVector("float-pairs").getAccessor().getObject(i)); + Assert.assertEquals(i, root.getVector("ints").getAccessor().getObject(i)); + } + } + } + + // read from stream + try (BufferAllocator readerAllocator = allocator.newChildAllocator("reader", 0, Integer.MAX_VALUE); + ByteArrayInputStream input = new ByteArrayInputStream(stream.toByteArray()); + ArrowStreamReader arrowReader = new ArrowStreamReader(input, readerAllocator)) { + VectorSchemaRoot root = arrowReader.getVectorSchemaRoot(); + Schema schema = root.getSchema(); + LOGGER.debug("reading schema: " + schema); + arrowReader.loadNextBatch(); + Assert.assertEquals(count, root.getRowCount()); + for (int i = 0; i < 10; i++) { + Assert.assertEquals(Lists.newArrayList(i + 0.1f, i + 10.1f), root.getVector("float-pairs").getAccessor().getObject(i)); + Assert.assertEquals(i, root.getVector("ints").getAccessor().getObject(i)); + } + } + } + /** * Writes the contents of parents to file. If outStream is non-null, also writes it * to outStream in the streaming serialized format. From 30e03a90718971c2a1d773145fb042d0c2857036 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Sat, 15 Apr 2017 18:26:19 -0400 Subject: [PATCH 0521/1644] =?UTF-8?q?ARROW-703:=20Fix=20issue=20where=20se?= =?UTF-8?q?tValueCount(0)=20doesn=E2=80=99t=20work=20in=20the=20case=20tha?= =?UTF-8?q?t=20we=E2=80=99ve=20shipped=20vectors=20across=20the=20wire?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Author: Julien Le Dem Closes #428 from julienledem/arrow_703 and squashes the following commits: 72b0f79 [Julien Le Dem] ARROW-703: Fix issue where setValueCount(0) doesn’t work in the case that we’ve shipped vectors across the wire --- .../templates/VariableLengthVectors.java | 23 +++++++++++-------- 1 file changed, 14 insertions(+), 9 deletions(-) diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java index bcd639ab8c30c..4a460c5475323 100644 --- a/java/vector/src/main/codegen/templates/VariableLengthVectors.java +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -613,16 +613,21 @@ protected void set(int index, ${minor.class}Holder holder){ @Override public void setValueCount(int valueCount) { - final int currentByteCapacity = getByteCapacity(); - final int idx = offsetVector.getAccessor().get(valueCount); - data.writerIndex(idx); - if (valueCount > 0 && currentByteCapacity > idx * 2) { - incrementAllocationMonitor(); - } else if (allocationMonitor > 0) { - allocationMonitor = 0; + if (valueCount == 0) { + // if no values in vector, don't try to retrieve the current value count. + offsetVector.getMutator().setValueCount(0); + } else { + final int currentByteCapacity = getByteCapacity(); + final int idx = offsetVector.getAccessor().get(valueCount); + data.writerIndex(idx); + if (currentByteCapacity > idx * 2) { + incrementAllocationMonitor(); + } else if (allocationMonitor > 0) { + allocationMonitor = 0; + } + VectorTrimmer.trim(data, idx); + offsetVector.getMutator().setValueCount(valueCount+1); } - VectorTrimmer.trim(data, idx); - offsetVector.getMutator().setValueCount(valueCount == 0 ? 0 : valueCount+1); } @Override From ee5cb2ad171f0f4c7673f2937dc226d62aad972c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 16 Apr 2017 09:28:34 -0400 Subject: [PATCH 0522/1644] ARROW-830: [Python] Expose jemalloc memory pool and other memory pool functions in public pyarrow API Author: Wes McKinney Closes #550 from wesm/ARROW-830 and squashes the following commits: c1ca9fb [Wes McKinney] Expose jemalloc memory pool and other memory pool functions in public pyarrow API --- python/README.md | 2 +- python/doc/source/api.rst | 12 ++++++++++ python/doc/source/jemalloc.rst | 8 ++----- python/pyarrow/__init__.py | 13 +++++++++- python/pyarrow/_memory.pyx | 12 +++++++--- python/pyarrow/tests/test_jemalloc.py | 34 +++++++++++++++------------ 6 files changed, 55 insertions(+), 26 deletions(-) diff --git a/python/README.md b/python/README.md index 25a3a67b83b03..ed008ea975d21 100644 --- a/python/README.md +++ b/python/README.md @@ -89,7 +89,7 @@ export PYARROW_CMAKE_OPTIONS=-DPYARROW_BUILD_PARQUET=on ```bash pip install -r doc/requirements.txt -python setup.py build_sphinx +python setup.py build_sphinx -s doc/source ``` [1]: https://github.com/apache/parquet-cpp \ No newline at end of file diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index 514dcf966f8cc..801ab34126c7c 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -151,3 +151,15 @@ Interprocess Communication and Messaging FileWriter StreamReader StreamWriter + +Memory Pools +------------ + +.. autosummary:: + :toctree: generated/ + + MemoryPool + default_memory_pool + jemalloc_memory_pool + total_allocated_bytes + set_memory_pool diff --git a/python/doc/source/jemalloc.rst b/python/doc/source/jemalloc.rst index 33fe61729c1e9..8d7a5dc4a82ec 100644 --- a/python/doc/source/jemalloc.rst +++ b/python/doc/source/jemalloc.rst @@ -35,18 +35,14 @@ operations. .. code:: python import pyarrow as pa - import pyarrow.jemalloc - import pyarrow.memory - jemalloc_pool = pyarrow.jemalloc.default_pool() + jemalloc_pool = pyarrow.jemalloc_memory_pool() # Explicitly use jemalloc for allocating memory for an Arrow Table object array = pa.Array.from_pylist([1, 2, 3], memory_pool=jemalloc_pool) # Set the global pool - pyarrow.memory.set_default_pool(jemalloc_pool) + pyarrow.set_memory_pool(jemalloc_pool) # This operation has no explicit MemoryPool specified and will thus will # also use jemalloc for its allocations. array = pa.Array.from_pylist([1, 2, 3]) - - diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 66bde4933ee2d..506d567b0c508 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -62,7 +62,8 @@ memory_map, create_memory_map, get_record_batch_size, get_tensor_size) -from pyarrow._memory import MemoryPool, total_allocated_bytes +from pyarrow._memory import (MemoryPool, total_allocated_bytes, + set_memory_pool, default_memory_pool) from pyarrow._table import Column, RecordBatch, Table, concat_tables from pyarrow._error import (ArrowException, ArrowKeyError, @@ -72,6 +73,16 @@ ArrowNotImplementedError, ArrowTypeError) + +def jemalloc_memory_pool(): + """ + Returns a jemalloc-based memory allocator, which can be passed to + pyarrow.set_memory_pool + """ + from pyarrow._jemalloc import default_pool + return default_pool() + + from pyarrow.filesystem import Filesystem, HdfsClient, LocalFilesystem from pyarrow.ipc import FileReader, FileWriter, StreamReader, StreamWriter diff --git a/python/pyarrow/_memory.pyx b/python/pyarrow/_memory.pyx index 98dbf66c8e0af..8b73a17553edf 100644 --- a/python/pyarrow/_memory.pyx +++ b/python/pyarrow/_memory.pyx @@ -22,6 +22,7 @@ from pyarrow.includes.libarrow cimport CMemoryPool, CLoggingMemoryPool from pyarrow.includes.pyarrow cimport set_default_memory_pool, get_memory_pool + cdef class MemoryPool: cdef init(self, CMemoryPool* pool): self.pool = pool @@ -29,24 +30,29 @@ cdef class MemoryPool: def bytes_allocated(self): return self.pool.bytes_allocated() + cdef CMemoryPool* maybe_unbox_memory_pool(MemoryPool memory_pool): if memory_pool is None: return get_memory_pool() else: return memory_pool.pool + cdef class LoggingMemoryPool(MemoryPool): pass -def default_pool(): - cdef: + +def default_memory_pool(): + cdef: MemoryPool pool = MemoryPool() pool.init(get_memory_pool()) return pool -def set_default_pool(MemoryPool pool): + +def set_memory_pool(MemoryPool pool): set_default_memory_pool(pool.pool) + def total_allocated_bytes(): cdef CMemoryPool* pool = get_memory_pool() return pool.bytes_allocated() diff --git a/python/pyarrow/tests/test_jemalloc.py b/python/pyarrow/tests/test_jemalloc.py index c6cc2cc34a08b..0a4d8a63ad2d2 100644 --- a/python/pyarrow/tests/test_jemalloc.py +++ b/python/pyarrow/tests/test_jemalloc.py @@ -18,12 +18,16 @@ import gc import pytest +import pyarrow as pa + + try: - import pyarrow.jemalloc + pa.jemalloc_memory_pool() HAVE_JEMALLOC = True except ImportError: HAVE_JEMALLOC = False + jemalloc = pytest.mark.skipif(not HAVE_JEMALLOC, reason='jemalloc support not built') @@ -31,33 +35,33 @@ @jemalloc def test_different_memory_pool(): gc.collect() - bytes_before_default = pyarrow.total_allocated_bytes() - bytes_before_jemalloc = pyarrow.jemalloc.default_pool().bytes_allocated() + bytes_before_default = pa.total_allocated_bytes() + bytes_before_jemalloc = pa.jemalloc_memory_pool().bytes_allocated() # it works - array = pyarrow.from_pylist([1, None, 3, None], # noqa - memory_pool=pyarrow.jemalloc.default_pool()) + array = pa.from_pylist([1, None, 3, None], # noqa + memory_pool=pa.jemalloc_memory_pool()) gc.collect() - assert pyarrow.total_allocated_bytes() == bytes_before_default - assert (pyarrow.jemalloc.default_pool().bytes_allocated() > + assert pa.total_allocated_bytes() == bytes_before_default + assert (pa.jemalloc_memory_pool().bytes_allocated() > bytes_before_jemalloc) @jemalloc def test_default_memory_pool(): gc.collect() - bytes_before_default = pyarrow.total_allocated_bytes() - bytes_before_jemalloc = pyarrow.jemalloc.default_pool().bytes_allocated() + bytes_before_default = pa.total_allocated_bytes() + bytes_before_jemalloc = pa.jemalloc_memory_pool().bytes_allocated() - old_memory_pool = pyarrow.memory.default_pool() - pyarrow.memory.set_default_pool(pyarrow.jemalloc.default_pool()) + old_memory_pool = pa.default_memory_pool() + pa.set_memory_pool(pa.jemalloc_memory_pool()) - array = pyarrow.from_pylist([1, None, 3, None]) # noqa + array = pa.from_pylist([1, None, 3, None]) # noqa - pyarrow.memory.set_default_pool(old_memory_pool) + pa.set_memory_pool(old_memory_pool) gc.collect() - assert pyarrow.total_allocated_bytes() == bytes_before_default + assert pa.total_allocated_bytes() == bytes_before_default - assert (pyarrow.jemalloc.default_pool().bytes_allocated() > + assert (pa.jemalloc_memory_pool().bytes_allocated() > bytes_before_jemalloc) From dad1a8ee3810d1584b96a5324f0d84215cd48216 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 16 Apr 2017 09:29:15 -0400 Subject: [PATCH 0523/1644] ARROW-832: [C++] Update to gtest 1.8.0, remove now unneeded test_main.cc I haven't tried this out on MSVC yet. Also includes .gitignore fix for ARROW-821 Author: Wes McKinney Closes #549 from wesm/ARROW-832 and squashes the following commits: 2f246a0 [Wes McKinney] Remove unused CMake variable 7a62cf4 [Wes McKinney] Small fix when ARROW_BUILD_BENCHMARKS=off 8eaa318 [Wes McKinney] Add dependency on gtest for benchmarks 5f692db [Wes McKinney] Update to gtest 1.8.0, remove now unneeded test_main.cc --- cpp/CMakeLists.txt | 42 ++++++++++++++++++------------- cpp/src/arrow/util/CMakeLists.txt | 25 +++--------------- cpp/src/arrow/util/test_main.cc | 26 ------------------- python/.gitignore | 2 +- 4 files changed, 29 insertions(+), 66 deletions(-) delete mode 100644 cpp/src/arrow/util/test_main.cc diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 83610d33e6af1..08120e9ea68a5 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -43,7 +43,7 @@ set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") set(GFLAGS_VERSION "2.1.2") -set(GTEST_VERSION "1.7.0") +set(GTEST_VERSION "1.8.0") set(GBENCHMARK_VERSION "1.1.0") set(FLATBUFFERS_VERSION "1.6.0") set(JEMALLOC_VERSION "4.4.0") @@ -458,7 +458,7 @@ include_directories(SYSTEM ${Boost_INCLUDE_DIR}) # ---------------------------------------------------------------------- # Enable / disable tests and benchmarks -if(ARROW_BUILD_TESTS) +if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) add_custom_target(unittest ctest -L unittest) if("$ENV{GTEST_HOME}" STREQUAL "") @@ -472,9 +472,13 @@ if(ARROW_BUILD_TESTS) set(GTEST_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/googletest_ep-prefix/src/googletest_ep") set(GTEST_INCLUDE_DIR "${GTEST_PREFIX}/include") - set(GTEST_STATIC_LIB "${GTEST_PREFIX}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}gtest${CMAKE_STATIC_LIBRARY_SUFFIX}") + set(GTEST_STATIC_LIB + "${GTEST_PREFIX}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}gtest${CMAKE_STATIC_LIBRARY_SUFFIX}") + set(GTEST_MAIN_STATIC_LIB + "${GTEST_PREFIX}/lib/${CMAKE_STATIC_LIBRARY_PREFIX}gtest_main${CMAKE_STATIC_LIBRARY_SUFFIX}") set(GTEST_VENDORED 1) set(GTEST_CMAKE_ARGS -DCMAKE_BUILD_TYPE=${CMAKE_BUILD_TYPE} + -DCMAKE_INSTALL_PREFIX=${GTEST_PREFIX} -Dgtest_force_shared_crt=ON -DCMAKE_CXX_FLAGS=${GTEST_CMAKE_CXX_FLAGS}) @@ -482,22 +486,11 @@ if(ARROW_BUILD_TESTS) # BUILD_BYPRODUCTS is a 3.2+ feature ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS ${GTEST_CMAKE_ARGS} - # googletest doesn't define install rules, so just build in the - # source dir and don't try to install. See its README for - # details. - BUILD_IN_SOURCE 1 - BUILD_BYPRODUCTS "${GTEST_STATIC_LIB}" - INSTALL_COMMAND "") + CMAKE_ARGS ${GTEST_CMAKE_ARGS}) else() ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" - CMAKE_ARGS ${GTEST_CMAKE_ARGS} - # googletest doesn't define install rules, so just build in the - # source dir and don't try to install. See its README for - # details. - BUILD_IN_SOURCE 1 - INSTALL_COMMAND "") + CMAKE_ARGS ${GTEST_CMAKE_ARGS}) endif() else() find_package(GTest REQUIRED) @@ -509,9 +502,12 @@ if(ARROW_BUILD_TESTS) include_directories(SYSTEM ${GTEST_INCLUDE_DIR}) ADD_THIRDPARTY_LIB(gtest STATIC_LIB ${GTEST_STATIC_LIB}) + ADD_THIRDPARTY_LIB(gtest_main + STATIC_LIB ${GTEST_MAIN_STATIC_LIB}) if(GTEST_VENDORED) add_dependencies(gtest googletest_ep) + add_dependencies(gtest_main googletest_ep) endif() # gflags (formerly Googleflags) command line parsing @@ -753,10 +749,22 @@ include_directories(SYSTEM "${HADOOP_HOME}/include") ############################################################ set(ARROW_MIN_TEST_LIBS arrow_static - arrow_test_main + gtest + gtest_main ${ARROW_BASE_LIBS} ${BOOST_REGEX_LIBRARY}) +if (APPLE) + set(ARROW_MIN_TEST_LIBS + ${ARROW_MIN_TEST_LIBS} + ${CMAKE_DL_LIBS}) +elseif(NOT MSVC) + set(ARROW_MIN_TEST_LIBS + ${ARROW_MIN_TEST_LIBS} + pthread + ${CMAKE_DL_LIBS}) +endif() + set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) set(ARROW_BENCHMARK_LINK_LIBS diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index 9aa8bae273fb8..b22c8aca11c5d 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -32,28 +32,6 @@ install(FILES # arrow_test_main ####################################### -if (ARROW_BUILD_TESTS) - add_library(arrow_test_main - test_main.cc) - - if (APPLE) - target_link_libraries(arrow_test_main - gtest - dl) - set_target_properties(arrow_test_main - PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") - elseif(MSVC) - target_link_libraries(arrow_test_main - gtest) - else() - target_link_libraries(arrow_test_main - gtest - pthread - dl - ) - endif() -endif() - if (ARROW_BUILD_BENCHMARKS) add_library(arrow_benchmark_main benchmark_main.cc) if (APPLE) @@ -66,6 +44,9 @@ if (ARROW_BUILD_BENCHMARKS) pthread ) endif() + + # TODO(wesm): Some benchmarks include gtest.h + add_dependencies(arrow_benchmark_main gtest) endif() ADD_ARROW_TEST(bit-util-test) diff --git a/cpp/src/arrow/util/test_main.cc b/cpp/src/arrow/util/test_main.cc deleted file mode 100644 index f928047023966..0000000000000 --- a/cpp/src/arrow/util/test_main.cc +++ /dev/null @@ -1,26 +0,0 @@ -// Licensed to the Apache Software Foundation (ASF) under one -// or more contributor license agreements. See the NOTICE file -// distributed with this work for additional information -// regarding copyright ownership. The ASF licenses this file -// to you under the Apache License, Version 2.0 (the -// "License"); you may not use this file except in compliance -// with the License. You may obtain a copy of the License at -// -// http://www.apache.org/licenses/LICENSE-2.0 -// -// Unless required by applicable law or agreed to in writing, -// software distributed under the License is distributed on an -// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -// KIND, either express or implied. See the License for the -// specific language governing permissions and limitations -// under the License. - -#include "gtest/gtest.h" - -int main(int argc, char** argv) { - ::testing::InitGoogleTest(&argc, argv); - - int ret = RUN_ALL_TESTS(); - - return ret; -} diff --git a/python/.gitignore b/python/.gitignore index 4ab802006914e..ba40c3ea88882 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -16,7 +16,7 @@ Testing/ *.c *.cpp pyarrow/version.py -pyarrow/table_api.h +pyarrow/*_api.h # Python files # setup.py working directory From 09e6eade166b60db95694d291ebfb074f1442ff8 Mon Sep 17 00:00:00 2001 From: Jeff Reback Date: Sun, 16 Apr 2017 13:11:38 -0400 Subject: [PATCH 0524/1644] ARROW-836: add test for pandas conversion of timedelta, currently unimplemented xref https://github.com/pandas-dev/pandas/pull/16004 Author: Jeff Reback Closes #551 from jreback/timedelta and squashes the following commits: cfd310e [Jeff Reback] TST: add test for pandas conversion of timedelta, currently unimplemented --- python/pyarrow/tests/test_convert_pandas.py | 13 +++++++++++++ python/pyarrow/tests/test_feather.py | 10 ++++++++++ 2 files changed, 23 insertions(+) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 4a57e4ba1d4fb..2394d638d073e 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -18,6 +18,7 @@ from collections import OrderedDict +import pytest import datetime import unittest import decimal @@ -412,6 +413,18 @@ def test_dates_from_integers(self): assert a1[0].as_py() == expected assert a2[0].as_py() == expected + @pytest.mark.xfail(reason="not supported ATM", + raises=NotImplementedError) + def test_timedelta(self): + # TODO(jreback): Pandas only support ns resolution + # Arrow supports ??? for resolution + df = pd.DataFrame({ + 'timedelta': np.arange(start=0, stop=3*86400000, + step=86400000, + dtype='timedelta64[ms]') + }) + pa.Table.from_pandas(df) + def test_column_of_arrays(self): df, schema = dataframe_with_arrays() self._check_pandas_roundtrip(df, schema=schema, expected_schema=schema) diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index 6f8040fd483c9..ef73a8feeb65c 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -14,6 +14,7 @@ import os import unittest +import pytest from numpy.testing import assert_array_equal import numpy as np @@ -320,6 +321,15 @@ def test_timestamp_with_nulls(self): self._check_pandas_roundtrip(df, null_counts=[1, 1]) + @pytest.mark.xfail(reason="not supported ATM", + raises=NotImplementedError) + def test_timedelta_with_nulls(self): + df = pd.DataFrame({'test': [pd.Timedelta('1 day'), + None, + pd.Timedelta('3 day')]}) + + self._check_pandas_roundtrip(df, null_counts=[1, 1]) + def test_out_of_float64_timestamp_with_nulls(self): df = pd.DataFrame( {'test': pd.DatetimeIndex([1451606400000000001, From f51259068640af92490c0832d5d55885a510776d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Apr 2017 09:22:15 -0400 Subject: [PATCH 0525/1644] ARROW-827: [Python] Miscellaneous improvements to help with Dask support Author: Wes McKinney Closes #543 from wesm/dask-improvements and squashes the following commits: 1f587e2 [Wes McKinney] Store the input Parquet paths on the dataset object 3504281 [Wes McKinney] Add some more cases edc9b59 [Wes McKinney] Unit tests 88f4380 [Wes McKinney] Use dict for type mapping for now 7e69cab [Wes McKinney] Miscellaneous improvements to help with Dask support --- python/pyarrow/_array.pyx | 193 +++++++++++++++++---------- python/pyarrow/_parquet.pyx | 23 +++- python/pyarrow/includes/libarrow.pxd | 64 ++++----- python/pyarrow/parquet.py | 22 ++- python/pyarrow/tests/test_parquet.py | 4 + python/pyarrow/tests/test_schema.py | 30 +++++ 6 files changed, 222 insertions(+), 114 deletions(-) diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index 7ef8e5867a1a2..c5a595c6c67e5 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -41,6 +41,31 @@ cdef _pandas(): return pd +# These are imprecise because the type (in pandas 0.x) depends on the presence +# of nulls +_pandas_type_map = { + _Type_NA: np.float64, # NaNs + _Type_BOOL: np.bool_, + _Type_INT8: np.int8, + _Type_INT16: np.int16, + _Type_INT32: np.int32, + _Type_INT64: np.int64, + _Type_UINT8: np.uint8, + _Type_UINT16: np.uint16, + _Type_UINT32: np.uint32, + _Type_UINT64: np.uint64, + _Type_HALF_FLOAT: np.float16, + _Type_FLOAT: np.float32, + _Type_DOUBLE: np.float64, + _Type_DATE32: np.dtype('datetime64[ns]'), + _Type_DATE64: np.dtype('datetime64[ns]'), + _Type_TIMESTAMP: np.dtype('datetime64[ns]'), + _Type_BINARY: np.object_, + _Type_FIXED_SIZE_BINARY: np.object_, + _Type_STRING: np.object_, + _Type_LIST: np.object_ +} + cdef class DataType: def __cinit__(self): @@ -64,6 +89,16 @@ cdef class DataType: else: raise TypeError('Invalid comparison') + def to_pandas_dtype(self): + """ + Return the NumPy dtype that would be used for storing this + """ + cdef Type type_id = self.type.id() + if type_id in _pandas_type_map: + return _pandas_type_map[type_id] + else: + raise NotImplementedError(str(self)) + cdef class DictionaryType(DataType): @@ -167,6 +202,16 @@ cdef class Schema: return result + property names: + + def __get__(self): + cdef int i + result = [] + for i in range(self.schema.num_fields()): + name = frombytes(self.schema.field(i).get().name()) + result.append(name) + return result + cdef init(self, const vector[shared_ptr[CField]]& fields): self.schema = new CSchema(fields) self.sp_schema.reset(self.schema) @@ -244,56 +289,56 @@ def field(name, type, bint nullable=True): cdef set PRIMITIVE_TYPES = set([ - Type_NA, Type_BOOL, - Type_UINT8, Type_INT8, - Type_UINT16, Type_INT16, - Type_UINT32, Type_INT32, - Type_UINT64, Type_INT64, - Type_TIMESTAMP, Type_DATE32, - Type_DATE64, - Type_HALF_FLOAT, - Type_FLOAT, - Type_DOUBLE]) + _Type_NA, _Type_BOOL, + _Type_UINT8, _Type_INT8, + _Type_UINT16, _Type_INT16, + _Type_UINT32, _Type_INT32, + _Type_UINT64, _Type_INT64, + _Type_TIMESTAMP, _Type_DATE32, + _Type_DATE64, + _Type_HALF_FLOAT, + _Type_FLOAT, + _Type_DOUBLE]) def null(): - return primitive_type(Type_NA) + return primitive_type(_Type_NA) def bool_(): - return primitive_type(Type_BOOL) + return primitive_type(_Type_BOOL) def uint8(): - return primitive_type(Type_UINT8) + return primitive_type(_Type_UINT8) def int8(): - return primitive_type(Type_INT8) + return primitive_type(_Type_INT8) def uint16(): - return primitive_type(Type_UINT16) + return primitive_type(_Type_UINT16) def int16(): - return primitive_type(Type_INT16) + return primitive_type(_Type_INT16) def uint32(): - return primitive_type(Type_UINT32) + return primitive_type(_Type_UINT32) def int32(): - return primitive_type(Type_INT32) + return primitive_type(_Type_INT32) def uint64(): - return primitive_type(Type_UINT64) + return primitive_type(_Type_UINT64) def int64(): - return primitive_type(Type_INT64) + return primitive_type(_Type_INT64) cdef dict _timestamp_type_cache = {} @@ -344,23 +389,23 @@ def timestamp(unit_str, tz=None): def date32(): - return primitive_type(Type_DATE32) + return primitive_type(_Type_DATE32) def date64(): - return primitive_type(Type_DATE64) + return primitive_type(_Type_DATE64) def float16(): - return primitive_type(Type_HALF_FLOAT) + return primitive_type(_Type_HALF_FLOAT) def float32(): - return primitive_type(Type_FLOAT) + return primitive_type(_Type_FLOAT) def float64(): - return primitive_type(Type_DOUBLE) + return primitive_type(_Type_DOUBLE) cpdef DataType decimal(int precision, int scale=0): @@ -373,7 +418,7 @@ def string(): """ UTF8 string """ - return primitive_type(Type_STRING) + return primitive_type(_Type_STRING) def binary(int length=-1): @@ -387,7 +432,7 @@ def binary(int length=-1): width `length`. """ if length == -1: - return primitive_type(Type_BINARY) + return primitive_type(_Type_BINARY) cdef shared_ptr[CDataType] fixed_size_binary_type fixed_size_binary_type.reset(new CFixedSizeBinaryType(length)) @@ -443,13 +488,13 @@ cdef DataType box_data_type(const shared_ptr[CDataType]& type): if type.get() == NULL: return None - if type.get().id() == Type_DICTIONARY: + if type.get().id() == _Type_DICTIONARY: out = DictionaryType() - elif type.get().id() == Type_TIMESTAMP: + elif type.get().id() == _Type_TIMESTAMP: out = TimestampType() - elif type.get().id() == Type_FIXED_SIZE_BINARY: + elif type.get().id() == _Type_FIXED_SIZE_BINARY: out = FixedSizeBinaryType() - elif type.get().id() == Type_DECIMAL: + elif type.get().id() == _Type_DECIMAL: out = DecimalType() else: out = DataType() @@ -732,31 +777,31 @@ cdef class FixedSizeBinaryValue(ArrayValue): cdef dict _scalar_classes = { - Type_BOOL: BooleanValue, - Type_UINT8: Int8Value, - Type_UINT16: Int16Value, - Type_UINT32: Int32Value, - Type_UINT64: Int64Value, - Type_INT8: Int8Value, - Type_INT16: Int16Value, - Type_INT32: Int32Value, - Type_INT64: Int64Value, - Type_DATE32: Date32Value, - Type_DATE64: Date64Value, - Type_TIMESTAMP: TimestampValue, - Type_FLOAT: FloatValue, - Type_DOUBLE: DoubleValue, - Type_LIST: ListValue, - Type_BINARY: BinaryValue, - Type_STRING: StringValue, - Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, - Type_DECIMAL: DecimalValue, + _Type_BOOL: BooleanValue, + _Type_UINT8: Int8Value, + _Type_UINT16: Int16Value, + _Type_UINT32: Int32Value, + _Type_UINT64: Int64Value, + _Type_INT8: Int8Value, + _Type_INT16: Int16Value, + _Type_INT32: Int32Value, + _Type_INT64: Int64Value, + _Type_DATE32: Date32Value, + _Type_DATE64: Date64Value, + _Type_TIMESTAMP: TimestampValue, + _Type_FLOAT: FloatValue, + _Type_DOUBLE: DoubleValue, + _Type_LIST: ListValue, + _Type_BINARY: BinaryValue, + _Type_STRING: StringValue, + _Type_FIXED_SIZE_BINARY: FixedSizeBinaryValue, + _Type_DECIMAL: DecimalValue, } cdef object box_scalar(DataType type, const shared_ptr[CArray]& sp_array, int64_t index): cdef ArrayValue val - if type.type.id() == Type_NA: + if type.type.id() == _Type_NA: return NA elif sp_array.get().IsNull(index): return NA @@ -1306,29 +1351,29 @@ cdef class DictionaryArray(Array): cdef dict _array_classes = { - Type_NA: NullArray, - Type_BOOL: BooleanArray, - Type_UINT8: UInt8Array, - Type_UINT16: UInt16Array, - Type_UINT32: UInt32Array, - Type_UINT64: UInt64Array, - Type_INT8: Int8Array, - Type_INT16: Int16Array, - Type_INT32: Int32Array, - Type_INT64: Int64Array, - Type_DATE32: Date32Array, - Type_DATE64: Date64Array, - Type_TIMESTAMP: TimestampArray, - Type_TIME32: Time32Array, - Type_TIME64: Time64Array, - Type_FLOAT: FloatArray, - Type_DOUBLE: DoubleArray, - Type_LIST: ListArray, - Type_BINARY: BinaryArray, - Type_STRING: StringArray, - Type_DICTIONARY: DictionaryArray, - Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, - Type_DECIMAL: DecimalArray, + _Type_NA: NullArray, + _Type_BOOL: BooleanArray, + _Type_UINT8: UInt8Array, + _Type_UINT16: UInt16Array, + _Type_UINT32: UInt32Array, + _Type_UINT64: UInt64Array, + _Type_INT8: Int8Array, + _Type_INT16: Int16Array, + _Type_INT32: Int32Array, + _Type_INT64: Int64Array, + _Type_DATE32: Date32Array, + _Type_DATE64: Date64Array, + _Type_TIMESTAMP: TimestampArray, + _Type_TIME32: Time32Array, + _Type_TIME64: Time64Array, + _Type_FLOAT: FloatArray, + _Type_DOUBLE: DoubleArray, + _Type_LIST: ListArray, + _Type_BINARY: BinaryArray, + _Type_STRING: StringArray, + _Type_DICTIONARY: DictionaryArray, + _Type_FIXED_SIZE_BINARY: FixedSizeBinaryArray, + _Type_DECIMAL: DecimalArray, } cdef object box_array(const shared_ptr[CArray]& sp_array): diff --git a/python/pyarrow/_parquet.pyx b/python/pyarrow/_parquet.pyx index dafcdaff9bfee..c06eab2630210 100644 --- a/python/pyarrow/_parquet.pyx +++ b/python/pyarrow/_parquet.pyx @@ -23,7 +23,7 @@ from cython.operator cimport dereference as deref from pyarrow.includes.common cimport * from pyarrow.includes.libarrow cimport * cimport pyarrow.includes.pyarrow as pyarrow -from pyarrow._array cimport Array, Schema +from pyarrow._array cimport Array, Schema, box_schema from pyarrow._error cimport check_status from pyarrow._memory cimport MemoryPool, maybe_unbox_memory_pool from pyarrow._table cimport Table, table_from_ctable @@ -194,6 +194,27 @@ cdef class ParquetSchema: def __getitem__(self, i): return self.column(i) + property names: + + def __get__(self): + return [self[i].name for i in range(len(self))] + + def to_arrow_schema(self): + """ + Convert Parquet schema to effective Arrow schema + + Returns + ------- + schema : pyarrow.Schema + """ + cdef: + shared_ptr[CSchema] sp_arrow_schema + + with nogil: + check_status(FromParquetSchema(self.schema, &sp_arrow_schema)) + + return box_schema(sp_arrow_schema) + def equals(self, ParquetSchema other): """ Returns True if the Parquet schemas are equal diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 2444f3fd0683e..b8aa24c65e11b 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -1,4 +1,4 @@ -# Licensed to the Apache Software Foundation (ASF) under one +#t Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file @@ -22,37 +22,37 @@ from pyarrow.includes.common cimport * cdef extern from "arrow/api.h" namespace "arrow" nogil: enum Type" arrow::Type::type": - Type_NA" arrow::Type::NA" - - Type_BOOL" arrow::Type::BOOL" - - Type_UINT8" arrow::Type::UINT8" - Type_INT8" arrow::Type::INT8" - Type_UINT16" arrow::Type::UINT16" - Type_INT16" arrow::Type::INT16" - Type_UINT32" arrow::Type::UINT32" - Type_INT32" arrow::Type::INT32" - Type_UINT64" arrow::Type::UINT64" - Type_INT64" arrow::Type::INT64" - - Type_HALF_FLOAT" arrow::Type::HALF_FLOAT" - Type_FLOAT" arrow::Type::FLOAT" - Type_DOUBLE" arrow::Type::DOUBLE" - - Type_DECIMAL" arrow::Type::DECIMAL" - - Type_DATE32" arrow::Type::DATE32" - Type_DATE64" arrow::Type::DATE64" - Type_TIMESTAMP" arrow::Type::TIMESTAMP" - Type_TIME32" arrow::Type::TIME32" - Type_TIME64" arrow::Type::TIME64" - Type_BINARY" arrow::Type::BINARY" - Type_STRING" arrow::Type::STRING" - Type_FIXED_SIZE_BINARY" arrow::Type::FIXED_SIZE_BINARY" - - Type_LIST" arrow::Type::LIST" - Type_STRUCT" arrow::Type::STRUCT" - Type_DICTIONARY" arrow::Type::DICTIONARY" + _Type_NA" arrow::Type::NA" + + _Type_BOOL" arrow::Type::BOOL" + + _Type_UINT8" arrow::Type::UINT8" + _Type_INT8" arrow::Type::INT8" + _Type_UINT16" arrow::Type::UINT16" + _Type_INT16" arrow::Type::INT16" + _Type_UINT32" arrow::Type::UINT32" + _Type_INT32" arrow::Type::INT32" + _Type_UINT64" arrow::Type::UINT64" + _Type_INT64" arrow::Type::INT64" + + _Type_HALF_FLOAT" arrow::Type::HALF_FLOAT" + _Type_FLOAT" arrow::Type::FLOAT" + _Type_DOUBLE" arrow::Type::DOUBLE" + + _Type_DECIMAL" arrow::Type::DECIMAL" + + _Type_DATE32" arrow::Type::DATE32" + _Type_DATE64" arrow::Type::DATE64" + _Type_TIMESTAMP" arrow::Type::TIMESTAMP" + _Type_TIME32" arrow::Type::TIME32" + _Type_TIME64" arrow::Type::TIME64" + _Type_BINARY" arrow::Type::BINARY" + _Type_STRING" arrow::Type::STRING" + _Type_FIXED_SIZE_BINARY" arrow::Type::FIXED_SIZE_BINARY" + + _Type_LIST" arrow::Type::LIST" + _Type_STRUCT" arrow::Type::STRUCT" + _Type_DICTIONARY" arrow::Type::DICTIONARY" enum TimeUnit" arrow::TimeUnit": TimeUnit_SECOND" arrow::TimeUnit::SECOND" diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 4ff7e038b5e6c..fef99d5e12a06 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -180,14 +180,13 @@ def _open(self, open_file_func=None): """ Returns instance of ParquetFile """ - if open_file_func is None: - def simple_opener(path): - return ParquetFile(path) - open_file_func = simple_opener - return open_file_func(self.path) + reader = open_file_func(self.path) + if not isinstance(reader, ParquetFile): + reader = ParquetFile(reader) + return reader def read(self, columns=None, nthreads=1, partitions=None, - open_file_func=None): + open_file_func=None, file=None): """ Read this piece as a pyarrow.Table @@ -205,7 +204,10 @@ def read(self, columns=None, nthreads=1, partitions=None, ------- table : pyarrow.Table """ - reader = self._open(open_file_func) + if open_file_func is not None: + reader = self._open(open_file_func) + elif file is not None: + reader = ParquetFile(file) if self.row_group is not None: table = reader.read_row_group(self.row_group, columns=columns, @@ -472,6 +474,8 @@ def __init__(self, path_or_paths, filesystem=None, schema=None, else: self.fs = filesystem + self.paths = path_or_paths + (self.pieces, self.partitions, self.metadata_path) = _make_manifest(path_or_paths, self.fs) @@ -550,6 +554,10 @@ def _make_manifest(path_or_paths, fs, pathsep='/'): partitions = None metadata_path = None + if len(path_or_paths) == 1: + # Dask passes a directory as a list of length 1 + path_or_paths = path_or_paths[0] + if is_string(path_or_paths) and fs.isdir(path_or_paths): manifest = ParquetManifest(path_or_paths, filesystem=fs, pathsep=pathsep) diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index ca6ae2d0b3be0..fc35781c54722 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -552,6 +552,10 @@ def test_read_common_metadata_files(tmpdir): pf = pq.ParquetFile(data_path) assert dataset.schema.equals(pf.schema) + # handle list of one directory + dataset2 = pq.ParquetDataset([base_path]) + assert dataset2.schema.equals(dataset.schema) + def _filter_partition(df, part_keys): predicate = np.ones(len(df), dtype=bool) diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index 53b6b68cfde3c..d1107fb1faf3f 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -31,6 +31,34 @@ def test_type_integers(): assert str(t) == name +def test_type_to_pandas_dtype(): + M8_ns = np.dtype('datetime64[ns]') + cases = [ + (pa.null(), np.float64), + (pa.bool_(), np.bool_), + (pa.int8(), np.int8), + (pa.int16(), np.int16), + (pa.int32(), np.int32), + (pa.int64(), np.int64), + (pa.uint8(), np.uint8), + (pa.uint16(), np.uint16), + (pa.uint32(), np.uint32), + (pa.uint64(), np.uint64), + (pa.float16(), np.float16), + (pa.float32(), np.float32), + (pa.float64(), np.float64), + (pa.date32(), M8_ns), + (pa.date64(), M8_ns), + (pa.timestamp('ms'), M8_ns), + (pa.binary(), np.object_), + (pa.binary(12), np.object_), + (pa.string(), np.object_), + (pa.list_(pa.int8()), np.object_), + ] + for arrow_type, numpy_type in cases: + assert arrow_type.to_pandas_dtype() == numpy_type + + def test_type_list(): value_type = pa.int32() list_type = pa.list_(value_type) @@ -83,6 +111,8 @@ def test_schema(): ] sch = pa.schema(fields) + assert sch.names == ['foo', 'bar', 'baz'] + assert len(sch) == 3 assert sch[0].name == 'foo' assert sch[0].type == fields[0].type From 312a665353c420452e98b6b266a5a7cb214c936f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Apr 2017 09:56:53 -0400 Subject: [PATCH 0526/1644] ARROW-707: [Python] Return NullArray for array of all None in Array.from_pandas. Revert from_numpy -> from_pandas per ARROW-838, I reverted the `Array.from_numpy` name to `Array.from_pandas` to reflect that the import is specific to pandas 0.x's memory representation Author: Wes McKinney Closes #554 from wesm/ARROW-707 and squashes the following commits: a875257 [Wes McKinney] Rename PyObject_is_null to reflect domain-specific nature 093b057 [Wes McKinney] Check more cases of all nulls. Fix segfault for NaN that resulted from computations 7d97f28 [Wes McKinney] Return NullArray for array of all None in Array.from_pandas. Revert from_numpy -> from_pandas --- cpp/src/arrow/python/pandas_convert.cc | 31 ++++++++++++------- python/doc/source/api.rst | 1 + python/pyarrow/__init__.py | 1 + python/pyarrow/_array.pxd | 4 +++ python/pyarrow/_array.pyx | 18 +++++------ python/pyarrow/_io.pyx | 2 +- python/pyarrow/_table.pyx | 2 +- python/pyarrow/tests/test_array.py | 4 +-- python/pyarrow/tests/test_convert_pandas.py | 34 ++++++++++++++------- python/pyarrow/tests/test_scalars.py | 6 ++-- 10 files changed, 65 insertions(+), 38 deletions(-) diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index b33aea4565817..5cdcb6fa49602 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -61,8 +61,16 @@ namespace py { // ---------------------------------------------------------------------- // Utility code -static inline bool PyObject_is_null(const PyObject* obj) { - return obj == Py_None || obj == numpy_nan; +static inline bool PyFloat_isnan(const PyObject* obj) { + if (PyFloat_Check(obj)) { + double val = PyFloat_AS_DOUBLE(obj); + return val != val; + } else { + return false; + } +} +static inline bool PandasObjectIsNull(const PyObject* obj) { + return obj == Py_None || obj == numpy_nan || PyFloat_isnan(obj); } static inline bool PyObject_is_string(const PyObject* obj) { @@ -158,7 +166,7 @@ static Status AppendObjectStrings( for (int64_t i = 0; i < objects.size(); ++i) { obj = objects[i]; - if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) { RETURN_NOT_OK(builder->AppendNull()); } else if (PyUnicode_Check(obj)) { obj = PyUnicode_AsUTF8String(obj); @@ -197,7 +205,7 @@ static Status AppendObjectFixedWidthBytes(PyArrayObject* arr, PyArrayObject* mas for (int64_t i = 0; i < objects.size(); ++i) { obj = objects[i]; - if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) { RETURN_NOT_OK(builder->AppendNull()); } else if (PyUnicode_Check(obj)) { obj = PyUnicode_AsUTF8String(obj); @@ -519,7 +527,7 @@ Status PandasConverter::ConvertDates() { obj = objects[i]; if (PyDate_CheckExact(obj)) { date_builder.Append(UnboxDate::Unbox(obj)); - } else if (PyObject_is_null(obj)) { + } else if (PandasObjectIsNull(obj)) { date_builder.AppendNull(); } else { return InvalidConversion(obj, "date"); @@ -570,7 +578,7 @@ Status PandasConverter::ConvertDecimals() { default: break; } - } else if (PyObject_is_null(object)) { + } else if (PandasObjectIsNull(object)) { decimal_builder.AppendNull(); } else { return InvalidConversion(object, "decimal.Decimal"); @@ -724,7 +732,7 @@ Status PandasConverter::ConvertBooleans() { PyObject* obj; for (int64_t i = 0; i < length_; ++i) { obj = objects[i]; - if ((have_mask && mask_values[i]) || PyObject_is_null(obj)) { + if ((have_mask && mask_values[i]) || PandasObjectIsNull(obj)) { ++null_count; } else if (obj == Py_True) { BitUtil::SetBit(bitmap, i); @@ -791,7 +799,7 @@ Status PandasConverter::ConvertObjects() { RETURN_NOT_OK(ImportFromModule(decimal, "Decimal", &Decimal)); for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { + if (PandasObjectIsNull(objects[i])) { continue; } else if (PyObject_is_string(objects[i])) { return ConvertObjectStrings(); @@ -809,7 +817,8 @@ Status PandasConverter::ConvertObjects() { } } - return Status::TypeError("Unable to infer type of object array, were all null"); + out_ = std::make_shared(length_); + return Status::OK(); } template @@ -833,7 +842,7 @@ inline Status PandasConverter::ConvertTypedLists(const std::shared_ptr ListBuilder list_builder(pool_, value_builder); PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { + if (PandasObjectIsNull(objects[i])) { RETURN_NOT_OK(list_builder.AppendNull()); } else if (PyArray_Check(objects[i])) { auto numpy_array = reinterpret_cast(objects[i]); @@ -893,7 +902,7 @@ inline Status PandasConverter::ConvertTypedLists( ListBuilder list_builder(pool_, value_builder); PyObject** objects = reinterpret_cast(PyArray_DATA(arr_)); for (int64_t i = 0; i < length_; ++i) { - if (PyObject_is_null(objects[i])) { + if (PandasObjectIsNull(objects[i])) { RETURN_NOT_OK(list_builder.AppendNull()); } else if (PyArray_Check(objects[i])) { auto numpy_array = reinterpret_cast(objects[i]); diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index 801ab34126c7c..1b7b9bdc8f8c8 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -90,6 +90,7 @@ Array Types :toctree: generated/ Array + NullArray NumericArray IntegerArray FloatingPointArray diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 506d567b0c508..3db2a4f4dd0c8 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -40,6 +40,7 @@ Array, Tensor, from_pylist, from_numpy_dtype, + NullArray, NumericArray, IntegerArray, FloatingPointArray, BooleanArray, Int8Array, UInt8Array, diff --git a/python/pyarrow/_array.pxd b/python/pyarrow/_array.pxd index 40413746fc94b..afb0c27d4e1ef 100644 --- a/python/pyarrow/_array.pxd +++ b/python/pyarrow/_array.pxd @@ -141,6 +141,10 @@ cdef class Tensor: cdef init(self, const shared_ptr[CTensor]& sp_tensor) +cdef class NullArray(Array): + pass + + cdef class BooleanArray(Array): pass diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index c5a595c6c67e5..99ff6f28096ef 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -843,9 +843,9 @@ cdef class Array: self.type = box_data_type(self.sp_array.get().type()) @staticmethod - def from_numpy(obj, mask=None, DataType type=None, - timestamps_to_ms=False, - MemoryPool memory_pool=None): + def from_pandas(obj, mask=None, DataType type=None, + timestamps_to_ms=False, + MemoryPool memory_pool=None): """ Convert pandas.Series to an Arrow Array. @@ -878,7 +878,7 @@ cdef class Array: >>> import pandas as pd >>> import pyarrow as pa - >>> pa.Array.from_numpy(pd.Series([1, 2])) + >>> pa.Array.from_pandas(pd.Series([1, 2])) [ 1, @@ -886,7 +886,7 @@ cdef class Array: ] >>> import numpy as np - >>> pa.Array.from_numpy(pd.Series([1, 2]), np.array([0, 1], + >>> pa.Array.from_pandas(pd.Series([1, 2]), np.array([0, 1], ... dtype=bool)) [ @@ -1329,14 +1329,14 @@ cdef class DictionaryArray(Array): mask = indices == -1 else: mask = mask | (indices == -1) - arrow_indices = Array.from_numpy(indices, mask=mask, - memory_pool=memory_pool) + arrow_indices = Array.from_pandas(indices, mask=mask, + memory_pool=memory_pool) if isinstance(dictionary, Array): arrow_dictionary = dictionary else: - arrow_dictionary = Array.from_numpy(dictionary, - memory_pool=memory_pool) + arrow_dictionary = Array.from_pandas(dictionary, + memory_pool=memory_pool) if not isinstance(arrow_indices, IntegerArray): raise ValueError('Indices must be integer type') diff --git a/python/pyarrow/_io.pyx b/python/pyarrow/_io.pyx index 9f067fb2166c6..ec37de0d72de9 100644 --- a/python/pyarrow/_io.pyx +++ b/python/pyarrow/_io.pyx @@ -1148,7 +1148,7 @@ cdef class FeatherWriter: if isinstance(col, Array): arr = col else: - arr = Array.from_numpy(col, mask=mask) + arr = Array.from_pandas(col, mask=mask) cdef c_string c_name = tobytes(name) diff --git a/python/pyarrow/_table.pyx b/python/pyarrow/_table.pyx index 6558b2ea463fa..78fec75cf3e7d 100644 --- a/python/pyarrow/_table.pyx +++ b/python/pyarrow/_table.pyx @@ -321,7 +321,7 @@ cdef _dataframe_to_arrays(df, timestamps_to_ms, Schema schema): if schema is not None: type = schema.field_by_name(name).type - arr = Array.from_numpy(col, type=type, + arr = Array.from_pandas(col, type=type, timestamps_to_ms=timestamps_to_ms) names.append(name) arrays.append(arr) diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index 57b17f6cea756..a1fe842c7ab8b 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -162,8 +162,8 @@ def test_dictionary_from_boxed_arrays(): indices = np.repeat([0, 1, 2], 2) dictionary = np.array(['foo', 'bar', 'baz'], dtype=object) - iarr = pa.Array.from_numpy(indices) - darr = pa.Array.from_numpy(dictionary) + iarr = pa.Array.from_pandas(indices) + darr = pa.Array.from_pandas(dictionary) d1 = pa.DictionaryArray.from_arrays(iarr, darr) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 2394d638d073e..f3602347a78a6 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -79,8 +79,8 @@ def _check_pandas_roundtrip(self, df, expected=None, nthreads=1, def _check_array_roundtrip(self, values, expected=None, mask=None, timestamps_to_ms=False, type=None): - arr = pa.Array.from_numpy(values, timestamps_to_ms=timestamps_to_ms, - mask=mask, type=type) + arr = pa.Array.from_pandas(values, timestamps_to_ms=timestamps_to_ms, + mask=mask, type=type) result = arr.to_pandas() values_nulls = pd.isnull(values) @@ -125,7 +125,7 @@ def test_float_nulls(self): for name, arrow_dtype in dtypes: values = np.random.randn(num_values).astype(name) - arr = pa.Array.from_numpy(values, null_mask) + arr = pa.Array.from_pandas(values, null_mask) arrays.append(arr) fields.append(pa.Field.from_py(name, arrow_dtype)) values[null_mask] = np.nan @@ -178,7 +178,7 @@ def test_integer_with_nulls(self): for name in int_dtypes: values = np.random.randint(0, 100, size=num_values) - arr = pa.Array.from_numpy(values, null_mask) + arr = pa.Array.from_pandas(values, null_mask) arrays.append(arr) expected = values.astype('f8') @@ -212,7 +212,7 @@ def test_boolean_nulls(self): mask = np.random.randint(0, 10, size=num_values) < 3 values = np.random.randint(0, 10, size=num_values) < 5 - arr = pa.Array.from_numpy(values, mask) + arr = pa.Array.from_pandas(values, mask) expected = values.astype(object) expected[mask] = None @@ -375,11 +375,11 @@ def test_date_objects_typed(self): t32 = pa.date32() t64 = pa.date64() - a32 = pa.Array.from_numpy(arr, type=t32) - a64 = pa.Array.from_numpy(arr, type=t64) + a32 = pa.Array.from_pandas(arr, type=t32) + a64 = pa.Array.from_pandas(arr, type=t64) - a32_expected = pa.Array.from_numpy(arr_i4, mask=mask, type=t32) - a64_expected = pa.Array.from_numpy(arr_i8, mask=mask, type=t64) + a32_expected = pa.Array.from_pandas(arr_i4, mask=mask, type=t32) + a64_expected = pa.Array.from_pandas(arr_i8, mask=mask, type=t64) assert a32.equals(a32_expected) assert a64.equals(a64_expected) @@ -406,8 +406,8 @@ def test_dates_from_integers(self): arr = np.array([17259, 17260, 17261], dtype='int32') arr2 = arr.astype('int64') * 86400000 - a1 = pa.Array.from_numpy(arr, type=t1) - a2 = pa.Array.from_numpy(arr2, type=t2) + a1 = pa.Array.from_pandas(arr, type=t1) + a2 = pa.Array.from_pandas(arr2, type=t2) expected = datetime.date(2017, 4, 3) assert a1[0].as_py() == expected @@ -586,3 +586,15 @@ def test_decimal_128_to_pandas(self): converted = pa.Table.from_pandas(expected) df = converted.to_pandas() tm.assert_frame_equal(df, expected) + + def test_all_nones(self): + def _check_series(s): + converted = pa.Array.from_pandas(s) + assert isinstance(converted, pa.NullArray) + assert len(converted) == 3 + assert converted.null_count == 3 + assert converted[0] is pa.NA + + _check_series(pd.Series([None] * 3, dtype=object)) + _check_series(pd.Series([np.nan] * 3, dtype=object)) + _check_series(pd.Series([np.sqrt(-1)] * 3, dtype=object)) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index f4f275b994228..df2a8980710f8 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -124,7 +124,7 @@ def test_timestamp(self): for unit in units: dtype = 'datetime64[{0}]'.format(unit) - arrow_arr = pa.Array.from_numpy(arr.astype(dtype)) + arrow_arr = pa.Array.from_pandas(arr.astype(dtype)) expected = pd.Timestamp('2000-01-01 12:34:56') assert arrow_arr[0].as_py() == expected @@ -133,8 +133,8 @@ def test_timestamp(self): arrow_type = pa.timestamp(unit, tz=tz) dtype = 'datetime64[{0}]'.format(unit) - arrow_arr = pa.Array.from_numpy(arr.astype(dtype), - type=arrow_type) + arrow_arr = pa.Array.from_pandas(arr.astype(dtype), + type=arrow_type) expected = (pd.Timestamp('2000-01-01 12:34:56') .tz_localize('utc') .tz_convert(tz)) From 7238d544c1f0b05a393cdf68b2e2c9485bdb154e Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Apr 2017 17:46:11 -0400 Subject: [PATCH 0527/1644] ARROW-734: [C++/Python] Support building PyArrow on MSVC Author: Wes McKinney Closes #553 from wesm/ARROW-734 and squashes the following commits: 6e00485 [Wes McKinney] Restore -undefined,dynamic_lookup linker logic on Linux for Python extensions 5be7e31 [Wes McKinney] do_import_numpy.h no longer needed 2d00e6b [Wes McKinney] Fix Unix library names 1e6bb6e [Wes McKinney] typos 8f4928e [Wes McKinney] More build fixes. Can now import pyarrow 5162203 [Wes McKinney] Remove unneeded exports 024579e [Wes McKinney] Wow, MSVC mangles the name CreateDirectory 990fdc2 [Wes McKinney] Install DLLs fixes in FindArrow.cmake ccf941e [Wes McKinney] Restore CompilerInfo to original state 1e657ad [Wes McKinney] More fixes. Change TimeUnit to struct-based enum 2be93f0 [Wes McKinney] NumPy initialization / build fixes 1744f83 [Wes McKinney] Add new files 68e2d5b [Wes McKinney] Move NumPy API initialization into libarrow_python 0a2d387 [Wes McKinney] WIP MSVC support for PyArrow. Linker errors --- cpp/cmake_modules/BuildUtils.cmake | 4 ++ cpp/cmake_modules/CompilerInfo.cmake | 5 +- cpp/cmake_modules/FindPythonLibsNew.cmake | 15 ++++-- cpp/src/arrow/io/hdfs-internal.cc | 2 +- cpp/src/arrow/io/hdfs-internal.h | 2 +- cpp/src/arrow/io/hdfs.cc | 8 ++-- cpp/src/arrow/io/hdfs.h | 2 +- cpp/src/arrow/io/io-hdfs-test.cc | 8 ++-- cpp/src/arrow/ipc/feather-internal.h | 16 +++---- cpp/src/arrow/ipc/feather.cc | 8 ++-- cpp/src/arrow/ipc/json-internal.cc | 4 +- cpp/src/arrow/ipc/metadata.cc | 8 ++-- cpp/src/arrow/python/CMakeLists.txt | 34 ++++++------- cpp/src/arrow/python/builtin_convert.cc | 3 +- cpp/src/arrow/python/builtin_convert.h | 3 +- cpp/src/arrow/python/common.h | 6 +-- cpp/src/arrow/python/config.cc | 4 +- cpp/src/arrow/python/config.h | 6 +-- cpp/src/arrow/python/helpers.cc | 6 +-- cpp/src/arrow/python/helpers.h | 12 ++--- .../python/{do_import_numpy.h => init.cc} | 15 +++++- cpp/src/arrow/python/init.h | 35 ++++++++++++++ cpp/src/arrow/python/io.h | 2 +- cpp/src/arrow/python/numpy-internal.h | 7 ++- cpp/src/arrow/python/numpy_convert.cc | 9 ++-- cpp/src/arrow/python/numpy_convert.h | 7 ++- cpp/src/arrow/python/numpy_interop.h | 2 +- cpp/src/arrow/python/pandas_convert.cc | 5 +- cpp/src/arrow/python/pandas_convert.h | 2 +- cpp/src/arrow/python/platform.h | 32 +++++++++++++ cpp/src/arrow/python/python-test.cc | 2 +- cpp/src/arrow/python/type_traits.h | 2 +- cpp/src/arrow/python/util/datetime.h | 2 +- cpp/src/arrow/python/util/test_main.cc | 7 ++- cpp/src/arrow/type.cc | 14 +++--- cpp/src/arrow/type.h | 32 +++++++------ python/CMakeLists.txt | 34 ++++++------- python/cmake_modules/CompilerInfo.cmake | 48 ------------------- python/cmake_modules/FindArrow.cmake | 25 ++++++---- python/cmake_modules/UseCython.cmake | 8 ++-- python/pyarrow/_config.pyx | 11 ++--- python/pyarrow/_io.pyx | 2 +- python/pyarrow/includes/common.pxd | 3 +- python/pyarrow/includes/libarrow.pxd | 4 +- python/setup.py | 28 ++++++----- 45 files changed, 269 insertions(+), 225 deletions(-) rename cpp/src/arrow/python/{do_import_numpy.h => init.cc} (79%) create mode 100644 cpp/src/arrow/python/init.h create mode 100644 cpp/src/arrow/python/platform.h delete mode 100644 python/cmake_modules/CompilerInfo.cmake diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 3a3b53678f6e5..4e6532be9aa7a 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -102,6 +102,8 @@ function(ADD_ARROW_LIB LIB_NAME) # Necessary to make static linking into other shared libraries work properly set_property(TARGET ${LIB_NAME}_objlib PROPERTY POSITION_INDEPENDENT_CODE 1) + set(RUNTIME_INSTALL_DIR bin) + if (ARROW_BUILD_SHARED) add_library(${LIB_NAME}_shared SHARED $) @@ -139,6 +141,7 @@ function(ADD_ARROW_LIB LIB_NAME) endif() install(TARGETS ${LIB_NAME}_shared + RUNTIME DESTINATION ${RUNTIME_INSTALL_DIR} LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() @@ -155,6 +158,7 @@ function(ADD_ARROW_LIB LIB_NAME) LINK_PRIVATE ${ARG_STATIC_PRIVATE_LINK_LIBS}) install(TARGETS ${LIB_NAME}_static + RUNTIME DESTINATION ${RUNTIME_INSTALL_DIR} LIBRARY DESTINATION ${CMAKE_INSTALL_LIBDIR} ARCHIVE DESTINATION ${CMAKE_INSTALL_LIBDIR}) endif() diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 079d9d1f3270d..3c603918a82ec 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -19,8 +19,8 @@ # Sets COMPILER_VERSION to the version execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v ERROR_VARIABLE COMPILER_VERSION_FULL) -message(INFO " ${COMPILER_VERSION_FULL}") -message(INFO " ${CMAKE_CXX_COMPILER_ID}") +message(INFO "Compiler version: ${COMPILER_VERSION_FULL}") +message(INFO "Compiler id: ${CMAKE_CXX_COMPILER_ID}") string(TOLOWER "${COMPILER_VERSION_FULL}" COMPILER_VERSION_FULL_LOWER) if(MSVC) @@ -62,4 +62,3 @@ else() message(FATAL_ERROR "Unknown compiler. Version info:\n${COMPILER_VERSION_FULL}") endif() message("Selected compiler ${COMPILER_FAMILY} ${COMPILER_VERSION}") - diff --git a/cpp/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake index d9cc4b3955734..961081609cb12 100644 --- a/cpp/cmake_modules/FindPythonLibsNew.cmake +++ b/cpp/cmake_modules/FindPythonLibsNew.cmake @@ -233,12 +233,17 @@ FUNCTION(PYTHON_ADD_MODULE _NAME ) # segfaults, so do this dynamic lookup instead. SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") + ELSEIF(MSVC) + target_link_libraries(${_NAME} ${PYTHON_LIBRARIES}) ELSE() - # In general, we should not link against libpython as we do not embed - # the Python interpreter. The python binary itself can then define where - # the symbols should loaded from. - SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS - "-Wl,-undefined,dynamic_lookup") + # In general, we should not link against libpython as we do not embed the + # Python interpreter. The python binary itself can then define where the + # symbols should loaded from. For being manylinux1 compliant, one is not + # allowed to link to libpython. Partly because not all systems ship it, + # also because the interpreter ABI/API was not stable between patch + # releases for Python < 3.5. + SET_TARGET_PROPERTIES(${_NAME} PROPERTIES LINK_FLAGS + "-Wl,-undefined,dynamic_lookup") ENDIF() IF(PYTHON_MODULE_${_NAME}_BUILD_SHARED) SET_TARGET_PROPERTIES(${_NAME} PROPERTIES PREFIX "${PYTHON_MODULE_PREFIX}") diff --git a/cpp/src/arrow/io/hdfs-internal.cc b/cpp/src/arrow/io/hdfs-internal.cc index e4b2cd55978cb..e67419b5fa501 100644 --- a/cpp/src/arrow/io/hdfs-internal.cc +++ b/cpp/src/arrow/io/hdfs-internal.cc @@ -409,7 +409,7 @@ int LibHdfsShim::SetWorkingDirectory(hdfsFS fs, const char* path) { } } -int LibHdfsShim::CreateDirectory(hdfsFS fs, const char* path) { +int LibHdfsShim::MakeDirectory(hdfsFS fs, const char* path) { return this->hdfsCreateDirectory(fs, path); } diff --git a/cpp/src/arrow/io/hdfs-internal.h b/cpp/src/arrow/io/hdfs-internal.h index 01cf1499857d9..c5ea397af0bd5 100644 --- a/cpp/src/arrow/io/hdfs-internal.h +++ b/cpp/src/arrow/io/hdfs-internal.h @@ -173,7 +173,7 @@ struct LibHdfsShim { int SetWorkingDirectory(hdfsFS fs, const char* path); - int CreateDirectory(hdfsFS fs, const char* path); + int MakeDirectory(hdfsFS fs, const char* path); int SetReplication(hdfsFS fs, const char* path, int16_t replication); diff --git a/cpp/src/arrow/io/hdfs.cc b/cpp/src/arrow/io/hdfs.cc index 3510ba183d8e4..a27e132d155b1 100644 --- a/cpp/src/arrow/io/hdfs.cc +++ b/cpp/src/arrow/io/hdfs.cc @@ -347,8 +347,8 @@ class HdfsClient::HdfsClientImpl { return Status::OK(); } - Status CreateDirectory(const std::string& path) { - int ret = driver_->CreateDirectory(fs_, path.c_str()); + Status MakeDirectory(const std::string& path) { + int ret = driver_->MakeDirectory(fs_, path.c_str()); CHECK_FAILURE(ret, "create directory"); return Status::OK(); } @@ -505,8 +505,8 @@ Status HdfsClient::Connect( return Status::OK(); } -Status HdfsClient::CreateDirectory(const std::string& path) { - return impl_->CreateDirectory(path); +Status HdfsClient::MakeDirectory(const std::string& path) { + return impl_->MakeDirectory(path); } Status HdfsClient::Delete(const std::string& path, bool recursive) { diff --git a/cpp/src/arrow/io/hdfs.h b/cpp/src/arrow/io/hdfs.h index e3f5442f48ead..f3de4a2bf174f 100644 --- a/cpp/src/arrow/io/hdfs.h +++ b/cpp/src/arrow/io/hdfs.h @@ -82,7 +82,7 @@ class ARROW_EXPORT HdfsClient : public FileSystemClient { // // @param path (in): absolute HDFS path // @returns Status - Status CreateDirectory(const std::string& path); + Status MakeDirectory(const std::string& path); // Delete file or directory // @param path: absolute path to data diff --git a/cpp/src/arrow/io/io-hdfs-test.cc b/cpp/src/arrow/io/io-hdfs-test.cc index 0a9f5d9885e19..0fdb897d94410 100644 --- a/cpp/src/arrow/io/io-hdfs-test.cc +++ b/cpp/src/arrow/io/io-hdfs-test.cc @@ -45,7 +45,7 @@ class TestHdfsClient : public ::testing::Test { if (client_->Exists(scratch_dir_)) { RETURN_NOT_OK((client_->Delete(scratch_dir_, true))); } - return client_->CreateDirectory(scratch_dir_); + return client_->MakeDirectory(scratch_dir_); } Status WriteDummyFile(const std::string& path, const uint8_t* buffer, int64_t size, @@ -161,14 +161,14 @@ TYPED_TEST(TestHdfsClient, ConnectsAgain) { ASSERT_OK(client->Disconnect()); } -TYPED_TEST(TestHdfsClient, CreateDirectory) { +TYPED_TEST(TestHdfsClient, MakeDirectory) { SKIP_IF_NO_DRIVER(); std::string path = this->ScratchPath("create-directory"); if (this->client_->Exists(path)) { ASSERT_OK(this->client_->Delete(path, true)); } - ASSERT_OK(this->client_->CreateDirectory(path)); + ASSERT_OK(this->client_->MakeDirectory(path)); ASSERT_TRUE(this->client_->Exists(path)); std::vector listing; EXPECT_OK(this->client_->ListDirectory(path, &listing)); @@ -253,7 +253,7 @@ TYPED_TEST(TestHdfsClient, ListDirectory) { ASSERT_OK(this->MakeScratchDir()); ASSERT_OK(this->WriteDummyFile(p1, data.data(), size)); ASSERT_OK(this->WriteDummyFile(p2, data.data(), size / 2)); - ASSERT_OK(this->client_->CreateDirectory(d1)); + ASSERT_OK(this->client_->MakeDirectory(d1)); std::vector listing; ASSERT_OK(this->client_->ListDirectory(this->scratch_dir_, &listing)); diff --git a/cpp/src/arrow/ipc/feather-internal.h b/cpp/src/arrow/ipc/feather-internal.h index 6847445149bb0..646c3b2f9f2e3 100644 --- a/cpp/src/arrow/ipc/feather-internal.h +++ b/cpp/src/arrow/ipc/feather-internal.h @@ -75,7 +75,7 @@ struct ARROW_EXPORT CategoryMetadata { }; struct ARROW_EXPORT TimestampMetadata { - TimeUnit unit; + TimeUnit::type unit; // A timezone name known to the Olson timezone database. For display purposes // because the actual data is all UTC @@ -83,7 +83,7 @@ struct ARROW_EXPORT TimestampMetadata { }; struct ARROW_EXPORT TimeMetadata { - TimeUnit unit; + TimeUnit::type unit; }; static constexpr const char* kFeatherMagicBytes = "FEA1"; @@ -156,12 +156,12 @@ static inline flatbuffers::Offset GetPrimitiveArray( array.length, array.null_count, array.total_bytes); } -static inline fbs::TimeUnit ToFlatbufferEnum(TimeUnit unit) { +static inline fbs::TimeUnit ToFlatbufferEnum(TimeUnit::type unit) { return static_cast(static_cast(unit)); } -static inline TimeUnit FromFlatbufferEnum(fbs::TimeUnit unit) { - return static_cast(static_cast(unit)); +static inline TimeUnit::type FromFlatbufferEnum(fbs::TimeUnit unit) { + return static_cast(static_cast(unit)); } // Convert Feather enums to Flatbuffer enums @@ -197,10 +197,10 @@ class ARROW_EXPORT ColumnBuilder { void SetValues(const ArrayMetadata& values); void SetUserMetadata(const std::string& data); void SetCategory(const ArrayMetadata& levels, bool ordered = false); - void SetTimestamp(TimeUnit unit); - void SetTimestamp(TimeUnit unit, const std::string& timezone); + void SetTimestamp(TimeUnit::type unit); + void SetTimestamp(TimeUnit::type unit, const std::string& timezone); void SetDate(); - void SetTime(TimeUnit unit); + void SetTime(TimeUnit::type unit); FBB& fbb(); private: diff --git a/cpp/src/arrow/ipc/feather.cc b/cpp/src/arrow/ipc/feather.cc index 5dc039662ce9d..7d0abdda1aadc 100644 --- a/cpp/src/arrow/ipc/feather.cc +++ b/cpp/src/arrow/ipc/feather.cc @@ -184,12 +184,12 @@ void ColumnBuilder::SetCategory(const ArrayMetadata& levels, bool ordered) { meta_category_.ordered = ordered; } -void ColumnBuilder::SetTimestamp(TimeUnit unit) { +void ColumnBuilder::SetTimestamp(TimeUnit::type unit) { type_ = ColumnType::TIMESTAMP; meta_timestamp_.unit = unit; } -void ColumnBuilder::SetTimestamp(TimeUnit unit, const std::string& timezone) { +void ColumnBuilder::SetTimestamp(TimeUnit::type unit, const std::string& timezone) { SetTimestamp(unit); meta_timestamp_.timezone = timezone; } @@ -198,7 +198,7 @@ void ColumnBuilder::SetDate() { type_ = ColumnType::DATE; } -void ColumnBuilder::SetTime(TimeUnit unit) { +void ColumnBuilder::SetTime(TimeUnit::type unit) { type_ = ColumnType::TIME; meta_time_.unit = unit; } @@ -279,7 +279,7 @@ class TableReader::TableReaderImpl { } case fbs::TypeMetadata_TimestampMetadata: { auto meta = static_cast(metadata); - TimeUnit unit = FromFlatbufferEnum(meta->unit()); + TimeUnit::type unit = FromFlatbufferEnum(meta->unit()); std::string tz; // flatbuffer non-null if (meta->timezone() != 0) { diff --git a/cpp/src/arrow/ipc/json-internal.cc b/cpp/src/arrow/ipc/json-internal.cc index 18ee8349da66a..2ab3acba37f8d 100644 --- a/cpp/src/arrow/ipc/json-internal.cc +++ b/cpp/src/arrow/ipc/json-internal.cc @@ -77,7 +77,7 @@ static std::string GetFloatingPrecisionName(FloatingPoint::Precision precision) return "UNKNOWN"; } -static std::string GetTimeUnitName(TimeUnit unit) { +static std::string GetTimeUnitName(TimeUnit::type unit) { switch (unit) { case TimeUnit::SECOND: return "SECOND"; @@ -645,7 +645,7 @@ static Status GetTimestamp(const RjObject& json_type, std::shared_ptr* std::string unit_str = json_unit->value.GetString(); - TimeUnit unit; + TimeUnit::type unit; if (unit_str == "SECOND") { unit = TimeUnit::SECOND; } else if (unit_str == "MILLISECOND") { diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index ee21156c08c67..791948b50b0ac 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -188,7 +188,7 @@ static Status UnionToFlatBuffer(FBB& fbb, const std::shared_ptr& type, *offset = IntToFlatbuffer(fbb, BIT_WIDTH, IS_SIGNED); \ break; -static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit unit) { +static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit::type unit) { switch (unit) { case TimeUnit::SECOND: return flatbuf::TimeUnit_SECOND; @@ -204,7 +204,7 @@ static inline flatbuf::TimeUnit ToFlatbufferUnit(TimeUnit unit) { return flatbuf::TimeUnit_MIN; } -static inline TimeUnit FromFlatbufferUnit(flatbuf::TimeUnit unit) { +static inline TimeUnit::type FromFlatbufferUnit(flatbuf::TimeUnit unit) { switch (unit) { case flatbuf::TimeUnit_SECOND: return TimeUnit::SECOND; @@ -258,7 +258,7 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } case flatbuf::Type_Time: { auto time_type = static_cast(type_data); - TimeUnit unit = FromFlatbufferUnit(time_type->unit()); + TimeUnit::type unit = FromFlatbufferUnit(time_type->unit()); int32_t bit_width = time_type->bitWidth(); switch (unit) { case TimeUnit::SECOND: @@ -279,7 +279,7 @@ static Status TypeFromFlatbuffer(flatbuf::Type type, const void* type_data, } case flatbuf::Type_Timestamp: { auto ts_type = static_cast(type_data); - TimeUnit unit = FromFlatbufferUnit(ts_type->unit()); + TimeUnit::type unit = FromFlatbufferUnit(ts_type->unit()); if (ts_type->timezone() != 0 && ts_type->timezone()->Length() > 0) { *out = timestamp(unit, ts_type->timezone()->str()); } else { diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index 8f7991e7f6832..607a1c436c45d 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -17,18 +17,18 @@ if (ARROW_BUILD_TESTS) add_library(arrow_python_test_main STATIC util/test_main.cc) + target_link_libraries(arrow_python_test_main + gtest) + if (APPLE) target_link_libraries(arrow_python_test_main - gtest - dl) + ${CMAKE_DL_LIBS}) set_target_properties(arrow_python_test_main PROPERTIES LINK_FLAGS "-undefined dynamic_lookup") - else() + elseif(NOT MSVC) target_link_libraries(arrow_python_test_main - gtest pthread - dl - ) + ${CMAKE_DL_LIBS}) endif() endif() @@ -38,12 +38,6 @@ set(ARROW_PYTHON_MIN_TEST_LIBS arrow_static ${BOOST_REGEX_LIBRARY}) -if(ARROW_BUILD_TESTS) - ADD_THIRDPARTY_LIB(python - SHARED_LIB "${PYTHON_LIBRARIES}") - list(APPEND ARROW_PYTHON_MIN_TEST_LIBS python) -endif() - set(ARROW_PYTHON_TEST_LINK_LIBS ${ARROW_PYTHON_MIN_TEST_LIBS}) # ---------------------------------------------------------------------- @@ -53,6 +47,7 @@ set(ARROW_PYTHON_SRCS common.cc config.cc helpers.cc + init.cc io.cc numpy_convert.cc pandas_convert.cc @@ -66,9 +61,14 @@ ADD_ARROW_LIB(arrow_python SOURCES ${ARROW_PYTHON_SRCS} SHARED_LINK_FLAGS "" SHARED_LINK_LIBS ${ARROW_PYTHON_SHARED_LINK_LIBS} - STATIC_LINK_LIBS ${ARROW_IO_SHARED_PRIVATE_LINK_LIBS} + STATIC_LINK_LIBS "" ) +if (MSVC) + target_link_libraries(arrow_python_shared + ${PYTHON_LIBRARIES}) +endif() + if ("${COMPILER_FAMILY}" STREQUAL "clang") # Clang, be quiet. Python C API has lots of macros set_property(SOURCE ${ARROW_PYTHON_SRCS} @@ -82,19 +82,19 @@ install(FILES builtin_convert.h common.h config.h - do_import_numpy.h helpers.h + init.h io.h numpy_convert.h numpy_interop.h pandas_convert.h + platform.h type_traits.h DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}/arrow/python") -# set_target_properties(arrow_python_shared PROPERTIES -# INSTALL_RPATH "\$ORIGIN") - if (ARROW_BUILD_TESTS) ADD_ARROW_TEST(python-test STATIC_LINK_LIBS "${ARROW_PYTHON_TEST_LINK_LIBS}") + target_link_libraries(python-test + ${PYTHON_LIBRARIES}) endif() diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 8cc9876fa9fc5..137937c0946df 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -15,7 +15,8 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/python/platform.h" + #include #include diff --git a/cpp/src/arrow/python/builtin_convert.h b/cpp/src/arrow/python/builtin_convert.h index 2141c25e95ef0..a6180d496a920 100644 --- a/cpp/src/arrow/python/builtin_convert.h +++ b/cpp/src/arrow/python/builtin_convert.h @@ -21,13 +21,12 @@ #ifndef ARROW_PYTHON_ADAPTERS_BUILTIN_H #define ARROW_PYTHON_ADAPTERS_BUILTIN_H -#include +#include "arrow/python/platform.h" #include #include #include "arrow/type.h" - #include "arrow/util/visibility.h" #include "arrow/python/common.h" diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h index a6806ab95ab95..882bb156224c0 100644 --- a/cpp/src/arrow/python/common.h +++ b/cpp/src/arrow/python/common.h @@ -32,7 +32,7 @@ class MemoryPool; namespace py { -class PyAcquireGIL { +class ARROW_EXPORT PyAcquireGIL { public: PyAcquireGIL() { state_ = PyGILState_Ensure(); } @@ -45,7 +45,7 @@ class PyAcquireGIL { #define PYARROW_IS_PY2 PY_MAJOR_VERSION <= 2 -class OwnedRef { +class ARROW_EXPORT OwnedRef { public: OwnedRef() : obj_(nullptr) {} @@ -70,7 +70,7 @@ class OwnedRef { PyObject* obj_; }; -struct PyObjectStringify { +struct ARROW_EXPORT PyObjectStringify { OwnedRef tmp_obj; const char* bytes; Py_ssize_t size; diff --git a/cpp/src/arrow/python/config.cc b/cpp/src/arrow/python/config.cc index c2a69168bb01e..3cec7c41a2f31 100644 --- a/cpp/src/arrow/python/config.cc +++ b/cpp/src/arrow/python/config.cc @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/python/platform.h" #include #include "arrow/python/config.h" @@ -23,8 +23,6 @@ namespace arrow { namespace py { -void Init() {} - PyObject* numpy_nan = nullptr; void set_numpy_nan(PyObject* obj) { diff --git a/cpp/src/arrow/python/config.h b/cpp/src/arrow/python/config.h index c13272667540a..c2b089d382c00 100644 --- a/cpp/src/arrow/python/config.h +++ b/cpp/src/arrow/python/config.h @@ -18,8 +18,7 @@ #ifndef ARROW_PYTHON_CONFIG_H #define ARROW_PYTHON_CONFIG_H -#include -#include +#include "arrow/python/platform.h" #include "arrow/python/numpy_interop.h" #include "arrow/util/visibility.h" @@ -34,9 +33,6 @@ namespace py { ARROW_EXPORT extern PyObject* numpy_nan; -ARROW_EXPORT -void Init(); - ARROW_EXPORT void set_numpy_nan(PyObject* obj); diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index 3d3d07a515833..f7c73a87fbf16 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -92,11 +92,11 @@ Status PythonDecimalToArrowDecimal( return FromString(c_string, arrow_decimal); } -template Status PythonDecimalToArrowDecimal( +template Status ARROW_TEMPLATE_EXPORT PythonDecimalToArrowDecimal( PyObject* python_decimal, decimal::Decimal32* arrow_decimal); -template Status PythonDecimalToArrowDecimal( +template Status ARROW_TEMPLATE_EXPORT PythonDecimalToArrowDecimal( PyObject* python_decimal, decimal::Decimal64* arrow_decimal); -template Status PythonDecimalToArrowDecimal( +template Status ARROW_TEMPLATE_EXPORT PythonDecimalToArrowDecimal( PyObject* python_decimal, decimal::Decimal128* arrow_decimal); Status InferDecimalPrecisionAndScale( diff --git a/cpp/src/arrow/python/helpers.h b/cpp/src/arrow/python/helpers.h index 77fde263de7e0..c6402d8796fe2 100644 --- a/cpp/src/arrow/python/helpers.h +++ b/cpp/src/arrow/python/helpers.h @@ -18,7 +18,7 @@ #ifndef PYARROW_HELPERS_H #define PYARROW_HELPERS_H -#include +#include "arrow/python/platform.h" #include #include @@ -42,18 +42,18 @@ class OwnedRef; ARROW_EXPORT std::shared_ptr GetPrimitiveType(Type::type type); -Status ImportModule(const std::string& module_name, OwnedRef* ref); -Status ImportFromModule( +Status ARROW_EXPORT ImportModule(const std::string& module_name, OwnedRef* ref); +Status ARROW_EXPORT ImportFromModule( const OwnedRef& module, const std::string& module_name, OwnedRef* ref); template -Status PythonDecimalToArrowDecimal( +Status ARROW_EXPORT PythonDecimalToArrowDecimal( PyObject* python_decimal, decimal::Decimal* arrow_decimal); -Status InferDecimalPrecisionAndScale( +Status ARROW_EXPORT InferDecimalPrecisionAndScale( PyObject* python_decimal, int* precision = nullptr, int* scale = nullptr); -Status DecimalFromString( +Status ARROW_EXPORT DecimalFromString( PyObject* decimal_constructor, const std::string& decimal_string, PyObject** out); } // namespace py diff --git a/cpp/src/arrow/python/do_import_numpy.h b/cpp/src/arrow/python/init.cc similarity index 79% rename from cpp/src/arrow/python/do_import_numpy.h rename to cpp/src/arrow/python/init.cc index bb4a382959102..fa70af7e44db3 100644 --- a/cpp/src/arrow/python/do_import_numpy.h +++ b/cpp/src/arrow/python/init.cc @@ -15,7 +15,20 @@ // specific language governing permissions and limitations // under the License. -// Trick borrowed from dynd-python for initializing the NumPy array API +#include "arrow/python/platform.h" // Trigger the array import (inversion of NO_IMPORT_ARRAY) #define NUMPY_IMPORT_ARRAY + +#include "arrow/python/init.h" +#include "arrow/python/numpy_interop.h" + +namespace arrow { +namespace py { + +void InitNumPy() { + import_numpy(); +} + +} // namespace py +} // namespace arrow diff --git a/cpp/src/arrow/python/init.h b/cpp/src/arrow/python/init.h new file mode 100644 index 0000000000000..a2533d8059273 --- /dev/null +++ b/cpp/src/arrow/python/init.h @@ -0,0 +1,35 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_PYTHON_INIT_H +#define ARROW_PYTHON_INIT_H + +#include "arrow/python/platform.h" + +#include "arrow/python/numpy_interop.h" +#include "arrow/util/visibility.h" + +namespace arrow { +namespace py { + +ARROW_EXPORT +void InitNumPy(); + +} // namespace py +} // namespace arrow + +#endif // ARROW_PYTHON_INIT_H diff --git a/cpp/src/arrow/python/io.h b/cpp/src/arrow/python/io.h index 905bd6c7a6aed..bf14cd6f45dbd 100644 --- a/cpp/src/arrow/python/io.h +++ b/cpp/src/arrow/python/io.h @@ -34,7 +34,7 @@ namespace py { // A common interface to a Python file-like object. Must acquire GIL before // calling any methods -class PythonFile { +class ARROW_EXPORT PythonFile { public: explicit PythonFile(PyObject* file); ~PythonFile(); diff --git a/cpp/src/arrow/python/numpy-internal.h b/cpp/src/arrow/python/numpy-internal.h index fcc6a58f2a347..f1ef7dadde084 100644 --- a/cpp/src/arrow/python/numpy-internal.h +++ b/cpp/src/arrow/python/numpy-internal.h @@ -20,12 +20,11 @@ #ifndef ARROW_PYTHON_NUMPY_INTERNAL_H #define ARROW_PYTHON_NUMPY_INTERNAL_H -#include +#include "arrow/python/numpy_interop.h" -#include +#include "arrow/python/platform.h" -#include "arrow/python/numpy_convert.h" -#include "arrow/python/numpy_interop.h" +#include namespace arrow { namespace py { diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc index ab79e179c7ea5..2c1a5910f06d5 100644 --- a/cpp/src/arrow/python/numpy_convert.cc +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -15,10 +15,9 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/python/numpy_interop.h" #include "arrow/python/numpy_convert.h" -#include "arrow/python/numpy_interop.h" #include #include @@ -38,8 +37,8 @@ namespace py { bool is_contiguous(PyObject* array) { if (PyArray_Check(array)) { - return PyArray_FLAGS(reinterpret_cast(array)) & - (NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_F_CONTIGUOUS); + return (PyArray_FLAGS(reinterpret_cast(array)) & + (NPY_ARRAY_C_CONTIGUOUS | NPY_ARRAY_F_CONTIGUOUS)) != 0; } else { return false; } @@ -167,7 +166,7 @@ Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out) { case NPY_DATETIME: { auto date_dtype = reinterpret_cast(descr->c_metadata); - TimeUnit unit; + TimeUnit::type unit; switch (date_dtype->meta.base) { case NPY_FR_s: unit = TimeUnit::SECOND; diff --git a/cpp/src/arrow/python/numpy_convert.h b/cpp/src/arrow/python/numpy_convert.h index c2526403213b1..a486646cdec64 100644 --- a/cpp/src/arrow/python/numpy_convert.h +++ b/cpp/src/arrow/python/numpy_convert.h @@ -21,7 +21,7 @@ #ifndef ARROW_PYTHON_NUMPY_CONVERT_H #define ARROW_PYTHON_NUMPY_CONVERT_H -#include +#include "arrow/python/platform.h" #include #include @@ -48,14 +48,19 @@ class ARROW_EXPORT NumPyBuffer : public Buffer { }; // Handle misbehaved types like LONGLONG and ULONGLONG +ARROW_EXPORT int cast_npy_type_compat(int type_num); +ARROW_EXPORT bool is_contiguous(PyObject* array); ARROW_EXPORT Status NumPyDtypeToArrow(PyObject* dtype, std::shared_ptr* out); +ARROW_EXPORT Status GetTensorType(PyObject* dtype, std::shared_ptr* out); + +ARROW_EXPORT Status GetNumPyType(const DataType& type, int* type_num); ARROW_EXPORT Status NdarrayToTensor( diff --git a/cpp/src/arrow/python/numpy_interop.h b/cpp/src/arrow/python/numpy_interop.h index 0a4b425e734f7..b93200cc8972d 100644 --- a/cpp/src/arrow/python/numpy_interop.h +++ b/cpp/src/arrow/python/numpy_interop.h @@ -18,7 +18,7 @@ #ifndef PYARROW_NUMPY_INTEROP_H #define PYARROW_NUMPY_INTEROP_H -#include +#include "arrow/python/platform.h" #include diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 5cdcb6fa49602..636a3fd15c044 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -17,9 +17,8 @@ // Functions for pandas conversion via NumPy -#include - #include "arrow/python/numpy_interop.h" + #include "arrow/python/pandas_convert.h" #include @@ -490,7 +489,7 @@ struct UnboxDate {}; template <> struct UnboxDate { - static int64_t Unbox(PyObject* obj) { + static int32_t Unbox(PyObject* obj) { return PyDate_to_days(reinterpret_cast(obj)); } }; diff --git a/cpp/src/arrow/python/pandas_convert.h b/cpp/src/arrow/python/pandas_convert.h index fd901d8f09fce..45c8a1a21fe20 100644 --- a/cpp/src/arrow/python/pandas_convert.h +++ b/cpp/src/arrow/python/pandas_convert.h @@ -21,7 +21,7 @@ #ifndef ARROW_PYTHON_ADAPTERS_PANDAS_H #define ARROW_PYTHON_ADAPTERS_PANDAS_H -#include +#include "arrow/python/platform.h" #include #include diff --git a/cpp/src/arrow/python/platform.h b/cpp/src/arrow/python/platform.h new file mode 100644 index 0000000000000..38f8e0f1c23ff --- /dev/null +++ b/cpp/src/arrow/python/platform.h @@ -0,0 +1,32 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +// Functions for converting between pandas's NumPy-based data representation +// and Arrow data structures + +#ifndef ARROW_PYTHON_PLATFORM_H +#define ARROW_PYTHON_PLATFORM_H + +#include +#include + +// Work around C2528 error +#if _MSC_VER >= 1900 +#undef timezone +#endif + +#endif // ARROW_PYTHON_PLATFORM_H diff --git a/cpp/src/arrow/python/python-test.cc b/cpp/src/arrow/python/python-test.cc index a4a11c039b60c..cbc93776f98ef 100644 --- a/cpp/src/arrow/python/python-test.cc +++ b/cpp/src/arrow/python/python-test.cc @@ -19,7 +19,7 @@ #include -#include +#include "arrow/python/platform.h" #include "arrow/array.h" #include "arrow/builder.h" diff --git a/cpp/src/arrow/python/type_traits.h b/cpp/src/arrow/python/type_traits.h index c464d65a4946c..26b15bdc9f464 100644 --- a/cpp/src/arrow/python/type_traits.h +++ b/cpp/src/arrow/python/type_traits.h @@ -15,7 +15,7 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/python/platform.h" #include #include diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index 82cf6fc48cad4..852f426c157c2 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -18,7 +18,7 @@ #ifndef PYARROW_UTIL_DATETIME_H #define PYARROW_UTIL_DATETIME_H -#include +#include "arrow/python/platform.h" #include namespace arrow { diff --git a/cpp/src/arrow/python/util/test_main.cc b/cpp/src/arrow/python/util/test_main.cc index c83514d0dbd37..c24da40aadcf6 100644 --- a/cpp/src/arrow/python/util/test_main.cc +++ b/cpp/src/arrow/python/util/test_main.cc @@ -15,18 +15,17 @@ // specific language governing permissions and limitations // under the License. -#include +#include "arrow/python/platform.h" #include -#include "arrow/python/do_import_numpy.h" -#include "arrow/python/numpy_interop.h" +#include "arrow/python/init.h" int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); Py_Initialize(); - arrow::py::import_numpy(); + arrow::py::InitNumPy(); int ret = RUN_ALL_TESTS(); diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index a2300d6029e39..2e454ae81886f 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -134,10 +134,10 @@ std::string Date32Type::ToString() const { // ---------------------------------------------------------------------- // Time types -TimeType::TimeType(Type::type type_id, TimeUnit unit) +TimeType::TimeType(Type::type type_id, TimeUnit::type unit) : FixedWidthType(type_id), unit_(unit) {} -Time32Type::Time32Type(TimeUnit unit) : TimeType(Type::TIME32, unit) { +Time32Type::Time32Type(TimeUnit::type unit) : TimeType(Type::TIME32, unit) { DCHECK(unit == TimeUnit::SECOND || unit == TimeUnit::MILLI) << "Must be seconds or milliseconds"; } @@ -148,7 +148,7 @@ std::string Time32Type::ToString() const { return ss.str(); } -Time64Type::Time64Type(TimeUnit unit) : TimeType(Type::TIME64, unit) { +Time64Type::Time64Type(TimeUnit::type unit) : TimeType(Type::TIME64, unit) { DCHECK(unit == TimeUnit::MICRO || unit == TimeUnit::NANO) << "Must be microseconds or nanoseconds"; } @@ -338,19 +338,19 @@ std::shared_ptr fixed_size_binary(int32_t byte_width) { return std::make_shared(byte_width); } -std::shared_ptr timestamp(TimeUnit unit) { +std::shared_ptr timestamp(TimeUnit::type unit) { return std::make_shared(unit); } -std::shared_ptr timestamp(TimeUnit unit, const std::string& timezone) { +std::shared_ptr timestamp(TimeUnit::type unit, const std::string& timezone) { return std::make_shared(unit, timezone); } -std::shared_ptr time32(TimeUnit unit) { +std::shared_ptr time32(TimeUnit::type unit) { return std::make_shared(unit); } -std::shared_ptr time64(TimeUnit unit) { +std::shared_ptr time64(TimeUnit::type unit) { return std::make_shared(unit); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index 6810b35f05b70..ea4ea03ff569a 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -538,9 +538,11 @@ class ARROW_EXPORT Date64Type : public DateType { static std::string name() { return "date"; } }; -enum class TimeUnit : char { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; +struct TimeUnit { + enum type { SECOND = 0, MILLI = 1, MICRO = 2, NANO = 3 }; +}; -static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { +static inline std::ostream& operator<<(std::ostream& os, TimeUnit::type unit) { switch (unit) { case TimeUnit::SECOND: os << "s"; @@ -560,11 +562,11 @@ static inline std::ostream& operator<<(std::ostream& os, TimeUnit unit) { class ARROW_EXPORT TimeType : public FixedWidthType { public: - TimeUnit unit() const { return unit_; } + TimeUnit::type unit() const { return unit_; } protected: - TimeType(Type::type type_id, TimeUnit unit); - TimeUnit unit_; + TimeType(Type::type type_id, TimeUnit::type unit); + TimeUnit::type unit_; }; class ARROW_EXPORT Time32Type : public TimeType { @@ -574,7 +576,7 @@ class ARROW_EXPORT Time32Type : public TimeType { int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } - explicit Time32Type(TimeUnit unit = TimeUnit::MILLI); + explicit Time32Type(TimeUnit::type unit = TimeUnit::MILLI); Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -587,7 +589,7 @@ class ARROW_EXPORT Time64Type : public TimeType { int bit_width() const override { return static_cast(sizeof(c_type) * CHAR_BIT); } - explicit Time64Type(TimeUnit unit = TimeUnit::MILLI); + explicit Time64Type(TimeUnit::type unit = TimeUnit::MILLI); Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; @@ -602,21 +604,21 @@ class ARROW_EXPORT TimestampType : public FixedWidthType { int bit_width() const override { return static_cast(sizeof(int64_t) * CHAR_BIT); } - explicit TimestampType(TimeUnit unit = TimeUnit::MILLI) + explicit TimestampType(TimeUnit::type unit = TimeUnit::MILLI) : FixedWidthType(Type::TIMESTAMP), unit_(unit) {} - explicit TimestampType(TimeUnit unit, const std::string& timezone) + explicit TimestampType(TimeUnit::type unit, const std::string& timezone) : FixedWidthType(Type::TIMESTAMP), unit_(unit), timezone_(timezone) {} Status Accept(TypeVisitor* visitor) const override; std::string ToString() const override; static std::string name() { return "timestamp"; } - TimeUnit unit() const { return unit_; } + TimeUnit::type unit() const { return unit_; } const std::string& timezone() const { return timezone_; } private: - TimeUnit unit_; + TimeUnit::type unit_; std::string timezone_; }; @@ -710,15 +712,15 @@ std::shared_ptr ARROW_EXPORT fixed_size_binary(int32_t byte_width); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); std::shared_ptr ARROW_EXPORT list(const std::shared_ptr& value_type); -std::shared_ptr ARROW_EXPORT timestamp(TimeUnit unit); +std::shared_ptr ARROW_EXPORT timestamp(TimeUnit::type unit); std::shared_ptr ARROW_EXPORT timestamp( - TimeUnit unit, const std::string& timezone); + TimeUnit::type unit, const std::string& timezone); /// Unit can be either SECOND or MILLI -std::shared_ptr ARROW_EXPORT time32(TimeUnit unit); +std::shared_ptr ARROW_EXPORT time32(TimeUnit::type unit); /// Unit can be either MICRO or NANO -std::shared_ptr ARROW_EXPORT time64(TimeUnit unit); +std::shared_ptr ARROW_EXPORT time64(TimeUnit::type unit); std::shared_ptr ARROW_EXPORT struct_( const std::vector>& fields); diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 36052bc257232..c1431af67ed55 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -28,7 +28,7 @@ set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/../cpp/cmake_mod include(CMakeParseArguments) -set(BUILD_SUPPORT_DIR ${CMAKE_CURRENT_SOURCE_DIR}/../cpp/build-support) +set(BUILD_SUPPORT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../cpp/build-support") # Allow "make install" to not depend on all targets. # @@ -58,10 +58,6 @@ if("${CMAKE_SOURCE_DIR}" STREQUAL "${CMAKE_CURRENT_SOURCE_DIR}") OFF) endif() -if(NOT PYARROW_BUILD_TESTS) - set(NO_TESTS 1) -endif() - find_program(CCACHE_FOUND ccache) if(CCACHE_FOUND) set_property(GLOBAL PROPERTY RULE_LAUNCH_COMPILE ccache) @@ -73,18 +69,19 @@ endif(CCACHE_FOUND) ############################################################ include(BuildUtils) -include(CompilerInfo) include(SetupCxxFlags) +include(CompilerInfo) # Add common flags set(CMAKE_CXX_FLAGS "${CXX_COMMON_FLAGS} ${CMAKE_CXX_FLAGS}") -# Enable perf and other tools to work properly -set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer") - -# Suppress Cython warnings -set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") +if (NOT MSVC) + # Enable perf and other tools to work properly + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fno-omit-frame-pointer") + # Suppress Cython warnings + set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wno-unused-variable") +endif() if ("${COMPILER_FAMILY}" STREQUAL "clang") # Using Clang with ccache causes a bunch of spurious warnings that are @@ -146,9 +143,10 @@ endif() # # The gold linker is only for ELF binaries, which OSX doesn't use. We can # just skip. -if (NOT APPLE) +if (NOT APPLE AND NOT MSVC) execute_process(COMMAND ${CMAKE_CXX_COMPILER} -Wl,--version OUTPUT_VARIABLE LINKER_OUTPUT) endif () + if (LINKER_OUTPUT MATCHES "gold") if ("${PYARROW_LINK}" STREQUAL "d" AND "${CMAKE_BUILD_TYPE}" STREQUAL "RELEASE") @@ -166,9 +164,6 @@ endif() # act on its value. if ("${PYARROW_LINK}" STREQUAL "d") set(BUILD_SHARED_LIBS ON) - - # Position independent code is only necessary when producing shared objects. - add_definitions(-fPIC) endif() # set compile output directory @@ -188,9 +183,16 @@ if (${CMAKE_SOURCE_DIR} STREQUAL ${CMAKE_CURRENT_BINARY_DIR}) EXECUTE_PROCESS(COMMAND ln ${MORE_ARGS} -sf ${BUILD_OUTPUT_ROOT_DIRECTORY} ${CMAKE_CURRENT_BINARY_DIR}/build/latest) else() - set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") + if (MSVC) + # MSVC makes its own output directories based on the build configuration + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/") + else() + set(BUILD_OUTPUT_ROOT_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${BUILD_SUBDIR_NAME}/") + endif() endif() +message(STATUS "Build output directory: ${BUILD_OUTPUT_ROOT_DIRECTORY}") + # where to put generated archives (.a files) set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") set(ARCHIVE_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}") diff --git a/python/cmake_modules/CompilerInfo.cmake b/python/cmake_modules/CompilerInfo.cmake deleted file mode 100644 index 8e85bdea96ea5..0000000000000 --- a/python/cmake_modules/CompilerInfo.cmake +++ /dev/null @@ -1,48 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. -# -# Sets COMPILER_FAMILY to 'clang' or 'gcc' -# Sets COMPILER_VERSION to the version -execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v - ERROR_VARIABLE COMPILER_VERSION_FULL) -message(INFO " ${COMPILER_VERSION_FULL}") - -# clang on Linux and Mac OS X before 10.9 -if("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") - set(COMPILER_FAMILY "clang") - string(REGEX REPLACE ".*clang version ([0-9]+\\.[0-9]+).*" "\\1" - COMPILER_VERSION "${COMPILER_VERSION_FULL}") -# clang on Mac OS X 10.9 and later -elseif("${COMPILER_VERSION_FULL}" MATCHES ".*based on LLVM.*") - set(COMPILER_FAMILY "clang") - string(REGEX REPLACE ".*based on LLVM ([0-9]+\\.[0.9]+).*" "\\1" - COMPILER_VERSION "${COMPILER_VERSION_FULL}") - -# clang on Mac OS X, XCode 7+. No version replacement is done -# because Apple no longer advertises the upstream LLVM version. -elseif("${COMPILER_VERSION_FULL}" MATCHES "clang-.*") - set(COMPILER_FAMILY "clang") - -# gcc -elseif("${COMPILER_VERSION_FULL}" MATCHES ".*gcc version.*") - set(COMPILER_FAMILY "gcc") - string(REGEX REPLACE ".*gcc version ([0-9\\.]+).*" "\\1" - COMPILER_VERSION "${COMPILER_VERSION_FULL}") -else() - message(FATAL_ERROR "Unknown compiler. Version info:\n${COMPILER_VERSION_FULL}") -endif() -message("Selected compiler ${COMPILER_FAMILY} ${COMPILER_VERSION}") diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 51a887189ccd4..8e13dd66b9f0f 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -67,18 +67,23 @@ find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python if (ARROW_INCLUDE_DIR AND ARROW_LIBS) set(ARROW_FOUND TRUE) - set(ARROW_LIB_NAME libarrow) - set(ARROW_JEMALLOC_LIB_NAME libarrow_jemalloc) - set(ARROW_PYTHON_LIB_NAME libarrow_python) - set(ARROW_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_LIB_NAME}.a) - set(ARROW_SHARED_LIB ${ARROW_LIBS}/${ARROW_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + if (MSVC) + set(ARROW_STATIC_LIB ${ARROW_LIB_PATH}) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_JEMALLOC_LIB_PATH}) + set(ARROW_SHARED_LIB ${ARROW_STATIC_LIB}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_PYTHON_STATIC_LIB}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_JEMALLOC_STATIC_LIB}) + else() + set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow.a) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_python.a) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_jemalloc.a) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_JEMALLOC_LIB_NAME}.a) - set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/${ARROW_JEMALLOC_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) - - set(ARROW_PYTHON_STATIC_LIB ${ARROW_SEARCH_LIB_PATH}/${ARROW_PYTHON_LIB_NAME}.a) - set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/${ARROW_PYTHON_LIB_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_SHARED_LIB ${ARROW_LIBS}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") diff --git a/python/cmake_modules/UseCython.cmake b/python/cmake_modules/UseCython.cmake index 7c06b023db2cb..7920940e688c7 100644 --- a/python/cmake_modules/UseCython.cmake +++ b/python/cmake_modules/UseCython.cmake @@ -122,9 +122,11 @@ function( compile_pyx _name pyx_target_name generated_files pyx_file) endif() set_source_files_properties( ${_generated_files} PROPERTIES GENERATED TRUE ) - # Cython creates a lot of compiler warning detritus on clang - set_source_files_properties(${_generated_files} PROPERTIES - COMPILE_FLAGS -Wno-unused-function) + if (NOT WIN32) + # Cython creates a lot of compiler warning detritus on clang + set_source_files_properties(${_generated_files} PROPERTIES + COMPILE_FLAGS -Wno-unused-function) + endif() set( ${generated_files} ${_generated_files} PARENT_SCOPE ) diff --git a/python/pyarrow/_config.pyx b/python/pyarrow/_config.pyx index 536f27839ae91..2c1e6bf3143aa 100644 --- a/python/pyarrow/_config.pyx +++ b/python/pyarrow/_config.pyx @@ -14,18 +14,13 @@ # distutils: language = c++ # cython: embedsignature = True -cdef extern from 'arrow/python/do_import_numpy.h': - pass - -cdef extern from 'arrow/python/numpy_interop.h' namespace 'arrow::py': - int import_numpy() +cdef extern from 'arrow/python/init.h' namespace 'arrow::py': + void InitNumPy() cdef extern from 'arrow/python/config.h' namespace 'arrow::py': - void Init() void set_numpy_nan(object o) -import_numpy() -Init() +InitNumPy() import numpy as np set_numpy_nan(np.nan) diff --git a/python/pyarrow/_io.pyx b/python/pyarrow/_io.pyx index ec37de0d72de9..09e8233bcbc2f 100644 --- a/python/pyarrow/_io.pyx +++ b/python/pyarrow/_io.pyx @@ -807,7 +807,7 @@ cdef class _HdfsClient: cdef c_string c_path = tobytes(path) with nogil: check_status(self.client.get() - .CreateDirectory(c_path)) + .MakeDirectory(c_path)) def delete(self, path, bint recursive=False): """ diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index ab38ff3084f01..44723faa7400e 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -26,8 +26,7 @@ from libcpp.vector cimport vector from cpython cimport PyObject cimport cpython -# This must be included for cerr and other things to work -cdef extern from "": +cdef extern from "arrow/python/platform.h": pass cdef extern from "": diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index b8aa24c65e11b..ea835f6d7bbc8 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -54,7 +54,7 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: _Type_STRUCT" arrow::Type::STRUCT" _Type_DICTIONARY" arrow::Type::DICTIONARY" - enum TimeUnit" arrow::TimeUnit": + enum TimeUnit" arrow::TimeUnit::type": TimeUnit_SECOND" arrow::TimeUnit::SECOND" TimeUnit_MILLI" arrow::TimeUnit::MILLI" TimeUnit_MICRO" arrow::TimeUnit::MICRO" @@ -435,7 +435,7 @@ cdef extern from "arrow/io/hdfs.h" namespace "arrow::io" nogil: CStatus Connect(const HdfsConnectionConfig* config, shared_ptr[CHdfsClient]* client) - CStatus CreateDirectory(const c_string& path) + CStatus MakeDirectory(const c_string& path) CStatus Delete(const c_string& path, c_bool recursive) diff --git a/python/setup.py b/python/setup.py index 3991856404bc8..ab71e7858e626 100644 --- a/python/setup.py +++ b/python/setup.py @@ -91,6 +91,13 @@ def initialize_options(self): _build_ext.initialize_options(self) self.extra_cmake_args = os.environ.get('PYARROW_CMAKE_OPTIONS', '') self.build_type = os.environ.get('PYARROW_BUILD_TYPE', 'debug').lower() + + if sys.platform == 'win32': + # Cannot do debug builds in Windows unless Python itself is a debug + # build + if not hasattr(sys, 'gettotalrefcount'): + self.build_type = 'release' + self.with_parquet = strtobool( os.environ.get('PYARROW_WITH_PARQUET', '0')) self.with_jemalloc = strtobool( @@ -132,13 +139,10 @@ def _run_cmake(self): return static_lib_option = '' - build_tests_option = '' cmake_options = [ '-DPYTHON_EXECUTABLE=%s' % sys.executable, - '-DPYARROW_BUILD_TESTS=off', static_lib_option, - build_tests_option, ] if self.with_parquet: @@ -150,10 +154,10 @@ def _run_cmake(self): if self.bundle_arrow_cpp: cmake_options.append('-DPYARROW_BUNDLE_ARROW_CPP=ON') - if sys.platform != 'win32': - cmake_options.append('-DCMAKE_BUILD_TYPE={0}' - .format(self.build_type)) + cmake_options.append('-DCMAKE_BUILD_TYPE={0}' + .format(self.build_type)) + if sys.platform != 'win32': cmake_command = (['cmake', self.extra_cmake_args] + cmake_options + [source]) @@ -167,15 +171,15 @@ def _run_cmake(self): self.spawn(args) else: import shlex - cmake_generator = 'Visual Studio 14 2015' - if is_64_bit: - cmake_generator += ' Win64' + cmake_generator = 'Visual Studio 14 2015 Win64' + if not is_64_bit: + raise RuntimeError('Not supported on 32-bit Windows') + # Generate the build files extra_cmake_args = shlex.split(self.extra_cmake_args) cmake_command = (['cmake'] + extra_cmake_args + cmake_options + - [source, - '-G', cmake_generator]) + [source, '-G', cmake_generator]) if "-G" in self.extra_cmake_args: cmake_command = cmake_command[:-2] @@ -336,7 +340,7 @@ def get_outputs(self): use_scm_version={"root": "..", "relative_to": __file__}, setup_requires=['setuptools_scm', 'cython >= 0.23'], install_requires=['numpy >= 1.9', 'six >= 1.0.0'], - test_requires=['pytest'], + tests_require=['pytest'], description="Python library for Apache Arrow", long_description=long_description, classifiers=[ From 84d725ba2610c778af75060d1c69a4ff8b2a2efc Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Apr 2017 17:47:51 -0400 Subject: [PATCH 0528/1644] ARROW-825: [Python] Rename pyarrow.from_pylist to pyarrow.array, test on tuples The idea is to make this function more semantically analogous to `numpy.array` -- convert to native data structure with optional explicit type. Author: Wes McKinney Closes #552 from wesm/ARROW-825 and squashes the following commits: 5d69c70 [Wes McKinney] Update test_jemalloc after ARROW-830 c25fdee [Wes McKinney] Update docstring 3a284b7 [Wes McKinney] Rename pyarrow.from_pylist to pyarrow.array, test on tuples --- python/doc/source/api.rst | 5 +- python/doc/source/install.rst | 17 +++-- python/pyarrow/__init__.py | 2 +- python/pyarrow/_array.pyx | 69 ++++++++++---------- python/pyarrow/parquet.py | 4 +- python/pyarrow/tests/test_array.py | 18 ++--- python/pyarrow/tests/test_convert_builtin.py | 60 +++++++++-------- python/pyarrow/tests/test_jemalloc.py | 6 +- python/pyarrow/tests/test_parquet.py | 2 +- python/pyarrow/tests/test_scalars.py | 14 ++-- python/pyarrow/tests/test_table.py | 40 ++++++------ 11 files changed, 123 insertions(+), 114 deletions(-) diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index 1b7b9bdc8f8c8..92e248b686ac0 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -83,12 +83,13 @@ Scalar Value Types StringValue FixedSizeBinaryValue -Array Types ------------ +Array Types and Constructors +---------------------------- .. autosummary:: :toctree: generated/ + array Array NullArray NumericArray diff --git a/python/doc/source/install.rst b/python/doc/source/install.rst index 16d19ef123135..278b466941a6f 100644 --- a/python/doc/source/install.rst +++ b/python/doc/source/install.rst @@ -90,7 +90,7 @@ using the default system install location will work, but for now we are being explicit: .. code-block:: bash - + export ARROW_HOME=$HOME/local Now, we build Arrow: @@ -98,18 +98,18 @@ Now, we build Arrow: .. code-block:: bash cd arrow/cpp - + mkdir dev-build cd dev-build - + cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. - + make - + # Use sudo here if $ARROW_HOME requires it make install -To get the optional Parquet support, you should also build and install +To get the optional Parquet support, you should also build and install `parquet-cpp `_. Install `pyarrow` @@ -138,10 +138,10 @@ Install `pyarrow` .. code-block:: python - + In [1]: import pyarrow - In [2]: pyarrow.from_pylist([1,2,3]) + In [2]: pyarrow.array([1,2,3]) Out[2]: [ @@ -149,4 +149,3 @@ Install `pyarrow` 2, 3 ] - diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 3db2a4f4dd0c8..87f23524ab49f 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -38,7 +38,7 @@ DataType, FixedSizeBinaryType, Field, Schema, schema, Array, Tensor, - from_pylist, + array, from_numpy_dtype, NullArray, NumericArray, IntegerArray, FloatingPointArray, diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index 99ff6f28096ef..e41380d0a6685 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -835,6 +835,42 @@ cdef maybe_coerce_datetime64(values, dtype, DataType type, return values, type + +def array(object sequence, DataType type=None, MemoryPool memory_pool=None): + """ + Create pyarrow.Array instance from a Python sequence + + Parameters + ---------- + sequence : sequence-like object of Python objects + type : pyarrow.DataType, optional + If not passed, will be inferred from the data + memory_pool : pyarrow.MemoryPool, optional + If not passed, will allocate memory from the currently-set default + memory pool + + Returns + ------- + array : pyarrow.Array + """ + cdef: + shared_ptr[CArray] sp_array + CMemoryPool* pool + + pool = maybe_unbox_memory_pool(memory_pool) + if type is None: + check_status(pyarrow.ConvertPySequence(sequence, pool, &sp_array)) + else: + check_status( + pyarrow.ConvertPySequence( + sequence, pool, &sp_array, type.sp_type + ) + ) + + return box_array(sp_array) + + + cdef class Array: cdef init(self, const shared_ptr[CArray]& sp_array): @@ -936,36 +972,6 @@ cdef class Array: return box_array(out) - @staticmethod - def from_list(object list_obj, DataType type=None, - MemoryPool memory_pool=None): - """ - Convert Python list to Arrow array - - Parameters - ---------- - list_obj : array_like - - Returns - ------- - pyarrow.array.Array - """ - cdef: - shared_ptr[CArray] sp_array - CMemoryPool* pool - - pool = maybe_unbox_memory_pool(memory_pool) - if type is None: - check_status(pyarrow.ConvertPySequence(list_obj, pool, &sp_array)) - else: - check_status( - pyarrow.ConvertPySequence( - list_obj, pool, &sp_array, type.sp_type - ) - ) - - return box_array(sp_array) - property null_count: def __get__(self): @@ -1408,6 +1414,3 @@ cdef object get_series_values(object obj): result = PandasSeries(obj).values return result - - -from_pylist = Array.from_list diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index fef99d5e12a06..94ad227fbefa9 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -295,9 +295,9 @@ def dictionary(self): # Only integer and string partition types are supported right now try: integer_keys = [int(x) for x in self.keys] - dictionary = _array.from_pylist(integer_keys) + dictionary = _array.array(integer_keys) except ValueError: - dictionary = _array.from_pylist(self.keys) + dictionary = _array.array(self.keys) self._dictionary = dictionary return dictionary diff --git a/python/pyarrow/tests/test_array.py b/python/pyarrow/tests/test_array.py index a1fe842c7ab8b..7c91785e12b2a 100644 --- a/python/pyarrow/tests/test_array.py +++ b/python/pyarrow/tests/test_array.py @@ -36,12 +36,12 @@ def test_repr_on_pre_init_array(): def test_getitem_NA(): - arr = pa.from_pylist([1, None, 2]) + arr = pa.array([1, None, 2]) assert arr[1] is pa.NA def test_list_format(): - arr = pa.from_pylist([[1], None, [2, 3, None]]) + arr = pa.array([[1], None, [2, 3, None]]) result = fmt.array_format(arr) expected = """\ [ @@ -55,7 +55,7 @@ def test_list_format(): def test_string_format(): - arr = pa.from_pylist(['', None, 'foo']) + arr = pa.array(['', None, 'foo']) result = fmt.array_format(arr) expected = """\ [ @@ -67,7 +67,7 @@ def test_string_format(): def test_long_array_format(): - arr = pa.from_pylist(range(100)) + arr = pa.array(range(100)) result = fmt.array_format(arr, window=2) expected = """\ [ @@ -83,7 +83,7 @@ def test_long_array_format(): def test_to_pandas_zero_copy(): import gc - arr = pa.from_pylist(range(10)) + arr = pa.array(range(10)) for i in range(10): np_arr = arr.to_pandas() @@ -93,7 +93,7 @@ def test_to_pandas_zero_copy(): assert sys.getrefcount(arr) == 2 for i in range(10): - arr = pa.from_pylist(range(10)) + arr = pa.array(range(10)) np_arr = arr.to_pandas() arr = None gc.collect() @@ -108,14 +108,14 @@ def test_to_pandas_zero_copy(): def test_array_slice(): - arr = pa.from_pylist(range(10)) + arr = pa.array(range(10)) sliced = arr.slice(2) - expected = pa.from_pylist(range(2, 10)) + expected = pa.array(range(2, 10)) assert sliced.equals(expected) sliced2 = arr.slice(2, 4) - expected2 = pa.from_pylist(range(2, 6)) + expected2 = pa.array(range(2, 6)) assert sliced2.equals(expected2) # 0 offset diff --git a/python/pyarrow/tests/test_convert_builtin.py b/python/pyarrow/tests/test_convert_builtin.py index d89a8e0c54ceb..d25055d828062 100644 --- a/python/pyarrow/tests/test_convert_builtin.py +++ b/python/pyarrow/tests/test_convert_builtin.py @@ -23,25 +23,31 @@ import decimal -class TestConvertList(unittest.TestCase): +class TestConvertSequence(unittest.TestCase): + + def test_sequence_types(self): + arr1 = pa.array([1, 2, 3]) + arr2 = pa.array((1, 2, 3)) + + assert arr1.equals(arr2) def test_boolean(self): expected = [True, None, False, None] - arr = pa.from_pylist(expected) + arr = pa.array(expected) assert len(arr) == 4 assert arr.null_count == 2 assert arr.type == pa.bool_() assert arr.to_pylist() == expected def test_empty_list(self): - arr = pa.from_pylist([]) + arr = pa.array([]) assert len(arr) == 0 assert arr.null_count == 0 assert arr.type == pa.null() assert arr.to_pylist() == [] def test_all_none(self): - arr = pa.from_pylist([None, None]) + arr = pa.array([None, None]) assert len(arr) == 2 assert arr.null_count == 2 assert arr.type == pa.null() @@ -49,7 +55,7 @@ def test_all_none(self): def test_integer(self): expected = [1, None, 3, None] - arr = pa.from_pylist(expected) + arr = pa.array(expected) assert len(arr) == 4 assert arr.null_count == 2 assert arr.type == pa.int64() @@ -62,13 +68,13 @@ def test_garbage_collection(self): gc.collect() bytes_before = pa.total_allocated_bytes() - pa.from_pylist([1, None, 3, None]) + pa.array([1, None, 3, None]) gc.collect() assert pa.total_allocated_bytes() == bytes_before def test_double(self): data = [1.5, 1, None, 2.5, None, None] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 6 assert arr.null_count == 3 assert arr.type == pa.float64() @@ -76,7 +82,7 @@ def test_double(self): def test_unicode(self): data = [u'foo', u'bar', None, u'mañana'] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pa.string() @@ -87,7 +93,7 @@ def test_bytes(self): data = [b'foo', u1.decode('utf-8'), # unicode gets encoded, None] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 3 assert arr.null_count == 1 assert arr.type == pa.binary() @@ -95,7 +101,7 @@ def test_bytes(self): def test_fixed_size_bytes(self): data = [b'foof', None, b'barb', b'2346'] - arr = pa.from_pylist(data, type=pa.binary(4)) + arr = pa.array(data, type=pa.binary(4)) assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pa.binary(4) @@ -104,12 +110,12 @@ def test_fixed_size_bytes(self): def test_fixed_size_bytes_does_not_accept_varying_lengths(self): data = [b'foo', None, b'barb', b'2346'] with self.assertRaises(pa.ArrowInvalid): - pa.from_pylist(data, type=pa.binary(4)) + pa.array(data, type=pa.binary(4)) def test_date(self): data = [datetime.date(2000, 1, 1), None, datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 4 assert arr.type == pa.date64() assert arr.null_count == 1 @@ -125,7 +131,7 @@ def test_timestamp(self): datetime.datetime(2006, 1, 13, 12, 34, 56, 432539), datetime.datetime(2010, 8, 13, 5, 46, 57, 437699) ] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 4 assert arr.type == pa.timestamp('us') assert arr.null_count == 1 @@ -138,22 +144,22 @@ def test_timestamp(self): 46, 57, 437699) def test_mixed_nesting_levels(self): - pa.from_pylist([1, 2, None]) - pa.from_pylist([[1], [2], None]) - pa.from_pylist([[1], [2], [None]]) + pa.array([1, 2, None]) + pa.array([[1], [2], None]) + pa.array([[1], [2], [None]]) with self.assertRaises(pa.ArrowInvalid): - pa.from_pylist([1, 2, [1]]) + pa.array([1, 2, [1]]) with self.assertRaises(pa.ArrowInvalid): - pa.from_pylist([1, 2, []]) + pa.array([1, 2, []]) with self.assertRaises(pa.ArrowInvalid): - pa.from_pylist([[1], [2], [None, [1]]]) + pa.array([[1], [2], [None, [1]]]) def test_list_of_int(self): data = [[1, 2, 3], [], None, [1, 2]] - arr = pa.from_pylist(data) + arr = pa.array(data) assert len(arr) == 4 assert arr.null_count == 1 assert arr.type == pa.list_(pa.int64()) @@ -162,12 +168,12 @@ def test_list_of_int(self): def test_mixed_types_fails(self): data = ['a', 1, 2.0] with self.assertRaises(pa.ArrowException): - pa.from_pylist(data) + pa.array(data) def test_decimal(self): data = [decimal.Decimal('1234.183'), decimal.Decimal('8094.234')] type = pa.decimal(precision=7, scale=3) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data def test_decimal_different_precisions(self): @@ -175,30 +181,30 @@ def test_decimal_different_precisions(self): decimal.Decimal('1234234983.183'), decimal.Decimal('80943244.234') ] type = pa.decimal(precision=13, scale=3) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data def test_decimal_no_scale(self): data = [decimal.Decimal('1234234983'), decimal.Decimal('8094324')] type = pa.decimal(precision=10) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data def test_decimal_negative(self): data = [decimal.Decimal('-1234.234983'), decimal.Decimal('-8.094324')] type = pa.decimal(precision=10, scale=6) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data def test_decimal_no_whole_part(self): data = [decimal.Decimal('-.4234983'), decimal.Decimal('.0103943')] type = pa.decimal(precision=7, scale=7) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data def test_decimal_large_integer(self): data = [decimal.Decimal('-394029506937548693.42983'), decimal.Decimal('32358695912932.01033')] type = pa.decimal(precision=23, scale=5) - arr = pa.from_pylist(data, type=type) + arr = pa.array(data, type=type) assert arr.to_pylist() == data diff --git a/python/pyarrow/tests/test_jemalloc.py b/python/pyarrow/tests/test_jemalloc.py index 0a4d8a63ad2d2..50eb74ae0e2c6 100644 --- a/python/pyarrow/tests/test_jemalloc.py +++ b/python/pyarrow/tests/test_jemalloc.py @@ -39,8 +39,8 @@ def test_different_memory_pool(): bytes_before_jemalloc = pa.jemalloc_memory_pool().bytes_allocated() # it works - array = pa.from_pylist([1, None, 3, None], # noqa - memory_pool=pa.jemalloc_memory_pool()) + array = pa.array([1, None, 3, None], # noqa + memory_pool=pa.jemalloc_memory_pool()) gc.collect() assert pa.total_allocated_bytes() == bytes_before_default assert (pa.jemalloc_memory_pool().bytes_allocated() > @@ -56,7 +56,7 @@ def test_default_memory_pool(): old_memory_pool = pa.default_memory_pool() pa.set_memory_pool(pa.jemalloc_memory_pool()) - array = pa.from_pylist([1, None, 3, None]) # noqa + array = pa.array([1, None, 3, None]) # noqa pa.set_memory_pool(old_memory_pool) gc.collect() diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index fc35781c54722..268e87af7dda4 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -47,7 +47,7 @@ def test_single_pylist_column_roundtrip(tmpdir): for dtype in [int, float]: filename = tmpdir.join('single_{}_column.parquet' .format(dtype.__name__)) - data = [pa.from_pylist(list(map(dtype, range(5))))] + data = [pa.array(list(map(dtype, range(5))))] table = pa.Table.from_arrays(data, names=('a', 'b')) pq.write_table(table, filename.strpath) table_read = pq.read_table(filename.strpath) diff --git a/python/pyarrow/tests/test_scalars.py b/python/pyarrow/tests/test_scalars.py index df2a8980710f8..149973b7831cb 100644 --- a/python/pyarrow/tests/test_scalars.py +++ b/python/pyarrow/tests/test_scalars.py @@ -29,7 +29,7 @@ def test_null_singleton(self): pa.NAType() def test_bool(self): - arr = pa.from_pylist([True, None, False, None]) + arr = pa.array([True, None, False, None]) v = arr[0] assert isinstance(v, pa.BooleanValue) @@ -39,7 +39,7 @@ def test_bool(self): assert arr[1] is pa.NA def test_int64(self): - arr = pa.from_pylist([1, 2, None]) + arr = pa.array([1, 2, None]) v = arr[0] assert isinstance(v, pa.Int64Value) @@ -49,7 +49,7 @@ def test_int64(self): assert arr[2] is pa.NA def test_double(self): - arr = pa.from_pylist([1.5, None, 3]) + arr = pa.array([1.5, None, 3]) v = arr[0] assert isinstance(v, pa.DoubleValue) @@ -62,7 +62,7 @@ def test_double(self): assert v.as_py() == 3.0 def test_string_unicode(self): - arr = pa.from_pylist([u'foo', None, u'mañana']) + arr = pa.array([u'foo', None, u'mañana']) v = arr[0] assert isinstance(v, pa.StringValue) @@ -75,7 +75,7 @@ def test_string_unicode(self): assert isinstance(v, unicode_type) def test_bytes(self): - arr = pa.from_pylist([b'foo', None, u('bar')]) + arr = pa.array([b'foo', None, u('bar')]) v = arr[0] assert isinstance(v, pa.BinaryValue) @@ -89,7 +89,7 @@ def test_bytes(self): def test_fixed_size_bytes(self): data = [b'foof', None, b'barb'] - arr = pa.from_pylist(data, type=pa.binary(4)) + arr = pa.array(data, type=pa.binary(4)) v = arr[0] assert isinstance(v, pa.FixedSizeBinaryValue) @@ -102,7 +102,7 @@ def test_fixed_size_bytes(self): assert isinstance(v, bytes) def test_list(self): - arr = pa.from_pylist([['foo', None], None, ['bar'], []]) + arr = pa.array([['foo', None], None, ['bar'], []]) v = arr[0] assert len(v) == 2 diff --git a/python/pyarrow/tests/test_table.py b/python/pyarrow/tests/test_table.py index 79b4c159fd10a..0567e8aba685a 100644 --- a/python/pyarrow/tests/test_table.py +++ b/python/pyarrow/tests/test_table.py @@ -29,7 +29,7 @@ class TestColumn(unittest.TestCase): def test_basics(self): data = [ - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array([-10, -5, 0, 5, 10]) ] table = pa.Table.from_arrays(data, names=['a']) column = table.column(0) @@ -40,7 +40,7 @@ def test_basics(self): assert column.to_pylist() == [-10, -5, 0, 5, 10] def test_from_array(self): - arr = pa.from_pylist([0, 1, 2, 3, 4]) + arr = pa.array([0, 1, 2, 3, 4]) col1 = pa.Column.from_array('foo', arr) col2 = pa.Column.from_array(pa.field('foo', arr.type), arr) @@ -49,7 +49,7 @@ def test_from_array(self): def test_pandas(self): data = [ - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array([-10, -5, 0, 5, 10]) ] table = pa.Table.from_arrays(data, names=['a']) column = table.column(0) @@ -61,8 +61,8 @@ def test_pandas(self): def test_recordbatch_basics(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]) ] batch = pa.RecordBatch.from_arrays(data, ['c0', 'c1']) @@ -78,8 +78,8 @@ def test_recordbatch_basics(): def test_recordbatch_slice(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]) ] names = ['c0', 'c1'] @@ -159,8 +159,8 @@ def test_recordbatchlist_schema_equals(): def test_table_basics(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]) ] table = pa.Table.from_arrays(data, names=('a', 'b')) assert len(table) == 5 @@ -179,9 +179,9 @@ def test_table_basics(): def test_table_add_column(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]), - pa.from_pylist(range(5, 10)) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]), + pa.array(range(5, 10)) ] table = pa.Table.from_arrays(data, names=('a', 'b', 'c')) @@ -202,9 +202,9 @@ def test_table_add_column(): def test_table_remove_column(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]), - pa.from_pylist(range(5, 10)) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]), + pa.array(range(5, 10)) ] table = pa.Table.from_arrays(data, names=('a', 'b', 'c')) @@ -223,15 +223,15 @@ def test_concat_tables(): [1., 2., 3., 4., 5.] ] - t1 = pa.Table.from_arrays([pa.from_pylist(x) for x in data], + t1 = pa.Table.from_arrays([pa.array(x) for x in data], names=('a', 'b')) - t2 = pa.Table.from_arrays([pa.from_pylist(x) for x in data2], + t2 = pa.Table.from_arrays([pa.array(x) for x in data2], names=('a', 'b')) result = pa.concat_tables([t1, t2]) assert len(result) == 10 - expected = pa.Table.from_arrays([pa.from_pylist(x + y) + expected = pa.Table.from_arrays([pa.array(x + y) for x, y in zip(data, data2)], names=('a', 'b')) @@ -240,8 +240,8 @@ def test_concat_tables(): def test_table_pandas(): data = [ - pa.from_pylist(range(5)), - pa.from_pylist([-10, -5, 0, 5, 10]) + pa.array(range(5)), + pa.array([-10, -5, 0, 5, 10]) ] table = pa.Table.from_arrays(data, names=('a', 'b')) From bb8514cc9d7068c8b62d346577370751d68221d8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 17 Apr 2017 17:50:05 -0400 Subject: [PATCH 0529/1644] ARROW-833: [Python] Add Developer quickstart for conda users cc @mrocklin @cpcloud, would you mind giving this a shot to make sure I haven't missed anything idiosyncratic from my environment? Author: Wes McKinney Closes #548 from wesm/ARROW-833 and squashes the following commits: a8ff608 [Wes McKinney] Don't check out parquet-cpp in the root of your arrow clone, more instructions c60fd60 [Wes McKinney] Add build type to pyarrow build command 3cfeb04 [Wes McKinney] Mark blocks as shell a738820 [Wes McKinney] Add Developer quickstart for conda users --- python/DEVELOPMENT.md | 136 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 136 insertions(+) create mode 100644 python/DEVELOPMENT.md diff --git a/python/DEVELOPMENT.md b/python/DEVELOPMENT.md new file mode 100644 index 0000000000000..280314f702f3d --- /dev/null +++ b/python/DEVELOPMENT.md @@ -0,0 +1,136 @@ + + +## Developer guide for conda users + +First, set up your thirdparty C++ toolchain using libraries from conda-forge: + +```shell +conda config --add channels conda-forge + +export ARROW_BUILD_TYPE=Release + +export CPP_TOOLCHAIN=$HOME/cpp-toolchain +export LD_LIBRARY_PATH=$CPP_TOOLCHAIN/lib:$LD_LIBRARY_PATH + +export BOOST_ROOT=$CPP_TOOLCHAIN +export FLATBUFFERS_HOME=$CPP_TOOLCHAIN +export RAPIDJSON_HOME=$CPP_TOOLCHAIN +export THRIFT_HOME=$CPP_TOOLCHAIN +export ZLIB_HOME=$CPP_TOOLCHAIN +export SNAPPY_HOME=$CPP_TOOLCHAIN +export BROTLI_HOME=$CPP_TOOLCHAIN +export JEMALLOC_HOME=$CPP_TOOLCHAIN +export ARROW_HOME=$CPP_TOOLCHAIN +export PARQUET_HOME=$CPP_TOOLCHAIN + +conda create -y -q -p $CPP_TOOLCHAIN \ + flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib brotli jemalloc +``` + +Now, activate a conda environment containing your target Python version and +NumPy installed: + +```shell +conda create -y -q -n pyarrow-dev python=3.6 numpy +source activate pyarrow-dev +``` + +Now, let's clone the Arrow and Parquet git repositories: + +```shell +mkdir repos +cd repos +git clone https://github.com/apache/arrow.git +git clone https://github.com/apache/parquet-cpp.git +``` + +You should now see + +```shell +$ ls -l +total 8 +drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/ +drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 parquet-cpp/ +``` + +Now build and install the Arrow C++ libraries: + +```shell +mkdir arrow/cpp/build +pushd arrow/cpp/build + +cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$CPP_TOOLCHAIN \ + -DARROW_PYTHON=on \ + -DARROW_BUILD_TESTS=OFF \ + .. +make -j4 +make install +popd +``` + +Now build and install the Apache Parquet libraries in your toolchain: + +```shell +mkdir parquet-cpp/build +pushd parquet-cpp/build + +cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$CPP_TOOLCHAIN \ + -DPARQUET_BUILD_BENCHMARKS=off \ + -DPARQUET_BUILD_EXECUTABLES=off \ + -DPARQUET_ZLIB_VENDORED=off \ + -DPARQUET_BUILD_TESTS=off \ + .. + +make -j4 +make install +popd +``` + +Now, install requisite build requirements for pyarrow, then build: + +```shell +conda install -y -q six setuptools cython pandas pytest + +cd arrow/python +python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet --inplace +``` + +You should be able to run the unit tests with: + +```shell +$ py.test pyarrow +================================ test session starts ================================ +platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 +rootdir: /home/wesm/arrow-clone/python, inifile: +collected 198 items + +pyarrow/tests/test_array.py ........... +pyarrow/tests/test_convert_builtin.py ..................... +pyarrow/tests/test_convert_pandas.py ............................. +pyarrow/tests/test_feather.py .......................... +pyarrow/tests/test_hdfs.py sssssssssssssss +pyarrow/tests/test_io.py .................. +pyarrow/tests/test_ipc.py ........ +pyarrow/tests/test_jemalloc.py ss +pyarrow/tests/test_parquet.py .................... +pyarrow/tests/test_scalars.py .......... +pyarrow/tests/test_schema.py ......... +pyarrow/tests/test_table.py ............. +pyarrow/tests/test_tensor.py ................ + +====================== 181 passed, 17 skipped in 0.98 seconds ======================= +``` From 0bcb7852feb464790791cf5f9c4da1aaaf429970 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 18 Apr 2017 16:25:02 +0200 Subject: [PATCH 0530/1644] ARROW-839: [Python] Use mktime variant that is reliable on MSVC This also reverts an unintentional regression from https://github.com/apache/arrow/pull/544 when code from config.h was moved to platform.h Author: Wes McKinney Closes #559 from wesm/ARROW-839 and squashes the following commits: 2e9b300 [Wes McKinney] Use _mkgmtime64 on MSVC f182bab [Wes McKinney] Restore include order in platform.h 38c29bf [Wes McKinney] Add Windows build instructions for Python --- cpp/CMakeLists.txt | 4 ++- cpp/src/arrow/python/platform.h | 2 +- cpp/src/arrow/python/util/datetime.h | 6 ++++ python/DEVELOPMENT.md | 48 ++++++++++++++++++++++++++++ 4 files changed, 58 insertions(+), 2 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 08120e9ea68a5..65fb2c9b1f7ea 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -837,7 +837,9 @@ if (${CLANG_FORMAT_FOUND}) add_custom_target(format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 1 `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g' | - sed -e '/windows_compatibility.h/g'`) + sed -e '/windows_compatibility.h/g' | + sed -e '/config.h/g' | + sed -e '/platform.h/g'`) # runs clang format and exits with a non-zero exit code if any files need to be reformatted add_custom_target(check-format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 0 diff --git a/cpp/src/arrow/python/platform.h b/cpp/src/arrow/python/platform.h index 38f8e0f1c23ff..a354b38f04cea 100644 --- a/cpp/src/arrow/python/platform.h +++ b/cpp/src/arrow/python/platform.h @@ -21,8 +21,8 @@ #ifndef ARROW_PYTHON_PLATFORM_H #define ARROW_PYTHON_PLATFORM_H -#include #include +#include // Work around C2528 error #if _MSC_VER >= 1900 diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index 852f426c157c2..bd80d9f636890 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -33,7 +33,13 @@ static inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { epoch.tm_year = 70; epoch.tm_mday = 1; // Milliseconds since the epoch +#ifdef _MSC_VER + const int64_t current_timestamp = static_cast(_mkgmtime64(&date)); + const int64_t epoch_timestamp = static_cast(_mkgmtime64(&epoch)); + return (current_timestamp - epoch_timestamp) * 1000LL; +#else return lrint(difftime(mktime(&date), mktime(&epoch)) * 1000); +#endif } static inline int32_t PyDate_to_days(PyDateTime_Date* pydate) { diff --git a/python/DEVELOPMENT.md b/python/DEVELOPMENT.md index 280314f702f3d..ca744628da1b5 100644 --- a/python/DEVELOPMENT.md +++ b/python/DEVELOPMENT.md @@ -14,6 +14,8 @@ ## Developer guide for conda users +### Linux and macOS + First, set up your thirdparty C++ toolchain using libraries from conda-forge: ```shell @@ -134,3 +136,49 @@ pyarrow/tests/test_tensor.py ................ ====================== 181 passed, 17 skipped in 0.98 seconds ======================= ``` + +### Windows + +First, make sure you can [build the C++ library][1]. + +Now, we need to build and install the C++ libraries someplace. + +```shell +mkdir cpp\build +cd cpp\build +set ARROW_HOME=C:\thirdparty +cmake -G "Visual Studio 14 2015 Win64" ^ + -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ + -DCMAKE_BUILD_TYPE=Release ^ + -DARROW_BUILD_TESTS=off ^ + -DARROW_PYTHON=on .. +cmake --build . --target INSTALL --config Release +cd ..\.. +``` + +After that, we must put the install directory's bin path in our `%PATH%`: + +```shell +set PATH=%ARROW_HOME%\bin;%PATH% +``` + +Now, we can build pyarrow: + +```shell +cd python +python setup.py build_ext --inplace +``` + +#### Running C++ unit tests with Python + +Getting `python-test.exe` to run is a bit tricky because your `%PYTHONPATH%` +must be configured given the active conda environment: + +```shell +set CONDA_ENV=C:\Users\wesm\Miniconda\envs\arrow-test +set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV% +``` + +Now `python-test.exe` or simply `ctest` (to run all tests) should work. + +[1]: https://github.com/apache/arrow/blob/master/cpp/doc/Windows.md \ No newline at end of file From bb287e2030c2b209edc4040099b138866e6e4692 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 18 Apr 2017 16:34:08 +0200 Subject: [PATCH 0531/1644] ARROW-845: [Python] Sync changes from PARQUET-955; explicit ARROW_HOME will override pkgconfig This will avoid build failures due to a stale system-level Arrow install Author: Wes McKinney Closes #558 from wesm/ARROW-845 and squashes the following commits: 4f89207 [Wes McKinney] Sync changes from PARQUET-955; explicit ARROW_HOME will override pkgconfig --- python/cmake_modules/FindArrow.cmake | 92 +++++++++++++++------------- 1 file changed, 50 insertions(+), 42 deletions(-) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index 8e13dd66b9f0f..fbe4545a520af 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -25,68 +25,75 @@ include(FindPkgConfig) -set(ARROW_SEARCH_HEADER_PATHS - $ENV{ARROW_HOME}/include -) +if ("$ENV{ARROW_HOME}" STREQUAL "") + pkg_check_modules(ARROW arrow) + if (ARROW_FOUND) + pkg_get_variable(ARROW_ABI_VERSION arrow abi_version) + message(STATUS "Arrow ABI version: ${ARROW_ABI_VERSION}") + pkg_get_variable(ARROW_SO_VERSION arrow so_version) + message(STATUS "Arrow SO version: ${ARROW_SO_VERSION}") + set(ARROW_INCLUDE_DIR ${ARROW_INCLUDE_DIRS}) + set(ARROW_LIBS ${ARROW_LIBRARY_DIRS}) + endif() +else() + set(ARROW_HOME "$ENV{ARROW_HOME}") -set(ARROW_SEARCH_LIB_PATH - $ENV{ARROW_HOME}/lib -) + set(ARROW_SEARCH_HEADER_PATHS + ${ARROW_HOME}/include + ) + + set(ARROW_SEARCH_LIB_PATH + ${ARROW_HOME}/lib + ) -pkg_check_modules(ARROW arrow) -if (ARROW_FOUND) - pkg_get_variable(ARROW_ABI_VERSION arrow abi_version) - message(STATUS "Arrow ABI version: ${ARROW_ABI_VERSION}") - pkg_get_variable(ARROW_SO_VERSION arrow so_version) - message(STATUS "Arrow SO version: ${ARROW_SO_VERSION}") - set(ARROW_INCLUDE_DIR ${ARROW_INCLUDE_DIRS}) - set(ARROW_LIBS ${ARROW_LIBRARY_DIRS}) -else() find_path(ARROW_INCLUDE_DIR arrow/array.h PATHS ${ARROW_SEARCH_HEADER_PATHS} # make sure we don't accidentally pick up a different version NO_DEFAULT_PATH - ) + ) find_library(ARROW_LIB_PATH NAMES arrow PATHS ${ARROW_SEARCH_LIB_PATH} NO_DEFAULT_PATH) get_filename_component(ARROW_LIBS ${ARROW_LIB_PATH} DIRECTORY) -endif() -find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) + find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) -find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) + find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) -if (ARROW_INCLUDE_DIR AND ARROW_LIBS) - set(ARROW_FOUND TRUE) + if (ARROW_INCLUDE_DIR AND ARROW_LIBS) + set(ARROW_FOUND TRUE) - if (MSVC) - set(ARROW_STATIC_LIB ${ARROW_LIB_PATH}) - set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_JEMALLOC_LIB_PATH}) - set(ARROW_SHARED_LIB ${ARROW_STATIC_LIB}) - set(ARROW_PYTHON_SHARED_LIB ${ARROW_PYTHON_STATIC_LIB}) - set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_JEMALLOC_STATIC_LIB}) - else() - set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow.a) - set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_python.a) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_jemalloc.a) + if (MSVC) + set(ARROW_STATIC_LIB ${ARROW_LIB_PATH}) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_JEMALLOC_LIB_PATH}) + set(ARROW_SHARED_LIB ${ARROW_STATIC_LIB}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_PYTHON_STATIC_LIB}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_JEMALLOC_STATIC_LIB}) + else() + set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow.a) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_python.a) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_jemalloc.a) - set(ARROW_SHARED_LIB ${ARROW_LIBS}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_SHARED_LIB ${ARROW_LIBS}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) + endif() endif() +endif() +if (ARROW_FOUND) if (NOT Arrow_FIND_QUIETLY) message(STATUS "Found the Arrow core library: ${ARROW_LIB_PATH}") + message(STATUS "Found the Arrow Python library: ${ARROW_PYTHON_LIB_PATH}") message(STATUS "Found the Arrow jemalloc library: ${ARROW_JEMALLOC_LIB_PATH}") endif () else () @@ -105,9 +112,10 @@ endif () mark_as_advanced( ARROW_INCLUDE_DIR - ARROW_LIBS ARROW_STATIC_LIB ARROW_SHARED_LIB + ARROW_PYTHON_STATIC_LIB + ARROW_PYTHON_SHARED_LIB ARROW_JEMALLOC_STATIC_LIB ARROW_JEMALLOC_SHARED_LIB ) From 7f20f6e738a2e163b0b753416ee4c4ed00998f4b Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 18 Apr 2017 16:37:03 +0200 Subject: [PATCH 0532/1644] ARROW-818: [Python] Expand Sphinx API docs, pyarrow.* namespace. Add factory functions for time32, time64 Author: Wes McKinney Closes #557 from wesm/ARROW-818 and squashes the following commits: 96ce436 [Wes McKinney] Expand Sphinx API docs, pyarrow.* namespace. Add factory functions for time32, time64 --- python/doc/source/api.rst | 69 +++++++++++++++++++++----- python/pyarrow/__init__.py | 33 ++++++++++--- python/pyarrow/_array.pxd | 10 ++++ python/pyarrow/_array.pyx | 74 +++++++++++++++++++++++++++- python/pyarrow/_io.pyx | 6 +-- python/pyarrow/includes/libarrow.pxd | 3 ++ python/pyarrow/tests/test_io.py | 4 +- python/pyarrow/tests/test_schema.py | 21 ++++++++ 8 files changed, 195 insertions(+), 25 deletions(-) diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index 92e248b686ac0..08a06948a3fba 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -24,8 +24,8 @@ API Reference .. _api.functions: -Type Metadata and Schemas -------------------------- +Type and Schema Factory Functions +--------------------------------- .. autosummary:: :toctree: generated/ @@ -43,6 +43,8 @@ Type Metadata and Schemas float16 float32 float64 + time32 + time64 timestamp date32 date64 @@ -53,10 +55,8 @@ Type Metadata and Schemas struct dictionary field - DataType - Field - Schema schema + from_numpy_dtype Scalar Value Types ------------------ @@ -68,6 +68,7 @@ Scalar Value Types NAType Scalar ArrayValue + BooleanValue Int8Value Int16Value Int32Value @@ -82,6 +83,11 @@ Scalar Value Types BinaryValue StringValue FixedSizeBinaryValue + Date32Value + Date64Value + TimestampValue + DecimalValue + Array Types and Constructors ---------------------------- @@ -91,21 +97,30 @@ Array Types and Constructors array Array - NullArray - NumericArray - IntegerArray - FloatingPointArray BooleanArray + DictionaryArray + FloatingPointArray + IntegerArray Int8Array Int16Array Int32Array Int64Array + NullArray + NumericArray UInt8Array UInt16Array UInt32Array UInt64Array - DictionaryArray + BinaryArray + FixedSizeBinaryArray StringArray + Time32Array + Time64Array + Date32Array + Date64Array + TimestampArray + DecimalArray + ListArray Tables and Record Batches ------------------------- @@ -113,9 +128,11 @@ Tables and Record Batches .. autosummary:: :toctree: generated/ + ChunkedArray Column RecordBatch Table + get_record_batch_size Tensor type and Functions ------------------------- @@ -141,7 +158,7 @@ Input / Output and Shared Memory MemoryMappedFile memory_map create_memory_map - PythonFileInterface + PythonFile Interprocess Communication and Messaging ---------------------------------------- @@ -165,3 +182,33 @@ Memory Pools jemalloc_memory_pool total_allocated_bytes set_memory_pool + +Type Classes +------------ + +.. autosummary:: + :toctree: generated/ + + DataType + DecimalType + DictionaryType + FixedSizeBinaryType + Time32Type + Time64Type + TimestampType + Field + Schema + +.. currentmodule:: pyarrow.parquet + +Apache Parquet +-------------- + +.. autosummary:: + :toctree: generated/ + + ParquetDataset + ParquetFile + read_table + write_metadata + write_table diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index 87f23524ab49f..4d8da9f5a10ed 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -31,12 +31,20 @@ from pyarrow._array import (null, bool_, int8, int16, int32, int64, uint8, uint16, uint32, uint64, - timestamp, date32, date64, + time32, time64, timestamp, date32, date64, float16, float32, float64, binary, string, decimal, list_, struct, dictionary, field, - DataType, FixedSizeBinaryType, - Field, Schema, schema, + DataType, + DecimalType, + DictionaryType, + FixedSizeBinaryType, + TimestampType, + Time32Type, + Time64Type, + Field, + Schema, + schema, Array, Tensor, array, from_numpy_dtype, @@ -47,25 +55,34 @@ Int16Array, UInt16Array, Int32Array, UInt32Array, Int64Array, UInt64Array, - ListArray, StringArray, + ListArray, + BinaryArray, StringArray, + FixedSizeBinaryArray, DictionaryArray, + Date32Array, Date64Array, + TimestampArray, Time32Array, Time64Array, + DecimalArray, ArrayValue, Scalar, NA, NAType, BooleanValue, Int8Value, Int16Value, Int32Value, Int64Value, UInt8Value, UInt16Value, UInt32Value, UInt64Value, FloatValue, DoubleValue, ListValue, - BinaryValue, StringValue, FixedSizeBinaryValue) + BinaryValue, StringValue, FixedSizeBinaryValue, + DecimalValue, + Date32Value, Date64Value, TimestampValue) -from pyarrow._io import (HdfsFile, NativeFile, PythonFileInterface, +from pyarrow._io import (HdfsFile, NativeFile, PythonFile, Buffer, BufferReader, InMemoryOutputStream, OSFile, MemoryMappedFile, memory_map, frombuffer, read_tensor, write_tensor, memory_map, create_memory_map, - get_record_batch_size, get_tensor_size) + get_record_batch_size, get_tensor_size, + have_libhdfs, have_libhdfs3) from pyarrow._memory import (MemoryPool, total_allocated_bytes, set_memory_pool, default_memory_pool) -from pyarrow._table import Column, RecordBatch, Table, concat_tables +from pyarrow._table import (ChunkedArray, Column, RecordBatch, Table, + concat_tables) from pyarrow._error import (ArrowException, ArrowKeyError, ArrowInvalid, diff --git a/python/pyarrow/_array.pxd b/python/pyarrow/_array.pxd index afb0c27d4e1ef..464de316f0437 100644 --- a/python/pyarrow/_array.pxd +++ b/python/pyarrow/_array.pxd @@ -42,6 +42,16 @@ cdef class TimestampType(DataType): const CTimestampType* ts_type +cdef class Time32Type(DataType): + cdef: + const CTime32Type* time_type + + +cdef class Time64Type(DataType): + cdef: + const CTime64Type* time_type + + cdef class FixedSizeBinaryType(DataType): cdef: const CFixedSizeBinaryType* fixed_size_binary_type diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index e41380d0a6685..1c571ba153dfa 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -127,6 +127,30 @@ cdef class TimestampType(DataType): return None +cdef class Time32Type(DataType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.time_type = type.get() + + property unit: + + def __get__(self): + return timeunit_to_string(self.time_type.unit()) + + +cdef class Time64Type(DataType): + + cdef void init(self, const shared_ptr[CDataType]& type): + DataType.init(self, type) + self.time_type = type.get() + + property unit: + + def __get__(self): + return timeunit_to_string(self.time_type.unit()) + + cdef class FixedSizeBinaryType(DataType): cdef void init(self, const shared_ptr[CDataType]& type): @@ -342,6 +366,7 @@ def int64(): cdef dict _timestamp_type_cache = {} +cdef dict _time_type_cache = {} cdef timeunit_to_string(TimeUnit unit): @@ -369,7 +394,7 @@ def timestamp(unit_str, tz=None): elif unit_str == 'ns': unit = TimeUnit_NANO else: - raise TypeError('Invalid TimeUnit string') + raise ValueError('Invalid TimeUnit string') cdef TimestampType out = TimestampType() @@ -388,6 +413,50 @@ def timestamp(unit_str, tz=None): return out +def time32(unit_str): + cdef: + TimeUnit unit + c_string c_timezone + + if unit_str == "s": + unit = TimeUnit_SECOND + elif unit_str == 'ms': + unit = TimeUnit_MILLI + else: + raise ValueError('Invalid TimeUnit for time32: {}'.format(unit_str)) + + cdef Time32Type out + if unit in _time_type_cache: + return _time_type_cache[unit] + else: + out = Time32Type() + out.init(ctime32(unit)) + _time_type_cache[unit] = out + return out + + +def time64(unit_str): + cdef: + TimeUnit unit + c_string c_timezone + + if unit_str == "us": + unit = TimeUnit_MICRO + elif unit_str == 'ns': + unit = TimeUnit_NANO + else: + raise ValueError('Invalid TimeUnit for time64: {}'.format(unit_str)) + + cdef Time64Type out + if unit in _time_type_cache: + return _time_type_cache[unit] + else: + out = Time64Type() + out.init(ctime64(unit)) + _time_type_cache[unit] = out + return out + + def date32(): return primitive_type(_Type_DATE32) @@ -516,6 +585,9 @@ cdef Schema box_schema(const shared_ptr[CSchema]& type): def from_numpy_dtype(object dtype): + """ + Convert NumPy dtype to pyarrow.DataType + """ cdef shared_ptr[CDataType] c_type with nogil: check_status(pyarrow.NumPyDtypeToArrow(dtype, &c_type)) diff --git a/python/pyarrow/_io.pyx b/python/pyarrow/_io.pyx index 09e8233bcbc2f..40c76f8363cd2 100644 --- a/python/pyarrow/_io.pyx +++ b/python/pyarrow/_io.pyx @@ -307,7 +307,7 @@ cdef class NativeFile: # Python file-like objects -cdef class PythonFileInterface(NativeFile): +cdef class PythonFile(NativeFile): cdef: object handle @@ -600,7 +600,7 @@ cdef get_reader(object source, shared_ptr[RandomAccessFile]* reader): source = BufferReader(source) elif not isinstance(source, NativeFile) and hasattr(source, 'read'): # Optimistically hope this is file-like - source = PythonFileInterface(source, mode='r') + source = PythonFile(source, mode='r') if isinstance(source, NativeFile): nf = source @@ -622,7 +622,7 @@ cdef get_writer(object source, shared_ptr[OutputStream]* writer): source = OSFile(source, mode='w') elif not isinstance(source, NativeFile) and hasattr(source, 'write'): # Optimistically hope this is file-like - source = PythonFileInterface(source, mode='w') + source = PythonFile(source, mode='w') if isinstance(source, NativeFile): nf = source diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index ea835f6d7bbc8..473a0b9cd9b6d 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -106,6 +106,9 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CTime64Type" arrow::Time64Type"(CFixedWidthType): TimeUnit unit() + shared_ptr[CDataType] ctime32" arrow::time32"(TimeUnit unit) + shared_ptr[CDataType] ctime64" arrow::time64"(TimeUnit unit) + cdef cppclass CDictionaryType" arrow::DictionaryType"(CFixedWidthType): CDictionaryType(const shared_ptr[CDataType]& index_type, const shared_ptr[CArray]& dictionary) diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index c5d3708d6a9ac..a14898ff2ffd0 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -32,7 +32,7 @@ def test_python_file_write(): buf = BytesIO() - f = pa.PythonFileInterface(buf) + f = pa.PythonFile(buf) assert f.tell() == 0 @@ -56,7 +56,7 @@ def test_python_file_read(): data = b'some sample data' buf = BytesIO(data) - f = pa.PythonFileInterface(buf, mode='r') + f = pa.PythonFile(buf, mode='r') assert f.size() == len(data) diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index d1107fb1faf3f..da704f378873b 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -77,6 +77,27 @@ def test_type_timestamp_with_tz(): assert t.tz == tz +def test_time_types(): + t1 = pa.time32('s') + t2 = pa.time32('ms') + t3 = pa.time64('us') + t4 = pa.time64('ns') + + assert t1.unit == 's' + assert t2.unit == 'ms' + assert t3.unit == 'us' + assert t4.unit == 'ns' + + assert str(t1) == 'time32[s]' + assert str(t4) == 'time64[ns]' + + with pytest.raises(ValueError): + pa.time32('us') + + with pytest.raises(ValueError): + pa.time64('s') + + def test_type_from_numpy_dtype_timestamps(): cases = [ (np.dtype('datetime64[s]'), pa.timestamp('s')), From 38efabea9bbc8d6386f96a635a95c53ba70e6149 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 18 Apr 2017 11:43:13 -0400 Subject: [PATCH 0533/1644] ARROW-844: [Format] Update README documents in format/ Added a section reflecting specification maturity and stability. Author: Wes McKinney Closes #556 from wesm/ARROW-844 and squashes the following commits: 03dbb71 [Wes McKinney] Update README documents in format/ --- format/README.md | 20 +++++++++++++++----- 1 file changed, 15 insertions(+), 5 deletions(-) diff --git a/format/README.md b/format/README.md index 048badb12214b..3aa8fdd6d4d6e 100644 --- a/format/README.md +++ b/format/README.md @@ -14,16 +14,14 @@ ## Arrow specification documents -> **Work-in-progress specification documents**. These are discussion documents -> created by the Arrow developers during late 2015 and in no way represents a -> finalized specification. - Currently, the Arrow specification consists of these pieces: - Metadata specification (see Metadata.md) - Physical memory layout specification (see Layout.md) -- Metadata serialized representation (see Message.fbs) +- Logical Types, Schemas, and Record Batch Metadata (see Schema.fbs) +- Encapsulated Messages (see Message.fbs) - Mechanics of messaging between Arrow systems (IPC, RPC, etc.) (see IPC.md) +- Tensor (Multi-dimensional array) Metadata (see Tensor.fbs) The metadata currently uses Google's [flatbuffers library][1] for serializing a couple related pieces of information: @@ -35,4 +33,16 @@ couple related pieces of information: schema, and enable a system to send and receive Arrow row batches in a form that can be precisely disassembled or reconstructed. +## Arrow Format Maturity and Stability + +We have made significant progress hardening the Arrow in-memory format and +Flatbuffer metadata since the project started in February 2016. We have +integration tests which verify binary compatibility between the Java and C++ +implementations, for example. + +Major versions may still include breaking changes to the memory format or +metadata, so it is recommended to use the same released version of all +libraries in your applications for maximum compatibility. Data stored in the +Arrow IPC formats should not be used for long term storage. + [1]: http://github.com/google/flatbuffers From 4baaa88c3f36d92ffe44f70198c510b7b326932c Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 18 Apr 2017 21:11:07 -0400 Subject: [PATCH 0534/1644] ARROW-847: Specify BUILD_BYPRODUCTS for gtest Author: Uwe L. Korn Closes #561 from xhochy/ARROW-847 and squashes the following commits: e8d5439 [Uwe L. Korn] ARROW-847: Specify BUILD_BYPRODUCTS for gtest --- cpp/CMakeLists.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 65fb2c9b1f7ea..5d8a0f6f9dd45 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -486,6 +486,7 @@ if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) # BUILD_BYPRODUCTS is a 3.2+ feature ExternalProject_Add(googletest_ep URL "https://github.com/google/googletest/archive/release-${GTEST_VERSION}.tar.gz" + BUILD_BYPRODUCTS ${GTEST_STATIC_LIB} ${GTEST_MAIN_STATIC_LIB} CMAKE_ARGS ${GTEST_CMAKE_ARGS}) else() ExternalProject_Add(googletest_ep From a94c03a02f1da8fa61ac86ba2d6c5e91d29c5767 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 18 Apr 2017 21:12:06 -0400 Subject: [PATCH 0535/1644] ARROW-809: [C++] Do not write excess bytes in IPC writer after slicing arrays cc @itaiin Author: Wes McKinney Closes #555 from wesm/ARROW-809 and squashes the following commits: 318c748 [Wes McKinney] Fix compiler warning 7fd6410 [Wes McKinney] Add sparse union test 290a300 [Wes McKinney] clang-format 1d14aa8 [Wes McKinney] Buffer truncation for unions 51f450f [Wes McKinney] Fix struct 7da5cac [Wes McKinney] Add List test and fix implementation. Fix list comparison bug for sliced arrays 9da3936 [Wes McKinney] Refactor to construct explicit non-nullable arrays 33eaa53 [Wes McKinney] Sliced array buffer truncation for fixed size types, string/binary --- cpp/CMakeLists.txt | 5 +- cpp/src/arrow/compare.cc | 10 +- cpp/src/arrow/ipc/ipc-read-write-test.cc | 69 ++++++++++ cpp/src/arrow/ipc/test-common.h | 50 ++++--- cpp/src/arrow/ipc/writer.cc | 158 +++++++++++++++-------- cpp/src/arrow/pretty_print.cc | 6 +- cpp/src/arrow/python/util/datetime.h | 2 +- cpp/src/arrow/table.cc | 4 +- cpp/src/arrow/table.h | 4 +- 9 files changed, 216 insertions(+), 92 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5d8a0f6f9dd45..f702da16e7a4e 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -839,8 +839,9 @@ if (${CLANG_FORMAT_FOUND}) `find ${CMAKE_CURRENT_SOURCE_DIR}/src -name \\*.cc -or -name \\*.h | sed -e '/_generated/g' | sed -e '/windows_compatibility.h/g' | - sed -e '/config.h/g' | - sed -e '/platform.h/g'`) + sed -e '/config.h/g' | # python/config.h + sed -e '/platform.h/g'` # python/platform.h + ) # runs clang format and exits with a non-zero exit code if any files need to be reformatted add_custom_target(check-format ${BUILD_SUPPORT_DIR}/run-clang-format.sh ${CMAKE_CURRENT_SOURCE_DIR} ${CLANG_FORMAT_BIN} 0 diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index e02f3f0a9a69c..ccb299e53a11e 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -460,14 +460,8 @@ class ArrayEqualsVisitor : public RangeEqualsVisitor { return Status::OK(); } - if (left.offset() == 0 && right.offset() == 0) { - result_ = left.values()->Equals(right.values()); - } else { - // One of the arrays is sliced - result_ = left.values()->RangeEquals(left.value_offset(0), - left.value_offset(left.length()), right.value_offset(0), right.values()); - } - + result_ = left.values()->RangeEquals(left.value_offset(0), + left.value_offset(left.length()), right.value_offset(0), right.values()); return Status::OK(); } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index cfba0d0a95106..b39136ec12d04 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -270,6 +270,75 @@ TEST_P(TestIpcRoundTrip, ZeroLengthArrays) { CheckRoundtrip(bin_array2, 1 << 20); } +TEST_F(TestWriteRecordBatch, SliceTruncatesBuffers) { + auto CheckArray = [this](const std::shared_ptr& array) { + auto f0 = field("f0", array->type()); + auto schema = std::shared_ptr(new Schema({f0})); + RecordBatch batch(schema, array->length(), {array}); + auto sliced_batch = batch.Slice(0, 5); + + int64_t full_size; + int64_t sliced_size; + + ASSERT_OK(GetRecordBatchSize(batch, &full_size)); + ASSERT_OK(GetRecordBatchSize(*sliced_batch, &sliced_size)); + ASSERT_TRUE(sliced_size < full_size) << sliced_size << " " << full_size; + + // make sure we can write and read it + this->CheckRoundtrip(*sliced_batch, 1 << 20); + }; + + std::shared_ptr a0, a1; + auto pool = default_memory_pool(); + + // Integer + ASSERT_OK(MakeRandomInt32Array(500, false, pool, &a0)); + CheckArray(a0); + + // String / Binary + { + auto s = MakeRandomBinaryArray(500, false, pool, &a0); + ASSERT_TRUE(s.ok()); + } + CheckArray(a0); + + // Boolean + ASSERT_OK(MakeRandomBooleanArray(10000, false, &a0)); + CheckArray(a0); + + // List + ASSERT_OK(MakeRandomInt32Array(500, false, pool, &a0)); + ASSERT_OK(MakeRandomListArray(a0, 200, false, pool, &a1)); + CheckArray(a1); + + // Struct + auto struct_type = struct_({field("f0", a0->type())}); + std::vector> struct_children = {a0}; + a1 = std::make_shared(struct_type, a0->length(), struct_children); + CheckArray(a1); + + // Sparse Union + auto union_type = union_({field("f0", a0->type())}, {0}); + std::vector type_ids(a0->length()); + std::shared_ptr ids_buffer; + ASSERT_OK(test::CopyBufferFromVector(type_ids, &ids_buffer)); + a1 = std::make_shared( + union_type, a0->length(), struct_children, ids_buffer); + CheckArray(a1); + + // Dense union + auto dense_union_type = union_({field("f0", a0->type())}, {0}, UnionMode::DENSE); + std::vector type_offsets; + for (int32_t i = 0; i < a0->length(); ++i) { + type_offsets.push_back(i); + } + std::shared_ptr offsets_buffer; + ASSERT_OK(test::CopyBufferFromVector(type_offsets, &offsets_buffer)); + a1 = std::make_shared( + dense_union_type, a0->length(), struct_children, ids_buffer, offsets_buffer); + CheckArray(a1); +} + void TestGetRecordBatchSize(std::shared_ptr batch) { ipc::MockOutputStream mock; int32_t mock_metadata_length = -1; diff --git a/cpp/src/arrow/ipc/test-common.h b/cpp/src/arrow/ipc/test-common.h index a17b609bbcba2..c8ca21cb8f17d 100644 --- a/cpp/src/arrow/ipc/test-common.h +++ b/cpp/src/arrow/ipc/test-common.h @@ -138,31 +138,41 @@ Status MakeRandomListArray(const std::shared_ptr& child_array, int num_li typedef Status MakeRecordBatch(std::shared_ptr* out); -Status MakeBooleanBatch(std::shared_ptr* out) { - const int length = 1000; +Status MakeRandomBooleanArray( + const int length, bool include_nulls, std::shared_ptr* out) { + std::vector values(length); + test::random_null_bytes(length, 0.5, values.data()); + auto data = test::bytes_to_null_buffer(values); + if (include_nulls) { + std::vector valid_bytes(length); + auto null_bitmap = test::bytes_to_null_buffer(valid_bytes); + test::random_null_bytes(length, 0.1, valid_bytes.data()); + *out = std::make_shared(length, data, null_bitmap, -1); + } else { + *out = std::make_shared(length, data, nullptr, 0); + } + return Status::OK(); +} + +Status MakeBooleanBatchSized(const int length, std::shared_ptr* out) { // Make the schema auto f0 = field("f0", boolean()); auto f1 = field("f1", boolean()); std::shared_ptr schema(new Schema({f0, f1})); - std::vector values(length); - std::vector valid_bytes(length); - test::random_null_bytes(length, 0.5, values.data()); - test::random_null_bytes(length, 0.1, valid_bytes.data()); - - auto data = test::bytes_to_null_buffer(values); - auto null_bitmap = test::bytes_to_null_buffer(valid_bytes); - - auto a0 = std::make_shared(length, data, null_bitmap, -1); - auto a1 = std::make_shared(length, data, nullptr, 0); + std::shared_ptr a0, a1; + RETURN_NOT_OK(MakeRandomBooleanArray(length, true, &a0)); + RETURN_NOT_OK(MakeRandomBooleanArray(length, false, &a1)); out->reset(new RecordBatch(schema, length, {a0, a1})); return Status::OK(); } -Status MakeIntRecordBatch(std::shared_ptr* out) { - const int length = 10; +Status MakeBooleanBatch(std::shared_ptr* out) { + return MakeBooleanBatchSized(1000, out); +} +Status MakeIntBatchSized(int length, std::shared_ptr* out) { // Make the schema auto f0 = field("f0", int32()); auto f1 = field("f1", int32()); @@ -177,16 +187,20 @@ Status MakeIntRecordBatch(std::shared_ptr* out) { return Status::OK(); } +Status MakeIntRecordBatch(std::shared_ptr* out) { + return MakeIntBatchSized(10, out); +} + template Status MakeRandomBinaryArray( - int64_t length, MemoryPool* pool, std::shared_ptr* out) { + int64_t length, bool include_nulls, MemoryPool* pool, std::shared_ptr* out) { const std::vector values = { "", "", "abc", "123", "efg", "456!@#!@#", "12312"}; Builder builder(pool); const size_t values_len = values.size(); for (int64_t i = 0; i < length; ++i) { int64_t values_index = i % values_len; - if (values_index == 0) { + if (include_nulls && values_index == 0) { RETURN_NOT_OK(builder.AppendNull()); } else { const std::string& value = values[values_index]; @@ -210,12 +224,12 @@ Status MakeStringTypesRecordBatch(std::shared_ptr* out) { // Quirk with RETURN_NOT_OK macro and templated functions { - auto s = MakeRandomBinaryArray(length, pool, &a0); + auto s = MakeRandomBinaryArray(length, true, pool, &a0); RETURN_NOT_OK(s); } { - auto s = MakeRandomBinaryArray(length, pool, &a1); + auto s = MakeRandomBinaryArray(length, true, pool, &a1); RETURN_NOT_OK(s); } out->reset(new RecordBatch(schema, length, {a0, a1})); diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 8ba00a6ffd599..61caf6403c8dc 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -45,6 +45,49 @@ namespace ipc { // ---------------------------------------------------------------------- // Record batch write path +static inline Status GetTruncatedBitmap(int64_t offset, int64_t length, + const std::shared_ptr input, MemoryPool* pool, + std::shared_ptr* buffer) { + if (!input) { + *buffer = input; + return Status::OK(); + } + int64_t min_length = PaddedLength(BitUtil::BytesForBits(length)); + if (offset != 0 || min_length < input->size()) { + // With a sliced array / non-zero offset, we must copy the bitmap + RETURN_NOT_OK(CopyBitmap(pool, input->data(), offset, length, buffer)); + } else { + *buffer = input; + } + return Status::OK(); +} + +template +inline Status GetTruncatedBuffer(int64_t offset, int64_t length, + const std::shared_ptr input, MemoryPool* pool, + std::shared_ptr* buffer) { + if (!input) { + *buffer = input; + return Status::OK(); + } + int32_t byte_width = static_cast(sizeof(T)); + int64_t padded_length = PaddedLength(length * byte_width); + if (offset != 0 || padded_length < input->size()) { + *buffer = + SliceBuffer(input, offset * byte_width, std::min(padded_length, input->size())); + } else { + *buffer = input; + } + return Status::OK(); +} + +static inline bool NeedTruncate( + int64_t offset, const Buffer* buffer, int64_t min_length) { + // buffer can be NULL + if (buffer == nullptr) { return false; } + return offset != 0 || min_length < buffer->size(); +} + class RecordBatchWriter : public ArrayVisitor { public: RecordBatchWriter(MemoryPool* pool, int64_t buffer_start_offset, @@ -71,14 +114,9 @@ class RecordBatchWriter : public ArrayVisitor { field_nodes_.emplace_back(arr.length(), arr.null_count(), 0); if (arr.null_count() > 0) { - std::shared_ptr bitmap = arr.null_bitmap(); - - if (arr.offset() != 0) { - // With a sliced array / non-zero offset, we must copy the bitmap - RETURN_NOT_OK( - CopyBitmap(pool_, bitmap->data(), arr.offset(), arr.length(), &bitmap)); - } - + std::shared_ptr bitmap; + RETURN_NOT_OK(GetTruncatedBitmap( + arr.offset(), arr.length(), arr.null_bitmap(), pool_, &bitmap)); buffers_.push_back(bitmap); } else { // Push a dummy zero-length buffer, not to be copied @@ -195,21 +233,23 @@ class RecordBatchWriter : public ArrayVisitor { protected: template Status VisitFixedWidth(const ArrayType& array) { - std::shared_ptr data_buffer = array.data(); + std::shared_ptr data = array.data(); - if (array.offset() != 0) { + const auto& fw_type = static_cast(*array.type()); + const int64_t type_width = fw_type.bit_width() / 8; + int64_t min_length = PaddedLength(array.length() * type_width); + + if (NeedTruncate(array.offset(), data.get(), min_length)) { // Non-zero offset, slice the buffer - const auto& fw_type = static_cast(*array.type()); - const int type_width = fw_type.bit_width() / 8; const int64_t byte_offset = array.offset() * type_width; // Send padding if it's available const int64_t buffer_length = std::min(BitUtil::RoundUpToMultipleOf64(array.length() * type_width), - data_buffer->size() - byte_offset); - data_buffer = SliceBuffer(data_buffer, byte_offset, buffer_length); + data->size() - byte_offset); + data = SliceBuffer(data, byte_offset, buffer_length); } - buffers_.push_back(data_buffer); + buffers_.push_back(data); return Status::OK(); } @@ -249,9 +289,16 @@ class RecordBatchWriter : public ArrayVisitor { RETURN_NOT_OK(GetZeroBasedValueOffsets(array, &value_offsets)); auto data = array.data(); - if (array.offset() != 0) { + int64_t total_data_bytes = 0; + if (value_offsets) { + total_data_bytes = array.value_offset(array.length()) - array.value_offset(0); + } + if (NeedTruncate(array.offset(), data.get(), total_data_bytes)) { // Slice the data buffer to include only the range we need now - data = SliceBuffer(data, array.value_offset(0), array.value_offset(array.length())); + const int64_t start_offset = array.value_offset(0); + const int64_t slice_length = + std::min(PaddedLength(total_data_bytes), data->size() - start_offset); + data = SliceBuffer(data, start_offset, slice_length); } buffers_.push_back(value_offsets); @@ -259,24 +306,11 @@ class RecordBatchWriter : public ArrayVisitor { return Status::OK(); } - Status Visit(const FixedSizeBinaryArray& array) override { - auto data = array.data(); - int32_t width = array.byte_width(); - - if (data && array.offset() != 0) { - data = SliceBuffer(data, array.offset() * width, width * array.length()); - } - buffers_.push_back(data); - return Status::OK(); - } - Status Visit(const BooleanArray& array) override { - std::shared_ptr bits = array.data(); - if (array.offset() != 0) { - RETURN_NOT_OK( - CopyBitmap(pool_, bits->data(), array.offset(), array.length(), &bits)); - } - buffers_.push_back(bits); + std::shared_ptr data; + RETURN_NOT_OK( + GetTruncatedBitmap(array.offset(), array.length(), array.data(), pool_, &data)); + buffers_.push_back(data); return Status::OK(); } @@ -299,6 +333,7 @@ class RecordBatchWriter : public ArrayVisitor { VISIT_FIXED_WIDTH(TimestampArray); VISIT_FIXED_WIDTH(Time32Array); VISIT_FIXED_WIDTH(Time64Array); + VISIT_FIXED_WIDTH(FixedSizeBinaryArray); #undef VISIT_FIXED_WIDTH @@ -314,11 +349,16 @@ class RecordBatchWriter : public ArrayVisitor { --max_recursion_depth_; std::shared_ptr values = array.values(); - if (array.offset() != 0) { - // For non-zero offset, we slice the values array accordingly - const int32_t offset = array.value_offset(0); - const int32_t length = array.value_offset(array.length()) - offset; - values = values->Slice(offset, length); + int32_t values_offset = 0; + int32_t values_length = 0; + if (value_offsets) { + values_offset = array.value_offset(0); + values_length = array.value_offset(array.length()) - values_offset; + } + + if (array.offset() != 0 || values_length < values->length()) { + // Must also slice the values + values = values->Slice(values_offset, values_length); } RETURN_NOT_OK(VisitArray(*values)); ++max_recursion_depth_; @@ -328,7 +368,7 @@ class RecordBatchWriter : public ArrayVisitor { Status Visit(const StructArray& array) override { --max_recursion_depth_; for (std::shared_ptr field : array.fields()) { - if (array.offset() != 0) { + if (array.offset() != 0 || array.length() < field->length()) { // If offset is non-zero, slice the child array field = field->Slice(array.offset(), array.length()); } @@ -339,18 +379,21 @@ class RecordBatchWriter : public ArrayVisitor { } Status Visit(const UnionArray& array) override { - auto type_ids = array.type_ids(); - if (array.offset() != 0) { - type_ids = SliceBuffer(type_ids, array.offset() * sizeof(UnionArray::type_id_t), - array.length() * sizeof(UnionArray::type_id_t)); - } + const int64_t offset = array.offset(); + const int64_t length = array.length(); + std::shared_ptr type_ids; + RETURN_NOT_OK(GetTruncatedBuffer( + offset, length, array.type_ids(), pool_, &type_ids)); buffers_.push_back(type_ids); --max_recursion_depth_; if (array.mode() == UnionMode::DENSE) { const auto& type = static_cast(*array.type()); - auto value_offsets = array.value_offsets(); + + std::shared_ptr value_offsets; + RETURN_NOT_OK(GetTruncatedBuffer( + offset, length, array.value_offsets(), pool_, &value_offsets)); // The Union type codes are not necessary 0-indexed uint8_t max_code = 0; @@ -363,7 +406,7 @@ class RecordBatchWriter : public ArrayVisitor { std::vector child_offsets(max_code + 1); std::vector child_lengths(max_code + 1, 0); - if (array.offset() != 0) { + if (offset != 0) { // This is an unpleasant case. Because the offsets are different for // each child array, when we have a sliced array, we need to "rebase" // the value_offsets for each array @@ -373,12 +416,12 @@ class RecordBatchWriter : public ArrayVisitor { // Allocate the shifted offsets std::shared_ptr shifted_offsets_buffer; - RETURN_NOT_OK(AllocateBuffer( - pool_, array.length() * sizeof(int32_t), &shifted_offsets_buffer)); + RETURN_NOT_OK( + AllocateBuffer(pool_, length * sizeof(int32_t), &shifted_offsets_buffer)); int32_t* shifted_offsets = reinterpret_cast(shifted_offsets_buffer->mutable_data()); - for (int64_t i = 0; i < array.length(); ++i) { + for (int64_t i = 0; i < length; ++i) { const uint8_t code = type_ids[i]; int32_t shift = child_offsets[code]; if (shift == -1) { child_offsets[code] = shift = unshifted_offsets[i]; } @@ -395,18 +438,23 @@ class RecordBatchWriter : public ArrayVisitor { // Visit children and slice accordingly for (int i = 0; i < type.num_children(); ++i) { std::shared_ptr child = array.child(i); - if (array.offset() != 0) { - const uint8_t code = type.type_codes()[i]; - child = child->Slice(child_offsets[code], child_lengths[code]); + + // TODO: ARROW-809, for sliced unions, tricky to know how much to + // truncate the children. For now, we are truncating the children to be + // no longer than the parent union + const uint8_t code = type.type_codes()[i]; + const int64_t child_length = child_lengths[code]; + if (offset != 0 || length < child_length) { + child = child->Slice(child_offsets[code], std::min(length, child_length)); } RETURN_NOT_OK(VisitArray(*child)); } } else { for (std::shared_ptr child : array.children()) { // Sparse union, slicing is simpler - if (array.offset() != 0) { + if (offset != 0 || length < child->length()) { // If offset is non-zero, slice the child array - child = child->Slice(array.offset(), array.length()); + child = child->Slice(offset, length); } RETURN_NOT_OK(VisitArray(*child)); } diff --git a/cpp/src/arrow/pretty_print.cc b/cpp/src/arrow/pretty_print.cc index 0f46f0306fe08..1f4bfa9acd357 100644 --- a/cpp/src/arrow/pretty_print.cc +++ b/cpp/src/arrow/pretty_print.cc @@ -162,10 +162,8 @@ class ArrayPrinter { Newline(); Write("-- values: "); - auto values = array.values(); - if (array.offset() != 0) { - values = values->Slice(array.value_offset(0), array.value_offset(array.length())); - } + auto values = + array.values()->Slice(array.value_offset(0), array.value_offset(array.length())); RETURN_NOT_OK(PrettyPrint(*values, indent_ + 2, sink_)); return Status::OK(); diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index bd80d9f636890..ad0ee0f5056da 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -32,8 +32,8 @@ static inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { struct tm epoch = {0}; epoch.tm_year = 70; epoch.tm_mday = 1; - // Milliseconds since the epoch #ifdef _MSC_VER + // Milliseconds since the epoch const int64_t current_timestamp = static_cast(_mkgmtime64(&date)); const int64_t epoch_timestamp = static_cast(_mkgmtime64(&epoch)); return (current_timestamp - epoch_timestamp) * 1000LL; diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index eabd98bda1893..db17da72a6a33 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -180,11 +180,11 @@ bool RecordBatch::ApproxEquals(const RecordBatch& other) const { return true; } -std::shared_ptr RecordBatch::Slice(int64_t offset) { +std::shared_ptr RecordBatch::Slice(int64_t offset) const { return Slice(offset, this->num_rows() - offset); } -std::shared_ptr RecordBatch::Slice(int64_t offset, int64_t length) { +std::shared_ptr RecordBatch::Slice(int64_t offset, int64_t length) const { std::vector> arrays; arrays.reserve(num_columns()); for (const auto& field : columns_) { diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index cfd1f366b039f..efc2077bd009a 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -137,8 +137,8 @@ class ARROW_EXPORT RecordBatch { int64_t num_rows() const { return num_rows_; } /// Slice each of the arrays in the record batch and construct a new RecordBatch object - std::shared_ptr Slice(int64_t offset); - std::shared_ptr Slice(int64_t offset, int64_t length); + std::shared_ptr Slice(int64_t offset) const; + std::shared_ptr Slice(int64_t offset, int64_t length) const; /// Returns error status is there is something wrong with the record batch /// contents, like a schema/array mismatch or inconsistent lengths From 59cd801a7645783c0c33ed2435be08db4ffcd378 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 18 Apr 2017 22:32:59 -0400 Subject: [PATCH 0536/1644] ARROW-852: Also search for ARROW libs when pkg-config provided the path Change-Id: Ic7fb227342782dfed5885f8fc5e73418fd31d504 Author: Uwe L. Korn Closes #563 from xhochy/ARROW-852 and squashes the following commits: 9630352 [Uwe L. Korn] Remove ARROW_HOME 5fc43ce [Uwe L. Korn] Always search for libs --- python/cmake_modules/FindArrow.cmake | 61 ++++++++++++++-------------- python/manylinux1/build_arrow.sh | 1 - 2 files changed, 31 insertions(+), 31 deletions(-) diff --git a/python/cmake_modules/FindArrow.cmake b/python/cmake_modules/FindArrow.cmake index fbe4545a520af..9fb1355fe1d52 100644 --- a/python/cmake_modules/FindArrow.cmake +++ b/python/cmake_modules/FindArrow.cmake @@ -34,6 +34,7 @@ if ("$ENV{ARROW_HOME}" STREQUAL "") message(STATUS "Arrow SO version: ${ARROW_SO_VERSION}") set(ARROW_INCLUDE_DIR ${ARROW_INCLUDE_DIRS}) set(ARROW_LIBS ${ARROW_LIBRARY_DIRS}) + set(ARROW_SEARCH_LIB_PATH ${ARROW_LIBRARY_DIRS}) endif() else() set(ARROW_HOME "$ENV{ARROW_HOME}") @@ -51,42 +52,42 @@ else() # make sure we don't accidentally pick up a different version NO_DEFAULT_PATH ) +endif() - find_library(ARROW_LIB_PATH NAMES arrow - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) - get_filename_component(ARROW_LIBS ${ARROW_LIB_PATH} DIRECTORY) +find_library(ARROW_LIB_PATH NAMES arrow + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) +get_filename_component(ARROW_LIBS ${ARROW_LIB_PATH} DIRECTORY) - find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) +find_library(ARROW_JEMALLOC_LIB_PATH NAMES arrow_jemalloc + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) - find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python - PATHS - ${ARROW_SEARCH_LIB_PATH} - NO_DEFAULT_PATH) +find_library(ARROW_PYTHON_LIB_PATH NAMES arrow_python + PATHS + ${ARROW_SEARCH_LIB_PATH} + NO_DEFAULT_PATH) - if (ARROW_INCLUDE_DIR AND ARROW_LIBS) - set(ARROW_FOUND TRUE) +if (ARROW_INCLUDE_DIR AND ARROW_LIBS) + set(ARROW_FOUND TRUE) - if (MSVC) - set(ARROW_STATIC_LIB ${ARROW_LIB_PATH}) - set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_JEMALLOC_LIB_PATH}) - set(ARROW_SHARED_LIB ${ARROW_STATIC_LIB}) - set(ARROW_PYTHON_SHARED_LIB ${ARROW_PYTHON_STATIC_LIB}) - set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_JEMALLOC_STATIC_LIB}) - else() - set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow.a) - set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_python.a) - set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_jemalloc.a) + if (MSVC) + set(ARROW_STATIC_LIB ${ARROW_LIB_PATH}) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_JEMALLOC_LIB_PATH}) + set(ARROW_SHARED_LIB ${ARROW_STATIC_LIB}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_PYTHON_STATIC_LIB}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_JEMALLOC_STATIC_LIB}) + else() + set(ARROW_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow.a) + set(ARROW_PYTHON_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_python.a) + set(ARROW_JEMALLOC_STATIC_LIB ${ARROW_PYTHON_LIB_PATH}/libarrow_jemalloc.a) - set(ARROW_SHARED_LIB ${ARROW_LIBS}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) - set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) - endif() + set(ARROW_SHARED_LIB ${ARROW_LIBS}/libarrow${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_JEMALLOC_SHARED_LIB ${ARROW_LIBS}/libarrow_jemalloc${CMAKE_SHARED_LIBRARY_SUFFIX}) + set(ARROW_PYTHON_SHARED_LIB ${ARROW_LIBS}/libarrow_python${CMAKE_SHARED_LIBRARY_SUFFIX}) endif() endif() diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 3df322581b54c..8ef087c7d262f 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -37,7 +37,6 @@ export PYARROW_WITH_JEMALLOC=1 export PYARROW_BUNDLE_ARROW_CPP=1 # Need as otherwise arrow_io is sometimes not linked export LDFLAGS="-Wl,--no-as-needed" -export ARROW_HOME="/arrow-dist" export PARQUET_HOME="/usr" export PKG_CONFIG_PATH=/arrow-dist/lib64/pkgconfig From 4555ab92b174ce645ed20c7c6e15ee236a0e2f7a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 19 Apr 2017 11:22:54 -0400 Subject: [PATCH 0537/1644] ARROW-841: [Python] Add pyarrow build to Appveyor Author: Wes McKinney Closes #566 from wesm/ARROW-841 and squashes the following commits: 4a04d57 [Wes McKinney] Set CMP0054 policy also in python/CMakeLists.txt d2b4ffb [Wes McKinney] install cython 695e9b3 [Wes McKinney] Fix directory 36ad9e2 [Wes McKinney] Another fix for compiler id check ba31cf5 [Wes McKinney] typo 7c32abb [Wes McKinney] Set CMP0054 policy d27563f [Wes McKinney] Fix for CMP0054 ee883d8 [Wes McKinney] Exit early on failure of things ac903db [Wes McKinney] Fix cmake warning 949558a [Wes McKinney] Remove conda list 54fbd48 [Wes McKinney] Add directory 86f91d2 [Wes McKinney] Write msvc build script that builds pyarrow --- appveyor.yml | 16 +++---- ci/msvc-build.bat | 52 +++++++++++++++++++++++ cpp/CMakeLists.txt | 6 +++ cpp/cmake_modules/FindPythonLibsNew.cmake | 13 +++--- cpp/cmake_modules/SetupCxxFlags.cmake | 2 +- cpp/src/arrow/python/CMakeLists.txt | 12 +++--- python/CMakeLists.txt | 6 +++ 7 files changed, 86 insertions(+), 21 deletions(-) create mode 100644 ci/msvc-build.bat diff --git a/appveyor.yml b/appveyor.yml index b8c26e6e5084c..f2954a92e9e19 100644 --- a/appveyor.yml +++ b/appveyor.yml @@ -21,17 +21,15 @@ os: Visual Studio 2015 environment: matrix: - GENERATOR: Visual Studio 14 2015 Win64 - # - GENERATOR: Visual Studio 14 2015 + PYTHON: "3.5" + ARCH: "64" MSVC_DEFAULT_OPTIONS: ON BOOST_ROOT: C:\Libraries\boost_1_63_0 BOOST_LIBRARYDIR: C:\Libraries\boost_1_63_0\lib64-msvc-14.0 -build_script: - - cd cpp - - mkdir build - - cd build - - cmake -G "%GENERATOR%" -DARROW_CXXFLAGS="/WX" -DARROW_BOOST_USE_SHARED=OFF -DCMAKE_BUILD_TYPE=Release .. - - cmake --build . --config Release +init: + - set MINICONDA=C:\Miniconda35-x64 + - set PATH=%MINICONDA%;%MINICONDA%/Scripts;%MINICONDA%/Library/bin;%PATH% -# test_script: - - ctest -VV +build_script: + - call ci\msvc-build.bat diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat new file mode 100644 index 0000000000000..de428b6e46e14 --- /dev/null +++ b/ci/msvc-build.bat @@ -0,0 +1,52 @@ +@rem Licensed to the Apache Software Foundation (ASF) under one +@rem or more contributor license agreements. See the NOTICE file +@rem distributed with this work for additional information +@rem regarding copyright ownership. The ASF licenses this file +@rem to you under the Apache License, Version 2.0 (the +@rem "License"); you may not use this file except in compliance +@rem with the License. You may obtain a copy of the License at +@rem +@rem http://www.apache.org/licenses/LICENSE-2.0 +@rem +@rem Unless required by applicable law or agreed to in writing, +@rem software distributed under the License is distributed on an +@rem "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +@rem KIND, either express or implied. See the License for the +@rem specific language governing permissions and limitations +@rem under the License. + +@echo on + +set CONDA_ENV=C:\arrow-conda-env +set ARROW_HOME=C:\arrow-install + +conda create -p %CONDA_ENV% -q -y python=%PYTHON% ^ + six pytest setuptools numpy pandas cython +call activate %CONDA_ENV% + +@rem Build and test Arrow C++ libraries + +cd cpp +mkdir build +cd build +cmake -G "%GENERATOR%" ^ + -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ + -DARROW_BOOST_USE_SHARED=OFF ^ + -DCMAKE_BUILD_TYPE=Release ^ + -DARROW_CXXFLAGS="/WX" ^ + -DARROW_PYTHON=on ^ + .. || exit /B +cmake --build . --target INSTALL --config Release || exit /B + +@rem Needed so python-test.exe works +set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV% + +ctest -VV || exit /B + +@rem Build and import pyarrow + +set PATH=%ARROW_HOME%\bin;%PATH% + +cd ..\..\python +python setup.py build_ext --inplace || exit /B +python -c "import pyarrow" || exit /B diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index f702da16e7a4e..c1cf7852a30b9 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -36,6 +36,12 @@ else() include(GNUInstallDirs) endif() +# Compatibility with CMake 3.1 +if(POLICY CMP0054) + # http://www.cmake.org/cmake/help/v3.1/policy/CMP0054.html + cmake_policy(SET CMP0054 NEW) +endif() + set(ARROW_SO_VERSION "0") set(ARROW_ABI_VERSION "${ARROW_SO_VERSION}.0.0") diff --git a/cpp/cmake_modules/FindPythonLibsNew.cmake b/cpp/cmake_modules/FindPythonLibsNew.cmake index 961081609cb12..09124aa17bb9c 100644 --- a/cpp/cmake_modules/FindPythonLibsNew.cmake +++ b/cpp/cmake_modules/FindPythonLibsNew.cmake @@ -141,12 +141,13 @@ string(REGEX REPLACE "\\\\" "/" PYTHON_INCLUDE_DIR ${PYTHON_INCLUDE_DIR}) string(REGEX REPLACE "\\\\" "/" PYTHON_SITE_PACKAGES ${PYTHON_SITE_PACKAGES}) if(CMAKE_HOST_WIN32) - if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC") - set(PYTHON_LIBRARY - "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") - else() - set(PYTHON_LIBRARY "${PYTHON_PREFIX}/libs/libpython${PYTHON_LIBRARY_SUFFIX}.a") - endif() + # Appease CMP0054 + if(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC") + set(PYTHON_LIBRARY + "${PYTHON_PREFIX}/libs/Python${PYTHON_LIBRARY_SUFFIX}.lib") + else() + set(PYTHON_LIBRARY "${PYTHON_PREFIX}/libs/libpython${PYTHON_LIBRARY_SUFFIX}.a") + endif() elseif(APPLE) # In some cases libpythonX.X.dylib is not part of the PYTHON_PREFIX and we # need to call `python-config --prefix` to determine the correct location. diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index 694e5a37df4ba..7e229ff90a3e7 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -30,7 +30,7 @@ if (MSVC) # insecure, like std::getenv add_definitions(-D_CRT_SECURE_NO_WARNINGS) - if ("${CMAKE_CXX_COMPILER_ID}" STREQUAL "Clang") + if (CMAKE_CXX_COMPILER_ID STREQUAL "Clang") # clang-cl set(CXX_COMMON_FLAGS "-EHsc") elseif(${CMAKE_CXX_COMPILER_VERSION} VERSION_LESS 19) diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index 607a1c436c45d..5c2b58825294c 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -57,6 +57,13 @@ set(ARROW_PYTHON_SHARED_LINK_LIBS arrow_shared ) +if (MSVC) + set(ARROW_PYTHON_SHARED_LINK_LIBS + ${ARROW_PYTHON_SHARED_LINK_LIBS} + ${PYTHON_LIBRARIES} + ) +endif() + ADD_ARROW_LIB(arrow_python SOURCES ${ARROW_PYTHON_SRCS} SHARED_LINK_FLAGS "" @@ -64,11 +71,6 @@ ADD_ARROW_LIB(arrow_python STATIC_LINK_LIBS "" ) -if (MSVC) - target_link_libraries(arrow_python_shared - ${PYTHON_LIBRARIES}) -endif() - if ("${COMPILER_FAMILY}" STREQUAL "clang") # Clang, be quiet. Python C API has lots of macros set_property(SOURCE ${ARROW_PYTHON_SRCS} diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index c1431af67ed55..3db7b7bf83822 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -28,6 +28,12 @@ set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_SOURCE_DIR}/../cpp/cmake_mod include(CMakeParseArguments) +# Compatibility with CMake 3.1 +if(POLICY CMP0054) + # http://www.cmake.org/cmake/help/v3.1/policy/CMP0054.html + cmake_policy(SET CMP0054 NEW) +endif() + set(BUILD_SUPPORT_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../cpp/build-support") # Allow "make install" to not depend on all targets. From 41a8ff9ad18a4970c16b674b56ade25b8e8986ec Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 19 Apr 2017 19:42:49 +0200 Subject: [PATCH 0538/1644] ARROW-853: [Python] Only set RPATH when bundling the shared libraries See discussion in https://github.com/apache/arrow/pull/562. Modifying RPATH is no longer needed when libarrow/libarrow_python are installed someplace else in the loader path. Author: Wes McKinney Closes #564 from wesm/ARROW-853 and squashes the following commits: 262f43a [Wes McKinney] Only set RPATH when bundling the shared libraries --- python/CMakeLists.txt | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/python/CMakeLists.txt b/python/CMakeLists.txt index 3db7b7bf83822..0d34bcdfa6e49 100644 --- a/python/CMakeLists.txt +++ b/python/CMakeLists.txt @@ -346,21 +346,25 @@ foreach(module ${CYTHON_EXTENSIONS}) LIBRARY_OUTPUT_DIRECTORY ${module_output_directory}) endif() - if(APPLE) + if (PYARROW_BUNDLE_ARROW_CPP) + # In the event that we are bundling the shared libraries (e.g. in a + # manylinux1 wheel), we need to set the RPATH of the extensions to the + # root of the pyarrow/ package so that libarrow/libarrow_python are able + # to be loaded properly + if(APPLE) set(module_install_rpath "@loader_path") - else() + else() set(module_install_rpath "\$ORIGIN") - endif() - list(LENGTH directories i) - while(${i} GREATER 0) + endif() + list(LENGTH directories i) + while(${i} GREATER 0) set(module_install_rpath "${module_install_rpath}/..") math(EXPR i "${i} - 1" ) - endwhile(${i} GREATER 0) + endwhile(${i} GREATER 0) - # for inplace development for now - #set(module_install_rpath "${CMAKE_SOURCE_DIR}/pyarrow/") + set_target_properties(${module_name} PROPERTIES + INSTALL_RPATH ${module_install_rpath}) + endif() - set_target_properties(${module_name} PROPERTIES - INSTALL_RPATH ${module_install_rpath}) target_link_libraries(${module_name} ${LINK_LIBS}) endforeach(module) From 391242a17d5bdb041b7b1b036b48e69e82ec29b8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Wed, 19 Apr 2017 15:49:00 -0400 Subject: [PATCH 0539/1644] ARROW-848: [Python] Another pass on conda dev guide Per feedback in https://github.com/apache/arrow/commit/bb8514cc9d7068c8b62d346577370751d68221d8 Author: Wes McKinney Closes #562 from wesm/conda-quickstart-iterate and squashes the following commits: 881a44d [Wes McKinney] Add system requirements notes about gcc 4.9, use boost shared libs 8c95705 [Wes McKinney] Install cmake in conda env 8c2885e [Wes McKinney] Another pass on conda dev guide, do not require LD_LIBRARY_PATH. Install everything in a single conda environment --- python/DEVELOPMENT.md | 73 ++++++++++++++++++++++++++++--------------- python/setup.py | 2 +- 2 files changed, 49 insertions(+), 26 deletions(-) diff --git a/python/DEVELOPMENT.md b/python/DEVELOPMENT.md index ca744628da1b5..7f08169d613f0 100644 --- a/python/DEVELOPMENT.md +++ b/python/DEVELOPMENT.md @@ -16,36 +16,41 @@ ### Linux and macOS -First, set up your thirdparty C++ toolchain using libraries from conda-forge: +#### System Requirements + +On macOS, any modern XCode (6.4 or higher; the current version is 8.3.1) is +sufficient. + +On Linux, for this guide, we recommend using gcc 4.8 or 4.9, or clang 3.7 or +higher. You can check your version by running ```shell -conda config --add channels conda-forge +$ gcc --version +``` -export ARROW_BUILD_TYPE=Release +On Ubuntu 16.04 and higher, you can obtain gcc 4.9 with: -export CPP_TOOLCHAIN=$HOME/cpp-toolchain -export LD_LIBRARY_PATH=$CPP_TOOLCHAIN/lib:$LD_LIBRARY_PATH +```shell +$ sudo apt-get install g++-4.9 +``` -export BOOST_ROOT=$CPP_TOOLCHAIN -export FLATBUFFERS_HOME=$CPP_TOOLCHAIN -export RAPIDJSON_HOME=$CPP_TOOLCHAIN -export THRIFT_HOME=$CPP_TOOLCHAIN -export ZLIB_HOME=$CPP_TOOLCHAIN -export SNAPPY_HOME=$CPP_TOOLCHAIN -export BROTLI_HOME=$CPP_TOOLCHAIN -export JEMALLOC_HOME=$CPP_TOOLCHAIN -export ARROW_HOME=$CPP_TOOLCHAIN -export PARQUET_HOME=$CPP_TOOLCHAIN +Finally, set gcc 4.9 as the active compiler using: -conda create -y -q -p $CPP_TOOLCHAIN \ - flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib brotli jemalloc +```shell +export CC=gcc-4.9 +export CXX=g++-4.9 ``` -Now, activate a conda environment containing your target Python version and -NumPy installed: +#### Environment Setup and Build + +First, let's create a conda environment with all the C++ build and Python +dependencies from conda-forge: ```shell -conda create -y -q -n pyarrow-dev python=3.6 numpy +conda create -y -q -n pyarrow-dev \ + python=3.6 numpy six setuptools cython pandas pytest \ + cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \ + brotli jemalloc -c conda-forge source activate pyarrow-dev ``` @@ -67,6 +72,26 @@ drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/ drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 parquet-cpp/ ``` +We need to set a number of environment variables to let Arrow's build system +know about our build toolchain: + +``` +export ARROW_BUILD_TYPE=release + +export BOOST_ROOT=$CONDA_PREFIX +export BOOST_LIBRARYDIR=$CONDA_PREFIX/lib + +export FLATBUFFERS_HOME=$CONDA_PREFIX +export RAPIDJSON_HOME=$CONDA_PREFIX +export THRIFT_HOME=$CONDA_PREFIX +export ZLIB_HOME=$CONDA_PREFIX +export SNAPPY_HOME=$CONDA_PREFIX +export BROTLI_HOME=$CONDA_PREFIX +export JEMALLOC_HOME=$CONDA_PREFIX +export ARROW_HOME=$CONDA_PREFIX +export PARQUET_HOME=$CONDA_PREFIX +``` + Now build and install the Arrow C++ libraries: ```shell @@ -74,7 +99,7 @@ mkdir arrow/cpp/build pushd arrow/cpp/build cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CPP_TOOLCHAIN \ + -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ -DARROW_PYTHON=on \ -DARROW_BUILD_TESTS=OFF \ .. @@ -90,7 +115,7 @@ mkdir parquet-cpp/build pushd parquet-cpp/build cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CPP_TOOLCHAIN \ + -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ -DPARQUET_BUILD_BENCHMARKS=off \ -DPARQUET_BUILD_EXECUTABLES=off \ -DPARQUET_ZLIB_VENDORED=off \ @@ -102,11 +127,9 @@ make install popd ``` -Now, install requisite build requirements for pyarrow, then build: +Now, build pyarrow: ```shell -conda install -y -q six setuptools cython pandas pytest - cd arrow/python python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet --inplace ``` diff --git a/python/setup.py b/python/setup.py index ab71e7858e626..1c46617066925 100644 --- a/python/setup.py +++ b/python/setup.py @@ -155,7 +155,7 @@ def _run_cmake(self): cmake_options.append('-DPYARROW_BUNDLE_ARROW_CPP=ON') cmake_options.append('-DCMAKE_BUILD_TYPE={0}' - .format(self.build_type)) + .format(self.build_type.lower())) if sys.platform != 'win32': cmake_command = (['cmake', self.extra_cmake_args] + From 74f89cfbe0793043eb579ec30b3d6467b0ad9af2 Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Wed, 19 Apr 2017 17:11:51 -0400 Subject: [PATCH 0540/1644] ARROW-858: Remove boost_regex from arrow dependencies Author: Phillip Cloud Closes #567 from cpcloud/decimal-no-regex and squashes the following commits: b5c59bd [Phillip Cloud] ARROW-858: Remove boost_regex from arrow dependencies --- .travis.yml | 1 - ci/travis_script_python.sh | 1 + cpp/CMakeLists.txt | 22 ++----- cpp/README.md | 1 - cpp/src/arrow/ipc/CMakeLists.txt | 4 +- cpp/src/arrow/ipc/ipc-read-write-test.cc | 4 +- cpp/src/arrow/python/CMakeLists.txt | 3 +- cpp/src/arrow/util/decimal-test.cc | 40 +++++++++++++ cpp/src/arrow/util/decimal.cc | 73 +++++++++++++++++++----- 9 files changed, 108 insertions(+), 41 deletions(-) diff --git a/.travis.yml b/.travis.yml index 824f62bccaab9..6ebebd4513fc7 100644 --- a/.travis.yml +++ b/.travis.yml @@ -14,7 +14,6 @@ addons: - valgrind - libboost-dev - libboost-filesystem-dev - - libboost-regex-dev - libboost-system-dev - libjemalloc-dev - gtk-doc-tools diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index bde1fd7e249ec..c1426da7247b2 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -22,6 +22,7 @@ pushd $ARROW_PYTHON_DIR export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env build_parquet_cpp() { + export PARQUET_ARROW_VERSION=$(git rev-parse HEAD) conda create -y -q -p $PARQUET_HOME python=3.6 source activate $PARQUET_HOME diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index c1cf7852a30b9..81e4c90c73147 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -410,19 +410,16 @@ if (ARROW_BOOST_USE_SHARED) add_definitions(-DBOOST_ALL_DYN_LINK) endif() - find_package(Boost COMPONENTS system filesystem regex REQUIRED) + find_package(Boost COMPONENTS system filesystem REQUIRED) if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) - set(BOOST_SHARED_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG}) else() set(BOOST_SHARED_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) set(BOOST_SHARED_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) - set(BOOST_SHARED_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE}) endif() set(BOOST_SYSTEM_LIBRARY boost_system_shared) set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_shared) - set(BOOST_REGEX_LIBRARY boost_regex_shared) else() # Find static boost headers and libs # TODO Differentiate here between release and debug builds @@ -431,15 +428,12 @@ else() if ("${CMAKE_BUILD_TYPE}" STREQUAL "DEBUG") set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_DEBUG}) set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_DEBUG}) - set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_DEBUG}) else() set(BOOST_STATIC_SYSTEM_LIBRARY ${Boost_SYSTEM_LIBRARY_RELEASE}) set(BOOST_STATIC_FILESYSTEM_LIBRARY ${Boost_FILESYSTEM_LIBRARY_RELEASE}) - set(BOOST_STATIC_REGEX_LIBRARY ${Boost_REGEX_LIBRARY_RELEASE}) endif() set(BOOST_SYSTEM_LIBRARY boost_system_static) set(BOOST_FILESYSTEM_LIBRARY boost_filesystem_static) - set(BOOST_REGEX_LIBRARY boost_regex_static) endif() message(STATUS "Boost include dir: " ${Boost_INCLUDE_DIRS}) @@ -453,11 +447,7 @@ ADD_THIRDPARTY_LIB(boost_filesystem STATIC_LIB "${BOOST_STATIC_FILESYSTEM_LIBRARY}" SHARED_LIB "${BOOST_SHARED_FILESYSTEM_LIBRARY}") -ADD_THIRDPARTY_LIB(boost_regex - STATIC_LIB "${BOOST_STATIC_REGEX_LIBRARY}" - SHARED_LIB "${BOOST_SHARED_REGEX_LIBRARY}") - -SET(ARROW_BOOST_LIBS boost_system boost_filesystem boost_regex) +SET(ARROW_BOOST_LIBS boost_system boost_filesystem) include_directories(SYSTEM ${Boost_INCLUDE_DIR}) @@ -758,8 +748,7 @@ set(ARROW_MIN_TEST_LIBS arrow_static gtest gtest_main - ${ARROW_BASE_LIBS} - ${BOOST_REGEX_LIBRARY}) + ${ARROW_BASE_LIBS}) if (APPLE) set(ARROW_MIN_TEST_LIBS @@ -777,8 +766,7 @@ set(ARROW_TEST_LINK_LIBS ${ARROW_MIN_TEST_LIBS}) set(ARROW_BENCHMARK_LINK_LIBS arrow_static arrow_benchmark_main - ${ARROW_BASE_LIBS} - ${BOOST_REGEX_LIBRARY}) + ${ARROW_BASE_LIBS}) ############################################################ # "make ctags" target @@ -875,7 +863,7 @@ endif() ############################################################ set(ARROW_LINK_LIBS - ${BOOST_REGEX_LIBRARY}) + ) set(ARROW_STATIC_LINK_LIBS) diff --git a/cpp/README.md b/cpp/README.md index 339b6b47533cb..69c695020add5 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -31,7 +31,6 @@ On Ubuntu/Debian you can install the requirements with: sudo apt-get install cmake \ libboost-dev \ libboost-filesystem-dev \ - libboost-regex-dev \ libboost-system-dev ``` diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index 37b455395644f..fc1d53e18a3dc 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -91,14 +91,12 @@ if(MSVC) set(UTIL_LINK_LIBS arrow_static ${BOOST_FILESYSTEM_LIBRARY} - ${BOOST_SYSTEM_LIBRARY} - ${BOOST_REGEX_LIBRARY}) + ${BOOST_SYSTEM_LIBRARY}) else() set(UTIL_LINK_LIBS arrow_static ${BOOST_FILESYSTEM_LIBRARY} ${BOOST_SYSTEM_LIBRARY} - ${BOOST_REGEX_LIBRARY} dl) endif() diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index b39136ec12d04..cd793e08a26be 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -322,8 +322,8 @@ TEST_F(TestWriteRecordBatch, SliceTruncatesBuffers) { std::vector type_ids(a0->length()); std::shared_ptr ids_buffer; ASSERT_OK(test::CopyBufferFromVector(type_ids, &ids_buffer)); - a1 = std::make_shared( - union_type, a0->length(), struct_children, ids_buffer); + a1 = + std::make_shared(union_type, a0->length(), struct_children, ids_buffer); CheckArray(a1); // Dense union diff --git a/cpp/src/arrow/python/CMakeLists.txt b/cpp/src/arrow/python/CMakeLists.txt index 5c2b58825294c..c5cbc50845de0 100644 --- a/cpp/src/arrow/python/CMakeLists.txt +++ b/cpp/src/arrow/python/CMakeLists.txt @@ -35,8 +35,7 @@ endif() set(ARROW_PYTHON_MIN_TEST_LIBS arrow_python_test_main arrow_python_static - arrow_static - ${BOOST_REGEX_LIBRARY}) + arrow_static) set(ARROW_PYTHON_TEST_LINK_LIBS ${ARROW_PYTHON_MIN_TEST_LIBS}) diff --git a/cpp/src/arrow/util/decimal-test.cc b/cpp/src/arrow/util/decimal-test.cc index dcaa9afd8724a..5d95c2cadc107 100644 --- a/cpp/src/arrow/util/decimal-test.cc +++ b/cpp/src/arrow/util/decimal-test.cc @@ -159,5 +159,45 @@ TEST(DecimalTest, TestDecimal128StringAndBytesRoundTrip) { ASSERT_EQ(expected.value, result.value); } + +template +class DecimalZerosTest : public ::testing::Test {}; +TYPED_TEST_CASE(DecimalZerosTest, DecimalTypes); + +TYPED_TEST(DecimalZerosTest, LeadingZerosNoDecimalPoint) { + std::string string_value("0000000"); + Decimal d; + int precision; + int scale; + FromString(string_value, &d, &precision, &scale); + ASSERT_EQ(precision, 7); + ASSERT_EQ(scale, 0); + ASSERT_EQ(d.value, 0); +} + +TYPED_TEST(DecimalZerosTest, LeadingZerosDecimalPoint) { + std::string string_value("000.0000"); + Decimal d; + int precision; + int scale; + FromString(string_value, &d, &precision, &scale); + // We explicitly do not support this for now, otherwise this would be ASSERT_EQ + ASSERT_NE(precision, 7); + + ASSERT_EQ(scale, 4); + ASSERT_EQ(d.value, 0); +} + +TYPED_TEST(DecimalZerosTest, NoLeadingZerosDecimalPoint) { + std::string string_value(".00000"); + Decimal d; + int precision; + int scale; + FromString(string_value, &d, &precision, &scale); + ASSERT_EQ(precision, 5); + ASSERT_EQ(scale, 5); + ASSERT_EQ(d.value, 0); +} + } // namespace decimal } // namespace arrow diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc index 3b8a3ff0398b5..2fe9da4aba9a2 100644 --- a/cpp/src/arrow/util/decimal.cc +++ b/cpp/src/arrow/util/decimal.cc @@ -17,34 +17,77 @@ #include "arrow/util/decimal.h" -#include - namespace arrow { namespace decimal { -static const boost::regex DECIMAL_PATTERN("(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?"); - template ARROW_EXPORT Status FromString( const std::string& s, Decimal* out, int* precision, int* scale) { + // Implements this regex: "(\\+?|-?)((0*)(\\d*))(\\.(\\d+))?"; if (s.empty()) { return Status::Invalid("Empty string cannot be converted to decimal"); } - boost::smatch match; - if (!boost::regex_match(s, match, DECIMAL_PATTERN)) { - std::stringstream ss; - ss << "String " << s << " is not a valid decimal string"; - return Status::Invalid(ss.str()); + + int8_t sign = 1; + auto charp = s.cbegin(); + auto end = s.cend(); + + if (*charp == '+' || *charp == '-') { + if (*charp == '-') { sign = -1; } + ++charp; } - const int8_t sign = match[1].str() == "-" ? -1 : 1; - std::string whole_part = match[4].str(); - std::string fractional_part = match[6].str(); - if (scale != nullptr) { *scale = static_cast(fractional_part.size()); } + + auto numeric_string_start = charp; + + // skip leading zeros + while (*charp == '0') { + ++charp; + } + + // all zeros and no decimal point + if (charp == end) { + if (out != nullptr) { out->value = static_cast(0); } + + // Not sure what other libraries assign precision to for this case (this case of + // a string consisting only of one or more zeros) + if (precision != nullptr) { + *precision = static_cast(charp - numeric_string_start); + } + + if (scale != nullptr) { *scale = 0; } + + return Status::OK(); + } + + auto whole_part_start = charp; + while (isdigit(*charp)) { + ++charp; + } + auto whole_part_end = charp; + std::string whole_part(whole_part_start, whole_part_end); + + if (*charp == '.') { + ++charp; + } else { + // no decimal point + DCHECK_EQ(charp, end); + } + + auto fractional_part_start = charp; + while (isdigit(*charp)) { + ++charp; + } + auto fractional_part_end = charp; + std::string fractional_part(fractional_part_start, fractional_part_end); + if (precision != nullptr) { - *precision = - static_cast(whole_part.size()) + static_cast(fractional_part.size()); + *precision = static_cast(whole_part.size() + fractional_part.size()); } + + if (scale != nullptr) { *scale = static_cast(fractional_part.size()); } + if (out != nullptr) { StringToInteger(whole_part, fractional_part, sign, &out->value); } + return Status::OK(); } From 0dc6fe8f33befaaa5fc8055b6c157ac1ccb09e6b Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 19 Apr 2017 17:17:38 -0400 Subject: [PATCH 0541/1644] ARROW-846: [GLib] Add GArrowTensor, GArrowInt8Tensor and GArrowUInt8Tensor Author: Kouhei Sutou Closes #560 from kou/glib-add-tensor and squashes the following commits: ed949d4 [Kouhei Sutou] [GLib] Support running tests on Ubuntu 14.04 39d40f0 [Kouhei Sutou] [GLib] Add GArrowTensor, GArrowInt8Tensor and GArrowUInt8Tensor --- c_glib/arrow-glib/Makefile.am | 8 + c_glib/arrow-glib/arrow-glib.h | 3 + c_glib/arrow-glib/arrow-glib.hpp | 2 + c_glib/arrow-glib/int8-tensor.cpp | 105 ++++++++ c_glib/arrow-glib/int8-tensor.h | 79 ++++++ c_glib/arrow-glib/numeric-tensor.hpp | 64 +++++ c_glib/arrow-glib/tensor.cpp | 390 +++++++++++++++++++++++++++ c_glib/arrow-glib/tensor.h | 77 ++++++ c_glib/arrow-glib/tensor.hpp | 27 ++ c_glib/arrow-glib/uint8-tensor.cpp | 105 ++++++++ c_glib/arrow-glib/uint8-tensor.h | 79 ++++++ c_glib/test/helper/omittable.rb | 28 ++ c_glib/test/run-test.rb | 1 + c_glib/test/test-int8-tensor.rb | 43 +++ c_glib/test/test-tensor.rb | 100 +++++++ c_glib/test/test-uint8-tensor.rb | 43 +++ 16 files changed, 1154 insertions(+) create mode 100644 c_glib/arrow-glib/int8-tensor.cpp create mode 100644 c_glib/arrow-glib/int8-tensor.h create mode 100644 c_glib/arrow-glib/numeric-tensor.hpp create mode 100644 c_glib/arrow-glib/tensor.cpp create mode 100644 c_glib/arrow-glib/tensor.h create mode 100644 c_glib/arrow-glib/tensor.hpp create mode 100644 c_glib/arrow-glib/uint8-tensor.cpp create mode 100644 c_glib/arrow-glib/uint8-tensor.h create mode 100644 c_glib/test/helper/omittable.rb create mode 100644 c_glib/test/test-int8-tensor.rb create mode 100644 c_glib/test/test-tensor.rb create mode 100644 c_glib/test/test-uint8-tensor.rb diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 2e7a9a0e439eb..fbfe3a4071000 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -65,6 +65,7 @@ libarrow_glib_la_headers = \ int8-array.h \ int8-array-builder.h \ int8-data-type.h \ + int8-tensor.h \ int16-array.h \ int16-array-builder.h \ int16-data-type.h \ @@ -88,10 +89,12 @@ libarrow_glib_la_headers = \ struct-array-builder.h \ struct-data-type.h \ table.h \ + tensor.h \ type.h \ uint8-array.h \ uint8-array-builder.h \ uint8-data-type.h \ + uint8-tensor.h \ uint16-array.h \ uint16-array-builder.h \ uint16-data-type.h \ @@ -152,6 +155,7 @@ libarrow_glib_la_sources = \ int8-array.cpp \ int8-array-builder.cpp \ int8-data-type.cpp \ + int8-tensor.cpp \ int16-array.cpp \ int16-array-builder.cpp \ int16-data-type.cpp \ @@ -175,10 +179,12 @@ libarrow_glib_la_sources = \ struct-array-builder.cpp \ struct-data-type.cpp \ table.cpp \ + tensor.cpp \ type.cpp \ uint8-array.cpp \ uint8-array-builder.cpp \ uint8-data-type.cpp \ + uint8-tensor.cpp \ uint16-array.cpp \ uint16-array-builder.cpp \ uint16-data-type.cpp \ @@ -220,9 +226,11 @@ libarrow_glib_la_cpp_headers = \ data-type.hpp \ error.hpp \ field.hpp \ + numeric-tensor.hpp \ record-batch.hpp \ schema.hpp \ table.hpp \ + tensor.hpp \ type.hpp libarrow_glib_la_cpp_headers += \ diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index b15c56f7bb486..eec9e25ebf690 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -42,6 +42,7 @@ #include #include #include +#include #include #include #include @@ -65,10 +66,12 @@ #include #include #include +#include #include #include #include #include +#include #include #include #include diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp index 3404d4d212917..d6ef370095bdf 100644 --- a/c_glib/arrow-glib/arrow-glib.hpp +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -31,9 +31,11 @@ #include #include #include +#include #include #include #include +#include #include #include diff --git a/c_glib/arrow-glib/int8-tensor.cpp b/c_glib/arrow-glib/int8-tensor.cpp new file mode 100644 index 0000000000000..06521a00997c0 --- /dev/null +++ b/c_glib/arrow-glib/int8-tensor.cpp @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: int8-tensor + * @short_description: 8-bit integer tensor class + * + * #GArrowInt8Tensor is a class for 8-bit integer tensor. It can store + * zero or more 8-bit integer data. + */ + +G_DEFINE_TYPE(GArrowInt8Tensor, \ + garrow_int8_tensor, \ + GARROW_TYPE_TENSOR) + +static void +garrow_int8_tensor_init(GArrowInt8Tensor *object) +{ +} + +static void +garrow_int8_tensor_class_init(GArrowInt8TensorClass *klass) +{ +} + +/** + * garrow_int8_tensor_new: + * @data: A #GArrowBuffer that contains tensor data. + * @shape: (array length=n_dimensions): A list of dimension sizes. + * @n_dimensions: The number of dimensions. + * @strides: (array length=n_strides) (nullable): A list of the number of + * bytes in each dimension. + * @n_strides: The number of strides. + * @dimention_names: (array length=n_dimention_names) (nullable): A list of + * dimension names. + * @n_dimention_names: The number of dimension names + * + * Returns: The newly created #GArrowInt8Tensor. + * + * Since: 0.3.0 + */ +GArrowInt8Tensor * +garrow_int8_tensor_new(GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimension_names, + gsize n_dimension_names) +{ + auto tensor = + garrow::numeric_tensor_new(data, + shape, + n_dimensions, + strides, + n_strides, + dimension_names, + n_dimension_names); + return GARROW_INT8_TENSOR(tensor); +} + +/** + * garrow_int8_tensor_get_raw_data: + * @tensor: A #GArrowInt8Tensor. + * @n_data: (out): The number of data. + * + * Returns: (array length=n_data): The raw data in the tensor. + * + * Since: 0.3.0 + */ +const gint8 * +garrow_int8_tensor_get_raw_data(GArrowInt8Tensor *tensor, + gint64 *n_data) +{ + return garrow::numeric_tensor_get_raw_data(GARROW_TENSOR(tensor), + n_data); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/int8-tensor.h b/c_glib/arrow-glib/int8-tensor.h new file mode 100644 index 0000000000000..76ed3c8d7a7ee --- /dev/null +++ b/c_glib/arrow-glib/int8-tensor.h @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_INT8_TENSOR \ + (garrow_int8_tensor_get_type()) +#define GARROW_INT8_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_TENSOR, \ + GArrowInt8Tensor)) +#define GARROW_INT8_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_TENSOR, \ + GArrowInt8TensorClass)) +#define GARROW_IS_INT8_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_TENSOR)) +#define GARROW_IS_INT8_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_TENSOR)) +#define GARROW_INT8_TENSOR_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_TENSOR, \ + GArrowInt8TensorClass)) + +typedef struct _GArrowInt8Tensor GArrowInt8Tensor; +typedef struct _GArrowInt8TensorClass GArrowInt8TensorClass; + +/** + * GArrowInt8Tensor: + * + * It wraps `arrow::Int8Tensor`. + */ +struct _GArrowInt8Tensor +{ + /*< private >*/ + GArrowTensor parent_instance; +}; + +struct _GArrowInt8TensorClass +{ + GArrowTensorClass parent_class; +}; + +GType garrow_int8_tensor_get_type(void) G_GNUC_CONST; + +GArrowInt8Tensor *garrow_int8_tensor_new(GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimention_names, + gsize n_dimention_names); + +const gint8 *garrow_int8_tensor_get_raw_data(GArrowInt8Tensor *tensor, + gint64 *n_data); + +G_END_DECLS diff --git a/c_glib/arrow-glib/numeric-tensor.hpp b/c_glib/arrow-glib/numeric-tensor.hpp new file mode 100644 index 0000000000000..07cea62bd7b25 --- /dev/null +++ b/c_glib/arrow-glib/numeric-tensor.hpp @@ -0,0 +1,64 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +namespace garrow { + template + GArrowTensor *numeric_tensor_new(GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimention_names, + gsize n_dimention_names) { + auto arrow_data = garrow_buffer_get_raw(data); + std::vector arrow_shape; + for (gsize i = 0; i < n_dimensions; ++i) { + arrow_shape.push_back(shape[i]); + } + std::vector arrow_strides; + for (gsize i = 0; i < n_strides; ++i) { + arrow_strides.push_back(strides[i]); + } + std::vector arrow_dimention_names; + for (gsize i = 0; i < n_dimention_names; ++i) { + arrow_dimention_names.push_back(dimention_names[i]); + } + auto arrow_numeric_tensor = + std::make_shared(arrow_data, + arrow_shape, + arrow_strides, + arrow_dimention_names); + std::shared_ptr arrow_tensor = arrow_numeric_tensor; + auto tensor = garrow_tensor_new_raw(&arrow_tensor); + return tensor; + } + + template + const value_type *numeric_tensor_get_raw_data(GArrowTensor *tensor, + gint64 *n_data) { + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_numeric_tensor = static_cast(arrow_tensor.get()); + *n_data = arrow_numeric_tensor->size(); + return arrow_numeric_tensor->raw_data(); + } +} diff --git a/c_glib/arrow-glib/tensor.cpp b/c_glib/arrow-glib/tensor.cpp new file mode 100644 index 0000000000000..cbc9d8e31fe9d --- /dev/null +++ b/c_glib/arrow-glib/tensor.cpp @@ -0,0 +1,390 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: tensor + * @short_description: Base class for all tensor classes + * + * #GArrowTensor is a base class for all tensor classes such as + * #GArrowInt8Tensor. + * #GArrowBooleanTensorBuilder to create a new tensor. + * + * Since: 0.3.0 + */ + +typedef struct GArrowTensorPrivate_ { + std::shared_ptr tensor; +} GArrowTensorPrivate; + +enum { + PROP_0, + PROP_TENSOR +}; + +G_DEFINE_TYPE_WITH_PRIVATE(GArrowTensor, garrow_tensor, G_TYPE_OBJECT) + +#define GARROW_TENSOR_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), GARROW_TYPE_TENSOR, GArrowTensorPrivate)) + +static void +garrow_tensor_finalize(GObject *object) +{ + auto priv = GARROW_TENSOR_GET_PRIVATE(object); + + priv->tensor = nullptr; + + G_OBJECT_CLASS(garrow_tensor_parent_class)->finalize(object); +} + +static void +garrow_tensor_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + auto priv = GARROW_TENSOR_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_TENSOR: + priv->tensor = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_tensor_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_tensor_init(GArrowTensor *object) +{ +} + +static void +garrow_tensor_class_init(GArrowTensorClass *klass) +{ + GParamSpec *spec; + + auto gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_tensor_finalize; + gobject_class->set_property = garrow_tensor_set_property; + gobject_class->get_property = garrow_tensor_get_property; + + spec = g_param_spec_pointer("tensor", + "Tensor", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_TENSOR, spec); +} + +/** + * garrow_tensor_get_value_data_type: + * @tensor: A #GArrowTensor. + * + * Returns: (transfer full): The data type of each value in the tensor. + * + * Since: 0.3.0 + */ +GArrowDataType * +garrow_tensor_get_value_data_type(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_data_type = arrow_tensor->type(); + return garrow_data_type_new_raw(&arrow_data_type); +} + +/** + * garrow_tensor_get_value_type: + * @tensor: A #GArrowTensor. + * + * Returns: The type of each value in the tensor. + * + * Since: 0.3.0 + */ +GArrowType +garrow_tensor_get_value_type(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_type = arrow_tensor->type_id(); + return garrow_type_from_raw(arrow_type); +} + +/** + * garrow_tensor_get_buffer: + * @tensor: A #GArrowTensor. + * + * Returns: (transfer full): The data of the tensor. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_tensor_get_buffer(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_buffer = arrow_tensor->data(); + return garrow_buffer_new_raw(&arrow_buffer); +} + +/** + * garrow_tensor_get_shape: + * @tensor: A #GArrowTensor. + * @n_dimensions: (out): The number of dimensions. + * + * Returns: (array length=n_dimensions): The shape of the tensor. + * It should be freed with g_free() when no longer needed. + * + * Since: 0.3.0 + */ +gint64 * +garrow_tensor_get_shape(GArrowTensor *tensor, gint *n_dimensions) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_shape = arrow_tensor->shape(); + auto n_dimensions_raw = arrow_shape.size(); + auto shape = + static_cast(g_malloc_n(sizeof(gint64), n_dimensions_raw)); + for (gsize i = 0; i < n_dimensions_raw; ++i) { + shape[i] = arrow_shape[i]; + } + *n_dimensions = static_cast(n_dimensions_raw); + return shape; +} + +/** + * garrow_tensor_get_strides: + * @tensor: A #GArrowTensor. + * @n_strides: (out): The number of strides. + * + * Returns: (array length=n_strides): The strides of the tensor. + * It should be freed with g_free() when no longer needed. + * + * Since: 0.3.0 + */ +gint64 * +garrow_tensor_get_strides(GArrowTensor *tensor, gint *n_strides) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_strides = arrow_tensor->strides(); + auto n_strides_raw = arrow_strides.size(); + auto strides = + static_cast(g_malloc_n(sizeof(gint64), n_strides_raw)); + for (gsize i = 0; i < n_strides_raw; ++i) { + strides[i] = arrow_strides[i]; + } + *n_strides = static_cast(n_strides_raw); + return strides; +} + +/** + * garrow_tensor_get_n_dimensions: + * @tensor: A #GArrowTensor. + * + * Returns: The number of dimensions of the tensor. + * + * Since: 0.3.0 + */ +gint +garrow_tensor_get_n_dimensions(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->ndim(); +} + +/** + * garrow_tensor_get_dimension_name: + * @tensor: A #GArrowTensor. + * @i: The index of the target dimension. + * + * Returns: The i-th dimension name of the tensor. + * + * Since: 0.3.0 + */ +const gchar * +garrow_tensor_get_dimension_name(GArrowTensor *tensor, gint i) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + auto arrow_dimension_name = arrow_tensor->dim_name(i); + return arrow_dimension_name.c_str(); +} + +/** + * garrow_tensor_get_size: + * @tensor: A #GArrowTensor. + * + * Returns: The number of value cells in the tensor. + * + * Since: 0.3.0 + */ +gint64 +garrow_tensor_get_size(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->size(); +} + +/** + * garrow_tensor_is_mutable: + * @tensor: A #GArrowTensor. + * + * Returns: %TRUE if the tensor is mutable, %FALSE otherwise. + * + * Since: 0.3.0 + */ +gboolean +garrow_tensor_is_mutable(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->is_mutable(); +} + +/** + * garrow_tensor_is_contiguous: + * @tensor: A #GArrowTensor. + * + * Returns: %TRUE if the tensor is contiguous, %FALSE otherwise. + * + * Since: 0.3.0 + */ +gboolean +garrow_tensor_is_contiguous(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->is_contiguous(); +} + +/** + * garrow_tensor_is_row_major: + * @tensor: A #GArrowTensor. + * + * Returns: %TRUE if the tensor is row major a.k.a. C order, + * %FALSE otherwise. + * + * Since: 0.3.0 + */ +gboolean +garrow_tensor_is_row_major(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->is_row_major(); +} + +/** + * garrow_tensor_is_column_major: + * @tensor: A #GArrowTensor. + * + * Returns: %TRUE if the tensor is column major a.k.a. Fortran order, + * %FALSE otherwise. + * + * Since: 0.3.0 + */ +gboolean +garrow_tensor_is_column_major(GArrowTensor *tensor) +{ + auto arrow_tensor = garrow_tensor_get_raw(tensor); + return arrow_tensor->is_column_major(); +} + +G_END_DECLS + +GArrowTensor * +garrow_tensor_new_raw(std::shared_ptr *arrow_tensor) +{ + GType type; + GArrowTensor *tensor; + + switch ((*arrow_tensor)->type_id()) { + case arrow::Type::type::UINT8: + type = GARROW_TYPE_UINT8_TENSOR; + break; + case arrow::Type::type::INT8: + type = GARROW_TYPE_INT8_TENSOR; + break; +/* + case arrow::Type::type::UINT16: + type = GARROW_TYPE_UINT16_TENSOR; + break; + case arrow::Type::type::INT16: + type = GARROW_TYPE_INT16_TENSOR; + break; + case arrow::Type::type::UINT32: + type = GARROW_TYPE_UINT32_TENSOR; + break; + case arrow::Type::type::INT32: + type = GARROW_TYPE_INT32_TENSOR; + break; + case arrow::Type::type::UINT64: + type = GARROW_TYPE_UINT64_TENSOR; + break; + case arrow::Type::type::INT64: + type = GARROW_TYPE_INT64_TENSOR; + break; + case arrow::Type::type::HALF_FLOAT: + type = GARROW_TYPE_HALF_FLOAT_TENSOR; + break; + case arrow::Type::type::FLOAT: + type = GARROW_TYPE_FLOAT_TENSOR; + break; + case arrow::Type::type::DOUBLE: + type = GARROW_TYPE_DOUBLE_TENSOR; + break; +*/ + default: + type = GARROW_TYPE_TENSOR; + break; + } + tensor = GARROW_TENSOR(g_object_new(type, + "tensor", arrow_tensor, + NULL)); + return tensor; +} + +std::shared_ptr +garrow_tensor_get_raw(GArrowTensor *tensor) +{ + auto priv = GARROW_TENSOR_GET_PRIVATE(tensor); + return priv->tensor; +} diff --git a/c_glib/arrow-glib/tensor.h b/c_glib/arrow-glib/tensor.h new file mode 100644 index 0000000000000..bedc80324f581 --- /dev/null +++ b/c_glib/arrow-glib/tensor.h @@ -0,0 +1,77 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_TENSOR \ + (garrow_tensor_get_type()) +#define GARROW_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), GARROW_TYPE_TENSOR, GArrowTensor)) +#define GARROW_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), GARROW_TYPE_TENSOR, GArrowTensorClass)) +#define GARROW_IS_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_TENSOR)) +#define GARROW_IS_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_TENSOR)) +#define GARROW_TENSOR_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), GARROW_TYPE_TENSOR, GArrowTensorClass)) + +typedef struct _GArrowTensor GArrowTensor; +typedef struct _GArrowTensorClass GArrowTensorClass; + +/** + * GArrowTensor: + * + * It wraps `arrow::Tensor`. + */ +struct _GArrowTensor +{ + /*< private >*/ + GObject parent_instance; +}; + +struct _GArrowTensorClass +{ + GObjectClass parent_class; +}; + +GType garrow_tensor_get_type (void) G_GNUC_CONST; + +GArrowDataType *garrow_tensor_get_value_data_type(GArrowTensor *tensor); +GArrowType garrow_tensor_get_value_type (GArrowTensor *tensor); +GArrowBuffer *garrow_tensor_get_buffer (GArrowTensor *tensor); +gint64 *garrow_tensor_get_shape (GArrowTensor *tensor, + gint *n_dimensions); +gint64 *garrow_tensor_get_strides (GArrowTensor *tensor, + gint *n_strides); +gint garrow_tensor_get_n_dimensions (GArrowTensor *tensor); +const gchar *garrow_tensor_get_dimension_name (GArrowTensor *tensor, + gint i); +gint64 garrow_tensor_get_size (GArrowTensor *tensor); +gboolean garrow_tensor_is_mutable (GArrowTensor *tensor); +gboolean garrow_tensor_is_contiguous (GArrowTensor *tensor); +gboolean garrow_tensor_is_row_major (GArrowTensor *tensor); +gboolean garrow_tensor_is_column_major (GArrowTensor *tensor); + +G_END_DECLS diff --git a/c_glib/arrow-glib/tensor.hpp b/c_glib/arrow-glib/tensor.hpp new file mode 100644 index 0000000000000..392aeeebb6d2c --- /dev/null +++ b/c_glib/arrow-glib/tensor.hpp @@ -0,0 +1,27 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +#include + +GArrowTensor *garrow_tensor_new_raw(std::shared_ptr *arrow_tensor); +std::shared_ptr garrow_tensor_get_raw(GArrowTensor *tensor); diff --git a/c_glib/arrow-glib/uint8-tensor.cpp b/c_glib/arrow-glib/uint8-tensor.cpp new file mode 100644 index 0000000000000..69f0f694530ad --- /dev/null +++ b/c_glib/arrow-glib/uint8-tensor.cpp @@ -0,0 +1,105 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#ifdef HAVE_CONFIG_H +# include +#endif + +#include +#include +#include + +G_BEGIN_DECLS + +/** + * SECTION: uint8-tensor + * @short_description: 8-bit unsigned integer tensor class + * + * #GArrowUint8Tensor is a class for 8-bit unsigned integer tensor. It + * can store zero or more 8-bit integer data. + */ + +G_DEFINE_TYPE(GArrowUInt8Tensor, \ + garrow_uint8_tensor, \ + GARROW_TYPE_TENSOR) + +static void +garrow_uint8_tensor_init(GArrowUInt8Tensor *object) +{ +} + +static void +garrow_uint8_tensor_class_init(GArrowUInt8TensorClass *klass) +{ +} + +/** + * garrow_uint8_tensor_new: + * @data: A #GArrowBuffer that contains tensor data. + * @shape: (array length=n_dimensions): A list of dimension sizes. + * @n_dimensions: The number of dimensions. + * @strides: (array length=n_strides) (nullable): A list of the number of + * bytes in each dimension. + * @n_strides: The number of strides. + * @dimention_names: (array length=n_dimention_names) (nullable): A list of + * dimension names. + * @n_dimention_names: The number of dimension names + * + * Returns: The newly created #GArrowUInt8Tensor. + * + * Since: 0.3.0 + */ +GArrowUInt8Tensor * +garrow_uint8_tensor_new(GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimension_names, + gsize n_dimension_names) +{ + auto tensor = + garrow::numeric_tensor_new(data, + shape, + n_dimensions, + strides, + n_strides, + dimension_names, + n_dimension_names); + return GARROW_UINT8_TENSOR(tensor); +} + +/** + * garrow_uint8_tensor_get_raw_data: + * @tensor: A #GArrowUInt8Tensor. + * @n_data: (out): The number of data. + * + * Returns: (array length=n_data): The raw data in the tensor. + * + * Since: 0.3.0 + */ +const guint8 * +garrow_uint8_tensor_get_raw_data(GArrowUInt8Tensor *tensor, + gint64 *n_data) +{ + return garrow::numeric_tensor_get_raw_data(GARROW_TENSOR(tensor), + n_data); +} + +G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-tensor.h b/c_glib/arrow-glib/uint8-tensor.h new file mode 100644 index 0000000000000..248c507b4f646 --- /dev/null +++ b/c_glib/arrow-glib/uint8-tensor.h @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include + +G_BEGIN_DECLS + +#define GARROW_TYPE_UINT8_TENSOR \ + (garrow_uint8_tensor_get_type()) +#define GARROW_UINT8_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_TENSOR, \ + GArrowUInt8Tensor)) +#define GARROW_UINT8_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_TENSOR, \ + GArrowUInt8TensorClass)) +#define GARROW_IS_UINT8_TENSOR(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_TENSOR)) +#define GARROW_IS_UINT8_TENSOR_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_TENSOR)) +#define GARROW_UINT8_TENSOR_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_TENSOR, \ + GArrowUInt8TensorClass)) + +typedef struct _GArrowUInt8Tensor GArrowUInt8Tensor; +typedef struct _GArrowUInt8TensorClass GArrowUInt8TensorClass; + +/** + * GArrowUInt8Tensor: + * + * It wraps `arrow::UInt8Tensor`. + */ +struct _GArrowUInt8Tensor +{ + /*< private >*/ + GArrowTensor parent_instance; +}; + +struct _GArrowUInt8TensorClass +{ + GArrowTensorClass parent_class; +}; + +GType garrow_uint8_tensor_get_type(void) G_GNUC_CONST; + +GArrowUInt8Tensor *garrow_uint8_tensor_new(GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimention_names, + gsize n_dimention_names); + +const guint8 *garrow_uint8_tensor_get_raw_data(GArrowUInt8Tensor *tensor, + gint64 *n_data); + +G_END_DECLS diff --git a/c_glib/test/helper/omittable.rb b/c_glib/test/helper/omittable.rb new file mode 100644 index 0000000000000..a16ad32485e15 --- /dev/null +++ b/c_glib/test/helper/omittable.rb @@ -0,0 +1,28 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +module Helper + module Omittable + def require_gi(major, minor, micro) + return if GLib.check_binding_version?(major, minor, micro) + message = + "Require gobject-introspection #{major}.#{minor}.#{micro} or later: " + + GLib::BINDING_VERSION.join(".") + omit(message) + end + end +end diff --git a/c_glib/test/run-test.rb b/c_glib/test/run-test.rb index 53805caef374f..50f548f3f5b3b 100755 --- a/c_glib/test/run-test.rb +++ b/c_glib/test/run-test.rb @@ -35,5 +35,6 @@ require "tempfile" require_relative "helper/buildable" +require_relative "helper/omittable" exit(Test::Unit::AutoRunner.run(true, test_dir.to_s)) diff --git a/c_glib/test/test-int8-tensor.rb b/c_glib/test/test-int8-tensor.rb new file mode 100644 index 0000000000000..a96a4076b5a8e --- /dev/null +++ b/c_glib/test/test-int8-tensor.rb @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestInt8Tensor < Test::Unit::TestCase + include Helper::Omittable + + def setup + @raw_data = [ + 1, 2, + 3, 4, + + 5, 6, + 7, 8, + + 9, 10, + 11, 12, + ] + data = Arrow::Buffer.new(@raw_data.pack("c*")) + shape = [3, 2, 2] + strides = [] + names = [] + @tensor = Arrow::Int8Tensor.new(data, shape, strides, names) + end + + def test_raw_data + require_gi(3, 1, 2) + assert_equal(@raw_data, @tensor.raw_data) + end +end diff --git a/c_glib/test/test-tensor.rb b/c_glib/test/test-tensor.rb new file mode 100644 index 0000000000000..455b0d9d90acb --- /dev/null +++ b/c_glib/test/test-tensor.rb @@ -0,0 +1,100 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestTensor < Test::Unit::TestCase + include Helper::Omittable + + def setup + @raw_data = [ + 1, 2, + 3, 4, + + 5, 6, + 7, 8, + + 9, 10, + 11, 12, + ] + data = Arrow::Buffer.new(@raw_data.pack("c*")) + @shape = [3, 2, 2] + strides = [] + names = ["a", "b", "c"] + @tensor = Arrow::Int8Tensor.new(data, @shape, strides, names) + end + + def test_value_data_type + assert_equal(Arrow::Int8DataType, @tensor.value_data_type.class) + end + + def test_value_type + assert_equal(Arrow::Type::INT8, @tensor.value_type) + end + + def test_buffer + assert_equal(@raw_data, @tensor.buffer.data) + end + + def test_shape + require_gi(3, 1, 2) + assert_equal(@shape, @tensor.shape) + end + + def test_strides + require_gi(3, 1, 2) + assert_equal([4, 2, 1], @tensor.strides) + end + + def test_n_dimensions + assert_equal(@shape.size, @tensor.n_dimensions) + end + + def test_dimension_name + dimension_names = @tensor.n_dimensions.times.collect do |i| + @tensor.get_dimension_name(i) + end + assert_equal(["a", "b", "c"], + dimension_names) + end + + def test_size + assert_equal(@raw_data.size, @tensor.size) + end + + def test_mutable? + assert do + not @tensor.mutable? + end + end + + def test_contiguous? + assert do + @tensor.contiguous? + end + end + + def test_row_major? + assert do + @tensor.row_major? + end + end + + def test_column_major? + assert do + not @tensor.column_major? + end + end +end diff --git a/c_glib/test/test-uint8-tensor.rb b/c_glib/test/test-uint8-tensor.rb new file mode 100644 index 0000000000000..0fe758ba676cc --- /dev/null +++ b/c_glib/test/test-uint8-tensor.rb @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestUInt8Tensor < Test::Unit::TestCase + include Helper::Omittable + + def setup + @raw_data = [ + 1, 2, + 3, 4, + + 5, 6, + 7, 8, + + 9, 10, + 11, 12, + ] + data = Arrow::Buffer.new(@raw_data.pack("c*")) + shape = [3, 2, 2] + strides = [] + names = [] + @tensor = Arrow::UInt8Tensor.new(data, shape, strides, names) + end + + def test_raw_data + require_gi(3, 1, 2) + assert_equal(@raw_data, @tensor.raw_data) + end +end From a68f31b0f3f2c094c5d6660a2d936baa05da3103 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 20 Apr 2017 09:36:21 +0200 Subject: [PATCH 0542/1644] ARROW-860: [C++] Remove typed Tensor containers cc @kou for opinions -- this patch breaks glib for the moment. Since tensors are all fixed width types, there's less reason to have strongly-typed containers for them (unlike the `arrow::Array` subclasses, where ListArray is quite different from Int8Array). My view is that if the visitor pattern needs to be employed, we can do it using the `type()` member on the tensor (which also provides compile-time access to `TypeClass::c_type` if needed) Author: Wes McKinney Author: Kouhei Sutou Closes #571 from wesm/ARROW-860 and squashes the following commits: fe0b4d8 [Kouhei Sutou] Remove typed Tensors from glib 357f441 [Wes McKinney] Remove typed Tensor containers --- c_glib/arrow-glib/Makefile.am | 5 -- c_glib/arrow-glib/arrow-glib.h | 2 - c_glib/arrow-glib/int8-tensor.cpp | 105 ----------------------- c_glib/arrow-glib/int8-tensor.h | 79 ----------------- c_glib/arrow-glib/numeric-tensor.hpp | 64 -------------- c_glib/arrow-glib/tensor.cpp | 103 +++++++++++----------- c_glib/arrow-glib/tensor.h | 8 ++ c_glib/arrow-glib/uint8-tensor.cpp | 105 ----------------------- c_glib/arrow-glib/uint8-tensor.h | 79 ----------------- c_glib/test/test-int8-tensor.rb | 43 ---------- c_glib/test/test-tensor.rb | 6 +- c_glib/test/test-uint8-tensor.rb | 43 ---------- cpp/src/arrow/compare.cc | 37 ++------ cpp/src/arrow/ipc/ipc-read-write-test.cc | 6 +- cpp/src/arrow/ipc/reader.cc | 3 +- cpp/src/arrow/python/numpy_convert.cc | 3 +- cpp/src/arrow/tensor-test.cc | 14 +-- cpp/src/arrow/tensor.cc | 65 -------------- cpp/src/arrow/tensor.h | 47 +--------- cpp/src/arrow/visitor_inline.h | 25 ------ 20 files changed, 94 insertions(+), 748 deletions(-) delete mode 100644 c_glib/arrow-glib/int8-tensor.cpp delete mode 100644 c_glib/arrow-glib/int8-tensor.h delete mode 100644 c_glib/arrow-glib/numeric-tensor.hpp delete mode 100644 c_glib/arrow-glib/uint8-tensor.cpp delete mode 100644 c_glib/arrow-glib/uint8-tensor.h delete mode 100644 c_glib/test/test-int8-tensor.rb delete mode 100644 c_glib/test/test-uint8-tensor.rb diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index fbfe3a4071000..11b6508df0745 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -65,7 +65,6 @@ libarrow_glib_la_headers = \ int8-array.h \ int8-array-builder.h \ int8-data-type.h \ - int8-tensor.h \ int16-array.h \ int16-array-builder.h \ int16-data-type.h \ @@ -94,7 +93,6 @@ libarrow_glib_la_headers = \ uint8-array.h \ uint8-array-builder.h \ uint8-data-type.h \ - uint8-tensor.h \ uint16-array.h \ uint16-array-builder.h \ uint16-data-type.h \ @@ -155,7 +153,6 @@ libarrow_glib_la_sources = \ int8-array.cpp \ int8-array-builder.cpp \ int8-data-type.cpp \ - int8-tensor.cpp \ int16-array.cpp \ int16-array-builder.cpp \ int16-data-type.cpp \ @@ -184,7 +181,6 @@ libarrow_glib_la_sources = \ uint8-array.cpp \ uint8-array-builder.cpp \ uint8-data-type.cpp \ - uint8-tensor.cpp \ uint16-array.cpp \ uint16-array-builder.cpp \ uint16-data-type.cpp \ @@ -226,7 +222,6 @@ libarrow_glib_la_cpp_headers = \ data-type.hpp \ error.hpp \ field.hpp \ - numeric-tensor.hpp \ record-batch.hpp \ schema.hpp \ table.hpp \ diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index eec9e25ebf690..8d9bfe2da9c38 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -42,7 +42,6 @@ #include #include #include -#include #include #include #include @@ -71,7 +70,6 @@ #include #include #include -#include #include #include #include diff --git a/c_glib/arrow-glib/int8-tensor.cpp b/c_glib/arrow-glib/int8-tensor.cpp deleted file mode 100644 index 06521a00997c0..0000000000000 --- a/c_glib/arrow-glib/int8-tensor.cpp +++ /dev/null @@ -1,105 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int8-tensor - * @short_description: 8-bit integer tensor class - * - * #GArrowInt8Tensor is a class for 8-bit integer tensor. It can store - * zero or more 8-bit integer data. - */ - -G_DEFINE_TYPE(GArrowInt8Tensor, \ - garrow_int8_tensor, \ - GARROW_TYPE_TENSOR) - -static void -garrow_int8_tensor_init(GArrowInt8Tensor *object) -{ -} - -static void -garrow_int8_tensor_class_init(GArrowInt8TensorClass *klass) -{ -} - -/** - * garrow_int8_tensor_new: - * @data: A #GArrowBuffer that contains tensor data. - * @shape: (array length=n_dimensions): A list of dimension sizes. - * @n_dimensions: The number of dimensions. - * @strides: (array length=n_strides) (nullable): A list of the number of - * bytes in each dimension. - * @n_strides: The number of strides. - * @dimention_names: (array length=n_dimention_names) (nullable): A list of - * dimension names. - * @n_dimention_names: The number of dimension names - * - * Returns: The newly created #GArrowInt8Tensor. - * - * Since: 0.3.0 - */ -GArrowInt8Tensor * -garrow_int8_tensor_new(GArrowBuffer *data, - gint64 *shape, - gsize n_dimensions, - gint64 *strides, - gsize n_strides, - gchar **dimension_names, - gsize n_dimension_names) -{ - auto tensor = - garrow::numeric_tensor_new(data, - shape, - n_dimensions, - strides, - n_strides, - dimension_names, - n_dimension_names); - return GARROW_INT8_TENSOR(tensor); -} - -/** - * garrow_int8_tensor_get_raw_data: - * @tensor: A #GArrowInt8Tensor. - * @n_data: (out): The number of data. - * - * Returns: (array length=n_data): The raw data in the tensor. - * - * Since: 0.3.0 - */ -const gint8 * -garrow_int8_tensor_get_raw_data(GArrowInt8Tensor *tensor, - gint64 *n_data) -{ - return garrow::numeric_tensor_get_raw_data(GARROW_TENSOR(tensor), - n_data); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-tensor.h b/c_glib/arrow-glib/int8-tensor.h deleted file mode 100644 index 76ed3c8d7a7ee..0000000000000 --- a/c_glib/arrow-glib/int8-tensor.h +++ /dev/null @@ -1,79 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT8_TENSOR \ - (garrow_int8_tensor_get_type()) -#define GARROW_INT8_TENSOR(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT8_TENSOR, \ - GArrowInt8Tensor)) -#define GARROW_INT8_TENSOR_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT8_TENSOR, \ - GArrowInt8TensorClass)) -#define GARROW_IS_INT8_TENSOR(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT8_TENSOR)) -#define GARROW_IS_INT8_TENSOR_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT8_TENSOR)) -#define GARROW_INT8_TENSOR_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT8_TENSOR, \ - GArrowInt8TensorClass)) - -typedef struct _GArrowInt8Tensor GArrowInt8Tensor; -typedef struct _GArrowInt8TensorClass GArrowInt8TensorClass; - -/** - * GArrowInt8Tensor: - * - * It wraps `arrow::Int8Tensor`. - */ -struct _GArrowInt8Tensor -{ - /*< private >*/ - GArrowTensor parent_instance; -}; - -struct _GArrowInt8TensorClass -{ - GArrowTensorClass parent_class; -}; - -GType garrow_int8_tensor_get_type(void) G_GNUC_CONST; - -GArrowInt8Tensor *garrow_int8_tensor_new(GArrowBuffer *data, - gint64 *shape, - gsize n_dimensions, - gint64 *strides, - gsize n_strides, - gchar **dimention_names, - gsize n_dimention_names); - -const gint8 *garrow_int8_tensor_get_raw_data(GArrowInt8Tensor *tensor, - gint64 *n_data); - -G_END_DECLS diff --git a/c_glib/arrow-glib/numeric-tensor.hpp b/c_glib/arrow-glib/numeric-tensor.hpp deleted file mode 100644 index 07cea62bd7b25..0000000000000 --- a/c_glib/arrow-glib/numeric-tensor.hpp +++ /dev/null @@ -1,64 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -namespace garrow { - template - GArrowTensor *numeric_tensor_new(GArrowBuffer *data, - gint64 *shape, - gsize n_dimensions, - gint64 *strides, - gsize n_strides, - gchar **dimention_names, - gsize n_dimention_names) { - auto arrow_data = garrow_buffer_get_raw(data); - std::vector arrow_shape; - for (gsize i = 0; i < n_dimensions; ++i) { - arrow_shape.push_back(shape[i]); - } - std::vector arrow_strides; - for (gsize i = 0; i < n_strides; ++i) { - arrow_strides.push_back(strides[i]); - } - std::vector arrow_dimention_names; - for (gsize i = 0; i < n_dimention_names; ++i) { - arrow_dimention_names.push_back(dimention_names[i]); - } - auto arrow_numeric_tensor = - std::make_shared(arrow_data, - arrow_shape, - arrow_strides, - arrow_dimention_names); - std::shared_ptr arrow_tensor = arrow_numeric_tensor; - auto tensor = garrow_tensor_new_raw(&arrow_tensor); - return tensor; - } - - template - const value_type *numeric_tensor_get_raw_data(GArrowTensor *tensor, - gint64 *n_data) { - auto arrow_tensor = garrow_tensor_get_raw(tensor); - auto arrow_numeric_tensor = static_cast(arrow_tensor.get()); - *n_data = arrow_numeric_tensor->size(); - return arrow_numeric_tensor->raw_data(); - } -} diff --git a/c_glib/arrow-glib/tensor.cpp b/c_glib/arrow-glib/tensor.cpp index cbc9d8e31fe9d..468eb0729357b 100644 --- a/c_glib/arrow-glib/tensor.cpp +++ b/c_glib/arrow-glib/tensor.cpp @@ -23,10 +23,8 @@ #include #include -#include #include #include -#include G_BEGIN_DECLS @@ -121,6 +119,58 @@ garrow_tensor_class_init(GArrowTensorClass *klass) g_object_class_install_property(gobject_class, PROP_TENSOR, spec); } +/** + * garrow_tensor_new: + * @data_type: A #GArrowDataType that indicates each element type + * in the tensor. + * @data: A #GArrowBuffer that contains tensor data. + * @shape: (array length=n_dimensions): A list of dimension sizes. + * @n_dimensions: The number of dimensions. + * @strides: (array length=n_strides) (nullable): A list of the number of + * bytes in each dimension. + * @n_strides: The number of strides. + * @dimention_names: (array length=n_dimention_names) (nullable): A list of + * dimension names. + * @n_dimention_names: The number of dimension names + * + * Returns: The newly created #GArrowTensor. + * + * Since: 0.3.0 + */ +GArrowTensor * +garrow_tensor_new(GArrowDataType *data_type, + GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimension_names, + gsize n_dimension_names) +{ + auto arrow_data_type = garrow_data_type_get_raw(data_type); + auto arrow_data = garrow_buffer_get_raw(data); + std::vector arrow_shape; + for (gsize i = 0; i < n_dimensions; ++i) { + arrow_shape.push_back(shape[i]); + } + std::vector arrow_strides; + for (gsize i = 0; i < n_strides; ++i) { + arrow_strides.push_back(strides[i]); + } + std::vector arrow_dimension_names; + for (gsize i = 0; i < n_dimension_names; ++i) { + arrow_dimension_names.push_back(dimension_names[i]); + } + auto arrow_tensor = + std::make_shared(arrow_data_type, + arrow_data, + arrow_shape, + arrow_strides, + arrow_dimension_names); + auto tensor = garrow_tensor_new_raw(&arrow_tensor); + return tensor; +} + /** * garrow_tensor_get_value_data_type: * @tensor: A #GArrowTensor. @@ -333,52 +383,9 @@ G_END_DECLS GArrowTensor * garrow_tensor_new_raw(std::shared_ptr *arrow_tensor) { - GType type; - GArrowTensor *tensor; - - switch ((*arrow_tensor)->type_id()) { - case arrow::Type::type::UINT8: - type = GARROW_TYPE_UINT8_TENSOR; - break; - case arrow::Type::type::INT8: - type = GARROW_TYPE_INT8_TENSOR; - break; -/* - case arrow::Type::type::UINT16: - type = GARROW_TYPE_UINT16_TENSOR; - break; - case arrow::Type::type::INT16: - type = GARROW_TYPE_INT16_TENSOR; - break; - case arrow::Type::type::UINT32: - type = GARROW_TYPE_UINT32_TENSOR; - break; - case arrow::Type::type::INT32: - type = GARROW_TYPE_INT32_TENSOR; - break; - case arrow::Type::type::UINT64: - type = GARROW_TYPE_UINT64_TENSOR; - break; - case arrow::Type::type::INT64: - type = GARROW_TYPE_INT64_TENSOR; - break; - case arrow::Type::type::HALF_FLOAT: - type = GARROW_TYPE_HALF_FLOAT_TENSOR; - break; - case arrow::Type::type::FLOAT: - type = GARROW_TYPE_FLOAT_TENSOR; - break; - case arrow::Type::type::DOUBLE: - type = GARROW_TYPE_DOUBLE_TENSOR; - break; -*/ - default: - type = GARROW_TYPE_TENSOR; - break; - } - tensor = GARROW_TENSOR(g_object_new(type, - "tensor", arrow_tensor, - NULL)); + auto tensor = GARROW_TENSOR(g_object_new(GARROW_TYPE_TENSOR, + "tensor", arrow_tensor, + NULL)); return tensor; } diff --git a/c_glib/arrow-glib/tensor.h b/c_glib/arrow-glib/tensor.h index bedc80324f581..71c6b4e9031dd 100644 --- a/c_glib/arrow-glib/tensor.h +++ b/c_glib/arrow-glib/tensor.h @@ -58,6 +58,14 @@ struct _GArrowTensorClass GType garrow_tensor_get_type (void) G_GNUC_CONST; +GArrowTensor *garrow_tensor_new (GArrowDataType *data_type, + GArrowBuffer *data, + gint64 *shape, + gsize n_dimensions, + gint64 *strides, + gsize n_strides, + gchar **dimention_names, + gsize n_dimention_names); GArrowDataType *garrow_tensor_get_value_data_type(GArrowTensor *tensor); GArrowType garrow_tensor_get_value_type (GArrowTensor *tensor); GArrowBuffer *garrow_tensor_get_buffer (GArrowTensor *tensor); diff --git a/c_glib/arrow-glib/uint8-tensor.cpp b/c_glib/arrow-glib/uint8-tensor.cpp deleted file mode 100644 index 69f0f694530ad..0000000000000 --- a/c_glib/arrow-glib/uint8-tensor.cpp +++ /dev/null @@ -1,105 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint8-tensor - * @short_description: 8-bit unsigned integer tensor class - * - * #GArrowUint8Tensor is a class for 8-bit unsigned integer tensor. It - * can store zero or more 8-bit integer data. - */ - -G_DEFINE_TYPE(GArrowUInt8Tensor, \ - garrow_uint8_tensor, \ - GARROW_TYPE_TENSOR) - -static void -garrow_uint8_tensor_init(GArrowUInt8Tensor *object) -{ -} - -static void -garrow_uint8_tensor_class_init(GArrowUInt8TensorClass *klass) -{ -} - -/** - * garrow_uint8_tensor_new: - * @data: A #GArrowBuffer that contains tensor data. - * @shape: (array length=n_dimensions): A list of dimension sizes. - * @n_dimensions: The number of dimensions. - * @strides: (array length=n_strides) (nullable): A list of the number of - * bytes in each dimension. - * @n_strides: The number of strides. - * @dimention_names: (array length=n_dimention_names) (nullable): A list of - * dimension names. - * @n_dimention_names: The number of dimension names - * - * Returns: The newly created #GArrowUInt8Tensor. - * - * Since: 0.3.0 - */ -GArrowUInt8Tensor * -garrow_uint8_tensor_new(GArrowBuffer *data, - gint64 *shape, - gsize n_dimensions, - gint64 *strides, - gsize n_strides, - gchar **dimension_names, - gsize n_dimension_names) -{ - auto tensor = - garrow::numeric_tensor_new(data, - shape, - n_dimensions, - strides, - n_strides, - dimension_names, - n_dimension_names); - return GARROW_UINT8_TENSOR(tensor); -} - -/** - * garrow_uint8_tensor_get_raw_data: - * @tensor: A #GArrowUInt8Tensor. - * @n_data: (out): The number of data. - * - * Returns: (array length=n_data): The raw data in the tensor. - * - * Since: 0.3.0 - */ -const guint8 * -garrow_uint8_tensor_get_raw_data(GArrowUInt8Tensor *tensor, - gint64 *n_data) -{ - return garrow::numeric_tensor_get_raw_data(GARROW_TENSOR(tensor), - n_data); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-tensor.h b/c_glib/arrow-glib/uint8-tensor.h deleted file mode 100644 index 248c507b4f646..0000000000000 --- a/c_glib/arrow-glib/uint8-tensor.h +++ /dev/null @@ -1,79 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT8_TENSOR \ - (garrow_uint8_tensor_get_type()) -#define GARROW_UINT8_TENSOR(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT8_TENSOR, \ - GArrowUInt8Tensor)) -#define GARROW_UINT8_TENSOR_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT8_TENSOR, \ - GArrowUInt8TensorClass)) -#define GARROW_IS_UINT8_TENSOR(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT8_TENSOR)) -#define GARROW_IS_UINT8_TENSOR_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT8_TENSOR)) -#define GARROW_UINT8_TENSOR_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT8_TENSOR, \ - GArrowUInt8TensorClass)) - -typedef struct _GArrowUInt8Tensor GArrowUInt8Tensor; -typedef struct _GArrowUInt8TensorClass GArrowUInt8TensorClass; - -/** - * GArrowUInt8Tensor: - * - * It wraps `arrow::UInt8Tensor`. - */ -struct _GArrowUInt8Tensor -{ - /*< private >*/ - GArrowTensor parent_instance; -}; - -struct _GArrowUInt8TensorClass -{ - GArrowTensorClass parent_class; -}; - -GType garrow_uint8_tensor_get_type(void) G_GNUC_CONST; - -GArrowUInt8Tensor *garrow_uint8_tensor_new(GArrowBuffer *data, - gint64 *shape, - gsize n_dimensions, - gint64 *strides, - gsize n_strides, - gchar **dimention_names, - gsize n_dimention_names); - -const guint8 *garrow_uint8_tensor_get_raw_data(GArrowUInt8Tensor *tensor, - gint64 *n_data); - -G_END_DECLS diff --git a/c_glib/test/test-int8-tensor.rb b/c_glib/test/test-int8-tensor.rb deleted file mode 100644 index a96a4076b5a8e..0000000000000 --- a/c_glib/test/test-int8-tensor.rb +++ /dev/null @@ -1,43 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -class TestInt8Tensor < Test::Unit::TestCase - include Helper::Omittable - - def setup - @raw_data = [ - 1, 2, - 3, 4, - - 5, 6, - 7, 8, - - 9, 10, - 11, 12, - ] - data = Arrow::Buffer.new(@raw_data.pack("c*")) - shape = [3, 2, 2] - strides = [] - names = [] - @tensor = Arrow::Int8Tensor.new(data, shape, strides, names) - end - - def test_raw_data - require_gi(3, 1, 2) - assert_equal(@raw_data, @tensor.raw_data) - end -end diff --git a/c_glib/test/test-tensor.rb b/c_glib/test/test-tensor.rb index 455b0d9d90acb..3e1f541cfd4b5 100644 --- a/c_glib/test/test-tensor.rb +++ b/c_glib/test/test-tensor.rb @@ -33,7 +33,11 @@ def setup @shape = [3, 2, 2] strides = [] names = ["a", "b", "c"] - @tensor = Arrow::Int8Tensor.new(data, @shape, strides, names) + @tensor = Arrow::Tensor.new(Arrow::Int8DataType.new, + data, + @shape, + strides, + names) end def test_value_data_type diff --git a/c_glib/test/test-uint8-tensor.rb b/c_glib/test/test-uint8-tensor.rb deleted file mode 100644 index 0fe758ba676cc..0000000000000 --- a/c_glib/test/test-uint8-tensor.rb +++ /dev/null @@ -1,43 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -class TestUInt8Tensor < Test::Unit::TestCase - include Helper::Omittable - - def setup - @raw_data = [ - 1, 2, - 3, 4, - - 5, 6, - 7, 8, - - 9, 10, - 11, 12, - ] - data = Arrow::Buffer.new(@raw_data.pack("c*")) - shape = [3, 2, 2] - strides = [] - names = [] - @tensor = Arrow::UInt8Tensor.new(data, shape, strides, names) - end - - def test_raw_data - require_gi(3, 1, 2) - assert_equal(@raw_data, @tensor.raw_data) - end -end diff --git a/cpp/src/arrow/compare.cc b/cpp/src/arrow/compare.cc index ccb299e53a11e..562d4e1b4ddff 100644 --- a/cpp/src/arrow/compare.cc +++ b/cpp/src/arrow/compare.cc @@ -580,31 +580,6 @@ Status ArrayRangeEquals(const Array& left, const Array& right, int64_t left_star // ---------------------------------------------------------------------- // Implement TensorEquals -class TensorEqualsVisitor { - public: - explicit TensorEqualsVisitor(const Tensor& right) : right_(right) {} - - template - Status Visit(const TensorType& left) { - const auto& size_meta = dynamic_cast(*left.type()); - const int byte_width = size_meta.bit_width() / 8; - DCHECK_GT(byte_width, 0); - - const uint8_t* left_data = left.data()->data(); - const uint8_t* right_data = right_.data()->data(); - - result_ = - memcmp(left_data, right_data, static_cast(byte_width * left.size())) == 0; - return Status::OK(); - } - - bool result() const { return result_; } - - protected: - const Tensor& right_; - bool result_; -}; - Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { // The arrays are the same object if (&left == &right) { @@ -619,9 +594,15 @@ Status TensorEquals(const Tensor& left, const Tensor& right, bool* are_equal) { "Comparison not implemented for non-contiguous tensors"); } - TensorEqualsVisitor visitor(right); - RETURN_NOT_OK(VisitTensorInline(left, &visitor)); - *are_equal = visitor.result(); + const auto& size_meta = dynamic_cast(*left.type()); + const int byte_width = size_meta.bit_width() / 8; + DCHECK_GT(byte_width, 0); + + const uint8_t* left_data = left.data()->data(); + const uint8_t* right_data = right.data()->data(); + + *are_equal = + memcmp(left_data, right_data, static_cast(byte_width * left.size())) == 0; } return Status::OK(); } diff --git a/cpp/src/arrow/ipc/ipc-read-write-test.cc b/cpp/src/arrow/ipc/ipc-read-write-test.cc index cd793e08a26be..b4a88b5519b7e 100644 --- a/cpp/src/arrow/ipc/ipc-read-write-test.cc +++ b/cpp/src/arrow/ipc/ipc-read-write-test.cc @@ -704,8 +704,8 @@ TEST_F(TestTensorRoundTrip, BasicRoundtrip) { auto data = test::GetBufferFromVector(values); - Int64Tensor t0(data, shape, strides, dim_names); - Int64Tensor tzero(data, {}, {}, {}); + Tensor t0(int64(), data, shape, strides, dim_names); + Tensor tzero(int64(), data, {}, {}, {}); CheckTensorRoundTrip(t0); CheckTensorRoundTrip(tzero); @@ -724,7 +724,7 @@ TEST_F(TestTensorRoundTrip, NonContiguous) { test::randint(24, 0, 100, &values); auto data = test::GetBufferFromVector(values); - Int64Tensor tensor(data, {4, 3}, {48, 16}); + Tensor tensor(int64(), data, {4, 3}, {48, 16}); int32_t metadata_length; int64_t body_length; diff --git a/cpp/src/arrow/ipc/reader.cc b/cpp/src/arrow/ipc/reader.cc index 69fde1783d7d3..aea4c9cd5ec1c 100644 --- a/cpp/src/arrow/ipc/reader.cc +++ b/cpp/src/arrow/ipc/reader.cc @@ -507,7 +507,8 @@ Status ReadTensor( std::vector dim_names; RETURN_NOT_OK( GetTensorMetadata(message->header(), &type, &shape, &strides, &dim_names)); - return MakeTensor(type, data, shape, strides, dim_names, out); + *out = std::make_shared(type, data, shape, strides, dim_names); + return Status::OK(); } } // namespace ipc diff --git a/cpp/src/arrow/python/numpy_convert.cc b/cpp/src/arrow/python/numpy_convert.cc index 2c1a5910f06d5..c391b5d7a1018 100644 --- a/cpp/src/arrow/python/numpy_convert.cc +++ b/cpp/src/arrow/python/numpy_convert.cc @@ -223,7 +223,8 @@ Status NdarrayToTensor(MemoryPool* pool, PyObject* ao, std::shared_ptr* std::shared_ptr type; RETURN_NOT_OK( GetTensorType(reinterpret_cast(PyArray_DESCR(ndarray)), &type)); - return MakeTensor(type, data, shape, strides, {}, out); + *out = std::make_shared(type, data, shape, strides); + return Status::OK(); } Status TensorToNdarray(const Tensor& tensor, PyObject* base, PyObject** out) { diff --git a/cpp/src/arrow/tensor-test.cc b/cpp/src/arrow/tensor-test.cc index 336905c21ae81..c41683a3db5a2 100644 --- a/cpp/src/arrow/tensor-test.cc +++ b/cpp/src/arrow/tensor-test.cc @@ -39,7 +39,7 @@ TEST(TestTensor, ZeroDim) { std::shared_ptr buffer; ASSERT_OK(AllocateBuffer(default_memory_pool(), values * sizeof(T), &buffer)); - Int64Tensor t0(buffer, shape); + Tensor t0(int64(), buffer, shape); ASSERT_EQ(1, t0.size()); } @@ -55,9 +55,9 @@ TEST(TestTensor, BasicCtors) { std::shared_ptr buffer; ASSERT_OK(AllocateBuffer(default_memory_pool(), values * sizeof(T), &buffer)); - Int64Tensor t1(buffer, shape); - Int64Tensor t2(buffer, shape, strides); - Int64Tensor t3(buffer, shape, strides, dim_names); + Tensor t1(int64(), buffer, shape); + Tensor t2(int64(), buffer, shape, strides); + Tensor t3(int64(), buffer, shape, strides, dim_names); ASSERT_EQ(24, t1.size()); ASSERT_TRUE(t1.is_mutable()); @@ -84,9 +84,9 @@ TEST(TestTensor, IsContiguous) { std::vector c_strides = {48, 8}; std::vector f_strides = {8, 32}; std::vector noncontig_strides = {8, 8}; - Int64Tensor t1(buffer, shape, c_strides); - Int64Tensor t2(buffer, shape, f_strides); - Int64Tensor t3(buffer, shape, noncontig_strides); + Tensor t1(int64(), buffer, shape, c_strides); + Tensor t2(int64(), buffer, shape, f_strides); + Tensor t3(int64(), buffer, shape, noncontig_strides); ASSERT_TRUE(t1.is_contiguous()); ASSERT_TRUE(t2.is_contiguous()); diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index fa3e203c998ba..909b05ebe8f80 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -118,69 +118,4 @@ bool Tensor::Equals(const Tensor& other) const { return are_equal; } -template -NumericTensor::NumericTensor(const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides, - const std::vector& dim_names) - : Tensor(TypeTraits::type_singleton(), data, shape, strides, dim_names), - raw_data_(nullptr), - mutable_raw_data_(nullptr) { - if (data_) { - raw_data_ = reinterpret_cast(data_->data()); - if (data_->is_mutable()) { - auto mut_buf = static_cast(data_.get()); - mutable_raw_data_ = reinterpret_cast(mut_buf->mutable_data()); - } - } -} - -template -NumericTensor::NumericTensor( - const std::shared_ptr& data, const std::vector& shape) - : NumericTensor(data, shape, {}, {}) {} - -template -NumericTensor::NumericTensor(const std::shared_ptr& data, - const std::vector& shape, const std::vector& strides) - : NumericTensor(data, shape, strides, {}) {} - -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; -template class ARROW_TEMPLATE_EXPORT NumericTensor; - -#define TENSOR_CASE(TYPE, TENSOR_TYPE) \ - case Type::TYPE: \ - *tensor = std::make_shared(data, shape, strides, dim_names); \ - break; - -Status ARROW_EXPORT MakeTensor(const std::shared_ptr& type, - const std::shared_ptr& data, const std::vector& shape, - const std::vector& strides, const std::vector& dim_names, - std::shared_ptr* tensor) { - switch (type->id()) { - TENSOR_CASE(INT8, Int8Tensor); - TENSOR_CASE(INT16, Int16Tensor); - TENSOR_CASE(INT32, Int32Tensor); - TENSOR_CASE(INT64, Int64Tensor); - TENSOR_CASE(UINT8, UInt8Tensor); - TENSOR_CASE(UINT16, UInt16Tensor); - TENSOR_CASE(UINT32, UInt32Tensor); - TENSOR_CASE(UINT64, UInt64Tensor); - TENSOR_CASE(HALF_FLOAT, HalfFloatTensor); - TENSOR_CASE(FLOAT, FloatTensor); - TENSOR_CASE(DOUBLE, DoubleTensor); - default: - return Status::NotImplemented(type->ToString()); - } - return Status::OK(); -} - } // namespace arrow diff --git a/cpp/src/arrow/tensor.h b/cpp/src/arrow/tensor.h index 7741c305f870d..371f5911a4396 100644 --- a/cpp/src/arrow/tensor.h +++ b/cpp/src/arrow/tensor.h @@ -76,6 +76,9 @@ class ARROW_EXPORT Tensor { std::shared_ptr type() const { return type_; } std::shared_ptr data() const { return data_; } + const uint8_t* raw_data() const { return data_->data(); } + uint8_t* raw_data() { return data_->mutable_data(); } + const std::vector& shape() const { return shape_; } const std::vector& strides() const { return strides_; } @@ -117,50 +120,6 @@ class ARROW_EXPORT Tensor { DISALLOW_COPY_AND_ASSIGN(Tensor); }; -template -class ARROW_EXPORT NumericTensor : public Tensor { - public: - using value_type = typename T::c_type; - - NumericTensor(const std::shared_ptr& data, const std::vector& shape); - - /// Constructor with non-negative strides - NumericTensor(const std::shared_ptr& data, const std::vector& shape, - const std::vector& strides); - - /// Constructor with strides and dimension names - NumericTensor(const std::shared_ptr& data, const std::vector& shape, - const std::vector& strides, const std::vector& dim_names); - - const value_type* raw_data() const { return raw_data_; } - value_type* raw_data() { return mutable_raw_data_; } - - private: - const value_type* raw_data_; - value_type* mutable_raw_data_; -}; - -Status ARROW_EXPORT MakeTensor(const std::shared_ptr& type, - const std::shared_ptr& data, const std::vector& shape, - const std::vector& strides, const std::vector& dim_names, - std::shared_ptr* tensor); - -// ---------------------------------------------------------------------- -// extern templates and other details - -// Only instantiate these templates once -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; -ARROW_EXTERN_TEMPLATE NumericTensor; - } // namespace arrow #endif // ARROW_TENSOR_H diff --git a/cpp/src/arrow/visitor_inline.h b/cpp/src/arrow/visitor_inline.h index bc5f493fa1f9a..7478950b894c5 100644 --- a/cpp/src/arrow/visitor_inline.h +++ b/cpp/src/arrow/visitor_inline.h @@ -104,31 +104,6 @@ inline Status VisitArrayInline(const Array& array, VISITOR* visitor) { return Status::NotImplemented("Type not implemented"); } -#define TENSOR_VISIT_INLINE(TYPE_CLASS) \ - case TYPE_CLASS::type_id: \ - return visitor->Visit( \ - static_cast::TensorType&>(array)); - -template -inline Status VisitTensorInline(const Tensor& array, VISITOR* visitor) { - switch (array.type_id()) { - TENSOR_VISIT_INLINE(Int8Type); - TENSOR_VISIT_INLINE(UInt8Type); - TENSOR_VISIT_INLINE(Int16Type); - TENSOR_VISIT_INLINE(UInt16Type); - TENSOR_VISIT_INLINE(Int32Type); - TENSOR_VISIT_INLINE(UInt32Type); - TENSOR_VISIT_INLINE(Int64Type); - TENSOR_VISIT_INLINE(UInt64Type); - TENSOR_VISIT_INLINE(HalfFloatType); - TENSOR_VISIT_INLINE(FloatType); - TENSOR_VISIT_INLINE(DoubleType); - default: - break; - } - return Status::NotImplemented("Type not implemented"); -} - } // namespace arrow #endif // ARROW_VISITOR_INLINE_H From 3f9b26c0edc84fb0d5c121937f966553bb12c0bf Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 20 Apr 2017 10:06:15 -0400 Subject: [PATCH 0543/1644] ARROW-863: [GLib] Use GBytes to implement zero-copy Author: Kouhei Sutou Closes #572 from kou/glib-buffer-use-gbytes and squashes the following commits: dc37de3 [Kouhei Sutou] [GLib] Use GBytes to implement zero-copy --- c_glib/arrow-glib/buffer.cpp | 13 +++++++------ c_glib/arrow-glib/buffer.h | 3 +-- c_glib/test/test-buffer.rb | 6 +++--- c_glib/test/test-tensor.rb | 2 +- 4 files changed, 12 insertions(+), 12 deletions(-) diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 0ec52df0aee67..9853e896b3dcc 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -167,18 +167,19 @@ garrow_buffer_get_capacity(GArrowBuffer *buffer) /** * garrow_buffer_get_data: * @buffer: A #GArrowBuffer. - * @size: (out): The number of bytes of the data. * - * Returns: (array length=size): The data of the buffer. + * Returns: (transfer full): The data of the buffer. The data is owned by + * the buffer. You should not free or modify the data. * * Since: 0.3.0 */ -const guint8 * -garrow_buffer_get_data(GArrowBuffer *buffer, gint64 *size) +GBytes * +garrow_buffer_get_data(GArrowBuffer *buffer) { auto arrow_buffer = garrow_buffer_get_raw(buffer); - *size = arrow_buffer->size(); - return arrow_buffer->data(); + auto data = g_bytes_new_static(arrow_buffer->data(), + arrow_buffer->size()); + return data; } /** diff --git a/c_glib/arrow-glib/buffer.h b/c_glib/arrow-glib/buffer.h index 1e7d55182fd1d..83e1d0d66bf28 100644 --- a/c_glib/arrow-glib/buffer.h +++ b/c_glib/arrow-glib/buffer.h @@ -61,8 +61,7 @@ GArrowBuffer *garrow_buffer_new (const guint8 *data, gint64 size); gboolean garrow_buffer_is_mutable (GArrowBuffer *buffer); gint64 garrow_buffer_get_capacity (GArrowBuffer *buffer); -const guint8 *garrow_buffer_get_data (GArrowBuffer *buffer, - gint64 *size); +GBytes *garrow_buffer_get_data (GArrowBuffer *buffer); gint64 garrow_buffer_get_size (GArrowBuffer *buffer); GArrowBuffer *garrow_buffer_get_parent (GArrowBuffer *buffer); diff --git a/c_glib/test/test-buffer.rb b/c_glib/test/test-buffer.rb index 1ea26f24ce873..6bb96714c8283 100644 --- a/c_glib/test/test-buffer.rb +++ b/c_glib/test/test-buffer.rb @@ -32,7 +32,7 @@ def test_capacity end def test_data - assert_equal(@data, @buffer.data.pack("C*")) + assert_equal(@data, @buffer.data.to_s) end def test_size @@ -45,11 +45,11 @@ def test_parent def test_copy copied_buffer = @buffer.copy(1, 3) - assert_equal(@data[1, 3], copied_buffer.data.pack("C*")) + assert_equal(@data[1, 3], copied_buffer.data.to_s) end def test_slice sliced_buffer = @buffer.slice(1, 3) - assert_equal(@data[1, 3], sliced_buffer.data.pack("C*")) + assert_equal(@data[1, 3], sliced_buffer.data.to_s) end end diff --git a/c_glib/test/test-tensor.rb b/c_glib/test/test-tensor.rb index 3e1f541cfd4b5..225857b52da98 100644 --- a/c_glib/test/test-tensor.rb +++ b/c_glib/test/test-tensor.rb @@ -49,7 +49,7 @@ def test_value_type end def test_buffer - assert_equal(@raw_data, @tensor.buffer.data) + assert_equal(@raw_data, @tensor.buffer.data.to_s.unpack("c*")) end def test_shape From 7c1fef51ca1add0af53dcdb43590f367974607c2 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 20 Apr 2017 10:08:56 -0400 Subject: [PATCH 0544/1644] ARROW-864: [GLib] Unify Array files Author: Kouhei Sutou Closes #573 from kou/glib-unify-array-file and squashes the following commits: 4c47afb [Kouhei Sutou] [GLib] Unify Array files --- c_glib/arrow-glib/Makefile.am | 32 - c_glib/arrow-glib/array.cpp | 624 +++++++++++++++++- c_glib/arrow-glib/array.h | 735 ++++++++++++++++++++++ c_glib/arrow-glib/arrow-glib.h | 16 - c_glib/arrow-glib/binary-array.cpp | 73 --- c_glib/arrow-glib/binary-array.h | 72 --- c_glib/arrow-glib/boolean-array.cpp | 69 -- c_glib/arrow-glib/boolean-array.h | 70 --- c_glib/arrow-glib/double-array.cpp | 69 -- c_glib/arrow-glib/double-array.h | 71 --- c_glib/arrow-glib/float-array.cpp | 69 -- c_glib/arrow-glib/float-array.h | 71 --- c_glib/arrow-glib/int16-array.cpp | 69 -- c_glib/arrow-glib/int16-array.h | 71 --- c_glib/arrow-glib/int32-array.cpp | 69 -- c_glib/arrow-glib/int32-array.h | 71 --- c_glib/arrow-glib/int64-array.cpp | 69 -- c_glib/arrow-glib/int64-array.h | 71 --- c_glib/arrow-glib/int8-array.cpp | 69 -- c_glib/arrow-glib/int8-array.h | 71 --- c_glib/arrow-glib/list-array.cpp | 92 --- c_glib/arrow-glib/list-array.h | 73 --- c_glib/arrow-glib/null-array.cpp | 69 -- c_glib/arrow-glib/null-array.h | 70 --- c_glib/arrow-glib/string-array.cpp | 74 --- c_glib/arrow-glib/string-array.h | 71 --- c_glib/arrow-glib/struct-array.cpp | 97 --- c_glib/arrow-glib/struct-array.h | 73 --- c_glib/arrow-glib/uint16-array.cpp | 69 -- c_glib/arrow-glib/uint16-array.h | 71 --- c_glib/arrow-glib/uint32-array.cpp | 69 -- c_glib/arrow-glib/uint32-array.h | 71 --- c_glib/arrow-glib/uint64-array.cpp | 69 -- c_glib/arrow-glib/uint64-array.h | 71 --- c_glib/arrow-glib/uint8-array.cpp | 69 -- c_glib/arrow-glib/uint8-array.h | 71 --- c_glib/doc/reference/arrow-glib-docs.sgml | 16 - 37 files changed, 1340 insertions(+), 2386 deletions(-) delete mode 100644 c_glib/arrow-glib/binary-array.cpp delete mode 100644 c_glib/arrow-glib/binary-array.h delete mode 100644 c_glib/arrow-glib/boolean-array.cpp delete mode 100644 c_glib/arrow-glib/boolean-array.h delete mode 100644 c_glib/arrow-glib/double-array.cpp delete mode 100644 c_glib/arrow-glib/double-array.h delete mode 100644 c_glib/arrow-glib/float-array.cpp delete mode 100644 c_glib/arrow-glib/float-array.h delete mode 100644 c_glib/arrow-glib/int16-array.cpp delete mode 100644 c_glib/arrow-glib/int16-array.h delete mode 100644 c_glib/arrow-glib/int32-array.cpp delete mode 100644 c_glib/arrow-glib/int32-array.h delete mode 100644 c_glib/arrow-glib/int64-array.cpp delete mode 100644 c_glib/arrow-glib/int64-array.h delete mode 100644 c_glib/arrow-glib/int8-array.cpp delete mode 100644 c_glib/arrow-glib/int8-array.h delete mode 100644 c_glib/arrow-glib/list-array.cpp delete mode 100644 c_glib/arrow-glib/list-array.h delete mode 100644 c_glib/arrow-glib/null-array.cpp delete mode 100644 c_glib/arrow-glib/null-array.h delete mode 100644 c_glib/arrow-glib/string-array.cpp delete mode 100644 c_glib/arrow-glib/string-array.h delete mode 100644 c_glib/arrow-glib/struct-array.cpp delete mode 100644 c_glib/arrow-glib/struct-array.h delete mode 100644 c_glib/arrow-glib/uint16-array.cpp delete mode 100644 c_glib/arrow-glib/uint16-array.h delete mode 100644 c_glib/arrow-glib/uint32-array.cpp delete mode 100644 c_glib/arrow-glib/uint32-array.h delete mode 100644 c_glib/arrow-glib/uint64-array.cpp delete mode 100644 c_glib/arrow-glib/uint64-array.h delete mode 100644 c_glib/arrow-glib/uint8-array.cpp delete mode 100644 c_glib/arrow-glib/uint8-array.h diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 11b6508df0745..570a033f4512c 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -44,62 +44,46 @@ libarrow_glib_la_headers = \ array.h \ array-builder.h \ arrow-glib.h \ - binary-array.h \ binary-array-builder.h \ binary-data-type.h \ - boolean-array.h \ boolean-array-builder.h \ boolean-data-type.h \ buffer.h \ chunked-array.h \ column.h \ data-type.h \ - double-array.h \ double-array-builder.h \ double-data-type.h \ error.h \ field.h \ - float-array.h \ float-array-builder.h \ float-data-type.h \ - int8-array.h \ int8-array-builder.h \ int8-data-type.h \ - int16-array.h \ int16-array-builder.h \ int16-data-type.h \ - int32-array.h \ int32-array-builder.h \ int32-data-type.h \ - int64-array.h \ int64-array-builder.h \ int64-data-type.h \ - list-array.h \ list-array-builder.h \ list-data-type.h \ - null-array.h \ null-data-type.h \ record-batch.h \ schema.h \ - string-array.h \ string-array-builder.h \ string-data-type.h \ - struct-array.h \ struct-array-builder.h \ struct-data-type.h \ table.h \ tensor.h \ type.h \ - uint8-array.h \ uint8-array-builder.h \ uint8-data-type.h \ - uint16-array.h \ uint16-array-builder.h \ uint16-data-type.h \ - uint32-array.h \ uint32-array-builder.h \ uint32-data-type.h \ - uint64-array.h \ uint64-array-builder.h \ uint64-data-type.h @@ -132,62 +116,46 @@ libarrow_glib_la_generated_sources = \ libarrow_glib_la_sources = \ array.cpp \ array-builder.cpp \ - binary-array.cpp \ binary-array-builder.cpp \ binary-data-type.cpp \ - boolean-array.cpp \ boolean-array-builder.cpp \ boolean-data-type.cpp \ buffer.cpp \ chunked-array.cpp \ column.cpp \ data-type.cpp \ - double-array.cpp \ double-array-builder.cpp \ double-data-type.cpp \ error.cpp \ field.cpp \ - float-array.cpp \ float-array-builder.cpp \ float-data-type.cpp \ - int8-array.cpp \ int8-array-builder.cpp \ int8-data-type.cpp \ - int16-array.cpp \ int16-array-builder.cpp \ int16-data-type.cpp \ - int32-array.cpp \ int32-array-builder.cpp \ int32-data-type.cpp \ - int64-array.cpp \ int64-array-builder.cpp \ int64-data-type.cpp \ - list-array.cpp \ list-array-builder.cpp \ list-data-type.cpp \ - null-array.cpp \ null-data-type.cpp \ record-batch.cpp \ schema.cpp \ - string-array.cpp \ string-array-builder.cpp \ string-data-type.cpp \ - struct-array.cpp \ struct-array-builder.cpp \ struct-data-type.cpp \ table.cpp \ tensor.cpp \ type.cpp \ - uint8-array.cpp \ uint8-array-builder.cpp \ uint8-data-type.cpp \ - uint16-array.cpp \ uint16-array-builder.cpp \ uint16-data-type.cpp \ - uint32-array.cpp \ uint32-array-builder.cpp \ uint32-data-type.cpp \ - uint64-array.cpp \ uint64-array-builder.cpp \ uint64-data-type.cpp \ $(libarrow_glib_la_headers) \ diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 3bd7b40ff9493..c86bff90d40d6 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -22,24 +22,8 @@ #endif #include -#include -#include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include #include -#include -#include -#include -#include #include @@ -47,13 +31,80 @@ G_BEGIN_DECLS /** * SECTION: array - * @short_description: Base class for all array classes + * @section_id: array-classes + * @title: Array classes + * @include: arrow-glib/arrow-glib.h * * #GArrowArray is a base class for all array classes such as * #GArrowBooleanArray. * - * Array is immutable. You need to use array builder class such as - * #GArrowBooleanArrayBuilder to create a new array. + * All array classes are immutable. You need to use array builder + * class such as #GArrowBooleanArrayBuilder to create a new array + * except #GArrowNullArray. + * + * #GArrowNullArray is a class for null array. It can store zero or + * more null values. You need to specify an array length to create a + * new array. + * + * #GArrowBooleanArray is a class for binary array. It can store zero + * or more boolean data. You need to use #GArrowBooleanArrayBuilder to + * create a new array. + * + * #GArrowInt8Array is a class for 8-bit integer array. It can store + * zero or more 8-bit integer data. You need to use + * #GArrowInt8ArrayBuilder to create a new array. + * + * #GArrowUInt8Array is a class for 8-bit unsigned integer array. It + * can store zero or more 8-bit unsigned integer data. You need to use + * #GArrowUInt8ArrayBuilder to create a new array. + * + * #GArrowInt16Array is a class for 16-bit integer array. It can store + * zero or more 16-bit integer data. You need to use + * #GArrowInt16ArrayBuilder to create a new array. + * + * #GArrowUInt16Array is a class for 16-bit unsigned integer array. It + * can store zero or more 16-bit unsigned integer data. You need to use + * #GArrowUInt16ArrayBuilder to create a new array. + * + * #GArrowInt32Array is a class for 32-bit integer array. It can store + * zero or more 32-bit integer data. You need to use + * #GArrowInt32ArrayBuilder to create a new array. + * + * #GArrowUInt32Array is a class for 32-bit unsigned integer array. It + * can store zero or more 32-bit unsigned integer data. You need to use + * #GArrowUInt32ArrayBuilder to create a new array. + * + * #GArrowInt64Array is a class for 64-bit integer array. It can store + * zero or more 64-bit integer data. You need to use + * #GArrowInt64ArrayBuilder to create a new array. + * + * #GArrowUInt64Array is a class for 64-bit unsigned integer array. It + * can store zero or more 64-bit unsigned integer data. You need to + * use #GArrowUInt64ArrayBuilder to create a new array. + * + * #GArrowFloatArray is a class for 32-bit floating point array. It + * can store zero or more 32-bit floating data. You need to use + * #GArrowFloatArrayBuilder to create a new array. + * + * #GArrowDoubleArray is a class for 64-bit floating point array. It + * can store zero or more 64-bit floating data. You need to use + * #GArrowDoubleArrayBuilder to create a new array. + * + * #GArrowBinaryArray is a class for binary array. It can store zero + * or more binary data. You need to use #GArrowBinaryArrayBuilder to + * create a new array. + * + * #GArrowStringArray is a class for UTF-8 encoded string array. It + * can store zero or more UTF-8 encoded string data. You need to use + * #GArrowStringArrayBuilder to create a new array. + * + * #GArrowListArray is a class for list array. It can store zero or + * more list data. You need to use #GArrowListArrayBuilder to create a + * new array. + * + * #GArrowStructArray is a class for struct array. It can store zero + * or more structs. One struct has zero or more fields. You need to + * use #GArrowStructArrayBuilder to create a new array. */ typedef struct GArrowArrayPrivate_ { @@ -243,6 +294,541 @@ garrow_array_slice(GArrowArray *array, return garrow_array_new_raw(&arrow_sub_array); } + +G_DEFINE_TYPE(GArrowNullArray, \ + garrow_null_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_null_array_init(GArrowNullArray *object) +{ +} + +static void +garrow_null_array_class_init(GArrowNullArrayClass *klass) +{ +} + +/** + * garrow_null_array_new: + * @length: An array length. + * + * Returns: A newly created #GArrowNullArray. + */ +GArrowNullArray * +garrow_null_array_new(gint64 length) +{ + auto arrow_null_array = std::make_shared(length); + std::shared_ptr arrow_array = arrow_null_array; + auto array = garrow_array_new_raw(&arrow_array); + return GARROW_NULL_ARRAY(array); +} + + +G_DEFINE_TYPE(GArrowBooleanArray, \ + garrow_boolean_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_boolean_array_init(GArrowBooleanArray *object) +{ +} + +static void +garrow_boolean_array_class_init(GArrowBooleanArrayClass *klass) +{ +} + +/** + * garrow_boolean_array_get_value: + * @array: A #GArrowBooleanArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gboolean +garrow_boolean_array_get_value(GArrowBooleanArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowInt8Array, \ + garrow_int8_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int8_array_init(GArrowInt8Array *object) +{ +} + +static void +garrow_int8_array_class_init(GArrowInt8ArrayClass *klass) +{ +} + +/** + * garrow_int8_array_get_value: + * @array: A #GArrowInt8Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint8 +garrow_int8_array_get_value(GArrowInt8Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowUInt8Array, \ + garrow_uint8_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint8_array_init(GArrowUInt8Array *object) +{ +} + +static void +garrow_uint8_array_class_init(GArrowUInt8ArrayClass *klass) +{ +} + +/** + * garrow_uint8_array_get_value: + * @array: A #GArrowUInt8Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint8 +garrow_uint8_array_get_value(GArrowUInt8Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowInt16Array, \ + garrow_int16_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int16_array_init(GArrowInt16Array *object) +{ +} + +static void +garrow_int16_array_class_init(GArrowInt16ArrayClass *klass) +{ +} + +/** + * garrow_int16_array_get_value: + * @array: A #GArrowInt16Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint16 +garrow_int16_array_get_value(GArrowInt16Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowUInt16Array, \ + garrow_uint16_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint16_array_init(GArrowUInt16Array *object) +{ +} + +static void +garrow_uint16_array_class_init(GArrowUInt16ArrayClass *klass) +{ +} + +/** + * garrow_uint16_array_get_value: + * @array: A #GArrowUInt16Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint16 +garrow_uint16_array_get_value(GArrowUInt16Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowInt32Array, \ + garrow_int32_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int32_array_init(GArrowInt32Array *object) +{ +} + +static void +garrow_int32_array_class_init(GArrowInt32ArrayClass *klass) +{ +} + +/** + * garrow_int32_array_get_value: + * @array: A #GArrowInt32Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint32 +garrow_int32_array_get_value(GArrowInt32Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowUInt32Array, \ + garrow_uint32_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint32_array_init(GArrowUInt32Array *object) +{ +} + +static void +garrow_uint32_array_class_init(GArrowUInt32ArrayClass *klass) +{ +} + +/** + * garrow_uint32_array_get_value: + * @array: A #GArrowUInt32Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint32 +garrow_uint32_array_get_value(GArrowUInt32Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowInt64Array, \ + garrow_int64_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_int64_array_init(GArrowInt64Array *object) +{ +} + +static void +garrow_int64_array_class_init(GArrowInt64ArrayClass *klass) +{ +} + +/** + * garrow_int64_array_get_value: + * @array: A #GArrowInt64Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gint64 +garrow_int64_array_get_value(GArrowInt64Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowUInt64Array, \ + garrow_uint64_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_uint64_array_init(GArrowUInt64Array *object) +{ +} + +static void +garrow_uint64_array_class_init(GArrowUInt64ArrayClass *klass) +{ +} + +/** + * garrow_uint64_array_get_value: + * @array: A #GArrowUInt64Array. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +guint64 +garrow_uint64_array_get_value(GArrowUInt64Array *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + +G_DEFINE_TYPE(GArrowFloatArray, \ + garrow_float_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_float_array_init(GArrowFloatArray *object) +{ +} + +static void +garrow_float_array_class_init(GArrowFloatArrayClass *klass) +{ +} + +/** + * garrow_float_array_get_value: + * @array: A #GArrowFloatArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gfloat +garrow_float_array_get_value(GArrowFloatArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowDoubleArray, \ + garrow_double_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_double_array_init(GArrowDoubleArray *object) +{ +} + +static void +garrow_double_array_class_init(GArrowDoubleArrayClass *klass) +{ +} + +/** + * garrow_double_array_get_value: + * @array: A #GArrowDoubleArray. + * @i: The index of the target value. + * + * Returns: The i-th value. + */ +gdouble +garrow_double_array_get_value(GArrowDoubleArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + return static_cast(arrow_array.get())->Value(i); +} + + +G_DEFINE_TYPE(GArrowBinaryArray, \ + garrow_binary_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_binary_array_init(GArrowBinaryArray *object) +{ +} + +static void +garrow_binary_array_class_init(GArrowBinaryArrayClass *klass) +{ +} + +/** + * garrow_binary_array_get_value: + * @array: A #GArrowBinaryArray. + * @i: The index of the target value. + * @length: (out): The length of the value. + * + * Returns: (array length=length): The i-th value. + */ +const guint8 * +garrow_binary_array_get_value(GArrowBinaryArray *array, + gint64 i, + gint32 *length) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_binary_array = + static_cast(arrow_array.get()); + return arrow_binary_array->GetValue(i, length); +} + + +G_DEFINE_TYPE(GArrowStringArray, \ + garrow_string_array, \ + GARROW_TYPE_BINARY_ARRAY) + +static void +garrow_string_array_init(GArrowStringArray *object) +{ +} + +static void +garrow_string_array_class_init(GArrowStringArrayClass *klass) +{ +} + +/** + * garrow_string_array_get_string: + * @array: A #GArrowStringArray. + * @i: The index of the target value. + * + * Returns: The i-th UTF-8 encoded string. + */ +gchar * +garrow_string_array_get_string(GArrowStringArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_string_array = + static_cast(arrow_array.get()); + gint32 length; + auto value = + reinterpret_cast(arrow_string_array->GetValue(i, &length)); + return g_strndup(value, length); +} + + +G_DEFINE_TYPE(GArrowListArray, \ + garrow_list_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_list_array_init(GArrowListArray *object) +{ +} + +static void +garrow_list_array_class_init(GArrowListArrayClass *klass) +{ +} + +/** + * garrow_list_array_get_value_type: + * @array: A #GArrowListArray. + * + * Returns: (transfer full): The data type of value in each list. + */ +GArrowDataType * +garrow_list_array_get_value_type(GArrowListArray *array) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_list_array = + static_cast(arrow_array.get()); + auto arrow_value_type = arrow_list_array->value_type(); + return garrow_data_type_new_raw(&arrow_value_type); +} + +/** + * garrow_list_array_get_value: + * @array: A #GArrowListArray. + * @i: The index of the target value. + * + * Returns: (transfer full): The i-th list. + */ +GArrowArray * +garrow_list_array_get_value(GArrowListArray *array, + gint64 i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_list_array = + static_cast(arrow_array.get()); + auto arrow_list = + arrow_list_array->values()->Slice(arrow_list_array->value_offset(i), + arrow_list_array->value_length(i)); + return garrow_array_new_raw(&arrow_list); +} + + +G_DEFINE_TYPE(GArrowStructArray, \ + garrow_struct_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_struct_array_init(GArrowStructArray *object) +{ +} + +static void +garrow_struct_array_class_init(GArrowStructArrayClass *klass) +{ +} + +/** + * garrow_struct_array_get_field + * @array: A #GArrowStructArray. + * @i: The index of the field in the struct. + * + * Returns: (transfer full): The i-th field. + */ +GArrowArray * +garrow_struct_array_get_field(GArrowStructArray *array, + gint i) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_struct_array = + static_cast(arrow_array.get()); + auto arrow_field = arrow_struct_array->field(i); + return garrow_array_new_raw(&arrow_field); +} + +/** + * garrow_struct_array_get_fields + * @array: A #GArrowStructArray. + * + * Returns: (element-type GArrowArray) (transfer full): + * The fields in the struct. + */ +GList * +garrow_struct_array_get_fields(GArrowStructArray *array) +{ + const auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + const auto arrow_struct_array = + static_cast(arrow_array.get()); + + GList *fields = NULL; + for (auto arrow_field : arrow_struct_array->fields()) { + GArrowArray *field = garrow_array_new_raw(&arrow_field); + fields = g_list_prepend(fields, field); + } + + return g_list_reverse(fields); +} + G_END_DECLS GArrowArray * diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 06a37e9b43ad6..b417cdbab3631 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -68,4 +68,739 @@ GArrowArray *garrow_array_slice (GArrowArray *array, gint64 offset, gint64 length); +#define GARROW_TYPE_NULL_ARRAY \ + (garrow_null_array_get_type()) +#define GARROW_NULL_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArray)) +#define GARROW_NULL_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArrayClass)) +#define GARROW_IS_NULL_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_NULL_ARRAY)) +#define GARROW_IS_NULL_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_NULL_ARRAY)) +#define GARROW_NULL_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_NULL_ARRAY, \ + GArrowNullArrayClass)) + +typedef struct _GArrowNullArray GArrowNullArray; +typedef struct _GArrowNullArrayClass GArrowNullArrayClass; + +/** + * GArrowNullArray: + * + * It wraps `arrow::NullArray`. + */ +struct _GArrowNullArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowNullArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_null_array_get_type(void) G_GNUC_CONST; + +GArrowNullArray *garrow_null_array_new(gint64 length); + + +#define GARROW_TYPE_BOOLEAN_ARRAY \ + (garrow_boolean_array_get_type()) +#define GARROW_BOOLEAN_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArray)) +#define GARROW_BOOLEAN_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArrayClass)) +#define GARROW_IS_BOOLEAN_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY)) +#define GARROW_IS_BOOLEAN_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY)) +#define GARROW_BOOLEAN_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY, \ + GArrowBooleanArrayClass)) + +typedef struct _GArrowBooleanArray GArrowBooleanArray; +typedef struct _GArrowBooleanArrayClass GArrowBooleanArrayClass; + +/** + * GArrowBooleanArray: + * + * It wraps `arrow::BooleanArray`. + */ +struct _GArrowBooleanArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowBooleanArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_boolean_array_get_type (void) G_GNUC_CONST; +gboolean garrow_boolean_array_get_value (GArrowBooleanArray *array, + gint64 i); + + +#define GARROW_TYPE_INT8_ARRAY \ + (garrow_int8_array_get_type()) +#define GARROW_INT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8Array)) +#define GARROW_INT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8ArrayClass)) +#define GARROW_IS_INT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_ARRAY)) +#define GARROW_IS_INT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_ARRAY)) +#define GARROW_INT8_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_ARRAY, \ + GArrowInt8ArrayClass)) + +typedef struct _GArrowInt8Array GArrowInt8Array; +typedef struct _GArrowInt8ArrayClass GArrowInt8ArrayClass; + +/** + * GArrowInt8Array: + * + * It wraps `arrow::Int8Array`. + */ +struct _GArrowInt8Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt8ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int8_array_get_type(void) G_GNUC_CONST; + +gint8 garrow_int8_array_get_value(GArrowInt8Array *array, + gint64 i); + + +#define GARROW_TYPE_UINT8_ARRAY \ + (garrow_uint8_array_get_type()) +#define GARROW_UINT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8Array)) +#define GARROW_UINT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8ArrayClass)) +#define GARROW_IS_UINT8_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_ARRAY)) +#define GARROW_IS_UINT8_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_ARRAY)) +#define GARROW_UINT8_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_ARRAY, \ + GArrowUInt8ArrayClass)) + +typedef struct _GArrowUInt8Array GArrowUInt8Array; +typedef struct _GArrowUInt8ArrayClass GArrowUInt8ArrayClass; + +/** + * GArrowUInt8Array: + * + * It wraps `arrow::UInt8Array`. + */ +struct _GArrowUInt8Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt8ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint8_array_get_type(void) G_GNUC_CONST; + +guint8 garrow_uint8_array_get_value(GArrowUInt8Array *array, + gint64 i); + + +#define GARROW_TYPE_INT16_ARRAY \ + (garrow_int16_array_get_type()) +#define GARROW_INT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16Array)) +#define GARROW_INT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16ArrayClass)) +#define GARROW_IS_INT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_ARRAY)) +#define GARROW_IS_INT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_ARRAY)) +#define GARROW_INT16_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_ARRAY, \ + GArrowInt16ArrayClass)) + +typedef struct _GArrowInt16Array GArrowInt16Array; +typedef struct _GArrowInt16ArrayClass GArrowInt16ArrayClass; + +/** + * GArrowInt16Array: + * + * It wraps `arrow::Int16Array`. + */ +struct _GArrowInt16Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt16ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int16_array_get_type(void) G_GNUC_CONST; + +gint16 garrow_int16_array_get_value(GArrowInt16Array *array, + gint64 i); + + +#define GARROW_TYPE_UINT16_ARRAY \ + (garrow_uint16_array_get_type()) +#define GARROW_UINT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16Array)) +#define GARROW_UINT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16ArrayClass)) +#define GARROW_IS_UINT16_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_ARRAY)) +#define GARROW_IS_UINT16_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_ARRAY)) +#define GARROW_UINT16_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_ARRAY, \ + GArrowUInt16ArrayClass)) + +typedef struct _GArrowUInt16Array GArrowUInt16Array; +typedef struct _GArrowUInt16ArrayClass GArrowUInt16ArrayClass; + +/** + * GArrowUInt16Array: + * + * It wraps `arrow::UInt16Array`. + */ +struct _GArrowUInt16Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt16ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint16_array_get_type(void) G_GNUC_CONST; + +guint16 garrow_uint16_array_get_value(GArrowUInt16Array *array, + gint64 i); + + +#define GARROW_TYPE_INT32_ARRAY \ + (garrow_int32_array_get_type()) +#define GARROW_INT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32Array)) +#define GARROW_INT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32ArrayClass)) +#define GARROW_IS_INT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_ARRAY)) +#define GARROW_IS_INT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_ARRAY)) +#define GARROW_INT32_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_ARRAY, \ + GArrowInt32ArrayClass)) + +typedef struct _GArrowInt32Array GArrowInt32Array; +typedef struct _GArrowInt32ArrayClass GArrowInt32ArrayClass; + +/** + * GArrowInt32Array: + * + * It wraps `arrow::Int32Array`. + */ +struct _GArrowInt32Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt32ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int32_array_get_type(void) G_GNUC_CONST; + +gint32 garrow_int32_array_get_value(GArrowInt32Array *array, + gint64 i); + + +#define GARROW_TYPE_UINT32_ARRAY \ + (garrow_uint32_array_get_type()) +#define GARROW_UINT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32Array)) +#define GARROW_UINT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32ArrayClass)) +#define GARROW_IS_UINT32_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_ARRAY)) +#define GARROW_IS_UINT32_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_ARRAY)) +#define GARROW_UINT32_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_ARRAY, \ + GArrowUInt32ArrayClass)) + +typedef struct _GArrowUInt32Array GArrowUInt32Array; +typedef struct _GArrowUInt32ArrayClass GArrowUInt32ArrayClass; + +/** + * GArrowUInt32Array: + * + * It wraps `arrow::UInt32Array`. + */ +struct _GArrowUInt32Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt32ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint32_array_get_type(void) G_GNUC_CONST; + +guint32 garrow_uint32_array_get_value(GArrowUInt32Array *array, + gint64 i); + + +#define GARROW_TYPE_INT64_ARRAY \ + (garrow_int64_array_get_type()) +#define GARROW_INT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64Array)) +#define GARROW_INT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64ArrayClass)) +#define GARROW_IS_INT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_ARRAY)) +#define GARROW_IS_INT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_ARRAY)) +#define GARROW_INT64_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_ARRAY, \ + GArrowInt64ArrayClass)) + +typedef struct _GArrowInt64Array GArrowInt64Array; +typedef struct _GArrowInt64ArrayClass GArrowInt64ArrayClass; + +/** + * GArrowInt64Array: + * + * It wraps `arrow::Int64Array`. + */ +struct _GArrowInt64Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowInt64ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_int64_array_get_type(void) G_GNUC_CONST; + +gint64 garrow_int64_array_get_value(GArrowInt64Array *array, + gint64 i); + + +#define GARROW_TYPE_UINT64_ARRAY \ + (garrow_uint64_array_get_type()) +#define GARROW_UINT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64Array)) +#define GARROW_UINT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64ArrayClass)) +#define GARROW_IS_UINT64_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_ARRAY)) +#define GARROW_IS_UINT64_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_ARRAY)) +#define GARROW_UINT64_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_ARRAY, \ + GArrowUInt64ArrayClass)) + +typedef struct _GArrowUInt64Array GArrowUInt64Array; +typedef struct _GArrowUInt64ArrayClass GArrowUInt64ArrayClass; + +/** + * GArrowUInt64Array: + * + * It wraps `arrow::UInt64Array`. + */ +struct _GArrowUInt64Array +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowUInt64ArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_uint64_array_get_type(void) G_GNUC_CONST; + +guint64 garrow_uint64_array_get_value(GArrowUInt64Array *array, + gint64 i); + + +#define GARROW_TYPE_FLOAT_ARRAY \ + (garrow_float_array_get_type()) +#define GARROW_FLOAT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArray)) +#define GARROW_FLOAT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArrayClass)) +#define GARROW_IS_FLOAT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_ARRAY)) +#define GARROW_IS_FLOAT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_ARRAY)) +#define GARROW_FLOAT_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_ARRAY, \ + GArrowFloatArrayClass)) + +typedef struct _GArrowFloatArray GArrowFloatArray; +typedef struct _GArrowFloatArrayClass GArrowFloatArrayClass; + +/** + * GArrowFloatArray: + * + * It wraps `arrow::FloatArray`. + */ +struct _GArrowFloatArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowFloatArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_float_array_get_type(void) G_GNUC_CONST; + +gfloat garrow_float_array_get_value(GArrowFloatArray *array, + gint64 i); + + +#define GARROW_TYPE_DOUBLE_ARRAY \ + (garrow_double_array_get_type()) +#define GARROW_DOUBLE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArray)) +#define GARROW_DOUBLE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArrayClass)) +#define GARROW_IS_DOUBLE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_ARRAY)) +#define GARROW_IS_DOUBLE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_ARRAY)) +#define GARROW_DOUBLE_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_ARRAY, \ + GArrowDoubleArrayClass)) + +typedef struct _GArrowDoubleArray GArrowDoubleArray; +typedef struct _GArrowDoubleArrayClass GArrowDoubleArrayClass; + +/** + * GArrowDoubleArray: + * + * It wraps `arrow::DoubleArray`. + */ +struct _GArrowDoubleArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowDoubleArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_double_array_get_type(void) G_GNUC_CONST; + +gdouble garrow_double_array_get_value(GArrowDoubleArray *array, + gint64 i); + + +#define GARROW_TYPE_BINARY_ARRAY \ + (garrow_binary_array_get_type()) +#define GARROW_BINARY_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArray)) +#define GARROW_BINARY_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArrayClass)) +#define GARROW_IS_BINARY_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_ARRAY)) +#define GARROW_IS_BINARY_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_ARRAY)) +#define GARROW_BINARY_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_ARRAY, \ + GArrowBinaryArrayClass)) + +typedef struct _GArrowBinaryArray GArrowBinaryArray; +typedef struct _GArrowBinaryArrayClass GArrowBinaryArrayClass; + +/** + * GArrowBinaryArray: + * + * It wraps `arrow::BinaryArray`. + */ +struct _GArrowBinaryArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowBinaryArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_binary_array_get_type(void) G_GNUC_CONST; + +const guint8 *garrow_binary_array_get_value(GArrowBinaryArray *array, + gint64 i, + gint32 *length); + +#define GARROW_TYPE_STRING_ARRAY \ + (garrow_string_array_get_type()) +#define GARROW_STRING_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArray)) +#define GARROW_STRING_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArrayClass)) +#define GARROW_IS_STRING_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_ARRAY)) +#define GARROW_IS_STRING_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_ARRAY)) +#define GARROW_STRING_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_ARRAY, \ + GArrowStringArrayClass)) + +typedef struct _GArrowStringArray GArrowStringArray; +typedef struct _GArrowStringArrayClass GArrowStringArrayClass; + +/** + * GArrowStringArray: + * + * It wraps `arrow::StringArray`. + */ +struct _GArrowStringArray +{ + /*< private >*/ + GArrowBinaryArray parent_instance; +}; + +struct _GArrowStringArrayClass +{ + GArrowBinaryArrayClass parent_class; +}; + +GType garrow_string_array_get_type(void) G_GNUC_CONST; + +gchar *garrow_string_array_get_string(GArrowStringArray *array, + gint64 i); + + +#define GARROW_TYPE_LIST_ARRAY \ + (garrow_list_array_get_type()) +#define GARROW_LIST_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArray)) +#define GARROW_LIST_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArrayClass)) +#define GARROW_IS_LIST_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_ARRAY)) +#define GARROW_IS_LIST_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_ARRAY)) +#define GARROW_LIST_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_ARRAY, \ + GArrowListArrayClass)) + +typedef struct _GArrowListArray GArrowListArray; +typedef struct _GArrowListArrayClass GArrowListArrayClass; + +/** + * GArrowListArray: + * + * It wraps `arrow::ListArray`. + */ +struct _GArrowListArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowListArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_list_array_get_type(void) G_GNUC_CONST; + +GArrowDataType *garrow_list_array_get_value_type(GArrowListArray *array); +GArrowArray *garrow_list_array_get_value(GArrowListArray *array, + gint64 i); + + +#define GARROW_TYPE_STRUCT_ARRAY \ + (garrow_struct_array_get_type()) +#define GARROW_STRUCT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArray)) +#define GARROW_STRUCT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArrayClass)) +#define GARROW_IS_STRUCT_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_ARRAY)) +#define GARROW_IS_STRUCT_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_ARRAY)) +#define GARROW_STRUCT_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_ARRAY, \ + GArrowStructArrayClass)) + +typedef struct _GArrowStructArray GArrowStructArray; +typedef struct _GArrowStructArrayClass GArrowStructArrayClass; + +/** + * GArrowStructArray: + * + * It wraps `arrow::StructArray`. + */ +struct _GArrowStructArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowStructArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_struct_array_get_type(void) G_GNUC_CONST; + +GArrowArray *garrow_struct_array_get_field(GArrowStructArray *array, + gint i); +GList *garrow_struct_array_get_fields(GArrowStructArray *array); + G_END_DECLS diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index 8d9bfe2da9c38..ee408cde3615e 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -21,62 +21,46 @@ #include #include -#include #include #include -#include #include #include #include #include #include -#include #include #include #include #include #include -#include #include #include -#include #include #include -#include #include #include -#include #include #include -#include #include #include -#include #include #include -#include #include #include #include -#include #include #include -#include #include #include #include #include #include -#include #include #include -#include #include #include -#include #include #include -#include #include #include diff --git a/c_glib/arrow-glib/binary-array.cpp b/c_glib/arrow-glib/binary-array.cpp deleted file mode 100644 index c149d14025ae7..0000000000000 --- a/c_glib/arrow-glib/binary-array.cpp +++ /dev/null @@ -1,73 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: binary-array - * @short_description: Binary array class - * - * #GArrowBinaryArray is a class for binary array. It can store zero - * or more binary data. - * - * #GArrowBinaryArray is immutable. You need to use - * #GArrowBinaryArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowBinaryArray, \ - garrow_binary_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_binary_array_init(GArrowBinaryArray *object) -{ -} - -static void -garrow_binary_array_class_init(GArrowBinaryArrayClass *klass) -{ -} - -/** - * garrow_binary_array_get_value: - * @array: A #GArrowBinaryArray. - * @i: The index of the target value. - * @length: (out): The length of the value. - * - * Returns: (array length=length): The i-th value. - */ -const guint8 * -garrow_binary_array_get_value(GArrowBinaryArray *array, - gint64 i, - gint32 *length) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - auto arrow_binary_array = - static_cast(arrow_array.get()); - return arrow_binary_array->GetValue(i, length); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/binary-array.h b/c_glib/arrow-glib/binary-array.h deleted file mode 100644 index ab63ece9844f8..0000000000000 --- a/c_glib/arrow-glib/binary-array.h +++ /dev/null @@ -1,72 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BINARY_ARRAY \ - (garrow_binary_array_get_type()) -#define GARROW_BINARY_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BINARY_ARRAY, \ - GArrowBinaryArray)) -#define GARROW_BINARY_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BINARY_ARRAY, \ - GArrowBinaryArrayClass)) -#define GARROW_IS_BINARY_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BINARY_ARRAY)) -#define GARROW_IS_BINARY_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BINARY_ARRAY)) -#define GARROW_BINARY_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BINARY_ARRAY, \ - GArrowBinaryArrayClass)) - -typedef struct _GArrowBinaryArray GArrowBinaryArray; -typedef struct _GArrowBinaryArrayClass GArrowBinaryArrayClass; - -/** - * GArrowBinaryArray: - * - * It wraps `arrow::BinaryArray`. - */ -struct _GArrowBinaryArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowBinaryArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_binary_array_get_type(void) G_GNUC_CONST; - -const guint8 *garrow_binary_array_get_value(GArrowBinaryArray *array, - gint64 i, - gint32 *length); - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array.cpp b/c_glib/arrow-glib/boolean-array.cpp deleted file mode 100644 index 62fc40fd54112..0000000000000 --- a/c_glib/arrow-glib/boolean-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: boolean-array - * @short_description: Boolean array class - * - * #GArrowBooleanArray is a class for binary array. It can store zero - * or more boolean data. - * - * #GArrowBooleanArray is immutable. You need to use - * #GArrowBooleanArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowBooleanArray, \ - garrow_boolean_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_boolean_array_init(GArrowBooleanArray *object) -{ -} - -static void -garrow_boolean_array_class_init(GArrowBooleanArrayClass *klass) -{ -} - -/** - * garrow_boolean_array_get_value: - * @array: A #GArrowBooleanArray. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gboolean -garrow_boolean_array_get_value(GArrowBooleanArray *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array.h b/c_glib/arrow-glib/boolean-array.h deleted file mode 100644 index 9899fdf0ceca8..0000000000000 --- a/c_glib/arrow-glib/boolean-array.h +++ /dev/null @@ -1,70 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BOOLEAN_ARRAY \ - (garrow_boolean_array_get_type()) -#define GARROW_BOOLEAN_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY, \ - GArrowBooleanArray)) -#define GARROW_BOOLEAN_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BOOLEAN_ARRAY, \ - GArrowBooleanArrayClass)) -#define GARROW_IS_BOOLEAN_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY)) -#define GARROW_IS_BOOLEAN_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BOOLEAN_ARRAY)) -#define GARROW_BOOLEAN_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY, \ - GArrowBooleanArrayClass)) - -typedef struct _GArrowBooleanArray GArrowBooleanArray; -typedef struct _GArrowBooleanArrayClass GArrowBooleanArrayClass; - -/** - * GArrowBooleanArray: - * - * It wraps `arrow::BooleanArray`. - */ -struct _GArrowBooleanArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowBooleanArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_boolean_array_get_type (void) G_GNUC_CONST; -gboolean garrow_boolean_array_get_value (GArrowBooleanArray *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/double-array.cpp b/c_glib/arrow-glib/double-array.cpp deleted file mode 100644 index ecc55d7541689..0000000000000 --- a/c_glib/arrow-glib/double-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: double-array - * @short_description: 64-bit floating point array class - * - * #GArrowDoubleArray is a class for 64-bit floating point array. It - * can store zero or more 64-bit floating data. - * - * #GArrowDoubleArray is immutable. You need to use - * #GArrowDoubleArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowDoubleArray, \ - garrow_double_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_double_array_init(GArrowDoubleArray *object) -{ -} - -static void -garrow_double_array_class_init(GArrowDoubleArrayClass *klass) -{ -} - -/** - * garrow_double_array_get_value: - * @array: A #GArrowDoubleArray. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gdouble -garrow_double_array_get_value(GArrowDoubleArray *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/double-array.h b/c_glib/arrow-glib/double-array.h deleted file mode 100644 index b9a236532e3bf..0000000000000 --- a/c_glib/arrow-glib/double-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_DOUBLE_ARRAY \ - (garrow_double_array_get_type()) -#define GARROW_DOUBLE_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_DOUBLE_ARRAY, \ - GArrowDoubleArray)) -#define GARROW_DOUBLE_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_DOUBLE_ARRAY, \ - GArrowDoubleArrayClass)) -#define GARROW_IS_DOUBLE_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_DOUBLE_ARRAY)) -#define GARROW_IS_DOUBLE_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_DOUBLE_ARRAY)) -#define GARROW_DOUBLE_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_DOUBLE_ARRAY, \ - GArrowDoubleArrayClass)) - -typedef struct _GArrowDoubleArray GArrowDoubleArray; -typedef struct _GArrowDoubleArrayClass GArrowDoubleArrayClass; - -/** - * GArrowDoubleArray: - * - * It wraps `arrow::DoubleArray`. - */ -struct _GArrowDoubleArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowDoubleArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_double_array_get_type(void) G_GNUC_CONST; - -gdouble garrow_double_array_get_value(GArrowDoubleArray *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-array.cpp b/c_glib/arrow-glib/float-array.cpp deleted file mode 100644 index 28e8047652f7e..0000000000000 --- a/c_glib/arrow-glib/float-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: float-array - * @short_description: 32-bit floating point array class - * - * #GArrowFloatArray is a class for 32-bit floating point array. It - * can store zero or more 32-bit floating data. - * - * #GArrowFloatArray is immutable. You need to use - * #GArrowFloatArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowFloatArray, \ - garrow_float_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_float_array_init(GArrowFloatArray *object) -{ -} - -static void -garrow_float_array_class_init(GArrowFloatArrayClass *klass) -{ -} - -/** - * garrow_float_array_get_value: - * @array: A #GArrowFloatArray. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gfloat -garrow_float_array_get_value(GArrowFloatArray *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-array.h b/c_glib/arrow-glib/float-array.h deleted file mode 100644 index d113f9757a511..0000000000000 --- a/c_glib/arrow-glib/float-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_FLOAT_ARRAY \ - (garrow_float_array_get_type()) -#define GARROW_FLOAT_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_FLOAT_ARRAY, \ - GArrowFloatArray)) -#define GARROW_FLOAT_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_FLOAT_ARRAY, \ - GArrowFloatArrayClass)) -#define GARROW_IS_FLOAT_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_FLOAT_ARRAY)) -#define GARROW_IS_FLOAT_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_FLOAT_ARRAY)) -#define GARROW_FLOAT_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_FLOAT_ARRAY, \ - GArrowFloatArrayClass)) - -typedef struct _GArrowFloatArray GArrowFloatArray; -typedef struct _GArrowFloatArrayClass GArrowFloatArrayClass; - -/** - * GArrowFloatArray: - * - * It wraps `arrow::FloatArray`. - */ -struct _GArrowFloatArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowFloatArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_float_array_get_type(void) G_GNUC_CONST; - -gfloat garrow_float_array_get_value(GArrowFloatArray *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array.cpp b/c_glib/arrow-glib/int16-array.cpp deleted file mode 100644 index 456d085a3449a..0000000000000 --- a/c_glib/arrow-glib/int16-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int16-array - * @short_description: 16-bit integer array class - * - * #GArrowInt16Array is a class for 16-bit integer array. It can store - * zero or more 16-bit integer data. - * - * #GArrowInt16Array is immutable. You need to use - * #GArrowInt16ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowInt16Array, \ - garrow_int16_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_int16_array_init(GArrowInt16Array *object) -{ -} - -static void -garrow_int16_array_class_init(GArrowInt16ArrayClass *klass) -{ -} - -/** - * garrow_int16_array_get_value: - * @array: A #GArrowInt16Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gint16 -garrow_int16_array_get_value(GArrowInt16Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array.h b/c_glib/arrow-glib/int16-array.h deleted file mode 100644 index d37144cef51f2..0000000000000 --- a/c_glib/arrow-glib/int16-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT16_ARRAY \ - (garrow_int16_array_get_type()) -#define GARROW_INT16_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT16_ARRAY, \ - GArrowInt16Array)) -#define GARROW_INT16_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT16_ARRAY, \ - GArrowInt16ArrayClass)) -#define GARROW_IS_INT16_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT16_ARRAY)) -#define GARROW_IS_INT16_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT16_ARRAY)) -#define GARROW_INT16_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT16_ARRAY, \ - GArrowInt16ArrayClass)) - -typedef struct _GArrowInt16Array GArrowInt16Array; -typedef struct _GArrowInt16ArrayClass GArrowInt16ArrayClass; - -/** - * GArrowInt16Array: - * - * It wraps `arrow::Int16Array`. - */ -struct _GArrowInt16Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowInt16ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_int16_array_get_type(void) G_GNUC_CONST; - -gint16 garrow_int16_array_get_value(GArrowInt16Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array.cpp b/c_glib/arrow-glib/int32-array.cpp deleted file mode 100644 index 8bd6f35fd6431..0000000000000 --- a/c_glib/arrow-glib/int32-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int32-array - * @short_description: 32-bit integer array class - * - * #GArrowInt32Array is a class for 32-bit integer array. It can store - * zero or more 32-bit integer data. - * - * #GArrowInt32Array is immutable. You need to use - * #GArrowInt32ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowInt32Array, \ - garrow_int32_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_int32_array_init(GArrowInt32Array *object) -{ -} - -static void -garrow_int32_array_class_init(GArrowInt32ArrayClass *klass) -{ -} - -/** - * garrow_int32_array_get_value: - * @array: A #GArrowInt32Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gint32 -garrow_int32_array_get_value(GArrowInt32Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array.h b/c_glib/arrow-glib/int32-array.h deleted file mode 100644 index cce2b41aafe26..0000000000000 --- a/c_glib/arrow-glib/int32-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT32_ARRAY \ - (garrow_int32_array_get_type()) -#define GARROW_INT32_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT32_ARRAY, \ - GArrowInt32Array)) -#define GARROW_INT32_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT32_ARRAY, \ - GArrowInt32ArrayClass)) -#define GARROW_IS_INT32_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT32_ARRAY)) -#define GARROW_IS_INT32_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT32_ARRAY)) -#define GARROW_INT32_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT32_ARRAY, \ - GArrowInt32ArrayClass)) - -typedef struct _GArrowInt32Array GArrowInt32Array; -typedef struct _GArrowInt32ArrayClass GArrowInt32ArrayClass; - -/** - * GArrowInt32Array: - * - * It wraps `arrow::Int32Array`. - */ -struct _GArrowInt32Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowInt32ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_int32_array_get_type(void) G_GNUC_CONST; - -gint32 garrow_int32_array_get_value(GArrowInt32Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array.cpp b/c_glib/arrow-glib/int64-array.cpp deleted file mode 100644 index be49d5bf35251..0000000000000 --- a/c_glib/arrow-glib/int64-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int64-array - * @short_description: 64-bit integer array class - * - * #GArrowInt64Array is a class for 64-bit integer array. It can store - * zero or more 64-bit integer data. - * - * #GArrowInt64Array is immutable. You need to use - * #GArrowInt64ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowInt64Array, \ - garrow_int64_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_int64_array_init(GArrowInt64Array *object) -{ -} - -static void -garrow_int64_array_class_init(GArrowInt64ArrayClass *klass) -{ -} - -/** - * garrow_int64_array_get_value: - * @array: A #GArrowInt64Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gint64 -garrow_int64_array_get_value(GArrowInt64Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array.h b/c_glib/arrow-glib/int64-array.h deleted file mode 100644 index 73d4c6453a6d5..0000000000000 --- a/c_glib/arrow-glib/int64-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT64_ARRAY \ - (garrow_int64_array_get_type()) -#define GARROW_INT64_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT64_ARRAY, \ - GArrowInt64Array)) -#define GARROW_INT64_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT64_ARRAY, \ - GArrowInt64ArrayClass)) -#define GARROW_IS_INT64_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT64_ARRAY)) -#define GARROW_IS_INT64_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT64_ARRAY)) -#define GARROW_INT64_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT64_ARRAY, \ - GArrowInt64ArrayClass)) - -typedef struct _GArrowInt64Array GArrowInt64Array; -typedef struct _GArrowInt64ArrayClass GArrowInt64ArrayClass; - -/** - * GArrowInt64Array: - * - * It wraps `arrow::Int64Array`. - */ -struct _GArrowInt64Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowInt64ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_int64_array_get_type(void) G_GNUC_CONST; - -gint64 garrow_int64_array_get_value(GArrowInt64Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array.cpp b/c_glib/arrow-glib/int8-array.cpp deleted file mode 100644 index d3f12ece9bbf7..0000000000000 --- a/c_glib/arrow-glib/int8-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int8-array - * @short_description: 8-bit integer array class - * - * #GArrowInt8Array is a class for 8-bit integer array. It can store - * zero or more 8-bit integer data. - * - * #GArrowInt8Array is immutable. You need to use - * #GArrowInt8ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowInt8Array, \ - garrow_int8_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_int8_array_init(GArrowInt8Array *object) -{ -} - -static void -garrow_int8_array_class_init(GArrowInt8ArrayClass *klass) -{ -} - -/** - * garrow_int8_array_get_value: - * @array: A #GArrowInt8Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -gint8 -garrow_int8_array_get_value(GArrowInt8Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array.h b/c_glib/arrow-glib/int8-array.h deleted file mode 100644 index 0e1e901f4fdb6..0000000000000 --- a/c_glib/arrow-glib/int8-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT8_ARRAY \ - (garrow_int8_array_get_type()) -#define GARROW_INT8_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT8_ARRAY, \ - GArrowInt8Array)) -#define GARROW_INT8_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT8_ARRAY, \ - GArrowInt8ArrayClass)) -#define GARROW_IS_INT8_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT8_ARRAY)) -#define GARROW_IS_INT8_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT8_ARRAY)) -#define GARROW_INT8_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT8_ARRAY, \ - GArrowInt8ArrayClass)) - -typedef struct _GArrowInt8Array GArrowInt8Array; -typedef struct _GArrowInt8ArrayClass GArrowInt8ArrayClass; - -/** - * GArrowInt8Array: - * - * It wraps `arrow::Int8Array`. - */ -struct _GArrowInt8Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowInt8ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_int8_array_get_type(void) G_GNUC_CONST; - -gint8 garrow_int8_array_get_value(GArrowInt8Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-array.cpp b/c_glib/arrow-glib/list-array.cpp deleted file mode 100644 index 2b3fb311280d0..0000000000000 --- a/c_glib/arrow-glib/list-array.cpp +++ /dev/null @@ -1,92 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: list-array - * @short_description: List array class - * @include: arrow-glib/arrow-glib.h - * - * #GArrowListArray is a class for list array. It can store zero - * or more list data. - * - * #GArrowListArray is immutable. You need to use - * #GArrowListArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowListArray, \ - garrow_list_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_list_array_init(GArrowListArray *object) -{ -} - -static void -garrow_list_array_class_init(GArrowListArrayClass *klass) -{ -} - -/** - * garrow_list_array_get_value_type: - * @array: A #GArrowListArray. - * - * Returns: (transfer full): The data type of value in each list. - */ -GArrowDataType * -garrow_list_array_get_value_type(GArrowListArray *array) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - auto arrow_list_array = - static_cast(arrow_array.get()); - auto arrow_value_type = arrow_list_array->value_type(); - return garrow_data_type_new_raw(&arrow_value_type); -} - -/** - * garrow_list_array_get_value: - * @array: A #GArrowListArray. - * @i: The index of the target value. - * - * Returns: (transfer full): The i-th list. - */ -GArrowArray * -garrow_list_array_get_value(GArrowListArray *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - auto arrow_list_array = - static_cast(arrow_array.get()); - auto arrow_list = - arrow_list_array->values()->Slice(arrow_list_array->value_offset(i), - arrow_list_array->value_length(i)); - return garrow_array_new_raw(&arrow_list); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-array.h b/c_glib/arrow-glib/list-array.h deleted file mode 100644 index c49aed1b9599e..0000000000000 --- a/c_glib/arrow-glib/list-array.h +++ /dev/null @@ -1,73 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_LIST_ARRAY \ - (garrow_list_array_get_type()) -#define GARROW_LIST_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_LIST_ARRAY, \ - GArrowListArray)) -#define GARROW_LIST_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_LIST_ARRAY, \ - GArrowListArrayClass)) -#define GARROW_IS_LIST_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_LIST_ARRAY)) -#define GARROW_IS_LIST_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_LIST_ARRAY)) -#define GARROW_LIST_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_LIST_ARRAY, \ - GArrowListArrayClass)) - -typedef struct _GArrowListArray GArrowListArray; -typedef struct _GArrowListArrayClass GArrowListArrayClass; - -/** - * GArrowListArray: - * - * It wraps `arrow::ListArray`. - */ -struct _GArrowListArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowListArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_list_array_get_type(void) G_GNUC_CONST; - -GArrowDataType *garrow_list_array_get_value_type(GArrowListArray *array); -GArrowArray *garrow_list_array_get_value(GArrowListArray *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/null-array.cpp b/c_glib/arrow-glib/null-array.cpp deleted file mode 100644 index 0e0ea51e24c04..0000000000000 --- a/c_glib/arrow-glib/null-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: null-array - * @short_description: Null array class - * - * #GArrowNullArray is a class for null array. It can store zero - * or more null values. - * - * #GArrowNullArray is immutable. You need to specify an array length - * to create a new array. - */ - -G_DEFINE_TYPE(GArrowNullArray, \ - garrow_null_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_null_array_init(GArrowNullArray *object) -{ -} - -static void -garrow_null_array_class_init(GArrowNullArrayClass *klass) -{ -} - -/** - * garrow_null_array_new: - * @length: An array length. - * - * Returns: A newly created #GArrowNullArray. - */ -GArrowNullArray * -garrow_null_array_new(gint64 length) -{ - auto arrow_null_array = std::make_shared(length); - std::shared_ptr arrow_array = arrow_null_array; - auto array = garrow_array_new_raw(&arrow_array); - return GARROW_NULL_ARRAY(array); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/null-array.h b/c_glib/arrow-glib/null-array.h deleted file mode 100644 index e25f3054843e4..0000000000000 --- a/c_glib/arrow-glib/null-array.h +++ /dev/null @@ -1,70 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_NULL_ARRAY \ - (garrow_null_array_get_type()) -#define GARROW_NULL_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_NULL_ARRAY, \ - GArrowNullArray)) -#define GARROW_NULL_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_NULL_ARRAY, \ - GArrowNullArrayClass)) -#define GARROW_IS_NULL_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_NULL_ARRAY)) -#define GARROW_IS_NULL_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_NULL_ARRAY)) -#define GARROW_NULL_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_NULL_ARRAY, \ - GArrowNullArrayClass)) - -typedef struct _GArrowNullArray GArrowNullArray; -typedef struct _GArrowNullArrayClass GArrowNullArrayClass; - -/** - * GArrowNullArray: - * - * It wraps `arrow::NullArray`. - */ -struct _GArrowNullArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowNullArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_null_array_get_type(void) G_GNUC_CONST; - -GArrowNullArray *garrow_null_array_new(gint64 length); - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-array.cpp b/c_glib/arrow-glib/string-array.cpp deleted file mode 100644 index 329c742ccafe1..0000000000000 --- a/c_glib/arrow-glib/string-array.cpp +++ /dev/null @@ -1,74 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: string-array - * @short_description: UTF-8 encoded string array class - * - * #GArrowStringArray is a class for UTF-8 encoded string array. It - * can store zero or more UTF-8 encoded string data. - * - * #GArrowStringArray is immutable. You need to use - * #GArrowStringArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowStringArray, \ - garrow_string_array, \ - GARROW_TYPE_BINARY_ARRAY) - -static void -garrow_string_array_init(GArrowStringArray *object) -{ -} - -static void -garrow_string_array_class_init(GArrowStringArrayClass *klass) -{ -} - -/** - * garrow_string_array_get_string: - * @array: A #GArrowStringArray. - * @i: The index of the target value. - * - * Returns: The i-th UTF-8 encoded string. - */ -gchar * -garrow_string_array_get_string(GArrowStringArray *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - auto arrow_string_array = - static_cast(arrow_array.get()); - gint32 length; - auto value = - reinterpret_cast(arrow_string_array->GetValue(i, &length)); - return g_strndup(value, length); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-array.h b/c_glib/arrow-glib/string-array.h deleted file mode 100644 index 41a53cd5f1d4a..0000000000000 --- a/c_glib/arrow-glib/string-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRING_ARRAY \ - (garrow_string_array_get_type()) -#define GARROW_STRING_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRING_ARRAY, \ - GArrowStringArray)) -#define GARROW_STRING_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRING_ARRAY, \ - GArrowStringArrayClass)) -#define GARROW_IS_STRING_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRING_ARRAY)) -#define GARROW_IS_STRING_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRING_ARRAY)) -#define GARROW_STRING_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRING_ARRAY, \ - GArrowStringArrayClass)) - -typedef struct _GArrowStringArray GArrowStringArray; -typedef struct _GArrowStringArrayClass GArrowStringArrayClass; - -/** - * GArrowStringArray: - * - * It wraps `arrow::StringArray`. - */ -struct _GArrowStringArray -{ - /*< private >*/ - GArrowBinaryArray parent_instance; -}; - -struct _GArrowStringArrayClass -{ - GArrowBinaryArrayClass parent_class; -}; - -GType garrow_string_array_get_type(void) G_GNUC_CONST; - -gchar *garrow_string_array_get_string(GArrowStringArray *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array.cpp b/c_glib/arrow-glib/struct-array.cpp deleted file mode 100644 index 14c2d17cdd737..0000000000000 --- a/c_glib/arrow-glib/struct-array.cpp +++ /dev/null @@ -1,97 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: struct-array - * @short_description: Struct array class - * @include: arrow-glib/arrow-glib.h - * - * #GArrowStructArray is a class for struct array. It can store zero - * or more structs. One struct has zero or more fields. - * - * #GArrowStructArray is immutable. You need to use - * #GArrowStructArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowStructArray, \ - garrow_struct_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_struct_array_init(GArrowStructArray *object) -{ -} - -static void -garrow_struct_array_class_init(GArrowStructArrayClass *klass) -{ -} - -/** - * garrow_struct_array_get_field - * @array: A #GArrowStructArray. - * @i: The index of the field in the struct. - * - * Returns: (transfer full): The i-th field. - */ -GArrowArray * -garrow_struct_array_get_field(GArrowStructArray *array, - gint i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - auto arrow_struct_array = - static_cast(arrow_array.get()); - auto arrow_field = arrow_struct_array->field(i); - return garrow_array_new_raw(&arrow_field); -} - -/** - * garrow_struct_array_get_fields - * @array: A #GArrowStructArray. - * - * Returns: (element-type GArrowArray) (transfer full): - * The fields in the struct. - */ -GList * -garrow_struct_array_get_fields(GArrowStructArray *array) -{ - const auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - const auto arrow_struct_array = - static_cast(arrow_array.get()); - - GList *fields = NULL; - for (auto arrow_field : arrow_struct_array->fields()) { - GArrowArray *field = garrow_array_new_raw(&arrow_field); - fields = g_list_prepend(fields, field); - } - - return g_list_reverse(fields); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array.h b/c_glib/arrow-glib/struct-array.h deleted file mode 100644 index f96e9d468f350..0000000000000 --- a/c_glib/arrow-glib/struct-array.h +++ /dev/null @@ -1,73 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRUCT_ARRAY \ - (garrow_struct_array_get_type()) -#define GARROW_STRUCT_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRUCT_ARRAY, \ - GArrowStructArray)) -#define GARROW_STRUCT_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRUCT_ARRAY, \ - GArrowStructArrayClass)) -#define GARROW_IS_STRUCT_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRUCT_ARRAY)) -#define GARROW_IS_STRUCT_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRUCT_ARRAY)) -#define GARROW_STRUCT_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRUCT_ARRAY, \ - GArrowStructArrayClass)) - -typedef struct _GArrowStructArray GArrowStructArray; -typedef struct _GArrowStructArrayClass GArrowStructArrayClass; - -/** - * GArrowStructArray: - * - * It wraps `arrow::StructArray`. - */ -struct _GArrowStructArray -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowStructArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_struct_array_get_type(void) G_GNUC_CONST; - -GArrowArray *garrow_struct_array_get_field(GArrowStructArray *array, - gint i); -GList *garrow_struct_array_get_fields(GArrowStructArray *array); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array.cpp b/c_glib/arrow-glib/uint16-array.cpp deleted file mode 100644 index 6c416c6592935..0000000000000 --- a/c_glib/arrow-glib/uint16-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint16-array - * @short_description: 16-bit unsigned integer array class - * - * #GArrowUInt16Array is a class for 16-bit unsigned integer array. It - * can store zero or more 16-bit unsigned integer data. - * - * #GArrowUInt16Array is immutable. You need to use - * #GArrowUInt16ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowUInt16Array, \ - garrow_uint16_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_uint16_array_init(GArrowUInt16Array *object) -{ -} - -static void -garrow_uint16_array_class_init(GArrowUInt16ArrayClass *klass) -{ -} - -/** - * garrow_uint16_array_get_value: - * @array: A #GArrowUInt16Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -guint16 -garrow_uint16_array_get_value(GArrowUInt16Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array.h b/c_glib/arrow-glib/uint16-array.h deleted file mode 100644 index 44725510062c8..0000000000000 --- a/c_glib/arrow-glib/uint16-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT16_ARRAY \ - (garrow_uint16_array_get_type()) -#define GARROW_UINT16_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT16_ARRAY, \ - GArrowUInt16Array)) -#define GARROW_UINT16_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT16_ARRAY, \ - GArrowUInt16ArrayClass)) -#define GARROW_IS_UINT16_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT16_ARRAY)) -#define GARROW_IS_UINT16_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT16_ARRAY)) -#define GARROW_UINT16_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT16_ARRAY, \ - GArrowUInt16ArrayClass)) - -typedef struct _GArrowUInt16Array GArrowUInt16Array; -typedef struct _GArrowUInt16ArrayClass GArrowUInt16ArrayClass; - -/** - * GArrowUInt16Array: - * - * It wraps `arrow::UInt16Array`. - */ -struct _GArrowUInt16Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowUInt16ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_uint16_array_get_type(void) G_GNUC_CONST; - -guint16 garrow_uint16_array_get_value(GArrowUInt16Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array.cpp b/c_glib/arrow-glib/uint32-array.cpp deleted file mode 100644 index 18a9aedc0658f..0000000000000 --- a/c_glib/arrow-glib/uint32-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint32-array - * @short_description: 32-bit unsigned integer array class - * - * #GArrowUInt32Array is a class for 32-bit unsigned integer array. It - * can store zero or more 32-bit unsigned integer data. - * - * #GArrowUInt32Array is immutable. You need to use - * #GArrowUInt32ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowUInt32Array, \ - garrow_uint32_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_uint32_array_init(GArrowUInt32Array *object) -{ -} - -static void -garrow_uint32_array_class_init(GArrowUInt32ArrayClass *klass) -{ -} - -/** - * garrow_uint32_array_get_value: - * @array: A #GArrowUInt32Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -guint32 -garrow_uint32_array_get_value(GArrowUInt32Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array.h b/c_glib/arrow-glib/uint32-array.h deleted file mode 100644 index 57d4beaee6186..0000000000000 --- a/c_glib/arrow-glib/uint32-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT32_ARRAY \ - (garrow_uint32_array_get_type()) -#define GARROW_UINT32_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT32_ARRAY, \ - GArrowUInt32Array)) -#define GARROW_UINT32_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT32_ARRAY, \ - GArrowUInt32ArrayClass)) -#define GARROW_IS_UINT32_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT32_ARRAY)) -#define GARROW_IS_UINT32_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT32_ARRAY)) -#define GARROW_UINT32_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT32_ARRAY, \ - GArrowUInt32ArrayClass)) - -typedef struct _GArrowUInt32Array GArrowUInt32Array; -typedef struct _GArrowUInt32ArrayClass GArrowUInt32ArrayClass; - -/** - * GArrowUInt32Array: - * - * It wraps `arrow::UInt32Array`. - */ -struct _GArrowUInt32Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowUInt32ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_uint32_array_get_type(void) G_GNUC_CONST; - -guint32 garrow_uint32_array_get_value(GArrowUInt32Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array.cpp b/c_glib/arrow-glib/uint64-array.cpp deleted file mode 100644 index 1f900842674b8..0000000000000 --- a/c_glib/arrow-glib/uint64-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint64-array - * @short_description: 64-bit unsigned integer array class - * - * #GArrowUInt64Array is a class for 64-bit unsigned integer array. It - * can store zero or more 64-bit unsigned integer data. - * - * #GArrowUInt64Array is immutable. You need to use - * #GArrowUInt64ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowUInt64Array, \ - garrow_uint64_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_uint64_array_init(GArrowUInt64Array *object) -{ -} - -static void -garrow_uint64_array_class_init(GArrowUInt64ArrayClass *klass) -{ -} - -/** - * garrow_uint64_array_get_value: - * @array: A #GArrowUInt64Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -guint64 -garrow_uint64_array_get_value(GArrowUInt64Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array.h b/c_glib/arrow-glib/uint64-array.h deleted file mode 100644 index b5abde52bd263..0000000000000 --- a/c_glib/arrow-glib/uint64-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT64_ARRAY \ - (garrow_uint64_array_get_type()) -#define GARROW_UINT64_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT64_ARRAY, \ - GArrowUInt64Array)) -#define GARROW_UINT64_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT64_ARRAY, \ - GArrowUInt64ArrayClass)) -#define GARROW_IS_UINT64_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT64_ARRAY)) -#define GARROW_IS_UINT64_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT64_ARRAY)) -#define GARROW_UINT64_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT64_ARRAY, \ - GArrowUInt64ArrayClass)) - -typedef struct _GArrowUInt64Array GArrowUInt64Array; -typedef struct _GArrowUInt64ArrayClass GArrowUInt64ArrayClass; - -/** - * GArrowUInt64Array: - * - * It wraps `arrow::UInt64Array`. - */ -struct _GArrowUInt64Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowUInt64ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_uint64_array_get_type(void) G_GNUC_CONST; - -guint64 garrow_uint64_array_get_value(GArrowUInt64Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array.cpp b/c_glib/arrow-glib/uint8-array.cpp deleted file mode 100644 index b5a2595b1ef09..0000000000000 --- a/c_glib/arrow-glib/uint8-array.cpp +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint8-array - * @short_description: 8-bit unsigned integer array class - * - * #GArrowUInt8Array is a class for 8-bit unsigned integer array. It - * can store zero or more 8-bit unsigned integer data. - * - * #GArrowUInt8Array is immutable. You need to use - * #GArrowUInt8ArrayBuilder to create a new array. - */ - -G_DEFINE_TYPE(GArrowUInt8Array, \ - garrow_uint8_array, \ - GARROW_TYPE_ARRAY) - -static void -garrow_uint8_array_init(GArrowUInt8Array *object) -{ -} - -static void -garrow_uint8_array_class_init(GArrowUInt8ArrayClass *klass) -{ -} - -/** - * garrow_uint8_array_get_value: - * @array: A #GArrowUInt8Array. - * @i: The index of the target value. - * - * Returns: The i-th value. - */ -guint8 -garrow_uint8_array_get_value(GArrowUInt8Array *array, - gint64 i) -{ - auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); - return static_cast(arrow_array.get())->Value(i); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array.h b/c_glib/arrow-glib/uint8-array.h deleted file mode 100644 index a572bc549670e..0000000000000 --- a/c_glib/arrow-glib/uint8-array.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT8_ARRAY \ - (garrow_uint8_array_get_type()) -#define GARROW_UINT8_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT8_ARRAY, \ - GArrowUInt8Array)) -#define GARROW_UINT8_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT8_ARRAY, \ - GArrowUInt8ArrayClass)) -#define GARROW_IS_UINT8_ARRAY(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT8_ARRAY)) -#define GARROW_IS_UINT8_ARRAY_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT8_ARRAY)) -#define GARROW_UINT8_ARRAY_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT8_ARRAY, \ - GArrowUInt8ArrayClass)) - -typedef struct _GArrowUInt8Array GArrowUInt8Array; -typedef struct _GArrowUInt8ArrayClass GArrowUInt8ArrayClass; - -/** - * GArrowUInt8Array: - * - * It wraps `arrow::UInt8Array`. - */ -struct _GArrowUInt8Array -{ - /*< private >*/ - GArrowArray parent_instance; -}; - -struct _GArrowUInt8ArrayClass -{ - GArrowArrayClass parent_class; -}; - -GType garrow_uint8_array_get_type(void) G_GNUC_CONST; - -guint8 garrow_uint8_array_get_value(GArrowUInt8Array *array, - gint64 i); - -G_END_DECLS diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 3c1d8d161179c..11e6a4de244d4 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -36,22 +36,6 @@ Array - - - - - - - - - - - - - - - - Array builder From 6c352e2057d5f9a442c1ebf0d35c716f475fd343 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Thu, 20 Apr 2017 21:42:20 -0400 Subject: [PATCH 0545/1644] ARROW-822: [Python] StreamWriter Wrapper for Socket and File-like Objects without tell() Added a wrapper for StreamWriter to implement the required tell() method so that python sockets and file-like objects can be used as sinks. The tell() method will report the position by starting at 0 when the StreamWriter is created and incrementing by number of bytes after each write. Added unittests that use local socket as the source/sink for streaming. Author: Bryan Cutler Closes #569 from BryanCutler/pyarrow-stream-writer-socket-ARROW-822 and squashes the following commits: 6cdec4f [Bryan Cutler] Removed StreamWriter wrapper and put position handling in PyStreamWriter instead 2bd669f [Bryan Cutler] Added StreamSinkWrapper to ensure stream sink has tell() method, added unittest for StreamWriter and StreamReader over local socket --- cpp/src/arrow/python/io.cc | 7 +-- cpp/src/arrow/python/io.h | 1 + python/pyarrow/tests/test_ipc.py | 79 ++++++++++++++++++++++++++++++++ 3 files changed, 84 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc index ba82a45411c4c..327e8fe9ff781 100644 --- a/cpp/src/arrow/python/io.cc +++ b/cpp/src/arrow/python/io.cc @@ -189,7 +189,7 @@ bool PyReadableFile::supports_zero_copy() const { // ---------------------------------------------------------------------- // Output stream -PyOutputStream::PyOutputStream(PyObject* file) { +PyOutputStream::PyOutputStream(PyObject* file) : position_(0) { file_.reset(new PythonFile(file)); } @@ -201,12 +201,13 @@ Status PyOutputStream::Close() { } Status PyOutputStream::Tell(int64_t* position) { - PyAcquireGIL lock; - return file_->Tell(position); + *position = position_; + return Status::OK(); } Status PyOutputStream::Write(const uint8_t* data, int64_t nbytes) { PyAcquireGIL lock; + position_ += nbytes; return file_->Write(data, nbytes); } diff --git a/cpp/src/arrow/python/io.h b/cpp/src/arrow/python/io.h index bf14cd6f45dbd..ebd4c5a61e938 100644 --- a/cpp/src/arrow/python/io.h +++ b/cpp/src/arrow/python/io.h @@ -82,6 +82,7 @@ class ARROW_EXPORT PyOutputStream : public io::OutputStream { private: std::unique_ptr file_; + int64_t position_; }; // A zero-copy reader backed by a PyBuffer object diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 31d418d5150ac..81213ede3151e 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -17,6 +17,8 @@ import io import pytest +import socket +import threading import numpy as np @@ -126,6 +128,83 @@ def test_read_all(self): assert result.equals(expected) +class TestSocket(MessagingTest, unittest.TestCase): + + class StreamReaderServer(threading.Thread): + + def init(self, do_read_all): + self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + self._sock.bind(('127.0.0.1', 0)) + self._sock.listen(1) + host, port = self._sock.getsockname() + self._do_read_all = do_read_all + self._schema = None + self._batches = [] + self._table = None + return port + + def run(self): + connection, client_address = self._sock.accept() + try: + source = connection.makefile(mode='rb') + reader = pa.StreamReader(source) + self._schema = reader.schema + if self._do_read_all: + self._table = reader.read_all() + else: + for i, batch in enumerate(reader): + self._batches.append(batch) + finally: + connection.close() + + def get_result(self): + return(self._schema, self._table if self._do_read_all else self._batches) + + def setUp(self): + # NOTE: must start and stop server in test + pass + + def start_server(self, do_read_all): + self._server = TestSocket.StreamReaderServer() + port = self._server.init(do_read_all) + self._server.start() + self._sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + self._sock.connect(('127.0.0.1', port)) + self.sink = self._get_sink() + + def stop_and_get_result(self): + import struct + self.sink.write(struct.pack('i', 0)) + self.sink.flush() + self._sock.close() + self._server.join() + return self._server.get_result() + + def _get_sink(self): + return self._sock.makefile(mode='wb') + + def _get_writer(self, sink, schema): + return pa.StreamWriter(sink, schema) + + def test_simple_roundtrip(self): + self.start_server(do_read_all=False) + writer_batches = self.write_batches() + reader_schema, reader_batches = self.stop_and_get_result() + + assert reader_schema.equals(writer_batches[0].schema) + assert len(reader_batches) == len(writer_batches) + for i, batch in enumerate(writer_batches): + assert reader_batches[i].equals(batch) + + def test_read_all(self): + self.start_server(do_read_all=True) + writer_batches = self.write_batches() + _, result = self.stop_and_get_result() + + expected = pa.Table.from_batches(writer_batches) + assert result.equals(expected) + + class TestInMemoryFile(TestFile): def _get_sink(self): From 6867e93cc78bffdb42b38ad9581999b567de28d6 Mon Sep 17 00:00:00 2001 From: Brian Hulette Date: Fri, 21 Apr 2017 17:42:54 -0400 Subject: [PATCH 0546/1644] ARROW-869 [JS] Rename directory to js/ Author: Brian Hulette Closes #578 from TheNeuralBit/js-rename and squashes the following commits: 62244d0 [Brian Hulette] moved javascript/ to js/ --- {javascript => js}/.gitignore | 0 {javascript => js}/README.md | 0 {javascript => js}/bin/arrow2csv.js | 0 {javascript => js}/bin/arrow_schema.js | 0 {javascript => js}/examples/read_file.html | 0 {javascript => js}/lib/Arrow_generated.d.ts | 0 {javascript => js}/lib/arrow.ts | 0 {javascript => js}/lib/bitarray.ts | 0 {javascript => js}/lib/types.ts | 0 {javascript => js}/package.json | 0 {javascript => js}/postinstall.sh | 0 {javascript => js}/tsconfig.json | 0 {javascript => js}/webpack.config.js | 0 13 files changed, 0 insertions(+), 0 deletions(-) rename {javascript => js}/.gitignore (100%) rename {javascript => js}/README.md (100%) rename {javascript => js}/bin/arrow2csv.js (100%) rename {javascript => js}/bin/arrow_schema.js (100%) rename {javascript => js}/examples/read_file.html (100%) rename {javascript => js}/lib/Arrow_generated.d.ts (100%) rename {javascript => js}/lib/arrow.ts (100%) rename {javascript => js}/lib/bitarray.ts (100%) rename {javascript => js}/lib/types.ts (100%) rename {javascript => js}/package.json (100%) rename {javascript => js}/postinstall.sh (100%) rename {javascript => js}/tsconfig.json (100%) rename {javascript => js}/webpack.config.js (100%) diff --git a/javascript/.gitignore b/js/.gitignore similarity index 100% rename from javascript/.gitignore rename to js/.gitignore diff --git a/javascript/README.md b/js/README.md similarity index 100% rename from javascript/README.md rename to js/README.md diff --git a/javascript/bin/arrow2csv.js b/js/bin/arrow2csv.js similarity index 100% rename from javascript/bin/arrow2csv.js rename to js/bin/arrow2csv.js diff --git a/javascript/bin/arrow_schema.js b/js/bin/arrow_schema.js similarity index 100% rename from javascript/bin/arrow_schema.js rename to js/bin/arrow_schema.js diff --git a/javascript/examples/read_file.html b/js/examples/read_file.html similarity index 100% rename from javascript/examples/read_file.html rename to js/examples/read_file.html diff --git a/javascript/lib/Arrow_generated.d.ts b/js/lib/Arrow_generated.d.ts similarity index 100% rename from javascript/lib/Arrow_generated.d.ts rename to js/lib/Arrow_generated.d.ts diff --git a/javascript/lib/arrow.ts b/js/lib/arrow.ts similarity index 100% rename from javascript/lib/arrow.ts rename to js/lib/arrow.ts diff --git a/javascript/lib/bitarray.ts b/js/lib/bitarray.ts similarity index 100% rename from javascript/lib/bitarray.ts rename to js/lib/bitarray.ts diff --git a/javascript/lib/types.ts b/js/lib/types.ts similarity index 100% rename from javascript/lib/types.ts rename to js/lib/types.ts diff --git a/javascript/package.json b/js/package.json similarity index 100% rename from javascript/package.json rename to js/package.json diff --git a/javascript/postinstall.sh b/js/postinstall.sh similarity index 100% rename from javascript/postinstall.sh rename to js/postinstall.sh diff --git a/javascript/tsconfig.json b/js/tsconfig.json similarity index 100% rename from javascript/tsconfig.json rename to js/tsconfig.json diff --git a/javascript/webpack.config.js b/js/webpack.config.js similarity index 100% rename from javascript/webpack.config.js rename to js/webpack.config.js From 16ea3703022304843c1eaef4a75636dbdc49e8e5 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 21 Apr 2017 17:44:26 -0400 Subject: [PATCH 0547/1644] ARROW-616: [C++] Do not include debug symbols in release builds by default This reduces binary size on Linux by about 80-90%. If the user wants them, they can enable with `-DARROW_CXXFLAGS="-g"`. Author: Wes McKinney Closes #574 from wesm/ARROW-616 and squashes the following commits: 71fc105 [Wes McKinney] Do not include debug symbols in release builds by default --- cpp/cmake_modules/SetupCxxFlags.cmake | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/cpp/cmake_modules/SetupCxxFlags.cmake b/cpp/cmake_modules/SetupCxxFlags.cmake index 7e229ff90a3e7..e2106559ba028 100644 --- a/cpp/cmake_modules/SetupCxxFlags.cmake +++ b/cpp/cmake_modules/SetupCxxFlags.cmake @@ -71,11 +71,12 @@ endif() # Same as DEBUG, except with some optimizations on. # For CMAKE_BUILD_TYPE=Release # -O3: Enable all compiler optimizations -# -g: Enable symbols for profiler tools (TODO: remove for shipping) +# Debug symbols are stripped for reduced binary size. Add +# -DARROW_CXXFLAGS="-g" to add them if (NOT MSVC) set(CXX_FLAGS_DEBUG "-ggdb -O0") set(CXX_FLAGS_FASTDEBUG "-ggdb -O1") - set(CXX_FLAGS_RELEASE "-O3 -g -DNDEBUG") + set(CXX_FLAGS_RELEASE "-O3 -DNDEBUG") endif() set(CXX_FLAGS_PROFILE_GEN "${CXX_FLAGS_RELEASE} -fprofile-generate") From b4a75b1e17ef0356892ec9d5d210a6e156517440 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 21 Apr 2017 17:50:28 -0400 Subject: [PATCH 0548/1644] ARROW-871: [GLib] Unify DataType files Author: Kouhei Sutou Closes #577 from kou/glib-data-type-unify-file and squashes the following commits: 188a12c [Kouhei Sutou] [GLib] Unify DataType files --- c_glib/arrow-glib/Makefile.am | 34 +- c_glib/arrow-glib/arrow-glib.h | 16 - c_glib/arrow-glib/binary-data-type.cpp | 67 -- c_glib/arrow-glib/binary-data-type.h | 69 --- c_glib/arrow-glib/boolean-data-type.cpp | 67 -- c_glib/arrow-glib/boolean-data-type.h | 69 --- c_glib/arrow-glib/data-type.cpp | 599 +++++++++++++++++- c_glib/arrow-glib/data-type.h | 711 ++++++++++++++++++++++ c_glib/arrow-glib/double-data-type.cpp | 68 --- c_glib/arrow-glib/double-data-type.h | 70 --- c_glib/arrow-glib/float-data-type.cpp | 68 --- c_glib/arrow-glib/float-data-type.h | 69 --- c_glib/arrow-glib/int16-data-type.cpp | 67 -- c_glib/arrow-glib/int16-data-type.h | 69 --- c_glib/arrow-glib/int32-data-type.cpp | 67 -- c_glib/arrow-glib/int32-data-type.h | 69 --- c_glib/arrow-glib/int64-data-type.cpp | 67 -- c_glib/arrow-glib/int64-data-type.h | 69 --- c_glib/arrow-glib/int8-data-type.cpp | 67 -- c_glib/arrow-glib/int8-data-type.h | 69 --- c_glib/arrow-glib/list-data-type.cpp | 91 --- c_glib/arrow-glib/list-data-type.h | 73 --- c_glib/arrow-glib/null-data-type.cpp | 67 -- c_glib/arrow-glib/null-data-type.h | 69 --- c_glib/arrow-glib/string-data-type.cpp | 68 --- c_glib/arrow-glib/string-data-type.h | 69 --- c_glib/arrow-glib/struct-array-builder.h | 2 +- c_glib/arrow-glib/struct-data-type.cpp | 75 --- c_glib/arrow-glib/struct-data-type.h | 71 --- c_glib/arrow-glib/uint16-data-type.cpp | 67 -- c_glib/arrow-glib/uint16-data-type.h | 69 --- c_glib/arrow-glib/uint32-data-type.cpp | 67 -- c_glib/arrow-glib/uint32-data-type.h | 69 --- c_glib/arrow-glib/uint64-data-type.cpp | 67 -- c_glib/arrow-glib/uint64-data-type.h | 69 --- c_glib/arrow-glib/uint8-data-type.cpp | 67 -- c_glib/arrow-glib/uint8-data-type.h | 69 --- c_glib/doc/reference/arrow-glib-docs.sgml | 16 - 38 files changed, 1294 insertions(+), 2302 deletions(-) delete mode 100644 c_glib/arrow-glib/binary-data-type.cpp delete mode 100644 c_glib/arrow-glib/binary-data-type.h delete mode 100644 c_glib/arrow-glib/boolean-data-type.cpp delete mode 100644 c_glib/arrow-glib/boolean-data-type.h delete mode 100644 c_glib/arrow-glib/double-data-type.cpp delete mode 100644 c_glib/arrow-glib/double-data-type.h delete mode 100644 c_glib/arrow-glib/float-data-type.cpp delete mode 100644 c_glib/arrow-glib/float-data-type.h delete mode 100644 c_glib/arrow-glib/int16-data-type.cpp delete mode 100644 c_glib/arrow-glib/int16-data-type.h delete mode 100644 c_glib/arrow-glib/int32-data-type.cpp delete mode 100644 c_glib/arrow-glib/int32-data-type.h delete mode 100644 c_glib/arrow-glib/int64-data-type.cpp delete mode 100644 c_glib/arrow-glib/int64-data-type.h delete mode 100644 c_glib/arrow-glib/int8-data-type.cpp delete mode 100644 c_glib/arrow-glib/int8-data-type.h delete mode 100644 c_glib/arrow-glib/list-data-type.cpp delete mode 100644 c_glib/arrow-glib/list-data-type.h delete mode 100644 c_glib/arrow-glib/null-data-type.cpp delete mode 100644 c_glib/arrow-glib/null-data-type.h delete mode 100644 c_glib/arrow-glib/string-data-type.cpp delete mode 100644 c_glib/arrow-glib/string-data-type.h delete mode 100644 c_glib/arrow-glib/struct-data-type.cpp delete mode 100644 c_glib/arrow-glib/struct-data-type.h delete mode 100644 c_glib/arrow-glib/uint16-data-type.cpp delete mode 100644 c_glib/arrow-glib/uint16-data-type.h delete mode 100644 c_glib/arrow-glib/uint32-data-type.cpp delete mode 100644 c_glib/arrow-glib/uint32-data-type.h delete mode 100644 c_glib/arrow-glib/uint64-data-type.cpp delete mode 100644 c_glib/arrow-glib/uint64-data-type.h delete mode 100644 c_glib/arrow-glib/uint8-data-type.cpp delete mode 100644 c_glib/arrow-glib/uint8-data-type.h diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 570a033f4512c..d0c8c799b71cf 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -45,47 +45,31 @@ libarrow_glib_la_headers = \ array-builder.h \ arrow-glib.h \ binary-array-builder.h \ - binary-data-type.h \ boolean-array-builder.h \ - boolean-data-type.h \ buffer.h \ chunked-array.h \ column.h \ data-type.h \ double-array-builder.h \ - double-data-type.h \ error.h \ field.h \ float-array-builder.h \ - float-data-type.h \ int8-array-builder.h \ - int8-data-type.h \ int16-array-builder.h \ - int16-data-type.h \ int32-array-builder.h \ - int32-data-type.h \ int64-array-builder.h \ - int64-data-type.h \ list-array-builder.h \ - list-data-type.h \ - null-data-type.h \ record-batch.h \ schema.h \ string-array-builder.h \ - string-data-type.h \ struct-array-builder.h \ - struct-data-type.h \ table.h \ tensor.h \ type.h \ uint8-array-builder.h \ - uint8-data-type.h \ uint16-array-builder.h \ - uint16-data-type.h \ uint32-array-builder.h \ - uint32-data-type.h \ - uint64-array-builder.h \ - uint64-data-type.h + uint64-array-builder.h libarrow_glib_la_headers += \ file.h \ @@ -117,47 +101,31 @@ libarrow_glib_la_sources = \ array.cpp \ array-builder.cpp \ binary-array-builder.cpp \ - binary-data-type.cpp \ boolean-array-builder.cpp \ - boolean-data-type.cpp \ buffer.cpp \ chunked-array.cpp \ column.cpp \ data-type.cpp \ double-array-builder.cpp \ - double-data-type.cpp \ error.cpp \ field.cpp \ float-array-builder.cpp \ - float-data-type.cpp \ int8-array-builder.cpp \ - int8-data-type.cpp \ int16-array-builder.cpp \ - int16-data-type.cpp \ int32-array-builder.cpp \ - int32-data-type.cpp \ int64-array-builder.cpp \ - int64-data-type.cpp \ list-array-builder.cpp \ - list-data-type.cpp \ - null-data-type.cpp \ record-batch.cpp \ schema.cpp \ string-array-builder.cpp \ - string-data-type.cpp \ struct-array-builder.cpp \ - struct-data-type.cpp \ table.cpp \ tensor.cpp \ type.cpp \ uint8-array-builder.cpp \ - uint8-data-type.cpp \ uint16-array-builder.cpp \ - uint16-data-type.cpp \ uint32-array-builder.cpp \ - uint32-data-type.cpp \ uint64-array-builder.cpp \ - uint64-data-type.cpp \ $(libarrow_glib_la_headers) \ $(libarrow_glib_la_generated_sources) diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index ee408cde3615e..46e98d2b8ed4c 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -22,47 +22,31 @@ #include #include #include -#include #include -#include #include #include #include #include -#include #include #include #include #include -#include #include -#include #include -#include #include -#include #include -#include #include -#include -#include #include #include #include -#include #include -#include #include #include #include #include -#include #include -#include #include -#include #include -#include #include #include diff --git a/c_glib/arrow-glib/binary-data-type.cpp b/c_glib/arrow-glib/binary-data-type.cpp deleted file mode 100644 index e5187f7d94efe..0000000000000 --- a/c_glib/arrow-glib/binary-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: binary-data-type - * @short_description: Binary data type - * - * #GArrowBinaryDataType is a class for binary data type. - */ - -G_DEFINE_TYPE(GArrowBinaryDataType, \ - garrow_binary_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_binary_data_type_init(GArrowBinaryDataType *object) -{ -} - -static void -garrow_binary_data_type_class_init(GArrowBinaryDataTypeClass *klass) -{ -} - -/** - * garrow_binary_data_type_new: - * - * Returns: The newly created binary data type. - */ -GArrowBinaryDataType * -garrow_binary_data_type_new(void) -{ - auto arrow_data_type = arrow::binary(); - - GArrowBinaryDataType *data_type = - GARROW_BINARY_DATA_TYPE(g_object_new(GARROW_TYPE_BINARY_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/binary-data-type.h b/c_glib/arrow-glib/binary-data-type.h deleted file mode 100644 index 9654fe216376e..0000000000000 --- a/c_glib/arrow-glib/binary-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BINARY_DATA_TYPE \ - (garrow_binary_data_type_get_type()) -#define GARROW_BINARY_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BINARY_DATA_TYPE, \ - GArrowBinaryDataType)) -#define GARROW_BINARY_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BINARY_DATA_TYPE, \ - GArrowBinaryDataTypeClass)) -#define GARROW_IS_BINARY_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BINARY_DATA_TYPE)) -#define GARROW_IS_BINARY_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BINARY_DATA_TYPE)) -#define GARROW_BINARY_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BINARY_DATA_TYPE, \ - GArrowBinaryDataTypeClass)) - -typedef struct _GArrowBinaryDataType GArrowBinaryDataType; -typedef struct _GArrowBinaryDataTypeClass GArrowBinaryDataTypeClass; - -/** - * GArrowBinaryDataType: - * - * It wraps `arrow::BinaryType`. - */ -struct _GArrowBinaryDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowBinaryDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_binary_data_type_get_type (void) G_GNUC_CONST; -GArrowBinaryDataType *garrow_binary_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-data-type.cpp b/c_glib/arrow-glib/boolean-data-type.cpp deleted file mode 100644 index 99c73d9ff8873..0000000000000 --- a/c_glib/arrow-glib/boolean-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: boolean-data-type - * @short_description: Boolean data type - * - * #GArrowBooleanDataType is a class for boolean data type. - */ - -G_DEFINE_TYPE(GArrowBooleanDataType, \ - garrow_boolean_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_boolean_data_type_init(GArrowBooleanDataType *object) -{ -} - -static void -garrow_boolean_data_type_class_init(GArrowBooleanDataTypeClass *klass) -{ -} - -/** - * garrow_boolean_data_type_new: - * - * Returns: The newly created boolean data type. - */ -GArrowBooleanDataType * -garrow_boolean_data_type_new(void) -{ - auto arrow_data_type = arrow::boolean(); - - GArrowBooleanDataType *data_type = - GARROW_BOOLEAN_DATA_TYPE(g_object_new(GARROW_TYPE_BOOLEAN_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-data-type.h b/c_glib/arrow-glib/boolean-data-type.h deleted file mode 100644 index ad30c99960a8e..0000000000000 --- a/c_glib/arrow-glib/boolean-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BOOLEAN_DATA_TYPE \ - (garrow_boolean_data_type_get_type()) -#define GARROW_BOOLEAN_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BOOLEAN_DATA_TYPE, \ - GArrowBooleanDataType)) -#define GARROW_BOOLEAN_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BOOLEAN_DATA_TYPE, \ - GArrowBooleanDataTypeClass)) -#define GARROW_IS_BOOLEAN_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BOOLEAN_DATA_TYPE)) -#define GARROW_IS_BOOLEAN_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BOOLEAN_DATA_TYPE)) -#define GARROW_BOOLEAN_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BOOLEAN_DATA_TYPE, \ - GArrowBooleanDataTypeClass)) - -typedef struct _GArrowBooleanDataType GArrowBooleanDataType; -typedef struct _GArrowBooleanDataTypeClass GArrowBooleanDataTypeClass; - -/** - * GArrowBooleanDataType: - * - * It wraps `arrow::BooleanType`. - */ -struct _GArrowBooleanDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowBooleanDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_boolean_data_type_get_type (void) G_GNUC_CONST; -GArrowBooleanDataType *garrow_boolean_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/data-type.cpp b/c_glib/arrow-glib/data-type.cpp index 12932a16269e8..2fd261dc91947 100644 --- a/c_glib/arrow-glib/data-type.cpp +++ b/c_glib/arrow-glib/data-type.cpp @@ -21,34 +21,55 @@ # include #endif -#include -#include -#include #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include #include -#include -#include -#include -#include G_BEGIN_DECLS /** * SECTION: data-type - * @short_description: Base class for all data type classes + * @section_id: data-type-classes + * @title: Data type classes + * @include: arrow-glib/arrow-glib.h * * #GArrowDataType is a base class for all data type classes such as * #GArrowBooleanDataType. + * + * #GArrowNullDataType is a class for null data type. + * + * #GArrowBooleanDataType is a class for boolean data type. + * + * #GArrowInt8DataType is a class for 8-bit integer data type. + * + * #GArrowUInt8DataType is a class for 8-bit unsigned integer data type. + * + * #GArrowInt16DataType is a class for 16-bit integer data type. + * + * #GArrowUInt16DataType is a class for 16-bit unsigned integer data type. + * + * #GArrowInt32DataType is a class for 32-bit integer data type. + * + * #GArrowUInt32DataType is a class for 32-bit unsigned integer data type. + * + * #GArrowInt64DataType is a class for 64-bit integer data type. + * + * #GArrowUInt64DataType is a class for 64-bit unsigned integer data type. + * + * #GArrowFloatDataType is a class for 32-bit floating point data + * type. + * + * #GArrowDoubleDataType is a class for 64-bit floating point data + * type. + * + * #GArrowBinaryDataType is a class for binary data type. + * + * #GArrowStringDataType is a class for UTF-8 encoded string data + * type. + * + * #GArrowListDataType is a class for list data type. + * + * #GArrowStructDataType is a class for struct data type. */ typedef struct GArrowDataTypePrivate_ { @@ -183,6 +204,548 @@ garrow_data_type_type(GArrowDataType *data_type) return garrow_type_from_raw(arrow_data_type->id()); } + +G_DEFINE_TYPE(GArrowNullDataType, \ + garrow_null_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_null_data_type_init(GArrowNullDataType *object) +{ +} + +static void +garrow_null_data_type_class_init(GArrowNullDataTypeClass *klass) +{ +} + +/** + * garrow_null_data_type_new: + * + * Returns: The newly created null data type. + */ +GArrowNullDataType * +garrow_null_data_type_new(void) +{ + auto arrow_data_type = arrow::null(); + + GArrowNullDataType *data_type = + GARROW_NULL_DATA_TYPE(g_object_new(GARROW_TYPE_NULL_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowBooleanDataType, \ + garrow_boolean_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_boolean_data_type_init(GArrowBooleanDataType *object) +{ +} + +static void +garrow_boolean_data_type_class_init(GArrowBooleanDataTypeClass *klass) +{ +} + +/** + * garrow_boolean_data_type_new: + * + * Returns: The newly created boolean data type. + */ +GArrowBooleanDataType * +garrow_boolean_data_type_new(void) +{ + auto arrow_data_type = arrow::boolean(); + + GArrowBooleanDataType *data_type = + GARROW_BOOLEAN_DATA_TYPE(g_object_new(GARROW_TYPE_BOOLEAN_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowInt8DataType, \ + garrow_int8_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int8_data_type_init(GArrowInt8DataType *object) +{ +} + +static void +garrow_int8_data_type_class_init(GArrowInt8DataTypeClass *klass) +{ +} + +/** + * garrow_int8_data_type_new: + * + * Returns: The newly created 8-bit integer data type. + */ +GArrowInt8DataType * +garrow_int8_data_type_new(void) +{ + auto arrow_data_type = arrow::int8(); + + GArrowInt8DataType *data_type = + GARROW_INT8_DATA_TYPE(g_object_new(GARROW_TYPE_INT8_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowUInt8DataType, \ + garrow_uint8_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint8_data_type_init(GArrowUInt8DataType *object) +{ +} + +static void +garrow_uint8_data_type_class_init(GArrowUInt8DataTypeClass *klass) +{ +} + +/** + * garrow_uint8_data_type_new: + * + * Returns: The newly created 8-bit unsigned integer data type. + */ +GArrowUInt8DataType * +garrow_uint8_data_type_new(void) +{ + auto arrow_data_type = arrow::uint8(); + + GArrowUInt8DataType *data_type = + GARROW_UINT8_DATA_TYPE(g_object_new(GARROW_TYPE_UINT8_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowInt16DataType, \ + garrow_int16_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int16_data_type_init(GArrowInt16DataType *object) +{ +} + +static void +garrow_int16_data_type_class_init(GArrowInt16DataTypeClass *klass) +{ +} + +/** + * garrow_int16_data_type_new: + * + * Returns: The newly created 16-bit integer data type. + */ +GArrowInt16DataType * +garrow_int16_data_type_new(void) +{ + auto arrow_data_type = arrow::int16(); + + GArrowInt16DataType *data_type = + GARROW_INT16_DATA_TYPE(g_object_new(GARROW_TYPE_INT16_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowUInt16DataType, \ + garrow_uint16_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint16_data_type_init(GArrowUInt16DataType *object) +{ +} + +static void +garrow_uint16_data_type_class_init(GArrowUInt16DataTypeClass *klass) +{ +} + +/** + * garrow_uint16_data_type_new: + * + * Returns: The newly created 16-bit unsigned integer data type. + */ +GArrowUInt16DataType * +garrow_uint16_data_type_new(void) +{ + auto arrow_data_type = arrow::uint16(); + + GArrowUInt16DataType *data_type = + GARROW_UINT16_DATA_TYPE(g_object_new(GARROW_TYPE_UINT16_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowInt32DataType, \ + garrow_int32_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int32_data_type_init(GArrowInt32DataType *object) +{ +} + +static void +garrow_int32_data_type_class_init(GArrowInt32DataTypeClass *klass) +{ +} + +/** + * garrow_int32_data_type_new: + * + * Returns: The newly created 32-bit integer data type. + */ +GArrowInt32DataType * +garrow_int32_data_type_new(void) +{ + auto arrow_data_type = arrow::int32(); + + GArrowInt32DataType *data_type = + GARROW_INT32_DATA_TYPE(g_object_new(GARROW_TYPE_INT32_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowUInt32DataType, \ + garrow_uint32_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint32_data_type_init(GArrowUInt32DataType *object) +{ +} + +static void +garrow_uint32_data_type_class_init(GArrowUInt32DataTypeClass *klass) +{ +} + +/** + * garrow_uint32_data_type_new: + * + * Returns: The newly created 32-bit unsigned integer data type. + */ +GArrowUInt32DataType * +garrow_uint32_data_type_new(void) +{ + auto arrow_data_type = arrow::uint32(); + + GArrowUInt32DataType *data_type = + GARROW_UINT32_DATA_TYPE(g_object_new(GARROW_TYPE_UINT32_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowInt64DataType, \ + garrow_int64_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_int64_data_type_init(GArrowInt64DataType *object) +{ +} + +static void +garrow_int64_data_type_class_init(GArrowInt64DataTypeClass *klass) +{ +} + +/** + * garrow_int64_data_type_new: + * + * Returns: The newly created 64-bit integer data type. + */ +GArrowInt64DataType * +garrow_int64_data_type_new(void) +{ + auto arrow_data_type = arrow::int64(); + + GArrowInt64DataType *data_type = + GARROW_INT64_DATA_TYPE(g_object_new(GARROW_TYPE_INT64_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowUInt64DataType, \ + garrow_uint64_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_uint64_data_type_init(GArrowUInt64DataType *object) +{ +} + +static void +garrow_uint64_data_type_class_init(GArrowUInt64DataTypeClass *klass) +{ +} + +/** + * garrow_uint64_data_type_new: + * + * Returns: The newly created 64-bit unsigned integer data type. + */ +GArrowUInt64DataType * +garrow_uint64_data_type_new(void) +{ + auto arrow_data_type = arrow::uint64(); + + GArrowUInt64DataType *data_type = + GARROW_UINT64_DATA_TYPE(g_object_new(GARROW_TYPE_UINT64_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowFloatDataType, \ + garrow_float_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_float_data_type_init(GArrowFloatDataType *object) +{ +} + +static void +garrow_float_data_type_class_init(GArrowFloatDataTypeClass *klass) +{ +} + +/** + * garrow_float_data_type_new: + * + * Returns: The newly created float data type. + */ +GArrowFloatDataType * +garrow_float_data_type_new(void) +{ + auto arrow_data_type = arrow::float32(); + + GArrowFloatDataType *data_type = + GARROW_FLOAT_DATA_TYPE(g_object_new(GARROW_TYPE_FLOAT_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowDoubleDataType, \ + garrow_double_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_double_data_type_init(GArrowDoubleDataType *object) +{ +} + +static void +garrow_double_data_type_class_init(GArrowDoubleDataTypeClass *klass) +{ +} + +/** + * garrow_double_data_type_new: + * + * Returns: The newly created 64-bit floating point data type. + */ +GArrowDoubleDataType * +garrow_double_data_type_new(void) +{ + auto arrow_data_type = arrow::float64(); + + GArrowDoubleDataType *data_type = + GARROW_DOUBLE_DATA_TYPE(g_object_new(GARROW_TYPE_DOUBLE_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowBinaryDataType, \ + garrow_binary_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_binary_data_type_init(GArrowBinaryDataType *object) +{ +} + +static void +garrow_binary_data_type_class_init(GArrowBinaryDataTypeClass *klass) +{ +} + +/** + * garrow_binary_data_type_new: + * + * Returns: The newly created binary data type. + */ +GArrowBinaryDataType * +garrow_binary_data_type_new(void) +{ + auto arrow_data_type = arrow::binary(); + + GArrowBinaryDataType *data_type = + GARROW_BINARY_DATA_TYPE(g_object_new(GARROW_TYPE_BINARY_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowStringDataType, \ + garrow_string_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_string_data_type_init(GArrowStringDataType *object) +{ +} + +static void +garrow_string_data_type_class_init(GArrowStringDataTypeClass *klass) +{ +} + +/** + * garrow_string_data_type_new: + * + * Returns: The newly created UTF-8 encoded string data type. + */ +GArrowStringDataType * +garrow_string_data_type_new(void) +{ + auto arrow_data_type = arrow::utf8(); + + GArrowStringDataType *data_type = + GARROW_STRING_DATA_TYPE(g_object_new(GARROW_TYPE_STRING_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + + +G_DEFINE_TYPE(GArrowListDataType, \ + garrow_list_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_list_data_type_init(GArrowListDataType *object) +{ +} + +static void +garrow_list_data_type_class_init(GArrowListDataTypeClass *klass) +{ +} + +/** + * garrow_list_data_type_new: + * @field: The field of elements + * + * Returns: The newly created list data type. + */ +GArrowListDataType * +garrow_list_data_type_new(GArrowField *field) +{ + auto arrow_field = garrow_field_get_raw(field); + auto arrow_data_type = + std::make_shared(arrow_field); + + GArrowListDataType *data_type = + GARROW_LIST_DATA_TYPE(g_object_new(GARROW_TYPE_LIST_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + +/** + * garrow_list_data_type_get_value_field: + * @list_data_type: A #GArrowListDataType. + * + * Returns: (transfer full): The field of value. + */ +GArrowField * +garrow_list_data_type_get_value_field(GArrowListDataType *list_data_type) +{ + auto arrow_data_type = + garrow_data_type_get_raw(GARROW_DATA_TYPE(list_data_type)); + auto arrow_list_data_type = + static_cast(arrow_data_type.get()); + + auto arrow_field = arrow_list_data_type->value_field(); + auto field = garrow_field_new_raw(&arrow_field); + + return field; +} + + +G_DEFINE_TYPE(GArrowStructDataType, \ + garrow_struct_data_type, \ + GARROW_TYPE_DATA_TYPE) + +static void +garrow_struct_data_type_init(GArrowStructDataType *object) +{ +} + +static void +garrow_struct_data_type_class_init(GArrowStructDataTypeClass *klass) +{ +} + +/** + * garrow_struct_data_type_new: + * @fields: (element-type GArrowField): The fields of the struct. + * + * Returns: The newly created struct data type. + */ +GArrowStructDataType * +garrow_struct_data_type_new(GList *fields) +{ + std::vector> arrow_fields; + for (GList *node = fields; node; node = g_list_next(node)) { + auto field = GARROW_FIELD(node->data); + auto arrow_field = garrow_field_get_raw(field); + arrow_fields.push_back(arrow_field); + } + + auto arrow_data_type = std::make_shared(arrow_fields); + GArrowStructDataType *data_type = + GARROW_STRUCT_DATA_TYPE(g_object_new(GARROW_TYPE_STRUCT_DATA_TYPE, + "data-type", &arrow_data_type, + NULL)); + return data_type; +} + G_END_DECLS GArrowDataType * diff --git a/c_glib/arrow-glib/data-type.h b/c_glib/arrow-glib/data-type.h index 3203d09b5c651..babf0ee1712a0 100644 --- a/c_glib/arrow-glib/data-type.h +++ b/c_glib/arrow-glib/data-type.h @@ -19,10 +19,16 @@ #pragma once +#include + #include G_BEGIN_DECLS +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowField GArrowField; +#endif + #define GARROW_TYPE_DATA_TYPE \ (garrow_data_type_get_type()) #define GARROW_DATA_TYPE(obj) \ @@ -69,4 +75,709 @@ gboolean garrow_data_type_equal (GArrowDataType *data_type, gchar *garrow_data_type_to_string (GArrowDataType *data_type); GArrowType garrow_data_type_type (GArrowDataType *data_type); + +#define GARROW_TYPE_NULL_DATA_TYPE \ + (garrow_null_data_type_get_type()) +#define GARROW_NULL_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataType)) +#define GARROW_NULL_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataTypeClass)) +#define GARROW_IS_NULL_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_NULL_DATA_TYPE)) +#define GARROW_IS_NULL_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_NULL_DATA_TYPE)) +#define GARROW_NULL_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_NULL_DATA_TYPE, \ + GArrowNullDataTypeClass)) + +typedef struct _GArrowNullDataType GArrowNullDataType; +typedef struct _GArrowNullDataTypeClass GArrowNullDataTypeClass; + +/** + * GArrowNullDataType: + * + * It wraps `arrow::NullType`. + */ +struct _GArrowNullDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowNullDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_null_data_type_get_type (void) G_GNUC_CONST; +GArrowNullDataType *garrow_null_data_type_new (void); + + +#define GARROW_TYPE_BOOLEAN_DATA_TYPE \ + (garrow_boolean_data_type_get_type()) +#define GARROW_BOOLEAN_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataType)) +#define GARROW_BOOLEAN_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataTypeClass)) +#define GARROW_IS_BOOLEAN_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE)) +#define GARROW_IS_BOOLEAN_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE)) +#define GARROW_BOOLEAN_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_DATA_TYPE, \ + GArrowBooleanDataTypeClass)) + +typedef struct _GArrowBooleanDataType GArrowBooleanDataType; +typedef struct _GArrowBooleanDataTypeClass GArrowBooleanDataTypeClass; + +/** + * GArrowBooleanDataType: + * + * It wraps `arrow::BooleanType`. + */ +struct _GArrowBooleanDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowBooleanDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_boolean_data_type_get_type (void) G_GNUC_CONST; +GArrowBooleanDataType *garrow_boolean_data_type_new (void); + + +#define GARROW_TYPE_INT8_DATA_TYPE \ + (garrow_int8_data_type_get_type()) +#define GARROW_INT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataType)) +#define GARROW_INT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataTypeClass)) +#define GARROW_IS_INT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_DATA_TYPE)) +#define GARROW_IS_INT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_DATA_TYPE)) +#define GARROW_INT8_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_DATA_TYPE, \ + GArrowInt8DataTypeClass)) + +typedef struct _GArrowInt8DataType GArrowInt8DataType; +typedef struct _GArrowInt8DataTypeClass GArrowInt8DataTypeClass; + +/** + * GArrowInt8DataType: + * + * It wraps `arrow::Int8Type`. + */ +struct _GArrowInt8DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt8DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int8_data_type_get_type (void) G_GNUC_CONST; +GArrowInt8DataType *garrow_int8_data_type_new (void); + + +#define GARROW_TYPE_UINT8_DATA_TYPE \ + (garrow_uint8_data_type_get_type()) +#define GARROW_UINT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataType)) +#define GARROW_UINT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataTypeClass)) +#define GARROW_IS_UINT8_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE)) +#define GARROW_IS_UINT8_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_DATA_TYPE)) +#define GARROW_UINT8_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_DATA_TYPE, \ + GArrowUInt8DataTypeClass)) + +typedef struct _GArrowUInt8DataType GArrowUInt8DataType; +typedef struct _GArrowUInt8DataTypeClass GArrowUInt8DataTypeClass; + +/** + * GArrowUInt8DataType: + * + * It wraps `arrow::UInt8Type`. + */ +struct _GArrowUInt8DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt8DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint8_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt8DataType *garrow_uint8_data_type_new (void); + + +#define GARROW_TYPE_INT16_DATA_TYPE \ + (garrow_int16_data_type_get_type()) +#define GARROW_INT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataType)) +#define GARROW_INT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataTypeClass)) +#define GARROW_IS_INT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_DATA_TYPE)) +#define GARROW_IS_INT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_DATA_TYPE)) +#define GARROW_INT16_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_DATA_TYPE, \ + GArrowInt16DataTypeClass)) + +typedef struct _GArrowInt16DataType GArrowInt16DataType; +typedef struct _GArrowInt16DataTypeClass GArrowInt16DataTypeClass; + +/** + * GArrowInt16DataType: + * + * It wraps `arrow::Int16Type`. + */ +struct _GArrowInt16DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt16DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int16_data_type_get_type (void) G_GNUC_CONST; +GArrowInt16DataType *garrow_int16_data_type_new (void); + + +#define GARROW_TYPE_UINT16_DATA_TYPE \ + (garrow_uint16_data_type_get_type()) +#define GARROW_UINT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataType)) +#define GARROW_UINT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataTypeClass)) +#define GARROW_IS_UINT16_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE)) +#define GARROW_IS_UINT16_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_DATA_TYPE)) +#define GARROW_UINT16_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_DATA_TYPE, \ + GArrowUInt16DataTypeClass)) + +typedef struct _GArrowUInt16DataType GArrowUInt16DataType; +typedef struct _GArrowUInt16DataTypeClass GArrowUInt16DataTypeClass; + +/** + * GArrowUInt16DataType: + * + * It wraps `arrow::UInt16Type`. + */ +struct _GArrowUInt16DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt16DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint16_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt16DataType *garrow_uint16_data_type_new (void); + + +#define GARROW_TYPE_INT32_DATA_TYPE \ + (garrow_int32_data_type_get_type()) +#define GARROW_INT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataType)) +#define GARROW_INT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataTypeClass)) +#define GARROW_IS_INT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_DATA_TYPE)) +#define GARROW_IS_INT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_DATA_TYPE)) +#define GARROW_INT32_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_DATA_TYPE, \ + GArrowInt32DataTypeClass)) + +typedef struct _GArrowInt32DataType GArrowInt32DataType; +typedef struct _GArrowInt32DataTypeClass GArrowInt32DataTypeClass; + +/** + * GArrowInt32DataType: + * + * It wraps `arrow::Int32Type`. + */ +struct _GArrowInt32DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt32DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int32_data_type_get_type (void) G_GNUC_CONST; +GArrowInt32DataType *garrow_int32_data_type_new (void); + + +#define GARROW_TYPE_UINT32_DATA_TYPE \ + (garrow_uint32_data_type_get_type()) +#define GARROW_UINT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataType)) +#define GARROW_UINT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataTypeClass)) +#define GARROW_IS_UINT32_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE)) +#define GARROW_IS_UINT32_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_DATA_TYPE)) +#define GARROW_UINT32_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_DATA_TYPE, \ + GArrowUInt32DataTypeClass)) + +typedef struct _GArrowUInt32DataType GArrowUInt32DataType; +typedef struct _GArrowUInt32DataTypeClass GArrowUInt32DataTypeClass; + +/** + * GArrowUInt32DataType: + * + * It wraps `arrow::UInt32Type`. + */ +struct _GArrowUInt32DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt32DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint32_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt32DataType *garrow_uint32_data_type_new (void); + + +#define GARROW_TYPE_INT64_DATA_TYPE \ + (garrow_int64_data_type_get_type()) +#define GARROW_INT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataType)) +#define GARROW_INT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataTypeClass)) +#define GARROW_IS_INT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_DATA_TYPE)) +#define GARROW_IS_INT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_DATA_TYPE)) +#define GARROW_INT64_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_DATA_TYPE, \ + GArrowInt64DataTypeClass)) + +typedef struct _GArrowInt64DataType GArrowInt64DataType; +typedef struct _GArrowInt64DataTypeClass GArrowInt64DataTypeClass; + +/** + * GArrowInt64DataType: + * + * It wraps `arrow::Int64Type`. + */ +struct _GArrowInt64DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowInt64DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_int64_data_type_get_type (void) G_GNUC_CONST; +GArrowInt64DataType *garrow_int64_data_type_new (void); + + +#define GARROW_TYPE_UINT64_DATA_TYPE \ + (garrow_uint64_data_type_get_type()) +#define GARROW_UINT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataType)) +#define GARROW_UINT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataTypeClass)) +#define GARROW_IS_UINT64_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE)) +#define GARROW_IS_UINT64_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_DATA_TYPE)) +#define GARROW_UINT64_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_DATA_TYPE, \ + GArrowUInt64DataTypeClass)) + +typedef struct _GArrowUInt64DataType GArrowUInt64DataType; +typedef struct _GArrowUInt64DataTypeClass GArrowUInt64DataTypeClass; + +/** + * GArrowUInt64DataType: + * + * It wraps `arrow::UInt64Type`. + */ +struct _GArrowUInt64DataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowUInt64DataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_uint64_data_type_get_type (void) G_GNUC_CONST; +GArrowUInt64DataType *garrow_uint64_data_type_new (void); + + +#define GARROW_TYPE_FLOAT_DATA_TYPE \ + (garrow_float_data_type_get_type()) +#define GARROW_FLOAT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataType)) +#define GARROW_FLOAT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataTypeClass)) +#define GARROW_IS_FLOAT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE)) +#define GARROW_IS_FLOAT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_DATA_TYPE)) +#define GARROW_FLOAT_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_DATA_TYPE, \ + GArrowFloatDataTypeClass)) + +typedef struct _GArrowFloatDataType GArrowFloatDataType; +typedef struct _GArrowFloatDataTypeClass GArrowFloatDataTypeClass; + +/** + * GArrowFloatDataType: + * + * It wraps `arrow::FloatType`. + */ +struct _GArrowFloatDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowFloatDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_float_data_type_get_type (void) G_GNUC_CONST; +GArrowFloatDataType *garrow_float_data_type_new (void); + + +#define GARROW_TYPE_DOUBLE_DATA_TYPE \ + (garrow_double_data_type_get_type()) +#define GARROW_DOUBLE_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataType)) +#define GARROW_DOUBLE_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataTypeClass)) +#define GARROW_IS_DOUBLE_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE)) +#define GARROW_IS_DOUBLE_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_DATA_TYPE)) +#define GARROW_DOUBLE_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_DATA_TYPE, \ + GArrowDoubleDataTypeClass)) + +typedef struct _GArrowDoubleDataType GArrowDoubleDataType; +typedef struct _GArrowDoubleDataTypeClass GArrowDoubleDataTypeClass; + +/** + * GArrowDoubleDataType: + * + * It wraps `arrow::DoubleType`. + */ +struct _GArrowDoubleDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowDoubleDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_double_data_type_get_type (void) G_GNUC_CONST; +GArrowDoubleDataType *garrow_double_data_type_new (void); + + +#define GARROW_TYPE_BINARY_DATA_TYPE \ + (garrow_binary_data_type_get_type()) +#define GARROW_BINARY_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataType)) +#define GARROW_BINARY_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataTypeClass)) +#define GARROW_IS_BINARY_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE)) +#define GARROW_IS_BINARY_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_DATA_TYPE)) +#define GARROW_BINARY_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_DATA_TYPE, \ + GArrowBinaryDataTypeClass)) + +typedef struct _GArrowBinaryDataType GArrowBinaryDataType; +typedef struct _GArrowBinaryDataTypeClass GArrowBinaryDataTypeClass; + +/** + * GArrowBinaryDataType: + * + * It wraps `arrow::BinaryType`. + */ +struct _GArrowBinaryDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowBinaryDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_binary_data_type_get_type (void) G_GNUC_CONST; +GArrowBinaryDataType *garrow_binary_data_type_new (void); + + +#define GARROW_TYPE_STRING_DATA_TYPE \ + (garrow_string_data_type_get_type()) +#define GARROW_STRING_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataType)) +#define GARROW_STRING_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataTypeClass)) +#define GARROW_IS_STRING_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_DATA_TYPE)) +#define GARROW_IS_STRING_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_DATA_TYPE)) +#define GARROW_STRING_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_DATA_TYPE, \ + GArrowStringDataTypeClass)) + +typedef struct _GArrowStringDataType GArrowStringDataType; +typedef struct _GArrowStringDataTypeClass GArrowStringDataTypeClass; + +/** + * GArrowStringDataType: + * + * It wraps `arrow::StringType`. + */ +struct _GArrowStringDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowStringDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_string_data_type_get_type (void) G_GNUC_CONST; +GArrowStringDataType *garrow_string_data_type_new (void); + + +#define GARROW_TYPE_LIST_DATA_TYPE \ + (garrow_list_data_type_get_type()) +#define GARROW_LIST_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataType)) +#define GARROW_LIST_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataTypeClass)) +#define GARROW_IS_LIST_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_DATA_TYPE)) +#define GARROW_IS_LIST_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_DATA_TYPE)) +#define GARROW_LIST_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_DATA_TYPE, \ + GArrowListDataTypeClass)) + +typedef struct _GArrowListDataType GArrowListDataType; +typedef struct _GArrowListDataTypeClass GArrowListDataTypeClass; + +/** + * GArrowListDataType: + * + * It wraps `arrow::ListType`. + */ +struct _GArrowListDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowListDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_list_data_type_get_type (void) G_GNUC_CONST; +GArrowListDataType *garrow_list_data_type_new (GArrowField *field); +GArrowField *garrow_list_data_type_get_value_field (GArrowListDataType *list_data_type); + + +#define GARROW_TYPE_STRUCT_DATA_TYPE \ + (garrow_struct_data_type_get_type()) +#define GARROW_STRUCT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataType)) +#define GARROW_STRUCT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataTypeClass)) +#define GARROW_IS_STRUCT_DATA_TYPE(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE)) +#define GARROW_IS_STRUCT_DATA_TYPE_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_DATA_TYPE)) +#define GARROW_STRUCT_DATA_TYPE_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_DATA_TYPE, \ + GArrowStructDataTypeClass)) + +typedef struct _GArrowStructDataType GArrowStructDataType; +typedef struct _GArrowStructDataTypeClass GArrowStructDataTypeClass; + +/** + * GArrowStructDataType: + * + * It wraps `arrow::StructType`. + */ +struct _GArrowStructDataType +{ + /*< private >*/ + GArrowDataType parent_instance; +}; + +struct _GArrowStructDataTypeClass +{ + GArrowDataTypeClass parent_class; +}; + +GType garrow_struct_data_type_get_type (void) G_GNUC_CONST; +GArrowStructDataType *garrow_struct_data_type_new (GList *fields); + G_END_DECLS diff --git a/c_glib/arrow-glib/double-data-type.cpp b/c_glib/arrow-glib/double-data-type.cpp deleted file mode 100644 index c132f97ebe58f..0000000000000 --- a/c_glib/arrow-glib/double-data-type.cpp +++ /dev/null @@ -1,68 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: double-data-type - * @short_description: 64-bit floating point data type - * - * #GArrowDoubleDataType is a class for 64-bit floating point data - * type. - */ - -G_DEFINE_TYPE(GArrowDoubleDataType, \ - garrow_double_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_double_data_type_init(GArrowDoubleDataType *object) -{ -} - -static void -garrow_double_data_type_class_init(GArrowDoubleDataTypeClass *klass) -{ -} - -/** - * garrow_double_data_type_new: - * - * Returns: The newly created 64-bit floating point data type. - */ -GArrowDoubleDataType * -garrow_double_data_type_new(void) -{ - auto arrow_data_type = arrow::float64(); - - GArrowDoubleDataType *data_type = - GARROW_DOUBLE_DATA_TYPE(g_object_new(GARROW_TYPE_DOUBLE_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/double-data-type.h b/c_glib/arrow-glib/double-data-type.h deleted file mode 100644 index ec725cbed3ba2..0000000000000 --- a/c_glib/arrow-glib/double-data-type.h +++ /dev/null @@ -1,70 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_DOUBLE_DATA_TYPE \ - (garrow_double_data_type_get_type()) -#define GARROW_DOUBLE_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_DOUBLE_DATA_TYPE, \ - GArrowDoubleDataType)) -#define GARROW_DOUBLE_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_DOUBLE_DATA_TYPE, \ - GArrowDoubleDataTypeClass)) -#define GARROW_IS_DOUBLE_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_DOUBLE_DATA_TYPE)) -#define GARROW_IS_DOUBLE_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_DOUBLE_DATA_TYPE)) -#define GARROW_DOUBLE_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_DOUBLE_DATA_TYPE, \ - GArrowDoubleDataTypeClass)) - -typedef struct _GArrowDoubleDataType GArrowDoubleDataType; -typedef struct _GArrowDoubleDataTypeClass GArrowDoubleDataTypeClass; - -/** - * GArrowDoubleDataType: - * - * It wraps `arrow::DoubleType`. - */ -struct _GArrowDoubleDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowDoubleDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_double_data_type_get_type(void) G_GNUC_CONST; - -GArrowDoubleDataType *garrow_double_data_type_new(void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-data-type.cpp b/c_glib/arrow-glib/float-data-type.cpp deleted file mode 100644 index ce7f28acfcb45..0000000000000 --- a/c_glib/arrow-glib/float-data-type.cpp +++ /dev/null @@ -1,68 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: float-data-type - * @short_description: 32-bit floating point data type - * - * #GArrowFloatDataType is a class for 32-bit floating point data - * type. - */ - -G_DEFINE_TYPE(GArrowFloatDataType, \ - garrow_float_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_float_data_type_init(GArrowFloatDataType *object) -{ -} - -static void -garrow_float_data_type_class_init(GArrowFloatDataTypeClass *klass) -{ -} - -/** - * garrow_float_data_type_new: - * - * Returns: The newly created float data type. - */ -GArrowFloatDataType * -garrow_float_data_type_new(void) -{ - auto arrow_data_type = arrow::float32(); - - GArrowFloatDataType *data_type = - GARROW_FLOAT_DATA_TYPE(g_object_new(GARROW_TYPE_FLOAT_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-data-type.h b/c_glib/arrow-glib/float-data-type.h deleted file mode 100644 index dcb6c2ab13d25..0000000000000 --- a/c_glib/arrow-glib/float-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_FLOAT_DATA_TYPE \ - (garrow_float_data_type_get_type()) -#define GARROW_FLOAT_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_FLOAT_DATA_TYPE, \ - GArrowFloatDataType)) -#define GARROW_FLOAT_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_FLOAT_DATA_TYPE, \ - GArrowFloatDataTypeClass)) -#define GARROW_IS_FLOAT_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_FLOAT_DATA_TYPE)) -#define GARROW_IS_FLOAT_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_FLOAT_DATA_TYPE)) -#define GARROW_FLOAT_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_FLOAT_DATA_TYPE, \ - GArrowFloatDataTypeClass)) - -typedef struct _GArrowFloatDataType GArrowFloatDataType; -typedef struct _GArrowFloatDataTypeClass GArrowFloatDataTypeClass; - -/** - * GArrowFloatDataType: - * - * It wraps `arrow::FloatType`. - */ -struct _GArrowFloatDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowFloatDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_float_data_type_get_type (void) G_GNUC_CONST; -GArrowFloatDataType *garrow_float_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-data-type.cpp b/c_glib/arrow-glib/int16-data-type.cpp deleted file mode 100644 index 45e109e1759dc..0000000000000 --- a/c_glib/arrow-glib/int16-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int16-data-type - * @short_description: 16-bit integer data type - * - * #GArrowInt16DataType is a class for 16-bit integer data type. - */ - -G_DEFINE_TYPE(GArrowInt16DataType, \ - garrow_int16_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_int16_data_type_init(GArrowInt16DataType *object) -{ -} - -static void -garrow_int16_data_type_class_init(GArrowInt16DataTypeClass *klass) -{ -} - -/** - * garrow_int16_data_type_new: - * - * Returns: The newly created 16-bit integer data type. - */ -GArrowInt16DataType * -garrow_int16_data_type_new(void) -{ - auto arrow_data_type = arrow::int16(); - - GArrowInt16DataType *data_type = - GARROW_INT16_DATA_TYPE(g_object_new(GARROW_TYPE_INT16_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-data-type.h b/c_glib/arrow-glib/int16-data-type.h deleted file mode 100644 index eaa199c4fc7f8..0000000000000 --- a/c_glib/arrow-glib/int16-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT16_DATA_TYPE \ - (garrow_int16_data_type_get_type()) -#define GARROW_INT16_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT16_DATA_TYPE, \ - GArrowInt16DataType)) -#define GARROW_INT16_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT16_DATA_TYPE, \ - GArrowInt16DataTypeClass)) -#define GARROW_IS_INT16_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT16_DATA_TYPE)) -#define GARROW_IS_INT16_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT16_DATA_TYPE)) -#define GARROW_INT16_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT16_DATA_TYPE, \ - GArrowInt16DataTypeClass)) - -typedef struct _GArrowInt16DataType GArrowInt16DataType; -typedef struct _GArrowInt16DataTypeClass GArrowInt16DataTypeClass; - -/** - * GArrowInt16DataType: - * - * It wraps `arrow::Int16Type`. - */ -struct _GArrowInt16DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowInt16DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_int16_data_type_get_type (void) G_GNUC_CONST; -GArrowInt16DataType *garrow_int16_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-data-type.cpp b/c_glib/arrow-glib/int32-data-type.cpp deleted file mode 100644 index add21135364f9..0000000000000 --- a/c_glib/arrow-glib/int32-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int32-data-type - * @short_description: 32-bit integer data type - * - * #GArrowInt32DataType is a class for 32-bit integer data type. - */ - -G_DEFINE_TYPE(GArrowInt32DataType, \ - garrow_int32_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_int32_data_type_init(GArrowInt32DataType *object) -{ -} - -static void -garrow_int32_data_type_class_init(GArrowInt32DataTypeClass *klass) -{ -} - -/** - * garrow_int32_data_type_new: - * - * Returns: The newly created 32-bit integer data type. - */ -GArrowInt32DataType * -garrow_int32_data_type_new(void) -{ - auto arrow_data_type = arrow::int32(); - - GArrowInt32DataType *data_type = - GARROW_INT32_DATA_TYPE(g_object_new(GARROW_TYPE_INT32_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-data-type.h b/c_glib/arrow-glib/int32-data-type.h deleted file mode 100644 index 75cccbd40560d..0000000000000 --- a/c_glib/arrow-glib/int32-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT32_DATA_TYPE \ - (garrow_int32_data_type_get_type()) -#define GARROW_INT32_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT32_DATA_TYPE, \ - GArrowInt32DataType)) -#define GARROW_INT32_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT32_DATA_TYPE, \ - GArrowInt32DataTypeClass)) -#define GARROW_IS_INT32_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT32_DATA_TYPE)) -#define GARROW_IS_INT32_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT32_DATA_TYPE)) -#define GARROW_INT32_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT32_DATA_TYPE, \ - GArrowInt32DataTypeClass)) - -typedef struct _GArrowInt32DataType GArrowInt32DataType; -typedef struct _GArrowInt32DataTypeClass GArrowInt32DataTypeClass; - -/** - * GArrowInt32DataType: - * - * It wraps `arrow::Int32Type`. - */ -struct _GArrowInt32DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowInt32DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_int32_data_type_get_type (void) G_GNUC_CONST; -GArrowInt32DataType *garrow_int32_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-data-type.cpp b/c_glib/arrow-glib/int64-data-type.cpp deleted file mode 100644 index 8e85b9d2ab922..0000000000000 --- a/c_glib/arrow-glib/int64-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int64-data-type - * @short_description: 64-bit integer data type - * - * #GArrowInt64DataType is a class for 64-bit integer data type. - */ - -G_DEFINE_TYPE(GArrowInt64DataType, \ - garrow_int64_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_int64_data_type_init(GArrowInt64DataType *object) -{ -} - -static void -garrow_int64_data_type_class_init(GArrowInt64DataTypeClass *klass) -{ -} - -/** - * garrow_int64_data_type_new: - * - * Returns: The newly created 64-bit integer data type. - */ -GArrowInt64DataType * -garrow_int64_data_type_new(void) -{ - auto arrow_data_type = arrow::int64(); - - GArrowInt64DataType *data_type = - GARROW_INT64_DATA_TYPE(g_object_new(GARROW_TYPE_INT64_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-data-type.h b/c_glib/arrow-glib/int64-data-type.h deleted file mode 100644 index 499e79f7ab7a7..0000000000000 --- a/c_glib/arrow-glib/int64-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT64_DATA_TYPE \ - (garrow_int64_data_type_get_type()) -#define GARROW_INT64_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT64_DATA_TYPE, \ - GArrowInt64DataType)) -#define GARROW_INT64_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT64_DATA_TYPE, \ - GArrowInt64DataTypeClass)) -#define GARROW_IS_INT64_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT64_DATA_TYPE)) -#define GARROW_IS_INT64_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT64_DATA_TYPE)) -#define GARROW_INT64_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT64_DATA_TYPE, \ - GArrowInt64DataTypeClass)) - -typedef struct _GArrowInt64DataType GArrowInt64DataType; -typedef struct _GArrowInt64DataTypeClass GArrowInt64DataTypeClass; - -/** - * GArrowInt64DataType: - * - * It wraps `arrow::Int64Type`. - */ -struct _GArrowInt64DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowInt64DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_int64_data_type_get_type (void) G_GNUC_CONST; -GArrowInt64DataType *garrow_int64_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-data-type.cpp b/c_glib/arrow-glib/int8-data-type.cpp deleted file mode 100644 index 55b1ebc852d10..0000000000000 --- a/c_glib/arrow-glib/int8-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int8-data-type - * @short_description: 8-bit integer data type - * - * #GArrowInt8DataType is a class for 8-bit integer data type. - */ - -G_DEFINE_TYPE(GArrowInt8DataType, \ - garrow_int8_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_int8_data_type_init(GArrowInt8DataType *object) -{ -} - -static void -garrow_int8_data_type_class_init(GArrowInt8DataTypeClass *klass) -{ -} - -/** - * garrow_int8_data_type_new: - * - * Returns: The newly created 8-bit integer data type. - */ -GArrowInt8DataType * -garrow_int8_data_type_new(void) -{ - auto arrow_data_type = arrow::int8(); - - GArrowInt8DataType *data_type = - GARROW_INT8_DATA_TYPE(g_object_new(GARROW_TYPE_INT8_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-data-type.h b/c_glib/arrow-glib/int8-data-type.h deleted file mode 100644 index 4343bd17a725b..0000000000000 --- a/c_glib/arrow-glib/int8-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT8_DATA_TYPE \ - (garrow_int8_data_type_get_type()) -#define GARROW_INT8_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT8_DATA_TYPE, \ - GArrowInt8DataType)) -#define GARROW_INT8_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT8_DATA_TYPE, \ - GArrowInt8DataTypeClass)) -#define GARROW_IS_INT8_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT8_DATA_TYPE)) -#define GARROW_IS_INT8_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT8_DATA_TYPE)) -#define GARROW_INT8_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT8_DATA_TYPE, \ - GArrowInt8DataTypeClass)) - -typedef struct _GArrowInt8DataType GArrowInt8DataType; -typedef struct _GArrowInt8DataTypeClass GArrowInt8DataTypeClass; - -/** - * GArrowInt8DataType: - * - * It wraps `arrow::Int8Type`. - */ -struct _GArrowInt8DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowInt8DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_int8_data_type_get_type (void) G_GNUC_CONST; -GArrowInt8DataType *garrow_int8_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-data-type.cpp b/c_glib/arrow-glib/list-data-type.cpp deleted file mode 100644 index e82e6fdee48ba..0000000000000 --- a/c_glib/arrow-glib/list-data-type.cpp +++ /dev/null @@ -1,91 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: list-data-type - * @short_description: List data type - * - * #GArrowListDataType is a class for list data type. - */ - -G_DEFINE_TYPE(GArrowListDataType, \ - garrow_list_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_list_data_type_init(GArrowListDataType *object) -{ -} - -static void -garrow_list_data_type_class_init(GArrowListDataTypeClass *klass) -{ -} - -/** - * garrow_list_data_type_new: - * @field: The field of elements - * - * Returns: The newly created list data type. - */ -GArrowListDataType * -garrow_list_data_type_new(GArrowField *field) -{ - auto arrow_field = garrow_field_get_raw(field); - auto arrow_data_type = - std::make_shared(arrow_field); - - GArrowListDataType *data_type = - GARROW_LIST_DATA_TYPE(g_object_new(GARROW_TYPE_LIST_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -/** - * garrow_list_data_type_get_value_field: - * @list_data_type: A #GArrowListDataType. - * - * Returns: (transfer full): The field of value. - */ -GArrowField * -garrow_list_data_type_get_value_field(GArrowListDataType *list_data_type) -{ - auto arrow_data_type = - garrow_data_type_get_raw(GARROW_DATA_TYPE(list_data_type)); - auto arrow_list_data_type = - static_cast(arrow_data_type.get()); - - auto arrow_field = arrow_list_data_type->value_field(); - auto field = garrow_field_new_raw(&arrow_field); - - return field; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-data-type.h b/c_glib/arrow-glib/list-data-type.h deleted file mode 100644 index bb406e2c62074..0000000000000 --- a/c_glib/arrow-glib/list-data-type.h +++ /dev/null @@ -1,73 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_LIST_DATA_TYPE \ - (garrow_list_data_type_get_type()) -#define GARROW_LIST_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_LIST_DATA_TYPE, \ - GArrowListDataType)) -#define GARROW_LIST_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_LIST_DATA_TYPE, \ - GArrowListDataTypeClass)) -#define GARROW_IS_LIST_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_LIST_DATA_TYPE)) -#define GARROW_IS_LIST_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_LIST_DATA_TYPE)) -#define GARROW_LIST_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_LIST_DATA_TYPE, \ - GArrowListDataTypeClass)) - -typedef struct _GArrowListDataType GArrowListDataType; -typedef struct _GArrowListDataTypeClass GArrowListDataTypeClass; - -/** - * GArrowListDataType: - * - * It wraps `arrow::ListType`. - */ -struct _GArrowListDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowListDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_list_data_type_get_type (void) G_GNUC_CONST; - -GArrowListDataType *garrow_list_data_type_new (GArrowField *field); - -GArrowField *garrow_list_data_type_get_value_field (GArrowListDataType *list_data_type); - -G_END_DECLS diff --git a/c_glib/arrow-glib/null-data-type.cpp b/c_glib/arrow-glib/null-data-type.cpp deleted file mode 100644 index 1f75d3bb88c37..0000000000000 --- a/c_glib/arrow-glib/null-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: null-data-type - * @short_description: Null data type - * - * #GArrowNullDataType is a class for null data type. - */ - -G_DEFINE_TYPE(GArrowNullDataType, \ - garrow_null_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_null_data_type_init(GArrowNullDataType *object) -{ -} - -static void -garrow_null_data_type_class_init(GArrowNullDataTypeClass *klass) -{ -} - -/** - * garrow_null_data_type_new: - * - * Returns: The newly created null data type. - */ -GArrowNullDataType * -garrow_null_data_type_new(void) -{ - auto arrow_data_type = arrow::null(); - - GArrowNullDataType *data_type = - GARROW_NULL_DATA_TYPE(g_object_new(GARROW_TYPE_NULL_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/null-data-type.h b/c_glib/arrow-glib/null-data-type.h deleted file mode 100644 index 006b76c961f3b..0000000000000 --- a/c_glib/arrow-glib/null-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_NULL_DATA_TYPE \ - (garrow_null_data_type_get_type()) -#define GARROW_NULL_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_NULL_DATA_TYPE, \ - GArrowNullDataType)) -#define GARROW_NULL_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_NULL_DATA_TYPE, \ - GArrowNullDataTypeClass)) -#define GARROW_IS_NULL_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_NULL_DATA_TYPE)) -#define GARROW_IS_NULL_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_NULL_DATA_TYPE)) -#define GARROW_NULL_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_NULL_DATA_TYPE, \ - GArrowNullDataTypeClass)) - -typedef struct _GArrowNullDataType GArrowNullDataType; -typedef struct _GArrowNullDataTypeClass GArrowNullDataTypeClass; - -/** - * GArrowNullDataType: - * - * It wraps `arrow::NullType`. - */ -struct _GArrowNullDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowNullDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_null_data_type_get_type (void) G_GNUC_CONST; -GArrowNullDataType *garrow_null_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-data-type.cpp b/c_glib/arrow-glib/string-data-type.cpp deleted file mode 100644 index 96a31bf2f906a..0000000000000 --- a/c_glib/arrow-glib/string-data-type.cpp +++ /dev/null @@ -1,68 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: string-data-type - * @short_description: UTF-8 encoded string data type - * - * #GArrowStringDataType is a class for UTF-8 encoded string data - * type. - */ - -G_DEFINE_TYPE(GArrowStringDataType, \ - garrow_string_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_string_data_type_init(GArrowStringDataType *object) -{ -} - -static void -garrow_string_data_type_class_init(GArrowStringDataTypeClass *klass) -{ -} - -/** - * garrow_string_data_type_new: - * - * Returns: The newly created UTF-8 encoded string data type. - */ -GArrowStringDataType * -garrow_string_data_type_new(void) -{ - auto arrow_data_type = arrow::utf8(); - - GArrowStringDataType *data_type = - GARROW_STRING_DATA_TYPE(g_object_new(GARROW_TYPE_STRING_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-data-type.h b/c_glib/arrow-glib/string-data-type.h deleted file mode 100644 index d10a325e1bb6c..0000000000000 --- a/c_glib/arrow-glib/string-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRING_DATA_TYPE \ - (garrow_string_data_type_get_type()) -#define GARROW_STRING_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRING_DATA_TYPE, \ - GArrowStringDataType)) -#define GARROW_STRING_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRING_DATA_TYPE, \ - GArrowStringDataTypeClass)) -#define GARROW_IS_STRING_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRING_DATA_TYPE)) -#define GARROW_IS_STRING_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRING_DATA_TYPE)) -#define GARROW_STRING_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRING_DATA_TYPE, \ - GArrowStringDataTypeClass)) - -typedef struct _GArrowStringDataType GArrowStringDataType; -typedef struct _GArrowStringDataTypeClass GArrowStringDataTypeClass; - -/** - * GArrowStringDataType: - * - * It wraps `arrow::StringType`. - */ -struct _GArrowStringDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowStringDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_string_data_type_get_type (void) G_GNUC_CONST; -GArrowStringDataType *garrow_string_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array-builder.h b/c_glib/arrow-glib/struct-array-builder.h index 7dd86625616e3..237b2b3264f24 100644 --- a/c_glib/arrow-glib/struct-array-builder.h +++ b/c_glib/arrow-glib/struct-array-builder.h @@ -20,7 +20,7 @@ #pragma once #include -#include +#include G_BEGIN_DECLS diff --git a/c_glib/arrow-glib/struct-data-type.cpp b/c_glib/arrow-glib/struct-data-type.cpp deleted file mode 100644 index 9a4f2a2deead0..0000000000000 --- a/c_glib/arrow-glib/struct-data-type.cpp +++ /dev/null @@ -1,75 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: struct-data-type - * @short_description: Struct data type - * - * #GArrowStructDataType is a class for struct data type. - */ - -G_DEFINE_TYPE(GArrowStructDataType, \ - garrow_struct_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_struct_data_type_init(GArrowStructDataType *object) -{ -} - -static void -garrow_struct_data_type_class_init(GArrowStructDataTypeClass *klass) -{ -} - -/** - * garrow_struct_data_type_new: - * @fields: (element-type GArrowField): The fields of the struct. - * - * Returns: The newly created struct data type. - */ -GArrowStructDataType * -garrow_struct_data_type_new(GList *fields) -{ - std::vector> arrow_fields; - for (GList *node = fields; node; node = g_list_next(node)) { - auto field = GARROW_FIELD(node->data); - auto arrow_field = garrow_field_get_raw(field); - arrow_fields.push_back(arrow_field); - } - - auto arrow_data_type = std::make_shared(arrow_fields); - GArrowStructDataType *data_type = - GARROW_STRUCT_DATA_TYPE(g_object_new(GARROW_TYPE_STRUCT_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-data-type.h b/c_glib/arrow-glib/struct-data-type.h deleted file mode 100644 index 0a2c743e280b7..0000000000000 --- a/c_glib/arrow-glib/struct-data-type.h +++ /dev/null @@ -1,71 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRUCT_DATA_TYPE \ - (garrow_struct_data_type_get_type()) -#define GARROW_STRUCT_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRUCT_DATA_TYPE, \ - GArrowStructDataType)) -#define GARROW_STRUCT_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRUCT_DATA_TYPE, \ - GArrowStructDataTypeClass)) -#define GARROW_IS_STRUCT_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRUCT_DATA_TYPE)) -#define GARROW_IS_STRUCT_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRUCT_DATA_TYPE)) -#define GARROW_STRUCT_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRUCT_DATA_TYPE, \ - GArrowStructDataTypeClass)) - -typedef struct _GArrowStructDataType GArrowStructDataType; -typedef struct _GArrowStructDataTypeClass GArrowStructDataTypeClass; - -/** - * GArrowStructDataType: - * - * It wraps `arrow::StructType`. - */ -struct _GArrowStructDataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowStructDataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_struct_data_type_get_type (void) G_GNUC_CONST; - -GArrowStructDataType *garrow_struct_data_type_new(GList *fields); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-data-type.cpp b/c_glib/arrow-glib/uint16-data-type.cpp deleted file mode 100644 index 918b75d61c3eb..0000000000000 --- a/c_glib/arrow-glib/uint16-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint16-data-type - * @short_description: 16-bit unsigned integer data type - * - * #GArrowUInt16DataType is a class for 16-bit unsigned integer data type. - */ - -G_DEFINE_TYPE(GArrowUInt16DataType, \ - garrow_uint16_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_uint16_data_type_init(GArrowUInt16DataType *object) -{ -} - -static void -garrow_uint16_data_type_class_init(GArrowUInt16DataTypeClass *klass) -{ -} - -/** - * garrow_uint16_data_type_new: - * - * Returns: The newly created 16-bit unsigned integer data type. - */ -GArrowUInt16DataType * -garrow_uint16_data_type_new(void) -{ - auto arrow_data_type = arrow::uint16(); - - GArrowUInt16DataType *data_type = - GARROW_UINT16_DATA_TYPE(g_object_new(GARROW_TYPE_UINT16_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-data-type.h b/c_glib/arrow-glib/uint16-data-type.h deleted file mode 100644 index b65189d888fcd..0000000000000 --- a/c_glib/arrow-glib/uint16-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT16_DATA_TYPE \ - (garrow_uint16_data_type_get_type()) -#define GARROW_UINT16_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT16_DATA_TYPE, \ - GArrowUInt16DataType)) -#define GARROW_UINT16_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT16_DATA_TYPE, \ - GArrowUInt16DataTypeClass)) -#define GARROW_IS_UINT16_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT16_DATA_TYPE)) -#define GARROW_IS_UINT16_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT16_DATA_TYPE)) -#define GARROW_UINT16_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT16_DATA_TYPE, \ - GArrowUInt16DataTypeClass)) - -typedef struct _GArrowUInt16DataType GArrowUInt16DataType; -typedef struct _GArrowUInt16DataTypeClass GArrowUInt16DataTypeClass; - -/** - * GArrowUInt16DataType: - * - * It wraps `arrow::UInt16Type`. - */ -struct _GArrowUInt16DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowUInt16DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_uint16_data_type_get_type (void) G_GNUC_CONST; -GArrowUInt16DataType *garrow_uint16_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-data-type.cpp b/c_glib/arrow-glib/uint32-data-type.cpp deleted file mode 100644 index fde14f3274174..0000000000000 --- a/c_glib/arrow-glib/uint32-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint32-data-type - * @short_description: 32-bit unsigned integer data type - * - * #GArrowUInt32DataType is a class for 32-bit unsigned integer data type. - */ - -G_DEFINE_TYPE(GArrowUInt32DataType, \ - garrow_uint32_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_uint32_data_type_init(GArrowUInt32DataType *object) -{ -} - -static void -garrow_uint32_data_type_class_init(GArrowUInt32DataTypeClass *klass) -{ -} - -/** - * garrow_uint32_data_type_new: - * - * Returns: The newly created 32-bit unsigned integer data type. - */ -GArrowUInt32DataType * -garrow_uint32_data_type_new(void) -{ - auto arrow_data_type = arrow::uint32(); - - GArrowUInt32DataType *data_type = - GARROW_UINT32_DATA_TYPE(g_object_new(GARROW_TYPE_UINT32_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-data-type.h b/c_glib/arrow-glib/uint32-data-type.h deleted file mode 100644 index 4fe60cd850ba8..0000000000000 --- a/c_glib/arrow-glib/uint32-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT32_DATA_TYPE \ - (garrow_uint32_data_type_get_type()) -#define GARROW_UINT32_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT32_DATA_TYPE, \ - GArrowUInt32DataType)) -#define GARROW_UINT32_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT32_DATA_TYPE, \ - GArrowUInt32DataTypeClass)) -#define GARROW_IS_UINT32_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT32_DATA_TYPE)) -#define GARROW_IS_UINT32_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT32_DATA_TYPE)) -#define GARROW_UINT32_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT32_DATA_TYPE, \ - GArrowUInt32DataTypeClass)) - -typedef struct _GArrowUInt32DataType GArrowUInt32DataType; -typedef struct _GArrowUInt32DataTypeClass GArrowUInt32DataTypeClass; - -/** - * GArrowUInt32DataType: - * - * It wraps `arrow::UInt32Type`. - */ -struct _GArrowUInt32DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowUInt32DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_uint32_data_type_get_type (void) G_GNUC_CONST; -GArrowUInt32DataType *garrow_uint32_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-data-type.cpp b/c_glib/arrow-glib/uint64-data-type.cpp deleted file mode 100644 index 7c18b36a01b3b..0000000000000 --- a/c_glib/arrow-glib/uint64-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint64-data-type - * @short_description: 64-bit unsigned integer data type - * - * #GArrowUInt64DataType is a class for 64-bit unsigned integer data type. - */ - -G_DEFINE_TYPE(GArrowUInt64DataType, \ - garrow_uint64_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_uint64_data_type_init(GArrowUInt64DataType *object) -{ -} - -static void -garrow_uint64_data_type_class_init(GArrowUInt64DataTypeClass *klass) -{ -} - -/** - * garrow_uint64_data_type_new: - * - * Returns: The newly created 64-bit unsigned integer data type. - */ -GArrowUInt64DataType * -garrow_uint64_data_type_new(void) -{ - auto arrow_data_type = arrow::uint64(); - - GArrowUInt64DataType *data_type = - GARROW_UINT64_DATA_TYPE(g_object_new(GARROW_TYPE_UINT64_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-data-type.h b/c_glib/arrow-glib/uint64-data-type.h deleted file mode 100644 index 221023c863818..0000000000000 --- a/c_glib/arrow-glib/uint64-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT64_DATA_TYPE \ - (garrow_uint64_data_type_get_type()) -#define GARROW_UINT64_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT64_DATA_TYPE, \ - GArrowUInt64DataType)) -#define GARROW_UINT64_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT64_DATA_TYPE, \ - GArrowUInt64DataTypeClass)) -#define GARROW_IS_UINT64_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT64_DATA_TYPE)) -#define GARROW_IS_UINT64_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT64_DATA_TYPE)) -#define GARROW_UINT64_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT64_DATA_TYPE, \ - GArrowUInt64DataTypeClass)) - -typedef struct _GArrowUInt64DataType GArrowUInt64DataType; -typedef struct _GArrowUInt64DataTypeClass GArrowUInt64DataTypeClass; - -/** - * GArrowUInt64DataType: - * - * It wraps `arrow::UInt64Type`. - */ -struct _GArrowUInt64DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowUInt64DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_uint64_data_type_get_type (void) G_GNUC_CONST; -GArrowUInt64DataType *garrow_uint64_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-data-type.cpp b/c_glib/arrow-glib/uint8-data-type.cpp deleted file mode 100644 index 7c93e455a4e96..0000000000000 --- a/c_glib/arrow-glib/uint8-data-type.cpp +++ /dev/null @@ -1,67 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint8-data-type - * @short_description: 8-bit unsigned integer data type - * - * #GArrowUInt8DataType is a class for 8-bit unsigned integer data type. - */ - -G_DEFINE_TYPE(GArrowUInt8DataType, \ - garrow_uint8_data_type, \ - GARROW_TYPE_DATA_TYPE) - -static void -garrow_uint8_data_type_init(GArrowUInt8DataType *object) -{ -} - -static void -garrow_uint8_data_type_class_init(GArrowUInt8DataTypeClass *klass) -{ -} - -/** - * garrow_uint8_data_type_new: - * - * Returns: The newly created 8-bit unsigned integer data type. - */ -GArrowUInt8DataType * -garrow_uint8_data_type_new(void) -{ - auto arrow_data_type = arrow::uint8(); - - GArrowUInt8DataType *data_type = - GARROW_UINT8_DATA_TYPE(g_object_new(GARROW_TYPE_UINT8_DATA_TYPE, - "data-type", &arrow_data_type, - NULL)); - return data_type; -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-data-type.h b/c_glib/arrow-glib/uint8-data-type.h deleted file mode 100644 index 6e058524f4b10..0000000000000 --- a/c_glib/arrow-glib/uint8-data-type.h +++ /dev/null @@ -1,69 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT8_DATA_TYPE \ - (garrow_uint8_data_type_get_type()) -#define GARROW_UINT8_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT8_DATA_TYPE, \ - GArrowUInt8DataType)) -#define GARROW_UINT8_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT8_DATA_TYPE, \ - GArrowUInt8DataTypeClass)) -#define GARROW_IS_UINT8_DATA_TYPE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT8_DATA_TYPE)) -#define GARROW_IS_UINT8_DATA_TYPE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT8_DATA_TYPE)) -#define GARROW_UINT8_DATA_TYPE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT8_DATA_TYPE, \ - GArrowUInt8DataTypeClass)) - -typedef struct _GArrowUInt8DataType GArrowUInt8DataType; -typedef struct _GArrowUInt8DataTypeClass GArrowUInt8DataTypeClass; - -/** - * GArrowUInt8DataType: - * - * It wraps `arrow::UInt8Type`. - */ -struct _GArrowUInt8DataType -{ - /*< private >*/ - GArrowDataType parent_instance; -}; - -struct _GArrowUInt8DataTypeClass -{ - GArrowDataTypeClass parent_class; -}; - -GType garrow_uint8_data_type_get_type (void) G_GNUC_CONST; -GArrowUInt8DataType *garrow_uint8_data_type_new (void); - -G_END_DECLS diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 11e6a4de244d4..5df9f64a85c92 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -60,22 +60,6 @@ Type - - - - - - - - - - - - - - - - Schema From 423235ccb39737d66e1c47d119879787d9e10847 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 21 Apr 2017 17:51:31 -0400 Subject: [PATCH 0549/1644] ARROW-868: [GLib] Use GBytes to reduce copy Author: Kouhei Sutou Closes #576 from kou/glib-binary-array-use-gbytes and squashes the following commits: 7aeb799 [Kouhei Sutou] [GLib] Use GBytes to reduce copy --- c_glib/arrow-glib/array.cpp | 13 +++++++------ c_glib/arrow-glib/array.h | 5 ++--- c_glib/test/test-binary-array.rb | 5 +++-- 3 files changed, 12 insertions(+), 11 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index c86bff90d40d6..dc1386b0daab9 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -672,19 +672,20 @@ garrow_binary_array_class_init(GArrowBinaryArrayClass *klass) * garrow_binary_array_get_value: * @array: A #GArrowBinaryArray. * @i: The index of the target value. - * @length: (out): The length of the value. * - * Returns: (array length=length): The i-th value. + * Returns: (transfer full): The i-th value. */ -const guint8 * +GBytes * garrow_binary_array_get_value(GArrowBinaryArray *array, - gint64 i, - gint32 *length) + gint64 i) { auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); auto arrow_binary_array = static_cast(arrow_array.get()); - return arrow_binary_array->GetValue(i, length); + + int32_t length; + auto value = arrow_binary_array->GetValue(i, &length); + return g_bytes_new_static(value, length); } diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index b417cdbab3631..74064562d6f39 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -660,9 +660,8 @@ struct _GArrowBinaryArrayClass GType garrow_binary_array_get_type(void) G_GNUC_CONST; -const guint8 *garrow_binary_array_get_value(GArrowBinaryArray *array, - gint64 i, - gint32 *length); +GBytes *garrow_binary_array_get_value(GArrowBinaryArray *array, + gint64 i); #define GARROW_TYPE_STRING_ARRAY \ (garrow_string_array_get_type()) diff --git a/c_glib/test/test-binary-array.rb b/c_glib/test/test-binary-array.rb index 82a537ef29e9e..6fe89247c8649 100644 --- a/c_glib/test/test-binary-array.rb +++ b/c_glib/test/test-binary-array.rb @@ -17,9 +17,10 @@ class TestBinaryArray < Test::Unit::TestCase def test_value + data = "\x00\x01\x02" builder = Arrow::BinaryArrayBuilder.new - builder.append("\x00\x01\x02") + builder.append(data) array = builder.finish - assert_equal([0, 1, 2], array.get_value(0)) + assert_equal(data, array.get_value(0).to_s) end end From 76dfd9878529c010b43726058ef3e913a78501f0 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 22 Apr 2017 10:43:01 -0400 Subject: [PATCH 0550/1644] ARROW-876: [GLib] Unify ArrayBuilder files Author: Kouhei Sutou Closes #581 from kou/glib-array-builder-unify-file and squashes the following commits: 5449d50 [Kouhei Sutou] [GLib] Unify ArrayBuilder files --- c_glib/arrow-glib/Makefile.am | 32 +- c_glib/arrow-glib/array-builder.cpp | 1381 ++++++++++++++++++- c_glib/arrow-glib/array-builder.h | 769 +++++++++++ c_glib/arrow-glib/arrow-glib.h | 15 - c_glib/arrow-glib/binary-array-builder.cpp | 122 -- c_glib/arrow-glib/binary-array-builder.h | 77 -- c_glib/arrow-glib/boolean-array-builder.cpp | 120 -- c_glib/arrow-glib/boolean-array-builder.h | 76 - c_glib/arrow-glib/double-array-builder.cpp | 120 -- c_glib/arrow-glib/double-array-builder.h | 76 - c_glib/arrow-glib/float-array-builder.cpp | 120 -- c_glib/arrow-glib/float-array-builder.h | 76 - c_glib/arrow-glib/int16-array-builder.cpp | 120 -- c_glib/arrow-glib/int16-array-builder.h | 76 - c_glib/arrow-glib/int32-array-builder.cpp | 120 -- c_glib/arrow-glib/int32-array-builder.h | 76 - c_glib/arrow-glib/int64-array-builder.cpp | 120 -- c_glib/arrow-glib/int64-array-builder.h | 76 - c_glib/arrow-glib/int8-array-builder.cpp | 120 -- c_glib/arrow-glib/int8-array-builder.h | 76 - c_glib/arrow-glib/list-array-builder.cpp | 173 --- c_glib/arrow-glib/list-array-builder.h | 77 -- c_glib/arrow-glib/string-array-builder.cpp | 97 -- c_glib/arrow-glib/string-array-builder.h | 74 - c_glib/arrow-glib/struct-array-builder.cpp | 187 --- c_glib/arrow-glib/struct-array-builder.h | 81 -- c_glib/arrow-glib/uint16-array-builder.cpp | 120 -- c_glib/arrow-glib/uint16-array-builder.h | 76 - c_glib/arrow-glib/uint32-array-builder.cpp | 120 -- c_glib/arrow-glib/uint32-array-builder.h | 76 - c_glib/arrow-glib/uint64-array-builder.cpp | 120 -- c_glib/arrow-glib/uint64-array-builder.h | 76 - c_glib/arrow-glib/uint8-array-builder.cpp | 120 -- c_glib/arrow-glib/uint8-array-builder.h | 76 - c_glib/doc/reference/arrow-glib-docs.sgml | 15 - 35 files changed, 2135 insertions(+), 3121 deletions(-) delete mode 100644 c_glib/arrow-glib/binary-array-builder.cpp delete mode 100644 c_glib/arrow-glib/binary-array-builder.h delete mode 100644 c_glib/arrow-glib/boolean-array-builder.cpp delete mode 100644 c_glib/arrow-glib/boolean-array-builder.h delete mode 100644 c_glib/arrow-glib/double-array-builder.cpp delete mode 100644 c_glib/arrow-glib/double-array-builder.h delete mode 100644 c_glib/arrow-glib/float-array-builder.cpp delete mode 100644 c_glib/arrow-glib/float-array-builder.h delete mode 100644 c_glib/arrow-glib/int16-array-builder.cpp delete mode 100644 c_glib/arrow-glib/int16-array-builder.h delete mode 100644 c_glib/arrow-glib/int32-array-builder.cpp delete mode 100644 c_glib/arrow-glib/int32-array-builder.h delete mode 100644 c_glib/arrow-glib/int64-array-builder.cpp delete mode 100644 c_glib/arrow-glib/int64-array-builder.h delete mode 100644 c_glib/arrow-glib/int8-array-builder.cpp delete mode 100644 c_glib/arrow-glib/int8-array-builder.h delete mode 100644 c_glib/arrow-glib/list-array-builder.cpp delete mode 100644 c_glib/arrow-glib/list-array-builder.h delete mode 100644 c_glib/arrow-glib/string-array-builder.cpp delete mode 100644 c_glib/arrow-glib/string-array-builder.h delete mode 100644 c_glib/arrow-glib/struct-array-builder.cpp delete mode 100644 c_glib/arrow-glib/struct-array-builder.h delete mode 100644 c_glib/arrow-glib/uint16-array-builder.cpp delete mode 100644 c_glib/arrow-glib/uint16-array-builder.h delete mode 100644 c_glib/arrow-glib/uint32-array-builder.cpp delete mode 100644 c_glib/arrow-glib/uint32-array-builder.h delete mode 100644 c_glib/arrow-glib/uint64-array-builder.cpp delete mode 100644 c_glib/arrow-glib/uint64-array-builder.h delete mode 100644 c_glib/arrow-glib/uint8-array-builder.cpp delete mode 100644 c_glib/arrow-glib/uint8-array-builder.h diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index d0c8c799b71cf..bbc11011474bc 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -44,32 +44,17 @@ libarrow_glib_la_headers = \ array.h \ array-builder.h \ arrow-glib.h \ - binary-array-builder.h \ - boolean-array-builder.h \ buffer.h \ chunked-array.h \ column.h \ data-type.h \ - double-array-builder.h \ error.h \ field.h \ - float-array-builder.h \ - int8-array-builder.h \ - int16-array-builder.h \ - int32-array-builder.h \ - int64-array-builder.h \ - list-array-builder.h \ record-batch.h \ schema.h \ - string-array-builder.h \ - struct-array-builder.h \ table.h \ tensor.h \ - type.h \ - uint8-array-builder.h \ - uint16-array-builder.h \ - uint32-array-builder.h \ - uint64-array-builder.h + type.h libarrow_glib_la_headers += \ file.h \ @@ -100,32 +85,17 @@ libarrow_glib_la_generated_sources = \ libarrow_glib_la_sources = \ array.cpp \ array-builder.cpp \ - binary-array-builder.cpp \ - boolean-array-builder.cpp \ buffer.cpp \ chunked-array.cpp \ column.cpp \ data-type.cpp \ - double-array-builder.cpp \ error.cpp \ field.cpp \ - float-array-builder.cpp \ - int8-array-builder.cpp \ - int16-array-builder.cpp \ - int32-array-builder.cpp \ - int64-array-builder.cpp \ - list-array-builder.cpp \ record-batch.cpp \ schema.cpp \ - string-array-builder.cpp \ - struct-array-builder.cpp \ table.cpp \ tensor.cpp \ type.cpp \ - uint8-array-builder.cpp \ - uint16-array-builder.cpp \ - uint32-array-builder.cpp \ - uint64-array-builder.cpp \ $(libarrow_glib_la_headers) \ $(libarrow_glib_la_generated_sources) diff --git a/c_glib/arrow-glib/array-builder.cpp b/c_glib/arrow-glib/array-builder.cpp index aea93d00bafe4..97d43e1f0c022 100644 --- a/c_glib/arrow-glib/array-builder.cpp +++ b/c_glib/arrow-glib/array-builder.cpp @@ -22,32 +22,66 @@ #endif #include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include +#include +#include G_BEGIN_DECLS /** * SECTION: array-builder - * @short_description: Base class for all array builder classes. + * @section_id: array-builder-classes + * @title: Array builder classes + * @include: arrow-glib/arrow-glib.h * * #GArrowArrayBuilder is a base class for all array builder classes * such as #GArrowBooleanArrayBuilder. * * You need to use array builder class to create a new array. + * + * #GArrowBooleanArrayBuilder is the class to create a new + * #GArrowBooleanArray. + * + * #GArrowInt8ArrayBuilder is the class to create a new + * #GArrowInt8Array. + * + * #GArrowUInt8ArrayBuilder is the class to create a new + * #GArrowUInt8Array. + * + * #GArrowInt16ArrayBuilder is the class to create a new + * #GArrowInt16Array. + * + * #GArrowUInt16ArrayBuilder is the class to create a new + * #GArrowUInt16Array. + * + * #GArrowInt32ArrayBuilder is the class to create a new + * #GArrowInt32Array. + * + * #GArrowUInt32ArrayBuilder is the class to create a new + * #GArrowUInt32Array. + * + * #GArrowInt64ArrayBuilder is the class to create a new + * #GArrowInt64Array. + * + * #GArrowUInt64ArrayBuilder is the class to create a new + * #GArrowUInt64Array. + * + * #GArrowFloatArrayBuilder is the class to creating a new + * #GArrowFloatArray. + * + * #GArrowDoubleArrayBuilder is the class to create a new + * #GArrowDoubleArray. + * + * #GArrowBinaryArrayBuilder is the class to create a new + * #GArrowBinaryArray. + * + * #GArrowStringArrayBuilder is the class to create a new + * #GArrowStringArray. + * + * #GArrowListArrayBuilder is the class to create a new + * #GArrowListArray. + * + * #GArrowStructArrayBuilder is the class to create a new + * #GArrowStructArray. */ typedef struct GArrowArrayBuilderPrivate_ { @@ -154,6 +188,1321 @@ garrow_array_builder_finish(GArrowArrayBuilder *builder) return garrow_array_new_raw(&arrow_array); } + +G_DEFINE_TYPE(GArrowBooleanArrayBuilder, + garrow_boolean_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_boolean_array_builder_init(GArrowBooleanArrayBuilder *builder) +{ +} + +static void +garrow_boolean_array_builder_class_init(GArrowBooleanArrayBuilderClass *klass) +{ +} + +/** + * garrow_boolean_array_builder_new: + * + * Returns: A newly created #GArrowBooleanArrayBuilder. + */ +GArrowBooleanArrayBuilder * +garrow_boolean_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_boolean_builder = + std::make_shared(memory_pool); + std::shared_ptr arrow_builder = arrow_boolean_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_BOOLEAN_ARRAY_BUILDER(builder); +} + +/** + * garrow_boolean_array_builder_append: + * @builder: A #GArrowBooleanArrayBuilder. + * @value: A boolean value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, + gboolean value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(static_cast(value)); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[boolean-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_boolean_array_builder_append_null: + * @builder: A #GArrowBooleanArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[boolean-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowInt8ArrayBuilder, + garrow_int8_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int8_array_builder_init(GArrowInt8ArrayBuilder *builder) +{ +} + +static void +garrow_int8_array_builder_class_init(GArrowInt8ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int8_array_builder_new: + * + * Returns: A newly created #GArrowInt8ArrayBuilder. + */ +GArrowInt8ArrayBuilder * +garrow_int8_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_int8_builder = + std::make_shared(memory_pool, arrow::int8()); + std::shared_ptr arrow_builder = arrow_int8_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_INT8_ARRAY_BUILDER(builder); +} + +/** + * garrow_int8_array_builder_append: + * @builder: A #GArrowInt8ArrayBuilder. + * @value: A int8 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, + gint8 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int8-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int8_array_builder_append_null: + * @builder: A #GArrowInt8ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int8-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowUInt8ArrayBuilder, + garrow_uint8_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint8_array_builder_init(GArrowUInt8ArrayBuilder *builder) +{ +} + +static void +garrow_uint8_array_builder_class_init(GArrowUInt8ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint8_array_builder_new: + * + * Returns: A newly created #GArrowUInt8ArrayBuilder. + */ +GArrowUInt8ArrayBuilder * +garrow_uint8_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_uint8_builder = + std::make_shared(memory_pool, arrow::uint8()); + std::shared_ptr arrow_builder = arrow_uint8_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_UINT8_ARRAY_BUILDER(builder); +} + +/** + * garrow_uint8_array_builder_append: + * @builder: A #GArrowUInt8ArrayBuilder. + * @value: An uint8 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, + guint8 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint8-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint8_array_builder_append_null: + * @builder: A #GArrowUInt8ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint8-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowInt16ArrayBuilder, + garrow_int16_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int16_array_builder_init(GArrowInt16ArrayBuilder *builder) +{ +} + +static void +garrow_int16_array_builder_class_init(GArrowInt16ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int16_array_builder_new: + * + * Returns: A newly created #GArrowInt16ArrayBuilder. + */ +GArrowInt16ArrayBuilder * +garrow_int16_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_int16_builder = + std::make_shared(memory_pool, arrow::int16()); + std::shared_ptr arrow_builder = arrow_int16_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_INT16_ARRAY_BUILDER(builder); +} + +/** + * garrow_int16_array_builder_append: + * @builder: A #GArrowInt16ArrayBuilder. + * @value: A int16 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, + gint16 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int16-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int16_array_builder_append_null: + * @builder: A #GArrowInt16ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int16-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowUInt16ArrayBuilder, + garrow_uint16_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint16_array_builder_init(GArrowUInt16ArrayBuilder *builder) +{ +} + +static void +garrow_uint16_array_builder_class_init(GArrowUInt16ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint16_array_builder_new: + * + * Returns: A newly created #GArrowUInt16ArrayBuilder. + */ +GArrowUInt16ArrayBuilder * +garrow_uint16_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_uint16_builder = + std::make_shared(memory_pool, arrow::uint16()); + std::shared_ptr arrow_builder = arrow_uint16_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_UINT16_ARRAY_BUILDER(builder); +} + +/** + * garrow_uint16_array_builder_append: + * @builder: A #GArrowUInt16ArrayBuilder. + * @value: An uint16 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, + guint16 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint16-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint16_array_builder_append_null: + * @builder: A #GArrowUInt16ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint16-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowInt32ArrayBuilder, + garrow_int32_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int32_array_builder_init(GArrowInt32ArrayBuilder *builder) +{ +} + +static void +garrow_int32_array_builder_class_init(GArrowInt32ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int32_array_builder_new: + * + * Returns: A newly created #GArrowInt32ArrayBuilder. + */ +GArrowInt32ArrayBuilder * +garrow_int32_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_int32_builder = + std::make_shared(memory_pool, arrow::int32()); + std::shared_ptr arrow_builder = arrow_int32_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_INT32_ARRAY_BUILDER(builder); +} + +/** + * garrow_int32_array_builder_append: + * @builder: A #GArrowInt32ArrayBuilder. + * @value: A int32 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, + gint32 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int32-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int32_array_builder_append_null: + * @builder: A #GArrowInt32ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int32-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowUInt32ArrayBuilder, + garrow_uint32_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint32_array_builder_init(GArrowUInt32ArrayBuilder *builder) +{ +} + +static void +garrow_uint32_array_builder_class_init(GArrowUInt32ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint32_array_builder_new: + * + * Returns: A newly created #GArrowUInt32ArrayBuilder. + */ +GArrowUInt32ArrayBuilder * +garrow_uint32_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_uint32_builder = + std::make_shared(memory_pool, arrow::uint32()); + std::shared_ptr arrow_builder = arrow_uint32_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_UINT32_ARRAY_BUILDER(builder); +} + +/** + * garrow_uint32_array_builder_append: + * @builder: A #GArrowUInt32ArrayBuilder. + * @value: An uint32 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, + guint32 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint32-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint32_array_builder_append_null: + * @builder: A #GArrowUInt32ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint32-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowInt64ArrayBuilder, + garrow_int64_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_int64_array_builder_init(GArrowInt64ArrayBuilder *builder) +{ +} + +static void +garrow_int64_array_builder_class_init(GArrowInt64ArrayBuilderClass *klass) +{ +} + +/** + * garrow_int64_array_builder_new: + * + * Returns: A newly created #GArrowInt64ArrayBuilder. + */ +GArrowInt64ArrayBuilder * +garrow_int64_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_int64_builder = + std::make_shared(memory_pool, arrow::int64()); + std::shared_ptr arrow_builder = arrow_int64_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_INT64_ARRAY_BUILDER(builder); +} + +/** + * garrow_int64_array_builder_append: + * @builder: A #GArrowInt64ArrayBuilder. + * @value: A int64 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, + gint64 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int64-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_int64_array_builder_append_null: + * @builder: A #GArrowInt64ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[int64-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowUInt64ArrayBuilder, + garrow_uint64_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_uint64_array_builder_init(GArrowUInt64ArrayBuilder *builder) +{ +} + +static void +garrow_uint64_array_builder_class_init(GArrowUInt64ArrayBuilderClass *klass) +{ +} + +/** + * garrow_uint64_array_builder_new: + * + * Returns: A newly created #GArrowUInt64ArrayBuilder. + */ +GArrowUInt64ArrayBuilder * +garrow_uint64_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_uint64_builder = + std::make_shared(memory_pool, arrow::uint64()); + std::shared_ptr arrow_builder = arrow_uint64_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_UINT64_ARRAY_BUILDER(builder); +} + +/** + * garrow_uint64_array_builder_append: + * @builder: A #GArrowUInt64ArrayBuilder. + * @value: An uint64 value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, + guint64 value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint64-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_uint64_array_builder_append_null: + * @builder: A #GArrowUInt64ArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[uint64-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowFloatArrayBuilder, + garrow_float_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_float_array_builder_init(GArrowFloatArrayBuilder *builder) +{ +} + +static void +garrow_float_array_builder_class_init(GArrowFloatArrayBuilderClass *klass) +{ +} + +/** + * garrow_float_array_builder_new: + * + * Returns: A newly created #GArrowFloatArrayBuilder. + */ +GArrowFloatArrayBuilder * +garrow_float_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_float_builder = + std::make_shared(memory_pool, arrow::float32()); + std::shared_ptr arrow_builder = arrow_float_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_FLOAT_ARRAY_BUILDER(builder); +} + +/** + * garrow_float_array_builder_append: + * @builder: A #GArrowFloatArrayBuilder. + * @value: A float value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, + gfloat value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[float-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_float_array_builder_append_null: + * @builder: A #GArrowFloatArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[float-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowDoubleArrayBuilder, + garrow_double_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_double_array_builder_init(GArrowDoubleArrayBuilder *builder) +{ +} + +static void +garrow_double_array_builder_class_init(GArrowDoubleArrayBuilderClass *klass) +{ +} + +/** + * garrow_double_array_builder_new: + * + * Returns: A newly created #GArrowDoubleArrayBuilder. + */ +GArrowDoubleArrayBuilder * +garrow_double_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_double_builder = + std::make_shared(memory_pool, arrow::float64()); + std::shared_ptr arrow_builder = arrow_double_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_DOUBLE_ARRAY_BUILDER(builder); +} + +/** + * garrow_double_array_builder_append: + * @builder: A #GArrowDoubleArrayBuilder. + * @value: A double value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, + gdouble value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[double-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_double_array_builder_append_null: + * @builder: A #GArrowDoubleArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[double-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowBinaryArrayBuilder, + garrow_binary_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_binary_array_builder_init(GArrowBinaryArrayBuilder *builder) +{ +} + +static void +garrow_binary_array_builder_class_init(GArrowBinaryArrayBuilderClass *klass) +{ +} + +/** + * garrow_binary_array_builder_new: + * + * Returns: A newly created #GArrowBinaryArrayBuilder. + */ +GArrowBinaryArrayBuilder * +garrow_binary_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_binary_builder = + std::make_shared(memory_pool, arrow::binary()); + std::shared_ptr arrow_builder = arrow_binary_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_BINARY_ARRAY_BUILDER(builder); +} + +/** + * garrow_binary_array_builder_append: + * @builder: A #GArrowBinaryArrayBuilder. + * @value: (array length=length): A binary value. + * @length: A value length. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, + const guint8 *value, + gint32 length, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value, length); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[binary-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_binary_array_builder_append_null: + * @builder: A #GArrowBinaryArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[binary-array-builder][append-null]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowStringArrayBuilder, + garrow_string_array_builder, + GARROW_TYPE_BINARY_ARRAY_BUILDER) + +static void +garrow_string_array_builder_init(GArrowStringArrayBuilder *builder) +{ +} + +static void +garrow_string_array_builder_class_init(GArrowStringArrayBuilderClass *klass) +{ +} + +/** + * garrow_string_array_builder_new: + * + * Returns: A newly created #GArrowStringArrayBuilder. + */ +GArrowStringArrayBuilder * +garrow_string_array_builder_new(void) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_string_builder = + std::make_shared(memory_pool); + std::shared_ptr arrow_builder = arrow_string_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_STRING_ARRAY_BUILDER(builder); +} + +/** + * garrow_string_array_builder_append: + * @builder: A #GArrowStringArrayBuilder. + * @value: A string value. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + */ +gboolean +garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, + const gchar *value, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(value, + static_cast(strlen(value))); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[string-array-builder][append]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowListArrayBuilder, + garrow_list_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_list_array_builder_init(GArrowListArrayBuilder *builder) +{ +} + +static void +garrow_list_array_builder_class_init(GArrowListArrayBuilderClass *klass) +{ +} + +/** + * garrow_list_array_builder_new: + * @value_builder: A #GArrowArrayBuilder for value array. + * + * Returns: A newly created #GArrowListArrayBuilder. + */ +GArrowListArrayBuilder * +garrow_list_array_builder_new(GArrowArrayBuilder *value_builder) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_value_builder = garrow_array_builder_get_raw(value_builder); + auto arrow_list_builder = + std::make_shared(memory_pool, arrow_value_builder); + std::shared_ptr arrow_builder = arrow_list_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_LIST_ARRAY_BUILDER(builder); +} + +/** + * garrow_list_array_builder_append: + * @builder: A #GArrowListArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new list element. To append a new list element, you + * need to call this function then append list element values to + * `value_builder`. `value_builder` is the #GArrowArrayBuilder + * specified to constructor. You can get `value_builder` by + * garrow_list_array_builder_get_value_builder(). + * + * |[ + * GArrowInt8ArrayBuilder *value_builder; + * GArrowListArrayBuilder *builder; + * + * value_builder = garrow_int8_array_builder_new(); + * builder = garrow_list_array_builder_new(value_builder, NULL); + * + * // Start 0th list element: [1, 0, -1] + * garrow_list_array_builder_append(builder, NULL); + * garrow_int8_array_builder_append(value_builder, 1); + * garrow_int8_array_builder_append(value_builder, 0); + * garrow_int8_array_builder_append(value_builder, -1); + * + * // Start 1st list element: [-29, 29] + * garrow_list_array_builder_append(builder, NULL); + * garrow_int8_array_builder_append(value_builder, -29); + * garrow_int8_array_builder_append(value_builder, 29); + * + * { + * // [[1, 0, -1], [-29, 29]] + * GArrowArray *array = garrow_array_builder_finish(builder); + * // Now, builder is needless. + * g_object_unref(builder); + * g_object_unref(value_builder); + * + * // Use array... + * g_object_unref(array); + * } + * ]| + */ +gboolean +garrow_list_array_builder_append(GArrowListArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[list-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_list_array_builder_append_null: + * @builder: A #GArrowListArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new NULL element. + */ +gboolean +garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[list-array-builder][append-null]"); + return FALSE; + } +} + +/** + * garrow_list_array_builder_get_value_builder: + * @builder: A #GArrowListArrayBuilder. + * + * Returns: (transfer full): The #GArrowArrayBuilder for values. + */ +GArrowArrayBuilder * +garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + auto arrow_value_builder = arrow_builder->value_builder(); + return garrow_array_builder_new_raw(&arrow_value_builder); +} + + +G_DEFINE_TYPE(GArrowStructArrayBuilder, + garrow_struct_array_builder, + GARROW_TYPE_ARRAY_BUILDER) + +static void +garrow_struct_array_builder_init(GArrowStructArrayBuilder *builder) +{ +} + +static void +garrow_struct_array_builder_class_init(GArrowStructArrayBuilderClass *klass) +{ +} + +/** + * garrow_struct_array_builder_new: + * @data_type: #GArrowStructDataType for the struct. + * @field_builders: (element-type GArrowArray): #GArrowArrayBuilders + * for fields. + * + * Returns: A newly created #GArrowStructArrayBuilder. + */ +GArrowStructArrayBuilder * +garrow_struct_array_builder_new(GArrowStructDataType *data_type, + GList *field_builders) +{ + auto memory_pool = arrow::default_memory_pool(); + auto arrow_data_type = garrow_data_type_get_raw(GARROW_DATA_TYPE(data_type)); + std::vector> arrow_field_builders; + for (GList *node = field_builders; node; node = g_list_next(node)) { + auto field_builder = static_cast(node->data); + auto arrow_field_builder = garrow_array_builder_get_raw(field_builder); + arrow_field_builders.push_back(arrow_field_builder); + } + + auto arrow_struct_builder = + std::make_shared(memory_pool, + arrow_data_type, + arrow_field_builders); + std::shared_ptr arrow_builder = arrow_struct_builder; + auto builder = garrow_array_builder_new_raw(&arrow_builder); + return GARROW_STRUCT_ARRAY_BUILDER(builder); +} + +/** + * garrow_struct_array_builder_append: + * @builder: A #GArrowStructArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new struct element. To append a new struct element, + * you need to call this function then append struct element field + * values to all `field_builder`s. `field_value`s are the + * #GArrowArrayBuilder specified to constructor. You can get + * `field_builder` by garrow_struct_array_builder_get_field_builder() + * or garrow_struct_array_builder_get_field_builders(). + * + * |[ + * // TODO + * ]| + */ +gboolean +garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->Append(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[struct-array-builder][append]"); + return FALSE; + } +} + +/** + * garrow_struct_array_builder_append_null: + * @builder: A #GArrowStructArrayBuilder. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * It appends a new NULL element. + */ +gboolean +garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, + GError **error) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + auto status = arrow_builder->AppendNull(); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[struct-array-builder][append-null]"); + return FALSE; + } +} + +/** + * garrow_struct_array_builder_get_field_builder: + * @builder: A #GArrowStructArrayBuilder. + * @i: The index of the field in the struct. + * + * Returns: (transfer full): The #GArrowArrayBuilder for the i-th field. + */ +GArrowArrayBuilder * +garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, + gint i) +{ + auto arrow_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + auto arrow_field_builder = arrow_builder->field_builder(i); + return garrow_array_builder_new_raw(&arrow_field_builder); +} + +/** + * garrow_struct_array_builder_get_field_builders: + * @builder: A #GArrowStructArrayBuilder. + * + * Returns: (element-type GArrowArray) (transfer full): + * The #GArrowArrayBuilder for all fields. + */ +GList * +garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder) +{ + auto arrow_struct_builder = + static_cast( + garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); + + GList *field_builders = NULL; + for (auto arrow_field_builder : arrow_struct_builder->field_builders()) { + auto field_builder = garrow_array_builder_new_raw(&arrow_field_builder); + field_builders = g_list_prepend(field_builders, field_builder); + } + + return g_list_reverse(field_builders); +} + + G_END_DECLS GArrowArrayBuilder * diff --git a/c_glib/arrow-glib/array-builder.h b/c_glib/arrow-glib/array-builder.h index 3717aef04a2f4..ad72f9ae8488b 100644 --- a/c_glib/arrow-glib/array-builder.h +++ b/c_glib/arrow-glib/array-builder.h @@ -67,4 +67,773 @@ GType garrow_array_builder_get_type (void) G_GNUC_CONST; GArrowArray *garrow_array_builder_finish (GArrowArrayBuilder *builder); + +#define GARROW_TYPE_BOOLEAN_ARRAY_BUILDER \ + (garrow_boolean_array_builder_get_type()) +#define GARROW_BOOLEAN_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilder)) +#define GARROW_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilderClass)) +#define GARROW_IS_BOOLEAN_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) +#define GARROW_IS_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) +#define GARROW_BOOLEAN_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ + GArrowBooleanArrayBuilderClass)) + +typedef struct _GArrowBooleanArrayBuilder GArrowBooleanArrayBuilder; +typedef struct _GArrowBooleanArrayBuilderClass GArrowBooleanArrayBuilderClass; + +/** + * GArrowBooleanArrayBuilder: + * + * It wraps `arrow::BooleanBuilder`. + */ +struct _GArrowBooleanArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowBooleanArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_boolean_array_builder_get_type(void) G_GNUC_CONST; + +GArrowBooleanArrayBuilder *garrow_boolean_array_builder_new(void); + +gboolean garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, + gboolean value, + GError **error); +gboolean garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_INT8_ARRAY_BUILDER \ + (garrow_int8_array_builder_get_type()) +#define GARROW_INT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilder)) +#define GARROW_INT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilderClass)) +#define GARROW_IS_INT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER)) +#define GARROW_IS_INT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT8_ARRAY_BUILDER)) +#define GARROW_INT8_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT8_ARRAY_BUILDER, \ + GArrowInt8ArrayBuilderClass)) + +typedef struct _GArrowInt8ArrayBuilder GArrowInt8ArrayBuilder; +typedef struct _GArrowInt8ArrayBuilderClass GArrowInt8ArrayBuilderClass; + +/** + * GArrowInt8ArrayBuilder: + * + * It wraps `arrow::Int8Builder`. + */ +struct _GArrowInt8ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt8ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int8_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt8ArrayBuilder *garrow_int8_array_builder_new(void); + +gboolean garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, + gint8 value, + GError **error); +gboolean garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_UINT8_ARRAY_BUILDER \ + (garrow_uint8_array_builder_get_type()) +#define GARROW_UINT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilder)) +#define GARROW_UINT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilderClass)) +#define GARROW_IS_UINT8_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER)) +#define GARROW_IS_UINT8_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER)) +#define GARROW_UINT8_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT8_ARRAY_BUILDER, \ + GArrowUInt8ArrayBuilderClass)) + +typedef struct _GArrowUInt8ArrayBuilder GArrowUInt8ArrayBuilder; +typedef struct _GArrowUInt8ArrayBuilderClass GArrowUInt8ArrayBuilderClass; + +/** + * GArrowUInt8ArrayBuilder: + * + * It wraps `arrow::UInt8Builder`. + */ +struct _GArrowUInt8ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt8ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint8_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt8ArrayBuilder *garrow_uint8_array_builder_new(void); + +gboolean garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, + guint8 value, + GError **error); +gboolean garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_INT16_ARRAY_BUILDER \ + (garrow_int16_array_builder_get_type()) +#define GARROW_INT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilder)) +#define GARROW_INT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilderClass)) +#define GARROW_IS_INT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER)) +#define GARROW_IS_INT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT16_ARRAY_BUILDER)) +#define GARROW_INT16_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT16_ARRAY_BUILDER, \ + GArrowInt16ArrayBuilderClass)) + +typedef struct _GArrowInt16ArrayBuilder GArrowInt16ArrayBuilder; +typedef struct _GArrowInt16ArrayBuilderClass GArrowInt16ArrayBuilderClass; + +/** + * GArrowInt16ArrayBuilder: + * + * It wraps `arrow::Int16Builder`. + */ +struct _GArrowInt16ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt16ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int16_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt16ArrayBuilder *garrow_int16_array_builder_new(void); + +gboolean garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, + gint16 value, + GError **error); +gboolean garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_UINT16_ARRAY_BUILDER \ + (garrow_uint16_array_builder_get_type()) +#define GARROW_UINT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilder)) +#define GARROW_UINT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilderClass)) +#define GARROW_IS_UINT16_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER)) +#define GARROW_IS_UINT16_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER)) +#define GARROW_UINT16_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT16_ARRAY_BUILDER, \ + GArrowUInt16ArrayBuilderClass)) + +typedef struct _GArrowUInt16ArrayBuilder GArrowUInt16ArrayBuilder; +typedef struct _GArrowUInt16ArrayBuilderClass GArrowUInt16ArrayBuilderClass; + +/** + * GArrowUInt16ArrayBuilder: + * + * It wraps `arrow::UInt16Builder`. + */ +struct _GArrowUInt16ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt16ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint16_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt16ArrayBuilder *garrow_uint16_array_builder_new(void); + +gboolean garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, + guint16 value, + GError **error); +gboolean garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_INT32_ARRAY_BUILDER \ + (garrow_int32_array_builder_get_type()) +#define GARROW_INT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilder)) +#define GARROW_INT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilderClass)) +#define GARROW_IS_INT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER)) +#define GARROW_IS_INT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT32_ARRAY_BUILDER)) +#define GARROW_INT32_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT32_ARRAY_BUILDER, \ + GArrowInt32ArrayBuilderClass)) + +typedef struct _GArrowInt32ArrayBuilder GArrowInt32ArrayBuilder; +typedef struct _GArrowInt32ArrayBuilderClass GArrowInt32ArrayBuilderClass; + +/** + * GArrowInt32ArrayBuilder: + * + * It wraps `arrow::Int32Builder`. + */ +struct _GArrowInt32ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt32ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int32_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt32ArrayBuilder *garrow_int32_array_builder_new(void); + +gboolean garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, + gint32 value, + GError **error); +gboolean garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_UINT32_ARRAY_BUILDER \ + (garrow_uint32_array_builder_get_type()) +#define GARROW_UINT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilder)) +#define GARROW_UINT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilderClass)) +#define GARROW_IS_UINT32_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER)) +#define GARROW_IS_UINT32_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER)) +#define GARROW_UINT32_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT32_ARRAY_BUILDER, \ + GArrowUInt32ArrayBuilderClass)) + +typedef struct _GArrowUInt32ArrayBuilder GArrowUInt32ArrayBuilder; +typedef struct _GArrowUInt32ArrayBuilderClass GArrowUInt32ArrayBuilderClass; + +/** + * GArrowUInt32ArrayBuilder: + * + * It wraps `arrow::UInt32Builder`. + */ +struct _GArrowUInt32ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt32ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint32_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt32ArrayBuilder *garrow_uint32_array_builder_new(void); + +gboolean garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, + guint32 value, + GError **error); +gboolean garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_INT64_ARRAY_BUILDER \ + (garrow_int64_array_builder_get_type()) +#define GARROW_INT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilder)) +#define GARROW_INT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilderClass)) +#define GARROW_IS_INT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER)) +#define GARROW_IS_INT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INT64_ARRAY_BUILDER)) +#define GARROW_INT64_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INT64_ARRAY_BUILDER, \ + GArrowInt64ArrayBuilderClass)) + +typedef struct _GArrowInt64ArrayBuilder GArrowInt64ArrayBuilder; +typedef struct _GArrowInt64ArrayBuilderClass GArrowInt64ArrayBuilderClass; + +/** + * GArrowInt64ArrayBuilder: + * + * It wraps `arrow::Int64Builder`. + */ +struct _GArrowInt64ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowInt64ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_int64_array_builder_get_type(void) G_GNUC_CONST; + +GArrowInt64ArrayBuilder *garrow_int64_array_builder_new(void); + +gboolean garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, + gint64 value, + GError **error); +gboolean garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_UINT64_ARRAY_BUILDER \ + (garrow_uint64_array_builder_get_type()) +#define GARROW_UINT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilder)) +#define GARROW_UINT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilderClass)) +#define GARROW_IS_UINT64_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER)) +#define GARROW_IS_UINT64_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER)) +#define GARROW_UINT64_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_UINT64_ARRAY_BUILDER, \ + GArrowUInt64ArrayBuilderClass)) + +typedef struct _GArrowUInt64ArrayBuilder GArrowUInt64ArrayBuilder; +typedef struct _GArrowUInt64ArrayBuilderClass GArrowUInt64ArrayBuilderClass; + +/** + * GArrowUInt64ArrayBuilder: + * + * It wraps `arrow::UInt64Builder`. + */ +struct _GArrowUInt64ArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowUInt64ArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_uint64_array_builder_get_type(void) G_GNUC_CONST; + +GArrowUInt64ArrayBuilder *garrow_uint64_array_builder_new(void); + +gboolean garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, + guint64 value, + GError **error); +gboolean garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_FLOAT_ARRAY_BUILDER \ + (garrow_float_array_builder_get_type()) +#define GARROW_FLOAT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilder)) +#define GARROW_FLOAT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilderClass)) +#define GARROW_IS_FLOAT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER)) +#define GARROW_IS_FLOAT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER)) +#define GARROW_FLOAT_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ + GArrowFloatArrayBuilderClass)) + +typedef struct _GArrowFloatArrayBuilder GArrowFloatArrayBuilder; +typedef struct _GArrowFloatArrayBuilderClass GArrowFloatArrayBuilderClass; + +/** + * GArrowFloatArrayBuilder: + * + * It wraps `arrow::FloatBuilder`. + */ +struct _GArrowFloatArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowFloatArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_float_array_builder_get_type(void) G_GNUC_CONST; + +GArrowFloatArrayBuilder *garrow_float_array_builder_new(void); + +gboolean garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, + gfloat value, + GError **error); +gboolean garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_DOUBLE_ARRAY_BUILDER \ + (garrow_double_array_builder_get_type()) +#define GARROW_DOUBLE_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilder)) +#define GARROW_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilderClass)) +#define GARROW_IS_DOUBLE_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) +#define GARROW_IS_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) +#define GARROW_DOUBLE_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ + GArrowDoubleArrayBuilderClass)) + +typedef struct _GArrowDoubleArrayBuilder GArrowDoubleArrayBuilder; +typedef struct _GArrowDoubleArrayBuilderClass GArrowDoubleArrayBuilderClass; + +/** + * GArrowDoubleArrayBuilder: + * + * It wraps `arrow::DoubleBuilder`. + */ +struct _GArrowDoubleArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowDoubleArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_double_array_builder_get_type(void) G_GNUC_CONST; + +GArrowDoubleArrayBuilder *garrow_double_array_builder_new(void); + +gboolean garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, + gdouble value, + GError **error); +gboolean garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_BINARY_ARRAY_BUILDER \ + (garrow_binary_array_builder_get_type()) +#define GARROW_BINARY_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilder)) +#define GARROW_BINARY_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilderClass)) +#define GARROW_IS_BINARY_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER)) +#define GARROW_IS_BINARY_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER)) +#define GARROW_BINARY_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BINARY_ARRAY_BUILDER, \ + GArrowBinaryArrayBuilderClass)) + +typedef struct _GArrowBinaryArrayBuilder GArrowBinaryArrayBuilder; +typedef struct _GArrowBinaryArrayBuilderClass GArrowBinaryArrayBuilderClass; + +/** + * GArrowBinaryArrayBuilder: + * + * It wraps `arrow::BinaryBuilder`. + */ +struct _GArrowBinaryArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowBinaryArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_binary_array_builder_get_type(void) G_GNUC_CONST; + +GArrowBinaryArrayBuilder *garrow_binary_array_builder_new(void); + +gboolean garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, + const guint8 *value, + gint32 length, + GError **error); +gboolean garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, + GError **error); + + +#define GARROW_TYPE_STRING_ARRAY_BUILDER \ + (garrow_string_array_builder_get_type()) +#define GARROW_STRING_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilder)) +#define GARROW_STRING_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilderClass)) +#define GARROW_IS_STRING_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER)) +#define GARROW_IS_STRING_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRING_ARRAY_BUILDER)) +#define GARROW_STRING_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRING_ARRAY_BUILDER, \ + GArrowStringArrayBuilderClass)) + +typedef struct _GArrowStringArrayBuilder GArrowStringArrayBuilder; +typedef struct _GArrowStringArrayBuilderClass GArrowStringArrayBuilderClass; + +/** + * GArrowStringArrayBuilder: + * + * It wraps `arrow::StringBuilder`. + */ +struct _GArrowStringArrayBuilder +{ + /*< private >*/ + GArrowBinaryArrayBuilder parent_instance; +}; + +struct _GArrowStringArrayBuilderClass +{ + GArrowBinaryArrayBuilderClass parent_class; +}; + +GType garrow_string_array_builder_get_type(void) G_GNUC_CONST; + +GArrowStringArrayBuilder *garrow_string_array_builder_new(void); + +gboolean garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, + const gchar *value, + GError **error); + + +#define GARROW_TYPE_LIST_ARRAY_BUILDER \ + (garrow_list_array_builder_get_type()) +#define GARROW_LIST_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilder)) +#define GARROW_LIST_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilderClass)) +#define GARROW_IS_LIST_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER)) +#define GARROW_IS_LIST_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_LIST_ARRAY_BUILDER)) +#define GARROW_LIST_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_LIST_ARRAY_BUILDER, \ + GArrowListArrayBuilderClass)) + +typedef struct _GArrowListArrayBuilder GArrowListArrayBuilder; +typedef struct _GArrowListArrayBuilderClass GArrowListArrayBuilderClass; + +/** + * GArrowListArrayBuilder: + * + * It wraps `arrow::ListBuilder`. + */ +struct _GArrowListArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowListArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_list_array_builder_get_type(void) G_GNUC_CONST; + +GArrowListArrayBuilder *garrow_list_array_builder_new(GArrowArrayBuilder *value_builder); + +gboolean garrow_list_array_builder_append(GArrowListArrayBuilder *builder, + GError **error); +gboolean garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, + GError **error); + +GArrowArrayBuilder *garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder); + + +#define GARROW_TYPE_STRUCT_ARRAY_BUILDER \ + (garrow_struct_array_builder_get_type()) +#define GARROW_STRUCT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilder)) +#define GARROW_STRUCT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilderClass)) +#define GARROW_IS_STRUCT_ARRAY_BUILDER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER)) +#define GARROW_IS_STRUCT_ARRAY_BUILDER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER)) +#define GARROW_STRUCT_ARRAY_BUILDER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ + GArrowStructArrayBuilderClass)) + +typedef struct _GArrowStructArrayBuilder GArrowStructArrayBuilder; +typedef struct _GArrowStructArrayBuilderClass GArrowStructArrayBuilderClass; + +/** + * GArrowStructArrayBuilder: + * + * It wraps `arrow::StructBuilder`. + */ +struct _GArrowStructArrayBuilder +{ + /*< private >*/ + GArrowArrayBuilder parent_instance; +}; + +struct _GArrowStructArrayBuilderClass +{ + GArrowArrayBuilderClass parent_class; +}; + +GType garrow_struct_array_builder_get_type(void) G_GNUC_CONST; + +GArrowStructArrayBuilder *garrow_struct_array_builder_new(GArrowStructDataType *data_type, + GList *field_builders); + +gboolean garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, + GError **error); +gboolean garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, + GError **error); + +GArrowArrayBuilder *garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, + gint i); +GList *garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder); + G_END_DECLS diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index 46e98d2b8ed4c..efff5710308a8 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -21,32 +21,17 @@ #include #include -#include -#include #include #include #include -#include #include #include #include -#include -#include -#include -#include -#include -#include #include #include -#include -#include #include #include #include -#include -#include -#include -#include #include #include diff --git a/c_glib/arrow-glib/binary-array-builder.cpp b/c_glib/arrow-glib/binary-array-builder.cpp deleted file mode 100644 index ab11535eb8595..0000000000000 --- a/c_glib/arrow-glib/binary-array-builder.cpp +++ /dev/null @@ -1,122 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: binary-array-builder - * @short_description: Binary array builder class - * - * #GArrowBinaryArrayBuilder is the class to create a new - * #GArrowBinaryArray. - */ - -G_DEFINE_TYPE(GArrowBinaryArrayBuilder, - garrow_binary_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_binary_array_builder_init(GArrowBinaryArrayBuilder *builder) -{ -} - -static void -garrow_binary_array_builder_class_init(GArrowBinaryArrayBuilderClass *klass) -{ -} - -/** - * garrow_binary_array_builder_new: - * - * Returns: A newly created #GArrowBinaryArrayBuilder. - */ -GArrowBinaryArrayBuilder * -garrow_binary_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::binary()); - auto builder = - GARROW_BINARY_ARRAY_BUILDER(g_object_new(GARROW_TYPE_BINARY_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_binary_array_builder_append: - * @builder: A #GArrowBinaryArrayBuilder. - * @value: (array length=length): A binary value. - * @length: A value length. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, - const guint8 *value, - gint32 length, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value, length); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[binary-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_binary_array_builder_append_null: - * @builder: A #GArrowBinaryArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[binary-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/binary-array-builder.h b/c_glib/arrow-glib/binary-array-builder.h deleted file mode 100644 index 111a83a3a09b0..0000000000000 --- a/c_glib/arrow-glib/binary-array-builder.h +++ /dev/null @@ -1,77 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BINARY_ARRAY_BUILDER \ - (garrow_binary_array_builder_get_type()) -#define GARROW_BINARY_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BINARY_ARRAY_BUILDER, \ - GArrowBinaryArrayBuilder)) -#define GARROW_BINARY_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BINARY_ARRAY_BUILDER, \ - GArrowBinaryArrayBuilderClass)) -#define GARROW_IS_BINARY_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BINARY_ARRAY_BUILDER)) -#define GARROW_IS_BINARY_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BINARY_ARRAY_BUILDER)) -#define GARROW_BINARY_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BINARY_ARRAY_BUILDER, \ - GArrowBinaryArrayBuilderClass)) - -typedef struct _GArrowBinaryArrayBuilder GArrowBinaryArrayBuilder; -typedef struct _GArrowBinaryArrayBuilderClass GArrowBinaryArrayBuilderClass; - -/** - * GArrowBinaryArrayBuilder: - * - * It wraps `arrow::BinaryBuilder`. - */ -struct _GArrowBinaryArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowBinaryArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_binary_array_builder_get_type(void) G_GNUC_CONST; - -GArrowBinaryArrayBuilder *garrow_binary_array_builder_new(void); - -gboolean garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, - const guint8 *value, - gint32 length, - GError **error); -gboolean garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array-builder.cpp b/c_glib/arrow-glib/boolean-array-builder.cpp deleted file mode 100644 index 146eb31e8bdf8..0000000000000 --- a/c_glib/arrow-glib/boolean-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: boolean-array-builder - * @short_description: Boolean array builder class - * - * #GArrowBooleanArrayBuilder is the class to create a new - * #GArrowBooleanArray. - */ - -G_DEFINE_TYPE(GArrowBooleanArrayBuilder, - garrow_boolean_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_boolean_array_builder_init(GArrowBooleanArrayBuilder *builder) -{ -} - -static void -garrow_boolean_array_builder_class_init(GArrowBooleanArrayBuilderClass *klass) -{ -} - -/** - * garrow_boolean_array_builder_new: - * - * Returns: A newly created #GArrowBooleanArrayBuilder. - */ -GArrowBooleanArrayBuilder * -garrow_boolean_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool); - auto builder = - GARROW_BOOLEAN_ARRAY_BUILDER(g_object_new(GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_boolean_array_builder_append: - * @builder: A #GArrowBooleanArrayBuilder. - * @value: A boolean value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, - gboolean value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(static_cast(value)); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[boolean-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_boolean_array_builder_append_null: - * @builder: A #GArrowBooleanArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[boolean-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/boolean-array-builder.h b/c_glib/arrow-glib/boolean-array-builder.h deleted file mode 100644 index ca50e9797d41c..0000000000000 --- a/c_glib/arrow-glib/boolean-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_BOOLEAN_ARRAY_BUILDER \ - (garrow_boolean_array_builder_get_type()) -#define GARROW_BOOLEAN_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ - GArrowBooleanArrayBuilder)) -#define GARROW_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ - GArrowBooleanArrayBuilderClass)) -#define GARROW_IS_BOOLEAN_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) -#define GARROW_IS_BOOLEAN_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BOOLEAN_ARRAY_BUILDER)) -#define GARROW_BOOLEAN_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BOOLEAN_ARRAY_BUILDER, \ - GArrowBooleanArrayBuilderClass)) - -typedef struct _GArrowBooleanArrayBuilder GArrowBooleanArrayBuilder; -typedef struct _GArrowBooleanArrayBuilderClass GArrowBooleanArrayBuilderClass; - -/** - * GArrowBooleanArrayBuilder: - * - * It wraps `arrow::BooleanBuilder`. - */ -struct _GArrowBooleanArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowBooleanArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_boolean_array_builder_get_type(void) G_GNUC_CONST; - -GArrowBooleanArrayBuilder *garrow_boolean_array_builder_new(void); - -gboolean garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, - gboolean value, - GError **error); -gboolean garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/double-array-builder.cpp b/c_glib/arrow-glib/double-array-builder.cpp deleted file mode 100644 index cc44eeabfb686..0000000000000 --- a/c_glib/arrow-glib/double-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: double-array-builder - * @short_description: 64-bit floating point array builder class - * - * #GArrowDoubleArrayBuilder is the class to create a new - * #GArrowDoubleArray. - */ - -G_DEFINE_TYPE(GArrowDoubleArrayBuilder, - garrow_double_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_double_array_builder_init(GArrowDoubleArrayBuilder *builder) -{ -} - -static void -garrow_double_array_builder_class_init(GArrowDoubleArrayBuilderClass *klass) -{ -} - -/** - * garrow_double_array_builder_new: - * - * Returns: A newly created #GArrowDoubleArrayBuilder. - */ -GArrowDoubleArrayBuilder * -garrow_double_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::float64()); - auto builder = - GARROW_DOUBLE_ARRAY_BUILDER(g_object_new(GARROW_TYPE_DOUBLE_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_double_array_builder_append: - * @builder: A #GArrowDoubleArrayBuilder. - * @value: A double value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, - gdouble value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[double-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_double_array_builder_append_null: - * @builder: A #GArrowDoubleArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[double-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/double-array-builder.h b/c_glib/arrow-glib/double-array-builder.h deleted file mode 100644 index 5d95c898bc8a7..0000000000000 --- a/c_glib/arrow-glib/double-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_DOUBLE_ARRAY_BUILDER \ - (garrow_double_array_builder_get_type()) -#define GARROW_DOUBLE_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ - GArrowDoubleArrayBuilder)) -#define GARROW_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ - GArrowDoubleArrayBuilderClass)) -#define GARROW_IS_DOUBLE_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) -#define GARROW_IS_DOUBLE_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_DOUBLE_ARRAY_BUILDER)) -#define GARROW_DOUBLE_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_DOUBLE_ARRAY_BUILDER, \ - GArrowDoubleArrayBuilderClass)) - -typedef struct _GArrowDoubleArrayBuilder GArrowDoubleArrayBuilder; -typedef struct _GArrowDoubleArrayBuilderClass GArrowDoubleArrayBuilderClass; - -/** - * GArrowDoubleArrayBuilder: - * - * It wraps `arrow::DoubleBuilder`. - */ -struct _GArrowDoubleArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowDoubleArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_double_array_builder_get_type(void) G_GNUC_CONST; - -GArrowDoubleArrayBuilder *garrow_double_array_builder_new(void); - -gboolean garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, - gdouble value, - GError **error); -gboolean garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-array-builder.cpp b/c_glib/arrow-glib/float-array-builder.cpp deleted file mode 100644 index 77a9a0bb75a05..0000000000000 --- a/c_glib/arrow-glib/float-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: float-array-builder - * @short_description: 32-bit floating point array builder class - * - * #GArrowFloatArrayBuilder is the class to creating a new - * #GArrowFloatArray. - */ - -G_DEFINE_TYPE(GArrowFloatArrayBuilder, - garrow_float_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_float_array_builder_init(GArrowFloatArrayBuilder *builder) -{ -} - -static void -garrow_float_array_builder_class_init(GArrowFloatArrayBuilderClass *klass) -{ -} - -/** - * garrow_float_array_builder_new: - * - * Returns: A newly created #GArrowFloatArrayBuilder. - */ -GArrowFloatArrayBuilder * -garrow_float_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::float32()); - auto builder = - GARROW_FLOAT_ARRAY_BUILDER(g_object_new(GARROW_TYPE_FLOAT_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_float_array_builder_append: - * @builder: A #GArrowFloatArrayBuilder. - * @value: A float value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, - gfloat value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[float-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_float_array_builder_append_null: - * @builder: A #GArrowFloatArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[float-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/float-array-builder.h b/c_glib/arrow-glib/float-array-builder.h deleted file mode 100644 index 003900313cca4..0000000000000 --- a/c_glib/arrow-glib/float-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_FLOAT_ARRAY_BUILDER \ - (garrow_float_array_builder_get_type()) -#define GARROW_FLOAT_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ - GArrowFloatArrayBuilder)) -#define GARROW_FLOAT_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ - GArrowFloatArrayBuilderClass)) -#define GARROW_IS_FLOAT_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_FLOAT_ARRAY_BUILDER)) -#define GARROW_IS_FLOAT_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_FLOAT_ARRAY_BUILDER)) -#define GARROW_FLOAT_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_FLOAT_ARRAY_BUILDER, \ - GArrowFloatArrayBuilderClass)) - -typedef struct _GArrowFloatArrayBuilder GArrowFloatArrayBuilder; -typedef struct _GArrowFloatArrayBuilderClass GArrowFloatArrayBuilderClass; - -/** - * GArrowFloatArrayBuilder: - * - * It wraps `arrow::FloatBuilder`. - */ -struct _GArrowFloatArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowFloatArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_float_array_builder_get_type(void) G_GNUC_CONST; - -GArrowFloatArrayBuilder *garrow_float_array_builder_new(void); - -gboolean garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, - gfloat value, - GError **error); -gboolean garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array-builder.cpp b/c_glib/arrow-glib/int16-array-builder.cpp deleted file mode 100644 index fbf18ef1e6ce7..0000000000000 --- a/c_glib/arrow-glib/int16-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int16-array-builder - * @short_description: 16-bit integer array builder class - * - * #GArrowInt16ArrayBuilder is the class to create a new - * #GArrowInt16Array. - */ - -G_DEFINE_TYPE(GArrowInt16ArrayBuilder, - garrow_int16_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_int16_array_builder_init(GArrowInt16ArrayBuilder *builder) -{ -} - -static void -garrow_int16_array_builder_class_init(GArrowInt16ArrayBuilderClass *klass) -{ -} - -/** - * garrow_int16_array_builder_new: - * - * Returns: A newly created #GArrowInt16ArrayBuilder. - */ -GArrowInt16ArrayBuilder * -garrow_int16_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::int16()); - auto builder = - GARROW_INT16_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT16_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_int16_array_builder_append: - * @builder: A #GArrowInt16ArrayBuilder. - * @value: A int16 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, - gint16 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int16-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_int16_array_builder_append_null: - * @builder: A #GArrowInt16ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int16-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int16-array-builder.h b/c_glib/arrow-glib/int16-array-builder.h deleted file mode 100644 index f222cfdccc9b7..0000000000000 --- a/c_glib/arrow-glib/int16-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT16_ARRAY_BUILDER \ - (garrow_int16_array_builder_get_type()) -#define GARROW_INT16_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT16_ARRAY_BUILDER, \ - GArrowInt16ArrayBuilder)) -#define GARROW_INT16_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT16_ARRAY_BUILDER, \ - GArrowInt16ArrayBuilderClass)) -#define GARROW_IS_INT16_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT16_ARRAY_BUILDER)) -#define GARROW_IS_INT16_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT16_ARRAY_BUILDER)) -#define GARROW_INT16_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT16_ARRAY_BUILDER, \ - GArrowInt16ArrayBuilderClass)) - -typedef struct _GArrowInt16ArrayBuilder GArrowInt16ArrayBuilder; -typedef struct _GArrowInt16ArrayBuilderClass GArrowInt16ArrayBuilderClass; - -/** - * GArrowInt16ArrayBuilder: - * - * It wraps `arrow::Int16Builder`. - */ -struct _GArrowInt16ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowInt16ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_int16_array_builder_get_type(void) G_GNUC_CONST; - -GArrowInt16ArrayBuilder *garrow_int16_array_builder_new(void); - -gboolean garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, - gint16 value, - GError **error); -gboolean garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array-builder.cpp b/c_glib/arrow-glib/int32-array-builder.cpp deleted file mode 100644 index 30cc4702f68fb..0000000000000 --- a/c_glib/arrow-glib/int32-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int32-array-builder - * @short_description: 32-bit integer array builder class - * - * #GArrowInt32ArrayBuilder is the class to create a new - * #GArrowInt32Array. - */ - -G_DEFINE_TYPE(GArrowInt32ArrayBuilder, - garrow_int32_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_int32_array_builder_init(GArrowInt32ArrayBuilder *builder) -{ -} - -static void -garrow_int32_array_builder_class_init(GArrowInt32ArrayBuilderClass *klass) -{ -} - -/** - * garrow_int32_array_builder_new: - * - * Returns: A newly created #GArrowInt32ArrayBuilder. - */ -GArrowInt32ArrayBuilder * -garrow_int32_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::int32()); - auto builder = - GARROW_INT32_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT32_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_int32_array_builder_append: - * @builder: A #GArrowInt32ArrayBuilder. - * @value: A int32 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, - gint32 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int32-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_int32_array_builder_append_null: - * @builder: A #GArrowInt32ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int32-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int32-array-builder.h b/c_glib/arrow-glib/int32-array-builder.h deleted file mode 100644 index bdb380d6070b0..0000000000000 --- a/c_glib/arrow-glib/int32-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT32_ARRAY_BUILDER \ - (garrow_int32_array_builder_get_type()) -#define GARROW_INT32_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT32_ARRAY_BUILDER, \ - GArrowInt32ArrayBuilder)) -#define GARROW_INT32_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT32_ARRAY_BUILDER, \ - GArrowInt32ArrayBuilderClass)) -#define GARROW_IS_INT32_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT32_ARRAY_BUILDER)) -#define GARROW_IS_INT32_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT32_ARRAY_BUILDER)) -#define GARROW_INT32_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT32_ARRAY_BUILDER, \ - GArrowInt32ArrayBuilderClass)) - -typedef struct _GArrowInt32ArrayBuilder GArrowInt32ArrayBuilder; -typedef struct _GArrowInt32ArrayBuilderClass GArrowInt32ArrayBuilderClass; - -/** - * GArrowInt32ArrayBuilder: - * - * It wraps `arrow::Int32Builder`. - */ -struct _GArrowInt32ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowInt32ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_int32_array_builder_get_type(void) G_GNUC_CONST; - -GArrowInt32ArrayBuilder *garrow_int32_array_builder_new(void); - -gboolean garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, - gint32 value, - GError **error); -gboolean garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array-builder.cpp b/c_glib/arrow-glib/int64-array-builder.cpp deleted file mode 100644 index b5eff114f92c9..0000000000000 --- a/c_glib/arrow-glib/int64-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int64-array-builder - * @short_description: 64-bit integer array builder class - * - * #GArrowInt64ArrayBuilder is the class to create a new - * #GArrowInt64Array. - */ - -G_DEFINE_TYPE(GArrowInt64ArrayBuilder, - garrow_int64_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_int64_array_builder_init(GArrowInt64ArrayBuilder *builder) -{ -} - -static void -garrow_int64_array_builder_class_init(GArrowInt64ArrayBuilderClass *klass) -{ -} - -/** - * garrow_int64_array_builder_new: - * - * Returns: A newly created #GArrowInt64ArrayBuilder. - */ -GArrowInt64ArrayBuilder * -garrow_int64_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::int64()); - auto builder = - GARROW_INT64_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT64_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_int64_array_builder_append: - * @builder: A #GArrowInt64ArrayBuilder. - * @value: A int64 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, - gint64 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int64-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_int64_array_builder_append_null: - * @builder: A #GArrowInt64ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int64-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int64-array-builder.h b/c_glib/arrow-glib/int64-array-builder.h deleted file mode 100644 index 8f4947eb7d9b1..0000000000000 --- a/c_glib/arrow-glib/int64-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT64_ARRAY_BUILDER \ - (garrow_int64_array_builder_get_type()) -#define GARROW_INT64_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT64_ARRAY_BUILDER, \ - GArrowInt64ArrayBuilder)) -#define GARROW_INT64_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT64_ARRAY_BUILDER, \ - GArrowInt64ArrayBuilderClass)) -#define GARROW_IS_INT64_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT64_ARRAY_BUILDER)) -#define GARROW_IS_INT64_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT64_ARRAY_BUILDER)) -#define GARROW_INT64_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT64_ARRAY_BUILDER, \ - GArrowInt64ArrayBuilderClass)) - -typedef struct _GArrowInt64ArrayBuilder GArrowInt64ArrayBuilder; -typedef struct _GArrowInt64ArrayBuilderClass GArrowInt64ArrayBuilderClass; - -/** - * GArrowInt64ArrayBuilder: - * - * It wraps `arrow::Int64Builder`. - */ -struct _GArrowInt64ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowInt64ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_int64_array_builder_get_type(void) G_GNUC_CONST; - -GArrowInt64ArrayBuilder *garrow_int64_array_builder_new(void); - -gboolean garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, - gint64 value, - GError **error); -gboolean garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array-builder.cpp b/c_glib/arrow-glib/int8-array-builder.cpp deleted file mode 100644 index 5107a6fae1f6a..0000000000000 --- a/c_glib/arrow-glib/int8-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: int8-array-builder - * @short_description: 8-bit integer array builder class - * - * #GArrowInt8ArrayBuilder is the class to create a new - * #GArrowInt8Array. - */ - -G_DEFINE_TYPE(GArrowInt8ArrayBuilder, - garrow_int8_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_int8_array_builder_init(GArrowInt8ArrayBuilder *builder) -{ -} - -static void -garrow_int8_array_builder_class_init(GArrowInt8ArrayBuilderClass *klass) -{ -} - -/** - * garrow_int8_array_builder_new: - * - * Returns: A newly created #GArrowInt8ArrayBuilder. - */ -GArrowInt8ArrayBuilder * -garrow_int8_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::int8()); - auto builder = - GARROW_INT8_ARRAY_BUILDER(g_object_new(GARROW_TYPE_INT8_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_int8_array_builder_append: - * @builder: A #GArrowInt8ArrayBuilder. - * @value: A int8 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, - gint8 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int8-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_int8_array_builder_append_null: - * @builder: A #GArrowInt8ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int8-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/int8-array-builder.h b/c_glib/arrow-glib/int8-array-builder.h deleted file mode 100644 index 321e9310a6447..0000000000000 --- a/c_glib/arrow-glib/int8-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_INT8_ARRAY_BUILDER \ - (garrow_int8_array_builder_get_type()) -#define GARROW_INT8_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INT8_ARRAY_BUILDER, \ - GArrowInt8ArrayBuilder)) -#define GARROW_INT8_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_INT8_ARRAY_BUILDER, \ - GArrowInt8ArrayBuilderClass)) -#define GARROW_IS_INT8_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_INT8_ARRAY_BUILDER)) -#define GARROW_IS_INT8_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_INT8_ARRAY_BUILDER)) -#define GARROW_INT8_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_INT8_ARRAY_BUILDER, \ - GArrowInt8ArrayBuilderClass)) - -typedef struct _GArrowInt8ArrayBuilder GArrowInt8ArrayBuilder; -typedef struct _GArrowInt8ArrayBuilderClass GArrowInt8ArrayBuilderClass; - -/** - * GArrowInt8ArrayBuilder: - * - * It wraps `arrow::Int8Builder`. - */ -struct _GArrowInt8ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowInt8ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_int8_array_builder_get_type(void) G_GNUC_CONST; - -GArrowInt8ArrayBuilder *garrow_int8_array_builder_new(void); - -gboolean garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, - gint8 value, - GError **error); -gboolean garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-array-builder.cpp b/c_glib/arrow-glib/list-array-builder.cpp deleted file mode 100644 index 6c8f53da1fc98..0000000000000 --- a/c_glib/arrow-glib/list-array-builder.cpp +++ /dev/null @@ -1,173 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: list-array-builder - * @short_description: List array builder class - * @include: arrow-glib/arrow-glib.h - * - * #GArrowListArrayBuilder is the class to create a new - * #GArrowListArray. - */ - -G_DEFINE_TYPE(GArrowListArrayBuilder, - garrow_list_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_list_array_builder_init(GArrowListArrayBuilder *builder) -{ -} - -static void -garrow_list_array_builder_class_init(GArrowListArrayBuilderClass *klass) -{ -} - -/** - * garrow_list_array_builder_new: - * @value_builder: A #GArrowArrayBuilder for value array. - * - * Returns: A newly created #GArrowListArrayBuilder. - */ -GArrowListArrayBuilder * -garrow_list_array_builder_new(GArrowArrayBuilder *value_builder) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_value_builder = garrow_array_builder_get_raw(value_builder); - auto arrow_list_builder = - std::make_shared(memory_pool, arrow_value_builder); - std::shared_ptr arrow_builder = arrow_list_builder; - auto builder = garrow_array_builder_new_raw(&arrow_builder); - return GARROW_LIST_ARRAY_BUILDER(builder); -} - -/** - * garrow_list_array_builder_append: - * @builder: A #GArrowListArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - * - * It appends a new list element. To append a new list element, you - * need to call this function then append list element values to - * `value_builder`. `value_builder` is the #GArrowArrayBuilder - * specified to constructor. You can get `value_builder` by - * garrow_list_array_builder_get_value_builder(). - * - * |[ - * GArrowInt8ArrayBuilder *value_builder; - * GArrowListArrayBuilder *builder; - * - * value_builder = garrow_int8_array_builder_new(); - * builder = garrow_list_array_builder_new(value_builder, NULL); - * - * // Start 0th list element: [1, 0, -1] - * garrow_list_array_builder_append(builder, NULL); - * garrow_int8_array_builder_append(value_builder, 1); - * garrow_int8_array_builder_append(value_builder, 0); - * garrow_int8_array_builder_append(value_builder, -1); - * - * // Start 1st list element: [-29, 29] - * garrow_list_array_builder_append(builder, NULL); - * garrow_int8_array_builder_append(value_builder, -29); - * garrow_int8_array_builder_append(value_builder, 29); - * - * { - * // [[1, 0, -1], [-29, 29]] - * GArrowArray *array = garrow_array_builder_finish(builder); - * // Now, builder is needless. - * g_object_unref(builder); - * g_object_unref(value_builder); - * - * // Use array... - * g_object_unref(array); - * } - * ]| - */ -gboolean -garrow_list_array_builder_append(GArrowListArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[list-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_list_array_builder_append_null: - * @builder: A #GArrowListArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - * - * It appends a new NULL element. - */ -gboolean -garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[list-array-builder][append-null]"); - return FALSE; - } -} - -/** - * garrow_list_array_builder_get_value_builder: - * @builder: A #GArrowListArrayBuilder. - * - * Returns: (transfer full): The #GArrowArrayBuilder for values. - */ -GArrowArrayBuilder * -garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - auto arrow_value_builder = arrow_builder->value_builder(); - return garrow_array_builder_new_raw(&arrow_value_builder); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/list-array-builder.h b/c_glib/arrow-glib/list-array-builder.h deleted file mode 100644 index 2c2e58e54309b..0000000000000 --- a/c_glib/arrow-glib/list-array-builder.h +++ /dev/null @@ -1,77 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_LIST_ARRAY_BUILDER \ - (garrow_list_array_builder_get_type()) -#define GARROW_LIST_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_LIST_ARRAY_BUILDER, \ - GArrowListArrayBuilder)) -#define GARROW_LIST_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_LIST_ARRAY_BUILDER, \ - GArrowListArrayBuilderClass)) -#define GARROW_IS_LIST_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_LIST_ARRAY_BUILDER)) -#define GARROW_IS_LIST_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_LIST_ARRAY_BUILDER)) -#define GARROW_LIST_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_LIST_ARRAY_BUILDER, \ - GArrowListArrayBuilderClass)) - -typedef struct _GArrowListArrayBuilder GArrowListArrayBuilder; -typedef struct _GArrowListArrayBuilderClass GArrowListArrayBuilderClass; - -/** - * GArrowListArrayBuilder: - * - * It wraps `arrow::ListBuilder`. - */ -struct _GArrowListArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowListArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_list_array_builder_get_type(void) G_GNUC_CONST; - -GArrowListArrayBuilder *garrow_list_array_builder_new(GArrowArrayBuilder *value_builder); - -gboolean garrow_list_array_builder_append(GArrowListArrayBuilder *builder, - GError **error); -gboolean garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, - GError **error); - -GArrowArrayBuilder *garrow_list_array_builder_get_value_builder(GArrowListArrayBuilder *builder); - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-array-builder.cpp b/c_glib/arrow-glib/string-array-builder.cpp deleted file mode 100644 index ebad53a18704a..0000000000000 --- a/c_glib/arrow-glib/string-array-builder.cpp +++ /dev/null @@ -1,97 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: string-array-builder - * @short_description: UTF-8 encoded string array builder class - * - * #GArrowStringArrayBuilder is the class to create a new - * #GArrowStringArray. - */ - -G_DEFINE_TYPE(GArrowStringArrayBuilder, - garrow_string_array_builder, - GARROW_TYPE_BINARY_ARRAY_BUILDER) - -static void -garrow_string_array_builder_init(GArrowStringArrayBuilder *builder) -{ -} - -static void -garrow_string_array_builder_class_init(GArrowStringArrayBuilderClass *klass) -{ -} - -/** - * garrow_string_array_builder_new: - * - * Returns: A newly created #GArrowStringArrayBuilder. - */ -GArrowStringArrayBuilder * -garrow_string_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool); - auto builder = - GARROW_STRING_ARRAY_BUILDER(g_object_new(GARROW_TYPE_STRING_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_string_array_builder_append: - * @builder: A #GArrowStringArrayBuilder. - * @value: A string value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, - const gchar *value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value, - static_cast(strlen(value))); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[string-array-builder][append]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/string-array-builder.h b/c_glib/arrow-glib/string-array-builder.h deleted file mode 100644 index f370ed9edec9d..0000000000000 --- a/c_glib/arrow-glib/string-array-builder.h +++ /dev/null @@ -1,74 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRING_ARRAY_BUILDER \ - (garrow_string_array_builder_get_type()) -#define GARROW_STRING_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRING_ARRAY_BUILDER, \ - GArrowStringArrayBuilder)) -#define GARROW_STRING_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRING_ARRAY_BUILDER, \ - GArrowStringArrayBuilderClass)) -#define GARROW_IS_STRING_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRING_ARRAY_BUILDER)) -#define GARROW_IS_STRING_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRING_ARRAY_BUILDER)) -#define GARROW_STRING_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRING_ARRAY_BUILDER, \ - GArrowStringArrayBuilderClass)) - -typedef struct _GArrowStringArrayBuilder GArrowStringArrayBuilder; -typedef struct _GArrowStringArrayBuilderClass GArrowStringArrayBuilderClass; - -/** - * GArrowStringArrayBuilder: - * - * It wraps `arrow::StringBuilder`. - */ -struct _GArrowStringArrayBuilder -{ - /*< private >*/ - GArrowBinaryArrayBuilder parent_instance; -}; - -struct _GArrowStringArrayBuilderClass -{ - GArrowBinaryArrayBuilderClass parent_class; -}; - -GType garrow_string_array_builder_get_type(void) G_GNUC_CONST; - -GArrowStringArrayBuilder *garrow_string_array_builder_new(void); - -gboolean garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, - const gchar *value, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array-builder.cpp b/c_glib/arrow-glib/struct-array-builder.cpp deleted file mode 100644 index 2453a5baf2ec8..0000000000000 --- a/c_glib/arrow-glib/struct-array-builder.cpp +++ /dev/null @@ -1,187 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: struct-array-builder - * @short_description: Struct array builder class - * @include: arrow-glib/arrow-glib.h - * - * #GArrowStructArrayBuilder is the class to create a new - * #GArrowStructArray. - */ - -G_DEFINE_TYPE(GArrowStructArrayBuilder, - garrow_struct_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_struct_array_builder_init(GArrowStructArrayBuilder *builder) -{ -} - -static void -garrow_struct_array_builder_class_init(GArrowStructArrayBuilderClass *klass) -{ -} - -/** - * garrow_struct_array_builder_new: - * @data_type: #GArrowStructDataType for the struct. - * @field_builders: (element-type GArrowArray): #GArrowArrayBuilders - * for fields. - * - * Returns: A newly created #GArrowStructArrayBuilder. - */ -GArrowStructArrayBuilder * -garrow_struct_array_builder_new(GArrowStructDataType *data_type, - GList *field_builders) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_data_type = garrow_data_type_get_raw(GARROW_DATA_TYPE(data_type)); - std::vector> arrow_field_builders; - for (GList *node = field_builders; node; node = g_list_next(node)) { - auto field_builder = static_cast(node->data); - auto arrow_field_builder = garrow_array_builder_get_raw(field_builder); - arrow_field_builders.push_back(arrow_field_builder); - } - - auto arrow_struct_builder = - std::make_shared(memory_pool, - arrow_data_type, - arrow_field_builders); - std::shared_ptr arrow_builder = arrow_struct_builder; - auto builder = garrow_array_builder_new_raw(&arrow_builder); - return GARROW_STRUCT_ARRAY_BUILDER(builder); -} - -/** - * garrow_struct_array_builder_append: - * @builder: A #GArrowStructArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - * - * It appends a new struct element. To append a new struct element, - * you need to call this function then append struct element field - * values to all `field_builder`s. `field_value`s are the - * #GArrowArrayBuilder specified to constructor. You can get - * `field_builder` by garrow_struct_array_builder_get_field_builder() - * or garrow_struct_array_builder_get_field_builders(). - * - * |[ - * // TODO - * ]| - */ -gboolean -garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[struct-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_struct_array_builder_append_null: - * @builder: A #GArrowStructArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - * - * It appends a new NULL element. - */ -gboolean -garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[struct-array-builder][append-null]"); - return FALSE; - } -} - -/** - * garrow_struct_array_builder_get_field_builder: - * @builder: A #GArrowStructArrayBuilder. - * @i: The index of the field in the struct. - * - * Returns: (transfer full): The #GArrowArrayBuilder for the i-th field. - */ -GArrowArrayBuilder * -garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, - gint i) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - auto arrow_field_builder = arrow_builder->field_builder(i); - return garrow_array_builder_new_raw(&arrow_field_builder); -} - -/** - * garrow_struct_array_builder_get_field_builders: - * @builder: A #GArrowStructArrayBuilder. - * - * Returns: (element-type GArrowArray) (transfer full): - * The #GArrowArrayBuilder for all fields. - */ -GList * -garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder) -{ - auto arrow_struct_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - GList *field_builders = NULL; - for (auto arrow_field_builder : arrow_struct_builder->field_builders()) { - auto field_builder = garrow_array_builder_new_raw(&arrow_field_builder); - field_builders = g_list_prepend(field_builders, field_builder); - } - - return g_list_reverse(field_builders); -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/struct-array-builder.h b/c_glib/arrow-glib/struct-array-builder.h deleted file mode 100644 index 237b2b3264f24..0000000000000 --- a/c_glib/arrow-glib/struct-array-builder.h +++ /dev/null @@ -1,81 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_STRUCT_ARRAY_BUILDER \ - (garrow_struct_array_builder_get_type()) -#define GARROW_STRUCT_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ - GArrowStructArrayBuilder)) -#define GARROW_STRUCT_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ - GArrowStructArrayBuilderClass)) -#define GARROW_IS_STRUCT_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_STRUCT_ARRAY_BUILDER)) -#define GARROW_IS_STRUCT_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_STRUCT_ARRAY_BUILDER)) -#define GARROW_STRUCT_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_STRUCT_ARRAY_BUILDER, \ - GArrowStructArrayBuilderClass)) - -typedef struct _GArrowStructArrayBuilder GArrowStructArrayBuilder; -typedef struct _GArrowStructArrayBuilderClass GArrowStructArrayBuilderClass; - -/** - * GArrowStructArrayBuilder: - * - * It wraps `arrow::StructBuilder`. - */ -struct _GArrowStructArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowStructArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_struct_array_builder_get_type(void) G_GNUC_CONST; - -GArrowStructArrayBuilder *garrow_struct_array_builder_new(GArrowStructDataType *data_type, - GList *field_builders); - -gboolean garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, - GError **error); -gboolean garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, - GError **error); - -GArrowArrayBuilder *garrow_struct_array_builder_get_field_builder(GArrowStructArrayBuilder *builder, - gint i); -GList *garrow_struct_array_builder_get_field_builders(GArrowStructArrayBuilder *builder); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array-builder.cpp b/c_glib/arrow-glib/uint16-array-builder.cpp deleted file mode 100644 index bfade2de7a84d..0000000000000 --- a/c_glib/arrow-glib/uint16-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint16-array-builder - * @short_description: 16-bit unsigned integer array builder class - * - * #GArrowUInt16ArrayBuilder is the class to create a new - * #GArrowUInt16Array. - */ - -G_DEFINE_TYPE(GArrowUInt16ArrayBuilder, - garrow_uint16_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_uint16_array_builder_init(GArrowUInt16ArrayBuilder *builder) -{ -} - -static void -garrow_uint16_array_builder_class_init(GArrowUInt16ArrayBuilderClass *klass) -{ -} - -/** - * garrow_uint16_array_builder_new: - * - * Returns: A newly created #GArrowUInt16ArrayBuilder. - */ -GArrowUInt16ArrayBuilder * -garrow_uint16_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::uint16()); - auto builder = - GARROW_UINT16_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT16_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_uint16_array_builder_append: - * @builder: A #GArrowUInt16ArrayBuilder. - * @value: An uint16 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, - guint16 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint16-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_uint16_array_builder_append_null: - * @builder: A #GArrowUInt16ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint16-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint16-array-builder.h b/c_glib/arrow-glib/uint16-array-builder.h deleted file mode 100644 index c08966ecc1d91..0000000000000 --- a/c_glib/arrow-glib/uint16-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT16_ARRAY_BUILDER \ - (garrow_uint16_array_builder_get_type()) -#define GARROW_UINT16_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT16_ARRAY_BUILDER, \ - GArrowUInt16ArrayBuilder)) -#define GARROW_UINT16_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT16_ARRAY_BUILDER, \ - GArrowUInt16ArrayBuilderClass)) -#define GARROW_IS_UINT16_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT16_ARRAY_BUILDER)) -#define GARROW_IS_UINT16_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT16_ARRAY_BUILDER)) -#define GARROW_UINT16_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT16_ARRAY_BUILDER, \ - GArrowUInt16ArrayBuilderClass)) - -typedef struct _GArrowUInt16ArrayBuilder GArrowUInt16ArrayBuilder; -typedef struct _GArrowUInt16ArrayBuilderClass GArrowUInt16ArrayBuilderClass; - -/** - * GArrowUInt16ArrayBuilder: - * - * It wraps `arrow::UInt16Builder`. - */ -struct _GArrowUInt16ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowUInt16ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_uint16_array_builder_get_type(void) G_GNUC_CONST; - -GArrowUInt16ArrayBuilder *garrow_uint16_array_builder_new(void); - -gboolean garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, - guint16 value, - GError **error); -gboolean garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array-builder.cpp b/c_glib/arrow-glib/uint32-array-builder.cpp deleted file mode 100644 index 35b1893619fa5..0000000000000 --- a/c_glib/arrow-glib/uint32-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint32-array-builder - * @short_description: 32-bit unsigned integer array builder class - * - * #GArrowUInt32ArrayBuilder is the class to create a new - * #GArrowUInt32Array. - */ - -G_DEFINE_TYPE(GArrowUInt32ArrayBuilder, - garrow_uint32_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_uint32_array_builder_init(GArrowUInt32ArrayBuilder *builder) -{ -} - -static void -garrow_uint32_array_builder_class_init(GArrowUInt32ArrayBuilderClass *klass) -{ -} - -/** - * garrow_uint32_array_builder_new: - * - * Returns: A newly created #GArrowUInt32ArrayBuilder. - */ -GArrowUInt32ArrayBuilder * -garrow_uint32_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::uint32()); - auto builder = - GARROW_UINT32_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT32_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_uint32_array_builder_append: - * @builder: A #GArrowUInt32ArrayBuilder. - * @value: An uint32 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, - guint32 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint32-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_uint32_array_builder_append_null: - * @builder: A #GArrowUInt32ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint32-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint32-array-builder.h b/c_glib/arrow-glib/uint32-array-builder.h deleted file mode 100644 index 4881d3b17ff0d..0000000000000 --- a/c_glib/arrow-glib/uint32-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT32_ARRAY_BUILDER \ - (garrow_uint32_array_builder_get_type()) -#define GARROW_UINT32_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT32_ARRAY_BUILDER, \ - GArrowUInt32ArrayBuilder)) -#define GARROW_UINT32_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT32_ARRAY_BUILDER, \ - GArrowUInt32ArrayBuilderClass)) -#define GARROW_IS_UINT32_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT32_ARRAY_BUILDER)) -#define GARROW_IS_UINT32_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT32_ARRAY_BUILDER)) -#define GARROW_UINT32_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT32_ARRAY_BUILDER, \ - GArrowUInt32ArrayBuilderClass)) - -typedef struct _GArrowUInt32ArrayBuilder GArrowUInt32ArrayBuilder; -typedef struct _GArrowUInt32ArrayBuilderClass GArrowUInt32ArrayBuilderClass; - -/** - * GArrowUInt32ArrayBuilder: - * - * It wraps `arrow::UInt32Builder`. - */ -struct _GArrowUInt32ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowUInt32ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_uint32_array_builder_get_type(void) G_GNUC_CONST; - -GArrowUInt32ArrayBuilder *garrow_uint32_array_builder_new(void); - -gboolean garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, - guint32 value, - GError **error); -gboolean garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array-builder.cpp b/c_glib/arrow-glib/uint64-array-builder.cpp deleted file mode 100644 index 85d24ca54ab8b..0000000000000 --- a/c_glib/arrow-glib/uint64-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint64-array-builder - * @short_description: 64-bit unsigned integer array builder class - * - * #GArrowUInt64ArrayBuilder is the class to create a new - * #GArrowUInt64Array. - */ - -G_DEFINE_TYPE(GArrowUInt64ArrayBuilder, - garrow_uint64_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_uint64_array_builder_init(GArrowUInt64ArrayBuilder *builder) -{ -} - -static void -garrow_uint64_array_builder_class_init(GArrowUInt64ArrayBuilderClass *klass) -{ -} - -/** - * garrow_uint64_array_builder_new: - * - * Returns: A newly created #GArrowUInt64ArrayBuilder. - */ -GArrowUInt64ArrayBuilder * -garrow_uint64_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::uint64()); - auto builder = - GARROW_UINT64_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT64_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_uint64_array_builder_append: - * @builder: A #GArrowUInt64ArrayBuilder. - * @value: An uint64 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, - guint64 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint64-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_uint64_array_builder_append_null: - * @builder: A #GArrowUInt64ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint64-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint64-array-builder.h b/c_glib/arrow-glib/uint64-array-builder.h deleted file mode 100644 index c51d1e2485d6f..0000000000000 --- a/c_glib/arrow-glib/uint64-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT64_ARRAY_BUILDER \ - (garrow_uint64_array_builder_get_type()) -#define GARROW_UINT64_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT64_ARRAY_BUILDER, \ - GArrowUInt64ArrayBuilder)) -#define GARROW_UINT64_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT64_ARRAY_BUILDER, \ - GArrowUInt64ArrayBuilderClass)) -#define GARROW_IS_UINT64_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT64_ARRAY_BUILDER)) -#define GARROW_IS_UINT64_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT64_ARRAY_BUILDER)) -#define GARROW_UINT64_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT64_ARRAY_BUILDER, \ - GArrowUInt64ArrayBuilderClass)) - -typedef struct _GArrowUInt64ArrayBuilder GArrowUInt64ArrayBuilder; -typedef struct _GArrowUInt64ArrayBuilderClass GArrowUInt64ArrayBuilderClass; - -/** - * GArrowUInt64ArrayBuilder: - * - * It wraps `arrow::UInt64Builder`. - */ -struct _GArrowUInt64ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowUInt64ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_uint64_array_builder_get_type(void) G_GNUC_CONST; - -GArrowUInt64ArrayBuilder *garrow_uint64_array_builder_new(void); - -gboolean garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, - guint64 value, - GError **error); -gboolean garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array-builder.cpp b/c_glib/arrow-glib/uint8-array-builder.cpp deleted file mode 100644 index 2f49693236b24..0000000000000 --- a/c_glib/arrow-glib/uint8-array-builder.cpp +++ /dev/null @@ -1,120 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: uint8-array-builder - * @short_description: 8-bit unsigned integer array builder class - * - * #GArrowUInt8ArrayBuilder is the class to create a new - * #GArrowUInt8Array. - */ - -G_DEFINE_TYPE(GArrowUInt8ArrayBuilder, - garrow_uint8_array_builder, - GARROW_TYPE_ARRAY_BUILDER) - -static void -garrow_uint8_array_builder_init(GArrowUInt8ArrayBuilder *builder) -{ -} - -static void -garrow_uint8_array_builder_class_init(GArrowUInt8ArrayBuilderClass *klass) -{ -} - -/** - * garrow_uint8_array_builder_new: - * - * Returns: A newly created #GArrowUInt8ArrayBuilder. - */ -GArrowUInt8ArrayBuilder * -garrow_uint8_array_builder_new(void) -{ - auto memory_pool = arrow::default_memory_pool(); - auto arrow_builder = - std::make_shared(memory_pool, arrow::uint8()); - auto builder = - GARROW_UINT8_ARRAY_BUILDER(g_object_new(GARROW_TYPE_UINT8_ARRAY_BUILDER, - "array-builder", &arrow_builder, - NULL)); - return builder; -} - -/** - * garrow_uint8_array_builder_append: - * @builder: A #GArrowUInt8ArrayBuilder. - * @value: An uint8 value. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, - guint8 value, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint8-array-builder][append]"); - return FALSE; - } -} - -/** - * garrow_uint8_array_builder_append_null: - * @builder: A #GArrowUInt8ArrayBuilder. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: %TRUE on success, %FALSE if there was an error. - */ -gboolean -garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, - GError **error) -{ - auto arrow_builder = - static_cast( - garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); - - auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint8-array-builder][append-null]"); - return FALSE; - } -} - -G_END_DECLS diff --git a/c_glib/arrow-glib/uint8-array-builder.h b/c_glib/arrow-glib/uint8-array-builder.h deleted file mode 100644 index e7216931a511c..0000000000000 --- a/c_glib/arrow-glib/uint8-array-builder.h +++ /dev/null @@ -1,76 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_UINT8_ARRAY_BUILDER \ - (garrow_uint8_array_builder_get_type()) -#define GARROW_UINT8_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_UINT8_ARRAY_BUILDER, \ - GArrowUInt8ArrayBuilder)) -#define GARROW_UINT8_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_UINT8_ARRAY_BUILDER, \ - GArrowUInt8ArrayBuilderClass)) -#define GARROW_IS_UINT8_ARRAY_BUILDER(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_UINT8_ARRAY_BUILDER)) -#define GARROW_IS_UINT8_ARRAY_BUILDER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_UINT8_ARRAY_BUILDER)) -#define GARROW_UINT8_ARRAY_BUILDER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_UINT8_ARRAY_BUILDER, \ - GArrowUInt8ArrayBuilderClass)) - -typedef struct _GArrowUInt8ArrayBuilder GArrowUInt8ArrayBuilder; -typedef struct _GArrowUInt8ArrayBuilderClass GArrowUInt8ArrayBuilderClass; - -/** - * GArrowUInt8ArrayBuilder: - * - * It wraps `arrow::UInt8Builder`. - */ -struct _GArrowUInt8ArrayBuilder -{ - /*< private >*/ - GArrowArrayBuilder parent_instance; -}; - -struct _GArrowUInt8ArrayBuilderClass -{ - GArrowArrayBuilderClass parent_class; -}; - -GType garrow_uint8_array_builder_get_type(void) G_GNUC_CONST; - -GArrowUInt8ArrayBuilder *garrow_uint8_array_builder_new(void); - -gboolean garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, - guint8 value, - GError **error); -gboolean garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, - GError **error); - -G_END_DECLS diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 5df9f64a85c92..bfb2776f621cc 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -40,21 +40,6 @@ Array builder - - - - - - - - - - - - - - - Type From 578b0ff15ebc2d3751c9b4ee87d9e31f1c7ae0b6 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 22 Apr 2017 10:49:17 -0400 Subject: [PATCH 0551/1644] ARROW-877: [GLib] Add garrow_array_get_null_bitmap() Author: Kouhei Sutou Closes #582 from kou/glib-array-null-bitmap and squashes the following commits: 7f679f6 [Kouhei Sutou] [GLib] Add garrow_array_get_null_bitmap() --- c_glib/arrow-glib/array.cpp | 19 +++++++++++++++++++ c_glib/arrow-glib/array.h | 2 ++ c_glib/test/test-array.rb | 11 +++++++++++ 3 files changed, 32 insertions(+) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index dc1386b0daab9..1229f27ff906f 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -22,6 +22,7 @@ #endif #include +#include #include #include @@ -242,6 +243,24 @@ garrow_array_get_n_nulls(GArrowArray *array) return arrow_array->null_count(); } +/** + * garrow_array_get_null_bitmap: + * @array: A #GArrowArray. + * + * Returns: (transfer full) (nullable): The bitmap that indicates null + * value indexes for the array as #GArrowBuffer or %NULL when + * garrow_array_get_n_nulls() returns 0. + * + * Since: 0.3.0 + */ +GArrowBuffer * +garrow_array_get_null_bitmap(GArrowArray *array) +{ + auto arrow_array = garrow_array_get_raw(array); + auto arrow_null_bitmap = arrow_array->null_bitmap(); + return garrow_buffer_new_raw(&arrow_null_bitmap); +} + /** * garrow_array_get_value_data_type: * @array: A #GArrowArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 74064562d6f39..f08ab84ef9e15 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -19,6 +19,7 @@ #pragma once +#include #include G_BEGIN_DECLS @@ -62,6 +63,7 @@ gboolean garrow_array_is_null (GArrowArray *array, gint64 garrow_array_get_length (GArrowArray *array); gint64 garrow_array_get_offset (GArrowArray *array); gint64 garrow_array_get_n_nulls (GArrowArray *array); +GArrowBuffer *garrow_array_get_null_bitmap(GArrowArray *array); GArrowDataType *garrow_array_get_value_data_type(GArrowArray *array); GArrowType garrow_array_get_value_type(GArrowArray *array); GArrowArray *garrow_array_slice (GArrowArray *array, diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index 06102eb36575b..a2a2a1e003862 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -40,6 +40,17 @@ def test_n_nulls assert_equal(2, array.n_nulls) end + def test_null_bitmap + builder = Arrow::BooleanArrayBuilder.new + builder.append_null + builder.append(true) + builder.append(false) + builder.append_null + builder.append(false) + array = builder.finish + assert_equal(0b10110, array.null_bitmap.data.to_s.unpack("c*")[0]) + end + def test_value_data_type builder = Arrow::BooleanArrayBuilder.new array = builder.finish From 07c6ade9b8362ba30c5d784986aedcb3cfb6483a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 22 Apr 2017 10:52:08 -0400 Subject: [PATCH 0552/1644] ARROW-849: [C++] Support setting production build dependencies with ARROW_BUILD_TOOLCHAIN Opening this for comment. If we like this, we can do the same thing in parquet-cpp. Will need to be documented in the README. I did not use the environment variable for gflags/gtest/gbenchmark, since this are test/benchmark-only dependencies, and they build automatically when needed. Author: Wes McKinney Closes #565 from wesm/ARROW-849 and squashes the following commits: 4507712 [Wes McKinney] Fix use of RAPIDJSON_HOME e9fa400 [Wes McKinney] Use ARROW_BUILD_TOOLCHAIN if it's defined, but override with environment variables d056a83 [Wes McKinney] Pull environment variables by default, override if toolchain variable is present ec003c6 [Wes McKinney] Support setting production build dependencies with ARROW_BUILD_TOOLCHAIN environment variable --- cpp/CMakeLists.txt | 50 ++++++++++++++++++------- cpp/cmake_modules/FindFlatbuffers.cmake | 6 +-- cpp/cmake_modules/Findjemalloc.cmake | 4 +- 3 files changed, 42 insertions(+), 18 deletions(-) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 81e4c90c73147..978f70a361756 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -46,13 +46,6 @@ set(ARROW_SO_VERSION "0") set(ARROW_ABI_VERSION "${ARROW_SO_VERSION}.0.0") set(BUILD_SUPPORT_DIR "${CMAKE_SOURCE_DIR}/build-support") -set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") - -set(GFLAGS_VERSION "2.1.2") -set(GTEST_VERSION "1.8.0") -set(GBENCHMARK_VERSION "1.1.0") -set(FLATBUFFERS_VERSION "1.6.0") -set(JEMALLOC_VERSION "4.4.0") find_package(ClangTools) if ("$ENV{CMAKE_EXPORT_COMPILE_COMMANDS}" STREQUAL "1" OR CLANG_TIDY_FOUND) @@ -387,6 +380,40 @@ enable_testing() # Dependencies ############################################################ +# ---------------------------------------------------------------------- +# Thirdparty toolchain + +set(THIRDPARTY_DIR "${CMAKE_SOURCE_DIR}/thirdparty") +set(GFLAGS_VERSION "2.1.2") +set(GTEST_VERSION "1.8.0") +set(GBENCHMARK_VERSION "1.1.0") +set(FLATBUFFERS_VERSION "1.6.0") +set(JEMALLOC_VERSION "4.4.0") + +if (NOT "$ENV{ARROW_BUILD_TOOLCHAIN}" STREQUAL "") + set(FLATBUFFERS_HOME "$ENV{ARROW_BUILD_TOOLCHAIN}") + set(RAPIDJSON_HOME "$ENV{ARROW_BUILD_TOOLCHAIN}") + set(JEMALLOC_HOME "$ENV{ARROW_BUILD_TOOLCHAIN}") + + if (NOT DEFINED ENV{BOOST_ROOT}) + # Since we have to set this in the environment, we check whether + # $BOOST_ROOT is defined inside here + set(ENV{BOOST_ROOT} "$ENV{ARROW_BUILD_TOOLCHAIN}") + endif() +endif() + +if (DEFINED ENV{FLATBUFFERS_HOME}) + set(FLATBUFFERS_HOME "$ENV{FLATBUFFERS_HOME}") +endif() + +if (DEFINED ENV{RAPIDJSON_HOME}) + set(RAPIDJSON_HOME "$ENV{RAPIDJSON_HOME}") +endif() + +if (DEFINED ENV{JEMALLOC_HOME}) + set(JEMALLOC_HOME "$ENV{JEMALLOC_HOME}") +endif() + # ---------------------------------------------------------------------- # Add Boost dependencies (code adapted from Apache Kudu (incubating)) @@ -451,9 +478,6 @@ SET(ARROW_BOOST_LIBS boost_system boost_filesystem) include_directories(SYSTEM ${Boost_INCLUDE_DIR}) -# ---------------------------------------------------------------------- -# Enable / disable tests and benchmarks - if(ARROW_BUILD_TESTS OR ARROW_BUILD_BENCHMARKS) add_custom_target(unittest ctest -L unittest) @@ -616,7 +640,7 @@ endif() if (ARROW_IPC) # RapidJSON, header only dependency - if("$ENV{RAPIDJSON_HOME}" STREQUAL "") + if("${RAPIDJSON_HOME}" STREQUAL "") ExternalProject_Add(rapidjson_ep PREFIX "${CMAKE_BINARY_DIR}" URL "https://github.com/miloyip/rapidjson/archive/v1.1.0.tar.gz" @@ -630,14 +654,14 @@ if (ARROW_IPC) set(RAPIDJSON_INCLUDE_DIR "${SOURCE_DIR}/include") set(RAPIDJSON_VENDORED 1) else() - set(RAPIDJSON_INCLUDE_DIR "$ENV{RAPIDJSON_HOME}/include") + set(RAPIDJSON_INCLUDE_DIR "${RAPIDJSON_HOME}/include") set(RAPIDJSON_VENDORED 0) endif() message(STATUS "RapidJSON include dir: ${RAPIDJSON_INCLUDE_DIR}") include_directories(SYSTEM ${RAPIDJSON_INCLUDE_DIR}) ## Flatbuffers - if("$ENV{FLATBUFFERS_HOME}" STREQUAL "") + if("${FLATBUFFERS_HOME}" STREQUAL "") set(FLATBUFFERS_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/flatbuffers_ep-prefix/src/flatbuffers_ep-install") ExternalProject_Add(flatbuffers_ep URL "https://github.com/google/flatbuffers/archive/v${FLATBUFFERS_VERSION}.tar.gz" diff --git a/cpp/cmake_modules/FindFlatbuffers.cmake b/cpp/cmake_modules/FindFlatbuffers.cmake index ee472d1c8995f..7fa640ac9542f 100644 --- a/cpp/cmake_modules/FindFlatbuffers.cmake +++ b/cpp/cmake_modules/FindFlatbuffers.cmake @@ -31,8 +31,8 @@ # FLATBUFFERS_STATIC_LIB, path to libflatbuffers.a # FLATBUFFERS_FOUND, whether flatbuffers has been found -if( NOT "$ENV{FLATBUFFERS_HOME}" STREQUAL "") - file( TO_CMAKE_PATH "$ENV{FLATBUFFERS_HOME}" _native_path ) +if( NOT "${FLATBUFFERS_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "${FLATBUFFERS_HOME}" _native_path ) list( APPEND _flatbuffers_roots ${_native_path} ) elseif ( Flatbuffers_HOME ) list( APPEND _flatbuffers_roots ${Flatbuffers_HOME} ) @@ -52,7 +52,7 @@ else () endif () find_program(FLATBUFFERS_COMPILER flatc - $ENV{FLATBUFFERS_HOME}/bin + ${FLATBUFFERS_HOME}/bin /usr/local/bin /usr/bin NO_DEFAULT_PATH diff --git a/cpp/cmake_modules/Findjemalloc.cmake b/cpp/cmake_modules/Findjemalloc.cmake index e511d4dde0f71..93458982b1dc6 100644 --- a/cpp/cmake_modules/Findjemalloc.cmake +++ b/cpp/cmake_modules/Findjemalloc.cmake @@ -30,8 +30,8 @@ # JEMALLOC_SHARED_LIB, path to libjemalloc.so/dylib # JEMALLOC_FOUND, whether flatbuffers has been found -if( NOT "$ENV{JEMALLOC_HOME}" STREQUAL "") - file( TO_CMAKE_PATH "$ENV{JEMALLOC_HOME}" _native_path ) +if( NOT "${JEMALLOC_HOME}" STREQUAL "") + file( TO_CMAKE_PATH "${JEMALLOC_HOME}" _native_path ) list( APPEND _jemalloc_roots ${_native_path} ) elseif ( JEMALLOC_HOME ) list( APPEND _jemalloc_roots ${JEMALLOC_HOME} ) From 39a37f76fa2cbf1dd52d3bc51b277553b772c343 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sat, 22 Apr 2017 11:59:59 -0400 Subject: [PATCH 0553/1644] ARROW-878: [GLib] Add garrow_binary_array_get_buffer() This will be conflicted with #582 . Author: Kouhei Sutou Closes #583 from kou/glib-binary-array-buffer and squashes the following commits: a84b8e8 [Kouhei Sutou] [GLib] Add garrow_binary_array_get_buffer() --- c_glib/arrow-glib/array.cpp | 16 ++++++++++++++++ c_glib/arrow-glib/array.h | 1 + c_glib/test/test-binary-array.rb | 10 ++++++++++ c_glib/test/test-string-array.rb | 8 ++++++++ 4 files changed, 35 insertions(+) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 1229f27ff906f..2fd09015d39ec 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -707,6 +707,22 @@ garrow_binary_array_get_value(GArrowBinaryArray *array, return g_bytes_new_static(value, length); } +/** + * garrow_binary_array_get_buffer: + * @array: A #GArrowBinaryArray. + * + * Returns: (transfer full): The data of the array as #GArrowBuffer. + */ +GArrowBuffer * +garrow_binary_array_get_buffer(GArrowBinaryArray *array) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_binary_array = + static_cast(arrow_array.get()); + auto arrow_data = arrow_binary_array->data(); + return garrow_buffer_new_raw(&arrow_data); +} + G_DEFINE_TYPE(GArrowStringArray, \ garrow_string_array, \ diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index f08ab84ef9e15..f8c6734a88308 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -664,6 +664,7 @@ GType garrow_binary_array_get_type(void) G_GNUC_CONST; GBytes *garrow_binary_array_get_value(GArrowBinaryArray *array, gint64 i); +GArrowBuffer *garrow_binary_array_get_buffer(GArrowBinaryArray *array); #define GARROW_TYPE_STRING_ARRAY \ (garrow_string_array_get_type()) diff --git a/c_glib/test/test-binary-array.rb b/c_glib/test/test-binary-array.rb index 6fe89247c8649..ccdf378ad41b9 100644 --- a/c_glib/test/test-binary-array.rb +++ b/c_glib/test/test-binary-array.rb @@ -23,4 +23,14 @@ def test_value array = builder.finish assert_equal(data, array.get_value(0).to_s) end + + def test_buffer + data1 = "\x00\x01\x02" + data2 = "\x03\x04\x05" + builder = Arrow::BinaryArrayBuilder.new + builder.append(data1) + builder.append(data2) + array = builder.finish + assert_equal(data1 + data2, array.buffer.data.to_s) + end end diff --git a/c_glib/test/test-string-array.rb b/c_glib/test/test-string-array.rb index a0f5a7b6b0fda..a076c228e0a4f 100644 --- a/c_glib/test/test-string-array.rb +++ b/c_glib/test/test-string-array.rb @@ -22,4 +22,12 @@ def test_value array = builder.finish assert_equal("Hello", array.get_string(0)) end + + def test_buffer + builder = Arrow::StringArrayBuilder.new + builder.append("Hello") + builder.append("World") + array = builder.finish + assert_equal("HelloWorld", array.buffer.data.to_s) + end end From a0a925b42541d1ed2711c547c9eccaaa91820711 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Sat, 22 Apr 2017 15:07:00 -0400 Subject: [PATCH 0554/1644] ARROW-875: Avoid setting an extra empty in fillEmpties() Author: Steven Phillips Closes #579 from StevenMPhillips/fillEmpties and squashes the following commits: e454876 [Steven Phillips] ARROW-875: Avoid setting an extra empty in fillEmpties() --- .../main/codegen/templates/NullableValueVectors.java | 4 ++-- .../org/apache/arrow/vector/TestValueVector.java | 12 ++++++++++++ 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index acee6cb738d76..178d5bd913910 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -526,8 +526,8 @@ public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.widt private void fillEmpties(int index){ final ${valuesName}.Mutator valuesMutator = values.getMutator(); - for (int i = lastSet; i < index; i++) { - valuesMutator.setSafe(i + 1, emptyByteArray); + for (int i = lastSet + 1; i < index; i++) { + valuesMutator.setSafe(i, emptyByteArray); } while(index > bits.getValueCapacity()) { bits.reAlloc(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index 78ca14dc406ea..e6e49ab8d9341 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -466,4 +466,16 @@ public void testReAllocNullableVariableWidthVector() { } } + @Test + public void testFillEmptiesNotOverfill() { + try (final NullableVarCharVector vector = newVector(NullableVarCharVector.class, EMPTY_SCHEMA_PATH, MinorType.VARCHAR, allocator)) { + vector.allocateNew(); + + vector.getMutator().setSafe(4094, "hello".getBytes(), 0, 5); + vector.getMutator().setValueCount(4095); + assertEquals(4096 * 4, vector.getFieldBuffers().get(1).capacity()); + } + } + + } From 26e5bb1627f3b9768afccf018946720a688cf6f6 Mon Sep 17 00:00:00 2001 From: Jeff Reback Date: Sun, 23 Apr 2017 18:37:00 -0400 Subject: [PATCH 0555/1644] ARROW-879: compat with pandas v0.20.0 Author: Jeff Reback Closes #585 from jreback/compat and squashes the following commits: 1f1f4ed [Jeff Reback] use permanent pandas.api.types import 28c6608 [Jeff Reback] compat with pandas v0.20.0 --- python/pyarrow/compat.py | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 4dcc11677e7dd..8d15c4c1e3fb5 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -32,9 +32,18 @@ try: import pandas as pd - if LooseVersion(pd.__version__) < '0.19.0': - pdapi = pd.core.common + pdver = LooseVersion(pd.__version__) + if pdver >= '0.20.0': + try: + from pandas.api.types import DatetimeTZDtype + except AttributeError: + # can be removed once 0.20.0 is released + from pandas.core.dtypes.dtypes import DatetimeTZDtype + + pdapi = pd.api.types + elif pdver < '0.19.0': from pandas.core.dtypes import DatetimeTZDtype + pdapi = pd.core.common else: from pandas.types.dtypes import DatetimeTZDtype pdapi = pd.api.types From 33ac8a29176df340faa204b6c2e61b2973db028e Mon Sep 17 00:00:00 2001 From: Max Risuhin Date: Sun, 23 Apr 2017 21:56:19 -0400 Subject: [PATCH 0556/1644] =?UTF-8?q?ARROW-882:=20[C++]=20Rename=20statica?= =?UTF-8?q?lly=20build=20library=20on=20Windows=20to=20avoid=20=E2=80=A6?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit …conflict with shared version Currently, statically built arrow.lib file overwrites previously built arrow.lib file of shared build. To resolve this, statically built library renamed to arrow_static.lib Author: Max Risuhin Closes #590 from MaxRis/ARROW-882 and squashes the following commits: 4f2f3f0 [Max Risuhin] ARROW-882: [C++] Rename statically build library on Windows to avoid conflict with shared version --- cpp/cmake_modules/BuildUtils.cmake | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/cpp/cmake_modules/BuildUtils.cmake b/cpp/cmake_modules/BuildUtils.cmake index 4e6532be9aa7a..db83efed35031 100644 --- a/cpp/cmake_modules/BuildUtils.cmake +++ b/cpp/cmake_modules/BuildUtils.cmake @@ -147,11 +147,16 @@ function(ADD_ARROW_LIB LIB_NAME) endif() if (ARROW_BUILD_STATIC) + if (MSVC) + set(LIB_NAME_STATIC ${LIB_NAME}_static) + else() + set(LIB_NAME_STATIC ${LIB_NAME}) + endif() add_library(${LIB_NAME}_static STATIC $) set_target_properties(${LIB_NAME}_static PROPERTIES LIBRARY_OUTPUT_DIRECTORY "${BUILD_OUTPUT_ROOT_DIRECTORY}" - OUTPUT_NAME ${LIB_NAME}) + OUTPUT_NAME ${LIB_NAME_STATIC}) target_link_libraries(${LIB_NAME}_static LINK_PUBLIC ${ARG_STATIC_LINK_LIBS} From 95f489c4c62f964cc32374686e4917774aa8aef2 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 23 Apr 2017 22:15:46 -0400 Subject: [PATCH 0557/1644] ARROW-880: [GLib] Support getting raw data of primitive arrays Author: Kouhei Sutou Closes #586 from kou/glib-primitive-array-buffer and squashes the following commits: 970b109 [Kouhei Sutou] [GLib] Support getting raw data of primitive arrays --- c_glib/arrow-glib/array.cpp | 53 ++++++++++++++---- c_glib/arrow-glib/array.h | 89 +++++++++++++++++++++++-------- c_glib/test/test-boolean-array.rb | 9 ++++ c_glib/test/test-double-array.rb | 9 ++++ c_glib/test/test-float-array.rb | 9 ++++ c_glib/test/test-int16-array.rb | 9 ++++ c_glib/test/test-int32-array.rb | 9 ++++ c_glib/test/test-int64-array.rb | 9 ++++ c_glib/test/test-int8-array.rb | 9 ++++ c_glib/test/test-uint16-array.rb | 9 ++++ c_glib/test/test-uint32-array.rb | 9 ++++ c_glib/test/test-uint64-array.rb | 9 ++++ c_glib/test/test-uint8-array.rb | 9 ++++ 13 files changed, 208 insertions(+), 33 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 2fd09015d39ec..3ca860d2ff6d3 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -344,9 +344,40 @@ garrow_null_array_new(gint64 length) } +G_DEFINE_TYPE(GArrowPrimitiveArray, \ + garrow_primitive_array, \ + GARROW_TYPE_ARRAY) + +static void +garrow_primitive_array_init(GArrowPrimitiveArray *object) +{ +} + +static void +garrow_primitive_array_class_init(GArrowPrimitiveArrayClass *klass) +{ +} + +/** + * garrow_primitive_array_get_buffer: + * @array: A #GArrowPrimitiveArray. + * + * Returns: (transfer full): The data of the array as #GArrowBuffer. + */ +GArrowBuffer * +garrow_primitive_array_get_buffer(GArrowPrimitiveArray *array) +{ + auto arrow_array = garrow_array_get_raw(GARROW_ARRAY(array)); + auto arrow_primitive_array = + static_cast(arrow_array.get()); + auto arrow_data = arrow_primitive_array->data(); + return garrow_buffer_new_raw(&arrow_data); +} + + G_DEFINE_TYPE(GArrowBooleanArray, \ garrow_boolean_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_boolean_array_init(GArrowBooleanArray *object) @@ -376,7 +407,7 @@ garrow_boolean_array_get_value(GArrowBooleanArray *array, G_DEFINE_TYPE(GArrowInt8Array, \ garrow_int8_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_int8_array_init(GArrowInt8Array *object) @@ -406,7 +437,7 @@ garrow_int8_array_get_value(GArrowInt8Array *array, G_DEFINE_TYPE(GArrowUInt8Array, \ garrow_uint8_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_uint8_array_init(GArrowUInt8Array *object) @@ -436,7 +467,7 @@ garrow_uint8_array_get_value(GArrowUInt8Array *array, G_DEFINE_TYPE(GArrowInt16Array, \ garrow_int16_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_int16_array_init(GArrowInt16Array *object) @@ -466,7 +497,7 @@ garrow_int16_array_get_value(GArrowInt16Array *array, G_DEFINE_TYPE(GArrowUInt16Array, \ garrow_uint16_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_uint16_array_init(GArrowUInt16Array *object) @@ -496,7 +527,7 @@ garrow_uint16_array_get_value(GArrowUInt16Array *array, G_DEFINE_TYPE(GArrowInt32Array, \ garrow_int32_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_int32_array_init(GArrowInt32Array *object) @@ -526,7 +557,7 @@ garrow_int32_array_get_value(GArrowInt32Array *array, G_DEFINE_TYPE(GArrowUInt32Array, \ garrow_uint32_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_uint32_array_init(GArrowUInt32Array *object) @@ -556,7 +587,7 @@ garrow_uint32_array_get_value(GArrowUInt32Array *array, G_DEFINE_TYPE(GArrowInt64Array, \ garrow_int64_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_int64_array_init(GArrowInt64Array *object) @@ -586,7 +617,7 @@ garrow_int64_array_get_value(GArrowInt64Array *array, G_DEFINE_TYPE(GArrowUInt64Array, \ garrow_uint64_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_uint64_array_init(GArrowUInt64Array *object) @@ -615,7 +646,7 @@ garrow_uint64_array_get_value(GArrowUInt64Array *array, G_DEFINE_TYPE(GArrowFloatArray, \ garrow_float_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_float_array_init(GArrowFloatArray *object) @@ -645,7 +676,7 @@ garrow_float_array_get_value(GArrowFloatArray *array, G_DEFINE_TYPE(GArrowDoubleArray, \ garrow_double_array, \ - GARROW_TYPE_ARRAY) + GARROW_TYPE_PRIMITIVE_ARRAY) static void garrow_double_array_init(GArrowDoubleArray *object) diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index f8c6734a88308..9bb502e4044a9 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -115,6 +115,51 @@ GType garrow_null_array_get_type(void) G_GNUC_CONST; GArrowNullArray *garrow_null_array_new(gint64 length); +#define GARROW_TYPE_PRIMITIVE_ARRAY \ + (garrow_primitive_array_get_type()) +#define GARROW_PRIMITIVE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_PRIMITIVE_ARRAY, \ + GArrowPrimitiveArray)) +#define GARROW_PRIMITIVE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_PRIMITIVE_ARRAY, \ + GArrowPrimitiveArrayClass)) +#define GARROW_IS_PRIMITIVE_ARRAY(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_PRIMITIVE_ARRAY)) +#define GARROW_IS_PRIMITIVE_ARRAY_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_PRIMITIVE_ARRAY)) +#define GARROW_PRIMITIVE_ARRAY_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_PRIMITIVE_ARRAY, \ + GArrowPrimitiveArrayClass)) + +typedef struct _GArrowPrimitiveArray GArrowPrimitiveArray; +typedef struct _GArrowPrimitiveArrayClass GArrowPrimitiveArrayClass; + +/** + * GArrowPrimitiveArray: + * + * It wraps `arrow::PrimitiveArray`. + */ +struct _GArrowPrimitiveArray +{ + /*< private >*/ + GArrowArray parent_instance; +}; + +struct _GArrowPrimitiveArrayClass +{ + GArrowArrayClass parent_class; +}; + +GType garrow_primitive_array_get_type(void) G_GNUC_CONST; + +GArrowBuffer *garrow_primitive_array_get_buffer(GArrowPrimitiveArray *array); + + #define GARROW_TYPE_BOOLEAN_ARRAY \ (garrow_boolean_array_get_type()) #define GARROW_BOOLEAN_ARRAY(obj) \ @@ -147,12 +192,12 @@ typedef struct _GArrowBooleanArrayClass GArrowBooleanArrayClass; struct _GArrowBooleanArray { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowBooleanArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_boolean_array_get_type (void) G_GNUC_CONST; @@ -192,12 +237,12 @@ typedef struct _GArrowInt8ArrayClass GArrowInt8ArrayClass; struct _GArrowInt8Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowInt8ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_int8_array_get_type(void) G_GNUC_CONST; @@ -238,12 +283,12 @@ typedef struct _GArrowUInt8ArrayClass GArrowUInt8ArrayClass; struct _GArrowUInt8Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowUInt8ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_uint8_array_get_type(void) G_GNUC_CONST; @@ -284,12 +329,12 @@ typedef struct _GArrowInt16ArrayClass GArrowInt16ArrayClass; struct _GArrowInt16Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowInt16ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_int16_array_get_type(void) G_GNUC_CONST; @@ -330,12 +375,12 @@ typedef struct _GArrowUInt16ArrayClass GArrowUInt16ArrayClass; struct _GArrowUInt16Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowUInt16ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_uint16_array_get_type(void) G_GNUC_CONST; @@ -376,12 +421,12 @@ typedef struct _GArrowInt32ArrayClass GArrowInt32ArrayClass; struct _GArrowInt32Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowInt32ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_int32_array_get_type(void) G_GNUC_CONST; @@ -422,12 +467,12 @@ typedef struct _GArrowUInt32ArrayClass GArrowUInt32ArrayClass; struct _GArrowUInt32Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowUInt32ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_uint32_array_get_type(void) G_GNUC_CONST; @@ -468,12 +513,12 @@ typedef struct _GArrowInt64ArrayClass GArrowInt64ArrayClass; struct _GArrowInt64Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowInt64ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_int64_array_get_type(void) G_GNUC_CONST; @@ -514,12 +559,12 @@ typedef struct _GArrowUInt64ArrayClass GArrowUInt64ArrayClass; struct _GArrowUInt64Array { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowUInt64ArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_uint64_array_get_type(void) G_GNUC_CONST; @@ -560,12 +605,12 @@ typedef struct _GArrowFloatArrayClass GArrowFloatArrayClass; struct _GArrowFloatArray { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowFloatArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_float_array_get_type(void) G_GNUC_CONST; @@ -606,12 +651,12 @@ typedef struct _GArrowDoubleArrayClass GArrowDoubleArrayClass; struct _GArrowDoubleArray { /*< private >*/ - GArrowArray parent_instance; + GArrowPrimitiveArray parent_instance; }; struct _GArrowDoubleArrayClass { - GArrowArrayClass parent_class; + GArrowPrimitiveArrayClass parent_class; }; GType garrow_double_array_get_type(void) G_GNUC_CONST; diff --git a/c_glib/test/test-boolean-array.rb b/c_glib/test/test-boolean-array.rb index 9cc3c94d554bf..15df1ed95b274 100644 --- a/c_glib/test/test-boolean-array.rb +++ b/c_glib/test/test-boolean-array.rb @@ -16,6 +16,15 @@ # under the License. class TestBooleanArray < Test::Unit::TestCase + def test_buffer + builder = Arrow::BooleanArrayBuilder.new + builder.append(true) + builder.append(false) + builder.append(true) + array = builder.finish + assert_equal([0b101].pack("C*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::BooleanArrayBuilder.new builder.append(true) diff --git a/c_glib/test/test-double-array.rb b/c_glib/test/test-double-array.rb index f9c000d23f173..c644ac6cc0c07 100644 --- a/c_glib/test/test-double-array.rb +++ b/c_glib/test/test-double-array.rb @@ -16,6 +16,15 @@ # under the License. class TestDoubleArray < Test::Unit::TestCase + def test_buffer + builder = Arrow::DoubleArrayBuilder.new + builder.append(-1.1) + builder.append(2.2) + builder.append(-4.4) + array = builder.finish + assert_equal([-1.1, 2.2, -4.4].pack("d*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::DoubleArrayBuilder.new builder.append(1.5) diff --git a/c_glib/test/test-float-array.rb b/c_glib/test/test-float-array.rb index 020c705aad241..84876f9754da7 100644 --- a/c_glib/test/test-float-array.rb +++ b/c_glib/test/test-float-array.rb @@ -16,6 +16,15 @@ # under the License. class TestFloatArray < Test::Unit::TestCase + def test_buffer + builder = Arrow::FloatArrayBuilder.new + builder.append(-1.1) + builder.append(2.2) + builder.append(-4.4) + array = builder.finish + assert_equal([-1.1, 2.2, -4.4].pack("f*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::FloatArrayBuilder.new builder.append(1.5) diff --git a/c_glib/test/test-int16-array.rb b/c_glib/test/test-int16-array.rb index 2aa5b0c054563..4b30ddd99ff9b 100644 --- a/c_glib/test/test-int16-array.rb +++ b/c_glib/test/test-int16-array.rb @@ -16,6 +16,15 @@ # under the License. class TestInt16Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::Int16ArrayBuilder.new + builder.append(-1) + builder.append(2) + builder.append(-4) + array = builder.finish + assert_equal([-1, 2, -4].pack("s*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::Int16ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int32-array.rb b/c_glib/test/test-int32-array.rb index 9dd6b3afc8676..90cf0224c1c30 100644 --- a/c_glib/test/test-int32-array.rb +++ b/c_glib/test/test-int32-array.rb @@ -16,6 +16,15 @@ # under the License. class TestInt32Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::Int32ArrayBuilder.new + builder.append(-1) + builder.append(2) + builder.append(-4) + array = builder.finish + assert_equal([-1, 2, -4].pack("l*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::Int32ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int64-array.rb b/c_glib/test/test-int64-array.rb index 612a8b4f69276..d3022017bb0ee 100644 --- a/c_glib/test/test-int64-array.rb +++ b/c_glib/test/test-int64-array.rb @@ -16,6 +16,15 @@ # under the License. class TestInt64Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::Int64ArrayBuilder.new + builder.append(-1) + builder.append(2) + builder.append(-4) + array = builder.finish + assert_equal([-1, 2, -4].pack("q*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::Int64ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int8-array.rb b/c_glib/test/test-int8-array.rb index ab009964ab16f..9f28fa7fcd3a3 100644 --- a/c_glib/test/test-int8-array.rb +++ b/c_glib/test/test-int8-array.rb @@ -16,6 +16,15 @@ # under the License. class TestInt8Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::Int8ArrayBuilder.new + builder.append(-1) + builder.append(2) + builder.append(-4) + array = builder.finish + assert_equal([-1, 2, -4].pack("c*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::Int8ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-uint16-array.rb b/c_glib/test/test-uint16-array.rb index ad85f09326bd3..82e898e733625 100644 --- a/c_glib/test/test-uint16-array.rb +++ b/c_glib/test/test-uint16-array.rb @@ -16,6 +16,15 @@ # under the License. class TestUInt16Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::UInt16ArrayBuilder.new + builder.append(1) + builder.append(2) + builder.append(4) + array = builder.finish + assert_equal([1, 2, 4].pack("S*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::UInt16ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint32-array.rb b/c_glib/test/test-uint32-array.rb index 59e19f3ed796f..c8be06fead5b9 100644 --- a/c_glib/test/test-uint32-array.rb +++ b/c_glib/test/test-uint32-array.rb @@ -16,6 +16,15 @@ # under the License. class TestUInt32Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::UInt32ArrayBuilder.new + builder.append(1) + builder.append(2) + builder.append(4) + array = builder.finish + assert_equal([1, 2, 4].pack("L*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::UInt32ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint64-array.rb b/c_glib/test/test-uint64-array.rb index e0195c1d49817..03082f33014ce 100644 --- a/c_glib/test/test-uint64-array.rb +++ b/c_glib/test/test-uint64-array.rb @@ -16,6 +16,15 @@ # under the License. class TestUInt64Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::UInt64ArrayBuilder.new + builder.append(1) + builder.append(2) + builder.append(4) + array = builder.finish + assert_equal([1, 2, 4].pack("Q*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::UInt64ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint8-array.rb b/c_glib/test/test-uint8-array.rb index 02f3470774c10..d7464e336da79 100644 --- a/c_glib/test/test-uint8-array.rb +++ b/c_glib/test/test-uint8-array.rb @@ -16,6 +16,15 @@ # under the License. class TestUInt8Array < Test::Unit::TestCase + def test_buffer + builder = Arrow::UInt8ArrayBuilder.new + builder.append(1) + builder.append(2) + builder.append(4) + array = builder.finish + assert_equal([1, 2, 4].pack("C*"), array.buffer.data.to_s) + end + def test_value builder = Arrow::UInt8ArrayBuilder.new builder.append(1) From de54eff19af024c1ca0e82f4b45c6021443a635b Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Mon, 24 Apr 2017 08:30:08 -0400 Subject: [PATCH 0558/1644] ARROW-659: [C++] Add multithreaded memcpy implementation parallelize memcopy operations for large objects with a multi-threaded implementation. Author: Philipp Moritz Closes #580 from atumanov/parallel-memcpy and squashes the following commits: 6ea9873 [Philipp Moritz] fix windows build (?) 66dfa74 [Philipp Moritz] linting 9dd6f3f [Philipp Moritz] cleanup e81bad9 [Philipp Moritz] add license header 0beb870 [Philipp Moritz] add pthread library 1d73612 [Philipp Moritz] add test of parallel memcopy 1a27431 [Philipp Moritz] restructure code 70d767c [Philipp Moritz] add benchmarks b320b47 [Philipp Moritz] make memcopy generic f99606a [Philipp Moritz] add parallel memcpy, contributed by Alexey Tumanov --- cpp/CMakeLists.txt | 3 +- cpp/src/arrow/io/CMakeLists.txt | 2 + cpp/src/arrow/io/io-memory-benchmark.cc | 75 +++++++++++++++++++++++++ cpp/src/arrow/io/io-memory-test.cc | 22 ++++++++ cpp/src/arrow/io/memory.cc | 31 +++++++++- cpp/src/arrow/io/memory.h | 8 +++ cpp/src/arrow/ipc/CMakeLists.txt | 1 + cpp/src/arrow/util/memory.h | 69 +++++++++++++++++++++++ 8 files changed, 207 insertions(+), 4 deletions(-) create mode 100644 cpp/src/arrow/io/io-memory-benchmark.cc create mode 100644 cpp/src/arrow/util/memory.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 978f70a361756..2d8c00fd80803 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -902,7 +902,8 @@ set(ARROW_STATIC_PRIVATE_LINK_LIBS if (NOT MSVC) set(ARROW_LINK_LIBS ${ARROW_LINK_LIBS} - ${CMAKE_DL_LIBS}) + ${CMAKE_DL_LIBS} + pthread) endif() if(RAPIDJSON_VENDORED) diff --git a/cpp/src/arrow/io/CMakeLists.txt b/cpp/src/arrow/io/CMakeLists.txt index c0199d7ef2599..cd489746b48ea 100644 --- a/cpp/src/arrow/io/CMakeLists.txt +++ b/cpp/src/arrow/io/CMakeLists.txt @@ -22,6 +22,8 @@ ADD_ARROW_TEST(io-file-test) ADD_ARROW_TEST(io-hdfs-test) ADD_ARROW_TEST(io-memory-test) +ADD_ARROW_BENCHMARK(io-memory-benchmark) + # Headers: top level install(FILES file.h diff --git a/cpp/src/arrow/io/io-memory-benchmark.cc b/cpp/src/arrow/io/io-memory-benchmark.cc new file mode 100644 index 0000000000000..59b511a6cf8fe --- /dev/null +++ b/cpp/src/arrow/io/io-memory-benchmark.cc @@ -0,0 +1,75 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "arrow/api.h" +#include "arrow/io/memory.h" +#include "arrow/test-util.h" + +#include "benchmark/benchmark.h" + +#include + +namespace arrow { + +static void BM_SerialMemcopy(benchmark::State& state) { // NOLINT non-const reference + constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB + + auto buffer1 = std::make_shared(default_memory_pool()); + buffer1->Resize(kTotalSize); + + auto buffer2 = std::make_shared(default_memory_pool()); + buffer2->Resize(kTotalSize); + test::random_bytes(kTotalSize, 0, buffer2->mutable_data()); + + while (state.KeepRunning()) { + io::FixedSizeBufferWriter writer(buffer1); + writer.Write(buffer2->data(), buffer2->size()); + } + state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize); +} + +static void BM_ParallelMemcopy(benchmark::State& state) { // NOLINT non-const reference + constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB + + auto buffer1 = std::make_shared(default_memory_pool()); + buffer1->Resize(kTotalSize); + + auto buffer2 = std::make_shared(default_memory_pool()); + buffer2->Resize(kTotalSize); + test::random_bytes(kTotalSize, 0, buffer2->mutable_data()); + + while (state.KeepRunning()) { + io::FixedSizeBufferWriter writer(buffer1); + writer.set_memcopy_threads(4); + writer.Write(buffer2->data(), buffer2->size()); + } + state.SetBytesProcessed(int64_t(state.iterations()) * kTotalSize); +} + +BENCHMARK(BM_SerialMemcopy) + ->RangeMultiplier(4) + ->Range(1, 1 << 13) + ->MinTime(1.0) + ->UseRealTime(); + +BENCHMARK(BM_ParallelMemcopy) + ->RangeMultiplier(4) + ->Range(1, 1 << 13) + ->MinTime(1.0) + ->UseRealTime(); + +} // namespace arrow diff --git a/cpp/src/arrow/io/io-memory-test.cc b/cpp/src/arrow/io/io-memory-test.cc index 4704fe8f4d391..33249cb27f200 100644 --- a/cpp/src/arrow/io/io-memory-test.cc +++ b/cpp/src/arrow/io/io-memory-test.cc @@ -17,6 +17,7 @@ #include #include +#include #include #include #include @@ -114,5 +115,26 @@ TEST(TestBufferReader, RetainParentReference) { ASSERT_EQ(0, std::memcmp(slice2->data(), data.c_str() + 4, 6)); } +TEST(TestMemcopy, ParallelMemcopy) { + for (int i = 0; i < 5; ++i) { + // randomize size so the memcopy alignment is tested + int64_t total_size = 3 * 1024 * 1024 + std::rand() % 100; + + auto buffer1 = std::make_shared(default_memory_pool()); + buffer1->Resize(total_size); + + auto buffer2 = std::make_shared(default_memory_pool()); + buffer2->Resize(total_size); + test::random_bytes(total_size, 0, buffer2->mutable_data()); + + io::FixedSizeBufferWriter writer(buffer1); + writer.set_memcopy_threads(4); + writer.set_memcopy_threshold(1024 * 1024); + writer.Write(buffer2->data(), buffer2->size()); + + ASSERT_EQ(0, memcmp(buffer1->data(), buffer2->data(), buffer1->size())); + } +} + } // namespace io } // namespace arrow diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 2e701e1104d1c..95c6206f0fab0 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -29,6 +29,7 @@ #include "arrow/io/interfaces.h" #include "arrow/status.h" #include "arrow/util/logging.h" +#include "arrow/util/memory.h" namespace arrow { namespace io { @@ -80,7 +81,7 @@ Status BufferOutputStream::Tell(int64_t* position) { Status BufferOutputStream::Write(const uint8_t* data, int64_t nbytes) { DCHECK(buffer_); RETURN_NOT_OK(Reserve(nbytes)); - std::memcpy(mutable_data_ + position_, data, nbytes); + memcpy(mutable_data_ + position_, data, nbytes); position_ += nbytes; return Status::OK(); } @@ -101,8 +102,15 @@ Status BufferOutputStream::Reserve(int64_t nbytes) { // ---------------------------------------------------------------------- // In-memory buffer writer +static constexpr int kMemcopyDefaultNumThreads = 1; +static constexpr int64_t kMemcopyDefaultBlocksize = 64; +static constexpr int64_t kMemcopyDefaultThreshold = 1024 * 1024; + /// Input buffer must be mutable, will abort if not -FixedSizeBufferWriter::FixedSizeBufferWriter(const std::shared_ptr& buffer) { +FixedSizeBufferWriter::FixedSizeBufferWriter(const std::shared_ptr& buffer) + : memcopy_num_threads_(kMemcopyDefaultNumThreads), + memcopy_blocksize_(kMemcopyDefaultBlocksize), + memcopy_threshold_(kMemcopyDefaultThreshold) { buffer_ = buffer; DCHECK(buffer->is_mutable()) << "Must pass mutable buffer"; mutable_data_ = buffer->mutable_data(); @@ -131,7 +139,12 @@ Status FixedSizeBufferWriter::Tell(int64_t* position) { } Status FixedSizeBufferWriter::Write(const uint8_t* data, int64_t nbytes) { - std::memcpy(mutable_data_ + position_, data, nbytes); + if (nbytes > memcopy_threshold_ && memcopy_num_threads_ > 1) { + parallel_memcopy(mutable_data_ + position_, data, nbytes, + memcopy_blocksize_, memcopy_num_threads_); + } else { + memcpy(mutable_data_ + position_, data, nbytes); + } position_ += nbytes; return Status::OK(); } @@ -143,6 +156,18 @@ Status FixedSizeBufferWriter::WriteAt( return Write(data, nbytes); } +void FixedSizeBufferWriter::set_memcopy_threads(int num_threads) { + memcopy_num_threads_ = num_threads; +} + +void FixedSizeBufferWriter::set_memcopy_blocksize(int64_t blocksize) { + memcopy_blocksize_ = blocksize; +} + +void FixedSizeBufferWriter::set_memcopy_threshold(int64_t threshold) { + memcopy_threshold_ = threshold; +} + // ---------------------------------------------------------------------- // In-memory buffer reader diff --git a/cpp/src/arrow/io/memory.h b/cpp/src/arrow/io/memory.h index fbb186b728022..f1b59905d8a3a 100644 --- a/cpp/src/arrow/io/memory.h +++ b/cpp/src/arrow/io/memory.h @@ -81,12 +81,20 @@ class ARROW_EXPORT FixedSizeBufferWriter : public WriteableFile { Status Write(const uint8_t* data, int64_t nbytes) override; Status WriteAt(int64_t position, const uint8_t* data, int64_t nbytes) override; + void set_memcopy_threads(int num_threads); + void set_memcopy_blocksize(int64_t blocksize); + void set_memcopy_threshold(int64_t threshold); + private: std::mutex lock_; std::shared_ptr buffer_; uint8_t* mutable_data_; int64_t size_; int64_t position_; + + int memcopy_num_threads_; + int64_t memcopy_blocksize_; + int64_t memcopy_threshold_; }; class ARROW_EXPORT BufferReader : public RandomAccessFile { diff --git a/cpp/src/arrow/ipc/CMakeLists.txt b/cpp/src/arrow/ipc/CMakeLists.txt index fc1d53e18a3dc..41ab5d7a1f39a 100644 --- a/cpp/src/arrow/ipc/CMakeLists.txt +++ b/cpp/src/arrow/ipc/CMakeLists.txt @@ -95,6 +95,7 @@ if(MSVC) else() set(UTIL_LINK_LIBS arrow_static + pthread ${BOOST_FILESYSTEM_LIBRARY} ${BOOST_SYSTEM_LIBRARY} dl) diff --git a/cpp/src/arrow/util/memory.h b/cpp/src/arrow/util/memory.h new file mode 100644 index 0000000000000..7feeb291ef4a0 --- /dev/null +++ b/cpp/src/arrow/util/memory.h @@ -0,0 +1,69 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_MEMORY_H +#define ARROW_UTIL_MEMORY_H + +#include +#include + +namespace arrow { + +uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) { + uintptr_t value = reinterpret_cast(address); + return reinterpret_cast(value & bits); +} + +// A helper function for doing memcpy with multiple threads. This is required +// to saturate the memory bandwidth of modern cpus. +void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, + uintptr_t block_size, int num_threads) { + std::vector threadpool(num_threads); + uint8_t* left = pointer_logical_and(src + block_size - 1, ~(block_size - 1)); + uint8_t* right = pointer_logical_and(src + nbytes, ~(block_size - 1)); + int64_t num_blocks = (right - left) / block_size; + + // Update right address + right = right - (num_blocks % num_threads) * block_size; + + // Now we divide these blocks between available threads. The remainder is + // handled on the main thread. + int64_t chunk_size = (right - left) / num_threads; + int64_t prefix = left - src; + int64_t suffix = src + nbytes - right; + // Now the data layout is | prefix | k * num_threads * block_size | suffix |. + // We have chunk_size = k * block_size, therefore the data layout is + // | prefix | num_threads * chunk_size | suffix |. + // Each thread gets a "chunk" of k blocks. + + // Start all threads first and handle leftovers while threads run. + for (int i = 0; i < num_threads; i++) { + threadpool[i] = std::thread(memcpy, dst + prefix + i * chunk_size, + left + i * chunk_size, chunk_size); + } + + memcpy(dst, src, prefix); + memcpy(dst + prefix + num_threads * chunk_size, right, suffix); + + for (auto& t : threadpool) { + if (t.joinable()) { t.join(); } + } +} + +} // namespace arrow + +#endif // ARROW_UTIL_MEMORY_H From 76d56d3aa9607976b162f6d924a23c12c8800236 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 24 Apr 2017 15:57:31 -0400 Subject: [PATCH 0559/1644] ARROW-95: Add Jekyll-based website publishing toolchain, migrate existing arrow-site This also renders the format Markdown documents on the website. Used the Apache Calcite website for guidance about best practices with Jekyll. See rendered website at https://wesm.github.io/arrow-site-test/ Author: Wes McKinney Closes #589 from wesm/ARROW-95 and squashes the following commits: a6b65cb [Wes McKinney] Fix some incomplete instructions 2806d26 [Wes McKinney] Exclude flatbuffers from C++ API docs. Add C++ docs link to site 512ea71 [Wes McKinney] Migrate website to Jekyll with bootstrap-sass. Add navbar. Render specification Markdown documents with website. Instructions for publishing Java and Python docs --- cpp/apidoc/Doxyfile | 3 +- dev/release/run-rat.sh | 4 +- java/pom.xml | 25 ++++ site/.gitignore | 6 + site/Gemfile | 25 ++++ site/README.md | 85 +++++++++++++ site/_config.yml | 43 +++++++ site/_docs/.gitignore | 1 + site/_docs/ipc.md | 25 ++++ site/_docs/memory_layout.md | 25 ++++ site/_docs/metadata.md | 25 ++++ site/_includes/footer.html | 4 + site/_includes/header.html | 53 ++++++++ site/_includes/top.html | 20 +++ site/_layouts/default.html | 12 ++ site/_layouts/docs.html | 14 +++ site/_sass/_font-awesome.scss | 25 ++++ site/css/main.scss | 10 ++ site/img/asf_logo.svg | 210 +++++++++++++++++++++++++++++++ site/img/copy.png | Bin 0 -> 23204 bytes site/img/copy2.png | Bin 0 -> 37973 bytes site/img/shared.png | Bin 0 -> 37973 bytes site/img/shared2.png | Bin 0 -> 23204 bytes site/img/simd.png | Bin 0 -> 101031 bytes site/index.html | 171 +++++++++++++++++++++++++ site/scripts/sync_format_docs.sh | 23 ++++ 26 files changed, 807 insertions(+), 2 deletions(-) create mode 100644 site/.gitignore create mode 100644 site/Gemfile create mode 100644 site/README.md create mode 100644 site/_config.yml create mode 100644 site/_docs/.gitignore create mode 100644 site/_docs/ipc.md create mode 100644 site/_docs/memory_layout.md create mode 100644 site/_docs/metadata.md create mode 100644 site/_includes/footer.html create mode 100644 site/_includes/header.html create mode 100644 site/_includes/top.html create mode 100644 site/_layouts/default.html create mode 100644 site/_layouts/docs.html create mode 100644 site/_sass/_font-awesome.scss create mode 100644 site/css/main.scss create mode 100644 site/img/asf_logo.svg create mode 100644 site/img/copy.png create mode 100644 site/img/copy2.png create mode 100644 site/img/shared.png create mode 100644 site/img/shared2.png create mode 100644 site/img/simd.png create mode 100644 site/index.html create mode 100755 site/scripts/sync_format_docs.sh diff --git a/cpp/apidoc/Doxyfile b/cpp/apidoc/Doxyfile index 51f5543b2de1b..3127662413328 100644 --- a/cpp/apidoc/Doxyfile +++ b/cpp/apidoc/Doxyfile @@ -891,7 +891,7 @@ RECURSIVE = YES # Note that relative paths are relative to the directory from which doxygen is # run. -EXCLUDE = +EXCLUDE = # The EXCLUDE_SYMLINKS tag can be used to select whether or not files or # directories that are symbolic links (a Unix file system feature) are excluded @@ -908,6 +908,7 @@ EXCLUDE_SYMLINKS = NO # exclude all test directories for example use the pattern */test/* EXCLUDE_PATTERNS = *-test.cc \ + *_generated.h \ *-benchmark.cc # The EXCLUDE_SYMBOLS tag can be used to specify one or more symbol names diff --git a/dev/release/run-rat.sh b/dev/release/run-rat.sh index a3c12a0ce8a92..9c34e073e628e 100755 --- a/dev/release/run-rat.sh +++ b/dev/release/run-rat.sh @@ -58,13 +58,15 @@ $RAT $1 \ -e "*.html" \ -e "*.css" \ -e "*.png" \ + -e "*.svg" \ -e "*.devhelp2" \ + -e "*.scss" \ > rat.txt cat rat.txt UNAPPROVED=`cat rat.txt | grep "Unknown Licenses" | head -n 1 | cut -d " " -f 1` if [ "0" -eq "${UNAPPROVED}" ]; then - echo "No unnaproved licenses" + echo "No unapproved licenses" else echo "${UNAPPROVED} unapproved licences. Check rat report: rat.txt" exit 1 diff --git a/java/pom.xml b/java/pom.xml index 5d07186e3e714..e586005e395c0 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -532,6 +532,31 @@ + + + + org.apache.maven.plugins + maven-javadoc-plugin + 2.9 + + + + javadoc + test-javadoc + + + + aggregate + false + + aggregate + + + + + + + format memory diff --git a/site/.gitignore b/site/.gitignore new file mode 100644 index 0000000000000..46bc466d3028e --- /dev/null +++ b/site/.gitignore @@ -0,0 +1,6 @@ +_site +.sass-cache +.jekyll-metadata +Gemfile.lock +asf-site +build/ diff --git a/site/Gemfile b/site/Gemfile new file mode 100644 index 0000000000000..98decaf35dbe6 --- /dev/null +++ b/site/Gemfile @@ -0,0 +1,25 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to you under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +source "https://rubygems.org" +ruby RUBY_VERSION +gem "jekyll", "3.4.3" +gem 'jekyll-bootstrap-sass' +gem 'github-pages' +group :jekyll_plugins do + gem "jekyll-feed", "~> 0.6" +end +gem 'tzinfo-data', platforms: [:mingw, :mswin, :x64_mingw, :jruby] diff --git a/site/README.md b/site/README.md new file mode 100644 index 0000000000000..3f8da2252f965 --- /dev/null +++ b/site/README.md @@ -0,0 +1,85 @@ + + +## Apache Arrow Website + +### Development instructions + +If you are planning to publish the website, you must first clone the arrow-site +git repository: + +```shell +git clone --branch=asf-site https://git-wip-us.apache.org/repos/asf/arrow-site.git asf-site +``` + +Now, with Ruby >= 2.1 installed, run: + +```shell +gem install jekyll bundler +bundle install + +# This imports the format Markdown documents so they will be rendered +scripts/sync_format_docs.sh + +bundle exec jekyll serve +``` + +### Publishing + +After following the above instructions the base `site/` directory, run: + +```shell +bundle exec jekyll build +rsync -r build/ asf-site/ +cd asf-site +git status +``` + +Now `git add` any new files, then commit everything, and push: + +``` +git push +``` + +### Updating Code Documentation + +#### Java + +``` +cd ../java +mvn install +mvn site +rsync -r target/site/apidocs/ ../site/asf-site/docs/java/ +``` + +#### C++ + +``` +cd ../cpp/apidoc +doxygen Doxyfile +rsync -r html/ ../../site/asf-site/docs/cpp +``` + +#### Python + +First, build PyArrow with all optional extensions (Apache Parquet, jemalloc). + +``` +cd ../python +python setup.py build_ext --inplace --with-parquet --with-jemalloc +python setup.py build_sphinx -s doc/source +rsync -r doc/_build/html/ ../site/asf-site/docs/python/ +``` + +Then add/commit/push from the site/asf-site git checkout. \ No newline at end of file diff --git a/site/_config.yml b/site/_config.yml new file mode 100644 index 0000000000000..922af4a08059c --- /dev/null +++ b/site/_config.yml @@ -0,0 +1,43 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to you under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +markdown: kramdown +repository: https://github.com/apache/arrow +destination: build + +exclude: + - Gemfile + - Gemfile.lock + - _docs/format/* + - asf-site + - scripts + - README.md + +collections: + docs: + output: true + +sass: + style: compressed + +# The base path where the website is deployed +baseurl: + +gems: + - jekyll-feed + - jekyll-bootstrap-sass + +bootstrap: + assets: true diff --git a/site/_docs/.gitignore b/site/_docs/.gitignore new file mode 100644 index 0000000000000..1e942fc5eadfb --- /dev/null +++ b/site/_docs/.gitignore @@ -0,0 +1 @@ +format/ \ No newline at end of file diff --git a/site/_docs/ipc.md b/site/_docs/ipc.md new file mode 100644 index 0000000000000..bc22dc3bfa7b7 --- /dev/null +++ b/site/_docs/ipc.md @@ -0,0 +1,25 @@ +--- +layout: docs +title: Arrow Messaging and IPC +permalink: /docs/ipc.html +--- + + +{% include_relative format/IPC.md %} \ No newline at end of file diff --git a/site/_docs/memory_layout.md b/site/_docs/memory_layout.md new file mode 100644 index 0000000000000..74cd7ed7f7de0 --- /dev/null +++ b/site/_docs/memory_layout.md @@ -0,0 +1,25 @@ +--- +layout: docs +title: Physical Memory Layout +permalink: /docs/memory_layout.html +--- + + +{% include_relative format/Layout.md %} \ No newline at end of file diff --git a/site/_docs/metadata.md b/site/_docs/metadata.md new file mode 100644 index 0000000000000..382ab0eaaf61b --- /dev/null +++ b/site/_docs/metadata.md @@ -0,0 +1,25 @@ +--- +layout: docs +title: Arrow Metadata +permalink: /docs/metadata.html +--- + + +{% include_relative format/Metadata.md %} \ No newline at end of file diff --git a/site/_includes/footer.html b/site/_includes/footer.html new file mode 100644 index 0000000000000..c2a7d5e92bb20 --- /dev/null +++ b/site/_includes/footer.html @@ -0,0 +1,4 @@ +
+

Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

+

© 2017 Apache Software Foundation

+
diff --git a/site/_includes/header.html b/site/_includes/header.html new file mode 100644 index 0000000000000..5963c22abea0d --- /dev/null +++ b/site/_includes/header.html @@ -0,0 +1,53 @@ + diff --git a/site/_includes/top.html b/site/_includes/top.html new file mode 100644 index 0000000000000..cc537bac07ba3 --- /dev/null +++ b/site/_includes/top.html @@ -0,0 +1,20 @@ + + + + + {{ page.title }} + + + + + + + Apache Arrow Homepage + + + + + + diff --git a/site/_layouts/default.html b/site/_layouts/default.html new file mode 100644 index 0000000000000..d0ff799b97ab3 --- /dev/null +++ b/site/_layouts/default.html @@ -0,0 +1,12 @@ +{% include top.html %} + + +
+ {% include header.html %} + + {{ content }} + + {% include footer.html %} +
+ + diff --git a/site/_layouts/docs.html b/site/_layouts/docs.html new file mode 100644 index 0000000000000..2ef9cf485e47c --- /dev/null +++ b/site/_layouts/docs.html @@ -0,0 +1,14 @@ +{% include top.html %} + + +
+ {% include header.html %} + + {{ content }} + +
+ + {% include footer.html %} +
+ + diff --git a/site/_sass/_font-awesome.scss b/site/_sass/_font-awesome.scss new file mode 100644 index 0000000000000..d90676c2b9e59 --- /dev/null +++ b/site/_sass/_font-awesome.scss @@ -0,0 +1,25 @@ +/*! + * Font Awesome 4.2.0 by @davegandy - http://fontawesome.io - @fontawesome + * License - http://fontawesome.io/license (Font: SIL OFL 1.1, CSS: MIT License) + */ +@font-face { + font-family: 'FontAwesome'; + src: url('../fonts/fontawesome-webfont.eot?v=4.2.0'); + src: url('../fonts/fontawesome-webfont.eot?#iefix&v=4.2.0') format('embedded-opentype'), url('../fonts/fontawesome-webfont.woff?v=4.2.0') format('woff'), url('../fonts/fontawesome-webfont.ttf?v=4.2.0') format('truetype'), url('../fonts/fontawesome-webfont.svg?v=4.2.0#fontawesomeregular') format('svg'); + font-weight: normal; + font-style: normal; +} +.fa { + display: inline-block; + font: normal normal normal 14px/1 FontAwesome; + font-size: inherit; + text-rendering: auto; + -webkit-font-smoothing: antialiased; + -moz-osx-font-smoothing: grayscale; +} +.fa-link:before { + content: "\f0c1"; +} +.fa-pencil:before { + content: "\f040"; +} diff --git a/site/css/main.scss b/site/css/main.scss new file mode 100644 index 0000000000000..24b46ae24ccf2 --- /dev/null +++ b/site/css/main.scss @@ -0,0 +1,10 @@ +--- +--- + +$container-desktop: 960px; +$container-large-desktop: $container-desktop; +$grid-gutter-width: 15px; + +@import "bootstrap-sprockets"; +@import "bootstrap"; +@import "font-awesome"; diff --git a/site/img/asf_logo.svg b/site/img/asf_logo.svg new file mode 100644 index 0000000000000..620694c52418a --- /dev/null +++ b/site/img/asf_logo.svg @@ -0,0 +1,210 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/site/img/copy.png b/site/img/copy.png new file mode 100644 index 0000000000000000000000000000000000000000..a1e04999eb3fd3bd7c350a0850659740d4a8cd03 GIT binary patch literal 23204 zcmeFZWmuG5*9HvSV1Nq93@9N8(gGq~LrB-qLxXf9-6f(j2&f214&5*`(jg%rAYIa( zGebAug>gUc`yR*l@B98e9)~b#&b`jH4OLN=xs6YbkA;PGTTWI|4GRk= z9t#V*_{KHh$<4vGr@$|47d06PtfF4ZwX0_@WOZDyun0)5{$gXrC*A`Z3|nhxyJ{;b z3Yt3Fvp+X;G%;uQw0{A##=;Wz6a+rno4Y=z@wB&da1rzrq5aiD5cqucn1hz)R})uT z5n62}6`E&`&gL{c?A+{6Xd(DCG&I7_W)^~KlG1;~fo~$TR<5ou1UWc7JUrMvxY!+? zEjc&^1OzyqJmq-$lnrRX=Hlhx`rMPv!G-R3Cx82qGkgy zTH31@{rl(lI9;tR{_`dWmp{`2CdhI1FC3igPdNVV8xR$~dMfzL(f)_6}M-Co7Y(bW;qg0r=$oP(>mGa&5x{AzTNr+=jX-&_3OzM#(5=D@6fNpt>@ z{-4kOv=`>Mn)v@v#BVMCdJ3o+f-lVR?~+0Aix-|wVPT13$w`W9cw%o&T~E{)9`C?d z64ZO(Vq23+f>c2&rvIp#B0se5v=?!fNGcVSjNbD>jrnlA#ltHhfXkNji+elP?p`2j zzdt3I^yo^8o5Aezmw9X_`fMNG5we8VT0Fuf62rp&?~niL;J^Fezh>}XSMdM4C%lm^ zrhJb;DvszILs0z#J(A9>-P^m>cSI|-a{L*0#&7gx)7v4!(ce_=;&zdJW4kmrl=_cv zO7X_g;LcxZul0lFa&;?@Q+@QJop#cqqTkQYzB>VH7CQu9y5M#xfH~)8b zLF+eF`U{dNP!j)ko+Ac`;fd9~``ZtViU>AK{IJV{!S8;rItSkX#IWAR692nX@)*UW z%Df%w2>!L66dXVdjD>mnpZD+M=2&Uhh!Gq00tKHwW_N`-Qydp|8ANl zaYA+L=6-0%!FC@p>pM^*=#cUF#O0vHU}rN+6_3|c4u=qQ_%Z5vL6d;c%7p*aXq6I3 znzkm{LX_j}7%K@auZ8Y=BmUE+}k!$OxcP1b2g?QJIohYSAz4RTKll3`f!@18n_!^p`q5;rK8c{4yR?U-9D}PF_M>>SFek#d+Id)2G6~q2I5oIQ^GTGF@GfC<9~M2)s{7Za`bbt z&({*Dl#QMk9TK5V7$UNGV;5%{T(g{KK5W?#x_Vu4Of+mc7RwlYh@L%@z?)xlF=ta>r^9ZNO6!8?8y6^4c zRo3f@^$n~nl=BVMGzjjSXXD;{f$1iwm8%W@;BFGQ9A3)tAPQ zJ`q%3q#|mp2<+Br%j_o?wWs#KDug|qyG$jym(LmR~XZxh|$HkXwmV^!RM>5dJvDLMW$OUjOpawi}SDXUfNNNep zzsFxsdngH2Xr6&4_->%8(LQq9S)>1PjYZF%fY9+i&Tb#+O>Ep!b-+cq8+!B#D&QTc zdbkCxV})r++s&W7of5A3k(D$2gx4!RhT68keLEjvvc1&rEpOX`l^(}cnfSG}&K!p6KE2#=4%!-F z?r?|w6y%?I9l>_1-8v?g1DUDS1mi@TmswaJ{R7ON=ejmGgIKiu@#S++RBfVs`=I^ z@u+)-u*5e|Dk*)0(z4jnzV#^N>vtlxN$M$e2pAV*LqO!L+-`TPM!Veew<}aRSH3lA zna&auSLrwgp@>`VJ~k`DVvCRSu@~UN&7+@CnyUV!v8* zF`9ZjmKnyF=@q%P{{+2{kAcPMoTa?;H|TvM1sv13bEW|FZ-}Mo1Y&N~t=_r6p*qNp z81Tnzu=riczcKg!vu=p;CJX70e@1!pJJ82!Y&#RS7cO=!$+d@F_KR0)Q6|3&c6#@P zKscY0k%c|Nr`Jr|8oNQAMJXChb*#*c_)OZfi3D)aI7B{%bSbnI8~+jlSd0phOLiwA;+2F(c=-#$8>>W zOIN|ZH?$tmT>}m^Q!OEQI$T+v5W7)}4)k>}{mZAj*tjh;@2Ou)s9eJ(`X(Rb;x;nF z`kJNPEK}?xv7FG2->^jU1Qhp?Hc;#yu2ui7ubL^hZkS~^p#--An77r@wM=yg>_*;0 zR(6%|w;ef|@8jHfr5b3o{k)Fl)6JkwVo&Yt^@Vfg?qqf4S}qX-dc#z-t<)9J$$KO= z>J|NE>^?Bn$3&|Bkm^N}wlH`czM3cQL@pDAx2&-R3BFqPoR7fv(4051#3hmqY>MpK zWpA^;`7PnGEyL{SqH^1-c=NBd#9}mhcsW)U8__Fu#5ZD_hFO#yLy&{A_-_J{E;UIl z$lJeqiUIUyQ-Wat{6`snWfyjl?>>m3$Iu8W;-*Ww5&SraUCoayQo%) zxqC9iG8_vB14@zNN{)Znm+wRcxZzH=#fUBE0?Yix2T|#n6~Gv>4t9<`TK<6tuqYkPJ0UJwOl!#x={uP`pF8@h}$BM`V57i&KL>__H9pd^m$iA=p8Is zxY{B0g^URx7E834`PVpv)?aPfaw!^``fif(tIKHTc|Mwxna8rGd(qEceenIcoI@tl zvlRf4*_|pmcX>wlJO`z02)~2TN0{&6FJ;y)O(=*;Z=M?w6gc=t+}`|XDzo>kqQnJ% z!Shk61dRhue+7>WQ%5YQpzuE5gfo{762_sv$Lhghcx6g?S5OjPd+o~x#K2{11y45v z-EEcm)MTdDWNOK^E=EUd86!C31FB+)_+IJ)9bTI*`o~96Q4^b=9oW{%!18>r&7$wM zYSS%KB4;Mg)A=UU$}79IKLG%%aVZqpnA7C)#GUnzOe!LK0LtXutKA2yQCqGdJ#3Rr zLnegJV&6_3x}_Fso)7PSvFH;PI-SBfwMFSu{CLd^=EOif-qveIDRZl@a*^PPC;RcToXC7W z37o~}5D0fG2i>>ZD_bKRItLk3)T1bCb-`_Y*)p4(9{8JaP9fr%&M&IcL^js@K5(5b zk29WRVydd}Hw`tP5dNg?xTEE{BRs*^dz0n8e^gEFu=mM!9|eq7Zo@gttLn3la=aZ^48~bz>vc^n$o#JRqHS9vRBf00IY47Smn_ux zM9Qif7%OcPY$GELF3*W0QSmt*=8Xwyr@NoYv30rC7m|n4d;3LCgfy{*caCv{F6(i+ z`>8)qhp4hHuOsy@eQ1~_q2#0E$_Dc#otNG}E0rR)GMDIWR+hBWC!_*$r4Ns_*^)Ce zw^L>omoSU-alDVErsjo-r1;9G9)U2^7ya3qjI)q zT;kO2le?iIZX9uNUN!F@()V`xpGU9wA6=>sj(!)0a*EzL{(y(0Nt)B?p`_H9kf*Ob zt-=OG&S4V;1fScFmde2)zXzv1KQoH_BP6em6Qt(vd^ghN#6@#ouCBdjbFt$qZ1Z|q zGaFk((L4;GuJ(D2qMB-L4|#t#iPrex@Ug~gi`N!}>lF~M#K5=@h8jAUeeIQ~%jEK^ zn{>KNQs+zJEv4=NN8}WS86KW(Dv7vc*+uU+Yv;wDtYl1R}00En&Su{lX$4lVs(R`-SHn0PTpXjho=jse$vl}FO{nu%u=VG%62nv#plyU<13Vxx5vC$x$%h> z9b7q;kFMZ}WLa|K)ij({-Z~L&%jjwh5h2d0{xzQCDUDq&x;Le`dDD+IwUq5RV8+Of zkpf%7#~xL$8deVl4;?nghDt)mpIu*`^!f^CiJ2a_H#u;>)=}N98mS^RI4Y^I!hJTT z>$^R8h_kM4wE(hYnpU0}l(p_89O*}@@2M3hs>*Xn?GrwDjrx@L3G!ivC-=f(UMTYn zio+JabZRz~PoecxL8uylWDSm+b}a+b-MqLTX=Yt?>uX0DOYAbWQ)x5}n?^r=G2oOG zr!U)a*t~BU8ja9p|4KR+&?HcD*|@U zm@jo@ah2`zj_$$~nMlfZJ>x8u3`E60uKtmk$iNn#nPN7`fbjC4F8;W{Q?_;{Fj@}g z*P}n2REMMaf2TZ*kdGhM93KbvDY=ewqg4(#MkUg80u4X1&w?2d@$Gj-z=M|twcpim6?(!B zu97SBi~v?Sd)x_Nh>xtZ|Aaxcd^0?CI2FVvv{Qn44GPAb7GlR<0=b$;T7jfJttTIa zu0A%6j_hr_7?FI`sA)2o@GYCq9x$?C1+c?=e!$E^%jmBFO+-_)prTP5q*Fj&mug8R zx-wz@bvX-9h!j@PyDN;!8T*k$oW>R}&DM6BF34LazZetBH*QN`n@e`2Eyb^yTH6@)RMPl!?}B<;=6z=fhCe(%x>FAho&Rv(>y<=GxV%%$e3{Cp z*Al-nRG(PTGWLG*C9;^j7F2?g!e^oZ;p5n14xb^eo<6z`$;iq(wpYXYd z`Y+Qw^-9DrIVSI40(l?+2ssCh8O;)Cj;wOLFmieSDH#Vs;P;6(0hoLr>RuZszM41w zkXp%cFuw?rDpMNJraaiHu&|$<+41I%&iBXH)mb!i_`LC4GOlEVp)07WxJQ|>fdTH< znYOjQYd&d}xp75|(2(Hy#uUxS_{`U!R~o`UJ;2e+|J=x0$0{ zuj);gfA~BsWJYUWLP`ej_ql-pw@!0VkI#}70z0D8DlNfDA5_j?H2Mj-!#6oVAvhOc zJ}H+yckGb&^imyDXc9l~EyeLiQ~}R7sdn!8>r&az|aiCZTPJ{0|sPO$% zH1-+feE=C=AxPqe_|EpOt%OR1B-xxY)$Ft+^{X>h`q_GFAC?CiN6wNcW@VKl>VN)_ zN_lLv{~Yez^Ci9;z-^p z!i40v3V^f-!BLYx^{l8Srf}Xj1d2sNZ!`;{a?YS7(Tm%4Gc*Ouu*!Cltrk2pEVze6 zKPQJV@xtE!K5NOs17QxnO^qJ5y)fS#emU`tD58&_&dA@XB_?Jp(CD31; z1c?##n}s$Rt@zw32OUh)Dvvd8X>R7DNzJc}1ITfvl2~Qjr9$4zLf`y{koiFEND@cQ z1h+`D63W1^@oDiSCc;}>J|dCo`o4SyjG{P4GA;MQ1gNFC3h9*>+R8Cq%( zs&YiT4&j|3)0|On9 zxX10D2V{AVrsI`>XGkQ{$Oq9fiaZx{N=wWAqkAkIP0*O)sZqv}_#6qjugTXKe6?|$ z$<<5&_=q{`Zahpd|B>PIbx{1>kTRNjjRFeLAZUi=BG(G}Xa59>CF1%@1VimU7okJ% zK;}}DUY3_MxlJs7C2UEaR>iI#kOX{qI`6r{+*f>mb^XHC)PE&$@O)neLtW7~38%?< z^)UhPc6_xDfylz+d=pd}^|Ry4bj>`#AEe8%+(kd9gY+Jlwh6Zxz_IcFY`hy8Voi(? z$~;f;5^YS~JW&Yr)3*a|ZL9_d>nS3Mqt7jI@FWTCN;9&~OD2AKOdf!UnA@nq8ciW$ zzm-Sx7^i`|*9eSa^$m?Z1=q6r;z(6yS-Jw=4ZVocjN88Ty!xYNhA@@Y`+3z682gOY znN+{ZznXtfa5UDl3z}i_Rm_HFjA?za&n;Bux3jf9fW`Sf#Mn`;7pM202EcXTeIQHb zf#(A;aJj#$Hh^0>AdQg4dXeP6cehrk-@?kqNW(-Ue5$-*~d3Vz{eX z2a9hf1GUHTTa%O)nB2e16@$QGRs8>~$9U)$6sfl`Qm8 zD1SL29={l-yhT(9$rPdS$|1g&+!K zEy5CDO%}|)?QVkEg_=jUphlJkb^V-gL(j1UdH!ycM+!}h&PT8AAfFuL5+YXTNFW28 zBKZje2{nc|XCUEfI6&t6So$8ZeGSqsk zCY;W%qN+fA;97&fYDz$&e)o9eer^|4Z<105VOFmnGeYV;4}QllM8^0Dow1;r_6OY z<&pFMw2{C^L9F1+OpA%tsapFHDHjo5`yVOd`QH2M6KwiAvLMt z?qZ1!{UYrw{H^)K#Ke#Zh!H9-zixT8-_rl|Rse!ol$D`wzUKK3e770DY76;7y^1P} zN#!b%1bqR>%69~GaX^OcC0<qyM=;-rg`Qv!tHIjZLdSyI&)^xS}V4 z&uBPHDWT;KmEdBf-FT^1zE}4)wq%6Jr&jiPev=mJho25x3NrG^T2mx_Scv* zRk_;5fDN^rCh%-$%fG$?wAiSI?Tpp!)R;EIJpKa>1p~5*^@^ds{K*R2sv4kUFQCW# zQ9%1=%1MGcla=;i&kFq?incA1L8GAwB0la!d9NluF{{%?F+vSfwF_6O%Pe~GvgIqQ zmX${ra5W`6zWwa5QdU#c@MLBFKY7WV)Ibk=$9(o#`{9W(D4X3Rn?a4MdIf)x8hxp8 zf_-J~WV%eG;8d+g6-OScGCLc_EfDF(y(4JS|MHPscLM>$#HL@z+n5k`=+KH5fvN_M<&NG zBfknQw$yz$HcO^a)%D&0qezI;H&#T*FHflgr0UB)zhnfI%!rK$EURi4d5|(vJf%_c zQxe<>_Sr0vImiCz+9IlMw^tnIkr*Mx%S?pkOt-<5tENkjza^&A@;$?J;JGY+?;Pc! z&WEogkQ1Clz?TXUijU=+_vhZogT*MB@C zv>!qmiB7st+AAqsWUR!guFDe~Y5(XGz5F@twUWextUs9OvL4J@${%)Ftn=D6RAE3e zS!fAzxl8`916eN4MQwUxyT*g;k$gHHImVT@bP+j?eUQ^dpQVA=(}(6&flx7WUc1lr z$IA)bhW^J(zURbG#ZASje|Y{}d9={Y@c}r}5CFMmi)N{15b@jeK#qs5@OgbU3XQ{~ zg-FL2L#i@M4kc}*PQti4=JA0RC6l$N2HR(E?ii$w^iulyLo_pY$&*%m;@pc834bHq z8yW^FNj2hcPSWONsQvcfjt$4s-LQ{DIW>W#-y(ko0GY5t_F@fF!R*y^&zCK+m9&3q z0K{T$=J=1JS=@Zs8y7z+Xypo(uk#R02Z`Osq4q_Ko4uiO*G_wQufbv^kT-zJ9@GAOJ2F~umY`ObZJ~4ByaC-p(9IS)lKSo&cUtQ8 zrR~^44x9T0Z(m!0zI{VUw?4%iN9(JhtvIt&8r34B~HM z*w>6LXQh@t5#@vi)^ww^Hh#tj6GFZ)jB6a%>pJ!ih1if2*wmb7Ou50lhk1OL&vh)5 zG}pZvf$?TZ^@NWfXYPO;_x80!Pgc_fz0b;cx3$l3pujrY1Jp#;R%ZI_jYraAY%W5! zO5Za$ULDT{6YBc=H=G~$1yrgl3$!JBuQTfv8;gfVQF2?UsNAn#N%r}U3Z5=CR~Que z671u8+UH*@$yRMv6Ht@%;w99ASDiJ^z@>+0q|6ekPCrpO=}?<;H{ZQtcHcBYpj{bq zyg2K%7Z7Xcox8s_*ipW0>KawQ)5Y5CH_BsM|0 z>km9r)*dX9PLcXWiZ&ch`&4ce)n2=EQG+ll$Xv-EF&$jTt^{0C0)x-^AXL;|&%Pph z3NXVGoj9Rfo{^&3?V7ujB>TZNz4o(BL2jQe-9rVh9sk~FfnspnA$tL(!!q{T04{7K z2n3SUcnHsVd4I7;UR5$0@vVn%Mt|Na@4ZwxjW3bMw3p>#>YE4nhSYox`ajs}yV+|c zHTrK=p1!ky=DmpCI0OuQTE)HZW1S9Gn+fP=8TNQT5 zauWuKdkDgFoXIvxP8+R8s~wu}eQH$CRwkx)yr%@ycaqAG3L_<+5@0O7yf}@$m%jE4 zuf~0g-0NW665wgWg^w4bn!Qdvm7&0H^iqjvyx2#60K7488h%=&>8P1`tmPK|Hrn{WxpQeSjZzGi!Nuvsq2TMlO8SaEEwGaKPi{&eo6AXJ#E#9`AD0 zsHy=;+QDpSWR3)RVl4}*+5e)^%3t(+PaRV!DcX<5LG~a%r3K(nlKx>v@6xBpkL~7V z6ezoz)AC-98U={tFtr!;t$`&gu%w6bwCnO=+_kQ2cXdk54RnOwx&QRh-JU+yi2aSxNa2g41^OKf zwFzrH5G})V(hT)OlLXyt&=4fZ8(wJTO|LTgB3+meh)!>7W`ob`)7FmL7?7XymdZS% zU@w5LU`MD;$!(UlHlfu}g!*QQu#u?c%!34f!1)`Wrbj|_>%7pj&BV9UnJxtSp#W+i zR#n))8}0E%(yzUf49L=5CG2(z1n%NkKTL_K9~>I4`&kQ*iq??BZ|BIX=;cmjxPcl- z680*c5vrBstULW#+oYgiw@y)wGpoL* zhbU^-vRWB#9>;0nRo=*p%}B`Wy~u#G^eN|Vwq^^>p?@yR`XEB?Rj))mOf^4^u40&} z7Q0_jy?qC8q1HIZazb~n1L)S3_{eX()ZA|v&LeL6uwIBib%WeHWD^*}opzFoIl=&@ z=Zjz{0+gXuad(pS^qK3srPt9FtMy~`CJS<=YQgwvGc5g#+g9!M)(@Scg2O2V23(O2 zgML4iHiWf2t&kddtP>F;2ig=oHX15&`*Q3Q4Af+6ovZ_f;^9jDJ8cg6-(5tFklK!( z4Hspx57$)<4I#-x$Y7YKU!65-y57EO%AFO(=H#Ws76;*VS$cP$t3fr)ty8?rNdV^x zEwbRsU*smj@4isPCAyIwh%_HlijG*bq1Wd$@A~vA8na!Cx}&=YU`31dYwX0+r3dX4 zj-N^2AxAJi`|IF3=!Q#-XJeJhNh=*}!5R0dIFS6j z!opARb7BR*;4Q%=f4&^(SjZ*dn>S@pP#_>#_&i0yOi<~Mw_x>bB zebz{qaaunk!uKvl&&uY&^~tvCR5Sg-eyF5P$-6|CUV*krBcH7b;{=u$#MFH{x^SxW z`=>Q5@x+?jCVbI=W8yfHe1OQwAe}P)nTzO%vrN9PN&wg+wZa-ns7Vw{5rvZc02?J; zeEoyXWs)$B@$eiRYQ-lzCD}I7w-~67{n|yMV;s>fHJ8@cyq{zZcI34y--1suKIrp% zmow6{r>CMk@+Ls#-X*pQz^(tPN0hA-6Jl%J{lcQp%1!4fwV9C=ufhMo|7gxwYyUH+ zY1j95)>QI0_b~EmZY=TZ7k3!;tvb_Fb}~X4n$4)(qGI{8`c@C{bvNJnAkD0W7p1|q zV|+21{qLq2PlVl{zIWV(9{>wS5gOBDIYI5U_<`7s`nn@?p@g=o)Zh%i>*|jb=fX&- zC+X9X`C4l<&`TrhW75VPM*nGlvh@6-8WwjK{p3@(J)Q0N)upP$y)0*=H}L6-)=P@z z8^7~suMAUiW6rC^3KLy;C+*7G@`?-g(9oC)I46RMs5a z?GvSbT()Qqm)U`Dl-Dw)@DW->-RYEQZE_Bu!+S|Sa5DqEgD6?Eat$u zkr5quxxJdr-nXFfK|tQ1Kp;ke3#=r+XK`mR0kIopn_i4&5YHFW*|x~S z7Kc(*Tlp1ssb)|p9Hxp~w@uu3jduOS5(AlbE;mVRVblj;Hdc@cz>u_ogvGIf9^sew zm(;aX7h32a4t}ZM(!kzIS;=<`YpbZ?@GvNJ(#iHY+-|OUPbK8et*H^EUAHOti9s%I z2G43BL+)brP{}BZv>RtCH79xGL_b%4NwSCZQgQT~9&-naze+(DJAE$@Z1d=VV~(Ir z-F0l-yANj-e!8l;9{zBauP<-y<;|OQ9l0yTw!dprC{V3iHnm&9tt_G`yDR9n4s!kK z1FL&;0F|sB5QG}Non~A|l5D?t{!6EAI*2V zHjk}HdfYjFyvoqDpD1VE1aL*s16yked8&#}%Qmm;qQQBs4z82ZRCL^(9g;Q}Z*@Mm zxYYYowxIl>LW2}He*5j8)yLjYk+_0SgE%6+BV(O-d9u7e3iCk1YudeDd2_LeF#>{gXxlZ`hpN6x7zAiSNdT4&x znuDW2P$u8CRJ>oOS|T!FdpSLSL;_S5MFm!AJ;*@gzn|GXxg7!voYpMRe8h5$l0cLX znHe0pa3~bI^(#hRs{btb+~A;xj%V?!@O4(_*1+Ct&Zy4AEm93}4}SoYc{VxP+Dhtu zLDp6cU*eqfL(t{-QY9wwfYX$*UWY1yARt8@g0bnasQ^$G9tx{!24Pu z9vE+rqZv>k=()mlo3K*<6KuTQ?el8*gPY%Kvvf{~UX_83SS(oTcszXmFhoFHyOAlz z7rpM>_NWcFEfZb@FJ?epqek-c2jK}zc|~fx@?wgDKB$qvM74;p11V*;zb~Z$C?3ZO zAq7la3uErjv@!7ycRE=$kdMBk7&f12IH|piBtdR^Xh~PG*3o-^n!hD=PcKdLS7GEY zk|Qn9gX>f>wh9axL8-RUghX~I#x52jrwgYZ-o)F^H9ae`N_u75wn22Me~7mz@fQdZDS>;T5L3gsIxezq!*uu}mP-ZU|$pGvQXoIEKtXo|33>CB2HIS8C%Q z@N@l%0K0+X-V1I#@=eT#((HsvLQ<^={-brKhd-1>kLTx&ShH9iH0-j!mAf`7uU3MR zKM{(2VF>Af4yF28KQf}Vem$=cLZ~n%`7b;07>pKLOiV#z;JaW#rUR8lrl*{@f5>^! zPmr-gHFtDQAh#aA14`t+wvq+RD9nX5pGuZIhcAFHbm;!7?KQ#B$nkN_hdyRfRznTS zteo_3iy8`yhr#ndfMh>2%U4;cEUne_){8hneay!e#`hxm{IYWh@xM!N_BIe^<*xoS z8y(Y42x4T3XDX#X<{uoDa4z*1foLXzKJGk9zo#^sgnHwhnEeGdhv)?KzXAHAJhQAO zUctf*kUH2Dbg5AeGt*noBRSY!k9a7ENoMIr!CiIrxD0(0SkiaF7jb)iSpm}LW%7R+ z4uBoOStMg>X8~BB{4w1RJ!=L@bx>D)y&c=9Ppdp z$!alo0`9>zAHGc)d{j)GOiBJr2e~rQ&St&d9l&7XN-Hgd_0NBcd2f^j#6Y!-pphY$v6=Xh$DRJB~P{bBFKXzt+D zM|UW}@54F3maG#jvgQPPg9_WjUp5UScIC5>h4(qEJWk%s6SO^U3jq6p{&1E=IcZ{+ zTLB+hM*a7ShFQUSOtN4C&(V!Td}`hI>}w#fYsO2bis)LCsfx(}P!w2{g!&%;u0U?;z$AY76`w3TDi#d$a}%Oww4K1eJjuJhRfG z|I)1^1Vgb+{rEinEvaANTsE;OqU+?!1BPpzDT$LIh~_^b_#%fM1uO>3BK$#HOjPlcH90#9ks|fEX!vqFS~QSnpQT zU^nL427|kRw*F31g7Jq6{Yh9$#|?HB;kQX=+}SCX19EmD^~&D;7daFOw2+$K~_9)D;i^ze*F-a8EXvJXq@5)=1i{@ewb?GE z;!TYBN3Vh?z?00fb3~r%ZC_|8rR1!bi!`?(!oi?B*wlPJS7jlDlV4Q&RsVRw6bWnu z){B=djJTd2aZn<3C@_>j(1Rd4jZY}>@f@>Si~PK)e?7gQaZ7|)`!F$@eea6>t26g| zH4l<<`CA@gf=#^|@V@2x1&3B$4*Djx8z=*P6DYml0*$*;ANjd90Hv}GeJ16S6?=$%J0e+3cNZV6TY^3fIlMyL(rm8G2i;v%*=K|wGYh}q<^_l*=Io(EW1~{;xx~;!D21T(MwW^_^e$G z2UNuOVBW=raY{I4dZ7MTqb~EP?By#l-5WogqcXaOwqBjm$_awr zm4q@Gsyxm$_ApeY4jR=Q`hZV;Ajaa1?mFZ&HhUM?r*bIz?#-&uQsjLX zEo*<;6XjHP&0d~8jN&8QQYB;K%A^;hiAWBfVI3z~(nMQxCA131f!v^rtYk^OnsZ)D$AeL&R_t#i zSJ>c+l@ zp$cxA92YYz>MF4SdIp}3R=!gBM+UJli-8;gBA{+fr^#{wlEN=jLeuMIspekh_LPiy z8}`KdVN}uMm6u}V0N*nehY>dcwK0dreGfO z_OX+EMDqRX=Jq|`zeFVB0jWKp1i$=?A_%y{@aTt()2FoSFg2UlCa4>sEg510e2b{-{fm3-=E~QHKv~NRm41C zg&`|)8#zj#?Mh{}MwjiRj}0`>bjJ{`4wJ1hNPW|cetah^tb6)b{}(_R89SkmDzniJ z-bL24leEoeQ30#(MADFjUUl=5A+UM58=qK;yGORR=3EZT8mD(+!jBv#q)*Rw(GBi9 z{*x10j3UOi1hk}6qBRTg6(J|}4c{B$xx2T$wz&R-WB?a%q`00T z_fBUnNE3AuEcSytAXAfLGlbiwB7&9WFI(Q$^Ljg@BmI@cQ{8N)#vTa!6rZovjBX0M z1>M@LkE=i3=Ww5UM)&g2dn3mq%++pq%s)}E@46Cy+-!~9>z_J<7xr-i(qOzuf?_F+*3!()59&BEnV(%xk0kA0Hd=O2$ks9%FQ1? zd-;uhCNsRA2|ZVVPI`3o@*c#_Yz1)mf$L%~%`2M31y6x%Q5F>vCeRCf^UL6t(Gkj+ zv1cEAcnQAi&F9<3^%!XEt@b7G+*%n;x+$%8kJ3Du6(@1DEGspONeXPK_?Y3OSdXc7` zRDmBViW#fm@sk$4Ji6w%X*CM&L?^)d57%mOF9GY=}SQ84(ufq!522b2ajkUu+pQXMF%i@O zqHNzzQ@V{FRB=4NhVV2Ki+bE`hhK9 z@)w^?hb732V)WRv>vj%?s(u^YMlb#9_g#J*B#eKkwYP>@Q}%sx8Vuf=9^UD!Sy{5^ z3LB?+k*T~RBBfg9$(!i3-plE3uhjTP_-0TzZm)prb-ns5m$=v364iAb3%Ee z`9GXpoC`w8sT&gcyOWPsJMNwu?u;?g`}?nd#kjYX)Z&JdO`mFJYbARHcMgDU?dlrR zg@rYjeKIB6&)z=;CeEB5i3;-gS!a)qP45#tWPekJ@|&49gyn`k=+xu&aTw|H&GJr~ z@Y3SQUR(-T$u~-`y%sr5yA}}Q-B!wwE7L*RJP93l6cTcJft)YU<+C?7f+lO8aY2dR z60a7nG$i8KTC}G;I${Rk>^=UevO`q#jzyBAr+-60`CN)-l(%^B*9{n@1>?+#@)HD& zd%$J#m`1(nMbbj9)A>Z6@kP;`=HOQU?g0V0()8K9rO&f)kfGO-54{%0&pw3_?iK?r zd00eXqz;BXbl@ZG9*3GoYx1OfIR`^8z=@lvXM0%J#*D?pCggKIspj<&?}?$%t+ud# zkGecRWb5I6JkDCB)%ZmAh8@jp6o^{oOkH)zx1{;b*=M34U^{yB6T?p!Nt~IU_V5PH z@UZVr^L5~I^%2gP@6N-QFZ^n8O`l|+HH@{N^YU3WzneHxlk1p1viNaYiHb1Te0m#u z*Xc|~e?JdXe^(R#k`CO1HcBUtMRWM?dxZ8uzBVdU+#`a1N-_JA7hCjGCpPQjb=Cz(~au6_w3T+)c=k=Y!9Wmgh zn|h`ynu?uL=81%eDMQf+N5Iv0fN9{EM_mE@~rRLd4IHD3;Wa4^ZLbQNA zBEvXo^U!s30QIa>sA&u9y3082aeYrsBeu$Y;*ZdtNC0+g_18$ zK?C9($Q6~>M~bYUh+8AK;0$bqhI+ouI@-ILs{dPnp) zp=$c+R#fq7s_)A|XonX@qJuCIOlaoerRj^}y8d+C>eeVI?d^ty5uW7J!w#eCj_STe z)6xDr4U?o$-l@`O1ce0k46D)k-w)-Tq=n^jfqO{IOVHkC#L%n_qF-dE@tNIP*Kl)d zf>3$Zc=I_QvKY&w;lQn!vcE8UJJb9Feow4Yje-Yd&En^aHq?3Fj}Pu0_y5`IBp@sf zStk$J3C6giC}wm~4jYZx>;b+RH8SqSdeBLAQ^$4O!j$CEF{5SBipxyG%!Gh0ZzpRd zQXAq;(#4bqp~lPrVzK%d$JU zNGC{8mnm^e|JYB@N0Vn&C@L)ZIS5&W+=Vuj$~8XN+H|~b@96K5O3$h_IfHs&_>&73 zhl-}5yNFc_+CNWVW|Ll~uA97hnCV*=g@i2!+%D3u#}qWfivl^Is`r?0IdHfaTNIoY z>~1Lsw8pH(eCQv_5U0{Bx6|C~cuJkXQ$TXog2wPsWKHPWbOuywWWS?W)*R`JqK3rv z`RVZIPWcDk2`Ed;HgU5F`?xQIWW+Yo+cm?pwSA3VAOdcNc$Ro}TNKvm>uI+izG+-O z{j^^bn_h>C_KYqo@I6vyR>-mv`g2``M<-#RwvtsoI@iqm!O~QKU&VzDy?LhCKS}#Z za8t0xQ}s(?0Hh5W1s5=j+XYKmGj5p*cITU&0Cnev%xus~dT1bLd}mqjXSyDP@}JVS z{r^upSN;zL`o~v^a>P0+VM$+#+30Yk9Bt7mXYOm*$`q5vk$W7IP1`uy%@yNJhlnA^ zgfZ$RIYtaJXpnm_!_Z(1Gh?4A_O*Y)_lNyuUa$G#^UP;H&+~kq_vd;CGcFxDBzXh{ zN$uJ|OH8CH4SsT01{Tkvn(YPchiN;iZo;O#v4iFpZ~Q?1dH+a~NyXcch}9d-pGcO5KejiQ06F zaaEth?!I32cGloboSbDEyPD}R`bqu;bI#!{^!QE~myH(j`htrG+VF&B&b;Um?Ks|IFsAsVWZ89?LDng!z`S-Ce?)pigQo67rHYojo3?8S>}m}`S=6Pu*Lf1lhc6{}nS~kI{XuB z8e1(!Km#4n&EhDXjN3{Bf*>eW3luz%O8;U<`4@cOVkbd;1!C!prjo5H6$pF!04aux zrT$1fSmImDI{(*Xx8|R}R}oF}v%u(f$(|O<)^`vEweJUF1PSa`a|jT&=W`+GjD^W> zuM=y+24Y0kn8Q{z5^R_+Fvlx31;qL9Rag*%9Q6cZY$H6{^AGX>lAgC+T`o({zF&MC3-#tm48c^Tx z`-USubbYRd*VgL7-<{m>?~tiqVt4s1eeIhKI!U|a3b5I210RWv{_{$MsizZq%Q%_# zqc^k+NFpLCt1oC!NPsJVzUB8j2m43ZbNlP*OMj_COWGx(>TuQa)= zRb6>TdB=&l94~xdS##^-M^`!acA(igdKu59WAb1WwgmF#U%y%gO>%)&vp}m;T#=w? zIrVa8`uwf8FjAn0p!UoskF_BbuAwu}1nM z@Sze$xC^dHdhWx`f5axfG7TYnG{tqf@ z9Zq6T=pTcwjaCE?gQty#dROmytPlK3A-Ks$6)=2HO>2p3ATQ3>JQE)*yq~kyh?}Qo zCcI+o!#Eo@wAh7X9a6CmtLfyj%;A#u>3J;DkA+_-F#)3D6Gcw-j&!xCcwd)|^rW9dhcDzsh0eUHz|~Vg zj7h5IDHlCnrykdJP|F+OhJzHx7`U5qJotq@KMwqwAC(Y4SCc3x9itl&5GPJ_Dy@#w zjCmPG+k)N{P2V3^(wjT|FhVKuYF*9;ELson<`K;s=`2@Mw%MuL@dws%OGqA5yesPV zMS9J>6-u#(P4}|M?cVs{`Joc1`U0s+)WVI!z0vH7vFX!izMSEUfWoUZKc6 zT$mZyB~z9<bzi0^RNYDsfj(R4 z_@E(*WDyxPv|8(89gg${6Y3`1q+Fc!0Gmu|h^ibn)-PUy+*5c$ zn;2R81X;+=U^%-Dj}sK^g0GxS_RN`p(IRu#{qfag{g0K}>=#dS<@#=ts(c;V2G1J0v8Ek-q_ zM5jSWhN3T3p)M|*Hl2*YRZ9!4ROywS?dUYT)iP=+P9uxd4HjPR79-ISh~Y%FJVksc z!|L6`Y9><~R{4OP=w&_C;Al%YCnU>ZH2)?2@knfT%zLYIa2G~d>HKOhi<}6{6K^Cu zx#Ft3q#y0WeqD-0UH*wXV8%T@ykHQN&qfR6f=mx2EV|<-Hy5T^a9mff@!nZGmBWo> z;aQ2tC<=m~EvekOEc`+4AWjaWkz!5f@yV|i`QBwy=gIFNwKQ6qChWd`m%aZDPGn^B z;XBu8SPat}jmI#RSOg_J!`v_r7j`d+!;q8@CQ6=hmPRd*#uvKLFBo?UTu(GM(84z6 zlCSAR>VM&#y2E{nc?~VGAUxZ57FKAzx7JxUqbX!?$f%Zm;8>9M+uJ>gNyw{LOx1sC z!Ok4O6Dnf09!?SExvw0Tg{$bPxNKu=00HaU_KFxUMhaGtvY4*32@OFj=Fyr?#78kp zh9l;nx}3T~KP_$mlU08|W8lR}_APgn8UQIC?`lV!+#S>gsAc3-NFcL2{+b@#6j6OE zjoi6hed6-vk)A4KUzQXN{$YX2@T{yM;W0r?_9JlFsruib-ry}m7oRFT4Q*isT1Kx; z0*;x){u_lqGs6(r5_1?&*up2TV5S1!hd+MB`K-N4o#AT5N%s4NyhJL%Isco;J}tCk zNzyA3c2=fsCrWbn+FH%m9*!!67&+J;8N+wKh+5&K@Ct)8i7XLRF!sMnPPlkV7*HBV3AfPamG)SW$BGLvmh;)Z^41$C- zNJ%~8|GMt`x$gHk-f!=R_ruMXnRCyMbFIDCZ>_z87y7#DB!mov004kQQ$y7d005-` z06_FDd|XRp{5d7=1L$X{ejiXZ#I%Wfxb3ZB?gs!6;eH^Z1Ojqi(cwCLcQH0YnCWOk z?YumMZ0)_CI0yxKc;k8l0P;al+*c0=ge_Z;hr6dAG)RH{pB_-$_rJ};>}>yZLAWWf zo9XDYJ@E2%V3QP*5E5ZmBxGY_llQfEgc_>C{yiM`q`>ZsKzKuig#!Zvg#yKeynLO6 zMP+4Wg+;`K#l!@0Jp}!NJrTA+f}VaH|C;2#=23OKf{cYk@|zl8q>BmTwZe_C;DRwR@c{_ibQB&=Q$TL1u*0h+2R#zDaCJc1yqM~u;4 z6ge8|PrV4s;b7G}Dk>1_5L4Bf_!jsSs~im#QGh{pNn{*}ni@m}s6t+BnEjjrMq|tRqb3==xx^jX&DdX2y1oOJXz3Sb(o;wy{R@euaBK(i+YJp51S z{r~b+m#KS(#NLFY3K)HSawair+& z%|2yO1p-rl+Qqt$zw4~Zx}Wc`9Li7!&vSs?*k*6?+sS=gZ|s#?&G_IusMjB`wtQ;W zvuas~u-iDs1UQ@HRv$3LZH{A}aDxfqvq#typU;jQOg%B_b=ntC&DZ`U&b;xV?w6P2 z5_V_VNd_f?<^s&tCVk}E_|ZIKUqaz|0!JE2?sF-jzsfe2N|sD59Df})gJMT{Eede^ zN~QOAkA{ybO88KOmo<}OTa64y7YzNOQcGf|%=sR=_tOr)Z+5xQ7=R$7ce`H$I3`i# z-7frSdhsc-*ok2=aMRD}l(kbQ^+XM^KO3rSD;1V6UL&b`G;s4euDTeNyjDg)!Yy61 zNsnuFju+78aaFxxu5^CZbME8Kk24lJuf7*#1c$rk&!7NC3aWV-ZV_}1oZmfn!-V?>MPd+;#VibQYF_t4Q+B{ z%qbpWE_=r1^zfEk?g&>A+ zl+{&}rEX$tJaQ9wV{d+6@A!Ey4~y$*_lduEbEHp6sc{b;OS zKFuRL)xsgyLe44ovzy0X5=%Tn{54lkPJ!T{%ro?Jwqohbn%A1RQjK zj#(1S#$gMW5e_G~j+(d1P>rg5v95{qle8^3BoCHie>w^`^%Hb?U)( zeXDGQbiav^pQ#~TyW5gCUwPH=UQSlJ(m00gy=rLkdwA9+X%Jq*C%0uQOFg=vc@Y|( zzd0(=jg}GZM%%b&3)8MTY!Gi+BLVLbDh}*o` z>>K5;m3{Q#n9R>7-*WW3K@p>ti$ZfuXrWyj@<*#H7C&};1QI-*py?^-p)Q>~U+w5I zIc}HAh1&j2c&idQNG(0dLRJip*+)tP4Na-R<&u3)x%Zw6b-9L~*njNm8oBH6E!E_nv;4~Y zpnF)G=O$M1o`~+D`EK*s^Ygs`JH?JtV&4NGeoC)^OiJ{)vtSsN48sG4+3$l=QTqb= ze|c@lrjKJb$`yw}R7Q}12lmM^P8xGv=JjZ;s&CtyW;;X(g@TESw9jW4_oJZ!dF?`x zH^PpO*AHhz&Rco5e4Zq|7BxQn-9i+Y%}S`ALAvpbD`WW@KPa_VUVh|p##6|dyM`{J zSa$e@dy;5ftVEAkiQ`K9agcl3KeXN{$Eg%$$$9}u+*=UqV9lKu5^Fm~zQi0TBhh>h znJ+0J64BhWC4(1pS?x+|Iq{^JBW{Z|%(?iQPo>S0kEzU|gLWzX2=82TplhOn-KmF6 zF6FWa{w!Bd#KW?@#6E5LS5HTtSsCd3bRWEx_yaxSRJsbSy1SZ*LZ7@oD{sisw@MD^5SPw`6dT{a?2nY%xylTTvOQ zY5rfeTUd%ad9|glT$s!LyB{0G(pXuyH2z5i&wps`G%6hMXQYY0Ur7VG6k<+v+y-OL zQ^yOrI!kT`&wb2JPhE;Xhf4{a_3ktE{9_n+@+6yl0Q#*U4Gm(fZ88YIKbU$ul^H$B z)%$)crR_FmuC`CghD=9>&o+`8tMakzA6`g z9nAvqcWzp%enlGg_F?s=glP!K#_x238kyvYUxLcr2Oj%EXP^l{;%M*;D+00tWF|1* zCY*qc5wsV;Sx3OCer${!Bn`WSC*XEc`!M$qf>+=t{%k1|68JrrAg)Dy=|096!efa5+9){;8JxNtz+@2vfhwUxb?}{l=Jt@s zaA}UPI^|$*fDG*=K0AHG6}`TjtL_f?J(CFxrQSY`VHOpThUG*9H^aMqw~)s{G-Yg} z9JzlluVNaw4W>ubP8jnhgCKNjmMl&!!|Yelg;bv?yr{(Bx^10Y4P{n8o_w_53&iAKi4$qD)CjjXX3p`1bL#9NgdBEH$$oQLXZT-;*# zqv@)IrKz|>Grs9wE@;ShMO9d`Z~AKRiK;8hMqWWu;_Cf<-dhOH8rN_LLF8V$u{SgJ ztmUP9(=vARnI3^K@&LKM>{(5tD(0$A@|H{Tb>oCg+CpRkn(5vT|ZwIpEE_pTSKK4e2IQk6V;y}W<;=`BM%76Hk*SQ`~@J|?woZAMzv!jLE3ek}S z1WK(%cy_OY3z;_{(^rovy1OO2e3XbYKv1kS(gf>*u?3Bya}4BMH3Z-CHjp@Q!&j)W zGS!mCk#vna(GnW8QN8d-bVCD`G2vhJ#x@Uc7Kakf>NE7K(gcesyPrvkG?=GpwXU0}--En-grq%MLk?-NpNNY2<9&Z%@Az@c3L7$V_glEr z_!EKOK!x@f$do1`74u|Xn0n|PGs2B+S%Yq(>VcJ{??D6d0j|n|<+mG(V54YC&qZKr z`-4%6?&^Y+p3S>sH6HSds_wVNJOCG*FU^B-XtZyGlb36^Du&m}Z+J%TkXV7Qb*kG3 z)YT7HMb2YwvSB@CIfLa4O5G=?UM2RTc0HFL!G9i&1CQq1n3|^0NT~pSF_^10E0Iu zh*AKe9u=aZfDyV3YFHILm<~eAo3Mx{T`}HCvP$}fzBO*JM6TB*;_}c=jw5CK{`PQ( z6s&4yQFZ&F{jUrS*iYPL4hKf7s((Iwc1;6i7X!LmqI2ZQ^)9;EeQ1>l_+Rg9}^&qUUt%3*jpxuv=VYacWfK% zE|-Mrm%qotm?TPWmw9T3Ea<1;Wz%MXUdk_epAZD-$E&(CiygdSpFQPDsZHk|q~2KM z?4S6Mww7RgM97QNiOEdaAAOxW1jK)QV>-Hac|D-+w3=1a;;WsTbIqh)AT_i0spU8I zq54kR#E%%t-k1pd=?2C@x)~^>861v(fN=A5Jen({Ed(Di1~Cl~JOFl6nBc!)@EZLB zf<8wC?cpN=PVi$R%K#iHIDUSWQUI@^?jk)L=YomlX;_Sk1MY~mk|q+MiNClAgcv>p zrTVa&ZdbKta&AHLx=?+AaG9N{k^! zYP9#Y_zX?;pnFJdSJm$%30qr(yq=%5w? zx)gcVZ*yvBfs>Guh)!vijN1+`N zG@Hr|_-I3@M28qhq+L5RHl&*5lZU?nmKH0etUl)PE(ydRX--bKb{aE#&-JoZ#BdJp zGTZrXA2-g-I>pKFZ)>|@*XdgD@EB$|IFFD{qv5l&52MfgGwXVN66Tk1_u$@+wGY?z14l+j1Egnf1Xiv){016qR&&BHoN{$ZY6r;T z1&{VsyL}cO3D5bcXEGiAjFdmR-nRO=r&aWt*bq6CVPPN;PTz$r!4R>TSj0VZ?`}a_ zb+3b<{x4=DdU`IOt8^&3pOAvi5$KM-z~riAjc1{y%Ca@oQuji{cm)66A%^TK%EMn9rWX ze{9#i!o@9V?+G>xl$;GJh1d|=&WiV^eSJShwgC+B_06BjXgY$XY&YDSxC{Ft6|MEb zgy<_%IG=n?-A z6>wZT@RPtHSJL8e!0AZ9h;#{wyLJzh9UTr(rXFf<<(RL1iK-i{h^QUtG0(7qwP=|) z`h9XQafo66^_l%y$Kw+DKYyOCW@$#4`?94f*0~rcbJzb6r?vRHtCvBaMA`f6V-yfM z+OUk3^k(br!8E1 z>Z}gJRz3bc|D9TdG`2M9VWc#&VZHxN0c+ownTtR%ug(%2oS z1s4v#K^>d59D#2d)9>}6*a6RlVhhF~^b2;GZpDrG^huBp zASgF#Vc4u8Vd?U!y#X#^Mr2KZjyv!Z#vH_DMg^)=gJ0;-JIFck#YCw+!<)MER1zieEaVj|}nN87R(4yk*Gc|4e@LQ0cNm>`}k@T88e-TA!ZZ z==YjBrmI0S2&qc{=Kg07N&~q1zHKDLuadn4O@In0ybDKc2k(mq5mGsok;j_C(VxaN zg-CSq2O93pxEbYW8E)rLk8GN6 zC%!jhz0*`su$^s(&zf#`hB=^l-tpTt-(k4$nu^l#dZu-2UvEjSKfk?2PDt8mFV~@B zC&OVe|MU2Z`Gib1x=Y=;vqDoC*GhwF*?P81Q1hLpG4mvW5^8w;J^@)GGLe^g!mN^* zV$`6SxEaq8+15^3NPWieiZp;ApL>UhR%m2eO<{F|$%uMrr6NMWZ!hpOWH|NGIWby;6INaD((bzR^>eEq9is6~ z>f>j7!Yr3}hlv#_lBl2(j7ije{8?Nq(=>u*Eg(eS<49hX`ZQh6Jo&^GKiIEB>E}cE zTZqEcX?$->W)8Fx$_c>St4}&E=I$i1%Y$)V)%GSayA@Z`5) zV`iu(cac?jOH&wCP~MV&rj5*`6jVOHG{OwJCFD-syP@IJ#jpk znc)K38HO-$NxZ*G7>0*_Il3T`y?JZ1XU;Kud%j*1z0O|pxFdeftkLztULw_wLk~C%GRNn^t~I=&v&t32 z4s-aGepa;L)ulPT$8 zg~nb&MN>J({rNPEAJtlMm4(-1rwq8PeZ2oE(r?*wDW+>0U>2+{wxtsBpfVl;8m0mqmr44mN z>Ha81dZ4e@BAuluZx1DA^mJaO8km@l5Xs-dM{WV;@_A2~82WPAp~+_-P8R5*nevq1 z6Qz`efX*Z9LSslelsPcDO6EDd`u9Ibx&+V|jx))oydAY_`11inYTR2B$eAlC5SH|{aMpY?migH@#r(tCHu8U7F1* zVf~7Jgpi5Ea9O$dvERLo;b<8T;FxrRxL*0{T+IQAHwzsdAFA>14f5ai!S|d+!jWaW ziG*)%Gd=pZWfB>g4WZuc1*o|`xKdLPGf|>_%;R+_Ys*da*liNdc_|iC&JZ%W0uZaZ zR@!Ro=bm9_kz?E3D+%agM|&=oXo!@t$k4e}A$)*%dj7sq)v;Q?!G=d&kxKkIFd?ny;n(aslRL>1T5+O6Yg!YsG z-7`7|wZ*p_#4Di0&G@=Nvd9Y{bD9BiT_nAWkr1}X7nzTghP@nTcug4x29r9sgIT=J*gBiI~lti{}RqE4{M)OpekP!rU`Y6aS zn;sb9Tr3B2Wn5R!Nl^f1K=+q};ttWEK;XyXyF3-s=JZVD9G}q0bK8donjA`eT7`)& zJ${PLgpS}LEyrgWsSleG*3hjpvjzk?m@X7iRZBell2Qlatg z-_QspI%96RRr*x>UJg9SLx!NaE1lE*?xB<}dxWYoPEL->5{FP6APQToz33AD8rI1%TD5j-fuW2=(zUiEy&Nod%=c%dnIz z+COE=gLGnt#XWMDr-1IO7NE@5cO54{r$cE=w293xyMD%!DmBH)b(2SklPYi-|wxn919lA^)=%KX@vnt^oqk__RBJ-{r1@mPxmslZVW$vX$ z>gMJ)5)g(E@5{zC7j`q)+S?xziqKy|b_oWRHS6PKHWCUek_c=F{jUgoLo3I3DuUX> z@I;`!x7gcOYUYjanxh)pDcT}3soXUSAW#LxU{KnTO#gS}VCXQh>yxy<7kjuZ+ z$R7<*e0-xqm8Qe~+%zZ$QxuFa{WtPJyY7Pnv zlq#KBcW@?S=8#6Wk<2NZOA`|D*l}hElbD_Qrbwp~L>>V+3NHsQXthC?c`}R=U{884 zEf_dlodxe0H5VX4XbJqnsW0M{_QPMEvK(fPdrzL*g!{2+O2st{!hBQUYvp_zw>L85 z4eEbBvJj{ZK%h^+onEM`$T?;zC6aO7L*JNGyWne7#h zYVu2vA_Q$FDppV>MMB~?-u`-gbpB_Q!$o=7+$0|41pgx#I#$AZr$RZ1>ThBO%+4cU2<6dkV>kHU7zPrLVne_e+h?MmwlvuABZmfzoO<~6) z!XO8<*kJ@p+OK)0&~N+osHQbOMGJ~GvAJ3F-k7}_SPs;=aJOkz=|fS(Rte! z!@8r2wj(!w3EEN)Wjk&q)Laa66u2$*Y8@yGz> z3N(9fjdEZuFwR_FX0WJO%sFx>)^v!=t9oN%Kc-iwi7a|y*AaE@klxKb*S{y`Z)bnO zrM$C|>;;+Fvr|@27qp{UuR23xQ~d|)S~TJC>s=#57@ow zv1(hqA{3VdV+%pH4i>SpB(Dj63T%xlL2BW)uKisnEz0~OVyr@`I`9BRghmD)hVhI1 zoex$C#M>;29E)g!v7?(P>6}cMCheyX>B@>)Gl;3B5r=%e3Zky-m{>k5 z1PD_UpzgulB99L)g6tHx!x{47vZJ*%`F_9%{h1bjuo)4Br1f?Z_)3Kl!#e8(6n?Vl zOfp(-qim(`U4^qr){ND>**PT36$7A&TpVuAxBcLpDT$oS{n5x#=-0gn!N<+8YWjU?;Zep;XpbBv)t^Dkm&#}SIFIqO2%x3n<*EMSDct+z z&qkH->y07&=8*O1A)IcgmDMVRxJ#xdUu`3hIg2sP59 zj%c!o5a>M?q;Hg7|PG1m#Z|GgIfa4XYSo;`PP4;%CE ziVT?db>UEHlj;DfLigP_BX8Ja8m2CK2m0a$*u9}W2u5(WRD<<5oJZ(J$aSMfUD}8( ziQgd0Fx<$YfQsz^<4_72V$O@E4!bYprE?^rIh;5vu26;-cQb7eV!0Cmpany15y)mS z=+lpp4OlG<(Ho0N2&9FC?&q@4HpI|DzA^B@BVBI8wKh=EQ1tNJ;}XRH<~p}&{*pTL zp?6;mc0VG?mD}En^Mw-ZM}*vh>#0J`Y$Tp2mM`D8dt8KmjTu~0rOkQG{#OYk5^l#w zJ>eAY?u>f(tkxPt1kBSCPM}2{VAZQX_EUdX2@XW1BU31$nc+D13lXNhi0^nQ%rL>2 zZQ}EO@LXN_9%<5+<@3YbATxp{H>V)8aQG-9<^mW{bOV&3$?147*7>yX?q!0lbR%Qe z9t=uM?Y&}V9P>A}T!=AsUYTL3ujB5!B=UYm-ZbAV7Vy2s!j^d=J}l{j zPZ%Rr;DTqjm~Nk0r06!1G1M_2d}|C48T3AX16XX3Py;O8gji=uPDSCoDUic8_Nj5K zkpM?5E7BaUXwXQ?q)i*D3og`M)-{@j`uv0o}0Xbi4{@;O<`A zO(RRlw&**rrPfJmQ%l>HtCKksYisLLfvW9X5!*=rb_lImruEae$K9+Esset$Cf-iw z1unYe`A?cpdVE}P)X9|+eDYK&z;mf}VYbFj{aeMne6LSNr!9K~7&hH3l?d70^WnN& zx?~>Nd^lM?n2ywYeg?6;H64{))hNf%rLjFIB)$JM?etZ@Lxl{El2T=3$h1{|Q=ssB zK0;RD8t6l}*5UnrjYS!K+8~Dn{e3tl{=%(L(7U5m?e4lNAJ40@qj5bY^9Hx6MqSyT zR9mkzt!u4A57gfItq)CJoi4Eo#q-b63O=^~{@!Bh&tbpt`xFlK#|aFQgj>lZ3=%yR zW;MnB{emCOyqXVUjkB@(SuM5Aq3N)R4>3mmuDYQ<~aUkpY3`ZVi(kY4Zd51z( z2XDT3^q!xI{G~jRLy}(cV2C__8aUThlfQTo*&RG3bWAVXOx6-uf;!m%r!>(JE0Pv%+xx(*G#=9 zo4Y+6aeZDk@|4d+SLy0E!k?YK#fJ0*7;dIyd{|3HeITbu_Dy^u`ey1fVvU47L|W!N z!A?!QQ-%qpdU>OZg}dGzeyR^eHTSXqV|qSDh3JEtYA!_*xtIvAK`jbZ)e3#<(ZAF% zo}9u7O4xN!z}qXrA-v{|PP=BAd|JT=JltEj15?juTn<${n}uzMJ#LhCrP(S0l9PBf$&fl z)`M@$xRUq7S`3H$b({G=w}*F-OLr+&Qhcx!5I zF63m}C2YT)U~(~J&x08;qC%6@WlS=FeN9zpCcYzbzb4StrZ`DK>ZFGz0h+bJZ&p6)FuY z-8An2@I8dgeg?H=2wFVeYrUvy*_pD5Xkg}sS`pC-)oZ2mIfPy8d^}MMH~C<_@c5%= z-lIB49k;$T-jkCgrR! zB}-=uEaBYOs-2(xnHK0mQWp7X923%820Zt{Wbz{c#=!yzycwOF5&H4dPghJ-B&t z_f}wNUiBUchKDgk_%g1mn9p>_JFO8U;ny%>6&)7YNg4RwRg;kF5!>Q(*eC@qIs2I@ zoFpI&GIXop+p)q@(Kjh%;j$ymdUu?rA&BP5LI5TI99_2;F!P$Ua^PVHgn2Ekm8NNVE#{t$Gd0koB3Q zbZVf?*y%ery#0B2^|1~U!QA^#Mgl)|%)~JxcIgsA$$tYgfx_YL>c#UfkPC_gU})fS zi_kWrT}CseTG91kPNqQ+8oE<=)_zIw_W;1@syme<$4XwfA$+Z=w`LyQMflor+PqK{7j1O$T?Bw z$E`a5tx3@OcrOOAQ6&|)6c(7$D5B%dxT!(mNham#GVU_EVZ7{lPNcn0h$_vD$!r670}g81W&{QdShzeeQ2GH1iR?u%g=;Y z4&!jiX?TQ1$~{ZL;){8mBgCqasrz|(zCv&G!pzUU^g5gbJJR-7`6SA%+IcyLy6fSz z(zfT5kOvsn#JttBKHRQPThIzyyGwe^PR#lZE4=^y&Xl0EUO=XdW^SZRkCn*4AY`fH z(K~_0kj3U;nr?q~sF_IP%N_XB0Gm2FM)J-V|S6IvR3Ov0o zerDQi`{eMOR8~*PviJ_Yl06Y?s#ChT^b46e?>NW|Wuj;r?J(Eo?ru?S&i%vZt$LE!7 zqU`FYuZpn#6U-c)QQh=XhKN z$TB0@v?!#cwUx+HUUuexjhvHeg?vgU?j!Y)?eVsUe)lw8Dy72xjcxUe>&B|3tclLxXZ=e#A$a~5XK^6cydpMO0|wuqZ-e< z%je!>D~UhCBmHGvWa z4@VCcmte~6c-$frnDjbDQ%zA~g_yXSf{e*n6K_zn#WYy$2tRUNDP0PXu9T(>?z=l2 z3bD-RD}Iym=s~?E!3;;w}Hsbu8z{uXukc7DV0`Fg{91a>ZL*IBP>1Ekv7mj z?UWMxERzYFVHF>#s8?NW-ShEC<+<-xs3}{!AB|h1twvB_ilcHMY?z341~NpEa=9y$976MG7z8PpC)*d~eVYXv&LE6fe1R1f1*I z4g9d47d`2kIPk5hM$IB(0}By)@~!0&9bN04QSjl#Kcd_!S%4|3bB#rVf$knmmlzHL(DaEGOob0{^1M_ z33{Y(&T7(6U;!YdiVe1FkSgwlE5gzVnOgEjTPJd?{3HbP(e2!$`STTR+CkHk?|(D8 zsGdIpT5C*#ugx5D4!+g_LIF8nMuK2B_$c$j@W`Jvb4jDjA0y6!4c6TsL*c@J#C61+ zOmA_)T=ZI|DM^96-Z;kkd`@eudsAlmEpn*_A<`IS&@GR-xhX0X)QDz|(V~s?eOoTC zmG_|f=0S4Y(IoY;F5quat5o?hKZdn}WBg0>GNUO?(}rcuMFJ^Xckn$ors1A)T)I9{ zMdEMtc72k#jePIS>j))y;JFoM@cF2BS|E-09H}}g0u`%5|B+}IdAPwO2x5xLa(x$# zzn>!SK0J(N)yGAOV8y_XufLG193Ms4vXnE`FMN2EU#5Opp?&;MkuSv~r0%%*3rF45 z&}JE5UmPZp!==r|CJC}X0x}riwV1P7=XrSAh@oi2_NFpR zO51zR@DbS`cVdm{OPnVPJm3=txCHh>p8*f?5Z;CenU!3FfPnX9Q9eBb4Ap^zj7Q|Z z0*-D(qhpIJaCpo!M7W(4Wk>7oQ+?%qXDM>CkYA%~NX?OGoAJTjmXsI+qPC`0XpwF} zs_#6_u$`C`O?C*Uk+Sk)hM8j%30N)C+KFeR#7hrcaJHYhM{e zFH2U*KYO^Ku^(Qqs<{2(8FUS#ILnjXSDKI|R zI9&Em^-J;z;rs4d4^knhK7CaDMC4&3x-G%RlkA10HXF3MB@;7-B68dXF$b zCFC?xRY)IQRTJ_Sc&nbMS`6$RsrSW4+(Mvfa&Ge11O;F1;A7y-Pze(HhV2q$IP|57Z|#~NRc7VF1B56+cWf(fdxSMREqI*(yN zvA>DDjqI-~6fWmAH0{TZbB-z#+M_|IMX$l)ZyI(b(hQ!Dw=?Zi z%h0KcLtmk1uZSsjU}%$A(~G%6tPgUb&@!NshGqi}tK&$4-wt9p6Sd`h9D>W5Odxvw zxFbLLEKov)hLqNYDPl6q5Ai??>K;mE^ug znQe4n{e06lO-cO|G4D_P9=eU65_zz(_CyBPFW)K$^vb(4`{(drG&w@vXj)u)!i4x% zD@WF_5BDKKlz6{t7$?=E2ryGvU(D|taJHjxNjrjRlgf`qlmm_6doYtdtwG>N;!HwU zgL7yqQ3cVZ8-q+gTbk4KF_NA|V0uEK^i4JN%>VLsifI91D~%x78?7%j?@nlxyML*T z+-Ziypp@+->_lZ`z3pNIPgO&LUl9R05)&mcpzg41!dY2H@t@+7J~H2bwV#&o_St_7 zcu3p7HsA!ixXk)Q=~eeZ4PQ4hRTK4z8oQ8!E4wfR!d}b3k`to_=ycgXlkdldaclrK zB=Bd$&6eH!cHvL4iNufBhE+49%0Gi*t5W`XoYkA-Ys_%J;-#wK}x}tQDc3kJv z%Ol5*P^2YH*~2toint2%YLL1BCevW(?WJ zsWrMsp1yIbIO#6K)#iMCD%tVtmNz^6bSe*fPlVF@A_`6)MC&a!6oRDGYmCya&2e@& zQg+HpnTkVp`?3~}4~Uj@cD>Qbb^PlO!+ExxawJ@y{B>^DErnkMKMj3n`avp5@lUz= z7sOlH&C*o&ua7=?Ey1i_SSjppIUb{_G4y^23+Y;DhHFmTc7chfs#4erA4`1RTCwUGy8k~n8UTkv5o z_h1y+VK`UAWNX-ku@oG_x2`j^g-o5d|9iE>Wf?dU?k|s~j#cWsbtdih4+G#T33|`& z?8Rc1aCJ;YfHt7@_SnJ(=Qz-slx}&G^#07}=}#2iq4wwcx5s=e%6Wy_ZmydiHM*O- zjg@Lf2izPDD@9MXnAx`6T%8tJ{T|2`e;-XmOY`=gu<$Sxiv(8!!T40OnYp=c9y7JC zN2Jk0>UB1vb}IS^K~7;N@w? z#7%;%S7pNA@oSiFMaKvNg(LZpD^QN(^}{o3Fp-Lk6J^s__kAm_4^6w!t>5E%WDWg` zF!lElH`gY)QXT>bKMPY!pXg(puYUI8>G3azmcu@N6Wn%I;*&5;Uu^vT#}^+j+-j!3 zHu2U-Q1uScWL|U>nd0;x+?Dv%_v8Yl>$8$aMfV~9hpx8_ild9Vg<)`afdSa1k#U-P{8y>n}%y{~LbyswYv#O@DGkLu>X%pFv;+b7qV4dNe!(M4CSRuJ$QwU6u@mR(B+9n(NH5w2vUk}j5ELdmy-C27-1HhQ=QT?+0Tt)lS zS4!6zRZTay`q?@gt%Er|2UaE)kIqmBxb_M~_|T}}S{R@8Dm9W+CZ^qYphwm7wvtMI zg!6@g*A2O#)nI{#|3!zxn5WY1FY)32S;fEdEM?YME}#`SOVL)LglNaHJ9wWj_Eh>yG~5 zt?7f3n)+TXT53iTvmT0kbYJt_)d5h;4HJmnTFc^pP%r?h=&t)-XS7moVU$t(-Q>_< zy~^}|^29Lc3#mscKU}El{t5)d<#F@LQriAX)kmRnsBIE((`A253MqFKUTxgXw8;@( z`C`#p_jR_K80bH&IuEnX3GSwwR_RNjcy9KD*CR9nkc{?LUsMgB-ICdEx3*G=&+Yk@ zbH4awn*`g@u6r^bSnN}FnC#cAE6+lSID#CkHkNG8xHiRdbZMrJ@1I4{Tbv%~C^{<9#GYv^-;;|7TwGUu(gjK#5yW{*UTIIEj zAWpj6xT~8nW%0)uHWXE9mx0$5ae>0Y_N$l_ir%Dn8W$@r!|$k;{u3srg63za#m zR7vh<2*H?33pJEd%?9^QHC58xLIDkGdUl(48 z0L)CB-}G8+zMcQ?J&9xd;Q0HwQ8d0o`z<0*00OiC9(@9^LRU$1Wdx;YG!x9m z<}5etj;6up*zPca!GunEm?sCWEsC`rICJ9BM%gwekHi*S4Qfbc));bY37mGy7}LrS zay>KvQ1@0+5sykq_v#}LWT7v67WIsi6OycI1yE8^uUbK%J@c{E>vH7z>AGD?mTt+K z_?@;9hud_fzM2)IX5e!Fqb;3@V(0s#EysHIy@^hB0Iv4|rVKz-nVfrL_0u(?{Hx#u zQAq#WfaQrYyW|5~#p!YH$?~xP+BGU0ay8xyO1v*Y#yD5*uqB%cUJ?`B`oUrkFNuY| zKPM9TOKp&$!MJg%Pb^2tpf?8a-@Zh@HKyh1I;&Q^1}Dd)MZe8UV3jbGkay;S(FIip zB+KFKx^@iEQMNhGVG1v_W`xa3F$l#&o9dPF;VRC9|JH8Z$hk>h?@#v!5?C>AXm`G# zon2dgr7s|y6yfCBm{T~=^l5S3isD~7I2@DX8UOMG?5|}&Ff16fuT5B)M$1) zq9w3c@P!VTOY-wE&R6;g_!FlKHMew#+hSN)>LoBL*Ef%)W}32{4Nx}^FYe1(QVFDM z0=Lm&xKx;sZkQP{VMx)>RXIz9ewVJ)+UD-XH_FXh(ZYwE8Ha6!PMdhmYl2_!@pQb? zE`TQxFEPc(d+m&WV{IME6=Rku&Ym^HIgIj1$2hjiuyY_H&9dl3rUt$CehF!p@lQqn{Opxrke0<#gR!hcZ68)+-HC4qd19(4#%3&d zICV^U+cDYBKuCY&<8Ye(vS3L%rHjS|I3x+;10CPtx)H$3?Ff2(tDbwv2(kqC>0s z#8;tT(5rJiB}iYG)8pM`8PGgNwr8uRINFCF=~%o;=a_BrbU8Tf#%Vb9QER4H z`^e$-UOr2i81~LfE=I8U>&@u~lIs}*B2GoaygZb|czon!rZ~YjuPgO|M~P`G0nww| zWkM2Vv40dzA4tY34D6XuQf@zkk2T7Gy2iSo1GjdUxjtMRlk1sC0d}uK?YV{=*>UEy z(@wD;+Ly|x@9Z{dh$2va_0KWOkf(M^GxIrRUYkBxE9ReN6;v;KM~$LV)x76~z))nq1> z1WK&I@^CVHLl$y?({IfZ&9Xytxe8A@Y57Ec!T-0fmx;mlyTb(N>`l4}{ooN(4N!U) zb(B9nV(+o*q(-7#s*q1DuK#{6j(An#JylS(WgBSO5uh|bb8y|EuE`uM_c|i_M8C=7 zNPjzemgEhYq@gmMi5)8>@3TT=|C6o`W`}*ewp!g+h=01eoHrzw@sr{sj7zU{$VN+* zBwy7GSQl0bYeo%D$4xmLaTi$|@Oz{uADdS>C;ILa^2V|a$ZWP&Y}<9h9O%(`1Wb1J z@?!YHbjij%LR~BbFftuzHW7bGL|*+1P!_xGVM7?p5_S5U8_C9%UNt_b&C}LIAkBlx z*nVm!Uz+AxEe1v5O;<-OocedPJYV^4u0LGEhZde{SDk}9%lu40ASkP!b(jK)BqI4e zL>*u;am`v9?fyw~pM<^%P$H%57QwX=t?bg72UFpI`~zhGI;WEfG>s1-!&3UnRW8oJzbemx}2QE@^iWzMh8U>a%5h!YM=^=NG3^{i-}rJY$VL*5j9vps!I zDS2tF8p;gmS2eBCI_Jo%16N$s1R;xa^%r5G)#M*LxKbX>%3IDq2%}pu4n0ewxZM5Q zM;d!l)HtXPL!-Ew z6#+-b#NAd%gjE>G+j>)<{*p$9Z|E{?;hIDGLRR7IbF~wi{Uegd>D`JkOtuo+2Yw8x z`QCp{osKBrsLsh&}I-5g~N$! zxIV*as$>-#W16^0p`*fJ96zo43={-yt->=mhGMi4TbzR3tE7yi+Qi9}kC&-JNH;-D z37}$4*c%=rkjHy*%wjl%$own#h-pACb-Dgg!shY{a@eu2OGDT=(u zM2S%Vx8$0QW0cIto(^V^5AYb*qx++S+2Tf{8uBQYMt*kt#)CivYfrsT*ZouePXsS) zqhzs+KGTj02~92{*l3T$1pETFSt(aAAXXBefMBw*FF?%r`Kc;QIZTBGtaC19`bZd- z`lN{ZEAb-{22BEEM8D-G_E;i;@^Hko%Kqj{KtR-{`#Psi`NDVVEn-mjbDwpNrXWlz zJL8+CMn~}*tLDc{Ba!BDs_SkAP(QNMiW)O((tPMfv>$ct8QuiN#2Fb<*W$*)Bw(Tf z9hekMLoneKDn^jF*2Zs6DIgx&{5^IVp!8Z?nzUQ8A3=^n7+#4Zx8{N9)MUjT?Pvd6gL z>U}oW0%98i1^EKG%>5^0dS|@TWzF2K@_?~po=R~?DF5_o$FLFU9>yA+Yp18tyVEYn zyCCOv_=cePO^HpFBUDbx>W?+v6R_UwScLsv$q9b+ZHa2OvA`sR{gA)RY5*qvQ3J-v z(?Ur;E_`Ryo#8(8h1zXQaCgDM)+p`f3I|&S$J}RtZM_`#p)bY}p-Vb(W44~ekXrLi zI6F>8P7OfG1LKlfuOI|-eF;0ktn2Q#E6gsR4H#QSFnXJ9I>_cd)^s^6kYk%~nD|pw z6eo`r>Sh`5DruS=6uWLy@#L;yZ-&IRkC9&wC>~%LAPiL=I`!9pXTuk2;{92DppYzl zj9_iEhEy!;R-3&aslf3*tm-eV853eTXD7Ylvzq9T>fy2#YLovBl=oMMa&5GjF>_2F zM|u@sS`SHMePTG)<(w`?`SpA#7c#w%t(|uAO%YJAYb|BYc^y48(;x3R##(&-#K|Ah zi?Fp_ZU9wx?I{T^M$n1c*!N|e>c>#i$d ztJ?)kvZmoO-MX)fuBX0r(AaYg(|o-K`>QQ+Vm5uN-PVKh=0kaZ4jfk53Y-%3PvuB8 zaQ*EDI3&HKIM@h&WTA>2pYPBd)0gQg_2FC4NJ4BeAFA()!d4GZ?lp{m%yQ!%y?5mi zqI&PKbzzo|c_sei>84yF-zC@giOIu8C4e$n zRE93d)FFgnr4rnS1ul4Vo4&pzl)PO!(9XUFu8GN9S#MvG^>>|UyZ|w-fw4f~B*&FP ztniS_e27Hp-%E}A_=DZj8^#EDatQYE1%IW!=BBjY{}hT%`4Qkyj=nPVF+S}u3_7+I z7CzWQq;4yj-d$0z~O?EQzH=I%K#R-;%gA~T6baDIcigs@F^lyoOr!0x-!4S%+& zPpSxJ>2M@byE4DPHCR?EbRA3pGSKt0qr5~{fAPCE`tJe>bw;kGgYMkc9~h=ccA-%? z|9pXB4$5sR1bln*m0fHpR1?6;T!37A`2f5Scr+I5*s7!@_v;N>D^SoYlCl^}_t7*4Rzl~}TBoRZ2o|KdO5M*dctZh%$Dptz1 z)LR7egYk4=yrOKK3{aI8dKFG$87_`U98}nu6@aJgwFm~$3dSzjEz>{Pp^U+GOA5m3 z&k(=B2{6n!H=klM$GAHQb7i-V=S8sxnbGwrsvzgBJlInqPmOvI^rOCbBmoGi6FSFv zw=j6Q3+q?-)m^1R#mfq+w%|9QVD72xoE==uR8XGvqw=a4YU_2%D z<5QM|!`L19t&np`r&o|I;IKn~Qy*f*^9sc{faDks81%Fyo(i-r2f@Vkn6O)5 z{$ihG1yGLehzMIvKelwP#wM=A^S2d-H{@E4RaEVt=BCyC=iD0mgbtM=STr!xt3)9D zDtf~y1hL*TLehoZ8|=p-ku#R-yNvR7reT!Th$(r6&*Fx^<2hsCp#CCjfyWQsMx=K5 z?d*9N7u`9zzBTU0&XGk9h?C2gKs(Gh~$A!0cxp)XZo%>jm<{b zCOnia?VO5XP83R;*&(T)(TCE-*<2#2hqL4%G-PI0-G*Y$s7vqgg9HLp55I|w8i1A> z(Ap4!f84T`$gS8x_Y2W1t-y&?H!N}s{d?>-CAh_zacL%|qglF*MD5LW$q?7@=6ozV z)}P_Z07by!oTK$@D0=Y?prn z1fuZ_HeN>-BJcr-aqFXTp-}UNsiH#8%CZ9E`A?DvQ@vk-5x3SCr~3j_@r(qkGtmFQx??Vq!`7gFS_nAw@q&FKF*KA8VVNkt zkojl#BHG;1uC)ZZf(fO!xN?v~j^#QX^qU+Bd3CPI;Rq5U%q;Up5i$gTwK`gYMKz3XZN21(FA!m%S}nI){K( zfy4+}dpGFD3|qm?Pow%Yk$vCM*kQwXO*T_S>Vr50Q76k7&@8qM_zS={66o}~9oz#w z)YBjg{sQIYRt-2t*lu6~Y!uX-x37S7<#)R2iEs3Zy7^KJ$yd@Gcko7lNbc$j+(CzB zfRZB;Eb;8yJrM`eOmq&u5Iqb0@>`bD4vx|*)|R*1HV;a^D0~0z%^#Fn(e@OQu1cnG zFHfHxSq9Lml+*s}ds>%t@YI${8y&Q`CaQZ0AUR?aLl8jYA;ZJzJl#c9>`DC)JR{K? ze0{`)5WtNLzVaMI0zsKfI&X$xu#N9L9{JmqX1 zOwUoU2yLbI7gH%U3C3PFHcEeh)H-J=qYYSBE9lr5Jy*J8khFRNy(MIHhQpxr4Oy2{ z{in|}G;;g58Rq%R4vcexW^2KhZkj-+4Va^XJCc$VLc2nZT=gd~_GeFssaeu$6V%Nq zwtX2JHv-cdytlwhQN|)@wu4K)c$^t9Mo8Ho67XthsRL*0j069M(&&*rfKCG2v0Eq) z*GNMf$sN|0Ch|7??Z z=q=wgxyTZ+`L7Xj*RYcK8MN%f<47U@g|Gto zvcqxWResmsyr}~~)=30=-yyH~hWyeR!nJ_qVdtkK4|5T;f-~D7GP*vh+Hk*b+I5?` zYOLV>R#8gpc7`16P=B`UmD9!kSgwBgpI!uCHJgy@5$i8Q{NlFA!^RUX072-p2`6pt zjhxb6G5$u)@A8zXt9R{OH#|=^eKhj+gg42L!E2`u*b*?Q>e%a`)UQlx3b81<urZy(4i$FYbhq6DtL=gDEwk>+a|{*%I#l}8 zE1cM8uO%1FcHW4Wp-SBxMI3+u$8NE27L(0+9c+c$Z|gL&M;?%%*@mnYAVl+HveAs+ zSx8Ts=a@_)G^M|PON6v%laPy*@~>8>j0)C37fS~}!4aij1G~4qG%JO7dbuVNRvL)v zx(dRcSFY$=s;u!p7v^)o)>E!*Gd{CG-$#n>ddy0eMfSXl8zLA<(}iS{V2o9~9m>YZ z%ER{RnJ2Rc3da;5u_Fk?Sk_}IPjkk9W~Sh@q+68Le~Yw$Dh9U;hwY16PhJJtTG%t2 zrUJDG-RZ}Ls0TbhM@b!@S<+E1%+jomPbOI{|Gp1{v}^qkPd(x@c|i6m*wpdnmtxWRB3yeE23u z-j2zO-UUnsV85ZXc3b5~23Nicf>|B9-Y)BHs%fgjM`wK6xmU04S!nn}^VZ*=4JxYm zLG8RRWb!Dd!$!Uu?XOV`Wd1T6{eSidPNX$?zV_zW#m~{5x~L1H>c}3}Eh-sk76JP( za4go;EQt;#v}LqQC_`ROYNe7j7G*$b zB}lJIxBQf0eGWOBVGY$RUPHpvlkP4TCn_^a2>UdLN(11|0u(eoAS^x3wqtU$;|l3PqQ%PD=OaQ4|l1xVge)`F!&;USvPx^v)^tmCXA)dJX;F+32u)6^?$Indii zOS&}5UnqDqq)0bge=Hj;G-mk8D({;(Plpz^!30KH6Y>FZ#o*f3$)e>>LW#oAxF}PJ zqJcwwTL+944k3LkfD48-WCXr)ca&Ap+e`U}0#4zFD0gmW^8*d^4D&v92VZpuapgs#R^3T$p6>D$nKh;^v6S|gYyaq*| zc~KrBObrUu?;8OY1Suo3FegV6*ln2EM|YciM9$+92>8I{W7_c`iL9yC9)T+0_z|Jj zQ3AM{flhG%DICU!>e=qjX436wjp2hhM+(@p3Y*caR?4Te|W9Tm|_=8TB3B>$J&R-sc9%S!O+9BQ&EnDecG~+llI;L93 z4THq-47XYG!eYpi6sNEqs$eYTKszH05`=O@zyhGA)v?3>#%F?6g}S+FdjFY+6m1Vx z9H9WTmfvWXrvekSfA#D9zvtl~69;8#>+w`cgpe3qLcpu7Ox@fSOFHqdjXeC>lAPMue3tg7KKNJi3sMv&ifNBSwBdjjYWhF+yw*7#PFB4mlEGWJxhPHEzx8I(I(>7?h>Kk25;JU7>&jBGM>B5H>mQ%(b4|TWbOmetlWL z!Fbo)Gli3+Zv?8WXo|2_Zi8IL`X;RYyKc?MNnIwoJ`eiPu331!WVV z6F*fXnaJGe&?z%}1N~+FqK7|f+Jt>`u|lM?isKqPUIQ5<7I%PK+DSHdJ>h$JvLwM> z`#u&Z2hj;P0&j1?<#hc2jd zu#N537baaL``<_CeALZo^Ag}gT01a(jAZL0p+0ecdC*Oz1S2GC&^?DG$1DneHL|`P zxwnfU8-FgDpp-z#nY-ttM1jORhCw!$Ch_%u%k!FpAeD47GJjI=mcNGy!J#OQW34hP zM0i6wVqAtgSBqN#GSks5>oHHT~s)-tviQ)9ovKBfiD@O7U zy*f}=p&uoE61SOrS7KZmOtY|NR>?Rq@Wg4gXIErO_FWj+_00z&pQ4@hgH=2$TxuhJ ziH)Zt`D`?1DMT5XLV`VbIB+1Xb<0`xh!bu7_2dDN)(nB3vUaXCr-~-%qmnO(#(Oz9mbFN|#33{o#PL;XV@!WX9vz zvP-;@lFgQ><)0{O&RyXU!?pjmc?l!hp1#oYc|&1-f+>#D_)vi9<46;aiJt!Pt&9Y4 zLg$E6s(*xvR1qWF;LXo=I>_xc+h|f7xO*Urdt;EVTXl2T(*|mQE@XNP}=)O;*Rut z;z>37H>7alG=9%@EM%>WnAKie-Z-}~1bvKLh}b1}fSC*P7Jpi3N|{bU$~`UcHf7}_NtN*}s6HzkXwOrL+_wz)fc z42y)VRtYYCb~-HPn%PnXDNoG0v2`$Npz23Ej5_sdth5*N&8J zLj(8tGP`#4GZYaaPSDTdRzf{zQF+OH&EVUyC%zyOYwKO+-iNm!4K z^dRnTEv4$SS5ZN`TTg$;=d!2r@ct^}1;53uR=yYlK>3doNZAD!Pwx2?XmSaolp{!# z1!gLY_Zxwc8*4LJS)k7cEU|xc87RfdD1N?z`G`FlY9ER&Fwn2eddS(rxXLgWy5l z!*-?ujdbEZ!B9z5;X+Qm`k3#HN`}L^G=N6ulUq2QMgTQv8I}^6)ZcOP@BP!%|_oGd0>`1#6iDEoFk^8lvKC*)t(Sp5d#8TDt>_M~M zEiNOEa^%K%3k}A_+n?Bt-y%hZr2Ai=8==yK|rlN8cV*bGQ~#f&jKPdfwRJgu-62(Lo^wk*)5}ND7EP z2lY=U)Gn(q9PqMF{`;oY$(F)Xx{;F+DUI9Rw-_=sQrF)lX6cX4zUSLDl4X{m{$V!| zJ-m2dF~xM)!pK!^6S`gmCVMMK#V84;Sd<;J(oo_PQmV;mnoyPuU<&1-82(X&ZAP|q zg~fx1z66oc)iJ?SkaZz}w2%++VVl+BO;aZ$klGCi7WxRr3GV71eB{P23T`;csmJYC zp4uSAxC|o65kehQ!q2AjB_o6qybtx+87xJs>qYrCPV?o&k%9C^Bo$CPFHYo0+a0eo zYE~La3SY0d907`M{&h*7ELkL9*xqE`)e9pF0BK7#%XF_g5OB1g9@;{8p01W1t^h{c z($r7IGC(5h21p5Y0CG-XIXf8|4z<`J$529OG6VfVg(;#CV04zFg#S(9h)9z zO-Wi@Q_z0ViHQ|Q5BWIeo=8{?wv9Bj+G%B60BC*Cm~nCC5s*STc^+&y(6(ZCL}vz3 z=PoCz6lR|RdY$gkQr!{2VyJ%+>;i0qBcRjW07#M9K5KZe?v^#IRjcO7Edqo+D?mnB z6H2uyztH?5Odk_I5fKLO#To#ND$OF5*SoHrfkb1}rH0eu~D(2f%MI}dof=yus zrV&li{VLW?o)z!3zX~ z$MU9ZfkIN=bc>aw+!hO)y^$}3NYaSW4B-YqojbnRU^PkmpM2SKC)Mq4zi>GWk1qWY zV8*&#iC@mBxdHS(VFTcG7BTfNq_ZaQ{w7;Y^xqfy9l%O+KiQHCLdH5wa}a1DXrE5W zJf0@?2KY0jz=)(23Oi1MvG>;}{6gP>u0|a|1zdU|t^(9nQ~ZOK#v0De#a|jMECx-r zqUQs2V^2U+V(c3Z^|9@RxKJe&JoUn*+tt^rwKCTF;`#Fkpb9Ro7ppNA3b!g{2+jR= zGTrig2Dl}2fPho@`QhSGaFstdpjgEmh_2ACLYgaz7SJl96G_NPn-9fzVEJT{02cAU zxw`ckV~72IOwKPA-@Vsill=&>?|eZ1uWjBishr3XpDE%Sn4l`Gmo`Wo4pF zX%-WYoCmmG`TZHd>xY=uo&zFg$EK|P{K@^zs&E2Ep)E%L+wE86fJF60bnFJWJ&@!< zYb;&F5*8cQ{cMVhi2LX$2O%Ho*f>{?|kw1_M5iQ={fT@rvnu*T<{N zg6#Dn(SHb(MKx$^-taKHVL#FIO4|afgSzB?&R@0rrtL z7p{rhKlBu2hJtvZctg#{O~YS2LnQfJ`n1-7j?Gy8fw)o;DVJH;*6Adk@6ASN6oFlN z;|`hP`@y{fmxsJ|K0mX9pM@ z^u3og^u0`6b?nB94jQ-<5KF*5yE)s6iY^}ZpaQtm|MtPnQ@0+z1owtP#8Nh|u?u|Q zo2WLW>Ab1fzKI1bEIYpaht`8pi+Tsjb@$(Y+S(d(lABbyH3JNC4 zX-qwb!46-w@5=6|2b-<1;mPkU?81_?)qkr^t~I%`k-ECNKkBTf`;OphFvnBpJuh}H zHElWWn_jDa+H)2UkAW=&5}5EznU~fg_h87lLZ~3Wq``9gMxVjJzmYkhk#OOi$^+9& zYyU390HTeZMk_@QIESa9k02PFSmOxWq7fve``X7D3@%t`9WJ7vd^Z+QkkEg;PsaOI zV2r3a-#T5PYmcH|VM^fk8+VB{g`ze<3-9-Qr&FD8;5w_})=Q)%utg+F$yrdQS@wfl z_1iK9&SH(F@@{Jbo3R}4e!4UM7|<9`?F9g~A{x4%hhpwg4*7J$oM|hs21u+axGkUr zc;=L~fVl-ZUD|SkgFD5C=!QRn_grl6YrU`Svy=|PeL_y}n89>CMIJ?c;p!JTs3>N|wLvX|T=@IL zpPd5ucXZnDd#wDWXD$SL%58<Og~2XhNWv_ZcG*k)5xrDvJxvi>XEFLiGgk)96ns9TrtHcvC33Z3DrXt9Jt1iR%6lhVDs#KB zUI#lF!vyiXv_{|`u8p~|^Vn95xIM66xG;>Lx~_?=`#k_$bQ|D->^wCExim(odM3B} z5wo>%Z1LNblLy(P;bv4tv4~I|?JrtX(4K&^&vxhpa@dz=dMnhPu!7;vcL8>(sDDU6 zjm{&fa%VDmw*(rhwr=(Bbj4MPW`~zBerg3AU#lt-9Qu=QF4TJHu&RqtH6h&zO+`t3 zkJ~zYSjrUjPVen&eX|L*9PjNAtdAXgWywAr?lQXLw!K{~3J~GPIUv2RBbP&Wd;dbJ ziyABVoHrbP42QeU;vKW}8&4Hun~Au)1>Fmw9PX+_Oy^KSh9Wwh*7%2|cxl{1z1uW< zy}i1IeZnSvgv$LsT@$u@n0TpN6F&b|d!U|EeJN@j3A964-L&qxs3O7n16l@aR~Hsa zDDm7Xw8s)&B@@eHPr6a;7hoFh!ofp{>a;ZORKdu_tlN@tEipq++HH}hzgT?`v!#*U z_>7PF zap4kDxPaj&Zn9()E+n6(lBn$6e&NqgPKPMs{dt~6T?3E z1AFJjn8}r<8P~~0I(@?4ZfDWk%bMhG1;&O%MdxT$!!0%ae9vty{xpsf44`m1z~?j9 z059h1`&-QkhIWrS@A0RgP`lXz1HSF?*{iwv_Qm>Dl$Cyb-@Fg0wDs=Kua_%VcT);w z#F6SwZpeIk9|Ej^Abp^SnR^{Qoen38fE5zox8OUx?bt$UDSof(J*DS$q=*iBnA*W~ z97=Q2capUVZwQ)PnBRmtUv(m692O+E*SzS=3Dv>cN?Yn#a#VT?(^uCbSP@WYRsW9B6_A4_;1J&n zP?fEw!%3nzbM%nbbKxmifKVoHehxZNCzcit^qAx)p_2L1hcRGn0~cp5sav+I`^ z&)gA*SxV3tc2LjP`m>H-cs( z7}j~YkyxtFN#9|Ja7Gkv`^|K%=_GeiM5y+#YX|=wyIc)zA}g?U#&t~9Ew5bvEWVgu zFg&PVE^)7|+OhoJP~m^eVrtSlTn{`|88&`Wiid7hj(&J0-1bAJo9t(q<@`6zW|Q~A z&i_)PQrl*b{jMS@g`8)Y6PayK*4I1Bz3icvat$)gcZo-#i^FB(xaCTaEH?@D1Op-> z<+rvNT6R@9e(9Dt`b!IJf1d@TTIZ$|?{?wN^TL9KgaO5VZhnexeR*e{ z#`us@qqdfh$Mx|c7kVEu^^dQ?%;a-SrtHx_jK#a1%#!bgMcFGAvmT~MWX)@95&r%V zalon_sy|*j87qm(r|aM}NcT(kgb`#6gqnc)z%-3+)OoR-t~8wA&ifd|l8i z_a8V`9n+3UcFLW$vXY6#@<7?Y5o~50`1gSr9z=wz0F(B&U{Mt?+%XutuYQBBQ>PZG z`}G4=iE4SuYNSLT@NUW4eT@00>2?42bPr+=s?9d1F-e)?6{21m+}>wN7f0uX@_NO& zNQa}%%fj%X;|#2|=x;EV1s(bJC8i!$wTS6dUMph0tmu}|jh$@k*PpiQ5a}dj#*Xq}nTRpGU5JNpCJ6T=6?W3#PWo1J7AM#z7*#}|Sy7gO zr71Xc8L!K)-$uTMd|4%1?39*&QZ?=LDSlq)r^NX(x3GUS6t(*cANAcslSixXom$_m z)T7v+to`@*W}6`I%_d?cqrg^8YtIXT&e+v^>_yN@9+NJ|bz;KLuPm9sUD%i47wgJj zuG8naxpipkhOqK{<#GQM+`Ypbh&L#dsLVOGr=&P)dh24|Qfq2i=J>|-3+ekm({veq zrP@akZ_*DNTz^^ey9*2(r~}JyclEj4`Sq{`qP|l8uCT0XOP`I1xXq$n(g)Pd1;ffeOqCs~G`+ciLO%C<+9>ru8s7Hm3h(x;YS3 zNH}!@QDxC3(<9WVf^Fx>vnsy%%%QTNH5Uf0r4JAb3`1Xs4rKAs#jZ>H{8i<*wDQ>( znc}?1{|eVTC0P=Frzv%qiEVaG*NJYWo((9lwGsJ@DgG?4KOK)uw3W(!Y9Lr_e6L|S zvbCP(wv@%fFB~VYHQcB_;TqKD>NP6FSN=7|Sev+i)bMFWtKdE&^7Cf%V3_#$9x6hBuCoHEEPcnU7QZEq3;@3EiA+spsb4L}JP+2HO_F>%6S(i6fZx zVVmP(u7+MzJ7PcE^~j{E7FnwW3&`7%FBRo|T~C`8TSfONQt~S=@+Qjo5G9zYnKdR> z5F47P_k!SG8cUAJ@gaq4jv3teH?QH5c|4wBzCmtZ&jPQ`?aGd#=lyG)7#V`k_^(y492@k0ye3kX3UAF2u1YJML!yn1e$P&EaB?TRzEd|Ws5&Td>yb|I zrHQA^_=rrM&rMz*?8dt#VWphb3Jx*|CS5uFT$?rQ)kzkNj;n{`0|Evv9nv=mUXIeNFeX?do8zC21ZDNGMU8#_J%;cl`L%|V&`*q zPAr9j#6X~ESc~~!7cI>R-mRtH9bv7xpg_t;H;#;`-|0kA(+_bj&H^x zIAn}wLe~ttTC}5${wp&ZAForZyzl3tT~`VCl`B+02UOI~Ueke%6~8g2D-j-q7S_#I@&~zKEv5`(Rx_cD^|m^Q(QB3AnF`PN6(Zpz8XeTJ=zbtl=E zJ~rvi@~kQM5zX*RVdRj6$9N|jsdFv;)_&o1BvP#{FMXegW&hKxngb9oRFQ{5)NSVW zeMx!B;^^YyfrS&4?R&LZj3`7GFm=kI?~ zulyRXWfzEBd?t^`k?z-8TE%UCM@RW)B!;U!C`poE^7%o+xbWRO&?~z616*pj`cMt2 z)6|>VfzxRrYMrSn|JPt5o<;c|6b-_7$8X~T$HL6kL>ldyWA0I6IHM}IY(v}cKjbg% zmr$&#`NRpf8OA9*`^4{9Y0iByb9bS^0Pl>o=r6EU!C(l1o1$al{y)%u|09B}8z#oZtIX<3oqu_?=Es_uZ*z^}S%@rOj zYWF7c5~j4K{X%cQF146eeCiL}I9y8#55NIsQb%9J?l4mCcP)+%5Lz0)f)GM5Lya6g zz352qiB8biJhh9-Ue8zfIXi!!V5 zg56o1Tr<$9z)uIzHZ(uTj0m0r2MA1ionfU$pBD=isn&DBrbjtPgfX={{+3|hYgcJw zn%QQ@_vHR-D&ws{_*Wwi=mv7q0@4<_Y>g;XGXhYi68(H=Itk>0IfAQCayqA(RwXk6 zHM1jB;SQx|+m+TD$}<)Ihj-NWP1P3Z9^an!PLLNnkWq@fWQ^l=pA}_k@KEz}@yZp< z$ZaiMMd&FbLSMTTub~W@Ce(PhV7y6JN#wPpDsie#cvhe3+GPg+T7qT?p5Cz&3?|Ob?e+@yUM}U}E{oxB+ zT`J3Vq>H1b1^jVaL4)|AaLM+6v0Kq6CvoOU%y$q=-%ufa?{za&ASe{cwo*fk&wz74 z1{z-wCcFV9aYN#_GR<0JRsTL*(rV~WY|}4|C-2wb{ref|)g^OpEAtqv0YQ}s1A+h# zRBQq?Rs!j8aZtE^C%sL4>2!oiFah)VCZ%Zo($xCjg91plE+v5~^#2~Hsla*1#=Q;t z??=GzUXFAHU*!M$8{mT!0}w1`W1Gqu`CC;v+w&d%6DH zPixl_?O&D8!9qX&Au^|b9%!8Nbl&}N=Q4=~oQVfx?51zSRRn-er%3>rw3}X`*IeU! z-V(2`c*6<84Q1=~{R~e}4OnEn-EPOCBqt2t^!?YquUvJp?k>|w`&GB|A7cu(bIC+^ zSo*Tp)5%9KYK%t>s#pPOmaRfXG@sj?v-e?MF==$jd4X{h7vJ8F=9PbYS*_?jPDs4Q z%i_le11n75sPfffj8>)E(Qj8yY4{c0fzL~F0 z_)RCh$T^K6KX+yozC(Ndv6Ub$Jc^XMGu|InU|mgk695=>X;RSxu;hywz&RcIg}C~? zmxrZK59?lV58WP~i$*lxKLyFJ?9NuoSsyIc<;<3APeoIRT9_8F*Fmx#O+uq7crhix z(XKXMrqQ>R>tstBmRJl&s(~`#PPDJ19DF(I=+!tCMqFuz~%Rgr|wS{aeeyUnKQ34-TV*qKGe7Q)!keJZW z6{X1GVf#(`->x4BRD(HEDQXkFNbLMxtLN;+<{$v5G zHu>p4jQpS4&&Q*-fAVGLp1X_Zvc0TVO7)hO%9<%HkB3~88;*D9>(dNx*OJPXjXx;~ z&z{rajILKnY&+YUURL)W+6ly0#AMn%-&#Z)K$`ZnNA#)|E#B!`%5vaY(f6&EEmTfB zNhFo8TJXCH3>OQ(YSkkdE=H8Us1sgV8j2D=+<#W8(MWUC$-c`NxnmzjM1Q4F$9KuR z7?jLkH7b0yXjc*CN`)+f!3N18x>dXII_|P$sz_wf8F0=@;hC;dC4#!znbR?{chzOW z1Tr2dymUFXJqqUQfz4p^WvjiJS7046MjT31kk{Jew8a0Et+wtBt8^tj4G~_9z!E-; zx8N;BqD_UOYQ9yQNe!q0AGujNcmEb-S+#zo#859UiDeYa+<|6o-S5heRSu7fj{cZq zJ`jile~B#EtwG5oJ35Z{ze})o18fz>ZX1VS2U0TSRL?h-;|6}?EYeb~XlZbiH}La( z4P3qGOQ%l$WEI{5Zo9IVc2Dbir6&`E$A>Q!T&nigS`HSl5!%X2Ae2Vs( z>58zqV3c%hB3K1ugBip-MC2m=@&8-PtoWCp+&mpv;-0L2d*xG(SHi#hv*#LTmxi(S z7~XmKU8j=yeRZgDcI@^=XJhv8#U3hf_;^0G`+Cl1GyebAfnAALFD}=|=j8XyeB0Cg zrCcXq=ekdS_bywsdX~3PYx2Yu_H(L=v*!7$>1_y-5U(_a>0y;VQDHm}XTEY&A`l4V~XYw)_=O!Cq) zw<{jryxPO}Fma8$$KypW4@RiG;$J4W)Y9|8RsHnH%$hRLh|Su&o22~Ab5Mn+|&M|#czw^jN@NBJWV;V#vJHkPe+HlH$4kyUO6$xZNdNc z$+^exF`NC`n0IY;v~qIDroyvE*A~~x{(g4S`)#AQw6({j&reHt?TbCV(d4*JrkiT) zvB)jCufvP(EnD{Fpu|a~V>36K)E{S>`dUQb!T$fgu{m>mK3;g5ZXLMp*Hy2ZugkX` zJrjRU%Kl|bg|?c|*J|zX-+zQ(nCEQzYg6C5x3b*&@tLYyYk2oHe*Fy(+gGSZvqJQ@Oz zb^of*+hb`w>-)P}*$ph->5qeK-&E~+|Mnx_RKXkfHWknJtzEhM@1LUb*Ik~b%^MAN zo?e-<$gacgT;*rB*``0N&TR7Cc)Wf7oCA+uM;JV1+odxn5g=>z&z~tqsmuG1j`Nq_wmN%{bbQz7S^bymw43d}YV*VH$|c{QA8ygLn^WS1K_qxCH z)*qbmc&44$?apsEJC0he<`e8-VeIYfVQ~d6z6w-3@G<3I&g*r5|MAV<|L0#HtJd}W zrE=_t3l(PC27RqIah;tg^daWKy>P)f3*}EVX{rZETIm66LJ5{DbLUE2cU+5TJv1%> zwQE=EO_$P!H&mE99f7UGq!{+Cz{ORt*2ttTU?y6!D<*3VqT!>$28!Q8*(+R#_K}Ac zuzfQ5Rnf|gz;p~nK#PG@@UhYWZMbnR3M+wgH9o6OFFgiqYe1Vo4oRS~c2oxbP0l+XkKpBSAK literal 0 HcmV?d00001 diff --git a/site/img/shared.png b/site/img/shared.png new file mode 100644 index 0000000000000000000000000000000000000000..7869daddefe9f5476f678395e3aae91442ffddb6 GIT binary patch literal 37973 zcmeFZWmuG7*Ec-DfRfT364Kp6D-0nG5>kV7*HBV3AfPamG)SW$BGLvmh;)Z^41$C- zNJ%~8|GMt`x$gHk-f!=R_ruMXnRCyMbFIDCZ>_z87y7#DB!mov004kQQ$y7d005-` z06_FDd|XRp{5d7=1L$X{ejiXZ#I%Wfxb3ZB?gs!6;eH^Z1Ojqi(cwCLcQH0YnCWOk z?YumMZ0)_CI0yxKc;k8l0P;al+*c0=ge_Z;hr6dAG)RH{pB_-$_rJ};>}>yZLAWWf zo9XDYJ@E2%V3QP*5E5ZmBxGY_llQfEgc_>C{yiM`q`>ZsKzKuig#!Zvg#yKeynLO6 zMP+4Wg+;`K#l!@0Jp}!NJrTA+f}VaH|C;2#=23OKf{cYk@|zl8q>BmTwZe_C;DRwR@c{_ibQB&=Q$TL1u*0h+2R#zDaCJc1yqM~u;4 z6ge8|PrV4s;b7G}Dk>1_5L4Bf_!jsSs~im#QGh{pNn{*}ni@m}s6t+BnEjjrMq|tRqb3==xx^jX&DdX2y1oOJXz3Sb(o;wy{R@euaBK(i+YJp51S z{r~b+m#KS(#NLFY3K)HSawair+& z%|2yO1p-rl+Qqt$zw4~Zx}Wc`9Li7!&vSs?*k*6?+sS=gZ|s#?&G_IusMjB`wtQ;W zvuas~u-iDs1UQ@HRv$3LZH{A}aDxfqvq#typU;jQOg%B_b=ntC&DZ`U&b;xV?w6P2 z5_V_VNd_f?<^s&tCVk}E_|ZIKUqaz|0!JE2?sF-jzsfe2N|sD59Df})gJMT{Eede^ zN~QOAkA{ybO88KOmo<}OTa64y7YzNOQcGf|%=sR=_tOr)Z+5xQ7=R$7ce`H$I3`i# z-7frSdhsc-*ok2=aMRD}l(kbQ^+XM^KO3rSD;1V6UL&b`G;s4euDTeNyjDg)!Yy61 zNsnuFju+78aaFxxu5^CZbME8Kk24lJuf7*#1c$rk&!7NC3aWV-ZV_}1oZmfn!-V?>MPd+;#VibQYF_t4Q+B{ z%qbpWE_=r1^zfEk?g&>A+ zl+{&}rEX$tJaQ9wV{d+6@A!Ey4~y$*_lduEbEHp6sc{b;OS zKFuRL)xsgyLe44ovzy0X5=%Tn{54lkPJ!T{%ro?Jwqohbn%A1RQjK zj#(1S#$gMW5e_G~j+(d1P>rg5v95{qle8^3BoCHie>w^`^%Hb?U)( zeXDGQbiav^pQ#~TyW5gCUwPH=UQSlJ(m00gy=rLkdwA9+X%Jq*C%0uQOFg=vc@Y|( zzd0(=jg}GZM%%b&3)8MTY!Gi+BLVLbDh}*o` z>>K5;m3{Q#n9R>7-*WW3K@p>ti$ZfuXrWyj@<*#H7C&};1QI-*py?^-p)Q>~U+w5I zIc}HAh1&j2c&idQNG(0dLRJip*+)tP4Na-R<&u3)x%Zw6b-9L~*njNm8oBH6E!E_nv;4~Y zpnF)G=O$M1o`~+D`EK*s^Ygs`JH?JtV&4NGeoC)^OiJ{)vtSsN48sG4+3$l=QTqb= ze|c@lrjKJb$`yw}R7Q}12lmM^P8xGv=JjZ;s&CtyW;;X(g@TESw9jW4_oJZ!dF?`x zH^PpO*AHhz&Rco5e4Zq|7BxQn-9i+Y%}S`ALAvpbD`WW@KPa_VUVh|p##6|dyM`{J zSa$e@dy;5ftVEAkiQ`K9agcl3KeXN{$Eg%$$$9}u+*=UqV9lKu5^Fm~zQi0TBhh>h znJ+0J64BhWC4(1pS?x+|Iq{^JBW{Z|%(?iQPo>S0kEzU|gLWzX2=82TplhOn-KmF6 zF6FWa{w!Bd#KW?@#6E5LS5HTtSsCd3bRWEx_yaxSRJsbSy1SZ*LZ7@oD{sisw@MD^5SPw`6dT{a?2nY%xylTTvOQ zY5rfeTUd%ad9|glT$s!LyB{0G(pXuyH2z5i&wps`G%6hMXQYY0Ur7VG6k<+v+y-OL zQ^yOrI!kT`&wb2JPhE;Xhf4{a_3ktE{9_n+@+6yl0Q#*U4Gm(fZ88YIKbU$ul^H$B z)%$)crR_FmuC`CghD=9>&o+`8tMakzA6`g z9nAvqcWzp%enlGg_F?s=glP!K#_x238kyvYUxLcr2Oj%EXP^l{;%M*;D+00tWF|1* zCY*qc5wsV;Sx3OCer${!Bn`WSC*XEc`!M$qf>+=t{%k1|68JrrAg)Dy=|096!efa5+9){;8JxNtz+@2vfhwUxb?}{l=Jt@s zaA}UPI^|$*fDG*=K0AHG6}`TjtL_f?J(CFxrQSY`VHOpThUG*9H^aMqw~)s{G-Yg} z9JzlluVNaw4W>ubP8jnhgCKNjmMl&!!|Yelg;bv?yr{(Bx^10Y4P{n8o_w_53&iAKi4$qD)CjjXX3p`1bL#9NgdBEH$$oQLXZT-;*# zqv@)IrKz|>Grs9wE@;ShMO9d`Z~AKRiK;8hMqWWu;_Cf<-dhOH8rN_LLF8V$u{SgJ ztmUP9(=vARnI3^K@&LKM>{(5tD(0$A@|H{Tb>oCg+CpRkn(5vT|ZwIpEE_pTSKK4e2IQk6V;y}W<;=`BM%76Hk*SQ`~@J|?woZAMzv!jLE3ek}S z1WK(%cy_OY3z;_{(^rovy1OO2e3XbYKv1kS(gf>*u?3Bya}4BMH3Z-CHjp@Q!&j)W zGS!mCk#vna(GnW8QN8d-bVCD`G2vhJ#x@Uc7Kakf>NE7K(gcesyPrvkG?=GpwXU0}--En-grq%MLk?-NpNNY2<9&Z%@Az@c3L7$V_glEr z_!EKOK!x@f$do1`74u|Xn0n|PGs2B+S%Yq(>VcJ{??D6d0j|n|<+mG(V54YC&qZKr z`-4%6?&^Y+p3S>sH6HSds_wVNJOCG*FU^B-XtZyGlb36^Du&m}Z+J%TkXV7Qb*kG3 z)YT7HMb2YwvSB@CIfLa4O5G=?UM2RTc0HFL!G9i&1CQq1n3|^0NT~pSF_^10E0Iu zh*AKe9u=aZfDyV3YFHILm<~eAo3Mx{T`}HCvP$}fzBO*JM6TB*;_}c=jw5CK{`PQ( z6s&4yQFZ&F{jUrS*iYPL4hKf7s((Iwc1;6i7X!LmqI2ZQ^)9;EeQ1>l_+Rg9}^&qUUt%3*jpxuv=VYacWfK% zE|-Mrm%qotm?TPWmw9T3Ea<1;Wz%MXUdk_epAZD-$E&(CiygdSpFQPDsZHk|q~2KM z?4S6Mww7RgM97QNiOEdaAAOxW1jK)QV>-Hac|D-+w3=1a;;WsTbIqh)AT_i0spU8I zq54kR#E%%t-k1pd=?2C@x)~^>861v(fN=A5Jen({Ed(Di1~Cl~JOFl6nBc!)@EZLB zf<8wC?cpN=PVi$R%K#iHIDUSWQUI@^?jk)L=YomlX;_Sk1MY~mk|q+MiNClAgcv>p zrTVa&ZdbKta&AHLx=?+AaG9N{k^! zYP9#Y_zX?;pnFJdSJm$%30qr(yq=%5w? zx)gcVZ*yvBfs>Guh)!vijN1+`N zG@Hr|_-I3@M28qhq+L5RHl&*5lZU?nmKH0etUl)PE(ydRX--bKb{aE#&-JoZ#BdJp zGTZrXA2-g-I>pKFZ)>|@*XdgD@EB$|IFFD{qv5l&52MfgGwXVN66Tk1_u$@+wGY?z14l+j1Egnf1Xiv){016qR&&BHoN{$ZY6r;T z1&{VsyL}cO3D5bcXEGiAjFdmR-nRO=r&aWt*bq6CVPPN;PTz$r!4R>TSj0VZ?`}a_ zb+3b<{x4=DdU`IOt8^&3pOAvi5$KM-z~riAjc1{y%Ca@oQuji{cm)66A%^TK%EMn9rWX ze{9#i!o@9V?+G>xl$;GJh1d|=&WiV^eSJShwgC+B_06BjXgY$XY&YDSxC{Ft6|MEb zgy<_%IG=n?-A z6>wZT@RPtHSJL8e!0AZ9h;#{wyLJzh9UTr(rXFf<<(RL1iK-i{h^QUtG0(7qwP=|) z`h9XQafo66^_l%y$Kw+DKYyOCW@$#4`?94f*0~rcbJzb6r?vRHtCvBaMA`f6V-yfM z+OUk3^k(br!8E1 z>Z}gJRz3bc|D9TdG`2M9VWc#&VZHxN0c+ownTtR%ug(%2oS z1s4v#K^>d59D#2d)9>}6*a6RlVhhF~^b2;GZpDrG^huBp zASgF#Vc4u8Vd?U!y#X#^Mr2KZjyv!Z#vH_DMg^)=gJ0;-JIFck#YCw+!<)MER1zieEaVj|}nN87R(4yk*Gc|4e@LQ0cNm>`}k@T88e-TA!ZZ z==YjBrmI0S2&qc{=Kg07N&~q1zHKDLuadn4O@In0ybDKc2k(mq5mGsok;j_C(VxaN zg-CSq2O93pxEbYW8E)rLk8GN6 zC%!jhz0*`su$^s(&zf#`hB=^l-tpTt-(k4$nu^l#dZu-2UvEjSKfk?2PDt8mFV~@B zC&OVe|MU2Z`Gib1x=Y=;vqDoC*GhwF*?P81Q1hLpG4mvW5^8w;J^@)GGLe^g!mN^* zV$`6SxEaq8+15^3NPWieiZp;ApL>UhR%m2eO<{F|$%uMrr6NMWZ!hpOWH|NGIWby;6INaD((bzR^>eEq9is6~ z>f>j7!Yr3}hlv#_lBl2(j7ije{8?Nq(=>u*Eg(eS<49hX`ZQh6Jo&^GKiIEB>E}cE zTZqEcX?$->W)8Fx$_c>St4}&E=I$i1%Y$)V)%GSayA@Z`5) zV`iu(cac?jOH&wCP~MV&rj5*`6jVOHG{OwJCFD-syP@IJ#jpk znc)K38HO-$NxZ*G7>0*_Il3T`y?JZ1XU;Kud%j*1z0O|pxFdeftkLztULw_wLk~C%GRNn^t~I=&v&t32 z4s-aGepa;L)ulPT$8 zg~nb&MN>J({rNPEAJtlMm4(-1rwq8PeZ2oE(r?*wDW+>0U>2+{wxtsBpfVl;8m0mqmr44mN z>Ha81dZ4e@BAuluZx1DA^mJaO8km@l5Xs-dM{WV;@_A2~82WPAp~+_-P8R5*nevq1 z6Qz`efX*Z9LSslelsPcDO6EDd`u9Ibx&+V|jx))oydAY_`11inYTR2B$eAlC5SH|{aMpY?migH@#r(tCHu8U7F1* zVf~7Jgpi5Ea9O$dvERLo;b<8T;FxrRxL*0{T+IQAHwzsdAFA>14f5ai!S|d+!jWaW ziG*)%Gd=pZWfB>g4WZuc1*o|`xKdLPGf|>_%;R+_Ys*da*liNdc_|iC&JZ%W0uZaZ zR@!Ro=bm9_kz?E3D+%agM|&=oXo!@t$k4e}A$)*%dj7sq)v;Q?!G=d&kxKkIFd?ny;n(aslRL>1T5+O6Yg!YsG z-7`7|wZ*p_#4Di0&G@=Nvd9Y{bD9BiT_nAWkr1}X7nzTghP@nTcug4x29r9sgIT=J*gBiI~lti{}RqE4{M)OpekP!rU`Y6aS zn;sb9Tr3B2Wn5R!Nl^f1K=+q};ttWEK;XyXyF3-s=JZVD9G}q0bK8donjA`eT7`)& zJ${PLgpS}LEyrgWsSleG*3hjpvjzk?m@X7iRZBell2Qlatg z-_QspI%96RRr*x>UJg9SLx!NaE1lE*?xB<}dxWYoPEL->5{FP6APQToz33AD8rI1%TD5j-fuW2=(zUiEy&Nod%=c%dnIz z+COE=gLGnt#XWMDr-1IO7NE@5cO54{r$cE=w293xyMD%!DmBH)b(2SklPYi-|wxn919lA^)=%KX@vnt^oqk__RBJ-{r1@mPxmslZVW$vX$ z>gMJ)5)g(E@5{zC7j`q)+S?xziqKy|b_oWRHS6PKHWCUek_c=F{jUgoLo3I3DuUX> z@I;`!x7gcOYUYjanxh)pDcT}3soXUSAW#LxU{KnTO#gS}VCXQh>yxy<7kjuZ+ z$R7<*e0-xqm8Qe~+%zZ$QxuFa{WtPJyY7Pnv zlq#KBcW@?S=8#6Wk<2NZOA`|D*l}hElbD_Qrbwp~L>>V+3NHsQXthC?c`}R=U{884 zEf_dlodxe0H5VX4XbJqnsW0M{_QPMEvK(fPdrzL*g!{2+O2st{!hBQUYvp_zw>L85 z4eEbBvJj{ZK%h^+onEM`$T?;zC6aO7L*JNGyWne7#h zYVu2vA_Q$FDppV>MMB~?-u`-gbpB_Q!$o=7+$0|41pgx#I#$AZr$RZ1>ThBO%+4cU2<6dkV>kHU7zPrLVne_e+h?MmwlvuABZmfzoO<~6) z!XO8<*kJ@p+OK)0&~N+osHQbOMGJ~GvAJ3F-k7}_SPs;=aJOkz=|fS(Rte! z!@8r2wj(!w3EEN)Wjk&q)Laa66u2$*Y8@yGz> z3N(9fjdEZuFwR_FX0WJO%sFx>)^v!=t9oN%Kc-iwi7a|y*AaE@klxKb*S{y`Z)bnO zrM$C|>;;+Fvr|@27qp{UuR23xQ~d|)S~TJC>s=#57@ow zv1(hqA{3VdV+%pH4i>SpB(Dj63T%xlL2BW)uKisnEz0~OVyr@`I`9BRghmD)hVhI1 zoex$C#M>;29E)g!v7?(P>6}cMCheyX>B@>)Gl;3B5r=%e3Zky-m{>k5 z1PD_UpzgulB99L)g6tHx!x{47vZJ*%`F_9%{h1bjuo)4Br1f?Z_)3Kl!#e8(6n?Vl zOfp(-qim(`U4^qr){ND>**PT36$7A&TpVuAxBcLpDT$oS{n5x#=-0gn!N<+8YWjU?;Zep;XpbBv)t^Dkm&#}SIFIqO2%x3n<*EMSDct+z z&qkH->y07&=8*O1A)IcgmDMVRxJ#xdUu`3hIg2sP59 zj%c!o5a>M?q;Hg7|PG1m#Z|GgIfa4XYSo;`PP4;%CE ziVT?db>UEHlj;DfLigP_BX8Ja8m2CK2m0a$*u9}W2u5(WRD<<5oJZ(J$aSMfUD}8( ziQgd0Fx<$YfQsz^<4_72V$O@E4!bYprE?^rIh;5vu26;-cQb7eV!0Cmpany15y)mS z=+lpp4OlG<(Ho0N2&9FC?&q@4HpI|DzA^B@BVBI8wKh=EQ1tNJ;}XRH<~p}&{*pTL zp?6;mc0VG?mD}En^Mw-ZM}*vh>#0J`Y$Tp2mM`D8dt8KmjTu~0rOkQG{#OYk5^l#w zJ>eAY?u>f(tkxPt1kBSCPM}2{VAZQX_EUdX2@XW1BU31$nc+D13lXNhi0^nQ%rL>2 zZQ}EO@LXN_9%<5+<@3YbATxp{H>V)8aQG-9<^mW{bOV&3$?147*7>yX?q!0lbR%Qe z9t=uM?Y&}V9P>A}T!=AsUYTL3ujB5!B=UYm-ZbAV7Vy2s!j^d=J}l{j zPZ%Rr;DTqjm~Nk0r06!1G1M_2d}|C48T3AX16XX3Py;O8gji=uPDSCoDUic8_Nj5K zkpM?5E7BaUXwXQ?q)i*D3og`M)-{@j`uv0o}0Xbi4{@;O<`A zO(RRlw&**rrPfJmQ%l>HtCKksYisLLfvW9X5!*=rb_lImruEae$K9+Esset$Cf-iw z1unYe`A?cpdVE}P)X9|+eDYK&z;mf}VYbFj{aeMne6LSNr!9K~7&hH3l?d70^WnN& zx?~>Nd^lM?n2ywYeg?6;H64{))hNf%rLjFIB)$JM?etZ@Lxl{El2T=3$h1{|Q=ssB zK0;RD8t6l}*5UnrjYS!K+8~Dn{e3tl{=%(L(7U5m?e4lNAJ40@qj5bY^9Hx6MqSyT zR9mkzt!u4A57gfItq)CJoi4Eo#q-b63O=^~{@!Bh&tbpt`xFlK#|aFQgj>lZ3=%yR zW;MnB{emCOyqXVUjkB@(SuM5Aq3N)R4>3mmuDYQ<~aUkpY3`ZVi(kY4Zd51z( z2XDT3^q!xI{G~jRLy}(cV2C__8aUThlfQTo*&RG3bWAVXOx6-uf;!m%r!>(JE0Pv%+xx(*G#=9 zo4Y+6aeZDk@|4d+SLy0E!k?YK#fJ0*7;dIyd{|3HeITbu_Dy^u`ey1fVvU47L|W!N z!A?!QQ-%qpdU>OZg}dGzeyR^eHTSXqV|qSDh3JEtYA!_*xtIvAK`jbZ)e3#<(ZAF% zo}9u7O4xN!z}qXrA-v{|PP=BAd|JT=JltEj15?juTn<${n}uzMJ#LhCrP(S0l9PBf$&fl z)`M@$xRUq7S`3H$b({G=w}*F-OLr+&Qhcx!5I zF63m}C2YT)U~(~J&x08;qC%6@WlS=FeN9zpCcYzbzb4StrZ`DK>ZFGz0h+bJZ&p6)FuY z-8An2@I8dgeg?H=2wFVeYrUvy*_pD5Xkg}sS`pC-)oZ2mIfPy8d^}MMH~C<_@c5%= z-lIB49k;$T-jkCgrR! zB}-=uEaBYOs-2(xnHK0mQWp7X923%820Zt{Wbz{c#=!yzycwOF5&H4dPghJ-B&t z_f}wNUiBUchKDgk_%g1mn9p>_JFO8U;ny%>6&)7YNg4RwRg;kF5!>Q(*eC@qIs2I@ zoFpI&GIXop+p)q@(Kjh%;j$ymdUu?rA&BP5LI5TI99_2;F!P$Ua^PVHgn2Ekm8NNVE#{t$Gd0koB3Q zbZVf?*y%ery#0B2^|1~U!QA^#Mgl)|%)~JxcIgsA$$tYgfx_YL>c#UfkPC_gU})fS zi_kWrT}CseTG91kPNqQ+8oE<=)_zIw_W;1@syme<$4XwfA$+Z=w`LyQMflor+PqK{7j1O$T?Bw z$E`a5tx3@OcrOOAQ6&|)6c(7$D5B%dxT!(mNham#GVU_EVZ7{lPNcn0h$_vD$!r670}g81W&{QdShzeeQ2GH1iR?u%g=;Y z4&!jiX?TQ1$~{ZL;){8mBgCqasrz|(zCv&G!pzUU^g5gbJJR-7`6SA%+IcyLy6fSz z(zfT5kOvsn#JttBKHRQPThIzyyGwe^PR#lZE4=^y&Xl0EUO=XdW^SZRkCn*4AY`fH z(K~_0kj3U;nr?q~sF_IP%N_XB0Gm2FM)J-V|S6IvR3Ov0o zerDQi`{eMOR8~*PviJ_Yl06Y?s#ChT^b46e?>NW|Wuj;r?J(Eo?ru?S&i%vZt$LE!7 zqU`FYuZpn#6U-c)QQh=XhKN z$TB0@v?!#cwUx+HUUuexjhvHeg?vgU?j!Y)?eVsUe)lw8Dy72xjcxUe>&B|3tclLxXZ=e#A$a~5XK^6cydpMO0|wuqZ-e< z%je!>D~UhCBmHGvWa z4@VCcmte~6c-$frnDjbDQ%zA~g_yXSf{e*n6K_zn#WYy$2tRUNDP0PXu9T(>?z=l2 z3bD-RD}Iym=s~?E!3;;w}Hsbu8z{uXukc7DV0`Fg{91a>ZL*IBP>1Ekv7mj z?UWMxERzYFVHF>#s8?NW-ShEC<+<-xs3}{!AB|h1twvB_ilcHMY?z341~NpEa=9y$976MG7z8PpC)*d~eVYXv&LE6fe1R1f1*I z4g9d47d`2kIPk5hM$IB(0}By)@~!0&9bN04QSjl#Kcd_!S%4|3bB#rVf$knmmlzHL(DaEGOob0{^1M_ z33{Y(&T7(6U;!YdiVe1FkSgwlE5gzVnOgEjTPJd?{3HbP(e2!$`STTR+CkHk?|(D8 zsGdIpT5C*#ugx5D4!+g_LIF8nMuK2B_$c$j@W`Jvb4jDjA0y6!4c6TsL*c@J#C61+ zOmA_)T=ZI|DM^96-Z;kkd`@eudsAlmEpn*_A<`IS&@GR-xhX0X)QDz|(V~s?eOoTC zmG_|f=0S4Y(IoY;F5quat5o?hKZdn}WBg0>GNUO?(}rcuMFJ^Xckn$ors1A)T)I9{ zMdEMtc72k#jePIS>j))y;JFoM@cF2BS|E-09H}}g0u`%5|B+}IdAPwO2x5xLa(x$# zzn>!SK0J(N)yGAOV8y_XufLG193Ms4vXnE`FMN2EU#5Opp?&;MkuSv~r0%%*3rF45 z&}JE5UmPZp!==r|CJC}X0x}riwV1P7=XrSAh@oi2_NFpR zO51zR@DbS`cVdm{OPnVPJm3=txCHh>p8*f?5Z;CenU!3FfPnX9Q9eBb4Ap^zj7Q|Z z0*-D(qhpIJaCpo!M7W(4Wk>7oQ+?%qXDM>CkYA%~NX?OGoAJTjmXsI+qPC`0XpwF} zs_#6_u$`C`O?C*Uk+Sk)hM8j%30N)C+KFeR#7hrcaJHYhM{e zFH2U*KYO^Ku^(Qqs<{2(8FUS#ILnjXSDKI|R zI9&Em^-J;z;rs4d4^knhK7CaDMC4&3x-G%RlkA10HXF3MB@;7-B68dXF$b zCFC?xRY)IQRTJ_Sc&nbMS`6$RsrSW4+(Mvfa&Ge11O;F1;A7y-Pze(HhV2q$IP|57Z|#~NRc7VF1B56+cWf(fdxSMREqI*(yN zvA>DDjqI-~6fWmAH0{TZbB-z#+M_|IMX$l)ZyI(b(hQ!Dw=?Zi z%h0KcLtmk1uZSsjU}%$A(~G%6tPgUb&@!NshGqi}tK&$4-wt9p6Sd`h9D>W5Odxvw zxFbLLEKov)hLqNYDPl6q5Ai??>K;mE^ug znQe4n{e06lO-cO|G4D_P9=eU65_zz(_CyBPFW)K$^vb(4`{(drG&w@vXj)u)!i4x% zD@WF_5BDKKlz6{t7$?=E2ryGvU(D|taJHjxNjrjRlgf`qlmm_6doYtdtwG>N;!HwU zgL7yqQ3cVZ8-q+gTbk4KF_NA|V0uEK^i4JN%>VLsifI91D~%x78?7%j?@nlxyML*T z+-Ziypp@+->_lZ`z3pNIPgO&LUl9R05)&mcpzg41!dY2H@t@+7J~H2bwV#&o_St_7 zcu3p7HsA!ixXk)Q=~eeZ4PQ4hRTK4z8oQ8!E4wfR!d}b3k`to_=ycgXlkdldaclrK zB=Bd$&6eH!cHvL4iNufBhE+49%0Gi*t5W`XoYkA-Ys_%J;-#wK}x}tQDc3kJv z%Ol5*P^2YH*~2toint2%YLL1BCevW(?WJ zsWrMsp1yIbIO#6K)#iMCD%tVtmNz^6bSe*fPlVF@A_`6)MC&a!6oRDGYmCya&2e@& zQg+HpnTkVp`?3~}4~Uj@cD>Qbb^PlO!+ExxawJ@y{B>^DErnkMKMj3n`avp5@lUz= z7sOlH&C*o&ua7=?Ey1i_SSjppIUb{_G4y^23+Y;DhHFmTc7chfs#4erA4`1RTCwUGy8k~n8UTkv5o z_h1y+VK`UAWNX-ku@oG_x2`j^g-o5d|9iE>Wf?dU?k|s~j#cWsbtdih4+G#T33|`& z?8Rc1aCJ;YfHt7@_SnJ(=Qz-slx}&G^#07}=}#2iq4wwcx5s=e%6Wy_ZmydiHM*O- zjg@Lf2izPDD@9MXnAx`6T%8tJ{T|2`e;-XmOY`=gu<$Sxiv(8!!T40OnYp=c9y7JC zN2Jk0>UB1vb}IS^K~7;N@w? z#7%;%S7pNA@oSiFMaKvNg(LZpD^QN(^}{o3Fp-Lk6J^s__kAm_4^6w!t>5E%WDWg` zF!lElH`gY)QXT>bKMPY!pXg(puYUI8>G3azmcu@N6Wn%I;*&5;Uu^vT#}^+j+-j!3 zHu2U-Q1uScWL|U>nd0;x+?Dv%_v8Yl>$8$aMfV~9hpx8_ild9Vg<)`afdSa1k#U-P{8y>n}%y{~LbyswYv#O@DGkLu>X%pFv;+b7qV4dNe!(M4CSRuJ$QwU6u@mR(B+9n(NH5w2vUk}j5ELdmy-C27-1HhQ=QT?+0Tt)lS zS4!6zRZTay`q?@gt%Er|2UaE)kIqmBxb_M~_|T}}S{R@8Dm9W+CZ^qYphwm7wvtMI zg!6@g*A2O#)nI{#|3!zxn5WY1FY)32S;fEdEM?YME}#`SOVL)LglNaHJ9wWj_Eh>yG~5 zt?7f3n)+TXT53iTvmT0kbYJt_)d5h;4HJmnTFc^pP%r?h=&t)-XS7moVU$t(-Q>_< zy~^}|^29Lc3#mscKU}El{t5)d<#F@LQriAX)kmRnsBIE((`A253MqFKUTxgXw8;@( z`C`#p_jR_K80bH&IuEnX3GSwwR_RNjcy9KD*CR9nkc{?LUsMgB-ICdEx3*G=&+Yk@ zbH4awn*`g@u6r^bSnN}FnC#cAE6+lSID#CkHkNG8xHiRdbZMrJ@1I4{Tbv%~C^{<9#GYv^-;;|7TwGUu(gjK#5yW{*UTIIEj zAWpj6xT~8nW%0)uHWXE9mx0$5ae>0Y_N$l_ir%Dn8W$@r!|$k;{u3srg63za#m zR7vh<2*H?33pJEd%?9^QHC58xLIDkGdUl(48 z0L)CB-}G8+zMcQ?J&9xd;Q0HwQ8d0o`z<0*00OiC9(@9^LRU$1Wdx;YG!x9m z<}5etj;6up*zPca!GunEm?sCWEsC`rICJ9BM%gwekHi*S4Qfbc));bY37mGy7}LrS zay>KvQ1@0+5sykq_v#}LWT7v67WIsi6OycI1yE8^uUbK%J@c{E>vH7z>AGD?mTt+K z_?@;9hud_fzM2)IX5e!Fqb;3@V(0s#EysHIy@^hB0Iv4|rVKz-nVfrL_0u(?{Hx#u zQAq#WfaQrYyW|5~#p!YH$?~xP+BGU0ay8xyO1v*Y#yD5*uqB%cUJ?`B`oUrkFNuY| zKPM9TOKp&$!MJg%Pb^2tpf?8a-@Zh@HKyh1I;&Q^1}Dd)MZe8UV3jbGkay;S(FIip zB+KFKx^@iEQMNhGVG1v_W`xa3F$l#&o9dPF;VRC9|JH8Z$hk>h?@#v!5?C>AXm`G# zon2dgr7s|y6yfCBm{T~=^l5S3isD~7I2@DX8UOMG?5|}&Ff16fuT5B)M$1) zq9w3c@P!VTOY-wE&R6;g_!FlKHMew#+hSN)>LoBL*Ef%)W}32{4Nx}^FYe1(QVFDM z0=Lm&xKx;sZkQP{VMx)>RXIz9ewVJ)+UD-XH_FXh(ZYwE8Ha6!PMdhmYl2_!@pQb? zE`TQxFEPc(d+m&WV{IME6=Rku&Ym^HIgIj1$2hjiuyY_H&9dl3rUt$CehF!p@lQqn{Opxrke0<#gR!hcZ68)+-HC4qd19(4#%3&d zICV^U+cDYBKuCY&<8Ye(vS3L%rHjS|I3x+;10CPtx)H$3?Ff2(tDbwv2(kqC>0s z#8;tT(5rJiB}iYG)8pM`8PGgNwr8uRINFCF=~%o;=a_BrbU8Tf#%Vb9QER4H z`^e$-UOr2i81~LfE=I8U>&@u~lIs}*B2GoaygZb|czon!rZ~YjuPgO|M~P`G0nww| zWkM2Vv40dzA4tY34D6XuQf@zkk2T7Gy2iSo1GjdUxjtMRlk1sC0d}uK?YV{=*>UEy z(@wD;+Ly|x@9Z{dh$2va_0KWOkf(M^GxIrRUYkBxE9ReN6;v;KM~$LV)x76~z))nq1> z1WK&I@^CVHLl$y?({IfZ&9Xytxe8A@Y57Ec!T-0fmx;mlyTb(N>`l4}{ooN(4N!U) zb(B9nV(+o*q(-7#s*q1DuK#{6j(An#JylS(WgBSO5uh|bb8y|EuE`uM_c|i_M8C=7 zNPjzemgEhYq@gmMi5)8>@3TT=|C6o`W`}*ewp!g+h=01eoHrzw@sr{sj7zU{$VN+* zBwy7GSQl0bYeo%D$4xmLaTi$|@Oz{uADdS>C;ILa^2V|a$ZWP&Y}<9h9O%(`1Wb1J z@?!YHbjij%LR~BbFftuzHW7bGL|*+1P!_xGVM7?p5_S5U8_C9%UNt_b&C}LIAkBlx z*nVm!Uz+AxEe1v5O;<-OocedPJYV^4u0LGEhZde{SDk}9%lu40ASkP!b(jK)BqI4e zL>*u;am`v9?fyw~pM<^%P$H%57QwX=t?bg72UFpI`~zhGI;WEfG>s1-!&3UnRW8oJzbemx}2QE@^iWzMh8U>a%5h!YM=^=NG3^{i-}rJY$VL*5j9vps!I zDS2tF8p;gmS2eBCI_Jo%16N$s1R;xa^%r5G)#M*LxKbX>%3IDq2%}pu4n0ewxZM5Q zM;d!l)HtXPL!-Ew z6#+-b#NAd%gjE>G+j>)<{*p$9Z|E{?;hIDGLRR7IbF~wi{Uegd>D`JkOtuo+2Yw8x z`QCp{osKBrsLsh&}I-5g~N$! zxIV*as$>-#W16^0p`*fJ96zo43={-yt->=mhGMi4TbzR3tE7yi+Qi9}kC&-JNH;-D z37}$4*c%=rkjHy*%wjl%$own#h-pACb-Dgg!shY{a@eu2OGDT=(u zM2S%Vx8$0QW0cIto(^V^5AYb*qx++S+2Tf{8uBQYMt*kt#)CivYfrsT*ZouePXsS) zqhzs+KGTj02~92{*l3T$1pETFSt(aAAXXBefMBw*FF?%r`Kc;QIZTBGtaC19`bZd- z`lN{ZEAb-{22BEEM8D-G_E;i;@^Hko%Kqj{KtR-{`#Psi`NDVVEn-mjbDwpNrXWlz zJL8+CMn~}*tLDc{Ba!BDs_SkAP(QNMiW)O((tPMfv>$ct8QuiN#2Fb<*W$*)Bw(Tf z9hekMLoneKDn^jF*2Zs6DIgx&{5^IVp!8Z?nzUQ8A3=^n7+#4Zx8{N9)MUjT?Pvd6gL z>U}oW0%98i1^EKG%>5^0dS|@TWzF2K@_?~po=R~?DF5_o$FLFU9>yA+Yp18tyVEYn zyCCOv_=cePO^HpFBUDbx>W?+v6R_UwScLsv$q9b+ZHa2OvA`sR{gA)RY5*qvQ3J-v z(?Ur;E_`Ryo#8(8h1zXQaCgDM)+p`f3I|&S$J}RtZM_`#p)bY}p-Vb(W44~ekXrLi zI6F>8P7OfG1LKlfuOI|-eF;0ktn2Q#E6gsR4H#QSFnXJ9I>_cd)^s^6kYk%~nD|pw z6eo`r>Sh`5DruS=6uWLy@#L;yZ-&IRkC9&wC>~%LAPiL=I`!9pXTuk2;{92DppYzl zj9_iEhEy!;R-3&aslf3*tm-eV853eTXD7Ylvzq9T>fy2#YLovBl=oMMa&5GjF>_2F zM|u@sS`SHMePTG)<(w`?`SpA#7c#w%t(|uAO%YJAYb|BYc^y48(;x3R##(&-#K|Ah zi?Fp_ZU9wx?I{T^M$n1c*!N|e>c>#i$d ztJ?)kvZmoO-MX)fuBX0r(AaYg(|o-K`>QQ+Vm5uN-PVKh=0kaZ4jfk53Y-%3PvuB8 zaQ*EDI3&HKIM@h&WTA>2pYPBd)0gQg_2FC4NJ4BeAFA()!d4GZ?lp{m%yQ!%y?5mi zqI&PKbzzo|c_sei>84yF-zC@giOIu8C4e$n zRE93d)FFgnr4rnS1ul4Vo4&pzl)PO!(9XUFu8GN9S#MvG^>>|UyZ|w-fw4f~B*&FP ztniS_e27Hp-%E}A_=DZj8^#EDatQYE1%IW!=BBjY{}hT%`4Qkyj=nPVF+S}u3_7+I z7CzWQq;4yj-d$0z~O?EQzH=I%K#R-;%gA~T6baDIcigs@F^lyoOr!0x-!4S%+& zPpSxJ>2M@byE4DPHCR?EbRA3pGSKt0qr5~{fAPCE`tJe>bw;kGgYMkc9~h=ccA-%? z|9pXB4$5sR1bln*m0fHpR1?6;T!37A`2f5Scr+I5*s7!@_v;N>D^SoYlCl^}_t7*4Rzl~}TBoRZ2o|KdO5M*dctZh%$Dptz1 z)LR7egYk4=yrOKK3{aI8dKFG$87_`U98}nu6@aJgwFm~$3dSzjEz>{Pp^U+GOA5m3 z&k(=B2{6n!H=klM$GAHQb7i-V=S8sxnbGwrsvzgBJlInqPmOvI^rOCbBmoGi6FSFv zw=j6Q3+q?-)m^1R#mfq+w%|9QVD72xoE==uR8XGvqw=a4YU_2%D z<5QM|!`L19t&np`r&o|I;IKn~Qy*f*^9sc{faDks81%Fyo(i-r2f@Vkn6O)5 z{$ihG1yGLehzMIvKelwP#wM=A^S2d-H{@E4RaEVt=BCyC=iD0mgbtM=STr!xt3)9D zDtf~y1hL*TLehoZ8|=p-ku#R-yNvR7reT!Th$(r6&*Fx^<2hsCp#CCjfyWQsMx=K5 z?d*9N7u`9zzBTU0&XGk9h?C2gKs(Gh~$A!0cxp)XZo%>jm<{b zCOnia?VO5XP83R;*&(T)(TCE-*<2#2hqL4%G-PI0-G*Y$s7vqgg9HLp55I|w8i1A> z(Ap4!f84T`$gS8x_Y2W1t-y&?H!N}s{d?>-CAh_zacL%|qglF*MD5LW$q?7@=6ozV z)}P_Z07by!oTK$@D0=Y?prn z1fuZ_HeN>-BJcr-aqFXTp-}UNsiH#8%CZ9E`A?DvQ@vk-5x3SCr~3j_@r(qkGtmFQx??Vq!`7gFS_nAw@q&FKF*KA8VVNkt zkojl#BHG;1uC)ZZf(fO!xN?v~j^#QX^qU+Bd3CPI;Rq5U%q;Up5i$gTwK`gYMKz3XZN21(FA!m%S}nI){K( zfy4+}dpGFD3|qm?Pow%Yk$vCM*kQwXO*T_S>Vr50Q76k7&@8qM_zS={66o}~9oz#w z)YBjg{sQIYRt-2t*lu6~Y!uX-x37S7<#)R2iEs3Zy7^KJ$yd@Gcko7lNbc$j+(CzB zfRZB;Eb;8yJrM`eOmq&u5Iqb0@>`bD4vx|*)|R*1HV;a^D0~0z%^#Fn(e@OQu1cnG zFHfHxSq9Lml+*s}ds>%t@YI${8y&Q`CaQZ0AUR?aLl8jYA;ZJzJl#c9>`DC)JR{K? ze0{`)5WtNLzVaMI0zsKfI&X$xu#N9L9{JmqX1 zOwUoU2yLbI7gH%U3C3PFHcEeh)H-J=qYYSBE9lr5Jy*J8khFRNy(MIHhQpxr4Oy2{ z{in|}G;;g58Rq%R4vcexW^2KhZkj-+4Va^XJCc$VLc2nZT=gd~_GeFssaeu$6V%Nq zwtX2JHv-cdytlwhQN|)@wu4K)c$^t9Mo8Ho67XthsRL*0j069M(&&*rfKCG2v0Eq) z*GNMf$sN|0Ch|7??Z z=q=wgxyTZ+`L7Xj*RYcK8MN%f<47U@g|Gto zvcqxWResmsyr}~~)=30=-yyH~hWyeR!nJ_qVdtkK4|5T;f-~D7GP*vh+Hk*b+I5?` zYOLV>R#8gpc7`16P=B`UmD9!kSgwBgpI!uCHJgy@5$i8Q{NlFA!^RUX072-p2`6pt zjhxb6G5$u)@A8zXt9R{OH#|=^eKhj+gg42L!E2`u*b*?Q>e%a`)UQlx3b81<urZy(4i$FYbhq6DtL=gDEwk>+a|{*%I#l}8 zE1cM8uO%1FcHW4Wp-SBxMI3+u$8NE27L(0+9c+c$Z|gL&M;?%%*@mnYAVl+HveAs+ zSx8Ts=a@_)G^M|PON6v%laPy*@~>8>j0)C37fS~}!4aij1G~4qG%JO7dbuVNRvL)v zx(dRcSFY$=s;u!p7v^)o)>E!*Gd{CG-$#n>ddy0eMfSXl8zLA<(}iS{V2o9~9m>YZ z%ER{RnJ2Rc3da;5u_Fk?Sk_}IPjkk9W~Sh@q+68Le~Yw$Dh9U;hwY16PhJJtTG%t2 zrUJDG-RZ}Ls0TbhM@b!@S<+E1%+jomPbOI{|Gp1{v}^qkPd(x@c|i6m*wpdnmtxWRB3yeE23u z-j2zO-UUnsV85ZXc3b5~23Nicf>|B9-Y)BHs%fgjM`wK6xmU04S!nn}^VZ*=4JxYm zLG8RRWb!Dd!$!Uu?XOV`Wd1T6{eSidPNX$?zV_zW#m~{5x~L1H>c}3}Eh-sk76JP( za4go;EQt;#v}LqQC_`ROYNe7j7G*$b zB}lJIxBQf0eGWOBVGY$RUPHpvlkP4TCn_^a2>UdLN(11|0u(eoAS^x3wqtU$;|l3PqQ%PD=OaQ4|l1xVge)`F!&;USvPx^v)^tmCXA)dJX;F+32u)6^?$Indii zOS&}5UnqDqq)0bge=Hj;G-mk8D({;(Plpz^!30KH6Y>FZ#o*f3$)e>>LW#oAxF}PJ zqJcwwTL+944k3LkfD48-WCXr)ca&Ap+e`U}0#4zFD0gmW^8*d^4D&v92VZpuapgs#R^3T$p6>D$nKh;^v6S|gYyaq*| zc~KrBObrUu?;8OY1Suo3FegV6*ln2EM|YciM9$+92>8I{W7_c`iL9yC9)T+0_z|Jj zQ3AM{flhG%DICU!>e=qjX436wjp2hhM+(@p3Y*caR?4Te|W9Tm|_=8TB3B>$J&R-sc9%S!O+9BQ&EnDecG~+llI;L93 z4THq-47XYG!eYpi6sNEqs$eYTKszH05`=O@zyhGA)v?3>#%F?6g}S+FdjFY+6m1Vx z9H9WTmfvWXrvekSfA#D9zvtl~69;8#>+w`cgpe3qLcpu7Ox@fSOFHqdjXeC>lAPMue3tg7KKNJi3sMv&ifNBSwBdjjYWhF+yw*7#PFB4mlEGWJxhPHEzx8I(I(>7?h>Kk25;JU7>&jBGM>B5H>mQ%(b4|TWbOmetlWL z!Fbo)Gli3+Zv?8WXo|2_Zi8IL`X;RYyKc?MNnIwoJ`eiPu331!WVV z6F*fXnaJGe&?z%}1N~+FqK7|f+Jt>`u|lM?isKqPUIQ5<7I%PK+DSHdJ>h$JvLwM> z`#u&Z2hj;P0&j1?<#hc2jd zu#N537baaL``<_CeALZo^Ag}gT01a(jAZL0p+0ecdC*Oz1S2GC&^?DG$1DneHL|`P zxwnfU8-FgDpp-z#nY-ttM1jORhCw!$Ch_%u%k!FpAeD47GJjI=mcNGy!J#OQW34hP zM0i6wVqAtgSBqN#GSks5>oHHT~s)-tviQ)9ovKBfiD@O7U zy*f}=p&uoE61SOrS7KZmOtY|NR>?Rq@Wg4gXIErO_FWj+_00z&pQ4@hgH=2$TxuhJ ziH)Zt`D`?1DMT5XLV`VbIB+1Xb<0`xh!bu7_2dDN)(nB3vUaXCr-~-%qmnO(#(Oz9mbFN|#33{o#PL;XV@!WX9vz zvP-;@lFgQ><)0{O&RyXU!?pjmc?l!hp1#oYc|&1-f+>#D_)vi9<46;aiJt!Pt&9Y4 zLg$E6s(*xvR1qWF;LXo=I>_xc+h|f7xO*Urdt;EVTXl2T(*|mQE@XNP}=)O;*Rut z;z>37H>7alG=9%@EM%>WnAKie-Z-}~1bvKLh}b1}fSC*P7Jpi3N|{bU$~`UcHf7}_NtN*}s6HzkXwOrL+_wz)fc z42y)VRtYYCb~-HPn%PnXDNoG0v2`$Npz23Ej5_sdth5*N&8J zLj(8tGP`#4GZYaaPSDTdRzf{zQF+OH&EVUyC%zyOYwKO+-iNm!4K z^dRnTEv4$SS5ZN`TTg$;=d!2r@ct^}1;53uR=yYlK>3doNZAD!Pwx2?XmSaolp{!# z1!gLY_Zxwc8*4LJS)k7cEU|xc87RfdD1N?z`G`FlY9ER&Fwn2eddS(rxXLgWy5l z!*-?ujdbEZ!B9z5;X+Qm`k3#HN`}L^G=N6ulUq2QMgTQv8I}^6)ZcOP@BP!%|_oGd0>`1#6iDEoFk^8lvKC*)t(Sp5d#8TDt>_M~M zEiNOEa^%K%3k}A_+n?Bt-y%hZr2Ai=8==yK|rlN8cV*bGQ~#f&jKPdfwRJgu-62(Lo^wk*)5}ND7EP z2lY=U)Gn(q9PqMF{`;oY$(F)Xx{;F+DUI9Rw-_=sQrF)lX6cX4zUSLDl4X{m{$V!| zJ-m2dF~xM)!pK!^6S`gmCVMMK#V84;Sd<;J(oo_PQmV;mnoyPuU<&1-82(X&ZAP|q zg~fx1z66oc)iJ?SkaZz}w2%++VVl+BO;aZ$klGCi7WxRr3GV71eB{P23T`;csmJYC zp4uSAxC|o65kehQ!q2AjB_o6qybtx+87xJs>qYrCPV?o&k%9C^Bo$CPFHYo0+a0eo zYE~La3SY0d907`M{&h*7ELkL9*xqE`)e9pF0BK7#%XF_g5OB1g9@;{8p01W1t^h{c z($r7IGC(5h21p5Y0CG-XIXf8|4z<`J$529OG6VfVg(;#CV04zFg#S(9h)9z zO-Wi@Q_z0ViHQ|Q5BWIeo=8{?wv9Bj+G%B60BC*Cm~nCC5s*STc^+&y(6(ZCL}vz3 z=PoCz6lR|RdY$gkQr!{2VyJ%+>;i0qBcRjW07#M9K5KZe?v^#IRjcO7Edqo+D?mnB z6H2uyztH?5Odk_I5fKLO#To#ND$OF5*SoHrfkb1}rH0eu~D(2f%MI}dof=yus zrV&li{VLW?o)z!3zX~ z$MU9ZfkIN=bc>aw+!hO)y^$}3NYaSW4B-YqojbnRU^PkmpM2SKC)Mq4zi>GWk1qWY zV8*&#iC@mBxdHS(VFTcG7BTfNq_ZaQ{w7;Y^xqfy9l%O+KiQHCLdH5wa}a1DXrE5W zJf0@?2KY0jz=)(23Oi1MvG>;}{6gP>u0|a|1zdU|t^(9nQ~ZOK#v0De#a|jMECx-r zqUQs2V^2U+V(c3Z^|9@RxKJe&JoUn*+tt^rwKCTF;`#Fkpb9Ro7ppNA3b!g{2+jR= zGTrig2Dl}2fPho@`QhSGaFstdpjgEmh_2ACLYgaz7SJl96G_NPn-9fzVEJT{02cAU zxw`ckV~72IOwKPA-@Vsill=&>?|eZ1uWjBishr3XpDE%Sn4l`Gmo`Wo4pF zX%-WYoCmmG`TZHd>xY=uo&zFg$EK|P{K@^zs&E2Ep)E%L+wE86fJF60bnFJWJ&@!< zYb;&F5*8cQ{cMVhi2LX$2O%Ho*f>{?|kw1_M5iQ={fT@rvnu*T<{N zg6#Dn(SHb(MKx$^-taKHVL#FIO4|afgSzB?&R@0rrtL z7p{rhKlBu2hJtvZctg#{O~YS2LnQfJ`n1-7j?Gy8fw)o;DVJH;*6Adk@6ASN6oFlN z;|`hP`@y{fmxsJ|K0mX9pM@ z^u3og^u0`6b?nB94jQ-<5KF*5yE)s6iY^}ZpaQtm|MtPnQ@0+z1owtP#8Nh|u?u|Q zo2WLW>Ab1fzKI1bEIYpaht`8pi+Tsjb@$(Y+S(d(lABbyH3JNC4 zX-qwb!46-w@5=6|2b-<1;mPkU?81_?)qkr^t~I%`k-ECNKkBTf`;OphFvnBpJuh}H zHElWWn_jDa+H)2UkAW=&5}5EznU~fg_h87lLZ~3Wq``9gMxVjJzmYkhk#OOi$^+9& zYyU390HTeZMk_@QIESa9k02PFSmOxWq7fve``X7D3@%t`9WJ7vd^Z+QkkEg;PsaOI zV2r3a-#T5PYmcH|VM^fk8+VB{g`ze<3-9-Qr&FD8;5w_})=Q)%utg+F$yrdQS@wfl z_1iK9&SH(F@@{Jbo3R}4e!4UM7|<9`?F9g~A{x4%hhpwg4*7J$oM|hs21u+axGkUr zc;=L~fVl-ZUD|SkgFD5C=!QRn_grl6YrU`Svy=|PeL_y}n89>CMIJ?c;p!JTs3>N|wLvX|T=@IL zpPd5ucXZnDd#wDWXD$SL%58<Og~2XhNWv_ZcG*k)5xrDvJxvi>XEFLiGgk)96ns9TrtHcvC33Z3DrXt9Jt1iR%6lhVDs#KB zUI#lF!vyiXv_{|`u8p~|^Vn95xIM66xG;>Lx~_?=`#k_$bQ|D->^wCExim(odM3B} z5wo>%Z1LNblLy(P;bv4tv4~I|?JrtX(4K&^&vxhpa@dz=dMnhPu!7;vcL8>(sDDU6 zjm{&fa%VDmw*(rhwr=(Bbj4MPW`~zBerg3AU#lt-9Qu=QF4TJHu&RqtH6h&zO+`t3 zkJ~zYSjrUjPVen&eX|L*9PjNAtdAXgWywAr?lQXLw!K{~3J~GPIUv2RBbP&Wd;dbJ ziyABVoHrbP42QeU;vKW}8&4Hun~Au)1>Fmw9PX+_Oy^KSh9Wwh*7%2|cxl{1z1uW< zy}i1IeZnSvgv$LsT@$u@n0TpN6F&b|d!U|EeJN@j3A964-L&qxs3O7n16l@aR~Hsa zDDm7Xw8s)&B@@eHPr6a;7hoFh!ofp{>a;ZORKdu_tlN@tEipq++HH}hzgT?`v!#*U z_>7PF zap4kDxPaj&Zn9()E+n6(lBn$6e&NqgPKPMs{dt~6T?3E z1AFJjn8}r<8P~~0I(@?4ZfDWk%bMhG1;&O%MdxT$!!0%ae9vty{xpsf44`m1z~?j9 z059h1`&-QkhIWrS@A0RgP`lXz1HSF?*{iwv_Qm>Dl$Cyb-@Fg0wDs=Kua_%VcT);w z#F6SwZpeIk9|Ej^Abp^SnR^{Qoen38fE5zox8OUx?bt$UDSof(J*DS$q=*iBnA*W~ z97=Q2capUVZwQ)PnBRmtUv(m692O+E*SzS=3Dv>cN?Yn#a#VT?(^uCbSP@WYRsW9B6_A4_;1J&n zP?fEw!%3nzbM%nbbKxmifKVoHehxZNCzcit^qAx)p_2L1hcRGn0~cp5sav+I`^ z&)gA*SxV3tc2LjP`m>H-cs( z7}j~YkyxtFN#9|Ja7Gkv`^|K%=_GeiM5y+#YX|=wyIc)zA}g?U#&t~9Ew5bvEWVgu zFg&PVE^)7|+OhoJP~m^eVrtSlTn{`|88&`Wiid7hj(&J0-1bAJo9t(q<@`6zW|Q~A z&i_)PQrl*b{jMS@g`8)Y6PayK*4I1Bz3icvat$)gcZo-#i^FB(xaCTaEH?@D1Op-> z<+rvNT6R@9e(9Dt`b!IJf1d@TTIZ$|?{?wN^TL9KgaO5VZhnexeR*e{ z#`us@qqdfh$Mx|c7kVEu^^dQ?%;a-SrtHx_jK#a1%#!bgMcFGAvmT~MWX)@95&r%V zalon_sy|*j87qm(r|aM}NcT(kgb`#6gqnc)z%-3+)OoR-t~8wA&ifd|l8i z_a8V`9n+3UcFLW$vXY6#@<7?Y5o~50`1gSr9z=wz0F(B&U{Mt?+%XutuYQBBQ>PZG z`}G4=iE4SuYNSLT@NUW4eT@00>2?42bPr+=s?9d1F-e)?6{21m+}>wN7f0uX@_NO& zNQa}%%fj%X;|#2|=x;EV1s(bJC8i!$wTS6dUMph0tmu}|jh$@k*PpiQ5a}dj#*Xq}nTRpGU5JNpCJ6T=6?W3#PWo1J7AM#z7*#}|Sy7gO zr71Xc8L!K)-$uTMd|4%1?39*&QZ?=LDSlq)r^NX(x3GUS6t(*cANAcslSixXom$_m z)T7v+to`@*W}6`I%_d?cqrg^8YtIXT&e+v^>_yN@9+NJ|bz;KLuPm9sUD%i47wgJj zuG8naxpipkhOqK{<#GQM+`Ypbh&L#dsLVOGr=&P)dh24|Qfq2i=J>|-3+ekm({veq zrP@akZ_*DNTz^^ey9*2(r~}JyclEj4`Sq{`qP|l8uCT0XOP`I1xXq$n(g)Pd1;ffeOqCs~G`+ciLO%C<+9>ru8s7Hm3h(x;YS3 zNH}!@QDxC3(<9WVf^Fx>vnsy%%%QTNH5Uf0r4JAb3`1Xs4rKAs#jZ>H{8i<*wDQ>( znc}?1{|eVTC0P=Frzv%qiEVaG*NJYWo((9lwGsJ@DgG?4KOK)uw3W(!Y9Lr_e6L|S zvbCP(wv@%fFB~VYHQcB_;TqKD>NP6FSN=7|Sev+i)bMFWtKdE&^7Cf%V3_#$9x6hBuCoHEEPcnU7QZEq3;@3EiA+spsb4L}JP+2HO_F>%6S(i6fZx zVVmP(u7+MzJ7PcE^~j{E7FnwW3&`7%FBRo|T~C`8TSfONQt~S=@+Qjo5G9zYnKdR> z5F47P_k!SG8cUAJ@gaq4jv3teH?QH5c|4wBzCmtZ&jPQ`?aGd#=lyG)7#V`k_^(y492@k0ye3kX3UAF2u1YJML!yn1e$P&EaB?TRzEd|Ws5&Td>yb|I zrHQA^_=rrM&rMz*?8dt#VWphb3Jx*|CS5uFT$?rQ)kzkNj;n{`0|Evv9nv=mUXIeNFeX?do8zC21ZDNGMU8#_J%;cl`L%|V&`*q zPAr9j#6X~ESc~~!7cI>R-mRtH9bv7xpg_t;H;#;`-|0kA(+_bj&H^x zIAn}wLe~ttTC}5${wp&ZAForZyzl3tT~`VCl`B+02UOI~Ueke%6~8g2D-j-q7S_#I@&~zKEv5`(Rx_cD^|m^Q(QB3AnF`PN6(Zpz8XeTJ=zbtl=E zJ~rvi@~kQM5zX*RVdRj6$9N|jsdFv;)_&o1BvP#{FMXegW&hKxngb9oRFQ{5)NSVW zeMx!B;^^YyfrS&4?R&LZj3`7GFm=kI?~ zulyRXWfzEBd?t^`k?z-8TE%UCM@RW)B!;U!C`poE^7%o+xbWRO&?~z616*pj`cMt2 z)6|>VfzxRrYMrSn|JPt5o<;c|6b-_7$8X~T$HL6kL>ldyWA0I6IHM}IY(v}cKjbg% zmr$&#`NRpf8OA9*`^4{9Y0iByb9bS^0Pl>o=r6EU!C(l1o1$al{y)%u|09B}8z#oZtIX<3oqu_?=Es_uZ*z^}S%@rOj zYWF7c5~j4K{X%cQF146eeCiL}I9y8#55NIsQb%9J?l4mCcP)+%5Lz0)f)GM5Lya6g zz352qiB8biJhh9-Ue8zfIXi!!V5 zg56o1Tr<$9z)uIzHZ(uTj0m0r2MA1ionfU$pBD=isn&DBrbjtPgfX={{+3|hYgcJw zn%QQ@_vHR-D&ws{_*Wwi=mv7q0@4<_Y>g;XGXhYi68(H=Itk>0IfAQCayqA(RwXk6 zHM1jB;SQx|+m+TD$}<)Ihj-NWP1P3Z9^an!PLLNnkWq@fWQ^l=pA}_k@KEz}@yZp< z$ZaiMMd&FbLSMTTub~W@Ce(PhV7y6JN#wPpDsie#cvhe3+GPg+T7qT?p5Cz&3?|Ob?e+@yUM}U}E{oxB+ zT`J3Vq>H1b1^jVaL4)|AaLM+6v0Kq6CvoOU%y$q=-%ufa?{za&ASe{cwo*fk&wz74 z1{z-wCcFV9aYN#_GR<0JRsTL*(rV~WY|}4|C-2wb{ref|)g^OpEAtqv0YQ}s1A+h# zRBQq?Rs!j8aZtE^C%sL4>2!oiFah)VCZ%Zo($xCjg91plE+v5~^#2~Hsla*1#=Q;t z??=GzUXFAHU*!M$8{mT!0}w1`W1Gqu`CC;v+w&d%6DH zPixl_?O&D8!9qX&Au^|b9%!8Nbl&}N=Q4=~oQVfx?51zSRRn-er%3>rw3}X`*IeU! z-V(2`c*6<84Q1=~{R~e}4OnEn-EPOCBqt2t^!?YquUvJp?k>|w`&GB|A7cu(bIC+^ zSo*Tp)5%9KYK%t>s#pPOmaRfXG@sj?v-e?MF==$jd4X{h7vJ8F=9PbYS*_?jPDs4Q z%i_le11n75sPfffj8>)E(Qj8yY4{c0fzL~F0 z_)RCh$T^K6KX+yozC(Ndv6Ub$Jc^XMGu|InU|mgk695=>X;RSxu;hywz&RcIg}C~? zmxrZK59?lV58WP~i$*lxKLyFJ?9NuoSsyIc<;<3APeoIRT9_8F*Fmx#O+uq7crhix z(XKXMrqQ>R>tstBmRJl&s(~`#PPDJ19DF(I=+!tCMqFuz~%Rgr|wS{aeeyUnKQ34-TV*qKGe7Q)!keJZW z6{X1GVf#(`->x4BRD(HEDQXkFNbLMxtLN;+<{$v5G zHu>p4jQpS4&&Q*-fAVGLp1X_Zvc0TVO7)hO%9<%HkB3~88;*D9>(dNx*OJPXjXx;~ z&z{rajILKnY&+YUURL)W+6ly0#AMn%-&#Z)K$`ZnNA#)|E#B!`%5vaY(f6&EEmTfB zNhFo8TJXCH3>OQ(YSkkdE=H8Us1sgV8j2D=+<#W8(MWUC$-c`NxnmzjM1Q4F$9KuR z7?jLkH7b0yXjc*CN`)+f!3N18x>dXII_|P$sz_wf8F0=@;hC;dC4#!znbR?{chzOW z1Tr2dymUFXJqqUQfz4p^WvjiJS7046MjT31kk{Jew8a0Et+wtBt8^tj4G~_9z!E-; zx8N;BqD_UOYQ9yQNe!q0AGujNcmEb-S+#zo#859UiDeYa+<|6o-S5heRSu7fj{cZq zJ`jile~B#EtwG5oJ35Z{ze})o18fz>ZX1VS2U0TSRL?h-;|6}?EYeb~XlZbiH}La( z4P3qGOQ%l$WEI{5Zo9IVc2Dbir6&`E$A>Q!T&nigS`HSl5!%X2Ae2Vs( z>58zqV3c%hB3K1ugBip-MC2m=@&8-PtoWCp+&mpv;-0L2d*xG(SHi#hv*#LTmxi(S z7~XmKU8j=yeRZgDcI@^=XJhv8#U3hf_;^0G`+Cl1GyebAfnAALFD}=|=j8XyeB0Cg zrCcXq=ekdS_bywsdX~3PYx2Yu_H(L=v*!7$>1_y-5U(_a>0y;VQDHm}XTEY&A`l4V~XYw)_=O!Cq) zw<{jryxPO}Fma8$$KypW4@RiG;$J4W)Y9|8RsHnH%$hRLh|Su&o22~Ab5Mn+|&M|#czw^jN@NBJWV;V#vJHkPe+HlH$4kyUO6$xZNdNc z$+^exF`NC`n0IY;v~qIDroyvE*A~~x{(g4S`)#AQw6({j&reHt?TbCV(d4*JrkiT) zvB)jCufvP(EnD{Fpu|a~V>36K)E{S>`dUQb!T$fgu{m>mK3;g5ZXLMp*Hy2ZugkX` zJrjRU%Kl|bg|?c|*J|zX-+zQ(nCEQzYg6C5x3b*&@tLYyYk2oHe*Fy(+gGSZvqJQ@Oz zb^of*+hb`w>-)P}*$ph->5qeK-&E~+|Mnx_RKXkfHWknJtzEhM@1LUb*Ik~b%^MAN zo?e-<$gacgT;*rB*``0N&TR7Cc)Wf7oCA+uM;JV1+odxn5g=>z&z~tqsmuG1j`Nq_wmN%{bbQz7S^bymw43d}YV*VH$|c{QA8ygLn^WS1K_qxCH z)*qbmc&44$?apsEJC0he<`e8-VeIYfVQ~d6z6w-3@G<3I&g*r5|MAV<|L0#HtJd}W zrE=_t3l(PC27RqIah;tg^daWKy>P)f3*}EVX{rZETIm66LJ5{DbLUE2cU+5TJv1%> zwQE=EO_$P!H&mE99f7UGq!{+Cz{ORt*2ttTU?y6!D<*3VqT!>$28!Q8*(+R#_K}Ac zuzfQ5Rnf|gz;p~nK#PG@@UhYWZMbnR3M+wgH9o6OFFgiqYe1Vo4oRS~c2oxbP0l+XkKpBSAK literal 0 HcmV?d00001 diff --git a/site/img/shared2.png b/site/img/shared2.png new file mode 100644 index 0000000000000000000000000000000000000000..a1e04999eb3fd3bd7c350a0850659740d4a8cd03 GIT binary patch literal 23204 zcmeFZWmuG5*9HvSV1Nq93@9N8(gGq~LrB-qLxXf9-6f(j2&f214&5*`(jg%rAYIa( zGebAug>gUc`yR*l@B98e9)~b#&b`jH4OLN=xs6YbkA;PGTTWI|4GRk= z9t#V*_{KHh$<4vGr@$|47d06PtfF4ZwX0_@WOZDyun0)5{$gXrC*A`Z3|nhxyJ{;b z3Yt3Fvp+X;G%;uQw0{A##=;Wz6a+rno4Y=z@wB&da1rzrq5aiD5cqucn1hz)R})uT z5n62}6`E&`&gL{c?A+{6Xd(DCG&I7_W)^~KlG1;~fo~$TR<5ou1UWc7JUrMvxY!+? zEjc&^1OzyqJmq-$lnrRX=Hlhx`rMPv!G-R3Cx82qGkgy zTH31@{rl(lI9;tR{_`dWmp{`2CdhI1FC3igPdNVV8xR$~dMfzL(f)_6}M-Co7Y(bW;qg0r=$oP(>mGa&5x{AzTNr+=jX-&_3OzM#(5=D@6fNpt>@ z{-4kOv=`>Mn)v@v#BVMCdJ3o+f-lVR?~+0Aix-|wVPT13$w`W9cw%o&T~E{)9`C?d z64ZO(Vq23+f>c2&rvIp#B0se5v=?!fNGcVSjNbD>jrnlA#ltHhfXkNji+elP?p`2j zzdt3I^yo^8o5Aezmw9X_`fMNG5we8VT0Fuf62rp&?~niL;J^Fezh>}XSMdM4C%lm^ zrhJb;DvszILs0z#J(A9>-P^m>cSI|-a{L*0#&7gx)7v4!(ce_=;&zdJW4kmrl=_cv zO7X_g;LcxZul0lFa&;?@Q+@QJop#cqqTkQYzB>VH7CQu9y5M#xfH~)8b zLF+eF`U{dNP!j)ko+Ac`;fd9~``ZtViU>AK{IJV{!S8;rItSkX#IWAR692nX@)*UW z%Df%w2>!L66dXVdjD>mnpZD+M=2&Uhh!Gq00tKHwW_N`-Qydp|8ANl zaYA+L=6-0%!FC@p>pM^*=#cUF#O0vHU}rN+6_3|c4u=qQ_%Z5vL6d;c%7p*aXq6I3 znzkm{LX_j}7%K@auZ8Y=BmUE+}k!$OxcP1b2g?QJIohYSAz4RTKll3`f!@18n_!^p`q5;rK8c{4yR?U-9D}PF_M>>SFek#d+Id)2G6~q2I5oIQ^GTGF@GfC<9~M2)s{7Za`bbt z&({*Dl#QMk9TK5V7$UNGV;5%{T(g{KK5W?#x_Vu4Of+mc7RwlYh@L%@z?)xlF=ta>r^9ZNO6!8?8y6^4c zRo3f@^$n~nl=BVMGzjjSXXD;{f$1iwm8%W@;BFGQ9A3)tAPQ zJ`q%3q#|mp2<+Br%j_o?wWs#KDug|qyG$jym(LmR~XZxh|$HkXwmV^!RM>5dJvDLMW$OUjOpawi}SDXUfNNNep zzsFxsdngH2Xr6&4_->%8(LQq9S)>1PjYZF%fY9+i&Tb#+O>Ep!b-+cq8+!B#D&QTc zdbkCxV})r++s&W7of5A3k(D$2gx4!RhT68keLEjvvc1&rEpOX`l^(}cnfSG}&K!p6KE2#=4%!-F z?r?|w6y%?I9l>_1-8v?g1DUDS1mi@TmswaJ{R7ON=ejmGgIKiu@#S++RBfVs`=I^ z@u+)-u*5e|Dk*)0(z4jnzV#^N>vtlxN$M$e2pAV*LqO!L+-`TPM!Veew<}aRSH3lA zna&auSLrwgp@>`VJ~k`DVvCRSu@~UN&7+@CnyUV!v8* zF`9ZjmKnyF=@q%P{{+2{kAcPMoTa?;H|TvM1sv13bEW|FZ-}Mo1Y&N~t=_r6p*qNp z81Tnzu=riczcKg!vu=p;CJX70e@1!pJJ82!Y&#RS7cO=!$+d@F_KR0)Q6|3&c6#@P zKscY0k%c|Nr`Jr|8oNQAMJXChb*#*c_)OZfi3D)aI7B{%bSbnI8~+jlSd0phOLiwA;+2F(c=-#$8>>W zOIN|ZH?$tmT>}m^Q!OEQI$T+v5W7)}4)k>}{mZAj*tjh;@2Ou)s9eJ(`X(Rb;x;nF z`kJNPEK}?xv7FG2->^jU1Qhp?Hc;#yu2ui7ubL^hZkS~^p#--An77r@wM=yg>_*;0 zR(6%|w;ef|@8jHfr5b3o{k)Fl)6JkwVo&Yt^@Vfg?qqf4S}qX-dc#z-t<)9J$$KO= z>J|NE>^?Bn$3&|Bkm^N}wlH`czM3cQL@pDAx2&-R3BFqPoR7fv(4051#3hmqY>MpK zWpA^;`7PnGEyL{SqH^1-c=NBd#9}mhcsW)U8__Fu#5ZD_hFO#yLy&{A_-_J{E;UIl z$lJeqiUIUyQ-Wat{6`snWfyjl?>>m3$Iu8W;-*Ww5&SraUCoayQo%) zxqC9iG8_vB14@zNN{)Znm+wRcxZzH=#fUBE0?Yix2T|#n6~Gv>4t9<`TK<6tuqYkPJ0UJwOl!#x={uP`pF8@h}$BM`V57i&KL>__H9pd^m$iA=p8Is zxY{B0g^URx7E834`PVpv)?aPfaw!^``fif(tIKHTc|Mwxna8rGd(qEceenIcoI@tl zvlRf4*_|pmcX>wlJO`z02)~2TN0{&6FJ;y)O(=*;Z=M?w6gc=t+}`|XDzo>kqQnJ% z!Shk61dRhue+7>WQ%5YQpzuE5gfo{762_sv$Lhghcx6g?S5OjPd+o~x#K2{11y45v z-EEcm)MTdDWNOK^E=EUd86!C31FB+)_+IJ)9bTI*`o~96Q4^b=9oW{%!18>r&7$wM zYSS%KB4;Mg)A=UU$}79IKLG%%aVZqpnA7C)#GUnzOe!LK0LtXutKA2yQCqGdJ#3Rr zLnegJV&6_3x}_Fso)7PSvFH;PI-SBfwMFSu{CLd^=EOif-qveIDRZl@a*^PPC;RcToXC7W z37o~}5D0fG2i>>ZD_bKRItLk3)T1bCb-`_Y*)p4(9{8JaP9fr%&M&IcL^js@K5(5b zk29WRVydd}Hw`tP5dNg?xTEE{BRs*^dz0n8e^gEFu=mM!9|eq7Zo@gttLn3la=aZ^48~bz>vc^n$o#JRqHS9vRBf00IY47Smn_ux zM9Qif7%OcPY$GELF3*W0QSmt*=8Xwyr@NoYv30rC7m|n4d;3LCgfy{*caCv{F6(i+ z`>8)qhp4hHuOsy@eQ1~_q2#0E$_Dc#otNG}E0rR)GMDIWR+hBWC!_*$r4Ns_*^)Ce zw^L>omoSU-alDVErsjo-r1;9G9)U2^7ya3qjI)q zT;kO2le?iIZX9uNUN!F@()V`xpGU9wA6=>sj(!)0a*EzL{(y(0Nt)B?p`_H9kf*Ob zt-=OG&S4V;1fScFmde2)zXzv1KQoH_BP6em6Qt(vd^ghN#6@#ouCBdjbFt$qZ1Z|q zGaFk((L4;GuJ(D2qMB-L4|#t#iPrex@Ug~gi`N!}>lF~M#K5=@h8jAUeeIQ~%jEK^ zn{>KNQs+zJEv4=NN8}WS86KW(Dv7vc*+uU+Yv;wDtYl1R}00En&Su{lX$4lVs(R`-SHn0PTpXjho=jse$vl}FO{nu%u=VG%62nv#plyU<13Vxx5vC$x$%h> z9b7q;kFMZ}WLa|K)ij({-Z~L&%jjwh5h2d0{xzQCDUDq&x;Le`dDD+IwUq5RV8+Of zkpf%7#~xL$8deVl4;?nghDt)mpIu*`^!f^CiJ2a_H#u;>)=}N98mS^RI4Y^I!hJTT z>$^R8h_kM4wE(hYnpU0}l(p_89O*}@@2M3hs>*Xn?GrwDjrx@L3G!ivC-=f(UMTYn zio+JabZRz~PoecxL8uylWDSm+b}a+b-MqLTX=Yt?>uX0DOYAbWQ)x5}n?^r=G2oOG zr!U)a*t~BU8ja9p|4KR+&?HcD*|@U zm@jo@ah2`zj_$$~nMlfZJ>x8u3`E60uKtmk$iNn#nPN7`fbjC4F8;W{Q?_;{Fj@}g z*P}n2REMMaf2TZ*kdGhM93KbvDY=ewqg4(#MkUg80u4X1&w?2d@$Gj-z=M|twcpim6?(!B zu97SBi~v?Sd)x_Nh>xtZ|Aaxcd^0?CI2FVvv{Qn44GPAb7GlR<0=b$;T7jfJttTIa zu0A%6j_hr_7?FI`sA)2o@GYCq9x$?C1+c?=e!$E^%jmBFO+-_)prTP5q*Fj&mug8R zx-wz@bvX-9h!j@PyDN;!8T*k$oW>R}&DM6BF34LazZetBH*QN`n@e`2Eyb^yTH6@)RMPl!?}B<;=6z=fhCe(%x>FAho&Rv(>y<=GxV%%$e3{Cp z*Al-nRG(PTGWLG*C9;^j7F2?g!e^oZ;p5n14xb^eo<6z`$;iq(wpYXYd z`Y+Qw^-9DrIVSI40(l?+2ssCh8O;)Cj;wOLFmieSDH#Vs;P;6(0hoLr>RuZszM41w zkXp%cFuw?rDpMNJraaiHu&|$<+41I%&iBXH)mb!i_`LC4GOlEVp)07WxJQ|>fdTH< znYOjQYd&d}xp75|(2(Hy#uUxS_{`U!R~o`UJ;2e+|J=x0$0{ zuj);gfA~BsWJYUWLP`ej_ql-pw@!0VkI#}70z0D8DlNfDA5_j?H2Mj-!#6oVAvhOc zJ}H+yckGb&^imyDXc9l~EyeLiQ~}R7sdn!8>r&az|aiCZTPJ{0|sPO$% zH1-+feE=C=AxPqe_|EpOt%OR1B-xxY)$Ft+^{X>h`q_GFAC?CiN6wNcW@VKl>VN)_ zN_lLv{~Yez^Ci9;z-^p z!i40v3V^f-!BLYx^{l8Srf}Xj1d2sNZ!`;{a?YS7(Tm%4Gc*Ouu*!Cltrk2pEVze6 zKPQJV@xtE!K5NOs17QxnO^qJ5y)fS#emU`tD58&_&dA@XB_?Jp(CD31; z1c?##n}s$Rt@zw32OUh)Dvvd8X>R7DNzJc}1ITfvl2~Qjr9$4zLf`y{koiFEND@cQ z1h+`D63W1^@oDiSCc;}>J|dCo`o4SyjG{P4GA;MQ1gNFC3h9*>+R8Cq%( zs&YiT4&j|3)0|On9 zxX10D2V{AVrsI`>XGkQ{$Oq9fiaZx{N=wWAqkAkIP0*O)sZqv}_#6qjugTXKe6?|$ z$<<5&_=q{`Zahpd|B>PIbx{1>kTRNjjRFeLAZUi=BG(G}Xa59>CF1%@1VimU7okJ% zK;}}DUY3_MxlJs7C2UEaR>iI#kOX{qI`6r{+*f>mb^XHC)PE&$@O)neLtW7~38%?< z^)UhPc6_xDfylz+d=pd}^|Ry4bj>`#AEe8%+(kd9gY+Jlwh6Zxz_IcFY`hy8Voi(? z$~;f;5^YS~JW&Yr)3*a|ZL9_d>nS3Mqt7jI@FWTCN;9&~OD2AKOdf!UnA@nq8ciW$ zzm-Sx7^i`|*9eSa^$m?Z1=q6r;z(6yS-Jw=4ZVocjN88Ty!xYNhA@@Y`+3z682gOY znN+{ZznXtfa5UDl3z}i_Rm_HFjA?za&n;Bux3jf9fW`Sf#Mn`;7pM202EcXTeIQHb zf#(A;aJj#$Hh^0>AdQg4dXeP6cehrk-@?kqNW(-Ue5$-*~d3Vz{eX z2a9hf1GUHTTa%O)nB2e16@$QGRs8>~$9U)$6sfl`Qm8 zD1SL29={l-yhT(9$rPdS$|1g&+!K zEy5CDO%}|)?QVkEg_=jUphlJkb^V-gL(j1UdH!ycM+!}h&PT8AAfFuL5+YXTNFW28 zBKZje2{nc|XCUEfI6&t6So$8ZeGSqsk zCY;W%qN+fA;97&fYDz$&e)o9eer^|4Z<105VOFmnGeYV;4}QllM8^0Dow1;r_6OY z<&pFMw2{C^L9F1+OpA%tsapFHDHjo5`yVOd`QH2M6KwiAvLMt z?qZ1!{UYrw{H^)K#Ke#Zh!H9-zixT8-_rl|Rse!ol$D`wzUKK3e770DY76;7y^1P} zN#!b%1bqR>%69~GaX^OcC0<qyM=;-rg`Qv!tHIjZLdSyI&)^xS}V4 z&uBPHDWT;KmEdBf-FT^1zE}4)wq%6Jr&jiPev=mJho25x3NrG^T2mx_Scv* zRk_;5fDN^rCh%-$%fG$?wAiSI?Tpp!)R;EIJpKa>1p~5*^@^ds{K*R2sv4kUFQCW# zQ9%1=%1MGcla=;i&kFq?incA1L8GAwB0la!d9NluF{{%?F+vSfwF_6O%Pe~GvgIqQ zmX${ra5W`6zWwa5QdU#c@MLBFKY7WV)Ibk=$9(o#`{9W(D4X3Rn?a4MdIf)x8hxp8 zf_-J~WV%eG;8d+g6-OScGCLc_EfDF(y(4JS|MHPscLM>$#HL@z+n5k`=+KH5fvN_M<&NG zBfknQw$yz$HcO^a)%D&0qezI;H&#T*FHflgr0UB)zhnfI%!rK$EURi4d5|(vJf%_c zQxe<>_Sr0vImiCz+9IlMw^tnIkr*Mx%S?pkOt-<5tENkjza^&A@;$?J;JGY+?;Pc! z&WEogkQ1Clz?TXUijU=+_vhZogT*MB@C zv>!qmiB7st+AAqsWUR!guFDe~Y5(XGz5F@twUWextUs9OvL4J@${%)Ftn=D6RAE3e zS!fAzxl8`916eN4MQwUxyT*g;k$gHHImVT@bP+j?eUQ^dpQVA=(}(6&flx7WUc1lr z$IA)bhW^J(zURbG#ZASje|Y{}d9={Y@c}r}5CFMmi)N{15b@jeK#qs5@OgbU3XQ{~ zg-FL2L#i@M4kc}*PQti4=JA0RC6l$N2HR(E?ii$w^iulyLo_pY$&*%m;@pc834bHq z8yW^FNj2hcPSWONsQvcfjt$4s-LQ{DIW>W#-y(ko0GY5t_F@fF!R*y^&zCK+m9&3q z0K{T$=J=1JS=@Zs8y7z+Xypo(uk#R02Z`Osq4q_Ko4uiO*G_wQufbv^kT-zJ9@GAOJ2F~umY`ObZJ~4ByaC-p(9IS)lKSo&cUtQ8 zrR~^44x9T0Z(m!0zI{VUw?4%iN9(JhtvIt&8r34B~HM z*w>6LXQh@t5#@vi)^ww^Hh#tj6GFZ)jB6a%>pJ!ih1if2*wmb7Ou50lhk1OL&vh)5 zG}pZvf$?TZ^@NWfXYPO;_x80!Pgc_fz0b;cx3$l3pujrY1Jp#;R%ZI_jYraAY%W5! zO5Za$ULDT{6YBc=H=G~$1yrgl3$!JBuQTfv8;gfVQF2?UsNAn#N%r}U3Z5=CR~Que z671u8+UH*@$yRMv6Ht@%;w99ASDiJ^z@>+0q|6ekPCrpO=}?<;H{ZQtcHcBYpj{bq zyg2K%7Z7Xcox8s_*ipW0>KawQ)5Y5CH_BsM|0 z>km9r)*dX9PLcXWiZ&ch`&4ce)n2=EQG+ll$Xv-EF&$jTt^{0C0)x-^AXL;|&%Pph z3NXVGoj9Rfo{^&3?V7ujB>TZNz4o(BL2jQe-9rVh9sk~FfnspnA$tL(!!q{T04{7K z2n3SUcnHsVd4I7;UR5$0@vVn%Mt|Na@4ZwxjW3bMw3p>#>YE4nhSYox`ajs}yV+|c zHTrK=p1!ky=DmpCI0OuQTE)HZW1S9Gn+fP=8TNQT5 zauWuKdkDgFoXIvxP8+R8s~wu}eQH$CRwkx)yr%@ycaqAG3L_<+5@0O7yf}@$m%jE4 zuf~0g-0NW665wgWg^w4bn!Qdvm7&0H^iqjvyx2#60K7488h%=&>8P1`tmPK|Hrn{WxpQeSjZzGi!Nuvsq2TMlO8SaEEwGaKPi{&eo6AXJ#E#9`AD0 zsHy=;+QDpSWR3)RVl4}*+5e)^%3t(+PaRV!DcX<5LG~a%r3K(nlKx>v@6xBpkL~7V z6ezoz)AC-98U={tFtr!;t$`&gu%w6bwCnO=+_kQ2cXdk54RnOwx&QRh-JU+yi2aSxNa2g41^OKf zwFzrH5G})V(hT)OlLXyt&=4fZ8(wJTO|LTgB3+meh)!>7W`ob`)7FmL7?7XymdZS% zU@w5LU`MD;$!(UlHlfu}g!*QQu#u?c%!34f!1)`Wrbj|_>%7pj&BV9UnJxtSp#W+i zR#n))8}0E%(yzUf49L=5CG2(z1n%NkKTL_K9~>I4`&kQ*iq??BZ|BIX=;cmjxPcl- z680*c5vrBstULW#+oYgiw@y)wGpoL* zhbU^-vRWB#9>;0nRo=*p%}B`Wy~u#G^eN|Vwq^^>p?@yR`XEB?Rj))mOf^4^u40&} z7Q0_jy?qC8q1HIZazb~n1L)S3_{eX()ZA|v&LeL6uwIBib%WeHWD^*}opzFoIl=&@ z=Zjz{0+gXuad(pS^qK3srPt9FtMy~`CJS<=YQgwvGc5g#+g9!M)(@Scg2O2V23(O2 zgML4iHiWf2t&kddtP>F;2ig=oHX15&`*Q3Q4Af+6ovZ_f;^9jDJ8cg6-(5tFklK!( z4Hspx57$)<4I#-x$Y7YKU!65-y57EO%AFO(=H#Ws76;*VS$cP$t3fr)ty8?rNdV^x zEwbRsU*smj@4isPCAyIwh%_HlijG*bq1Wd$@A~vA8na!Cx}&=YU`31dYwX0+r3dX4 zj-N^2AxAJi`|IF3=!Q#-XJeJhNh=*}!5R0dIFS6j z!opARb7BR*;4Q%=f4&^(SjZ*dn>S@pP#_>#_&i0yOi<~Mw_x>bB zebz{qaaunk!uKvl&&uY&^~tvCR5Sg-eyF5P$-6|CUV*krBcH7b;{=u$#MFH{x^SxW z`=>Q5@x+?jCVbI=W8yfHe1OQwAe}P)nTzO%vrN9PN&wg+wZa-ns7Vw{5rvZc02?J; zeEoyXWs)$B@$eiRYQ-lzCD}I7w-~67{n|yMV;s>fHJ8@cyq{zZcI34y--1suKIrp% zmow6{r>CMk@+Ls#-X*pQz^(tPN0hA-6Jl%J{lcQp%1!4fwV9C=ufhMo|7gxwYyUH+ zY1j95)>QI0_b~EmZY=TZ7k3!;tvb_Fb}~X4n$4)(qGI{8`c@C{bvNJnAkD0W7p1|q zV|+21{qLq2PlVl{zIWV(9{>wS5gOBDIYI5U_<`7s`nn@?p@g=o)Zh%i>*|jb=fX&- zC+X9X`C4l<&`TrhW75VPM*nGlvh@6-8WwjK{p3@(J)Q0N)upP$y)0*=H}L6-)=P@z z8^7~suMAUiW6rC^3KLy;C+*7G@`?-g(9oC)I46RMs5a z?GvSbT()Qqm)U`Dl-Dw)@DW->-RYEQZE_Bu!+S|Sa5DqEgD6?Eat$u zkr5quxxJdr-nXFfK|tQ1Kp;ke3#=r+XK`mR0kIopn_i4&5YHFW*|x~S z7Kc(*Tlp1ssb)|p9Hxp~w@uu3jduOS5(AlbE;mVRVblj;Hdc@cz>u_ogvGIf9^sew zm(;aX7h32a4t}ZM(!kzIS;=<`YpbZ?@GvNJ(#iHY+-|OUPbK8et*H^EUAHOti9s%I z2G43BL+)brP{}BZv>RtCH79xGL_b%4NwSCZQgQT~9&-naze+(DJAE$@Z1d=VV~(Ir z-F0l-yANj-e!8l;9{zBauP<-y<;|OQ9l0yTw!dprC{V3iHnm&9tt_G`yDR9n4s!kK z1FL&;0F|sB5QG}Non~A|l5D?t{!6EAI*2V zHjk}HdfYjFyvoqDpD1VE1aL*s16yked8&#}%Qmm;qQQBs4z82ZRCL^(9g;Q}Z*@Mm zxYYYowxIl>LW2}He*5j8)yLjYk+_0SgE%6+BV(O-d9u7e3iCk1YudeDd2_LeF#>{gXxlZ`hpN6x7zAiSNdT4&x znuDW2P$u8CRJ>oOS|T!FdpSLSL;_S5MFm!AJ;*@gzn|GXxg7!voYpMRe8h5$l0cLX znHe0pa3~bI^(#hRs{btb+~A;xj%V?!@O4(_*1+Ct&Zy4AEm93}4}SoYc{VxP+Dhtu zLDp6cU*eqfL(t{-QY9wwfYX$*UWY1yARt8@g0bnasQ^$G9tx{!24Pu z9vE+rqZv>k=()mlo3K*<6KuTQ?el8*gPY%Kvvf{~UX_83SS(oTcszXmFhoFHyOAlz z7rpM>_NWcFEfZb@FJ?epqek-c2jK}zc|~fx@?wgDKB$qvM74;p11V*;zb~Z$C?3ZO zAq7la3uErjv@!7ycRE=$kdMBk7&f12IH|piBtdR^Xh~PG*3o-^n!hD=PcKdLS7GEY zk|Qn9gX>f>wh9axL8-RUghX~I#x52jrwgYZ-o)F^H9ae`N_u75wn22Me~7mz@fQdZDS>;T5L3gsIxezq!*uu}mP-ZU|$pGvQXoIEKtXo|33>CB2HIS8C%Q z@N@l%0K0+X-V1I#@=eT#((HsvLQ<^={-brKhd-1>kLTx&ShH9iH0-j!mAf`7uU3MR zKM{(2VF>Af4yF28KQf}Vem$=cLZ~n%`7b;07>pKLOiV#z;JaW#rUR8lrl*{@f5>^! zPmr-gHFtDQAh#aA14`t+wvq+RD9nX5pGuZIhcAFHbm;!7?KQ#B$nkN_hdyRfRznTS zteo_3iy8`yhr#ndfMh>2%U4;cEUne_){8hneay!e#`hxm{IYWh@xM!N_BIe^<*xoS z8y(Y42x4T3XDX#X<{uoDa4z*1foLXzKJGk9zo#^sgnHwhnEeGdhv)?KzXAHAJhQAO zUctf*kUH2Dbg5AeGt*noBRSY!k9a7ENoMIr!CiIrxD0(0SkiaF7jb)iSpm}LW%7R+ z4uBoOStMg>X8~BB{4w1RJ!=L@bx>D)y&c=9Ppdp z$!alo0`9>zAHGc)d{j)GOiBJr2e~rQ&St&d9l&7XN-Hgd_0NBcd2f^j#6Y!-pphY$v6=Xh$DRJB~P{bBFKXzt+D zM|UW}@54F3maG#jvgQPPg9_WjUp5UScIC5>h4(qEJWk%s6SO^U3jq6p{&1E=IcZ{+ zTLB+hM*a7ShFQUSOtN4C&(V!Td}`hI>}w#fYsO2bis)LCsfx(}P!w2{g!&%;u0U?;z$AY76`w3TDi#d$a}%Oww4K1eJjuJhRfG z|I)1^1Vgb+{rEinEvaANTsE;OqU+?!1BPpzDT$LIh~_^b_#%fM1uO>3BK$#HOjPlcH90#9ks|fEX!vqFS~QSnpQT zU^nL427|kRw*F31g7Jq6{Yh9$#|?HB;kQX=+}SCX19EmD^~&D;7daFOw2+$K~_9)D;i^ze*F-a8EXvJXq@5)=1i{@ewb?GE z;!TYBN3Vh?z?00fb3~r%ZC_|8rR1!bi!`?(!oi?B*wlPJS7jlDlV4Q&RsVRw6bWnu z){B=djJTd2aZn<3C@_>j(1Rd4jZY}>@f@>Si~PK)e?7gQaZ7|)`!F$@eea6>t26g| zH4l<<`CA@gf=#^|@V@2x1&3B$4*Djx8z=*P6DYml0*$*;ANjd90Hv}GeJ16S6?=$%J0e+3cNZV6TY^3fIlMyL(rm8G2i;v%*=K|wGYh}q<^_l*=Io(EW1~{;xx~;!D21T(MwW^_^e$G z2UNuOVBW=raY{I4dZ7MTqb~EP?By#l-5WogqcXaOwqBjm$_awr zm4q@Gsyxm$_ApeY4jR=Q`hZV;Ajaa1?mFZ&HhUM?r*bIz?#-&uQsjLX zEo*<;6XjHP&0d~8jN&8QQYB;K%A^;hiAWBfVI3z~(nMQxCA131f!v^rtYk^OnsZ)D$AeL&R_t#i zSJ>c+l@ zp$cxA92YYz>MF4SdIp}3R=!gBM+UJli-8;gBA{+fr^#{wlEN=jLeuMIspekh_LPiy z8}`KdVN}uMm6u}V0N*nehY>dcwK0dreGfO z_OX+EMDqRX=Jq|`zeFVB0jWKp1i$=?A_%y{@aTt()2FoSFg2UlCa4>sEg510e2b{-{fm3-=E~QHKv~NRm41C zg&`|)8#zj#?Mh{}MwjiRj}0`>bjJ{`4wJ1hNPW|cetah^tb6)b{}(_R89SkmDzniJ z-bL24leEoeQ30#(MADFjUUl=5A+UM58=qK;yGORR=3EZT8mD(+!jBv#q)*Rw(GBi9 z{*x10j3UOi1hk}6qBRTg6(J|}4c{B$xx2T$wz&R-WB?a%q`00T z_fBUnNE3AuEcSytAXAfLGlbiwB7&9WFI(Q$^Ljg@BmI@cQ{8N)#vTa!6rZovjBX0M z1>M@LkE=i3=Ww5UM)&g2dn3mq%++pq%s)}E@46Cy+-!~9>z_J<7xr-i(qOzuf?_F+*3!()59&BEnV(%xk0kA0Hd=O2$ks9%FQ1? zd-;uhCNsRA2|ZVVPI`3o@*c#_Yz1)mf$L%~%`2M31y6x%Q5F>vCeRCf^UL6t(Gkj+ zv1cEAcnQAi&F9<3^%!XEt@b7G+*%n;x+$%8kJ3Du6(@1DEGspONeXPK_?Y3OSdXc7` zRDmBViW#fm@sk$4Ji6w%X*CM&L?^)d57%mOF9GY=}SQ84(ufq!522b2ajkUu+pQXMF%i@O zqHNzzQ@V{FRB=4NhVV2Ki+bE`hhK9 z@)w^?hb732V)WRv>vj%?s(u^YMlb#9_g#J*B#eKkwYP>@Q}%sx8Vuf=9^UD!Sy{5^ z3LB?+k*T~RBBfg9$(!i3-plE3uhjTP_-0TzZm)prb-ns5m$=v364iAb3%Ee z`9GXpoC`w8sT&gcyOWPsJMNwu?u;?g`}?nd#kjYX)Z&JdO`mFJYbARHcMgDU?dlrR zg@rYjeKIB6&)z=;CeEB5i3;-gS!a)qP45#tWPekJ@|&49gyn`k=+xu&aTw|H&GJr~ z@Y3SQUR(-T$u~-`y%sr5yA}}Q-B!wwE7L*RJP93l6cTcJft)YU<+C?7f+lO8aY2dR z60a7nG$i8KTC}G;I${Rk>^=UevO`q#jzyBAr+-60`CN)-l(%^B*9{n@1>?+#@)HD& zd%$J#m`1(nMbbj9)A>Z6@kP;`=HOQU?g0V0()8K9rO&f)kfGO-54{%0&pw3_?iK?r zd00eXqz;BXbl@ZG9*3GoYx1OfIR`^8z=@lvXM0%J#*D?pCggKIspj<&?}?$%t+ud# zkGecRWb5I6JkDCB)%ZmAh8@jp6o^{oOkH)zx1{;b*=M34U^{yB6T?p!Nt~IU_V5PH z@UZVr^L5~I^%2gP@6N-QFZ^n8O`l|+HH@{N^YU3WzneHxlk1p1viNaYiHb1Te0m#u z*Xc|~e?JdXe^(R#k`CO1HcBUtMRWM?dxZ8uzBVdU+#`a1N-_JA7hCjGCpPQjb=Cz(~au6_w3T+)c=k=Y!9Wmgh zn|h`ynu?uL=81%eDMQf+N5Iv0fN9{EM_mE@~rRLd4IHD3;Wa4^ZLbQNA zBEvXo^U!s30QIa>sA&u9y3082aeYrsBeu$Y;*ZdtNC0+g_18$ zK?C9($Q6~>M~bYUh+8AK;0$bqhI+ouI@-ILs{dPnp) zp=$c+R#fq7s_)A|XonX@qJuCIOlaoerRj^}y8d+C>eeVI?d^ty5uW7J!w#eCj_STe z)6xDr4U?o$-l@`O1ce0k46D)k-w)-Tq=n^jfqO{IOVHkC#L%n_qF-dE@tNIP*Kl)d zf>3$Zc=I_QvKY&w;lQn!vcE8UJJb9Feow4Yje-Yd&En^aHq?3Fj}Pu0_y5`IBp@sf zStk$J3C6giC}wm~4jYZx>;b+RH8SqSdeBLAQ^$4O!j$CEF{5SBipxyG%!Gh0ZzpRd zQXAq;(#4bqp~lPrVzK%d$JU zNGC{8mnm^e|JYB@N0Vn&C@L)ZIS5&W+=Vuj$~8XN+H|~b@96K5O3$h_IfHs&_>&73 zhl-}5yNFc_+CNWVW|Ll~uA97hnCV*=g@i2!+%D3u#}qWfivl^Is`r?0IdHfaTNIoY z>~1Lsw8pH(eCQv_5U0{Bx6|C~cuJkXQ$TXog2wPsWKHPWbOuywWWS?W)*R`JqK3rv z`RVZIPWcDk2`Ed;HgU5F`?xQIWW+Yo+cm?pwSA3VAOdcNc$Ro}TNKvm>uI+izG+-O z{j^^bn_h>C_KYqo@I6vyR>-mv`g2``M<-#RwvtsoI@iqm!O~QKU&VzDy?LhCKS}#Z za8t0xQ}s(?0Hh5W1s5=j+XYKmGj5p*cITU&0Cnev%xus~dT1bLd}mqjXSyDP@}JVS z{r^upSN;zL`o~v^a>P0+VM$+#+30Yk9Bt7mXYOm*$`q5vk$W7IP1`uy%@yNJhlnA^ zgfZ$RIYtaJXpnm_!_Z(1Gh?4A_O*Y)_lNyuUa$G#^UP;H&+~kq_vd;CGcFxDBzXh{ zN$uJ|OH8CH4SsT01{Tkvn(YPchiN;iZo;O#v4iFpZ~Q?1dH+a~NyXcch}9d-pGcO5KejiQ06F zaaEth?!I32cGloboSbDEyPD}R`bqu;bI#!{^!QE~myH(j`htrG+VF&B&b;Um?Ks|IFsAsVWZ89?LDng!z`S-Ce?)pigQo67rHYojo3?8S>}m}`S=6Pu*Lf1lhc6{}nS~kI{XuB z8e1(!Km#4n&EhDXjN3{Bf*>eW3luz%O8;U<`4@cOVkbd;1!C!prjo5H6$pF!04aux zrT$1fSmImDI{(*Xx8|R}R}oF}v%u(f$(|O<)^`vEweJUF1PSa`a|jT&=W`+GjD^W> zuM=y+24Y0kn8Q{z5^R_+Fvlx31;qL9Rag*%9Q6cZY$H6{^AGX>lAgC+T`o({zF&MC3-#tm48c^Tx z`-USubbYRd*VgL7-<{m>?~tiqVt4s1eeIhKI!U|a3b5I210RWv{_{$MsizZq%Q%_# zqc^k+NFpLCt1oC!NPsJVzUB8j2m43ZbNlP*OMj_COWGx(>TuQa)= zRb6>TdB=&l94~xdS##^-M^`!acA(igdKu59WAb1WwgmF#U%y%gO>%)&vp}m;T#=w? zIrVa8`uwf8FjAn0p!UoskF_BbuAwu}1nM z@Sze$xC^dHdhWx`f5axfG7TYnG{tqf@ z9Zq6T=pTcwjaCE?gQty#dROmytPlK3A-Ks$6)=2HO>2p3ATQ3>JQE)*yq~kyh?}Qo zCcI+o!#Eo@wAh7X9a6CmtLfyj%;A#u>3J;DkA+_-F#)3D6Gcw-j&!xCcwd)|^rW9dhcDzsh0eUHz|~Vg zj7h5IDHlCnrykdJP|F+OhJzHx7`U5qJotq@KMwqwAC(Y4SCc3x9itl&5GPJ_Dy@#w zjCmPG+k)N{P2V3^(wjT|FhVKuYF*9;ELson<`K;s=`2@Mw%MuL@dws%OGqA5yesPV zMS9J>6-u#(P4}|M?cVs{`Joc1`U0s+)WVI!z0vH7vFX!izMSEUfWoUZKc6 zT$mZyB~z9<bzi0^RNYDsfj(R4 z_@E(*WDyxPv|8(89gg${6Y3`1q+Fc!0Gmu|h^ibn)-PUy+*5c$ zn;2R81X;+=U^%-Dj}sK^g0GxS_RN`p(IRu#{qfag{g0K}>=#dS<@#=ts(c;V2G1J0v8Ek-q_ zM5jSWhN3T3p)M|*Hl2*YRZ9!4ROywS?dUYT)iP=+P9uxd4HjPR79-ISh~Y%FJVksc z!|L6`Y9><~R{4OP=w&_C;Al%YCnU>ZH2)?2@knfT%zLYIa2G~d>HKOhi<}6{6K^Cu zx#Ft3q#y0WeqD-0UH*wXV8%T@ykHQN&qfR6f=mx2EV|<-Hy5T^a9mff@!nZGmBWo> z;aQ2tC<=m~EvekOEc`+4AWjaWkz!5f@yV|i`QBwy=gIFNwKQ6qChWd`m%aZDPGn^B z;XBu8SPat}jmI#RSOg_J!`v_r7j`d+!;q8@CQ6=hmPRd*#uvKLFBo?UTu(GM(84z6 zlCSAR>VM&#y2E{nc?~VGAUxZ57FKAzx7JxUqbX!?$f%Zm;8>9M+uJ>gNyw{LOx1sC z!Ok4O6Dnf09!?SExvw0Tg{$bPxNKu=00HaU_KFxUMhaGtvY4*32@OFj=Fyr?#78kp zh9l;nx}3T~KP_$mlU08|W8lR}_APgn8UQIC?`lV!+#S>gsAc3-NFcL2{+b@#6j6OE zjoi6hed6-vk)A4KUzQXN{$YX2@T{yM;W0r?_9JlFsruib-ry}m7oRFT4Q*isT1Kx; z0*;x){u_lqGs6(r5_1?&*up2TV5S1!hd+MB`K-N4o#AT5N%s4NyhJL%Isco;J}tCk zNzyA3c2=fsCrWbn+FH%m9*!!67&+J;8N+wKh+5&K@Ct)8i7XLRF!sMnPPlti)TPjz*5)voT{dtcX8_{)41L3oe-{>6(I2x6jw-(I|URr%rtOyIk> z(2}mC4jt&9mp0!-_+OL^;qF4ez*&l_*t~cFkNW%f<%@(QOlXBkV|is;WhqH^T?-Jc zww{HKJ}nqz39bF&1t*vtx(m{`)g}ak%*<`r!Cb_D)nJG2|1Jg)6aH1j)|87_SxSab zz`|Oe5J<~JOGnK8o{*4`(^}7f{hOfh-={;raSfq@2EgT}_u+*TV*V{Sw8uSWi~BdBkqYi(?4Yiwao_`6+g9Sb{KE@I-}7y9St zU;VT-Hu$eAncMt5Ea(6MzmEXuY3Ts}v<*F#^LHt`fCb3XTHnS7+CDds^RJr!OWA*2 z=U??@j4W&|pjNOp))h0i)whP8Y^(jdciaqrpZ$Mdy>j{2{&>Hj|aznA@8 zpA+zV;Q!4K|FZI5rBE|-zvl$}GiBWGOV=3|U%cRbAtuNt4}N)&3=5W@obqY85T_&} zeaTN~8)lw4FVkrKhT88tKwVb0%(F#9_5gf{QTF`}^^btO&+kOTmx!XmcnQ6j#`asL z@C_Km$m3tY(JVWqMB5AEzx_PRnRV4A(BV>67H-gM|oBC^)+jFcW z6u`N7?6J-x_F9o=8Y4Mp*(t)wuqKzo<5Efbmk-Ky%2Ca)(^OIzyq35ku5-2>(o@5l z-(8SVdwpY2UM*UV_mS}B|GIggivM_f2-i6{B>Lm^zilRgFA(PDh*tr) zl@y;?I7t4jy_X;JXkeI(Oan>hiT>jdYx--WkrbJQgP?!ElxSM@yOFeFnh3LPFVquj z2ca;CbMh6=+V)F_ngR#AodxO_!LOS3Y(JPS@Dp}Wy*Q#Z@8E+mWO^Aovch=x1h)Cx z(lYbVD@-xVRy(x4p@Ff0s;iNmaD=1?f!4yFRx!tn zrFc8`(^@+XkIv>Vkll#dq}0;b6~n-C1GX4jz5<30BHYABk0ek-7=t0V^*X9)z7Vs( z-_F#K8)|dICqtL6rPXDtU72EuTO)0L4d&tMLPF$S9Mvz|?rLrYQHs-WAziZ5uH}z+ z5^0~gP+^-N(w|^fk|SThg+aGgdgL@d=MN*woopWS_blzvi|tasip6HZ+EeyfdT1G9 zr7ZZk6EKGv*o2T$%w@QtP$zCru%Z7=eE9j zCz-!&8xj@LN_SOn&PYpx_xe_|)Zl^!FvqotHMhdVsXN9-Zr=lQ=?vzk)_E<{ z4M}$~9S^MAE-ATtPuM))^v>pry7P7wkB%t^sXFWUR}~HH%6D^ie9fzuCFY|@?W9~S zwaCZj3Eo%@=m~sGr||70$-bGcmP}^>@F#~QLUHAjI)tfsQ|L$z&FC*v+(i3_Bbp7q z&w31k4;vds$n(BDjX%8~?|$wOXry*zG> z+-@J%@Lf^JX6yW1^CY)X4OSOdh>a-i!3pTq$udS0W~*q;Y664Q z8d(it-4??<{RjczK3v7A#9%vKP>8YC>Z9Kk5$)%` z%k@cLPdxD`XK(;*KaB2Hw{`N(%S8?>UB&LvjS^Wu_gmKuvw)utfd+nVCy4ZlS`k)w zw9}!NZJm>7_(Xh`;mg8k!=*C114oHxRrG!&=FM-Vj+;@eLL1-pz$3r^%PT=0(~pO1 ziFc$>2Yx}{KRu3MSo4m7>NO-}_>+YhoKqDdUnvAOrZYQTpM-}A=^X7R1M1_PYMX>a zmJSQOkDdN%)`d$)?nXkoLbN&0(Ngix)#F1S2e;*5&x~(obBW&4PvTzKWeATn!WJQz zlY0n?>rUg%)I|%?9@QwD1m6$mi@@BKBD#!*4pV!tx?VNXuF_rgvwo?PSh3WmwGKE7 z{sra@8oKbL_Qlh~*jn|yA?u+T>MmN87*R&INGP)*df$1^W;3m&6A_c4|DE)X>Bx(a zo*?F45wI#KCaw@9PkVSubVQ`Mgl` zR)Vx8uM+e8x)?pw2Ifx4wtDk?1M{6<`u5%?89Y^$PhcXBb@h>;F(1)GL#1yb1Sul- z*=wwgc;j&4XVh#9Nm{pxCDQvmS&>`k>p&3)W^0FzOgb`zr{ruYdfP~{;>uFw?6q`- zpGssK8B0A&jRIrugB=utC)y@#v6Re$@5<4|;~#O}M0q0d{x!0?6KQx$UQ=cvBxuT3?S zy6Dnfu6Vwy)BO~y1oZq`%8zf>LNTjyc)fYBGm799^I59X!NFCb1%e1R`*D?$F+uHe zie@F^WAKPL(&h@ck`7;a1)Q#>?-7H2Gy;0mAkmhq*)z|bNGE2WJlg~qwwtreLimnC zxN`VHee_$zCNE_Og^QZNK4OVFSuyAs^Aw3Tmz4#pMcUnpy>q2#Z(zR_#frF65B1U8 zP1qEzc7OUOU`yMhpdTahSSO^@?R3jCQ9i;U;rM>#h2-~`~<22H&e2dXpTR- zH1cibI;}iE;RT93O@7qnp8ejKA2N2RHj^42>Vv=uTpOOUSKCz+YHIo=!XJ6Kk9+8t zF!1u~dNpY_bJM=OpF)qoE3q&!ez+jL15H12;Ahi1WEz>;MlShwEu&N;e z5cPuP7k_r57OpmmwAs1pP%oXe2^e>MwqyIKsXm@E(R%p~F-n=@R=a{Rs;U{ggzU|m zE@!pztEA}hx)w@y&HwTk-Y*0~Zl&_FlhWq*)wg4YjwUlyLY^n>BSfX=l)H-UFDu9s7log@QF9QZBo#{ZtYs?OV=0cV?&&d z`?>e<6iF6}c8iX5umCZmvY2rNKY~c1SnJN6!DzHFf=4Hb=v}nq&$(%s*Dm!ap@D|i zIowjKt6MNYUb~mMa}n{+r0KfjZ*T0)94clegV4m{0M|nKt}rd6MXk}JO{KOy^EaAX zS$i2$-IVniCvifNr}Ec0r5k-U1J_@-E51gJ=m6z?dx@n&9dP{B;gE)CC|d>kmn!To zqp9Cz2XgNgbcRyjBab`KwyHH9@ih>tSSen0P2QF_;W$)A^UR?h%3)LlT)+bZzb@8n zfEanMjpns)8DQ|!`TQCm+0XAY;kocM8^jdxHBB4K>*M4$K+_qkXc$=Qa={C(8krAs9TrDnOa0T z-k}g)S~euLb#MDFWBfv`(5$C>u%G%)Uf&~g;3VR>ZLV z$;HbZuK4mQaWbE4QhC9d3hZf2)f_)5TeD@LLj70rC^cwAMkY|VDPrxi%N z8LWWkPfM&ostFXyz}Iw^ZLcnfjy>$h8)OTf%}iIQDBt5M7Pl2zD-cwolc0+tV-<(oehfo1My2AYYHb1F_ed^Op(TA^JG^*b3i?{)E-Ozr*Tw zq~sb&7*Z$}K+RLTxq;+K!kP`2^;~xmg--x0@^^Na3-hwX+J(~PJ?^VN5!JtX0{v{2 zMTO>k6Ftfv|B-jE$P@DZypq59#QOH1_spLik4$ptvApmynEz}spZCW-B>8v5=@Ahiqv6xKNDbOZi~6Z+lX+*dEL ziF4a?ZB0VUD)-mnr4Lc8ZF8wTCWkMDa(8P|%GcLhzk*=)At{~+Bu89Tu)U+n<*(V@ z=vBJAw$tRUP}f?2V-jfwzi|zNB;|nr9M>ljXx~5G^RnXp8O}d&ghv8O8lM{DSL(m| z@=rehd&e>WddNAzl@J&4KgP^E_#2|YqKF~>r=>qTtewz9T%Fv=h`#^P6KE%)a~L!p zFW)C+zUAe1R8s!MQvwZ##K`GKsB^SS@}?S9y@HL%WR*p+J45J^IqiuxWb5tMt(C9V z%e3X0!asbqsxtkIk!ABGfB0Hd*)eJD&jvyoFcvf}QhjwtT@1fzxDK8>Tw6^EX1f?( zYvseB_Muo_7N^uzJXiXF2WHF{6jEYYf3oAuNs75SAzFBR*JP8 zbh`pb)nZLHLK=$R_+0t@@Wr(4^@TrKJ7A?=NHFPR?XO{;Xr0H46tUxldZ?qM`lST3 zkm(duk5LRkfqp%VPwS7jr#EZIuA6;3-q0K|C-^7;bYDT|b~hz{tNJ|7!J)ewV%x}M zC{nz6?^M6)Po^e|lVfs@{b>RSao)urt6NA=Dy*Szzp+5#aSaBV#i8w=hr zvEDXToY-N=?Aa1p{*jV$I&7KM5z=3IQd7lo2r&}DvunLcjvj%&~^9VSA@ z%~V-qgB5yC=i^qMQC%f%df83hvH;j<_I|`U|1-CF%FS6@cNvfb1^TXP)w>Se3F>V+ zqTeRVCiA&Y5U`4CT;(7&AYj@Ze;eeRo5pCOW&K)f0*4m==(Bjg;1~sCSd~+zKGV<{UeIif6g(4-0|%lD7p!p$fR>5~RF!zg zpuX7u#4)&-?3awHSsiU`eI5Lc6OwqSDqX$!`~e!}M@v*_*52I`-u>c`RaN#{1UyVb zr*&w~WTb*IVbMXNjvcl+AvkI_x`ktjUJAu=-V)3b$h66*RMg&#)WR!f-dtxN6Nt(grMENvmJV#4RN+*^5f@IOnrc zS2!o?8{N>DZ6TMmOo=(&*!1y!?2PX0fjkZ@rS`$a5v%+xV3k_yn6Su$VyUi-qz#6% z446OgV}uiQe-xOgdtiuD<|Iu%6o}B32o0eFqtvQ61OjLD9?;4ro%{)&9jFoe<1`-~ zbBjYBd&{j_OkDPQnlbfGE{d~wzFFwt|+|FE5(_MCH;%q5CB)l+LKxub#{pegF zXDLu9LARU4VL2L#v}T#!3}4=}IXKkm>1x_JLj5BJ$#@AzV4KThqtVvZV&?tvwd0G# zu{D%byH?dV>6$NDsj}YSYg`dc)H*jzH{BR)!vx(tI(j+5t}HHHuP%iM&{++}>PX)C zS>4d(m?s8?RzC#A{dm_n?o4g{{#-JFxnTJiP6N`NE#*9SSIXm9*>C4}g3w9E!xbb+ zZ`+-0hoH%0@qzaCl*rd7ed>_rn!9>kW})dusz}@)75(g+XjdJZYbZVD;d~7b(#1u4 zpy6`JKpXD;hV?dTeX!tv6Y{RYSo_K21cx$}FXcqRn5yeBEN&^TZXY5( zX<_dA1^f23f;bNr90H^k8KK^xE#x#ZXcQ9~l`=J5lXq}&qIko~tx~tF#o^fQ`{gwV zvt0~1F(V3y*(2I>-p3TH$~xUHV?xM-!%wCtdIF`H^u^E4I(ulfx>RSmSdm9VpJ4H= z2j-igHy#JOX}{W?xLWOpFBKt}rDWNkMZcC!)}+*e9f|MKIl}Jdbq<`?fVFR!c`ryZ zM|Tq3wVx>%p|{6;es7c7{d)MHJ20Tt-`}vHa5OdLu;5;5F-%Qx8U2yZHq1z)uO-Xs z@^vWkqFgMV%KXtd`Rie=$tf`BwV^Vn#uO*l6&hrtp{1K-RR%vmLeEm-s!guVTo+rF zUnqIg6!Ua~gI!m_eg-5ReIOO0NOnn4{G+Bs9HG}LgKZ4dLlUJP~URevNGhP zJywE>ht-5_@OYN^9g=$$mQd#=mvwg}0opF(w+m$CCI})Q@>Hd@6;L=II&;DSL z%`1e6-pLph*t2zTizAfXxkKs$GM9b1G|x1zPS7LOe%*i(NWU-_MSwnz0o%#?^!*$+ zvPGjY;n*VK=e`c$M`GaS_W3S8OGYB*FcPin@RaYRVDREnFvn_@-FSFJtOpD|dail0 z>4z4Br@OEHWf!QS`Ae}&Kb<@;o0+o@nT6}uv12AMIwCpMj%fI5*&^IUuKDL%b8cMj zk3-6X4cUOtbWjZK=k)Y2FX}U$Eh0#pszEqtpMLGlwG|bC+u{fZa;B=5XO%Cd2;B^2 zYHMVgCRf3-<V)1;D`qp+=nBKD0Ga>&p^Cg_?TZ1kc8ZrN;A5((=poJ2HD?1#KfDI z{irG$l>^MC;@?9*Ro3!Q121-EgVkM$C#P?*v5ORy<95tPt1}cGg}&yf0X8tHu_+2b zGyT?eI{@}>?lRh87E=7KLxSRVEfIb{sw=zpds8|Kf%s`c;vRUivytd1b9%SrO~Wm8 ztXGz62Y5M}omareV68r1+dHTFgb`iSS@d8Rdi(Dm@b883nF1+!PJk$x=fC)9=(O+9 z4~9#L@>Fp5o|jfwqnj2n~{>Fwu%T*N9|rR0+n6C0Ov} zvmsy<&a4Vb^<}XRpiXs=K9yJeV51pU`8L>9U}S)2K|UUeq?zB@L|`^YbD5h2Y0??u zPG!e9URSxInTl4_<(-JM7iy_lJWHzyI5mOnY@d!hDl7#2lwPR~CaNMat@a*+##G)! zXY2El;hC49WN&}*$nfCBegNQD9M#I)bOYo>* zumh^)gMiLozI^`SoVMv5Q{i&&RL8Mac3)T1*A{)qxT%iN(cu#YcuB!xIR18<#GGRa&h&0`(bw%P)JVf z=j}-*c;Bt;RgjD{Ii_KTsD(!;G=H_b-Losszc9fII^hL8lz*jH*&6&>UtK(Kxah3` zYeObjXDxwBI>A~~`Wgwr#GeojLGz1^y4I~RETWz&{!Iu44?_Zt*YA;RHBVx_{Y*Z#QCfNNKTUy(?Jh9I zLIqq)pMAEq#P)HqTNM=K0T#Nc$ni*xZJ+#+f7UC3MI=w};Pzw>sh4&WJ9-+At__1l z8gw|8KB_+U)n%S;-B}FA0kdY~KZN2D+JwXRW3970*pJOYBxkCe!hn@(!TY~Px7Q1n zQ@WzpUUH+6nl2i6X-*N$uI*Cyu?Dr;G_*k(m#nAejt|2s&5vB43=R0IC}lItJiu1K zPctN>@3Mu2L?4bRdV&L46c)$^C&yn&9o50nG7L{u$i9m(eDD(Bd0H*)Eviby%Q4Lu zukMJm$7rt$A0SW!lqm_4=vjm?u8-1f?+FG!ti0SNx=kd&L$`@KK3KwI?hyQw6YBrw zSFwx`5^xCh?E{%VX^^pt+J@%^r0mO`KKrBl?~U)1%G%ZC?HkwcypshQ0X|MDgnbk8 z4QK`w2t#7nKw0vQqciRwK%!5fcUAh)hlC}&jCUU z9=I+ozrAKGR=nmi0#&!4U*=-oIstbQ?GkKUitjRMDpvS_sz!UL!3J&h7_ z8zj6Z?ws?N;V)xi`po1IUOmxRWyd=dEKPHxB%iD$0L&~WUd_^e##ZYfDNOIvW=K6> z5I!VxvCqbIjn7G<_jI5soh;)ywYBRu-Cp$0)@dCIsKMk;$gvSW6lyBLetSsdC_+V| z3t;fM>bXTVRj*Ru{giTO#n=V8Nj>C_!JYAIt~h1{<-uNV+8lM97v>Os5phSP><7oM zEFV1;TIXo=?OQ)nN5tp4u&chKZ<#Q6kq2f+yQFGP770>i7kMH z=VfIt1T4Q-TI^Q~rrH{;h56K^(N)@1W6!GQg%(`-l*_YCYWwG+e<%EQVy#uQIEM3x za1Z49?2bo@qgNSMRsr7_SEYfz2^+(9F z`u2($Vz0K4rq?{L5U9N8+52PfixpLYSCQVC?5mrp_By}N``eqb3w}0u_cu-w2Q`s) z8EM6z&y%+;C;8qv_o&bFQb#HTyDQuAHd>7=+_zh{xU4JxNV9j+BEu^@dTjnF>PBhyk<34&d^>>OmysWtU{7j#G8UAOFHvnp6YSrijS2;Xo_zO zurpkRD$Xr42TZwgW!2PY8)H=@`x78Gwsy3|n)}2mit%{QECvUvej>5%eiH6Z9Rt}_ zTeAdO(QLnBfm}Fup1kel%f-Y4v&}>PoPgdzXl~QoI`u5c+h+RwQ>6X-f$Osk+Vx!+~P%kT{zr8l6>PP+K0EB)!0F*6RW}re}A$`Hyyyx)xyLwHoGINoV z%aOyLx~=@O26_4R5CFzxZBMG8bATpAfj)CU%Gk$^eME$71xT*Kd;MS8P^jFHNimV$~dPf2^mm%X}8ZXgO15(R3jg7+-U zV)e!O+vgJ}0}hqGRu4Pame_o?Bg$7pmn;|yR4q%vUM9G}qv$$NiZf+Nt<|ZsaG?2^ z=^jJ<+HNF=UF1r}&Bv@Rk{EygHD|?{T*X{Pm8272%&E)%nruB*zy8VND1d8}UnnkDavcO!)|cO5J_a8t3QD)vJW zk!oEo_9h2YK76r%32xHR;cBOJ)XesF;DW*_?`%OD_?L!kO0kh$x%rn9suwRu9W$JT z6mkMho_I3$i2KlzsqTvMuzPYF^J*CRx57-`L9ij%IvnLnpNRF3c9yM-yKq!B+@#E9 z5ZY_>B~pjckAXINYToyJ3QZ6!D5FvHe9$e)p+P0A*OsmxS>kh;nur)b=>sscm?cr% z--VHN7Ztr&ZOm97S|$(VqnV3&zj!86b0n*4|FEKb5Kgary;j`on&PiNF%$3!X`S(- zL19<1&*NS-mn=i7U9Sg%K|kK2;S}`2j{fx#4hQ_k$@#KY6`}}hCD-jKWI7|=MOJ4=D z4;oDnxbL}-%pBg7#LBE==%rJ*BWaB{A35%C?rCckE!3G)Wv$=I-_1I*_9}MoNr8x| z@%I_{8#|k2)@^*eWakT=5~6*N1PTE#6ZW#o3HVwTGDl9vk6Mi_py=@Y=J(@M0OsLEU3^#Trrcj(Tv!dEoyP%Y6 zv!Ni6uQ#9v!70T%t=X`DVYM?y!(C;wZZ$eK&IY)ZNp_)fUC%{I#lH_Dttq9iO$7z+ z*Qi^_ug$U72>y5jT^JD_xS|kP`Qe1&nENJS+vc6a41xen@;i?mk__y~g04Ly*+Qe4 zQ_~ES3Dr#-wmL=m;Kc$#fyB!$T77T^) z9%`9q&3>L|9W>nSlWZppt;<QR z#_5jw2f@Z*j@NfiE1Ga$6s!8ql6Nt=?3ymi`*btB9=S;mIjw~xW+iHh2NI||A?1|k z{pFvd`cAh@an|)wsDVcAxYOg9>dyyXyERhHMv9yO>oLuyALe*UMj-6SfD}apBE`^5 zY(Hosa=^z$k=|vpF^G5Z@l1~jUujlxJFs>+h@vc%x**Gj^H51_s{T2A(y@Ul2soGo zaBiBD_zl#NcU?-kH+yY6Ef718BciPXkhHo6yM5LA(1);)XFH2ni-ars06#BM6~DY4 z!1tYL6`kGMS|)!?UW#FZZ1F#FQmxzB5+p$2S>FWEL(b(CipziNgS2ia=X%^qr>hS_ z3G@QfrFKuJ4p5!s1#Wa5;VcURj)42~^>6IC$#@(-BC5$2^j!y-1g!)*k?8K4tk+FA zfp}cR+_*L!gxhTF%*}87+!H7L>^KN9SXh_{c-~H-d5*hj>7l!pU*&e0W){Mx_r}Q@ zY_1(h%_BdLH5l~c0zCz2k6tD37tG(~2`+sBP!1jGeE5!`q*R&A05{7}16;lQ(uG`7 zPLp5Qm2Z)V>4nfRQK=B-^3H$FaWLFFUEbDW>BkXX_}L?;7ebTdjj|ig<1nR#j78l! zJ@g?Yo7~!u;TbwwwC&^hL+y9Jrk4J0w*Z&OLk@#&kuUKyH=N6rpL0F+V3ErTRr~J;hbQKUShp)cDmKj|FfKPx{p{knxG!<8oZTN;#LA+ zmuji%!x{6Aoe(fxToVrND$`MJuGodG$SuqaP|fiG(=x(ex`qqcBJOTgaz!`RPdy~y zy6UAAKPD|EOyu`7G#_bpA7~W!TP62YPit|#!4Y0i+t*MH8ZVbI%f;*A`MxNi*z`-m zf2+`{(7Aaoa7b1`U?O}Jq&4)^SWsQ0ug=NYL1+)8!c!X`knPHc=iDV$sUio~UaRgG z#^CJzW~0J~`Kkt@2g85!EB@rXNQeI9n&(eUQ8z&tz_s^OoGY`-mCzir+|Eujx(sx9 z3_iCT{Su%H!|pKU7Oj05)l#ONv)(51ttzOAzOZ*Wkm4M?DbHYVsTl&4zX7vN_}&dw z*XDQda<)!lRj6uS``D3o>!>SE##B9yweizO6f4<_U~QLO7*6wVjzhuJO>xP;6_*}! zB&}?fRoxLZzdQH8Zb`h*;6a+u%>F;Q4;km*Lh zGjCOgHC+~)R*u#TQss}-&y&UME`h_#i`^=y(PznvEw0`PC)QnE&0Hz5D3+&|zJalZ zD*T|NI{fA1E?I>1dIB*}kIIv*U02)RGitf^P97wdR7*dRo08KzsOi4pW3!0~xU1U< z%n-`3BSXkqy%vO^Oj6_xX4KzqGmNrFmyEsyuiqhp_oo-*%7}ZXVzDjPX-z%SxrkR* zw>&r6-+-;`ana4Cfci}hFvuN}T>SV%EzBg^oIX8( zT~4o?t@SVXgGY@s8+7KMsDo0QtG+c)JzoqBDc6(W0S0Ha`z<~m1XK7{^2hG*R)nt> z+Oe*C)WZDlEvNJ!DuccTeJWOsgY}-ZRg=lD>{5Zro6j z;MXX0bnR>SXFKJt=b~yOhAVW^#yg^N7EsOqo^5J*hhvY0jnMQ-?a1XrrwZGZ%GLrD zV|`C>YVjMNNfc&lUKS5#_``00MfNCv9+Z~xJ$T#l{^x2&upr=*tA;#}Z?53J_)?Tl zBHkzyTQl+9Vfeld+m!!>DURoMi3v&QT=Zf^%>^jnTQh2Sm);ROMk*)kywP+nw z{$1%O)un}^RvdeC>fXIp8nio^>M*yr(@RA%r#-$zy}$v=#M~^2V`{|1e^jEzztM^O zGMy1lU;yF%5sFS8w;wU~?LevuQ;eMNNxr*BHHRDD!j$7(6ShM~sa9&o(!r;Nk-d+S z?AwUy7wbP+)$@)$sjVF=Y|6ebtI{*Ur)>))Tz8E)N)xgH@D_2Z+wNWQTXQPG9fo+N zihOxWFNUBZEsYr{E66eb59AWS z2K2nWH5Ho?E*<7NCdPxrnhWoFm;5tXx>3_$pf8UT$2Xmn1C}GjwV$~wEYVnM~(fFIXXBUYtZ?HHp%*LTZ_x|UwT3|wEVJ>WfbS;jr5{Nq1yvY&?QSc)3 z*~1t2-2S+VJeH&W%?*jkg=p5pi7A_|BD^EeILNDpn@`gvBVP_G>d9d0)}0=~1ny8_ zw^k1&Z4!LhS+05oGkk1b*!3r4XN`WB)C$g(+pU(~Q+}WAU?*UX;;ME5JIV)jk#|^H zG#k*NK00+0Rw!!~h|p!|QwHf3TDfhVJaonqw9k#sfJc)+fY zADB*&1hz-et7)hGthj_5SGdhL+aWkkusZVl5~Wu%DzASQkBPBcA;Ix)6i*Wy@7Zop zM1>BVpl8zkRxzqgt{0e`FDYU|EcP+80G|LYPcr^lG|KOfx1`dDdhXVBiY>VAsw})+ zjG=kQQ%5WY3lE%~`x*C!*{n8a?WRqC$plGJmvtYt>$5S%+1daV@j@l&5F)OtSRk~- zjoS=TBi25XZxHVL3gr>Sppr1bCFTW7>cZYXfds4FZ{5jlY9~IHXXm3fRv-g zr&0C?J5sM&9VC{@T`Of4)uKJ+$=QxsKYTv%u;#rXmt6LI#1i303*9(GgIGUf z#9&*LS2g7!TO6>-Z*c`7Tnia!qRa(`ll}^vDB3*uAWt9KCHp9d1S-(u{HMgWgD9Qr<-0+rz_* z?%5i7j#L7E)?w9V4`y!$nATnfY|tn>TB_AuMA{Rjbdx`0`44QG(sUVldw_xdysPQ9 zU|^t^VeLbu_-z3xzzxutfjwW|MBE;XQn=cNCbT`njO#*==*Cf_6H})sZ80`sLEV-9 zI8W0s1{G_1DCy~vHnR-UPFnAgrEIPxsY5k_KOJnN#@U*T&ttz|Ge&m)OjOCIk+O_I zf*Ulq9u8)`Sa;btqP%z5D5WQl_*X#NDn?8QiTgqk_apHl*Id-~W0TB<{b-_~f{UO# z#-sR!>gd-Dh`$>#yP*7{U%Ox`h623tfAUPP9A(DbKGGalTgsd$M~;=lr~eOxljUjN z)Bj93@gV&c<}10)dAt(Y>QOaSC$!NAm@OkCkiEW(KdwpCbLB=nyKIevPb8sgc}K6^ zd?T_~c$!3V?xZ!G=?2NM^5Ty8^QweV#F#ML2 zT-eOhYp^Xqa#D>T^`_@Vh^02;3|EIEs9y}bL`|tQB#@s&j0!G zb)D5|bXPI{61lE<7+d9FmU`Lg5L7v;l5L%U%w7t3kBHAdH8+TO4lJwZyrlF+YJqvf zurw39AryEK`!sDG=s^v5VA2vuNL;W%Ro8oh7-k}q4&R7<7~;XbGHfD9hCL8X+Vs`B z2b`*JuZ~E&i!KB2-8lAqjsGA4vA@$tGL3E!QrN~~27-{1KjKFXD08`eH*w;gad)Sc zs!8EL^ig4Y+;$Wt!(>F0v|f`*!8UK_KEvd~Bz|v$yZ3;?Nz=~ztGs4DXI8#P?$YcL zy+>Tj7onkkCx;j<#qvGI%P4NHtSXh#FPDDd#Wou2h#M>S(fT%2hia_3`p82uGP*z+ zclF2*?iMVZWS~akK+fQg$K_2%pRJwDiRJ`vGF_v1CY-5iegEeeS%`>k)ayNVn;turf!~E5NKC?HI3a+nTuR1?!`$qrWuoFP9^8b#wx=gip%nwkK znlX}q^gk9tm#hiAlGc(`rhkL&0|M0VL|(bq3^2Mu7nIdP#_3#YoKfrvmMpXmHj?}2 zvxKJ~;N$zJFr|&kF!hKqF8Rk{&AUUoH5rx^fZr5?MQ^)EAWY{vHOtqM*;c0S%6u@k zk0peS%P=bkGgg-C@XqYXqNw)w*1bpi0j*08xlm#}0bZjvf5#I2D&SX9iuQ3_j-T5O z;tr?EWrP7_Ibg7c+st&V#+5pO#|gul`Iz@;XznuY&cQgeRg@DC^<7yDyB zpQDn|UazG$d3U2uO=v!FuKP<@k6lLIBkF~9ouV-F1G}45 zrjq^Vm0K?XRe^AouEbY{uUVu$#A59KVy|FMt*`vPwB!X6@gFMlP^!Z{C_3L>3tYQl){ed z%saTBjE>iOvZ5e=tQA1_Hu*v|JGZSm#7HXWbaeM~XJOzYv1;9GIb0Ocp06==mCL#l z1c0@SJoBB|yx=caP(mxc5M^ISMzOACT!CTQmMcZ&5{rA08%a0zVY;Z1qfy^oPOv@H zPH3IyI@+BmJ$|X2b>e@=X-y!j1LB7pq|ZWxg{$f@r)%op)zE@x4dU1fmsJ%t-Lw>~ z+{DoH?%IFVQcXbuVi}pI>1~XQ9k05T^uM0yOAY%p6ONP~ zII!$;)W``;?C))@tfG?s`CkCXfvyCuy686iVsPKKAH6+a!LGV1HuU|BiY3o)q46

fK{CPfWOX$NO1E~(Z+&I(8bCATatt2 z*tiihu4#qLFMrU5g5PUZ&$#{b$EvV1O@7=FVEHubC`lL1hnsTQa#)u*njFxgO_r|r z3$>s=idE+B*k!01Wag1L3Vu9BIRsz>oW@9drtz?k3N@t*KSAxpn2U_9b3nbU+C3|~ zUDE6zCxS};5QrJ6IGYgVz3ny=GFZ-UrDIYu`e9d%s}s(;IFJ)BhECgAx@>BaMoIpR z^n;2(T(|lsP)V^nBL0v_X{uo~6}d0)q~KGDS#iHEmIrkGXQTCz5$cnvlmZ~1R{7JR zMSOqW{O!(yg0nHkf+YWserLE#)uw7n0n#XC1m??m;4NGaFB1lS=W7+os;Vi^w+nwme2ez5Xv!(oJQ}B34Sn)SRwsSl4cU`g^pkxj~9* zK;|}Zbu%cqYVh`6F%jbp&ba-&B6BsAv$Ug1h1d_U6mm$1Po_fIGdZDv3LAa#q6f7} zIyeri@Y}TPH~IkHgYtoZb}|ME#iq=wH5wZ;1=~RU!cPQ4%s%A~lBlL1ZObTtMv;Ya z9Hvw|8-m;2iXX8Y#N6Dn^&$zcckxGFc;|06?S#M+2o;n{+=0m%Vv3?LUVagVm!o#w zv4f(h=Z*xO8fBmVLb_V9#9J>vG(pBUtGjx5FL&>nNM1dugxqypmgcES7Yh*}CsWib z$?MnWZ9S6|&=R}CjG>99peekX4d*2C-;1Aof)Tp3Ybglbfy2187k=%Sh^z_JpR|E~vSfcfZojNiLCMrKY0Y zPOt@cF)C2&xYs2~el zN_FzF*k7KQu)&Be9;#@{FeMpKPLYpXo!)2IeWT(RC6tyDPNxQ(VmeT&IrNXXEs{Vq zqf;nrLMeiOTCws zQTLLre)Np-kTu|$+bZ8K>SI#TP{By37u#y8Z>Z&KcRxO4lJ5(^7#!j-TQ=xC=a|>< z*e9E~d84w^yWBoq32|O>s@r1%cW2DM->v12cC+u(tW(ugsKNba{0~EB*(K4h)Gd4- zDl|cL-*p#X4pa5SnautzDj_7Pm}BsYot9yFqH|CR-K1U={o9$yRv!Yx^#nq&UP^(_ z`5uqCw91B9&6?P~K-+*mneL*ZNKO5yV|6`CF?Nr(0O^g8`#V8?+9L!6z|icsYI(k$ z7Fs&|m7>Y$Wr{s~2ozCh4%ue18Mk zRRXPNJhq3Gor+1QBfSd0GGgb^4nJ?73sLyFeJ=3Xg3Y*lz47HYQ8ub4i)IIZfYPLS zAt4)0B?#P-o-eo&vKm6Q6Zk(+_0I(ocrlLuLhG@SjvxHKOWK(u1 za_Od?81zmhZMQeQco4c~TeGHmG;DE`V_eG4@K$xlI3E&7@94+3ODU||4E`E>DB4~N zU|G+F9cG^2i|aagGP=*qEvbX>>^+Z|7 z&_W%^zsp@mSq~f$h^-L>|JVp*G#agzS8;S~P?@2uPY+d*6yIJe*lC>WWpJ=>#_YTL zhDUO5ZMb*)*PiTq=cl!s<)U;OI4AiHsReDX*iM<4k-SzN{7c zFfSTEoahT!-&#ME;+ZM<;CG;-h^v@6+pX~duY%wD52fbf=J7OMLOm>QRB6|r9#D7w zR*He@i;|=k`L#QAWwcegmsSgzVpp+|5k@R0s6dmb5PRf;PcQi$_kG%;ZwrD_Pe^wxKAJ#mG zWUIh#D?Hb>3jvj4`j#w}*E?yx9Q)0!DUAumJ+N2P-*+6j9nn-Yg-DxKVumR7_O-8` z9XEb5p`&oUN(QOnFEUn7QlSnkl}1R_XCo+94rseRD?qHo z_@KC^nNZulT+rykP66M$@XqWBWT;#O4RQ<+y$6No<5-&iU+leiP*mHt|GPy6Q8FT_ z$pT6gk({%Vm7FEVCg&!jG)Pi`CTEeHp-J5!IX56VHIh?9lXLjB`|N%1x#!&X>b+mR zSM~ejR?S+~T@*d%8gtDs=4X7z(s|YB`5 z-Da+`YuFrSkt4^zxyF_idx|PVZRs;(SEsGk(ybddNV(Ib7G;e#LF3+6Uz<{cxq=Zv zOaM90yTBKqM*aS2x?1JC`-WCetY*KKJ0KsWj-F^9Y|T6z`_-YaHa#;mT`H9z%(-OG zo6+A`*i+2W?Yb7O()luf)I$rq^q6*xy4YYS{?mx=n~te5^~I%pO&aCBxuK|8K_+xtt3?qnTHIsT!L$?nvS}6*%%M|*P)(=I)zhL_7S6)t?HCcJ zo}pZ*K&022Sn7(cIyqE$Hx@PeaGE;}*fo4Y&J}HGCkC)lRtu|c8#kPom-Y0Es}0dq z7V54kH)_WgcMyv-GxNNl!4Ny$=*~~?^mT1usbr(+Ms&q4dI>zm*P8)Q0=eBy-AlJA zTGGEFKz_ZV$MCxRR&WQyU*7pYK_^WNSAxDI{r)c;@ArLS|0YD(bR^)vL&E+Ov#BRz z7*?NhlJvi$JgVP&Dk{I%0LFhsPM;-zhZeH)n*N8v_CMP3!xp3W`d!)2w*>odp5KnW zoe;`cYCLCaQRtSb2$)XB(T3YVHKqU0Ryg`59f8qTl zBidW1K6AYPk=5q`>c*Y9;-aB>M%e>B3G zp{kRFpO;xMN~>TdjN0p~$v@0@A17AM?=f6pLqx74c_Qytt{%*{FEf@mY(tP*A2b{KH;F+L8 zY*;i!AvJo}Z>|{d^}kTM|9ASXIaFLdU(FEPw9?-ziot z;YprnI(@`g7fc%Cgqm`%$KyR*+(IK2j#2FVIFm)U^Gz3mx6N91<&7Wg0-hTAuYxwC zr1*}Dw;?`%JMEMAil9%o8rLtTZ(cR*Y{$;|mI0RKB*^{W*1WDMlg<2TLJF4|({L}} zpct9X@nFN`^VsV;rrDb)Z{jhR`&!D)=lVPAiwsEihYVih|2DmAn{6i>mG|m|Uh7dv z_}l0CsniF)$CFQ^`ZJFb1gi7~e zu|zi}HOp#J!IyKX1)LY&yZ7P(yuj={u+{rL++31{-(Zcq)K_U0Fhd%+cg`h#68mNo z0c$oUc}&Ju0F(ZGeev>z)3Ua>TfZn&j;KcY{|?Kez!%|-+gs_2trYi2ERD(~g~#&H zVCM!?EC@rq1Q2aC^$P&9k0<-?r>VmdRgpYeI)J%Pxa{K6b^+f@{cgh%-t~vhA1<+T zE51>fpwI3)m1RbdG!PD^VwOh@=(n1VJ!7|L1aSg>;0{zMi~mf}{+_#A_ufJVpN9iJ zozMu+0M(f&CM{7rQ@?mYl7Cy7zgcQ-=jf>j3jG(eGstjCePYAE>j^mXCBQ6^4oY6` z@3i!EoVaC?0W?(Z+cr^cX@j$0&UV+IAw*FHlyPfY>!BMsXjxlRV$YVn1=m$+>=U`E z&%w@zhjNRpnyAzRzA6Dnb{%={t0B8c7IP=5wqiO=6wZm_`3bIRJC{sucA{B-dTh2J zdpBftvXGO}shNt)IK%n)NVe;1;OC>lz2{t&2_tbuTZ!x)kNz5qf^UA$hTeNEiiz!V z)DgzdE*c902-G4b5=2?>P1t-P1>$lN^cM*>@ivV%>91R$4Bop%`~+~R|DT9oy;CZY z;f$U{*}>0$B6?AO5j`oos|Fn2+i8DaYk3V+jA{5yj$H;3@-qu)yq;`w?AJ?x|p zw6n2?J?}a%IGuOo(F!`#xef)2Nwc4oTsg806>~QOo`}^u5RzBNi7*6_C zJcEIDnj?N?&2&LwbdiSU{_%$Pg%B5z<|A}i_08Awiw7;)3*G4%2Cf6~WK|4-;+pFrZj z(?_fSppTH#EBo(Mx#R?ltDR60wUw<8Q;smna>qwP6jmTw<7zf zSt3SbJ-%kXr=V1dS?v+^WqgvKbG!owZrP^&6QC+cFR6^U=9wn5zDZ{Q0Xh~0S$gk? z?P%*}+h9lK2xs(pq)l4_pfV`?^`?Hz7oeyR_8GvRowRgCbrDd_Fru*Sr zTL#;Ef4eWNsX>>$w31?|1(jXB`Oo;6upPM35zAXLz2v)8+s6qTwK?gsCZyN0?%@UE zws`+n;1bImU2wGHHT$Nn%8#*2=$IB$R6yHTxRb zv)QSKMb;fBdVK1A>E!vd7HH{+=0XW4u`~1~7&KnyP*_B+yU#4aZJwGAp5$!j-#hnx zKk#v|vq}F$(hHLffG%H*n?HYqMnsQ8XdCL z+y8Ve9D*@JmB*(kdv@ujLCn0Yqv!fQ6xU7M0MC$b$pCW_IcOH| zSjl_BK0!w-BDuy$4&m+cXx{l&Ym!`He@t(0Q%5x!nmg33)gl(no(s})So>xI?n@gB z2?hVoOa$cEQqd}n(A1g8HGS}qP%ESMt@ks0TGMQT)H*t_{~a}3CAsYxB~v?L(b$?* zT9OZYi#l2K%w(0^y^%>dX5VgW=gdm%pDEt&jjo1OyAu`xjxk~Q(H&mjllRsg_9;N0 z>KdN1l7j#jUZu#TPLClRs)+1?FHxwrb^Z zvepqr|EY^J;x1CFl95{=*UHuD*JQR++0;m*9WS`nuBkd^j$l<(TxC$kmpn#z#^QEd zsXSi=kRuN4U#cE|!!XGIUL^T&x-2n~-2F$-_$7%@7ZKOxIgBSEaPkf$NJ>slO7qiWDaoQE zBty-;gY@c#R0qdhcmwVm`d**Rbqi~2D{Je3rs*a$0_6Y%QIX2{lp0ij(k^u<@qfj< ze0~S4!+NO#YXee{Q{y}7tr7k5Y0c==OshggT-lDhufsJ?nlsel31Xt#aQAoaKT{Hl zmj!JBMTV4UH(G-{p|BxUwM;;-OUP~N=jLbuB5F#YntS66OMCFO-8r2q$?};6J~o2hn}8l0%WXDAiwghJC(8vlrSo}Jnw&#TISGLQ2M-~1@!J6k>XRIfTN zMg=W=#un&i9w>)(_NGEY3BjbdPFw7%jhE|hYBIH}H7(g`sAD7Yp?>nR-d8j$dViWJ#%{QES<&eYX-B$Re!(rXBTb0PNg*I+!s{Tvg zq|@?Qo5IUegF>t55{r2%CYLwk>$IcT$vXBQ~ z8E+{C9V+UD%~~a97_Cxkv44!n)t@#uPn`J>xOtiCQDTBXbX=C8^k_U4#T3%c3f~;x zV>QHW2P>Qa&Di} zp0+quJN_n>rhCe0{i}?#(l37`-|PMedq5T%WO;dNuNj}fik@3DtUXn` zd192!Vf73#ecozpkrL>WCdx)BR`semv1vcjWwSlx^rr5o&pLEqL_ug%=)655XXzIa zci_hTZc5|RlXd87{u_v3y@~3lx5mUocYYT^)V-Bq+Gd#-*iGMg6>(m*>O_<2RXMpZJ3N59$9@tWGp|ysminD zk%5)Fg-~;^Dwl8<&&@^cBs8@7RefGGRZ~YmQ1#|aNNp49hMRoaZ-5_W*H5c3Tj_N2 zxV+|OfXwEzP3njX02o{`MHye!E^G|_qi{axWTuLUB>TJbv0SDQ%F09&iCfE z%K6w$sAYn9Pf1+U;m{g!3PojbG`!e&`JO^rgEucy1b(+YMW|$=R>=D_p*ojMnn`9M+8>{PafCTd)6R7(`68m=O^AVBfa6lNzx{-PgV}gmpxPv-q|)DgjrY5+u{Sb| zj6+|b%S;YB{lUy)IR*@#Im3%EVk(@{FdKK(f8Q?hWL7Onw@SPw89|tiy=BB~B*fNn;_2VXmc&yq5Ec(>s zJjNlnC=cLrz;g0EAWUs*uUfe5CXBR!M64<*-&_f2N@dxVBh3J@*!F_e)ocizXJjuF zCncA=4HapA1CpO*+l&*hj6sdSt3bdp(}bPiI8^P&XXZcKIFqCrvN3r(gsZ*MriVAl zJX<=q2Tju2W>3&nvdUN<++ou{8&)gy0-#`Y=z)=XfGc<**@HFIb^0neyN#Srv^V)# zgv*ttikhF|x))yTs)hR1x1szE|0Dh9$~7HBlY1HhPFz>5RPm*Yez8!lW+5dIPBpt* zPGPGznopRqPw?urQQQ9uIou)xY3h`Gv&f#3ZhFl#mQ+o|+R-7hdYD?r%y>@a_uiaZ zkNYHttFz8EIdh2>Y~yb1_fxD%&8^ltP>D|y-QTf2Y3p#hKVtd;qP)64qdaiNu~)e zKBTC=;%s*DFLv(3g#x8#wh3{8DF%7fYPi-qQ2UdXn8SgWHh_%5Dxte>cCKp;cx`Oz z$+D97ewSY_h_gnYvI}DQeJK_l>#y#IJctE{$&^b2L{1)dh-B%8BeoKTvBR}V_3s>m zkzgBZNx)%u)fLEwIAM1BWFd*$_@@8R z*Yx~6)oBr+-F;7m2{VjUvE`QYfm7_l9-I8uVomN=R1rIBZ7(p#pAY6P<)*}y>t9*+ zYBz0OpYxxr#dtf_e$arN685*a4gj3uNp}Mc@$KI-wVn^CLu6&0-xWAHZQ0PsE4zN( z*8Eg$pz8F#`z(ykEoqO}+spF`R%04EFh-63Wjqa!IR-iUJ^@2E_bcDV$jp+Vr+zM8 z*iZ8}U5MLe?oHek4^EK(qX2}hvLr{mRHt2 zU|qCE!aibqf>xJjB3h+VC?i~~nAf#uW+@?|&PVAO~5Fd=xG z%KmxkWh^aLH$O4@aJTl9JhpCq-9Nb*S0cHc66}Mwycb|M5~w&m7aLV3J5R&}$e{b@Q&%n;V`W7$YX+*;E}v`h?LZB0-2TASQKy<~H)3wf%F-^p4#YQ&`Wrr<{X4Ta z>+S}pYx$~r5oar^$uU3NK#jd-XXHjb8k>Q`(tV2$+a4v9WKV`SoY&nK%EA)GU2vpuwC1_gXaJG8CHZZv8fVN=k2-I&r#j`KJ%#!xQ z)AIN|$WQ5@3|i)l%Z?Wgy#P<-{fJ1MlGzSe3Pa3PK>QcK1#J3_jt{OIy(E;- zsAdv+E_Q)xOb6afkm>n)jPiaU4sQoRkzStdQ%_B*on2XP)%GjMlM#`(vfAhRU)puO zUw}{4i#u@RHm+8Cem(vw#`@tv{~pbng<+SN3<$p2+7@&7VLoq?TvxYHHHo!i_f2|4 z(XXfL9gIrZC1N(*&He^y$^<|c;TgN)2jP-$S%{3{sdaj0pZu~c#7F?p!>-GDr z)fO>$b9Pgv{O1%TdG8^6a8H}Q%ecW`-~XDMFw;|(H|b{?16XW}KKGxS-hXU;|Ga*L z1OaBGK^vMbe=Uyx*bZT?&ybi^@(Fsmd?GUFLHIux$!unNFm#w!Sj+HVtNJ8xN2sRr z@~rsyul@CZRHbf%yCG8Z@nraZ>DhyCbEL2M(4=R157QNVhHC5M3l8(^0h?F0`H0M` zMSVlX)$?4O%g$As;&jaK$YI*}g%M=)sKm4oh)&aeFESeQ=}B<%saK^o9U-TF^gH{9 zQ9sfGa(6N#%h)TfZ58AqCHqe^Oheu2G;0Wo^_j33)$wpQdM$WFH|(nZ>&*l`czhqr z)h4p-ttRz$1&}{{SZ!1C;#3rxqc%{tB-4b{smh{GN8!M>kNIhB2c)5;_4<-JZg~RKXkZ7hydyT|Pj8w$b=9(`e z!y};KB$cgS$^L$5&K2qFS-+GwaY8?~ghEbFYXb-&{n=jG;Oxf*^d-EW2kh?RN0jm9 z6Hrx3P*LC`&`G!DNpDT<=4DQx_o(f0DSBj_lZCB~A2sJh7zm$Y;hI`n>E)W|EO76& z0a?9&G@I!m$9qkEOr_-1(Lm77`vD)6=0r_0Hg3v>J$p_A>^*t&sXC)Mn)+VWMX}`;aIIEspr{jg>|afi$kN-iW9W*vki;p zpR4jBhnI`0SGAl7=p!okI)9uK12!|-%e@4XZ!vX)UL|I*YG%2OLr%Q&*>@TjGOG)z zZ}Ahx-`HzZX2S!PI6eF06QhCqb&{(drO26#&8tAwiO@nE8e}y;H`My(d`_Dg=)SbD z=?U(Xmz%vWF>j4?J%hk=Sl1DwtJq9^GIRSKw&dU@c+*=gEajPQW<`;LNefhoL;TpU!HIW(UmCIRWfvW^VNwd$F#}UQ0i>U_~$f1G=jwY^<#QhSLw)dNE z=UALoe!(xMKXcWh>g88=VC(s_DWxFb4;Q|j?86lOq3-WfWpEshCif>g4ellI7X^7* zT%`7oCaeUBl+C9nGYlug;9`?jWmBfNx4TEOA{ z!2aapFAX)Dn-dNCN+Rps#86?|IHS3Z$DpI6{?hI`fmwuw_#?$?G1_+=c~; zU5!)kZINBa*$S`J+VZg;WAR1#)H;s_&V^d!d?(LwR$2%&kX;^b)ZBa3V8car~F(jGq!#sUm!s;dYuoyJ8`RZ)L6 z;vAAOk#I=LLt`Y) zbW>mnR2Q$mROdPHwXx?WKwfn1V4}--h>+F#=&JEDi}!l7#^d#ZY2n7}OY;%!)E2A6 zTAA3&N%<37hue~wO~D$ZcNkin8eZHJxQ!DG>Xgo#ECx6Yl8~Yz(1|n1mC%?B%G1o+ zQjLiX*(Pl=#96vEUk!Lz1a9MRl|MJA0Z3$JfkM|u9yOXxVv7XPw%3tgha+2)INCp4 zX5*Zi3+C4FlTlB7^SBj6a7^Yq>019-V%*KDKQ}y$EP|c}vaeKXRgVasYEewS~=dvN7%)2pw!lO%Nas z2jff>7r@k6)gZrG?UN2vEN~RL1EoVOCEZ^Ti80x^CqDWR`1@G&IyA#jxMp!GbmPqp z70>|JF_@$2V5%Bsxu8m2x5W)|MX+k~W>?jgB~6HQA6NnP9VWBLMY-4x_qG+8`)7W1 z_?U!A9W`-I}QB4oC45 zlU6%cO%Ldq#ilIv)poA4U-ik?%DLTzIg9E#4v$Cyh?*PtRX*$4eL?pWv9|X7=o2NL z?#jS|<-9Ih!%1v`-3Z!MZ^3nwhPTO(?7sf^kveIGlr?|w`7Bu+a5-a#|66Ziv!U%H zg1QXoqDg8q&))63?*sQ?ZC)bm={{A!E7WC_*m7q1n$}&1?y;6pKvh&c2gAu3SJSQD zEFZ5Mnu`|_qM-thrtH;r8E040_~N%)EydbJrViw^07W%4Zl@-5WW?;TOpX1%sXu|90|4)@Ao zl}eNfWuUT-I;Xt&LzvF)zStc&?s2f-4GOSv=NUHDAC<))5L_nuTle=>MVEIe*(;W_ z46+E+gXe4=R22EyM2d}g5EmXLjywAqo6bjTjxz~9sp^|35By1C0WK%4enYLn`ZrCOVqt3<%FYk!nE`>kPK}~m_IL)Dsq8iEA7E0qrc_}G%ex|^V=zWdrrov*ID~+moeVylFD4#qm zJ>|TI5SWhvQTn|Xj(=BmSGCT|rBW0+WdYZSZ>AZ~Pci!CA(WL~UU zfu9*wJGgN3wHX(bPPjX3JGKo!aX`N zwqD%Gv37u}Tsg{7&l5 z7;aF{)th4Oy*HX<_{qb^o+OhRkWwiCL8Fm=zW$es6Xp)%5V&@e*_+Gv2UN3Q+!Ddf zaX!|`+RQaa*Dz|<9i_?u8j_ueo57u{b?d1E6-qcGSUau_&NuYv1oFKtdilk#Ol=h} zfbf5^#3O9}XK^-zedCpZBTW6-eF^*G>y=t5=4U8(L4R8@^%&Btg%3wSSnyPj4Jc=H zoS%{arPF`w{FS2HOA+2>tV0kQBc-njlx(Ms$rXElCBq;9uR5oMlF0;ac2wh}km~Fk znN)e3FA|<~Uo=dtZf~c*oC>{|;?bG9%Zse6ZSoQ7v71gHzfmk)QxjD$GxRHLTC$>*X<;| z?BB!{3o>mZ^L2;n3m97|IRQF?! z$(2SkGx6b)iVo8G+WX4-9foP2LW>t`xLTXaHpm>OI}~ee9PEw#7oaLBR&QNg5I1%v z68F%!5?`8)RPQU;53MBZR6HE!xx3dLwA0*`=`xYPsq%d zlQ`91tRKz}=F|&>R%r5d3Ac;s(KJ6bOGdY(GbMMjWj91OWHy*K3=5u)ksf&2(8(Hd zj1l{e<?mkV=+{~KAwCOqy#&9PX90+%uA5|Pi5%gQzh0({S|@=BCvg1l`1D}s#PGFsQQ#x_yK zUY)@&MRbbPxS^1gPO}D)UDFj38E#fcc$Sn7>mxATD9@G0Act4nk>wj@f@1)nYjDGV z6|cLjRCNX=*tIL1=uZFNBFI((sDIO^Xw>-m!gs>_wGM4Mj`Q(GR_I_=Z-ushT5{HT z*$YZ_K#srsjw=5HbS;LOptn~8XIO7lYG%JT+vUexXC86^AvBCjE-+F*{nD5&oO=Ms zwb^G=L@<=Ir~XsEhg({?QT19L8WP7Q$K0`OG#Q(AD!ubdYWC%>z-&tZ6N`f#B!fFm z@}P?o@l^H3Czn0G*5c?Ps8n`GR{6xew$y){cUI_{&gorQd(%2+7;gh|iE2Opu1sLc zMBTT*CsR;djkHsJLX={a2Rd+dKswEc+o3?wx&FAZPQ11@iJ|f0R{&x~_bqN-$p!PZ z0~t%flJl;L$&F1vtuM0ZNQiIr#fthWo627&O9vsC^C?(XQq#+>_q@Wn?vkW_a(k8% zCmr(V6Ad=&k9eaIyA~UO0`d-*FXjvc%*{k*84aH&&#;gWY$6mciLAUn-40vGuC`#H zCRAa_$qtv9HznDRe%$kYI2V}0I+4~9e|r1k=PwAq9g5`A6w#%ex-*lyr2J$xdc~ST z=&4|NbbI0Xd6tOy&E#gf<+;dWhx{J1e#_(vG?dAq-)Z@OrN-g|W*w zDgU>a(q`CL6YZJSJAG-)J(}a(yjLxCb21$m_v-|BNE#7|EXh_LD0FFKm+8Ph{?^p4T>soN_6Xy_#yS)8?H5O0-7A!v z`(VHiA!4KNHOx+Lw&f)1LEPY43pU734vjM6N?75$@*Qo#SB-r>SINB<7BFr=G-Y}( z^eMnSVfnp;e9~s16bkFt)EGMZl3uvyh994p`_fiyLT}AlJ$M#6|6Chx_@?Pk zoY^I>F!h^fR{FiUBt^$s{pOhl;~c=DfGb|G^>=%Odp)uvQ zwQTY36SHGn=IeBEHYN!A6c}%S{syri+AkI&lWJtpV%{>?hBqj%WB_9}XXK<>Sb~33 zQnAKf!M(NGopz~v&eadR^wMMWr9gM$wYWP9p%OE93WHDeZgp>C4|{0>tS8d(`c@Bc z!XBE8?R`6n_)2B=!1dqJW#Puxvzj4%=&j@@ctkb~zi-01S;2sOAuSIcI-MXi@|e}| zQ%rc3yrtcgN9hG4y8bgkrWc*gE0OjK^Md`*vlvHdWtfI|i;n^0%N@g`HIMd%(zGQQ zsX?(O8EX+|Gp7d;jlD(=s?%pzO^e*W=gZpE1KGUV|J-sUPn?(LesE%NL`Vq&o}64 zGN4|zU8%h`<>?*B-^aMGGP@OuaA~$rT>+hXb(>WGW}EEMr$&plBESOviddc-dcgSy z{=w3p)3)%F=^f|WFcrkz-~(07of9T#7VYWk8#wA-Z&g_+aCYF(Jmm-r)A^ zHh1mv5B!p^#4n8hW}qfg9Q`feCvWvoY8|53s~=@jXHF0Usv%8v5$z(2Mk5cpO_f@s z(w}d9TwS+qz6OP2UcRNmbiwb*5^5IE%zLp>r{u8pvnE2RN@<;+sq*>_#dpA6>(64T zX>YyzzLt89ULFEbgEXa7CyP!Tw|zWrXhhUOHJPz`W@*@lyjMujQH%BLIC_{Q^o^PUpXBUB<&R`wtl#~*7I81N+q z7#}REuP&MKqS0oh;%gh116^-oH--;D1vRT1w*C|gAAcjy{ZU@O*inz&IZmN$(sURt z-|Vch-zQd*%yA#5eYF!TKso8&CP27JN5!{qaF@fWOTd)-1iybUJ}7~-nQHCmp#Y~& z|E`4Z1Hh~-ti5p(X(Ecu$UL?5J#mZxM)4Ai+Qvk74Z?V3T#Gy|FIYTi?!Nmw7l6Gk z{|Isi5x7(0Nf##HlnFRcip?RUuqWGFd0uU1VNWa;vh) z%yA1;Z5cTklAE-gCQk2{&kYkK^Q2XpKlTP7){GZzd}S;FT(e8T5*%#?FalJ@v(h6yiV<+pHT@SskbhYiC%Q;d8*u2bCKuull3bm zXT2C=-D%Fhso%(reC7abT*1OE9)z3oeO4H6J&$=w1ZpV^bxG4cDWjdsMBym zRs%~FTShsd4~?UL@bTFqrPt%6E+_MkJKR74>yY-4%_O%6Jtw`qBfM7)yap~Ff=J7m zx|;hs3!Nneye;LR%29Kgi0Q0z6@wG>Zy>s^A$GN*UK=^&!cw`^>9juJquQha&4b6O zTEj20+|s9`g% z^Q<@0^^pu^D?q<)H_bajtY(+hapM(%{WQ7iRw6>cXxB=WTRUus_?3z=2-seOO7fxFO8#bUxE^GzNINSGUsEXpZ ztD76!@kXh#r8fa6k$S)E@1;pTXYDdJTjv!BkohQ+(8_=bj-Xtku0}l|AW9TW! zSr2F2}Jb)FJ z9{N3@+s6a*Q$H_$p$-*2&rP4V`R$tnRX4W4fh){g7i)^2#kYD{!>jQQ;Xm_l-fZ_i zoDg5z7p@*yUr42Pm{s2gwHRflR*5BD*oyEAQ7`^ z``@VyLsl2RV%}{gpH+1F_`Sz@Eq1!X{Rgn{9mRDuVv(%-bAWV-`*@h|G3&9kDDukC zqP71*bpNfD|MS=XvB3XW;D0QDvA|Cs{ zsnNA68N|=PU)k_Mon7wMIe%k>d z%o0Po*-K0$O4CX0WX(~~`PQ#<^cVkPf2!-}c#;i#C=g8d)<46`|Dh?b`*n|?O=EyX zh5GA1J!1OU2Nqs#8468rdG0I!GDQu|@(N>Z!wMTmHDd8IoCEIPO11K3m1fZ3NuG%o zq_W1eO8Q(^0Zzu(3R>xH+{07mB5RiR0zID3G|;l$O@A;lB1mIL$Qk5n2J2mwiP0t+ zx;q)t*|e$ugb=HCSUDzZ>OOv$IQU`i=EWD6F-&XTXMg^C36Z3@H{~YG9pwj3>1+Y> ztNXaW3&4$crdV#<$)1lC<_a|n9ZQHYVyhu!+Y+ns7?Cfp^hyP`^ixw>dGjB@Y3Q9Q z&i>FN)B1pmr!Y$(9V>%sUB9(AeJQTIavN4gX9c^7lQZhEZw&CZefE#QmCsCAUxfNf z2N<0CfZQTaX5 zOqb46WY7lSMt?X+pHRs1AOdyR0ABVR z6AN3TwB9!xf1cb(3cz&{+Yy3z)3CY9-%!!?yi9+A6u znhhFg^dE>lVwNnc#OKZG8|EciShWzz_yB_!EY{$pTG+dNUSuo9J;6*J6D6uWso z+&?TE?gxW}lhGc!!M{v@i1Xhp7CZ0R<{uIc70Z$%8Vcse^k)XOLnuhq{Q(IYDguB6 z&qSJ8evFc|9!s1I_oB~GgURpTwk?FDFT>>b47u>nP1bd>hs18^IKrS(q*-wcll2RP zb7Tu&*LO$?@=dNrb=Qv>k}7W+1MYp#-Bx(d^=VKsi#^BABASbzCqFlj{Gwyq;Kef9 zYxI2t!_xUD1=WB{na6ogfA1}N46uU30AAf(@3JQukZEE#(LpxclDATmlJ0&RnywT; z|8Q;W9L(e6Z7N1~L1gA*B2!kGrqWC%93&vkkth)BE?8hkG?<#WohZ$LW)4RjSq@We zI6b4vX#l?Wakudy4j^6ftdlOYoXlYB=>cx2V^>kgHg2jAsRoCgTV9cEmmP|eEuM;P zCNd3BLv^ys)B|nqy5p?O$ljwkrIB4T&k`H9*#L#B0_v>AV6z|FBP?(K7+JiFaFr?t}vl%^rX*?xbJaR>W0CA@E}e^Yg7v^a))wvJ+Xmy;uZ(~2>uX1$cBdD z7C3(#nFd(AUX*!0)2%M>jKp=BXkbEBhQO<3_dt9|o7MYguQk19TO8v3J$KwA?2ebJ z!G_Dnp)Lt?4>-8@IunjNXzccc9`BLWQ2Fw{HdP@SvL92E`Y9uKSnv2KfnlH~yGMcL z<)8h|i^co)r)bO1IcoT3cjXu1Fy!iaNj&p&LEuk1rgN9tMwq$E8C*z5?J-YW7?h@q^6! zE+FBWvVGl_r7~~G&AyN+WdS@Rhm7OiJ#YM_&;)KVkJ5-2G^$7#=pX&+&2_I4vA>pz zU~{^G@l&$1Gu$i)1l7YqZZTQ=fzBKqxAv#E2?X~KYlE&-CS&m!e^2&3#lx>3*WC5R zX}sS^x74sRBlDmnWaC@Il2iTUnTQMyR}!Z@Ryvaxh->J`V)x^VTb;mA@2K<3eF53! z+MBnTXHBrKYSs1UfJpi84Z}ObHU)$ow?4O?^lSyuk`K`ebB)eE`K+gW532DUW(|lA zRYT8iw`UR`?9E7icanTQd`z_SCVVo^z5y%p!&?lEmKA{72uc>@)?vICgN|m`ti21V%Exw|e_1k^2Wbl*6 zo7Rk)O1h-GkSO->k@~YH(ui$_8)sVlPW2zYC&8f-qju3!S>lPK3Us_0uDW?W%nKLG_MZn9>^ynILiI~hTW+;d|!o9NsBJ5 zOLxn(NW@4TN}ZxDLG5&hbo<^@Qs)d8fVdmmz=D*6&zR!qw1;#y^jq5EDsBih zC%KV<3FPo+Ej?wrgcGzaCDUbG3Nbp_1`H>PhdF(8jgs2!7B4ESvG%2?O>gB9@V2g{0+ zH3hh))eR8^iXF-g$p?A9B``%kJqc;v>i`R{1^+y|t?sFYQ?JxUYm+bC-Z^#Q>)2kH z9)4=W^NLpZ(FNH-`J^U>sBKLlqu7VGO&5h`z^gOh+lHo1Gn|Fr|>&_nVqc6ED44{5HWEmgK@hz!TxCr{^K5F z$*1LR^I{Q`3_dM!oOoC{8jE+Gi;h@3-SSbrFyFE?Kc{2{uwCj(RNYL>-U>(VRf^qOQN8U_uvu zPk>igPN=>#mvl$^?(5fzDBLK{PseI*7ZE)Q%*E7KDE;U?pKY*qmb4M%TiEVv3=i4r zLrfM$O~@tPTUNjhn{?3ZP6tV~WQGg9xEB&YM~yWfz`BamSP!0yi7 zy>17^n(4V?*&I$0jSnNODV`5T3D`a5mIPouV{g4;Rx21{&qG(8qWpaPm}sxjHs6iD z4e$HN9os178RKEQr=YfZe8~^CU#-QmHhvE82s4dM96oE7vb=}biKLhet|#N+RHM2p z=S0d&xre!avP{f~`)OB>vvJyNcaD;Us1*?*xh`!_R|$Jngg<&$Pxvbq=YGm|6)6DC z2rY4b7N?rheW|y$Y`A~>0%1qYSZ&k(2;>&)bT66Ufb=G(`!+i-qf7MHXi2g+(hQ+K zO6y>ND8f0_Bf0U*3t7wq0v;-hDuQ1;Vr}|a&=2c2Z%0k+_ zvm=fr9{={3O65hu*>A9Z1wW4itOn-&sNIipHmJb}XAIX^>x$@3byX|y3x6+O$BCJ0 z={UwK$nXvxcT_!*1eXkHH?uAhP2wb_q%@X@bm*k zT@v;PzI5+2i|s7Ko(kQ)zAA8f>S8S?#M*V!Z3qLt7H<2%_rG4|=DX3(kJ8l)8J)BH zR7-WOq+!h&fN#MNFn`p+=F7=7z&P!iW~UK;EmayLH}qHnXlR`_aGyTm*T9rtpNF5% zkT4(EH_Fi!bz5t=sf5AvM0_!sQRMQfBc0g_cZm6nd7jhDH}|Vud9K6pOm{YzUvlj- z`FaVPLD~txyOv2WNc+sex)}*22=V^)<5>Ye+7_Z4%N2(vr7zi5EFI;DGY3>h25CWE%ee2z3PP#J zJZU$@qa;+zX}Nc?iGnn}w^#7D21EtYbh7qLA^0>e&Z|h_8o9}d@mOp@civAv;tgjU zwa#A!DB$zh@^fjMZjG7$r{ZGoh|%`w<0O{=Q4mU}EjRP{PlXD!FDe)M^|*iFJt9wn zS0X)7gvEZPn+sv;51=gg?yK0KTrsAg?e@DrgRkkBAPS`&3JKnrycE7z4+Le%=M+Wf zURSJ4DUTAlrW(RRFS6DVuRMsBPZDF5977ReQud0*M)^cwjHvF3j@P`(@G{~2uh0nt z!Hc58{>NAoT4D1-+9hLVL#8sK3&+Tn$X=AaRu}5zIu<^-1HSxh@EnYf59fK88S@nm zFA*8sc@B-?W%sEcsGe4+KF7;<%R7LMpAWv zAvYmf??6Fksbw z1!O{)K!8UL!$}65*y3|hu_+FT2_GYgxU0h4#=umZ0pu_4AzS(S;7hRf@Uug)hHPYv zZPnfKNh5s)Mko}M-T8{`k0Ub0l?{opqmZjb6DBW1;tSyneLBW{k3jCRnJZ8|e}CrX z7$sFLNtP`(Pe$hqBMc&zEQ!VLrV1~E0;7w-b~p~0Gs~!!FMU`r;ua`F!~T@KR0VWH zk(Ta`trBFXt=}Ra2z=LxJt_B&rV_=TA1~u+$M`Qq53pM2uz?O;ggGYuHm{Tf+cDcA z4qidihi}Frfje7`A~kg|pJC6%lOW-CqD)L}Y_MrN!^nXGQYW5-IGz)&IbBzCx)*CF zXg0)GnC~AEw$6){LODI9Y{iCG{mnT3X;?&)0P!cOUc4!c`j?9SYJu-az#2(h3wC57imJ9Cyiy(ZEx z+TSSj=d$A_2~_BOX*)39z;pXi6->)5qfVk;H|OH~GRLW!1U};zT#h~4hkcjk753l^ zI(2PI{vOcAl{wu$f{f!x+TDh#dlL$k$S$mhabIlsc*8mVQoO%5Cs)#cVXu))_a0>l z_PpZ>*ELoYN<1(ofUy;LuL>MzorN!`4c&H&+O1>yCY_TT+y%4|b)?HmpwOpA`ulP7 z_VK;(m7dD>1vEGapiKm1xVFF@_!$bK5nPocEsn&#;|n^G?5YAs`|bQVGEkfv1~*vF zqS`Ky)X#IO+5Z zc5JOUjxXv1H80!}TC1l39Dcq@{w=@tu(8c>D1%Hw*62qFZRfaOgnh}q+zXJIcn z>E17pt9|T!jr3hZKNqQ~rYIJM{J;uT8}bE*!Bw$!!}?M7J5jIiHlKoO))1pZ(*Qfe zMRwUZ8~1vYOE3Yj0m%0T=6BQ#&C**xL0m$Lo;R5pq+*eL%3U zxbG6>aicF$VTkMzY~_dVZP+ogNqkaQx%xm#K z(UsvbJw%$#71g8HwMYBt)1;>Qfh5^-!nPEFSvrF`Cc>pE((dvHybmHY(l`R7pRu#i zCpV8VT4fFS_oenl_r(#W)`z9U4!7H!fieOapsXTY2X|o9d_Qd>@Y{Izz>e$}4Oq*dz2FIm zV9Lg_p&kxc*pQRQZX0K(yK^>U(~M>lH@~yoXS9js8gJjYzaSwP&bZ!l>#Lyi3r${) zvQKN>Oi@ad$C3evHRSIz|2xnIbun>@a0aRz!jimLUo4#T=t?pA$^GYFB6~c&5JAy7 zHRK(n!0Z#Ww0MAA&NU~>eCR@q9-=Sm#nMk*meVhHNSi-%u2m(4Ke}_01KbgwD#_mD zgpUUgKH}vLx)RKcGzVfrYt0HTn!?a_j6bBzxGrq8X(}QKO~Y5|aV(^bp@vNREMX>L z!vl>Z2usrJ<`nKkyr^hHb>-3y1Q(+2e7u7?G_BsZr!iP9rmMvWigIkp*ciRFGjF+# z;zbGbF7xbYg*S~wFxmW_f8{J=f_W0`Hd9Xbc9S|QETq?}R^wr)O*^oRc4Z&Z5_K;& zC*bypONkmNw(c+Sck}}g#(~4lY%|ms&?7=zgc>k>$sY(^IZBS5o2R*n|3a?8edqu0 zjxw-MF<6-`TW1Y4Ny-L$zPGlXUu2Gp-b%EKBnpZ~Nx!PNEMB*BS7eo58L9=V58I*Y znzh-vIMRG&0DanQtuxsL_*iJ}HwwFRNsr(((U zD;Y1+1N`WhzF)Le?uypZRh`j2Z%doVfeh4CH8~9#s>nfD^u=2>C}Ia z%s4=M6nB{i;eC2&9;vjCx-YbE93^n&zO(?p+c=Xbz*WJPU=uN^G z2P}H=tKpyL-p|RuWmh1&`;R#H5pQCL=v68CY1aHWy<>hxrRssxapEOnAL$C;fzQRo z;ut`izLZhej_Ub233e-SK2;ucPibFt*HJOgW4v@)pVz_jj$ziMSWGIOh;Gu*Wz?Eq z?C{?Zy>W)O5VD#c{veerc}aUiR##$!nldk2vdtMaK0*f;Wv;d-L~BDT(C~f7OI85? zh}-~Iz!ShgP+>Ik(Qk=K_`==QG14*E@R1ZYI&K;rysV1JOkohG2)Vbk2YLukxW{$I zRhTr18mxvTkx6xUsho2x&QYog4j$)VBqiFvVkCJoPr6eanC9RV;0TAfe{`EFH^II4 zVTZQMI7MF?m`mCFBZ&z(V7E%cZ}T8|AM&qN>GqMM27NDdFMi3<7=(Tlw>Mh^eqH7Y z+g}*Pf``pn`8$99cRasW#+0oLv{o#x%4EaWPo=On{ED#KzW`5ouo&E_1#~eO2nJ%<~Fl!jPx^>a&vorF1ta(E*`mG@j^|292;&^l>dwqA3(TIR{G`&@%?s}o=T3&m>;e?;|zI$|B_hbqz)_4AV^>$E3?HY79f1^90Cq?E`_ z1Bc*_{h00FeAzI??c=Z^aU6pt!ZN(V@W2-)Wq0*y$Gwbk+M%lKcmUc=5Am;yCy=R% zZp$1x>47MC^2fBy&~8w|e9NmIPW&=w)O}8bVPR;$Fc{WsRXlY6uH}Xm&QUo1GH=}R zP1;fZl@t8@>BWu^v6-+Ddk{7+QUQ;&9n!lXrK-ClN%kNYonGAtjmS;rNC(A`9H3)( z7Z7c9#VyX^Iyz3GlB(aEN`Z{D{t`Sm?bXNS{8QZx@PtmUbMmeqp&!M4x?+ z=kL>jF5FNK-K0ndiidcw9Y^d*Gf7pQKbia3BVT@v6o!5VMrV$T@<5M&tFjy#`CVd3 zg}Z@|l!3C72Kc-z!nEgU;Gi>xI@ZyRj=E2#IW`e0ZPQgH5eZY>;`GPs*%q6sV+3pH zo6}offvmAne10Iwjs>=AR?#Yq7WCe&_a?mHN zIW}4@%6Gy|k$nqn@^0C~4UKdVF4F1Ab(G2TTY;Of3*C3ILw}sfC^d3;{G+xSnJdl& z+aYw>PCG1#Hdlv03-ih{{VzKCTHF{#mJKgSxKOyJ1z%^7WSs43f-Js zxD1%3huAB+5y zVIcQ&`|!AXV}5+wF1(B>c4#@OQSUG3=kby4@pydO9`6dRS8KH2YP2tvt$n`fCtZAV zolC|k!*?deg-4y{M$|r$#au71zX1Th99Q(`In6arh1S{xAmyrh$53qgQHYV;b0VMTHvqRFE^JCLrc|O?sZ&Ff=j}4klWV3?7>G8FrUjC-S)CGu$WTx zbusYWlS|p@e_u9KmrcO{;^tLNPD;CK6W^5Ca*Fh#1I)7izVb$9u= zZ^W+ttx%Pk&!fX8(WC7Nqe6J0aL{kgN4Il+y9i?~eM~5fiOrxl=PvN)@nR6z^GL3C zqPS8?#X6^b!qMXRPfD$U#9exMMupbzsfCl^&Wl4d1)+q`hh9g`T-H_1o*p2J(V^7G z4c#CSyU@yo4gm#ytt*suA8l#B{lkGKn*jARciFE;(+e+pjnghR@RlfP;u|f{+36HM z(}q4Q3f{+~Y7c(=O>|*P4={XcxgahMum{2Yd1RyDPz5PQ6srS2a-|u3<=ymUl(q&CO}~W|WoxeQ;;xmtFZEBB|xgCyiCu%1uUo zJihPk@*u7QdtwI+0sIcxn3Q76{KreVR1l;E4d1=wAffR(avikOsVSkGrGcUOnknZl zlY7h7S}f!3hX4qFZ7#&IsoL7{jIk(kLZY~BR@&*(D3?<){t_vqs2;f$G3(?Dx}cd= zgwmZ?y+>(@*%0Y_0z4q}B%y1j*o-O(Aqg+nnt@^%pItCgZ~^mQUVs?|F2h8WJxc5| z^*~4ubxc#l>^NK7Y?C=zN-%3gc)_1D-W`y3B;^})365q7 zqLRj#&gS`S^@%U9J8d^AJ8p-BbcCBCiuYW0b$r!oOJ#-W zX{Wz^W2GaSkL8*eXYmu5el7TA;>X0s)u<8kH?t@-Eeeq6D{d$BM@~3dVHY6#ROV`o zxGR76|2^W7O*={9nAN~F@Zj;Wf$2e@`7$@x3?y(|3u>ROt&|(#e@aGdayCkG!DH=e zZYP({UWZawT0CgZgr|LN@`*jnx3a(4(}K2>Ho&ORCG$&lJ{NfeJa719T9>=?sn~W08ehfd*gjY4N6(*P565_lXtOjsOo30$ z9*mJ>;NB)v-HYJ~o(Mz}bZb7O+zVS&CSMX?J(jNe<#r`4@`cuj*`c?e#=k%H=)`2* zdfmqzB73iUFhq!6KAY&1cXU54g%QT_fUbSKvF{35?4V?<@`jihUA}hu6ZV!@7rwFh zixsPBt@Cr4RW;B>zlI03pwN;k2~zf66hv2>z6C|Sf5O2E;hz49zBLb@wA|B3pxZ0sQSxD z_=Fw48Ne;S3-d;>Lf)$wyduA+)KxaU0CoIUPAq?W0|#nWVh zr;Cou*-6Ka_nmrykDd})GtGe}mY8KJb+TFtdaLrbSeG5M?ME78tVgp}g7wMgfk)mm z!1^~6f1!bLwAU9vAYsF)B1VcoB-{N;4W|%$_qAUh67|!Aw$cRwLUaxnpra`hnbVWJ zZ95@{5;}hQp&Ui$kOL05v;o;z=}$w#R=xZSV?t+~H3OA!N+dzh}hjqA>Yb-hbW0 z6jITb-fGm0{MvB554>&`ILpXyW~_9j&@a~d<-E~;W=rEWw<+VTl6E@nYV$7kbkS9U z7$s)V&-=DG?!oRzIfWs>ud9i4<3*PZh&N9yDdQ-GLu$o>*2qr z_)3|nRzhr(t0bV&^g&Ui|4_y6=#Hy98@vBOB+y0wpJa*|2CglG9cEa@n^@9`xzB-z z$!5!%KSU4C*Wk97RQWuP6e~@Ldg~zJgG=eVKR9E8F?F~2(l7T=K+T!Ws|wnEx*EYe z_IX4d^Bc8HS6QghoQ zYA3t#g1188YM=!3xEJj!WE>To4h~ANyT%nnf55yDXa5;>9g{Zz_DZmOqL&JaTIYE} zadrCfb`SIbNTn4wmGeW34^~8o=7xRY8GXM|gNw2|R;S*6W9c7roc0&gj9@leRdQRD;f7|O?!R61H=5AqYna`A+ zpoh%f9Z`_y31+~lMC-r%gVTJBw8xT%|3=mR;}jPG-WKdsSXu6a8N>QZ6dx=wf8>nV6;AB;i1lTUbf8ANoRuQ}I zB}|)r``-)vb0y{_Xsd|mc)NV=e~BFhNc@Y`ZJ0`*?_WN-|KzgK{II`W4IAQ}0-HGr zbB!lUaxXGnCpp$1b5tiCiuBV={z?M4>IMo8je??wT*hOxeJ4^)Ps@?~yk^@gA8jd{ zl~3B*ui_zXUXRj?%?3?rZ&c|~he=tIA{%?iQkVX0=OZ@p9GBYQew%jUCI4ZkIw@R% zxcFWA1@$ zh?Jjw4uibI8|+w#Xd?WMHTg~{_|BH<<>WlGF-Xc?eX+G)?PPe*KDmsZ~Pg1R+n0(7UIv}FSQD)Cx2J8usyx6*z84WFE$__nP zm>`_qPFyfTOBiU9w3g2y|3Smyzim4x53wfP6TiJ$UBO&VfxG=onkWtQ=Ayx+Z~mhV z=m#`RSSVtb5N&38RMy*1I=BBJ_ba=_V9px3zlW8dIGTD8R$(B)7T6*Kg+LxkZil{W zrcosGKel6B<$_N|*Gc!9@Dol!Z!f9j7H#z8C8y`6ewuYAn6y;d78U$gw@DcHzp_G;k`&>F z-;Exy_;&=4hlsQ2b?dLv{xzG!lKesig+di~i!^*s*yrnravBXN1ZaN=J7A{H@E>90 zaNyRQ;@Y&fL^A4nrGs;aNXZM@tZ6Elg&IKM9Efy-j@%p7QmC9R`QrYqRsM=q&10Jp zmY=pbxcf_FGmNXoeFPUkeyA~WY;I7rPOskWAxTkM7WHxX5zo4c`BAU_F zCb#FygZG}6Hb?lDmN_aXnps_T+3z=CA5BNR{rqo+^gO-(qLX_})j00XwmWVqXnxrSXv;|INH=ZHIQdjvs_@|i*KUhTUvw?2~xp^GQ;_S&DJ%uJa;#O<{?DmG`{9m^d}+J}3?3&zl^XyO{=cP9AIm4yGTw zEZ!yBzK)CBFbPo@o1A4ox7|2dw*#3^AG9G0h!X2gF-_I-Tp;q6l4lx`az{wsnR z)&d%HJ*vog(wv$_B=k(lJ)PrHG1{S|$n6~ll!rGPxlI8o^1f1+{Rr)cN3wG@pO>U- zO!ge=5>43aegtGxTdw4J9CP3NNhmyOir-tl^U>oPX})$e9pYzpdN%oH>>qD7fsVvq5hV;yoUmWkmWqvE~u1qvo)hP4y8ajEE)$ zY`3okKpZdzNUBMy=6(udR*@Nq{uFSxOn41kAm<0L=jDCmni7GP!bSy#AtWUPRkmhi zwhA;CC1oBx2e9$DEoId+3T24-eWt37k?*}C%Xdz~t4BwFI7~Gg`}t5_3s9|cuq8DKrZgIR zAEq}}Tt8(!WbY)&c2>w?L2@pbDjyK2VbkR6OCMZ)lGiT&=y`#DUz)s^XeM(|-F(JQ z*?kw}iVXL`R;x1~BZAxDdv|c0*9@Cw-W5qHyv%i_8s!gV%~Jcd+!N=*J&*EH9-Zd2 zp{_-Cw|twXpVTZGj&bQqkJCFPe37wrTqf(<{6|X3eiFu9v~Nwi1Uo9fc2|V2ydEIY zQLm4GJw)HWHv+5KOa>4}U(GArf`+|x6e^=|e6|O%T<7myF1Faa1ojlJS$|y|)y%ny z@~}xvx_XKeoU*kI4wrhpx~N(OkivsRpLQ9|4ec)wQV#ltE(u*#DD>L$N~%az(JSqY zpIA~)rKpMP`!ua2IIPGSw4QmmFx6awmZVABlKi)5P3lvCuD^q=P0CzXJ-SZ}xqZ8- z@5sI(G$4P6DPlP-Jb!`d{+L#F6^_?p`bu`s;37ou!*J2rJ^d-bxw9e0lVU5O_}OCb zL?!~UiGhF>Emd!{w!%H5-HX96=3ecc3E79a)NiblbysJMy2@~Q&VwKe-N0mS4a~zo zn#|n|X!VADQA^w~X?m=_BTVrM`koPrhWE7%Rg)jv<{WK#h{t%Z2BRD;~YDL;!m(8Rwnbrncr5?W4u3hldsgQM4 ziq|RS7~OyQHBuDox9rr6n5NhE`8M{zRE0<7B!>EP^XE)kUrnT^L?AQs7Y(OF`i8Q= z+Hcbpg;}%Q`Y$U2InWM`NnEo7`1_DRg{ znszaOe*N=jqG%wu_~t0l#Y+6L#i8VfqLpQmeWcYNo`hf7WIP!xNorW@*95sA^FB0V zoB;)cUNz7w4cbO6BOdwWyS(Pl5eII48I0=aAJp*tQ?6PX#4LNPkwOAEe7~8!e`@`UPR* z5-S%m(h8bqpXff1BuSGWA1Jo7R&}qmZFI{nHO+M5Me|i@E@4@yu18{Hv&Uwdp*D_! ziy^}TBE%$ens1TE>G|&QaPb@s!F=C1$0E7o&%ruwKtN zel%Eu8-VCOg}uCUTFJs~1=xm@zO&gh{(;M%-`FS_`&O^*xiqncn6nk4tZ#2cKD6+C zQ&pJf$d)IKGC^nWC}pVzK?GEn{nK@T4MdqlSCf`DTC3%FM9emu=dW@fkbd=f{HM6d zLdh3t1JKI#hn*pEu9l=PH1b1}|xMY_2gU8g%i2~>xS6N<$p*(~nscheorp>&u8#}_B&$C2#@Zc~{D zjBe;tBazYE8_gJ{QhrucMp`TU2pbkx!hrPD0GM6NJSJy;fJ6gw@H%j=N6M7fl(m7* zMY+BgXTvBs`O7rc$6d>Ep9qIN85k0O7Xxn;hKc)-Q%CmTq24w2OXk zMwkPyG{1}1T?P2u9z9wj&b9J?Eu|Z(=iVu->|V+&1RXKN`i<0>2c-tJ_sHyt%l1bP2<6CEwr?>Z}m-cew z?v3nY`EAC`@PWTzQuR#2(6~}ZQc-_itVY8lmSNcID2&_bEUOyk4wC6=%rP##XdALL zSdm_mkW$f0x6LOD1S-8v{$XDDnY+wHn)LkU^T^0N&+)yEM>&&GO<=lY(Vp0f!Hwid z;GuwbThG=X_Nqa^p`-W51krephWuZ!Q!v_)X2=O%<{NvvUpjGFrzP{ANKKm+G0$kA zd=7}O{tM{_G3%w#sFo9iW-?RCu5#$Mp}i8d$o~atNq|t_#BqQ)dpUrm)G`G&$O1sP z#B(`w#}GftL^$^f=Cguii2BWHAksw~DX&2OI1JyJqNmy+9Uec}42Ezi2KFV{ z2s;H`C?gqhe%^daTzFjfu#M?CVUmkg%j7uA>69*%c~a|MRW>?*->SekUbwOdb8ekG z0Y4mA+~UZGI9km{I=SV4U|f>=_1kS$A|)E4Ai88uR=8x@HyvGDQbIQT_I3JrcE4_b zDvOGP(Cy*Q0ekCyN#v2%gS1>1M%AyrK)=x(E3xx5r!>nACZw@4Dl*#^t)TR?^i58O zz02V|ooc{HKoKN{F=i4Bzg^vSm~ zElysibM}4Usr$hl?bjVKn=H#meYe;SnMb4f4YEP8P&6Qj_@>{Bh zmP}Le%uEUOYgrpSTqu$0VOU=#U|#Uh8N!2pGHYHUq7mN&hEu8l$v=m1tLBh0A5p*e zp^A~X=D^CXgonB{NBQD*ly0O5I1yNl(yJF% z;WbD<{bXq3sztAfPMD{%B?YC0*R7nc%Ff8uCUod&VA#YxVg17y2!Z&|uIh{Ju~h?v zPB}3vR_g8@@?naTjZfFg(hVL$-XD(Y%1m0#I;|Ao=wMHSgiukmsH+S*DQXOhQEFwK zsaqY6Smta3q$(fXIVr12NoDR6=L9LmA42B*5!;fTpH2sp*2(S21GHexVE14=CDnZ* ze?16^`7`HyC)uxpzy-8sv*qrH3Ec4eKUCLYcGj(obF6EFJ$0Z>r0fKDDd9H-b*v5A zH_;x-;d6TXMyVTFrSrRKK*xR3^~%yCp;ya!D3LCw{*Yi!zOvSDm&@$r(?ZtK;C1di zwBrmv_jc#Q$)08yqb~>X3eD!<3CLmj>*>>Upu9Y1lz{wR`S3KSPbM%-Fl~c|=!S5L z2;-$+kDVKF8@>SZ6opa}KkFL$h#{O*Nu21@3$@;Nd18<>B6AKGbAw!0MkD=ZrevDM z=QifB;_D3TGMq+#S8^rgt9p`7dUX#BcspER`8k*NyJy<*+?K_Du-J`nezDTN^|e@E z`y-&8V{rXb%qDeSo_V9nWUkH73FJv9UAB~5DCGe0o5CWWb5ObeVEW7-{;Mo-^>F$1 zNBYaXlsLs(D!Y_hnz!i(v5rkYK3L{$aXVuZjP8oWlL7O_)Tm@P81>(=1Q^IDSThGY z^$yIZ^bM)VyT30rGZOO+(P*&#iATm<9os_|*Z{H_)n@os^*a+cYLsD&ii=$_# z7D+HT#wOdi8s4n!_n1NCp##QTlMl4MQhIM7zTf-i$WXgFz^Z39{kL)HbI}Bc%`cvj z9+?)Q_43|v3#QStweyyAkx*nKnJR=p&c2uZ$`l};B~8tHgXuvuP>-UfH{l&d-awjF z-9e8z%zD+`eIOj5Xzqzgu6cV6=C=wqVNLbd^dwqJ&@i4H|o#yT2RFy2g{|&}l zI4n#j@t!FCu|ryy=ObYD*b9efjqyW&Lo%n2TC8mm#p<6oeSbb4(i)}wHdVEvq#A1I zG;|Cue@92E!D7*15h8!KHzTI2(ymQ)DlGr`(;*dWeY^)bYWiL%w}exWww*!McxxrN zy=Khy_yCJ}!Rm3vbvW7k5AmT3Tb^U10jU!B(*xg1wbEEfh}=NVJci9K$GV2Z%5EjF z?v=jj<5rtxs{9BHo&HpLAuj_`?DTej_LYpRxmQL}Y7QFl7E?q6cL9-x(~fw}{7o5G zH{=*L6Qy|@*P1USza|Q)FG*j78^LdmUtq-}K+TG-Hw5E$O+!_^giA4H&x7skh`5EP47e+R61Oan|~j8l&rwU_2@jkh?%b%IsByb;VXYfteJB& zi8;2uen@M??JtE)1>rYb(+}n9pD@@zDBPP6IQ&-6*J2@q_n*jDAGmvCNn7cXJX$d3 zyPYhWfK6!G;{>f7jx&PXHQ$TCQ^OjfS^9Md|500MT1Ica+QN>S<+IV{#)u=2$|K@>i;r-%`Pw#ae z>JavD7k(2l&!za~O=XhK=7VtSL^d7;3A34Gegxa`9fQW#>ndWtBU+u;haY+?f`jw+ ztX_?fn*N$RGA17h#;T`bU5N+q6w3mdb;?-l4x)6cal~<;qt{Yo>ftA3z9U8qVZtbz zz8K_g+8Audl-iiJlytv!#|k-0@}NeX;^=tl22VC)_qQ8AhwMcI2F9+&pJ zn=vcgqk@I|q&`;G^oM>;Q&*6}2Lwg56nxbyaoHX7wrlnMDUlaY>;+6pPh`z4d-Kgb z`tp;gka>Nl1f=07%LJAKyh0g=?M;v`*)CFNi>{+|9t-P8ia1As1L8#>*0+X_>Sn-e z#_igmBl88a{rC?LBr?rZAJf|<44Fm_nu=gCPIhcDR@`pJFSo!$b7x^Asg@$@u@7?} zK{oY9T5D16U(7=$6z*H^@b&xM(W={ihFJELmQZ-Sm9 zfAg_=lW~GNIlf@}1cNFw;r6Elmi%J{#y5femx9(GnzSMI7E zU6kEg_Qb1KH0Dx%xdd238w7p6pNoa43r%fzn?#BgxGM|JCzUK^dvnKS0=lrUP`Fp_ z3=zLvv?k)Q>ms80$tv##Li9w&?p_MqR8e1y2@8IrtEF+dUg5BC_#j|%!MwP-h4Ovm zij&G*wuH;%YI_U7Z?17+N~~#eZO^nXpQvf?ElN-h7mcbKOyls%a@w}V7%S`@-_T}_ z!>I7xvu7mmdLa$w-xF$L7ufXuuJPU2*E*OQPFa}HApFo4CY58kjW#PK@hSQ-E7z;q z&iMyZR=snaA3s;3+v4#!qWt*-WrG-F_45apteq|6z62}Z+vkW&NzPH5ByLq&5|k>O z!qRV{g0W`1b|HbnvgO<$TL1JK@ZrZeOBX%oF8lCh4mFCA@!u>pjFp1Ngw2kYI#w~D z+&M0Oc!@&90^WgHbZy(UYOsT!y49jC8GE+1dP`H<8GX*+2Frv-=647W{H2;JI(Srl z8#XHl#XHbB4bAsY8jy3C))=Yfy&c&ZzNwp5j~oly`Y+Zb&4zQ$AJ&EDEU-v6h?;6h zDX^9RQf6w58A*XI`abg&xxIq`&g>uh~h|`h%YeTlw)K*ji@y)JrSo_{Z%P$oICgEUOV)s_ePXXnWF!&pNi{DIVcC zfD*xoFrb!X5{vgcmg%5D9JBx93zivLq5ci;lV}U=_=d5H3|OXj$!wTR&g;4{7VlYPr_aO zu@r+DeiXo97EF|@4O5AI)mbw2(af~y9%kAXF00+vLyS1jj;=Br9;h13HP8OR%+DHL zlM_GqcW}L7NZ6kqa{yxvhBqz=#<#S!ZNVU{yn?ZGs+%J8jq;V~kRW&sU~9N#=~|I{DeyS*XvHfMqf>VpmE`K(R33PTR{K(wnL%8FH4~UDWv^Ey=pX zItq}n2&PWIgJquO7Sw;Sf_-Y=S}*A3L4E>jX>R&|ck z`w_|sQt514konM1|DDYBaPdHYL*-+J-wg#8?hj`7bOQ~5DzK~t%dq-JL-18Ql=OJ3 z!PLOU9%PuxpP?{NCJI<%WKM*NUScx+PocqC9T;s08sBv01=-fH^{YP#PjhlI)z(1%qcalXb=OYvk?r>s(2#h`~p2Wr4{S9Bwf z!HUD%B*zM zTwqG(s+v>$;8*w?7&;i8xli$%f7+NOzn z<2w*5x2u@T5Ao+gEWIqXINW>qLLkP5iYdk!sw!dLEX2T67=U;cK!a0?q$>H$c=}`_ z*oF85vZ+sz9nK|04BxOJjt~Z?32;J|mtXMYGQ|h%K2YYha5};ctVR+AsQ&zck#aGV zAe*NJY36xtZzMDNhSX9AuJhW<8ga8mbx}UMw2Lcg%Y1hh6s&Kvnov0M@++9N;pIH5 zAswoERUwi)t_Zmq1RLAN(V-TaJ9R0Y`##13UQQEy6|&MrF%f^`*+j z604o_BzY-ZP1hp7a%?4csiDEH@L_o7rDxhzvS5i;8XEtELZ5Ji6|}P98HIUu?1q2H zoix@w$-E3*Wa{DFpzg_T01WcxeaVD#H>&zLg&e8}u>!Owh@<6^H$lByItId{Wp(59 z-)5<4kCc_FcW9$gNu1_Z-uqdOpFmS%q_MbNc)~e#pZ1t=$7m;|8-G4suqY#4vjuw3 z=wSdm(lt4F(AN3MyB<{+?JJU|ArD{zC-8_h+Bq;25nS}`gUMcdxmp#MsvbQCyu-d# zjUp2@SKA1*X`i>klZWQe8uWfi66()Ykw!V-Z}Y9gqPbNwNb~9d#J<*+X$pvrNKLS; z3==RMHPbNq7TKZVv?GOM(vELbCzMz?4@*BfZa>u|ZCh0!C%m+0iYjR%J%#*b&|B|p znNt6X6RLlOi@Ypu07zS7)0HJ=jx8joAzn6kWo=6NZt6#cC_ulqHARtwb=~J|~Z5x4# z^z&jc7w16NO0;~Ih|KsZ9+t@xN4%wA&h>7s`4W?59zBU_coV(u4FBH&>>t6nxz8w& zlx{UOTClE=db0gW^pxY*NY99h7W~jA@OEBh1Cv}?ZahDz}x=9kE> zd$}aYPSE@7wDq$SHT6yvok;EGmD>gjN@3{U0r4on(mZ(dj~(4I=;zTiw-xL9?8R=! zO{3+?!0#W9jjAIdb@#hT0T-haBFBhivXLNsX?=MJ<1D5`XKK5;W$%3EF1S|XhFi2X zz|_3e0f0~X);95nmBsSq5?@rx5K?mr(K~1d=9D~X=jUltnW+6C72u2Xs5#EmULHXB&10_)n=7VcQNM(ZBeCXUxTt;A z5&iz%b0+Yq+4Yqw@}`U89nM@$7jK7k1AP~tQ5=^6UH-eqqLDih`ooXW^sg5s-XN~G z1iJEv@l^T%7amZ!Km`*EdfLY+Y<|SFy)=!;3o+}C-OS@}A6S=bz2f?#lSR=hGl1qc zox-kUX*0+8Ce0;Q)Jl!HexPRJemRShCiUATki{)XW2b+Q5^u3KmiJ}CAqf@nzQ6d` z_;OL3WUvczjm_YuQ!vR}xw*)4|NIW`I+9y;E8X-->Y#Z8YNM{@rcGN zo0^Ghe94i6bv9p7T%@9rVq(noqX3uat{Vk*Q0vIP04O&&_4>>uvsVo57BkXgf%X4z z_7*^Gb>H4FkOIYvJCst~io3KFcXubayGskDEmqv!-QB$qT!LG%;uHujZ|I}8*Yy-h?*`$n!C! zI`BH4XAUcPo@HsLwgxDvSdeO6wi9fx8GMU_YbFT}IxRC9hE+K9_7(S5YU{H>Jjbky4({I{O ztomTme+sBZ-s!pJcykH~Q9$Ks?w>1WgcRlD(X*f`eo0w}TYVMoU84@I1$dY>Hq?Oi zuLO6c`S8>%7pca{Hp=r7$E@dcmj?exE#%;4cG~PP>St zKA=?8FfJz1YqVA`-N?zkGGxkgvWX+U?UUxDT~+a0-4X~V=o8rDE>?rT2gMVqiRP*a z-ret%LF(%51v_j-Os;{p(GE6@i)7N5A4Uejuh{&jFn1?#WZtX#05d~FKUihCzWtg( zjhiYUfbrUYByh)xDo$ta7S~OHYDU3Uf{+#)VS1}k)}maOTS})!iYUBI?A1bD>^Yha z6=Qw+bo-q^x zM`Hsl&P1Or%)~D^82J%d8_H>=Ky5j|X6`Sj7||5V`-8+C)N`S3HmY*s5Nle>75ziD z{LQPJ3Yd=a1pEn|GBFxp$AE5npbQLXo_0D;?i#odk|<1_lr1 zI_pX#DeQCm;sV*9l2!8$5M)1v=25b%O@7OhighdkGI|;el26rY#9vokOg2|eQN%D9 z-=rL;8dhN8XsP`myKR^zbfPHD5GE)eAUiwc^ySZ*iGA~I$f*tUV{l+neT46VBy^yU zn2o~sWDXMdfyayFZ?o##7#9`IIaLM_FQ=^?`*FL7VmU>9AbH(Gl+_@zu0uU*^5JZuN4%0Y zKLlYuew>vcUd=;J!wffr@UuJp;CGYa7d%nQ5iAm!_AK}{{@~!f!*M)E>fQwm_sdAJ zeggL-N%QxO7y_Ht@ob7vfkJS-t}xl-p)PVZ5w`Bm{P~3U=J$hDD(rnxxhFb_;UH=v zGP0|C5JsjPx;~ch*-d52(md8HTIsozk9(#1P7hKo==kaPxER=-j+0jV-=Olg8|wuw zdV;taQs*k=0HRcV+utxKKVy}p#b| zyiCsJhSq#+2Ac+FEprzaHpa>33FmSCKO_wnsey@B&>( z7{o87rosUn63FM9(t-v!V{?!F=oANd1Hque@?X)Xx2w@<@pkXg%N^Ym<{>{?#Fgx? zyA&n&b$KPx{cmeb3%vZuk)ES>r&lF>uc9gmFLQlYrQfjfnWDd~>BXpy5Pxr*i-OI( z<$LP#$?Pv2euiNrNxTa3M}o9Q;Wn7!yH3OH9-sA*z9dM@rbNEUX^a-nn`j zQtL!DXob_s2`F-)@H3WVu+ggNnmf%m$)Rnk*YNdx#7GHb!dwKm3bdwWk@z*7W=;o$ zo-~H05FfR+)7D@bz3t}`qDfcc}li;k0AR{HHOahZe_nryA7^wDg-6mX1C{hq~$zeHdrK`2sV8^hH zEnw7kq=_#;w={?ijDDiDqfJ#7tJym_Ga&ZO;^>{>7dT6XX?-ewb#}k7X39S5WtDw8 zo8z|+f}OlDs+uWLAj){xV0dW5T$-b%!yTOM5#<=F5)xtfxENU(5$nt?l;kyN-dX(f z>{g1tG$;Sv;&vPUd&5)1&O9&5is;1sir(gV7Bn%gm$Z^U$eK`~HqLN@A=9CyjxZ6~ z{*iHE`)f`_2!8q5mO6juxR0f#g>kLW>s9em9^^f`;;Bz^Gp_A4-US8sw~jDodEYCW zJv1;}@V__bvhOQ{I{V;3{uR~?GnjNq@$26C6)|ZcGTkUstunzv#qV^o$Sct0p%w65 zIIQ_Bz<tDyrxQs!P>}8)kcMq~}P1!5ri5Tj9>03fEj)wJ)p( z#yg%$SwLu|%&%|cFjPs>^*!li%*4M0qkOZR_J@#+Ih~A55`RibDet{| zb=jOUI&SNUu6X{XSFR_y6g3tXV3ozhkPi`RXNL(DHHy{ZN==D_Z|n#z@S*=0jCDR% z#s>N%u!--c+kPJEeNy}c*{&0w~vb@KTQ3NR{jNAWHYv=pRic+H;M#XT^OHK)?2Pv#R*Vqo!FIc zatAiI(mv)el(eQ4Z&5cz=Fy;<{jdw~$J>nhldKB{+molc>HY(9JRvyGQOC1|xuK;{sXK$VIxp@N*r@PZq|y(CZ|=1@KEE+ zn@qd~tSlwsKRJZo5R3?6;Y(D3K*u-F_%7FsRjsG)6DK~n17ghrnlujeQ|%331p)bq z%wI`3>ZZGseKBDPz>H@P?1MS(a(8S-41o>fXnppEyW}Y1j0h%JT^-u1!HuT?!~x{* zlz$|~ndtLZ6D35PC2NknOu6m9;Eg!FICe!0Z5w31GFgkV?22QV2 zzp_8NIR9$l&L(*Xw@8rw6v)CjcIze+v%$1$uQgdwoBL=aRggV+PryTS09OnX``i$t z(`7panNEHSqcc@>uAHtlBk7?q=`-c6d4A0*6wS>u6k)2xJ!ZvsHE7Fzqjcc=GSo?> zOof%8Q|KG;x<#8+a!N^)@Hy>~P%XIsyw!;{^GspOYo(8*4C1f$EstrxT5!LqF08Kn zjK9ItmTV-yY!-Y2b}PHAN!I~s?kC&sD-jlcL)CnIPQMA`s`zKL{MUbQW1z9auS zQi!T5iBK9f{XhMF5lli4GDzbXW84UDJXd=+Z&}fj-;Rn5-(^}Gi3O66v6Q)Q{v~2N zVxt&Fu!P-hudV09Gub@+f>W4E z%_{q;pgB5uq zy9+x+jo!y~I7XFAK}X0;L@$Eje1nPnBg*NCjyjWoCR|yGSZXr~85V624bsK9$jtal zB=l1UfCAcW;*Kq9JIBTsN`pOXLMSroFcMJE!DR2>vSswJevY>WL_Xu_|6cjB0;0!W zQ1o{w_(ShM?MDxJzj>K<#&)jiw$vdZCX@VfSArhTLu(K;1&~^{%8HJQVK88xr>VGW zozl-Ya%mJ`&CPSMQ(ivV(Q0B)qxNajLMP=1vh>6(B67tqB5*{(#^6a=iPsBO=zWy* zv~iHJ#Kbnry>F+GO24ry$<5eARFl}9=8;}$Y03$EbjlfC?2OEcV0)1L4FlDK%lEqHd2QrYbReIi>f>ca#AGdcLOhT3S+m+!Qi%>|Gs8_XlIAfwW%W^_ zd{0S%`6#jCuFc0+XFevC&RqDm0oBK^>K7K*Lp9#TCyOeNt5ZC-R(F7Rvgf;!p%TU1 z$2pJpHv(;n%(utN=Q+hou70Uny?&+oRGR6eVC@L=WErgkt^UDl#ny_SM3fjl?Gta4 zAV5^cvy+4r)9)o9G~KHoScpI{wnRkRI0JO9X;A^`5A{CRi%ccg1Iu^aRmv1wh+Qwa z$!J&lukJW)w;oTTDrZ$tfLAFDPq^o1a(cuW(B=JS0bf^!>(vi}#lv{p__84AsQxDF z7@$$iq!F28>_Uw!{bPEO52>u!fRWikji&@yd)Gvz<)@SJx~US+W+{O>w4OAszOq=^ z*eI<`Lr|7_Y@UL!Kpw=Zsc+uNXS7{2yiwWi77Y2(HHL!lE7V>S4dP@vMBYM03|{?M zx#!ohm+2B7YOIEs^t{gtyeC?HE7?9X&G-Tvmma&^hGOi#QLdg$mOCaqr-5}Gty$`j zo=n$MF!P@)ltUS1ZtR8-Gbe!Aksic!Gh?8Ag)sRRs z@o;+*Q{rx^CHl)nV*8QYxl9O&7S&;0Pk&RdX8zTfHLg7c`FnFap@HHVa~bLJ8Rv1x zKXb|J2Um4eOos#d;l1uB-XWm8ws~i5{gKG1+Jzr*L4o*(m=`pBOQ^$6B2Dx0dfv3Z z8gM$4nx;$~Z)l#ixP)7w`4RT(dqXCsIFpj4ITrQex7X>Xx6HdUt)DnSn$BE~7S>5E z51;QEzXl?a_AL?+%@n_~Oqy5DvHF~qT63JFb$xl+AGNW3{=_o)06P~~;~P4bs}ucs zc07{}3QbpTIyn0W+Wie)Xc+!n>6g^l#J(ytj~gx}DZ)(>Hqc)zY3B&WAB5Pm1{6zE z5RSVSyu-@x`_0AB$|1NYGt0R#I4-K1`C`(ys#0rE=MK3upyL#>Rq8SER_NJ^I{N;G zM?&$1zA^-Z8BNG2dp?>UC$z$iNk9_y}>y zmk##xh*Y;_t8Rm4$Dv0J6hxn+@pQ}rUc(tYQ$2usp)qMFO zzrXbY(zg6!V6ApI2p<%?>W;5zvR{M|$HP*J6{#0e{+l;%GJ)0$PM4DO8X@wp3{{o2 z<7-39s3AsqdDf|FKiPaZC}p%~Bi|=U_IO#3T)N2vPl2s-On-?&XTAR|3Z-r|J^Y_U zq1vTlIVe?Xy5UBwN3-rfRxgZ~e;Hqc3q3>~4LW8olr6`mv7tfBLy6jWvsmYr%dwr& zJ<0-pxbl(*CdhD68(Ja=4^4;$=`%tObfL1r#2=%5TAFv3JMGAo7Td! zk&jO>JN$-kZi_DtaK$F?Yf5&seHKDAp&vaZKd7>5SC!C2JY@B}{&W|H>Snc4`NbcI z1mQfly2Z$w4`TO~GnySTC`UV=yIT=if_43O7@^wA(|>6zze3kjDCggLe;D6?r!se) zS19XS=hC73B@Z9Jqjvf;&Li}Z?lvN=aMI?zPy6P&e9TeLObT-AcUd)N}ks}H=skmQXRGNx@*$LM`~*GxvsL0(V16H_e+kYu;^>A zxC;)M;Iib*x667UTnIaMN|+U2K4&B9X$GZ%_UfK~3K;;pw8(fr`~v#X4Nk1OJF1vm z_oO&uAB;Hp3Ik5Hg1zV264Spbp!PP><{_7%D04_X%|u`|l&oB+Hq<&^?-Zfe{hT4< zA}7w2zw+JJkT7%evQ^L!d)-aoN9iXv9G&ZDe+cz%>7Imok?G%8t-)DJAhGwF-_}>W zk)`*}gTG9++6C42`qesq=98U=Nm#XyCwJ(?q^$|9Ito*IONi^q`6_57($~py3rKFq z*Fu4__adoduwJ38ffe-#&*b2(jLe(a6Cn>vXQ=_us~oZtXiC?aWCC4K>@3HG+lN`G z`bNd(kZqs;fEBobP*eVBnLkRc3kfHH3t#&ryK6Jtg|>MchdA=e>wH+m7k|FK6P=mu zkDm&y2V4jP__DH5agqX;n6dF1OYf%D%EAogRexH{3w(LrAyZ%Qy>X0-Ncupt@f!V5 ziySrvF5FeeCBz$J@gT8Vkbpb@ivSzfOcUz?TE51t!3{&|akiaxQ0RNr!k+i@iZ{Jw zVUC9;7=il^Z!dvh;wKWW8*ZO*s&ZufyaVfSQFmHwJa(Y+p;SU8h02owoa882wp%@L z>qXQ9Pih59+v$@MZGE%diBoMOQQZ2*Ii-ilNP&d+vf`C~1ysU?F0+&Q^~wQcChN!l z=#gR|x%xrk*6@p1z%L~xVrFXHF%Yi>Sh!vjW@S4Pk5#P---Cvd=&nH?_VOodx94fI z5ZY^=H;R)lXaZofe&*|HkVSlH zgOFhuxup-?Y45uih7Rqn-nml;A$z{i3un?f;<@aZB1jw%VelbTb2QCNWQmOk-&F&e zb{_AL8*C>``ceWggAxq`1*WE8-TSEav(N1F|qIjTNGb@t< zuey5s+EFFdlo0570eeomgZ{9i=O==6H3vEiQnVka^3)1iNCnp0Np7sl&XmX>l0J!CGL3&xWK%6`@lZKHC|K`TZ&KJg;AgOqat0C zGnW(sS}dZ%Cb@78fBwqpWDOXyrwMR^@2uE*)*=0?XJ|*nxieOehtOVMGva_|zfXJe zT$S)d2by$179|QuHU6##0@=*ggqe@HL1M|&YC;I3ecz6mvW&Rj@e>Ceb(XX(_l4x8`}$xcI9>rvwcOCzNqSY9ki^Q#ddae3n>{#owin zbgmPN)QSaH-VUP(s(T<<83#qea04y@E-Zc+iD~XOp(cWcsNY}}sR5*cfiQ6k1%Vg= zAXxYA)7TE%kazd3D0%FGRKdl0!Evy{=1}>Buz$dUFLVR?v*x1317sC0H-2L(eSb|1 zySTj>Tw2<99}ioeuuY^kMBMp0;OoJ=muQ({aa~EYcQ3mk6{;UB5GkAfy5 z>GG9rTAUu8#r4(MNvvV++Bj~%CFCIjGmW40Bz=DEd4kQlHoh+u7q}-b7#IOfxNfF# z?uNh{*|U26qE?&z@1Z>+O8y#6Z9_Gl4jntXj$Zk$_rb9Am3M+p1*yt`dCD?kHC%>6 zGCW8eEV?AKJ9!hz`yjr#NX3XDmN-ptp0cepvV7xKWX+p!Uj^6Eacf{FIkbcYlV@mO z?>dXzX*9%UtXL@AeGRPZ5gL!A@PqXF?Ad4uP3flj`*=}ne}@MIC;Cq+`BI9A1zsU( zB0_EcYi26h^*z45zv_5rG* z!GO~>k!jduI4##%<9U$?2UR?`FQYTU_yMTlh+*mnAx(#YXRrpFgg8I+o1rTwtY0P} zLZvT4x-v9)o&nNz$>@7wTwnZM*FiTe55s`+9x3KdNLBH$*Gu^t5`rr@jk`BaB$=*p z-H>Vq7;VvO{tLvlP*zLMAy>G8B%huWBB*wKiz6g zz|OdC527}vVm449o< zZtqJhPd|uxbec@@>DGr&%UgGI<)>p&hswA>@_jyKW}7)tH~^gKK-K+e?^nxeF3$p` zz-t9pVx}FgpEiXvAawh9iG+KFXazp>6qa`T^EM6nw9$BfCJ)|6Galj^7n<#%7I#OdiGyF0xI@d~WeB{}>wT4GJKtyd zuSi8qeRlS9{bsbCx(*QYb{gD~aludrIDMIFW)tgoZ5D5yu>;y>cQ!Tn@oHRk2)PfX z*Y-Gp-7mk-3op?WR6*8z(rk4`u7^f))iNYIjvtxaS$y>B$z)DI(4+KUybyle#DX%Q;MV?~2~c|M&)^)2enOa|5AVFG zHe7@|<;rMa4+E)1ERi^+>_+c7Ix7#o1ByE5p+jgu8;EjI5oLgtAj$-aLz8eHTt%Ff zX|WifK(|~_FgwyW)il(3mz8~&+BnQZ29i!_^493CYYcX@eXtc}=;K>u{ht zC)e(?MNPS=?hc8vJv|swK$T~OVX7RSRBN$E?a1O8^8F=xsBSwz(x!}X7YZFS zWbVgAg$Dh&+_D{6iXeqWMb&#j98a_rgI14rDcwv+|l@| z@s=ZcF|x%rPyXmssF6^8T9i;C_-iVi3#UG!`zD#|0@OlHmZ1fS5V328_`UbHJp zl|CuaaCr~+ay7xsAHl98f7zms`pHMKpIq##GM75LIK_Iser&5*kc7K@=z*p0O3y`LdxV$$rCRv<^vJ>;e#w(Oc1shR&!j-k6WF4Q zEgfuo{N?oqmxMyzAf?P|g8Nvz0og$E9NqmfH)9&P?fuLOTJx~}WMjhJx043Jk!w@c z5|k|k%@B+p<3*VGF#ye#T&EJc>1)xha$AMmnHh~Nu|(-h1#g>pQ-gi2TimaX=iy%x z;+42cYU{$!D!HzB8?7(DSkfqzCz_dFcZb)ZrB5AJn=f-!pN~_Al5iR3*N=ieXh zUiM+cU~rAaZw_x zn>F!FGx3`d3Pi45sVKEmS(f)}qcP7c*2HPeX(P0rpKdauD)+T^OM?nEljiQJpILkT zr?ppCeNcQ=8FRXbsec(gb;4^Oh1(`9FoWPtiI+w0y869dx51~DT8KER7i1{8Vk>L3 z2Z;B<4lV2u+PWj^bRym$k2m&-XnOUup?+?-E)W)))ccf&^yC!Y%DIfUSBDOi={u+* z<2SMk91&Sx-2*j;;GX$85~!0J!oNVnb7go_x4^5FS z4?!Y^lBq7@p-QsyIa3cRs_To!zG!``g0NbdtkA=sV0;2=&HW$^Zg7b9ax{zG^pxV- zr~8}SWr(}QR84r{z(;@2B?}YU*tG`k%3Jtg7M0ls9tFGf$85d%r4+yS7QbBZ#P#)^ zU$4r|^RCWtexz=jdKB`6RZ8$*m00k_*0?`abr#CT!frFv^UazixMp$;zUoF>+~ix` zdj3@`lsQL`-0|zC&5AoZm~M6a)TgQLkm@h6m3U3Oikr3DkNQvj?Hm|$m`Ikj-M|(#`f+^zGVvknMsX4hgOF^;uhR?%_zbW z0BqQ)#~8Z-4mvJ4S)^_$8bfz|I9zr5+F2N=zcqljv#ygWfIm3Ce6R*KSyBQnA1iz+ zISsB@oWE^!ABrKQ0!TsKG_SN8^pe;)`jHIYJiOOHS0K5&8>dH;F)P*~xLXln$1yIz6&%{uI3%Tc~5 zP8U41H5n*zZw>nNI#Tzi!mraW?(m{mHcQMjIZewk_&fnKwyU>6_xSu& zZsNa26f}(ljx=uEoBcRdTVoHv(@7r82Smc^n&_Yz2eK2yYK082+Oi}NuUQRC_D9Xp#A5~`(RZb& zTp9`(HIKrNZogq)PL$}nl)XxAmZ56_}cMqN(n$hA&t1- znji|VaIT4U^bEsTd0syM>h0?J!u!gJMcZjNoLJ%BAweSeVQ`C_Xg_{AZo=iMwi>@4 zIBWfGZk6VL6?J6I`1pnug}m{~C~6k9aEGabwu2H{0t9oG<@~X#uSLMR^12Pk&gZFM z9-{g@b3i-LTziDkcDVMz3Xp*}~u zVend~2oef1r|x{e$*M-Vdt0FqMibG|cl9q7^QuQyV_W^N>>-r4*U(f#qW@+hBwoHt z9Vf?hl!}F52*PG_>3o|}v5DVHINOJAH77^lLjPhy2p_{w6pws~)&#$ej^(m_Su=)@ z>%odbR?EpE=c#YWc3#hfPKQ}HpJ3~?NGsbmk@W}cOhe(yE7F4^?>2$SErh|55zcx+ zFPm%C96g04o~6cb_^ZdpVXV+w*7%R&V8Zi(S4F{KCxgAObw%ShjQDjxM5K>cIhYW$ zY)38N+({t&$rJf&PVne#sLWB8kYrYaEB}D2w7@L+j1@Uv-LA!?`Slb%5&5#J+3UMq z-RT``Z6N?0oG!w@9Fe0%a7>e5Q@@-a=+2U@k1pX@bVg!<(b?NOev)iTjFqOks_kUF z5_;i&rhu7p2D0LNyxKCEl4p1#iPoZF>&BaN^K@T31zNUKYce;hUdcl-4e$eVSBu(- zCSC&k4oJPR4YLnZ^zAT{JPMW0Y#bYiKNo23(Hh0hO@?Bf4^@6YMG=H0pyhhN)lyFF zFWhH{8_@iz-ba}HFzxr^{U23f05P<@b`=ey{(U9?qNRVoFV_BvyjmTT@cyq9?4R#j z`vlErmtW)z{^v=5f74dVQ?`3~HnH+QaN?P$jYc93!w%4IhWaBYS@(~>DX3af;G@1d zs`XQKej7%8-3u*BgeG!>m)<~kC1PU^t>5*hzllg3roXrGR!K+GetAcO4@A;?gB#W5 z>YVdJZRcf&r}0*Kj{ap9le-`oJ&qW`g?z#3jJB5Y_a)+_z}7|Jq`b5J8?E`LmW2hb z$wFcA!z4q!;Cra{QDCAQl`4L75q<=BvSPxAeh6k`b#tL&qY0bxA~JJXzba(*Vrr87n*g*ZcI#F zwlW`wPG&RiFTj_bYzGQa=>h6b*>)hb0>}DkRATbG05?+n?*({b#?te%9yrmZ&21V) z_{I!C_PbZ7C^sJ4Wd@@oL5RV~>+(M9{d#(Ge9rR=Gjx zkL@(O#PthOmL8Y1MJoj*_sKLmN8Z}i=%kL48)EN$h%Fta-XJ|kiR&C!l)CqBI>>kk zmmIbH(qvRQ-EJxp$R|5@eO}^Ze1@Lj=~~{dxu@?MWkYmUxtW{hpq2mdGpo_vdY*M1 z$eZRksg59RzF5Q}W$5sbAus+e`@rLzD#7+mmDRr4q{mX3(@AJjG@H9opi>~GTJE5K zzoead{cBmT91$UGo{Mel0jceyw&<&Vmjob(5_IltP*J}gQe(I@tM927QD_x7Z=M~V zZFX3<(~FF4EZfEMddClqZrpB_CSIIcBoOCHJ0xaJkF2!4PdB(vGTqtcnQqB6a_1GU zFS&CzZBZ4QA83w*bc;+V+5W?XhT&4oqcfg`y%p zJ)fb^?tpe+$a9m)$zVXYkCKS@eTm4f9G1P(sO31rkA-ECf*qMA&Fyiux}C8ZdH+ZN zk@$ar1^$e(BZBY#4DK55rs4u$vAw{-rK@U){FZy{@2Y4vudCVPRsGwDl5n`S?XoR3 zC@;n2P>}h(c+%ovr24pRY;*318WxTcR7Tw~t7shKm&y>*aj&JkeSfIPl5LAp#D8uK zRYMO@EH`$_(BP_>0~eYMRVzl7H&XShH-2fH#%bXomuX1I7i4^Ua>-~63Taoanwiy= zlfJ$fjj}7N>6)1NZNzytFdWooRtG0Re;!EC^2C{Wm}QEo0+OcB3=~O(zZqa_NG=kE zZ+&zo`xZ{u0=!74x)`eK&dxO%F=O)^^i~$+ct?HNB$}@IkOeKPWi8Os{(p4DlPMQ~ zf)ZcQOQ`W*J*vvr%N{t!v?BFEnmxzn7E6isMOSwHGqQZ_`Sy6sA%7whP2TGz)ohk2 z=Vi0{`V#ly67Ha{xl9*asR~cid-n)wQUMnnM7KP6jVQ`KF7CK_Hl9J)+$K_loY^gB zXxO|jl`Zkq44CaFb@T=I?V3CvLVp$x;=^FHu5NAXkG(1ohw2 zqZLAjCQqK zV0<(Hcsf_@pDPW@?}zMS6JBr)*IT77%S4-ve(Ylh(VE;(tbx~+>*Sp0&lXgzt9v{h zMu;ZY;;n&?(-oVW>AvWm>x#YXmj~{R1cypr3|cU*}&a(xk^(6$quhyY=rTvxx=bG&3w2+EM?t{vv0EN zQ3HVSU-eH1H=^UyD6lpB7|X0VqRT6TlnZ$j$WJP>QvK>}(>;lAe?7G;{O>*P7E!e2 zq0^^4%>xwPS1Y4qsknZ}<|I#K^5``H$8XT?6cc4z(sJ{i%hPR(27xAl>&Uia8Bfz} zw|JR5|D)iDKog5!;c~HU{_fM7^t+rmiG!~5zCdXZ?$^NNk2f-!AHAZlMhD>cg1d&2 z0Ad`4+Cf=`6KUYI)Sy!W0xU*~y6i)-IlFDc1%xRLT}G~q_QR?g$@Mk z${bKe8!BTX+A#seK>P!x1U_ROO$v<>+P(}pDhCVP7#P$)?P%=@^rSyS8f#^?ECHv? zkVdGEUMk#ADyDALS}(rWv=RK$^nl!xESF%43!R#i;m!~pKd1(HwH`nF13s~mj9USz zljKt~I-i;i1xj0;J$S9Fqph~2Smf9 zSQmF}MsP1#mPQAX;LHV9j@?zIAwNjrK@D_tihHZ8 z9K`B29x0g^`RC<}>U+aXo! z@hRvs)NyLsz{tCEiNI>vLjD6Rsbu*5gQ5>$t!;Cg#lDZvWAEOxU`? z@HSskKRHL8tXAK?&$isJFEx820^TA(A)R1FBkc&zEOxhrSdI7}$%^!i-U_b?<08_f zVA_ zrmkTSjAul@EPPm{742JQeu)231p-BLdTU?z{IIs&5GY04zFFgQ_~_@<$3Y3gw=1W;S0+O%?ap4vq$pB&OF`xzi)?5jVE3xj*bc1 zYXh}D)>lC8q8$=tB!y!`bGHM#^ zOVETc9)1aFSt+WjI)ePRX%_^NSQ{C^4sJe@YS6MQQuVE4Uk%QHB16SK|3-%XzxR9# zGL$>Ue@?$u#I)m=CXhyje}DcBL_hyihaL>)yzI#DAiiJ5vQ){6;Ah-fx|bF8NeAMN zQ~$&=zIKbOKbGKaF>4nn zvvwA?hjiTb9pR-mIY+*Jyh(Wfc0Cu`970uAp&M?{wQ3ZZ=_9vV;JeoQb$qma=6!Jt z0U08Z`NfOO!@dcN$RS#|w}jyNn1WRM^(oG>a@2Qq$NmOH>W1pl4_gXfgeq9FIUC|k zOD|ntXr{#Y>(JIQ^UYRXnUTA0^dBe=HiUk@D0@xxwkUf26{+(wu6Q239I=dpORS@> z>BVt$S}`8l>fE$)TpTR4mD_lqo69~*7-y+#cM0*nXqnw0KKGs_twU{la(RbnO1kZP zL9#p0hB&e4|Juf_)`JgacCMT^`Z)g}^>{D&Z43d=9{hW~2WYcP4KdyPro9Bv2hT5O zm~n|e;rI7Y+WNoSKQG=w;VvCrQvhY%O=W{D#w1c-@upm+eRLh-t`#t_v-SR>J6YJ! zT5tDBaqwTGJS1YM#9#&O2ucdjCD#$DTprnt@zh!6F@Hyh7Ew&CH3$!&nP~Rjd$|Oc zC6N4Xi@Wx1*t4LZL@BJ9lF%&G{JZ31b&Ij{U$1Ed0Xlugwh3GMr9=VkQ)~y)1Zh_s z;QGl|dePlK$8LqLppE1LHfXBX{p1+9#}x3DELjGHtaOG%JWUO4LQ2pQjICGEfw|H# z-@U{qUo^HnG+IYNH}HCVqs~IX%HiZ8M)t-yjLX8jP&?f^FJ*p#A-l3Vjo!hh4r}|` zs2s8L_jvdlb(%WX2qA1B=ZXc7_v-cKEu3ddr1K4Q=8+%SZkrSd&()ToS&qu19%KU%5a8W*?)TG zQ+9lMgFI)oGk=lb72F28ZVS;DUyFGJs!Hj^2`);&Vh+KMmU$Dxtky(LJ&n#8hH#K{#|sx>x4$8%iIy?;2Nkl7*i zy4LsM9)*Ko)YVOGmXUxLl~Lk->UrC-&$B|qKJc5A>lzXXK#Cut!xLDc^h!O@Z+`0 z1(#vr7i}_h&^@sf?u`ZQ;CnOGV&BuAa@RL^5|}u^Yzt^slJ}CeFNMbIdTsPu3@8yht|sO})hm}VK1$^op)ogf zv*w9+kbSBiyyXk6W7<{|ZK-l!v8d_&(fnjQsuF8h;_YgDOy|y*`slAmW1BBv&*AEH zsh?xdI}p+Hio_<2<|wKq$l4qFG^Oa+CVxWMGR;KHFF9k!G(~g9-(3*x&Q?&yy8&T- zv&5dbukZ{k-*d^|TQ^Pk0c>n>W@>6C{Ix$U`kZ9vRxC1l>tA<%pCym`5NHv`MppR% zr_citfyt@`$?|Q;yTnwA$ura}#vg?LZ@tte$-&=)HV{o$zfhucr#0BewIa{X4$eh# z68MKeTd7t5Tthfe;_uTu_EPF0bE;E$Q+Smuw^e?jL3G$UV_z!j4b4JLOUZpLc5ytj z7ip862T?vbwV^KFFCEolF#JZ=M&=$l-@H`B`vY0CIDox%I$jLtEP^t?c-^s7AzVXc zC=b^9d?alUYwFUeo)nDU)&%QPB>8|3_nkl7O$9E}d~-o)4vK^S9aq!V{7$6(x3Je? z1m=zb+9Tx^`)^N@pn5)hywVpvq)Vebr<9bKjYPD4nmCCa$-Y2L3kE>^;M8^bhX`Cf z-6l(rm{iDX?Z`d{PPeLh6x-_Ku+9TUE-{U=g3;E-=A)BcYgsU_WY9x8H;ky1WV-HW z6XG>B_AJzhjCHbAScBNEHt8RDO}`A5e^inFzd&|4zPtdY4BgQ+ddb+cd(e+&WSonz z)DIw93$fAnUT)=6sj0Jc=5|Bb6?v|abv07>%wnGOnE7ef~cDC2e9IB32-{F zpJpF$o+Hv4!T2l#X)1}YunJKW_h425-mu`b=g(kkhusj;VdgNjKhwE*=i`0%u52rT zQ9+)fsUNs6F9(3(P$%wLU%>u0^ZbQIDVicoztb_@tIhs>q|ax1D!u&=uRK>2`A5wWP@Z*S|Vew!d;Z|UvA}ZMUJggfIk;frWU;)fZ4y#CC1HXty z5P#qd%>Lu^#86qnG+TknXN|;HFRsusKj1W&Z_@{?P%Ga+-hvD+P@34K=yu^xHVbOvL$O?Uz{{!+=7{iZD(wPLpMeuc7PNN zt|5I-Ykf1<#^l={4z4I7eTj!E?`%WhgqI;jB3-S|Rx-fA|N8repOmWx!?C)W$D@#98;RyMUPPa9Y&ouV>m9ZS5`%!%jz??i-&iSJ`T)u>X zg9O0{)Iv7j{@@aUdA27ISj=Gt|0#%Ehs`W+G@6xROZ%$o!1~q7QZbIqPT|lmR=no{ zFuz|SfLAXKPAH(KczP+OJu?ko)SB^x3q$*T0D5fv4N_;OZQ>ZI`|m@XQfrJE2cgLV z6r=F}yyZFcEd*nP?nr+f`r~}i3yXk3UGTb;bQJybEB^We^t}Jq4_U9G@y=iAY;PRN zMsVTDJGeW(R?_)lV*;a~LF43B?Fgp>m{m+mNKrvcLn?p`{3O;n+ZRJ(TRQv-VkE>A zmU-3aF?T>2?#*FoJv}hr^`$jUFl-1FBX?zFqObD%kM@|45cjl-(1bprdgMdB1WN-2 zb1kyD=0-lggngmX$N-`jFtAUrLN7SY`hvYA<6dPrjZ&x^u<1eO}6)!1IGAO0a?^tN)7{+{dL1fT}25+>qOvMP3dj0@DwjY3>Qz zK91^s&H({hCfVDLUSe?VEv4yNx=y=!d9F67Y8u~-Hge^%=wUCVT+n1tv%>5^(q(6Z z&o?lf(Q%?MCC=Ra0mE%TKw5?%=@HJPA@yb*2Q_8cThOTO$b@$i^xtS@iUe+m8sg1ueYQlHX0`$KE*Y_~|oW z_iITMyV}>oC-Pzx>cw_%r3*to8^I*((L13_Z{5P}Bg?QMuT8d5tVF%7eXe?WNl(XZ zl*jNfOXBU&C{U4m4?Vs1228Oh$SB-7$9H_w5}&MY=re93vuGTfZ$gIQ)&Ib)bs1jq zMjg`d^T^gMFdX$C2g)C2?{5#Fw1!-FHD9u-j2!!im_;P*E?bG&I zi}9BL)bB!=6)mw;&YfUV=WI43N+Sw*Shj3R@%SKQ`=nq)7EebXSXzNI*MP?}2Y%gu zeW+zmZ(HU0soZrg|FvW*gVY6l>gl>3iFrD#qgRj_=G6J+-8V{9m*{d-w%X^%KTTW- zLL89o*j>^AWD8Im$qQOgLB~83mOoY)A5Jnmt+$>I#Z+$_h`fe6UkSM-@1KF&mfRsX zPO#Et89$K^J4wEvDa60#@gwhJG%o=SZFyrAizh4Anl3u1XvI4hvnzeqS_=zdtjS_& zO{nB@NO<|8`b;LfdCWU~+DhS{&j(m7*F=KugHw}k^?E)J?gF`1PtQg+)Xp2vE%qgW zZjMqYAtc;wem5AyqW21u z+t?yiqv7MbFOOcRl6>?J`59u3xK$jd)zj)kvQGoU@{FsytYkIi<23(R7&w#9U9vx& ziuN``h9u*C&_wcVYg#)9Fr3*C?fB@K%jIxro4)x5M@ch|Owja6VpL*LLxE)2eO^P$ z2whtNB{FCbg6%@-j73rY(GPR)Dv*=aA}<((18nh+MRpH#AWu%UoTfIVQmNV!;1e-l zx{ln@X8pnUNmQOVQ!iGMB-~fX&%jcwZRet!xvX3o6dEcqfevSJ~U*Pok($psJHMLBFpTz z+nThqvqo_tB1WR(EdpS+F|Av#&FpTfb~IRzmUgK#J5uPz(rRPpof|KZ!NQ&dQbe_E zNADd$8Ls+y)yiBazWUu`^pH{qe_Hs0OZvjidWAj#y6^4|Wcc+3njoua_(g6yZiw1g zxQ$H7feMCSdyGH1L3nq~s%zdw01O%d^oBaQs3Z+L86P}CeGS!vf?NV=LQ&JcAA6L* zO)+JMZBLV;iNtXCm8XM9V9y&f063_iu^)!Oi+RY5fWV(>`ht51^v>^uSD+pZI0Q7Z z>nI4OIRUl*_Ab zjg3E$TK+gMy+hWc`y5#=%_A6rHzV{5|&hQ;86gs7J8D z0;%~(Yt}8fnSD=o(DsCArzELQ43UKe?MH21=IXz)v{)_VPVzlFKi5ho*g6Z)GQQbn zR&P&hCqhGm9fOli-nxEYl?v`T9+ON2b1$Sm1M!*P__Cy5WPhb@z5P`yWs;7?Ff)lS zH8xmmPupKd(_QjfI1~&rXEhIUhfAXygyX3KiViiECRvtA7EZ|vl<&?Zd@0kZ0R9LH z@mw20nS4gOc09_IUYR5s@s0LCv2;o#WEs%vWQ3HdBk~z|!ketnQ(P%%S@oxQij;7K zqPUiD7S!@1?UpMR5A+1RJX5}^gArQPu<``=s1x5EOLCG@Tl=Ud@a>P*4;`o$&OsZe zaSw;NB!AU4JD>A#H6`d)O%Pil7mpk78rS%ybdOf6_D$04tSmZHMG}Q3hl7WF=rGC7 zOx241*!Cow9tg>*GDi^lr&}2jsO~(+dy<~H^Z$J5J;;F*(KVVeGi-1kCy`s0CZ<|s ze9s!Pou$Dw!kOYojg;O)j5brh7D^Y@qT4-q?{b@Hywi zN;YaJw%?0h?nN#0+n{QVG9L0o0~_V!t1v$2lw*SZ;xv=w;>}N5Pubs20!4gFCr~tg zOKA%F$6xVdW8cjv;dtMd3UHJpAED-cl1U#+Ko$=dAh=~Gwl>Q+A=U08=A&0x*J8;p zu~}N)5wo<-UVV3d2d}W|D^l{6)wAS)rag0QOQ|7)5umbBGgDA)C!zLoa{z?9^U{)1F-h7Ja19>tuuwQA| z7{w8`zo)$JJvQU-jo1I7%lAYTLcD*7;BZB>|WpP~La7wEORrHQ8geqFG}c9DW$3G({! zPdd}9Esqf5lCplN`FxIHevOxUh?@Jg-N>j2R z_bG5vsp>>pr3$_N+z7dBe-H++sBhSP#kVS|JwBRrx0Ebd-D&uHsI=1ogs!|)nQk_W z?b8wdLd?!W@}(_#|Kbm6!?*r)Ng0s9^y$@?)vzR%ASea@kAt%Up`El~#@ixgLmL4E10 z=|w_Pq-k)=!UT{VC#s;F7*zh@5+l~VvvduxN;>73i;=*1q8CR4WCaWDzaw{|g1oT&jIxRTG8Zbo7|>D!eaDOa(-p(a5- znli@Gvx8u<9KS|(gD7#e9nS>_v zU%@SWq$hg*EV`CE@-olYdvwdEldPR^28VCYPhIN1L{bTBvja<~>txlzmuID}+tMiR z(PzrZ1Oyy&6VX*2fHZRFd6js5PNGae$hwt*5Yh!I+;m%j1TNrl_m%_-B{nIZh;Hs( z`fpss3wmnb2Qn5p9^3F;SgPyI#cZ~tpl)Yu6mT9ha?Dyd`&G3AveYydvnCPeiZ?O` zUIHZ36U>3^5~Hk*X#}|7eXU8Va~Ys9SFllc%+@9XaNTONu3lkOaBgY=bBXH7n(GWh4hs>u1QB4RX%3?WS3YWDJG_NVUSmCIV|n*JgJLQ z(AYEUj*Ir4=%GdVBKQxqAVqLGCxpwOzr^yN*B!#g*cJAN@yme%|7weS^jiIvjCAq4{xvuX00ftov2-O^ysNt0wbbDr= zkVg~V=ElH=o}72t`1?i1LRUTY9Pn){EWE*L+AiSoA^e<`^^ebfb+a0~XoBBArE{1M79D88YmT zyq`;?rZsuuZs~V|eg4wu%pac&An$q5pY4aPyi!$jc`Muy0-EpQjAeUlv~ z_rBjKG-=xvYd3;WPm%b?BNI=DizDWMMXZ>Gr^hL`rc6VrT8tNt2k=l0ew9Z4AL~2s;W>k{n5s^XJUT;*9d<*CO(|08Woa zgn1s*Ior+NtTP%Gn-90{@o2H)hP!z+$bOeD-oIZcQQcpK>H8wUQym{VoaGaIwj+D9 za)i{3uc*eB)9LN^%p_FSWg?@}JU=y`kW%Kj7lpXPJNtR*+Zu+0M=*?XNoVY^%;c>5 z6;uT*zEur;0@gTXVXzY6)VaBAfWI7hgxq)G-GTk(p%sT6)tij}$5iEZNTR!*KmHp> z#pgklu~KVE5KV6!nn;8BmO=&BHiCPvwwyvZ!0G>csFLa?#1I3wEw`aF9W3lmyz7J? z){2T8(!*{e4wrRd_+d^#Jwr1+RkeOiflt(GE~s1eV0Zn6h;sLa$;RgssD0^%nQs^P zTok9^xDmdp=ITEAXl#aE^KDc(I%YlA0))0M33&_Ab1Kr`@1`PRs9zZ$U!z*!=*3ON z_;vnDSsyFYg2s~$$Ye||av<)rN;L~xY13g&?si*v^{GXs;)6Dakb*pkv5VtaKH*BI4_cP>f7ZJ zclKPzhb7o&B4?#F+t8GKumG@JoaQJmcvFql=4QqrPe0Kjy^q((z1baUgN8TuS&1I3 zf5cm8+`F0}i4M62h8B_`Q8Fh@a=%YyI$J|U1Q5cc%>rp(jHx|AhQWwYWfmBM*f~M_ zweuCnd!8bLBd0Lp=;P(R%EzPIC#zFDm3I`Z8Yvu$vY4uXC zyv5A82{TS6uWf+6E|x^&LVo!==e&`#xI)v&%^`nSR=gHRvCB4WUC0aj%)~sHcWT{` z1*jaM!!vD01S%J0-4Se%Hy%iZ2{07zBS&4|2jI|J{pgT za23$;C3usIih#RwBbv_Z!nVuL$mY2!eJ{GG#iIg<*S)?A-{g+WQ8K_>8!C(RRCC)K zJa1HR{=E6?9V(9Hkr$$3W^j9lrvF743nRtf+&CBI$-p=RMMs6?(u?Xhs|uPHVWl1| zR78WTyc3CPEY&Lt4SdN;ugBLRL25#3gQW{?oUuw|A|73w5ohmqb(PJCHl+;r+h<}c z`=H&3O0>>Mv%3^(Qk~2Utf= z1o^_oAES(c(43#WfHCcLmv zEx@)IgL$6cI||)HiI`SH7Whmw{iwd&LbMXh>%Cl!Bbn8(cffN#N%C|? zrCWA}RTq28Eg~KQcwX-P?VcR_=pA^UU%dPZl9Tgl{U)MKs$LDI8l)=xMNLa>uuCA% zzrDBXt(PN*urLo+R1UlC^N!8(TGskWGN5`qm-)IiuLbs>Yp*atK3c$kIg^MDZru zvnwp`|3p$S%2)r3qC9iGYJ)SB^Z&_ENM2j@G!^H?DH+|Z0#kC_L~qQD)i-xJ9``+xX*-Jy*M(HVx%UpIx9{a$II2Q~fn{?F zJ%_hKsLR(yn;7RSe+mFHF)&)#UhMLy#lvyb#6w5yU4TCpFz-!<>c+UGB)cFqu|j25 zhp0N&#kq-ww&w5brX{pD{9Q++UfM1_p5)A!hhReIk2Y(c&7hxy?FyCA77n!d6|he!|V@tis> zM-+aLSd9g{kuqzNV?jabU5@6gsE}bf!%Z^d^TKN8qEZK~jzYUh$`4b%HqM{1DsKnW z&a<&4bk5ZWmxfk3X8Bl|1WDDR`{L2Vs#x@u@g&Ql7XG9w-kiXPy^oRoRVmHBlDyt6 z1t*wE;Xdl*?1SJc-CU(1C${x)DUlvOyT58w$dg>1xJ_OZAjMA@a>(sb zr|DeL8bjkU*_^Yv`-p5nT%=;6tbc?iwf|Aj{VcXb{fI8^JiKv~Yw?q#Rc@&P) z8mUHEE_QS0XofTIzCcPO*EbTp672GmYpIRz8&2gUQZx2(79h1W)M+MaTOje&&xwv$ zQa|&6!rcguvUlG}3em0tH<2XC^wqK-eCeAL&qw=yF_3hFUweA>919j}u_iC*(^t;T zw)i8Sr1_$nBsZ+-rWUkn&7!|s0&Rzjy-EwPqqPe?mC-#ql(7vsf6|Z zf)T&P>vg{N%WgVBq`NAi=!}rHbl=n7Y47#yo>@G;i*Z~;=OG)pwsqymOK}4h)-1sk ztPQsw|0SnCCqEJlEiBK=!)5s7TISKs8DWpPdtVxesFDIC*+qsh8YzDYR=!dqDymmh z4j+dA$~~wMF+Y)|1tMP1!Zr5#EbI{CSgKs>|6$@QG*6NPCyLsy-_W*W8N@BB(6V6A zsJ!;=#D(6=!RBiCl+3TO$E9E(G!)-rS3+YoY$&qfprDU|;O$6QhYfpK+Si*0BgIl# z_z)S>97MG2P&PM^L;yIFf2XiDxT_<_*WvIN5<$~wlrrF*`t_aHv@xI9m7!|>7zG@} zCsAJ2C|QhJqth>*40l$4Vwp~{ipmDN32`yc%+wbt?IouSIbidBMS~?R<>AXTlpM}e zaRmK}ApOeAOBNAKu+ zkeYJ(A~tCtFY@Jkhw$l@p*iQJTx@S^R~J+kuR|-zT*#qy_gr|;J(Xpc{*Slne=p5Z zVP13=xo+U$d_yGvi#Z>bf1dkl&beYOrGTJQK0Y5w_#xTifmDmzW{}13&tHC4PIwfN>mAKU|rYjykK=01rqo}JR*#UWNS=jIr z7?Q@H+Wf|N>kx3|WwhN30xdEfr@ejhJGP}V2j$-M@A3e|Q7@4Fwh)t*p0v z75O&~fq(4v%*^>GE1rM1trI=+snU!adhst>lUntFd}mW$6e0e*O%tKSK&19#S~$j& zhyWZ=}}M)r%#Veh|uzIoi*gPBKEp2l2_DOauom z$u62Ts{a$XJbDa&M)m)!g?~P$eIWelInQdWO8@-Vzu)7+`wSP!{NFHY=iLLS+Ob~N z`*#~N+}L12`^WY87u@)*87Sc`yJM>@?Z4fUf3z9HfsiRl3Dgb&pZeG2P`8VQsAF~D zyiZ6eeo|6bb9a^+p&)&j7^|zw*sMIXa+cB06hLG#u56L32yVT6yIU40G1Sf_# z`l`(a(UzPH)J!cS-*7oQ;}Rz9&la%ldOi8wDY*hq23VZvlJdRZZ2dYvSzB40+Ov2y z%CvPE?0jv*u%gb#nk@11C8fnBqrhab*Ot-PuB(ZPr4 z4TC2VM_n!MyzSE0#Nc3Aqb8BD>u$uZZ`HNoo53>)1SL6w!J?x8nUR5TPycTNFaT+> zda(z8U76hMb-=b_8{a^LN+A=TL__98M?{AnHrJKQXn@}j1z(3>r?Yuys~2)@4Q#<3 z$5044OXD$Ppp!#@w7#+jJK|uR`LN`ev8)bDM;-^gg03m2vx8}`l%ecEym1fTkK6&3 z4>BSu{!iZ+$05uKs>^oWXUq4w_HMNdJYB<`QiPhlj8LFlFP1haUIvz3^y zNsjOs$0t#uvYPag5~g z?olF(GX}c~lt?x%LgdOC$%U4ZN}ewUP311@K%r%_Wn)mDFCpT7#`t2vCtF&!L;F-i zvVZ0pnLzBD2~N~yFK)!insBH5%Ai?pccswO49op4*o-FDi)?L_JZXhjXMk{8idE5; z^1K|-4YmIm(D@1d=o(?(7O$NaCYKDPW}A1p*pX;PT@YtDq|5k{sW@tSy!S>V{w8e4 z^HEP&*OxYfW8+!%%0deAa`e=#2Jh(lB8U0a8RV>Z5!06`}%y{ku ze!~ajerXt`ejhBhQG4FMv-IT=*SZG?H~zxJ^T08bhSpVNymCdsiS8|xjO-v;c&Sjz zfqDNu5@6sgS9)!MW}0Nn8Q4%)F&SN2O}6GO9#yK(mj#|nvotax-#k?d)VyQ`9q^dv zB|vdEyJ*>Eb;wU(NqFU*3BAHpJ>qpq&d)zrvA z>}4I&mhWc^99J*QX^;a1KWvd2wyYsa(R^a^AuA?{c5G^;kN3lNIX6c%$ZEc1kVUC-m~r1s`dDv?aYOC zUWXj(t(QYEli|8Z(CYrH-~kqfKAwFH~GW!EG9g= z$JGm|uLamV7C{6&!VHIQX-xV1&}FkLwlp1AL1X`N%dG zi4aWCyf8iQa%ip;X=P`<0k9YDXtlZ26qjS@4ub^x``{J~L4tzm?T=SkmL~8h^>29B zwdhv98{C^cm+KYH8Q#v(meX}KOa{N}LItC;8rx10Q|pE^>P!@9kV`B!F6yCC!PCxO z!UI^Myxb5t2fN3=93=CH9u!4&o(yX}O(?D5*caHzWn$#h^U#qD@Vp}>Qz3s2o0ZLqhz?E`c zn;bb2sy(s>2YF5?ZLaEXDaEAZM6yf(M)xA776neiYlvohMtr$%gsAA0#SIMxk{mAv zvgh{*5oPd&u`W9u=W-Smj(0pMT#m?h5Niy0eBtZ9VhTT%5ex6t))UUVrmMA|_uWm! znkj(%0H{l$mT+~1JC9F!lxr1g`P8^Jv|VQ6<#e3PmFufon^XU+EsS_9v+_neO(dNd z13T+pkMvjdAvu;}=&olo@N&%@!p`__S9RAHDjoij4g5;zaCJ5Qi6JG0{}Danhh;M= z@xaCkQ%8e=csBZEeG2X6l|jc zLc_B7>{p879*w$kwdbUBQ_Lis^sEBT;9izTxNt!_MKW|7I!t2Ah3;`VugB$jhRHQe zX?K2-^+YDvNP2HvW7jB8lXnB-q|E3W77Y+wqC$V-A@sxL`+(;aI>?+!n@(!LsCZ#3 zfuvz(((iDd`*DgK7yjS7q!&NN!!xw=GbDw2VVgyeDwLtBC^on{q-9rYY7|*?d)`VG z;n&Ts`RcAMt$jkn#p;rjBY`!hg!Dbh?^2BjrnFsr8{MoLbMRK5bt^?fsVkktAp%~e zp+WdW{FXSL=ci3$c1F>1CdEQja|$vNP7JAc#TwJO1%H{(wGP`ST67X{d&aymKYh@N z^?t}V7r6e>e-c7>DC8l8m`k%7@pxVlGOn_g{9&Qjuv#zb%ij7bcFbij6i?9`YvGH4(+vu z<=;BDJ<+V)thz^@<@!pStniCdV+>8k2fmiJR<*tkqlvWDchEHj(`Lz)%k%9H+tH0Y zZQcd|_o~b~LG&4z)FpH2l|Usw??UIpL;oa)xTkDA{tb_B>}kz&~PIHD>AU_URFIgWGqX zm)W80;nE*7_aYP(4~7}z@Ex{l^SqtgKnQ|oTu z(C=4jxZijA6_ywo$=$Lwl{8n6?@@vO>hgea^*Ig^*lcX%8aau=tz*+p?CXe3r>ci8 z|6Y5;=(iY#Eba`%*xWcN<4RN_A_J#16}m8d;cC23?=i8u3&I#%F5F~Ciu!?**w;$F z1wUtXf8yr2v)b!uvf!#aqL6pD3#oD~e(h=Qt1o|A7J$dMaWnw(iZZF6JE#)9{k>{E zd<}|U$XIQtiO|$lTb>t%E85Vg>C3eQ^`pr|qtN-*Ky0bCKMM`RhM2V(+e#Kzcz<*T z@Zgu7iVC6KrkaZZ>(vh`CmKBQHQDuAu<%Do@`=10ImBMm!{#AD~qLKBhc}9OEfUS z#ru?Og~v9$Gj9jJUwnguCAwx2chEw?#G!_aApm{~7G?_xpT*xcN!-BtZAyO{OvMriAs7 z@&|J8-!H|taQjmU0j*V)e92(dGCr)HiDO1JU~IlU{1Z5!JmfLAR&v*LcL$O=vZgwD zE7ufG%LI>0J;(IEPG~PbIy}xmQtapOXzN{2D$%6TZkAlQb_UChLz0j(~@^ud6 zncx58&#d=LZj>^CwydxJM}~>T^-htr$$1y+yQNi$XTqeXt>>!hWgEeb(*8fV`T_GP zE>#wv^dt{B5@l>-0XbiT-Q4k}Yduw|>nx4UaO;ezvo_!D?A9STk2Yw`DAe=QE1eI9 zlepZ)Mn*-eWLjqIb35y6JjCN3Tz)QAb1$A~JoR+6%c`Cf897e2R>y)(Oz>z=ck9gS zrxxR3VD|EK9I+euJBF+sJ9wP#d>uILn)EY)N33UXK5R&eDPE2jGVhLm&A)?nbVk{) zWqPE3_sf>e@P?jsmf5a^W97d8Fr6!f0yk0RhO}0@oBN~jwX`TtNed%?*HqonHTtD4GgAG3Bl2z|$JVNW&CJzA-o zQGPNq*)?3d=Q+<2&hYN2&kVEkyWn6bICp%{_8i&*Jl&Kdpgo%~w5 zW8Y2|{{eLW%XAc;eW>1onqevO%io}3N&YRxP18zN-IJthZ8YN!-yK(lIyojC6;8rf zkX;z~DYwb;R4B7$yR~|!uVD`6!s+n*vK+?hrXS0hdvx=2G0`#8<1$PdpK zBiX7lM4LF$ciN}lcTRp|W5LxxY0J1F!yD;!Y#Uxfy^uqv$8y;musB^}Nju#+%n6y% zY4ttJUGQdJZ-p6lK(I4kb=n*Fs7smJmO6&fFV4iLsCfiW0K~In?ha1rPW*v`8HZC& zlWk$e``=0lw#q|>KN_-7Onu?Ucw=>Zmc4XCE;rIsDFR1a`5F$?Yz8}Wncz#m zoNoSACkeSF(3E4>A`o_|(KA&Qo7BKfc{&~6<1c*;CLq0Q@RgsuH?phLPM($fSrZYJa(0G=7TC`%c;?CGUgnI;8w(6HHn7lH%J;c;|ds?W4uj`kI66r66) zhkP-Ts@h4mKR5eS_`}Uh(sVFOf;Il znMs7^{NLqhV0Cqz&n3*ETwoBuw%e_fx3&RcqgSVz86GyD4+-047kafp z42Ht+JfH2v!csf?Niw+efAwmZU`w{|8ED89^q?#?xxJURs}%;e%cD+r)mNJ@pT=th z@%L~#1s|`XMXAG6c_KA>-;b*VtA!J>k})|PkcXY??$oeziSqffDzAQsl^e-F+dHC> z1;4(3k-+NB%~hGTRULziiMP*n!gm{=4=NKy|AfGQzOS#Hzen->8>+qL>uUw4L1*@q>$&O;dt04LZxuq&UK*j<> zQCJN&13wnbB@nrN6kpp6O{cf|@^>c1BE(6<2e)2Mv24JKJ2pY`cQVCtbMxNve@Yw0LzY}Buk7)f?TJ5qBhSRl- z;}isK#_??j-Q4QEbpDJ6+M5EJ;5kM`#~##vt4RH_#%yDT@u|qWAL62&Rf~NR{%SG1 z59-(75&nc>U^V70^|evC^P4Q5xi5(dOZyp7Y#M?Q|H+=jd$5yOd81vdr0bY@41>UUqbXf>*PsjMac9P zKH~;bKQ_jKHa8_ka?gOssj95$i#InkFh>c;`8>5}B&L$)^BKV>9c^^*9FuUylG&1F zPJ1g!3}ssVKu4#IO5L4?UZeA6=eG8WKxh{Lh7Q`V&MK4?Ab{%)%Ew1m&t+-?jw9uB z=c$2m;<|6M;5K*wA58o>T5^X#ON(O7Q=!{?GrjKlr9#W`c+~9;pW*8u#&U`cXtAQI z1*R)wyDCPp`7z)=p)&dlTnH+R9PTCeSKl6Vao8)L*;<(6)21VDu-T;^FLAomD0HG? zmKw2qL-SA-jz_c$rx*xv>+<6%Mq}J>xsa1r?Xp#GdJonF<4{!8H^#5V27!QGcft^EE^h9CkBBUTp+1Sqj(`IdTq<)rn zNH{XRl&PWIgH@6xYylTlL@PUc{Ke(EwJKkwF;kCzE)F2s^Ils;ToB`C8kGt^E->f;w?+Yeh%L<1 zI*CrtCLLd3+JGm%LY&qxyn7OQ0D6)8=b=Lk4{xeuWhJS)(O<%!v zU*)EsvLn5u`2 zLyHx=%Az9rUOx1P^Ao+V@ksQJ3P~z;ry(Y!O|O#N-E-f!yCr;eay(g$RLm`#kPgv?%JKvlF zHj!UpzrUbY?^=du0mQ&oByjhYoI|i*{q`&UcBo8Xd)uLWNEw`VCp9abisn(jiXaiY z&S56D`X{sC%uHrL24!q)_S>W_xL{*C8#D`)g69>qNk0Fk!0Sg%KZ_;=e%g@BQ!I)% zqiFO|a$?mz$@@}H8&Rd9#wN7h_iIV>o*-Ume6f%-!)FmYOOXuc#v+{u6Nv&*O&LHC#<4Pq%{-TjJwa*9TII^<+Jc1!l)~T-D_hIi z9Wi}BuW3567S!T>%uKg%LbINi9Es{mB&b(cU1biFoTv~aCFO4Cnz1`avw`+_GmF@~ zHlGKD)6&Z@5R*&}N!lcKB7DiQ z>PLFpYW$R)a+Et|afV)MlajX^p3lNth>1o6X5(f0E|f$NHLTG^r_CUi z)qi+nZ)**n&p5Tj+7kaUKkE~7>+I{nD(*y@HfocMj`%#B$@Md%Ghs2w*NBhnzC^BA zDB&9IT)7^+R#z@N z_9n#x$yjhff2FxXX}@IN%h(Ce;IWCH)ksPN=Iyc;d{vIP>?zO?jc$-�zSs9ob%0 znR>Ztdt@J5GOQu5*fx3hGz$)3c6^*nh~a*S=u(=LB|RC#X!|X8Vfrh@alE@>gEghK z;VKOBvz5T%Wi3SM8XAt^x!Xw{)@45(L0Ff^LQ%lVQyJU0gbo5bi?#)LT^V2 zcGUO>Ju5EpT!tU3@&N|AteWXt@>-(OKd>I+Mv9M4c>0Z^=(Q6Iw)Hl zTd%d8fYP<9oW25P0WW(V4Bu1>H2#>?x~sXJex+IXYD~s-BIRMOe3Xn;GO2VbFCoNi zNKa~+RNwJ}Ccl(PjUyb+G!l9_A(-DHY2$IrM~dP$Ym`+X!5~Zgqncw3b&C%D2Q|we z=uJy31dqjk02(l|CDkTNR7b=IZ@GQK&yS7QFvu{(Rny(1$y~>2)Tq{Ex%!!3s|BI$zBm zd74W6cb`A9}a?G2)E906XWRbifFs*o?aXJ58R~yRzXJh3z)N9?y(!7Y11o zhD9n3xF*;W6hMGXA@-yN*nP337P!;o?8&OR~#F$^r%4(5e5~ui(Eel{CM{ zBZ0r^hAF@stuYf%O-C<8Z~o`?8#BTJ>%$=V&+8}Y zhxfVxE5cvC2!QTTx)uTc_D@hW?VlEGJk#pBK|`Ye|Q$Igsf%w|+01V0{#7f1eC%R3;wxky$^SKDUO)i40z!bVYowlAo3|M%O26si}e!y7auJw!i zqm(-P_L>=Wp{1ouVEbnuRr2?~J5Y#TaL=v6@T{t^Vd)*9;45;kD|dua@S-h*=#{Fc zeh4T;H=3i$v1}${T$6JD?sN2;jn@<8s=}GK4<%;w;(C&1g&`Q>+!nWQ3Y=W#Ro5GN z4GJOb@CXb;&`r9|Q}Um?Lo2Y+pZ6$B4}QrVRQPy4+f-YTf&aQYSdSSfNJmThbnJ~V zP$6Vx^g5B*r7k6brDhaw-G-u-xN9Q5ihqik8UD1fq>gsf78t0)HkvBHdN}wQ!$+}0 z3RFq1C50;%cj}~Ei^UKZr23EX0>`3Oa&TF#{f_r9iRf)g{=VZ!E@^m)xPtyfNwQ%d zZcMb%;!qrmIsGORvtjY7{?WU{9$HPSxr&lg)rR^%ryv?U1fu|f`1!*Q=nb!}hrDRN zX)|}lYr#~2zd!Wn*!sjT$ts^FWBmG|_u(&$_wQ}pn{xSGnSVRp9=^fm!91mtGDi6Q zM}Pn6!&P?nA>|wlM40?Lv*NGs_}wf^2fvHnZ%843&x${h|7+Inj6k}-11e5JH9C=>VMmvv6l2f^y_ z8>V+#P_gxz|+O;~pDmHR9NU z@_YCz@;x@-DDvp=HNtL3q?UYTfZU-v5dyJQBUUq$(B4EB?-(|9y=(Qk8CvDB$aoai zp(v~QX)#$FXo-J0ym$A%t<++~X5(m+H7+;1z3#p!uN!b=6a7a9&RRG4cc~U)(Yt2M z|CDMO$h8%Row#R@C<*XHtNpJOoJdmM-cNtjYROoljQrnb;M7CBlc70RDbJtVr6KKN z=H59ZjP4z*n#kWQkD0Z*8%nQpHsPK#gG2ZP>Gy9$bedfI5Z6|6U6cMR>xMPbxj}7N z{|Xm#r27=xov$@{^I9q`Q3z3uyI7ZN&xapOHnhCI+1=h$I!%h1Z8UU*b8@<_*Y*d{ zJ6)e}NC+Vif2=PuRJsk6lgp;*EkM5*dh@dJ9>wO>%gmJqHwm@!a5Uvx0Y&+A7KT~% zKO$*BUWNc3c!Ywdv@njZX8Ykc#&y=z6i&cQvPi)sW5JVlI#udTh;^fGb>(QN4S^Y<-e-a0lM%-UlcUtI(@+ z|9YIzwEIdOy*#oF@A^s!IFwA_I||QZ8oLNjJ^~kIhK1b~^`z`^gsVHTrkSbm3Fvk8 zA9zsH;#|Jhy17|Z>|@v2b40zWy&AID{Gp>L_Ud|G7+$@F(;6lSi+iZx;zLa+*Ktu= zVeDSJ_u671gz$jX7oMj>wlB^p?33V_Rb6{OTtP9DY}b-onDvyQ(NdTADD*6-2h^uM z4L{#^@t9FkD39^#kyYCJM0i#+R4q~1wP_>l5m#=A<*T~9vqvr`$9`J}$i>I@DT?6s zCHLTv_3gx%xU>BsiceZb(znl;xj0__U+tY&R8w2K_Z1PWpi)$-H0dCM^r{r;NbkK! z@4bl#BB5-0Pmtb`(7OtR9$M&-NDI9;NhoKrU)_7Z-x=T4ITz>R&B(~bTv;QTbImp9 zn$Pq5|EGB7t(%g4MlYir%B1Ol@I`mmjem@^R>pc1zM> z4ou5X`u+e%g>_%J35Bp3RBGEUTf%?zQ9im!A*671ak7w|J4HeOSh8ODWeFOE@IN-7 zA^0`IWTJ0qM?;G0ZQq7j$<3ujnHj(v!m3$qZO7IUHT~X4Yg?%=fBIW!1XApIAQVlwhERmAsagAT1TD{rCzrU1+u$t%o&(hqpCiTRg?d!^Cr+*jjus7Oc7F`|>@4j1#w}OQj z50recA-rY29sNo8fvD2qHFa=X1XU-&! za46mi!6?LN2f}?fs@h-|7TC$MTh&44(qBCXeP)mjTd(P}9+jgRb{cD-_F`UCy&+BNZ&VkR{X-Zh=CAXhtdU!^U;9qtlhR7UWRT6HZAymkU zpu{rUFA(?($qNbO52!AFuVr@KHjy8V_dTMhzmC7QP9ZYWUC-rPQm>zrm40x|(3mwFft-i8T)=qRV@;GCcg)#92YWf54G*KOo`KLh2CIApf_3U6TEo9u~ zep|>y-j(#fhD;fr6P#*C&Pg6k;8lD)@~f%eWyCXiHf~L8IKCI|^`T?soKAOHwr0%H zZsOkBY)SB~TBO&=#gM1gz^JdZ!HFo|^-wimqYN!wSaP>Xf68@%s*M~t$Xsx%vUyvb zYC?d*lzoAnYUkaadRom$6vHns87Bo+)oF{<3Wc zAV_up?rIO0KM_dKPkr&`RhW$S%U}zn0}$zblNJ)qyj{uLGN;+pR8Jvlr#Q^JnW@hU zZUd)Dv2T$n+k*o?JP7y+P+s(2uNg_^at^=!_H(jHy-dVDEfYfBN48gnAMSA#Tjqp6 zJQN)}DVf)d=T)wdtAA9~tJ7E+3T^n(1Jj5F*tr_ZS7$Zy1{1Qm%fTX!-esthEkWTM(9>ygfY& zSd!o_9B;;l&cY5QV`bzMO{Cv%!MW*SbF#wopshJ;o^&Q%c;Xp6GAbK@EqfNcD*-%Z zm%uK6a7O;VA=K1;Rq>569;YBWW@eznA`52I4eX&vfD&(u(rfsHO3ym_R zgf~PV-X3cynRg*pE_L=}8(5r&4(KRDMPV2gzyV5?13M1YOaqiyH^rjr?#ZhOttpN6 zHOHhXV;xKMTa={#WozTVV2U|yC-ZT?@H{wM4OAa9EA8j!eV?@NrzuYho4NFu@#SsG zrX)_U^Z> zm6VWR;qoK=jLy2cWYzPjcGz25Uu&T>k4j$(erI9>V`cY$z;lZ$TSZW+>CFhd-zy&w zFl9Bn5iYPnd|1|TG}+I&6OWgO{N*(j)4lTXZPl%V9+j95B6~0*w!Fo47zeXv@F*ga z9Syyx2x#+ZP+yv%b)hIfD%y$GnS~YoVcJGRcCu-&6gT^hw7d~YFGxK>l1Aonn)+NMpQ5<1tUR|=of zEZfYcshmtci)w#V#rQZqWFn}UIj`8GOL%gI*1o99ftZWmXbKyto5du^9rKtT5MMa} zyA(54dSY=t(mqVPgKl2{h2iTH$!apVdW5`zc0(m~$?aKnVz<}$W83)+NvaiDVSs4v zZqPety3q$braKx*UBViUhJu|c0Diz6%pFanrXgw0HvTn_zWXyzjCUzGJ%<@xpf!W|JevIZFrIXcCQ^R3lR+mm9ZwLD4(h za_N-Axk~r9`r*)nKtt*;!=oFYH|Bk)26f>#$SXy<+^%I z1^r_+bPF9za0MGs1G_k_IE_;YEZLe&I9R+Dr zN}6+QP?yZ7Pv>Lvo$<7Iymys8o8FXL>ej8!!R?JB0}I` zcf%o4EGx0;{H+5HtTEQ~R}QUIPlra(v~w-cnW+3qa?TKLDRvaycy(M7 zHl=Z&yE-CzOdkC&Pn}#r=))J$%>~o*(ZeOB+rviVt#bV${c(-I{dDe{)Fd$Yn%|_- zv6PX#qoxxj^Zb$TPIGCiGuKC&@&zXpg#GAjAEQ+;9*#Y5ha%u3%?`^9pH3$b?4Mr@ z32A88csWXWM4!AfoM`T2%ZBp4`(RFJJv57c(Tg8eULF8jY0|w|o1mh&D7*vmRB``m z|Hv%yQd86+WdASL5?U3>aC839z#n_O=v}5nPX#yS6|E$o%wJ)EsNObw&{?a@2hL_e zk(?CMZsQNssd>v-_)nV}nrY&kDDO~=0ZaGllBg1DR#(+rr9>P#eU5Usq8)7@bxQ2E zZEtfzb&?mZJ{=Uh6sZl-kDi{)9BU;pO*bXVV4E#8vCpIKeeEBK9+^K$za2(W!!}0RvxFMM6H{ zCh5FS+LPm$p9ux{AUavl#wY7Xo3cj^;f4{!iNvE$0aml4;aBZXam5R9CPBJWMNq8-D#f;2eaa$^WnfntY6wizz z`bse+cz(tC~2-fWFJ<&ueQ}_|6 zoH{3fjI!-@U^{6^FD+_;BE-F`%VdItN+a)XxOri{U3MMqJUT;GAG`U%gtJ`$4HMJi zTq83kGMah4OEz0@Bp68ZseVnq?`DTlQ;$z6E%zY2`}qfU`$=u4I19WtA7GZ4O**J0 z85oKpeYNF*Y#&a2?Jd>L$3lcE8=WW>AVwE(XN2uBt7`Y|96r)vSm1I53p> zc^kf(b3hOa=mPUl4%HxI6is>W?OWWP^C1yupvz35{AM^WtIwm4W6|0}&T(zVn}jGV zx%C6ZtbU~Q-E^n8nENaVN-apcFr}=V1Y3LZ$_<7DYO6@RBVIB!0tyikCbz8%Cht)n z%*D}`aLoXe;L6VtciTx3569VLitk^UU#UF15iX%Zk9+NYJyC+~1MbN8l2X^?7OBI2 zNT|~n428X8!f{B)fr~FDi%*y@I6W*`hdUV5Qi-B!50b!U>y|px195KtsN-WW1eL~k zwuPmz-STJBSj_Z*ktF57N)NrK+X~rGoe_@u>0_b?(Faqbd^vkVbR!Y0oQgzZo;E)F zM~#9RoV*dIKT6bJET3|4XOYA)=H&J34G-Hoa!^ix33z2)?{0z^svJpzMN*(EH2mQ535hC&ur7347ye zO%!#mzw*!5HrlOR8Yphg4)?XBQiajL)FfDjn_oa9ey#~VoC#V}!7du@t9NmnKfFV6 z>aHP8vD>14xO97X4{|DKZ&X<}!huytORwS3)9|VBf=8twzX=G59D=NJHEPc)>L5yt z-oCF+c1)}_Nlk4ro-B-@}`!JNS*#MaqDB>#_?k=joTa9@{HUoIDQhDXVzT!+>hPF0^Y5zTVultD zMVus*8|v8vPzXsIP34tNPf?CEJh*zaj%pPc6+Kp6&+}-dvu$X$^e^h`Phx$lxRTz{ zKe=d~l%Zv=J}!3d)@UYF#)DSaE}GtA!a*Sr!@>=lqA!j`QWMV#6z@R>M9ja}sUEqn zle4N?jOGl}1cz?L;YD|+Ae#(m#4~{_IdYhmdA>T;p^1Zg2+yQ!MJQV)m}(Nfo1S<6C;$F!an8Q)mm&(*R!aK%I! zG@FA393d?>hYk&`I_<2@AM=fHl^H?Z;2^ADYDix)NGm@aSI2Hk8GxTZFA-9Ey4cc2 zXGs~$+uB55>bn8ypfMrs|Ks=qIYS#p8L%R8qI;Z&St|BG#~;xQZeK@;tw91VVtvIg zYxo*Rm-E|&hVlGkNYCV?)s%&mnQ?xoVzqULa-bTn>CgvWOg*nu>o}HbG8g-p$K2`u zy;>nALcGzLiTIwWJ2S+Pq$U6SfwPxI{?5+f#$Om2Qtx*vzj1-N?Rn?tTPGPM8>aXw z#*{|3)DKDWA6b`7r^3KqcB$vppyA=-GR{=yRcxrxAa5CCK4AkHA(U z48M`}Ucal1|631TS7odyl}@Eq${S4U`|lLrn6@uH*3#F-t&AaY$bjc>a)}s3 z88gqX^_0pD0{e2rU$6OPatZb2E?vU9r{|1Z*7hi9bm9?HJ&q7Z(5lC@p zrp_#G4UTx2_Hw({9rt_H1zOCA$m~Q4a$__GOuaT1G-?BQjX3X1* zp&N!->Ee}z*SiVf70w#eI**{$-FGK**YEaclXV*Fl3!GR;is1trt;m}?KqIBpaI(- zaLKq$8`UzA$&k^C5#H1(vvSb&0c&Z0Y^Sc)dfJFt*? z`$t9g2!b;sBb53GIj?_sVGyH_T(lBn!VuTOgM?lv8}O{PRk{YE_J~vB z^C&(z!u|b{X1#1H0(n@eeCK6A=dB%;OL-k!*WbiH`eD-(VzszHI6yGHziUxwX&7+S zs0(hMu;x6s8ryaVZ&btI>_80}2H37qA1X&1y#YQ}DP9A8s9EVNb)4_4)^NT3Pd?A;>g--r$ZUsLkhIAIv%CWF|Lu+fnA5U3zi- zXRZr!F<@H-Ff05^oAHaS{6^#2v!&AePX>X3X9Uj-A0$*_gb)+@WB)#I3*R^heJ5Jg z{CR`yva{&|u9(;EM2jep&)|LF<;r+Pw= cr{0gi2L=kN&5efCSAdtSl#*oGOXK(d1zO?5jsO4v literal 0 HcmV?d00001 diff --git a/site/index.html b/site/index.html new file mode 100644 index 0000000000000..aecaea525166e --- /dev/null +++ b/site/index.html @@ -0,0 +1,171 @@ +--- +layout: default +--- +

+
+

Apache Arrow

+

Powering Columnar In-Memory Analytics

+

+ Join Mailing List +

+
+
+
+

Fast

+

Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single input multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout of data also allows for a better use of CPU caches by placing all data relevant to a column operation in as compact of a format as possible.

+
+
+

Flexible

+

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. Java, C, C++, Python are underway and more languages are expected soon.

+
+
+

Standard

+

Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics.

+
+
+ +
+
+

Developer Mailing List

+ +
+
+

Developer Resources

+

Arrow is still early in development.

+

Source Code (http) (git)

+

Issue Tracker (JIRA)

+

Chat Room (Slack)

+
+
+

Latest release

+

Apache Arrow 0.2.0 is an early release and the APIs are still evolving. The metadata and physical data representation should be fairly stable as we have spent time finalizing the details.

+

source release

+

tag apache-arrow-0.2.0

+

java artifacts on maven central

+
+
+

Performance Advantage of Columnar In-Memory

+
+ SIMD +
+

Advantages of a Common Data Layer

+ +
+
+common data layer +
    +
  • Each system has its own internal memory format
  • +
  • 70-80% CPU wasted on serialization and deserialization
  • +
  • Similar functionality implemented in multiple projects
  • +
+
+
+common data layer +
    +
  • All systems utilize the same memory format
  • +
  • No overhead for cross-system communication
  • +
  • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • +
+
+
+

Committers

+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameAlias (email is <alias>@apache.org)
Jacques Nadeaujacques
Todd Lipcontodd
Ted Dunningtdunning
Michael Stackstack
P. Taylor Goetzptgoetz
Julian Hydejhyde
Reynold Xinrxin
James Taylorjamestaylor
Julien Le Demjulien
Jake Lucianijake
Jason Altekrusejson
Alex Levensonalexlevenson
Parth Chandraparthc
Marcel Kornackermarcel
Steven Phillipssmp
Hanifi Guneshg
Abdelhakim Denecheadeneche
Wes McKinneywesm
David Alvesdralves
Ippokratis Pandisippokratis
Uwe L. Kornuwe
+ + + + diff --git a/site/scripts/sync_format_docs.sh b/site/scripts/sync_format_docs.sh new file mode 100755 index 0000000000000..4b50f9d1707d0 --- /dev/null +++ b/site/scripts/sync_format_docs.sh @@ -0,0 +1,23 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + +set -ex + +cp -r $(dirname "$BASH_SOURCE")/../../format _docs/format From 6239abd1a61fc254818548a7b6ee3f8a88777a7f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 24 Apr 2017 15:58:19 -0400 Subject: [PATCH 0560/1644] ARROW-862: [Python] Simplify README landing documentation to direct users and developers toward the documentation Also migrates DEVELOPMENT.md to the Sphinx docs Author: Wes McKinney Closes #584 from wesm/ARROW-862 and squashes the following commits: 50049dd [Wes McKinney] Revise python/README.md. Move DEVELOPMENT.md to Sphinx docs. Other cleaning 2187c1c [Wes McKinney] Migrate DEVELOPMENT.md to sphinx docs --- python/DEVELOPMENT.md | 207 ---------------------------- python/README.md | 71 ++-------- python/doc/source/development.rst | 215 ++++++++++++++++++++++++++++++ python/doc/source/index.rst | 1 + python/doc/source/install.rst | 117 ++-------------- 5 files changed, 236 insertions(+), 375 deletions(-) delete mode 100644 python/DEVELOPMENT.md create mode 100644 python/doc/source/development.rst diff --git a/python/DEVELOPMENT.md b/python/DEVELOPMENT.md deleted file mode 100644 index 7f08169d613f0..0000000000000 --- a/python/DEVELOPMENT.md +++ /dev/null @@ -1,207 +0,0 @@ - - -## Developer guide for conda users - -### Linux and macOS - -#### System Requirements - -On macOS, any modern XCode (6.4 or higher; the current version is 8.3.1) is -sufficient. - -On Linux, for this guide, we recommend using gcc 4.8 or 4.9, or clang 3.7 or -higher. You can check your version by running - -```shell -$ gcc --version -``` - -On Ubuntu 16.04 and higher, you can obtain gcc 4.9 with: - -```shell -$ sudo apt-get install g++-4.9 -``` - -Finally, set gcc 4.9 as the active compiler using: - -```shell -export CC=gcc-4.9 -export CXX=g++-4.9 -``` - -#### Environment Setup and Build - -First, let's create a conda environment with all the C++ build and Python -dependencies from conda-forge: - -```shell -conda create -y -q -n pyarrow-dev \ - python=3.6 numpy six setuptools cython pandas pytest \ - cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \ - brotli jemalloc -c conda-forge -source activate pyarrow-dev -``` - -Now, let's clone the Arrow and Parquet git repositories: - -```shell -mkdir repos -cd repos -git clone https://github.com/apache/arrow.git -git clone https://github.com/apache/parquet-cpp.git -``` - -You should now see - -```shell -$ ls -l -total 8 -drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/ -drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 parquet-cpp/ -``` - -We need to set a number of environment variables to let Arrow's build system -know about our build toolchain: - -``` -export ARROW_BUILD_TYPE=release - -export BOOST_ROOT=$CONDA_PREFIX -export BOOST_LIBRARYDIR=$CONDA_PREFIX/lib - -export FLATBUFFERS_HOME=$CONDA_PREFIX -export RAPIDJSON_HOME=$CONDA_PREFIX -export THRIFT_HOME=$CONDA_PREFIX -export ZLIB_HOME=$CONDA_PREFIX -export SNAPPY_HOME=$CONDA_PREFIX -export BROTLI_HOME=$CONDA_PREFIX -export JEMALLOC_HOME=$CONDA_PREFIX -export ARROW_HOME=$CONDA_PREFIX -export PARQUET_HOME=$CONDA_PREFIX -``` - -Now build and install the Arrow C++ libraries: - -```shell -mkdir arrow/cpp/build -pushd arrow/cpp/build - -cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ - -DARROW_PYTHON=on \ - -DARROW_BUILD_TESTS=OFF \ - .. -make -j4 -make install -popd -``` - -Now build and install the Apache Parquet libraries in your toolchain: - -```shell -mkdir parquet-cpp/build -pushd parquet-cpp/build - -cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ - -DPARQUET_BUILD_BENCHMARKS=off \ - -DPARQUET_BUILD_EXECUTABLES=off \ - -DPARQUET_ZLIB_VENDORED=off \ - -DPARQUET_BUILD_TESTS=off \ - .. - -make -j4 -make install -popd -``` - -Now, build pyarrow: - -```shell -cd arrow/python -python setup.py build_ext --build-type=$ARROW_BUILD_TYPE --with-parquet --inplace -``` - -You should be able to run the unit tests with: - -```shell -$ py.test pyarrow -================================ test session starts ================================ -platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 -rootdir: /home/wesm/arrow-clone/python, inifile: -collected 198 items - -pyarrow/tests/test_array.py ........... -pyarrow/tests/test_convert_builtin.py ..................... -pyarrow/tests/test_convert_pandas.py ............................. -pyarrow/tests/test_feather.py .......................... -pyarrow/tests/test_hdfs.py sssssssssssssss -pyarrow/tests/test_io.py .................. -pyarrow/tests/test_ipc.py ........ -pyarrow/tests/test_jemalloc.py ss -pyarrow/tests/test_parquet.py .................... -pyarrow/tests/test_scalars.py .......... -pyarrow/tests/test_schema.py ......... -pyarrow/tests/test_table.py ............. -pyarrow/tests/test_tensor.py ................ - -====================== 181 passed, 17 skipped in 0.98 seconds ======================= -``` - -### Windows - -First, make sure you can [build the C++ library][1]. - -Now, we need to build and install the C++ libraries someplace. - -```shell -mkdir cpp\build -cd cpp\build -set ARROW_HOME=C:\thirdparty -cmake -G "Visual Studio 14 2015 Win64" ^ - -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ - -DCMAKE_BUILD_TYPE=Release ^ - -DARROW_BUILD_TESTS=off ^ - -DARROW_PYTHON=on .. -cmake --build . --target INSTALL --config Release -cd ..\.. -``` - -After that, we must put the install directory's bin path in our `%PATH%`: - -```shell -set PATH=%ARROW_HOME%\bin;%PATH% -``` - -Now, we can build pyarrow: - -```shell -cd python -python setup.py build_ext --inplace -``` - -#### Running C++ unit tests with Python - -Getting `python-test.exe` to run is a bit tricky because your `%PYTHONPATH%` -must be configured given the active conda environment: - -```shell -set CONDA_ENV=C:\Users\wesm\Miniconda\envs\arrow-test -set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV% -``` - -Now `python-test.exe` or simply `ctest` (to run all tests) should work. - -[1]: https://github.com/apache/arrow/blob/master/cpp/doc/Windows.md \ No newline at end of file diff --git a/python/README.md b/python/README.md index ed008ea975d21..816fbf0c85daf 100644 --- a/python/README.md +++ b/python/README.md @@ -18,78 +18,31 @@ This library provides a Pythonic API wrapper for the reference Arrow C++ implementation, along with tools for interoperability with pandas, NumPy, and other traditional Python scientific computing packages. -### Development details - -This project is layered in two pieces: - -* arrow_python, a library part of the main Arrow C++ project for Python, - pandas, and NumPy interoperability -* Cython extensions and pure Python code under pyarrow/ which expose Arrow C++ - and pyarrow to pure Python users +## Installing -#### PyArrow Dependencies: - -To build pyarrow, first build and install Arrow C++ with the Python component -enabled using `-DARROW_PYTHON=on`, see -(https://github.com/apache/arrow/blob/master/cpp/README.md) . These components -must be installed either in the default system location (e.g. `/usr/local`) or -in a custom `$ARROW_HOME` location. +Across platforms, you can install a recent version of pyarrow with the conda +package manager: ```shell -mkdir cpp/build -pushd cpp/build -cmake -DARROW_PYTHON=on -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. -make -j4 -make install -``` - -If you build with a custom `CMAKE_INSTALL_PREFIX`, during development, you must -set `ARROW_HOME` as an environment variable and add it to your -`LD_LIBRARY_PATH` on Linux and OS X: - -```bash -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$ARROW_HOME/lib -``` - -5. **Python dependencies: numpy, pandas, cython, pytest** - -#### Build pyarrow and run the unit tests - -```bash -python setup.py build_ext --inplace -py.test pyarrow -``` - -To change the build type, use the `--build-type` option or set -`$PYARROW_BUILD_TYPE`: - -```bash -python setup.py build_ext --build-type=release --inplace +conda install pyarrow -c conda-forge ``` -To pass through other build options to CMake, set the environment variable -`$PYARROW_CMAKE_OPTIONS`. - -#### Build the pyarrow Parquet file extension +On Linux, you can also install binary wheels from PyPI with pip: -To build the integration with [parquet-cpp][1], pass `--with-parquet` to -the `build_ext` option in setup.py: - -``` -python setup.py build_ext --with-parquet install +```shell +pip install pyarrow ``` -Alternately, add `-DPYARROW_BUILD_PARQUET=on` to the general CMake options. +### Development details -``` -export PYARROW_CMAKE_OPTIONS=-DPYARROW_BUILD_PARQUET=on -``` +See the [Development][2] page in the documentation. -#### Build the documentation +### Building the documentation ```bash pip install -r doc/requirements.txt python setup.py build_sphinx -s doc/source ``` -[1]: https://github.com/apache/parquet-cpp \ No newline at end of file +[1]: https://github.com/apache/parquet-cpp +[2]: https://github.com/apache/arrow/blob/master/python/doc/source/development.rst \ No newline at end of file diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst new file mode 100644 index 0000000000000..01add1142642a --- /dev/null +++ b/python/doc/source/development.rst @@ -0,0 +1,215 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _development: + +*********** +Development +*********** + +Developing with conda +===================== + +Linux and macOS +--------------- + +System Requirements +~~~~~~~~~~~~~~~~~~~ + +On macOS, any modern XCode (6.4 or higher; the current version is 8.3.1) is +sufficient. + +On Linux, for this guide, we recommend using gcc 4.8 or 4.9, or clang 3.7 or +higher. You can check your version by running + +.. code-block:: shell + + $ gcc --version + +On Ubuntu 16.04 and higher, you can obtain gcc 4.9 with: + +.. code-block:: shell + + $ sudo apt-get install g++-4.9 + +Finally, set gcc 4.9 as the active compiler using: + +.. code-block:: shell + + export CC=gcc-4.9 + export CXX=g++-4.9 + +Environment Setup and Build +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +First, let's create a conda environment with all the C++ build and Python +dependencies from conda-forge: + +.. code-block:: shell + + conda create -y -q -n pyarrow-dev \ + python=3.6 numpy six setuptools cython pandas pytest \ + cmake flatbuffers rapidjson boost-cpp thrift-cpp snappy zlib \ + brotli jemalloc -c conda-forge + source activate pyarrow-dev + +Now, let's clone the Arrow and Parquet git repositories: + +.. code-block:: shell + + mkdir repos + cd repos + git clone https://github.com/apache/arrow.git + git clone https://github.com/apache/parquet-cpp.git + +You should now see + + +.. code-block:: shell + + $ ls -l + total 8 + drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 arrow/ + drwxrwxr-x 12 wesm wesm 4096 Apr 15 19:19 parquet-cpp/ + +We need to set some environment variables to let Arrow's build system know +about our build toolchain: + +.. code-block:: shell + + export ARROW_BUILD_TYPE=release + export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX + export PARQUET_BUILD_TOOLCHAIN=$CONDA_PREFIX + +Now build and install the Arrow C++ libraries: + +.. code-block:: shell + + mkdir arrow/cpp/build + pushd arrow/cpp/build + + cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ + -DARROW_PYTHON=on \ + -DARROW_BUILD_TESTS=OFF \ + .. + make -j4 + make install + popd + +Now, optionally build and install the Apache Parquet libraries in your +toolchain: + +.. code-block:: shell + + mkdir parquet-cpp/build + pushd parquet-cpp/build + + cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ + -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ + -DPARQUET_BUILD_BENCHMARKS=off \ + -DPARQUET_BUILD_EXECUTABLES=off \ + -DPARQUET_ZLIB_VENDORED=off \ + -DPARQUET_BUILD_TESTS=off \ + .. + + make -j4 + make install + popd + +Now, build pyarrow: + +.. code-block:: shell + + cd arrow/python + python setup.py build_ext --build-type=$ARROW_BUILD_TYPE \ + --with-parquet --with-jemalloc --inplace + +If you did not build parquet-cpp, you can omit ``--with-parquet``. + +You should be able to run the unit tests with: + +.. code-block:: shell + + $ py.test pyarrow + ================================ test session starts ==================== + platform linux -- Python 3.6.1, pytest-3.0.7, py-1.4.33, pluggy-0.4.0 + rootdir: /home/wesm/arrow-clone/python, inifile: + collected 198 items + + pyarrow/tests/test_array.py ........... + pyarrow/tests/test_convert_builtin.py ..................... + pyarrow/tests/test_convert_pandas.py ............................. + pyarrow/tests/test_feather.py .......................... + pyarrow/tests/test_hdfs.py sssssssssssssss + pyarrow/tests/test_io.py .................. + pyarrow/tests/test_ipc.py ........ + pyarrow/tests/test_jemalloc.py ss + pyarrow/tests/test_parquet.py .................... + pyarrow/tests/test_scalars.py .......... + pyarrow/tests/test_schema.py ......... + pyarrow/tests/test_table.py ............. + pyarrow/tests/test_tensor.py ................ + + ====================== 181 passed, 17 skipped in 0.98 seconds =========== + +Windows +======= + +First, make sure you can `build the C++ library `_. + +Now, we need to build and install the C++ libraries someplace. + +.. code-block:: shell + + mkdir cpp\build + cd cpp\build + set ARROW_HOME=C:\thirdparty + cmake -G "Visual Studio 14 2015 Win64" ^ + -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ + -DCMAKE_BUILD_TYPE=Release ^ + -DARROW_BUILD_TESTS=off ^ + -DARROW_PYTHON=on .. + cmake --build . --target INSTALL --config Release + cd ..\.. + +After that, we must put the install directory's bin path in our ``%PATH%``: + +.. code-block:: shell + + set PATH=%ARROW_HOME%\bin;%PATH% + +Now, we can build pyarrow: + +.. code-block:: shell + + cd python + python setup.py build_ext --inplace + +Running C++ unit tests with Python +---------------------------------- + +Getting ``python-test.exe`` to run is a bit tricky because your +``%PYTHONPATH%`` must be configured given the active conda environment: + +.. code-block:: shell + + set CONDA_ENV=C:\Users\wesm\Miniconda\envs\arrow-test + set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV% + +Now ``python-test.exe`` or simply ``ctest`` (to run all tests) should work. diff --git a/python/doc/source/index.rst b/python/doc/source/index.rst index ecb8e8f4830f3..55b4efc79bc3f 100644 --- a/python/doc/source/index.rst +++ b/python/doc/source/index.rst @@ -35,6 +35,7 @@ structures. :caption: Getting Started install + development pandas filesystems parquet diff --git a/python/doc/source/install.rst b/python/doc/source/install.rst index 278b466941a6f..a2a6520be4884 100644 --- a/python/doc/source/install.rst +++ b/python/doc/source/install.rst @@ -37,115 +37,14 @@ Install the latest version from PyPI: pip install pyarrow .. note:: - Currently there are only binary artifcats available for Linux and MacOS. - Otherwise this will only pull the python sources and assumes an existing - installation of the C++ part of Arrow. - To retrieve the binary artifacts, you'll need a recent ``pip`` version that - supports features like the ``manylinux1`` tag. - -Building from source --------------------- - -First, clone the master git repository: - -.. code-block:: bash - - git clone https://github.com/apache/arrow.git arrow - -System requirements -~~~~~~~~~~~~~~~~~~~ - -Building pyarrow requires: - -* A C++11 compiler - - * Linux: gcc >= 4.8 or clang >= 3.5 - * OS X: XCode 6.4 or higher preferred - -* `CMake `_ - -Python requirements -~~~~~~~~~~~~~~~~~~~ - -You will need Python (CPython) 2.7, 3.4, or 3.5 installed. Earlier releases and -are not being targeted. - -.. note:: - This library targets CPython only due to an emphasis on interoperability with - pandas and NumPy, which are only available for CPython. - -The build requires NumPy, Cython, and a few other Python dependencies: - -.. code-block:: bash - - pip install cython - cd arrow/python - pip install -r requirements.txt - -Installing Arrow C++ library -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -First, you should choose an installation location for Arrow C++. In the future -using the default system install location will work, but for now we are being -explicit: - -.. code-block:: bash - - export ARROW_HOME=$HOME/local - -Now, we build Arrow: - -.. code-block:: bash - - cd arrow/cpp - - mkdir dev-build - cd dev-build - - cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME .. - - make - - # Use sudo here if $ARROW_HOME requires it - make install - -To get the optional Parquet support, you should also build and install -`parquet-cpp `_. -Install `pyarrow` -~~~~~~~~~~~~~~~~~ - - -.. code-block:: bash - - cd arrow/python - - # --with-parquet enables the Apache Parquet support in PyArrow - # --with-jemalloc enables the jemalloc allocator support in PyArrow - # --build-type=release disables debugging information and turns on - # compiler optimizations for native code - python setup.py build_ext --with-parquet --with-jemalloc --build-type=release install - python setup.py install - -.. warning:: - On XCode 6 and prior there are some known OS X `@rpath` issues. If you are - unable to import pyarrow, upgrading XCode may be the solution. - -.. note:: - In development installations, you will also need to set a correct - ``LD_LIBRARY_PATH``. This is most probably done with - ``export LD_LIBRARY_PATH=$ARROW_HOME/lib:$LD_LIBRARY_PATH``. - - -.. code-block:: python + Currently there are only binary artifacts available for Linux and MacOS. + Otherwise this will only pull the python sources and assumes an existing + installation of the C++ part of Arrow. To retrieve the binary artifacts, + you'll need a recent ``pip`` version that supports features like the + ``manylinux1`` tag. - In [1]: import pyarrow +Installing from source +---------------------- - In [2]: pyarrow.array([1,2,3]) - Out[2]: - - [ - 1, - 2, - 3 - ] +See :ref:`development`. From eaf2118efc56823d93dbd57c7e9afdb1d904ac2f Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Mon, 24 Apr 2017 13:29:20 -0700 Subject: [PATCH 0561/1644] ARROW-887: add default value to units for backward compatibility Author: Julien Le Dem Closes #592 from julienledem/ARROW-887 and squashes the following commits: bc49f8a [Julien Le Dem] ARROW-887: add default value to units for backward compatibility --- format/Schema.fbs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/format/Schema.fbs b/format/Schema.fbs index ff6119931dd34..b48859f50eea2 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -100,7 +100,7 @@ enum DateUnit: short { /// leap seconds), where the values are evenly divisible by 86400000 /// * Days (32 bits) since the UNIX epoch table Date { - unit: DateUnit; + unit: DateUnit = MILLISECOND; } enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } @@ -109,8 +109,8 @@ enum TimeUnit: short { SECOND, MILLISECOND, MICROSECOND, NANOSECOND } /// - SECOND and MILLISECOND: 32 bits /// - MICROSECOND and NANOSECOND: 64 bits table Time { - unit: TimeUnit; - bitWidth: int; + unit: TimeUnit = MILLISECOND; + bitWidth: int = 32; } /// Time elapsed from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. From f00e2ab590ad8f04409e7bc09f70622e73ebd741 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 25 Apr 2017 08:45:01 +0200 Subject: [PATCH 0562/1644] ARROW-890: [GLib] Add GArrowMutableBuffer Author: Kouhei Sutou Closes #596 from kou/glib-mutable-buffer and squashes the following commits: 73c2663 [Kouhei Sutou] [GLib] Support running tests on Ubuntu 14.04 d211a22 [Kouhei Sutou] [GLib] Add GArrowMutableBuffer --- c_glib/arrow-glib/buffer.cpp | 97 +++++++++++++++++++++++++++++- c_glib/arrow-glib/buffer.h | 52 ++++++++++++++++ c_glib/arrow-glib/buffer.hpp | 2 + c_glib/test/test-buffer.rb | 7 +++ c_glib/test/test-mutable-buffer.rb | 38 ++++++++++++ 5 files changed, 195 insertions(+), 1 deletion(-) create mode 100644 c_glib/test/test-mutable-buffer.rb diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 9853e896b3dcc..5fc3b077a1cdb 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -28,10 +28,16 @@ G_BEGIN_DECLS /** * SECTION: buffer - * @short_description: Buffer class + * @section_id: buffer-classes + * @title: Buffer classes + * @include: arrow-glib/arrow-glib.h * * #GArrowBuffer is a class for keeping data. Other classes such as * #GArrowArray and #GArrowTensor can use data in buffer. + * + * #GArrowBuffer is immutable. + * + * #GArrowMutableBuffer is mutable. */ typedef struct GArrowBufferPrivate_ { @@ -182,6 +188,27 @@ garrow_buffer_get_data(GArrowBuffer *buffer) return data; } +/** + * garrow_buffer_get_mutable_data: + * @buffer: A #GArrowBuffer. + * + * Returns: (transfer full) (nullable): The data of the buffer. If the + * buffer is imutable, it returns %NULL. The data is owned by the + * buffer. You should not free the data. + * + * Since: 0.3.0 + */ +GBytes * +garrow_buffer_get_mutable_data(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + if (!arrow_buffer->is_mutable()) { + return NULL; + } + return g_bytes_new_static(arrow_buffer->mutable_data(), + arrow_buffer->size()); +} + /** * garrow_buffer_get_size: * @buffer: A #GArrowBuffer. @@ -271,6 +298,65 @@ garrow_buffer_slice(GArrowBuffer *buffer, gint64 offset, gint64 size) return garrow_buffer_new_raw(&arrow_buffer); } + +G_DEFINE_TYPE(GArrowMutableBuffer, \ + garrow_mutable_buffer, \ + GARROW_TYPE_BUFFER) + +static void +garrow_mutable_buffer_init(GArrowMutableBuffer *object) +{ +} + +static void +garrow_mutable_buffer_class_init(GArrowMutableBufferClass *klass) +{ +} + +/** + * garrow_mutable_buffer_new: + * @data: (array length=size): Data for the buffer. + * They aren't owned by the new buffer. + * You must not free the data while the new buffer is alive. + * @size: The number of bytes of the data. + * + * Returns: A newly created #GArrowMutableBuffer. + * + * Since: 0.3.0 + */ +GArrowMutableBuffer * +garrow_mutable_buffer_new(guint8 *data, gint64 size) +{ + auto arrow_buffer = std::make_shared(data, size); + return garrow_mutable_buffer_new_raw(&arrow_buffer); +} + +/** + * garrow_mutable_buffer_slice: + * @buffer: A #GArrowMutableBuffer. + * @offset: An offset in the buffer data in byte. + * @size: The number of bytes of the sliced data. + * + * Returns: (transfer full): A newly created #GArrowMutableBuffer that + * shares data of the base #GArrowMutableBuffer. The created + * #GArrowMutableBuffer has data start with offset from the base + * buffer data and are the specified bytes size. + * + * Since: 0.3.0 + */ +GArrowMutableBuffer * +garrow_mutable_buffer_slice(GArrowMutableBuffer *buffer, + gint64 offset, + gint64 size) +{ + auto arrow_parent_buffer = garrow_buffer_get_raw(GARROW_BUFFER(buffer)); + auto arrow_buffer = + std::make_shared(arrow_parent_buffer, + offset, + size); + return garrow_mutable_buffer_new_raw(&arrow_buffer); +} + G_END_DECLS GArrowBuffer * @@ -288,3 +374,12 @@ garrow_buffer_get_raw(GArrowBuffer *buffer) auto priv = GARROW_BUFFER_GET_PRIVATE(buffer); return priv->buffer; } + +GArrowMutableBuffer * +garrow_mutable_buffer_new_raw(std::shared_ptr *arrow_buffer) +{ + auto buffer = GARROW_MUTABLE_BUFFER(g_object_new(GARROW_TYPE_MUTABLE_BUFFER, + "buffer", arrow_buffer, + NULL)); + return buffer; +} diff --git a/c_glib/arrow-glib/buffer.h b/c_glib/arrow-glib/buffer.h index 83e1d0d66bf28..5334614c151c9 100644 --- a/c_glib/arrow-glib/buffer.h +++ b/c_glib/arrow-glib/buffer.h @@ -62,6 +62,7 @@ GArrowBuffer *garrow_buffer_new (const guint8 *data, gboolean garrow_buffer_is_mutable (GArrowBuffer *buffer); gint64 garrow_buffer_get_capacity (GArrowBuffer *buffer); GBytes *garrow_buffer_get_data (GArrowBuffer *buffer); +GBytes *garrow_buffer_get_mutable_data(GArrowBuffer *buffer); gint64 garrow_buffer_get_size (GArrowBuffer *buffer); GArrowBuffer *garrow_buffer_get_parent (GArrowBuffer *buffer); @@ -73,4 +74,55 @@ GArrowBuffer *garrow_buffer_slice (GArrowBuffer *buffer, gint64 offset, gint64 size); + +#define GARROW_TYPE_MUTABLE_BUFFER \ + (garrow_mutable_buffer_get_type()) +#define GARROW_MUTABLE_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_MUTABLE_BUFFER, \ + GArrowMutableBuffer)) +#define GARROW_MUTABLE_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_MUTABLE_BUFFER, \ + GArrowMutableBufferClass)) +#define GARROW_IS_MUTABLE_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_MUTABLE_BUFFER)) +#define GARROW_IS_MUTABLE_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_MUTABLE_BUFFER)) +#define GARROW_MUTABLE_BUFFER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_MUTABLE_BUFFER, \ + GArrowMutableBufferClass)) + +typedef struct _GArrowMutableBuffer GArrowMutableBuffer; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowMutableBufferClass GArrowMutableBufferClass; +#endif + +/** + * GArrowMutableBuffer: + * + * It wraps `arrow::MutableBuffer`. + */ +struct _GArrowMutableBuffer +{ + /*< private >*/ + GArrowBuffer parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowMutableBufferClass +{ + GArrowBufferClass parent_class; +}; +#endif + +GType garrow_mutable_buffer_get_type(void) G_GNUC_CONST; + +GArrowMutableBuffer *garrow_mutable_buffer_new (guint8 *data, + gint64 size); +GArrowMutableBuffer *garrow_mutable_buffer_slice(GArrowMutableBuffer *buffer, + gint64 offset, + gint64 size); + G_END_DECLS diff --git a/c_glib/arrow-glib/buffer.hpp b/c_glib/arrow-glib/buffer.hpp index 00dd3de3bfd50..1337d9ed596f9 100644 --- a/c_glib/arrow-glib/buffer.hpp +++ b/c_glib/arrow-glib/buffer.hpp @@ -25,3 +25,5 @@ GArrowBuffer *garrow_buffer_new_raw(std::shared_ptr *arrow_buffer); std::shared_ptr garrow_buffer_get_raw(GArrowBuffer *buffer); + +GArrowMutableBuffer *garrow_mutable_buffer_new_raw(std::shared_ptr *arrow_buffer); diff --git a/c_glib/test/test-buffer.rb b/c_glib/test/test-buffer.rb index 6bb96714c8283..9f76a805f7577 100644 --- a/c_glib/test/test-buffer.rb +++ b/c_glib/test/test-buffer.rb @@ -16,6 +16,8 @@ # under the License. class TestBuffer < Test::Unit::TestCase + include Helper::Omittable + def setup @data = "Hello" @buffer = Arrow::Buffer.new(@data) @@ -35,6 +37,11 @@ def test_data assert_equal(@data, @buffer.data.to_s) end + def test_mutable_data + require_gi(3, 1, 2) + assert_nil(@buffer.mutable_data) + end + def test_size assert_equal(@data.bytesize, @buffer.size) end diff --git a/c_glib/test/test-mutable-buffer.rb b/c_glib/test/test-mutable-buffer.rb new file mode 100644 index 0000000000000..df62dcf1e8d15 --- /dev/null +++ b/c_glib/test/test-mutable-buffer.rb @@ -0,0 +1,38 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestMutableBuffer < Test::Unit::TestCase + def setup + @data = "Hello" + @buffer = Arrow::MutableBuffer.new(@data) + end + + def test_mutable? + assert do + @buffer.mutable? + end + end + + def test_mutable_data + assert_equal(@data, @buffer.mutable_data.to_s) + end + + def test_slice + sliced_buffer = @buffer.slice(1, 3) + assert_equal(@data[1, 3], sliced_buffer.data.to_s) + end +end From 1a73c352d023dfa0e8aca4c16f3e421745524ea8 Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Tue, 25 Apr 2017 11:25:03 -0700 Subject: [PATCH 0563/1644] ARROW-895: Fix lastSet in fillEmpties() and copyFrom() Author: Steven Phillips Closes #601 from StevenMPhillips/fillEmpties4 and squashes the following commits: 4707673 [Steven Phillips] ARROW-895: Fix lastSet in fillEmpties() and copyFrom() --- .../templates/NullableValueVectors.java | 5 ++- .../apache/arrow/vector/TestValueVector.java | 38 +++++++++++++++++++ 2 files changed, 42 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 178d5bd913910..31adc2bdd0789 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -393,6 +393,7 @@ public void copyFrom(int fromIndex, int thisIndex, ${className} from){ if (!fromAccessor.isNull(fromIndex)) { mutator.set(thisIndex, fromAccessor.get(fromIndex)); } + <#if type.major == "VarLen">mutator.lastSet = thisIndex; } public void copyFromSafe(int fromIndex, int thisIndex, ${valuesName} from){ @@ -401,6 +402,7 @@ public void copyFromSafe(int fromIndex, int thisIndex, ${valuesName} from){ values.copyFromSafe(fromIndex, thisIndex, from); bits.getMutator().setSafeToOne(thisIndex); + <#if type.major == "VarLen">mutator.lastSet = thisIndex; } public void copyFromSafe(int fromIndex, int thisIndex, ${className} from){ @@ -409,6 +411,7 @@ public void copyFromSafe(int fromIndex, int thisIndex, ${className} from){ bits.copyFromSafe(fromIndex, thisIndex, from.bits); values.copyFromSafe(fromIndex, thisIndex, from.values); + <#if type.major == "VarLen">mutator.lastSet = thisIndex; } public final class Accessor extends BaseDataValueVector.BaseAccessor <#if type.major = "VarLen">implements VariableWidthVector.VariableWidthAccessor { @@ -532,7 +535,7 @@ private void fillEmpties(int index){ while(index > bits.getValueCapacity()) { bits.reAlloc(); } - lastSet = index; + lastSet = index - 1; } @Override diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java index e6e49ab8d9341..63543b0932908 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestValueVector.java @@ -21,6 +21,7 @@ import static org.apache.arrow.vector.TestUtils.newVector; import static org.junit.Assert.assertArrayEquals; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNull; import static org.junit.Assert.assertTrue; import java.nio.charset.Charset; @@ -473,9 +474,46 @@ public void testFillEmptiesNotOverfill() { vector.getMutator().setSafe(4094, "hello".getBytes(), 0, 5); vector.getMutator().setValueCount(4095); + assertEquals(4096 * 4, vector.getFieldBuffers().get(1).capacity()); } } + @Test + public void testCopyFromWithNulls() { + try (final NullableVarCharVector vector = newVector(NullableVarCharVector.class, EMPTY_SCHEMA_PATH, MinorType.VARCHAR, allocator); + final NullableVarCharVector vector2 = newVector(NullableVarCharVector.class, EMPTY_SCHEMA_PATH, MinorType.VARCHAR, allocator)) { + vector.allocateNew(); + + for (int i = 0; i < 4095; i++) { + if (i % 3 == 0) { + continue; + } + byte[] b = Integer.toString(i).getBytes(); + vector.getMutator().setSafe(i, b, 0, b.length); + } + + vector.getMutator().setValueCount(4095); + + vector2.allocateNew(); + + for (int i = 0; i < 4095; i++) { + vector2.copyFromSafe(i, i, vector); + } + + vector2.getMutator().setValueCount(4095); + + for (int i = 0; i < 4095; i++) { + if (i % 3 == 0) { + assertNull(vector2.getAccessor().getObject(i)); + } else { + assertEquals(Integer.toString(i), vector2.getAccessor().getObject(i).toString()); + } + + } + + + } + } } From 0bee8040e29ebbb4542bc267804f56dcf7feaf4e Mon Sep 17 00:00:00 2001 From: Steven Phillips Date: Tue, 25 Apr 2017 11:36:32 -0700 Subject: [PATCH 0564/1644] ARROW-888: Transfer ownership of buffer in BitVector transferTo() Author: Steven Phillips Closes #594 from StevenMPhillips/bitVectorOwnership and squashes the following commits: 117f6b2 [Steven Phillips] ARROW-888: Transfer ownership of buffer in BitVector transferTo() --- .../org/apache/arrow/vector/BitVector.java | 6 +- .../vector/TestBufferOwnershipTransfer.java | 65 +++++++++++++++++++ 2 files changed, 66 insertions(+), 5 deletions(-) create mode 100644 java/vector/src/test/java/org/apache/arrow/vector/TestBufferOwnershipTransfer.java diff --git a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java index ed574333beacd..82cbd47d75816 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/BitVector.java @@ -251,11 +251,7 @@ public TransferPair makeTransferPair(ValueVector to) { public void transferTo(BitVector target) { target.clear(); - if (target.data != null) { - target.data.release(); - } - target.data = data; - target.data.retain(1); + target.data = data.transferOwnership(target.allocator).buffer; target.valueCount = valueCount; clear(); } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestBufferOwnershipTransfer.java b/java/vector/src/test/java/org/apache/arrow/vector/TestBufferOwnershipTransfer.java new file mode 100644 index 0000000000000..fa657875d6d92 --- /dev/null +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestBufferOwnershipTransfer.java @@ -0,0 +1,65 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + *

+ * http://www.apache.org/licenses/LICENSE-2.0 + *

+ * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.arrow.vector; + +import static org.junit.Assert.assertEquals; + +import org.apache.arrow.memory.BufferAllocator; +import org.apache.arrow.memory.RootAllocator; +import org.junit.Test; + +public class TestBufferOwnershipTransfer { + + @Test + public void testTransferFixedWidth() { + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + BufferAllocator childAllocator1 = allocator.newChildAllocator("child1", 100000, 100000); + BufferAllocator childAllocator2 = allocator.newChildAllocator("child2", 100000, 100000); + + NullableIntVector v1 = new NullableIntVector("v1", childAllocator1); + v1.allocateNew(); + v1.getMutator().setValueCount(4095); + + NullableIntVector v2 = new NullableIntVector("v2", childAllocator2); + + v1.makeTransferPair(v2).transfer(); + + assertEquals(0, childAllocator1.getAllocatedMemory()); + assertEquals(5 * 4096, childAllocator2.getAllocatedMemory()); + } + + @Test + public void testTransferVariableidth() { + BufferAllocator allocator = new RootAllocator(Integer.MAX_VALUE); + BufferAllocator childAllocator1 = allocator.newChildAllocator("child1", 100000, 100000); + BufferAllocator childAllocator2 = allocator.newChildAllocator("child2", 100000, 100000); + + NullableVarCharVector v1 = new NullableVarCharVector("v1", childAllocator1); + v1.allocateNew(); + v1.getMutator().setSafe(4094, "hello world".getBytes(), 0, 11); + v1.getMutator().setValueCount(4001); + + NullableVarCharVector v2 = new NullableVarCharVector("v2", childAllocator2); + + v1.makeTransferPair(v2).transfer(); + + assertEquals(0, childAllocator1.getAllocatedMemory()); + int expected = 8*4096 + 4*4096 + 4096; + assertEquals(expected, childAllocator2.getAllocatedMemory()); + } +} From 68decb6f33cb1ed262006d4b237137e36f89057c Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 25 Apr 2017 14:54:18 -0400 Subject: [PATCH 0565/1644] ARROW-865: [Python] Add unit tests validating Parquet date/time type roundtrips Requires PARQUET-915 https://github.com/apache/parquet-cpp/pull/311 Author: Wes McKinney Closes #595 from wesm/ARROW-865 and squashes the following commits: db16940 [Wes McKinney] Add tests for auto-casted types, and unsupported nanosecond time 475fa3f [Wes McKinney] Fix test case fad3934 [Wes McKinney] Update test case da96a38 [Wes McKinney] Add failing Parquet test case. Enable same-type-size cases in pandas_convert.cc --- cpp/src/arrow/python/pandas_convert.cc | 2 +- cpp/src/arrow/python/type_traits.h | 48 +++++++++++++++++++++ cpp/src/arrow/util/stl.h | 2 +- python/pyarrow/tests/test_ipc.py | 3 +- python/pyarrow/tests/test_parquet.py | 60 ++++++++++++++++++++++++++ 5 files changed, 112 insertions(+), 3 deletions(-) diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 636a3fd15c044..9f65af41bb294 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -444,7 +444,7 @@ inline Status PandasConverter::ConvertData(std::shared_ptr* data) { // Handle LONGLONG->INT64 and other fun things int type_num_compat = cast_npy_type_compat(PyArray_DESCR(arr_)->type_num); - if (traits::npy_type != type_num_compat) { + if (numpy_type_size(traits::npy_type) != numpy_type_size(type_num_compat)) { return Status::NotImplemented("NumPy type casts not yet implemented"); } diff --git a/cpp/src/arrow/python/type_traits.h b/cpp/src/arrow/python/type_traits.h index 26b15bdc9f464..b6761ae0d2611 100644 --- a/cpp/src/arrow/python/type_traits.h +++ b/cpp/src/arrow/python/type_traits.h @@ -15,6 +15,8 @@ // specific language governing permissions and limitations // under the License. +// Internal header + #include "arrow/python/platform.h" #include @@ -24,6 +26,7 @@ #include "arrow/builder.h" #include "arrow/type.h" +#include "arrow/util/logging.h" namespace arrow { namespace py { @@ -224,5 +227,50 @@ struct arrow_traits { static constexpr bool supports_nulls = true; }; +static inline int numpy_type_size(int npy_type) { + switch (npy_type) { + case NPY_BOOL: + return 1; + case NPY_INT8: + return 1; + case NPY_INT16: + return 2; + case NPY_INT32: + return 4; + case NPY_INT64: + return 8; +#if (NPY_INT64 != NPY_LONGLONG) + case NPY_LONGLONG: + return 8; +#endif + case NPY_UINT8: + return 1; + case NPY_UINT16: + return 2; + case NPY_UINT32: + return 4; + case NPY_UINT64: + return 8; +#if (NPY_UINT64 != NPY_ULONGLONG) + case NPY_ULONGLONG: + return 8; +#endif + case NPY_FLOAT16: + return 2; + case NPY_FLOAT32: + return 4; + case NPY_FLOAT64: + return 8; + case NPY_DATETIME: + return 8; + case NPY_OBJECT: + return sizeof(void*); + default: + DCHECK(false) << "unhandled numpy type"; + break; + } + return -1; +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/util/stl.h b/cpp/src/arrow/util/stl.h index bfce111ff8a22..d58689b748896 100644 --- a/cpp/src/arrow/util/stl.h +++ b/cpp/src/arrow/util/stl.h @@ -20,7 +20,7 @@ #include -#include +#include "arrow/util/logging.h" namespace arrow { diff --git a/python/pyarrow/tests/test_ipc.py b/python/pyarrow/tests/test_ipc.py index 81213ede3151e..02040678958ed 100644 --- a/python/pyarrow/tests/test_ipc.py +++ b/python/pyarrow/tests/test_ipc.py @@ -158,7 +158,8 @@ def run(self): connection.close() def get_result(self): - return(self._schema, self._table if self._do_read_all else self._batches) + return(self._schema, self._table if self._do_read_all + else self._batches) def setUp(self): # NOTE: must start and stop server in test diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 268e87af7dda4..8c446af03fc16 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -348,6 +348,66 @@ def test_column_of_lists(tmpdir): tm.assert_frame_equal(df, df_read) +@parquet +def test_date_time_types(tmpdir): + buf = io.BytesIO() + + t1 = pa.date32() + data1 = np.array([17259, 17260, 17261], dtype='int32') + a1 = pa.Array.from_pandas(data1, type=t1) + + t2 = pa.date64() + data2 = data1.astype('int64') * 86400000 + a2 = pa.Array.from_pandas(data2, type=t2) + + t3 = pa.timestamp('us') + start = pd.Timestamp('2000-01-01').value / 1000 + data3 = np.array([start, start + 1, start + 2], dtype='int64') + a3 = pa.Array.from_pandas(data3, type=t3) + + t4 = pa.time32('ms') + data4 = np.arange(3, dtype='i4') + a4 = pa.Array.from_pandas(data4, type=t4) + + t5 = pa.time64('us') + a5 = pa.Array.from_pandas(data4.astype('int64'), type=t5) + + t6 = pa.time32('s') + a6 = pa.Array.from_pandas(data4, type=t6) + + ex_t6 = pa.time32('ms') + ex_a6 = pa.Array.from_pandas(data4 * 1000, type=ex_t6) + + table = pa.Table.from_arrays([a1, a2, a3, a4, a5, a6], + ['date32', 'date64', 'timestamp[us]', + 'time32[s]', 'time64[us]', 'time32[s]']) + + # date64 as date32 + # time32[s] to time32[ms] + expected = pa.Table.from_arrays([a1, a1, a3, a4, a5, ex_a6], + ['date32', 'date64', 'timestamp[us]', + 'time32[s]', 'time64[us]', 'time32[s]']) + + pq.write_table(table, buf, version="2.0") + buf.seek(0) + + result = pq.read_table(buf) + assert result.equals(expected) + + # Unsupported stuff + def _assert_unsupported(array): + table = pa.Table.from_arrays([array], ['unsupported']) + buf = io.BytesIO() + + with pytest.raises(NotImplementedError): + pq.write_table(table, buf, version="2.0") + + t7 = pa.time64('ns') + a7 = pa.Array.from_pandas(data4.astype('int64'), type=t7) + + _assert_unsupported(a7) + + @parquet def test_multithreaded_read(): df = alltypes_sample(size=10000) From 6ae49a1dd6a3a8c4292987643cd11af4f35ab9b2 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 25 Apr 2017 15:09:25 -0400 Subject: [PATCH 0566/1644] ARROW-892: [GLib] Fix GArrowTensor document Author: Kouhei Sutou Closes #598 from kou/glib-tensor-doc and squashes the following commits: 3982db1 [Kouhei Sutou] [GLib] Fix GArrowTensor document --- c_glib/arrow-glib/tensor.cpp | 7 +++---- c_glib/doc/reference/arrow-glib-docs.sgml | 4 ++++ 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/c_glib/arrow-glib/tensor.cpp b/c_glib/arrow-glib/tensor.cpp index 468eb0729357b..82f66352d66a0 100644 --- a/c_glib/arrow-glib/tensor.cpp +++ b/c_glib/arrow-glib/tensor.cpp @@ -30,11 +30,10 @@ G_BEGIN_DECLS /** * SECTION: tensor - * @short_description: Base class for all tensor classes + * @short_description: Tensor class. + * @include: arrow-glib/arrow-glib.h * - * #GArrowTensor is a base class for all tensor classes such as - * #GArrowInt8Tensor. - * #GArrowBooleanTensorBuilder to create a new tensor. + * #GArrowTensor is a tensor class. * * Since: 0.3.0 */ diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index bfb2776f621cc..75e4a0a37286f 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -41,6 +41,10 @@ Array builder + + Tensor + + Type From 015b2849299be4bee9b470e3965465e1b0278881 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 25 Apr 2017 15:47:53 -0400 Subject: [PATCH 0567/1644] ARROW-894: [GLib] Add GArrowResizableBuffer and GArrowPoolBuffer Author: Kouhei Sutou Closes #600 from kou/glib-pool-buffer and squashes the following commits: f8845aa [Kouhei Sutou] [GLib] Add GArrowResizableBuffer and GArrowPoolBuffer --- c_glib/arrow-glib/buffer.cpp | 114 ++++++++++++++++++++++++++++++++ c_glib/arrow-glib/buffer.h | 99 +++++++++++++++++++++++++++ c_glib/arrow-glib/buffer.hpp | 1 + c_glib/test/test-pool-buffer.rb | 32 +++++++++ 4 files changed, 246 insertions(+) create mode 100644 c_glib/test/test-pool-buffer.rb diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 5fc3b077a1cdb..5c28daf674e4e 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -38,6 +38,11 @@ G_BEGIN_DECLS * #GArrowBuffer is immutable. * * #GArrowMutableBuffer is mutable. + * + * #GArrowResizableBuffer is mutable and + * resizable. #GArrowResizableBuffer isn't instantiatable. + * + * #GArrowPoolBuffer is mutable, resizable and instantiatable. */ typedef struct GArrowBufferPrivate_ { @@ -357,6 +362,106 @@ garrow_mutable_buffer_slice(GArrowMutableBuffer *buffer, return garrow_mutable_buffer_new_raw(&arrow_buffer); } + +G_DEFINE_TYPE(GArrowResizableBuffer, \ + garrow_resizable_buffer, \ + GARROW_TYPE_MUTABLE_BUFFER) + +static void +garrow_resizable_buffer_init(GArrowResizableBuffer *object) +{ +} + +static void +garrow_resizable_buffer_class_init(GArrowResizableBufferClass *klass) +{ +} + +/** + * garrow_resizable_buffer_resize: + * @buffer: A #GArrowResizableBuffer. + * @new_size: The new buffer size in bytes. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * Since: 0.3.0 + */ +gboolean +garrow_resizable_buffer_resize(GArrowResizableBuffer *buffer, + gint64 new_size, + GError **error) +{ + auto arrow_buffer = garrow_buffer_get_raw(GARROW_BUFFER(buffer)); + auto arrow_resizable_buffer = + std::static_pointer_cast(arrow_buffer); + auto status = arrow_resizable_buffer->Resize(new_size); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[resizable-buffer][resize]"); + return FALSE; + } +} + +/** + * garrow_resizable_buffer_reserve: + * @buffer: A #GArrowResizableBuffer. + * @new_capacity: The new buffer capacity in bytes. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: %TRUE on success, %FALSE if there was an error. + * + * Since: 0.3.0 + */ +gboolean +garrow_resizable_buffer_reserve(GArrowResizableBuffer *buffer, + gint64 new_capacity, + GError **error) +{ + auto arrow_buffer = garrow_buffer_get_raw(GARROW_BUFFER(buffer)); + auto arrow_resizable_buffer = + std::static_pointer_cast(arrow_buffer); + auto status = arrow_resizable_buffer->Reserve(new_capacity); + if (status.ok()) { + return TRUE; + } else { + garrow_error_set(error, status, "[resizable-buffer][capacity]"); + return FALSE; + } +} + + +G_DEFINE_TYPE(GArrowPoolBuffer, \ + garrow_pool_buffer, \ + GARROW_TYPE_RESIZABLE_BUFFER) + +static void +garrow_pool_buffer_init(GArrowPoolBuffer *object) +{ +} + +static void +garrow_pool_buffer_class_init(GArrowPoolBufferClass *klass) +{ +} + +/** + * garrow_pool_buffer_new: + * + * Returns: A newly created #GArrowPoolBuffer. + * + * Since: 0.3.0 + */ +GArrowPoolBuffer * +garrow_pool_buffer_new(void) +{ + auto arrow_memory_pool = arrow::default_memory_pool(); + auto arrow_buffer = std::make_shared(arrow_memory_pool); + return garrow_pool_buffer_new_raw(&arrow_buffer); +} + + G_END_DECLS GArrowBuffer * @@ -383,3 +488,12 @@ garrow_mutable_buffer_new_raw(std::shared_ptr *arrow_buffe NULL)); return buffer; } + +GArrowPoolBuffer * +garrow_pool_buffer_new_raw(std::shared_ptr *arrow_buffer) +{ + auto buffer = GARROW_POOL_BUFFER(g_object_new(GARROW_TYPE_POOL_BUFFER, + "buffer", arrow_buffer, + NULL)); + return buffer; +} diff --git a/c_glib/arrow-glib/buffer.h b/c_glib/arrow-glib/buffer.h index 5334614c151c9..22a5e9bb2549a 100644 --- a/c_glib/arrow-glib/buffer.h +++ b/c_glib/arrow-glib/buffer.h @@ -125,4 +125,103 @@ GArrowMutableBuffer *garrow_mutable_buffer_slice(GArrowMutableBuffer *buffer, gint64 offset, gint64 size); + +#define GARROW_TYPE_RESIZABLE_BUFFER \ + (garrow_resizable_buffer_get_type()) +#define GARROW_RESIZABLE_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_RESIZABLE_BUFFER, \ + GArrowResizableBuffer)) +#define GARROW_RESIZABLE_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_RESIZABLE_BUFFER, \ + GArrowResizableBufferClass)) +#define GARROW_IS_RESIZABLE_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_RESIZABLE_BUFFER)) +#define GARROW_IS_RESIZABLE_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_RESIZABLE_BUFFER)) +#define GARROW_RESIZABLE_BUFFER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_RESIZABLE_BUFFER, \ + GArrowResizableBufferClass)) + +typedef struct _GArrowResizableBuffer GArrowResizableBuffer; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowResizableBufferClass GArrowResizableBufferClass; +#endif + +/** + * GArrowResizableBuffer: + * + * It wraps `arrow::ResizableBuffer`. + */ +struct _GArrowResizableBuffer +{ + /*< private >*/ + GArrowMutableBuffer parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowResizableBufferClass +{ + GArrowMutableBufferClass parent_class; +}; +#endif + +GType garrow_resizable_buffer_get_type(void) G_GNUC_CONST; + +gboolean garrow_resizable_buffer_resize(GArrowResizableBuffer *buffer, + gint64 new_size, + GError **error); +gboolean garrow_resizable_buffer_reserve(GArrowResizableBuffer *buffer, + gint64 new_capacity, + GError **error); + + +#define GARROW_TYPE_POOL_BUFFER \ + (garrow_pool_buffer_get_type()) +#define GARROW_POOL_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_POOL_BUFFER, \ + GArrowPoolBuffer)) +#define GARROW_POOL_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_POOL_BUFFER, \ + GArrowPoolBufferClass)) +#define GARROW_IS_POOL_BUFFER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), GARROW_TYPE_POOL_BUFFER)) +#define GARROW_IS_POOL_BUFFER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), GARROW_TYPE_POOL_BUFFER)) +#define GARROW_POOL_BUFFER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_POOL_BUFFER, \ + GArrowPoolBufferClass)) + +typedef struct _GArrowPoolBuffer GArrowPoolBuffer; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowPoolBufferClass GArrowPoolBufferClass; +#endif + +/** + * GArrowPoolBuffer: + * + * It wraps `arrow::PoolBuffer`. + */ +struct _GArrowPoolBuffer +{ + /*< private >*/ + GArrowResizableBuffer parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowPoolBufferClass +{ + GArrowResizableBufferClass parent_class; +}; +#endif + +GType garrow_pool_buffer_get_type(void) G_GNUC_CONST; + +GArrowPoolBuffer *garrow_pool_buffer_new(void); + G_END_DECLS diff --git a/c_glib/arrow-glib/buffer.hpp b/c_glib/arrow-glib/buffer.hpp index 1337d9ed596f9..d1664b11b17c9 100644 --- a/c_glib/arrow-glib/buffer.hpp +++ b/c_glib/arrow-glib/buffer.hpp @@ -27,3 +27,4 @@ GArrowBuffer *garrow_buffer_new_raw(std::shared_ptr *arrow_buffer std::shared_ptr garrow_buffer_get_raw(GArrowBuffer *buffer); GArrowMutableBuffer *garrow_mutable_buffer_new_raw(std::shared_ptr *arrow_buffer); +GArrowPoolBuffer *garrow_pool_buffer_new_raw(std::shared_ptr *arrow_buffer); diff --git a/c_glib/test/test-pool-buffer.rb b/c_glib/test/test-pool-buffer.rb new file mode 100644 index 0000000000000..57f3458ef1efb --- /dev/null +++ b/c_glib/test/test-pool-buffer.rb @@ -0,0 +1,32 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestPoolBuffer < Test::Unit::TestCase + def setup + @buffer = Arrow::PoolBuffer.new + end + + def test_resize + @buffer.resize(1) + assert_equal(1, @buffer.size) + end + + def test_reserve + @buffer.reserve(1) + assert_equal(64, @buffer.capacity) + end +end From 949249d9e85d2464a3f1c65b176b636d1cfbaf1a Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 25 Apr 2017 17:29:41 -0400 Subject: [PATCH 0568/1644] ARROW-893: Add GLib document to Web site Author: Kouhei Sutou Closes #599 from kou/site-glib-doc and squashes the following commits: f85ad44 [Kouhei Sutou] Add GLib document to Web site --- site/README.md | 21 ++++++++++++++++++++- site/_includes/header.html | 1 + 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/site/README.md b/site/README.md index 3f8da2252f965..aeebaa1790ef9 100644 --- a/site/README.md +++ b/site/README.md @@ -82,4 +82,23 @@ python setup.py build_sphinx -s doc/source rsync -r doc/_build/html/ ../site/asf-site/docs/python/ ``` -Then add/commit/push from the site/asf-site git checkout. \ No newline at end of file +#### C (GLib) + +First, build Apache Arrow C++ and Apache Arrow GLib. + +``` +mkdir -p ../cpp/build +cd ../cpp/build +cmake .. -DCMAKE_BUILD_TYPE=debug +make +cd ../../c_glib +./autogen.sh +./configure \ + --with-arrow-cpp-build-dir=$PWD/../cpp/build \ + --with-arrow-cpp-build-type=debug \ + --enable-gtk-doc +LD_LIBRARY_PATH=$PWD/../cpp/build/debug make GTK_DOC_V_XREF=": " +rsync -r doc/reference/html/ ../site/asf-site/docs/c_glib/ +``` + +Then add/commit/push from the site/asf-site git checkout. diff --git a/site/_includes/header.html b/site/_includes/header.html index 5963c22abea0d..3d61494f2f109 100644 --- a/site/_includes/header.html +++ b/site/_includes/header.html @@ -28,6 +28,7 @@

  • C++
  • Java
  • Python
  • +
  • C (GLib)
  • From 7d433dc27bf70b5d80b8c88261a19cdc615defdb Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Tue, 25 Apr 2017 17:36:31 -0400 Subject: [PATCH 0569/1644] ARROW-483: [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting Author: Phillip Cloud Closes #588 from cpcloud/ARROW-483 and squashes the following commits: f671ba4 [Phillip Cloud] ARROW-483: [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting --- cpp/CMakeLists.txt | 1 + cpp/src/arrow/array.cc | 2 +- cpp/src/arrow/builder.cc | 13 ++- cpp/src/arrow/ipc/metadata.cc | 30 +++++- cpp/src/arrow/type-test.cc | 34 +++++++ cpp/src/arrow/type.cc | 20 +++- cpp/src/arrow/type.h | 10 +- cpp/src/arrow/util/CMakeLists.txt | 2 + cpp/src/arrow/util/key-value-metadata-test.cc | 87 ++++++++++++++++ cpp/src/arrow/util/key_value_metadata.cc | 99 +++++++++++++++++++ cpp/src/arrow/util/key_value_metadata.h | 56 +++++++++++ format/Schema.fbs | 2 +- python/.gitignore | 1 + python/pyarrow/_array.pxd | 2 + python/pyarrow/_array.pyx | 7 ++ python/pyarrow/_table.pyx | 64 +++++++----- python/pyarrow/includes/common.pxd | 3 +- python/pyarrow/includes/libarrow.pxd | 11 ++- 18 files changed, 401 insertions(+), 43 deletions(-) create mode 100644 cpp/src/arrow/util/key-value-metadata-test.cc create mode 100644 cpp/src/arrow/util/key_value_metadata.cc create mode 100644 cpp/src/arrow/util/key_value_metadata.h diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 2d8c00fd80803..5abe5f1436ea7 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -944,6 +944,7 @@ set(ARROW_SRCS src/arrow/util/bit-util.cc src/arrow/util/decimal.cc + src/arrow/util/key_value_metadata.cc ) if (ARROW_IPC) diff --git a/cpp/src/arrow/array.cc b/cpp/src/arrow/array.cc index e640bbd4a206e..76dda2ca7b94f 100644 --- a/cpp/src/arrow/array.cc +++ b/cpp/src/arrow/array.cc @@ -113,7 +113,7 @@ Status Array::Validate() const { static inline void ConformSliceParams( int64_t array_offset, int64_t array_length, int64_t* offset, int64_t* length) { DCHECK_LE(*offset, array_length); - DCHECK_GE(offset, 0); + DCHECK_NE(offset, nullptr); *length = std::min(array_length - *offset, *length); *offset = array_offset + *offset; } diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index d85eb32652c47..4ecb8d3500981 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -363,8 +363,6 @@ ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal128& value) { return Status::OK(); } -template ARROW_EXPORT Status DecimalBuilder::Append(const decimal::Decimal128& val); - Status DecimalBuilder::Init(int64_t capacity) { RETURN_NOT_OK(FixedSizeBinaryBuilder::Init(capacity)); if (byte_width_ == 16) { @@ -408,16 +406,17 @@ Status DecimalBuilder::Finish(std::shared_ptr* out) { ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, const std::shared_ptr& type) - : ArrayBuilder( - pool, type ? type : std::static_pointer_cast( - std::make_shared(value_builder->type()))), + : ArrayBuilder(pool, + type ? type : std::static_pointer_cast( + std::make_shared(value_builder->type()))), offset_builder_(pool), value_builder_(value_builder) {} ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr values, const std::shared_ptr& type) - : ArrayBuilder(pool, type ? type : std::static_pointer_cast( - std::make_shared(values->type()))), + : ArrayBuilder(pool, + type ? type : std::static_pointer_cast( + std::make_shared(values->type()))), offset_builder_(pool), values_(values) {} diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index 791948b50b0ac..c0b518a0d8e50 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -45,6 +45,7 @@ namespace ipc { using FBB = flatbuffers::FlatBufferBuilder; using DictionaryOffset = flatbuffers::Offset; using FieldOffset = flatbuffers::Offset; +using KeyValueOffset = flatbuffers::Offset; using RecordBatchOffset = flatbuffers::Offset; using VectorLayoutOffset = flatbuffers::Offset; using Offset = flatbuffers::Offset; @@ -583,6 +584,7 @@ flatbuf::Endianness endianness() { static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, DictionaryMemo* dictionary_memo, flatbuffers::Offset* out) { + /// Fields std::vector field_offsets; for (int i = 0; i < schema.num_fields(); ++i) { std::shared_ptr field = schema.field(i); @@ -591,7 +593,20 @@ static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, field_offsets.push_back(offset); } - *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets)); + /// Custom metadata + const auto& custom_metadata_ = schema.custom_metadata(); + std::vector key_value_offsets; + size_t metadata_size = custom_metadata_.size(); + key_value_offsets.reserve(metadata_size); + for (size_t i = 0; i < metadata_size; ++i) { + const auto& key = custom_metadata_.key(i); + const auto& value = custom_metadata_.value(i); + key_value_offsets.push_back( + flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value))); + } + + *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets), + fbb.CreateVector(key_value_offsets)); return Status::OK(); } @@ -939,7 +954,18 @@ Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_mem const flatbuf::Field* field = schema->fields()->Get(i); RETURN_NOT_OK(FieldFromFlatbuffer(field, dictionary_memo, &fields[i])); } - *out = std::make_shared(fields); + + KeyValueMetadata custom_metadata; + auto fb_metadata = schema->custom_metadata(); + if (fb_metadata != nullptr) { + custom_metadata.reserve(fb_metadata->size()); + + for (const auto& pair : *fb_metadata) { + custom_metadata.Append(pair->key()->str(), pair->value()->str()); + } + } + + *out = std::make_shared(fields, custom_metadata); return Status::OK(); } diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index dec7268a5a8b5..8e2dfd50e431d 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -117,6 +117,40 @@ TEST_F(TestSchema, GetFieldByName) { ASSERT_TRUE(result == nullptr); } +TEST_F(TestSchema, TestCustomMetadataConstruction) { + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + vector> fields = {f0, f1, f2}; + KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); + auto schema = std::make_shared(fields, metadata); + ASSERT_TRUE(metadata.Equals(schema->custom_metadata())); +} + +TEST_F(TestSchema, TestAddCustomMetadata) { + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + vector> fields = {f0, f1, f2}; + KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); + auto schema = std::make_shared(fields); + std::shared_ptr new_schema; + schema->AddCustomMetadata(metadata, &new_schema); + ASSERT_TRUE(metadata.Equals(new_schema->custom_metadata())); +} + +TEST_F(TestSchema, TestRemoveCustomMetadata) { + auto f0 = field("f0", int32()); + auto f1 = field("f1", uint8(), false); + auto f2 = field("f2", utf8()); + vector> fields = {f0, f1, f2}; + KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); + auto schema = std::make_shared(fields); + std::shared_ptr new_schema; + schema->RemoveCustomMetadata(&new_schema); + ASSERT_EQ(0, new_schema->custom_metadata().size()); +} + #define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ TEST(TypesTest, TestPrimitive_##ENUM) { \ KLASS tp; \ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index 2e454ae81886f..f59f8fb26c9ba 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -24,6 +24,7 @@ #include "arrow/array.h" #include "arrow/compare.h" #include "arrow/status.h" +#include "arrow/util/key_value_metadata.h" #include "arrow/util/logging.h" #include "arrow/util/stl.h" #include "arrow/visitor.h" @@ -231,7 +232,9 @@ std::string NullType::ToString() const { // ---------------------------------------------------------------------- // Schema implementation -Schema::Schema(const std::vector>& fields) : fields_(fields) {} +Schema::Schema(const std::vector>& fields, + const KeyValueMetadata& custom_metadata) + : fields_(fields), custom_metadata_(custom_metadata) {} bool Schema::Equals(const Schema& other) const { if (this == &other) { return true; } @@ -263,7 +266,18 @@ Status Schema::AddField( DCHECK_GE(i, 0); DCHECK_LE(i, this->num_fields()); - *out = std::make_shared(AddVectorElement(fields_, i, field)); + *out = std::make_shared(AddVectorElement(fields_, i, field), custom_metadata_); + return Status::OK(); +} + +Status Schema::AddCustomMetadata( + const KeyValueMetadata& custom_metadata, std::shared_ptr* out) const { + *out = std::make_shared(fields_, custom_metadata); + return Status::OK(); +} + +Status Schema::RemoveCustomMetadata(std::shared_ptr* out) { + *out = std::make_shared(fields_, KeyValueMetadata()); return Status::OK(); } @@ -271,7 +285,7 @@ Status Schema::RemoveField(int i, std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LT(i, this->num_fields()); - *out = std::make_shared(DeleteVectorElement(fields_, i)); + *out = std::make_shared(DeleteVectorElement(fields_, i), custom_metadata_); return Status::OK(); } diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index ea4ea03ff569a..dc9456137030f 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -28,6 +28,7 @@ #include "arrow/status.h" #include "arrow/type_fwd.h" +#include "arrow/util/key_value_metadata.h" #include "arrow/util/macros.h" #include "arrow/util/visibility.h" #include "arrow/visitor.h" @@ -677,7 +678,8 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { class ARROW_EXPORT Schema { public: - explicit Schema(const std::vector>& fields); + explicit Schema(const std::vector>& fields, + const KeyValueMetadata& custom_metadata = KeyValueMetadata()); // Returns true if all of the schema fields are equal bool Equals(const Schema& other) const; @@ -689,6 +691,7 @@ class ARROW_EXPORT Schema { std::shared_ptr GetFieldByName(const std::string& name); const std::vector>& fields() const { return fields_; } + const KeyValueMetadata& custom_metadata() const { return custom_metadata_; } // Render a string representation of the schema suitable for debugging std::string ToString() const; @@ -697,11 +700,16 @@ class ARROW_EXPORT Schema { int i, const std::shared_ptr& field, std::shared_ptr* out) const; Status RemoveField(int i, std::shared_ptr* out) const; + Status AddCustomMetadata( + const KeyValueMetadata& metadata, std::shared_ptr* out) const; + Status RemoveCustomMetadata(std::shared_ptr* out); + int num_fields() const { return static_cast(fields_.size()); } private: std::vector> fields_; std::unordered_map name_to_index_; + KeyValueMetadata custom_metadata_; }; // ---------------------------------------------------------------------- diff --git a/cpp/src/arrow/util/CMakeLists.txt b/cpp/src/arrow/util/CMakeLists.txt index b22c8aca11c5d..ac7e86615eb40 100644 --- a/cpp/src/arrow/util/CMakeLists.txt +++ b/cpp/src/arrow/util/CMakeLists.txt @@ -26,6 +26,7 @@ install(FILES macros.h random.h visibility.h + key_value_metadata.h DESTINATION include/arrow/util) ####################################### @@ -52,3 +53,4 @@ endif() ADD_ARROW_TEST(bit-util-test) ADD_ARROW_TEST(stl-util-test) ADD_ARROW_TEST(decimal-test) +ADD_ARROW_TEST(key-value-metadata-test) diff --git a/cpp/src/arrow/util/key-value-metadata-test.cc b/cpp/src/arrow/util/key-value-metadata-test.cc new file mode 100644 index 0000000000000..aadc989cb403f --- /dev/null +++ b/cpp/src/arrow/util/key-value-metadata-test.cc @@ -0,0 +1,87 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include "gtest/gtest.h" + +#include "arrow/util/key_value_metadata.h" + +#include "arrow/test-util.h" + +namespace arrow { + +TEST(KeyValueMetadataTest, SimpleConstruction) { + KeyValueMetadata metadata; + ASSERT_EQ(0, metadata.size()); +} + +TEST(KeyValueMetadataTest, StringVectorConstruction) { + std::vector keys = {"foo", "bar"}; + std::vector values = {"bizz", "buzz"}; + + KeyValueMetadata metadata(keys, values); + ASSERT_EQ("foo", metadata.key(0)); + ASSERT_EQ("bar", metadata.key(1)); + ASSERT_EQ("bizz", metadata.value(0)); + ASSERT_EQ("buzz", metadata.value(1)); + ASSERT_EQ(2, metadata.size()); +} + +TEST(KeyValueMetadataTest, StringMapConstruction) { + std::unordered_map pairs = {{"foo", "bizz"}, {"bar", "buzz"}}; + std::unordered_map result_map; + result_map.reserve(pairs.size()); + + KeyValueMetadata metadata(pairs); + metadata.ToUnorderedMap(&result_map); + ASSERT_EQ(pairs, result_map); + ASSERT_EQ(2, metadata.size()); +} + +TEST(KeyValueMetadataTest, StringAppend) { + std::vector keys = {"foo", "bar"}; + std::vector values = {"bizz", "buzz"}; + + KeyValueMetadata metadata(keys, values); + ASSERT_EQ("foo", metadata.key(0)); + ASSERT_EQ("bar", metadata.key(1)); + ASSERT_EQ("bizz", metadata.value(0)); + ASSERT_EQ("buzz", metadata.value(1)); + ASSERT_EQ(2, metadata.size()); + + metadata.Append("purple", "orange"); + metadata.Append("blue", "red"); + + ASSERT_EQ("purple", metadata.key(2)); + ASSERT_EQ("blue", metadata.key(3)); + + ASSERT_EQ("orange", metadata.value(2)); + ASSERT_EQ("red", metadata.value(3)); +} + +TEST(KeyValueMetadataTest, Equals) { + std::vector keys = {"foo", "bar"}; + std::vector values = {"bizz", "buzz"}; + + KeyValueMetadata metadata(keys, values); + KeyValueMetadata metadata2(keys, values); + KeyValueMetadata metadata3(keys, {"buzz", "bizz"}); + + ASSERT_TRUE(metadata.Equals(metadata2)); + ASSERT_FALSE(metadata.Equals(metadata3)); +} + +} // namespace arrow diff --git a/cpp/src/arrow/util/key_value_metadata.cc b/cpp/src/arrow/util/key_value_metadata.cc new file mode 100644 index 0000000000000..c91478bd1acd6 --- /dev/null +++ b/cpp/src/arrow/util/key_value_metadata.cc @@ -0,0 +1,99 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#include + +#include "arrow/util/key_value_metadata.h" +#include "arrow/util/logging.h" + +namespace arrow { + +static std::vector UnorderedMapKeys( + const std::unordered_map& map) { + std::vector keys; + keys.reserve(map.size()); + for (const auto& pair : map) { + keys.push_back(pair.first); + } + return keys; +} + +static std::vector UnorderedMapValues( + const std::unordered_map& map) { + std::vector values; + values.reserve(map.size()); + for (const auto& pair : map) { + values.push_back(pair.second); + } + return values; +} + +KeyValueMetadata::KeyValueMetadata() : keys_(), values_() {} + +KeyValueMetadata::KeyValueMetadata( + const std::unordered_map& map) + : keys_(UnorderedMapKeys(map)), values_(UnorderedMapValues(map)) {} + +KeyValueMetadata::KeyValueMetadata( + const std::vector& keys, const std::vector& values) + : keys_(keys), values_(values) { + DCHECK_EQ(keys.size(), values.size()); +} + +void KeyValueMetadata::ToUnorderedMap( + std::unordered_map* out) const { + DCHECK_NE(out, nullptr); + const int64_t n = size(); + out->reserve(n); + for (int64_t i = 0; i < n; ++i) { + out->insert(std::make_pair(key(i), value(i))); + } +} + +void KeyValueMetadata::Append(const std::string& key, const std::string& value) { + keys_.push_back(key); + values_.push_back(value); +} + +void KeyValueMetadata::reserve(int64_t n) { + DCHECK_GE(n, 0); + const auto m = static_cast(n); + keys_.reserve(m); + values_.reserve(m); +} + +int64_t KeyValueMetadata::size() const { + DCHECK_EQ(keys_.size(), values_.size()); + return static_cast(keys_.size()); +} + +std::string KeyValueMetadata::key(int64_t i) const { + DCHECK_GE(i, 0); + return keys_[static_cast(i)]; +} + +std::string KeyValueMetadata::value(int64_t i) const { + DCHECK_GE(i, 0); + return values_[static_cast(i)]; +} + +bool KeyValueMetadata::Equals(const KeyValueMetadata& other) const { + return size() == other.size() && + std::equal(keys_.cbegin(), keys_.cend(), other.keys_.cbegin()) && + std::equal(values_.cbegin(), values_.cend(), other.values_.cbegin()); +} +} // namespace arrow diff --git a/cpp/src/arrow/util/key_value_metadata.h b/cpp/src/arrow/util/key_value_metadata.h new file mode 100644 index 0000000000000..713b2c0b0bcfb --- /dev/null +++ b/cpp/src/arrow/util/key_value_metadata.h @@ -0,0 +1,56 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +#ifndef ARROW_UTIL_KEY_VALUE_METADATA_H +#define ARROW_UTIL_KEY_VALUE_METADATA_H + +#include +#include +#include +#include + +#include "arrow/util/visibility.h" + +namespace arrow { + +class ARROW_EXPORT KeyValueMetadata { + public: + KeyValueMetadata(); + KeyValueMetadata( + const std::vector& keys, const std::vector& values); + explicit KeyValueMetadata(const std::unordered_map& map); + + void ToUnorderedMap(std::unordered_map* out) const; + + void Append(const std::string& key, const std::string& value); + + void reserve(int64_t n); + int64_t size() const; + + std::string key(int64_t i) const; + std::string value(int64_t i) const; + + bool Equals(const KeyValueMetadata& other) const; + + private: + std::vector keys_; + std::vector values_; +}; + +} // namespace arrow + +#endif // ARROW_UTIL_KEY_VALUE_METADATA_H diff --git a/format/Schema.fbs b/format/Schema.fbs index b48859f50eea2..8de5c6d466c36 100644 --- a/format/Schema.fbs +++ b/format/Schema.fbs @@ -200,7 +200,7 @@ table VectorLayout { table KeyValue { key: string; - value: [ubyte]; + value: string; } /// ---------------------------------------------------------------------- diff --git a/python/.gitignore b/python/.gitignore index ba40c3ea88882..6c0d5a93cd35c 100644 --- a/python/.gitignore +++ b/python/.gitignore @@ -33,3 +33,4 @@ coverage.xml # benchmark working dir .asv +pyarrow/_table_api.h diff --git a/python/pyarrow/_array.pxd b/python/pyarrow/_array.pxd index 464de316f0437..4d5db8618a377 100644 --- a/python/pyarrow/_array.pxd +++ b/python/pyarrow/_array.pxd @@ -81,6 +81,8 @@ cdef class Schema: cdef init(self, const vector[shared_ptr[CField]]& fields) cdef init_schema(self, const shared_ptr[CSchema]& schema) + cpdef dict custom_metadata(self) + cdef class Scalar: cdef readonly: diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index 1c571ba153dfa..2fb20b7553e93 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -244,6 +244,13 @@ cdef class Schema: self.schema = schema.get() self.sp_schema = schema + cpdef dict custom_metadata(self): + cdef: + CKeyValueMetadata metadata = self.schema.custom_metadata() + unordered_map[c_string, c_string] result + metadata.ToUnorderedMap(&result) + return result + def equals(self, other): """ Test if this schema is equal to the other diff --git a/python/pyarrow/_table.pyx b/python/pyarrow/_table.pyx index 78fec75cf3e7d..ed0782bbba0a3 100644 --- a/python/pyarrow/_table.pyx +++ b/python/pyarrow/_table.pyx @@ -34,7 +34,6 @@ from pyarrow._error import ArrowException from pyarrow._array import field from pyarrow.compat import frombytes, tobytes - from collections import OrderedDict @@ -273,15 +272,22 @@ cdef class Column: return chunked_array -cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): +cdef CKeyValueMetadata key_value_metadata_from_dict(dict metadata): + cdef: + unordered_map[c_string, c_string] unordered_metadata = metadata + CKeyValueMetadata c_metadata = CKeyValueMetadata(unordered_metadata) + return c_metadata + + +cdef int _schema_from_arrays( + arrays, names, dict metadata, shared_ptr[CSchema]* schema) except -1: cdef: Array arr Column col c_string c_name vector[shared_ptr[CField]] fields - cdef shared_ptr[CDataType] type_ - - cdef int K = len(arrays) + shared_ptr[CDataType] type_ + int K = len(arrays) fields.resize(K) @@ -306,15 +312,16 @@ cdef _schema_from_arrays(arrays, names, shared_ptr[CSchema]* schema): else: raise TypeError(type(arrays[0])) - schema.reset(new CSchema(fields)) - + schema.reset(new CSchema(fields, key_value_metadata_from_dict(metadata))) + return 0 -cdef _dataframe_to_arrays(df, timestamps_to_ms, Schema schema): +cdef tuple _dataframe_to_arrays(df, bint timestamps_to_ms, Schema schema): cdef: list names = [] list arrays = [] DataType type = None + dict metadata = {} for name in df.columns: col = df[name] @@ -326,7 +333,7 @@ cdef _dataframe_to_arrays(df, timestamps_to_ms, Schema schema): names.append(name) arrays.append(arr) - return names, arrays + return names, arrays, metadata cdef class RecordBatch: @@ -486,11 +493,11 @@ cdef class RecordBatch: ------- pyarrow.table.RecordBatch """ - names, arrays = _dataframe_to_arrays(df, False, schema) - return cls.from_arrays(arrays, names) + names, arrays, metadata = _dataframe_to_arrays(df, False, schema) + return cls.from_arrays(arrays, names, metadata) @staticmethod - def from_arrays(arrays, names): + def from_arrays(list arrays, list names, dict metadata=None): """ Construct a RecordBatch from multiple pyarrow.Arrays @@ -512,15 +519,17 @@ cdef class RecordBatch: shared_ptr[CRecordBatch] batch vector[shared_ptr[CArray]] c_arrays int64_t num_rows + int64_t i + int64_t number_of_arrays = len(arrays) - if len(arrays) == 0: + if not number_of_arrays: raise ValueError('Record batch cannot contain no arrays (for now)') num_rows = len(arrays[0]) - _schema_from_arrays(arrays, names, &schema) + _schema_from_arrays(arrays, names, metadata or {}, &schema) - for i in range(len(arrays)): - arr = arrays[i] + c_arrays.reserve(len(arrays)) + for arr in arrays: c_arrays.push_back(arr.sp_array) batch.reset(new CRecordBatch(schema, num_rows, c_arrays)) @@ -656,13 +665,13 @@ cdef class Table: >>> pa.Table.from_pandas(df) """ - names, arrays = _dataframe_to_arrays(df, + names, arrays, metadata = _dataframe_to_arrays(df, timestamps_to_ms=timestamps_to_ms, schema=schema) - return cls.from_arrays(arrays, names=names) + return cls.from_arrays(arrays, names=names, metadata=metadata) @staticmethod - def from_arrays(arrays, names=None): + def from_arrays(arrays, names=None, dict metadata=None): """ Construct a Table from Arrow arrays or columns @@ -680,22 +689,25 @@ cdef class Table: """ cdef: - vector[shared_ptr[CField]] fields vector[shared_ptr[CColumn]] columns shared_ptr[CSchema] schema shared_ptr[CTable] table + size_t K = len(arrays) - _schema_from_arrays(arrays, names, &schema) + _schema_from_arrays(arrays, names, metadata or {}, &schema) - cdef int K = len(arrays) - columns.resize(K) + columns.reserve(K) for i in range(K): if isinstance(arrays[i], Array): - columns[i].reset(new CColumn(schema.get().field(i), - ( arrays[i]).sp_array)) + columns.push_back( + make_shared[CColumn]( + schema.get().field(i), + ( arrays[i]).sp_array + ) + ) elif isinstance(arrays[i], Column): - columns[i] = ( arrays[i]).sp_column + columns.push_back(( arrays[i]).sp_column) else: raise ValueError(type(arrays[i])) diff --git a/python/pyarrow/includes/common.pxd b/python/pyarrow/includes/common.pxd index 44723faa7400e..cc3b4b6fdaf92 100644 --- a/python/pyarrow/includes/common.pxd +++ b/python/pyarrow/includes/common.pxd @@ -19,9 +19,10 @@ from libc.stdint cimport * from libcpp cimport bool as c_bool -from libcpp.memory cimport shared_ptr, unique_ptr +from libcpp.memory cimport shared_ptr, unique_ptr, make_shared from libcpp.string cimport string as c_string from libcpp.vector cimport vector +from libcpp.unordered_map cimport unordered_map from cpython cimport PyObject cimport cpython diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index 473a0b9cd9b6d..ef1a332bed52f 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -1,4 +1,4 @@ -#t Licensed to the Apache Software Foundation (ASF) under one +# Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file @@ -19,6 +19,12 @@ from pyarrow.includes.common cimport * +cdef extern from "arrow/util/key_value_metadata.h" namespace "arrow" nogil: + cdef cppclass CKeyValueMetadata" arrow::KeyValueMetadata": + CKeyValueMetadata() + CKeyValueMetadata(const unordered_map[c_string, c_string]&) + void ToUnorderedMap(unordered_map[c_string, c_string]*) const + cdef extern from "arrow/api.h" namespace "arrow" nogil: enum Type" arrow::Type::type": @@ -170,10 +176,13 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: cdef cppclass CSchema" arrow::Schema": CSchema(const vector[shared_ptr[CField]]& fields) + CSchema(const vector[shared_ptr[CField]]& fields, + const CKeyValueMetadata& custom_metadata) c_bool Equals(const CSchema& other) shared_ptr[CField] field(int i) + const CKeyValueMetadata& custom_metadata() const shared_ptr[CField] GetFieldByName(c_string& name) int num_fields() c_string ToString() From 3ad9d09f39ead51266299ec4bbb703724b8ac69d Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 26 Apr 2017 10:30:26 -0400 Subject: [PATCH 0570/1644] ARROW-904: [GLib] Simplify error check codes Author: Kouhei Sutou Closes #604 from kou/glib-simplify-error-check and squashes the following commits: 1cf6b76 [Kouhei Sutou] [GLib] Simplify error check codes --- c_glib/arrow-glib/array-builder.cpp | 210 +++++------------------ c_glib/arrow-glib/buffer.cpp | 17 +- c_glib/arrow-glib/error.cpp | 25 +-- c_glib/arrow-glib/error.hpp | 6 +- c_glib/arrow-glib/file-output-stream.cpp | 2 +- c_glib/arrow-glib/file-reader.cpp | 8 +- c_glib/arrow-glib/file-writer.cpp | 19 +- c_glib/arrow-glib/file.cpp | 10 +- c_glib/arrow-glib/memory-mapped-file.cpp | 2 +- c_glib/arrow-glib/random-access-file.cpp | 10 +- c_glib/arrow-glib/readable.cpp | 7 +- c_glib/arrow-glib/stream-reader.cpp | 8 +- c_glib/arrow-glib/stream-writer.cpp | 19 +- c_glib/arrow-glib/table.cpp | 6 +- c_glib/arrow-glib/writeable-file.cpp | 7 +- c_glib/arrow-glib/writeable.cpp | 14 +- 16 files changed, 90 insertions(+), 280 deletions(-) diff --git a/c_glib/arrow-glib/array-builder.cpp b/c_glib/arrow-glib/array-builder.cpp index 97d43e1f0c022..30158b07b11d3 100644 --- a/c_glib/arrow-glib/array-builder.cpp +++ b/c_glib/arrow-glib/array-builder.cpp @@ -237,12 +237,7 @@ garrow_boolean_array_builder_append(GArrowBooleanArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(static_cast(value)); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[boolean-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[boolean-array-builder][append]"); } /** @@ -261,12 +256,9 @@ garrow_boolean_array_builder_append_null(GArrowBooleanArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[boolean-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[boolean-array-builder][append-null]"); } @@ -318,12 +310,7 @@ garrow_int8_array_builder_append(GArrowInt8ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int8-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[int8-array-builder][append]"); } /** @@ -342,12 +329,7 @@ garrow_int8_array_builder_append_null(GArrowInt8ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int8-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[int8-array-builder][append-null]"); } @@ -399,12 +381,7 @@ garrow_uint8_array_builder_append(GArrowUInt8ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint8-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[uint8-array-builder][append]"); } /** @@ -423,12 +400,7 @@ garrow_uint8_array_builder_append_null(GArrowUInt8ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint8-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[uint8-array-builder][append-null]"); } @@ -480,12 +452,7 @@ garrow_int16_array_builder_append(GArrowInt16ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int16-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[int16-array-builder][append]"); } /** @@ -504,12 +471,7 @@ garrow_int16_array_builder_append_null(GArrowInt16ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int16-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[int16-array-builder][append-null]"); } @@ -561,12 +523,7 @@ garrow_uint16_array_builder_append(GArrowUInt16ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint16-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[uint16-array-builder][append]"); } /** @@ -585,12 +542,9 @@ garrow_uint16_array_builder_append_null(GArrowUInt16ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint16-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[uint16-array-builder][append-null]"); } @@ -642,12 +596,7 @@ garrow_int32_array_builder_append(GArrowInt32ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int32-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[int32-array-builder][append]"); } /** @@ -666,12 +615,7 @@ garrow_int32_array_builder_append_null(GArrowInt32ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int32-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[int32-array-builder][append-null]"); } @@ -723,12 +667,7 @@ garrow_uint32_array_builder_append(GArrowUInt32ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint32-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[uint32-array-builder][append]"); } /** @@ -747,12 +686,9 @@ garrow_uint32_array_builder_append_null(GArrowUInt32ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint32-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[uint32-array-builder][append-null]"); } @@ -804,12 +740,7 @@ garrow_int64_array_builder_append(GArrowInt64ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int64-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[int64-array-builder][append]"); } /** @@ -828,12 +759,7 @@ garrow_int64_array_builder_append_null(GArrowInt64ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[int64-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[int64-array-builder][append-null]"); } @@ -885,12 +811,7 @@ garrow_uint64_array_builder_append(GArrowUInt64ArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[uint64-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[uint64-array-builder][append]"); } /** @@ -912,7 +833,7 @@ garrow_uint64_array_builder_append_null(GArrowUInt64ArrayBuilder *builder, if (status.ok()) { return TRUE; } else { - garrow_error_set(error, status, "[uint64-array-builder][append-null]"); + garrow_error_check(error, status, "[uint64-array-builder][append-null]"); return FALSE; } } @@ -966,12 +887,7 @@ garrow_float_array_builder_append(GArrowFloatArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[float-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[float-array-builder][append]"); } /** @@ -990,12 +906,7 @@ garrow_float_array_builder_append_null(GArrowFloatArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[float-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[float-array-builder][append-null]"); } @@ -1047,12 +958,7 @@ garrow_double_array_builder_append(GArrowDoubleArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[double-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[double-array-builder][append]"); } /** @@ -1071,12 +977,9 @@ garrow_double_array_builder_append_null(GArrowDoubleArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[double-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[double-array-builder][append-null]"); } @@ -1130,12 +1033,7 @@ garrow_binary_array_builder_append(GArrowBinaryArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(value, length); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[binary-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[binary-array-builder][append]"); } /** @@ -1154,12 +1052,9 @@ garrow_binary_array_builder_append_null(GArrowBinaryArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[binary-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[binary-array-builder][append-null]"); } @@ -1212,12 +1107,7 @@ garrow_string_array_builder_append(GArrowStringArrayBuilder *builder, auto status = arrow_builder->Append(value, static_cast(strlen(value))); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[string-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[string-array-builder][append]"); } @@ -1305,12 +1195,7 @@ garrow_list_array_builder_append(GArrowListArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[list-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[list-array-builder][append]"); } /** @@ -1331,12 +1216,7 @@ garrow_list_array_builder_append_null(GArrowListArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[list-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, status, "[list-array-builder][append-null]"); } /** @@ -1427,12 +1307,7 @@ garrow_struct_array_builder_append(GArrowStructArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->Append(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[struct-array-builder][append]"); - return FALSE; - } + return garrow_error_check(error, status, "[struct-array-builder][append]"); } /** @@ -1453,12 +1328,9 @@ garrow_struct_array_builder_append_null(GArrowStructArrayBuilder *builder, garrow_array_builder_get_raw(GARROW_ARRAY_BUILDER(builder)).get()); auto status = arrow_builder->AppendNull(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[struct-array-builder][append-null]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[struct-array-builder][append-null]"); } /** diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 5c28daf674e4e..4373ef1c83447 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -272,10 +272,9 @@ garrow_buffer_copy(GArrowBuffer *buffer, auto arrow_buffer = garrow_buffer_get_raw(buffer); std::shared_ptr arrow_copied_buffer; auto status = arrow_buffer->Copy(start, size, &arrow_copied_buffer); - if (status.ok()) { + if (garrow_error_check(error, status, "[buffer][copy]")) { return garrow_buffer_new_raw(&arrow_copied_buffer); } else { - garrow_error_set(error, status, "[buffer][copy]"); return NULL; } } @@ -396,12 +395,7 @@ garrow_resizable_buffer_resize(GArrowResizableBuffer *buffer, auto arrow_resizable_buffer = std::static_pointer_cast(arrow_buffer); auto status = arrow_resizable_buffer->Resize(new_size); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[resizable-buffer][resize]"); - return FALSE; - } + return garrow_error_check(error, status, "[resizable-buffer][resize]"); } /** @@ -423,12 +417,7 @@ garrow_resizable_buffer_reserve(GArrowResizableBuffer *buffer, auto arrow_resizable_buffer = std::static_pointer_cast(arrow_buffer); auto status = arrow_resizable_buffer->Reserve(new_capacity); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[resizable-buffer][capacity]"); - return FALSE; - } + return garrow_error_check(error, status, "[resizable-buffer][capacity]"); } diff --git a/c_glib/arrow-glib/error.cpp b/c_glib/arrow-glib/error.cpp index efbc6ae60452a..e5d2ad6f6eb40 100644 --- a/c_glib/arrow-glib/error.cpp +++ b/c_glib/arrow-glib/error.cpp @@ -63,19 +63,20 @@ garrow_error_code(const arrow::Status &status) G_END_DECLS -void -garrow_error_set(GError **error, - const arrow::Status &status, - const char *context) +gboolean +garrow_error_check(GError **error, + const arrow::Status &status, + const char *context) { if (status.ok()) { - return; + return TRUE; + } else { + g_set_error(error, + GARROW_ERROR, + garrow_error_code(status), + "%s: %s", + context, + status.ToString().c_str()); + return FALSE; } - - g_set_error(error, - GARROW_ERROR, - garrow_error_code(status), - "%s: %s", - context, - status.ToString().c_str()); } diff --git a/c_glib/arrow-glib/error.hpp b/c_glib/arrow-glib/error.hpp index 357d293c4f127..dad27bd5c9b5a 100644 --- a/c_glib/arrow-glib/error.hpp +++ b/c_glib/arrow-glib/error.hpp @@ -23,6 +23,6 @@ #include -void garrow_error_set(GError **error, - const arrow::Status &status, - const char *context); +gboolean garrow_error_check(GError **error, + const arrow::Status &status, + const char *context); diff --git a/c_glib/arrow-glib/file-output-stream.cpp b/c_glib/arrow-glib/file-output-stream.cpp index b6ca42a1d59da..e1e1e27a06193 100644 --- a/c_glib/arrow-glib/file-output-stream.cpp +++ b/c_glib/arrow-glib/file-output-stream.cpp @@ -204,7 +204,7 @@ garrow_file_output_stream_open(const gchar *path, std::string context("[io][file-output-stream][open]: <"); context += path; context += ">"; - garrow_error_set(error, status, context.c_str()); + garrow_error_check(error, status, context.c_str()); return NULL; } } diff --git a/c_glib/arrow-glib/file-reader.cpp b/c_glib/arrow-glib/file-reader.cpp index c2aeabe5eed21..b952b52ddbe6d 100644 --- a/c_glib/arrow-glib/file-reader.cpp +++ b/c_glib/arrow-glib/file-reader.cpp @@ -146,10 +146,9 @@ garrow_file_reader_open(GArrowRandomAccessFile *file, auto status = arrow::ipc::FileReader::Open(garrow_random_access_file_get_raw(file), &arrow_file_reader); - if (status.ok()) { + if (garrow_error_check(error, status, "[ipc][file-reader][open]")) { return garrow_file_reader_new_raw(&arrow_file_reader); } else { - garrow_error_set(error, status, "[ipc][file-reader][open]"); return NULL; } } @@ -217,10 +216,11 @@ garrow_file_reader_get_record_batch(GArrowFileReader *file_reader, std::shared_ptr arrow_record_batch; auto status = arrow_file_reader->GetRecordBatch(i, &arrow_record_batch); - if (status.ok()) { + if (garrow_error_check(error, + status, + "[ipc][file-reader][get-record-batch]")) { return garrow_record_batch_new_raw(&arrow_record_batch); } else { - garrow_error_set(error, status, "[ipc][file-reader][get-record-batch]"); return NULL; } } diff --git a/c_glib/arrow-glib/file-writer.cpp b/c_glib/arrow-glib/file-writer.cpp index 68eca2edf77c1..e615cf554fd64 100644 --- a/c_glib/arrow-glib/file-writer.cpp +++ b/c_glib/arrow-glib/file-writer.cpp @@ -75,10 +75,9 @@ garrow_file_writer_open(GArrowOutputStream *sink, arrow::ipc::FileWriter::Open(garrow_output_stream_get_raw(sink).get(), garrow_schema_get_raw(schema), &arrow_file_writer); - if (status.ok()) { + if (garrow_error_check(error, status, "[ipc][file-writer][open]")) { return garrow_file_writer_new_raw(&arrow_file_writer); } else { - garrow_error_set(error, status, "[ipc][file-writer][open]"); return NULL; } } @@ -104,12 +103,9 @@ garrow_file_writer_write_record_batch(GArrowFileWriter *file_writer, arrow_record_batch.get(); auto status = arrow_file_writer->WriteRecordBatch(*arrow_record_batch_raw); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[ipc][file-writer][write-record-batch]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[ipc][file-writer][write-record-batch]"); } /** @@ -127,12 +123,7 @@ garrow_file_writer_close(GArrowFileWriter *file_writer, garrow_file_writer_get_raw(file_writer); auto status = arrow_file_writer->Close(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[ipc][file-writer][close]"); - return FALSE; - } + return garrow_error_check(error, status, "[ipc][file-writer][close]"); } G_END_DECLS diff --git a/c_glib/arrow-glib/file.cpp b/c_glib/arrow-glib/file.cpp index 0d0fe1d8b9c83..775339386c6b5 100644 --- a/c_glib/arrow-glib/file.cpp +++ b/c_glib/arrow-glib/file.cpp @@ -60,12 +60,7 @@ garrow_file_close(GArrowFile *file, auto arrow_file = garrow_file_get_raw(file); auto status = arrow_file->Close(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][file][close]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][file][close]"); } /** @@ -83,10 +78,9 @@ garrow_file_tell(GArrowFile *file, gint64 position; auto status = arrow_file->Tell(&position); - if (status.ok()) { + if (garrow_error_check(error, status, "[io][file][tell]")) { return position; } else { - garrow_error_set(error, status, "[io][file][tell]"); return -1; } } diff --git a/c_glib/arrow-glib/memory-mapped-file.cpp b/c_glib/arrow-glib/memory-mapped-file.cpp index a3e1d0c45f142..f9cbf079105c1 100644 --- a/c_glib/arrow-glib/memory-mapped-file.cpp +++ b/c_glib/arrow-glib/memory-mapped-file.cpp @@ -260,7 +260,7 @@ garrow_memory_mapped_file_open(const gchar *path, std::string context("[io][memory-mapped-file][open]: <"); context += path; context += ">"; - garrow_error_set(error, status, context.c_str()); + garrow_error_check(error, status, context.c_str()); return NULL; } } diff --git a/c_glib/arrow-glib/random-access-file.cpp b/c_glib/arrow-glib/random-access-file.cpp index 71f315ec7efaa..976a80dce0693 100644 --- a/c_glib/arrow-glib/random-access-file.cpp +++ b/c_glib/arrow-glib/random-access-file.cpp @@ -61,10 +61,9 @@ garrow_random_access_file_get_size(GArrowRandomAccessFile *file, int64_t size; auto status = arrow_random_access_file->GetSize(&size); - if (status.ok()) { + if (garrow_error_check(error, status, "[io][random-access-file][get-size]")) { return size; } else { - garrow_error_set(error, status, "[io][random-access-file][get-size]"); return 0; } } @@ -110,12 +109,7 @@ garrow_random_access_file_read_at(GArrowRandomAccessFile *file, n_bytes, n_read_bytes, buffer); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][random-access-file][read-at]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][random-access-file][read-at]"); } G_END_DECLS diff --git a/c_glib/arrow-glib/readable.cpp b/c_glib/arrow-glib/readable.cpp index b8c0cd99df06a..d893853eea015 100644 --- a/c_glib/arrow-glib/readable.cpp +++ b/c_glib/arrow-glib/readable.cpp @@ -66,12 +66,7 @@ garrow_readable_read(GArrowReadable *readable, const auto arrow_readable = garrow_readable_get_raw(readable); auto status = arrow_readable->Read(n_bytes, n_read_bytes, buffer); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][readable][read]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][readable][read]"); } G_END_DECLS diff --git a/c_glib/arrow-glib/stream-reader.cpp b/c_glib/arrow-glib/stream-reader.cpp index c4ccebe56f6ba..017d5e91a8a4d 100644 --- a/c_glib/arrow-glib/stream-reader.cpp +++ b/c_glib/arrow-glib/stream-reader.cpp @@ -147,10 +147,9 @@ garrow_stream_reader_open(GArrowInputStream *stream, auto status = arrow::ipc::StreamReader::Open(garrow_input_stream_get_raw(stream), &arrow_stream_reader); - if (status.ok()) { + if (garrow_error_check(error, status, "[ipc][stream-reader][open]")) { return garrow_stream_reader_new_raw(&arrow_stream_reader); } else { - garrow_error_set(error, status, "[ipc][stream-reader][open]"); return NULL; } } @@ -187,14 +186,15 @@ garrow_stream_reader_get_next_record_batch(GArrowStreamReader *stream_reader, std::shared_ptr arrow_record_batch; auto status = arrow_stream_reader->GetNextRecordBatch(&arrow_record_batch); - if (status.ok()) { + if (garrow_error_check(error, + status, + "[ipc][stream-reader][get-next-record-batch]")) { if (arrow_record_batch == nullptr) { return NULL; } else { return garrow_record_batch_new_raw(&arrow_record_batch); } } else { - garrow_error_set(error, status, "[ipc][stream-reader][get-next-record-batch]"); return NULL; } } diff --git a/c_glib/arrow-glib/stream-writer.cpp b/c_glib/arrow-glib/stream-writer.cpp index 016ce93759c87..cc24f263bfca9 100644 --- a/c_glib/arrow-glib/stream-writer.cpp +++ b/c_glib/arrow-glib/stream-writer.cpp @@ -150,10 +150,9 @@ garrow_stream_writer_open(GArrowOutputStream *sink, arrow::ipc::StreamWriter::Open(garrow_output_stream_get_raw(sink).get(), garrow_schema_get_raw(schema), &arrow_stream_writer); - if (status.ok()) { + if (garrow_error_check(error, status, "[ipc][stream-writer][open]")) { return garrow_stream_writer_new_raw(&arrow_stream_writer); } else { - garrow_error_set(error, status, "[ipc][stream-writer][open]"); return NULL; } } @@ -179,12 +178,9 @@ garrow_stream_writer_write_record_batch(GArrowStreamWriter *stream_writer, arrow_record_batch.get(); auto status = arrow_stream_writer->WriteRecordBatch(*arrow_record_batch_raw); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[ipc][stream-writer][write-record-batch]"); - return FALSE; - } + return garrow_error_check(error, + status, + "[ipc][stream-writer][write-record-batch]"); } /** @@ -202,12 +198,7 @@ garrow_stream_writer_close(GArrowStreamWriter *stream_writer, garrow_stream_writer_get_raw(stream_writer); auto status = arrow_stream_writer->Close(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[ipc][stream-writer][close]"); - return FALSE; - } + return garrow_error_check(error, status, "[ipc][stream-writer][close]"); } G_END_DECLS diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp index 1d743b70731bb..2aba21b564243 100644 --- a/c_glib/arrow-glib/table.cpp +++ b/c_glib/arrow-glib/table.cpp @@ -226,10 +226,9 @@ garrow_table_add_column(GArrowTable *table, const auto arrow_column = garrow_column_get_raw(column); std::shared_ptr arrow_new_table; auto status = arrow_table->AddColumn(i, arrow_column, &arrow_new_table); - if (status.ok()) { + if (garrow_error_check(error, status, "[table][add-column]")) { return garrow_table_new_raw(&arrow_new_table); } else { - garrow_error_set(error, status, "[table][add-column]"); return NULL; } } @@ -253,10 +252,9 @@ garrow_table_remove_column(GArrowTable *table, const auto arrow_table = garrow_table_get_raw(table); std::shared_ptr arrow_new_table; auto status = arrow_table->RemoveColumn(i, &arrow_new_table); - if (status.ok()) { + if (garrow_error_check(error, status, "[table][remove-column]")) { return garrow_table_new_raw(&arrow_new_table); } else { - garrow_error_set(error, status, "[table][remove-column]"); return NULL; } } diff --git a/c_glib/arrow-glib/writeable-file.cpp b/c_glib/arrow-glib/writeable-file.cpp index d0937ea2612d2..b717c32932fc0 100644 --- a/c_glib/arrow-glib/writeable-file.cpp +++ b/c_glib/arrow-glib/writeable-file.cpp @@ -66,12 +66,7 @@ garrow_writeable_file_write_at(GArrowWriteableFile *writeable_file, garrow_writeable_file_get_raw(writeable_file); auto status = arrow_writeable_file->WriteAt(position, data, n_bytes); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][writeable-file][write-at]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][writeable-file][write-at]"); } G_END_DECLS diff --git a/c_glib/arrow-glib/writeable.cpp b/c_glib/arrow-glib/writeable.cpp index 6f4c63008ae49..eb6adfee8c985 100644 --- a/c_glib/arrow-glib/writeable.cpp +++ b/c_glib/arrow-glib/writeable.cpp @@ -64,12 +64,7 @@ garrow_writeable_write(GArrowWriteable *writeable, const auto arrow_writeable = garrow_writeable_get_raw(writeable); auto status = arrow_writeable->Write(data, n_bytes); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][writeable][write]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][writeable][write]"); } /** @@ -88,12 +83,7 @@ garrow_writeable_flush(GArrowWriteable *writeable, const auto arrow_writeable = garrow_writeable_get_raw(writeable); auto status = arrow_writeable->Flush(); - if (status.ok()) { - return TRUE; - } else { - garrow_error_set(error, status, "[io][writeable][flush]"); - return FALSE; - } + return garrow_error_check(error, status, "[io][writeable][flush]"); } G_END_DECLS From 02c32ff938de4507483fc69f39847291319f427f Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Wed, 26 Apr 2017 10:33:44 -0400 Subject: [PATCH 0571/1644] ARROW-903: [GLib] Remove a needless "." Author: Kouhei Sutou Closes #603 from kou/glib-tensor-doc and squashes the following commits: 527c4db [Kouhei Sutou] [GLib] Remove a needless "." --- c_glib/arrow-glib/tensor.cpp | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/c_glib/arrow-glib/tensor.cpp b/c_glib/arrow-glib/tensor.cpp index 82f66352d66a0..27af7532f3451 100644 --- a/c_glib/arrow-glib/tensor.cpp +++ b/c_glib/arrow-glib/tensor.cpp @@ -30,7 +30,7 @@ G_BEGIN_DECLS /** * SECTION: tensor - * @short_description: Tensor class. + * @short_description: Tensor class * @include: arrow-glib/arrow-glib.h * * #GArrowTensor is a tensor class. From 8bf61d1682b883a7a538678f7f3c68dc06bb758d Mon Sep 17 00:00:00 2001 From: Holden Karau Date: Wed, 26 Apr 2017 15:14:49 -0400 Subject: [PATCH 0572/1644] ARROW-697: JAVA Throw exception for record batches > 2GB Add a test to verify that we throw a clear error message for record batches over 2GB. This entry point is easist to test without adding some magic bytes to the tests suite since its explicit on the input, and the other public entry points for deserialization have the same checks (just extracted from the metadata). Author: Holden Karau Closes #597 from holdenk/ARROW-697-java-raise-exception-for-large-batch-size and squashes the following commits: d2d6b3d [Holden Karau] Merge branch 'master' into ARROW-697-java-raise-exception-for-large-batch-size d56daab [Holden Karau] Throw IOException if record batch length, node length, or null count are larger than Int.MAX_VALUE 0a96b74 [Holden Karau] Add a test to verify that we throw a clear error message for record batches over 2GB in size --- .../arrow/vector/stream/MessageSerializer.java | 10 +++++++++- .../vector/stream/MessageSerializerTest.java | 18 ++++++++++++++++++ 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index ec7e0f2ffb115..228ab613466d2 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -201,12 +201,17 @@ public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, ArrowBlock // Deserializes a record batch given the Flatbuffer metadata and in-memory body private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB, - ArrowBuf body) { + ArrowBuf body) throws IOException { // Now read the body int nodesLength = recordBatchFB.nodesLength(); List nodes = new ArrayList<>(); for (int i = 0; i < nodesLength; ++i) { FieldNode node = recordBatchFB.nodes(i); + if ((int)node.length() != node.length() || + (int)node.nullCount() != node.nullCount()) { + throw new IOException("Cannot currently deserialize record batches with " + + "node length larger than Int.MAX_VALUE"); + } nodes.add(new ArrowFieldNode((int)node.length(), (int)node.nullCount())); } List buffers = new ArrayList<>(); @@ -215,6 +220,9 @@ private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB ArrowBuf vectorBuffer = body.slice((int)bufferFB.offset(), (int)bufferFB.length()); buffers.add(vectorBuffer); } + if ((int)recordBatchFB.length() != recordBatchFB.length()) { + throw new IOException("Cannot currently deserialize record batches over 2GB"); + } ArrowRecordBatch arrowRecordBatch = new ArrowRecordBatch((int)recordBatchFB.length(), nodes, buffers); body.release(); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java index d3d49d5fb8096..27879efeaf117 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/stream/MessageSerializerTest.java @@ -31,6 +31,7 @@ import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.RootAllocator; +import org.apache.arrow.vector.file.ArrowBlock; import org.apache.arrow.vector.file.ReadChannel; import org.apache.arrow.vector.file.WriteChannel; import org.apache.arrow.vector.schema.ArrowFieldNode; @@ -41,6 +42,8 @@ import org.apache.arrow.vector.types.pojo.Field; import org.apache.arrow.vector.types.pojo.Schema; import org.junit.Test; +import org.junit.Rule; +import org.junit.rules.ExpectedException; import io.netty.buffer.ArrowBuf; @@ -87,6 +90,21 @@ public void testSchemaDictionaryMessageSerialization() throws IOException { assertEquals(schema, deserialized); } + @Rule + public ExpectedException expectedEx = ExpectedException.none(); + + @Test + public void testdeSerializeRecordBatchLongMetaData() throws IOException { + expectedEx.expect(IOException.class); + expectedEx.expectMessage("Cannot currently deserialize record batches over 2GB"); + int offset = 0; + int metadataLength = 1; + long bodyLength = Integer.MAX_VALUE + 10L; + ArrowBlock block = new ArrowBlock(offset, metadataLength, bodyLength); + long totalLen = block.getMetadataLength() + block.getBodyLength(); + MessageSerializer.deserializeRecordBatch(null, block, null); + } + @Test public void testSerializeRecordBatch() throws IOException { byte[] validity = new byte[] { (byte)255, 0}; From 3fdeac74c80593ebde7a8eeb148cea9f6e0d1b38 Mon Sep 17 00:00:00 2001 From: Emilio Lahr-Vivaz Date: Wed, 26 Apr 2017 15:16:53 -0400 Subject: [PATCH 0573/1644] ARROW-886 [Java] Fixing reallocation of VariableLengthVector offsets Author: Emilio Lahr-Vivaz Closes #591 from elahrvivaz/ARROW-886 and squashes the following commits: 5f6b4be [Emilio Lahr-Vivaz] ARROW-886 Fixing reallocation of VariableLengthVector offsets --- .../templates/VariableLengthVectors.java | 1 + .../arrow/vector/TestVectorReAlloc.java | 27 ++++++++++++++++++- 2 files changed, 27 insertions(+), 1 deletion(-) diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java index 4a460c5475323..11f0cc894d004 100644 --- a/java/vector/src/main/codegen/templates/VariableLengthVectors.java +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -355,6 +355,7 @@ public void reset() { } public void reAlloc() { + offsetVector.reAlloc(); final long newAllocationSize = allocationSizeInBytes*2L; if (newAllocationSize > MAX_ALLOCATION_SIZE) { throw new OversizedAllocationException("Unable to expand the buffer. Max allowed buffer size is reached."); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java index a7c35b6363cf1..40c7bc5ac9add 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/TestVectorReAlloc.java @@ -72,6 +72,31 @@ public void testFixedType() { } } + @Test + public void testVariableLengthType() { + try (final VarCharVector vector = new VarCharVector("", allocator)) { + final VarCharVector.Mutator m = vector.getMutator(); + // note: capacity ends up being - 1 due to offsets vector + vector.setInitialCapacity(511); + vector.allocateNew(); + + assertEquals(511, vector.getValueCapacity()); + + try { + m.set(512, "foo".getBytes(StandardCharsets.UTF_8)); + Assert.fail("Expected out of bounds exception"); + } catch (Exception e) { + // ok + } + + vector.reAlloc(); + assertEquals(1023, vector.getValueCapacity()); + + m.set(512, "foo".getBytes(StandardCharsets.UTF_8)); + assertEquals("foo", new String(vector.getAccessor().get(512), StandardCharsets.UTF_8)); + } + } + @Test public void testNullableType() { try (final NullableVarCharVector vector = new NullableVarCharVector("", allocator)) { @@ -89,7 +114,7 @@ public void testNullableType() { } vector.reAlloc(); - assertEquals(1023, vector.getValueCapacity()); // note: size - 1 for some reason... + assertEquals(1024, vector.getValueCapacity()); m.set(512, "foo".getBytes(StandardCharsets.UTF_8)); assertEquals("foo", new String(vector.getAccessor().get(512), StandardCharsets.UTF_8)); From e876abbdf4f7bac53ae5d56f4680259f021ea8d9 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 27 Apr 2017 09:28:54 -0400 Subject: [PATCH 0574/1644] ARROW-898: [C++/Python] Use shared_ptr to avoid copying KeyValueMetadata, add to Field type also This also adds support for adding and removing schema-level metadata to the Python Schema wrapper object. Need to do the same for Field but putting this up for @cpcloud to review since he's working on using this in parquet-cpp Author: Wes McKinney Closes #605 from wesm/ARROW-898 and squashes the following commits: 03873f7 [Wes McKinney] RemoveMetadata not return Status c621c2c [Wes McKinney] Add metadata methods to Field, some code cleaning 581b9fa [Wes McKinney] Add unit tests for passing metadata to Field constructor 51fae29 [Wes McKinney] Add metadata to Field 2ce4003 [Wes McKinney] Test sharing metadata 48aa3ca [Wes McKinney] Use shared_ptr for KeyValueMetadata so metadata can be shared / not copied --- cpp/src/arrow/builder.cc | 11 +- cpp/src/arrow/io/io-memory-benchmark.cc | 6 +- cpp/src/arrow/io/memory.cc | 4 +- cpp/src/arrow/ipc/metadata.cc | 38 +-- cpp/src/arrow/type-test.cc | 61 ++++- cpp/src/arrow/type.cc | 60 +++-- cpp/src/arrow/type.h | 32 ++- cpp/src/arrow/util/key-value-metadata-test.cc | 9 + cpp/src/arrow/util/key_value_metadata.cc | 4 + cpp/src/arrow/util/key_value_metadata.h | 6 + cpp/src/arrow/util/memory.h | 8 +- python/pyarrow/_array.pxd | 2 - python/pyarrow/_array.pyx | 220 ++++++++++++++---- python/pyarrow/_table.pyx | 7 +- python/pyarrow/includes/libarrow.pxd | 31 ++- python/pyarrow/tests/pandas_examples.py | 4 +- python/pyarrow/tests/test_convert_pandas.py | 64 ++--- python/pyarrow/tests/test_schema.py | 48 +++- 18 files changed, 466 insertions(+), 149 deletions(-) diff --git a/cpp/src/arrow/builder.cc b/cpp/src/arrow/builder.cc index 4ecb8d3500981..ab43c2a51baf4 100644 --- a/cpp/src/arrow/builder.cc +++ b/cpp/src/arrow/builder.cc @@ -406,17 +406,16 @@ Status DecimalBuilder::Finish(std::shared_ptr* out) { ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr value_builder, const std::shared_ptr& type) - : ArrayBuilder(pool, - type ? type : std::static_pointer_cast( - std::make_shared(value_builder->type()))), + : ArrayBuilder( + pool, type ? type : std::static_pointer_cast( + std::make_shared(value_builder->type()))), offset_builder_(pool), value_builder_(value_builder) {} ListBuilder::ListBuilder(MemoryPool* pool, std::shared_ptr values, const std::shared_ptr& type) - : ArrayBuilder(pool, - type ? type : std::static_pointer_cast( - std::make_shared(values->type()))), + : ArrayBuilder(pool, type ? type : std::static_pointer_cast( + std::make_shared(values->type()))), offset_builder_(pool), values_(values) {} diff --git a/cpp/src/arrow/io/io-memory-benchmark.cc b/cpp/src/arrow/io/io-memory-benchmark.cc index 59b511a6cf8fe..6aa9577f0a1d9 100644 --- a/cpp/src/arrow/io/io-memory-benchmark.cc +++ b/cpp/src/arrow/io/io-memory-benchmark.cc @@ -26,7 +26,7 @@ namespace arrow { static void BM_SerialMemcopy(benchmark::State& state) { // NOLINT non-const reference - constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB + constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB auto buffer1 = std::make_shared(default_memory_pool()); buffer1->Resize(kTotalSize); @@ -43,7 +43,7 @@ static void BM_SerialMemcopy(benchmark::State& state) { // NOLINT non-const ref } static void BM_ParallelMemcopy(benchmark::State& state) { // NOLINT non-const reference - constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB + constexpr int64_t kTotalSize = 100 * 1024 * 1024; // 100MB auto buffer1 = std::make_shared(default_memory_pool()); buffer1->Resize(kTotalSize); @@ -72,4 +72,4 @@ BENCHMARK(BM_ParallelMemcopy) ->MinTime(1.0) ->UseRealTime(); -} // namespace arrow +} // namespace arrow diff --git a/cpp/src/arrow/io/memory.cc b/cpp/src/arrow/io/memory.cc index 95c6206f0fab0..faf02d2fa22f7 100644 --- a/cpp/src/arrow/io/memory.cc +++ b/cpp/src/arrow/io/memory.cc @@ -140,8 +140,8 @@ Status FixedSizeBufferWriter::Tell(int64_t* position) { Status FixedSizeBufferWriter::Write(const uint8_t* data, int64_t nbytes) { if (nbytes > memcopy_threshold_ && memcopy_num_threads_ > 1) { - parallel_memcopy(mutable_data_ + position_, data, nbytes, - memcopy_blocksize_, memcopy_num_threads_); + parallel_memcopy(mutable_data_ + position_, data, nbytes, memcopy_blocksize_, + memcopy_num_threads_); } else { memcpy(mutable_data_ + position_, data, nbytes); } diff --git a/cpp/src/arrow/ipc/metadata.cc b/cpp/src/arrow/ipc/metadata.cc index c0b518a0d8e50..706ab2e8aab0e 100644 --- a/cpp/src/arrow/ipc/metadata.cc +++ b/cpp/src/arrow/ipc/metadata.cc @@ -593,20 +593,27 @@ static Status SchemaToFlatbuffer(FBB& fbb, const Schema& schema, field_offsets.push_back(offset); } + auto fb_offsets = fbb.CreateVector(field_offsets); + /// Custom metadata - const auto& custom_metadata_ = schema.custom_metadata(); - std::vector key_value_offsets; - size_t metadata_size = custom_metadata_.size(); - key_value_offsets.reserve(metadata_size); - for (size_t i = 0; i < metadata_size; ++i) { - const auto& key = custom_metadata_.key(i); - const auto& value = custom_metadata_.value(i); - key_value_offsets.push_back( - flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value))); + const KeyValueMetadata* metadata = schema.metadata().get(); + + if (metadata != nullptr) { + std::vector key_value_offsets; + size_t metadata_size = metadata->size(); + key_value_offsets.reserve(metadata_size); + for (size_t i = 0; i < metadata_size; ++i) { + const auto& key = metadata->key(i); + const auto& value = metadata->value(i); + key_value_offsets.push_back( + flatbuf::CreateKeyValue(fbb, fbb.CreateString(key), fbb.CreateString(value))); + } + *out = flatbuf::CreateSchema( + fbb, endianness(), fb_offsets, fbb.CreateVector(key_value_offsets)); + } else { + *out = flatbuf::CreateSchema(fbb, endianness(), fb_offsets); } - *out = flatbuf::CreateSchema(fbb, endianness(), fbb.CreateVector(field_offsets), - fbb.CreateVector(key_value_offsets)); return Status::OK(); } @@ -955,17 +962,16 @@ Status GetSchema(const void* opaque_schema, const DictionaryMemo& dictionary_mem RETURN_NOT_OK(FieldFromFlatbuffer(field, dictionary_memo, &fields[i])); } - KeyValueMetadata custom_metadata; + auto metadata = std::make_shared(); auto fb_metadata = schema->custom_metadata(); if (fb_metadata != nullptr) { - custom_metadata.reserve(fb_metadata->size()); - + metadata->reserve(fb_metadata->size()); for (const auto& pair : *fb_metadata) { - custom_metadata.Append(pair->key()->str(), pair->value()->str()); + metadata->Append(pair->key()->str(), pair->value()->str()); } } - *out = std::make_shared(fields, custom_metadata); + *out = std::make_shared(fields, metadata); return Status::OK(); } diff --git a/cpp/src/arrow/type-test.cc b/cpp/src/arrow/type-test.cc index 8e2dfd50e431d..e73adecdcb5aa 100644 --- a/cpp/src/arrow/type-test.cc +++ b/cpp/src/arrow/type-test.cc @@ -23,6 +23,7 @@ #include "gtest/gtest.h" +#include "arrow/test-util.h" #include "arrow/type.h" using std::shared_ptr; @@ -50,6 +51,40 @@ TEST(TestField, Equals) { ASSERT_FALSE(f0.Equals(f0_nn)); } +TEST(TestField, TestMetadataConstruction) { + auto metadata = std::shared_ptr( + new KeyValueMetadata({"foo", "bar"}, {"bizz", "buzz"})); + auto metadata2 = metadata->Copy(); + auto f0 = field("f0", int32(), true, metadata); + auto f1 = field("f0", int32(), true, metadata2); + ASSERT_TRUE(metadata->Equals(*f0->metadata())); + ASSERT_TRUE(f0->Equals(*f1)); +} + +TEST(TestField, TestAddMetadata) { + auto metadata = std::shared_ptr( + new KeyValueMetadata({"foo", "bar"}, {"bizz", "buzz"})); + auto f0 = field("f0", int32()); + auto f1 = field("f0", int32(), true, metadata); + std::shared_ptr f2; + ASSERT_OK(f0->AddMetadata(metadata, &f2)); + + ASSERT_FALSE(f2->Equals(*f0)); + ASSERT_TRUE(f2->Equals(*f1)); + + // Not copied + ASSERT_TRUE(metadata.get() == f1->metadata().get()); +} + +TEST(TestField, TestRemoveMetadata) { + auto metadata = std::shared_ptr( + new KeyValueMetadata({"foo", "bar"}, {"bizz", "buzz"})); + auto f0 = field("f0", int32()); + auto f1 = field("f0", int32(), true, metadata); + std::shared_ptr f2 = f1->RemoveMetadata(); + ASSERT_TRUE(f2->metadata() == nullptr); +} + class TestSchema : public ::testing::Test { public: void SetUp() {} @@ -117,38 +152,42 @@ TEST_F(TestSchema, GetFieldByName) { ASSERT_TRUE(result == nullptr); } -TEST_F(TestSchema, TestCustomMetadataConstruction) { +TEST_F(TestSchema, TestMetadataConstruction) { auto f0 = field("f0", int32()); auto f1 = field("f1", uint8(), false); auto f2 = field("f2", utf8()); vector> fields = {f0, f1, f2}; - KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); + auto metadata = std::shared_ptr( + new KeyValueMetadata({"foo", "bar"}, {"bizz", "buzz"})); auto schema = std::make_shared(fields, metadata); - ASSERT_TRUE(metadata.Equals(schema->custom_metadata())); + ASSERT_TRUE(metadata->Equals(*schema->metadata())); } -TEST_F(TestSchema, TestAddCustomMetadata) { +TEST_F(TestSchema, TestAddMetadata) { auto f0 = field("f0", int32()); auto f1 = field("f1", uint8(), false); auto f2 = field("f2", utf8()); vector> fields = {f0, f1, f2}; - KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); + auto metadata = std::shared_ptr( + new KeyValueMetadata({"foo", "bar"}, {"bizz", "buzz"})); auto schema = std::make_shared(fields); std::shared_ptr new_schema; - schema->AddCustomMetadata(metadata, &new_schema); - ASSERT_TRUE(metadata.Equals(new_schema->custom_metadata())); + schema->AddMetadata(metadata, &new_schema); + ASSERT_TRUE(metadata->Equals(*new_schema->metadata())); + + // Not copied + ASSERT_TRUE(metadata.get() == new_schema->metadata().get()); } -TEST_F(TestSchema, TestRemoveCustomMetadata) { +TEST_F(TestSchema, TestRemoveMetadata) { auto f0 = field("f0", int32()); auto f1 = field("f1", uint8(), false); auto f2 = field("f2", utf8()); vector> fields = {f0, f1, f2}; KeyValueMetadata metadata({"foo", "bar"}, {"bizz", "buzz"}); auto schema = std::make_shared(fields); - std::shared_ptr new_schema; - schema->RemoveCustomMetadata(&new_schema); - ASSERT_EQ(0, new_schema->custom_metadata().size()); + std::shared_ptr new_schema = schema->RemoveMetadata(); + ASSERT_TRUE(new_schema->metadata() == nullptr); } #define PRIMITIVE_TEST(KLASS, ENUM, NAME) \ diff --git a/cpp/src/arrow/type.cc b/cpp/src/arrow/type.cc index f59f8fb26c9ba..b1e322ce1b321 100644 --- a/cpp/src/arrow/type.cc +++ b/cpp/src/arrow/type.cc @@ -31,10 +31,31 @@ namespace arrow { +Status Field::AddMetadata(const std::shared_ptr& metadata, + std::shared_ptr* out) const { + *out = std::make_shared(name_, type_, nullable_, metadata); + return Status::OK(); +} + +std::shared_ptr Field::RemoveMetadata() const { + return std::make_shared(name_, type_, nullable_); +} + bool Field::Equals(const Field& other) const { - return (this == &other) || - (this->name_ == other.name_ && this->nullable_ == other.nullable_ && - this->type_->Equals(*other.type_.get())); + if (this == &other) { + return true; + } + if (this->name_ == other.name_ && this->nullable_ == other.nullable_ && + this->type_->Equals(*other.type_.get())) { + if (metadata_ == nullptr && other.metadata_ == nullptr) { + return true; + } else if ((metadata_ == nullptr) ^ (other.metadata_ == nullptr)) { + return false; + } else { + return metadata_->Equals(*other.metadata_); + } + } + return false; } bool Field::Equals(const std::shared_ptr& other) const { @@ -233,8 +254,8 @@ std::string NullType::ToString() const { // Schema implementation Schema::Schema(const std::vector>& fields, - const KeyValueMetadata& custom_metadata) - : fields_(fields), custom_metadata_(custom_metadata) {} + const std::shared_ptr& metadata) + : fields_(fields), metadata_(metadata) {} bool Schema::Equals(const Schema& other) const { if (this == &other) { return true; } @@ -266,26 +287,25 @@ Status Schema::AddField( DCHECK_GE(i, 0); DCHECK_LE(i, this->num_fields()); - *out = std::make_shared(AddVectorElement(fields_, i, field), custom_metadata_); + *out = std::make_shared(AddVectorElement(fields_, i, field), metadata_); return Status::OK(); } -Status Schema::AddCustomMetadata( - const KeyValueMetadata& custom_metadata, std::shared_ptr* out) const { - *out = std::make_shared(fields_, custom_metadata); +Status Schema::AddMetadata(const std::shared_ptr& metadata, + std::shared_ptr* out) const { + *out = std::make_shared(fields_, metadata); return Status::OK(); } -Status Schema::RemoveCustomMetadata(std::shared_ptr* out) { - *out = std::make_shared(fields_, KeyValueMetadata()); - return Status::OK(); +std::shared_ptr Schema::RemoveMetadata() const { + return std::make_shared(fields_); } Status Schema::RemoveField(int i, std::shared_ptr* out) const { DCHECK_GE(i, 0); DCHECK_LT(i, this->num_fields()); - *out = std::make_shared(DeleteVectorElement(fields_, i), custom_metadata_); + *out = std::make_shared(DeleteVectorElement(fields_, i), metadata_); return Status::OK(); } @@ -298,6 +318,15 @@ std::string Schema::ToString() const { buffer << field->ToString(); ++i; } + + if (metadata_) { + buffer << "\n-- metadata --"; + for (int64_t i = 0; i < metadata_->size(); ++i) { + buffer << "\n" << metadata_->key(i) << ": " + << metadata_->value(i); + } + } + return buffer.str(); } @@ -391,8 +420,9 @@ std::shared_ptr dictionary(const std::shared_ptr& index_type } std::shared_ptr field( - const std::string& name, const TypePtr& type, bool nullable) { - return std::make_shared(name, type, nullable); + const std::string& name, const std::shared_ptr& type, bool nullable, + const std::shared_ptr& metadata) { + return std::make_shared(name, type, nullable, metadata); } std::shared_ptr decimal(int precision, int scale) { diff --git a/cpp/src/arrow/type.h b/cpp/src/arrow/type.h index dc9456137030f..bb258578da327 100644 --- a/cpp/src/arrow/type.h +++ b/cpp/src/arrow/type.h @@ -203,8 +203,16 @@ class NoExtraMeta {}; class ARROW_EXPORT Field { public: Field(const std::string& name, const std::shared_ptr& type, - bool nullable = true) - : name_(name), type_(type), nullable_(nullable) {} + bool nullable = true, + const std::shared_ptr& metadata = nullptr) + : name_(name), type_(type), nullable_(nullable), metadata_(metadata) {} + + std::shared_ptr metadata() const { return metadata_; } + + Status AddMetadata(const std::shared_ptr& metadata, + std::shared_ptr* out) const; + + std::shared_ptr RemoveMetadata() const; bool Equals(const Field& other) const; bool Equals(const std::shared_ptr& other) const; @@ -224,6 +232,9 @@ class ARROW_EXPORT Field { // Fields can be nullable bool nullable_; + + // The field's metadata, if any + std::shared_ptr metadata_; }; typedef std::shared_ptr FieldPtr; @@ -679,7 +690,7 @@ class ARROW_EXPORT DictionaryType : public FixedWidthType { class ARROW_EXPORT Schema { public: explicit Schema(const std::vector>& fields, - const KeyValueMetadata& custom_metadata = KeyValueMetadata()); + const std::shared_ptr& metadata = nullptr); // Returns true if all of the schema fields are equal bool Equals(const Schema& other) const; @@ -691,7 +702,7 @@ class ARROW_EXPORT Schema { std::shared_ptr GetFieldByName(const std::string& name); const std::vector>& fields() const { return fields_; } - const KeyValueMetadata& custom_metadata() const { return custom_metadata_; } + std::shared_ptr metadata() const { return metadata_; } // Render a string representation of the schema suitable for debugging std::string ToString() const; @@ -700,16 +711,18 @@ class ARROW_EXPORT Schema { int i, const std::shared_ptr& field, std::shared_ptr* out) const; Status RemoveField(int i, std::shared_ptr* out) const; - Status AddCustomMetadata( - const KeyValueMetadata& metadata, std::shared_ptr* out) const; - Status RemoveCustomMetadata(std::shared_ptr* out); + Status AddMetadata(const std::shared_ptr& metadata, + std::shared_ptr* out) const; + + std::shared_ptr RemoveMetadata() const; int num_fields() const { return static_cast(fields_.size()); } private: std::vector> fields_; std::unordered_map name_to_index_; - KeyValueMetadata custom_metadata_; + + std::shared_ptr metadata_; }; // ---------------------------------------------------------------------- @@ -741,7 +754,8 @@ std::shared_ptr ARROW_EXPORT dictionary( const std::shared_ptr& index_type, const std::shared_ptr& values); std::shared_ptr ARROW_EXPORT field( - const std::string& name, const std::shared_ptr& type, bool nullable = true); + const std::string& name, const std::shared_ptr& type, bool nullable = true, + const std::shared_ptr& metadata = nullptr); // ---------------------------------------------------------------------- // diff --git a/cpp/src/arrow/util/key-value-metadata-test.cc b/cpp/src/arrow/util/key-value-metadata-test.cc index aadc989cb403f..59cfdf597308c 100644 --- a/cpp/src/arrow/util/key-value-metadata-test.cc +++ b/cpp/src/arrow/util/key-value-metadata-test.cc @@ -72,6 +72,15 @@ TEST(KeyValueMetadataTest, StringAppend) { ASSERT_EQ("red", metadata.value(3)); } +TEST(KeyValueMetadataTest, Copy) { + std::vector keys = {"foo", "bar"}; + std::vector values = {"bizz", "buzz"}; + + KeyValueMetadata metadata(keys, values); + auto metadata2 = metadata.Copy(); + ASSERT_TRUE(metadata.Equals(*metadata2)); +} + TEST(KeyValueMetadataTest, Equals) { std::vector keys = {"foo", "bar"}; std::vector values = {"bizz", "buzz"}; diff --git a/cpp/src/arrow/util/key_value_metadata.cc b/cpp/src/arrow/util/key_value_metadata.cc index c91478bd1acd6..8bddd5d0164c2 100644 --- a/cpp/src/arrow/util/key_value_metadata.cc +++ b/cpp/src/arrow/util/key_value_metadata.cc @@ -91,6 +91,10 @@ std::string KeyValueMetadata::value(int64_t i) const { return values_[static_cast(i)]; } +std::shared_ptr KeyValueMetadata::Copy() const { + return std::make_shared(keys_, values_); +} + bool KeyValueMetadata::Equals(const KeyValueMetadata& other) const { return size() == other.size() && std::equal(keys_.cbegin(), keys_.cend(), other.keys_.cbegin()) && diff --git a/cpp/src/arrow/util/key_value_metadata.h b/cpp/src/arrow/util/key_value_metadata.h index 713b2c0b0bcfb..bae4ad806dd62 100644 --- a/cpp/src/arrow/util/key_value_metadata.h +++ b/cpp/src/arrow/util/key_value_metadata.h @@ -19,10 +19,12 @@ #define ARROW_UTIL_KEY_VALUE_METADATA_H #include +#include #include #include #include +#include "arrow/util/macros.h" #include "arrow/util/visibility.h" namespace arrow { @@ -44,11 +46,15 @@ class ARROW_EXPORT KeyValueMetadata { std::string key(int64_t i) const; std::string value(int64_t i) const; + std::shared_ptr Copy() const; + bool Equals(const KeyValueMetadata& other) const; private: std::vector keys_; std::vector values_; + + DISALLOW_COPY_AND_ASSIGN(KeyValueMetadata); }; } // namespace arrow diff --git a/cpp/src/arrow/util/memory.h b/cpp/src/arrow/util/memory.h index 7feeb291ef4a0..c5c17ef907c22 100644 --- a/cpp/src/arrow/util/memory.h +++ b/cpp/src/arrow/util/memory.h @@ -31,7 +31,7 @@ uint8_t* pointer_logical_and(const uint8_t* address, uintptr_t bits) { // A helper function for doing memcpy with multiple threads. This is required // to saturate the memory bandwidth of modern cpus. void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, - uintptr_t block_size, int num_threads) { + uintptr_t block_size, int num_threads) { std::vector threadpool(num_threads); uint8_t* left = pointer_logical_and(src + block_size - 1, ~(block_size - 1)); uint8_t* right = pointer_logical_and(src + nbytes, ~(block_size - 1)); @@ -52,8 +52,8 @@ void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, // Start all threads first and handle leftovers while threads run. for (int i = 0; i < num_threads; i++) { - threadpool[i] = std::thread(memcpy, dst + prefix + i * chunk_size, - left + i * chunk_size, chunk_size); + threadpool[i] = std::thread( + memcpy, dst + prefix + i * chunk_size, left + i * chunk_size, chunk_size); } memcpy(dst, src, prefix); @@ -64,6 +64,6 @@ void parallel_memcopy(uint8_t* dst, const uint8_t* src, int64_t nbytes, } } -} // namespace arrow +} // namespace arrow #endif // ARROW_UTIL_MEMORY_H diff --git a/python/pyarrow/_array.pxd b/python/pyarrow/_array.pxd index 4d5db8618a377..464de316f0437 100644 --- a/python/pyarrow/_array.pxd +++ b/python/pyarrow/_array.pxd @@ -81,8 +81,6 @@ cdef class Schema: cdef init(self, const vector[shared_ptr[CField]]& fields) cdef init_schema(self, const shared_ptr[CSchema]& schema) - cpdef dict custom_metadata(self) - cdef class Scalar: cdef readonly: diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index 2fb20b7553e93..658f4b314b3a2 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -172,7 +172,14 @@ cdef class DecimalType(FixedSizeBinaryType): cdef class Field: + """ + Represents a named field, with a data type, nullability, and optional + metadata + Notes + ----- + Do not use this class's constructor directly; use pyarrow.field + """ def __cinit__(self): pass @@ -181,32 +188,77 @@ cdef class Field: self.field = field.get() self.type = box_data_type(field.get().type()) - @classmethod - def from_py(cls, object name, DataType type, bint nullable=True): - cdef Field result = Field() - result.type = type - result.sp_field.reset(new CField(tobytes(name), type.sp_type, - nullable)) - result.field = result.sp_field.get() + def equals(self, Field other): + """ + Test if this field is equal to the other + """ + return self.field.Equals(deref(other.field)) - return result + def __str__(self): + self._check_null() + return 'pyarrow.Field<{0}>'.format(frombytes(self.field.ToString())) def __repr__(self): - return 'Field({0!r}, type={1})'.format(self.name, str(self.type)) + return self.__str__() property nullable: def __get__(self): + self._check_null() return self.field.nullable() property name: def __get__(self): - if box_field(self.sp_field) is None: - raise ReferenceError( - 'Field not initialized (references NULL pointer)') + self._check_null() return frombytes(self.field.name()) + property metadata: + + def __get__(self): + self._check_null() + return box_metadata(self.field.metadata().get()) + + def _check_null(self): + if self.field == NULL: + raise ReferenceError( + 'Field not initialized (references NULL pointer)') + + def add_metadata(self, dict metadata): + """ + Add metadata as dict of string keys and values to Field + + Parameters + ---------- + metadata : dict + Keys and values must be string-like / coercible to bytes + + Returns + ------- + field : pyarrow.Field + """ + cdef shared_ptr[CKeyValueMetadata] c_meta + convert_metadata(metadata, &c_meta) + + cdef shared_ptr[CField] new_field + with nogil: + check_status(self.field.AddMetadata(c_meta, &new_field)) + + return box_field(new_field) + + def remove_metadata(self): + """ + Create new field without metadata, if any + + Returns + ------- + field : pyarrow.Field + """ + cdef shared_ptr[CField] new_field + with nogil: + new_field = self.field.RemoveMetadata() + return box_field(new_field) + cdef class Schema: @@ -226,6 +278,14 @@ cdef class Schema: return result + cdef init(self, const vector[shared_ptr[CField]]& fields): + self.schema = new CSchema(fields) + self.sp_schema.reset(self.schema) + + cdef init_schema(self, const shared_ptr[CSchema]& schema): + self.schema = schema.get() + self.sp_schema = schema + property names: def __get__(self): @@ -236,20 +296,10 @@ cdef class Schema: result.append(name) return result - cdef init(self, const vector[shared_ptr[CField]]& fields): - self.schema = new CSchema(fields) - self.sp_schema.reset(self.schema) - - cdef init_schema(self, const shared_ptr[CSchema]& schema): - self.schema = schema.get() - self.sp_schema = schema + property metadata: - cpdef dict custom_metadata(self): - cdef: - CKeyValueMetadata metadata = self.schema.custom_metadata() - unordered_map[c_string, c_string] result - metadata.ToUnorderedMap(&result) - return result + def __get__(self): + return box_metadata(self.schema.metadata().get()) def equals(self, other): """ @@ -274,23 +324,40 @@ cdef class Schema: """ return box_field(self.schema.GetFieldByName(tobytes(name))) - @classmethod - def from_fields(cls, fields): - cdef: - Schema result - Field field - vector[shared_ptr[CField]] c_fields + def add_metadata(self, dict metadata): + """ + Add metadata as dict of string keys and values to Schema - c_fields.resize(len(fields)) + Parameters + ---------- + metadata : dict + Keys and values must be string-like / coercible to bytes - for i in range(len(fields)): - field = fields[i] - c_fields[i] = field.sp_field + Returns + ------- + schema : pyarrow.Schema + """ + cdef shared_ptr[CKeyValueMetadata] c_meta + convert_metadata(metadata, &c_meta) - result = Schema() - result.init(c_fields) + cdef shared_ptr[CSchema] new_schema + with nogil: + check_status(self.schema.AddMetadata(c_meta, &new_schema)) - return result + return box_schema(new_schema) + + def remove_metadata(self): + """ + Create new schema without metadata, if any + + Returns + ------- + schema : pyarrow.Schema + """ + cdef shared_ptr[CSchema] new_schema + with nogil: + new_schema = self.schema.RemoveMetadata() + return box_schema(new_schema) def __str__(self): return frombytes(self.schema.ToString()) @@ -299,6 +366,15 @@ cdef class Schema: return self.__str__() +cdef box_metadata(const CKeyValueMetadata* metadata): + cdef unordered_map[c_string, c_string] result + if metadata != NULL: + metadata.ToUnorderedMap(&result) + return result + else: + return None + + cdef dict _type_cache = {} @@ -315,8 +391,49 @@ cdef DataType primitive_type(Type type): #------------------------------------------------------------ # Type factory functions -def field(name, type, bint nullable=True): - return Field.from_py(name, type, nullable) +cdef int convert_metadata(dict metadata, + shared_ptr[CKeyValueMetadata]* out) except -1: + cdef: + shared_ptr[CKeyValueMetadata] meta = ( + make_shared[CKeyValueMetadata]()) + c_string key, value + + for py_key, py_value in metadata.items(): + key = tobytes(py_key) + value = tobytes(py_value) + meta.get().Append(key, value) + out[0] = meta + return 0 + + +def field(name, DataType type, bint nullable=True, dict metadata=None): + """ + Create a pyarrow.Field instance + + Parameters + ---------- + name : string or bytes + type : pyarrow.DataType + nullable : boolean, default True + metadata : dict, default None + Keys and values must be coercible to bytes + + Returns + ------- + field : pyarrow.Field + """ + cdef: + shared_ptr[CKeyValueMetadata] c_meta + Field result = Field() + + if metadata is not None: + convert_metadata(metadata, &c_meta) + + result.sp_field.reset(new CField(tobytes(name), type.sp_type, + nullable, c_meta)) + result.field = result.sp_field.get() + result.type = type + return result cdef set PRIMITIVE_TYPES = set([ @@ -554,7 +671,28 @@ def struct(fields): def schema(fields): - return Schema.from_fields(fields) + """ + Construct pyarrow.Schema from collection of fields + + Parameters + ---------- + field : list or iterable + + Returns + ------- + schema : pyarrow.Schema + """ + cdef: + Schema result + Field field + vector[shared_ptr[CField]] c_fields + + for i, field in enumerate(fields): + c_fields.push_back(field.sp_field) + + result = Schema() + result.init(c_fields) + return result cdef DataType box_data_type(const shared_ptr[CDataType]& type): diff --git a/python/pyarrow/_table.pyx b/python/pyarrow/_table.pyx index ed0782bbba0a3..599e046e956b6 100644 --- a/python/pyarrow/_table.pyx +++ b/python/pyarrow/_table.pyx @@ -272,11 +272,12 @@ cdef class Column: return chunked_array -cdef CKeyValueMetadata key_value_metadata_from_dict(dict metadata): +cdef shared_ptr[const CKeyValueMetadata] key_value_metadata_from_dict( + dict metadata): cdef: unordered_map[c_string, c_string] unordered_metadata = metadata - CKeyValueMetadata c_metadata = CKeyValueMetadata(unordered_metadata) - return c_metadata + return ( + make_shared[CKeyValueMetadata](unordered_metadata)) cdef int _schema_from_arrays( diff --git a/python/pyarrow/includes/libarrow.pxd b/python/pyarrow/includes/libarrow.pxd index ef1a332bed52f..8a730b3988441 100644 --- a/python/pyarrow/includes/libarrow.pxd +++ b/python/pyarrow/includes/libarrow.pxd @@ -23,6 +23,10 @@ cdef extern from "arrow/util/key_value_metadata.h" namespace "arrow" nogil: cdef cppclass CKeyValueMetadata" arrow::KeyValueMetadata": CKeyValueMetadata() CKeyValueMetadata(const unordered_map[c_string, c_string]&) + + c_bool Equals(const CKeyValueMetadata& other) + + void Append(const c_string& key, const c_string& value) void ToUnorderedMap(unordered_map[c_string, c_string]*) const cdef extern from "arrow/api.h" namespace "arrow" nogil: @@ -168,25 +172,48 @@ cdef extern from "arrow/api.h" namespace "arrow" nogil: shared_ptr[CDataType] type() c_bool nullable() + c_string ToString() + c_bool Equals(const CField& other) + + shared_ptr[const CKeyValueMetadata] metadata() + CField(const c_string& name, const shared_ptr[CDataType]& type, c_bool nullable) + CField(const c_string& name, const shared_ptr[CDataType]& type, + c_bool nullable, const shared_ptr[CKeyValueMetadata]& metadata) + + # Removed const in Cython so don't have to cast to get code to generate + CStatus AddMetadata(const shared_ptr[CKeyValueMetadata]& metadata, + shared_ptr[CField]* out) + shared_ptr[CField] RemoveMetadata() + + cdef cppclass CStructType" arrow::StructType"(CDataType): CStructType(const vector[shared_ptr[CField]]& fields) cdef cppclass CSchema" arrow::Schema": CSchema(const vector[shared_ptr[CField]]& fields) CSchema(const vector[shared_ptr[CField]]& fields, - const CKeyValueMetadata& custom_metadata) + const shared_ptr[const CKeyValueMetadata]& metadata) + + # Does not actually exist, but gets Cython to not complain + CSchema(const vector[shared_ptr[CField]]& fields, + const shared_ptr[CKeyValueMetadata]& metadata) c_bool Equals(const CSchema& other) shared_ptr[CField] field(int i) - const CKeyValueMetadata& custom_metadata() const + shared_ptr[const CKeyValueMetadata] metadata() shared_ptr[CField] GetFieldByName(c_string& name) int num_fields() c_string ToString() + # Removed const in Cython so don't have to cast to get code to generate + CStatus AddMetadata(const shared_ptr[CKeyValueMetadata]& metadata, + shared_ptr[CSchema]* out) + shared_ptr[CSchema] RemoveMetadata() + cdef cppclass CBooleanArray" arrow::BooleanArray"(CArray): c_bool Value(int i) diff --git a/python/pyarrow/tests/pandas_examples.py b/python/pyarrow/tests/pandas_examples.py index e081c38713057..313a3ae9f1747 100644 --- a/python/pyarrow/tests/pandas_examples.py +++ b/python/pyarrow/tests/pandas_examples.py @@ -73,7 +73,7 @@ def dataframe_with_arrays(): ] df = pd.DataFrame(arrays) - schema = pa.Schema.from_fields(fields) + schema = pa.schema(fields) return df, schema @@ -114,6 +114,6 @@ def dataframe_with_lists(): ] df = pd.DataFrame(arrays) - schema = pa.Schema.from_fields(fields) + schema = pa.schema(fields) return df, schema diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index f3602347a78a6..2779da3320c6f 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -106,10 +106,10 @@ def test_float_no_nulls(self): for numpy_dtype, arrow_dtype in dtypes: values = np.random.randn(num_values) data[numpy_dtype] = values.astype(numpy_dtype) - fields.append(pa.Field.from_py(numpy_dtype, arrow_dtype)) + fields.append(pa.field(numpy_dtype, arrow_dtype)) df = pd.DataFrame(data) - schema = pa.Schema.from_fields(fields) + schema = pa.schema(fields) self._check_pandas_roundtrip(df, expected_schema=schema) def test_float_nulls(self): @@ -127,7 +127,7 @@ def test_float_nulls(self): arr = pa.Array.from_pandas(values, null_mask) arrays.append(arr) - fields.append(pa.Field.from_py(name, arrow_dtype)) + fields.append(pa.field(name, arrow_dtype)) values[null_mask] = np.nan expected_cols.append(values) @@ -136,7 +136,7 @@ def test_float_nulls(self): columns=names) table = pa.Table.from_arrays(arrays, names) - assert table.schema.equals(pa.Schema.from_fields(fields)) + assert table.schema.equals(pa.schema(fields)) result = table.to_pandas() tm.assert_frame_equal(result, ex_frame) @@ -159,10 +159,10 @@ def test_integer_no_nulls(self): min(info.max, np.iinfo('i8').max), size=num_values) data[dtype] = values.astype(dtype) - fields.append(pa.Field.from_py(dtype, arrow_dtype)) + fields.append(pa.field(dtype, arrow_dtype)) df = pd.DataFrame(data) - schema = pa.Schema.from_fields(fields) + schema = pa.schema(fields) self._check_pandas_roundtrip(df, expected_schema=schema) def test_integer_with_nulls(self): @@ -200,8 +200,8 @@ def test_boolean_no_nulls(self): np.random.seed(0) df = pd.DataFrame({'bools': np.random.randn(num_values) > 0}) - field = pa.Field.from_py('bools', pa.bool_()) - schema = pa.Schema.from_fields([field]) + field = pa.field('bools', pa.bool_()) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, expected_schema=schema) def test_boolean_nulls(self): @@ -217,8 +217,8 @@ def test_boolean_nulls(self): expected = values.astype(object) expected[mask] = None - field = pa.Field.from_py('bools', pa.bool_()) - schema = pa.Schema.from_fields([field]) + field = pa.field('bools', pa.bool_()) + schema = pa.schema([field]) ex_frame = pd.DataFrame({'bools': expected}) table = pa.Table.from_arrays([arr], ['bools']) @@ -230,16 +230,16 @@ def test_boolean_nulls(self): def test_boolean_object_nulls(self): arr = np.array([False, None, True] * 100, dtype=object) df = pd.DataFrame({'bools': arr}) - field = pa.Field.from_py('bools', pa.bool_()) - schema = pa.Schema.from_fields([field]) + field = pa.field('bools', pa.bool_()) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, expected_schema=schema) def test_unicode(self): repeats = 1000 values = [u'foo', None, u'bar', u'mañana', np.nan] df = pd.DataFrame({'strings': values * repeats}) - field = pa.Field.from_py('strings', pa.string()) - schema = pa.Schema.from_fields([field]) + field = pa.field('strings', pa.string()) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, expected_schema=schema) @@ -257,7 +257,7 @@ def test_bytes_to_binary(self): def test_fixed_size_bytes(self): values = [b'foo', None, b'bar', None, None, b'hey'] df = pd.DataFrame({'strings': values}) - schema = pa.Schema.from_fields([pa.field('strings', pa.binary(3))]) + schema = pa.schema([pa.field('strings', pa.binary(3))]) table = pa.Table.from_pandas(df, schema=schema) assert table.schema[0].type == schema[0].type assert table.schema[0].name == schema[0].name @@ -267,7 +267,7 @@ def test_fixed_size_bytes(self): def test_fixed_size_bytes_does_not_accept_varying_lengths(self): values = [b'foo', None, b'ba', None, None, b'hey'] df = pd.DataFrame({'strings': values}) - schema = pa.Schema.from_fields([pa.field('strings', pa.binary(3))]) + schema = pa.schema([pa.field('strings', pa.binary(3))]) with self.assertRaises(pa.ArrowInvalid): pa.Table.from_pandas(df, schema=schema) @@ -279,8 +279,8 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - field = pa.Field.from_py('datetime64', pa.timestamp('ms')) - schema = pa.Schema.from_fields([field]) + field = pa.field('datetime64', pa.timestamp('ms')) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) @@ -291,8 +291,8 @@ def test_timestamps_notimezone_no_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - field = pa.Field.from_py('datetime64', pa.timestamp('ns')) - schema = pa.Schema.from_fields([field]) + field = pa.field('datetime64', pa.timestamp('ns')) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) @@ -304,8 +304,8 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437'], dtype='datetime64[ms]') }) - field = pa.Field.from_py('datetime64', pa.timestamp('ms')) - schema = pa.Schema.from_fields([field]) + field = pa.field('datetime64', pa.timestamp('ms')) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=True, expected_schema=schema) @@ -316,8 +316,8 @@ def test_timestamps_notimezone_nulls(self): '2010-08-13T05:46:57.437699912'], dtype='datetime64[ns]') }) - field = pa.Field.from_py('datetime64', pa.timestamp('ns')) - schema = pa.Schema.from_fields([field]) + field = pa.field('datetime64', pa.timestamp('ns')) + schema = pa.schema([field]) self._check_pandas_roundtrip(df, timestamps_to_ms=False, expected_schema=schema) @@ -353,8 +353,8 @@ def test_date_infer(self): datetime.date(1970, 1, 1), datetime.date(2040, 2, 26)]}) table = pa.Table.from_pandas(df) - field = pa.Field.from_py('date', pa.date32()) - schema = pa.Schema.from_fields([field]) + field = pa.field('date', pa.date32()) + schema = pa.schema([field]) assert table.schema.equals(schema) result = table.to_pandas() expected = df.copy() @@ -526,8 +526,8 @@ def test_decimal_32_from_pandas(self): ] }) converted = pa.Table.from_pandas(expected) - field = pa.Field.from_py('decimals', pa.decimal(7, 3)) - schema = pa.Schema.from_fields([field]) + field = pa.field('decimals', pa.decimal(7, 3)) + schema = pa.schema([field]) assert converted.schema.equals(schema) def test_decimal_32_to_pandas(self): @@ -549,8 +549,8 @@ def test_decimal_64_from_pandas(self): ] }) converted = pa.Table.from_pandas(expected) - field = pa.Field.from_py('decimals', pa.decimal(12, 6)) - schema = pa.Schema.from_fields([field]) + field = pa.field('decimals', pa.decimal(12, 6)) + schema = pa.schema([field]) assert converted.schema.equals(schema) def test_decimal_64_to_pandas(self): @@ -572,8 +572,8 @@ def test_decimal_128_from_pandas(self): ] }) converted = pa.Table.from_pandas(expected) - field = pa.Field.from_py('decimals', pa.decimal(26, 11)) - schema = pa.Schema.from_fields([field]) + field = pa.field('decimals', pa.decimal(26, 11)) + schema = pa.schema([field]) assert converted.schema.equals(schema) def test_decimal_128_to_pandas(self): diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index da704f378873b..b3abc0f04f418 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -118,7 +118,7 @@ def test_field(): assert f.name == 'foo' assert f.nullable assert f.type is t - assert repr(f) == "Field('foo', type=string)" + assert repr(f) == "pyarrow.Field" f = pa.field('foo', t, False) assert not f.nullable @@ -152,6 +152,52 @@ def test_field_empty(): repr(f) +def test_field_add_remove_metadata(): + f0 = pa.field('foo', pa.int32()) + + assert f0.metadata is None + + metadata = {b'foo': b'bar', b'pandas': b'badger'} + + f1 = f0.add_metadata(metadata) + assert f1.metadata == metadata + + f3 = f1.remove_metadata() + assert f3.metadata is None + + # idempotent + f4 = f3.remove_metadata() + assert f4.metadata is None + + f5 = pa.field('foo', pa.int32(), True, metadata) + f6 = f0.add_metadata(metadata) + assert f5.equals(f6) + + +def test_schema_add_remove_metadata(): + fields = [ + pa.field('foo', pa.int32()), + pa.field('bar', pa.string()), + pa.field('baz', pa.list_(pa.int8())) + ] + + s1 = pa.schema(fields) + + assert s1.metadata is None + + metadata = {b'foo': b'bar', b'pandas': b'badger'} + + s2 = s1.add_metadata(metadata) + assert s2.metadata == metadata + + s3 = s2.remove_metadata() + assert s3.metadata is None + + # idempotent + s4 = s3.remove_metadata() + assert s4.metadata is None + + def test_schema_equals(): fields = [ pa.field('foo', pa.int32()), From 909f826b55973e93f4656c43a84c8e740a86601f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 27 Apr 2017 11:19:02 -0400 Subject: [PATCH 0575/1644] ARROW-867: [Python] pyarrow MSVC fixes Author: Wes McKinney Closes #575 from wesm/ARROW-867 and squashes the following commits: 0483cfb [Wes McKinney] Do not encode file paths to utf-16le on Windows. Fix date/time conversion, platform ints. Add release/acquire methods to PyAcquireGIL lock object. Remove a couple unneeded GIL acquisitions --- ci/msvc-build.bat | 5 ++ cpp/src/arrow/python/builtin_convert.cc | 16 +--- cpp/src/arrow/python/common.h | 22 +++++- cpp/src/arrow/python/helpers.cc | 2 - cpp/src/arrow/python/init.cc | 12 +-- cpp/src/arrow/python/init.h | 12 +-- cpp/src/arrow/python/numpy_interop.h | 1 + cpp/src/arrow/python/pandas_convert.cc | 17 +++-- cpp/src/arrow/python/util/datetime.h | 23 ++++++ cpp/src/arrow/python/util/test_main.cc | 2 +- python/pyarrow/_array.pyx | 6 +- python/pyarrow/_config.pyx | 6 +- python/pyarrow/_table.pyx | 2 +- python/pyarrow/compat.py | 11 +-- python/pyarrow/includes/pyarrow.pxd | 6 +- python/pyarrow/tests/test_convert_pandas.py | 4 +- python/pyarrow/tests/test_feather.py | 5 ++ python/pyarrow/tests/test_io.py | 83 ++++++++++----------- python/pyarrow/tests/test_schema.py | 1 - python/pyarrow/tests/test_tensor.py | 32 ++++---- 20 files changed, 140 insertions(+), 128 deletions(-) diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index de428b6e46e14..08c5033849539 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -43,6 +43,8 @@ set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python3 ctest -VV || exit /B +set PYTHONPATH= + @rem Build and import pyarrow set PATH=%ARROW_HOME%\bin;%PATH% @@ -50,3 +52,6 @@ set PATH=%ARROW_HOME%\bin;%PATH% cd ..\..\python python setup.py build_ext --inplace || exit /B python -c "import pyarrow" || exit /B + +@rem TODO: re-enable when last tests are fixed +@rem py.test pyarrow -v -s || exit /B diff --git a/cpp/src/arrow/python/builtin_convert.cc b/cpp/src/arrow/python/builtin_convert.cc index 137937c0946df..3197c2ade4bae 100644 --- a/cpp/src/arrow/python/builtin_convert.cc +++ b/cpp/src/arrow/python/builtin_convert.cc @@ -358,22 +358,8 @@ class TimestampConverter : public TypedConverter { } else { PyDateTime_DateTime* pydatetime = reinterpret_cast(item.obj()); - struct tm datetime = {0}; - datetime.tm_year = PyDateTime_GET_YEAR(pydatetime) - 1900; - datetime.tm_mon = PyDateTime_GET_MONTH(pydatetime) - 1; - datetime.tm_mday = PyDateTime_GET_DAY(pydatetime); - datetime.tm_hour = PyDateTime_DATE_GET_HOUR(pydatetime); - datetime.tm_min = PyDateTime_DATE_GET_MINUTE(pydatetime); - datetime.tm_sec = PyDateTime_DATE_GET_SECOND(pydatetime); - int us = PyDateTime_DATE_GET_MICROSECOND(pydatetime); + typed_builder_->Append(PyDateTime_to_us(pydatetime)); RETURN_IF_PYERROR(); - struct tm epoch = {0}; - epoch.tm_year = 70; - epoch.tm_mday = 1; - // Microseconds since the epoch - int64_t val = static_cast( - lrint(difftime(mktime(&datetime), mktime(&epoch))) * 1000000 + us); - typed_builder_->Append(val); } } return Status::OK(); diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h index 882bb156224c0..0211823c8c8fe 100644 --- a/cpp/src/arrow/python/common.h +++ b/cpp/src/arrow/python/common.h @@ -34,11 +34,29 @@ namespace py { class ARROW_EXPORT PyAcquireGIL { public: - PyAcquireGIL() { state_ = PyGILState_Ensure(); } + PyAcquireGIL() : acquired_gil_(false) { + acquire(); + } + + ~PyAcquireGIL() { release(); } - ~PyAcquireGIL() { PyGILState_Release(state_); } + void acquire() { + if (!acquired_gil_) { + state_ = PyGILState_Ensure(); + acquired_gil_ = true; + } + } + + // idempotent + void release() { + if (acquired_gil_) { + PyGILState_Release(state_); + acquired_gil_ = false; + } + } private: + bool acquired_gil_; PyGILState_STATE state_; DISALLOW_COPY_AND_ASSIGN(PyAcquireGIL); }; diff --git a/cpp/src/arrow/python/helpers.cc b/cpp/src/arrow/python/helpers.cc index f7c73a87fbf16..e5d1d388cb54c 100644 --- a/cpp/src/arrow/python/helpers.cc +++ b/cpp/src/arrow/python/helpers.cc @@ -55,7 +55,6 @@ std::shared_ptr GetPrimitiveType(Type::type type) { } Status ImportModule(const std::string& module_name, OwnedRef* ref) { - PyAcquireGIL lock; PyObject* module = PyImport_ImportModule(module_name.c_str()); RETURN_IF_PYERROR(); ref->reset(module); @@ -66,7 +65,6 @@ Status ImportFromModule(const OwnedRef& module, const std::string& name, OwnedRe /// Assumes that ImportModule was called first DCHECK_NE(module.obj(), nullptr) << "Cannot import from nullptr Python module"; - PyAcquireGIL lock; PyObject* attr = PyObject_GetAttrString(module.obj(), name.c_str()); RETURN_IF_PYERROR(); ref->reset(attr); diff --git a/cpp/src/arrow/python/init.cc b/cpp/src/arrow/python/init.cc index fa70af7e44db3..db648915465a8 100644 --- a/cpp/src/arrow/python/init.cc +++ b/cpp/src/arrow/python/init.cc @@ -15,20 +15,12 @@ // specific language governing permissions and limitations // under the License. -#include "arrow/python/platform.h" - // Trigger the array import (inversion of NO_IMPORT_ARRAY) #define NUMPY_IMPORT_ARRAY #include "arrow/python/init.h" #include "arrow/python/numpy_interop.h" -namespace arrow { -namespace py { - -void InitNumPy() { - import_numpy(); +int arrow_init_numpy() { + return arrow::py::import_numpy(); } - -} // namespace py -} // namespace arrow diff --git a/cpp/src/arrow/python/init.h b/cpp/src/arrow/python/init.h index a2533d8059273..1daa5a3d2624d 100644 --- a/cpp/src/arrow/python/init.h +++ b/cpp/src/arrow/python/init.h @@ -19,17 +19,11 @@ #define ARROW_PYTHON_INIT_H #include "arrow/python/platform.h" - -#include "arrow/python/numpy_interop.h" #include "arrow/util/visibility.h" -namespace arrow { -namespace py { - +extern "C" { ARROW_EXPORT -void InitNumPy(); - -} // namespace py -} // namespace arrow +int arrow_init_numpy(); +} #endif // ARROW_PYTHON_INIT_H diff --git a/cpp/src/arrow/python/numpy_interop.h b/cpp/src/arrow/python/numpy_interop.h index b93200cc8972d..023fdc8249c0c 100644 --- a/cpp/src/arrow/python/numpy_interop.h +++ b/cpp/src/arrow/python/numpy_interop.h @@ -47,6 +47,7 @@ namespace py { inline int import_numpy() { #ifdef NUMPY_IMPORT_ARRAY + std::cout << "Importing NumPy" << std::endl; import_array1(-1); import_umath1(-1); #endif diff --git a/cpp/src/arrow/python/pandas_convert.cc b/cpp/src/arrow/python/pandas_convert.cc index 9f65af41bb294..b54197e5145b0 100644 --- a/cpp/src/arrow/python/pandas_convert.cc +++ b/cpp/src/arrow/python/pandas_convert.cc @@ -763,11 +763,10 @@ Status PandasConverter::ConvertObjects() { Ndarray1DIndexer objects; - { - PyAcquireGIL lock; - objects.Init(arr_); - PyDateTime_IMPORT; - } + PyAcquireGIL lock; + objects.Init(arr_); + PyDateTime_IMPORT; + lock.release(); // This means we received an explicit type from the user if (type_) { @@ -792,6 +791,9 @@ Status PandasConverter::ConvertObjects() { return Status::TypeError("No known conversion to Arrow type"); } } else { + // Re-acquire GIL + lock.acquire(); + OwnedRef decimal; OwnedRef Decimal; RETURN_NOT_OK(ImportModule("decimal", &decimal)); @@ -2196,7 +2198,12 @@ class ArrowDeserializer { RETURN_IF_PYERROR(); PyObject* dictionary; + + // Release GIL before calling ConvertArrayToPandas, will be reacquired + // there if needed + lock.release(); RETURN_NOT_OK(ConvertArrayToPandas(dict_type->dictionary(), nullptr, &dictionary)); + lock.acquire(); PyDict_SetItemString(result_, "indices", block->block_arr()); PyDict_SetItemString(result_, "dictionary", dictionary); diff --git a/cpp/src/arrow/python/util/datetime.h b/cpp/src/arrow/python/util/datetime.h index ad0ee0f5056da..7ebf46a92fd5c 100644 --- a/cpp/src/arrow/python/util/datetime.h +++ b/cpp/src/arrow/python/util/datetime.h @@ -42,6 +42,29 @@ static inline int64_t PyDate_to_ms(PyDateTime_Date* pydate) { #endif } +static inline int64_t PyDateTime_to_us(PyDateTime_DateTime* pydatetime) { + struct tm datetime = {0}; + datetime.tm_year = PyDateTime_GET_YEAR(pydatetime) - 1900; + datetime.tm_mon = PyDateTime_GET_MONTH(pydatetime) - 1; + datetime.tm_mday = PyDateTime_GET_DAY(pydatetime); + datetime.tm_hour = PyDateTime_DATE_GET_HOUR(pydatetime); + datetime.tm_min = PyDateTime_DATE_GET_MINUTE(pydatetime); + datetime.tm_sec = PyDateTime_DATE_GET_SECOND(pydatetime); + int us = PyDateTime_DATE_GET_MICROSECOND(pydatetime); + struct tm epoch = {0}; + epoch.tm_year = 70; + epoch.tm_mday = 1; +#ifdef _MSC_VER + // Microseconds since the epoch + const int64_t current_timestamp = static_cast(_mkgmtime64(&datetime)); + const int64_t epoch_timestamp = static_cast(_mkgmtime64(&epoch)); + return (current_timestamp - epoch_timestamp) * 1000000L + us; +#else + return static_cast( + lrint(difftime(mktime(&datetime), mktime(&epoch))) * 1000000 + us); +#endif +} + static inline int32_t PyDate_to_days(PyDateTime_Date* pydate) { return static_cast(PyDate_to_ms(pydate) / 86400000LL); } diff --git a/cpp/src/arrow/python/util/test_main.cc b/cpp/src/arrow/python/util/test_main.cc index c24da40aadcf6..efb44754b2b3a 100644 --- a/cpp/src/arrow/python/util/test_main.cc +++ b/cpp/src/arrow/python/util/test_main.cc @@ -25,7 +25,7 @@ int main(int argc, char** argv) { ::testing::InitGoogleTest(&argc, argv); Py_Initialize(); - arrow::py::InitNumPy(); + arrow_init_numpy(); int ret = RUN_ALL_TESTS(); diff --git a/python/pyarrow/_array.pyx b/python/pyarrow/_array.pyx index 658f4b314b3a2..f01cff6cc99f3 100644 --- a/python/pyarrow/_array.pyx +++ b/python/pyarrow/_array.pyx @@ -1288,8 +1288,7 @@ cdef class Array: with nogil: check_status( - pyarrow.ConvertArrayToPandas(self.sp_array, self, - &out)) + pyarrow.ConvertArrayToPandas(self.sp_array, self, &out)) return wrap_array_output(out) def to_pylist(self): @@ -1326,8 +1325,7 @@ strides: {2}""".format(self.type, self.shape, self.strides) cdef: PyObject* out - check_status(pyarrow.TensorToNdarray(deref(self.tp), self, - &out)) + check_status(pyarrow.TensorToNdarray(deref(self.tp), self, &out)) return PyObject_to_object(out) def equals(self, Tensor other): diff --git a/python/pyarrow/_config.pyx b/python/pyarrow/_config.pyx index 2c1e6bf3143aa..e5fdbef8af5f6 100644 --- a/python/pyarrow/_config.pyx +++ b/python/pyarrow/_config.pyx @@ -14,13 +14,13 @@ # distutils: language = c++ # cython: embedsignature = True -cdef extern from 'arrow/python/init.h' namespace 'arrow::py': - void InitNumPy() +cdef extern from 'arrow/python/init.h': + int arrow_init_numpy() except -1 cdef extern from 'arrow/python/config.h' namespace 'arrow::py': void set_numpy_nan(object o) -InitNumPy() +arrow_init_numpy() import numpy as np set_numpy_nan(np.nan) diff --git a/python/pyarrow/_table.pyx b/python/pyarrow/_table.pyx index 599e046e956b6..223fe27ea9819 100644 --- a/python/pyarrow/_table.pyx +++ b/python/pyarrow/_table.pyx @@ -164,7 +164,7 @@ cdef class Column: PyObject* out check_status(pyarrow.ConvertColumnToPandas(self.sp_column, - self, &out)) + self, &out)) return _pandas().Series(wrap_array_output(out), name=self.name) diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 8d15c4c1e3fb5..928a2c0724298 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -139,14 +139,9 @@ def encode_file_path(path): import os # Windows requires utf-16le encoding for unicode file names if isinstance(path, unicode_type): - if os.name == 'nt': - # try: - # encoded_path = path.encode('ascii') - # except: - encoded_path = path.encode('utf-16le') - else: - # POSIX systems can handle utf-8 - encoded_path = path.encode('utf-8') + # POSIX systems can handle utf-8. UTF8 is converted to utf16-le in + # libarrow + encoded_path = path.encode('utf-8') else: encoded_path = path diff --git a/python/pyarrow/includes/pyarrow.pxd b/python/pyarrow/includes/pyarrow.pxd index c40df3db8a9c5..35c71107f8db1 100644 --- a/python/pyarrow/includes/pyarrow.pxd +++ b/python/pyarrow/includes/pyarrow.pxd @@ -47,14 +47,14 @@ cdef extern from "arrow/python/api.h" namespace "arrow::py" nogil: CStatus NdarrayToTensor(CMemoryPool* pool, object ao, shared_ptr[CTensor]* out); - CStatus TensorToNdarray(const CTensor& tensor, PyObject* base, + CStatus TensorToNdarray(const CTensor& tensor, object base, PyObject** out) CStatus ConvertArrayToPandas(const shared_ptr[CArray]& arr, - PyObject* py_ref, PyObject** out) + object py_ref, PyObject** out) CStatus ConvertColumnToPandas(const shared_ptr[CColumn]& arr, - PyObject* py_ref, PyObject** out) + object py_ref, PyObject** out) CStatus ConvertTableToPandas(const shared_ptr[CTable]& table, int nthreads, PyObject** out) diff --git a/python/pyarrow/tests/test_convert_pandas.py b/python/pyarrow/tests/test_convert_pandas.py index 2779da3320c6f..9b9b7519fd994 100644 --- a/python/pyarrow/tests/test_convert_pandas.py +++ b/python/pyarrow/tests/test_convert_pandas.py @@ -155,8 +155,8 @@ def test_integer_no_nulls(self): for dtype, arrow_dtype in numpy_dtypes: info = np.iinfo(dtype) - values = np.random.randint(info.min, - min(info.max, np.iinfo('i8').max), + values = np.random.randint(max(info.min, np.iinfo(np.int_).min), + min(info.max, np.iinfo(np.int_).max), size=num_values) data[dtype] = values.astype(dtype) fields.append(pa.field(dtype, arrow_dtype)) diff --git a/python/pyarrow/tests/test_feather.py b/python/pyarrow/tests/test_feather.py index ef73a8feeb65c..7a8abf486f4b5 100644 --- a/python/pyarrow/tests/test_feather.py +++ b/python/pyarrow/tests/test_feather.py @@ -13,6 +13,7 @@ # limitations under the License. import os +import sys import unittest import pytest @@ -251,6 +252,9 @@ def test_boolean_object_nulls(self): self._check_pandas_roundtrip(df, null_counts=[1 * repeats]) def test_delete_partial_file_on_error(self): + if sys.platform == 'win32': + pytest.skip('Windows hangs on to file handle for some reason') + # strings will fail df = pd.DataFrame( { @@ -361,6 +365,7 @@ def test_read_columns(self): def test_overwritten_file(self): path = random_path() + self.test_files.append(path) num_values = 100 np.random.seed(0) diff --git a/python/pyarrow/tests/test_io.py b/python/pyarrow/tests/test_io.py index a14898ff2ffd0..610dedc6a7640 100644 --- a/python/pyarrow/tests/test_io.py +++ b/python/pyarrow/tests/test_io.py @@ -230,12 +230,13 @@ def test_nativefile_write_memoryview(): @pytest.fixture -def sample_disk_data(request): +def sample_disk_data(request, tmpdir): SIZE = 4096 arr = np.random.randint(0, 256, size=SIZE).astype('u1') data = arr.tobytes()[:SIZE] - path = guid() + path = os.path.join(str(tmpdir), guid()) + with open(path, 'wb') as f: f.write(data) @@ -298,68 +299,62 @@ def _try_delete(path): pass -def test_memory_map_writer(): +def test_memory_map_writer(tmpdir): SIZE = 4096 arr = np.random.randint(0, 256, size=SIZE).astype('u1') data = arr.tobytes()[:SIZE] - path = guid() - try: - with open(path, 'wb') as f: - f.write(data) + path = os.path.join(str(tmpdir), guid()) + with open(path, 'wb') as f: + f.write(data) - f = pa.memory_map(path, mode='r+w') + f = pa.memory_map(path, mode='r+w') - f.seek(10) - f.write('peekaboo') - assert f.tell() == 18 + f.seek(10) + f.write('peekaboo') + assert f.tell() == 18 - f.seek(10) - assert f.read(8) == b'peekaboo' + f.seek(10) + assert f.read(8) == b'peekaboo' - f2 = pa.memory_map(path, mode='r+w') + f2 = pa.memory_map(path, mode='r+w') - f2.seek(10) - f2.write(b'booapeak') - f2.seek(10) + f2.seek(10) + f2.write(b'booapeak') + f2.seek(10) - f.seek(10) - assert f.read(8) == b'booapeak' + f.seek(10) + assert f.read(8) == b'booapeak' - # Does not truncate file - f3 = pa.memory_map(path, mode='w') - f3.write('foo') + # Does not truncate file + f3 = pa.memory_map(path, mode='w') + f3.write('foo') - with pa.memory_map(path) as f4: - assert f4.size() == SIZE + with pa.memory_map(path) as f4: + assert f4.size() == SIZE - with pytest.raises(IOError): - f3.read(5) + with pytest.raises(IOError): + f3.read(5) - f.seek(0) - assert f.read(3) == b'foo' - finally: - _try_delete(path) + f.seek(0) + assert f.read(3) == b'foo' -def test_os_file_writer(): +def test_os_file_writer(tmpdir): SIZE = 4096 arr = np.random.randint(0, 256, size=SIZE).astype('u1') data = arr.tobytes()[:SIZE] - path = guid() - try: - with open(path, 'wb') as f: - f.write(data) + path = os.path.join(str(tmpdir), guid()) + with open(path, 'wb') as f: + f.write(data) - # Truncates file - f2 = pa.OSFile(path, mode='w') - f2.write('foo') + # Truncates file + f2 = pa.OSFile(path, mode='w') + f2.write('foo') - with pa.OSFile(path) as f3: - assert f3.size() == 3 + with pa.OSFile(path) as f3: + assert f3.size() == 3 - with pytest.raises(IOError): - f2.read(5) - finally: - _try_delete(path) + with pytest.raises(IOError): + f2.read(5) diff --git a/python/pyarrow/tests/test_schema.py b/python/pyarrow/tests/test_schema.py index b3abc0f04f418..2d98865b56e73 100644 --- a/python/pyarrow/tests/test_schema.py +++ b/python/pyarrow/tests/test_schema.py @@ -206,7 +206,6 @@ def test_schema_equals(): ] sch1 = pa.schema(fields) - print(dir(sch1)) sch2 = pa.schema(fields) assert sch1.equals(sch2) diff --git a/python/pyarrow/tests/test_tensor.py b/python/pyarrow/tests/test_tensor.py index ec71735b2a540..b0924e3504ff7 100644 --- a/python/pyarrow/tests/test_tensor.py +++ b/python/pyarrow/tests/test_tensor.py @@ -77,41 +77,37 @@ def test_tensor_numpy_roundtrip(dtype_str, arrow_type): def _try_delete(path): + import gc + gc.collect() try: os.remove(path) except os.error: pass -def test_tensor_ipc_roundtrip(): +def test_tensor_ipc_roundtrip(tmpdir): data = np.random.randn(10, 4) tensor = pa.Tensor.from_numpy(data) - path = 'pyarrow-tensor-ipc-roundtrip' - try: - mmap = pa.create_memory_map(path, 1024) + path = os.path.join(str(tmpdir), 'pyarrow-tensor-ipc-roundtrip') + mmap = pa.create_memory_map(path, 1024) - pa.write_tensor(tensor, mmap) + pa.write_tensor(tensor, mmap) - mmap.seek(0) - result = pa.read_tensor(mmap) + mmap.seek(0) + result = pa.read_tensor(mmap) - assert result.equals(tensor) - finally: - _try_delete(path) + assert result.equals(tensor) -def test_tensor_ipc_strided(): +def test_tensor_ipc_strided(tmpdir): data = np.random.randn(10, 4) tensor = pa.Tensor.from_numpy(data[::2]) - path = 'pyarrow-tensor-ipc-strided' - try: - with pytest.raises(ValueError): - mmap = pa.create_memory_map(path, 1024) - pa.write_tensor(tensor, mmap) - finally: - _try_delete(path) + path = os.path.join(str(tmpdir), 'pyarrow-tensor-ipc-strided') + with pytest.raises(ValueError): + mmap = pa.create_memory_map(path, 1024) + pa.write_tensor(tensor, mmap) def test_tensor_size(): From 81be9c6679466177d4b8e5dbca55f81185bb3ec6 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 27 Apr 2017 18:10:24 +0200 Subject: [PATCH 0576/1644] ARROW-866: [Python] Be robust to PyErr_Fetch returning a null exc value cc @BryanCutler. This was a tricky one. I am not sure how to reproduce with our current code -- I reverted the patch from ARROW-822 to get a reproduction so I could fix this. Now, the error raised is: ``` /home/wesm/code/arrow/python/pyarrow/_error.pyx in pyarrow._error.check_status (/home/wesm/code/arrow/python/build/temp.linux-x86_64-2.7/_error.cxx:1324)() 58 raise ArrowInvalid(message) 59 elif status.IsIOError(): ---> 60 raise ArrowIOError(message) 61 elif status.IsOutOfMemory(): 62 raise ArrowMemoryError(message) ArrowIOError: IOError: Error message was null ``` I'm not sure why calling `tell` on the socket object results in a bad exception state, but in any case it seems that the result of `PyErr_Fetch` cannot be relied upon to be non-null even when `PyErr_Occurred()` returns non-null Author: Wes McKinney Closes #606 from wesm/ARROW-866 and squashes the following commits: fa395cd [Wes McKinney] Enable other kinds of Status errors to be returned 0bd11c2 [Wes McKinney] Consolidate error handling code a bit 9d59dd2 [Wes McKinney] Be robust to PyErr_Fetch returning a null exc value --- cpp/src/arrow/python/common.cc | 22 ++++++++++++++++++++++ cpp/src/arrow/python/common.h | 29 ++++++++++++++--------------- cpp/src/arrow/python/io.cc | 29 +++++++---------------------- cpp/src/arrow/status.h | 3 +++ 4 files changed, 46 insertions(+), 37 deletions(-) diff --git a/cpp/src/arrow/python/common.cc b/cpp/src/arrow/python/common.cc index 717cb5c5cc122..bedd458c783f4 100644 --- a/cpp/src/arrow/python/common.cc +++ b/cpp/src/arrow/python/common.cc @@ -64,5 +64,27 @@ PyBuffer::~PyBuffer() { Py_XDECREF(obj_); } +Status CheckPyError(StatusCode code) { + if (PyErr_Occurred()) { + PyObject *exc_type, *exc_value, *traceback; + PyErr_Fetch(&exc_type, &exc_value, &traceback); + PyObjectStringify stringified(exc_value); + Py_XDECREF(exc_type); + Py_XDECREF(exc_value); + Py_XDECREF(traceback); + PyErr_Clear(); + + // ARROW-866: in some esoteric cases, formatting exc_value can fail. This + // was encountered when calling tell() on a socket file + if (stringified.bytes != nullptr) { + std::string message(stringified.bytes); + return Status(code, message); + } else { + return Status(code, "Error message was null"); + } + } + return Status::OK(); +} + } // namespace py } // namespace arrow diff --git a/cpp/src/arrow/python/common.h b/cpp/src/arrow/python/common.h index 0211823c8c8fe..c5745a53f70dc 100644 --- a/cpp/src/arrow/python/common.h +++ b/cpp/src/arrow/python/common.h @@ -98,27 +98,26 @@ struct ARROW_EXPORT PyObjectStringify { if (PyUnicode_Check(obj)) { bytes_obj = PyUnicode_AsUTF8String(obj); tmp_obj.reset(bytes_obj); + bytes = PyBytes_AsString(bytes_obj); + size = PyBytes_GET_SIZE(bytes_obj); + } else if (PyBytes_Check(obj)) { + bytes = PyBytes_AsString(obj); + size = PyBytes_GET_SIZE(obj); } else { - bytes_obj = obj; + bytes = nullptr; + size = -1; } - bytes = PyBytes_AsString(bytes_obj); - size = PyBytes_GET_SIZE(bytes_obj); } }; +Status CheckPyError(StatusCode code = StatusCode::UnknownError); + // TODO(wesm): We can just let errors pass through. To be explored later -#define RETURN_IF_PYERROR() \ - if (PyErr_Occurred()) { \ - PyObject *exc_type, *exc_value, *traceback; \ - PyErr_Fetch(&exc_type, &exc_value, &traceback); \ - PyObjectStringify stringified(exc_value); \ - std::string message(stringified.bytes); \ - Py_DECREF(exc_type); \ - Py_XDECREF(exc_value); \ - Py_XDECREF(traceback); \ - PyErr_Clear(); \ - return Status::UnknownError(message); \ - } +#define RETURN_IF_PYERROR() \ + RETURN_NOT_OK(CheckPyError()); + +#define PY_RETURN_IF_ERROR(CODE) \ + RETURN_NOT_OK(CheckPyError(CODE)); // Return the common PyArrow memory pool ARROW_EXPORT void set_default_memory_pool(MemoryPool* pool); diff --git a/cpp/src/arrow/python/io.cc b/cpp/src/arrow/python/io.cc index 327e8fe9ff781..a7193854c4d01 100644 --- a/cpp/src/arrow/python/io.cc +++ b/cpp/src/arrow/python/io.cc @@ -41,21 +41,6 @@ PythonFile::~PythonFile() { Py_DECREF(file_); } -static Status CheckPyError() { - if (PyErr_Occurred()) { - PyObject *exc_type, *exc_value, *traceback; - PyErr_Fetch(&exc_type, &exc_value, &traceback); - PyObjectStringify stringified(exc_value); - std::string message(stringified.bytes); - Py_XDECREF(exc_type); - Py_XDECREF(exc_value); - Py_XDECREF(traceback); - PyErr_Clear(); - return Status::IOError(message); - } - return Status::OK(); -} - // This is annoying: because C++11 does not allow implicit conversion of string // literals to non-const char*, we need to go through some gymnastics to use // PyObject_CallMethod without a lot of pain (its arguments are non-const @@ -71,7 +56,7 @@ Status PythonFile::Close() { // whence: 0 for relative to start of file, 2 for end of file PyObject* result = cpp_PyObject_CallMethod(file_, "close", "()"); Py_XDECREF(result); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); return Status::OK(); } @@ -79,13 +64,13 @@ Status PythonFile::Seek(int64_t position, int whence) { // whence: 0 for relative to start of file, 2 for end of file PyObject* result = cpp_PyObject_CallMethod(file_, "seek", "(ii)", position, whence); Py_XDECREF(result); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); return Status::OK(); } Status PythonFile::Read(int64_t nbytes, PyObject** out) { PyObject* result = cpp_PyObject_CallMethod(file_, "read", "(i)", nbytes); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); *out = result; return Status::OK(); } @@ -93,24 +78,24 @@ Status PythonFile::Read(int64_t nbytes, PyObject** out) { Status PythonFile::Write(const uint8_t* data, int64_t nbytes) { PyObject* py_data = PyBytes_FromStringAndSize(reinterpret_cast(data), nbytes); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); PyObject* result = cpp_PyObject_CallMethod(file_, "write", "(O)", py_data); Py_XDECREF(py_data); Py_XDECREF(result); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); return Status::OK(); } Status PythonFile::Tell(int64_t* position) { PyObject* result = cpp_PyObject_CallMethod(file_, "tell", "()"); - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); *position = PyLong_AsLongLong(result); Py_DECREF(result); // PyLong_AsLongLong can raise OverflowError - ARROW_RETURN_NOT_OK(CheckPyError()); + PY_RETURN_IF_ERROR(StatusCode::IOError); return Status::OK(); } diff --git a/cpp/src/arrow/status.h b/cpp/src/arrow/status.h index dd65b753fef31..6a8cee27cb756 100644 --- a/cpp/src/arrow/status.h +++ b/cpp/src/arrow/status.h @@ -91,6 +91,9 @@ class ARROW_EXPORT Status { Status() : state_(NULL) {} ~Status() { delete[] state_; } + Status(StatusCode code, const std::string& msg) + : Status(code, msg, -1) {} + // Copy the specified status. Status(const Status& s); void operator=(const Status& s); From 03dce9dcab1df587f2293decf49708f872aaad3d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Thu, 27 Apr 2017 18:11:44 +0200 Subject: [PATCH 0577/1644] ARROW-900: [Python] Fix UnboundLocalError in ParquetDatasetPiece.read Author: Wes McKinney Closes #607 from wesm/ARROW-900 and squashes the following commits: 81f8394 [Wes McKinney] Fix UnboundLocalError in ParquetDatasetPiece.read --- python/pyarrow/parquet.py | 3 +++ python/pyarrow/tests/test_parquet.py | 14 ++++++++++++++ 2 files changed, 17 insertions(+) diff --git a/python/pyarrow/parquet.py b/python/pyarrow/parquet.py index 94ad227fbefa9..21359f137f24e 100644 --- a/python/pyarrow/parquet.py +++ b/python/pyarrow/parquet.py @@ -208,6 +208,9 @@ def read(self, columns=None, nthreads=1, partitions=None, reader = self._open(open_file_func) elif file is not None: reader = ParquetFile(file) + else: + # try to read the local path + reader = ParquetFile(self.path) if self.row_group is not None: table = reader.read_row_group(self.row_group, columns=columns, diff --git a/python/pyarrow/tests/test_parquet.py b/python/pyarrow/tests/test_parquet.py index 8c446af03fc16..bb3a9ed5f4a25 100644 --- a/python/pyarrow/tests/test_parquet.py +++ b/python/pyarrow/tests/test_parquet.py @@ -492,6 +492,20 @@ def test_read_single_row_group(): tm.assert_frame_equal(df[cols], result.to_pandas()) +@parquet +def test_parquet_piece_read(tmpdir): + df = _test_dataframe(1000) + table = pa.Table.from_pandas(df) + + path = tmpdir.join('parquet_piece_read.parquet').strpath + pq.write_table(table, path, version='2.0') + + piece1 = pq.ParquetDatasetPiece(path) + + result = piece1.read() + assert result.equals(table) + + @parquet def test_parquet_piece_basics(): path = '/baz.parq' From 14bec24c584dc6fa05b84b6ed00d7474d62fd1d7 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 27 Apr 2017 18:13:47 +0200 Subject: [PATCH 0578/1644] ARROW-908: [GLib] Unify OutputStream files Author: Kouhei Sutou Closes #609 from kou/glib-unify-output-stream and squashes the following commits: f62f869 [Kouhei Sutou] [GLib] Unify OutputStream files --- c_glib/arrow-glib/Makefile.am | 3 - c_glib/arrow-glib/arrow-glib.h | 1 - c_glib/arrow-glib/arrow-glib.hpp | 1 - c_glib/arrow-glib/file-output-stream.cpp | 231 ---------------------- c_glib/arrow-glib/file-output-stream.h | 72 ------- c_glib/arrow-glib/file-output-stream.hpp | 28 --- c_glib/arrow-glib/output-stream.cpp | 201 ++++++++++++++++++- c_glib/arrow-glib/output-stream.h | 52 +++++ c_glib/arrow-glib/output-stream.hpp | 5 +- c_glib/doc/reference/arrow-glib-docs.sgml | 3 +- 10 files changed, 256 insertions(+), 341 deletions(-) delete mode 100644 c_glib/arrow-glib/file-output-stream.cpp delete mode 100644 c_glib/arrow-glib/file-output-stream.h delete mode 100644 c_glib/arrow-glib/file-output-stream.hpp diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index bbc11011474bc..54fb7f8c7a799 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -59,7 +59,6 @@ libarrow_glib_la_headers = \ libarrow_glib_la_headers += \ file.h \ file-mode.h \ - file-output-stream.h \ input-stream.h \ memory-mapped-file.h \ output-stream.h \ @@ -102,7 +101,6 @@ libarrow_glib_la_sources = \ libarrow_glib_la_sources += \ file.cpp \ file-mode.cpp \ - file-output-stream.cpp \ input-stream.cpp \ memory-mapped-file.cpp \ output-stream.cpp \ @@ -137,7 +135,6 @@ libarrow_glib_la_cpp_headers = \ libarrow_glib_la_cpp_headers += \ file.hpp \ file-mode.hpp \ - file-output-stream.hpp \ input-stream.hpp \ memory-mapped-file.hpp \ output-stream.hpp \ diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index efff5710308a8..e88b66b6ae9b2 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -35,7 +35,6 @@ #include #include -#include #include #include #include diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp index d6ef370095bdf..339773f651de1 100644 --- a/c_glib/arrow-glib/arrow-glib.hpp +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -40,7 +40,6 @@ #include #include -#include #include #include #include diff --git a/c_glib/arrow-glib/file-output-stream.cpp b/c_glib/arrow-glib/file-output-stream.cpp deleted file mode 100644 index e1e1e27a06193..0000000000000 --- a/c_glib/arrow-glib/file-output-stream.cpp +++ /dev/null @@ -1,231 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include - -#include -#include -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: file-output-stream - * @short_description: A file output stream. - * - * The #GArrowFileOutputStream is a class for file output stream. - */ - -typedef struct GArrowFileOutputStreamPrivate_ { - std::shared_ptr file_output_stream; -} GArrowFileOutputStreamPrivate; - -enum { - PROP_0, - PROP_FILE_OUTPUT_STREAM -}; - -static std::shared_ptr -garrow_file_output_stream_get_raw_file_interface(GArrowFile *file) -{ - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(file); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; -} - -static void -garrow_file_interface_init(GArrowFileInterface *iface) -{ - iface->get_raw = garrow_file_output_stream_get_raw_file_interface; -} - -static std::shared_ptr -garrow_file_output_stream_get_raw_writeable_interface(GArrowWriteable *writeable) -{ - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(writeable); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; -} - -static void -garrow_writeable_interface_init(GArrowWriteableInterface *iface) -{ - iface->get_raw = garrow_file_output_stream_get_raw_writeable_interface; -} - -static std::shared_ptr -garrow_file_output_stream_get_raw_output_stream_interface(GArrowOutputStream *output_stream) -{ - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(output_stream); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; -} - -static void -garrow_output_stream_interface_init(GArrowOutputStreamInterface *iface) -{ - iface->get_raw = garrow_file_output_stream_get_raw_output_stream_interface; -} - -G_DEFINE_TYPE_WITH_CODE(GArrowFileOutputStream, - garrow_file_output_stream, - G_TYPE_OBJECT, - G_ADD_PRIVATE(GArrowFileOutputStream) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, - garrow_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, - garrow_writeable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_OUTPUT_STREAM, - garrow_output_stream_interface_init)); - -#define GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ - (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_TYPE_FILE_OUTPUT_STREAM, \ - GArrowFileOutputStreamPrivate)) - -static void -garrow_file_output_stream_finalize(GObject *object) -{ - GArrowFileOutputStreamPrivate *priv; - - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); - - priv->file_output_stream = nullptr; - - G_OBJECT_CLASS(garrow_file_output_stream_parent_class)->finalize(object); -} - -static void -garrow_file_output_stream_set_property(GObject *object, - guint prop_id, - const GValue *value, - GParamSpec *pspec) -{ - GArrowFileOutputStreamPrivate *priv; - - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); - - switch (prop_id) { - case PROP_FILE_OUTPUT_STREAM: - priv->file_output_stream = - *static_cast *>(g_value_get_pointer(value)); - break; - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_file_output_stream_get_property(GObject *object, - guint prop_id, - GValue *value, - GParamSpec *pspec) -{ - switch (prop_id) { - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_file_output_stream_init(GArrowFileOutputStream *object) -{ -} - -static void -garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) -{ - GObjectClass *gobject_class; - GParamSpec *spec; - - gobject_class = G_OBJECT_CLASS(klass); - - gobject_class->finalize = garrow_file_output_stream_finalize; - gobject_class->set_property = garrow_file_output_stream_set_property; - gobject_class->get_property = garrow_file_output_stream_get_property; - - spec = g_param_spec_pointer("file-output-stream", - "io::FileOutputStream", - "The raw std::shared *", - static_cast(G_PARAM_WRITABLE | - G_PARAM_CONSTRUCT_ONLY)); - g_object_class_install_property(gobject_class, PROP_FILE_OUTPUT_STREAM, spec); -} - -/** - * garrow_file_output_stream_open: - * @path: The path of the file output stream. - * @append: Whether the path is opened as append mode or recreate mode. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowFileOutputStream or %NULL on error. - */ -GArrowFileOutputStream * -garrow_file_output_stream_open(const gchar *path, - gboolean append, - GError **error) -{ - std::shared_ptr arrow_file_output_stream; - auto status = - arrow::io::FileOutputStream::Open(std::string(path), - append, - &arrow_file_output_stream); - if (status.ok()) { - return garrow_file_output_stream_new_raw(&arrow_file_output_stream); - } else { - std::string context("[io][file-output-stream][open]: <"); - context += path; - context += ">"; - garrow_error_check(error, status, context.c_str()); - return NULL; - } -} - -G_END_DECLS - -GArrowFileOutputStream * -garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream) -{ - auto file_output_stream = - GARROW_FILE_OUTPUT_STREAM(g_object_new(GARROW_TYPE_FILE_OUTPUT_STREAM, - "file-output-stream", arrow_file_output_stream, - NULL)); - return file_output_stream; -} - -std::shared_ptr -garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream) -{ - GArrowFileOutputStreamPrivate *priv; - - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); - return priv->file_output_stream; -} diff --git a/c_glib/arrow-glib/file-output-stream.h b/c_glib/arrow-glib/file-output-stream.h deleted file mode 100644 index bef3700039921..0000000000000 --- a/c_glib/arrow-glib/file-output-stream.h +++ /dev/null @@ -1,72 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_FILE_OUTPUT_STREAM \ - (garrow_file_output_stream_get_type()) -#define GARROW_FILE_OUTPUT_STREAM(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_FILE_OUTPUT_STREAM, \ - GArrowFileOutputStream)) -#define GARROW_FILE_OUTPUT_STREAM_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_FILE_OUTPUT_STREAM, \ - GArrowFileOutputStreamClass)) -#define GARROW_IS_FILE_OUTPUT_STREAM(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_FILE_OUTPUT_STREAM)) -#define GARROW_IS_FILE_OUTPUT_STREAM_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_FILE_OUTPUT_STREAM)) -#define GARROW_FILE_OUTPUT_STREAM_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_FILE_OUTPUT_STREAM, \ - GArrowFileOutputStreamClass)) - -typedef struct _GArrowFileOutputStream GArrowFileOutputStream; -typedef struct _GArrowFileOutputStreamClass GArrowFileOutputStreamClass; - -/** - * GArrowFileOutputStream: - * - * It wraps `arrow::io::FileOutputStream`. - */ -struct _GArrowFileOutputStream -{ - /*< private >*/ - GObject parent_instance; -}; - -struct _GArrowFileOutputStreamClass -{ - GObjectClass parent_class; -}; - -GType garrow_file_output_stream_get_type(void) G_GNUC_CONST; - -GArrowFileOutputStream *garrow_file_output_stream_open(const gchar *path, - gboolean append, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/file-output-stream.hpp b/c_glib/arrow-glib/file-output-stream.hpp deleted file mode 100644 index 0b10418cdf176..0000000000000 --- a/c_glib/arrow-glib/file-output-stream.hpp +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -#include - -GArrowFileOutputStream *garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); -std::shared_ptr garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream); diff --git a/c_glib/arrow-glib/output-stream.cpp b/c_glib/arrow-glib/output-stream.cpp index bbc29b794f7c6..037814c1ffeb4 100644 --- a/c_glib/arrow-glib/output-stream.cpp +++ b/c_glib/arrow-glib/output-stream.cpp @@ -24,17 +24,22 @@ #include #include +#include #include +#include G_BEGIN_DECLS /** * SECTION: output-stream - * @title: GArrowOutputStream - * @short_description: Stream output interface + * @section_id: output-stream-classes + * @title: Output stream classes + * @include: arrow-glib/arrow-glib.h * * #GArrowOutputStream is an interface for stream output. Stream * output is file based and writeable + * + * #GArrowFileOutputStream is a class for file output stream. */ G_DEFINE_INTERFACE(GArrowOutputStream, @@ -46,6 +51,178 @@ garrow_output_stream_default_init (GArrowOutputStreamInterface *iface) { } + +typedef struct GArrowFileOutputStreamPrivate_ { + std::shared_ptr file_output_stream; +} GArrowFileOutputStreamPrivate; + +enum { + PROP_0, + PROP_FILE_OUTPUT_STREAM +}; + +static std::shared_ptr +garrow_file_output_stream_get_raw_file_interface(GArrowFile *file) +{ + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(file); + auto arrow_file_output_stream = + garrow_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_file_interface_init(GArrowFileInterface *iface) +{ + iface->get_raw = garrow_file_output_stream_get_raw_file_interface; +} + +static std::shared_ptr +garrow_file_output_stream_get_raw_writeable_interface(GArrowWriteable *writeable) +{ + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(writeable); + auto arrow_file_output_stream = + garrow_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_writeable_interface_init(GArrowWriteableInterface *iface) +{ + iface->get_raw = garrow_file_output_stream_get_raw_writeable_interface; +} + +static std::shared_ptr +garrow_file_output_stream_get_raw_output_stream_interface(GArrowOutputStream *output_stream) +{ + auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(output_stream); + auto arrow_file_output_stream = + garrow_file_output_stream_get_raw(file_output_stream); + return arrow_file_output_stream; +} + +static void +garrow_output_stream_interface_init(GArrowOutputStreamInterface *iface) +{ + iface->get_raw = garrow_file_output_stream_get_raw_output_stream_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowFileOutputStream, + garrow_file_output_stream, + G_TYPE_OBJECT, + G_ADD_PRIVATE(GArrowFileOutputStream) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, + garrow_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, + garrow_writeable_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_OUTPUT_STREAM, + garrow_output_stream_interface_init)); + +#define GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamPrivate)) + +static void +garrow_file_output_stream_finalize(GObject *object) +{ + GArrowFileOutputStreamPrivate *priv; + + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + + priv->file_output_stream = nullptr; + + G_OBJECT_CLASS(garrow_file_output_stream_parent_class)->finalize(object); +} + +static void +garrow_file_output_stream_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowFileOutputStreamPrivate *priv; + + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_FILE_OUTPUT_STREAM: + priv->file_output_stream = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_file_output_stream_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_file_output_stream_init(GArrowFileOutputStream *object) +{ +} + +static void +garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_file_output_stream_finalize; + gobject_class->set_property = garrow_file_output_stream_set_property; + gobject_class->get_property = garrow_file_output_stream_get_property; + + spec = g_param_spec_pointer("file-output-stream", + "io::FileOutputStream", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_FILE_OUTPUT_STREAM, spec); +} + +/** + * garrow_file_output_stream_open: + * @path: The path of the file output stream. + * @append: Whether the path is opened as append mode or recreate mode. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable) (transfer full): A newly opened + * #GArrowFileOutputStream or %NULL on error. + */ +GArrowFileOutputStream * +garrow_file_output_stream_open(const gchar *path, + gboolean append, + GError **error) +{ + std::shared_ptr arrow_file_output_stream; + auto status = + arrow::io::FileOutputStream::Open(std::string(path), + append, + &arrow_file_output_stream); + if (status.ok()) { + return garrow_file_output_stream_new_raw(&arrow_file_output_stream); + } else { + std::string context("[io][file-output-stream][open]: <"); + context += path; + context += ">"; + garrow_error_check(error, status, context.c_str()); + return NULL; + } +} + G_END_DECLS std::shared_ptr @@ -54,3 +231,23 @@ garrow_output_stream_get_raw(GArrowOutputStream *output_stream) auto *iface = GARROW_OUTPUT_STREAM_GET_IFACE(output_stream); return iface->get_raw(output_stream); } + + +GArrowFileOutputStream * +garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream) +{ + auto file_output_stream = + GARROW_FILE_OUTPUT_STREAM(g_object_new(GARROW_TYPE_FILE_OUTPUT_STREAM, + "file-output-stream", arrow_file_output_stream, + NULL)); + return file_output_stream; +} + +std::shared_ptr +garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream) +{ + GArrowFileOutputStreamPrivate *priv; + + priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); + return priv->file_output_stream; +} diff --git a/c_glib/arrow-glib/output-stream.h b/c_glib/arrow-glib/output-stream.h index 3481072c27d8b..043832efccd78 100644 --- a/c_glib/arrow-glib/output-stream.h +++ b/c_glib/arrow-glib/output-stream.h @@ -42,4 +42,56 @@ typedef struct _GArrowOutputStreamInterface GArrowOutputStreamInterface; GType garrow_output_stream_get_type(void) G_GNUC_CONST; + +#define GARROW_TYPE_FILE_OUTPUT_STREAM \ + (garrow_file_output_stream_get_type()) +#define GARROW_FILE_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStream)) +#define GARROW_FILE_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamClass)) +#define GARROW_IS_FILE_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_IS_FILE_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_FILE_OUTPUT_STREAM)) +#define GARROW_FILE_OUTPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_FILE_OUTPUT_STREAM, \ + GArrowFileOutputStreamClass)) + +typedef struct _GArrowFileOutputStream GArrowFileOutputStream; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowFileOutputStreamClass GArrowFileOutputStreamClass; +#endif + +/** + * GArrowFileOutputStream: + * + * It wraps `arrow::io::FileOutputStream`. + */ +struct _GArrowFileOutputStream +{ + /*< private >*/ + GObject parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowFileOutputStreamClass +{ + GObjectClass parent_class; +}; +#endif + +GType garrow_file_output_stream_get_type(void) G_GNUC_CONST; + +GArrowFileOutputStream *garrow_file_output_stream_open(const gchar *path, + gboolean append, + GError **error); + + G_END_DECLS diff --git a/c_glib/arrow-glib/output-stream.hpp b/c_glib/arrow-glib/output-stream.hpp index 635da10e24766..e8e73216c499b 100644 --- a/c_glib/arrow-glib/output-stream.hpp +++ b/c_glib/arrow-glib/output-stream.hpp @@ -19,7 +19,7 @@ #pragma once -#include +#include #include @@ -36,3 +36,6 @@ struct _GArrowOutputStreamInterface }; std::shared_ptr garrow_output_stream_get_raw(GArrowOutputStream *output_stream); + +GArrowFileOutputStream *garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); +std::shared_ptr garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream); diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 75e4a0a37286f..7ba37b45068e0 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -87,9 +87,8 @@ Output - - + Input and output From f13a9286c1444391c04fc0d20909a672122d10c1 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Thu, 27 Apr 2017 21:37:12 -0400 Subject: [PATCH 0579/1644] ARROW-907: C++: Construct Table from schema and arrays Author: Uwe L. Korn Closes #610 from xhochy/ARROW-907 and squashes the following commits: b8ee8dc [Uwe L. Korn] Fix signed comparison 25518b3 [Uwe L. Korn] ARROW-907: C++: Construct Table from schema and arrays --- cpp/src/arrow/table-test.cc | 5 +++++ cpp/src/arrow/table.cc | 21 +++++++++++++++++++++ cpp/src/arrow/table.h | 3 +++ 3 files changed, 29 insertions(+) diff --git a/cpp/src/arrow/table-test.cc b/cpp/src/arrow/table-test.cc index 0da4c0f9641a3..e46fdc77cf761 100644 --- a/cpp/src/arrow/table-test.cc +++ b/cpp/src/arrow/table-test.cc @@ -233,6 +233,11 @@ TEST_F(TestTable, Ctors) { table_.reset(new Table(schema_, columns_, length)); ASSERT_OK(table_->ValidateColumns()); ASSERT_EQ(length, table_->num_rows()); + + ASSERT_OK(MakeTable(schema_, arrays_, &table_)); + ASSERT_OK(table_->ValidateColumns()); + ASSERT_EQ(length, table_->num_rows()); + ASSERT_EQ(3, table_->num_columns()); } TEST_F(TestTable, Metadata) { diff --git a/cpp/src/arrow/table.cc b/cpp/src/arrow/table.cc index db17da72a6a33..c110ec16a0494 100644 --- a/cpp/src/arrow/table.cc +++ b/cpp/src/arrow/table.cc @@ -366,4 +366,25 @@ Status Table::ValidateColumns() const { return Status::OK(); } +Status ARROW_EXPORT MakeTable(const std::shared_ptr& schema, + const std::vector>& arrays, std::shared_ptr* table) { + // Make sure the length of the schema corresponds to the length of the vector + if (schema->num_fields() != static_cast(arrays.size())) { + std::stringstream ss; + ss << "Schema and Array vector have different lengths: " << schema->num_fields() + << " != " << arrays.size(); + return Status::Invalid(ss.str()); + } + + std::vector> columns; + columns.reserve(schema->num_fields()); + for (int i = 0; i < schema->num_fields(); i++) { + columns.emplace_back(std::make_shared(schema->field(i), arrays[i])); + } + + *table = std::make_shared
    (schema, columns); + + return Status::OK(); +} + } // namespace arrow diff --git a/cpp/src/arrow/table.h b/cpp/src/arrow/table.h index efc2077bd009a..67710a8216010 100644 --- a/cpp/src/arrow/table.h +++ b/cpp/src/arrow/table.h @@ -208,6 +208,9 @@ class ARROW_EXPORT Table { Status ARROW_EXPORT ConcatenateTables( const std::vector>& tables, std::shared_ptr
    * table); +Status ARROW_EXPORT MakeTable(const std::shared_ptr& schema, + const std::vector>& arrays, std::shared_ptr
    * table); + } // namespace arrow #endif // ARROW_TABLE_H From f7ab7270bb07466dabf84c015a6db2a192eb3dad Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Thu, 27 Apr 2017 21:38:23 -0400 Subject: [PATCH 0580/1644] ARROW-896: Support Jupyter Notebook in Web site We can embed a Jupyter Notebook (`getting_started.ipynb`) in the same directory by the following code: ```markdown {::nomarkdown} {% jupyter_notebook getting_started.ipynb %} {:/nomarkdown} ``` Author: Kouhei Sutou Closes #608 from kou/site-support-jupyter-notebook and squashes the following commits: 12186b1 [Kouhei Sutou] Support Jupyter Notebook in Web site --- site/Gemfile | 1 + site/_config.yml | 1 + 2 files changed, 2 insertions(+) diff --git a/site/Gemfile b/site/Gemfile index 98decaf35dbe6..e6691a0857140 100644 --- a/site/Gemfile +++ b/site/Gemfile @@ -21,5 +21,6 @@ gem 'jekyll-bootstrap-sass' gem 'github-pages' group :jekyll_plugins do gem "jekyll-feed", "~> 0.6" + gem "jekyll-jupyter-notebook" end gem 'tzinfo-data', platforms: [:mingw, :mswin, :x64_mingw, :jruby] diff --git a/site/_config.yml b/site/_config.yml index 922af4a08059c..346565e6d5cca 100644 --- a/site/_config.yml +++ b/site/_config.yml @@ -38,6 +38,7 @@ baseurl: gems: - jekyll-feed - jekyll-bootstrap-sass + - jekyll-jupyter-notebook bootstrap: assets: true From 53c093b521d87794cba066032e827788c75d42fe Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sat, 29 Apr 2017 20:53:19 -0400 Subject: [PATCH 0581/1644] ARROW-914 [C++/Python] Fix Decimal ToBytes Author: Phillip Cloud Closes #613 from cpcloud/ARROW-914 and squashes the following commits: b0f3c10 [Phillip Cloud] Use a more appropriate name 418fc9c [Phillip Cloud] ARROW-914 [C++/Python] Fix Decimal ToBytes --- ci/msvc-build.bat | 3 +-- cpp/src/arrow/util/decimal.cc | 8 ++++---- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index 08c5033849539..aca1f8cc3c073 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -53,5 +53,4 @@ cd ..\..\python python setup.py build_ext --inplace || exit /B python -c "import pyarrow" || exit /B -@rem TODO: re-enable when last tests are fixed -@rem py.test pyarrow -v -s || exit /B +py.test pyarrow -v -s || exit /B diff --git a/cpp/src/arrow/util/decimal.cc b/cpp/src/arrow/util/decimal.cc index 2fe9da4aba9a2..3d9fbd31bf22a 100644 --- a/cpp/src/arrow/util/decimal.cc +++ b/cpp/src/arrow/util/decimal.cc @@ -147,7 +147,7 @@ void FromBytes(const uint8_t* bytes, Decimal64* decimal) { constexpr static const size_t BYTES_IN_128_BITS = 128 / CHAR_BIT; constexpr static const size_t LIMB_SIZE = sizeof(std::remove_pointer::type); -constexpr static const size_t BYTES_PER_LIMB = BYTES_IN_128_BITS / LIMB_SIZE; +constexpr static const size_t LIMBS_IN_INT128 = BYTES_IN_128_BITS / LIMB_SIZE; void FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal) { DCHECK_NE(bytes, nullptr); @@ -155,7 +155,7 @@ void FromBytes(const uint8_t* bytes, bool is_negative, Decimal128* decimal) { auto& decimal_value(decimal->value); int128_t::backend_type& backend(decimal_value.backend()); - backend.resize(BYTES_PER_LIMB, BYTES_PER_LIMB); + backend.resize(LIMBS_IN_INT128, LIMBS_IN_INT128); std::memcpy(backend.limbs(), bytes, BYTES_IN_128_BITS); if (is_negative) { decimal->value = -decimal->value; } } @@ -177,8 +177,8 @@ void ToBytes(const Decimal128& decimal, uint8_t** bytes, bool* is_negative) { /// TODO(phillipc): boost multiprecision is unreliable here, int128_t can't be /// roundtripped const auto& backend(decimal.value.backend()); - auto boost_bytes = reinterpret_cast(backend.limbs()); - std::memcpy(*bytes, boost_bytes, BYTES_IN_128_BITS); + const size_t bytes_in_use = LIMB_SIZE * backend.size(); + std::memcpy(*bytes, backend.limbs(), bytes_in_use); *is_negative = backend.isneg(); } From ed5a1d4f9ae13c0474418d2b0534cbacdca57ef8 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 13:08:13 -0400 Subject: [PATCH 0582/1644] ARROW-916: [GLib] Add GArrowBufferOutputStream Author: Kouhei Sutou Closes #616 from kou/glib-buffer-output-stream and squashes the following commits: 75c89cb [Kouhei Sutou] [GLib] Add GArrowBufferOutputStream --- c_glib/arrow-glib/output-stream.cpp | 204 ++++++++++++++----------- c_glib/arrow-glib/output-stream.h | 99 ++++++++++-- c_glib/arrow-glib/output-stream.hpp | 17 +-- c_glib/test/test-buffer-output-file.rb | 26 ++++ 4 files changed, 237 insertions(+), 109 deletions(-) create mode 100644 c_glib/test/test-buffer-output-file.rb diff --git a/c_glib/arrow-glib/output-stream.cpp b/c_glib/arrow-glib/output-stream.cpp index 037814c1ffeb4..b757d44cef44e 100644 --- a/c_glib/arrow-glib/output-stream.cpp +++ b/c_glib/arrow-glib/output-stream.cpp @@ -22,7 +22,9 @@ #endif #include +#include +#include #include #include #include @@ -40,114 +42,87 @@ G_BEGIN_DECLS * output is file based and writeable * * #GArrowFileOutputStream is a class for file output stream. + * + * #GArrowBufferOutputStream is a class for buffer output stream. */ -G_DEFINE_INTERFACE(GArrowOutputStream, - garrow_output_stream, - G_TYPE_OBJECT) - -static void -garrow_output_stream_default_init (GArrowOutputStreamInterface *iface) -{ -} - - -typedef struct GArrowFileOutputStreamPrivate_ { - std::shared_ptr file_output_stream; -} GArrowFileOutputStreamPrivate; +typedef struct GArrowOutputStreamPrivate_ { + std::shared_ptr output_stream; +} GArrowOutputStreamPrivate; enum { PROP_0, - PROP_FILE_OUTPUT_STREAM + PROP_OUTPUT_STREAM }; static std::shared_ptr -garrow_file_output_stream_get_raw_file_interface(GArrowFile *file) +garrow_output_stream_get_raw_file_interface(GArrowFile *file) { - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(file); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; + auto output_stream = GARROW_OUTPUT_STREAM(file); + auto arrow_output_stream = garrow_output_stream_get_raw(output_stream); + return arrow_output_stream; } static void -garrow_file_interface_init(GArrowFileInterface *iface) +garrow_output_stream_file_interface_init(GArrowFileInterface *iface) { - iface->get_raw = garrow_file_output_stream_get_raw_file_interface; + iface->get_raw = garrow_output_stream_get_raw_file_interface; } static std::shared_ptr -garrow_file_output_stream_get_raw_writeable_interface(GArrowWriteable *writeable) +garrow_output_stream_get_raw_writeable_interface(GArrowWriteable *writeable) { - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(writeable); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; + auto output_stream = GARROW_OUTPUT_STREAM(writeable); + auto arrow_output_stream = garrow_output_stream_get_raw(output_stream); + return arrow_output_stream; } static void -garrow_writeable_interface_init(GArrowWriteableInterface *iface) -{ - iface->get_raw = garrow_file_output_stream_get_raw_writeable_interface; -} - -static std::shared_ptr -garrow_file_output_stream_get_raw_output_stream_interface(GArrowOutputStream *output_stream) +garrow_output_stream_writeable_interface_init(GArrowWriteableInterface *iface) { - auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(output_stream); - auto arrow_file_output_stream = - garrow_file_output_stream_get_raw(file_output_stream); - return arrow_file_output_stream; + iface->get_raw = garrow_output_stream_get_raw_writeable_interface; } -static void -garrow_output_stream_interface_init(GArrowOutputStreamInterface *iface) -{ - iface->get_raw = garrow_file_output_stream_get_raw_output_stream_interface; -} - -G_DEFINE_TYPE_WITH_CODE(GArrowFileOutputStream, - garrow_file_output_stream, +G_DEFINE_TYPE_WITH_CODE(GArrowOutputStream, + garrow_output_stream, G_TYPE_OBJECT, - G_ADD_PRIVATE(GArrowFileOutputStream) + G_ADD_PRIVATE(GArrowOutputStream) G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, - garrow_file_interface_init) + garrow_output_stream_file_interface_init) G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, - garrow_writeable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_OUTPUT_STREAM, - garrow_output_stream_interface_init)); + garrow_output_stream_writeable_interface_init)); -#define GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(obj) \ +#define GARROW_OUTPUT_STREAM_GET_PRIVATE(obj) \ (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_TYPE_FILE_OUTPUT_STREAM, \ - GArrowFileOutputStreamPrivate)) + GARROW_TYPE_OUTPUT_STREAM, \ + GArrowOutputStreamPrivate)) static void -garrow_file_output_stream_finalize(GObject *object) +garrow_output_stream_finalize(GObject *object) { - GArrowFileOutputStreamPrivate *priv; + GArrowOutputStreamPrivate *priv; - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + priv = GARROW_OUTPUT_STREAM_GET_PRIVATE(object); - priv->file_output_stream = nullptr; + priv->output_stream = nullptr; - G_OBJECT_CLASS(garrow_file_output_stream_parent_class)->finalize(object); + G_OBJECT_CLASS(garrow_output_stream_parent_class)->finalize(object); } static void -garrow_file_output_stream_set_property(GObject *object, +garrow_output_stream_set_property(GObject *object, guint prop_id, const GValue *value, GParamSpec *pspec) { - GArrowFileOutputStreamPrivate *priv; + GArrowOutputStreamPrivate *priv; - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(object); + priv = GARROW_OUTPUT_STREAM_GET_PRIVATE(object); switch (prop_id) { - case PROP_FILE_OUTPUT_STREAM: - priv->file_output_stream = - *static_cast *>(g_value_get_pointer(value)); + case PROP_OUTPUT_STREAM: + priv->output_stream = + *static_cast *>(g_value_get_pointer(value)); break; default: G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); @@ -156,7 +131,7 @@ garrow_file_output_stream_set_property(GObject *object, } static void -garrow_file_output_stream_get_property(GObject *object, +garrow_output_stream_get_property(GObject *object, guint prop_id, GValue *value, GParamSpec *pspec) @@ -169,28 +144,43 @@ garrow_file_output_stream_get_property(GObject *object, } static void -garrow_file_output_stream_init(GArrowFileOutputStream *object) +garrow_output_stream_init(GArrowOutputStream *object) { } static void -garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) +garrow_output_stream_class_init(GArrowOutputStreamClass *klass) { GObjectClass *gobject_class; GParamSpec *spec; gobject_class = G_OBJECT_CLASS(klass); - gobject_class->finalize = garrow_file_output_stream_finalize; - gobject_class->set_property = garrow_file_output_stream_set_property; - gobject_class->get_property = garrow_file_output_stream_get_property; + gobject_class->finalize = garrow_output_stream_finalize; + gobject_class->set_property = garrow_output_stream_set_property; + gobject_class->get_property = garrow_output_stream_get_property; - spec = g_param_spec_pointer("file-output-stream", - "io::FileOutputStream", - "The raw std::shared *", + spec = g_param_spec_pointer("output-stream", + "io::OutputStream", + "The raw std::shared *", static_cast(G_PARAM_WRITABLE | G_PARAM_CONSTRUCT_ONLY)); - g_object_class_install_property(gobject_class, PROP_FILE_OUTPUT_STREAM, spec); + g_object_class_install_property(gobject_class, PROP_OUTPUT_STREAM, spec); +} + + +G_DEFINE_TYPE(GArrowFileOutputStream, + garrow_file_output_stream, + GARROW_TYPE_OUTPUT_STREAM); + +static void +garrow_file_output_stream_init(GArrowFileOutputStream *file_output_stream) +{ +} + +static void +garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) +{ } /** @@ -204,8 +194,8 @@ garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) */ GArrowFileOutputStream * garrow_file_output_stream_open(const gchar *path, - gboolean append, - GError **error) + gboolean append, + GError **error) { std::shared_ptr arrow_file_output_stream; auto status = @@ -223,13 +213,56 @@ garrow_file_output_stream_open(const gchar *path, } } + +G_DEFINE_TYPE(GArrowBufferOutputStream, + garrow_buffer_output_stream, + GARROW_TYPE_OUTPUT_STREAM); + +static void +garrow_buffer_output_stream_init(GArrowBufferOutputStream *buffer_output_stream) +{ +} + +static void +garrow_buffer_output_stream_class_init(GArrowBufferOutputStreamClass *klass) +{ +} + +/** + * garrow_buffer_output_stream_new: + * @buffer: The resizable buffer to be output. + * + * Returns: (transfer full): A newly created #GArrowBufferOutputStream. + */ +GArrowBufferOutputStream * +garrow_buffer_output_stream_new(GArrowResizableBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(GARROW_BUFFER(buffer)); + auto arrow_resizable_buffer = + std::static_pointer_cast(arrow_buffer); + auto arrow_buffer_output_stream = + std::make_shared(arrow_resizable_buffer); + return garrow_buffer_output_stream_new_raw(&arrow_buffer_output_stream); +} G_END_DECLS +GArrowOutputStream * +garrow_output_stream_new_raw(std::shared_ptr *arrow_output_stream) +{ + auto output_stream = + GARROW_OUTPUT_STREAM(g_object_new(GARROW_TYPE_OUTPUT_STREAM, + "output-stream", arrow_output_stream, + NULL)); + return output_stream; +} + std::shared_ptr garrow_output_stream_get_raw(GArrowOutputStream *output_stream) { - auto *iface = GARROW_OUTPUT_STREAM_GET_IFACE(output_stream); - return iface->get_raw(output_stream); + GArrowOutputStreamPrivate *priv; + + priv = GARROW_OUTPUT_STREAM_GET_PRIVATE(output_stream); + return priv->output_stream; } @@ -238,16 +271,17 @@ garrow_file_output_stream_new_raw(std::shared_ptr * { auto file_output_stream = GARROW_FILE_OUTPUT_STREAM(g_object_new(GARROW_TYPE_FILE_OUTPUT_STREAM, - "file-output-stream", arrow_file_output_stream, + "output-stream", arrow_file_output_stream, NULL)); return file_output_stream; } -std::shared_ptr -garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream) +GArrowBufferOutputStream * +garrow_buffer_output_stream_new_raw(std::shared_ptr *arrow_buffer_output_stream) { - GArrowFileOutputStreamPrivate *priv; - - priv = GARROW_FILE_OUTPUT_STREAM_GET_PRIVATE(file_output_stream); - return priv->file_output_stream; + auto buffer_output_stream = + GARROW_BUFFER_OUTPUT_STREAM(g_object_new(GARROW_TYPE_BUFFER_OUTPUT_STREAM, + "output-stream", arrow_buffer_output_stream, + NULL)); + return buffer_output_stream; } diff --git a/c_glib/arrow-glib/output-stream.h b/c_glib/arrow-glib/output-stream.h index 043832efccd78..2a14a24ea9051 100644 --- a/c_glib/arrow-glib/output-stream.h +++ b/c_glib/arrow-glib/output-stream.h @@ -21,24 +21,53 @@ #include +#include + G_BEGIN_DECLS -#define GARROW_TYPE_OUTPUT_STREAM \ +#define GARROW_TYPE_OUTPUT_STREAM \ (garrow_output_stream_get_type()) -#define GARROW_OUTPUT_STREAM(obj) \ +#define GARROW_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_OUTPUT_STREAM, \ + GARROW_TYPE_OUTPUT_STREAM, \ GArrowOutputStream)) -#define GARROW_IS_OUTPUT_STREAM(obj) \ +#define GARROW_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_OUTPUT_STREAM, \ + GArrowOutputStreamClass)) +#define GARROW_IS_OUTPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_TYPE_OUTPUT_STREAM)) -#define GARROW_OUTPUT_STREAM_GET_IFACE(obj) \ - (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_TYPE_OUTPUT_STREAM, \ - GArrowOutputStreamInterface)) +#define GARROW_IS_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_OUTPUT_STREAM)) +#define GARROW_OUTPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_OUTPUT_STREAM, \ + GArrowOutputStreamClass)) typedef struct _GArrowOutputStream GArrowOutputStream; -typedef struct _GArrowOutputStreamInterface GArrowOutputStreamInterface; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowOutputStreamClass GArrowOutputStreamClass; +#endif + +/** + * GArrowOutputStream: + * + * It wraps `arrow::io::OutputStream`. + */ +struct _GArrowOutputStream +{ + /*< private >*/ + GObject parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowOutputStreamClass +{ + GObjectClass parent_class; +}; +#endif GType garrow_output_stream_get_type(void) G_GNUC_CONST; @@ -77,13 +106,13 @@ typedef struct _GArrowFileOutputStreamClass GArrowFileOutputStreamClass; struct _GArrowFileOutputStream { /*< private >*/ - GObject parent_instance; + GArrowOutputStream parent_instance; }; #ifndef __GTK_DOC_IGNORE__ struct _GArrowFileOutputStreamClass { - GObjectClass parent_class; + GArrowOutputStreamClass parent_class; }; #endif @@ -94,4 +123,52 @@ GArrowFileOutputStream *garrow_file_output_stream_open(const gchar *path, GError **error); +#define GARROW_TYPE_BUFFER_OUTPUT_STREAM \ + (garrow_buffer_output_stream_get_type()) +#define GARROW_BUFFER_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BUFFER_OUTPUT_STREAM, \ + GArrowBufferOutputStream)) +#define GARROW_BUFFER_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BUFFER_OUTPUT_STREAM, \ + GArrowBufferOutputStreamClass)) +#define GARROW_IS_BUFFER_OUTPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BUFFER_OUTPUT_STREAM)) +#define GARROW_IS_BUFFER_OUTPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BUFFER_OUTPUT_STREAM)) +#define GARROW_BUFFER_OUTPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BUFFER_OUTPUT_STREAM, \ + GArrowBufferOutputStreamClass)) + +typedef struct _GArrowBufferOutputStream GArrowBufferOutputStream; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowBufferOutputStreamClass GArrowBufferOutputStreamClass; +#endif + +/** + * GArrowBufferOutputStream: + * + * It wraps `arrow::io::BufferOutputStream`. + */ +struct _GArrowBufferOutputStream +{ + /*< private >*/ + GArrowOutputStream parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowBufferOutputStreamClass +{ + GArrowOutputStreamClass parent_class; +}; +#endif + +GType garrow_buffer_output_stream_get_type(void) G_GNUC_CONST; + +GArrowBufferOutputStream *garrow_buffer_output_stream_new(GArrowResizableBuffer *buffer); + G_END_DECLS diff --git a/c_glib/arrow-glib/output-stream.hpp b/c_glib/arrow-glib/output-stream.hpp index e8e73216c499b..5d22f1d2e7026 100644 --- a/c_glib/arrow-glib/output-stream.hpp +++ b/c_glib/arrow-glib/output-stream.hpp @@ -20,22 +20,13 @@ #pragma once #include +#include #include -/** - * GArrowOutputStreamInterface: - * - * It wraps `arrow::io::OutputStream`. - */ -struct _GArrowOutputStreamInterface -{ - GTypeInterface parent_iface; - - std::shared_ptr (*get_raw)(GArrowOutputStream *file); -}; - +GArrowOutputStream *garrow_output_stream_new_raw(std::shared_ptr *arrow_output_stream); std::shared_ptr garrow_output_stream_get_raw(GArrowOutputStream *output_stream); + GArrowFileOutputStream *garrow_file_output_stream_new_raw(std::shared_ptr *arrow_file_output_stream); -std::shared_ptr garrow_file_output_stream_get_raw(GArrowFileOutputStream *file_output_stream); +GArrowBufferOutputStream *garrow_buffer_output_stream_new_raw(std::shared_ptr *arrow_buffer_output_stream); diff --git a/c_glib/test/test-buffer-output-file.rb b/c_glib/test/test-buffer-output-file.rb new file mode 100644 index 0000000000000..1b7fae9e6f6d4 --- /dev/null +++ b/c_glib/test/test-buffer-output-file.rb @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBufferOutputStream < Test::Unit::TestCase + def test_new + buffer = Arrow::PoolBuffer.new + output_stream = Arrow::BufferOutputStream.new(buffer) + output_stream.write("Hello") + output_stream.close + assert_equal("Hello", buffer.data.to_s) + end +end From ce0c96221f0db74b51af5484bd39f0619b71e58f Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 13:19:22 -0400 Subject: [PATCH 0583/1644] ARROW-917: [GLib] Add GArrowBufferReader Author: Kouhei Sutou Closes #617 from kou/glib-buffer-reader and squashes the following commits: 399acda [Kouhei Sutou] [GLib] Add GArrowBufferReader --- c_glib/arrow-glib/input-stream.cpp | 232 +++++++++++++++++++++-- c_glib/arrow-glib/input-stream.h | 102 ++++++++-- c_glib/arrow-glib/input-stream.hpp | 16 +- c_glib/arrow-glib/memory-mapped-file.cpp | 17 -- c_glib/test/test-buffer-reader.rb | 26 +++ 5 files changed, 341 insertions(+), 52 deletions(-) create mode 100644 c_glib/test/test-buffer-reader.rb diff --git a/c_glib/arrow-glib/input-stream.cpp b/c_glib/arrow-glib/input-stream.cpp index 36bef80422489..56b811ad1c368 100644 --- a/c_glib/arrow-glib/input-stream.cpp +++ b/c_glib/arrow-glib/input-stream.cpp @@ -21,36 +21,246 @@ # include #endif -#include +#include +#include +#include #include +#include #include +#include +#include G_BEGIN_DECLS /** * SECTION: input-stream - * @title: GArrowInputStream - * @short_description: Stream input interface + * @section_id: input-stream-classes + * @title: Input stream classes + * @include: arrow-glib/arrow-glib.h * - * #GArrowInputStream is an interface for stream input. Stream input - * is file based and readable. + * #GArrowInputStream is a base class for input stream. + * + * #GArrowBufferReader is a class for buffer input stream. */ -G_DEFINE_INTERFACE(GArrowInputStream, - garrow_input_stream, - G_TYPE_OBJECT) +typedef struct GArrowInputStreamPrivate_ { + std::shared_ptr input_stream; +} GArrowInputStreamPrivate; + +enum { + PROP_0, + PROP_INPUT_STREAM +}; + +static std::shared_ptr +garrow_input_stream_get_raw_file_interface(GArrowFile *file) +{ + auto input_stream = GARROW_INPUT_STREAM(file); + auto arrow_input_stream = + garrow_input_stream_get_raw(input_stream); + return arrow_input_stream; +} + +static void +garrow_input_stream_file_interface_init(GArrowFileInterface *iface) +{ + iface->get_raw = garrow_input_stream_get_raw_file_interface; +} + +static std::shared_ptr +garrow_input_stream_get_raw_readable_interface(GArrowReadable *readable) +{ + auto input_stream = GARROW_INPUT_STREAM(readable); + auto arrow_input_stream = garrow_input_stream_get_raw(input_stream); + return arrow_input_stream; +} + +static void +garrow_input_stream_readable_interface_init(GArrowReadableInterface *iface) +{ + iface->get_raw = garrow_input_stream_get_raw_readable_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowInputStream, + garrow_input_stream, + G_TYPE_OBJECT, + G_ADD_PRIVATE(GArrowInputStream) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, + garrow_input_stream_file_interface_init) + G_IMPLEMENT_INTERFACE(GARROW_TYPE_READABLE, + garrow_input_stream_readable_interface_init)); + +#define GARROW_INPUT_STREAM_GET_PRIVATE(obj) \ + (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ + GARROW_TYPE_INPUT_STREAM, \ + GArrowInputStreamPrivate)) + +static void +garrow_input_stream_finalize(GObject *object) +{ + GArrowInputStreamPrivate *priv; + + priv = GARROW_INPUT_STREAM_GET_PRIVATE(object); + + priv->input_stream = nullptr; + + G_OBJECT_CLASS(garrow_input_stream_parent_class)->finalize(object); +} + +static void +garrow_input_stream_set_property(GObject *object, + guint prop_id, + const GValue *value, + GParamSpec *pspec) +{ + GArrowInputStreamPrivate *priv; + + priv = GARROW_INPUT_STREAM_GET_PRIVATE(object); + + switch (prop_id) { + case PROP_INPUT_STREAM: + priv->input_stream = + *static_cast *>(g_value_get_pointer(value)); + break; + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_input_stream_get_property(GObject *object, + guint prop_id, + GValue *value, + GParamSpec *pspec) +{ + switch (prop_id) { + default: + G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); + break; + } +} + +static void +garrow_input_stream_init(GArrowInputStream *object) +{ +} + +static void +garrow_input_stream_class_init(GArrowInputStreamClass *klass) +{ + GObjectClass *gobject_class; + GParamSpec *spec; + + gobject_class = G_OBJECT_CLASS(klass); + + gobject_class->finalize = garrow_input_stream_finalize; + gobject_class->set_property = garrow_input_stream_set_property; + gobject_class->get_property = garrow_input_stream_get_property; + + spec = g_param_spec_pointer("input-stream", + "Input stream", + "The raw std::shared *", + static_cast(G_PARAM_WRITABLE | + G_PARAM_CONSTRUCT_ONLY)); + g_object_class_install_property(gobject_class, PROP_INPUT_STREAM, spec); +} + + +static std::shared_ptr +garrow_buffer_reader_get_raw_random_access_file_interface(GArrowRandomAccessFile *random_access_file) +{ + auto input_stream = GARROW_INPUT_STREAM(random_access_file); + auto arrow_input_stream = garrow_input_stream_get_raw(input_stream); + auto arrow_buffer_reader = + std::static_pointer_cast(arrow_input_stream); + return arrow_buffer_reader; +} static void -garrow_input_stream_default_init (GArrowInputStreamInterface *iface) +garrow_buffer_reader_random_access_file_interface_init(GArrowRandomAccessFileInterface *iface) +{ + iface->get_raw = garrow_buffer_reader_get_raw_random_access_file_interface; +} + +G_DEFINE_TYPE_WITH_CODE(GArrowBufferReader, \ + garrow_buffer_reader, \ + GARROW_TYPE_INPUT_STREAM, + G_IMPLEMENT_INTERFACE(GARROW_TYPE_RANDOM_ACCESS_FILE, + garrow_buffer_reader_random_access_file_interface_init)); + +static void +garrow_buffer_reader_init(GArrowBufferReader *object) +{ +} + +static void +garrow_buffer_reader_class_init(GArrowBufferReaderClass *klass) +{ +} + +/** + * garrow_buffer_reader_new: + * @buffer: The buffer to be read. + * + * Returns: A newly created #GArrowBufferReader. + */ +GArrowBufferReader * +garrow_buffer_reader_new(GArrowBuffer *buffer) +{ + auto arrow_buffer = garrow_buffer_get_raw(buffer); + auto arrow_buffer_reader = + std::make_shared(arrow_buffer); + return garrow_buffer_reader_new_raw(&arrow_buffer_reader); +} + +/** + * garrow_buffer_reader_get_buffer: + * @buffer_reader: A #GArrowBufferReader. + * + * Returns: (transfer full): The data of the array as #GArrowBuffer. + */ +GArrowBuffer * +garrow_buffer_reader_get_buffer(GArrowBufferReader *buffer_reader) { + auto arrow_input_stream = + garrow_input_stream_get_raw(GARROW_INPUT_STREAM(buffer_reader)); + auto arrow_buffer_reader = + std::static_pointer_cast(arrow_input_stream); + auto arrow_buffer = arrow_buffer_reader->buffer(); + return garrow_buffer_new_raw(&arrow_buffer); } + G_END_DECLS +GArrowInputStream * +garrow_input_stream_new_raw(std::shared_ptr *arrow_input_stream) +{ + auto input_stream = + GARROW_INPUT_STREAM(g_object_new(GARROW_TYPE_INPUT_STREAM, + "input-stream", arrow_input_stream, + NULL)); + return input_stream; +} + std::shared_ptr garrow_input_stream_get_raw(GArrowInputStream *input_stream) { - auto *iface = GARROW_INPUT_STREAM_GET_IFACE(input_stream); - return iface->get_raw(input_stream); + GArrowInputStreamPrivate *priv; + + priv = GARROW_INPUT_STREAM_GET_PRIVATE(input_stream); + return priv->input_stream; +} + + +GArrowBufferReader * +garrow_buffer_reader_new_raw(std::shared_ptr *arrow_buffer_reader) +{ + auto buffer_reader = + GARROW_BUFFER_READER(g_object_new(GARROW_TYPE_BUFFER_READER, + "input-stream", arrow_buffer_reader, + NULL)); + return buffer_reader; } diff --git a/c_glib/arrow-glib/input-stream.h b/c_glib/arrow-glib/input-stream.h index 4b331b93fb27f..caa11b575452b 100644 --- a/c_glib/arrow-glib/input-stream.h +++ b/c_glib/arrow-glib/input-stream.h @@ -19,27 +19,105 @@ #pragma once -#include +#include G_BEGIN_DECLS -#define GARROW_TYPE_INPUT_STREAM \ +#define GARROW_TYPE_INPUT_STREAM \ (garrow_input_stream_get_type()) -#define GARROW_INPUT_STREAM(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_INPUT_STREAM, \ +#define GARROW_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_INPUT_STREAM, \ GArrowInputStream)) -#define GARROW_IS_INPUT_STREAM(obj) \ +#define GARROW_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_INPUT_STREAM, \ + GArrowInputStreamClass)) +#define GARROW_IS_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ GARROW_TYPE_INPUT_STREAM)) -#define GARROW_INPUT_STREAM_GET_IFACE(obj) \ - (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_TYPE_INPUT_STREAM, \ - GArrowInputStreamInterface)) +#define GARROW_IS_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_INPUT_STREAM)) +#define GARROW_INPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_INPUT_STREAM, \ + GArrowInputStreamClass)) + +typedef struct _GArrowInputStream GArrowInputStream; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowInputStreamClass GArrowInputStreamClass; +#endif + +/** + * GArrowInputStream: + * + * It wraps `arrow::io::InputStream`. + */ +struct _GArrowInputStream +{ + /*< private >*/ + GObject parent_instance; +}; -typedef struct _GArrowInputStream GArrowInputStream; -typedef struct _GArrowInputStreamInterface GArrowInputStreamInterface; +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowInputStreamClass +{ + GObjectClass parent_class; +}; +#endif GType garrow_input_stream_get_type(void) G_GNUC_CONST; + +#define GARROW_TYPE_BUFFER_READER \ + (garrow_buffer_reader_get_type()) +#define GARROW_BUFFER_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_BUFFER_READER, \ + GArrowBufferReader)) +#define GARROW_BUFFER_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BUFFER_READER, \ + GArrowBufferReaderClass)) +#define GARROW_IS_BUFFER_READER(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_BUFFER_READER)) +#define GARROW_IS_BUFFER_READER_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BUFFER_READER)) +#define GARROW_BUFFER_READER_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BUFFER_READER, \ + GArrowBufferReaderClass)) + +typedef struct _GArrowBufferReader GArrowBufferReader; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowBufferReaderClass GArrowBufferReaderClass; +#endif + +/** + * GArrowBufferReader: + * + * It wraps `arrow::io::BufferReader`. + */ +struct _GArrowBufferReader +{ + /*< private >*/ + GArrowInputStream parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowBufferReaderClass +{ + GArrowInputStreamClass parent_class; +}; +#endif + +GType garrow_buffer_reader_get_type(void) G_GNUC_CONST; + +GArrowBufferReader *garrow_buffer_reader_new(GArrowBuffer *buffer); + +GArrowBuffer *garrow_buffer_reader_get_buffer(GArrowBufferReader *buffer_reader); + G_END_DECLS diff --git a/c_glib/arrow-glib/input-stream.hpp b/c_glib/arrow-glib/input-stream.hpp index 7958df1585887..008f5f2b4e157 100644 --- a/c_glib/arrow-glib/input-stream.hpp +++ b/c_glib/arrow-glib/input-stream.hpp @@ -20,19 +20,11 @@ #pragma once #include +#include #include -/** - * GArrowInputStreamInterface: - * - * It wraps `arrow::io::InputStream`. - */ -struct _GArrowInputStreamInterface -{ - GTypeInterface parent_iface; - - std::shared_ptr (*get_raw)(GArrowInputStream *file); -}; - +GArrowInputStream *garrow_input_stream_new_raw(std::shared_ptr *arrow_input_stream); std::shared_ptr garrow_input_stream_get_raw(GArrowInputStream *input_stream); + +GArrowBufferReader *garrow_buffer_reader_new_raw(std::shared_ptr *arrow_buffer_reader); diff --git a/c_glib/arrow-glib/memory-mapped-file.cpp b/c_glib/arrow-glib/memory-mapped-file.cpp index f9cbf079105c1..71a9a6dad3134 100644 --- a/c_glib/arrow-glib/memory-mapped-file.cpp +++ b/c_glib/arrow-glib/memory-mapped-file.cpp @@ -82,21 +82,6 @@ garrow_readable_interface_init(GArrowReadableInterface *iface) iface->get_raw = garrow_memory_mapped_file_get_raw_readable_interface; } -static std::shared_ptr -garrow_memory_mapped_file_get_raw_input_stream_interface(GArrowInputStream *input_stream) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(input_stream); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_input_stream_interface_init(GArrowInputStreamInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_input_stream_interface; -} - static std::shared_ptr garrow_memory_mapped_file_get_raw_random_access_file_interface(GArrowRandomAccessFile *file) { @@ -150,8 +135,6 @@ G_DEFINE_TYPE_WITH_CODE(GArrowMemoryMappedFile, garrow_file_interface_init) G_IMPLEMENT_INTERFACE(GARROW_TYPE_READABLE, garrow_readable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_INPUT_STREAM, - garrow_input_stream_interface_init) G_IMPLEMENT_INTERFACE(GARROW_TYPE_RANDOM_ACCESS_FILE, garrow_random_access_file_interface_init) G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, diff --git a/c_glib/test/test-buffer-reader.rb b/c_glib/test/test-buffer-reader.rb new file mode 100644 index 0000000000000..b3517b230e421 --- /dev/null +++ b/c_glib/test/test-buffer-reader.rb @@ -0,0 +1,26 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestBufferReader < Test::Unit::TestCase + def test_read + buffer = Arrow::Buffer.new("Hello World") + buffer_reader = Arrow::BufferReader.new(buffer) + read_buffer = " " * 5 + _success, n_read_bytes = buffer_reader.read(read_buffer) + assert_equal("Hello", read_buffer.byteslice(0, n_read_bytes)) + end +end From 2d5142cd3fc9a5f5150daf6ea6335029de8002ae Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 13:31:49 -0400 Subject: [PATCH 0584/1644] ARROW-918: [GLib] Use GArrowBuffer for read buffer It's efficient. Author: Kouhei Sutou Closes #618 from kou/glib-read-buffer and squashes the following commits: e14bb40 [Kouhei Sutou] [GLib] Use GArrowBuffer for read buffer --- c_glib/arrow-glib/random-access-file.cpp | 20 +++++++++++--------- c_glib/arrow-glib/random-access-file.h | 10 ++++------ c_glib/arrow-glib/readable.cpp | 23 +++++++++++++---------- c_glib/arrow-glib/readable.h | 10 ++++------ c_glib/test/test-memory-mapped-file.rb | 18 +++++++----------- 5 files changed, 39 insertions(+), 42 deletions(-) diff --git a/c_glib/arrow-glib/random-access-file.cpp b/c_glib/arrow-glib/random-access-file.cpp index 976a80dce0693..744131632cbc2 100644 --- a/c_glib/arrow-glib/random-access-file.cpp +++ b/c_glib/arrow-glib/random-access-file.cpp @@ -23,6 +23,7 @@ #include +#include #include #include @@ -88,28 +89,29 @@ garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file) * @file: A #GArrowRandomAccessFile. * @position: The read start position. * @n_bytes: The number of bytes to be read. - * @n_read_bytes: (out): The read number of bytes. - * @buffer: (array length=n_bytes): The buffer to be read data. * @error: (nullable): Return location for a #GError or %NULL. * - * Returns: %TRUE on success, %FALSE if there was an error. + * Returns: (transfer full) (nullable): #GArrowBuffer that has read + * data on success, %NULL if there was an error. */ -gboolean +GArrowBuffer * garrow_random_access_file_read_at(GArrowRandomAccessFile *file, gint64 position, gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, GError **error) { const auto arrow_random_access_file = garrow_random_access_file_get_raw(file); + std::shared_ptr arrow_buffer; auto status = arrow_random_access_file->ReadAt(position, n_bytes, - n_read_bytes, - buffer); - return garrow_error_check(error, status, "[io][random-access-file][read-at]"); + &arrow_buffer); + if (garrow_error_check(error, status, "[io][random-access-file][read-at]")) { + return garrow_buffer_new_raw(&arrow_buffer); + } else { + return NULL; + } } G_END_DECLS diff --git a/c_glib/arrow-glib/random-access-file.h b/c_glib/arrow-glib/random-access-file.h index 8a7f6b4218a31..83a7d8cd14b95 100644 --- a/c_glib/arrow-glib/random-access-file.h +++ b/c_glib/arrow-glib/random-access-file.h @@ -45,11 +45,9 @@ GType garrow_random_access_file_get_type(void) G_GNUC_CONST; guint64 garrow_random_access_file_get_size(GArrowRandomAccessFile *file, GError **error); gboolean garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file); -gboolean garrow_random_access_file_read_at(GArrowRandomAccessFile *file, - gint64 position, - gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, - GError **error); +GArrowBuffer *garrow_random_access_file_read_at(GArrowRandomAccessFile *file, + gint64 position, + gint64 n_bytes, + GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/readable.cpp b/c_glib/arrow-glib/readable.cpp index d893853eea015..6a9023e6cddf0 100644 --- a/c_glib/arrow-glib/readable.cpp +++ b/c_glib/arrow-glib/readable.cpp @@ -23,6 +23,7 @@ #include +#include #include #include @@ -50,23 +51,25 @@ garrow_readable_default_init (GArrowReadableInterface *iface) * garrow_readable_read: * @readable: A #GArrowReadable. * @n_bytes: The number of bytes to be read. - * @n_read_bytes: (out): The read number of bytes. - * @buffer: (array length=n_bytes): The buffer to be read data. * @error: (nullable): Return location for a #GError or %NULL. * - * Returns: %TRUE on success, %FALSE if there was an error. + * Returns: (transfer full) (nullable): #GArrowBuffer that has read + * data on success, %NULL if there was an error. */ -gboolean +GArrowBuffer * garrow_readable_read(GArrowReadable *readable, - gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, - GError **error) + gint64 n_bytes, + GError **error) { const auto arrow_readable = garrow_readable_get_raw(readable); - auto status = arrow_readable->Read(n_bytes, n_read_bytes, buffer); - return garrow_error_check(error, status, "[io][readable][read]"); + std::shared_ptr arrow_buffer; + auto status = arrow_readable->Read(n_bytes, &arrow_buffer); + if (garrow_error_check(error, status, "[io][readable][read]")) { + return garrow_buffer_new_raw(&arrow_buffer); + } else { + return NULL; + } } G_END_DECLS diff --git a/c_glib/arrow-glib/readable.h b/c_glib/arrow-glib/readable.h index bde4b01ee1f15..216e7369c76f7 100644 --- a/c_glib/arrow-glib/readable.h +++ b/c_glib/arrow-glib/readable.h @@ -19,7 +19,7 @@ #pragma once -#include +#include G_BEGIN_DECLS @@ -42,10 +42,8 @@ typedef struct _GArrowReadableInterface GArrowReadableInterface; GType garrow_readable_get_type(void) G_GNUC_CONST; -gboolean garrow_readable_read(GArrowReadable *readable, - gint64 n_bytes, - gint64 *n_read_bytes, - guint8 *buffer, - GError **error); +GArrowBuffer *garrow_readable_read(GArrowReadable *readable, + gint64 n_bytes, + GError **error); G_END_DECLS diff --git a/c_glib/test/test-memory-mapped-file.rb b/c_glib/test/test-memory-mapped-file.rb index e78d07a72d3b8..e09e3697074c9 100644 --- a/c_glib/test/test-memory-mapped-file.rb +++ b/c_glib/test/test-memory-mapped-file.rb @@ -22,9 +22,8 @@ def test_open tempfile.close file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - buffer = " " * 5 - file.read(buffer) - assert_equal("Hello", buffer) + buffer = file.read(5) + assert_equal("Hello", buffer.data.to_s) ensure file.close end @@ -48,9 +47,8 @@ def test_read tempfile.close file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - buffer = " " * 5 - _success, n_read_bytes = file.read(buffer) - assert_equal("Hello", buffer.byteslice(0, n_read_bytes)) + buffer = file.read(5) + assert_equal("Hello", buffer.data.to_s) ensure file.close end @@ -62,9 +60,8 @@ def test_read_at tempfile.close file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - buffer = " " * 5 - _success, n_read_bytes = file.read_at(6, buffer) - assert_equal("World", buffer.byteslice(0, n_read_bytes)) + buffer = file.read_at(6, 5) + assert_equal("World", buffer.data.to_s) ensure file.close end @@ -116,8 +113,7 @@ def test_tell tempfile.close file = Arrow::MemoryMappedFile.open(tempfile.path, :read) begin - buffer = " " * 5 - file.read(buffer) + file.read(5) assert_equal(5, file.tell) ensure file.close From b4886da0f19484fe829fdb23a231a864070bf58c Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 13:33:06 -0400 Subject: [PATCH 0585/1644] ARROW-919: [GLib] Use "id" to get type enum value from GArrowDataType Author: Kouhei Sutou Closes #619 from kou/glib-data-type-use-id and squashes the following commits: e20d445 [Kouhei Sutou] [GLib] Use "id" to get type enum value from GArrowDataType --- c_glib/arrow-glib/data-type.cpp | 6 +++--- c_glib/arrow-glib/data-type.h | 2 +- c_glib/test/test-binary-data-type.rb | 2 +- c_glib/test/test-boolean-data-type.rb | 2 +- c_glib/test/test-double-data-type.rb | 2 +- c_glib/test/test-float-data-type.rb | 2 +- c_glib/test/test-int16-data-type.rb | 2 +- c_glib/test/test-int32-data-type.rb | 2 +- c_glib/test/test-int64-data-type.rb | 2 +- c_glib/test/test-int8-data-type.rb | 2 +- c_glib/test/test-list-data-type.rb | 2 +- c_glib/test/test-null-data-type.rb | 2 +- c_glib/test/test-string-data-type.rb | 2 +- c_glib/test/test-uint16-data-type.rb | 2 +- c_glib/test/test-uint32-data-type.rb | 2 +- c_glib/test/test-uint64-data-type.rb | 2 +- c_glib/test/test-uint8-data-type.rb | 2 +- 17 files changed, 19 insertions(+), 19 deletions(-) diff --git a/c_glib/arrow-glib/data-type.cpp b/c_glib/arrow-glib/data-type.cpp index 2fd261dc91947..c3c7fdb0f7c21 100644 --- a/c_glib/arrow-glib/data-type.cpp +++ b/c_glib/arrow-glib/data-type.cpp @@ -192,13 +192,13 @@ garrow_data_type_to_string(GArrowDataType *data_type) } /** - * garrow_data_type_type: + * garrow_data_type_get_id: * @data_type: A #GArrowDataType. * - * Returns: The type of the data type. + * Returns: The #GArrowType of the data type. */ GArrowType -garrow_data_type_type(GArrowDataType *data_type) +garrow_data_type_get_id(GArrowDataType *data_type) { const auto arrow_data_type = garrow_data_type_get_raw(data_type); return garrow_type_from_raw(arrow_data_type->id()); diff --git a/c_glib/arrow-glib/data-type.h b/c_glib/arrow-glib/data-type.h index babf0ee1712a0..933fcfc4b2ccb 100644 --- a/c_glib/arrow-glib/data-type.h +++ b/c_glib/arrow-glib/data-type.h @@ -73,7 +73,7 @@ GType garrow_data_type_get_type (void) G_GNUC_CONST; gboolean garrow_data_type_equal (GArrowDataType *data_type, GArrowDataType *other_data_type); gchar *garrow_data_type_to_string (GArrowDataType *data_type); -GArrowType garrow_data_type_type (GArrowDataType *data_type); +GArrowType garrow_data_type_get_id (GArrowDataType *data_type); #define GARROW_TYPE_NULL_DATA_TYPE \ diff --git a/c_glib/test/test-binary-data-type.rb b/c_glib/test/test-binary-data-type.rb index 3d4095c1b0648..5a1cb89b30061 100644 --- a/c_glib/test/test-binary-data-type.rb +++ b/c_glib/test/test-binary-data-type.rb @@ -18,7 +18,7 @@ class TestBinaryDataType < Test::Unit::TestCase def test_type data_type = Arrow::BinaryDataType.new - assert_equal(Arrow::Type::BINARY, data_type.type) + assert_equal(Arrow::Type::BINARY, data_type.id) end def test_to_s diff --git a/c_glib/test/test-boolean-data-type.rb b/c_glib/test/test-boolean-data-type.rb index ac5667140fb8e..39b8128989de3 100644 --- a/c_glib/test/test-boolean-data-type.rb +++ b/c_glib/test/test-boolean-data-type.rb @@ -18,7 +18,7 @@ class TestBooleanDataType < Test::Unit::TestCase def test_type data_type = Arrow::BooleanDataType.new - assert_equal(Arrow::Type::BOOL, data_type.type) + assert_equal(Arrow::Type::BOOL, data_type.id) end def test_to_s diff --git a/c_glib/test/test-double-data-type.rb b/c_glib/test/test-double-data-type.rb index 18c870cb9e62b..0edd64eed300f 100644 --- a/c_glib/test/test-double-data-type.rb +++ b/c_glib/test/test-double-data-type.rb @@ -18,7 +18,7 @@ class TestDoubleDataType < Test::Unit::TestCase def test_type data_type = Arrow::DoubleDataType.new - assert_equal(Arrow::Type::DOUBLE, data_type.type) + assert_equal(Arrow::Type::DOUBLE, data_type.id) end def test_to_s diff --git a/c_glib/test/test-float-data-type.rb b/c_glib/test/test-float-data-type.rb index ab315fd336b84..8384b526e0203 100644 --- a/c_glib/test/test-float-data-type.rb +++ b/c_glib/test/test-float-data-type.rb @@ -18,7 +18,7 @@ class TestFloatDataType < Test::Unit::TestCase def test_type data_type = Arrow::FloatDataType.new - assert_equal(Arrow::Type::FLOAT, data_type.type) + assert_equal(Arrow::Type::FLOAT, data_type.id) end def test_to_s diff --git a/c_glib/test/test-int16-data-type.rb b/c_glib/test/test-int16-data-type.rb index 273ec809c198e..aad5f11fbf60d 100644 --- a/c_glib/test/test-int16-data-type.rb +++ b/c_glib/test/test-int16-data-type.rb @@ -18,7 +18,7 @@ class TestInt16DataType < Test::Unit::TestCase def test_type data_type = Arrow::Int16DataType.new - assert_equal(Arrow::Type::INT16, data_type.type) + assert_equal(Arrow::Type::INT16, data_type.id) end def test_to_s diff --git a/c_glib/test/test-int32-data-type.rb b/c_glib/test/test-int32-data-type.rb index f6b9b34e1d827..2d9d44d66236a 100644 --- a/c_glib/test/test-int32-data-type.rb +++ b/c_glib/test/test-int32-data-type.rb @@ -18,7 +18,7 @@ class TestInt32DataType < Test::Unit::TestCase def test_type data_type = Arrow::Int32DataType.new - assert_equal(Arrow::Type::INT32, data_type.type) + assert_equal(Arrow::Type::INT32, data_type.id) end def test_to_s diff --git a/c_glib/test/test-int64-data-type.rb b/c_glib/test/test-int64-data-type.rb index 032b24dac3ecc..3c5263e848ca2 100644 --- a/c_glib/test/test-int64-data-type.rb +++ b/c_glib/test/test-int64-data-type.rb @@ -18,7 +18,7 @@ class TestInt64DataType < Test::Unit::TestCase def test_type data_type = Arrow::Int64DataType.new - assert_equal(Arrow::Type::INT64, data_type.type) + assert_equal(Arrow::Type::INT64, data_type.id) end def test_to_s diff --git a/c_glib/test/test-int8-data-type.rb b/c_glib/test/test-int8-data-type.rb index d33945614db8e..40de1be95c652 100644 --- a/c_glib/test/test-int8-data-type.rb +++ b/c_glib/test/test-int8-data-type.rb @@ -18,7 +18,7 @@ class TestInt8DataType < Test::Unit::TestCase def test_type data_type = Arrow::Int8DataType.new - assert_equal(Arrow::Type::INT8, data_type.type) + assert_equal(Arrow::Type::INT8, data_type.id) end def test_to_s diff --git a/c_glib/test/test-list-data-type.rb b/c_glib/test/test-list-data-type.rb index 6fde203517684..aa6a8fa65fd8c 100644 --- a/c_glib/test/test-list-data-type.rb +++ b/c_glib/test/test-list-data-type.rb @@ -19,7 +19,7 @@ class TestListDataType < Test::Unit::TestCase def test_type field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) data_type = Arrow::ListDataType.new(field) - assert_equal(Arrow::Type::LIST, data_type.type) + assert_equal(Arrow::Type::LIST, data_type.id) end def test_to_s diff --git a/c_glib/test/test-null-data-type.rb b/c_glib/test/test-null-data-type.rb index 95e54833b0896..fd766675e40c3 100644 --- a/c_glib/test/test-null-data-type.rb +++ b/c_glib/test/test-null-data-type.rb @@ -18,7 +18,7 @@ class TestNullDataType < Test::Unit::TestCase def test_type data_type = Arrow::NullDataType.new - assert_equal(Arrow::Type::NA, data_type.type) + assert_equal(Arrow::Type::NA, data_type.id) end def test_to_s diff --git a/c_glib/test/test-string-data-type.rb b/c_glib/test/test-string-data-type.rb index daba7fd9ec768..550bf13f19f39 100644 --- a/c_glib/test/test-string-data-type.rb +++ b/c_glib/test/test-string-data-type.rb @@ -18,7 +18,7 @@ class TestStringDataType < Test::Unit::TestCase def test_type data_type = Arrow::StringDataType.new - assert_equal(Arrow::Type::STRING, data_type.type) + assert_equal(Arrow::Type::STRING, data_type.id) end def test_to_s diff --git a/c_glib/test/test-uint16-data-type.rb b/c_glib/test/test-uint16-data-type.rb index f5a6cc0be28bb..56f6cf4a2f0b4 100644 --- a/c_glib/test/test-uint16-data-type.rb +++ b/c_glib/test/test-uint16-data-type.rb @@ -18,7 +18,7 @@ class TestUInt16DataType < Test::Unit::TestCase def test_type data_type = Arrow::UInt16DataType.new - assert_equal(Arrow::Type::UINT16, data_type.type) + assert_equal(Arrow::Type::UINT16, data_type.id) end def test_to_s diff --git a/c_glib/test/test-uint32-data-type.rb b/c_glib/test/test-uint32-data-type.rb index 7a50257d6d3b9..7ad3f5697e510 100644 --- a/c_glib/test/test-uint32-data-type.rb +++ b/c_glib/test/test-uint32-data-type.rb @@ -18,7 +18,7 @@ class TestUInt32DataType < Test::Unit::TestCase def test_type data_type = Arrow::UInt32DataType.new - assert_equal(Arrow::Type::UINT32, data_type.type) + assert_equal(Arrow::Type::UINT32, data_type.id) end def test_to_s diff --git a/c_glib/test/test-uint64-data-type.rb b/c_glib/test/test-uint64-data-type.rb index 403fc9acdfcfa..f5bf3c9786b93 100644 --- a/c_glib/test/test-uint64-data-type.rb +++ b/c_glib/test/test-uint64-data-type.rb @@ -18,7 +18,7 @@ class TestUInt64DataType < Test::Unit::TestCase def test_type data_type = Arrow::UInt64DataType.new - assert_equal(Arrow::Type::UINT64, data_type.type) + assert_equal(Arrow::Type::UINT64, data_type.id) end def test_to_s diff --git a/c_glib/test/test-uint8-data-type.rb b/c_glib/test/test-uint8-data-type.rb index eb91da2761efe..d4bf797a95b7e 100644 --- a/c_glib/test/test-uint8-data-type.rb +++ b/c_glib/test/test-uint8-data-type.rb @@ -18,7 +18,7 @@ class TestUInt8DataType < Test::Unit::TestCase def test_type data_type = Arrow::UInt8DataType.new - assert_equal(Arrow::Type::UINT8, data_type.type) + assert_equal(Arrow::Type::UINT8, data_type.id) end def test_to_s From 00994b82015365fec8474605bf09bd11637859af Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 13:35:29 -0400 Subject: [PATCH 0586/1644] ARROW-920: [GLib] Add Lua examples Author: Kouhei Sutou Closes #620 from kou/glib-lua-examples and squashes the following commits: 491c0e4 [Kouhei Sutou] [GLib] Add Lua examples --- c_glib/README.md | 2 + c_glib/configure.ac | 1 + c_glib/example/Makefile.am | 6 ++- c_glib/example/README.md | 5 +- c_glib/example/lua/Makefile.am | 24 ++++++++++ c_glib/example/lua/README.md | 45 ++++++++++++++++++ c_glib/example/lua/read-batch.lua | 44 +++++++++++++++++ c_glib/example/lua/read-stream.lua | 51 ++++++++++++++++++++ c_glib/example/lua/write-batch.lua | 74 +++++++++++++++++++++++++++++ c_glib/example/lua/write-stream.lua | 74 +++++++++++++++++++++++++++++ 10 files changed, 323 insertions(+), 3 deletions(-) create mode 100644 c_glib/example/lua/Makefile.am create mode 100644 c_glib/example/lua/README.md create mode 100644 c_glib/example/lua/read-batch.lua create mode 100644 c_glib/example/lua/read-stream.lua create mode 100644 c_glib/example/lua/write-batch.lua create mode 100644 c_glib/example/lua/write-stream.lua diff --git a/c_glib/README.md b/c_glib/README.md index b253d32b266d4..6eadb797032bc 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -194,10 +194,12 @@ You can use Arrow GLib with non C languages with GObject Introspection based bindings. Here are languages that support GObject Introspection: * Ruby: [red-arrow gem](https://rubygems.org/gems/red-arrow) should be used. + * Examples: https://github.com/red-data-tools/red-arrow/tree/master/example * Python: [PyGObject](https://wiki.gnome.org/Projects/PyGObject) should be used. (Note that you should use PyArrow than Arrow GLib.) * Lua: [LGI](https://github.com/pavouk/lgi) should be used. + * Examples: `example/lua/` directory. * Go: [Go-gir-generator](https://github.com/linuxdeepin/go-gir-generator) should be used. diff --git a/c_glib/configure.ac b/c_glib/configure.ac index f36719284711b..e010d962f377a 100644 --- a/c_glib/configure.ac +++ b/c_glib/configure.ac @@ -97,6 +97,7 @@ AC_CONFIG_FILES([ doc/Makefile doc/reference/Makefile example/Makefile + example/lua/Makefile ]) AC_OUTPUT diff --git a/c_glib/example/Makefile.am b/c_glib/example/Makefile.am index 8bf3c15526759..66d2cddcac5fb 100644 --- a/c_glib/example/Makefile.am +++ b/c_glib/example/Makefile.am @@ -15,6 +15,9 @@ # specific language governing permissions and limitations # under the License. +SUBDIRS = \ + lua + AM_CPPFLAGS = \ -I$(top_builddir) \ -I$(top_srcdir) @@ -41,7 +44,8 @@ read_batch_SOURCES = \ read_stream_SOURCES = \ read-stream.c -example_DATA = \ +dist_example_DATA = \ + README.md \ $(build_SOURCES) \ $(read_batch_SOURCES) \ $(read_stream_SOURCES) diff --git a/c_glib/example/README.md b/c_glib/example/README.md index b1ba259534cb1..99730d59ce1c2 100644 --- a/c_glib/example/README.md +++ b/c_glib/example/README.md @@ -16,7 +16,9 @@ There are example codes in this directory. -C example codes exist in this directory. +C example codes exist in this directory. Language bindings example +codes exists in sub directories. For example, Lua example codes exists +in `lua/` sub directory. ## C example codes @@ -39,4 +41,3 @@ Here are example codes in this directory: * `read-stream.c`: It shows how to read Arrow array from file in stream mode. - diff --git a/c_glib/example/lua/Makefile.am b/c_glib/example/lua/Makefile.am new file mode 100644 index 0000000000000..9019d24741c1a --- /dev/null +++ b/c_glib/example/lua/Makefile.am @@ -0,0 +1,24 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +lua_exampledir = $(exampledir)/lua +dist_lua_example_DATA = \ + README.md \ + read-batch.lua \ + read-stream.lua \ + write-batch.lua \ + write-stream.lua diff --git a/c_glib/example/lua/README.md b/c_glib/example/lua/README.md new file mode 100644 index 0000000000000..d127573bcc368 --- /dev/null +++ b/c_glib/example/lua/README.md @@ -0,0 +1,45 @@ + + +# Arrow Lua example + +There are Lua example codes in this directory. + +## How to run + +All example codes use [LGI](https://github.com/pavouk/lgi) to use +Arrow GLib based bindings. + +Here are command lines to install LGI on Debian GNU/Linux and Ubuntu: + +```text +% sudo apt install -y luarocks +% sudo luarocks install lgi +``` + +## Lua example codes + +Here are example codes in this directory: + + * `write-batch.lua`: It shows how to write Arrow array to file in + batch mode. + + * `read-batch.lua`: It shows how to read Arrow array from file in + batch mode. + + * `write-stream.lua`: It shows how to write Arrow array to file in + stream mode. + + * `read-stream.lua`: It shows how to read Arrow array from file in + stream mode. diff --git a/c_glib/example/lua/read-batch.lua b/c_glib/example/lua/read-batch.lua new file mode 100644 index 0000000000000..b28d346863820 --- /dev/null +++ b/c_glib/example/lua/read-batch.lua @@ -0,0 +1,44 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +local lgi = require 'lgi' +local Arrow = lgi.Arrow + +local input_path = arg[1] or "/tmp/batch.arrow"; + +local input = Arrow.MemoryMappedFile.open(input_path, Arrow.FileMode.READ) +local reader = Arrow.FileReader.open(input) + +for i = 0, reader:get_n_record_batches() - 1 do + local record_batch = reader:get_record_batch(i) + print(string.rep("=", 40)) + print("record-batch["..i.."]:") + for j = 0, record_batch:get_n_columns() - 1 do + local column = record_batch:get_column(j) + local column_name = record_batch:get_column_name(j) + io.write(" "..column_name..": [") + for k = 0, record_batch:get_n_rows() - 1 do + if k > 0 then + io.write(", ") + end + io.write(column:get_value(k)) + end + print("]") + end +end + +input:close() diff --git a/c_glib/example/lua/read-stream.lua b/c_glib/example/lua/read-stream.lua new file mode 100644 index 0000000000000..3b0820627e6b2 --- /dev/null +++ b/c_glib/example/lua/read-stream.lua @@ -0,0 +1,51 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +local lgi = require 'lgi' +local Arrow = lgi.Arrow + +local input_path = arg[1] or "/tmp/stream.arrow"; + +local input = Arrow.MemoryMappedFile.open(input_path, Arrow.FileMode.READ) +local reader = Arrow.StreamReader.open(input) + +local i = 0 +while true do + local record_batch = reader:get_next_record_batch(i) + if not record_batch then + break + end + + print(string.rep("=", 40)) + print("record-batch["..i.."]:") + for j = 0, record_batch:get_n_columns() - 1 do + local column = record_batch:get_column(j) + local column_name = record_batch:get_column_name(j) + io.write(" "..column_name..": [") + for k = 0, record_batch:get_n_rows() - 1 do + if k > 0 then + io.write(", ") + end + io.write(column:get_value(k)) + end + print("]") + end + + i = i + 1 +end + +input:close() diff --git a/c_glib/example/lua/write-batch.lua b/c_glib/example/lua/write-batch.lua new file mode 100644 index 0000000000000..3a22cd57fd81e --- /dev/null +++ b/c_glib/example/lua/write-batch.lua @@ -0,0 +1,74 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +local lgi = require 'lgi' +local Arrow = lgi.Arrow + +local output_path = arg[1] or "/tmp/batch.arrow"; + +local fields = { + Arrow.Field.new("uint8", Arrow.UInt8DataType.new()), + Arrow.Field.new("uint16", Arrow.UInt16DataType.new()), + Arrow.Field.new("uint32", Arrow.UInt32DataType.new()), + Arrow.Field.new("uint64", Arrow.UInt64DataType.new()), + Arrow.Field.new("int8", Arrow.Int8DataType.new()), + Arrow.Field.new("int16", Arrow.Int16DataType.new()), + Arrow.Field.new("int32", Arrow.Int32DataType.new()), + Arrow.Field.new("int64", Arrow.Int64DataType.new()), + Arrow.Field.new("float", Arrow.FloatDataType.new()), + Arrow.Field.new("double", Arrow.DoubleDataType.new()), +} +local schema = Arrow.Schema.new(fields) + +local output = Arrow.FileOutputStream.open(output_path, false) +local writer = Arrow.FileWriter.open(output, schema) + +function build_array(builder, values) + for _, value in pairs(values) do + builder:append(value) + end + return builder:finish() +end + +local uints = {1, 2, 4, 8} +local ints = {1, -2, 4, -8} +local floats = {1.1, -2.2, 4.4, -8.8} +local columns = { + build_array(Arrow.UInt8ArrayBuilder.new(), uints), + build_array(Arrow.UInt16ArrayBuilder.new(), uints), + build_array(Arrow.UInt32ArrayBuilder.new(), uints), + build_array(Arrow.UInt64ArrayBuilder.new(), uints), + build_array(Arrow.Int8ArrayBuilder.new(), ints), + build_array(Arrow.Int16ArrayBuilder.new(), ints), + build_array(Arrow.Int32ArrayBuilder.new(), ints), + build_array(Arrow.Int64ArrayBuilder.new(), ints), + build_array(Arrow.FloatArrayBuilder.new(), floats), + build_array(Arrow.DoubleArrayBuilder.new(), floats), +} + +local record_batch = Arrow.RecordBatch.new(schema, 4, columns) +writer:write_record_batch(record_batch) + +local sliced_columns = {} +for i, column in pairs(columns) do + sliced_columns[i] = column:slice(1, 3) +end +record_batch = Arrow.RecordBatch.new(schema, 3, sliced_columns) +writer:write_record_batch(record_batch) + +writer:close() +output:close() diff --git a/c_glib/example/lua/write-stream.lua b/c_glib/example/lua/write-stream.lua new file mode 100644 index 0000000000000..37c6bb54cd8f4 --- /dev/null +++ b/c_glib/example/lua/write-stream.lua @@ -0,0 +1,74 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +local lgi = require 'lgi' +local Arrow = lgi.Arrow + +local output_path = arg[1] or "/tmp/stream.arrow"; + +local fields = { + Arrow.Field.new("uint8", Arrow.UInt8DataType.new()), + Arrow.Field.new("uint16", Arrow.UInt16DataType.new()), + Arrow.Field.new("uint32", Arrow.UInt32DataType.new()), + Arrow.Field.new("uint64", Arrow.UInt64DataType.new()), + Arrow.Field.new("int8", Arrow.Int8DataType.new()), + Arrow.Field.new("int16", Arrow.Int16DataType.new()), + Arrow.Field.new("int32", Arrow.Int32DataType.new()), + Arrow.Field.new("int64", Arrow.Int64DataType.new()), + Arrow.Field.new("float", Arrow.FloatDataType.new()), + Arrow.Field.new("double", Arrow.DoubleDataType.new()), +} +local schema = Arrow.Schema.new(fields) + +local output = Arrow.FileOutputStream.open(output_path, false) +local writer = Arrow.StreamWriter.open(output, schema) + +function build_array(builder, values) + for _, value in pairs(values) do + builder:append(value) + end + return builder:finish() +end + +local uints = {1, 2, 4, 8} +local ints = {1, -2, 4, -8} +local floats = {1.1, -2.2, 4.4, -8.8} +local columns = { + build_array(Arrow.UInt8ArrayBuilder.new(), uints), + build_array(Arrow.UInt16ArrayBuilder.new(), uints), + build_array(Arrow.UInt32ArrayBuilder.new(), uints), + build_array(Arrow.UInt64ArrayBuilder.new(), uints), + build_array(Arrow.Int8ArrayBuilder.new(), ints), + build_array(Arrow.Int16ArrayBuilder.new(), ints), + build_array(Arrow.Int32ArrayBuilder.new(), ints), + build_array(Arrow.Int64ArrayBuilder.new(), ints), + build_array(Arrow.FloatArrayBuilder.new(), floats), + build_array(Arrow.DoubleArrayBuilder.new(), floats), +} + +local record_batch = Arrow.RecordBatch.new(schema, 4, columns) +writer:write_record_batch(record_batch) + +local sliced_columns = {} +for i, column in pairs(columns) do + sliced_columns[i] = column:slice(1, 3) +end +record_batch = Arrow.RecordBatch.new(schema, 3, sliced_columns) +writer:write_record_batch(record_batch) + +writer:close() +output:close() From d4a2a75a50cb8ccaddb29bc5462ed3aa34af1d9f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 30 Apr 2017 13:39:42 -0400 Subject: [PATCH 0587/1644] ARROW-910: [C++] Write 0 length at EOS in StreamWriter Author: Wes McKinney Closes #614 from wesm/ARROW-910 and squashes the following commits: e1ef336 [Wes McKinney] Write 0 length at EOS --- cpp/src/arrow/ipc/writer.cc | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/cpp/src/arrow/ipc/writer.cc b/cpp/src/arrow/ipc/writer.cc index 61caf6403c8dc..78d6b9eb92b4e 100644 --- a/cpp/src/arrow/ipc/writer.cc +++ b/cpp/src/arrow/ipc/writer.cc @@ -620,7 +620,11 @@ class StreamWriter::StreamWriterImpl { virtual Status Close() { // Write the schema if not already written // User is responsible for closing the OutputStream - return CheckStarted(); + RETURN_NOT_OK(CheckStarted()); + + // Write 0 EOS message + const int32_t kEos = 0; + return Write(reinterpret_cast(&kEos), sizeof(int32_t)); } Status CheckStarted() { From 6950e45db78934924a41c39b79bb2c99996d4d56 Mon Sep 17 00:00:00 2001 From: Phillip Cloud Date: Sun, 30 Apr 2017 23:01:12 -0400 Subject: [PATCH 0588/1644] ARROW-922: Allow Flatbuffers and RapidJSON to be used locally on Windows Author: Phillip Cloud Closes #621 from cpcloud/ARROW-922 and squashes the following commits: 0da7a56 [Phillip Cloud] Add parallel builds 8b377da [Phillip Cloud] ARROW-922: Allow Flatbuffers and RapidJSON to be used locally on Windows --- ci/msvc-build.bat | 33 +++++++++++-------------- cpp/cmake_modules/FindFlatbuffers.cmake | 16 ++++++------ 2 files changed, 23 insertions(+), 26 deletions(-) diff --git a/ci/msvc-build.bat b/ci/msvc-build.bat index aca1f8cc3c073..504da7638daa0 100644 --- a/ci/msvc-build.bat +++ b/ci/msvc-build.bat @@ -17,40 +17,37 @@ @echo on -set CONDA_ENV=C:\arrow-conda-env -set ARROW_HOME=C:\arrow-install - -conda create -p %CONDA_ENV% -q -y python=%PYTHON% ^ +conda create -n arrow -q -y python=%PYTHON% ^ six pytest setuptools numpy pandas cython -call activate %CONDA_ENV% +conda install -n arrow -q -y -c conda-forge flatbuffers rapidjson +call activate arrow + +set ARROW_HOME=%CONDA_PREFIX%\Library +set FLATBUFFERS_HOME=%CONDA_PREFIX%\Library +set RAPIDJSON_HOME=%CONDA_PREFIX%\Library @rem Build and test Arrow C++ libraries -cd cpp -mkdir build -cd build +mkdir cpp\build +cd cpp\build + cmake -G "%GENERATOR%" ^ - -DCMAKE_INSTALL_PREFIX=%ARROW_HOME% ^ + -DCMAKE_INSTALL_PREFIX=%CONDA_PREFIX%\Library ^ -DARROW_BOOST_USE_SHARED=OFF ^ -DCMAKE_BUILD_TYPE=Release ^ - -DARROW_CXXFLAGS="/WX" ^ - -DARROW_PYTHON=on ^ + -DARROW_CXXFLAGS="/WX /MP" ^ + -DARROW_PYTHON=ON ^ .. || exit /B cmake --build . --target INSTALL --config Release || exit /B @rem Needed so python-test.exe works -set PYTHONPATH=%CONDA_ENV%\Lib;%CONDA_ENV%\Lib\site-packages;%CONDA_ENV%\python35.zip;%CONDA_ENV%\DLLs;%CONDA_ENV% +set PYTHONPATH=%CONDA_PREFIX%\Lib;%CONDA_PREFIX%\Lib\site-packages;%CONDA_PREFIX%\python35.zip;%CONDA_PREFIX%\DLLs;%CONDA_PREFIX% ctest -VV || exit /B -set PYTHONPATH= - @rem Build and import pyarrow - -set PATH=%ARROW_HOME%\bin;%PATH% +set PYTHONPATH= cd ..\..\python python setup.py build_ext --inplace || exit /B -python -c "import pyarrow" || exit /B - py.test pyarrow -v -s || exit /B diff --git a/cpp/cmake_modules/FindFlatbuffers.cmake b/cpp/cmake_modules/FindFlatbuffers.cmake index 7fa640ac9542f..804f4797241da 100644 --- a/cpp/cmake_modules/FindFlatbuffers.cmake +++ b/cpp/cmake_modules/FindFlatbuffers.cmake @@ -33,18 +33,18 @@ if( NOT "${FLATBUFFERS_HOME}" STREQUAL "") file( TO_CMAKE_PATH "${FLATBUFFERS_HOME}" _native_path ) - list( APPEND _flatbuffers_roots ${_native_path} ) + list( APPEND _flatbuffers_roots "${_native_path}" ) elseif ( Flatbuffers_HOME ) - list( APPEND _flatbuffers_roots ${Flatbuffers_HOME} ) + list( APPEND _flatbuffers_roots "${Flatbuffers_HOME}" ) endif() # Try the parameterized roots, if they exist if ( _flatbuffers_roots ) find_path( FLATBUFFERS_INCLUDE_DIR NAMES flatbuffers/flatbuffers.h - PATHS ${_flatbuffers_roots} NO_DEFAULT_PATH + PATHS "${_flatbuffers_roots}" NO_DEFAULT_PATH PATH_SUFFIXES "include" ) find_library( FLATBUFFERS_LIBRARIES NAMES flatbuffers - PATHS ${_flatbuffers_roots} NO_DEFAULT_PATH + PATHS "${_flatbuffers_roots}" NO_DEFAULT_PATH PATH_SUFFIXES "lib" ) else () find_path( FLATBUFFERS_INCLUDE_DIR NAMES flatbuffers/flatbuffers.h ) @@ -52,7 +52,7 @@ else () endif () find_program(FLATBUFFERS_COMPILER flatc - ${FLATBUFFERS_HOME}/bin + "${FLATBUFFERS_HOME}/bin" /usr/local/bin /usr/bin NO_DEFAULT_PATH @@ -60,9 +60,9 @@ find_program(FLATBUFFERS_COMPILER flatc if (FLATBUFFERS_INCLUDE_DIR AND FLATBUFFERS_LIBRARIES) set(FLATBUFFERS_FOUND TRUE) - get_filename_component( FLATBUFFERS_LIBS ${FLATBUFFERS_LIBRARIES} PATH ) + get_filename_component( FLATBUFFERS_LIBS "${FLATBUFFERS_LIBRARIES}" PATH ) set(FLATBUFFERS_LIB_NAME libflatbuffers) - set(FLATBUFFERS_STATIC_LIB ${FLATBUFFERS_LIBS}/${FLATBUFFERS_LIB_NAME}.a) + set(FLATBUFFERS_STATIC_LIB "${FLATBUFFERS_LIBS}/${FLATBUFFERS_LIB_NAME}.a") else () set(FLATBUFFERS_FOUND FALSE) endif () @@ -75,7 +75,7 @@ else () if (NOT Flatbuffers_FIND_QUIETLY) set(FLATBUFFERS_ERR_MSG "Could not find the Flatbuffers library. Looked in ") if ( _flatbuffers_roots ) - set(FLATBUFFERS_ERR_MSG "${FLATBUFFERS_ERR_MSG} in ${_flatbuffers_roots}.") + set(FLATBUFFERS_ERR_MSG "${FLATBUFFERS_ERR_MSG} ${_flatbuffers_roots}.") else () set(FLATBUFFERS_ERR_MSG "${FLATBUFFERS_ERR_MSG} system search paths.") endif () From 8013cf3189f3be4785f3f88ee3fbcaea94bd4960 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 30 Apr 2017 23:04:01 -0400 Subject: [PATCH 0589/1644] ARROW-925: [GLib] Fix GArrowBufferReader test GArrowReadable#read API was changed. Author: Kouhei Sutou Closes #622 from kou/glib-fix-buffer-reader-test and squashes the following commits: e41374b [Kouhei Sutou] [GLib] Fix GArrowBufferReader test --- c_glib/test/test-buffer-reader.rb | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/c_glib/test/test-buffer-reader.rb b/c_glib/test/test-buffer-reader.rb index b3517b230e421..d05ed062ebdb7 100644 --- a/c_glib/test/test-buffer-reader.rb +++ b/c_glib/test/test-buffer-reader.rb @@ -19,8 +19,7 @@ class TestBufferReader < Test::Unit::TestCase def test_read buffer = Arrow::Buffer.new("Hello World") buffer_reader = Arrow::BufferReader.new(buffer) - read_buffer = " " * 5 - _success, n_read_bytes = buffer_reader.read(read_buffer) - assert_equal("Hello", read_buffer.byteslice(0, n_read_bytes)) + read_buffer = buffer_reader.read(5) + assert_equal("Hello", read_buffer.data.to_s) end end From c9e61cd77be59b0709610e60484df64c3810b4ca Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 1 May 2017 10:38:43 -0400 Subject: [PATCH 0590/1644] ARROW-926: Add wesm to KEYS Author: Wes McKinney Closes #623 from wesm/add-wesm-keys and squashes the following commits: 5892a67 [Wes McKinney] Add wesm to KEYS --- KEYS | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 60 insertions(+), 1 deletion(-) diff --git a/KEYS b/KEYS index ad0f5cf6f4340..05862c8f643ca 100644 --- a/KEYS +++ b/KEYS @@ -2,7 +2,7 @@ This file contains the PGP keys of various developers. Users: pgp < KEYS gpg --import KEYS -Developers: +Developers: pgp -kxa and append it to this file. (pgpk -ll && pgpk -xa ) >> this file. (gpg --list-sigs @@ -178,3 +178,62 @@ wfxeCKBeSpMuy3pvOnNy8uNYjNqizVlpNBx01I2R1MD8P14Pxteg6APi0jcusXrD s8g7c7dzdXM0lxreeXge8JSmxuwcCqVUswac6zbX4li03m/lov2YYxCwuw== =ESbR -----END PGP PUBLIC KEY BLOCK----- + +pub 4096R/1735623D 2017-05-01 +uid William Wesley McKinney (CODE SIGNING KEY) +sig 3 1735623D 2017-05-01 William Wesley McKinney (CODE SIGNING KEY) +sub 4096R/E83E9940 2017-05-01 +sig 1735623D 2017-05-01 William Wesley McKinney (CODE SIGNING KEY) + +-----BEGIN PGP PUBLIC KEY BLOCK----- +Version: GnuPG v1 + +mQINBFkGqIQBEACuKfRxQ2zjpWtuEpKTr0qhpucl5h57cnbPG8M2t2eAbl7fD6mD +ZyLePZEHSoNgUTqFTh8b850qD2b1loyuk6fx5mesweeWlSxt24Y5pXneH7WL/a8K +H81jl+Qy5J8DfG8oEnlQp8bPjb3n8xFgNkpt09kxj9lRhDCK0+M0lN/JRGK2BfTx +TCJWH2vC8Xh+apXmlSR5vohx7dj5RoFlIwNXsi+5JRkZCLoER8Fvozdq7qYNNmgL +a8l38VnW5fQkx1Pl0mMBi0d4XwFCY6W5BfzfAU3t+ujb0a/6ZzFHiW6q53Fct4BM +dMX91Xi73Myb3AF3x8dnv7E09dwXaShwUQu76WD/v7js1COS9o3SaCZfOdrJ9+KN +bYc2zuzXCWtDQ1GU07ocq2Z8VnhGC/qAUwOY9K0JagFOx7xV3gc8bkWqFII0XeCK +QBhKZHx7oFGz6bH2W/THLolbezwC7+0iuiWeDjY6y6Hk1/S25120wqdUfpa2QDlz +5V+ayyF8Lt77CnowYeMuDSFZzBjg67SpbbkyZJwKUtTJBUOLKiJF37QCAYENHthB +lmRgvOcCIic5cnJivgIs6Q7hCpFahWgr2g/6clu04YKFSaup+LU6F3UGvbKW6nnF +HRSsVFkof0+Ni+yT/oiQUAYyCbrfptpgUZXrVuee8d4frbPfKeiWd4MTrwARAQAB +tDxXaWxsaWFtIFdlc2xleSBNY0tpbm5leSAoQ09ERSBTSUdOSU5HIEtFWSkgPHdl +c21AYXBhY2hlLm9yZz6JAjcEEwEKACEFAlkGqIQCGwMFCwkIBwMFFQoJCAsFFgID +AQACHgECF4AACgkQ8QWIOhc1Yj1IQRAAm71yO273ulTxYlpFTN+CnTqTdxAQIGmc +gfS55/XmjKfQySQTKOfQPafJe7MazbVG/jG5CZeKHEgHvM0qi8vnAezzeTKEDHPP +Q1ziHyTt7ND+GbKChrLKA/lbgJkoBxKohyi6eQfz33cvh0fPsv8zej5M6+FAVJaA +GCMUS/yIC0Oiq0JgYH38sPOhNtw3z8pODg6WjJFWKHXw5qGng11/3BtTVu5KXzqf +85IJHqMgyOnU0r4mdKgqmSdaCpU/CMJlT3iflF5wN79c46FwAceCiYT8eJiWl1cB +wAV/mRhTzWGQkWVhE+6EK6+PyuzkjJgGhMtv3zuzKKN8iOv3eb7xptzZydEPqRFf +50f1cERfsf8um8W9IXQb60vrALyWwQFjF9B2oxsk28ZgzZ5ibA1xU9TJAS+iFo3e +eITPZnxxT3jZ2WQVWIQB8/yn0sAg5mLQ+Clcghik60KQsjAVS27QrlMTimK6eXey +tKTS4cw7LPo7GkuiBy3FuERX/ABXg8Wxd+EXOvLuZXNV/p9uBhU0w5tfaasnXFy0 +0NoKAVQ9ffW9MTV3CjrPakjHGLIzHgfFYuHnBdo67E3LR16kLcTusH8e8A3wYgbM +/gbXNS1C+i31ATNWfHaZtAFrUdzUvotDVo2UTw4nRqy27XBM9NVS+EwfwiZLWoLH +9gZEMGFQ0MW5Ag0EWQaohAEQAJnHTGcy0ol//23alysOuwYFHsS7PFizcCHuy4jv +iB8YR5Y4Ts8nAgo8gz2O2m9bgNfbFHStDoqOUWTV7ILYv/CDZiNhvR+fAeWl3Gmt +o72YYu85r+KZj22YfiXtfOb70IT7jYsTpjlgaqFUFHEHXzoa7EMscra6r+i1qDa9 +QfDjMIBaBCu/Q9CFfhHtIhBHV2Wt8IEJfgYSMHb24Db0RY+OSQixX4QRcVeSnJ5D +I4ZjA3//o9DwthrJf+GxW/f8TGZy6vtertnxLJXGoHFPVuI895m7wSfzt1+/2nlc +0obAMX4Q1yRYTOKQGPDeDZ7k5pxnhYkOHDf2gtY5wORw6vN9KR51YFJYXVmK+2zr +P0fKr0AUG3C7CwQp6bDeYaTndon8S9VNyPypvJ7lpxKy/DIujdvbaJHF3i4rI+w0 +veScfkGtLDc37OeVQEBV4vnHcMvDIC2SEtli4BZjwOcihOv3DgtmQnAjkkAZLtys +x/W4/MPoZiIWl0DnQev/ujwLkwHCYg/Oo7E70OKpdxDk/2cZyM1US2Uz2NQ4lo5O +8M+F9sMWj2EPX/kJxZpb6N/+xJnKf4oIdJkaammVllX0TGtoxGOadPST9D8gtSCr +yRdLMp0bB0+Ghbc+STGo78atg+J+HRvgzXG/gwaEiCIezuLB4W6rFjbldYfbeKTs +OoAlABEBAAGJAh8EGAEKAAkFAlkGqIQCGwwACgkQ8QWIOhc1Yj23pw/+JNWYULOd +uM4Khfyx3NgCLiX9VqmwZ7PQQsPKtxviQXdEgs+NJUrCePmjSV9Sf+exTZ4wqSTC +BilGUppAJbO9avR2wRkYbdiYW+g0jDwAD9cyfAiDBSUiRTimKsKqYN0PbIKJ2Ric +xvtBw4jW/f1lHkrySqOHetmFTe2ocXkFm8BjqDpt5XCoZa4ADcofNpRJYwVu0Uck +8MQ/wYjoNRZiz0Sjx9vOBVW9ZKMWS6RgnPStsK3UJiG3c7c83kpDx8nk4bUp8seY +cBjiViXh6QMXRPdlqsGEMiBVtyXF7Sy3cK3gUcH7808VmKMHEgWvq9MRrZoE0rLK +74pZrEuWnwD6o77w4DCBtKJyDNlR23kLObS+1Ur7fIXe2yXmbqwEmjpSX4H2Teth +77PU7nKMAkFsPJDNI7K/kEy3x7KM3G1gIcWaz3pL5gthLV+H3RfIojrK1hS7ZSSI +gCzYEkQCMsigT5YTgK5+n0I4U7zoDBd1sttwK2FahvuCKUDwc+ZiX/ciYiAjUMb9 +6yTNHlNr/H31EWVZMEd7+fhFZWXJjFsQD11GkXvy6vMBn3Kq+Vd7Yr4CJUGTV3rW +bWo1vt2ED7h5rbZTrS1UssxLUpy5iXrjyGwn2h/Ei9MzXpNvH8p2raf0eQ0Qn65Q +UoUryip3RD0yaMCyL/IK3KoPt74f2eJsFwM= +=feO2 +-----END PGP PUBLIC KEY BLOCK----- \ No newline at end of file From da523ce72524de6243b8ea3c40cf50f92d60ac3e Mon Sep 17 00:00:00 2001 From: Max Risuhin Date: Mon, 1 May 2017 16:38:21 -0400 Subject: [PATCH 0591/1644] ARROW-928: [C++] Detect supported MSVC versions Author: Max Risuhin Closes #625 from MaxRis/ARROW-928 and squashes the following commits: db81a27 [Max Risuhin] ARROW-928: [C++] Detect supported MSVC versions --- cpp/cmake_modules/CompilerInfo.cmake | 13 ++++++++++++- 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 3c603918a82ec..21e2dafba2e24 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -17,7 +17,11 @@ # # Sets COMPILER_FAMILY to 'clang' or 'gcc' # Sets COMPILER_VERSION to the version -execute_process(COMMAND "${CMAKE_CXX_COMPILER}" -v +if (NOT MSVC) + set(COMPILER_GET_VERSION_SWITCH "-v") +endif() + +execute_process(COMMAND "${CMAKE_CXX_COMPILER}" ${COMPILER_GET_VERSION_SWITCH} ERROR_VARIABLE COMPILER_VERSION_FULL) message(INFO "Compiler version: ${COMPILER_VERSION_FULL}") message(INFO "Compiler id: ${CMAKE_CXX_COMPILER_ID}") @@ -25,6 +29,13 @@ string(TOLOWER "${COMPILER_VERSION_FULL}" COMPILER_VERSION_FULL_LOWER) if(MSVC) set(COMPILER_FAMILY "msvc") + if ("${COMPILER_VERSION_FULL}" MATCHES ".*Microsoft \\(R\\) C/C\\+\\+ Optimizing Compiler Version 19.*x64") + string(REGEX REPLACE ".*Optimizing Compiler Version ([0-9]+\\.[0-9]+\\.[0-9]+\\.[0-9]+).*" "\\1" + COMPILER_VERSION "${COMPILER_VERSION_FULL}") + elseif(NOT "${COMPILER_VERSION_FULL}" STREQUAL "") + message(FATAL_ERROR "Not supported MSVC compiler:\n${COMPILER_VERSION_FULL}\n" + "Supported MSVC versions: Visual Studio 2015 2017 x64") + endif() # clang on Linux and Mac OS X before 10.9 elseif("${COMPILER_VERSION_FULL}" MATCHES ".*clang version.*") From 569426b917007d7eb8f238d657184d5789527646 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 2 May 2017 11:27:19 -0400 Subject: [PATCH 0592/1644] ARROW-930: javadoc generation fails with java 8 Author: Julien Le Dem Closes #627 from julienledem/fix_javadoc and squashes the following commits: 4160d5b [Julien Le Dem] ARROW-930: javadoc generation fails with java 8 --- .../arrow/memory/AllocationManager.java | 22 ++++--- .../arrow/memory/AllocationReservation.java | 13 ++-- .../apache/arrow/memory/BufferAllocator.java | 10 ++- .../apache/arrow/memory/BufferManager.java | 8 --- .../arrow/memory/util/HistoricalLog.java | 27 +++----- .../templates/AbstractFieldReader.java | 63 +++++++++---------- .../AbstractPromotableFieldWriter.java | 7 +-- .../src/main/codegen/templates/ArrowType.java | 2 +- .../main/codegen/templates/BaseWriter.java | 6 +- .../main/codegen/templates/ComplexCopier.java | 4 +- .../codegen/templates/FixedValueVectors.java | 2 +- .../templates/NullableValueVectors.java | 7 +-- .../templates/VariableLengthVectors.java | 2 +- .../arrow/vector/SchemaChangeCallBack.java | 1 + .../org/apache/arrow/vector/ValueVector.java | 29 ++++++--- .../arrow/vector/VariableWidthVector.java | 4 +- .../org/apache/arrow/vector/VectorLoader.java | 2 +- .../complex/AbstractContainerVector.java | 2 + .../vector/complex/AbstractMapVector.java | 18 ++++-- .../complex/BaseRepeatedValueVector.java | 11 ++-- .../vector/complex/FixedSizeListVector.java | 13 ++-- .../vector/complex/RepeatedValueVector.java | 15 ++--- .../RepeatedVariableWidthVectorLike.java | 2 +- .../apache/arrow/vector/file/ArrowWriter.java | 10 +-- .../apache/arrow/vector/file/ReadChannel.java | 7 +++ .../vector/stream/ArrowStreamReader.java | 14 +++-- .../vector/stream/MessageSerializer.java | 47 +++++++++++++- .../arrow/vector/types/pojo/Schema.java | 2 +- .../arrow/vector/util/MapWithOrdinal.java | 14 +++-- .../util/OversizedAllocationException.java | 3 +- .../org/apache/arrow/vector/util/Text.java | 50 +++++++++++---- .../apache/arrow/vector/util/Validator.java | 4 ++ 32 files changed, 253 insertions(+), 168 deletions(-) diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java index 683752e6a4980..70ca1dc32a1b3 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationManager.java @@ -18,15 +18,7 @@ package org.apache.arrow.memory; -import com.google.common.base.Preconditions; - -import io.netty.buffer.ArrowBuf; -import io.netty.buffer.PooledByteBufAllocatorL; -import io.netty.buffer.UnsafeDirectLittleEndian; - -import org.apache.arrow.memory.BaseAllocator.Verbosity; -import org.apache.arrow.memory.util.AutoCloseableLock; -import org.apache.arrow.memory.util.HistoricalLog; +import static org.apache.arrow.memory.BaseAllocator.indent; import java.util.IdentityHashMap; import java.util.concurrent.atomic.AtomicInteger; @@ -34,7 +26,15 @@ import java.util.concurrent.locks.ReadWriteLock; import java.util.concurrent.locks.ReentrantReadWriteLock; -import static org.apache.arrow.memory.BaseAllocator.indent; +import org.apache.arrow.memory.BaseAllocator.Verbosity; +import org.apache.arrow.memory.util.AutoCloseableLock; +import org.apache.arrow.memory.util.HistoricalLog; + +import com.google.common.base.Preconditions; + +import io.netty.buffer.ArrowBuf; +import io.netty.buffer.PooledByteBufAllocatorL; +import io.netty.buffer.UnsafeDirectLittleEndian; /** * Manages the relationship between one or more allocators and a particular UDLE. Ensures that @@ -328,6 +328,8 @@ private void inc() { * Decrement the ledger's reference count. If the ledger is decremented to zero, this ledger * should release its * ownership back to the AllocationManager + * @param decrement amout to decrease the reference count by + * @return the new reference count */ public int decrement(int decrement) { allocator.assertOpen(); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java index 7f5aa313779a7..b0ce574deefcb 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/AllocationReservation.java @@ -30,13 +30,14 @@ * For the purposes of airtight memory accounting, the reservation must be close()d whether it is * used or not. * This is not threadsafe. + *

    */ public interface AllocationReservation extends AutoCloseable { /** * Add to the current reservation. - *

    - *

    Adding may fail if the allocator is not allowed to consume any more space. + * + *

    Adding may fail if the allocator is not allowed to consume any more space.

    * * @param nBytes the number of bytes to add * @return true if the addition is possible, false otherwise @@ -46,8 +47,8 @@ public interface AllocationReservation extends AutoCloseable { /** * Requests a reservation of additional space. - *

    - *

    The implementation of the allocator's inner class provides this. + * + *

    The implementation of the allocator's inner class provides this.

    * * @param nBytes the amount to reserve * @return true if the reservation can be satisfied, false otherwise @@ -56,9 +57,9 @@ public interface AllocationReservation extends AutoCloseable { /** * Allocate a buffer whose size is the total of all the add()s made. - *

    + * *

    The allocation request can still fail, even if the amount of space - * requested is available, if the allocation cannot be made contiguously. + * requested is available, if the allocation cannot be made contiguously.

    * * @return the buffer, or null, if the request cannot be satisfied * @throws IllegalStateException if called called more than once diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java index c05e9acb0aa96..8a40441863889 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferAllocator.java @@ -70,9 +70,9 @@ public interface BufferAllocator extends AutoCloseable { /** * Close and release all buffers generated from this buffer pool. - *

    + * *

    When assertions are on, complains if there are any outstanding buffers; to avoid - * that, release all buffers before the allocator is closed. + * that, release all buffers before the allocator is closed.

    */ @Override public void close(); @@ -116,7 +116,8 @@ public interface BufferAllocator extends AutoCloseable { /** * Create an allocation reservation. A reservation is a way of building up * a request for a buffer whose size is not known in advance. See - * {@see AllocationReservation}. + * + * @see AllocationReservation * * @return the newly created reservation */ @@ -127,6 +128,7 @@ public interface BufferAllocator extends AutoCloseable { * special because we don't * worry about them leaking or managing reference counts on them since they don't actually * point to any memory. + * @return the empty buffer */ public ArrowBuf getEmpty(); @@ -134,6 +136,7 @@ public interface BufferAllocator extends AutoCloseable { * Return the name of this allocator. This is a human readable name that can help debugging. * Typically provides * coordinates about where this allocator was created + * @return the name of the allocator */ public String getName(); @@ -142,6 +145,7 @@ public interface BufferAllocator extends AutoCloseable { * that an allocator is * over its limit, all consumers of that allocator should aggressively try to addrss the * overlimit situation. + * @return whether or not this allocator (or one if its parents) is over its limits */ public boolean isOverLimit(); diff --git a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java index 2fe763e10aff9..3075ebeef996f 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/BufferManager.java @@ -25,14 +25,6 @@ * re-allocation the old buffer will be freed. Managing a list of these buffers * prevents some parts of the system from needing to define a correct location * to place the final call to free them. - *

    - * The current uses of these types of buffers are within the pluggable components of Drill. - * In UDFs, memory management should not be a concern. We provide access to re-allocatable - * ArrowBufs to give UDF writers general purpose buffers we can account for. To prevent the need - * for UDFs to contain boilerplate to close all of the buffers they request, this list - * is tracked at a higher level and all of the buffers are freed once we are sure that - * the code depending on them is done executing (currently {@link FragmentContext} - * and {@link QueryContext}. */ public interface BufferManager extends AutoCloseable { diff --git a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java index c464598bfb856..0452dc9adf256 100644 --- a/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java +++ b/java/memory/src/main/java/org/apache/arrow/memory/util/HistoricalLog.java @@ -18,11 +18,11 @@ package org.apache.arrow.memory.util; -import org.slf4j.Logger; - import java.util.Arrays; import java.util.LinkedList; +import org.slf4j.Logger; + /** * Utility class that can be used to log activity within a class * for later logging and debugging. Supports recording events and @@ -98,19 +98,20 @@ public synchronized void recordEvent(final String noteFormat, Object... args) { * events with their stack traces. * * @param sb {@link StringBuilder} to write to + * @param includeStackTrace whether to include the stacktrace of each event in the history */ public void buildHistory(final StringBuilder sb, boolean includeStackTrace) { buildHistory(sb, 0, includeStackTrace); } /** - * - * @param sb - * @param indent - * @param includeStackTrace + * build the history and write it to sb + * @param sb output + * @param indent starting indent (usually "") + * @param includeStackTrace whether to include the stacktrace of each event. */ - public synchronized void buildHistory(final StringBuilder sb, int indent, boolean - includeStackTrace) { + public synchronized void buildHistory( + final StringBuilder sb, int indent, boolean includeStackTrace) { final char[] indentation = new char[indent]; final char[] innerIndentation = new char[indent + 2]; Arrays.fill(indentation, ' '); @@ -150,16 +151,6 @@ public synchronized void buildHistory(final StringBuilder sb, int indent, boolea } } - /** - * Write the history of this object to the given {@link StringBuilder}. The history - * includes the identifying string provided at construction time, and all the recorded - * events with their stack traces. - * - * @param sb {@link StringBuilder} to write to - * @param additional an extra string that will be written between the identifying - * information and the history; often used for a current piece of state - */ - /** * Write the history of this object to the given {@link Logger}. The history * includes the identifying string provided at construction time, and all the recorded diff --git a/java/vector/src/main/codegen/templates/AbstractFieldReader.java b/java/vector/src/main/codegen/templates/AbstractFieldReader.java index e0d0fc9715ba2..79d4c122f0e4e 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldReader.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldReader.java @@ -26,16 +26,19 @@ <#include "/@includes/vv_imports.ftl" /> +/* + * This class is generated using freemarker and the ${.template_name} template. + */ @SuppressWarnings("unused") abstract class AbstractFieldReader extends AbstractBaseReader implements FieldReader{ - + AbstractFieldReader(){ super(); } /** * Returns true if the current value of the reader is not null - * @return + * @return whether the current value is set */ public boolean isSet() { return true; @@ -52,78 +55,74 @@ public Field getField() { "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> <#if safeType=="byte[]"><#assign safeType="ByteArray" /> - - public ${friendlyType} read${safeType}(int arrayIndex){ + public ${friendlyType} read${safeType}(int arrayIndex) { fail("read${safeType}(int arrayIndex)"); return null; } - - public ${friendlyType} read${safeType}(){ + + public ${friendlyType} read${safeType}() { fail("read${safeType}()"); return null; } - + - - public void copyAsValue(MapWriter writer){ + public void copyAsValue(MapWriter writer) { fail("CopyAsValue MapWriter"); } - public void copyAsField(String name, MapWriter writer){ + + public void copyAsField(String name, MapWriter writer) { fail("CopyAsField MapWriter"); } - public void copyAsField(String name, ListWriter writer){ + public void copyAsField(String name, ListWriter writer) { fail("CopyAsFieldList"); } - + <#list vv.types as type><#list type.minor as minor><#assign name = minor.class?cap_first /> <#assign boxedType = (minor.boxedType!type.boxedType) /> - - public void read(${name}Holder holder){ + public void read(${name}Holder holder) { fail("${name}"); } - public void read(Nullable${name}Holder holder){ + public void read(Nullable${name}Holder holder) { fail("${name}"); } - - public void read(int arrayIndex, ${name}Holder holder){ + + public void read(int arrayIndex, ${name}Holder holder) { fail("Repeated${name}"); } - - public void read(int arrayIndex, Nullable${name}Holder holder){ + + public void read(int arrayIndex, Nullable${name}Holder holder) { fail("Repeated${name}"); } - - public void copyAsValue(${name}Writer writer){ + + public void copyAsValue(${name}Writer writer) { fail("CopyAsValue${name}"); } - public void copyAsField(String name, ${name}Writer writer){ + + public void copyAsField(String name, ${name}Writer writer) { fail("CopyAsField${name}"); } + - - public FieldReader reader(String name){ + public FieldReader reader(String name) { fail("reader(String name)"); return null; } - public FieldReader reader(){ + public FieldReader reader() { fail("reader()"); return null; - } - - public int size(){ + + public int size() { fail("size()"); return -1; } - - private void fail(String name){ + + private void fail(String name) { throw new IllegalArgumentException(String.format("You tried to read a [%s] type when you are using a field reader of type [%s].", name, this.getClass().getSimpleName())); } - - } diff --git a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java index 60dd0c7b7adf8..ada0b1d5c7816 100644 --- a/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java +++ b/java/vector/src/main/codegen/templates/AbstractPromotableFieldWriter.java @@ -39,14 +39,13 @@ abstract class AbstractPromotableFieldWriter extends AbstractFieldWriter { /** * Retrieve the FieldWriter, promoting if it is not a FieldWriter of the specified type - * @param type - * @return + * @param type the type of the values we want to write + * @return the corresponding field writer */ abstract protected FieldWriter getWriter(MinorType type); /** - * Return the current FieldWriter - * @return + * @return the current FieldWriter */ abstract protected FieldWriter getWriter(); diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index a9e875a2095f7..dc99aad0bb3a2 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -100,7 +100,7 @@ private ArrowTypeID(byte flatbufType) { /** * to visit the ArrowTypes * - * type.accept(new ArrowTypeVisitor() { + * type.accept(new ArrowTypeVisitor<Type>() { * ... * }); * diff --git a/java/vector/src/main/codegen/templates/BaseWriter.java b/java/vector/src/main/codegen/templates/BaseWriter.java index 08bd39eae2358..3da02b00a0dbf 100644 --- a/java/vector/src/main/codegen/templates/BaseWriter.java +++ b/java/vector/src/main/codegen/templates/BaseWriter.java @@ -30,7 +30,7 @@ * File generated from ${.template_name} using FreeMarker. */ @SuppressWarnings("unused") - public interface BaseWriter extends AutoCloseable, Positionable { +public interface BaseWriter extends AutoCloseable, Positionable { int getValueCapacity(); public interface MapWriter extends BaseWriter { @@ -39,12 +39,12 @@ public interface MapWriter extends BaseWriter { /** * Whether this writer is a map writer and is empty (has no children). - * + * *

    * Intended only for use in determining whether to add dummy vector to * avoid empty (zero-column) schema, as in JsonReader. *

    - * + * @return whether the map is empty */ boolean isEmptyMap(); diff --git a/java/vector/src/main/codegen/templates/ComplexCopier.java b/java/vector/src/main/codegen/templates/ComplexCopier.java index 89368ce6e0b96..fb7ae0f2ef57e 100644 --- a/java/vector/src/main/codegen/templates/ComplexCopier.java +++ b/java/vector/src/main/codegen/templates/ComplexCopier.java @@ -34,8 +34,8 @@ public class ComplexCopier { /** * Do a deep copy of the value in input into output - * @param in - * @param out + * @param input field to read from + * @param output field to write to */ public static void copy(FieldReader input, FieldWriter output) { writeValue(input, output); diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 5c09e30c71487..05faaae1e9e2f 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -150,7 +150,7 @@ public boolean allocateNewSafe() { * * Note that the maximum number of values a vector can allocate is Integer.MAX_VALUE / value width. * - * @param valueCount + * @param valueCount the number of values to allocate for * @throws org.apache.arrow.memory.OutOfMemoryException if it can't allocate the new buffer */ @Override diff --git a/java/vector/src/main/codegen/templates/NullableValueVectors.java b/java/vector/src/main/codegen/templates/NullableValueVectors.java index 31adc2bdd0789..76d2bad36bc18 100644 --- a/java/vector/src/main/codegen/templates/NullableValueVectors.java +++ b/java/vector/src/main/codegen/templates/NullableValueVectors.java @@ -421,9 +421,8 @@ public final class Accessor extends BaseDataValueVector.BaseAccessor <#if type.m /** * Get the element at the specified position. * - * @param index position of the value - * @return value of the element, if not null - * @throws NullValueException if the value is null + * @param index position of the value + * @return value of the element, if not null */ public <#if type.major == "VarLen">byte[]<#else>${minor.javaType!type.javaType} get(int index) { if (isNull(index)) { @@ -509,7 +508,7 @@ public void setIndexDefined(int index){ * Set the variable length element at the specified index to the supplied byte array. * * @param index position of the bit to set - * @param bytes array of bytes to write + * @param value array of bytes (or int if smaller than 4 bytes) to write */ public void set(int index, <#if type.major == "VarLen">byte[]<#elseif (type.width < 4)>int<#else>${minor.javaType!type.javaType} value) { setCount++; diff --git a/java/vector/src/main/codegen/templates/VariableLengthVectors.java b/java/vector/src/main/codegen/templates/VariableLengthVectors.java index 11f0cc894d004..3d933addb6208 100644 --- a/java/vector/src/main/codegen/templates/VariableLengthVectors.java +++ b/java/vector/src/main/codegen/templates/VariableLengthVectors.java @@ -139,7 +139,7 @@ public int getCurrentSizeInBytes() { /** * Return the number of bytes contained in the current var len byte vector. - * @return + * @return the number of bytes contained in the current var len byte vector */ public int getVarByteLength(){ final int valueCount = getAccessor().getValueCount(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java b/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java index fc0a066749a91..6fdcda20480f8 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/SchemaChangeCallBack.java @@ -42,6 +42,7 @@ public void doWork() { /** * Returns the value of schema-changed state, resetting the * schema-changed state to {@code false}. + * @return the previous schema-changed state */ public boolean getSchemaChangedAndReset() { final boolean current = schemaChanged; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java index 685b0be010a08..2e83836b64626 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/ValueVector.java @@ -52,7 +52,7 @@ * * This interface "should" strive to guarantee this order of operation: *
    - * allocate > mutate > setvaluecount > access > clear (or allocate to start the process over). + * allocate > mutate > setvaluecount > access > clear (or allocate to start the process over). *
    */ public interface ValueVector extends Closeable, Iterable { @@ -84,6 +84,7 @@ public interface ValueVector extends Closeable, Iterable { /** * Returns the maximum number of values that can be stored in this vector instance. + * @return the maximum number of values that can be stored in this vector instance. */ int getValueCapacity(); @@ -100,13 +101,16 @@ public interface ValueVector extends Closeable, Iterable { /** * Get information about how this field is materialized. + * @return the field corresponding to this vector */ Field getField(); MinorType getMinorType(); /** - * Returns a {@link org.apache.arrow.vector.util.TransferPair transfer pair}, creating a new target vector of + * to transfer quota responsibility + * @param allocator the target allocator + * @return a {@link org.apache.arrow.vector.util.TransferPair transfer pair}, creating a new target vector of * the same type. */ TransferPair getTransferPair(BufferAllocator allocator); @@ -116,31 +120,33 @@ public interface ValueVector extends Closeable, Iterable { TransferPair getTransferPair(String ref, BufferAllocator allocator, CallBack callBack); /** - * Returns a new {@link org.apache.arrow.vector.util.TransferPair transfer pair} that is used to transfer underlying + * makes a new transfer pair used to transfer underlying buffers + * @param target the target for the transfer + * @return a new {@link org.apache.arrow.vector.util.TransferPair transfer pair} that is used to transfer underlying * buffers into the target vector. */ TransferPair makeTransferPair(ValueVector target); /** - * Returns an {@link org.apache.arrow.vector.ValueVector.Accessor accessor} that is used to read from this vector + * @return an {@link org.apache.arrow.vector.ValueVector.Accessor accessor} that is used to read from this vector * instance. */ Accessor getAccessor(); /** - * Returns an {@link org.apache.arrow.vector.ValueVector.Mutator mutator} that is used to write to this vector + * @return an {@link org.apache.arrow.vector.ValueVector.Mutator mutator} that is used to write to this vector * instance. */ Mutator getMutator(); /** - * Returns a {@link org.apache.arrow.vector.complex.reader.FieldReader field reader} that supports reading values + * @return a {@link org.apache.arrow.vector.complex.reader.FieldReader field reader} that supports reading values * from this vector. */ FieldReader getReader(); /** - * Returns the number of bytes that is used by this vector instance. + * @return the number of bytes that is used by this vector instance. */ int getBufferSize(); @@ -177,21 +183,23 @@ interface Accessor { * * @param index * Index of the value to get + * @return the friendly java type */ Object getObject(int index); /** - * Returns the number of values that is stored in this vector. + * @return the number of values that is stored in this vector. */ int getValueCount(); /** - * Returns true if the value at the given index is null, false otherwise. + * @param index the index to check for nullity + * @return true if the value at the given index is null, false otherwise. */ boolean isNull(int index); /** - * Returns the number of null values + * @return the number of null values */ int getNullCount(); } @@ -214,6 +222,7 @@ interface Mutator { /** * @deprecated this has nothing to do with value vector abstraction and should be removed. + * @param values the number of values to generate */ @Deprecated void generateTestData(int values); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java index 971a241adafc2..ed164b548b5bd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VariableWidthVector.java @@ -17,8 +17,6 @@ */ package org.apache.arrow.vector; -import io.netty.buffer.ArrowBuf; - public interface VariableWidthVector extends ValueVector{ /** @@ -31,7 +29,7 @@ public interface VariableWidthVector extends ValueVector{ /** * Provide the maximum amount of variable width bytes that can be stored in this vector. - * @return + * @return the byte capacity of this vector */ int getByteCapacity(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java index 76de250e0e972..33a608cd92922 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/VectorLoader.java @@ -50,7 +50,7 @@ public VectorLoader(VectorSchemaRoot root) { /** * Loads the record batch in the vectors * will not close the record batch - * @param recordBatch + * @param recordBatch the batch to load */ public void load(ArrowRecordBatch recordBatch) { Iterator buffers = recordBatch.getBuffers().iterator(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java index 71f2bea5b8fe1..7f8e6796285fd 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractContainerVector.java @@ -58,6 +58,8 @@ public BufferAllocator getAllocator() { /** * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given field name if exists or null. + * @param name the name of the child to return + * @return the corresponding FieldVector */ public FieldVector getChild(String name) { return getChild(name, FieldVector.class); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java index 15e8a5bc624ac..4b6d82cc8b291 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/AbstractMapVector.java @@ -151,14 +151,19 @@ private boolean nullFilled(ValueVector vector) { /** * Returns a {@link org.apache.arrow.vector.ValueVector} corresponding to the given ordinal identifier. + * @param id the ordinal of the child to return + * @return the corresponding child */ public ValueVector getChildByOrdinal(int id) { return vectors.getByOrdinal(id); } /** - * Returns a {@link org.apache.arrow.vector.ValueVector} instance of subtype of corresponding to the given + * Returns a {@link org.apache.arrow.vector.ValueVector} instance of subtype of T corresponding to the given * field name if exists or null. + * @param name the name of the child to return + * @param clazz the expected type of the child + * @return the child corresponding to this name */ @Override public T getChild(String name, Class clazz) { @@ -186,6 +191,8 @@ protected ValueVector add(String childName, FieldType fieldType) { * Inserts the vector with the given name if it does not exist else replaces it with the new value. * * Note that this method does not enforce any vector type check nor throws a schema change exception. + * @param name the name of the child to add + * @param vector the vector to add as a child */ protected void putChild(String name, FieldVector vector) { putVector(name, vector); @@ -208,7 +215,7 @@ protected void putVector(String name, FieldVector vector) { } /** - * Returns a sequence of underlying child vectors. + * @return a sequence of underlying child vectors. */ protected List getChildren() { int size = vectors.size(); @@ -228,7 +235,7 @@ protected List getChildFieldNames() { } /** - * Returns the number of underlying child vectors. + * @return the number of underlying child vectors. */ @Override public int size() { @@ -241,7 +248,7 @@ public Iterator iterator() { } /** - * Returns a list of scalar child vectors recursing the entire vector hierarchy. + * @return a list of scalar child vectors recursing the entire vector hierarchy. */ public List getPrimitiveVectors() { final List primitiveVectors = Lists.newArrayList(); @@ -257,7 +264,8 @@ public List getPrimitiveVectors() { } /** - * Returns a vector with its corresponding ordinal mapping if field exists or null. + * @param name the name of the child to return + * @return a vector with its corresponding ordinal mapping if field exists or null. */ @Override public VectorWithOrdinal getChildVectorWithOrdinal(String name) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java index c9a9319c69154..5ff4c2c8172c3 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/BaseRepeatedValueVector.java @@ -20,10 +20,6 @@ import java.util.Collections; import java.util.Iterator; -import com.google.common.base.Preconditions; -import com.google.common.collect.ObjectArrays; - -import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.AddOrGetResult; import org.apache.arrow.vector.BaseValueVector; @@ -35,6 +31,11 @@ import org.apache.arrow.vector.util.CallBack; import org.apache.arrow.vector.util.SchemaChangeRuntimeException; +import com.google.common.base.Preconditions; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; + public abstract class BaseRepeatedValueVector extends BaseValueVector implements RepeatedValueVector { public final static FieldVector DEFAULT_DATA_VECTOR = ZeroVector.INSTANCE; @@ -151,7 +152,7 @@ public ArrowBuf[] getBuffers(boolean clear) { } /** - * Returns 1 if inner vector is explicitly set via #addOrGetVector else 0 + * @return 1 if inner vector is explicitly set via #addOrGetVector else 0 */ public int size() { return vector == DEFAULT_DATA_VECTOR ? 0:1; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java index 7ac9f3bd5137f..0dceeed50d484 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/FixedSizeListVector.java @@ -26,11 +26,6 @@ import java.util.List; import java.util.Objects; -import com.google.common.base.Preconditions; -import com.google.common.collect.ImmutableList; -import com.google.common.collect.ObjectArrays; - -import io.netty.buffer.ArrowBuf; import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.memory.OutOfMemoryException; import org.apache.arrow.vector.AddOrGetResult; @@ -53,6 +48,12 @@ import org.apache.arrow.vector.util.SchemaChangeRuntimeException; import org.apache.arrow.vector.util.TransferPair; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableList; +import com.google.common.collect.ObjectArrays; + +import io.netty.buffer.ArrowBuf; + public class FixedSizeListVector extends BaseValueVector implements FieldVector, PromotableVector { private FieldVector vector; @@ -236,7 +237,7 @@ public ArrowBuf[] getBuffers(boolean clear) { } /** - * Returns 1 if inner vector is explicitly set via #addOrGetVector else 0 + * @return 1 if inner vector is explicitly set via #addOrGetVector else 0 */ public int size() { return vector == ZeroVector.INSTANCE ? 0 : 1; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java index b01a4e7cf49d4..de58eda0b11a2 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedValueVector.java @@ -33,14 +33,12 @@ public interface RepeatedValueVector extends ValueVector { final static int DEFAULT_REPEAT_PER_RECORD = 5; /** - * Returns the underlying offset vector or null if none exists. - * - * TODO(DRILL-2995): eliminate exposing low-level interfaces. + * @return the underlying offset vector or null if none exists. */ UInt4Vector getOffsetVector(); /** - * Returns the underlying data vector or null if none exists. + * @return the underlying data vector or null if none exists. */ ValueVector getDataVector(); @@ -52,22 +50,21 @@ public interface RepeatedValueVector extends ValueVector { interface RepeatedAccessor extends ValueVector.Accessor { /** - * Returns total number of cells that vector contains. - * * The result includes empty, null valued cells. + * @return total number of cells that vector contains. */ int getInnerValueCount(); /** - * Returns number of cells that the value at the given index contains. + * @param index the index of the value for which we want the size + * @return number of cells that the value at the given index contains. */ int getInnerValueCountAt(int index); /** - * Returns true if the value at the given index is empty, false otherwise. - * * @param index value index + * @return true if the value at the given index is empty, false otherwise. */ boolean isEmpty(int index); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java index 93b744e108719..29f9d75c74671 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/complex/RepeatedVariableWidthVectorLike.java @@ -29,7 +29,7 @@ public interface RepeatedVariableWidthVectorLike { /** * Provide the maximum amount of variable width bytes that can be stored int his vector. - * @return + * @return the byte capacity */ int getByteCapacity(); } diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java index 60a6afb565318..1716287f722ff 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ArrowWriter.java @@ -25,8 +25,6 @@ import java.util.List; import java.util.Map; -import com.google.common.collect.ImmutableList; - import org.apache.arrow.vector.FieldVector; import org.apache.arrow.vector.VectorSchemaRoot; import org.apache.arrow.vector.VectorUnloader; @@ -42,6 +40,8 @@ import org.slf4j.Logger; import org.slf4j.LoggerFactory; +import com.google.common.collect.ImmutableList; + public abstract class ArrowWriter implements AutoCloseable { private static final Logger LOGGER = LoggerFactory.getLogger(ArrowWriter.class); @@ -62,9 +62,9 @@ public abstract class ArrowWriter implements AutoCloseable { /** * Note: fields are not closed when the writer is closed * - * @param root - * @param provider - * @param out + * @param root the vectors to write to the output + * @param provider where to find the dictionaries + * @param out the output where to write */ protected ArrowWriter(VectorSchemaRoot root, DictionaryProvider provider, WritableByteChannel out) { this.unloader = new VectorUnloader(root); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java index b062f3826eab3..87450e38f6852 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/file/ReadChannel.java @@ -42,6 +42,9 @@ public ReadChannel(ReadableByteChannel in) { /** * Reads bytes into buffer until it is full (buffer.remaining() == 0). Returns the * number of bytes read which can be less than full if there are no more. + * @param buffer The buffer to read to + * @return the number of byte read + * @throws IOException if nit enough bytes left to read */ public int readFully(ByteBuffer buffer) throws IOException { LOGGER.debug("Reading buffer with size: " + buffer.remaining()); @@ -58,6 +61,10 @@ public int readFully(ByteBuffer buffer) throws IOException { /** * Reads up to len into buffer. Returns bytes read. + * @param buffer the buffer to read to + * @param l the amount of bytes to read + * @return the number of bytes read + * @throws IOException if nit enough bytes left to read */ public int readFully(ArrowBuf buffer, int l) throws IOException { int n = readFully(buffer.nioBuffer(buffer.writerIndex(), l)); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java index 2deef37cd4e56..641978a516ae4 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/ArrowStreamReader.java @@ -17,17 +17,17 @@ */ package org.apache.arrow.vector.stream; +import java.io.IOException; +import java.io.InputStream; +import java.nio.channels.Channels; +import java.nio.channels.ReadableByteChannel; + import org.apache.arrow.memory.BufferAllocator; import org.apache.arrow.vector.file.ArrowReader; import org.apache.arrow.vector.file.ReadChannel; import org.apache.arrow.vector.schema.ArrowMessage; import org.apache.arrow.vector.types.pojo.Schema; -import java.io.IOException; -import java.io.InputStream; -import java.nio.channels.Channels; -import java.nio.channels.ReadableByteChannel; - /** * This classes reads from an input stream and produces ArrowRecordBatches. */ @@ -35,6 +35,8 @@ public class ArrowStreamReader extends ArrowReader { /** * Constructs a streaming read, reading bytes from 'in'. Non-blocking. + * @param in the stream to read from + * @param allocator to allocate new buffers */ public ArrowStreamReader(ReadableByteChannel in, BufferAllocator allocator) { super(new ReadChannel(in), allocator); @@ -46,6 +48,8 @@ public ArrowStreamReader(InputStream in, BufferAllocator allocator) { /** * Reads the schema message from the beginning of the stream. + * @param in to allocate new buffers + * @return the deserialized arrow schema */ @Override protected Schema readSchema(ReadChannel in) throws IOException { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java index 228ab613466d2..2fd93749976c6 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/stream/MessageSerializer.java @@ -22,9 +22,6 @@ import java.util.ArrayList; import java.util.List; -import com.google.flatbuffers.FlatBufferBuilder; - -import io.netty.buffer.ArrowBuf; import org.apache.arrow.flatbuf.Buffer; import org.apache.arrow.flatbuf.DictionaryBatch; import org.apache.arrow.flatbuf.FieldNode; @@ -43,6 +40,10 @@ import org.apache.arrow.vector.schema.ArrowRecordBatch; import org.apache.arrow.vector.types.pojo.Schema; +import com.google.flatbuffers.FlatBufferBuilder; + +import io.netty.buffer.ArrowBuf; + /** * Utility class for serializing Messages. Messages are all serialized a similar way. * 1. 4 byte little endian message header prefix @@ -69,6 +70,10 @@ public static int bytesToInt(byte[] bytes) { /** * Serialize a schema object. + * @param out where to write the schema + * @param schema the object to serialize to out + * @return the resulting size of the serialized schema + * @throws IOException if something went wrong */ public static long serialize(WriteChannel out, Schema schema) throws IOException { FlatBufferBuilder builder = new FlatBufferBuilder(); @@ -81,6 +86,9 @@ public static long serialize(WriteChannel out, Schema schema) throws IOException /** * Deserializes a schema object. Format is from serialize(). + * @param in the channel to deserialize from + * @return the deserialized object + * @throws IOException if something went wrong */ public static Schema deserializeSchema(ReadChannel in) throws IOException { Message message = deserializeMessage(in); @@ -98,6 +106,10 @@ public static Schema deserializeSchema(ReadChannel in) throws IOException { /** * Serializes an ArrowRecordBatch. Returns the offset and length of the written batch. + * @param out where to write the batch + * @param batch the object to serialize to out + * @return the serialized block metadata + * @throws IOException if something went wrong */ public static ArrowBlock serialize(WriteChannel out, ArrowRecordBatch batch) throws IOException { @@ -153,6 +165,11 @@ private static long writeBatchBuffers(WriteChannel out, ArrowRecordBatch batch) /** * Deserializes a RecordBatch + * @param in the channel to deserialize from + * @param message the object to derialize to + * @param alloc to allocate buffers + * @return the deserialized object + * @throws IOException if something went wrong */ private static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, Message message, BufferAllocator alloc) throws IOException { @@ -171,6 +188,11 @@ private static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, Message m /** * Deserializes a RecordBatch knowing the size of the entire message up front. This * minimizes the number of reads to the underlying stream. + * @param in the channel to deserialize from + * @param block the object to derialize to + * @param alloc to allocate buffers + * @return the deserialized object + * @throws IOException if something went wrong */ public static ArrowRecordBatch deserializeRecordBatch(ReadChannel in, ArrowBlock block, BufferAllocator alloc) throws IOException { @@ -231,6 +253,10 @@ private static ArrowRecordBatch deserializeRecordBatch(RecordBatch recordBatchFB /** * Serializes a dictionary ArrowRecordBatch. Returns the offset and length of the written batch. + * @param out where to serialize + * @param batch the batch to serialize + * @return the metadata of the serialized block + * @throws IOException if something went wrong */ public static ArrowBlock serialize(WriteChannel out, ArrowDictionaryBatch batch) throws IOException { long start = out.getCurrentPosition(); @@ -264,6 +290,11 @@ public static ArrowBlock serialize(WriteChannel out, ArrowDictionaryBatch batch) /** * Deserializes a DictionaryBatch + * @param in where to read from + * @param message the message message metadata to deserialize + * @param alloc the allocator for new buffers + * @return the corresponding dictionary batch + * @throws IOException if something went wrong */ private static ArrowDictionaryBatch deserializeDictionaryBatch(ReadChannel in, Message message, @@ -284,6 +315,11 @@ private static ArrowDictionaryBatch deserializeDictionaryBatch(ReadChannel in, /** * Deserializes a DictionaryBatch knowing the size of the entire message up front. This * minimizes the number of reads to the underlying stream. + * @param in where to read from + * @param block block metadata for deserializing + * @param alloc to allocate new buffers + * @return the corresponding dictionary + * @throws IOException if something went wrong */ public static ArrowDictionaryBatch deserializeDictionaryBatch(ReadChannel in, ArrowBlock block, @@ -331,6 +367,11 @@ public static ArrowMessage deserializeMessageBatch(ReadChannel in, BufferAllocat /** * Serializes a message header. + * @param builder to write the flatbuf to + * @param headerType headerType field + * @param headerOffset header offset field + * @param bodyLength body length field + * @return the corresponding ByteBuffer */ private static ByteBuffer serializeMessage(FlatBufferBuilder builder, byte headerType, int headerOffset, int bodyLength) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java index c33bd6e6e61b0..cede3e801c1e4 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/pojo/Schema.java @@ -44,7 +44,7 @@ public class Schema { /** - * @param the list of the fields + * @param fields the list of the fields * @param name the name of the field to return * @return the corresponding field * @throws IllegalArgumentException if the field was not found diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java index d7f9d382e4865..b35aaa401bae4 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/MapWithOrdinal.java @@ -24,16 +24,18 @@ import java.util.Map; import java.util.Set; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + import com.google.common.base.Function; import com.google.common.base.Preconditions; import com.google.common.collect.Iterables; import com.google.common.collect.Lists; import com.google.common.collect.Maps; import com.google.common.collect.Sets; + import io.netty.util.collection.IntObjectHashMap; import io.netty.util.collection.IntObjectMap; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; /** * An implementation of map that supports constant time look-up by a generic key or an ordinal. @@ -194,7 +196,7 @@ public V get(Object key) { * assignment. A new ordinal is assigned if key does not exists. Otherwise the same ordinal is re-used but the value * is replaced. * - * {@see java.util.Map#put} + * @see java.util.Map#put */ @Override public V put(K key, V value) { @@ -217,11 +219,11 @@ public boolean containsValue(Object value) { } /** - * Removes the element corresponding to the key if exists extending the semantics of {@link Map#remove} with ordinal + * Removes the element corresponding to the key if exists extending the semantics of {@link java.util.Map#remove} with ordinal * re-cycling. The ordinal corresponding to the given key may be re-assigned to another tuple. It is important that - * consumer checks the ordinal value via {@link #getOrdinal(Object)} before attempting to look-up by ordinal. + * consumer checks the ordinal value via {@link org.apache.arrow.vector.util.MapWithOrdinal#getOrdinal(Object)} before attempting to look-up by ordinal. * - * {@see java.util.Map#remove} + * @see java.util.Map#remove */ @Override public V remove(Object key) { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java b/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java index ec628b22c2d90..bd7396249a72c 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/OversizedAllocationException.java @@ -22,8 +22,7 @@ * An exception that is used to signal that allocation request in bytes is greater than the maximum allowed by * {@link org.apache.arrow.memory.BufferAllocator#buffer(int) allocator}. * - *

    Operators should handle this exception to split the batch and later resume the execution on the next - * {@link RecordBatch#next() iteration}.

    + *

    Operators should handle this exception to split the batch and later resume the execution on the next iteration.

    * */ public class OversizedAllocationException extends RuntimeException { diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java index 3db4358ea9155..ce82f445ad883 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Text.java @@ -74,18 +74,22 @@ public Text() { /** * Construct from a string. + * @param string initialize from that string */ public Text(String string) { set(string); } - /** Construct from another text. */ + /** Construct from another text. + * @param utf8 initialize from that Text + */ public Text(Text utf8) { set(utf8); } /** * Construct from a byte array. + * @param utf8 initialize from that byte array */ public Text(byte[] utf8) { set(utf8); @@ -94,6 +98,7 @@ public Text(byte[] utf8) { /** * Get a copy of the bytes that is exactly the length of the data. See {@link #getBytes()} for faster access to the * underlying array. + * @return a copy of the underlying array */ public byte[] copyBytes() { byte[] result = new byte[length]; @@ -104,12 +109,13 @@ public byte[] copyBytes() { /** * Returns the raw bytes; however, only data up to {@link #getLength()} is valid. Please use {@link #copyBytes()} if * you need the returned array to be precisely the length of the data. + * @return the underlying array */ public byte[] getBytes() { return bytes; } - /** Returns the number of bytes in the byte array */ + /** @return the number of bytes in the byte array */ public int getLength() { return length; } @@ -118,6 +124,7 @@ public int getLength() { * Returns the Unicode Scalar Value (32-bit integer value) for the character at position. Note that this * method avoids using the converter or doing String instantiation * + * @param position the index of the char we want to retrieve * @return the Unicode scalar value at position or -1 if the position is invalid or points to a trailing byte */ public int charAt(int position) { @@ -143,6 +150,8 @@ public int find(String what) { * starting position is measured in bytes and the return value is in terms of byte position in the buffer. The backing * buffer is not converted to a string for this operation. * + * @param what the string to search for + * @param start where to start from * @return byte position of the first occurence of the search string in the UTF-8 buffer or -1 if not found */ public int find(String what, int start) { @@ -187,6 +196,7 @@ public int find(String what, int start) { /** * Set to contain the contents of a string. + * @param string the string to initialize from */ public void set(String string) { try { @@ -200,12 +210,14 @@ public void set(String string) { /** * Set to a utf8 byte array + * @param utf8 the byte array to initialize from */ public void set(byte[] utf8) { set(utf8, 0, utf8.length); } - /** copy a text. */ + /** copy a text. + * @param other the text to initialize from */ public void set(Text other) { set(other.getBytes(), 0, other.getLength()); } @@ -253,13 +265,12 @@ public void clear() { length = 0; } - /* + /** * Sets the capacity of this Text object to at least len bytes. If the current buffer is longer, * then the capacity and existing content of the buffer are unchanged. If len is larger than the current * capacity, the Text object's capacity is increased to match. * * @param len the number of bytes we need - * * @param keepData should the old data be kept */ private void setCapacity(int len, boolean keepData) { @@ -272,11 +283,6 @@ private void setCapacity(int len, boolean keepData) { } } - /** - * Convert text back to string - * - * @see java.lang.Object#toString() - */ @Override public String toString() { try { @@ -289,6 +295,9 @@ public String toString() { /** * Read a Text object whose length is already known. This allows creating Text from a stream which uses a different * serialization format. + * @param in the input to initialize from + * @param len how many bytes to read from in + * @throws IOException if something bad happens */ public void readWithKnownLength(DataInput in, int len) throws IOException { setCapacity(len, false); @@ -296,7 +305,6 @@ public void readWithKnownLength(DataInput in, int len) throws IOException { length = len; } - /** Returns true iff o is a Text with the same contents. */ @Override public boolean equals(Object o) { if (o == this) { @@ -326,7 +334,7 @@ public boolean equals(Object o) { /** * Copied from Arrays.hashCode so we don't have to copy the byte array * - * @return + * @return hashCode */ @Override public int hashCode() { @@ -346,6 +354,9 @@ public int hashCode() { /** * Converts the provided byte array to a String using the UTF-8 encoding. If the input is malformed, replace by a * default value. + * @param utf8 bytes to decode + * @return the decoded string + * @throws CharacterCodingException if this is not valid UTF-8 */ public static String decode(byte[] utf8) throws CharacterCodingException { return decode(ByteBuffer.wrap(utf8), true); @@ -360,6 +371,12 @@ public static String decode(byte[] utf8, int start, int length) * Converts the provided byte array to a String using the UTF-8 encoding. If replace is true, then * malformed input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a * MalformedInputException. + * @param utf8 the bytes to decode + * @param start where to start from + * @param length length of the bytes to decode + * @param replace whether to replace malformed characters with U+FFFD + * @return the decoded string + * @throws CharacterCodingException if the input could not be decoded */ public static String decode(byte[] utf8, int start, int length, boolean replace) throws CharacterCodingException { @@ -387,9 +404,10 @@ private static String decode(ByteBuffer utf8, boolean replace) * Converts the provided String to bytes using the UTF-8 encoding. If the input is malformed, invalid chars are * replaced by a default value. * + * @param string the string to encode * @return ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit() + * @throws CharacterCodingException if the string could not be encoded */ - public static ByteBuffer encode(String string) throws CharacterCodingException { return encode(string, true); @@ -400,7 +418,11 @@ public static ByteBuffer encode(String string) * input is replaced with the substitution character, which is U+FFFD. Otherwise the method throws a * MalformedInputException. * + + * @param string the string to encode + * @param replace whether to replace malformed characters with U+FFFD * @return ByteBuffer: bytes stores at ByteBuffer.array() and length is ByteBuffer.limit() + * @throws CharacterCodingException if the string could not be encoded */ public static ByteBuffer encode(String string, boolean replace) throws CharacterCodingException { @@ -553,6 +575,8 @@ public static void validateUTF8(byte[] utf8, int start, int len) /** * Returns the next code point at the current position in the buffer. The buffer's position will be incremented. Any * mark set on this buffer will be changed by this method! + * @param bytes the incoming bytes + * @return the corresponding unicode codepoint */ public static int bytesToCodePoint(ByteBuffer bytes) { bytes.mark(); diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java index f294e20b029c5..3035576da3327 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/Validator.java @@ -36,6 +36,8 @@ public class Validator { /** * Validate two arrow schemas are equal. * + * @param schema1 the 1st shema to compare + * @param schema2 the 2nd shema to compare * @throws IllegalArgumentException if they are different. */ public static void compareSchemas(Schema schema1, Schema schema2) { @@ -47,6 +49,8 @@ public static void compareSchemas(Schema schema1, Schema schema2) { /** * Validate two arrow vectorSchemaRoot are equal. * + * @param root1 the 1st shema to compare + * @param root2 the 2nd shema to compare * @throws IllegalArgumentException if they are different. */ public static void compareVectorSchemaRoot(VectorSchemaRoot root1, VectorSchemaRoot root2) { From 02a121f18b6b5a34b63dd8d2bf7b1955ac7e11b2 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Tue, 2 May 2017 13:40:47 -0400 Subject: [PATCH 0593/1644] ARROW-927: C++/Python: Add manylinux1 builds to Travis matrix Also pre-builds flatbuffers and gtest so that they are only build once. Author: Uwe L. Korn Closes #624 from xhochy/ARROW-927 and squashes the following commits: c901b7a [Uwe L. Korn] Separate scripts folder 0dd4e08 [Uwe L. Korn] Add scripts folder e705813 [Uwe L. Korn] Move boost and openssl to scripts 1b44878 [Uwe L. Korn] Add base image for thirdparties f4ff321 [Uwe L. Korn] ARROW-927: C++/Python: Add manylinux1 builds to Travis matrix --- .travis.yml | 11 +++++ python/manylinux1/Dockerfile-x86_64 | 41 +--------------- python/manylinux1/Dockerfile-x86_64_base | 52 +++++++++++++++++++++ python/manylinux1/build_arrow.sh | 3 +- python/manylinux1/scripts/build_boost.sh | 21 +++++++++ python/manylinux1/scripts/build_jemalloc.sh | 21 +++++++++ python/manylinux1/scripts/build_openssl.sh | 21 +++++++++ 7 files changed, 128 insertions(+), 42 deletions(-) create mode 100644 python/manylinux1/Dockerfile-x86_64_base create mode 100755 python/manylinux1/scripts/build_boost.sh create mode 100755 python/manylinux1/scripts/build_jemalloc.sh create mode 100755 python/manylinux1/scripts/build_openssl.sh diff --git a/.travis.yml b/.travis.yml index 6ebebd4513fc7..19e71ae1e68f0 100644 --- a/.travis.yml +++ b/.travis.yml @@ -19,6 +19,8 @@ addons: - gtk-doc-tools - autoconf-archive - libgirepository1.0-dev +services: + - docker cache: ccache: true @@ -54,6 +56,15 @@ matrix: script: - $TRAVIS_BUILD_DIR/ci/travis_script_cpp.sh - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh + - language: cpp + before_script: + - docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-927 + script: | + pushd python/manylinux1 + git clone ../../ arrow + docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . + docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh + ls -l dist/ - language: java os: linux jdk: oraclejdk7 diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 56b27ad2ae808..8f55ba7e1deed 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -9,46 +9,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. See accompanying LICENSE file. - -FROM quay.io/pypa/manylinux1_x86_64:latest - -# Install dependencies -RUN yum install -y flex zlib-devel - -# Build a newer OpenSSL version to support Thrift 0.10.0, note that we don't trigger the SSL code in Arrow. -WORKDIR / -RUN wget --no-check-certificate https://www.openssl.org/source/openssl-1.0.2k.tar.gz -O openssl-1.0.2k.tar.gz -RUN tar xf openssl-1.0.2k.tar.gz -WORKDIR openssl-1.0.2k -RUN ./config -fpic shared --prefix=/usr -RUN make -j5 -RUN make install - -WORKDIR / -RUN wget --no-check-certificate http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz -RUN tar xf boost_1_60_0.tar.gz -WORKDIR /boost_1_60_0 -RUN ./bootstrap.sh -RUN ./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system --with-regex install - -WORKDIR / -RUN wget https://github.com/jemalloc/jemalloc/releases/download/4.4.0/jemalloc-4.4.0.tar.bz2 -O jemalloc-4.4.0.tar.bz2 -RUN tar xf jemalloc-4.4.0.tar.bz2 -WORKDIR /jemalloc-4.4.0 -RUN ./configure -RUN make -j5 -RUN make install - -WORKDIR / -# Install cmake manylinux1 package -RUN /opt/python/cp35-cp35m/bin/pip install cmake -RUN ln -s /opt/python/cp35-cp35m/bin/cmake /usr/bin/cmake - -WORKDIR / -RUN git clone https://github.com/matthew-brett/multibuild.git -WORKDIR /multibuild -RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 - +FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-927 ADD arrow /arrow WORKDIR /arrow/cpp diff --git a/python/manylinux1/Dockerfile-x86_64_base b/python/manylinux1/Dockerfile-x86_64_base new file mode 100644 index 0000000000000..e38296d78de1c --- /dev/null +++ b/python/manylinux1/Dockerfile-x86_64_base @@ -0,0 +1,52 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. +FROM quay.io/pypa/manylinux1_x86_64:latest + +# Install dependencies +RUN yum install -y flex zlib-devel + +ADD scripts/build_openssl.sh / +RUN /build_openssl.sh + +ADD scripts/build_boost.sh / +RUN /build_boost.sh + +ADD scripts/build_jemalloc.sh / +RUN /build_jemalloc.sh + +WORKDIR / +# Install cmake manylinux1 package +RUN /opt/python/cp35-cp35m/bin/pip install cmake +RUN ln -s /opt/python/cp35-cp35m/bin/cmake /usr/bin/cmake + +WORKDIR / +RUN git clone https://github.com/matthew-brett/multibuild.git +WORKDIR /multibuild +RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 + +WORKDIR / +RUN wget https://github.com/google/googletest/archive/release-1.7.0.tar.gz -O googletest-release-1.7.0.tar.gz +RUN tar xf googletest-release-1.7.0.tar.gz +WORKDIR /googletest-release-1.7.0 +RUN cmake -DCMAKE_CXX_FLAGS='-fPIC' -Dgtest_force_shared_crt=ON . +RUN make -j5 +ENV GTEST_HOME /googletest-release-1.7.0 + +WORKDIR / +RUN wget https://github.com/google/flatbuffers/archive/v1.6.0.tar.gz -O flatbuffers-1.6.0.tar.gz +RUN tar xf flatbuffers-1.6.0.tar.gz +WORKDIR /flatbuffers-1.6.0 +RUN cmake "-DCMAKE_CXX_FLAGS=-fPIC" "-DCMAKE_INSTALL_PREFIX:PATH=/usr" "-DFLATBUFFERS_BUILD_TESTS=OFF" +RUN make -j5 +RUN make install +ENV FLATBUFFERS_HOME /usr + diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index 8ef087c7d262f..a11d3d41f49f7 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -31,7 +31,6 @@ cd /arrow/python # PyArrow build configuration export PYARROW_BUILD_TYPE='release' -export PYARROW_CMAKE_OPTIONS='-DPYARROW_BUILD_TESTS=ON' export PYARROW_WITH_PARQUET=1 export PYARROW_WITH_JEMALLOC=1 export PYARROW_BUNDLE_ARROW_CPP=1 @@ -51,7 +50,7 @@ for PYTHON in ${PYTHON_VERSIONS}; do echo "=== (${PYTHON}) Installing build dependencies ===" $PIPI_IO "numpy==1.9.0" - $PIPI_IO "cython==0.24" + $PIPI_IO "cython==0.25.2" $PIPI_IO "pandas==0.19.2" echo "=== (${PYTHON}) Building Arrow C++ libraries ===" diff --git a/python/manylinux1/scripts/build_boost.sh b/python/manylinux1/scripts/build_boost.sh new file mode 100755 index 0000000000000..6a313366494c6 --- /dev/null +++ b/python/manylinux1/scripts/build_boost.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + + +wget --no-check-certificate http://downloads.sourceforge.net/project/boost/boost/1.60.0/boost_1_60_0.tar.gz -O /boost_1_60_0.tar.gz +tar xf boost_1_60_0.tar.gz +pushd /boost_1_60_0 +./bootstrap.sh +./bjam cxxflags=-fPIC cflags=-fPIC --prefix=/usr --with-filesystem --with-date_time --with-system --with-regex install +popd +rm -rf boost_1_60_0.tar.gz boost_1_60_0 diff --git a/python/manylinux1/scripts/build_jemalloc.sh b/python/manylinux1/scripts/build_jemalloc.sh new file mode 100755 index 0000000000000..8153baa097e52 --- /dev/null +++ b/python/manylinux1/scripts/build_jemalloc.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +wget https://github.com/jemalloc/jemalloc/releases/download/4.4.0/jemalloc-4.4.0.tar.bz2 -O jemalloc-4.4.0.tar.bz2 +tar xf jemalloc-4.4.0.tar.bz2 +pushd /jemalloc-4.4.0 +./configure +make -j5 +make install +popd +rm -rf jemalloc-4.4.0.tar.bz2 jemalloc-4.4.0 diff --git a/python/manylinux1/scripts/build_openssl.sh b/python/manylinux1/scripts/build_openssl.sh new file mode 100755 index 0000000000000..3bcb2b9a053a9 --- /dev/null +++ b/python/manylinux1/scripts/build_openssl.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +wget --no-check-certificate https://www.openssl.org/source/openssl-1.0.2k.tar.gz -O openssl-1.0.2k.tar.gz +tar xf openssl-1.0.2k.tar.gz +pushd openssl-1.0.2k +./config -fpic shared --prefix=/usr +make -j5 +make install +popd +rm -rf openssl-1.0.2k.tar.gz openssl-1.0.2k From f1bd49d5bb32636f5a8bbe8d26b4269b678dec55 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Tue, 2 May 2017 13:49:37 -0400 Subject: [PATCH 0594/1644] ARROW-931: [GLib] Reconstruct input stream * GArrowRandomAccessFile -> GArrowSeekableInputStream * GArrowBufferReader -> GArrowBufferInputStream * GArrowMemoryMappedFile -> GArrowMemoryMappedInputStream Author: Kouhei Sutou Closes #628 from kou/glib-reconstruct-input-stream and squashes the following commits: 3488cad [Kouhei Sutou] [GLib] Reconstruct input stream --- c_glib/arrow-glib/Makefile.am | 6 - c_glib/arrow-glib/arrow-glib.h | 2 - c_glib/arrow-glib/file-reader.cpp | 12 +- c_glib/arrow-glib/file-reader.h | 6 +- c_glib/arrow-glib/input-stream.cpp | 211 +++++++++++--- c_glib/arrow-glib/input-stream.h | 163 +++++++++-- c_glib/arrow-glib/input-stream.hpp | 8 +- c_glib/arrow-glib/memory-mapped-file.cpp | 270 ------------------ c_glib/arrow-glib/memory-mapped-file.h | 72 ----- c_glib/arrow-glib/memory-mapped-file.hpp | 28 -- c_glib/arrow-glib/random-access-file.cpp | 124 -------- c_glib/arrow-glib/random-access-file.h | 53 ---- c_glib/arrow-glib/random-access-file.hpp | 38 --- c_glib/doc/reference/arrow-glib-docs.sgml | 2 - c_glib/example/lua/read-batch.lua | 2 +- c_glib/example/lua/read-stream.lua | 2 +- c_glib/example/read-batch.c | 9 +- c_glib/example/read-stream.c | 6 +- ...-reader.rb => test-buffer-input-stream.rb} | 6 +- c_glib/test/test-file-writer.rb | 2 +- c_glib/test/test-memory-mapped-file.rb | 134 --------- .../test/test-memory-mapped-input-stream.rb | 82 ++++++ c_glib/test/test-stream-writer.rb | 2 +- 23 files changed, 420 insertions(+), 820 deletions(-) delete mode 100644 c_glib/arrow-glib/memory-mapped-file.cpp delete mode 100644 c_glib/arrow-glib/memory-mapped-file.h delete mode 100644 c_glib/arrow-glib/memory-mapped-file.hpp delete mode 100644 c_glib/arrow-glib/random-access-file.cpp delete mode 100644 c_glib/arrow-glib/random-access-file.h delete mode 100644 c_glib/arrow-glib/random-access-file.hpp rename c_glib/test/{test-buffer-reader.rb => test-buffer-input-stream.rb} (85%) delete mode 100644 c_glib/test/test-memory-mapped-file.rb create mode 100644 c_glib/test/test-memory-mapped-input-stream.rb diff --git a/c_glib/arrow-glib/Makefile.am b/c_glib/arrow-glib/Makefile.am index 54fb7f8c7a799..242507273f451 100644 --- a/c_glib/arrow-glib/Makefile.am +++ b/c_glib/arrow-glib/Makefile.am @@ -60,9 +60,7 @@ libarrow_glib_la_headers += \ file.h \ file-mode.h \ input-stream.h \ - memory-mapped-file.h \ output-stream.h \ - random-access-file.h \ readable.h \ writeable.h \ writeable-file.h @@ -102,9 +100,7 @@ libarrow_glib_la_sources += \ file.cpp \ file-mode.cpp \ input-stream.cpp \ - memory-mapped-file.cpp \ output-stream.cpp \ - random-access-file.cpp \ readable.cpp \ writeable.cpp \ writeable-file.cpp @@ -136,9 +132,7 @@ libarrow_glib_la_cpp_headers += \ file.hpp \ file-mode.hpp \ input-stream.hpp \ - memory-mapped-file.hpp \ output-stream.hpp \ - random-access-file.hpp \ readable.hpp \ writeable.hpp \ writeable-file.hpp diff --git a/c_glib/arrow-glib/arrow-glib.h b/c_glib/arrow-glib/arrow-glib.h index e88b66b6ae9b2..0a06cb824dc85 100644 --- a/c_glib/arrow-glib/arrow-glib.h +++ b/c_glib/arrow-glib/arrow-glib.h @@ -36,9 +36,7 @@ #include #include #include -#include #include -#include #include #include #include diff --git a/c_glib/arrow-glib/file-reader.cpp b/c_glib/arrow-glib/file-reader.cpp index b952b52ddbe6d..bbba5a1ede7b2 100644 --- a/c_glib/arrow-glib/file-reader.cpp +++ b/c_glib/arrow-glib/file-reader.cpp @@ -27,7 +27,7 @@ #include #include -#include +#include #include #include @@ -132,19 +132,21 @@ garrow_file_reader_class_init(GArrowFileReaderClass *klass) /** * garrow_file_reader_open: - * @file: The file to be read. + * @input_stream: The seekable input stream to read data. * @error: (nullable): Return locatipcn for a #GError or %NULL. * * Returns: (nullable) (transfer full): A newly opened * #GArrowFileReader or %NULL on error. */ GArrowFileReader * -garrow_file_reader_open(GArrowRandomAccessFile *file, - GError **error) +garrow_file_reader_open(GArrowSeekableInputStream *input_stream, + GError **error) { + auto arrow_random_access_file = + garrow_seekable_input_stream_get_raw(input_stream); std::shared_ptr arrow_file_reader; auto status = - arrow::ipc::FileReader::Open(garrow_random_access_file_get_raw(file), + arrow::ipc::FileReader::Open(arrow_random_access_file, &arrow_file_reader); if (garrow_error_check(error, status, "[ipc][file-reader][open]")) { return garrow_file_reader_new_raw(&arrow_file_reader); diff --git a/c_glib/arrow-glib/file-reader.h b/c_glib/arrow-glib/file-reader.h index 084f7148ed903..b737269a2945b 100644 --- a/c_glib/arrow-glib/file-reader.h +++ b/c_glib/arrow-glib/file-reader.h @@ -22,7 +22,7 @@ #include #include -#include +#include #include @@ -70,8 +70,8 @@ struct _GArrowFileReaderClass GType garrow_file_reader_get_type(void) G_GNUC_CONST; -GArrowFileReader *garrow_file_reader_open(GArrowRandomAccessFile *file, - GError **error); +GArrowFileReader *garrow_file_reader_open(GArrowSeekableInputStream *input_stream, + GError **error); GArrowSchema *garrow_file_reader_get_schema(GArrowFileReader *file_reader); guint garrow_file_reader_get_n_record_batches(GArrowFileReader *file_reader); diff --git a/c_glib/arrow-glib/input-stream.cpp b/c_glib/arrow-glib/input-stream.cpp index 56b811ad1c368..b931cf8250607 100644 --- a/c_glib/arrow-glib/input-stream.cpp +++ b/c_glib/arrow-glib/input-stream.cpp @@ -28,7 +28,6 @@ #include #include #include -#include #include G_BEGIN_DECLS @@ -41,7 +40,13 @@ G_BEGIN_DECLS * * #GArrowInputStream is a base class for input stream. * - * #GArrowBufferReader is a class for buffer input stream. + * #GArrowSeekableInputStream is a base class for input stream that + * supports random access. + * + * #GArrowBufferInputStream is a class to read data on buffer. + * + * #GArrowMemoryMappedFile is a class to read data in file by mapping + * the file on memory. It supports zero copy. */ typedef struct GArrowInputStreamPrivate_ { @@ -168,71 +173,174 @@ garrow_input_stream_class_init(GArrowInputStreamClass *klass) } -static std::shared_ptr -garrow_buffer_reader_get_raw_random_access_file_interface(GArrowRandomAccessFile *random_access_file) +G_DEFINE_TYPE(GArrowSeekableInputStream, \ + garrow_seekable_input_stream, \ + GARROW_TYPE_INPUT_STREAM); + +static void +garrow_seekable_input_stream_init(GArrowSeekableInputStream *object) { - auto input_stream = GARROW_INPUT_STREAM(random_access_file); - auto arrow_input_stream = garrow_input_stream_get_raw(input_stream); - auto arrow_buffer_reader = - std::static_pointer_cast(arrow_input_stream); - return arrow_buffer_reader; } static void -garrow_buffer_reader_random_access_file_interface_init(GArrowRandomAccessFileInterface *iface) +garrow_seekable_input_stream_class_init(GArrowSeekableInputStreamClass *klass) +{ +} + +/** + * garrow_seekable_input_stream_get_size: + * @input_stream: A #GArrowSeekableInputStream. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: The size of the file. + */ +guint64 +garrow_seekable_input_stream_get_size(GArrowSeekableInputStream *input_stream, + GError **error) +{ + auto arrow_random_access_file = + garrow_seekable_input_stream_get_raw(input_stream); + int64_t size; + auto status = arrow_random_access_file->GetSize(&size); + if (garrow_error_check(error, status, "[seekable-input-stream][get-size]")) { + return size; + } else { + return 0; + } +} + +/** + * garrow_seekable_input_stream_get_support_zero_copy: + * @input_stream: A #GArrowSeekableInputStream. + * + * Returns: Whether zero copy read is supported or not. + */ +gboolean +garrow_seekable_input_stream_get_support_zero_copy(GArrowSeekableInputStream *input_stream) +{ + auto arrow_random_access_file = + garrow_seekable_input_stream_get_raw(input_stream); + return arrow_random_access_file->supports_zero_copy(); +} + +/** + * garrow_seekable_input_stream_read_at: + * @input_stream: A #GArrowSeekableInputStream. + * @position: The read start position. + * @n_bytes: The number of bytes to be read. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (transfer full) (nullable): #GArrowBuffer that has read + * data on success, %NULL if there was an error. + */ +GArrowBuffer * +garrow_seekable_input_stream_read_at(GArrowSeekableInputStream *input_stream, + gint64 position, + gint64 n_bytes, + GError **error) { - iface->get_raw = garrow_buffer_reader_get_raw_random_access_file_interface; + auto arrow_random_access_file = + garrow_seekable_input_stream_get_raw(input_stream); + + std::shared_ptr arrow_buffer; + auto status = arrow_random_access_file->ReadAt(position, + n_bytes, + &arrow_buffer); + if (garrow_error_check(error, status, "[seekable-input-stream][read-at]")) { + return garrow_buffer_new_raw(&arrow_buffer); + } else { + return NULL; + } } -G_DEFINE_TYPE_WITH_CODE(GArrowBufferReader, \ - garrow_buffer_reader, \ - GARROW_TYPE_INPUT_STREAM, - G_IMPLEMENT_INTERFACE(GARROW_TYPE_RANDOM_ACCESS_FILE, - garrow_buffer_reader_random_access_file_interface_init)); + +G_DEFINE_TYPE(GArrowBufferInputStream, \ + garrow_buffer_input_stream, \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM); static void -garrow_buffer_reader_init(GArrowBufferReader *object) +garrow_buffer_input_stream_init(GArrowBufferInputStream *object) { } static void -garrow_buffer_reader_class_init(GArrowBufferReaderClass *klass) +garrow_buffer_input_stream_class_init(GArrowBufferInputStreamClass *klass) { } /** - * garrow_buffer_reader_new: + * garrow_buffer_input_stream_new: * @buffer: The buffer to be read. * - * Returns: A newly created #GArrowBufferReader. + * Returns: A newly created #GArrowBufferInputStream. */ -GArrowBufferReader * -garrow_buffer_reader_new(GArrowBuffer *buffer) +GArrowBufferInputStream * +garrow_buffer_input_stream_new(GArrowBuffer *buffer) { auto arrow_buffer = garrow_buffer_get_raw(buffer); auto arrow_buffer_reader = std::make_shared(arrow_buffer); - return garrow_buffer_reader_new_raw(&arrow_buffer_reader); + return garrow_buffer_input_stream_new_raw(&arrow_buffer_reader); } /** - * garrow_buffer_reader_get_buffer: - * @buffer_reader: A #GArrowBufferReader. + * garrow_buffer_input_stream_get_buffer: + * @input_stream: A #GArrowBufferInputStream. * * Returns: (transfer full): The data of the array as #GArrowBuffer. */ GArrowBuffer * -garrow_buffer_reader_get_buffer(GArrowBufferReader *buffer_reader) +garrow_buffer_input_stream_get_buffer(GArrowBufferInputStream *input_stream) { - auto arrow_input_stream = - garrow_input_stream_get_raw(GARROW_INPUT_STREAM(buffer_reader)); - auto arrow_buffer_reader = - std::static_pointer_cast(arrow_input_stream); + auto arrow_buffer_reader = garrow_buffer_input_stream_get_raw(input_stream); auto arrow_buffer = arrow_buffer_reader->buffer(); return garrow_buffer_new_raw(&arrow_buffer); } +G_DEFINE_TYPE(GArrowMemoryMappedInputStream, \ + garrow_memory_mapped_input_stream, \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM); + +static void +garrow_memory_mapped_input_stream_init(GArrowMemoryMappedInputStream *object) +{ +} + +static void +garrow_memory_mapped_input_stream_class_init(GArrowMemoryMappedInputStreamClass *klass) +{ +} + +/** + * garrow_memory_mapped_input_stream_new: + * @path: The path of the file to be mapped on memory. + * @error: (nullable): Return location for a #GError or %NULL. + * + * Returns: (nullable): A newly created #GArrowMemoryMappedInputStream + * or %NULL on error. + */ +GArrowMemoryMappedInputStream * +garrow_memory_mapped_input_stream_new(const gchar *path, + GError **error) +{ + std::shared_ptr arrow_memory_mapped_file; + auto status = + arrow::io::MemoryMappedFile::Open(std::string(path), + arrow::io::FileMode::READ, + &arrow_memory_mapped_file); + if (status.ok()) { + return garrow_memory_mapped_input_stream_new_raw(&arrow_memory_mapped_file); + } else { + std::string context("[memory-mapped-input-stream][open]: <"); + context += path; + context += ">"; + garrow_error_check(error, status, context.c_str()); + return NULL; + } +} + + G_END_DECLS GArrowInputStream * @@ -254,13 +362,42 @@ garrow_input_stream_get_raw(GArrowInputStream *input_stream) return priv->input_stream; } +std::shared_ptr +garrow_seekable_input_stream_get_raw(GArrowSeekableInputStream *seekable_input_stream) +{ + auto arrow_input_stream = + garrow_input_stream_get_raw(GARROW_INPUT_STREAM(seekable_input_stream)); + auto arrow_random_access_file = + std::static_pointer_cast(arrow_input_stream); + return arrow_random_access_file; +} + +GArrowBufferInputStream * +garrow_buffer_input_stream_new_raw(std::shared_ptr *arrow_buffer_reader) +{ + auto buffer_input_stream = + GARROW_BUFFER_INPUT_STREAM(g_object_new(GARROW_TYPE_BUFFER_INPUT_STREAM, + "input-stream", arrow_buffer_reader, + NULL)); + return buffer_input_stream; +} + +std::shared_ptr +garrow_buffer_input_stream_get_raw(GArrowBufferInputStream *buffer_input_stream) +{ + auto arrow_input_stream = + garrow_input_stream_get_raw(GARROW_INPUT_STREAM(buffer_input_stream)); + auto arrow_buffer_reader = + std::static_pointer_cast(arrow_input_stream); + return arrow_buffer_reader; +} -GArrowBufferReader * -garrow_buffer_reader_new_raw(std::shared_ptr *arrow_buffer_reader) +GArrowMemoryMappedInputStream * +garrow_memory_mapped_input_stream_new_raw(std::shared_ptr *arrow_memory_mapped_file) { - auto buffer_reader = - GARROW_BUFFER_READER(g_object_new(GARROW_TYPE_BUFFER_READER, - "input-stream", arrow_buffer_reader, - NULL)); - return buffer_reader; + auto object = g_object_new(GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM, + "input-stream", arrow_memory_mapped_file, + NULL); + auto memory_mapped_input_stream = GARROW_MEMORY_MAPPED_INPUT_STREAM(object); + return memory_mapped_input_stream; } diff --git a/c_glib/arrow-glib/input-stream.h b/c_glib/arrow-glib/input-stream.h index caa11b575452b..511882863760d 100644 --- a/c_glib/arrow-glib/input-stream.h +++ b/c_glib/arrow-glib/input-stream.h @@ -70,54 +70,159 @@ struct _GArrowInputStreamClass GType garrow_input_stream_get_type(void) G_GNUC_CONST; -#define GARROW_TYPE_BUFFER_READER \ - (garrow_buffer_reader_get_type()) -#define GARROW_BUFFER_READER(obj) \ +#define GARROW_TYPE_SEEKABLE_INPUT_STREAM \ + (garrow_seekable_input_stream_get_type()) +#define GARROW_SEEKABLE_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM, \ + GArrowSeekableInputStream)) +#define GARROW_SEEKABLE_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM, \ + GArrowSeekableInputStreamClass)) +#define GARROW_IS_SEEKABLE_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM)) +#define GARROW_IS_SEEKABLE_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM)) +#define GARROW_SEEKABLE_INPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_SEEKABLE_INPUT_STREAM, \ + GArrowSeekableInputStreamClass)) + +typedef struct _GArrowSeekableInputStream GArrowSeekableInputStream; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowSeekableInputStreamClass GArrowSeekableInputStreamClass; +#endif + +/** + * GArrowSeekableInputStream: + * + * It wraps `arrow::io::RandomAccessFile`. + */ +struct _GArrowSeekableInputStream +{ + /*< private >*/ + GArrowInputStream parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowSeekableInputStreamClass +{ + GArrowInputStreamClass parent_class; +}; +#endif + +GType garrow_seekable_input_stream_get_type(void) G_GNUC_CONST; + +guint64 garrow_seekable_input_stream_get_size(GArrowSeekableInputStream *input_stream, + GError **error); +gboolean garrow_seekable_input_stream_get_support_zero_copy(GArrowSeekableInputStream *input_stream); +GArrowBuffer *garrow_seekable_input_stream_read_at(GArrowSeekableInputStream *input_stream, + gint64 position, + gint64 n_bytes, + GError **error); + + +#define GARROW_TYPE_BUFFER_INPUT_STREAM \ + (garrow_buffer_input_stream_get_type()) +#define GARROW_BUFFER_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_BUFFER_READER, \ - GArrowBufferReader)) -#define GARROW_BUFFER_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_BUFFER_READER, \ - GArrowBufferReaderClass)) -#define GARROW_IS_BUFFER_READER(obj) \ + GARROW_TYPE_BUFFER_INPUT_STREAM, \ + GArrowBufferInputStream)) +#define GARROW_BUFFER_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_BUFFER_INPUT_STREAM, \ + GArrowBufferInputStreamClass)) +#define GARROW_IS_BUFFER_INPUT_STREAM(obj) \ (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_BUFFER_READER)) -#define GARROW_IS_BUFFER_READER_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_BUFFER_READER)) -#define GARROW_BUFFER_READER_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_BUFFER_READER, \ - GArrowBufferReaderClass)) - -typedef struct _GArrowBufferReader GArrowBufferReader; + GARROW_TYPE_BUFFER_INPUT_STREAM)) +#define GARROW_IS_BUFFER_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_BUFFER_INPUT_STREAM)) +#define GARROW_BUFFER_INPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_BUFFER_INPUT_STREAM, \ + GArrowBufferInputStreamClass)) + +typedef struct _GArrowBufferInputStream GArrowBufferInputStream; #ifndef __GTK_DOC_IGNORE__ -typedef struct _GArrowBufferReaderClass GArrowBufferReaderClass; +typedef struct _GArrowBufferInputStreamClass GArrowBufferInputStreamClass; #endif /** - * GArrowBufferReader: + * GArrowBufferInputStream: * * It wraps `arrow::io::BufferReader`. */ -struct _GArrowBufferReader +struct _GArrowBufferInputStream { /*< private >*/ - GArrowInputStream parent_instance; + GArrowSeekableInputStream parent_instance; }; #ifndef __GTK_DOC_IGNORE__ -struct _GArrowBufferReaderClass +struct _GArrowBufferInputStreamClass { - GArrowInputStreamClass parent_class; + GArrowSeekableInputStreamClass parent_class; }; #endif -GType garrow_buffer_reader_get_type(void) G_GNUC_CONST; +GType garrow_buffer_input_stream_get_type(void) G_GNUC_CONST; + +GArrowBufferInputStream *garrow_buffer_input_stream_new(GArrowBuffer *buffer); + +GArrowBuffer *garrow_buffer_input_stream_get_buffer(GArrowBufferInputStream *input_stream); + + +#define GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM \ + (garrow_memory_mapped_input_stream_get_type()) +#define GARROW_MEMORY_MAPPED_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_CAST((obj), \ + GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM, \ + GArrowMemoryMappedInputStream)) +#define GARROW_MEMORY_MAPPED_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_CAST((klass), \ + GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM, \ + GArrowMemoryMappedInputStreamClass)) +#define GARROW_IS_MEMORY_MAPPED_INPUT_STREAM(obj) \ + (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ + GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM)) +#define GARROW_IS_MEMORY_MAPPED_INPUT_STREAM_CLASS(klass) \ + (G_TYPE_CHECK_CLASS_TYPE((klass), \ + GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM)) +#define GARROW_MEMORY_MAPPED_INPUT_STREAM_GET_CLASS(obj) \ + (G_TYPE_INSTANCE_GET_CLASS((obj), \ + GARROW_TYPE_MEMORY_MAPPED_INPUT_STREAM, \ + GArrowMemoryMappedInputStreamClass)) + +typedef struct _GArrowMemoryMappedInputStream GArrowMemoryMappedInputStream; +#ifndef __GTK_DOC_IGNORE__ +typedef struct _GArrowMemoryMappedInputStreamClass GArrowMemoryMappedInputStreamClass; +#endif + +/** + * GArrowMemoryMappedInputStream: + * + * It wraps `arrow::io::MemoryMappedFile`. + */ +struct _GArrowMemoryMappedInputStream +{ + /*< private >*/ + GArrowSeekableInputStream parent_instance; +}; + +#ifndef __GTK_DOC_IGNORE__ +struct _GArrowMemoryMappedInputStreamClass +{ + GArrowSeekableInputStreamClass parent_class; +}; +#endif -GArrowBufferReader *garrow_buffer_reader_new(GArrowBuffer *buffer); +GType garrow_memory_mapped_input_stream_get_type(void) G_GNUC_CONST; -GArrowBuffer *garrow_buffer_reader_get_buffer(GArrowBufferReader *buffer_reader); +GArrowMemoryMappedInputStream *garrow_memory_mapped_input_stream_new(const gchar *path, + GError **error); G_END_DECLS diff --git a/c_glib/arrow-glib/input-stream.hpp b/c_glib/arrow-glib/input-stream.hpp index 008f5f2b4e157..17d2bd92422d6 100644 --- a/c_glib/arrow-glib/input-stream.hpp +++ b/c_glib/arrow-glib/input-stream.hpp @@ -19,6 +19,7 @@ #pragma once +#include #include #include @@ -27,4 +28,9 @@ GArrowInputStream *garrow_input_stream_new_raw(std::shared_ptr *arrow_input_stream); std::shared_ptr garrow_input_stream_get_raw(GArrowInputStream *input_stream); -GArrowBufferReader *garrow_buffer_reader_new_raw(std::shared_ptr *arrow_buffer_reader); +std::shared_ptr garrow_seekable_input_stream_get_raw(GArrowSeekableInputStream *input_stream); + +GArrowBufferInputStream *garrow_buffer_input_stream_new_raw(std::shared_ptr *arrow_buffer_reader); +std::shared_ptr garrow_buffer_input_stream_get_raw(GArrowBufferInputStream *input_stream); + +GArrowMemoryMappedInputStream *garrow_memory_mapped_input_stream_new_raw(std::shared_ptr *arrow_memory_mapped_file); diff --git a/c_glib/arrow-glib/memory-mapped-file.cpp b/c_glib/arrow-glib/memory-mapped-file.cpp deleted file mode 100644 index 71a9a6dad3134..0000000000000 --- a/c_glib/arrow-glib/memory-mapped-file.cpp +++ /dev/null @@ -1,270 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: memory-mapped-file - * @short_description: Memory mapped file class - * - * #GArrowMemoryMappedFile is a class for memory mapped file. It's - * readable and writeable. It supports zero copy. - */ - -typedef struct GArrowMemoryMappedFilePrivate_ { - std::shared_ptr memory_mapped_file; -} GArrowMemoryMappedFilePrivate; - -enum { - PROP_0, - PROP_MEMORY_MAPPED_FILE -}; - -static std::shared_ptr -garrow_memory_mapped_file_get_raw_file_interface(GArrowFile *file) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_file_interface_init(GArrowFileInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_file_interface; -} - -static std::shared_ptr -garrow_memory_mapped_file_get_raw_readable_interface(GArrowReadable *readable) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(readable); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_readable_interface_init(GArrowReadableInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_readable_interface; -} - -static std::shared_ptr -garrow_memory_mapped_file_get_raw_random_access_file_interface(GArrowRandomAccessFile *file) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_random_access_file_interface_init(GArrowRandomAccessFileInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_random_access_file_interface; -} - -static std::shared_ptr -garrow_memory_mapped_file_get_raw_writeable_interface(GArrowWriteable *writeable) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(writeable); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_writeable_interface_init(GArrowWriteableInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_writeable_interface; -} - -static std::shared_ptr -garrow_memory_mapped_file_get_raw_writeable_file_interface(GArrowWriteableFile *file) -{ - auto memory_mapped_file = GARROW_MEMORY_MAPPED_FILE(file); - auto arrow_memory_mapped_file = - garrow_memory_mapped_file_get_raw(memory_mapped_file); - return arrow_memory_mapped_file; -} - -static void -garrow_writeable_file_interface_init(GArrowWriteableFileInterface *iface) -{ - iface->get_raw = garrow_memory_mapped_file_get_raw_writeable_file_interface; -} - -G_DEFINE_TYPE_WITH_CODE(GArrowMemoryMappedFile, - garrow_memory_mapped_file, - G_TYPE_OBJECT, - G_ADD_PRIVATE(GArrowMemoryMappedFile) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_FILE, - garrow_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_READABLE, - garrow_readable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_RANDOM_ACCESS_FILE, - garrow_random_access_file_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE, - garrow_writeable_interface_init) - G_IMPLEMENT_INTERFACE(GARROW_TYPE_WRITEABLE_FILE, - garrow_writeable_file_interface_init)); - -#define GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(obj) \ - (G_TYPE_INSTANCE_GET_PRIVATE((obj), \ - GARROW_TYPE_MEMORY_MAPPED_FILE, \ - GArrowMemoryMappedFilePrivate)) - -static void -garrow_memory_mapped_file_finalize(GObject *object) -{ - GArrowMemoryMappedFilePrivate *priv; - - priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(object); - - priv->memory_mapped_file = nullptr; - - G_OBJECT_CLASS(garrow_memory_mapped_file_parent_class)->finalize(object); -} - -static void -garrow_memory_mapped_file_set_property(GObject *object, - guint prop_id, - const GValue *value, - GParamSpec *pspec) -{ - GArrowMemoryMappedFilePrivate *priv; - - priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(object); - - switch (prop_id) { - case PROP_MEMORY_MAPPED_FILE: - priv->memory_mapped_file = - *static_cast *>(g_value_get_pointer(value)); - break; - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_memory_mapped_file_get_property(GObject *object, - guint prop_id, - GValue *value, - GParamSpec *pspec) -{ - switch (prop_id) { - default: - G_OBJECT_WARN_INVALID_PROPERTY_ID(object, prop_id, pspec); - break; - } -} - -static void -garrow_memory_mapped_file_init(GArrowMemoryMappedFile *object) -{ -} - -static void -garrow_memory_mapped_file_class_init(GArrowMemoryMappedFileClass *klass) -{ - GObjectClass *gobject_class; - GParamSpec *spec; - - gobject_class = G_OBJECT_CLASS(klass); - - gobject_class->finalize = garrow_memory_mapped_file_finalize; - gobject_class->set_property = garrow_memory_mapped_file_set_property; - gobject_class->get_property = garrow_memory_mapped_file_get_property; - - spec = g_param_spec_pointer("memory-mapped-file", - "io::MemoryMappedFile", - "The raw std::shared *", - static_cast(G_PARAM_WRITABLE | - G_PARAM_CONSTRUCT_ONLY)); - g_object_class_install_property(gobject_class, PROP_MEMORY_MAPPED_FILE, spec); -} - -/** - * garrow_memory_mapped_file_open: - * @path: The path of the memory mapped file. - * @mode: The mode of the memory mapped file. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowMemoryMappedFile or %NULL on error. - */ -GArrowMemoryMappedFile * -garrow_memory_mapped_file_open(const gchar *path, - GArrowFileMode mode, - GError **error) -{ - std::shared_ptr arrow_memory_mapped_file; - auto status = - arrow::io::MemoryMappedFile::Open(std::string(path), - garrow_file_mode_to_raw(mode), - &arrow_memory_mapped_file); - if (status.ok()) { - return garrow_memory_mapped_file_new_raw(&arrow_memory_mapped_file); - } else { - std::string context("[io][memory-mapped-file][open]: <"); - context += path; - context += ">"; - garrow_error_check(error, status, context.c_str()); - return NULL; - } -} - -G_END_DECLS - -GArrowMemoryMappedFile * -garrow_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file) -{ - auto memory_mapped_file = - GARROW_MEMORY_MAPPED_FILE(g_object_new(GARROW_TYPE_MEMORY_MAPPED_FILE, - "memory-mapped-file", arrow_memory_mapped_file, - NULL)); - return memory_mapped_file; -} - -std::shared_ptr -garrow_memory_mapped_file_get_raw(GArrowMemoryMappedFile *memory_mapped_file) -{ - GArrowMemoryMappedFilePrivate *priv; - - priv = GARROW_MEMORY_MAPPED_FILE_GET_PRIVATE(memory_mapped_file); - return priv->memory_mapped_file; -} diff --git a/c_glib/arrow-glib/memory-mapped-file.h b/c_glib/arrow-glib/memory-mapped-file.h deleted file mode 100644 index 40b8de04a5a75..0000000000000 --- a/c_glib/arrow-glib/memory-mapped-file.h +++ /dev/null @@ -1,72 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_MEMORY_MAPPED_FILE \ - (garrow_memory_mapped_file_get_type()) -#define GARROW_MEMORY_MAPPED_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_MEMORY_MAPPED_FILE, \ - GArrowMemoryMappedFile)) -#define GARROW_MEMORY_MAPPED_FILE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_CAST((klass), \ - GARROW_TYPE_MEMORY_MAPPED_FILE, \ - GArrowMemoryMappedFileClass)) -#define GARROW_IS_MEMORY_MAPPED_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_MEMORY_MAPPED_FILE)) -#define GARROW_IS_MEMORY_MAPPED_FILE_CLASS(klass) \ - (G_TYPE_CHECK_CLASS_TYPE((klass), \ - GARROW_TYPE_MEMORY_MAPPED_FILE)) -#define GARROW_MEMORY_MAPPED_FILE_GET_CLASS(obj) \ - (G_TYPE_INSTANCE_GET_CLASS((obj), \ - GARROW_TYPE_MEMORY_MAPPED_FILE, \ - GArrowMemoryMappedFileClass)) - -typedef struct _GArrowMemoryMappedFile GArrowMemoryMappedFile; -typedef struct _GArrowMemoryMappedFileClass GArrowMemoryMappedFileClass; - -/** - * GArrowMemoryMappedFile: - * - * It wraps `arrow::io::MemoryMappedFile`. - */ -struct _GArrowMemoryMappedFile -{ - /*< private >*/ - GObject parent_instance; -}; - -struct _GArrowMemoryMappedFileClass -{ - GObjectClass parent_class; -}; - -GType garrow_memory_mapped_file_get_type(void) G_GNUC_CONST; - -GArrowMemoryMappedFile *garrow_memory_mapped_file_open(const gchar *path, - GArrowFileMode mode, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/memory-mapped-file.hpp b/c_glib/arrow-glib/memory-mapped-file.hpp deleted file mode 100644 index 522e43d117f39..0000000000000 --- a/c_glib/arrow-glib/memory-mapped-file.hpp +++ /dev/null @@ -1,28 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include -#include - -#include - -GArrowMemoryMappedFile *garrow_memory_mapped_file_new_raw(std::shared_ptr *arrow_memory_mapped_file); -std::shared_ptr garrow_memory_mapped_file_get_raw(GArrowMemoryMappedFile *memory_mapped_file); diff --git a/c_glib/arrow-glib/random-access-file.cpp b/c_glib/arrow-glib/random-access-file.cpp deleted file mode 100644 index 744131632cbc2..0000000000000 --- a/c_glib/arrow-glib/random-access-file.cpp +++ /dev/null @@ -1,124 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#ifdef HAVE_CONFIG_H -# include -#endif - -#include - -#include -#include -#include - -G_BEGIN_DECLS - -/** - * SECTION: random-access-file - * @title: GArrowRandomAccessFile - * @short_description: File input interface - * - * #GArrowRandomAccessFile is an interface for file input. - */ - -G_DEFINE_INTERFACE(GArrowRandomAccessFile, - garrow_random_access_file, - G_TYPE_OBJECT) - -static void -garrow_random_access_file_default_init (GArrowRandomAccessFileInterface *iface) -{ -} - -/** - * garrow_random_access_file_get_size: - * @file: A #GArrowRandomAccessFile. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: The size of the file. - */ -guint64 -garrow_random_access_file_get_size(GArrowRandomAccessFile *file, - GError **error) -{ - auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(file); - auto arrow_random_access_file = iface->get_raw(file); - int64_t size; - - auto status = arrow_random_access_file->GetSize(&size); - if (garrow_error_check(error, status, "[io][random-access-file][get-size]")) { - return size; - } else { - return 0; - } -} - -/** - * garrow_random_access_file_get_support_zero_copy: - * @file: A #GArrowRandomAccessFile. - * - * Returns: Whether zero copy read is supported or not. - */ -gboolean -garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file) -{ - auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(file); - auto arrow_random_access_file = iface->get_raw(file); - - return arrow_random_access_file->supports_zero_copy(); -} - -/** - * garrow_random_access_file_read_at: - * @file: A #GArrowRandomAccessFile. - * @position: The read start position. - * @n_bytes: The number of bytes to be read. - * @error: (nullable): Return location for a #GError or %NULL. - * - * Returns: (transfer full) (nullable): #GArrowBuffer that has read - * data on success, %NULL if there was an error. - */ -GArrowBuffer * -garrow_random_access_file_read_at(GArrowRandomAccessFile *file, - gint64 position, - gint64 n_bytes, - GError **error) -{ - const auto arrow_random_access_file = - garrow_random_access_file_get_raw(file); - - std::shared_ptr arrow_buffer; - auto status = arrow_random_access_file->ReadAt(position, - n_bytes, - &arrow_buffer); - if (garrow_error_check(error, status, "[io][random-access-file][read-at]")) { - return garrow_buffer_new_raw(&arrow_buffer); - } else { - return NULL; - } -} - -G_END_DECLS - -std::shared_ptr -garrow_random_access_file_get_raw(GArrowRandomAccessFile *random_access_file) -{ - auto *iface = GARROW_RANDOM_ACCESS_FILE_GET_IFACE(random_access_file); - return iface->get_raw(random_access_file); -} diff --git a/c_glib/arrow-glib/random-access-file.h b/c_glib/arrow-glib/random-access-file.h deleted file mode 100644 index 83a7d8cd14b95..0000000000000 --- a/c_glib/arrow-glib/random-access-file.h +++ /dev/null @@ -1,53 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -G_BEGIN_DECLS - -#define GARROW_TYPE_RANDOM_ACCESS_FILE \ - (garrow_random_access_file_get_type()) -#define GARROW_RANDOM_ACCESS_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_CAST((obj), \ - GARROW_TYPE_RANDOM_ACCESS_FILE, \ - GArrowRandomAccessFile)) -#define GARROW_IS_RANDOM_ACCESS_FILE(obj) \ - (G_TYPE_CHECK_INSTANCE_TYPE((obj), \ - GARROW_TYPE_RANDOM_ACCESS_FILE)) -#define GARROW_RANDOM_ACCESS_FILE_GET_IFACE(obj) \ - (G_TYPE_INSTANCE_GET_INTERFACE((obj), \ - GARROW_TYPE_RANDOM_ACCESS_FILE, \ - GArrowRandomAccessFileInterface)) - -typedef struct _GArrowRandomAccessFile GArrowRandomAccessFile; -typedef struct _GArrowRandomAccessFileInterface GArrowRandomAccessFileInterface; - -GType garrow_random_access_file_get_type(void) G_GNUC_CONST; - -guint64 garrow_random_access_file_get_size(GArrowRandomAccessFile *file, - GError **error); -gboolean garrow_random_access_file_get_support_zero_copy(GArrowRandomAccessFile *file); -GArrowBuffer *garrow_random_access_file_read_at(GArrowRandomAccessFile *file, - gint64 position, - gint64 n_bytes, - GError **error); - -G_END_DECLS diff --git a/c_glib/arrow-glib/random-access-file.hpp b/c_glib/arrow-glib/random-access-file.hpp deleted file mode 100644 index 6d6fed70b4b62..0000000000000 --- a/c_glib/arrow-glib/random-access-file.hpp +++ /dev/null @@ -1,38 +0,0 @@ -/* - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - */ - -#pragma once - -#include - -#include - -/** - * GArrowRandomAccessFileInterface: - * - * It wraps `arrow::io::RandomAccessFile`. - */ -struct _GArrowRandomAccessFileInterface -{ - GTypeInterface parent_iface; - - std::shared_ptr (*get_raw)(GArrowRandomAccessFile *file); -}; - -std::shared_ptr garrow_random_access_file_get_raw(GArrowRandomAccessFile *random_access_file); diff --git a/c_glib/doc/reference/arrow-glib-docs.sgml b/c_glib/doc/reference/arrow-glib-docs.sgml index 7ba37b45068e0..8c691de93bce4 100644 --- a/c_glib/doc/reference/arrow-glib-docs.sgml +++ b/c_glib/doc/reference/arrow-glib-docs.sgml @@ -82,7 +82,6 @@ Input - Output @@ -93,7 +92,6 @@ Input and output - diff --git a/c_glib/example/lua/read-batch.lua b/c_glib/example/lua/read-batch.lua index b28d346863820..8b129c9e4e7a3 100644 --- a/c_glib/example/lua/read-batch.lua +++ b/c_glib/example/lua/read-batch.lua @@ -20,7 +20,7 @@ local Arrow = lgi.Arrow local input_path = arg[1] or "/tmp/batch.arrow"; -local input = Arrow.MemoryMappedFile.open(input_path, Arrow.FileMode.READ) +local input = Arrow.MemoryMappedInputStream.new(input_path) local reader = Arrow.FileReader.open(input) for i = 0, reader:get_n_record_batches() - 1 do diff --git a/c_glib/example/lua/read-stream.lua b/c_glib/example/lua/read-stream.lua index 3b0820627e6b2..e744bed22ad4b 100644 --- a/c_glib/example/lua/read-stream.lua +++ b/c_glib/example/lua/read-stream.lua @@ -20,7 +20,7 @@ local Arrow = lgi.Arrow local input_path = arg[1] or "/tmp/stream.arrow"; -local input = Arrow.MemoryMappedFile.open(input_path, Arrow.FileMode.READ) +local input = Arrow.MemoryMappedInputStream.new(input_path) local reader = Arrow.StreamReader.open(input) local i = 0 diff --git a/c_glib/example/read-batch.c b/c_glib/example/read-batch.c index dce96b8713362..25f19b24393b2 100644 --- a/c_glib/example/read-batch.c +++ b/c_glib/example/read-batch.c @@ -87,14 +87,13 @@ int main(int argc, char **argv) { const char *input_path = "/tmp/batch.arrow"; - GArrowMemoryMappedFile *input; + GArrowMemoryMappedInputStream *input; GError *error = NULL; if (argc > 1) input_path = argv[1]; - input = garrow_memory_mapped_file_open(input_path, - GARROW_FILE_MODE_READ, - &error); + input = garrow_memory_mapped_input_stream_new(input_path, + &error); if (!input) { g_print("failed to open file: %s\n", error->message); g_error_free(error); @@ -104,7 +103,7 @@ main(int argc, char **argv) { GArrowFileReader *reader; - reader = garrow_file_reader_open(GARROW_RANDOM_ACCESS_FILE(input), + reader = garrow_file_reader_open(GARROW_SEEKABLE_INPUT_STREAM(input), &error); if (!reader) { g_print("failed to open file reader: %s\n", error->message); diff --git a/c_glib/example/read-stream.c b/c_glib/example/read-stream.c index ba461e3ad6aed..ca5b9d97cc9df 100644 --- a/c_glib/example/read-stream.c +++ b/c_glib/example/read-stream.c @@ -87,14 +87,12 @@ int main(int argc, char **argv) { const char *input_path = "/tmp/stream.arrow"; - GArrowMemoryMappedFile *input; + GArrowMemoryMappedInputStream *input; GError *error = NULL; if (argc > 1) input_path = argv[1]; - input = garrow_memory_mapped_file_open(input_path, - GARROW_FILE_MODE_READ, - &error); + input = garrow_memory_mapped_input_stream_new(input_path, &error); if (!input) { g_print("failed to open file: %s\n", error->message); g_error_free(error); diff --git a/c_glib/test/test-buffer-reader.rb b/c_glib/test/test-buffer-input-stream.rb similarity index 85% rename from c_glib/test/test-buffer-reader.rb rename to c_glib/test/test-buffer-input-stream.rb index d05ed062ebdb7..51ed8b3961a75 100644 --- a/c_glib/test/test-buffer-reader.rb +++ b/c_glib/test/test-buffer-input-stream.rb @@ -15,11 +15,11 @@ # specific language governing permissions and limitations # under the License. -class TestBufferReader < Test::Unit::TestCase +class TestBufferInputStream < Test::Unit::TestCase def test_read buffer = Arrow::Buffer.new("Hello World") - buffer_reader = Arrow::BufferReader.new(buffer) - read_buffer = buffer_reader.read(5) + buffer_input_stream = Arrow::BufferInputStream.new(buffer) + read_buffer = buffer_input_stream.read(5) assert_equal("Hello", read_buffer.data.to_s) end end diff --git a/c_glib/test/test-file-writer.rb b/c_glib/test/test-file-writer.rb index 31c094dd3bd29..6d4100a8cd38a 100644 --- a/c_glib/test/test-file-writer.rb +++ b/c_glib/test/test-file-writer.rb @@ -33,7 +33,7 @@ def test_write_record_batch output.close end - input = Arrow::MemoryMappedFile.open(tempfile.path, :read) + input = Arrow::MemoryMappedInputStream.new(tempfile.path) begin file_reader = Arrow::FileReader.open(input) assert_equal(["enabled"], diff --git a/c_glib/test/test-memory-mapped-file.rb b/c_glib/test/test-memory-mapped-file.rb deleted file mode 100644 index e09e3697074c9..0000000000000 --- a/c_glib/test/test-memory-mapped-file.rb +++ /dev/null @@ -1,134 +0,0 @@ -# Licensed to the Apache Software Foundation (ASF) under one -# or more contributor license agreements. See the NOTICE file -# distributed with this work for additional information -# regarding copyright ownership. The ASF licenses this file -# to you under the Apache License, Version 2.0 (the -# "License"); you may not use this file except in compliance -# with the License. You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, -# software distributed under the License is distributed on an -# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -# KIND, either express or implied. See the License for the -# specific language governing permissions and limitations -# under the License. - -class TestMemoryMappedFile < Test::Unit::TestCase - def test_open - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :read) - begin - buffer = file.read(5) - assert_equal("Hello", buffer.data.to_s) - ensure - file.close - end - end - - def test_size - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :read) - begin - assert_equal(5, file.size) - ensure - file.close - end - end - - def test_read - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello World") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :read) - begin - buffer = file.read(5) - assert_equal("Hello", buffer.data.to_s) - ensure - file.close - end - end - - def test_read_at - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello World") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :read) - begin - buffer = file.read_at(6, 5) - assert_equal("World", buffer.data.to_s) - ensure - file.close - end - end - - def test_write - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) - begin - file.write("World") - ensure - file.close - end - assert_equal("World", File.read(tempfile.path)) - end - - def test_write_at - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) - begin - file.write_at(2, "rld") - ensure - file.close - end - assert_equal("Herld", File.read(tempfile.path)) - end - - def test_flush - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) - begin - file.write("World") - file.flush - assert_equal("World", File.read(tempfile.path)) - ensure - file.close - end - end - - def test_tell - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello World") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :read) - begin - file.read(5) - assert_equal(5, file.tell) - ensure - file.close - end - end - - def test_mode - tempfile = Tempfile.open("arrow-io-memory-mapped-file") - tempfile.write("Hello World") - tempfile.close - file = Arrow::MemoryMappedFile.open(tempfile.path, :readwrite) - begin - assert_equal(Arrow::FileMode::READWRITE, file.mode) - ensure - file.close - end - end -end diff --git a/c_glib/test/test-memory-mapped-input-stream.rb b/c_glib/test/test-memory-mapped-input-stream.rb new file mode 100644 index 0000000000000..c3a5f62fbeeb5 --- /dev/null +++ b/c_glib/test/test-memory-mapped-input-stream.rb @@ -0,0 +1,82 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +class TestMemoryMappedInputStream < Test::Unit::TestCase + def test_new + tempfile = Tempfile.open("arrow-memory-mapped-input-stream") + tempfile.write("Hello") + tempfile.close + input = Arrow::MemoryMappedInputStream.new(tempfile.path) + begin + buffer = input.read(5) + assert_equal("Hello", buffer.data.to_s) + ensure + input.close + end + end + + def test_size + tempfile = Tempfile.open("arrow-memory-mapped-input-stream") + tempfile.write("Hello") + tempfile.close + input = Arrow::MemoryMappedInputStream.new(tempfile.path) + begin + assert_equal(5, input.size) + ensure + input.close + end + end + + def test_read + tempfile = Tempfile.open("arrow-memory-mapped-input-stream") + tempfile.write("Hello World") + tempfile.close + input = Arrow::MemoryMappedInputStream.new(tempfile.path) + begin + buffer = input.read(5) + assert_equal("Hello", buffer.data.to_s) + ensure + input.close + end + end + + def test_read_at + tempfile = Tempfile.open("arrow-memory-mapped-input-stream") + tempfile.write("Hello World") + tempfile.close + input = Arrow::MemoryMappedInputStream.new(tempfile.path) + begin + buffer = input.read_at(6, 5) + assert_equal("World", buffer.data.to_s) + ensure + input.close + end + end + + + def test_mode + tempfile = Tempfile.open("arrow-memory-mapped-input-stream") + tempfile.write("Hello World") + tempfile.close + input = Arrow::MemoryMappedInputStream.new(tempfile.path) + begin + assert_equal(Arrow::FileMode::READWRITE, input.mode) + ensure + input.close + end + end +end diff --git a/c_glib/test/test-stream-writer.rb b/c_glib/test/test-stream-writer.rb index 306115ee78925..4280c1b32e0f7 100644 --- a/c_glib/test/test-stream-writer.rb +++ b/c_glib/test/test-stream-writer.rb @@ -38,7 +38,7 @@ def test_write_record_batch output.close end - input = Arrow::MemoryMappedFile.open(tempfile.path, :read) + input = Arrow::MemoryMappedInputStream.new(tempfile.path) begin stream_reader = Arrow::StreamReader.open(input) assert_equal(["enabled"], From 0eff2174f9454e54c24acd988706b2f10a2a380d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 2 May 2017 15:10:23 -0400 Subject: [PATCH 0595/1644] ARROW-933: [Python] Remove debug print statement Author: Wes McKinney Closes #629 from wesm/ARROW-933 and squashes the following commits: a6fc22e [Wes McKinney] Remove debug print statement --- cpp/src/arrow/python/numpy_interop.h | 1 - 1 file changed, 1 deletion(-) diff --git a/cpp/src/arrow/python/numpy_interop.h b/cpp/src/arrow/python/numpy_interop.h index 023fdc8249c0c..b93200cc8972d 100644 --- a/cpp/src/arrow/python/numpy_interop.h +++ b/cpp/src/arrow/python/numpy_interop.h @@ -47,7 +47,6 @@ namespace py { inline int import_numpy() { #ifdef NUMPY_IMPORT_ARRAY - std::cout << "Importing NumPy" << std::endl; import_array1(-1); import_umath1(-1); #endif From e794a598b89427d5be0442b5009d61086a4af789 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 2 May 2017 15:13:41 -0400 Subject: [PATCH 0596/1644] ARROW-936: fix release README Author: Julien Le Dem Closes #626 from julienledem/fix_README and squashes the following commits: 0489913 [Julien Le Dem] update README e9f2aec [Julien Le Dem] move tag to rc; add set -e 38f5017 [Julien Le Dem] fix release README --- dev/release/00-prepare.sh | 34 ++++++++++---------- dev/release/01-perform.sh | 1 + dev/release/02-source.sh | 66 ++++++++++++++++++--------------------- dev/release/README | 15 ++++++--- 4 files changed, 60 insertions(+), 56 deletions(-) diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 00af5e7768161..398f15d8270f1 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -17,30 +17,30 @@ # specific language governing permissions and limitations # under the License. # +set -e SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" -if [ -z "$1" ]; then - echo "Usage: $0 " - exit -fi +if [ "$#" -eq 3 ]; then + version=$1 + nextVersion=$2 + nextVersionSNAPSHOT=${nextVersion}-SNAPSHOT + rcnum=$3 + tag=apache-arrow-${version}-rc${rcnum} -if [ -z "$2" ]; then - echo "Usage: $0 " - exit -fi + echo "prepare release ${version} rc ${rcnum} on tag ${tag} then reset to version ${nextVersionSNAPSHOT}" -version=$1 + cd "${SOURCE_DIR}/../../java" -tag=apache-arrow-${version} + mvn release:clean + mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmodules -DdevelopmentVersion=${nextVersionSNAPSHOT} -nextVersion=$2 + cd - -cd "${SOURCE_DIR}/../../java" + echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" -mvn release:clean -mvn release:prepare -Dtag=${tag} -DreleaseVersion=${version} -DautoVersionSubmodules -DdevelopmentVersion=${nextVersion}-SNAPSHOT - -cd - +else + echo "Usage: $0 " + exit +fi -echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" diff --git a/dev/release/01-perform.sh b/dev/release/01-perform.sh index d7140f6cba1e7..876dae8442dfb 100644 --- a/dev/release/01-perform.sh +++ b/dev/release/01-perform.sh @@ -17,6 +17,7 @@ # specific language governing permissions and limitations # under the License. # +set -e SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh index 924b94fd6caa0..3bb72045988cd 100755 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -17,15 +17,11 @@ # specific language governing permissions and limitations # under the License. # +set -e SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" -if [ -z "$1" ]; then - echo "Usage: $0 " - exit -fi - -if [ -z "$2" ]; then +if [ "$#" -ne 2 ]; then echo "Usage: $0 " exit fi @@ -38,42 +34,42 @@ if [ -d tmp/ ]; then exit fi -tag=apache-arrow-$version +tag=apache-arrow-${version} tagrc=${tag}-rc${rc} -echo "Preparing source for $tagrc" +echo "Preparing source for ${tagrc}" -release_hash=`git rev-list $tag 2> /dev/null | head -n 1 ` +release_hash=`git rev-list $tagrc 2> /dev/null | head -n 1 ` if [ -z "$release_hash" ]; then - echo "Cannot continue: unknown git tag: $tag" + echo "Cannot continue: unknown git tag: $tagrc" exit fi echo "Using commit $release_hash" -tarball=$tag.tar.gz +tarball=${tag}.tar.gz extract_dir=tmp-apache-arrow -rm -rf $extract_dir +rm -rf ${extract_dir} # be conservative and use the release hash, even though git produces the same # archive (identical hashes) using the scm tag -git archive $release_hash --prefix $extract_dir/ | tar xf - +git archive ${release_hash} --prefix ${extract_dir}/ | tar xf - # build Apache Arrow C++ before building Apache Arrow GLib because # Apache Arrow GLib requires Apache Arrow C++. -mkdir -p $extract_dir/cpp/build -cpp_install_dir=$PWD/$extract_dir/cpp/install -cd $extract_dir/cpp/build +mkdir -p ${extract_dir}/cpp/build +cpp_install_dir=${PWD}/${extract_dir}/cpp/install +cd ${extract_dir}/cpp/build cmake .. \ - -DCMAKE_INSTALL_PREFIX=$cpp_install_dir \ + -DCMAKE_INSTALL_PREFIX=${cpp_install_dir} \ -DARROW_BUILD_TESTS=no make -j8 make install cd - # build source archive for Apache Arrow GLib by "make dist". -cd $extract_dir/c_glib +cd ${extract_dir}/c_glib ./autogen.sh ./configure \ PKG_CONFIG_PATH=$cpp_install_dir/lib/pkgconfig \ @@ -84,38 +80,38 @@ tar xzf *.tar.gz rm *.tar.gz cd - rm -rf tmp-c_glib/ -mv $extract_dir/c_glib/apache-arrow-glib-* tmp-c_glib/ -rm -rf $extract_dir +mv ${extract_dir}/c_glib/apache-arrow-glib-* tmp-c_glib/ +rm -rf ${extract_dir} # replace c_glib/ by tar.gz generated by "make dist" -rm -rf $tag -git archive $release_hash --prefix $tag/ | tar xf - -rm -rf $tag/c_glib -mv tmp-c_glib $tag/c_glib -tar czf $tarball $tag -rm -rf $tag +rm -rf ${tag} +git archive $release_hash --prefix ${tag}/ | tar xf - +rm -rf ${tag}/c_glib +mv tmp-c_glib ${tag}/c_glib +tar czf ${tarball} ${tag} +rm -rf ${tag} -${SOURCE_DIR}/run-rat.sh $tarball +${SOURCE_DIR}/run-rat.sh ${tarball} # sign the archive -gpg --armor --output ${tarball}.asc --detach-sig $tarball -gpg --print-md MD5 $tarball > ${tarball}.md5 +gpg --armor --output ${tarball}.asc --detach-sig ${tarball} +gpg --print-md MD5 ${tarball} > ${tarball}.md5 shasum $tarball > ${tarball}.sha # check out the arrow RC folder svn co --depth=empty https://dist.apache.org/repos/dist/dev/arrow tmp # add the release candidate for the tag -mkdir -p tmp/$tagrc -cp ${tarball}* tmp/$tagrc -svn add tmp/$tagrc -svn ci -m 'Apache Arrow $version RC${rc}' tmp/$tagrc +mkdir -p tmp/${tagrc} +cp ${tarball}* tmp/${tagrc} +svn add tmp/${tagrc} +svn ci -m 'Apache Arrow ${version} RC${rc}' tmp/${tagrc} # clean up rm -rf tmp echo "Success! The release candidate is available here:" -echo " https://dist.apache.org/repos/dist/dev/arrow/$tagrc" +echo " https://dist.apache.org/repos/dist/dev/arrow/${tagrc}" echo "" -echo "Commit SHA1: $release_hash" +echo "Commit SHA1: ${release_hash}" diff --git a/dev/release/README b/dev/release/README index 07402030bf699..265a23494e43e 100644 --- a/dev/release/README +++ b/dev/release/README @@ -1,17 +1,24 @@ requirements: - being a committer to be able to push to dist and maven repository - a gpg key to sign the artifacts +- use java 7. check your JAVA_HOME environment variable (at least for now. See ARROW-930) +- have the build requirements for cpp and c_glibg installed (see their README) to release, run the following (replace 0.1.0 with version to release): #create a release branch git co -b release-0_1_0 -# prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT -dev/release/00-prepare.sh 0.1.0 0.1.1 +#setup gpg agent for signing artifacts +source dev/release/setup-gpg-agent.sh + +# prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT, RC 0 +sh dev/release/00-prepare.sh 0.1.0 0.1.1 0 +# push the tag +git push apache apache-arrow-0.1.0-rc0 # tag and push to maven repo (repo will have to be finalized separately) -dev/release/01-perform.sh +sh dev/release/01-perform.sh # create the source release -dev/release/02-source.sh 0.1.0 0 +sh dev/release/02-source.sh 0.1.0 0 useful commands: - to set the mvn version in the poms From 32a4d70db7faf196543b5701c3a5d4b527b7f947 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Tue, 2 May 2017 15:36:55 -0400 Subject: [PATCH 0597/1644] ARROW-936: add missing file; revert tag change Change-Id: I22603129e9b442ded0def4cd36b564de4a712fc8 Author: Julien Le Dem Closes #631 from julienledem/fix_README and squashes the following commits: 13bc318 [Julien Le Dem] revert tag change 2f6b14c [Julien Le Dem] add missing file; revert tag change --- dev/release/00-prepare.sh | 7 +++---- dev/release/02-source.sh | 6 +++--- dev/release/README | 6 +++--- dev/release/setup-gpg-agent.sh | 3 +++ 4 files changed, 12 insertions(+), 10 deletions(-) create mode 100644 dev/release/setup-gpg-agent.sh diff --git a/dev/release/00-prepare.sh b/dev/release/00-prepare.sh index 398f15d8270f1..c8d7909ba06f9 100644 --- a/dev/release/00-prepare.sh +++ b/dev/release/00-prepare.sh @@ -21,12 +21,11 @@ set -e SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )" -if [ "$#" -eq 3 ]; then +if [ "$#" -eq 2 ]; then version=$1 nextVersion=$2 nextVersionSNAPSHOT=${nextVersion}-SNAPSHOT - rcnum=$3 - tag=apache-arrow-${version}-rc${rcnum} + tag=apache-arrow-${version} echo "prepare release ${version} rc ${rcnum} on tag ${tag} then reset to version ${nextVersionSNAPSHOT}" @@ -40,7 +39,7 @@ if [ "$#" -eq 3 ]; then echo "Finish staging binary artifacts by running: sh dev/release/01-perform.sh" else - echo "Usage: $0 " + echo "Usage: $0 " exit fi diff --git a/dev/release/02-source.sh b/dev/release/02-source.sh index 3bb72045988cd..d3d94af0468cb 100755 --- a/dev/release/02-source.sh +++ b/dev/release/02-source.sh @@ -37,12 +37,12 @@ fi tag=apache-arrow-${version} tagrc=${tag}-rc${rc} -echo "Preparing source for ${tagrc}" +echo "Preparing source for tag ${tag}" -release_hash=`git rev-list $tagrc 2> /dev/null | head -n 1 ` +release_hash=`git rev-list $tag 2> /dev/null | head -n 1 ` if [ -z "$release_hash" ]; then - echo "Cannot continue: unknown git tag: $tagrc" + echo "Cannot continue: unknown git tag: $tag" exit fi diff --git a/dev/release/README b/dev/release/README index 265a23494e43e..cf68028005e14 100644 --- a/dev/release/README +++ b/dev/release/README @@ -11,10 +11,10 @@ git co -b release-0_1_0 #setup gpg agent for signing artifacts source dev/release/setup-gpg-agent.sh -# prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT, RC 0 -sh dev/release/00-prepare.sh 0.1.0 0.1.1 0 +# prepare release v 0.1.0 (run tests, sign artifacts). Next version will be 0.1.1-SNAPSHOT +sh dev/release/00-prepare.sh 0.1.0 0.1.1 # push the tag -git push apache apache-arrow-0.1.0-rc0 +git push apache apache-arrow-0.1.0 # tag and push to maven repo (repo will have to be finalized separately) sh dev/release/01-perform.sh # create the source release diff --git a/dev/release/setup-gpg-agent.sh b/dev/release/setup-gpg-agent.sh new file mode 100644 index 0000000000000..acc08094bc5c1 --- /dev/null +++ b/dev/release/setup-gpg-agent.sh @@ -0,0 +1,3 @@ +# source me +eval $(gpg-agent --daemon --allow-preset-passphrase) +gpg --use-agent -s LICENSE.txt From 928b63f40d4d8234644acca36d41bed9390f5f3a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 2 May 2017 15:59:30 -0400 Subject: [PATCH 0598/1644] ARROW-938: Fix Rat license warnings The .sgml files are generated by the GTK-Doc in c_glib. Author: Wes McKinney Closes #632 from wesm/ARROW-938 and squashes the following commits: 65ca79a [Wes McKinney] Fix Rat warnings --- dev/release/run-rat.sh | 1 + dev/release/setup-gpg-agent.sh | 23 ++++++++++++++++++++++- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/dev/release/run-rat.sh b/dev/release/run-rat.sh index 9c34e073e628e..3ff9ef083e523 100755 --- a/dev/release/run-rat.sh +++ b/dev/release/run-rat.sh @@ -56,6 +56,7 @@ $RAT $1 \ -e arrow-glib-overrides.txt \ -e gtk-doc.make \ -e "*.html" \ + -e "*.sgml" \ -e "*.css" \ -e "*.png" \ -e "*.svg" \ diff --git a/dev/release/setup-gpg-agent.sh b/dev/release/setup-gpg-agent.sh index acc08094bc5c1..3e31d0e4e3c55 100644 --- a/dev/release/setup-gpg-agent.sh +++ b/dev/release/setup-gpg-agent.sh @@ -1,3 +1,24 @@ +#!/bin/bash +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# + # source me eval $(gpg-agent --daemon --allow-preset-passphrase) -gpg --use-agent -s LICENSE.txt +gpg --use-agent -s LICENSE.txt +rm -rf LICENSE.txt.gpg From 2c3e111d45c056d429cef312533c9f3f96b08ae8 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 5 May 2017 08:18:53 +0200 Subject: [PATCH 0599/1644] ARROW-923: Changelog generation Python script, add 0.1.0 and 0.2.0 changelog Author: Wes McKinney Closes #640 from wesm/ARROW-923 and squashes the following commits: 289d3cd [Wes McKinney] Add license header 96f55f8 [Wes McKinney] Add option to write Markdown JIRA links (for website) 6c808da [Wes McKinney] Changelog Python script, add 0.1.0 and 0.2.0 changelog --- CHANGELOG.md | 403 ++++++++++++++++++++++++++++++++++++++++++ dev/make_changelog.py | 85 +++++++++ 2 files changed, 488 insertions(+) create mode 100644 CHANGELOG.md create mode 100644 dev/make_changelog.py diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000000000..3d54838e1a7f0 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,403 @@ + + +# Apache Arrow 0.2.0 (15 February 2017) + +## Bug + +* ARROW-112 - [C++] Style fix for constants/enums +* ARROW-202 - [C++] Integrate with appveyor ci for windows support and get arrow building on windows +* ARROW-220 - [C++] Build conda artifacts in a build environment with better cross-linux ABI compatibility +* ARROW-224 - [C++] Address static linking of boost dependencies +* ARROW-230 - Python: Do not name modules like native ones (i.e. rename pyarrow.io) +* ARROW-239 - [Python] HdfsFile.read called with no arguments should read remainder of file +* ARROW-261 - [C++] Refactor BinaryArray/StringArray classes to not inherit from ListArray +* ARROW-275 - Add tests for UnionVector in Arrow File +* ARROW-294 - [C++] Do not use fopen / fclose / etc. methods for memory mapped file implementation +* ARROW-322 - [C++] Do not build HDFS IO interface optionally +* ARROW-323 - [Python] Opt-in to PyArrow parquet build rather than skipping silently on failure +* ARROW-334 - [Python] OS X rpath issues on some configurations +* ARROW-337 - UnionListWriter.list() is doing more than it should, this can cause data corruption +* ARROW-339 - Make merge_arrow_pr script work with Python 3 +* ARROW-340 - [C++] Opening a writeable file on disk that already exists does not truncate to zero +* ARROW-342 - Set Python version on release +* ARROW-345 - libhdfs integration doesn't work for Mac +* ARROW-346 - Python API Documentation +* ARROW-348 - [Python] CMake build type should be configurable on the command line +* ARROW-349 - Six is missing as a requirement in the python setup.py +* ARROW-351 - Time type has no unit +* ARROW-354 - Connot compare an array of empty strings to another +* ARROW-357 - Default Parquet chunk_size of 64k is too small +* ARROW-358 - [C++] libhdfs can be in non-standard locations in some Hadoop distributions +* ARROW-362 - Python: Calling to_pandas on a table read from Parquet leaks memory +* ARROW-371 - Python: Table with null timestamp becomes float in pandas +* ARROW-375 - columns parameter in parquet.read_table() raises KeyError for valid column +* ARROW-384 - Align Java and C++ RecordBatch data and metadata layout +* ARROW-386 - [Java] Respect case of struct / map field names +* ARROW-387 - [C++] arrow::io::BufferReader does not permit shared memory ownership in zero-copy reads +* ARROW-390 - C++: CMake fails on json-integration-test with ARROW_BUILD_TESTS=OFF +* ARROW-392 - Fix string/binary integration tests +* ARROW-393 - [JAVA] JSON file reader fails to set the buffer size on String data vector +* ARROW-395 - Arrow file format writes record batches in reverse order. +* ARROW-398 - [Java] Java file format requires bitmaps of all 1's to be written when there are no nulls +* ARROW-399 - [Java] ListVector.loadFieldBuffers ignores the ArrowFieldNode length metadata +* ARROW-400 - [Java] ArrowWriter writes length 0 for Struct types +* ARROW-401 - [Java] Floating point vectors should do an approximate comparison in integration tests +* ARROW-402 - [Java] "refCnt gone negative" error in integration tests +* ARROW-403 - [JAVA] UnionVector: Creating a transfer pair doesn't transfer the schema to destination vector +* ARROW-404 - [Python] Closing an HdfsClient while there are still open file handles results in a crash +* ARROW-405 - [C++] Be less stringent about finding include/hdfs.h in HADOOP_HOME +* ARROW-406 - [C++] Large HDFS reads must utilize the set file buffer size when making RPCs +* ARROW-408 - [C++/Python] Remove defunct conda recipes +* ARROW-414 - [Java] "Buffer too large to resize to ..." error +* ARROW-420 - Align Date implementation between Java and C++ +* ARROW-421 - [Python] Zero-copy buffers read by pyarrow::PyBytesReader must retain a reference to the parent PyBytes to avoid premature garbage collection issues +* ARROW-422 - C++: IPC should depend on rapidjson_ep if RapidJSON is vendored +* ARROW-429 - git-archive SHA-256 checksums are changing +* ARROW-433 - [Python] Date conversion is locale-dependent +* ARROW-434 - Segfaults and encoding issues in Python Parquet reads +* ARROW-435 - C++: Spelling mistake in if(RAPIDJSON_VENDORED) +* ARROW-437 - [C++] clang compiler warnings from overridden virtual functions +* ARROW-445 - C++: arrow_ipc is built before arrow/ipc/Message_generated.h was generated +* ARROW-447 - Python: Align scalar/pylist string encoding with pandas' one. +* ARROW-455 - [C++] BufferOutputStream dtor does not call Close() +* ARROW-469 - C++: Add option so that resize doesn't decrease the capacity +* ARROW-481 - [Python] Fix Python 2.7 regression in patch for PARQUET-472 +* ARROW-486 - [C++] arrow::io::MemoryMappedFile can't be casted to arrow::io::FileInterface +* ARROW-487 - Python: ConvertTableToPandas segfaults if ObjectBlock::Write fails +* ARROW-494 - [C++] When MemoryMappedFile is destructed, memory is unmapped even if buffer referecnes still exist +* ARROW-499 - Update file serialization to use streaming serialization format +* ARROW-505 - [C++] Fix compiler warnings in release mode +* ARROW-511 - [Python] List[T] conversions not implemented for single arrays +* ARROW-513 - [C++] Fix Appveyor build +* ARROW-519 - [C++] Missing vtable in libarrow.dylib on Xcode 6.4 +* ARROW-523 - Python: Account for changes in PARQUET-834 +* ARROW-533 - [C++] arrow::TimestampArray / TimeArray has a broken constructor +* ARROW-535 - [Python] Add type mapping for NPY_LONGLONG +* ARROW-537 - [C++] StringArray/BinaryArray comparisons may be incorrect when values with non-zero length are null +* ARROW-540 - [C++] Fix build in aftermath of ARROW-33 +* ARROW-543 - C++: Lazily computed null_counts counts number of non-null entries +* ARROW-544 - [C++] ArrayLoader::LoadBinary fails for length-0 arrays +* ARROW-545 - [Python] Ignore files without .parq or .parquet prefix when reading directory of files +* ARROW-548 - [Python] Add nthreads option to pyarrow.Filesystem.read_parquet +* ARROW-551 - C++: Construction of Column with nullptr Array segfaults +* ARROW-556 - [Integration] Can not run Integration tests if different cpp build path +* ARROW-561 - Update java & python dependencies to improve downstream packaging experience + +## Improvement + +* ARROW-189 - C++: Use ExternalProject to build thirdparty dependencies +* ARROW-191 - Python: Provide infrastructure for manylinux1 wheels +* ARROW-328 - [C++] Return shared_ptr by value instead of const-ref? +* ARROW-330 - [C++] CMake functions to simplify shared / static library configuration +* ARROW-333 - Make writers update their internal schema even when no data is written. +* ARROW-335 - Improve Type apis and toString() by encapsulating flatbuffers better +* ARROW-336 - Run Apache Rat in Travis builds +* ARROW-338 - [C++] Refactor IPC vector "loading" and "unloading" to be based on cleaner visitor pattern +* ARROW-350 - Add Kerberos support to HDFS shim +* ARROW-355 - Add tests for serialising arrays of empty strings to Parquet +* ARROW-356 - Add documentation about reading Parquet +* ARROW-360 - C++: Add method to shrink PoolBuffer using realloc +* ARROW-361 - Python: Support reading a column-selection from Parquet files +* ARROW-365 - Python: Provide Array.to_pandas() +* ARROW-366 - [java] implement Dictionary vector +* ARROW-374 - Python: clarify unicode vs. binary in API +* ARROW-379 - Python: Use setuptools_scm/setuptools_scm_git_archive to provide the version number +* ARROW-380 - [Java] optimize null count when serializing vectors. +* ARROW-382 - Python: Extend API documentation +* ARROW-396 - Python: Add pyarrow.schema.Schema.equals +* ARROW-409 - Python: Change pyarrow.Table.dataframe_from_batches API to create Table instead +* ARROW-411 - [Java] Move Intergration.compare and Intergration.compareSchemas to a public utils class +* ARROW-423 - C++: Define BUILD_BYPRODUCTS in external project to support non-make CMake generators +* ARROW-425 - Python: Expose a C function to convert arrow::Table to pyarrow.Table +* ARROW-426 - Python: Conversion from pyarrow.Array to a Python list +* ARROW-430 - Python: Better version handling +* ARROW-432 - [Python] Avoid unnecessary memory copy in to_pandas conversion by using low-level pandas internals APIs +* ARROW-450 - Python: Fixes for PARQUET-818 +* ARROW-457 - Python: Better control over memory pool +* ARROW-458 - Python: Expose jemalloc MemoryPool +* ARROW-463 - C++: Support jemalloc 4.x +* ARROW-466 - C++: ExternalProject for jemalloc +* ARROW-468 - Python: Conversion of nested data in pd.DataFrames to/from Arrow structures +* ARROW-474 - Create an Arrow streaming file fomat +* ARROW-479 - Python: Test for expected schema in Pandas conversion +* ARROW-485 - [Java] Users are required to initialize VariableLengthVectors.offsetVector before calling VariableLengthVectors.mutator.getSafe +* ARROW-490 - Python: Update manylinux1 build scripts +* ARROW-524 - [java] provide apis to access nested vectors and buffers +* ARROW-525 - Python: Add more documentation to the package +* ARROW-529 - Python: Add jemalloc and Python 3.6 to manylinux1 build +* ARROW-546 - Python: Account for changes in PARQUET-867 +* ARROW-553 - C++: Faster valid bitmap building + +## New Feature + +* ARROW-108 - [C++] Add IPC round trip for union types +* ARROW-221 - Add switch for writing Parquet 1.0 compatible logical types +* ARROW-227 - [C++/Python] Hook arrow_io generic reader / writer interface into arrow_parquet +* ARROW-228 - [Python] Create an Arrow-cpp-compatible interface for reading bytes from Python file-like objects +* ARROW-243 - [C++] Add "driver" option to HdfsClient to choose between libhdfs and libhdfs3 at runtime +* ARROW-303 - [C++] Also build static libraries for leaf libraries +* ARROW-312 - [Python] Provide Python API to read/write the Arrow IPC file format +* ARROW-317 - [C++] Implement zero-copy Slice method on arrow::Buffer that retains reference to parent +* ARROW-33 - C++: Implement zero-copy array slicing +* ARROW-332 - [Python] Add helper function to convert RecordBatch to pandas.DataFrame +* ARROW-363 - Set up Java/C++ integration test harness +* ARROW-369 - [Python] Add ability to convert multiple record batches at once to pandas +* ARROW-373 - [C++] Implement C++ version of JSON file format for testing +* ARROW-377 - Python: Add support for conversion of Pandas.Categorical +* ARROW-381 - [C++] Simplify primitive array type builders to use a default type singleton +* ARROW-383 - [C++] Implement C++ version of ARROW-367 integration test validator +* ARROW-389 - Python: Write Parquet files to pyarrow.io.NativeFile objects +* ARROW-394 - Add integration tests for boolean, list, struct, and other basic types +* ARROW-410 - [C++] Add Flush method to arrow::io::OutputStream +* ARROW-415 - C++: Add Equals implementation to compare Tables +* ARROW-416 - C++: Add Equals implementation to compare Columns +* ARROW-417 - C++: Add Equals implementation to compare ChunkedArrays +* ARROW-418 - [C++] Consolidate array container and builder code, remove arrow/types +* ARROW-419 - [C++] Promote util/{status.h, buffer.h, memory-pool.h} to top level of arrow/ source directory +* ARROW-427 - [C++] Implement dictionary-encoded array container +* ARROW-428 - [Python] Deserialize from Arrow record batches to pandas in parallel using a thread pool +* ARROW-438 - [Python] Concatenate Table instances with equal schemas +* ARROW-440 - [C++] Support pkg-config +* ARROW-441 - [Python] Expose Arrow's file and memory map classes as NativeFile subclasses +* ARROW-442 - [Python] Add public Python API to inspect Parquet file metadata +* ARROW-444 - [Python] Avoid unnecessary memory copies from use of PyBytes_* C APIs +* ARROW-449 - Python: Conversion from pyarrow.{Table,RecordBatch} to a Python dict +* ARROW-456 - C++: Add jemalloc based MemoryPool +* ARROW-461 - [Python] Implement conversion between arrow::DictionaryArray and pandas.Categorical +* ARROW-467 - [Python] Run parquet-cpp unit tests in Travis CI +* ARROW-470 - [Python] Add "FileSystem" abstraction to access directories of files in a uniform way +* ARROW-471 - [Python] Enable ParquetFile to pass down separately-obtained file metadata +* ARROW-472 - [Python] Expose parquet::{SchemaDescriptor, ColumnDescriptor}::Equals +* ARROW-475 - [Python] High level support for reading directories of Parquet files (as a single Arrow table) from supported file system interfaces +* ARROW-476 - [Integration] Add integration tests for Binary / Varbytes type +* ARROW-477 - [Java] Add support for second/microsecond/nanosecond timestamps in-memory and in IPC/JSON layer +* ARROW-478 - [Python] Accept a PyBytes object in the pyarrow.io.BufferReader ctor +* ARROW-484 - Add more detail about what of technology can be found in the Arrow implementations to README +* ARROW-495 - [C++] Add C++ implementation of streaming serialized format +* ARROW-497 - [Java] Integration test harness for streaming format +* ARROW-498 - [C++] Integration test harness for streaming format +* ARROW-503 - [Python] Interface to streaming binary format +* ARROW-508 - [C++] Make file/memory-mapped file interfaces threadsafe +* ARROW-509 - [Python] Add support for PARQUET-835 (parallel column reads) +* ARROW-512 - C++: Add method to check for primitive types +* ARROW-514 - [Python] Accept pyarrow.io.Buffer as input to StreamReader, FileReader classes +* ARROW-515 - [Python] Add StreamReader/FileReader methods that read all record batches as a Table +* ARROW-521 - [C++/Python] Track peak memory use in default MemoryPool +* ARROW-531 - Python: Document jemalloc, extend Pandas section, add Getting Involved +* ARROW-538 - [C++] Set up AddressSanitizer (ASAN) builds +* ARROW-547 - [Python] Expose Array::Slice and RecordBatch::Slice +* ARROW-81 - [Format] Add a Category logical type (distinct from dictionary-encoding) + +## Task + +* ARROW-268 - [C++] Flesh out union implementation to have all required methods for IPC +* ARROW-327 - [Python] Remove conda builds from Travis CI processes +* ARROW-353 - Arrow release 0.2 +* ARROW-359 - Need to document ARROW_LIBHDFS_DIR +* ARROW-367 - [java] converter csv/json <=> Arrow file format for Integration tests +* ARROW-368 - Document use of LD_LIBRARY_PATH when using Python +* ARROW-372 - Create JSON arrow file format for integration tests +* ARROW-506 - Implement Arrow Echo server for integration testing +* ARROW-527 - clean drill-module.conf file +* ARROW-558 - Add KEYS files +* ARROW-96 - C++: API documentation using Doxygen +* ARROW-97 - Python: API documentation via sphinx-apidoc + +# Apache Arrow 0.1.0 (7 October 2016) + +## Bug + +* ARROW-103 - Missing patterns from .gitignore +* ARROW-104 - Update Layout.md based on discussion on the mailing list +* ARROW-105 - Unit tests fail if assertions are disabled +* ARROW-113 - TestValueVector test fails if cannot allocate 2GB of memory +* ARROW-16 - Building cpp issues on XCode 7.2.1 +* ARROW-17 - Set some vector fields to default access level for Drill compatibility +* ARROW-18 - Fix bug with decimal precision and scale +* ARROW-185 - [C++] Make sure alignment and memory padding conform to spec +* ARROW-188 - Python: Add numpy as install requirement +* ARROW-193 - For the instruction, typos "int his" should be "in this" +* ARROW-194 - C++: Allow read-only memory mapped source +* ARROW-200 - [Python] Convert Values String looks like it has incorrect error handling +* ARROW-209 - [C++] Broken builds: llvm.org apt repos are unavailable +* ARROW-210 - [C++] Tidy up the type system a little bit +* ARROW-211 - Several typos/errors in Layout.md examples +* ARROW-217 - Fix Travis w.r.t conda 4.1.0 changes +* ARROW-219 - [C++] Passed CMAKE_CXX_FLAGS are being dropped, fix compiler warnings +* ARROW-223 - Do not link against libpython +* ARROW-225 - [C++/Python] master Travis CI build is broken +* ARROW-244 - [C++] Some global APIs of IPC module should be visible to the outside +* ARROW-246 - [Java] UnionVector doesn't call allocateNew() when creating it's vectorType +* ARROW-247 - [C++] Missing explicit destructor in RowBatchReader causes an incomplete type error +* ARROW-250 - Fix for ARROW-246 may cause memory leaks +* ARROW-259 - Use flatbuffer fields in java implementation +* ARROW-265 - Negative decimal values have wrong padding +* ARROW-266 - [C++] Fix the broken build +* ARROW-274 - Make the MapVector nullable +* ARROW-278 - [Format] Struct type name consistency in implementations and metadata +* ARROW-283 - [C++] Update arrow_parquet to account for API changes in PARQUET-573 +* ARROW-284 - [C++] Triage builds by disabling Arrow-Parquet module +* ARROW-287 - [java] Make nullable vectors use a BitVecor instead of UInt1Vector for bits +* ARROW-297 - Fix Arrow pom for release +* ARROW-304 - NullableMapReaderImpl.isSet() always returns true +* ARROW-308 - UnionListWriter.setPosition() should not call startList() +* ARROW-309 - Types.getMinorTypeForArrowType() does not work for Union type +* ARROW-313 - XCode 8.0 breaks builds +* ARROW-314 - JSONScalar is unnecessary and unused. +* ARROW-320 - ComplexCopier.copy(FieldReader, FieldWriter) should not start a list if reader is not set +* ARROW-321 - Fix Arrow licences +* ARROW-36 - Remove fixVersions from patch tool (until we have them) +* ARROW-46 - Port DRILL-4410 to Arrow +* ARROW-5 - Error when run maven install +* ARROW-51 - Move ValueVector test from Drill project +* ARROW-55 - Python: fix legacy Python (2.7) tests and add to Travis CI +* ARROW-62 - Format: Are the nulls bits 0 or 1 for null values? +* ARROW-63 - C++: ctest fails if Python 3 is the active Python interpreter +* ARROW-65 - Python: FindPythonLibsNew does not work in a virtualenv +* ARROW-69 - Change permissions for assignable users +* ARROW-72 - FindParquet searches for non-existent header +* ARROW-75 - C++: Fix handling of empty strings +* ARROW-77 - C++: conform null bit interpretation to match ARROW-62 +* ARROW-80 - Segmentation fault on len(Array) for empty arrays +* ARROW-88 - C++: Refactor given PARQUET-572 +* ARROW-93 - XCode 7.3 breaks builds +* ARROW-94 - Expand list example to clarify null vs empty list + +## Improvement + +* ARROW-10 - Fix mismatch of javadoc names and method parameters +* ARROW-15 - Fix a naming typo for memory.AllocationManager.AllocationOutcome +* ARROW-190 - Python: Provide installable sdist builds +* ARROW-199 - [C++] Refine third party dependency +* ARROW-206 - [C++] Expose an equality API for arrays that compares a range of slots on two arrays +* ARROW-212 - [C++] Clarify the fact that PrimitiveArray is now abstract class +* ARROW-213 - Exposing static arrow build +* ARROW-218 - Add option to use GitHub API token via environment variable when merging PRs +* ARROW-234 - [C++] Build with libhdfs support in arrow_io in conda builds +* ARROW-238 - C++: InternalMemoryPool::Free() should throw an error when there is insufficient allocated memory +* ARROW-245 - [Format] Clarify Arrow's relationship with big endian platforms +* ARROW-252 - Add implementation guidelines to the documentation +* ARROW-253 - Int types should only have width of 8*2^n (8, 16, 32, 64) +* ARROW-254 - Remove Bit type as it is redundant with boolean +* ARROW-255 - Finalize Dictionary representation +* ARROW-256 - Add versioning to the arrow spec. +* ARROW-257 - Add a typeids Vector to Union type +* ARROW-264 - Create an Arrow File format +* ARROW-270 - [Format] Define more generic Interval logical type +* ARROW-271 - Update Field structure to be more explicit +* ARROW-279 - rename vector module to arrow-vector for consistency +* ARROW-280 - [C++] Consolidate file and shared memory IO interfaces +* ARROW-285 - Allow for custom flatc compiler +* ARROW-286 - Build thirdparty dependencies in parallel +* ARROW-289 - Install test-util.h +* ARROW-290 - Specialize alloc() in ArrowBuf +* ARROW-292 - [Java] Upgrade Netty to 4.041 +* ARROW-299 - Use absolute namespace in macros +* ARROW-305 - Add compression and use_dictionary options to Parquet interface +* ARROW-306 - Add option to pass cmake arguments via environment variable +* ARROW-315 - Finalize timestamp type +* ARROW-319 - Add canonical Arrow Schema json representation +* ARROW-324 - Update arrow metadata diagram +* ARROW-325 - make TestArrowFile not dependent on timezone +* ARROW-50 - C++: Enable library builds for 3rd-party users without having to build thirdparty googletest +* ARROW-54 - Python: rename package to "pyarrow" +* ARROW-64 - Add zsh support to C++ build scripts +* ARROW-66 - Maybe some missing steps in installation guide +* ARROW-68 - Update setup_build_env and third-party script to be more userfriendly +* ARROW-71 - C++: Add script to run clang-tidy on codebase +* ARROW-73 - Support CMake 2.8 +* ARROW-78 - C++: Add constructor for DecimalType +* ARROW-79 - Python: Add benchmarks +* ARROW-8 - Set up Travis CI +* ARROW-85 - C++: memcmp can be avoided in Equal when comparing with the same Buffer +* ARROW-86 - Python: Implement zero-copy Arrow-to-Pandas conversion +* ARROW-87 - Implement Decimal schema conversion for all ways supported in Parquet +* ARROW-89 - Python: Add benchmarks for Arrow<->Pandas conversion +* ARROW-9 - Rename some unchanged "Drill" to "Arrow" +* ARROW-91 - C++: First draft of an adapter class for parquet-cpp's ParquetFileReader that produces Arrow table/row batch objects + +## New Feature + +* ARROW-100 - [C++] Computing RowBatch size +* ARROW-106 - Add IPC round trip for string types (string, char, varchar, binary) +* ARROW-107 - [C++] add ipc round trip for struct types +* ARROW-13 - Add PR merge tool similar to that used in Parquet +* ARROW-19 - C++: Externalize memory allocations and add a MemoryPool abstract interface to builder classes +* ARROW-197 - [Python] Add conda dev recipe for pyarrow +* ARROW-2 - Post Simple Website +* ARROW-20 - C++: Add null count member to Array containers, remove nullable member +* ARROW-201 - C++: Initial ParquetWriter implementation +* ARROW-203 - Python: Basic filename based Parquet read/write +* ARROW-204 - [Python] Automate uploading conda build artifacts for libarrow and pyarrow +* ARROW-21 - C++: Add in-memory schema metadata container +* ARROW-214 - C++: Add String support to Parquet I/O +* ARROW-215 - C++: Support other integer types in Parquet I/O +* ARROW-22 - C++: Add schema adapter routines for converting flat Parquet schemas to in-memory Arrow schemas +* ARROW-222 - [C++] Create prototype file-like interface to HDFS (via libhdfs) and begin defining more general IO interface for Arrow data adapters +* ARROW-23 - C++: Add logical "Column" container for chunked data +* ARROW-233 - [C++] Add visibility defines for limiting shared library symbol visibility +* ARROW-236 - [Python] Enable Parquet read/write to work with HDFS file objects +* ARROW-237 - [C++] Create Arrow specializations of Parquet allocator and read interfaces +* ARROW-24 - C++: Add logical "Table" container +* ARROW-242 - C++/Python: Support Timestamp Data Type +* ARROW-26 - C++: Add developer instructions for building parquet-cpp integration +* ARROW-262 - [Format] Add a new format document for metadata and logical types for messaging and IPC / on-wire/file representations +* ARROW-267 - [C++] C++ implementation of file-like layout for RPC / IPC +* ARROW-28 - C++: Add google/benchmark to the 3rd-party build toolchain +* ARROW-293 - [C++] Implementations of IO interfaces for operating system files +* ARROW-296 - [C++] Remove arrow_parquet C++ module and related parts of build system +* ARROW-3 - Post Initial Arrow Format Spec +* ARROW-30 - Python: pandas/NumPy to/from Arrow conversion routines +* ARROW-301 - [Format] Add some form of user field metadata to IPC schemas +* ARROW-302 - [Python] Add support to use the Arrow file format with file-like objects +* ARROW-31 - Python: basic PyList <-> Arrow marshaling code +* ARROW-318 - [Python] Revise README to reflect current state of project +* ARROW-37 - C++: Represent boolean array data in bit-packed form +* ARROW-4 - Initial Arrow CPP Implementation +* ARROW-42 - Python: Add to Travis CI build +* ARROW-43 - Python: Add rudimentary console __repr__ for array types +* ARROW-44 - Python: Implement basic object model for scalar values (i.e. results of arrow_arr[i]) +* ARROW-48 - Python: Add Schema object wrapper +* ARROW-49 - Python: Add Column and Table wrapper interface +* ARROW-53 - Python: Fix RPATH and add source installation instructions +* ARROW-56 - Format: Specify LSB bit ordering in bit arrays +* ARROW-57 - Format: Draft data headers IDL for data interchange +* ARROW-58 - Format: Draft type metadata ("schemas") IDL +* ARROW-59 - Python: Boolean data support for builtin data structures +* ARROW-60 - C++: Struct type builder API +* ARROW-67 - C++: Draft type metadata conversion to/from IPC representation +* ARROW-7 - Add Python library build toolchain +* ARROW-70 - C++: Add "lite" DCHECK macros used in parquet-cpp +* ARROW-76 - Revise format document to include null count, defer non-nullable arrays to the domain of metadata +* ARROW-82 - C++: Implement IPC exchange for List types +* ARROW-90 - Apache Arrow cpp code does not support power architecture +* ARROW-92 - C++: Arrow to Parquet Schema conversion + +## Task + +* ARROW-1 - Import Initial Codebase +* ARROW-101 - Fix java warnings emitted by java compiler +* ARROW-102 - travis-ci support for java project +* ARROW-11 - Mirror JIRA activity to dev@arrow.apache.org +* ARROW-14 - Add JIRA components +* ARROW-251 - [C++] Expose APIs for getting code and message of the status +* ARROW-272 - Arrow release 0.1 +* ARROW-298 - create release scripts +* ARROW-35 - Add a short call-to-action / how-to-get-involved to the main README.md + +## Test + +* ARROW-260 - TestValueVector.testFixedVectorReallocation and testVariableVectorReallocation are flaky +* ARROW-83 - Add basic test infrastructure for DecimalType diff --git a/dev/make_changelog.py b/dev/make_changelog.py new file mode 100644 index 0000000000000..0ad1607c79326 --- /dev/null +++ b/dev/make_changelog.py @@ -0,0 +1,85 @@ +#!/usr/bin/env python + +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Utility for generating changelogs for fix versions +# requirements: pip install jira +# Set $JIRA_USERNAME, $JIRA_PASSWORD environment variables + +from collections import defaultdict +from io import StringIO +import os +import sys + +import jira.client + +# ASF JIRA username +JIRA_USERNAME = os.environ.get("JIRA_USERNAME") +# ASF JIRA password +JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD") + +JIRA_API_BASE = "https://issues.apache.org/jira" + +asf_jira = jira.client.JIRA({'server': JIRA_API_BASE}, + basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) + + +def get_issues_for_version(version): + jql = ("project=ARROW " + "AND fixVersion='{0}' " + "AND status = Resolved " + "AND resolution in (Fixed, Done) " + "ORDER BY issuetype DESC").format(version) + + return asf_jira.search_issues(jql, maxResults=9999) + + +LINK_TEMPLATE = '[{0}](https://issues.apache.org/jira/browse/{0})' + + +def format_changelog_markdown(issues, out, links=False): + issues_by_type = defaultdict(list) + for issue in issues: + issues_by_type[issue.fields.issuetype.name].append(issue) + + + for issue_type, issue_group in sorted(issues_by_type.items()): + issue_group.sort(key=lambda x: x.key) + + out.write('## {0}\n\n'.format(issue_type)) + for issue in issue_group: + if links: + name = LINK_TEMPLATE.format(issue.key) + else: + name = issue.key + out.write('* {0} - {1}\n'.format(name, + issue.fields.summary)) + out.write('\n') + + +if __name__ == '__main__': + if len(sys.argv) < 2: + print('Usage: make_changelog.py $FIX_VERSION [$LINKS]') + + buf = StringIO() + + links = len(sys.argv) > 2 and sys.argv[2] == '1' + + issues_for_version = get_issues_for_version(sys.argv[1]) + format_changelog_markdown(issues_for_version, buf, links=links) + print(buf.getvalue()) From 80b72d43e093f90e4c207e61a0f408aef7057c94 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Tue, 2 May 2017 16:07:00 -0400 Subject: [PATCH 0600/1644] [maven-release-plugin] prepare for next development iteration Change-Id: I2c3a4705c6e7c39c75cd1feb195ad167aeb12084 --- java/format/pom.xml | 2 +- java/memory/pom.xml | 20 ++++++++++---------- java/pom.xml | 2 +- java/tools/pom.xml | 2 +- java/vector/pom.xml | 2 +- 5 files changed, 14 insertions(+), 14 deletions(-) diff --git a/java/format/pom.xml b/java/format/pom.xml index 98a113a30cf78..4ef748235152d 100644 --- a/java/format/pom.xml +++ b/java/format/pom.xml @@ -15,7 +15,7 @@ arrow-java-root org.apache.arrow - 0.2.1-SNAPSHOT + 0.4.0-SNAPSHOT arrow-format diff --git a/java/memory/pom.xml b/java/memory/pom.xml index f20228b1bee62..e6d9900e24e80 100644 --- a/java/memory/pom.xml +++ b/java/memory/pom.xml @@ -1,20 +1,20 @@ - 4.0.0 org.apache.arrow arrow-java-root - 0.2.1-SNAPSHOT + 0.4.0-SNAPSHOT arrow-memory Arrow Memory diff --git a/java/pom.xml b/java/pom.xml index e586005e395c0..1fa8ef9b457be 100644 --- a/java/pom.xml +++ b/java/pom.xml @@ -20,7 +20,7 @@ org.apache.arrow arrow-java-root - 0.2.1-SNAPSHOT + 0.4.0-SNAPSHOT pom Apache Arrow Java Root POM diff --git a/java/tools/pom.xml b/java/tools/pom.xml index 35e5599b3b64c..6124b85379fe4 100644 --- a/java/tools/pom.xml +++ b/java/tools/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.2.1-SNAPSHOT + 0.4.0-SNAPSHOT arrow-tools Arrow Tools diff --git a/java/vector/pom.xml b/java/vector/pom.xml index fc3ce66ac1f80..e09193692a94d 100644 --- a/java/vector/pom.xml +++ b/java/vector/pom.xml @@ -14,7 +14,7 @@ org.apache.arrow arrow-java-root - 0.2.1-SNAPSHOT + 0.4.0-SNAPSHOT arrow-vector Arrow Vectors From bcf073c3aeca872e41f86cee14d2c43598ce3149 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 5 May 2017 10:38:33 -0400 Subject: [PATCH 0601/1644] ARROW-945: [GLib] Add a Lua example to show Torch integration Author: Kouhei Sutou Closes #637 from kou/glib-lua-to-torch-tensor and squashes the following commits: 4aba395 [Kouhei Sutou] [GLib] Add a Lua example to show Torch integration --- c_glib/example/lua/Makefile.am | 1 + c_glib/example/lua/README.md | 5 + c_glib/example/lua/read-stream.lua | 2 +- c_glib/example/lua/stream-to-torch-tensor.lua | 101 ++++++++++++++++++ 4 files changed, 108 insertions(+), 1 deletion(-) create mode 100644 c_glib/example/lua/stream-to-torch-tensor.lua diff --git a/c_glib/example/lua/Makefile.am b/c_glib/example/lua/Makefile.am index 9019d24741c1a..86bdbed8a0228 100644 --- a/c_glib/example/lua/Makefile.am +++ b/c_glib/example/lua/Makefile.am @@ -20,5 +20,6 @@ dist_lua_example_DATA = \ README.md \ read-batch.lua \ read-stream.lua \ + stream-to-torch-tensor.lua \ write-batch.lua \ write-stream.lua diff --git a/c_glib/example/lua/README.md b/c_glib/example/lua/README.md index d127573bcc368..6145bc74ddd5a 100644 --- a/c_glib/example/lua/README.md +++ b/c_glib/example/lua/README.md @@ -43,3 +43,8 @@ Here are example codes in this directory: * `read-stream.lua`: It shows how to read Arrow array from file in stream mode. + + * `stream-to-torch-tensor.lua`: It shows how to read Arrow array + from file in stream mode and convert it to + [Torch](http://torch.ch/)'s + [`Tensor` object](http://torch7.readthedocs.io/en/rtd/tensor/index.html). diff --git a/c_glib/example/lua/read-stream.lua b/c_glib/example/lua/read-stream.lua index e744bed22ad4b..987d463b981cf 100644 --- a/c_glib/example/lua/read-stream.lua +++ b/c_glib/example/lua/read-stream.lua @@ -25,7 +25,7 @@ local reader = Arrow.StreamReader.open(input) local i = 0 while true do - local record_batch = reader:get_next_record_batch(i) + local record_batch = reader:get_next_record_batch() if not record_batch then break end diff --git a/c_glib/example/lua/stream-to-torch-tensor.lua b/c_glib/example/lua/stream-to-torch-tensor.lua new file mode 100644 index 0000000000000..237d759d93e20 --- /dev/null +++ b/c_glib/example/lua/stream-to-torch-tensor.lua @@ -0,0 +1,101 @@ +-- Licensed to the Apache Software Foundation (ASF) under one +-- or more contributor license agreements. See the NOTICE file +-- distributed with this work for additional information +-- regarding copyright ownership. The ASF licenses this file +-- to you under the Apache License, Version 2.0 (the +-- "License"); you may not use this file except in compliance +-- with the License. You may obtain a copy of the License at +-- +-- http://www.apache.org/licenses/LICENSE-2.0 +-- +-- Unless required by applicable law or agreed to in writing, +-- software distributed under the License is distributed on an +-- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +-- KIND, either express or implied. See the License for the +-- specific language governing permissions and limitations +-- under the License. + +local lgi = require 'lgi' +local Arrow = lgi.Arrow + +local torch = require 'torch' + +Arrow.Array.torch_types = function(self) + return nil +end + +Arrow.Array.to_torch = function(self) + local types = self:torch_types() + if not types then + return nil + end + + local storage_type = types[1] + local tensor_type = types[2] + + local size = self:get_length() + local storage = storage_type(size) + if not storage then + return nil + end + + for i = 1, size do + storage[i] = self:get_value(i - 1) + end + return tensor_type(storage) +end + +Arrow.UInt8Array.torch_types = function(self) + return {torch.ByteStorage, torch.ByteTensor} +end + +Arrow.Int8Array.torch_types = function(self) + return {torch.CharStorage, torch.CharTensor} +end + +Arrow.Int16Array.torch_types = function(self) + return {torch.ShortStorage, torch.ShortTensor} +end + +Arrow.Int32Array.torch_types = function(self) + return {torch.IntStorage, torch.IntTensor} +end + +Arrow.Int64Array.torch_types = function(self) + return {torch.LongStorage, torch.LongTensor} +end + +Arrow.FloatArray.torch_types = function(self) + return {torch.FloatStorage, torch.FloatTensor} +end + +Arrow.DoubleArray.torch_types = function(self) + return {torch.DoubleStorage, torch.DoubleTensor} +end + + +local input_path = arg[1] or "/tmp/stream.arrow"; + +local input = Arrow.MemoryMappedInputStream.new(input_path) +local reader = Arrow.StreamReader.open(input) + +local i = 0 +while true do + local record_batch = reader:get_next_record_batch() + if not record_batch then + break + end + + print(string.rep("=", 40)) + print("record-batch["..i.."]:") + for j = 0, record_batch:get_n_columns() - 1 do + local column = record_batch:get_column(j) + local column_name = record_batch:get_column_name(j) + print(" "..column_name..":") + print(column:to_torch()) + end + + i = i + 1 +end + +input:close() From 9a48773afa369cbbfd4c3354134125e82e0691b7 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 5 May 2017 10:44:20 -0400 Subject: [PATCH 0602/1644] ARROW-943: [GLib] Support running unit tests with source archive Author: Kouhei Sutou Closes #635 from kou/glib-dist-test and squashes the following commits: 2c30729 [Kouhei Sutou] [GLib] Support running unit tests with source archive --- c_glib/Makefile.am | 1 + c_glib/README.md | 50 ++++++++++++++++++++++++++++++++++++++++- c_glib/test/run-test.sh | 2 +- 3 files changed, 51 insertions(+), 2 deletions(-) diff --git a/c_glib/Makefile.am b/c_glib/Makefile.am index bb52ce503e04e..2e23f125683ba 100644 --- a/c_glib/Makefile.am +++ b/c_glib/Makefile.am @@ -24,6 +24,7 @@ SUBDIRS = \ EXTRA_DIST = \ README.md \ + test \ version arrow_glib_docdir = ${datarootdir}/doc/arrow-glib diff --git a/c_glib/README.md b/c_glib/README.md index 6eadb797032bc..b6e08e358d244 100644 --- a/c_glib/README.md +++ b/c_glib/README.md @@ -143,7 +143,7 @@ You need to install Arrow C++ before you install Arrow GLib. See Arrow C++ document about how to install Arrow C++. You need [GTK-Doc](https://www.gtk.org/gtk-doc/) and -[GObject Introspection](https://wiki.gnome.org/action/show/Projects/GObjectIntrospection) +[GObject Introspection](https://wiki.gnome.org/Projects/GObjectIntrospection) to build Arrow GLib. You can install them by the followings: On Debian GNU/Linux or Ubuntu: @@ -206,3 +206,51 @@ based bindings. Here are languages that support GObject Introspection: See also [Projects/GObjectIntrospection/Users - GNOME Wiki!](https://wiki.gnome.org/Projects/GObjectIntrospection/Users) for other languages. + +## How to run test + +Arrow GLib has unit tests. You can confirm that you install Apache +GLib correctly by running unit tests. + +You need to install the followings to run unit tests: + + * [Ruby](https://www.ruby-lang.org/) + * [gobject-introspection gem](https://rubygems.org/gems/gobject-introspection) + * [test-unit gem](https://rubygems.org/gems/test-unit) + +You can install them by the followings: + +On Debian GNU/Linux or Ubuntu: + +```text +% sudo apt install -y -V ruby-dev +% sudo gem install gobject-introspection test-unit +``` + +On CentOS 7 or later: + +```text +% sudo yum install -y git +% git clone https://github.com/sstephenson/rbenv.git ~/.rbenv +% git clone https://github.com/sstephenson/ruby-build.git ~/.rbenv/plugins/ruby-build +% echo 'export PATH="$HOME/.rbenv/bin:$PATH"' >> ~/.bash_profile +% echo 'eval "$(rbenv init -)"' >> ~/.bash_profile +% exec ${SHELL} --login +% sudo yum install -y gcc make patch openssl-devel readline-devel zlib-devel +% rbenv install 2.4.1 +% rbenv global 2.4.1 +% gem install gobject-introspection test-unit +``` + +On macOS with [Homebrew](https://brew.sh/): + +```text +% gem install gobject-introspection test-unit +``` + +Now, you can run unit tests by the followings: + +```text +% cd c_glib +% test/run-test.sh +``` diff --git a/c_glib/test/run-test.sh b/c_glib/test/run-test.sh index 9b0ec8e45f52f..efa2829d74a29 100755 --- a/c_glib/test/run-test.sh +++ b/c_glib/test/run-test.sh @@ -22,7 +22,7 @@ lib_dir="${base_dir}/arrow-glib/.libs" LD_LIBRARY_PATH="${lib_dir}:${LD_LIBRARY_PATH}" -if [ "${NO_MAKE}" != "yes" ]; then +if [ -f "Makefile" -a "${NO_MAKE}" != "yes" ]; then make -j8 > /dev/null || exit $? fi From ba2880c77c5e0ebb4baf83322899223f7c5e1068 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 5 May 2017 14:51:59 -0400 Subject: [PATCH 0603/1644] ARROW-946: [GLib] Use "new" instead of "open" for constructor name Because "new" is the standard constructor name. Author: Kouhei Sutou Closes #638 from kou/glib-use-new and squashes the following commits: 6b16b5d [Kouhei Sutou] [GLib] Use "new" instead of "open" for constructor name --- c_glib/arrow-glib/file-reader.cpp | 10 +++++----- c_glib/arrow-glib/file-reader.h | 4 ++-- c_glib/arrow-glib/file-writer.cpp | 12 ++++++------ c_glib/arrow-glib/file-writer.h | 6 +++--- c_glib/arrow-glib/output-stream.cpp | 12 ++++++------ c_glib/arrow-glib/output-stream.h | 6 +++--- c_glib/arrow-glib/stream-reader.cpp | 10 +++++----- c_glib/arrow-glib/stream-reader.h | 4 ++-- c_glib/arrow-glib/stream-writer.cpp | 12 ++++++------ c_glib/arrow-glib/stream-writer.h | 6 +++--- c_glib/example/lua/read-batch.lua | 2 +- c_glib/example/lua/read-stream.lua | 2 +- c_glib/example/lua/write-batch.lua | 4 ++-- c_glib/example/lua/write-stream.lua | 4 ++-- c_glib/example/read-batch.c | 4 ++-- c_glib/example/read-stream.c | 4 ++-- c_glib/test/test-file-output-stream.rb | 6 +++--- c_glib/test/test-file-writer.rb | 6 +++--- c_glib/test/test-stream-writer.rb | 6 +++--- 19 files changed, 60 insertions(+), 60 deletions(-) diff --git a/c_glib/arrow-glib/file-reader.cpp b/c_glib/arrow-glib/file-reader.cpp index bbba5a1ede7b2..c16bf194821cd 100644 --- a/c_glib/arrow-glib/file-reader.cpp +++ b/c_glib/arrow-glib/file-reader.cpp @@ -131,16 +131,16 @@ garrow_file_reader_class_init(GArrowFileReaderClass *klass) } /** - * garrow_file_reader_open: + * garrow_file_reader_new: * @input_stream: The seekable input stream to read data. * @error: (nullable): Return locatipcn for a #GError or %NULL. * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowFileReader or %NULL on error. + * Returns: (nullable): A newly created #GArrowFileReader or %NULL on + * error. */ GArrowFileReader * -garrow_file_reader_open(GArrowSeekableInputStream *input_stream, - GError **error) +garrow_file_reader_new(GArrowSeekableInputStream *input_stream, + GError **error) { auto arrow_random_access_file = garrow_seekable_input_stream_get_raw(input_stream); diff --git a/c_glib/arrow-glib/file-reader.h b/c_glib/arrow-glib/file-reader.h index b737269a2945b..551e05a3d1413 100644 --- a/c_glib/arrow-glib/file-reader.h +++ b/c_glib/arrow-glib/file-reader.h @@ -70,8 +70,8 @@ struct _GArrowFileReaderClass GType garrow_file_reader_get_type(void) G_GNUC_CONST; -GArrowFileReader *garrow_file_reader_open(GArrowSeekableInputStream *input_stream, - GError **error); +GArrowFileReader *garrow_file_reader_new(GArrowSeekableInputStream *input_stream, + GError **error); GArrowSchema *garrow_file_reader_get_schema(GArrowFileReader *file_reader); guint garrow_file_reader_get_n_record_batches(GArrowFileReader *file_reader); diff --git a/c_glib/arrow-glib/file-writer.cpp b/c_glib/arrow-glib/file-writer.cpp index e615cf554fd64..e3c721c49b16f 100644 --- a/c_glib/arrow-glib/file-writer.cpp +++ b/c_glib/arrow-glib/file-writer.cpp @@ -57,18 +57,18 @@ garrow_file_writer_class_init(GArrowFileWriterClass *klass) } /** - * garrow_file_writer_open: + * garrow_file_writer_new: * @sink: The output of the writer. * @schema: The schema of the writer. * @error: (nullable): Return locatipcn for a #GError or %NULL. * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowFileWriter or %NULL on error. + * Returns: (nullable): A newly created #GArrowFileWriter or %NULL on + * error. */ GArrowFileWriter * -garrow_file_writer_open(GArrowOutputStream *sink, - GArrowSchema *schema, - GError **error) +garrow_file_writer_new(GArrowOutputStream *sink, + GArrowSchema *schema, + GError **error) { std::shared_ptr arrow_file_writer; auto status = diff --git a/c_glib/arrow-glib/file-writer.h b/c_glib/arrow-glib/file-writer.h index 7f9a4f0399454..346dc6f242ae3 100644 --- a/c_glib/arrow-glib/file-writer.h +++ b/c_glib/arrow-glib/file-writer.h @@ -65,9 +65,9 @@ struct _GArrowFileWriterClass GType garrow_file_writer_get_type(void) G_GNUC_CONST; -GArrowFileWriter *garrow_file_writer_open(GArrowOutputStream *sink, - GArrowSchema *schema, - GError **error); +GArrowFileWriter *garrow_file_writer_new(GArrowOutputStream *sink, + GArrowSchema *schema, + GError **error); gboolean garrow_file_writer_write_record_batch(GArrowFileWriter *file_writer, GArrowRecordBatch *record_batch, diff --git a/c_glib/arrow-glib/output-stream.cpp b/c_glib/arrow-glib/output-stream.cpp index b757d44cef44e..48c48b8fdc327 100644 --- a/c_glib/arrow-glib/output-stream.cpp +++ b/c_glib/arrow-glib/output-stream.cpp @@ -184,18 +184,18 @@ garrow_file_output_stream_class_init(GArrowFileOutputStreamClass *klass) } /** - * garrow_file_output_stream_open: + * garrow_file_output_stream_new: * @path: The path of the file output stream. * @append: Whether the path is opened as append mode or recreate mode. * @error: (nullable): Return location for a #GError or %NULL. * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowFileOutputStream or %NULL on error. + * Returns: (nullable): A newly opened #GArrowFileOutputStream or + * %NULL on error. */ GArrowFileOutputStream * -garrow_file_output_stream_open(const gchar *path, - gboolean append, - GError **error) +garrow_file_output_stream_new(const gchar *path, + gboolean append, + GError **error) { std::shared_ptr arrow_file_output_stream; auto status = diff --git a/c_glib/arrow-glib/output-stream.h b/c_glib/arrow-glib/output-stream.h index 2a14a24ea9051..48b891c19733d 100644 --- a/c_glib/arrow-glib/output-stream.h +++ b/c_glib/arrow-glib/output-stream.h @@ -118,9 +118,9 @@ struct _GArrowFileOutputStreamClass GType garrow_file_output_stream_get_type(void) G_GNUC_CONST; -GArrowFileOutputStream *garrow_file_output_stream_open(const gchar *path, - gboolean append, - GError **error); +GArrowFileOutputStream *garrow_file_output_stream_new(const gchar *path, + gboolean append, + GError **error); #define GARROW_TYPE_BUFFER_OUTPUT_STREAM \ diff --git a/c_glib/arrow-glib/stream-reader.cpp b/c_glib/arrow-glib/stream-reader.cpp index 017d5e91a8a4d..cc18cd84d3142 100644 --- a/c_glib/arrow-glib/stream-reader.cpp +++ b/c_glib/arrow-glib/stream-reader.cpp @@ -132,16 +132,16 @@ garrow_stream_reader_class_init(GArrowStreamReaderClass *klass) } /** - * garrow_stream_reader_open: + * garrow_stream_reader_new: * @stream: The stream to be read. * @error: (nullable): Return locatipcn for a #GError or %NULL. * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowStreamReader or %NULL on error. + * Returns: (nullable): A newly created #GArrowStreamReader or %NULL + * on error. */ GArrowStreamReader * -garrow_stream_reader_open(GArrowInputStream *stream, - GError **error) +garrow_stream_reader_new(GArrowInputStream *stream, + GError **error) { std::shared_ptr arrow_stream_reader; auto status = diff --git a/c_glib/arrow-glib/stream-reader.h b/c_glib/arrow-glib/stream-reader.h index 16a7f57bf801b..2ea2c26a9e541 100644 --- a/c_glib/arrow-glib/stream-reader.h +++ b/c_glib/arrow-glib/stream-reader.h @@ -70,8 +70,8 @@ struct _GArrowStreamReaderClass GType garrow_stream_reader_get_type(void) G_GNUC_CONST; -GArrowStreamReader *garrow_stream_reader_open(GArrowInputStream *stream, - GError **error); +GArrowStreamReader *garrow_stream_reader_new(GArrowInputStream *stream, + GError **error); GArrowSchema *garrow_stream_reader_get_schema(GArrowStreamReader *stream_reader); GArrowRecordBatch *garrow_stream_reader_get_next_record_batch(GArrowStreamReader *stream_reader, diff --git a/c_glib/arrow-glib/stream-writer.cpp b/c_glib/arrow-glib/stream-writer.cpp index cc24f263bfca9..45e2fb099535b 100644 --- a/c_glib/arrow-glib/stream-writer.cpp +++ b/c_glib/arrow-glib/stream-writer.cpp @@ -132,18 +132,18 @@ garrow_stream_writer_class_init(GArrowStreamWriterClass *klass) } /** - * garrow_stream_writer_open: + * garrow_stream_writer_new: * @sink: The output of the writer. * @schema: The schema of the writer. * @error: (nullable): Return locatipcn for a #GError or %NULL. * - * Returns: (nullable) (transfer full): A newly opened - * #GArrowStreamWriter or %NULL on error. + * Returns: (nullable): A newly created #GArrowStreamWriter or %NULL on + * error. */ GArrowStreamWriter * -garrow_stream_writer_open(GArrowOutputStream *sink, - GArrowSchema *schema, - GError **error) +garrow_stream_writer_new(GArrowOutputStream *sink, + GArrowSchema *schema, + GError **error) { std::shared_ptr arrow_stream_writer; auto status = diff --git a/c_glib/arrow-glib/stream-writer.h b/c_glib/arrow-glib/stream-writer.h index 6e773f1fc316e..d718b188d8fff 100644 --- a/c_glib/arrow-glib/stream-writer.h +++ b/c_glib/arrow-glib/stream-writer.h @@ -69,9 +69,9 @@ struct _GArrowStreamWriterClass GType garrow_stream_writer_get_type(void) G_GNUC_CONST; -GArrowStreamWriter *garrow_stream_writer_open(GArrowOutputStream *sink, - GArrowSchema *schema, - GError **error); +GArrowStreamWriter *garrow_stream_writer_new(GArrowOutputStream *sink, + GArrowSchema *schema, + GError **error); gboolean garrow_stream_writer_write_record_batch(GArrowStreamWriter *stream_writer, GArrowRecordBatch *record_batch, diff --git a/c_glib/example/lua/read-batch.lua b/c_glib/example/lua/read-batch.lua index 8b129c9e4e7a3..090a857ee555d 100644 --- a/c_glib/example/lua/read-batch.lua +++ b/c_glib/example/lua/read-batch.lua @@ -21,7 +21,7 @@ local Arrow = lgi.Arrow local input_path = arg[1] or "/tmp/batch.arrow"; local input = Arrow.MemoryMappedInputStream.new(input_path) -local reader = Arrow.FileReader.open(input) +local reader = Arrow.FileReader.new(input) for i = 0, reader:get_n_record_batches() - 1 do local record_batch = reader:get_record_batch(i) diff --git a/c_glib/example/lua/read-stream.lua b/c_glib/example/lua/read-stream.lua index 987d463b981cf..d7ac5ebbd2d97 100644 --- a/c_glib/example/lua/read-stream.lua +++ b/c_glib/example/lua/read-stream.lua @@ -21,7 +21,7 @@ local Arrow = lgi.Arrow local input_path = arg[1] or "/tmp/stream.arrow"; local input = Arrow.MemoryMappedInputStream.new(input_path) -local reader = Arrow.StreamReader.open(input) +local reader = Arrow.StreamReader.new(input) local i = 0 while true do diff --git a/c_glib/example/lua/write-batch.lua b/c_glib/example/lua/write-batch.lua index 3a22cd57fd81e..663f8ef995551 100644 --- a/c_glib/example/lua/write-batch.lua +++ b/c_glib/example/lua/write-batch.lua @@ -34,8 +34,8 @@ local fields = { } local schema = Arrow.Schema.new(fields) -local output = Arrow.FileOutputStream.open(output_path, false) -local writer = Arrow.FileWriter.open(output, schema) +local output = Arrow.FileOutputStream.new(output_path, false) +local writer = Arrow.FileWriter.new(output, schema) function build_array(builder, values) for _, value in pairs(values) do diff --git a/c_glib/example/lua/write-stream.lua b/c_glib/example/lua/write-stream.lua index 37c6bb54cd8f4..fb6cc557a98c2 100644 --- a/c_glib/example/lua/write-stream.lua +++ b/c_glib/example/lua/write-stream.lua @@ -34,8 +34,8 @@ local fields = { } local schema = Arrow.Schema.new(fields) -local output = Arrow.FileOutputStream.open(output_path, false) -local writer = Arrow.StreamWriter.open(output, schema) +local output = Arrow.FileOutputStream.new(output_path, false) +local writer = Arrow.StreamWriter.new(output, schema) function build_array(builder, values) for _, value in pairs(values) do diff --git a/c_glib/example/read-batch.c b/c_glib/example/read-batch.c index 25f19b24393b2..212b2a7a342f0 100644 --- a/c_glib/example/read-batch.c +++ b/c_glib/example/read-batch.c @@ -103,8 +103,8 @@ main(int argc, char **argv) { GArrowFileReader *reader; - reader = garrow_file_reader_open(GARROW_SEEKABLE_INPUT_STREAM(input), - &error); + reader = garrow_file_reader_new(GARROW_SEEKABLE_INPUT_STREAM(input), + &error); if (!reader) { g_print("failed to open file reader: %s\n", error->message); g_error_free(error); diff --git a/c_glib/example/read-stream.c b/c_glib/example/read-stream.c index ca5b9d97cc9df..28a3f5e2b9e1c 100644 --- a/c_glib/example/read-stream.c +++ b/c_glib/example/read-stream.c @@ -102,8 +102,8 @@ main(int argc, char **argv) { GArrowStreamReader *reader; - reader = garrow_stream_reader_open(GARROW_INPUT_STREAM(input), - &error); + reader = garrow_stream_reader_new(GARROW_INPUT_STREAM(input), + &error); if (!reader) { g_print("failed to open stream reader: %s\n", error->message); g_error_free(error); diff --git a/c_glib/test/test-file-output-stream.rb b/c_glib/test/test-file-output-stream.rb index 26737c0c87b38..237781ac00e2b 100644 --- a/c_glib/test/test-file-output-stream.rb +++ b/c_glib/test/test-file-output-stream.rb @@ -16,12 +16,12 @@ # under the License. class TestFileOutputStream < Test::Unit::TestCase - sub_test_case(".open") do + sub_test_case(".new") do def test_create tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = Arrow::FileOutputStream.open(tempfile.path, false) + file = Arrow::FileOutputStream.new(tempfile.path, false) file.close assert_equal("", File.read(tempfile.path)) end @@ -30,7 +30,7 @@ def test_append tempfile = Tempfile.open("arrow-io-file-output-stream") tempfile.write("Hello") tempfile.close - file = Arrow::FileOutputStream.open(tempfile.path, true) + file = Arrow::FileOutputStream.new(tempfile.path, true) file.close assert_equal("Hello", File.read(tempfile.path)) end diff --git a/c_glib/test/test-file-writer.rb b/c_glib/test/test-file-writer.rb index 6d4100a8cd38a..1d9102b6b0085 100644 --- a/c_glib/test/test-file-writer.rb +++ b/c_glib/test/test-file-writer.rb @@ -18,11 +18,11 @@ class TestFileWriter < Test::Unit::TestCase def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-file-writer") - output = Arrow::FileOutputStream.open(tempfile.path, false) + output = Arrow::FileOutputStream.new(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - file_writer = Arrow::FileWriter.open(output, schema) + file_writer = Arrow::FileWriter.new(output, schema) begin record_batch = Arrow::RecordBatch.new(schema, 0, []) file_writer.write_record_batch(record_batch) @@ -35,7 +35,7 @@ def test_write_record_batch input = Arrow::MemoryMappedInputStream.new(tempfile.path) begin - file_reader = Arrow::FileReader.open(input) + file_reader = Arrow::FileReader.new(input) assert_equal(["enabled"], file_reader.schema.fields.collect(&:name)) ensure diff --git a/c_glib/test/test-stream-writer.rb b/c_glib/test/test-stream-writer.rb index 4280c1b32e0f7..d27eaa54fc53c 100644 --- a/c_glib/test/test-stream-writer.rb +++ b/c_glib/test/test-stream-writer.rb @@ -20,11 +20,11 @@ class TestStreamWriter < Test::Unit::TestCase def test_write_record_batch tempfile = Tempfile.open("arrow-ipc-stream-writer") - output = Arrow::FileOutputStream.open(tempfile.path, false) + output = Arrow::FileOutputStream.new(tempfile.path, false) begin field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) - stream_writer = Arrow::StreamWriter.open(output, schema) + stream_writer = Arrow::StreamWriter.new(output, schema) begin columns = [ build_boolean_array([true]), @@ -40,7 +40,7 @@ def test_write_record_batch input = Arrow::MemoryMappedInputStream.new(tempfile.path) begin - stream_reader = Arrow::StreamReader.open(input) + stream_reader = Arrow::StreamReader.new(input) assert_equal(["enabled"], stream_reader.schema.fields.collect(&:name)) assert_equal(true, From cc06197bc2825e4602a72730611d523dbc3b80e8 Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Fri, 5 May 2017 15:06:42 -0400 Subject: [PATCH 0604/1644] ARROW-948: [GLib] Update C++ header file list Author: Kouhei Sutou Closes #641 from kou/glib-fix-cpp-header-list and squashes the following commits: f6b63ab [Kouhei Sutou] [GLib] Update C++ header file list --- c_glib/arrow-glib/arrow-glib.hpp | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/c_glib/arrow-glib/arrow-glib.hpp b/c_glib/arrow-glib/arrow-glib.hpp index 339773f651de1..3184761d4e148 100644 --- a/c_glib/arrow-glib/arrow-glib.hpp +++ b/c_glib/arrow-glib/arrow-glib.hpp @@ -20,18 +20,15 @@ #pragma once #include + #include #include -#include -#include +#include #include #include #include -#include #include #include -#include -#include #include #include #include @@ -41,11 +38,10 @@ #include #include #include -#include #include -#include #include #include +#include #include #include From f63ff08643a79a7f9902fb01157e88902c85c9fc Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Fri, 5 May 2017 19:28:38 -0400 Subject: [PATCH 0605/1644] ARROW-52: Set up project blog, draft 0.3 release posting This will need additional updates after the 0.3 release to fix dates and URLs, but wanted to get this up for review and comment. Some things I did here: * Fixed top navbar on mobile devices * Set up blogroll and simple blog templates * Added "Project Links" menu * Moved committers to a separate page Author: Wes McKinney Closes #639 from wesm/ARROW-52 and squashes the following commits: a104fbd [Wes McKinney] Code review comments and fix typos 5262583 [Wes McKinney] Typo 2236a7c [Wes McKinney] tweak 3f57b51 [Wes McKinney] Typo 4b22731 [Wes McKinney] Finish 0.3 release blog draft 5e34079 [Wes McKinney] Drafting 0.3 release blog post 058c625 [Wes McKinney] Rename post d22490d [Wes McKinney] Fix mobile navbar. Move committers to separate page. Add project links nav, install page. Blog page placeholder --- site/_config.yml | 2 + site/_includes/blog_contents.html | 12 ++ site/_includes/blog_entry.html | 39 ++++ site/_includes/footer.html | 1 + site/_includes/header.html | 31 ++- site/_layouts/blog.html | 15 ++ site/_layouts/post.html | 34 ++++ site/_posts/2017-05-08-0.3-release.md | 263 ++++++++++++++++++++++++++ site/blog.html | 28 +++ site/committers.html | 100 ++++++++++ site/css/main.scss | 5 +- site/index.html | 120 +----------- site/install.html | 11 ++ 13 files changed, 537 insertions(+), 124 deletions(-) create mode 100644 site/_includes/blog_contents.html create mode 100644 site/_includes/blog_entry.html create mode 100644 site/_layouts/blog.html create mode 100644 site/_layouts/post.html create mode 100644 site/_posts/2017-05-08-0.3-release.md create mode 100644 site/blog.html create mode 100644 site/committers.html create mode 100644 site/install.html diff --git a/site/_config.yml b/site/_config.yml index 346565e6d5cca..d7e0bb37e3eb0 100644 --- a/site/_config.yml +++ b/site/_config.yml @@ -14,8 +14,10 @@ # limitations under the License. # markdown: kramdown +permalink: /blog/:year/:month/:day/:title/ repository: https://github.com/apache/arrow destination: build +excerpt_separator: "" exclude: - Gemfile diff --git a/site/_includes/blog_contents.html b/site/_includes/blog_contents.html new file mode 100644 index 0000000000000..b3b785d4318cd --- /dev/null +++ b/site/_includes/blog_contents.html @@ -0,0 +1,12 @@ +
    + +
    diff --git a/site/_includes/blog_entry.html b/site/_includes/blog_entry.html new file mode 100644 index 0000000000000..cdc0060669c2f --- /dev/null +++ b/site/_includes/blog_entry.html @@ -0,0 +1,39 @@ +{% comment %} +Licensed to the Apache Software Foundation (ASF) under one or more +contributor license agreements. See the NOTICE file distributed with +this work for additional information regarding copyright ownership. +The ASF licenses this file to you under the Apache License, Version 2.0 +(the "License"); you may not use this file except in compliance with +the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +{% endcomment %} +
    +

    + {{ post.title }} + +

    + +
    +
    +
    + Published + + + {{ post.date | date_to_string }} + +
    + +
    +
    + {{ post.content }} +
    diff --git a/site/_includes/footer.html b/site/_includes/footer.html index c2a7d5e92bb20..c3ce968b8fb8e 100644 --- a/site/_includes/footer.html +++ b/site/_includes/footer.html @@ -1,3 +1,4 @@ +

    Apache Arrow, Arrow, Apache, the Apache feather logo, and the Apache Arrow project logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

    © 2017 Apache Software Foundation

    diff --git a/site/_includes/header.html b/site/_includes/header.html index 3d61494f2f109..d1526f69faf16 100644 --- a/site/_includes/header.html +++ b/site/_includes/header.html @@ -1,12 +1,33 @@
    + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    NameAlias (email is <alias>@apache.org)
    Jacques Nadeaujacques
    Todd Lipcontodd
    Ted Dunningtdunning
    Michael Stackstack
    P. Taylor Goetzptgoetz
    Julian Hydejhyde
    Reynold Xinrxin
    James Taylorjamestaylor
    Julien Le Demjulien
    Jake Lucianijake
    Jason Altekrusejson
    Alex Levensonalexlevenson
    Parth Chandraparthc
    Marcel Kornackermarcel
    Steven Phillipssmp
    Hanifi Guneshg
    Abdelhakim Denecheadeneche
    Wes McKinneywesm
    David Alvesdralves
    Ippokratis Pandisippokratis
    Uwe L. Kornuwe
    + + + + diff --git a/site/css/main.scss b/site/css/main.scss index 24b46ae24ccf2..e8b2165bbcda7 100644 --- a/site/css/main.scss +++ b/site/css/main.scss @@ -5,6 +5,9 @@ $container-desktop: 960px; $container-large-desktop: $container-desktop; $grid-gutter-width: 15px; +$font-family-base: "Droid Serif",Georgia,Helvetica,sans-serif; +$font-size-base: 14px; + @import "bootstrap-sprockets"; @import "bootstrap"; -@import "font-awesome"; +@import "font-awesome"; \ No newline at end of file diff --git a/site/index.html b/site/index.html index aecaea525166e..d80925ce24bcd 100644 --- a/site/index.html +++ b/site/index.html @@ -7,8 +7,10 @@

    Apache Arrow

    Powering Columnar In-Memory Analytics

    Join Mailing List + Install (0.3.0 Release - May 5, 2017)

    +

    Latest News: Apache Arrow 0.3.0 release

    Fast

    @@ -24,31 +26,6 @@

    Standard

    -
    -
    -

    Developer Mailing List

    - -
    -
    -

    Developer Resources

    -

    Arrow is still early in development.

    -

    Source Code (http) (git)

    -

    Issue Tracker (JIRA)

    -

    Chat Room (Slack)

    -
    -
    -

    Latest release

    -

    Apache Arrow 0.2.0 is an early release and the APIs are still evolving. The metadata and physical data representation should be fairly stable as we have spent time finalizing the details.

    -

    source release

    -

    tag apache-arrow-0.2.0

    -

    java artifacts on maven central

    -
    -

    Performance Advantage of Columnar In-Memory

    SIMD @@ -73,99 +50,6 @@

    Advantages of a Common Data Layer

    -

    Committers

    - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    NameAlias (email is <alias>@apache.org)
    Jacques Nadeaujacques
    Todd Lipcontodd
    Ted Dunningtdunning
    Michael Stackstack
    P. Taylor Goetzptgoetz
    Julian Hydejhyde
    Reynold Xinrxin
    James Taylorjamestaylor
    Julien Le Demjulien
    Jake Lucianijake
    Jason Altekrusejson
    Alex Levensonalexlevenson
    Parth Chandraparthc
    Marcel Kornackermarcel
    Steven Phillipssmp
    Hanifi Guneshg
    Abdelhakim Denecheadeneche
    Wes McKinneywesm
    David Alvesdralves
    Ippokratis Pandisippokratis
    Uwe L. Kornuwe
    - diff --git a/site/install.html b/site/install.html new file mode 100644 index 0000000000000..7734eeb303315 --- /dev/null +++ b/site/install.html @@ -0,0 +1,11 @@ +--- +layout: default +--- +

    Current Version: 0.2.0

    +

    Released: May 5, 2017

    +g

    Apache Arrow 0.2.0 is an early release and the APIs are still evolving. The +metadata and physical data representation should be fairly stable as we have +spent time finalizing the details.

    +

    source release

    +

    tag apache-arrow-0.2.0

    +

    java artifacts on maven central

    From 1a6d135bed84919c166e5a08d894811d26eb3ea7 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Fri, 5 May 2017 19:42:03 -0400 Subject: [PATCH 0606/1644] ARROW-952: fix regex include from C++ standard library Author: Philipp Moritz Closes #643 from pcmoritz/fix-regex-include and squashes the following commits: d72e2c2 [Philipp Moritz] fix regex include from C++ standard library --- cpp/src/arrow/util/decimal.h | 1 - 1 file changed, 1 deletion(-) diff --git a/cpp/src/arrow/util/decimal.h b/cpp/src/arrow/util/decimal.h index c73bae1b4c995..f113c3359eaeb 100644 --- a/cpp/src/arrow/util/decimal.h +++ b/cpp/src/arrow/util/decimal.h @@ -21,7 +21,6 @@ #include #include #include -#include #include #include "arrow/status.h" From 316c63dbaec8e5df33d0cf0fa78d38ac8cc375b8 Mon Sep 17 00:00:00 2001 From: Julien Le Dem Date: Fri, 5 May 2017 22:05:46 -0400 Subject: [PATCH 0607/1644] ARROW-824: Date and Time Vectors should reflect timezone-less semantics The current java vector support the timezone less version of time related types but in an incomplete way. This change fixes it and clarifies what the vectors implementation do. We'll need separate vectors or adapt those to deal will timezone aware time types. Author: Julien Le Dem Closes #568 from julienledem/ARROW-824 and squashes the following commits: 3528ad8 [Julien Le Dem] add license e37385c [Julien Le Dem] centralize LocalDateTime.toMillis bdac7ff [Julien Le Dem] make integration tests output more readable b0da88c [Julien Le Dem] fix failing integration test 61518ec [Julien Le Dem] improve travis integration ec19e7d [Julien Le Dem] ARROW-824: Date and Time Vectors should reflect timezone-less semantics --- ci/travis_script_integration.sh | 3 +- integration/integration_test.py | 27 ++++++++++------- .../main/codegen/data/ValueVectorTypes.tdd | 14 ++++----- .../src/main/codegen/includes/vv_imports.ftl | 1 + .../templates/AbstractFieldReader.java | 6 ++-- .../src/main/codegen/templates/ArrowType.java | 1 + .../main/codegen/templates/BaseReader.java | 5 ++-- .../codegen/templates/ComplexReaders.java | 7 +++++ .../codegen/templates/ComplexWriters.java | 3 ++ .../codegen/templates/FixedValueVectors.java | 18 ++++------- .../codegen/templates/HolderReaderImpl.java | 12 ++++---- .../main/codegen/templates/NullReader.java | 6 ++-- .../main/codegen/templates/UnionReader.java | 6 ++-- .../main/codegen/templates/UnionVector.java | 1 + .../main/codegen/templates/ValueHolders.java | 3 ++ .../org/apache/arrow/vector/types/Types.java | 11 ++++--- .../apache/arrow/vector/util/DateUtility.java | 10 +++++++ .../java/org/joda/time/LocalDateTimes.java | 30 +++++++++++++++++++ .../complex/writer/TestComplexWriter.java | 19 ++++++------ .../arrow/vector/file/BaseFileTest.java | 20 ++++++------- .../apache/arrow/vector/pojo/TestConvert.java | 2 +- 21 files changed, 134 insertions(+), 71 deletions(-) create mode 100644 java/vector/src/main/java/org/joda/time/LocalDateTimes.java diff --git a/ci/travis_script_integration.sh b/ci/travis_script_integration.sh index 56f5ab7d2d35e..6e93ed79a2266 100755 --- a/ci/travis_script_integration.sh +++ b/ci/travis_script_integration.sh @@ -18,7 +18,8 @@ source $TRAVIS_BUILD_DIR/ci/travis_env_common.sh pushd $ARROW_JAVA_DIR -mvn package +echo "mvn package" +mvn -B clean package 2>&1 > mvn_package.log || (cat mvn_package.log && false) popd diff --git a/integration/integration_test.py b/integration/integration_test.py index 661f5c97770d9..646646997f72c 100644 --- a/integration/integration_test.py +++ b/integration/integration_test.py @@ -544,7 +544,8 @@ def get_json(self): class JSONFile(object): - def __init__(self, schema, batches): + def __init__(self, name, schema, batches): + self.name = name self.schema = schema self.batches = batches @@ -579,7 +580,7 @@ def get_field(name, type_, nullable=True): raise TypeError(dtype) -def _generate_file(fields, batch_sizes): +def _generate_file(name, fields, batch_sizes): schema = JSONSchema(fields) batches = [] for size in batch_sizes: @@ -590,7 +591,7 @@ def _generate_file(fields, batch_sizes): batches.append(JSONRecordBatch(size, columns)) - return JSONFile(schema, batches) + return JSONFile(name, schema, batches) def generate_primitive_case(batch_sizes): @@ -604,7 +605,7 @@ def generate_primitive_case(batch_sizes): fields.append(get_field(type_ + "_nullable", type_, True)) fields.append(get_field(type_ + "_nonnullable", type_, False)) - return _generate_file(fields, batch_sizes) + return _generate_file("primitive", fields, batch_sizes) def generate_datetime_case(): @@ -619,11 +620,11 @@ def generate_datetime_case(): TimestampType('f7', 'ms'), TimestampType('f8', 'us'), TimestampType('f9', 'ns'), - TimestampType('f10', 'ms', tz='America/New_York') + TimestampType('f10', 'ms', tz=None) ] batch_sizes = [7, 10] - return _generate_file(fields, batch_sizes) + return _generate_file("datetime", fields, batch_sizes) def generate_nested_case(): @@ -637,7 +638,7 @@ def generate_nested_case(): ] batch_sizes = [7, 10] - return _generate_file(fields, batch_sizes) + return _generate_file("nested", fields, batch_sizes) def get_generated_json_files(): @@ -655,7 +656,7 @@ def _temp_path(): generated_paths = [] for file_obj in file_objs: - out_path = os.path.join(temp_dir, guid() + '.json') + out_path = os.path.join(temp_dir, 'generated_' + file_obj.name + '.json') file_obj.write(out_path) generated_paths.append(out_path) @@ -684,11 +685,15 @@ def _compare_implementations(self, producer, consumer): consumer.name)) for json_path in self.json_files: + print('=====================================================================================') print('Testing file {0}'.format(json_path)) + print('=====================================================================================') + + name = os.path.splitext(os.path.basename(json_path))[0] # Make the random access file print('-- Creating binary inputs') - producer_file_path = os.path.join(self.temp_dir, guid()) + producer_file_path = os.path.join(self.temp_dir, guid() + '_' + name + '.json_to_arrow') producer.json_to_file(json_path, producer_file_path) # Validate the file @@ -696,8 +701,8 @@ def _compare_implementations(self, producer, consumer): consumer.validate(json_path, producer_file_path) print('-- Validating stream') - producer_stream_path = os.path.join(self.temp_dir, guid()) - consumer_file_path = os.path.join(self.temp_dir, guid()) + producer_stream_path = os.path.join(self.temp_dir, guid() + '_' + name + '.arrow_to_stream') + consumer_file_path = os.path.join(self.temp_dir, guid() + '_' + name + '.stream_to_arrow') producer.file_to_stream(producer_file_path, producer_stream_path) consumer.stream_to_file(producer_stream_path, diff --git a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd index d472b559347f0..ca6d9ecbe85e6 100644 --- a/java/vector/src/main/codegen/data/ValueVectorTypes.tdd +++ b/java/vector/src/main/codegen/data/ValueVectorTypes.tdd @@ -59,7 +59,7 @@ { class: "DateDay" }, { class: "IntervalYear", javaType: "int", friendlyType: "Period" }, { class: "TimeSec" }, - { class: "TimeMilli", javaType: "int", friendlyType: "DateTime" } + { class: "TimeMilli", javaType: "int", friendlyType: "LocalDateTime" } ] }, { @@ -71,12 +71,12 @@ minor: [ { class: "BigInt"}, { class: "UInt8" }, - { class: "Float8", javaType: "double" , boxedType: "Double", fields: [{name: "value", type: "double"}], }, - { class: "DateMilli", javaType: "long", friendlyType: "DateTime" }, - { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, - { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, - { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, - { class: "TimeStampNano", javaType: "long", boxedType: "Long", friendlyType: "DateTime" }, + { class: "Float8", javaType: "double", boxedType: "Double", fields: [{name: "value", type: "double"}], }, + { class: "DateMilli", javaType: "long", friendlyType: "LocalDateTime" }, + { class: "TimeStampSec", javaType: "long", boxedType: "Long", friendlyType: "LocalDateTime" }, + { class: "TimeStampMilli", javaType: "long", boxedType: "Long", friendlyType: "LocalDateTime" }, + { class: "TimeStampMicro", javaType: "long", boxedType: "Long", friendlyType: "LocalDateTime" }, + { class: "TimeStampNano", javaType: "long", boxedType: "Long", friendlyType: "LocalDateTime" }, { class: "TimeMicro" }, { class: "TimeNano" } ] diff --git a/java/vector/src/main/codegen/includes/vv_imports.ftl b/java/vector/src/main/codegen/includes/vv_imports.ftl index 9b4b79bfd7b90..e723e7d7ea3d0 100644 --- a/java/vector/src/main/codegen/includes/vv_imports.ftl +++ b/java/vector/src/main/codegen/includes/vv_imports.ftl @@ -57,6 +57,7 @@ import java.math.BigDecimal; import java.math.BigInteger; import org.joda.time.DateTime; +import org.joda.time.LocalDateTime; import org.joda.time.Period; diff --git a/java/vector/src/main/codegen/templates/AbstractFieldReader.java b/java/vector/src/main/codegen/templates/AbstractFieldReader.java index 79d4c122f0e4e..b16ee162fde9e 100644 --- a/java/vector/src/main/codegen/templates/AbstractFieldReader.java +++ b/java/vector/src/main/codegen/templates/AbstractFieldReader.java @@ -26,8 +26,8 @@ <#include "/@includes/vv_imports.ftl" /> -/* - * This class is generated using freemarker and the ${.template_name} template. +/** + * Source code generated using FreeMarker template ${.template_name} */ @SuppressWarnings("unused") abstract class AbstractFieldReader extends AbstractBaseReader implements FieldReader{ @@ -51,7 +51,7 @@ public Field getField() { } <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", - "Character", "DateTime", "Period", "Double", "Float", + "Character", "LocalDateTime", "Period", "Double", "Float", "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> <#if safeType=="byte[]"><#assign safeType="ByteArray" /> diff --git a/java/vector/src/main/codegen/templates/ArrowType.java b/java/vector/src/main/codegen/templates/ArrowType.java index dc99aad0bb3a2..93746303d9311 100644 --- a/java/vector/src/main/codegen/templates/ArrowType.java +++ b/java/vector/src/main/codegen/templates/ArrowType.java @@ -38,6 +38,7 @@ /** * Arrow types + * Source code generated using FreeMarker template ${.template_name} **/ @JsonTypeInfo( use = JsonTypeInfo.Id.NAME, diff --git a/java/vector/src/main/codegen/templates/BaseReader.java b/java/vector/src/main/codegen/templates/BaseReader.java index 72fea58d0bc9e..ea3286e86817a 100644 --- a/java/vector/src/main/codegen/templates/BaseReader.java +++ b/java/vector/src/main/codegen/templates/BaseReader.java @@ -26,8 +26,9 @@ <#include "/@includes/vv_imports.ftl" /> - - +/** + * Source code generated using FreeMarker template ${.template_name} + */ @SuppressWarnings("unused") public interface BaseReader extends Positionable{ Field getField(); diff --git a/java/vector/src/main/codegen/templates/ComplexReaders.java b/java/vector/src/main/codegen/templates/ComplexReaders.java index d53744539aae8..38cd1bfdeb3c5 100644 --- a/java/vector/src/main/codegen/templates/ComplexReaders.java +++ b/java/vector/src/main/codegen/templates/ComplexReaders.java @@ -47,6 +47,9 @@ <#include "/@includes/vv_imports.ftl" /> +/** + * Source code generated using FreeMarker template ${.template_name} + */ @SuppressWarnings("unused") public class ${name}ReaderImpl extends AbstractFieldReader { @@ -123,12 +126,16 @@ public Object readObject(){ package org.apache.arrow.vector.complex.reader; <#include "/@includes/vv_imports.ftl" /> +/** + * Source code generated using FreeMarker template ${.template_name} + */ @SuppressWarnings("unused") public interface ${name}Reader extends BaseReader{ public void read(${minor.class?cap_first}Holder h); public void read(Nullable${minor.class?cap_first}Holder h); public Object readObject(); + // read friendly type public ${friendlyType} read${safeType}(); public boolean isSet(); public void copyAsValue(${minor.class}Writer writer); diff --git a/java/vector/src/main/codegen/templates/ComplexWriters.java b/java/vector/src/main/codegen/templates/ComplexWriters.java index 3457545cea5d7..c23b89d1bbcfb 100644 --- a/java/vector/src/main/codegen/templates/ComplexWriters.java +++ b/java/vector/src/main/codegen/templates/ComplexWriters.java @@ -139,6 +139,9 @@ public void writeNull() { package org.apache.arrow.vector.complex.writer; <#include "/@includes/vv_imports.ftl" /> +/* + * This class is generated using FreeMarker on the ${.template_name} template. + */ @SuppressWarnings("unused") public interface ${eName}Writer extends BaseWriter { public void write(${minor.class}Holder h); diff --git a/java/vector/src/main/codegen/templates/FixedValueVectors.java b/java/vector/src/main/codegen/templates/FixedValueVectors.java index 05faaae1e9e2f..f403ecfac1f93 100644 --- a/java/vector/src/main/codegen/templates/FixedValueVectors.java +++ b/java/vector/src/main/codegen/templates/FixedValueVectors.java @@ -495,16 +495,14 @@ public long getTwoAsLong(int index) { <#elseif minor.class == "DateMilli"> @Override public ${friendlyType} getObject(int index) { - org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime date = new org.joda.time.LocalDateTime(get(index), org.joda.time.DateTimeZone.UTC); return date; } <#elseif minor.class == "TimeMilli"> @Override public ${friendlyType} getObject(int index) { - org.joda.time.DateTime time = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); - time = time.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime time = new org.joda.time.LocalDateTime(get(index), org.joda.time.DateTimeZone.UTC); return time; } @@ -512,16 +510,14 @@ public long getTwoAsLong(int index) { @Override public ${friendlyType} getObject(int index) { long secs = java.util.concurrent.TimeUnit.SECONDS.toMillis(get(index)); - org.joda.time.DateTime date = new org.joda.time.DateTime(secs, org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime date = new org.joda.time.LocalDateTime(secs, org.joda.time.DateTimeZone.UTC); return date; } <#elseif minor.class == "TimeStampMilli"> @Override public ${friendlyType} getObject(int index) { - org.joda.time.DateTime date = new org.joda.time.DateTime(get(index), org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime date = new org.joda.time.LocalDateTime(get(index), org.joda.time.DateTimeZone.UTC); return date; } @@ -530,8 +526,7 @@ public long getTwoAsLong(int index) { public ${friendlyType} getObject(int index) { // value is truncated when converting microseconds to milliseconds in order to use DateTime type long micros = java.util.concurrent.TimeUnit.MICROSECONDS.toMillis(get(index)); - org.joda.time.DateTime date = new org.joda.time.DateTime(micros, org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime date = new org.joda.time.LocalDateTime(micros, org.joda.time.DateTimeZone.UTC); return date; } @@ -540,8 +535,7 @@ public long getTwoAsLong(int index) { public ${friendlyType} getObject(int index) { // value is truncated when converting nanoseconds to milliseconds in order to use DateTime type long millis = java.util.concurrent.TimeUnit.NANOSECONDS.toMillis(get(index)); - org.joda.time.DateTime date = new org.joda.time.DateTime(millis, org.joda.time.DateTimeZone.UTC); - date = date.withZoneRetainFields(org.joda.time.DateTimeZone.getDefault()); + org.joda.time.LocalDateTime date = new org.joda.time.LocalDateTime(millis, org.joda.time.DateTimeZone.UTC); return date; } diff --git a/java/vector/src/main/codegen/templates/HolderReaderImpl.java b/java/vector/src/main/codegen/templates/HolderReaderImpl.java index d66577bc1e444..e990fcc933479 100644 --- a/java/vector/src/main/codegen/templates/HolderReaderImpl.java +++ b/java/vector/src/main/codegen/templates/HolderReaderImpl.java @@ -84,7 +84,7 @@ public boolean isSet() { } -@Override + @Override public void read(${name}Holder h) { <#list fields as field> h.${field.name} = holder.${field.name}; @@ -99,7 +99,7 @@ public void read(Nullable${name}Holder h) { h.isSet = isSet() ? 1 : 0; } - + // read friendly type @Override public ${friendlyType} read${safeType}(){ <#if nullMode == "Nullable"> @@ -114,15 +114,15 @@ public void read(Nullable${name}Holder h) { byte[] value = new byte [length]; holder.buffer.getBytes(holder.start, value, 0, length); -<#if minor.class == "VarBinary"> + <#if minor.class == "VarBinary"> return value; -<#elseif minor.class == "Var16Char"> + <#elseif minor.class == "Var16Char"> return new String(value); -<#elseif minor.class == "VarChar"> + <#elseif minor.class == "VarChar"> Text text = new Text(); text.set(value); return text; - + <#elseif minor.class == "Interval"> Period p = new Period(); diff --git a/java/vector/src/main/codegen/templates/NullReader.java b/java/vector/src/main/codegen/templates/NullReader.java index ba0c088add7c9..7c75b3ae1df9d 100644 --- a/java/vector/src/main/codegen/templates/NullReader.java +++ b/java/vector/src/main/codegen/templates/NullReader.java @@ -29,7 +29,9 @@ <#include "/@includes/vv_imports.ftl" /> - +/** + * Source code generated using FreeMarker template ${.template_name} + */ @SuppressWarnings("unused") public class NullReader extends AbstractBaseReader implements FieldReader{ @@ -127,7 +129,7 @@ private void fail(String name){ } <#list ["Object", "BigDecimal", "Integer", "Long", "Boolean", - "Character", "DateTime", "Period", "Double", "Float", + "Character", "LocalDateTime", "Period", "Double", "Float", "Text", "String", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> <#if safeType=="byte[]"><#assign safeType="ByteArray" /> diff --git a/java/vector/src/main/codegen/templates/UnionReader.java b/java/vector/src/main/codegen/templates/UnionReader.java index c56e95c89dc81..0b5a209d40ac4 100644 --- a/java/vector/src/main/codegen/templates/UnionReader.java +++ b/java/vector/src/main/codegen/templates/UnionReader.java @@ -28,7 +28,9 @@ package org.apache.arrow.vector.complex.impl; <#include "/@includes/vv_imports.ftl" /> - +/** + * Source code generated using FreeMarker template ${.template_name} + */ @SuppressWarnings("unused") public class UnionReader extends AbstractFieldReader { @@ -122,7 +124,7 @@ public void copyAsValue(UnionWriter writer) { } <#list ["Object", "Integer", "Long", "Boolean", - "Character", "DateTime", "Double", "Float", + "Character", "LocalDateTime", "Double", "Float", "Text", "Byte", "Short", "byte[]"] as friendlyType> <#assign safeType=friendlyType /> <#if safeType=="byte[]"><#assign safeType="ByteArray" /> diff --git a/java/vector/src/main/codegen/templates/UnionVector.java b/java/vector/src/main/codegen/templates/UnionVector.java index d70cbae02bf33..9d5dee5d237e6 100644 --- a/java/vector/src/main/codegen/templates/UnionVector.java +++ b/java/vector/src/main/codegen/templates/UnionVector.java @@ -50,6 +50,7 @@ * * For performance reasons, UnionVector stores a cached reference to each subtype vector, to avoid having to do the map lookup * each time the vector is accessed. + * Source code generated using FreeMarker template ${.template_name} */ public class UnionVector implements FieldVector { diff --git a/java/vector/src/main/codegen/templates/ValueHolders.java b/java/vector/src/main/codegen/templates/ValueHolders.java index d744c523265f7..a474b691080c8 100644 --- a/java/vector/src/main/codegen/templates/ValueHolders.java +++ b/java/vector/src/main/codegen/templates/ValueHolders.java @@ -29,6 +29,9 @@ <#include "/@includes/vv_imports.ftl" /> +/** + * Source code generated using FreeMarker template ${.template_name} + */ public final class ${className} implements ValueHolder{ <#if mode.name == "Repeated"> diff --git a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java index 6023f1c9500e7..d5076d82c2a4d 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/types/Types.java @@ -241,7 +241,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in second from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. - TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND, "UTC")) { + TIMESTAMPSEC(new Timestamp(org.apache.arrow.vector.types.TimeUnit.SECOND, null)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { return new NullableTimeStampSecVector(name, fieldType, allocator); @@ -253,7 +253,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in millis from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. - TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND, "UTC")) { + TIMESTAMPMILLI(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MILLISECOND, null)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { return new NullableTimeStampMilliVector(name, fieldType, allocator); @@ -265,7 +265,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in microsecond from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. - TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND, "UTC")) { + TIMESTAMPMICRO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.MICROSECOND, null)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { return new NullableTimeStampMicroVector(name, fieldType, allocator); @@ -277,7 +277,7 @@ public FieldWriter getNewFieldWriter(ValueVector vector) { } }, // time in nanosecond from the Unix epoch, 00:00:00.000000000 on 1 January 1970, UTC. - TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND, "UTC")) { + TIMESTAMPNANO(new Timestamp(org.apache.arrow.vector.types.TimeUnit.NANOSECOND, null)) { @Override public FieldVector getNewVector(String name, FieldType fieldType, BufferAllocator allocator, CallBack schemaChangeCallback) { return new NullableTimeStampNanoVector(name, fieldType, allocator); @@ -580,6 +580,9 @@ public MinorType visit(FloatingPoint type) { } @Override public MinorType visit(Timestamp type) { + if (type.getTimezone() != null) { + throw new IllegalArgumentException("only timezone-less timestamps are supported for now: " + type); + } switch (type.getUnit()) { case SECOND: return MinorType.TIMESTAMPSEC; diff --git a/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java index 1f8ce069cf9cf..8aad41744f673 100644 --- a/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java +++ b/java/vector/src/main/java/org/apache/arrow/vector/util/DateUtility.java @@ -18,6 +18,9 @@ package org.apache.arrow.vector.util; +import org.joda.time.DateTimeZone; +import org.joda.time.LocalDateTime; +import org.joda.time.LocalDateTimes; import org.joda.time.Period; import org.joda.time.format.DateTimeFormat; import org.joda.time.format.DateTimeFormatter; @@ -679,4 +682,11 @@ public static int millisFromPeriod(final Period period){ (period.getMillis()); } + public static long toMillis(LocalDateTime localDateTime) { + return LocalDateTimes.getLocalMillis(localDateTime); + } + + public static int toMillisOfDay(final LocalDateTime localDateTime) { + return localDateTime.toDateTime(DateTimeZone.UTC).millisOfDay().get(); + } } diff --git a/java/vector/src/main/java/org/joda/time/LocalDateTimes.java b/java/vector/src/main/java/org/joda/time/LocalDateTimes.java new file mode 100644 index 0000000000000..e4f999e1d828e --- /dev/null +++ b/java/vector/src/main/java/org/joda/time/LocalDateTimes.java @@ -0,0 +1,30 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.joda.time; + +/** + * Workaround to access package protected fields in JODA + * + */ +public class LocalDateTimes { + + public static long getLocalMillis(LocalDateTime localDateTime) { + return localDateTime.getLocalMillis(); + } + +} diff --git a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java index 99ba19bec80e7..aba65dbf374d4 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/complex/writer/TestComplexWriter.java @@ -56,8 +56,7 @@ import org.apache.arrow.vector.util.JsonStringHashMap; import org.apache.arrow.vector.util.Text; import org.apache.arrow.vector.util.TransferPair; -import org.joda.time.DateTime; -import org.joda.time.DateTimeZone; +import org.joda.time.LocalDateTime; import org.junit.Assert; import org.junit.Test; @@ -602,10 +601,10 @@ public void timeStampWriters() throws Exception { final long expectedMicros = 981173106123456L; final long expectedMillis = 981173106123L; final long expectedSecs = 981173106L; - final DateTime expectedSecDateTime = new DateTime(2001, 2, 3, 4, 5, 6, 0).withZoneRetainFields(DateTimeZone.getDefault()); - final DateTime expectedMilliDateTime = new DateTime(2001, 2, 3, 4, 5, 6, 123).withZoneRetainFields(DateTimeZone.getDefault()); - final DateTime expectedMicroDateTime = expectedMilliDateTime; - final DateTime expectedNanoDateTime = expectedMilliDateTime; + final LocalDateTime expectedSecDateTime = new LocalDateTime(2001, 2, 3, 4, 5, 6, 0); + final LocalDateTime expectedMilliDateTime = new LocalDateTime(2001, 2, 3, 4, 5, 6, 123); + final LocalDateTime expectedMicroDateTime = expectedMilliDateTime; + final LocalDateTime expectedNanoDateTime = expectedMilliDateTime; // write MapVector parent = new MapVector("parent", allocator, null); @@ -650,28 +649,28 @@ public void timeStampWriters() throws Exception { FieldReader secReader = rootReader.reader("sec"); secReader.setPosition(0); - DateTime secDateTime = secReader.readDateTime(); + LocalDateTime secDateTime = secReader.readLocalDateTime(); Assert.assertEquals(expectedSecDateTime, secDateTime); long secLong = secReader.readLong(); Assert.assertEquals(expectedSecs, secLong); FieldReader milliReader = rootReader.reader("milli"); milliReader.setPosition(1); - DateTime milliDateTime = milliReader.readDateTime(); + LocalDateTime milliDateTime = milliReader.readLocalDateTime(); Assert.assertEquals(expectedMilliDateTime, milliDateTime); long milliLong = milliReader.readLong(); Assert.assertEquals(expectedMillis, milliLong); FieldReader microReader = rootReader.reader("micro"); microReader.setPosition(2); - DateTime microDateTime = microReader.readDateTime(); + LocalDateTime microDateTime = microReader.readLocalDateTime(); Assert.assertEquals(expectedMicroDateTime, microDateTime); long microLong = microReader.readLong(); Assert.assertEquals(expectedMicros, microLong); FieldReader nanoReader = rootReader.reader("nano"); nanoReader.setPosition(3); - DateTime nanoDateTime = nanoReader.readDateTime(); + LocalDateTime nanoDateTime = nanoReader.readLocalDateTime(); Assert.assertEquals(expectedNanoDateTime, nanoDateTime); long nanoLong = nanoReader.readLong(); Assert.assertEquals(expectedNanos, nanoLong); diff --git a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java index 5ca083aa2dfab..5cc36a3b82000 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/file/BaseFileTest.java @@ -39,8 +39,8 @@ import org.apache.arrow.vector.complex.writer.TimeMilliWriter; import org.apache.arrow.vector.complex.writer.TimeStampMilliWriter; import org.apache.arrow.vector.holders.NullableTimeStampMilliHolder; -import org.joda.time.DateTime; import org.joda.time.DateTimeZone; +import org.joda.time.LocalDateTime; import org.junit.After; import org.junit.Assert; import org.junit.Before; @@ -144,8 +144,8 @@ protected void validateComplexContent(int count, VectorSchemaRoot root) { } } - private DateTime makeDateTimeFromCount(int i) { - return new DateTime(2000 + i, 1 + i, 1 + i, i, i, i, i, DateTimeZone.UTC); + private LocalDateTime makeDateTimeFromCount(int i) { + return new LocalDateTime(2000 + i, 1 + i, 1 + i, i, i, i, i); } protected void writeDateTimeData(int count, NullableMapVector parent) { @@ -156,17 +156,17 @@ protected void writeDateTimeData(int count, NullableMapVector parent) { TimeMilliWriter timeWriter = rootWriter.timeMilli("time"); TimeStampMilliWriter timeStampMilliWriter = rootWriter.timeStampMilli("timestamp-milli"); for (int i = 0; i < count; i++) { - DateTime dt = makeDateTimeFromCount(i); + LocalDateTime dt = makeDateTimeFromCount(i); // Number of days in milliseconds since epoch, stored as 64-bit integer, only date part is used dateWriter.setPosition(i); - long dateLong = dt.minusMillis(dt.getMillisOfDay()).getMillis(); + long dateLong = org.apache.arrow.vector.util.DateUtility.toMillis(dt.minusMillis(dt.getMillisOfDay())); dateWriter.writeDateMilli(dateLong); // Time is a value in milliseconds since midnight, stored as 32-bit integer timeWriter.setPosition(i); timeWriter.writeTimeMilli(dt.getMillisOfDay()); // Timestamp is milliseconds since the epoch, stored as 64-bit integer timeStampMilliWriter.setPosition(i); - timeStampMilliWriter.writeTimeStampMilli(dt.getMillis()); + timeStampMilliWriter.writeTimeStampMilli(org.apache.arrow.vector.util.DateUtility.toMillis(dt)); } writer.setValueCount(count); } @@ -176,13 +176,13 @@ protected void validateDateTimeContent(int count, VectorSchemaRoot root) { printVectors(root.getFieldVectors()); for (int i = 0; i < count; i++) { long dateVal = ((NullableDateMilliVector)root.getVector("date")).getAccessor().get(i); - DateTime dt = makeDateTimeFromCount(i); - DateTime dateExpected = dt.minusMillis(dt.getMillisOfDay()); - Assert.assertEquals(dateExpected.getMillis(), dateVal); + LocalDateTime dt = makeDateTimeFromCount(i); + LocalDateTime dateExpected = dt.minusMillis(dt.getMillisOfDay()); + Assert.assertEquals(org.apache.arrow.vector.util.DateUtility.toMillis(dateExpected), dateVal); long timeVal = ((NullableTimeMilliVector)root.getVector("time")).getAccessor().get(i); Assert.assertEquals(dt.getMillisOfDay(), timeVal); Object timestampMilliVal = root.getVector("timestamp-milli").getAccessor().getObject(i); - Assert.assertTrue(dt.withZoneRetainFields(DateTimeZone.getDefault()).equals(timestampMilliVal)); + Assert.assertEquals(dt, timestampMilliVal); } } diff --git a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java index 824c62aa5fbf3..f9c8f726ab6c6 100644 --- a/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java +++ b/java/vector/src/test/java/org/apache/arrow/vector/pojo/TestConvert.java @@ -81,7 +81,7 @@ public void nestedSchema() { new Field("child4.1", true, Utf8.INSTANCE, null) ))); childrenBuilder.add(new Field("child5", true, new Union(UnionMode.Sparse, new int[] { MinorType.TIMESTAMPMILLI.ordinal(), MinorType.FLOAT8.ordinal() } ), ImmutableList.of( - new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND, "UTC"), null), + new Field("child5.1", true, new Timestamp(TimeUnit.MILLISECOND, null), null), new Field("child5.2", true, new FloatingPoint(DOUBLE), ImmutableList.of()) ))); Schema initialSchema = new Schema(childrenBuilder.build()); From 5af8069d234a7b16ab324085ecc802e6f915ae88 Mon Sep 17 00:00:00 2001 From: Bryan Cutler Date: Fri, 5 May 2017 22:07:42 -0400 Subject: [PATCH 0608/1644] ARROW-866: [Python] Normalize PyErr exc_value to be more predictable It is possible when using `PyErr_Fetch(&exc_type, &exc_value, &traceback)` for the `exc_value` to be a string, tuple or NULL. Calling `PyErr_Normalize` after this will cause `exc_value` to always be a valid object of the same type as `exc_type` which can then be converted to a string predictably. Author: Bryan Cutler Closes #630 from BryanCutler/python-pyerr_normalize-ARROW866 and squashes the following commits: fb93356 [Bryan Cutler] use PyObjectStringify to be Unicode safe d56c6bf [Bryan Cutler] Added PyErr_NormalizeException to CheckPyError to make more predictable exc_value --- cpp/src/arrow/python/common.cc | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/cpp/src/arrow/python/common.cc b/cpp/src/arrow/python/common.cc index bedd458c783f4..5702c71b4d8d5 100644 --- a/cpp/src/arrow/python/common.cc +++ b/cpp/src/arrow/python/common.cc @@ -68,20 +68,16 @@ Status CheckPyError(StatusCode code) { if (PyErr_Occurred()) { PyObject *exc_type, *exc_value, *traceback; PyErr_Fetch(&exc_type, &exc_value, &traceback); - PyObjectStringify stringified(exc_value); + PyErr_NormalizeException(&exc_type, &exc_value, &traceback); + PyObject *exc_value_str = PyObject_Str(exc_value); + PyObjectStringify stringified(exc_value_str); + std::string message(stringified.bytes); Py_XDECREF(exc_type); Py_XDECREF(exc_value); + Py_XDECREF(exc_value_str); Py_XDECREF(traceback); PyErr_Clear(); - - // ARROW-866: in some esoteric cases, formatting exc_value can fail. This - // was encountered when calling tell() on a socket file - if (stringified.bytes != nullptr) { - std::string message(stringified.bytes); - return Status(code, message); - } else { - return Status(code, "Error message was null"); - } + return Status(code, message); } return Status::OK(); } From 995317ae9ecb54bc1aec02f7c7e133ab61ac387f Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 6 May 2017 12:56:46 +0200 Subject: [PATCH 0609/1644] ARROW-929: Remove KEYS file from git I have updated the SVN KEYS file at https://dist.apache.org/repos/dist/release/arrow/KEYS Author: Wes McKinney Closes #646 from wesm/ARROW-929 and squashes the following commits: 8ad3c0a [Wes McKinney] Remove KEYS file from git --- KEYS | 239 ----------------------------------------------------------- 1 file changed, 239 deletions(-) delete mode 100644 KEYS diff --git a/KEYS b/KEYS deleted file mode 100644 index 05862c8f643ca..0000000000000 --- a/KEYS +++ /dev/null @@ -1,239 +0,0 @@ -This file contains the PGP keys of various developers. - -Users: pgp < KEYS - gpg --import KEYS -Developers: - pgp -kxa and append it to this file. - (pgpk -ll && pgpk -xa ) >> this file. - (gpg --list-sigs - && gpg --armor --export ) >> this file. - -pub 2048R/7AE7E47B 2013-04-10 [expires: 2017-04-10] -uid Julien Le Dem -sig 3 7AE7E47B 2013-04-10 Julien Le Dem -sig D3924CCD 2014-09-08 Ryan Blue (CODE SIGNING KEY) -sig 71F0F13B 2014-09-08 Tianshuo Deng -sub 2048R/03C4E111 2013-04-10 [expires: 2017-04-10] -sig 7AE7E47B 2013-04-10 Julien Le Dem - -pub 4096R/1679D194 2016-09-19 [expires: 2020-09-19] -uid Julien Le Dem -sig 3 1679D194 2016-09-19 Julien Le Dem -sub 4096R/61C65CFD 2016-09-19 [expires: 2020-09-19] -sig 1679D194 2016-09-19 Julien Le Dem - ------BEGIN PGP PUBLIC KEY BLOCK----- - -mQENBFFll5kBCACk/tTfHSxUT2W9phkLQzJs6AV4GElqcFo7ZNE1DwAB/gk8uJwR -Po7WYaO2/91hNu4y1SooDRGnqz0FvZzOA8sW/KujK13MMqmGYb1jJdwPjNq6KOK/ -3EygCxq9DxSS+TILvq3NsFgYGdopdJxRl9zh15Po/3c/jNMPtnGZzP39EsfMhgIS -YwwiEHPVPB00Q0IGRQMhtJqh1AQ5KrxqK4+uEwwu3Sb52DpBjfgffl8GMGKfH/tk -VvJ6L+7rPXtNqho5b7i8379//Bn9xwgO2YCtjPoZMVg37M6f6hVWMr3fFmX/OXgU -UWwLGOTAeuLKWkikFJr5y0rzDaF2qcD9t7wfABEBAAG0IEp1bGllbiBMZSBEZW0g -PGp1bGllbkBsZWRlbS5uZXQ+iQE9BBMBCgAnBQJRZZeZAhsvBQkHhh+ABQsJCAcD -BRUKCQgLBRYCAwEAAh4BAheAAAoJEJfX6GR65+R7au4IAIfZVA9eWBZn9NuaWX7L -Xi+xDtzrfUrsWZxMIP6zkQsIspiX9AThGv3zDn+Tpfw7svV1QfUQX0LHbwMMYqq+ -mRJB/kqYutpLxw7h63zrWR2k2Sdzvole2c3Rfk1vblIdWZk7ArLSivqTk/oGwr7d -MejvOMmKSzqW0vQF6dNbYerLOiqPr4mKqONWm4nOLZEBzjE3IfbK3gNBSFq+92jV -iWY6ozqAxydYafNUSZRrcniYskxd9JCSSLZiIZW3X9lToA/74LjpPbmzvQtkH68D -0EnC1mkPTKCA4r+CLb3a9GJ9Surg2T0OptyPHsXipgViVryXgopD2odA3fh9SY5l -Ee+JAhwEEAECAAYFAlQN+kQACgkQ/LPL2dOSTM3+OA//dYj9kiZhZNVb6hMfrubn -OjTmY8Hcax8G+aJWxRrGE8HrCUjEJ4NThK523+fmol1PxNWsguljlsZvJ189YPOh -weDJzNmKwhLntq/uBgtJyWBN1v9bUzkR9Ud+UdD1tPbNj7sNiIQE1ZqWMxra3sq/ -gcodVgqSADGgjKO9tenQhWvQXxBR55MOqZbxnyazRPEYS0mkN0A0DwtG82tHNRL7 -Z3vs/kG5hoW3kYifCZn5pW3wKtfIY5JH7usYOzA86p7GH4hOfO+dzhDANH+C+u9O -ZRbCdUE8oEp3fAWY9+3VzlO5ixpFOeHGfbSJp44Jv6wUOxNwRmD/gk+DxVrsS/Yn -rLFCZgDHgkFHGJ1D7PnxTy4qtwGasYxWYJOUiaAJbOvRa8nbhan2/wsrgnJTbXAH -+7v5tFfCV77Po//V0fojYZNvbkEO8/yRpQL+uKiVRaRD5dMfHRb31OR0A59ssYX9 -63QpBEof/OeELC0VowG+KCc+4CfSMmAGnQMdEhMAUPz+79nJw7ijeF5C82Z5mQof -v+nf+kdqr80UbG+RoODKtlHFETxJ5STQe6uiPOfvb+EADPA0cZ34u5tD3Z+SMV1k -Gf7Jxi45jmkn9Z9AkVj6KgdDeSjV7EkRiY0pm43Vvd6WvV5t54cgJcwXrjG+h03f -65w7F+KBrh7YAcUvrf4JeXKJARwEEAECAAYFAlQN/XwACgkQfNgniXHw8TtU9Af/ -b9CYFtsG9q1ZbnV9SChxjLLUipGsmKTUjCnz7oiZvJJ04e+0np1NQJKJbthGfEDM -eLt1WiYpTDu66zAuLDA7ACcbv3UUXXsUTEfN76J+9DJHrtK1soHGLkKLW2hZeWKp -PKya/HRF4Rv3/aAwWtRjEuQr9pLt/wAOedV6mrpyTngOKQn97tzo/yUeDNG7be8A -xtUStQY/2zJmHkaLeULKOspgUchBQ1S+M4q46dE+tyel47BLyHIECqk/geLOlZmh -lo6TtVgnBSXC5SqMwh5pz/P5ntQ8FVLedGQI9dwVhxbjoo5DNB/6ntfbwkheiak1 -CFBm0ZVPJjX7F2XFcq7VCrkBDQRRZZeZAQgA4eixR7xHvnTyF12CYLsnFE8x1tI+ -78FCjKm0n1YPCzEYa70bnnZmpW4KCwO0flN4RhhP+g2KRCCov2ZH7bxvhTxe4n/j -T6I/+61Fpba4I7qExYqX+tylyjUKhynLcWCbvRQnyjOMTaLbMVrftV+ATVmj7fi0 -PdzRW/7QvCSrDsMFtTSaNBdeMbzptpoXAxTgVZOIoHbWOIfovN1uPnFItrmNnKXX -KGyDPX2s2KCz10G1lrw0l9tqDg+BtqE9/xCtqWoZJMnT8jAJZeJ0V37R1jDBDEHK -AfPOUKNYf5GWxJeCWYzL77ve8VdItKwPhtjW7zFKuyrqiBHE40fgTLKvNQARAQAB -iQJEBBgBCgAPBQJRZZeZAhsuBQkHhh+AASkJEJfX6GR65+R7wF0gBBkBCgAGBQJR -ZZeZAAoJECrRWHEDxOERzmEIAOCrfYGPdLyzBn/xAdymx2FaNTS48ybMIGjcu6Od -nKzvgBJObLPQf0+WKhkbQf2HEHYinBVpX8K4dNY9RhzIRbQNhCWY5E5/leI/nQ9O -ZBUMpT8Gw5saj0YtF3By4E9ywxNWiAyX2SAHjPv/lub0PEaUiWWe6s9MaX5fp71C -TupkdElpxucEpVefUaUOSMQ2ecOniCh/9ltPLYcjwnC1ti+Et8/cAK2N554GNE+x -fO3qtGXGUleWhpt3fblTcCyO+odAPKxm70jnABLk8m+KpffcdBYSJ5ai5hPkrnyq -3NBRDPGlLdtDkzn0/xKYnVbLW1d+d2NFwJzEKncQphHoo0T19wf8DSfym7dIsstj -jwFI8+N/1yCdMD886x8bgmsSsNiD9tro+1083yr+IL5+gUs8Q4ETpsow+IS6sfp2 -fzA0TaLBLEOFYy/XFxnzO+YtVNIDAnrDEgTOMahFUrJ/HVZF9xT+kKwhyHaRNIQL -CYc4VoSWldqoDVOGI30NjtVo5EGzf3qVWkTm4yplBhJvJanxrMHuJAWRgFX8D48B -cs/senr8s+O0oXQQYIjz/FkZh/mQFtrgsvnzyUR52SnwEzNMmXjZNkydPZwcY6mu -cqCIvQIvmBpPdlyaoglwJ8wWb76uIE6VFcN71FF3EfV51/yUeQGJaoExWLY6IH8x -Xtn3IWkBWJkCDQRX4AxuARAAzzTxE1FGdmJYPZyTys51oDi8+CJ8VXF6wlTkjuOW -abkGUu0FjnN++D7G9LRDvN7QnVUHU+h6QWPZ0LanmjYh0ABO5SeWCjOX6ajcACkz -pEzMv2DbOPfJuPJmtuFfiLOQAUVBB1dSSPFMPPaGTco2iE7uLr8edtQBvgpx/PGd -52lma3qAAZFzonKWyTRonUjV4SU3C96Xhbs+DExTL8H6MX8NzZCz4UZj5u4NsEH+ -oQD0U4tSOe3xgroJpOR6ZPvlhgbWVqlYvkEWt0AaPJsXJwnWe17GgDmxME2cwsuI -fgv/9shB7VYmLglY6dV/6HYoHh+2qKXndTMjlqXXvUHW0J3uRryoCR+C2gin38/f -sPFICpt5yJVnR517O/jsviDz4TwjhqFsUUM7Ud8IydriJX02Oj5UitUF7l5MSqkS -/Z7jwPEErCRWwmfj4ZjjWWV60I9SYgPZhBp0//s2o/gbIBBtIdHI2+xaMt0lWOsA -Xi7dsY1NLGoSGUlhdSiP032tVHpGiOV3AWwf399Qus4iuwf6N8KEVSTRdaA7b1Um -b7PepfEHIrOS5oUYjgZJK+JFGey5SOsPvG3Yv9cbEKWqmoEzEDb6y3HI/iRbk/qC -SWGKvEiqYSo6wlvZFDv1qoApylfBaI8Lf32vawlMCSI37KCWfua1RqbCYMi/4wux -bfsAEQEAAbQgSnVsaWVuIExlIERlbSA8anVsaWVuQGxlZGVtLm5ldD6JAj8EEwEI -ACkFAlfgDG4CGwMFCQeGH4AHCwkIBwMCAQYVCAIJCgsEFgIDAQIeAQIXgAAKCRAC -2r/fFnnRlHqND/oCaPPGn8u5oyVml9J3+lpYWwT69qHwYV5IX+72zqq02uvYEqlY -CseEwOvkfDClh81KWO4A9kzVwWcu611d/UFsA94EnZuJ46m8DflPeidhJoTRnNpr -IRzH2lL72QyQFeT9viWHdxu3cKlJkChQuD3zR9yyVH/QVFOlBvdx/ZT0dOFpbgJd -2n4fy8ExGSXLP6wGf5RQRKEYiZX4VB4Bkz6sK0Con6GPsqqtaUgj9fCxA1ebhGA2 -k1m79mR9gh4oJWeefSuXyf3x8oBoQ46Lury8HiuxLh6cy9SqHZ8uXu3hfQEZ6rhd -C9yBjK2+8z6GLhjgSasMkJK0OAR8yLgARZJwt9+wV/Ww15Sq76B6IrKJnSR6P4FM -jA+ItCDtiooGz6rJGFidsH99fU2IydcsSqbTN3h3/2cBXFgxspesHWsEeTvtXSgR -I82kUyA0g1v9ESY0leiLVzKyL0zmCjaPg0nHoH8tFqFkqaSXSZu75TefnsokkoXN -ewkDf+yD2J/BMtUHAgFOlvYkywGzS9cbAxxzc607Jvww5LjtPI0wYRIwzOlvZfws -xoYPrqJe1R7hRy0QS6pnSL7TgdbvwbGtiUAZ9w/Y5FEugV8bgyZMvF7Z9gt3ThMg -XOSWlZrsDym7jg/yd/h/4aPZuPC73oNvgV4g6OT510fkkMNWbZR8C2uHX7kCDQRX -4AxuARAAphEmWY5Z3Q+gQ1X9+b55VE17ORMKjXtE2gQnYk9Fxpt31F0kZcoK/25Y -BItkjcmIaC4LTLjbdwe6IW4zlOjULxaTstTJsfCcrJONlSmEJ0OWaXx9i/tAXt8d -0IZn2hkQ9aevJuoWqta+wFNhpLdPuPQq6vO6hIl7j0w1tAGFHV+IQ7Q7VFuUVo/h -gZbtJOjufZWqulz6pMVu4p3TW9OM98CWioO1eidcwKYEsgk/fJ1uKc599SSCz2Cg -+lEho6rHtvojk34TLD9QQHaEcCZToq7WSqwqLi4OCuhcfAVuwydj0RMByE1TOpsg -RwOh2egBLNplK+0k2jVaQPX38laolOAMNLg+VVRy73T1MpyelY/m+cRr8292X/f4 -GgHNHbmQU+LDzsezC+ryPXdP3FjVo66xNlYHzw1x0hRdnwExkqOYmdTz1YN83Z+6 -p0d2RkTZTpLnE449KiNsgTPttplBGE8QKNqYxoKIk40DlDuya7q/acgTcqe8vW0v -34E6RRIX8dbCJeTBB2vUDp6bD3ICXGI09EuUAh9yy/FlNv1OdggDfTnF/NztWmmT -CpNwmdx+GTT2Sv0i6H9RelOl0uGj351+7PSFSFrHV9T3TUaMB0QkkZDxklIvPVZv -dhx7UGXFJPDjQyJxcN7UW6Pc7m+m2W3/u2MZaL7xPbWRPVkqs48AEQEAAYkCJQQY -AQgADwUCV+AMbgIbDAUJB4YfgAAKCRAC2r/fFnnRlGwCEACXcfMAz79G2sLs6z1N -6tMbO0qGGQJ9vAXRKjb7JN/yd+z+zaejs/+cmRM+wHKZtANYtnSzGiWJO4TIP5A+ -DeRE93GJaVr0ly0C+du/uSm6wVg+w1wgy6JE6q/IsMW5qHd0qWi/npq4esfH1Uho -T7Kl/AxUyT0N23n5oK0GrVUFhFcU9dUx6auhHxEOM5tIgNqV6lAn72lykPYUvV5f -aAiz2OAlVYxgBb6wxjXTUVlrhaxbgNQ7PPzkjzMVZaE/TZrcyl4Ck3grYDBFZEGi -jhjsl/HX+/lhJvr5gcFkisG5A2pnrkAe1wnXm4HoKGN2xUWCCipN5oPc3Lw6ge76 -YDX1t5CXqd94cDBlwFDtd4kykI3rJDvTI3P/fevMNqVS3tzW9AwkHkPil1DE+4rI -/qCib+G6BAgloUGYLuNxSa1ySOd0yckFTrNBB5yk+yWvrLpKGFVdQS7BwUcgdeCJ -3XU3fyhfXcIn3tMHabZ6laB3Xzi3Gi8iL6SJywSXIqTGw3MmLJlxr1IKWTMNxjjs -d/XBF7ZpCCisH7s9hyMCAet72YFAxVcB3bwbd3mzcGfTg/Y+sSum82vaSvAJ0QBc -pp4X8HzEsSsJ88N6ON7IU92r+1mxWhglKZx2NORHIvNFwIrvAzKWhqGdHd5/xq3f -EwCykGi6RtdCStNFh6h16kCkgA== -=YkSF ------END PGP PUBLIC KEY BLOCK----- -pub 4096R/8CAAD602 2016-11-06 -uid Uwe L. Korn -sig 3 8CAAD602 2016-11-06 Uwe L. Korn -sub 4096R/7BD1BC86 2016-11-06 -sig 8CAAD602 2016-11-06 Uwe L. Korn - ------BEGIN PGP PUBLIC KEY BLOCK----- -Version: GnuPG v1 - -mQINBFgfd4wBEACylQqqVH/aK00fgU/v1ZggNwtgJhzH7yswAzQz9eUU5t4Q9kzI -zdkR1yJvaEDHtZy2D0mCM1CuGVPXzf+0kSFDaRPcm6LNAD15KC7eUzyad1Y4MwNn -UYE3pZlnvSwUBAigQSN1quw+u1eHc+IJc32iCRcK8DihQgrDivg8yZckoGGZj/6w -Epfp8SLrI+OmqBgwYYjRqy9uC0aWypKb9waZmc2NIZZu1y3bL6hx54+Dk+4hF01E -OtT79HQV1e4MyqiuGUKa34QAHb1CGrju+1Z9sDNdI7hBDqfQKjisR2WaJM4kXHjj -m7Tv3M1LUB4eh1+Yd514d/wpSChkLvMCJ9tYGSpQ8c+qrLAFvgRD7YCYp4ypslcx -Sg30gU0bcTu8aiIm7qfl9CUjtBYwirUGC/t2SUxnhOpxWuzZdAiUJHi0QFa+LnZa -ecA5fIoMfqTWAqfQr3noxB6qLLNCgZd7IIH5KXIIhJZHpO3eMCCTJuDXiMS1Z/uo -D1FvUL8c19nmMjPJSfQo95Uynw6gZKFy0d3xg7NKUvnJBsVI24/PTVabzRrDh/qb -RCHvQOFjXOSYsPm2sz1BPs+ucV4AoxPZFgsCfUN2t4FRbcb39vr6oYFb+Nd3sIKX -7wknSwAid6pATvfZuLC9NI8ykjcEDGeLL0sET3kdUeuGYjpj2kuhnrV4cwARAQAB -tBxVd2UgTC4gS29ybiA8dXdlQGFwYWNoZS5vcmc+iQI4BBMBAgAiBQJYH3eMAhsD -BgsJCAcDAgYVCAIJCgsEFgIDAQIeAQIXgAAKCRAp2U4ijKrWAos1D/98UBoLbt6L -c7mnXTww069nkt0vOOHSz/QWJxo5rQsqFSKcSRuBhwLuaVMGTjBqCOLdEmA+XKJ8 -O+OgCZz0QZXuwL3PklX3DFvsYO0wIEIssovEJMu5e3XxDcCf5ZZtfszW5dnbWTjc -JXP0TlEbjOR5Z0/O/24iysGtoEMiktRTLOz9R5oRXFQLN4jQSykvMfKhanCVFljX -qEdMszjtvZhLwOiCaWkIOEo3jCrCDhdThI5nTiu/pH3vi7mkFYTNKpiva3XYKH7V -ITEdn5WO/QNFu/VBRjtOxT+F068vuuNpvAddn5rOtZOyGMCHnEqnlRnqIIZGtJeo -EJ87N2ytn8CtKpQKhyJIJhQIfW5jS3YW8qj1HeKN2s5wqQKnBYYsJOh+/QC9g3oE -nllgoSHAKSzys2Y1VoOQbRxYipCqRx7uS2aAqFr6r3hQpzySWeKQuxVZSZD7ar/0 -AFB0Hg4EgUGDl6Lw5icJ7scXTgoQKZWH1UmNc/FwFbG/F1GVbU88R6DlF84D1X/P -ArtP20eT+B3u5nfO2pCaBVi6GYyMsL2WKHO7AQAgURMgEPk0AQZZpv/OSJFa/TzI -UQ8xTLgmwZRL/XjjNFYWs+eYecGQsHKLbKNm1BpZMEfbVSFw54PiyJgoOhdMKdyA -Cmb+aUBkbPXf5S0ScZOoq8e8k1dYseDGOLkCDQRYH3eMARAAx/joL6ScsKMmPGRn -n79gQ3zbcKxWSfEDMYeeFfSssRgRd2iIrgvjzr9phka2yknzPnQPi7C8GLkUTj5e -V1dBxIGkGmP28n0DoowMqGb1xqn0WeoxDL0VQycGjkv5SOkxcbCCKS/MHOn6zenh -patSJsEHkCqk3f4GtPngYN5oMRTXUfUj1s7AooNti1ONSQSvZNbOMKAg8MgAjAHm -z3A+INLVTa59vqUNr5ptG/n+cB65ggeNhJf3gMaDyUy7oRZtOhrmA4D9CLpy2OBA -gezgOCZk/mPNP5jW0sbRiL6nYqC9VTp0E+f3hYSdgXNTWGIcxOwK7xe09SRqUQ7u -WnoKBTjkkYdCaCN4rv8IhJdrufgYdfqMGuldQZ9R/gcN3Iel7JMdon2onk94KZPs -W58/1DCD2eRuz8CsIgleUHVXJ+mCpkdtAt45ZGyv5pFC/+6s8mS/pBQEdVl7wjEX -kf2lrtFZCfK1uUiUTDnJJdtXNhwdtvnxJYeRg51jlD9Qg/mPV6m8KFyINtLKedLv -hChFkAIfFsdC/r1Xt4fMiCv2eZ8Dop2dM6xV/6Ueicti0lywoTpVtugSUWPO1j8a -N48jUfkZUV0jdELNHAloZaIDeLc7mU0uZJ3JykC4laD+YDwHT8tYUvamtU2uNgh1 -V7I3jrEu8YO4T2fiXe+0EzBwzjEAEQEAAYkCHwQYAQIACQUCWB93jAIbDAAKCRAp -2U4ijKrWAs3bD/wOE8NLnzKqebz0v+lxQf7fRL+RMaJ8mFda/t7UFtxj6XdePGZy -HWdqlvBFSDo/K6aEiicmpEIPbMi+V7d1Dg3tGhwtkHzgbpxNVoolR+2cF4jtrkoV -NC7uAMaDPt0X+wqinGg4E7IFuJoT4WiS+i4lzCUbD8n7lxe6Kj9bDt8tb6gOCgld -oweGN2k3bc4hIzeRt0jqGu1xm91Zbf8YbI3vyi8WQqmxX3zugY46NWwj8a+4Mhxz -Ysd7SI1pPs5k7vdHif3MD3Wwx68CCuZSm2KzNsm0iGxrCXSA6dXVflK9rlq6O1Us -UTxfX60o6S8PdFr4oOPFHYXmvDU5PY575xscWB2VVAyuSCyZWtq8d1BBU9JxcozS -6PTefVUqgr0XXRwVldAIabSA5q13j+b5+vU6LnAuoeMlFFprRlcJN03XTWKXF/gP -SpCDscCEMbz7aHpox8wmFckeiT+TgwDLMKO5PKRSMEBErUk+SsOyBnFpuGaPsCem -Pi6NwQyPCt3eep4Ti0dPo3u/dCUEtdKWMpOhsPIoCvGpgqS7o5PuBC2MDHQCc7q8 -wfxeCKBeSpMuy3pvOnNy8uNYjNqizVlpNBx01I2R1MD8P14Pxteg6APi0jcusXrD -s8g7c7dzdXM0lxreeXge8JSmxuwcCqVUswac6zbX4li03m/lov2YYxCwuw== -=ESbR ------END PGP PUBLIC KEY BLOCK----- - -pub 4096R/1735623D 2017-05-01 -uid William Wesley McKinney (CODE SIGNING KEY) -sig 3 1735623D 2017-05-01 William Wesley McKinney (CODE SIGNING KEY) -sub 4096R/E83E9940 2017-05-01 -sig 1735623D 2017-05-01 William Wesley McKinney (CODE SIGNING KEY) - ------BEGIN PGP PUBLIC KEY BLOCK----- -Version: GnuPG v1 - -mQINBFkGqIQBEACuKfRxQ2zjpWtuEpKTr0qhpucl5h57cnbPG8M2t2eAbl7fD6mD -ZyLePZEHSoNgUTqFTh8b850qD2b1loyuk6fx5mesweeWlSxt24Y5pXneH7WL/a8K -H81jl+Qy5J8DfG8oEnlQp8bPjb3n8xFgNkpt09kxj9lRhDCK0+M0lN/JRGK2BfTx -TCJWH2vC8Xh+apXmlSR5vohx7dj5RoFlIwNXsi+5JRkZCLoER8Fvozdq7qYNNmgL -a8l38VnW5fQkx1Pl0mMBi0d4XwFCY6W5BfzfAU3t+ujb0a/6ZzFHiW6q53Fct4BM -dMX91Xi73Myb3AF3x8dnv7E09dwXaShwUQu76WD/v7js1COS9o3SaCZfOdrJ9+KN -bYc2zuzXCWtDQ1GU07ocq2Z8VnhGC/qAUwOY9K0JagFOx7xV3gc8bkWqFII0XeCK -QBhKZHx7oFGz6bH2W/THLolbezwC7+0iuiWeDjY6y6Hk1/S25120wqdUfpa2QDlz -5V+ayyF8Lt77CnowYeMuDSFZzBjg67SpbbkyZJwKUtTJBUOLKiJF37QCAYENHthB -lmRgvOcCIic5cnJivgIs6Q7hCpFahWgr2g/6clu04YKFSaup+LU6F3UGvbKW6nnF -HRSsVFkof0+Ni+yT/oiQUAYyCbrfptpgUZXrVuee8d4frbPfKeiWd4MTrwARAQAB -tDxXaWxsaWFtIFdlc2xleSBNY0tpbm5leSAoQ09ERSBTSUdOSU5HIEtFWSkgPHdl -c21AYXBhY2hlLm9yZz6JAjcEEwEKACEFAlkGqIQCGwMFCwkIBwMFFQoJCAsFFgID -AQACHgECF4AACgkQ8QWIOhc1Yj1IQRAAm71yO273ulTxYlpFTN+CnTqTdxAQIGmc -gfS55/XmjKfQySQTKOfQPafJe7MazbVG/jG5CZeKHEgHvM0qi8vnAezzeTKEDHPP -Q1ziHyTt7ND+GbKChrLKA/lbgJkoBxKohyi6eQfz33cvh0fPsv8zej5M6+FAVJaA -GCMUS/yIC0Oiq0JgYH38sPOhNtw3z8pODg6WjJFWKHXw5qGng11/3BtTVu5KXzqf -85IJHqMgyOnU0r4mdKgqmSdaCpU/CMJlT3iflF5wN79c46FwAceCiYT8eJiWl1cB -wAV/mRhTzWGQkWVhE+6EK6+PyuzkjJgGhMtv3zuzKKN8iOv3eb7xptzZydEPqRFf -50f1cERfsf8um8W9IXQb60vrALyWwQFjF9B2oxsk28ZgzZ5ibA1xU9TJAS+iFo3e -eITPZnxxT3jZ2WQVWIQB8/yn0sAg5mLQ+Clcghik60KQsjAVS27QrlMTimK6eXey -tKTS4cw7LPo7GkuiBy3FuERX/ABXg8Wxd+EXOvLuZXNV/p9uBhU0w5tfaasnXFy0 -0NoKAVQ9ffW9MTV3CjrPakjHGLIzHgfFYuHnBdo67E3LR16kLcTusH8e8A3wYgbM -/gbXNS1C+i31ATNWfHaZtAFrUdzUvotDVo2UTw4nRqy27XBM9NVS+EwfwiZLWoLH -9gZEMGFQ0MW5Ag0EWQaohAEQAJnHTGcy0ol//23alysOuwYFHsS7PFizcCHuy4jv -iB8YR5Y4Ts8nAgo8gz2O2m9bgNfbFHStDoqOUWTV7ILYv/CDZiNhvR+fAeWl3Gmt -o72YYu85r+KZj22YfiXtfOb70IT7jYsTpjlgaqFUFHEHXzoa7EMscra6r+i1qDa9 -QfDjMIBaBCu/Q9CFfhHtIhBHV2Wt8IEJfgYSMHb24Db0RY+OSQixX4QRcVeSnJ5D -I4ZjA3//o9DwthrJf+GxW/f8TGZy6vtertnxLJXGoHFPVuI895m7wSfzt1+/2nlc -0obAMX4Q1yRYTOKQGPDeDZ7k5pxnhYkOHDf2gtY5wORw6vN9KR51YFJYXVmK+2zr -P0fKr0AUG3C7CwQp6bDeYaTndon8S9VNyPypvJ7lpxKy/DIujdvbaJHF3i4rI+w0 -veScfkGtLDc37OeVQEBV4vnHcMvDIC2SEtli4BZjwOcihOv3DgtmQnAjkkAZLtys -x/W4/MPoZiIWl0DnQev/ujwLkwHCYg/Oo7E70OKpdxDk/2cZyM1US2Uz2NQ4lo5O -8M+F9sMWj2EPX/kJxZpb6N/+xJnKf4oIdJkaammVllX0TGtoxGOadPST9D8gtSCr -yRdLMp0bB0+Ghbc+STGo78atg+J+HRvgzXG/gwaEiCIezuLB4W6rFjbldYfbeKTs -OoAlABEBAAGJAh8EGAEKAAkFAlkGqIQCGwwACgkQ8QWIOhc1Yj23pw/+JNWYULOd -uM4Khfyx3NgCLiX9VqmwZ7PQQsPKtxviQXdEgs+NJUrCePmjSV9Sf+exTZ4wqSTC -BilGUppAJbO9avR2wRkYbdiYW+g0jDwAD9cyfAiDBSUiRTimKsKqYN0PbIKJ2Ric -xvtBw4jW/f1lHkrySqOHetmFTe2ocXkFm8BjqDpt5XCoZa4ADcofNpRJYwVu0Uck -8MQ/wYjoNRZiz0Sjx9vOBVW9ZKMWS6RgnPStsK3UJiG3c7c83kpDx8nk4bUp8seY -cBjiViXh6QMXRPdlqsGEMiBVtyXF7Sy3cK3gUcH7808VmKMHEgWvq9MRrZoE0rLK -74pZrEuWnwD6o77w4DCBtKJyDNlR23kLObS+1Ur7fIXe2yXmbqwEmjpSX4H2Teth -77PU7nKMAkFsPJDNI7K/kEy3x7KM3G1gIcWaz3pL5gthLV+H3RfIojrK1hS7ZSSI -gCzYEkQCMsigT5YTgK5+n0I4U7zoDBd1sttwK2FahvuCKUDwc+ZiX/ciYiAjUMb9 -6yTNHlNr/H31EWVZMEd7+fhFZWXJjFsQD11GkXvy6vMBn3Kq+Vd7Yr4CJUGTV3rW -bWo1vt2ED7h5rbZTrS1UssxLUpy5iXrjyGwn2h/Ei9MzXpNvH8p2raf0eQ0Qn65Q -UoUryip3RD0yaMCyL/IK3KoPt74f2eJsFwM= -=feO2 ------END PGP PUBLIC KEY BLOCK----- \ No newline at end of file From 8febd03f862eab0ca83871e9ff8c5062550b646d Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 6 May 2017 12:57:14 +0200 Subject: [PATCH 0610/1644] ARROW-953: Use conda-forge cmake, curl in CI toolchain Author: Wes McKinney Closes #645 from wesm/ARROW-953 and squashes the following commits: 4f719c1 [Wes McKinney] Use conda-forge cmake, curl in CI toolchain --- .travis.yml | 2 -- ci/travis_install_conda.sh | 4 ++-- ci/travis_script_python.sh | 4 ++-- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/.travis.yml b/.travis.yml index 19e71ae1e68f0..d821b5accb973 100644 --- a/.travis.yml +++ b/.travis.yml @@ -4,13 +4,11 @@ addons: apt: sources: - ubuntu-toolchain-r-test - - kalakris-cmake packages: - gcc-4.9 # Needed for C++11 - g++-4.9 # Needed for C++11 - gdb - ccache - - cmake - valgrind - libboost-dev - libboost-filesystem-dev diff --git a/ci/travis_install_conda.sh b/ci/travis_install_conda.sh index 7d185ee82275b..369820b37f5c1 100644 --- a/ci/travis_install_conda.sh +++ b/ci/travis_install_conda.sh @@ -40,7 +40,7 @@ conda config --add channels https://repo.continuum.io/pkgs/free conda config --add channels conda-forge conda info -a -conda install --yes conda-build jinja2 anaconda-client - # faster builds, please conda install -y nomkl + +conda install --y conda-build jinja2 anaconda-client cmake curl diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index c1426da7247b2..20b0f2aadb900 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -23,7 +23,7 @@ export PARQUET_HOME=$TRAVIS_BUILD_DIR/parquet-env build_parquet_cpp() { export PARQUET_ARROW_VERSION=$(git rev-parse HEAD) - conda create -y -q -p $PARQUET_HOME python=3.6 + conda create -y -q -p $PARQUET_HOME python=3.6 cmake curl source activate $PARQUET_HOME # In case some package wants to download the MKL @@ -89,7 +89,7 @@ python_version_tests() { export ARROW_HOME=$TRAVIS_BUILD_DIR/arrow-install-$PYTHON_VERSION export LD_LIBRARY_PATH=$ARROW_HOME/lib:$PARQUET_HOME/lib - conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION + conda create -y -q -p $CONDA_ENV_DIR python=$PYTHON_VERSION cmake curl source activate $CONDA_ENV_DIR python --version From c3a122e1cbe83028531bfd73f9a4e1401031c824 Mon Sep 17 00:00:00 2001 From: Philipp Moritz Date: Sat, 6 May 2017 10:56:00 -0400 Subject: [PATCH 0611/1644] ARROW-939: fix division by zero if one of the tensor dimensions is zero This was reported and fixed by @stephanie-wang, see https://github.com/ray-project/ray/issues/500 Author: Philipp Moritz Closes #634 from pcmoritz/master and squashes the following commits: 399681b [Philipp Moritz] fix linting 44ee13a [Philipp Moritz] fix strides if one of the tensor dimensions is zero 4d831ed [Philipp Moritz] fix division by zero if one of the tensor dimensions is zero --- cpp/src/arrow/tensor-test.cc | 10 ++++++++++ cpp/src/arrow/tensor.cc | 11 +++++++++++ 2 files changed, 21 insertions(+) diff --git a/cpp/src/arrow/tensor-test.cc b/cpp/src/arrow/tensor-test.cc index c41683a3db5a2..0a11422b75d13 100644 --- a/cpp/src/arrow/tensor-test.cc +++ b/cpp/src/arrow/tensor-test.cc @@ -93,4 +93,14 @@ TEST(TestTensor, IsContiguous) { ASSERT_FALSE(t3.is_contiguous()); } +TEST(TestTensor, ZeroDimensionalTensor) { + std::vector shape = {0}; + + std::shared_ptr buffer; + ASSERT_OK(AllocateBuffer(default_memory_pool(), 0, &buffer)); + + Tensor t(int64(), buffer, shape); + ASSERT_EQ(t.strides().size(), 1); +} + } // namespace arrow diff --git a/cpp/src/arrow/tensor.cc b/cpp/src/arrow/tensor.cc index 909b05ebe8f80..bcd9d8d94c6b4 100644 --- a/cpp/src/arrow/tensor.cc +++ b/cpp/src/arrow/tensor.cc @@ -41,6 +41,11 @@ static void ComputeRowMajorStrides(const FixedWidthType& type, remaining *= dimsize; } + if (remaining == 0) { + strides->assign(shape.size(), type.bit_width() / 8); + return; + } + for (int64_t dimsize : shape) { remaining /= dimsize; strides->push_back(remaining); @@ -50,6 +55,12 @@ static void ComputeRowMajorStrides(const FixedWidthType& type, static void ComputeColumnMajorStrides(const FixedWidthType& type, const std::vector& shape, std::vector* strides) { int64_t total = type.bit_width() / 8; + for (int64_t dimsize : shape) { + if (dimsize == 0) { + strides->assign(shape.size(), type.bit_width() / 8); + return; + } + } for (int64_t dimsize : shape) { strides->push_back(total); total *= dimsize; From 75ebf5ca809ee648a3f64aa9b967246167c509f1 Mon Sep 17 00:00:00 2001 From: Jeff Reback Date: Sat, 6 May 2017 15:40:43 -0400 Subject: [PATCH 0612/1644] ARROW-956: [Python] compat with pandas >= 0.20.0 ARROW-944: [Python] compat with pandas < 0.19.0 Author: Jeff Reback Closes #649 from jreback/pandas_compat2 and squashes the following commits: 91fd9fb [Jeff Reback] ARROW-956: [Python] compat with pandas >= 0.20.0 025639c [Jeff Reback] ARROW-944: [Python] compat with pandas < 0.19.0 --- python/pyarrow/compat.py | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/python/pyarrow/compat.py b/python/pyarrow/compat.py index 928a2c0724298..7be35dfc2c81f 100644 --- a/python/pyarrow/compat.py +++ b/python/pyarrow/compat.py @@ -34,19 +34,14 @@ import pandas as pd pdver = LooseVersion(pd.__version__) if pdver >= '0.20.0': - try: - from pandas.api.types import DatetimeTZDtype - except AttributeError: - # can be removed once 0.20.0 is released - from pandas.core.dtypes.dtypes import DatetimeTZDtype - + from pandas.api.types import DatetimeTZDtype pdapi = pd.api.types - elif pdver < '0.19.0': - from pandas.core.dtypes import DatetimeTZDtype - pdapi = pd.core.common - else: + elif pdver >= '0.19.0': from pandas.types.dtypes import DatetimeTZDtype pdapi = pd.api.types + else: + from pandas.types.dtypes import DatetimeTZDtype + pdapi = pd.core.common PandasSeries = pd.Series Categorical = pd.Categorical From 959ec47b3e2828a44088d054bbdc5eabac9113e5 Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 6 May 2017 17:54:31 -0400 Subject: [PATCH 0613/1644] ARROW-856: Also read compiler info from stdout As I cannot reproduce the problem, this may not the be the correct solution. Still due to the output of `gcc -v` in the ticket, it is very likely that some systems print the info on stdout. Also the additional message should improve debugging. Author: Uwe L. Korn Closes #650 from xhochy/ARROW-856 and squashes the following commits: d87c7bd [Uwe L. Korn] ARROW-856: Also read compiler info from stdout --- cpp/cmake_modules/CompilerInfo.cmake | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/cpp/cmake_modules/CompilerInfo.cmake b/cpp/cmake_modules/CompilerInfo.cmake index 21e2dafba2e24..4a18376df59f4 100644 --- a/cpp/cmake_modules/CompilerInfo.cmake +++ b/cpp/cmake_modules/CompilerInfo.cmake @@ -21,7 +21,11 @@ if (NOT MSVC) set(COMPILER_GET_VERSION_SWITCH "-v") endif() +message(INFO "Compiler command: ${CMAKE_CXX_COMPILER}") +# Some gcc seem to output their version on stdout, most do it on stderr, simply +# merge both pipes into a single variable execute_process(COMMAND "${CMAKE_CXX_COMPILER}" ${COMPILER_GET_VERSION_SWITCH} + OUTPUT_VARIABLE COMPILER_VERSION_FULL ERROR_VARIABLE COMPILER_VERSION_FULL) message(INFO "Compiler version: ${COMPILER_VERSION_FULL}") message(INFO "Compiler id: ${CMAKE_CXX_COMPILER_ID}") From bd36f6f590e3f5ebe3ad8ed2cc81b988272c9215 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sat, 6 May 2017 17:57:29 -0400 Subject: [PATCH 0614/1644] ARROW-899: [Doc] Add 0.3.0 changelog Author: Wes McKinney Closes #652 from wesm/ARROW-899 and squashes the following commits: c3af6b5 [Wes McKinney] Remove asterisks causing weird Markdown formatting b1e707c [Wes McKinney] Add 0.3.0 changelog --- CHANGELOG.md | 307 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 307 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 3d54838e1a7f0..85a43ef7952d9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -12,6 +12,313 @@ limitations under the License. See accompanying LICENSE file. --> +# Apache Arrow 0.3.0 (5 May 2017) + +## Bug + +* ARROW-109 - [C++] Investigate recursive data types limit in flatbuffers +* ARROW-208 - Add checkstyle policy to java project +* ARROW-347 - Add method to pass CallBack when creating a transfer pair +* ARROW-413 - DATE type is not specified clearly +* ARROW-431 - [Python] Review GIL release and acquisition in to_pandas conversion +* ARROW-443 - [Python] Support for converting from strided pandas data in Table.from_pandas +* ARROW-451 - [C++] Override DataType::Equals for other types with additional metadata +* ARROW-454 - pojo.Field doesn't implement hashCode() +* ARROW-526 - [Format] Update IPC.md to account for File format changes and Streaming format +* ARROW-565 - [C++] Examine "Field::dictionary" member +* ARROW-570 - Determine Java tools JAR location from project metadata +* ARROW-584 - [C++] Fix compiler warnings exposed with -Wconversion +* ARROW-588 - [C++] Fix compiler warnings on 32-bit platforms +* ARROW-595 - [Python] StreamReader.schema returns None +* ARROW-604 - Python: boxed Field instances are missing the reference to DataType +* ARROW-613 - [JS] Implement random-access file format +* ARROW-617 - Time type is not specified clearly +* ARROW-619 - Python: Fix typos in setup.py args and LD_LIBRARY_PATH +* ARROW-623 - segfault with __repr__ of empty Field +* ARROW-624 - [C++] Restore MakePrimitiveArray function +* ARROW-627 - [C++] Compatibility macros for exported extern template class declarations +* ARROW-628 - [Python] Install nomkl metapackage when building parquet-cpp for faster Travis builds +* ARROW-630 - [C++] IPC unloading for BooleanArray does not account for offset +* ARROW-636 - [C++] Add Boost / other system requirements to C++ README +* ARROW-639 - [C++] Invalid offset in slices +* ARROW-642 - [Java] Remove temporary file in java/tools +* ARROW-644 - Python: Cython should be a setup-only requirement +* ARROW-652 - Remove trailing f in merge script output +* ARROW-654 - [C++] Support timezone metadata in file/stream formats +* ARROW-668 - [Python] Convert nanosecond timestamps to pandas.Timestamp when converting from TimestampValue +* ARROW-671 - [GLib] License file isn't installed +* ARROW-673 - [Java] Support additional Time metadata +* ARROW-677 - [java] Fix checkstyle jcl-over-slf4j conflict issue +* ARROW-678 - [GLib] Fix dependenciesfff +* ARROW-680 - [C++] Multiarch support impacts user-supplied install prefix +* ARROW-682 - Add self-validation checks in integration tests +* ARROW-683 - [C++] Support date32 (DateUnit::DAY) in IPC metadata, rename date to date64 +* ARROW-686 - [C++] Account for time metadata changes, add time32 and time64 types +* ARROW-689 - [GLib] Install header files and documents to wrong directories +* ARROW-691 - [Java] Encode dictionary Int type in message format +* ARROW-697 - [Java] Raise appropriate exceptions when encountering large (> INT32_MAX) record batches +* ARROW-699 - [C++] Arrow dynamic libraries are missed on run of unit tests on Windows +* ARROW-702 - Fix BitVector.copyFromSafe to reAllocate instead of returning false +* ARROW-703 - Fix issue where setValueCount(0) doesn’t work in the case that we’ve shipped vectors across the wire +* ARROW-704 - Fix bad import caused by conflicting changes +* ARROW-709 - [C++] Restore type comparator for DecimalType +* ARROW-713 - [C++] Fix linking issue with ipc benchmark +* ARROW-715 - Python: Explicit pandas import makes it a hard requirement +* ARROW-716 - error building arrow/python +* ARROW-720 - [java] arrow should not have a dependency on slf4j bridges in compile +* ARROW-723 - Arrow freezes on write if chunk_size=0 +* ARROW-726 - [C++] PyBuffer dtor may segfault if constructor passed an object not exporting buffer protocol +* ARROW-732 - Schema comparison bugs in struct and union types +* ARROW-736 - [Python] Mixed-type object DataFrame columns should not silently coerce to an Arrow type by default +* ARROW-738 - [Python] Fix manylinux1 packaging +* ARROW-739 - Parallel build fails non-deterministically. +* ARROW-740 - FileReader fails for large objects +* ARROW-747 - [C++] Fix spurious warning caused by passing dl to add_dependencies +* ARROW-749 - [Python] Delete incomplete binary files when writing fails +* ARROW-753 - [Python] Unit tests in arrow/python fail to link on some OS X platforms +* ARROW-756 - [C++] Do not pass -fPIC when compiling with MSVC +* ARROW-757 - [C++] MSVC build fails on googletest when using NMake +* ARROW-762 - Kerberos Problem with PyArrow +* ARROW-776 - [GLib] Cast type is wrong +* ARROW-777 - [Java] Resolve getObject behavior per changes / discussion in ARROW-729 +* ARROW-778 - Modify merge tool to work on Windows +* ARROW-781 - [Python/C++] Increase reference count for base object? +* ARROW-783 - Integration tests fail for length-0 record batch +* ARROW-787 - [GLib] Fix compilation errors caused by ARROW-758 +* ARROW-793 - [GLib] Wrong indent +* ARROW-794 - [C++] Check whether data is contiguous in ipc::WriteTensor +* ARROW-797 - [Python] Add updated pyarrow. public API listing in Sphinx docs +* ARROW-800 - [C++] Boost headers being transitively included in pyarrow +* ARROW-805 - listing empty HDFS directory returns an error instead of returning empty list +* ARROW-809 - C++: Writing sliced record batch to IPC writes the entire array +* ARROW-812 - Pip install pyarrow on mac failed. +* ARROW-817 - [C++] Fix incorrect code comment from ARROW-722 +* ARROW-821 - [Python] Extra file _table_api.h generated during Python build process +* ARROW-822 - [Python] StreamWriter fails to open with socket as sink +* ARROW-826 - Compilation error on Mac with -DARROW_PYTHON=on +* ARROW-829 - Python: Parquet: Dictionary encoding is deactivated if column-wise compression was selected +* ARROW-830 - Python: jemalloc is not anymore publicly exposed +* ARROW-839 - [C++] Portable alternative to PyDate_to_ms function +* ARROW-847 - C++: BUILD_BYPRODUCTS not specified anymore for gtest +* ARROW-852 - Python: Also set Arrow Library PATHS when detection was done through pkg-config +* ARROW-853 - [Python] It is no longer necessary to modify the RPATH of the Cython extensions on many environments +* ARROW-858 - Remove dependency on boost regex +* ARROW-866 - [Python] Error from file object destructor +* ARROW-867 - [Python] Miscellaneous pyarrow MSVC fixes +* ARROW-875 - Nullable variable length vector fillEmpties() fills an extra value +* ARROW-879 - compat with pandas 0.20.0 +* ARROW-882 - [C++] On Windows statically built lib file overwrites lib file of shared build +* ARROW-886 - VariableLengthVectors don't reAlloc offsets +* ARROW-887 - [format] For backward compatibility, new unit fields must have default values matching previous implied unit +* ARROW-888 - BitVector transfer() does not transfer ownership +* ARROW-895 - Nullable variable length vector lastSet not set correctly +* ARROW-900 - [Python] UnboundLocalError in ParquetDatasetPiece +* ARROW-903 - [GLib] Remove a needless "." +* ARROW-914 - [C++/Python] Fix Decimal ToBytes +* ARROW-922 - Allow Flatbuffers and RapidJSON to be used locally on Windows +* ARROW-928 - Update CMAKE script to detect unsupported msvc compilers versions +* ARROW-933 - [Python] arrow_python bindings have debug print statement +* ARROW-934 - [GLib] Glib sources missing from result of 02-source.sh +* ARROW-936 - Fix release README +* ARROW-938 - Fix Apache Rat errors from source release build + +## Improvement + +* ARROW-316 - Finalize Date type +* ARROW-542 - [Java] Implement dictionaries in stream/file encoding +* ARROW-563 - C++: Support non-standard gcc version strings +* ARROW-566 - Python: Deterministic position of libarrow in manylinux1 wheels +* ARROW-569 - [C++] Set version for .pc +* ARROW-577 - [C++] Refactor StreamWriter and FileWriter to have private implementations +* ARROW-580 - C++: Also provide jemalloc_X targets if only a static or shared version is found +* ARROW-582 - [Java] Add Date/Time Support to JSON File +* ARROW-589 - C++: Use system provided shared jemalloc if static is unavailable +* ARROW-593 - [C++] Rename ReadableFileInterface to RandomAccessFile +* ARROW-612 - [Java] Field toString should show nullable flag status +* ARROW-615 - Move ByteArrayReadableSeekableByteChannel to vector.util package +* ARROW-631 - [GLib] Import C API (C++ API wrapper) based on GLib from https://github.com/kou/arrow-glib +* ARROW-646 - Cache miniconda packages +* ARROW-647 - [C++] Don't require Boost static libraries to support CentOS 7 +* ARROW-648 - [C++] Support multiarch on Debian +* ARROW-650 - [GLib] Follow eadableFileInterface -> RnadomAccessFile change +* ARROW-651 - [C++] Set shared library version for .deb packages +* ARROW-655 - Implement DecimalArray +* ARROW-662 - [Format] Factor Flatbuffer schema metadata into a Schema.fbs +* ARROW-664 - Make C++ Arrow serialization deterministic +* ARROW-674 - [Java] Support additional Timestamp timezone metadata +* ARROW-675 - [GLib] Update package metadata +* ARROW-676 - [java] move from MinorType to FieldType in ValueVectors to carry all the relevant type bits +* ARROW-679 - [Format] Change RecordBatch and Field length members from int to long +* ARROW-681 - [C++] Build Arrow on Windows with dynamically linked boost +* ARROW-684 - Python: More informative message when parquet-cpp but not parquet-arrow is available +* ARROW-688 - [C++] Use CMAKE_INSTALL_INCLUDEDIR for consistency +* ARROW-690 - Only send JIRA updates to issues@arrow.apache.org +* ARROW-700 - Add headroom interface for allocator. +* ARROW-706 - [GLib] Add package install document +* ARROW-707 - Python: All none-Pandas column should be converted to NullArray +* ARROW-708 - [C++] Some IPC code simplification, perf analysis +* ARROW-712 - [C++] Implement Array::Accept as inline visitor +* ARROW-719 - [GLib] Support prepared source archive release +* ARROW-724 - Add "How to Contribute" section to README +* ARROW-725 - [Format] Constant length list type +* ARROW-727 - [Python] Write memoryview-compatible objects in NativeFile.write with zero copy +* ARROW-728 - [C++/Python] Add arrow::Table function for removing a column +* ARROW-731 - [C++] Add shared library related versions to .pc +* ARROW-741 - [Python] Add Python 3.6 to Travis CI +* ARROW-743 - [C++] Consolidate unit tests for code in array.h +* ARROW-744 - [GLib] Re-add an assertion to garrow_table_new() test +* ARROW-745 - [C++] Allow use of system cpplint +* ARROW-746 - [GLib] Add garrow_array_get_data_type() +* ARROW-751 - [Python] Rename all Cython extensions to "private" status with leading underscore +* ARROW-752 - [Python] Construct pyarrow.DictionaryArray from boxed pyarrow array objects +* ARROW-754 - [GLib] Add garrow_array_is_null() +* ARROW-755 - [GLib] Add garrow_array_get_value_type() +* ARROW-758 - [C++] Fix compiler warnings on MSVC x64 +* ARROW-761 - [Python] Add function to compute the total size of tensor payloads, including metadata and padding +* ARROW-763 - C++: Use `python-config` to find libpythonX.X.dylib +* ARROW-765 - [Python] Make generic ArrowException subclass value error +* ARROW-769 - [GLib] Support building without installed Arrow C++ +* ARROW-770 - [C++] Move clang-tidy/format config files back to C++ source tree +* ARROW-774 - [GLib] Remove needless LICENSE.txt copy +* ARROW-775 - [Java] add simple constructors to value vectors +* ARROW-779 - [C++/Python] Raise exception if old metadata encountered +* ARROW-782 - [C++] Change struct to class for objects that meet the criteria in the Google style guide +* ARROW-788 - Possible nondeterminism in Tensor serialization code +* ARROW-795 - [C++] Combine libarrow/libarrow_io/libarrow_ipc +* ARROW-802 - [GLib] Add read examples +* ARROW-803 - [GLib] Update package repository URL +* ARROW-804 - [GLib] Update build document +* ARROW-806 - [GLib] Support add/remove a column from table +* ARROW-807 - [GLib] Update "Since" tag +* ARROW-808 - [GLib] Remove needless ignore entries +* ARROW-810 - [GLib] Remove io/ipc prefix +* ARROW-811 - [GLib] Add GArrowBuffer +* ARROW-815 - [Java] Allow for expanding underlying buffer size after allocation +* ARROW-816 - [C++] Use conda packages for RapidJSON, Flatbuffers to speed up builds +* ARROW-818 - [Python] Review public pyarrow. API completeness and update docs +* ARROW-820 - [C++] Build dependencies for Parquet library without arrow support +* ARROW-825 - [Python] Generalize pyarrow.from_pylist to accept any object implementing the PySequence protocol +* ARROW-827 - [Python] Variety of Parquet improvements to support Dask integration +* ARROW-828 - [CPP] Document new requirement (libboost-regex-dev) in README.md +* ARROW-832 - [C++] Upgrade thirdparty gtest to 1.8.0 +* ARROW-833 - [Python] "Quickstart" build / environment setup guide for Python developers +* ARROW-841 - [Python] Add pyarrow build to Appveyor +* ARROW-844 - [Format] Revise format/README.md to reflect progress reaching a more complete specification +* ARROW-845 - [Python] Sync FindArrow.cmake changes from parquet-cpp +* ARROW-846 - [GLib] Add GArrowTensor, GArrowInt8Tensor and GArrowUInt8Tensor +* ARROW-848 - [Python] Improvements / fixes to conda quickstart guide +* ARROW-849 - [C++] Add optional $ARROW_BUILD_TOOLCHAIN environment variable option for configuring build environment +* ARROW-857 - [Python] Automate publishing Python documentation to arrow-site +* ARROW-860 - [C++] Decide if typed Tensor subclasses are worthwhile +* ARROW-861 - [Python] Move DEVELOPMENT.md to Sphinx docs +* ARROW-862 - [Python] Improve source build instructions in README +* ARROW-863 - [GLib] Use GBytes to implement zero-copy +* ARROW-864 - [GLib] Unify Array files +* ARROW-868 - [GLib] Use GBytes to reduce copy +* ARROW-871 - [GLib] Unify DataType files +* ARROW-876 - [GLib] Unify ArrayBuffer files +* ARROW-877 - [GLib] Add garrow_array_get_null_bitmap() +* ARROW-878 - [GLib] Add garrow_binary_array_get_buffer() +* ARROW-892 - [GLib] Fix GArrowTensor document +* ARROW-893 - Add GLib document to Web site +* ARROW-894 - [GLib] Add GArrowPoolBuffer +* ARROW-896 - [Docs] Add Jekyll plugin for including rendered Jupyter notebooks on website +* ARROW-898 - [C++] Expand metadata support to field level, provide for sharing instances of KeyValueMetadata +* ARROW-904 - [GLib] Simplify error check codes +* ARROW-907 - C++: Convenience construct Table from schema and arrays +* ARROW-908 - [GLib] Unify OutputStream files +* ARROW-910 - [C++] Write 0-length EOS indicator at end of stream +* ARROW-916 - [GLib] Add GArrowBufferOutputStream +* ARROW-917 - [GLib] Add GArrowBufferReader +* ARROW-918 - [GLib] Use GArrowBuffer for read +* ARROW-919 - [GLib] Use "id" to get type enum value from GArrowDataType +* ARROW-920 - [GLib] Add Lua examples +* ARROW-925 - [GLib] Fix GArrowBufferReader test +* ARROW-930 - javadoc generation fails with java 8 +* ARROW-931 - [GLib] Reconstruct input stream + +## New Feature + +* ARROW-231 - C++: Add typed Resize to PoolBuffer +* ARROW-281 - [C++] IPC/RPC support on Win32 platforms +* ARROW-341 - [Python] Making libpyarrow available to third parties +* ARROW-452 - [C++/Python] Merge "Feather" file format implementation +* ARROW-459 - [C++] Implement IPC round trip for DictionaryArray, dictionaries shared across record batches +* ARROW-483 - [C++/Python] Provide access to "custom_metadata" Field attribute in IPC setting +* ARROW-491 - [C++] Add FixedWidthBinary type +* ARROW-493 - [C++] Allow in-memory array over 2^31 -1 elements but require splitting at IPC / RPC boundaries +* ARROW-502 - [C++/Python] Add MemoryPool implementation that logs allocation activity to std::cout +* ARROW-510 - Add integration tests for date and time types +* ARROW-520 - [C++] Add STL-compliant allocator that hooks into an arrow::MemoryPool +* ARROW-528 - [Python] Support _metadata or _common_metadata files when reading Parquet directories +* ARROW-534 - [C++] Add IPC tests for date/time types +* ARROW-539 - [Python] Support reading Parquet datasets with standard partition directory schemes +* ARROW-550 - [Format] Add a TensorMessage type +* ARROW-552 - [Python] Add scalar value support for Dictionary type +* ARROW-557 - [Python] Explicitly opt in to HDFS unit tests +* ARROW-568 - [C++] Add default implementations for TypeVisitor, ArrayVisitor methods that return NotImplemented +* ARROW-574 - Python: Add support for nested Python lists in Pandas conversion +* ARROW-576 - [C++] Complete round trip Union file/stream IPC tests +* ARROW-578 - [C++] Add CMake option to add custom $CXXFLAGS +* ARROW-598 - [Python] Add support for converting pyarrow.Buffer to a memoryview with zero copy +* ARROW-603 - [C++] Add RecordBatch::Validate method that at least checks that schema matches the array metadata +* ARROW-605 - [C++] Refactor generic ArrayLoader class, support work for Feather merge +* ARROW-606 - [C++] Upgrade to flatbuffers 1.6.0 +* ARROW-608 - [Format] Days since epoch date type +* ARROW-610 - [C++] Win32 compatibility in file.cc +* ARROW-616 - [C++] Remove -g flag in release builds +* ARROW-618 - [Python] Implement support for DatetimeTZ custom type from pandas +* ARROW-620 - [C++] Add date/time support to JSON reader/writer for integration testing +* ARROW-621 - [C++] Implement an "inline visitor" template that enables visitor-pattern-like code without virtual function dispatch +* ARROW-625 - [C++] Add time unit to TimeType::ToString +* ARROW-626 - [Python] Enable pyarrow.BufferReader to read from any Python object implementing the buffer/memoryview protocol +* ARROW-632 - [Python] Add support for FixedWidthBinary type +* ARROW-635 - [C++] Add JSON read/write support for FixedWidthBinary +* ARROW-637 - [Format] Add time zone metadata to Timestamp type +* ARROW-656 - [C++] Implement IO interface that can read and write to a fixed-size mutable buffer +* ARROW-657 - [Python] Write and read tensors (with zero copy) into shared memory +* ARROW-658 - [C++] Implement in-memory arrow::Tensor objects +* ARROW-659 - [C++] Add multithreaded memcpy implementation (for hardware where it helps) +* ARROW-660 - [C++] Restore function that can read a complete encapsulated record batch message +* ARROW-661 - [C++] Add a Flatbuffer metadata type that supports array data over 2^31 - 1 elements +* ARROW-663 - [Java] Support additional Time metadata + vector value accessors +* ARROW-669 - [Python] Attach proper tzinfo when computing boxed scalars for TimestampArray +* ARROW-687 - [C++] Build and run full test suite in Appveyor +* ARROW-698 - [C++] Add options to StreamWriter/FileWriter to permit large record batches +* ARROW-701 - [Java] Support additional Date metadata +* ARROW-710 - [Python] Enable Feather APIs to read and write using Python file-like objects +* ARROW-717 - [C++] IPC zero-copy round trips for arrow::Tensor +* ARROW-718 - [Python] Expose arrow::Tensor with conversions to/from NumPy arrays +* ARROW-722 - [Python] pandas conversions for new date and time types/metadata +* ARROW-729 - [Java] Add vector type for 32-bit date as days since UNIX epoch +* ARROW-733 - [C++/Format] Change name of Fixed Width Binary to Fixed Size Binary for consistency +* ARROW-734 - [Python] Support for pyarrow on Windows / MSVC +* ARROW-735 - [C++] Developer instruction document for MSVC on Windows +* ARROW-737 - [C++] Support obtaining mutable slices of mutable buffers +* ARROW-768 - [Java] Change the "boxed" object representation of date and time types +* ARROW-771 - [Python] Add APIs for reading individual Parquet row groups +* ARROW-773 - [C++] Add function to create arrow::Table with column appended to existing table +* ARROW-865 - [Python] Verify Parquet roundtrips for new date/time types +* ARROW-880 - [GLib] Add garrow_primitive_array_get_buffer() +* ARROW-890 - [GLib] Add GArrowMutableBuffer +* ARROW-926 - Update KEYS to include wesm + +## Task + +* ARROW-52 - Set up project blog +* ARROW-670 - Arrow 0.3 release +* ARROW-672 - [Format] Bump metadata version for 0.3 release +* ARROW-748 - [Python] Pin runtime library versions in conda-forge packages to force upgrades +* ARROW-798 - [Docs] Publish Format Markdown documents somehow on arrow.apache.org +* ARROW-869 - [JS] Rename directory to js/ +* ARROW-95 - Scaffold Main Documentation using asciidoc +* ARROW-98 - Java: API documentation + +## Test + +* ARROW-836 - Test for timedelta compat with pandas +* ARROW-927 - C++/Python: Add manylinux1 builds to Travis matrix + # Apache Arrow 0.2.0 (15 February 2017) ## Bug From 03c242c7c614a4ca2089ea378796c38ae1d9fa3a Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 6 May 2017 17:58:10 -0400 Subject: [PATCH 0615/1644] ARROW-947: [Python] Improve execution time of manylinux1 build Got it down to 14min: https://travis-ci.org/xhochy/arrow/builds/229461571 We can probably squeeze out 1-2 minutes more but that won't be easy. Build times are probably longer here as we build in release mode. Author: Uwe L. Korn Closes #648 from xhochy/ARROW-947 and squashes the following commits: ebe7507 [Uwe L. Korn] Use this branch as docker tag 3203b22 [Uwe L. Korn] Move manylinux commands to script a6397da [Uwe L. Korn] Reset docker tag to latest f2b5909 [Uwe L. Korn] Explicit set e 352215f [Uwe L. Korn] Pre-build virtualenvs c1a1c89 [Uwe L. Korn] Pre-install python packages 0e83522 [Uwe L. Korn] Fix image tags 60a0c8c [Uwe L. Korn] Add ccache to the image 3426195 [Uwe L. Korn] Add brotli dc0802b [Uwe L. Korn] Add more dependencies 9cadeb5 [Uwe L. Korn] Move gtest and flatbuffers to scripts --- .travis.yml | 10 ++--- ci/travis_script_manylinux.sh | 21 ++++++++++ python/manylinux1/Dockerfile-x86_64 | 2 +- python/manylinux1/Dockerfile-x86_64_base | 41 +++++++++++-------- python/manylinux1/build_arrow.sh | 17 +++----- python/manylinux1/scripts/build_brotli.sh | 30 ++++++++++++++ python/manylinux1/scripts/build_ccache.sh | 21 ++++++++++ .../manylinux1/scripts/build_flatbuffers.sh | 21 ++++++++++ python/manylinux1/scripts/build_gtest.sh | 21 ++++++++++ python/manylinux1/scripts/build_snappy.sh | 22 ++++++++++ python/manylinux1/scripts/build_thrift.sh | 37 +++++++++++++++++ .../manylinux1/scripts/build_virtualenvs.sh | 41 +++++++++++++++++++ 12 files changed, 248 insertions(+), 36 deletions(-) create mode 100755 ci/travis_script_manylinux.sh create mode 100755 python/manylinux1/scripts/build_brotli.sh create mode 100755 python/manylinux1/scripts/build_ccache.sh create mode 100755 python/manylinux1/scripts/build_flatbuffers.sh create mode 100755 python/manylinux1/scripts/build_gtest.sh create mode 100755 python/manylinux1/scripts/build_snappy.sh create mode 100755 python/manylinux1/scripts/build_thrift.sh create mode 100755 python/manylinux1/scripts/build_virtualenvs.sh diff --git a/.travis.yml b/.travis.yml index d821b5accb973..e6941620c2c91 100644 --- a/.travis.yml +++ b/.travis.yml @@ -56,13 +56,9 @@ matrix: - $TRAVIS_BUILD_DIR/ci/travis_script_python.sh - language: cpp before_script: - - docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-927 - script: | - pushd python/manylinux1 - git clone ../../ arrow - docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . - docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh - ls -l dist/ + - docker pull quay.io/xhochy/arrow_manylinux1_x86_64_base:latest + script: + - $TRAVIS_BUILD_DIR/ci/travis_script_manylinux.sh - language: java os: linux jdk: oraclejdk7 diff --git a/ci/travis_script_manylinux.sh b/ci/travis_script_manylinux.sh new file mode 100755 index 0000000000000..69feb685b5136 --- /dev/null +++ b/ci/travis_script_manylinux.sh @@ -0,0 +1,21 @@ +#!/usr/bin/env bash + +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + + +set -ex + +pushd python/manylinux1 +git clone ../../ arrow +docker build -t arrow-base-x86_64 -f Dockerfile-x86_64 . +docker run --rm -v $PWD:/io arrow-base-x86_64 /io/build_arrow.sh diff --git a/python/manylinux1/Dockerfile-x86_64 b/python/manylinux1/Dockerfile-x86_64 index 8f55ba7e1deed..08fecb0da9276 100644 --- a/python/manylinux1/Dockerfile-x86_64 +++ b/python/manylinux1/Dockerfile-x86_64 @@ -9,7 +9,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. See accompanying LICENSE file. -FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-927 +FROM quay.io/xhochy/arrow_manylinux1_x86_64_base:ARROW-947 ADD arrow /arrow WORKDIR /arrow/cpp diff --git a/python/manylinux1/Dockerfile-x86_64_base b/python/manylinux1/Dockerfile-x86_64_base index e38296d78de1c..2ae7e0ff4f98f 100644 --- a/python/manylinux1/Dockerfile-x86_64_base +++ b/python/manylinux1/Dockerfile-x86_64_base @@ -28,25 +28,34 @@ WORKDIR / RUN /opt/python/cp35-cp35m/bin/pip install cmake RUN ln -s /opt/python/cp35-cp35m/bin/cmake /usr/bin/cmake +ADD scripts/build_gtest.sh / +RUN /build_gtest.sh +ENV GTEST_HOME /googletest-release-1.7.0 + +ADD scripts/build_flatbuffers.sh / +RUN /build_flatbuffers.sh +ENV FLATBUFFERS_HOME /usr + +ADD scripts/build_thrift.sh / +RUN /build_thrift.sh +ENV THRIFT_HOME /usr + +ADD scripts/build_brotli.sh / +RUN /build_brotli.sh +ENV BROTLI_HOME /usr + +ADD scripts/build_snappy.sh / +RUN /build_snappy.sh +ENV SNAPPY_HOME /usr + +ADD scripts/build_ccache.sh / +RUN /build_ccache.sh + WORKDIR / RUN git clone https://github.com/matthew-brett/multibuild.git WORKDIR /multibuild RUN git checkout ffe59955ad8690c2f8bb74766cb7e9b0d0ee3963 - -WORKDIR / -RUN wget https://github.com/google/googletest/archive/release-1.7.0.tar.gz -O googletest-release-1.7.0.tar.gz -RUN tar xf googletest-release-1.7.0.tar.gz -WORKDIR /googletest-release-1.7.0 -RUN cmake -DCMAKE_CXX_FLAGS='-fPIC' -Dgtest_force_shared_crt=ON . -RUN make -j5 -ENV GTEST_HOME /googletest-release-1.7.0 - WORKDIR / -RUN wget https://github.com/google/flatbuffers/archive/v1.6.0.tar.gz -O flatbuffers-1.6.0.tar.gz -RUN tar xf flatbuffers-1.6.0.tar.gz -WORKDIR /flatbuffers-1.6.0 -RUN cmake "-DCMAKE_CXX_FLAGS=-fPIC" "-DCMAKE_INSTALL_PREFIX:PATH=/usr" "-DFLATBUFFERS_BUILD_TESTS=OFF" -RUN make -j5 -RUN make install -ENV FLATBUFFERS_HOME /usr +ADD scripts/build_virtualenvs.sh / +RUN /build_virtualenvs.sh diff --git a/python/manylinux1/build_arrow.sh b/python/manylinux1/build_arrow.sh index a11d3d41f49f7..e0727495cff4a 100755 --- a/python/manylinux1/build_arrow.sh +++ b/python/manylinux1/build_arrow.sh @@ -27,6 +27,9 @@ MANYLINUX_URL=https://nipy.bic.berkeley.edu/manylinux source /multibuild/manylinux_utils.sh +# Quit on failure +set -e + cd /arrow/python # PyArrow build configuration @@ -48,11 +51,6 @@ for PYTHON in ${PYTHON_VERSIONS}; do PIPI_IO="$PIP install -f $MANYLINUX_URL" PATH="$PATH:$(cpython_path $PYTHON)" - echo "=== (${PYTHON}) Installing build dependencies ===" - $PIPI_IO "numpy==1.9.0" - $PIPI_IO "cython==0.25.2" - $PIPI_IO "pandas==0.19.2" - echo "=== (${PYTHON}) Building Arrow C++ libraries ===" ARROW_BUILD_DIR=/arrow/cpp/build-PY${PYTHON} mkdir -p "${ARROW_BUILD_DIR}" @@ -77,14 +75,9 @@ for PYTHON in ${PYTHON_VERSIONS}; do auditwheel -v repair -L . dist/pyarrow-*.whl -w repaired_wheels/ echo "=== (${PYTHON}) Testing manylinux1 wheel ===" - # Fix version to keep build reproducible" - $PIPI_IO "virtualenv==15.1.0" - rm -rf venv - "$(cpython_path $PYTHON)/bin/virtualenv" -p ${PYTHON_INTERPRETER} --no-download venv - source ./venv/bin/activate + source /venv-test-${PYTHON}/bin/activate pip install repaired_wheels/*.whl - pip install pytest pandas - py.test venv/lib/*/site-packages/pyarrow + py.test /venv-test-${PYTHON}/lib/*/site-packages/pyarrow deactivate mv repaired_wheels/*.whl /io/dist diff --git a/python/manylinux1/scripts/build_brotli.sh b/python/manylinux1/scripts/build_brotli.sh new file mode 100755 index 0000000000000..4b4cbf17ca9bf --- /dev/null +++ b/python/manylinux1/scripts/build_brotli.sh @@ -0,0 +1,30 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +export BROTLI_VERSION="0.6.0" +wget "https://github.com/google/brotli/archive/v${BROTLI_VERSION}.tar.gz" -O brotli-${BROTLI_VERSION}.tar.gz +tar xf brotli-${BROTLI_VERSION}.tar.gz +pushd brotli-${BROTLI_VERSION} +mkdir build +pushd build +cmake -DCMAKE_BUILD_TYPE=release \ + "-DCMAKE_CXX_FLAGS=-fPIC" \ + "-DCMAKE_C_FLAGS=-fPIC" \ + -DCMAKE_INSTALL_PREFIX=/usr \ + -DBUILD_SHARED_LIBS=OFF \ + .. +make -j5 +make install +popd +popd +rm -rf brotli-${BROTLI_VERSION}.tar.gz brotli-${BROTLI_VERSION} diff --git a/python/manylinux1/scripts/build_ccache.sh b/python/manylinux1/scripts/build_ccache.sh new file mode 100755 index 0000000000000..6ad5d29f83292 --- /dev/null +++ b/python/manylinux1/scripts/build_ccache.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +wget https://www.samba.org/ftp/ccache/ccache-3.3.4.tar.bz2 -O ccache-3.3.4.tar.bz2 +tar xf ccache-3.3.4.tar.bz2 +pushd ccache-3.3.4 +./configure --prefix=/usr +make -j5 +make install +popd +rm -rf ccache-3.3.4.tar.bz2 ccache-3.3.4 diff --git a/python/manylinux1/scripts/build_flatbuffers.sh b/python/manylinux1/scripts/build_flatbuffers.sh new file mode 100755 index 0000000000000..7703855b6efbf --- /dev/null +++ b/python/manylinux1/scripts/build_flatbuffers.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +wget https://github.com/google/flatbuffers/archive/v1.6.0.tar.gz -O flatbuffers-1.6.0.tar.gz +tar xf flatbuffers-1.6.0.tar.gz +pushd flatbuffers-1.6.0 +cmake "-DCMAKE_CXX_FLAGS=-fPIC" "-DCMAKE_INSTALL_PREFIX:PATH=/usr" "-DFLATBUFFERS_BUILD_TESTS=OFF" +make -j5 +make install +popd +rm -rf flatbuffers-1.6.0.tar.gz flatbuffers-1.6.0 diff --git a/python/manylinux1/scripts/build_gtest.sh b/python/manylinux1/scripts/build_gtest.sh new file mode 100755 index 0000000000000..3427bed091ed3 --- /dev/null +++ b/python/manylinux1/scripts/build_gtest.sh @@ -0,0 +1,21 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +wget https://github.com/google/googletest/archive/release-1.7.0.tar.gz -O googletest-release-1.7.0.tar.gz +tar xf googletest-release-1.7.0.tar.gz +ls -l +pushd googletest-release-1.7.0 +cmake -DCMAKE_CXX_FLAGS='-fPIC' -Dgtest_force_shared_crt=ON . +make -j5 +popd +rm -rf googletest-release-1.7.0.tar.gz diff --git a/python/manylinux1/scripts/build_snappy.sh b/python/manylinux1/scripts/build_snappy.sh new file mode 100755 index 0000000000000..973b4ff7d8089 --- /dev/null +++ b/python/manylinux1/scripts/build_snappy.sh @@ -0,0 +1,22 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +export SNAPPY_VERSION="1.1.3" +wget "https://github.com/google/snappy/releases/download/${SNAPPY_VERSION}/snappy-${SNAPPY_VERSION}.tar.gz" -O snappy-${SNAPPY_VERSION}.tar.gz +tar xf snappy-${SNAPPY_VERSION}.tar.gz +pushd snappy-${SNAPPY_VERSION} +./configure --with-pic "--prefix=/usr" CXXFLAGS='-DNDEBUG -O2' +make -j5 +make install +popd +rm -rf snappy-${SNAPPY_VERSION}.tar.gz snappy-${SNAPPY_VERSION} diff --git a/python/manylinux1/scripts/build_thrift.sh b/python/manylinux1/scripts/build_thrift.sh new file mode 100755 index 0000000000000..1db745855489f --- /dev/null +++ b/python/manylinux1/scripts/build_thrift.sh @@ -0,0 +1,37 @@ +#!/bin/bash -ex +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +export THRIFT_VERSION=0.10.0 +wget http://archive.apache.org/dist/thrift/${THRIFT_VERSION}/thrift-${THRIFT_VERSION}.tar.gz +tar xf thrift-${THRIFT_VERSION}.tar.gz +pushd thrift-${THRIFT_VERSION} +mkdir build-tmp +pushd build-tmp +cmake -DCMAKE_BUILD_TYPE=release \ + "-DCMAKE_CXX_FLAGS=-fPIC" \ + "-DCMAKE_C_FLAGS=-fPIC" \ + "-DCMAKE_INSTALL_PREFIX=/usr" \ + "-DCMAKE_INSTALL_RPATH=/usr/lib" \ + "-DBUILD_SHARED_LIBS=OFF" \ + "-DBUILD_TESTING=OFF" \ + "-DWITH_QT4=OFF" \ + "-DWITH_C_GLIB=OFF" \ + "-DWITH_JAVA=OFF" \ + "-DWITH_PYTHON=OFF" \ + "-DWITH_CPP=ON" \ + "-DWITH_STATIC_LIB=ON" .. +make -j5 +make install +popd +popd +rm -rf thrift-${THRIFT_VERSION}.tar.gz thrift-${THRIFT_VERSION} diff --git a/python/manylinux1/scripts/build_virtualenvs.sh b/python/manylinux1/scripts/build_virtualenvs.sh new file mode 100755 index 0000000000000..ee8a82730281f --- /dev/null +++ b/python/manylinux1/scripts/build_virtualenvs.sh @@ -0,0 +1,41 @@ +#!/bin/bash -e +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. See accompanying LICENSE file. + +# Build upon the scripts in https://github.com/matthew-brett/manylinux-builds +# * Copyright (c) 2013-2016, Matt Terry and Matthew Brett (BSD 2-clause) + +PYTHON_VERSIONS="${PYTHON_VERSIONS:-2.7 3.4 3.5 3.6}" + +# Package index with only manylinux1 builds +MANYLINUX_URL=https://nipy.bic.berkeley.edu/manylinux + +source /multibuild/manylinux_utils.sh + +for PYTHON in ${PYTHON_VERSIONS}; do + PYTHON_INTERPRETER="$(cpython_path $PYTHON)/bin/python" + PIP="$(cpython_path $PYTHON)/bin/pip" + PIPI_IO="$PIP install -f $MANYLINUX_URL" + PATH="$PATH:$(cpython_path $PYTHON)" + + echo "=== (${PYTHON}) Installing build dependencies ===" + $PIPI_IO "numpy==1.9.0" + $PIPI_IO "cython==0.25.2" + $PIPI_IO "pandas==0.20.1" + $PIPI_IO "virtualenv==15.1.0" + + echo "=== (${PYTHON}) Preparing virtualenv for tests ===" + "$(cpython_path $PYTHON)/bin/virtualenv" -p ${PYTHON_INTERPRETER} --no-download /venv-test-${PYTHON} + source /venv-test-${PYTHON}/bin/activate + pip install pytest 'numpy==1.12.1' 'pandas==0.20.1' + deactivate +done From 20228a2becc22fbbf72b8e5e9b3c875ac835c0af Mon Sep 17 00:00:00 2001 From: "Uwe L. Korn" Date: Sat, 6 May 2017 17:59:18 -0400 Subject: [PATCH 0616/1644] ARROW-909: Link jemalloc statically if build as external project Author: Uwe L. Korn Closes #651 from xhochy/ARROW-909 and squashes the following commits: a3e7a44 [Uwe L. Korn] ARROW-909: Link jemalloc statically if build as external project --- cpp/CMakeLists.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/cpp/CMakeLists.txt b/cpp/CMakeLists.txt index 5abe5f1436ea7..72e5ea90948b9 100644 --- a/cpp/CMakeLists.txt +++ b/cpp/CMakeLists.txt @@ -692,6 +692,7 @@ if (ARROW_JEMALLOC) find_package(jemalloc) if(NOT JEMALLOC_FOUND) + set(ARROW_JEMALLOC_USE_SHARED OFF) set(JEMALLOC_PREFIX "${CMAKE_CURRENT_BINARY_DIR}/jemalloc_ep-prefix/src/jemalloc_ep/dist/") set(JEMALLOC_HOME "${JEMALLOC_PREFIX}") set(JEMALLOC_INCLUDE_DIR "${JEMALLOC_PREFIX}/include") From c48f6493fa7301260fce709eb16ce5382bc4673e Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Sun, 7 May 2017 10:34:41 -0400 Subject: [PATCH 0617/1644] ARROW-963: [GLib] Add equal Author: Kouhei Sutou Closes #654 from kou/glib-equal and squashes the following commits: 63f071d [Kouhei Sutou] [GLib] Add equal --- c_glib/arrow-glib/array.cpp | 66 +++++++++++++++++++++++++++++ c_glib/arrow-glib/array.h | 10 +++++ c_glib/arrow-glib/buffer.cpp | 39 +++++++++++++++++ c_glib/arrow-glib/buffer.h | 5 +++ c_glib/arrow-glib/chunked-array.cpp | 20 +++++++++ c_glib/arrow-glib/chunked-array.h | 3 ++ c_glib/arrow-glib/column.cpp | 18 ++++++++ c_glib/arrow-glib/column.h | 3 ++ c_glib/arrow-glib/data-type.cpp | 5 ++- c_glib/arrow-glib/field.cpp | 5 ++- c_glib/arrow-glib/record-batch.cpp | 20 +++++++++ c_glib/arrow-glib/record-batch.h | 3 ++ c_glib/arrow-glib/schema.cpp | 18 ++++++++ c_glib/arrow-glib/schema.h | 2 + c_glib/arrow-glib/table.cpp | 18 ++++++++ c_glib/arrow-glib/table.h | 3 ++ c_glib/arrow-glib/tensor.cpp | 18 ++++++++ c_glib/arrow-glib/tensor.h | 2 + c_glib/test/test-array.rb | 23 ++++++++++ c_glib/test/test-buffer.rb | 13 ++++++ c_glib/test/test-chunked-array.rb | 13 ++++++ c_glib/test/test-column.rb | 13 ++++++ c_glib/test/test-field.rb | 5 +++ c_glib/test/test-record-batch.rb | 15 +++++++ c_glib/test/test-schema.rb | 11 +++++ c_glib/test/test-table.rb | 14 ++++++ c_glib/test/test-tensor.rb | 13 ++++++ 27 files changed, 374 insertions(+), 4 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 3ca860d2ff6d3..8a78984349c62 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -188,6 +188,72 @@ garrow_array_class_init(GArrowArrayClass *klass) g_object_class_install_property(gobject_class, PROP_ARRAY, spec); } +/** + * garrow_array_equal: + * @array: A #GArrowArray. + * @other_array: A #GArrowArray to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_array_equal(GArrowArray *array, GArrowArray *other_array) +{ + const auto arrow_array = garrow_array_get_raw(array); + const auto arrow_other_array = garrow_array_get_raw(other_array); + return arrow_array->Equals(arrow_other_array); +} + +/** + * garrow_array_equal_approx: + * @array: A #GArrowArray. + * @other_array: A #GArrowArray to be compared. + * + * Returns: %TRUE if both of them have the approx same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_array_equal_approx(GArrowArray *array, GArrowArray *other_array) +{ + const auto arrow_array = garrow_array_get_raw(array); + const auto arrow_other_array = garrow_array_get_raw(other_array); + return arrow_array->ApproxEquals(arrow_other_array); +} + +/** + * garrow_array_equal_range: + * @array: A #GArrowArray. + * @start_index: The start index of @array to be used. + * @other_array: A #GArrowArray to be compared. + * @other_start_index: The start index of @other_array to be used. + * @end_index: The end index of @array to be used. The end index of + * @other_array is "@other_start_index + (@end_index - + * @start_index)". + * + * Returns: %TRUE if both of them have the same data in the range, + * %FALSE otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_array_equal_range(GArrowArray *array, + gint64 start_index, + GArrowArray *other_array, + gint64 other_start_index, + gint64 end_index) +{ + const auto arrow_array = garrow_array_get_raw(array); + const auto arrow_other_array = garrow_array_get_raw(other_array); + return arrow_array->RangeEquals(*arrow_other_array, + start_index, + end_index, + other_start_index); +} + /** * garrow_array_is_null: * @array: A #GArrowArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index 9bb502e4044a9..f750ee10f8cbe 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -58,6 +58,16 @@ struct _GArrowArrayClass GType garrow_array_get_type (void) G_GNUC_CONST; +gboolean garrow_array_equal (GArrowArray *array, + GArrowArray *other_array); +gboolean garrow_array_equal_approx(GArrowArray *array, + GArrowArray *other_array); +gboolean garrow_array_equal_range (GArrowArray *array, + gint64 start_index, + GArrowArray *other_array, + gint64 other_start_index, + gint64 end_index); + gboolean garrow_array_is_null (GArrowArray *array, gint64 i); gint64 garrow_array_get_length (GArrowArray *array); diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 4373ef1c83447..0970128ae3862 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -144,6 +144,45 @@ garrow_buffer_new(const guint8 *data, gint64 size) } +/** + * garrow_buffer_equal: + * @buffer: A #GArrowBuffer. + * @other_buffer: A #GArrowBuffer to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_buffer_equal(GArrowBuffer *buffer, GArrowBuffer *other_buffer) +{ + const auto arrow_buffer = garrow_buffer_get_raw(buffer); + const auto arrow_other_buffer = garrow_buffer_get_raw(other_buffer); + return arrow_buffer->Equals(*arrow_other_buffer); +} + +/** + * garrow_buffer_equal_n_bytes: + * @buffer: A #GArrowBuffer. + * @other_buffer: A #GArrowBuffer to be compared. + * @n_bytes: The number of first bytes to be compared. + * + * Returns: %TRUE if both of them have the same data in the first + * `n_bytes`, %FALSE otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_buffer_equal_n_bytes(GArrowBuffer *buffer, + GArrowBuffer *other_buffer, + gint64 n_bytes) +{ + const auto arrow_buffer = garrow_buffer_get_raw(buffer); + const auto arrow_other_buffer = garrow_buffer_get_raw(other_buffer); + return arrow_buffer->Equals(*arrow_other_buffer, n_bytes); +} + /** * garrow_buffer_is_mutable: * @buffer: A #GArrowBuffer. diff --git a/c_glib/arrow-glib/buffer.h b/c_glib/arrow-glib/buffer.h index 22a5e9bb2549a..b3f3a2cdc5e9b 100644 --- a/c_glib/arrow-glib/buffer.h +++ b/c_glib/arrow-glib/buffer.h @@ -59,6 +59,11 @@ GType garrow_buffer_get_type (void) G_GNUC_CONST; GArrowBuffer *garrow_buffer_new (const guint8 *data, gint64 size); +gboolean garrow_buffer_equal (GArrowBuffer *buffer, + GArrowBuffer *other_buffer); +gboolean garrow_buffer_equal_n_bytes(GArrowBuffer *buffer, + GArrowBuffer *other_buffer, + gint64 n_bytes); gboolean garrow_buffer_is_mutable (GArrowBuffer *buffer); gint64 garrow_buffer_get_capacity (GArrowBuffer *buffer); GBytes *garrow_buffer_get_data (GArrowBuffer *buffer); diff --git a/c_glib/arrow-glib/chunked-array.cpp b/c_glib/arrow-glib/chunked-array.cpp index e732ece73c7f9..62d666fbcaaba 100644 --- a/c_glib/arrow-glib/chunked-array.cpp +++ b/c_glib/arrow-glib/chunked-array.cpp @@ -143,6 +143,26 @@ garrow_chunked_array_new(GList *chunks) return garrow_chunked_array_new_raw(&arrow_chunked_array); } +/** + * garrow_chunked_array_equal: + * @chunked_array: A #GArrowChunkedArray. + * @other_chunked_array: A #GArrowChunkedArray to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_chunked_array_equal(GArrowChunkedArray *chunked_array, + GArrowChunkedArray *other_chunked_array) +{ + const auto arrow_chunked_array = garrow_chunked_array_get_raw(chunked_array); + const auto arrow_other_chunked_array = + garrow_chunked_array_get_raw(other_chunked_array); + return arrow_chunked_array->Equals(arrow_other_chunked_array); +} + /** * garrow_chunked_array_get_length: * @chunked_array: A #GArrowChunkedArray. diff --git a/c_glib/arrow-glib/chunked-array.h b/c_glib/arrow-glib/chunked-array.h index 338930b9bd84a..c5f986a631835 100644 --- a/c_glib/arrow-glib/chunked-array.h +++ b/c_glib/arrow-glib/chunked-array.h @@ -67,6 +67,9 @@ GType garrow_chunked_array_get_type(void) G_GNUC_CONST; GArrowChunkedArray *garrow_chunked_array_new(GList *chunks); +gboolean garrow_chunked_array_equal(GArrowChunkedArray *chunked_array, + GArrowChunkedArray *other_chunked_array); + guint64 garrow_chunked_array_get_length (GArrowChunkedArray *chunked_array); guint64 garrow_chunked_array_get_n_nulls(GArrowChunkedArray *chunked_array); guint garrow_chunked_array_get_n_chunks (GArrowChunkedArray *chunked_array); diff --git a/c_glib/arrow-glib/column.cpp b/c_glib/arrow-glib/column.cpp index 94df640d6b2b5..a7222b17650bb 100644 --- a/c_glib/arrow-glib/column.cpp +++ b/c_glib/arrow-glib/column.cpp @@ -160,6 +160,24 @@ garrow_column_new_chunked_array(GArrowField *field, return garrow_column_new_raw(&arrow_column); } +/** + * garrow_column_equal: + * @column: A #GArrowColumn. + * @other_column: A #GArrowColumn to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_column_equal(GArrowColumn *column, GArrowColumn *other_column) +{ + const auto arrow_column = garrow_column_get_raw(column); + const auto arrow_other_column = garrow_column_get_raw(other_column); + return arrow_column->Equals(arrow_other_column); +} + /** * garrow_column_get_length: * @column: A #GArrowColumn. diff --git a/c_glib/arrow-glib/column.h b/c_glib/arrow-glib/column.h index fba3c26b2f08f..b649c5f1e50be 100644 --- a/c_glib/arrow-glib/column.h +++ b/c_glib/arrow-glib/column.h @@ -72,6 +72,9 @@ GArrowColumn *garrow_column_new_array(GArrowField *field, GArrowColumn *garrow_column_new_chunked_array(GArrowField *field, GArrowChunkedArray *chunked_array); +gboolean garrow_column_equal (GArrowColumn *column, + GArrowColumn *other_column); + guint64 garrow_column_get_length (GArrowColumn *column); guint64 garrow_column_get_n_nulls (GArrowColumn *column); GArrowField *garrow_column_get_field (GArrowColumn *column); diff --git a/c_glib/arrow-glib/data-type.cpp b/c_glib/arrow-glib/data-type.cpp index c3c7fdb0f7c21..9ce8c16e914e3 100644 --- a/c_glib/arrow-glib/data-type.cpp +++ b/c_glib/arrow-glib/data-type.cpp @@ -164,9 +164,10 @@ garrow_data_type_class_init(GArrowDataTypeClass *klass) /** * garrow_data_type_equal: * @data_type: A #GArrowDataType. - * @other_data_type: A #GArrowDataType. + * @other_data_type: A #GArrowDataType to be compared. * - * Returns: Whether they are equal or not. + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. */ gboolean garrow_data_type_equal(GArrowDataType *data_type, diff --git a/c_glib/arrow-glib/field.cpp b/c_glib/arrow-glib/field.cpp index 5fd0c4d221bba..09c7ca33e6a13 100644 --- a/c_glib/arrow-glib/field.cpp +++ b/c_glib/arrow-glib/field.cpp @@ -204,9 +204,10 @@ garrow_field_is_nullable(GArrowField *field) /** * garrow_field_equal: * @field: A #GArrowField. - * @other_field: A #GArrowField. + * @other_field: A #GArrowField to be compared. * - * Returns: Whether they are equal or not. + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. */ gboolean garrow_field_equal(GArrowField *field, diff --git a/c_glib/arrow-glib/record-batch.cpp b/c_glib/arrow-glib/record-batch.cpp index 8ac1791feef8c..3eed1a097c9e7 100644 --- a/c_glib/arrow-glib/record-batch.cpp +++ b/c_glib/arrow-glib/record-batch.cpp @@ -153,6 +153,26 @@ garrow_record_batch_new(GArrowSchema *schema, return garrow_record_batch_new_raw(&arrow_record_batch); } +/** + * garrow_record_batch_equal: + * @record_batch: A #GArrowRecordBatch. + * @other_record_batch: A #GArrowRecordBatch to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_record_batch_equal(GArrowRecordBatch *record_batch, + GArrowRecordBatch *other_record_batch) +{ + const auto arrow_record_batch = garrow_record_batch_get_raw(record_batch); + const auto arrow_other_record_batch = + garrow_record_batch_get_raw(other_record_batch); + return arrow_record_batch->Equals(*arrow_other_record_batch); +} + /** * garrow_record_batch_get_schema: * @record_batch: A #GArrowRecordBatch. diff --git a/c_glib/arrow-glib/record-batch.h b/c_glib/arrow-glib/record-batch.h index 92eee4d9af973..61e8f3d42b1c8 100644 --- a/c_glib/arrow-glib/record-batch.h +++ b/c_glib/arrow-glib/record-batch.h @@ -70,6 +70,9 @@ GArrowRecordBatch *garrow_record_batch_new(GArrowSchema *schema, guint32 n_rows, GList *columns); +gboolean garrow_record_batch_equal(GArrowRecordBatch *record_batch, + GArrowRecordBatch *other_record_batch); + GArrowSchema *garrow_record_batch_get_schema (GArrowRecordBatch *record_batch); GArrowArray *garrow_record_batch_get_column (GArrowRecordBatch *record_batch, guint i); diff --git a/c_glib/arrow-glib/schema.cpp b/c_glib/arrow-glib/schema.cpp index 4d5ae5af4fb4a..be3ea4bbb8c3e 100644 --- a/c_glib/arrow-glib/schema.cpp +++ b/c_glib/arrow-glib/schema.cpp @@ -142,6 +142,24 @@ garrow_schema_new(GList *fields) return garrow_schema_new_raw(&arrow_schema); } +/** + * garrow_schema_equal: + * @schema: A #GArrowSchema. + * @other_schema: A #GArrowSchema to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_schema_equal(GArrowSchema *schema, GArrowSchema *other_schema) +{ + const auto arrow_schema = garrow_schema_get_raw(schema); + const auto arrow_other_schema = garrow_schema_get_raw(other_schema); + return arrow_schema->Equals(*arrow_other_schema); +} + /** * garrow_schema_get_field: * @schema: A #GArrowSchema. diff --git a/c_glib/arrow-glib/schema.h b/c_glib/arrow-glib/schema.h index 7615634021bc3..483d55e562d31 100644 --- a/c_glib/arrow-glib/schema.h +++ b/c_glib/arrow-glib/schema.h @@ -67,6 +67,8 @@ GType garrow_schema_get_type (void) G_GNUC_CONST; GArrowSchema *garrow_schema_new (GList *fields); +gboolean garrow_schema_equal (GArrowSchema *schema, + GArrowSchema *other_schema); GArrowField *garrow_schema_get_field (GArrowSchema *schema, guint i); GArrowField *garrow_schema_get_field_by_name(GArrowSchema *schema, diff --git a/c_glib/arrow-glib/table.cpp b/c_glib/arrow-glib/table.cpp index 2aba21b564243..779f2ef62b8f5 100644 --- a/c_glib/arrow-glib/table.cpp +++ b/c_glib/arrow-glib/table.cpp @@ -148,6 +148,24 @@ garrow_table_new(GArrowSchema *schema, return garrow_table_new_raw(&arrow_table); } +/** + * garrow_table_equal: + * @table: A #GArrowTable. + * @other_table: A #GArrowTable to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_table_equal(GArrowTable *table, GArrowTable *other_table) +{ + const auto arrow_table = garrow_table_get_raw(table); + const auto arrow_other_table = garrow_table_get_raw(other_table); + return arrow_table->Equals(*arrow_other_table); +} + /** * garrow_table_get_schema: * @table: A #GArrowTable. diff --git a/c_glib/arrow-glib/table.h b/c_glib/arrow-glib/table.h index 9ae0cce1b7d9d..9e21669cd11da 100644 --- a/c_glib/arrow-glib/table.h +++ b/c_glib/arrow-glib/table.h @@ -69,6 +69,9 @@ GType garrow_table_get_type (void) G_GNUC_CONST; GArrowTable *garrow_table_new (GArrowSchema *schema, GList *columns); +gboolean garrow_table_equal (GArrowTable *table, + GArrowTable *other_table); + GArrowSchema *garrow_table_get_schema (GArrowTable *table); GArrowColumn *garrow_table_get_column (GArrowTable *table, guint i); diff --git a/c_glib/arrow-glib/tensor.cpp b/c_glib/arrow-glib/tensor.cpp index 27af7532f3451..89e971c726ecb 100644 --- a/c_glib/arrow-glib/tensor.cpp +++ b/c_glib/arrow-glib/tensor.cpp @@ -170,6 +170,24 @@ garrow_tensor_new(GArrowDataType *data_type, return tensor; } +/** + * garrow_tensor_equal: + * @tensor: A #GArrowTensor. + * @other_tensor: A #GArrowTensor to be compared. + * + * Returns: %TRUE if both of them have the same data, %FALSE + * otherwise. + * + * Since: 0.4.0 + */ +gboolean +garrow_tensor_equal(GArrowTensor *tensor, GArrowTensor *other_tensor) +{ + const auto arrow_tensor = garrow_tensor_get_raw(tensor); + const auto arrow_other_tensor = garrow_tensor_get_raw(other_tensor); + return arrow_tensor->Equals(*arrow_other_tensor); +} + /** * garrow_tensor_get_value_data_type: * @tensor: A #GArrowTensor. diff --git a/c_glib/arrow-glib/tensor.h b/c_glib/arrow-glib/tensor.h index 71c6b4e9031dd..6529282f5f34b 100644 --- a/c_glib/arrow-glib/tensor.h +++ b/c_glib/arrow-glib/tensor.h @@ -66,6 +66,8 @@ GArrowTensor *garrow_tensor_new (GArrowDataType *data_type, gsize n_strides, gchar **dimention_names, gsize n_dimention_names); +gboolean garrow_tensor_equal (GArrowTensor *tensor, + GArrowTensor *other_tensor); GArrowDataType *garrow_tensor_get_value_data_type(GArrowTensor *tensor); GArrowType garrow_tensor_get_value_type (GArrowTensor *tensor); GArrowBuffer *garrow_tensor_get_buffer (GArrowTensor *tensor); diff --git a/c_glib/test/test-array.rb b/c_glib/test/test-array.rb index a2a2a1e003862..ca02fa283b014 100644 --- a/c_glib/test/test-array.rb +++ b/c_glib/test/test-array.rb @@ -16,6 +16,29 @@ # under the License. class TestArray < Test::Unit::TestCase + include Helper::Buildable + + def test_equal + assert_equal(build_boolean_array([true, false]), + build_boolean_array([true, false])) + end + + def test_equal_approx + array1 = build_double_array([1.1, 2.2 + Float::EPSILON * 10]) + array2 = build_double_array([1.1, 2.2]) + assert do + array1.equal_approx(array2) + end + end + + def test_equal_range + array1 = build_int32_array([1, 2, 3, 4, 5]) + array2 = build_int32_array([-2, -1, 0, 1, 2, 3, 4, 999]) + assert do + array1.equal_range(1, array2, 4, 3) + end + end + def test_is_null builder = Arrow::BooleanArrayBuilder.new builder.append_null diff --git a/c_glib/test/test-buffer.rb b/c_glib/test/test-buffer.rb index 9f76a805f7577..39ae631a0f68d 100644 --- a/c_glib/test/test-buffer.rb +++ b/c_glib/test/test-buffer.rb @@ -23,6 +23,19 @@ def setup @buffer = Arrow::Buffer.new(@data) end + def test_equal + assert_equal(@buffer, + Arrow::Buffer.new(@data.dup)) + end + + def test_equal_n_bytes + buffer1 = Arrow::Buffer.new("Hello!") + buffer2 = Arrow::Buffer.new("Hello World!") + assert do + buffer1.equal_n_bytes(buffer2, 5) + end + end + def test_mutable? assert do not @buffer.mutable? diff --git a/c_glib/test/test-chunked-array.rb b/c_glib/test/test-chunked-array.rb index 167d5d1033e42..cde7a8b0c61f1 100644 --- a/c_glib/test/test-chunked-array.rb +++ b/c_glib/test/test-chunked-array.rb @@ -18,6 +18,19 @@ class TestChunkedArray < Test::Unit::TestCase include Helper::Buildable + def test_equal + chunks1 = [ + build_boolean_array([true, false]), + build_boolean_array([true]), + ] + chunks2 = [ + build_boolean_array([true]), + build_boolean_array([false, true]), + ] + assert_equal(Arrow::ChunkedArray.new(chunks1), + Arrow::ChunkedArray.new(chunks2)) + end + def test_length chunks = [ build_boolean_array([true, false]), diff --git a/c_glib/test/test-column.rb b/c_glib/test/test-column.rb index ec75194edb830..96e02b60319fd 100644 --- a/c_glib/test/test-column.rb +++ b/c_glib/test/test-column.rb @@ -38,6 +38,19 @@ def test_chunked_array end end + def test_equal + field1 = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + array1 = build_boolean_array([true, false]) + field2 = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) + chunks = [ + build_boolean_array([true]), + build_boolean_array([false]), + ] + array2 = Arrow::ChunkedArray.new(chunks) + assert_equal(Arrow::Column.new(field1, array1), + Arrow::Column.new(field2, array2)) + end + def test_length field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) array = build_boolean_array([true, false]) diff --git a/c_glib/test/test-field.rb b/c_glib/test/test-field.rb index a20802c2ac653..1b9c46e8cd037 100644 --- a/c_glib/test/test-field.rb +++ b/c_glib/test/test-field.rb @@ -16,6 +16,11 @@ # under the License. class TestField < Test::Unit::TestCase + def test_equal + assert_equal(Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + Arrow::Field.new("enabled", Arrow::BooleanDataType.new)) + end + def test_name field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) assert_equal("enabled", field.name) diff --git a/c_glib/test/test-record-batch.rb b/c_glib/test/test-record-batch.rb index 941ff35060154..048f6deb7245c 100644 --- a/c_glib/test/test-record-batch.rb +++ b/c_glib/test/test-record-batch.rb @@ -46,6 +46,21 @@ def setup @record_batch = Arrow::RecordBatch.new(schema, 5, columns) end + def test_equal + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + build_boolean_array([true, false, true, false, true, false]), + build_boolean_array([false, true, false, true, false]), + ] + other_record_batch = Arrow::RecordBatch.new(schema, 5, columns) + assert_equal(@record_batch, + other_record_batch) + end + def test_schema assert_equal(["visible", "valid"], @record_batch.schema.fields.collect(&:name)) diff --git a/c_glib/test/test-schema.rb b/c_glib/test/test-schema.rb index c9cbb756944bb..4c09ecb40f51e 100644 --- a/c_glib/test/test-schema.rb +++ b/c_glib/test/test-schema.rb @@ -16,6 +16,17 @@ # under the License. class TestSchema < Test::Unit::TestCase + def test_equal + fields1 = [ + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + ] + fields2 = [ + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + ] + assert_equal(Arrow::Schema.new(fields1), + Arrow::Schema.new(fields2)) + end + def test_field field = Arrow::Field.new("enabled", Arrow::BooleanDataType.new) schema = Arrow::Schema.new([field]) diff --git a/c_glib/test/test-table.rb b/c_glib/test/test-table.rb index da6871ec1d090..08dd34861e51a 100644 --- a/c_glib/test/test-table.rb +++ b/c_glib/test/test-table.rb @@ -66,6 +66,20 @@ def setup @table = Arrow::Table.new(schema, columns) end + def test_equal + fields = [ + Arrow::Field.new("visible", Arrow::BooleanDataType.new), + Arrow::Field.new("valid", Arrow::BooleanDataType.new), + ] + schema = Arrow::Schema.new(fields) + columns = [ + Arrow::Column.new(fields[0], build_boolean_array([true])), + Arrow::Column.new(fields[1], build_boolean_array([false])), + ] + other_table = Arrow::Table.new(schema, columns) + assert_equal(@table, other_table) + end + def test_schema assert_equal(["visible", "valid"], @table.schema.fields.collect(&:name)) diff --git a/c_glib/test/test-tensor.rb b/c_glib/test/test-tensor.rb index 225857b52da98..780c9f179e18d 100644 --- a/c_glib/test/test-tensor.rb +++ b/c_glib/test/test-tensor.rb @@ -40,6 +40,19 @@ def setup names) end + def test_equal + data = Arrow::Buffer.new(@raw_data.pack("c*")) + strides = [] + names = ["a", "b", "c"] + other_tensor = Arrow::Tensor.new(Arrow::Int8DataType.new, + data, + @shape, + strides, + names) + assert_equal(@tensor, + other_tensor) + end + def test_value_data_type assert_equal(Arrow::Int8DataType, @tensor.value_data_type.class) end From d7a2a1e18457acb8a18cfcb7fbb3c3ba41543d4a Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Sun, 7 May 2017 17:48:18 +0200 Subject: [PATCH 0618/1644] ARROW-958: [Python] Fix conda source build instructions Author: Wes McKinney Closes #653 from wesm/ARROW-958 and squashes the following commits: 88c3c1d [Wes McKinney] Fix conda build instructions --- python/doc/source/development.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/python/doc/source/development.rst b/python/doc/source/development.rst index 01add1142642a..440c1c459eed5 100644 --- a/python/doc/source/development.rst +++ b/python/doc/source/development.rst @@ -93,8 +93,11 @@ about our build toolchain: .. code-block:: shell export ARROW_BUILD_TYPE=release + export ARROW_BUILD_TOOLCHAIN=$CONDA_PREFIX export PARQUET_BUILD_TOOLCHAIN=$CONDA_PREFIX + export ARROW_HOME=$CONDA_PREFIX + export PARQUET_HOME=$CONDA_PREFIX Now build and install the Arrow C++ libraries: @@ -104,7 +107,7 @@ Now build and install the Arrow C++ libraries: pushd arrow/cpp/build cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ + -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \ -DARROW_PYTHON=on \ -DARROW_BUILD_TESTS=OFF \ .. @@ -121,7 +124,7 @@ toolchain: pushd parquet-cpp/build cmake -DCMAKE_BUILD_TYPE=$ARROW_BUILD_TYPE \ - -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX \ + -DCMAKE_INSTALL_PREFIX=$PARQUET_HOME \ -DPARQUET_BUILD_BENCHMARKS=off \ -DPARQUET_BUILD_EXECUTABLES=off \ -DPARQUET_ZLIB_VENDORED=off \ From cb5e7b6fa7d75e14e163ce43cb333b02e9fe1c03 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 8 May 2017 00:49:37 -0400 Subject: [PATCH 0619/1644] ARROW-446: [Python] Expand Sphinx documentation for 0.3 I am going to finish the data model section and revamp the Parquet section, so we can get this pushed out with the release announcement tomorrow. We should continue to add a lot of new documentation over the coming weeks Author: Wes McKinney Closes #656 from wesm/ARROW-446 and squashes the following commits: b92c6d2 [Wes McKinney] Make pass over Parquet docs a46f846 [Wes McKinney] Make a pass over Parquet documentation 066d0b9 [Wes McKinney] Finish first cut at data model section 4f510fb [Wes McKinney] Install IPython before building docs 4885222 [Wes McKinney] Start on a data model section 1d512e9 [Wes McKinney] Add barebones IPC section 0f800d8 [Wes McKinney] Add section on OSFile, MemoryMappedFile aabf5b2 [Wes McKinney] Add draft about memory/io 5968847 [Wes McKinney] More on Memory/IO section --- ci/travis_script_python.sh | 2 +- python/doc/requirements.txt | 2 + python/doc/source/api.rst | 19 +- python/doc/source/conf.py | 14 +- python/doc/source/data.rst | 316 ++++++++++++++++++++++++++++++ python/doc/source/filesystems.rst | 8 +- python/doc/source/index.rst | 5 +- python/doc/source/ipc.rst | 136 +++++++++++++ python/doc/source/jemalloc.rst | 9 +- python/doc/source/memory.rst | 235 ++++++++++++++++++++++ python/doc/source/pandas.rst | 36 ++-- python/doc/source/parquet.rst | 243 ++++++++++++++++++----- python/pyarrow/_io.pyx | 1 + 13 files changed, 936 insertions(+), 90 deletions(-) create mode 100644 python/doc/source/data.rst create mode 100644 python/doc/source/ipc.rst create mode 100644 python/doc/source/memory.rst diff --git a/ci/travis_script_python.sh b/ci/travis_script_python.sh index 20b0f2aadb900..ce5f7ec506b73 100755 --- a/ci/travis_script_python.sh +++ b/ci/travis_script_python.sh @@ -117,7 +117,7 @@ python_version_tests() { # Build documentation once if [[ "$PYTHON_VERSION" == "3.6" ]] then - pip install -r doc/requirements.txt + conda install -y -q --file=doc/requirements.txt python setup.py build_sphinx -s doc/source fi } diff --git a/python/doc/requirements.txt b/python/doc/requirements.txt index ce0793c31de26..f3c3414a4be9a 100644 --- a/python/doc/requirements.txt +++ b/python/doc/requirements.txt @@ -1,3 +1,5 @@ +ipython +matplotlib numpydoc sphinx sphinx_rtd_theme diff --git a/python/doc/source/api.rst b/python/doc/source/api.rst index 08a06948a3fba..a8dd8c5e110ac 100644 --- a/python/doc/source/api.rst +++ b/python/doc/source/api.rst @@ -22,7 +22,7 @@ API Reference ************* -.. _api.functions: +.. _api.types: Type and Schema Factory Functions --------------------------------- @@ -58,6 +58,8 @@ Type and Schema Factory Functions schema from_numpy_dtype +.. _api.value: + Scalar Value Types ------------------ @@ -88,6 +90,7 @@ Scalar Value Types TimestampValue DecimalValue +.. _api.array: Array Types and Constructors ---------------------------- @@ -122,6 +125,8 @@ Array Types and Constructors DecimalArray ListArray +.. _api.table: + Tables and Record Batches ------------------------- @@ -134,6 +139,8 @@ Tables and Record Batches Table get_record_batch_size +.. _api.tensor: + Tensor type and Functions ------------------------- @@ -145,6 +152,8 @@ Tensor type and Functions get_tensor_size read_tensor +.. _api.io: + Input / Output and Shared Memory -------------------------------- @@ -160,6 +169,8 @@ Input / Output and Shared Memory create_memory_map PythonFile +.. _api.ipc: + Interprocess Communication and Messaging ---------------------------------------- @@ -171,6 +182,8 @@ Interprocess Communication and Messaging StreamReader StreamWriter +.. _api.memory_pool: + Memory Pools ------------ @@ -183,6 +196,8 @@ Memory Pools total_allocated_bytes set_memory_pool +.. _api.type_classes: + Type Classes ------------ @@ -201,6 +216,8 @@ Type Classes .. currentmodule:: pyarrow.parquet +.. _api.parquet: + Apache Parquet -------------- diff --git a/python/doc/source/conf.py b/python/doc/source/conf.py index a9262bf7db3dd..7f98979e88ff8 100644 --- a/python/doc/source/conf.py +++ b/python/doc/source/conf.py @@ -25,19 +25,11 @@ # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # -import inspect import os import sys import sphinx_rtd_theme -on_rtd = os.environ.get('READTHEDOCS') == 'True' - -if not on_rtd: - # Hack: On RTD we use the pyarrow package from conda-forge as we cannot - # build pyarrow there. - sys.path.insert(0, os.path.abspath('..')) - sys.path.extend([ os.path.join(os.path.dirname(__file__), '..', '../..') @@ -60,6 +52,8 @@ 'sphinx.ext.mathjax', 'sphinx.ext.viewcode', 'sphinx.ext.napoleon', + 'IPython.sphinxext.ipython_directive', + 'IPython.sphinxext.ipython_console_highlighting' ] # numpydoc configuration @@ -86,7 +80,7 @@ # General information about the project. project = u'pyarrow' -copyright = u'2016 Apache Software Foundation' +copyright = u'2016-2017 Apache Software Foundation' author = u'Apache Software Foundation' # The version info for the project you're documenting, acts as replacement for @@ -156,7 +150,7 @@ # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # -html_theme = 'sphinx_rtd_theme' +html_theme = 'sphinxdoc' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the diff --git a/python/doc/source/data.rst b/python/doc/source/data.rst new file mode 100644 index 0000000000000..04e74ae64d437 --- /dev/null +++ b/python/doc/source/data.rst @@ -0,0 +1,316 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _data: + +In-Memory Data Model +==================== + +Apache Arrow defines columnar array data structures by composing type metadata +with memory buffers, like the ones explained in the documentation on +:ref:`Memory and IO `. These data structures are exposed in Python through +a series of interrelated classes: + +* **Type Metadata**: Instances of ``pyarrow.DataType``, which describe a logical + array type +* **Schemas**: Instances of ``pyarrow.Schema``, which describe a named + collection of types. These can be thought of as the column types in a + table-like object. +* **Arrays**: Instances of ``pyarrow.Array``, which are atomic, contiguous + columnar data structures composed from Arrow Buffer objects +* **Record Batches**: Instances of ``pyarrow.RecordBatch``, which are a + collection of Array objects with a particular Schema +* **Tables**: Instances of ``pyarrow.Table``, a logical table data structure in + which each column consists of one or more ``pyarrow.Array`` objects of the + same type. + +We will examine these in the sections below in a series of examples. + +.. _data.types: + +Type Metadata +------------- + +Apache Arrow defines language agnostic column-oriented data structures for +array data. These include: + +* **Fixed-length primitive types**: numbers, booleans, date and times, fixed + size binary, decimals, and other values that fit into a given number +* **Variable-length primitive types**: binary, string +* **Nested types**: list, struct, and union +* **Dictionary type**: An encoded categorical type (more on this later) + +Each logical data type in Arrow has a corresponding factory function for +creating an instance of that type object in Python: + +.. ipython:: python + + import pyarrow as pa + t1 = pa.int32() + t2 = pa.string() + t3 = pa.binary() + t4 = pa.binary(10) + t5 = pa.timestamp('ms') + + t1 + print(t1) + print(t4) + print(t5) + +We use the name **logical type** because the **physical** storage may be the +same for one or more types. For example, ``int64``, ``float64``, and +``timestamp[ms]`` all occupy 64 bits per value. + +These objects are `metadata`; they are used for describing the data in arrays, +schemas, and record batches. In Python, they can be used in functions where the +input data (e.g. Python objects) may be coerced to more than one Arrow type. + +The :class:`~pyarrow.Field` type is a type plus a name and optional +user-defined metadata: + +.. ipython:: python + + f0 = pa.field('int32_field', t1) + f0 + f0.name + f0.type + +Arrow supports **nested value types** like list, struct, and union. When +creating these, you must pass types or fields to indicate the data types of the +types' children. For example, we can define a list of int32 values with: + +.. ipython:: python + + t6 = pa.list_(t1) + t6 + +A `struct` is a collection of named fields: + +.. ipython:: python + + fields = [ + pa.field('s0', t1), + pa.field('s1', t2), + pa.field('s2', t4), + pa.field('s3', t6) + ] + + t7 = pa.struct(fields) + print(t7) + +See :ref:`Data Types API ` for a full listing of data type +functions. + +.. _data.schema: + +Schemas +------- + +The :class:`~pyarrow.Schema` type is similar to the ``struct`` array type; it +defines the column names and types in a record batch or table data +structure. The ``pyarrow.schema`` factory function makes new Schema objects in +Python: + +.. ipython:: python + + fields = [ + pa.field('s0', t1), + pa.field('s1', t2), + pa.field('s2', t4), + pa.field('s3', t6) + ] + + my_schema = pa.schema(fields) + my_schema + +In some applications, you may not create schemas directly, only using the ones +that are embedded in :ref:`IPC messages `. + +.. _data.array: + +Arrays +------ + +For each data type, there is an accompanying array data structure for holding +memory buffers that define a single contiguous chunk of columnar array +data. When you are using PyArrow, this data may come from IPC tools, though it +can also be created from various types of Python sequences (lists, NumPy +arrays, pandas data). + +A simple way to create arrays is with ``pyarrow.array``, which is similar to +the ``numpy.array`` function: + +.. ipython:: python + + arr = pa.array([1, 2, None, 3]) + arr + +The array's ``type`` attribute is the corresponding piece of type metadata: + +.. ipython:: python + + arr.type + +Each in-memory array has a known length and null count (which will be 0 if +there are no null values): + +.. ipython:: python + + len(arr) + arr.null_count + +Scalar values can be selected with normal indexing. ``pyarrow.array`` converts +``None`` values to Arrow nulls; we return the special ``pyarrow.NA`` value for +nulls: + +.. ipython:: python + + arr[0] + arr[2] + +Arrow data is immutable, so values can be selected but not assigned. + +Arrays can be sliced without copying: + +.. ipython:: python + + arr[3] + +``pyarrow.array`` can create simple nested data structures like lists: + +.. ipython:: python + + nested_arr = pa.array([[], None, [1, 2], [None, 1]]) + print(nested_arr.type) + +Dictionary Arrays +~~~~~~~~~~~~~~~~~ + +The **Dictionary** type in PyArrow is a special array type that is similar to a +factor in R or a ``pandas.Categorical``. It enables one or more record batches +in a file or stream to transmit integer *indices* referencing a shared +**dictionary** containing the distinct values in the logical array. This is +particularly often used with strings to save memory and improve performance. + +The way that dictionaries are handled in the Apache Arrow format and the way +they appear in C++ and Python is slightly different. We define a special +:class:`~.DictionaryArray` type with a corresponding dictionary type. Let's +consider an example: + +.. ipython:: python + + indices = pa.array([0, 1, 0, 1, 2, 0, None, 2]) + dictionary = pa.array(['foo', 'bar', 'baz']) + + dict_array = pa.DictionaryArray.from_arrays(indices, dictionary) + dict_array + +Here we have: + +.. ipython:: python + + print(dict_array.type) + dict_array.indices + dict_array.dictionary + +When using :class:`~.DictionaryArray` with pandas, the analogue is +``pandas.Categorical`` (more on this later): + +.. ipython:: python + + dict_array.to_pandas() + +.. _data.record_batch: + +Record Batches +-------------- + +A **Record Batch** in Apache Arrow is a collection of equal-length array +instances. Let's consider a collection of arrays: + +.. ipython:: python + + data = [ + pa.array([1, 2, 3, 4]), + pa.array(['foo', 'bar', 'baz', None]), + pa.array([True, None, False, True]) + ] + +A record batch can be created from this list of arrays using +``RecordBatch.from_arrays``: + +.. ipython:: python + + batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) + batch.num_columns + batch.num_rows + batch.schema + + batch[1] + +A record batch can be sliced without copying memory like an array: + +.. ipython:: python + + batch2 = batch.slice(1, 3) + batch2[1] + +.. _data.table: + +Tables +------ + +The PyArrow :class:`~.Table` type is not part of the Apache Arrow +specification, but is rather a tool to help with wrangling multiple record +batches and array pieces as a single logical dataset. As a relevant example, we +may receive multiple small record batches in a socket stream, then need to +concatenate them into contiguous memory for use in NumPy or pandas. The Table +object makes this efficient without requiring additional memory copying. + +Considering the record batch we created above, we can create a Table containing +one or more copies of the batch using ``Table.from_batches``: + +.. ipython:: python + + batches = [batch] * 5 + table = pa.Table.from_batches(batches) + table + table.num_rows + +The table's columns are instances of :class:`~.Column`, which is a container +for one or more arrays of the same type. + +.. ipython:: python + + c = table[0] + c + c.data + c.data.num_chunks + c.data.chunk(0) + +As you'll see in the :ref:`pandas section `, we can convert thee +objects to contiguous NumPy arrays for use in pandas: + +.. ipython:: python + + c.to_pandas() + +Custom Schema and Field Metadata +-------------------------------- + +TODO diff --git a/python/doc/source/filesystems.rst b/python/doc/source/filesystems.rst index 9e00ddd558127..61c03c57dfad9 100644 --- a/python/doc/source/filesystems.rst +++ b/python/doc/source/filesystems.rst @@ -15,10 +15,12 @@ .. specific language governing permissions and limitations .. under the License. -File interfaces and Memory Maps -=============================== +Filesystem Interfaces +===================== -PyArrow features a number of file-like interfaces +In this section, we discuss filesystem-like interfaces in PyArrow. + +.. _hdfs: Hadoop File System (HDFS) ------------------------- diff --git a/python/doc/source/index.rst b/python/doc/source/index.rst index 55b4efc79bc3f..4bfbe44605767 100644 --- a/python/doc/source/index.rst +++ b/python/doc/source/index.rst @@ -36,8 +36,11 @@ structures. install development - pandas + memory + data + ipc filesystems + pandas parquet api getting_involved diff --git a/python/doc/source/ipc.rst b/python/doc/source/ipc.rst new file mode 100644 index 0000000000000..e63e7455bb815 --- /dev/null +++ b/python/doc/source/ipc.rst @@ -0,0 +1,136 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow + +.. _ipc: + +IPC: Fast Streaming and Serialization +===================================== + +Arrow defines two types of binary formats for serializing record batches: + +* **Streaming format**: for sending an arbitrary length sequence of record + batches. The format must be processed from start to end, and does not support + random access + +* **File or Random Access format**: for serializing a fixed number of record + batches. Supports random access, and thus is very useful when used with + memory maps + +To follow this section, make sure to first read the section on :ref:`Memory and +IO `. + +Writing and Reading Streams +--------------------------- + +First, let's create a small record batch: + +.. ipython:: python + + import pyarrow as pa + + data = [ + pa.array([1, 2, 3, 4]), + pa.array(['foo', 'bar', 'baz', None]), + pa.array([True, None, False, True]) + ] + + batch = pa.RecordBatch.from_arrays(data, ['f0', 'f1', 'f2']) + batch.num_rows + batch.num_columns + +Now, we can begin writing a stream containing some number of these batches. For +this we use :class:`~pyarrow.StreamWriter`, which can write to a writeable +``NativeFile`` object or a writeable Python object: + +.. ipython:: python + + sink = pa.InMemoryOutputStream() + writer = pa.StreamWriter(sink, batch.schema) + +Here we used an in-memory Arrow buffer stream, but this could have been a +socket or some other IO sink. + +When creating the ``StreamWriter``, we pass the schema, since the schema +(column names and types) must be the same for all of the batches sent in this +particular stream. Now we can do: + +.. ipython:: python + + for i in range(5): + writer.write_batch(batch) + writer.close() + + buf = sink.get_result() + buf.size + +Now ``buf`` contains the complete stream as an in-memory byte buffer. We can +read such a stream with :class:`~pyarrow.StreamReader`: + +.. ipython:: python + + reader = pa.StreamReader(buf) + reader.schema + + batches = [b for b in reader] + len(batches) + +We can check the returned batches are the same as the original input: + +.. ipython:: python + + batches[0].equals(batch) + +An important point is that if the input source supports zero-copy reads +(e.g. like a memory map, or ``pyarrow.BufferReader``), then the returned +batches are also zero-copy and do not allocate any new memory on read. + +Writing and Reading Random Access Files +--------------------------------------- + +The :class:`~pyarrow.FileWriter` has the same API as +:class:`~pyarrow.StreamWriter`: + +.. ipython:: python + + sink = pa.InMemoryOutputStream() + writer = pa.FileWriter(sink, batch.schema) + + for i in range(10): + writer.write_batch(batch) + writer.close() + + buf = sink.get_result() + buf.size + +The difference between :class:`~pyarrow.FileReader` and +:class:`~pyarrow.StreamReader` is that the input source must have a ``seek`` +method for random access. The stream reader only requires read operations: + +.. ipython:: python + + reader = pa.FileReader(buf) + +Because we have access to the entire payload, we know the number of record +batches in the file, and can read any at random: + +.. ipython:: python + + reader.num_record_batches + b = reader.get_batch(3) + b.equals(batch) diff --git a/python/doc/source/jemalloc.rst b/python/doc/source/jemalloc.rst index 8d7a5dc4a82ec..9389dcbd25cfe 100644 --- a/python/doc/source/jemalloc.rst +++ b/python/doc/source/jemalloc.rst @@ -18,7 +18,7 @@ jemalloc MemoryPool =================== -Arrow's default :class:`~pyarrow.memory.MemoryPool` uses the system's allocator +Arrow's default :class:`~pyarrow.MemoryPool` uses the system's allocator through the POSIX APIs. Although this already provides aligned allocation, the POSIX interface doesn't support aligned reallocation. The default reallocation strategy is to allocate a new region, copy over the old data and free the @@ -27,10 +27,9 @@ the existing memory allocation to the requested size. While this may still be linear in the size of allocated memory, it is magnitudes faster as only the page mapping in the kernel is touched, not the actual data. -The :mod:`~pyarrow.jemalloc` allocator is not enabled by default to allow the -use of the system allocator and/or other allocators like ``tcmalloc``. You can -either explicitly make it the default allocator or pass it only to single -operations. +The jemalloc-based allocator is not enabled by default to allow the use of the +system allocator and/or other allocators like ``tcmalloc``. You can either +explicitly make it the default allocator or pass it only to single operations. .. code:: python diff --git a/python/doc/source/memory.rst b/python/doc/source/memory.rst new file mode 100644 index 0000000000000..d1020da246407 --- /dev/null +++ b/python/doc/source/memory.rst @@ -0,0 +1,235 @@ +.. Licensed to the Apache Software Foundation (ASF) under one +.. or more contributor license agreements. See the NOTICE file +.. distributed with this work for additional information +.. regarding copyright ownership. The ASF licenses this file +.. to you under the Apache License, Version 2.0 (the +.. "License"); you may not use this file except in compliance +.. with the License. You may obtain a copy of the License at + +.. http://www.apache.org/licenses/LICENSE-2.0 + +.. Unless required by applicable law or agreed to in writing, +.. software distributed under the License is distributed on an +.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +.. KIND, either express or implied. See the License for the +.. specific language governing permissions and limitations +.. under the License. + +.. currentmodule:: pyarrow +.. _io: + +Memory and IO Interfaces +======================== + +This section will introduce you to the major concepts in PyArrow's memory +management and IO systems: + +* Buffers +* File-like and stream-like objects +* Memory pools + +pyarrow.Buffer +-------------- + +The :class:`~pyarrow.Buffer` object wraps the C++ ``arrow::Buffer`` type and is +the primary tool for memory management in Apache Arrow in C++. It permits +higher-level array classes to safely interact with memory which they may or may +not own. ``arrow::Buffer`` can be zero-copy sliced to permit Buffers to cheaply +reference other Buffers, while preserving memory lifetime and clean +parent-child relationships. + +There are many implementations of ``arrow::Buffer``, but they all provide a +standard interface: a data pointer and length. This is similar to Python's +built-in `buffer protocol` and ``memoryview`` objects. + +A :class:`~pyarrow.Buffer` can be created from any Python object which +implements the buffer protocol. Let's consider a bytes object: + +.. ipython:: python + + import pyarrow as pa + + data = b'abcdefghijklmnopqrstuvwxyz' + buf = pa.frombuffer(data) + buf + buf.size + +Creating a Buffer in this way does not allocate any memory; it is a zero-copy +view on the memory exported from the ``data`` bytes object. + +The Buffer's ``to_pybytes`` method can convert to a Python byte string: + +.. ipython:: python + + buf.to_pybytes() + +Buffers can be used in circumstances where a Python buffer or memoryview is +required, and such conversions are also zero-copy: + +.. ipython:: python + + memoryview(buf) + +.. _io.native_file: + +Native Files +------------ + +The Arrow C++ libraries have several abstract interfaces for different kinds of +IO objects: + +* Read-only streams +* Read-only files supporting random access +* Write-only streams +* Write-only files supporting random access +* File supporting reads, writes, and random access + +In the the interest of making these objects behave more like Python's built-in +``file`` objects, we have defined a :class:`~pyarrow.NativeFile` base class +which is intended to mimic Python files and able to be used in functions where +a Python file (such as ``file`` or ``BytesIO``) is expected. + +:class:`~pyarrow.NativeFile` has some important features which make it +preferable to using Python files with PyArrow where possible: + +* Other Arrow classes can access the internal C++ IO objects natively, and do + not need to acquire the Python GIL +* Native C++ IO may be able to do zero-copy IO, such as with memory maps + +There are several kinds of :class:`~pyarrow.NativeFile` options available: + +* :class:`~pyarrow.OSFile`, a native file that uses your operating system's + file descriptors +* :class:`~pyarrow.MemoryMappedFile`, for reading (zero-copy) and writing with + memory maps +* :class:`~pyarrow.BufferReader`, for reading :class:`~pyarrow.Buffer` objects + as a file +* :class:`~pyarrow.InMemoryOutputStream`, for writing data in-memory, producing + a Buffer at the end +* :class:`~pyarrow.HdfsFile`, for reading and writing data to the Hadoop Filesystem +* :class:`~pyarrow.PythonFile`, for interfacing with Python file objects in C++ + +We will discuss these in the following sections after explaining memory pools. + +Memory Pools +------------ + +All memory allocations and deallocations (like ``malloc`` and ``free`` in C) +are tracked in an instance of ``arrow::MemoryPool``. This means that we can +then precisely track amount of memory that has been allocated: + +.. ipython:: python + + pa.total_allocated_bytes() + +PyArrow uses a default built-in memory pool, but in the future there may be +additional memory pools (and subpools) to choose from. Let's consider an +``InMemoryOutputStream``, which is like a ``BytesIO``: + +.. ipython:: python + + stream = pa.InMemoryOutputStream() + stream.write(b'foo') + pa.total_allocated_bytes() + for i in range(1024): stream.write(b'foo') + pa.total_allocated_bytes() + +The default allocator requests memory in a minimum increment of 64 bytes. If +the stream is garbaged-collected, all of the memory is freed: + +.. ipython:: python + + stream = None + pa.total_allocated_bytes() + +Classes and functions that may allocate memory will often have an option to +pass in a custom memory pool: + +.. ipython:: python + + my_pool = pa.jemalloc_memory_pool() + my_pool + my_pool.bytes_allocated() + stream = pa.InMemoryOutputStream(my_pool) + stream.write(b'foo') + my_pool.bytes_allocated() + +On-Disk and Memory Mapped Files +------------------------------- + +PyArrow includes two ways to interact with data on disk: standard operating +system-level file APIs, and memory-mapped files. In regular Python we can +write: + +.. ipython:: python + + with open('example.dat', 'wb') as f: + f.write(b'some example data') + +Using pyarrow's :class:`~pyarrow.OSFile` class, you can write: + +.. ipython:: python + + with pa.OSFile('example2.dat', 'wb') as f: + f.write(b'some example data') + +For reading files, you can use ``OSFile`` or +:class:`~pyarrow.MemoryMappedFile`. The difference between these is that +:class:`~pyarrow.OSFile` allocates new memory on each read, like Python file +objects. In reads from memory maps, the library constructs a buffer referencing +the mapped memory without any memory allocation or copying: + +.. ipython:: python + + file_obj = pa.OSFile('example.dat') + mmap = pa.memory_map('example.dat') + file_obj.read(4) + mmap.read(4) + +The ``read`` method implements the standard Python file ``read`` API. To read +into Arrow Buffer objects, use ``read_buffer``: + +.. ipython:: python + + mmap.seek(0) + buf = mmap.read_buffer(4) + print(buf) + buf.to_pybytes() + +Many tools in PyArrow, particular the Apache Parquet interface and the file and +stream messaging tools, are more efficient when used with these ``NativeFile`` +types than with normal Python file objects. + +.. ipython:: python + :suppress: + + buf = mmap = file_obj = None + !rm example.dat + !rm example2.dat + +In-Memory Reading and Writing +----------------------------- + +To assist with serialization and deserialization of in-memory data, we have +file interfaces that can read and write to Arrow Buffers. + +.. ipython:: python + + writer = pa.InMemoryOutputStream() + writer.write(b'hello, friends') + + buf = writer.get_result() + buf + buf.size + reader = pa.BufferReader(buf) + reader.seek(7) + reader.read(7) + +These have similar semantics to Python's built-in ``io.BytesIO``. + +Hadoop Filesystem +----------------- + +:class:`~pyarrow.HdfsFile` is an implementation of :class:`~pyarrow.NativeFile` +that can read and write to the Hadoop filesytem. Read more in the +:ref:`Filesystems Section `. diff --git a/python/doc/source/pandas.rst b/python/doc/source/pandas.rst index 34445aed517d3..cb7a56d020d19 100644 --- a/python/doc/source/pandas.rst +++ b/python/doc/source/pandas.rst @@ -15,17 +15,17 @@ .. specific language governing permissions and limitations .. under the License. -Pandas Interface -================ +Using PyArrow with pandas +========================= -To interface with Pandas, PyArrow provides various conversion routines to -consume Pandas structures and convert back to them. +To interface with pandas, PyArrow provides various conversion routines to +consume pandas structures and convert back to them. DataFrames ---------- -The equivalent to a Pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. -Both consist of a set of named columns of equal length. While Pandas only +The equivalent to a pandas DataFrame in Arrow is a :class:`pyarrow.table.Table`. +Both consist of a set of named columns of equal length. While pandas only supports flat columns, the Table also provides nested columns, thus it can represent more data than a DataFrame, so a full conversion is not always possible. @@ -33,9 +33,9 @@ Conversion from a Table to a DataFrame is done by calling :meth:`pyarrow.table.Table.to_pandas`. The inverse is then achieved by using :meth:`pyarrow.Table.from_pandas`. This conversion routine provides the convience parameter ``timestamps_to_ms``. Although Arrow supports timestamps of -different resolutions, Pandas only supports nanosecond timestamps and most +different resolutions, pandas only supports nanosecond timestamps and most other systems (e.g. Parquet) only work on millisecond timestamps. This parameter -can be used to already do the time conversion during the Pandas to Arrow +can be used to already do the time conversion during the pandas to Arrow conversion. .. code-block:: python @@ -44,35 +44,35 @@ conversion. import pandas as pd df = pd.DataFrame({"a": [1, 2, 3]}) - # Convert from Pandas to Arrow + # Convert from pandas to Arrow table = pa.Table.from_pandas(df) - # Convert back to Pandas + # Convert back to pandas df_new = table.to_pandas() Series ------ -In Arrow, the most similar structure to a Pandas Series is an Array. +In Arrow, the most similar structure to a pandas Series is an Array. It is a vector that contains data of the same type as linear memory. You can -convert a Pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. +convert a pandas Series to an Arrow Array using :meth:`pyarrow.array.from_pandas_series`. As Arrow Arrays are always nullable, you can supply an optional mask using the ``mask`` parameter to mark all null-entries. Type differences ---------------- -With the current design of Pandas and Arrow, it is not possible to convert all -column types unmodified. One of the main issues here is that Pandas has no +With the current design of pandas and Arrow, it is not possible to convert all +column types unmodified. One of the main issues here is that pandas has no support for nullable columns of arbitrary type. Also ``datetime64`` is currently fixed to nanosecond resolution. On the other side, Arrow might be still missing support for some types. -Pandas -> Arrow Conversion +pandas -> Arrow Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~ +------------------------+--------------------------+ -| Source Type (Pandas) | Destination Type (Arrow) | +| Source Type (pandas) | Destination Type (Arrow) | +========================+==========================+ | ``bool`` | ``BOOL`` | +------------------------+--------------------------+ @@ -91,11 +91,11 @@ Pandas -> Arrow Conversion | ``datetime.date`` | ``DATE`` | +------------------------+--------------------------+ -Arrow -> Pandas Conversion +Arrow -> pandas Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~ +-------------------------------------+--------------------------------------------------------+ -| Source Type (Arrow) | Destination Type (Pandas) | +| Source Type (Arrow) | Destination Type (pandas) | +=====================================+========================================================+ | ``BOOL`` | ``bool`` | +-------------------------------------+--------------------------------------------------------+ diff --git a/python/doc/source/parquet.rst b/python/doc/source/parquet.rst index 8e011e4f19857..3317b99c0b685 100644 --- a/python/doc/source/parquet.rst +++ b/python/doc/source/parquet.rst @@ -15,77 +15,218 @@ .. specific language governing permissions and limitations .. under the License. -Reading/Writing Parquet files -============================= +.. currentmodule:: pyarrow +.. _parquet: -If you have built ``pyarrow`` with Parquet support, i.e. ``parquet-cpp`` was -found during the build, you can read files in the Parquet format to/from Arrow -memory structures. The Parquet support code is located in the -:mod:`pyarrow.parquet` module and your package needs to be built with the -``--with-parquet`` flag for ``build_ext``. +Reading and Writing the Apache Parquet Format +============================================= -Reading Parquet ---------------- +The `Apache Parquet `_ project provides a +standardized open-source columnar storage format for use in data analysis +systems. It was created originally for use in `Apache Hadoop +`_ with systems like `Apache Drill +`_, `Apache Hive `_, `Apache +Impala (incubating) `_, and `Apache Spark +`_ adopting it as a shared standard for high +performance data IO. -To read a Parquet file into Arrow memory, you can use the following code -snippet. It will read the whole Parquet file into memory as an -:class:`~pyarrow.table.Table`. +Apache Arrow is an ideal in-memory transport layer for data that is being read +or written with Parquet files. We have been concurrently developing the `C++ +implementation of Apache Parquet `_, +which includes a native, multithreaded C++ adapter to and from in-memory Arrow +data. PyArrow includes Python bindings to this code, which thus enables reading +and writing Parquet files with pandas as well. -.. code-block:: python +Obtaining PyArrow with Parquet Support +-------------------------------------- + +If you installed ``pyarrow`` with pip or conda, it should be built with Parquet +support bundled: + +.. ipython:: python + + import pyarrow.parquet as pq + +If you are building ``pyarrow`` from source, you must also build `parquet-cpp +`_ and enable the Parquet extensions when +building ``pyarrow``. See the :ref:`Development ` page for more +details. + +Reading and Writing Single Files +-------------------------------- + +The functions :func:`~.parquet.read_table` and :func:`~.parquet.write_table` +read and write the :ref:`pyarrow.Table ` objects, respectively. + +Let's look at a simple table: + +.. ipython:: python + + import numpy as np + import pandas as pd + import pyarrow as pa + + df = pd.DataFrame({'one': [-1, np.nan, 2.5], + 'two': ['foo', 'bar', 'baz'], + 'three': [True, False, True]}) + table = pa.Table.from_pandas(df) + +We write this to Parquet format with ``write_table``: + +.. ipython:: python + + import pyarrow.parquet as pq + pq.write_table(table, 'example.parquet') + +This creates a single Parquet file. In practice, a Parquet dataset may consist +of many files in many directories. We can read a single file back with +``read_table``: + +.. ipython:: python + + table2 = pq.read_table('example.parquet') + table2.to_pandas() + +You can pass a subset of columns to read, which can be much faster than reading +the whole file (due to the columnar layout): + +.. ipython:: python + + pq.read_table('example.parquet', columns=['one', 'three']) + +We need not use a string to specify the origin of the file. It can be any of: + +* A file path as a string +* A :ref:`NativeFile ` from PyArrow +* A Python file object + +In general, a Python file object will have the worst read performance, while a +string file path or an instance of :class:`~.NativeFIle` (especially memory +maps) will perform the best. - import pyarrow.parquet as pq +Finer-grained Reading and Writing +--------------------------------- - table = pq.read_table('') +``read_table`` uses the :class:`~.ParquetFile` class, which has other features: -As DataFrames stored as Parquet are often stored in multiple files, a -convenience method :meth:`~pyarrow.parquet.read_multiple_files` is provided. +.. ipython:: python -If you already have the Parquet available in memory or get it via non-file -source, you can utilize :class:`pyarrow.io.BufferReader` to read it from -memory. As input to the :class:`~pyarrow.io.BufferReader` you can either supply -a Python ``bytes`` object or a :class:`pyarrow.io.Buffer`. + parquet_file = pq.ParquetFile('example.parquet') + parquet_file.metadata + parquet_file.schema -.. code:: python +As you can learn more in the `Apache Parquet format +`_, a Parquet file consists of +multiple row groups. ``read_table`` will read all of the row groups and +concatenate them into a single table. You can read individual row groups with +``read_row_group``: - import pyarrow.io as paio - import pyarrow.parquet as pq +.. ipython:: python - buf = ... # either bytes or paio.Buffer - reader = paio.BufferReader(buf) - table = pq.read_table(reader) + parquet_file.num_row_groups + parquet_file.read_row_group(0) -Writing Parquet ---------------- +We can similarly write a Parquet file with multiple row groups by using +``ParquetWriter``: -Given an instance of :class:`pyarrow.table.Table`, the most simple way to -persist it to Parquet is by using the :meth:`pyarrow.parquet.write_table` -method. +.. ipython:: python + + writer = pq.ParquetWriter('example2.parquet', table.schema) + for i in range(3): + writer.write_table(table) + writer.close() + + pf2 = pq.ParquetFile('example2.parquet') + pf2.num_row_groups + +.. ipython:: python + :suppress: + + !rm example.parquet + !rm example2.parquet + +Compression, Encoding, and File Compatibility +--------------------------------------------- + +The most commonly used Parquet implementations use dictionary encoding when +writing files; if the dictionaries grow too large, then they "fall back" to +plain encoding. Whether dictionary encoding is used can be toggled using the +``use_dictionary`` option: .. code-block:: python - import pyarrow as pa - import pyarrow.parquet as pq + pq.write_table(table, where, use_dictionary=False) - table = pa.Table(..) - pq.write_table(table, '') +The data pages within a column in a row group can be compressed after the +encoding passes (dictionary, RLE encoding). In PyArrow we use Snappy +compression by default, but Brotli, Gzip, and uncompressed are also supported: -By default this will write the Table as a single RowGroup using ``DICTIONARY`` -encoding. To increase the potential of parallelism a query engine can process -a Parquet file, set the ``chunk_size`` to a fraction of the total number of rows. +.. code-block:: python + + pq.write_table(table, where, compression='snappy') + pq.write_table(table, where, compression='gzip') + pq.write_table(table, where, compression='brotli') + pq.write_table(table, where, compression='none') + +Snappy generally results in better performance, while Gzip may yield smaller +files. + +These settings can also be set on a per-column basis: + +.. code-block:: python -If you also want to compress the columns, you can select a compression -method using the ``compression`` argument. Typically, ``GZIP`` is the choice if -you want to minimize size and ``SNAPPY`` for performance. + pa.write_table(table, where, compression={'foo': 'snappy', 'bar': 'gzip'}, + use_dictionary=['foo', 'bar']) -Instead of writing to a file, you can also write to Python ``bytes`` by -utilizing an :class:`pyarrow.io.InMemoryOutputStream()`: +Reading Multiples Files and Partitioned Datasets +------------------------------------------------ -.. code:: python +Multiple Parquet files constitute a Parquet *dataset*. These may present in a +number of ways: - import pyarrow.io as paio - import pyarrow.parquet as pq +* A list of Parquet absolute file paths +* A directory name containing nested directories defining a partitioned dataset + +A dataset partitioned by year and month may look like on disk: + +.. code-block:: text + + dataset_name/ + year=2007/ + month=01/ + 0.parq + 1.parq + ... + month=02/ + 0.parq + 1.parq + ... + month=03/ + ... + year=2008/ + month=01/ + ... + ... + +The :class:`~.ParquetDataset` class accepts either a directory name or a list +or file paths, and can discover and infer some common partition structures, +such as those produced by Hive: + +.. code-block:: python + + dataset = pq.ParquetDataset('dataset_name/') + table = dataset.read() + +Multithreaded Reads +------------------- + +Each of the reading functions have an ``nthreads`` argument which will read +columns with the indicated level of parallelism. Depending on the speed of IO +and how expensive it is to decode the columns in a particular file +(particularly with GZIP compression), this can yield significantly higher data +throughput: + +.. code-block:: python - table = ... - output = paio.InMemoryOutputStream() - pq.write_table(table, output) - pybytes = output.get_result().to_pybytes() + pq.read_table(where, nthreads=4) + pq.ParquetDataset(where).read(nthreads=4) diff --git a/python/pyarrow/_io.pyx b/python/pyarrow/_io.pyx index 40c76f8363cd2..e9e2ba01c0678 100644 --- a/python/pyarrow/_io.pyx +++ b/python/pyarrow/_io.pyx @@ -522,6 +522,7 @@ cdef class Buffer: buffer.strides = self.strides buffer.suboffsets = NULL + cdef shared_ptr[PoolBuffer] allocate_buffer(CMemoryPool* pool): cdef shared_ptr[PoolBuffer] result result.reset(new PoolBuffer(pool)) From 3d19831717297e91a74e008d44c71695088b39fd Mon Sep 17 00:00:00 2001 From: Kouhei Sutou Date: Mon, 8 May 2017 20:10:12 +0200 Subject: [PATCH 0620/1644] ARROW-967: [GLib] Support initializing array with buffer It's for zero-copy data conversion. Author: Kouhei Sutou Closes #657 from kou/glib-array-new-with-buffer and squashes the following commits: 57f4266 [Kouhei Sutou] [GLib] Support initializing array with buffer --- c_glib/arrow-glib/array.cpp | 600 ++++++++++++++++++++++++++++-- c_glib/arrow-glib/array.h | 80 ++++ c_glib/arrow-glib/buffer.cpp | 3 + c_glib/test/helper/buildable.rb | 55 +++ c_glib/test/test-binary-array.rb | 13 + c_glib/test/test-boolean-array.rb | 10 + c_glib/test/test-double-array.rb | 10 + c_glib/test/test-float-array.rb | 10 + c_glib/test/test-int16-array.rb | 10 + c_glib/test/test-int32-array.rb | 10 + c_glib/test/test-int64-array.rb | 10 + c_glib/test/test-int8-array.rb | 10 + c_glib/test/test-list-array.rb | 15 + c_glib/test/test-string-array.rb | 13 + c_glib/test/test-struct-array.rb | 33 ++ c_glib/test/test-uint16-array.rb | 10 + c_glib/test/test-uint32-array.rb | 10 + c_glib/test/test-uint64-array.rb | 10 + c_glib/test/test-uint8-array.rb | 10 + 19 files changed, 889 insertions(+), 33 deletions(-) diff --git a/c_glib/arrow-glib/array.cpp b/c_glib/arrow-glib/array.cpp index 8a78984349c62..8bc6ea95d6a9d 100644 --- a/c_glib/arrow-glib/array.cpp +++ b/c_glib/arrow-glib/array.cpp @@ -39,73 +39,89 @@ G_BEGIN_DECLS * #GArrowArray is a base class for all array classes such as * #GArrowBooleanArray. * - * All array classes are immutable. You need to use array builder - * class such as #GArrowBooleanArrayBuilder to create a new array - * except #GArrowNullArray. + * All array classes are immutable. You need to use binary data or + * array builder to create a new array except #GArrowNullArray. If you + * have binary data that uses Arrow format data, you can create a new + * array with the binary data as #GArrowBuffer object. If you don't + * have binary data, you can use array builder class such as + * #GArrowBooleanArrayBuilder that creates Arrow format data + * internally and a new array from the data. * * #GArrowNullArray is a class for null array. It can store zero or * more null values. You need to specify an array length to create a * new array. * * #GArrowBooleanArray is a class for binary array. It can store zero - * or more boolean data. You need to use #GArrowBooleanArrayBuilder to - * create a new array. + * or more boolean data. If you don't have Arrow format data, you need + * to use #GArrowBooleanArrayBuilder to create a new array. * * #GArrowInt8Array is a class for 8-bit integer array. It can store - * zero or more 8-bit integer data. You need to use - * #GArrowInt8ArrayBuilder to create a new array. + * zero or more 8-bit integer data. If you don't have Arrow format + * data, you need to use #GArrowInt8ArrayBuilder to create a new + * array. * * #GArrowUInt8Array is a class for 8-bit unsigned integer array. It - * can store zero or more 8-bit unsigned integer data. You need to use - * #GArrowUInt8ArrayBuilder to create a new array. + * can store zero or more 8-bit unsigned integer data. If you don't + * have Arrow format data, you need to use #GArrowUInt8ArrayBuilder to + * create a new array. * * #GArrowInt16Array is a class for 16-bit integer array. It can store - * zero or more 16-bit integer data. You need to use - * #GArrowInt16ArrayBuilder to create a new array. + * zero or more 16-bit integer data. If you don't have Arrow format + * data, you need to use #GArrowInt16ArrayBuilder to create a new + * array. * * #GArrowUInt16Array is a class for 16-bit unsigned integer array. It - * can store zero or more 16-bit unsigned integer data. You need to use - * #GArrowUInt16ArrayBuilder to create a new array. + * can store zero or more 16-bit unsigned integer data. If you don't + * have Arrow format data, you need to use #GArrowUInt16ArrayBuilder + * to create a new array. * * #GArrowInt32Array is a class for 32-bit integer array. It can store - * zero or more 32-bit integer data. You need to use - * #GArrowInt32ArrayBuilder to create a new array. + * zero or more 32-bit integer data. If you don't have Arrow format + * data, you need to use #GArrowInt32ArrayBuilder to create a new + * array. * * #GArrowUInt32Array is a class for 32-bit unsigned integer array. It - * can store zero or more 32-bit unsigned integer data. You need to use - * #GArrowUInt32ArrayBuilder to create a new array. + * can store zero or more 32-bit unsigned integer data. If you don't + * have Arrow format data, you need to use #GArrowUInt32ArrayBuilder + * to create a new array. * * #GArrowInt64Array is a class for 64-bit integer array. It can store - * zero or more 64-bit integer data. You need to use - * #GArrowInt64ArrayBuilder to create a new array. + * zero or more 64-bit integer data. If you don't have Arrow format + * data, you need to use #GArrowInt64ArrayBuilder to create a new + * array. * * #GArrowUInt64Array is a class for 64-bit unsigned integer array. It - * can store zero or more 64-bit unsigned integer data. You need to - * use #GArrowUInt64ArrayBuilder to create a new array. + * can store zero or more 64-bit unsigned integer data. If you don't + * have Arrow format data, you need to use #GArrowUInt64ArrayBuilder + * to create a new array. * * #GArrowFloatArray is a class for 32-bit floating point array. It - * can store zero or more 32-bit floating data. You need to use - * #GArrowFloatArrayBuilder to create a new array. + * can store zero or more 32-bit floating data. If you don't have + * Arrow format data, you need to use #GArrowFloatArrayBuilder to + * create a new array. * * #GArrowDoubleArray is a class for 64-bit floating point array. It - * can store zero or more 64-bit floating data. You need to use - * #GArrowDoubleArrayBuilder to create a new array. + * can store zero or more 64-bit floating data. If you don't have + * Arrow format data, you need to use #GArrowDoubleArrayBuilder to + * create a new array. * * #GArrowBinaryArray is a class for binary array. It can store zero - * or more binary data. You need to use #GArrowBinaryArrayBuilder to - * create a new array. + * or more binary data. If you don't have Arrow format data, you need + * to use #GArrowBinaryArrayBuilder to create a new array. * * #GArrowStringArray is a class for UTF-8 encoded string array. It - * can store zero or more UTF-8 encoded string data. You need to use - * #GArrowStringArrayBuilder to create a new array. + * can store zero or more UTF-8 encoded string data. If you don't have + * Arrow format data, you need to use #GArrowStringArrayBuilder to + * create a new array. * * #GArrowListArray is a class for list array. It can store zero or - * more list data. You need to use #GArrowListArrayBuilder to create a - * new array. + * more list data. If you don't have Arrow format data, you need to + * use #GArrowListArrayBuilder to create a new array. * * #GArrowStructArray is a class for struct array. It can store zero - * or more structs. One struct has zero or more fields. You need to - * use #GArrowStructArrayBuilder to create a new array. + * or more structs. One struct has zero or more fields. If you don't + * have Arrow format data, you need to use #GArrowStructArrayBuilder + * to create a new array. */ typedef struct GArrowArrayPrivate_ { @@ -455,6 +471,39 @@ garrow_boolean_array_class_init(GArrowBooleanArrayClass *klass) { } +/** + * garrow_boolean_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowBooleanArray. + * + * Since: 0.4.0 + */ +GArrowBooleanArray * +garrow_boolean_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_boolean_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_boolean_array); + return GARROW_BOOLEAN_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_boolean_array_get_value: * @array: A #GArrowBooleanArray. @@ -485,6 +534,39 @@ garrow_int8_array_class_init(GArrowInt8ArrayClass *klass) { } +/** + * garrow_int8_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowInt8Array. + * + * Since: 0.4.0 + */ +GArrowInt8Array * +garrow_int8_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_int8_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_int8_array); + return GARROW_INT8_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_int8_array_get_value: * @array: A #GArrowInt8Array. @@ -515,6 +597,39 @@ garrow_uint8_array_class_init(GArrowUInt8ArrayClass *klass) { } +/** + * garrow_uint8_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowUInt8Array. + * + * Since: 0.4.0 + */ +GArrowUInt8Array * +garrow_uint8_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_uint8_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_uint8_array); + return GARROW_UINT8_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_uint8_array_get_value: * @array: A #GArrowUInt8Array. @@ -545,6 +660,39 @@ garrow_int16_array_class_init(GArrowInt16ArrayClass *klass) { } +/** + * garrow_int16_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowInt16Array. + * + * Since: 0.4.0 + */ +GArrowInt16Array * +garrow_int16_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_int16_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_int16_array); + return GARROW_INT16_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_int16_array_get_value: * @array: A #GArrowInt16Array. @@ -575,6 +723,39 @@ garrow_uint16_array_class_init(GArrowUInt16ArrayClass *klass) { } +/** + * garrow_uint16_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowUInt16Array. + * + * Since: 0.4.0 + */ +GArrowUInt16Array * +garrow_uint16_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_uint16_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_uint16_array); + return GARROW_UINT16_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_uint16_array_get_value: * @array: A #GArrowUInt16Array. @@ -605,6 +786,39 @@ garrow_int32_array_class_init(GArrowInt32ArrayClass *klass) { } +/** + * garrow_int32_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowInt32Array. + * + * Since: 0.4.0 + */ +GArrowInt32Array * +garrow_int32_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_int32_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_int32_array); + return GARROW_INT32_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_int32_array_get_value: * @array: A #GArrowInt32Array. @@ -635,6 +849,39 @@ garrow_uint32_array_class_init(GArrowUInt32ArrayClass *klass) { } +/** + * garrow_uint32_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowUInt32Array. + * + * Since: 0.4.0 + */ +GArrowUInt32Array * +garrow_uint32_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_uint32_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_uint32_array); + return GARROW_UINT32_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_uint32_array_get_value: * @array: A #GArrowUInt32Array. @@ -665,6 +912,39 @@ garrow_int64_array_class_init(GArrowInt64ArrayClass *klass) { } +/** + * garrow_int64_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowInt64Array. + * + * Since: 0.4.0 + */ +GArrowInt64Array * +garrow_int64_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_int64_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_int64_array); + return GARROW_INT64_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_int64_array_get_value: * @array: A #GArrowInt64Array. @@ -695,6 +975,39 @@ garrow_uint64_array_class_init(GArrowUInt64ArrayClass *klass) { } +/** + * garrow_uint64_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowUInt64Array. + * + * Since: 0.4.0 + */ +GArrowUInt64Array * +garrow_uint64_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_uint64_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_uint64_array); + return GARROW_UINT64_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_uint64_array_get_value: * @array: A #GArrowUInt64Array. @@ -724,6 +1037,39 @@ garrow_float_array_class_init(GArrowFloatArrayClass *klass) { } +/** + * garrow_float_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowFloatArray. + * + * Since: 0.4.0 + */ +GArrowFloatArray * +garrow_float_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_float_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_float_array); + return GARROW_FLOAT_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_float_array_get_value: * @array: A #GArrowFloatArray. @@ -754,6 +1100,39 @@ garrow_double_array_class_init(GArrowDoubleArrayClass *klass) { } +/** + * garrow_double_array_new: + * @length: The number of elements. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowDoubleArray. + * + * Since: 0.4.0 + */ +GArrowDoubleArray * +garrow_double_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_double_array = + std::make_shared(length, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_double_array); + return GARROW_DOUBLE_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_double_array_get_value: * @array: A #GArrowDoubleArray. @@ -784,6 +1163,43 @@ garrow_binary_array_class_init(GArrowBinaryArrayClass *klass) { } +/** + * garrow_binary_array_new: + * @length: The number of elements. + * @value_offsets: The value offsets of @data in Arrow format. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowBinaryArray. + * + * Since: 0.4.0 + */ +GArrowBinaryArray * +garrow_binary_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_value_offsets = garrow_buffer_get_raw(value_offsets); + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_binary_array = + std::make_shared(length, + arrow_value_offsets, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_binary_array); + return GARROW_BINARY_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_binary_array_get_value: * @array: A #GArrowBinaryArray. @@ -835,6 +1251,43 @@ garrow_string_array_class_init(GArrowStringArrayClass *klass) { } +/** + * garrow_string_array_new: + * @length: The number of elements. + * @value_offsets: The value offsets of @data in Arrow format. + * @data: The binary data in Arrow format of the array. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowStringArray. + * + * Since: 0.4.0 + */ +GArrowStringArray * +garrow_string_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_value_offsets = garrow_buffer_get_raw(value_offsets); + const auto arrow_data = garrow_buffer_get_raw(data); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_string_array = + std::make_shared(length, + arrow_value_offsets, + arrow_data, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_string_array); + return GARROW_STRING_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_string_array_get_string: * @array: A #GArrowStringArray. @@ -870,6 +1323,45 @@ garrow_list_array_class_init(GArrowListArrayClass *klass) { } +/** + * garrow_list_array_new: + * @length: The number of elements. + * @value_offsets: The offsets of @values in Arrow format. + * @values: The values as #GArrowArray. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowListArray. + * + * Since: 0.4.0 + */ +GArrowListArray * +garrow_list_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowArray *values, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_value_offsets = garrow_buffer_get_raw(value_offsets); + const auto arrow_values = garrow_array_get_raw(values); + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_data_type = arrow::list(arrow_values->type()); + auto arrow_list_array = + std::make_shared(arrow_data_type, + length, + arrow_value_offsets, + arrow_values, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_list_array); + return GARROW_LIST_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_list_array_get_value_type: * @array: A #GArrowListArray. @@ -921,6 +1413,48 @@ garrow_struct_array_class_init(GArrowStructArrayClass *klass) { } +/** + * garrow_struct_array_new: + * @data_type: The data type of the struct. + * @length: The number of elements. + * @children: (element-type GArrowArray): The arrays for each field + * as #GList of #GArrowArray. + * @null_bitmap: (nullable): The bitmap that shows null elements. The + * N-th element is null when the N-th bit is 0, not null otherwise. + * If the array has no null elements, the bitmap must be %NULL and + * @n_nulls is 0. + * @n_nulls: The number of null elements. If -1 is specified, the + * number of nulls are computed from @null_bitmap. + * + * Returns: A newly created #GArrowStructArray. + * + * Since: 0.4.0 + */ +GArrowStructArray * +garrow_struct_array_new(GArrowDataType *data_type, + gint64 length, + GList *children, + GArrowBuffer *null_bitmap, + gint64 n_nulls) +{ + const auto arrow_data_type = garrow_data_type_get_raw(data_type); + std::vector> arrow_children; + for (GList *node = children; node; node = node->next) { + GArrowArray *child = GARROW_ARRAY(node->data); + arrow_children.push_back(garrow_array_get_raw(child)); + } + const auto arrow_bitmap = garrow_buffer_get_raw(null_bitmap); + auto arrow_struct_array = + std::make_shared(arrow_data_type, + length, + arrow_children, + arrow_bitmap, + n_nulls); + auto arrow_array = + std::static_pointer_cast(arrow_struct_array); + return GARROW_STRUCT_ARRAY(garrow_array_new_raw(&arrow_array)); +} + /** * garrow_struct_array_get_field * @array: A #GArrowStructArray. diff --git a/c_glib/arrow-glib/array.h b/c_glib/arrow-glib/array.h index f750ee10f8cbe..c4efeafd6404a 100644 --- a/c_glib/arrow-glib/array.h +++ b/c_glib/arrow-glib/array.h @@ -211,6 +211,12 @@ struct _GArrowBooleanArrayClass }; GType garrow_boolean_array_get_type (void) G_GNUC_CONST; + +GArrowBooleanArray *garrow_boolean_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gboolean garrow_boolean_array_get_value (GArrowBooleanArray *array, gint64 i); @@ -257,6 +263,11 @@ struct _GArrowInt8ArrayClass GType garrow_int8_array_get_type(void) G_GNUC_CONST; +GArrowInt8Array *garrow_int8_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gint8 garrow_int8_array_get_value(GArrowInt8Array *array, gint64 i); @@ -303,6 +314,11 @@ struct _GArrowUInt8ArrayClass GType garrow_uint8_array_get_type(void) G_GNUC_CONST; +GArrowUInt8Array *garrow_uint8_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + guint8 garrow_uint8_array_get_value(GArrowUInt8Array *array, gint64 i); @@ -349,6 +365,11 @@ struct _GArrowInt16ArrayClass GType garrow_int16_array_get_type(void) G_GNUC_CONST; +GArrowInt16Array *garrow_int16_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gint16 garrow_int16_array_get_value(GArrowInt16Array *array, gint64 i); @@ -395,6 +416,11 @@ struct _GArrowUInt16ArrayClass GType garrow_uint16_array_get_type(void) G_GNUC_CONST; +GArrowUInt16Array *garrow_uint16_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + guint16 garrow_uint16_array_get_value(GArrowUInt16Array *array, gint64 i); @@ -441,6 +467,11 @@ struct _GArrowInt32ArrayClass GType garrow_int32_array_get_type(void) G_GNUC_CONST; +GArrowInt32Array *garrow_int32_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gint32 garrow_int32_array_get_value(GArrowInt32Array *array, gint64 i); @@ -487,6 +518,11 @@ struct _GArrowUInt32ArrayClass GType garrow_uint32_array_get_type(void) G_GNUC_CONST; +GArrowUInt32Array *garrow_uint32_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + guint32 garrow_uint32_array_get_value(GArrowUInt32Array *array, gint64 i); @@ -533,6 +569,11 @@ struct _GArrowInt64ArrayClass GType garrow_int64_array_get_type(void) G_GNUC_CONST; +GArrowInt64Array *garrow_int64_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gint64 garrow_int64_array_get_value(GArrowInt64Array *array, gint64 i); @@ -579,6 +620,11 @@ struct _GArrowUInt64ArrayClass GType garrow_uint64_array_get_type(void) G_GNUC_CONST; +GArrowUInt64Array *garrow_uint64_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + guint64 garrow_uint64_array_get_value(GArrowUInt64Array *array, gint64 i); @@ -625,6 +671,11 @@ struct _GArrowFloatArrayClass GType garrow_float_array_get_type(void) G_GNUC_CONST; +GArrowFloatArray *garrow_float_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gfloat garrow_float_array_get_value(GArrowFloatArray *array, gint64 i); @@ -671,6 +722,11 @@ struct _GArrowDoubleArrayClass GType garrow_double_array_get_type(void) G_GNUC_CONST; +GArrowDoubleArray *garrow_double_array_new(gint64 length, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gdouble garrow_double_array_get_value(GArrowDoubleArray *array, gint64 i); @@ -717,6 +773,12 @@ struct _GArrowBinaryArrayClass GType garrow_binary_array_get_type(void) G_GNUC_CONST; +GArrowBinaryArray *garrow_binary_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + GBytes *garrow_binary_array_get_value(GArrowBinaryArray *array, gint64 i); GArrowBuffer *garrow_binary_array_get_buffer(GArrowBinaryArray *array); @@ -763,6 +825,12 @@ struct _GArrowStringArrayClass GType garrow_string_array_get_type(void) G_GNUC_CONST; +GArrowStringArray *garrow_string_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowBuffer *data, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + gchar *garrow_string_array_get_string(GArrowStringArray *array, gint64 i); @@ -809,6 +877,12 @@ struct _GArrowListArrayClass GType garrow_list_array_get_type(void) G_GNUC_CONST; +GArrowListArray *garrow_list_array_new(gint64 length, + GArrowBuffer *value_offsets, + GArrowArray *values, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + GArrowDataType *garrow_list_array_get_value_type(GArrowListArray *array); GArrowArray *garrow_list_array_get_value(GArrowListArray *array, gint64 i); @@ -856,6 +930,12 @@ struct _GArrowStructArrayClass GType garrow_struct_array_get_type(void) G_GNUC_CONST; +GArrowStructArray *garrow_struct_array_new(GArrowDataType *data_type, + gint64 length, + GList *children, + GArrowBuffer *null_bitmap, + gint64 n_nulls); + GArrowArray *garrow_struct_array_get_field(GArrowStructArray *array, gint i); GList *garrow_struct_array_get_fields(GArrowStructArray *array); diff --git a/c_glib/arrow-glib/buffer.cpp b/c_glib/arrow-glib/buffer.cpp index 0970128ae3862..4be8fed18ea01 100644 --- a/c_glib/arrow-glib/buffer.cpp +++ b/c_glib/arrow-glib/buffer.cpp @@ -504,6 +504,9 @@ garrow_buffer_new_raw(std::shared_ptr *arrow_buffer) std::shared_ptr garrow_buffer_get_raw(GArrowBuffer *buffer) { + if (!buffer) + return nullptr; + auto priv = GARROW_BUFFER_GET_PRIVATE(buffer); return priv->buffer; } diff --git a/c_glib/test/helper/buildable.rb b/c_glib/test/helper/buildable.rb index 900e180675b45..4120eed6fe2d0 100644 --- a/c_glib/test/helper/buildable.rb +++ b/c_glib/test/helper/buildable.rb @@ -61,6 +61,61 @@ def build_double_array(values) build_array(Arrow::DoubleArrayBuilder, values) end + def build_binary_array(values) + build_array(Arrow::BinaryArrayBuilder, values) + end + + def build_string_array(values) + build_array(Arrow::StringArrayBuilder, values) + end + + def build_list_array(value_builder_class, values_list) + value_builder = value_builder_class.new + builder = Arrow::ListArrayBuilder.new(value_builder) + values_list.each do |values| + if values.nil? + builder.append_null + else + builder.append + values.each do |value| + if value.nil? + value_builder.append_null + else + value_builder.append(value) + end + end + end + end + builder.finish + end + + def build_struct_array(fields, structs) + field_builders = fields.collect do |field| + data_type_name = field.data_type.class.name + builder_name = data_type_name.gsub(/DataType/, "ArrayBuilder") + Arrow.const_get(builder_name).new + end + data_type = Arrow::StructDataType.new(fields) + builder = Arrow::StructArrayBuilder.new(data_type, field_builders) + structs.each do |struct| + if struct.nil? + builder.append_null + else + builder.append + struct.each do |name, value| + field_builder_index = fields.index {|field| field.name == name} + field_builder = builder.get_field_builder(field_builder_index) + if value.nil? + field_builder.append_null + else + field_builder.append(value) + end + end + end + end + builder.finish + end + private def build_array(builder_class, values) builder = builder_class.new diff --git a/c_glib/test/test-binary-array.rb b/c_glib/test/test-binary-array.rb index ccdf378ad41b9..9ae122a9a742b 100644 --- a/c_glib/test/test-binary-array.rb +++ b/c_glib/test/test-binary-array.rb @@ -16,6 +16,19 @@ # under the License. class TestBinaryArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + value_offsets = Arrow::Buffer.new([0, 2, 5, 5].pack("l*")) + data = Arrow::Buffer.new("\x00\x01\x02\x03\x04") + assert_equal(build_binary_array(["\x00\x01", "\x02\x03\x04", nil]), + Arrow::BinaryArray.new(3, + value_offsets, + data, + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_value data = "\x00\x01\x02" builder = Arrow::BinaryArrayBuilder.new diff --git a/c_glib/test/test-boolean-array.rb b/c_glib/test/test-boolean-array.rb index 15df1ed95b274..43b83655638e3 100644 --- a/c_glib/test/test-boolean-array.rb +++ b/c_glib/test/test-boolean-array.rb @@ -16,6 +16,16 @@ # under the License. class TestBooleanArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_boolean_array([true, false, nil]), + Arrow::BooleanArray.new(3, + Arrow::Buffer.new([0b001].pack("C*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::BooleanArrayBuilder.new builder.append(true) diff --git a/c_glib/test/test-double-array.rb b/c_glib/test/test-double-array.rb index c644ac6cc0c07..935fbe5b93dd9 100644 --- a/c_glib/test/test-double-array.rb +++ b/c_glib/test/test-double-array.rb @@ -16,6 +16,16 @@ # under the License. class TestDoubleArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_double_array([-1.1, 2.2, nil]), + Arrow::DoubleArray.new(3, + Arrow::Buffer.new([-1.1, 2.2].pack("d*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::DoubleArrayBuilder.new builder.append(-1.1) diff --git a/c_glib/test/test-float-array.rb b/c_glib/test/test-float-array.rb index 84876f9754da7..fcac9021e56d2 100644 --- a/c_glib/test/test-float-array.rb +++ b/c_glib/test/test-float-array.rb @@ -16,6 +16,16 @@ # under the License. class TestFloatArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_float_array([-1.1, 2.2, nil]), + Arrow::FloatArray.new(3, + Arrow::Buffer.new([-1.1, 2.2].pack("f*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::FloatArrayBuilder.new builder.append(-1.1) diff --git a/c_glib/test/test-int16-array.rb b/c_glib/test/test-int16-array.rb index 4b30ddd99ff9b..6bc7f8815c26e 100644 --- a/c_glib/test/test-int16-array.rb +++ b/c_glib/test/test-int16-array.rb @@ -16,6 +16,16 @@ # under the License. class TestInt16Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_int16_array([-1, 2, nil]), + Arrow::Int16Array.new(3, + Arrow::Buffer.new([-1, 2].pack("s*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::Int16ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int32-array.rb b/c_glib/test/test-int32-array.rb index 90cf0224c1c30..0b68273aca7dd 100644 --- a/c_glib/test/test-int32-array.rb +++ b/c_glib/test/test-int32-array.rb @@ -16,6 +16,16 @@ # under the License. class TestInt32Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_int32_array([-1, 2, nil]), + Arrow::Int32Array.new(3, + Arrow::Buffer.new([-1, 2].pack("l*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::Int32ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int64-array.rb b/c_glib/test/test-int64-array.rb index d3022017bb0ee..c2174345746c2 100644 --- a/c_glib/test/test-int64-array.rb +++ b/c_glib/test/test-int64-array.rb @@ -16,6 +16,16 @@ # under the License. class TestInt64Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_int64_array([-1, 2, nil]), + Arrow::Int64Array.new(3, + Arrow::Buffer.new([-1, 2].pack("q*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::Int64ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-int8-array.rb b/c_glib/test/test-int8-array.rb index 9f28fa7fcd3a3..c7ff2165056cb 100644 --- a/c_glib/test/test-int8-array.rb +++ b/c_glib/test/test-int8-array.rb @@ -16,6 +16,16 @@ # under the License. class TestInt8Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_int8_array([-1, 2, nil]), + Arrow::Int8Array.new(3, + Arrow::Buffer.new([-1, 2].pack("c*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::Int8ArrayBuilder.new builder.append(-1) diff --git a/c_glib/test/test-list-array.rb b/c_glib/test/test-list-array.rb index 34177de9dcdeb..8e481e8367053 100644 --- a/c_glib/test/test-list-array.rb +++ b/c_glib/test/test-list-array.rb @@ -16,6 +16,21 @@ # under the License. class TestListArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + value_offsets = Arrow::Buffer.new([0, 2, 5, 5].pack("l*")) + data = Arrow::Buffer.new([1, 2, 3, 4, 5].pack("c*")) + values = Arrow::Int8Array.new(5, data, nil, 0) + assert_equal(build_list_array(Arrow::Int8ArrayBuilder, + [[1, 2], [3, 4, 5], nil]), + Arrow::ListArray.new(3, + value_offsets, + values, + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_value builder = Arrow::ListArrayBuilder.new(Arrow::Int8ArrayBuilder.new) value_builder = builder.value_builder diff --git a/c_glib/test/test-string-array.rb b/c_glib/test/test-string-array.rb index a076c228e0a4f..a9edb0ae49152 100644 --- a/c_glib/test/test-string-array.rb +++ b/c_glib/test/test-string-array.rb @@ -16,6 +16,19 @@ # under the License. class TestStringArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + value_offsets = Arrow::Buffer.new([0, 5, 11, 11].pack("l*")) + data = Arrow::Buffer.new("HelloWorld!") + assert_equal(build_string_array(["Hello", "World!", nil]), + Arrow::StringArray.new(3, + value_offsets, + data, + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_value builder = Arrow::StringArrayBuilder.new builder.append("Hello") diff --git a/c_glib/test/test-struct-array.rb b/c_glib/test/test-struct-array.rb index cf450f52d299a..ef0bc7179f290 100644 --- a/c_glib/test/test-struct-array.rb +++ b/c_glib/test/test-struct-array.rb @@ -16,6 +16,39 @@ # under the License. class TestStructArray < Test::Unit::TestCase + include Helper::Buildable + + def test_new + fields = [ + Arrow::Field.new("score", Arrow::Int8DataType.new), + Arrow::Field.new("enabled", Arrow::BooleanDataType.new), + ] + structs = [ + { + "score" => -29, + "enabled" => true, + }, + { + "score" => 2, + "enabled" => false, + }, + nil, + ] + struct_array1 = build_struct_array(fields, structs) + + data_type = Arrow::StructDataType.new(fields) + children = [ + Arrow::Int8Array.new(2, Arrow::Buffer.new([-29, 2].pack("C*")), nil, 0), + Arrow::BooleanArray.new(2, Arrow::Buffer.new([0b01].pack("C*")), nil, 0), + ] + assert_equal(struct_array1, + Arrow::StructArray.new(data_type, + 3, + children, + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_fields fields = [ Arrow::Field.new("score", Arrow::Int8DataType.new), diff --git a/c_glib/test/test-uint16-array.rb b/c_glib/test/test-uint16-array.rb index 82e898e733625..e3ffa5d28b6fa 100644 --- a/c_glib/test/test-uint16-array.rb +++ b/c_glib/test/test-uint16-array.rb @@ -16,6 +16,16 @@ # under the License. class TestUInt16Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_uint16_array([1, 2, nil]), + Arrow::UInt16Array.new(3, + Arrow::Buffer.new([1, 2].pack("S*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::UInt16ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint32-array.rb b/c_glib/test/test-uint32-array.rb index c8be06fead5b9..95aee79921929 100644 --- a/c_glib/test/test-uint32-array.rb +++ b/c_glib/test/test-uint32-array.rb @@ -16,6 +16,16 @@ # under the License. class TestUInt32Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_uint32_array([1, 2, nil]), + Arrow::UInt32Array.new(3, + Arrow::Buffer.new([1, 2].pack("L*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::UInt32ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint64-array.rb b/c_glib/test/test-uint64-array.rb index 03082f33014ce..7d9185459b295 100644 --- a/c_glib/test/test-uint64-array.rb +++ b/c_glib/test/test-uint64-array.rb @@ -16,6 +16,16 @@ # under the License. class TestUInt64Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_uint64_array([1, 2, nil]), + Arrow::UInt64Array.new(3, + Arrow::Buffer.new([1, 2].pack("Q*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::UInt64ArrayBuilder.new builder.append(1) diff --git a/c_glib/test/test-uint8-array.rb b/c_glib/test/test-uint8-array.rb index d7464e336da79..9c93abe7c349e 100644 --- a/c_glib/test/test-uint8-array.rb +++ b/c_glib/test/test-uint8-array.rb @@ -16,6 +16,16 @@ # under the License. class TestUInt8Array < Test::Unit::TestCase + include Helper::Buildable + + def test_new + assert_equal(build_uint8_array([1, 2, nil]), + Arrow::UInt8Array.new(3, + Arrow::Buffer.new([1, 2].pack("C*")), + Arrow::Buffer.new([0b011].pack("C*")), + -1)) + end + def test_buffer builder = Arrow::UInt8ArrayBuilder.new builder.append(1) From fe945a276206d597039f027004c14e141fffa0f5 Mon Sep 17 00:00:00 2001 From: Wes McKinney Date: Mon, 8 May 2017 15:22:41 -0400 Subject: [PATCH 0621/1644] ARROW-965: Website updates for 0.3.0 Author: Wes McKinney Closes #658 from wesm/ARROW-965 and squashes the following commits: 61c69d1 [Wes McKinney] Add license header 9e2c969 [Wes McKinney] Fix post template 5a55645 [Wes McKinney] Add Release page with changelog 7e674e1 [Wes McKinney] Website updates for 0.3.0 --- NOTICE.txt | 2 + site/_config.yml | 2 + site/_data/contributors.yml | 32 +++ site/_includes/blog_entry.html | 20 +- site/_includes/header.html | 1 + site/_layouts/post.html | 20 +- site/_posts/2017-05-08-0.3-release.md | 4 +- site/_release/0.1.0.md | 222 +++++++++++++++++ site/_release/0.2.0.md | 238 ++++++++++++++++++ site/_release/0.3.0.md | 343 ++++++++++++++++++++++++++ site/_release/index.md | 35 +++ site/install.html | 11 - site/install.md | 75 ++++++ 13 files changed, 990 insertions(+), 15 deletions(-) create mode 100644 site/_data/contributors.yml create mode 100644 site/_release/0.1.0.md create mode 100644 site/_release/0.2.0.md create mode 100644 site/_release/0.3.0.md create mode 100644 site/_release/index.md delete mode 100644 site/install.html create mode 100644 site/install.md diff --git a/NOTICE.txt b/NOTICE.txt index e71835c233de6..c02e75f91d966 100644 --- a/NOTICE.txt +++ b/NOTICE.txt @@ -46,6 +46,8 @@ This product includes software from the Ibis project (Apache 2.0) * Copyright (c) 2015 Cloudera, Inc. * https://github.com/cloudera/ibis +The web site includes files generated by Jekyll. + -------------------------------------------------------------------------------- This product includes code from Apache Kudu, which includes the following in diff --git a/site/_config.yml b/site/_config.yml index d7e0bb37e3eb0..8bb969abe848c 100644 --- a/site/_config.yml +++ b/site/_config.yml @@ -30,6 +30,8 @@ exclude: collections: docs: output: true + release: + output: true sass: style: compressed diff --git a/site/_data/contributors.yml b/site/_data/contributors.yml new file mode 100644 index 0000000000000..7bed83d21823d --- /dev/null +++ b/site/_data/contributors.yml @@ -0,0 +1,32 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to you under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Database of contributors to Apache Arrow (WIP) +# Blogs and other pages use this data +# +- name: Wes McKinney + apacheId: wesm + githubId: wesm + homepage: http://wesmckinney.com + role: PMC +- name: Uwe Korn + apacheId: uwe + githubId: xhochy + role: PMC +- name: Julien Le Dem + apacheId: julienledem + githubId: julienledem + role: PMC +# End contributors.yml diff --git a/site/_includes/blog_entry.html b/site/_includes/blog_entry.html index cdc0060669c2f..ffcbd7a7ae208 100644 --- a/site/_includes/blog_entry.html +++ b/site/_includes/blog_entry.html @@ -20,6 +20,24 @@

    + {% capture discard %} + {% for c in site.data.contributors %} + {% if c.apacheId == post.author %} + {% assign author_name = c.name %} + {% if c.homepage %} + {% assign homepage = c.homepage %} + {% else %} + {% capture homepage %}http://github.com/{{ c.githubId }}{% endcapture %} + {% endif %} + {% if c.avatar %} + {% assign avatar = c.avatar %} + {% else %} + {% capture avatar %}http://github.com/{{ c.githubId }}.png{% endcapture %} + {% endif %} + {% endif %} + {% endfor %} + {% endcapture %}{% assign discard = nil %} + diff --git a/site/_includes/header.html b/site/_includes/header.html index d1526f69faf16..6c0ec30f39ca7 100644 --- a/site/_includes/header.html +++ b/site/_includes/header.html @@ -21,6 +21,7 @@